2009 02 04 ################################################################## SHORT TERM PROJECTS ################################################################## ______________________________________________________________ OPEN TICKETS 125925 aklog of kcron tickets under SLF 4.7 124576 Account and group alignment _____________________________________________________________ ACTIVE - _____________________________________________________________ CONDOR - 7.2.0 upgrade on cluster, to fix write to Iwd of .proxy FARM running - on some nodes, lacking libxml2-2.6.16-12.6 dlopen error: libxml2.so.2: cannot open shared object file: No such FARM duplicates in nearcat / cedar, from 1 Nov . AFS timeouts - pursue minos-mysql1 timeouts correlated to nwest ssh connx Check 'WasIScanned' at security web page Test DCache unsecured door capacity, OK to 4000 ? use root -b to hold files open, Rename volumes d141 - testups d199 - testminossoft and test mk/rm and rename impact on running processes Subject: HelpDesk ticket 115219 has additional info. Short Description: Cannot write via dcap -q x509 using Howard Rubin proxy CRL - reproduce and report multi-header and java crash issues dc2nfs -cedar far .bntp and sntp all months, to catch up. See notes 5/25 bluwatch add /grid/data , /minos/data2 /minos/scratch monitoring Add write-mode monitoring libssl.so.4 on flxb35 ( 64 bit ) report, send advice to skip this node ############################################################################# ############################################################################# W O R K L O G ############################################################################# ############################################################################# ============================================================================= 2009 02 06 ============================================================================= ######## # DATA # ######## Date: Thu, 05 Feb 2009 21:29:12 -0600 From: ssa-group@fnal.gov The libraries for stken: LTO3, LTO4F1 and LTO4G1 are currently down. The cause is not yet known, but somebody is working on investigating and we will post more info as we are aware. __________________________________________________________________________ Date: Fri, 06 Feb 2009 01:18:43 -0600 From: ssa-group@fnal.gov The disruption of service to stken libraries has been fixed and all libraries are back in production. __________________________________________________________________________ I see many Minos data tapes in no-access : VO2307 0.04GB (NOACCESS 0205-2223 full 0209-0621) CD-9940B minos.neardet_data.cpio_odc VO2432 45.58GB (NOACCESS 0205-2224 readonly 0909-1538) CD-9940B minos.fardet_data.cpio_odc write-protected VO4335 49.10GB (NOACCESS 0205-2225 readonly 0511-1029) CD-9940B minos.fardet_data.cpio_odc VO8536 55.29GB (NOACCESS 0205-2225 readonly 0311-1010) CD-9940B minos.fardet_data.cpio_odc Copied to new media 03/14/06 VOK237 361.69GB (NOACCESS 0205-2202 none 0526-0123) CD-LTO3 minos.reco_far_cedar_cand.cpio_odc VO4209 177.36GB (NOTALLOWED 0511-1347 none 0831-0853) CD-9940B minos.reco_near_cedar_phy_mrnt.cpio_odc BOT overwritten 05/11/07 VO4956 0.37GB (NOTALLOWED 0112-1046 migrating 0106-0720) 9940 minos.caldet_data.cpio_odc Volume is missing 01/06/2009, not in drop slot VOB445 200.00GB (NOTALLOWED 0908-1806 none ) CD-9940B minos.beam_data.cern Cannot seem to write a label; needs investigation VOK330 331.09GB (NOTALLOWED 0731-1124 readonly 0716-1115) CD-LTO3 minos.reco_far_cedar_bcnd.cpio_odc Volume needs to be cloned due to repeated errors Tapes with raw data : V02307, staging in ND raw data. caught up in Enstore restart ? 'sum_mounts': 312, 'sum_rd_err': 2, Thu Feb 5 22:23:57 CST 2009 VO2432 minos.fardet_data.cpio_odc 'last_access': 1233894269.0, 'library': 'CD-9940B', 'sum_mounts': 3, 'sum_rd_err': 2, Thu Feb 5 22:24:29 CST 2009 VO4335 minos.fardet_data.cpio_odc 'sum_mounts': 996, 'sum_rd_err': 2, 'sum_wr_err': 1, Thu Feb 5 22:25:36 CST 2009 VO8536 minos.fardet_data.cpio_odc 'sum_mounts': 133, 'sum_rd_err': 2, 'sum_wr_err': 3, Thu Feb 5 22:25:00 CST 2009 __________________________________________________________________________ Date: Fri, 06 Feb 2009 17:28:41 -0600 (CST) Subject: HelpDesk ticket 128947 Short Description: STKEN Minos tapes in NOACCESS Problem Description: enstore-admin : There are four Minos tapes that have gone NOACCESS recently at http://www-stken.fnal.gov/enstore/tape_inventory/NOACCESS The tapes all show last access times around Thu Feb 5 22:23:57 CST 2009 . The time was during last night's unscheduled Enstore outage. Please restore access to these tapes. ___________________________________________ This ticket is assigned to HICKS, STAN of the CD-SF/DMS/DSC/SSA. ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ ######## # FARM # ######## Date: Thu, 05 Feb 2009 17:07:07 -0600 Subject: cleanup All but one cedar_phy_bhcurv near MC (a helium job) have completed their reruns, so you could rerun the concatenation. I'm currently running the far failures (57 of them plus). In light of this I'm going to rerun all cedar and dogwoodtest4 data failures. ============================================================================= 2009 02 05 ============================================================================= ########## # DCACHE # ########## W A R N I N G Our raw data seems to be getting migrated to LTO-4 tape : Reading VOO107(neardet_data-MIGRATION) using LTO4_112.mover from stkendca24a by root This may really foul up our raw data restoration, as the files may no longer be in tape-order, if they are currently being moved. This may also destroy our policy of having primary and vault copies in different buildings. ######## # ENCP # ######## Make current the version of encp having enmv MINOS26 > date Thu Feb 5 15:09:39 CST 2009 MINOS26 > ups declare -c encp v3_7d WARNING: Unless you know what you are doing, use a qualifier in your ups declare command! ########### # ENSTORE # ########### Date: Thu, 05 Feb 2009 14:38:52 -0600 From: ssa-group@fnal.gov To: cdweb@fnal.gov, helpdesk@fnal.gov, oleynik@fnal.gov, stan@fnal.gov, wolbers@fnal.gov, crawdad@fnal.gov, white@fnal.gov, moibenko@fnal.gov, timur@fnal.gov, stk-users@fnal.gov, cms-t1@fnal.gov, fermigrid-announce@fnal.gov, enstore-admin@fnal.gov Subject: Announcement: Service disruption for enstore on stken for a duration of 30 min The media_changer needs some work and we are backing out of the previous upgrade. We should be down for about 30 minutes. Please stand-by... Stanley __________________________________________________________ Date: Thu, 05 Feb 2009 15:20:53 -0600 The libraries are back in production. Total outage was only 20 minutes. __________________________________________________________ __________________________________________________________ ############ # MCIMPORT # ############ Date: Thu, 05 Feb 2009 14:20:38 -0500 From: Daniel D. Cherdack The next decade (75*) of Daikon06 singles is ready for import. __________________________________________________________ touch STAGE/cherdack/MCIMPORT __________________________________________________________ ######### # STAGE # ######### Running this on minos27, for speed ./volumes vols NVOLS=`./volumes neardet_data` { for VOL in ${NVOLS} ; do ./stage -w -g q ${VOL} done ; } > /minos/scratch/kreymer/log/stage/ndstage.log 2>&1 & STARTING Thu Feb 5 13:01:00 CST 2009 ####### # EVO # ####### Send HELP error report regarding window scroll bar sliders needing decoration. ######## # FARM # ######## ./roundup -p -r cedar_phy_bhcurv mcnear SRV1> ./roundup -b 2000 -r cedar_phy_bhcurv mcnear Thu Feb 5 12:46:18 CST 2009 ######## # FARM # ######## Why are a large number of these files not declared to sam ? Like n13047122_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root ########## # DCACHE # ########## Date: Thu, 05 Feb 2009 08:56:46 -0600 (CST) Subject: HelpDesk ticket 128816 Short Description: FNDCA PNFS is down Problem Description: enstore-admin, dcache-admin : The FNDCA PNFS server is down. This takes down the public DCache system. There is scheduled Enstore maintenance today. But the announcement stated : The following services will be available. - Stken PNFS. - Public dCache. - The rest of Enstore. Therefore I assume that this is unrelated to the maintenance. ___________________________________________ This ticket is assigned to HICKS, STAN of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Thu, 05 Feb 2009 09:07:04 -0600 (CST) Note To Requester: swhicks@fnal.gov sent this Notes To Requester: Art, We mistakenly took the pnfs manager down as part of our maintenance. It has since been brought back up and you should see dcache back to normal. Please let us know if this doesn't happen right away. Sorry for the inconvenience. ___________________________________________ ___________________________________________ 02/05 08:53:29 PNFS manager is back up Daq logging failed with messages in Recent FTP like 2009-02-5 08:49:10 451 Operation failed: PANIC : Unexpected message arrived class dmg.cells.nucleus.NoRouteToCellException ... 2009-02-5 07:27:13 last success 2009-02-5 08:06:45 first failure ########## # DCACHE # ########## To : minos_software_discussion@fnal.gov Cc : minos-data@fnal.gov Attchmnt: Subject : Fermilab DCache unscheduled outage this morning ----- Message Text ----- It appears that between about 07:30 and 08:55 this morning, the DCache PNFS manager went offline, taking DCache down. I have submitted a helpdesk ticket. Here is the reply : We mistakenly took the pnfs manager down as part of our maintenance. It has since been brought back up and you should see dcache back to normal. Please let us know if this doesn't happen right away. Sorry for the inconvenience. ########### # ENSTORE # ########### Date: Wed, 04 Feb 2009 15:51:26 -0600 From: ssa-group@fnal.gov To: cdweb@fnal.gov, helpdesk@fnal.gov, oleynik@fnal.gov, stan@fnal.gov, wolbers@fnal.gov, crawdad@fnal.gov, white@fnal.gov, moibenko@fnal.gov, timur@fnal.gov, d0en-announce@fnal.gov, stk-users@fnal.gov, cms-t1@fnal.gov, fermigrid-announce@fnal.gov, cdf_dh_help@fnal.gov, enstore-admin@fnal.gov Subject: Announcement: Service scheduled outage for enstore on d0en, stken, cdfen for a duration of 4 Hours This is a reminder that Stken and parts of Cdfen and D0en will be down Feb 5, 2009 from 0730 - 1130. The work to be done is: - Replace/repair 2 Bots in SL8500#2. - Update Enstore database. - Enstore code update. The following services will be unavailable. - Stken Enstore. This includes all Stken Library Managers. - CDF-LTO4F1 and D0-LTO4F1 Library Managers during the Bot repair. The following services will be available. - Stken PNFS. - Public dCache. - The rest of Enstore. The Bots will be repaired first. This will allow the CDF and D0 Library Mangers to be available as soon as that work is done. ~2 hours. _____________________________________________________________________ Via CDF JIRA Date: Thu, 05 Feb 2009 10:12:08 -0600 (CST) The libraries at GCC for D0 and cdf have been un-paused and returned to service. You should see your submitted jobs begin to run. The affected libraries for this maintenance were: CDF-LTO3 CDF-LTO4G1 D0-LTO4G1 _____________________________________________________________________ Date: Thu, 05 Feb 2009 11:56:15 -0600 From: ssa-group@fnal.gov We are experiencing a bit of difficulty in getting back into production. We will be over by an additional 30 minutes. _____________________________________________________________________ Date: Thu, 05 Feb 2009 12:25:35 -0600 The maintenance is completed on stken and everything is being brought back into production. Please let us know if you experience any errors as they relate to the libraries and enstore being down. _____________________________________________________________________ _____________________________________________________________________ ============================================================================= 2009 02 04 ============================================================================= ######## # FARM # ######## SRV1> ./roundup -b 6000 -D -r cedar_phy_bhcurv mcnear Wed Feb 4 07:18:52 CST 2009 Wed Feb 4 19:03:00 CST 2009 ########## # DCACHE # ########## Date: Wed, 04 Feb 2009 17:21:40 -0600 (CST) Subject: HelpDesk ticket 128800 ___________________________________________ Short Description: FNDCA Recent FTP Transfers is out of date again Problem Description: dcache-admin : Once again, the Recent FTP Transfer page is out of date, showing only transfers from user pagedcache http://fndca3a.fnal.gov/cgi-bin/dcache_files.py Please correct this, we use this to verify Minos raw data archiving. ___________________________________________ This ticket is assigned to HICKS, STAN of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Wed, 04 Feb 2009 17:50:37 -0600 (CST) From: HelpDesk Note To Requester: swhicks@fnal.gov sent this Notes To Requester: Art, I have re-opened our bug #189 in which the developers said this was fixed. An email has gone out to them to correct this problem. ___________________________________________ ___________________________________________ ___________________________________________ ######### # STAGE # ######### Raw data far stage. ./volumes vols NVOLS=`./volumes neardet_data` enstore info --vol=VOO267 { for VOL in ${NVOLS} ; do ./stage -d -p 0 -g q ${VOL} done ; } Run the full stage { for VOL in ${NVOLS} ; do ./stage -w -g q ${VOL} done ; } > /minos/scratch/kreymer/log/stage/ndstage.log 2>&1 & ___________________________________________ Date: Wed, 04 Feb 2009 15:38:26 -0600 (CST) Subject: HelpDesk ticket 128790 Short Description: Permisson to reload files to FNDCA RawDataWritePools Problem Description: dcache-admin : The new RawDataWritePools pools are deployed in FNDCA. We have current Pool Directory Listings available I would like to reload all the Minos raw data to this pool group. My scripts for loading the files in optimal tape-order are ready to run. Please send permission from the admins, and I will start these scripts. Standing by ... ___________________________________________ This ticket is assigned to HICKS, STAN of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Wed, 04 Feb 2009 17:23:46 -0600 (CST) From: HelpDesk Note To Requester: swhicks@fnal.gov sent this Notes To Requester: I thought this would be addressed to the dcache admins (which I don't believe is SSA) and forwarded it on to the Dcache Admin maillist we have; his reply was that he thought it was our call and that he had no problems with this. > > So, I guess I have no problems with your reloading them either. > ___________________________________________ Date: Wed, 04 Feb 2009 23:26:08 +0000 (GMT) From: Arthur Kreymer Thanks for the green light ! I will start the file restores after the Enstore maintenance tommorrow. This ticket can be closed. ___________________________________________ Date: Thu, 05 Feb 2009 15:26:48 -0600 (CST) This has been done. ___________________________________________ ######### # STAGE # ######### stage.20090204 Modify pool option to use suffix ( r w m q )and the files.* summary file, as we no longer have Layer 2 metadata. ./stage.20090204 -n -g q fardet_data/2009-02 MINOS26 > ln -sf stage.20090204 stage # was stage.20081203 ############ # DATASETS # ############ Added symlink CFL/list.* to listing file MIN > mv datasets.20090116 datasets.20090204 MIN > ln -sf datasets.20090204 datasets ########## # DCACHE # ########## Updating file lists MINOS26 > dcache/datasets q '' '' list Run Wed Feb 4 10:53:08 CST 2009 Data from 04-Feb-2009 06:15 Pool group RawDataWritePools Caches = 12 LIST w-raw-minos-stkendca21a-1.files LIST w-raw-minos-stkendca22a-1.files LIST w-raw-minos-stkendca24a-1.files LIST w-raw-minos-stkendca26a-1.files LIST w-stkendca7a-1.files LIST w-stkendca7a-2.files LIST w-stkendca8a-1.files LIST w-stkendca8a-2.files LIST w-stkendca9a-3.files LIST w-stkendca10a-3.files LIST w-stkendca11a-3.files LIST w-stkendca12a-3.files /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2009/02/list.q ############ # PREDATOR # ############ crc values seem to be 666 since /pnfs/minos/fardet_data/2008-12/F00042259_0018.mdaq.root ######## # DATA # ######## tagg reports corruption of files : Error in : error reading all requested bytes from file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015460_0010.mdaq.root got 16057 of 42564 Error in : error reading all requested bytes from file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2008-11/F00042228_0003.mdaq.root got 8878 of 45732 I can copy these to disk OK. ecrc values are OK. ./dccptest N00015460_0010.mdaq.root '' '' '' keep ./dccptest F00042228_0003.mdaq.root '' '' '' keep ecrc /local/scratch26/kreymer/DCCPTEST/N00015460_0010.mdaq.root ecrc /local/scratch26/kreymer/DCCPTEST/F00042228_0003.mdaq.root setup_minos -r R1.22 loon -bq firstlast.C /local/scratch26/kreymer/DCCPTEST/N00015460_0010.mdaq.root loon -bq firstlast.C dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015460_0010.mdaq.root loon -bq firstlast.C dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2008-11/F00042228_0003.mdaq.root All these look OK to me. ls /pnfs/minos/reco_near/cedar/cand_data/2009-01/N00015460_0010.* /pnfs/minos/reco_near/cedar/cand_data/2009-01/N00015460_0010.cosmic.cand.cedar.0.root /pnfs/minos/reco_near/cedar/cand_data/2009-01/N00015460_0010.spill.cand.cedar.0.root ls /pnfs/minos/reco_far/cedar/cand_data/2008-11/F00042228_0003* /pnfs/minos/reco_far/cedar/cand_data/2008-11/F00042228_0003.all.cand.cedar.0.root /pnfs/minos/reco_far/cedar/cand_data/2008-11/F00042228_0003.spill.cand.cedar.0.root MINOS26 > grep N00015460_0010.mdaq.root /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2009/02/list.q N00015460_0010.mdaq.root w-raw-minos-stkendca22a-1 MINOS26 > grep F00042228_0003.mdaq.root /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2009/02/list.q F00042228_0003.mdaq.root w-stkendca11a-3 _________________________________________________________________________ Date: Wed, 04 Feb 2009 16:45:21 +0000 (GMT) From: Arthur Kreymer To: Nathaniel Tagg Cc: minos_software_discussion@fnal.gov Subject: Re: More errors! On Wed, 4 Feb 2009, Nathaniel Tagg wrote: > I now see an identical error from a Far detector run as well as a Near > detector run. I can run loon R1.22 on both these files in Dcache, and from local copies, on minos01.fnal.gov. ( loon -bq ~kreymer/minos/scripts/firstlast.C ... ) Copies made from DCache seem to have the correct Enstore checksums. There are cedar keepup output files for both of these subruns. I suspect a problem with your loon, or your node . _________________________________________________________________________ MINOS26 > grep N00015460_0010.mdaq.root /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2009/02/list.q N00015460_0010.mdaq.root w-raw-minos-stkendca22a-1 MINOS26 > grep F00042228_0003.mdaq.root /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2009/02/list.q F00042228_0003.mdaq.root w-stkendca11a-3 ============================================================================= 2009 02 03 ============================================================================= Continued mysql testing, see LOG.mysql ============================================================================= 2009 02 02 ============================================================================= ####### # NET # ####### Non-disruptive network maintenance on r-s-edge-1 router. 9:00 - 9:30AM CST Tuesday February 3, 2009 ######## # FARM # ######## Digging into the Mother Lode of backlog : SRV1> ls /minos/data/minfarm/mcnearcat | grep cedar_phy_bhcurv | grep sntp | wc -l 2931 SRV1> ls /minos/data/minfarm/mcnearcat | grep cedar_phy_bhcurv | grep mrnt | wc -l 2930 SRV1> ./roundup -r cedar_phy_bhcurv mcnear Mon Feb 2 12:56:07 CST 2009 Mon Feb 2 13:08:33 CST 2009 Found 204 duplicate files Running a pass over all 6000 files, do clear DUP's and get master PEND list SRV1> ./roundup -b 6000 -D -r cedar_phy_bhcurv mcnear Mon Feb 2 14:46:04 CST 2009 __________________________________________________________________________ Date: Mon, 02 Feb 2009 21:38:31 +0000 (GMT) From: Arthur Kreymer I have run a full pass of ./roundup -b 6000 -D -r cedar_phy_bhcurv mcnear 204 duplicate files were removed. There is a fresh pend file, /home/minfarm/ROUNTMP/LOG/cedar_phy_bhcurvmcnear.pend There are also several files present which are flagged as bad : SRV1> grep ZAPPING LOG/2009-02/cedar_phy_bhcurvmcnear.log ZAPPING BAD n13037415_0009_L010185N_D04.cand.cedar_phy_bhcurv.1.root ZAPPING BAD n13037415_0009_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root ZAPPING BAD n13037415_0009_L010185N_D04.sntp.cedar_phy_bhcurv.1.root ZAPPING BAD n13037436_0005_L010185N_D04.cand.cedar_phy_bhcurv.1.root ZAPPING BAD n13037436_0005_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root ZAPPING BAD n13037436_0005_L010185N_D04.sntp.cedar_phy_bhcurv.1.root Should the badfiles list be edited, or should these files be discarded ? These files do seem to be a missing subruns that are needed. Note that some of these are candidate files. __________________________________________________________________________ Date: Mon, 02 Feb 2009 16:13:18 -0600 From: Howard Rubin To: Arthur Kreymer Subject: Re: cedar_phy_bhcurvmcnear.pend Here's the history on the 'bad' files: n13037415_0009_L010185N_D04.0 28665 analyze fnpc260 BEGIN 2008-03-26 08:30:39 n13037415_0009_L010185N_D04.0 28665 analyze fnpc260 ERROR 2008-03-26 09:12:22 136 n13037415_0009_L010185N_D04.0 18430 analyze caf1626 BEGIN 2008-04-29 14:34:31 n13037415_0009_L010185N_D04.0 18430 analyze caf1586 BEGIN 2008-04-29 14:34:53 n13037415_0009_L010185N_D04.0 18430 analyze caf1626 ERROR 2008-04-29 16:14:31 136 n13037415_0009_L010185N_D04.0 18430 analyze caf1586 END 2008-04-29 18:12:11 n13037436_0005_L010185N_D04.0 15737 analyze fnpc255 BEGIN 2008-04-05 00:33:55 n13037436_0005_L010185N_D04.0 15737 analyze fnpc255 END 2008-04-05 03:55:30 n13037436_0005_L010185N_D04.1 2159 analyze caf1606 BEGIN 2008-05-09 17:57:26 n13037436_0005_L010185N_D04.1 2159 analyze fnpc304 BEGIN 2008-05-09 18:20:28 n13037436_0005_L010185N_D04.1 2159 analyze fnpc304 ERROR 2008-05-09 18:53:51 136 n13037436_0005_L010185N_D04.1 2159 analyze caf1606 END 2008-05-09 21:10:19 I don't remember the reason for 'Pass 1' but it appears that, at least in that pass, the 2 jobs were multiply submitted. I've removed the lines from bad_runs_mc.cedar_phy_bhcurv. I'm beginning processing of the pend list except for the 'MRE' run whose history I don't remember at all, and whose naming convention doesn't match our recent running. (Were they perhaps run by Matt as a special pass?) I suggest we just forget about the one pair of files. __________________________________________________________________________ Date: Mon, 02 Feb 2009 17:05:59 -0600 From: Howard Rubin To: Art Kreymer Subject: cleanup Art, Not counting the MRE jobs, there are 714 files in the cedar_phy_bhcurv pend list. Of these, 652 have already run successfully, meaning that they produced *some* output which has made it to dcache or is already in mcnearcat. This means that this cleanup will produce *lots* of duplicates, lots of them probably being candidates, and some of which I will catch if the existing output is in mcnearcat (or candidates in dcache), and the rest of which you will find when you concatenate. I think that *all* of these duplicates should simply be deleted, not kept around in a 'duplicates' area. Presumably we understand why they've been produced, and there's no reason for anyone to ever look at them. Incidentally, all the files I produce in this cleanup will be pass 0. __________________________________________________________________________ Date: Tue, 03 Feb 2009 15:01:33 +0000 (GMT) From: Arthur Kreymer I ran roundup this morning. It picked up the subruns which had formerly been flagged as bad. But nothing more. The newest files in /minos/data/minfarm/mcnearcat are from Nov 26. __________________________________________________________________________ Date: Tue, 03 Feb 2009 10:23:10 -0600 From: Howard Rubin Jobs seem to be running now so I'm submitting more. __________________________________________________________________________ Date: Wed, 04 Feb 2009 09:38:06 -0600 From: Howard Rubin Art, The cedar_phy_bhcurv cleanup processing is complete. Again, I expect *many* duplicates, and I think these should be deleted, not moved. Everything has been forced to use pass 0. __________________________________________________________________________ __________________________________________________________________________ ######## # FARM # ######## Sunday, purged D06, charm, helium files on farm Started cleanup of helium; SRV1> ./roundup -s helium -r cedar_phy_bhcurv mcnear & This finished, I have done the purge : SRV1> ./roundup -p -s helium -r cedar_phy_bhcurv mcnear ####### # NAS # ####### Reported previous delays to romero ######## # GRID # ######## Tracking down dejong problems access condor_submit but not condor_q ######### # ADMIN # ######### Planning for gfactory/gfrontend ID, versus Condor 7.2 ============================================================================= 2009 01 30 ============================================================================= ######## # FARM # ######## helium files are complete for M100200R Still missing subruns for M100200N SRV1> ./roundup -s helium -r cedar_phy_bhcurv mcnear Fri Jan 30 11:30:44 CST 2009 SRV1> ./roundup -s charm -r cedar_phy_bhcurv mcnear Fri Jan 30 13:55:56 CST 2009 ######## # FARM # ######## SRV1> ./roundup -s D06 -r cedar_phy_bhcurv mcnear Fri Jan 30 11:42:37 CST 2009 ./roundup -n -p -s D06 -r cedar_phy_bhcurv mcnear ============================================================================= 2009 01 29 ============================================================================= ######### # DOCDB # ######### The old ticket regarding DNS failover is closed. I think nothing was done. Noted. Date: Thu, 29 Jan 2009 12:16:54 -0600 (CST) Subject: Help Desk Ticket 121522 Has Been Resolved. Solution: related to DNS issue ___________________________________________________________________ This ticket was resolved by INKMANN, JOHN of the CD-LSCS/CSI/CS/EST group. ######### # AKLOG # ######### Surveyed FNALU for system running LSF 4.7, to test aklog/kcron combo There are none : MIN > for NODE in ${UNODES} ; do printf "${NODE} " ; ssh -ax ${NODE} 'cat /etc/redhat-release' ; done flxi02 Scientific Linux Fermi LTS release 4.4 (Wilson) x86_64 flxi03 Scientific Linux Fermi LTS release 4.4 (Wilson) x86_64 flxi04 Scientific Linux Fermi LTS release 4.5 (Wilson) i686 flxi05 Scientific Linux Fermi LTS release 4.5 (Wilson) i686 flxi06 Scientific Linux SLF release 5.1 (Lederman) i686 no kcron exists flxi07 Scientific Linux Fermi LTS release 4.4 (Wilson) x86_64 flxi09 Scientific Linux Fermi LTS release 4.5 (Wilson) i686 ####### # SAM # ####### minosora3 - Oracle patches and OS patches being performed, Seems to be complete as of 09:00 09:12 - ran HOWTO.sam tests, including station test AOK Updated the MINOS status page Date: Thu, 29 Jan 2009 15:27:40 +0000 (GMT) From: Arthur Kreymer To: minos_software_discussion@fnal.gov Cc: minos_sam_admin@fnal.gov Subject: SAM maintenance completed this morning The Minos SAM Oracle database was upgraded from 08:00 to 09:12. The SAM station and dbservers have resumed normal operation . kreymer@minos26 : crontab crontab.dat mindata@minos26 : crontab crontab.dat minfarm@fnpcsrv1 : mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok ######## # FARM # ######## Concatenation of D00_medi.sntp.cedar_phy_linfix and D00_lowi.mrnt.cedar_phy_linfix completed Tue Jan 27 15:12:54 CST 2009 ######## # FARM # ######## The concatenation of daikon06 files is up to date, as of Wed Jan 28 14:37:26 CST 2009 There is one run pending, due to a missing subrun : PEND - have 28/29 subruns for n13037060_*_L010185N_D06_nccohbkg.mrnt.cedar_phy_bhcurv.0.root MISS 0002 PEND - have 28/29 subruns for n13037060_*_L010185N_D06_nccohbkg.sntp.cedar_phy_bhcurv.0.root MISS 0002 ============================================================================= 2009 01 28 ============================================================================= ######## # DATA # ######## Tue Jan 27 13:18:14 CST 2009 cherdack imports have completed, with one false DUP entry, n15037088_0000_L010185N_D06_nccohbkg.reroot.root mv DUP/n15037088_0000_L010185N_D06_nccohbkg.reroot.root dcache/ ./mcimport cherdack less /minos/data/mcimport/cherdack/log/mcimport.log Wed Jan 28 14:42:12 CST 2009 PURGED n15037088_0000_L010185N_D06_nccohbkg.reroot.root rm STAGE/cherdack/MCIMPORT ######### # ADMIN # ######### Date: Wed, 28 Jan 2009 11:59:20 -0600 (CST) Subject: HelpDesk ticket 128348 Short Description: Request UID assignment for gfactory account. Problem Description: Please assign an official UID for the gfactory account. The UID assignment should be similar to that of gfrontend, including the Minos 5111 group, and Ryan Patterson as the registered user : 43598 5111 PATTERSON RYAN GFRONTEND ___________________________________________ Date: Wed, 28 Jan 2009 12:07:40 -0600 (CST) This ticket has been reassigned to VALADEZ, YOLANDA of the CD-LSCS/CSI/HD ___________________________________________ Date: Wed, 28 Jan 2009 13:08:56 -0600 (CST) Solution: Refresh your uid/gid list. The following new uid/gid assignment has been entered: UID:GID:LASTNAME: FIRSTNAME:USERNAME 43680 5111 Patterson Ryan gfactory This ticket was resolved by VALADEZ, YOLANDA of the CD-LSCS/CSI/HD group. ___________________________________________ ########## # DCACHE # ########## The backlog has cleared, we are back in operation . See ticket 128156 below ######### # ADMIN # ######### Date: Wed, 28 Jan 2009 11:30:49 -0600 From: HelpDesk To: Arthur Kreymer Subject: uid and gid listings http://www-giduid.fnal.gov/cd/FUE/uidgid/uid.lis http://www-giduid.fnal.gov/cd/FUE/uidgid/gid_id.lis ============================================================================= 2009 01 27 ============================================================================= ########## # DCACHE # ########## Forwarded to M_B, M-D, CC: M_S_D Date: Tue, 27 Jan 2009 14:46:14 -0600 From: ssa-group@fnal.gov All STKEN dCache pools will be restarted in 10 minutes (at approximately 2:55pm) for an emergency configuration change. Another notification will be sent after the restart is complete. ___________________________________________________________________ Date: Tue, 27 Jan 2009 15:06:12 -0600 From: ssa-group@fnal.gov The restart of the public dCache pools is complete. ___________________________________________________________________ Stores Queued 15:08 - 29000 approx 15:25 - 27609 15:36 - 27412 DES writes to tape are continuing apace, about 5 sec/file. 16:56 - 19763 16:57 - 18649 That's more like it, 1000/minute, ########### # BLUEARC # ########### To fermigrid-users, minos-data : I run a set of processes which check the readability and speed of files in the /minos/data2 piece of Bluearc. In response to questions during the Grid Users meeting yesterday, here is a summary of recent slow access, as seem from Minos nodes. There have been no recent outright failures . File access times over 10 seconds are logged. ( Normal access times are under a second. ) The full logs are at http://www-numi.fnal.gov/computing/dh/bluwatch/log/ I am listing here the large blocks of slow response. See the full logs for a few other short term slowdowns. fnpcsrv1 Jan 14 04:36 -> 06:26 to 50 sec Jan 24 18:17 -> 23:29 to 92 sec Jan 25 18:37 -> 21:07 to 41 sec minos-sam03 Jan 24 18:20 -> 23:21 to 43 sec Jan 25 18:37 -> 21:35 to 40 sec minos02 Jan 24 18:16 -> 23:18 to 36 sec Jan 25 18:40 -> 21:07 to 44 sec minos26 Jan 24 18:16 -> 23:21 to 69 sec Jan 25 18:37 -> 22:44 to 43 sec ######## # FARM # ######## Concatenating lowi and medi files These are the only linfix files pending ./roundup -n -r cedar_phy_linfix mcnear ./looper '-r cedar_phy_linfix mcnear' & Tue Jan 27 13:56:55 CST 2009 ============================================================================= 2009 01 26 ============================================================================= ######## # FARM # ######## ls /minos/data/minfarm/mcnearcat | grep D06 | wc -l 8970 ls /minos/data/minfarm/mcnearcat -ltr | grep D06 | tail -rw-rw-r-- 1 minospro e875 20802267 Jan 20 19:52 n13037137_0047_L010185N_D06_nccohbkg.mrnt.cedar_phy_bhcurv.0.root rm /minos/data/minfarm/roundup/STOP.LOOPER SRV1> ./roundup -n -s D06 -r cedar_phy_bhcurv mcnear OK - processing /minos/data/minfarm/mcnearcat version 20090121 SELECT files containing D06 Mon Jan 26 17:32:48 CST 2009 ... OK adding n13037015_0000_L010185N_D06_nccohbkg.mrnt.cedar_phy_bhcurv.0.root 31 ./roundup: line 764: ((: SSIF = : syntax error: operand expected (error token is " ") ./roundup: line 764: ((: SSIF = : syntax error: operand expected (error token is " ") OK adding n13037016_0000_L010185N_D06_nccohbkg.mrnt.cedar_phy_bhcurv.0.root 31 ... BIG - Splitting due to size 2168015605 OK adding n13037015_0000_L010185N_D06_nccohbkg.sntp.cedar_phy_bhcurv.0.root 30 OK adding n13037015_0030_L010185N_D06_nccohbkg.sntp.cedar_phy_bhcurv.0.root 1 ./roundup: line 764: ((: SSIF = : syntax error: operand expected (error token is " ") BIG - Splitting due to size 2201111372 OK adding n13037016_0000_L010185N_D06_nccohbkg.sntp.cedar_phy_bhcurv.0.root 30 OK adding n13037016_0030_L010185N_D06_nccohbkg.sntp.cedar_phy_bhcurv.0.root 1 The errors were because I tried to cancel the interactive roundup with ^C and ^Y. ./looper '-s D06 -r cedar_phy_bhcurv mcnear' & LOG/2009-01/cedar_phy_bhcurvmcnearD06.log Mon Jan 26 18:04:51 CST 2009 ######## # DATA # ######## Testing cherdack imports, 500 files, typically 600 MBytes, a few over 1 GByte. $ ./mcimport -b 2 cherdack OK, logging activity to /minos/data/mcimport/cherdack/log/mcimport.log ... Mon Jan 26 16:43:23 CST 2009 MCIN configuration n1503 _L010185N_D06_nccohbkg.reroot.root SRMCPed n15037001_0000_L010185N_D06_nccohbkg.reroot.root SRMCPed n15037002_0000_L010185N_D06_nccohbkg.reroot.root BAIL aftar 2 files ~/saddmc --declare daikon_06 near/daikon_06/L010185N_nccohbkg/700 324213 /minos/data/mcimport/cherdack/ du: cannot access `/minos/data/mcimport/cherdack/tar': No such file or directory 1 /minos/data/mcimport/cherdack/dcache 324212 /minos/data/mcimport/cherdack/mcin 1240 /minos/data/mcimport/cherdack/mcin/dcache Mon Jan 26 16:46:43 CST 2009 Sam data is OK MINOS26 > sam locate n15037001_0000_L010185N_D06_nccohbkg.reroot.root ['/pnfs/minos/mcin_data/near/daikon_06/L010185N_nccohbkg/700,26@dcache'] Allow the rest to run on the normal cycle MINOS26 > touch /minos/data/mcimport/cherdack/MCIMPORT And start an early cycle, as we just missed the 4 hour boat by 10 minutes. $ ./mcimport cherdack OK, logging activity to /minos/data/mcimport/cherdack/log/mcimport.log ######## # DATA # ######## One of the Minos caldet tapes has been found MIA during migration. http://www-stken.fnal.gov/enstore/tape_inventory/VO4956 'library': '9940' 'sum_mounts': 629 'sum_rd_access': 1169 'sum_rd_err': 5 'sum_wr_access': 581 'sum_wr_err': 0 'volume_family': 'minos.caldet_data.cpio_odc' ########## # CONDOR # ########## Jobs took a nosedive around 00:00 today. Confirmed by GPFarm Ganglia http://rexganglia2.fnal.gov/farms/?m=load_one&r=day&s=descending&c=GP+Farm&h=&sh=1&hc=4 Processes dropped from 800 at 00:00 sharply to 400, then to under 200 at 01:00 The latest running gfactory is 268378.16 gfactory 1/24 14:29 1+18:50:24 My glide test jobs stayed idle from 00:00 through 07:45. They are idle again, starting 08;20 Jobs started up again around 09:44 .. ########## # DCACHE # ########## Huge write backlog, over 16k files. Built up during the day Sunday 25 Jan 21K Stores queued in DCache, mostly w-pub-minos-stkendca21a-2 4122 w-pub-minos-stkendca22a-2 1143 w-pub-minos-stkendca23a-2 2493 w-sstkendca10a-4 1768 w-sstkendca10a-5 2140 w-sstkendca10a-6 811 w-stkendca11a-4 3007 w-stkendca11a-5 249 w-stkendca11a-6 2666 w-stkendca12a-4 550 w-stkendca12a-5 296 w-stkendca12a-6 536 w-stkendca9a-4 1222 Appears to be file family des. Recent files on tape No visible Minos backlog so far. My eyeball suggests the backlog will takeat least 3 days to clear. The latest 38 files on VOK823 average 8 MBytes in size. http://www-stken.fnal.gov/enstore/tape_inventory/VOK823 Many are under 10 KBytes. Date: Mon, 26 Jan 2009 09:11:33 -0600 (CST) Subject: HelpDesk ticket 128156 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> Short Description: writePools backlog Problem Description: dcache-admin : It appears that a write backlog of over 20,000 files has built up in the FNDCA writePools group. On the surface, these seem to be in file family des. The average file size is under 10 MBytes, with many files under 1 MByte, based on a glance at recent files on volume VOK823. At the present rate, this backlog will take at least 3 days to clear. I urge that the user be contacted, and that these writes be reorganized. ___________________________________________ Date: Mon, 26 Jan 2009 11:23:19 -0600 (CST) This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA. We are looking at this and working to resolve it. I do see quite a large number of queued write requests. What monitoring page are you using to determine the user? _ Thank you for letting us know. __________________________________________ Date: Mon, 26 Jan 2009 18:05:12 +0000 (GMT) From: Arthur Kreymer I am deducing the user from the backlogs in Enstore : Servers page, see CD-LTO3.library_manager http://www-stken.fnal.gov/enstore/status_enstore_system.html CD-LT03 Full Queue Elements http://www-stken.fnal.gov/enstore/CD-LTO3.library_manager-full.html ___________________________________________ Date: Mon, 26 Jan 2009 20:56:34 +0000 (GMT) From: Arthur Kreymer We were making a little progress this morning, but now the Queued Stores are up by another 15K files, to 29133 as of about 14:55. The new files seem to have all been queued in a single sampling of the plot at http://fndca.fnal.gov/dcache/queue/allpools.jpg ___________________________________________ Date: Tue, 27 Jan 2009 19:50:03 +0000 (GMT) From: Arthur Kreymer Almost a full day ago, we were told at the Grid Users meeting that the errant des files would be removed by their creator at UIC. The write backlog remains at nearly 28,000, as seen at http://fndca.fnal.gov:2288/queueInfo Minos production reprocessing remains shut down, as it has been since last Sunday, due to our service level agreement. I request that either 1) Administrators remove these DES files. or 2) Minos be given permission to write to the DCache pools without any service limitations. If thing remain as they are, it will be several more days before we can resume processing. This is unacceptable. ___________________________________________ Date: Tue, 27 Jan 2009 14:52:48 -0600 (CST) The user ceased writing yesterday afternoon, but the jobs already in the system have continued to clog it up. We are taking action immediately to resolve this problem in conjunction with a pool restart. The backlog should clear out shortly after we restart pools and the files are removed from pnfs space. I will let you know when we are done. Thanks for your patience. -Tim Messer ___________________________________________ Date: Tue, 27 Jan 2009 18:00:15 -0600 (CST) The FNDCA write queue is now below 3,000 and continues to drop. ___________________________________________ Date: Wed, 28 Jan 2009 15:56:54 +0000 (GMT) From: Arthur Kreymer Thanks ! The DCache writePools backlog cleared yesterday, around 18:00 . Minos reconstruction processing has resumed. This ticket can be closed. ___________________________________________ Solution: DES user was writing >28,000 small files to dCache, slowing the system down and causing MINOS processing to stop. User was notified and files were removed from pnfs. ___________________________________________ 16:42 - 31177 ########## # CONDOR # ########## ============================================================================= 2009 01 23 ============================================================================= ######### # SHIFT # ######### Where do the minos-shiftscript item come from, sent to minos-shifters ? From minos23, probably habig account. ######### # ADMIN # ######### Date: Fri, 23 Jan 2009 14:20:03 -0600 (CST) Subject: HelpDesk ticket 128129 ___________________________________________ Short Description: Please create minsoft account on minos-sam04 Problem Description: run2-sys : Please create account minsoft on minos-sam04. This is for development testing of mysql. The account can be NIS served, as on minos-mysql2. Please copy the .k5login file from minos-mysql2. ___________________________________________ Date: Fri, 23 Jan 2009 15:49:05 -0600 (CST) This ticket has been reassigned to SHEPELAK, KAREN of the CD-SF/FEF Group. ___________________________________________ Date: Fri, 23 Jan 2009 16:05:37 -0600 (CST) Solution: shepelak@fnal.gov sent this solution: Minsoft account has been created on minos-sam04 with .k5login as copied from minos-mysql2 as requested. -- karen This ticket was resolved by SHEPELAK, KAREN of the CD-SF/FEF group. ___________________________________________ Date: Fri, 23 Jan 2009 22:30:57 +0000 (GMT) From: Arthur Kreymer Thanks, I can log in ! ___________________________________________ ########## # CONDOR # ########## Increased analysis to 350, leaving headroom for Farm jobs on GPFARM. Somehow I typed the wrong file names last time. I had edited the old vofrontend.cfg.20081119, not vofrontend.cfg.20090122, and symlinked to it. I REALLY NEED A BETTER MONITOR ! cd /home/gfrontend/myvofrontend2/etc cp vofrontend.cfg.20081125 vofrontend.cfg.20090123 nedit - max_running_jobs=350 ln -sf vofrontend.cfg.20090123 vofrontend.cfg # was vofrontend.cfg.20081119 kill -9 22120 cd ./start_frontend.sh Fri Jan 23 15:59:46 CST 2009 [gfrontend@minos25 ~]$ tail /home/gfrontend/myvofrontend2/log/frontend_info.`date +%Y%m%d`.log [2009-01-23T15:59:45-05:00 28984] Iteration at Fri Jan 23 15:59:45 2009 [2009-01-23T15:59:48-05:00 28984] Match [2009-01-23T15:59:48-05:00 28984] Total running 205 limit 350 [2009-01-23T15:59:48-05:00 28984] For gpgeneral@t22_glexec@minos Idle 898 Running 200 [2009-01-23T15:59:48-05:00 28984] Advertize gpgeneral@t22_glexec@minos Request idle 40 max_run 1142 [2009-01-23T15:59:48-05:00 28984] For gpminos@t22_glexec@minos Idle 898 Running 205 [2009-01-23T15:59:48-05:00 28984] Advertize gpminos@t22_glexec@minos Request idle 40 max_run 1148 [2009-01-23T15:59:48-05:00 28984] For cdf@t22_glexec@minos Idle 898 Running 200 [2009-01-23T15:59:48-05:00 28984] Advertize cdf@t22_glexec@minos Request idle 40 max_run 1142 [2009-01-23T15:59:48-05:00 28984] Sleep ######## # DATA # ######## From: ssa-group@fnal.gov Subject: Announcement: Service scheduled outage for dCache on stken for a duration of 30 minutes The following STKEN services will be restarted at 3:30pm today to allow new gri$ fndca3a - gPlazma stkendca2a - dcap, gridftp and srm services stkendca13a-20a - dCache and gridftp doors A notice will be sent when the services have been restarted. _____________________________________________________________________ Date: Fri, 23 Jan 2009 15:45:04 -0600 From: ssa-group@fnal.gov Subject: Announcement: Service restoration for dCache on stken for a duration of 30 minutes The STKEN dCache and grid services on fndca3a and stkendca2a,13a-20a have been restarted. Thank you. ######### # MYSQL # ######### Date: Fri, 23 Jan 2009 11:47:09 -0600 (CST) Subject: HelpDesk ticket 128116 ___________________________________________ Short Description: Please change the system timezone for minos-mysql2 to UTC Problem Description: run2-sys : Because the new minos-mysql2 server connects with various grid hosts, we would like to have the system time zone set to UTC, at your next convenience. Please contact minos-admin and minosdb_support to schedule the change. I suggest waiting until next week, to avoid weekend surprises. ___________________________________________ Date: Fri, 23 Jan 2009 11:53:27 -0600 (CST) This ticket has been reassigned to SHEPELAK, KAREN of the CD-SF/FEF Group. ___________________________________________ Date: Fri, 23 Jan 2009 13:07:00 -0600 (CST) I reset the timezone to UTC on minos-mysql2. I'm still able to login as root after the change so this change looks to be working. Please confirm that you are able to login. Let me know that this change works for you, then I'll close the ticket. [root@minos-mysql2 ~]# date Fri Jan 23 19:02:06 UTC 2009 ___________________________________________ Date: Fri, 23 Jan 2009 13:17:20 -0600 (CST) This ticket has been reassigned to ALLEN, JASON of the CD-SF/FEF Group. ___________________________________________ Date: Fri, 23 Jan 2009 20:26:28 +0000 (GMT) From: Arthur Kreymer Gamglia monitoring of minos-mysql2 seems to have stopped, as of almost precisely 12:00 today. http://rexganglia2.fnal.gov/minos/?m=load_one&r=day&c=MINOS Server&h=minos-mysql2.fnal.gov Perhaps the Ganglia server is confused about the time shift. If this does not fix itself, please look into it Monday. ___________________________________________ Date: Fri, 23 Jan 2009 14:27:21 -0600 From: Jason Allen I don't want to make these machines unique. Please set the timezone back to CST. If using CST is really causing minos legitimate problems then I'll discuss the matter with Margaret V. ___________________________________________ Date: Fri, 23 Jan 2009 21:01:02 +0000 (GMT) From: Arthur Kreymer Jason has asked that the UTC test on mysql2 be suspended. So be it. I share the desire for uniform systems. I would like to have all Minos Servers running UTC. minos-sam01 minos-sam02 minos-sam03 minos-sam04 minos-mysql1 minos-mysql2 I'll be glad to discuss a deployment plan for this, when you have time. This is not urgent. I would also like to see the entire Minos Cluster at UTC, but this takes a lot more planning and testing. Users' crontab files are affected by such a change, We might set a local TZ in /etc/profile or /etc/csh.cshrc so that interactive users would get the old times reported. This would likely be done at a major system upgrade. ___________________________________________ Date: Fri, 23 Jan 2009 15:48:44 -0600 Minos-mysql2 is back on CS Time. ___________________________________________ Date: Mon, 26 Jan 2009 14:38:19 -0600 (CST) Solution: jallen@fnal.gov sent this solution: Converting just the Minos nodes to UTC time zone introduces unnecessary system management complexity to the FEF Dept. Minos has been operating for many years with the nodes set to CST time zone. The FEF Dept will evaluate the potential benefits of converting all nodes CDF/D0/Minos/MiniBoone/EAG/etc to UTC. This ticket was resolved by ALLEN, JASON of the CD-SF/FEF group. __________________________________________ Date: Fri, 23 Jan 2009 19:27:49 +0000 (GMT) Thanks, I can log in and access the database server normally. This ticket can be closed. __________________________________________ Mysql2 > dds /etc/sysconfig/clock -rw-r--r-- 1 root root 42 Jan 23 18:50 /etc/sysconfig/clock ########## # DCACHE # ########## Date: Fri, 23 Jan 2009 11:36:31 -0600 (CST) Subject: HelpDesk ticket 128115 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 128115 ___________________________________________ Short Description: FNDCA dcache monitor page is out of date Problem Description: There is a web page which monitors for internal file problems in FNDCA : http://www-stken.fnal.gov/enstore/dcache_monitor/ The content of this page seems not to have changed since 23-Jul-2008 ___________________________________________ This ticket is assigned to JONES, TERRY of the CD-SF/DMS/DSC/SSA. ____________________________________________ Date: Fri, 23 Jan 2009 12:50:14 -0600 (CST) Note To Requester: jonest@fnal.gov sent this Notes To Requester: Have you looked at this URL yet? > http://www-stken.fnal.gov/enstore/www-stken_pnfs_monitor ___________________________________________ Date: Fri, 23 Jan 2009 19:19:54 +0000 (GMT) From: Arthur Kreymer That web page is new to me. The minos data files listed there are normal. Problem files were once sorted into files like http://www-stken.fnal.gov/enstore/dcache_monitor/p929_bad.txt This seemed to be useful. ___________________________________________ Date: Wed, 28 Jan 2009 16:48:21 +0000 (GMT) From: Arthur Kreymer This ticket can be closed. I have adjusted my web pages to point to the new url http://www-stken.fnal.gov/enstore/www-stken_pnfs_monitor Because this listing includes files from all experiments, both bad files and files queued for write to tape, it is not trivial to scan for bad Minos files. That is OK. I continue to rely on the SSA group to alert us to problem files. There is no need for me to scan this list on a regular basis. ___________________________________________ Note To Requester: jhendry@fnal.gov sent this Notes To Requester: Hi Art, For your information, this web page can be accessed from the top level dcache web page: http://www-stken.fnal.gov/enstore/enstore_saag.html click on "CD dCache" goes to FNAL General dCache System Status url: http://fndca.fnal.gov/ Then click on "Meta-Data Checks" which takes you to the pnfs monitor output page: http://www-stken.fnal.gov/enstore/www-stken_pnfs_monitor John ___________________________________________ ########## # DCACHE # ########## Date: Fri, 23 Jan 2009 10:11:04 -0600 From: jonest To: kreymer@fnal.gov, bernstein@fnal.gov Cc: Dcache Admin Subject: minos files not in dcache or pnfs. Parts/Attachments: 1 OK 40 lines Text 2 Shown ~44 lines Text ---------------------------------------- These file are not in dcache. The have layer 2 information but no pool data. Do you dill have the files?  If so, please delete from pnfs and retransfer these files  or regenerate the file and retransfer.  Recent PNFS Database minos files:  timestamp                  |                   pnfsid | layer1 | layer2 | layer4 | path  2009-01-21 15:12:57.930745 | 000F00000000000009073A88 |      n |      y |      n | /pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0000.mdaq.root  2009-01-21 16:13:55.177935 | 000F00000000000009073AC8 |      n |      y |      n | /pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0001.mdaq.root  2009-01-21 17:14:23.413759 | 000F00000000000009073B38 |      n |      y |      n | /pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0002.mdaq.root  2009-01-21 18:14:58.21879  | 000F00000000000009073C48 |      n |      y |      n | /pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0003.mdaq.root  2009-01-21 22:11:48.698747 | 000F00000000000009073F70 |      n |      y |      n | /pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0007.mdaq.root  2009-01-22 01:13:41.053332 | 000F00000000000009074C08 |      n |      y |      n | /pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0010.mdaq.root  2009-01-22 04:09:34.693716 | 000F00000000000009075B10 |      n |      y |      n | /pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0013.mdaq.root Terry Jones, jonest@fnal.gov, run2-sys@fnal.gov FCC/2/252/T x5200 __________________________________________________________________ FFILES=' F00042731_0000.mdaq.root F00042731_0001.mdaq.root F00042731_0002.mdaq.root F00042731_0003.mdaq.root F00042731_0007.mdaq.root F00042731_0010.mdaq.root F00042731_0013.mdaq.root ' for FILE in ${FFILES} ; do ./dc_stat ${FILE} ; done MINOS26 > for FILE in ${FFILES} ; do ./dccptest ${FILE} ; done Everthing is OK. __________________________________________________________________ Date: Fri, 23 Jan 2009 17:16:16 +0000 (GMT) From: Arthur Kreymer Layer 2 no longer has pool data, so this is normal. The files are all on tape, and I can read them all from DCache. It is a serious problem that we have no pool information. But this is a design issue, for which we have no solution planned. ######### # MYSQL # ######### Testing file copy speeds data to /var/tmp ( over 50 MBytes/second ) Note, we can set per-user limits. http://dev.mysql.com/doc/refman/5.1/en/user-resources.html The existing user table has field max_connection ########## # CONDOR # ########## The glideins adjusted to the new 200 target around midnight. Farm jobs are running, ============================================================================= 2009 01 22 ============================================================================= ########## # CONDOR # ########## 16:30 farm jobs are not getting started, due to existing glideins. Adjusted max_running_jobs=650 to max_running_jobs=200 in /home/gfrontend/myvofrontend2/etc/vofrontend.cfg.20081119 ln -sf vofrontend.cfg.20081119 vofrontend.cfg # was vofrontend.cfg.20081125 [2009-01-22T16:38:37-05:00 16173] Iteration at Thu Jan 22 16:38:37 2009 [2009-01-22T16:38:48-05:00 16173] Match [2009-01-22T16:38:48-05:00 16173] Total running 417 limit 2050 [2009-01-22T16:38:48-05:00 16173] For gpgeneral@t22_glexec@minos Idle 3666 Running 417 [2009-01-22T16:38:48-05:00 16173] Advertize gpgeneral@t22_glexec@minos Request idle 40 max_run 4247 [2009-01-22T16:38:48-05:00 16173] For gpminos@t22_glexec@minos Idle 3672 Running 417 [2009-01-22T16:38:48-05:00 16173] Advertize gpminos@t22_glexec@minos Request idle 40 max_run 4253 [2009-01-22T16:38:48-05:00 16173] For cdf@t22_glexec@minos Idle 3666 Running 417 [2009-01-22T16:38:48-05:00 16173] Advertize cdf@t22_glexec@minos Request idle 40 max_run 4247 [2009-01-22T16:38:48-05:00 16173] Sleep That does not match the config file. Tried to kill gfrontent without kill -9, noffect. Tried again, kill -9 [gfrontend@minos25 ~]$ cd [gfrontend@minos25 ~]$ ./start_frontend.sh [2009-01-22T16:41:58-05:00 22120] Iteration at Thu Jan 22 16:41:58 2009 [2009-01-22T16:42:08-05:00 22120] Match [2009-01-22T16:42:08-05:00 22120] Total running 417 limit 200 [2009-01-22T16:42:08-05:00 22120] For gpgeneral@t22_glexec@minos Idle 3664 Running 417 [2009-01-22T16:42:08-05:00 22120] Advertize gpgeneral@t22_glexec@minos Request idle 0 max_run 4245 [2009-01-22T16:42:08-05:00 22120] For gpminos@t22_glexec@minos Idle 3670 Running 417 [2009-01-22T16:42:08-05:00 22120] Advertize gpminos@t22_glexec@minos Request idle 0 max_run 4251 [2009-01-22T16:42:08-05:00 22120] For cdf@t22_glexec@minos Idle 3664 Running 417 [2009-01-22T16:42:08-05:00 22120] Advertize cdf@t22_glexec@minos Request idle 0 max_run 4245 [2009-01-22T16:42:08-05:00 22120] Sleep ########## # ORACLE # ########## Benchmarks comparing sparc vs x86 http://it.toolbox.com/blogs/david/ultrasparc-vs-x86-servers-which-one-runs-oracle-faster-22776 http://www.tpc.org/ spec.org int2006 rates Sun T5440 4x8core, x8 threads/core 1.4 GHz, 256 GB memory 255 copies, rate 270 peak 301 http://www.spec.org/cpu2006/results/res2008q4/cpu2006-20080929-05410.txt HP ProLiant DL580 G5 (2.66 GHz, Intel Xeon X7460 ) 64 GB memory 24 copies rate 267 peak 291 http://www.spec.org/cpu2006/results/res2008q3/cpu2006-20080901-05166.txt Scanning our heavy hitting Oracle SAM servers via the browser, Oracle Database Administration Offline System Session and License Usage http://dbb.fnal.gov/d0/databases http://rexganglia1.fnal.gov/d0/?r=month&c=d0central&h=d0ora2.fnal.gov 36 DBS_* connections, 3 active, all DBS_USER_POOL http://cdfdbb.fnal.gov/cdfr2/databases/ http://rexganglia2.fnal.gov/cdf/?r=month&c=CDF+Central&h=fcdfora4.fnal.gov 46 DBS_* connections, 1 active, DBS_CDF_USER_PRD The D0 load is very non-uniform. Something is going on every 15 minutes which is most of the load. ####### # CVS # ####### Date: Thu, 22 Jan 2009 11:36:08 -0600 (CST) To: KREYMER@FNAL.GOV Subject: HelpDesk ticket 128037 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 128037 ___________________________________________ Short Description: Please remove minoscvs from NIS hosts file on Minos Cluster Problem Description: run2-sys : The Minos CVS production server minoscvs has moved to cdcvs. But the NIS hosts file on the Minos Cluster contains MINOS26 > ypcat hosts | grep minoscvs 131.225.193.33 minoscvs.fnal.gov minoscvs 131.225.193.33 minoscvs.fnal.gov minoscvs Please remove minoscvs from this file and push out the change, so that Minos Cluster users will connect to the correct server. We plan to run the old CVS server on minos01, in readonly mode, for another week or two, before shutting down that CVS server. P.S. All lines in the NIS hosts file seem to entered twice. You might want to clean that up. ___________________________________________ This ticket is assigned to HelpDesk of the Help Desk. ___________________________________________ Date: Thu, 22 Jan 2009 11:40:21 -0600 (CST) This ticket has been reassigned to SHEPELAK, KAREN of the CD-SF/FEF Group. ___________________________________________ Date: Thu, 22 Jan 2009 13:28:41 -0600 (CST) Solution: Minoscvs has been removed from nis hosts file as requested. This ticket was resolved by SHEPELAK, KAREN of the CD-SF/FEF group. ___________________________________________ Thanks for updating the file. minoscvs is now being properly resolved to cdcvs. ############ # TERAGRID # ############ Date: Thu, 22 Jan 2009 09:25:08 -0600 From: help@teragrid.org To: rmehdi@fnal.gov Subject: Re: Parrot and Teragrid FROM: Eijkhout, Victor (Concerning ticket No. 166606) Rashid, I'm still not convinced that compiling libraries locally is going to work, but let's assume it does. Can your proposed arrangement with parrot handle the fact that jobs have to be run through a batch system? You are not allowed to run production codes on the login nodes; those are for compiling only. By the way, lonestar is on the teragrid, so you can have shared software areas there. I suspect that's what you mean by "web cache". Or are you really talking about something relating to http connections? Victor. ######## # MAIL # ######## Tracking down A-hat characters in Mac and PC mail Perhaps they use UTF-8 or Windows-1252 rather than ISO-8859-1 or USASCII : http://objectmix.com/pine/330696-pc-alpine-character-set.html ########## # DCACHE # ########## Scheduled maintenance seems to have begun around 08:30 Finished around ######## # DATA # ######## Date: Thu, 22 Jan 2009 14:08:15 +0000 (GMT) From: Arthur Kreymer To: Mayly Sanchez Cc: Minos_Batch , minos-data@fnal.gov Subject: Re: low/medium intensity MC On Wed, 21 Jan 2009, Mayly Sanchez wrote: > We need processed with very high priority the following set using > cedar_phy_linfix: > /pnfs/minos/mcin_data/near/daikon_00/L010185N_medi/ > /pnfs/minos/mcin_data/near/daikon_00/L010185N_lowi/ I have created the output directories, and corrected the file families for the input directories : cd ~kreymer/minos/scripts ./pnfsdirs near cedar_phy_linfix daikon_00 L010185N_medi write ./pnfsdirs near cedar_phy_linfix daikon_00 L010185N_lowi write ============================================================================= 2009 01 21 ============================================================================= ######## # DATA # ######## There will be a downtime on Wednesday Jan 21, 2009 starting at 7:30am until 1:30pm. Fire Suppression maintenance will be done on the 9310's, SL8500 and AML in FCC2. ######## # FARM # ######## roundup.20090121 - added SEL string to PENDLOG, and added timestamp AFSS/roundup.20090121 -s charm -r cedar_phy_bhcurv mcnear AFSS/roundup.20090121 -s helium -r cedar_phy_bhcurv mcnear SRV1> cp AFSS/roundup.20090121 . SRV1> ln -sf roundup.20090121 roundup # was roundup.20081209 ./roundup -s L010185_D04 -r cedar_phy_bhcurv mcnear ============================================================================= 2009 01 20 ============================================================================= ######## # DATA # ######## Date: Tue, 20 Jan 2009 16:47:26 -0600 The SL8500 at FCC had a problem with one of the gripper/bots. SUN/STK replaced two gripper/bots This is effected the following libraries: D0-LTO4F1.library_manager CD-LTO4F1.library_manager CDF-LTO4F1.library_manager They libraries have been returned to service. The libraries were unavailable from ~1pm to 4:30pm ############ # HELPDESK # ############ Date: Tue, 20 Jan 2009 15:12:24 -0600 (CST) Subject: HelpDesk ticket 127906 ___________________________________________ Ticket #: 127906 ___________________________________________ Short Description: System Status Input Page - shortcut request Problem Description: We have been making good use of the System Status Input Page, https://computing.fnal.gov/cdsystemstatus/customersupport/inpform.htm But in spite of being careful, I have at least once updated the status of the wrong System. I suggest setting up a URL to pre-select the system, something like : https://computing.fnal.gov/cdsystemstatus/customersupport/inpform.htm?s=MINOS ___________________________________________ This ticket is assigned to HelpDesk of the Help Desk. ___________________________________________ Date: Tue, 20 Jan 2009 15:20:42 -0600 (CST) This ticket has been reassigned to ARENA, MATTHEW of the CD-LSCS/DBAP/IS ___________________________________________ Date: Wed, 21 Jan 2009 11:27:05 -0600 (CST) Solution: mengel@fnal.gov sent this solution: Art, That input form is just a static HTML form. It has no smarts behind it to prefill in data, etc. However, if you wanted to make a copy of the form, put the full URL in the
tag, and add a few 'value="..."' attributes to the input tags, you could make a version that has various fields pre-filled, which should work fine, just do a "save page as.." in your browser and edit the HTML. This ticket was resolved by ARENA, MATTHEW of the CD-LSCS/DBAP/IS group. ___________________________________________ N.B. I have copied this page to my desktop, and removed all the non-Minos options. This seems to work as desired. There is a risk of version drift, if the original page chanages. ___________________________________________ ######## # FARM # ######## I have removed the vanilla L010185N_D04 duplicates. There are thousands of concatenated subruns. For a partial list of missing subruns, see /home/minfarm/ROUNTMP/LOG/2009-01/cedar_phy_bhcurvmcnearL010185N_D04.log There are still many missing charm and helium subruns /home/minfarm/ROUNTMP/LOG/2009-01/cedar_phy_bhcurvmcnearcharm.log /home/minfarm/ROUNTMP/LOG/2009-01/cedar_phy_bhcurvmcnearhelium.log ########## # DCACHE # ########## Date: Tue, 20 Jan 2009 20:50:31 +0000 (GMT) From: Arthur Kreymer To: minos_software_discussion@fnal.gov, minos_batch@fnal.gov Cc: minos-data@fnal.gov Subject: PNFS/DCache down Thursday morning 22 Jan 2009 The Minos PNFS and DCache systems will be down this Thursday morning. ---------- Forwarded message ---------- Date: Fri, 16 Jan 2009 15:11:18 -0600 From: ssa-group@fnal.gov Subject: Announcement: Service schedule outage for enstore, dCache on d0en, stken, cdfen for a duration of 4 Hours There will be a downtime on Thursday Jan 22, 2009 starting at 7:30am until 11:30pm. The following is the agenda for the downtime. Not all services will be available during the maintenance. Listed below are the services that will be effected and when they will be up or down. 7:30 Stop Stken Enstore Here is the list of Library Managers that will be unavailable. 9940.library CD-9940B CD-LTO3 CD-LTO4F1 CD-LTO4G1 CDF-LTO4G1 D0-LTO4G1 Replace library robot arm in GCC SL8500. Begin repair work on srv0, srv1, srv2 and srv4. Begin SW changes and upgrades after repair work. dCache pool config change. 10:30 SL8500 work done. Verify work. The following Library Managers will be made available. CDF-LTO4G1 D0-LTO4G1 11:30 All Stken srv's back online, Enstore service restored. All Library Managers available. Stan Naymola ssa-group@fnal.gov ######### # MYSQL # ######### Trying to tailor Port 3307 server for crl Sort of succeeded, got port 3308. That's OK. ########## # ORACLE # ########## Oracle's January Quarterly patch has been deployed on minosora3~minosdev successfully. Can we proceed with the Production Oracle & O/S patch? Jan 28 or 29th? 8 a.m. ? We will do the reboot for the o/s at this time as well. In sum, requesting ~2 hr downtime. Please advice ___________________________________________________________________ I have done the usual minimal tests. The development station seems to be working normally. No dbserver or station restarts were needed. Thursday Jan 29 at 8 AM is OK with me. Consider it scheduled ============================================================================= 2009 01 16 ============================================================================= ############ # DATASETS # ############ Why are the *minos* pools not listed ? cp datasets.20081201 datasets.20090116 Because they are not in the daily pool lists. ######## # DATA # ######## New Minos pools are showing up, and in use. Sizes are odd, I see over 26 TB of minos-names pools, but we only bought about 14 TB of new pools. Pool + indicates presence in Cell Services list 1.8 TB each + r-minos-stkendca21a-3 + r-minos-stkendca22a-3 + r-minos-stkendca23a-3 r-minos-stkendca25a-3 r-minos-stkendca26a-3 r-minos-stkendca27a-2 2.8 TB each + w-pub-minos-stkendca21a-2 + w-pub-minos-stkendca22a-2 + w-pub-minos-stkendca23a-2 w-pub-minos-stkendca25a-3 1.8 TB each + w-raw-minos-stkendca21a-1 + w-raw-minos-stkendca22a-1 + w-raw-minos-stkendca24a-1 w-raw-minos-stkendca26a-1 Date: Fri, 16 Jan 2009 16:11:58 -0600 (CST) Subject: HelpDesk ticket 127805 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Short Description: Minos DCache pool file listings ? Problem Description: dcache-admin : I do not see the new Minos DCache pools in the pool file listings at http://fndca3a.fnal.gov/dcache/files/ The pools are online, so it is important that we have these lists, especially now that we lack Layer 2 PNFS data. Some of the pools are also missing from the Cell Services list at http://fndca3a.fnal.gov:2288/cellInfo r-minos-stkendca25a-3 r-minos-stkendca26a-3 r-minos-stkendca27a-2 w-pub-minos-stkendca25a-3 w-raw-minos-stkendca26a-1 ___________________________________________ Date: Fri, 16 Jan 2009 16:11:58 -0600 (CST) This ticket is assigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Wed, 28 Jan 2009 16:55:48 +0000 (GMT) From: Arthur Kreymer There are still no pool directory listings for the 'minos' pools under http://fndca3a.fnal.gov/dcache/files/ Is progress being made on this ? ___________________________________________ Date: Thu, 29 Jan 2009 09:49:03 -0600 (CST) Note To Requester: georges@fnal.gov sent this Notes To Requester: This was forwarded to the dcache developers. ___________________________________________ Date: Mon, 02 Feb 2009 21:02:26 +0000 (GMT) From: Arthur Kreymer Is there a time estimate for getting these listings ? ___________________________________________ Date: Tue, 03 Feb 2009 14:29:47 -0600 From: Vladimir Podstavkov Fixed. Minos DCache pool file listings have been generated and will be generated daily from now on. About the pool info on http://fndca3a.fnal.gov:2288/cellInfo page. r-minos-stkendca25a-3 --> No such pool r-minos-stkendca26a-3 --> Are present r-minos-stkendca27a-2 --> Are present w-pub-minos-stkendca25a-3 --> No such pool w-raw-minos-stkendca26a-1 --> Are present The ticket can be closed. ___________________________________________ ___________________________________________ ######## # DATA # ######## Dealing with plans to import reroot files in /minos/data/users/cherdack/Daikon06/singles_reroot/7* 264 GBytes 12769 files Average size 21 MBytes Typically 31 subruns per run Typically 6 GBytes, 10 runs, 300 files and per directory ######## # GRID # ######## Submitted account request to TACC, for interactive access to UT Austin Teragrid resources. https://portal.tacc.utexas.edu/gridsphere/gridsphere Approved, logged in OK at https://portal.tacc.utexas.edu/gridsphere/gridsphere?cid Login to longhorn.tacc.utexas.edu via ssh This does not work, needed to be registed in a valid group by rmehdi ######## # FARM # ######## Continuing cleanup up dups for general CPB mcnear, only a few runs have dupes: n13037009 - already removed as a test n13037010 n13037064 n13037137 n13037138 n13037139 n13037148 To clean up, ./roundup -M -D -s "n13037010" -r cedar_phy_bhcurv mcnear for RUN in n13037010 n13037064 n1303713 n13037148 ; do ./roundup -M -D -s "${RUN}" -r cedar_phy_bhcurv mcnear date done Fri Jan 16 13:14:59 CST 2009 Fri Jan 16 13:16:33 CST 2009 Fri Jan 16 13:18:03 CST 2009 Fri Jan 16 13:19:57 CST 2009 Now one more full pass on D04 files Most files are L010185N_D04, a few are L250200N_D04 ./roundup -s L250200N_D04 -r cedar_phy_bhcurv mcnear ./roundup -s L010185N_D04 -r cedar_phy_bhcurv mcnear ############# # SADDCACHE # ############# ln -sf saddcache.20090116 saddcache # was saddcache.20070806 Shifted date to the second field Increased mimimum file name width to 45, from 25, for cleaner reading ######## # MAIL # ######## Removed RFC2369 headers from lists for which they are not appropriate, to eliminate the PINE messages [ Note: This message contains email list management information ] To disable the headers, added to the head of the options list, Misc-Options= NO_RFC2369 minos_software_discussion ######## # MAIL # ######## Date: Fri, 16 Jan 2009 09:42:52 -0600 (CST) Subject: HelpDesk ticket 127751 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 127751 ___________________________________________ Short Description: SpamAssassin mistuned ? Problem Description: I am getting a lot more spam lately ( what else is new . ) But some of the items coming through have negative SpamAssassin ratings, in spite of having the single most common signature of SPAM, an IP address which is not registered in DNS. For example, Received: by hepa2.fnal.gov (Postfix, from userid 102) id 42BDBBA2F7; Fri, 16 Jan 2009 05:01:12 -0600 (CST) Received: from ffhzi.comunitel.net (unknown [77.224.46.178]) by hepa2.fnal.gov (Postfix) with SMTP id 8EB35BA2EE for ; Fri, 16 Jan 2009 05:01:10 -0600 (CST) This produced X-Spam-Status: No, score=-0.4 required=5.0 tests=BAYES_00, DATE_IN_PAST_96_XX, HTML_MESSAGE,RDNS_NONE,URI_HEX autolearn=no version=3.2.4 With so many flags set, how could this be getting a negative score ? Perhaps something is broken in the SpamAssassin configuration. ___________________________________________ Date: Fri, 16 Jan 2009 09:59:52 -0600 (CST) This ticket has been reassigned to MIHALEK, MAURINE of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ Note To Requester: i spoke to the mail expert. SpamAssassin is not broken. the spam you are getting was crafted to score more negatively. his suggestion is that you report this to the spam reporting page: http://computing.fnal.gov/cgi-bin/email/teachspam.cgi ___________________________________________ ___________________________________________ ============================================================================= 2009 01 15 ============================================================================= ######## # FARM # ######## Clearing out cedar_phy_bhcurv mcnear duplicates ./roundup -n -M -D -s n13037009 -r cedar_phy_bhcurv mcnear Looks OK, do this limited thing manually, ./roundup -M -D -s n13037009 -r cedar_phy_bhcurv mcnear Then clear out the D04_charm duplicates, ./roundup -n -M -D -s D04_charm -r cedar_phy_bhcurv mcnear OK - processing 660 files still many PEND files. Go ahead and drop the duplicates ./roundup -M -D -s D04_charm -r cedar_phy_bhcurv mcnear ============================================================================= 2009 01 14 ============================================================================= Date: Wed, 14 Jan 2009 08:19:08 -0600 (CST) Subject: HelpDesk ticket 127564 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 127564 ___________________________________________ Short Description: Recent Minos raw data file not readable in DCache Problem Description: dcache-admin,minos-data : I get an empty file when accessing a recent Minos raw data file in DCache. /pnfs/minos/fardet_data/2009-01/F00042685_0003.mdaq.root But the file should not be empty, see ls and Layer 4 metadata below : -rw-r--r-- 1 buckley e875 18677506 Jan 13 08:01 F00042685_0003.mdaq.root VO8699 0000_000000000_0003456 18677506 fardet_data /pnfs/fs/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root 000F00000000000009016408 CDMS123185530300000 stkenmvr10a:/dev/rmt/tps2d0n:479002012194 According to today's pool listings, this file is in w-stkendca8a-2 I see no files listed in the DCache filemonitor page. But that may not mean much, as those listing seem dated 23-Jul-2008 http://www-stken.fnal.gov/enstore/dcache_monitor/ Please restore F00042685_0003.mdaq.root in DCache. ___________________________________________ This ticket is assigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA. ____________________________________________ Date: Wed, 14 Jan 2009 09:11:02 -0600 This problem was forwarded to the dcache developers. ___________________________________________ Date: Wed, 14 Jan 2009 13:23:04 -0600 (CST) georges@fnal.gov sent this Notes To Requester: Notes To requester Please advise user to try to use file once again. Zero length copy of file has been removed from the pool after it has been verified that file exists on the tape and is of correct size. ___________________________________________ Date: Wed, 14 Jan 2009 20:23:26 +0000 (GMT) From: Arthur Kreymer The file is gone from DCache, as confirmed by dc_check. But it has not been restored yet, despite my attempt to dccp it, and my doing a manual dccp -P The restore is visible in the DCache restore queue pages http://fndca3a.fnal.gov:2288/poolInfo/restoreHandler/lazy http://fndca3a.fnal.gov:2288/poolInfo/restoreHandler/* It has been going to w-raw-minos-stkendca21a-1, since about 13:09. But there is no tape mounted, in spite of light library activity. I see the problem. At http://fndca3a.fnal.gov:2288/queueInfo, Restores Max is set to 0 for all the w-*-minos-* pools. This restore shows up as Queued, but will not get scheduled. Please boost the restore limit for these pools. __________________________________________ __________________________________________ Date: Wed, 14 Jan 2009 21:44:32 +0000 (GMT) From: Arthur Kreymer The file is now queued for restore, but is not yet on disk. I see no tape mounted, though the tape system is not busy. The restore request was in the restore queues for a while, but is now gone. dc_check reports the file as unavailable : MINOS26 > dc_check dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00 042685_0003.mdaq.root dc_stage fail : File not cached System error: Resource temporarily unavailable I have run another dccp by hand, at 15:35:35 I see the transfer in the Restore Queues again, but again no tape is mounted, and there is no active transfer. This time the file is targeted to w-raw-minos-stkendca22a-1 Still no tape activity, as of 15:44. __________________________________________ Date: Wed, 14 Jan 2009 15:18:58 -0600 From: Vladimir Podstavkov Max number of restores for the new pools set to 1. __________________________________________ Date: Wed, 14 Jan 2009 16:08:04 -0600 (CST) From: Dmitry Litvintsev Let us know if anything changed for the better. __________________________________________ Date: Wed, 14 Jan 2009 23:33:49 +0000 (GMT) From: Arthur Kreymer Restoring the file took about 12 minutes, from around 17:10, due to a modest backlog of restore requests ( your testing ? ). The file has been processed and declared to SAM. This ticket can be closed. What was done to kick things loose ? __________________________________________ Date: Wed, 14 Jan 2009 20:35:55 -0600 (CST) From: Dmitry Litvintsev Two issues: 1) pools are part of read/write link but number of restores was set to zero preventing staging files from tape. This has been noted by you and then fixed by us. 2) in addition the encp, which is as you know is HSM client used to put data to/from tape has been missing on the pools. So the HSM jobs will spawn and quit w/ errors. This has been spotted and corrected. The long time to restore a file was due to a backlog of store/restore jobs coming from these pools. __________________________________________ __________________________________________ __________________________________________ http://fndca3a.fnal.gov:2288/poolInfo/restoreHandler/lazy PnfsId Subnet PoolCandidate Started Clients Retries Status 000F00000000000009016408 0.0.0.0/0.0.0.0-*/* w-raw-minos-stkendca21a-1 01.14 13:09:37 3 0 Staging 01.14 13:09:37 /pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root http://fndca3a.fnal.gov:2288/poolInfo/restoreHandler/* PnfsId Subnet PoolCandidate Started Clients Retries Status 000F00000000000009016408 0.0.0.0/0.0.0.0-*/* w-raw-minos-stkendca21a-1 01.14 13:09:37 3 0 Staging 01.14 13:09:37 Wed Jan 14 14:10:46 CST 2009 000F00000000000009016408 0.0.0.0/0.0.0.0-*/* w-raw-minos-stkendca24a-1 01.14 15:16:34 1 1 Waiting 01.14 15:16:34 While waiting, get a safe copy : [minos@minos-gateway ~]$ scp -c blowfish daqdcp.minos-soudan.org:/daqdata/F00042685_0003.mdaq.root /tmp/ [minos@minos-gateway ~]$ sum /tmp/F00042685_0003.mdaq.root 45765 18240 [minos@minos-gateway ~]$ scp /tmp/F00042685_0003.mdaq.root kreymer@minos26.fnal.gov:/minos/scratch/kreymer/ MIN > sum /minos/scratch/kreymer/F00042685_0003.mdaq.root 45765 18240 MINOS26 > ./dccptest /fardet_data/2009-01/F00042685_0003.mdaq.root PORT 24136 Connected in 0.00s. [Wed Jan 14 15:35:35 2009] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root in cache. Command failed! Server error message for [2]: "902" (errno 902). Failed open file in the dCache. Can't open source file : "902" System error: Input/output error ls: /local/scratch26/kreymer/DCCPTEST/F00042685_0003.mdaq.root: No such file or directory 000F00000000000009016408 0.0.0.0/0.0.0.0-*/* w-raw-minos-stkendca22a-1 01.14 15:35:35 1 1 Waiting 01.14 15:35:35 /pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root 17:10 - MINOS26 > dc_check dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root dc_stage fail : File not cached System error: Resource temporarily unavailable MINOS26 > date ; ./dccptest /fardet_data/2009-01/F00042685_0003.mdaq.root Wed Jan 14 17:11:36 CST 2009 PORT 24136 Connected in 0.00s. [Wed Jan 14 17:11:37 2009] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root in cache. Cache open succeeded in 877.22s. 18677506 bytes in 0 seconds -rw-r--r-- 1 kreymer g020 18677506 Jan 14 17:26 /local/scratch26/kreymer/DCCPTEST/F00042685_0003.mdaq.root 000F00000000000009016408 0.0.0.0/0.0.0.0-*/* w-raw-minos-stkendca22a-1 01.14 17:10:47 3 0 Staging 01.14 17:10:47 /pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root Restores are queued up for w-raw-minos-stkendca21a-1 8 w-raw-minos-stkendca22a-1 3 w-raw-minos-stkendca24a-1 2 By 17:26, the queue for 22a-1 was down to 0, 1 active ============================================================================= 2009 01 13 ============================================================================= ####### # DAQ # ####### Last file archived : Tue Jan 13 16:01:01 CST 2009 F00042685_0002.mdaq.root Jan 12 15:03 N00015450_0000.mdaq.root Jan 13 15:36 F090112_000000.mdcs.root Jan 12 19:43 N090112_000003.mdcs.root Jan 12 19:16 B090113_080001.mbeam.root Jan 13 10:14 Near DAQ is OK, Far is bad, both DCS are bad since yesterday. Slight slowdowns reading yesterday, http://www-numi.fnal.gov/computing/dh/ftplog/2009/01/12.txt 8 Mon Jan 12 14:50:53 CST 2009 557 10 Mon Jan 12 15:01:03 CST 2009 557 10 Mon Jan 12 15:11:13 CST 2009 557 5 Mon Jan 12 15:21:18 CST 2009 557 Looked at logs in LDIR=/local/scratch26/kreymer/genpy/neardet_data/2009-01 Testing a few of the difficult mdaq files with loon, firstlast.C Created freebird to allow me to run on minos26 setup_minos -r R1.22 The failed command was dbu -bq /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/dbu_sampy.C \ dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015439_0008.mdaq.root Trying loon -bq firstlast.C \ dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015439_0008.mdaq.root Processing firstlast.C... Spin(104292 in 104292 out 0 filt.) 1) +RawRecCounts::Ana n=104292(104292/ 0) t=( 1.97/ 0.08) This looks OK on the surface, now try export ENV_TSQL_URL="mysql:odbc://minos-db1.fnal.gov/temp;mysql:odbc://minos-db1.fnal.gov/offline;mysql:odbc://minos-db1.fnal.gov/offl ine_dev" dbu -bq dbu_sampy.C \ dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015439_0008.mdaq.root This is also OK Try genpy in its normal context per HOWTO.genpy ./genpy -d -f N00015439_0008.mdaq.root neardet_data/2009-01 ./genpy -f N00015439_0008.mdaq.root neardet_data/2009-01 This looks OK ./predator Only one file fails, F00042685_0003.sam.py was not generated - check log for error Unable to open dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root ./dc_stat /pnfs/minos/fardet_data/2009-01/F00042685_0003.mdaq.root PNFS status for /pnfs/minos/fardet_data/2009-01/F00042685_0003.mdaq.root -rw-r--r-- 1 buckley e875 18677506 Jan 13 08:01 F00042685_0003.mdaq.root LEVEL 2 2,0,0,0.0,0.0 :l=18677506; LEVEL 4 VO8699 0000_000000000_0003456 18677506 fardet_data /pnfs/fs/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root 000F00000000000009016408 CDMS123185530300000 stkenmvr10a:/dev/rmt/tps2d0n:479002012194 2356856715 ============================ dc_check dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root Check passed for file "dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root" loon -bq firstlast.C \ dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root Error in : error reading all requested bytes ./dccptest /fardet_data/2009-01/F00042685_0003.mdaq.root PORT 24136 Connected in 0.00s. [Tue Jan 13 22:13:23 2009] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003. mdaq.root in cache. Cache open succeeded in 0.20s. 0 bytes in 0 seconds This file claims to be OK on tape, but is 0 length in DCache. Ran ./predator 2009-01 - clean except for this one file. 22:20 crontab crontab.dat ####### # EVO # ####### Registered. http://evo.caltech.edu/evoGate/index.jsp This was immediate via automatic email/web validation. ######### # MYSQL # ######### Date: Tue, 13 Jan 2009 14:56:04 -0600 (CST) Subject: HelpDesk ticket 127533 ___________________________________________ Ticket #: 127533 ___________________________________________ Short Description: On minos-mysql2, need /data/crl owned by minsoft Problem Description: run2-sys On minos-mysql2, please create directory /data/crl owned by minsoft,mysql just like /data/database. ___________________________________________ Date: Tue, 13 Jan 2009 15:02:24 -0600 (CST) This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group. ___________________________________________ Date: Tue, 13 Jan 2009 15:10:41 -0600 (CST) Solution: Request completed. ___________________________________________ ___________________________________________ ___________________________________________ ######### # MYSQL # ######### Added Minos Cluster account for our Mysql DBA MINOS01 > cmd add_minos_user svetlana Date: Tue, 13 Jan 2009 14:28:52 -0600 (CST) Subject: HelpDesk ticket 127529 ___________________________________________ Ticket #: 127529 ___________________________________________ Short Description: On minos-mysql1 and minos-mysql2, add svetlana access, remove bgreen, Problem Description: run2-sys : Please add local account access to minos-mysql1 and minos-mysql2 for user svetlana. This account has recently been added to the Minos Cluster. So you can just add to /etc/passwd : +svetlana::::: This would also be a good time to remove the bgreen accounts from minos-mysql1 and minoos-mysql2. Bruce left the lab years ago. I have examined the /home/bgreen files on minos-mysql1. They should be removed along with the account. ___________________________________________ Date: Tue, 13 Jan 2009 15:27:30 -0600 (CST) This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group. ___________________________________________ Date: Wed, 14 Jan 2009 14:36:44 -0600 (CST) Solution: Requests complete. This ticket was resolved by SCOTT, RENNIE of the CD-SF/FEF group. ___________________________________________ ######## # PNFS # ######## Date: Tue, 13 Jan 2009 11:53:12 -0600 From: John Hendry We only show Liz Buckley-Geer as the sole pnfs authorizations contact for minos. I know she is working in dark energy survey now. Should I make you the minos contact in our pnfs authorizations list? On http://www-ccf.fnal.gov/ISA/PNFS_Auth_list.html For these mountpoints: These users are authorized to request mountpoint exports: Experiment also known as: * stkensrv1:/minos * Liz Buckley-Geer (buckley@fnal.gov) E875/E934 _________________________________________________________________ Date: Tue, 13 Jan 2009 19:53:38 +0000 (GMT) From: Arthur Kreymer Yes, I should be the primary contact. Robert Hatcher (rhatcher) should also be authorized. ######## # PINE # ######## http://mailman2.u.washington.edu/pipermail/alpine-info/2008-July/000971.html ####### # DAQ # ####### FD data gap ? -rw-r--r-- 1 buckley e875 59609235 Jan 11 16:03 F00042682_0002.mdaq.root -rw-r--r-- 1 buckley e875 18551291 Jan 12 07:56 F00042679_0015.mdaq.root ============================================================================= 2009 01 12 ============================================================================= ######### # MYSQL # ######### Mysql> ./dbar -I STARTED DBARCHIVES Mon Jan 12 16:06:32 CST 2009 Archiving OFFLINE Mon Jan 12 16:06:32 CST 2009 69377 . Filesystem Size Used Avail Use% Mounted on /dev/hdb1 230G 174G 44G 80% /data /minos/data/mysql/archive/20090112/offline/offline.log ########### # CRONTAB # ########### Restarted kreymer@minos26, mindata@minos26, down since AFS problems 7 Jan Ran condorproxy and predator manually ############ # PREDATOR # ############ Many, not all, ND files timed out N00015420_0005.mdaq.root Mon Jan 12 20:54:14 UTC 2009 N00015420_0008.mdaq.root Mon Jan 12 21:00:44 UTC 2009 N00015425_0000.mdaq.root Mon Jan 12 21:36:40 UTC 2009 N00015427_0002.mdaq.root Mon Jan 12 21:46:32 UTC 2009 N00015427_0008.mdaq.root Mon Jan 12 21:59:12 UTC 2009 N00015427_0009.mdaq.root Mon Jan 12 22:02:38 UTC 2009 N00015427_0010.mdaq.root Mon Jan 12 22:06:13 UTC 2009 N00015427_0011.mdaq.root Mon Jan 12 22:09:38 UTC 2009 N00015427_0012.mdaq.root Mon Jan 12 22:13:14 UTC 2009 N00015427_0014.mdaq.root Mon Jan 12 22:20:14 UTC 2009 N00015427_0016.mdaq.root Mon Jan 12 22:25:29 UTC 2009 N00015427_0017.mdaq.root Mon Jan 12 22:28:55 UTC 2009 N00015427_0019.mdaq.root Mon Jan 12 22:34:15 UTC 2009= N00015427_0020.mdaq.root Mon Jan 12 22:37:45 UTC 2009= N00015430_0000.mdaq.root Mon Jan 12 22:46:41 UTC 2009 N00015430_0002.mdaq.root Mon Jan 12 22:51:51 UTC 2009 N00015430_0003.mdaq.root Mon Jan 12 22:55:26 UTC 2009 N00015431_0000.mdaq.root Mon Jan 12 22:59:11 UTC 2009= N00015433_0002.mdaq.root Mon Jan 12 23:04:25 UTC 2009= N00015433_0005.mdaq.root Mon Jan 12 23:11:15 UTC 2009= N00015433_0006.mdaq.root Mon Jan 12 23:14:45 UTC 2009= N00015433_0009.mdaq.root Mon Jan 12 23:22:01 UTC 2009 N00015433_0010.mdaq.root Mon Jan 12 23:25:31 UTC 2009 N00015433_0014.mdaq.root Mon Jan 12 23:33:56 UTC 2009= N00015433_0015.mdaq.root Mon Jan 12 23:37:16 UTC 2009= N00015433_0017.mdaq.root Mon Jan 12 23:43:31 UTC 2009 N00015433_0018.mdaq.root Mon Jan 12 23:47:02 UTC 2009 N00015433_0019.mdaq.root Mon Jan 12 23:50:22 UTC 2009 N00015433_0020.mdaq.root Mon Jan 12 23:53:52 UTC 2009= N00015433_0021.mdaq.root Mon Jan 12 23:57:12 UTC 2009 N00015433_0022.mdaq.root Mon Jan 12 23:58:52 UTC 2009 N00015433_0023.mdaq.root Tue Jan 13 00:02:12 UTC 2009= N00015435_0000.mdaq.root Tue Jan 13 00:08:57 UTC 2009 N00015436_0004.mdaq.root Tue Jan 13 00:18:33 UTC 2009 N00015436_0007.mdaq.root Tue Jan 13 00:25:18 UTC 2009 N00015436_0009.mdaq.root Tue Jan 13 00:30:33 UTC 2009 N00015436_0010.mdaq.root Tue Jan 13 00:34:08 UTC 2009 N00015436_0014.mdaq.root Tue Jan 13 00:42:23 UTC 2009 N00015436_0015.mdaq.root Tue Jan 13 00:45:43 UTC 2009 N00015436_0017.mdaq.root Tue Jan 13 00:51:14 UTC 2009 N00015436_0020.mdaq.root Tue Jan 13 00:58:03 UTC 2009 N00015436_0021.mdaq.root Tue Jan 13 01:01:24 UTC 2009 N00015436_0023.mdaq.root Tue Jan 13 01:06:54 UTC 2009 N00015439_0004.mdaq.root Tue Jan 13 01:20:53 UTC 2009 N00015439_0008.mdaq.root Tue Jan 13 01:29:17 UTC 2009X N00015439_0009.mdaq.root Tue Jan 13 01:32:38 UTC 2009X N00015439_0010.mdaq.root Tue Jan 13 01:36:08 UTC 2009X N00015439_0011.mdaq.root Tue Jan 13 01:39:28 UTC 2009X N00015439_0012.mdaq.root Tue Jan 13 01:42:58 UTC 2009X N00015439_0013.mdaq.root Tue Jan 13 01:44:08 UTC 2009X N00015439_0014.mdaq.root Tue Jan 13 01:47:38 UTC 2009X N00015439_0015.mdaq.root Tue Jan 13 01:50:58 UTC 2009X N00015439_0016.mdaq.root Tue Jan 13 01:54:29 UTC 2009X N00015439_0017.mdaq.root Tue Jan 13 01:57:34 UTC 2009X N00015439_0018.mdaq.root Tue Jan 13 01:59:03 UTC 2009X N00015439_0019.mdaq.root Tue Jan 13 02:02:24 UTC 2009X Many of these were picked up the next cycle. New timeouts, N00015439_0020.mdaq.root Tue Jan 13 08:12:30 UTC 2009 N00015439_0021.mdaq.root Tue Jan 13 08:14:05 UTC 2009 dccptest neardet_data/2009-01/N00015427_0016.mdaq.root OK, 2 seconds to copy dccp testneardet_data/2009-01/N00015420_0005.mdaq.root PORT 24136 Datafile with name 'neardet_data/2009-01/N00015420_0005.mdaq.root' not found. Connected in 0.00s. [Mon Jan 12 16:32:38 2009] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos//neardet_data/2009-01/N00015420_0005.mdaq.root in cache. Cache open succeeded in 317.21s. 115960637 bytes in 4 seconds (28310.70 KB/sec) ####### # CRL # ####### Reset password for rlt ( Richard Talaga ) via Administer menu.in CRL. minoscrl-admin added kreymer removed howcroft ######## # FARM # ######## touch /minos/data/minfarm/roundup/STOP.LOOPER ####### # DAQ # ####### Control Room network went down, back up around 8:50 Date: Mon, 12 Jan 2009 06:39:10 -0600 From: MINOS DAQ To: c.j.metelko@rl.ac.uk, gfp@fnal.gov, kreymer@fnal.gov Subject: Copy of N00015439_0012.mdaq.root to minos-om.fnal.gov:/data/root_files/ FAILED ============================================================================= 2009 01 10 ============================================================================= ######## # FARM # ######## ./looper '-r cedar_phy_bhcurv mcnear' & rm /minos/data/minfarm/roundup/STOP.LOOPER ./looper '-r cedar_phy_bhcurv mcnear' & ######### # MYSQL # ######### Moved XTRA databases to mysql2 ============================================================================= 2009 01 09 ============================================================================= ######## # DATA # ######## Date: Wed, 07 Jan 2009 09:29:02 -0500 From: Daniel D. Cherdack To: Arthur Kreymer Subject: Data Storage I have about 365G of Daikon04 AnaNue files that I would like to hold onto for at least the near term. Right now they sit on /minos/data/users/cherdack/ while I trim and merge them for transport analysis etc. When I finish where is the preferred location for storage? _________________________________________________________________________ Date: Sat, 10 Jan 2009 00:17:08 +0000 (GMT) I have started looking at this. For archival, we need most files to be at least 1 GB in size, and no more than about 100 per directory. Your files seem too small, with over 1000 in one directory. So we will need to tar them up in some simple fashion, then archive. ########## # PARROT # ########## Created indexparrot script under /home/mindata. Will put this into CVS soon. ####### # CRL # ####### Date: Fri, 09 Jan 2009 17:55:04 -0600 (CST) Subject: HelpDesk ticket 127347 ___________________________________________ Ticket #: 127347 ___________________________________________ Short Description: Minos CRL software sometimes silently fails to post an entry Problem Description: minoscrl-admin,gysin : As a spinoff of discussions earlier today, we verified that entries posted to the CRL without a specified Topic could be silently dropped. This happens if the item is Save'd without a Preview and without a Topic. Suzanne Gysin has added the warning in version v1_14 as deployed to Minos. Please update the official software, and the release notes. ___________________________________________ This ticket is assigned to HelpDesk of the Help Desk. ___________________________________________ Date: Mon, 12 Jan 2009 09:26:59 -0600 (CST) This ticket has been reassigned to GYSIN, SUZANNE of the CD-ILC/FP Group. ___________________________________________ Date: Mon, 12 Jan 2009 09:48:08 -0600 (CST) Note To Requester: gysin@fnal.gov sent this Notes To Requester: The official software and release notes are updated and all CRLW installations are patched. ___________________________________________ Date: Tue, 20 Jan 2009 09:54:56 -0600 (CST) Solution: gysin@fnal.gov sent this solution: The software is updated and the release notes published This ticket was resolved by GYSIN, SUZANNE of the CD-ILC/FP group. ___________________________________________ ########## # DCACHE # ########## Subject : Re: HelpDesk ticket 126958 has additional info. The FTP transfers page seems to be up to date today. I presume that something was done to address the previous problem. If that is the case, this ticket can be closed. Thanks ! ####### # CRL # ####### Suzanne Gysin gave me admin access. Web master email: minoscrl-admin@fnal.gov Path to CRL configuration directory: /afs/fnal.gov/files/data/minos/crl_data/ Memo Pad URL: http://www-bd.fnal.gov/issues/wiki/MINOSMemopad////////// MINOS26 > fs listacl /afs/fnal.gov/files/data/minos/crl_data/ Access list for /afs/fnal.gov/files/data/minos/crl_data/ is Normal rights: bens:crlweb2 rlidwk bgreen:minoscrladmin rlidwka bgreen:minoscrl rlidwk spanacek:crladmin rlidwka system:administrators rlidwka system:anyuser rl buckley rlidwka bgreen rlidwka We should create minos:minoscrladmin to replace the breen's, Meanwhile, CRLCONF=/afs/fnal.gov/files/data/minos/crl_data/LogBook_admin/LogbookConfigParms.properties grep Logbook.file_location.entry_directory ${CRLCONF} Logbook.file_location.entry_directory = /afs/fnal.gov/files/data/minos/crl_data/CRLdata grep Logbook.file_location.www_directory ${CRLCONF} Logbook.file_location.www_directory = /afs/fnal.gov/files/data/minos/crl_data/WWWdirectory ####### # CRL # ####### Checklist for CRL: S.Gysin 1-9-08 AFS The symptoms for AFS failure are that one can not see the entries, and can not make new entries or annotations. In many cases the CRL administrator chooses to store the CRL entries in an AFS directory. This directory can be found in the configuration file. The path to the configuration file is noted in the CRL on line on the configure page: 1.On the CRL's index page (Index.jsp) select configure (upper left). Only a user with admin priviledges has access to this page. 2.The path to the CRL configuration directory is noted there. 3.Go to this directory and > cd LogBook_admin open the file LogbookConfigParms.properties This file contains the path to all data stored within the logbook. The directory for storing the entries is noted by the property: Logbook.file_location.entry_directory Logbook.file_location.www_directory 4.If this is an afs directory, one can check this by attempting to cd to it. Database The symptom of a Database failure are that one can not see any entries nor make new entries or annotations. Similarily to the AFS directory the database information is also stored in the configuration file. 1.On the CRL's index page (Index.jsp) select configure (upper left). Only a user with admin priviledges has access to this page. 2.The path to the CRL configuration directory is noted there. Go to this directory and > cd LogBook_admin open the file LogbookConfigParms.properties This file contains the database information in the following properties: Logbook.database.server Logbook.database.dbms_name Logbook.database.username 3.If this database is in your control, log in and execute a mysql command to see if it is up and running. If it is not in your control, open a help desk ticket, specifying the logbook and database. WebServer The symptom of a webserver failure are that one can not see the webpage and usually sees an error 500. The webserver is also specific to each logbook. Most logbooks at FNAL run currently under an alias on crlweb2. The alias has to be correct to see images, because each alias is assigned a virtual host with its own home directory (see afs) where the images are stored. The CRL runs under Tomcat and Apache. The distinction is irrelevant since either can be down and both have to be restarted if one is down. The webserver is restarted every day at 4 am to re-issue an AFS token. If you see that your webserver is down, open a help desk ticket stating the name of your logbook and the symptom. CRL application The symptoms of a CRL application failure vary greatly. You may see an exception and an error page, or you may find a single entry missing. If you are sure AFS, the database, and the Webserver are running, open a helpdesk ticket for the CRL application. ############ # PREDATOR # ############ Predator - look into beam declares, seemed to be choking STARTING Wed Jan 7 11:31:06 UTC 2009 B090106_080001.mbeam.root Wed Jan 7 11:31:07 UTC 2009 B090106_160002.mbeam.root Wed Jan 7 11:36:23 UTC 2009 ? ########## # ORACLE # ########## Found a list of HP Oracle certified servers, http://h18004.www1.hp.com/products/servers/linux/hplinuxcert-oracle.html For example, http://h10010.www1.hp.com/wwpc/us/en/en/WF25a/15351-15351-3328412-241644-3328422-3454575.html HP DL580 G5 Rack $25K 4x4core 2.9 GHz 64 GB, 4x350GB disk, dual ps, dual FC http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=431&FamilyId=2635&BaseId=23381&oi=E9CED&BEID=19701&SBLID= 3 GHz Intel is like 6 HGz Sparc ? HP DL580 G5 7400 Rack $28K 4x6core 2.7 GHz 64 GB, 4x250GB disk, dual ps, dual FC http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=431&FamilyId=2635&BaseId=28012&oi=E9CED&BEID=19701&SBLID= Similar Sun servers, like Netra X4250 Server, $7K The baseline Oracle server is Sun T5440 Config 2 http://shop.sun.com/is-bin/INTERSHOP.enfinity/WFS/Sun_NorthAmerica-Sun_Store_US-Site/en_US/-/USD/ViewStandardCatalog-Browse?CategoryName=SPARC_T5440&CategoryDomainName=Sun_NorthAmerica-Sun_Store_US-SunCatalog $92K, 8-Core 4 x 1.2 GHz UltraSPARC T2 Plus 64 GB (32 x 2 GB DIMMs) DELL www.dell.com/oracle http://www.dell.com/content/topics/global.aspx/alliances/en/oracle_builds?c=us&cs=555&l=en&s=biz&~tab=3 Dell 900 Oracle Validated $37K 4x6core 2.7 GHz 64 GB, 4x300GB disk, dual ps, dual FC http://configure.us.dell.com/dellstore/config.aspx?c=us&cs=555&l=en&oc=MLB1041&s=biz ============================================================================= 2009 01 08 ============================================================================= ######### # BATCH # ######### Finished stray cpl far files (3 each mrnt, sntp) ./roundup -r cedar_phy_linfix mcfar Waited aobut 5 minutes for files to be written, iterated. ./roundup -r cedar_phy_linfix mcfar Date: Thu, 08 Jan 2009 15:31:10 -0600 (CST) From: Howard@agni.phys.iit.edu, Rubin@agni.phys.iit.edu To: kreymer@fnal.gov Subject: On vacation I'm on vacation and will not be reading my mail for a while. Your mail will be dealt with when I return on or about January 12. ============================================================================= 2009 01 07 ============================================================================= ####### # AFS # ####### Date: Thu, 08 Jan 2009 10:38:56 -0600 From: Ramon C. Pasetes To: 'Arthur Kreymer' Subject: RE: CENTRAL WEB Servers & AFS Status (fwd) Hi Art, We don't know what caused the incident yesterday, but we have been stable as of 10:22. -Ray _____________________________________________________________________ Date: Thu, 08 Jan 2009 17:08:44 +0000 (GMT) From: Arthur Kreymer To: minos-admin@fnal.gov Cc: minos_all@fnal.gov, minos_software_discussion@fnal.gov, minos-shifters@fnal.gov Subject: Re: Fermilab AFS problems since 05:00 CST ( 11:00 UTC ) The root cause of yesterday's AFS problems is not yet understood. But AFS is stable and back to normal, since about 10:30 yesterday/ We can resume normal operations. ######### # MYSQL # ######### Archived the extra databases, Mysql> ./dbar -X tee: /minos/data/mysql/archive/20090107/archive.log: No such file or directory STARTED DBARCHIVES Wed Jan 7 16:50:06 CST 2009 STARTED DBARCHIVES Wed Jan 7 16:50:06 CST 2009 FINISHED DBARCHIVES Wed Jan 7 17:14:32 CST 2009 Helping nwest get connected to the server on mysql2, and getting dbarchive ready for running in cron, and cleaning the non-database files out of /data/database mkdir -p /var/tmp/minsoft/maint mkdir -p /var/tmp/minsoft/maint/grid chmod 700 /var/tmp/minsoft/maint/grid chmod 700 /var/tmp/minsoft/maint/grid/mysqlroot Test this on each system: export MYSQL_PWD=`cat /var/tmp/minsoft/maint/grid/mysqlroot` . ${HOME}/setups.sh setup mysql mysqladmin -u root processlist ########### # MINOS01 # ########### Load average on minos01 is increasing. Many processes seen in ps axfu, like arms /usr/krb5/bin/kcron /usr/krb5/bin/aklog cp -p /minos/data/mcimport/hgallag/md5/all.md5 /afs/fnal/files/home/room3/arms/public_html/hgallag.all.md5 Date: Wed, 7 Jan 2009 17:22:37 +0000 (GMT) From: Arthur Kreymer To: arms@fnal.gov Cc: minos-admin@fnal.gov Subject: arms cron jobs piling up on minos01 Kregg : You have many cron jobs piling up on minos01.fnal.gov, starting every 10 minutes, doing things like cp -p /minos/data/mcimport/hgallag/md5/all.md5 \ /afs/fnal/files/home/room3/arms/public_html/hgallag.all.md5 These are getting stuck due to the present AFS problems. You should kill off most of these, and put in a test to keep more than one from running at once. Done, the load is back down. ######## # PNFS # ######## Isloated slow access to PNFS, but this may have been due to AFS http://www-numi.fnal.gov/computing/dh/pnfslog/NOW.txt 102 Wed Jan 7 04:35:22 CST 2009 50 Wed Jan 7 05:01:22 CST 2009 776 Wed Jan 7 05:29:21 CST 2009 52 Wed Jan 7 07:32:11 CST 2009 203 Wed Jan 7 08:20:48 CST 2009 http://www-numi.fnal.gov/computing/dh/ftplog/NOW.txt 136 Wed Jan 7 04:35:24 CST 2009 557 823 Wed Jan 7 05:29:23 CST 2009 557 116 Wed Jan 7 07:32:13 CST 2009 557 ####### # CRL # ####### D0 logbook is at http://www-d0online.fnal.gov/crlw/Index.jsp?inquiry=/CRLWindex/2_Hr_All_Entries D0 uses v1_8_28 September 6, 2006 Minos uses v1_13 November 6, 2008 Available support levels are : 24by7 ( commonly called 24x7 ) 8to00by7 8to17by7 8to17by5 ( commonly called 8x5, incorrectly, should be 9x5 ) Per discussion with Tom B, we need a SLA (Service Level Agreement) listing the components of CRL and their support. D0 is doing a full review of their online systems, much larger scale. ########## # CONDOR # ########## Moved condorglide out of AFS, due to today's global failure. MINOS25 > pwd /minos/scratch/kreymer/condor/probe MINOS25 > cp ${HOME}/minos/scripts/condorglide ../condorglide crontab.minos25 MAILTO=kreymer@fnal.gov 0-59/10 * * * * /minos/scratch/kreymer/condor/condorglide 07 1-23/2 * * * /usr/krb5/bin/kcron /local/scratch25/grid/kproxy ####### # AFS # ####### Date: Wed, 07 Jan 2009 15:49:13 +0000 (GMT) From: Arthur Kreymer To: minos-admin@fnal.gov Cc: minos_all@fnal.gov, minos_software_discussion@fnal.gov, minos-shifters@fnal.gov Subject: Fermilab AFS problems since 05:00 CST ( 11:00 UTC ) There have been severe AFS problems since about 05:00 CST ( 11:00 UTC ). This has affected most of the Fermilab web pages, and has severely slowed down logins to Minos Cluster nodes, to the point of uselessness. The Fermilab experts are working to resolve the problem. Please minimize use of /afs/fnal.gov . Please stand by for a further announcement. ########## # ORACLE # ########## Date: Wed, 07 Jan 2009 09:42:04 -0600 From: Maurine Mihalek To: Arthur Kreymer Cc: dsg-entire group , minosdb-support@fnal.gov, csi unix group Subject: Re: new kernel for minosora1/minosora3 minosora3 is back up. julie checked db's and they are up. will co-ordinate minosora1 with nelly. maurine ----- Original Message ----- From: Arthur Kreymer Date: Monday, January 5, 2009 2:50 pm Subject: Re: new kernel for minosora1/minosora3 To: Maurine Mihalek Cc: dsg-entire group , minosdb-support@fnal.gov, csi unix group > On Fri, 2 Jan 2009, Maurine Mihalek wrote: > > > there is a new linux kernel that needs to be made effective on > minosora1 and > > minosora3. > > > > i would like to do minosora3 on wednesday (1/7/2009) and reboot > around 8:30 > > am. minosora3 should be up by 9 am. minosora3 has been up for 231 days. > > > > for minosora1, i would like to upgrade the kernel and reboot on > tuesday > > (1/13) morning at 8 am. minosora1 has been up for 192 days. > > > > are these days and times acceptable? > > You can do minosora3 any time, just let me know when it is done. > > Since the January quarterly patches are coming out soon, > I would rather combine the minosora1 kernel update with those patches, > to minimize service interruptions. > > Note - please do not cc: minos_software_discussion or minos-data. > Those lists are not related to database support. ============================================================================= 2009 01 06 ============================================================================= ######### # BATCH # ######### Date: Tue, 06 Jan 2009 15:30:51 -0600 (CST) From: HelpDesk Subject: HelpDesk ticket 127096 ___________________________________________ Ticket #: 127096 ___________________________________________ Short Description: Ten runaway emacs session for user elllis Problem Description: User ellis has ten emacs sessions running on fnpcsrv1. 9 of these are running CPU bound, each using nearly 19 hours of CPU so far. This is bogging down the the fnpcsrv1 server. fnpcsrv1% ps -flu ellis F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 R ellis 340 1 83 85 0 - 2149 - Jan05 ? 18:53:47 xemacs rubmit2.al=0.1 4 S ellis 2029 1370 0 75 0 - 2033 - 10:24 pts/1 00:00:00 -tcsh 0 R ellis 4417 1 77 85 0 - 2149 - Jan05 ? 18:56:06 xemacs submit1.al=0.1 0 R ellis 5520 1 77 85 0 - 2149 - Jan05 ? 18:50:51 xemacs copy 0 R ellis 7202 1 78 85 0 - 2149 - Jan05 ? 19:06:37 xemacs symlin 0 R ellis 7720 1 78 85 0 - 2149 - Jan05 ? 18:54:52 xemacs symlin 0 R ellis 10945 1 78 85 0 - 2149 - Jan05 ? 18:55:42 xemacs submit1.al=0.01 4 S ellis 12023 12016 0 75 0 - 1732 - 12:03 pts/7 00:00:00 [Message 1 copied to "minosbatch" in and deleted] cc: ellis@fnal.gov ___________________________________________ Date: Tue, 06 Jan 2009 15:39:08 -0600 (CST) Note To Requester: timm@fnal.gov sent this Notes To Requester: Art, I had already seen these emacs processes and killed them by the time this ticket got to me. ___________________________________________ ___________________________________________ ####### # CRL # ####### Found an interesting CRL review, when responding to an errant Helpdesk ticket 127060 by Elvin Harms (ILC) Electronic Logbooks for Use at FNAL ILC Test Areas http://docdb.fnal.gov/ILC/DocDB/0003/000306/001/ElectronicLogbooks_060529.ppt They mention a PSI logbook : PSI logbook: This product was used at MINOS for a while but it's use is declining due to support problems. PSI: MINOS liked it but they tried to make some changes and the server now hangs frequently. Archaic architecture is blamed for the difficulty in finding the problem. Rejected. ######### # BATCH # ######### LINFIX status : less LOG/2008-12/cedar_phy_linfixmcnear.log Files were picked up Thu Dec 25 18:38:25 CST 2008 There are 7 each sntp/mrnt linfix files left in mcnearcat n11011001_0009_L010185N_D00.mrnt.cedar_phy_linfix.0.root n11011001_0009_L010185N_D00.sntp.cedar_phy_linfix.0.root n11011015_0002_L010185N_D00.mrnt.cedar_phy_linfix.0.root n11011015_0002_L010185N_D00.sntp.cedar_phy_linfix.0.root n13011112_0010_L010185N_D00.mrnt.cedar_phy_linfix.0.root n13011112_0010_L010185N_D00.sntp.cedar_phy_linfix.0.root n13011318_0000_L010185N_D00.mrnt.cedar_phy_linfix.0.root n13011318_0000_L010185N_D00.sntp.cedar_phy_linfix.0.root n13011493_0009_L010185N_D00.mrnt.cedar_phy_linfix.0.root n13011493_0009_L010185N_D00.sntp.cedar_phy_linfix.0.root n13011494_0010_L010185N_D00.mrnt.cedar_phy_linfix.0.root n13011494_0010_L010185N_D00.sntp.cedar_phy_linfix.0.root n13012017_0003_L010185N_D00.mrnt.cedar_phy_linfix.0.root n13012017_0003_L010185N_D00.sntp.cedar_phy_linfix.0.root These are all marked as bad : n11011001_0009_L010185N_D00.0 136 2008-10-29 21:06:51 fcdfcaf1581 n11011001_0009_L010185N_D00.0 136 2008-10-29 21:06:51 fcdfcaf1581 n11011015_0002_L010185N_D00.0 136 2008-10-29 21:02:37 fcdfcaf1508 n11011015_0002_L010185N_D00.0 136 2008-10-29 21:02:37 fcdfcaf1508 n13011112_0010_L010185N_D00.0 136 2008-10-30 07:51:53 fcdfcaf1599 n13011112_0010_L010185N_D00.0 136 2008-10-30 07:51:53 fcdfcaf1599 n13011318_0000_L010185N_D00.0 136 2008-11-01 16:21:35 fcdfcaf1614 n13011318_0000_L010185N_D00.0 136 2008-11-01 16:21:35 fcdfcaf1614 n13011493_0009_L010185N_D00.0 136 2008-11-04 04:53:12 fcdfcaf1664 n13011493_0009_L010185N_D00.0 136 2008-11-04 04:53:12 fcdfcaf1664 n13011494_0010_L010185N_D00.0 136 2008-11-04 07:35:02 fcdfcaf1613 n13011494_0010_L010185N_D00.0 136 2008-11-04 07:35:02 fcdfcaf1613 n13012017_0003_L010185N_D00.0 136 2008-12-09 15:46:47 fnpc263 n13012017_0003_L010185N_D00.0 136 2008-12-09 15:46:47 fnpc263 The HAVE messages show the following missing subruns run missing n11011001 - 0000 n11011015 - 0000 0001 n13011112 - complete n13011318 - 0010 n13011493 - complete n13011494 - complete n13012017 - complete I have taken the liberty of editing the bad_runs file : RUBIN> cd /minos/data/minfarm/lists RUBIN> cp -a bad_runs_mc.cedar_phy_linfix bad_runs_mc.cedar_phy_linfix.20081211 RUBIN> nedit bad_runs_mc.cedar_phy_linfix ####### # NET # ####### Date: Tue, 06 Jan 2009 13:40:09 -0600 From: Rick Finnegan Thursday, January 15, 2008  6:00pm - 7:00pm Upgrade S-S-WH8W-5 network switch chassis Minos/Numi nodes connected : wh12whp800c wh12w-hp4200 wh12w-xerox8400 numi-koizumilt numi-92582 numi-94790 numi-lucaspc numi-plunkettpc ######### # ADMIN # ######### This shows status of Minos and other systems ( Ganglia up/down ) catecgorized by date of purchase. http://d0om.fnal.gov/d0admin/faultlog/ ######### # MYSQL # ######### dbarchive - adding -X option to archive non-offline/crl tables ######### # MYSQL # ######### Mysql> mysqladmin -u root processlist | grep crlweb | wc -l 30 ######### # ADMIN # ######### MINOS01 > setup systools MINOS01 > cmd add_minos_user lueking Creating account... /var/yp gmake[1]: Entering directory `/var/yp/minos' gmake[1]: `ypservers' is up to date. gmake[1]: Leaving directory `/var/yp/minos' gmake[1]: Entering directory `/var/yp/minos' Updating passwd.byname... Updating passwd.byuid... Updating netid.byname... gmake[1]: Leaving directory `/var/yp/minos' Adding user to Minos AFS group... Sending mail to subscribe to minos-user mailing list ... Sending email to user... ######## # MAIL # ######## minos-shifters mail list To cut spam, but allow mail from stk-users, changed configuration from Send= Private to Local=fnal.gov,*.fnal.gov Service=Local Send=Service ########### # MONTHLY # ########### DATASETS 1/6 PREDATOR 1/6 VAULT 1/3 MYSQL 1/6 rm -r /data/archive/COPY/20081114 scripts/dbarchive Tue Jan 6 12:03:37 CST 2009 FINISHED DBARCHIVES Tue Jan 6 15:14:55 CST 2009 ============================================================================= 2009 01 05 ============================================================================= ########## # DCACHE # ########## id=164 - use dc_check or srmls to check disk copies MINOS26 > dc_check dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015411_0008.mdaq.root Check passed for file "dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015411_0008.mdaq.root" dc_check does dccp -P -t -1 $* Reviewed http://www-dcache.desy.de/manuals/dccp.html Need a file which is not on disk : /pnfs/minos/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root ./dc_stat /pnfs/minos/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root MINOS26 > time dc_check dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root dc_stage fail : File not cached System error: Resource temporarily unavailable Check FAILED for file "dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root" real 0m0.382s user 0m0.004s sys 0m0.020s MINOS26 > time dccp -P -t -1 dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root dc_stage fail : File not cached System error: Resource temporarily unavailable real 0m0.804s user 0m0.001s sys 0m0.010s real 0m0.192s user 0m0.001s sys 0m0.010s Time this for FILES=`ls /pnfs/minos/neardet_data/2009-01` MINOS26 > printf "${FILES}\n" | wc -l 181 DCP=dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01 time for FILE in ${FILES} ; do dccp -P -t -1 ${DCP}/${FILE} ; done real 0m41.377s user 0m0.172s sys 0m1.557s Rate is 4.4 files/second Check using Layer 2, MINOS26 > time ./stage -n neardet_data/2009-01 Staging files from /pnfs/minos/neardet_data/2009-01 Prestaging 183 files ................... Needed 183/183 STARTED Wed Jan 7 11:56:52 CST 2009 FINISHED Wed Jan 7 11:56:59 CST 2009 Rate is 23 files/second. Testing srmls 15 files in beam_data/2004-12 time srmls real 0m4.313s user 0m7.715s sys 0m0.263s time srmls -l real 0m9.623s user 0m8.063s sys 0m0.276s 181 files in neardet_data/2009-01 time srmls real 0m6.865s user 0m10.177s sys 0m0.266s time srmls -l real 1m9.835s user 0m11.597s sys 0m0.368s Perhaps disk status is given by access latency:NEARLINE locality:ONLINE_AND_NEARLINE Test non-local file SPATH2z=${S2MINOS}/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root access latency:NEARLINE locality:NEARLINE real 0m4.758s user 0m6.214s sys 0m0.235s Now test a larger directory, with over 1K files : SPATH2B=${S2MINOS}/fardet_data/2008-12 time { srmls ${SPATH2B} | tee /tmp/bigls ; } real 0m24.986s user 0m16.601s sys 0m0.483s $ time srmls -l ${SPATH2B} real 6m9.159s user 0m18.488s sys 0m0.557s Rates are 40 files/second for the list, 2.7 files/second for the full list ######### # EMAIL # ######### minos-shifters - allow email from stk admins ? ########## # DCACHE # ########## Data logging failing since 02:06, picked up 09:51 ftplog gap 02:23 to 09:41 Gap in pagedcache ftp transferf 02:15 to 09:38 PREDATOR genpy large interval 07:09 to 15:39 ___________________________________________ Date: Mon, 05 Jan 2009 10:48:37 -0600 (CST) Subject: HelpDesk ticket 126954 ___________________________________________ Ticket #: 126954 ___________________________________________ Short Description: FNDCA ftp transfers failing since Problem Description: FTP transfers from DCache seem to have failed from about 02:30 to 09:30 today, 2008 Jan 05. This includes password access FTP reads and kerberized FTP writes . Was there a known outage ? ___________________________________________ Date: Mon, 05 Jan 2009 11:43:05 -0600 (CST) Solution: jhendry@fnal.gov sent this solution: Hi Art, Yes there was a problem which has already been resolved. I clicked the d0 box instead of stken when I made the initial announcement so that may be why you did not see it. However when I sent the resolution announcement I also clicked the stken box on that web page form. Its confusing as on the announcment web page form the boxes are on the left of each instance whereas on other forms we use they are on the left of each instance. Sorry for any confusion. Date: Mon, 05 Jan 2009 10:17:13 -0600 The stken pnfs server matter has been resolved. Please report any further problems. Thanks, John Hendry SSA Primary ___________________________________________ ___________________________________________ ___________________________________________ This ticket is assigned to NAYMOLA, STAN of the CD-SF/DMS/DSC/SSA. Date: Mon, 05 Jan 2009 10:54:55 -0600 (CST) Subject: HelpDesk ticket 126958 ___________________________________________ Ticket #: 126958 ___________________________________________ Short Description: FNDCA Recent FTP Transfers web page is missing entries Problem Description: The only entries in the Recent FTP Transfers web page are for pagedcache(5744.6209). http://fndca3a.fnal.gov/cgi-bin/dcache_files.py There have been many other recent transfers from other accounts, but these are not showing up on the web page. ___________________________________________ This ticket is assigned to NAYMOLA, STAN of the CD-SF/DMS/DSC/SSA. ____________________________________________ Date: Mon, 05 Jan 2009 12:48:21 -0600 (CST) Note To Requester: jhendry@fnal.gov sent this Notes To Requester: Opened bugzilla 186 for dcache developers. ___________________________________________ Date: Mon, 05 Jan 2009 12:54:52 -0600 (CST) This ticket has been reassigned to HENDRY, JOHN of the CD-SF/DMS/DSC/SSA Group. __________________________________________ The FTP transfers page seems to be up to date today. I presume that something was done to address the previous problem. If that is the case, this ticket can be closed. Thanks ! __________________________________________ Date: Mon, 12 Jan 2009 17:27:59 +0000 (GMT) As of 11:24 today, the FTP transfers page only shows pagedcache items, Times from 2009-01-11 11:15:08 to 2009-01-12 11:15:33 No entries for any other users. So please continue to investigate. __________________________________________ Date: Wed, 14 Jan 2009 17:56:56 +0000 (GMT) The FTP transfers page is still incomplete, showing only recent transfers by pagedcache, none of the 'buckley' transfers of Minos raw data. Is there any progress on bringing it back to life ? __________________________________________ Date: Wed, 14 Jan 2009 13:02:11 -0600 (CST) Note To Requester: jhendry@fnal.gov sent this Notes To Requester: Dmitry believes he has solved this issue. I will await your OK prior to closing this ticket. Comment #4 from Dmitry Litvintsev 2009-01-14 12:33:33 The pages shows up to date info (I updated manually). The cause of files not being copied to the right destination is being investigated. Comment #5 from Dmitry Litvintsev 2009-01-1412:42:13 lock file discovered and removed. Page is up to date. Lock file dated Jan 09. __________________________________________ Date: Wed, 14 Jan 2009 19:53:47 +0000 (GMT) From: Arthur Kreymer The web page is up to date again, after the clearing of a lock file by Dmitri. You can close this ticket. Thanks ! __________________________________________ Date: Wed, 14 Jan 2009 14:21:55 -0600 (CST) Originator concurs issue has been resolved with lock file removal by Dmitry. This ticket was resolved by HENDRY, JOHN of the CD-SF/DMS/DSC/SSA group. __________________________________________ __________________________________________ ####### # CDF # ####### no cvs backups since 22 Dec, low disk space. ######### # ADMIN # ######### jmusser/musser password reset needed MINOS26 > finger musser@fnal.fnal.gov OpenLDAP Finger Service... Search failed to find anything. MINOS26 > finger jmusser@fnal.fnal.gov OpenLDAP Finger Service... 1 exact match found for "jmusser": "jmusser, People" Users Name: james mussser User ID: jmusser E-Mail forwarded to: jmusser@indiana.edu Suggested that he call the helpdesk for a password reset. ######### # DOCDB # ######### Updated ticket 126747, yes do get the developers involved. This is resolved ! See the ticket below. Informed m_s_d ########## # CONDOR # ########## condor admins schedule - Jan 9 ############ # minverva # ############ Lee Leuking joins as Minerva liaison ########### # ENSTORE # ########### Restarted 29 Dec 15:30 through 17:13 ########### # ENSTORE # ########### 9940 and 9940B library managers down Tue 30 Dec 04:00 through 11:33 ######### # ADMIN # ######### 126613 ncurses.i386 installed on minos-mysql2 ######### # MYSQL # ######### beam dbu processes hung Tue, 30 Dec 2008 15:59:18 Date: Mon, 05 Jan 2009 19:09:33 +0000 (GMT) From: Arthur Kreymer The database server Ganglia monitoring shows a low load average, and very little network activity at this time, as viewed at http://rexganglia2.fnal.gov/minos/?r=week&c=MINOS+Server&h=minos-mysql1. fnal.gov Dan Cherdack did have about 250 grid jobs running at that time, There are a lot of connections right now from grid nodes, with long open connections to temp, like : | Id | User | Host | db | Command |Time | 194405069 | reader_old | fncdf279.fnal.gov:51647 | temp | Sleep |9491 MINOS25 > condor_q -run | grep -v gfactory | grep fncdf279 254128.10 cherdack 1/5 10:16 0+02:51:30 glidein_28550@fncdf279.fnal.gov Dan, we need to see what is going wrong with your jobs. ########## # TRAVEL # ########## signed up for cambridge meeting, http://www.hep.phy.cam.ac.uk/~thomson/meetings/collabmtg2009/ ########## # ORACLE # ########## Date: Fri, 02 Jan 2009 15:30:36 -0600 From: Maurine Mihalek To: Arthur Kreymer Cc: dsg-entire group , minosdb-support@fnal.gov, csi unix group , minos_software_discussion@fnal.gov, minos-data@fnal.gov Subject: new kernel for minosora1/minosora3 there is a new linux kernel that needs to be made effective on minosora1 and minosora3. i would like to do minosora3 on wednesday (1/7/2009) and reboot around 8:30 am. minosora3 should be up by 9 am. minosora3 has been up for 231 days. for minosora1, i would like to upgrade the kernel and reboot on tuesday (1/13) morning at 8 am. minosora1 has been up for 192 days. are these days and times acceptable? maurine ___________________________________________________________________ You can do minosora3 any time, just let us know when it is done. Since the January quarterly patches are coming out soon, I would rather combine the minosora1 kernel update with those patches, to minimize service interruptions. Note - please do not cc: minos_software_discussion or minos-data. Those lists are not related to database support. ########## # CONDOR # ########## Date: Sun, 04 Jan 2009 00:38:55 -0600 (CST) From: Jeff K deJong I seem to be having some trouble getting my jobs to run on Condor, they ran for me at the start of december and now I run my scripts again and my jobs are just sitting idle in the queue, while other jobs submited after I submitted mine run and complete OK. I've run a condor analyze command ... 253380.034: Run analysis summary. Of 145 machines, 39 are rejected by your job's requirements 106 reject your job because of their own requirements 0 match but are serving users with a better priority in the pool 0 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 0 are available to run your job The Requirements expression for your job is: ( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) && ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) && ( target.HasFileTransfer ) Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( target.Arch == "X86_64" ) 106 2 ( target.OpSys == "LINUX" ) 145 3 ( target.Disk >= 3 ) 145 4 ( ( 1024 * target.Memory ) >= 3 ) 145 5 ( target.HasFileTransfer ) 145 The following attributes are missing from the job ClassAd: RunOnGrid x509userproxysubject Reply : The file suggests that you are requiring X86_64 ( 64 bit kernel ) while submitting to the local Minos Cluster. We do not have any such nodes. For local cluster submission, see for example /minos/scratch/kreymer/condor/probe/probe.run ============================================================================= 2008 12 26 ============================================================================= KREYMER IS ON VACATION UNTIL 2009 JANUARY 5 nodified minos-admin,minos_batch,minos-data ============================================================================= 2008 12 25 ============================================================================= ON SHIFT 00:00 - 07:00 ######## # FARM # ######## To : Howard Rubin Cc : Alexandre Sousa , minos-data@fnal.gov Attchmnt: Subject : Re: linfix reprocessing ----- Message Text ----- On Wed, 24 Dec 2008, Howard Rubin wrote: > The linfix reprocessing is complete. The latest file in /minos/data/minfarm/mcnearcat is dated 9 Dec. But I see 89 each sntp and mrnt files in mcnear ( without the cat ), 15 from Dec 9 and the rest from Dec 24. Should these be shifted to mcnearcat ? ######## # DATA # ######## On Wed, 24 Dec 2008, Robert Hatcher wrote: > On Dec 24, 2008, at 8:26 AM, "Musser, James A." wrote: > > Robert: >  Could you copy > > /minos/scratch/petyt/FDfiles_2008/.bntp/all_events_cphy_bfld.root > > to someplace I can access it in afs space, if it still exists?  > Sorry, I have lost the capability of logging on to the minos cluster. cp /minos/scratch/petyt/FDfiles_2008/.bntp/all_events_cphy_bfld.root \ /afs/fnal.gov/files/data/minos/d13/musser/all_events_cphy_bfld.root Done, Merry Christmas ! ######### # DOCDB # ######### Date: Wed, 24 Dec 2008 06:25:09 -0600 (CST) From: HelpDesk Subject: HelpDesk ticket 126747 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 126747 ___________________________________________ Short Description: Minos DocDB pages are extremely slow to load Problem Description: Several Minos DocDB pages are extremely slow to load. This is true with a variety of browsers, and both Linux and XP clients. Therefore this is probably a server side problem. Specific examples : 15 seconds to load https://minos-docdb.fnal.gov:440/cgi-bin/ShowCalendar 19 seconds to load http://minos-docdb.fnal.gov:8080/cgi-bin/DisplayMeeting?sessionid=1307 ___________________________________________ This ticket is assigned to TECKENBROCK, MARCIA of the CD-CDO/CO. ___________________________________________ Date: Sat, 27 Dec 2008 12:15:23 -0600 Dear MINOS DocDB Admins, Please see the complaint below. This issue was brought to our attention prior to the upgrade, but we felt we had obtained a manageable speed. Of course, instances with more events will have a slower load time. Do you feel the 19 second load time for the events page is serious enough to get the developer involved? Thank you. -Marcia ___________________________________________ Yes, I think this is too slow, given that these pages formerly loaded in a fraction of a second. Something is clearly wrong. There is nothing being done that should take anything like this long. This is well worth getting the developers involved. For example, it takes 15 seconds to load a meeting containing a single talk and a single document : https://minos-docdb.fnal.gov:440/cgi-bin/DisplayMeeting?sessionid=1264 ___________________________________________ Date: Mon, 05 Jan 2009 17:14:16 -0600 (CST) Solution: E. Vaandering has fixed the problem and notes we should let him know if we notice other speed issues. The MINOS DocDB instance has been upgraded and testing complete. ___________________________________________ Date: Mon, 05 Jan 2009 23:41:49 +0000 (GMT) Thanks ! This seems to have restored access to full speed. I had to Clear Private Data/Authenticated Sessions under Firefox in order to restore access to the certificate-protected pages. Some large Calendar pages are still a bit slow, but tolerable. 5 seconds for https://minos-docdb.fnal.gov:440/cgi-bin/ShowCalendar?year=2008&month=12 ___________________________________________ The DocDB developers ( Eric Vaandering ) have corrected the problems that led to slow loading of our Calendars and Meeting pages. You may need to Clear Private Data/Authenticated Sessions under Firefox in order to regain certificated based access to internal pages. ___________________________________________ Date: Tue, 06 Jan 2009 10:24:38 -0600 From: Eric Vaandering The ShowCalendar function should be nearly instantaneous again. It's not on my test setup, but I have a thousand or so events on a single day, so that takes some time to write out. :-) This is in stable/8.7.5 ___________________________________________ Date: Tue, 06 Jan 2009 17:28:46 +0000 (GMT) Thanks ! I am not seeing the speedup in Minos DocDB yet, but I expect that the new code has not been deployed for us. ___________________________________________ Date: Tue, 06 Jan 2009 11:30:41 -0600 From: Eric Vaandering Probably not, but you can always check the version number in the lower left corner of a page. ___________________________________________ Date: Tue, 6 Jan 2009 17:32:53 +0000 (GMT) Thanks ! As expected, we are at 8.7.4 . ___________________________________________ Date: Tue, 06 Jan 2009 12:01:10 -0600 Nope. I will do this in just a few minutes. ___________________________________________ Date: Tue, 06 Jan 2009 12:21:15 -0600 From: Marcia Teckenbrock I've upgraded the code, but the December calendar is still 7-8 seconds for me. It has quite a few events, though, so I don't think this is unreasonable. The January calendar loads in 2 seconds. ___________________________________________ Date: Tue, 06 Jan 2009 18:25:37 +0000 (GMT) Thanks ! That is odd. The speed loading the December calendar is slower than before, 7 seconds under 8.7.4 8 seconds under 8.7.5. https://minos-docdb.fnal.gov:440/cgi-bin/ShowCalendar?year=2008&month=12 DocDB Version 8.7.5, contact Document Database Administrators Execution time: 8 wallclock secs ( 4.02 usr + 3.89 sys = 7.91 CPU) ___________________________________________ Date: Tue, 06 Jan 2009 12:45:02 -0600 From: Eric Vaandering That is weird, because in my test setup which has a day with hundreds of events, it takes 4 seconds using my laptop as a server. So let me take a look again. The difference may be that I don't have sessions and talks in those events, so it may be slowing down somewhere else. It's also possible it is unavoidable. ___________________________________________ Date: Tue, 06 Jan 2009 19:50:21 -0600 From: Eric Vaandering Marcia, can you update again? I didn't bump the version number yet, but I *think* I've got the time consuming calls taken out of this without side effects. Some other things still seem as if they might be slower than I would like. Perhaps we can find an evening to turn on the debugging so I can see what is happening live on the Minos DocDB. (Debugging will just add a bunch of text output to the end of the page, so DocDB is still fully usable, but doesn't look quite as nice.) ___________________________________________ Date: Wed, 07 Jan 2009 18:02:23 -0600 From: Marcia Teckenbrock Sorry, I ran out of time today, but I will do this first thing tomorrow. ___________________________________________ Date: Thu, 08 Jan 2009 10:25:36 -0600 From: Marcia Teckenbrock This seems to have done the trick. Art, would you please take a look? ___________________________________________ Date: Thu, 08 Jan 2009 18:17:59 +0000 (GMT) From: Arthur Kreymer I see no change at present : https://minos-docdb.fnal.gov:440/cgi-bin/ShowCalendar?year=2008&month=12 DocDB Version 8.7.5, contact Document Database Administrators Execution time: 6 wallclock secs ( 2.51 usr + 2.84 sys = 5.35 CPU) ___________________________________________ Date: Thu, 08 Jan 2009 12:52:29 -0600 From: Marcia Teckenbrock Hmm. There's something about the December calendar, because the change caused the current month's calendar display time to be halved. ___________________________________________ December had some heavy sessions, due to the Collaboration meeting. October and November were more normal, with no such meeting, but they are also loading more slowly than you might expect : October Execution time: 7 wallclock secs ( 2.96 usr + 3.09 sys = 6.05 CPU) November Execution time: 4 wallclock secs ( 1.93 usr + 1.74 sys = 3.67 CPU) February, with no talks, does load very quickly Execution time: 0 wallclock secs ( 0.37 usr + 0.06 sys = 0.43 CPU) ___________________________________________ Date: Thu, 08 Jan 2009 13:32:30 -0600 From: Marcia Teckenbrock To: Eric Vaandering Cc: Art Kreymer Subject: Re: Help Desk Ticket 126747 Has Been Resolved. Yes, I am willing to work on this next week, but the earliest I can do it is Wednesday. ___________________________________________ Date: Thu, 08 Jan 2009 13:45:51 -0600 From: Eric Vaandering Ok. On the other hand, if the problem is the sessions and talks, then it *should* be no different than it was several months ago. But we'll look at this. Marcia, Wednesday is fine for me and I will try to reproduce something like a "normal" MINOS month in the meantime and see what improvements can be made. I think MINOS is using this part of DocDB more heavily than it was ever tested before so I'm not too surprised things might be popping up. ___________________________________________ ___________________________________________ ___________________________________________ ============================================================================= 2008 12 24 ============================================================================= ON SHIFT 00:00 - 07:00 ######### # DOCDB # ######### Ticket submitted ============================================================================= 2008 12 23 ============================================================================= ON SHIFT 00:00 - 07:00 ######### # DOCDB # ######### Display of the calendar remains slow, as noted below 19 seconds for http://minos-docdb.fnal.gov:8080/cgi-bin/DisplayMeeting?sessionid=1307 ============================================================================= 2008 12 22 ============================================================================= ON SHIFT 00:00 - 07:00 ######### # DOCDB # ######### Host cert update schedule 12:00 to 12:20 If display of calendar is still slow, report this. Presently, takes 12 seconds to display https://minos-docdb.fnal.gov:440/cgi-bin/ShowCalendar and http://minos-docdb.fnal.gov:8080/cgi-bin/ShowCalendar Meetings can also take a long time, 14 seconds for http://minos-docdb.fnal.gov:8080/cgi-bin/DisplayMeeting?sessionid=1307 ============================================================================= 2008 12 20 Sun ============================================================================= ######## # SPAM # ######## to minos-docdb minos_software_discussion Need to let ssa-group post to minos-shifters ============================================================================= 2008 12 20 Sat ============================================================================= Changed Send= Public to Send= Private for minos-docdb minos_sam_users minoscrl_admin and Send= Owner for minos_comp minos_linux_users numi-pc-users ============================================================================= 2008 12 19 ============================================================================= ######### # MYSQL # ######### dbarchive.20081219 Draft dbarchive supports -I to copy indexes, and do just the offline database, nothing else And self logging to ${DBCOPY}/archive.log Mysql> ./scripts/dbarchive -I PARSING ARGS Archiving OFFLINE Fri Dec 19 18:03:26 CST 2008 68683 . Filesystem Size Used Avail Use% Mounted on /dev/hdb1 230G 172G 47G 79% /data /minos/data/mysql/archive/20081219/offline/offline.log FINISHED DBARCHIVES Fri Dec 19 19:29:06 CST 2008 [1]+ Done nedit scripts/dbarchive ########## # DCACHE # ########## rubin reported srmcp failure, cp /minos/data2/minfarm/farmtest/mclogs/dogwoodtest4/near/daikon_04/L010185N/706/n13037064_0009_L010185N_D04.0.dogwoodtest4.log.gz /var/tmp/dogtest.gz gunzip /var/tmp/dogtest.gz MINOS26 > ./dccptest n13037064_0009_L010185N_D04.reroot.root PORT 24136 Connected in 0.00s. [Fri Dec 19 13:06:15 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_04/L010185N/706/n13037064_0009_L010185N_D04.reroot.root in cache. Cache open succeeded in 99.57s. 355257111 bytes in 6 seconds (57821.80 KB/sec) Tested an srmcp , with upgraded srmtest.20081219, looks OK. Date: Fri, 19 Dec 2008 22:15:20 +0000 (GMT) From: Arthur Kreymer The log indicated that the copy first started at Thu Dec 18 13:00:36, while the Enstore robots were still down. Perhaps srm/Dcache failed to properly execute the retries ? I can dccp and srmcp the file now : n13037064_0009_L010185N_D04.reroot.root ######## # MAIL # ######## minos-shifters mail list To cut spam, changed configuration Send= Public to Send= Private And added Owner= zwaska ####### # CVS # ####### Created new admin package cd /local/scratch26/kreymer/trel/testrel mkdir admin cd admin touch .cvsignore MINOS26 > cvs import minossoft/admin kreymer start aklog: Couldn't get fnal.gov AFS tickets: aklog: Incorrect net address while getting AFS tickets nedit: the current locale is utf8 (en_US.UTF-8) nedit: changed locale to non-utf8 (en_US) N minossoft/admin/.cvsignore No conflicts created by this import MINOS26 > date Fri Dec 19 13:13:28 CST 2008 cd /local/scratch26/kreymer/trel/CVSROOT cvs update nedit modules cvs commit -m 'Added admin for scripts and HOWTOs' modules nedit check_access cvs commit -m 'Added admin module' check_access ######### # MYSQL # ######### Added older HOWTO's to admin/mysql addpkg -h admin cd admin mkdir mysql cvs add -m 'mysql database' mysql cd mysql cp ~/minos/HOWTO.dbarchive.20051021 HOWTO.dbarchive cvs add HOWTO.dbarchive cvs commit -m 'HOWTO.dbarchive.20051021' HOWTO.dbarchive cp ~/minos/HOWTO.dbarchive.20070403 HOWTO.dbarchive cvs commit -m "HOWTO.dbarchive.20070403" HOWTO.dbarchive for DAT in 20070703 20070705 20080115 20080409 20080804 20081014 ; do cp ~/minos/HOWTO.dbarchive.${DAT} HOWTO.dbarchive cvs commit -m "HOWTO.dbarchive.${DAT}" HOWTO.dbarchive done Also added dbarchive script cd /minos/scratch/kreymer/admin ######### # DOCDB # ######### To minos_all DocDB will be down next Monday 22 Dec at noon, for 20 minutes. There is no Minos meeting conflict, we have none scheduled next week. ---------- Forwarded message ---------- Date: Fri, 19 Dec 2008 12:53:20 -0600 From: Marcia Teckenbrock To: cd-docdb-users@fnal.gov Subject: Re: DocDB Outage on Monday, December 22nd, Noon Hi All, Just want you to know we ARE going forward with the outage on Monday at Noon. Thank you for your prompt responses, and happy holidays! -Marcia Marcia Teckenbrock wrote: > Dear DocDB Users, > > The system administrators for the DocDB machine would like to schedule an > approximately 20 minute outage on Monday, December 22nd at Noon. The purpose > of the outage is to install the new ssl certificate on the server. > > Before we schedule, I just want to make sure this will not interfere with > your activities. Please let me know ASAP. Thanks, > > -Marcia > marcia@fnal.gov ####### # NET # ####### netdown email , Tuesday, December 23, 2008 6:00am - 7:00am Upgrade S-S-WH8W-9 network switch Scattered locations in Wilson Hall on the following VLANS. Not all users on these VLANS will be down - only those connected to switch #9. VLAN 18 - Beams-WH VLAN 19 - Dir VLAN 27 - BSS VLAN 31 - LSS VLAN 55 - PPD VLAN 92 - Conference Note that the Control Room is on subnet 55. But probably not routing through S-S-WH8W-9 minos-rc is connected to s-s-wh8w-7 on port Fa3/30 minos-acnet same minos-om same minos-evd is connected to s-s-wh8w-7 on port Fa3/31 ########## # MYSQL2 # ########## Continuing tests, see LOG.mysql2 ============================================================================= 2008 12 18 ============================================================================= ########## # MYSQL2 # ########## Date: Thu, 18 Dec 2008 17:19:25 -0600 (CST) Subject: HelpDesk ticket 126613 ___________________________________________ Ticket #: 126613 ___________________________________________ Short Description: minos-mysql2 needs compatibility libncurses.so.5 for mysql Problem Description: Some of the 32 bit mysql programs on minos-mysql2 need libncurses.so.5 . Please find out which compability rpm's are needed, and install them, on all of our new SLF 4.7 systems ( mysql2, sam04, minos25, minos27 ) This is not urgent, I've copied the library from minos-mysql1 and put it in my path for testing purposes. ___________________________________________ Date: Fri, 19 Dec 2008 08:23:13 -0600 (CST) This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group. ___________________________________________ Date: Tue, 30 Dec 2008 09:36:47 -0600 (CST) Solution: schmitz@fnal.gov sent this solution: installed ncurses.i386 ___________________________________________ ___________________________________________ ########## # MYSQL2 # ########## Tests needed for commissioning : Make all this into a formal deployment and support plan. OS - find missing .so files OS - measure and boost open file limits as appropriate OS - set time zone to UTC ? DB - copy mysql table from mysql1, to get accounts DB - Time mysqldump for backups 3 minutes/GB for production table, to bluearc DB - Time mysql_upgrade 2h 35m for offline database DB - set connection limits consistent with OS file limits and capacity test performance vs connections DB - Test recovery from data file, with index rebuild time from mysqldump files adding binlogs to base restore DB - Test full database snapshot ( backup plus indexes ) DB - Copy all but offline and crl databases from mysql1 to mysql2 DB - Test defragmentation MINOS - Test dbmauto - Nick ######## # DATA # ######## Date: Thu, 18 Dec 2008 13:26:27 -0600 From: George Szmuksta The enstore outage is over. All enstore libraries are available for work. ######### # ADMIN # ######### Date: Thu, 18 Dec 2008 12:35:11 -0600 (CST) Subject: HelpDesk ticket 126575 ___________________________________________ Ticket #: 126575 ___________________________________________ Short Description: sshd not responding on minos21 Problem Description: I cannot log into minos21 with ssh. But kerberized rsh is working. Please restart the sshd. MIN > ssh minos21 ssh_exchange_identification: Connection closed by remote host MIN > date Thu Dec 18 18:31:28 GMT 2008 ___________________________________________ This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group. Solution: schmitz@fnal.gov sent this solution: restarted sshd ___________________________________________ Date: Thu, 18 Dec 2008 13:11:25 -0600 (CST) ___________________________________________ ######### # ADMIN # ######### Date: Thu, 18 Dec 2008 12:35:12 -0600 (CST) Subject: HelpDesk ticket 126576 ___________________________________________ Ticket #: 126576 ___________________________________________ Short Description: minos26 mount of /grid/data Problem Description: The /grid/data file handle is stale on minos26 : MINOS26 > ls -ld /grid/data ls: /grid/data: Stale NFS file handle MINOS26 > date Thu Dec 18 12:32:43 CST 2008 ___________________________________________ This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group. ___________________________________________ Date: Thu, 18 Dec 2008 13:13:30 -0600 (CST) Solution: schmitz@fnal.gov sent this solution: cleared stale mount and remounted ___________________________________________ ___________________________________________ ########## # CONDOR # ########## To : Nicholas Devenish Cc : "Ryan B. Patterson" , minos-admin@fnal.gov, nickd@fnal.gov, xbhuang@fnal.gov, minos_software_discussion@fnal.gov Attchmnt: Subject : Re: minos25, /minos/data2? (fwd) ----- Message Text ----- On Thu, 18 Dec 2008, Nicholas Devenish wrote: > I've been noticing it for a couple of days - my mass jobs are submitting at a > rate of about 0.5 per second (horrendously slow). Thanks to detective work by Ryan. we have found the cause of recent Condor slowness and failures. User xbhuang has been running some grid jobs which have memory leaks. These eventually grow above 2 GBytes on the CDF worker nodes, crashing the glidein processes on the workers, and causing large overheads in our Condor scripts as they reconnect and eventualy restart these jobs. I have just added a 1.8 GBytes memory limit to the paloon script, which should help prevent future global problems. Xiaobo - you need to debug your jobs to eliminate the memory leak before running more of these . I have removed all your existing jobs from Condor, some of which had grown to nearly 4 Gbytes in size. ########## # CONDOR # ########## My test glide jobs last completed at Dec 18 08:51 There are log files through Dec 18 10:20 logs/glide/probe.246186.0.out No further submissions, as of 10:50 Ganglia showa a load average that is low now, Load spike to 70 around 00:40 Load spike to 100 around 03:00 Networking showed sustained 6 MB/sec input 22:45 through 09:52 3 MB/sec output same time, with spike at end to 10 MB/sec condor_status looks OK, Name OpSys Arch State Activity LoadAv Mem ActvtyTime minos01.fnal.gov LINUX INTEL Claimed Busy 0.990 4053 0+03:46:01 minos02.fnal.gov LINUX INTEL Claimed Busy 0.990 4053 0+04:09:52 slot1@minos03.fnal LINUX INTEL Claimed Busy 2.690 2026 0+03:26:54 slot2@minos03.fnal LINUX INTEL Claimed Busy 2.020 2026 0+02:51:36 glidein_12452@fnpc LINUX X86_64 Claimed Busy 1.030 16053 0+08:04:32 monitor_12452@fnpc LINUX X86_64 Owner Idle 1.000 1605 0+08:05:13 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX 39 0 39 0 0 0 0 X86_64/LINUX 58 29 29 0 0 0 0 Total 97 29 68 0 0 0 0 condor_q is failing, MINOS25 > condor_q -- Failed to fetch ads from: <131.225.193.25:63348> : minos25.fnal.gov 10:55 - condor_status has decayed, MINOS25 > condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@minos04.fnal LINUX INTEL Claimed Busy 1.010 2026 0+03:43:15 slot2@minos04.fnal LINUX INTEL Claimed Busy 0.980 2026 0+03:58:16 slot1@minos09.fnal LINUX INTEL Claimed Busy 0.980 2026 0+02:35:41 slot2@minos09.fnal LINUX INTEL Claimed Busy 0.970 2026 0+02:29:29 slot1@minos15.fnal LINUX INTEL Claimed Busy 0.980 2026 0+04:09:00 slot2@minos15.fnal LINUX INTEL Claimed Busy 0.970 2026 0+04:34:05 slot1@minos18.fnal LINUX INTEL Claimed Busy 0.940 2026 0+04:24:05 slot2@minos18.fnal LINUX INTEL Claimed Busy 1.050 2026 0+03:14:04 slot1@minos19.fnal LINUX INTEL Claimed Busy 0.980 2026 0+04:18:02 slot2@minos19.fnal LINUX INTEL Claimed Busy 1.020 2026 0+04:13:27 slot1@minos20.fnal LINUX INTEL Claimed Busy 1.040 2026 0+04:19:11 slot2@minos20.fnal LINUX INTEL Claimed Busy 1.900 2026 0+03:14:34 slot1@minos21.fnal LINUX INTEL Claimed Busy 1.020 2026 0+03:14:20 slot2@minos21.fnal LINUX INTEL Claimed Busy 0.970 2026 0+04:24:23 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX 14 0 14 0 0 0 0 Total 14 0 14 0 0 0 0 I see no full disks, MINOS25 > w 10:57:03 up 13 days, 2:33, 15 users, load average: 0.21, 0.44, 0.30 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT nickd pts/12 ndevenish-macboo 10:26 28:57 3.44s 0.00s /bin/sh /afs/fnal.gov/files/code/e875/general/condor/scripts/remote_wrappers/condor_submit Condor_FCSyst.run MINOS25 > ps -flu gfactory F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 Z gfactory 4293 20163 0 76 0 - 0 exit 00:24 ? 00:00:39 [condor_gridmana] 4 S gfactory 22056 22055 0 76 0 - 14047 - Dec17 pts/9 00:00:00 -bash MINOS25 > ps -flu gfrontend F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD Checking Condor log, 12/18 09:22:51 Return from HandleReq 12/18 09:27:39 Preen pid is 20970 12/18 09:27:39 DaemonCore: pid 20970 exited with status 0, invoking reaper 1 12/18 09:27:39 Child 20970 died, but not a daemon -- Ignored 12/18 09:27:39 DaemonCore: return from reaper for pid 20970 12/18 09:34:51 Calling HandleReq (0) ... 12/18 10:52:14 Calling Handler 12/18 10:52:14 ZKM: setting default map to rbpatter@fnal.gov 12/18 10:52:14 DaemonCore: Command received via TCP from rbpatter@fnal.gov from host <131.225.193.25:62745>, access level ADMINISTRATOR 12/18 10:52:14 DaemonCore: received command 454 (DAEMONS_OFF), calling handler (admin_command_handler) 12/18 10:52:14 Calling HandleReq (0) 12/18 10:52:14 Sent SIGTERM to COLLECTOR (pid 7451) 12/18 10:52:14 Sent SIGTERM to NEGOTIATOR (pid 7452) 12/18 10:52:14 Sent SIGTERM to SCHEDD (pid 20163) 12/18 10:52:14 Return from HandleReq 12/18 10:52:14 Return from Handler 12/18 10:52:14 DaemonCore: pid 7452 exited with status 0, invoking reaper 1 12/18 10:52:14 The NEGOTIATOR (pid 7452) exited with status 0 12/18 10:52:14 DaemonCore: return from reaper for pid 7452 12/18 10:52:53 DaemonCore: pid 7451 exited with status 0, invoking reaper 1 12/18 10:52:53 The COLLECTOR (pid 7451) exited with status 0 12/18 10:52:53 DaemonCore: return from reaper for pid 7451 11:02 condor_status is back to a full list, condor_q still hung condor_q still fails In Shadowlog, 12/18 04:27:50 (245787.13) (23786): attempt to connect to <131.225.211.155:48026> failed: Connection refused (connect errno = 111). fcdfcaf1507.fnal.gov grep attempted /local/stage1/condor/log/ShadowLog /minos/data/users/xbhuang/new_run3/log.245787.0 245869.99 PROCS=`condor_q xbhuang | grep xbhuang | cut -f 1 -d ' '` Need to clear the local side of some of these, PROCS=`condor_q xbhuang | grep xbhuang | cut -f 1 -d ' '` 11:56:30 for PROC in ${PROCS} ; do printf "${PROC} " ; echo condor_rm -forcex ${PROC} ; sleep 1 ; done ########## # PALOON # ########## Added 1.8 GBytes virtual memory limit, to stop future crashes as above, ulimit -v 1800000 cp paloon paloon.20081218 Tested this, cd /grid/fermiapp/minos/parrot ./paloon.20081218 Moved new version to production ln -sf paloon.20081218 paloon ######## # DATA # ######## Early fardet_data was not in monthly directories : grep fardet_data/F ../CFL/CFL | wc -l 326 MINOS26 > ls /pnfs/minos/fardet_data/2001-09 | wc -l 316 MINOS26 > ls /pnfs/minos/fardet_data/2001-10 | wc -l 55 OFILES=`grep fardet_data/F ../CFL/CFL | cut -f 8 -d ' ' | sort` printf "${OFILES}\n" | sort /pnfs/minos/fardet_data/F00000508_0000.mdaq.root /pnfs/minos/fardet_data/F00000535_0000.mdaq.root ... /pnfs/minos/fardet_data/F00000983_0000.mdaq.root /pnfs/minos/fardet_data/F00000985_0000.mdaq.root Where are the first and last files ? MINOS26 > ls /pnfs/minos/fardet_data/*/F00000508_0000.mdaq.root /pnfs/minos/fardet_data/2001-09/F00000508_0000.mdaq.root MINOS26 > ls /pnfs/minos/fardet_data/*/F00000985_0000.mdaq.root /pnfs/minos/fardet_data/2001-10/F00000985_0000.mdaq.root for FILE in ${OFILES} ; do FIL=`echo ${FILE} | cut -f 5 -d / | cut -f 1 -d '}'` ls /pnfs/minos/fardet_data/2001-*/${FIL} done Mostly in 2001-09, except /pnfs/minos/fardet_data/2001-10/F00000965_0000.mdaq.root /pnfs/minos/fardet_data/2001-10/F00000966_0000.mdaq.root /pnfs/minos/fardet_data/2001-10/F00000967_0000.mdaq.root /pnfs/minos/fardet_data/2001-10/F00000968_0000.mdaq.root /pnfs/minos/fardet_data/2001-10/F00000969_0000.mdaq.root /pnfs/minos/fardet_data/2001-10/F00000970_0000.mdaq.root /pnfs/minos/fardet_data/2001-10/F00000974_0000.mdaq.root /pnfs/minos/fardet_data/2001-10/F00000980_0000.mdaq.root /pnfs/minos/fardet_data/2001-10/F00000983_0000.mdaq.root /pnfs/minos/fardet_data/2001-10/F00000985_0000.mdaq.root Also, have a few three digit subruns : /pnfs/minos/fardet_data/2001-09/F00000570_000.mdaq.root /pnfs/minos/fardet_data/2001-09/F00000573_000.mdaq.root /pnfs/minos/fardet_data/2001-09/F00000574_000.mdaq.root /pnfs/minos/fardet_data/2001-09/F00000575_000.mdaq.root /pnfs/minos/fardet_data/2001-09/F00000576_000.mdaq.root /pnfs/minos/fardet_data/2001-09/F00000577_000.mdaq.root First, let's enmv everything to its current locations, for the CFL listing. setup encp v3_7d -q stken FILES09=`printf "${OFILES}\n" | head -316 | cut -f 5 -d /` for FILE in ${FILES09} ; do printf "${FILE}\n" enmv /pnfs/minos/fardet_data/2001-09/${FILE} \ /pnfs/minos/fardet_data/2001-09/${FILE} sleep 1 done ... F00000964_0000.mdaq.root FILES10=`printf "${OFILES}\n" | tail -10 | cut -f 5 -d /` for FILE in ${FILES10} ; do printf "${FILE}\n" enmv /pnfs/minos/fardet_data/2001-10/${FILE} \ /pnfs/minos/fardet_data/2001-10/${FILE} sleep 1 done Rename the short subruns to standard form SHORTS='570 573 574 575 576 577' Verified absence from CFL with some other subrun: for SHORT in ${SHORTS} ; do grep F00000${SHORT} ../CFL/CFL done minos fardet_data VO6876 0000_000000000_0002625 CDMS109626825400000 245916094 3230905666 /pnfs/minos/fardet_data/F00000570_000.mdaq.root minos fardet_data VO6876 0000_000000000_0002649 CDMS109626881500000 83623425 3755784777 /pnfs/minos/fardet_data/F00000573_000.mdaq.root minos fardet_data VO6876 0000_000000000_0002650 CDMS109626883600000 50080898 3771648130 /pnfs/minos/fardet_data/F00000574_000.mdaq.root minos fardet_data VO6876 0000_000000000_0002651 CDMS109626888400000 37155578 2198592335 /pnfs/minos/fardet_data/F00000575_000.mdaq.root minos fardet_data VO6876 0000_000000000_0002652 CDMS109626890700000 24823718 526838901 /pnfs/minos/fardet_data/F00000576_000.mdaq.root minos fardet_data VO6876 0000_000000000_0002653 CDMS109626892000000 12363247 542879600 /pnfs/minos/fardet_data/F00000577_000.mdaq.root for SHORT in ${SHORTS} ; do printf "F00000${SHORT}_000.mdaq.root\n" enmv /pnfs/minos/fardet_data/2001-09/F00000${SHORT}_000.mdaq.root \ /pnfs/minos/fardet_data/2001-09/F00000${SHORT}_0000.mdaq.root done TO DO - check CFL tomorrow ######### # DOCDB # ######### Registered Rustem for numirw ( and not all groups as requested ) Note that the email should be modified : If this is correct, please visit https://minos-docdb.fnal.gov:440/cgi-bin/EmailAdministerForm, select "Modify", select the user, check "Verify", and click to Submit. If the groups are not correct, select the correct groups before clicking Submit. Check the box to the left of 'User is Verified' Click on 'Modify Personal Account Got a request from Gary W. Smith gsmish@fnal.gov Not an author, asked whether this was a real request. ============================================================================= 2008 12 17 ============================================================================= ########## # SAMSUB # ########## samsub.20081217 Removed leftover diagnostic printouts. Somehow that had been used in production up to now. ln -sf samsub.20081217 samsub # was samsub.20081118 SRV1> cp -a AFSS/samsub.20081217 . SRV1> ln -sf samsub.20081217 samsub ============================================================================= 2008 12 16 ============================================================================= ########## # DCACHE # ########## Date: Tue, 16 Dec 2008 12:26:14 -0600 From: ssa-group@fnal.gov We need to restart pnfs on stken for a logging issue. The restart should only take a couple of minutes. Date: Tue, 16 Dec 2008 12:27:40 -0600 The restart of pnfs will happen at 1245 pm. Date: Tue, 16 Dec 2008 12:46:52 -0600 Done. ============================================================================= 2008 12 15 ============================================================================= ####### # CFL # ####### minos/CFL data files filled my AFS quota. MINOS26 > fs listquota . Volume Name Quota Used %Used Partition d.minos.d5 8000000 7997943 100%<< 25% < CFL/lists Adjusted cfl script accordingly cp -a cfl cfl.20081210 cp cfl cfl.20081215 nedit cfl.20081215 ln -sf cfl.20081215 cfl cp -va CFL.* lists/ CFILES=`ls CFL.* for FILE in ${CFILES} ; do echo ${FILE} ; diff ${FILE} lists/${FILE} ; done rm CFL.* Still a problem, as ed works in /tmp, which is only 1 GByte in minos26 !!! WOW Time edit with sed : cat CFL.new | sed 's./fs/usr/./.g'| sed 's./fnal.gov/usr/./.g' Pipe CFL.new through sed MINOS26 > time cat CFL.new | sed 's^/fs/usr/^/^g' | sed 's^/fnal.gov/usr/^/^g' > CFL.newer real 0m49.075s user 1m14.395s sys 0m2.452s Seems OK, let's put that filter on the original curl output. MINOS26 > time ./cfl real 1m0.709s user 1m16.221s sys 0m9.321s MINOS27 > time ./cfl real 0m40.826s user 0m40.643s sys 0m5.202s ####### # NET # ####### Date: Mon, 15 Dec 2008 12:51:43 -0600 (CST) Subject: HelpDesk ticket 126359 Short Description: Wireless problems reported on WH12W Problem Description: This is a low priority request to check out Wireless networking around WH2W, including the Control Room at WH12NW. During the Minos Collaboration meeting this weekend, and this morning, there have been some reports of connectivity problems. I cannot reproduce the problems with my own laptop. Unfortunately , I have no specific report ( this is third hand. ) So please just cast an eye on your monitoring logs, see whether there are any obvious issues. ___________________________________________ This ticket is assigned to FINNEGAN, RICK of the CD-LSCS/CNCS/SN. ___________________________________________ This ticket has been reassigned to ANDREWS, CHARLES of the CD-LSCS/CNCS/SN Group. ___________________________________________ Date: Tue, 16 Dec 2008 16:13:48 -0600 (CST) Solution: Art - The microwave oven in the control room is leaking RF - levels at 1 meter are about -7 to -10 Dbm - more than enough to interfere with some wireless operatrions - I will e-mail the screen captures to you. -Chuck- ___________________________________________ ######## # GRID # ######## Date: Mon, 15 Dec 2008 08:26:49 -0600 (CST) Subject: HelpDesk ticket 126325 ___________________________________________ Ticket #: 126325 ___________________________________________ Short Description: Make rbpatter a /fermilab/minos manager Problem Description: Please give Ryan Patterson, rbpatter@fnal.gov the same control over /fermilab/minos that I have, so that he can authorize new members and assign roles. ___________________________________________ This ticket is assigned to TIMM, STEVE of the CD-SF/GF/FGS. ___________________________________________ Date: Mon, 15 Dec 2008 12:24:07 -0600 (CST) Solution: This has been done. ___________________________________________________________________ This ticket was resolved by TIMM, STEVE of the CD-SF/GF/FGS group. ============================================================================= 2008 12 14 SUN ============================================================================= ########## # DCACHE # ########## I see no data movement since 13 Dec 23:00 roundup and predator are stuck Services are all up. ftplog failed since 6 Sat Dec 13 23:33:58 CST 2008 557 3601 Sun Dec 14 00:43:59 CST 2008 1 pnfslog 400 seconds since 2 Sat Dec 13 23:40:42 CST 2008 309 Sat Dec 13 23:50:51 CST 2008 No recent helpdesk tickets Date: Sun, 14 Dec 2008 12:17:02 -0600 (CST) Subject: HelpDesk ticket 126310 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 126310 ___________________________________________ Short Description: PNFS not responding in FNDCA DCache Problem Description: User access to FNDCA has been failing since about Sat Dec 13 23:45 See my FTP and PNFS listing monitoring logs at http://www-numi.fnal.gov/computing/dh/ftplog/2008/12/13.txt http://www-numi.fnal.gov/computing/dh/pnfslog/2008/12/13.txt Reply to minos-data I can be reached at 630 697 0469 ___________________________________________ This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA. ___________________________________________ Calling helpdesk , option 5 to page get message 'is not available at the moment at the tone record your message... " ___________________________________________ Yolanda paged SSA, ticket 126312, around 16:45 ___________________________________________ ___________________________________________ Date: Mon, 15 Dec 2008 14:44:58 -0600 (CST) From: Dmitry Litvintsev Info was requested of me by HelpDesk: dcache developer primary has been contacted at about 7:30 PM Sunday 12/14. It was discovered that an obscure log file used by pnfs daemon has reached 2GB in size generating error "File size exceeds limit". The file has been moved and the pnfs daemon has been restarted. System ws back to normal just after 8PM. ============================================================================= 2008 12 13 ============================================================================= ############ # MCIMPORT # ############ Working to get arms approved for pushing data to volatile area from the grid /DC=org/DC=doegrids/OU=People/CN=Kregg Arms 875233 Date: Sat, 13 Dec 2008 11:43:40 -0600 (CST) Subject: HelpDesk ticket 126305 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 126305 ___________________________________________ Short Description: FNDCA permissions for Kregg Arms to fermigrid/volatile/minos Problem Description: Please authorize Kregg to write to fermigrid/volatile/minos , using /DC=org/DC=doegrids/OU=People/CN=Kregg Arms 875233 User/group mapping should probably be arms/e875 Kregg intends to write from Teragrid sites, probably with Grid FTP. ___________________________________________ This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA. ___________________________________________ ___________________________________________ ___________________________________________ ####### # CRL # ####### Ticket 126307 CRLWEB problem- Minos Control room - 630-840-3368 Detailed Problem Description (if supplied): Masaki called off hours helpdesk from the Minos Control rool that they are having problems with the CRL logbook (CRLWEB) with images, Extensive Work Log, datail of people being paged, Ticket 126303 Short Description: (from weekend) Proble with Control Room Logbook for MINOS Problem Description: Our instance of the Control Room Logbook has stopped loading images. It appears to be some problem with the html wrapper for the image. http://crlweb2.fnal.gov/minos/Index.jsp We seem to be able to enter images into entries, but none will load or display onto pages (including navigation buttons). You may contact the control room at x3368 - it is manned 24 hours a day. Or, contact me at 630-240-6842. ___________________________________________________________________ This ticket was resolved by RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST group. ___________________________________________________________________ ___________________________________________________________________ Ticket 126304 Web images problem from MINOS control romm hi from MINOS control room We are having a problem to view images in the following web page. http://crlweb2.fnal.gov/minos/Index.jsp For example : When we click one of images that we pasted, the images sometimes does't appear at all, but we are able to see the images in some other computers... error message we get is ************************************************* Not Found The requested URL /Entries/2008/12month/13day/06hour/General/Operations/Log/Text_82191_0_dec_ 13_08_day_plot1_png_wrapper.htm was not found on this server. Apache/2.0.46 (Scientific Linux) Server at crlweb2.fnal.gov Port 80 ************************************************** Can you help to solve the problem? other contacts for MINOS experts : kreymer@fnal.gov phone number for MINOS control room : x3368 x8751 x2482 x6913 The Work Log has discussions of missing files ? in /afs/fnal.gov/files/data/crl/dr/WWWdirectory Files have been restored to /afs/fnal.gov/files/restored/d.crl.1 ___________________________________________________________________ ___________________________________________________________________ Date: Sat, 13 Dec 2008 22:43:00 -0600 From: kreymer To: Maurine Mihalek Cc: Wayne Baisley , Hugh Gallagher , Desktop & Server Support - Enterprise , Arthur Kreymer Maureen, sorry that I could not get you more specific information earlier this evening, I was on the way out the door, and not able to get to a computer until just now. The ticket numbers were 126307 and 126304. My memory was faulty, this issue came up this morning, not last night. One of the missing URL's is http://crlweb2.fnal.gov/Entries/2008/12month/13day/06hour/General/Operations /Log/Text_82191_0_dec_13_08_day_plot1_png_wrapper.htm This is visible from my Desktop system at Fermilab, but not my laptop, the CRL, or my home system. This cannot have been cached on my desktop, as the graphic is much newer than my previous access to CRL from that system. =========================================================== Date: Sat, 13 Dec 2008 23:41:06 -0600 From: Wayne Baisley To: "kreymer@fnal.gov" Cc: Maurine Mihalek , Hugh Gallagher , Desktop & Server Support - Enterprise Subject: Re: Minos control room crlweb tickets > One of the missing URL's is > http://crlweb2.fnal.gov/Entries/2008/12month/13day/06hour/General/Operatio > ns/Log/Text_82191_0_dec_13_08_day_plot1_png_wrapper.htm > > This is visible from my Desktop system at Fermilab, but not my laptop, > the CRL, or my home system. > This cannot have been cached on my desktop, > as the graphic is much newer than my previous access to CRL from that > system. It seems to be cached somewhere, because most of the directory tree under Entries has gone missing, excepting a couple of hours for Thursday. The deepest directories in the vicinity are ... /afs/fnal.gov/files/data/crl/dr/LogBook_admin /afs/fnal.gov/files/data/crl/dr/CRLinquiries/CRLWindex /afs/fnal.gov/files/data/crl/dr/CRLdata/Entries/2008/12month/11day/10hour/Si mulation/SLIC_studies/Log /afs/fnal.gov/files/data/crl/dr/CRLdata/Entries/2008/12month/11day/13hour/Si mulation/SLIC_studies/Log /afs/fnal.gov/files/data/crl/dr/WWWdirectory/Entries/2008/12month/11day/10ho ur/Simulation/SLIC_studies/Log /afs/fnal.gov/files/data/crl/dr/WWWdirectory/Entries/2008/12month/11day/13ho ur/Simulation/SLIC_studies/Log /afs/fnal.gov/files/data/crl/dr/CRLmaillists I get 404s for anything below 12month, aside from the 11day directories which give me a 403 (Forbidden). Hope that helps some. Wayne ==================================================================== Date: Sun, 14 Dec 2008 06:15:56 +0000 (GMT) From: Arthur Kreymer To: Wayne Baisley Cc: Maurine Mihalek , Hugh Gallagher , Desktop & Server Support - Enterprise , kreymer@fnal.gov Subject: Re: Minos control room crlweb tickets On Sat, 13 Dec 2008, Wayne Baisley wrote: > It seems to be cached somewhere, because most of the directory tree under > Entries has gone missing, excepting a couple of hours for Thursday. The > deepest directories in the vicinity are ... I have created a minos account on my desktop, where things worked this afternoon: minos-93198.dhcp.fnal.gov I have given access to kreymer, baisley, mmihalek, and minos-wh-cr/minos/minos-om.fnal.gov@FNAL.GOV which is the principal used in the control room. You can run 'firefox' there, Unfortunately, running this remotely from home, the graphics seem to have disappared from my desktop also. I cannot do anything about the AFS areas under /afs/fnal.gov/files/data/crl I do not even have read access. Date: Sun, 14 Dec 2008 00:22:18 -0600 From: Maurine Mihalek To: Arthur Kreymer , mmihalek@fnal.gov Cc: Wayne Baisley , Hugh Gallagher , Desktop & Server Support - Enterprise , kreymer@fnal.gov Subject: Re: Minos control room crlweb tickets the directory wayne requested to be restored was /afs/fnal.gov/files/data/crl/dr/WWWdirectory. i am doing a tibs restore of the volume d.crl.1. this is the only way i see in the afs and tibs restore documentation to do it. i conferred with joe syu and he agrees. the restore is started. once it is finished, i will be mounting it in a restored area. i will advise you of the name and when that restored volume is available. maurine Date: Sun, 14 Dec 2008 01:05:26 -0600 From: Maurine Mihalek To: Maurine Mihalek , Arthur Kreymer , Wayne Baisley , Hugh Gallagher , Desktop & Server Support - Enterprise , kreymer@fnal.gov Subject: Re: Minos control room crlweb tickets I restored from 12/12/2008 tibs backup. the restored volume d.crl.1 is mounted under /afs/fnal.gov/files/restored/d.crl.1 there is a dr directory that has the WWWdirectory from Dec 12 tibs backup you can restore whatever files you need from there. Date: Sun, 14 Dec 2008 22:26:38 +0000 (GMT) From: Arthur Kreymer To: helpdesk-forwarder@fnal.gov Cc: Maurine Mihalek , Wayne Baisley , Desktop & Server Support - Enterprise Subject: HelpDesk ticket 126304 <-- # @@@ Enter Update below this line. @@@ # --> Minos has access to these CRL/AFS support areas only through the CRL web interface. We cannot read the /afs paths, or perform maintenance. My desktop system minos-93198.dhcp.fnal.gov, running SLF 5, with Firefox 3.0.4, continues to see new graphics files, like the recent http://www-minoscrl2.fnal.gov/Entries/2008/12month/14day/06hour/General/Operati$ This file was added long after the problem started yesterday. This file is not visible to the Control Room, or to most other systems. The problem cannot be in the loss of the data files on the Web Server, just a failure to serve them to some ( but not all ) clients. Perhaps there is a change in the way the server handles or caches usernames/passwords for restricted pages. I think that all these images are password protected, which is historically quite a nuisance when viewing pages. My successful Firefox 3.0.4 browser has a stored password for : http://crlweb2.fnal.gov(CRLW) User minos on the same system with the same browser, cannot see images. <-- # @@@ Enter Update above this line. @@@ # --> Fails http://crlweb2.fnal.gov/Entries/2008/12month/15day/00hour/General/Operations/Log/Text_82255_0_dec_15_08_night_plot1_png_wrapper.htm Works http://www-minoscrl2.fnal.gov/Entries/2008/12month/15day/00hour/General/Operations/Log/Text_82255_0_dec_15_08_night_plot1_png_wrapper.htm ---------- Forwarded message ---------- Date: Mon, 15 Dec 2008 11:08:06 -0600 From: Suzanne Gysin To: Arthur Kreymer Subject: Re: Minos control room crlweb tickets (fwd) Hi Art, just a little more information. The web address has not changed, not since years. The way this works is the webserver (crlweb2) has many logbooks, each having an alias. The alias maps to the specific log's image and entry directory. This is nothing new. Maybe the images were cashed until now in the control room, I don't know why it worked before. Maybe they used the alias before. Suzanne _____________________________________________________________________________ Corrected crlweb2 to www-minoscrl2 at http://www-numi.fnal.gov/Minos/ControlRoom/index.html _____________________________________________________________________________ Date: Mon, 15 Dec 2008 17:22:00 +0000 (GMT) From: Arthur Kreymer To: helpdesk@fnal.gov Cc: mmihalek@fnal.gov, baisley@fnal.gov, dss-est@fnal.gov, zwaska@fnal.gov, kreymer@fnal.gov, votava@fnal.gov Subject: Re: Minos control room crlweb tickets (fwd) Tickets 126307, 126303, and 126304 can be closed. We had been using an incorrect web address, which stopped working over the weekend. Apparently the correct web address works fine. There appears to have been no problem with the server itself. I have updated the link on the Minos Control Room web page. _____________________________________________________________________________ Per baisley, also updated /afs/fnal.gov/files/expwww/numi/html/documentation/alphabetical_index.html _____________________________________________________________________________ HISTORY - digging through email for www-minoscrl2 references minos: Date: Fri, 02 Jun 2006 16:31:00 -0500 saranen, new account minosadmin: Date: Wed, 05 Apr 2006 14:44:32 -0500 (CDT) referenced in passsing, re mysql upgrade Date: Mon, 24 Apr 2006 11:04:51 -0500 (CDT) reference to the old CRL, www-minoscrl out: ============================================================================= 2008 12 12 ============================================================================= ######## # FARM # ######## To minos_batch, rubin : Please run cedar catchup spill processing on the horn off runs, N00015187 MISS 0021 0022 0023 N00015190 MISS 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 0011 0012 0013 0014 0015 0016 0017 0018 0019 0020 Xiaobo - there are spill sntp files waiting to be concatenated for these in /minos/data/minfarm/nearcat The third horn off run has been fully processed, N00015193 This may take special sntp-only processing, as it seems that cand and cosmic files exist for all the missing subruns. ---------------------------- Howie submitted these, picked up Dec 12 19:31 /minos/data/minfarm/nearcat/N00015187_0021.spill.sntp.cedar.0.root Dec 12 20:06 /minos/data/minfarm/nearcat/N00015187_0022.spill.sntp.cedar.0.root Dec 12 20:12 /minos/data/minfarm/nearcat/N00015187_0023.spill.sntp.cedar.0.root N00015190_0000 through 20 I have concatenated ahead of schedule, SRV1> ./roundup -r cedar near Fri Dec 12 20:43:32 CST 2008 OK adding N00015187_0000.spill.sntp.cedar.0.root 9 OK adding N00015187_0010.spill.sntp.cedar.0.root 14 SUPPRESS N00015190_0024.spill.sntp.cedar.0.root OK adding N00015190_0000.spill.sntp.cedar.0.root 24 Fri Dec 12 20:56:30 CST 2008 Informed minos_batch, xbhuang ######### # MYSQL # ######### Resuming work on mysql installation HOWTO.mysqladmin - updating per mysql2 work HOWTO.mysqladmin.20080820 - describes minos-sam03 work ########## # DCACHE # ########## Requested closeout of ticket 121533 as there no planned action to test DCache failover to secondary DNS. 9/12/2008 11:04:51 AM _______________________________________________________________________ During last night's fnsrv0 primary DNS server outage, it appears that all FNDCA and STKEN data transfers stopped. The DCache and Enstore data rate plots show no activity, and many user jobs failed. Likewise, I see no Enstore data transfers in CDFEN or D0EN from 22:45 through 04:00 last night ( Sep 11/12 ) All password ftp reads from FNDCA failed during this period. Access to PNFS was very slow, typically 3 minutes instead of 3 seconds. Strangly, the Minos Data Acquisition kerberized ftp copies all succeeded during this downtime. It would be very desirable for Enstore and DCache smoothly fail over to secondary nameservers. _______________________________________________________________________ 9/22/2008 3:59:20 PM Remedy Application Service The following was e-mailed to the Requester: jonest@fnal.gov sent this Notes To Requester: All enstore movers and servers have primary and secondary > nameservers defined. > nameserver 131.225.8.120 > nameserver 131.225.17.150 > Also, each mover and server has all enstore related nodes listed in the /etc/hosts file. _______________________________________________________________________ Date: Fri, 12 Dec 2008 20:03:32 +0000 (GMT) From: Arthur Kreymer As no further action on this issue seems to be planned, I suggest that this ticket be closed. It would be nice to test failover to secondary DNS servers in the test stand. But that is not best tracked via a helpdesk ticket. Thanks ! _______________________________________________________________________ Solution: jonest@fnal.gov sent this solution: > Art feels this ticket can be close, Thanks Art! _______________________________________________________________________ _______________________________________________________________________ ########## # CONDOR # ########## Date: Fri, 12 Dec 2008 10:33:31 -0600 (CST) Subject: HelpDesk ticket 126261 ___________________________________________ Ticket #: 126261 ___________________________________________ Short Description: xbhuang Robot Cert needs approval. Problem Description: User xbhuang added a Robot cert a couple of days ago, but this is still listed as 'new', not 'approved'. Please approve this cert : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Xiaobo Huang/CN=UID:xbhuang ___________________________________________ This ticket is assigned to of the . ___________________________________________ Date: Fri, 12 Dec 2008 10:43:56 -0600 (CST) This ticket has been reassigned to TIMM, STEVE of the CD-Grid/Fermi Group. ___________________________________________ Date: Fri, 12 Dec 2008 10:56:35 -0600 (CST) Note To Requester: chadwick@fnal.gov sent this Notes To Requester: The new certificate has been approved. ___________________________________________ Solution: This cert has been added and approved. This ticket was resolved by TIMM, STEVE of the CD-Grid/Fermi group. ___________________________________________ Date: Mon, 15 Dec 2008 12:19:54 -0600 (CST) Note To Requester: timm@fnal.gov sent this Notes To Requester: The logs of VOMRS show that the certificate was never successfully added but I have added it now and approved it. Steve Timm ___________________________________________ ######## # DATA # ######## html/computing/dh/dhleft.html.20081212 updated for data/data2/scratch free ############# # DBARCHIVE # ############# Cleaned up messages : Non-verbose file copy Final time stamp ########### # MONTHLY # ########### DATASETS 12/5 PREDATOR 12/5 VAULT 12/4 MYSQL 12/12 Thu Dec 11 15:49:39 CST 2008 Archiving OFFLINE Thu Dec 11 15:49:55 CST 2008 68608 . Archiving BINLOGS Thu Dec 11 16:53:49 CST 2008 real 1m41.312s Compressing archives Thu Dec 11 16:55:34 CST 2008 real 55m36.702s Copying back to local disk Thu Dec 11 20:57:49 CST 2008 real 64m0.852s Mysql> du -sm /data/archive/COPY/* 34485 /data/archive/COPY/20080902 22233 /data/archive/COPY/20081013 22445 /data/archive/COPY/20081114 22509 /data/archive/COPY/20081211 ============================================================================= 2008 12 11 ============================================================================= ########## # CONDOR # ########## Submittted glidex100 ( 100 sections ), around 14;19 82 jobs; 71 idle, 10 running, 1 held 14:20 ; 72 jobs; 61 idle, 10 running, 1 held Thu Dec 11 14:21:03 CST 2008 50 jobs; 0 idle, 49 running, 1 held Thu Dec 11 14:22:00 CST 2008 1 jobs; 0 idle, 0 running, 1 held Thu Dec 11 14:23:00 CST 2008 Checking startup time, from the logs : Removed old held job 242864.0 kreymer 12/10 23:20 0+00:14:02 H 0 0.0 probe HoldReason = "Error from starter on glidein_24419@fcdfcaf1695.fnal.gov: Failed to execute '/local/stage1/condor/execute/dir_23884/glide_e23941/condor_job_wrapper.sh' with arguments 0 sleep 30: Connection reset by peer" Checking start times from logs grep executing logs/glide/probe.243057.*.log \ | cut -f 4 -d ' ' | sort > probex100.logtimes 14:18:48 ... 14:22:19 Never more than two jobs per second, and these are always isolated. Net rate 100/210 seconds. Checking job starts from the *.out files, for consistency grep STARTED logs/glide/probe.243057.*.out \ | cut -f 7 -d ' ' | sort > probex100.outtimes 14:18:49 ... 14:22:21 Triplets at 14:21:47 14:21:47 14:21:50 14:21:50 14:21:50 14:21:51 14:21:52 14:21:52 14:21:52 14:21:54 14:21:54 14:21:56 This did not get 100 running at once, 30 sec was too short a sleep. And did not have 100 glideins up front. Increased the sleep to 210 seconds, 243074 primed the pump with another run ar 15:30, Farm glideins: R=189 I=0 H=0 condor_submit glidex100.run 100 job(s) submitted to cluster 243087. 100 jobs; 100 idle, 0 running, 0 held Thu Dec 11 15:55:22 CST 2008 MINOS25 > condor_q kreymer | tail -1 ; date 100 jobs; 0 idle, 100 running, 0 held Thu Dec 11 15:55:35 CST 2008 grep executing logs/glide/probe.243087.*.log | cut -f 4 -d ' ' | sort \ > probex100a.logtimes grep STARTED logs/glide/probe.243087.*.out | cut -f 7 -d ' ' | sort \ > probex100a.outtimes ######### # MYSQL # ######### Preparing for replication testing of mysql2, see http://www-numi.fnal.gov/offline_software/srt_public_context/DatabaseMaintenance/doc/dbmauto_index.html ########## # DCACHE # ########## Date: Thu, 11 Dec 2008 11:28:06 -0600 From: ssa-group@fnal.gov To: cdweb@fnal.gov, helpdesk@fnal.gov, oleynik@fnal.gov, stan@fnal.gov, wolbers@fnal.gov, crawdad@fnal.gov, white@fnal.gov, moibenko@fnal.gov, timur@fnal.gov, stk-users@fnal.gov, cms-t1@fnal.gov, dcache-admin@fnal.gov Subject: Announcement: Service disruption for dCache on stken for a duration of is back up. There was a short disruption of the fndca (public) dcache. A hardware repair contractor accidentally rebooted the head node. Dcache is running now. ########### # ENSTORE # ########### Date: Thu, 11 Dec 2008 10:38:59 -0600 From: ssa-group@fnal.gov Just as a reminder there will be an upgrade to the STKEN movers at 11:00 AM. Approximately 20 minutes from now. Date: Thu, 11 Dec 2008 11:27:49 -0600 From: ssa-group@fnal.gov The update is complete. Thank you for your patience. ########## # DCACHE # ########## Closed out ticket 126014 FNDCA - three files unavailable via DCache All three files are readable. ####### # CRL # ####### Date: Thu, 11 Dec 2008 08:51:29 -0600 (CST) Subject: HelpDesk ticket 126186 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 126186 ___________________________________________ Short Description: Incorrect email address in error message when CRL was down Problem Description: On Dec 3, there was a problem with the Minos CRL web server. The message that was seen referred to an incorrect email address, webmaster@crlweb2.fnal.gov At low priority, I suggest that this be tracked down and corrected. Here is the message that was seen at the time : The server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, webmaster@crlweb2.fnal.gov and inform them of the time the error occurred, and anything you might have done that may have caused the error. More information about this error may be available in the server error log. ____________________________________________________________________________ Apache/2.0.46 (Scientific Linux) Server at crlweb2.fnal.gov Port 80 ___________________________________________ Date: Thu, 11 Dec 2008 09:10:03 -0600 (CST) This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/WST ___________________________________________ ___________________________________________ ___________________________________________ ============================================================================= 2008 12 10 ============================================================================= ########## # CONDOR # ########## Helping xbhuang on minos25 MINOS_CONDOR=/afs/fnal.gov/files/code/e875/general/condor ${MINOS_CONDIR}/scripts/proxyconfig fails to find libssl3.so In particular, ${MINOS_CONDOR}/scripts/get-cert/get-cert.sh -i ####### # CFL # ####### Sometimes the COMPLETE_FILE_LISTING is truncated. This results in 'newline appended' messages from he cronjob. I would be better to bail on incomplete CFL files. The last line should be like (1832559 rows) Test with tail -1 | grep -q '^(.* rows)$' Updated cfl script accordingly ########### # BLUEARC # ########### Date: Wed, 10 Dec 2008 11:58:43 -0600 From: Andrew J. Romero To: "'site-nas-announce@fnal.gov'" Subject: Emergency Reboot of all nodes in RHEA cluster to occur at noon (Dec 10, 2008) We are experiencing severe heap allocation problems on both RHEA cluster nodes. At noon we will be rebooting both RHEA-1 and RHEA-2 (We will also install a firmware upgrade which will addresses this issue) The following EVSs are effected: BLUE1 BLUE2 MINOS-NAS-0 CDSERVER PPDSERVER DIRSERVER1 ESHSERVER1 CDFSERVER1 LSSERVER NUMISERVER PSEEKITS The RHEA cluster should be back online at 12:30 Andy ____________________________________________________________________ bluwatch resumed at 12:34 But the logs are strangely incomplete. fnpcrv1 - no errors or timeouts minos-sam03 - no errors or timeouts minos01 - timeout ending 12:34:57 minos26 - no errors or timeouts ____________________________________________________________________ Date: Wed, 10 Dec 2008 13:15:54 -0600 From: Andrew J. Romero To: "'site-nas-announce@fnal.gov'" Subject: BLUEARC .... Emergency Reboot .... COMPLETE All EVSs and Filesystems were back online at 12:42PM ____________________________________________________________________ ____________________________________________________________________ ####### # DAQ # ####### Started ND archiver, had been stopped by GFP Restarted FD archiver, it appeared to be stuck, valid PID, process running, no activity. Restarted BEAM archiver, no files had moved, as with FD. Restarted NDCS around 14:07 , had not archived Dec 8 18:00 N081209_000002.mdcs.root got archived 14:16 For consistency with daq, beam systems, midir bin/init cp -a /etc/init.d/archiver bin/init/ Restarted FDCS around 14:30 Dec 8 18:00 F081209_000006.mdcs.root got archived at 14:32 midir bin/init cp -a /etc/init.d/archiver bin/init/ Noted that archiverstatus.sh is in scripts on N/F DAQ, in bin for N/F DCS and BEAM ( where it has been in crontabs ) For the first time in over a year, all five archivers have current status and recently archived files at http://minos-om.fnal.gov/cgi-bin/archiver_status.cgi ######### # ADMIN # ######### Old rbpatter helpdesk ticket :124576 11/10/2008 11:18:33 PM ___________________________________________ Please add the following names to the NIS database for the MINOS Cluster: UID 42918 -- gfrontend UID 43021 -- minosana and GID 5468 -- e898 Thanks in advance. --Ryan ___________________________________________ 11/11/2008 5:22:07 PM jereboze Checked out uids listed on this request, there was no uid/gid assignment for "gfrontend". UID 42918 was assigned to user "whend". Created uid 43498 for gfrontend. UID for minosana is correct. Minos Cluster admin you can now add the two uids: UID 43598 -- gfrontend UID 43021 -- minosana Will assign ticket to the Minos admins.. Yolanda Valadez CD/Helpdesk 11/11/2008 4:53:44 PM valadez Checked out uids listed on this request, there was no uid/gid assignment for "gfrontend". UID 42918 was assigned to user "whend". Created uid 43498 for gfrontend. UID for minosana is correct. Minos Cluster admin you can now add the two uids: UID 43598 -- gfrontend UID 43021 -- minosana Will assign ticket to the Minos admins.. Yolanda Valadez CD/Helpdesk ___________________________________________ Assigned to Arthur Kreymer 1/11/2008 5:21:35 PM jereboze The Assigned To Group was changed from CD-SF/FEF to CD-SP. The Assigned To Individual was changed from HO, LING to KREYMER, ARTHUR. The Assigned To E-mail Address was changed from run2-sys@fnal.gov to kreymer@fnal.gov. helpdesk@fnal.gov was put into the CC Address 1 field. 11/11/2008 4:56:11 PM valadez The Assigned To Group was changed from CD-LSCS/CSI/HD to CD-SF/FEF. The Assigned To Individual was changed from VALADEZ, YOLANDA to HO, LING. The Assigned To E-mail Address was changed from helpdesk@fnal.gov to run2-sys@fnal.gov. 11/11/2008 4:53:44 PM valadez The Assigned To Group was changed from Help Desk to CD-LSCS/CSI/HD. The Assigned To Individual was changed from HelpDesk to VALADEZ, YOLANDA. ___________________________________________ Date: Wed, 28 Jan 2009 17:52:00 +0000 (GMT) Subject: Re: KREYMER, ARTHUR HelpDesk ticket 124576 Reminder We need an official UID for gfactory. It presently has none. I will submit a separate request. The UID we now use, 42917, belongs to Stefan Lammel/cdfprd_svx. Once that is assigned, I will ask to shift this ticket back to FEF ( run2-sys ) so that we can schedule a shift of the gfactory and gfrontend accounts and files to their proper UID's. ___________________________________________ Date: Wed, 28 Jan 2009 20:48:12 +0000 (GMT) Subject: Re: KREYMER, ARTHUR HelpDesk ticket 124576 Has Been Updated. Helpdesk : Please reassign this ticket to run2-sys (FEF) We should change the gfactory and gfrontent accounts to use their assigned UID's 43598 gfrontend 43680 gfactory This needs to be coordinated with rbpatter and kreymer, as the factory and frontend processes needed to be stopped while the NIS and file ownerships are being changed. ___________________________________________ Date: Wed, 28 Jan 2009 15:10:22 -0600 (CST) This ticket has been reassigned There is no need for you to continue working on this problem. The ticket has been reassigned to: COOPER, GLENN ___________________________________________ ___________________________________________ ######## # SSHD # ######## Date: Wed, 10 Dec 2008 10:50:56 -0600 (CST) Subject: HelpDesk ticket 126126 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 126126 ___________________________________________ Short Description: sshd not responding on minos03, minos05, minos22 Problem Description: run2-sys : rlogin works, but sshd logins are not available for nodes minos05 minos08 minos22 Please restart the sshd servers on these nodes. ___________________________________________ Date: Wed, 10 Dec 2008 10:55:27 -0600 (CST) This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group. ___________________________________________ Date: Wed, 10 Dec 2008 11:26:11 -0600 (CST) Solution: The sshd service has been restarted on the three machines. This ticket was resolved by BURNS, ETTA of the CD-SF/FEF group. ___________________________________________ ___________________________________________ ######### # AKLOG # ######### Date: Wed, 10 Dec 2008 15:52:27 +0000 (GMT) Subject: Re: [Fwd: HelpDesk ticket 125925 has additional info.] <-- # @@@ Enter Update below this line. @@@ # --> This rebuilt version of aklog does get me a token. Unfortunately, the token cannot be used to write to AFS. I can write using tokens derived from my default ticket MINOS25 > touch /afs/fnal.gov/files/home/room1/kreymer/testafs.native MINOS25 > tokens > /tmp/tokens.ok But not with kcron tokens MINOS25 > kcron MINOS25 > aklog MINOS25 > tokens Tokens held by the Cache Manager: User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Dec 10 19:44] --End of list-- MINOS25 > touch /afs/fnal.gov/files/home/room1/kreymer/testafs.kcron touch: cannot touch `/afs/fnal.gov/files/home/room1/kreymer/testafs.kcron': Permission denied MINOS25 > tokens > /tmp/tokens.kcron MINOS25 > diff /tmp/tokens.ok /tmp/tokens.kcron 4c4 < User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Dec 11 10:53] --- > User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Dec 10 19:44] <-- # @@@ Enter Update above this line. @@@ # --> ____________________________________________________________________________ 12/10/2008 4:02:11 PM dawson The following was e-mailed to the Requester: Hi Art, I am reassigning this to the kcron maintainer. This is beyond my expertise, as I usually deal with just tickets gotten the usual way. ____________________________________________________________________________ Date: Mon, 02 Feb 2009 21:19:43 +0000 (GMT) From: Arthur Kreymer Is there any progress on this ticket ? We really need an aklog that functions with kcron tickets on the new Minos SLF 4.7 servers. This ticket has been pending now for almost two months. We are tantalizingly close to a solution. Ling's patched version of aklog delivers AFS tokens. But those tokens to not give us access to AFS, so they must be broken in some invisible way. As the ticket is assigned to Frank, I have created a nagy account on the Minos Cluster. Nodes minos25 and minos27 are at SLF 4.7 . ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ########## # DCACHE # ########## DCache seems to be down Last ftplog 20:57 Last MRTG activity around 21:10 for fndca3a aka fndca Traffic Analysis for 3/20 fndca3a -- s-s-fcc1-server kreymer@minos26 crontab -r mindata@minos26 crontab -r minfarm@fnpcsrv1 mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT Updated MINOS status page _____________________________________________________________ Date: Wed, 10 Dec 2008 11:05:18 -0600 Subject: Announcement: Service disruption for dCache on stken for a duration of public dcache system back up The fndca ( public ) dcache system is back up. The head node was replaced. _____________________________________________________________ Near and far data have been archived since 11:00 _____________________________________________________________ 16:55 - restarted all ============================================================================= 2008 12 09 ============================================================================= ########## # DCACHE # ########## Found some 3 day old files in write pools : n13036711_0004_L010185N_D04_helium.reroot.root kreymer e875 0 Dec 4 09:08 n13036710_0013_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root minospro e875 0 Dec 4 05:44 n13036710_0014_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root minospro e875 0 Dec 4 09:57 n13036705_0028_L010185N_D04_helium.sntp.cedar_phy_bhcurv.0.root minospro e875 0 Dec 4 04:59 -rw-r--r-- 1 kreymer e875 0 Dec 4 09:08 /pnfs/minos/mcin_data/near/daikon_04/L010185N_helium/671/n13036711_0004_L010185N_D04_helium.reroot.root -rw-r--r-- 1 minospro e875 0 Dec 4 05:44 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N_helium/cand_data/671/n13036710_0013_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root -rw-r--r-- 1 minospro e875 0 Dec 4 09:57 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N_helium/cand_data/671/n13036710_0014_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root -rw-r--r-- 1 minospro e875 0 Dec 4 04:59 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N_helium/sntp_data/670/n13036705_0028_L010185N_D04_helium.sntp.cedar_phy_bhcurv.0.root ZFILES=' n13036711_0004_L010185N_D04_helium.reroot.root n13036710_0013_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root n13036710_0014_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root n13036705_0028_L010185N_D04_helium.sntp.cedar_phy_bhcurv.0.root ' Removed the files from PNFS for FILE in ${ZFILES} ; do FPAT=`sam locate ${FILE} | cut -f 2 -d "'" | grep '^/pnfs' | cut -f 1 -d,`/${FILE} rm ${FPAT} done Removed the files from SAM for FILE in ${ZFILES} ; do sam undeclare file ${FILE} ; done ######## # FARM # ######## Date: Tue, 09 Dec 2008 15:29:17 +0000 (GMT) From: Arthur Kreymer To: minos_batch@fnal.gov Cc: rubin@fnal.gov Subject: linfix concatenation up to date The mcnearcat cedar_phy_bhcurv linfix concatenation is up to date. There are 36 runs which are pending missing subruns. See ROUNTUP/LOG/cedar_phy_linfixmcnear.pend ########### # ROUNDUP # ########### rounudup.20081209 Corrected logic to set aside old PENDFILE when NOOP is null SRV1> cp AFSS/roundup.20081209 . SRV1> ln -sf roundup.20081209 roundup # was roundup.20081126 ============================================================================= 2008 12 08 ============================================================================= ######### # STAGE # ######### Review state of carrot_06 for pawloski CARROT=/pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/sntp_data CDIRS=`ls ${CARROT}` for DIR in ${CDIRS} ; do cd ${CARROT}/${DIR} CFILES=`ls ${CARROT}/${DIR}` for FILE in ${CFILES} ; do head -1 ".(use)(4)(${FILE})" done done 2>&1 | tee /tmp/CVOLS wc -l /tmp/CVOLS 7457 /tmp/CVOLS CVOLS=`sort -u /tmp/CVOLS` echo $CVOLS VO8219 VO8366 VOB656 VOB862 VOB873 VOB879 VOB883 VOB887 VOB895 VOB903 VOB907 VOB908 VOB913 VOB920 VOB927 for VOL in ${CVOLS} ; do printf "${VOL} " ; grep ${VOL} /tmp/CVOLS | wc -l ; done VO8219 38 VO8366 582 VOB656 575 VOB862 592 VOB873 600 VOB879 582 VOB883 583 VOB887 590 VOB895 492 VOB903 563 VOB907 481 VOB908 504 VOB913 593 VOB920 608 VOB927 74 cd ~/minos/scripts { for VOL in ${CVOLS} ; do ./stage -w -p 5 -s carrot_06/L010185/sntp_data ${VOL} done ; } >> /minos/scratch/kreymer/log/stage/carrot06.log 2>&1 & STARTING Mon Dec 8 15:15:18 CST 2008 FINISHED Tue Dec 9 04:06:02 CST 2008 ########## # DCACHE # ########## Date: Mon, 08 Dec 2008 09:43:51 -0600 From: David Saranen To: Arthur Kreymer Subject: Re: Daq archiving started again this evening Ticket 125959 FNDCA recent FTP web page listing is empty. _____________________________________________________________________ Similar to tickets 123960 and 124357. Web page: http://fndca3a.fnal.gov/cgi-bin/dcache_files.py is empty. Art Kreymer has restarted archiving from MINOS Far Detector, so I assume that files are being transfered normally. _____________________________________________________________________ Date: Thu, 11 Dec 2008 15:44:33 +0000 (GMT) The FTP Recent Transfers list is still empty, at http://fndca3a.fnal.gov/cgi-bin/dcache_files.py Our data transfers seem to be running normally, but it would be nice to have this diagnostic page available. _____________________________________________________________________ Date: Thu, 11 Dec 2008 16:26:46 -0600 (CST) From: Dmitry Litvintsev I believe I have fixed the issue. Issue was having to do with log gathering script that has been stoppped/killed but left lock file that prevented this script from starting over. Lock file was dated Dec 03 _____________________________________________________________________ ########### # BLUEARC # ########### Date: Mon, 08 Dec 2008 14:23:00 -0600 (CST) Subject: HelpDesk ticket 126005 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 126005 ___________________________________________ Short Description: BlueArc r/w exports for new Minos servers Problem Description: LSC/CSI Please make read/write exports for /minos/data, data2, scratcch for the new Minos servers : minos27 minos-sam04 minos-mysql2 minos-mysql3 ___________________________________________ This ticket is assigned to HelpDesk of the Help Desk. ___________________________________________ Date: Mon, 08 Dec 2008 15:30:33 -0600 (CST) This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST ___________________________________________ Note To Requester: added blue2.fnal.gov:/minos/data for hosts minos27 , minos-sam04 , minos-mysql2 , minos-mysql3 But we want to confirm, is that minos-mysql2 with letter "l" or number one? We were extending the pattern in the current host list. (hosts with plus signs added during this request) minos-mysql1.fnal.gov(rw,no_root_squash) flxb10.fnal.gov (rw) flxb11.fnal.gov (rw) flxb12.fnal.gov (rw) flxb13.fnal.gov (rw) flxb14.fnal.gov (rw) flxb15.fnal.gov (rw) flxb16.fnal.gov (rw) flxb17.fnal.gov (rw) flxb18.fnal.gov (rw) flxb19.fnal.gov (rw) flxb20.fnal.gov (rw) flxb21.fnal.gov (rw) flxb22.fnal.gov (rw) flxb23.fnal.gov (rw) flxb24.fnal.gov (rw) flxb25.fnal.gov (rw) flxb26.fnal.gov (rw) flxb27.fnal.gov (rw) flxb28.fnal.gov (rw) flxb29.fnal.gov (rw) flxb30.fnal.gov (rw) flxb31.fnal.gov (rw) flxb32.fnal.gov (rw) flxb33.fnal.gov (rw) flxb34.fnal.gov (rw) flxb35.fnal.gov (rw) flxi02.fnal.gov (rw) flxi03.fnal.gov (rw) flxi04.fnal.gov (rw) flxi05.fnal.gov (rw) flxi06.fnal.gov (rw) flxi07.fnal.gov (rw) minos01.fnal.gov (rw) minos02.fnal.gov (rw) minos03.fnal.gov (rw) minos04.fnal.gov (rw) minos05.fnal.gov (rw) minos06.fnal.gov (rw) minos07.fnal.gov (rw) minos08.fnal.gov (rw) minos09.fnal.gov (rw) minos10.fnal.gov (rw) minos11.fnal.gov (rw) minos12.fnal.gov (rw) minos13.fnal.gov (rw) minos14.fnal.gov (rw) minos15.fnal.gov (rw) minos16.fnal.gov (rw) minos17.fnal.gov (rw) minos18.fnal.gov (rw) minos19.fnal.gov (rw) minos20.fnal.gov (rw) minos21.fnal.gov (rw) minos22.fnal.gov (rw) minos23.fnal.gov (rw) minos24.fnal.gov (rw) minos25.fnal.gov (rw) minos26.fnal.gov (rw) minos27.fnal.gov (rw) +++++++ minos-mysql11.fnal.gov (rw) minos-mysql12.fnal.gov (rw) ++++++++ ??????????? minos-mysql13.fnal.gov (rw) ++++++++ ??????????? minos-sam01.fnal.gov (rw) minos-sam02.fnal.gov (rw) minos-sam03.fnal.gov (rw) minos-sam04.fnal.gov (rw) ++++++++ 131.225.166.* (rw) 131.225.167.* (rw) 131.225.208.0/22 (rw) 131.225.212.0/23 (rw) 131.225.240.0/24 (rw) 131.225.238.0/23 (rw) 131.225.*.* (read_only,root_squash) ___________________________________________ Date: Tue, 09 Dec 2008 17:40:46 +0000 (GMT) Thanks for updating these. The mysql node names start with minos-mysql followed by the numbers one, two three. There is an extra 1(one) in these in your list below, minos-mysql11 should be minos-mysql1 etc. ___________________________________________ Date: Tue, 09 Dec 2008 13:17:32 -0600 (CST) This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/WST ___________________________________________ Date: Tue, 09 Dec 2008 14:12:19 -0600 (CST) Solution: Gave the following hosts: minos27 minos-sam04 minos-mysql2 minos-mysql3 (rw) access to minos-nas-0.fnal.gov:/minos/data minos-nas-0.fnal.gov:/minos/scratch blue2.fnal.gov:/minos/data This ticket was resolved by ROMERO, ANDY of the CD-LSCS/CSI/CS/WST group. ___________________________________________ ___________________________________________ ___________________________________________ ########### # MINOS27 # ########### Date: Mon, 08 Dec 2008 14:52:58 -0600 (CST) Subject: HelpDesk ticket 126011 ___________________________________________ Ticket #: 126011 ___________________________________________ Short Description: minos27 lacks /afs/fnal.gov and /pnfs/minos mounts Problem Description: The /afs/fnal.gov and /pnfs/minos mounts seem to have disappeared from node minos27.fnal.gov . Please investigate and restore. ( Please also make /var/log/messages world readable on minos27 and the other new servers (minos25, minos-sam04, minos-mysql2 ) ___________________________________________ Date: Mon, 08 Dec 2008 15:38:56 -0600 (CST) This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group. ___________________________________________ Date: Mon, 08 Dec 2008 16:14:39 -0600 (CST) Note To Requester: Please make read/write exports for /minos/data, data2, scratcch (is this typo?) for the new Minos servers : minos27 minos-sam04 minos-mysql2 minos-mysql3 ===================================================================================== scratcch (is this typo?) Added rights for minos27 , minos-sam04 , minos-mysql2 , minos-mysql3 to export /minos/data Added rights for minos27 , minos-sam04 , minos-mysql2 , minos-mysql3 to export /minos/scratch No export with "data2" found. Is this correct name of export? ___________________________________________ Date: Mon, 08 Dec 2008 22:49:14 +0000 (GMT) Sorry for the typo, and the imprecise list. My request should have been in terms of your exports, not our mounts. Thanks for taking care of the first two of these : minos-nas-0.fnal.gov:/minos/data minos-nas-0.fnal.gov:/minos/scratch blue2.fnal.gov:/minos/data ___________________________________________ Date: Tue, 09 Dec 2008 08:17:28 -0600 (CST) Note To Requester: Please verify which minos machines should have /pnfs/minos mounted. ___________________________________________ Date: Tue, 09 Dec 2008 14:29:29 -0600 (CST) This ticket has been reassigned to TIMM, STEVE of the CD-Grid/Fermi Group. ___________________________________________ Date: Tue, 09 Dec 2008 14:42:48 -0600 (CST) This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group. ___________________________________________ Date: Tue, 09 Dec 2008 20:50:43 +0000 (GMT) Scanned existing nodes, ro everywhere but minos01 - rw minos25 - missing deliberately minos26 - rw minos-mysql2 - missing minos-sam03 - rw minos-sam04 - missing Since you ask, I have reviewed the existing mounts, and compared this to what we might be needing soon. Request : Mount /pnfs/minos ro on all Minos Cluster and Server nodes, except : Do not mount on minos25 Mount rw on minos01 minos26 minos-mysql2 minos-sam03 minos-sam04 Compared the present status, this adds rw mounts on minos-mysql2 minos-sam04 ___________________________________________ Date: Wed, 10 Dec 2008 10:16:23 -0600 (CST) Solution: Verified that the /pnfs/minos mount is in the /etc/fstab file for the entire cluster of machines. Made /var/log/messages world readable as requested. This ticket was resolved by BURNS, ETTA of the CD-SF/FEF group. ___________________________________________ ___________________________________________ ########## # ANNUAL # ########## Created new directories for 2009, per HOWTO.annual ########## # DCACHE # ########## Date: Mon, 08 Dec 2008 12:06:28 -0600 (CST) Subject: HelpDesk ticket 125989 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 125989 ___________________________________________ Short Description: FNDCA KFTP-stkendca2a status report claims to be offline Problem Description: The KFTP-stkendca2a status report at http://fndca.fnal.gov:2288/cellInfo claims that the service is offline. KFTP-stkendca2a gridftp-stkendca2aDomain OFFLINE Yet the Minos raw data logging, via KFTP, seems to be OK. ___________________________________________ This ticket is assigned to NAYMOLA, STAN of the CD-SF/DMS/DSC/SSA. ____________________________________________ Date: Mon, 08 Dec 2008 13:19:20 -0600 (CST) The process was restarted but seems to not have resolved the problem of restarting every three minutes. The log output states that there is a Java exception. I have created Bug # 176 so the d-cache group will take a look at this. Glenn ___________________________________________ Date: Wed, 10 Dec 2008 23:19:40 +0000 (GMT) Since the DCache restart this morning, the KFTP-stkendca2a status is no longer OFFLINE at http://fndca.fnal.gov:2288/cellInfo In fact, it shows a creation time of 12/05 16:10:42 . That is well before the problem was first reported on Monday 8 Dec. Oh well. Our data transfers seem to be running OK. I guess this ticket can be closed. __________________________________________ Date: Mon, 15 Dec 2008 14:56:44 -0600 (CST) Solution: Door properly reported after a reboot of the server. This ticket was resolved by NAYMOLA, STAN of the CD-SF/DMS/DSC/SSA group. ######## # DATA # ######## > They all worked with the exception of 3 > /reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar$ > /reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0005.spill.bcnd.cedar$ > /reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0006.spill.bcnd.cedar$ MINOS26 > grep F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root ../CFL/list.r F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root r-stkendca16a-6 MINOS26 > grep F00032507_0005.spill.bcnd.cedar_phy_bhcurv.0.root ../CFL/list.r F00032507_0005.spill.bcnd.cedar_phy_bhcurv.0.root r-stkendca9a-2 MINOS26 > grep F00032507_0006.spill.bcnd.cedar_phy_bhcurv.0.root ../CFL/list.r MINOS26 > ./dccptest F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root Connected in 0.00s. [Mon Dec 8 09:17:36 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root in cache. MINOS26 > ./dc_stat F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root ============================ PNFS status for /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root -rw-r--r-- 1 rubin e875 39198093 Dec 21 2007 F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root LEVEL 2 2,0,0,0.0,0.0 :c=1:4d900d72;h=yes;l=39198093; LEVEL 4 VOC190 0000_000000000_0002756 39198093 reco_far_cedar_phy_bhcurv_bcnd /pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root 000F0000000000000730EA60 CDMS119826912800000 stkenmvr16a:/dev/rmt/tps0d0n:479000017059 217648497 ============================ Tested again, MINOS26 > ./dccptest F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root Connected in 0.00s. [Thu Dec 11 09:07:58 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root in cache. Cache open succeeded in 138.29s. 39198093 bytes in 1 seconds (38279.39 KB/sec) MINOS26 > ./dccptest F00032507_0005.spill.bcnd.cedar_phy_bhcurv.0.root Connected in 0.00s. [Thu Dec 11 09:26:42 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0005.spill.bcnd.cedar_phy_bhcurv.0.root in cache. Cache open succeeded in 0.21s. 38144336 bytes in 1 seconds (37250.33 KB/sec) MINOS26 > ./dccptest F00032507_0006.spill.bcnd.cedar_phy_bhcurv.0.root Connected in 0.00s. [Thu Dec 11 09:27:09 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0006.spill.bcnd.cedar_phy_bhcurv.0.root in cache. Cache open succeeded in 0.20s. 46186096 bytes in 1 seconds (45103.61 KB/sec) ___________________________________________ Date: Mon, 08 Dec 2008 15:09:40 -0600 (CST) Subject: HelpDesk ticket 126014 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 126014 ___________________________________________ Short Description: FNDCA - three files unavailable via DCache Problem Description: One of the Minos users has reported three files to be unavailable via dccp. Their metadata looks good, and the first two are in recent pool listings. I have verified that a 'dccp' of the first of these gets stuck indefinitely : Under /pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/ File Pool listing F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root r-stkendca16a-6 F00032507_0005.spill.bcnd.cedar_phy_bhcurv.0.root r-stkendca9a-2 F00032507_0006.spill.bcnd.cedar_phy_bhcurv.0.root MINOS26 > ./dccptest F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root Connected in 0.00s. [Mon Dec 8 09:17:36 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhc urv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root in cache. ( still stuck as of 15:00 ) Odd, the entry in the Lazy Restore Queue for the first file is dated 11.28, and indicates a pool to pool transfer to a write queue ??? 000F0000000000000730EA60 0.0.0.0/0.0.0.0-*/* r-stkendca16a-6->w-stkendca10a-6 11.28 19:32:25 136 0 Pool2Pool 11.28 19:32:25 /pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F0003 2507_0004.spill.bcnd.cedar_phy_bhcurv.0.root ___________________________________________ This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Thu, 11 Dec 2008 15:29:23 +0000 (GMT) All three of the files are available via DCache now. This ticket can be closed. Thanks ! ___________________________________________ ___________________________________________ ####### # CRL # ####### CRL web page got stuck last Friday, 5 Dec. Mail bounced : Date: Mon, 08 Dec 2008 00:30:46 -0600 (CST) From: Internet Mail Delivery To: kreymer@fnal.gov Subject: Delivery Notification: Delivery has been delayed ... Recipient address: webmaster@crlweb2.fnal.gov Reason: unable to deliver this message after 4 days Delivery attempt history for your mail: Sun, 7 Dec 2008 18:56:20 -0600 (CST) TCP active open: Failed connect() Error: Connection refused ... Wed, 3 Dec 2008 17:32:57 -0600 (CST) TCP active open: Failed connect() Error: Connection refused The mail system will continue to try to deliver your message for an additional 3 days. ... On Wed, 3 Dec 2008, Mayly Sanchez wrote: > We are getting the following error when trying to access the CRL. T= his > started happening around 15.15 and ti still not fixed at 17:00. = =A0 > Mayly=A0 I have looked at the minos-db1 muysql server. The server seems to be acting normally, with several current connections from the CRL web server : ============================================================================= 2008 12 06 ============================================================================= The KFTP-stkendca2a kerberized FTP door is down. This is needed for Minos raw data archiving. The last file archived seem to have been Dec 5 21:29 /pnfs/minos/neardet_data/2008-12/N00015256_0021.mdaq.root FTP transfer page is empty : http://fndca3a.fnal.gov/cgi-bin/dcache_files.py ####### # DAQ # ####### restarted the ND archiver Last login: Fri Dec 5 21:03:55 2008 from 131.225.192.193 [minos@daqdcp-nd ~]$ ps xf PID TTY STAT TIME COMMAND 5701 pts/0 Ss 0:00 -bash 5737 pts/0 R+ 0:00 \_ ps xf 14689 ? S 0:32 python /home/minos/bin/archiver_krb.py 15678 ? Z 0:00 \_ [kdestroy] 15681 ? Z 0:00 \_ [kinit] [minos@daqdcp-nd ~]$ bin/init/archiver restart Stopping archiver - try graceful exit first. Please wait ...... Killing archiver with USR1 Starting archiver [minos@daqdcp-nd ~]$ ps xf PID TTY STAT TIME COMMAND 5701 pts/0 Ss 0:00 -bash 6687 pts/0 R+ 0:00 \_ ps xf 6394 pts/0 S 0:09 python /home/minos/bin/archiver_krb.py 6683 pts/0 Z 0:00 \_ [kinit] Caught up to Dec 6 15:00 N00015268_0016.mdaq.root Next cycle was OK at 16:00 [minos@daqdcp-nd ~]$ ls -l -tr /daqdata/archiver/data-archived/ | tail -2 -rw-r--r-- 1 minos e875 0 Dec 6 16:05 N00015268_0017.mdaq.root On Sunday, restarted the beamdata archiver, after clearing empty PID file. Need to research archiverstatus.sh scripts, updating web status ? Ran one of these manually, seemed to work. Why is this not in cron ? Date: Sun, 07 Dec 2008 14:04:26 +0000 (GMT) From: Arthur Kreymer To: kreymer@fnal.gov Subject: beam_data archiver 08:00 - removed empty pid file, started archiver. ============================================================================= 2008 12 05 ============================================================================= ############ # PREDATOR # ############ N00015238_0017.mdaq.root bad .py data needs to be cleared/added Set the damaged file aside cd /local/scratch26/kreymer/genpy/neardet_data/2008-11 MINOS26 > dds N00015238_0017* -rw-r--r-- 1 kreymer g020 4662 Nov 30 01:09 N00015238_0017.log -rw-r--r-- 1 kreymer g020 1284 Nov 30 01:09 N00015238_0017.sam.py -rw-r--r-- 1 kreymer g020 1266 Nov 30 01:15 N00015238_0017.sam.pyc mv N00015238_0017.sam.py N00015238_0017.sam.pybad2 Note that there is a .pyc for this file ! MINOS26 > dds *.pyc -rw-r--r-- 1 kreymer g020 1266 Nov 30 01:15 N00015238_0017.sam.pyc MINOS26 > ./predator 2008-11 Looks OK this time. ######### # VAULT # ######### Far encp failed, because the first copy was comlete and clean on 3 Dec. Removed working files rm -r /local/scratch26/kreymer/SHEEP/fardet_data/2008-11 ########### # MINOS25 # ########### Date: Fri, 05 Dec 2008 19:28:29 +0000 (GMT) From: Arthur Kreymer To: Ling C. Ho Cc: minos-admin@fnal.gov, rbpatter@fnal.gov Subject: Re: minos25 hardware swap tomorrow, 5 Dec at 08:00 Update on kcron / aklog problem. I have tested this on minos27 and minos25 ( both recently installed. ) After doing kcron, aklog seems to succeed, but does not provide a working AFS token. aklog with a normal ticket MINOS25 > kinit kreymer MINOS25 > /usr/krb5/bin/aklog MINOS25 > tokens Tokens held by the Cache Manager: User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Dec 6 15:26] --End of list-- aklog with a kcron ticket MINOS25 > kcron MINOS25 > klist Ticket cache: /tmp/krb5cc_1060_L11665 Default principal: kreymer/cron/minos25.fnal.gov@FNAL.GOV Valid starting Expires Service principal 12/05/08 13:27:12 12/05/08 23:27:12 krbtgt/FNAL.GOV@FNAL.GOV MINOS25 > /usr/krb5/bin/aklog MINOS25 > echo $? 0 MINOS25 > tokens Tokens held by the Cache Manager: Tokens for afs@fnal.gov [Expires Dec 5 23:27] --End of list-- ================================================================ 13:31 Trying mengel workaround from FNALU, * * * * * /usr/krb5/bin/kcron "/usr/krb5/bin/aklog ; ${HOME}/minos/scripts/crontestark" 13:32 - still no good, removed aklog from crontestark Trying again interactively MINOS25 > aklog -d Authenticating to cell fnal.gov (server fsus01.fnal.gov). Trying to authenticate to user's realm FNAL.GOV. Getting tickets: afs/fnal.gov@FNAL.GOV We've deduced that we need to authenticate to realm FNAL.GOV. Getting tickets: afs/fnal.gov@FNAL.GOV Getting tickets: afs/fnal.gov@FNAL.GOV Getting tickets: afs@FNAL.GOV Using Kerberos V5 ticket natively About to resolve name kreymer.cron to id in cell fnal.gov. Id 32766 Set username to kreymer.cron Setting tokens. kreymer.cron / @ FNAL.GOV MINOS25 > tokens Tokens held by the Cache Manager: Tokens for afs@fnal.gov [Expires Dec 5 23:27] --End of list-- Success on minos26 looks like : MINOS26 > aklog -d Authenticating to cell fnal.gov (server fsus01.fnal.gov). We've deduced that we need to authenticate to realm FNAL.GOV. Getting tickets: afs/@FNAL.GOV endTime = 1228540648, Fri Dec 5 23:17:28 2008 About to resolve name kreymer to id in cell fnal.gov. Id 1060 Set username to AFS ID 1060 Setting tokens. AFS ID 1060 / @ FNAL.GOV Test 64bit node fnpc344, SLF 4.5 Getting tickets: afs/@FNAL.GOV endTime = 1229144749, Fri Dec 12 23:05:49 2008 About to resolve name kreymer to id in cell fnal.gov. Id 1060 Set username to AFS ID 1060 Setting tokens. AFS ID 1060 / @ FNAL.GOV touch /afs/fnal.gov/files/home/room1/kreymer/maint/touchafs rm /afs/fnal.gov/files/home/room1/kreymer/maint/touchafs Trying a copy of aklog.slf45 on minos25: MINOS25 > ./aklog.slf45 -d Authenticating to cell fnal.gov (server fsus01.fnal.gov). We've deduced that we need to authenticate to realm FNAL.GOV. Getting tickets: afs/@FNAL.GOV endTime = 1228802052, Mon Dec 8 23:54:12 2008 About to resolve name kreymer to id in cell fnal.gov. Id 1060 Set username to AFS ID 1060 Setting tokens. AFS ID 1060 / @ FNAL.GOV aklog.slf45: unable to obtain tokens for cell fnal.gov (status: 11862788). DISABLED AKLOG AND VOMSES CODE IN KPROXY, TILL THIS IS FIXED edited /local/scratch25/grid/kproxy CHECKING FNALU MIN > for NODE in ${UNODES} ; do printf "${NODE} " ; ssh -ax ${NODE} 'cat /etc/redhat-release' ; done flxi02 Scientific Linux Fermi LTS release 4.4 (Wilson) flxi03 Scientific Linux Fermi LTS release 4.4 (Wilson) flxi04 Scientific Linux Fermi LTS release 4.5 (Wilson) flxi05 Scientific Linux Fermi LTS release 4.5 (Wilson) OK flxi06 Scientific Linux SLF release 5.1 (Lederman) NA flxi07 Scientific Linux Fermi LTS release 4.4 (Wilson) OK flxi09 Scientific Linux Fermi LTS release 4.5 (Wilson) OK Ling will pursue getting an SFL 4.7 aklog. can reproduce the problem on minos25 with ling account. minos27 is having other problems, kcron fails a random 1/2 the time : MINOS27 > kcron MINOS27 > kcron kinit: Preauthentication failed while getting initial credentials MINOS27 > kcron kinit: Preauthentication failed while getting initial credentials MINOS27 > kcron kinit: Preauthentication failed while getting initial credentials MINOS27 > kcron MINOS27 > kcron ######## # FARM # ######## rm /minos/data/minfarm/roundup/STOP.LOOPER Inspect bad file : -rw-r--r-- 1 minospro e875 0 Dec 4 01:19 /pnfs/minos/mcout_data/cedar_phy_linfix/near/daikon_00/L010185N/mrnt_data/108/n13011086_0000_L010185N_D00.mrnt.cedar_phy_linfix.0.root ls -l /minos/data/minfarm/WRITE/n13011086_0000_L010185N_D00.mrnt.cedar_phy_linfix.0.root -rw-r--r-- 1 minfarm e875 211699277 Dec 3 20:25 minospro@minos26 rm /pnfs/minos/mcout_data/cedar_phy_linfix/near/daikon_00/L010185N/mrnt_data/108/n13011086_0000_L010185N_D00.mrnt.cedar_phy_linfix.0.root ./looper '-r cedar_phy_linfix mcnear' & [1] 19786 OK - processing /minos/data/minfarm/mcnearcat version 20081126 Fri Dec 5 13:06:50 CST 2008 PURGING WRITE files 496 ####### # DAQ # ####### Predator fails on N00015253_0012.mdaq.root Thu Dec 4 23:09:36 UTC 2008 0 length -rw-r--r-- 1 buckley e875 0 Dec 4 05:09 N00015253_0012.mdaq.root -rw-r--r-- 1 buckley e875 0 Dec 4 02:08 B081204_000001.mbeam.root rm /pnfs/minos/neardet_data/2008-12/N00015253_0012.mdaq.root rm /pnfs/minos/beam_data/2008-12/B081204_000001.mbeam.root Need to manually restart the archivers ? And update email address in archiver restart scripts to minos-data. ########### # MINOS25 # ########### MINOS25 > condor_off -fast minos25 -master Sent "Kill-Daemon-Fast" command for "master" to master minos25.fnal.gov MINOS25 > condor_status CEDAR:6001:Failed to connect to <131.225.193.25:9618> Error: Couldn't contact the condor_collector on minos25.fnal.gov. MINOS25 > ps -flu condor F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 5 S condor 29187 1 0 76 0 - 2685 - Dec01 ? 00:23:29 /opt/condor/sbin/condor_master MINOS25 > date Fri Dec 5 08:04:16 CST 2008 08:12 - ling shuttind down both systems for the swap. 239047.0 tinti job that completed ?, yes, logs look OK 10:00 rbpatter started up gfactory. 11:09 - released 6 pawloski jobs 11:11 - pawloski jobs start to run pawloski 6 0 565 12/4 13:39 0+03:03:31 paloon.sh release 100 scavan jobs SJOBS=`condor_q -hold scavan | grep scavan | head -100 | cut -f 1 -d ' '` 11:15 for JOB in ${SJOBS} ; do condor_release ${JOB} ; done Released another 100 pawloski SJOBS=`condor_q -hold pawloski | grep pawloski | head -100 | cut -f 1 -d ' '` 11:21 for JOB in ${SJOBS} ; do condor_release ${JOB} ; done 11:33 condor_release pawloski scavan released his own jobs, apparently ######## # DATA # ######## rubin@fnpcsrv1 cut/paste shrc/kreymer export PRODUCTS=/afs/fnal.gov/files/code/e875/general/ups/db setup encp v3_7d -q stken CARROT=/pnfs/minos/mcout_data/cedar/near/carrot_06/L010185 STREAM=cand_data BADDIR=${CARROT}/${STREAM} BFILES=`ls -C1 ${BADDIR}` printf "${BFILES}\n" | wc -l 7457 for FILE in ${BFILES} ; do SDIR=`echo ${FILE} | cut -c 6-8` echo ${SDIR} done | sort -u | wc -l 389 100 through 599 cd ${BADDIR} NMOV=0 date for FILE in ${BFILES} ; do SDIR=`echo ${FILE} | cut -c 6-8` [ ! -d "${SDIR}" ] && printf " MAKING ${SDIR}\n" && mkdir ${SDIR} (( NMOV++ )) enmv ${FILE} ${SDIR}/${FILE} printf "\r ${NMOV} ${FILE}" sleep 1 done printf "\n" date Fri Dec 5 11:45:18 CST 2008 MAKING 100 7 n11001009_0000_L010185.cand.cedar.root MAKING 101 17 n11001019_0000_L010185.cand.cedar.root MAKING 102 27 n11001029_0000_L010185.cand.cedar.root MAKING 103 ... ============================================================================= 2008 12 04 ============================================================================= ######### # STAGE # ######### Restarted staging ; Dropped limit to 500 files, slow down to 5 sec/file. FVOLS="VOB594 VOB990" { for VOL in ${FVOLS} ; do ./stage -w -p 5 -s cedar_phy_bhcurv/.bcnd_data ${VOL} done ; } >> /minos/scratch/kreymer/log/stage/cpbcnd3.log 2>&1 & Good, not all files are needed on this pass. ######## # DATA # ######## Create subdirectories for carrot_06 files rubin@fnpcsrv CARROT=/pnfs/minos/mcout_data/cedar/near/carrot_06/L010185 ( for STREAM in snts_data cand_data sntp_data ; do ) STREAM=snts_data BADDIR=${CARROT}/${STREAM} BFILES=`ls -C1 ${BADDIR}` printf "${BFILES}\n" | wc -l 802 for FILE in ${BFILES} ; do SDIR=`echo ${FILE} | cut -c 6-8` echo ${SDIR} done | sort -u | wc -l 93 cd ${BADDIR} for FILE in ${BFILES} ; do SDIR=`echo ${FILE} | cut -c 6-8` [ ! -d "${SDIR}" ] && mkdir ${SDIR} mv ${FILE} ${SDIR}/${FILE} ; usleep 100000 done $ find . -type f | wc -l 802 Now do use enmv to correct this metadata for FILE in ${BFILES} ; do SDIR=`echo ${FILE} | cut -c 6-8` enmv ${SDIR}/${FILE} ${SDIR}/${FILE} ; sleep 1 done REVISED FOR ENMV : CARROT=/pnfs/minos/mcout_data/cedar/near/carrot_06/L010185 STREAM=sntp_data BADDIR=${CARROT}/${STREAM} BFILES=`ls -C1 ${BADDIR}` printf "${BFILES}\n" | wc -l 7457 for FILE in ${BFILES} ; do SDIR=`echo ${FILE} | cut -c 6-8` echo ${SDIR} done | sort -u | wc -l 389 100 through 599 cd ${BADDIR} NMOV=0 date for FILE in ${BFILES} ; do SDIR=`echo ${FILE} | cut -c 6-8` [ ! -d "${SDIR}" ] && printf " MAKING ${SDIR}\n" && mkdir ${SDIR} (( NMOV++ )) enmv ${FILE} ${SDIR}/${FILE} printf "\r ${NMOV} ${FILE}" sleep 1 done printf "\n" date Thu Dec 4 16:03:07 CST 2008 # N.B. - this is running about 3 seconds/file But it has sped up to 1 sec/file in dirctory 400 MAKING 598 3832 n13015989_0000_L010185.sntp.cedar.root MAKING 599 7457 n13025999_0000_L010185.sntp.cedar.rootfnpcsrv1$ printf "\n" fnpcsrv1$ date Thu Dec 4 19:53:16 CST 2008 fnpcsrv1$ Connection to fnpcsrv1 closed. ####### # DAQ # ####### /home/minos/bin/archiverstatus.sh changed mail form buckley to minos-data $ crontab -l 1 * * * * /home/minos/bin/archiverstatus.sh > /dev/null 2>&1 # Update the #pot. This is a bit buggy still! # 0 0,8,16 * * * /home/minos/BD/R1.16/BeamData/ana/bv/run_npot.sh # Run the DBU job 10 minutes after every hour. It will just exit if there is nothing to do. 20 * * * * /bin/bash /home/minos/BD/dbu/BeamDataDbi/scripts/run_bdbu_fnal_cron.sh ########## # DCACHE # ########## http://www-numi.fnal.gov/computing/dh/ftplog/2008/12/03.txt FTP got slow : 115 Wed Dec 3 18:05:01 CST 2008 557 6 Wed Dec 3 19:56:38 CST 2008 557 7 Wed Dec 3 20:06:45 CST 2008 557 21 Wed Dec 3 20:17:06 CST 2008 557 11 Wed Dec 3 20:27:17 CST 2008 557 24 Wed Dec 3 20:37:41 CST 2008 557 5 Wed Dec 3 20:47:47 CST 2008 557 48 Wed Dec 3 20:58:35 CST 2008 557 5 Wed Dec 3 21:08:40 CST 2008 557 75 Wed Dec 3 21:19:55 CST 2008 557 65 Wed Dec 3 21:31:00 CST 2008 557 244 Wed Dec 3 21:45:04 CST 2008 557 392 Wed Dec 3 22:01:36 CST 2008 557 13 Wed Dec 3 22:11:49 CST 2008 557 10 Wed Dec 3 22:21:59 CST 2008 557 1781 Wed Dec 3 23:01:40 CST 2008 557 220 Wed Dec 3 23:15:20 CST 2008 557 373 Wed Dec 3 23:31:33 CST 2008 557 644 Wed Dec 3 23:52:17 CST 2008 557 http://www-numi.fnal.gov/computing/dh/ftplog/2008/12/04.txt Failing now : 40 Thu Dec 4 00:02:57 CST 2008 557 1418 Thu Dec 4 00:36:35 CST 2008 557 110 Thu Dec 4 00:48:25 CST 2008 557 64 Thu Dec 4 00:59:29 CST 2008 557 45 Thu Dec 4 01:10:14 CST 2008 557 76 Thu Dec 4 01:21:30 CST 2008 557 1330 Thu Dec 4 01:53:40 CST 2008 557 3603 Thu Dec 4 03:03:43 CST 2008 1 3602 Thu Dec 4 04:13:45 CST 2008 1 3603 Thu Dec 4 05:23:48 CST 2008 1 3603 Thu Dec 4 06:33:51 CST 2008 1 3602 Thu Dec 4 07:43:53 CST 2008 1 Date: Thu, 04 Dec 2008 03:05:02 -0600 From: MINOS DAQ To: carl.metelko@stfc.ac.uk, geoff.pearce@stfc.ac.uk, kreymer@fnal.gov, miller@sudan.umn.edu, saranen@sudan.umn.edu Subject: FARDAQ: 1 file(s) waiting more than 1h for archival Predator is failing to read recent files. Starting with N00015253_0008.mdaq.root Thu Dec 4 05:09:22 UTC 2008 ( 23:09 CST ) Continuing with repeated failures, N00015253_0009.mdaq.root N00015253_0011.mdaq.root N00015253_0012.mdaq.root MINOS26 > ./dc_stat /pnfs/minos/beam_data/2008-12/B081203_080002.mbeam.root ============================ PNFS status for /pnfs/minos/beam_data/2008-12/B081203_080002.mbeam.root -rw-r--r-- 1 buckley e875 8068 Dec 3 18:09 B081203_080002.mbeam.root LEVEL 2 2,0,0,0.0,0.0 :c=1:a4ccb18c;h=yes;l=8068; LEVEL 4 VOC009 0000_000000000_0000270 8068 beam_data /pnfs/fnal.gov/usr/minos/beam_data/2008-12/B081203_080002.mbeam.root 000F000000000000089499D8 CDMS122834935400000 stkenmvr25a:/dev/rmt/tps0d0n:479000022613 2236133771 Cannot read this file from DCache. MINOS26 > ./dccptest /pnfs/minos/beam_data/2008-12/B081203_080002.mbeam.root Connected in 0.00s. Try another file : ./dccptest n13047018_0029_L010185N_D04.reroot.root Connected in 0.00s. failed also. /pnfs/minos/mcin_data/near/daikon_04/L010185N/701/n13047018_0029_L010185N_D04.reroot.root Vaults are also apparently stuck : tail ~/minos/log/rawcopy/near/2008-11.log neardet_data.2008-11.8.tar N00015143_0001.mdaq.root to N00015147_0000.mdaq.root ................... MINOS26 > ls -l /local/scratch26/kreymer/SHEEP/neardet_data/2008-11/ total 11974696 -rw-r--r-- 1 kreymer g020 1758433280 Dec 3 23:12 neardet_data.2008-11.1.tar -rw-r--r-- 1 kreymer g020 1733775360 Dec 3 23:42 neardet_data.2008-11.2.tar -rw-r--r-- 1 kreymer g020 1691781120 Dec 4 00:03 neardet_data.2008-11.3.tar -rw-r--r-- 1 kreymer g020 1766010880 Dec 4 00:34 neardet_data.2008-11.4.tar -rw-r--r-- 1 kreymer g020 1783511040 Dec 4 00:51 neardet_data.2008-11.5.tar -rw-r--r-- 1 kreymer g020 1726822400 Dec 4 01:04 neardet_data.2008-11.6.tar -rw-r--r-- 1 kreymer g020 1789736960 Dec 4 01:24 neardet_data.2008-11.7.tar MINOS26 > ls -l /var/tmp/rawcopy/TARWORK/ total 1557196 -rw-r--r-- 1 kreymer g020 117972943 Dec 4 01:29 N00015143_0001.mdaq.root -rw-r--r-- 1 kreymer g020 89554683 Dec 4 01:32 N00015143_0002.mdaq.root -rw-r--r-- 1 kreymer g020 86609970 Dec 4 01:32 N00015143_0003.mdaq.root -rw-r--r-- 1 kreymer g020 76573633 Dec 4 01:32 N00015143_0004.mdaq.root -rw-r--r-- 1 kreymer g020 87348917 Dec 4 01:32 N00015143_0005.mdaq.root -rw-r--r-- 1 kreymer g020 76753020 Dec 4 01:32 N00015143_0006.mdaq.root -rw-r--r-- 1 kreymer g020 89123053 Dec 4 01:33 N00015143_0007.mdaq.root -rw-r--r-- 1 kreymer g020 276591749 Dec 4 01:38 N00015144_0000.mdaq.root -rw-r--r-- 1 kreymer g020 13252439 Dec 4 01:43 N00015145_0000.mdaq.root -rw-r--r-- 1 kreymer g020 81400956 Dec 4 01:44 N00015146_0000.mdaq.root -rw-r--r-- 1 kreymer g020 90560856 Dec 4 01:46 N00015146_0001.mdaq.root -rw-r--r-- 1 kreymer g020 80464850 Dec 4 01:50 N00015146_0002.mdaq.root -rw-r--r-- 1 kreymer g020 90798583 Dec 4 01:54 N00015146_0003.mdaq.root -rw-r--r-- 1 kreymer g020 80166536 Dec 4 01:59 N00015146_0004.mdaq.root -rw-r--r-- 1 kreymer g020 90765663 Dec 4 02:05 N00015146_0005.mdaq.root -rw-r--r-- 1 kreymer g020 77672824 Dec 4 02:11 N00015146_0006.mdaq.root -rw-r--r-- 1 kreymer g020 87265249 Dec 4 02:30 N00015146_0007.mdaq.root There were about 900 pawloski jobs, accessing files in MINOS26 > grep cand_data/ ../CFL/CFL | wc -l 7450 MINOS26 > grep /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/sntp_data/ ../CFL/CFL | wc -l 7455 MINOS26 > grep /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/snts_data/ ../CFL/CFL | wc -l 802 MINOS26 > grep /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/mrnt_data/ ../CFL/CFL | wc -l 0 This probably was the prime cause of the PNFS overload. We must break up these files into more directories. =============================================================== Killing off all activity : MINOS26 > ps xf PID TTY STAT TIME COMMAND 29193 ? Ss 0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/vault_monthly 29195 ? S 0:00 \_ /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/vault_monthly 2377 ? S 0:00 \_ /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/vault near 2008-11 2487 ? S 0:00 \_ /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/rawcopy neardet_data/2008-11 19740 ? S 0:00 \_ dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2008-11/N00015146_0009.mdaq.root /var/tmp/rawcopy/TARWORK/N00015146_0009.mdaq. MINOS26 > kill 29193 MINOS26 > kill 29195 MINOS26 > kill 2377 MINOS26 > kill 2487 MINOS26 > kill 19740 MINOS26 > crontab -r kill %2 ( kills off prestage ) Updated MINOS CD status page mindata@minos26 11:54 crontab -r minfarm@fnpcsrv1 11:55 mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT touch /minos/data/minfarm/roundup/STOP.LOOPER 12:01 =============================================================== http://fndca3a.fnal.gov:2288/context/transfers.html found unusual activity , authentication since Thu Dec 04 11:14:32 CST 2008 KFTP-stkendca2a-Unknown-31478 kerberizedftpdoor-stkendca2aDomain 38532 GFtp-1 1602 0 null N.N. bzora1.fnal.gov checking permissions via permission handler 07:09:13 Staging and many more like these, from various ExpDbWritePools clients 11:14 - 7:09 = 04:05, well after the problem started. The latest FTP tranfer indicated at http://fndca3a.fnal.gov/cgi-bin/dcache_files.py oracle(1602.2752) 2008-12-3 16:00:23 =============================================================== Date: Thu, 04 Dec 2008 09:35:54 -0600 (CST) Subject: HelpDesk ticket 125810 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 125810 ___________________________________________ Short Description: FNDCA RawDataWritePools severe problems Problem Description: There seems to be a global DCache problem, which started last night. Kerberized ftp writes of Minos raw data to RawDataWritePools have been failing since about 03:00. dccp from both RawDataWritePools and readPools are failing, for all files that I have tried We were doing the monthly raw data safetly copies at the time of the failure The last successful dccp by this script was on minos26, to -rw-r--r-- 1 kreymer g020 87265249 Dec 4 02:30 N00015146_0007.mdaq.root We were doing regulated dccp -P prestages of files, which got stuck soon after this status report : File/needed Time stamp Restore queue depth 5181/6794 Thu Dec 4 02:11:30 CST 2008 queue=1351/2000 Anonymous ftp reads were very slow last night, then failed altogether. Here are detailed logs of some access attempts. Elapsed Attempted ftp at Bytes returned sec 115 Wed Dec 3 18:05:01 CST 2008 557 6 Wed Dec 3 19:56:38 CST 2008 557 7 Wed Dec 3 20:06:45 CST 2008 557 21 Wed Dec 3 20:17:06 CST 2008 557 11 Wed Dec 3 20:27:17 CST 2008 557 24 Wed Dec 3 20:37:41 CST 2008 557 5 Wed Dec 3 20:47:47 CST 2008 557 48 Wed Dec 3 20:58:35 CST 2008 557 5 Wed Dec 3 21:08:40 CST 2008 557 75 Wed Dec 3 21:19:55 CST 2008 557 65 Wed Dec 3 21:31:00 CST 2008 557 244 Wed Dec 3 21:45:04 CST 2008 557 392 Wed Dec 3 22:01:36 CST 2008 557 13 Wed Dec 3 22:11:49 CST 2008 557 10 Wed Dec 3 22:21:59 CST 2008 557 1781 Wed Dec 3 23:01:40 CST 2008 557 220 Wed Dec 3 23:15:20 CST 2008 557 373 Wed Dec 3 23:31:33 CST 2008 557 644 Wed Dec 3 23:52:17 CST 2008 557 40 Thu Dec 4 00:02:57 CST 2008 557 1418 Thu Dec 4 00:36:35 CST 2008 557 110 Thu Dec 4 00:48:25 CST 2008 557 64 Thu Dec 4 00:59:29 CST 2008 557 45 Thu Dec 4 01:10:14 CST 2008 557 76 Thu Dec 4 01:21:30 CST 2008 557 1330 Thu Dec 4 01:53:40 CST 2008 557 3603 Thu Dec 4 03:03:43 CST 2008 1 3602 Thu Dec 4 04:13:45 CST 2008 1 3603 Thu Dec 4 05:23:48 CST 2008 1 3603 Thu Dec 4 06:33:51 CST 2008 1 3602 Thu Dec 4 07:43:53 CST 2008 1 ___________________________________________ This ticket is assigned to JONES, TERRY of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Thu, 04 Dec 2008 12:33:21 -0600 From: Timur Perelmutov These are pnfs problem, it can not handle the increased load. We requested to upgrade the pnfs software that should allow it to perform under increased load. We are waiting for the reply from SSA group. ___________________________________________ Date: Thu, 04 Dec 2008 12:56:58 -0600 From: Timur Perelmutov Service should be back up again. ___________________________________________ As of 13:40, I have successfully tested dccp srmls ftp =============================================================== Restating activity SRV1> mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok mindata@minos26 $ crontab crontab.dat kreymer@minos26 crontab crontab.dat rm -r /local/scratch26/kreymer/SHEEP/neardet_data/2008-11 rm /var/tmp/rawcopy/TARWORK/*.root hacked crontab to run vault tonight ============================================================================= 2008 12 03 ============================================================================= ####### # CRL # ####### Date: Wed, 03 Dec 2008 16:58:48 -0600 From: Mayly Sanchez To: webmaster@crlweb2.fnal.gov Cc: Robert Bernstein , Bob Zwaska , Arthur Kreymer Subject: CRL is dead Parts/Attachments: 1 OK 29 lines Text 2 Shown ~44 lines Text ---------------------------------------- Hi,  We are getting the following error when trying to access the CRL. This started happening around 15.15 and ti still not fixed at 17:00.   Mayly  OK The server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, webmaster@crlweb2.fnal.gov and inform them of the time the error occurred, and anything you might have done that may have caused the error. More information about this error may be available in the server error log. ____________________________________________________________________________ Apache/2.0.46 (Scientific Linux) Server at crlweb2.fnal.gov Port 80 Mysql> mysqladmin processlist -u root | grep crlweb | 57308859 | crl | crlweb.fnal.gov:38519 | crl_v1 | Sleep | 16892388 | 57308860 | crl | crlweb.fnal.gov:38520 | crl_v1 | Sleep | 16892388 | 110774986 | crl | crlweb.fnal.gov:53489 | crl_v1 | Sleep | 12257425 | 110774987 | crl | crlweb.fnal.gov:53490 | crl_v1 | Sleep | 12257425 | 110774988 | crl | crlweb.fnal.gov:53491 | crl_v1 | Sleep | 12257425 | 189141425 | crl | crlweb.fnal.gov:47178 | crl_v1 | Sleep | 2489 | 189143126 | crl | crlweb.fnal.gov:47180 | crl_v1 | Sleep | 1585 | 189143129 | crl | crlweb.fnal.gov:47181 | crl_v1 | Sleep | 1583 | 189144484 | crl | crlweb.fnal.gov:47183 | crl_v1 | Sleep | 680 | 189144487 | crl | crlweb.fnal.gov:47184 | crl_v1 | Sleep | 678 ########## # CONDOR # ########## Date: Wed, 03 Dec 2008 14:51:37 -0600 (CST) Subject: HelpDesk ticket 125792 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 125792 ___________________________________________ Short Description: minos25 condor configuration file update Problem Description: run2-sys : Please update the minos25 local configuration file, /opt/condor-7.0.1/local/condor_config.local to have the content of /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/ condor_config.local.minos25.20081203 This is not urgent. This change reduces the rate at which our Grid jobs start, to prevent the global SAZ overloads seem last weekend. ___________________________________________ Date: Wed, 03 Dec 2008 15:10:39 -0600 (CST) This ticket has been reassigned to SIMMONDS, EDWARD of the CD-SF/FEF Group. ___________________________________________ Date: Wed, 03 Dec 2008 15:33:38 -0600 (CST) Solution: esimm@fnal.gov sent this solution: Done. I did not restart any services. ___________________________________________ ___________________________________________ ####### # SRM # ####### Testing publicly useable srm on my desktop, using mcimport as a model. MIN > scp minfarm@fnpcsrv1:/local/globus/minfarm/.grid/kreymer-production.proxy . MIN > cd .grid . /minos/scratch/app/OSG1/setup.sh export X509_USER_PROXY=/home/kreymer/.grid/kreymer-production.proxy srmls srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport 0 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/ 512 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/howcroft/ 512 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/kordosky/ ########### # BLUEARC # ########### Monitoring showed no failures, delays on minos-sam03 Wed Dec 3 04:49:13 CST 2008 SLO N00013822_0000.spill.sntp.cedar_phy_bhcurv.0.root 600 minos26 Wed Dec 3 04:49:28 CST 2008 SLO N00013289_0000.spill.sntp.cedar_phy_bhcurv.0.root 625 After bluearc outage 04:30 minfarm@fnpcsrv1 mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok kreymer@minos26 crontab crontab.dat ######## # DATA # ######## On Wed, 3 Dec 2008, Steven Cavanaugh wrote: > I am running grid jobs which dccp a number of files which have all been > prestaged. The problem is that not all the files were prestaged. Short story : I need to spend about 1/2 hour modifying scripts to get a list of volumes containing these files, so that I can stage the rest. Longer story : about half these .bcnd files got written to the generic 'minos' file family, not the reco_far_cedar_phy_bhcurv_bcnd family. My original tape list was therfore incomplete. I think that I can get an accurate tape list using SAM. With 25K files to check, 5 second queries are too slow. I can adapt an existing python script to get tape data with one query. I'm not sure if it helps (I don't have tape information), but I have a file with a list of all of the files that I need: /minos/scratch/scavan/mrcc_trimmer/final/bcnd_far_cedar_phy_bhcurv wc -l /minos/scratch/scavan/mrcc_trimmer/final/bcnd_far_cedar_phy_bhcurv 22981 wc -l /tmp/CPBVOL.lis 22963 /tmp/CPBVOL.lis MINOS26 > sort -u /tmp/CPBVOL.lis vob235 vob570 vob594 vob990 voc190 voc193 voh334 Volumes previously restored were VOC190 VOC193 VOH334 VOK485 New volumes are FVOLS="VOB235 VOB570 VOB594 VOB990" Why does VOK485 not show up in this list ? Files seem to be in 2007-11, not declared to SAM. VFILES=`enstore info --list=VOK485 | grep cedar_phy_bhcurv | cut -f 10 -d /` for FILE in ${VFILES} ; do sam locate ${FILE} ; done Datafile with name 'F00039984_0011.spill.bcnd.cedar_phy_bhcurv.0.root' not found. ... Datafile with name 'F00039987_0011.spill.bcnd.cedar_phy_bhcurv.0.root' not found. All files are missing from SAM. But they were prestaged already. Let's pick up the remaining files : { for VOL in ${FVOLS} ; do ./stage.20081125 -w -p 1 -s cedar_phy_bhcurv/.bcnd_data ${VOL} done ; } >> /minos/scratch/kreymer/log/stage/cpbcnd2.log 2>&1 & Hmmm, this looks grim. Encp history http://www-stken.fnal.gov/enstore/encp_enstore_system.html shows 1 minute between file copies from VOB570, though the files are in tape-order. Almost all time is being spent seeking. 9940B27.mover 2937 xfers 5878 - 11:05:20 5879 11:05:42 5883 11:07:55 9940B27.mover alive : HAVE BOUND volume (VOB570) - IDLE stkenmvr27a 2008-Dec-03 11:06:28 Completed Transfers 5882 Failed Transfers 0 Last Read (bytes) 29,878,277 Volume VOC197 Last Write (bytes) 29,878,082 Location Cookie 45 /pnfs/fs/usr/.(access)(000F00000000000006E55678) --> stkendca14a.fnal.gov:/diskc/read-pool-5/data/000F00000000000006E55678 Stared at ENTV display on WH8E for a while. Seek times vary from 2 to 60 seconds. Average is 20 MINOS26 > echo '(38 + 44 + 6 + 12 + 17 + 10 + 60 + 30 + 10 + 12 + 2 + 8 + 23 ) / 13' | bc 20 08:43 - grinding away on VOB594, 5181/6794 Thu Dec 4 02:11:30 CST 2008 queue=1351/2000 ######### # STAGE # ######### < Changed test for exiting file from using level 2 ( no longer valid ) < to grepping through a consolidated file dump POOLFILES. ln -sf stage.20081203 stage # was stage.20071012 ############# # SAMLOCATE # ############# Adding -t option, for tape label Steal code from http://d0db-prd.fnal.gov/rexipedia/illingworth/add_pnfs_location.py SAMDIM="FILE_NAME F00030617_0002.spill.bcnd.cedar_phy_bhcurv.0.root" ./samlocate "${SAMDIM}" F00030617_0002.spill.bcnd.cedar_phy_bhcurv.0.root /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-04 SAMDIM=" DATA_TIER bcnd-far and VERSION cedar.phy.bhcurv " ./samlocate -t "${SAMDIM}" | tee /tmp/CPBVOL.lis wc -l /tmp/CPBVOL.lis 22963 /tmp/CPBVOL.lis ######### Date: Tue, 02 Dec 2008 19:15:50 -0600 From: Cron Daemon To: kreymer@fnal.gov Subject: Cron /usr/krb5/bin/kcron ${HOME}/minos/scripts/cfl Newline appended Found the end of CFL with partial file name : minos reco_near_cedar_phy_bhcurv_cand VO9747 0000_000000000_0000453 CDMS119569487200000 326877613 4165507 /pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2007-04/N00012063_0020.spil not the customary (1832538 rows) File: `CFL' Size: 304196973 Blocks: 594136 IO Block: 4096 regular file Device: 18h/24d Inode: 27722344 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1060/ kreymer) Gid: ( 1525/ g020) Access: 2008-12-08 19:16:51.574824459 -0600 Modify: 2008-12-08 19:16:51.485102646 -0600 Change: 2008-12-08 19:16:51.485102646 -0600 ######## # FARM # ######## Test linfix concat, do just one run ./roundup -s n11011001 -r cedar_phy_linfix mcnear OK - processing /minos/data/minfarm/mcnearcat version 20081126 SELECT files containing n11011001 Wed Dec 3 09:35:09 CST 2008 Wed Dec 3 09:50:11 CST 2008 SADD less +F /home/minfarm/ROUNTMP/LOG/saddreco/daikon_00/cedar_phy_linfix/near_L010185N.log Wed Dec 3 11:21:09 CST 2008 Declared thousands of candidates Start up looper on linfix ./looper '-r cedar_phy_linfix mcnear' & OK - processing /minos/data/minfarm/mcnearcat version 20081126 Wed Dec 3 11:40:47 CST 2008 Traceback (most recent call last): File "/home/minfarm/scripts/samdup", line 162, in ? SUB = FILE.strip().split('_')[1].split('.')[0] IndexError: list index out of range SRV1> ls /minos/data/minfarm/mcnearcat | grep -v .root marker_end marker_start wc SRV1> dds /minos/data/minfarm/mcnearcat/mar* -rw-rw-r-- 1 rubin e875 0 Apr 29 2008 /minos/data/minfarm/mcnearcat/marker_end -rw-rw-r-- 1 rubin e875 0 Apr 29 2008 /minos/data/minfarm/mcnearcat/marker_start SRV1> dds /minos/data/minfarm/mcnearcat/w* -rw-r--r-- 1 asousa e875 18496 Oct 14 08:37 /minos/data/minfarm/mcnearcat/wc SRV1> mv /minos/data/minfarm/mcnearcat/mar* /minos/data/minfarm/maint/ SRV1> mv /minos/data/minfarm/mcnearcat/wc /minos/data/minfarm/maint/ ============================================================================= 2008 12 02 ============================================================================= ######## # FARM # ######## Prepare for linfix concat, do just one run ./roundup -n -s n11011001 -r cedar_phy_linfix mcnear Tue Dec 2 16:28:10 CST 2008 ... OK adding n11011001_0001_L010185N_D00.sntp.cedar_phy_linfix.0.root 8 OK adding n11011001_0010_L010185N_D00.sntp.cedar_phy_linfix.0.root 1 HADD rate 0 Kbytes/second Tue Dec 2 16:49:32 CST 2008 ... ######## # DATA # ######## Staged a test copy , using door 0 ./dccptest n13047018_0029_L010185N_D04.reroot.root 24125 15:30:30 Checked transfer page, see nothing for dcap00 or minos26 as of 15:33 Showed up at 15:34:17 DCap00-stkendca2a-unknow-113 dcap00-stkendca2aDomain 2 dcap-3 1060 30485 000F000000000000074C0348 N.N. minos26.fnal.gov WaitingForGetPool 00:02:08 Staging ######## # DATA # ######## Date: Tue, 02 Dec 2008 14:37:50 -0600 From: Keith Chadwick To: fermigrid-announce@fnal.gov Subject: FermiGrid BlueArc filesystems... The BlueArc filesystems (/grid/app, /grid/data, /grid/home, etc.) appear to have "burped" across all of FermiGrid about 5 minutes ago. We are investigating... -Keith. ------------------------------------------------ Date: Tue, 02 Dec 2008 14:47:40 -0600 From: Andrew J. Romero To: "'site-nas-announce@fnal.gov'" Subject: BlueArc node RHEA-1 crashed at ~2:00pm today The following EVSs failed over to RHEA-2 blue2 minos-nas-0 and appear to be operating normally I will provide more information soon. ---------- Forwarded message ---------- Date: Tue, 02 Dec 2008 15:12:56 -0600 From: Andrew J. Romero To: 'Jon Bakken' , 'Steve Timm' , 'Arthur Kreymer' Subject: RHEA-1 crash We are still working with BlueArc to determine why RHEA-1 crashed. We will need to rebalance the EVSs (Virtual Servers) as soon as possible. This will involve a short 10-15min downtime for the following EVSs (as they are moved from RHEA-2 to RHEA-1) - blue2 - minos-nas-0 Assuming RHEA-1 does not have a hardware problem, we would like to rebalance the EVS load at 4:30am tomorrow. Let me know if this time is acceptable Andy ------------------------------------------------ There were no failures to read files. But the reads took 20 minutes to complete, finishing around 14:41 . Strangely, the process on fnpcsrv1 seems to have continued with no logged error or delay around 14:41 ! There have been many instances of 20 to 40 second delays this last week. See samples recorded under http://www-numi.fnal.gov/computing/dh/bluwatch/log/ ------------------------------------------------ Date: Tue, 02 Dec 2008 22:09:44 +0000 (GMT) From: Arthur Kreymer To: minos_batch@fnal.gov, minos_software_discussion@fnal.gov Subject: BlueArc downtime/stall tomorrow at 04:30 There is a scheduled emergency interruption to BlueArc service, affecting /grid/* and /minos/* file systems. Files reads may be slow, with 10 to 20 minute delays. I do not know what will happen to file writes. I am shutting down file concatenation for the evening. ------------------------------------------------ mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT Tue Dec 2 16:11:27 CST 2008 And disabled mdsum_log on minos26. ######### # FNALU # ######### Brebel reports condor on FNALU not giving him tokens. Repeat my tests, cd /local/stage1/kreymer/condor condor_submit probe ####### # DAQ # ####### Date: Tue, 02 Dec 2008 12:33:39 -0600 From: John Urish To: Brett M Viren , Mary R Bishai , Arthur E Kreymer , zwaska@fnal.gov Subject: minos-beamdata minos-beamdata is set up in it's rack in FCC. The new IP address is 131.225.107.196. I'm able to connect to it via SSH. You can go ahead and check the minos software on it. Let me know if you find any problems. ######## # FARM # ######## Can you get the linfix files prestaged yet? The runs are in /minos/data/minfarm/lists/mclist.cedar_phy_linfix. There are also ~1100 dogwoodtest4 jobs to rerun which should be prestaged if possible. They'll be found in /minos/data/minfarm/farmtest/mclist.dogwoodtest4. dcache/datasets r '' '' list /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2008/12/list.r PFILES=/afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2008/12/list.r Linfix input files LFILES=`cat /minos/data/minfarm/lists/mclist.cedar_phy_linfix | cut -f 1 -d .` for FILE in ${LFILES} ; do grep ${FILE}.reroot.root ${PFILES} ; done | wc -l 79 list files not in pools for FILE in ${LFILES} ; do grep -q ${FILE}.reroot.root ${PFILES} || echo ${FILE} ; done 83 Dogwood test input files MFILES=`cat /minos/data/minfarm/lists/mclist.dogwoodtest4.matt` 2385 count files in pools for FILE in ${MFILES} ; do grep ${FILE}.reroot.root ${PFILES} ; done | wc -l 2648 list files not in pools for FILE in ${MFILES} ; do grep -q ${FILE}.reroot.root ${PFILES} || echo ${FILE} ; done More mail from Howie wc -l /minos/data/minfarm/farmtest/lists/mclist.dogwoodtest4 1135 HFILES=`cat /minos/data/minfarm/farmtest/lists/mclist.dogwoodtest4 | cut -f 1 -d .` count files in pools for FILE in ${HFILES} ; do grep ${FILE}.reroot.root ${PFILES} ; done | wc -l 1285 list files not in pools for FILE in ${HFILES} ; do grep -q ${FILE}.reroot.root ${PFILES} || echo ${FILE} ; done ########### # MINOS25 # ########### Date: Tue, 02 Dec 2008 12:02:00 -0600 (CST) Reply-To: HelpDesk Subject: HelpDesk ticket 125690 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 125690 ___________________________________________ Short Description: minos-mysql3 swap with minos25 - OS reinstall requested Problem Description: run2-sys : Per discussions with Jason Allen this morning, please reinstall the OS on the node presently called minos-mysql3, configured as a Minos Cluster node ( presently configured as a mysql server.) Please create partitions and accounts, and copy files as described in http://www-numi.fnal.gov/computing/minos25.txt or provide appropriate revisions to this plan. ___________________________________________ This ticket is assigned to HelpDesk of the Help Desk. ___________________________________________ Date: Tue, 02 Dec 2008 12:06:14 -0600 (CST) This ticket has been reassigned to SIMMONDS, EDWARD of the CD-SF/FEF Group. ___________________________________________ Date: Wed, 03 Dec 2008 09:55:09 -0600 (CST) This ticket has been reassigned to HO, LING of the CD-SF/FEF Group. ___________________________________________ Date: Wed, 03 Dec 2008 14:55:58 -0600 (CST) The OS has been reinstalled. [root@minos-mysql3 ~]# uname -a Linux minos-mysql3.fnal.gov 2.6.9-78.0.8.ELsmp #1 SMP Wed Nov 19 13:11:58 CST 2008 x86_64 x86_64 x86_64 GNU/Linux [root@minos-mysql3 ~]# df -l Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 20641788 3585528 16007620 19% / none 8219568 0 8219568 0% /dev/shm /dev/sda7 192822076 5818512 177208736 4% /home /dev/sda6 2063504 35948 1922736 2% /tmp /dev/sda3 10317860 182728 9611012 2% /var /dev/sda2 10317860 55844 9737896 1% /var/tmp AFS 9000000 0 9000000 0% /afs /dev/sdb1 240292420 98892 227987344 1% /local/scratch25 /dev/sdb mounted as /local/scratch25. Not sure if you really want /minos/scratch25. 10GB /var/tmp created. gfrontend and gfactory users and groups added to NIS server. Only these three users are recognized on the machine, besides the system logins. Home areas have been rsyned. /home/condor from minos25:/local/stage1/condor /local/scratch25/condor does not exist. Guessing this is /local/scratch25/stage1/condor (or /local/stage1/condor) Copied /local/stage1/condor to minos-mysql3:/home/condor Copied kcron files under /var/adm/krb/ Copied ONLY the grid directories under the user directories. Not sure if other content in the user directories need to be copied. Copied crontabs file under /var/spool/cron . System cron files were not copied. ___________________________________________ Date: Wed, 03 Dec 2008 23:42:45 +0000 (GMT) > /dev/sdb mounted as /local/scratch25. Not sure if you really want /minos/scra$ Yes, /minos/scratch25 was yet another typo. Your interpretation was correct. I have corrected the original document. >> Create accounts for Condor management, cloned from minos25, >> and rsync the home areas. >> Account Home Size >> condor /home/condor 844 MB >> symlink to this from /local/stage1/condor >> gfrontend /home/gfrontend 175 MB >> gfactory /home/gfactory 4614 MB > > gfrontend and gfactory users and groups added to NIS server. > Only these three users are recognized on the machine, besides the system logi$ Please allow the full set of Minos users to log in, as is the case on minos25. That will let me log in as kreymer, to do some configuration. > Home areas have been rsyned. > /home/condor from minos25:/local/stage1/condor Thanks, I have logged into gfactory and gfrontend, looks OK. > /local/scratch25/condor does not exist. Guessing this is /local/scratch25/sta$ > /local/stage1/condor) > Copied /local/stage1/condor to minos-mysql3:/home/condor Correct again. I have correct the document. The condor home is /local/stage1/condor, which is a symlink to /local/scratch25/stage1/condor. > Copied kcron files under /var/adm/krb/ > Copied ONLY the grid directories under the user directories. Not sure if othe$ > directories need to be copied. Not needed for this migration. Will copy them elsewhere, later. > Copied crontabs file under /var/spool/cron . System cron files were not copie$ Let me know when rpbatter and kreymer et.al. can log in, we will start doing more tests. ___________________________________________ Date: Wed, 03 Dec 2008 17:45:56 -0600 From: Ling C. Ho I have corrected this. You should be able to log in now. ___________________________________________ Date: Wed, 03 Dec 2008 17:48:19 -0600 From: Ling C. Ho By the way I don't thee the user rpbatter on minos25 nor the nis map. ___________________________________________ Date: Wed, 03 Dec 2008 15:48:51 -0800 (PST) From: Ryan B. Patterson Any word on the new Condor versions? Are they *really* going to be available this week, or should we ask for 7_0_3 to be installed. I'd like to do hard testing this week/weekend if possible, and my gut tells me that we aren't going to see these new RPMs in a timely manner. ___________________________________________ Date: Wed, 03 Dec 2008 17:49:57 -0600 From: Ling C. Ho SOrry, I am slow at the end of the day. "rbpatter" is there. ___________________________________________ Date: Wed, 03 Dec 2008 16:01:51 -0800 (PST) From: Ryan B. Patterson /minos/scratch and /minos/data are present but appear to be read-only at the moment: bash$ touch /minos/data/hi touch: cannot touch `/minos/data/hi': Read-only file system bash$ touch /minos/scratch/hi touch: cannot touch `/minos/scratch/hi': Read-only file system Perhaps this is temporary, as 'mount' suggests they should be rw: minos-nas-0.fnal.gov:/minos/scratch on /minos/scratch type nfs (rw,rsize=32768,timeo=600,proto=tcp,nfsvers=3,hard,intr,addr=131.225.111.115 ) minos-nas-0.fnal.gov:/minos/data on /minos/data type nfs (rw,rsize=32768,timeo=600,proto=tcp,nfsvers=3,hard,intr,addr=131.225.111.115 ) ___________________________________________ Date: Thu, 04 Dec 2008 10:30:21 -0800 (PST) From: Ryan B. Patterson To: Arthur Kreymer Subject: Re: HelpDesk ticket 125690 has additional info. An additional observation: The permissions of /var/adm/krb5 seem to disallow proper kcron operation. On minos25[old] this was: drwx--s--x 2 root root 4096 Nov 18 14:56 krb5 on minos25-mysql3, this is: drwxr-xr-x 2 root root 4096 Dec 3 11:44 krb5 --Ryan ___________________________________________ Date: Mon, 08 Dec 2008 13:21:28 -0600 (CST) Machine swap completed on Friday 12/5/08. This ticket was resolved by HO, LING of the CD-SF/FEF group. ___________________________________________ ########## # DCACHE # ########## > Stan Naymola > I may have found a dCache monitor page that may help you with door status. > Look at http://fndca3a.fnal.gov:2288/context/transfers.html . Look at the > first column, it tells you which door is being used or queued. Let me know if > this is helpful. Thanks ! This should be very useful. I seems to be updated about every 2 minutes, which should be good enough for load balancing. -------------------------------------------------------- Time stamps from the bottom of the page 11:33:24 11:35:14 11:37:05 11:38:54 CST 2008 ########## # DCACHE # ########## Date: Tue, 02 Dec 2008 11:47:16 -0600 (CST) Reply-To: HelpDesk Subject: HelpDesk ticket 125688 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 125688 ___________________________________________ Short Description: STKEN door DCap00-stkendca2a , port 24125 is stuck Problem Description: According to the login plots at http://fndca3a.fnal.gov/dcache/logins//DCap00-stkendca2a.jpg DCap00-stkendca2a , port 24125, seems to have been down since early Saturday 29 November, I cannot connect to this door : Failed to create a control line [Tue Dec 2 11:24:45 2008] Going to open file dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/neardet_data/2004-11/N 00004502_0000.mdaq.root in cache. Failed to create a control line Failed open file in the dCache. Can't open source file : Unable to connect to server System error: Connection refused ___________________________________________ This ticket is assigned to JONES, TERRY of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Tue, 02 Dec 2008 15:43:20 -0600 (CST) Solution: jonest@fnal.gov sent this solution: > Door was not working with no obvious reason. Nothing in the log > file. The door > has been restarted and now works fine. This ticket was resolved by JONES, TERRY of the CD-SF/DMS/DSC/SSA group. ___________________________________________ ___________________________________________ ============================================================================= 2008 12 01 ============================================================================= ########### # MINOS25 # ########### First draft migration plan to swap minos-mysql3 with minos25 We are having severe performance problems with minos25, presently our Condor gateway system. Unexplained very high load averages, with very high I/O wait delays are common. The problems continue with each of the suspected software components disabled. Even without these overloads, we need a more capable Condor host system. Therefore, we would like to swap the new minos-sam03 hardware with minos25. The following items can be done in preparation : Install condor v7_0_3 on minos-mysql3 as on the rest of the Minos Cluster. Install condor v7_0_3 configuration files to be provided by Minos to FEF. Create accounts for Condor management, cloned from minos25, and rsync the home areas. Account Home Size condor /local/stage1/condor 844 MB gfrontend /home/gfrontend 175 MB gfactory /home/gfactory 4614 MB Change mount from /data to /local/scratch25 on minos-mysql3, and set permissions per minos25. Copy /local/scratch25/condor to minos-mysql3. Copy all user kcron files to minos-mysql3. Copy all user /local/scratch25//grid files to minos-mysql3. Copy all user crontabs from minos25 to minos-mysql3 For the actualy identity transplant : Shut down the minos25 condor processes. Disable automatic Condor startup on minos25 (old and new). Shut down minos25 and minos-mysql. Exchange host Grid certificates . Reboot the new minos25, without starting condor. rsync files from the condor, gfrontend and gfactory accounts. Restart condor and verify proper operation. N.B. shift ~condor from /local/stage1/condor to /home/condor, with a symlink to new space for compatibility. N.B. - changed /local/scratch25 to /local/scratch25 above, typo in first draft N.B - added correction to afs login problem on minos-mysql3 ########## # DCACHE # ########## Requesting more doors : Date: Mon, 01 Dec 2008 18:02:55 -0600 (CST) From: HelpDesk <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 125645 ___________________________________________ Short Description: FNDCA needs more unauthenticated dcap doors Problem Description: Minos is ramping up grid usage, with a goal of 5000 jobs running on FermiGrid. We are getting close to 1000 jobs running recently. But we still have only four unauthenticated dcap doors. We are hitting door limits even when using all four. To handle 5000 jobs, figuring 250 connections per door, we would need about 20 doors. Please increase the number of doors to 10 as soon as possible, to handle the present load. And increase to 20 doors as soon as convenient to handle the longer term load. ___________________________________________ This ticket is assigned to JONES, TERRY of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Mon, 01 Dec 2008 17:38:15 -0600 From: Jon Bakken That's interesting. I configured the CMS doors for a max of 4000 each (running multiple doors per node). We routinely have more than a 1000 per door without troubles. ___________________________________________ Date: Tue, 02 Dec 2008 15:01:26 -0600 (CST) Note To Requester: jonest@fnal.gov sent this Notes To Requester: Hi, A dcache expert has responded. should I close this ticket? > --- Comment #1 from Vladimir Podstavkov > 2008-12-02 14:53:01 --- > The current limit for dcap doors is 4000 connections per door. Two > doors allow > to have up to 8000 open connections. No additional doors needed. ___________________________________________ Date: Tue, 02 Dec 2008 21:16:55 +0000 (GMT) From: Arthur Kreymer To: HelpDesk Cc: minos-data@fnal.gov, dcache-admin@fnal.gov Subject: Re: HelpDesk ticket 125645 has additional info. <-- # @@@ Enter Update below this line. @@@ # --> Thanks, it is good to know that the limit is at 4000. Door 0 seems to have died Saturday with 1000 connections. Doors 1/2/3 seem to have historic peaks under 400. Door 2 is at 536 right now, as high as I've ever seen it. I'll keep watching, and think about setting up a load test. I'll start regular monitoring of service availability. <-- # @@@ Enter Update above this line. @@@ # --> ############ # PREDATOR # ############ InvalidMetadata: Invalid Metadata specified for file 'N00015238_0017.mdaq.root' of type 'importedDetector': Application with family 'online', applName 'rotorooter', version 'v00-00--1' not found. FINISHED Sun Nov 30 05:12:08 2008 Set the damaged file aside cd /local/scratch26/kreymer/genpy/neardet_data/2008-11 mv N00015238_0017.sam.py N00015238_0017.sam.pybad Note that there is a .pyc for this file ! MINOS26 > dds *.pyc -rw-r--r-- 1 kreymer g020 1266 Nov 30 01:15 N00015238_0017.sam.pyc Subruns 16 and 18 are OK. MINOS26 > ./predator 2008-11 ########### # MINOS27 # ########### /pnfs/minos is not mounted ######## # DATA # ######## Link for file listings from dcache pools, in minos/CFL/ ln -s /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2008/12/list.q dcache_q.txt Many raw data files are still not on tape, since Thursday 27 Nov. cat > /tmp/oldfiles DFILES=`grep -v ' 30' /tmp/oldfiles | cut -c 7- | cut -f 1 -d ' '` MINOS27 > for FILE in ${DFILES} ; do grep ${FILE} ../CFL/dcache_q.txt ; done w-stkendca10a-3 w-stkendca11a-3 w-stkendca12a-3 w-stkendca8a-1 w-stkendca8a-2 w-stkendca9a-3 ######## # DATA # ######## Date: Mon, 01 Dec 2008 09:23:19 -0500 From: Steven Cavanaugh To: Arthur Kreymer Subject: [Fwd: dccp issues with grid] Hi Art, I forgot to include you on this email last night Thanks, Steve -------- Original Message -------- Subject: dccp issues with grid Date: Sun, 30 Nov 2008 21:55:47 -0500 From: Steven Cavanaugh To: Helpdesk , Mayly Sanchez Hi, This is a high priority issue, as minos grid jobs are currently unable to dccp files. I am submitting some jobs on minos25 to the condor grid jobs executing on the following machines (and possibly more) are unable to dccp a file: 131.225.167.4 fnpc237 131.225.166.81 fnpc303 131.225.211.160 fcdfcaf1512 131.225.167.8 fnpc240 131.225.166.90 fnpc312 They all get the error : Failed to create a control line Failed open file in the dCache. Can't open source file : Unable to connect to server System error: Connection refused however, running the dccp on minos09 works: dccp /pnfs/minos//reco_far/cedar_phy_bhcurv/.bcnd_data/2005-03/F00030612_0004.spi ll.bcnd.cedar_phy_bhcurv.0.root . 5802487 bytes in 0 seconds ___________________________________________________________________ Thanks, Steve New Information: scavan@fas.harvard.edu sent in this update: > > > Some additional information: These jobs were attempting to copy 10 files using dccp (a separate dccp for each file, which would not run until the previous dccp command completed) usually the first 3-8 files would copy without issue, and the remainder would fail as described below So this is not simply a connection problem ___________________________________________________________________ <-- # @ Enter Update below this line. @ # --> I think I see the reason for the dccp failures. Messages like Unable to connect to server - usually come from Doors ( dccp ports ) that are overloaded. Looking at the door login plots under http://fndca3a.fnal.gov/dcache/dc_login_plots.html, particularly those for the unauthenticated doors we use, DCap00/01/02/03 I see repeated peaks over 300, sometimes to 1000. The doors can only handle about 200 to 300 connections before they stop taking connections. Steve - I presume that you are spreading your connections randomly between ports 24125, 24136, 24137, 24138. doors 00 01 02 03 We need to have more doors added to DCache. I will put in a new helpdesk ticket for this. Until the new doors come, you need to reduce the number of simultaneous dccp's so as not to overload any single door. You may need to do something similar to Greg, submitting then holding the jobs, then releasing a few hundred at a time. He has some scripts to regulate the number of jobs running. This should let you run your jobs in 10 file mode, if you use all four doors/ports randomly, and keep your total per door under 100 ( there are other users.) <-- # @ Enter Update above this line. @ # --> ######## # GRID # ######## Date: Sun, 30 Nov 2008 22:47:29 -0600 (CST) From: Steven Timm To: minos-data@fnal.gov, scavan@fnal.gov Cc: fermigrid-operations@fnal.gov Subject: HUGE number of very short jobs from user scavan of MINOS We are seeing a HUGE number of jobs (17000!) from minos user scavan (Steve Cavanaugh) being processed through the minosgli glideins. The mean finishing time of these jobs is 2-5 minutes. This is very much frowned upon. It is pushing our SAZ server to near-record and near-failure levels and threatening to disable all of FermiGrid. Make the jobs longer, NOW. Jobs designed to run only 1-2 minutes are not allowed on FermiGrid, period. If our SAZ server alarms for high load again tonight, which it has already done once, we will not hesitate to block all MINOS glideins until the problem is fixed. Steve Timm -------------------------- Per previous discussions, JOB_START_COUNT = 8 JOB_START_DELAY = 2 -------------------------- condor_config_val -schedd JOB_START_COUNT 8 condor_config_val -schedd -rset "JOB_START_COUNT = 1" Attempt to set configuration "JOB_START_COUNT = 1" on schedd minos25.fnal.gov <131.225.193.25:65226> failed. export X509_USER_PROXY=/local/scratch25/kreymer/grid/kreymer.proxy Still fails condor_reconfig Sent "Reconfig" command to local master Trying a sample from the man page, MINOS25 > condor_config_val -schedd -rset "MAX_JOBS_RUNNING = 2001" Attempt to set configuration "MAX_JOBS_RUNNING = 2001" on schedd minos25.fnal.gov <131.225.193.25:65226> failed. Let's to this through config files cd /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701 cp condor_config.local.minos25.20080925 condor_config.local.minos25.20081203 From JOB_START_COUNT = 3 JOB_START_DELAY = 2 To JOB_START_COUNT = 1 JOB_START_DELAY = 1 ln -sf condor_config.local.minos25.20081203 condor_config.local.minos25 # was condor_config.local.minos25.20080925 Sep 25 condor_config.20081203 added rbpatter to QUEUE_SUPER_USERS, removed buckley ########### # MINOS25 # ########### Test hdb disk access, using similar script to DATA test below RFILES=`cat /minos/data/minfarm/lists/mmm.D00 | cut -f 1 -d .` for FILE in ${RFILES}; do time ./dccptest ${FILE}.reroot.root done 2>&1 | tee /tmp/dccptest.log Data rates are typically 30 MB/sec, as on minos26. But the load average is running about 8, versus 2 But the CPU average is running about 70% wait, versus 40% wait. 26 GBytes of files were copied, no errors. The CP load stayed high about 5 minutes past the last network traffic. Question, is DMA enabled on this local disk ? Do a similar test on minos-mysql3, candidate for minospgrid Extended dccptest to take 4th argument, destination path, will supply DCCPTEST subdirectory MINOS-MYSQL3 > RFILES=`cat /minos/data/minfarm/lists/mmm.D00 | cut -f 1 -d .` for FILE in ${RFILES}; do time ./dccptest ${FILE}.reroot.root '' '' /var/tmp/kreymer/DCCPTEST done 2>&1 | tee /tmp/dccptest.log Data rates uniform 40 MBytes/sec. Load average on minos-mysq3 was round 1, CPU usage 10% wait I/O, no post-copy delays Elapsed time 10 minutes Continuing tests, with a local file : DTDIR=/local/scratch26/${LOGNAME}/DCCPTEST time dd if=/dev/urandom of=${DTDIR}/10G.dat bs=10M count=1000 Top - 26% system, dd is 100% CPU bound, load around 1.5 real 31m41.954s user 0m0.009s sys 31m38.962s DTDIR=/local/scratch25/${LOGNAME}/DCCPTEST time dd if=/dev/urandom of=${DTDIR}/10G.dat bs=10M count=1000 Top - 26% system, dd is 100% CPU bound, load around 1.2 real 31m57.528s user 0m0.011s sys 31m53.885s DTDIR=/var/tmp/${LOGNAME}/DCCPTEST time dd if=/dev/urandom of=${DTDIR}/10G.dat bs=10M count=1000 Top - 12% system, dd is 100% CPU bound, load around .6 real 17m48.005s user 0m0.000s sys 17m46.705s Not such a good test, seem to be burning CPU making urandom content. Deferred minos25 test, load average started growing aroun 14:20, The 10G.dat creation finished at 14:38:12 Overload lasted 14:20 to 14:50 MINOS25 > time dd if=${DTDIR}/10G.dat of=${DTDIR}/10G.dat2 top - 16:07:47 up 41 days, 16 min, 9 users, load average: 3.07, 0.99, 0.49 Cpu(s): 4.4% us, 5.1% sy, 0.0% ni, 30.8% id, 59.4% wa, 0.2% hi, 0.0% si MINOS26 > time dd if=${DTDIR}/10G.dat of=${DTDIR}/10G.dat2 top - 14:35:42 up 395 days, 3:03, 11 users, load average: 3.52, 1.91, 1.59 Cpu(s): 2.9% us, 3.7% sy, 0.0% ni, 22.0% id, 70.9% wa, 0.5% hi, 0.0% si 20480000+0 records in 20480000+0 records out real 9m8.772s user 0m13.858s sys 2m17.817s MINOS-MYSQL3 > time dd if=${DTDIR}/10G.dat of=${DTDIR}/10G.dat2 top - 14:36:02 up 16 days, 21:26, 2 users, load average: 1.81, 1.57, 0.98 Cpu(s): 0.0% us, 0.5% sy, 0.0% ni, 61.0% id, 38.3% wa, 0.0% hi, 0.1% si real 3m56.376s user 0m3.955s sys 0m56.385s ####### # SRM # ####### Test FermiGrid volatile : SRV1> export X509_USER_PROXY=/local/globus/minfarm/.grid/kreymer-production.proxy SRV1> srmls srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport 0 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/ 512 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/howcroft/ 512 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/kordosky/ ########## # CONDOR # ########## minos24 I/O wait overloads continued through the weekend. Abandoning ship, will move to a new host, minos-grid formerly minos-mysql3 ######## # DATA # ######## Date: Thu, 27 Nov 2008 23:24:55 -0500 From: Howard Rubin There remain 79 srmcp failures listed in file /minos/data/minfarm/lists/mmm.D00. Can you check whether these might be resulting from a NONACCESS (or other problem) tape? The symptoms are ... N.B., yes, these were all on VOC495. MINOS26 > head -1 /minos/data/minfarm/lists/mmm.D00 n13023108_0002_L010185N_D00.0 1 2008-11-27 03:37:14 fcdfcaf1056 RFILES=`cat /minos/data/minfarm/lists/mmm.D00 | cut -f 1 -d .` for FILE in ${RFILES}; do ./dccptest ${FILE}.reroot.root rm -f /local/scratch26/kreymer/${FILE}.reroot.root done ############ # DCCPTEST # ############ Extended, look up path of simple file name in SAM. Third argument is Debug flag Fourth argument is DEST , will supply DCCPTEST subdirectory ============================================================================= 2008 11 27 ============================================================================= T H A N K S G I V I N G ============================================================================= 2008 11 26 ============================================================================= ############ # NOACCESS # ############ This tape has been on the list much of the last week. Status ? VOC495 0.05GB (NOACCESS 1118-2328 full 0629-0105) CD-9940B minos.mcin_near_daikon.cpio_odc Howie needs some of these files for farm processing. MINOS26 > enstore info --vol=VOC495 {'blocksize': 131072, 'capacity_bytes': 214748364800L, 'comment': '', 'declared': 1180707936.0, 'eod_cookie': '0000_000000000_0000438', 'external_label': 'VOC495', 'first_access': 1181290454.0, 'last_access': 1227736330.0, 'library': 'CD-9940B', 'media_type': '9940B', 'remaining_bytes': 48729600L, 'si_time': [1227736121.0, 1183097105.0], 'sum_mounts': 54, 'sum_rd_access': 701, 'sum_rd_err': 2, 'sum_wr_access': 437, 'sum_wr_err': 0, 'system_inhibit': ['none', 'full'], 'user_inhibit': ['none', 'none'], 'volume_family': 'minos.mcin_near_daikon.cpio_odc', 'wrapper': 'cpio_odc', 'write_protected': 'y'} Date: Wed, 26 Nov 2008 15:59:37 -0600 (CST) Subject: HelpDesk ticket 125550 ___________________________________________ Ticket #: 125550 ___________________________________________ Short Description: VOC495 status Problem Description: We need to read some files from Minos 9940B volume VOC495. The tape is NOACCESS, I think for several days. What is the status of this tape ? ___________________________________________ This ticket is assigned to HENDRY, JOHN of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Wed, 26 Nov 2008 22:07:21 +0000 (GMT) From: Arthur Kreymer The NOACCESS list is at http://www-stken.fnal.gov/enstore/tape_inventory/NOACCESS The Minos tapes seem to be VO4209 - not a problem, copied elsewhere long ago. VOB445 - not a problem, no files written to this tape. VOC495 0.05GB (NOACCESS 1118-2328 full 0629-0105) CD-9940B minos.mcin_near_daikon.cpio_odc VOK330 331.09GB (NOTALLOWED 0731-1124 readonly 0716-1115) CD-LTO3 minos.reco_far_cedar_bcnd.cpio_odc Volume needs to be cloned due to repeated errors ( This one has been on the list for weeks ) ___________________________________________ cleared by Timur ___________________________________________ Date: Wed, 26 Nov 2008 16:22:26 -0600 (CST) Solution: jhendry@fnal.gov sent this solution: Hi Art, Glenn cleared this tape a bit earlier c. 15:49 today Nov 26 2008 CST. ___________________________________________ ########### # ROUNDUP # ########### Corrected SOCFILE from insecure /export/stage/minfarm/.grid/samdbs_prd to /local/globus/minfarm/.grid/samdbs_prd cp -a AFSS/roundup.20081126 . ln -sf roundup.20081126 roundup # was roundup.20081118 Wed Nov 26 11:14:41 CST 2008 And clean up, rm roundup.20080* ######## # GRID # ######## M25 overloads seem to come from gfrontent, Igor asks that we upgrade to current code. ######## # DATA # ######## Big backlog of restores for files in mcout_data/cedar_phy/far/daikon_00/L010185N/cand_data Total data is about .5 TB, under 2k files, this should clear up by itself. ============================================================================= 2008 11 25 ============================================================================= ######## # FARM # ######## killed looper on charm, stable results, many duplicate and missing subruns. Reported to rubin. Reenabled charm and helium mcnearcat in corral. ######## # DATA # ######## Date: Tue, 25 Nov 2008 15:09:18 -0600 (CST) Subject: HelpDesk ticket 125498 ___________________________________________ Ticket #: 125498 ___________________________________________ Short Description: Increase space available to /minos/scratch Problem Description: LSC/CSI : At your next convenience, please shift 3 TBytes of capacity on minos-nas: from /minos/data to /minos/scratch ___________________________________________ Date: Wed, 26 Nov 2008 08:13:32 -0600 (CST) This ticket has been reassigned to WILLIAMS, CARL of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ Date: Wed, 26 Nov 2008 08:40:22 -0600 (CST) This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/WST ___________________________________________ Date: Wed, 26 Nov 2008 09:24:01 -0600 (CST) Solution: Quotas have been adjusted /minos/data .... decreased to 24TB /minos/scratch ... increased to 9TB ___________________________________________ ######## # DATA # ######## Date: Tue, 25 Nov 2008 15:07:12 -0600 (CST) Subject: HelpDesk ticket 125497 ___________________________________________ Ticket #: 125497 ___________________________________________ Short Description: Quota request for rahaman on BlueArc served /minos/scratch Problem Description: LSC/CSI : Please increase the individual storage quota to 1000 GBytes for user rahaman on the BlueArc served /minos/scratch volume. Please try to do this before the Thanksgiving weekend. ___________________________________________ Date: Tue, 25 Nov 2008 15:11:24 -0600 (CST) This ticket has been reassigned to WILLIAMS, CARL of the CD-LSCS/CSI/CS/EST ___________________________________________ Date: Tue, 25 Nov 2008 15:21:53 -0600 (CST) This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/WST Group. ___________________________________________ Date: Tue, 25 Nov 2008 16:26:49 -0600 (CST) Solution: Quota for rahaman on /minos/scratch is now 1000GB ___________________________________________ ########## # ADMIN # ########## mgoodman,zwaska,plunk,cjames Need to roll back the collab web page: In particular, I am looking for a version of http://www-numi.fnal.gov/collab/index.html from before September of this year. At www.archive.com, typed in the web address 'Take Me Back' Got pleny of them, through Aug 2007. http://web.archive.org/web/*/http://www-numi.fnal.gov/collab/index.html http://web.archive.org/web/20070813035359/http://www-numi.fnal.gov/collab/index.html Put a copy of Aug 13 link in to ~kreymer/minos/collabindex.html Maury cannot access these links : The Argonne IT Administrator Review Group (IT-ARG) has chosen to block this url based on its category affilication. The site you requested is blocked under the following categories: *Anonymizing Utilities* Anonymizing Utilities RESOLVED : To : mgoodman@fnal.gov, zwaska@fnal.gov, plunk@fnal.gov, cjames@fnal.gov Cc : Attchmnt: Subject : Re: [Fwd: Re: Fwd: Re: IB web page] ----- Message Text ----- ANL blocked Maury's access to www.archive.org. He could not log into Fermilab to get the copy I had made. I took the liberty of cleaning this up myself. I found our own backup copy, html/collab/index07.html. I renamed the various older copies, and copied index.20070910.html to index.html, first saving the stray copy of the ib index, as follows : index.20050115.html - was index_old.html index.20070813.html - from www.archive.org index.20070910.html - was index07.html index.ib.20080918.html - problematic copy of ib/ib.index The Collaboration page is working again. Enjoy ! ( If this is too stale, and needs rolling forward to Sept 08, we could issue a helpdesk request for file restoration. ) ########## # DCACHE # ########## Date: Tue, 25 Nov 2008 14:40:07 -0600 (CST) Subject: HelpDesk ticket 125495 ___________________________________________ Ticket #: 125495 ___________________________________________ Short Description: Level 2 metadata seems very out of date Problem Description: Some Minos data management scripts use the PNFS Level 2 metadata to get an estimate of whether a file is on disk in DCache. For example, ( cd /pnfs/minos/neardet_data/2004-11 ; \ cat ".(use)(2)(N00004502_0000.mdaq.root)" ) 2,0,0,0.0,0.0 :l=15761813; w-stkendca7a-1 r-stkendca4a-1 It is understood that this information is not precise. But it has always been very close to reality. For at least the last several months, the pool information seems to be very out of date. Recently written raw data files do no have pool information, even files which have been in the pools for over a month. The following file was written on Oct 1, and is in pool stkendca11a-3 ( cd /pnfs/minos/neardet_data/2008-10 ; \ cat ".(use)(2)(N00014898_0024.mdaq.root)" ) 2,0,0,0.0,0.0 :c=1:d22ec926;h=yes;l=100979; Please investigate, and correct this condition. ___________________________________________ This ticket is assigned to HENDRY, JOHN of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Tue, 25 Nov 2008 15:24:00 -0600 (CST) Note To Requester: jhendry@fnal.gov sent this Notes To Requester: This issue had been recorded as dcache bugzilla Bug 164. We will update this ticket upon action from the dcache developers. ___________________________________________ Date: Tue, 25 Nov 2008 17:14:18 -0600 (CST) Note To Requester: jhendry@fnal.gov sent this Notes To Requester: An update has been received from the dcache developers: --- Comment #1 from Alex Kulyavtsev 2008-11-25 16:36:13 --- (In reply to comment #0) > > This is from helpdesk remedy ticket 125495 submitted by ARTHUR KREYMER. > > > > Some Minos data management scripts use the PNFS Level 2 metadata > > to get an estimate of whether a file is on disk in DCache. During last dcache upgrade on June 24 the pool code was switched from version 2 to version 3, which is default and has been used by CMS for a while. This version of code does not keep pool location information for the file replica in pnfs layer(2). Instead it stores cacheinfo in so called "pnfs companion" DB. Files cached earlier may keep cacheinfo in layer(2). Alex. ___________________________________________ Date: Wed, 26 Nov 2008 17:49:07 +0000 (GMT) From: Arthur Kreymer Thanks, the switch to the 'companion' explains the lack of layer(2) data. How to I obtain PNFS companion data ? I occasionally need a fair estimate of which pools a file is in. The layer(2) data worked very nicely. The daily pool directory listings are a bit too stale, and do not include file paths. ___________________________________________ Date: Wed, 26 Nov 2008 12:04:18 -0600 From: Timur Perelmutov I would not recommend to grant you access to the companion database, this is critical internal dCache service. What do you use information for? ___________________________________________ Date: Wed, 26 Nov 2008 20:52:25 +0000 (GMT) From: Arthur Kreymer It is reasonable that users should not have direct access. Two recent examples where I would have used the old level(2) information : 1) A Minos user was scanning through 24,000 files in a particular family, generating uncontrolled large tape restore backlogs, and many tape mounts. These files were on only 4 tapes. I have a script which pre-stages such files ( using dccp -P ) tape by tape, and in tape order. The script issues each dccp then waits 5 seconds, to keep slightly ahead of actual file delivery from tape. It backs off when the Pool Request Queues page shows a backlog. For this to work efficiently, I need to avoid the dccp/delay for files which are already on disk. 2) Last Friday afternoon, it seemed that hundreds of files were being restored to a single pool. I could have been mistaken about this. I wanted to see which pools these files were in after staging. If we have no present means of getting this data, we might try a readonly database replica ( perhaps using Slony-I ) with direct access for experts, and a simple web interface for people like me. A readonly replica might let us deploy more agressive monitoring tools. ___________________________________________ Date: Mon, 01 Dec 2008 13:57:40 -0600 (CST) Note To Requester: jhendry@fnal.gov sent this Notes To Requester: The originator, made this reply outside of the helpdesk remedy system: ( see above ) ___________________________________________ Date: Mon, 01 Dec 2008 14:04:05 -0600 (CST) Note To Requester: jhendry@fnal.gov sent this Notes To Requester: Hi Art, I have appended your comments to enstore/dcache bugzilla 164. ___________________________________________ Date: Tue, 30 Dec 2008 09:34:38 -0600 (CST) > --- Comment #3 from Timur Perelmutov 2008-12-29 13:24:26 --- We do not have resources to implement the replication of the companion database into a read-only database replica. We suggest that you use either dc_check or srmls to find out if file has a copy on disk.This will allow MINOS to estimate if the file is on disk without access to internal databases of dCache. ___________________________________________ Date: Wed, 07 Jan 2009 11:31:50 -0600 From: John Hendry May I please have your agreement to close this ticket? ___________________________________________ Date: Wed, 07 Jan 2009 22:02:03 +0000 (GMT) From: Arthur Kreymer This ticket can be closed as an operational issue, We should do something to improve the situation longer term. The suggested workarounds are to use dc_check, or srmls to check for the existence of a file on DCache disk. >>>> dc_check <<<< dc_check runs dccp -P -t -1 This runs at a rate of about 4.5 files per second. The same test using PNFS Layer 2 runs at 23 files per second. This slowdown is tolerable for the occasional global pre-stage. But dc_check does not give any estimate of which pool holds the file, something we had with the PNFS Layer 2 data. >>>> srmls <<<< srmls -l seems to give file location information via locality:ONLINE_AND_NEARLINE The per-query overhead of srmls is about 8 seconds, making it too slow for testing tens of thousands of files. srmls -l is slower than dc_check even for large directories, about 3 seconds per file. But srmls has a fatal flaw. It quietly lists only the first 999 files in a given directory. This can be worked around using the count and offset options, but this requires private knowledge of the directory content, and special logic in each application for making multiple queries. Not worth it, when the rate is so slow. I do not trust a tool which has such a deep flaw. The bottom line is that we can work around the problem by using dccp -P -t -1 , but this is very inefficient. I would like to know where dccp gets its information, to avoid the cost of activating the dccp image. Formerly this information came from the PNFS Layer 2 data. ___________________________________________ ####### # AFS # ####### loiacono has no access to $MINOS_DATA/d190 /afs/fnal.gov/files/data/minos/d190 MINOS26 > fs listacl d190 Access list for d190 is Normal rights: minos rlidwka system:administrators rlidwka buckley rlidwka pts membership minos | grep loiacono nuthin pts adduser -user loiacono -group minos pts membership minos | grep loiacono loiacono She still has no access. fs setacl \ -dir /afs/.fnal.gov/files/data/minos/d190 \ -acl system:authuser rl That worked, can read and listacl. ######### # STAGE # ######### The form of the http://fndca.fnal.gov:2288/queueInfo has changed, such that we no longer get valid queue feedback. The data was once in a single line, headed by Total. Now it is split across many lines in the HTML source. ######### # STAGE # ######### We need the full set of near cedar_phy_bhcurv/.bcnd_data files 2005-03 through 2008-07 Check file counts MINOS26 > find /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data -type f | wc -l 24182 MINOS26 > find /pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data -type f | wc -l 17599 Check file sizes MINOS26 > ( cd /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2007-10 ; du -sm * ; ) 21 F00039719_0003.spill.bcnd.cedar_phy_bhcurv.0.root 23 F00039719_0004.spill.bcnd.cedar_phy_bhcurv.0.root 22 F00039719_0005.spill.bcnd.cedar_phy_bhcurv.0.root 22 F00039719_0006.spill.bcnd.cedar_phy_bhcurv.0.root 24 F00039719_0007.spill.bcnd.cedar_phy_bhcurv.0.root Net size would be somewhat over 24182*22 = 532 GBytes. Check file families, so we can do this by volume ( cd /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data ; enstore pnfs --tags ; ) .(tag)(library) = CD-LTO3 .(tag)(file_family) = reco_far_cedar_phy_bhcurv_bcnd ./volumes vols FVOLS=`./volumes reco_far_cedar_phy_bhcurv_bcnd` echo $FVOLS VOC190 VOC193 VOH334 VOK485 Test one volume enstore info --list=VOC190 ./stage -d -s cedar_phy_bhcurv/.bcnd_data VOC190 Staging files from tape VOC190 OK JUST TESTING Staging VOC190 Version 20071012 STARTING Tue Nov 25 13:30:37 CST 2008 Prestaging 3389 files Selecting cedar_phy_bhcurv/.bcnd_data .NEED /pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2007-01/F00037210_0008.spill.bcnd.cedar_phy_bhcurv.0.root dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2007-01/F00037210_0008.spill.bcnd.cedar_phy_bhcurv.0.root Run the full stage { for VOL in ${FVOLS} ; do ./stage -w -s cedar_phy_bhcurv/.bcnd_data ${VOL} done ; } > /minos/scratch/kreymer/log/stage/cpbcnd.log 2>&1 & Staging VOC190 Version 20071012 STARTING Tue Nov 25 13:35:28 CST 2008 ... I do not see much backlog building. Restarted with a 1 second pause, not the default 5 Change stage_limit to 2000, from 200. These are small files. kill %2 { for VOL in ${FVOLS} ; do ./stage -w -p 1 -s cedar_phy_bhcurv/.bcnd_data ${VOL} done ; } >> /minos/scratch/kreymer/log/stage/cpbcnd.log 2>&1 & Staging VOC190 Version 20071012 STARTING Tue Nov 25 14:04:56 CST 2008 Prestaging 3389 files 1/3389 Tue Nov 25 14:03:47 CST 2008 queue=0/2000 Staging VOC193 Version 20071012 STARTING Tue Nov 25 15:28:44 CST 2008 Prestaging 571 files Staging VOH334 Version 20071012 STARTING Tue Nov 25 15:43:05 CST 2008 Prestaging 5520 files Killed at 16:44 CST, queue is up to 1974, and the stage script will not back off due to web page changes. less /local/scratch26/kreymer/log/stage/VOH334.20081125.log 2511/5520 Tue Nov 25 16:43:29 CST 2008 queue=0/2000 VOH334 is mounted, transferring data. Check out the last file from VOC193 ./dccptest /reco_far/cedar_phy_bhcurv/.bcnd_data/2006-03/F00034263_0013.spill.bcnd.cedar_phy_bhcurv.0.root Cache open succeeded in 1.09s. 14207324 bytes in 0 seconds Corrected stage to handle new web page format, for proper metering { for VOL in VOH334 VOK485 ; do ./stage.20081125 -w -p 1 -s cedar_phy_bhcurv/.bcnd_data ${VOL} done ; } >> /minos/scratch/kreymer/log/stage/cpbcnd.log 2>&1 & FINISHED Tue Nov 25 23:00:20 CST 2008 But only a net of about 9K files were restored. Date: Wed, 26 Nov 2008 16:23:56 +0000 (GMT) From: Arthur Kreymer To: scavan@fnal.gov, msanchez@fnal.gov Cc: Patricia Vahle , minos-data@fnal.gov Subject: Re: cedar_phy_bhcurv .bntp file staging The prestaging of these files finished late last night. It should be OK to run full blast on the files under /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data ########## # PARROT # ########## Added GROWFS section to HOWTO.parrot, for rebuilding growfsdir Changed archives to GROWFS subdirectory, cleaned up d120 Made new directory for d119 make_growfs: 1241589 files, 5675 links, 177026 dirs, 0 checksums computed ########## # PARROT # ########## Tracking down SAM problems under parrot. On CDF and GP farm nodes. ( fnpc338, fcdfcaf1502 With and without squid. With or without fresh PTD cache. export PRO=/grid/fermiapp/minos/parrot REL=current ; ARC='x86_64-linux-2.6' ; DAT='' export VER=cctools-${REL}${DAT}-${ARC} export PARROT_DIR=${PRO}/${VER} export PATH=${PARROT_DIR}/bin:${PATH} PTD=/local/stage1/${LOGNAME}/parrot rm -r ${PTD} parrot -m ${PARROT_DIR}/mountfile.grow -H -t ${PTD} /bin/bash PS1='P> ' export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup ls -d /afs/fnal.gov/files/code/e875/general/minossoft unset SETUP_UPS SETUPS_DIR . /afs/fnal.gov/files/code/e875/general/ups/etc/setups.sh setup sam No default SAM configuration exists at this time. Works OK on fnpc340, which has AFS. Try rebuilding the products index I can now setup sam, but get segmentation fault. Still fail to be able to run sam. /tmp/minossoft_30632.setup_script: line 5: 30939 Segmentation fault /afs/fnal.gov/files/code/e875/general/minossoft/setup/datagram/datagram_client.py "[sh] kreymer minos_offline R1.24.2 -q GCC_3_4 # kernel 2.6.9-78.0.1.ELsmp OS 4.5 " ######## # FARM # ######## DIR=313 ./stage ${MCIND}/${DIR} http://fndca3a.fnal.gov:2288/poolInfo/restoreHandler/lazy to sorted r-stkendca14a-5 r-stkendca14a-5 + r-stkendca14a-5 r-stkendca14a-5 r-stkendca14a-5 r-stkendca14a-5 r-stkendca14a-6 r-stkendca14a-5 r-stkendca16a-6 r-stkendca14a-5 r-stkendca14a-5 r-stkendca14a-5 r-stkendca14a-6 r-stkendca14a-5 r-stkendca14a-6 r-stkendca14a-6 + r-stkendca16a-6 r-stkendca14a-6 r-stkendca14a-5 r-stkendca14a-6 r-stkendca14a-6 r-stkendca14a-6 r-stkendca14a-5 r-stkendca14a-6 r-stkendca14a-5 r-stkendca16a-6 + r-stkendca14a-6 r-stkendca16a-6 This is a reasonable balance I suppose, 3 pools involved, 2 hosts Will continue with rest of the stages : MCIND=mcin_data/near/daikon_00/L010185N DIRS=`cat /minos/data/minfarm/lists/mclist.cedar_phy_linfix | cut -c 6-8 | sort -u ` echo $DIRS 304 305 306 307 310 311 312 313 Removed those already done : DIRS='304 305 306 307 310 311' FINISHED Tue Nov 25 10:16:40 CST 2008 ============================================================================= 2008 11 24 ============================================================================= ######## # FARM # ######## Draft to rubin : On Fri, 21 Nov 2008, Howard Rubin wrote: > You'll find the list of still-to-be-run jobs in > /minos/data/minfarm/lists/mclist.cedar_phy_bhcurv. This file is empty. I think you meant /minos/data/minfarm/lists/mclist.cedar_phy_linfix MINOS27 > wc -l /minos/data/minfarm/lists/mclist.cedar_phy_linfix 980 /minos/data/minfarm/lists/mclist.cedar_phy_linfix I've gotten a list of directories based on this list : DIRS=`cat /minos/data/minfarm/lists/mclist.cedar_phy_linfix | cut -c 6-8 | sort -u ` echo $DIRS 223 224 225 303 304 305 306 307 310 311 312 313 I've checked that the file counts are about right MCIND=mcin_data/near/daikon_00/L010185N for DIR in ${DIRS}; do ls /pnfs/minos/${MCIND}/${DIR} | wc -l ; done 109 86 10 99 109 109 107 64 96 109 108 11 I've not started the file restores : 08:54 cd ~kreymer/minos/scripts { for DIR in ${DIRS}; do ./stage -w ${MCIND}/${DIR} done ; } > /minos/scratch/kreymer/log/stage/linfix.log & This could take a while Of the 13 LTO_3 drives 1 writing mcin nccohbkg 6 in mount or dismount wait 6 writing exp-db, 3 dismount/mount 1 seek 2 active ########## # DCACHE # ########## Tests that all raw data is on tape per below for FILE in F081119_000006.mdcs.root B081120_000001.mbeam.root ; do ./dc_stat ${FILE} ; done for FILE in ${NFILES} ; do ./dc_stat ${FILE} ; done for FILE in ${FFILES} ; do ./dc_stat ${FILE} ; done ############### # MINOS-SAM04 # ############### Requested sam and samread accounts, copy of .k5login from minos-sam01. Date: Mon, 24 Nov 2008 17:57:52 -0600 (CST) Subject: HelpDesk ticket 125435 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Ticket #: 125435 ___________________________________________ Short Description: minos-sam04 needs /home/sam and samread Problem Description: runs-sys : Please create /home/sam and /home/samread login areas on minos-sam04, and copy the .k5login files from minos-sam01. ___________________________________________ Date: Tue, 25 Nov 2008 08:19:59 -0600 (CST) This ticket has been reassigned to COOPER, GLENN of the CD-SF/FEF Group. ___________________________________________ Date: Tue, 25 Nov 2008 10:51:42 -0600 (CST) Solution: gcooper@fnal.gov sent this solution: Added home areas and .k5login files. ___________________________________________ ___________________________________________ ########### # ENSTORE # ########### library is CD-LTO4F1, per ( cd /pnfs/minos/mcout_data/cedar_phy_linfix/near/daikon_00; enstore pnfs --tags ) I do not recall setting this up. ########## # DCACHE # ########## Date: Mon, 24 Nov 2008 23:23:23 +0000 (GMT) From: Arthur Kreymer To: Gene Oleynik Cc: minos-data@fnal.gov Subject: Re: dCache upgrade/expansion On Fri, 21 Nov 2008, Gene Oleynik wrote: > The new hardware is in place. We still have to install OS etc, and plan > the migration from new hardware. Seems to me you will get most benefit > from bringing up the minos expansion first. > > How do you want these new pools configured? What file families, > read/write, etc. We would like to deploy the additional 12 TB of disk as follows : Expand RawDataWritePools from 6 TB to 8 TB Selection unchanged. Expand MinosPrdReadPools from 13 TB to 23 TB Selection has been a long list of file families like minos.mcout_cedar_phy_bhcurv_far_daikon_04_sntp If wild cards worked the way we might wish, this would be minos.*sntp minos.*mrnt We could discuss shifting the explicit selection to the general readPools, with a somewhat shorter list like minos.*bcnd minos.*cand minos.mcin* The family selections can be updated after deployment, if this is desired. ########## # CONDOR # ########## Removed 'fnpc374.fnal.gov' from entry_gpgeneral/nodes.blacklist as the file system mounts have been restored ############# # MDSUM_LOG # ############# Corrected to use fine for a subdirectory list, due to files at top level of minfarm. MIN > ln -sf mdsum_log.20081124 mdsum_log # was mdsum_log.20081118 ######### # ADMIN # ######### Date: Mon, 24 Nov 2008 11:55:34 -0600 (CST) Subject: HelpDesk ticket 125388 ___________________________________________ Short Description: Minos Cluster has stale NFS mounts of /grid/data Problem Description: run2-sys : During the maintenance period last Thursday 20 Nov, the /grid/data files were moved to a new server There are stale NFS mounts of /grid/data on most Minos Cluster systems, and minos-sam01 minos-sam02 minos-sam03 minos-mysql2 minos-mysql3 We do not use /grid/data heavily on the Cluster or servers, but it would be nice to have the mounts cleaned up at your next convenience. ___________________________________________ Date: Mon, 24 Nov 2008 12:37:01 -0600 (CST) This ticket has been reassigned to COOPER, GLENN of the CD-SF/FEF Group. ___________________________________________ Date: Mon, 24 Nov 2008 14:43:21 -0600 (CST) Solution: gcooper@fnal.gov sent this solution: /grid/data remounted on minos[01-24], minos-sam[01-03], minos-mysql[2-3]. This ticket was resolved by COOPER, GLENN of the CD-SF/FEF group. ___________________________________________ ___________________________________________ ============================================================================= 2008 11 22 Sat ============================================================================= ########## # DCACHE # ########## Looking at precious space in RawDataWritePools Pool MB precious w-stkendca10a-3 2494 w-stkendca11a-3 165 w-stkendca12a-3 0 w-stkendca8a-1 0 w-stkendca8a-2 0 w-stkendca9a-3 468 ############ # POOLSTAT # ############ hacked poolstat.20081124 to give full pool listing : ./poolstat full ln -sf poolstat.20081124 poolstat # was poolstat.20080707 ########## # DCACHE # ########## Date: Sat, 22 Nov 2008 14:04:37 -0600 (CST) Subject: HelpDesk ticket 125350 ___________________________________________ Short Description: FNDCA has not written tape since Thursday maintenance ? Problem Description: dcache-admin - It appears that no Minos raw data files have been written to tape from the RawDataWritePools group since Thursday 19 November. Looking at http://fndca3a.fnal.gov:2288/usageInfo, I see many precious files in write pools across the whole DCache system, over half of the capacity of some pools. Most of the writes I see active in Enstore are for one file family. minos.mcout_cedar_phy_linfix_near_daikon_00_cand.cpio_odc There are also problems with file restores from tape. There are hundreds of file restores pending for the readPools group, but all are directed to one pool, r-stkendca15a I will ask the Minos team to shut down production processing, to help take some of the load off the system until these problems are resolved. ___________________________________________ Date: Sat, 22 Nov 2008 15:17:33 -0600 From: Howard Rubin This is almost certainly related to the read problems I've been having, which I first reported several weeks ago, and which have, in general, been either ignored or 'turned over to the developers.' At the present time there are 542 linfix jobs in the system, either running or idle, so when they finish the system will be empty except for (data) keep-up. So far today (2008-11-22) 133 jobs have failed on srm input while 1618 jobs have finished successfully. This is roughly consistent with the error rate I've quoted in recent tickets. There have been no output failures due to the lost mounts. ___________________________________________ Date: Mon, 24 Nov 2008 12:12:08 -0600 From: Timur Perelmutov The problems were due to the fact that pnfs was not mounted on pools after restart. This was fixed by Vladimir over the weekend. ___________________________________________ Date: Mon, 24 Nov 2008 19:01:07 +0000 (GMT) We still have raw data files not archived to tape since Wednesday 19 Nov. For example, /pnfs/minos/fardet_data/2008-11/F00042222_0005.mdaq.root The nightly pool listing shows this file in w-stkendca12a-3. http://fndca3a.fnal.gov:2288/usageInfo shows 0 MB of precious files in this pool w-stkendca12a-3. Why is this file not being written ? Here is a summary of precious space presently reported at usageInfo . As noted above, these space reports are not consistent with the list of files not on tape. Pool MB precious w-stkendca10a-3 2494 w-stkendca11a-3 165 w-stkendca12a-3 0 w-stkendca8a-1 0 w-stkendca8a-2 0 w-stkendca9a-3 468 ___________________________________________ Date: Mon, 24 Nov 2008 13:14:43 -0600 From: George Szmuksta From your perspective are the dcache problems fixed? ___________________________________________ Date: Mon, 24 Nov 2008 13:21:39 -0600 From: Margaret Votava i think minos is still having a problem? Is it fixed from their perspective? ___________________________________________ Date: Mon, 24 Nov 2008 13:22:09 -0600 From: Margaret Votava I don't think so. Art? Gene Oleynik wrote: Minos has production back up now, correct? ___________________________________________ Date: Mon, 24 Nov 2008 13:24:35 -0600 From: Gene Oleynik To: Margaret Votava I asked George Szmuksta to follow up on the ticket. Art, if there are still issues let me know asap. ___________________________________________ Date: Mon, 24 Nov 2008 13:37:50 -0600 From: Timur Perelmutov Many of the precious files remain on disk, because they are deleted from pnfs. In order to prevent potential data loss, the files in these cases are not deleted automatically. They can not be written to tape either, as enstore will detect that they are deleted from pnfs. We perform a periodic manual clean up of these files. The behavior will be different in dCache 1.9. So certain accumulation of the precious space on the write pools is normal and should not be considered a system malfunction. ___________________________________________ georges,minos_batch,dcache-admin,minos-data,timur,votava,oleynik The following files, dating from 19 through 21 November are not on tape . I have listed the pools that they seem to be in, per today's pool listings. Other more recent files are on tape. /pnfs/minos/fardet_data/2008-11 pools F00042222_0002.mdaq.root 11a-3 F00042222_0003.mdaq.root 10a-3 12a-3 F00042222_0004.mdaq.root 10a-3 F00042222_0005.mdaq.root 10a-3 12a-3 F00042222_0006.mdaq.root 10a-3 F00042222_0007.mdaq.root 10a-3 F00042222_0008.mdaq.root 10a-3 11a-3 F00042222_0009.mdaq.root 10a-3 F00042222_0010.mdaq.root 10a-3 F00042222_0011.mdaq.root 10a-3 F00042222_0012.mdaq.root 10a-3 12a-3 F00042222_0013.mdaq.root 10a-3 F00042222_0014.mdaq.root 10a-3 F00042222_0015.mdaq.root 10a-3 F00042222_0016.mdaq.root 10a-3 F00042222_0017.mdaq.root 10a-3 F00042222_0018.mdaq.root 10a-3 F00042222_0019.mdaq.root 10a-3 F00042222_0020.mdaq.root 10a-3 F00042222_0021.mdaq.root 10a-3 11a-3 /pnfs/minos/fardcs_data/2008-11 F081119_000006.mdcs.root 10a-3 /pnfs/minos/neardet_data/2008-11 N00015199_0014.mdaq.root 11a-3 N00015199_0015.mdaq.root 10a-3 N00015199_0016.mdaq.root 10a-3 11a-3 N00015199_0017.mdaq.root 10a-3 N00015199_0018.mdaq.root 10a-3 N00015199_0019.mdaq.root 10a-3 N00015199_0020.mdaq.root 10a-3 11a-3 N00015199_0021.mdaq.root 10a-3 N00015199_0022.mdaq.root 10a-3 N00015199_0023.mdaq.root 10a-3 N00015199_0024.mdaq.root 10a-3 N00015200_0000.mdaq.root 10a-3 N00015201_0000.mdaq.root 10a-3 N00015202_0000.mdaq.root 10a-3 N00015202_0001.mdaq.root 10a-3 N00015202_0002.mdaq.root 10a-3 N00015202_0003.mdaq.root 10a-3 N00015202_0004.mdaq.root 10a-3 N00015202_0005.mdaq.root 10a-3 N00015202_0006.mdaq.root 10a-3 N00015202_0007.mdaq.root 10a-3 N00015202_0008.mdaq.root 10a-3 /pnfs/minos/beam_data/2008-11 B081120_000001.mbeam.root 10a-3 ___________________________________________ Date: Mon, 24 Nov 2008 15:03:45 -0600 From: Alex Kulyavtsev I was looking on one file you referred before and I was going to ask do you know other files like that. You do. Thanks for the info - I'll let you know as we learn more. ___________________________________________ Date: Mon, 24 Nov 2008 17:31:54 -0600 From: Alex Kulyavtsev the issue was due to pnfs not mounted during pool restart. dcache did not find file in pnfs and decided to deactivate requests. I restarted pools 10-3, 11-3 and 12-3 and requests were flushed to tape. Could you please confirm files were written to tape ? ___________________________________________ Date: Tue, 25 Nov 2008 00:10:38 +0000 (GMT) From: Arthur Kreymer I have re-scanned the full file list. They all appear to be on tape. Thanks ! ___________________________________________ Date: Mon, 01 Dec 2008 19:39:13 +0000 (GMT) From: Arthur Kreymer RawDataWritePools files dated before Nov 30, but not yet on tape, Pools are determined from the nightly Pool Directory Listings at http://fndca3a.fnal.gov/dcache/files/ F00042247_0005.mdaq.root w-stkendca10a-3 F00042247_0006.mdaq.root w-stkendca10a-3 F00042247_0007.mdaq.root w-stkendca10a-3 F00042247_0008.mdaq.root w-stkendca10a-3 F00042247_0009.mdaq.root w-stkendca10a-3 F00042247_0010.mdaq.root w-stkendca10a-3 F00042247_0011.mdaq.root w-stkendca10a-3 F00042247_0012.mdaq.root w-stkendca10a-3 F00042247_0013.mdaq.root w-stkendca10a-3 F00042247_0014.mdaq.root w-stkendca10a-3 F00042247_0015.mdaq.root w-stkendca10a-3 F00042247_0016.mdaq.root w-stkendca10a-3 F00042247_0017.mdaq.root w-stkendca10a-3 F00042247_0018.mdaq.root w-stkendca10a-3 F00042247_0019.mdaq.root w-stkendca10a-3 F00042247_0020.mdaq.root w-stkendca10a-3 F00042247_0021.mdaq.root w-stkendca10a-3 F00042247_0022.mdaq.root w-stkendca10a-3 F00042248_0000.mdaq.root w-stkendca10a-3 F00042250_0000.mdaq.root w-stkendca10a-3 F00042250_0001.mdaq.root w-stkendca10a-3 F00042250_0002.mdaq.root w-stkendca10a-3 F00042250_0003.mdaq.root w-stkendca10a-3 F00042250_0004.mdaq.root w-stkendca10a-3 F00042250_0005.mdaq.root w-stkendca10a-3 F00042250_0006.mdaq.root w-stkendca10a-3 F00042250_0007.mdaq.root w-stkendca10a-3 F00042250_0008.mdaq.root w-stkendca10a-3 F00042250_0009.mdaq.root w-stkendca10a-3 F00042250_0010.mdaq.root w-stkendca10a-3 F00042250_0011.mdaq.root w-stkendca10a-3 F00042250_0015.mdaq.root w-stkendca10a-3 F00042250_0016.mdaq.root w-stkendca10a-3 F00042250_0017.mdaq.root w-stkendca10a-3 F00042250_0018.mdaq.root w-stkendca10a-3 F00042250_0019.mdaq.root w-stkendca10a-3 F00042250_0020.mdaq.root w-stkendca10a-3 F00042252_0000.mdaq.root w-stkendca10a-3 F00042253_0000.mdaq.root w-stkendca10a-3 F00042253_0001.mdaq.root w-stkendca10a-3 F00042253_0002.mdaq.root w-stkendca10a-3 F00042253_0003.mdaq.root w-stkendca10a-3 F00042253_0004.mdaq.root w-stkendca10a-3 F00042253_0005.mdaq.root w-stkendca10a-3 F00042253_0006.mdaq.root w-stkendca10a-3 F00042253_0007.mdaq.root w-stkendca10a-3 F00042253_0008.mdaq.root w-stkendca10a-3 F00042253_0009.mdaq.root w-stkendca10a-3 F00042253_0012.mdaq.root w-stkendca10a-3 F00042253_0013.mdaq.root w-stkendca10a-3 F00042253_0014.mdaq.root w-stkendca10a-3 F00042253_0015.mdaq.root w-stkendca10a-3 F00042253_0016.mdaq.root w-stkendca10a-3 F00042253_0017.mdaq.root w-stkendca10a-3 F00042253_0018.mdaq.root w-stkendca10a-3 F00042253_0019.mdaq.root w-stkendca10a-3 F00042253_0020.mdaq.root w-stkendca10a-3 F00042253_0021.mdaq.root w-stkendca10a-3 F00042253_0022.mdaq.root w-stkendca10a-3 F00042253_0023.mdaq.root w-stkendca10a-3 F00042255_0000.mdaq.root w-stkendca10a-3 F00042256_0000.mdaq.root w-stkendca10a-3 F00042256_0001.mdaq.root w-stkendca10a-3 F00042256_0002.mdaq.root w-stkendca10a-3 F00042256_0003.mdaq.root w-stkendca10a-3 F00042256_0006.mdaq.root w-stkendca10a-3 F00042256_0007.mdaq.root w-stkendca10a-3 F00042256_0008.mdaq.root w-stkendca10a-3 F00042256_0009.mdaq.root w-stkendca10a-3 F00042256_0010.mdaq.root w-stkendca10a-3 F00042259_0011.mdaq.root w-stkendca10a-3 F081126_000010.mdcs.root w-stkendca10a-3 F081127_000010.mdcs.root w-stkendca10a-3 F081128_000002.mdcs.root w-stkendca10a-3 N081126_000002.mdcs.root w-stkendca10a-3 N081127_000002.mdcs.root w-stkendca10a-3 N081128_000003.mdcs.root w-stkendca10a-3 F00042247_0023.mdaq.root w-stkendca11a-3 F00042250_0012.mdaq.root w-stkendca11a-3 F00042250_0014.mdaq.root w-stkendca11a-3 F00042253_0009.mdaq.root w-stkendca11a-3 N00015235_0008.mdaq.root w-stkendca11a-3 N00015237_0000.mdaq.root w-stkendca11a-3 N00015238_0003.mdaq.root w-stkendca11a-3 F00042249_0000.mdaq.root w-stkendca12a-3 F00042250_0013.mdaq.root w-stkendca12a-3 F00042251_0000.mdaq.root w-stkendca12a-3 F00042253_0010.mdaq.root w-stkendca12a-3 F00042253_0011.mdaq.root w-stkendca12a-3 F00042253_0015.mdaq.root w-stkendca12a-3 F00042254_0000.mdaq.root w-stkendca12a-3 F00042256_0004.mdaq.root w-stkendca12a-3 F00042256_0005.mdaq.root w-stkendca12a-3 N00015231_0000.mdaq.root w-stkendca12a-3 N00015234_0000.mdaq.root w-stkendca12a-3 N00015235_0009.mdaq.root w-stkendca12a-3 N00015235_0024.mdaq.root w-stkendca12a-3 N00015238_0002.mdaq.root w-stkendca12a-3 F00042245_0000.mdaq.root w-stkendca8a-1 F081125_000010.mdcs.root w-stkendca8a-1 N00015225_0000.mdaq.root w-stkendca8a-1 N00015228_0000.mdaq.root w-stkendca8a-1 F00042247_0003.mdaq.root w-stkendca8a-2 B081124_160001.mbeam.root w-stkendca9a-3 B081125_000001.mbeam.root w-stkendca9a-3 B081125_080001.mbeam.root w-stkendca9a-3 F081124_000012.mdcs.root w-stkendca9a-3 N081124_000002.mdcs.root w-stkendca9a-3 N081125_000002.mdcs.root w-stkendca9a-3 ___________________________________________ Date: Tue, 02 Dec 2008 22:37:43 +0000 (GMT) From: Arthur Kreymer To: dcache-admin@fnal.gov Cc: minos-data@fnal.gov Subject: Re: HelpDesk ticket 125350 (fwd) Is there any progress on this ? There are precious files in RawDataWritePools as old as 24 November For example, Nov 24 18:00 /pnfs/minos/beam_data/2008-11/B081124_160001.mbeam.root _____________________________________________ Date: Wed, 03 Dec 2008 17:25:28 +0000 (GMT) From: Arthur Kreymer To: helpdesk-forwarder@fnal.gov, helpdesk@fnal.gov, dcache-admin@fnal.gov Cc: minos-data@fnal.gov, timur@fnal.gov, oleynik@fnal.gov, georges@fnal.gov Subject: Re: HelpDesk ticket 125350 (fwd) <-- # @@@ Enter Update below this line. @@@ # --> I have had no feedback on this ticket since November 24. Some of our raw data files have been pending for over a week now ! There are precious files in RawDataWritePools as old as 24 November For example, Nov 24 18:00 /pnfs/minos/beam_data/2008-11/B081124_160001.mbeam.root I have updated the ticket, sent mail to dcache-admin, and raised the issue in CD ops and Grid Ops meetings. Still no response at all. Is anyone there ? Hello ? <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Date: Wed, 03 Dec 2008 11:59:44 -0600 From: Vladimir Podstavkov Yes, we are working on this. If you noticed some of these files have been written to the tape yesterday evening. We are investigating what caused this problem and don't want just blindly flush everything. Sorry, we have had to let you know that we are looking into it. _________________________________________ Date: Wed, 03 Dec 2008 18:07:07 +0000 (GMT) From: Arthur Kreymer Thanks for the feedback. Because the Minos beam is off for another week, you can take whatever time is needed to investigate this. _________________________________________ Date: Wed, 03 Dec 2008 14:14:14 -0600 From: Vladimir Podstavkov Finally I have found the cause of the problem. It turned out that the timeout for Minos pools has been set to 10 days instead of one by mistake. I have changed the setup files and will change the actual values on all pools, so all files will be flushed within a day. Sorry for inconvenience and thank you for your patience! _________________________________________ ============================================================================= 2008 11 21 ============================================================================= ######### # MYSQL # ######### Testing minos-mysql2 gcooper created /home/minsoft and /data/database directories, and copied .k5login Next, requested minsoft be in group mysql 9531 per ups tailor mysql Iterated account/files, initially local group file had mysql/27 group . The base server is running ! Date: Tue, 25 Nov 2008 12:25:20 -0600 (CST) From: Glenn Cooper I finally got back to this. Removed the mysql entry from local /etc/group file, and also changed nsswitch.conf to use the NIS map as well as the local file. ########## # CONDOR # ########## The load average took off starting at 12:00, up over 55 at 14:00 MINOS25 > touch /minos/scratch/kreymer/test1121 MINOS25 > rm /minos/scratch/kreymer/test1121 MINOS25 > touch /minos/data/users/kreymer/test1121 MINOS25 > rm /minos/data/users/kreymer/test1121 condor response time is fine. MINOS25 > condor_q | tail -1 3016 jobs; 1651 idle, 1365 running, 0 held I see nothing recent in /var/log/messages I see no change in the condor queues around 12:00 Last output file in logs/glide/probe*out is Nov 20 10:50 logs/glide/probe.229053.0.log ${HOME}/minos/scripts/condorglide no output, nothing in queue condor_history kreymer ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD MINOS25 > condor_submit glide.run Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 229927. I see that over 200 pilots started up on CDF pool 3, from 11:22 through 12:18. These show up in MINOS25 > condor_q gfactory | tail -1 685 jobs; 33 idle, 652 running, 0 held But no user jobs seem to be making use of them yet, as of 14:40. MINOS25 > condor_q -run gfactory | grep fnpcfg1 | wc -l 400 MINOS25 > condor_q -run gfactory | grep fermigridosg1 | wc -l 247 ######## # GRID # ######## for HOST in `cat /tmp/cdfhosts`; do printf "${HOST} " ; ssh -ax ${HOST} 'ls -ld /fnal/ups/etc' ; done fcdfcaf1528 ls: /fnal/ups/etc: No such file or directory fcdfcaf1539 ls: /fnal/ups/etc: No such file or directory fcdfcaf1559 ls: /fnal/ups/etc: No such file or directory fcdfcaf1566 ls: /fnal/ups/etc: No such file or directory for HOST in `cat /tmp/cdfhosts`; do printf "${HOST} " ssh -ax ${HOST} 'ls -ld /fnal/ups /usr/local/etc/setups.* > /dev/null && echo'; done ; date fcdfcaf1528 ls: /fnal/ups: No such file or directory /usr/local/etc/setups.*: No such file or directory fcdfcaf1539 ls: /fnal/ups: No such file or directory /usr/local/etc/setups.*: No such file or directory fcdfcaf1559 ls: /fnal/ups: No such file or directory /usr/local/etc/setups.*: No such file or directory fcdfcaf1566 ls: /fnal/ups: No such file or directory /usr/local/etc/setups.*: No such file or directory Ran again, reversed file order, fcdfcaf1528 ls: /usr/local/etc/setups.*: No such file or directory ls: /fnal/ups: No such file or directory fcdfcaf1539 ls: /usr/local/etc/setups.*: No such file or directory ls: /fnal/ups: No such file or directory fcdfcaf1559 ls: /usr/local/etc/setups.*: No such file or directory ls: /fnal/ups: No such file or directory fcdfcaf1566 ls: /usr/local/etc/setups.*: No such file or directory ls: /fnal/ups: No such file or directory Date: Fri, 21 Nov 2008 12:52:33 -0600 (CST) Subject: HelpDesk ticket 125314 ___________________________________________ Short Description: Four fcdfcaf nodes lack /fnal/ups and /usr/local/etc/setups.* scripts Problem Description: Four of the fcdfcaf nodes lack a local UPS installation in /fnal/ups, and the associated /usr/local/etc/setups.[c]sh scripts : fcdfcaf1528 fcdfcaf1539 fcdfcaf1559 fcdfcaf1566 ___________________________________________ This ticket is assigned to TIMM, STEVE of the CD-SF/GF/FGS. ___________________________________________ Date: Fri, 21 Nov 2008 13:02:56 -0600 (CST) Note To Requester: timm@fnal.gov sent this Notes To Requester: This is the first I knew that there is a /fnal/ups installation on any fcdfcaf nodes. There did not use to be any /fnal/ups installation on the cdf nodes at all and FermiGrid never gave Minos any representations that it would work or that you could count on it. Your scripts should not be dependent on /fnal/ups anywhere outside the GP Grid cluster, and it would be good to get rid of the dependency there too. That said, it appears that in the latest round of CDF node reinstalls that is ongoing, there is now a upsupdbootstrap rpm on most of the new ones. The four nodes that you mention in the ticket are currently being drained to be reinstalled again for other reasons and this problem will be resolved by FEF at that time. Steve Timm ___________________________________________ Date: Mon, 24 Nov 2008 10:50:24 -0600 (CST) Solution: These 4 systems have been removed from the batch system for a software reinstall that was scheduled to take place even before this ticket was filed. By the way--the FEF person who does these reinstalls is aware of what went wrong and it should be automatically fixed next time. Nevertheless FermiGrid makes no warranty of presence or usability of /fnal/ups outside the GP Grid cluster. Steve Timm This ticket was resolved by TIMM, STEVE of the CD-SF/GF/FGS group. ___________________________________________ ########## # CONDOR # ########## Date: Fri, 21 Nov 2008 10:37:21 -0600 (CST) Subject: HelpDesk ticket 125300 <-- # @@@ Enter Update below this line. @@@ # --> <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Short Description: Minos glideins to CDF nodes not starting Problem Description: This morning at around 08:00, according to CondorView, the minosgli jobs disappeared sharply from the fcdfosg3 served nodes, No new Minos glideins to the CDF nodes have started since then. There are plenty of jobs still running on the fcdfosg3 pool, and there seems to be plenty of unused capacity there. Glideins continue to start normally on the gpfarm pool. fnpc4x1 looks OK, only one globus-job-manager running for minosgli. It seems that rubin's minospro jobs are also not running, with about 49 idle. ___________________________________________ Date: Fri, 21 Nov 2008 10:21:30 -0600 From: Marian Zvada To: cdf_caf_user@fnal.gov, cdf_jointphysics@fnal.gov Cc: cdfoom@fnal.gov Subject: cdfgrid fcdfhead10 down Dear Users, during the night we've experienced trouble with cdfgrid cluster. Now it's under recovery and not available for the users. We will announce when the system is back in normal. Sorry for any inconvenience, Marian (for the CAF team) ___________________________________________ Date: Fri, 21 Nov 2008 10:45:41 -0600 (CST) Note To Requester: timm@fnal.gov sent this Notes To Requester: Art, how long had the glideins been running at the time? Also, were they going directly to fcdfosg3 or coming through fg1x1? There was a disruption at 07:52 on the CDF side of things from their submitter that submits to that grid. Maybe something was connected to that. I'll have a look. Steve Timm ___________________________________________ Date: Fri, 21 Nov 2008 11:19:16 -0600 (CST) Note To Requester: timm@fnal.gov sent this Notes To Requester: It appears that most of the glideins which exited from the fcdfosg3/4 cluster around 8:00 exited of their own accord, with status zero. This was not at all due to any problems in the "cdfgrid" which is the CDF submission machine that would normally feed this cluster. There were a few, 10-20 glideins in the last number of days that were removed from the CDF cluster because they were above the 2GB/process memory limit on this cluster and got killed. Steve Timm ___________________________________________ Date: Mon, 24 Nov 2008 10:52:55 -0600 (CST) Solution: Information was given on the status of the recent minosgli glideins. This ticket was resolved by TIMM, STEVE of the CD-SF/GF/FGS group. ########## # CONDOR # ########## CDFCAF glideins seem to have stopped, probably around 08:00 ( sharp drop in client jobs, to near 0 ) Fri Nov 21 10:08:29 CST 2008 MINOS25 > condor_q -run | grep -v gfactory | grep fcdfcaf | wc -l 0 MINOS25 > condor_q -run | grep -v gfactory | grep fnpc | wc -l 387 MINOS25 > condor_q gfactory | tail -1 635 jobs; 83 idle, 552 running, 0 held No new glideins are getting started on cdf nodes, check the grid gateway ssh -ax fnpc4x1 'ps axfu | grep globus-job-manager | grep minos | grep -v grep' minosgli 1044 0.0 0.0 111932 5000 ? S 10:08 0:00 globus-job-manager -conf /usr/local/vdt-1.10.1/globus/etc/globus-job-manager.conf -type managedfork -rdn jobmanager-managedfork -machine-type unknown -publish-jobs Check that factory and frontends are running, looks OK ps -flu gfactory ps -flu gfrontend condor_q -run | grep gfactory ... 229132.0 gfactory 11/20 13:22 0+20:48:26 gt2 fermigridosg1.fnal.gov:2119/jobmanager-condor 229132.1 gfactory 11/20 13:22 0+20:48:26 gt2 fermigridosg1.fnal.gov:2119/jobmanager-condor 229594.0 gfactory 11/21 06:31 0+02:42:39 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor 229615.0 gfactory 11/21 06:59 0+02:14:08 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor ... ########## # DCACHE # ########## PO 582564 F1F-141000HDRG SATABOY storage device configured with (14) 1TB disks I presume 12 TB net capacity,due to raid 5. We would like to deploy the additional 12 TB of disk as follows : Expand RawDataWritePools from 6 TB to 8 TB Expand MinosPrdReadPools from 13 TB to 23 TB ########## # DCACHE # ########## Date: Fri, 21 Nov 2008 09:43:34 -0600 From: Gene Oleynik The new hardware is in place. We still have to install OS etc, and plan the migration from new hardware. Seems to me you will get most benefit from bringing up the minos expansion first. How do you want these new pools configured? What file families, read/write, etc. ----------------------------------------------------------------------- ######## # FARM # ######## Start clearing out the backlog ------------------------------------------------------------------------ Summarizing /minos/data/minfarm/*cat Fri Nov 21 09:03:17 CST 2008 1719 232668 nearcat 5031 54693 farcat 30240 1287973 mcnearcat 26 1218 mcfarcat 0 1 mcfmockcat 7 1 WRITE 37023 1576554 TOTAL files, GBytes nearcat 66 1896 cosmic.sntp.cedar.0.root 412 164242 spill.cand.cedar_phy_bhcurv.0.root 587 25393 spill.mrnt.cedar_phy_bhcurv.0.root 1 47 spill.mrnt.cedar_phy_bhcurv.1.root 65 5649 spill.sntp.cedar.0.root 587 46647 spill.sntp.cedar_phy_bhcurv.0.root 1 87 spill.sntp.cedar_phy_bhcurv.1.root farcat 23 544 all.sntp.cedar.0.root 1359 33137 all.sntp.cedar_phy_bhcurv.0.root 23 170 spill.bntp.cedar.0.root 1201 8951 spill.bntp.cedar_phy_bhcurv.0.root 1201 8551 spill.mrnt.cedar_phy_bhcurv.0.root 23 112 spill.sntp.cedar.0.root 1201 5870 spill.sntp.cedar_phy_bhcurv.0.root mcnearcat 2 1151 cand.cedar_phy_bhcurv.1.root 2336 68216 mrnt.cedar_phy_bhcurv.0.root 2881 54129 mrnt.cedar_phy_bhcurv.1.root 54 1737 mrnt.cedar_phy_bhcurv.root 9813 190711 mrnt.cedar_phy_linfix.0.root 65 3114 mrnt.cedar_phy.root 2336 198362 sntp.cedar_phy_bhcurv.0.root 2881 185731 sntp.cedar_phy_bhcurv.1.root 54 5188 sntp.cedar_phy_bhcurv.root 9813 641990 sntp.cedar_phy_linfix.0.root 2 135 sntp.cedar_phy.root mcfarcat 3 120 mrnt.cedar_phy_linfix.0.root 4 209 sntp.cedar_phy_bhcurv.0.root 16 813 sntp.cedar_phy_bhcurv.root 3 132 sntp.cedar_phy_linfix.0.root ------------------------------------------------------------------------ mcfar linfix would write cleanly, waiting for permission. mcfar bhcurv has many several DUPs, would write nothing, ran this to produce a log for reference, ./roundup -r cedar_phy_bhcurv mcfar ~minfarm/ROUNTMP/LOG/2008-11/cedar_phy_bhcurvmcfar.log mcnear CBP charm files can be written, First a small test probe ./roundup -s charm -b 10 -r cedar_phy_bhcurv mcnear ~minfarm/ROUNTMP/LOG/2008-11/cedar_phy_bhcurvmcnearcharm.log Would start writing them all ./looper '-s charm -r cedar_phy_bhcurv mcnear' & But first, there may be too many duplicates, like n13037067 ./roundup -n -s charm -r cedar_phy_bhcurv mcnear 2>&1 | tee /tmp/cpbmcnc.log DUPEs include just 3 runs. DUPE n13037065_0027_L010185N_D04_charm.mrnt.cedar_phy_bhcurv.0.root DUPE n13037066_0002_L010185N_D04_charm.mrnt.cedar_phy_bhcurv.0.root DUPE n13037067_0001_L010185N_D04_charm.mrnt.cedar_phy_bhcurv.0.root Let's proceed with the rest : ./looper '-s charm -r cedar_phy_bhcurv mcnear' & rm /minos/data/minfarm/roundup/STOP.LOOPER ./looper '-s charm -r cedar_phy_bhcurv mcnear' & ============================================================================= 2008 11 20 ============================================================================= ########## # CONDOR # ########## rbpatter and Igor have cut back monitoring, to avoid overloads on minos25. Looks healthy with 660 user jobs glided, 690 pilots ########## # CONDOR # ########## rbpatter has implemented group priorities for high priority tasks. Announced to primer users. ############ # SHUTDOWN # ############ Thu Nov 20 19:31:49 CST 2008 kreymer@minos26 cd minos/scripts crontab crontab.dat mindata@minos26 cd crontab crontab.dat minfarm@fnpcsrv1 mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok ########## # DCACHE # ########## Date: Thu, 20 Nov 2008 20:28:10 +0000 (GMT) From: Arthur Kreymer To: minos-data@fnal.gov Cc: dcache-admin@fnal.gov, minos_batch@fnal.gov Subject: Holding of Minos writing to DCache for now Most FNDCA DCache services seem to have come back up around 13:13 today. Raw data files are being archived successfully. But 5 of the 12 pools in the writePools group are still offline. Summary from ~kreymer/minos/scripts/poolstat Thu Nov 20 14:26:24 CST 2008 DOWN TOT POOL GROUP 3/ 14 ExpDbWritePools 4/ 10 FermigridVolPools 12 KTeVReadPools 3/ 15 MinosPrdReadPools 2/ 8 RawDataWritePools 4/ 13 readPools 7/ 14 writePools Pools down in writePools : w-stkendca11a-4 w-stkendca12a-5 w-stkendca12a-6 w-stkendca6a-1 w-stkendca6a-2 I will not restart the Farm concatenation and MC import jobs, in order to reduce the load on that system, until we hear more from the DCache people, ------------------------------------------------------ Date: Thu, 20 Nov 2008 15:43:56 -0600 From: ssa-group@fnal.gov We believe we have resolved the pnfs problems. As far as we know everything is once again operational. Please report any problems. ------------------------------------------------------ Thu Nov 20 19:20:00 CST 2008 DOWN TOT POOL GROUP 3/ 14 ExpDbWritePools 4/ 10 FermigridVolPools 12 KTeVReadPools 3/ 15 MinosPrdReadPools 2/ 8 RawDataWritePools 4/ 13 readPools 4/ 14 writePools This looks OK'ish to me ------------------------------------------------------ ######## # DCAP # ######## ups copy dcap v2_42_f0710 -q unsecured -G "dcap v2_42_f0710 -q unsecured" ups declare -c dcap v2_42_f0710 -q unsecured copy succeeded around 13:30 ######## # DATA # ######## PNFS/ftp test succeeded 10 Thu Nov 20 13:18:37 CST 2008 557 Date: Thu, 20 Nov 2008 12:09:07 -0600 (CST) From: Steven Timm The /grid/data file system is now available again for use, on the new disk with increased size. Tested roundup, too few write pools : OOPS - POOLS ACTIVE NEED 12 7 8 ########## # PARROT # ########## mindata@minos26 PD=/minos/scratch/parrot MD=/afs/fnal.gov/files/data/minos/d120 MDB=${MD}/20081106 cd ${PD} MDB=${MD}/GROWFSDIR/20081106 mkdir -p ${MDB} cp -va ${MD}/.grow* ${MDB}/ date ; time ./make_growfs.auto -k ${MD} Thu Nov 20 10:44:05 CST 2008 make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d120/.growfsdir make_growfs: 2710628 files, 8291 links, 125209 dirs, 0 checksums computed real 28m3.788s user 2m56.720s sys 10m48.974s MDB=${MD}/20081120 mkdir -p ${MDB} cp -va ${MD}/.grow* ${MDB}/ $ du -sm $MD/*/.growfsdir 120 /afs/fnal.gov/files/data/minos/d120/20081106/.growfsdir 123 /afs/fnal.gov/files/data/minos/d120/20081120/.growfsdir ######## # FARM # ######## Moving to the new scripts, which give correct MISS lists SRV1> cp -a AFSS/roundup.20081118 . SRV1> cp -a AFSS/samsub.20081118 . ######## # DATA # ######## Date: Wed, 19 Nov 2008 18:03:36 -0600 I am having problems with: fcdfcaf1566.fnal.gov It was around 5:15 pm yesterday. I would get the following errors on that node: /minos/scratch/pawloski/EntProc/condor_job_glidein_Reco_SameCali_FarTauMC.sh : line 8: srt_setup: command not found /minos/scratch/pawloski/EntProc/condor_job_glidein_Reco_SameCali_FarTauMC.sh : line 32: dccp: command not found /minos/scratch/pawloski/EntProc/condor_job_glidein_Reco_SameCali_FarTauMC.sh : line 42: loon: command not found Note I used a release at: /grid/app/minos/ by sourcing the following script: /grid/app/minos/users/boehm/setup_minossoft_MINOS_BATCH_GRID.sh R1.24.2 Greg Reply - It scanned OK at 09:17, ypwhich fcdfcaf1575 N.B. - see 2008 11 21 GRID note - lack of /fnal/ups ########## # CONDOR # ########## MINOS25 > uptime 09:01:22 up 29 days, 17:09, 8 users, load average: 213.06, 213.04, 213.00 MINOS25 > lsof ^C MINOS25 > df -h Filesystem Size Used Avail Use% Mounted on /dev/hda1 9.9G 7.1G 2.4G 76% / none 2.0G 0 2.0G 0% /dev/shm /dev/hda5 1012M 535M 426M 56% /tmp /dev/hda6 22G 244M 21G 2% /var /dev/hdb1 230G 180G 38G 83% /local/scratch25 ^C MINOS25 > cat /etc/fstab # This file is edited by fstab-sync - see 'man fstab-sync' for details LABEL=/ / ext3 defaults 1 1 none /dev/pts devpts gid=5,mode=620 0 0 none /dev/shm tmpfs defaults 0 0 none /proc proc defaults 0 0 none /sys sysfs defaults 0 0 LABEL=/tmp /tmp ext3 defaults 1 2 LABEL=/var /var ext3 defaults 1 2 LABEL=SWAP-hda2 swap swap defaults 0 0 LABEL=SWAP-hda3 swap swap defaults 0 0 stkensrv1:/minos /pnfs/minos nfs user,intr,bg,hard,ro,noac 0 0 LABEL=/local/scratch25 /local/scratch25 ext3 defaults 0 0 minos-nas-0.fnal.gov:/minos/data /minos/data nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 minos-nas-0.fnal.gov:/minos/scratch /minos/scratch nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 blue2:/fermigrid-data /grid/data nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 blue2:/fermigrid-app /grid/app nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 blue2:/fermigrid-fermiapp /grid/fermiapp nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 /dev/hdc /media/cdrom auto pamconsole,exec,noauto,managed 0 0 /dev/fd0 /media/floppy auto pamconsole,exec,noauto,managed 0 0 blue2.fnal.gov:/minos/data /minos/data2 nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 Grabbing the ps axf output, as of around 09:00 cat > /local/scratch25/kreymer/psaxf25.20081120 MINOS25 > grep /minos/scratch /local/scratch25/kreymer/psaxf25.20081120 11247 ? D 0:02 condor_submit /minos/scratch/tinti/Cleaning/RecoStudies.run 12276 ? D 0:00 cat /minos/scratch/tinti/condor_logs/datamc.log.229046.0 14360 ? D 0:00 cat /minos/scratch/tinti/condor_logs/datamc.log.229046.0 26143 ? D 0:00 /bin/sh submit_badChanSntpGen.sh 1 2 /minos/scratch/med/badChannels/N00014833_0002.mdaq.root 26743 ? D 0:00 rm /minos/scratch/med/badChannels/R1.24/cmdfile MINOS25 > cat /minos/scratch/tinti/condor_logs/datamc.log.229046.0 000 (229046.000.000) 11/20 04:30:04 Job submitted from host: <131.225.193.25:63223> ^Y^Y^Y^Y^Y^Y [gfrontend@minos25 ~]$ ps xf PID TTY STAT TIME COMMAND 2072 pts/18 Ss 0:00 -bash 2107 pts/18 R+ 0:00 \_ ps xf 14360 ? D 0:00 cat /minos/scratch/tinti/condor_logs/datamc.log.229046.0 [gfrontend@minos25 ~]$ kill -9 14360 [gfrontend@minos25 ~]$ kill -9 14360 [gfactory@minos25 ~]$ ps xf PID TTY STAT TIME COMMAND 22865 pts/13 Ss 0:00 -bash 31838 pts/13 D+ 0:00 \_ cat /minos/data/users/scavan/mrcc_cand_filter/log.228592.100 18216 ? Z 1:07 [condor_gridmana] 2128 pts/18 Ss 0:00 -bash 2162 pts/18 R+ 0:00 \_ ps xf Nov 20 03:09:48 minos25 kernel: afs: Tokens for user of AFS id 4356 for cell fnal.gov have expired Nov 20 03:29:07 minos25 kernel: afs: Tokens for user of AFS id 13849 for cell fnal.gov are discarded (rxkad error=19270407) Nov 20 05:22:51 minos25 kernel: afs: Tokens for user of AFS id 5922 for cell fnal.gov are discarded (rxkad error=19270407) Nov 20 09:02:18 minos25 kernel: nfs: server stkensrv1 not responding, still trying Nov 20 09:08:14 minos25 kernel: nfs_statfs: statfs error = 512 Load average plummeted around 10:42 Date: Thu, 20 Nov 2008 09:49:08 -0600 (CST) Subject: HelpDesk ticket 125219 ___________________________________________ Short Description: minos25 cannot write to /minos/scratch, load average is over 200 Problem Description: run2-sys : Starting around 04:00 CST today, the load average on minos25 started climbing sharply. It is now over 200. Writes to the Bluearc served /minos/scratch seem to get hung up. I can read some files from /minos/scratch, but others hang up : cat /minos/scratch/tinti/condor_logs/datamc.log.229046.0 This command displays the file, but fails to exit, and cannot be killed. This same file can be read from other hosts such as minos26. I see no interesting messages in /var/log/messages. Please see whether we can determine the cause for these hangups. If necessary, please reboot minos25. ___________________________________________ Date: Thu, 20 Nov 2008 09:50:16 -0600 From: Ling C. Ho This mount is the reason lsof is hanging: stkensrv1:/minos on /pnfs/minos type nfs (ro,noexec,nosuid,nodev,intr,bg,hard,noac,addr=131.225.13.1) ___________________________________________ Date: Thu, 20 Nov 2008 15:55:32 +0000 (GMT) From: Arthur Kreymer It would be OK to remove the /pnfs/minos mount from minos25. We do not need it on that node. ___________________________________________ Date: Thu, 20 Nov 2008 10:42:57 -0600 (CST) This ticket has been reassigned to SHEPELAK, KAREN of the CD-SF/FEF Group. ___________________________________________ Date: Thu, 20 Nov 2008 10:45:15 -0600 From: Ling C. Ho All the stuck close calls seems to have returned. Not sure what happened. /pnfs/minos has been unmounted. /grid/data was remounted. ___________________________________________ Date: Thu, 20 Nov 2008 10:42:06 -0600 From: condor To: minos-admin@fnal.gov Subject: [Condor] Problem minos25.fnal.gov: condor_schedd killed (unresponsive) This is an automated email from the Condor system on machine "minos25.fnal.gov". Do not reply. "/opt/condor/sbin/condor_schedd" on "minos25.fnal.gov" was killed because it was no longer responding. Condor will automatically restart this process in 10 seconds. ___________________________________________ Date: Thu, 20 Nov 2008 17:01:57 +0000 (GMT) From: Arthur Kreymer The condor_schedd seems to have restarted itself at 10:42:06 At just about that time, the load average dropped, and the condor system became responsive. User jobs are running again. /minos/scratch is writeable again from minos25. The formerly bad file is OK : /minos/scratch/tinti/condor_logs/datamc.log.229046.0 ___________________________________________ Date: Thu, 20 Nov 2008 15:49:34 -0600 (CST) Hi Art, The problem seems to have resolved itself. Load average is ok right now, and condor_q doesn't hang. [root@minos25 ~]# uptime 15:39:36 up 29 days, 23:47, 15 users, load average: 1.25, 1.10, 1.21 [root@minos25 ~]# condor_q 6832 jobs; 5457 idle, 1375 running, 0 held Let me know if you think it is ok to mark this ticket as resolved, karen ___________________________________________ Date: Sun, 30 Nov 2008 22:04:48 -0600 (CST) From: HelpDesk This request will be automatically closed in two weeks. If you wish this problem to remain open please contact the HelpDesk. ___________________________________________ ___________________________________________ ============================================================================= 2008 11 19 ============================================================================= ######## # FARM # ######## samsub.20081118 and roundup.20081118 ready for production. Do this coming out of the shutdown tomorrow. ############ # SHUTDOWN # ############ Prepared for PNFS/DCache maintenance Nov 20 kreymer@minos26 echo "crontab -r" | at 06:30 job 20 at 2008-11-20 06:30 mindata@minos26 echo "crontab -r" | at 01:00 job 21 at 2008-11-20 01:00 minfarm@fnpcsrv1 echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00 job 17 at 2008-11-20 01:00 ######## # GRID # ######## Investigating Ticket #: 125092 /minos/data2 mounts unstable on CDF grid nodes MIN > ssh fcdfcaf1502 -bash-3.00$ domainname fcdfosg1 -bash-3.00$ ypwhich -d fcdfosg1 fcdfcaf1550.fnal.gov -bash-3.00$ ypwhich -d fcdfosg1 fcdfcaf1425.fnal.gov -bash-3.00$ ypwhich -d fcdfosg1 fcdfcaf1450.fnal.gov for HOST in `head -2 /tmp/cdfhosts`; do printf "${HOST} " ssh -ax ${HOST} 'printf "`ypwhich` " ; ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null && echo' done fcdfcaf1502 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1503 fcdfcaf1425.fnal.gov fcdfcaf1504 fcdfcaf1450.fnal.gov fcdfcaf1505 fcdfcaf1425.fnal.gov fcdfcaf1506 fcdfcaf1501.fnal.gov fcdfcaf1507 fcdfcaf1425.fnal.gov fcdfcaf1508 fcdfcaf1550.fnal.gov fcdfcaf1509 fcdfcaf1425.fnal.gov fcdfcaf1510 fcdfcaf1425.fnal.gov fcdfcaf1511 fcdfcaf1450.fnal.gov fcdfcaf1512 fcdfcaf1450.fnal.gov fcdfcaf1513 fcdfcaf1425.fnal.gov fcdfcaf1514 fcdfcaf1475.fnal.gov fcdfcaf1515 fcdfcaf1550.fnal.gov fcdfcaf1516 fcdfcaf1450.fnal.gov fcdfcaf1517 fcdfcaf1425.fnal.gov fcdfcaf1518 fcdfcaf1450.fnal.gov fcdfcaf1519 fcdfcaf1475.fnal.gov fcdfcaf1520 fcdfcaf1425.fnal.gov fcdfcaf1521 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory for HOST in `cat /tmp/cdfhosts`; do printf "${HOST} " ssh -ax ${HOST} 'printf "`ypwhich` " ; ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null && echo' done 2>&1 | tee /tmp/scan1119a.lis ( selecting only failing nodes ) fcdfcaf1519 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1521 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1525 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1529 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1540 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1544 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1557 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1560 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1670 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1674 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1677 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1680 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1681 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1683 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1684 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1686 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1687 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1689 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1695 do_ypcall: clnt_call: RPC: Timed out fcdfcaf1703 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1712 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory fcdfcaf1713 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory Get list of ypwhich'es : MIN > cat /tmp/scan1119a.lis | cut -f 2 -d ' ' | sort -u do_ypcall: fcdfcaf1150.fnal.gov fcdfcaf1201.fnal.gov fcdfcaf1425.fnal.gov fcdfcaf1450.fnal.gov fcdfcaf1475.fnal.gov fcdfcaf1501.fnal.gov fcdfcaf1525.fnal.gov fcdfcaf1550.fnal.gov fcdfcaf1575.fnal.gov fcdfcaf1601.fnal.gov List the maps -bash-3.00$ ypwhich -m passwd.byuid fcdf0x4.fnal.gov auto.des fcdf0x4.fnal.gov-h auto.grid fcdf0x4.fnal.gov auto.master fcdf0x4.fnal.gov auto.home fcdf0x4.fnal.gov auto.minos fcdf0x4.fnal.gov auto.cdf fcdf0x4.fnal.gov passwd.byname fcdf0x4.fnal.gov ypservers fcdf0x4.fnal.gov auto.ilc fcdf0x4.fnal.gov group.bygid fcdf0x4.fnal.gov group.byname fcdf0x4.fnal.gov Look at the maps : -bash-3.00$ ypcat -h fcdfcaf1450 auto.grid blue2:/fermigrid-home blue2:/fermigrid-products/opt/condorsleeper/${ARCH}/condor-7.0.3 blue2:/fermigrid-products/opt/condorsleeper_sl5/${ARCH}/condor-7.0.3 blue2:/fermigrid-products/usr/local/grid-1.0.0-${ARCH} blue2:/fermigrid-fermiapp blue2:/fermigrid-app blue2:/fermigrid-data -bash-3.00$ ypcat -h fcdfcaf1525 auto.minos -ro minos-nas-0.fnal.gov:/minos/scratch -noexec blue2.fnal.gov:/minos/data -noexec minos-nas-0.fnal.gov:/minos/data -bash-3.00$ ypcat -h fcdfcaf1450 auto.minos -ro minos-nas-0.fnal.gov:/minos/scratch -noexec blue2.fnal.gov:/minos/data -noexec minos-nas-0.fnal.gov:/minos/data -bash-3.00$ ypcat -h fcdfcaf1450 auto.master auto.minos -o proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600 auto.grid -o proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600 auto.ilc -o proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600 auto.home -o proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600 auto.cdf -o proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600 -bash-3.00$ ypcat ypservers fcdfcaf1325.fnal.gov fcdfcaf1525.fnal.gov fcdf0x4.fnal.gov fcdfcaf1001.fnal.gov fcdfcaf1250.fnal.gov fcdfcaf1450.fnal.gov fcdfcaf1225.fnal.gov fcdfcaf1375.fnal.gov fcdfcaf1425.fnal.gov fcdfcaf1575.fnal.gov fcdfcaf1150.fnal.gov fcdfcaf1350.fnal.gov fcdfcaf1275.fnal.gov fcdfcaf1501.fnal.gov fcdfcaf1475.fnal.gov fcdfcaf1050.fnal.gov fcdfcaf1201.fnal.gov fcdfcaf1401.fnal.gov fcdfcaf1601.fnal.gov fcdfcaf1101.fnal.gov fcdfcaf1550.fnal.gov Schmitz will reboot fcdfcaf1525 after permission from Timm. on fcdfcaf1519, scan all the maps MAPS=`ypwhich -m | cut -f 1 -d ' '` for HOST in fcdf0x4 fcdfcaf1525 fcdfcaf1450; do for MAP in ${MAPS} ; do ypcat -h ${HOST} ${MAP} done >> /tmp/ypmaps.${HOST} done There are differences in passwd and group files try again without these, MAPS=`ypwhich -m | cut -f 1 -d ' ' | grep -v passwd | grep -v group` for HOST in fcdf0x4 fcdfcaf1525 fcdfcaf1450; do for MAP in ${MAPS} ; do ypcat -h ${HOST} ${MAP} done > /tmp/ypmaps.${HOST} done No differences, but still : -bash-3.00$ ypwhich fcdfcaf1525.fnal.gov -bash-3.00$ ls -ld /minos/data2 ls: /minos/data2: No such file or directory /usr/lib64/autofs/autofs-ldap-auto-maste 11:27 - fcdfcaf1525 has been rebooted, try a new scan done 2>&1 | tee /tmp/scan1119b.lis Wed Nov 19 17:48:30 GMT 2008 Clean, aside from timeout logging into 1525 - cleared up now. Second scan, just to be sure ! MIN > ssh fcdfcaf1525 -bash-3.00$ uptime ; date 11:36:34 up 3 min, 2 users, load average: 0.12, 0.11, 0.04 Wed Nov 19 11:36:34 CST 2008 date 11:43, poking in parallel, while stuck on connection to 1694 MIN > ssh -ax fcdfcaf1693 'ypwhich ; ls -ld /minos/data /minos/data2 /minos/scratch' do_ypcall: clnt_call: RPC: Timed out fcdfcaf1525.fnal.gov done 2>&1 | tee /tmp/scan1119d.lis ; date Wed Nov 19 17:57:39 GMT 2008 Clean, the ticket is closed. Back to work ! ============================================================================= 2008 11 18 ============================================================================= ############# # MDSUM_LOG # ############# created new version, mdsum_log.20081118 ran test around 16:23 ########### # ROUNDUP # ########### roundup.20081118 Make use of the extended subrun list. Present usage of samsubs : printed in HAVE message get VAL subrun count from cut -f 2 -d ':' Due to new whitespace in SAMSUBS, changed for SAMSUB in ${SAMSUBS} to printf "${SAMSUBS}\n" | while read SAMSUB Therefore must deploy new samsubs and roundup together. Test with AFSS/roundup.20081118 -n -W -r cedar near AFSS/roundup.20081118 -n -W -s n13037095 -r cedar_phy_bhcurv mcnear samsub tripped on stray files in mcnearcat, Oops, my parsing of subruns does not work for mcout files, which have many more underscores. Test on n13011168 Weird, no longer need the string.join, Needed to split on _ first, then ., for sake of mcin parents. For quicker testing, hacked INDIR=/minos/data/minfarm/testroundup SRV1> mkdir /minos/data/minfarm/testroundup cp -a /minos/data/minfarm/mcnearcat/n1303709*.root /minos/data/minfarm/testroundup/ ######## # FARM # ######## ######## # GRID # ######## Date: Tue, 18 Nov 2008 11:52:50 -0800 (PST) From: Ryan B. Patterson To: minos-admin@fnal.gov Cc: pawloski@fnal.gov Subject: Glidein throughput issue seems to be addressed Hi, We were suffering from glideins not willing to run new jobs after finishing their first one. This was limiting computing throughput to the rate of new glidein production. Igor was eventually able to track it down to a communication problem between the starter and the schedd/shadow. On his suggestion, I added two obscure settings to the factory configuration: WANT_UDP_COMMAND_SOCKET = True STARTD_SENDS_ALIVES = False Initial testing suggests that this fix has worked, as new glideins seem to be returning to the pool in the "Unclaimed" state and seem willing to run new jobs. Let me know if this doesn't seem true in more complex situations over the next few days. --Ryan ########### # MINOS27 # ########### Date: Tue, 18 Nov 2008 19:38:36 +0000 (GMT) From: Arthur Kreymer To: Ling C. Ho Cc: minos-admin@fnal.gov Subject: Re: DB servers, and other Grid items On Fri, 14 Nov 2008, Ling C. Ho wrote: > I have minos27 ready. Please log in and take a look. The virtual network > interface is not set up. We can swap this with minos01 once you determine > everything need is installed properly. Thanks ! /minos/data and /minos/scratch need to be NFS mounted, as on other Cluster systems. The local scratch space should be mounted as /minos/scratch27, group writeable by the e875 group, similar to other Cluster systems. /local/scratch27 should be configured as a single 500 GB volume. We would rename /minos/scratch27 to /minos/scratch01, and clone content from minos01, when the system eventually goes to production. We will move the minoscvs repository to the CD's cdcvs server before we retire the old minos01 server. So we will not need the /cvs area, or the special sshd cvs server ( /usr/sbin/sshd -f /etc/ssh/sshd_config.cvs ) And we will not need to run the pserver on minos27. Date: Tue, 18 Nov 2008 14:14:26 -0600 From: Ling C. Ho > /minos/data and /minos/scratch need to be NFS mounted, > as on other Cluster systems. Corrected. Do you mean /local/scratch27? I have repartitioned the data disk and mounted as /local/scratch27. Date: Tue, 18 Nov 2008 20:55:56 +0000 (GMT) From: Arthur Kreymer Yes, my bad. On the other Minos Cluster nodes, /local/scratchNN has permissions 777, and is owned by root.root It is probaby best to do this also on minos27, rather than set ownership root.e875, and mode 775 as I had requested. I'll let the users know that minos27 is available now. Date: Tue, 18 Nov 2008 20:56:14 +0000 (GMT) From: Arthur Kreymer To: minos_software_discussion@fnal.gov Cc: minos-admin@fnal.gov Subject: minos27 testing - SLF 4.7 and x86_64 kernel Node minos27 is available for testing. This node is intended to become the replacement for minos01. It is running SLF 4.7 . The rest of the Minos Cluster runs SLF 4.4. It is running the x86_64 64 bit kernel. Please test the working environment there, particularly whether programs built on minos27 can be used on other Minos Cluster nodes. Do not put anything permanent into /local/scratch27, as that area may be cloned from /local/scratch01 eventually. ######## # GRID # ######## Date: Tue, 18 Nov 2008 12:07:49 -0600 (CST) Subject: HelpDesk ticket 125094 ___________________________________________ Ticket #: 125094 ___________________________________________ Short Description: /minos/data and data2 timeouts on fnpcsrv1 Problem Description: Howie Rubin reports repeated failures to access /minos/data2 on fnpcsrv1. My once-per minute scans have not seen a recent failure, but they are not accessing the system as often as Howie. In /var/log/messages, it is apparent that /minos/data and /minos/data2 are being dismounted and remounted about every 20 minutes, even though I am accessing files there every minute. These dismounts ( 'expired /minos/data2' ) seem to have started at Nov 14 01:02:27 but have not occured since Nov 18 09:42:59 Please tune fnpcsrv1 so that these filesystems are not dismounted so frequently, or confirm that this tuning was done this morning after 09:42. ___________________________________________ This ticket is assigned to TIMM, STEVE of the CD-SF/GF/FGS. ___________________________________________ Verbal 12:30, Steve Timm The script which kept /minos/data* mounted was stopped when the stale NFS handles were cleared last Friday. He is restarting the script. I see no further dismounts as of 13:50, of /minos/data2. But /minos/data continues to be dismounted every 20 minutes . ___________________________________________ Date: Tue, 18 Nov 2008 14:07:22 -0600 (CST) Our script which keeps the minos areas mounted on fnpcsrv1 is now running again. We had disabled it during last week's incident with /minos/data when we had to recover from the stale file handles. Note, however, that it is the belief of FGS that whether the file system is temporarily umounted has nothing to do with the problems that Howie is seeing. So let us know if Howie's problem persists. Steve Timm ___________________________________________ N.B. - 2008 11 19 08:35 - no further expired messages in messages ___________________________________________ Date: Thu, 20 Nov 2008 11:24:44 -0600 (CST) From: HelpDesk Solution: We restored the processes which keep the /minos/data, /minos/data2, and /minos/scratch areas mounted all the time. Steve Timm ___________________________________________ ######## # GRID # ######## Date: Tue, 18 Nov 2008 11:55:19 -0600 (CST) Subject: HelpDesk ticket 125092 Short Description: /minos/data2 mounts unstable on CDF grid nodes Problem Description: The mounts of /minos/data2 are sometimes present, sometimes absent from the CDF grid nodes ( fcdfcaf1502 through fcdfcaf1716 ) This seems to be the same problem previously resolved on Friday 14 Nov, Helpdesk Ticket 124929 Here is the relevant language from that ticket, quoting Steve Timm : There is probably an old out-of-sync yp slave server somewhere on the CDF grid cluster 2 that does not yet have the new map that includes /minos/data2. I re-pushed out the map to all existing nodes. Will ask FEF to re-enable the 3 slave servers that are down. This might not happen before the weekend. ___________________________________________ This ticket is assigned to Box, Dennis of the CDF. ___________________________________________ Date: Tue, 18 Nov 2008 12:05:39 -0600 (CST) Reassign Please reassign this to someone who can update yp mapfiles on cdf machines (FEF?) ___________________________________________ Date: Tue, 18 Nov 2008 12:09:56 -0600 (CST) Note To Requester: timm@fnal.gov sent this Notes To Requester: Somehow this ticket went to Grid/CDF and should have gone to FEF instead. But Art misunderstood my E-mail to him. It is not a problem with the automount maps now, it is just that some of the worker nodes had a missing /minos/data2 left over from Nov. 14 when the automount maps were fixed and that needs to get reset. Steve Timm ___________________________________________ I still don't understand. /minos/data2 was mounted on all nodes on Friday. It is now intermittent, coming and going apparently randomly. ___________________________________________ Date: Tue, 18 Nov 2008 13:25:24 -0600 (CST) This ticket has been reassigned to SHEPELAK, KAREN of the CD-SF/FEF Group. ___________________________________________ Date: Tue, 18 Nov 2008 22:12:21 +0000 (GMT) From: Arthur Kreymer To: Howard Rubin Cc: minos-data@fnal.gov, shepelak@fnal.gov Subject: Re: cdf nodes On Tue, 18 Nov 2008, Howard Rubin wrote: > They may have done something. The last failure was at 13:40. My last scan, around 15:00, picked up 5 failures, out of 151 hosts. One of the 5 failures was an rcp timeout: fcdfcaf1510 ls: /minos/data2: No such file or directory fcdfcaf1525 ls: /minos/data2: No such file or directory fcdfcaf1672 ls: /minos/data2: No such file or directory fcdfcaf1677 ls: /minos/data2: No such file or directory fcdfcaf1686 do_ypcall: clnt_call: RPC: Timed out ___________________________________________ Date: Tue, 18 Nov 2008 23:03:52 +0000 (GMT) We are continuing to see failures, as of 17:00 today. ___________________________________________ Date: Tue, 18 Nov 2008 17:53:35 -0600 (CST) Note To Requester: investigating ___________________________________________ Date: Wed, 19 Nov 2008 15:25:59 +0000 (GMT) From: Arthur Kreymer The problem seems only to occur when yp sever fcdfcaf1502 happens to be chosen . I have repeated a scan of the fcdfdcaf nodes, this time checking which yp server is being used as follows : for HOST in `cat /tmp/cdfhosts`; do printf "${HOST} " ssh -ax ${HOST} 'printf "`ypwhich` " ; ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null && echo' done 2>&1 > /tmp/scan1119a.lis A typical line of failing output is : fcdfcaf1519 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory Various hosts fail, but the problem always occurs when fcdfcaf1525 happens to be used to serve the yp maps. I am still puzzled, I see no difference in the auto.minos map served by fcdfcaf1525 : -bash-3.00$ ypcat -h fcdfcaf1525 auto.minos -ro minos-nas-0.fnal.gov:/minos/scratch -noexec blue2.fnal.gov:/minos/data -noexec minos-nas-0.fnal.gov:/minos/data -bash-3.00$ ypcat -h fcdfcaf1450 auto.minos -ro minos-nas-0.fnal.gov:/minos/scratch -noexec blue2.fnal.gov:/minos/data -noexec minos-nas-0.fnal.gov:/minos/data ___________________________________________ Date: Wed, 19 Nov 2008 09:30:58 -0600 (CST) From: Steven Timm To: Arthur Kreymer Cc: HelpDesk , shepelak@fnal.gov, run2-sys@fnal.gov, minos-data@fnal.gov Subject: Re: HelpDesk ticket 125092 has additional info. What I have been trying to say is the following: On Nov. 14, when the yp server fcdfcaf1525 came back up, it came up with a bad map, which I fixed yesterday when I saw that problem. However, any nodes which were bound to that server between Nov 14 and now, and tried to access /minos/data2 during that time, got the "no such file or directory" error. Killing and restarting automount may not be enough to fix this error, it may require a reboot of the nodes in question. ___________________________________________ Date: Wed, 19 Nov 2008 09:34:33 -0600 (CST) I restarted ypbind on fcdfcaf1502. It seems to have resolved the mount issue. Check it out and let me know. Mark ___________________________________________ Date: Wed, 19 Nov 2008 11:13:44 -0600 From: Mark Schmitz To: Arthur Kreymer Subject: fcdfcaf1525 This node is being rebooted now. Mark ___________________________________________ Date: Wed, 19 Nov 2008 10:39:30 -0600 From: Howard Rubin To: Art Kreymer , Steve Timm Subject: I/O failures on /minos/data2 Art (Steve, FYI), Since 23:42 yesterday there have been a total of 612 input or output failures. I'm going to shut down processing (except for keep-up) until we get a response from FEF. However, you should be aware that some input failures are also occurring on GPGrid nodes as well. These appear to be the same old SRM problem to which I don't think I've ever received a satisfactory response to old tickets beyond "Turned over to developers" which has been the standard response lately (after a couple of weeks of aging). The several random checks I've made of input failures on CDFGrid are *not* SRM related. Being the good soldier that I am, I will submit a ticket on this. Wed Nov 19 09:12:59 CST 2008: ====> fileStatus state ==Failed java.io.IOException: rs.state = Failed rs.error = at Wed Nov 19 07:08:26 CST 20 08 state Pending : created RequestFileStatus#-2144281128 failed with error:[ at Wed Nov 19 09:12:03 CST 20 08 state Failed : Pinning failed] at gov.fnal.srm.util.SRMGetClientV1.start(SRMGetClientV1.java:298) at gov.fnal.srm.util.SRMDispatcher.work(SRMDispatcher.java:795) at gov.fnal.srm.util.SRMDispatcher.main(SRMDispatcher.java:374) srm copy of at least one file failed or not completed I have a complete list of nodes and times. Howie ___________________________________________ Date: Wed, 19 Nov 2008 10:42:04 -0600 (CST) From: Steven Timm To: Howard Rubin Cc: Art Kreymer Subject: Re: I/O failures on /minos/data2 Internal E-mail from Mark Schmitz of FEF tells me he is working on the /minos/data2 issue on the CDF nodes at the moment. This is something that I would have the permission to do myself but since I am in the workshop I can't get to it today or tomorrow and you are better to stay with them. Steve Timm ___________________________________________ Date: Wed, 19 Nov 2008 17:25:32 +0000 (GMT) From: Arthur Kreymer To: HelpDesk Cc: shepelak@fnal.gov, run2-sys@fnal.gov, timm@fnal.gov, minos-data@fnal.gov Subject: Re: HelpDesk ticket 125092 has additional info. <-- # @@@ Enter Update below this line. @@@ # --> I earlier wrote " The problem seems only to occur when yp sever fcdfcaf1502 happens to be chosen " As always, I cannot type correctly. The yp server at issue is fcdfcaf1525 , not 1502, 1525 was mentioned correctly several times later in the posting. <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ ######## # DATA # ######## Date: Fri, 14 Nov 2008 13:23:05 -0600 (CST) Subject: HelpDesk ticket 124929 has additional info. _________________________________________________________________ Ticket #: 124929 _________________________________________________________________ Note To Requester: timm@fnal.gov sent this Notes To Requester: There is probably an old out-of-sync yp slave server somewhere on the CDF grid cluster 2 that does not yet have the new map that includes /minos/data2. I re-pushed out the map to all existing nodes. Will ask FEF to re-enable the 3 slave servers that are down. This might not happen before the weekend. As far as stale file handles the process is the following, which FEF can do in my absence 1) kill any stale processes of the form gidd_alloc or procd 2) umount /minos/data (and /minos/scratch if necessary) 3) kill auto.minos automount process 4) umount /minos 5) service autofs reload Art--I would suggest that a ticket be opened to FEF to get this done expediently., Steve _________________________________________________________________ Date: Wed, 19 Nov 2008 11:35:44 -0600 (CST) Note To Requester: Hi Art, Steve, I'd like to reboot fcdfcaf1502. The /minos/data2 directory is still not mounting even after ypbind and autofs services have been restarted. Can you start draining condor jobs? Automount maps also appear to be ok, same mapping as the other machines which mount the directory ok. [root@fcdfcaf1502 ~]# ypcat -k auto.minos scratch -ro minos-nas-0.fnal.gov:/minos/scratch data2 -noexec blue2.fnal.gov:/minos/data data -noexec minos-nas-0.fnal.gov:/minos/data Since services restarted I am now seeing: [root@fcdfcaf1502 ~]# mount /minos/data2 Unsupported nfs mount option: o [root@fcdfcaf1502 ~]# mount blue2.fnal.gov:/minos/data /minos/data2 mount: blue2.fnal.gov:/minos/data already mounted or /minos/data2 busy mount: according to mtab, blue2.fnal.gov:/minos/data is already mounted on /minos/data2 [root@fcdfcaf1502 ~]# cat /etc/mtab |grep data2 blue2.fnal.gov:/minos/data /minos/data2 nfs rw,noexec,o,proto=tcp,nfsvers=3,wsize=32768,rsize=32768,hard,intr,timeo=600, addr=131.225.111.93 0 0 thanks, karen _________________________________________________________________ Date: Wed, 19 Nov 2008 11:53:45 -0600 From: Mark Schmitz To: Arthur Kreymer Cc: HelpDesk , shepelak@fnal.gov, run2-sys@fnal.gov, timm@fnal.gov, minos-data@fnal.gov Subject: Re: HelpDesk ticket 125092 has additional info. <-- # @@@ Enter Update below this line. @@@ # --> worklog fcdfcaf1525 has been restarted <-- # @@@ Enter Update above this line. @@@ # --> _________________________________________________________________ Date: Wed, 19 Nov 2008 17:58:49 +0000 (GMT) From: Arthur Kreymer To: Mark Schmitz Cc: HelpDesk , shepelak@fnal.gov, run2-sys@fnal.gov, timm@fnal.gov, minos-data@fnal.gov, minos_batch@fnal.gov, minos_software_discussion@fnal.gov Subject: Re: HelpDesk ticket 125092 has additional info. <-- # @@@ Enter Update below this line. @@@ # --> I have run two full scans on the cdf caf nodes since the reboot of fcdfcaf1525. There are no failures to mount the /minos areas. The fcdfcaf1525 NIS server is being used heavily. We can consider this probelem resolved. Thanks ! <-- # @@@ Enter Update above this line. @@@ # --> _________________________________________________________________ Date: Wed, 19 Nov 2008 13:57:08 -0600 (CST) This ticket was resolved by SHEPELAK, KAREN of the CD-SF/FEF group. ######## # FARM # ######## Rubin reports failure to read or write /minos/data2 on several hosts. For exmaple, reading /minos/data/minfarm/loonexe/set_tsql_override.C > 2008-11-18 04:01:03 fcdfcaf1554 > 2008-11-18 04:01:16 fcdfcaf1688 > 2008-11-18 04:01:16 fcdfcaf1672 > 2008-11-18 04:01:40 fcdfcaf1672 > 2008-11-18 04:01:58 fcdfcaf1672 > 2008-11-18 04:02:28 fcdfcaf1504 > > and those with output errors: > > 2008-11-18 07:54:36 fcdfcaf1675 > 2008-11-18 08:14:50 fcdfcaf1563 > 2008-11-18 08:21:34 fcdfcaf1522 > 2008-11-18 08:21:42 fcdfcaf1665 > 2008-11-18 08:26:38 fcdfcaf1555 > 2008-11-18 08:27:20 fcdfcaf1670 > 2008-11-18 08:33:31 fcdfcaf1535 > 2008-11-18 08:34:17 fcdfcaf1555 > 2008-11-18 09:16:26 fcdfcaf1540 Rescanning for /minos/data, as before for HOST in `cat /tmp/cdfhosts`; do printf "${HOST} " ssh -ax ${HOST} 'ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null && echo' done 2>&1 | tee /tmp/scan1118a.lis done 2>&1 | tee /tmp/scan1118b.lis Scanned gpfarm nodes, AOK for HOST in `cat /tmp/gphosts`; do printf "${HOST} " ssh -ax ${HOST} 'ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null ;echo' done on fnpcsrv1, find expired /minos/data2 every 20 minutes since Nov 14 01:02:27 through Nov 18 09:42:59 but not since then. ######## # DATA # ######## /grid/data usage message Total disk allocated (GB): 400.0 Percent disk used: 80.1% du -sm /grid/data/minos/* 315373 /grid/data/minos/users 9659 /grid/data/minos/minfarm 1972 /grid/data/minos/OLDfarcat 594 /grid/data/minos/OLDneardet 110 /grid/data/minos/condor_log ... du -sm /grid/data/minos/users/* | sort -n ... 91 /grid/data/minos/users/mishi 315282 /grid/data/minos/users/rustem Forwarded to rustem ============================================================================= 2008 11 17 ============================================================================= ######### # MYSQL # ######### Date: Mon, 17 Nov 2008 15:15:19 -0600 From: Ling C. Ho To: Arthur Kreymer Cc: minos-admin@fnal.gov Subject: Re: DB servers, and other Grid items Hi Art, Minos-mysql2 and 3 are ready. I have installed NIS client on these nodes, but login is limited to the few accounts that were on the local password files. The default database directory is /var/lib/mysql. Please let me know how you would like to set up /data (ie, if you want subdirectories like on minos-mysql1) and I can create symbolic links to point /var/lib/mysql to the right place. ####### # SAM # ####### Date: Mon, 17 Nov 2008 17:08:43 -0600 From: Ling C. Ho To: Arthur Kreymer Cc: minos-admin@fnal.gov Subject: Re: DB servers, and other Grid items Minos-sam04 is ready too. ######## # DATA # ######## mcimport - mtavera, duplicates over the weekend - inform her ######## # DATA # ######## mcimport OVERLAY Mon Nov 17 08:48:32 CST 2008 MCIN configuration n1303 _L010185N_D06_nccohbkg.reroot.root SRMClientV1 : put: try # 0 failed with error SRMClientV1 : java.net.ConnectException: Connection timed out srm copy of at least one file failed or not completed MRTG shows stkendca2a off net just after 08:30 this morning. Now it is back up, services up since 10:57 stkendca7a is also offline, as of 11:11 MRTG has no data for that node Date: Mon, 17 Nov 2008 11:01:35 -0600 (CST) Subject: HelpDesk ticket 125001 ___________________________________________ Short Description: stkendca2a seems down, doors offline in FNDCA Problem Description: According to MRTG data, stkendca2a went off the net around$ morning. All the stlendca2a services are offline, including dcap dcapK dcapG SRM GFTP0/1 KFTP WFTP Minos raw data archiving has stopped. ___________________________________________ Date: Mon, 17 Nov 2008 11:26:43 -0600 (CST) The stkendca2a node should be back on-line now. It was moved from FCC1 to the Mezzanine this morning. I had believed that the node was only running the test instance of dCache. I was not aware it was serving public dCache services. I'm sorry for the inconvenience. No further interruptions are anticipated. Ken S. -- SSA Group ___________________________________________ Date: Mon, 17 Nov 2008 17:37:49 +0000 (GMT) From: Arthur Kreymer Thanks for restoring stkendca2a. I see that node stkendca7a is also off the network. This serves two of the RawDataWritePools pools, and one GFTP door. ___________________________________________ Date: Mon, 17 Nov 2008 12:13:44 -0600 Art, The stkendca7a node was recently brought back on-line. On Friday, it was found that one of the two RAID sets was inaccessible. In trying to recover that RAID partition, I issued a command to reboot the system. We have been unable to get the system to boot properly since then. It has been down since Friday afternoon. This is a serious hardware problem. And it is a rather old node. I'm meeting with my manager after lunch to discuss options for how we can get this node back on-line and recover the pools. Ken S. ___________________________________________ ######## # DATA # ######## Continue cleanout of MOVED files, find /minos/data/reco_near.MOVED -user mindata | wc -l 14747 find /minos/data/reco_near.MOVED -user minfarm | wc -l 3348 find /minos/data/reco_near.MOVED -user rubin | wc -l 206 minfarm@fnpcsrv1> find /minos/data/reco_near.MOVED -user minfarm -exec chmod g+w {} \; rubin@fnpcsrv1> find /minos/data/reco_near.MOVED -user rubin -exec chmod g+w {} \; df -m /minos/data 28311552 20197451 8114102 72% /minos/data mindata@minos26> time rm -r /minos/data/reco_near.MOVED rm: remove write-protected regular file `/minos/data/reco_near.MOVED/cedar_phy_bhcurv/sntp_data/2007-03/libMyPainterSL4_51902.so'?y -rwxr-xr-x 1 rodriges e875 124459 Mar 12 2008 libMyPainterSL4_51902.so* real 190m44.659s user 0m0.156s sys 0m2.681s df -m /minos/data 28311552 18119726 10191827 65% /minos/data ============================================================================= 2008 11 14 ============================================================================= ######## # DATA # ######## Date: Fri, 14 Nov 2008 16:44:29 +0000 (GMT) From: Arthur Kreymer To: plunk@fnal.gov Cc: minos-data@fnal.gov, votava@fnal.gov Subject: FYI, people involved in new /minos/data disk deployment : CSI/SVC - Andy Romero BlueArc deployment and migration, intervening as necessary past 10 PM, and as early as 4 AM. Ray Pasetes CSI group HeadC CSI/DSS ( FNALU ) Margaret Greaney mounted the new disks and fixed stale handles on FNALU Wayne Baisley DSS group head Jack Schmidt CSI Dept Head FEF - Glenn Cooper - assisted in planning and coordination Jason Harrington -/minos/data2 mounts on Minos systems Ling Ho - stale file handle cleanup 10 PM Thursday Jason Allen FEF Dept Head Grid - Steve Timm assisted in planning, set up the new /minos/data2 mounts, cleaned up stale file handles Thursday night. Keith Chadwick Grid Services group head Eileen Berman Grid Dept Head ############ # BLUWATCH # ############ ln -sf bluwatch.20081114 bluwatch # was bluwatch.20080724 rm /afs/fnal.gov/files/data/minos/log_data/bluwatch/STOP set nohup ; ${HOME}/minos/scripts/bluwatch & ######### # ADMIN # ######### minos27 is available for testing. Stray message at login, aklog: unable to obtain tokens for cell fnal.gov (status: 11862788). ######## # FARM # ######## Restarted concatenation ( cedar ) 14:50 mv NOCAT NOCAT.ok Need to investigate DUP files in N00015122 ############ # DATABASE # ############ Doing monthly backups, cut/paste from the new dbarchive script. Next month will run the script as such ! ######## # GRID # ######## /minos/data2 mounts on cdf nodes seem flaky condor_status | grep @ | grep '\. LINUX' | cut -f 2 -d @ | cut -f 1 -d . | sort -u > /tmp/cdfhosts scp fcdfcaf1699:/tmp/cdfhosts /tmp/cdfhosts for HOST in `cat /tmp/cdfhosts`; do printf "${HOST} " ssh -ax ${HOST} 'ls -ld /minos/data /minos/data2 > /dev/null' sleep 1 ; done 2>&1 | tee /tmp/scan1.lis Most nodes lack /minos/data2, but many are intermittent. NTHOSTS=' fcdfcaf1573 fcdfcaf1583 fcdfcaf1586 fcdfcaf1663 fcdfcaf1669 fcdfcaf1670 fcdfcaf1672 fcdfcaf1680 fcdfcaf1681 fcdfcaf1684 ' ls: /minos/data: Stale NFS file handle fcdfcaf1674 do_ypcall: clnt_call: RPC: Timed out Scan for FS that should be good: for HOST in `cat /tmp/cdfhosts`; do printf "${HOST} " ssh -ax ${HOST} 'ls -ld /minos/scratch /minos/data > /dev/null && echo' done 2>&1 | tee /tmp/scansd.lis Everything is fine, except for the NFS file handles on select nodes. Repeated scan for just data2, missing on all but fcdfcaf1512 Scanning GPFARM hosts condor_status | grep LINUX | wc -l 1089 condor_status | grep fnpc | cut -f 2 -d @ | \ cut -f 1 -d . | sort -u > /tmp/gphosts wc -l /tmp/gphosts 204 /tmp/gphosts scp fnpc340:/tmp/gphosts /tmp/gphosts for HOST in `cat /tmp/gphosts`; do printf "${HOST} " ssh -ax ${HOST} 'ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null ;echo' sleep 1 ; done 2>&1 | tee /tmp/scangp1.lis CDF data2 mounts were corrected about 13:25 by Timm. Passed the stale NFS handles on to FEF. for HOST in ${NTHOSTS} ; do printf "${HOST} " ssh -ax ${HOST} 'ls -ld /minos/data > /dev/null && echo' ; done ######## # DATA # ######## Date: Fri, 14 Nov 2008 12:10:01 -0600 (CST) Subject: HelpDesk ticket 124929 ___________________________________________ Short Description: CDF node mounts intermittent for /minos/data2, and some stale NFS handles Problem Description: The /minos/data2 file system seems to come and go on FermiGrid CDF nodes. For example, on fcdfcaf1502, it is visible about half the time, as seen by 'df -h /minos/data2' or 'ls -ld /minos/data2' A few nodes have stale NFS file handles for /minos/data : fcdfcaf1573 fcdfcaf1583 fcdfcaf1586 fcdfcaf1663 fcdfcaf1669 fcdfcaf1670 fcdfcaf1672 fcdfcaf1680 fcdfcaf1681 fcdfcaf1684 ___________________________________________ ######## # DATA # ######## Date: Fri, 14 Nov 2008 16:59:30 +0000 From: Philip Rodrigues To: Arthur Kreymer Subject: Re: /minos/data status - coming soon - please stand by ! Hi Art, > There may still be stale NFS handles on FNALU batch nodes, > we hope to fix that tomorrow morning. Just to let you know, I'm seeing stale file handles on CDF nodes. Other nodes seem to be working fine. Thanks, Phil ######## # DATA # ######## Continue cleanout of MOVED files, rubin@fnpcsrv1 find /minos/data/reco_far.MOVED -user rubin -exec ls -l {} \; 1542 find /minos/data/reco_far.MOVED -user rubin -exec chmod g+w {} \; find /minos/data/reco_near.MOVED -user rubin -exec ls -l {} \; find /minos/data/reco_near.MOVED -user rubin -exec chmod g+w {} \; mindata@minos26 $ time rm -r /minos/data/reco_far.MOVED real 9m20.399s user 0m0.112s sys 0m2.966s minfarm@fnpcsrv1 time rm -r /minos/data/reco_far.MOVED real 46m33.447s Stopped to set g+w for mindata files find /minos/data/reco_far.MOVED -user mindata -exec ls -l {} \; 52258 find /minos/data/reco_far.MOVED -user minfarm -exec ls -l {} \; 19266 Lots fewer minfarm files, let's chmod them, and remove under mindata find /minos/data/reco_far.MOVED -user minfarm -exec chmod g+w {} \; time rm -r /minos/data/reco_near.MOVED real 98m42.239s user 0m0.121s sys 0m3.328s ============================================================================= 2008 11 13 ============================================================================= ######## # DATA # ######## Creating rsync command for data -> data2 replica Steal from gridappsync, and HOWTO.rsync Preview, DIR=mcimport/boehm/mcin time rsync -r /minos/data/${DIR}/ /minos/data2/${DIR} \ -n --perms --times --links --size-only --delete --verbose { echo ; date printf "time rsync -r /minos/data/${DIR}/ /minos/data2/${DIR} \ -n --perms --times --links --size-only --delete --verbose " time rsync -r /minos/data/${DIR}/ /minos/data2/${DIR} \ --perms --times --links --size-only --delete --verbose } 2>&1 | tee -a /home/minsoft/datasync.log Test the boehm files, Restarted, after adding / to the source path, to avoid creation of an extra subdirectory at the distination. RDIRS=`ls /minos/data | grep -v analysis | grep -v users` beam_data condor-limbo condor-tmp d10 flux log_data maint mcimport mcout_data mindata minfarm mysql reco_far reco_near release_data validation for DIR in ${RDIRS} ; do { echo ; date printf "time rsync -r /minos/data/${DIR}/ /minos/data2/${DIR} \ -n --perms --times --links --size-only --delete --verbose " time rsync -r /minos/data/${DIR}/ /minos/data2/${DIR} \ --perms --times --links --size-only --delete --verbose } 2>&1 | tee -a /home/minsoft/datasync.log done Ganglia shows stable data rates of 15 to 20 MBytes/seconds on minos-mysql1, starting around 16:12 101 minutes to update mcimport, done at 16:40 ... Thu Nov 13 18:12:24 CST 2008 time rsync -r /minos/data/validation/ /minos/data2/validation -n --perms --times --links --size-only --delete --verbose real 7m29.180s Correct directory ownerships chown 3648 /minos/data2/condor-limbo chown 3648 /minos/data2/condor-tmp chown 3648 /minos/data2/validation Create symlinks TEST for DIR in ${RDIRS} ; do echo mv ${DIR} ${DIR}.MOVED echo ln -s /minos/data2/${DIR} ${DIR} ls -ld ${DIR} echo done MOVE cd /minos/data for DIR in ${RDIRS} ; do mv ${DIR} ${DIR}.MOVED ln -s /minos/data2/${DIR} ${DIR} ls -ld ${DIR} echo done Clean up after not haveing done cd before the first move cd for DIR in ${RDIRS} ; do ls -ld ${DIR} done Generally, files moved seemed to match what was expected, except for reco_near and reco_far lists, nothing was copied. Repeated the NUFILES scan, Thu Nov 13 19:29:51 CST 2008 GOT 3238 Which files are not present ? GOT=0 ; for FILE in ${NUFILES} ; do [ ! -r /minos/data2/${FILE} ] && (( GOT++)) && ls -l /minos/data/${FILE} done ; date ; printf " GOT ${GOT} \n" These are all condor-tmp and minfarm/DBM/dbtables/checksum files DANGER DANGER DANGER It seems I should have used option --archive, which preserves perms, links, times, group, owner, devices or have added --owner --group The files which I rsync'd are now owned by root !!!! Files were written to condor-tmp mcimport minfarm DIR=condor-tmp DIR=minfarm 1041 DIR=mcimport 2747 ( the find took about 20' ) ROOFILES=`find /minos/data2/${DIR} -user root | cut -f 5- -d /` for FILE in ${ROOFILES} ; do ls -ld /minos/data2/${DIR}/${FILE} chown --reference=/minos/data/${DIR}.MOVED/${FILE} /minos/data2/${DIR}/${FILE} done DONE ! Prepare to remove the .MOVED directories for mcout_data, reco_near, reco_far Mysql> time du -sm /minos/data/mcout_data.MOVED 5480489 /minos/data/mcout_data.MOVED real 0m7.086s user 0m0.070s sys 0m0.797s 5480489 /minos/data2/mcout_data real 0m7.095s user 0m0.059s sys 0m0.692s Now trying to remove mcout_data.MOVED, hard because so many files are owned by rubin. rubin@fnpcsrv1 find /minos/data/mcout_data.MOVED -user rubin -exec chmod g+w {} \; 22:54 -bash-3.00$ time rm -r /minos/data/mcout_data.MOVED real 207m39.066s user 0m0.127s sys 0m2.798s ######## # DATA # ######## The second replication to data2 is still running. Checking access times of something scanned by bluwatch /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-11 -rw-r--r-- 1 minfarm e875 1922308910 Nov 11 18:58 /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-11/N00009300_0000.spill.sntp.cedar_phy_bhcurv.0.root -rw-r--r-- 1 minfarm e875 1922308910 Nov 12 21:19 /minos/data2/reco_near/cedar_phy_bhcurv/sntp_data/2005-11/N00009300_0000.spill.sntp.cedar_phy_bhcurv.0.root Things not recently scanned. -rw-r--r-- 1 minfarm e875 1047749340 Oct 4 16:10 /minos/data/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11/N00009300_0000.spill.mrnt.cedar_phy_bhcurv.1.root -rw-r--r-- 1 minfarm e875 1047749340 Nov 12 16:19 /minos/data2/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11/N00009300_0000.spill.mrnt.cedar_phy_bhcurv.1.root Also some of the old mcin/boehm/mcin files -rw-r--r-- 1 mindata e875 383622203 Nov 8 22:14 n00009592_0001_spill_D04_cedarphybhcurvMRE.reroot.root -rw-r--r-- 1 mindata e875 420489636 Nov 8 22:14 n00009592_0002_spill_D04_cedarphybhcurvMRE.reroot.root -rw-r--r-- 1 mindata e875 423754360 Nov 8 22:14 n00009592_0003_spill_D04_cedarphybhcurvMRE.reroot.root and dcache -rw-r--r-- 1 mindata e875 388314221 Nov 9 02:51 n00009219_0011_spill_D04_cedarphybhcurvMRE.reroot.root -rw-r--r-- 1 mindata e875 375754081 Nov 9 02:51 n00009219_0012_spill_D04_cedarphybhcurvMRE.reroot.root -rw-r--r-- 1 mindata e875 356026673 Nov 9 02:51 n00009219_0013_spill_D04_cedarphybhcurvMRE.reroot.root -rw-r--r-- 1 mindata e875 400541244 Nov 9 02:51 n00009219_0014_spill_D04_cedarphybhcurvMRE.reroot.root romero has found the problem, removal of the previous snapshot caused a full copy of data to data2. This could take another week. Creating a summary of files to be copies to data2, based on the summary in /minos/scratch/mindata/newdata.log NDL=/minos/scratch/mindata/newdata.log cd ~kreymer/minos/scripts grep /minos/data ${NDL} | tr -s ' ' | cut -f 5 -d ' ' | count Enter numbers to be added : Got 3519 /tmp/FOO numbers 56645368420 57 GBytes. Select only those modified in November : MINOS26 > grep /minos/data ${NDL} | grep ' Nov ' | tr -s ' ' | cut -f 5 -d ' ' | count Enter numbers to be added : Got 3056 /tmp/FOO numbers 35871684801 36 GBytes. Summary by directory : FDIRS=`ls /minos/data | grep -v analysis | grep -v users` for DIR in mcimport ; do for DIR in ${FDIRS} ; do COUNTS=`grep /minos/data/${DIR} ${NDL} | tr -s ' ' | cut -f 5 -d ' ' | count` BYTES=`printf "${COUNTS}\n" | tail -1` (( MB = BYTES / 1000000 )) NFILES=`printf "${COUNTS}\n" | grep Got | cut -f 3 -d ' '` printf "%12s %6d %6d\n" ${DIR} ${NFILES} ${MB} done DIRECTORY FILES MB beam_data 0 0 condor-limbo 0 0 condor-tmp 915 2 d10 0 0 flux 0 0 log_data 0 0 maint 0 0 mcimport 1468 18089 mcout_data 0 0 mindata 0 0 minfarm 1100 25986 mysql 0 0 reco_far 23 3058 reco_near 13 9508 release_data 0 0 validation 0 0 ######## # GRID # ######## scavan still having trouble with cert. Looks OK in VOMRS /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven Cavanaugh/CN=UID:scavan His robot cert is not registered. ============================================================================= 2008 11 12 ============================================================================= ########## # CONDOR # ########## Added analysis in /fermilab/minos for asousa rodriges scavan In VOMRS, Status Approved Roles Member Groups /fermilab/minos Group Roles Analysis My Groups Only - check Good listing, but still finds 4912 records, non-group members Group/Group Role Status - approved - limits this to approved users, but only those with analysis role. Removed Group Roles, get 153 records, but only see approved roles. Removed Status and Group Roles entries, tried again, Get 182 rows ! Let's get a list of active Condor users : condor_userprio -all -allusers | grep fnal.gov | sort \ | grep -vfnal.gov@fnal.gov | cut -f 1 -d @ | wc -l 46 ahimmel himmel asousa sousa bckhouse backhouse brebel rebel bspeak speakman cherdack cherdack deb4 Bhattacharya djauty auty idanko danko jdejong dejong jjling ling jyuko ma koskinen koskinen masaki watabe mtavera tavera naples naples nsmayer mayer ochoa ochoa petyt petyt pittam pittam rahaman rahaman rearmstr armstrong rhatcher hatcher rmehdi mehdiyev sfarrell farrell sfiligoi sfiligoi sjc coleman tagg tagg tinti tinti tjyang yang vahle vahle whitehd whitehead zisvan isvan Pre approved hartnell hartnell kreymer kreymer loiacono loiacono mishi ishitsuka med dorman nickd devenish pawloski pawloski rbpatter patterson rodriges rodrigues rustem ospanov scavan cavanaugh ######## # DATA # ######## Scan for files newer than the first snapshot. $ { date ; find /minos/data -type f -ctime -7 -exec ls -ld {} \; ; date ; } 2>&1 | tee /tmp/newdata.log Wed Nov 12 10:39:14 CST 2008 Oops finding lots in the .snapshot directory, and wasting time in anaysis and users. $ FDIRS=`ls /minos/data | grep -v analysis | grep -v users` for DIR in ${FDIRS} ; do { printf "\n${DIR} `date`\n" find /minos/data/${DIR} -type f -ctime -7 -exec ls -ld {} \; } 2>&1 | tee -a /tmp/newdata.log done condor-tmp ... d10 Wed Nov 12 10:57:15 CST 2008 flux Wed Nov 12 10:57:15 CST 2008 log_data Wed Nov 12 11:02:22 CST 2008 maint Wed Nov 12 11:02:22 CST 2008 mcimport Wed Nov 12 11:02:22 CST 2008 First pass missed logging of headers, restarted beam_data Wed Nov 12 11:04:13 CST 2008 condor-limbo Wed Nov 12 11:04:13 CST 2008 condor-tmp Wed Nov 12 11:04:13 CST 2008 d10 Wed Nov 12 11:04:19 CST 2008 flux Wed Nov 12 11:04:19 CST 2008 log_data Wed Nov 12 11:04:29 CST 2008 maint Wed Nov 12 11:04:29 CST 2008 mcimport Wed Nov 12 11:04:29 CST 2008 mcout_data Wed Nov 12 12:20:14 CST 2008 mindata Wed Nov 12 12:25:12 CST 2008 minfarm Wed Nov 12 12:25:13 CST 2008 mysql Wed Nov 12 13:04:40 CST 2008 reco_far Wed Nov 12 13:06:59 CST 2008 reco_near Wed Nov 12 13:13:53 CST 2008 release_data Wed Nov 12 13:16:23 CST 2008 validation Wed Nov 12 13:16:27 CST 2008 MINFARM > wc -l /tmp/newdata.log 3551 /tmp/newdata.log MINFARM > grep /minos/data /tmp/newdata.log | wc -l 3519 Count files modified in each month : MINFARM > for N in 4 5 6 7 8 9 ; do printf " ${N} " ; grep /minos/data /tmp/newdata.log | grep "Nov ${N}" | wc -l ; done MINFARM > for N in 10 11 12 ; do printf " ${N} " ; grep /minos/data /tmp/newdata.log | grep "Nov ${N}" | wc -l ; done 4 0 5 537 6 1145 7 300 8 205 9 53 10 56 11 191 12 37 echo '537 +1145 +300 +205 +53 +56 +191+37' | bc 2524 NUFILES=`grep /minos/data /minos/scratch/mindata/newdata.log | cut -f 4- -d /` GOT=0 ; for FILE in ${NUFILES} ; do [ -r /minos/data2/${FILE} ] && (( GOT++)) done ; date ; printf " GOT ${GOT} \n" Wed Nov 12 14:39:01 CST 2008 GOT 230 Thu Nov 13 08:22:21 CST 2008 GOT 230 Thu Nov 13 19:29:51 CST 2008 GOT 3238 Let's get a new review of these files, with change times rather than the default modification times. for FILE in ${NUFILES} ; do ls -lc /minos/data/${FILE} ; done \ > /minos/scratch/mindata/newchange.log for N in 4 5 6 7 8 9 ; do printf " ${N} " grep /minos/data /minos/scratch/mindata/newchange.log | grep "Nov ${N}" | wc -l ; done for N in 10 11 12 ; do printf " ${N} " grep /minos/data /minos/scratch/mindata/newdata.log | grep "Nov ${N}" | wc -l ; done Day Files 4 0 5 565 6 675 7 336 8 1598 9 55 10 56 11 191 12 37 Sum is 3513 ============================================================================= 2008 11 11 ============================================================================= ############# # DBARCHIVE # ############# Creating script from HOWTO.dbarchive. Dropping use of script command, tee into log files instead, now what we do not cut/paste from the terminal. ######## # GRID # ######## Subject: Help Desk Ticket 119292 Has Been Resolved. Ticket closed, no more globus errors 17 or 43 since our glideinWMS upgraded to Condor 7.1.3 ######## # MAIL # ######## As usual, an email to stk-users bounces from minos-shifters : Your message cannot be delivered to the following recipients: Recipient address: c.bungau@SUSSEX.AC.UK Reason: Remote SMTP server has rejected address Diagnostic code: smtp;550 unknown user, or c.bungau has a bad forwarding address Remote system: dns;smtp2.susx.ac.uk (TCP|131.225.111.11|43150|139.184.14.93|25) (sivits.uscs.susx.ac.uk ESMTP Exim 4.64 Tue, 11 Nov 2008 16:15:41 +0000) Recipient address: kafv1@SUSSEX.AC.UK Reason: Remote SMTP server has rejected address Diagnostic code: smtp;550 unknown user, or kafv1 has a bad forwarding address Remote system: dns;smtp2.susx.ac.uk (TCP|131.225.111.11|43150|139.184.14.93|25) (sivits.uscs.susx.ac.uk ESMTP Exim 4.64 Tue, 11 Nov 2008 16:15:41 +0000) minos shifters does contain Cristian Bungau Elisabeth Falk I find no Bungau under the Sussex shift index. bungau does exist at Fermilab, forwarded to c.bungau@SUSSEX.AC.UK Removed bungau from minos-shifters Changed to kafv1 to e.falk ########### # ENSTORE # ########### Date: Tue, 11 Nov 2008 08:54:15 -0600 SSA Group needs to reboot the system which manages the STK Silos for Public Enstore %28STK%29 and D0en Enstore. We need to do this as soon as we can, but we want to do this carefully so we cause as little disruption of service as possible. This will only affect the STK libraries. STKen: 9940.library_manager & CD-9940B.library_manager D0en: D0-9940B.library_manager & mezsilo.library_manager CDFen: CDF-9940B-D0.library_manager We will begin draining these libraries at 09:00. Users will still be able to submit requests which will be queued up for the library manager. Draining will allow any tape work that is already in progress to complete. No new mount requests will be handled until after the reboot. Once any copies that are already in progress finish and those tapes get dismounted, we will be able to reboot the %27fntt%27 library front end system. Once the reboot is accomplished, we will restart Enstore processes as necessary. We will then re-open the library managers for normal use. We will have these re-opened as soon as possible. Again, this will only affect access to 9940 type tapes. ----------------------------------------------------------------------- Date: Tue, 11 Nov 2008 10:13:07 -0600 SSA Group has reboot the %27fntt%27 system. Everything went smoothly. All libraries are re-opened. STKen: 9940.library_manager & CD-9940B.library_manager D0en: D0-9940B.library_manager & mezsilo.library_manager CDFen: CDF-9940B-D0.library_manager ######## # GRID # ######## Date: Tue, 11 Nov 2008 09:32:04 -0600 From: Edward Simmonds To: kreymer@fnal.gov Subject: New Grid certificates Art, I don't think we've met, but I'm the "new guy" in Jason Allen's department. I've been asked to install new Grid certificates on minos01-26, because the current certs expire Thursday. I'd like to update one server, minos26 if that works for you, and have someone test to make sure the new certificates are working properly. In other words, I don't want to install all twenty-six and have it break something. Can I install the new cert on minos26 and have you (or anyone you suggest) test it before I update the other 25 servers? Thanks much, Edward Simmonds ----------------------------------------- Date: Tue, 11 Nov 2008 15:42:11 +0000 (GMT) From: Arthur Kreymer The Minos Cluster host cert's are used by the Condor batch system. I suggest updating the cert on minos01 first. The real test is to see Condor jobs start and finish after the upgrade. Let me know when the cert is upgraded on minos01, and I will run a test job. Then minos02 through 24 can be upgraded. minos25 is the Condor master node, and the most sensitive to problems. It should be upgraded last. ----------------------------------------- Date: Tue, 11 Nov 2008 09:49:29 -0600 From: Edward Simmonds To: Arthur Kreymer Cc: minos-admin@fnal.gov Subject: Re: New Grid certificates Arthur Kreymer wrote: > I suggest updating the cert on minos01 first. > The real test is to see Condor jobs start and finish after the upgrade. > Let me know when the cert is upgraded on minos01, Okay, I'll do this right now and send you an email. ----------------------------------------- Date: Tue, 11 Nov 2008 09:57:49 -0600 The new cert is installed on minos01. Please test and let me know the results. ----------------------------------------- Date: Tue, 11 Nov 2008 16:06:47 +0000 (GMT) From: Arthur Kreymer A Condor test job has run on minos01 after the cert update. Please go ahead with the rest of the Minos Cluster cert updates. ----------------------------------------- Date: Tue, 11 Nov 2008 10:59:15 -0600 All Grid certificates have been installed on minos01 through 26. Please let me know if you have any issues. ----------------------------------------- Date: Tue, 11 Nov 2008 17:02:16 +0000 (GMT) From: Arthur Kreymer Thanks ! I have seem at least one new Condor glideinWMS job run, so I think we are in good shape. ----------------------------------------- ######## # DATA # ######## mysql and validation are in data2 this morning ! date ; df -m /minos/data2 | grep '^ ' Tue Nov 11 14:17:25 GMT 2008 28311552 22455831 5855722 80% /minos/data2 The output of df seems constant, perhaps the first pass is complete ! Andy Romero released the snapshot date ; df -m /minos/data | grep '^ ' Tue Nov 11 14:24:00 GMT 2008 28311552 28261948 49605 100% /minos/data Tue Nov 11 14:26:30 GMT 2008 28311552 28196767 114786 100% /minos/data romero is starting the next replication Tue Nov 11 14:38:02 GMT 2008 28311552 27934584 376969 99% /minos/data Tue Nov 11 14:57:58 GMT 2008 28311552 27285167 1026386 97% /minos/data 28311552 22455831 5855722 80% /minos/data2 Tue Nov 11 15:49:51 GMT 2008 28311552 26727884 1583669 95% /minos/data 28311552 22455869 5855684 80% /minos/data2 Second replication started around 12:00 CST Tue Nov 11 12:26:20 CST 2008 28311552 26728252 1583301 95% /minos/data 28311552 22093690 6217863 79% /minos/data2 Tue Nov 11 13:12:40 CST 2008 28311552 26728382 1583171 95% /minos/data 28311552 22240488 6071065 79% /minos/data2 Tue Nov 11 15:59:20 CST 2008 28311552 26728807 1582746 95% /minos/data 28311552 21826461 6485092 78% /minos/data2 Check removed files in mcin : du -sm /minos/data2/mcimport/boehm/mcin/dcache MIN > ls /minos/data2 beam_data condor-limbo condor-tmp d10 flux log_data maint mcimport mcout_data mindata minfarm mysql reco_far reco_near release_data validation MIN > du -sm /minos/data2/* 259008 /minos/data2/beam_data 1 /minos/data2/condor-limbo 3286 /minos/data2/condor-tmp 2 /minos/data2/d10 du: cannot read directory `/minos/data2/flux/gnumi/v19/fluka05_le010z185i_old': Permission denied 3862464 /minos/data2/mcimport 5480489 /minos/data2/mcout_data 1 /minos/data2/mindata 1 /minos/data2/mindata du: cannot read directory `/minos/data2/minfarm/farmtest/.certs/rubin': Permission denied 2687538 /minos/data2/minfarm 445852 /minos/data2/mysql 2131478 /minos/data2/reco_far 4094215 /minos/data2/reco_near 2251 /minos/data2/release_data 85846 /minos/data2/validation Sum 19052432 Checking known deleted files : du -sm /minos/data/mcimport/boehm/ 2558 /minos/data/mcimport/boehm/ du -sm /minos/data2/mcimport/boehm/ 521751 /minos/data2/mcimport/boehm/ du -sm /minos/data2/mcimport/boehm/mcin/dcache 274817 /minos/data2/mcimport/boehm/mcin/dcache MINOS26 > date Tue Nov 11 13:19:18 CST 2008 Date: Tue, 11 Nov 2008 23:02:24 +0000 (GMT) From: Arthur Kreymer The second pass of replication to /minos/data2 is still running. This seems likely to finish this evening, perhaps early tomorrow morning. We will then need one more replication to /minos/data2, with the file system unmounted. I will sent a note when this starts. Again, please stand by, and minimize access to /minos/data or scratch. PLAN FOR FINAL /minos/data2 CUTOVER mindata@minos26 minfarm@minos26 minsoft@minos26 cd /minos/data DDIRS=`find . -type d -maxdepth 1 -user ${LOGNAME} -exec basename {} \; \ | sort |grep -v analysis | grep -v users` printf "${DDIRS}\n" TEST for DIR in ${DDIRS} ; do echo mv ${DIR} ${DIR}.MOVED echo ln -s /minos/data2/${DIR} ${DIR} ls -ld ${DIR} echo done MOVE for DIR in ${DDIRS} ; do mv ${DIR} ${DIR}.MOVED ln -s /minos/data2/${DIR} ${DIR} ls -ld ${DIR} echo done ============================================================================= 2008 11 10 ============================================================================= ######## # GRID # ######## Date: Mon, 10 Nov 2008 15:47:13 -0800 (PST) From: Ryan B. Patterson FYI: We have increased he total number of allowed glidein pilots from 250 to 400, of which 50 may be on CDF nodes. ######## # DATA # ######## MIN > df -m /minos/data* Filesystem 1M-blocks Used Available Use% Mounted on minos-nas-0.fnal.gov:/minos/data 28311552 28256092 55461 100% /minos/data blue2.fnal.gov:/minos/data 28311552 22503981 5807572 80% /minos/data2 mysql and validation are still not in data2. ######## # DATA # ######## Date: Mon, 10 Nov 2008 12:35:48 -0600 (CST) Subject: HelpDesk ticket 124533 ___________________________________________ Short Description: Deployment plan for new BlueArc /minos/data disk Problem Description: Please forward this to CSI fermigrid-help run2-sys fnalu-admin Per conversation with Andy Romero this morning, here is a plan for active deployment of the new Minos data disks. 1) CSI - export the new disks as blue2:/minos/data to all systems presently mounting minos-nas:/minos/data Do this ASAP, and inform minos-data, fermigrid-help, run2-sys, fnalu-admin 2) run2-sys , fermigrid-help , fnalu-admin Mount blue2:/minos/data as /minos/data2, on all systems where /minos/data is presently mounted. Do this as soon as blue2 is exported, see step 1) above 3) CSI - complete the data copy from minos-nas:/minos/data to blue2:/minos/data, including touchup copies. This is likely to finish today. 4) CSI/Kreymer coordinated deployment : CSI - Make the final touchup copy, with readonly file systems Kreymer - rename the copied directories to *.MOVED in minos-nas create symlinks for each directory from minos-nas to blue2 CSI - Make file systems writeable. Arthur Kreymer can be reached at x4261, or cell 630 697 0469 ___________________________________________ Date: Mon, 10 Nov 2008 13:20:29 -0600 (CST) This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ Date: Mon, 10 Nov 2008 13:35:04 -0600 (CST) From: Steven Timm To: kreymer@fnal.gov Cc: fermigrid-help@fnal.gov FermiGrid has mounted the /minos/data2 = blue2:/minos/data on all the places where the other minos volumes are mounted. Steve Timm ___________________________________________ Date: Mon, 10 Nov 2008 15:11:12 -0600 (CST) This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/WST Group. ___________________________________________ Date: Mon, 10 Nov 2008 15:13:27 -0600 (CST) Note To Requester: The following (read-only ... for now) export has been created: blue2.fnal.gov:/minos/data The suggested NFS mount options are: -o rsize=32768,wsize=32768,timeo=600,proto=tcp,vers=3,hard,intr ___________________________________________ Date: Mon, 10 Nov 2008 15:29:51 -0600 (CST) From: Margaret_Greaney fnalu nodes were updated for the new mount. ___________________________________________ Date: Mon, 10 Nov 2008 21:52:02 +0000 (GMT) From: Arthur Kreymer To: Margaret_Greaney On FNALU batch nodes, I see the new blue2:/minos/data file system mounted on /minos/data, rather than /minos/data2. Please restore the original minos-nas:/minos/data mount on /minos/data, and add a new mount of blue2:/minos/data on /minos/data2. Thanks ___________________________________________ Date: Mon, 10 Nov 2008 16:03:53 -0600 (CST) From: Margaret_Greaney done __________________________________________ Date: Mon, 10 Nov 2008 15:44:58 -0800 (PST) From: Ryan B. Patterson To: kreymer@fnal.gov Subject: condor-tmp and condor-limbo ownership I've changed ownership of these areas to 'mindata'. Enjoy. ___________________________________________ Date: Tue, 11 Nov 2008 10:54:15 -0600 From: Jason Harrington To: Arthur E Kreymer Cc: run2-sys@fnal.gov Subject: /minos/data2 /minos/data2 has been installed on all nodes listed in the 'minos-cluster' sysadmin db cluster with the following exceptions: > minos-mysql2 (no /minos/data) > minos-mysql3 (login permission denied) > minos-sam04 (no /minos/data) > minos27 (ssh connection refused, telnet login permission denied) ___________________________________________ Verbal - This morning's replication crashed, needed to clear the snapshot. Andy restarted it around 12:00 Will evaluate again around 16:30, after his class. ___________________________________________ Date: Wed, 12 Nov 2008 16:14:04 +0000 (GMT) From: Arthur Kreymer blue2:/minos/data is mounted on all our clients as /minos/data2. I am now tracking free space in /minos/data2 hourly at http://www-numi.fnal.gov/computing/dh/mdfree/data2/NOW.txt Replication seems to be both adding and removing files. 300 GB was freed up just before 09:44. I am watching for removal of 520 GB of *.reroot.root files under /minos/data2/mcimport/boehm/mcin. ___________________________________________ Date: Wed, 12 Nov 2008 23:41:09 +0000 (GMT) From: Arthur Kreymer To: minos-data@fnal.gov Cc: romero@fnal.gov Subject: Replication status and estimates Andy, yes, I did a full filescan earlier today. Between 11:04 and 13:16, I did a find in replicating directories. This produced file listing /minos/scratch/mindata/newdata.log There are 3519 files. I have done a day by day scan of the 'change' times of these files, an improvement over 'mod' times previously reported. Day Files 4 0 5 565 6 675 7 336 8 1598 9 55 10 56 11 191 12 37 I have occasionally been doing : NUFILES=`grep /minos/data /minos/scratch/mindata/newdata.log | cut -f 4- -d /` GOT=0 ; for FILE in ${NUFILES} ; do [ -r /minos/data2/${FILE} ] && (( GOT++)) done ; date ; printf " GOT ${GOT} \n" Wed Nov 12 14:39:01 CST 2008 GOT 230 ... Wed Nov 12 17:05:10 CST 2008 GOT 230 We do not seem to be getting more of the files that I expect to see, nor are the files being removed from /minos/data2/mcimport/boehm/mcin/. But there are certainly files being copied, as df indicates net file system size changes, at http://www-numi.fnal.gov/computing/dh/mdfree/data2/NOW.txt I do not know of any other activity that would be doing global file scans. ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ Date: Thu, 13 Nov 2008 19:42:43 +0000 (GMT) From: Arthur Kreymer __ The CSI internal replication is being canceled. Due to premature removal of a Bluearc snapshot, this turned in to a full replication, which would have run for days. The modified plan step 4) is : CSI - cancel the second replication CSI - remove all exports of /minos/data from minos-nas: and blue2: CSI - export these root-enabled to minos-mysql1 kreymer - finish the file system copies using rsync ( estimate 4 hours ) I will log in as root@minos-mysql1 ( already have access ) CSI - restore full read/write exports after the rsync is done. FEF/Fermigrid/FNALU - remounts will likely be needed, due to stale file handles _________________________________________ Date: Thu, 13 Nov 2008 19:45:09 +0000 (GMT) From: Arthur Kreymer We are starting the final replication pass. The exports of /minos/data and data2 have been removed. I have reason to expect this to take less than 4 hours. Then file systems will have to be re-exported and remounted. ___________________________________________ Date: Thu, 13 Nov 2008 21:01:36 +0000 (GMT) From: Arthur Kreymer The rsync copies started around 14:36 ___________________________________________ Date: Thu, 13 Nov 2008 22:33:09 +0000 (GMT) From: Arthur Kreymer The rsync copies are cranking along. Files are presently moving to mcimport/mtavera. I guess we are about halfway through this pass. I will check again on progress before about 19:00 this evening. ___________________________________________ Date: Fri, 14 Nov 2008 01:22:56 +0000 (GMT) From: Arthur Kreymer Mount/remounts of /minos/data, /minos/data2 needed as follows : run2-sys - We need dismount/remounts on the Minos Cluster and servers to clear the stale file handles in /minos/data /minos/data2 ( except for minos-mysql1, where I have done this already. ) fermigrid-help - we need remounts of /minos/data on fnpcsrv1, and possibly other nodes. Perhaps we can get this by waiting an hour or so for automount to time out ? fnalu-admin - we need remounts desribed above on FNALU batch nodes, which are seeing stale NFS handles. --------- Summary of plan execution ----------- CSI - cancel the second replication DONE CSI - remove all exports of /minos/data from minos-nas: and blue2: DONE CSI - export these root-enabled to minos-mysql1 DONE kreymer - finish the file system copies using rsync ( estimate 4 hours ) I will log in as root@minos-mysql1 ( already have access ) DONE CSI - restore full read/write exports after the rsync is done. DONE around 19:00 CST, thanks Andy FEF/Fermigrid/FNALU - remounts will likely be needed, due to stale file handles TRUE - requests are listed above ___________________________________________ Date: Thu, 13 Nov 2008 19:49:07 -0600 (CST) From: Steven Timm Stale file handles for /minos/data cleared on fnpcsrv1. Will check workers later. ___________________________________________ Date: Thu, 13 Nov 2008 20:31:07 -0600 (CST) From: Steven Timm To: Arthur Kreymer Subject: Re: HelpDesk ticket 124533 has additional info. Stale file handles on 7 worker nodes in gp grid cleared too, Will look for stale ones on cdf grid later. ___________________________________________ Date: Fri, 14 Nov 2008 02:37:27 +0000 (GMT) From: Arthur Kreymer To: run2-sys@fnal.gov Cc: minos-data@fnal.gov Subject: Urgent - please remount /minos/data and data2 on Minos Cluster Please, if you get a chance this evening, correct the stale file handles on the Minos Cluster, as noted below. This is the last thing that needs to be done before we announce availalbility of the disks to our users. ... ___________________________________________ 20:50 - called helpdesk, requested page of FEF run2-sys 21:00 - helpdesk will page FEF run2-sys Date: Thu, 13 Nov 2008 21:15:56 -0600 (CST) From: HelpDesk Subject: HelpDesk ticket 124803 Short Description: MINOS Cluster - Art Kreymer- 630-840-4261 Problem Description: Detailed Problem Description (if supplied)FEF primary call Art Kreymer X4271 regarding the MINOS Cluster. This ticket is assigned to HO, LING of the CD-SF/FEF. _________________________________________ 21:23 Ling responded, will remount disks by about 22:00 ___________________________________________ Date: Thu, 13 Nov 2008 22:29:23 -0600 From: Ling C. Ho I have remounted /minos/data and /minos/data on minos01-26, minos-sam01-03. Is there anything I missed? ___________________________________________ Date: Thu, 13 Nov 2008 22:49:45 -0600 (CST) Thanks, things look good on all nodes, except for minos07, which still shows NFS timouts. ___________________________________________ Date: Fri, 14 Nov 2008 04:56:42 +0000 (GMT) Looks like I missed minos07. It should be fine now. ___________________________________________ Date: Fri, 14 Nov 2008 04:58:09 +0000 (GMT) From: Arthur Kreymer Thanks, minos07 is better now. Thanks again for attending to this so late in the evening. I will announce availability of the file systems. This ticket can be closed ! __________________________________________ Date: Thu, 13 Nov 2008 23:04:11 -0600 (CST) From: HelpDesk Solution: ling@fnal.gov sent this solution: NFS mounts restored. This ticket was resolved by HO, LING of the CD-SF/FEF group. ============================================================================= 2008 11 07 ============================================================================= ######### # ADMIN # ######### Date: Fri, 07 Nov 2008 23:38:55 +0000 (GMT) From: Arthur Kreymer To: ling@fnal.gov Cc: minos-admin@fnal.gov Subject: Re: DB servers, and other Grid items On Wed, 22 Oct 2008, Arthur Kreymer wrote: > Here is a strawman configuration, for discussion > > minos-mysql2 > ... > minos-sam04 > ... > minos01 replacement - configured just like minos01, as NIS server. > ... > minos-mysql3 > ... These systems seem to have been on the network for a week or two. I can log into kreymer@minos-mysq2 kreymer@minos27 ( using rsh, not ssh ) but not the other two systems. I do not see the requested minsoft account on minos-mysql2. When will these systems will be available for us to test ? ######### # MYSQL # ######### Bootstrapping mysql product onto samread@minos-sam03 scp minsoft@minos-sam03:setups.sh . mkdir -p ups/db/foo mkdir -p ups/db/.upsfiles mkdir -p ups/db/.updfiles AFSP=/afs/fnal.gov/files/code/e875/general/ups cp ${AFSP}/db/.upsfiles/dbconfig ups/db/.upsfiles/dbconfig nedit ups/db/.upsfiles/dbconfig changed /afs/fnal.gov/files/code/e875/general/ups to /home/samread/ups cp ${AFSP}/db/.updfiles/updconfig ups/db/.updfiles/updconfig setup upd upd install -j mysql v5_0_67 ups declare -c mysql v5_0_67 ######## # DATA # ######## /grid/data/minos is over 80% capacity ( 400 GB ). Again, it is Rustem. SRV1> du -sm /grid/data/minos/users/* 1 /grid/data/minos/users/boehm 1 /grid/data/minos/users/brebel 1 /grid/data/minos/users/habig 1 /grid/data/minos/users/jdejong 1 /grid/data/minos/users/jjling 1 /grid/data/minos/users/kreymer 1 /grid/data/minos/users/masaki 91 /grid/data/minos/users/mishi 1 /grid/data/minos/users/petyt 315282 /grid/data/minos/users/rustem 1 /grid/data/minos/users/scavan 1 /grid/data/minos/users/tinti ########## # CONDOR # ########## Removed the cronjob entry that releases held gfactory jobs, this should no longer happen, since our Condor 7.1.3 upgrade in gfactory. ######### # POWER # ######### recover from planned outage 05:00 ########### # BLUEARC # ########### Date: Fri, 07 Nov 2008 05:44:00 -0600 From: Andrew J. Romero To: 'lisa' , 'Jon Bakken' , 'Steven Timm' , "'kreymer@fnal.gov'" Subject: BlueArc maintenance complete BlueArc maintenance complete bluwatch saw the outage, 05:24 through 05:42 ######## # DATA # ######## ./mcimport boehm less /minos/data/mcimport/boehm/log/mcimport.log Fri Nov 7 07:08:35 CST 2008 OK - purging 87 MCIN files ? MCIN processing 511 files Fri Nov 7 08:04:30 CST 2008 74 files copied by about 9:12 this is a much better rate than yesterday, 1' per file versus 10' per file ./mciboehm -n boehm | tee /tmp/mcib.log $ grep SIZE /tmp/mcib.log | wc -l 1519 $ grep PNFS /tmp/mcib.log | wc -l 81 $ ls /minos/data/analysis/nue/MRERerootFiles/CedarPhyData | wc -l 1600 MINOS26 > du -sm /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE 177736 /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE MINOS26 > find /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE -type f | wc -l 1543 N.B. - should mcimprt -M to suppress sam declarations. ============================================================================= 2008 11 06 ============================================================================= ####### # CVS # ####### global lock for the repository http://www.mail-archive.com/info-cvs@gnu.org/msg33409.html create an empty $CVSROOT/CVSROOT/writers file http://ximbiot.com/cvs/manual/cvs-1.11.20/cvs_2.html#SEC36 ######## # DATA # ######## Moving the remaining nearly 600 files, ./mcimport -b 1 boehm ./mcimport boehm ########### # BLUEARC # ########### Data has been moving from /minos/data to /minos/data2 since Tuesday. Internally, via FC. Only directories : mcimport mcout_data minfarm reco_far reco_near Will change these to symlinks in /M/D/ when the copy is complete. Existing exports : minos-nas-0.fnal.gov:/minos/data /minos/data nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 minos-nas-0.fnal.gov:/minos/scratch /minos/scratch nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 New export : blue2.fnal.gov:/minos/data /minos/data2 nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 ######## # DATA # ######## Test copy of a file which Rubin cannot read via srmcp MINOS26 > ./dc_stat n13011670_0005_L010185N_D00.reroot.root ============================ PNFS status for /pnfs/minos/mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root -rw-r--r-- 1 rhatcher e875 371431752 May 1 2007 n13011670_0005_L010185N_D00.reroot.root LEVEL 2 2,0,0,0.0,0.0 :h=yes;l=371431752; LEVEL 4 VO4722 0000_000000000_0000129 371431752 mcin_near_daikon /pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root 000F00000000000005412E40 CDMS117804817300000 stkenmvr26a:/dev/rmt/tps0d0n:479000037979 455900240 ============================ MINOS26 > ./dccptest /mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root 2,0,0,0.0,0.0 :h=yes;l=371431752; [Thu Nov 6 11:18:31 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root in cache. Connected in 0.00s. Cache open succeeded in 94.03s. 371431752 bytes in 5 seconds (72545.26 KB/sec) -rw-r--r-- 1 kreymer g020 371431752 Nov 6 11:20 /local/scratch26/kreymer/n13011670_0005_L010185N_D00.reroot.root MINOS26 > ./dccptest /mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root 2,0,0,0.0,0.0 :h=yes;l=371431752; [Thu Nov 6 13:23:16 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root in cache. Connected in 0.00s. Cache open succeeded in 0.34s. 371431752 bytes in 24 seconds (15113.60 KB/sec) -rw-r--r-- 1 kreymer g020 371431752 Nov 6 13:23 /local/scratch26/kreymer/n13011670_0005_L010185N_D00.reroot.root ########### # SRMTEST # ########### srmtest.20081106 - now using X509_USER_PROXY instead of SRM_CONFIG Tested on fnpcsrv1, OK ./mcimport -b 1 boehm ######## # FARM # ######## Setting a flag which will tell the Minos Farm scripts not to reconstruct the archived mcin data from Josh mkdir /pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurvMRE/NORECO Cleaning up daikon_04/spill_cedarphybhcurvMRE files produced in error on the farm minospro@minos26 cd /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/spill_cedarphybhcurvMRE PRO> du -sm . 35670 . ( cd /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/spill_cedarphybhcurvMRE/cand_data ; enstore pnfs --tags ) .(tag)(library) = CD-LTO3 .(tag)(file_family) = minos .(tag)(file_family_wrapper) = cpio_odc .(tag)(storage_group) = minos .(tag)(file_family_width) = 1 PRO> find . -type f | wc -l 88 cd /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04 rm -r spill_cedarphybhcurvMRE date Thu Nov 6 10:48:32 CST 2008 ######## # GRID # MILESTONE ######## Upgraded to use the analysis role, jobs will run under minosana MINOS25 > cd /local/scratch25/grid MINOS25 > cp ~rbpatter/computing/sandbox/kproxy.20081106 . MINOS25 > cp ~rbpatter/computing/sandbox/kproxyv.20081106 . MINOS25 > ln -sf kproxy.20081106 kproxy MINOS25 > ln -sf kproxyv.20081106 kproxyv MINOS25 > date Thu Nov 6 08:39:37 CST 2008 MINOS25 > condor_q | tail -1 1291 jobs; 369 idle, 42 running, 880 held MINOS25 > condor_q -run | grep -v minos ID OWNER SUBMITTED RUN_TIME HOST(S) 218557.3 gfactory 11/6 05:20 0+03:20:02 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor 218564.0 gfactory 11/6 06:11 0+02:28:30 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor 218578.3 gfactory 11/6 08:01 0+00:38:27 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor Updated my proxy at 08:47 MINOS25 > /local/scratch25/grid/kproxy MINOS25 > /local/scratch25/grid/kproxyi attribute : /fermilab/minos/Role=Analysis/Capability=NULL ######## # GRID # MILESTONE ######## Ryan has run the first Minos glidein job on CDF nodes those nodes lack /usr/local/etc/setups.sh ########### # ENSTORE # ########### Data Rates plot shows zero from 15:00 through 19:30 yesterday ########## # PARROT # ########## mindata@minos26 PD=/minos/scratch/parrot MD=/afs/fnal.gov/files/data/minos cd ${PD} date ; time ./make_growfs.auto -k ${MD}/d120 Thu Nov 6 07:53:10 CST 2008 make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d120/.growfsdir ####### # SAM # ####### Reviewing sam web pages , be sure we have no samzilla running : SAM User registration http://www-numi.fnal.gov/cgi-bin/autoRegister.py Get list of files http://www-numi.fnal.gov/computing/findrun_sam.html SAG http://www-numi.fnal.gov/sam_local/SamAtAGlance/ Web home is /afs/fnal.gov/files/expwww/numi Did global search under cgi-bin, find cgi-bin -name samzilla ============================================================================= 2008 11 05 ============================================================================= 14:37 or so - site wide power outage 15:35 - power back up ######## # DATA # ######## Creating specal mciboehm script to purge archived files from /minos/data/analysis/nue/MRERerootFiles/CedarPhyData Nov 5 13:09 STAGE/boehm/log/mcimport.log The previous iteration of mcimport failed, SRMCPed n00009238_0007_spill_D04_cedarphybhcurvMRE.reroot.root SRMCPed n00009238_0008_spill_D04_cedarphybhcurvMRE.reroot.root SRMCPed n00009238_0009_spill_D04_cedarphybhcurvMRE.reroot.root SRMClientV1 : getRequestStatus: try #0 failed with error SRMClientV1 : java.net.ConnectException: Connection timed out setting File Request to "Done" failed java.lang.RuntimeException: java.net.ConnectException: Connection timed out at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1101) at gov.fnal.srm.util.SRMV1CopyJob.done(SRMV1CopyJob.java:188) at gov.fnal.srm.util.Copier.run(Copier.java:359) at java.lang.Thread.run(Thread.java:595) srm client error: java.net.ConnectException: Connection timed out SRMClientV1 : getRequestStatus: try #0 failed with error SRMClientV1 : java.rmi.RemoteException: srm setFileStatus failed; nested exception is: java.lang.RuntimeException: java.lang.IllegalArgumentException: FileRequest fileRequestId =-2144325443does not belong to this Request Exception in thread "Thread-1" java.lang.RuntimeException: java.rmi.RemoteException: srm setFileStatus failed; nested exception is: java.lang.RuntimeException: java.lang.IllegalArgumentException: FileRequest fileRequestId =-2144325443does not belong to this Request at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1101) at gov.fnal.srm.util.SRMV1CopyJob.done(SRMV1CopyJob.java:188) at gov.fnal.srm.util.Copier.cleanup(Copier.java:672) at gov.fnal.srm.util.Copier.run(Copier.java:274) at java.lang.Thread.run(Thread.java:595) OOPS - srmcp failed , bailing Last file copied seems to be : $ ls -l /pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurvMRE/923/n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root -rw-r--r-- 1 kreymer e875 332164037 Nov 5 13:08 /pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurvMRE/923/n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root $ ls -l STAGE/boehm/mcin/n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root -rw-r--r-- 1 mindata e875 332164037 Sep 28 04:12 STAGE/boehm/mcin/n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root Power outage, then recovery, resume the copies : $ cd STAGE/boehm/mcin $ mv n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root dcache/n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root ./mcimport boehm less /minos/data/mcimport/boehm/log/mcimport.log OK - purging 916 MCIN files ? Wed Nov 5 15:56:28 CST 2008 PURGED n00008451_0001_spill_D04_cedarphybhcurvMRE.reroot.root ... $ ls /minos/data/analysis/nue/MRERerootFiles/CedarPhyData | wc -l 1600 Rats, the file sizes do not match ! $ dds /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE/900/n00009003_0000_spill_D03_cedarphyMRE.reroot.root -rw-r--r-- 1 kreymer e875 144656067 Nov 6 2007 /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE/900/n00009003_0000_spill_D03_cedarphyMRE.reroot.root $ dds /minos/data/analysis/nue/MRERerootFiles/CedarPhyData/n00009003_0000_spill_D03_cedarphyMRE.reroot.root -rw-r--r-- 1 mindata e875 110607413 Dec 13 2007 /minos/data/analysis/nue/MRERerootFiles/CedarPhyData/n00009003_0000_spill_D03_cedarphyMRE.reroot.root And data rates for present srmcp's is over 10 minutes per file !!!!!! Killed the script, ran one more iteration to clean the pid file, ./mcimport -b 1 boehm ######## # SPAM # ######## Election spam, from unregistered network address, hitting minos_sam_admin minos_software_discussion minos_sim ######## # GRID # ######## Date: Wed, 05 Nov 2008 08:48:57 -0600 (CST) Subject: HelpDesk ticket 124250 ___________________________________________ Short Description: Many files at the top of /grid/data - see closed ticket 120763 Problem Description: There are over 7000 files at the top level of /grid/data with names like 2008-11-05T14:08:01Z-gridftp-probe-test-file-remote.7515 There are about 200 of these per day, dating back through Oct 1. Perhaps the daily purge script has stopped working. ___________________________________________ Date: Thu, 06 Nov 2008 14:37:56 -0600 (CST) Note To Requester: I have manually removed these files again. There is some reason why this script isn't working in cron and we are not sure why. Will keep the ticket open until we get it solved. Steve Timm ___________________________________________ Date: Mon, 10 Nov 2008 13:27:41 -0600 (CST) Solution: The cleanup script is working now. Steve Timm This ticket was resolved by TIMM, STEVE of the CD-SF/GF/FGS group. ___________________________________________ ============================================================================ 2008 11 04 ============================================================================= ######## # GRID # ######## Date: Tue, 04 Nov 2008 14:57:04 -0600 (CST) From: Steven Timm To: kreymer@fnal.gov Subject: minos quota bump Art--as a result of the meeting with Steve W. and Patty M. this morning it was determined that MINOS quota would be (and now has been) increased to 400 slots. I know it is not full opportunistic yet but it will help some. ######## # DATA # ######## Ticket 118354 - raw data writes run daily again, see below ####### # SAM # ####### I think the following is moot, our logs are not browsable Date: Tue, 04 Nov 2008 13:19:04 -0600 From: Robert Illingworth To: Angela Bellavance , Arthur Kreymer Subject: SamZilla web vulnerabilities There are apparently security vulnerabilities with the SamZilla web log file browser. If this is installed on the CDF or Minos webservers, I recommend you either remove it or at least remove execute permission from the python scripts in the ups product cgi directory, until we discovered how serious the problem is and what can be done about it. Robert ########## # ORACLE # ########## Date: Tue, 04 Nov 2008 11:46:19 -0600 minosdev & minosint database hosted on minosora3 will be down for Oracle DB Security patches on Tuesday 11/04/2008 starting 1:30PM Interruption is expected to last 1 hour. -Nelly p.s.  i have cc'd the sysadmin maillist,  giving them a heads up that i will need a script run as root during the patching process Date: Tue, 04 Nov 2008 14:19:56 -0600 minosdev & minosint database hosted on minosora3 october oracle quarterly  work  is completed.   let us know if you have any issues at oem-admin@fnal.gov   our goal is to patch minosora1~minosprd on thursday nov 20, 2008 ============================================================================= 2008 11 03 ============================================================================= ########### # MONTHLY # ########### DATASETS 11/03 PREDATOR 11/03 VAULT 11/03 MYSQL 11/14 using new dbarchive script ( still cut/paste ) ######## # DATA # ######## boehm volunteered about 1 TB of reroot files that can be archived. /minos/data/analysis/nue/MRERerootFiles/ See notes 10/30 ./pnfsdirs near MCIN daikon_04 spill_cedarphybhcurveMRE write Mon Nov 3 10:52:04 CST 2008 STREAMS cand mrnt sntp INPUT /pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurveMRE FAMSET mcin_near_daikon_04 FAMILY mcin_near_daikon_04 Oops, ./pnfsdirs near MCIN daikon_04 spill_cedarphybhcurvMRE write rmdir /pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurveMRE Shifted a few files, for testing MCIF=/minos/data/analysis/nue/MRERerootFiles MCID=${MCIF}/CedarPhyBhcurvData/NearDetector MCIN=/minos/data/mcimport/boehm/mcin $ ls ${MCIF}/CedarPhyBhcurvData/NearDetector | grep n00008454 n00008454_0000_spill_D04_cedarphybhcurvMRE.reroot.root n00008454_0007_spill_D04_cedarphybhcurvMRE.reroot.root n00008454_0014_spill_D04_cedarphybhcurvMRE.reroot.root n00008454_0020_spill_D04_cedarphybhcurvMRE.reroot.root 10:57 mv ${MCIF}/CedarPhyBhcurvData/NearDetector/n00008454* ${MCIN}/ Cleaning up stray dcache files in boehm/mcin/dcache : -rw-r--r-- 1 mindata e875 0 Oct 25 2007 n00009573_0000_spill_D03_cedarphyMRE.reroot.root -rw-r--r-- 1 mindata e875 0 Oct 24 2007 n00009696_0019_spill_D03_cedarphyMRE.reroot.root -rw-r--r-- 1 mindata e875 0 Oct 24 2007 n00009696_0020_spill_D03_cedarphyMRE.reroot.root -rw-r--r-- 1 mindata e875 0 Oct 24 2007 n00009696_0021_spill_D03_cedarphyMRE.reroot.root In log/mcimport.log, these were pending a year ago. MCIN processing 0 files Thu Nov 8 15:14:26 CST 2007 The files were cleanly written to /pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurvMRE/845/ Let's move the rest. $ ls ${MCID} | wc -l 1515 FILES=`ls ${MCID}` printf "${FILES}\n" | wc -l 1515 time for FILE in ${FILES} ; do mv ${MCID}/${FILE} ${MCIN}/${FILE} ; done real 0m56.457s user 0m1.088s sys 0m2.653s $ du -sm ${MCIN} 521751 /minos/data/mcimport/boehm/mcin The above was done around 17:00 ######## # DATA # ######## /minos/data nearly filled on Sunday, then cleared out 237536 Sun Nov 2 00:56:07 CDT 2008 210189 Sun Nov 2 01:56:10 CDT 2008 187179 Sun Nov 2 01:56:11 CST 2008 165368 Sun Nov 2 02:56:12 CST 2008 151886 Sun Nov 2 03:56:13 CST 2008 132018 Sun Nov 2 04:56:14 CST 2008 131748 Sun Nov 2 05:56:16 CST 2008 118768 Sun Nov 2 06:56:18 CST 2008 108241 Sun Nov 2 07:56:19 CST 2008 94099 Sun Nov 2 08:56:22 CST 2008 101133 Sun Nov 2 09:56:25 CST 2008 106683 Sun Nov 2 10:56:29 CST 2008 101025 Sun Nov 2 11:56:31 CST 2008 83327 Sun Nov 2 12:56:35 CST 2008 80768 Sun Nov 2 13:56:38 CST 2008 71174 Sun Nov 2 14:56:40 CST 2008 64220 Sun Nov 2 15:56:43 CST 2008 118041 Sun Nov 2 16:56:47 CST 2008 232765 Sun Nov 2 17:56:51 CST 2008 391707 Sun Nov 2 18:56:55 CST 2008 ########## # CONDOR # ########## minos25 went into overload, average over 100, starting around 02:30 Average is up to 140 around 06:00 Condor activity pretty much shut down globally. Can write to /minos/scratch /minos/data . /grid/data /grid/app MINOS25 > lsof | wc -l 1372 Nothing interesting in the /var/log/messages gfactory plot entries end at 3:30 No unusual activity around 02:30, about 210 running glideins, few idle stuck doing MINOS25 > echo FOO > /grid/fermiapp/touchga MINOS25 > lsof /grid/fermiapp COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME bash 5721 kreymer 1u REG 0,24 4 3937155499 /grid/fermiapp/touchga (blue2:/fermigrid-fermiapp) Load dropped sharply at 09:48. No condor processes are running. System is still about 25% wait I/O, Ganalia shows sustained 4 to 9 MB/sec I/O. MINOS25 > lsof | wc -l 620 MINOS25 > date Mon Nov 3 10:38:07 CST 2008 MINOS25 > cat logs/glide/probe.217336.0.log 000 (217336.000.000) 11/02 14:50:03 Job submitted from host: <131.225.193.25:64258> ... 001 (217336.000.000) 11/02 14:50:32 Job executing on host: <131.225.166.120:60817> ... 007 (217336.000.000) 11/03 09:48:13 Shadow exception! Assertion ERROR on (result) 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job Pawloski reports errors writing to /minos/data last night. cherdack 9777 1.2 0.0 3692 720 pts/2 D+ 09:05 1:10 | \_ mv /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0000_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0001_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0002_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0003_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0004_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0005_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0006_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0007_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0008_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0009_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0010_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0000_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0001_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0002_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0003_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0004_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0005_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0006_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0007_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0008_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0010_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0000_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0001_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0002_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0003_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0004_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0005_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0006_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0007_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0008_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0009_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0010_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0000_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0001_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0002_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0003_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0004_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0005_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0006_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0007_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/u Date: Mon, 03 Nov 2008 08:54:04 -0600 (CST) Subject: HelpDesk ticket 124102 ___________________________________________ Short Description: minos25 is overloaded - why ? Problem Description: run2-sys, minos-admin : At around 02:30 Monday 3 Nov, the load average on minos25 increased sharply to 100. It has built up to 140 through the morning. Many of the usual suspects do not seem to be at fault. I can write to /minos/data and scratch, and /grid/data and app. In testing, I created a new stuck interactive process, doing echo FOO > /grid/fermiapp/touchga The file was written, and is readable from minos25 and elsewhere, but the shell that did the writing is stuck, cannot be interrupted. I see nothing interesting in /var/log/messages. We are not out of file descriptors, as I can log in and write new files. What is going on ? ___________________________________________ Date: Mon, 03 Nov 2008 09:14:02 -0600 (CST) This ticket has been reassigned to BRICHACEK, MATTHEW of the CD-SF/FEF Group. x3982 ___________________________________________ Date: Mon, 03 Nov 2008 11:38:26 -0600 There is a move command that has made /minos/data io-bound. This is causing the load to jump on the server. I have stopped condor but the move is still completing. The move command PID is 9777. Once the move is complete I will restart condor. ___________________________________________ Date: Mon, 03 Nov 2008 11:40:06 -0600 The move command just completed and condor is on it's way back up. ___________________________________________ Date: Mon, 03 Nov 2008 11:43:42 -0600 (CST) Solution: A move command was causing IO on /minos/data to hang. Condor was stopped, the move command completed and condor has been restarted ___________________________________________ Date: Mon, 03 Nov 2008 17:48:04 +0000 (GMT) Would this be the command ? cherdack 9777 1.2 0.0 3692 720 pts/2 D+ 09:05 1:10 | \_ mv /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0000_L010185N_D00_ nccoh.sntp.cedar_phy.root ... ___________________________________________ Date: Mon, 03 Nov 2008 11:48:50 -0600 Yes, that was the one. ___________________________________________ ============================================================================= 2008 10 31 ============================================================================= ########## # CONDOR # ########## The initial working directory seems to be in _CONDOR_SCRATCH_DIR Confirmed this, http://osg-docdb.opensciencegrid.org/0003/000382/002/NFS-lite.doc ############### # CONDORGLIDE # ############### script/condorglide switched from glideafs to glide. do not select afs nodes do not set REMOTE_INITIALDIR ########## # CONDOR # ########## Probe job is taking over 6 minutes to do du -sh /grid/home/minos better remove this from probe, for now. Holding glideins for the moment, with touch /minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE 09:59 rm /minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE ########## # CONDOR # ########## About 54 gpfarm nodes are down, including 295 -> 318 323 -> 346 ( 339-346 are the Minos AFS nodes ) Ganglia shows a short drop in farm capacity around 00:00 tdoay. Sometime around 09:46, AFS came back. ########## # CONDOR # ########## ssh fnpc4x1 ps axfu | grep globus-job-manager | grep minos | grep -v grep minos 15734 0.0 0.0 111924 5008 ? S 08:33 0:00 globus-job-manager -conf /usr/local/vdt-1.10.1/globus/etc/globus-job-manager.conf -type managedfork -rdn jobmanager-managedfork -machine-type unknown -publish-jobs ssh -ax fnpc4x1 'ps axfu | grep globus-job-manager | grep minos | grep -v grep' minos 15734 0.0 0.0 111924 5008 ? S 08:33 0:00 globus-job-manager -conf /usr/local/vdt-1.10.1/globus/etc/globus-job-manager.conf -type managedfork -rdn jobmanager-managedfork -machine-type unknown -publish-jobs Date: Fri, 31 Oct 2008 09:06:08 -0500 (CDT) From: Steven Timm To: Arthur Kreymer Cc: fermigrid-help@fnal.gov Subject: Re: Userid for MINOS glideins That's right--since you aren't user "minos" you can't actually strace the process. Only root can do that. But if there's a globus-job-manager out there, particularly if it has any subprocesses such as globus-gass-cache-util sitting out there, let us know and we can deal with it. We are now pretty sure that this problem is related to the "feature" of bluearc snapshotting, in which sometimes a hard link can be removed from a directory on the bluearc but it isn't really gone. thus the globus-gass-cache-util spins in a tight loop: rm "data" no such file or directory ln "data" file exists and so forth. for whatever reason an strace is enough to jar it out of the loop. Steve ============================================================================= 2008 10 30 ============================================================================= ######## # DATA # ######## Removed .removed files from the 2008 10 24 simulation cleanup. Details have been appended to that entry. ########## # CONDOR # ########## No new gildein user jobs seem to have started since about 12:00 today. Bluearc has been healthy, can write from minos25 In /home/gfrontend/myvofrontend2/log/frontend_info.20081030.log, there was unusual activity around noon : [2008-10-30T11:57:27-05:00 7042] Iteration at Thu Oct 30 11:57:27 2008 [2008-10-30T11:57:33-05:00 7042] Match [2008-10-30T11:57:33-05:00 7042] Total running 239 limit 250 [2008-10-30T11:57:33-05:00 7042] For gpminos@t20_glexec@minos Idle 432 Running 239 [2008-10-30T11:57:33-05:00 7042] Advertize gpminos@t20_glexec@minos Request idle 10 max_run 698 [2008-10-30T11:57:33-05:00 7042] For gpgeneral@t20_glexec@minos Idle 432 Running 239 [2008-10-30T11:57:33-05:00 7042] Advertize gpgeneral@t20_glexec@minos Request idle 10 max_run 698 [2008-10-30T11:57:33-05:00 7042] Sleep [2008-10-30T11:59:03-05:00 7042] Iteration at Thu Oct 30 11:59:03 2008 [2008-10-30T11:59:13-05:00 7042] Match [2008-10-30T11:59:13-05:00 7042] Total running 239 limit 250 [2008-10-30T11:59:13-05:00 7042] For gpminos@t20_glexec@minos Idle 922 Running 239 [2008-10-30T11:59:13-05:00 7042] Advertize gpminos@t20_glexec@minos Request idle 10 max_run 1208 [2008-10-30T11:59:13-05:00 7042] For gpgeneral@t20_glexec@minos Idle 922 Running 239 [2008-10-30T11:59:13-05:00 7042] Advertize gpgeneral@t20_glexec@minos Request idle 10 max_run 1208 [2008-10-30T11:59:13-05:00 7042] Sleep currently [2008-10-30T16:02:38-05:00 7042] Iteration at Thu Oct 30 16:02:38 2008 [2008-10-30T16:02:42-05:00 7042] Match [2008-10-30T16:02:42-05:00 7042] Total running 81 limit 250 [2008-10-30T16:02:42-05:00 7042] For gpminos@t20_glexec@minos Idle 294 Running 81 [2008-10-30T16:02:42-05:00 7042] Advertize gpminos@t20_glexec@minos Request idle 10 max_run 391 [2008-10-30T16:02:42-05:00 7042] For gpgeneral@t20_glexec@minos Idle 269 Running 81 [2008-10-30T16:02:42-05:00 7042] Advertize gpgeneral@t20_glexec@minos Request idle 10 max_run 365 [2008-10-30T16:02:43-05:00 7042] Sleep /home/gfactory/glideinsubmit/glidein_t20_glexec/log/factory_info.20081030.log I see nothing interesting going on around 12:00 or later From: Steven Timm To: Sfiligoi Igor Cc: Arthur Kreymer , Ryan B. Patterson , fermigrid-help@fnal.gov Subject: Re: Userid for MINOS glideins 10 globus-job-manager processes were hung on fnpcfg1 (a.k.a. fnpc4x1) since 11:34AM I found the right one, straced it, and they all cleared up. Since both Art and Ryan have the "admin" access to the gatekeepers and worker nodes as requested by MINOS it is possible for them to log in to fnpc4x1 as kreymer and rbpatter respectively and check for this themselves. Look for any globus-job-manager process owned by minos that is more than 1 hr old, that's a sign of this problem. We have a feature request in to condor team so that the condor-G client can corrrectly detect this error condition too. Expect they'll get it done in the next release or two. Steve Timm ######## # DATA # ######## boehm volunteered about 1 TB of reroot files that can be archived. /minos/data/analysis/nue/MRERerootFiles/ Warning, most of the CedarPhyData files are already in PNFS, they can just be removed. CedarPhyBhcurvData/NearDetector CedarPhyDaikon00 CedarPhyDaikon00/NearL010185N CedarPhyDaikon00/FarL010185N CedarPhyData MINOS26 > find . -type d -exec du -sm {} \; 844299 . 519535 ./CedarPhyBhcurvData/NearDetector 143030 ./CedarPhyDaikon00 1 ./CedarPhyDaikon00/NearL010185N 26516 ./CedarPhyDaikon00/FarL010185N 181734 ./CedarPhyData Typical files CedarPhyBhcurvData/NearDetector n00008451_0001_spill_D04_cedarphybhcurvMRE.reroot.root CedarPhyDaikon00 n13011001_0001_L010185N_D00_sntp_D03_cedarphyMRE.reroot.root Josh will rename these like n13011001_0001_L010185ND00sntp_D03_cedarphyMRE.reroot.root ./pnfsdirs near MCIN daikon_03 L010185ND00sntp_cedarphyMRE write CedarPhyDaikon00/FarL010185N reroot_f21011005_0000.root CedarPhyData n00009000_0000_spill_D03_cedarphyMRE.reroot.root File name forms, reroot* is NG, ignore for now The major item them is the 520 GB in CedarPhyBhcurvData/NearDetector ls CedarPhyBhcurvData/NearDetector | grep -v '^n........_...._spill_D04_cedarphybhcurvMRE.reroot.root$' MINOS26 > ls CedarPhyBhcurvData/NearDetector | wc -l 1519 MINOS26 > ls CedarPhyBhcurvData/NearDetector | grep '^n........_...._spill_D04_cedarphybhcurvMRE.reroot.root$' | wc -l 1519 CPD were last imported around 2007 11 06, at that time just like those in CedarPhyData. Remember that pnfsdirs supports an MCIN release just for this stuff Previously used /home/mindata/STAGE/boehm/mcin now this is /minos/data/mcimport/boehm/mcin Files had been written to /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE MCIF=/minos/data/analysis/nue/MRERerootFiles MDIR=... ( find ${MCIF}/${MDIR} -type f -name \*reroot.root -exec du -sm {} \; | cut -f 1 ) > /minos/scratch/kreymer/MCIN.gpl printf 'plot "/minos/scratch/kreymer/MCIN.gpl"\n' | gnuplot -persist MDIR=CedarPhyData mostly 80 to 250, some around 1 MDIR=CedarPhyBhcurvData/NearDetector mostly 250 to 700 MB MDIR=CedarPhyDaikon00 tight, 200 to 250 MB MDIR=CedarPhyDaikon00/FarL010185N file names are reroot_*.root tight around 160 MB for 1/4, then tight around 20-25 MB for 3/4 of files Let's get the CPD stuff going, should just need to move files to mcin, as similar files were previously imported JOSH=/minos/data/analysis/nue/MRERerootFiles MCIN=/minos/data/mcimport/boehm/mcin BOEH=/minos/data/mcimport/boehm cd /minos/data/analysis/nue/MRERerootFiles mv CedarPhyData/n00009000* ${MCIN}/ ########## # SAMSUB # ########## Updating to provide list of subrun, not a count, for use in roundup. ln -sf samsub samsub.20080408 cp -a samsub samsub.20081030 ######## # FCOE # ######## converged network adapters (CNAs) from Emulex and QLogic ? ============================================================================= 2008 10 29 ============================================================================= ########## # BUEARC # ########## Spoke to Andy Romero, they are making plans, will contact us in a few days regarding specific actions for deployment of new /minos/data disk. ########### # SERVERS # ########### MRTG shows network activity since last Friday 24 Oct. No logins yet, but : MINOS26 > host minos-mysql2 minos-mysql2.fnal.gov has address 131.225.193.32 MINOS26 > host minos-mysql3 minos-mysql3.fnal.gov has address 131.225.193.34 MINOS26 > host minos-sam04 minos-sam04.fnal.gov has address 131.225.193.35 MINOS26 > host minos27 minos27.fnal.gov has address 131.225.193.31 ########## # DCACHE # ########## Date: Wed, 29 Oct 2008 15:05:17 -0500 From: David Saranen To: Arthur Kreymer Subject: file transfers from mine http://fndca3a.fnal.gov/cgi-bin/dcache_files.py doesn't show any data. Is this related to stk problems earlier this week? -Dave Date: Wed, 29 Oct 2008 15:17:35 -0500 (CDT) Subject: HelpDesk ticket 123960 ___________________________________________ Short Description: FNDCA recent FTP web page listing is empty. Problem Description: The web page listing recent FNDCA FTP transfers contains only the header line and Legal Notices, no tranfers. I am not sure how long this has been so. The problem was present today 29 Oct, around 15:10 CDT I know that recent transfers have occurred, the page should not be empty. See : http://fndca3a.fnal.gov/cgi-bin/dcache_files.py ___________________________________________ This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Wed, 29 Oct 2008 16:28:40 -0500 (CDT) Note To Requester: The process which gathers the log files appeared to have been stuck since Monday morning and left a lockfile behind. It is now running and the page lists recent transfers. ___________________________________________ Date: Wed, 29 Oct 2008 17:42:16 -0500 (CDT) Solution: Killed stuck transfers of log files, removed stale lockfile. ftp_gather ran at the next scheduled time. dcache_files.py now generates output. This ticket was resolved by MESSER, TIM of the CD-SF/DMS/DSC/SSA group. ######## # DATA # ######## dcache/datasets - add capacity calculation to the summary Get from Pool Usage, http://fndca3a.fnal.gov:2288/usageInfo Applied 1049/1000 scale factory to the usageinfo MBytes numbers. ln -sf datasets.20081029 datasets # was datasets.20070703 For daq pools, dcache/datasets.20081029 q ... Size = 3188 Capacity = 3190 Historically, MIN > grep Size */*/current.q.* 2006/09/current.q.20060918:Size = 1610 2006/09/current.q.20060920:Size = 1610 2006/09/current.q.20060925:Size = 3265 2006/10/current.q.20061023:Size = 3397 2007/02/current.q.20070226:Size = 3714 2007/02/current.q.20070228:Size = 3714 2007/03/current.q.20070302:Size = 3714 2007/03/current.q.20070319:Size = 3714 2007/04/current.q.20070402:Size = 3714 2007/05/current.q.20070501:Size = 3714 2007/06/current.q.20070609:Size = 5505 2007/07/current.q.20070703:Size = 5671 2007/08/current.q.20070803:Size = 5206 2007/09/current.q.20070910:Size = 5206 2007/10/current.q.20071002:Size = 5631 2007/10/current.q.20071029:Size = 5432 2007/11/current.q.20071105:Size = 5468 2007/12/current.q.20071213:Size = 5782 2008/01/current.q.20080102:Size = 5723 2008/02/current.q.20080204:Size = 5925 2008/03/current.q.20080303:Size = 6100 2008/04/current.q.20080407:Size = 6193 2008/05/current.q.20080513:Size = 6418 2008/06/current.q.20080604:Size = 6508 2008/07/current.q.20080701:Size = 6604 2008/08/current.q.20080804:Size = 3188 2008/09/current.q.20080902:Size = 3188 2008/10/current.q.20081013:Size = 3187 2008/10/current.q.20081029:Size = 3188 In July, there were eight pools Tue Jul 1 06:02:06 CDT 2008 w-stkendca7a-1.files Tue Jul 1 06:06:42 CDT 2008 w-stkendca7a-2.files Tue Jul 1 06:11:20 CDT 2008 w-stkendca8a-1.files Tue Jul 1 06:16:00 CDT 2008 w-stkendca8a-2.files Tue Jul 1 06:13:05 CDT 2008 w-stkendca9a-3.files Tue Jul 1 06:13:14 CDT 2008 w-stkendca10a-3.files Tue Jul 1 06:13:22 CDT 2008 w-stkendca11a-3.files Tue Jul 1 06:13:06 CDT 2008 w-stkendca12a-3.files In August, this dropped to four Mon Aug 4 06:02:46 CDT 2008 w-stkendca9a-3.files Mon Aug 4 06:03:43 CDT 2008 w-stkendca10a-3.files Mon Aug 4 06:03:55 CDT 2008 w-stkendca11a-3.files Mon Aug 4 06:03:48 CDT 2008 w-stkendca12a-3.files The pools group is back to eight now, but only four are active. Date: Wed, 29 Oct 2008 12:27:22 -0500 (CDT) Subject: HelpDesk ticket 123932 ___________________________________________ Short Description: RawDataWritePools decreased in August, needs an increase Problem Description: The RawDataWritePools group is sized to hold all the Minos raw data, about 6 TB. This was true through last July. But in August the pool group capacity decreased to 3 TB, and remains there. It appears that four of the eight pools were removed from the group : w-stkendca7a-1 w-stkendca7a-2 w-stkendca8a-1 w-stkendca8a-2 As of 13 October, these pools seem to have returned to the group. But these pools are not listed today in cellInfo, or in usageInfo. The file listings under http://fndca3a.fnal.gov/dcache/files/ indicate that these four pools are empty. We need an increase to 8 TB to handle the next year or so of data taking. Please review the status of the RawDataWritePools group, and take action consistent with the new round of hardware deployments coming over the next few weeks. ___________________________________________ ######## # DATA # ######## Date: Wed, 29 Oct 2008 15:24:15 +0000 (GMT) Tickled helpdesk ticket 118354 , 2008 07 08 regarding aggressive writes to tape. The current FD volume has been mounted 1153 times. enstore info --vol VO8699 ... 'eod_cookie': '0000_000000000_0001335', 'sum_mounts': 1153, 'sum_rd_access': 65, 'sum_wr_access': 1335, ./volumes vols FVOLS=`./volumes fardet_data BVOLS=` for VOL in ${FVOLS} ; do printf "${VOL} " enstore info --vol ${VOL} | grep library done | grep 9940B | cut -f 1 -d ' ' ` for VOL in ${BVOLS} ; do printf "${VOL} " enstore info --vol ${VOL} | grep sum_mounts done VO2432 'sum_mounts': 2, VO3899 'sum_mounts': 282, VO4298 'sum_mounts': 1758, VO4335 'sum_mounts': 996, VO6876 'sum_mounts': 800, VO8536 'sum_mounts': 133, VO8555 'sum_mounts': 1151, VO8699 'sum_mounts': 1153, VO9488 'sum_mounts': 1048, VO9830 'sum_mounts': 163, VOA187 'sum_mounts': 307, VOB499 'sum_mounts': 194, VOB737 'sum_mounts': 108, VOC268 'sum_mounts': 144, VOC475 'sum_mounts': 1133, VOC513 'sum_mounts': 771, VOC538 'sum_mounts': 13, VOC560 'sum_mounts': 253, ############ # GRELEASE # ############ Added daily release of all gfactory processes, at 05:34, logged to /afs/fnal.gov/files/expwww/numi/html/gfactory/release.txt http://www-numi.fnal.gov/gfactory/release.txt ######## # DATA # ######## Closed helpdesk ticket ############# # CHECKLIST # ############# Ganglia monitoring for Minos Cluster offline yesterday 15:00 - 17:50 Same for all rexganglia2 monitoring MRTG shows dropout, 15:50 - 16:40 ####### # SAM # ####### +Includes corrected in all SAM products by illingwo, closed SAMDEV-25 in jira ============================================================================= 2008 10 28 ============================================================================= ######## # DATA # ######## boehm volunteers about 1 TB of reroot files that can be archived. /minos/data/analysis/nue/MRERerootFiles/ Some are similar to files previously imported. Some may need to be renamed, to fit our conventions for mcimport. ########### # ENSTORE # ########### Date: Tue, 28 Oct 2008 15:54:27 -0500 From: Tim Messer To: stk-users@fnal.gov, "cms-t1@fnal.gov" Cc: Enstore Admin Subject: STK Enstore mover code update Hello, An emergency update of the Enstore code is ready for deployment. SSA and the Enstore developers have determined that this code update is necessary in order to prevent movers from going offline in certain cases when switching from write mode to read mode. SSA will begin to deploy the code shortly and will restart the mover processes after the code is copied into place. This will not affect transfers in progress and is anticipated to be transparent to users. ######## # DATA # ######## Removed temporary data copies. MINOS26 > rm /local/scratch26/kreymer/DAQ/*.root ######## # FARM # ######## Requesting minfarm account on the Minos Cluater Scanning existing account in NIS, are they all in AFS home* ? Yes they are, either /afs/fnal.gov or /afs/fnal/, except for strictly local home areas. condor:KERBEROS:4716:3302:condor:/local/stage1/condor:/sbin/nologin sam:KERBEROS:7816:5024:sam users:/home/sam:/bin/bash minoscvs:KERBEROS:7927:5111:E875 Minos:/home/minoscvs:/home/minoscvs/bin/cvsh products:KERBEROS:1342:4525:products account:/local/ups:/bin/sh minsoft:KERBEROS:9979:5111:Minos Software:/home/minsoft:/bin/bash minfarm:KERBEROS:10871:5111:Minos Farm:/home/minfarm:/bin/bash lsfadm:KERBEROS:7628:5443:Admin_Load_Sharing_Facility:/home/room1/lsf/v6_1:/bin/bash samread:KERBEROS:12160:5024:Sequential Access - Run II:/home/samread:/bin/bash mindata:KERBEROS:3648:5111:Minos Data:/home/mindata:/bin/bash Minfarm exists on minos-sam03, that's why it is in the NIS list. Updated the .k5login per fnpcsrv1, removing servers and obsolete users. Sent note to minos_batch. Date: Tue, 28 Oct 2008 16:45:27 -0500 (CDT) Subject: HelpDesk ticket 123886 ___________________________________________ Short Description: Please create minfarm local account on minos01 and minos26 Problem Description: Please create a minfarm local account on minos26. This is for the purpose of building software in /grid/farmiapp. We would prefer to do this on the Minos Cluster, for software uniformity. The account is already in the NIS passwd file, and is enabled on minos_sam03. Please create the /home/minfarm area on minos26, and copy .k5login from minos_sam03. ___________________________________________ Date: Wed, 29 Oct 2008 08:36:49 -0500 (CDT) This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group. ___________________________________________ Date: Thu, 30 Oct 2008 15:58:16 -0500 (CDT) Solution: Request completed. This ticket was resolved by SCOTT, RENNIE of the CD-SF/FEF group. ___________________________________________ The account works, for kreymer ####### # SAM # ####### SAMDEV-25 Per Fermilab security recommendation ( inkmann@fnal.gov ) we need to change all .htaccess files from the use of Options +Includes to Options +IncludesNOEXEC ####### # SAM # ####### Jira categories, at https://fermilab.onjira.com/secure/BrowseProject.jspa need to be clarified. Project Key Project Lead URL D0 Grid Data Production Initiative DZGDPI Adam Lyon No URL D0 SAM Operations DZSAM D0 SAM Shifter No URL SAMGrid development SAMDEV Adam Lyon No URL ######## # ZOOM # ######## Final snapshot from cvsuser@cdfcode, after the move to cdcvs. [cvsuser@cdfcode cvsuser]$ time tar czvf /var/tmp/zoomcvs.tgz . real 1m0.955s user 0m24.220s sys 0m3.280s [cvsuser@cdfcode cvsuser]$ du -sm . 331 . [cvsuser@cdfcode cvsuser]$ du -sm * 9 archive 1 ark 2 bin 1 check_access.bak 1 check_access.hold 1 crontab.dat 0 cvsh 1 cvshlog 1 Desktop 1 genser 1 LOG 1 LOG~ 1 log.bak 1 maint 170 repository 149 repository_work 1 rsyncZoom.sh 1 shrc [cvsuser@cdfcode cvsuser]$ du -sm /var/tmp/zoomcvs.tgz 55 /var/tmp/zoomcvs.tgz [cvsuser@cdfcode cvsuser]$ scp -c blowfish /var/tmp/zoomcvs.tgz kreymer@minos26:/minos/data/users/kreymer/zoomcvs.tgz [cvsuser@cdfcode cvsuser]$ crontab -l 55 23 * * * ${HOME}/archive/archive 1> ${HOME}/archive/archive.log 2>&1 Sent mail to garren, rs, suggesting shutdown of the nightly cron job. ============================================================================= 2008 10 27 ============================================================================= ######## # PNFS # ######## Date: Mon, 27 Oct 2008 12:18:17 -0500 (CDT) Subject: HelpDesk ticket 123773 ___________________________________________ Short Description: PNFS not responding Problem Description: Starting soon after 11:57 today, the /pnfs file system does not seem to be responding. ftp file transfers hang up, ls hangs up. ___________________________________________ This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Mon, 27 Oct 2008 12:26:57 -0500 From: Stanley Hicks To: stk-users@fnal.gov, cms-t1@fnal.gov, Enstore Admin Subject: Problems with stk Users and interested parties, We received notification about a power supply failure and possible disk failure on the raid on two stk servers. The situation is currently being investigated and we will update you with more information as we discover the cause and potential uptime. Currently there are no transfers happening on stken. Sorry for the inconvenience and please stay tuned for further information. Thanks, Stanley ___________________________________________ Date: Mon, 27 Oct 2008 12:42:23 -0500 (CDT) Note To Requester: There was a facilities power problem with the STKEN server rack. It has been corrected and we are working on restoring services. ___________________________________________ Date: Mon, 27 Oct 2008 16:59:03 -0500 From: Tim Messer To: stk-users@fnal.gov, cms-t1@fnal.gov Cc: Enstore Admin Subject: Re: Problems with stk Hi, STK Enstore has been returned to service. The cause of the outage was the accidental power-off of a breaker on the circuit feeding most of the STKEN server rack. Power was restored, and after running consistency checks, the system is now believed to be stable. We apologize for the inconvenience and thank you for your patience. Please let us know if you encounter any further trouble. Thank you. ___________________________________________ Date: Tue, 28 Oct 2008 13:49:52 +0000 (GMT) Services were restored yesterday. This ticket can be closed. ___________________________________________ Date: Tue, 28 Oct 2008 16:32:01 -0500 (CDT) Solution: PNFS was not responding due to loss of power to the STKEN server rack. ___________________________________________ 17:20 reeneabled crontab for mindata@minos26, and renamed NOCAT to NOCAT.ok on fnpcsrv1 ######## # DATA # ######## ./savedata 2>&1 | tee -a ../maint/daqwrite/savedata.log Mon Oct 27 12:02:04 CDT 2008 OOPS, bad file, F00042086_0000.all.sntp.cedar.0.root OOPS, bad file, F00042089_0006.spill.bcnd.cedar.0.root OOPS, bad file, F00042089_0007.spill.bcnd.cedar.0.root OOPS, bad file, F00042089_0011.spill.bcnd.cedar.0.root OOPS, bad file, F00042089_0015.spill.bcnd.cedar.0.root OOPS, bad file, F00042089_0020.spill.bcnd.cedar.0.root OOPS, bad file, F00042089_0021.spill.bcnd.cedar.0.root COPYING F00042089_0022.mdaq.root Mon Oct 27 12:02:13 CDT 2008 ... interrupted at 12:07, no progress, no tape mount, ... MINOS26 > ./dc_stat F00042089_0021.spill.bcnd.cedar.0.root no response, killed ftplog/NOW.txt 4 Mon Oct 27 11:36:59 CDT 2008 557 7 Mon Oct 27 11:47:06 CDT 2008 557 5 Mon Oct 27 11:57:11 CDT 2008 557 3604 Mon Oct 27 13:07:15 CDT 2008 1 3603 Mon Oct 27 14:17:18 CDT 2008 1 5776 Mon Oct 27 16:03:34 CDT 2008 1 6 Mon Oct 27 16:13:40 CDT 2008 557 6 Mon Oct 27 16:23:46 CDT 2008 557 pnfslog/NOW.txt 2 Mon Oct 27 11:48:51 CDT 2008 4 Mon Oct 27 11:53:55 CDT 2008 2 Mon Oct 27 11:58:57 CDT 2008 10804 Mon Oct 27 15:04:01 CDT 2008 3 Mon Oct 27 15:09:04 CDT 2008 3 Mon Oct 27 15:14:07 CDT 2008 3 Mon Oct 27 15:19:10 CDT 2008 3 Mon Oct 27 15:24:13 CDT 2008 2 Mon Oct 27 15:29:15 CDT 2008 2 Mon Oct 27 15:34:17 CDT 2008 3 Mon Oct 27 15:39:20 CDT 2008 Sent in PNFS helpdesk ticket, above PNFS is back up, rescanned for stale files, they all seem to be on tape now . MINOS26 > ./saddcache --list | grep -v vo MODE list Relocating files in SAM as needed, in prd STARTED Mon Oct 27 17:16:47 2008 324 FILES STARTED Mon Oct 27 17:16:47 2008 FINISHED Mon Oct 27 17:17:00 2008 A typical line is : 304 F00042090_0000.mdaq.root /pnfs/minos/fardet_data/2008-10(vo8699.1189) ============================================================================= 2008 10 25 ============================================================================= ./saddcache --list > ./maint/daqwrite/oct25.pend cd ../maint/daqwrite grep root oct25.pend | cut -c 7- | cut -f 1 -d ' ' | sort > oct25.files wc -l oct25.files 90 oct25.files for FILE in `cat ../maint/daqwrite/oct25.files` ; do ./dc_stat ${FILE} ; done | less None of the files have pools listed in Level2, none are on tape MINOS26 > ./dccptest /fardet_data/2008-10/F00042089_0011.mdaq.root 2,0,0,0.0,0.0 :h=yes;c=1:4e8ae670;l=72334795; 72334795 bytes in 2 seconds (35319.72 KB/sec) -rw-r--r-- 1 kreymer g020 72334795 Oct 25 08:55 /local/scratch26/kreymer/F00042 089_0011.mdaq.root Set up a safety copy on minos26. MINOS26 > cp dccptest dccpdata Increased debug level to 2, to get name of originating pool ./savedata 2>&1 | tee -a ../maint/daqwrite/savedata.log 95 files copied ============================================================================= 2008 10 24 ============================================================================= ####### # WEB # ####### /afs/.fnal.gov/files/expwww/numi/html Focused scan of sam docs MIN > find sam -name .htaccess -exec grep Includes {} \; -print Options +Includes sam/doc/design/samBootstrapRedesign/.htaccess Options +Includes sam/doc/install/.htaccess Options +Includes sam/sam_doc/doc/design/samBootstrapRedesign/.htaccess Options +Includes sam/sam_doc/doc/install/.htaccess Options +Includes sam/sam_doc/www/.htaccess find sam -name .htaccess -exec grep Includes {} \; -exec nedit {} \; added NOEXEC Globel scan find computing/products. -follow -name .htaccess -exec grep Includes {} \; -print Too many symlinks under computing/products/prd/MINOS_ROOT/Linux2.4-GCC_3_3/v4-00-08f/v4-00-08f links to itself lrwxr-xr-x 1 5922 5111 9 Feb 14 2006 computing/products/prd/MINOS_ROOT/Linux2.4-GCC_3_3/v4-00-08f/v4-00-08f -> v4-00-08f rm computing/products/prd/MINOS_ROOT/Linux2.4-GCC_3_3/v4-00-08f/v4-00-08f Options +IncludesNOEXEC computing/products/db/sam_config/Symlinks/v4_2_34/www/.htaccess Options +IncludesNOEXEC computing/products/db/sam_config/Symlinks/current/www/.htaccess Options +IncludesNOEXEC computing/products/db/sam_bootstrap/Symlinks/v4_4_1/www/.htaccess Options +IncludesNOEXEC computing/products/db/sam_bootstrap/Symlinks/current/www/.htaccess Options +IncludesNOEXEC computing/products/db/sam_web_services/Symlinks/v0_9_8/www/.htaccess Options +IncludesNOEXEC computing/products/db/sam_web_services/Symlinks/current/www/.htaccess Options +IncludesNOEXEC computing/products/db/sam_web_services/Symlinks/v0_9_9/www/.htaccess Options +Includes computing/products/prd/sam_config/v4_2_28/NULL/www/.htaccess Options +Includes computing/products/prd/sam_config/v4_2_34/NULL/www/.htaccess Options +Includes computing/products/prd/sam_bootstrap/v4_4_1/NULL/www/.htaccess Options +Includes computing/products/prd/sam_web_services/v0_9_8/NULL/www/.htaccess Options +Includes computing/products/prd/sam_web_services/v0_9_9/NULL/www/.htaccess Removed old products, no longer needed MINOS26 > ups list -aK+ sam_config v4_2_28 "sam_config" "v4_2_28" "NULL" "" "" "sam_config" "v4_2_28" "NULL" "minos" "" "sam_config" "v4_2_28" "NULL" "minos_prd" "" "sam_config" "v4_2_28" "NULL" "prd" "" "sam_config" "v4_2_28" "NULL" "dev" "" ups undeclare sam_config v4_2_28 -q dev -f NULL ups undeclare sam_config v4_2_28 -q int -f NULL ups undeclare sam_config v4_2_28 -q prd -f NULL ups undeclare sam_config v4_2_28 -q minos -f NULL ups undeclare sam_config v4_2_28 -q minos_prd -f NULL ups undeclare -Y sam_config v4_2_28 -f NULL ############ # PREDATOR # ############ Several raw data files are not on tape after nearly 1 day . Good, this is the desired behaviour ! Write pools should only go to tape daily. 4 F00042086_0006.mdaq.root 24 13 N00015034_0002.mdaq.root 23 14 N00015034_0005.mdaq.root 24 23 N00015034_0004.mdaq.root 24 31 N00015034_0006.mdaq.root 24 38 F00042086_0002.mdaq.root 23 48 F00042086_0007.mdaq.root 24 55 F00042085_0000.mdaq.root 23 57 N00015033_0000.mdaq.root 23 RUNS="F00042086_0006 N00015034_0002 N00015034_0005 N00015034_0004 N00015034_0006 F00042086_0002 F00042086_0007 F00042085_0000 N00015033_0000" for RUN in ${RUNS} ; do ./dc_stat ${RUN}.mdaq.root ; done The oldest unwritten file is -rw-r--r-- 1 buckley e875 17651382 Oct 23 12:13 F00042085_0000.mdaq.root This is under 1 day old, good ! The latest written file is -rw-r--r-- 1 buckley e875 114201451 Oct 24 07:53 N00015034_0002.mdaq.root Tested again at 15:00, not so good ! No further files have been written to tape. The files are OK, but not on tape MINOS26 > ./dccptest /fardet_data/2008-10/F00042085_0000.mdaq.root 2,0,0,0.0,0.0 :h=yes;c=1:dab8b986;l=17651382; 17651382 bytes in 1 seconds (17237.68 KB/sec) -rw-r--r-- 1 kreymer g020 17651382 Oct 24 15:02 /local/scratch26/kreymer/F00042085_0000.mdaq.root Made a fresh file list ./saddcache --list OFILES=" F00042086_0022.mdaq.root B081023_160001.mbeam.root B081024_000001.mbeam.root F00042085_0000.mdaq.root F00042086_0002.mdaq.root F00042086_0006.mdaq.root F00042086_0007.mdaq.root F00042086_0008.mdaq.root F00042086_0009.mdaq.root F00042086_0010.mdaq.root F00042086_0011.mdaq.root F00042086_0012.mdaq.root F00042086_0013.mdaq.root F00042086_0014.mdaq.root F00042086_0015.mdaq.root F00042086_0016.mdaq.root F00042086_0017.mdaq.root F00042086_0018.mdaq.root F00042086_0019.mdaq.root F00042086_0020.mdaq.root F00042086_0021.mdaq.root F00042086_0022.mdaq.root F00042086_0023.mdaq.root F00042087_0000.mdaq.root F00042088_0000.mdaq.root F00042089_0000.mdaq.root F00042089_0001.mdaq.root F081023_000010.mdcs.root N00015033_0000.mdaq.root N00015034_0004.mdaq.root N00015034_0005.mdaq.root N00015034_0006.mdaq.root N00015034_0007.mdaq.root N00015034_0008.mdaq.root N00015034_0009.mdaq.root N00015034_0010.mdaq.root N00015034_0011.mdaq.root N00015034_0012.mdaq.root N00015034_0013.mdaq.root N00015034_0014.mdaq.root N00015034_0015.mdaq.root N00015034_0016.mdaq.root N00015034_0017.mdaq.root N00015034_0018.mdaq.root N00015034_0019.mdaq.root N00015034_0020.mdaq.root N00015034_0021.mdaq.root N00015034_0022.mdaq.root N00015034_0023.mdaq.root N00015034_0024.mdaq.root N00015035_0000.mdaq.root N00015036_0000.mdaq.root N00015037_0000.mdaq.root N081023_000002.mdcs.root " Scanned for stale files, at about 15:30 for FILE in ${OFILES} ; do ./dc_stat ${FILE} ; done | less /pnfs/minos/fardet_data/2008-10/F00042085_0000.mdaq.root -rw-r--r-- 1 buckley e875 17651382 Oct 23 12:13 F00042085_0000.mdaq.root /pnfs/minos/fardet_data/2008-10/F00042086_0002.mdaq.root -rw-r--r-- 1 buckley e875 28326005 Oct 23 15:15 F00042086_0002.mdaq.root /pnfs/minos/neardet_data/2008-10/N00015033_0000.mdaq.root -rw-r--r-- 1 buckley e875 13390345 Oct 23 13:50 N00015033_0000.mdaq.root Date: Fri, 24 Oct 2008 15:41:56 -0500 (CDT) Subject: HelpDesk ticket 123698 ___________________________________________ Short Description: Minos raw data files not moving to tape Problem Description: Some Minos raw data files seem to have not moved to tape in the last day. This is based on the absence of PNFS Level 4 metadata. The latest file to be written seems to be /pnfs/minos/neardet_data/2008-10/N00015034_0002.mdaq.root around Oct 24 07:53 I noticed the delay writing to tape this morning, I had hoped that the normal 1 day delay had been restored, as requested. At that time, all the files were under 24 hours old. But now a few files are over 24 hours in the pools, without being on tape. /pnfs/minos/fardet_data/2008-10/F00042085_0000.mdaq.root -rw-r--r-- 1 buckley e875 17651382 Oct 23 12:13 F00042085_0000.mdaq.root /pnfs/minos/fardet_data/2008-10/F00042086_0002.mdaq.root -rw-r--r-- 1 buckley e875 28326005 Oct 23 15:15 F00042086_0002.mdaq.root /pnfs/minos/neardet_data/2008-10/N00015033_0000.mdaq.root -rw-r--r-- 1 buckley e875 13390345 Oct 23 13:50 N00015033_0000.mdaq.root There are plenty of free movers; the delay is not due to an Enstore backlog. ___________________________________________ Date: Sat, 25 Oct 2008 15:46:15 +0000 (GMT) From: Arthur Kreymer Subject: Re: HelpDesk ticket 123698 has additional info. <-- # @@@ Enter Update below this line. @@@ # --> The files are all in the RawDataWritePools pool group. It is hard to tell which specific pool is involved, because the pool name is not in the Level 2 PNFS data. Perhaps this is a clue to an underlying problem ! I have made a safety copy of all pending files to node minos26, using dccp -d 2 to determine the pools holding the files. The 95 files copied came from pools as follows : 1 stkendca9a 87 stkendca11a 7 stkendca12a It would seem that there is a problem with w-stkendca11a There many files over a day old, one of which is /pnfs/minos/near_dcs_data/2008-10/N081023_000002.mdcs.root <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ Date: Mon, 27 Oct 2008 08:45:48 -0500 (CDT) Note To Requester: We are looking into this Art. This ticket is assigned to JONES, TERRY of the CD-SF/DMS/DSC/SSA group. ___________________________________________ Date: Tue, 28 Oct 2008 13:52:28 +0000 (GMT) All of our files in RawDataWritePools have moved to tape. You can close this ticket. Thanks ! ___________________________________________ Date: Tue, 04 Nov 2008 11:11:42 -0600 (CST) This ticket was resolved by JONES, TERRY of the CD-SF/DMS/DSC/SSA group. ######## # DATA # ######## Date: Thu, 23 Oct 2008 17:27:56 -0500 (CDT) From: Kregg E Arms Only the files within the run range 7481 - 7500 were uploaded to FNAL from this "corrupted" set (i.e. only the runs listed below by Nick as "L010185_near_production" & "L010185_rock_pro"), the others were merely test runs not actually used. So, the list derived from yours with this correction is /minos/scratch/arms/remove.badd04.lis Repeated the above , with SAMDIM=" RUN_TYPE physics% and DATA_TIER mc-near and MC.BEAM L010185N and RUN_NUMBER in 7481,7482,7483,7484,7485,7486,7487,7488,7489, 7490,7491,7492,7493,7494,7495,7496,7497,7498,7499,7500 " ~/minos/scripts/samlocate "${SAMDIM}" > reroot.lis File Count: 626 Average File Size: 342.95MB Total File Size: 209.66GB Total Event Count: 500800 for STRM in cand sntp mrnt ; do SAMDIM=" DATA_TIER ${STRM}-near and MC.RELEASE daikon_04 and MC.BEAM L010185N and VERSION cedar.phy.bhcurv and RUN_NUMBER in 7481,7482,7483,7484,7485,7486,7487,7488,7489, 7490,7491,7492,7493,7494,7495,7496,7497,7498,7499,7500 " ~/minos/scripts/samlocate "${SAMDIM}" > ${STRM}.lis done "; sam list files --dim="${SAMDIM}" --summaryOnly ; done File Count: 412 Average File Size: 551.04MB Total File Size: 221.71GB Total Event Count: 329600 File Count: 29 Average File Size: 1.34GB Total File Size: 38.73GB Total Event Count: 500800 File Count: 28 Average File Size: 408.75MB Total File Size: 11.18GB Total Event Count: 500800 MINOS26 > wc -l *.lis 412 cand.lis 28 mrnt.lis 626 reroot.lis 29 sntp.lis 1095 total 469 mcout_data Total size is 481 GB. Made a backed up copy of these in AFS MINOS26 > cp -vax . ~/minos/maint/badd04 Set aside the files minospro@minos26 cd /minos/scratch/kreymer/badd04 for STRM in sntp mrnt cand reroot ; do cat ${STRM}.lis | while read LINE ; do FNAM=`echo ${LINE} | cut -f 1 -d ' '` FPAT=`echo ${LINE} | cut -f 2 -d ' '` echo mv ${FPAT}/${FNAM} ${FPAT}/${FNAM}.removed mv ${FPAT}/${FNAM} ${FPAT}/${FNAM}.removed usleep 100000 done done Tested the copy for STRM in sntp mrnt cand reroot; do cat ${STRM}.lis | while read LINE ; do FNAM=`echo ${LINE} | cut -f 1 -d ' '` FPAT=`echo ${LINE} | cut -f 2 -d ' '` ls ${FPAT}/${FNAM}.removed usleep 100000 done done Removed them from /minos/data for STRM in sntp mrnt ; do # head -1 ${STRM}.lis | while read LINE ; do cat ${STRM}.lis | while read LINE ; do FNAM=`echo ${LINE} | cut -f 1 -d ' '` FPAT=`echo ${LINE} | cut -f 2 -d ' '` FMD=/minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/${STRM}_data NUM=`echo ${FPAT} | cut -f 10 -d /` ls -l ${FMD}/${NUM}/${FNAM} rm ${FMD}/${NUM}/${FNAM} done minfarm@fnpcsrv1 chmod 775 /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/mrnt_data/750 chmod 775 /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/sntp_data/750 Removed the files from SAM ./samundeclare "${SAMDIM}" Found 626 files OOPS, did not undeclare n13037481_0049_L010185N_D04.reroot.root OK, let's try cand instead Worked OK. Same for sntp and mrnt Now the reroots can be removed. OK! Of course, they are now no longer parents. 2008 10 30 removing the .removed files, for real. cd /minos/scratch/kreymer/badd04 for STRM in sntp mrnt cand reroot; do cat ${STRM}.lis | while read LINE ; do FNAM=`echo ${LINE} | cut -f 1 -d ' '` FPAT=`echo ${LINE} | cut -f 2 -d ' '` ls -l ${FPAT}/${FNAM}.removed rm -f ${FPAT}/${FNAM}.removed usleep 100000 done done Did this at about 16:45 CDT ============================================================================= 2008 10 23 ============================================================================= ######## # DATA # ######## Purging sam/pnfs/md for bad v18 D04 mcnear files See notes in this log, 2008 09 18 Date: Fri, 12 Sep 2008 16:18:09 +0100 From: Nick West Attachment was copied to /minos/scratch/kreymer/badd04 SAMDIM=" DATA_TIER ${STRM}-near and MC.RELEASE daikon_04 and MC.BEAM L010185N and VERSION cedar.phy.bhcurv and RUN_NUMBER >= 7450 and RUN_NUMBER <= 7500 " sam list files --dim="${SAMDIM} File Count: 76 Average File Size: 1.28GB Total File Size: 97.56GB Total Event Count: 1260800 All are n1303*, Sorted and uniqued the run list, regardless of configuration, SAMDIM=" DATA_TIER ${STRM}-near and MC.RELEASE daikon_04 and MC.BEAM L010185N and VERSION cedar.phy.bhcurv and RUN_NUMBER in 7450,7451,7455,7481,7482,7483,7484,7485,7486,7487,7488,7489, 7490,7491,7492,7493,7494,7495,7496,7497,7498,7499,7500,7655 " STRM=sntp File Count: 32 Average File Size: 1.39GB Total File Size: 44.49GB Total Event Count: 575200 STRM=mrnt File Count: 31 Average File Size: 424.03MB Total File Size: 12.84GB Total Event Count: 575200 STRM=cand File Count: 530 Average File Size: 550.95MB Total File Size: 285.16GB Total Event Count: 424000 SAMDIM=" RUN_TYPE physics% and DATA_TIER mc-near and MC.BEAM=L010185N and RUN_NUMBER in 7450,7451,7455,7481,7482,7483,7484,7485,7486,7487,7488,7489, 7490,7491,7492,7493,7494,7495,7496,7497,7498,7499,7500,7655 " File Count: 749 Average File Size: 342.88MB Total File Size: 250.80GB Total Event Count: 599200 MINOS26 > sam list files --dim="${SAMDIM}" --nosummary | cut -f 1 -d _ | sort -u n13037450 n13037451 n13037455 n13037481 n13037482 n13037483 n13037484 n13037485 n13037486 n13037487 n13037488 n13037489 n13037490 n13037491 n13037492 n13037493 n13037494 n13037495 n13037496 n13037497 n13037498 n13037499 n13037500 n13037655 ~/minos/scripts/samlocate "${SAMDIM}" > reroot.lis for STRM in cand sntp mrnt ; do SAMDIM=" DATA_TIER ${STRM}-near and MC.RELEASE daikon_04 and MC.BEAM L010185N and VERSION cedar.phy.bhcurv and RUN_NUMBER in 7450,7451,7455,7481,7482,7483,7484,7485,7486,7487,7488,7489, 7490,7491,7492,7493,7494,7495,7496,7497,7498,7499,7500,7655 " ~/minos/scripts/samlocate "${SAMDIM}" > ${STRM}.lis done MINOS26 > wc -l *.lis 530 cand.lis 31 mrnt.lis 749 reroot.lis 32 sntp.lis 1342 total ########## # PARROT # ########## paloonew - arguments -m - mountfile name e.g. mountfile.grow mountfile.d199d141.grow -p - parrot arguments e.g. "-d remote" -r - parrot release e.g. current current-20081010 2_4_3 -s - script to run e.g. /grid/fermiapp/minos/parrot/loonar "/grid/fermiapp/minos/parrot/loonar -r R1.24.2" loonar - arguments -r - loon release e.g. R1.24.2 S08-08-28-R1-30 -s - script to run e.g. firstlast.C This is working well, making this the new paloon: PW=/afs/fnal.gov/files/expwww/numi/html/computing/parrot /grid/fermiapp/minos/parrot mv paloon paloon.20081013 ; cp paloonew paloon cp paloon ${PW}/paloon cp loonar ${PW}/loonar ./paloon -r current -s "/grid/fermiapp/minos/parrot/loonar -r S08-08-28-R1-30" ./paloon -m mountfile.d199d141.grow ./paloon -p "-d remote" ########## # PARROT # ########## Bootstrap process to run reco : WP=http://www-numi.fnal.gov/computing/parrot wget ${WP}/paloon wget ${WP}/mountfile.grow wget ${WP}/loonar wget ${WP}/firstlast.C wget ${WP}/reco_near_spill_cedar.C wget ${WP}/N00009870_0002.mdaq.root wget ${WP}/N00009870_0002.log chmod 755 loonar paloon Modify paloon to set the path to parrot, and the default verson of parrot for your site. Copy mountfile.grow to the parrot home directory Run a firstlast.C event count test of the file ./paloon -s "./loonar -f N00009870_0002.mdaq.root" Reconstruct the file { time ./paloon -s "./loonar -f N00009870_0002.mdaq.root -s reco_near_spill_cedar.C" ; } 2>&1 | tee N00009870_0002.log2 Look for reco .root files : -rw-r--r-- 1 kreymer e875 3186229 Oct 23 11:33 ntupleStS.root -rw-r--r-- 1 kreymer e875 17805680 Oct 23 11:33 CandS.root Compare N00009870_0002.log2 to N00009870_0002.log ============================================================================= 2008 10 22 ============================================================================= ######## # FARM # ######## > the minos mysql instance is taking 729% of cpu on fnpcsrv1 > currently, with only 250 production jobs running. Isn't that > a bit more than usual? Could someone please have a look? According to fnpcsrv1 ganglia plots, this happened in coincidence with sustained network rates of 15 MBytes/sec out, 5 MBytes/sec in, The lastest overloads were at : 16:20 to 16:40 16:50 to 17:05 Earlier episodes started after 12:30. Condorview for fcdfosg3 shows minospro usage peaks, from roughly 17:40 to 20:55 UTC, then from 21:20 through the current time. 12:40 to 15:55 CDT 16:20 CDT This web site is http://fcdfcm3.fnal.gov/UserDay.html So I think this overload is consistent with our recent farm startups. I think we need to ramp up more gradually, to avoid these overloads. ############ # BLUWATCH # ############ Restarted bluwatch on minos25, down since yesterday's reboot ########### # BLUEARC # ########### /minos/data service slowdown seen by all clients, [TXT] fnpcsrv1.txt 22-Oct-2008 13:37 87 [TXT] minos-sam03.txt 22-Oct-2008 13:38 87 [TXT] minos01.txt 22-Oct-2008 13:38 87 [TXT] minos26.txt 22-Oct-2008 13:38 87 This produced Howie's doubling or missing content in condor job files in /minos/data/minfarm But this doubling disappeared on later inspection of the same files. Sounds to me like a local cache defect, induced by these delays. The roundup concatenator script was not running at the time. log files minos01.txt Tue Oct 21 09:01:47 CDT 2008 SLO N00011513_0000.spill.sntp.cedar_phy_bhcurv.0.root 176 Wed Oct 22 13:04:30 CDT 2008 SLO N00011896_0000.spill.sntp.cedar_phy_bhcurv.0.root 168 Wed Oct 22 13:38:04 CDT 2008 SLO N00011995_0000.spill.sntp.cedar_phy_bhcurv.0.root 210 minos26.txt Tue Oct 21 09:01:50 CDT 2008 SLO N00009689_0000.spill.sntp.cedar_phy_bhcurv.0.root 161 Tue Oct 21 15:00:42 CDT 2008 SLO N00011411_0020.spill.sntp.cedar_phy_bhcurv.0.root 36 Wed Oct 22 13:04:33 CDT 2008 SLO N00010265_0019.spill.sntp.cedar_phy_bhcurv.0.root 210 Wed Oct 22 13:38:08 CDT 2008 SLO N00010350_0000.spill.sntp.cedar_phy_bhcurv.0.root 238 Ganglia on minos26 shows low network activity consistent with this. Ganglia on fnpcsrv1 shows a load average around 40, since about 12:30 mysqld shows cpu usage around 700% ######## # GRID # ######## Cleaned up stray files in /grid/home/minos empty directory 0 empty file foo SRV1> stat /grid/home/minos/0 File: `/grid/home/minos/0' Size: 2048 Blocks: 64 IO Block: 32768 directory Device: 1dh/29d Inode: 148324388 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 7927/ minos) Gid: ( 5111/ numi) Access: 2008-10-22 08:10:38.218000000 -0500 Modify: 2005-08-12 11:51:57.000000000 -0500 Change: 2006-09-19 13:49:04.035000000 -0500 SRV1> stat /grid/home/minos/foo File: `/grid/home/minos/foo' Size: 0 Blocks: 0 IO Block: 32768 regular empty file Device: 1dh/29d Inode: 1869770220 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 7927/ minos) Gid: ( 5111/ numi) Access: 2008-01-22 13:43:38.288000000 -0600 Modify: 2008-01-22 13:43:38.288000000 -0600 Change: 2008-01-22 13:43:38.289000000 -0600 ######## # GRID # ######## find /grid/home/minos -maxdepth 1 -type d -name gram_scratch_\* -mtime +60 | wc -l 1641 Date: Wed, 22 Oct 2008 09:04:23 -0500 (CDT) Subject: HelpDesk ticket 123533 ___________________________________________ Short Description: /grid/home/minos old gram_scratch directories Problem Description: The /grid/home/minos area is getting pretty large, nearly 1 TByte. 725M /grid/home/minos There are 1641 directories dating from July , all but one from July 31 : For comparison, there are many fewer current files : SRV1> ls -l -t /grid/home/minos | grep gram_scratch | grep Oct | wc -l 297 SRV1> ls -l -t /grid/home/minos | grep gram_scratch | grep Jul | wc -l 1641 SRV1> ls -l -t /grid/home/minos | grep gram_scratch | grep 'Jul 31' | wc -l 1640 These July files should probably be removed. ___________________________________________ Date: Wed, 22 Oct 2008 14:22:40 +0000 (GMT) From: Arthur Kreymer Correction, that is almost a GBYte, not a TByte, used. Still, it is probably good to clear up the old directories. ___________________________________________ Date: Wed, 22 Oct 2008 09:36:27 -0500 (CDT) Note To Requester: timm@fnal.gov sent this Notes To Requester: This has now been done. Most of those directories were blank anyway. It exposed a minor bug in our gram_scratch clearing script, we will get that fixed. Steve ___________________________________________ Date: Mon, 27 Oct 2008 12:35:31 -0500 (CDT) Subject: Help Desk Ticket 123533 Has Been Resolved. Solution: These directories were removed as requested Steve timm ============================================================================= 2008 10 21 ============================================================================= ########## # CONDOR # ########## Date: Tue, 21 Oct 2008 15:30:23 -0500 (CDT) Subject: HelpDesk ticket 123517 ___________________________________________ Short Description: Minos25 stuck writing to /minos/data, Condor is hung Problem Description: Starting around 13:30 today, processes writing to /minos/data seem to be stuck on node minos25 . Please have a look to see whether there is any obvious cause for this. ( System level file descriptors, bad user processes, etc. ? ) I see nothing in /var/log/messages since : Oct 21 08:59:38 minos25 kernel: lockd: server 131.225.111.115 not responding, still trying Oct 21 09:00:58 minos25 last message repeated 5 times Oct 21 09:01:08 minos25 last message repeated 6 times Oct 21 09:01:46 minos25 kernel: lockd: server 131.225.111.115 OK If this cannot be corrected gently , we may need to reboot the system. ___________________________________________ Date: Tue, 21 Oct 2008 15:35:22 -0500 (CDT) This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group. ___________________________________________ Date: Tue, 21 Oct 2008 15:58:05 -0500 (CDT) Note To Requester: sether@fnal.gov sent this Notes To Requester: There are a rather large number of defunct processes on the system from around 1pm - 2pm, which I assume is the result of some issues with the bluarc server. Everything appears okay now in terms of access, but it looks like the system never really recovered. A reboot is probably the best choice. ___________________________________________ Date: Tue, 21 Oct 2008 21:16:08 +0000 (GMT) From: Arthur Kreymer Please reboot minos25 as soon as possible, to clear this condition. ___________________________________________ Date: Tue, 21 Oct 2008 21:25:27 +0000 (GMT) From: Arthur Kreymer I wonder whether it is worth trying a forced dismount, then remount of /minos/data, which is less drastic than a full reboot ? I suspect that any process trying to dismount /minos/data is likely to get stuck itself, so please do not waste too much time trying this. ___________________________________________ Date: Tue, 21 Oct 2008 16:55:43 -0500 (CDT) Solution: The machine has been rebooted. Things look okay in general, but let us know if there are any more problems. ___________________________________________ MINOS25 > uptime 16:55:43 up 4 min, 2 users, load average: 0.29, 0.37, 0.17 MINOS25 > date Tue Oct 21 16:55:48 CDT 2008 ___________________________________________ Date: Tue, 21 Oct 2008 22:11:23 +0000 (GMT) From: Arthur Kreymer Thanks for the reboot ! Most of the running jobs seem to be accounted for. New jobs have started, both local and through glideinWMS. ___________________________________________ Date: Tue, 21 Oct 2008 22:14:47 +0000 (GMT) From: Arthur Kreymer Today between about 13:30 and 17:00 CDT, the Minos25 system bogged down with stuck writes to /minos/data. This pretty much stopped the Condor system from running. minos25 was rebooted around 16:50. Condor jobs seem to have resumed running. New jobs have started, both on the Cluster and through glideinWMS. The system seem to have kept track of most of the existing running jobs. ######## # DATA # ######## Planning to remove all the R* ntuples, we should really only need the cedar* releases on disk. Per minos batch meeting today. cd /minos/data/reco_near du -sm R* 17 R1_18 270547 R1_18_2 8813 R1_18_3 142294 R1_18_4 22221 R1_21 10001 R1_23 9336 R1_23a 9958 R1_24 10089 R1_24a 11734 R1_24b 41283 R1_24c 24122 R1_24cal find R1* -atime -365 -type f | wc -l 583 for FILE in ${FILES} ; do dirname ${FILE} ; done | sort -u R1_18_4/sntp_data/2006-09 R1_18_4/sntp_data/2006-10 R1_18/sntp_data/2005-04 There is only one file in R1_18 R1_18/sntp_data/2005-04/N00007184_0000.spill.sntp.R1_18.0.root find R1* -atime -304 -type f R1_18/sntp_data/2005-04/N00007184_0000.spill.sntp.R1_18.0.root So most of these were last accessed 305 days ago. Removing them all, mkdir ZAP mv R1* ZAP/ du -sm ZAP 560411 ZAP date ; time rm -r ZAP Tue Oct 21 14:09:44 CDT 2008 real 59m54.901s user 0m0.042s sys 0m1.024s ######### # MDSUM # ######### mdsum_log wrapped around, is running twice. Killed the younger one 26132 ? Ss 0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log 28734 ? D 0:00 \_ du -sm users/scavan 9283 ? Ss 0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log 21228 ? S 0:29 \_ du -sm users 26132 ? Ss 0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log 4221 ? S 0:13 \_ du -sm users 9283 ? Ss 0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log 20120 ? S 0:01 \_ du -sm analysis/nue Tokens will have expired, kill em both MINOS26 > kill 26132 MINOS26 > kill 9283 Updated mdsum_log to check for existing process running. ############ # PREDATOR # ############ Why is predator linked to NORECO ? lrwxr-xr-x 1 kreymer kreymer 24 Dec 12 2006 predator -> predator.20061209.NORECO* -rwxr-xr-x 1 kreymer kreymer 4638 Dec 9 2006 predator.20061209* -rwxr-xr-x 1 kreymer kreymer 4803 Oct 20 15:00 predator.20061209.NORECO* I hacked the NORECO file inadvertantly, presumably it dates 2006 12 12 MIN > diff predator.20061209.NORECO predator.20061209 69,73d68 < # 2008 10 20 < # work around loon suppression on minos25/26, temporarily < < PATH=${PATH/#\/afs\/fnal.gov\/files\/code\/e875\/general\/minos25_bin:/} < 120,121c115 < if false ; then < #if [ ${HOUR} = "23" -o -n "${FORCE}" ] ; then --- > if [ ${HOUR} = "23" -o -n "${FORCE}" ] ; then OK, this is the version that leaves saddreco up to the concatenator. Cutting a new version with this content, removing the reco code altogether, as predator.20081020 Renaming predator.20061209.NORECO to predator.20061212 cp -a predator.20061209.NORECO predator.20081020 nedit predator.20081020 ln -sf predator.20081020 predator mv predator.20061209.NORECO predator.20061212 ############ # PREDATOR # ############ genpy/sadd are clean today, after hacks to correct the path to loon. saddcache timed out : /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/predator: line 144: ./saddcache: Connection timed out MINOS26 > ./saddcache --list STARTED Tue Oct 21 10:50:14 2008 looks OK to me, finds 5 files to add. ########## # PARROT # ########## recopa - support single file name specification recopa "" FILE Added printout of loon command Updated in PG and PW Generated logs for two sample files in PG, PW : MINOS24 > { date ; time ${PG}/recopa "" N00009761_0010.mdaq.root ; } 2>&1 | tee N00009761_0010.log Tue Oct 21 10:39:48 CDT 2008 ... real 4m9.124s user 3m40.282s sys 0m6.296s MINOS24 > { date ; time ${PG}/recopa "" N00009870_0002.mdaq.root ; } 2>&1 | tee N00009870_0002.log Tue Oct 21 10:45:36 CDT 2008 ... real 15m29.448s user 15m5.123s sys 0m14.673s ============================================================================= 2008 10 20 ============================================================================= ########## # PARROT # ########## Final local test of a large file, about 10' reco time ssh minos24 cd /local/scratch24/kreymer PW=/afs/fnal.gov/files/expwww/numi/html/computing/parrot export PRO=/local/scratch24/kreymer cp ${PG}/reco_near_spill_cedar.C ${PRO} cp ${PG}/N00009870_0002.mdaq.root ${PRO} MINOS24 > date ; time ${PG}/recopa Mon Oct 20 17:58:05 CDT 2008 real 17m58.747s user 15m42.702s sys 0m14.986s ########## # PARROT # ########## minosadmin Update from rbpatter, re Grid items Where to run dbserver tests ( sam02/3 prob'ly ) How to get to CDf nodes ( See Steve and Igor ) How to cleanup CONDOR_TMP - perhaps sudo script, where ? ############ # PREDATOR # ############ Starting Sat AM, near/fardcs and beam genpy failing, like B081017_080002.mbeam.root Sat Oct 18 10:11:57 UTC 2008 /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 70: [: too many arguments /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 92: [: too many arguments ERROR: List of process IDs must follow -p. This was due to the setup_minos fix stopping users from running loon on minos26. Hacked PATH in predator, to remove the killer path. MINOS26 > ./predator 2008-10 This found nothing to do for dcs/beam, we must have damaged .py files around. cd /local/scratch26/kreymer/genpy/near_dcs_data/2008-10 Some files are like : MINOS26 > cat N081017_000001.sam.py from SamFile.SamDataFile import SamDataFile from SamFile.SamDataFile import ApplicationFamily from SamFile.SamDataFile import CRC from SamFile.SamDataFile import SamTime from SamFile.SamDataFile import RunDescriptorList from SamFile.SamDataFile import SamSize import SAM metadata = SamDataFile( fileName = 'N081017_000001.mdcs.root', fileType = 'physicsGeneric', fileContentStatus = SAM.DataFileContentStatus_Good, fileFormat = SAM.DataFileFormat_ROOT, fileSize = SamSize('479932B'), crc = CRC(1106728658L,SAM.CRC_Adler32Type), group = 'minos', applicationFamily = ApplicationFamily('online','rotorooter',''), dataTier = 'dcs-near', datastream = 'alldata', startTime = SamTime ( '(UTC)' , '%Y-%m-%d %H:%M:%S(UTC)' ), endTime = SamTime ( '(UTC)' , '%Y-%m-%d %H:%M:%S(UTC)' ), eventCount = , firstEvent = 0, lastEvent = ) #/pnfs/minos/near_dcs_data/2008-10(vo8508.1304) Automatic scan for this problem, FILES=`ls` for FILE in ${FILES} ; do grep '= ,' ${FILE} && ls -l ${FILE} ; done eventCount = , -rw-r--r-- 1 kreymer g020 1032 Oct 18 05:12 N081017_000001.sam.py eventCount = , -rw-r--r-- 1 kreymer g020 1031 Oct 18 05:12 N081017_235956.sam.py eventCount = , -rw-r--r-- 1 kreymer g020 1032 Oct 19 05:13 N081018_000003.sam.py eventCount = , -rw-r--r-- 1 kreymer g020 1032 Oct 20 05:13 N081019_000003.sam.py Removed the bad files : for FILE in ${FILES} ; do grep '= ,' ${FILE} && rm ${FILE} ; done Repeated for fardcs, beam cd /local/scratch26/kreymer/genpy/far_dcs_data/2008-10 -rw-r--r-- 1 kreymer g020 1031 Oct 18 05:13 F081017_000010.sam.py -rw-r--r-- 1 kreymer g020 1031 Oct 20 05:13 F081018_000008.sam.py -rw-r--r-- 1 kreymer g020 1031 Oct 20 05:13 F081019_000011.sam.py cd /local/scratch26/kreymer/genpy/beam_data/2008-10 -rw-r--r-- 1 kreymer g020 1027 Oct 18 05:12 B081017_080002.sam.py -rw-r--r-- 1 kreymer g020 1027 Oct 18 05:12 B081017_160001.sam.py -rw-r--r-- 1 kreymer g020 1027 Oct 18 05:12 B081018_000001.sam.py -rw-r--r-- 1 kreymer g020 1027 Oct 19 05:12 B081018_080001.sam.py -rw-r--r-- 1 kreymer g020 1027 Oct 19 05:12 B081018_160001.sam.py -rw-r--r-- 1 kreymer g020 1026 Oct 19 05:13 B081019_000002.sam.py -rw-r--r-- 1 kreymer g020 1026 Oct 20 05:12 B081019_080001.sam.py -rw-r--r-- 1 kreymer g020 1027 Oct 20 05:12 B081019_160001.sam.py -rw-r--r-- 1 kreymer g020 1027 Oct 20 05:12 B081020_000001.sam.py ./predator 2008-10 Failed again, removed .py files again. Needed to update genpy cp -a genpy.20080915 genpy.20081020 nedit genpy.2008102 PATH=${PATH/#\/afs\/fnal.gov\/files\/code\/e875\/general\/minos25_bin:/} ln -sf genpy.20081020 genpy # was genpy.20080915 ./predator 2008-10 Successful ########## # PARROT # ########## Updated directory for latest snapshot ls -l /afs/fnal.gov/files/expwww/numi/html/computing/parrot releases -> .../d120 MD=/afs/fnal.gov/files/data/minos PG=/grid/fermiapp/minos/parrot cd ${PG} mkdir ${MD}/d120/GROWFSDIR mv ${MD}/d120/GROW ${MD}/d120/GROWFSDIR/20080814 mkdir ${MD}/d120/GROWFSDIR/20080829 cp -a ${MD}/d120/.grow* ${MD}/d120/GROWFSDIR/20080829 du -sm ${MD}/d120/GROWFSDIR/*/.growfsdir 30 /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080814/.growfsdir 203 /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080829/.growfsdir Mon Oct 20 13:20:58 CDT 2008 time ./make_growfs.auto -k ${MD}/d120 ; date real 26m20.190s user 3m34.661s sys 10m35.154s Mon Oct 20 13:47:30 CDT 2008 30 /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080814/.growfsdir 203 /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080829/.growfsdir 119 /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20081020/.growfsdir $ grep '^L ' /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080814/.growfsdir | wc -l 16425 $ grep '^L ' /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080829/.growfsdir | wc -l 12 $ grep '^L ' /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20081020/.growfsdir | wc -l 7979 mkdir ${MD}/d120/GROWFSDIR/20081020 cp -a ${MD}/d120/.grow* ${MD}/d120/GROWFSDIR/20081020 Test in fnpc185 PG=/grid/fermiapp/minos/parrot mkdir /local/stage1/kreymer cd /local/stage1/kreymer Before the new index time ${PG}/paloonew "" "" ${PG}/recopa mountfile.grow real 8m2.860s user 3m3.453s sys 0m58.444s -rw-r--r-- 1 kreymer numi 1533038 Oct 20 12:12 CandS.root -rw-r--r-- 1 kreymer numi 233549 Oct 20 12:12 ntupleStS.root After the new index real 4m33.607s user 2m56.538s sys 0m57.384s -rw-r--r-- 1 kreymer numi 1533038 Oct 20 14:17 CandS.root -rw-r--r-- 1 kreymer numi 233549 Oct 20 14:17 ntupleStS.root real 4m15.273s user 2m53.797s sys 0m55.029s ######### # BATCH # ######### per asousa, for numiwrk/Batch we pages, pts membership wadmnumi:numiweb pts membership wadmnumi:numiweb | grep masaki pts adduser -user masaki -group wadmnumi:numiweb ============================================================================= 2008 10 16 ============================================================================= ########## # PARROT # ########## mindata@minos26 Repeat test of new d141(ups) d199(minsoft) copies, with make_growfs.auto MD=/afs/fnal.gov/files/data/minos PD=/minos/scratch/parrot PG=/grid/fermiapp/minos/parrot cd ${PD} time ./make_growfs.auto -k ${MD}/d199 make_growfs: 2557394 files, 7810 links, 117623 dirs, 0 checksums computed real 25m59.916s user 2m55.113s sys 11m24.894s 116 .growfsdir time ./make_growfs.auto -k ${MD}/d141 make_growfs: 1088642 files, 5588 links, 150368 dirs, 0 checksums computed real 11m0.743s user 1m27.281s sys 2m36.532s 47 .growfsdir Suggestion, progress messages could be shorter, by omitting the common path to the area beng indexed : make_growfs: following link /afs/fnal.gov/files/data/minos/d199/releases/.. could be make_growfs: following link releases/.. There are some differences in these indexes, e.g. in d199 $ diff .growfsdir oldparrot/20080912/.growfsdir | less < F arch_spec_doc.mk 33188 4217 22456815 0 --- > F arch_spec_doc.mk 33188 4228 -14522322 0 Test this with a full reco job Find an idle node at http://rexganglia2.fnal.gov/farms/?c=GP Farm&m=&r=hour&s=descending&hc=4 fnpc387 PG=/grid/fermiapp/minos/parrot mkdir /local/stage1/kreymer cd /local/stage1/kreymer Regular minimal pallon ${PG}/paloon ${PG}/paloon "" "" ${PG}/paloon ${PG}/paloon "" "" ${PG}/recopa seg faults sending datagram set faults running recopa Try a somewhat older node fnpc238 ${PG}/paloon Still a segfault sending datagram ${PG}/paloonew "" "" ${PG}/recopa Still setfault running loon Try a node formerly used fnpc185 - formerly tested cd $PG ./paloon "" "" ./recopa no segfault for datagram no segfault running recopa mkdir /local/stage1/kreymer cd /local/stage1/kreymer ${PG}/paloon "" "" ${PG}/recopa Spill(100000 in 750 out 99250 filt.) Try a shorter file, N00009761_0010.mdaq.root -bash-3.00$ time ./paloon "" "" ./recopa real 4m11.336s user 3m4.323s sys 0m29.075s cd ${PG} -bash-3.00$ time ${PG}/paloon "" "" ${PG}/recopa real 4m56.638s user 3m10.155s sys 0m52.410s OK, wrote root files. Now try paloonew, old files time ${PG}/paloonew "" "" ${PG}/recopa mountfile.grow real 4m39.656s user 3m7.654s sys 0m42.132s updated mountfile.d199d141.grow, adding MINOS_EXTERNAL, sim, release_data rm -r /local/stage1/kreymer/parrot time ${PG}/paloonew "" "" ${PG}/recopa mountfile.d199d141.grow real 5m51.853s user 3m19.409s sys 0m27.931s ########## # CONDOR # ########## rhatcher hacked setup_minos so that users will have a dummy loon, root, etc in their path The setup ends like using PYTHIA6 (v6_409) for LUND *********************************************************************** * WARNING: do NOT run loon or root on minos25 *********************************************************************** Running loon gets you ******************************************************************* ******************************************************************* ** MINOS25.FNAL.GOV: condor head node ** user should not run executables here ** attempted to run: "loon" ******************************************************************* ******************************************************************* ####### # DAQ # ####### changed buckley to minos-data in email from archiver cp -a archiver_near_daq.config archiver_near_dcs.config.20070531 nedit archiver_near_daq.config cp -a archiver_near_dcs.config archiver_near_dcs.config.20051103 nedit archiver_near_dcs.config cp -a archiver_far_daq.config archiver_far_daq.config.20070531 nedit archiver_far_daq.config cp -a archiver_far_dcs.config archiver_far_dcs.config.20051103 nedit archiver_far_dcs.config cp -a archiver_beam.config archiver_beam.config.20080724 nedit archiver_beam.config Also in minfarm@fnpcsrv1 : /home/minfarm/scripts/check_delivery ######## # GRID # ######## Date: Mon, 06 Oct 2008 12:06:03 -0500 (CDT) Subject: minos-admin HelpDesk ticket 122469 Reminder ___________________________________________________________________ Requester Name: JOSHUA BOEHM Phone: 3316 E-Mail Address: BOEHM@PHYSICS.HARVARD.EDU Incident Time: 10/1/2008 1:59:18 PM System Name: Priority: Medium Problem Category: Software Type: Other Item: Other Urgency: Medium Short Description: condor jobs losing permissions Problem Description: I've submitted a large number of jobs through the glide-in system to the minos part of the farm. Oddly a random subset of these jobs appear to be dying with a permission denied error trying to access the condor scripts. Its not universal, new jobs are successfully running, but most are dying. This started around 13:00 cst, I thought perhaps my tokens had expired, but I logged out and in with a new ticket and even new submissions are demonstrating this problem. Things ran perfectly smoothly as best as I can tell for the previous 20 hours or so. Is there an obvious setting I missed that would be causing this? Have I missed a setting? the scripts that are running are located in /minos/scratch/boehm/MREGeneration/SummaryMake And assuming they haven't all died the current batch of jobs showing issues is condor cluster 199262 Thanks, Josh ___________________________________________________________________ Date: Mon, 13 Oct 2008 12:05:42 -0500 (CDT) reminder ___________________________________________________________________ Date: Fri, 17 Oct 2008 19:07:24 +0000 (GMT) I was on vacation Sep 26 through Oct 12. Josh - has this problem cleared up ? I do not see reports of such problems from our current active users. I do not see unusual activity at around that time in the glidein statistics plots. Were your jobs failing on a specific node ? Sometimes a single node with a filesystem problem can consume much more than its share of failing jobs, until the FermiGrid people spot this and take it out of the configuration. ___________________________________________________________________ Date: Fri, 17 Oct 2008 23:38:17 +0000 (GMT) Resolved I hear from rhatcher that your problem was related to the 'setuid' issue, which has since been resolved. I am marking this helpdesk ticket Resolved. ============================================================================= 2008 10 15 ============================================================================= ######## # DATA # ######## requested scan of near cedar sntp cosmics check that they are in pnfs MINOS26 > ./stage -d -p 0 -s cosmic reco_near/cedar/sntp_data/2008-10 most claim to be off disk, not likely. ./dc_stat N00014901_0000.cosmic.sntp.cedar.0.root This claims the file is not in dcache. Not so, quick response from MINOS26 > ./dccptest /reco_near/cedar/sntp_data/2008-10/N00014901_0000.cosmic.sntp.cedar.0.root 2,0,0,0.0,0.0 :c=1:2b854664;h=yes;l=708962619; 708962619 bytes in 13 seconds (53257.41 KB/sec) -rw-r--r-- 1 kreymer g020 708962619 Oct 16 17:39 /local/scratch26/kreymer/N00014901_0000.cosmic.sntp.cedar.0.root So level2 information is stale or wrong. Run anyway, extra dccp -P will be run, that should be OK. CDIRS=`ls for DIR in $CDIRS ; do ./stage -w -s cosmic reco_near/cedar/sntp_data/${DIR} ; done > minos/log/stage/reco_near_cedar_cosmic.log MINOS26 > grep Needed reco_near_cedar_cosmic.log Needed 1/1 Needed 0/632 Needed 0/772 Needed 0/736 Needed 0/740 Needed 0/743 Needed 0/718 Needed 0/704 Needed 418/716 Needed 288/745 Needed 633/754 Needed 412/605 Needed 0/18 Needed 33/704 Needed 0/526 Needed 12/743 Needed 1/576 Needed 0/695 Needed 0/703 Needed 0/742 Needed 0/55 Needed 0/47 Needed 0/43 Needed 0/36 Needed 0/48 Needed 0/44 Needed 0/59 Needed 0/20 Needed 0/5 Needed 3/26 Needed 0/43 Needed 4/71 Needed 0/42 Needed 2/36 Needed 0/49 Needed 7/37 Needed 6/34 Needed 48/48 Needed 48/48 Needed 40/40 Needed 15/15 # DATA # Date: Fri, 10 Oct 2008 09:49:17 -0500 From: George Szmuksta To: Arthur Kreymer Cc: Dcache Admin Subject: Minos dcache pnfs files with no layer information There are 2 files in pnfs dated 10/07/08 that not have any pnfs layer information. These files are not in dcache and should be deleted and retransfered. /pnfs/fs/usr/minos/reco_near/cedar_phy_bhcurv/cand_data/tmp1.27374 /pnfs/fs/usr/minos/reco_near/cedar_phy_bhcurv/cand_data/tmp1.27412 Sorry for the inconvenience. George Szmuksta SSA _________________________________________________________ -rw-rw-r-- 1 mstrait e875 0 Oct 7 12:20 tmp1.27374 -rw-rw-r-- 1 mstrait e875 0 Oct 7 12:20 tmp1.27412 I have removed these. _________________________________________________________ ######## # DATA # ######## Helping to make space, rhatcher is archiving files for pawloski mindata@minos26 mkdir /pnfs/minos/analysis enstore pnfs --tags file family is minos, want analysis enstore pnfs --file_family analysis $ mkdir nue $ cd nue $ enstore pnfs --file_family analysis_nue Date: Thu, 16 Oct 2008 15:33:57 -0500 (CDT) Subject: HelpDesk ticket 123303 ___________________________________________ Short Description: Please assign CD-LTO4G1 to /pnfs/minos/analsys and /pnfs/minos/analysis/nue Problem Description: enstore-admin : We need to archive a few TBytes of minos files, similar to what we did previously in /pnfs/minos/stage. This time we will write under /pnfs/minos/analysis Please assign the CD-LTO4G1 library to /pnfs/minos/analysis /pnfs/minos/analysis/nue And see to it that there are a few tapes available, 10 should be more than enough for now. Thanks ! ___________________________________________ Date: Fri, 31 Oct 2008 16:20:43 -0500 (CDT) Your request has been put into a pending status by the expert working on the problem. Pending Reason: On Hold By Expert ___________________________________________ Date: Mon, 03 Nov 2008 11:04:25 -0600 (CST) Note To Requester: Art, The library tag for the /pnfs/minos/analysis directory has been changed from CD-9940B to CD-LTO4G1. No other tags were changed. We have also increased the quota of LTO4 tapes for Minos by 10. The quota was set to 25 tapes but is now set to 35. Minos currently has 19 tapes in use. The /pnfs/minos/analysis/nue directory is also updated. It automatically inherits the library tag of its parent directory. Please try this out and then let me know if I can close this Remedy request. Ken S -- SSA Group ___________________________________________ Date: Mon, 03 Nov 2008 17:25:43 +0000 (GMT) Thanks for updating the /pnfs/minos/analysis tags, for future writes, and adding the tapes. We completed this round of file copies a couple of weeks ago, hence that data went to 9940B tapes. We will keep an eye on things the next time we write. This ticket can be closed out. ___________________________________________ Date: Mon, 03 Nov 2008 11:41:15 -0600 (CST) Solution: Library tag was updated and quota was increased. This ticket was resolved by SCHUMACHER, KEN of the CD-SF/DMS/DSC/SSA group. ######### # FARM # ######## cedar_phy_bhcurvmcnearcharm.log and helium, errors in samsub, starting Tue Oct 14 12:11:53 CDT 2008 Traceback (most recent call last): File "/home/minfarm/scripts/samsub", line 159, in ? SUB = FILE.strip().split('_')[1].split('.')[0] IndexError: list index out of range Many pending runs in helium and charm, regardless of this problem. Disabled linfix ( complete ) and helium/charm in corral ( stuck ) ########## # CONDOR # ########## At around 06:00, an interactive loon job by bckhouse on minos25 lost its network connection. Condor processes on the cluster seem to have stopped, as well as most other schedd activity, including gfactory plots. Process writing to /minos/scratch succeeded, but hung up after the write. He killed the lost loon around 11:14 schedd and user processes broke loose at that time. ########## # RUSTEM # ########## Per his request,, to allow building his code on Minos Cluster MINOS26 > upd install -j mysql v5_0_22 informational: installed mysql v5_0_22. upd install succeeded. ############ # PREDATOR # ############ Stuck running dbu on N00014991_0007.mdaq.root Tue Oct 14 14:06:13 UTC 2008 recovered. Stuck permanently on N00014995_0000.mdaq.root Tue Oct 14 22:06:14 UTC 2008 N00014995_0001.mdaq.root Tue Oct 14 22:09:09 UTC 2008 N00014996_0000.mdaq.root Tue Oct 14 22:11:29 UTC 2008 N00014997_0000.mdaq.root Tue Oct 14 22:17:35 UTC 2008 N00014998_0000.mdaq.root Wed Oct 15 00:19:23 UTC 2008 through N00014998_0015.mdaq.root Wed Oct 15 15:05:26 UTC 2008 Similar for far, F00042018_0016.mdaq.root Mon Oct 6 22:06:14 UTC 2008 through F00042058_0018.mdaq.root Wed Oct 15 15:48:00 UTC 2008 And B081014_080001.mbeam.root Wed Oct 15 11:28:41 UTC 2008 B081014_160001.mbeam.root Wed Oct 15 11:33:27 UTC 2008 B081015_000001.mbeam.root Wed Oct 15 11:35:16 UTC 2008 N081010_145548.mdcs.root Wed Oct 15 11:37:49 UTC 2008 This cleared up when the DCache queues cleared Thursday. ######## # DATA # ######## 11:00 dbu has been stuck since last night, see above Stuck in MINOS26 > ./dccptest /neardet_data/2008-10/N00014995_0000.mdaq.root 2,0,0,0.0,0.0 :c=1:d0bf66c5;h=yes;l=81922203; Big mover queues on r 10a-3, 11a-3, 12a-3, 9a-3 ( write pools ) Many connections to beam_data from fnpc34* nodes. But I don't believe this listing, Started/Active times are like Aug 25 08:34:56 Aug 25 08:40:49 The login ;lots for door 0 shows a spike to nearly 250 last last night, down to 120 thiis morning, down to 56 right now ( 14:00 ) Queues on 10,12a-3 have cleared, 37,30 on 12,9 15:45, queue down to 15 15:55 queue down to 12, only on w-stkendca9a-3 17:24 - queues has cleared Date: Wed, 15 Oct 2008 11:50:35 -0500 (CDT) Subject: HelpDesk ticket 123203 ___________________________________________ Short Description: FNDCA login list is stale ? Problem Description: I am trying to hunt down the source of an overload of STKEN RawDataWritePools, as indicated by queues in http://fndca.fnal.gov:2288/queueInfo I look at the login list, at http://fndca3a.fnal.gov/dcache/DOORS.html The time stamp on the listing is current, Wed Oct 15 11:23:02 2008 But almost all connections show Started/Active times like Aug 25 08:34:56 Aug 25 08:40:49 Is this page stale ? We need to track down the user who is overloading this pool. This ticket is assigned to SCHUMACHER, KEN of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Wed, 15 Oct 2008 14:20:35 -0500 (CDT) The problem has been turned over to the developers. It is bugzilla ticket number 125. Thank you ___________________________________________ Date: Wed, 15 Oct 2008 22:32:04 +0000 (GMT) The queuing has cleared from the RawDataWritePools, as of around 17:00 CDT. My programs that access this pool group have resumed normal operation. The non-timestamp content of http://fndca3a.fnal.gov/dcache/DOORS.html has not changed between 14:19 and 17:27 today, So it seems the content of this page is indeed stale. ___________________________________________ > Just another data point. > > The content of http://fndca.fnal.gov:2288/queueInfo today > is identical to the content yesterday, > with the exception of the time stamp. ___________________________________________ Date: Fri, 17 Oct 2008 10:55:16 -0500 We are looking at this problem. Thanks, Timur ___________________________________________ Date: Fri, 17 Oct 2008 11:27:45 -0500 The problem was looked at by enstore developers and experts at the storage meeting. The following was determined. These are requests for small files. The requests cause multiple mount / rewind operations and because of these delays the effective transfer rates are low. We will continue looking at how the situation can be improved. Thanks, Timur ___________________________________________ I don't quite understand your reply. The problem is not with the transfers listed on the login page. The problem is that the contents of the page is 2 months old. ___________________________________________ Date: Fri, 17 Oct 2008 11:34:56 -0500 From: Timur Perelmutov It is not old for me, can it be cached in your web browser? ___________________________________________ Date: Fri, 17 Oct 2008 17:55:10 +0000 (GMT) From: Arthur Kreymer The content of the page is different each time I display it so this is not a caching problem : MIN > diff DOORS1017.html DOORS1419.html 7c7 < Thu Oct 16 17:11:02 2008 --- > Wed Oct 15 14:19:01 2008 97c97 < Finished at Thu Oct 16 17:11:05 CDT 2008 --- > Finished at Wed Oct 15 14:19:04 CDT 2008 MIN > The time stamps are changing. It is the login list data that is stale. ___________________________________________ Date: Fri, 17 Oct 2008 14:05:36 -0500 _Then I do not understand what particular information on the page http://fndca.fnal.gov:2288/queueInfo is stale? And why do you think it is stale? __________________________________________ Date: Fri, 17 Oct 2008 19:12:07 +0000 (GMT) _ The contents of the page, aside from the overall page timestamps, has apparently not changed since August 25, the latest Started entry on that page. I am quite sure that there have been new DCache logins since Aug 25, and that some of there are active. At the time that I first spotted this problem, there were at least 30 recent, open, active logins, not reflected in the login list. JobId door Node State Started Last Active UID/PID Role Username Pool [PNFS Id] [Timer] [File Seq] [Client Id] [Client Pid] Kind Status(time-in-state) Command DCap00-stkendca2a-unknow-93225 DCap00-stkendca2a-unknow-93225 fnpc341.fnal.gov active Aug 25 17:41:21 Aug 25 17:54:56 7927/9134 DCap00-stkendca2a-unknow-93225 E875 Minos ? ? ? ? stat minos/beam_data/2007-11/B071121_224612.mbeam.root __________________________________________ The stale page is http://fndca3a.fnal.gov/dcache/DOORS.html __________________________________________ Date: Fri, 17 Oct 2008 14:20:17 -0500 Ok, I was looking at a completely different page. __________________________________________ Date: Fri, 17 Oct 2008 14:23:46 -0500 Art, We have watched the resotres changed so that is not stale. We have come across an enstore bug we think. See if this makes sense to you. Minos is currently reading almost all, if not all, of the files off of a certain tape containing thousands of files. dCache requests these files over time in no particular order. enstore should order these and assure that the tape progressively moves forward. What is happening is new requests, some behind the current tape position, are not getting ordered properly such that the tape is seeking back and forth and not reading sequentially. The rate is really slow due to this seeking, so the queues are being drained very slowly, and the restore queues in dCache aren't changing by very much. Development is working on the problem. Gene N.B. - AK - these are reco_far/cedar_phy_bhcurv/.bcnd_data spill files, like 2007-02/F00037654_0001.spill.bcnd.cedar_phy_bhcurv.0.root __________________________________________ Date: Wed, 29 Oct 2008 13:43:09 +0000 (GMT) The http://fndca3a.fnal.gov/dcache/DOORS.html login list continued to show August data earlier this week. But this morning it is up to date, as of Oct 29 08:36:12 CDT 2008 I am curious as to the root cause of the problem. Thanks, this ticket can be closed. __________________________________________ Date: Fri, 31 Oct 2008 14:12:53 -0500 (CDT) Solution: The "stale" report may have been related to a backlog caused by a large number of small files being read from tape and possibly related to a bug in Enstore sequencing of these read from tape requests. This ticket was resolved by SCHUMACHER, KEN of the CD-SF/DMS/DSC/SSA group. __________________________________________ Date: Fri, 31 Oct 2008 14:12:54 -0500 (CDT) Note To Requester: Art, I'm not sure if this is actually the root cause you asked about. But I did find the following information in a dCache developer report from two weeks ago. I will mark this request as resolved. If you encounter further problems, please open a new request. Ken S. -- SSA Group __________________________________________ ============================================================================= 2008 10 14 ============================================================================= ######## # FARM # ######## MINOS26 > ./pnfsdirs far cedar_phy_linfix daikon_00 L010185N write MINOS26 > ./pnfsdirs near cedar_phy_linfix daikon_00 L010185N write SRV1> nedit ~/ROUNTMP/ROOTRELS added cedar_phy_linfix export SAM_ORACLE_CONNECT="samdbs/" for REL in dev int prd ; do setup sam -q ${REL} samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.linfix samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.linfix done New applicationFamilyId = 258 Application family already exists: id = 258 New applicationFamilyId = 95 Application family already exists: id = 95 New applicationFamilyId = 368 Application family already exists: id = 368 Note that daikon_00 has 1 subrun per run, concatenation is swift. Picked up the existing 12 runs ( mrnt and sntp ) SRV1> ./roundup -r cedar_phy_linfix mcfar Ran cleanly, declared files to sam ( most already on tape ). corral : [ ${BADS} -le 1 ] && ${HOME}/scripts/roundup -c -r cedar_phy_linfix mcfar || (( BADS++ )) #[ ${BADS} -le 1 ] && ${HOME}/scripts/roundup -c -r cedar_phy_linfix mcnear || (( BADS++ )) ####### # WEB # ####### MIN > ln -sf protons.20081014.html protons.html # was protons.20080117.html ######## # DATA # ######## MINOS26 > du -sm /grid/app/minos/* 840 /grid/app/minos/Minossoft du: cannot read directory `/grid/app/minos/VDT/vdt/extract': Permission denied du: cannot read directory `/grid/app/minos/VDT/vdt/backup': Permission denied du: cannot read directory `/grid/app/minos/VDT/vdt/services': Permission denied 288 /grid/app/minos/VDT 1 /grid/app/minos/bin du: cannot read directory `/grid/app/minos/minfarm/Minossoft/EXTERNAL/mysql-5.0.22/sql/share/japanese-sjis': Permission denied 18007 /grid/app/minos/minfarm 1 /grid/app/minos/parrot 848 /grid/app/minos/parrotold 56 /grid/app/minos/sam 5 /grid/app/minos/scripts 1 /grid/app/minos/test 9471 /grid/app/minos/users Rustem is using 9 GB, including recently built ROOT version. Requested his removal of files from /grid/app via email, cc minos_batch ============================================================================= 2008 10 13 ============================================================================= ########### # MONTHLY # ########### DATASETS 10/13 PREDATOR 10/13 VAULT 10/3 MYSQL 10/16 mysql timing, offline copies Mon Oct 13 14:20:26 CDT 2008 Mon Oct 13 15:08:13 CDT 2008 Adjusted HOWTO.dbarchive to use /tmp/*.sql for gzip phase, no more cut/paste. ############ # VACATION # ############ Predator - no neardcs since Mon Oct 6 10:09:15 2008 UTC Checklist - Cluster ganglia shows mostly wait state Oct 07 15:00 through Wed 08 Oct 06:00 and a high load avarage ( over 150 ) High on minos25, low on the rest 1.5 GB data free, needs attention Blue Arc was clean, ############# # MAIL SCAN # ############# Nearly 1000 emails to dig through cdfdev - zoomcvs moved from cdfcvs to cdcvs Fri, 26 Sep 2008 10:13:01 -0500 minosshift - Many messages FAR DAQ web status unable to contact minos-om.fnal.gov Sep 26 Oct 3 Thursday ( with and without web status string ) 12/30/2007 NEAR daqautoclean.sh refused kerberos ticket by minos-om.fnal.gov nas - Reboot RHEA 2 Fri 10/10 affects Windows - cdserver1 numiserver1 farm - down Sep 22 through Oct 5 due to DCache/SRM authentication problems. lusers - parrot on 2.6.9 ? firefox update from 1.5 to 3.0 Thu 2 Oct., SLF 4.5 and older Also affected my desktop minosbatch masaki starrrting web page, needs access to /afs/fnal.gov/files/expwww/numi/html/workgrps doing 1700 runs of mcfar cedar_phy_linfix mnv fermigrid mounts allowed ? x need for VO - nope. minosdata ticket 122261 - fermiapp quota - done HDS disk configuration - plan ? parrot setuid problem discussion - new nodes at 2.6.9-78 kernel ticket 122483 2 hour emergency Enstore downtime Thu Oct 9 10:00 - web outage scheduled 9 Oct 06:00 - 07:00 minosadmin 121790 - fnalu mounts /m/d and /m/s mounted , /grid is not mounted. jyuko - how to setup under condor ? - referred to loont/loonb jdejong - 7 day job cannot write to afs. Yep. 121520 - crl during dns - was a database problem, closed 122270 - rodriges - missing function.h on fnpc339 working OK now. 122469 - condor script access 13:00 cst Wed, 01 Oct 2008 119292 - Thu, 02 Oct 2008 13:09:15 gahp_server upgrade for grid errors we need condor upgrad ( to ? ) parrot setuid problem - scavan minossim - hgallag keytab problems - resolved, bad ssh client mail list ( new coordinater hennessy ) continue to use minos_sim parrot meeting 1211, grid 2008 - health exam ============================================================================= KREYMER ON VACATION THROUGH 12 OCTOBER ============================================================================= 2008 09 26 ============================================================================= DCache seminar 09:00 WH10NW Write queue timers/limits should act per pool group, not per pool Need wild cards in file family pool associations ( e.g. all Ntuples ) Kerberos doors hang up due to single client access with expired cert Management of ports for doors clients need list of valid ports, or automatic port assignment File leveling and migration for pool additions and retirement Per-file time overheads ( small file management ) ######## # GRID # ######## Date: Fri, 26 Sep 2008 08:42:47 -0500 (CDT) Subject: HelpDesk ticket 122261 ___________________________________________ Short Description: Increase to e875/minos/numi/5111 group quota in /grid/fermiapp Problem Description: The group quota for 5111 a.k.a. e875/minos/numi in /grid/fermiapp is only 30 GB. This is not enough to hold all the files formerly in /home/minfarm etc. We will shift some of these to /minos/data, but we could still use more space in /grid/fermiapp. Please increase this quota to 100 GBytes, at your next convenience. This will expedite our retirement our use of /grid/app. Thanks ! Please reply to minos-data, as user kreymer is leaving on vacation today. ___________________________________________ Date: Mon, 29 Sep 2008 12:12:32 -0500 (CDT) Note To Requester: timm@fnal.gov sent this Notes To Requester: We have passed the quota request on to CSI-WST group. It may take a bit, as it turned out that this volume is currently configured with user-by-user quotas and not group-by-group quotas so it could take some time to reconfigure it. Steve Timm ___________________________________________ Date: Mon, 29 Sep 2008 13:00:56 -0500 (CDT) Solution: This request has been completed. ============================================================================= 2008 09 25 ============================================================================= ########## # CONDOR # ########## Date: Thu, 25 Sep 2008 12:22:36 -0500 (CDT) Subject: HelpDesk ticket 122224 ___________________________________________ Short Description: minos25 configuration file needs to be update before the weekend Problem Description: run2-sys : Please update the minos25 local configuration file, /opt/condor-7.0.1/local/condor_config.local to have the content of /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/ condor_config.local.minos25.20080925 We need to have this done today if at all possible. This is to correct a parameter which has been limiting out glideins to Fermigrid to only 100 out of the 350 jobs we should normally get. ___________________________________________ Date: Thu, 25 Sep 2008 12:38:04 -0500 (CDT) This ticket has been reassigned to COOPER, GLENN of the CD-SF/FEF Group. ___________________________________________ Date: Thu, 25 Sep 2008 12:55:23 -0500 (CDT) From: Glenn Cooper I copied the file in. Do I need to restart/reload condor, or will it read the file each time a job is submitted? ____________________________________________ Date: Thu, 25 Sep 2008 19:31:06 +0000 (GMT) From: Arthur Kreymer I see no change, aside from the modified time, in /opt/condor-7.0.1/local/condor_config.local The file should end like : MINOS26 > tail -6 /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local.minos25.20080925 ########################################################## # Set the number of jobs that can be submitted to glide in, default 100 # setting this to the full gpfarm, set a tighter limit via gfrontend ########################################################## GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=1000 ___________________________________________ Date: Thu, 25 Sep 2008 21:08:05 +0000 (GMT) From: Arthur Kreymer Thanks for updating the file, this went fine. Unfortunately, at or around Sep 18 13:53, /opt/condor-7.0.1/etc/condor_config was updated on all the Minos Cluster nodes, including minos25. Please, immediately, restore the correct content. My copy of the former file is in /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config __________________________________________ Date: Thu, 25 Sep 2008 16:27:29 -0500 (CDT) From: Glenn Cooper Art's version put into cfengine and pulled to minos25. The other nodes should get it over the next few hours. Glenn __________________________________________ 16:29 - condor_reconfig minos25 # by rhatcher 16:32 - restarted condor_gfactory process Date: Thu, 25 Sep 2008 21:45:57 +0000 (GMT) From: Arthur Kreymer Thanks, we have run condor_reconfig minos25 and have restarted the gfactory process. The gfactory processes are registered again, seen by gfrontend. New glidein processes are now running, and user jobs have started again. Thanks !!! __________________________________________ Date: Thu, 25 Sep 2008 16:32:27 -0500 (CDT) From: Glenn Cooper Not sure how the incorrect file got there, epecially with a Sep 18 date. Our subversion logs show no changes to this file since May 20 (until today, of course). I'll investigate further and let you know if I find anything. __________________________________________ MINOS25 > condor_q gfactory | tail -1 94 jobs; 18 idle, 74 running, 2 held MINOS25 > date Thu Sep 25 16:35:09 CDT 2008 MINOS25 > condor_q gfactory | tail -1 ; date 96 jobs; 15 idle, 79 running, 2 held Thu Sep 25 16:35:44 CDT 2008 MINOS25 > condor_q gfactory | tail -1 ; date 116 jobs; 13 idle, 100 running, 3 held Thu Sep 25 16:42:38 CDT 2008 The plots have come alive, at http://www-numi.fnal.gov/gfactory/monitor/glidein_t20_glexec/total/ MINOS25 > condor_q gfactory | tail -1 ; date 169 jobs; 14 idle, 155 running, 0 held Thu Sep 25 17:09:04 CDT 2008 ######### # MYSQL # ######### SOFT03 > ups declare -c mysql v5_0_67 DECLARE: A UPS start/stop exists for this product SOFT03 > ups tailor mysql Enter valid path for mysql data directory: /home/minsoft/database Never use default port number 3306 for any mysql server instances! Assign your port number here:3306 You can update mysql server options in my.cnf file before you start mysql server. Please assign a new username for your mysql daemon. For security it is recommended to substitute this name for mysql root in a mysql database. See README file in your mysql datadir for more details. Do not forget to set a strong password for root user IMMEDIATELY after initial startup of mysql daemon! Then replace root username with the newly assigned username. Enter your new username here:root Mysql server with server_id = 1 was already configured on minos-sam03.fnal.gov machine. Would you like to configure next mysql server on minos-sam03.fnal.gov machine (y,n)? n SOFT03 > ups start mysql Setup:mysql datadir = /home/minsoft/database Setup:port=3306; socket=/home/minsoft/database/mysql.sock SOFT03 > WARNING: Found /home/minsoft/database/my.cnf Datadir is deprecated place for my.cnf, please move it to /home/minsoft/ups/prd/mysql/v5_0_67/Linux-2-6 Starting mysqld daemon with databases from /home/minsoft/database SOFT03 > ups rootpass mysql Setup:mysql datadir = /home/minsoft/database Setup:port=3306; socket=/home/minsoft/database/mysql.sock Enter password for root user: Setup root password for root@localhost is O.K. You also need to set this password for root@minos-sam03.fnal.gov when you start mysql client. You can do it using following command in mysql: mysql> SET PASSWORD FOR root@minos-sam03.fnal.gov=PASSWORD('new_password'); See user table in mysql database. ============================================================================= 2008 09 24 ============================================================================= 195912.0 gfactory 9/24 08:41 0+00:00:00 I 0 0.0 glidein_startup.sh 195932.0 gfactory 9/24 10:16 0+00:00:00 I 0 0.0 glidein_startup.sh 195932.1 gfactory 9/24 10:16 0+00:00:00 I 0 0.0 glidein_startup.sh 195932.2 gfactory 9/24 10:16 0+00:00:00 I 0 0.0 glidein_startup.sh 195932.3 gfactory 9/24 10:16 0+00:00:00 I 0 0.0 glidein_startup.sh 195940.0 gfactory 9/24 10:46 0+00:00:00 I 0 0.0 glidein_startup.sh 195944.0 gfactory 9/24 10:59 0+00:00:00 I 0 0.0 glidein_startup.sh 195967.0 gfactory 9/24 13:16 0+00:00:00 I 0 0.0 glidein_startup.sh 195974.0 gfactory 9/24 13:46 0+00:00:00 I 0 0.0 glidein_startup.sh 195974.1 gfactory 9/24 13:46 0+00:00:00 I 0 0.0 glidein_startup.sh 195974.2 gfactory 9/24 13:46 0+00:00:00 I 0 0.0 glidein_startup.sh 195974.3 gfactory 9/24 13:46 0+00:00:00 I 0 0.0 glidein_startup.sh 195974.4 gfactory 9/24 13:46 0+00:00:00 I 0 0.0 glidein_startup.sh 195974.5 gfactory 9/24 13:46 0+00:00:00 I 0 0.0 glidein_startup.sh 195974.6 gfactory 9/24 13:46 0+00:00:00 I 0 0.0 glidein_startup.sh 195974.7 gfactory 9/24 13:46 0+00:00:00 I 0 0.0 glidein_startup.sh 195974.8 gfactory 9/24 13:46 0+00:00:00 I 0 0.0 glidein_startup.sh 195974.9 gfactory 9/24 13:46 0+00:00:00 I 0 0.0 glidein_startup.sh 196017.2 gfactory 9/24 16:05 0+00:00:00 I 0 0.0 glidein_startup.sh 196018.0 gfactory 9/24 16:08 0+00:00:00 I 0 0.0 glidein_startup.sh 196020.0 gfactory 9/24 16:11 0+00:00:00 I 0 0.0 glidein_startup.sh 196021.0 gfactory 9/24 16:19 0+00:00:00 I 0 0.0 glidein_startup.sh 196024.0 gfactory 9/24 16:36 0+00:00:00 I 0 0.0 glidein_startup.sh 196025.0 gfactory 9/24 16:39 0+00:00:00 I 0 0.0 glidein_startup.sh 196027.0 gfactory 9/24 16:44 0+00:00:00 I 0 0.0 glidein_startup.sh 196031.0 gfactory 9/24 16:54 0+00:00:00 I 0 0.0 glidein_startup.sh 196032.0 gfactory 9/24 16:56 0+00:00:00 I 0 0.0 glidein_startup.sh 196034.0 gfactory 9/24 17:05 0+00:00:00 I 0 0.0 glidein_startup.sh MINOS25 > for CLU in ${CLUS} ; do printf "${CLU} " ; condor_q -l ${CLU} | grep GlideinEntryName ; done 195912.0 GlideinEntryName = "gpminos" 195932.0 GlideinEntryName = "gpminos" 195932.1 GlideinEntryName = "gpminos" 195932.2 GlideinEntryName = "gpminos" 195932.3 GlideinEntryName = "gpminos" 195940.0 GlideinEntryName = "gpminos" 195944.0 GlideinEntryName = "gpminos" 195967.0 GlideinEntryName = "gpminos" 195974.0 GlideinEntryName = "gpminos" 195974.1 GlideinEntryName = "gpminos" 195974.2 GlideinEntryName = "gpminos" 195974.3 GlideinEntryName = "gpminos" 195974.4 GlideinEntryName = "gpminos" 195974.5 GlideinEntryName = "gpminos" 195974.6 GlideinEntryName = "gpminos" 195974.7 GlideinEntryName = "gpminos" 195974.8 GlideinEntryName = "gpminos" 195974.9 GlideinEntryName = "gpminos" 196017.2 GlideinEntryName = "gpgeneral" 196018.0 GlideinEntryName = "gpgeneral" 196020.0 GlideinEntryName = "gpgeneral" 196021.0 GlideinEntryName = "gpgeneral" 196024.0 GlideinEntryName = "gpgeneral" 196025.0 GlideinEntryName = "gpgeneral" 196027.0 GlideinEntryName = "gpgeneral" 196031.0 GlideinEntryName = "gpgeneral" 196032.0 GlideinEntryName = "gpgeneral" 196034.0 GlideinEntryName = "gpgeneral" MINOS25 > for CLU in ${CLUS} ; do printf "${CLU} " ; condor_q -l ${CLU} | grep QDate ; done 195912.0 QDate = 1222263702 195932.0 QDate = 1222269414 195932.1 QDate = 1222269414 195932.2 QDate = 1222269414 195932.3 QDate = 1222269414 195940.0 QDate = 1222271191 195944.0 QDate = 1222271941 195967.0 QDate = 1222280188 195974.0 QDate = 1222281985 195974.1 QDate = 1222281985 195974.2 QDate = 1222281985 195974.3 QDate = 1222281985 195974.4 QDate = 1222281985 195974.5 QDate = 1222281985 195974.6 QDate = 1222281985 195974.7 QDate = 1222281985 195974.8 QDate = 1222281985 195974.9 QDate = 1222281985 196017.2 QDate = 1222290308 196018.0 QDate = 1222290494 196020.0 QDate = 1222290683 196021.0 QDate = 1222291149 196024.0 QDate = 1222292175 196025.0 QDate = 1222292360 196027.0 QDate = 1222292641 196031.0 QDate = 1222293296 196032.0 QDate = 1222293389 196034.0 QDate = 1222293947 MINOS25 > datesec 1222290308 Wed Sep 24 16:05:08 CDT 2008 MINOS25 > datesec 1222293947 Wed Sep 24 17:05:47 CDT 2008 ########## # CONDOR # ########## Date: Wed, 24 Sep 2008 16:05:54 -0500 (CDT) Subject: HelpDesk ticket 122184 ___________________________________________ Short Description: Too few jobs running on GPFarm Problem Description: I see far fewer jobs than expected running on GPfarm. Our analysis users are getting well under half normal capacity, and we have several high priority jobs that we are trying to get through. The overall load on GPFarm seems pretty light, according to Ganglia, The 'nice' CPU is running around 20%. Condorview shows about 250 running processes, out of the 850 capacity. User jobs are getting in and running, but at nothing like normal capacity. According to condor_q, rubin has 97 jobs idle, only 14 running. The Minos glideins have 100 processes running, with over 20 idle. A few new pilots have gotten started during the day, with no net gain. We usually have more like 200 running. Any idea what has gone wrong ? ___________________________________________ Date: Wed, 24 Sep 2008 16:30:32 -0500 (CDT) From: HelpDesk Note To Requester: timm@fnal.gov sent this Notes To Requester: Art--with respect to the glideins, I checked the glideins and those that have not started, haven't started because they are waiting for the nodes with AFS. As far as Howie's jobs are concerned, it appears that his condor_gridmanager on fnpcsrv1 got stuck, I have now unstuck it. If you continue to see gfactory jobs sitting "unsubmitted" or "pending" on minos25 for any length of time, keep us posted. Steve ___________________________________________ Date: Wed, 24 Sep 2008 22:40:09 +0000 (GMT) From: Arthur Kreymer Thanks for unsticking the rubin jobs, they seem to have finished. You are right, the glideins submitted earlier today, through 16:05, were all going toward the saturated AFS nodes. There are 10 newer glideings submitted between 16:05 and 17:05, which I think are not tied to AFS, but which are also all idle. 196017.2 QDate = 1222290308 196018.0 QDate = 1222290494 196020.0 QDate = 1222290683 196021.0 QDate = 1222291149 196024.0 QDate = 1222292175 196025.0 QDate = 1222292360 196027.0 QDate = 1222292641 196031.0 QDate = 1222293296 196032.0 QDate = 1222293389 196034.0 QDate = 1222293947 Only the first of these has started to run, as of 17;34. ___________________________________________ Date: Thu, 25 Sep 2008 16:15:25 +0000 (GMT) From: Arthur Kreymer FYI, Minos glideinWMS status plots are available at http://www-numi.fnal.gov/gfactory/monitor/glidein_t20_glexec/total/0Status.day.large.html The gpgeneral glideins are not restricted to AFS nodes. The gpminos glideins are restricted to AFS nodes. We seem to have hit a ceiling of about 80 to 100 glideins. This is about the level of our hardware priority allocation. Is this a coincidence ? Recent glidein jobs are contining to get started, but at a very restricted rate, consistent with some GPFarm limit, although the GPFarm nodes are mostly idle. ___________________________________________ ___________________________________________ ___________________________________________ ######## # FARM # ######## Studying confused F00041882 status, 25/24 reported All subruns 00 thru 23 are present in cand files. dds /pnfs/minos/reco_far/cedar/cand_data/2008-08/F00041882 Aug 29 throu Sep 02 Have only 0, 7, 10 , 12 in pass 0, Aug 31, for sntp and bntp SRV1> FILE=F00041882_0000.all.sntp.cedar.0.root SRV1> sam get metadata --file=${FILE} | grep parents | tr "'" \\\n | grep root | sort F00041882_0000.mdaq.root F00041882_0001.mdaq.root F00041882_0002.mdaq.root F00041882_0003.mdaq.root F00041882_0004.mdaq.root F00041882_0007.mdaq.root F00041882_0008.mdaq.root F00041882_0010.mdaq.root F00041882_0012.mdaq.root F00041882_0013.mdaq.root F00041882_0014.mdaq.root F00041882_0015.mdaq.root F00041882_0016.mdaq.root F00041882_0017.mdaq.root F00041882_0018.mdaq.root F00041882_0019.mdaq.root F00041882_0020.mdaq.root F00041882_0021.mdaq.root F00041882_0022.mdaq.root In farcat , have 05 06 09 10 11 23 dating Aug 31 and Sep 02 The problem is subrun 10, a duplicate ? Why is this not detected ? From logs, Sun Aug 31 06:07:28 CDT 2008 BADRUNS F00041882_0005.all.sntp.cedar.0.root BADRUNS F00041882_0006.all.sntp.cedar.0.root BADRUNS F00041882_0009.all.sntp.cedar.0.root BADRUNS F00041882_0011.all.sntp.cedar.0.root BADRUNS F00041882_0023.all.sntp.cedar.0.root The files were processes at that time. SRV1> dds /minos/data/minfarm/farcat/F00041882_0010* -rw-rw-r-- 1 minospro numi 23029986 Aug 31 22:28 /minos/data/minfarm/farcat/F00041882_0010.all.sntp.cedar.0.root -rw-rw-r-- 1 minospro numi 8183846 Aug 31 22:28 /minos/data/minfarm/farcat/F00041882_0010.spill.bntp.cedar.0.root -rw-rw-r-- 1 minospro numi 5223452 Aug 31 22:28 /minos/data/minfarm/farcat/F00041882_0010.spill.sntp.cedar.0.root SRV1> SRV1> dds /pnfs/minos/reco_far/cedar/sntp_data/F00041882_0010* ls: /pnfs/minos/reco_far/cedar/sntp_data/F00041882_0010*: No such file or directory SRV1> dds /pnfs/minos/reco_far/cedar/sntp_data/2008-08/F00041882_0010* -rw-r--r-- 1 rubin numi 23031199 Aug 31 07:43 /pnfs/minos/reco_far/cedar/sntp_data/2008-08/F00041882_0010.all.sntp.cedar.0.root -rw-r--r-- 1 rubin numi 5223452 Aug 31 07:40 /pnfs/minos/reco_far/cedar/sntp_data/2008-08/F00041882_0010.spill.sntp.cedar.0.root SRV1> dds /pnfs/minos/reco_far/cedar/.bntp_data/2008-08/F00041882_0010* -rw-r--r-- 1 rubin numi 8183846 Aug 31 07:37 /pnfs/minos/reco_far/cedar/.bntp_data/2008-08/F00041882_0010.spill.bntp.cedar.0.root Let's check out DUP checking SRV1> . ./samsetup SRV1> ./samdup /minos/data/minfarm/farcat F00041882_0010.spill.bntp.cedar.0.root F00041882_0010.all.sntp.cedar.0.root F00041882_0010.spill.sntp.cedar.0.root The roundup script was escaping quotation marks around the -s \"${SEL}" argument being sent to samdup, causing nothing to be selected. Corrected this and a related typo in the DUP code ( extre [] brackets ) Cut a new roundup.20080923 version on the fly. Testing this out, this is also the first test of the new proxy. SRV1> ./roundup -D -r cedar far ########### # ROUNDUP # ########### Put today's changes for DUP handling into roundup.20080924 SRV1> cp -a AFSS/roundup.20080924 . SRV1> ln -sf roundup.20080924 roundup ######### # MYSQL # ######### On minos-sam03, created setups.sh script in home area of minsoft, unset UPS_DIR unset SETUP_UPS . /usr/local/etc/setups.sh export PRODUCTS=${HOME}/ups/db:/local/ups/db Test this, and move to the newer mysql, ########## # CONDOR # ########## No held jobs for the last couple of days, then a few this morning : MINOS25 > condor_q -hold gfactory -- Submitter: minos25.fnal.gov : <131.225.193.25:64961> : minos25.fnal.gov ID OWNER HELD_SINCE HOLD_REASON 195887.5 gfactory 9/24 07:55 Globus error 17: the job failed when the jo 195887.7 gfactory 9/24 07:55 Globus error 43: the job manager failed to 195887.9 gfactory 9/24 07:55 Globus error 17: the job failed when the jo ######### # MYSQL # ######### Make room in samread on minos-sam02 for database tests cd DBARCH/ -rw-r----- 1 samread 5024 8806 Aug 15 2007 PULSERDRIFT.frm -rw-r----- 1 samread 5024 75037580970 Aug 15 2007 PULSERDRIFT.MYD -rw-r--r-- 1 samread 5024 32655442989 Aug 17 2007 PULSERDRIFT.MYD.gz -rw-r----- 1 samread 5024 28319080448 Aug 15 2007 PULSERDRIFT.MYI -rw-r--r-- 1 samread 5024 8937925900 Aug 18 2007 PULSERDRIFT.MYI.gz Test integrity of the zipped PD files, remove originals, copy to /minos/data/mysql/old MINOS-SAM02 > gunzip -c PULSERDRIFT.MYI.gz > PDI MINOS-SAM02 > diff PULSERDRIFT.MYI PDI rm PULSERDRIFT.MYI PDI time gunzip -c PULSERDRIFT.MYD.gz > PDD real 59m8.515s user 18m53.518s time md5sum PULSERDRIFT.MYD PDD 96e5cb77b49526184e10d78f12969636 PULSERDRIFT.MYD 96e5cb77b49526184e10d78f12969636 PDD rm PULSERDRIFT.MYD PDD MINOS-SAM02 > mkdir /minos/data/mysql/old MINOS-SAM02 > time cp -va PULSER* /minos/data/mysql/old/ `PULSERDRIFT.frm' -> `/minos/data/mysql/old/PULSERDRIFT.frm' `PULSERDRIFT.MYD.gz' -> `/minos/data/mysql/old/PULSERDRIFT.MYD.gz' real 15m55.623s user 0m0.753s sys 1m54.251s Wow, data transfers peak up to 50 MBytes/second ! BlueArc must be happy today. But there are frequent minute long interruptions with 0 data rate. Packet rates are steady around 30K/second. du -sm shows rates about 46 MBytes/sec, maybe the du sample was lucky. Net transfer was about 40 GB/1000 sec or 40 MB/sec. SOFT03 > du -sm restore/20080902/ 66492 restore/20080902/ mkdir ~/MYSQL cd ~/MYSQL time scp -r -c blowfish minsoft@minos-sam03:restore/20080902 restore Typical rates are reported around 40 MB/sec ============================================================================= 2008 09 23 ============================================================================= ######## # FARM # ######## Howie's cert was about to expire created a fresh kreymer cert with Role=Production, on minos26 MINOS26 > . /minos/scratch/kreymer/VDT/setup.sh MINOS26 > cd /local/scratch26/kreymer/grid MINOS26 > voms-proxy-init -voms fermilab:/fermilab/minos/Role=Production -cert kreymerdoe.pem -key kreymerdoekey.pem -out kreymer-production.proxy -valid 10000:0 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Enter GRID pass phrase: phrase is too short, needs to be at least 4 chars Enter GRID pass phrase: Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Creating temporary proxy ............................................. Done Contacting voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Done Warning: fg6x1.fnal.gov:15001: The validity of this VOMS AC in your proxy is shortened to 86400 seconds! Creating proxy ............................................................. Done Warning: your certificate and proxy will expire Wed Mar 25 14:45:40 2009 which is within the requested lifetime of the proxy MINOS26 > voms-proxy-info -all -file kreymer-production.proxy WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 type : proxy strength : 512 bits path : kreymer-production.proxy timeleft : 4385:57:02 === VO fermilab extension information === VO : fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 issuer : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov attribute : /fermilab/minos/Role=Production/Capability=NULL attribute : /fermilab/Role=NULL/Capability=NULL attribute : /fermilab/minos/Role=NULL/Capability=NULL This will give a production role proxy for use by roundup. Copied this to /local/globus/minfarm/.grid SRV1> pwd /local/globus/minfarm/.grid SRV1> scp kreymer@minos26:/local/scratch26/kreymer/grid/kreymer-production.proxy . SRV1> cd /export/stage/minfarm/.grid Created draft local srmtestp, using production proxy, and adding a write and cleanup to /pnfs/minos/NULL Created a new roundup, using the new cert in the correct location. SRV1> ln -sf roundup.20080923 roundup # was roundup.20080915 SRV1> date Tue Sep 23 21:18:47 CDT 2008 ####### # WEB # ####### Per request of inkmann, reviewing all .htaccess files with Options +Includes These should be Options +IncludesNOEXEC MIN > find /afs/fnal.gov/files/data/minos/d119 -name .htaccess /afs/fnal.gov/files/data/minos/d119/prd/sam_config/v4_2_28/NULL/www/.htaccess /afs/fnal.gov/files/data/minos/d119/prd/sam_config/v4_2_34/NULL/www/.htaccess /afs/fnal.gov/files/data/minos/d119/prd/sam_bootstrap/v4_4_1/NULL/www/.htaccess /afs/fnal.gov/files/data/minos/d119/prd/sam_web_services/v0_9_8/NULL/www/.htaccess /afs/fnal.gov/files/data/minos/d119/prd/sam_web_services/v0_9_9/NULL/www/.htaccess /afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_4/v03/boost_1_34_1/regression/.htaccess Only the sam files are SSI enabled First checking all .shtml files to see that we are not using #exec MIN > find . -name \*\.shtml ./prd/sam_config/v4_2_28/NULL/www/index.shtml ./prd/sam_config/v4_2_34/NULL/www/index.shtml ./prd/sam_bootstrap/v4_4_1/NULL/www/index.shtml So the sam_web_services entry seem frivolous. None of these .shtml files contain the string exec. Corrected all sam* .htaccess files, in d119 and d141. Checked on minos-sam01 FILES=`find . -name \*\.shtml` for FILE in $FILES ; do echo ${FILE} ; grep exec ${FILE} ; done Found no #exec elements of directives ####### # SAM # ####### Note to sam-design I just received an email from the Fermilab Web security team, noting that several of the Minos .htaccess files contained Options +Includes This is apparently dangerous. They should be set up to prohibit #exec directives on the server side: Options +IncludesNOEXEC This is a sam issue because none of these particular .includes are from active Minos code, but are parts of various sam products whose files are incidentally being served to the web : sam_config/v4_2_28 sam_config/v4_2_34 sam_bootstrap/v4_4_1 sam_web_services/v0_9_8 sam_web_services/v0_9_9 Looking at products on the Minos station/dbserver, it seems that many sam products have Options +Includes sam_bootstrap sam_config sam_cp sam_gridftp sam_kerberos_rcp The good news is that none of our .shtml files seem to use the dangerous #exec element, so there is no immediate risk. The bad news is that the web security people will bug us until we change all the +Includes to +IncludesNOEXEC ############ # PRODUCTS # ############ Per loiacono request, upd install -j geant4 v4_8_1_p02 -q GCC_3_4_3 -f Linux+2.4-2.3.2 informational: installed geant4 v4_8_1_p02. upd install succeeded. ######## # GRID # ######## MINOS26 > cd /grid/app/minos MINOS26 > du -sm * $ du -sm /grid/app/minos/users/* 3202 /grid/app/minos/users/boehm 2975 /grid/app/minos/users/loiacono 1 /grid/app/minos/users/pawloski 2683 /grid/app/minos/users/rustem 10 /grid/app/minos/users/scavan MINOS26 > quota -v -s -g e875 Disk quotas for group e875 (gid 5111): Filesystem blocks quota limit grace files quota limit grace blue2:/fermigrid-data 315G 0 400G 128k 0 0 blue2:/fermigrid-app 26436M 0 30720M 391k 0 0 minos-nas-0.fnal.gov:/minos/scratch 5084G 0 6144G 1549k 0 0 minos-nas-0.fnal.gov:/minos/data 16384G* 0 16384G 1500k 0 0 blue2:/fermigrid-fermiapp 15291M 0 30720M 217k 0 0 ########## # PARROT # ########## 10:50 - as planned, move to use of /grid/fermiapp/minos/parrot mv /grid/app/minos/parrot /grid/app/minos/parrotold ln -s /grid/fermiapp/minos/parrot /grid/app/minos/parrot ######## # DATA # ######## rbpatter will create door lists in something like computing/config/dcachedoor pts membership wadmnumi:numiweb pts adduser -user rbpatter -group wadmnumi:numiweb ============================================================================= 2008 09 22 ============================================================================= ######### # CLUBS # ######### HOWTO.nodes - updated per current condor nodes flxb31 flxb32 flxb33 flxb34 flxb36 flxi09 flxi10 Can also log into flxb19 flxb35 ########## # DCACHE # ########## Why is level 2 information not indicating a pool for raw data? ./dc_stat /pnfs/minos/fardet_data/2008-09/F00041967_0000.mdaq.root ============================ PNFS status for /pnfs/minos/fardet_data/2008-09/F00041967_0000.mdaq.root -rw-r--r-- 1 buckley e875 43610714 Sep 21 21:36 F00041967_0000.mdaq.root LEVEL 2 2,0,0,0.0,0.0 :c=1:5cb8e1c2;h=yes;l=43610714; LEVEL 4 VO8699 0000_000000000_0000396 43610714 fardet_data /pnfs/fnal.gov/usr/minos/fardet_data/2008-09/F00041967_0000.mdaq.root 000F0000000000000864A588 CDMS122205097100000 stkenmvr25a:/dev/rmt/tps0d0n:479000022613 3277382081 ============================ ############ # DCCPTEST # ############ Created dccptest script, can copy recent raw data. ######## # FARM # ######## SRV1> ls -ltr /minos/data/minfarm/mcnearcat | grep charm | wc -l 1714 -rw-rw-r-- 1 minospro numi 29619026 Sep 19 18:49 n13037053_0003_L010185N_D04_charm.mrnt.cedar_phy_bhcurv.0.root SRV1> ls -ltr /minos/data/minfarm/mcnearcat | grep helium | wc -l 830 -rw-rw-r-- 1 minospro numi 39034470 Sep 21 00:01 n13038001_0015_M100200N_D04_helium.mrnt.cedar_phy_bhcurv.0.root ./roundup -b 2000 -s helium -r cedar_phy_bhcurv mcnear Mon Sep 22 11:00:39 CDT 2008 Need to set up helium and charm in corral. ######### # ADMIN # ######### CD105723 produced requisition 204475 last week. Buyer Gloinski PO 582475 SuperMicro server Promised Date: 09-Oct-2008 No PO yet for Sataboy It is there now, 17:00 CDT, PO 582564 F1F-141000HDRG SATABOY storage device configured with (14) 1TB disks ORDER DATE 22-Sep-2008 Promised Date: 13-Oct-2008 ============================================================================= 2008 09 21 Sun ============================================================================= ######### # MYSQL # ######### minsoft@minos-sam03 - added rearmstr ########## # PARROT # ########## Cloned to /grid/fermiapp, stop supporting /grid/app mindata@minos26 MINOS26 > du -sm /grid/app/minos/parrot 848 /grid/app/minos/parrot $ cp -vax /grid/app/minos/parrot /grid/fermiapp/minos/parrot $ date Sun Sep 21 19:17:27 CDT 2008 $ diff -r /grid/app/minos/parrot /grid/fermiapp/minos/parrot mountfile2.grow was missing link $ cp -a /grid/app/minos/parrot/cctools-current-20080717-i686-linux-2.6/mountfile2.grow /grid/app/minos/parrot $ cp -a /grid/fermiapp/minos/parrot/cctools-current-20080717-i686-linux-2.6/mountfile2.grow /grid/fermiapp/minos/parrot $ diff -r /grid/app/minos/parrot /grid/fermiapp/minos/parrot clean Will shift tomorrow. after confirming /g/fa mounts at Grid Users Meeting, mv /grid/app/minos/parrot /grid/app/minos/parrotold ln -s /grid/fermiapp/minos/parrot /grid/app/minos/parrot paloon - adjusted paths to /grid/fermiapp/... ########## # PARROT # ########## Test file for grid tests. mindata@minos26 . /afs/fnal.gov/ups/etc/setups.sh export PRODUCTS=/afs/fnal.gov/files/code/e875/general/ups/db cd /afs/fnal.gov/files/data/minos/release_data/parrot DFILE='dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root' dccp ${DFILE} . This is one of our short, 50KBytes test files. Where is a somewhat longer, 10 miinue MIN > ssh fnpc176 -bash-3.00$ cd /grid/fermiapp/minos/parrot -bash-3.00$ time ./paloon SETTING UP UPS SETTING UP MINOS real 0m55.410s OK , found my test files, cp /grid/fermiapp/minos/parrot/N00009870_0002.mdaq.root \ /afs/fnal.gov/files/data/minos/release_data/parrot/N00009870_0002.mdaq.root time ./paloon "" "" /grid/fermiapp/minos/parrot/recopa Spin(103760 in 103760 out 0 filt.) real 2m58.785s user 1m1.069s sys 0m49.411s Need a yet larger file for realistic testing. No, need to correct typos, and run in a writeable area; FAP=/grid/fermiapp/minos/parrot cd /local/scratch1/kreymer time ${FAP}/paloon "" "" ${FAP}/recopa Spill(100000 in 750 out 99250 filt.) real 13m58.890s user 10m10.147s sys 2m27.029s -bash-3.00$ ls -ltr total 20544 drwxr-xr-x 259 kreymer numi 4096 Sep 22 11:51 parrot -rw-r--r-- 1 kreymer numi 17819119 Sep 22 12:05 CandS.root -rw-r--r-- 1 kreymer numi 3179130 Sep 22 12:05 ntupleStS.root -bash-3.00$ du -sm * 18 CandS.root 4 ntupleStS.root ============================================================================= 2008 09 19 ============================================================================= ######## # FARM # ######## Once again, I have run CPB far concatenation without first removing the mrnt files, and renaming bmnt to mrnt. I have added an appropriate comment to the corral scripts. Doing a test run of roundup, it seems we have a clean boundary, all runs presently in farcat have all subruns present. 171 files of each, in 9 runs. 660 bmnt files, some previously concatenated. So we can swap out the bmnt/mrnt in farcat, then remove the previously written mrnt files in SAM. ----------------------------------------------------------- BMNT LIST BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort` MFILES=`ls /minos/data/minfarm/farcat | grep mrnt | sort` printf "${BFILES}\n" | wc -w 660 printf "${MFILES}\n" | wc -w 171 ----------------------------------------------------------- MOVE MRNT OUT OF THE WAY mkdir -p /minos/data/minfarm/FMRNT cd /minos/data/minfarm/farcat for MFILE in ${MFILES} ; do mv ${MFILE} /minos/data/minfarm/FMRNT/${MFILE} done ----------------------------------------------------------- RENAME BMNT TO MRNT cd /minos/data/minfarm/farcat check for conflicts for BFILE in ${BFILES} ; do MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g` [ -r ${MFILE} ] && ls -l ${MFILE} done for BFILE in ${BFILES} ; do MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g` mv ${BFILE} ${MFILE} done ----------------------------------------------------------- PNFS, MINOS_DATA and SAM cleanup prepration Get run list of possible bmnt MRUNS=`printf "${BFILES}\n" | cut -f 1 -d _ | sort -u` printf "${MRUNS}\n" | wc -w 36 cd ~/scripts . ./samsetup Detailed check via SAM for MRUN in ${MRUNS} ; do RUN=`echo ${MRUN} | cut -c 5-` SAMDIM=" DATA_TIER mrnt-far and VERSION cedar.phy.bhcurv and PHYSICAL_DATASTREAM_NAME spill and RUN_NUMBER ${RUN} " sam list files --dim="${SAMDIM}" --nosummary done > /minos/data/minfarm/maint/MFILES I expect 27 ( = 36 - 9 ) runs to remove SRV1> wc -l /minos/data/minfarm/maint/MFILES 28 /minos/data/minfarm/maint/MFILES One run is split, F00040213_0000.spill.mrnt.cedar_phy_bhcurv.0.root F00040213_0019.spill.mrnt.cedar_phy_bhcurv.0.root So have 28 files to remove grep -v '_0000' /minos/data/minfarm/maint/MFILES MFILES=`cat /minos/data/minfarm/maint/MFILES` printf "${MFILES}\n" | wc -l 28 Added the paths for MFILE in ${MFILES} ; do MON=`sam locate ${MFILE} | cut -f 7 -d / | cut -f 1 -d ,` printf "reco_far/cedar_phy_bhcurv/mrnt_data/${MON}/${MFILE}\n" \ | tee -a /minos/data/minfarm/maint/MFILEPS done MFILEPS=`cat /minos/data/minfarm/maint/MFILEPS` for MFILEP in ${MFILEPS} ; do ls -l /pnfs/minos/${MFILEP} ; done for MFILEP in ${MFILEPS} ; do ls -l /minos/data/${MFILEP} ; done ----------------------------------------------------------- /minos/data - minfarm@fnpcsrv1 for MFILEP in ${MFILEPS} ; do MFILER=`echo ${MFILEP} | sed s/mrnt_data/BMNT2/g` MFILED=`dirname ${MFILER}` mkdir -p /minos/data/${MFILED} mv /minos/data/${MFILEP} /minos/data/${MFILER} done find /minos/data/reco_far/cedar_phy_bhcurv/BMNT2/2008-01 -type f | wc -l 28 ----------------------------------------------------------- /pnfs/minos - rubin@fnpcsrv1 cat shrc/kreymer # cut and paste the result to get into bash MFILES=`cat /minos/data/minfarm/maint/MFILES` MFILEPS=`cat /minos/data/minfarm/maint/MFILEPS` for MFILEP in ${MFILEPS} ; do ls -l /pnfs/minos/${MFILEP} rm /pnfs/minos/${MFILEP} done ----------------------------------------------------------- SAM/READ cd /export/stage/minfarm/ROUNDUP mkdir -p READBMNT2 for MFILE in ${MFILES} ; do ls -l READ/SAM/${MFILE} mv READ/SAM/${MFILE} READBMNT2/${MFILE} done ----------------------------------------------------------- SAM for MFILE in ${MFILES} ; do sam undeclare file ${MFILE} done 13:30 ----------------------------------------------------------- WRITE clean up the items which I left dangling. for MFILE in ${MFILES} ; do [ -L "/minos/data/minfarm/WRITE/${MFILE}" ] \ && rm /minos/data/minfarm/WRITE/${MFILE} done Now we should be able to roundup the remaining 9 runs. Needed to adjust the BAIL limit to over the default 1000 First purge the stale WRITE links ./roundup -w -r cedar_phy_bhcurv far One CPBF file is left, not in DCache or Tape ( stkendca9a problem ) F00040151_0000.spill.sntp.cedar_phy_bhcurv.0.root ./roundup -b 1500 -r cedar_phy_bhcurv far Fri Sep 19 14:33:54 CDT 2008 OK - processing 1173 files ############ # PREDATOR # ############ N00014862_0013.mdaq.root Fri Sep 19 14:06:19 UTC 2008 F00041958_0003.mdaq.root Fri Sep 19 10:13:58 UTC 2008 B080918_080001.mbeam.root Fri Sep 19 10:18:07 UTC 2008 N080918_000003.mdcs.root Fri Sep 19 10:28:30 UTC 2008 repeatedly time out in dbu F080917_000007.mdcs.root Thu Sep 18 10:14:17 UTC 2008 is ok Many files queued for write to tape : STARTED Fri Sep 19 02:11:47 2008 302 FILES 3 N00014861_0000.mdaq.root 18 32 N00014862_0006.mdaq.root 19 65 F00040170_0022.spill.cand.cedar_phy_bhcurv.0.root 18 76 F00041955_0018.mdaq.root 18 98 F00040167_0012.all.cand.cedar_phy_bhcurv.0.root 18 102 F00040167_0018.all.cand.cedar_phy_bhcurv.0.root 18 111 F00040176_0019.all.cand.cedar_phy_bhcurv.0.root 18 113 F00040151_0000.spill.sntp.cedar_phy_bhcurv.0.root 18 114 F00041955_0021.mdaq.root 19 115 F00041955_0022.mdaq.root 19 119 N00014862_0004.mdaq.root 18 146 F00041955_0019.mdaq.root 19 155 F00040145_0023.spill.bcnd.cedar_phy_bhcurv.0.root 18 172 F00040148_0021.spill.cand.cedar_phy_bhcurv.0.root 18 215 F00040176_0011.spill.cand.cedar_phy_bhcurv.0.root 18 217 N00014862_0007.mdaq.root 19 224 F00041955_0012.mdaq.root 18 247 F00040148_0003.all.cand.cedar_phy_bhcurv.0.root 18 249 F00040148_0018.all.cand.cedar_phy_bhcurv.0.root 18 252 F00040167_0005.spill.cand.cedar_phy_bhcurv.0.root 18 259 F00040173_0019.spill.cand.cedar_phy_bhcurv.0.root 18 262 F00040176_0019.spill.cand.cedar_phy_bhcurv.0.root 18 266 N00014859_0022.mdaq.root 18 272 F00040145_0012.spill.bcnd.cedar_phy_bhcurv.0.root 18 284 F00040145_0014.spill.cand.cedar_phy_bhcurv.0.root 18 296 F00040170_0005.spill.cand.cedar_phy_bhcurv.0.root 18 299 N00014862_0005.mdaq.root 19 300 F00041955_0020.mdaq.root 19 301 N00014862_0008.mdaq.root 19 Full paths of the dbu trouble files : /pnfs/minos/neardet_data/2008-09/N00014862_0013.mdaq.root /pnfs/minos/fardet_data/2008-09/F00041958_0003.mdaq.root /pnfs/minos/beam_data/2008-09/B080918_080001.mbeam.root /pnfs/minos/near_dcs_data/2008-09/N080918_000003.mdcs.root MINOS26 > ./dc_stat /pnfs/minos/neardet_data/2008-09/N00014862_0013.mdaq.root ============================ PNFS status for /pnfs/minos/neardet_data/2008-09/N00014862_0013.mdaq.root -rw-r--r-- 1 buckley e875 111237454 Sep 19 01:16 N00014862_0013.mdaq.root LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:8097bf89;l=111237454; LEVEL 4 ============================ Same for all 4 files. Scanning the write backlog for SAM locations : FILES=' N00014861_0000.mdaq.root N00014862_0006.mdaq.root F00040170_0022.spill.cand.cedar_phy_bhcurv.0.root F00041955_0018.mdaq.root F00040167_0012.all.cand.cedar_phy_bhcurv.0.root F00040167_0018.all.cand.cedar_phy_bhcurv.0.root F00040176_0019.all.cand.cedar_phy_bhcurv.0.root F00040151_0000.spill.sntp.cedar_phy_bhcurv.0.root F00041955_0021.mdaq.root F00041955_0022.mdaq.root N00014862_0004.mdaq.root F00041955_0019.mdaq.root F00040145_0023.spill.bcnd.cedar_phy_bhcurv.0.root F00040148_0021.spill.cand.cedar_phy_bhcurv.0.root F00040176_0011.spill.cand.cedar_phy_bhcurv.0.root N00014862_0007.mdaq.root F00041955_0012.mdaq.root F00040148_0003.all.cand.cedar_phy_bhcurv.0.root F00040148_0018.all.cand.cedar_phy_bhcurv.0.root F00040167_0005.spill.cand.cedar_phy_bhcurv.0.root F00040173_0019.spill.cand.cedar_phy_bhcurv.0.root F00040176_0019.spill.cand.cedar_phy_bhcurv.0.root N00014859_0022.mdaq.root F00040145_0012.spill.bcnd.cedar_phy_bhcurv.0.root F00040145_0014.spill.cand.cedar_phy_bhcurv.0.root F00040170_0005.spill.cand.cedar_phy_bhcurv.0.root N00014862_0005.mdaq.root F00041955_0020.mdaq.root N00014862_0008.mdaq.root ' MINOS26 > for FILE in $FILES ; do sam locate $FILE ; done The cand files are all /pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2008-01 bcnd are /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2008-01 sntp are /pnfs/minos/reco_far/cedar_phy_bhcurv/sntp_data/2008-01 Some are on tape now, F00040167_0012.all.cand.cedar_phy_bhcurv.0.root F00040176_0019.all.cand.cedar_phy_bhcurv.0.root F00040148_0018.all.cand.cedar_phy_bhcurv.0.root F00040176_0019.spill.cand.cedar_phy_bhcurv.0.root F00040145_0012.spill.bcnd.cedar_phy_bhcurv.0.root F00040145_0014.spill.cand.cedar_phy_bhcurv.0.root F00040170_0005.spill.cand.cedar_phy_bhcurv.0.root Submitted helpdesk ticket I see now that w-stkendca9a-* pools are offline stkendca9a is up, on the network MRTG traffic stops around 01:45 MRTG traffic starts around 14:30 Date: Fri, 19 Sep 2008 16:15:26 +0000 (GMT) From: Arthur Kreymer To: HelpDesk Cc: minos-data@fnal.gov, dcache-admin@fnal.gov Subject: Re: HelpDesk ticket 121930 <-- # @@@ Enter Update below this line. @@@ # --> According to the MRGT network monitoring, stkendca9a is up and on the network, but stopped most activity around 01:45 this morning. This node serves both writePools and RawDataWritePools. This could explain the absence of our files. <-- # @@@ Enter Update above this line. @@@ # --> From: Arthur Kreymer To: HelpDesk Cc: minos-data@fnal.gov, dcache-admin@fnal.gov Subject: Re: HelpDesk ticket 121930 <-- # @@@ Enter Update below this line. @@@ # --> According to the MRGT network monitoring, stkendca9a started moving data again around 14:30. All of the previously backlogged files seem to have made it to tape, <-- # @@@ Enter Update above this line. @@@ # --> ############ # MINOS_OM # ############ Investigating failure of FarWeb to contact minos-om since Fri, 19 Sep 2008 07:50:39 -0500 /var/log/messages is flooded with Aug 24 04:02:30 minos-om pam_timestamp_check: pam_timestamp: `/var/run/' owner UID != 0 /var/run is owned by apache. ####### # DAQ # ####### [root@minos-evd ~]# cat /etc/exports # # export /data/mcr to other control room pc's. SA 1/19/05 /data/mcr 131.225.55.0/255.255.0.0(rw) minos-beamdata.fnal.gov(rw) /data/minsoft 131.225.55.0/255.255.0.0(rw) minos-beamdata.fnal.gov(rw) This is dangerously wrong, exports rw to 131.225.* , all of Fermilab Probably want an exlicit list of CR systems. ============================================================================= 2008 09 18 ============================================================================= ######## # FARM # ######## Cleanup of v18 flux error files in D04 GPB GOt list of configs and runs from nwest email Date: Fri, 12 Sep 2008 16:18:09 +0100 From: Nick West scp minos-93198.dhcp:baddo4 /minos/scratch/kreymer/badd04 These are thing whose mapping to files I do not understand L010185_near_bhcurv 00007450 L010185_near_bhcurv_test 00007655 L010185_near_production 00007484 L010185_rock_pro 00007481 ######## # GRID # ######## 14:37 Most worker nodes have booted up, as recently as 14:20 A couple of nodes are still down ( formerly running pawloski jobs ) fnpc209.fnal.gov fnpc219.fnal.gov 16:24 - Timm states that FermiGrid is and has been up. Started gfactory and gfrontend ########## # CONDOR # ########## Clean up pilots that think they are running . Nodes supposedly running jobs XNO=`condor_q -run | grep fnpc | grep -v gfactory | cut -f 3 -d @ | sort -u` Of these, some respond to ping XUP=`for NO in ${XNO} ; do ping -c 1 -w 2 ${NO} > /dev/null && echo ${NO} ; done` Of these scan for condor processes for UP in ${XUP} ; do echo ${UP} ; ssh -ax ${UP} "ps -fu condor" ; done mostly got UID PID PPID C STIME TTY TIME CMD condor 3773 1 0 08:04 ? 00:00:00 /opt/condor/sbin/condor_master condor 3788 3773 0 08:04 ? 00:00:05 condor_startd -f unauthorized on fnpc207.fnal.gov fnpc218.fnal.gov fnpc253.fnal.gov fnpc257.fnal.gov fnpc236.fnal.gov This rsh session is using DES encryption for all data transmissions. UID PID PPID C STIME TTY TIME CMD condor 3818 1 0 08:04 ? 00:00:00 /opt/condor/sbin/condor_master condor 3826 3818 0 08:04 ? 00:00:14 condor_startd -f condor 6712 3826 0 08:41 ? 00:00:00 condor_starter -f -a slot2 fnpcosg1.fnal.gov condor 6713 3826 0 08:41 ? 00:00:00 condor_starter -f -a slot3 fnpcosg1.fnal.gov condor 6714 3826 0 08:41 ? 00:00:00 condor_starter -f -a slot1 fnpcosg1.fnal.gov condor 6715 3826 0 08:41 ? 00:00:00 condor_starter -f -a slot4 fnpcosg1.fnal.gov These are jobs for username engage Checking again for minos processes, there were none : for UP in ${XUP} ; do echo ${UP} ; ssh -ax ${UP} "ps -fu minos" ; done Bottom line, nothing useful is running for us. Shall/can I remove these gfactory's ? No need, they are all help now ! MINOS25 > condor_q gfactory | tail -1 ; date 73 jobs; 0 idle, 0 running, 73 held Thu Sep 18 09:43:26 CDT 2008 MINOS25 > condor_rm gfactory User gfactory's job(s) have been marked for removal. six of these want back into X status then back to H after a minute MINOS25 > condor_rm gfactory all clear Held pawloski jobs 66 jobs; 0 idle, 66 running, 0 held MINOS25 > condor_hold pawloski 66 jobs; 0 idle, 0 running, 66 held The pawloski jobs are now in X state ######## # FARM # ######## GCC power maintenance has started, Condor glideins are shut down since 00:45. My glideafs stopped getting scheduled at 01:30 ############## # AFSERRSCAN # ############## Added printout of nodes failing to connect via ssh ######### # ADMIN # ######### Two nodes unavailable to ssh, they are OK with rsh minos03 minos16 Ticket 121841 Date: Thu, 18 Sep 2008 08:26:32 -0500 (CDT) This ticket has been reassigned to HO, LING of the CD-SF/FEF Group. ling ============================================================================= 2008 09 17 ============================================================================= ########## # CONDOR # ########## Have over 150 gfactory pilots running, but few user jobs MINOS25 > condor_q gfactory | tail -1 176 jobs; 25 idle, 151 running, 0 held MINOS25 > condor_q -run | grep fnpc | grep -v gfactory | wc -l 34 MINOS25 > condor_q -run | grep ahimmel | tr -s ' ' | cut -f 6 -d ' ' | cut -f 3 -d @ | sort -u fnpc340.fnal.gov fnpc341.fnal.gov fnpc342.fnal.gov fnpc343.fnal.gov fnpc345.fnal.gov On 343, see 6035 ? Ss 17:36 /opt/condor/sbin/condor_master 6055 ? Ss 304:10 \_ condor_startd -f 7336 ? S 49:30 \_ condor_procd -A /local/stage1/condor/log/condor-lock.fnpc3430.0366944801799072/procd_pipe.STARTD -S 60 -C 4716 and many condor_starter -f under which our jobs run On 344, just have 8657 ? Ss 22:17 /opt/condor/sbin/condor_master 8658 ? Ss 371:24 \_ condor_startd -f 8740 ? S 60:23 \_ condor_procd -A /local/stage1/condor/log/condor-lock.fnpc3440.0366944801799072/procd_pipe.STARTD -S 60 -C 4716 We know that a pawloski job finished on fnpc300 recently ( 14:00 ) and that it was running loon OK. 25553 ? Ss 13:42 /opt/condor/sbin/condor_master 25554 ? Ss 465:30 \_ condor_startd -f 25733 ? S 44:06 \_ condor_procd -A /local/stage1/condor/log/condor-lock.fnpc3000.0366944801799072/procd_pipe.STARTD -S 60 -C 4716 22113 ? Ss 0:00 \_ condor_starter -f -a slot4 fnpcfg1.fnal.gov 22114 ? SNs 0:00 | \_ /bin/bash /grid/home/minos/.globus/.gass_cache/local/md5/68/c3/98/63a9242a845ee20f3cf6078aa0/md5/6c/85/f9/a30b5e7bee1d26d3c297802c16/data -v 22514 ? SN 0:00 | \_ /bin/bash ./condor_startup.sh glidein_config 22699 ? SN 0:10 | \_ /local/stage1/condor/execute/dir_22113/glide_m22150/condor/sbin/condor_master -r 359 -dyn -f 22700 ? SN 1:09 | \_ condor_startd -f 22925 ? SN 0:34 | \_ condor_procd -A /local/stage1/condor/execute/dir_22113/glide_m22150/log.131.225.166.78-22699/procd_address.STARTD -L /local/ 16310 ? SN 0:00 | \_ /local/stage1/condor/execute/dir_22113/glide_m22150/condor/sbin/condor_starter -f -a vm2 minos25.fnal.gov 16509 ? SN 0:22 | \_ condor_procd -A /local/stage1/condor/execute/dir_22113/glide_m22150/tmp/starter-tmp-dir-wu3e3x/log/procd_pipe.STARTER -L 16510 ? SN 0:00 | \_ /bin/sh /minos/scratch/pawloski/EntProc/paloon 148 0 /minos/scratch/pawloski/EntProc/SntpFileListsForSept2008Meeting/con 16513 ? RN 498:00 | \_ parrot -m /grid/app/minos/parrot/cctools-current-20080708-x86_64-linux-2.6/mountfile.grow -H -t /local/stage1/minos/ 16514 ? TN 0:00 | \_ /minos/scratch/pawloski/EntProc/SntpFileListsForSept2008Meeting/condor_job_glidein_FarDataAll_HornOn_SUN.sh /min 17009 ? TN 302:58 | \_ loon -bq /minos/scratch/pawloski/Nue/nue_standard_Firebird_SUN/NueAna/macros/MakeAnaNueTreePECut.C dcap://fndca1 7623 ? Ss 0:00 \_ condor_starter -f -a slot1 fnpcosg1.fnal.gov 7626 ? RNs 21:46 | \_ condor_exec.exe pd_45mA_errors12_1p0.in 7637 ? Ss 0:00 \_ condor_starter -f -a slot5 fnpcosg1.fnal.gov 7638 ? RNs 18:15 | \_ condor_exec.exe pd_45mA_errors16_1p0.in 7639 ? Ss 0:00 \_ condor_starter -f -a slot3 fnpcosg1.fnal.gov 7641 ? RNs 17:43 | \_ condor_exec.exe pd_45mA_errors15_1p0.in 7640 ? Ss 0:00 \_ condor_starter -f -a slot2 fnpcosg1.fnal.gov 7642 ? RNs 18:12 | \_ condor_exec.exe pd_45mA_errors14_1p0.in 7653 ? Ss 0:00 \_ condor_starter -f -a slot7 fnpcosg1.fnal.gov 7655 ? RNs 17:26 | \_ condor_exec.exe pd_45mA_errors107_1p0.in 7654 ? Ss 0:00 \_ condor_starter -f -a slot6 fnpcosg1.fnal.gov 7657 ? RNs 17:21 | \_ condor_exec.exe pd_45mA_errors40_1p0.in 7668 ? Ss 0:00 \_ condor_starter -f -a slot8 fnpcosg1.fnal.gov 7671 ? RNs 17:11 \_ condor_exec.exe pd_45mA_errors51_1p0.in MINOS25 > condor_q gfactory | tail -1 176 jobs; 25 idle, 151 running, 0 held Sees that this fnpc300 pilot is gone, but we were not informed. 25553 ? Ss 13:42 /opt/condor/sbin/condor_master 25554 ? Ss 465:34 \_ condor_startd -f 25733 ? S 44:07 \_ condor_procd -A /local/stage1/condor/log/condor-lock.fnpc3000.0366944801799072/procd_pipe.STARTD -S 60 -C 4716 7623 ? Ss 0:00 \_ condor_starter -f -a slot1 fnpcosg1.fnal.gov 7626 ? RNs 41:25 | \_ condor_exec.exe pd_45mA_errors12_1p0.in 7637 ? Ss 0:00 \_ condor_starter -f -a slot5 fnpcosg1.fnal.gov 7638 ? RNs 37:59 | \_ condor_exec.exe pd_45mA_errors16_1p0.in 7639 ? Ss 0:00 \_ condor_starter -f -a slot3 fnpcosg1.fnal.gov 7641 ? RNs 37:26 | \_ condor_exec.exe pd_45mA_errors15_1p0.in 7640 ? Ss 0:00 \_ condor_starter -f -a slot2 fnpcosg1.fnal.gov 7642 ? RNs 37:54 | \_ condor_exec.exe pd_45mA_errors14_1p0.in 7653 ? Ss 0:00 \_ condor_starter -f -a slot7 fnpcosg1.fnal.gov 7655 ? RNs 37:09 | \_ condor_exec.exe pd_45mA_errors107_1p0.in 7654 ? Ss 0:00 \_ condor_starter -f -a slot6 fnpcosg1.fnal.gov 7657 ? RNs 37:04 | \_ condor_exec.exe pd_45mA_errors40_1p0.in 7668 ? Ss 0:00 \_ condor_starter -f -a slot8 fnpcosg1.fnal.gov 7671 ? RNs 36:53 \_ condor_exec.exe pd_45mA_errors51_1p0.in Starting around 02:00, getting fewer and fewer running client jobs. And my probe jobs have not run since 20:40, based on *.out sizes. Next job to run should be 190628.0 Submitting a non-afs test job, MINOS25 > condor_submit glide.run Checking proxy in gfactory, it looks fine. Steve Timm logged into fnpc4x1, found 10 stuck minos account globus-job-manager processes Sent them a signal by running ptrace, this kicked them loose. May be able to do this with kill -s SIGCONG ( see man 7 signal ) Our condor_q info is now up to date, pilots are starting, and jobs are starting to run. I do not seem to be able to log into fnpcrx1 or fnpcfg1 sfiligoi notes that condor_status -l contains the information necessary to trace a gfactory job to a specific execution node, searching for things like GLIDEIN_ClusterId = 190823 GLIDEIN_ProcId = 5 for 190823.5 ######### # ADMIN # ######### Date: Wed, 17 Sep 2008 11:21:19 -0500 (CDT) Subject: Help Desk Ticket 121592 Has Been Resolved. ___________________________________________________________________ Solution: Added resolv.conf to cfengine for the minos cluster. ___________________________________________________________________ Short Description: minos-mysql1 /etc/resolv.conf has been updated Problem Description: run2-sys : The /etc/resolv.conf file on minos-mysql1 consisted of: search fnal.gov nameserver 131.225.8.120 This of course caused severe problems for mysql service, including unavailability of the Minos Control Room Logbook, during the recent service problems with fnsrv0/131.225.8.120 Because this has been causing current operational problems, I have taken the liberty of renaming this file to resolv.conf.2004 and have created a new file, copied from the Minos Cluster systems, but putting fnsrv1 first in the list: search fnal.gov nameserver 131.225.17.150 nameserver 131.225.227.254 nameserver 131.225.8.120 The old resolv.conf file seems to be older than minos-mysql1 [root@minos-mysql1 etc]# ls -l resolv.conf.2004 -rw-r--r-- 1 root root 41 Nov 22 2004 resolv.conf.2004 Action items : Please put this change in via your usual means ( cfengine ? ), rather than my hack. Please also update resolv.conf on minos-sam02 . ___________________________________________________________________ This ticket was resolved by SCOTT, RENNIE of the CD-SF/FEF group. _________________________________________________________________ Note To Requester: Hi Art, I put resolv.conf in cfengine. Of note, you are correct that the nameserver on minos-mysql-1 was older. This is the only system in the cluster that was upgraded and not re-installed during last years upgrade of the minos cluster. This method was done due to the uncertainty of the mysql database configurations and what would happen if upgraded i.e. new mysql database issue, etc. Since resolv.conf is configured during our kickstart process, this system never received an updated resolv.conf like the rest of the cluster. I can't remeber what we did with minos-sam02 so I'm not sure why it's resolv.conf did not get updated. This cfengine file entry should rectify the issues. Best regards, Rennie ######## # MAIL # ######## Adjusting SPAM filter to directly reject mail over 5, rather than send to managers. The level might even go lower, but I don't have time to test, and I've never seen good mail at or over 5. ___________________________________________________________________ Date: Tue, 16 Sep 2008 11:25:58 -0500 (CDT) Subject: Help Desk Ticket 121629 Has Been Resolved. Solution: Hi, Sounds like your list is configured with the default setting of sending spam to the owners for moderation. You can change that by adding a filter to the list configuration on listserv. The instructions for doing so are here: http://computing.fnal.gov/email/listserv/listserv-spam.html ___________________________________________________________________ List Management select a list Templates Select a template to view or edit Rules for filtering list messages based on their contents [CONTENT FILTER] Edit Form X-Spam-Flag: YES Action: Discard SPAM ( was Action: Moderate ) Update Did this to isajet-users and minos_sam_admin Also did minos-cdops, set spam level of 2 X-Spam-ListServ-Level:-- Action: Discard SPAM ######### # FNALU # ######### Date: Wed, 17 Sep 2008 10:22:13 -0500 (CDT) Subject: HelpDesk ticket 121790 ___________________________________________ Short Description: FNALU Batch mount needed for /grid/data,app,fermiapp and /minos/data,scratch Problem Description: fnalu-admin In order to let Minos make effective use of the FNALU Condor batch system, please mount on the interactive and worker nodes : /minos/data /minos/scratch /grid/data /grid/fermiapp /grid/app As FNALU interactive and batch nodes are all GCE managed nodes, these file systems can be exported and mounted with full RWX access. Thanks ! ___________________________________________ Date: Wed, 17 Sep 2008 10:42:32 -0500 (CDT) This ticket has been reassigned to HILL, KEVIN of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ Date: Wed, 17 Sep 2008 11:26:00 -0500 (CDT) This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group. ___________________________________________ Date: Fri, 26 Sep 2008 09:31:01 -0500 (CDT) Art, I've mounted /minos/data and scratch on the fnalu condor nodes, but not the /grid mounts. Wayne is going to be sending an e-mail to you clarifying or expanding on what we discussed in our meeting this week. Please remember that DSS considers this condor pool as a test set up for a couple months. I will wait to announce that condor is on fnalu because I will not be here next week and there will be very restricted support for it during that time. Margaret ___________________________________________ Date: Tue, 14 Oct 2008 09:02:40 -0500 (CDT) Solution: file systems were mounted on 9/21. ___________________________________________ N.B. this is only the /minos/files. /grid is not mounted. ___________________________________________ ============================================================================= 2008 09 16 ============================================================================= ######### # FNALU # ######### Testing condor submission from /local/stage1/kreymer/condor, per http://cdorg.fnal.gov/dss/condor/condor.html http://cdorg.fnal.gov/dss/condor/condor2.html Had to move to /local/stage1/kreymer/condor, submission from $HOME/condor resulted in a running job per condor, but no useful work on the worker ( FLXI09 > condor_submit probe.run Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 532. WARNING: File /afs/fnal.gov/files/home/room1/kreymer/condor/logs/probe/probe.532.0.err is not writable by condor. WARNING: File /afs/fnal.gov/files/home/room1/kreymer/condor/logs/probe/probe.532.0.out is not writable by condor. FLXI09 > condor_q kreymer -run -- Quill: quilld@flxi09.fnal.gov : <131.225.68.37:5432> : quill2 : 2008-09-17 11:31:03-05 ID OWNER SUBMITTED RUN_TIME HOST(S) 532.0 kreymer 9/17 11:30 0+00:01:03 slot1@flxb32.fnal.gov 32678 ? Ss 0:09 /opt/condor/sbin/condor_master 32679 ? Ss 0:34 \_ condor_startd -f 7137 ? S 0:07 \_ condor_procd -A /tmp/condor-lock.flxb320.342033398644165/procd_pipe.STARTD -S 60 -C 4716 ####### # DAQ # ####### Date: Tue, 16 Sep 2008 14:55:23 -0500 Subject: [Fwd: Hardware Service Request] System Node Name: minos-beamdata Tag Number: 106501 Manufacturer/Model: Dell Equipment Location: precision 390 Task Name: 50 Task Number: 50.01.06.04.01.01 Problem Details: Move the minos-beamdata PC to FCC computer rooms. This computer logs accelerator beam data for the Minos experiment. It's currently located in WH12NW. The form factor is a mini-tower. It is currently in the 131.225.52.xxx subnet with a fixed IP. ######### # FNALU # ######### Date: Tue, 16 Sep 2008 14:22:13 -0500 (CDT) Subject: HelpDesk ticket 121735 ___________________________________________ Short Description: LSF has been shut down as scheduled - please announce Problem Description: The FNALU LSF queues have been shut down as scheduled. Please put an announcement on the System Status / FNALU web page, and/or a login message on FNALU. Please let us know when Condor queues will be available. ___________________________________________ Date: Tue, 16 Sep 2008 16:11:40 -0500 (CDT) This ticket has been reassigned to GREANEY, MARGARET of the CD-LSCS/CSI/DSS/EST Group. ___________________________________________ Date: Tue, 16 Sep 2008 15:27:23 -0500 (CDT) From: Margaret_Greaney The condor cluster is available. I specifically asked Wayne to contact you before I made an announcement. Could you please try to call him? This condor cluster is not grid specific. I have two html documents set up that describe the way this pool is configured. These are at http://cdorg.fnal.gov/dss under the FNALU heading. I am still setting up a few more nodes for the cluster, but you should be able to login to flxi09 and look at the current set up. If you would like to have a meeting with Wayne and me I would welcome that to answer your questions if you have any. Also, remember that we had no budget to buy any new batch nodes yet because of the budget cuts last year. ___________________________________________ Date: Tue, 16 Dec 2008 12:32:39 -0600 (CST) Solution: motd on fnalu nodes updated; system status page updated for lsf replacement. ___________________________________________ Date: Tue, 16 Dec 2008 12:32:40 -0600 (CST) Note To Requester: Art, the helpdesk gave me permissions to update the system status page and this has been done for condor/lsf. margaret ___________________________________________ ######## # FARM # ######## Forcing out cedar_phy near recent processing, no needing concatenation. per batch meeting discussion AFSS/roundup.20080915 -F -r cedar_phy near AFSS/roundup.20080915 -F -r cedar_phy near All done SRV1> cp -a AFSS/roundup.20080915 . ########### # ROUNDUP # ########### 20080915 version hacked to filter SAMSUBS list on \.${REL}\. Restored -F option ######## # FARM # ######## Rename the n1303*.0.root files to .root SRV1> FILES=`ls /minos/data/minfarm/mcnearcat | grep .0.root | grep n1303` SRV1> printf "${FILES}\n" | wc -l 62 SRV1> cd /minos/data/minfarm/mcnearcat for FILE in ${FILES} ; do mv ${FILE} ${FILE/.0./.} ; done SRV1> ln -sf roundup.20080915 roundup # was roundup.20080728 ./roundup -n -s n130370 -r cedar_phy_bhcurv mcnear Missing subrun 29 of n13037097, failed in pass0, but have shifted to null Created a fake null pass line via SRV1> nedit /minos/data/minfarm/lists/bad_runs_mc.cedar_phy_bhcurv Test with AFSS/roundup.20080915 -n -W -s n13037097 -r cedar_phy_bhcurv mcnear PEND - have 30/31 subruns for n13037097_*_L250200N_D04.sntp.cedar_phy_bhcurv.root 8 09/08 06:25 4 26 MISS 0005 0006 0007 0008 0029 RAWF is like n13037097_0029_L250200N_D04 Changed all \.${PASS} to .${PASS} to this can match null passes RUNTAI n13037097__L250200N_D04.sntp.cedar_phy_bhcurv.root SAMSUBS n13037097__L250200N_D04.mrnt.cedar_phy_bhcurv.root:4 ============================================================================= 2008 09 15 ============================================================================= ######## # FARM # ######## n13037078_0024_L250200N_D04.sntp.cedar_phy_bhcurv.0.root is in mcnearcat. Subruns 10 and 26 have no input files All other subruns are concatenated in 00 11 27 So why is the MISS list so long, and there no output ? I do not see the expected HAVE n13037078__L250200N_D04 Probably because the old files have no pass number, and the new ones have pass 0. Wiped outh the pend file in testing this. Ugh, hacked it back into place from the full log. Let's see how many .0.root files have .root friends MINOS26 > ls /minos/data/minfarm/mcnearcat > /tmp/mcn MINOS26 > grep '.0.root' /tmp/mcn | wc -l 66 MINOS26 > grep '.0.root' /tmp/mcn | cut -f 1 -d _ | sort -u The first 4 are CosmicMu n10032095 n10042089 n10042106 n10042115 The next 4 are L250200N_D04 n13037073 n13037078 n13037095 n13037097 for RUN in n10032095 n10042089 n10042106 n10042115 ; do sam list files \ --dim="FILE_NAME ${RUN}%L250200N_D04.sntp.cedar_phy_bhcurv.root" done No files match the given constraints. No files match the given constraints. No files match the given constraints. No files match the given constraints. for RUN in n13037073 n13037078 n13037095 n13037097 ; do sam list files \ --dim="FILE_NAME ${RUN}%L250200N_D04.sntp.cedar_phy_bhcurv.root" done n13037073_0000_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037073_0018_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037073_0021_L250200N_D04.sntp.cedar_phy_bhcurv.root File Count: 3 Average File Size: 1.08GB Total File Size: 3.25GB Total Event Count: 24000 Files: n13037078_0000_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037078_0011_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037078_0025_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037078_0027_L250200N_D04.sntp.cedar_phy_bhcurv.root File Count: 4 Average File Size: 773.30MB Total File Size: 3.02GB Total Event Count: 22400 Files: n13037095_0000_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037095_0003_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037095_0006_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037095_0012_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037095_0014_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037095_0018_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037095_0020_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037095_0023_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037095_0028_L250200N_D04.sntp.cedar_phy_bhcurv.root File Count: 9 Average File Size: 269.94MB Total File Size: 2.37GB Total Event Count: 17600 Files: n13037097_0005_L250200N_D04.sntp.cedar_phy_bhcurv.root File Count: 1 Average File Size: 442.30MB Total File Size: 442.30MB Total Event Count: 3200 ############ # PREDATOR # ############ B080906_160000.mbeam.root Fri Sep 12 10:26:16 UTC 2008 OOPS - loon status is 139 genpy sets up loon R1.22 Created predatorbfx to just to beam_data Test manually, DFILE=dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/beam_data/2008-09/B080906_160000.mbeam.root minos setup_minos -r R1.22 loloon [0] Processing /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/firstlast.C... Warning in : no dictionary for class RecJobHistory is available Warning in : The StreamerInfo of class RawDataBlock read from file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/beam_data/2008-09/B080906_160000.mbeam.root has the same version (=1) as the active class but a different checksum. You should update the version to ClassDef(RawDataBlock,2). Do not try to write objects with the current class definition, the files will not be readable. on -bq ${HOME}/minos/scripts/firstlast.C ${DFILE} MINOS26 > setup_minos -r R1.26 # root v5-16-00 - fails MINOS26 > setup_minos -r R1.28 # root v5-18-00 , good output ... root version: v05-21-03 Need to declare this to SAM before trying to declare. export SAM_ORACLE_CONNECT="samdbs/password" for REL in dev int prd ; do setup sam -q ${REL} samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v05-21-03 done New applicationFamilyId = 257 New applicationFamilyId = 94 New applicationFamilyId = 348 S08-01-10-R1-27 v5-17-08 OK S08-02-16-R1-28 v5-18-00 S08-02-24-R1-28 v5-18-00a S08-03-20-R1-28 v5-18-00a S08-04-24-R1-28 v5-19-02a S08-07-25-R1-29 v5-20-00 S08-08-28-R1-30 v5-20-00 R1.29 v5-18-00c R1.30 v5-20-00 Try running root from development, get v5-21-03 !!! The Bottom Line : These beam_data files require R1.28, S08-01-10-R1-27 or later for reading ( root >= v5-17-08 ) ./predatorbfx B080906_080002.mbeam.root Mon Sep 15 18:20:53 UTC 2008 OOPS - run_dbu is stuck for 208, killing it OK - declared B080906_160000.mbeam.root OK - declared B080915_080001.mbeam.root DFILE=dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/beam_data/2008-09/B080906_080002.mbeam.root MINOS26 > time loon -bq ${HOME}/minos/scripts/firstlast.C ${DFILE} real 4m59.461s user 0m28.788s sys 0m3.123s repeat, real 4m57.145s OK, let's hack predatorbfx to genpy -t 500 ./predatorbfx Removed predatorbfx, no longer needed ######### # GENPY # ######### Added printing of TIMEX when set. Added a few more {} in getopts parsing ########## # CONDOR # ########## Only three help gfactory pilots this morning, not the usual dozen/day All were Globus error 43: 189597.3 gfactory 9/13 14:31 0+00:00:00 H 0 0.0 glidein_startup.sh 189944.9 gfactory 9/14 22:49 0+00:00:00 H 0 0.0 glidein_startup.sh 189995.3 gfactory 9/15 03:24 0+00:00:00 H 0 0.0 glidein_startup.sh ######### # ADMIN # ######### sar data is bunk, no idle time listed, contrary to top, uptime. ARK > ssh -ax minos01.fnal.gov rpm -q sysstat sysstat-5.0.5-11.rhel4 MIN > ssh -ax minos01 cat /etc/redhat-release Scientific Linux Fermi LTS release 4.4 (Wilson) ARK > ssh -ax fnpcsrv1.fnal.gov rpm -q sysstat sysstat-5.0.5-16.rhel4 MIN > ssh -ax fnpcsrv1 cat /etc/redhat-release Scientific Linux Fermi LTS release 4.6 (Wilson) ============================================================================= 2008 09 12 ============================================================================= ########## # PARROT # ########## Test of new d141(ups) d199(minsoft) copies, with make_growfs.auto MD=/afs/fnal.gov/files/data/minos PD=/minos/scratch/parrot $ time ./make_growfs.auto -k ${MD}/d141 WOW, lots of broken symliks, both relative and to /fnal/ups/... real 9m10.773s user 0m54.767s sys 1m39.739s $ time ./make_growfs.auto -k ${MD}/d199 ####### # DAQ # ####### DNS problems continue with fnsrv0 / 131.225.8.120, according to the helpdesk. for ROLE in rc evd om acnet beamdata gateway-nd ; do echo ${ROLE} ssh -l minos minos-${ROLE} cat /etc/resolv.conf done All are ; generated by /sbin/dhclient-script search fnal.gov dhcp.fnal.gov nameserver 131.225.17.150 nameserver 131.225.8.120 Same for my desktop ######### # ADMIN # ######### xbhuang account - created this for xiaobo ####### # CRL # ####### CRL not responding well, ( no login, or content ) Mail to gysin : Date: Fri, 12 Sep 2008 17:22:50 +0000 (GMT) From: Arthur Kreymer To: gysin@fnal.gov Subject: Minos CRL Sorry not to have gotten back to you sooner. I presume that you have now seen the helpdesk ticket 121520. In addition, just recently, we seem to have additional Minos CRL problems. I cannot login : To login you must be authenticated: Login for kreymer was invalid - please try again. If the problem persists, contact your CRL Administrator to verify your user name, password, and your status as an active, remote user. User name: Password: And when I try to view http://www-minoscrl2.fnal.gov/minos/Index.jsp, I get two header bars, and no content. The minos-mysql1 database server looks normal to me, and has crl connections. ######## # FARM # ######## SRV1> ./looper '-r cedar_phy_bhcurv mcnear' & Fri Sep 12 12:04:44 CDT 2008 ZAPPING BAD n13037415_0009_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13037415_0009_L010185N_D04.1 136 2008-04-29 16:14:31 fcdfcaf1626 ZAPPING BAD n13037415_0009_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root n13037415_0009_L010185N_D04.1 136 2008-04-29 16:14:31 fcdfcaf1626 ZAPPING BAD n13037415_0009_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037415_0009_L010185N_D04.1 136 2008-04-29 16:14:31 fcdfcaf1626 ZAPPING BAD n13037436_0005_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13037436_0005_L010185N_D04.1 136 2008-05-09 18:53:51 fnpc304 ZAPPING BAD n13037436_0005_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root n13037436_0005_L010185N_D04.1 136 2008-05-09 18:53:51 fnpc304 ZAPPING BAD n13037436_0005_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037436_0005_L010185N_D04.1 136 2008-05-09 18:53:51 fnpc304 This ran out of steam, to many pending files, restarted as : SRV1> ./looper '-b 10000 -r cedar_phy_bhcurv mcnear' & Fri Sep 12 16:00:08 CDT 2008 ####### # CRL # ####### Where to go for help ( ticket, mail ??? ) not on CRL or HELP Help spelling ( hyrarchy of topics ) ############ # PREDATOR # ############ B080906_160000.mbeam.root Fri Sep 12 10:26:16 UTC 2008 OOPS - loon status is 139 OOPS - cannot read B080906_160000.sam.py ####### # NET # ####### The Primary DNS server fnsrv0 131.225.8.120 failed last night. It was on the network, but not providing DNS service. Test : host www.fnal.gov 131.225.8.120 # failed host www.fnal.gov 131.225.17.150 # succeeded Problems : Predator : genpy failed EVD stopped working CRL stopped responding Ticket #: 121520 - closed, not a CRL issue, was database Helpdesk expert login and ticket submission failed. Ticket #: 121521 ( closed in Oct, no real cause found ) DocDB failed Ticket #: 121522 ( closed 2009 Jan 29, as is ) MRTG network monitoring data disappeared around 22:45 CDT, back at 03:30 fnsrv0 monitoring was back at 04:00 Ticket #: 121529 - closed 13 Oct 2008 Solution: darryl@fnal.gov sent this solution: The MRTG monitors are configured to use fnsrv1 as a secondary DNS server. However, there is insufficient evidence that the DNS failover mechanism on an MRTG host would have singlehandedly prevented loss of data, as there are ongoing external factors affecting host performance. DCache was down, no data transfers in FNDCA CDFDCA seems to have stayed up, but not tape activity pnfslog times were over 3 minutes, not 3 seconds. Strangely, Daq kftp writes to DCache claim to have succeeded. Ticket #: 121533 Ticket 121506 9/12/2008 6:59:14 AM by plunk Resolved 9/12/2008 9:12:32 AM rader@fnal.gov sent this solution: FNSRV0 dns server needed a reboot. Looking into the cause of the failure... ============================================================================= 2008 09 11 ============================================================================= xbhuang account /fermilab/mions VO entries ######## # GRID # ######## Sweeping up all Minos folk into /fermilab/minos VO MINOS26 > ypcat passwd | cut -f 5 -d : > /tmp/names sort /tmp/names -k 2,2 > /tmp/namess Registering them one at a time , cp /tmp/namess /minos/scratch/kreymer/namess Completed this tomorrow ( 9/12 ), about 152 total users.. Got note from Yocum : Date: Fri, 12 Sep 2008 15:30:26 -0500 From: Dan Yocum I notice that you're adding a lot of suspended members to the minos group. For instance: Brandon Sielhan Vitali Semenov Christopher Smith Philip Symes Edward Tetteh-Lartey Carol Ward Qun Wu Hai Zheng Anyone who only has a DN with '../UID=...' has been suspended for a long time. ######### # ADMIN # ######### Make sure everyone is in group e875 5111 MINOS26 > GPS=`ypcat passwd | cut -f 4 -d : | sort -u` MINOS26 > for GP in ${GPS} ; do grep ${GP} /tmp/group ; done g020:x:1525:kreymer epp:x:1535: e791:x:1720:cjames e781:x:1747: oss:x:5023: us_cms:-:5063:gaines,odell e875:x:5111:kreymer,pgouffon,bishai,cjames,jyuko,rbpatter lsfadm:-:5443: Wow, there are 64 users not in the e875/5111 group ! Here are the apparent Minos users who need addition, beyond present kreymer,pgouffon,bishai,cjames,jyuko,rbpatter GUSERS=' ahimmel baller bock costas cwhite dave_b dawson diwan djensen erwin escobar grossman hartouni jkn joffem jpaley kafka kschu kulik llhsu mmichel3 moeller murgia naples nevans niki paolone para qkwu rtoner shanahan thosieck tzanakos verebryu ' Date: Thu, 11 Sep 2008 16:48:37 -0500 (CDT) Subject: HelpDesk ticket 121501 ___________________________________________ Short Description: Please add users to groups list Problem Description: We are preparing to scale up our grid usage. Users need to write to group protected areas under /minos/data/... Quite a few Minos Cluster users are not in the e875/5111 group, and whose group id is not 5111. The e875 list is presently : MINOS26 > ypcat group | grep e875 e875:x:5111:kreymer,pgouffon,bishai,cjames,jyuko,rbpatter Please add these : ahimmel baller bock costas cwhite dave_b dawson diwan djensen erwin escobar grossman hartouni jkn joffem jpaley kafka kschu kulik llhsu mmichel3 moeller murgia naples nevans niki paolone para qkwu rtoner shanahan thosieck tzanakos verebryu ___________________________________________ Date: Fri, 12 Sep 2008 09:27:25 -0500 (CDT) This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group. ___________________________________________ ___________________________________________ ######## # FARM # ######## Let's just nibble away at CPB mcnearcat for a while. SRV1> ls /minos/data/minfarm/mcnearcat | grep n13047 | wc -l 7452 SRV1> ls /minos/data/minfarm/mcnearcat | grep n13037 | wc -l 2332 SRV1> ls /minos/data/minfarm/mcnearcat | grep L250 | wc -l 296 First some L250 items, then n13037 ./looper '-b 2000 -s L250 -r cedar_phy_bhcurv mcnear' & Thu Sep 11 14:30:55 CDT 2008 ############ # FERMIAPP # ############ MINOS25 > mkdir /grid/fermiapp/minos MINOS25 > chgrp e875 /grid/fermiapp/minos MINOS25 > chmod 775 /grid/fermiapp/minos MINOS25 > mkdir /grid/fermiapp/minos/kreymer MIN > for NODE in ${NODES} ; do printf "${NODE} " ; ssh -ax ${NODE} touch /grid/fermiapp/minos/kreymer/${NODE} ; done minos01 touch: cannot touch `/grid/fermiapp/minos/kreymer/minos01': No such file or directory minos02 touch: cannot touch `/grid/fermiapp/minos/kreymer/minos02': Read-only file system ... minos25 minos26 touch: cannot touch `/grid/fermiapp/minos/kreymer/minos26': Read-only file system MIN > for NODE in ${NODES} ; do printf "${NODE} " ; ssh -ax ${NODE} touch /grid/app/minos/test/${NODE} ; done Updated ticket accordingly ######### # ADMIN # ######### Four links under http://computing.fnal.gov/xms/Internal/Budget_%26_Finance ICON - Web Interface for Electronic Purchase Request http://fncdug1.fnal.gov/miser/ redirects a new window to https://appora.fnal.gov/pls/cert/miscomp.miser.html LINK - Create/Edit Purchase Requisition https://appora.fnal.gov/pls/cert/miscomp.miser.html ICON - Puchase Requisition Query https://miscomp.fnal.gov/miser/req-query.html redirects a new window to https://appora.fnal.gov/pls/cert/miscomp.miser.html LINK - Query Purchase Requisition https://appora.fnal.gov/miser_ora/www/req-query.html Followed this latter link, got query form, Put in CD105723 and/or CD%105723 Get a result which should have had the Lab_Number and PO_Number These fields were blank. ######## # FARM # ######## UGH, looking closely at concatenation, why would this be written? In /pnfs/minos/mcin_data/near/daikon_04/L250200N/709/n13037094*, all but subrun 1 are present. SRV1> ./roundup -n -s L250 -r cedar_phy_bhcurv mcnear ... MISS n13037094_*._L250200N_D04.mrnt.cedar_phy_bhcurv.root 0006 0007 0020 0021 0022 0023 0029 0030 OOPS - SUBRUN gap 1 to 1 OK adding n13037094_0000_L250200N_D04.mrnt.cedar_phy_bhcurv.root 1 OOPS - SUBRUN gap 6 to 7 OK adding n13037094_0002_L250200N_D04.mrnt.cedar_phy_bhcurv.root 4 OOPS - SUBRUN gap 20 to 23 OK adding n13037094_0008_L250200N_D04.mrnt.cedar_phy_bhcurv.root 12 OK adding n13037094_0024_L250200N_D04.mrnt.cedar_phy_bhcurv.0.root 5 SRV1> ./roundup -n -s n13037094 -r cedar_phy_bhcurv mcnear Same behaviour, would write this sntp and mrnt, in spite of gaps. SRV1> ./roundup -n -v -s n13037094 -r cedar_phy_bhcurv mcnear Note that the reco files are both .0.root and .root HAVE n13037094__L250200N_D04.mrnt.cedar_phy_bhcurv.root:8 HAVE n13037094__L250200N_D04.sntp.cedar_phy_bhcurv.root:8 MINOS26 > ls /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L250200N/sntp_data/709 | grep n13037094 n13037094_0006_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037094_0020_L250200N_D04.sntp.cedar_phy_bhcurv.root n13037094_0029_L250200N_D04.sntp.cedar_phy_bhcurv.root So this is legitemate, filling in missing subruns. ########## # DCACHE # ########## Date: Thu, 11 Sep 2008 11:14:04 -0500 (CDT) Subject: HelpDesk ticket 121458 ___________________________________________ Short Description: Additional unsecured dcap doors needed FNDCA Problem Description: There are presently only two unsecured dcap doors in FNDCA, each capable of a few hundred connections. Minos is now routinely running over 400 analysis jobs. We plan to scale up toward several thousand jobs running on Fermigrid. We will clearly need more doors. Please add a few more doors now, if technically possible. Please contact us ( minos-data ) to prepare a long term plan. ___________________________________________ Date: Mon, 22 Sep 2008 14:12:47 -0500 (CDT) From: Dmitry Litvintsev I started two additional dcap doors on fndca1: port numbers : 24137,24138 ___________________________________________ ___________________________________________ ######## # FARM # ######## Date: Thu, 11 Sep 2008 14:34:10 +0100 From: Robert Pittam To: Arthur Kreymer , Alex Sousa Subject: RE: Cedar cedar_phy differences In a similar vein as the last email there are some near detector files for which there are a cedar_phy_bhcurv version but no cedar_phy equivalent. Some of them exist in cedar as well. I checked /minos/data/minfarm/nearcat/ But theres no sign of them. Any idea why they're missing? Jul 05 N00008046_0000.spill.sntp.cedar_phy_bhcurv.0.root Oct 06 N00011134_0000.spill.sntp.cedar_phy_bhcurv.0.root N00011134_0017.spill.sntp.cedar_phy_bhcurv.0.root Dec 06 N00011437_0000.spill.sntp.cedar_phy_bhcurv.0.root Jan 07 N00011468_0000.spill.sntp.cedar_phy_bhcurv.0.root Feb 07 N00011710_0000.spill.sntp.cedar_phy_bhcurv.0.root Apr 07 N00012074_0000.spill.sntp.cedar_phy_bhcurv.0.root N00012083_0000.spill.sntp.cedar_phy_bhcurv.0.root SUBS='N00008046_0000 N00011134_0000 N00011134_0017 N00011437_0000 N00011468_0000 N00011710_0000 N00012074_0000 N00012083_0000' MINOS26 > for SUB in ${SUBS} ; do grep ${SUB} /minos/data/minfarm/lists/bad_runs.cedar_phy ; done N00008046_0000.0 2005-07 8 2 2007-06-01 13:37:07 fnpc269 N00011468_0000.0 2007-01 106 2 2007-06-10 16:08:50 fnpc282 N00012074_0000.0 2007-04 2 2008-07-01 15:47:06 fnpc219 N00012083_0000.0 2007-04 2 2008-07-01 15:53:51 fnpc183 MINOS26 > for SUB in ${SUBS} ; do grep ${SUB} /minos/data/minfarm/lists/good_runs.cedar_phy ; done N00011134_0000.0 2006-10 101468 2007-05-15 09:22:28 fnpc237 N00011134_0000.1 2006-10 101468 2007-05-18 15:39:42 fnpc226 N00011134_0017.0 2006-10 100520 2007-05-15 08:36:42 fnpc242 N00011134_0017.1 2006-10 100520 2007-05-18 16:20:20 fnpc279 N00011437_0000.0 2006-12 100792 2007-10-12 17:18:27 fnpc136 N00011710_0000.0 2007-02 99755 2007-06-11 05:02:27 fnpc222 grep N00011134_0000.spill.*sntp */cedar_phynear.log __________________________________________________ Howie is resubmitting the missing runs. __________________________________________________ __________________________________________________ ============================================================================= 2008 09 10 ============================================================================= ######### # ADMIN # ######### OLD - Subject: HelpDesk ticket 118265 MINOS01 > cmd add_minos_user jcravens Creating account... /var/yp gmake[1]: Entering directory `/var/yp/minos' gmake[1]: `ypservers' is up to date. gmake[1]: Leaving directory `/var/yp/minos' gmake[1]: Entering directory `/var/yp/minos' Updating passwd.byname... Updating passwd.byuid... Updating netid.byname... gmake[1]: Leaving directory `/var/yp/minos' Adding user to Minos AFS group... Sending mail to subscribe to minos-user mailing list ... Sending email to user... MINOS01 > ypcat passwd | grep jcravens jcravens:KERBEROS:43513:5111:John Cravens:/afs/fnal/files/home/room2/jcravens:/usr/local/bin/tcsh The user got /usr/local/bin/tcsh rather than /bin/bash send mail to jonest ######### # ADMIN # ######### CD105723 https://appora.fnal.gov/pls/cert/miscomp.miser.html?action=view&cd_req_number=CD105723&submit_label=Go! State Entry Requires Role Exit Via Transition By Actor Comments E-Ready 23-Jul-2008 15:34:50.406 Terminal State CheckOut 22-Jul-2008 15:28:25.855 CheckOut_Approver 23-Jul-2008 15:34:50.392 E-Ready cbruce ######## # FARM # ######## Start to work again on CPB mcnear, howie is doing cleanup runs. The 10K file backlog is too-too much, Breaking it down. SRV1> ls /minos/data/minfarm/mcnearcat/n1303709* | wc -l 288 SRV1> ls /minos/data/minfarm/mcnearcat/n130374* | wc -l 1196 SRV1> ./roundup -n -s n1303709 -r cedar_phy_bhcurv mcnear This mostly added files SRV1> ./roundup -b 2000 -n -s n130374 -r cedar_phy_bhcurv mcnear This almost entirely added, with a few ZAPPED ZAPPING BAD n13037415_0009_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13037415_0009_L010185N_D04.1 136 2008-04-29 16:14:31 fcdfcaf1626 ZAPPING BAD n13037415_0009_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root n13037415_0009_L010185N_D04.1 136 2008-04-29 16:14:31 fcdfcaf1626 ZAPPING BAD n13037415_0009_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037415_0009_L010185N_D04.1 136 2008-04-29 16:14:31 fcdfcaf1626 ZAPPING BAD n13037436_0005_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13037436_0005_L010185N_D04.1 136 2008-05-09 18:53:51 fnpc304 ZAPPING BAD n13037436_0005_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root n13037436_0005_L010185N_D04.1 136 2008-05-09 18:53:51 fnpc304 ZAPPING BAD n13037436_0005_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037436_0005_L010185N_D04.1 136 2008-05-09 18:53:51 fnpc304 ./looper '-b 2000 -s n130374 -r cedar_phy_bhcurv mcnear' & Wed Sep 10 15:46:23 CDT 2008 ######## # FARM # ######## From: Arthur Kreymer To: rubin@fnal.gov Two files are in mcfarcat which are in bad_runs_mc.cedar_phy_bhcurv ZAPPING BAD f21438026_0000_M100200N_D04_helium.mrnt.cedar_phy_bhcurv.0.root f21438026_0000_M100200N_D04_helium.0 136 2008-09-04 21:34:47 fcdfcaf1605 ZAPPING BAD f21438026_0000_M100200N_D04_helium.sntp.cedar_phy_bhcurv.0.root f21438026_0000_M100200N_D04_helium.0 136 2008-09-04 21:34:47 fcdfcaf1605 The file times seem to predate the entries in bad_runs : -rw-rw-r-- 1 minospro numi 109085210 Sep 3 16:45 f21438026_0000_M100200N_D04_helium.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 minospro numi 126672629 Sep 3 16:44 f21438026_0000_M100200N_D04_helium.sntp.cedar_phy_bhcurv.0.root The candiate is similar : -rw-r--r-- 1 minospro e875 1596835361 Sep 3 17:58 /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_04/M100200N_helium/cand_data/802/f21438026_0000_M10 0200N_D04_helium.cand.cedar_phy_bhcurv.0.root Date: Wed, 10 Sep 2008 12:45:47 -0500 From: Howard Rubin To: Arthur Kreymer Subject: Re: Two 'bad' files in mcfarcat I can't be sure I understand this, but I have a possible scenario. Suppose this is one of those cases where a job was spontaneously restarted -- or perhaps not spontaneously because there were several jobs as mentioned by Steve in the meeting where they finished but held on termination. He released them but they restarted from the beginning and reran. If, on the second pass, they hit the 'random' failure, they might have failed, producing the bad_runs_mc entry. In fact, this seems to be borne out by the existence of a line in the good_runs_mc file: f21438026_0000_M100200N_D04_helium.0 2008-09-03 16:45:00 fcdfcaf1573 Since the pass number is determined upon submission, not upon processing, the pass for both processes would be 0. The operative procedure would be to delete the line(s) from bad_runs_mc. Do you want to do it or should I? Rubin re activity : If /grid/app/minos/scripts is on your path, lj s will give you the current activity. If Matt's jobs are running there may be some formatting errors, but the final count will be correct. SRV1> /grid/app/minos/scripts/lj 0 jobs running. The ENSTORE write pool contains 0 files at 13:49 on 09/10/08. Updated the bad_runs file. fnpcsrv1% ls -l bad_runs_mc.cedar_phy_bhcurv -rw-rw-r-- 1 rubin numi 13465 Sep 9 17:20 bad_runs_mc.cedar_phy_bhcurv fnpcsrv1% cp -a bad_runs_mc.cedar_phy_bhcurv bad_runs_mc.cedar_phy_bhcurvnew fnpcsrv1% nedit bad_runs_mc.cedar_phy_bhcurvnew fnpcsrv1% diff bad_runs_mc.cedar_phy_bhcurvnew bad_runs_mc.cedar_phy_bhcurv 162a163 > f21438026_0000_M100200N_D04_helium.0 136 2008-09-04 21:34:47 fcdfcaf1605 fnpcsrv1% mv bad_runs_mc.cedar_phy_bhcurvnew bad_runs_mc.cedar_phy_bhcurv ######## # FARM # ######## MINOS26 > ./samdup /minos/data/minfarm/mcnearcat ######## # GRID # ######## Date: Wed, 10 Sep 2008 10:03:29 -0500 (CDT) Subject: HelpDesk ticket 121371 ___________________________________________ Short Description: Please mount /grid/fermiapp on Minos Cluster and Servers run2sys : The existing /grid/app application are is to assume a new role in Fermigrid, such that we will need to reinstall our software in a new area, /grid/farmiapp. Please mount /grid/fermiapp on Minos Cluster nodes minos02 through minos26 and on the Minos SAM servers minos-sam01 minos-sam02 minos-sam03 The mount should be similar to that of /grid/app : blue2:/fermigrid-app /grid/app nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 blue2:/fermigrid-fermiapp /grid/fermiapp nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 ___________________________________________ Date: Wed, 10 Sep 2008 10:08:00 -0500 (CDT) Subject: Your ticket 121371 has been reassigned to SCOTT, RENNIE ___________________________________________ Date: Thu, 11 Sep 2008 16:53:04 -0500 (CDT) Solution: Request completed. ============================================================================= 2008 09 09 ============================================================================= ######### # ADMIN # ######### reviewed status of requisition 14-Aug-2008 ALLEN 203579 CD105749.1 CD105749 for FL/CD/SCF/FEF Storage and Servers for Minos PO 582085 Page 4 2U Dual Intel Xeon E5430 2.66GHz General Rack computer Server Promised Date: 06-Oct-2008 Deliver To: ALLEN, JASON M 4.00 EACH 3,422.00 PO 203579 13,688.00 582126 Configuration # 1 - TagmaStore Adaptable Storage and TagmaStore Workgroup Storage Hardware - 30TB Additional Capacity Promised Date: 06-Oct-2008 Deliver To: ALLEN, JASON M 2.0 EACH 14,000.0 PO 203579 28,000.00 Project CD Operations Task MINOS-COMP-OP Task Number 50.01.06.04.01.01 Exp. Org CD - FERMILAB EXPERIMENTS FACILITIES Exp. Type MATERIAL PURCHASES Task Org CD - RUNNING EXPERIMENTS Service Type OP-EXST PRGM OP-DET ########## # PARROT # ########## Added MINOS_EXTERNAL and release_data to mountfile.grow, renaming the mountfile.MX.grow previously tested. $ diff mountfile.3119d120MX.grow mountfile.grow 3d2 < /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL /grow/www-numi.fnal.gov/computing/parrot/MINOS_EXTERNAL 5d3 < /afs/fnal.gov/files/data/minos /grow/www-numi.fnal.gov/computing/parrot/release_data $ ln -sf mountfile.3119d120MX.grow mountfile.grow $ date Tue Sep 9 14:50:33 CDT 2008 $ pwd /grid/app/minos/parrot ######### # DOCDB # ######### Added Mark Messier 581879, group numirw Actual actions are : Find Name/ID in list, and click it. Select Action: Modify Verify User Click on Modify Personal Account. Instructions say to Select, nonesuch. ######## # FARM # ######## Pushing out mcfar CPB helium files ./looper '-r cedar_phy_bhcurv mcfar' & ######## # FARM # ######## Repeated dccp tests, per Ken S request SRV1> cd /minos/data/minfarm/mcnear SRV1> FILE=n13047014_0025_L010185N_D04.cand.cedar_phy_bhcurv.1.root SRV1> setup dcap -q x509 SRV1> source /usr/local/vdt/setup.sh SRV1> export X509_USER_PROXY=/export/stage/minfarm/.grid/x509up_u1334 SRV1> dccp ${FILE} \ dcap://fndca1.fnal.gov:24536/pnfs/fnal.gov/usr/minos/NULL/${FILE} Error ( POLLIN) (with data) on control line [6] Failed to create a control line Error ( POLLIN) (with data) on control line [7] Failed to create a control line Failed open file in the dCache. Can't open destination file : Server rejected "hello" System error: Input/output error SRV1> voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: VOMS extension not found! subject : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990/CN=2146134877 issuer : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990 identity : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990 type : unknown strength : 512 bits path : /export/stage/minfarm/.grid/x509up_u1334 timeleft : 358:01:23 I think I need a cert with production role. ######### # FNALU # ######### Still on schedule for shutdown next week. For the record, lest we forget : MINOS26 > bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP debug 99 Open:Active - 5 1 - 0 0 0 0 test 98 Open:Active - 15 1 - 0 0 0 0 30min 10 Open:Active - - 1 - 0 0 0 0 4hr 8 Open:Active - - 1 - 6 0 6 0 12hr 6 Open:Active - - 1 - 1 0 1 0 1day 4 Open:Active - - 1 - 0 0 0 0 selex 4 Open:Active - 5 1 - 0 0 0 0 minos 4 Open:Active - - 1 - 0 0 0 0 1day_ex 4 Open:Active - 4 1 - 0 0 0 0 4day 2 Open:Active - 5 1 - 0 0 0 0 8day 1 Open:Active - 2 1 - 0 0 0 0 MINOS26 > bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV flxb16.fnal.gov unavail - 4 1 1 0 0 0 flxb17.fnal.gov ok - 4 0 0 0 0 0 flxb18.fnal.gov ok - 4 0 0 0 0 0 flxb19.fnal.gov ok - 4 0 0 0 0 0 flxb20.fnal.gov unavail - 4 0 0 0 0 0 flxb21.fnal.gov unavail - 4 0 0 0 0 0 flxb22.fnal.gov unavail - 4 0 0 0 0 0 flxb23.fnal.gov unavail - 4 0 0 0 0 0 flxb24.fnal.gov unavail - 4 1 1 0 0 0 flxb25.fnal.gov ok - 4 0 0 0 0 0 flxb26.fnal.gov closed - 4 2 2 0 0 0 flxb27.fnal.gov ok - 4 0 0 0 0 0 flxb28.fnal.gov closed - 4 2 2 0 0 0 flxb29.fnal.gov ok - 4 0 0 0 0 0 flxb30.fnal.gov ok - 4 0 0 0 0 0 flxb31.fnal.gov ok - 4 0 0 0 0 0 flxb32.fnal.gov ok - 4 0 0 0 0 0 flxb33.fnal.gov ok - 4 0 0 0 0 0 flxb34.fnal.gov ok - 4 0 0 0 0 0 flxb35.fnal.gov ok - 4 0 0 0 0 0 flxi04.fnal.gov unavail - 1 0 0 0 0 0 flxi06.fnal.gov ok - 2 0 0 0 0 0 flxi07.fnal.gov unavail - 2 0 0 0 0 0 fsui03.fnal.gov unavail - 5 0 0 0 0 0 minos14.fnal.gov unavail - 2 0 0 0 0 0 minos15.fnal.gov unavail - 2 0 0 0 0 0 minos16.fnal.gov unavail - 2 0 0 0 0 0 minos17.fnal.gov unavail - 2 0 0 0 0 0 minos18.fnal.gov unavail - 2 0 0 0 0 0 minos19.fnal.gov unavail - 2 0 0 0 0 0 minos20.fnal.gov unavail - 2 0 0 0 0 0 minos21.fnal.gov unavail - 2 0 0 0 0 0 minos22.fnal.gov unavail - 2 0 0 0 0 0 minos23.fnal.gov unavail - 2 0 0 0 0 0 minos24.fnal.gov unavail - 2 0 0 0 0 0 minos25.fnal.gov unavail - 2 0 0 0 0 0 minos26.fnal.gov unavail - 2 0 0 0 0 0 ============================================================================= 2008 09 08 ============================================================================= ######## # MAIL # ######## Removed RFC2369 headers from lists for which they are not appropriate, to eliminate the PINE messages [ Note: This message contains email list management information ] To disable the headers, added to the head of the options list, Misc-Options= NO_RFC2369 minos-admin minos-docdb minos_sam_admin minos_sam_users Need to get ownership of some other lists minosdb-support MINOS-ACCOUNTS ? ######## # MCIN # ######### Checked sized, for budget planning MINOS26 > du -sh /pnfs/minos/mcin_data/near/daikon* 7.3T /pnfs/minos/mcin_data/near/daikon_00 33G /pnfs/minos/mcin_data/near/daikon_01 2.2T /pnfs/minos/mcin_data/near/daikon_03 15T /pnfs/minos/mcin_data/near/daikon_04 1.0K /pnfs/minos/mcin_data/near/daikon_05 ######### # MYSQL # ######### export PRODUCTS=/${HOME}/ups/db RE=/home/minsoft/restore/20080902/offline OF=/home/minsoft/database//offline cp -va ${RE}/BEAMMONCUTS.* ${OF}/ SOFT03 > ups start mysql SOFT03 > ups status mysql Setup:mysql datadir = /home/minsoft/database Setup:port=3306; socket=/home/minsoft/database/mysql.sock Check mysqld status: Uptime: 25 Threads: 1 Questions: 1 Slow queries: 0 Opens: 12 Flush tables: 1 Open tables: 6 Queries per second avg: 0.040 export MYSQL_PWD mysqladmin processlist -u root mysql> show tables ; +-------------------+ | Tables_in_offline | +-------------------+ | BEAMMONCUTS | +-------------------+ mysql> show columns from BEAMMONCUTS ; +-------------+---------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------+---------+------+-----+---------+-------+ | SEQNO | int(11) | NO | PRI | 0 | | | ROW_COUNTER | int(11) | NO | PRI | 0 | | | CUTVALUES | text | YES | | NULL | | +-------------+---------+------+-----+---------+-------+ 3 rows in set (0.00 sec) mysql> show index from BEAMMONCUTS ; +-------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | +-------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | BEAMMONCUTS | 0 | PRIMARY | 1 | SEQNO | A | NULL | NULL | NULL | | BTREE | | | BEAMMONCUTS | 0 | PRIMARY | 2 | ROW_COUNTER | A | 12 | NULL | NULL | | BTREE | | +-------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ 2 rows in set (0.00 sec) mysql> check table BEAMMONCUTS ; +---------------------+-------+----------+-------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------------------+-------+----------+-------------------------------------------------------+ | offline.BEAMMONCUTS | check | warning | 1 client is using or hasn't closed the table properly | | offline.BEAMMONCUTS | check | status | OK | +---------------------+-------+----------+-------------------------------------------------------+ 2 rows in set (0.00 sec) ============================================================================= 2008 09 06 ============================================================================= ###### # WH # ###### Power out 06:00 to 18:00 ########## # CONDOR # ########## MINOS25 > condor_q gfactory 351 jobs; 16 idle, 248 running, 87 held MINOS25 > IDLES=`condor_q gfactory -hold | grep gfactory | cut -f 1 -d ' '` MINOS25 > date Sat Sep 6 02:41:48 CDT 2008 for IDLE in ${IDLES} ; do condor_release ${IDLE} ; sleep 10 ; done ============================================================================= 2008 09 05 ============================================================================= ########### # MINOS26 # ########### ./vault near 2008-08 ============================================================================= 2008 09 04 ============================================================================= ######## # FARM # ######## SRV1> ./roundup -b 2000 -r cedar_phy far SRV1> ./roundup -b 2000 -r cedar_phy far OK - processing 351 files ####### # DAQ # ####### [minos@dcsdcp ~]$ cat /dcsdata/logs/archiver.log /home/minos/kftp/v3_5/NULL/lib/gssftp.py:1: RuntimeWarning: Python C API version mismatch for module gss: This Python has API version 1012, module gss has version 1011. import gss Traceback (most recent call last): File "/home/minos/bin/archiver_krb.py", line 395, in ? os.remove(lock_file) OSError: [Errno 2] No such file or directory: '/var/lock/dcs/archiver.pid' [minos@dcsdcp ~]$ ls -l /var/lock/dcs/ total 0 -rw-r--r-- 1 minos e875 0 Sep 4 15:40 archiver.pid -r--r----- 1 minos e875 0 Sep 4 15:22 dcs_mysql2rotod.lock Checking out the near detector [minos@dcsdcp-nd dcsdata]$ cat /var/lock/dcs/archiver.pid 3046 [minos@dcsdcp-nd dcsdata]$ ps -f -p 3046 UID PID PPID C STIME TTY TIME CMD minos 3046 1 0 Jun27 ? 00:00:01 python /home/minos/bin/archiver_krb.py [minos@dcsdcp-nd dcsdata]$ /etc/init.d/archiver status Archiver is running This looks like the classic empty pid file, try clearing it FDCS > ls -l /dcsdata/archiver/data-to-archive total 0 -rw-r--r-- 1 minos e875 0 Jan 1 2007 F070101_163119.mdcs.root -rw-r--r-- 1 minos e875 0 Aug 28 19:00 F080829_000008.mdcs.root -rw-r--r-- 1 minos e875 0 Aug 29 19:00 F080830_000009.mdcs.root -rw-r--r-- 1 minos e875 0 Aug 30 19:00 F080831_000004.mdcs.root -rw-r--r-- 1 minos e875 0 Aug 31 19:00 F080901_000006.mdcs.root -rw-r--r-- 1 minos e875 0 Sep 1 19:00 F080902_000012.mdcs.root -rw-r--r-- 1 minos e875 0 Sep 2 19:00 F080903_000013.mdcs.root -rw-r--r-- 1 minos e875 0 Sep 3 19:00 F080904_000001.mdcs.root FDCS > ls -l /dcsdata/2007/F070101* -rw-r--r-- 1 minos e875 11273 Dec 31 2006 /dcsdata/2007/F070101_000034.mdcs.root -rw-r--r-- 1 minos e875 1303994 Jan 1 2007 /dcsdata/2007/F070101_000057.mdcs.root -rw-r--r-- 1 minos e875 721315 Jan 23 2007 /dcsdata/2007/F070101_170032.mdcs.root FDCS > mkdir /dcsdata/archiver/data-to-archivestray/ FDCS > mv /dcsdata/archiver/data-to-archive/F070101_163119.mdcs.root /dcsdata/archiver/data-to-archivestray/F070101_163119.mdcs.root FDCS > /etc/init.d/archiver start Starting archiver FDCS > cat logs/archiver.log /home/minos/kftp/v3_5/NULL/lib/gssftp.py:1: RuntimeWarning: Python C API version mismatch for module gss: This Python has API version 1012, module gss has version 1011. import gss MINOS26 > dds /pnfs/minos/far_dcs_data/2008-08 -rw-r--r-- 1 buckley e875 2510149 Aug 27 23:40 F080827_000002.mdcs.root -rw-r--r-- 1 buckley e875 2518981 Sep 4 15:38 F080828_000010.mdcs.root -rw-r--r-- 1 buckley e875 2494006 Sep 4 15:55 F080829_000008.mdcs.root I think the archiver still has the 10 minute cycle time. Files are continuing to move. On Thu, 4 Sep 2008, Alec T. Habig wrote: > Art Kreymer writes: >> File "/home/minos/bin/archiver_krb.py", line 395, in ? >> os.remove(lock_file) >> OSError: [Errno 2] No such file or directory: '/var/lock/dcs/archiver.pid' > > This was when I was trying to clean up processes, and had deleted the > lockfile but not killed the zombie process. > > Interestingly, I haven't been able to get the scripts to write to that > logfile since, although I can run the archiver manually (without the > startup scripts). The startup scripts do nothing. > > The data's there to be archived, flag files are in /dcsdata/archiver > > I did try and fixing ./dcs_startup/dcs_init_functions, which as far as I > could tell was trying to invoke the archiver with a nonexistant path. > diff it with ./dcs_startup/dcs_init_functions.cya to see what I mean. I did restart the archiver one more time, after removing a stale /var/lock/dcs/archiver.pid It has transferred two files so far, so we should be in business. I did correct one other problem. There was a tag file /dcsdata/archiver/data-to-archive/F070101_163119.mdcs.root for which there was no corresponding file in /dcsdata Perhaps this was tripping things up ( time bomb triggered by new python ? ) ############ # PREDATOR # ############ Dealing with effects of the full disk Bad files for N00014791_0007.mdaq.root N00014791_0008.mdaq.root N00014791_0009.mdaq.root N00014791_0010.mdaq.root F00041897_0009.mdaq.root Thu Sep 4 06:09:49 UTC 2008 F00041897_0010.mdaq.root Thu Sep 4 06:10:39 UTC 2008 F00041897_0011.mdaq.root Thu Sep 4 08:09:24 UTC 2008 F00041897_0012.mdaq.root Thu Sep 4 08:10:19 UTC 2008 B080903_080001.mbeam.root Thu Sep 4 10:11:14 UTC 2008 cat: write error: No space left on device ? B080903_160001.mbeam.root Thu Sep 4 10:11:59 UTC 2008 cat: write error: No space left on device ? B080904_000001.mbeam.root Thu Sep 4 10:12:38 UTC 2008 N080903_000002.mdcs.root Thu Sep 4 10:13:20 UTC 2008 cd /local/scratch26/kreymer/genpy/beam_data/2008-09 rm B080903_080001.sam.py B080903_160001.sam.py B080904_000001.sam.py cd /local/scratch26/kreymer/genpy/fardet_data/2008-09 for SR in 09 10 11 12 13 14 15 16 ; do rm F00041897_00${SR}.sam.py ; done cd /local/scratch26/kreymer/genpy/near_dcs_data/2008-09 rm N080903_000002.sam.py cd /local/scratch26/kreymer/genpy/neardet_data/2008-09 for SR in 07 08 09 10 11 12 13 14 ; do rm N00014791_00${SR}.sam.py ; done These were picked up on the 15:06 cycle Now pick up the beam and dcs MINOS26 > ./predator 2008-09 ########### # MINOS26 # ########### Disk was filled, due to monthly vault copies combined with jdejong use of this disk MINOS26 > du -sm /pnfs/minos/neardet_data/2008-08 88047 /pnfs/minos/neardet_data/2008-08 Just the neardet failed, the far was OK. MINOS26 > du -sm /local/scratch26/jdejong 130137 /local/scratch26/jdejong MINOS26 > du -sm * 1 fardet_data 48262 neardet_data Checking other big users, mainly mindata/MOVED MINOS26 > du -sm * 1692 141 1 CRON 22195 MOVED mindata $ rm -r /local/scratch26/mindata/MOVED kreymer MINOS26 > cd /local/scratch26/kreymer/SHEEP/neardet_data MINOS26 > rm -r 2008-08 Jeff will remove his files, they are no longer needed. ######### # MYSQL # ######### Shifted MYI files out of the way, before continuing with the gzip and local phase of archives mkdir ${DBCOPY}/offlineindex mv ${DBCOPY}/offline/*.MYI ${DBCOPY}/offlineindex/ Mysql> du -sm ${DBCOPY}/off* 53650 /minos/data/mysql/archive/20080902/offline 12777 /minos/data/mysql/archive/20080902/offlineindex ============================================================================= 2008 09 03 ============================================================================= ######### # MYSQL # ######### export PRODUCTS=/${HOME}/ups/db ups stop mysql SOFT03 > du -sm /minos/data/mysql/archive/20080902/offline 66426 /minos/data/mysql/archive/20080902/offline SOFT03 > rm -r restore mkdir restore mkdir restore/20080902 time cp -vax /minos/data/mysql/archive/20080902/offline restore/20080902/offline/ real 132m44.680s user 0m10.905s sys 6m14.102s Ganglia rate is 6 to 10 MBytes/second, with the usual 15 second drop outs every 10 minutes. For example, after 600 sec of 6 MB/sec ( 3.6 GB ) Peak rate interval 240 s 15 MB/sec ( 3.4 Gb ) SOFT03 > du -sk /minos/data/mysql/archive/20080902/offline restore/20080902/offline/ 68020112 /minos/data/mysql/archive/20080902/offline 68087604 restore/20080902/offline/ ============================================================================= 2008 09 02 ============================================================================= ########## # PARROT # ########## Test thain's new symlink hack to make_growfs.auto, using $ cat mountlink.grow /parrot /grow/www-numi.fnal.gov/computing/parrot/link The usual parrot test, but parrot -m /minos/scratch/parrot/mountlink.grow -d remote /bin/bash P> ls -l /parrot total 2514 -r--r--r-- 1 kreymer numi 1283387 Jul 29 14:56 data -rw-r--r-- 1 kreymer numi 6483 Aug 26 16:12 HOWTO.parrot -r--r--r-- 1 kreymer numi 1283387 Jul 29 14:56 releasedata P> head -3 /parrot/releasedata /// /// Data and script to load the CalStripAtten database /// with ND mapper attenuation curves. P> head -3 /parrot/data ... 2008/09/02 14:20:44.589679 [20793] parrot: grow: failed to open http://www-numi.fnal.gov:80/computing/parrot/link//data head: cannot open `/parrot/data' for reading: Permission denied This is the expected result, as /minos/data cannot be web served. Now test this sort of directory on ups/minossoft in d199/d141 Need a fresh copy of these, for testing with current software minsoft@minos-mysql1 MD=/afs/fnal.gov/files/data/minos ECHO=echo for DIR in packages releases setup srt ; do date ; echo ${DIR} ${ECHO} rm -r ${MD}/d199/${DIR} ${ECHO} cp -vax ${MD}/d120/${DIR} \ ${MD}/d199/${DIR} date ; done for DIR in catman db etc man prd ; do date ; echo ${DIR} ${ECHO} rm -r ${MD}/d141/${DIR} ${ECHO} cp -vax ${MD}/d119/${DIR} \ ${MD}/d141/${DIR} date ; done Tue Sep 2 16:52:25 CDT 2008 packages ... Tue Sep 2 19:37:54 CDT 2008 man rm: remove write-protected regular file `/afs/fnal.gov/files/data/minos/d141/man/man1/tclsh.1'? OOOPS, should not have used 'v' option for cp, Mysql> fs listacl /afs/fnal.gov/files/data/minos/d141/man/man1 Access list for /afs/fnal.gov/files/data/minos/d141/man/man1 is Normal rights: minos rlidwka system:administrators rlidwka system:anyuser rl Mysql> tokens Tokens held by the Cache Manager: User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Sep 4 20:47] --End of list-- Mysql> rm /afs/fnal.gov/files/data/minos/d141/man/man1/tclsh.1 rm: remove write-protected regular file `/afs/fnal.gov/files/data/minos/d141/man/man1/tclsh.1'? no Mysql> chmod 755 /afs/fnal.gov/files/data/minos/d141/man/man1/tclsh.1 Mysql> rm /afs/fnal.gov/files/data/minos/d141/man/man1/tclsh.1 ?????????????? what gives ???????????? Since when do AFS files care about file permissions ? Mysql> ls -l /afs/fnal.gov/files/data/minos/d141/man/man1 total 36 -rw-r--r-- 1 kreymer 1525 5887 Oct 17 2005 dropit.1 -rw-r--r-- 1 kreymer 1525 11704 Apr 26 2005 python.1 -rwxr-xr-- 1 kreymer 1525 4448 Oct 17 2005 upd.1 -r--r--r-- 1 kreymer 1525 12668 Jan 31 2005 wish.1 Mysql> whoami minsoft Mysql> chmod u+w /afs/fnal.gov/files/data/minos/d141/man/man1/wish.1 This allowed the file to be removed. There are many more files lacking u+w permission, using code from addpkg : DIR=/afs/fnal.gov/files/data/minos/d141 #find ${DIR} ! -perm -200 -exec ls -l {} \; #find ${DIR} ! -perm -200 -exec chmod u+w {} \; Many things in man tcl tk blt python perl xfig imagelibs java oracle_client MINOS_EXTERN Mysql> find ${DIR} ! -perm -200 | wc -l 125553 -r-xr-xr-x 1 kreymer 1525 7771765 May 7 2007 /afs/fnal.gov/files/data/minos/d141/prd/encp/v3_6g/Linux-2-4-2-3-2/enstore total 52 -r--r--r-- 1 kreymer 1525 1766 May 7 2007 ECRC.c -r--r--r-- 1 kreymer 1525 1019 May 7 2007 Makefile -r--r--r-- 1 kreymer 1525 4820 May 7 2007 add_to_tape.c -r--r--r-- 1 kreymer 1525 5960 May 7 2007 cpio.c ... ???? what gives ????? where did the path go ?????? find /afs/fnal.gov/files/code/e875/general/products/man/man1 ! -perm -200 -exec ls -l {} \; see similar issues Checking out minossoft : Mysql> find /afs/fnal.gov/files/data/minos/d120 ! -perm -200 -exec ls -l {} \; ... nothing found ... find ${DIR} ! -perm -200 -exec chmod u+w {} \; Mysql> find ${DIR} ! -perm -200 -exec chmod u+w {} \; Mysql> date Wed Sep 3 11:27:25 CDT 2008 Let's correct the original working ups files, DIR=/afs/fnal.gov/files/data/minos/d119 date Wed Sep 3 11:55:52 CDT 2008 time find ${DIR} ! -perm -200 -exec chmod u+w {} \; real 22m32.942s user 0m51.270s sys 7m49.561s for DIR in catman db etc man prd ; do date ; echo ${DIR} ${ECHO} rm -r ${MD}/d141/${DIR} ${ECHO} cp -ax ${MD}/d119/${DIR} \ ${MD}/d141/${DIR} date ; done Wed Sep 3 12:20:42 CDT 2008 catman Wed Sep 3 12:20:44 CDT 2008 Wed Sep 3 12:20:44 CDT 2008 db Wed Sep 3 12:20:53 CDT 2008 Wed Sep 3 12:20:53 CDT 2008 etc Wed Sep 3 12:20:53 CDT 2008 Wed Sep 3 12:20:53 CDT 2008 man Wed Sep 3 12:21:02 CDT 2008 Wed Sep 3 12:21:02 CDT 2008 prd Wed Sep 3 15:45:33 CDT 2008 ######## # DISK # ######## Date: Tue, 02 Sep 2008 13:27:51 -0500 (CDT) Subject: HelpDesk ticket 120912 ___________________________________________ Short Description: Quota request for jdejong on BlueArc served /minos/scratch Problem Description: LSC/CSI : Please set an individual storage quota of 500 GBytes for user jdejong on the BlueArc served /minos/scratch volume. This overrides the existing default 100 GBytes quota. Please cc: jdejong and minos-data ( my mail comes through imap3, which is down right now ) ___________________________________________ Date: Tue, 02 Sep 2008 14:36:35 -0500 (CDT) This ticket has been reassigned to RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ Date: Tue, 02 Sep 2008 15:17:15 -0500 (CDT) The quota has been moved up to 500GB as requested. This ticket was resolved by RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST group. ___________________________________________ ######## # IMAP # ######## 10:50 imapserver3 seems to be down, not on the network. Stopped getting mail service around 10:50 CDT. Submitted helpdesk ticket. 13:30 imap3 is up. Date: Tue, 02 Sep 2008 13:27:53 -0500 (CDT) Subject: HelpDesk ticket 120917 ___________________________________________ Short Description: imapserver3 down Problem Description: At around 10:50, imapserver3 seems to have gone off the network. ( I will not be able to receive email regarding this, as this is where kreymer@fnal.gov mail goes. ) ___________________________________________ Date: Tue, 02 Sep 2008 14:02:36 -0500 (CDT) Solution: Imapserver3 experienced a hardware failure at 11:00am this morning. The imapserver3 service is back up as of 1:30pm on other temporary hardware. Another downtime will be scheduled to switch the service back once the hardware problem has been repaired. email will be sent when we know exactly when the downtime will be. This ticket was resolved by BOZONELOS, JERE of the CD-LSCS/CSI/HD group. __________________________________________ ___________________________________________ Date: Tue, 02 Sep 2008 13:34:41 -0500 From: Fermilab Postmaster To: all-imap3-users@imapserver3.fnal.gov Subject: Imapserver3 back up Hi, Imapserver3 experienced a hardware failure at 11:00am this morning. The imapserver3 service is back up as of 1:30pm on other temporary hardware. We will have to schedule a downtime to switch the service back once the hardware problem has been repaired. We will send email when we know exactly when the downtime will be. Fermilab Email Team ############ # PREDATOR # ############ No far_dcs_data files this month Last file standing was F080827_000002.mdcs.root Thu Aug 28 10:14:58 UTC 2008 ########### # MONTHLY # ########### DATASETS 9/2 PREDATOR 9/2 VAULT 9/5 MYSQL 9/5 ============================================================================= 2008 08 29 ============================================================================= ########## # PARROT # ########## Fails when running paloon/loonar with S08-08-28-R1-30 OK with old R1.24.2 and S07-12-22-R1-26 could not find a gcc version for release "S08-08-28-R1-30" on Linux+2 ERROR: Need unique instance but multiple "products" found INFORMATIONAL: Product '*' (with qualifiers ','), has no S08-08-28-R1-30 version (or may not exist) RUNNING LOON /grid/app/minos/parrot/loonar: line 47: loon: command not found Should look like No default SAM configuration exists at this time. MINOSSOFT release "S07-12-22-R1-26" SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-17-08 EXTERN=v03 CONFIG=v01 setup "test" version of LABYRINTH [ linux , FNALU ] setup NEUGEN3 development explicitly setting up GCC3_4_3 version of GEANT using PYTHIA6 (v6_412) for LUND Needed $ date ; time make_growfs -k -f /afs/fnal.gov/files/code/e875/general/ups make_growfs: 1133707 files, 361 links, 153891 dirs, 0 checksums computed real 9m1.392s user 1m19.533s sys 2m6.919s ########## # CONDOR # ########## Slow response, especially for condor_submit ( seconds per process ) since late last night, reported by pawloski and loiacono Also, recently held gfactory jobs, 108 jobs; 1 idle, 19 running, 88 held MINOS25 > condor_q -l 181282.1 HoldReason = "Globus error 10: data transfer to the server failed" MINOS25 > condor_q -l -hold gfactory | grep EnteredCurrentStatus | sort EnteredCurrentStatus = 1220027205 Fri Aug 29 11:26:45 CDT 2008 many ... EnteredCurrentStatus = 1220027597 Fri Aug 29 11:33:17 CDT 2008 many ... EnteredCurrentStatus = 1220027780 Fri Aug 29 11:36:20 CDT 2008 few ... EnteredCurrentStatus = 1220029654 Fri Aug 29 12:07:34 CDT 2008 many bluwatch did see slow access ( no failure ) Fri Aug 29 03:25:30 CDT 2008 SLO N00013286_0000.spill.sntp.cedar_phy_bhcurv.0.root 13 Fri Aug 29 03:29:46 CDT 2008 SLO N00013299_0000.spill.sntp.cedar_phy_bhcurv.0.root 12 ... Fri Aug 29 11:16:17 CDT 2008 SLO N00008017_0000.spill.sntp.cedar_phy_bhcurv.0.root 58 Fri Aug 29 11:17:47 CDT 2008 SLO N00008019_0000.spill.sntp.cedar_phy_bhcurv.0.root 30 Fri Aug 29 11:19:28 CDT 2008 SLO N00008019_0002.spill.sntp.cedar_phy_bhcurv.0.root 41 Fri Aug 29 11:21:15 CDT 2008 SLO N00008020_0000.spill.sntp.cedar_phy_bhcurv.0.root 47 Fri Aug 29 11:22:49 CDT 2008 SLO N00008021_0000.spill.sntp.cedar_phy_bhcurv.0.root 34 Fri Aug 29 12:23:08 CDT 2008 SLO N00008218_0000.spill.sntp.cedar_phy_bhcurv.0.root 12 Released 84 held gfactory jobs, a few are running, more are held ( error 10 ) ########## # PARROT # ########## Updated minossoft, for latest snapshot ( noopt only ) $ date ; time make_growfs -v -k -f /afs/fnal.gov/files/code/e875/general/minossoft Fri Aug 29 14:20:44 CDT 2008 Interrupted, there were changes to the latest snapshot. Reduced verbosity $ date ; time make_growfs -k -f /afs/fnal.gov/files/code/e875/general/minossoft Fri Aug 29 14:56:34 CDT 2008 make_growfs: loading existing directory from /afs/fnal.gov/files/code/e875/general/minossoft/.growfsdir make_growfs: scanning directory tree for changes... Broken link, /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-08-28-R1-30/include/CodeMgtTools/include rhatcher repaired this, $ date ; time make_growfs -k -f /afs/fnal.gov/files/code/e875/general/minossoft Fri Aug 29 15:58:56 CDT 2008 warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.16/bin/Linux2.6-GCC_3_4/Linux2.4-GCC_3_4 warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.16/bin/Linux-sl3-GCC_3_4/Linux2.4-GCC_3_4 warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.16/bin/Linux2.4-GCC_3_4/Linux2.4-GCC_3_4 warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/bin/Linux2.6-GCC_3_4/Linux2.4-GCC_3_4 warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/bin/Linux2.6-GCC_3_4-maxopt/Linux2.4-GCC_3_4-maxopt warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/bin/Linux2.4-GCC_3_4/Linux2.4-GCC_3_4 warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/bin/Linux2.4-GCC_3_4-maxopt/Linux2.4-GCC_3_4-maxopt warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/lib/Linux2.6-GCC_3_4/Linux2.4-GCC_3_4 warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/lib/Linux2.6-GCC_3_4-maxopt/Linux2.4-GCC_3_4-maxopt warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/lib/Linux2.4-GCC_3_4/Linux2.4-GCC_3_4 warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/lib/Linux2.4-GCC_3_4-maxopt/Linux2.4-GCC_3_4-maxopt warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1-29/Linux2.6-GCC_3_4-maxopt make_growfs: 4422692 files, 12 links, 176657 dirs, 0 checksums computed real 36m58.916s user 3m35.032s sys 20m50.270s From -rw-r--r-- 1 kreymer 5111 31326785 Aug 14 19:29 .growfsdir to -rw-r--r-- 1 kreymer 5111 211982753 Aug 29 21:35 .growfsdir Following symlinks takes us from 31 MB to 211 MB directory size. ############# # MILESTONE # ############# Successfully ran a standard cedar near detector spill reco job under Parrot. ########## # PARROT # ########## Added release_data service /afs/fnal.gov/files/expwww/numi/html/computing/parrot MIN > ln -s /afs/fnal.gov/files/data/minos release_data Corrected single file sim data file link, for reco test ln -sf /afs/fnal.gov/files/data/minos/release_data/bmaps/bfld_160.dat \ /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat Would restore with ln -sf /minos/data/release_data/bmaps/bfld_160.dat \ /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat Checked file identity with diff /minos/data/release_data/bmaps/bfld_160.dat \ /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat Before retest, rebuilt sim index, and created such for release_data time make_growfs -v -k /afs/fnal.gov/files/data/minos/release_data make_growfs: 292 files, 1 links, 53 dirs, 0 checksums computed real 0m1.885s user 0m0.028s sys 0m0.083s time make_growfs -v -k -f /afs/fnal.gov/files/code/e875/sim Interrupted, circular symlink under gmieg/Mesa/Mesa-2.6 $ ls -alF /afs/fnal.gov/files/code/e875/sim/gmieg/Mesa/Mesa-2.6/Mesa-2.6 lrwxr-xr-x 1 gmieg e875 53 Jan 25 1999 /afs/fnal.gov/files/code/e875/sim/gmieg/Mesa/Mesa-2.6/Mesa-2.6 -> /afs/fnal.gov/files/code/e875/sim/gmieg/Mesa/Mesa-2.6/ $ rm /afs/fnal.gov/files/code/e875/sim/gmieg/Mesa/Mesa-2.6/Mesa-2.6 time make_growfs -v -k -f /afs/fnal.gov/files/code/e875/sim make_growfs: 393475 files, 77 links, 63801 dirs, 0 checksums computed real 3m33.670s user 0m22.977s sys 0m57.210s Repeaded loon run, got stuck at : explicitly setting up GCC3_4_3 version of GEANT using PYTHIA6 (v6_412) for LUND 6160 kreymer 25 0 6592 1976 1248 R 100 0.1 13:39.02 5b33ce4febe1b4b 5819 pts/0 S 0:11 \_ parrot -m ./mountfile.MX.grow -H /bin/bash 5822 pts/0 T 0:00 \_ /bin/bash 6160 pts/0 R+ 13:46 \_ python /afs/fnal.gov/files/code/e875/general/minossoft/setup/datagram/datagram_client.py [sh] kreymer minos_offline R1. Noted that /local/stage1/minos was empty, Fresh login, fresh parrot sesssion, parrot -m /grid/app/minos/parrot/mountfile.MX.grow -H /bin/bash ... BfldLoanPool::GetMap new map, type 2 'Rect2dGrid', variant 160 BfldMapRect2d read file: /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat BfldMapRect2d: near detector, 40kAturns CurrentForward -- FXY Bob Wands 08/09/2005 ... =E= Bfld 2008/08/29 11:59:27 [9870|200520] BfldMapRect2d.cxx,v1.26:87> can not open input file: '/afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_161.dat' Floating point exception diff /minos/data/release_data/bmaps/bfld_161.dat \ /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_161.dat ln -sf /afs/fnal.gov/files/data/minos/release_data/bmaps/bfld_161.dat \ /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_161.dat $ time make_growfs -v -k -f /afs/fnal.gov/files/code/e875/sim make_growfs: 393475 files, 77 links, 63801 dirs, 0 checksums computed real 2m41.662s user 0m30.416s sys 0m54.832s Spill(100000 in 750 out 99250 filt.) P> du -sk *root 17420 CandS.root 81772 N00009870_0002.mdaq.root 3112 ntupleStS.root S U C C E S S ! rhatcher will revise all the sim symlinks from /minos/data to afs, as was done with the minossoft area. Repeated this run, after rebuilding minsoft index, looks OK Spill(100000 in 750 out 99250 filt.) ... Channels with the most errors: Errors: 13 [ Near| 46 Vf| 39|*W] Errors: 13 [ Near| 136 Vf| 64|*W] Errors: 13 [ Near| 136 Vf| 72|*W] Errors: 11 [ Near| 146 Vf| 65|*W] Errors: 11 [ Near| 146 Vf| 73|*W] Errors: 10 [ Near| 6 Vf| 81|*W] Errors: 10 [ Near| 16 Vf| 48|*W] Errors: 10 [ Near| 136 Vf| 65|*W] Errors: 10 [ Near| 136 Vf| 73|*W] Errors: 9 [ Near| 136 Vf| 80|*W] DatabaseInterface shutdown not requested ============================================================================= 2008 08 28 ============================================================================= ########## # PARROT # ########## continue reco test, this time kreymer@minos26 cd ${HOME}/minos . ./setup_minos setup_minos -r R1.24.0 export ENV_TSQL_URL='mysql:odbc://fnpcsrv1.fnal.gov:3307/temp;mysql:odbc://fnpcsrv1.fnal.gov:3307/cedar' export ENV_TSQL_USER=reader export ENV_TSQL_PSWD=minos_db cd /local/scratch26/kreymer/DATA FIN=N00009870_0002.mdaq.root time loon -b -q reco_near_spill_cedar.C ${FIN} 2>&1 | tee loon.log ... Spill(100000 in 750 out 99250 filt.) ... real 14m10.572s user 13m43.216s sys 0m12.443s Try this again using the public database Changed host to minos-db1 port to 3306 cedar to offline export ENV_TSQL_URL='mysql:odbc://minos-db1.fnal.gov:3306/temp;mysql:odbc://minos-db1.fnal.gov:3306/offline' export ENV_TSQL_USER=reader export ENV_TSQL_PSWD=minos_db This looks good, let's gear up for Parrot tests. mindata@Minos26: $ cd /grid/app/minos/parrot $ cp /local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root . $ cp /local/scratch26/kreymer/DATA/reco_near_spill_cedar.C . kreymer@fnpc170 Run standard usage test, with the setups and env's as above mkdir -p /local/stage1/minos cd /local/stage1/minos cp /grid/app/minos/parrot/N00009870_0002.mdaq.root . cp /grid/app/minos/parrot/reco_near_spill_cedar.C . time loon -b -q reco_near_spill_cedar.C ${FIN} 2>&1 | tee loon.log ^D ( needed for parrot stickiness ) Ended after libFiltration.so Try again with printf, time { printf "" | loon -b -q reco_near_spill_cedar.C ${FIN} } 2>&1 | tee loon.log P> time { printf "" | loon -b -q reco_near_spill_cedar.C ${FIN} ; } 2>&1 | tee loon.log ; ERROR: ld.so: object '/grid/app/minos/parrot/cctools-current-20080708-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored. Warning in : class timespec already in TClassTable Processing reco_near_spill_cedar.C... Warning in : class CandDigitListHandleKeyFunctor already in TClassTable Warning in : class CandDigitListHandleKeyFunc already in TClassTable Warning in : class CandDigitListHandleItr already in TClassTable Segmentation fault real 0m12.186s user 0m0.000s sys 0m0.000s The next log messages would have been Successfully opened connection to: mysql:odbc://minos-db1.fnal.gov:3306/temp?option=1; Successfully opened connection to: mysql:odbc://minos-db1.fnal.gov:3306/offline?option=1; On rerunning, got an additional message, segmentation fault Trying a fresh test, parrot -m ${PARROT_DIR}/mountfile.grow -H /bin/bash # for production PS1='P> ' export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup unset SETUP_UPS SETUPS_DIR . /afs/fnal.gov/files/code/e875/general/ups/etc/setups.sh setup_minos() { . $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $* } setup_minos -r R1.24.0 cd /local/stage1/minos export ENV_TSQL_URL='mysql:odbc://minos-db1.fnal.gov:3306/temp;mysql:odbc://minos-db1.fnal.gov:3306/offline' export ENV_TSQL_USER=reader export ENV_TSQL_PSWD=minos_db FIN=N00009870_0002.mdaq.root printf "" | loon -b -q reco_near_spill_cedar.C ${FIN} Warning in : class timespec already in TClassTable Processing reco_near_spill_cedar.C... Warning in : class CandDigitListHandleKeyFunctor already in TClassTable Warning in : class CandDigitListHandleKeyFunc already in TClassTable Warning in : class CandDigitListHandleItr already in TClassTable Segmentation fault With -d all, see message just before the crash , cannot open /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/etc/odbcinst.ini This file is indeed not visible, MINOS_EXTERNAL is not exported. MIN > pwd /afs/fnal.gov/files/expwww/numi/html/computing/parrot ln -s /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL MINOS_EXTERNAL rm MINOS_EXTERNAL/.gr* $ time make_growfs -v -k /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL make_growfs: 37346 files, 1077 links, 2425 dirs, 0 checksums computed real 0m22.704s user 0m1.763s sys 0m3.694s parrot -m ./mountfile.MX.grow -H /bin/bash This connects to the database, cranking along at 17:02 CDT, BfldLoanPool::GetMap new map, type 2 'Rect2dGrid', variant 160 =E= Bfld 2008/08/28 17:02:56 [9870|200314] BfldMapRect2d.cxx,v1.26:87> can not open input file: '/afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat' Floating point exception That is reasonable, we need to shift more symlinks : $ ls -l /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat lrwxr-xr-x 1 rhatcher e875 43 Aug 5 16:16 /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat -> /minos/data/release_data/bmaps/bfld_160.dat ####### # NAS # ####### Date: Thu, 28 Aug 2008 13:05:52 -0500 From: Andrew J. Romero To: "'site-nas-announce@fnal.gov'" Subject: Reminder: BlueArc Maintenance Tuesday Sept 2, 2008 from 6:00am to 6:20am (FERMI-BLUE cluster ONLY) We will be performing maintenance on the FERMI-BLUE BlueArc cluster on Tuesday Sept 2, 2008 from 6:00am to 6:20am During the maintenance outage we will be upgrading the BlueArc Titan firmware UNIX NFS Clients should recover gracefully when the maintenance is complete. Note: There are no production Windows shares hosted on the FERMI-BLUE cluster The following BlueArc hosted file servers (EVSs) are effected by this maintenance outage ----------------------------------------------- blue3 bluetest fermi-nas-1 mb-nas-0 The following BlueArc hosted file servers (EVSs) are **NOT** effected by this maintenance outage ----------------------------------------------- blue1 blue2 cdfserver1 cdserver dirserver1 eshserver1 lsserver minos-nas-0 numiserver ppdserver pseekits ############# # MINOSSOFT # ############# Preparing to move all the /minos/data/release_data sylinks to /afs/fnal.gov/files/data/minos/release_data Build on the methods described in HOWTO.afssoftprod LOGD=/minos/scratch/minsoft/afssoft SLINKF=${LOGD}/slink/recodata.links SLINKL=${LOGD}/slink/recodata.log PVOL=/afs/fnal.gov/files/data/minos/d120 DOUT=${PVOL} find ${DOUT} -type l -exec ls -l {} \; \ | cut -f 2- -d / \ | sed 's/ -> /:/g' \ | grep ':/' \ | grep :/minos/data/release_data/ \ | tee ${SLINKF} * * * Not proceeding with this. * * * rhatcher already has scripts in place which can do this, as part of normal release management. ####### # WEB # ####### Date: Thu, 28 Aug 2008 11:35:53 -0500 (CDT) Subject: HelpDesk ticket 120775 ___________________________________________ Short Description: public web server allows browsing of local files in /etc Problem Description: Web administrators : The CD public web servers are wollowing symlinks to local files, such as /etc/passwd. The actual files you see vary from time to time, depending on which backend web server you actually get connected to. For example, see http://www-numi.fnal.gov/computing/parrot/link/etcdir/ which is a symlink to /etc, which contains interesting files like ftpusers group hosts passwd release resolv.conf These are probably things we do not want served to the world. ___________________________________________ Date: Thu, 28 Aug 2008 15:38:56 -0500 (CDT) This ticket has been reassigned to PASETES, RAY of the CD-LSCS/CSI/CS/EST Group ___________________________________________ Date: Fri, 29 Aug 2008 10:33:07 -0500 (CDT) Solution: Hi Art, Thank you for bringing this up. Currently, we are allowing links on the central web servers. However, we are working with security to change this policy. This will be a HUGE disruption to many sites which rely on soft links heavily. There are a couple of other options we can place which could prevent the people from linking to local files, but these options would also severely break many other sites. So, for now, we are balancing security with practicality. We accept the current risk for now and are working towards a more secure infrastructure while also providing the least painful path for our users. We have removed your links to the /etc directory. Please do not do that again. Thank you. This ticket was resolved by PASETES, RAY of the CD-LSCS/CSI/CS/EST group. ___________________________________________ Thanks for the clarification. I had not intented to suggest disabling symlinks, which would be disastrous. Instead, something more gentle, limiting served files to /afs/fnal.gov ( and maybe /afs/.fnal.gov ) Thanks for your attention to this. I have removed the rest of my test links to local file systems. ___________________________________________ ############ # STARTUP # ############ PNFS is back 2 Thu Aug 28 08:40:19 CDT 2008 13219 Thu Aug 28 12:25:38 CDT 2008 4 Thu Aug 28 12:30:42 CDT 2008 FTP is back 6 Thu Aug 28 12:36:01 CDT 2008 557 So DCache seems to be fine, restarting tasks Thu Aug 28 13:33:24 CDT 2008 kreymer@minos26 cd minos/scripts crontab crontab.dat mindata@minos26 cd crontab crontab.dat minfarm@fnpcsrv1 mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok ########### # MINOS01 # ########### Date: Thu, 28 Aug 2008 10:09:47 -0500 (CDT) Subject: HelpDesk ticket 120768 ___________________________________________ Short Description: minos01 is not accepting ssh connections Problem Description: minos01 will not accept ssh connections. rsh is still working. The last login I see in /var/log/messages is Aug 27 12:03:05 minos01 sshd(pam_unix)[8288]: session opened for user rustem by (uid=0) It is rather important to fix this, as this system is our CVS server, which is accessed primarily via ssh. ___________________________________________ Date: Thu, 28 Aug 2008 10:13:49 -0500 From: Mark Schmitz I restarted sshd. Seems OK now. ___________________________________________ Date: Thu, 28 Aug 2008 10:16:26 -0500 (CDT) This ticket has been reassigned to HARRINGTON, JASON of the CD-SF/FEF Group. ___________________________________________ Date: Thu, 28 Aug 2008 10:16:27 -0500 (CDT) Solution: restarted sshd ######## # GRID # ######## Date: Thu, 28 Aug 2008 09:32:50 -0500 (CDT) Subject: HelpDesk ticket 120763 ___________________________________________ Short Description: Many files at the top of /grid/data Problem Description: There are over 4600 files at the top level of /grid/data . MINOS26 > ls /grid/data | grep remote$ | wc -l 4681 An initial 'ls /grid/data' command can take over two minutes. The files are owned by fnalgrid, and have names like 2008-07-25T17:08:01Z-gridftp-probe-test-file-remote I suggest moving these to a subdirectory of /grid/data , perhaps /grid/data/fnalgrid/ . ___________________________________________ Date: Thu, 28 Aug 2008 09:37:35 -0500 (CDT) Note To Requester: timm@fnal.gov sent this Notes To Requester: Art--most of those files are the by-product of the automatic OSG RSV system probe testing. Unfortunately it is not easy to change the directory into which they go but we could easily develop a cron to purge them after a day or so and we will do that. Steve Timm ___________________________________________ Date: Mon, 08 Sep 2008 09:06:25 -0500 (CDT) Solution: These files in /grid/data are now being purged daily by a script. Steve Timm ___________________________________________ ___________________________________________ ___________________________________________ ============================================================================= 2008 08 27 ============================================================================= ############ # SHUTDOWN # ############ Prepared for PNFS/DCache maintenance Aug 28 kreymer@minos26 echo "crontab -r" | at 05:30 job 18 at 2008-08-28 05:30 mindata@minos26 echo "crontab -r" | at 01:00 job 19 at 2008-08-28 01:00 minfarm@fnpcsrv1 echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00 job 16 at 2008-08-28 01:00 ####### # AFS # ####### Spoke to Ray Pasetes ( kreymer, rhatcher ) We can add mounts for d119 and d120 to replace the symlinks anytime we want. Testing, fs mkmount art room.kreymer fs rmmount art If a mount fails, it still needs to be rmmount'd Got the volume names for mounting with fs examine $MINOS_DATA/d119 fs examine $MINOS_DATA/d120 Will do cd /afs/fnal.gov/files/code/e875/general rm ups fs mkmount ups nb.minos.d119 rm minossoft fs mkmount minossoft nb.minos.d120 Survey total Minos usage in AFS vos partinfo fsus-minos02 Free space on partition /vicepa: 282137398 K blocks out of total 898040549 Free space on partition /vicepb: 327383281 K blocks out of total 898040549 Free space on partition /vicepc: 328670058 K blocks out of total 898040549 Free space on partition /vicepd: 415337195 K blocks out of total 896348377 Free space on partition /vicepe: 379669157 K blocks out of total 899211060 Free space on partition /vicepf: 325347830 K blocks out of total 841974743 Free space on partition /vicepg: 296747358 K blocks out of total 841974743 Free space on partition /viceph: 380486512 K blocks out of total 841974743 Free space on partition /vicepi: 349533267 K blocks out of total 841974743 Free space on partition /vicepj: 334027396 K blocks out of total 841974743 Free space on partition /vicepk: 397945771 K blocks out of total 841974743 Free space on partition /vicepl: 343095448 K blocks out of total 841974743 Free space on partition /vicepm: 336612230 K blocks out of total 841974743 Free space on partition /vicepn: 362614880 K blocks out of total 841974743 Free space on partition /vicepo: 386638702 K blocks out of total 841974743 MINOS26 > vos partinfo fsus-minos02 | wc -l 15 ########## # PARROT # ########## Date: Wed, 27 Aug 2008 12:39:21 +0100 From: Alexandre Sousa Sorry this is coming a bit late, but the Dogwood validation meeting went all the way to 20:40 local time, so I got home a little too late. So cedar reconstruction uses R1.24.0 for data and R1.24.1 for MC. Therefore, to run a Near detector data job in fnpcsrv1 you would do: source /grid/app/minos/minfarm/Minossoft/setup_minossoft_MINOS_BATCH_GRID_CEDAR.[sh;csh] R1.24.0 This sets up , root v5.12.00 and the mysql environment variables: echo $ENV_TSQL_URL mysql:odbc://fnpcsrv1.fnal.gov:3307/temp;mysql:odbc://fnpcsrv1.fnal.gov:3307/cedar echo $ENV_TSQL_USER reader echo $ENV_TSQL_PSWD that allow you to connect to the far DB. Then, to run the job, do: loon -b -q /home/minfarm/loonexe/reco_near_spill_cedar.C To run a Near Detector MC job, you would do: source /grid/app/minos/minfarm/Minossoft/setup_minossoft_MINOS_BATCH_GRID_CEDAR.[sh ;csh] R1.24.1 loon -b -q /home/minfarm/loonexe/GoodSpillTime.C /home/minfarm/loonexe/reco_MC_daikon_near_cedar.C The cedar scripts are also found in the Production package under: Production/Cedar Hope this helps. Let me know if you run into problems. Cheers, Alex FNPCSRV1 > pwd /home/kreymer/DATA FNPCSRV1 > scp -c blowfish minos26:/local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root . FNPCSRV1 > FIN=N00009870_0002.mdaq.root FNPCSRV1 > time loon -b -q /home/minfarm/loonexe/reco_near_spill_cedar.C ${FIN} 2>&1 | tee loon.log real 10m31.341s user 10m21.704s sys 0m7.702s MINOS26 > du -sk /pnfs/minos/reco_near/cedar/cand_data/2006-02/N00009870_0002* 112612 /pnfs/minos/reco_near/cedar/cand_data/2006-02/N00009870_0002.cosmic.cand.cedar.0.root 18170 /pnfs/minos/reco_near/cedar/cand_data/2006-02/N00009870_0002.spill.cand.cedar.0.root MINOS26 > du -sk /pnfs/minos/reco_near/cedar/sntp_data/2006-02/N00009870_0002* 29271 /pnfs/minos/reco_near/cedar/sntp_data/2006-02/N00009870_0002.cosmic.sntp.cedar.0.root 3235 /pnfs/minos/reco_near/cedar/sntp_data/2006-02/N00009870_0002.spill.sntp.cedar.0.root [kreymer@fnpcsrv1 ~/DATA]$ du -sk * 17440 CandS.root 32 loon.log 81696 N00009870_0002.mdaq.root 3104 ntupleStS.root ########## # PARROT # ########## /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link ln -s /etc/mail etcmail ln -s /etc/mail/access access make_growfs: loading existing directory from /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/.growfsdir make_growfs: no directory exists, this might be quite slow... make_growfs: scanning directory tree for changes... /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/HOWTO.parrot /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/access make_growfs: 1 files, 2 links, 0 dirs, 0 checksums computed MIN > cat .growfsdir D root 16877 2048 20706406 0 F HOWTO.parrot 33188 6483 20621540 0 L etcmail 41453 9 20706367 0 /etc/mail L access 41453 16 20706378 0 /etc/mail/access E This fails to follow symlinks $ make_growfs -v -f -k /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link make_growfs: loading existing directory from /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/.growfsdir make_growfs: scanning directory tree for changes... /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/HOWTO.parrot /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/virtusertable.db /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/submit.cf /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/mailertable /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/submit.mc /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/domaintable /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/helpfile /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/Makefile /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/virtusertable /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/access /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/sendmail.cf /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/access.db /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/local-host-names /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/trusted-users /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/domaintable.db /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/sendmail.mc /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/mailertable.db /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/access make_growfs: 18 files, 0 links, 1 dirs, 0 checksums computed MIN > cat .growfsdir D root 16877 2048 20706583 0 F HOWTO.parrot 33188 6483 20621540 0 D etcmail 16877 4096 -11869490 0 F virtusertable.db 33184 12288 -11869522 0 F submit.cf 33060 41313 -48846718 0 F mailertable 33188 0 -48846717 0 F submit.mc 33188 952 -48846718 0 F domaintable 33188 0 -48846717 0 F helpfile 33188 5588 -48846718 0 F Makefile 33188 1035 -48846717 0 F virtusertable 33188 0 -48846718 0 F access 33188 331 -48846717 0 F sendmail.cf 33188 58049 -11869490 0 F access.db 33184 12288 -11869522 0 F local-host-names 33188 64 -48846718 0 F trusted-users 33188 127 -48846718 0 F domaintable.db 33184 12288 -11869522 0 F sendmail.mc 33188 6736 -11869490 0 F mailertable.db 33184 12288 -11869522 0 E F access 33188 331 -48846717 0 E This DOES follow symlinks Check again on symlinks in d120 (minossoft) They are all local ( with or without full path ) /minos/data/... Test web access to /minos/data MIN > ln -s /minos/data/release_data/minossoft/Calibrator/macros/GenerateNdAttenConstants-r1.2.C data Test availablility of release_data via the web MIN > ln -s /afs/fnal.gov/files/data/minos/release_data/minossoft/Calibrator/macros/GenerateNdAttenConstants-r1.2.C releasedata ============================================================================= 2008 08 26 ============================================================================= ########## # PARROT # ########## Try following a simple symlink : Testing under /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link ######### # FNALU # minos-users ######### From CD ops notes 9/3: 8-5 There will be an all-day FNALU downtime for rack consolidation on FCC-1.  This will affect everything - interactive and batch nodes - except for fsui03. Login message NOTE: Downtime Sept. 3, 2008 all day for all fnalu nodes except fsui03. This includes batch nodes. Rack consolidation will be done in FCC1. ########## # CONDOR # ########## 66 jobs; 10 idle, 27 running, 29 held MINOS25 > condor_release gfactory User gfactory's job(s) released. MINOS25 > date Tue Aug 26 13:04:08 CDT 2008 N.B. I have lately spotted a few jobs in state 'C', which I think means Completing. This seems unusual. ####### # AFS # ####### Mysql> pwd /afs/fnal.gov/files/code/e875/general Mysql> rm ups Mysql> ln -s /afs/fnal.gov/files/data/minos/d119 ups Mysql> mv minossoft minossoftold Mysql> ln -1 /afs/fnal.gov/files/data/minos/d120 ups ln: invalid option -- 1 Try `ln --help' for more information. Mysql> ln -s /afs/fnal.gov/files/data/minos/d120 minossoft Mysql> date Tue Aug 26 11:16:05 CDT 2008 MINOS26 > minos -bash: /afs/fnal.gov/files/code/e875/general/minossoft/setup/setup_minossoft_FNALU.sh: No such file or directory Mysql> ln -s /afs/fnal.gov/files/data/minos/d120 minossoft Mysql> date Tue Aug 26 11:18:31 CDT 2008 Mysql> rm ups/d120 Examined Minos Cluster Ganglia report, no glitch seen around 11:16, remains busy running batch jobs. Some loiacono condor jobs, submitted at 11:10, failed due to this. ####### # SAM # ####### Test in integration export SAM_ORACLE_CONNECT="samdbs/" samadmin add dimension \ --name=EVENT_COUNT \ --table=data_files \ --column=EVENT_COUNT \ --type=number \ --desc='select DATA_FILES EVENT_COUNT' Test using file N00009521_0024.mdaq.root 'eventCount' : 129L, NEARDIM=`printf "DATA_TIER raw-near and RUN_NUMBER 9521 and EVENT_COUNT < 1000"` sam list files --dim="${NEARDIM}" --nosummary | sort This worked in integration, now did this in production MINOS26 > sam list files --dim="${NEARDIM}" --nosummary | sort N00009521_0024.mdaq.root ============================================================================= 2008 08 25 ============================================================================= ####### # AFS # ####### Date: Mon, 25 Aug 2008 22:11:30 +0000 (GMT) From: Arthur Kreymer To: minos_software_discussion@fnal.gov Cc: minos-admin@fnal.gov Subject: Minos AFS symlink ajustment tommorrow for ups, minossoft We have copied all of the Minos UPS and Release files into single 50 GB volumes. in order to clean up the tangled web of symlinks which have grown as verious smaller 2 to 8 GB volumes have overflowed. ups is already a symlink to an alternate volume, this new copy is just to a larger volume. These copies have been tested with real Analysis jobs, via Parrot. We have scheduled to symlink these new volumes into production use tommorrow morning, as follows : cd /afs/fnal.gov/files/code/e875/general rm ups ln -s /afs/fnal.gov/files/data/minos/d119 ups mv minossoft minossoftold ln -s /afs/fnal.gov/files/data/minos/d120 ups Ideally, shifting the links will be entirely transparent, even for running jobs. We do not plan to remove the original files in the near future. With the exception of the development release, we think that the copies are identical to the originals ============================================================================= 2008 08 23 Sat ============================================================================= ######### # ADMIN # ######### In ganglia, Minos cluster system mode CPU kicked up to 20% our of 40%, by Friday 24:00, starting Friday 22 Aug beforenoon. Circumstantial evidence suggests this is cause by tinti condor jobs. ( no problem on minos08, where they are not running ) This cleared up by Sunday at noon. ######### # CONDOR # ######### /home/gfactory/glideinsubmit/glidein_t20_glexec entry_gpminos/log condor_activity_20080822_gpminos@t20_glexec@minos@my2.log 000 (177730.005.000) 08/22 14:04:33 Job submitted from host: <131.225.193.25:61451> ... 017 (177730.000.000) 08/23 04:36:36 Job submitted to Globus RM-Contact: fnpcfg1.fnal.gov:2119/jobmanager-condor JM-Contact: https://fnpcfg1.fnal.gov:40032/31893/1219484189/ Can-Restart-JM: 1 ... 027 (177730.000.000) 08/23 04:36:36 Job submitted to grid resource GridResource: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor GridJobId: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor https://fnpcfg1.fnal.gov:40032/31893/1219484189/ factory_info.20080823.log Look at latest err and .out files, less entry_gpgeneral/log/job.177567.6.out ran on fnpc266, looks fine to me less entry_gpgeneral/log/job.177567.6.err errors at the end : MasterLog ======== gzip | uuencode ============= ./condor_startup.sh: line 195: uuencode: command not found gzip: stdout: Broken pipe StartdLog ======== gzip | uuencode ============= ./condor_startup.sh: line 195: uuencode: command not found gzip: stdout: Broken pipe But I see similar things back through August 18. Date: Sat, 23 Aug 2008 10:56:17 -0500 (CDT) Subject: HelpDesk ticket 120532 _____________________________________________________________________ Sometime after Friday 22 August 18:00, the Minos glideinWMS pilots seem to have disappeared from the GPFarm nodes. The minos25 condor system shows 134 gfactory processes running, from 176094.2 gfactory 8/18 15:35 0+19:55:38 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor to 177726.9 gfactory 8/22 13:59 0+19:56:38 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor The last user job that I see completing normally was 177760.0 kreymer 8/22 18:00 0+00:00:44 C 8/22 18:00 /minos/scratch/ MINOS25 > condor_q gfactory ... 150 jobs; 20 idle, 130 running, 0 held The condor logs look pretty normal for recently running gfactory jobs, /home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/log/condor_activity_20080822_gpminos@t20_glexec@minos@my2.log 000 (177726.000.000) 08/22 13:59:51 Job submitted from host: <131.225.193.25:61451> 017 (177726.000.000) 08/22 13:59:59 Job submitted to Globus RM-Contact: fnpcfg1.fnal.gov:2119/jobmanager-condor JM-Contact: https://fnpcfg1.fnal.gov:40012/3970/1219431596/ Can-Restart-JM: 1 ... 027 (177726.000.000) 08/22 13:59:59 Job submitted to grid resource GridResource: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor GridJobId: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor https://fnpcfg1.fnal.gov:40012/3970/1219431596/ ... 001 (177726.000.000) 08/22 14:02:02 Job executing on host: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor But there is a big gap in activity in the log 000 (177730.005.000) 08/22 14:04:33 Job submitted from host: <131.225.193.25:61451> ... 017 (177730.000.000) 08/23 04:36:36 Job submitted to Globus RM-Contact: fnpcfg1.fnal.gov:2119/jobmanager-condor JM-Contact: https://fnpcfg1.fnal.gov:40032/31893/1219484189/ Can-Restart-JM: 1 Nothing seems to have started or finished since that gap. _____________________________________________________________________ _____________________________________________________________________ _____________________________________________________________________ _____________________________________________________________________ Cannot submit helpdesk ticket, get download window for Helpdesk.pl Cannot run firefox2, complains about missing library. Submitted the above from my laptop, FF 2. ######## # GRID # ######## /grid/data has 4471 files at the top level. On fnpc341, an 'ls' takes about 150 seconds. On minos26, 112 sec -rw-r--r-- 1 fnalgrid fnalgrid 134 May 8 17:08 2008-05-08T22:08:00Z-gridftp-probe-test-file-remote ... -rw-r--r-- 1 fnalgrid fnalgrid 134 Aug 23 09:08 2008-08-23T14:08:02Z-gridftp-probe-test-file-remote -bash-3.00$ ls /grid/data | grep -remote | wc -l 4227 Mon Aug 25, MINOS26 > ls /grid/data | grep remote$ | wc -l 4425 ============================================================================= 2008 08 22 ============================================================================= ########## # CONDOR # ########## gfrontend - reset limit to 250, now that pawloski jobs are corrected to not hold database connections open Stopped and restarted after changing limit Found about 20 Globus error 17 and 43 held gfactories, MINOS25 > condor_release gfactory User gfactory's job(s) released. ######## # DATA # ######## pittam suggests lack of cedar_phy files from 2007-04/5/6/7, in farcat F00037835_0000.spill.bntp.cedar_phy.0.root F00037835 F00037838_0000.spill.bntp.cedar_phy.0.root F00037838 F00037841_0000.spill.bntp.cedar_phy.0.root F00037841 F00037868_0000.spill.bntp.cedar_phy.0.root F00037868 F00037871_0000.spill.bntp.cedar_phy.0.root F00037871 F00037947_0000.spill.bntp.cedar_phy.0.root F00037947 F00037956_0000.spill.bntp.cedar_phy.0.root F00037956 F00037974_0000.spill.bntp.cedar_phy.0.root F00037974 F00037977_0000.spill.bntp.cedar_phy.0.root F00037977 F00037989_0000.spill.bntp.cedar_phy.0.root F00037986 F00037989_0017.spill.bntp.cedar_phy.0.root F00037989 F00037989_0018.spill.bntp.cedar_phy.0.root F00037993 F00037989_0021.spill.bntp.cedar_phy.0.root F00037996 F00037993_0000.spill.bntp.cedar_phy.0.root F00038221 F00037993_0007.spill.bntp.cedar_phy.0.root F00038266 F00037996_0000.spill.bntp.cedar_phy.0.root F00038283 F00038221_0000.spill.bntp.cedar_phy.0.root F00038307 F00038266_0000.spill.bntp.cedar_phy.0.root F00038559 F00038283_0000.spill.bntp.cedar_phy.0.root F00038283_0006.spill.bntp.cedar_phy.0.root F00038304_0016.spill.bntp.cedar_phy.0.root F00038307_0000.spill.bntp.cedar_phy.0.root F00038559_0000.spill.bntp.cedar_phy.0.root ######## # DESK # ######## Upgrading kreymer desktop to SLF 5, limited availability this AM ============================================================================= 2008 08 21 ============================================================================= ############ # BLUWATCH # ############ Restarted bluwatch on minos01/25/26, stopped since the desktop reboot. This time being sure to set nohup before ./bluwatch &, and verifying updates after logging out. ######## # DESK # ######## kreymer recovered from /home filesystem error on desktop, required boot with recovery disk to fsck. the system continues to hang up... this is distracting ######## # GRID # ######## Date: Thu, 21 Aug 2008 12:16:15 -0500 (CDT) Subject: HelpDesk ticket 120454 ___________________________________________ Short Description: fnpc339 lacks mount of /grid/app Problem Description: We had a set of user jobs go into a block hole recently on node fnpc339. They all failed to access /grid/app . Indeed, mounts seem to be broken there for /grid/app /grid/data /home/kreymer As of 12:13, this problem is still present. ___________________________________________ Date: Thu, 21 Aug 2008 12:31:50 -0500 (CDT) Note To Requester: timm@fnal.gov sent this Notes To Requester: I saw the problem of the black hole jobs last night and disabled condor on that node. I was waiting for all the existing jobs to finish which they now have done. I will notify FEF that the node needs a reboot. Steve Timm ___________________________________________ Note To Requester: timm@fnal.gov sent this Notes To Requester: I have now sent a ticket to FEF asking for a reboot. This is the second of the AFS machines that has crashed this way in less than a week. Steve Timm ___________________________________________ Date: Fri, 22 Aug 2008 14:07:21 -0500 (CDT) Note To Requester: FEF rebooted the node this morning and said it was OK but it is not. I still can't login. Have asked them to look again. ___________________________________________ The grid mount points are back on fnpc339, and jobs are running there. But /afs is missing, and jobs are again failing. In fact, the openafs is not installed : -bash-3.00$ rpm -q openafs package openafs is not installed Rather than having afs reinstalled on this one node, please just remove the ISMINOSAFS flag for this node. You may also need to gracefully terminate the glidein pilots which are presently running there. ___________________________________________ Note To Requester: timm@fnal.gov sent this Notes To Requester: The ISMINOSAFS tag has been removed temporarily from node fnpc339. We will have them put the AFS back on on Monday. Sorry for the inconvenience. Steve Timm ___________________________________________ Date: Mon, 25 Aug 2008 16:55:55 -0500 (CDT) Solution: FEF added AFS back to the node. I changed the condor config back to make ISMINOSAFS true It is good to go. Steve Timm ___________________________________________ ============================================================================= 2008 08 20 ============================================================================= ######## # FARM # ######## Per petyt mail, 8 Jan, Run I thru IIb is 31720 - 38449 Run IIb (2007-04 - 2007-07) ######### # MYSQL # ######### Oops, finished up gzip/local phases of this month's archives ######### # MYSQL # ######### HOWTO.mysqladmin - details installation and operation of mysql, initially on minos-sam01 This will document two major modes 1) primary warehouse, for disaster recovery for upgrades and tests 2) replica Observe this on minos-mysql1 , from ps xfwww /local/ups/prd/mysql/v4_1_11/Linux-2/libexec/mysqld --basedir=/local/ups/prd/mysql/v4_1_11/Linux-2 --datadir=/data/database --pid-file=/data/database/minos-mysql1.fnal.gov.pid --skip-locking --port=3306 --socket=/data/database/mysql.sock 'start') setup mysql user=`/usr/bin/whoami` if [ "${user}" = "minsoft" ]; then $MYSQL_DIR/bin/mysql.server start else ulimit -n 4096 su minsoft -c "setup mysql; $MYSQL_DIR/bin/mysql.server start" fi ;; 'stop') setup mysql $MYSQL_DIR/bin/mysql.server stop ;; On minos-mysql1, /etc/my.cnf -> /data/database/my.cnf According to mysql.server comments, the default location should be basedir, which is $MYSQL_DIR, rather than datadir I don't like this much, as UPS products should not be hacked. So will continue to use the deprecated datadir/my.cnf, avoiding /etc/my.cnf to minimize root activity. cp database/my.cnf.minos-mysql1 database/my.cnf.20080820 As a last resort, reading the README file at $MYSQL_DIR This is very confusing document, too many forward references and unexplained options. many irrelevant comments about group ownership, for cases where several accounts share the mysql server recommends against default port 3306 ( we use this ) References to 'starting mysql client' are confusing to me. ____________________________________________________________________________ Let's try the tailor process : ____________________________________________________________________________ SOFT03 > ups tailor mysql Enter valid path for mysql data directory: /home/minsoft/database Never use default port number 3306 for any mysql server instances! Assign your port number here:3306 You can update mysql server options in my.cnf file before you start mysql server. Please assign a new username for your mysql daemon. For security it is recommended to substitute this name for mysql root in a mysql database. See README file in your mysql datadir for more details. Do not forget to set a strong password for root user IMMEDIATELY after initial startup of mysql daemon! Then replace root username with the newly assigned username. Enter your new username here:root There are small,medium,large or huge cnf files in /home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/share/mysql directory. Which one you would like to use (s/m/l/h)? h Installing MySQL system tables... 080820 13:53:51 [Warning] One can only use the --user switch if running as root OK Filling help tables... 080820 13:53:51 [Warning] One can only use the --user switch if running as root OK To start mysqld at boot time you have to copy support-files/mysql.server to the right place for your system PLEASE REMEMBER TO SET A PASSWORD FOR THE MySQL root USER ! To do so, start the server, then issue the following commands: /home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysqladmin -u root password 'new-password' /home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysqladmin -u root -h minos-sam03.fnal.gov password 'new-password' Alternatively you can run: /home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysql_secure_installation which will also give you the option of removing the test databases and anonymous user created by default. This is strongly recommended for production servers. See the manual for more instructions. You can start the MySQL daemon with: cd /home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6 ; /home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysqld_safe & You can test the MySQL daemon with mysql-test-run.pl cd mysql-test ; perl mysql-test-run.pl Please report any problems with the /home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysqlbug script! The latest information about MySQL is available on the web at http://www.mysql.com Support MySQL by buying support/licenses at http://shop.mysql.com chgrp: invalid group name `products' You can on/off skip-innodb option in my.cnf file before starting mysqld. Disable InnoDB tables now (y/n)? n Mysql server successfuly configured. chgrp: invalid group name `mysql' !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Cannot change group name to mysql in your data directory! Group ownership for files in your data area should belong to mysql group. Reserved gid is 9531. Please, add an entry for a mysql group to your systems /etc/group file or NIS group map.If you are setting up multiple databases to be managed by different UNIX acccounts, add these account names to the group entry. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! There are following ups function you can use: To start/stop mysql server : ups start/stop mysql To set root password : ups rootpass mysql To start mysql client : ups client mysql To see run-time server variables: ups variables mysql To see short status message : ups status mysql To ping mysql server : ups ping mysql ____________________________________________________________________________ Set the initial password with SOFT03 > ups rootpass mysql Setup:mysql datadir = /home/minsoft/database Setup:port=3306; socket=/home/minsoft/database/mysql.sock Enter password for root user: Setup root password for root@localhost is O.K. You also need to set this password for root@minos-sam03.fnal.gov when you start mysql client. You can do it using following command in mysql: mysql> SET PASSWORD FOR root@minos-sam03.fnal.gov=PASSWORD('new_password'); See user table in mysql database. Changed password back to the existing one, for testing mysqladmin -u root -p password themostsecrecpasswordever ____________________________________________________________________________ Loading data tables SOFT03 > du -sm /minos/data/mysql/archive/20080804/offline 53095 /minos/data/mysql/archive/20080804/offline SOFT03 > time cp -a /minos/data/mysql/archive/20080804/offline/*.frm \ > ${HOME}/database/mysql/ real 0m6.293s user 0m0.009s sys 0m0.067s SOFT03 > time cp -a /minos/data/mysql/archive/20080804/offline/*.MYD \ > ${HOME}/database/mysql/ This runs at 6 MB/sec. Interrupted around 15 GB into the copy. Removed partial copies, SOFT03 > export LANG="C"; SOFT03 > rm database/mysql/[A-Z]* SOFT03 > time scp -c blowfish minsoft@minos-mysql1:/data/archive/COPY/20080702/offline/* database/mysql/ real 21m42.975s user 5m6.842s sys 3m51.649s rates were typically 25 to 30 MB/sec for larger files SOFT03 > rm database/mysql/*.log SOFT03 > rm database/mysql/CALADCTOPESVLD.dump SOFT03 > rm database/mysql/CALADCTOPESVLD.dump2 What about db.opt ? -rw-r----- 1 minsoft e875 65 Aug 20 15:40 db.opt SOFT03 > cat database/mysql/db.opt default-character-set=latin1 default-collation=latin1_swedish_ci SOFT03 > du -sm database/mysql 20265 database/mysql SOFT03 > time gunzip database/mysql/*.gz real 33m53.119s user 12m15.257s sys 3m19.913s Shifted offline tables to the offline database SOFT03 > mv database/mysql/[A-Z]* database/offline/ SOFT03 > mv database/mysql/db.opt database/offline/db.opt Shifted files to /home/minsoft/recover directory Started database mysql> use offline mysql> restore table BEAMMONCUTS from '/home/minsoft/restore' ; +-------------+---------+----------+----------------------------------------+ | Table | Op | Msg_type | Msg_text | +-------------+---------+----------+----------------------------------------+ | BEAMMONCUTS | restore | error | Failed generating table from .frm file | +-------------+---------+----------+----------------------------------------+ 1 row in set, 1 warning (0.00 sec) ============================================================================= 2008 08 19 ============================================================================= ######### # MYSQL # ######### minsoft@minos-sam03 SOFT03 > mkdir -p ups/db/foo SOFT03 > mkdir -p ups/db/.upsfiles SOFT03 > mkdir -p ups/db/.updfiles SOFT03 > AFSP=/afs/fnal.gov/files/code/e875/general/ups SOFT03 > cp ${AFSP}/db/.upsfiles/dbconfig ups/db/.upsfiles/dbconfig SOFT03 > nedit ups/db/.upsfiles/dbconfig changed path to /home/minsoft/ups/... SOFT03 > cp ${AFSP}/db/.updfiles/updconfig ups/db/.updfiles/updconfig . /usr/local/etc/setups.sh setup upd export PRODUCTS=/${HOME}/ups/db upd install -j mysql v5_0_51 informational: installed mysql v5_0_51. upd install succeeded. SOFT03 > ups list -aK+ "mysql" "v5_0_51" "Linux+2.6" "" "" SOFT03 > ups declare -c mysql v5_0_51 DECLARE: A UPS start/stop exists for this product Setup seems to use configs from ${PRODUCTS}/mysql/config/${MACHID}.${UPS_OPTIONS} Mysql> cat /local/ups/db/mysql/config/minos-mysql1.fnal.gov. /data/database ########### # GNUPLOT # ########### Date: Tue, 19 Aug 2008 16:13:53 -0500 (CDT) ___________________________________________ Ticket #: 120371 ___________________________________________ Short Description: Install gnuplot on Minos cluster and servers Problem Description: run2-sys : Please install gnuplot on the Minos Cluster and the minos servers, including minos26, minos-sam01/2/3 and minos-mysql1 . This should be a standard part of Minos installations. This is not at all urgent, please do it at your next convenience. ___________________________________________ Date: Tue, 19 Aug 2008 16:22:53 -0500 (CDT) This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group. ___________________________________________ Date: Wed, 20 Aug 2008 09:44:08 -0500 (CDT) Solution: The gnuplot rpm has been installed on the cluster and server machines. ___________________________________________ This works --- for batch, needed to add a - for stdin echo " ... " | gnuplot -persist - ___________________________________________ ___________________________________________ ######### # ADMIN # ######### Date: Tue, 19 Aug 2008 15:36:18 -0500 (CDT) Subject: HelpDesk ticket 120369 ___________________________________________ Short Description: Please add rbpatter to the e875 group on the Minos Cluster Problem Description: run2-sys : Please add rbpatter to the e875 group on the Minos Cluster Thanks ! _________________________________________________________________ Date: Tue, 19 Aug 2008 15:41:10 -0500 (CDT) This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group. __________________________________________ Date: Wed, 20 Aug 2008 08:59:11 -0500 (CDT) Solution: rbpatter has been added to e875 group. ######### # FNALU # ######### Date: Tue, 19 Aug 2008 11:46:27 -0500 (CDT) Subject: HelpDesk ticket 120350 ___________________________________________ Short Description: FNALU status needs update on CD systems page Problem Description: The FNALU status is shown as Down with a scheduled outage, at http://computing.fnal.gov/cdsystemstatus/system/FNALU.html due to a problem with fsui03 back in June. But the system is up. Please update the status. ___________________________________________ Date: Tue, 19 Aug 2008 12:50:31 -0500 (CDT) This ticket has been reassigned to GREANEY, MARGARET of the CD-LSCS/CSI/DSS/EST Group. ___________________________________________ Date: Mon, 25 Aug 2008 13:40:12 -0500 (CDT) Note To Requester: mgreaney@fnal.gov sent this Notes To Requester: I will ask the CSI group to update the status page for FNALU as I see it has a red ball on the system status page. When you click on this page, it does not show any message. As for the remainder, none of the nodes are marked down in NGOP. I think there is still a problem with ngop updating the system status pages. I make entries for down or problem nodes in ngop and these are updated to the system status page, but it is not something that I do, but is supposed to get done automatically. ___________________________________________ Solution: There is some way other than ngop updates which changes the system status page, and the helpdesk has access to this. ___________________________________________ ####### # AFS # ####### Mail to nwest and minos-data : I recently looked at /var/log/messages on minos-mysql1, and saw a surprising number of AFS timeouts, highly correleated with nwest ssh logins from minos-db.minos-soudan.org. These are very similar to the short timeouts that have been plaguing the Minos Cluster, typically once per month per node, not correleated with any other activity as far as I can tell. I would love to understand what activity your connections are performing which might trigger these timeouts. Maybe we can reproduce and try to eliminate the problem. Here is a sample from this morning : ... a subset of the listing below ... ####### # AFS # ####### Timeouts typically look like this on minos-mysql1 Aug 19 02:15:12 minos-mysql1 sshd(pam_unix)[10442]: session opened for user nwest by (uid=0) Aug 19 02:15:15 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 19 02:15:15 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 19 02:18:24 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Aug 19 02:18:24 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Aug 19 02:37:29 minos-mysql1 sshd(pam_unix)[11895]: session opened for user nwest by (uid=0) Aug 19 02:37:32 minos-mysql1 kernel: afs: Tokens for user of AFS id 4777 for cell fnal.gov are discarded (rxkad error=19270407) Aug 19 02:37:37 minos-mysql1 sshd(pam_unix)[12031]: session opened for user nwest by (uid=0) Aug 19 02:38:54 minos-mysql1 sshd(pam_unix)[12169]: session opened for user nwest by (uid=0) Aug 19 02:38:57 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 19 02:38:57 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 19 02:39:01 minos-mysql1 sshd(pam_unix)[12305]: session opened for user nwest by (uid=0) Aug 19 02:40:10 minos-mysql1 sshd(pam_unix)[12506]: session opened for user nwest by (uid=0) Aug 19 02:40:12 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 19 02:40:12 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 19 02:40:13 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Aug 19 02:40:13 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Aug 19 02:40:14 minos-mysql1 kernel: afs: Tokens for user of AFS id 4777 for cell fnal.gov are discarded (rxkad error=19270407) Aug 19 02:40:17 minos-mysql1 sshd(pam_unix)[12642]: session opened for user nwest by (uid=0) Aug 19 02:40:30 minos-mysql1 sshd(pam_unix)[12778]: session opened for user nwest by (uid=0) Aug 19 02:40:33 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 19 02:40:33 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 19 02:40:37 minos-mysql1 sshd(pam_unix)[12914]: session opened for user nwest by (uid=0) Aug 19 02:40:46 minos-mysql1 sshd(pam_unix)[13050]: session opened for user nwest by (uid=0) Aug 19 02:40:50 minos-mysql1 sshd(pam_unix)[13186]: session opened for user nwest by (uid=0) Aug 19 02:43:20 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Aug 19 02:43:20 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Aug 19 02:43:20 minos-mysql1 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Aug 19 02:43:20 minos-mysql1 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) In /var/log/secure, find Aug 19 02:15:11 minos-mysql1 sshd[10441]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok) Aug 19 02:15:11 minos-mysql1 sshd[10441]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56335 ssh2 Aug 19 02:37:29 minos-mysql1 sshd[11894]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok) Aug 19 02:37:29 minos-mysql1 sshd[11894]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56340 ssh2 Aug 19 02:37:36 minos-mysql1 sshd[12030]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok) Aug 19 02:37:36 minos-mysql1 sshd[12030]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56341 ssh2 Aug 19 02:38:53 minos-mysql1 sshd[12168]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok) Aug 19 02:38:53 minos-mysql1 sshd[12168]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56342 ssh2 Aug 19 02:39:00 minos-mysql1 sshd[12304]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok) Aug 19 02:39:00 minos-mysql1 sshd[12304]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56343 ssh2 Aug 19 02:40:10 minos-mysql1 sshd[12505]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok) Aug 19 02:40:10 minos-mysql1 sshd[12505]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56344 ssh2 Aug 19 02:40:16 minos-mysql1 sshd[12641]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok) Aug 19 02:40:16 minos-mysql1 sshd[12641]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56345 ssh2 Aug 19 02:40:30 minos-mysql1 sshd[12777]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok) Aug 19 02:40:30 minos-mysql1 sshd[12777]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56346 ssh2 Aug 19 02:40:36 minos-mysql1 sshd[12913]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok) Aug 19 02:40:36 minos-mysql1 sshd[12913]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56347 ssh2 Aug 19 02:40:46 minos-mysql1 sshd[13049]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok) Aug 19 02:40:46 minos-mysql1 sshd[13049]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56348 ssh2 Aug 19 02:40:49 minos-mysql1 sshd[13185]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok) Aug 19 02:40:49 minos-mysql1 sshd[13185]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56349 ssh2 Aug 19 03:31:22 minos-mysql1 sshd[15994]: Authorized to nwest, krb5 principal nwest@FNAL.GOV (krb5_kuserok) Aug 19 03:31:22 minos-mysql1 sshd[15994]: Accepted external-keyx for nwest from ::ffff:163.1.136.71 port 48518 ssh2 Mysql> host 198.124.213.10 10.213.124.198.in-addr.arpa domain name pointer minos-db.minos-soudan.org. Mysql> host 163.1.136.71 71.136.1.163.in-addr.arpa domain name pointer pplxint1.physics.ox.ac.uk. # grep nwest /etc/passwd nwest:x:4777:5111:Nick_West:/afs/fnal.gov/files/home/room3/nwest:/usr/local/bin/tcsh ######## # FARM # -> 2008 ######## Date: Tue, 19 Aug 2008 09:15:35 -0500 From: Phyllis Rubin Reply-To: phyllis.rubin@comcast.net To: kreymer@fnal.gov, asousa@fnal.gov Subject: Howie Rubin Howie won't be at today's phone meeting. He is in the hospital after having a heart attack on Saturday night. He is doing well now. ... ============================================================================= 2008 08 18 ============================================================================= ####### # AFS # ####### Summary of AFS ups/minossoft copy issues Questions are prefixed with > We will ask the CSI group to rearrange the AFS mounts : I have sniffed out volumes using fs examine MD=/afs/fnal.gov/files/data/minos MG=/afs/fnal.gov/files/code/e875/general MG/minossoft is not presently its own volume, it sits under MG. AFS volume present mount future mount vid = 1685748770 named c.e875.d1 MG/ups MG/oldups (vid = 1685404769 named code.e875.general) MG/minossoft MG/oldminossoft vid = 1685735879 named nb.minos.d119 MD/d119 MG/ups vid = 1685735882 named nb.minos.d120 MD/d120 MG/minossoft > What happens to old MG/minossoft files when we mount MD/d120 there ? > How long do the remounts take, and do they cause failures of running jobs ? UPS Several files were duplicated in removing symlinks, see /minos/scratch/minsoft/afssoft/slink/prod2 Mainly, config_build_root.sh config_build_root_minimal.sh and broken links to mengel There are more broken links to non-AFS space > Should we set up local symlinks for config_build_root*.sh ? MINOSSOFT /minos/scratch/minsoft/afssoft/slink/soft1 Beyond the expected bin/lib/tmp copies, there were symlinks to packages/DatabaseTables/HEAD 53 MB packages/WebDocs/HEAD/doxygen/loon 415 MB The doxygen files copied extremely slowly. > Should these be left in-line, cleaning up the originals later ? /afs/fnal.gov/files/data/minos/d04/libraries/DatabaseTables/HEAD /afs/fnal.gov/files/code/e875/releases1/doxygen/loon ######### # ADMIN # ######### Date: Mon, 18 Aug 2008 14:17:57 -0500 (CDT) Subject: HelpDesk ticket 120307 ___________________________________________ Short Description: minos21 sshd not accepting logins Problem Description: run2-sys : minos21 does not accept ssh logins. rsh does allow connections $ ssh minos21 ssh_exchange_identification: Connection closed by remote host The latest ssh login seems to be Aug 16 05:27:10 minos21 sshd(pam_unix)[28947]: session opened for user rhatcher by (uid=0) ___________________________________________ Date: Mon, 18 Aug 2008 14:27:05 -0500 (CDT) This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group. ___________________________________________ Date: Mon, 18 Aug 2008 14:45:17 -0500 (CDT) Solution: I restarted the ssh daemon. ___________________________________________ ___________________________________________ ######## # FARM # ######## SIESTA=`date -u +'%Y-%m-%d %H:%M:%S' -d 'now - 1 day'` FARDIM=`printf "DATA_TIER raw-far and RUN_TYPE physics%s and EVENT_COUNT >=16000 and END_TIME >= to_date(\'${SIESTA}\',\'yyyy-mm-dd hh24:mi:ss\')" %` sam list files --dim="${FARDIM}" ~kreymer/minos/scripts/samlocate "${FARDIM}" This does not work, no such dimension. Several metadata items seem not to be dimensions, therefore not selectable. Found the column name via db browser, EVENT_COUNT=16083 apparently in the FILE_DATA table as in F00041835_0003.mdaq.root Test in development : FARDIM=`printf "DATA_TIER raw-far and RUN_NUMBER 28812"` MINOS26 > sam get metadata --file=F00028812_0000.mdaq.root ... 'eventCount' : 7L, export SAM_ORACLE_CONNECT="samdbs/" samadmin add dimension \ --name=EVENT_COUNT \ --table=file_data \ --column=EVENT_COUNT \ --type=number \ --desc='select eventCount, DATA_FILES EVENT_COUNT' New dimensionName 'EVENT_COUNT' added. MINOS26 > sam get dimension info --category=datafile EVENT_COUNT (category: datafile) select eventcount data_files event_count FARDIM=`printf "DATA_TIER raw-far and RUN_NUMBER 28812 and EVENT_COUNT 7"` MINOS26 > sam list files --dim="${FARDIM}" table or view does not exist SQL> select file_name,event_count from data_files where file_name = 'F00028812_0000.mdaq.root' ; FILE_NAME -------------------------------------------------------------------------------- EVENT_COUNT ----------- F00028812_0000.mdaq.root 7 Try again with data_files table, in development samadmin add dimension \ --name=EVENT_COUNTS \ --table=data_files \ --column=EVENT_COUNT \ --type=number \ --desc='select eventCount based on DATA_FILES EVENT_COUNT' New dimensionName 'EVENT_COUNTS' added. FARDIM=`printf "DATA_TIER raw-far and RUN_NUMBER 28812 and EVENT_COUNTS 7"` That's fine. ####### # SAM # ####### ########## # CONDOR # ########## Released some older held pilots 11:28 condor_release gfactory They all went to Idle state, OK ########## # CONDOR # ########## [gfrontend@minos25 etc]$ stat vofrontend.cfg Access: 2008-07-31 14:18:24.000000000 -0500 [gfrontend@minos25 etc]$ ps -flu gfrontend F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 S 42918 13839 13838 0 75 0 - 1536 wait Jul31 pts/2 00:00:00 -bash 0 S 42918 15585 1 1 76 0 - 28850 - Jul14 ? 13:54:57 python glideinFrontend.py 90 4 /home/gfrontend/myvo 0 R 42918 23272 13839 0 77 0 - 1008 - 09:23 pts/2 00:00:00 ps -flu gfrontend That July 31 time was me looking at the .cfg file. Check the log, /home/gfrontend/myvofrontend2/log/frontend_info.20080714.log [2008-07-14T15:32:46-05:00 15585] Starting up Kill and restart, soon after sleep [2008-08-18T09:29:06-05:00 15585] Sleep kill 15585 [2008-08-18T09:30:39-05:00 15585] Sleep Reverted to kill -9 15585 [gfrontend@minos25 ~]$ ./start_frontend.sh 2008-08-18T09:31:30-05:00 23634] Starting up [2008-08-18T09:31:30-05:00 23634] Iteration at Mon Aug 18 09:31:30 2008 [2008-08-18T09:31:33-05:00 23634] Match [2008-08-18T09:31:33-05:00 23634] Total running 255 limit 125 MINOS25 > condor_q pawloski -run | grep pawl | wc -l 251 MINOS25 > date ; condor_q pawloski -run | grep pawl | wc -l Mon Aug 18 09:44:32 CDT 2008 242 Mon Aug 18 09:44:45 CDT 2008 249 Could not connect to CRL, finally got a connection at MINOS25 > date ; condor_q pawloski -run | grep pawl | wc -l Mon Aug 18 09:48:28 CDT 2008 238 10:50 Greg killed the presently idle jobs, to avoid new starts 10:51 Grek killed all the jobs Mysql> date ; mysqladmin processlist -u root | grep fnal | wc -l Mon Aug 18 10:51:23 CDT 2008 16 Adjusted limit down to 100, for safety [2008-08-18T11:23:56-05:00 645] Starting up [2008-08-18T11:23:56-05:00 645] Iteration at Mon Aug 18 11:23:56 2008 [2008-08-18T11:23:56-05:00 645] Match [2008-08-18T11:23:56-05:00 645] Total running 0 limit 100 ######## # SOFT # ######## 14:04 UTC Cleaned up protections in minos/scripts, MINOS26 > chmod g+x * MINOS26 > chmod o+x * ============================================================================= 2008 08 15 ============================================================================= ######## # FARM # ######## REQUEST FOR FILE INPUT LISTS FROM SAM Date: Thu, 14 Aug 2008 21:35:18 -0500 From: Howard Rubin To: Arthur Kreymer What I need is a file with the subruns from the 'UDT previous day' delivered before 23:19 local time to directory /minos/data/minfarm/lists. The content of the file is subrun month where subrun is obvious and month is the month subdirectory in which the mdaq is to be found. Life would be easiest for me if the file were named fardet.month (neardet.month); example: fardet.2008-08 Howie __________________________________________________________________________ SIESTA=`date -u +'%Y-%m-%d %H:%M:%S' -d 'now - 1 day'` FARDIM=`printf "DATA_TIER raw-far and END_TIME >= to_date(\'${SIESTA}\',\'yyyy-mm-dd hh24:mi:ss\')"` sam list files --dim="${FARDIM}" ~kreymer/minos/scripts/samlocate "${FARDIM}" ######## # DATA # ######## Date: Fri, 15 Aug 2008 15:18:39 -0500 (CDT) From: Kregg E Arms To: Arthur Kreymer Cc: Minos Data , Marta Tavera Subject: New pnfs directories for MC Hi Art, It appears we will soon need directories in pnfs for the following new MC samples: near/daikon_04/M100200N_helium near/daikon_04/M100200R_helium far/daikon_04/M100200N_helium far/daikon_04/M100200R_helium _______________________________________________________________ cd ~kreymer/minos/scripts ./pnfsdirs near cedar_phy_bhcurv daikon_04 M100200N_helium write ./pnfsdirs near cedar_phy_bhcurv daikon_04 M100200R_helium write ./pnfsdirs far cedar_phy_bhcurv daikon_04 M100200N_helium write ./pnfsdirs far cedar_phy_bhcurv daikon_04 M100200R_helium write MINOS26 > date Fri Aug 15 16:38:08 CDT 2008 ######## # FARM # ######## farcat 3674 25987 spill.bmnt.cedar_phy_bhcurv.0.root 3674 27225 spill.bntp.cedar_phy_bhcurv.0.root 3674 17099 spill.mrnt.cedar_phy_bhcurv.0.root Picked up the bntp's first, see below Following the plan of 2008 06 18 to clear bmnt files out of farcat area, simplified because no mrnt have gone to PNFS/SAM. Everything can be done as minfarm@fnpcsrv1 ----------------------------------------------------------- BMNT LIST BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort` MFILES=`ls /minos/data/minfarm/farcat | grep mrnt | sort` printf "${BFILES}\n" | wc -w 3674 printf "${MFILES}\n" | wc -w 3674 ----------------------------------------------------------- MOVE MRNT OUT OF THE WAY 13:18 mkdir -p /minos/data/minfarm/FMRNT cd /minos/data/minfarm/farcat for MFILE in ${MFILES} ; do mv ${MFILE} /minos/data/minfarm/FMRNT/${MFILE} done ----------------------------------------------------------- RENAME BMNT TO MRNT cd /minos/data/minfarm/farcat check for conflicts for BFILE in ${BFILES} ; do MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g` [ -r ${MFILE} ] && ls -l ${MFILE} done 13:24 for BFILE in ${BFILES} ; do MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g` mv ${BFILE} ${MFILE} done ----------------------------------------------------------- DONE, NOW ROUNDUP ! 3674 25987 spill.mrnt.cedar_phy_bhcurv.0.root Grab all mrnt files at once ./roundup -b 4000 -r cedar_phy_bhcurv far Fri Aug 15 16:33:34 CDT 2008 ######## # FARM # ######## bad_runs.cedar_phy_bhcurv is updated for stray subruns, SRV1> ./roundup -s sntp -r cedar_phy_bhcurv far Wow, for the first time, see size discrepancy, consistently, OK adding F00040403_0000.all.sntp.cedar_phy_bhcurv.0.root 3 NSFIL SSIZ MSIZ DSIZ 3 1482704657 1479415327 1644665 OOPS, concatenated file size discrepancy, 1644665 gt 1500000 OOPS, concatenated file size discrepancy, 1643462 gt 1500000 Subrun 0 is 1.47 GB by itself, this is a wonky run. SRV1> dds /minos/data/minfarm/farcat/*sntp.cedar_phy_bhcurv* -rw-rw-r-- 1 minospro numi 1472851108 Aug 10 07:25 /minos/data/minfarm/farcat/F00040403_0000.all.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 minospro numi 4837673 Aug 8 19:55 /minos/data/minfarm/farcat/F00040403_0001.all.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 minospro numi 5015876 Aug 8 19:56 /minos/data/minfarm/farcat/F00040403_0002.all.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 minospro numi 512174805 Aug 9 07:57 /minos/data/minfarm/farcat/F00040403_0004.all.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 minospro numi 5096558 Aug 8 19:57 /minos/data/minfarm/farcat/F00040403_0005.all.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 minospro numi 482758689 Aug 9 07:09 /minos/data/minfarm/farcat/F00040403_0007.all.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 minospro numi 5350507 Aug 8 19:58 /minos/data/minfarm/farcat/F00040403_0008.all.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 minospro numi 498439762 Aug 9 07:34 /minos/data/minfarm/farcat/F00040403_0010.all.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 minospro numi 5224760 Aug 8 19:58 /minos/data/minfarm/farcat/F00040403_0011.all.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 minospro numi 455139267 Aug 9 06:31 /minos/data/minfarm/farcat/F00040403_0013.all.sntp.cedar_phy_bhcurv.0.root This left a stray file in WRITE -rw-r--r-- 1 minfarm numi 1479417732 Aug 15 09:31 Merged.17767.root SRV1> rm /minos/data/minfarm/WRITE/Merged.17767.root Hacked DLIM from 1500000 to 2000000, reran. All the files show large per-subrun DSIZ values about 1.6 MB/subrun Also ran the .bntp pass : ./roundup -s bntp -r cedar_phy_bhcurv far Iterated, with larger bail limit ( default was 1000 ) ./roundup -s bntp -b 3000 -r cedar_phy_bhcurv far And once more the purge WRITE ./roundup -s bntp -b 4000 -r cedar_phy_bhcurv far Hacked DLIM back. ########## # CONDOR # ########## Date: Fri, 15 Aug 2008 09:02:54 -0500 (CDT) Subject: HelpDesk ticket 119292 has additional info. Note To Requester: We captured extra debug output from one of the minos glidein jobs yesterday that held with error 17 and have sent it to the Condor team and the OSG Troubleshooting team. Steve Timm ####### # SAM # ####### Date: Thu, 14 Aug 2008 23:16:57 +0100 From: Nicholas Devenish I am trying to learn how to use sam, and actually registered my username on the system in april; but when I try to create a definition it gives me the message: "Person 'nickd' is not registered in group 'minos'" When I look at the registration at http://www-numi.fnal.gov/cgi-bin/autoRegister.py it looks like I am supposed to be in the group, so I don't know what it is complaining about. ______________________________________________________________________ Date: Tue, 26 Aug 2008 15:39:24 +0000 (GMT) From: Arthur Kreymer Sorry to be slow in responding . Are you still having problems ? It appears that nickd was added to the minos group just after your creation attempts on August 14. Perhaps the dbserver had some old information cached at that time. I successfully created a definition under Person nickd today. Tested by using export SAM_USER_NAME=nickd ============================================================================= 2008 08 14 ============================================================================= ####### # AFS # ####### Working on duplication of e875/sim for parrot, use volume d117. 10 broken links to /utarchive/para/minos/events/old 18 broken links to /afs/fnal.gov/files/data/minos/d12/root_files 5 broken links to /afs/fnal.gov/files/data/minos/d7/hitbits 4 broken links to /afs/fnal.gov/files/data/minos/d1/nuflux/newfiles Size of files in the links that exist : 1 /afs/fnal.gov/files/code/e875/general/minossoft/releases/development/BField/bfld_imap.C 624 /afs/fnal.gov/files/data/minos/d17/gnumi_flux 77 /afs/fnal.gov/files/data/minos/d82/rhatcher/daikon_02.tar.gz 1 /afs/fnal.gov/files/data/minos/d87/gnumi/v19 1 /afs/fnal.gov/files/data/minos/d87/gnumi/v19 1 /afs/fnal.gov/files/data/minos/d87/gnumi/v19 1 /afs/fnal.gov/files/data/minos/d87/gnumi/v19 gnumi_flux has additional links, to d87, largest is v18 at 26 GB. the original d18 is small, links to large d87 Mysql> fs listquota $MD/d17 Volume Name Quota Used %Used Partition nb.minos.d19 8000000 2334722 29% 60% Mysql> fs listquota $MD/d87 Volume Name Quota Used %Used Partition nb.minos.d87 50000000 35085185 70% 58% For present, bailing on this, leave sim as it is in /afs/fnal.gov/files/code/e875/sim Added to mountfile.d119d120.grow : This gets rid of the old labyrinth complaints : -bash-3.00$ /grid/app/minos/parrot/paloon SETTING UP UPS SETTING UP MINOS No default SAM configuration exists at this time. MINOSSOFT release "R1.24.2" SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01 setup "test" version of LABYRINTH [ linux , FNALU ] setup NEUGEN3 development explicitly setting up GCC3_4_3 version of GEANT using PYTHIA6 (v6_412) for LUND ########## # CONDOR # ########## MINOS25 > condor_q -hold 30 jobs; 0 idle, 0 running, 30 held MINOS25 > condor_release gfactory ########## # PARROT # ########## Testing releases/ups copies on d119/d120 Recorrected symlinks to be /general/ups rather than /general/ups We plan to remount this all as /general/ups Had done : SLINKF=${LOGD}/slink/prod1 SLINKL=${LOGD}/slink/prod1.log generated SLINKF as below, Corrected general/ups symlinks to general/products This had not been done in the first tests in d141, do not know why we got away with this, as d141 did not have symlink ups -> products SLINKS=`grep ':/afs' ${SLINKF} | grep ":${AFSC}/general/ups"` printf "${SLINKS}\n" | while read SLINK ; do SLIN=`printf "${SLINK}" | cut -f 2 -d :` SLIX=${SLIN/\/e875\/general\/ups/\/e875\/general\/products} SLOU=/`printf "${SLINK}" | cut -f 1 -d :` rm -f ${SLOU} ln -s ${SLIX} ${SLOU} done Now need to reverse this, having lost our copy of prod1 SLINKF=${LOGD}/slink/prodprod SLINKL=${LOGD}/slink/prodprod.log generated SLINKF, hacked to include $UPI, which we need to change Mysql> wc -l ${SLINKF} 29 /minos/scratch/minsoft/afssoft/slink/prodprod SLINKS=`grep ':/afs' ${SLINKF} | grep ":${AFSC}/general/products"` printf "${SLINKS}\n" | while read SLINK ; do SLIN=`printf "${SLINK}" | cut -f 2 -d :` SLIX=${SLIN/\/e875\/general\/products/\/e875\/general\/ups} SLOU=/`printf "${SLINK}" | cut -f 1 -d :` rm -f ${SLOU} ln -s ${SLIX} ${SLOU} done mindata@minos26 $ time make_growfs -k /afs/fnal.gov/files/data/minos/d119 make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d119/.growfsdir make_growfs: scanning directory tree for changes... make_growfs: 1075098 files, 5525 links, 149944 dirs, 0 checksums computed real 10m51.369s user 1m16.832s sys 1m43.335s Ran the usual HOWTO.parrot tests on fnpc185, OK ! Ran paloon integrated test, -bash-3.00$ /grid/app/minos/parrot/paloon SETTING UP UPS SETTING UP MINOS No default SAM configuration exists at this time. MINOSSOFT release "R1.24.2" SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01 /tmp/fileHEYSq8: line 620: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory RUNNING LOON Warning in : class timespec already in TClassTable Processing firstlast.C... Spin(1 in 1 out 0 filt.) 1) +RawRecCounts::Ana n=1 ( 1/ 0) t=( 0.00/ 0.00) ... RawRecCounts done OK, ran loon under parrot ####### # SAM # ####### pittam reports several cedar files not declared to SAM, in cedar, sntp_data 10 2006-01 2007 Feb 5 32 2006-06 2007 Feb 5 12 2006-08 2006 Dec 12/13 1 2006-09 2006 Dec 12 Test 1 file MINOS26 > sam locate N00010896_0014.spill.sntp.cedar.0.root Datafile with name 'N00010896_0014.spill.sntp.cedar.0.root' not found. RELEASE=cedar DET=near MONTH=2006-09 for MONTH in 2006-01 2006-06 2006-08 2006-09 ; do ./saddreco ${DET} ${RELEASE} ${MONTH} verify done needed 20, 65, 24, 2 files These should be even numbers, including cosmic/spill N00010163_0015.spill.sntp.cedar.0.root is missing. declare 2006-09, looks OK SLOG=${HOME}/ROUNTMP/LOG/saddreco/${RELEASE}/${DET}.log for MONTH in 2006-01 2006-06 2006-08 2006-09 ; do ./saddreco ${DET} ${RELEASE} ${MONTH} declare done 2>&1 | tee -a ${SLOG} Updated HOWTO.saddreco with improved SLOG and paths Ran full verify scan, one more file was missing MONTH 2005-03 needed 1 N00007101_0001.cosmic.*.cedar.0.root MONTH 2007-01 several obsoletes MONTH 2008-07 several obsoletes obsolete N00014529_0010.spill.cand.cedar.1.root obsolete N00014529_0002.spill.cand.cedar.1.root obsolete N00014529_0003.spill.cand.cedar.1.root obsolete N00014529_0001.spill.cand.cedar.1.root obsolete N00014551_0001.spill.cand.cedar.0.root obsolete N00014529_0004.spill.cand.cedar.1.root obsolete N00014529_0005.spill.cand.cedar.1.root obsolete N00014529_0006.spill.cand.cedar.1.root MONTH=2005-03 ./saddreco ${DET} ${RELEASE} ${MONTH} declare 2>&1 | tee -a ${SLOG} FINISHED Thu Aug 14 16:39:16 2008 Also scanned DET=far, found nothing missing. ####### # SAM # ####### Example SAMDIM=" DATA_TIER sntp-near and MC.RELEASE daikon_04 and MC.BEAM L010185N and MC.VTXREGION 3 and MC.BFIELD 3 and VERSION cedar.phy.bhcurv and RUN_NUMBER >= 7250 and RUN_NUMBER <= 7260 " MINOS26 > sam list files --summaryonly --dim="${SAMDIM}" File Count: 14 Average File Size: 1.50GB Total File Size: 20.96GB Total Event Count: 270400 MINOS26 > sam list files --noSummary --dim="${SAMDIM}" n13037259_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037257_0025_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037257_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037252_0014_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037260_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037254_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037250_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037255_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037253_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037258_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037256_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037252_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037251_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root n13037255_0016_L010185N_D04.sntp.cedar_phy_bhcurv.1.root ============================================================================= 2008 08 13 ============================================================================= ######## # FARM # ######## ---------- rubin > F00038575_0010 is bad_runs, but with pass 1 (filenames were > intentionally changed) Added a line to bad_runs to mark pass 0 as bad, which it was. fnpcsrv1% grep F00038575_0010 bad_runs.cedar_phy_bhcurv >> /tmp/newbad fnpcsrv1% nedit /tmp/newbad changed pass1 to 0 fnpcsrv1% cat /tmp/newbad >> bad_runs.cedar_phy_bhcurv fnpcsrv1% grep F00038575_0010 bad_runs.cedar_phy_bhcurv F00038575_0010.1 2007-08 139 2008-08-09 22:08:13 fcdfcaf1283 F00038575_0010.0 2007-08 139 2008-08-09 22:08:13 fcdfcaf1283 > F00039811 and F00039818 are 2007-10 and were not supposed to be run > through the spill pass. It looks as though a single run in each case > was put through both passes by mistake, probably as part of a previous > cleanup. One of us should delete the spill files from farcat We have 1 subrun of each of these, bmnt/bntp/mrnt/sntp rm /minos/data/minfarm/farcat/F00039811* rm /minos/data/minfarm/farcat/F00039818* sam locate F00039811_0000.spill.cand.cedar_phy_bhcurv.0.root ['/pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10,321@vok698'] sam locate F00039818_0015.spill.cand.cedar_phy_bhcurv.0.root ['/pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10,369@vok698'] sam undeclare file F00039811_0000.spill.cand.cedar_phy_bhcurv.0.root sam undeclare file F00039818_0015.spill.cand.cedar_phy_bhcurv.0.root ls /pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10/F00039811_0000.spill.cand.cedar_phy_bhcurv.0.root ls /pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10/F00039818_0015.spill.cand.cedar_phy_bhcurv.0.root rm /pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10/F00039811_0000.spill.cand.cedar_phy_bhcurv.0.root rm /pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10/F00039818_0015.spill.cand.cedar_phy_bhcurv.0.root > F00040124 crosses a month/year boundary and the first part may have > escaped the original runlist. I'm submitting them. > > F00040133, 40403, 40421: No idea why they weren't run. I'm submitting > them. OK, will keep an eye out for them. ---------- All but 40403 and 40421 seem to be there now SRV1> ./roundup -s sntp -r cedar_phy_bhcurv far Wed Aug 13 15:19:45 CDT 2008 ---------- rubin F00040124 and 40133 are complete. It turns out that the other subruns are from runs which, for some other subruns in the first pass, produced my 'Type 90' failures where they run 'forever' producing multi-volume candidate files. The remaining 3 jobs look like they'll do the same. I'll let them run for several hours and kill them if they produce a second candidate volume. I will then make a manual entry in the bad_runs list and let you know. The 3 subruns do not appear in the nightly lists, nor in the suppressed lists. The former is why they were not run in the first pass. They are continuing to chug along, but they are almost certainly junk. ####### # AFS # ####### HOWTO.afssoftprod Continuing to clean up symlinks, interrupted by other work correcting many symlinks from general/ups to general/products Adjusted HOWTO to filter :${UPI} out of the initial SLINKF file. Oops, stepped on prod1 file, lost it. Mysql> mv /minos/scratch/minsoft/afssoft/slink/prod1 /minos/scratch/minsoft/afssoft/slink/prod2 Let's look at the deadwood, not pointing to /afs Mysql> grep -v :/afs $SLINKF | wc -l 18 Mysql> grep :/afs $SLINKF | wc -l 11 Mysql> grep -v :/afs $SLINKF afs/fnal.gov/files/data/minos/d119/prd/sam/v8_2_0/Linux-2/ups/..tar:/ftp/products/sam/v8_2_0/Linux+2/sam_v8_2_0_Linux+2.ups.tar afs/fnal.gov/files/data/minos/d119/prd/sam_ns_ior/v7_1_0/NULL/ups/..tar:/ftp/products/sam_ns_ior/v7_1_0/NULL/sam_ns_ior_v7_1_0_NULL.ups.tar afs/fnal.gov/files/data/minos/d119/prd/oracle_client/v10_1_0_2_0b/Linux-2/bin/lbuilder:/fnal/ups/prd/oracle_client/v10_1_0_2_0a/Linux-2/nls/lbuilder/lbuilder afs/fnal.gov/files/data/minos/d119/prd/oracle_client/v10_1_0_2_0b/Linux-2/jdk/man/ja:/fnal/ups/prd/oracle_client/v10_1_0_2_0a/Linux-2/jdk/man/ja_JP.eucJP afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_3_2/config_build_root.sh:/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_3_3/config_build_root.sh:/afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_3_4/config_build_root.sh:/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_3_4/config_build_root_minimal.sh:/afs/fnal.gov/files/code/e875/general/bin/config_build_root_minimal.sh afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_4_1/config_build_root.sh:/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_2/v01/lib/mysql/libz.so:/usr/lib/libz.so.1 afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_2/v01/lib/libz.so:/usr/lib/libz.so.1 afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_2/v03/tar_files:/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/tar_files afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_3/v02/lib/libmyodbc.so:/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_3/lib/libmyodbc3.so afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_4/bleeding-edge/lib/libmyodbc.so:/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_4/v03/lib/libmyodbc.so:/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/config.guess:/usr/share/libtool/config.guess afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/config.sub:/usr/share/libtool/config.sub afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/ltmain.sh:/usr/share/libtool/ltmain.sh afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/install-sh:/usr/share/automake-1.6/install-sh afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/mkinstalldirs:/usr/share/automake-1.6/mkinstalldirs afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/missing:/usr/share/automake-1.6/missing afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/depcomp:/usr/share/automake-1.6/depcomp afs/fnal.gov/files/data/minos/d119/prd/samgrid_batch_adapter/v7_1_0/NULL/ups/..tar:/ftp/products/samgrid_batch_adapter/v7_1_0/NULL/samgrid_batch_adapter_v7_1_0_NULL.ups.tar afs/fnal.gov/files/data/minos/d119/prd/geant/v3_21_14a/Linux-2-6/ups/..tar:/ftp/products/geant/v3_21_14a/Linux+2.6/geant_v3_21_14a_Linux+2.6.ups.tar afs/fnal.gov/files/data/minos/d119/prd/sam_products/v4_30/NULL/ups/..tar:/ftp/products/sam_products/v4_30/NULL/sam_products_v4_30_NULL.ups.tar afs/fnal.gov/files/data/minos/d119/prd/sam_products/v4_31/NULL/ups/..tar:/ftp/products/sam_products/v4_30/NULL/sam_products_v4_30_NULL.ups.tar afs/fnal.gov/files/data/minos/d119/prd/sam_products/v4_32/NULL/ups/..tar:/ftp/products/sam_products/v4_30/NULL/sam_products_v4_30_NULL.ups.tar afs/fnal.gov/files/data/minos/d119/prd/gcc/v3_4_3/Linux+2.6-2.3.4/tar/binutils.tar.gz:/afs/fnal/files/home/room1/mengel/binutils.tar.gz afs/fnal.gov/files/data/minos/d119/prd/gcc/v3_4_3/Linux+2.6-2.3.4/tar/gcc.tar.gz:/afs/fnal/files/home/room1/mengel/gcc-3.4.3.tar.gz Back to work, look at what is needed from afs, all are files except for one directory Mysql> printf "${SLINKS}\n" | cut -f 2 -d : /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /afs/fnal.gov/files/code/e875/general/bin/config_build_root_minimal.sh /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/tar_files /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_3/lib/libmyodbc3.so /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so /afs/fnal/files/home/room1/mengel/binutils.tar.gz /afs/fnal/files/home/room1/mengel/gcc-3.4.3.tar.gz For now, let's not clean up, just take a copy as-is Proceeding to releases, Added SLINKF filter against /minos/data/release_data MINOSSOFT soft1 found mostly bin/lib/tmp, plus Mysql> grep -v '/bin$' $SLINKF | grep -v '/lib$' | grep -v '/tmp$' afs/fnal.gov/files/data/minos/d120/packages/DatabaseTables/HEAD:/afs/fnal.gov/files/data/minos/d04/libraries/DatabaseTables/HEAD afs/fnal.gov/files/data/minos/d120/packages/WebDocs/HEAD/doxygen/loon:/afs/fnal.gov/files/code/e875/releases1/doxygen/loon Mysql> du -sm /afs/fnal.gov/files/data/minos/d04/libraries/DatabaseTables/HEAD 53 /afs/fnal.gov/files/data/minos/d04/libraries/DatabaseTables/HEAD Mysql> du -sm /afs/fnal.gov/files/code/e875/releases1/doxygen/loon 415 /afs/fnal.gov/files/code/e875/releases1/doxygen/loon Let's go ahead and copy all this, it should fit. Using only 4.1 GB without bin/lib/tmp doxygen/loon copy is slow, 1/3 MB/second Wed Aug 13 19:15:55 CDT 2008 /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/lib /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib 535 /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/lib close : Connection timed out Rename of /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib/Linux2.4-GCC_3_4/libCluster3D.so.UPD to /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib/Linux2.4-GCC_3_4/libCluster3D.so failed. ... getacl : Connection timed out Unable to set mode-bits for /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib/Linux2.4-GCC_3_4-maxopt to 16877 Couldn't set acls for /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib/Linux2.4-GCC_3_4-maxopt Could not read symbolic link /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/lib/Linux2.6-GCC_3_4 read link : Connection timed out Could not read symbolic link /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/lib/Linux2.6-GCC_3_4-maxopt read link : Connection timed out Unable to set mode-bits for /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib to 16877 getacl : Connection timed out du: cannot access `/afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib': Connection timed out Wed Aug 13 19:16:15 CDT 2008 /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/tmp /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/tmp the rest timed out. Mysql> fs listquota /afs/fnal.gov/files/data/minos/d120 Volume Name Quota Used %Used Partition nb.minos.d120 50000000 19849589 40% 56% ... Aug 13 19:15:20 minos-mysql1 kernel: post_create: no inode, dir (dev=afs, ino=1238086069) Aug 13 19:15:46 minos-mysql1 kernel: post_create: no inode, dir (dev=afs, ino=1238086069) Aug 13 19:15:55 minos-mysql1 kernel: post_create: no inode, dir (dev=afs, ino=1238086077) Aug 13 19:15:55 minos-mysql1 kernel: post_create: no inode, dir (dev=afs, ino=1238086077) Aug 13 19:16:13 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.11 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 13 19:16:13 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.11 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 13 19:16:14 minos-mysql1 kernel: afs: failed to store file (110) Aug 13 19:16:14 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 13 19:16:14 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) Aug 13 19:16:15 minos-mysql1 kernel: afs: Tokens for user of AFS id 1060 for cell fnal.gov have expired Aug 13 19:16:31 minos-mysql1 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Aug 13 19:16:31 minos-mysql1 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Aug 13 19:16:31 minos-mysql1 kernel: afs: file server 131.225.68.11 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Aug 13 19:16:31 minos-mysql1 kernel: afs: file server 131.225.68.11 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Mysql> grep ^OPTIONS= /etc/sysconfig/afs OPTIONS=$LARGE Pick up where we left off, Mysql> printf "${SLINKS}\n" | wc -l 158 Mysql> printf "${SLINKS}\n" | grep -n S08-01-10-R1-27/lib 109:... Mysql> SLINKX=`printf "${SLINKS}\n" | tail +109` Renewed expired token Mysql> tokens Removed partial library Mysql> rm -r /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib Mysql> grep afs: /var/log/messages | grep -v Tokens | uniq | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' | sort >> put this into the afse.txt file Ran the SLINK procedures, reading from SLINKX printf "${SLINKX}\n" | while read SLINK ; do and the rest per SLINK procedure in HOWTO.afssoftprod Thu Aug 14 09:37:03 CDT 2008 ... Thu Aug 14 10:07:44 CDT 2008 Mysql> fs listquota /afs/fnal.gov/files/data/minos/d120 Volume Name Quota Used %Used Partition nb.minos.d120 50000000 30004513 60% 57% $ mv mountfile.grow mountfile.d199d141.grow $ ln -s mountfile.d119d120.grow mountfile.grow Mysql> wc -l ${SLINKF} 29 /minos/scratch/minsoft/afssoft/slink/prodprod ####### # CVS # ####### Removed blake cvs keys, per request ( a machine was cracked ) Deferred adding new keys pending test of kerberos access. "Yes, kerberos access works - I just committed some code to CVS." MINOSCVS > grep blake cvshlog | tail -1 Thu Aug 14 04:39:41 2008 (blake@(null)) : cvsh -c cvs server [sSk] ######## # FARM # ######## ./roundup -s sntp -r cedar_phy_bhcurv far Wed Aug 13 10:41:31 CDT 2008 ######## # FARM # ######## Added test of NOCAT to looper : #!/bin/sh OPTS="${1}" if [ -z "${OPTS}" ] ; then printf " OOPS, need to specify at least release/stream \n" printf " LIKE ./looper '-r cedar_phy_bhhi mcnear' \n" exit 1 fi printf "./roundup -c ${OPTS}\n" while true ; do [ -r /home/minfarm/ROUNTMP/NOCAT ] || ./roundup -c ${OPTS} sleep 1200 done ============================================================================= 2008 08 12 ============================================================================= ######## # FARM # ######## Monitoring automount of /minos/data on fnpcsrv1 yesterday, Aug 11 09:43:41 fnpcsrv1 automount[15347]: mount(nfs): mounted minos-nas-0.fnal.gov:/minos/data on /minos/data Aug 11 17:43:33 fnpcsrv1 automount[15441]: expired /minos/data This good period of no dismounts corresponds to timm's test script which cd'd to /minos/datat, terminating around 17:43 ######## # FARM # ######## Added asousa, mstrait, nwest to .k5login, pending restore SRV1> du -sk . 4666784 . Tue Aug 12 12:21:17 CDT 2008 9598368 . SRV1> date Tue Aug 12 13:01:48 CDT 2008 SRV1> sdiff -s restore_20080810/minfarm/.k5login .k5login > bseilhan/cron/fnpcsrv1.fnal.gov@FNAL.GOV > durga@FNAL.GOV mishi@FNAL.GOV < mstrait/cron/fnpcsrv1.fnal.gov@FNAL.GOV < mstrait/cron/minos04fnal.gov@FNAL.GOV < > rubin/cron/fnppd.fnal.gov@FNAL.GOV timm@FNAL.GOV < > timm@FNAL.GOV removed bseilhan/cron, durga, rubin/cron/fnppd SRV1> sdiff -s restore_20080810/minfarm/.k5login .k5login mishi@FNAL.GOV < mstrait/cron/fnpcsrv1.fnal.gov@FNAL.GOV < mstrait/cron/minos04fnal.gov@FNAL.GOV < timm@FNAL.GOV < > timm@FNAL.GOV cd restore_20080810/minfarm SRV1> du -sm * | sort -n ... 2 restore 2 rhatcher 3 maint 3 work 4 bin 7 lib 8 monitor 11 loonexe 20 scripts 60 west 67 web 128 FNAL_00030851.dbm.gz 167 lists 546 strait_scratch 1047 track_crash 2735 bckhousetest Time of minfarm directory seems to be Aug 09 Latest bad_runs* or good_runs* seem to be Feb 26 That is because lists moved to /minos/data/minfarm then. Howie suggests cd ~minfarm setenv R restore_20080810/minfarm cp -upr $R/.* . cp -upr $R/* . I'd be explicit, and add the 'd' ( I usually use a, which is -dpr ) First, though back it all up : cd /home/minfarm date Tue Aug 12 14:24:49 CDT 2008 time tar cf /minos/data/minfarm/backup20080812.tar . Many files could not be opened, owned by rubin : FILES=' restore_20080810/minfarm/lists/missing.cedar.bck restore_20080810/minfarm/lists/good_runs_mc.cedar.duplicated_output restore_20080810/minfarm/lists/greg_list.bck restore_20080810/minfarm/lists/bad_runs_mc.cedar_phy_bhcurv.bck restore_20080810/minfarm/lists/mmm.d04.cedar_phy_bhcurv.bck restore_20080810/minfarm/lists/bad_runs.cedar_phy_bhcurv.bck restore_20080810/minfarm/lists/bad_runs.cedar.bck restore_20080810/minfarm/loonexe/reco_MC_daikon_far_MRCCOnly_cedar_phy_Corrected.C.bck restore_20080810/minfarm/scripts/deprecated/total_data.trigger.old restore_20080810/minfarm/restore/minfarm/lists/bad_runs.cedar_phy_bhcurv.bck restore/minfarm/lists/bad_runs.cedar_phy_bhcurv.bck ' This agrees with SRV1> find . -user rubin ! -perm -40 SRV1> rm /minos/data/minfarm/backup20080812.tar fnpcsrv1% cd /home/minfarm fnpcsrv1$ for FILE in ${FILES} ; do chmod g+r ${FILE} ; done SRV1> time tar cf /minos/data/minfarm/backup20080812.tar . real 17m34.869s user 0m3.311s sys 1m52.664s SRV1> date Tue Aug 12 14:55:26 CDT 2008 SRV1> du -sh /minos/data/minfarm/backup20080812.tar 7.8G /minos/data/minfarm/backup20080812.tar SRV1> du -sh . 9.2G . SRV1> find . -type f | wc -l 50686 Test file restoration, with loonexe SRV1> du -sk loonexe restore_20080810/minfarm/loonexe/ 2976 loonexe 10848 restore_20080810/minfarm/loonexe/ SRV1> find restore_20080810/minfarm/loonexe -type f | wc -l 248 cp -dupr restore_20080810/minfarm/loonexe . cp: setting permissions for `./loonexe/josh': Permission denied Many files are owned by rubin : SRV1> find restore_20080810 -user rubin | wc -l 3562 SRV1> find -user rubin | wc -l 4555 This is OK, per howie. Everying will end up being owned by minfarm, Go for the gold cp -dupr restore_20080810/minfarm/.* . cp: setting permissions for `././condor_log': Permission denied cp: setting permissions for `././condor_submit': Permission denied cp: setting permissions for `././lists/non_current': Permission denied cp: setting permissions for `././loonexe/josh': Permission denied cp: cannot overwrite directory `././.nedit' with non-directory cp: setting permissions for `././badlogs': Permission denied cp: setting permissions for `././monitor/R1_18/logfiles': Permission denied cp: setting permissions for `././monitor/R1_18/psfiles': Permission denied cp: setting permissions for `././monitor/R1_18': Permission denied cp: setting permissions for `././monitor/R1_18_2': Permission denied cp: setting permissions for `././monitor/R1_18_3': Permission denied cp: setting permissions for `././monitor/R1_18_4': Permission denied cp: setting permissions for `././monitor/R1_21': Permission denied cp: setting permissions for `././monitor/R1_23': Permission denied cp: setting permissions for `././monitor/R1_23a': Permission denied cp: setting permissions for `././monitor/S06-05-25-R1-22': Permission denied cp: setting permissions for `././monitor/S06-06-22-R1-22': Permission denied cp: setting permissions for `././monitor/R1_24c': Permission denied cp: setting permissions for `././monitor/R1_24': Permission denied cp: setting permissions for `././monitor/R1_24a': Permission denied cp: setting permissions for `././monitor/cedar': Permission denied cp: setting permissions for `././monitor/R1_24b': Permission denied cp: setting permissions for `././monitor/cedar_phy_bhcurve': Permission denied cp: setting permissions for `././monitor/R1_24calB': Permission denied cp: setting permissions for `././monitor/R1_24cal': Permission denied cp: setting permissions for `././monitor/cedar_phy': Permission denied cp: setting permissions for `././monitor/cedar_phy_safitter': Permission denied cp: setting permissions for `././monitor/cedar_phy_srsafitter': Permission denied cp: setting permissions for `././monitor/srsafitter': Permission denied cp: setting permissions for `././monitor/cedar_phy_mboone': Permission denied cp: setting permissions for `././monitor/cedar_phy_srsafitterbx113': Permission denied cp: setting permissions for `././monitor/cedar_phy_bhcurv': Permission denied cp: preserving times for `././recover': Permission denied cp: setting permissions for `././scripts/caldet': Permission denied cp: setting permissions for `././scripts/deprecated': Permission denied cp: setting permissions for `././scripts/fbs': Permission denied cp: setting permissions for `././scripts/specials': Permission denied cp: setting permissions for `././scripts/old_li': Permission denied cp: setting permissions for `././web/deprecated': Permission denied cp: setting permissions for `././web/indexes': Permission denied cp: setting permissions for `././restore/minfarm/lists': Permission denied cp: setting permissions for `././restore/minfarm': Permission denied cp: setting permissions for `././restore': Permission denied cp: preserving times for `././strait_scratch/itworked': Permission denied cp: preserving times for `././strait_scratch/itworked2': Permission denied cp: preserving times for `././strait_scratch/badtries/10thtry': Permission denied cp: preserving times for `././strait_scratch/badtries/11thtry': Permission denied cp: preserving times for `././strait_scratch/badtries/12thtry': Permission denied cp: preserving times for `././strait_scratch/badtries/13thtry': Permission denied cp: preserving times for `././strait_scratch/badtries/14thtry': Permission denied cp: preserving times for `././strait_scratch/badtries/15thtry': Permission denied cp: preserving times for `././strait_scratch/badtries/16thtry': Permission denied cp: preserving times for `././strait_scratch/badtries/eighthtry': Permission denied cp: preserving times for `././strait_scratch/badtries/ninthtry': Permission denied cp: preserving times for `././strait_scratch/badtries': Permission denied cp: preserving times for `././strait_scratch/2005-08-logs': Permission denied cp: preserving times for `././strait_scratch/2005-09-logs': Permission denied cp: will not create hard link `./minfarm' to directory `./.' cp: will not create hard link `./.autosave' to directory `././.autosave' cp: will not create hard link `./.emacs.d' to directory `././.emacs.d' cp: will not create hard link `./.grid' to directory `././.grid' cp: cannot overwrite directory `./.nedit' with non-directory cp: will not create hard link `./.netscape' to directory `././.netscape' cp: will not create hard link `./.srmconfig' to directory `././.srmconfig' cp: will not create hard link `./.ssh' to directory `././.ssh' cp: will not create hard link `./.subversion' to directory `././.subversion' drwxrwxr-x 2 minfarm numi 2048 Aug 12 09:12 .nedit/ rmdir .nedit SRV1> cp -dupr restore_20080810/minfarm/.nedit . for FOO in autosave emacs.d grid netscape srmconfig ssh subversion do echo ${FOO} diff -r restore_20080810/minfarm/.${FOO} .${FOO} done Tue Aug 12 15:25:32 CDT 2008 cp -dupr restore_20080810/minfarm/* . cp: setting permissions for `./badlogs': Permission denied cp: setting permissions for `./condor_log': Permission denied cp: setting permissions for `./condor_submit': Permission denied cp: setting permissions for `./lists/non_current': Permission denied cp: setting permissions for `./loonexe/josh': Permission denied cp: setting permissions for `./monitor/R1_18/logfiles': Permission denied cp: setting permissions for `./monitor/R1_18/psfiles': Permission denied cp: setting permissions for `./monitor/R1_18': Permission denied cp: setting permissions for `./monitor/R1_18_2': Permission denied cp: setting permissions for `./monitor/R1_18_3': Permission denied cp: setting permissions for `./monitor/R1_18_4': Permission denied cp: setting permissions for `./monitor/R1_21': Permission denied cp: setting permissions for `./monitor/R1_23': Permission denied cp: setting permissions for `./monitor/R1_23a': Permission denied cp: setting permissions for `./monitor/S06-05-25-R1-22': Permission denied cp: setting permissions for `./monitor/S06-06-22-R1-22': Permission denied cp: setting permissions for `./monitor/R1_24c': Permission denied cp: setting permissions for `./monitor/R1_24': Permission denied cp: setting permissions for `./monitor/R1_24a': Permission denied cp: setting permissions for `./monitor/cedar': Permission denied cp: setting permissions for `./monitor/R1_24b': Permission denied cp: setting permissions for `./monitor/cedar_phy_bhcurve': Permission denied cp: setting permissions for `./monitor/R1_24calB': Permission denied cp: setting permissions for `./monitor/R1_24cal': Permission denied cp: setting permissions for `./monitor/cedar_phy': Permission denied cp: setting permissions for `./monitor/cedar_phy_safitter': Permission denied cp: setting permissions for `./monitor/cedar_phy_srsafitter': Permission denied cp: setting permissions for `./monitor/srsafitter': Permission denied cp: setting permissions for `./monitor/cedar_phy_mboone': Permission denied cp: setting permissions for `./monitor/cedar_phy_srsafitterbx113': Permission denied cp: setting permissions for `./monitor/cedar_phy_bhcurv': Permission denied cp: preserving times for `./recover': Permission denied cp: setting permissions for `./restore/minfarm/lists': Permission denied cp: setting permissions for `./restore/minfarm': Permission denied cp: setting permissions for `./restore': Permission denied cp: setting permissions for `./scripts/caldet': Permission denied cp: setting permissions for `./scripts/deprecated': Permission denied cp: setting permissions for `./scripts/fbs': Permission denied cp: setting permissions for `./scripts/specials': Permission denied cp: setting permissions for `./scripts/old_li': Permission denied cp: preserving times for `./strait_scratch/itworked': Permission denied cp: preserving times for `./strait_scratch/itworked2': Permission denied cp: preserving times for `./strait_scratch/badtries/10thtry': Permission denied cp: preserving times for `./strait_scratch/badtries/11thtry': Permission denied cp: preserving times for `./strait_scratch/badtries/12thtry': Permission denied cp: preserving times for `./strait_scratch/badtries/13thtry': Permission denied cp: preserving times for `./strait_scratch/badtries/14thtry': Permission denied cp: preserving times for `./strait_scratch/badtries/15thtry': Permission denied cp: preserving times for `./strait_scratch/badtries/16thtry': Permission denied cp: preserving times for `./strait_scratch/badtries/eighthtry': Permission denied cp: preserving times for `./strait_scratch/badtries/ninthtry': Permission denied cp: preserving times for `./strait_scratch/badtries': Permission denied cp: preserving times for `./strait_scratch/2005-08-logs': Permission denied cp: preserving times for `./strait_scratch/2005-09-logs': Permission denied cp: setting permissions for `./web/deprecated': Permission denied cp: setting permissions for `./web/indexes': Permission denied Tue Aug 12 15:27:44 CDT 2008 We are good to go now. Making a separate copy of the restore_20080810 files. cd restore_20080810 time tar cf /minos/data/minfarm/restore_20080810.tar . real 3m37.368s user 0m1.392s sys 1m2.883s ######## # FARM # ######## export MYSQL_PWD= mysqladmin -h fnpcsrv1.fnal.gov --port 3307 -u minfarm processlist mysqladmin -u minfarm -S /export/stage/minfarm/mysql.sock1 processlist ============================================================================= 2008 08 11 ============================================================================= ######## # FARM # ######## Most of the /home/minfarm files were deleted , by a runaway script /grid/app/minos/scripts/gather_runs.mc The script was intending to concatenate files in /home/minfarm/lists/BAD and GOOD, But there were no BAD or GOOD directories initially. So the script wandered into /home, and started removing all files thereunder. I have put back a .k5login with rubin and kreymer. I have done crontab crontab.dat in the scripts directory. And in /home/minfarm, ln -s scripts/crontab.dat crontab.dat ########### # ROUNDUP # ########### Fell behind on 11:00 cycle, fardet, due to small runs around 60 KBytes F00041598 through F00041801. ######## # FARM # ######## Date: Sat, 09 Aug 2008 12:32:10 -0500 There have been serious NFS (probably) problems at 19:15-19:45 yesterday and again at ~01:00 today. /minos/data has been affected. I'm not sure if it relates to the missing runs or not. I want to check before resubmitting. "Late" FD r3 is complete (except of course for your missing runs) so I'm going to start submitting the early r3 stuff which has only a cosmic pass. Date: Sat, 09 Aug 2008 16:41:16 -0500 From: Howard Rubin There is a single run of pass 1 output in farcat, F00039719. I think I must have run these subruns to complete the run started in the previous month. I don't know if these should actually replace the pass 0 because of better constants or just be deleted. If they should replace, then the first 3 subruns should probably also be reprocessed. What do you think? Note that these are cosmic/all only. Note added in proof: Apparently the first pass stuff must have been deleted because there's no F00039719_0000.all.sntp.cedar_phy_bhcurv.0.root (or _0003) in SAM. I guess that means I *should* run the first 3 subruns so that the run will be complete. I'll change the pass to pass 0 and remove the good_runs lines that are causing them to be designated pass 1. Date: Sat, 09 Aug 2008 22:47:29 -0500 From: Howard Rubin There a set of runs, probably a month's worth, and perhaps more coming with pass 1. I'm going to stop roundup until this is all complete and rename the files to pass 0. I've checked and there don't seem to be any of these declared to SAM. I've done it by stopping the cron job. I'm sure there's a more elegant way, but this should only be for a short time. There are only a couple of hundred more jobs to run. Date: Sat, 09 Aug 2008 23:35:50 -0500 From: Howard Rubin I've restarted the corral cron job. Except for anything you might find in roundup, run 3 is complete. I'll start checking the list from 'late run3' concatenation tomorrow or Monday. --------------------------------- looper picked up F00039719 around Aug 9 23:03 , pass 0 had formerly been pass 1 It picked up and concatenated most of the .1. data last Saturday. Sat Aug 9 19:52:35 CDT 2008 Some with pass 2 or 4 Sat Aug 9 21:49:13 CDT 2008 PURGED WRITE/F00039340_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039345_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039348_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039349_0000.all.sntp.cedar_phy_bhcurv.2.root PURGED WRITE/F00039350_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039353_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039356_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039362_0000.all.sntp.cedar_phy_bhcurv.2.root PURGED WRITE/F00039574_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039827_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039830_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039834_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039840_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039840_0008.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039843_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039846_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039849_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039855_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039858_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039869_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039878_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039881_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039884_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00039887_0000.all.sntp.cedar_phy_bhcurv.0.root Sun Aug 10 00:07:38 CDT 2008 PURGED WRITE/F00038559_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039359_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039571_0000.all.sntp.cedar_phy_bhcurv.2.root PURGED WRITE/F00039577_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039580_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039583_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039586_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039589_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039592_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039595_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039603_0000.all.sntp.cedar_phy_bhcurv.2.root PURGED WRITE/F00039607_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039608_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039610_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039615_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039618_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039622_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039625_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039628_0000.all.sntp.cedar_phy_bhcurv.4.root PURGED WRITE/F00039631_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039653_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039676_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039679_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039682_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039685_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039688_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039691_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039694_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039697_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039700_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039704_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039707_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039710_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039713_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039716_0000.all.sntp.cedar_phy_bhcurv.1.root PURGED WRITE/F00039719_0000.all.sntp.cedar_phy_bhcurv.1.root FONES='' for FON in $FONES ; do sam locate ${FON}cand.cedar_phy_bhcurv.1.root ; done AOK with .1. except Datafile with name 'F00039603_0000.all.cand.cedar_phy_bhcurv.1.root' not found. for FON in $FTWOS ; do sam locate ${FON}cand.cedar_phy_bhcurv.2.root ; done AOK, all 4 files are in SAM The single .4. cand file is in SAM and PNFS FTWOS cands are all in PNFS. FONES cands are all in PNFS, except F00039603_0000. ######## # FARM # ######## ============================================================================= 2008 08 08 ============================================================================= ####### # AFS # ####### HOWTO.afssoftprod Continuing, resolving symlinks and cleaning up the HOWTO per this pass. In first products symlink pass ( prod1 ) first needed to correct many symlinks from general/ups to general/products ######## # FARM # ######## farcat 8 191 all.sntp.cedar.0.root 315 7614 all.sntp.cedar_phy.0.root 634 15573 all.sntp.cedar_phy_bhcurv.0.root 627 4386 spill.bmnt.cedar_phy_bhcurv.0.root 8 65 spill.bntp.cedar.0.root 315 1332 spill.bntp.cedar_phy.0.root 627 4543 spill.bntp.cedar_phy_bhcurv.0.root 627 2786 spill.mrnt.cedar_phy_bhcurv.0.root 8 40 spill.sntp.cedar.0.root 315 888 spill.sntp.cedar_phy.0.root 627 2877 spill.sntp.cedar_phy_bhcurv.0.root mcfmockcat 241 3427 mrnt.cedar_phy_bhcurv.0.root 241 6827 sntp.cedar_phy_bhcurv.0.root mockfar seems complete, force this out : ./looper '-M -r cedar_phy_bhcurv mockfar' & Fri Aug 8 13:21:33 CDT 2008 OK - processing 482 files Fri Aug 8 13:23:59 CDT 2008 WRITING to DCache 482 CPB far is running currently, let's keep up with sntp : ./looper '-s sntp -r cedar_phy_bhcurv far' & SELECT files containing sntp Fri Aug 8 13:28:55 CDT 2008 ZAPPING BAD F00040942_0008.all.sntp.cedar_phy_bhcurv.0.root F00040942_0008.0 2008-06 136 2008-08-01 00:26:31 fcdfcaf1628 ... ZAPPING BAD F00040942_0020.all.sntp.cedar_phy_bhcurv.0.root F00040942_0020.0 2008-06 136 2008-08-01 00:30:56 fcdfcaf1597 ZAPPING BAD F00040942_0020.spill.sntp.cedar_phy_bhcurv.0.root F00040942_0020.0 2008-06 136 2008-08-01 00:30:56 fcdfcaf1597 ... OK - processing 974 files OK - stream all.sntp.cedar_phy_bhcurv OK - 12096 Mbytes in 22 runs ... Date: Fri, 08 Aug 2008 21:14:14 +0000 (UTC) From: Arthur Kreymer To: Minos Batch The additional 241 subruns of cedar_phy_bhcurv daikon_05 mdc have been written to PNFS and /minos/data . ####### # WEB # ####### MIN > cp dhleft.html.20070328 dhleft.html.20080808 [14]+ Done nedit dhleft.html MIN > ln -sf dhleft.html.20080808 dhleft.html Updated dhleft with Cluster group replacing MINOS26, and DATA group showing disk status dhmain - added DATABASE - ganglia/fnpcsrv1 ############ # NOACCESS # ############ VOK330 331.09GB (NOTALLOWED 0731-1124 readonly 0716-1115) CD-LTO3 minos.reco_far_cedar_bcnd.cpio_odc Volume needs to be cloned due to repeated errors ########## # CONDOR # ########## Summary of proxy activities last week : When I added the pilot proxy, I did wait for all the glideins to terminate. condor_q showed no entries for gfactory when I changed the proxy. I was not so clever yesterday. I did remove the pilot Role while jobs were running yesterday. This resulted in many held gfactory processes. Plus a few freshly minted pilots without the pilot Role. I stopped the gfactory, restored the pilot Role, did 'condor_release gfactory', waited for the released jobs to run, but some reverted to 'held', iterated several times until all processes terminated. Then I removed the pilot Role from the proxy, released the last few gfactory processes, and waited for them to run and terminate. Then I restarted the gfactory. Thinks look clean since then. ============================================================================= 2008 08 07 ============================================================================= ########## # CONDOR # ########## Errors seen recently by pawloski cat /minos/data/users/pawloski/Nue/PETrimmerTest_Delete/Far_Beam_Standard_MC/log.172615.43 000 (172615.043.000) 08/06 21:44:29 Job submitted from host: <131.225.193.25:64545> ... 007 (172615.043.000) 08/06 23:57:51 Shadow exception! Can no longer talk to condor_starter <131.225.166.131:65459> 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 007 (172615.043.000) 08/06 23:57:54 Shadow exception! Can no longer talk to condor_starter <131.225.166.131:65459> 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 007 (172615.043.000) 08/06 23:57:58 Shadow exception! Can no longer talk to condor_starter <131.225.166.131:65459> 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 007 (172615.043.000) 08/07 00:00:09 Shadow exception! Can no longer talk to condor_starter <131.225.166.131:64284> 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 007 (172615.043.000) 08/07 00:00:12 Shadow exception! Can no longer talk to condor_starter <131.225.166.131:64284> 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 007 (172615.043.000) 08/07 00:00:17 Shadow exception! Can no longer talk to condor_starter <131.225.166.131:64284> 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 007 (172615.043.000) 08/07 00:00:19 Shadow exception! Can no longer talk to condor_starter <131.225.166.131:64284> 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 007 (172615.043.000) 08/07 00:00:22 Shadow exception! Can no longer talk to condor_starter <131.225.166.131:64284> 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 001 (172615.043.000) 08/07 00:02:46 Job executing on host: <131.225.166.119:62644> ... 009 (172615.043.000) 08/07 00:02:54 Job was aborted by the user. The system macro SYSTEM_PERIODIC_REMOVE expression '(JobRunCount > 10) || (JobRunCount>=1 && ImageSize>1000000 && JobStatus==1)' evaluated to TRUE ... ########## # CONDOR # ########## 15:11 restored condorproxy without pilot role Killed gfactory master process ( only, not condor_gridmanager ) Restored pilot role to condorproxy, so we can release held gfactory jobs Still have fresh gfactories, MINOS25 > condor_q gfactory | grep -v H -- Submitter: minos25.fnal.gov : <131.225.193.25:61622> : minos25.fnal.gov ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 172904.0 gfactory 8/7 15:18 0+00:00:00 I 0 0.0 glidein_startup.sh 172904.1 gfactory 8/7 15:18 0+00:10:08 R 0 0.0 glidein_startup.sh 172904.2 gfactory 8/7 15:18 0+00:47:08 R 0 0.0 glidein_startup.sh 172904.3 gfactory 8/7 15:18 0+00:00:00 I 0 0.0 glidein_startup.sh 172904.4 gfactory 8/7 15:18 0+00:00:00 I 0 0.0 glidein_startup.sh 172904.5 gfactory 8/7 15:18 0+00:00:00 I 0 0.0 glidein_startup.sh 172904.6 gfactory 8/7 15:18 0+00:00:00 I 0 0.0 glidein_startup.sh 172904.7 gfactory 8/7 15:18 0+00:00:00 I 0 0.0 glidein_startup.sh 172904.8 gfactory 8/7 15:18 0+00:00:00 I 0 0.0 glidein_startup.sh 172904.9 gfactory 8/7 15:18 0+00:00:00 I 0 0.0 glidein_startup.sh 172906.0 gfactory 8/7 15:22 0+00:00:00 I 0 0.0 glidein_startup.sh 172913.0 gfactory 8/7 15:58 0+00:00:00 I 0 0.0 glidein_startup.sh 16:10 Released GLobus error 17 and 43 from yesterday, condor_release 171531 172568 172589 The recent glideins have evaporated. 16:15 JOBS=`condor_q gfactory -hold | grep gfactory | cut -f 1 -d ' '` for JOB in ${JOBS} ; do condor_q ${JOB} | grep gfactory ; done 88 for JOB in ${JOBS} ; do condor_release ${JOB} ; sleep 1 ; done 16:18 87 jobs; 12 idle, 50 running, 25 held 16:46 82 jobs; 12 idle, 45 running, 25 held 18:40: 57 jobs; 0 idle, 32 running, 25 held MINOS25 > condor_release 172666.0 MINOS25 > condor_release 172670.0 MINOS25 > condor_q 172666.0 172666.0 gfactory 8/7 01:59 0+10:21:03 R 0 0.0 glidein_startup.sh MINOS25 > condor_q 172670.0 172670.0 gfactory 8/7 02:20 0+10:20:33 R 0 0.0 glidein_startup.sh JOBS2=`condor_q gfactory -hold | grep gfactory | cut -f 1 -d ' '` for JOB in ${JOBS2} ; do condor_q ${JOB} | grep gfactory ; done 23 for JOB in ${JOBS2} ; do condor_release ${JOB} ; sleep 1 ; done 51 jobs; 0 idle, 35 running, 16 held JOBS3=`condor_q gfactory -hold | grep gfactory | cut -f 1 -d ' '` for JOB in ${JOBS3} ; do condor_q ${JOB} | grep gfactory ; done 23 for JOB in ${JOBS3} ; do condor_release ${JOB} ; sleep 1 ; done two more started running. 20:45 MINOS25 > condor_q gfactory | tail -1 14 jobs; 0 idle, 0 running, 14 held MINOS25 > condor_release gfactory User gfactory's job(s) released. MINOS25 > condor_q gfactory | tail -1 14 jobs; 14 idle, 0 running, 0 held 21:00 12 jobs; 0 idle, 0 running, 12 held MINOS25 > condor_release gfactory These are the processes without /Role=pilot 21:02:30 - removed the Role from the proxy 21:03:20 MINOS25 > condor_release gfactory 21:07 all processes are gone, including gfactory's condor_gridmanager gfactory@minos25: ./start_factory.sh 21:09 rm /minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE 172923.0 gfactory 8/7 21:12 0+00:00:00 I 0 0.0 glidein_startup.sh 21:17 172920.0 kreymer 8/7 21:10 0+00:00:25 R 0 0.0 probe 172923.0 gfactory 8/7 21:12 0+00:03:43 R 0 0.0 glidein_startup.sh RUN FINISHED Thu Aug 7 21:18:25 CDT 2008 ALL CLEAR ! ######## # FARM # ######## mysql overload on fnpcsrv1 Looked at top, 10 second interval, top - 14:13:06 up 1 day, 50 min, 9 users, load average: 11.99, 10.33, 8.30 PID USER PR NI %CPU TIME+ %MEM VIRT RES SHR S COMMAND 7748 minfarm 16 0 601 3740:44 2.3 437m 367m 3548 S mysqld 7748 minfarm 16 0 565 3741:40 2.3 436m 366m 3548 S mysqld Using 60 seconds of CPU in 10 seconds, 600%, It does not to show up on 'idle suppressed' top displays. Checking the Starting messages in the database log file, find SRV1> grep Starting /farm/minsoft2/Minossoft/dbm-cedar_phy/logs/dbm_checksum.log 2008-08-07 11:35:07 Starting pass 1 on BEAMMONSPILLVLD: 2008-08-07 11:44:08 Starting pass 2 on BEAMMONSPILLVLD: 2008-08-07 11:55:35 Starting pass 3 on BEAMMONSPILLVLD: 2008-08-07 12:07:54 Starting pass 4 on BEAMMONSPILLVLD: 2008-08-07 12:20:59 Starting pass 5 on BEAMMONSPILLVLD: 2008-08-07 12:33:13 Starting pass 6 on BEAMMONSPILLVLD: 2008-08-07 12:46:10 Starting pass 7 on BEAMMONSPILLVLD: 2008-08-07 12:58:17 Starting pass 8 on BEAMMONSPILLVLD: 2008-08-07 13:10:22 Starting pass 9 on BEAMMONSPILLVLD: 2008-08-07 13:22:20 Starting pass 10 on BEAMMONSPILLVLD: 2008-08-07 13:34:26 Starting pass 11 on BEAMMONSPILLVLD: 2008-08-07 13:46:47 Starting pass 12 on BEAMMONSPILLVLD: 2008-08-07 13:59:35 Starting pass 13 on BEAMMONSPILLVLD: 2008-08-07 14:14:02 Starting pass 14 on BEAMMONSPILLVLD: 2008-08-07 14:29:06 Starting pass 15 on BEAMMONSPILLVLD: 2008-08-07 14:44:53 Starting pass 16 on BEAMMONSPILLVLD: 2008-08-07 15:01:31 Starting pass 17 on BEAMMONSPILLVLD: 2008-08-07 15:13:55 Starting pass 18 on BEAMMONSPILLVLD: 2008-08-07 15:26:07 Starting pass 19 on BEAMMONSPILLVLD: 2008-08-07 15:39:07 Starting pass 20 on BEAMMONSPILLVLD: 2008-08-07 15:52:51 Starting pass 21 on BEAMMONSPILLVLD: 2008-08-07 16:03:51 Starting pass 22 on BEAMMONSPILLVLD: 2008-08-07 16:11:17 Starting pass 23 on BEAMMONSPILLVLD: 2008-08-07 16:18:50 Starting pass 24 on BEAMMONSPILLVLD: 2008-08-07 16:26:39 Starting pass 25 on BEAMMONSPILLVLD: 2008-08-07 16:34:18 Starting pass 26 on BEAMMONSPILLVLD: 2008-08-07 16:42:12 Starting pass 27 on BEAMMONSPILLVLD: 2008-08-07 16:47:04 Starting pass 1 on CALADCTOPESVLD: 2008-08-07 16:59:16 Starting pass 2 on CALADCTOPESVLD: 2008-08-07 17:03:46 Starting pass 3 on CALADCTOPESVLD: 2008-08-07 17:08:19 Starting pass 4 on CALADCTOPESVLD: 2008-08-07 17:36:23 Starting pass 5 on CALADCTOPESVLD: 2008-08-07 18:17:34 Starting pass 6 on CALADCTOPESVLD: 2008-08-07 18:57:08 Starting pass 7 on CALADCTOPESVLD: SRV1> grep Starting /farm/minsoft2/Minossoft/dbm-cedar/logs/dbm_checksum.log | cut -f 1-3 -d : 2008-08-07 19:20:22 Starting pass 1 on BEAMMONFILESUMMARYVLD 2008-08-07 19:20:24 Starting pass 1 on BEAMMONSPILLVLD 2008-08-07 19:20:37 Starting pass 1 on BEAMMONSWICPEDSVLD 2008-08-07 19:20:50 Starting pass 1 on CALADCTOPESVLD 2008-08-07 19:30:29 Starting pass 1 on CALADCTOPEVLD 2008-08-07 19:40:10 Starting pass 1 on FABPLNINSTALLVLD 2008-08-07 19:40:11 Starting pass 1 on PHOTONBLUESPECTRUMVLD 2008-08-07 19:40:11 Starting pass 1 on PHOTONELECTRONRANGEVLD 2008-08-07 19:40:20 Starting pass 1 on UGLIDBISCINTPLNSTRUCTVLD 2008-08-07 19:40:22 Starting pass 1 on UGLIDBISCINTPLNVLD Oops, wiped out /farm/minsoft2/Minossoft/dbm-cedar_phy/logs/dbm_checksum.log with one of my commands. SRV1> wc -l /farm/minsoft2/Minossoft/dbm-cedar/reference_checksums/checksum_Fnal_CALADCTOPES.log 295 /farm/minsoft2/Minossoft/dbm-cedar/reference_checksums/checksum_Fnal_CALADCTOPES.log SRV1> wc -l /farm/minsoft2/Minossoft/dbm-cedar_phy/reference_checksums/checksum_Fnal_CALADCTOPES.log 159200 /farm/minsoft2/Minossoft/dbm-cedar_phy/reference_checksums/checksum_Fnal_CALADCTOPES.log SRV1> wc -l /farm/minsoft2/Minossoft/dbm-cedar/reference_checksums/checksum_Fnal_BEAMMONSPILL.log 10111 /farm/minsoft2/Minossoft/dbm-cedar/reference_checksums/checksum_Fnal_BEAMMONSPILL.log SRV1> wc -l /farm/minsoft2/Minossoft/dbm-cedar_phy/reference_checksums/checksum_Fnal_BEAMMONSPILL.log 9908 /farm/minsoft2/Minossoft/dbm-cedar_phy/reference_checksums/checksum_Fnal_BEAMMONSPILL.log ########## # CONDOR # ########## ID OWNER HELD_SINCE HOLD_REASON 171531.1 gfactory 8/5 13:12 Globus error 43: the job manager failed to 171531.3 gfactory 8/5 13:12 Globus error 17: the job failed when the jo 172568.5 gfactory 8/6 18:39 Globus error 17: the job failed when the jo 172568.6 gfactory 8/6 18:39 Globus error 43: the job manager failed to 172589.6 gfactory 8/6 19:54 Globus error 43: the job manager failed to 172589.7 gfactory 8/6 19:54 Globus error 17: the job failed when the jo ============================================================================= 2008 08 06 ============================================================================= ######## # FARM # ######## SRV1> Broadcast message from root (ttyS0) (Wed Aug 6 13:09:59 2008): The system is going down for reboot NOW! 15:24 - restarted kreymer@fnpcsrv1 ./bluwatch & ####### # AFS # ####### HOWTO.afssoftprod Will use d119 products d120 releases Adjusted HOWTO, saved old as HOWTO.afssoftprod.20080207 As before, a flood of Unable to set group-id messages { time up ${UPI} ${UPO} ; } 2>&1 | tee -a /minos/scratch/minsoft/afssoft/cloneproducts.log Unable to set group-id for /afs/fnal.gov/files/data/minos/d119/.growfschecksum to 1525 Unable to set group-id for /afs/fnal.gov/files/data/minos/d119/.growfsdir to 1525 ... Unable to set group-id for /afs/fnal.gov/files/data/minos/d119/catman/cat1/kcommon.1 to 1525 Unable to set group-id for /afs/fnal.gov/files/data/minos/d119/catman/cat1/bison.1 to 1525 real 17m53.617s user 0m2.322s sys 2m37.550s grep -v 'Unable to set' /minos/scratch/minsoft/afssoft/cloneproducts.log Scanned sizes of PLINKS 680 /afs/fnal.gov/files/code/e875/releases/GENIE 91 /afs/fnal.gov/files/code/e875/releases/LOG4CPP 3307 /afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN 22630 /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT 841 /afs/fnal.gov/files/code/e875/releases/NEUGEN3 183 /afs/fnal.gov/files/code/e875/releases/PYTHIA6 27 /afs/fnal.gov/files/code/e875/releases/stdhep Tested stdhep first, PLINKS=stdhep Looks OK, proceeded with PLINKS=' GENIE LOG4CPP MINOS_EXTERN MINOS_ROOT NEUGEN3 PYTHIA6 ' OK - copying GENIE Wed Aug 6 12:04:06 CDT 2008 680 /afs/fnal.gov/files/code/e875/releases/GENIE real 2m7.429s user 0m0.226s sys 0m20.850s OK - copying LOG4CPP Wed Aug 6 12:06:16 CDT 2008 91 /afs/fnal.gov/files/code/e875/releases/LOG4CPP real 1m2.998s user 0m0.171s sys 0m8.953s OK - copying MINOS_EXTERN Wed Aug 6 12:07:21 CDT 2008 3307 /afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN real 20m4.292s user 0m2.805s sys 3m11.530s OK - copying MINOS_ROOT Wed Aug 6 12:27:53 CDT 2008 22630 /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT real 181m8.958s user 0m24.515s sys 22m48.879s OK - copying NEUGEN3 Wed Aug 6 15:34:45 CDT 2008 841 /afs/fnal.gov/files/code/e875/releases/NEUGEN3 real 2m20.528s user 0m0.349s sys 0m29.542s OK - copying PYTHIA6 Wed Aug 6 15:37:09 CDT 2008 183 /afs/fnal.gov/files/code/e875/releases/PYTHIA6 real 0m44.693s user 0m0.097s sys 0m7.347s Before cleaning up symlinks, copy the other big slug of files, Mysql> AFSC=/afs/fnal.gov/files/code/e875 Mysql> RVOL=/afs/fnal.gov/files/data/minos/d120 # previously d199 Mysql> UPI=${AFSC}/general/minossoft Mysql> { time up ${UPI} ${RVOL} ; } 2>&1 \ > | grep -v 'Unable to set .*-id' \ > | tee /minos/scratch/minsoft/afssoft/cloneminos.log real 101m23.552s user 0m12.607s sys 9m27.300s ######## # FARM # ######## nwest updated the farm data, as needed for CPB processing of Run III. In farcat I see part of F00040942 ( 8 - 17 ) from 01:28 to 01:57. This is part of the 2008-06 part of the run ( 8 - 23 ) The first part is in 2008-05 ( 0 - 7 ) Scanning back, the last Run II cand seems to be F00038559 in 2007-08, dribbling over from 2007-07 sntp is entirely in 2007-07 Howie is running Run II FD CPB full bore now, round 11:15, as well as rest of MDC ######## # FARM # ######## Date: Wed, 06 Aug 2008 09:13:13 -0500 (CDT) Subject: HelpDesk ticket 119761 ___________________________________________ Short Description: Request fnpcsrv1 account, anf work node login for rbpatter, for Minos support Problem Description: Ryan Patterson ( rbpatter@fnal.gov ) of Caltech is joining the Minos support term, particularly working on Parrot and Condor support. Please create an account for him on fnpcsrv1, and give him interactive access to the worker nodes. ___________________________________________ Date: Fri, 08 Aug 2008 10:30:07 -0500 (CDT) Subject: Help Desk Ticket 119761 Has Been Resolved. Solution: Account rbpatter has been created on fnpcsrv1, and the gp grid workers. Steve Timm ___________________________________________________________________ ######### # ADMIN # ######### Default shell for new minos accounts is now /bin/bash, not FNALU shell. Ticket 118265 2008 07 07 ============================================================================= 2008 08 05 ============================================================================= ########## # CONDOR # ########## Updated glideafs10min.run per glideme.run, testing UID of jobs running via glidein,, test on fnpc344 UID 4716 condor_starter 7927 /bin/bash /grid/home/minos/... 7927 /bin/bash ./condor_startup.sh 7927 .../condor_master 7927 condor_startd -f 7927 condor_procd 43022 .../condor_starter 43022 condor_procd 43022 /bin/sh /minos/scratch/kreymer/condor/probe/probe 0 sleep 600 -bash-3.00$ id condor uid=4716(condor) gid=3302(condor) groups=3302(condor) #-bash-3.00$ id minos uid=7927(minos) gid=5111(numi) groups=5111(numi) -bash-3.00$ id minosgli uid=43022(minosgli) gid=5111(numi) groups=5111(numi) Ticket 119498 updated, > _________________________________________________________________ > Note To Requester: timm@fnal.gov sent this Notes To Requester: > > Actually you haven't added the role. glideins are still running > without any glidein role as user "minos" as they always have. > > Steve Timm > > > > _________________________________________________________________ The user jobs are running under the minosgli account, but the pilot is apparently remaining under minos. Here is a simplfied execution tree, from 'ps -axflwww' UID 4716 condor_starter 7927 /bin/bash /grid/home/minos/... 7927 /bin/bash ./condor_startup.sh 7927 .../condor_master 7927 condor_startd -f 7927 condor_procd 43022 .../condor_starter 43022 condor_procd 43022 /bin/sh /minos/scratch/kreymer/condor/probe/probe 0 sleep 600 4716 is condor 7927 is minos 43022 is minosgli It is no longer clear to me that we are running under glexec . I do not see anything like a glexec binary in this execution tree. We did upgrade the glideinWMS software back on 14 July. _________________________________________________________________ Date: Tue, 05 Aug 2008 15:48:53 -0500 (CDT) Note To Requester: timm@fnal.gov sent this Notes To Requester: With the new glexec, it exits immediately and you do not see the glexec executable keep running through the course of the job. But I assure you that unless it had been running successfully you would never have been able to change uid from minos to minosgli. You've actually implemented it backwards of the way we had intended the glidein role to be used. "gfactory" should have the glidein role in its proxy when it submits the glideins to fnpcfg1. The normal users should not. The glideins should be running as "minosgli" and the user processes they spawn should be running as "minos". Steve Timm _________________________________________________________________ Date: Tue, 05 Aug 2008 15:57:08 -0500 From: Sfiligoi Igor Hi Art. The fact that you do not see glexec is OK... this is as it should be. Regarding the UID: you mentioned in a previous mail that you added a role... Where did you do that? The factory one does not have one: [gfactory@minos25 ~]$ voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 type : proxy strength : 512 bits path : /home/gfactory/.grid/kreymer-condor.proxy timeleft : 132:14:19 === VO fermilab extension information === VO : fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 issuer : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov attribute : /fermilab/minos/Role=NULL/Capability=NULL attribute : /fermilab/Role=NULL/Capability=NULL timeleft : 12:14:18 Did you add a role to your user jobs? Igor _________________________________________________________________ You are both correct, I've gotten my proxies reversed. I have corrected the pilot role from the user proxy to gfactory, around 16:21. We happen to have gotten a fresh batch of pilots with the new proxies, just as I made the change. The accounts are now as you describe, pilots under minosgli, users under minos. Thanks ! _________________________________________________________________ ######### # CONDOR # ########## The email worked, alerting us to a held job : Date: Tue, 05 Aug 2008 13:14:17 -0500 From: Art Kreymer To: minos-admin@fnal.gov, fermigrid-help@fnal.gov, timm@fnal.gov Subject: Minos gfactory job held, details follow Tue Aug 5 13:14:17 CDT 2008 -- Submitter: minos25.fnal.gov : <131.225.193.25:64545> : minos25.fnal.gov ID OWNER HELD_SINCE HOLD_REASON 171531.1 gfactory 8/5 13:12 Globus error 43: the job manager failed to 171531.3 gfactory 8/5 13:12 Globus error 17: the job failed when the jo 2 jobs; 0 idle, 0 running, 2 held ########## # CONDOR # ########## Per pittam/brebel request, set pittam to better priority. MINOS25 > condor_userprio -setfactor pittam@fnal.gov 10. The priority factor of pittam@fnal.gov was set to 10.000000 ########## # CONDOR # ########## spotting users with excessively good priorities HOTS=`condor_userprio -all -allusers \ | grep -v gfactory \ | grep -v kreymer \ | grep -v rhatcher \ | grep ' 1.00 ' \ | cut -f 1 -d @ ` for HOT in ${HOTS} ; do printf "condor_userprio -setfactor ${HOT}@fnal.gov 100.\n" done MINOS25 > condor_userprio -setfactor rbpatter@fnal.gov 100. condor_userprio -all -allusers \ | grep -v gfactory \ | grep -v kreymer \ | grep -v rhatcher \ | grep ' 100.00 ' ############## # MINOS_DATA # ############## Need to return to cleanups, to make space for releases and products, and for user analysis MINOS26 > dds *.index | sort -n -k 5,5 ... -rw-r--r-- 1 rubin e875 66383 Oct 22 2007 mc_far.carrot.cedar.index -rw-r--r-- 1 rubin e875 83431 Oct 22 2007 mc_cosmic.bfld201.cedar.index -rw-r--r-- 1 rubin e875 87615 May 9 2007 mc_far.daikon_00.cedar.index -rw-r--r-- 1 rubin e875 118976 Oct 24 2007 mc_near.R1_18_2.index -rw-r--r-- 1 rubin e875 512600 Oct 31 2006 mc_near.carrot_06.cedar.index -rw-r--r-- 1 rubin e875 542620 Nov 1 2006 mc_near.carrot_06.R1_18_2.index -rw-rw-r-- 1 rubin e875 606080 Oct 24 2007 mc_near.daikon_00.cedar.index MINOS26 > less mc_near.R1_18_2.index | cut -f 1 -d / | sort -u recodata35 recodata36 recodata37 recodata38 recodata39 recodata40 recodata42 MINOS26 > less mc_near.carrot_06.R1_18_2.index | cut -f 1 -d / | sort -u recodata08 recodata13 recodata19 recodata21 recodata22 recodata35 recodata36 recodata37 recodata38 recodata39 recodata40 recodata42 recodata43 recodata44 recodata45 recodata46 recodata47 recodata48 recodata49 recodata50 recodata51 recodata53 recodata56 MINOS26 > for NN in 35 36 37 38 39 40 42 ; do ls ../recodata${NN} | cut -f 3 -d . | sort -u ; done R1_18_2 cbdl cnts recodata35 sntp snts R1_18_2 cbdl cnts recodata36 sntp snts R1_18_2 recodata37 R1_18_2 recodata38 sntp snts R1_18_2 recodata39 R1_18_2 recodata40 R1_18_2 recodata42 rubin@fnpcsrv1 cat shrc/kreymer cut/paste cd /afs/fnal.gov/files/data/minos/d10/indexes ./rvm _near.R1_18_2 noop SRV1> ./rvm _near.R1_18_2 This procedure will erase all _near.R1_18_2 ntuples and rewrite the index files! It will not prompt again -- do you want to proceed ? y Removing mc_near.R1_18_2. Removed 2288 files Removed net 2288 files SRV1> date Tue Aug 5 11:38:37 CDT 2008 ./rvm _near.carrot_06.R1_18_2 noop | less SRV1> ./rvm _near.carrot_06.R1_18_2 ; date many messages failing to remove nonexistent files, rm: cannot remove `../recodata38/n13011000_0000_L010170.sntp.R1_18_2.root': No such file or directory rm: cannot remove `../recodata40/n13011000_0000_L010185.sntp.R1_18_2.root': No such file or directory rm: cannot remove `../recodata42/n13011000_0000_L010200.sntp.R1_18_2.root': No such file or directory ... Updated rvm to print data, using /bin/bash, and to rm -f SRV1> ./rvm _near.carrot_06.R1_18_2 This procedure will erase all _near.carrot_06.R1_18_2 ntuples and rewrite the index files! It will not prompt again -- do you want to proceed ? y Tue Aug 5 11:47:22 CDT 2008 Removing mc_near.carrot_06.R1_18_2. Removed 10435 files Removed net 10435 files Tue Aug 5 11:48:33 CDT 2008 Cleaning up the rest of mc R1_18_2 -rw-r--r-- 1 rubin e875 21736 Sep 28 2006 mc_far.R1_18_2.index lrwxr-xr-x 1 rubin e791 24 Jan 27 2006 mc_far.beet.R1_18_2.index -> mc_far.v17.R1_18_2.index -rw-r--r-- 1 rubin e875 1560 Sep 28 2006 mc_far.v17.R1_18_2.index -rw-r--r-- 1 rubin e875 265 Feb 4 2006 mc_far.v17.R1_18_2a.index -rw-r--r-- 1 rubin e875 19292 Mar 17 2006 mc_fmock.carrot.R1_18_2.index -rw-r--r-- 1 rubin e875 1508 Mar 15 2006 mc_fmock.carrot_06.R1_18_2.index lrwxr-xr-x 1 rubin e791 25 Jan 27 2006 mc_near.beet.R1_18_2.index -> mc_near.v17.R1_18_2.index -rw-r--r-- 1 rubin e875 4108 Sep 28 2006 mc_near.v17.R1_18_2.index rm mc_far.beet.R1_18_2.index rm mc_near.beet.R1_18_2.index REL=_far.R1_18_2 REL=_far.v17.R1_18_2 REL=_far.v17.R1_18_2a REL=_fmock.carrot.R1_18_2 REL=_fmock.carrot_06.R1_18_2 REL=_near.v17.R1_18_2 ./rvm ${REL} noop | less SRV1> REL=_far.R1_18_2 SRV1> ./rvm ${REL} This procedure will erase all _far.R1_18_2 ntuples and rewrite the index files! It will not prompt again -- do you want to proceed ? y Tue Aug 5 11:59:45 CDT 2008 Removing mc_far.R1_18_2. Removed 418 files Removed net 418 files Tue Aug 5 11:59:47 CDT 2008 SRV1> REL=_far.v17.R1_18_2 SRV1> ./rvm ${REL} This procedure will erase all _far.v17.R1_18_2 ntuples and rewrite the index files! It will not prompt again -- do you want to proceed ? y Tue Aug 5 11:59:58 CDT 2008 Removing mc_far.v17.R1_18_2. Removed 30 files Removed net 30 files Tue Aug 5 11:59:58 CDT 2008 SRV1> REL=_far.v17.R1_18_2a SRV1> ./rvm ${REL} This procedure will erase all _far.v17.R1_18_2a ntuples and rewrite the index files! It will not prompt again -- do you want to proceed ? y Tue Aug 5 12:00:06 CDT 2008 Removing mc_far.v17.R1_18_2a. Removed 5 files Removed net 5 files Tue Aug 5 12:00:06 CDT 2008 SRV1> REL=_fmock.carrot.R1_18_2 SRV1> ./rvm ${REL} This procedure will erase all _fmock.carrot.R1_18_2 ntuples and rewrite the index files! It will not prompt again -- do you want to proceed ? y Tue Aug 5 12:00:15 CDT 2008 Removing mc_fmock.carrot.R1_18_2. Removed 371 files Removed net 371 files Tue Aug 5 12:00:17 CDT 2008 SRV1> REL=_fmock.carrot_06.R1_18_2 SRV1> ./rvm ${REL} This procedure will erase all _fmock.carrot_06.R1_18_2 ntuples and rewrite the index files! It will not prompt again -- do you want to proceed ? y Tue Aug 5 12:00:25 CDT 2008 Removing mc_fmock.carrot_06.R1_18_2. Removed 29 files Removed net 29 files Tue Aug 5 12:00:25 CDT 2008 SRV1> REL=_near.v17.R1_18_2 SRV1> ./rvm ${REL} This procedure will erase all _near.v17.R1_18_2 ntuples and rewrite the index files! It will not prompt again -- do you want to proceed ? y Tue Aug 5 12:00:30 CDT 2008 Removing mc_near.v17.R1_18_2. Removed 79 files Removed net 79 files Tue Aug 5 12:00:31 CDT 2008 Identifying the empty disks : cd $MINOS_DATA/d10 fs listquota recodata* | sort -k 3,3 | head -12 Volume Name Quota Used %Used Partition nb.minos.d114 50000000 246 0% 59% nb.minos.d117 50000000 252 0% 60% nb.minos.d116 50000000 278 0% 59% nb.minos.d124 50000000 292 0% 53% nb.minos.d119 50000000 300 0% 60% nb.minos.d120 50000000 638 0% 53% nb.minos.d115 50000000 1405 0% 55% nb.data.minosd10 8000000 8446 0% 59% nb.minos.d198 50000000 59211 0% 53% nb.minos.d123 50000000 310819 1% 54% nb.minos.d125 50000000 2043136 4% 55% Noted that recodata17 points to ../d88, which was long ago given to cc. rm recodata17 Let's backlink these to the recodata links : for RECO in `ls -d recodata*` ; do USED=`fs listquota ${RECO} | grep '% ' | tr -s ' ' | cut -f 3 -d ' '` [ ${USED} -lt 10000 ] && printf " ${RECO} " && fs listquota ${RECO} | grep '% ' done recodata01 nb.data.minosd10 8000000 8446 0% 59% recodata37 nb.minos.d114 50000000 246 0% 59% recodata38 nb.minos.d115 50000000 1405 0% 55% recodata39 nb.minos.d116 50000000 278 0% 59% recodata40 nb.minos.d117 50000000 252 0% 60% recodata42 nb.minos.d119 50000000 300 0% 60% recodata43 nb.minos.d120 50000000 638 0% 53% recodata47 nb.minos.d124 50000000 292 0% 53% Remove the recodata* links presently empty for RD in 37 38 39 40 42 43 47 ; do ls -l recodata${RD} ; done for RD in 37 38 39 40 42 43 47 ; do rm recodata${RD} ; done Clean out the remnant data directories MINOS26 > for DN in 14 15 16 17 19 20 24 ; do find ../d1${DN} -type f ; done ../d115/recodata38/F00033307_0005.all.snts.R1_18_2.1.root ../d115/recodata38/F00033307_0005.spill.sntp.R1_18_2.1.root ../d115/recodata38/N00008029_0007.cosmic.snts.R1_18_2.1.root ../d115/recodata38/N00008029_0007.spill.sntp.R1_18_2.1.root ../d115/recodata38/n13021020_0017_L100200.sntp.R1_18_2.root MINOS26 > grep recodata38 indexes/*.index nothing, so these few files are unindexed strays. MINOS26 > dds ../d115/recodata38/*.root -rw-r--r-- 1 rubin e875 156754 Dec 14 2005 ../d115/recodata38/F00033307_0005.all.snts.R1_18_2.1.root -rw-r--r-- 1 rubin e875 158192 Dec 14 2005 ../d115/recodata38/F00033307_0005.spill.sntp.R1_18_2.1.root -rw-r--r-- 1 rubin e875 159699 Dec 15 2005 ../d115/recodata38/N00008029_0007.cosmic.snts.R1_18_2.1.root -rw-r--r-- 1 rubin e875 217387 Dec 15 2005 ../d115/recodata38/N00008029_0007.spill.sntp.R1_18_2.1.root -rw-r--r-- 1 rubin e875 453595 Dec 15 2005 ../d115/recodata38/n13021020_0017_L100200.sntp.R1_18_2.root MINOS26 > rm ../d115/recodata38/*.root for DN in 14 15 16 17 19 20 24 ; do find ../d1${DN} -type f ; done for DN in 14 15 16 17 19 20 24 ; do find ../d1${DN} -type l ; done for DN in 14 15 16 17 19 20 24 ; do find ../d1${DN} -type l -exec rm {} \; ; done for DN in 14 15 16 17 19 20 24 ; do find ../d1${DN} -type d -name reco\* ; done for DN in 14 15 16 17 19 20 24 ; do find ../d1${DN} -type d -name reco\* -exec rmdir {} \; ; done MINOS26 > for DN in 14 15 16 17 19 20 24 ; do ls -l ../d1${DN} ; done total 0 total 0 total 0 total 0 total 0 total 0 total 0 ============================================================================= 2008 08 04 ============================================================================= ########### # MONTHLY # ########### DATASETS 8/4 PREDATOR 8/4 VAULT 8/3 MYSQL 8/4 ######### # MYSQL # ######### HOWTO.dbarchive.20080804 MILESTONE - no more locking CRL during backups Rework table locking, pairwise, FLUSH TABLES ${TAB}, ${TAB}VLD LOCK TABLES ${TAB}, ${TAB}VLD READ 131 *VLD.MYD 266 *.MYD Non-VLD are : DBUVACHIPPEDS_OLD.MYD DBUVACHIPSPARS_OLD.MYD GLOBALSEQNO.MYD LOCALSEQNO.MYD ########## # CONDOR # ########## Try running with Role=pilot Edited /local/scratch25/grid/kproxy, adding /Role=pilot to -voms fermilab:/fermilab/minos/Role=pilot Just after 13:10, will have to let gfactory processes expire, MINOS25 > condor_q gfactory -- Submitter: minos25.fnal.gov : <131.225.193.25:64545> : minos25.fnal.gov ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 171085.2 gfactory 8/4 13:01 0+00:28:19 R 0 0.0 glidein_startup.sh touch /minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE 17:15 - rm /minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE gfactory jobs submitted at 17:18 MILESTONE - glidein jobs are running under minosgli account ! new proxies continue to get /Role=pilot ########## # CONDOR # ########## Started this around 10:32 CDT, to send email once in case of a future held gfactory job whose condor_q message include 'Globus', per request of Timm to be notified immediately. { while ! { condor_q gfactory -hold | grep -q Globus ; } ; do sleep 500 done { date ; condor_q gfactory -hold ; } | \ mail -s "Minos gfactory job held, details follow" \ minos-admin@fnal.gov,fermigrid-help@fnal.gov,timm@fnal.gov } & ####### # RAL # ####### Verified that kreymer@rl.ac.uk mail still is forwarded to kreymer@fnal.gov ============================================================================= 2008 08 02 ============================================================================= ########## # CONDOR # ########## see 08 01, released last block of 'tranfer' related held gfactory jobs still 31 others, clear them later. pawloski is running again ######## # FARM # ######## CPB near is fully concatenated, so killed looper. ============================================================================= 2008 08 01 ============================================================================= ######## # GRID # ######## Date: Sat, 02 Aug 2008 03:53:28 +0100 From: Jenny Thomas To: minos_authors@fnal.gov Subject: MINOS GRID INFRASTRUCTURE GROUP All, I am very happy to announce that Ryan Patterson has agreed to lead the new MINOS GRID INFRASTRUCTURE GROUP whose goal is to enable a ten fold increase in the MINOS processing capability. This will provide a desparately needed shot in the arm for our computing resources. That was the good stuff. The next thing of course is that Ryan needs volunteers to help him. I would like to point you to Doc-DB 4886 which lays out the tasks which need to be covered. I would like to ask you all to consider whether you might be willing to volunteer. The idea is that this would be a 2-3 month blitz to set up the infrastructue and then it would be routine maintenence after that. Some experience of unix systems is obviously necessary or a willingness to learn it quickly! I would point out that GRID usage is going to become the bread and butter of HEP physics analysis in the future and so this would be extremely good experience for more junior people although senior people are also encouraged to volunteer. Please respond to me in the first instance and ideally before the next collaboration meeting. Thanks, Jenny ########### # SCRATCH # ########### Disk is full, MINOS26 > df -h /minos/scratch Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/scratch 5.0T 5.0T 38G 100% /minos/scratch MINOS26 > du -sm /minos/scratch/* | sort -n du: `/minos/scratch/app/OSG1/vdt/extract': Permission denied du: `/minos/scratch/app/OSG1/vdt/backup': Permission denied du: `/minos/scratch/boehm/Extrapolation/SideBands/PidTweak/BadFiles': Permission denied du: `/minos/scratch/boehm/Extrapolation/SideBands/L250200/SysFiles': Permission denied du: `/minos/scratch/djauty/.TemporaryItems/folders.23559': Permission denied du: `/minos/scratch/pawloski/Nue/HornOn_HornOff/HornOn/macros/NueErrors': Permission denied du: `/minos/scratch/pawloski/Nue/Old/HornOn_HornOff/HornOn/macros/NueErrors': Permission denied du: `/minos/scratch/pawloski/Nue/Old/HornOn_HornOff/HornOn_PreAustinCuts/macros/NueErrors': Permission denied du: `/minos/scratch/pawloski/Old/test/.tmp/.files': Permission denied du: `/minos/scratch/pawloski/.tmp': Permission denied du: `/minos/scratch/rahaman/latex/BcForBeach2006': Permission denied du: `/minos/scratch/rahaman/latex/aps2006': Permission denied du: `/minos/scratch/rahaman/latex/beach2006': Permission denied du: `/minos/scratch/rahaman/latex/bcpaper': Permission denied du: `/minos/scratch/rahaman/latex/cdfnote': Permission denied du: `/minos/scratch/rahaman/latex/ckm06Proc': Permission denied du: `/minos/scratch/rahaman/latex/cv': Permission denied du: `/minos/scratch/rahaman/latex/style': Permission denied du: `/minos/scratch/rahaman/latex/talk': Permission denied du: `/minos/scratch/rahaman/latex/thesis': Permission denied du: `/minos/scratch/rearmstr/.ssh': Permission denied ... 11255 /minos/scratch/asousa 11532 /minos/scratch/arms 11753 /minos/scratch/grashorn 12002 /minos/scratch/bishai 13144 /minos/scratch/masaki 14802 /minos/scratch/rbpatter 19195 /minos/scratch/sjc 19960 /minos/scratch/tagg 20928 /minos/scratch/annah1 23068 /minos/scratch/kimjj 25386 /minos/scratch/bspeak 25853 /minos/scratch/tinti 26276 /minos/scratch/jyuko 44281 /minos/scratch/med 46128 /minos/scratch/ahimmel 46539 /minos/scratch/vahle 47369 /minos/scratch/ochoa 49958 /minos/scratch/hartnell 51873 /minos/scratch/rodriges 55421 /minos/scratch/koskinen 59403 /minos/scratch/brebel 59769 /minos/scratch/bckhouse 61388 /minos/scratch/deb4 66207 /minos/scratch/evansj 69586 /minos/scratch/niki 72106 /minos/scratch/djauty 82008 /minos/scratch/petyt 84030 /minos/scratch/zarko 90560 /minos/scratch/pittam 92179 /minos/scratch/mishi 140503 /minos/scratch/jjling 176011 /minos/scratch/boehm 276778 /minos/scratch/rustem 342611 /minos/scratch/tjyang 433343 /minos/scratch/rmehdi 457522 /minos/scratch/scavan 474990 /minos/scratch/pawloski 629348 /minos/scratch/rahaman 902190 /minos/scratch/loiacono Date: Fri, 01 Aug 2008 17:33:30 -0500 (CDT) Subject: HelpDesk ticket 119604 ___________________________________________ Short Description: Please move 1 TB of quota from /minos/data to /minos/scratch Problem Description: LSC/CSI : We have recovered a lot of space from /minos/data, with about 5 TB free. We have run out of space in /minos/scratch. So please shift 1 TB of capacity back from /minos/data to /minos/scratch Thanks ! ___________________________________________ forwarded copy of ticket to rayp, romero, inkmann ___________________________________________ Date: Fri, 01 Aug 2008 17:45:32 -0500 From: Andrew Romero /minos/data ... decreased to 27TB /minos/scratch ... increased to 6TB ___________________________________________ ######## # FARM # ######## Date: Fri, 01 Aug 2008 15:34:19 -0500 From: Howard Rubin To: Minos_Batch batch Subject: Current status of Run III The Run III ND spill pass has completed and over 400 FD runs from 2008-06 have run. *ALL* of the FD jobs and 25% of the ND jobs have failed with FPE's, according to Steve's research, probably all in the calibrator. I am shutting down Run III processing until this is completely diagnosed and fixed. It may be necessary to rerun all of the ND as well as, of course the FD. Because of a logic error in my proxy renewal for grid processing, there was a problem early this morning which has caused a substantial backlog of 'apparently incomplete' jobs which have to be cleared out. This may not happen until the next cedar keep-up is due to start, so tentatively I have also shut down keep-up until the backlog clears and I'm sure I've removed the logic flaw. ########## # PARROT # ########## Test parallel parrots, First, try out stale test area on fnpc338 last changed Jul 31 16:42 388 > du -sm /local/stage1/condor/ 150 /local/stage1/condor/ 388 > /grid/app/minos/parrot/paloon 388 > du -sm /local/stage1/condor/ 150 /local/stage1/condor/ oops, potoential shared /local/stage1/kreymer is absent, 388 > du -sm /var/tmp/kreymer/ 336 /var/tmp/kreymer/ changed pallon to create /local/stage1/${LOGNAME} Set up file list, just the data part of the path without /pnfs/minos in a shared area. ./samlocate "__set__ st-censmall" ./samlocate "__set__ st-censmall" | sort | while read FLINE ; do FILE=`echo ${FLINE} | cut -f 1 -d ' '` FPAT=`echo ${FLINE} | cut -f 2 -d ' '` printf "${FPAT/\/pnfs\/minos\/}/${FILE}\n" done > /minos/scratch/kreymer/condor/parrot/st-censmall.files Set up paloon and loonar to take optional process and file list. /grid/app/minos/parrot/paloon 3 /minos/scratch/kreymer/condor/parrot/st-censmall.files This works ! But we had the wrong files in st-censmall, too big ! Recreated the dataset, and the list, this time ordered by name. Test all the files , N=-1 while [ ${N} -lt 100 ] ; do (( N ++ )) /grid/app/minos/parrot/paloon ${N} /minos/scratch/kreymer/condor/parrot/st-censmall.files done Completed cleanly. Shifted to a less busy node, fnpc299 mkdir parrot cd parrot /grid/app/minos/parrot/paloon 3 /minos/scratch/kreymer/condor/parrot/st-censmall.files > 3.log 2>&1 & EXE=/grid/app/minos/parrot/paloon FILES=/minos/scratch/kreymer/condor/parrot/st-censmall.files for N in 0 1 2 3 4 5 6 7 8 9 ; do ${EXE} ${N} ${FILES} > ${N}.log 2>&1 & done Clear the boards, try fresh with empty cache -bash-3.00$ rm -r /local/stage1/kreymer cd ~/parrot/try2 for N in 0 1 2 3 4 5 6 7 8 9 ; do ${EXE} ${N} ${FILES} > ${N}.log 2>&1 & done -bash-3.00$ du -sm /local/stage1/kreymer/parrot/ 337 /local/stage1/kreymer/parrot/ ########## # CONDOR # ########## FYI , to switch to using account minosgli or minosana, just add the corresponding role to the glidin proxy, pilot or Analysis I'm going to wait till I have purged all the held gfactory jobs, probably next week. ############### # CONDORGLIDE # ############### Added flag to skip : [ -r "/minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE" ] && exit 0 ########### # ROUNDUP # ########### roundup.20080801 DFARM cleanup : DFARM is was used as flag for file purging, absent for purge of components, in PURGE GRID, set at that time required before purge from WRITE. This assures the purge of components, in case of messy restarts. ( a file gets into WRITE, components not purged ) Should change directory name to PURGED, Should remove the PURGED file as soon as the PURGE WRITE is complete. ######## # FARM # ######## DFARM cleanup SRV1> ls /export/stage/minfarm/ROUNDUP/DFARM | wc -l 124861 SRV1> ls /export/stage/minfarm/ROUNDUP/DFARM | grep cedar | wc -l 122924 SRV1> du -sm . 495 . Safety copy : SRV1> tar cf /minos/data/minfarm/maint/DFARM.tar . SRV1> tar tf /minos/data/minfarm/maint/DFARM.tar | grep root | wc -l 124898 MINOS26 > du -sm /minos/data/minfarm/maint/DFARM.tar 122 /minos/data/minfarm/maint/DFARM.tar Found 38 files in DFARM/tmp, vintage Jun 1 2007. Removed them. SRV1> rm -r /export/stage/minfarm/ROUNDUP/DFARM/tmp Remove an older release SRV1> find -name \*R1_24\* | wc -l 1936 SRV1> find -name \*R1_24\* -exec rm {} \; Remove candidates, easy pickens SRV1> find . -name \*cand\* | wc -l 60366 SRV1> time find . -name \*cand\* -exec rm {} \; real 5m35.302s user 0m19.702s sys 4m51.090s now monte carlo, not presently being concatenated SRV1> find . -name n\* | wc -l 17875 SRV1> find . -name f\* | wc -l 11547 SRV1> time find . -name n\* -exec rm {} \; real 1m36.460s user 0m6.008s sys 1m13.267s SRV1> time find . -name f\* -exec rm {} \; real 0m54.439s user 0m4.046s sys 0m48.083s Blow away D05 mdc files SRV1> time find . -name \*D05\* -exec rm {} \; real 0m0.945s user 0m0.103s sys 0m0.617s And Far files, none being written presently SRV1> find . -type f | wc -l 32822 SRV1> find . -type f -name F\* | wc -l 24472 SRV1> time find . -type f -name F\* -exec rm {} \; real 2m6.308s user 0m8.369s sys 1m38.664s SRV1> find . -type f | wc -l 8350 Grab the cedar_phy files, SRV1> find . -type f -name \*\.cedar_phy\.\* | wc -l 3676 SRV1> find . -type f -name \*\.cedar_phy\.\* | cut -f 5 -d . | uniq cedar_phy SRV1> time find . -type f -name \*\.cedar_phy\.\* -exec rm {} \; real 0m20.316s user 0m1.286s sys 0m15.412s Troll for more stuff SRV1> find . -type f | cut -f 5 -d . | sort -u cedar cedar_phy_bhcurv cedar_phy_srsafitter cedar_phy_srsafitterbx113 SRV1> find . -type f -name \*\.cedar_phy_srsafitter\* | wc -l 302 SRV1> time find . -type f -name \*\.cedar_phy_srsafitter\* -exec rm {} \; real 0m1.450s user 0m0.142s sys 0m1.205s Keepup cleaned up, can purge cedar SRV1> find . -type f -name \*\.cedar\.\* | wc -l 1503 SRV1> time find . -type f -name \*\.cedar\.\* -exec rm {} \; Now a final purge of slightly old files, SRV1> find . -type f -mtime +3 | wc -l 2453 ls -ltr -rw-rw-r-- 1 minfarm numi 29 Mar 28 18:46 N00008345_0002.spill.mrnt.cedar_phy_bhcurv.1.root drwxrwxr-x 16 minfarm numi 4096 Jul 24 21:44 ../ -rw-rw-r-- 1 minfarm numi 29 Jul 30 18:54 N00014166_0000.spill.mrnt.cedar_phy_bhcurv.0.root That makes sense, have purged all but CPB, last run last March. SRV1> find . -type f -mtime +3 -exec ls -l {} \; | tail -rw-rw-r-- 1 minfarm numi 29 Mar 28 18:38 ./N00008439_0011.spill.mrnt.cedar_phy_bhcurv.1.root ... SRV1> time find . -type f -mtime +3 -exec rm {} \; real 0m15.380s user 0m0.862s sys 0m12.335s SRV1> ls | wc -l 416 This is very healthy. Finish the purge after the upgrade to roundup, and the shift from DFARM to PURGED ########## # CONDOR # ########## The backlog has cleared, nothing runing in glidein beyond my 10-minute interval tests. The first batch of released held gfactory jobs has cleared. Release a batch of 100. JOBS=`condor_q gfactory -hold | grep transfer | head -100 | cut -f 1 -d ' '` for JOB in ${JOBS} ; do condor_release ${JOB} ; sleep 1 ; done for JOB in ${JOBS} ; do condor_q ${JOB} | grep gfactory ; done Released these around 10:00 All were running by about 10:50, about half already timed out. 13:12 - previous batch is clear, Ran the next batch, 576 jobs; 102 idle, 55 running, 419 held 170282.0 gfactory 7/31 07:05 0+00:00:00 H 0 0.0 glidein_startup.sh ... 170301.9 gfactory 7/31 08:45 0+00:00:00 H 0 0.0 glidein_startup.sh 17:15 - all clear, take another shot ( farm has ramped down ) 23:00 - all clear, another shot 219 jobs; 0 idle, 0 running, 219 held 2 August 192 jobs; 10 idle, 63 running, 119 held 12:40 - all clear, many other glideins running MINOS25 > printf "${JOBS}\n" | wc -l 88 192 jobs; 98 idle, 63 running, 31 held ######### # PROBE # ######### 10:00 Added space check of HOME, df -h ${HOME} du -sh ${HOME} ######## # FARM # ######## Just for completeness, now that other cleanup has been done, looking at state of May 2 vintage cedar_phy far. Forcing an update of ROUNTMP/LOG/cdar_phyfar.pend farcat 315 7614 all.sntp.cedar_phy.0.root 315 1332 spill.bntp.cedar_phy.0.root 315 888 spill.sntp.cedar_phy.0.root ./roundup -r cedar_phy far ============================================================================= 2008 07 31 ============================================================================= ########## # CONDOR # ########## Per sfiligoi advice, clearing out all the old held gfactory sections, by releasing them. Let's do 50 at a time, and start with the 'data transfer error's of today. JOBS=`condor_q gfactory -hold | grep transfer | head -50 | cut -f 1 -d ' '` for JOB in ${JOBS} ; do condor_release ${JOB} ; sleep 1 ; done for JOB in ${JOBS} ; do condor_q ${JOB} | grep gfactory ; done Thu Jul 31 15:19:48 CDT 2008 initally all idle ( 170250.0 - 170259.9 ) Thu Jul 31 15:27:30 CDT 2008 most of these are running. Plan : wait a half hour for them to time out, then do another batch. Things have changed, pawloski has 1460 jobs queued up. But his jobs only run on the AFS nodes, 64 of them. On the other hand, there are 808 farm jobs running, so the glideings are not starting too quickly. I'll have a look again tomorrow ####### # WEB # ####### Created an easier to find DH home page MIN > cd /afs/fnal.gov/files/expwww/numi/html/computing/dh MIN > ln -s dhmain.html index.html Added mdsum link MIN > cp dhmain.20080403.html dhmain.20080731.html MIN > ln -sf dhmain.20080731.html dhmain.html ############ # MCIMPORT # ############ The planned data archival using mcimport is complete. ######## # FARM # ######## > User: minos > > Email: rubin@fnal.gov > > FileSystem: fermigrid-home > > Total disk allocated (GB): 10.0 > > Percent disk used: 100.0% SRV1> dds /export/blue2_home drwxr-xr-x 1184 minos numi 1808384 Jul 31 11:17 minos/ drwxr-xr-x 3 minosana numi 2048 Dec 16 2007 minosana/ drwxr-xr-x 4 minosgli numi 12288 Dec 30 2007 minosgli/ drwxr-xr-x 4605 minospro numi 2160640 Jul 31 11:19 minospro/ drwxr-xr-x 3 minsoft numi 2048 Jan 22 2008 minsoft/ SRV1> df -h /export/blue2_home Filesystem Size Used Avail Use% Mounted on blue2.fnal.gov:/fermigrid-home 1004G 282G 723G 29% /grid/home du -sm /grid/home/minos du: cannot read directory `/grid/home/minos/gram_scratch_azzcRJplkx': Permission denied du: cannot read directory `/grid/home/minos/gram_scratch_gMpBXjwcYD': Permission denied du: cannot read directory `/grid/home/minos/gram_scratch_Es7lcr26pj': Permission denied 407 /grid/home/minos SRV1> ls -l /grid/home/minos | grep drw | wc -l 1221 SRV1> ls -l /grid/home/minos | grep -v gram total 408192 drwxr-xr-x 2 minos numi 2048 Aug 12 2005 0 -rw-r--r-- 1 minos numi 0 Jan 22 2008 foo -rw-r--r-- 1 minos numi 12197 May 22 2006 wrapper.sh -rw------- 1 minos numi 7660 May 22 2006 x509_proxy_in SRV1> ls -l /grid/home/minos | wc -l 9553 SRV1> ls -l /grid/home/minos | grep gram_scratch | wc -l 1229 CONDOR errors in gractory jobs, starting 170250.0 gfactory 7/31 04:30 Globus error 10: data transfer to the serve 170250.1 gfactory 7/31 04:30 Globus error 10: data transfer to the serve ls -ldtr /grid/home/minos/gram_scratch_* drwx------ 2 minos numi 2048 Jul 12 13:52 /grid/home/minos/gram_scratch_GqyUAmOXyw drwx------ 2 minos numi 2048 Jul 13 04:08 /grid/home/minos/gram_scratch_BiWr7QCsMH drwx------ 2 minos numi 2048 Jul 13 04:38 /grid/home/minos/gram_scratch_8rVn4pIePb drwx------ 2 minos numi 2048 Jul 13 04:38 /grid/home/minos/gram_scratch_LIuonUwfsV drwx------ 2 minos numi 2048 Jul 13 04:52 /grid/home/minos/gram_scratch_zuAusgD3En ... drwx------ 2 minos numi 2048 Jul 13 09:56 /grid/home/minos/gram_scratch_8LT5ENcZEr drwx------ 2 minos numi 2048 Jul 13 10:45 /grid/home/minos/gram_scratch_syH6ohX48k drwx------ 2 minos numi 2048 Jul 24 06:20 /grid/home/minos/gram_scratch_pryJw1EdQF drwx------ 2 minos numi 2048 Jul 31 04:20 /grid/home/minos/gram_scratch_WHgYnLpDYr drwx------ 2 minos numi 2048 Jul 31 04:20 /grid/home/minos/gram_scratch_dHjsbrp3jM drwx------ 2 minos numi 2048 Jul 31 04:20 /grid/home/minos/gram_scratch_bNGYJjYDPO There are also lots of stale gram_scratch areas under minospro, mainly from : Jul 1 21:06 thru Jul 2 16:13 Jul 11 10:17 thru Jul 11 22:11 Jul 15 13:27 Jul 15 17:30 Jul 16 05:57 Jul 17 12:14 Jul 18 19:34 Jul 18 19:39 Jul 18 20:59 Jul 23 15:17 Jul 23 15:17 Jul 23 15:17 Jul 24 06:25 Jul 24 23:09 thru Jul 25 16:48 Jul 26 11:35 Jul 27 04:37 Jul 28 15:23 Jul 28 17:24 Jul 28 17:25 Jul 28 17:26 Jul 28 17:26 Jul 28 17:26 Jul 28 17:27 Jul 28 22:32 Jul 28 22:32 Jul 28 22:37 Jul 29 09:40 thru Jul 29 17:17 Jul 30 07:31 Jul 30 07:37 Jul 30 11:30 thru Jul 30 12:15 Jul 30 20:58 thru current, Jul 31 11:51 Counting non-Jul 31 gram_scratch directories: SRV1> ls -ltr /grid/home/minospro | grep -v 'Jul 31' | wc -l 3779 SRV1> ls -ltr /grid/home/minos | grep -v 'Jul 31' | wc -l 7153 Somewhat better now, 13:15 SRV1> ls -ltr /grid/home/minospro | grep -v 'Jul 31' | wc -l 616 SRV1> ls -ltr /grid/home/minos | grep -v 'Jul 31' | wc -l 7075 Later, around 14:23, SRV1> ls -ltr /grid/home/minos | grep -v 'Jul 31' | wc -l 179 I see a few gfactory processes running, starting around 14:03 Rustem has removed his large stdout jobs, around Testing the release of a few gfactory jobs : condor_release 170382 Date: Thu, 31 Jul 2008 11:53:32 -0500 (CDT) Subject: HelpDesk ticket 119498 ___________________________________________ Short Description: /grid/home/minos quota used up - why ? Problem Description: The /grid/home/minos quota of 10 GB seems to have been suddenly used up. Analysis glidein jobs have stalled, as of early this morning. A quick scan of the area does not show evidence of user abuse. This is difficult, as I do not have an interactive minos login with access to /grid/home/minos. With what I can see, there are about 400 MBytes of visible files, plus over 1200 gram_scratch* directories. We have nothing like 1200 jobs running, so something must have gone wrong with the grid software. The time stamps are suspicious : ls -ldtr /grid/home/minos/gram_scratch_* drwx------ 2 minos numi 2048 Jul 12 13:52 /grid/home/minos/gram_scratch_GqyUAmOXyw drwx------ 2 minos numi 2048 Jul 13 04:08 /grid/home/minos/gram_scratch_BiWr7QCsMH drwx------ 2 minos numi 2048 Jul 13 04:38 /grid/home/minos/gram_scratch_8rVn4pIePb drwx------ 2 minos numi 2048 Jul 13 04:38 /grid/home/minos/gram_scratch_LIuonUwfsV drwx------ 2 minos numi 2048 Jul 13 04:52 /grid/home/minos/gram_scratch_zuAusgD3En .. drwx------ 2 minos numi 2048 Jul 13 09:56 /grid/home/minos/gram_scratch_8LT5ENcZEr drwx------ 2 minos numi 2048 Jul 13 10:45 /grid/home/minos/gram_scratch_syH6ohX48k drwx------ 2 minos numi 2048 Jul 24 06:20 /grid/home/minos/gram_scratch_pryJw1EdQF drwx------ 2 minos numi 2048 Jul 31 04:20 /grid/home/minos/gram_scratch_WHgYnLpDYr drwx------ 2 minos numi 2048 Jul 31 04:20 /grid/home/minos/gram_scratch_dHjsbrp3jM drwx------ 2 minos numi 2048 Jul 31 04:20 /grid/home/minos/gram_scratch_bNGYJjYDPO Something seems to have gone wrong around July 12 and 13, then again around 04:20 this morning. ___________________________________________ This ticket is assigned to TIMM, STEVE of the CD-SF/GF/FGS. ___________________________________________ Date: Thu, 31 Jul 2008 12:16:38 -0500 (CDT) Note To Requester: The recent move of fermigrid1 to fg1x1 led us to neglect to re-enable our cleanup cron-job. I'll re-enable it shortly and run it manually and it should clean everything up. Thanks for bringing this to our attention. Steve Timm ___________________________________________ Note To Requester: In the process of cleaning out the /grid/home/minos directory we discovered that the bulk of the quota was actually used up not by the glidein jobs going to fnpcfg1 but by a set of jobs that were submitted by Rustem, 20 or more of which produced stdout of 500+MB per job. The quota is only 10GB. If minos leadership wants to not have collisions, you should switch to running the minos glideins as user "minosgli" as we had previously discussed. That will keep the glidein jobs from interfering with the jobs of the unwashed minos users. If you need changes in priority among the various minos users we can do that too. Seeing that these stdouts were only written last night, the cleanup script will not remove them right away. We can intervene for those jobs that were removed, if necessary, or we can temporarily get the /grid/home/minos quota bumped up appropriately. Let us know what you want to do. Steve Timm ___________________________________________ Date: Thu, 31 Jul 2008 19:04:47 +0000 (UTC) From: Arthur Kreymer To: rustem@fnal.gov Cc: minos-admin@fnal.gov, timm@fnal.gov You have about 44 jobs running or queued on Fermigrid, most submitted around 7/30 20:51. These have exhaused the quota in the minos account, so none of these are likely to be producing useful output. The note from Steve Timm indicates that the jobs are producing about 1/2 GBytes in stdout ( perhaps more, as they are still running ). The grid system cannot handle this much data in stdout. This is not a local Fermilab limitation, It is intrinsic to all existing grid systems that I know of. Until these jobs are cleared out, nobody ( including you ) will be able to run analsyis jobs on Fermigrid. Please cancel these jobs, and reorganize them to write data to the usual places ( local disk or /minos/data ) Thanks ! ___________________________________________ Date: Thu, 31 Jul 2008 14:43:49 -0500 (CDT) Solution: Jobs from one minos user were identified and cancelled. The stdout files that were left behind from the cancelled jobs were cleaned up on /grid/home/minos directory. Also minos quota will be boosted to 50GB so there will be less chance of running out of quota again. minos is now using only 68MB of the 10GB quota. Steve Timm ___________________________________________ Date: Mon, 04 Aug 2008 22:36:27 +0000 (UTC) From: Arthur Kreymer I have added the pilot role to the proxy used to run Minos glideings. The jobs are now running under the minosgli account. You can return the minos account quota to 10 GB quota. If you feel included to boost quotas to give more margin, please boost the minosgoi account. At this point, I think this ticket can be closed. Thanks ! ___________________________________________ ########### # ENSTORE # ########### noaccess list is back with us. Can close ticket 119190 Sent request for same. ######## # FARM # ######## fmock roundup - how to do it ? Previous runs were like On 2007 08 29 ./roundup -M -r cedar_phy mockfar2007 08 29 but mockfar directory does not exist. As I recall, there was a symlink before we moved /minos/data/minfarm Let's recreate it : ln -s mcfmockcat /minos/data/minfarm/mockfarcat ./roundup -n -W -r cedar_phy_bhcurv mockfar This looks OK, all subruns would be added Separately. Last time we did this, no SAM was available for MC. Let's try one file : ./roundup -s F21930001_0000_L010185N_D05.mrnt -r cedar_phy_bhcurv mockfar The SAM declares fail, messages like Oops, no directories found like /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_05/*_data/L010185N So let's run without SAM, -M, as before mcfmockcat 156 2232 mrnt.cedar_phy_bhcurv.0.root 157 4514 sntp.cedar_phy_bhcurv.0.root ./roundup -M -r cedar_phy_bhcurv mockfar Thu Jul 31 10:37:52 CDT 2008 Thu Jul 31 12:11:12 CDT 2008 SRMCP copies were running just about 19 seconds/file. These are smallish, clustered at 15 MBytes and 27 MBytes ( find /minos/data/minfarm/WRITE -type f -name F\*D05\* -exec du -sm {} \; | cut -f 1 ) > /minos/scratch/kreymer/mcfmock.gpl printf 'plot "/minos/scratch/kreymer/mcfmock.gpl"' | gnuplot -persist Later files, like F21930001_0026_L010185N_D05.sntp.cedar_phy_bhcurv.0.root, copy in about 14 seconds. This is a larger file, 28 MBytes, so the change is not due to size. Did cleanup of WRITE, SRV1> ./roundup -M -W -r cedar_phy_bhcurv mockfar Thu Jul 31 15:25:50 CDT 2008 PURGING WRITE files 313 PURGED 278/313 Thu Jul 31 16:03:40 CDT 2008 PURGING WRITE files 35 PURGED 0/35 Drive LTO3_11 is still busy writing this data, to VOK682 Thu Jul 31 16:33:26 CDT 2008 PURGED 35/35 DONE ! ============================================================================= 2008 07 30 ============================================================================= ######## # FARM # ######## cedar_phy_bhcurv processing has started for Run III. A few near files are showing up, nearcat 17 118 spill.mrnt.cedar_phy_bhcurv.0.root 17 216 spill.sntp.cedar_phy_bhcurv.0.root about 18:30 CDT ./looper '-r cedar_phy_bhcurv near' & ############ # MCIMPORT # ############ TOP=daikon_04/L010000N/near # 10MB, 190-230 MCI3 > echo $RDIRS 700 701 702 703 704 705 706 for DIR in ${RDIRS} ; do printf "${DIR} " ./mcimport.20080730 -n -t ${TOP}/${DIR} | grep FILES done 700 278/278 TOTAL FILES 138/138 TOTAL FILES 701 311/311 TOTAL FILES 110/110 TOTAL FILES 702 309/309 TOTAL FILES 111/111 TOTAL FILES 703 310/310 TOTAL FILES 110/110 TOTAL FILES 704 307/307 TOTAL FILES 110/110 TOTAL FILES 705 305/305 TOTAL FILES 109/109 TOTAL FILES 706 182/182 TOTAL FILES 65/65 TOTAL FILES DIR=706 ./mcimport.20080730 -n -t ${TOP}/${DIR} ./mcimport.20080730 -t ${TOP}/${DIR} OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010000N/near/706/mcimport.log Wed Jul 30 15:33:14 CDT 2008 ... n11037060_0000_L010000N_D04-n11037060_0007_L010000N_D04.tar 8 n11037060_0000_L010000N_D04.tar.gz to n11037060_0007_L010000N_D04.tar.gz from 8 files, 1756725643 bytes tar 8 files, 1756733440 bytes (7797) rate 7 MB/sec ... ln -sf mcimport.20080730 mcimport # was mcimort.20080729 RDIRS='700 701 702 703 704 705' for DIR in ${RDIRS}; do ./mcimport -t ${TOP}/${DIR} done OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010000N/near/700/mcimport.log Wed Jul 30 18:26:49 CDT 2008 Thu Jul 31 12:49:21 CDT 2008 ######## # FARM # ######## MINOS26 > ./pnfsdirs fmock cedar_phy_bhcurv daikon_05 L010185N write Wed Jul 30 14:59:36 CDT 2008 ls -l /pnfs/minos/mcout_data/cedar_phy_bhcurv/fmock/daikon_05/L010185N/cand_data/000 minospro e875 298500705 Jul 29 14:42 F21930001_0000_L010185N_D05.cand.cedar_phy_bhcurv.0.root minospro e875 308399928 Jul 29 14:42 F21930001_0001_L010185N_D05.cand.cedar_phy_bhcurv.0.root minospro e875 305796428 Jul 29 17:05 F21930001_0002_L010185N_D05.cand.cedar_phy_bhcurv.0.root minospro e875 311444452 Jul 29 17:17 F21930001_0003_L010185N_D05.cand.cedar_phy_bhcurv.0.root ######## # FARM # ######## Date: Tue, 29 Jul 2008 17:04:05 -0500 (CDT) From: Matthew Strait mrnts for these two subruns are now in /minos/data/minfarm/farmtest_strait/mcnearcat/ ---------------------------------------------------- Someone moved these, as well as sntp's to /m/d/mf/mcnearcat yesterday, and the looper script picked them up. Moving the duplicate sntp's out of the way. Need to correct the DUP detection in roundup, it is clearly broken. SRV1> ls -l /minos/data/minfarm/mcnearcat/*bhhi* -rw-rw-r-- 1 minospro numi 65551338 Jul 29 16:39 /minos/data/minfarm/mcnearcat/n13037022_0007_L010185N_D04.sntp.cedar_phy_bhhi.root -rw-rw-r-- 1 minospro numi 66227908 Jul 29 16:44 /minos/data/minfarm/mcnearcat/n13037022_0010_L010185N_D04.sntp.cedar_phy_bhhi.root SRV1> mv /minos/data/minfarm/mcnearcat/*bhhi* /minos/data/minfarm/DUP/ ######## # FARM # ######## Date: Tue, 29 Jul 2008 17:08:53 -0500 From: Howard Rubin This was another split month. The 2 are running now. ------------------------------------------------------------------ SRV1> ./roundup -r cedar_phy_bhcurv far Wed Jul 30 09:38:08 CDT 2008 SRV1> ./roundup -w -r cedar_phy_bhcurv far Wed Jul 30 09:43:00 CDT 2008 PURGING WRITE files 4 SRV1> cat ../ROUNTMP/READ/SAM/F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root F00037962_0001.spill.mrnt.cedar_phy_bhcurv.1.root F00037962_0002.spill.mrnt.cedar_phy_bhcurv.0.root F00037962_0003.spill.mrnt.cedar_phy_bhcurv.0.root ... F00037962_0023.spill.mrnt.cedar_phy_bhcurv.0.root So now we have F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root which is in PNFS and eclared to SAM, with first two subruns blinded. Proposed to just remove all these files, and forget this run for mrnt. Per rubin's approved, did rm /pnfs/minos/reco_far/cedar_phy_bhcurv/mrnt_data/2007-04/F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root sam undeclare file F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root ls /minos/data/minfarm/farcat/F00037962* /minos/data/minfarm/farcat/F00037962_0000.spill.bmnt.cedar_phy_bhcurv.1.root /minos/data/minfarm/farcat/F00037962_0001.spill.bmnt.cedar_phy_bhcurv.1.root rm /minos/data/minfarm/farcat/F00037962* rm /minos/data/reco_far/cedar_phy_bhcurv/mrnt_data/2007-04/F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root ============================================================================= 2008 07 29 ============================================================================= ######## # FARM # ######## Date: Tue, 29 Jul 2008 11:33:32 -0500 From: Howard Rubin The farcat cedar_phy_bhcurv cleanup has finished. There is a possible complication. I believe that at some point you may have renamed bmnt files to mrnt, getting rid of the original mrnt. The cleanup has produced both mrnt and bmnt, but because the cleanup didn't include all subruns, the newly produced bmnt set isn't complete. If you did rename, then all you have to do is rename the new bmnt to mrnt and run the concatenator. Otherwise I have to repeat the cleanup to produce the missing bmnt. I think it would also avoid confusion if the pass was renamed to 0. ------------------------------------------------------------- Reviewed files for the partial runs, for RUN in F00032654 F00036592 F00037962 F00037965 ; do echo ls -alF /minos/data/minfarm/farcat/${RUN}*spill.mrnt.cedar_phy_bhcurv.* done cd /minos/data/minfarm/farcat mv *spill.mrnt.cedar_phy_bhcurv.1.root /minos/data/minfarm/BAD/ RUNS=`ls *spill.bmnt.cedar_phy_bhcurv.1.root | cut -f 1 -d .` for RUN in ${RUNS} ; do ls ${RUN}.spill.bmnt.cedar_phy_bhcurv.1.root mv ${RUN}.spill.bmnt.cedar_phy_bhcurv.1.root \ ${RUN}.spill.mrnt.cedar_phy_bhcurv.0.root done Not quite all there, PEND - have 22/24 subruns for F00037962_*.spill.mrnt.cedar_phy_bhcurv.0.root 0 07/28 22:56 0 22 MISS 0000 0001 ############ # MCIMPORT # ############ mcimport.20080729 For better speed in -t mode, add a new path to local disk for the concatenated tar files. This had been TAPAT, instead use ${LOCAL} as used by -T ( TAPER ) ############ # MCIMPORT # ############ tarring up (-t) smaller files now TOP=daikon_04/L010185N_charm/near # 10MB, 120-260 MB MCI3 > echo $RDIRS 700 701 702 703 999 700 269/269 TOTAL FILES 701 296/296 TOTAL FILES 702 296/296 TOTAL FILES 703 30/30 TOTAL FILES 999 14/14 TOTAL FILES 14/14 TOTAL FILES Test with just DIR=703, as this code is stale ./mcimport -t ${TOP}/${DIR} OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N_charm/near/703/mcimport.log Tue Jul 29 11:01:50 CDT 2008 OK - writing 3 tarfiles Tue Jul 29 11:17:05 CDT 2008 Disk rate = 5.55 MB/sec. Exit status = 0. OK - purging 3 files ? Tue Jul 29 11:32:56 CDT 2008 PURGED n14037030_0000_L010185N_D04_charm-n14037030_0011_L010185N_D04_charm.tar PURGED n14037030_0012_L010185N_D04_charm-n14037030_0023_L010185N_D04_charm.tar PURGED n14037030_0024_L010185N_D04_charm-n14037030_0029_L010185N_D04_charm.tar Tue Jul 29 11:32:57 CDT 2008 Data rates are pretty lousy for these direct /m/d to enstore transfers, under 6 MB/sec. Upgraded to mcimport.20080729, for speed. Test on another smaller directory DIR=999 ./mcimport.20080729 -t ${TOP}/${DIR} OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N_charm/near/999/mcimport.log Tue Jul 29 14:03:44 CDT 2008 OOPS, ran the old version of the script ( failed to flush editor ) Cleaned out /minos/data/mcimport/TAR/daikon_04/L010185N_charm/near/999 ./mcimport.20080729 -t ${TOP}/${DIR} OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N_charm/near/999/mcimport.log Tue Jul 29 14:18:47 CDT 2008 Tar rates look better, 9 MB/sec, versus former 4. Encp rates are much better, (10.7 MB/S overall) (40.6 MB/S transfer) (38.8 MB/S overall) (39.1 MB/S transfer) (39.5 MB/S overall) (39.8 MB/S transfer) (56.7 MB/S overall) (57.2 MB/S transfer) But the purging code failed, the ecrc files are missing ? Corrected location of .ecrc files to /home/mindata/TAPE/ MCI3 > ./mcimport.20080729 -t ${TOP}/${DIR} OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N_charm/near/999/mcimport.log Tue Jul 29 14:57:40 CDT 2008 PURGED n14011011_0000_L010185N_D00_charm-n14011011_0006_L010185N_D00_charm.tar PURGED n14011011_0007_L010185N_D00_charm-n14011012_0002_L010185N_D00_charm.tar PURGED n14039991_0000_L010185N_D04_charm-n14039991_0006_L010185N_D04_charm.tar PURGED n14039991_0007_L010185N_D04_charm-n14039992_0002_L010185N_D04_charm.tar Tue Jul 29 14:57:47 CDT 2008 MCI3 > ln -sf mcimport.20080729 mcimport # was mcimport.20080728 RDIRS='700 701 702' for DIR in ${RDIRS}; do ./mcimport -t ${TOP}/${DIR} done ########## # CONDOR # ########## Date: Tue, 29 Jul 2008 08:45:38 -0500 (CDT) Subject: Help Desk Ticket 115222 Has Been Resolved. ___________________________________________________________________ Solution: Since the new glexec from osg 1.0.0 has been installed on the GP Grid cluster and elsewhere they have been able to see the environment variables correctly. ___________________________________________________________________ ============================================================================= 2008 07 28 ============================================================================= ######## # FARM # ######## Date: Mon, 28 Jul 2008 11:55:23 -0500 From: Howard Rubin The following runs in farcat appear to me to be complete, given what's in that directory plus what's in bad_runs. All are spill.mrnt.cedar_phy_bhcurv. F00031874 F00031939 F00032654 F00032997 F00033538 F00033570 F00035947 F00036563 F00037126 F00037752 F00038266 There are 3 additional runs which are not complete: F00036592 F00037962 F00037965 In the first 2 of these there is only 1/23 subrun *present* while the last has 6/23 missing. I've checked the logs for a couple of these 'missing' subruns and they were apparently written (on or about Dec. 20) along with the other ntuples, which were successfully concatenated. I'm going to rerun these to get the mrnt, but first I would like you to run the concatenator to see if what's complete gets put out. Then after I do the rerun I can go in and delete all but the mrnt, avoiding duplicates showing up. I'll also change the pass to 0 to avoid confusion. SRV1> ./roundup -n -r cedar_phy_bhcurv far Mon Jul 28 12:22:01 CDT 2008 OK - 444 Mbytes in 14 runs PEND - have 7/8 subruns for F00031874_*.spill.mrnt.cedar_phy_bhcurv.0.root 233 12/07 19:32 0 7 MISS 0002 PEND - have 7/8 subruns for F00031939_*.spill.mrnt.cedar_phy_bhcurv.0.root 233 12/07 20:43 0 7 MISS 0001 PEND - have 5/14 subruns for F00032654_*.spill.mrnt.cedar_phy_bhcurv.0.root 219 12/21 16:42 0 5 MISS 0005 0006 0007 0008 0009 0010 0011 0012 0013 PEND - have 23/24 subruns for F00032997_*.spill.mrnt.cedar_phy_bhcurv.0.root 232 12/08 11:40 0 23 MISS 0019 PEND - have 1/2 subruns for F00033538_*.spill.mrnt.cedar_phy_bhcurv.0.root 232 12/08 18:29 0 1 MISS 0000 PEND - have 7/8 subruns for F00033570_*.spill.mrnt.cedar_phy_bhcurv.0.root 232 12/08 19:09 0 7 MISS 0007 PEND - have 23/24 subruns for F00035947_*.spill.mrnt.cedar_phy_bhcurv.0.root 231 12/09 15:22 0 23 MISS 0015 PEND - have 23/24 subruns for F00036563_*.spill.mrnt.cedar_phy_bhcurv.0.root 231 12/10 01:21 0 23 MISS 0015 PEND - have 1/24 subruns for F00036592_*.spill.mrnt.cedar_phy_bhcurv.0.root 219 12/21 16:44 0 1 MISS 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 0012 0013 0014 0015 0016 0017 0018 0019 0020 0021 0022 0023 PEND - have 23/24 subruns for F00037126_*.spill.mrnt.cedar_phy_bhcurv.0.root 230 12/10 18:22 0 23 MISS 0022 SUPPRESS F00037752_0024.spill.mrnt.cedar_phy_bhcurv.0.root PEND - have 23/24 subruns for F00037752_*.spill.mrnt.cedar_phy_bhcurv.0.root 230 12/11 05:29 0 23 MISS 0012 PEND - have 1/24 subruns for F00037962_*.spill.mrnt.cedar_phy_bhcurv.0.root 228 12/12 15:00 0 1 MISS 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 0011 0012 0013 0014 0015 0016 0017 0018 0019 0020 0021 0022 SUPPRESS F00037965_0024.spill.mrnt.cedar_phy_bhcurv.0.root PEND - have 18/24 subruns for F00037965_*.spill.mrnt.cedar_phy_bhcurv.0.root 228 12/12 14:59 0 18 MISS 0000 0002 0003 0005 0006 0007 PEND - have 23/24 subruns for F00038266_*.spill.mrnt.cedar_phy_bhcurv.0.root 228 12/12 20:52 0 23 MISS 0011 Date: Mon, 28 Jul 2008 13:01:52 -0500 From: Howard Rubin All inserted 'data' lines are from bad_runs.cedar_phy_bhcurv. You appear to not be using this for mrnt. If there's no sntp (or bntp) there's no mrnt. -------------------------------------------------- The roundup script was changed to get bad runs from bad_runs_mrcc.${REL} in April of 2007. It has been that way since then. But I see no *mrcc* files in /minos/data/minfarm/lists . Apparently we have never had an mrcc-only failure ? So I will change roundup to go back to using bad_runs.${REL} for mrnt data, for the present. SRV1> AFSS/roundup.20080728 -r cedar_phy_bhcurv far ########### # ROUNDUP # ########### SRV1> cp AFSS/roundup.20080728 . SRV1> ln -sf roundup.20080728 roundup # was roundup.20080722 Dropped bad_runs_mrcc , was specific to original mrcc tests ########## # CONDOR # ########## MINOS25 > condor_q -hold -- Submitter: minos25.fnal.gov : <131.225.193.25:64545> : minos25.fnal.gov ID OWNER HELD_SINCE HOLD_REASON 168302.2 gfactory 7/24 10:21 Globus error 17: the job failed when the jo 168302.4 gfactory 7/24 10:21 Globus error 43: the job manager failed to 168348.1 gfactory 7/24 13:57 Globus error 17: the job failed when the jo 168348.2 gfactory 7/24 13:57 Globus error 43: the job manager failed to 168411.3 gfactory 7/24 20:31 Globus error 43: the job manager failed to 168458.0 gfactory 7/25 01:51 Globus error 17: the job failed when the jo 168458.3 gfactory 7/25 01:51 Globus error 43: the job manager failed to 168652.0 gfactory 7/25 15:18 Globus error 43: the job manager failed to 168652.1 gfactory 7/25 15:18 Globus error 17: the job failed when the jo 168652.3 gfactory 7/25 15:18 Globus error 17: the job failed when the jo 168652.4 gfactory 7/25 15:18 Globus error 17: the job failed when the jo 169073.0 gfactory 7/27 03:52 Globus error 17: the job failed when the jo 169073.2 gfactory 7/27 03:52 Globus error 43: the job manager failed to 169109.0 gfactory 7/27 07:52 Globus error 17: the job failed when the jo 169109.3 gfactory 7/27 07:52 Globus error 43: the job manager failed to 169132.3 gfactory 7/27 10:11 Globus error 43: the job manager failed to 169148.1 gfactory 7/27 11:51 Globus error 17: the job failed when the jo 169148.2 gfactory 7/27 11:51 Globus error 17: the job failed when the jo 169148.3 gfactory 7/27 11:51 Globus error 43: the job manager failed to 169148.4 gfactory 7/27 11:51 Globus error 43: the job manager failed to 169176.1 gfactory 7/27 14:42 Globus error 17: the job failed when the jo 169176.3 gfactory 7/27 14:42 Globus error 43: the job manager failed to 169264.4 gfactory 7/28 00:30 Globus error 43: the job manager failed to 169407.4 gfactory 7/28 08:30 Globus error 43: the job manager failed to ... 169471.0 gfactory 7/28 14:34 Globus error 43: the job manager failed to 169685.2 gfactory 7/29 07:10 Globus error 43: the job manager failed to 170368.0 gfactory 7/31 14:35 Globus error 43: the job manager failed to 170368.1 gfactory 7/31 14:19 Globus error 43: the job manager failed to 170368.2 gfactory 7/31 14:24 Globus error 43: the job manager failed to 170368.4 gfactory 7/31 14:14 Globus error 43: the job manager failed to 170368.9 gfactory 7/31 14:29 Globus error 43: the job manager failed to MINOS25 > condor_q -l 169176.1 UserLog = "/home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/log/condor_activity_20080727_gpminos@t20_glexec@minos@my2.log" GridResource = "gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor" GlobalJobId = "minos25.fnal.gov#1217187710#169176.1" EnteredCurrentStatus = 1217187727 HoldReason = "Globus error 17: the job failed when the job manager attempted to run it" MINOS25 > condor_q -l 169176.3 UserLog = "/home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/log/condor_activity_20080727_gpminos@t20_glexec@minos@my2.log" GridResource = "gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor" GlobalJobId = "minos25.fnal.gov#1217187710#169176.3" EnteredCurrentStatus = 1217187727 HoldReason = "Globus error 43: the job manager failed to stage the executable" Date: Mon, 28 Jul 2008 11:46:48 -0500 (CDT) Subject: HelpDesk ticket 119292 ___________________________________________ Short Description: Minos glideinWMS pilot jobs seeing low level of grid errors on GPFARM Problem Description: Recently, we have started seeing a few Minos glideinWMS jobs being held by Condor on our end, with messages like : MINOS25 > condor_q -l 169176.1 UserLog ="/home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/log/condor_ activity_20080727_gpminos@t20_glexec@minos@my2.log" EnteredCurrentStatus = 1217187727 HoldReason = "Globus error 17: the job failed when the job manager attempted to run it" and MINOS25 > condor_q -l 169176.3 GlobalJobId = "minos25.fnal.gov#1217187710#169176.3" EnteredCurrentStatus = 1217187727 HoldReason = "Globus error 43: the job manager failed to stage the executable" So far these are not doing a great deal of harm, our jobs run on other pilots. But they are cluttering up the local queues, and may indicate a grid problem. For more details, see the 2008 07 28 CONDOR entry in http://www-numi.fnal.gov/minwork/computing/dh/worklog.txt The pilot jobs are submitted in batches of 5, most of which are OK. For example, here are 2 of a batch of 5 the UserLog, including the .001. process which got held : The pilot jobs are submitted in batches of 5, most of which are OK. For example, here are 2 of a batch of 5 the UserLog, including the .001. process which got held : 000 (169176.000.000) 07/27 14:41:50 Job submitted from host: <131.225.193.25:64545> .. 000 (169176.001.000) 07/27 14:41:50 Job submitted from host: <131.225.193.25:64545> 017 (169176.001.000) 07/27 14:42:06 Job submitted to Globus RM-Contact: fnpcfg1.fnal.gov:2119/jobmanager-condor JM-Contact: https://fnpcfg1.fnal.gov:40013/19561/1217187720/ Can-Restart-JM: 1 .. 027 (169176.001.000) 07/27 14:42:06 Job submitted to grid resource GridResource: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor GridJobId: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor https://fnpcfg1.fnal.gov:40013/19561/1217187720/ .. 017 (169176.000.000) 07/27 14:42:06 Job submitted to Globus RM-Contact: fnpcfg1.fnal.gov:2119/jobmanager-condor JM-Contact: https://fnpcfg1.fnal.gov:40019/19682/1217187720/ Can-Restart-JM: 1 .. 027 (169176.000.000) 07/27 14:42:06 Job submitted to grid resource GridResource: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor GridJobId: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor https://fnpcfg1.fnal.gov:40019/19682/1217187720/ 012 (169176.001.000) 07/27 14:42:07 Job was held. Globus error 17: the job failed when the job manager attempted to run it Code 2 Subcode 17 005 (169176.000.000) 07/27 15:07:15 Job terminated. (1) Normal termination (return value 0) ___________________________________________ Note To Requester: timm@fnal.gov sent this Notes To Requester: These errors have been seen by a variety of users on fermigridosg1, fnpcosg1, and fnpcfg1. We always had a low level of them but they appear to have increased in frequency since the OSG 1.0 upgrade. We have an open ticket with condor_support on this already and as of this morning we received a debug version of one of the key condor executables which we have deployed, in hopes of figuring out what is causing this error. At the moment it does not seem to be related to any bluearc problems at all. In the case of the minos glidein jobs, do you have TRANSFER_EXECUTABLE set to TRUE or FALSE? Steve Timm ___________________________________________ This seems to be set to True, as has been the case since around April 23 when we moved to glexec. MINOS25 > cat /home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/job.condor # File: job.condor # Universe = grid Grid_Resource = gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor globus_rsl = (condorsubmit=(universe vanilla)(requirements \"(ISMINOSAFS=?=True)\")) Executable = glidein_startup.sh Arguments = -v $ENV(GLIDEIN_VERBOSITY) -cluster $(Cluster) -name t20_glexec -entry gpminos -subcluster $(Process) -schedd $ENV(GLIDEIN_SCHEDD) -factory minos -web http://www-numi.fnal.gov/gfactory/stage/glidein_t20_glexec -sign 3962a6fa3b08256b9424992ea9f4b871028d589f -signentry 1e30f8c685e345d5be3a3c5f5b510515f0ed2d18 -signtype sha1 -descript description.87eflp.cfg -descriptentry description.87eflp.cfg -dir Condor -param_GLIDEIN_Client $ENV(GLIDEIN_CLIENT) $ENV(GLIDEIN_PARAMS) +GlideinFactory = "minos" +GlideinName = "t20_glexec" +GlideinEntryName = "gpminos" +GlideinClient = "$ENV(GLIDEIN_CLIENT)" +GlideinWebBase = "http://www-numi.fnal.gov/gfactory/stage/glidein_t20_glexec" +GlideinLogNr = "$ENV(GLIDEIN_LOGNR)" +GlideinWorkDir = "Condor" Transfer_Executable = True transfer_Input_files = transfer_Output_files = WhenToTransferOutput = ON_EXIT Notification = Never +Owner = undefined Log = entry_gpminos/log/condor_activity_$ENV(GLIDEIN_LOGNR)_$ENV(GLIDEIN_CLIENT).log Output = entry_gpminos/log/job.$(Cluster).$(Process).out Error = entry_gpminos/log/job.$(Cluster).$(Process).err stream_output = False stream_error = False Queue $ENV(GLIDEIN_COUNT) ___________________________________________ Date: Fri, 15 Aug 2008 09:02:54 -0500 (CDT) Note To Requester: We captured extra debug output from one of the minos glidein jobs yesterday that held with error 17 and have sent it to the Condor team and the OSG Troubleshooting team. Steve Timm ___________________________________________ Date: Thu, 02 Oct 2008 13:09:15 -0500 (CDT) Note To Requester: We have recently received a patch to the gahp_server binary of condor which has shown great promise thus far in reducing and eliminating these errors on fnpcsrv1 and fg1x1. We want to run with the patch on fnpcsrv1 and fg1x1 first for another week or so, until minos production solves its current difficulties and is able to ramp back up. It would then be possible for you to install the same patch on minos25 and it it should address the difficulties there as well. Steve Timm ___________________________________________ Date: Mon, 06 Oct 2008 10:16:48 -0500 (CDT) Note To Requester: We are in possession of a new debug/patched gahp_server executable. Since this has been installed on fg1x1 and fnpcsrv1 we have not seen any repeats of the globus errors 17 and 43 there. I suggest that it get installed on minos25 as well. Please contact me. Steve Timm ___________________________________________ Date: Tue, 11 Nov 2008 10:24:18 -0600 (CST) Solution: Per E-mail from MINOS they have not seen this error since they upgraded the condor within their glideins to condor 7.1.3. We will close this ticket for now but keep an eye on the larger problem in FermiGrid, which is not going to upgrade to condor 7.1.x for a couple of months yet. Steve Timm ######## # FARM # ######## SRV1> less cedar_phy_bhlomcnear.log Finished last purge Sun Jul 27 07:21:39 CDT 2008 SRV1> less cedar_phy_bhhimcnear.log Finished last purge Sun Jul 27 04:20:58 CDT 2008 PEND - have 28/30 subruns for n13037022_*_L010185N_D04.mrnt.cedar_phy_bhhi.root 15 07/11 18:13 0 28 MISS 0007 0010 Informed rubin via email ############ # MCIMPORT # ############ MCI3 > ln -sf mcimport.20080728 mcimport Put ECRC message inline, reduced encp verbose from 4 to 1 TOP=daikon_04/L150200N/near # 10MB, 480-540 MB Mon Jul 28 08:59:51 CDT 2008 Tue Jul 29 05:30:09 CDT 2008 ########### # ENSTORE # ########### Per email from kordosky, nwest Date: Mon, 28 Jul 2008 09:41:38 -0500 (CDT) Subject: HelpDesk ticket 119275 ___________________________________________ Short Description: Some web pages not available offsite Problem Description: Since the July 24 upgrades, several web pages are not visible to clients outside fnal.gov. Not all web pages are affected. Available pages include http://www-stken.fnal.gov/enstore/enstore_system.html and all links directly under this page, except the three listed here. Blocked pages include these links under the home page : Quota and Usage http://www-stken.fnal.gov/enstore/tape_inventory/VOLUME_QUOTAS Tape Inventory Summary http://www-stken.fnal.gov/cgi-bin/enstore_show_inv_summary_cgi.py Tape Inventory http://www-stken.fnal.gov/cgi-bin/enstore_show_inventory_cgi.py The same problem seems to exist under www-cdfen and www-d0en We did not see this problem in STKEN before the July 24 upgrade. ___________________________________________ Date: Mon, 28 Jul 2008 12:20:28 -0500 (CDT) kschu Note To Requester: I believe this is because the computer security webserver exemptions are tied to old hostnames that are no longer in use. I will request exemptions for the new hostnames, and this will be resolved as soon as possible. Thanks for letting us know. ___________________________________________ Date: Mon, 28 Jul 2008 12:20:29 -0500 (CDT) This ticket has been reassigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA Group. ___________________________________________ Tested this later 28 July, CFL is available again, not the cgi's Tested via ssh -1 cdfsoft@uchicago.edu ; mozilla --local ___________________________________________ Date: Thu, 31 Jul 2008 14:25:45 -0500 (CDT) Note To Requester: Exemption requests have been submitted. Can you please see whether the specified pages can be viewed off-site now? ___________________________________________ The first page is available again, > Quota and Usage > http://www-stken.fnal.gov/enstore/tape_inventory/VOLUME_QUOTAS This is the area for which we had an immediate need. ( This area is still blocked for www-cdfen and www-d0en, but this is not a problem for Minos. ) > Tape Inventory Summary > http://www-stken.fnal.gov/cgi-bin/enstore_show_inv_summary_cgi.py > Tape Inventory > http://www-stken.fnal.gov/cgi-bin/enstore_show_inventory_cgi.py The latter two URL's still seem to be unavailable, at least at cdf.uchicago.edu , with messages as follows : Forbidden You don't have permission to access /cgi-bin/enstore_show_inv_summary_cgi.py on this server. Apache Server at www-stken.fnal.gov Port 80 ___________________________________________ Date: Fri, 29 Aug 2008 10:37:47 -0500 (CDT) Solution: I believe this problem to be resolved. Some pages were never intended to be viewed off-site, as part of policy. Main system pages are now available off-site after requesting web server exemptions from CST. This ticket was resolved by MESSER, TIM of the CD-SF/DMS/DSC/SSA group. ___________________________________________ ============================================================================= 2008 07 25 ============================================================================= ######### # ADMIN # ######### Ticket #: 119229 MINOS01 > cmd add_minos_user paschrei ########## # CONDOR # ########## Several loiacono jobs got held, trying to write loiacono.proxy to afs : MINOS25 > condor_history -l 168583.20 cat /minos/scratch/loiacono/condor_minosoft_output/log.168583.20 000 (168583.020.000) 07/25 10:46:58 Job submitted from host: <131.225.193.25:64545> ... 001 (168583.020.000) 07/25 12:07:49 Job executing on host: <131.225.166.130:61062> ... 006 (168583.020.000) 07/25 12:12:57 Image size of job updated: 124608 ... 006 (168583.020.000) 07/25 12:42:57 Image size of job updated: 125824 ... 006 (168583.020.000) 07/25 13:32:57 Image size of job updated: 189636 ... 007 (168583.020.000) 07/25 13:37:03 Shadow exception! Error from starter on vm2@26016@fnpc342.fnal.gov: STARTER at 131.225.166.130 failed to send file(s) to <131.225.193.25:65305>; SHADOW at 131.225.193.25 failed to write to file /afs/fnal.gov/files/home/room3/loiacono/work/minossoft/BeamDataPro/condor/loiacono.proxy: (errno 13) Permission denied 373213 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 012 (168583.020.000) 07/25 13:37:03 Job was held. Error from starter on vm2@26016@fnpc342.fnal.gov: STARTER at 131.225.166.130 failed to send file(s) to <131.225.193.25:65305>; SHADOW at 131.225.193.25 failed to write to file /afs/fnal.gov/files/home/room3/loiacono/work/minossoft/BeamDataPro/condor/loiacono.proxy: (errno 13) Permission denied Code 12 Subcode 13 ... 013 (168583.020.000) 07/25 14:13:12 Job was released. via condor_release (by user loiacono) ... 001 (168583.020.000) 07/25 14:14:04 Job executing on host: <131.225.166.122:65323> ... 006 (168583.020.000) 07/25 14:19:12 Image size of job updated: 190260 ... 005 (168583.020.000) 07/25 14:19:33 Job terminated. (1) Normal termination (return value 0) Usr 0 00:04:45, Sys 0 00:00:01 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:04:45, Sys 0 00:00:01 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 366975 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 740188 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... Strange, there are many loon processes using almost no CPU on fnpc342 Same thing on fnpc343, where the above job ran quickly : bash-3.00$ ps axf | grep loon 8874 pts/0 S+ 0:00 \_ grep loon 24475 ? SN 0:01 | \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2007,01,11,2007,01,11) 8341 ? SN 0:01 | \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2006,12,27,2006,12,27) 23010 ? SN 0:01 | \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2007,01,06,2007,01,06) 5511 ? SN 0:01 | \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2006,12,21,2006,12,21) 4389 ? SN 0:01 | \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2006,12,20,2006,12,20) 7771 ? SN 0:01 | \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2006,12,26,2006,12,26) 30748 ? SN 0:01 \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2006,12,12,2006,12,12) These loon processes are claimed not to be reading from DCache, ######## # FARM # ######## Changed 'looper' script to take command options via parmeter Will do ./looper '-r cedar_phy_bhhi mcnear' & ./looper '-r cedar_phy_bhlo mcnear' & Fired these up round 17:45 ######### # ADMIN # ######### Suggested contacts for surplus equipment tape Gene Oleynik ( head of Data Storage and Dacheing in DMS in SF. CPU Bob Tschirhart ########## # CONDOR # ########## Testing new glideme.run, would like to avoid setting REMOTE_INITIALDIR, MINOS25 > condor_submit probeme.run Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 168576. MINOS25 > condor_submit glideme.run Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 168579. Removed REMOTE_INITIALDIR, fully specified logs MINOS25 > condor_submit glideme.run Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 168608. MINOS25 > grep PWD logs/glide/glide.168608.0.out PWD /local/stage1/condor/execute/dir_3962/glide_Ig4000/tmp/starter-tmp-dir-zolpWp/execute/dir_4929 Removed full spec from logs, probably uses Iwd. MINOS25 > condor_submit glideme.run Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 168611. MINOS25 > grep PWD logs/glide/glide.168611.0.out PWD /local/stage1/condor/execute/dir_3962/glide_Ig4000/tmp/starter-tmp-dir-rJq7IP/execute/dir_5675 Hacked glideme.run to transfer a file, run probefile. MINOS25 > condor_submit glideme.run Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 168613. MINOS25 > less logs/glidefile/glide.168613.0.out ########## # FILE # ########## Looking at the file ------------------- THIS IS A TEST FILE. I AM HERE ! ------------------- RUN FINISHED Fri Jul 25 11:43:37 CDT 2008 ########## # CONDOR # ########## MINOS25 > condor_q -hold -- Submitter: minos25.fnal.gov : <131.225.193.25:64545> : minos25.fnal.gov ID OWNER HELD_SINCE HOLD_REASON 168302.2 gfactory 7/24 10:21 Globus error 17: the job failed when the jo 168302.4 gfactory 7/24 10:21 Globus error 43: the job manager failed to 168348.1 gfactory 7/24 13:57 Globus error 17: the job failed when the jo 168348.2 gfactory 7/24 13:57 Globus error 43: the job manager failed to 168411.3 gfactory 7/24 20:31 Globus error 43: the job manager failed to 168458.0 gfactory 7/25 01:51 Globus error 17: the job failed when the jo 168458.3 gfactory 7/25 01:51 Globus error 43: the job manager failed to ########## # PARROT # ########## Found 100 small files for parroting, same place as our reference small file, /pnfs/minos/fardet_data/2005-04/F000310* Sizes range 40 to 54 KB. Should make a dataset st-100small, Runs 3100 through 3199 SAMDIM=' DATA_TIER raw-far and RUN_NUMBER >= 3100 and RUN_NUMBER <= 3199 ' sam list files --dim="${SAMDIM}" ST100=`sam list files --dim="${SAMDIM}" --nosummary | sort` MINOS26 > for FILE in ${ST100} ; do printf "${FILE} " ; ./dc_stat ${FILE} | grep stkendca ; done F00003100_0000.mdaq.root w-stkendca7a-1 F00003101_0000.mdaq.root w-stkendca9a-3 F00003102_0000.mdaq.root w-stkendca9a-3 ... F00003197_0000.mdaq.root w-stkendca11a-3 F00003198_0000.mdaq.root w-stkendca7a-1 r-stkendca4a-1 F00003199_0000.mdaq.root w-stkendca9a-3 sam create definition \ --definitionName='st-censmall' \ --dimensions="${SAMDIM}" \ --group='minos' DatasetDefinition saved with definitionId = 5201 sam list files --dim="__set__ st-censmall" 2008 08 01 OOPS, this was wrong, file selection should have been SAMDIM=' DATA_TIER raw-far and RUN_NUMBER >= 31000 and RUN_NUMBER <= 31099 ' sam list files --dim="${SAMDIM}" File Count: 100 Average File Size: 48.90KB Total File Size: 4.78MB Total Event Count: 690 sam delete dataset definition --definitionName st-censmall sam create definition \ --definitionName='st-censmall' \ --dimensions="${SAMDIM}" \ --group='minos' DatasetDefinition saved with definitionId = 5215 ############ # MCIMPORT # ############ TOP=daikon_04/L150200N/near # 10MB, 480-540 MB Fri Jul 25 10:20:10 CDT 2008 Fri Jul 25 19:20:19 CDT 2008 ######## # FARM # ######## ./roundup -r cedar_phy_bhlo mcfar Fri Jul 25 09:42:15 CDT 2008 PURGED 832/832 Fri Jul 25 09:45:12 CDT 2008 bhhi mcnear is already done ./roundup -r cedar_phy_bhhi mcnear Fri Jul 25 10:08:11 CDT 2008 PURGED 6/6 Fri Jul 25 13:22:34 CDT 2008 ./roundup -r cedar_phy_bhhi mcnear & Fri Jul 25 13:25:18 CDT 2008 PURGING WRITE files 44 ./roundup -r cedar_phy_bhlo mcnear & Fri Jul 25 13:23:32 CDT 2008 ########### # CONDOR # ########## MINOS25 > condor_q kreymer 166375.0 kreymer 7/17 08:40 0+00:00:00 X 0 0.0 probe MINOS25 > condor_rm 166375.0 Job 166375.0 already marked for removal MINOS25 > condor_rm -force 166375.0 Job 166375.0 removed locally (remote state unknown) MINOS25 > dds -tr logs/glideafs/*out | tail -rw-r--r-- 1 kreymer g020 3844 Jul 25 08:40 logs/glideafs/probe.168524.0.out -rw-r--r-- 1 kreymer g020 3675 Jul 25 08:53 logs/glideafs/probe.168525.0.out -rw-r--r-- 1 kreymer g020 3897 Jul 25 09:00 logs/glideafs/probe.168528.0.out -rw-r--r-- 1 kreymer g020 6317 Jul 25 09:10 logs/glideafs/probe.168543.0.out -rw-r--r-- 1 kreymer g020 6637 Jul 25 09:23 logs/glideafs/probe.168565.0.out Difference is due to number of jobs running on each node. loiacono has just submitted a slug o jobs. ############# # CHECKLIST # ############# FTPLOG 5 Fri Jul 25 02:10:48 CDT 2008 557 3601 Fri Jul 25 03:20:49 CDT 2008 1 5 Fri Jul 25 03:30:54 CDT 2008 557 NOACCESS missing DATA missing ########## # DCACHE # ########## http://www-numi.fnal.gov/computing/dh/ftplog/2008/07/25.txt 5 Fri Jul 25 02:10:48 CDT 2008 557 3601 Fri Jul 25 03:20:49 CDT 2008 1 5 Fri Jul 25 03:30:54 CDT 2008 557 Date: Fri, 25 Jul 2008 09:10:28 -0500 (CDT) Subject: HelpDesk ticket 119196 ___________________________________________ Short Description: FNDCA - glitch in weak ftp access around 2008/07/25 02:20 Problem Description: I test the availability of weak ftp access every 10 minutes, from minos26. This is done by listing a small directory, /pnfs/minos/beam_data/2004-12 . The listing at 02:20 this morning failed after 1 hour. These numbers are the elapsed time, time stamps, and size of the listings. 5 Fri Jul 25 02:10:48 CDT 2008 557 3601 Fri Jul 25 03:20:49 CDT 2008 1 5 Fri Jul 25 03:30:54 CDT 2008 557 This does not cause me a problem, but may be a symptom of some deeper problem. minos26 also tests connectivity to bluearc every minute, no problem seen there. ___________________________________________ Date: Fri, 25 Jul 2008 16:29:42 -0500 (CDT) Vladimir fixed the problem. ########## # DCACHE # ########## BILLING http://fndca3a.fnal.gov/dcache/billing.html looks very empty http://fndca2a.fnal.gov:8090/dcache/outplot?lvl=0&filename=billing.week.brd.png no response OLD PLOTS http://fndca2a.fnal.gov:8090/dcache/lsplots no response fron fndca2a NEW PLOTS http://fndca2a.fnal.gov:9090/lps/plots/src/plots.lzx cannot establish connection http://fndca3a.fnal.gov/cgi-bin/dcache_files.py At around 08:50, the latest transfers listed are at about 07:32:40 by minospro Date: Fri, 25 Jul 2008 09:02:29 -0500 (CDT) Subject: HelpDesk ticket 119192 ___________________________________________ Short Description: Billing and other web monitoring data is missing from FNDCA since the shutdown Problem Description: dcache-admin : Since yesterday's shutdown, several web pages are missing or abnormal : BILLING http://fndca3a.fnal.gov/dcache/billing.html looks very empty http://fndca2a.fnal.gov:8090/dcache/outplot?lvl=0&filename=billing.week.brd png no response OLD PLOTS http://fndca2a.fnal.gov:8090/dcache/lsplots no response fron fndca2a NEW PLOTS http://fndca2a.fnal.gov:9090/lps/plots/src/plots.lzx cannot establish connection http://fndca3a.fnal.gov/cgi-bin/dcache_files.py At around 08:50, the latest transfers listed are at about 07:32:40 by minospro ___________________________________________ Date: Fri, 25 Jul 2008 16:27:27 -0500 the developer says these are fixed now....please click away... ########### # ENSTORE # ########### http://www-stken.fnal.gov/enstore/tape_inventory/NOACCESS The requested URL /enstore/tape_inventory/NOACCESS was not found on this server. Date: Fri, 25 Jul 2008 09:02:06 -0500 (CDT) Subject: HelpDesk ticket 119190 ___________________________________________ Short Description: NOACCESS list is missing Problem Description: enstore-admin : ince yesterday's SDE upgrade of STKEN, the list of NOACCESS tapes is missing from http://www-stken.fnal.gov/enstore/tape_inventory/NOACCESS The requested URL /enstore/tape_inventory/NOACCESS was not found on this server. ___________________________________________ Date: Fri, 25 Jul 2008 16:27:27 -0500 the developer says these are fixed now....please click away... ___________________________________________ The list actually returned round July 31 ============================================================================= 2008 07 24 ============================================================================= ############ # STARTUP # ############ kreymer@minos26 crontab crontab.dat mindata@minos26 crontab crontab.dat minfarm@fnpcsrv1 ############ # SHUTDOWN # ############ DOWNTIMES Thursday 24 July 06:00 - 06:20 (06:29) BlueArc 06:30 Enstore drain 07:15 - 18:00 (21:17) Enstore and DCache ; FermiGrid reracking ( not down ) 07:30 - 07:45 (07:45) AFS data servers 09:30 - 10:00 (09:55) SAM database MDSUM_LOG The mdsum_log script was still running at 06:00. Restarted it via cron 29891 ? Ss 0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log 29905 ? S 0:00 \_ /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log 25478 ? S 0:00 \_ /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log 25479 ? S 0:00 \_ du -sm mcimport/TAR 25480 ? S 0:00 \_ tr -d Killed all these pid's. BLUWATCH The STOP file did not get created for bluwatch Reported slow access : fnpcsrv1.txt 11-Jul-2008 19:10 86 minos-sam03.txt 24-Jul-2008 06:18 87 minos01.txt 24-Jul-2008 06:18 87 minos25.txt 19-Jul-2008 17:29 86 minos26.txt 15-Jul-2008 12:20 86 The right way to touch the STOP file would have been echo "/usr/krb5/bin/kcron ; /usr/krb5/bin/aklog ;touch /afs/fnal.gov/files/data/minos/log_data/bluwatch/TEST" | at 07:57 echo " /usr/krb5/bin/kcron /usr/krb5/bin/aklog touch /afs/fnal.gov/files/data/minos/log_data/bluwatch/TEST touch /afs/fnal.gov/files/data/minos/log_data/bluwatch/TEST2 " | at 08 09 AFS was on schedule, 06:30 to 07:45 SAM was on schedule, 09:30 to 09:53 Tested per HOWTO.sam UNI=prd for N in 1 2 3 ; do echo ${N} date ./sam_test_py minos ${UNI} st-onesmall ./sam_test_py minos ${UNI} st-ten ./sam_test_py minos ${UNI} st-cen done ; date DCACHE PNFS logging came up at 18:43 FTP logging started to recover 19:56, looked normal starting at 21:17 From: Ken Schumacher Date: Thu, 24 Jul 2008 18:09:25 -0500 We have encountered a few problems... Date: Thu, 24 Jul 2008 20:15:02 -0500 We have overcome most of our problems... Date: Thu, 24 Jul 2008 21:32:20 -0500 The public dCache services are back on-line. Enstore is ready except Date: Thu, 24 Jul 2008 23:03:46 -0500 We thank you for your patience and we apologize for the extended ... On behalf of the whole team from DMS, Good Night. ########## # PARROT # ########## /grid/app/minos/parrot paloon - script to run loon on a raw data file, under parrot loonar - loon script run by paloon 388 > /grid/app/minos/parrot/paloon ######### # ADMIN # ######### 13:00 Cannot ssh to minos04 or minos12 ssh_exchange_identification: Connection closed by remote host MINOS04 > tail /var/log/messages ... Jul 23 05:43:24 minos04 sshd(pam_unix)[26376]: session opened for user djauty by (uid=0) Jul 24 13:06:30 minos04 login: kreymer preauthenticated login on pts/0 from minos-93198.dhcp MINOS12 > tail /var/log/messages ... Jul 23 07:59:37 minos12 sshd(pam_unix)[22780]: session opened for user jyuko by jyuko(uid=0) Jul 23 17:18:50 minos12 sshd(pam_unix)[9686]: session opened for user kreymer by (uid=0) Jul 23 21:45:11 minos12 sshd: pam_krb5[10080]: authentication fails for 'jyuko' (jyuko@FNAL.GOV): Authentication service cannot retrieve authentication info. (Cannot contact any KDC for requested realm) Jul 24 13:00:15 minos12 kernel: nfs: server stkensrv1 not responding, still trying Jul 24 13:07:34 minos12 login: kreymer preauthenticated login on pts/0 from minos-93198.dhcp 13:10 submitted ticket Date: Thu, 24 Jul 2008 13:12:52 -0500 (CDT) Subject: HelpDesk ticket 119161 ___________________________________________ Short Description: ssh logins fail to minos04 and minos12 Problem Description: run2-sys : I cannot ssh to minos04 or minos12. I can rsh to them, and they look OK on the surface. MIN > date Thu Jul 24 18:10:06 UTC 2008 MIN > ssh -v minos04 OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug1: Connecting to minos04 [131.225.193.4] port 22. debug1: Connection established. debug1: identity file /home/kreymer/.ssh/identity type -1 debug1: identity file /home/kreymer/.ssh/id_rsa type 1 ... ___________________________________________ Date: Thu, 24 Jul 2008 13:18:23 -0500 (CDT) This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group. ___________________________________________ Date: Thu, 24 Jul 2008 13:46:08 -0500 (CDT) Resolved killed off stuck ssh/pam processes and ssh access was restored. ___________________________________________ ___________________________________________ ########### # GNUPLOT # ########### On my desktop and laptop, yum install gnuplot ######## # MAIL # ######## Removed RFC2369 headers from lists for which they are not appropriate, to eliminate the PINE messages minos-users [ Note: This message contains email list management information ] To disable the headers, added to the head of the options list, Misc-Options= NO_RFC2369 Need to get ownership of some other lists minos_sam_admin minosdb-support MINOS-ACCOUNTS ? MINOS-SAM-USERS ? ############ # BLUWATCH # ############ bluwatch.20080724 Added SKIP control, gentler than STOP Added usage comments ln -sf bluwatch.20080724 bluwatch # was bluwatch.20080707 Thu Jul 24 14:05:44 UTC 2008 ########## # CONDOR # ########## Removed ancient stuck probe job, 166375.0 kreymer 7/17 08:40 6+23:31:16 R 0 0.0 probe 166375.0 kreymer 7/17 08:40 6+03:36:35 vm2@12501@fnpc346.fnal.gov MINOS25 > condor_rm 166375.0 Job 166375.0 marked for removal 166375.0 kreymer 7/17 08:40 0+00:00:00 X 0 0.0 probe Let this sit a day, then remove again. Also, let's cleanup up again the held pilots 167672.1 gfactory 7/22 02:10 0+07:11:56 H 0 0.0 glidein_startup.sh 167693.3 gfactory 7/22 03:09 0+06:46:58 H 0 0.0 glidein_startup.sh MINOS25 > condor_rm 167672.1 Job 167672.1 marked for removal MINOS25 > condor_rm 167693.3 Job 167693.3 marked for removal 167672.1 gfactory 7/22 02:10 0+07:11:56 X 0 0.0 glidein_startup.sh 167693.3 gfactory 7/22 03:09 0+06:46:58 X 0 0.0 glidein_startup.sh A minute later, they were gone. ============================================================================= 2008 07 23 ============================================================================= ############ # SHUTDOWN # ############ Prepared for PNFS/DCache maintenance Jul 24 kreymer@minos26 echo "crontab -r" | at 05:30 job 11 at 2008-07-24 05:30 echo "touch /afs/fnal.gov/files/data/minos/log_data/bluwatch/STOP" | at 05:58 job 14 at 2008-07-24 05:58 mindata@minos26 echo "crontab -r" | at 01:00 job 12 at 2008-07-24 01:00 minfarm@fnpcsrv1 echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00 job 15 at 2008-07-24 01:00 ########### # ROUNDUP # ########### Added MISS messages listing pending subruns. PENDLOG is only written when NOOP is clear SRV1> ln -sf roundup.20080722 roundup # was roundup.20080703 ( did nothing yesterday, due to a typo ) ########## # CONDOR # ########## 165531.3 gfactory 7/15 10:21 0+00:00:00 H 0 0.0 glidein_startup.sh 165861.1 gfactory 7/15 21:03 0+00:00:00 H 0 0.0 glidein_startup.sh 165861.3 gfactory 7/15 21:03 0+00:00:00 H 0 0.0 glidein_startup.sh 166779.1 gfactory 7/18 01:51 0+00:00:00 H 0 0.0 glidein_startup.sh 166779.2 gfactory 7/18 01:51 0+00:00:00 H 0 0.0 glidein_startup.sh 166861.7 gfactory 7/18 11:27 0+00:00:00 H 0 0.0 glidein_startup.sh 167039.1 gfactory 7/19 06:12 0+00:00:00 H 0 0.0 glidein_startup.sh 167039.2 gfactory 7/19 06:12 0+00:00:00 H 0 0.0 glidein_startup.sh 167039.4 gfactory 7/19 06:12 0+00:00:00 H 0 0.0 glidein_startup.sh 167672.1 gfactory 7/22 02:10 0+07:11:56 H 0 0.0 glidein_startup.sh 167693.3 gfactory 7/22 03:09 0+06:46:58 H 0 0.0 glidein_startup.sh 167863.0 gfactory 7/22 14:35 0+00:00:00 H 0 0.0 glidein_startup.sh 167999.0 gfactory 7/23 04:00 0+00:00:00 H 0 0.0 glidein_startup.sh 167999.1 gfactory 7/23 04:00 0+00:00:00 H 0 0.0 glidein_startup.sh MINOS25 > condor_rm 165531.3 MINOS25 > condor_rm 165861.1 MINOS25 > condor_rm 165861.3 JOB=166779.1 ; condor_rm ${JOB} ; sleep 10 ; condor_q ${JOB} JOB=166779.2 ; condor_rm ${JOB} ; sleep 10 ; condor_q ${JOB} for JOB in 166861.7 167039.1 167039.2 167039.4 167672.1 ; do condor_rm ${JOB} ; sleep 10 ; condor_q ${JOB} ; done 167672.1 gfactory 7/22 02:10 0+07:11:56 X 0 0.0 glidein_startup.sh for JOB in 167693.3 167863.0 167999.0 167999.1 ; do condor_rm ${JOB} ; sleep 10 ; condor_q ${JOB} ; done 167693.3 gfactory 7/22 03:09 0+06:46:58 X 0 0.0 glidein_startup.sh So we have two X jobs stuck, since yesterday. Now they are back to H, 167672.1 gfactory 7/22 02:10 0+07:11:56 H 0 0.0 glidein_startup.sh 167693.3 gfactory 7/22 03:09 0+06:46:58 H 0 0.0 glidein_startup.sh ########## # ORACLE # ########## Costs per email Date: Thu, 10 Apr 2008 11:42:44 -0500 From: Maurine Mihalek continuing monthly decision 1 maintenance costs are: MINOSORA1 - $52.13 MINOSORA1-SUN-RAID-ARRAY - $272.27 MINOSORA3 - $43.32 MINOSORA3-SUN-RAID-ARRAY - $82.71 ############ # MCIMPORT # ############ Keep on truckin, per list at 2008 07 17 MCIMPORTARCHIVELIST TOP=daikon_04/L010170N/near RDIRS=`ls /minos/data/mcimport/STAGE/${TOP}` echo $RDIRS ( find /minos/data/mcimport/STAGE/${TOP} -type f -name \*.tar.gz -exec du -sm {} \; | cut -f 1 ) \ > /minos/scratch/mindata/ssize.gpl FLXI04 > printf 'plot "/minos/scratch/mindata/ssize.gpl"' | gnuplot -persist for DIR in ${RDIRS} ; do printf "${DIR} " ./mcimport -n -T ${TOP}/${DIR} | grep NFILES \ | grep -v 'NFILES 0' done 700 NFILES 183 for DIR in ${RDIRS}; do ./mcimport -T ${TOP}/${DIR} done OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010170N/near/700/mcimport.log Wed Jul 23 10:52:31 CDT 2008 Wed Jul 23 13:30:51 CDT 2008 Plan is to do the rest in roughly order of total size TOP=daikon_04/L100200N/near # 10MB, 350-410 MB 700 NFILES 278 701 NFILES 31 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L100200N/near/700/mcimport.log Wed Jul 23 14:49:34 CDT 2008 Wed Jul 23 22:20:15 CDT 2008 TOP=daikon_04/L150200N/near # 10MB, 480-540 MB 700 NFILES 275 701 NFILES 31 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L150200N/near/700/mcimport.log Fri Jul 25 10:20:10 CDT 2008 Fri Jul 25 19:20:19 CDT 2008 TOP=daikon_04/L010185N_helium/near # 10MB, 350-410 MB 650 NFILES 278 651 NFILES 305 652 NFILES 307 653 NFILES 29 Mon Jul 28 08:59:51 CDT 2008 Tue Jul 29 05:30:09 CDT 2008 These are a bit too small, should tar them with -t option TOP=daikon_04/L010185N_charm/near # 10MB, 120-260 MB TOP=daikon_04/L010000N/near # 10MB, 190-230 for DIR in ${RDIRS} ; do printf "${DIR} " ./mcimport -n -t ${TOP}/${DIR} | grep FILES done for DIR in ${RDIRS}; do ./mcimport -t ${TOP}/${DIR} done ######## # DOWN # ######## Announced downtime schedule to minos-data minos_software_discussion Minos status web page ########## # ORACLE # ########## Date: Wed, 23 Jul 2008 10:17:31 -0500 From: Anil Kumar Checked with Nelly too. MINOS_DEV user on minosdev database can be safely dropped. I am dropping the user now. ######## # FARM # ######## The removal of dogwood0/1 seems to be complete. MINOS26 > du -sm /minos/data/minfarm/farmtest 616494 /minos/data/minfarm/farmtest We now have over 5 TB free ######## # FARM # ######## rounding up cedar_phy_bhhi mcnear cedar_phy_bhlo mcnear cedar_phy_bhhi mcfar cedar_phy_bhlo mcfar Had top get bad_runs lists, cp /minos/data/minfarm/farmtest_strait/lists/bad_runs_mc.cedar_phy_bhlo \ /minos/data/minfarm/lists/bad_runs_mc.cedar_phy_bhlo cp /minos/data/minfarm/farmtest_strait/lists/bad_runs_mc.cedar_phy_bhhi \ /minos/data/minfarm/lists/bad_runs_mc.cedar_phy_bhhi for DET in near far ; do for FIE in hi lo ; do for STR in sntp mrnt ; do NF=`ls /minos/data/minfarm/mc${DET}cat | \ grep L010185N_D04.${STR}.cedar_phy_bh${FIE}.root | wc -l` printf " %6s %6s %6s %6d\n" ${DET} ${FIE} ${STR} ${NF} done ; done ; done near hi sntp 5493 near hi mrnt 5491 near lo sntp 5493 near lo mrnt 5493 far hi sntp 417 far hi mrnt 417 far lo sntp 416 far lo mrnt 416 ./roundup -n -W -s n13037001 -r cedar_phy_bhhi mcnear ./roundup -s n13037001 -r cedar_phy_bhhi mcnear takes a while to pick up the candidates ./roundup -n -W -s f21037001 -r cedar_phy_bhhi mcfar Noted that ALL the mcfar files are subrun 0. ./roundup -s f21037001 -r cedar_phy_bhhi mcfar ./roundup -r cedar_phy_bhhi mcfar cedar_phy_bhhimcfar.log One of the srmcp 's failed, SRMCP 34/832 -streams_num=1 -server_mode=active -protocols=gsiftp file:///f21037018_0000_L010185N_D04.sntp.cedar_phy_bhhi.root /pnfs/minos/mcout_data/cedar_phy_bhhi/far/daikon_04/L010185N/sntp _data/701 [main] ERROR gsi.CertificateRevocationLists - CRL /usr/local/grid/globus/TRUSTED_CA/eebc7717.r0 failed to load. Getting several of these, they seem scary but harmless, should report. ./roundup -r cedar_phy_bhhi mcfar Wed Jul 23 17:24:32 CDT 2008 PURGED 832/832 Wed Jul 23 17:28:28 CDT 2008 Updated to newer roundup, ./roundup -r cedar_phy_bhlo mcfar Wed Jul 23 17:58:18 CDT 2008 Wed Jul 23 22:23:51 CDT 2008 ############ # PNFSDIRS # ############ Added mrnt to far , due to problems seen in bhhi, bhlo ./pnfsdirs far cedar_phy_bhhi daikon_04 L010185N write ./pnfsdirs far cedar_phy_bhlo daikon_04 L010185N write ############# # MDSUM_LOG # ############# Date: Wed, 23 Jul 2008 03:10:06 -0500 From: Cron Daemon To: kreymer@fnal.gov Subject: Cron /usr/krb5/bin/kcron ${HOME}/minos/scripts/mdsum_log /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log: line 50: kcron: command not found ... Removed kcron Changed to full path /usr/krb5/bin/aklog Tested this at 09:39, then restored crontab.dat Looks OK. ============================================================================= 2008 07 22 ============================================================================= ######## # FARM # ######## farmtest - directory is being purged of dogwoottest0 and 1 files, farmtest_strait - hi/lo will be moved to mcnearcat/mcfarcat for catting ########### # ROUNDUP # ########### Added MISS messages listing pending subruns. PENDLOG is only written when NOOP is clear SRV1> ln -sf roundup.20080722 roundup # was roundup.20080703 ########## # PARROT # ########## Installed the present current versions into /grid/app/minos/parrot per HOWTO.parrot, REL=current ; ARC='x86_64-linux-2.6' ; DAT='-20080708' REL=current ; ARC='i686-linux-2.6' ; DAT='-20080717' Tested ssh fnpc338 mkdir -p /local/stage1/kreymer/parrot export PRO=/grid/app/minos/parrot REL=current ; ARC='x86_64-linux-2.6' ; DAT='-20080708' export VER=cctools-${REL}${DAT}-${ARC} export PARROT_DIR=${PRO}/${VER} export PATH=${PARROT_DIR}/bin:${PATH} export HTTP_PROXY="http://squid.fnal.gov:3128" PTD=/local/stage1/kreymer/parrot parrot -m ${PARROT_DIR}/mountfile.grow -H -t ${PTD} /bin/bash P> printf "\n" | loon -bq firstlast.C ${DFILE} sh: error while loading shared libraries: libtermcap.so.2: object file has no loadable segments -bash-3.00$ du -sm /local/stage1/kreymer/parrot/ 337 /local/stage1/kreymer/parrot/ Repeated test with fresh login, REL=current ; ARC='x86_64-linux-2.6' ; DAT='-20080619' sh: error while loading shared libraries: libtermcap.so.2: object file has no loadable segments printf "" | loon -bq firstlast.C ${DFILE} This is clean -bash-3.00$ du -sm $PTD 673 /local/stage1/kreymer/parrot Repeated the test, -bash-3.00$ du -sm $PTD 704 /local/stage1/kreymer/parrot ########## # PARROT # ########## How to check which kernel we run ? uname -m, --machine print the machine hardware name -p, --processor print the processor type -i, --hardware-platform print the hardware platform -i tends to be i386 or x86_64 -m -p tend to be i686 or x86_64 For testing, FNALU nodes are all Intel for NODE in ${UNODES} ; do printf "$NODE "; ssh -ax ${NODE} 'grep name /proc/cpuinfo | head -1' ; done MIN > for NODE in ${UNODES} ; do printf "$NODE "; ssh -ax ${NODE} 'uname -m -p -i' ; done flxi02 x86_64 x86_64 x86_64 flxi03 ssh_exchange_identification: Connection closed by remote host flxi04 i686 i686 i386 flxi05 i686 i686 i386 flxi06 i686 i686 i386 flxi07 x86_64 x86_64 x86_64 flxi09 i686 i686 i386 MIN > ssh flxb31 uname -m -p -i i686 athlon i386 So experimentally, we use 'uname -m' to identify bitosity of kernel ############ # MCIMPORT # ############ MCI3 > ln -sf mcimport.20080716 mcimport # was mcimport.20080630 Check file sizes with du -sm /minos/data/mcimport/STAGE/${TOP}/*/* | sort -n | less TOP=daikon_04/L010200N/near RDIRS=`ls /minos/data/mcimport/STAGE/${TOP}` echo $RDIRS 700 for DIR in ${RDIRS} ; do printf "${DIR} " ./mcimport -n -T ${TOP}/${DIR} | grep NFILES \ | grep -v 'NFILES 0' done 700 NFILES 185 for DIR in ${RDIRS}; do ./mcimport -T ${TOP}/${DIR} done OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010200N/near/700/mcimport.log Tue Jul 22 12:26:06 CDT 2008 ############# # MDSUM_LOG # ############# Added this to the kreymer crontab on minos26 ln -sf crontab.minos26.20080722 crontab.dat # was crontab.minos26.20080402 MINOS26 > crontab crontab.dat ============================================================================= 2008 07 21 ============================================================================= ######## # FARM # ######## State of recent near spill data processing ? Pending : PEND - have 22/24 subruns for N00014187_*.spill.sntp.cedar.0.root 63 05/19 03:07 0 22 PEND - have 9/10 subruns for N00014551_*.spill.sntp.cedar.0.root 3 07/18 04:18 0 9 PEND - have 7/8 subruns for N00014562_*.spill.sntp.cedar.0.root 1 07/20 08:11 0 7 PEND - have 3/20 subruns for N00014584_*.spill.sntp.cedar.0.root 0 07/21 03:34 0 3 Far det, per habig query, runs Friday 18 July seem to be F00041400_0015.mdaq.root ... F00041412_0001.mdaq.root ######## # FARM # ######## Date: Thu, 17 Jul 2008 15:05:02 -0500 From: Howard Rubin Run F00040421 should be forced out of farcat. The 'snarl' counts are all over the place, and 4 of the subruns ran forever before I finally killed them. It's not worth pursuing the remainder of the run. --------------------------------------------------------------- SRV1> ./roundup -n -s F00040421 -r cedar far PEND - have 4/6 subruns for F00040421_*.all.sntp.cedar.0.root 132 03/11 11:18 0 4 PEND - have 4/6 subruns for F00040421_*.spill.bntp.cedar.0.root 132 03/11 11:18 0 4 PEND - have 4/6 subruns for F00040421_*.spill.sntp.cedar.0.root 132 03/11 11:18 0 4 ./roundup -f 1 -s F00040421 -r cedar far Mon Jul 21 11:51:30 CDT 2008 Ugh, many many ECRC problems reported since Friday, SRV1> ./roundup -n -r cedar far 2>&1 | grep 'No such file' | tee /tmp/nsf.lis SRV1> wc -l /tmp/nsf.lis 165 /tmp/nsf.lis SRV1> DUPC=`cat /tmp/nsf.lis | cut -f 7 -d / | cut -f 1 -d :` SRV1> for ${DUP} in ${DUPC} ; do sam locate ${DUP} ; done These are all declared to SAM. SRV1> printf "$DUPC\n" | cut -f 1 -d _ | sort -u F00041342 Thu Jul 10 13:51:50 CDT 2008 F00041348 F00041351 F00041360 F00041363 F00041366 F00041369 F00041372 F00041375 F00041380 F00041383 F00041388 F00041393 Tue Jul 15 16:02:29 CDT 2008 Moved them all to DUP/farcat SRV1> for FIL in ${DUPC} ; do mv /minos/data/minfarm/WRITE/${FIL} /minos/data/minfarm/DUP/farcat/${FIL} ; done SRV1> date Mon Jul 21 14:57:47 CDT 2008 This is all messed up , BADRUNS F00041348_0003.spill.bntp.cedar.0.root PEND - have 45/23 subruns for F00041348_*.spill.bntp.cedar.0.root 4 07/17 14:35 23 22 PEND - have 20/18 subruns for F00041393_*.spill.bntp.cedar.0.root 4 07/17 13:18 17 3 PEND - have 6/22 subruns for F00041418_*.spill.bntp.cedar.0.root 0 07/20 23:39 0 6 BADRUNS F00041348_0003.spill.sntp.cedar.0.root PEND - have 45/23 subruns for F00041348_*.spill.sntp.cedar.0.root 4 07/17 14:35 23 22 PEND - have 20/18 subruns for F00041393_*.spill.sntp.cedar.0.root 4 07/17 13:18 17 3 PEND - have 6/22 subruns for F00041418_*.spill.sntp.cedar.0.root 0 07/20 23:39 0 6 41348 - spill cand's for subruns 0-2, 5-23 subrun 3 is bad, what about 4 ? These have already been concatenated and written, both bntp and sntp mv /minos/data/minfarm/farcat/F00041348* /minos/data/minfarm/DUP/farcat/ For F00041393, have 0/1/2 subruns in farcat, have 0/1 already concatenated as F00041393_0000.all.sntp.cedar.0.root F00041393_0003.spill.bntp.cedar.0.root Move the extra 0/1 to DUP, force out remaining _0002. mv /minos/data/minfarm/farcat/F00041393_0000* /minos/data/minfarm/DUP/farcat/ mv /minos/data/minfarm/farcat/F00041393_0001* /minos/data/minfarm/DUP/farcat/ SRV1> ./roundup -n -f 1 -s F00041393_0002 -r cedar far Odd, why does the script not detect that we HAVE the other subruns ? They are declared to SAM. Will have to debug this all over again !!! SRV1> ./roundup -f 1 -s F00041393_0002 -r cedar far ############# # MDSUM_LOG # ############# Created mdsum_log for daily /minos/data space usage summary. Removed strays : rmdir /minos/data/BAD rm -r /minos/data/analysis/database these were stray database backups , set log entry 2007 11 26 MINOS26 > find asousa -type f -atime -130 -exec ls -ltu {} \; -rw-r--r-- 1 asousa e875 89327267 Mar 14 06:04 asousa/N00009098_0015.mdaq.root -rw-r--r-- 1 asousa e875 698 Apr 7 11:41 asousa/makeShortSNTP.C mv asousa users/asousa Ran initial pass round 17:20 See http://www-numi.fnal.gov/computing/dh/mdsum/2008/07/21.txt ####### # AFS # ####### Sometime this last week : MINOS01 > ./bluwatch: line 71: /afs/fnal.gov/files/data/minos/log_data/bluwatch/last/minos01.txt: Connection timed out ./bluwatch: line 71: /afs/fnal.gov/files/data/minos/log_data/bluwatch/last/minos01.txt: Connection timed out ############ # MCIMPORT # ############ Free space is down to 700GB. Good thing we archived over 1 TB last week ! MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/L010185N/near 1109172 /minos/data/mcimport/STAGE/daikon_04/L010185N/near MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/L010185N/near/* 90862 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/700 101257 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/701 101891 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/702 101538 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/703 101930 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/704 99776 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/705 101306 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/706 107014 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/707 103844 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/708 101137 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/709 2186 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/710 2123 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/711 ============================================================================= 2008 07 20 ============================================================================= ######## # DATA # ######## Many condor complaint sbout a file not removable on minos25. MINOS01 > dds logs/glideafs/.*c -rw-r--r-- 1 kreymer g020 0 Jul 17 08:40 logs/glideafs/.nfs532415d50000014c MINOS01 > rm logs/glideafs/.*c MINOS01 > dds logs/glideafs/.*c MINOS01 > date Sun Jul 20 21:31:30 CDT 2008 ============================================================================= 2008 07 17 ============================================================================= ########## # CONDOR # ########## First draft document from rbpatter on increased computing http://www.hep.caltech.edu/~rbpatter/computing.pdf ############ # MCIMPORT # ############ Proceed with over 2 TB of mainline MC, leaving alone first 110 runs. RDIRS=`ls /minos/data/mcimport/STAGE/daikon_04/L010185N/near | grep -v ^70` for DIR in ${RDIRS} ; do printf "${DIR} " du -sm /minos/data/mcimport/STAGE/daikon_04/L010185N/near/${DIR} done 710 2186 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/710 711 2123 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/711 712 1835 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/712 713 1926 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/713 714 2112 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/714 715 1953 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/715 716 1885 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/716 717 1901 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/717 718 2048 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/718 719 98473 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/719 720 10924 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/720 721 1030 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/721 722 1034 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/722 723 1054 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/723 724 1049 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/724 725 1027 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/725 726 1041 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/726 727 1014 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/727 728 1033 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/728 729 1048 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/729 730 1035 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/730 731 1030 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/731 732 1466 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/732 733 1029 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/733 734 1039 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/734 735 7098 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/735 736 5339 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/736 737 1045 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/737 738 1013 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/738 739 983 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/739 740 1005 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/740 741 1042 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/741 742 1049 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/742 743 1026 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/743 744 1023 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/744 745 1023 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/745 746 1057 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/746 747 99778 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/747 748 101765 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/748 749 106939 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/749 750 11259 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/750 751 1097 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/751 752 1043 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/752 753 1095 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/753 754 1129 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/754 755 1130 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/755 756 1145 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/756 757 1146 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/757 758 1454 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/758 759 1143 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/759 760 1127 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/760 761 1130 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/761 762 1104 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/762 763 1127 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/763 764 1094 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/764 765 1104 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/765 766 1364 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/766 767 1121 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/767 768 1117 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/768 769 1430 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/769 770 1146 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/770 771 1132 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/771 772 1143 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/772 773 101015 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/773 774 101705 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/774 775 101436 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/775 776 99742 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/776 777 99526 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/777 778 99725 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/778 779 99893 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/779 999 12076 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/999 Much of the run-space has been archived. Scan for what remains to do : for DIR in ${RDIRS} ; do printf "${DIR} " ./mcimport.20080716 -n -T daikon_04/L010185N/near/${DIR} | grep NFILES \ | grep -v 'NFILES 0' done 719 NFILES 309 720 NFILES 31 747 NFILES 304 748 NFILES 306 749 NFILES 320 750 NFILES 31 773 NFILES 308 774 NFILES 310 775 NFILES 309 776 NFILES 304 777 NFILES 303 778 NFILES 304 779 NFILES 304 999 NFILES 33 RDIRS='719 720 747 748 749 750 773 774 775 776 777 778 779 999' for DIR in ${RDIRS} ; do printf "${DIR} " ./mcimport.20080716 -T daikon_04/L010185N/near/${DIR} done 719 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N/near/719/mcimport.log Thu Jul 17 08:24:35 CDT 2008 ============================================================================= 2008 07 16 ============================================================================= ######## # FARM # ######## Several cedarnear.log ECRC missing files, DFILES=' N00014520_0000.cosmic.cand.cedar.0.root N00014520_0011.cosmic.cand.cedar.0.root N00014520_0012.cosmic.cand.cedar.0.root N00014520_0015.cosmic.cand.cedar.0.root N00014523_0000.cosmic.cand.cedar.0.root ' These are all duplicates. Why not picked up by the DUP checking code ? For now, renamed to DUP SRV1> cd /minos/data/minfarm/WRITE SRV1> for FILE in ${DFILES} ; do ls -l ${FILE} ; done -rw-rw-r-- 1 minospro numi 111176051 Jul 12 20:09 N00014520_0000.cosmic.cand.cedar.0.root -rw-rw-r-- 1 minospro numi 113477854 Jul 12 19:50 N00014520_0011.cosmic.cand.cedar.0.root -rw-rw-r-- 1 minospro numi 113270695 Jul 12 19:24 N00014520_0012.cosmic.cand.cedar.0.root -rw-rw-r-- 1 minospro numi 113336210 Jul 12 18:59 N00014520_0015.cosmic.cand.cedar.0.root -rw-rw-r-- 1 minospro numi 112857605 Jul 12 19:50 N00014523_0000.cosmic.cand.cedar.0.root SRV1> for FILE in ${DFILES} ; do mv ${FILE} ../DUP/nearcat/${FILE} ; done SRV1> date Wed Jul 16 18:10:35 CDT 2008 ########## # DCACHE # ########## Date: Wed, 16 Jul 2008 14:26:06 -0500 (CDT) From: Michael Zalokar To: kreymer@fnal.gov Cc: moibenko@fnal.gov Subject: missing minos files We recently ran a scan of STKen PNFS. In that scan the following 6 files were flagged as being in pnfs, but not on tape. /pnfs/fs/usr/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/Unable /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.all.sntp.cedar.0.root.08dec2006.bad /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.spill.sntp.cedar.0.root.08dec2006.bad /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2006-06/F00035724_0013.all.sntp.cedar.0.root.08dec2006.bad /pnfs/fs/usr/minos/reco_near/R1_18_4/snts_data/2006-10/N00011077_0013.spill.snts.R1_18_4.0.root.08dec2006.ba d /pnfs/fs/usr/minos/reco_near/cedar/cand_data/2006-10/N00011134_0038.spill.cand.cedar.0.root.18dec2006.bad If you still have the originals, feel free to rewrite them. If not, please remove them from PNFS. We apologize for the inconvenience. Mike BFILES=' /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/Unable /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.all.sntp.cedar.0.root.08dec2006.bad /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.spill.sntp.cedar.0.root.08dec2006.bad /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035724_0013.all.sntp.cedar.0.root.08dec2006.bad /pnfs/minos/reco_near/R1_18_4/snts_data/2006-10/N00011077_0013.spill.snts.R1_18_4.0.root.08dec2006.bad /pnfs/minos/reco_near/cedar/cand_data/2006-10/N00011134_0038.spill.cand.cedar.0.root.18dec2006.bad ' for FIL in ${BFILES} ; do ls -l ${FIL} ; done -rw-rw-r-- 1 rubin e875 0 Jun 9 12:06 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/Unable -rw-r--r-- 1 rubin e875 17865772 Nov 10 2006 /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.all.sntp.cedar.0.root.08dec2006.bad -rw-r--r-- 1 rubin e875 805743 Nov 10 2006 /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.spill.sntp.cedar.0.root.08dec2006.bad -rw-r--r-- 1 rubin e875 22792353 Nov 10 2006 /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035724_0013.all.sntp.cedar.0.root.08dec2006.bad -rw-r--r-- 1 rubin e875 7106097 Oct 19 2006 /pnfs/minos/reco_near/R1_18_4/snts_data/2006-10/N00011077_0013.spill.snts.R1_18_4.0.root.08dec2006.bad -rw-r--r-- 1 rubin e875 443821667 Dec 10 2006 /pnfs/minos/reco_near/cedar/cand_data/2006-10/N00011134_0038.spill.cand.cedar.0.root.18dec2006.bad for FIL in ${BFILES} ; do ./dc_stat ${FIL} ; done As rubin, for FIL in ${BFILES} ; do rm ${FIL} ; done ############ # MCIMPORT # ############ Urgently need to tar up some more files, for space on /minos/data. Let's move to the current mcimport, which now supports tar archiving As usual, work on minos-sam03 MCI3 > cp AFSS/mcimport.20080630 mcimport.20080630 MCI3 > ln -sf mcimport.20080630 mcimport # was AFSS/mcimport.20071102 Let's see what there is to chew on. $ du -sm /minos/data/mcimport/STAGE/daikon_00/* 158600 /minos/data/mcimport/STAGE/daikon_00/L010185N done 2008/07/16 13332 /minos/data/mcimport/STAGE/daikon_00/L010185N_nue MCIMPORTARCHIVELIST $ du -sm /minos/data/mcimport/STAGE/daikon_04/* 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N -t 55785 /minos/data/mcimport/STAGE/daikon_04/L010170N done 7/23 2235895 /minos/data/mcimport/STAGE/daikon_04/L010185N done 7/20 126550 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm 7/29 -t 368713 /minos/data/mcimport/STAGE/daikon_04/L010185N_helium done 7/28 8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh done 65355 /minos/data/mcimport/STAGE/daikon_04/L010200N done 7/22 338 min (700) 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N done 7/23 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N done 7/25 27834 /minos/data/mcimport/STAGE/daikon_04/L250200N done $ du -sm /minos/data/mcimport/STAGE/daikon_00/L010185N/near/*/* | sort -n ... 11 /minos/data/mcimport/STAGE/daikon_00/L010185N/near/704/n12017041_0010_L010185N_D00.tar.gz 12 /minos/data/mcimport/STAGE/daikon_00/L010185N/near/142/n12011429_0010_L010185N_D00.tar.gz 317 /minos/data/mcimport/STAGE/daikon_00/L010185N/near/142/n11011424_0003_L010185N_D00.tar.gz 319 /minos/data/mcimport/STAGE/daikon_00/L010185N/near/143/n11011431_0009_L010185N_D00.tar.gz ... 349 /minos/data/mcimport/STAGE/daikon_00/L010185N/near/145/n11011450_0001_L010185N_D00.tar.gz $ du -sm /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/*/* | sort -n ... 67 /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/144/n14111446_0008_L010185N_D00_nue.tar.gz RDIRS=`ls /minos/data/mcimport/STAGE/daikon_00/L010185N/near` MCI3 > echo $RDIRS 141 142 143 144 145 704 for DIR in ${RDIRS}; do ./mcimport.20080716 -n -T daikon_00/L010185N/near/${DIR} done \ | grep NFILES NFILES 99 NFILES 107 NFILES 107 NFILES 109 NFILES 11 NFILES 31 Had to update to mcimport.20080716 to get correct MCIN path, so try one file first : ./mcimport.20080716 -b 1 -T daikon_00/L010185N/near/141 Hung up indefinitely waiting for CMS BACKFILL2 jobs in Enstore, 2237 queued up, and the queue is not getting shorter. Start time: Wed Jul 16 15:02:04 2008 Wed Jul 16 17:31:32 CDT 2008 OK, let er rip on this modest 158 GB of files. for DIR in ${RDIRS}; do ./mcimport.20080716 -T daikon_00/L010185N/near/${DIR} done ########## # CONDOR # ########## loiacono jobs are still not running. MINOS25 > condor_history -l 165974.0 > /tmp/histark MINOS25 > condor_q -l 165515.0 > /tmp/histlau sdiff -s /tmp/histark /tmp/histlau reveals that they are setting JobLeaseDuration = 360000 and not setting X509USERPROXY = /local/scratch25/$ENV(LOGNAME)/grid/$ENV(LOGNAME).proxy Sent mail, Laura confirmed that this resolves the problem. ######## # FARM # ######## Date: Wed, 16 Jul 2008 09:07:53 -0500 (CDT) From: Steven Timm To: fermigrid-announce@fnal.gov Subject: fnpcsrv1 reboot now node fnpcsrv1 was hopelessly confused with NFS errors and has to be rebooted. I am rebooting now. Steve Timm ######## # FARM # ######## Checking last night's roundups : Problems since Sun Jul 13 18:07:47 CDT 2008 cat: /export/stage/minfarm/ROUNDUP/ECRC/N00014520_0000.cosmic.cand.cedar.0.root: No such file or directory etc. Did pick up OK adding N00012941_0007.spill.sntp.cedar.0.root 1 OK adding N00013375_0000.spill.sntp.cedar.0.root 30 OK adding N00013434_0000.spill.sntp.cedar.0.root 18 OK adding N00013793_0011.spill.sntp.cedar.0.root 1 BIG - Splitting due to size 2163459957 OK adding N00014184_0000.spill.sntp.cedar.0.root 23 OK adding N00014184_0023.spill.sntp.cedar.0.root 1 Jul 14 19:25 But these are all old, beam came up Sunday around 06:00, in near 14526/14 far 41348/16 There are 5 spill cand's for 14508 and 14520, before beam came back. ls /pnfs/minos/reco_near/cedar/cand_data/2008-07/*spill* No spill data in ls /minos/data/minfarm/neardet Alec will try to reach Matt Strait, Howie is away. Date: Wed, 16 Jul 2008 17:35:28 -0500 (CDT) I have submitted for processing runs 14526_0000 through 145548_0000. Since I don't know what's happening with Howie's jobs and don't want to step on them unncessarily, the output will appear in my directory: /minos/data/minfarm/farmtest_strait/nearcat/ and logs at: /minos/data/minfarm/farmtest_strait/logs/cedar/near/ When Howie comes back, he can decide whether to rerun them with his scripts or copy my output. It's not a whole lot of processing, so it doesn't much matter. -Matt ============================================================================= 2008 07 15 ============================================================================= ######### # ADMIN # ######### Account request for zkrahn , no FNALU account yet. ############ # MCIMPORT # ############ 13:00 roughly MINOS26 > ./pnfsdirs near cedar_phy_bhcurv daikon_05 L010185N write MINOS26 > ./pnfsdirs far cedar_phy_bhcurv daikon_05 L010185N write Oops, this is Mock Data. Need to have a modified /pnfsdirs to handle this. The default script created /pnfs/minos/mcin_data/far/daikon_05/L010185N needed /pnfs/minos/mcin_data/fmock/daikon_05/L010185N Finally updated pnfsdirs, ran this 15:00 2008 07 30 The 49 GB of files were copied, 00:37 through 04:40. That's 3.4 MB/sec. ########## # CONDOR # ########## These processes have been running CPU-bound under gfactory since Sunday 13:00 4 R gfactory 14664 32564 64 76 0 - 2618 - Jul12 ? 1-20:16:27 condor_gridmanager -f -C (Owner=?="gfactory"&&JobUniverse==9) -o gfactory -S /tmp/condor_g_scratch.0xbb57928.32564 4 R gfactory 14670 14664 61 77 0 - 2298 - Jul12 ? 1-18:14:38 /opt/condor/sbin/gahp_server Per sfiligoi, killed these, first checking the time cycle of gfactory : LDIR=/home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/log tail -1 ${LDIR}/factory_info.20080715.log [2008-07-15T16:44:10-05:00 15536] Sleep 90s [gfactory@minos25 ~]$ ps xf PID TTY STAT TIME COMMAND 14664 ? R 2940:55 condor_gridmanager -f -C (Owner=?="gfactory"&&JobUniverse==9) -o gfactory -S /tmp/condor_g_scratch.0xbb57928.32564 14670 ? R 2804:05 \_ /opt/condor/sbin/gahp_server 17635 pts/10 Ss 0:00 -bash 25203 pts/10 R+ 0:00 \_ ps xf 15533 ? S 71:41 python glideFactory.py 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/ 15535 ? S 86:09 \_ /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 15533 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/ gpgeneral 15536 ? S 91:42 \_ /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 15533 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/ gpminos kill 15533 kill 15536 kill -9 15536 kill 14670 that got them both ./start_factory.sh F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 S gfactory 17635 17634 0 75 0 - 1407 wait Jul14 pts/10 00:00:00 -bash 0 S gfactory 25254 1 6 77 0 - 6194 - 16:48 pts/10 00:00:13 python glideFactory.py 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/ 0 S gfactory 25256 25254 6 76 0 - 8047 - 16:48 pts/10 00:00:14 /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 25254 90 4 /home/gfactory/glideinsubmit/glidein_t 0 S gfactory 25257 25254 6 76 0 - 8258 - 16:48 pts/10 00:00:13 /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 25254 90 4 /home/gfactory/glideinsubmit/glidein_t 4 R gfactory 25478 32564 88 78 0 - 2806 - 16:50 ? 00:01:28 condor_gridmanager -f -C (Owner=?="gfactory"&&JobUniverse==9) -o gfactory -S /tmp/condor_g_scratch.0x1122a558.32564 4 R gfactory 25480 25478 85 85 0 - 2041 - 16:50 ? 00:01:22 /opt/condor/sbin/gahp_server 0 R gfactory 25515 17635 0 77 0 - 1011 - 16:51 pts/10 00:00:00 ps -flu gfactory [gfactory@minos25 ~]$ ps xf PID TTY STAT TIME COMMAND 25478 ? R 2:02 condor_gridmanager -f -C (Owner=?="gfactory"&&JobUniverse==9) -o gfactory -S /tmp/condor_g_scratch.0x1122a558.32564 25480 ? R 1:54 \_ /opt/condor/sbin/gahp_server 17635 pts/10 Ss 0:00 -bash 25524 pts/10 R+ 0:00 \_ ps xf 25254 pts/10 S 0:13 python glideFactory.py 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/ 25256 pts/10 S 0:14 \_ /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 25254 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/ gpgeneral 25257 pts/10 S 0:13 \_ /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 25254 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/ gpminos My test glideins are running. But the gridmanager processes are still running 100% cpu limited. ####### # SAM # ####### SAM taking stock meeting, 09:00 to 09:45, Julie Trumbo Anil Kumar Dianne Bonham Arthur Kreymer One result was to identify large test are in Integration which was being backed up. Removed, cutting backups from 1.5 hours to 10 minutes. ============================================================================= 2008 07 14 ============================================================================= ######### # MYSQL # ######### Date: Mon, 14 Jul 2008 17:59:19 +0100 From: Jeff Hartnell To: Nick West , rhatcher , Arthur Kreymer Cc: Nick Devenish Subject: URGENT (database problem) Hi all, Just tried to phone each one of you... Nick D has accidentally removed all the entries from the CALADCTOPESVLD table in offline on minos-db1 (he meant to do it to caltest). He was about to do a large import of new gain numbers so was trying to test it out on caltest. One worry: will this get imminently exported to other sites? Could this table be reprimed? It hasn't been updated since the end of May this year. Cheers, Jeff. ------------------------------------------------------------------ The latest backup was done on July 2, the gzipped file is /data/archive/COPY/20080702/offline/CALADCTOPESVLD.MYD.gz This is only 6 MB, so a restore should go quickly. I'll coordinate with Robert and Nick to see whether we can use this. At present, the existing 0 length database file is being held open by mysqld, so we cannot just swap this under the running mysqld. ------------------------------------------------------------------ gunzip /data/database/recover/CALADCTOPESVLD.MYD.gz Mysql> mkdir /data/database/recover Mysql> cp /data/database/offline/db.opt /data/database/recover/db.opt Mysql> ARC=/data/archive/COPY/20080702/offline/ Mysql> cp ${ARC}/CALADCTOPESVLD.MYD.gz /data/database/recover/ Mysql> cp ${ARC}/CALADCTOPESVLD.frm /data/database/recover/ Mysql> gunzip /data/database/recover/CALADCTOPESVLD.MYD.gz Mysql> mysqlshow -u root recover Database: recover +----------------+ | Tables | +----------------+ | CALADCTOPESVLD | +----------------+ Mysql> mysql -u root recover mysql> repair no_write_to_binlog table CALADCTOPESVLD quick use_frm ; +------------------------+--------+----------+-----------------------------------------+ | Table | Op | Msg_type | Msg_text | +------------------------+--------+----------+-----------------------------------------+ | recover.CALADCTOPESVLD | repair | warning | Number of rows changed from 0 to 115593 | | recover.CALADCTOPESVLD | repair | status | OK | +------------------------+--------+----------+-----------------------------------------+ 2 rows in set (0.46 sec) Mysql> mysqlshow -u root recover CALADCTOPESVLD Database: recover Table: CALADCTOPESVLD Rows: 115593 +--------------+------------+-----------+------+-----+---------------------+----------------+---------------------------------+---------+ | Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment | +--------------+------------+-----------+------+-----+---------------------+----------------+---------------------------------+---------+ | SEQNO | int(11) | NULL | | PRI | | auto_increment | select,insert,update,references | | | TIMESTART | datetime | NULL | | MUL | 0000-00-00 00:00:00 | | select,insert,update,references | | | TIMEEND | datetime | NULL | | MUL | 0000-00-00 00:00:00 | | select,insert,update,references | | | DETECTORMASK | tinyint(4) | NULL | YES | | | | select,insert,update,references | | | SIMMASK | tinyint(4) | NULL | YES | | | | select,insert,update,references | | | TASK | int(11) | NULL | YES | | | | select,insert,update,references | | | AGGREGATENO | int(11) | NULL | YES | | | | select,insert,update,references | | | CREATIONDATE | datetime | NULL | | | 0000-00-00 00:00:00 | | select,insert,update,references | | | INSERTDATE | datetime | NULL | | | 0000-00-00 00:00:00 | | select,insert,update,references | | +--------------+------------+-----------+------+-----+---------------------+----------------+---------------------------------+---------+ Mysql> mysqldump -u root recover CALADCTOPESVLD > /data/archive/COPY/20080702/offline/CALADCTOPESVLD.dump Second iteration, this should have been done with Mysql> REC=/data/archive/COPY/20080702/recover Mysql> mkdir -p ${REC} Mysql> cp ${ARC}/CALADCTOPESVLD.MYD.gz ${REC}/ Mysql> cp ${ARC}/CALADCTOPESVLD.frm ${REC}/ Mysql> gunzip ${REC}/CALADCTOPESVLD.MYD.gz mysql> drop table CALADCTOPESVLD ; mysql> restore table CALADCTOPESVLD from '/data/archive/COPY/20080702/recover' ; Mysql> mysqldump -u root recover CALADCTOPESVLD > /data/archive/COPY/20080702/offline/CALADCTOPESVLD.dump2 Mysql> diff /data/archive/COPY/20080702/offline/CALADCTOPESVLD.dump /data/archive/COPY/20080702/offline/CALADCTOPESVLD.dump2 ########## # CONDOR # ########## The number of held glideins built up gradually since the Tue 8 July 10 AM, peak over 16K Fri 10 AM, down gradually to 12K 12:30 Monday. 4664 ? R 1267:16 condor_gridmanager -f -C (Owner=?="gfactory"&&JobUniverse==9) -o gfactory -S /tmp/condor_g_scratch.0xbb57928.32564 14670 ? R 1216:00 \_ /opt/condor/sbin/gahp_server 17635 pts/10 Ss 0:00 -bash 17731 pts/10 R+ 0:00 \_ ps xf The condor_gridmanager is still running. MINOS25 > condor_q -l 164258.0 -- Submitter: minos25.fnal.gov : <131.225.193.25:64545> : minos25.fnal.gov MyType = "Job" TargetType = "Machine" ClusterId = 164258 QDate = 1215957560 CompletionDate = 0 ... EnteredCurrentStatus = 1215963651 HoldReason = "Globus error 9: the system cancelled the job" HoldReasonCode = 2 HoldReasonSubCode = 9 ReleaseReason = UNDEFINED NumSystemHolds = 1 Managed = "Schedd" ServerTime = 1216050068 MINOS25 > condor_q gfactory -hold ... 11525 jobs; 0 idle, 0 running, 11525 held Mixture of 11330 Globus error 17: the job failed when the jo 3 Globus error 43: the job manager failed to 192 Globus error 9: the system cancelled the jo HJOBS=`grep 'Globus' gfactoryhold.log | cut -f 1 -d ' '` for JOB in ${HJOBS} ; do usleep 100000 ; condor_rm ${JOB} ; done MINOS25 > condor_q gfactory > cqfact.log 164013.1 gfactory 7/12 13:52 0+00:00:00 H 0 0.0 glidein_startup.sh 164021.2 gfactory 7/12 14:54 1+22:20:12 R 0 0.0 glidein_startup.sh 164186.0 gfactory 7/13 04:08 1+08:34:16 R 0 0.0 glidein_startup.sh ... 164248.1 gfactory 7/13 08:42 1+04:27:09 R 0 0.0 glidein_startup.sh 164248.2 gfactory 7/13 08:42 1+04:27:09 R 0 0.0 glidein_startup.sh 164248.3 gfactory 7/13 08:42 1+04:27:09 R 0 0.0 glidein_startup.sh 164258.1 gfactory 7/13 08:59 0+00:00:00 X 0 0.0 glidein_startup.sh 164258.2 gfactory 7/13 08:59 0+00:00:00 X 0 0.0 glidein_startup.sh 164260.0 gfactory 7/13 09:02 0+00:00:00 X 0 0.0 glidein_startup.sh ... 164262.2 gfactory 7/13 09:19 0+00:00:00 X 0 0.0 glidein_startup.sh 164267.0 gfactory 7/13 09:55 0+00:00:00 X 0 0.0 glidein_startup.sh 164278.0 gfactory 7/13 10:45 0+00:00:00 X 0 0.0 glidein_startup.sh 164443.0 gfactory 7/14 09:28 0+00:00:00 I 0 0.0 glidein_startup.sh 164443.1 gfactory 7/14 09:28 0+00:00:00 I 0 0.0 glidein_startup.sh 164445.0 gfactory 7/14 09:35 0+00:00:00 I 0 0.0 glidein_startup.sh 164445.1 gfactory 7/14 09:35 0+00:00:00 I 0 0.0 glidein_startup.sh 164445.2 gfactory 7/14 09:35 0+00:00:00 I 0 0.0 glidein_startup.sh 164447.0 gfactory 7/14 09:49 0+00:00:00 I 0 0.0 glidein_startup.sh 164447.1 gfactory 7/14 09:49 0+00:00:00 I 0 0.0 glidein_startup.sh 164447.2 gfactory 7/14 09:49 0+00:00:00 I 0 0.0 glidein_startup.sh 164453.0 gfactory 7/14 10:22 0+00:00:00 I 0 0.0 glidein_startup.sh 164461.0 gfactory 7/14 11:18 0+00:00:00 I 0 0.0 glidein_startup.sh 49 jobs; 10 idle, 38 running, 1 held MINOS25 > RJOBS=`condor_q -run gfactory | grep gfactory | cut -f 1 -d ' '` MINOS25 > condor_rm 164021.2 Job 164021.2 marked for removal Went into stat 'X' MINOS25 > for JOB in ${IJOBS} ; do sleep 1 ; condor_rm ${JOB} ; done Strange, my glideafs jobs kept running up through 09:00 Sunday : MINOS25 > dds -tr log/glideafs/*.out ... -rw-r--r-- 1 kreymer g020 5991 Jul 13 08:59 logs/glideafs/probe.164249.0.out -rw-r--r-- 1 kreymer g020 0 Jul 13 09:00 logs/glideafs/probe.164259.0.out ... Here is a clue from GridJobId = "gt2 fngp-osg.fnal.gov:2119/jobmanager-condor https://fnpcosg1.fnal.gov:40028/29205/1215888755/" Looking in the gfactory config file glideinWMS/creation/glideinWMS.xml ... --- > 5c5 < --- > 13c13 < --- > 21c21 < --- > cd ~ vi start_factory.sh Igor -------------------------------------------------------- ============================================================================= 2008 07 11 ============================================================================= ######### # ADMIN # ######### Date: Fri, 11 Jul 2008 10:13:47 -0500 From: Jason Allen Attached is a quote specifying the config of the new Minos servers. The servers will have dual quad core 2.66GHz CPUs, 16GB RAM, mirrored 250GB system and data disks, and redundant power supplies. Additionally the systems are compatible with SLF4.5 and SLF5.1. These are very nice machines! We have you down on the list for 3 servers, is that number correct? Please reserve about 2K in the Minos budget for a rack and PDUs. As I mentioned yesterday, placement of the new Minos servers is yet to be determined. --------------------------------------------------------------------- 200 West North Avenue, Lombard, IL 60148. Tel No.(630) 627-8811; Fax No. (630) 627-8877 www.koicomputer.com FERMILAB ATTN: GLENN COOPER/JASON ALLEN EMAIL: gcooper@fnal.gov/jallen@fnal.gov Quotation#20080707-02/INTEL (Revised) Qty Description Unit Cost Total Amount $3,300.00 $3,300.00 1 2U Dual Intel Xeon E5430 2.66GHz General Rack Server Breakdown: 1 Supermicro SC823T-R500LPPB, 2U Black Rack chassis, 500W Redundant Power Supply. 6 x 3.5" Hot-swap SAS/SATA Drive bays, 1 x 5.25" + 1 x Slim CD-ROM Drive + 1 x 3.5" Floppy Drive Bays. Cooling: 4 x 80mm 6300RPM Fans. 2U Slide Rails included. 1 X7DBE, Intel 5000P (Blackford) Chipset, quad/dual core Intel 64-bit Xeon Support, 667/1066/1333MHz FSB. 8 x 240-pin DIMM sockets support up to 32GB DDR2 667 ECC Fully Buffered DIMM in dual channel. Onboard 6 x SATA 3.0Gbps Ports via ESB2 SATA Controller, ATI ES1000 16MB Graphics, Intel 82563EB Dual-port Gigabit Ethernet Controller. Expansion Slots: 2 (x8) & 1 (x4) PCI-Express, 2 x 64-bit 133MHz PCI-X, 1 x 64-bit 100MHz PCI-X. Extended ATX 12" x 13.05" Form Factor. 1 AOC-SIMLP-B+ IPMI 2.0 Adapter 2 Intel Xeon E5430 QC LGA771 2.66GHz 12MB 1333MHz Processor 8 2GB DDR2-667 ECC Fully Buffered DIMM 1 8x+ 24x24x24x Internal Black Slim Tray-type DVD/CDRW Combo Drive 1 3Ware 9650SE-4LPML, 4 Port SATA2 PCIEx4 Multi-Lane RAID Controller 4 Seagate ST3250310NS 250GB 32MB 7200RPM SATA Enterprise Ver. HDD 1 Server Labor/3Year Parts and Labor On-site Repair Warranty 1 Test with Scientific Fermi Linux 5.1 TOTAL: $3,300.00 Checked out seagate ST3250310NS http://www.seagate.com/www/en-us/products/servers/barracuda_es/barracuda_es.2/ Interface Capacity Model # SAS 3Gb/s 500GB ST3500620SS ST3500620SS 500000.0 SAS 3Gb/s 750GB ST3750630SS ST3750630SS 750000.0 SAS 3Gb/s 1000GB ST31000640SS ST31000640SS 1000000.0 SATA 3.0Gb/s 250GB ST3250310NS ST3250310NS 250000.0 SATA 3.0Gb/s 500GB ST3500320NS ST3500320NS 500000.0 SATA 3.0Gb/s 750GB ST3750330NS ST3750330NS 750000.0 SATA 3.0Gb/s 1000GB ST31000340NS Pricewatch : ST3250310NS $ 79 to 95 ST3500320NS $ 100 to 120 ST3750330NS $ 148 to 250 ST31000340NS $ 230 to 255 ============================================================================= 2008 07 10 ============================================================================= ######### # MYSQL # ######### reforwarded this to minosdb-support Date: Tue, 20 May 2008 16:52:55 +0000 (UTC) From: Arthur Kreymer To: minosdb-support@fnal.gov Subject: Re: new MINOS hardware (fwd) Here's an extract of recent discussions regarding purchase of a server-class ( 64 bit Intel architecture ) replacement for the present minos-mysql1 production Mysql server. We would get 24x7 class hardware, and ask for 8to17by7 hardware and software support. ---------- Forwarded message ---------- Date: Thu, 15 May 2008 14:38:28 -0500 From: Robert Hatcher To: Joseph Boyd Cc: minos-admin@fnal.gov Subject: Re: new MINOS hardware On May 13, 2008, at 2:06 PM, Joseph Boyd wrote: > I believe everything you want is possible. Please let me know how > much data disk space and scratch space you want on each of the > servers. I think the only server that has any real requirement on disk space is the one that we will use for the MySQL warehouse and even those needs are relatively moderate in this day-and-age. The sizes I'll list are really only minimums, if larger disks are standard order those at your discretion. MySQL replacement: 2 * 230GB /data (mirrored) + 1 * 230GB /local/scratchX ... ####### # NET # ####### Wireless seems to be out on WH12 generally. My laptop, from my office KREYMERFNALGOV.dhcp.fnal.gov, connects to : 131.225.94.231 is connected to w-s-wh11se-g on port radio Last detected on this switch at 2008/07/10/14:07 25 MAC addresses have been seen on port radio of w-s-wh11se-g. Date: Thu, 10 Jul 2008 14:23:30 -0500 (CDT) Subject: HelpDesk ticket 118505 ___________________________________________ Short Description: WH12W wireless non-functional Problem Description: We do not seem to be getting wireless connections on WH12. When I connect to wireless from my WH12 SW office (1260), I am connected at follows : 131.225.94.231 is connected to w-s-wh11se-g on port radio When I scan for available networks, I see fgz, and tuftswireless on demand Unsecured computer-to-computer network The WH12SW wireless access point is at the entrance to my office, so this is not a signal strength problem. ___________________________________________ NB - MRTG shows no nodes connected to w-s-wh12sw-g ___________________________________________ See also tickets 118488 7/10 plunk assigned to andrews 118406 7/09 perdue assigned to andrews ___________________________________________ The CD leave request page shows that Dave Coder is not here this week. Please reassign this to someone who is here. See also two other tickets, assigned to Chuck Andrews 118488 7/10 118406 7/09 which have had no action indicated in Remedy. Chuck is not listed on the leave page, but is he around ? ___________________________________________ Spoke to Chuck Andrews around 14:10, Our access point failed to reboot, he will come reset it. My own laptop was broadcasting tuftswireless ! (Wireless Network Connection) Change the order of perferred networks (Preferred Networks) tuftswireless Remove OK View Wireless Networks There are several more doing this now on other floors. The wireless transmitters have been reducing their power to minimize interference, based on traffic they see. This is a recent policy change. When they see rogue access points, which have always been around, they drop their power, causing access problems. The network group is working today to roll this change out until some way can be found to filter out the rogues. ___________________________________________ The w-s-wh12sw-g AP has been rebooted and firmware reloaded. It is still not picking up any clients. But several rogue access points have been located and corrected by candrews, so we should be getting much better service now. ___________________________________________ Date: Fri, 11 Jul 2008 15:52:50 -0500 (CDT) This ticket has been reassigned to ANDREWS, CHARLES of the CD-LSCS/CNCS/SN Group. ___________________________________________ Chuck visited the WH12SW area this afternoon, and removed several rogue access points, including one on my own laptop in WH12W 1260. He restarted the w-s-wh12sw-n wireless access point, which eventually seems to have downloaded software, and resumed normal operation, according to its status lights. As of 17:28, I see several connections to the access point. Note that the name ends in -n , not -g . This ticket can be closed, thanks ! ___________________________________________ Date: Mon, 14 Jul 2008 09:35:20 -0500 (CDT) Solution: Adhoc laptops were causing localized problems - removed adhoc SSID from WCS reported aptops - OK now. This ticket was resolved by ANDREWS, CHARLES of the CD-LSCS/CNCS/SN group. ___________________________________________ 000d93ee8658 118402 7/8/2008 10:23:14 PM wireless reception bad in WH11SW since replacement of wireless hub\ 0n 9-July-2008, two laptops (one on WH12 SW & 1 on WH13 SW) were found that were acting as Adhoc Rouges. Both laptops were advertising themselves as Access Points, transmitting unauthorized SSIDs and interfering with the Fermi Wireless network on WH10, WH11, WH12 and WH13. Both laptops were returned to correct operating configurations, and normal wireless operation was restored. -Chuck- 840-2721 ============================================================================= 2008 07 09 ============================================================================= ####### # CVS # ####### Updated WebDocs/cvs-rep.html to deprecate ssh key pairs, and document kerberos access. ####### # CVS # ####### Added all Minos Cluster users to minoscvs .k5login ypcat passwd | cut -f 1 -d : | sort | wc -l 211 ypcat passwd | cut -f 1 -d : | sort > minosusers Removed non-Minos entries baisley boyd condor dawson ettab fromm jlkaiser joes kevinh lisa lsfadm mgreaney mindata minfarm minoscvs minsoft products sam samread sfiligoi timm cp .k5login .k5loginnew NUSERS=`cat minosusers` for username in ${NUSERS} ; do if grep -q \^${username}@FNAL.GOV ${HOME}/.k5loginnew then printf "HAVE ${username}\n" else printf "ADD ${username}\n" echo ${username}@FNAL.GOV >> ${HOME}/.k5loginnew fi done HAVE 37 ADD 153 -bash-3.00$ wc -l .k5login* 43 .k5login 14 .k5login.20050527 15 .k5login.20070122 13 .k5login.20070214 42 .k5login.bak 196 .k5loginnew 28 .k5login~ cp ${HOME}/.k5login ${HOME}/.k5login.bak sort -u -o ${HOME}/.k5login ${HOME}/.k5loginnew -bash-3.00$ wc -l .k5login* 196 .k5login 14 .k5login.20050527 15 .k5login.20070122 13 .k5login.20070214 42 .k5login.bak 196 .k5loginnew -bash-3.00$ date Wed Jul 9 17:43:04 CDT 2008 ####### # SRM # ####### MCFILS=`cat /minos/data/minfarm/mcnear/cp_to_dc` CPFILS=`for FIL in ${MCFILS} ; do sam locate ${FIL} ; done 2>&1 | \ grep Datafile | cut -f 2 -d "'"` n13047014_0025_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047014_0027_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047014_0028_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047014_0029_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047041_0000_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047041_0001_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047041_0002_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047041_0014_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047041_0015_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047041_0016_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047041_0017_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047041_0021_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047041_0030_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047042_0000_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047042_0002_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047042_0004_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047042_0006_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root SRV1> printf "${CPFILS}\n" | wc -l 18 for FIL in ${CPFILS} ; do ls -l /minos/data/minfarm/mcnear/${FIL} ; done These are all 500 MB files . /usr/local/grid/setup.sh export X509_USER_PROXY=/local/globus/minfarm/.grid/x509up_u1334 setup encp v3_6d -q stken SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL GPATH=gsiftp://stkendca2a.fnal.gov:2811///NULL LPATH=file:////minos/data/minfarm/mcnear EPATH=file:////export/stage/minfarm/CPFILS SRV1> FIL=n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root SRV1> time ecrc /minos/data/minfarm/mcnear/${FIL} CRC 883220849 real 0m7.075s user 0m1.924s sys 0m2.423s SRV1> time srmcp ${LPATH}/${FIL} ${SPATH}/${FIL} real 0m48.549s user 0m9.628s sys 0m9.133s chmod 600 /local/globus/minfarm/.grid/x509up_u1334 SRV1> time globus-url-copy ${LPATH}/${FIL} ${GPATH}/${FIL} real 0m19.137s user 0m1.607s sys 0m6.124s SRV1> time srmcp ${LPATH}/${FIL} ${SPATH}/${FIL} real 0m50.392s user 0m9.757s sys 0m8.851s Let's tune the GUC to be simlar to srmcp as we use it, based on options listed with globus-url-copy -help : GUC='globus-url-copy -binary -create-dest -fast -rst-retries 24 -rst-interval 600 -rst-timeout 10000 -block-size 1M -parallel 1 ' GUCV="${GUC} -vb" SRV1> time ${GUCV} ${LPATH}/${FIL} ${GPATH}/${FIL} SRV1> time ${GUCV} ${LPATH}/${FIL} ${GPATH}/${FIL} Source: file:////minos/data/minfarm/mcnear/ Dest: gsiftp://stkendca2a.fnal.gov:2811///NULL/ n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root 528312697 bytes 16.91 MB/sec avg 16.91 MB/sec inst real 0m31.644s user 0m0.097s sys 0m4.082s SRV1> time ${GUCV} ${LPATH}/${FIL} ${GPATH}/${FIL} Source: file:////minos/data/minfarm/mcnear/ Dest: gsiftp://stkendca2a.fnal.gov:2811///NULL/ n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root 528312697 bytes 15.89 MB/sec avg 15.89 MB/sec inst real 0m33.663s user 0m0.097s sys 0m5.681s SRMC='srmcp -streams_num=1 -server_mode=active -protocols=gsiftp' ecrc times are consistent, real 0m7.075s real 0m6.157s real 0m7.208s real 0m6.089s srmcp elapsed times vary real 0m50.392s real 0m40.475s real 0m45.932s SRMC elapsed times can be real 0m42.429s real 0m42.493s real 0m37.965s real 0m45.403s GUC times have been real 0m19.137s real 0m31.644s real 0m33.663s real 1m10.560s real 0m57.457s real 0m28.931s real 0m28.527s real 0m26.159s real 0m51.112s real 0m45.346s real 0m27.960s 15:05 time { for FIL in ${CPFILS} ; do ecrc /minos/data/minfarm/mcnear/${FIL} ${SRMC} ${LPATH}/${FIL} ${SPATH}/${FIL} done ; } date real 20m52.922s user 3m29.292s sys 3m8.408s Wed Jul 9 15:25:53 CDT 2008 time { for FIL in ${CPFILS} ; do ecrc /minos/data/minfarm/mcnear/${FIL} ${GUC} ${LPATH}/${FIL} ${GPATH}/${FIL} done ; } date real 14m31.935s user 0m37.659s sys 2m17.358s Wed Jul 9 15:40:59 CDT 2008 Repeated with 'globus-url-copy' instead of GUC real 21m32.638s user 1m3.671s sys 2m25.088s Wed Jul 9 16:42:16 CDT 2008 setup dcap klist -f DPATH=dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/NULL MINOS26 > type dccp dccp is hashed (/afs/fnal.gov/ups/dcap/v2_26_f0213/Linux+2.4/bin/dccp) MINOS26 > dccp /minos/data/minfarm/mcnear/${FIL} ${DPATH}/${FIL} 540263111 bytes in 118 seconds (4471.19 KB/sec) MINOS26 > dccp /minos/data/minfarm/mcnear/${FIL} ${DPATH}/${FIL} 540263111 bytes in 14 seconds (37685.76 KB/sec) SRV1> type dccp dccp is hashed (/fnal/ups/prd/dcap/v2_42_f0710/Linux-2-6/bin/dccp) SRV1> dccp /minos/data/minfarm/mcnear/${FIL} ${DPATH}/${FIL} Command failed! Server error message for [1]: "path /pnfs/fnal.gov/usr/minos/NULL/n13047014_0025_L010185N_D04.cand.cedar_phy_bhcurv.1.root not found" (errno 10001). 540263111 bytes in 33 seconds (15987.90 KB/sec) ... 540263111 bytes in 20 seconds (26380.03 KB/sec) 540263111 bytes in 20 seconds (26380.03 KB/sec) Testing the bulk copy, rates reported as uniformly 29 to 30 MB/sec ( 18 sec ) not sure why the messy command failed messages. SRV1> time { for FIL in ${CPFILS} ; do ecrc /minos/data/minfarm/mcnear/${FIL} dccp /minos/data/minfarm/mcnear/${FIL} ${DPATH}/${FIL} done ; } date 540263111 bytes in 18 seconds (29311.15 KB/sec) 534044330 bytes in 18 seconds (28973.76 KB/sec) 535609734 bytes in 22 seconds (23775.29 KB/sec) 521546727 bytes in 21 seconds (24253.48 KB/sec) 530692815 bytes in 28 seconds (18509.10 KB/sec) 535020334 bytes in 22 seconds (23749.13 KB/sec) 536381262 bytes in 20 seconds (26190.49 KB/sec) 536748102 bytes in 19 seconds (27587.79 KB/sec) 541163157 bytes in 31 seconds (17047.73 KB/sec) 531467411 bytes in 23 seconds (22565.70 KB/sec) 554185242 bytes in 21 seconds (25771.26 KB/sec) 546379858 bytes in 33 seconds (16168.91 KB/sec) 532681782 bytes in 33 seconds (15763.55 KB/sec) 527743225 bytes in 62 seconds (8312.49 KB/sec) 536695576 bytes in 31 seconds (16906.99 KB/sec) 528616952 bytes in 45 seconds (11471.72 KB/sec) 527392960 bytes in 27 seconds (19075.27 KB/sec) 528312697 bytes in 21 seconds (24568.11 KB/sec) real 17m43.132s user 1m7.907s sys 2m20.122s Wed Jul 9 18:33:26 CDT 2008 Running on minos26, where we've seen better rates Noted that the CRC's were taking longer than the copies ! MINOS26> time { for FIL in ${CPFILS} ; do ecrc /minos/data/minfarm/mcnear/${FIL} dccp /minos/data/minfarm/mcnear/${FIL} ${DPATH}/${FIL} done ; } date 540263111 bytes in 30 seconds (17586.69 KB/sec) 534044330 bytes in 10 seconds (52152.77 KB/sec) 535609734 bytes in 16 seconds (32691.02 KB/sec) 521546727 bytes in 15 seconds (33954.87 KB/sec) 530692815 bytes in 24 seconds (21593.95 KB/sec) 535020334 bytes in 48 seconds (10885.02 KB/sec) 536381262 bytes in 31 seconds (16897.09 KB/sec) 536748102 bytes in 16 seconds (32760.50 KB/sec) 541163157 bytes in 32 seconds (16514.99 KB/sec) 531467411 bytes in 17 seconds (30530.07 KB/sec) 554185242 bytes in 15 seconds (36079.77 KB/sec) 546379858 bytes in 12 seconds (44464.51 KB/sec) 532681782 bytes in 11 seconds (47290.64 KB/sec) 527743225 bytes in 18 seconds (28631.90 KB/sec) 536695576 bytes in 43 seconds (12188.76 KB/sec) 528616952 bytes in 13 seconds (39709.81 KB/sec) 527392960 bytes in 14 seconds (36788.01 KB/sec) 528312697 bytes in 30 seconds (17197.68 KB/sec) real 29m55.317s user 1m2.824s sys 0m39.356s MINOS26 > date Wed Jul 9 19:07:55 CDT 2008 Make a local copy on /local/scratch26 mkdir /local/scratch26/kreymer/CPFILS time { for FIL in ${CPFILS} ; do echo ${FIL} cp /minos/data/minfarm/mcnear/${FIL} /local/scratch26/kreymer/CPFILS/${FIL} done ; } date Wed Jul 9 19:07:55 CDT 2008 real 16m4.144s user 0m1.670s sys 0m53.230s Wed Jul 9 19:25:41 CDT 2008 Test dccp from /local/scratch26 In many cases, elapsed dccp time is much longer than reported. Ganglia shows many minute or more long gaps with no I/O. including 19:38 to 19:42 ( 4 minutes ! ) saved minos26net.20080709.png image MINOS26> time { for FIL in ${CPFILS} ; do ecrc /local/scratch26/kreymer/CPFILS/${FIL} dccp /local/scratch26/kreymer/CPFILS/${FIL} ${DPATH}/${FIL} done ; } date 540263111 bytes in 13 seconds (40584.67 KB/sec) 534044330 bytes in 15 seconds (34768.51 KB/sec) 535609734 bytes in 17 seconds (30768.02 KB/sec) 521546727 bytes in 15 seconds (33954.87 KB/sec) 530692815 bytes in 16 seconds (32390.92 KB/sec) 535020334 bytes in 16 seconds (32655.05 KB/sec) 536381262 bytes in 33 seconds (15873.03 KB/sec) 536748102 bytes in 39 seconds (13440.21 KB/sec) 541163157 bytes in 21 seconds (25165.70 KB/sec) 531467411 bytes in 17 seconds (30530.07 KB/sec) 554185242 bytes in 17 seconds (31835.09 KB/sec) 546379858 bytes in 38 seconds (14041.42 KB/sec) 532681782 bytes in 14 seconds (37156.93 KB/sec) 527743225 bytes in 27 seconds (19087.93 KB/sec) 536695576 bytes in 28 seconds (18718.46 KB/sec) 528616952 bytes in 18 seconds (28679.31 KB/sec) 527392960 bytes in 18 seconds (28612.90 KB/sec) 528312697 bytes in 18 seconds (28662.80 KB/sec) real 24m25.210s user 0m59.146s sys 0m37.979s Wed Jul 9 19:50:30 CDT 2008 The next day, run again on minos26, timing each dccp Strange to see minute delays connecting to the pool queueinfo shows only 8 stores w-stkendca11a-6 1 writePools w-stkendca20a-2 3 ExpDbWritePools w-stkendca20a-3 4 ExpDbWritePools Min/Max delays are 8 83 540263111 bytes in 16 seconds (32975.04 KB/sec) real 0m24.324s 534044330 bytes in 27 seconds (19315.84 KB/sec) real 0m44.544s 535609734 bytes in 23 seconds (22741.58 KB/sec) real 0m48.259s 521546727 bytes in 29 seconds (17562.86 KB/sec) real 0m55.006s 530692815 bytes in 22 seconds (23557.03 KB/sec) real 0m58.104s 535020334 bytes in 30 seconds (17416.03 KB/sec) real 112.799s 536381262 bytes in 20 seconds (26190.49 KB/sec) real 0m29.646s 536748102 bytes in 19 seconds (27587.79 KB/sec) real 0m30.349s 541163157 bytes in 35 seconds (15099.42 KB/sec) real 60.711s 531467411 bytes in 29 seconds (17896.94 KB/sec) real 0m58.351s 554185242 bytes in 32 seconds (16912.39 KB/sec) real 104.013s 546379858 bytes in 35 seconds (15244.97 KB/sec) real 90.253s 532681782 bytes in 30 seconds (17339.90 KB/sec) real 0m45.880s 527743225 bytes in 29 seconds (17771.53 KB/sec) real 62.893s 536695576 bytes in 34 seconds (15415.20 KB/sec) real 92.619s 528616952 bytes in 30 seconds (17207.58 KB/sec) real 70.083s 527392960 bytes in 30 seconds (17167.74 KB/sec) real 0m53.432s 528312697 bytes in 15 seconds (34395.36 KB/sec) real 0m25.090s real 20m27.129s user 0m59.083s sys 0m37.151s Thu Jul 10 15:38:01 CDT 2008 dccp -d 4 /local/scratch26/kreymer/CPFILS/${FIL} ${DPATH}/${FIL} [Thu Jul 10 15:51:43 2008] Going to open file dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/NULL/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root in cache. Connected in 0.00s. Cache open succeeded in 1.66s. 528312697 bytes in 29 seconds (17790.70 KB/sec) MINOS26 > time dccp -d 4 /local/scratch26/kreymer/CPFILS/${FIL} ${DPATH}/${FIL} [Thu Jul 10 15:54:08 2008] Going to open file dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/NULL/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root in cache. Connected in 0.00s. Cache open succeeded in 3.93s. 528312697 bytes in 33 seconds (15634.25 KB/sec) real 4m7.100s user 0m1.649s sys 0m1.201s Weird. ganglia shows a blip of activity at 15:54 to 15:56, then nothing through 16:00, the timestamp on the file in PNFS. MINOS26 > time dccp -d999 /local/scratch26/kreymer/CPFILS/${FIL} ${DPATH}/${FIL} extra option: -alloc-size=528312697 Real file name: /local/scratch26/kreymer/CPFILS/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root. Using system native open for /local/scratch26/kreymer/CPFILS/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root. [Thu Jul 10 16:04:35 2008] Going to open file dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/NULL/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root in cache. Allocated message queues 0, used 0 Allocated message queues 1, used 1 Creating a new control connection to fndca1.fnal.gov:24725. Activating IO tunnel. Provider: [/afs/fnal.gov/ups/dcap/v2_26_f0213/Linux+2.4/lib/libgssTunnel.so]. Added IO tunneling plugin /afs/fnal.gov/ups/dcap/v2_26_f0213/Linux+2.4/lib/libgssTunnel.so for fndca1.fnal.gov:24725. Connected in 0.00s. Sending control message: 0 0 client hello 0 0 2 26 -uid=1060 -pid=7649 Server reply: welcome. Connected to fndca1.fnal.gov:24725 Setting hostname to minos26.fnal.gov. Sending control message: 1 0 client open "dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/NULL/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root" w minos26.fnal.gov 51944 -timeout=-1 -onerror=default -alloc-size=528312697 -uid=1060 Got callback connection from stkendca10a.fnal.gov:51426 for session 1, myID 1. Enabling checksumming on write. Cache open succeeded in 1.80s. [6] Sending IOCMD_WRITE. [6] Expected position: 1048570 @ 0 bytes written. [6] Sending IOCMD_WRITE. [6] Expected position: 2097140 @ 0 bytes written. ... [6] Sending IOCMD_WRITE. [6] Expected position: 528312697 @ 0 bytes written. Using system native close for [3]. [6] unpluging node File checksum is: 1989075314 Sending CLOSE for fd:6 ID:1. ... N.B. long delay here ... Server reply: ok destination [1]. Removing unneeded queue [1] [6] destroing node 528312697 bytes in 17 seconds (30348.85 KB/sec) real 0m58.003s user 0m1.638s sys 0m1.094s The problem is not recycling file names, same delays with unique names. Tried disabling crc and allowing unsafe writes with -c and -u, no gain. v2_26_f0213/Linux+2.4 MINOS26 > setup dcap v2_41_f0610 Make a local copy on srv1 mkdir /export/stage/minfarm/CPFILS time { for FIL in ${CPFILS} ; do echo ${FIL} cp /minos/data/minfarm/mcnear/${FIL} /export/stage/minfarm/CPFILS done ; } date real 4m0.040s user 0m1.675s sys 1m52.409s Wed Jul 9 17:00:18 CDT 2008 Let's try local disk GUC to PNFS time { for FIL in ${CPFILS} ; do ecrc /export/stage/minfarm/CPFILS/${FIL} ${GUC} ${EPATH}/${FIL} ${GPATH}/${FIL} done ; } date real 18m14.040s user 0m37.082s sys 1m19.753s SRV1> date Wed Jul 9 17:24:16 CDT 2008 For reference, check ecrc times : time { for FIL in ${CPFILS} ; do ecrc /export/stage/minfarm/CPFILS/${FIL} done ; } date real 2m38.338s user 0m28.999s sys 0m16.279s Wed Jul 9 17:50:36 CDT 2008 time { for FIL in ${CPFILS} ; do ecrc /minos/data/minfarm/mcnear/${FIL} done ; } date real 2m9.727s user 0m34.299s sys 0m38.907s Wed Jul 9 17:52:58 CDT 2008 ############ # BLUWATCH # ############ blutickle is not preventing expiration of /minos/data on fnpcsrv1, and we keep getting errors. Made local 55 second sleep copy of bluwatch in /home/kreymer, started this up around 09:20 ####### # NET # ####### LOW admin Date: Wed, 09 Jul 2008 09:02:40 -0500 (CDT) Subject: HelpDesk ticket 118397 ___________________________________________ Short Description: DNS port radomization security test seems to fail Problem Description: A worldwide coordinated DNS security fix was apparently deployed yesterday. http://www.doxpara.com/ The DNS checker at that site reports that the Fermilab DNS server is vulnerable. This seems odd to me, as we deployed new DNS servers at that time. Here is the test result : Your name server, at 131.225.8.120, appears vulnerable to DNS Cache Poisoning. All requests came from the following source port: 32770Requests seen for 4e7a194afa10.toorrr.com: 131.225.8.120:32770 TXID=22669 131.225.8.120:32770 TXID=24309 131.225.8.120:32770 TXID=38045 131.225.8.120:32770 TXID=19642 131.225.8.120:32770 TXID=59774 ___________________________________________ Date: Wed, 09 Jul 2008 09:15:43 -0500 (CDT) This ticket has been reassigned to TANG, DAVID of the CD-LSCS/CNCS/SN Group. ___________________________________________ ___________________________________________ ___________________________________________ ============================================================================= 2008 07 08 ============================================================================= ####### # NET # ####### Date: Tue, 08 Jul 2008 15:19:02 -0500 (CDT) Subject: HelpDesk ticket 118378 ___________________________________________ Short Description: DNS lookups failing on fnsrv0 Problem Description: DNS network lookups are failing on DNS server fnsrv0, but not fnsrv1. This is causing a myriad of problems site wide. A recent example : MINOS01 > nslookup 131.225.193.11 fnsrv0 Server: fnsrv0 Address: 131.225.8.120#53 ** server can't find 11.193.225.131.in-addr.arpa: SERVFAIL MINOS01 > nslookup 131.225.193.11 fnsrv1 Server: fnsrv1 Address: 131.225.17.150#53 11.193.225.131.in-addr.arpa name = minos11.fnal.gov. ___________________________________________ Date: Tue, 08 Jul 2008 16:04:50 -0500 (CDT) Note To Requester: tang@fnal.gov sent this Notes To Requester: Resolved. Related to DNS server glitch this afternoon ___________________________________________ Date: Wed, 09 Jul 2008 08:33:03 -0500 (CDT) Solution: Resolved. Related to DNS server glitch this afternoon This ticket was resolved by CODER, DAVE of the CD-LSCS/CNCS/SN group. ___________________________________________ ___________________________________________ ########## # DCACHE # ########## LOW Date: Tue, 08 Jul 2008 13:31:52 -0500 (CDT) Subject: HelpDesk ticket 118354 ___________________________________________ Short Description: FNDCA write pool time settings Problem Description: The time threshold for writing files from FNDCA write pools are intended to be : 4 hours - writePools 24 hours - RawDataWritePools I have good reason to suspect that these have reverted to small values. This probably happened even before the DCache 1.8 upgrade. Please reset the time thresholds to these normal values, to improve overall efficiency and reduce wear on the tapes and robots. This is particularly important for the Raw pools, where Minos writes one file per hour. ___________________________________________ Date: Mon, 08 Sep 2008 14:02:41 -0500 (CDT) Note To Requester: swhicks@fnal.gov sent this Notes To Requester: Arthur, > This ticket is over 60 days old. If this problem still exists, please let us know and a new ticket can be issued. If not, I will close this ticket now. Thanks, Stanley ___________________________________________ Date: Mon, 08 Sep 2008 19:45:21 +0000 (GMT) I have had not response to this ticket.. I believe that the problems persists. ___________________________________________ Date: Mon, 22 Sep 2008 19:06:08 +0000 (GMT) The raw data files continue to be written to tape immediately, which is bad : MINOS26 > ls -l /pnfs/minos/fardet_data/2008-09 ... -rw-r--r-- 1 buckley e875 37049096 Sep 22 07:31 F00041967_0010.mdaq.root -rw-r--r-- 1 buckley e875 19926529 Sep 22 08:32 F00041967_0011.mdaq.root -rw-r--r-- 1 buckley e875 43658989 Sep 22 09:33 F00041967_0012.mdaq.root -rw-r--r-- 1 buckley e875 37256313 Sep 22 10:34 F00041967_0013.mdaq.root -rw-r--r-- 1 buckley e875 19872473 Sep 22 11:35 F00041967_0014.mdaq.root -rw-r--r-- 1 buckley e875 43571565 Sep 22 12:36 F00041967_0015.mdaq.root -rw-r--r-- 1 buckley e875 37174411 Sep 22 13:30 F00041967_0016.mdaq.root The latest of these files is already on tape as of 13:31 Thanks to the CD Ops meeting report, I went to Bugzilla, http://www-ccf.fnal.gov/Bugzilla/show_bug.cgi?id=100 I see a reply to me last Friday, but I do not have a record of receiving this email. ___________________________________________ Date: Mon, 22 Sep 2008 14:57:30 -0500 (CDT) I have sent this info to the dcache developers and included it in the bugzilla ticket. Please let us know if you don't hear from somebody in a reasonable (24 hr?) amount of time. Stanley ___________________________________________ Date: Wed, 29 Oct 2008 15:24:15 +0000 (GMT) This ticket was initially logged July 8. Updated 8 Sept. Updated 22 Sept. The problem persists. This is not academic. Our raw data tapes are being mounted once per file, which is extremly bad for the tapes. For example, the current, partially filled far detector data tape has been mounted over 1000 times : MINOS26 > enstore info --vol VO8699 {'blocksize': 131072, 'capacity_bytes': 214748364800L, 'comment': '', 'declared': 1199726325.0, 'eod_cookie': '0000_000000000_0001335', 'external_label': 'VO8699', 'first_access': 1206710452.0, 'last_access': 1225291110.0, 'library': 'CD-9940B', 'media_type': '9940B', 'remaining_bytes': 160116203520L, 'si_time': [1222728047.0, 1130679712.0], 'sum_mounts': 1153, 'sum_rd_access': 65, 'sum_rd_err': 0, 'sum_wr_access': 1335, 'sum_wr_err': 1, 'system_inhibit': ['none', 'none'], 'user_inhibit': ['none', 'none'], 'volume_family': 'minos.fardet_data.cpio_odc', 'wrapper': 'cpio_odc', 'write_protected': 'n'} ___________________________________________ http://www-ccf.fnal.gov/Bugzilla/show_bug.cgi?id=100 ------- Comment #3 From Stanley W. Hicks 2008-10-29 14:36:29 ------- ARTHUR KREYMER wrote again about this problem. He says it is on-going and continues to be an issues (since July 9). I am upping the priority from P5 to P2 due to this being nearly 4 months old now: ___________________________________________ Date: Thu, 30 Oct 2008 15:22:29 -0500 (CDT) From: Dmitry Litvintsev Hi Alex, just not to make impression that we are walking in circles here. Last time Art reported about it - I looked in log files and indeed the files get written to tape almost immediately. Without regard to what is specified in pool setup. We promised to raise the priority of this, but then it slipped through. Dmitry ___________________________________________ Date: Thu, 30 Oct 2008 20:51:13 +0000 (GMT) From: Arthur Kreymer All recent raw data files have been written 'too soon'. I claim that files are on tape based on valid Level 4 metadata. The PNFS file time seen by 'ls' changes when the file moves to tape. So the simplest way to see the problem is to look at files in any of our recent data directories. The data directories are under /pnfs/minos/ fardet_data/2008-10 neardet_data/2008-10 beam_data/2008-10 far_dcs_data/2008-10 near_dcs_data/2008-10 For example, /pnfs/minos/far_dcs_data/2008-10/F081029_000012.mdcs.root was written to DCache at 2008-10-29 22:53:26 The time in PNFS is Oct 29 22:55 Similarly, /pnfs/minos/near_dcs_data/2008-10/N081029_000002.mdcs.root was written to DCache at 2008-10-30 00:10:00 The time in PNFS is Oct 30 00:12 As of around 15:20 CDT, the latest far detector data file, /pnfs/minos/fardet_data/2008-10/F00042108_0002.mdaq.root is on tape VO8699, with time stamp Oct 30 14:41 That is consistent with the enstore info for the volume, the last_access field. Since Wed, 29 Oct 2008 15:24:15 +0000 (GMT), our latest Far Detector tape VO8699 has been written to 33 times, and mounted 29 times. Statistics around 15:25 today : MINOS26 > enstore info --vol VO8699 ... 'last_access': 1225395669.0, ... 'sum_mounts': 1182, 'sum_rd_access': 65, 'sum_rd_err': 0, 'sum_wr_access': 1368, 'sum_wr_err': 1, Previous statistics > > MINOS26 > enstore info --vol VO8699 ... > > 'sum_mounts': 1153, > > 'sum_rd_access': 65, > > 'sum_rd_err': 0, > > 'sum_wr_access': 1335, > > 'sum_wr_err': 1, ___________________________________________ Date: Fri, 31 Oct 2008 16:29:51 -0500 Hi Art, we still looking on the issue of minos files written to the tape too often. Dmitri came up with possible solution. We will try to change store queue parameter on the pool but we will defer this change to Monday to observe system behavior. Sorry for inconvenience, Alex. ___________________________________________ Date: Tue, 04 Nov 2008 14:33:03 -0600 (CST) From: Dmitry Litvintsev To: Alex Kulyavtsev Cc: Arthur Kreymer , swhicks@fnal.gov, dcache-admin@fnal.gov, minos-data@fnal.gov Subject: Re: HelpDesk ticket 118354 has additional info. Hi Art, We believe that since yesterday the 24 hour policy has been reinstated and is in fact observed. Let us know if you still find exceptions from this rule. The policy has been set in effect only on pools belonging to RawDataWritePools. ___________________________________________ Yes, this looks OK, no writes since around noon 4 Nov, as of 07:30 5 Nov, in fardet_data ___________________________________________ ######## # FARM # ######## PEND - have 16/24 subruns for N00012636_*.cosmic.sntp.cedar_phy.0.root 5 Right, these subruns 16-23 don't appear in Ben's to-be-processed lists (nor do they appear in the 'suppressed' lists). So this run should be forced out. SRV1> ./roundup -n -f 0 -s N00012636 -r cedar_phy near MISSING N00012636_0016..cosmic.sntp.cedar_phy.0.root MISSING N00012636_0017..cosmic.sntp.cedar_phy.0.root MISSING N00012636_0018..cosmic.sntp.cedar_phy.0.root MISSING N00012636_0019..cosmic.sntp.cedar_phy.0.root MISSING N00012636_0020..cosmic.sntp.cedar_phy.0.root MISSING N00012636_0021..cosmic.sntp.cedar_phy.0.root MISSING N00012636_0022..cosmic.sntp.cedar_phy.0.root MISSING N00012636_0023..cosmic.sntp.cedar_phy.0.root SRV1> ./roundup -f 0 -s N00012636 -r cedar_phy near Tue Jul 8 11:13:15 CDT 2008 SRV1> ln -sf roundup.20080703 roundup # was roundup.20080624 This puts in the 1+ minute delay after concatenation, so that fresh file will be written. And abandons SRM_CONFIG, we now use X509_USER_PROXY SRV1> ./roundup -s N00012636 -r cedar_phy near Picked up the other partial run after missing subruns were rerun N00012596 Forced out the remaining run, which spanned months N00012681 SRV1> ./roundup -f 0 -r cedar_phy near ########## # CONDOR # ########## Added rbpatter ( Patterson ) to Minos Analysis and Production groups. ============================================================================= 2008 07 07 ============================================================================= ########## # CONDOR # ########## Date: Mon, 07 Jul 2008 15:16:42 -0500 (CDT) Subject: HelpDesk ticket 118305 ___________________________________________ Short Description: Minos Cluster - condor 7.0.3 preinstallation run2-sys : Please install the following RPM on nodes minos01 thru minos25 . http://fermigrid.fnal.gov/files/condor/condor-7.0.3-linux-x86-rhel3-dynamic-1.i386.rpm This rpm places new files in /opt/condor-7.0.3, and should not interfere with existing operations. I will submit a separate request for configuration files when we are ready. ___________________________________________ Date: Mon, 07 Jul 2008 15:48:35 -0500 (CDT) This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group. ___________________________________________ Date: Mon, 07 Jul 2008 16:04:33 -0500 (CDT) Solution: schmitz@fnal.gov sent this solution: installed new condor rpms ___________________________________________ Date: Mon, 07 Jul 2008 17:32:45 -0500 Putting it back now. Mark ___________________________________________ Condor 7.0.1 has been reinstalled and configured, and test jobs have been run. Thanks to the run2-sys group for quickly correcting this ! We seem to be back up. Running jobs probably were lost. The outage seems to have been from 16:00 to 18:00 CDT. Sorry about this ! Probably an 'rpm --upgrade' was done instead of the intended 'rpm --install' . ___________________________________________ Glidein jobs - last success 16:00, first new job 18:05 N.B. Files in /local/stage1/condor had wrong ownership, daemon instead of condor. How did that happen ? MIN > for NODE in ${NODES} ; do printf "$NODE "; ssh -ax ${NODE} 'du -sm /opt/condor-*' ; done minos01 1 /opt/condor-6.8.6 1 /opt/condor-6.9.5 1 /opt/condor-7.0.1 242 /opt/condor-7.0.3 ... minos25 1 /opt/condor-6.8.6 1 /opt/condor-6.9.5 1 /opt/condor-7.0.1 242 /opt/condor-7.0.3 minos26 237 /opt/condor-6.8.6 MINOS01 > cd /opt MINOS01 > dds total 44 drwxr-xr-x 9 root root 4096 Jul 7 15:54 ./ drwxr-xr-x 32 root root 4096 Dec 4 2007 ../ lrwxrwxrwx 1 root root 12 Apr 16 14:07 condor -> condor-7.0.1/ drwxr-xr-x 6 root root 4096 Jul 7 15:54 condor-6.8.6/ drwxr-xr-x 4 root root 4096 Jul 7 15:54 condor-6.9.5/ drwxr-xr-x 5 root root 4096 Jul 7 15:54 condor-7.0.1/ drwxr-xr-x 13 root root 4096 Jul 7 15:54 condor-7.0.3/ ########## # DCACHE # ########## Date: Mon, 07 Jul 2008 11:26:40 -0500 (CDT) From: HelpDesk ___________________________________________ Short Description: Most FNDCA readPools pools are down Problem Description: dcache-admin : A user reported a failure to open a file in DCache this morning. Only two of the 13 readPools pools seem to be active. r-stkendca16a-6 r-stkendca9a-2 The rest are absent from http://fndca.fnal.gov:2288/cellInfo and http://fndca.fnal.gov:2288/queueInfo ___________________________________________ Sent poolstat summary to this ticket MINOS26 > ./poolstat.20080707 verb Mon Jul 7 11:47:27 CDT 2008 DOWN TOT POOL GROUP 14 ExpDbWritePools 4/ 10 FermigridVolPools v-gwdca01-1 v-gwdca01-2 v-stkendca6a-1 v-stkendca6a-2 1/ 12 KTeVReadPools r-stkendca13a-2 15 MinosPrdReadPools 8 RawDataWritePools 11/ 13 readPools r-gwdca01-1 r-gwdca01-2 r-stkendca13a-5 r-stkendca13a-6 r-stkendca14a-5 r-stkendca14a-6 r-stkendca15a-5 r-stkendca15a-6 r-stkendca16a-5 r-stkendca6a-1 r-stkendca6a-2 6/ 16 writePools w-gwdca01-1 w-gwdca01-2 w-stkendca12a-4 w-stkendca12a-6 w-stkendca6a-1 w-stkendca6a-2 ___________________________________________ Two files are stuck in the generate write pools _ PNFS status for /pnfs/minos/reco_near/cedar_phy/cand_data/2007-04/N00012120_0001.spill.cand.cedar_phy.0.root -rw-r--r-- 1 rubin e875 395142882 Jul 3 23:48 N00012120_0001.spill.cand.cedar_phy.0.root LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:956eded9;l=395142882; LEVEL 4 PNFS status for /pnfs/minos/reco_near/cedar_phy/cand_data/2007-04/N00012120_0003.cosmic.cand.cedar_phy.0.root -rw-r--r-- 1 rubin e875 112667526 Jul 3 23:50 N00012120_0003.cosmic.cand.cedar_phy.0.root LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:5440fe70;l=112667526; LEVEL 4 These are still in /minos/data/minfarm/WRITE/ ___________________________________________ Date: Mon, 07 Jul 2008 12:38:11 -0500 From: Stanley Hicks Hi, Didn't want you to think nobody is reading your messages; I just don't have an answer yet and am looking for help from others here. Thanks for all the input and I or somebody will be getting back with you on this before too long. Stanley ___________________________________________ Date: Mon, 07 Jul 2008 14:39:06 -0500 Art, "missing" pools stem from the configuration inconsistency when pools are described in PoolManager.conf (and shown on page http://fndca3a.fnal.gov:2288/poolInfo/pools/* - I guess your script starts from there) and these pools are not actually connected to the head node. E.g. gwdca01 and stkendca6a are test pools and were used to test dcache v1.8 before moving in production. We are looking to fix configuration. Alex. ___________________________________________ Date: Mon, 07 Jul 2008 16:03:44 -0500 Hi Art, besides few pools from test system described in configuration, there were ten pools which did not start properly after dcache upgrade due to communication error during pool startup. I restarted these ten pools, the list is attached below. Thanks for noticing and reporting the issue. For helpdesk: please close the ticket, we filed separate ticket in dcache support on dcache.org to address the root cause in the dcache code. Alex. ___________________________________________ Date: Mon, 07 Jul 2008 22:12:41 +0000 (UTC) Thanks, I see that the pools are back as advertised. We are still missing two files : /pnfs/minos/reco_near/cedar_phy/cand_data/2007-04/N00012120_0001.spill.cand.cedar_phy.0.root /pnfs/minos/reco_near/cedar_phy/cand_data/2007-04/N00012120_0003.cosmic.cand.cedar_phy.0.root I could easily remove and rewrite these to PNFS. But for present, I will leave them alone for your investigation. ___________________________________________ Date: Tue, 08 Jul 2008 19:03:32 -0500 Hi Art, each file was present in two write pools : precious copy (in now dead pool) and the cached copy on other pool. We checked CRC for existing copy and set files to precious thus both files were written to tapes. Please close the ticket. Best regards, Alex, Vladimir. ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ Mon Jul 7 14:13:55 CDT 2008 DOWN TOT POOL GROUP 14 ExpDbWritePools 4/ 10 FermigridVolPools 1/ 12 KTeVReadPools 15 MinosPrdReadPools 8 RawDataWritePools 11/ 13 readPools 6/ 16 writePools Mon Jul 7 15:32:42 CDT 2008 DOWN TOT POOL GROUP 14 ExpDbWritePools 4/ 10 FermigridVolPools 1/ 12 KTeVReadPools 15 MinosPrdReadPools 8 RawDataWritePools 11/ 13 readPools 4/ 16 writePools Mon Jul 7 16:56:30 CDT 2008 DOWN TOT POOL GROUP 14 ExpDbWritePools 4/ 10 FermigridVolPools 12 KTeVReadPools 15 MinosPrdReadPools 8 RawDataWritePools 4/ 13 readPools 4/ 16 writePools ############ # poolstat # ############ poolstat.20080707 - genaralized pool name match, anything with 2 non-consec - MINOS26 > ln -sf poolstat.20080707 poolstat # was poolstat.20070611 ######### # ADMIN # ######### Created jsm62 account ######### # ADMIN # ######### Date: Mon, 07 Jul 2008 10:56:11 -0500 (CDT) Subject: HelpDesk ticket 118265 ___________________________________________ Short Description: Default login shell on Minos Cluster - please change to bash Problem Description: run2-sys : Please change the default login shell for new users on the Minos Cluster from /usr/local/bin/tcsh to /bin/bash This is the preferred shell, in policy and practice. ___________________________________________ Date: Mon, 07 Jul 2008 11:00:53 -0500 (CDT) This ticket has been reassigned to ALLEN, JASON of the CD-SF/FEF Group. ___________________________________________________________________ Date: Tue, 05 Aug 2008 17:00:03 -0500 (CDT) Solution: jonest@fnal.gov sent this solution: > I updated the /root/bin/add_minos_user script. >> I edited the function 'user_info' to use /bin/bash as the default >> shell rather than the user's >> fnalu account shell. > ___________________________________________________________________ ######## # FARM # ######## cedar_phy near stuck , Sun Jul 6 20:17:06 CDT 2008 PURGING WRITE files 442 ... rm: remove write-protected regular file `N00012620_0013.cosmic.cand.cedar_phy.0.root'? -rw-r--r-- 1 minfarm numi 801602717 Jul 6 16:26 /home/minfarm/ROUNTMP/WRITE/N00012620_0000.spill.sntp.cedar_phy.0.root -rw-r--r-- 1 minfarm numi 801602717 Jul 6 16:26 /home/minfarm/ROUNTMP/WRITE/N00012620_0000.spill.sntp.cedar_phy.0.root This is not group writeable, SRV1> ls -l ~/ROUNTMP/WRITE/ | grep minospro | grep 'r--r--' -rw-r--r-- 1 minospro numi 112909866 Jul 3 00:25 N00012620_0013.cosmic.cand.cedar_phy.0.root -rw-r--r-- 1 minospro numi 111808381 Jul 3 00:27 N00012626_0016.cosmic.cand.cedar_phy.0.root -rw-r--r-- 1 minospro numi 82860929 Jul 3 00:28 N00012633_0006.cosmic.cand.cedar_phy.0.root MIN > ssh -l minospro minos26 PRO> cd /minos/data/minfarm/WRITE PRO> for SUB in N00012620_0013 N00012626_0016 N00012633_0006 ; do ls -l ${SUB}.cosmic.cand.cedar_phy.0.root ; done PRO> for SUB in N00012620_0013 N00012626_0016 N00012633_0006 ; do chmod 664 ${SUB}.cosmic.cand.cedar_phy.0.root ; done 22616 pts/1 T 0:00 | \_ rm N00012620_0013.cosmic.cand.cedar_phy.0.root kill 22616 kill -9 22616 This is now a Zombie process Killed off the parent, also needed -9 15813 pts/1 Z 0:01 | \_ [roundup] Killed loopCPn, will wait for DCache to recover before restarting. ############ # BLUWATCH # ############ bluwatch.20080707 Created lasterr directory, shifted files there. Renamed latest to last, for consistency, symlink for transition Renamed LASTERR to LASTBAD for consistency with log message. cd ${MINOS_DATA}/log_data/bluwatch mkdir lastbad mkdir lastslo mv latest last ; ln -s last latest mv *.txt lastbad/ Stopped and restarted minos26, looks OK. ln -sf bluwatch.20080707 bluwatch Restarted minos25, minos01, minos-sam03, fnpcsrv1 cd ${MINOS_DATA}/log_data/bluwatch mv LASTERR LASTBAD Added README file. Started up on fnpcsrv1 at 09:59 Jul 7 12:25:30 fnpcsrv1 automount[14027]: expired /minos/data ... Jul 7 12:53:02 fnpcsrv1 automount[10043]: attempting to mount entry /minos/data Started getting failures, 12:12:14 12:17:14 12:19:14 12:21:15 Started up a tickle script interactively, around 12:10 while true ; do sleep 50 ; ls -d /minos/data/minfarm 2>&1 > /dev/null ; done This disappeared, put it in a script ./blutickle & # around 13:25 ####### # SAM # ####### Date: Sat, 05 Jul 2008 16:24:22 -0500 From: Rashid Mehdiyev Hi Art, do you know what is wrong with this SAM query command below ? I used it this way in Jan,08, but now it does not return me anythinng interesting,,, sam list files --dim="run_type physics% and data_tier mc-far and mc.beam='L010185N' and mc.bfield='1' and mc.flavor='0' and mc.release='daikon_00' and mc.vtxregion='3' and run_number>=0 and run_number<=1111" No files match the given constraints. ------------------------------------------------------------------------- SAMDIM='run_type physics% and data_tier mc-far and mc.beam=L010185N \ and mc.bfield=1 \ and mc.flavor=0 \ and mc.release=daikon_00' MINOS26 > sam list files --summaryonly --dim="${SAMDIM}" File Count: 586 SAMDIM='run_type physics% and data_tier mc-far and mc.beam=L010185N \ and mc.bfield=1 \ and mc.flavor=0 \ and mc.release=daikon_00 \ and mc.vtxregion=3 \ ' MINOS26 > sam locate f21011015_0000_L010185N_D00.reroot.root ['/pnfs/minos/mcin_data/far/daikon_00/L010185N/101,317@voc139'] I see no vtxregion 3 files : MINOS26 > ls /pnfs/minos/mcin_data/far/daikon_00/L010185N/*/f23* ls: /pnfs/minos/mcin_data/far/daikon_00/L010185N/*/f23*: No such file or directory ============================================================================= 2008 07 03 ============================================================================= 17:06:45, adjusted bluwatch sleep on fnpcsrv1 to 58 seconds Also note the last expiration in /var/log/messages was Jul 3 13:28:55 fnpcsrv1 automount[17005]: expired /minos/data Perhaps Steve adjusted something. ########## # DCACHE # ########## Created HOWTO.dcachetest Need to replicate the pre-upgrade tests into scripts, and run these regularly, perhaps monthly. ######## # FARM # ######## Started up aggressive cedar_phy near and far concatenation, SRV1> ./loopCPn & SRV1> ./loopCPf & Will stop these when they have caught up in a day or so, over the weekend. loopCPf got caught up, only one run to process, F00037968 The script is unable to move the files to target Bluearc areas, Set STOP flag touch /minos/data/minfarm/STOP.cedar_phynear mv: cannot move `N00012004_0000.cosmic.sntp.cedar_phy.0.root' to `/minos/data/reco_near/cedar_phy/sntp_data/2007-04/N00012004_0000.cosmic.sntp.cedar_phy.0.root': Permission denied ln: `N00012004_0000.cosmic.sntp.cedar_phy.0.root': File exists MINOS26 > ls -ld /minos/data/reco_near/cedar_phy/sntp_data/2007-04 drwxr-xr-x 2 mindata e875 2048 Feb 15 11:07 /minos/data/reco_near/cedar_phy/sntp_data/2007-04 GRRRRRRRRRRRRRRRR Directories under /minos/data/reco_near/cedar_phy/sntp_data 2007-04 through 2008-12 are mode 755, not 775. How did this happen ? Why did these all get created on Feb 15 11:07 ? $ chmod 775 /minos/data/reco_near/cedar_phy/sntp_data/2007-* $ chmod 775 /minos/data/reco_near/cedar_phy/sntp_data/2008-* This is a mess, will have to stop and restart loopCPn, and try to move and repair these symlinks. The previous pass was on cand_data, so no problem there. SRV1> grep 'cannot move' cedar_phynear.log | wc -l 26 MFILES=`grep 'cannot move' cedar_phynear.log | cut -c 18- | cut -f 1 -d "'"` for FILE in ${MFILES} ; do ls -l ${FILE} BLUE=/minos/data/reco_near/cedar_phy/sntp_data/2007-04/${FILE} mv ${FILE} ${BLUE} ln -s ${BLUE} ${FILE} done WHEW - looks OK now. rm /minos/data/minfarm/STOP.cedar_phynear ####### # TWW # ####### Regarding ticket 087003 Date: Mon, 30 Jun 2008 12:59:59 -0500 (CDT) From: Margaret_Greaney I have not heard back from Frank Nagy on this, but from what I see the upgrade of TWW caused new perl modules to be available and kcroninit does work on my attempts on fnalu on linux nodes. ------------------------------------ Still fails for me, same way, FLXI04 > kcroninit Can't locate Net/Domain.pm in @INC (@INC contains: /usr/krb5/lib /opt/TWWfsw/libdb42/lib/perl586 /opt/TWWfsw/imagemagick62/lib/perl586 /opt/TWWfsw/readline50/lib/perl586 /opt/TWWfsw/pe FLXI04 > type perl perl is /opt/TWWfsw/bin/perl FLXI04 > ls -l /opt/TWWfsw/bin/perl lrwxr-xr-x 1 kevinh root 41 May 23 2006 /opt/TWWfsw/bin/perl -> /opt/TWWfsw/perl586/bin/.perl.tww-wrapper FLXI04 > ls -l /opt/TWWfsw/perl586/bin/.perl.tww-wrapper -rwxr-xr-x 1 kevinh root 12363 Apr 11 2006 /opt/TWWfsw/perl586/bin/.perl.tww-wrapper ######### # ADMIN # ######### MINOS01 > cmd add_minos_user jsm62 INVALID: jsm62 does NOT have a valid fnalu account. Informed minos-admin and jsm62. Date: Mon, 07 Jul 2008 16:44:52 +0100 From: Jessica Mitchell I now have a FNALU account, so if you could set up my minos one that would be great! Created account at 10:45 ######### # ADMIN # ######### scan for AFS AUTOMATIC needing reset MIN > for NODE in ${UNODES} ; do printf "$NODE " ssh -ax ${NODE} 'grep ^OPTIONS= /etc/sysconfig/afs' ; done flxi02 OPTIONS=$LARGE flxi03 OPTIONS=$LARGE flxi04 OPTIONS=$MEDIUM flxi05 OPTIONS=$SMALL flxi06 OPTIONS=$MEDIUM flxi07 OPTIONS=$LARGE flxi09 Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive). You have new mail in /var/spool/mail/kreymer MIN > date Thu Jul 3 14:50:49 UTC 2008 See ticket 117526 ########### # BLUEARC # ########### Date: Thu, 03 Jul 2008 07:53:08 -0500 From: Andy Romero To: site-nas-announce@fnal.gov Subject: BlueArc problems Both RHEA cluster nodes are being rebooted. Should be back in ~15min more info to come CMS, Minos, FermiGrid and Windows are effected. Date: Thu, 03 Jul 2008 08:35:04 -0500 From: Andy Romero To: site-nas-announce@fnal.gov Subject: BlueArc NAS service back on-line BlueArc NAS service back on-line ----------------------------------- bluwatch shows errors on minos nodes at 08:01 and 08:02 Also 08:21 and 08:23 on fnpcsrv1 ============================================================================= 2008 07 02 ============================================================================= ######## # FARM # ######## Clear out some cedar_phy candidates, and test new roundup ( no SRM_CONFIG ) SRV1> ./roundup.20080703 -b 2 -s cand -r cedar_phy near MINOS26 > sam locate N00012004_0000.cosmic.cand.cedar_phy.0.root ['/pnfs/minos/reco_near/cedar_phy/cand_data/2007-04,3@dcache'] MINOS26 > sam locate N00012004_0000.spill.cand.cedar_phy.0.root ['/pnfs/minos/reco_near/cedar_phy/cand_data/2007-04,3@dcache'] OK, let's do the rest of cand's on hand Rates are pretty lousy, 2 MB/second for these 2 files. Check this out tomorrow, using NULL, try dccp x509 again, after the DCache 1.8 upgrade SRV1> ./roundup.20080703 -s cand -r cedar_phy near ######## # DATA # ######## Clearing 1.3 TB of space, from removal of bad-field D04 data 2008 04 29 SRV1 > du -sm /minos/data/BAD/D4CLEAN 1372961 /minos/data/BAD/D4CLEAN SRV1> date ; time rm -r /minos/data/BAD/D4CLEAN Wed Jul 2 14:42:57 CDT 2008 real 14m56.042s user 0m0.028s sys 0m0.657s ######## # FARM # ######## Date: Wed, 02 Jul 2008 14:25:59 -0500 (CDT) Subject: HelpDesk ticket 118137 ___________________________________________ Short Description: Please create minospro account on minos26 Problem Description: run2-sys : Recent changes in Grid authentication have changed file ownership of new Minos production analysis files from rubin to minospro. We need a local minospro account in order to manage these files. Please create account minospro, group e875 on node minos26, with a local login area similar to mindata, probably /home/minospro. For initial access, please copy /home/mindata/.k5login to /home/minospro/ ( and change ownership to minospro ) ___________________________________________ Date: Wed, 02 Jul 2008 14:35:33 -0500 (CDT) This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group. ___________________________________________ Date: Wed, 02 Jul 2008 14:50:27 -0500 (CDT) Solution: ettab@fnal.gov sent this solution: Account has been created. ___________________________________________ Date: Wed, 2 Jul 2008 20:39:43 +0000 (UTC) From: Arthur Kreymer To: ettab@fnal.gov Thanks ! I have logged in successfully. Can you change the login shell to /bin/bash ? ___________________________________________ ___________________________________________ ########## # DCACHE # ########## Per blake,ochoa,arms Date: Wed, 02 Jul 2008 10:43:55 -0500 (CDT) Subject: HelpDesk ticket 118111 ___________________________________________ Short Description: Password reset for mindata ftp access Problem Description: dcache-admin : The password for ftp read access by user mindata seems to have changed after the 24 June upgrade to DCache 1.8. Please contact me (x4261) to arrange a reset of the password. ___________________________________________ This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Thu, 03 Jul 2008 15:09:58 -0500 From: Timur Perelmutov Could you please try using the weak ftp door again? I think we found and fixed the problem. ___________________________________________ Date: Thu, 03 Jul 2008 15:11:40 -0500 From: Timur Perelmutov I spoke with Art on the phone and he confirmed that the door works again. ___________________________________________ Date: Thu, 03 Jul 2008 20:18:32 +0000 (UTC) From: Arthur Kreymer Timur reports the door being repaired around 15:10 today Thu 3 July, I tested an 'ls' command successfully at ftp fndca1.fnal.gov 24126 using account mindata, and the usual password. Thanks, and have a good Holiday weekend ! ___________________________________________ ___________________________________________ ############ # MCIMPORT # ############ MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/* 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N 55785 /minos/data/mcimport/STAGE/daikon_04/L010170N 2235895 /minos/data/mcimport/STAGE/daikon_04/L010185N 126550 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm 368713 /minos/data/mcimport/STAGE/daikon_04/L010185N_helium 8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh 65355 /minos/data/mcimport/STAGE/daikon_04/L010200N 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N 27834 /minos/data/mcimport/STAGE/daikon_04/L250200N MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04 3599213 /minos/data/mcimport/STAGE/daikon_04 ############ # MCIMPORT # ############ 10:58 Restarted crontab, now that we have a working mcimport script again. crontab crontab.dat ####### # DAQ # ####### Checked for missing near dcs files, on dcsdcp-nd, using a password different than for daqdcp-nd, but in an obvious way. No files since June 27 21:26 N080628_000002.mdcs.root The archiver is running, there are no input files. Reported to Run Coordinator habig, he will restart the dcs scripts. ============================================================================= 2008 07 01 ============================================================================= ########### # MONTHLY # ########### DATASETS 7/1 PREDATOR 7/1 VAULT 7/3 MYSQL 7/2 Wed Jul 2 09:50:30 CDT 2008 Wed Jul 2 10:26:03 CDT 2008 ############ # PREDATOR # ############ Oops, restarted predator cronjob, was down since Monday morning, due to DCache outage. ########## # CONDOR # ########## Drafting 7.0.3 installation request ticket Short Description: Minos Cluster - condor 7.0.1 preinstallation run2-sys : Please install the following RPM on nodes minos01 thru minos25 . http://fermigrid.fnal.gov/files/condor/condor-7.0.3-linux-x86-rhel3-dynamic-1.i386.rpm This rpm places new files in /opt/condor-7.0.3, and should not interfere with existing operations. I will submit a separate request for configuration files when we are ready. ___________________________________________ ######## # DATA # ######## mcimport Clearing space in /minos/data, $ du -sm /minos/data/mcimport/OVERLAY/mcin/DUP 77963 /minos/data/mcimport/OVERLAY/mcin/DUP These are mostly from 2007 Dec and 2008 Jan 09, one from Feb 20 17:44 n13037306_0006_L010185N_D04.reroot.root Removed them all ANALYSIS $ du -sm /minos/data/analysis/* 110141 /minos/data/analysis/NuMuBar 62019 /minos/data/analysis/database 466571 /minos/data/analysis/nc 672500 /minos/data/analysis/nonap 1404200 /minos/data/analysis/nue USERS 18 /minos/data/users/bckhouse 506457 /minos/data/users/boehm 1 /minos/data/users/jjling 29 /minos/data/users/kreymer 72289 /minos/data/users/loiacono 1 /minos/data/users/minsoft 142391 /minos/data/users/mishi 104 /minos/data/users/nickd 2289641 /minos/data/users/pawloski 1 /minos/data/users/rhatcher 102630 /minos/data/users/rmehdi 27825 /minos/data/users/rustem CAND strays Nothing in reco_near or reco_far SRV1> ls -l /minos/data/mcout_data/*/*/*/*/cand_data/* /minos/data/mcout_data/daikon_04/L250200N/near/cedar_phy_bhcurv/cand_data/700: total 1131008 -rw-rw-r-- 1 minospro numi 1158151211 Apr 21 20:20 n13037004_0009_L250200N_D04.cand.cedar_phy_bhcurv.1.root /minos/data/mcout_data/daikon_04/L250200N/near/cedar_phy_bhcurv/cand_data/701: total 1129128 -rw-rw-r-- 1 minospro numi 1156226310 Apr 21 20:21 n13037014_0008_L250200N_D04.cand.cedar_phy_bhcurv.1.root Remove these new minospro@minos26 account Oops, the directory is owned by minfarm, so do it from that account. SRV1> rm -r /minos/data/mcout_data/*/*/*/*/cand_data ########## # CONDOR # ########## Scanning /local/stage1 sizes ( symlinked to /local/scratch??/stage1 ) to see whether this could be decoupled from /local/scratch device to prevent full-disk problems. MIN > for NODE in ${NODES} ; do printf "$NODE " ; ssh -ax ${NODE} 'du -sm /local/scratch??/stage1' ; done minos01 26 /local/scratch01/stage1 minos02 32 /local/scratch02/stage1 minos03 81 /local/scratch03/stage1 minos04 77 /local/scratch04/stage1 minos05 76 /local/scratch05/stage1 minos06 68 /local/scratch06/stage1 minos07 34 /local/scratch07/stage1 minos08 65 /local/scratch08/stage1 minos09 63 /local/scratch09/stage1 minos10 76 /local/scratch10/stage1 minos11 1 /local/scratch11/stage1 minos12 67 /local/scratch12/stage1 minos13 1 /local/scratch13/stage1 minos14 78 /local/scratch14/stage1 minos15 80 /local/scratch15/stage1 minos16 64 /local/scratch16/stage1 minos17 67 /local/scratch17/stage1 minos18 70 /local/scratch18/stage1 minos19 61 /local/scratch19/stage1 minos20 70 /local/scratch20/stage1 minos21 79 /local/scratch21/stage1 minos22 64 /local/scratch22/stage1 minos23 60 /local/scratch23/stage1 minos24 75 /local/scratch24/stage1 minos25 458 /local/scratch25/stage1 minos26 1 /local/scratch26/stage1 Perhaps these file could go under /var, say /var/condor MIN > for NODE in ${NODES} ; do printf "$NODE " ; ssh -ax ${NODE} 'df -h /var/tmp | grep /var' ; done minos01 /dev/hda7 13G 285M 12G 3% /var minos02 /dev/hda6 22G 273M 21G 2% /var minos03 /dev/hda6 22G 269M 21G 2% /var minos04 /dev/hda6 22G 373M 21G 2% /var minos05 /dev/hda6 22G 268M 21G 2% /var minos06 /dev/hda6 22G 254M 21G 2% /var minos07 /dev/hda6 22G 260M 21G 2% /var minos08 /dev/hda6 22G 285M 21G 2% /var minos09 /dev/hda6 22G 273M 21G 2% /var minos10 /dev/hda6 22G 269M 21G 2% /var minos11 /dev/hda6 22G 138M 21G 1% /var minos12 /dev/hda6 22G 2.0G 19G 10% /var minos13 /dev/hda6 22G 261M 21G 2% /var minos14 /dev/hda6 22G 263M 21G 2% /var minos15 /dev/hda6 22G 260M 21G 2% /var minos16 /dev/hda6 22G 261M 21G 2% /var minos17 /dev/hda6 22G 264M 21G 2% /var minos18 /dev/hda6 22G 267M 21G 2% /var minos19 /dev/hda6 22G 262M 21G 2% /var minos20 /dev/hda6 22G 266M 21G 2% /var minos21 /dev/hda6 22G 576M 21G 3% /var minos22 /dev/hda6 22G 265M 21G 2% /var minos23 /dev/hda6 22G 265M 21G 2% /var minos24 /dev/hda6 22G 259M 21G 2% /var minos25 /dev/hda6 22G 268M 21G 2% /var minos26 /dev/hda6 22G 416M 21G 2% /var ######## # DATA # ######## Date: Tue, 01 Jul 2008 11:20:37 -0500 (CDT) Subject: HelpDesk ticket 118050 ___________________________________________ Short Description: /minos/data error report from fnpcsrv1 Problem Description: Since about May 20, we have been running tests of file access to /minos/data, from fnpcsrv1.txt minos-sam03.txt minos01.txt minos25.txt minos26.txt The test script reads a few bytes from a different file every minute. /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/bluwatch We have seen failures only on fnpcsrv1 since 1 June. The failures on fnpcsrv1 continue on occasion. See the lines containing 'BAD' at http://www-numi.fnal.gov/computing/dh/bluwatch/log/fnpcsrv1.txt ___________________________________________ Date: Thu, 03 Jul 2008 15:31:00 -0500 (CDT) This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/BLU Group. ___________________________________________ Date: Thu, 03 Jul 2008 15:31:00 -0500 (CDT) Note To Requester: I am assigning this ticket to the bluearc admins. All of those times correspond to times where we also saw failures in other parts of FermiGrid monitoring. The latest one at 7/3 08:23 was due to the reboot of the bluearc system, so maybe that will have fixed the problem. Steve Timm ___________________________________________ Date: Thu, 03 Jul 2008 16:36:25 -0500 (CDT) Note To Requester: I am corelating your file with the BlueArc logs. Some of the times in your log do match up to days on which we had known problems. for example: May 20 ~19:30 June 1 ~05:30 ... throughout the morning July 3 ~08:00 (today) Other times corelate to backups and snapshot management Other times do not corelate to anything. (possible net or host issues ?) Andy ___________________________________________ Date: Thu, 03 Jul 2008 22:11:58 +0000 (UTC) From: Arthur Kreymer <-- # @@@ Enter Update below this line. @@@ # --> Thanks for checking the BlueArc end. The latest failure on fnpcsrv1 is at Thu Jul 3 08:23:18 The following lines from /var/log/messages may be relevant : Jul 3 08:23:18 fnpcsrv1 automount[16061]: expired /minos/data ... Jul 3 08:25:18 fnpcsrv1 automount[10043]: attempting to mount entry /minos/data Jul 3 08:25:18 fnpcsrv1 automount[16384]: mount(nfs): mounted minos-nas-0.fnal.gov:/minos/data on /minos/data Jul 3 08:26:18 fnpcsrv1 automount[16472]: expired /minos/data It seems that the automount mount of /minos/data expired just as an access was being attempted. It seems that the automounter mounts for /minos/data expire in 1 minute, precisely the time that I wait between file checks. This is probably why I have occasional failures in my test scripts, but we have not seen global problems in farm processing. I would suggest a much more gentle automount timeout, like 1 hour. I do see that the /home/data expirations stopped after 13:28:55, so perhaps Steve has already adjusted something. <-- # @@@ Enter Update above this line. @@@ # --> ___________________________________________ ####### # SAM # ####### Date: Fri, 27 Jun 2008 11:05:59 -0500 From: Stephen P. White To: Arthur Kreymer Cc: sam-design mailing list Subject: Fix for Minos cached_files data. Art, Dbserver v8_4_5 fixes the cached_files bug that was introduced in v8_4_1. Please upgrade your dbserver to v8_4_5. Also, I've created a database fix for cached_file records that were created by dbservers from v8_4_1 to v8_4_4. After you upgrade the dbserver please run the update statement I've supplied. (Since MINOS has so few cahced_file records this can be done in 1 update statement.) update cached_files set owner_work_grp_id=uncaching_work_grp_id, uncaching_work_grp_id=null where cached_file_id in (select cached_file_id from cached_files where owner_work_grp_id is null and uncaching_work_grp_id is NOT null) Call me if you have questions..... Steve ------------------------------------------------------------ export SAM_ORACLE_CONNECT="samdbs/" bin/rlwrap sqlplus samdbs/@minosdev select cached_file_id from cached_files where owner_work_grp_id is null and uncaching_work_grp_id is NOT null ; This list is empty for dev and int There are 32 rows in prd CACHED_FILE_ID -------------- 596229 596235 596236 596237 596242 596250 596060 596064 596022 596023 596024 CACHED_FILE_ID -------------- 596025 596038 596043 596048 595743 595556 595557 595607 595621 595625 595626 CACHED_FILE_ID -------------- 595481 595482 595483 595491 595492 595493 595499 595402 595403 595405 32 rows selected. update cached_files set owner_work_grp_id=uncaching_work_grp_id, uncaching_work_grp_id=null where cached_file_id in (select cached_file_id from cached_files where owner_work_grp_id is null and uncaching_work_grp_id is NOT null) ; 32 rows updated. SQL> select cached_file_id from cached_files 2 where owner_work_grp_id is null and 3 uncaching_work_grp_id is NOT null ; no rows selected Tue Jul 1 11:44:11 CDT 2008 ####### # SAM # ####### On minos-sam01 and minos-sam02, upd install -j sam_db_srv_pkg v8_4_5 Fired this up, failed. Traceback (most recent call last): File "/home/sam/products/db_server_base_cx/v1_8/NULL/bin/DbListener.py", line 31, in ? import DbCORBAomni File "/home/sam/products/db_server_base_cx/v1_8/NULL/lib/DbCORBAomni.py", line 12, in ? from omniORB import CORBA ImportError: No module named omniORB Killed process: 6016 I needed to do upd install without the -j upd install sam_db_srv_pkg v8_4_5 informational: installed sam_pnfs_srv v8_4_1. informational: installed sam_dimension_server_prototype v8_4_2. informational: installed sam_config v7_1_6. informational: installed sam_idl_pylib v8_4_1. informational: installed omniORB v4_1_2. informational: installed sam_common_pylib v8_4_3. informational: installed sam_server_pylib v8_4_2. informational: installed sam_db_srv v8_4_5. informational: installed python v2_4_5. Warning: For product "sam_config"local node version v7_1_5 does not match distribution node version v7_1_6 Warning: For product "sam_config"local node version v7_1_5 does not match distribution node version v7_1_6 upd install succeeded. This works, running on dev/int Tested with 1/10/100 file projects, looks clean. Installed products on minos-sam01 also, same list. MINOS26 > sam get dbserver connection info Number of connections: 1 11:04 restarted production dbserver UNI=prd for N in 1 2 3 4 5 6 7 8 9 10 ; do echo ${N} date ./sam_test_py minos ${UNI} st-onesmall ./sam_test_py minos ${UNI} st-ten ./sam_test_py minos ${UNI} st-cen done ; date Tue Jul 1 11:06:19 CDT 2008 ... oops, that was done in dev. Repeated in prd Tue Jul 1 11:20:53 CDT 2008 ... Tue Jul 1 12:06:59 CDT 2008 ============================================================================= 2008 06 30 ============================================================================= Kreymer unavailable since 24 June due to family emergency Summary of activity during that period : Condor 7.0.3 is available - minosadmin stken/fndca downtime Wed - stkusers pnfs restore Wed 25 Jun 16:41:17 - stkusers stken coming up 18:50:35 - stkusers fndca up 19:46:52 - stkusers no file loss ? rubin file access failure 08-06-26 10:17:41 - minosdata 117526 AFS - flxi02, 03 have been updated - minosadmin cvs key request lwhitehead - minoscvs sjc SLF 5 query - minosadmin rubin stray dogwood cands - minosdata Dbserver v8_4_5 needed, and db repair - minossamadmin minossoft draft minutes - minossoft saranen drive request LTO - minosdata mcimport discussion 27 Jun 2008 14:48:52 arms - minossim Sun, 29 Jun 2008 09:34:29 ftp access down bspeak - minosdata Sun, 29 Jun 2008 11:35:59 beam data rhatcher/habig - minosdata Mon, 30 Jun 2008 06:06:34 enstore down berg - stkusers Mon, 30 Jun 2008 10:09:35 enstore up zalokar - stkusers ######### # ADMIN # ######### Date: Thu, 26 Jun 2008 16:28:31 -0400 From: Stephen Coleman To: Kregg E Arms Cc: Arthur Kreymer , Robert Hatcher Subject: MC and Scientific Linux Hi, With delivery of W&M's new farm mere weeks away, the lattice QCD co-owner of the cluster and I are trying to hash out some details regarding software. He's agreed to use Scientific Linux, but wants the most recent version (5.x) because he's concerned about up-to-date drivers for all of our Infiniband cards. Can you think of any reason that this would be a bad idea for either Minossoft or Daikon/Eggplant/Fava/Garlic production? Our current farm is running obsolete 303, and I know the MINOS farm runs SL Fermi 4.4... Thanks, -Stephen My answer : Minos has no publicly available SLF5 systems at present. Not on FNALU interactive or batch systems, or the Cluster, or FermiGrid. So we have no way to test compatibility at present. SLF5 is being primarily deployed on desktops and laptops, where bleeding edge hardware support is more important. ######## # FARM # ######## $ RFILE=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/mcin_data/far/daikon_03/CosmicLE/120/f20031209_0001_CosmicLE_D03.reroot.root $ srmls ${RFILE} 236326221 /pnfs/fnal.gov/usr/minos/mcin_data/far/daikon_03/CosmicLE/120/f20031209_0001_CosmicLE_D03.reroot.root $ srmcp ${RFILE} file:////tmp/testfile.root Did this with grid and doe proxy ############ # MCIMPORT # ############ $ du -sm /minos/data/mcimport/*/mcin 6674 /minos/data/mcimport/arms/mcin 1 /minos/data/mcimport/boehm/mcin 1 /minos/data/mcimport/hgallag/mcin 391 /minos/data/mcimport/howcroft/mcin 1 /minos/data/mcimport/kordosky/mcin 12 /minos/data/mcimport/kreymer/mcin 1 /minos/data/mcimport/mtavera/mcin 542 /minos/data/mcimport/mualem/mcin 99561 /minos/data/mcimport/OVERLAY/mcin 1 /minos/data/mcimport/rhatcher/mcin 2149 /minos/data/mcimport/sjc/mcin lrwxrwxrwx 1 mindata e875 8 May 16 14:08 mcimport.20080516 -> mcimport* $ dds mcimport* lrwxrwxrwx 1 mindata e875 17 Feb 15 13:54 mcimport -> mcimport.20080211* ... mcimport.20080630 cp mcimport.20080516 mcimport.20080630 Added these changes . /minos/scratch/app/OSG1/setup.sh export X509_USER_PROXY=/home/mindata/.grid/kreymer-grid.proxy $ AFSS/mcimport.20080630 -n -b 1 OVERLAY OK - processing 1 directories OVERLAY OOPS - /minos/data/mcimport/STAGE/OVERLAY not a directory daikon_00 daikon_04 URK - need to resolve mcimport.20080211 ( last version used for import ) mcimport.20080303 ( draft tar version ) mcimport.20080311 ( second draft tar , with local copy ) mcimport.20080326 ( last version used for Tarring, with -s ,dismount time ) mcimport.20080516 ( latest revision, sets X509_USER_PROXY ) TYPO, cp mcimport.20080326 mcimport.20080516 Recovered this from backup, MIN > cp -a /afs/fnal.gov/files/backup/home/room1/kreymer/minos/scripts/mcimport.20080516 . $ AFSS/mcimport.20080630 -n -b 1 OVERLAY OOPS - found /minos/data/mcimport/CRON/mcimport.pid OK - stale pid file OK - processing 1 directories OVERLAY Mon Jun 30 17:53:17 CDT 2008 OK - version mcimport.20080516 processing from /minos/data/mcimport/OVERLAY NOOP BAIL at 1 LOGS STAGE, MCINPURGE, MCINWRITE OK - staging 0 files OK - purging 0 MCIN files ? Mon Jun 30 17:53:17 CDT 2008 MCIN processing 63 files Mon Jun 30 17:53:17 CDT 2008 MCIN configuration n1303 _L010185N_D04.reroot.root srmcp -streams_num=1 -server_mode=active -protocols=gsiftp file:///n13037495_0010_L010185N_D04.reroot.root srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_04/L010185N/749/n13037495_0010_L010185N_D04.reroot.root BAIL aftar 1 files ~/saddmc --declare daikon_04 near/daikon_04/L010185N/749 >> /minos/scratch/mindata/log/saddmc/prd-near-daikon_04-L010185N.log 99565 /minos/data/mcimport/OVERLAY/ 1 /minos/data/mcimport/OVERLAY/tar 1 /minos/data/mcimport/OVERLAY/dcache 99561 /minos/data/mcimport/OVERLAY/mcin 1 /minos/data/mcimport/OVERLAY/mcin/dcache Mon Jun 30 17:53:22 CDT 2008 $ AFSS/mcimport.20080630 -b 1 OVERLAY $ less /minos/data/mcimport/OVERLAY/log/mcimport.log $ dds /pnfs/minos/mcin_data/near/daikon_04/L010185N/749/n13037495_0010_L010185N_D04.reroot.root -rw-r--r-- 1 kreymer e875 365410709 Jun 30 17:55 /pnfs/minos/mcin_data/near/daikon_04/L010185N/749/n13037495_0010_L010185N_D04.reroot.root MINOS26 > sam locate n13037495_0010_L010185N_D04.reroot.root ['/pnfs/minos/mcin_data/near/daikon_04/L010185N/749,30@dcache'] $ AFSS/mcimport.20080630 OVERLAY $ cp -a AFSS/mcimport.20080630 . $ ln -sf mcimport.20080630 mcimport # was mcimport.20080211 Let this cook, then restart the crontab.dat entry. ############ # MCIMPORT # ############ We need a grid proxy, to retain kreymer ownership of mcimport files. cd /local/scratch26/kreymer/grid . /minos/scratch/kreymer/VDT/setup.sh MINOS26 > grid-proxy-info -all -file kreymer-grid.proxy subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=2127860370 issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 type : Proxy draft (pre-RFC) compliant impersonation proxy strength : 512 bits path : kreymer-grid.proxy timeleft : 6432:20:40 (268.0 days) Let's test this mindata@minos-sam03 cd /home/mindata/.grid scp kreymer@minos26:/local/scratch26/kreymer/grid/kreymer-grid.proxy . $ export X509_USER_PROXY=/home/mindata/.grid/kreymer-grid.proxy $ srmcp -debug -streams_num=1 -server_mode=active file:////minos/scratch/parrot/F00031300_0000.mdaq.root srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/grid.dat $ dds /pnfs/minos/NULL/grid.dat -rw-r--r-- 1 kreymer e875 41379 Jun 30 14:33 /pnfs/minos/NULL/grid.dat This proxy was already present on minos26. ########## # PARROT # ########## Investigating cache area, on fnpc338 try -t Where to store temporary files. (PARROT_TEMP_DIR) export PARROT_TEMP_DIR=/local/stage1/parrot parrot -m ${PARROT_DIR}/mountfile.grow -d remote -t /local/stage1/parrot bash P> ls -l /tmp/par* ls: /tmp/par*: No such file or directory The environment variable by itself does not seem to work. ########## # DCACHE # ########## Reported Wed 16:31 to 19:47 outage to minos-data, minos_software_discussion ########## # DCACHE # ########## Public enstore down 06:06 - probably several hours. Up at 10:09 - reported to minos-data and msd. ########## # DCACHE # ########## No neardet data since Sunday morning But far continued thru current time ( how, with DCache down ? ) F00041292_0017.mdaq.root Mon Jun 30 14:07:39 UTC 2008 Beam files missing Thu/Fri/Sat/Sun Near and Far DCS look roughly OK Cannot investigate further until the public Enstore system comes back up. ============================================================================= 2008 06 24 ============================================================================= ########## # DCACHE # ########## Tests for DCache 1.8 upgrade 24 June SRM srls OK srmcp read OK srmcp write OK srmmkdir OK ( wrong ownership ) 6/23 DCACHE unsecured dccp OK loon OK DAQ write OK 11 files transferred, at 15:00 Transition, need to shift to OSG 1 for mcimport roundup ########## # DCACHE # ########## kreymer@minos26 cd ~/minos/scripts crontab crontab.dat minfarm@fnpcsrv1 cd /home/minfarm/ROUNTMP mv NOCAT NOCAT.ok mindata@minos26 ######## # FARM # ######## SRV1> cp -a AFSS/roundup.20080624 . SRV1> ln -sf roundup.20080624 roundup # was roundup.20080515 ./roundup -n -s 40403 -f 0 -r cedar far This would work, but ... OOPS - POOLS ACTIVE NEED 14 10 11 Adjusted pool limit to 10, now WRITING to DCache 82 subject : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990/CN=2146134877 timeleft : 2217:51:00 SRMCP 1/82 -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00041019_0000.all.sntp.cedar.0.root /pnfs/minos/reco_far/cedar/sntp_data/2008-06 org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: (error code 1) [Nested exception message: Custom message: Unexpected reply: 425 Cannot open p ort: java.lang.Exception: Pool manager error: No write pools configured for ]. Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: C ustom message: Unexpected reply: 425 Cannot open port: java.lang.Exception: Pool manager error: No write pools configured for SRV1> ./roundup -n -w -s 40403 -r cedar far SRMCP 1/15 -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00040403_0001.spill.bntp.cedar.0.root /pnfs/minos/reco_far/cedar/.bntp_data/2008-03 srmcp -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00040403_0001.spill.bntp.cedar.0.root srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/reco_far/cedar/.bntp_data/2008-03/F00040403_0001.spill.bntp.cedar.0.root Try manually, round about 20:40 . /usr/local/grid/setup.sh export X509_USER_PROXY=/export/stage/minfarm/.grid/x509up_u1334 cd /minos/data/minfarm/WRITE srmcp -streams_num=1 -server_mode=active -protocols=gsiftp -debug=true \ file:///F00040403_0001.spill.bntp.cedar.0.root \ srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/F00040403_0001.spill.bntp.cedar.0.root srmcp -streams_num=1 -server_mode=active -protocols=gsiftp -debug=true \ file:///F00040403_0001.spill.bntp.cedar.0.root \ srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/reco_far/cedar/.bntp_data/2008-03/F00040403_0001.spill.bntp.cedar.0.root MINOS26 > ls -alF /pnfs/minos/reco_far/cedar/.bntp_data/2008-03/F00040403_0001.spill.bntp.cedar.0.root -rw-r--r-- 1 rubin e875 17425962 Jun 24 20:45 /pnfs/minos/reco_far/cedar/.bntp_data/2008-03/F00040403_0001.spill.bntp.cedar.0.root cp roundup roundupsrm2 SRMQ="-streams_num=1 -server_mode=active -protocols=gsiftp -debug=true" SRV1> ./roundupsrm2 -w -s 40403 -r cedar far D U H My bad, this was the pool shutdown last night. I was looking at the wrong log file. The current copies are running correctly, URK, need to check the support files for F00040403_0001.spill.bntp.cedar.0.root They seem to be there ( READ/SAM, ECRC ) The file is on tape. SRV1> ./roundup -w -s 40403 -r cedar far But the content of the ECRC/F00040403_0001.spill.bntp.cedar.0.root is the file, not the checksum. SRV1> setup encp -q stken SRV1> ecrc WRITE/F00040403_0001.spill.bntp.cedar.0.root CRC 334121273 SRV1> nedit ECRC/F00040403_0001.spill.bntp.cedar.0.root SRV1> ./roundup -w -s 40403 -r cedar far PURGING WRITE files 1 PURGED WRITE/F00040403_0001.spill.bntp.cedar.0.root SRV1> mv NOCAT NOCAT.ok ######## # FARM # ######## Per Howie's note ( minosdata ) SRMLS_PATH=srm://fndca1:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos srmls -debug=true \ "$SRMLS_PATH"/reco_near/cedar_phy/cand_data/2007-04/N00012001_0020.cosmic.cand.cedar_phy.0.root This works. Howie had been using /usr/local/vdt/setup.sh, still the old SRM, not /usr/local/grid/setup.sh, the new one. ########## # DCACHE # ########## DAQ logging seems to have halted last night after 23:09:30 First success after 11:04:35 As of 14:00, there are several unfinished copies pending from around 11:30 $ ps xf PID TTY STAT TIME COMMAND 17665 pts/1 Ss+ 0:00 -bash 13187 pts/0 Ss 0:00 -bash 11344 pts/0 R+ 0:00 \_ ps xf 3632 ? S 33:57 python /home/minos/bin/archiver_krb.py 3736 ? Z 0:00 \_ [kinit] 3578 ? S 3123:49 rotorooter -p9011 From ND RC stopped and started archiver. Files are moving again. Same for FD RC, files are moving again. > ls $DAQ_DATA_TO_ARCHIVE_DIR -l total 0 -rw-r--r-- 1 minos e875 0 Jun 24 11:29 F00041034_0018.mdaq.root -rw-r--r-- 1 minos e875 0 Jun 24 12:29 F00041034_0019.mdaq.root -rw-r--r-- 1 minos e875 0 Jun 24 12:36 F00041034_0020.mdaq.root -rw-r--r-- 1 minos e875 0 Jun 24 13:05 F00041035_0000.mdaq.root -rw-r--r-- 1 minos e875 0 Jun 24 13:07 F00041036_0000.mdaq.root -rw-r--r-- 1 minos e875 0 Jun 24 13:07 F00041037_0000.mdaq.root -rw-r--r-- 1 minos e875 0 Jun 24 14:07 F00041038_0000.mdaq.root ... > ls $DAQ_DATA_TO_ARCHIVE_DIR -l total 0 -rw-r--r-- 1 minos e875 0 Jun 24 14:07 F00041038_0000.mdaq.root -rw-r--r-- 1 minos e875 0 Jun 24 15:07 F00041038_0001.mdaq.root ####### # SAM # ####### Upgraded to v6_0_5_24_srm just before the oracle upgrade. Projects run, but very slowly, several minutes per batch of 5 files. MINOS26 > sam stop project --station=minos --project=sam_test_project_20080624125114 --force Killed after 45. Try again after the downtime. Reran st-cen job, ran quickly up through file 45, then slowly. Long delay for each batch of 5 files, then they are delivered 1 per second. For example, delay after Got dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030619_0007.mdaq.root file 45 Than ok for Got dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030620_0000.mdaq.root file 46 Got dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030621_0000.mdaq.root file 47 Got dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030622_0000.mdaq.root file 48 Got dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030622_0001.mdaq.root file 49 Got dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030622_0002.mdaq.root file 50 Delay is about 2 minutes Next pass was at full speed, 118 sec. for 100 files. ######## # DATA # ######## DAQ logging failed starting around midnight. Reran DC18 test , succeeds on test stand. Reran DC18 test using fndca1 host, production system, Cannot open port: java.lang.Exception: Pool manager error: No write pools configured for OK, this is consistent with draining the write queues prior to the upgrades. N.B. test stand web server at http://stkendca3a:2288/ ============================================================================= 2008 06 23 ============================================================================= ######## # DATA # ######## Date: Mon, 23 Jun 2008 16:42:04 -0500 (CDT) Subject: HelpDesk ticket 117640 ___________________________________________ Short Description: Unrequested roles being assigned by GUMS/VOMS , seen by dcache 1.8 Problem Description: We have seen a difference in behaviour between DCache 1.7 and 1.8, seen on the test stand recently. With DCache 1.7, if I do not assign a role to my grid proxy, I get the expected mapping of proxy to file ownership, in my case to the kreymer username. With DCache 1.8, they are seeing a Production role assigned, and the file ownership is accordingly mapped to minospro, something I do not want. But the proxy has no Production role. Is this something that can or should be corrected before the end of tomorrow's public DCache upgrade ? I wonder whether this same issue is contributing to recent confusion regarding Minos farm reconstruction file ownership ? The following is the latest exchange with DCache developers. Date: Mon, 23 Jun 2008 15:56:56 -0500 From: Timur Perelmutov To: Arthur Kreymer Cc: dcache-admin@fnal.gov, minos-data@fnal.gov, rubin@fnal.gov Subject: Re: File ownerships wrong using kreymer cert, dcache 1.8 But this voms-proxy-info output just confirms what I have said, your proxy had vo attributes and in this case the mapping was done by GUMs into the user minospro. If you want your proxy to be mappend into the user kreymer, you need to create the proxy with grid-proxy-init. In this case voms-proxy-info output will contain no " === VO fermilab extension information ===" line and no voms attributes listing. You want you voms proxy mapped to user kreymer even if you have VO attributes in your proxy, you need to resolve this with administrators of GUMS/VOMS servers. These servers are not controlled by dCache project or storage administrators. Thanks, Timur Arthur Kreymer wrote: > On Mon, 23 Jun 2008, Timur Perelmutov wrote: > > >> From the gPlazma logs I see: >> >> 06/23 09:01:53,489 VOAuthorizationPlugin: authRequestID 332223121 >> Requesting mapping for User with DN: /DC=org/DC=doegrids/OU=People/CN=Arthur >> E Kreymer 261310 and Role /fermilab/minos/Role=Production/Capability=NULL >> 06/23 09:01:47,889 VOAuthorizationPlugin: authRequestID 793306152 >> Mapping Service URL configuration: https://gums.fnal.gov:8443/gums/services/ >> GUMSAuthorizationServicePort >> 06/23 09:01:48,263 VOAuthorizationPlugin: authRequestID 793306152 VO >> mapping service returned Username: minospro >> > > > I happened to do a voms-proxy-info -all on Friday, > when the directories were being created : > > $ voms-proxy-info -all > WARNING: Unable to verify signature! Server certificate possibly not > installed. > Error: Cannot find certificate of AC issuer for vo fermilab > subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy > issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 > identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 > type : proxy > strength : 512 bits > path : /home/mindata/.grid/kreymer-doe.proxy > timeleft : 6606:10:02 > === VO fermilab extension information === > VO : fermilab > subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 > issuer : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov > attribute : /fermilab/minos/Role=Production/Capability=NULL > attribute : /fermilab/Role=NULL/Capability=NULL > attribute : /fermilab/minos/Role=NULL/Capability=NULL > timeleft : 0:00:00 > > The file and directory at issue are : > > MINOS26 > ls -al /pnfs/minos/NULL/dc18dir > total 41 > drwxrwxr-x 1 42411 e875 512 Jun 23 09:01 . > drwxrwxr-x 1 mindata e875 512 Jun 23 14:40 .. > -rw-r--r-- 1 42411 e875 41379 Jun 23 09:02 F00031300_0000.mdaq.root > -rw-r--r-- 1 42411 e875 41379 Jun 23 15:18 F00031300_0000.mdaq.root2 > > I repeateded the file copy at 15:18, > see the entry just above 2008 06 20 in > > http://www-numi.fnal.gov/minwork/computing/dh/worklog.txt > > ___________________________________________ Date: Thu, 03 Jul 2008 15:10:18 -0500 (CDT) Note To Requester: After the grid users meeting we met together and agreed on a list of problems that need to be resolved. As of today I have communicated that list to the developers and dcache admins. Steve Timm ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ ######## # DATA # ######## Preparing for PNFS/DCache maintenance Jun 24 kreymer@minos26 echo "crontab -r" | at 05:30 mindata@minos26 echo "crontab -r" | at 01:00 minfarm@fnpcsrv1 echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00 ####### # DAQ # ####### Testing gsiftp as used by DAQ for archiving, with DCache 1.8 teststand $ ssh -l minos minos-gateway-nd.fnal.gov $ ssh daqdcp-nd Observed 3632 ? S 33:16 python /home/minos/bin/archiver_krb.py $ . shrc/kreymer # get rid of ls alias, create dds for my fun $ cd /home/minos/bin $ cp archiver_krb.py DC18.py We need to test gssftp. Most details come from config file $DAQ_CONFIG_DIR/archiver_near_daq.config Initial environment comes from .profile Now hacking DC18.py to test fndcat. Created config/DC18.config CONFIG FILE CHANGES ftp_host_name#T=fndcat.fnal.gov; pnfs_dir#T=NULL; check_freq#I=10; lock_file#T=/var/lock/daq/DC18.pid; err_mail#T=kreymer@fnal.gov,g; toarchive#T=/var/tmp/DC18/toarchive; archived#T=/var/tmp/DC18/archived; data_dir#T=/var/tmp/DC18/data_dir; #toarchive#T=$DAQ_DATA_TO_ARCHIVE_DIR; /daqdata/archiver/data-to-archive #archived#T=$DAQ_DATA_ARCHIVED_DIR; /daqdata/archiver/data-archived #data_dir#T=$DAQ_DATA_DIR; /daqdata Modified bin/DC18.py config just DC18, no other options Remove expansion of toarchive, archived, data_dir strings Changed msg_srv to be ECHO os.chdir scan - just to toarchive and data_dir toarchive from config, data_dir from changed krb_cache to /tmp/krb5c_DC18 $ mkdir /var/tmp/DC18 $ mkdir /var/tmp/DC18/toarchive $ mkdir /var/tmp/DC18/archived $ mkdir /var/tmp/DC18/data_dir for NN in 00 01 02 03 04 05 06 07 08 09 10 11 ; do cp -av /daqdata/N00014362_00${NN}.mdaq.root /var/tmp/DC18/data_dir ; done $ du -sm /var/tmp/DC18/data_dir/ 955 /var/tmp/DC18/data_dir/ for NN in 00 01 02 03 04 05 06 07 08 09 10 11 ; do touch /var/tmp/DC18/toarchive/N00014362_00${NN}.mdaq.root ; done We also need an output directory, MINOS26 > mkdir /pnfs/minos/NULL/2008-06 MINOS26 > chmod 775 /pnfs/minos/NULL/2008-06 MINOS26 > chgrp e875 /pnfs/minos/NULL/2008-06 Tried with wrong group, $ bin/DC18.py /home/minos/kftp/v3_6/NULL/lib/gssftp.py:1: RuntimeWarning: Python C API version mismatch for module gss: This Python has API version 1012, module gss has version 1011. import gss ftp_host_name fndcat.fnal.gov port 24127 username buckley data_dir /var/tmp/DC18/data_dir toarchive /var/tmp/DC18/toarchive archived /var/tmp/DC18/archived pnfs_dir NULL check_freq 10 ticket_cache /home/minos/kt/minoskt lock_file /var/lock/daq/DC18.pid err_mail kreymer@fnal.gov /// MESSAGE /// -l I Will archive files in /pnfs/minos/NULL /// MESSAGE /// -l I Will look for new files every 10 seconds ['N00014362_0000.mdaq.root', 'N00014362_0001.mdaq.root', 'N00014362_0002.mdaq.root', 'N00014362_0003.mdaq.root', 'N00014362_0004.mdaq.root', 'N00014362_0005.mdaq.root', 'N00014362_0006.mdaq.root', 'N00014362_0007.mdaq.root', 'N00014362_0008.mdaq.root', 'N00014362_0009.mdaq.root', 'N00014362_0010.mdaq.root', 'N00014362_0011.mdaq.root'] /// MESSAGE /// -l I Processing file N00014362_0000.mdaq.root /// MESSAGE /// -l I Getting credentials /// MESSAGE /// -l I Got credentials /// MESSAGE /// -l I Trying ftp connect to disk cache /// MESSAGE /// -l I Ftp connect succeeded /// MESSAGE /// -l W File N00014362_0000.mdaq.root failed to transfer, try again: STOR N00014362_0000.mdaq.root: Permission denied /// MESSAGE /// -l I Processing file N00014362_0001.mdaq.root ... Try one file, $ bin/DC18.py ... /// MESSAGE /// -l I Will archive files in /pnfs/minos/NULL /// MESSAGE /// -l I Will look for new files every 10 seconds ['N00014362_0000.mdaq.root'] /// MESSAGE /// -l I Processing file N00014362_0000.mdaq.root /// MESSAGE /// -l I Getting credentials /// MESSAGE /// -l I Got credentials /// MESSAGE /// -l I Trying ftp connect to disk cache /// MESSAGE /// -l I Ftp connect succeeded /// MESSAGE /// -l I File N00014362_0000.mdaq.root transferred to /pnfs/minos/NULL/2008-06 Try all 11 files, /// MESSAGE /// -l I Will archive files in /pnfs/minos/NULL /// MESSAGE /// -l I Will look for new files every 10 seconds ['N00014362_0000.mdaq.root', 'N00014362_0001.mdaq.root', 'N00014362_0002.mdaq.root', 'N00014362_0003.mdaq.root', 'N00014362_0004.mdaq.root', 'N00014362_0005.mdaq.root', 'N00014362_0006.mdaq.root', 'N00014362_0007.mdaq.root', 'N00014362_0008.mdaq.root', 'N00014362_0009.mdaq.root', 'N00014362_0010.mdaq.root', 'N00014362_0011.mdaq.root'] /// MESSAGE /// -l I Processing file N00014362_0000.mdaq.root /// MESSAGE /// -l I Getting credentials /// MESSAGE /// -l I Got credentials /// MESSAGE /// -l I Trying ftp connect to disk cache /// MESSAGE /// -l I Ftp connect succeeded /// MESSAGE /// -l W File N00014362_0000.mdaq.root failed to transfer, try again: STOR N00014362_0000.mdaq.root: Permission denied /// MESSAGE /// -l I Processing file N00014362_0001.mdaq.root /// MESSAGE /// -l I Getting credentials /// MESSAGE /// -l I Got credentials /// MESSAGE /// -l I Trying ftp connect to disk cache /// MESSAGE /// -l I Ftp connect succeeded /// MESSAGE /// -l W File N00014362_0001.mdaq.root failed to transfer, try again: STOR N00014362_0001.mdaq.root: Permission denied /// MESSAGE /// -l I Processing file N00014362_0002.mdaq.root ... /// MESSAGE /// -l I File N00014362_0010.mdaq.root transferred to /pnfs/minos/NULL/2008-06 /// MESSAGE /// -l I Processing file N00014362_0011.mdaq.root /// MESSAGE /// -l I Getting credentials /// MESSAGE /// -l I Got credentials /// MESSAGE /// -l I Trying ftp connect to disk cache /// MESSAGE /// -l I Ftp connect succeeded /// MESSAGE /// -l I File N00014362_0011.mdaq.root transferred to /pnfs/minos/NULL/2008-06 MINOS26 > ls -l /pnfs/minos/NULL/2008-06 total 976125 -rw-r--r-- 1 buckley e875 88929278 Jun 23 14:59 N00014362_0000.mdaq.root -rw-r--r-- 1 buckley e875 78320762 Jun 23 15:00 N00014362_0001.mdaq.root -rw-r--r-- 1 buckley e875 88274514 Jun 23 15:01 N00014362_0002.mdaq.root -rw-r--r-- 1 buckley e875 78394261 Jun 23 15:02 N00014362_0003.mdaq.root -rw-r--r-- 1 buckley e875 87837099 Jun 23 15:02 N00014362_0004.mdaq.root -rw-r--r-- 1 buckley e875 78282913 Jun 23 15:02 N00014362_0005.mdaq.root -rw-r--r-- 1 buckley e875 87983068 Jun 23 15:03 N00014362_0006.mdaq.root -rw-r--r-- 1 buckley e875 78285285 Jun 23 15:04 N00014362_0007.mdaq.root -rw-r--r-- 1 buckley e875 88048732 Jun 23 15:04 N00014362_0008.mdaq.root -rw-r--r-- 1 buckley e875 78299797 Jun 23 15:05 N00014362_0009.mdaq.root -rw-r--r-- 1 buckley e875 88505253 Jun 23 15:05 N00014362_0010.mdaq.root -rw-r--r-- 1 buckley e875 78394163 Jun 23 15:05 N00014362_0011.mdaq.root ######## # FARM # ######## Corrected bhhi and bhlo directory permissions ( at the top ) using mismapped kreymer cert, allowing access to 42411 minospro. 08:51 for BH in ${SLO} ${SHI} ; do for DET in far near ; do srm-set-permissions -type=CHANGE -group=RWX ${BH}/${DET} srm-get-permissions ${BH}/${DET} srm-set-permissions -type=CHANGE -group=RWX ${BH}/${DET}/daikon_04 srm-get-permissions ${BH}/${DET}/daikon_04 srm-set-permissions -type=CHANGE -group=RWX ${BH}/${DET}/daikon_04/L010185N srm-get-permissions ${BH}/${DET}/daikon_04/L010185N done ; done for BH in ${SLO} ${SHI} ; do for DET in far near ; do srm-set-permissions -type=CHANGE -group=RWX ${BH}/${DET}/daikon_04/L010185N/cand_data srm-get-permissions ${BH}/${DET}/daikon_04/L010185N/cand_data done ; done Back on kreymer@minos26, ./pnfsdirs near cedar_phy_bhhi daikon_04 L010185N write Mon Jun 23 09:04:04 CDT 2008 STREAMS cand mrnt sntp INPUT /pnfs/minos/mcin_data/near/daikon_04/L010185N FAMSET mcin_near_daikon_04 :28: RuntimeWarning: Python C API version mismatch for module _locale: This Python has API version 1012, module _locale has version 1011. FAMILY mcin_near_daikon_04 OUTPUT /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N OK - created /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/mrnt_data OK - created /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/sntp_data OOPS, need permissions drwxrwxr-x drwxr-xr-x 1 42411 e875 512 Jun 11 18:34 /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/cand_data chmod: changing permissions of `/pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/cand_data': Operation not permitted OK - have set permissions drwxrwxr-x drwxr-xr-x 1 42411 e875 512 Jun 11 18:34 /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/cand_data FAMSET mcout_cedar_phy_bhhi_near_daikon_04_cand ... ./pnfsdirs near cedar_phy_bhlo daikon_04 L010185N write ./pnfsdirs far cedar_phy_bhlo daikon_04 L010185N write ./pnfsdirs far cedar_phy_bhhi daikon_04 L010185N write ########## # DCACHE # ########## mindata@minos26 SPATH17=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/NULL/dc17dir SPATH18=srm://fndcat.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/NULL/dc18dir srmmkdir ${SPATH17} $ ls -l /pnfs/minos/NULL total 161 -rw-r--r-- 1 kreymer e875 41379 Jun 20 13:44 F00031300_0000.mdaq.root -rw-r--r-- 1 42411 e875 41379 Jun 20 19:10 F00031300_0000.mdaq.root1 -rw-r--r-- 1 42411 e875 41379 Jun 20 14:16 F00031300_0000.mdaq.root2 -rw-r--r-- 1 42411 e875 41379 Jun 20 19:10 F00031300_0000.mdaq.root3 drwxr-xr-x 1 kreymer e875 512 Jun 23 08:27 dc17dir srmmkdir ${SPATH18} $ ls -l /pnfs/minos/NULL SRMClientV2 : put: try # 0 failed with error SRMClientV2 : org.xml.sax.SAXException: Invalid element in org.dcache.srm.v2_2.SrmMkdirRequest - directoryPath SRMClientV2 : put: try again SRMClientV2 : put: try # 1 failed with error SRMClientV2 : org.xml.sax.SAXException: Invalid element in org.dcache.srm.v2_2.SrmMkdirRequest - directoryPath SRMClientV2 : put: try again OK, that's with a client mismatch, let's use the proper client $ srmmkdir ${SPATH18} $ ls -l /pnfs/minos/NULL total 161 -rw-r--r-- 1 kreymer e875 41379 Jun 20 13:44 F00031300_0000.mdaq.root -rw-r--r-- 1 42411 e875 41379 Jun 20 19:10 F00031300_0000.mdaq.root1 -rw-r--r-- 1 42411 e875 41379 Jun 20 14:16 F00031300_0000.mdaq.root2 -rw-r--r-- 1 42411 e875 41379 Jun 20 19:10 F00031300_0000.mdaq.root3 drwxr-xr-x 1 kreymer e875 512 Jun 23 08:27 dc17dir drwxrwxr-x 1 42411 e875 512 Jun 23 08:30 dc18dir Note that the ownership has changed. And permissions. Reported to dcache-admin at 08:35 Let's check permissions : $ srm-set-permissions -type=CHANGE -group=RX ${SPATH18} $ ls -l /pnfs/minos/NULL total 161 drwxr-xr-x 1 kreymer e875 512 Jun 23 08:27 dc17dir drwxr-xr-x 1 42411 e875 512 Jun 23 08:42 dc18dir For reference, from my session on Friday, $ voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 type : proxy strength : 512 bits path : /home/mindata/.grid/kreymer-doe.proxy timeleft : 6606:10:02 === VO fermilab extension information === VO : fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 issuer : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov attribute : /fermilab/minos/Role=Production/Capability=NULL attribute : /fermilab/Role=NULL/Capability=NULL attribute : /fermilab/minos/Role=NULL/Capability=NULL timeleft : 0:00:00 Repeating the file copy, around 15:18 $ srmcp -debug -streams_num=1 -server_mode=active file:////minos/scratch/parrot/F00031300_0000.mdaq.root srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/dc18dir/F00031300_0000.mdaq.root2 Storage Resource Manager (SRM) implementation version 2.0.3 Copyright (c) 2002-2008 Fermi National Accelerator Laboratory Specification Version 2.0 by SRM Working Group (http://sdm.lbl.gov/srm-wg) SRM Configuration: debug=true gsissl=true help=false pushmode=false userproxy=true buffer_size=131072 tcp_buffer_size=0 streams_num=1 config_file=/minos/scratch/app/OSG1/srm-client-fermi/etc/config-2.xml glue_mapfile=/minos/scratch/app/OSG1/srm-client-fermi/conf/SRMServerV1.map webservice_path=srm/managerv1 webservice_protocol=https gsiftpclinet=globus-url-copy protocols_list=http,gsiftp save_config_file=null srmcphome=/minos/scratch/app/OSG1/srm-client-fermi urlcopy=sbin/urlcopy.sh x509_user_cert= x509_user_key= x509_user_proxy=/home/mindata/.grid/kreymer-doe.proxy x509_user_trusted_certificates=/minos/scratch/app/OSG1/globus/TRUSTED_CA globus_tcp_port_range=null gss_expected_name=null storagetype=permanent retry_num=20 retry_timeout=10000 wsdl_url=null use_urlcopy_script=false connect_to_wsdl=false delegate=true full_delegation=true server_mode=active srm_protocol_version=1 request_lifetime=86400 access latency=null overwrite mode=null priority=0 from[0]=file:////minos/scratch/parrot/F00031300_0000.mdaq.root to=srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/dc18dir/F00031300_0000.mdaq.root2 Mon Jun 23 15:17:36 CDT 2008: starting SRMPutClient Mon Jun 23 15:17:36 CDT 2008: In SRMClient ExpectedName: host Mon Jun 23 15:17:36 CDT 2008: SRMClient(https,srm/managerv1,true) SRMClientV1 : user credentials are: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 SRMClientV1 : SRMClientV1 calling org.globus.axis.util.Util.registerTransport() SRMClientV1 : connecting to srm at httpg://stkendca3a.fnal.gov:8443/srm/managerv1 Mon Jun 23 15:17:38 CDT 2008: connected to server, obtaining proxy Mon Jun 23 15:17:38 CDT 2008: got proxy of type class org.dcache.srm.client.SRMClientV1 Mon Jun 23 15:17:38 CDT 2008: source file#0 : /minos/scratch/parrot/F00031300_0000.mdaq.root copy_jobs is empty SRMClientV1 : put, sources[0]="srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/dc18dir/F00031300_0000.mdaq.root2" SRMClientV1 : put, dests[0]="srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/dc18dir/F00031300_0000.mdaq.root2" SRMClientV1 : put, protocols[0]="gsiftp" SRMClientV1 : put, protocols[1]="dcap" SRMClientV1 : put, protocols[2]="http" SRMClientV1 : put, contacting service httpg://stkendca3a.fnal.gov:8443/srm/managerv1 Mon Jun 23 15:17:40 CDT 2008: srm returned requestId = -2147441805 Mon Jun 23 15:17:40 CDT 2008: sleeping 4 seconds ... Mon Jun 23 15:17:44 CDT 2008: FileRequestStatus with SURL=srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/dc18dir/F00031300_0000.mdaq.root2 is Ready Mon Jun 23 15:17:44 CDT 2008: received TURL=gsiftp://stkendca6a.fnal.gov:2811///NULL/dc18dir/F00031300_0000.mdaq.root2 copy_jobs is not empty copying CopyJob, source = file:////minos/scratch/parrot/F00031300_0000.mdaq.root destination = gsiftp://stkendca6a.fnal.gov:2811///NULL/dc18dir/F00031300_0000.mdaq.root2 GridftpClient: memory buffer size is set to 131072 GridftpClient: connecting to stkendca6a.fnal.gov on port 2811 GridftpClient: gridFTPClient tcp buffer size is set to 1048576 GridftpClient: gridFTPWrite started, source file is java.io.RandomAccessFile@1eb5666 destination path is //NULL/dc18dir/F00031300_0000.mdaq.root2 GridftpClient: gridFTPWrite started, destination path is //NULL/dc18dir/F00031300_0000.mdaq.root2 GridftpClient: set local data channel authentication mode to None GridftpClient: stream mode transfer GridftpClient: adler32 for file java.io.RandomAccessFile@1eb5666 is bdce1a9a GridftpClient: waiting for completion of transfer GridftpClient: starting a transfer to //NULL/dc18dir/F00031300_0000.mdaq.root2 GridftpClient: DiskDataSink.close() called GridftpClient: gridFTPWrite() wrote 41379bytes GridftpClient: closing client : org.dcache.srm.util.GridftpClient$FnalGridFTPClient@e34726 GridftpClient: closed client execution of CopyJob, source = file:////minos/scratch/parrot/F00031300_0000.mdaq.root destination = gsiftp://stkendca6a.fnal.gov:2811///NULL/dc18dir/F00031300_0000.mdaq.root2 completed setting file request -2147441804 status to Done copy_jobs is empty stopping copier ============================================================================= 2008 06 20 ============================================================================= ####### # SAM # ####### export SAM_ORACLE_CONNECT="samdbs/" for REL in dev int prd ; do setup sam -q ${REL} samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.bhlo samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.bhhi done New applicationFamilyId = 255 New applicationFamilyId = 256 New applicationFamilyId = 92 New applicationFamilyId = 93 New applicationFamilyId = 346 New applicationFamilyId = 347 ####### # OSG # ####### minsoft@minos26 mkdir /minos/scratch/app/OSG1 cd /minos/scratch/app/OSG1 11:47 pacman -get OSG:client Do you want to add [http://software.grid.iu.edu/pacman/] to [trusted.caches]? (y or n): y Package [client] found in [OSG]... Package [OSG:vo-client-0.6] found in [OSG]... Package [vo-client-0.6] found in [OSG]... Do you want to add [http://vdt.cs.wisc.edu/vdt_1101_cache] to [trusted.caches]? (y or n): y Package [VDT-Client] found in [http://vdt.cs.wisc.edu/vdt_1101_cache]... ... Downloading [vo-client-0.6-11.tar.gz] from [http://software.grid.iu.edu/pacman/tarballs]... Untarring [vo-client-0.6-11.tar.gz]... Downloading [vdt-common-1-287.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-common/1]... Downloading [vdt-core-1.2.1-404.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-core/1.2.1]... Downloading [vdt-environment-1-233.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-environment/1]... Downloading [vdt-core-bin-2.1.x86_rhas_4.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-core-bin/2.1-4]... Downloading [vdt-version-5-281.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-version/5]... Downloading [vdt-version-info-1.10.1-19.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-version-info/1.10.1]... Downloading [vdt-system-profiler-3-267.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-system-profiler/3]... Downloading [vdt-questions-2.0-324.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-questions/2.0]... Downloading [remove-rpaths-1--0.13-2.x86_rhas_4.tar.gz] from [http://vdt.cs.wisc.edu/software//remove-rpaths/1--0.13-2]... Beginning VDT prerequisite checking script vdt-common/vdt-prereq-check... All prerequisite checks are satisfied. Downloading [VDT-Client-1.10.1.tar.gz] from [http://vdt.cs.wisc.edu/software/questions/1.10.1]... VDT 1.10.1 installs a variety of software, each with its own license. In order to continue, you must agree to the licenses. You can view the licenses online at: http://vdt.cs.wisc.edu/licenses/1.10.1 After the installation has completed, you will also be able to view the licenses in the "licenses" directory. Do you agree to the licenses? [y/n] y The VDT typically installs public certificates and signing policy files for the well-known public CA's. This is necessary in order for you to perform GSI authentication with any remote Grid services (that have service/host certificates signed by these CA's). For more information please refer to the VDT documentation: http://vdt.cs.wisc.edu/releases/1.10.1/setup_ca.html Where would you like to install CA files? Choices: l (local) - install into $VDT_LOCATION/globus/share/certificates n (no) - do not install l Downloading [CA-Certificates-1.10.1.tar.gz] from [http://vdt.cs.wisc.edu/software/questions/1.10.1]... ... Package [Configure-Condor] found in [http://vdt.cs.wisc.edu/vdt_1101_cache]... Downloading [configure_condor-1-301.tar.gz] from [http://vdt.cs.wisc.edu/software//configure_condor/1]... ... Downloading [LCG-Infosites-1.10.1.tar.gz] from [http://vdt.cs.wisc.edu/software/questions/1.10.1]... Downloading [lcg-infosites-2.6-2.tar.gz] from [http://vdt.cs.wisc.edu/software//lcg-infosites/2.6-2]... The VDT version 1.10.1 has been installed. The OSG Client package version 1.0.0 has been installed. 12:02 ######## # TEST # ######## Putting a file into TEST and NULL mindata@minos26 $ setup java v1.5.0 $ setup srmcp v1_25_1 $ unset SRM_PATH $ export X509_USER_PROXY=/home/mindata/.grid/kreymer-doe.proxy srmcp -debug \ -streams_num=1 -server_mode=active \ file:////minos/scratch/parrot/F00031300_0000.mdaq.root \ srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/TEST/F00031300_0000.mdaq.root ############ # OSG TEST # ############ $ export X509_USER_PROXY=/home/mindata/.grid/kreymer-doe.proxy $ voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 type : proxy strength : 512 bits path : /home/mindata/.grid/kreymer-doe.proxy timeleft : 6674:39:11 === VO fermilab extension information === VO : fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 issuer : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov attribute : /fermilab/minos/Role=Production/Capability=NULL attribute : /fermilab/Role=NULL/Capability=NULL attribute : /fermilab/minos/Role=NULL/Capability=NULL timeleft : 0:00:00 SPATH2=srm://fndcat.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/beam_data/2004-12 srmls ${SPATH2} OK, got all files SPATH=srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr S1PATH=srm://fndcat.fnal.gov:8443/srm/managerv1?SFN=/pnfs/fnal.gov/usr S2PATH=srm://fndcat.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr IFILE=N00004502_0000.mdaq.root IPATH=minos/neardet_data/2004-11 SFILE=${SPATH}/${IPATH}/${IFILE} S1FILE=${S1PATH}/${IPATH}/${IFILE} srmcp -streams_num=1 -server_mode=active ${SFILE} \ file:////var/tmp/TEST.dat gave up after 10 minutes srmcp -debug \ -streams_num=1 -server_mode=active \ file:////minos/scratch/parrot/F00031300_0000.mdaq.root \ srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/TEST/F00031300_0000.mdaq.root fails, first file protection, then authorization $ T1FILE=srm://fndcat.fnal.gov:8443/srm/managerv1?SFN=/pnfs/fnal.gov/usr/minos/TEST/F00031300_0000.mdaq.root $ srmcp -streams_num=1 -server_mode=active -debug=true ${T1FILE} file:////var/tmp/TEST.dat URK , this works, but not the short form, and not with managerv2. N.B. that's not correct, the real problem was hitting a bad door. WFILE=F00031300_0000.mdaq.root2 WPATH=minos/NULL SW1FILE=${S1PATH}/${WPATH}/${WFILE} $ echo $SW1FILE srm://fndcat.fnal.gov:8443/srm/managerv1?SFN=/pnfs/fnal.gov/usr/minos/NULL/F00031300_0000.mdaq.root2 srmcp -debug \ -streams_num=1 -server_mode=active \ file:////minos/scratch/parrot/F00031300_0000.mdaq.root \ ${SW1FILE} Dmitri has examined the logs. The commands work correctly when they happen to get ftp paths like TURL=gsiftp://stkendca3a.fnal.gov:2812///TEST/F00031300_0000.mdaq.root The commands fail with TURL=gsiftp://gwdca01.fnal.gov:2811///TEST/F00031300_0000.mdaq.root He will examine the door configurations. The configuration has been updated by Timur, reads and writes both work consistently. DCCP TESTS - irrelevant ? kreymer@minos26 setup dcap # kerberized DCPOR=24725 # kerberos (default) WFILE=F00031300_0000.mdaq.root WPATH=minos/NULL WFILE=dcap://fndcat.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${WPATH}/${WFILE} cd /minos/scratch/parrot dccp -d 4 ${IFILE} ${WFILE} [Fri Jun 20 13:24:14 2008] Going to open file dcap://fndcat.fnal.gov:24125/pnfs/fnal.gov/usr/minos/NULL/F00031300_0000.mdaq.root in cache. Connected in 0.00s. Error on control line [4] Failed to create a control line Failed open file in the dCache. Can't open destination file : Server rejected "hello" System error: No such file or directory I am not sure that I was ever authorized, this is not important. ########## # DCACHE # ########## Date: Fri, 20 Jun 2008 10:56:44 -0500 From: Dan Yocum https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/ClientInstallationGuide Date: Fri, 20 Jun 2008 11:26:14 -0500 (CDT) From: Steven Timm To: Dan Yocum , fermigrid-help@fnal.gov Cc: Arthur Kreymer , timur@fnal.gov, minos-data@fnal.gov Subject: Re: yes, OSG v1.0 client software The new version of SRM clients are already available on fnpcsrv1 (and fnpcsrv1 only, right now). They have, in fact, been available in that location since February. Log on to fnpcsrv1 do . /usr/local/grid/setup.sh and the srmcp that will be in your path is /opt/d-cache/srm/bin/srmcp That is the right version. As Dan says, we will be installing this on the worker nodes as of June 24, but it will be in the /usr/local/grid area once we have done that, and the /usr/local/grid/setup.sh will automatically place it in your path on all worker nodes, not just on fnpcsrv1. ####### # AFS # ####### Per kordosky question regarding AFS problems in Minerva, happening on flxi05 but not flxi06 rescanned AFS tuning : MIN > for NODE in ${NODES} ; do printf "$NODE " ; ssh -ax ${NODE} 'grep ^OPTIONS= /etc/sysconfig/afs' ; done minos01 OPTIONS=$LARGE ... minos26 OPTIONS=$LARGE MIN > for NODE in ${UNODES} ; do printf "$NODE " ; ssh -ax ${NODE} 'grep ^OPTIONS= /etc/sysconfig/afs' ; done flxi02 OPTIONS=AUTOMATIC flxi03 OPTIONS=AUTOMATIC flxi04 OPTIONS=$MEDIUM flxi05 OPTIONS=AUTOMATIC flxi06 OPTIONS=$MEDIUM flxi07 OPTIONS=$LARGE flxi09 OPTIONS=AUTOMATIC FNALU batch is all $LARGE ( cannot log into 21, 22 ) ___________________________________________ Date: Fri, 20 Jun 2008 09:26:14 -0500 (CDT) Subject: HelpDesk ticket 117526 ___________________________________________ Short Description: AFS client tuning is needed on FNALU hosts Problem Description: fnalu-admin : Mike Kordosky has been observing inconsistent AFS service on some of the FNALU nodes, including flxi05. I see the following problematic setting in etc/sysconfig/afs : OPTIONS=AUTOMATIC For reliable service, we had to set this in the Minos Cluster to OPTIONS=$LARGE Please set this to at least $MEDIUM, if not $LARGE Here is a summary of present FNALU settings : for NODE in ${UNODES} ; do printf "$NODE " ssh -ax ${NODE} 'grep ^OPTIONS= /etc/sysconfig/afs' ; done flxi02 OPTIONS=AUTOMATIC flxi03 OPTIONS=AUTOMATIC flxi04 OPTIONS=$MEDIUM flxi05 OPTIONS=AUTOMATIC flxi06 OPTIONS=$MEDIUM flxi07 OPTIONS=$LARGE flxi09 OPTIONS=AUTOMATIC ___________________________________________ Date: Fri, 20 Jun 2008 09:53:40 -0500 (CDT) This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group. ___________________________________________ Date: Fri, 20 Jun 2008 10:01:25 -0500 (CDT) From: Margaret_Greaney Art, could you please ask Mike to provide more details about "inconsistent" AFS service? thanks, Margaret ___________________________________________ Date: Fri, 20 Jun 2008 10:04:51 -0500 (CDT) From: Margaret_Greaney flxi09 is not a node released to users and won't be for some time. ___________________________________________ Date: Fri, 20 Jun 2008 10:23:27 -0500 (CDT) Art, I will plan to work on this afs update next week. It will require a reboot or at least an unmount of afs and remount, so it will take a maintenance day. We have something else scheduled for the early part of the week, but for flxi02 and 03, I will try to make that part of the move for these nodes. Flxi05 may be later on in the week. Margaret ___________________________________________ Date: Fri, 20 Jun 2008 10:42:19 -0500 (CDT) This ticket has been reassigned to GREANEY, MARGARET of the CD-LSCS/CSI/DSS/EST Group. ___________________________________________ ___________________________________________ ___________________________________________ ########## # PARROT # ########## Tested current 2008-06-19 on fnpc338 with dcache, OK ( with ^D hack ) Tested without -d remote, still OK Tested with squid , checksums do not match for d199- remote checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d local checksum: 8294e8e248aa71fc003cb306d2ca0db5266aeaec disabled squid, local checksum: 8294e8e248aa71fc003cb306d2ca0db5266aeaec export http_proxy=http://squid.fnal.gov:3128 # for curl export HTTP_PROXY=http://squid.fnal.gov:3128 # for parrot curl http://www-numi.fnal.gov:80/computing/d199//.growfschecksum curl is not available wget http://www-numi.fnal.gov:80/computing/d199//.growfschecksum -O - -q 6f63107de1a1e42d3a10b8847ebffea250f0895d - unset http_proxy wget http://www-numi.fnal.gov:80/computing/d199//.growfschecksum -O - -q 8294e8e248aa71fc003cb306d2ca0db5266aeaec - squid is being inconsistent : MINOS26 > wget http://www-numi.fnal.gov:80/computing/d199//.growfschecksum -O CHKS ; cat CHKS --09:42:26-- http://www-numi.fnal.gov/computing/d199//.growfschecksum => `CHKS' Resolving squid.fnal.gov... 131.225.107.161 Connecting to squid.fnal.gov|131.225.107.161|:3128... connected. Proxy request sent, awaiting response... 200 OK Length: 44 [text/plain] 100%[====================================================================================================================================================================================================================================================================>] 44 --.--K/s 09:42:26 (3.23 MB/s) - `CHKS' saved [44/44] 6f63107de1a1e42d3a10b8847ebffea250f0895d - MINOS26 > wget http://www-numi.fnal.gov:80/computing/d199//.growfschecksum -O CHKS ; cat CHKS --09:42:29-- http://www-numi.fnal.gov/computing/d199//.growfschecksum => `CHKS' Resolving squid.fnal.gov... 131.225.107.161 Connecting to squid.fnal.gov|131.225.107.161|:3128... connected. Proxy request sent, awaiting response... 200 OK Length: 44 [text/plain] 100%[====================================================================================================================================================================================================================================================================>] 44 --.--K/s 09:42:29 (3.23 MB/s) - `CHKS' saved [44/44] 8294e8e248aa71fc003cb306d2ca0db5266aeaec - Lets hit this more ... while true ; do sleep 4 ; curl http://www-numi.fnal.gov:80/computing/d199//.growfschecksum ; done no failures in 1/2 hour. while true ; do sleep 4 ; wget -q http://www-numi.fnal.gov:80/computing/d199//.growfschecksum -O -; done As of 11:30, the squid cache seems to be up to date. Have run loon repeatedly using squid, local and dcache, on fnpc338. ============================================================================= 2008 06 19 ============================================================================= ####### # SAM # ####### why is N00004502_0000.mdaq.root not declared to SAM ???? DIRS=`ls /pnfs/minos/neardet_data | grep 2004` for DIR in ${DIRS} ; do echo ${DIR} find /pnfs/minos/neardet_data/${DIR} -type f | wc -l SAMDIM="FULL_PATH /pnfs/minos/neardet_data/${DIR}" sam list files --dim="${SAMDIM}" --summary_only done 2004-07 697 File Count: 696 2004-11 1081 File Count: 0 2006-10 905 File Count: 904 ########## # DCACHE # ########## Tests to do before the 24 June upgrade : read a file using dccp loon a file using dcap path write a file with srmcp, normal write a file with srmcp, production role write a file from DAQ, using gsiftp Should write to a TEST directory and file family, for recycling Should write to a NULL directory, directed to a NULL mover, for extended tests. mindata@minos26 mkdir /pnfs/minos/NULL mkdir /pnfs/minos/TEST ( cd /pnfs/minos/NULL ; enstore pnfs --file_family NULL ) ( cd /pnfs/minos/TEST ; enstore pnfs --file_family TEST ) read dccp MINOS26 > setup dcap -q unsecured MINOS26 > IFILE=N00004502_0000.mdaq.root MINOS26 > IPATH=minos/neardet_data/2004-11 MINOS26 > DCPOR=24125 MINOS26 > DFILE=dcap://fndcat.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} MINOS26 > cd /local/scratch??/`whoami` MINOS26 > dccp ${DFILE} TEST.dat # around 14:34 Unknow replay from server: "0 0 server shutdown " On Thu, 19 Jun 2008, Timur Perelmutov wrote: > Could you please let us know what is the status of the tests. We would > like to make a final decision about the upgrade schedule tomorrow during > the Storage Weekly meeting at 10:30 AM. The tests have not yet started. In order to do the tests, I need to monitor the system, which I thought would be at http://fndcat.fnal.gov But I got not response from that address. So I presumed that the system was down. I also have not been told which version of VDT to use, for srmcp. Is there version installed anywere for public use ? Somewhere in /grid/app would be fine for me. I tried to read a file at 13:34, via unsecured dccp, but am still waiting 20 minutes later. The file is on tape VO5041 I see no activity for that tape. MINOS26 > enstore info --vol=VO5041 | grep last_access 'last_access': 1203539287.0, datesec 1203539287 Wed Feb 20 14:28:07 CST 2008 File details follow : IFILE=N00004502_0000.mdaq.root IPATH=minos/neardet_data/2004-11 DCPOR=24125 DFILE=dcap://fndcat.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} dcap://fndcat.fnal.gov:24125/pnfs/fnal.gov/usr/minos/neardet_data/2004-11/N00004502_0000.mdaq.root setup dcap -q unsecured dccp ${DFILE} TEST.dat ######## # FARM # ######## mstrait email indicates that adjusted field runs have started I presume Detectors - near far Releases - cedar_phy_bhhi cedar_phy_bhlo MC rel - daikon_04 Beam - L010185N ./pnfsdirs near cedar_phy_bhhi daikon_04 L010185N Thu Jun 19 14:14:50 CDT 2008 STREAMS cand mrnt sntp INPUT /pnfs/minos/mcin_data/near/daikon_04/L010185N FAMSET mcin_near_daikon_04 FAMILY mcin_near_daikon_04 OUTPUT /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N OOPS, need permissions drwxrwxr-x drwxr-xr-x 1 42411 e875 512 Jun 10 22:23 /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04 OOPS, need permissions drwxrwxr-x drwxr-xr-x 1 42411 e875 512 Jun 11 18:34 /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N OOPS, need permissions drwxrwxr-x drwxr-xr-x 1 42411 e875 512 Jun 11 18:34 /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/cand_data FAMSET mcout_cedar_phy_bhhi_near_daikon_04_cand FAMILY minos OOPS - need file family mcout_cedar_phy_bhhi_near_daikon_04_cand ls: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/mrnt_data: No such file or directory OOPS, need permissions drwxrwxr-x ls: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/mrnt_data: No such file or directory FAMSET mcout_cedar_phy_bhhi_near_daikon_04_mrnt ./pnfsdirs: line 87: cd: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/mrnt_data: No such file or directory FAMILY OOPS - need file family mcout_cedar_phy_bhhi_near_daikon_04_mrnt ls: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/sntp_data: No such file or directory OOPS, need permissions drwxrwxr-x ls: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/sntp_data: No such file or directory FAMSET mcout_cedar_phy_bhhi_near_daikon_04_sntp ./pnfsdirs: line 87: cd: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/sntp_data: No such file or directory FAMILY OOPS - need file family mcout_cedar_phy_bhhi_near_daikon_04_sntp What a mess ! Let's correct permissions/families ./pnfsdirs near cedar_phy_bhhi daikon_04 L010185N write Csnnot correct this, directories owned by minospro 2008 06 20 BLO=/pnfs/minos/mcout_data/cedar_phy_bhlo BHI=/pnfs/minos/mcout_data/cedar_phy_bhhi for BH in ${BLO} ${BHI} ; do for DET in far near ; do ls -ld ${BH}/${DET} ls -ld ${BH}/${DET}/daikon_04 ls -ld ${BH}/${DET}/daikon_04/L010185N ls -l ${BH}/${DET}/daikon_04/L010185N done ; done All these need to have group write. There are stray candidates in /pnfs/minos/mcout_data/cedar_phy_bhlo/near/daikon_04/L010185N /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N Need to do this under mindata@minos26, where we have access to SRM v2 scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/x509up_u1334 /home/mindata/.grid/ export X509_USER_PROXY=/home/mindata/.grid/x509up_u1334 SRMP=srm://fndcat.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos SLO=${SRMP}/mcout_data/cedar_phy_bhlo SHI=${SRMP}/mcout_data/cedar_phy_bhhi for BH in ${SLO} ${SHI} ; do for DET in far near ; do srm-get-permissions ${BH}/${DET} done ; done WOW THIS WORKED !!! for BH in ${SLO} ${SHI} ; do for DET in far near ; do srm-set-permissions -type=CHANGE -group=RWX ${BH}/${DET} srm-get-permissions ${BH}/${DET} done ; done This would work, but rubin's proxy is mapping to 1334(rubin) And the files are owned by 42411 ( minospro ) 2008 06 23 - Ran the set permissions script above, using kreymer cert. Because this is presently mis-mapped to minospro, I can get this work done ! ######### # ADMIN # ######### MINOS01 > cmd add_minos_user bain Creating account... /var/yp gmake[1]: Entering directory `/var/yp/minos' gmake[1]: `ypservers' is up to date. gmake[1]: Leaving directory `/var/yp/minos' gmake[1]: Entering directory `/var/yp/minos' Updating passwd.byname... Updating passwd.byuid... Updating netid.byname... gmake[1]: Leaving directory `/var/yp/minos' Adding user to Minos AFS group... Sending mail to subscribe to minos-user mailing list ... Sending email to user... ########## # PARROT # ########## continue tests using current version, with fix for ups on x86_64 Still problems on fcdflnx3, perhaps due to Linux+2.6-2.3.4 flavor of gcc root hangs up after loading libvector.dll Changed the flavor to Linux+2.6 by hacking the .version file, then rebuilt the growfs index, time make_growfs -k /afs/fnal.gov/files/data/minos/d141 make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d141/.growfsdir make_growfs: scanning directory tree for changes... make_growfs: 991412 files, 6817 links, 107259 dirs, 0 checksums computed real 9m19.467s user 1m6.978s sys 1m28.444s ls -l /afs/fnal.gov/files/code/e875/general/ups/prd/gcc Oops, had to rename the old mountfiles in /grid/app/minos/parrot ######## # DATA # ######## Date: Thu, 19 Jun 2008 10:30:49 -0500 From: root To: owner-minos-data@listserv.fnal.gov Subject: FermiGrid Thu Jun 19 10:30:49 CDT 2008 Usage limit approaching on fermigrid-app Total disk allocated (GB): 30.0 Percent disk used: 80.0% Date: Thu, 19 Jun 2008 10:30:46 -0500 From: root To: owner-minos-data@listserv.fnal.gov Subject: FermiGrid Thu Jun 19 10:30:46 CDT 2008 Usage limit approaching on fermigrid-data Total disk allocated (GB): 400.0 Percent disk used: 85.4% ============================================================================= 2008 06 18 ============================================================================= ########## # CONDOR # ########## spotting users with excessively good priorities HOTS=`condor_userprio -all -allusers \ | grep -v gfactory \ | grep -v kreymer \ | grep -v rhatcher \ | grep ' 1.00 ' \ | cut -f 1 -d @ ` for HOT in ${HOTS} ; do printf "condor_userprio -setfactor ${HOT}@fnal.gov 100.\n" done condor_userprio -setfactor boehm@fnal.gov 100. condor_userprio -setfactor asousa@fnal.gov 100. ########## # PARROT # ########## Continue 2008 06 14 work ######## # BMNT # ######## Following the plan of 2008 01 17 to clear bmnt files out of farcat area. List the bmnt files and runs Generate mrnt list, verify that they are in PNFS and MD Move the mrnt files aside in PNFS and MD 0) BMNT LIST - kreymer BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort` printf "${BFILES}\n" | wc -w 546 mkdir /minos/scratch/kreymer/bmnt2 printf "${BFILES}\n" > /minos/scratch/kreymer/bmnt2/BFILES BFILES runs from F00033808_0000.spill.bmnt.cedar_phy_bhcurv.0.root 2006-03 to F00034618_0003.spill.bmnt.cedar_phy_bhcurv.0.root 2006-03 1) MRNT LIST - kreymer/mindata/rubin/minfarm MRUNS=`printf "${BFILES}\n" | cut -f 1 -d _ | sort -u` printf "${MRUNS}\n" | wc -w 49 F00033808 F00033814 F00033818 ... F00034607 F00034610 F00034618 Rough check for _000000 subruns for MRUN in ${MRUNS} ; do sam locate ${MRUN}_0000.spill.mrnt.cedar_phy_bhcurv.0.root done ... all files are found ... Detailed check via SAM for MRUN in ${MRUNS} ; do RUN=`echo ${MRUN} | cut -c 5-` SAMDIM=" DATA_TIER mrnt-far and VERSION cedar.phy.bhcurv and PHYSICAL_DATASTREAM_NAME spill and RUN_NUMBER ${RUN} " sam list files --dim="${SAMDIM}" --nosummary done > /minos/scratch/kreymer/bmnt2/MFILES wc -l /minos/scratch/kreymer/bmnt2/MFILES 49 /minos/scratch/kreymer/bmnt2/MFILES grep -v '_0000' /minos/scratch/kreymer/bmnt2/MFILES .. nothing ... MFILES=`cat /minos/scratch/kreymer/bmnt2/MFILES` printf "${MFILES}\n" | wc -l 49 for MFILE in ${MFILES} ; do MON=`sam locate ${MFILE} | cut -f 7 -d / | cut -f 1 -d ,` printf "reco_far/cedar_phy_bhcurv/mrnt_data/${MON}/${MFILE}\n" \ | tee -a /minos/scratch/kreymer/bmnt2/MFILEPS done MFILEPS=`cat /minos/scratch/kreymer/bmnt2/MFILEPS` for MFILEP in ${MFILEPS} ; do ls -l /pnfs/minos/${MFILEP} ; done for MFILEP in ${MFILEPS} ; do ls -l /minos/data/${MFILEP} ; done ... continuing 2008 06 18 ... for each account, do BFILES=`cat /minos/scratch/kreymer/bmnt2/BFILES` MFILES=`cat /minos/scratch/kreymer/bmnt2/MFILES` MFILEPS=`cat /minos/scratch/kreymer/bmnt2/MFILEPS` 2a) /minos/data - minfarm@fnpcsrv1 MOVE TO BMNT2 IN /minos/data for MFILEP in ${MFILEPS} ; do MFILER=`echo ${MFILEP} | sed s/mrnt_data/BMNT2/g` MFILED=`dirname ${MFILER}` mkdir -p /minos/data/${MFILED} mv /minos/data/${MFILEP} /minos/data/${MFILER} done find /minos/data/reco_far/cedar_phy_bhcurv/BMNT2 -type f | wc -l 49 2b) /pnfs/minos - rubin@fnpcsrv1 cat shrc/kreymer # cut and paste the result to get into bash for MFILEP in ${MFILEPS} ; do ls -l /pnfs/minos/${MFILEP} rm /pnfs/minos/${MFILEP} done 2c) SAM/READ Do this as minfarm@fnpcsrv1 cd /export/stage/minfarm/ROUNDUP mkdir READBMNT2 for MFILE in ${MFILES} ; do ls READ/SAM/${MFILE} mv READ/SAM/${MFILE} READBMNT2/${MFILE} done 2d) SAM kreymer@minos26 for MFILE in ${MFILES} ; do sam undeclare file ${MFILE} done 3) rename bmnt - minfarm@fnpcsrv1 cd /minos/data/minfarm/farcat for BFILE in ${BFILES} ; do MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g` [ -r ${MFILE} ] && ls -l ${MFILE} done no conflicting files were found for BFILE in ${BFILES} ; do MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g` mv ${BFILE} ${MFILE} done for BFILE in ${BFILES} ; do MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g` [ -r ${MFILE} ] && ls -l ${MFILE} done | wc -l 546 Ready to rock n' roll ! cd /home/minfarm/scripts ./roundup -n -r cedar_phy_bhcurv far adding messages look good to my eye, let's count them ./roundup -n -r cedar_phy_bhcurv far | grep adding | wc -l 49 OK, let's do this. ./roundup -r cedar_phy_bhcurv far OK - processing /minos/data/minfarm/farcat version 20080515 Wed Jun 18 11:11:15 CDT 2008 ... HADD rate 1 Mbytes/second Wed Jun 18 11:24:56 CDT 2008 WRITING to DCache 48 Wed Jun 18 11:40:16 CDT 2008 ??? we just added 49, why writing 48 ? probably timing, files need to have been around a little while. Yes, missing just the last 3'ish F00034602_0000.spill.mrnt.cedar_phy_bhcurv.0.root 22 F00034610_0000.spill.mrnt.cedar_phy_bhcurv.0.root 4 F00034618_0000.spill.mrnt.cedar_phy_bhcurv.0.root 4 ./roundup -w -r cedar_phy_bhcurv far WRITING to DCache 3 Wed Jun 18 11:58:16 CDT 2008 Waited a bit for files to get onto tape Wed Jun 18 13:21:31 CDT 2008 PURGING WRITE files 3 PURGED WRITE/F00034602_0000.spill.mrnt.cedar_phy_bhcurv.0.root PURGED WRITE/F00034610_0000.spill.mrnt.cedar_phy_bhcurv.0.root PURGED WRITE/F00034618_0000.spill.mrnt.cedar_phy_bhcurv.0.root ########## # DCACHE # ########## Preparing for DCache 1.8 tests, prior to Jun 24 upgrade. Do we have NULL movers set up ? Date: Tue, 17 Jun 2008 09:48:26 -0500 (CDT) From: Dmitry Litvintsev Art, you need to do to change "fndca" -> "fndcat" and make sure you are using these ports: ports: SRM : 8443 kerberized dcap doors : 24725, 24736 kerberized ftp door : 24127 weak ftp door : 24126 gsi dcap : 24128 grid ftp : 2811 grid ftp doors run on all three nodes available fndcat stkendca6a gwdca01 Date: Tue, 17 Jun 2008 09:51:22 -0500 From: Timur Perelmutov Could you let us know what path you are planning to write into? We will configure pools for that path. ============================================================================= 2008 06 17 ============================================================================= ######## # FARM # ######## Date: Mon, 16 Jun 2008 21:56:27 +0100 From: Alexandre Sousa To: Howard Rubin , Matthew Strait , Arthur Kreymer , Robert Hatcher , Nick West Subject: Hypothetical reprocessing with increased CPU resources. ... cost of a full reprocessing of CPB, for subshower nue correction ? ============================================================================= 2008 06 09 through 14 WEEK IN THE WOODS WORKSHIP IN ELY, MN ============================================================================= 2008 06 14 ============================================================================= ########## # PARROT # ########## Checking out use of ups/upd from /afs/fnal.gov/files/code/e875/general/ups/db . /afs/fnal.gov/files/code/e875/general/ups/etc/setups.sh export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup setup_minos() { . $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $* } setup_minos -r R1.24.2 ups/upd seem to work OK there, but setup_minos fails due to lack of gcc v3_4_3 MINOS26 > du -sm /afs/fnal.gov/ups/gcc/v3_4_3/* prd/gcc/v3_4_3 159 /afs/fnal.gov/ups/gcc/v3_4_3/IRIX+6.5 119 /afs/fnal.gov/ups/gcc/v3_4_3/Linux+2.4-2.2.4 380 /afs/fnal.gov/ups/gcc/v3_4_3/Linux+2.4-2.3.2 380 /afs/fnal.gov/ups/gcc/v3_4_3/Linux+2.6-2.3.4 132 /afs/fnal.gov/ups/gcc/v3_4_3/SunOS+5.6 165 /afs/fnal.gov/ups/gcc/v3_4_3/SunOS+5.8 MINOS26 > mkdir prd/gcc/v3_4_3 MINOS26 > cp -r /afs/fnal.gov/ups/gcc/v3_4_3/Linux+2.6-2.3.4 prd/gcc/v3_4_3 ... continue on 2008 06 18 MINOS26 > mkdir db/gcc MINOS26 > cp /afs/fnal.gov/ups/db/gcc/v3_4_3.table db/gcc/ MINOS26 > cp /afs/fnal.gov/ups/db/gcc/v3_4_3.version db/gcc/ MINOS26 > nedit db/gcc/v3_4_3.version removed all but LInux_2.6... stanzas MINOS26 > ups list -aK+ gcc "gcc" "v3_4_3" "Linux+2.6-2.3.4" "" "" The setups look good now, and root runs. Now clone this to the d141 working copy of products, cd /afs/fnal.gov/files/data/minos/d141 cp -r /afs/fnal.gov/files/code/e875/general/ups/prd/gcc prd/gcc cp -r /afs/fnal.gov/files/code/e875/general/ups/db/gcc db/gcc For easier testing, as mindata, did cd /grid/app/minos/parrot FILES=' HOWTO.parrot firstlast.C F00031300_0000.mdaq.root ' for FILE in ${FILES} ; do curl http://www-numi.fnal.gov/computing/parrot/${FILE} -o ${FILE} done Successfully ran loon, without /usr/local/etc/setups.sh Now repeated testing... strange inconsistencies. setups produces ERROR: ld.so: object '/cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored. which dthain states can be ignored ( or use parrot -H ) root runs, produces the usual batch messages, but hangs on input. ERROR: ld.so: object '/cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored. ******************************************* * * * W E L C O M E to R O O T * * * * Version 5.12/00f 23 October 2006 * * * * You are welcome to visit our Web site * * http://root.cern.ch * * * ******************************************* Compiled on 25 March 2007 for linux with thread support. CINT/ROOT C/C++ Interpreter version 5.16.13, June 8, 2006 Type ? for help. Commands must be C++ statements. Enclose multiple statements between { }. .exit PID TTY STAT TIME COMMAND 2966 pts/0 Ss 0:00 -bash 3013 pts/0 R+ 0:00 \_ ps xf 923 pts/7 Ss 0:00 -bash 1029 pts/7 S+ 0:09 \_ parrot -m /cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/mountfile.grow -d remote bash 1030 pts/7 T 0:00 \_ bash 2884 pts/7 T 0:00 \_ root -b 2885 pts/7 T 0:00 \_ /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v5-12-00f/bin/root.exe -splash -b 2886 pts/7 T 0:00 \_ sh -c ldd /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v5-12-00f/bin/root.exe Duh, forgot to rebuild the parrot index for d141 after adding gcc. dthain report that he can run root and loon, with proper output. Installed parrot 2_4_3 as mindata@minos26, per HOWTO.parrot Rebuilt indexes $ REL=2_4_3 ; ARC='i686-linux-2.6' ; DAT='' $ VER=cctools-${REL}${DAT}-${ARC} $ export PARROT_DIR=${PRO}/${VER} $ export PATH=${PARROT_DIR}/bin:${PATH} $ dds /afs/fnal.gov/files/data/minos/d141/.g* -rw-r--r-- 1 kreymer g020 44 Feb 7 08:42 /afs/fnal.gov/files/data/minos/d141/.growfschecksum -rw-r--r-- 1 kreymer g020 70827161 Feb 7 08:42 /afs/fnal.gov/files/data/minos/d141/.growfsdir $ mkdir /afs/fnal.gov/files/data/minos/d141/oldparrot $ mv /afs/fnal.gov/files/data/minos/d141/.g* /afs/fnal.gov/files/data/minos/d141/oldparrot $ time make_growfs -k /afs/fnal.gov/files/data/minos/d141 make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d141/.growfsdir make_growfs: no directory exists, this might be quite slow... make_growfs: scanning directory tree for changes... make_growfs: 991412 files, 6817 links, 107259 dirs, 0 checksums computed real 9m10.921s user 0m46.863s sys 1m25.696s du -sk /afs/fnal.gov/files/data/minos/d141/.growfsdir 41195 /afs/fnal.gov/files/data/minos/d141/.growfsdir $ mkdir /afs/fnal.gov/files/data/minos/d199/oldparrot $ mv /afs/fnal.gov/files/data/minos/d199/.g* /afs/fnal.gov/files/data/minos/d199/oldparrot $ time make_growfs -k /afs/fnal.gov/files/data/minos/d199 make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d199/.growfsdir make_growfs: no directory exists, this might be quite slow... make_growfs: scanning directory tree for changes... make_growfs: 727335 files, 16246 links, 59829 dirs, 0 checksums computed real 6m41.867s user 0m33.389s sys 1m5.625s P> ups list -aK+ ERROR: ld.so: object '/cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored. GRRR, all products are gone now. Noted that fcdflnx3 is SL 5.0 Let's try SL 4.4, PRO=/minos/scratch/parrot ============================================================================= 2008 06 13 ============================================================================= ########## # CONDOR # ########## 4 email indicate that minos02 /local/scratch1 seems to have filled around Date: Fri, 13 Jun 2008 14:44:48 -0500 Date: Fri, 13 Jun 2008 14:44:57 -0500 Date: Fri, 13 Jun 2008 14:45:09 -0500 Date: Fri, 13 Jun 2008 14:45:22 -0500 The disk looks OK now ( 14:50 ) Ganglia shows 30 GB used up, starting 14:14, ran out around 14:44, freed up around 14:50 ########### # BLUEARC # ########### Date: Fri, 13 Jun 2008 14:51:43 -0500 (CDT) Subject: HelpDesk ticket 117175 ___________________________________________ Short Description: Quota request for BlueArc served /minos/scratch, for loiacono Problem Description: LSC/CSI : Please set an individual storage quota of 1000 GBytes for user loiacono on the BlueArc served /minos/scratch volume. This increases the existing 500 GBytes quota. ___________________________________________ ########## # PARROT # ########## Checking out X86_64 on fcdflnx3 REL=2_4_3 ; ARC='x86_64-linux-2.6' ; DAT='' copied and modified setup.sh which had gotten lost on fcdflnx2 at $HOME, moved to parrot After ls -d ... 53 /tmp/parrot.1060/ catting a file works, P> . /usr/local/etc/setups.sh ERROR: ld.so: object '/cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored. P> uname -a Linux fcdflnx3.fnal.gov 2.6.18-53.1.6.el5 #1 SMP Wed Jan 23 11:37:57 EST 2008 x86_64 x86_64 x86_64 GNU/Linux P> cat /etc/redhat-release Scientific Linux SLF release 5.0 (Lederman) model name : Intel(R) Xeon(R) CPU E5335 @ 2.00GHz P> file /cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so /cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped This may be harmless After ls 53 /tmp/parrot.1060/ After setup_minos 121 /tmp/parrot.1060/ After root -v - the splash came up, After loon with local file, stuck at vector.dll, 378 /tmp/parrot.1060/ fcdflnx3 > ps xf PID TTY STAT TIME COMMAND 7682 pts/3 Ss 0:00 -bash 7750 pts/3 S+ 0:00 \_ script parrot.log 7751 pts/3 S+ 0:00 \_ script parrot.log 7752 pts/7 Ss 0:00 \_ bash -i 7826 pts/7 S+ 0:15 \_ parrot -m /cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/mountfile.grow -d remote bash 7827 pts/7 T 0:00 \_ bash 8071 pts/7 T 0:02 \_ loon -bq firstlast.C F00031300_0000.mdaq.root 8073 pts/7 T 0:00 \_ sh -c ldd /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.24.2/bin/Linux2.6-GCC_3_4/loon 3297 pts/5 Ss 0:00 -bash 8076 pts/5 R+ 0:00 \_ ps xf ########## # PARROT # ########## Checked sizes on fcdflnx2 fcdflnx2 > du -sm /tmp/parrot.1060/ 384 /tmp/parrot.1060/ Now a clean start, After ls -d ... 53 /tmp/parrot.1060/ After setup_minos 121 /tmp/parrot.1060/ After loon 384 /tmp/parrot.1060/ Repeated loon run seems to be OK Repeated parrot seems to be OK no increase in size of /tmp/parrot.1060/ setup and ran root v5_14_00g -q after setup_minos, OK created setup.sh for quicker usage test ran with squid, OK ran with DCache , uses libdcap.so , OK 437 /tmp/parrot.1060/ fcdflnx2 > du -sm /tmp/parrot.1060/* | sort -n ... 11 /tmp/parrot.1060/7a 13 /tmp/parrot.1060/6e 18 /tmp/parrot.1060/8f 52 /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- 52 /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199-- 68 /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d141-- ============================================================================= 2008 06 12 ============================================================================= ########## # PARROT # ########## INSTALLATION simplified, for major release install kreymer@KREYMERLAP very slow moving files to VCC/Ely Tried also on fcdflnx2 ( SLF 4.5 32 bit ) cd ${HOME}/parrot FILES=' HOWTO.parrot mountfile.grow firstlast.C F00031300_0000.mdaq.root ' for FILE in ${FILES} ; do curl http://www-numi.fnal.gov/computing/parrot/${FILE} -o ${FILE} done VER=2_4_3 KERN='i686' TARD=cctools-${VER}-${KERN}-linux-2.6 TARP=${TARD}.tar.gz curl http://www.cse.nd.edu/~ccl/software/files/${TARP} -o ${TARP} tar xzvf ${TARP} ln -s ../mountfile.grow ${TARD}/ TESTING VER=cctools-${VER}-${KERN}-linux-2.6 export PARROT_DIR=${HOME}/parrot/${TARD} export PATH=${PARROT_DIR}/bin:${PATH} #export HTTP_PROXY="http://squid.fnal.gov:3128" parrot -m ${PARROT_DIR}/mountfile.grow -d remote bash ls -d /afs/fnal.gov/files/code/e875/general/minossoft PS1='P> ' . /usr/local/etc/setups.sh export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup setup_minos() { . $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $* } setup_minos -r R1.24.2 No default SAM configuration exists at this time. MINOSSOFT release "R1.24.2" SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01 bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory type loon DFILE=F00031300_0000.mdaq.root loon -bq firstlast.C ${DFILE} ============================================================================= 2008 06 11 ============================================================================= ######## # GRID # ######## MINOS26 > quota -g e875 Disk quotas for group e875 (gid 5111): Filesystem blocks quota limit grace files quota limit grace blue2:/fermigrid-data 324961152 0 419430400 126348 0 0 blue2:/fermigrid-app 31344704 0 31457280 328480 0 0 MINOS26 > du -sm * 1834 Minossoft du: `VDT/vdt/extract': Permission denied du: `VDT/vdt/backup': Permission denied du: `VDT/vdt/services': Permission denied 287 VDT 1 bin du: `minfarm/Minossoft/EXTERNAL/mysql-5.0.22/sql/share/japanese-sjis': Permission denied 15192 minfarm 488 parrot 4458 products 56 sam 6269 scripts 7209 users MINOS26 > du -sm users/* 1700 users/boehm 2818 users/loiacono 1 users/pawloski 2681 users/rustem 10 users/scavan MINOS26 > chmod -R 755 products MINOS26 > rm -r products ########## # CONDOR # ########## Regarding intermittent kxlist failures/timeouts in kproxy These stopped after the KCA server upgrade The last was on 21 May, upgrade was on 28 May ######## # LSF # ######## Date: Wed, 11 Jun 2008 13:24:29 -0500 From: Joseph Boyd To: Arthur E. Kreymer , Robert W. Hatcher Subject: LSF on Minos Hi Art, What is the state of LSF on minos? We got a ticket today that running bjobs (or anything) on minos13 caused and error. That was confirmed. If I kill all the lsf daemons running on that machine though then everything works (it presumably goes and talks to some other server). Looking at all the minos machines, various machines have various things running. Some have nothing. Can you please let me know what the current state is so I can fix minos13 if I've broken it and also so I can document what it's supposed to look like. Is it going away soon? Thanks, joe ---------------------------------------------------------------------- Could not log in, around 14:00 This is working fine, as of 14:47 bjobs, bhosts, bqueues ######## # GRID # ######## Clean out a little space in /grid/app/minos, per email warning 99.7% full ( 30 GB ) ############ # MCIMPORT # ############ ------------------------------------------------------------- Date: Tue, 10 Jun 2008 11:23:51 -0500 (CDT) From: Kregg E Arms To: Arthur Kreymer Cc: Ben Speakman Subject: short runs (fwd) Hi Art, Ben found four of the AtmosNu files I generated for him had problems. I will rerun these (today?) and upload new copies of the reroot files. Can you remove the old ones from pnfs, etc.? /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330002_0004_AtmosNu_D04.reroot.root /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330002_0005_AtmosNu_D04.reroot.root /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330004_0005_AtmosNu_D04.reroot.root /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330004_0008_AtmosNu_D04.reroot.root ------------------------------------------------------------- kreymer@minos26 FILES=' f21330002_0004_AtmosNu_D04.reroot.root f21330002_0005_AtmosNu_D04.reroot.root f21330004_0005_AtmosNu_D04.reroot.root f21330004_0008_AtmosNu_D04.reroot.root ' for FILE in ${FILES} ; do ls -l /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/${FILE} rm /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/${FILE} sam undeclare ${FILE} done -rw-r--r-- 1 kreymer e875 74925261 Apr 6 22:43 /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330002_0004_AtmosNu_D04.reroot.root -rw-r--r-- 1 kreymer e875 15996235 Apr 6 22:44 /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330002_0005_AtmosNu_D04.reroot.root -rw-r--r-- 1 kreymer e875 64018225 Apr 6 23:07 /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330004_0005_AtmosNu_D04.reroot.root -rw-r--r-- 1 kreymer e875 66456480 Apr 6 23:14 /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330004_0008_AtmosNu_D04.reroot.root MINOS26 > date Wed Jun 11 07:31:14 CDT 2008 ============================================================================= 2008 06 09 ============================================================================= MINOS01 > setup systools MINOS01 > cmd add_minos_user djalbrec Creating account... /var/yp gmake[1]: Entering directory `/var/yp/minos' gmake[1]: `ypservers' is up to date. gmake[1]: Leaving directory `/var/yp/minos' gmake[1]: Entering directory `/var/yp/minos' Updating passwd.byname... Updating passwd.byuid... Updating netid.byname... gmake[1]: Leaving directory `/var/yp/minos' Adding user to Minos AFS group... Sending mail to subscribe to minos-user mailing list ... Sending email to user... ============================================================================= 2008 06 06 ============================================================================= ####### # LSF # ####### fsun02 is up again. MRTG network rates indicate a spike in before drop Wed Jun 4 11:00 ish And resumed activity Fri Jun 06 Jun 11:00. ########### # ENSTORE # ########### ( cd /pnfs/minos/reco_near ; enstore pnfs --tags ) ( cd /pnfs/minos/reco_far ; enstore pnfs --tags ) .(tag)(library) = CD-9940B Date: Fri, 06 Jun 2008 21:33:51 -0500 (CDT) Subject: HelpDesk ticket 116813 ___________________________________________ Short Description: Please move future /pnfs/minos/reco_near and reco_far writes to CD-LTO-3 Problem Description: enstore-admin : It seems to be a good time to move the bulk of remaining Minos writes from 9940B tape to LTO-3 tape. Therefore, please do something like the following to direct future writes under these paths toward LTO-3 tape : cd /pnfs/minos/reco_near enstore pnfs --library CD-LTO3 cd /pnfs/minos/reco_far enstore pnfs --library CD-LTO3 ___________________________________________ This ticket is assigned to BERG, DAVID of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Mon, 09 Jun 2008 17:31:47 -0500 (CDT) Solution: berg@fnal.gov sent this solution: Art, I changed the library tags under reco_near and reco_far to CD-LTO3. Everything below those points in the tree inherits the tags, and will now write to LTO3, except R1_18_2, which for some reason has primary tags (dated Nov 16, 2005): 1 CD-9940B minos/reco_far/R1_18_2 1 CD-9940B minos/reco_near/R1_18_2 I don't know if anything will be written under these directories in the future, but they are still set to CD-9940B. I can changes these directories to inherit from the level above like the others, if you like. - David ___________________________________________ Date: Mon, 16 Jun 2008 01:44:21 +0000 (UTC) Thanks ! Nothing should be written to the R1_18_2 directories again. It is fine with us if you change these to inherit. ######## # FARM # ######## Following up on Rubin note of 16 May, regarding undeclare cand's in /minos/data/minfarm/mcnear/cp_to_dc MINOS26 > MCFILS=`cat /minos/data/minfarm/mcnear/cp_to_dc` MINOS26 > for FIL in ${MCFILS} ; do sam locate ${FIL} ; done ... Datafile with name 'n13047014_0025_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047014_0027_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047014_0028_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047014_0029_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047041_0000_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047041_0001_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047041_0002_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047041_0014_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047041_0015_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047041_0016_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047041_0017_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047041_0021_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047041_0030_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047042_0000_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047042_0002_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047042_0004_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047042_0006_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. Datafile with name 'n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found. These files were not copied to DCache. I may copy these manually, using them to test srmcp versus gsi_ftp rates. ######## # DATA # ######## Date: Fri, 06 Jun 2008 11:23:27 -0500 (CDT) Subject: HelpDesk ticket 116781 ___________________________________________ Short Description: Quota request for BlueArc served /minos/scratch, for jjling Problem Description: LSC/CSI : Please set an individual storage quota of 500 GBytes for user jjling on the BlueArc served /minos/scratch volume. This in an increase from the existing default 100 GBytes quota. ___________________________________________ Date: Fri, 06 Jun 2008 11:33:23 -0500 (CDT) This ticket has been reassigned to HILL, KEVIN of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ ============================================================================= 2008 06 05 ============================================================================= ######### # ADMIN # ######### Will try again to use setup systools cmd add_minos_user djalbrec after this user gets a FNALU account ( for home area ) Updated web link for account, at /afs/fnal.gov/files/expwww/numi/html/minwork/computing/account.html Changed this to a symlink to a dated file. ############## # AFSERRSCAN # ############## Made the MON=${2} selection optional Removed the default being the current month. ########## # PARROT # ########## INSTALLATION mindata@minos26 cd /grid/app/minos/parrot VER=current VERX="-20080604" TARD=cctools-${VER}-x86_64-linux-2.6 TARX=cctools-${VER}${VERX}-x86_64-linux-2.6 TARP=${TARD}.tar.gz TARL=${TARX}.tar.gz curl http://www.cse.nd.edu/~ccl/software/files/${TARP} -o ${TARL} tar xzvf ${TARL} [ -n "${VERX}" ] && mv ${TARD} ${TARX} ln -s ../mountfile2.grow ${TARX}/ cat /grid/app/minos/parrot/mountfile2.grow /afs/fnal.gov/files/code/e875/general/minossoft /grow/www-numi.fnal.gov/computing/d199/ /afs/fnal.gov/files/code/e875/general/ups /grow/www-numi.fnal.gov/computing/d141/ TESTING ssh fnpc194 PVER=cctools-current-20080604-x86_64-linux-2.6 export PARROT_DIR=/grid/app/minos/parrot/${PVER} export PATH=${PARROT_DIR}/bin:${PATH} parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash ls -d /afs/fnal.gov/files/code/e875/general/minossoft This is OK ! PVER=cctools-current-20080604-x86_64-linux-2.6 export PARROT_DIR=/grid/app/minos/parrot/${PVER} export PATH=${PARROT_DIR}/bin:${PATH} export HTTP_PROXY="http://squid.fnal.gov:3128" parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash ls -d /afs/fnal.gov/files/code/e875/general/minossoft Squid worked ! P> . /usr/local/etc/setups.sh bash: child setpgid (27573 to 27572): Operation not permitted ERROR: ld.so: object '/grid/app/minos/parrot/cctools-current-20080604-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored. P> setup_minos -r R1.24.2 MINOSSOFT release "R1.24.2" SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01 bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory P> cd /minos/scratch/kreymer/condor/loonb P> DFILE=F00031300_0000.mdaq.root P> loon -bq firstlast.C ${DFILE} Host: squid.fnal.gov 2008/06/05 11:45:45.185560 [27820] parrot: http: HTTP/1.0 200 OK 2008/06/05 11:45:45.185573 [27820] parrot: http: Date: Thu, 05 Jun 2008 16:45:45 GMT 2008/06/05 11:45:45.185583 [27820] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6 2008/06/05 11:45:45.185593 [27820] parrot: http: Last-Modified: Sat, 03 Nov 2007 02:37:05 GMT 2008/06/05 11:45:45.185602 [27820] parrot: http: ETag: "4ec3eb58-2721c0-43dfd28a89641" 2008/06/05 11:45:45.185611 [27820] parrot: http: Accept-Ranges: bytes 2008/06/05 11:45:45.185619 [27820] parrot: http: Content-Length: 2564544 2008/06/05 11:45:45.185630 [27820] parrot: http: Content-Type: application/x-msdownload 2008/06/05 11:45:45.185639 [27820] parrot: http: X-Cache: MISS from fg2x3.fnal.gov 2008/06/05 11:45:45.185648 [27820] parrot: http: Via: 1.0 fg2x3.fnal.gov:3128 (squid/2.6.STABLE17) 2008/06/05 11:45:45.185657 [27820] parrot: http: Proxy-Connection: close 2008/06/05 11:45:45.185665 [27820] parrot: http: 2008/06/05 11:45:45.185673 [27820] parrot: grow: open http://www-numi.fnal.gov:80/computing/d141///prd/MINOS_ROOT/Linux2.4-GCC_3_4/v5-12-00f/cint/stl/vector.dll and no further action Around 12:00, 27437 pts/0 Ss 0:00 -bash 27553 pts/0 S+ 0:15 \_ parrot -m /grid/app/minos/parrot/cctools-current-20080604-x86_64-linux-2.6/mountfile2.grow -d remote bash 27554 pts/0 T 0:00 \_ bash 27820 pts/0 T 0:02 \_ loon -bq firstlast.C F00031300_0000.mdaq.root 27897 pts/0 T 0:00 \_ sh -c ldd /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.24.2/bin/Linux2.6-GCC_3_4/loon MIN > curl http://www-numi.fnal.gov:80/computing/d141///prd/MINOS_ROOT/Linux2.4-GCC_3_4/v5-12-00f/cint/stl/vector.dll -o /var/tmp/vector.dll killed the parrot session Trying again without squid P> loon -bq firstlast.C ${DFILE} ERROR: ld.so: object '/grid/app/minos/parrot/cctools-current-20080604-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored. Warning in : class timespec already in TClassTable P> loon -bq firstlast.C ${DFILE} ERROR: ld.so: object '/grid/app/minos/parrot/cctools-current-20080604-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored. Trying again with a clean cache rm -r /tmp/parrot.1060 Stuck again at the same place, vector.dll For the record, application tests : rm -r /tmp/parrot.1060 PVER= export PARROT_DIR=/grid/app/minos/parrot/${PVER} export PATH=${PARROT_DIR}/bin:${PATH} export HTTP_PROXY="http://squid.fnal.gov:3128" parrot -m ${PARROT_DIR}/mountfile2.grow bash ls -d /afs/fnal.gov/files/code/e875/general/minossoft PS1='P> ' . /usr/local/etc/setups.sh export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup setup_minos() { . $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $* } setup_minos -r R1.24.2 type loon cd /minos/scratch/kreymer/condor/loonb DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root DFILE=F00031300_0000.mdaq.root loon -bq firstlast.C ${DFILE} ######## # FARM # ######## ./roundup -r cedar_phy_bhcurv mcnear Reported long PEND list to minos_batch Reported long time PEND cedar items to minos_batch ####### # LSF # ####### First reports via email around 04:39 UTC ( 23:39 CDT ) MINOS26 > bqueues batch system daemon not responding ... still trying batch system daemon not responding ... still trying ... fsun02 is down Date: Thu, 05 Jun 2008 10:44:00 -0500 (CDT) Subject: HelpDesk ticket 116724 ___________________________________________ Short Description: fsui02 is off the network, taking FNALU batch down Problem Description: The fsui02 system is off the network. This takes down the FNALU LSF batch system. This seems to have happened as long ago as 23:30 last night. ___________________________________________ Date: Thu, 05 Jun 2008 11:03:43 -0500 (CDT) This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group. ___________________________________________ Date: Fri, 6 Jun 2008 02:06:38 +0000 (UTC) From: Arthur Kreymer To: minos-users@fnal.gov Correction, it is fsun02 that is down. We still have no status update from the managers. They are aware of the problem. ___________________________________________ Correction, it is fsun02 that is down, I expect you knew that. It is hard to avoid typing fsui02, force of habit. ___________________________________________ Date: Sat, 07 Jun 2008 02:50:56 +0000 (UTC) The LSF queues seem to be active again, and jobs are running. MRTG monitoring shows fsun02 active again Fri 2008 Jun 06 11:00 ___________________________________________ Date: Mon, 17 Nov 2008 12:50:34 -0600 (CST) Solution: fsui02 was decommissioned. This ticket was resolved by BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST group. ============================================================================= 2008 06 04 ============================================================================= ########## # PARROT # ########## Resume testing latest version INSTALLATION 2.4.2, which has proxy problems, is still latest point release But there is a more recent x86_64 version Man 29, after last tests 2008 05 23 mindata@minos26 cd /grid/app/minos/parrot VER=current VERX="-20080529" TARD=cctools-${VER}-x86_64-linux-2.6 TARX=cctools-${VER}${VERX}-x86_64-linux-2.6 TARP=${TARD}.tar.gz TARL=${TARX}.tar.gz curl http://www.cse.nd.edu/~ccl/software/files/${TARP} -o ${TARL} tar xzvf ${TARL} [ -n "${VERX}" ] && mv ${TARD} ${TARX} ln -s ../mountfile2.grow ${TARX}/ cat /grid/app/minos/parrot/mountfile2.grow /afs/fnal.gov/files/code/e875/general/minossoft /grow/www-numi.fnal.gov/computing/d199/ /afs/fnal.gov/files/code/e875/general/ups /grow/www-numi.fnal.gov/computing/d141/ TESTING ssh fnpc194 PVER=cctools-current-20080529-x86_64-linux-2.6 export PARROT_DIR=/grid/app/minos/parrot/${PVER} export PATH=${PARROT_DIR}/bin:${PATH} parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash ls -d /afs/fnal.gov/files/code/e875/general/minossoft bash-3.00$ ls -d /afs/fnal.gov/files/code/e875/general/minossoft 2008/06/04 16:08:38.639664 [31451] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/ 2008/06/04 16:08:38.639771 [31451] parrot: grow: fetching checksum: 2008/06/04 16:08:38.639800 [31451] parrot: http: connect www-numi.fnal.gov port 80 2008/06/04 16:08:38.641832 [31451] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfschecksum HTTP/1.0 Host: www-numi.fnal.gov 2008/06/04 16:08:38.650005 [31451] parrot: http: HTTP/1.1 200 OK 2008/06/04 16:08:38.650059 [31451] parrot: http: Date: Wed, 04 Jun 2008 21:08:38 GMT 2008/06/04 16:08:38.650070 [31451] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6 2008/06/04 16:08:38.650080 [31451] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:19 GMT 2008/06/04 16:08:38.650090 [31451] parrot: http: ETag: "534d65ce-2c-44592a26655c3" 2008/06/04 16:08:38.650098 [31451] parrot: http: Accept-Ranges: bytes 2008/06/04 16:08:38.650106 [31451] parrot: http: Content-Length: 44 2008/06/04 16:08:38.650115 [31451] parrot: http: Connection: close 2008/06/04 16:08:38.650123 [31451] parrot: http: Content-Type: text/plain 2008/06/04 16:08:38.650130 [31451] parrot: http: 2008/06/04 16:08:38.650142 [31451] parrot: grow: remote checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d 2008/06/04 16:08:38.650187 [31451] parrot: grow: fetching directory: http://www-numi.fnal.gov:80/computing/d199//.growfsdir 2008/06/04 16:08:38.650253 [31451] parrot: http: connect www-numi.fnal.gov port 80 2008/06/04 16:08:38.651219 [31451] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfsdir HTTP/1.0 Host: www-numi.fnal.gov 2008/06/04 16:08:38.654426 [31451] parrot: http: HTTP/1.1 200 OK 2008/06/04 16:08:38.654466 [31451] parrot: http: Date: Wed, 04 Jun 2008 21:08:38 GMT 2008/06/04 16:08:38.654476 [31451] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6 2008/06/04 16:08:38.654486 [31451] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:09 GMT 2008/06/04 16:08:38.654496 [31451] parrot: http: ETag: "5350b140-33ac14e-44592a1cdbf44" 2008/06/04 16:08:38.654505 [31451] parrot: http: Accept-Ranges: bytes 2008/06/04 16:08:38.654513 [31451] parrot: http: Content-Length: 54182222 2008/06/04 16:08:38.654522 [31451] parrot: http: Connection: close 2008/06/04 16:08:38.654529 [31451] parrot: http: Content-Type: text/plain 2008/06/04 16:08:38.654537 [31451] parrot: http: 2008/06/04 16:08:39.836793 [31451] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- 2008/06/04 16:08:40.545656 [31451] parrot: grow: local checksum: 2adb169a42c725ccfbe2b2174da7d8d9e46121f4 2008/06/04 16:08:40.545802 [31451] parrot: grow: checksum does not match, reloading... 2008/06/04 16:08:40.545897 [31451] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/ 2008/06/04 16:08:40.545943 [31451] parrot: grow: fetching checksum: http://www-numi.fnal.gov:80/computing/d199//.growfsdir 2008/06/04 16:08:40.545993 [31451] parrot: http: connect www-numi.fnal.gov port 80 2008/06/04 16:08:40.547439 [31451] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfschecksum HTTP/1.0 Host: www-numi.fnal.gov 2008/06/04 16:08:40.551883 [31451] parrot: http: HTTP/1.1 200 OK 2008/06/04 16:08:40.551994 [31451] parrot: http: Date: Wed, 04 Jun 2008 21:08:40 GMT 2008/06/04 16:08:40.552036 [31451] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6 2008/06/04 16:08:40.552076 [31451] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:19 GMT 2008/06/04 16:08:40.552115 [31451] parrot: http: ETag: "534d65ce-2c-44592a26655c3" 2008/06/04 16:08:40.552150 [31451] parrot: http: Accept-Ranges: bytes 2008/06/04 16:08:40.552183 [31451] parrot: http: Content-Length: 44 2008/06/04 16:08:40.552218 [31451] parrot: http: Connection: close 2008/06/04 16:08:40.552251 [31451] parrot: http: Content-Type: text/plain 2008/06/04 16:08:40.552283 [31451] parrot: http: 2008/06/04 16:08:40.552320 [31451] parrot: grow: remote checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d 2008/06/04 16:08:40.552392 [31451] parrot: grow: fetching directory: http://www-numi.fnal.gov:80/computing/d199//.growfsdir 2008/06/04 16:08:40.552488 [31451] parrot: http: connect www-numi.fnal.gov port 80 2008/06/04 16:08:40.553482 [31451] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfsdir HTTP/1.0 Host: www-numi.fnal.gov 2008/06/04 16:08:40.559354 [31451] parrot: http: HTTP/1.1 200 OK 2008/06/04 16:08:40.559467 [31451] parrot: http: Date: Wed, 04 Jun 2008 21:08:39 GMT 2008/06/04 16:08:40.559508 [31451] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6 2008/06/04 16:08:40.559548 [31451] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:09 GMT 2008/06/04 16:08:40.559588 [31451] parrot: http: ETag: "5350b140-33ac14e-44592a1cdbf44" 2008/06/04 16:08:40.559626 [31451] parrot: http: Accept-Ranges: bytes 2008/06/04 16:08:40.559661 [31451] parrot: http: Content-Length: 54182222 2008/06/04 16:08:40.559695 [31451] parrot: http: Connection: close 2008/06/04 16:08:40.559728 [31451] parrot: http: Content-Type: text/plain 2008/06/04 16:08:40.559771 [31451] parrot: http: 2008/06/04 16:08:56.323562 [31451] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- 2008/06/04 16:08:57.046740 [31451] parrot: grow: local checksum: 2adb169a42c725ccfbe2b2174da7d8d9e46121f4 2008/06/04 16:08:57.046890 [31451] parrot: grow: checksum does not match, reloading... 2008/06/04 16:08:57.046977 [31451] parrot: grow: directory and checksum are inconsistent, retry in 2 seconds ... Try this with 2.4.2 ( which will not use Squid ) ssh fnpc194 PVER=cctools-2_4_2-x86_64-linux-2.6 export PARROT_DIR=/grid/app/minos/parrot/${PVER} export PATH=${PARROT_DIR}/bin:${PATH} parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash ls -d /afs/fnal.gov/files/code/e875/general/minossoft ########### # MONTHLY # ########### DATASETS 6/4 PREDATOR 6/4 VAULT 6/3 from cron, overall rate 32 MB/sec MYSQL 6/5 Thu Jun 5 11:03:07 CDT 2008 Thu Jun 5 11:53:27 CDT 2008 ########## # CONDOR # ########## Glideins stopped getting scheduled to run at around 04:30. Probably due to this on fnpcsrv1 : SRV1> condor_q rustem -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 2008-06-04 08:33:25-05 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1879557.0 rustem 6/4 02:07 0+06:20:17 R 0 0.0 run_study.sh /grid 1879557.2 rustem 6/4 02:07 0+06:20:17 R 0 0.0 run_study.sh /grid ... 1879691.38 rustem 6/4 03:49 0+00:00:00 I 0 0.0 run_study.sh /grid 1879691.39 rustem 6/4 03:49 0+00:00:00 I 0 0.0 run_study.sh /grid 423 jobs; 50 idle, 373 running, 0 held This is using the entire Minos group allocation No more of our jobs will run until these jobs finish. ########## # CONDOR # ########## Corrected default factors for newer users condor_userprio -all mtavera@fnal.gov 0.50 0.50 1.00 0 8.09 6/03/2008 10:40 6/03/2008 20:15 pittam@fnal.gov 0.57 0.57 1.00 0 149.31 5/27/2008 11:20 6/04/2008 00:35 naples@fnal.gov 1.04 1.04 1.00 0 167.95 5/09/2008 15:14 6/03/2008 22:00 jjling@fnal.gov 25.71 25.71 1.00 36 4038.77 5/28/2008 16:30 6/04/2008 08:50 condor_userprio -setfactor mtavera@fnal.gov 100. condor_userprio -setfactor pittam@fnal.gov 100. condor_userprio -setfactor naples@fnal.gov 100. condor_userprio -setfactor jjling@fnal.gov 100. condor_userprio -setfactor pawloski@fnal.gov 100. # no longer needs boost ============================================================================= 2008 06 03 ============================================================================= ############ # MCIMPORT # ############ Planning for rm /home/mindata/STAGE # was /local/scratch26/mindata ln -s /minos/data/mcimport /home/mindata/STAGE $ ls -l /local/scratch26/mindata total 12 drwxr-xr-x 2 mindata e875 4096 Mar 6 16:17 141 drwxr-xr-x 2 mindata e875 4096 Jun 3 08:37 CRON drwxr-xr-x 14 mindata e875 4096 Mar 3 17:22 MOVED lrwxrwxrwx 1 mindata e875 28 Nov 18 2007 OVERLAY -> /minos/data/mcimport/OVERLAY lrwxrwxrwx 1 mindata e875 25 Oct 30 2007 arms -> /minos/data/mcimport/arms lrwxrwxrwx 1 mindata e875 26 Nov 5 2007 boehm -> /minos/data/mcimport/boehm lrwxrwxrwx 1 mindata e875 28 Oct 30 2007 buckley -> /minos/data/mcimport/buckley lrwxrwxrwx 1 mindata e875 26 Oct 30 2007 gmieg -> /minos/data/mcimport/gmieg lrwxrwxrwx 1 mindata e875 28 Oct 31 2007 hgallag -> /minos/data/mcimport/hgallag lrwxrwxrwx 1 mindata e875 27 Oct 30 2007 himmel -> /minos/data/mcimport/himmel lrwxrwxrwx 1 mindata e875 29 Oct 31 2007 howcroft -> /minos/data/mcimport/howcroft lrwxrwxrwx 1 mindata e875 29 Nov 3 2007 kordosky -> /minos/data/mcimport/kordosky lrwxrwxrwx 1 mindata e875 28 Oct 31 2007 kreymer -> /minos/data/mcimport/kreymer lrwxrwxrwx 1 mindata e875 30 Nov 3 2007 mcinwrite -> /minos/data/mcimport/mcinwrite lrwxrwxrwx 1 mindata e875 28 Feb 25 17:42 mtavera -> /minos/data/mcimport/mtavera lrwxrwxrwx 1 mindata e875 27 Oct 31 2007 mualem -> /minos/data/mcimport/mualem lrwxrwxrwx 1 mindata e875 26 Feb 25 18:23 nwest -> /minos/data/mcimport/nwest lrwxrwxrwx 1 mindata e875 29 Oct 30 2007 rhatcher -> /minos/data/mcimport/rhatcher lrwxrwxrwx 1 mindata e875 24 Nov 2 2007 sjc -> /minos/data/mcimport/sjc lrwxrwxrwx 1 mindata e875 27 Oct 30 2007 urheim -> /minos/data/mcimport/urheim $ ls -l /local/scratch26/mindata/CRON total 4 -rw-r--r-- 1 mindata e875 6 Mar 6 13:57 mcimport.tar.pid $ ln -sf /minos/data/mindata /home/mindata/STAGE $ date Tue Jun 3 11:59:25 CDT 2008 Cleanup - remove the MOVED and 141 directory files. There is more that can be archived, MDS3 > du -sm /home/mindata/STAGE/STAGE 3699254 STAGE ============================================================================= 2008 06 02 ============================================================================= ######## # FARM # ######## Tracking down size mismatch in loopCPB1 n13037702_0018_L010185N_D04.cand.cedar_phy_bhcurv.1.root -rw-r--r-- 1 rubin numi 570061246 May 21 21:04 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/770/n13037702_0018_L010185N_D04.cand.cedar_phy_bhcurv.1.root This is odd, as the file was moved to WRITE on May 28, long after this file went into PNFS. Ths file is declared to SAM already. Seems like a classic duplicate. For present, FIL=n13037702_0018_L010185N_D04.cand.cedar_phy_bhcurv.1.root mv WRITE/${FIL} DUP/${FIL} loopCPB1 is now complete loopCPB gets nowhere, as everything is pending. ######## # FARM # ######## Rustem points out that cedar_phy_bhcurv Far run 37901 it not in PNFS. This data was processed on Dec 12 2007. Concatenatin has been stalled due to missing subrun is 15, whose size is normal in raw data, see /pnfs/minos/fardet_data/2007-04/F00037901* Subrun F00037901_0015 produced output during the cedar_phy pass on the data. I have forced concatenation of F00037901 without subrun 15. I suggest that someone look at Farm logs to see why subrun 15 is missing. SRV1> ./roundup -n -s 37901 -r cedar_phy_bhcurv far ... OK - 568 Mbytes in 1 runs PEND - have 23/24 subruns for F00037901_*.all.sntp.cedar_phy_bhcurv.0.root 173 12/12 14:58 0 23 OK - stream spill.bntp.cedar_phy_bhcurv OK - 102 Mbytes in 1 runs PEND - have 23/24 subruns for F00037901_*.spill.bntp.cedar_phy_bhcurv.0.root 173 12/12 14:59 0 23 OK - stream spill.mrnt.cedar_phy_bhcurv OK - 65 Mbytes in 1 runs PEND - have 23/24 subruns for F00037901_*.spill.mrnt.cedar_phy_bhcurv.0.root 173 12/12 14:59 0 23 OK - stream spill.sntp.cedar_phy_bhcurv OK - 69 Mbytes in 1 runs PEND - have 23/24 subruns for F00037901_*.spill.sntp.cedar_phy_bhcurv.0.root 173 12/12 14:58 0 23 SRV1> ./roundup -s 37901 -r cedar_phy_bhcurv far SRV1> ./roundup -f 1 -s 37901 -r cedar_phy_bhcurv far ########## # cflsum # ########## Corrected cflsum to use release_data not log_data ( space issues ) MIN > ln -sf cflsum.20080602 cflsum # was 20070702 MINOS26 > ${HOME}/minos/scripts/cflsum > cflsum.`date +%Y%m%d` ########### # MINOS25 # ########### System is in desparate trouble. condor_q no longer responds Dozens of processes in 'D' state. Load average started rising around 09:30, with pawloski submission, but root cause is probably rmehdi loon job using 2.3 GB of memory. Trying to shut down condor gracefully. [gfactory@minos25 ~]$ ps xf PID TTY STAT TIME COMMAND 6226 ? Z 0:02 [condor_gridmana] 9547 pts/26 Ss 0:00 -bash 9595 pts/26 R+ 0:00 \_ ps xf 4314 ? S 68:21 python glideFactory.py 90 4 /home/gfactory/glideinsubmit/glidein_t12_glexec/ 4316 ? S 624:33 \_ /usr/bin/python glideFactoryEntry.py 4314 90 4 /home/gfactory/glideinsubmit/glidein_t12_glexec/ gpgeneral 4317 ? S 687:45 \_ /usr/bin/python glideFactoryEntry.py 4314 90 4 /home/gfactory/glideinsubmit/glidein_t12_glexec/ gpminos 15986 ? S 0:00 \_ /bin/bash ./job_submit.sh gpminos gpminos@t12_glexec@minos@my2 10 std -- GLIDEIN_Collector minos25.dot,fnal.dot,gov 15991 ? S 0:00 \_ condor_submit -name minos25.fnal.gov entry_gpminos/job.condor Killed all gfrontent and gfactory processes MINOS25 > condor_off -peaceful -all -subsystem startd Can't connect to master FNAL_858@fnpc344.fnal.gov Can't connect to master FNAL_31050@fnpc339.fnal.gov Date: Mon, 02 Jun 2008 12:34:42 -0500 (CDT) Subject: HelpDesk ticket 116503 ___________________________________________ Short Description: User process in D state, please kill this, or reboot Problem Description: run2-sys : Around 09:30 this morning, user rmehdi ran a 'loon' process, which has gone in to a 'D' state, according to the top display. The load average has gone up over 100, and many other processes are behaving strangely. Rashid cannot kill this process, top - 12:08:50 up 130 days, 23:57, 14 users, load average: 123.08, 123.02, 122.64 Tasks: 311 total, 2 running, 305 sleeping, 0 stopped, 4 zombie Cpu(s): 25.2% us, 1.5% sy, 0.0% ni, 0.0% id, 73.3% wa, 0.0% hi, 0.0% si Mem: 4151264k total, 4125572k used, 25692k free, 64764k buffers Swap: 4192944k total, 208k used, 4192736k free, 972376k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME #C COMMAND 3860 rmehdi 15 0 2457m 2.3g 44m D 0 58.0 22:34 2 loon There are now dozens of other processes in 'D' state. I do not see any problems accessing disk on the system, so I am inclined to blame this on the runaway 2.3 GByte user process. Please let us know whether this is a correct assessment of the situation. Please use any an all tricks you know of to kill this process. If we need to reboot later this afternoon, contact minos-admin, and I'll let the users know, and will try to drain the Condor queues first. ___________________________________________ Date: Mon, 02 Jun 2008 13:07:30 -0500 (CDT) This ticket has been reassigned to HO, LING of the CD-SF/FEF Group. ___________________________________________ Date: Mon, 02 Jun 2008 13:59:44 -0500 The loon process, and many other processes are in D state which cannot be killed. I don't see any problem with the afs or nfs space, not the local disk. So, we'll have to reboot the machine. Please let me know if it is ok to reboot it now. ___________________________________________ At around 14:10, MINOS25 > condor_off -fast minos25 Sent "Kill-All-Daemons-Fast" command to master minos25.fnal.gov ___________________________________________ Date: Mon, 02 Jun 2008 14:18:42 -0500 (CDT) Note To Requester: kreymer@fnal.gov sent this Notes To Requester: I have stopped condor on minos25. Please go ahead with the reboot of minos25 as soon as you can. _________________________________________________________________ Rebooted Restarted startd on minos03 MINOS25 > condor_on minos03 -subsystem startd Removes stale jjling jobs MINOS25 > condor_rm 140901.0 Started and tested minos04 MINOS25 > condor_on minos04 -subsystem startd Started them all up MINOS25 > condor_on -all -subsystem startd [gfrontend@minos25 ~]$ ./start_frontend.sh [gfactory@minos25 ~]$ ./start_factory.sh ######## # DATA # ######## Blue arc failures, fnpcsrv1 Sun Jun 1 05:41:48 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-05 ... Sun Jun 1 07:31:54 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-11 Sun Jun 1 10:03:00 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-12 Sun Jun 1 10:04:00 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-01 minos-sam03 Sun Jun 1 05:26:27 CDT 2008 SLO N00007148_0002.spill.sntp.cedar_phy_bhcurv.0.root 14 head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-04/N00007148_0005.spill.sntp.cedar_phy_bhcurv.0.root' for reading: Stale NFS file handle ... Sun Jun 1 05:43:48 CDT 2008 BAD N00007188_0000.spill.sntp.cedar_phy_bhcurv.0.root 0 head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-04/N00007194_0000.spill.sntp.cedar_phy_bhcurv.0.root' for reading: Stale NFS file handle Sun Jun 1 05:44:48 CDT 2008 BAD N00007194_0000.spill.sntp.cedar_phy_bhcurv.0.root 0 head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-04/N00007194_0005.spill.sntp.cedar_phy_bhcurv.0.root' for reading: No such device or address ... Sun Jun 1 06:24:49 CDT 2008 BAD N00007604_0011.spill.sntp.cedar_phy_bhcurv.0.root 0 head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-04/N00007607_0002.spill.sntp.cedar_phy_bhcurv.0.root' for reading: No such device or address Sun Jun 1 06:25:49 CDT 2008 BAD N00007607_0002.spill.sntp.cedar_phy_bhcurv.0.root 0 Sun Jun 1 07:31:51 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-11 Sun Jun 1 10:03:04 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-12 Sun Jun 1 10:04:04 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-01 minos01 Sun Jun 1 05:37:58 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-05 ... Sun Jun 1 07:31:53 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-01 Sun Jun 1 10:03:06 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-02 Sun Jun 1 10:04:07 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-03 minos25 Fri May 30 02:05:17 CDT 2008 SLO N00010645_0000.spill.sntp.cedar_phy_bhcurv.0.root 14 Sun Jun 1 05:37:58 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-05 ... Sun Jun 1 07:29:53 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-11 Sun Jun 1 10:03:01 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-12 Sun Jun 1 10:04:01 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-01 minos26 Sun Jun 1 05:37:58 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-05 ... Sun Jun 1 07:32:09 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-01 Sun Jun 1 10:03:08 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-02 Sun Jun 1 10:04:08 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-03 Date: Sun, 01 Jun 2008 08:03:20 -0500 From: Andy Romero To: site-nas-announce@fnal.gov Cc: Storage Admins Subject: BlueArc Problem ... Status update Only blue2 and minos-nas-0 customers were effected by this problem .. everyone else ignore Early this morning array Minossata01 partially failed. Of course this caused file system MINOS-r6sata-0 to fail; however, it also caused cms-r5-at-1 and cms-r5-at-2 to fail. Initial sttempts to get these filesystems back online failed ...(system calls were timing out). At this point I decided that the only course of action which had any chance of quickly getting CMS back online was to reboot NAS head RHEA-1. The Reboot completed, all CMS file systems are back online. All other blue2 hosted file systems are also back online. I am now going to contact someone to get the Minossata01 array back online. Then I will re-enable EVS minos-nas-0 Andy ============================================================================= ============================================================================= * * * * * KREYMER IS ON FURLOUGH 26 THROUGH 31 MAY * * * * * Blue Arc failures Sunday Jun 1, as noted above CFL - failed again, path error, adjusted and retried, OK, see above KCA update - interrupted farm processing, Rubin's robot cert needed registration ============================================================================= ============================================================================= 2008 05 25 ============================================================================= ######## # FARM # ######## Leaving loopCPB1 running through the furlough. There are enough PEND partial runs that the normal roundup processing done under the corral script would not be moving any cand files through. ######## # DATA # ######## One more quick scan for cand/bcnd files MINOS26 > du -sm /minos/data/reco_*/*/cand_data 144 /minos/data/reco_far/cedar_phy/cand_data 331197 /minos/data/reco_far/cedar_phy_bhcurv/cand_data 1066 /minos/data/reco_near/cedar_phy_bhcurv/cand_data MINOS26 > du -sm /minos/data/reco_*/*/.bcnd_data 78964 /minos/data/reco_far/cedar/.bcnd_data 27 /minos/data/reco_far/cedar_phy/.bcnd_data 6054 /minos/data/reco_far/cedar_phy_bhcurv/.bcnd_data FARM03 > df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 28T 27T 1.8T 94% /minos/data FARM03 > DIRS=`ls -d /minos/data/reco_*/*/cand_data` FARM03 > for DIR in ${DIRS} ; do echo rm -r ${DIR} ; done rm -r /minos/data/reco_far/cedar_phy_bhcurv/cand_data rm -r /minos/data/reco_far/cedar_phy/cand_data rm -r /minos/data/reco_near/cedar_phy_bhcurv/cand_data FARM03 > rm -r /minos/data/reco_far/cedar_phy_bhcurv/cand_data FARM03 > rm -r /minos/data/reco_far/cedar_phy/cand_data FARM03 > rm -r /minos/data/reco_near/cedar_phy_bhcurv/cand_data FARM03 > DIRS=`ls -d /minos/data/reco_*/*/.bcnd_data` FARM03 > for DIR in ${DIRS} ; do echo rm -r ${DIR} ; done rm -r /minos/data/reco_far/cedar/.bcnd_data rm -r /minos/data/reco_far/cedar_phy/.bcnd_data rm -r /minos/data/reco_far/cedar_phy_bhcurv/.bcnd_data FARM03 > for DIR in ${DIRS} ; do echo rm -r ${DIR} rm -r ${DIR} ; done rm -r /minos/data/reco_far/cedar/.bcnd_data About 2.2 TB free in /minos/data now. ######## # DATA # ######## jdejong asks to process sntp's from far cedar, problem is 2000 files per month period. That's correct, 2005-04 through 2007-04 were not concatenated. MINOS26 > for DIR in ${DIRS} ; do printf "${DIR} " ; ls /minos/data/reco_far/cedar/sntp_data/${DIR} | wc -l ; done 2005-04 688 2005-05 741 2005-06 716 2005-07 735 2005-08 742 2005-09 721 2005-10 738 2005-11 720 2005-12 748 2006-01 745 2006-02 616 2006-03 554 2006-06 721 2006-07 702 2006-08 696 2006-09 719 2006-10 746 2006-11 720 2006-12 745 2007-01 738 2007-02 672 2007-03 744 2007-04 715 2007-05 96 2007-11 40 2007-12 66 2008-01 75 2008-02 82 2008-03 84 2008-04 74 2008-05 50 MINOS26 > for DIR in ${DIRS} ; do printf "${DIR} " ; ls /pnfs/minos/reco_far/cedar/sntp_data/${DIR} | wc -l ; done 2005-04 1392 2005-05 1486 2005-06 1432 2005-07 1470 2005-08 1486 2005-09 1445 2005-10 1482 2005-11 1440 2005-12 1497 2006-01 1490 2006-02 1237 2006-03 1151 2006-06 1442 2006-07 1429 2006-08 1402 2006-09 1440 2006-10 1492 2006-11 1443 2006-12 1490 2007-01 92 2007-02 98 2007-03 80 2007-04 74 2007-05 76 2007-11 76 2007-12 66 2008-01 75 2008-02 82 2008-03 84 2008-04 74 2008-05 50 That's truly weird, 2007-01 through 2007-04 are concatenated in PNFS, but not on /minos/data. OHHHH, not so weird after all. Files were copied from afs to nfs, and afs files were not concatenated. Also, /minos/data seems to have only spill data through 2007-04 ============================================================================= 2008 05 23 ============================================================================= ########## # PARROT # ########## Resume testing latest version INSTALLATION 2.4.2, which has proxy problems, is still latest point release cd /grid/app/minos/parrot VER=current VERX="-20080520" TARD=cctools-${VER}-x86_64-linux-2.6 TARX=cctools-${VER}${VERX}-x86_64-linux-2.6 TARP=${TARD}.tar.gz TARL=${TARX}.tar.gz curl http://www.cse.nd.edu/~ccl/software/files/${TARP} -o ${TARL} tar xzvf ${TARL} [ -n "${VERX}" ] && mv ${TARD} ${TARX} ln -s ../mountfile.grow ${TARX}/ ln -s ../mountfile2.grow ${TARX}/ ln -s ../mountfile.html ${TARX}/ TESTING Checked ganglia at http://rexganglia2.fnal.gov/farms/?c=GP%20Farm&m=&r=hour&s=descending&hc=4 ssh fnpc194 Checksums fail. running on current-20080520 x86_64 fnpc194, remote checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d local checksum: 2adb169a42c725ccfbe2b2174da7d8d9e46121f4 Fails, and retries indefinitely. running 2.4.2 on fngp-osg, 32 bit system, see local checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d running 2.4.2 x96_64 on fnpc194, 1211664873.544326 [14605] parrot: grow: local checksum: c56014e206c26c1ba13e5a321c3155b95689bf4a 1211664873.544453 [14605] parrot: grow: checksum does not match, reloading... 1211664873.548607 [14605] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/ 1211664873.548663 [14605] parrot: grow: fetching checksum: wget --no-cache -q -O /tmp/parrot.1060/grow.checksum.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfschecksum 1211664873.563510 [14605] parrot: grow: remote checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d 1211664873.563624 [14605] parrot: grow: fetching directory: wget --no-cache -q -O /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfsdir 1211664878.222728 [14605] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- 1211664878.951315 [14605] parrot: grow: local checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d running on current-20080520 x86_64 fnpc194, another attempt bash-3.00$ ls -d /afs/fnal.gov/files/code/e875/general/minossoft 2008/05/24 16:36:40.927396 [14740] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/ 2008/05/24 16:36:40.927602 [14740] parrot: grow: fetching checksum: 2008/05/24 16:36:40.927664 [14740] parrot: http: connect www-numi.fnal.gov port 80 2008/05/24 16:36:40.930494 [14740] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfschecksum HTTP/1.0 Host: www-numi.fnal.gov 2008/05/24 16:36:40.938232 [14740] parrot: http: HTTP/1.1 200 OK 2008/05/24 16:36:40.938249 [14740] parrot: http: Date: Sat, 24 May 2008 21:36:40 GMT 2008/05/24 16:36:40.938260 [14740] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6 2008/05/24 16:36:40.938270 [14740] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:19 GMT 2008/05/24 16:36:40.938279 [14740] parrot: http: ETag: "534d65ce-2c-44592a26655c3" 2008/05/24 16:36:40.938288 [14740] parrot: http: Accept-Ranges: bytes 2008/05/24 16:36:40.938296 [14740] parrot: http: Content-Length: 44 2008/05/24 16:36:40.938307 [14740] parrot: http: Connection: close 2008/05/24 16:36:40.938315 [14740] parrot: http: Content-Type: text/plain 2008/05/24 16:36:40.938323 [14740] parrot: http: 2008/05/24 16:36:40.938333 [14740] parrot: grow: remote checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d 2008/05/24 16:36:40.938349 [14740] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- 2008/05/24 16:36:41.632329 [14740] parrot: grow: local checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d /afs/fnal.gov/files/code/e875/general/minossoft So suddenly things are OK !!! Try again on fnpc195, things fail again forever 2008/05/24 16:38:36.871472 [28006] parrot: grow: local checksum: 2adb169a42c725ccfbe2b2174da7d8d9e46121f4 Can clean it up again by running 2.4.1 once. ============================================================================= 2008 05 23 ============================================================================= Istvan (I.Z.) Danko - P [izdanko@pitt.edu] 412-624-7159 ########## # CONDOR # ########## Submit a 10 minute glidein job, then change the proxy with /local/scratch25/grid/kproxyvnew while job is running. MINOS25 > ln -s /local/scratch25/grid grid condor_submit glideafs10min.run 117890.0 kreymer 5/23 10:59 0+00:00:00 I 0 0.0 probe MINOS25 > grid/kproxyvnew -rw------- 1 kreymer g020 5302 May 23 11:03 kreymer.proxy.2008052311 MINOS25 > dds logs/10min/*117890.0 -rw-r--r-- 1 kreymer g020 0 May 23 10:59 logs/10min/probe.err.117890.0 -rw-r--r-- 1 kreymer g020 247 May 23 11:01 logs/10min/probe.log.117890.0 -rw-r--r-- 1 kreymer g020 0 May 23 10:59 logs/10min/probe.out.117890.0 MINOS25 > cat logs/10min/probe.log.117890.0 000 (117890.000.000) 05/23 10:59:14 Job submitted from host: <131.225.193.25:62903> ... 001 (117890.000.000) 05/23 11:01:26 Job executing on host: <131.225.166.120:61927> ... 006 (117890.000.000) 05/23 11:01:34 Image size of job updated: 6332 ... 005 (117890.000.000) 05/23 11:11:27 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 9974 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 9974 - Total Bytes Sent By Job 0 - Total Bytes Received By Job MINOS25 > less logs/10min/probe.out.117890.0 RUN STARTED Fri May 23 11:01:26 CDT 2008 ... ########## # PROXY # ########## PROXY /local/stage1/condor/execute/dir_10848/glide_J10883/tmp/starter-tmp-dir-xF9smf/execute/dir_12333/kreymer.proxy subject : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer/CN=proxy/CN=proxy issuer : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer/CN=proxy identity : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer/CN=proxy type : unknown strength : 512 bits path : /local/stage1/condor/execute/dir_10848/glide_J10883/tmp/starter-tmp-dir-xF9smf/execute/dir_12333/kreymer.proxy timeleft : 9:51:59 ###################################################### # CHECK THE GRID ENVIRONMENT IF WE ARE ON THE GRID # ###################################################### OK - we do not seem to be on an OSG host RUN FINISHED Fri May 23 11:11:26 CDT 2008 Tried this again, showing proxy and identity at start and end of job. RUN STARTED Fri May 23 14:27:39 CDT 2008 PROXY /local/stage1/condor/execute/dir_22749/glide_W22784/tmp/starter-tmp-dir-wk19ES/execute/dir_25851/kreymer.proxy identity : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy ... PROXY /local/stage1/condor/execute/dir_22749/glide_W22784/tmp/starter-tmp-dir-wk19ES/execute/dir_25851/kreymer.proxy identity : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer/CN=proxy RUN FINISHED Fri May 23 14:37:40 CDT 2008 ####### # SAM # ####### Need to add upcoming daikon_05, heck let's throw in daikon_06/07/08 Inspired by 2007 09 23 log entry for UNI in dev int prd ; do for VEG in daikon_05 daikon_06 daikon_07 daikon_08 ; do setup sam -q ${UNI} export SAM_ORACLE_CONNECT samadmin add application family --appFamily=simulation --appName=gminos --appVersion=${VEG} export -n SAM_ORACLE_CONNECT done done New applicationFamilyId = 251 New applicationFamilyId = 252 New applicationFamilyId = 253 New applicationFamilyId = 254 New applicationFamilyId = 88 New applicationFamilyId = 89 New applicationFamilyId = 90 New applicationFamilyId = 91 New applicationFamilyId = 342 New applicationFamilyId = 343 New applicationFamilyId = 344 New applicationFamilyId = 345 MINOS26 > date Fri May 23 09:47:08 CDT 2008 ######## # DATA # ######## No error reported since yesterday's bluearc reboot, per http://www-numi.fnal.gov/computing/dh/bluwatch/ Will resume normal activities. ######## # FARM # ######## loopCPB0 - finished its work on CPB 0 candidates started loopCPB, to get all mrnt and sntp's concatenated ============================================================================= 2008 05 22 ============================================================================= ########## # CONDOR # ########## Testing new KCA form of certificate ############ # BLUWATCH # ############ bluwatch.20080522 kcron was not being run in bluwatch.20080522 Added kcron based on expiration time in the file loop Added LATEST file to show time of latest error Moved full logs to bluwatch/log/* Output goes to *.txt - latest error log/*.txt - full log latest/*.txt - latest status MINOS26 > touch LASTERR -d 'May 21 10:19:02 CDT' ============================================================================= 2008 05 21 ============================================================================= ############ # BLUWATCH # ############ SRV1> echo 'touch /minos/data/minfarm/roundup/STOP' | at 05:00 job 13 at 2008-05-22 05:00 bluwatch.20080521 . add kcron as in other monitoring scripts . cut down verbosity, OK goes to latest 1 line log . add *.latest.txt file with heartbeat . add time limit and report . add STOP file add /grid/data monitoring add lasterr directory/files Started up on all systems, MINOS26 > echo 'touch /afs/fnal.gov/files/data/minos/log_data/bluwatch/STOP' | at 05:00 job 8 at 2008-05-22 05:00 MINOS26 > at -l 8 2008-05-22 05:00 a kreymer MINOS26 > mkdir /afs/fnal.gov/files/data/minos/log_data/bluwatch/lasterr ########## # CONDOR # ########## Testing new KCA per http://security.fnal.gov/pki/newkcafaq.html Got new kx509 image from http://security.fnal.gov/tools/kx509.tar /local/scratch25/grid/kx509 -s winserver.fnal.gov Seems I do need to do kdestroy before testing interactively /local/scratch25/grid/kproxynew Date: Wed, 21 May 2008 18:05:49 -0500 (CDT) Subject: HelpDesk ticket 116054 ___________________________________________ Short Description: Testing new KCA proxies - voms-proxy-init fails Problem Description: I am attempting to get a new style KCA proxy for testing under the fermilab/minos group before the upgrade next week. I want to submit a Minos glidein job, then change the original proxy from the old to the new form, and verify that the job can complete correctly. This is what will happen to users' jobs next Wednesday. voms-proxy-init fails, as follows : Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Contacting voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Failed Error: fermilab: Unable to satisfy G/fermilab/minos Request! This is not too surprising, as the VO does not seem to know about the new style DN's. I am on furlough next week. Will the new DN's be registered in time for some advanced testing before next week ? ___________________________________________ Registered the new form certificate at DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer It shows up as 'new' under https://vomrs.fnal.gov:8443/vomrs/vo-fermilab/vomrs?path=/RootNode/MemberAction/MemberDNs&action=execute&do=select ___________________________________________ Date: Thu, 22 May 2008 11:17:32 -0500 (CDT) From: fermilab-vomrs-admin@fnal.gov To: kreymer@fnal.gov Subject: Automated email from fermilab vomrs: You have a request to add a new certificate Dear VO Member, A request to add a new certificate DN: /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer CA: /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA to your certificate list was made by a Member DN: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/UID=kreymer CA: /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA The following reason was provided: Testing new KCA proxy format. Please contact VO administrator if you have any questions. VOMRS fermilab Service ___________________________________________ Date: Thu, 22 May 2008 15:38:37 -0500 (CDT) Note To Requester: Art, can you change the status via the certificates/ set certificate status on your own entry in vomrs? you should be able to do that. If you can't, let me know and I will do it for you. Steve Timm ___________________________________________ Date: Thu, 22 May 2008 16:07:11 -0500 (CDT) Note To Requester: timm@fnal.gov sent this Notes To Requester: I have set the certificate status to approved. Should get into VOMS within the hour. Steve ___________________________________________ Date: Wed, 21 May 2008 19:43:55 -0500 (CDT) Note To Requester: timm@fnal.gov sent this Notes To Requester: The new DN's will start getting loaded into the Fermilab VO on Monday night May 26. We anticipate, based on the same procedure done on our test machine, that it will take 14-15 hours to load them all. If you want to test before that, you can manually add the new DN to your VOMRS entry via VOMRS. Or voms-proxy-init against fgtest2.fnal.gov where the entry should already be there. Steve Timm ___________________________________________ Date: Thu, 22 May 2008 21:05:36 +0000 (UTC) The new DN has been 'Approved', and I have generated a new style KCA proxy. Thanks, you can close this ticket ! ___________________________________________ Date: Fri, 23 May 2008 20:23:06 +0000 (UTC) Most of the Minos glidein users, and users in general, will be generating one of the new KCA DN's for the first time at the time of the global conversion next Wednesday. The new DN's contain the name of the machine on which the proxy was generated, like .../CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer Is this hostname field needed for authorization purposes ? If so, how will you know which hostname to use when doing the automatic load of the new DN's into the Fermilab VO ? ___________________________________________ Date: Fri, 23 May 2008 15:31:46 -0500 (CDT) Note To Requester: timm@fnal.gov sent this Notes To Requester: Yes, the hostname field is needed for authorization purposes. No, our auto-add of the new CN=UID:username doesn't account for this. But given this request, we will take any /OU=Robots/CN=cron/CN=.... certs that are in the "minos" group of the fermilab VO and make sure that /OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=User Name/CN=UID:username gets added for all ofthem. The farms production people may also need one added for fnpcsrv1. Steve Timm ___________________________________________ Date: Fri, 23 May 2008 20:37:40 +0000 (UTC) For completeness, here is a list from my whiteboard of users who are probable near term users of Minos glideins, hence need new KCA registrations with /CN=minos25.fnal.gov asousa bckhouse boehm djauty hartnell loiacono mishi nickd pawloski rustem rhatcher ######## # FARM # ######## ./roundup -n -m L010185N -r cedar_phy_bhcurv mcnear ########## # CONDOR # ########## MIN > NODES0='minos11 minos13 minos26' MIN > NODES1='minos01 minos02 minos07' MIN > NODES2='minos03 minos04 minos05 minos06 minos08 minos09 minos10 minos12 minos14 minos15 minos16 minos17 minos18 minos19 minos20 minos21 minos22 minos23 minos24' MIN > REFFIL=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local0.20080512 MIN > for NODE in ${NODES0} ; do printf "${NODE}\n" ssh -ax ${NODE} diff /opt/condor-7.0.1/local/condor_config.local ${REFFIL} ; done MIN > REFFIL=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local1.20080512 MIN > for NODE in ${NODES1} ; do printf "${NODE}\n" ssh -ax ${NODE} diff /opt/condor-7.0.1/local/condor_config.local ${REFFIL} ; done MIN > REFFIL=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local2.20080512 MIN > for NODE in ${NODES2} ; do printf "${NODE}\n" ssh -ax ${NODE} diff /opt/condor-7.0.1/local/condor_config.local ${REFFIL} ; done ssh -ax minos25 diff /opt/condor-7.0.1/local/condor_config.local \ /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local.minos25 The config files all look OK. Now will do the promised reconfigure MINOS25 > condor_reconfig minos01 Sent "Reconfig" command to master minos01.fnal.gov MINOS25 > condor_config_val CONDOR_ADMIN -name minos01 minos-admin@fnal.gov MINOS25 > date Wed May 21 14:33:36 CDT 2008 MINOS25 > condor_reconfig -all Sent "Reconfig" command to master minos10.fnal.gov Sent "Reconfig" command to master minos01.fnal.gov Sent "Reconfig" command to master minos02.fnal.gov Sent "Reconfig" command to master minos20.fnal.gov Sent "Reconfig" command to master minos21.fnal.gov Sent "Reconfig" command to master minos03.fnal.gov Sent "Reconfig" command to master minos04.fnal.gov Sent "Reconfig" command to master minos22.fnal.gov Sent "Reconfig" command to master minos14.fnal.gov Sent "Reconfig" command to master minos05.fnal.gov Sent "Reconfig" command to master minos23.fnal.gov Sent "Reconfig" command to master minos15.fnal.gov Sent "Reconfig" command to master minos06.fnal.gov Sent "Reconfig" command to master minos24.fnal.gov Sent "Reconfig" command to master minos16.fnal.gov Sent "Reconfig" command to master minos25.fnal.gov Sent "Reconfig" command to master minos07.fnal.gov Sent "Reconfig" command to master minos17.fnal.gov Sent "Reconfig" command to master minos08.fnal.gov ERROR AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5004:Failed to get authorization from server. Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile) AUTHENTICATE:1004:Failed to authenticate using FS Can't send Reconfig command to master minos26.fnal.gov Sent "Reconfig" command to master minos09.fnal.gov Sent "Reconfig" command to master minos18.fnal.gov Sent "Reconfig" command to master minos19.fnal.gov Sent "Reconfig" command to master FNAL_1621@fnpc342.fnal.gov Sent "Reconfig" command to master FNAL_5500@fnpc345.fnal.gov Sent "Reconfig" command to master FNAL_8103@fnpc344.fnal.gov Sent "Reconfig" command to master FNAL_8211@fnpc345.fnal.gov Sent "Reconfig" command to master FNAL_2224@fnpc339.fnal.gov Sent "Reconfig" command to master FNAL_5215@fnpc345.fnal.gov Sent "Reconfig" command to master FNAL_1913@fnpc346.fnal.gov Sent "Reconfig" command to master FNAL_5660@fnpc344.fnal.gov Sent "Reconfig" command to master FNAL_2426@fnpc339.fnal.gov Sent "Reconfig" command to master FNAL_2808@fnpc344.fnal.gov Sent "Reconfig" command to master FNAL_2466@fnpc344.fnal.gov Sent "Reconfig" command to master FNAL_1537@fnpc339.fnal.gov Sent "Reconfig" command to master FNAL_1467@fnpc346.fnal.gov Sent "Reconfig" command to master FNAL_1396@fnpc346.fnal.gov Sent "Reconfig" command to master FNAL_8652@fnpc344.fnal.gov Sent "Reconfig" command to master FNAL_5594@fnpc344.fnal.gov Sent "Reconfig" command to master FNAL_8464@fnpc345.fnal.gov Sent "Reconfig" command to master FNAL_1797@fnpc339.fnal.gov Sent "Reconfig" command to master FNAL_25110@fnpc341.fnal.gov Sent "Reconfig" command to master FNAL_21093@fnpc340.fnal.gov Sent "Reconfig" command to master FNAL_27403@fnpc340.fnal.gov Sent "Reconfig" command to master FNAL_10820@fnpc345.fnal.gov Sent "Reconfig" command to master FNAL_23605@fnpc340.fnal.gov Sent "Reconfig" command to master FNAL_31094@fnpc340.fnal.gov Sent "Reconfig" command to master FNAL_30125@fnpc346.fnal.gov Sent "Reconfig" command to master FNAL_32550@fnpc342.fnal.gov Sent "Reconfig" command to master FNAL_28105@fnpc341.fnal.gov Sent "Reconfig" command to master FNAL_29012@fnpc343.fnal.gov Sent "Reconfig" command to master FNAL_11911@fnpc345.fnal.gov Sent "Reconfig" command to master FNAL_27081@fnpc340.fnal.gov Sent "Reconfig" command to master FNAL_22662@fnpc340.fnal.gov Sent "Reconfig" command to master FNAL_26620@fnpc342.fnal.gov Sent "Reconfig" command to master FNAL_30442@fnpc346.fnal.gov Sent "Reconfig" command to master FNAL_29630@fnpc340.fnal.gov Sent "Reconfig" command to master FNAL_11644@fnpc345.fnal.gov Sent "Reconfig" command to master FNAL_30295@fnpc342.fnal.gov Sent "Reconfig" command to master FNAL_32408@fnpc344.fnal.gov Sent "Reconfig" command to master FNAL_25528@fnpc342.fnal.gov Sent "Reconfig" command to master FNAL_18184@fnpc342.fnal.gov Sent "Reconfig" command to master FNAL_15557@fnpc341.fnal.gov Sent "Reconfig" command to master FNAL_32745@fnpc346.fnal.gov Sent "Reconfig" command to master FNAL_23589@fnpc340.fnal.gov Sent "Reconfig" command to master FNAL_29392@fnpc343.fnal.gov Sent "Reconfig" command to master FNAL_32709@fnpc339.fnal.gov Sent "Reconfig" command to master FNAL_27077@fnpc346.fnal.gov Sent "Reconfig" command to master FNAL_29745@fnpc342.fnal.gov Sent "Reconfig" command to master FNAL_27767@fnpc341.fnal.gov Sent "Reconfig" command to master FNAL_29318@fnpc339.fnal.gov Sent "Reconfig" command to master FNAL_28882@fnpc343.fnal.gov Sent "Reconfig" command to master FNAL_15897@fnpc341.fnal.gov Sent "Reconfig" command to master FNAL_29175@fnpc339.fnal.gov Sent "Reconfig" command to master FNAL_28477@fnpc346.fnal.gov Sent "Reconfig" command to master FNAL_29596@fnpc339.fnal.gov MINOS25 > condor_config_val CONDOR_ADMIN minos-admin@fnal.gov MINOS25 > condor_config_val CONDOR_ADMIN -name minos11 Can't find address for this master Perhaps you need to query another pool. MINOS25 > condor_config_val CONDOR_ADMIN -name minos12 Can't find address for this master Perhaps you need to query another pool. MINOS25 > condor_config_val CONDOR_ADMIN -name minos13 Can't find address for this master Perhaps you need to query another pool. On minos12, sudo /etc/init.d/condor start ########## # CONDOR # ########## Date: Wed, 21 May 2008 13:46:01 -0500 (CDT) Subject: HelpDesk ticket 116039 ___________________________________________ Short Description: asousa certificate is misregistered in VOMS ? Problem Description: Alex Sousa ( asousa@fnal.gov ) is unable to get a proxy as a member of the fermilab/minos group. His x509 identity is /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/ CN=Alexandre B. Pereira sousa/USERID=asousa His registered certificate as seen in VOMRS is /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alexandre B. Pereira Sousa/UID=asousa Note that sousa is lower case in the X509 string, but is Sousa in the VO. Please resolve this. I am not authorized to add cert's for other people, so I cannot fix this by adding the alternate DN myself. There is the larger question of why we have so many malformed DN's. Please reply to minos-admin and/or asousa ___________________________________________ Date: Wed, 21 May 2008 15:37:39 -0500 (CDT) New Information: The two robot certs /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Alexandre B. Pereira sousa/UID=asousa and /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Alexandre B. Pereira sousa/USERID=asousa have been added to vomrs. The failure appears to be that the robots cert was missing altogether in vomrs, at least we hope that is the case. vomrs and voms are not supposed to be case sensitive, i.e. Alexandre B. Pereira Sousa and Alexandre B. Pereira sousa resolve to the same person. I will check again in about 1/2 hour to make sure the change propagated through the rest of the system. The incredible changing names of KCA certs are a problem of long standing which are due to be finally fixed when the new KCA is rolled out on May 28 next week. Historically we find that a name changes when a user renews his or her Fermi ID. Steve Timm ########## # CONDOR # ########## Date: Wed, 21 May 2008 12:05:11 -0500 (CDT) Subject: HelpDesk ticket 116023 ___________________________________________ Short Description: Use bckhouse KCA cert has extra space Problem Description: Christopher Backhouse ( bckhouse@fnal.gov ) cannot activate a kx509 proxy in the fermilab/minos group because his kx509 subject has an extra space after his name : From kx509 Service kx509/certificate issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA subject= /DC=gov/DC=fnal/O=Fermilab/OU=People/ CN=Christopher J. Backhouse /0.9.2342.19200300.100.1.1=bckhouse From voms-proxy-init Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/ CN=Christopher J. Backhouse /USERID=bckhouse Please do what it takes to correct this condition for Chris, and do the same for other users who may have this problem. I know that Rustem Ospanov had the seme problem earlier. Please reply to minos-admin and bckhouse ___________________________________________ Date: Wed, 21 May 2008 15:25:36 -0500 (CDT) Note To Requester: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Christopher J. Backhouse /UID=bckhouse /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Christopher J. Backhouse /USER ID=bckhouse /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Christopher J. Backhouse /UID=bckhouse /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Christopher J. Backhouse /USERID=bckhouse All added to Christopher Backhouse in VOMRS. Will check in 1/2 hour or so to make sure they make it to VOMS. Steve Timm ___________________________________________ ######## # FARM # ######## Tue May 20 20:51:22 CDT 2008 BAD N00008517_0000.spill.sntp.cedar_phy_bhcurv.0.root Tue May 20 20:52:22 CDT 2008 BAD N00008523_0000.spill.sntp.cedar_phy_bhcurv.0.root SRV1> ./loopCPB0: line 5: 17604 Killed ./roundup -c -b 100 -w -s "cand.cedar_phy_bhcurv.0" -r cedar_phy_bhcurv mcnear Connection to fnpcsrv1 closed. SRV1> dds -tr -rw-rw-r-- 1 minfarm numi 740890 May 20 20:48 cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.0.log -rw-r--r-- 1 minfarm numi 3215797 May 21 06:09 cedarfar.log -rw-r--r-- 1 minfarm numi 2325596 May 21 06:40 cedarnear.log SRV1> tail cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.0.log SRMCP 70/100 -streams_num=1 -server_mode=active -protocols=gsiftp file:///n13037760_0012_L010185N_D04.cand.cedar_phy_bhcurv.0.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L01018 5N/cand_data/776 Latest mount failure ( except /home/ftp ) is May 20 19:26:41 fnpcsrv1 automount[16603]: failed to mount /minos/data May 20 20:47:16 fnpcsrv1 autofs: automount shutdown failed May 20 20:59:03 fnpcsrv1 automount[9888]: starting automounter version 4.1.3-231, path = /farm, maptype = yp, mapname = auto.farm ============================================================================= 2008 05 20 ============================================================================= ######## # FARM # ######## Will not run loopCPB until the BlueArc problems are resolved. mcnearcat 837 472139 cand.cedar_phy_bhcurv.0.root 4974 2871267 cand.cedar_phy_bhcurv.1.root WRITE 1 6 7571.root 811 468125 cand.cedar_phy_bhcurv.0.root 414 238927 cand.cedar_phy_bhcurv.1.root So firing up loopCPB0 again, to clear 1.5 TB . No, cannot do this, see /var/log/messages on fnpcsrv1 : May 20 19:24:54 fnpcsrv1 automount[16429]: failed to mount /home/carneiro May 20 19:25:15 fnpcsrv1 automount[16518]: failed to mount /minos/data May 20 19:25:17 fnpcsrv1 automount[16526]: failed to mount /grid/wnclient May 20 19:25:28 fnpcsrv1 kernel: nfs_statfs: statfs error = 512 There is a defective output file, 7571.root Let's try to get some disk cleared while we fight BlueArc. SRV1> ./loopCPB0 & [3] 17603 SRV1> date Tue May 20 19:37:31 CDT 2008 ######## # DATA # ######## Date: Tue, 20 May 2008 17:04:01 -0500 (CDT) From: HelpDesk Subject: CC: Help Desk Ticket 000000000115925 Has Been Updated. ___________________________________________________________________ New Information: It looks like we are having communications issues with one of the Minos logical drives on the minossata01 array. From our logs: May 19 23:05:16 blue1.fnal.gov 2054 Warning: FCP nexus 1 (host port 4; target port name 0x5000402301FC41F7 address 0x692E00; LUN 10) of SCSI device 88 has failed. May 20 03:18:58 blue1.fnal.gov 2054 Warning: FCP nexus 0 (host port 2; target port name 0x5000402201FC41F7 address 0x692600; LUN 10) of SCSI device 88 has failed. May 20 07:18:05 blue1.fnal.gov 2054 Warning: FCP nexus 1 (host port 4; target port name 0x5000402301FC41F7 address 0x692E00; LUN 10) of SCSI device 88 has failed. May 20 11:26:09 blue1.fnal.gov 2054 Warning: FCP nexus 0 (host port 2; target port name 0x5000402201FC41F7 address 0x692600; LUN 10) of SCSI device 88 has failed. May 20 14:22:02 blue1.fnal.gov 2054 Warning: FCP nexus 1 (host port 4; target port name 0x5000402301FC41F7 address 0x692E00; LUN 10) of SCSI device 88 has failed. Jason/Art, can you check if the array is okay or if there is any indication in the logs that there is a problem with the minos array. The device in question, according to our notes, is Lun 10 on array minossata01. Each time the lun fails, the BlueArc goes into recovery mode and tries to replay the filesystem, attempting to access the luns via any paths it thinks it can use. In the log it looks like we are bouncing between the two ports on the nexsan array. ___________________________________________________________________ Requester Name: GREGORY PAWLOSKI Short Description: BlueArc Server for /minos/scratch/ Down? Problem Description: I wonder if the BlueArc server that provides access to the /minos/data/ and /minos/scratch/ areas on the Minos cluster is down. I cannot access these areas. Greg ___________________________________________________________________ We have just now had another timeout of /minos/data. Here are the relevant bits from the logs at http://www-numi.fnal.gov/computing/dh/bluwatch.html Note the three to four minute delays in monitoring on the nodes where we did not have an outright failure. fnpcsrv1 Tue May 20 19:23:59 CDT 2008 OK N00008203_0000.spill.sntp.cedar_phy_bhcurv.0.root head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008206_0000.spill.sntp.cedar_phy_bhcurv.0.root' for reading: No such file or directory Tue May 20 19:25:25 CDT 2008 BAD N00008206_0000.spill.sntp.cedar_phy_bhcurv.0.root head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008214_0000.spill.sntp.cedar_phy_bhcurv.0.root' for reading: No such file or directory Tue May 20 19:26:51 CDT 2008 BAD N00008214_0000.spill.sntp.cedar_phy_bhcurv.0.root Tue May 20 19:28:06 CDT 2008 OK N00008218_0000.spill.sntp.cedar_phy_bhcurv.0.root minos-sam03 Tue May 20 19:24:07 CDT 2008 OK N00007983_0000.spill.sntp.cedar_phy_bhcurv.0.root Tue May 20 19:27:59 CDT 2008 OK N00007988_0000.spill.sntp.cedar_phy_bhcurv.0.root minos01 ue May 20 19:24:27 CDT 2008 OK N00008218_0000.spill.sntp.cedar_phy_bhcurv.0.root Tue May 20 19:27:59 CDT 2008 OK N00008221_0000.spill.sntp.cedar_phy_bhcurv.0.root minos25 Tue May 20 19:23:37 CDT 2008 OK N00008200_0007.spill.sntp.cedar_phy_bhcurv.0.root Tue May 20 19:27:51 CDT 2008 OK N00008203_0000.spill.sntp.cedar_phy_bhcurv.0.root minos26 Tue May 20 19:24:37 CDT 2008 OK /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-08 Tue May 20 19:28:05 CDT 2008 OK N00008227_0000.spill.sntp.cedar_phy_bhcurv.1.root This all happened at the moment at which I tried to cat a file from BlueArc mounted /home/minfarm/scripts, on fnpcsrv1. Here is the tail of /var/log/messages there . May 20 18:08:52 fnpcsrv1 telnetd[15872]: ttloop: peer died: Invalid or incomplete multibyte or wide character May 20 19:24:54 fnpcsrv1 automount[16429]: failed to mount /home/carneiro May 20 19:25:15 fnpcsrv1 automount[16518]: failed to mount /minos/data May 20 19:25:17 fnpcsrv1 automount[16526]: failed to mount /grid/wnclient May 20 19:25:28 fnpcsrv1 kernel: nfs_statfs: statfs error = 512 May 20 19:26:41 fnpcsrv1 automount[16603]: failed to mount /minos/data This seems odd, /home/carneiro and /grid/wnclient do not seem to exist. Their failure to mount seems tightly correlated with the /minos/data timeout. I find one earlier such incident , involving /minos/scratch : May 19 16:08:44 fnpcsrv1 automount[32100]: failed to mount /home/rubin May 19 16:10:17 fnpcsrv1 automount[32244]: failed to mount /grid/wnclient May 19 16:12:34 fnpcsrv1 kernel: nfs_statfs: statfs error = 6 May 19 16:13:44 fnpcsrv1 automount[809]: >> mount: minos-nas-0.fnal.gov:/minos/scratch failed, reason given by server: Input/output error May 19 16:13:44 fnpcsrv1 automount[809]: mount(nfs): nfs: mount failure minos-nas-0.fnal.gov:/minos/scratch on /minos/scratch May 19 16:13:44 fnpcsrv1 automount[809]: failed to mount /minos/scratch And there are many more failures to mount file systems like /home/carneiro /home/rubin There are problems that extend beyond /minos/data. <-- # @@@ Enter Update above this line. @@@ # --> _____________________________________________________________________ Date: Tue, 20 May 2008 21:04:15 -0500 (CDT) From: Steven Timm I've restarted autofs on fnpcsrv1 in debug mode so if there are future problems which there probably will be, we whave more information. ___________________________________________________________________ ___________________________________________________________________ _________________________________________________________________ ######## # DATA # ######## Setting up file pings for bluearc, BASE=/minos/data/reco_near/cedar_phy_bhcurv/sntp_data DIRS=`ls -d ${BASE}/*` for DIR in ${DIRS} ; do if [ -d "${DIR}" ] ; then printf "`date` OK ${DIR}\n" else printf "`date` BAD ${DIR}\n" fi FILS=`ls ${DIR}` for FIL in ${FILS} ; do if head -c 8 ${DIR}/${FIL} > /dev/null ; then printf "`date` OK ${FIL}\n" else printf "`date` BAD ${FIL}\n" fi sleep 5 ; done # FILS sleep 5 ; done # DIRS Put this into a bluwatch self logging script, MINOS26 > ./bluwatch & MINOS26 > cat /afs/fnal.gov/files/data/minos/log_data/bluwatch/minos26.txt Tue May 20 14:38:12 CDT 2008 OK /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-03 Tue May 20 14:38:12 CDT 2008 OK /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-04 Tue May 20 14:38:13 CDT 2008 OK N00007119_0000.spill.sntp.cedar_phy_bhcurv.0.root Tue May 20 14:39:13 CDT 2008 OK N00007122_0000.spill.sntp.cedar_phy_bhcurv.0.root We have another timeout around 19:24 SRV1> mkdir /export/stage/minfarm/bluwatch SRV1> cp -a /var/log/messages /export/stage/minfarm/bluwatch/messages.2008052119 ######## # DATA # ######## DATA=reco_near/cedar_phy_bhcurv/sntp_data FARM03> ./dc2nfs -d ${DATA} 2>&1 | tee -a /minos/scratch/minfarm/dc2nfs/cpbnear.log /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-03 982327 free 47/ 51 N00011987_0000.spill.sntp.cedar_phy_bhcurv.0.root 16261802 bytes in 1 seconds (15880.67 KB/sec) FARM03 > dds -tr /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-03 | tail -rw-r--r-- 1 minfarm e875 844353660 May 20 12:32 N00011981_0011.spill.sntp.cedar_phy_bhcurv.0.root -rw-r--r-- 1 minfarm e875 1835651305 May 20 12:32 N00011984_0000.spill.sntp.cedar_phy_bhcurv.0.root -rw-r--r-- 1 minfarm e875 16261802 May 20 12:32 N00011987_0000.spill.sntp.cedar_phy_bhcurv.0.root drwxrwxr-x 2 minfarm e875 10240 May 20 12:32 ./ -rw-r--r-- 1 minfarm e875 2030928611 May 20 12:35 N00011992_0000.spill.sntp.cedar_phy_bhcurv.0.root FARM03 > dds /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-03/N00011992_0000.spill.sntp.cedar_phy_bhcurv.0.root -rw-r--r-- 1 rubin e875 2030928611 Nov 2 2007 /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-03/N00011992_0000.spill.sntp.cedar_phy_bhcurv.0.root This looks like the classical dccp glitch, after a successful copy. FARM03 > ps xf 3880 pts/3 R+ 73:39 \_ dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-03/N00011992_0000.spill.sntp.cedar_phy_bhcurv.0.root /minos/data FARM03> ./dc2nfs -d ${DATA} 2>&1 | tee -a /minos/scratch/minfarm/dc2nfs/cpbnear.log /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-03 934685 free 50/ 51 N00011998_0000.spill.sntp.cedar_phy_bhcurv.0.root 1770611397 bytes in 43 seconds (40211.92 KB/sec) NEEDED 2/51 reco_near/cedar_phy_bhcurv/sntp_data/2007-03 33 Mbytes/second STARTED Tue May 20 13:49:21 CDT 2008 FINISHED Tue May 20 13:51:14 CDT 2008 ######## # DATA # ######## BlueArc timeouts and failures continue ----------------------------------------------------------------- Subject: Cron ${HOME}/scripts/farmgsum_log Date: Tue, 20 May 2008 00:15:02 -0500 Summarizing /minos/data/minfarm/*cat du: cannot access `/minos/data/minfarm/mcnearcat/n13037716_0014_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root': No such file or directory du: cannot access `/minos/data/minfarm/mcnearcat/n13037716_0022_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root': No such file or directory ----------------------------------------------------------------- Date: Tue, 20 May 2008 00:46:32 -0500 From: Howard Rubin Subject: [Fwd: HelpDesk ticket 115906] Short Description: /grid/app/minos/ mount lost, srm probably dead Problem Description: This is the output of a cron job run at 23:09 /grid/app/minos/scripts/get_daq_submit: line 28: cd: /minos/data/minfarm/lists: No such device or address /grid/app/minos/scripts/get_daq_submit: line 100: /minos/data/minfarm/lists/current_version: No such device or address There is no FD timestamp file in /minos/data/minfarm/lists/daq_lists. Unable to proceed. Earlier in the evening grid jobs weren't starting with failed authentication, and srm was hung in one process. ----------------------------------------------------------------- Date: Tue, 20 May 2008 01:29:46 -0500 roundup cedar_phy_bhcurv mcnear 20126 stale pidfile on fnpcsrv1 ----------------------------------------------------------------- Date: Tue, 20 May 2008 03:21:45 -0500 From: Cron Daemon To: kreymer@fnal.gov Subject: Cron ${HOME}/minos/scripts/condorglide rm: cannot remove `logs/glideafs/probe.116295.0.err': No such device or address find: logs/glideafs/probe.116295.0.out: No such device or address find: logs/glideafs/probe.116295.0.log: No such device or address ... ----------------------------------------------------------------- Date: Tue, 20 May 2008 05:34:32 -0500 roundup cedar_phy_bhcurv mcnear 3481 stale pidfile on fnpcsrv1 ----------------------------------------------------------------- Date: Tue, 20 May 2008 07:20:55 -0500 From: Cron Daemon /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorglide: line 12: cd: /minos/scratch/kreymer/condor/probe: Not a directory find: logs/glideafs: No such file or directory ----------------------------------------------------------------- Date: Tue, 20 May 2008 09:27:44 -0400 From: Steven Cavanaugh I can't run loon (rather it starts, and as soon as I press a single key, I get a FPE).. I think it can't read all of the libraries. This problem started around 9am EST... it was working at 830am EST ============================================================================= 2008 05 19 ============================================================================= ########## # CONDOR # ########## MINOS25 > condor_config_val CONDOR_ADMIN fermigrid-root@fnal.gov cd ~kreymer/minos/scripts/condor701 MINOS25 > diff /opt/condor/local/condor_config.local ~/minos/scripts/condor701/condor_config.local.minos25 24c24 < CONDOR_ADMIN = fermigrid-root@fnal.gov --- > CONDOR_ADMIN = minos-admin@fnal.gov MINOS01 > diff /opt/condor/local/condor_config.local ~/minos/scripts/condor701/condor_config.local 25c25 < CONDOR_ADMIN = fermigrid-root@fnal.gov --- > CONDOR_ADMIN = minos-admin@fnal.gov MINOS25 > diff /opt/condor/etc/condor_config ~/minos/scripts/condor701/condor_config 77c77 < CONDOR_ADMIN = fermigrid-root@fnal.gov --- > CONDOR_ADMIN = minos-admin@fnal.gov condor_config -> condor_config.20080512 condor_config.local -> condor_config.local.20080512 condor_config.local.minos25 -> condor_config.local.minos25.20080512 Date: Mon, 19 May 2008 18:27:03 -0500 (CDT) Subject: HelpDesk ticket 115902 ___________________________________________ Short Description: Update requested to Minos condor_config files. Problem Description: The administrative email from the Minos Condor pool has been going to fermigrid-root@fnal.gov rather than minos-admin. Please propagate new configuration files to minos01 through minos25 as follows . The new scripts are contained under /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701 On all of minos01 through minos25, propagate the new condor_config to /opt/condor-7.0.1/etc/condor_config On minos01 through minos24, propagate condor_config.local to /opt/condor-7.0.1/local/condor_config.local On minos25, propagate condor_config.local.minos25 to /opt/condor-7.0.1/local/condor_config.local After these have been updated, I will make them effective with condor_reconfig -all ___________________________________________ ___________________________________________ Date: Wed, 21 May 2008 09:46:22 -0500 (CDT) Subject: Help Desk Ticket 115902 Has Been Resolved. Solution: The updated configs are now on all the minos machines. ___________________________________________ ########## # CONDOR # ########## http://gratia-fermi.fnal.gov:8880/gratia-reporting/ ######### # ADMIN # ######### Date: Mon, 19 May 2008 12:48:16 -0500 (CDT) Subject: HelpDesk ticket 115867 ___________________________________________ Short Description: Login shells for minfarm and minsoft on minos-sam03 ( cluster ) Problem Description: At your next convenience, please change the login shells for minfarm and minsoft on the Minos Cluster to /bin/bash. ( The accounts are only active on minos-sam03 at present. ) ___________________________________________ Date: Mon, 19 May 2008 12:50:38 -0500 (CDT) This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group. ___________________________________________ ___________________________________________ ___________________________________________ ######## # DATA # ######## minfarm @ minos-sam03 ln -s ~kreymer/minos/scripts/dc2nfs.20080118 dc2nfs DATA=reco_near/cedar_phy_bhcurv/sntp_data ./dc2nfs -n -d ${DATA}/2006-07 # need 7/15 ./dc2nfs -d ${DATA}/2006-07 # need 7/15 Added rate printout ./dc2nfs -d ${DATA}/2005-08 # need 1/55 integrated rate with NEEDED ./dc2nfs -d ${DATA}/2005-10 # 33/33 NEEDED 33/33 reco_near/cedar_phy_bhcurv/sntp_data/2005-10 35 Mbytes/second STARTED Mon May 19 12:40:21 CDT 2008 FINISHED Mon May 19 12:52:25 CDT 2008 mkdir -p /minos/scratch/minfarm/dc2nfs ./dc2nfs -d ${DATA}/2005-04 2>&1 | \ tee -a /minos/scratch/minfarm/dc2nfs/cpbnear.log ./dc2nfs -d ${DATA} 2>&1 | \ tee -a /minos/scratch/minfarm/dc2nfs/cpbnear.log ######## # DATA # ######## /minos/data seems to have glitched, seen on fnpcsrv1 and minos-sam03 FARM03> stat /minos/scratch/minfarm/dc2nfs/cpbnear.log File: `/minos/scratch/minfarm/dc2nfs/cpbnear.log' Size: 110557 Blocks: 216 IO Block: 32768 regular file Device: 15h/21d Inode: -944082904 Links: 1 Access: (0644/-rw-r--r--) Uid: (10871/ minfarm) Gid: ( 5111/ e875) Access: 2008-05-19 12:59:56.753000000 -0500 Modify: 2008-05-19 16:43:18.136000000 -0500 Change: 2008-05-19 16:43:18.136000000 -0500 2006-08 28/ 41 /minos/data/reco_near/cedar_phy_bhcurv/sntp_data 1263632 N00010634_0011.spill.sntp.cedar_phy_bhcurv.0.root 200581432 bytes in 4 seconds (48970.08 KB/sec) FARM03> FARM03> ls -ltr /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-08 ... -rw-r--r-- 1 minfarm e875 807030972 May 19 16:07 N00010634_0012.spill.sntp.cedar_phy_bhcurv.0.root -rw-r--r-- 1 minfarm e875 705818624 May 19 16:07 N00010639_0003.spill.sntp.cedar_phy_bhcurv.0.root Cleaned up dc2nfs printout ( full path at top of each directory only ) ./dc2nfs -d ${DATA} 2>&1 | \ tee -a /minos/scratch/minfarm/dc2nfs/cpbnear.log FARM cleanup SRV1> less cedar_phy_bhcurvmcnearsntp.cedar_phy_bhcurv.log SRMCP 4/5 -streams_num=1 -server_mode=active -protocols=gsiftp file:///n13037664_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/ sntp_data/766 srm client error: No such device or address OOPS - SRMCP failed, bailing Mon May 19 16:12:32 CDT 2008 rm: cannot remove `/minos/data/minfarm/roundup/cedar_phy_bhcurvmcnearsntp.cedar_phy_bhcurv.pid': No such device or address Very Very Very odd. On minos-sam03, /var/log/messages, May 19 15:57:59 minos-sam03 kernel: oom-killer: gfp_mask=0xd0 May 19 15:57:59 minos-sam03 kernel: Mem-info: ... May 19 15:58:01 minos-sam03 kernel: Free pages: 14836kB (1664kB HighMem) May 19 15:58:01 minos-sam03 kernel: Active:48391 inactive:966431 dirty:1 writeback:200392 unstable:0 free:3709 slab:14123 mapped:47542 pagetables:577 May 19 15:58:01 minos-sam03 kernel: DMA free:12532kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB pages_scanned:4334989 all_unreclaimable? yes May 19 15:58:01 minos-sam03 kernel: protections[]: 0 0 0 May 19 15:58:01 minos-sam03 kernel: Normal free:640kB min:928kB low:1856kB high:2784kB active:236kB inactive:793240kB present:901120kB pages_scanned:1229679 all_unreclaimable? yes May 19 15:58:01 minos-sam03 kernel: protections[]: 0 0 0 May 19 15:58:01 minos-sam03 kernel: HighMem free:1664kB min:512kB low:1024kB high:1536kB active:193328kB inactive:3072484kB present:3801088kB pages_scanned:0 all_unreclaimable? no May 19 15:58:01 minos-sam03 kernel: protections[]: 0 0 0 May 19 15:58:01 minos-sam03 kernel: DMA: 3*4kB 3*8kB 3*16kB 3*32kB 3*64kB 3*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 12532kB May 19 15:58:01 minos-sam03 kernel: Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 640kB May 19 15:58:01 minos-sam03 kernel: HighMem: 288*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1664kB May 19 15:58:01 minos-sam03 kernel: Swap cache: add 212613, delete 212613, find 80435/92424, race 0+0 May 19 15:58:01 minos-sam03 kernel: 0 bounce buffer pages May 19 15:58:01 minos-sam03 kernel: Free swap: 4192072kB May 19 15:58:01 minos-sam03 kernel: 1179648 pages of RAM May 19 15:58:01 minos-sam03 kernel: 819136 pages of HIGHMEM May 19 15:58:01 minos-sam03 kernel: 141832 reserved pages May 19 15:58:01 minos-sam03 kernel: 213957 pages shared May 19 15:58:01 minos-sam03 kernel: 0 pages swap cached May 19 15:58:01 minos-sam03 kernel: Out of Memory: Killed process 6340 (python). PID USER PR NI VIRT RES SHR S %CPU %MEM TIME #C COMMAND 30691 sam 16 0 54412 11m 3060 S 0 0.3 0:00 0 python And on fnpcsrv1, May 19 15:36:01 fnpcsrv1 automount[17398]: failed to mount /home/scripts May 19 16:08:44 fnpcsrv1 automount[32100]: failed to mount /home/rubin May 19 16:10:17 fnpcsrv1 automount[32244]: failed to mount /grid/wnclient May 19 16:12:34 fnpcsrv1 kernel: nfs_statfs: statfs error = 6 May 19 16:13:44 fnpcsrv1 automount[809]: >> mount: minos-nas-0.fnal.gov:/minos/scratch failed, reason given by server: Input/output error May 19 16:13:44 fnpcsrv1 automount[809]: mount(nfs): nfs: mount failure minos-nas-0.fnal.gov:/minos/scratch on /minos/scratch May 19 16:13:44 fnpcsrv1 automount[809]: failed to mount /minos/scratch SRV1> grep 'failed to mount' /var/log/messages May 19 16:08:44 fnpcsrv1 automount[32100]: failed to mount /home/rubin May 19 16:10:17 fnpcsrv1 automount[32244]: failed to mount /grid/wnclient May 19 16:13:44 fnpcsrv1 automount[809]: failed to mount /minos/scratch SRV1> grep 'failed to mount' /var/log/messages | cut -f 9 -d ' ' | sort -u /grid/wnclient /home/condor_log /home/ftp /home/lists /home/mclogs /home/rubin /home/scripts /minos/scratch And on minos25 Date: Mon, 19 May 2008 16:12:25 -0500 From: Cron Daemon To: kreymer@fnal.gov Subject: Cron ${HOME}/minos/scripts/condorglide /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorglide: line 12: cd: /minos/scratch/kreymer/condor/probe: Not a directory find: logs/glideafs: No such file or directory Date: Mon, 19 May 2008 17:28:52 -0500 (CDT) Subject: HelpDesk ticket 115900 ___________________________________________ Short Description: /minos/data and /minos/scratch interruption around 16:07 CDT Problem Description: LSC/CSI At around 16:07 CDT, the mounts of /minos/data and /minos/scratch seem to have timed out on several nodes, including fnpcsrv1 minos-sam03 Here are some relevant lines from /var/log/messages on fnpcsrv1 : May 19 16:08:44 fnpcsrv1 automount[32100]: failed to mount /home/rubin May 19 16:10:17 fnpcsrv1 automount[32244]: failed to mount /grid/wnclient May 19 16:13:44 fnpcsrv1 automount[809]: failed to mount /minos/scratch My user application failed to write a file to /minos/data at around 16:07. Was there a global BlueArc or network problem around 16:07 ? ___________________________________________ Date: Tue, 20 May 2008 10:02:43 -0500 (CDT) Note To Requester: Is it working now? How long was it down? ___________________________________________ Date: Tue, 20 May 2008 21:02:43 +0000 (UTC) We continue to see failures of /minos/data, /minos/scratch and /grid/app on several hosts, on at least fnpcsrv1 - farm head node minos25 - Minos Condor submission node minos-sam03 - doing dccp's from FNDCA to /minos/data. Some of the additonal failures are at : Tue, 20 May 2008 00:15:02 - fnpcsrv1 Tue, 20 May 2008 03:21:45 - minos25 Tue, 20 May 2008 07:20:55 - minos25 and sometime around 09:00 I have started a script reading a few bytes of data from a fresh file under /minos/data once per minute, on fnpcsrv1 minos01 minos25 minos26 minos-sam03 The logs are on the web under http://www-numi.fnal.gov/computing/dh/bluwatch.html I plan to check these logs for correlated 'BAD' messages the next time we see a timeout. ___________________________________________ ___________________________________________ ___________________________________________ Date: Thu, 22 May 2008 15:57:50 +0000 (UTC) I have restarted a streamlined 'bluwatch' script, monitoring access to /minos/data files every minute, and running on fnpcsrv1 minos-sam03 minos01 minos25 minos26 The scan results are presented at http://www-numi.fnal.gov/computing/dh/bluwatch This directory has *.txt files with the latest error from each node. There is a LASTERR file whose time stamp is reset when any error is found. Full logs of all errors, as well as starts and stops for each node, are under the 'log' subdirectory Likewise, files with the latest result for each node are under the 'latest' subdirectory. So far this morning, no reported errors. Enjoy ! ___________________________________________ ============================================================================= 2008 05 17 Sat ============================================================================= ####### # SAM # ####### Tried to update the Issue Tracker regarding station problems, SAM-IT/3582 https://plone3.fnal.gov/SAMGrid/tracking/base_view Browse issues New search Search Navigation: Show issue # Site error This site encountered an error trying to fulfill your request. The errors were: Error Type IOError Error Value [Errno 28] No space left on device Request made at 2008/05/17 18:06:33.083 GMT-5 Date: Sat, 17 May 2008 18:13:11 -0500 (CDT) Subject: HelpDesk ticket 115829 ___________________________________________ Short Description: The SAM issue tracker is down Problem Description: On connecting to https://plone3.fnal.gov/SAMGrid/tracking/base_view I see a web page indicating disk problems : Browse issues New search Search Navigation: Show issue # Site error This site encountered an error trying to fulfill your request. The errors were: Error Type IOError Error Value [Errno 28] No space left on device Request made at 2008/05/17 18:06:33.083 GMT-5 ___________________________________________ Date: Mon, 19 May 2008 07:47:21 -0500 (CDT) This ticket has been reassigned to RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ Date: Mon, 19 May 2008 08:17:57 -0500 (CDT) This ticket has been reassigned to MENGEL, MARC of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ ___________________________________________ ####### # SAM # ####### Prepare a slightly bigger standard test TENFILES=`ls /pnfs/minos/fardet_data/2005-04 | head -10` MINOS26 > printf "${TENFILES}\n" F00030612_0005.mdaq.root F00030612_0006.mdaq.root F00030612_0007.mdaq.root F00030613_0000.mdaq.root F00030613_0001.mdaq.root F00030613_0002.mdaq.root F00030613_0003.mdaq.root F00030613_0004.mdaq.root F00030613_0005.mdaq.root F00030613_0006.mdaq.root { for FILE in `printf "${TENFILES}\n" | head -9` ; do printf "${FILE},"; done ; printf "${TENFILES}\n" | tail -1 ; } > /tmp/STENFILES STENFILES=`cat /tmp/STENFILES` sam list files --dim="FILE_NAME in ${STENFILES}" sam create definition \ --definitionName='st-ten' \ --dimensions="FILE_NAME in ${STENFILES}" \ --group='minos' DatasetDefinition saved with definitionId = 3519 sam describe definition --definitionName='st-ten' sam list files --dim="__set__ st-ten" Also created CENFILES dataset CENFILES=`ls /pnfs/minos/fardet_data/2005-04 | head -100` { for FILE in `printf "${CENFILES}\n" | head -99` ; do printf "${FILE},"; done ; printf "${CENFILES}\n" | tail -1 ; } > /tmp/CENFILES CENFILES=`cat /tmp/CENFILES` sam list files --dim="FILE_NAME in ${CENFILES}" sam create definition \ --definitionName='st-cen' \ --dimensions="FILE_NAME in ${CENFILES}" \ --group='minos' DatasetDefinition saved with definitionId = 3521 sam describe definition --definitionName='st-cen' sam list files --dim="__set__ st-cen" ( did this wrong once, 3520, deleted definition ) MINOS26 > ./sam_test_py minos dev st-onesmall MINOS26 > ./sam_test_py minos dev st-ten about a 30 second delay between files 5 and 6, then 6-10 came at the usual rate 1 per second MINOS26 > ./sam_test_py minos dev st-ten hung up, here's trace, from minos-sam02 05/17/08 16:06:29 minos.SM.ProjectManager 8934: Constraining delivery to the 1 consumption sites , priority : hi 05/17/08 16:06:29 minos.SM.CacheFitter_constrained 8934: Delivery of 1047113 is constrained to (none), i.e., impossible 05/17/08 16:06:29 minos.SM.CacheMan minos 8934: Could not fit files on disk, possibly due to fragmentation MINOS-SAM02 > cd private/station__minos-sam02__station_dev__minos MINOS26 > sam stop project --force --project=sam_test_project_20080517210337 restarted dev station 05/17/08 16:10:44 minos.SM.Repler 16692: No authorized requests 05/17/08 16:11:43 minos.SM.Repler 16692: No authorized requests MINOS26 > ./sam_test_py minos dev st-onesmall good, fast did it 9 more times, looks OK MINOS26 > ./sam_test_py minos dev st-ten OK, fast MINOS26 > ./sam_test_py minos dev st-onesmall OK, fast MINOS26 > ./sam_test_py minos dev st-ten OK, fast MINOS26 > ./sam_test_py minos dev st-ten 05/17/08 16:15:09 minos.SM.CacheFitter_constrained 16692: Delivery of 1047113 is constrained to (none), i.e., impossible 05/17/08 16:15:09 minos.SM.CacheMan minos 16692: Could not fit files on disk, possibly due to fragmentation 05/17/08 16:15:09 minos.SM.Repler 16692: No more deliveries possible MINOS26 > sam stop project --force --project=sam_test_project_20080517211506 Fall back to v6_0_5_23_srm from v6_0_5_24_srm Repeased same tests one ten one x 9 ten ten ten 05/17/08 16:24:52 minos.SM.CacheMan minos 19557: Could not fit files on disk, possibly due to fragmentation MINOS26 > sam stop project --force --project=sam_test_project_20080517212449 Falling back to the old station on minos-sam01 dev per notes ups declare -c sam_cp_config v7_0 ups declare -c sam_station v6_0_1_17 -q GCC-3.1 ups undeclare -c sam_gsi_config -q vdt ups declare -c sam_gsi_config v2_2_8 one ten one x 9 ten for N in 2 3 4 5 6 7 8 9 10 ; do echo ${N} ./sam_test_py minos dev st-ten done date; for N in 1 2 3 4 5 6 7 8 9 10 ; do echo ${N} ./sam_test_py minos dev st-onesmall ./sam_test_py minos dev st-ten ./sam_test_py minos dev st-cen done ; date Sat May 17 16:39:22 CDT 2008 ... Sat May 17 17:04:42 CDT 2008 Prepared production station for fallback MINOS26 > export SAM_STATION=minos BADPROJ=`sam dump station --projects \ | grep project \ | cut -f 2 -d ' ' \ | cut -f 1 -d '(' ` gemma3-Cedar-near-all-sntp-2008-3-w2muondrift-2008-05-15-18-30-19.303301000-0500 gemma3-Cedar-near-all-sntp-2008-3-w1muondrift-2008-05-15-18-30-27.506402000-0500 gemma3-Cedar-near-all-sntp-2008-4-w1muondrift-2008-05-15-18-45-19.588175000-0500 gemma3-Cedar-near-all-sntp-2008-4-w2muondrift-2008-05-15-18-45-22.696954000-0500 gemma3-Cedar-near-all-sntp-2008-4-w3muondrift-2008-05-15-18-55-23.363232000-0500 gemma3-Cedar-near-all-sntp-2008-4-w4muondrift-2008-05-15-18-55-28.628555000-0500 gemma-Cedar-far-all-sntp-2008-05-11muondrift-2008-05-16-05-41-33.678666000-0500 gemma-Cedar-near-all-sntp-2008-05-11muondrift-2008-05-16-05-43-24.221559000-0500 gemma-Cedar-far-all-sntp-2008-03-w3muondrift-2008-05-16-12-10-21.330785000-0500 gemma-Cedar-far-all-sntp-2008-03-w1muondrift-2008-05-16-12-10-21.525181000-0500 gemma-Cedar-far-all-sntp-2008-03-w2muondrift-2008-05-16-12-10-25.796463000-0500 gemma-Cedar-far-all-sntp-2008-03-w4muondrift-2008-05-16-12-10-29.350059000-0500 gemma-Cedar-far-all-sntp-2008-04-w2muondrift-2008-05-16-12-10-29.397919000-0500 gemma-Cedar-far-all-sntp-2008-04-w1muondrift-2008-05-16-12-10-29.411734000-0500 gemma-Cedar-far-all-sntp-2008-04-w3muondrift-2008-05-16-12-19-29.526171000-0500 gemma-Cedar-far-all-sntp-2008-04-w4muondrift-2008-05-16-12-19-31.759466000-0500 gemma-Cedar-far-all-sntp-2008-05-05muondrift-2008-05-17-05-40-45.360945000-0500 gemma-Cedar-far-all-sntp-2008-05-07muondrift-2008-05-17-05-41-04.215081000-0500 gemma-Cedar-far-all-sntp-2008-05-09muondrift-2008-05-17-05-41-39.177638000-0500 gemma-Cedar-far-all-sntp-2008-05-11muondrift-2008-05-17-05-41-51.113199000-0500 gemma-Cedar-near-all-sntp-2008-04-19muondrift-2008-05-17-05-42-09.456687000-0500 gemma-Cedar-near-all-sntp-2008-04-18muondrift-2008-05-17-05-42-29.405132000-0500 gemma-Cedar-near-all-sntp-2008-05-04muondrift-2008-05-17-05-43-05.304878000-0500 for PROJ in ${BADPROJ} ; do sleep 2 ; sam stop project --force --project=${PROJ} ; done 16:54 shutdown minos prd for downgrade ups declare -c sam_cp_config v7_0 ups declare -c sam_station v6_0_1_17 -q GCC-3.1 ups undeclare -c sam_gsi_config -q vdt ups declare -c sam_gsi_config v2_2_8 16:55 - started downgraded station Created st-ten and st-cen in production, as above DatasetDefinition saved with definitionId = 5002 DatasetDefinition saved with definitionId = 5004 17:09 Restarted prd station, accidentally started as _dev date; for N in 1 2 3 4 5 6 7 8 9 10 ; do echo ${N} ./sam_test_py minos prd st-onesmall ./sam_test_py minos prd st-ten ./sam_test_py minos prd st-cen done ; date Sat May 17 17:10:56 CDT 2008 ... Sat May 17 17:33:08 CDT 2008 ####### # SAM # ####### Date: Sat, 17 May 2008 10:28:13 +0100 From: Gemma Tinti ????? I see again an intermittent odd behaviour in SAM. It looks like the same problem we had before, some jobs just stay there as they are still waiting for some files to be delivered. ------------------------------------------------------------ Date: Sat, 17 May 2008 13:17:46 +0000 (UTC) From: Arthur Kreymer I was waiting for a time when the SAM station was idle, to restart it with larger virtual disks, to prevent this problem in the future. Since it is failing again, I have done the restart now, around 08:15 CDT. This should prevent future failures of this sort. ------------------------------------------------------------ Date: Sat, 17 May 2008 13:59:48 +0000 (UTC) From: Arthur Kreymer Restarted station after gemma reported file deliver errors again. Disks are larger now : MINOS26 > sam dump station --disks *** BEGIN DUMP STATION minos version v6_0_5_24_srm running at minos-sam01.fnal.gov 9 days 20 hours 18 minutes 57 seconds, admins: buckley kreymer rhatcher sam Replica selection: prefer (enstore), avoid (empty) There are 0 authorized transfer groups Full delivery unit is enforced; external deliveries are constrained to dcap://minos-01 dcap://minos-02 Excess consumer satisfaction: 0 STATION DISKS: disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 1285777209B/52428800KB = 2.4% free disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 205357411B/52428800KB = 0.4% free station disk total: 1491134620B/104857600KB = 1.4% free *** END OF STATION DUMP *** MINOS26 > sam dump station --disks *** BEGIN DUMP STATION minos version v6_0_5_24_srm running at minos-sam01.fnal.gov 11 seconds, admins: buckley kreymer rhatcher sam Replica selection: prefer (enstore), avoid (empty) There are 0 authorized transfer groups Full delivery unit is enforced; external deliveries are constrained to dcap://minos-01 dcap://minos-02 Excess consumer satisfaction: 0 STATION DISKS: disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 849026969KB/900200128KB = 94.3% free disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 847971872KB/900200128KB = 94.2% free station disk total: 1696998842KB/1800400256KB = 94.3% free *** END OF STATION DUMP *** But a test of a 400+ file project failed, MINOS26 > sam_test_py minos ${UNIV} evansj-CC0325-RunI-L010z185-ND-Data OK running station minos dbserver prd dataset evansj-CC0325-RunI-L010z185-ND-Data project sam_test_project_20080517131921 fileCut 0 cid 8154 cpid 29777 job SAMStation.JobCount(jobsAtNode=1, jobsAll=1) Got dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2005-06/N00008019_ 0002.spill.sntp.cedar_phy_bhcurv.0.root file 1 Got dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008083_ 0000.spill.sntp.cedar_phy_bhcurv.0.root file 2 ... Got dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008143_ 0000.spill.sntp.cedar_phy_bhcurv.0.root file 18 RetryHandler.getNextFile(29777L)> initial retriable exception TRANSIENT('CORBA.TRANSIENT(omniORB.TRANSIENT_CallTimedout, CORBA.COMPLETED_MAYBE)') RetryHandler.getNextFile(29777L)> will retry in 1.95 seconds Traceback (most recent call last): File "./sam_test_py", line 162, in ? c = 'TRUE' File "sam_common_pylib/SamCommand/BlessedCommandInterfacePlaceHolder.py", line 81, in __call__ File "sam_common_pylib/SamCommand/CommandInterface.py", line 251, in __call__ File "sam_common_pylib/SamCommand/SamCommandInterface.py", line 240, in apiWrapper File "sam_user_pyapi/src/samConsumer.py", line 752, in implementation File "sam_common_pylib/SamCorba/SamServerProxy.py", line 230, in _callRemoteMethod File "sam_common_pylib/SamCorba/SamServerProxyRetryHandler.py", line 266, in handleCall PreviousFileNotReleased: Previous file not released, CPID: 29777 MINOS26 > sam list files --dim='__set__ evansj-CC0325-RunI-L010z185-ND-Data' Files: N00007787_0000.spill.sntp.cedar_phy_bhcurv.0.root N00007799_0002.spill.sntp.cedar_phy_bhcurv.0.root ... File Count: 445 Average File Size: 680.52MB Total File Size: 295.73GB Total Event Count: 338837208 ------------------------------------------------------------ Date: Sat, 17 May 2008 14:01:26 +0000 (UTC) From: Arthur Kreymer To: Gemma Tinti Cc: minos_sam_admin@fnal.gov, minos_software_discussion@fnal.gov Subject: RE: SAM behaviour The Minos SAM station is still having problems, even after the restart. I will work on it this afternoon. Meanwhile, please stand by ( do not run SAM projects ) ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ######## # FARM # ######## Ran loopCPB ( -w mrnt and sntp ) to clean out remaining mrnt/sntp in WRITE Ran loopCPB ( -b 1000 and without -w ) to catch up on mrnt/sntp now that the farm is writing cand's directly ( loopCPB0/1 cannot compete with the direct farm write, and was falling behind. ) ============================================================================= 2008 05 16 ============================================================================= ############### # MINOS_SAM03 # ############### Date: Fri, 16 May 2008 15:49:43 -0500 (CDT) Subject: HelpDesk ticket 115813 ___________________________________________ Short Description: Account request on minos-sam03 Problem Description: Please enable two more accounts on minos-sam03 minfarm - as exists on fnpcsrv1 - for Farm I/O operations /home/minfarm login area. uid=10871(minfarm) gid=5111(e875) .k5login can be copied from mindata@minos26 initially minsoft - as exists on minos-mysql1, for testing Mysql5, and to test migration to the new minos-mysql1 hardware Minos-sam03 may end up being the Farm mysql server. /home/minsoft login area .k5login can be copied from mindata@minos26 initially Please give minos-sam03 root access to kreymer, so that we can fully test the Mysql operation. ___________________________________________ ######## # DATA # ######## PLANNING - need minfarm account on minos-sam03 to keep the I/O load off fnpcsrv1 DATA=reco_near/cedar_phy_bhcurv/sntp_data AFSS/dc2nfs.20080118 -n -d ${DATA}/2006-07 # need 7/15 NDIRS=`ls /pnfs/minos/${DATA}` AFSS/dc2nfs -d ${DATA} 2>&1 | \ tee /minos/scratch/log/dc2nfs/cpbnear.log ############ # MCIMPORT # ############ Per tests on mindata@minos26, it seems that we do not need SRM_CONFIG, and we are impervious to default proxies like /tmp/x509up_u3648 if we do export X509_USER_PROXY=/home/mindata/.grid/kreymer-doe.proxy We had been hosed by : -rw------- 1 mindata e875 5071 May 6 11:14 /tmp/x509up_u3648 $ cp -a AFSS/mcimport.20080516 . $ ln -sf mcimport mcimport.20080516 # was mcimport.20080211 ####### # SRM # ####### Rubin reports problems with srm on worker nodes. srmcp is working on fpcserv1 / roundup But my manual test in mindata@minos26 fails MDS3 > srmls ${SPATH2} SRMClientV2 : put: try # 0 failed with error SRMClientV2 : ; nested exception is: org.globus.common.ChainedIOException: Authentication failed [Caused by: Defective credential detected [Caused by: [JGLOBUS-96] Certificate "DC=org,DC=DOEGrids,OU=Certificate Authorities,CN=DOEGrids CA 1" expired]] Renewed the proxy in /local/scratch26/kreymer/grid cd /local/scratch26/kreymer/grid . /minos/scratch/kreymer/VDT/setup.sh voms-proxy-init \ -voms fermilab:/fermilab/minos \ -cert kreymerdoe.pem \ -key kreymerdoekey.pem \ -out kreymer-doe.proxy \ -valid 10000:0 Testing minfarm@fnpcsrv1 cd /local/globus/minfarm/.grid scp kreymer@minos26:/local/scratch26/kreymer/grid/kreymer-doe.proxy . export SRM_CONFIG=/export/stage/minfarm/.srmconfig/kreymer.xml export SRM_CONFIG=/local/globus/minfarm/.srmconfig/kreymer.xml SRV1> cp ax /export/stage/minfarm/.srmconfig .srmconfig nedit /local/globus/minfarm/.srmconfig/kreymer.xml use kreymer-doe.proxy from /local/globus/minfarm/.grid Looks good, try rubin, MIN > ssh -l rubin fnpcsrv1 fnpcsrv1% date Fri May 16 14:22:16 CDT 2008 fnpcsrv1% source /usr/local/vdt/setup.csh fnpcsrv1% set SPATH2='srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/cand_data/310/n13023109_0003_L010185N_D00.cand.cedar_phy_brev.root' fnpcsrv1% srmls "${SPATH2}" 542849643 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/cand_data/310/n13023109_0003_L010185N_D00.cand.cedar_phy_brev.root fnpcsrv1% srmls -debug "${SPATH2}" Storage Resource Manager (SRM) CP Client version 1.25 Copyright (c) 2002-2006 Fermi National Accelerator Laboratory SRM Configuration: debug=true gsissl=true help=false pushmode=false userproxy=true buffer_size=131072 tcp_buffer_size=0 streams_num=10 config_file=config.xml glue_mapfile=conf/SRMServerV1.map webservice_path=srm/managerv2 webservice_protocol=https gsiftpclinet=globus-url-copy protocols_list=http,gsiftp save_config_file=null srmcphome=.. urlcopy=sbin/urlcopy.sh x509_user_cert=/home/timur/k5-ca-proxy.pem x509_user_key=/home/timur/k5-ca-proxy.pem x509_user_proxy=/tmp/x509up_u1334 x509_user_trusted_certificates=/usr/local/vdt-1.8.1/globus/TRUSTED_CA globus_tcp_port_range=null gss_expected_name=null storagetype=permanent retry_num=20 retry_timeout=10000 wsdl_url=null use_urlcopy_script=false connect_to_wsdl=false delegate=true full_delegation=true server_mode=passive srm_protocol_version=2 request_lifetime=86400 priority=0 action is ls recursion depth=1 is long listing mode=false surl[0]=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/cand_data/310/n13023109_0003_L010185N_D00.cand.cedar_phy_brev.root Fri May 16 14:25:21 CDT 2008: In SRMClient ExpectedName: host Fri May 16 14:25:21 CDT 2008: SRMClient(https,srm/managerv2,true) SRMClientV2 : user credentials are: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/UID=kreymer SRMClientV2 : connecting to srm at httpg://stkendca2a.fnal.gov:8443/srm/managerv2 SRMClientV2 : srmLs, contacting service httpg://stkendca2a.fnal.gov:8443/srm/managerv2 542849643 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/cand_data/310/n13023109_0003_L010185N_D00.cand.cedar_phy_brev.root fnpcsrv1% srmls -version Storage Resource Manager (SRM) CP Client version 1.25 Copyright (c) 2002-2006 Fermi National Accelerator Laboratory srm client error: No URL(s) specified Try, and fail, with V2 SRM. fnpcsrv1% source /usr/local/grid/setup.csh fnpcsrv1% srmls "${SPATH2}" [main] ERROR client.Call - Exception: org.xml.sax.SAXException: Invalid element in org.dcache.srm.v2_2.TMetaDataPathDetail - surl at org.apache.axis.encoding.ser.BeanDeserializer.onStartChild(BeanDeserializer.java:258) at org.apache.axis.encoding.DeserializationContext.startElement(DeserializationContext.java:1035) at org.apache.axis.message.SAX2EventRecorder.replay(SAX2EventRecorder.java:165) at org.apache.axis.message.MessageElement.publishToHandler(MessageElement.java:1141) at org.apache.axis.message.RPCElement.deserialize(RPCElement.java:236) at org.apache.axis.message.RPCElement.getParams(RPCElement.java:384) at org.apache.axis.client.Call.invoke(Call.java:2467) at org.apache.axis.client.Call.invoke(Call.java:2366) at org.apache.axis.client.Call.invoke(Call.java:1812) at org.dcache.srm.v2_2.SrmSoapBindingStub.srmLs(SrmSoapBindingStub.java:2089) at org.dcache.srm.client.SRMClientV2.srmLs(SRMClientV2.java:575) at gov.fnal.srm.util.SRMLsClientV2.start(SRMLsClientV2.java:136) at gov.fnal.srm.util.SRMDispatcher.work(SRMDispatcher.java:779) at gov.fnal.srm.util.SRMDispatcher.main(SRMDispatcher.java:372) SRMClientV2 : put: try # 0 failed with error SRMClientV2 : ; nested exception is: org.xml.sax.SAXException: Invalid element in org.dcache.srm.v2_2.TMetaDataPathDetail - surl SRMClientV2 : put: try again fnpcsrv1% srmls -version Storage Resource Manager (SRM) CP Client version 2.0 Copyright (c) 2002-2006 Fermi National Accelerator Laboratory srm client error: No URL(s) specified java.lang.IllegalArgumentException: No URL(s) specified at gov.fnal.srm.util.SRMDispatcher.checkURLSUniformity(SRMDispatcher.java:786) at gov.fnal.srm.util.SRMDispatcher.work(SRMDispatcher.java:463) at gov.fnal.srm.util.SRMDispatcher.main(SRMDispatcher.java:372) fnpcsrv1% srmls -version Storage Resource Manager (SRM) CP Client version 1.25 Copyright (c) 2002-2006 Fermi National Accelerator Laboratory srm client error: No URL(s) specified ######## # FARM # ######## Cleanup interrupted copy of FILE=n13037749_0025_L010185N_D04.cand.cedar_phy_bhcurv.0.root SRV1> dds /minos/data/minfarm/WRITE/${FILE} -rw-rw-r-- 1 minospro numi 573160605 May 1 09:59 /minos/data/minfarm/WRITE/n13037749_0025_L010185N_D04.cand.cedar_phy_bhcurv.0.root SRV1> dds /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/774//${FILE} -rw-r--r-- 1 rubin numi 0 May 15 21:05 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/774//n13037749_0025_L010185N_D04.cand.cedar_phy_bhcurv.0.root RUBIN > rm /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/774//${FILE} Things seem to be running OK now. ######## # FARM # ######## URK - in cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.1.log occasional messages like SRV1> grep cat: cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.1.log cat: /export/stage/minfarm/ROUNDUP/READ/n13037409_0008_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory cat: /export/stage/minfarm/ROUNDUP/READ/n13037412_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory cat: /export/stage/minfarm/ROUNDUP/READ/n13037426_0009_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory cat: /export/stage/minfarm/ROUNDUP/READ/n13037428_0023_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory cat: /export/stage/minfarm/ROUNDUP/READ/n13037429_0016_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory cat: /export/stage/minfarm/ROUNDUP/READ/n13037430_0011_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory cat: /export/stage/minfarm/ROUNDUP/READ/n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory SRV1> grep cat: cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.0.log cat: /export/stage/minfarm/ROUNDUP/READ/n13037476_0001_L010185N_D04.cand.cedar_phy_bhcurv.0.root: No such file or directory cat: /export/stage/minfarm/ROUNDUP/READ/n13037737_0004_L010185N_D04.cand.cedar_phy_bhcurv.0.root: No such file or directory cat: /export/stage/minfarm/ROUNDUP/READ/n13037743_0003_L010185N_D04.cand.cedar_phy_bhcurv.0.root: No such file or directory cat: /export/stage/minfarm/ROUNDUP/READ/n13037749_0025_L010185N_D04.cand.cedar_phy_bhcurv.0.root: No such file or directory SRV1> grep cat: cedar_phyfarF00037835_0008.log cat: /export/stage/minfarm/ROUNDUP/READ/F00037835_0008.all.cand.cedar_phy.0.root: No such file or directory SRV1> grep cat: ../2008-04/*.log SRV1> grep cat: ../2008-03/*.log ../2008-03/cedar_phy_bhcurvnear.log:cat: /export/stage/minfarm/ROUNDUP/READ/N00011059_0000.spill.mrnt.cedar_phy_bhcurv.0.root: No such file or directory SRV1> grep cat: ../2008-02/*.log SRV1> grep cat: ../2008-01/*.log Specifically, recently, SRMCP 56/100 -streams_num=1 -server_mode=active -protocols=gsiftp file:///n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L01018 5N/cand_data/743 PURGE FARM n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root cat: /export/stage/minfarm/ROUNDUP/READ/n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory for REA in `cat ${ROUNTMP}/${CAT}READ/${FILE}` ; do ${ECHO} rm -f ${INDIR}/${REA} ; done ls -l /export/stage/minfarm/ROUNDUP/READ/SAM/n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root SRV1> ls -l /export/stage/minfarm/ROUNDUP/READ/SAM/n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 minfarm numi 57 May 14 00:09 /export/stage/minfarm/ROUNDUP/READ/SAM/n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root OK this makes sense, saddmc is moving the READ file to READ/SAM ============================================================================= 2008 05 15 ============================================================================= ########## # DCACHE # ########## Power out to DCache/Enstore and Oracle around 21:10 tonight. Stopped predator cronjob on minos26, and loopCP0/1 on fnpcsrv1 ########## # CONDOR # ########## Per man page , can use some variables in the .run files : X509USERPROXY = /local/scratch25/$ENV(LOGNAME)/grid/$ENV(LOGNAME).proxy Updated glidefile.run, glide.run This works !! ########## # CONDOR # ########## HOWTO condor updated with cert registration instructions, per Date: Fri, 09 May 2008 10:00:37 -0500 (CDT) From: HelpDesk Solution: yocum@fnal.gov sent this solution: Hi Art, Yesterday I re-enabled email notification on the fermilab VOMRS server, so I am now in a position to give you the correct method of performing the action you request, yourself. The users can and should add their own Robot certificates to their membership in the fermilab VO per the instructions I sent you last Monday, with the following addition (see 2a, 2b, 2c): 1) Load your KCA certificate (current, not expired!) and visit this URL: https://vomrs.fnal.gov:8443/vomrs/vo-fermilab/vomrs 2) Click on the [+] next to the "Members" 2a) Click on "Change Email Address" 2b) Enter your last name and "Search" 2c) Enter your correct email address and "Submit" 3) Click on the [+] next to the "Certificates" 4) Click on "Add certificate" 5) Enter your last name and "Search" 6) Enter your 'new' DN in the New DN field, and select the Fermi KCA from the pull-down list in the "New CA" list. 7) Enter some text in the "Reason" field and click "Submit" Next, the members representative (You!) will receive an email from VOMRS requesting you to approve the addition of the DN. The email will contain a handy link for you to click on to get to the right page. I should note, that when the new DN format is implemented the users will NOT need to add this DN, we'll do this for them automatically. Cheers, Dan ########## # CONDOR # ########## Removed from all .run files JOBLEASEDURATION = 1000000 ######## # FARM # ######## Request for test job submission of farm jobs, looking into it Last minfarm jobs were 9 April This was an old vanilla test. SRV1> condor_history -l 1780232.0> /tmp/chl Iwd = "/home/minfarm/bckhousetest/test-scripts" Cmd = "/home/minfarm/bckhousetest/test-scripts/reco_far_cosmic_daikon04_base_dogwoodtest0.sh" UserLog = "/home/minfarm/bckhousetest/restructure-test/test-results/dogwoodtest1/reco_far_cosmic_daikon04_base_dogwoodtest0/log.txt" LastRemoteHost = "slot1@fnpc225.fnal.gov" JobUniverse = 5 http://www.cs.wisc.edu/condor/manual/v6.6/2_5Submitting_Job.html JobUniverse : Integer which indicates the job universe, where 1 = Standard, 4 = PVM, 5 = Vanilla, 7 = Scheduler, 8 = MPI, 9 = Globus, and 10 = Java. Testing a clone of probe, copied to\ /minos/data/minfarm/probe Inspired by /minos/data/minfarm/condor_submit SRV1> condor_submit probe.run Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1856378. 000 (1856378.000.000) 05/15 15:10:11 Job submitted from host: <131.225.167.44:63082> ... 012 (1856378.000.000) 05/15 15:10:14 Job was held. Failed to initialize GAHP Code 0 Subcode 0 ... SRV1> condor_rm 1856378.0 Job 1856378.0 marked for removal ######## # FARM # ######## Made safer copies of the crontab.dat files SRV1> cp -a crontab.dat.20071226 AFSS/crontab.minfarm.20071226 SRV1> cp -a crontab.dat.20070503 AFSS/crontab.minfarm.20070503 SRV1> cp -a crontab.dat.20060829 AFSS/crontab.minfarm.20060829 SRV1> rm crontab.dat.* Added farmgsum_log 15 00 * * * ${HOME}/scripts/farmgsum_log SRV1> cp crontab.dat AFSS/crontab.minfarm.20080515 ============================================================================= 2008 05 14 ============================================================================= ########## # DCACHE # ########## Date: Wed, 14 May 2008 15:45:17 -0500 (CDT) Subject: HelpDesk ticket 115648 ___________________________________________ Short Description: Recent FTP Tranfers - web page is empty Problem Description: The Recent FTP Transfers web page is empty, at http://fndca3a.fnal.gov/cgi-bin/dcache_files.py ___________________________________________ ############ # MCIMPORT # ############ MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/* 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N 55785 /minos/data/mcimport/STAGE/daikon_04/L010170N 2030288 /minos/data/mcimport/STAGE/daikon_04/L010185N 126550 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm 368713 /minos/data/mcimport/STAGE/daikon_04/L010185N_helium 8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh 65355 /minos/data/mcimport/STAGE/daikon_04/L010200N 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N 27834 /minos/data/mcimport/STAGE/daikon_04/L250200N Per rhatcher, will probably next safely archive 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N 126550 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm 368713 /minos/data/mcimport/STAGE/daikon_04/L010185N_helium ####### # SAM # ####### export SAM_ORACLE_CONNECT="samdbs/" export SAM_STATION=minos setup sam -q dev sam dump station --disks disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 401805604B/52428800KB = 0.7% free disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 393714515B/52428800KB = 0.7% free Typical cdf-caf disk is disk 719 dcap://cdfcaf-door1:dcap://cdfdca1.fnal.gov:25125/pnfs/fnal.gov/usr, 69464633B/209715200KB = 0% free Typical cdf-cnaf disk is disk 533 cdfsam.cnaf.infn.it:/cdf/data/gpfs01/sam/cache1, 1518181B/2252800000KB = 0% free MINOS26 > samadmin resize station disk \ --mountPoint=dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr \ --size=900200100KB Disk size for dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr changed to 858.50GB samadmin resize station disk \ --mountPoint=dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr \ --size=900200100KB Disk size for dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr changed to 858.50GB Restarted the station, disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 848163716KB/900200128KB = 94.2% free disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 848155814KB/900200128KB = 94.2% free ./sam_test_py minos dev Got dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root file 1 Decrementing the job count. Stopping the project PRODUCTION setup sam -q prd disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 1417330916B/52428800KB = 2.6% free disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 575415386B/52428800KB = 1.1% free samadmin resize station disk \ --mountPoint=dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr \ --size=900200100KB Disk size for dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr changed to 858.50GB samadmin resize station disk \ --mountPoint=dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr \ --size=900200100KB Disk size for dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr changed to 858.50GB Waiting for an idle time for station restart gemma has projects running, mostly started around 05:40 ######## # DATA # ######## per 2008 04 29 plan of action : minfarm@fnpcsrv1 DATA=reco_near/cedar_phy_bhcurv/sntp_data NDIRS=`ls /pnfs/minos/${DATA}` AFSS/dc2nfs.20080118 -n -d ${DATA}/2007-04 NEEDED 0/45 reco_near/cedar_phy_bhcurv/sntp_data/2007-04 204G /minos/data/reco_near/cedar_phy_bhcurv/sntp_data SRV1> AFSS/dc2nfs.20080118 -n -d ${DATA} | grep NEEDED NEEDED 0/0 reco_near/cedar_phy_bhcurv/sntp_data/2005-03 NEEDED 67/67 reco_near/cedar_phy_bhcurv/sntp_data/2005-04 NEEDED 98/98 reco_near/cedar_phy_bhcurv/sntp_data/2005-05 NEEDED 65/65 reco_near/cedar_phy_bhcurv/sntp_data/2005-06 NEEDED 54/56 reco_near/cedar_phy_bhcurv/sntp_data/2005-07 NEEDED 1/55 reco_near/cedar_phy_bhcurv/sntp_data/2005-08 NEEDED 55/55 reco_near/cedar_phy_bhcurv/sntp_data/2005-09 NEEDED 33/33 reco_near/cedar_phy_bhcurv/sntp_data/2005-10 NEEDED 46/46 reco_near/cedar_phy_bhcurv/sntp_data/2005-11 NEEDED 42/42 reco_near/cedar_phy_bhcurv/sntp_data/2005-12 NEEDED 52/52 reco_near/cedar_phy_bhcurv/sntp_data/2006-01 NEEDED 50/50 reco_near/cedar_phy_bhcurv/sntp_data/2006-02 NEEDED 0/0 reco_near/cedar_phy_bhcurv/sntp_data/2006-03 NEEDED 0/0 reco_near/cedar_phy_bhcurv/sntp_data/2006-04 NEEDED 0/0 reco_near/cedar_phy_bhcurv/sntp_data/2006-05 NEEDED 73/74 reco_near/cedar_phy_bhcurv/sntp_data/2006-06 NEEDED 7/15 reco_near/cedar_phy_bhcurv/sntp_data/2006-07 NEEDED 41/41 reco_near/cedar_phy_bhcurv/sntp_data/2006-08 NEEDED 36/36 reco_near/cedar_phy_bhcurv/sntp_data/2006-09 NEEDED 49/51 reco_near/cedar_phy_bhcurv/sntp_data/2006-10 NEEDED 36/36 reco_near/cedar_phy_bhcurv/sntp_data/2006-11 NEEDED 39/39 reco_near/cedar_phy_bhcurv/sntp_data/2006-12 NEEDED 51/51 reco_near/cedar_phy_bhcurv/sntp_data/2007-01 NEEDED 52/52 reco_near/cedar_phy_bhcurv/sntp_data/2007-02 NEEDED 50/51 reco_near/cedar_phy_bhcurv/sntp_data/2007-03 NEEDED 0/45 reco_near/cedar_phy_bhcurv/sntp_data/2007-04 NEEDED 0/48 reco_near/cedar_phy_bhcurv/sntp_data/2007-05 NEEDED 0/42 reco_near/cedar_phy_bhcurv/sntp_data/2007-06 NEEDED 0/32 reco_near/cedar_phy_bhcurv/sntp_data/2007-07 SRV1> AFSS/dc2nfs.20080118 -n -d reco_far/cedar_phy_bhcurv/sntp_data | grep NEEDED NEEDED 0/59 reco_far/cedar_phy_bhcurv/sntp_data/2003-07 NEEDED 0/140 reco_far/cedar_phy_bhcurv/sntp_data/2003-08 NEEDED 0/164 reco_far/cedar_phy_bhcurv/sntp_data/2003-09 NEEDED 0/199 reco_far/cedar_phy_bhcurv/sntp_data/2003-10 NEEDED 0/166 reco_far/cedar_phy_bhcurv/sntp_data/2003-11 NEEDED 0/135 reco_far/cedar_phy_bhcurv/sntp_data/2003-12 NEEDED 0/118 reco_far/cedar_phy_bhcurv/sntp_data/2004-01 NEEDED 0/108 reco_far/cedar_phy_bhcurv/sntp_data/2004-02 NEEDED 0/137 reco_far/cedar_phy_bhcurv/sntp_data/2004-03 NEEDED 0/152 reco_far/cedar_phy_bhcurv/sntp_data/2004-04 NEEDED 0/119 reco_far/cedar_phy_bhcurv/sntp_data/2004-05 NEEDED 0/102 reco_far/cedar_phy_bhcurv/sntp_data/2004-06 NEEDED 0/103 reco_far/cedar_phy_bhcurv/sntp_data/2004-07 NEEDED 0/111 reco_far/cedar_phy_bhcurv/sntp_data/2004-08 NEEDED 0/112 reco_far/cedar_phy_bhcurv/sntp_data/2004-09 NEEDED 0/102 reco_far/cedar_phy_bhcurv/sntp_data/2004-10 NEEDED 0/115 reco_far/cedar_phy_bhcurv/sntp_data/2004-11 NEEDED 0/105 reco_far/cedar_phy_bhcurv/sntp_data/2004-12 NEEDED 0/105 reco_far/cedar_phy_bhcurv/sntp_data/2005-01 NEEDED 0/95 reco_far/cedar_phy_bhcurv/sntp_data/2005-02 NEEDED 0/112 reco_far/cedar_phy_bhcurv/sntp_data/2005-03 NEEDED 0/214 reco_far/cedar_phy_bhcurv/sntp_data/2005-04 NEEDED 0/220 reco_far/cedar_phy_bhcurv/sntp_data/2005-05 NEEDED 0/210 reco_far/cedar_phy_bhcurv/sntp_data/2005-06 NEEDED 0/216 reco_far/cedar_phy_bhcurv/sntp_data/2005-07 NEEDED 0/151 reco_far/cedar_phy_bhcurv/sntp_data/2005-08 NEEDED 0/83 reco_far/cedar_phy_bhcurv/sntp_data/2005-09 NEEDED 0/111 reco_far/cedar_phy_bhcurv/sntp_data/2005-10 NEEDED 0/98 reco_far/cedar_phy_bhcurv/sntp_data/2005-11 NEEDED 0/102 reco_far/cedar_phy_bhcurv/sntp_data/2005-12 NEEDED 0/104 reco_far/cedar_phy_bhcurv/sntp_data/2006-01 NEEDED 0/70 reco_far/cedar_phy_bhcurv/sntp_data/2006-02 NEEDED 0/102 reco_far/cedar_phy_bhcurv/sntp_data/2006-03 NEEDED 0/43 reco_far/cedar_phy_bhcurv/sntp_data/2006-04 NEEDED 0/68 reco_far/cedar_phy_bhcurv/sntp_data/2006-05 NEEDED 0/84 reco_far/cedar_phy_bhcurv/sntp_data/2006-06 NEEDED 0/110 reco_far/cedar_phy_bhcurv/sntp_data/2006-07 NEEDED 0/98 reco_far/cedar_phy_bhcurv/sntp_data/2006-08 NEEDED 0/78 reco_far/cedar_phy_bhcurv/sntp_data/2006-09 NEEDED 0/106 reco_far/cedar_phy_bhcurv/sntp_data/2006-10 NEEDED 0/97 reco_far/cedar_phy_bhcurv/sntp_data/2006-11 NEEDED 0/68 reco_far/cedar_phy_bhcurv/sntp_data/2006-12 NEEDED 0/90 reco_far/cedar_phy_bhcurv/sntp_data/2007-01 NEEDED 0/94 reco_far/cedar_phy_bhcurv/sntp_data/2007-02 NEEDED 0/80 reco_far/cedar_phy_bhcurv/sntp_data/2007-03 NEEDED 0/72 reco_far/cedar_phy_bhcurv/sntp_data/2007-04 NEEDED 0/66 reco_far/cedar_phy_bhcurv/sntp_data/2007-05 NEEDED 0/72 reco_far/cedar_phy_bhcurv/sntp_data/2007-06 NEEDED 0/88 reco_far/cedar_phy_bhcurv/sntp_data/2007-07 803G /minos/data/reco_far/cedar_phy_bhcurv/sntp_data Need over 1 TB free space before doing near catchup, wait till we have 2. AFSS/dc2nfs -d ${DATA} 2>&1 | \ tee /minos/scratch/log/dc2nfs/cpbnear.log ########## # DCACHE # ########## Date: Wed, 14 May 2008 14:36:26 +0000 (UTC) From: Arthur Kreymer To: jdejong@fnal.gov Cc: minos-data@fnal.gov Subject: jdejong jobs overloading raw data DCache pools I have been seeing timeouts for the last couple of days in the jobs which declare raw data files to SAM. This is apparently due to an overload of the DCache pools by jdejong jobs running under LSF. These jobs are collectively holding dozens of raw data files open, which is beyond the capacity of these pools. The net effect is to delay other users of raw data, including most of your own jobs. Please adjust your jobs to take a local copy of the files before processing. Thanks ! To see the overload, look at the w-stkendca9a-3 line at the web page http://fndca.fnal.gov:2288/queueInfo ==================================================================== Date: Wed, 14 May 2008 13:07:13 -0500 (CDT) From: Jeff K deJong Can you indentify for me which are the jobs that are the problem, are they the jobs in the 12hr queue? If it is the 12hr job then each job that is running is holding 1 dcache file. I'll modify them shortly so that at the start of each job the file is copied from dcache to the local directory Sorry for the problems ==================================================================== Thanks ! You can see the dcache connections at http://fndca3a.fnal.gov/dcache/DOORS.html You can get a text listing of your connections with curl http://fndca3a.fnal.gov/dcache/DOORS.html 2>&1 | grep Dejong ######## # FARM # ######## CP far all done except : OK - stream all.sntp.cedar_phy OK - 825 Mbytes in 2 runs PEND - have 17/24 subruns for F00037835_*.all.sntp.cedar_phy.0.root 11 05/02 11:36 0 17 SUPPRESS F00037838_0024.all.sntp.cedar_phy.0.root PEND - have 17/24 subruns for F00037838_*.all.sntp.cedar_phy.0.root 11 05/02 12:02 0 17 OK - stream spill.bntp.cedar_phy OK - 144 Mbytes in 2 runs PEND - have 17/24 subruns for F00037835_*.spill.bntp.cedar_phy.0.root 11 05/02 11:36 0 17 SUPPRESS F00037838_0024.spill.bntp.cedar_phy.0.root PEND - have 16/24 subruns for F00037838_*.spill.bntp.cedar_phy.0.root 11 05/02 12:02 0 16 OK - stream spill.sntp.cedar_phy OK - 97 Mbytes in 2 runs PEND - have 17/24 subruns for F00037835_*.spill.sntp.cedar_phy.0.root 11 05/02 11:36 0 17 SUPPRESS F00037838_0024.spill.sntp.cedar_phy.0.root PEND - have 16/24 subruns for F00037838_*.spill.sntp.cedar_phy.0.root 11 05/02 12:02 0 16 Stopped loopCPF Stopped looper, created loopCPB0, loopCPB1 for pass 0 and 1 files, run them in parallel SRV1> cp AFSS/roundup.20080515 . SRV1> ln -sf roundup.20080515 roundup # was roundup.20080514 SRV1> ./roundup -n -b 100 -p -s "cand.cedar_phy_bhcurv.1" -r cedar_phy_bhcurv mcnear PURGED 100/1600 SRV1> ./roundup -n -b 1600 -w -s "cand.cedar_phy_bhcurv.1" -r cedar_phy_bhcurv mcnear It seems that MOST of WRITE needs to be written to PNFS ( all pass 1 )! So let's run loopCPB1 with -w set SRV1> cat loopCPB1 #!/bin/sh while true ; do ./roundup -c -b 100 -w -s "cand.cedar_phy_bhcurv.1" -r cedar_phy_bhcurv mcnear sleep 1200 done ./loopCPB1 & less cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.1.log ============================================================================= 2008 05 13 ============================================================================= ########### # ROUNDUP # ########### herber suggests an inline alternate to invoking MAIN : exec > ... 2> ... ########### # ROUNDUP # ########### Subject: Cron ${HOME}/scripts/corral Date: Tue, 13 May 2008 06:05:01 -0500 From: root@fnpcsrv1.fnal.gov (Cron Daemon) To: rubin@fnal.gov PID TTY TIME CMD need to kill header, like ps -p ${prepid} --no-headers ########### # MONTHLY # ########### DATASETS 5/13 PREDATOR 5/13,14,15 had to retry due to DCache overload, OK on 15th VAULT 5/7 from cron, logs are OK MYSQL 5/19 Mon May 19 09:43:13 CDT 2008 Mon May 19 10:14:42 CDT 2008 crontab.dat - changed vault to run on 4th of the month Renamed all scripts/crontab.dat.* to crontab.minos26.* Predator - B080430_080001.mbeam.root Tue May 13 16:50:30 UTC 2008 OOPS - run_dbu is stuck for 207, killing it Try again, watching 25509 pts/0 S+ 0:04 \_ loon -bq /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/firstlast.C dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/beam_data B080430_080001.mbeam.root Tue May 13 20:06:13 UTC 2008 OOPS - run_dbu is stuck for 207, killing it setup dcap -q unsecured IFILE=B080430_080001.mbeam.root IPATH=minos/beam_data/2008-04 DCPOR=24125 # unsecured DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} cd /local/scratch??/`whoami` dccp -d 4 ${DFILE} TEST.dat MINOS26 > dccp -d 4 ${DFILE} TEST.dat [Tue May 13 15:16:11 2008] Going to open file dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/2008-04/B080430_080001.mbeam.root in cache. Connected in 0.00s. not making much progress. I see why , under PolRequestQueue, CellName DomainName Active Max Queued movers w-stkendca9a-3 w-stkendca9a-3Domain 10 12 43 Jeff Dejong is reading large numbers of neardet_data files from LSF, Several beam_data files are open from fnpc341 MINOS25 > condor_q -r | grep fnpc341 113258.6 loiacono 5/13 06:06 0+00:26:09 vm2@7829@fnpc341.fnal.gov 113258.13 loiacono 5/13 06:06 0+00:26:11 vm2@14117@fnpc341.fnal.gov 113258.14 loiacono 5/13 06:06 0+00:26:15 vm2@28520@fnpc341.fnal.gov 113258.15 loiacono 5/13 06:06 0+00:26:15 vm2@11484@fnpc341.fnal.gov 113258.16 loiacono 5/13 06:06 0+00:26:07 vm2@23347@fnpc341.fnal.gov 113333.140 pawloski 5/13 10:09 0+01:25:06 vm2@992@fnpc341.fnal.gov bin/sh /minos/scratch/loiacono/gnumi/prod/condor/condor_job_le150_20060603_1.sh 16 000 \_ /bin/csh -f ./run-gnumi fluka05_le150i000_20060603 36 99999999999 \_ /minos/scratch/loiacono/gnumi/src/gnumi/linux/gnumi ########### # CRONTAB # ########### for YMD in 20050817 20050902 20051005 20060504 20060807 20080402 ; do mv crontab.dat.${YMD} crontab.minos26.${YMD} ; done ########## # CONDOR # ########## Date: Tue, 13 May 2008 10:42:32 -0500 From: Sfiligoi Igor ... The Joint EGEE and OSG Workshop on VO Management in Production Grids will be held June 24, 2008 in conjunction with HPDC 2008. ... ######## # FARM # ######## farmgsum_log - logs farmgsum to the tee looper keeps on running as desired, 29627 5586119 mcnearcat mcnearcat 2537 1453681 cand.cedar_phy_bhcurv.0.root 6058 3497257 cand.cedar_phy_bhcurv.1.root WRITE 1731 1000277 cand.cedar_phy_bhcurv.1.root set up a looper for CP F SRV1> cp looper loopCPF containing ./roundup -c -b 100 -r cedar_phy far Tue May 13 09:15:10 CDT 2008 PURGING WRITE files 100 PURGED WRITE/F00038316_0006.spill.cand.cedar_phy.0.root ============================================================================= 2008 05 12 ============================================================================= ########## # DCACHE # ########## Sent in high priority helpdesk ticket re lack of CPB candidate writes The predator logs show a backlog of STARTED Sat May 10 02:11:26 2008 40 FILES STARTED Sun May 11 02:11:57 2008 1799 FILES STARTED Mon May 12 02:11:26 2008 1435 FILES all clear STARTED Tue May 13 02:28:34 2008 1408 FILES many pending, 30 thru 1407 cc'd this ticket to rubin Date: Mon, 12 May 2008 23:45:42 -0500 (CDT) Subject: HelpDesk ticket 115493 ___________________________________________ Short Description: some Minos files pending in FNDCA write queues for 1 day Problem Description: dcache-admin We have a large amount of data ( hundreds of GBytes ) waiting to get out of the public write pools onto LTO-3 tape. Some of these files have been been waiting nearly a day. I do not see large backlogs in Enstore. There are modest queues ( net 500 ) on the write pools, but I do not think that explains a 1 day delay. A sample delayed file follows : ============================ PNFS status for /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/7 34/n13037340_0017_L010185N_D04.cand.cedar_phy_bhcurv.1.root -rw-r--r-- 1 rubin e875 574165580 May 12 02:03 n13037340_0017_L010185N_D04.cand.cedar_phy_bhcurv.1.root LEVEL 2 2,0,0,0.0,0.0 :c=1:4c3ee836;h=yes;l=574165580; w-stkendca12a-6 r-stkendca16a-5 LEVEL 4 ============================ ___________________________________________ Date: Tue, 13 May 2008 16:32:19 +0000 (UTC) From: Arthur Kreymer There was a similar large backlog Saturday night, which cleared during the day Sunday. Our files started moving to tape again Monday night, so we are making some progress again. The file listed below is on tape. I will keep an eye on things for a while, to see whether another backlog develops tomorrow. ___________________________________________ Date: Mon, 07 Jul 2008 14:41:38 -0500 (CDT) Previously assigned to: BERG, DAVID This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group. ___________________________________________ Date: Mon, 07 Jul 2008 19:52:11 +0000 (UTC) From: Arthur Kreymer This problem did not recur, as nearly as I can tell. This ticket can be closed out. Thanks ! ########## # CONDOR # ########## MINOS25 > condor_config_val CONDOR_ADMIN fermigrid-root@fnal.gov MINOS25 > condor_config_val -config Configuration source: /etc/condor/condor_config Local configuration source: /opt/condor/local/condor_config.local MINOS25 > condor_config_val -set "CONDOR_ADMIN=minos-admin@fnal.gov" Attempt to set configuration "CONDOR_ADMIN=minos-admin@fnal.gov" on master minos25.fnal.gov <131.225.193.25:62694> failed. MINOS25 > condor_config_val -name minos01 -set "CONDOR_ADMIN=minos-admin@fnal.gov" Attempt to set configuration "CONDOR_ADMIN=minos-admin@fnal.gov" on master minos01.fnal.gov <131.225.193.1:61561> failed. ######### # MINOS # ######### Date: Mon, 12 May 2008 12:14:47 -0500 (CDT) Subject: HelpDesk ticket 115465 ___________________________________________ Short Description: ssh to minos05 fails Problem Description: run2-sys : ssh to minos05 has been failing since at least last Friday : MIN > ssh -v minos05 ; date OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug1: Connecting to minos05 [131.225.193.5] port 22. debug1: Connection established. debug1: identity file /home/kreymer/.ssh/identity type -1 debug1: identity file /home/kreymer/.ssh/id_rsa type 1 debug1: identity file /home/kreymer/.ssh/id_dsa type -1 ssh_exchange_identification: Connection closed by remote host Mon May 12 17:10:47 UTC 2008 ___________________________________________ Date: Mon, 12 May 2008 12:59:04 -0500 sshd has been restarted on minos05. Logins via ssh are working again. ___________________________________________ ######## # GRID # ######## du: `/grid/app/minos/VDT/vdt/extract': Permission denied du: `/grid/app/minos/VDT/vdt/backup': Permission denied du: `/grid/app/minos/VDT/vdt/services': Permission denied MINOS26 > du -sm /grid/app/minos 31249 /grid/app/minos MINOS26 > du -sm /grid/app/minos/* 1834 /grid/app/minos/Minossoft 287 /grid/app/minos/VDT 1 /grid/app/minos/bin du: `/grid/app/minos/minfarm/Minossoft/EXTERNAL/mysql-5.0.22/sql/share/japanese-sjis': Permission denied 17376 /grid/app/minos/minfarm 445 /grid/app/minos/parrot 4458 /grid/app/minos/products 56 /grid/app/minos/sam 10 /grid/app/minos/scripts 6787 /grid/app/minos/users MINOS26 > du -sm /grid/app/minos/minfarm/* ########### # ROUNDUP # ########### roundup.20080513 Corrected pid code, to find single Process ID. more exact script match ########### # ROUNDUP # ########### Suppressed printing from non-NOOP PID code, in order to avoid CRON email to Howie SRV1> cp AFSS/roundup.20080512 . SRV1> ln -sf roundup.20080512 roundup # was roundup.20080509 SRV1> ls -l roundup lrwxrwxrwx 1 minfarm numi 16 May 12 12:01 roundup -> roundup.20080512 Date: Mon, 12 May 2008 12:05:04 -0500 From: minfarm To: minos-data@fnal.gov Subject: roundup cedar far 13815 ? stale pidfile on fnpcsrv1 roundup cedar far 13815 ? stale pidfile on fnpcsrv1 OK, that is due to a 2-line pid file, 13815 26952 ? Ss 0:00 /bin/sh /home/minfarm/scripts/corral 26954 ? S 0:09 \_ /bin/bash /home/minfarm/scripts/roundup -c -r cedar far 24529 ? S 0:00 \_ /bin/sh /usr/local/vdt-1.8.1/srm-v1-client/bin/srmcp -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00040860_00 24533 ? S 0:00 \_ /bin/sh /usr/local/vdt-1.8.1/srm-v1-client/sbin/srm -copy -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00 24535 ? Sl 0:12 \_ java -cp /usr/local/vdt-1.8.1/srm-v1-client/lib/srm_client.jar:/usr/local/vdt-1.8.1/srm-v1-client/lib/srm.jar:/usr/loc 23229 pts/10 Ss 0:08 /bin/bash 15383 pts/10 R+ 0:00 \_ ps xf 23131 ? Sl 7413:02 /fnal/ups/prd/mysql/v5_0_22/Linux-2-4/libexec/mysqld --defaults-file=/export/stage/minfarm/my.cnf --basedir=/fnal/ups/prd/mysql/v5_0_ 7968 pts/2 Ss+ 0:06 /bin/bash 7953 pts/0 S 0:00 ksu minfarm 7955 pts/0 S+ 0:00 \_ -bin/tcsh 4544 ? S 0:00 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t 4520 ? Ss 0:00 /bin/sh /home/minfarm/scripts/corral 4522 ? S 0:01 \_ /bin/bash /home/minfarm/scripts/roundup -c -r cedar far 15378 ? S 0:00 \_ /bin/bash /home/minfarm/scripts/roundup -c -r cedar far 15379 ? R 0:00 \_ sampy /export/stage/minfarm/ROUNDUP/SAM/current/bin/sam locate F00040863_0018.mdaq.root 15380 ? S 0:00 \_ grep /pnfs/minos 15381 ? S 0:00 \_ cut -f 5 -d / 15382 ? S 0:00 \_ cut -f 1 -d , 32219 pts/10 S 0:11 /bin/bash ./roundup -w -r cedar_phy far 15097 pts/10 S 0:00 \_ /bin/sh /usr/local/vdt-1.8.1/srm-v1-client/bin/srmcp -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00038349_0008.s 15101 pts/10 S 0:00 \_ /bin/sh /usr/local/vdt-1.8.1/srm-v1-client/sbin/srm -copy -streams_num=1 -server_mode=active -protocols=gsiftp file:///F000383 15103 pts/10 Sl 0:09 \_ java -cp /usr/local/vdt-1.8.1/srm-v1-client/lib/srm_client.jar:/usr/local/vdt-1.8.1/srm-v1-client/lib/srm.jar:/usr/local/v SRV1> cat /minos/data/minfarm/roundup/cedarfar.pid 4522 ? SRV1> ls -l /minos/data/minfarm/roundup/cedarfar.pid -rw-r--r-- 1 minfarm numi 7 May 12 12:05 /minos/data/minfarm/roundup/cedarfar.pid pico /minos/data/minfarm/roundup/cedarfar.pid # 26954 .... cedar far is stuck in an srmcp, /pnfs/fnal.gov/usr/minos/reco_far/cedar/.bcnd_data/2008-05/F00040860_0014.spill.bcnd.cedar.0.root When I killed the second copy, during its PURGE phase, all further output to the log file ceased, and the srmcp got stuck. SRV1> touch /minos/data/minfarm/roundup/STOP.cedar_phyfar SRV1> date Mon May 12 13:21:53 CDT 2008 OOPS - SRMCP failed, bailing Mon May 12 13:23:03 CDT 2008 corral is cranking along, looking at cedar nearm, will proceed to CPfar and CPBmcnear SRV1> ./roundup -r cedar_phy far rm /minos/data/minfarm/roundup/STOP.cedar_phyfar SRV1> ./roundup -r cedar_phy far running, let's poke at the 600 files in WRITE for CPB SRV1> ./roundup -w -b 20 -r cedar_phy_bhcurv mcnear SRV1> ./roundup -w -b 100 -r cedar_phy_bhcurv mcnear SRV1> ./roundup -w -b 200 -r cedar_phy_bhcurv mcnear slight delay, this is also srmcp'ing 200 files, mostly non cand 16:45 mv ../ROUNTMP/NOCAT.ok NOCAT PID is messed up for CPB : SRV1> cat MDMR/cedar_phy_bhcurvmcnear.pid 7538 7587 7588 pts/10 Plan, modify corral to run just C far and near, set up loops running continuous scans in 'c' cron mode CP far -s "nd.cedar" CPB mcnear thinking about doing while true ; do ./roundup -c -b 50 -s "nd.cedar" -r cedar_phy_bhcurv mcnear ; done while true ; do ./roundup -c -b 50 -s "nd.cedar" -r cedar_phy far ; done 22:20 - updated corral There's not enough CP far left to be worth chewing on. Try pushing out CPB mcnear candidates in a fairly tight loop : cat looper #!/bin/sh while true ; do ./roundup -c -b 100 -s "cand.cedar" -r cedar_phy_bhcurv mcnear sleep 1200 done Oops, false start with 'far' detector. Try again at 22:55 And again at 22:56 SRV1> df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 28T 28T 168G 100% /minos/data SRV1> df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 28T 28T 182G 100% /minos/data Checking top file in WRITE, n13037340_0017_L010185N_D04.cand.cedar_phy_bhcurv.1.root is in dcache, not on tape yet. ######## # FARM # ######## Free space down to 388181 Mon May 12 10:40:20 CDT 2008 ./farmgsum | tee FGS/sum.`date +%Y%m%d%H` WRITE 847 113334 all.cand.cedar_phy.0.root 32 12290 all.sntp.cedar_phy.0.root 848 19025 spill.bcnd.cedar_phy.0.root 32 1359 spill.bntp.cedar_phy.0.root 848 12053 spill.cand.cedar_phy.0.root 32 887 spill.sntp.cedar_phy.0.root 509 293914 cand.cedar_phy_bhcurv.1.root 54 5540 mrnt.cedar_phy_bhcurv.1.root 54 19233 sntp.cedar_phy_bhcurv.1.root ./roundup -w -r cedar_phy far Cleaned up messages, and PID code a bit more 85500 Mon May 12 20:40:57 CDT 2008 WRITE 42 5676 all.cand.cedar.0.root 649 86728 all.cand.cedar_phy.0.root 1 582 all.sntp.cedar.0.root 12 5843 all.sntp.cedar_phy.0.root 254 146751 cand.cedar_phy_bhcurv.1.root 136 44698 spill.cand.cedar.0.root ./roundup -b 256 -w -s 'cand.cedar_phy_bhcurv.1' -r cedar_phy_bhcurv mcnear 21:00 SRV1> condor_q rubin 26 jobs; 0 idle, 26 running, 0 held N.B. - why are there pass 0 files from CPB mcnear ? this reprocessing was to be at pass 1, to avoid file name conflicts. ./roundup -b 100 -w -s 'cand.cedar_phy.0' -r cedar_phy far ./roundup -b 200 -w -s 'cand.cedar_phy.0' -r cedar_phy far There are about 230 writes for CP far queues in Enstore more may dribble out as the tapes make progress. SRV1> cat ~/scripts/MDMR/cedarnear.pid 5357 7967 GRRRRR this is silly ! Yes, there are two roundups running, under separate trees. Pure luck that the first matched correctly. Let's go back to setting MYPID=${$} which seems to be correct lately. SRV1> ./roundup.20080514 -w -r cedar near SRV1> cat ~/scripts/MDMR/cedarnear.pid 18163 SRV1> ps xf 18175 pts/10 S 0:00 /bin/bash ./roundup.20080514 -w -r cedar near GRRRRRR !!!!!!!! pico ~/scripts/MDMR/cedarnear.pid SRV1> ./roundup -n -w -r cedar far SRV1> cat ~/scripts/MDMR/cedarfar.pid 7967 pts/10 pts/10 pts/10 GRRRRRRRR !!! This entirely missed the proper 27338 The root cause of all these problems is probably running MAIN in the background. stop doing this, background with ^z if needed. This seems to work as desired, $$ does the right thing. SRV1> cp -a AFSS/roundup.20080514 . SRV1> ln -sf roundup.20080514 roundup # was 13 ============================================================================= 2008 05 09 ============================================================================= ######## # FARM # ######## Test improved pid calc, and SRMCP count printout. AFSS/roundup.20080509 -w -b 100 -r cedar_phy far SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid 20608 20619 pts/10 S 0:00 /bin/sh AFSS/roundup.20080509 -w -b 100 -r cedar_phy far AFSS/roundup.20080509 -w -b 10 -r cedar_phy far SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid 28005 28016 pts/10 S 0:00 /bin/sh AFSS/roundup.20080509 -w -b 10 -r cedar_phy far MYOWNPID is 28005 Try /bin/bash not /bin/sh SRV1> AFSS/roundup.20080509 -w -b 10 -r cedar_phy far SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid 29375 29386 pts/10 S 0:00 /bin/bash AFSS/roundup.20080509 -w -b 10 -r cedar_phy far Try adding ps diagnostic, with filtering SRV1> AFSS/roundup.20080509 -w -b 10 -r cedar_phy far SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid 31524 31535 pts/10 S 0:00 /bin/bash AFSS/roundup.20080509 -w -b 10 -r cedar_phy far Try without filtering Fri May 9 09:05:33 CDT 2008 PID TTY TIME CMD 23229 pts/10 00:00:08 bash 32660 pts/10 00:00:00 roundup.2008050 32661 pts/10 00:00:00 ps SRV1> AFSS/roundup.20080509 -w -b 10 -r cedar_phy far SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid 1707 1709 1710 1713 Cleanup the parsing of ps, with PSO="`ps --no-header`" PSOU=" `printf "${PSO}\n" | grep roundup`" PSOUT=`printf "${PSOU}\n" | tr -s ' ' | cut -f 2 -d ' '` Test shifting the symlink out from under a running process, to see whether we can upgrade on the fly. cp AFSS/roundup.20080509 ./roundup.20080509test # with test message cp AFSS/roundup.20080509 ./roundup.20080509 ln -sf roundup.20080509test roundup.work ./roundup.work -w -b 10 -r cedar_phy far OK, was running the test version ./roundup.work -w -b 10 -r cedar_phy far ln -sf roundup.20080509 roundup.work OK, was running the test version ./roundup.work -w -b 10 -r cedar_phy far < no message > It seems safe enough to shift the symlink on the fly ! ln -sf roundup.20080509 roundup date Fri May 9 09:52:27 CDT 2008 ./roundup -w -r cedar_phy far ./farmgsum | tee FGS/sum.`date +%Y%m%d%H` 12:07 both CPB and CP finished a few seconds/minutes to late for cron Started next cycle manually ./roundup -r cedar_phy far ./roundup -r cedar_phy_bhcurv mcnear ########## # CONDOR # ########## glide4hr.run is stuck, 111815.0 boehm 5/8 14:54 0+17:16:02 R 0 9.8 probe Submitted glide1hr 2 3 4 at 08:15 112066.0 boehm 5/9 08:17 0+00:00:57 R 0 0.0 probe 112067.0 boehm 5/9 08:17 0+00:00:00 I 0 0.0 probe 112068.0 boehm 5/9 08:17 0+00:00:00 I 0 0.0 probe Submitted 1 2 3 4 under kreymer account 112071.0 kreymer 5/9 08:20 0+00:00:00 I 0 0.0 probe 112072.0 kreymer 5/9 08:22 0+00:00:21 R 0 0.0 probe 112073.0 kreymer 5/9 08:22 0+00:00:00 R 0 0.0 probe 112074.0 kreymer 5/9 08:22 0+00:00:00 R 0 0.0 probe 112075.0 kreymer 5/9 08:22 0+00:00:00 I 0 0.0 probe MINOS25 > dds logs/glide/probe*hr* -rw-r--r-- 1 kreymer g020 38 May 9 09:22 logs/glide/probe1hr.112072.0.err -rw-r--r-- 1 kreymer g020 696 May 9 09:22 logs/glide/probe1hr.112072.0.log -rw-r--r-- 1 kreymer g020 2227 May 9 09:22 logs/glide/probe1hr.112072.0.out -rw-r--r-- 1 kreymer g020 0 May 9 08:22 logs/glide/probe2hr.112073.0.err -rw-r--r-- 1 kreymer g020 247 May 9 08:22 logs/glide/probe2hr.112073.0.log -rw-r--r-- 1 kreymer g020 0 May 9 08:22 logs/glide/probe2hr.112073.0.out -rw-r--r-- 1 kreymer g020 0 May 9 08:22 logs/glide/probe3hr.112074.0.err -rw-r--r-- 1 kreymer g020 247 May 9 08:22 logs/glide/probe3hr.112074.0.log -rw-r--r-- 1 kreymer g020 0 May 9 08:22 logs/glide/probe3hr.112074.0.out -rw-r--r-- 1 kreymer g020 0 May 9 08:22 logs/glide/probe4hr.112075.0.err -rw-r--r-- 1 kreymer g020 245 May 9 08:25 logs/glide/probe4hr.112075.0.log -rw-r--r-- 1 kreymer g020 0 May 9 08:22 logs/glide/probe4hr.112075.0.out Created 50 and 70 minute jobs, 112133.0 kreymer 5/9 14:31 0+00:00:00 I 0 0.0 probe 112134.0 kreymer 5/9 14:31 0+00:00:00 I 0 0.0 probe Removed the 3 and 4 hour tests MINOS25 > condor_rm 112074.0 112075.0 Job 112074.0 marked for removal Job 112075.0 marked for removal condor_rm -forcex 112074.0 112075.0 Per sfiligoi, will try leaving an intact proxy for the next tests also killing the 10 minute probes, till this is settled MINOS25 > crontab /tmp/ct25 Fri May 9 14:52:34 CDT 2008 MINOS25 > condor_submit glidelease.run 1 job(s) submitted to cluster 112139. Per sfiligoi, condor_config_val SEC_DEFAULT_SESSION_DURATION 3600 This should be, for a 4 day maximum SEC_DEFAULT_SESSION_DURATION = 345600 Date: Fri, 09 May 2008 15:40:51 -0500 (CDT) Subject: HelpDesk ticket 115385 ___________________________________________ Short Description: Please update Minos Cluster condor.config files Problem Description: run2-sys : Most of our analysis jobs have been failing to return their logs, and are getting hung up on exit, since Monday's upgrade of the Condor glidein system to use glexec. The experts believe that the root cause of this is a parameter in /opt/condor-7.0.1/etc/condor_config The line SEC_DEFAULT_SESSION_DURATION = 3600 needs to be updated to SEC_DEFAULT_SESSION_DURATION = 345600 I have provided a modified condor_config file under /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/ Please update and cfengine this file to all of minos01 through minos25 before the weekend, so that we can resume Analysis batch processing. Thanks ! ___________________________________________ Date: Fri, 09 May 2008 15:51:07 -0500 (CDT) This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group. ___________________________________________ Date: Fri, 09 May 2008 16:13:15 -0500 (CDT) I have made the change in cfengine. The file has been updated. ___________________________________________ ___________________________________________ ___________________________________________ MIN > for NODE in ${NODES} ; do printf "${NODE} "; ssh -ax ${NODE} grep SEC_DEFAULT_SESSION_DURATION /opt/condor/etc/condor_config ; done minos01 SEC_DEFAULT_SESSION_DURATION = 345600 ... condor_reconfig -all MINOS25 > date Fri May 9 17:10:56 CDT 2008 MINOS25 > condor_submit glide2hr.run 1 job(s) submitted to cluster 112146. 112146.0 kreymer 5/9 17:12 0+00:00:00 I 0 0.0 probe MINOS25 > condor_q 111448 | grep boehm | wc -l 15 MINOS25 > condor_q 111448 MINOS25 > condor_q 111448 | grep boehm | wc -l -- Submitter: minos25.fnal.gov : <131.225.193.25:65336> : minos25.fnal.gov ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 111448.0 boehm 5/7 10:01 2+06:41:37 R 0 732.4 condor_job.sh 111448.1 boehm 5/7 10:01 2+06:41:37 R 0 732.4 condor_job.sh 111448.2 boehm 5/7 10:01 2+06:41:34 R 0 732.4 condor_job.sh 111448.3 boehm 5/7 10:01 2+06:41:34 R 0 732.4 condor_job.sh 111448.4 boehm 5/7 10:01 2+06:41:32 R 0 488.3 condor_job.sh 111448.5 boehm 5/7 10:01 2+06:41:34 R 0 732.4 condor_job.sh 111448.6 boehm 5/7 10:01 2+06:38:22 R 0 732.4 condor_job.sh 111448.7 boehm 5/7 10:01 2+06:38:20 R 0 732.4 condor_job.sh 111448.8 boehm 5/7 10:01 2+06:38:20 R 0 732.4 condor_job.sh 111448.9 boehm 5/7 10:01 2+06:38:21 R 0 732.4 condor_job.sh 111448.10 boehm 5/7 10:01 2+06:38:04 R 0 488.3 condor_job.sh 111448.14 boehm 5/7 10:01 2+06:38:02 R 0 488.3 condor_job.sh 111448.84 boehm 5/7 10:01 2+01:33:43 R 0 732.4 condor_job.sh 111448.88 boehm 5/7 10:01 2+01:29:02 R 0 732.4 condor_job.sh 111448.98 boehm 5/7 10:01 2+01:26:17 R 0 732.4 condor_job.sh condor_rm 111448 condor_rm 111448 -forcex MINOS25 > condor_submit probe.run 112148.0 kreymer 5/9 17:17 0+00:00:02 R 0 0.0 kcron less logs/probe/probe.112148.0.out this short job looks OK Killing off a lot more of boehm stuck jobs : 111420 - 1 111422 - 100 111438 - 100 condor_rm 111422 condor_rm 111422 -forcex Fri May 9 17:23:06 CDT 2008 condor_config_val SEC_DEFAULT_SESSION_DURATION Try killing off a few of the stuck gfactories there are a dozen or so idle at present. condor_rm 111423 -forcex 17:45 condor_rm boehm -forcex condor_rm gfactory -forcex This was a bad idea, sfiligoi informs me this creates a huge mess. Standing by. Date: Fri, 09 May 2008 18:22:37 -0500 I have cleaned up the old glideins in the queue using the fork queue. The new glideins should be starting soon. Igor MINOS25 > condor_submit glide.run 112155.0 kreymer 5/9 18:24 0+00:00:01 R 0 0.0 probe MINOS25 > condor_submit glide2hr.run 112156.0 kreymer 5/9 18:25 0+00:00:07 R 0 0.0 probe ============================================================================= 2008 05 08 ============================================================================= ######## # FARM # ######## Corral Enabled cedar_phy far Enabled cedar_phy_bhcurv mcnear SRV1> ./roundup -s charm -r cedar_phy_bhcurv mcnear this picked up the remaining charm files touched roundup to drop pool limit from NPOOLS-1 to NPOOLS-3 allowing a single server to be down. ( 11 is down today ) SRV1> ./roundup -r cedar_phy far SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid 23736 That process is gone, this should be 23747 Edited it manually Ran again, to purge the files now declared to sam SRV1> ./roundup -r cedar_phy far SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid 15130 15141 pts/10 S 0:09 /bin/sh ./roundup -r cedar_phy far Try shifting PID setting to MAIN. AFSS/roundup.new -r cedar_phy_bhcurv mcnear SRV1> cat /minos/data/minfarm/roundup/cedar_phy_bhcurvmcnear.pid 25271 25282 pts/10 S 0:00 /bin/sh AFSS/roundup.new -r cedar_phy_bhcurv mcnear Corrected the file order to be the same for PURGE CONCAT WRITE AFSS/roundup.newer -w -r cedar_phy far SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid 21099 21116 pts/10 S 0:00 /bin/sh AFSS/roundup.newer -w -r cedar_phy far Try chaing to /bin/bash for better shell control ? Try setting MYOWNPID=${$} echo ${MYOWNPID} ... ########### # ROUNDUP # ########### Changed default bail limit to 1000, to stay clear of DCache limits SRV1> cp -a AFSS/roundup.20080708 . SRV1> ln -sf roundup.20080508 roundup # was roundup.20080506 ########## # CONDOR # ########## Date: Thu, 8 May 2008 16:37:14 +0000 (UTC) From: Arthur Kreymer To: sfiligoi@fnal.gov Cc: minos-admin@fnal.gov Subject: Re: Analysis glideins ramping up on GPFARM (fwd) Our first large scale users seems to have many hung jobs, stuck as they are trying to finish. It seems odd to me that I see only two active glideins when I log into fnpc340, but condor_q shows 8 boehm jobs running. For example, MINOS25 > condor_q -r | grep fnpc340 111422.34 boehm 5/7 09:25 1+01:51:58 vm2@31266@fnpc340.fnal.gov 111422.40 boehm 5/7 09:25 1+01:48:55 vm2@642@fnpc340.fnal.gov 111422.42 boehm 5/7 09:25 1+01:48:58 vm2@377@fnpc340.fnal.gov 111422.61 boehm 5/7 09:25 1+01:45:54 vm2@2592@fnpc340.fnal.gov 111422.68 boehm 5/7 09:25 1+01:45:50 vm2@3381@fnpc340.fnal.gov 111422.72 boehm 5/7 09:25 1+01:45:54 vm2@3697@fnpc340.fnal.gov 111422.92 boehm 5/7 09:25 1+01:42:51 vm2@6865@fnpc340.fnal.gov 111422.93 boehm 5/7 09:25 1+01:42:51 vm2@6309@fnpc340.fnal.gov MINOS25 > condor_q -l 111422.34 | grep UserLog UserLog = "/minos/data/users/boehm/RyanFiles/distcutoff/log.111422.34" MINOS25 > cat /minos/data/users/boehm/RyanFiles/distcutoff/log.111422.34 000 (111422.034.000) 05/07 09:25:34 Job submitted from host: <131.225.193.25:62172> ... 001 (111422.034.000) 05/07 09:32:47 Job executing on host: <131.225.166.119:61601> ... 006 (111422.034.000) 05/07 09:32:55 Image size of job updated: 132728 ... 006 (111422.034.000) 05/07 10:32:55 Image size of job updated: 447540 ... 006 (111422.034.000) 05/07 11:32:55 Image size of job updated: 582148 ... MINOS25 > ls -l /minos/data/users/boehm/RyanFiles/distcutoff/*.111422.34 -rw-r--r-- 1 boehm e875 0 May 7 09:25 /minos/data/users/boehm/RyanFiles/distcutoff/err.111422.34 -rw-r--r-- 1 boehm e875 397 May 7 11:32 /minos/data/users/boehm/RyanFiles/distcutoff/log.111422.34 -rw-r--r-- 1 boehm e875 0 May 7 09:25 /minos/data/users/boehm/RyanFiles/distcutoff/out.111422.34 fnpc340 $ ps axfwww > /minos/data/users/kreymer/fnpc340.psaxf Date: Thu, 08 May 2008 11:56:54 -0500 (CDT) Subject: HelpDesk ticket 115305 Short Description: recent Minos glidein jobs on GPFARM not terminating properly ? Problem Description: Date: Thu, 08 May 2008 11:52:55 -0500 From: Sfiligoi Igor To: Arthur Kreymer Cc: minos-admin@fnal.gov Subject: Re: Analysis glideins ramping up on GPFARM (fwd) From all you write below, it would seem it is running just fine. Also looking on fnpc340, I see 8 starters, so Condor-wise there do not seem to be any obvious problems. Unfortunately, I am not able to read the files in fnpc340:/local/stage1/condor/execute/dir_2051/glide_dx2086/tmp/starter-tmp- dir-SfzaAS so I cannot look at all the details. .... further information, as listed above in this log ... ___________________________________________ Date: Fri, 09 May 2008 13:02:07 -0500 (CDT) Hi Igor--you have root on fnpc211 temporarily. cfengine will wipe you out of the .k5login again pretty soon. ... ___________________________________________ Date: Fri, 09 May 2008 15:02:01 -0500 From: Sfiligoi Igor But talking with Dan B. from the Condor team, we think we may have found the problem: condor_config_val SEC_DEFAULT_SESSION_DURATION 3600 This seems to confuse the glideins a lot, at least when using glexec: after the glidein has been up for 2h, it starts misbehaving. Probably a Condor bug. For now, I would suggest you increase this value to the max expected lifetime of the glidein. This should be OK: SEC_DEFAULT_SESSION_DURATION = 30000 I would suggest you change this for all the Condor daemons in the pool. Cheers, Igor ___________________________________________ Date: Fri, 09 May 2008 15:06:35 -0500 (CDT) From: Steven Timm I would concur--SEC_DEFAULT_SESSION_DURATION=3600 was left over from the GP Grid cluster config and we have lengthened it there too now. ___________________________________________ Subject: HelpDesk ticket 115385 update condor_config, see 2008/04/09 ___________________________________________ This has been changed on the Minos pool, by update the /opt/condor/etc/condor_config file, then issuing condor_reconfig -all I have submitted another 2 hour test job, but it has not started running yet, after 20 minutes. I have also done condor_rm -forcex on over 100 of the stuck jobs running for user boehm. But the probe job is still not running. An equivalent local poll probe job ran right away. I will try removing some of the gfactory jobs, to make room for new ones to run. ___________________________________________ Date: Fri, 09 May 2008 17:46:28 -0500 From: Sfiligoi Igor The glideins are most probably stuck. You should remove them all from the queue. ___________________________________________ Date: Fri, 09 May 2008 22:55:10 +0000 (UTC) From: Arthur Kreymer I've done this. The old pilots are still running on GPFARM. Perhaps Steve will have to remove these on the GPFARM end. New gfactory processes are being created, but are all Idle. rubin also has 350 jobs running on GPFARM, using the minos quota till some of them finish. ___________________________________________ Date: Fri, 09 May 2008 18:00:30 -0500 From: Sfiligoi Igor For the future, do not use condor_rm -forcex , unless you have a very good reason. It creates a huge mess in the system. ___________________________________________ Date: Fri, 09 May 2008 18:22:37 -0500 From: Sfiligoi Igor I have cleaned up the old glideins in the queue using the fork queue. The new glideins should be starting soon. ___________________________________________ Date: Fri, 09 May 2008 23:28:23 +0000 (UTC) From: Arthur Kreymer Thanks ! The old jobs seem to have cleared off of GPFARM, at least on node fnpc344. I submitted a short glide job, which completed. I submitted a 2 hr glide job, will check in 2 hours. ___________________________________________ Date: Tue, 20 May 2008 09:01:09 -0500 (CDT) Solution: Per discussion with Art Kreymer, he changed the SEC_DEFAULT_SESSION_CONFIGURATION on his glideins after consulting Igor and the condor team. After this was lengthened this problem hasn't recurred. ___________________________________________ ___________________________________________ Sample job, glide.run condor_history 111803 ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 111803.0 boehm 5/8 13:56 0+00:00:35 C 5/8 14:00 /minos/scratch/ condor_history -l 111803 /minos/scratch/boehm/probe.111803.0.log 14:20 - while investigating, improve josh's factor from 10 to 1 condor_userprio -setfactor boehm@fnal.gov 1. Also cleaned up one of the erroneous entries : MINOS25 > condor_userprio -delete deb4@fnal.gov@fnal.gov The accountant record named deb4@fnal.gov@fnal.gov was deleted condor_submit glide4hr.run Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 111815. 111815.0 boehm 5/8 14:54 0+00:00:00 I 0 0.0 probe ########## # CONDOR # ########## condor_userprio -setfactor rearmstr@fnal.gov 100. ########## # CONDOR # ########## None of my glideafs jobs have run since 10:00 yesterday. ########## # CONDOR # ########## Added random delay to the proxy regeneration by /local/scratch25/grid/kproxy in order to avoid piling up on the KCA's. [ "x$USER" = "x" ] || [ "x$PS1" = "x" ] && sleep $(( ${RANDOM} % 600 )) Tested in kreymer account, at 09:56. 30695 ? Ss 0:00 /usr/krb5/bin/kcron /local/scratch25/grid/kproxy 30701 ? S 0:00 \_ /bin/sh /local/scratch25/grid/kproxy 30702 ? S 0:00 \_ sleep 505 This worked, producing -rw------- 1 kreymer g020 5189 May 8 10:04 kreymer.proxy.2008050810 ######## # FARM # ######## SRV1> grep SRMCP cedar_phy_bhcurvmcnearnd.cedar.log | wc -l 826 SRV1> tail cedar_phyfar.log PURGE FARM F00038005_0011.all.cand.cedar_phy.0.root DCACHE WRITE BACKLOG at 3000 after 3001 files SRV1> dds cedar_phyfar.log -rw-rw-r-- 1 minfarm numi 2207507 May 7 18:57 cedar_phyfar.log OK, that's intrinsic, will never copy more then DCQLIM files per pass Wait for existing iteration to finish, then make roundup.20080507 the default ( with -b support ) ########## # CONDOR # ########## The new 250 limit seems to be effective, MINOS25 > condor_q gfactory -r | wc -l 220 ============================================================================= 2008 05 07 ============================================================================= ######## # FARM # ######## Date: Wed, 07 May 2008 17:47:30 -0500 (CDT) Subject: HelpDesk ticket 115269 ___________________________________________ Short Description: /fnal/ups is not mounted on fnpc213 Problem Description: /fnal/ups has become unmounted on fnpc213, causing user jobs to fail. ___________________________________________ ############ # RELEASES # ############ MINOS26 > cd /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-02-16-R1-28/Mad/data cvs update ( added back nearly 100 MB of root files. ) ########## # CONDOR # ########## Investigating cause of 300+ glideins Seems we have a different configuration file, probably happened when glexec was enabled. drwxrwxr-x 4 gfrontend gfrontend 4096 Nov 27 09:18 myvofrontend1/ drwxrwxr-x 4 gfrontend gfrontend 4096 Feb 6 09:50 myvofrontend2/ [gfrontend@minos25 ~]$ ls -l myvofrontend1/etc total 8 -rw-rw-r-- 1 gfrontend gfrontend 887 Dec 21 09:25 vofrontend.cfg -rw-rw-r-- 1 gfrontend gfrontend 889 Nov 27 09:19 vofrontend.cfg.20071127 [gfrontend@minos25 ~]$ ls -l myvofrontend2/etc total 4 -rw-rw-r-- 1 gfrontend gfrontend 1456 Apr 23 13:39 vofrontend.cfg [gfrontend@minos25 ~]$ diff myvofrontend1/etc/vofrontend.cfg myvofrontend2/etc/vofrontend.cfg 4c4 < frontend_name='fe1' --- > frontend_name='my2' 18c18,25 < match_string='1' --- > # GLIDEIN_Has_MINOSAFS can handle all jobs > # The other only the ones that do not have Require_MINOSAFS set to true > # Also only match glexec glideins with jobs that have a proxy > match_string='(glidein["attrs"]["GLIDEIN_Has_MINOSAFS"] or (not (job.has_key("Require_MINOSAFS") and job["Require_MINOSAFS"]))) and ((not glidein["attrs"].has_key("GLIDEIN_UseGLEXEC")) or (not glidein["attrs"]["GLIDEIN_UseGLEXEC"]) or job.has_key("x509userproxysubject"))' > > > # old > #match_string='glidein["attrs"]["GLIDEIN_Has_MINOSAFS"] or (not (job.has_key("Require_MINOSAFS") and job["Require_MINOSAFS"]))' 21c28 < max_idle_glideins_per_entry=20 --- > max_idle_glideins_per_entry=10 24c31 < max_running_jobs=100 --- > max_running_jobs=1000 32c39 < log_dir='/home/gfrontend/myvofrontend1/log' --- > log_dir='/home/gfrontend/myvofrontend2/log' Changed limit from 1000 to 250 cd myvofrontend2/etc cp -a vofrontend.cfg vofrontend.cfg.20080423 nedit vofrontend.cfg [gfrontend@minos25 etc]$ diff vofrontend.cfg.20080423 vofrontend.cfg 31c31 < max_running_jobs=1000 --- > max_running_jobs=250 kill -9 6931 ./start_frontend.sh [gfrontend@minos25 ~]$ grep Total myvofrontend2/log/frontend_info.20080507.log | tail [2008-05-07T17:31:53-05:00 6931] Total running 339 limit 1000 [2008-05-07T17:33:30-05:00 6931] Total running 349 limit 1000 [2008-05-07T17:35:06-05:00 6931] Total running 349 limit 1000 [2008-05-07T17:36:41-05:00 6931] Total running 349 limit 1000 [2008-05-07T17:38:17-05:00 6931] Total running 349 limit 1000 [2008-05-07T17:38:40-05:00 18528] Total running 349 limit 250 [2008-05-07T17:40:16-05:00 18528] Total running 349 limit 250 [2008-05-07T17:41:52-05:00 18528] Total running 349 limit 250 [2008-05-07T17:43:27-05:00 18528] Total running 345 limit 250 [2008-05-07T17:45:05-05:00 18528] Total running 347 limit 250 ########### # ROUNDUP # ########### Capturing the current roundup, renamed as 20080507, adds -b BAIL count. SRV1> cp AFSS/roundup.20080507 . SRV1> ln -sf roundup.20080506 roundup # was roundup.20080501 ########## # CONDOR # ########## spotted users with excessively good priorities condor_userprio -setfactor idanko@fnal.gov 100. condor_userprio -setfactor djauty@fnal.gov 100. Wed May 7 14:03:26 CDT 2008 condor_userprio --all -allusers rahaman@fnal.gov 0.50 0.50 1.00 0 241.19 5/01/2008 14:30 5/03/2008 04:10 rhatcher@fnal.gov 0.50 0.50 1.00 0 3712.76 3/19/2008 11:30 5/01/2008 16:40 nickd@fnal.gov 0.50 0.50 1.00 0 392.49 4/28/2008 11:29 5/04/2008 18:30 kreymer@fnal.gov 0.54 0.54 1.00 0 1409.22 10/24/2007 09:00 5/07/2008 09:50 condor_userprio -setfactor rahaman@fnal.gov 100. condor_userprio -setfactor nickd@fnal.gov 100. It seems that newly running users come in with a factor of 1. ######## # FARM # ######## CP far is running, finished concatenation and started writes at 01:24 Wed May 7 09:46:14 CDT 2008 SRV1> du -sm /minos/data/minfarm/WRITE/ 549973 /minos/data/minfarm/WRITE/ Wed May 7 01:24:39 CDT 2008 WRITING to DCache 8798 SRV1> date Wed May 7 15:04:06 CDT 2008 SRV1> grep SRMCP cedar_phyfar.log | wc -l 2342 Started rate tests with CPB mcnear , in parallel AFSS/roundup.20080507 -n -b 20 -s 'nd.cedar' -r cedar_phy_bhcurv mcnear AFSS/roundup.20080507 -b 1000 -s 'nd.cedar' -r cedar_phy_bhcurv mcnear ########## # CONDOR # ########## /local/scratch25/grid/VDT - hacked setups.sh to change path, this seems to have worked fngp-osg > du -sh /usr/local/vdt-1.8.1 3.0G /usr/local/vdt-1.8.1 SRV1> du -sh /usr/local/vdt-1.8.1 1.2G /usr/local/vdt-1.8.1 SRV1> tar cf /tmp/vdt181.tar -C /usr/local/vdt-1.8.1 . SRV1> du -sh /tmp/vdt181.tar 1.2G /tmp/vdt181.tar ########## # CONDOR # ########## OSG variables On fngp-osg, using /usr/local/vdt-1.8.1/monitoring/osg-attributes.conf cp ########## # CONDOR # ########## Added users to fermilab/minos/Analysis role ( not yet used ) djauty boehm hartnell loiacono mishi nickd pawloski rustem ########## # CONDOR # ########## For the record, OSG environment from a non-glexec job : OSG_GRID ^/usr/local/grid^ OSG_DATA ^/grid/data^ OSG_APP ^/grid/app^ OSG_WN_TMP ^/local/stage1^ See https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/StorageParameterOsgWnTmp ########## # CONDOR # ########## Date: Wed, 07 May 2008 08:23:59 -0500 (CDT) Subject: HelpDesk ticket 115229 ___________________________________________ Short Description: Request KCA cert addition for devenish and auty Problem Description: Please add KCA certs, for glexec support, for User Lastname nickd devenish djauty auty I think these would look like /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Nicholas Devenish/UID=nickd /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=David J. Auty/USERID=djauty We will eventually want such certs for all Minos users, as well as certs compatible with the new format coming next week. ___________________________________________ Date: Fri, 09 May 2008 09:58:10 -0500 Resolved Hi Art, Yesterday I re-enabled email notification on the fermilab VOMRS server, so I am now in a position to give you the correct method of performing the action you request, yourself. The users can and should add their own Robot certificates to their membership in the fermilab VO per the instructions I sent you last Monday, with the following addition (see 2a, 2b, 2c): 1) Load your KCA certificate (current, not expired!) and visit this URL: https://vomrs.fnal.gov:8443/vomrs/vo-fermilab/vomrs 2) Click on the [+] next to the "Members" 2a) Click on "Change Email Address" 2b) Enter your last name and "Search" 2c) Enter your correct email address and "Submit" 3) Click on the [+] next to the "Certificates" 4) Click on "Add certificate" 5) Enter your last name and "Search" 6) Enter your 'new' DN in the New DN field, and select the Fermi KCA from the pull-down list in the "New CA" list. 7) Enter some text in the "Reason" field and click "Submit" Next, the members representative (You!) will receive an email from VOMRS requesting you to approve the addition of the DN. The email will contain a handy link for you to click on to get to the right page. I should note, that when the new DN format is implemented the users will NOT need to add this DN, we'll do this for them automatically. Cheers, Dan ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ ============================================================================= 2008 05 06 ============================================================================= ########## # CONDOR # ########## spotted users with excessively good priorities condor_userprio -setfactor idanko@fnal.gov 100. condor_userprio -setfactor djauty@fnal.gov 100. Wed May 7 14:03:26 CDT 2008 ########## # CONDOR # ########## Date: Tue, 06 May 2008 18:30:04 -0500 (CDT) Subject: HelpDesk ticket 115222 ___________________________________________ Short Description: OSG_* environment missing from Minos glidein/glexec jobs Problem Description: Even since we activated glexec execution of Minos glidein jobs on GPFARM yesterday aroung 13:00, the OSG_* environment variables have been undefined on the user jobs. This could be a problem for people needing to do source ${OSG_GRID}/setup.sh ___________________________________________ Date: Tue, 06 May 2008 18:38:56 -0500 (CDT) Note To Requester: chadwick@fnal.gov sent this Notes To Requester: As a workaround, here is how you can manually invoke the script to define the missing environment variables: if [ -z "$VDT_LOCATION" ] then if [ -e "/usr/local/vdt" ] then export VDT_LOCATION='/usr/local/vdt' elif [ -e "/usr/local/grid" ] then export VDT_LOCATION='/usr/local/grid' elif [ -n "$OSG_LOCATION" ] then export VDT_LOCATION="$OSG_LOCATION" fi fi if [ -n "$VDT_LOCATION" ] then source $VDT_LOCATION/setup.sh if [ -e "$VDT_LOCATION/monitoring/osg-attributes.conf" ] then source $VDT_LOCATION/monitoring/osg-attributes.conf fi fi -Keith. ___________________________________________ Date: Tue, 06 May 2008 19:48:14 -0500 (CDT) Note To Requester: timm@fnal.gov sent this Notes To Requester: The stripping of the OSG* environment variables is a feature of glexec and glideWMS. It purposely strips out any environment that was set up before that. I suggest you contact the GlideWMS developer to see how he deals with that. He does have a mechanism. Steve Timm ___________________________________________ Date: Wed, 07 May 2008 08:50:33 -0500 From: Sfiligoi Igor Uhm... this was not expected. I thought Condor preserved the environment even when using gLExec; but now that I think about it, I never tried it out! I'll see what I can do... but it could be a few days. ___________________________________________ Date: Mon, 12 May 2008 13:50:22 -0500 (CDT) Note To Requester: yocum@fnal.gov sent this Notes To Requester: Art, Are you satisfied with the answers Steve and Keith gave? Can we close this ticket? Thanks, Dan ___________________________________________ Date: Mon, 12 May 2008 14:09:30 -0500 (CDT) Note To Requester: A clarification to Keith's earlier E-mail. $VDT_LOCATION/monitoring/osg-attributes.conf is not visible from the worker nodes. /usr/local/grid/setup.sh willl be but that, in itself, does not define the various OSG* variables that you are looking for. It is probably best to modify your glidein startup script to (1) detect the OSG environment variables before it launches the glidein and (2) somehow pass them on to the user job. It may be necessary to send these arguments along with the user job independent of the glidewms. Glexec is configurable. By default it passes only a very few environment variables to the glidein. This configuration could be changed but we would need very good reason from the glidewms to prove it couldn't be done any other way before we do that. Steve Timm ___________________________________________ Our users can easily work around the problem with OSG variables on glidein. So we will stand by for a longer term solution. See the following comment from Igor : Date: Mon, 12 May 2008 16:10:34 -0500 From: Sfiligoi Igor Hi Art. This is a Condor bug. I am in contact with Madison, and hope to have a beta to test by the end of the week. Igor ___________________________________________ Date: Thu, 22 May 2008 10:31:18 -0500 (CDT) Note To Requester: Igor Sfiligoi is now testing a pre-release of condor 7.0.2 which is supposed to beat the problem of stripping out all the environment variables. That, accompanied with an upgrade of the cluster to 7.0.2 when it comes out and a change in the glexec configuration file, should eventually resolve this problem. I am marking this ticket pending until we hear from condor that condor 7.0.2 is out. Steve Timm ___________________________________________ ########## # CONDOR # ########## per Mayly request, at around 18:25 MINOS25 > condor_userprio -setfactor pawloski@fnal.gov 10. The priority factor of pawloski@fnal.gov was set to 10.000000 MINOS25 > condor_userprio -setfactor boehm@fnal.gov 10. The priority factor of boehm@fnal.gov was set to 10.000000 ######### # PROBE # ######### probe - changed from ps ax to ps -H --forest resticts display to current process tree ######## # FARM # ######## Overnight, while waiting for dccp -x509 assistance, will proceed to concatenate cedar_phy far data. For write rate tests, can use a STOP file to halt this. ./roundup -r cedar_phy far ######## # FARM # ######## Enabled corral again, now that we have a STOP ability, but just for catchup. Catching up on clearing cand space, now that WRITE is clean. SRV1> du -sm /minos/data/reco_near/cedar/cand_data 601568 /minos/data/reco_near/cedar/cand_data SRV1> du -sm /minos/data/reco_far/cedar/cand_data 292934 /minos/data/reco_far/cedar/cand_data rm -r /minos/data/reco_near/cedar/cand_data rm -r /minos/data/reco_far/cedar/cand_data Space is up to 2.9 TB. ######## # FARM # ######## Prepare for switch to /local/globus from /export chmod 740 /export/stage/minfarm/.grid/backup cp -vax /export/stage/minfarm/.grid \ /local/globus/minfarm/.grid ######## # FARM # ######## Trying dccp ( x509 ) We use a grid proxy instead of a voms proxy because we think voms is unsafe, given the caching of roles by the present FNDCA system. SRV1> setup dcap -q x509 SRV1> export X509_USER_PROXY=/export/stage/minfarm/.grid/x509up_u1334 SRV1> dccp F00037838_0004.spill.bcnd.cedar_phy.0.root dcap://fndca1.fnal.gov:24536/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy/.bcnd_data/2007-04/F00037838_0004.spill.bcnd.cedar_phy.0.root Error ( POLLIN POLLERR POLLHUP) (with data) on control line [6] Failed to create a control line Error ( POLLIN POLLERR POLLHUP) (with data) on control line [8] Failed to create a control line Failed open file in the dCache. Can't open destination file : Server rejected "hello" System error: Input/output error SRV1> dccp F00037838_0004.spill.bcnd.cedar_phy.0.root dcap://fndca1.fnal.gov:24525/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy/.bcnd_data/2007-04/F00037838_0004.spill.bcnd.cedar_phy.0.root Error ( POLLIN POLLERR POLLHUP) (with data) on control line [6] Failed to create a control line Error ( POLLIN POLLERR POLLHUP) (with data) on control line [8] Failed to create a control line Failed open file in the dCache. Can't open destination file : Server rejected "hello" System error: Input/output error SRV1> voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: VOMS extension not found! subject : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990/CN=2146134877 issuer : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990 identity : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990 type : unknown strength : 512 bits path : /export/stage/minfarm/.grid/x509up_u1334 timeleft : 3377:44:33 SRV1> grid-proxy-info -all ERROR: Couldn't find a valid proxy. Use -debug for further information. Date: Tue, 06 May 2008 16:47:05 -0500 (CDT) Subject: HelpDesk ticket 115219 ___________________________________________ Short Description: Cannot write via dcap -q x509 using Howard Rubin proxy ... This ticket is assigned to HARRISON, MICHAEL of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Mon, 07 Jul 2008 14:35:49 -0500 (CDT) This ticket has been reassigned to SCHUMACHER, KEN of the CD-SF/DMS/DSC/SSA Group. ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ ######## # FARM # ######## ./farmgsum | tee FGS/sum.`date +%Y%m%d%H` ########### # ROUNDUP # ########### Capturing the current roundup, renamed as 20080506, making it current. SRV1> cp AFSS/roundup.20080506 . SRV1> ln -sf roundup.20080506 roundup # was roundup.20080501 ######## # FARM # ######## Howie stopped new mcnear runs before midnight 690132 Tue May 6 00:30:17 CDT 2008 644756 Tue May 6 01:30:21 CDT 2008 581928 Tue May 6 02:30:24 CDT 2008 575817 Tue May 6 03:30:28 CDT 2008 562353 Tue May 6 04:30:31 CDT 2008 559085 Tue May 6 05:30:35 CDT 2008 557717 Tue May 6 06:30:39 CDT 2008 AFSS/roundup.20080502 -n -w -s helium -r cedar_phy_bhcurv mcnear PURGING WRITE files 981 OOPS - mismatched Enstore and local size/crc SIZE 325788525/325788525 CRC 20242414/20242414 n13037030_0026_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root n13037022_0001_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root WRITING to DCache 981 SRV1> cat ../../ECRC/n13037030_0026_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root 20242414 8 SRV1> ecrc /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/sntp_data/703/n13037030_0026_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root CRC 20242414 SRV1> cat ../../ECRC/n13037022_0001_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root 2612195126 SRV1> ecrc /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/sntp_data/702/n13037022_0001_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root CRC 2356580040 These are likely due to the two scripts running simultaneously echo 20242414 > ../../ECRC/n13037030_0026_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root echo 2356580040 > ../../ECRC/n13037022_0001_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root This corrected the problem, let's purge : AFSS/roundup.20080502 -w -s charm -r cedar_phy_bhcurv mcnear SRV1> du -sm /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/cand_data/ 645847 /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/cand_data/ The helium files were already purged, per OK - processing /minos/data/minfarm/mcnearcat version 20080501 SELECT files containing helium Mon May 5 23:31:57 CDT 2008 SRV1> du -sm /minos/data/mcout_data/daikon_04/L010185N_helium/near/cedar_phy_bhcurv/cand_data/ 617245 /minos/data/mcout_data/daikon_04/L010185N_helium/near/cedar_phy_bhcurv/cand_data/ As minfarm@fnpcsrv1, cd /minos/data/mcout_data/daikon_04 rm -r L010185N_charm/near/cedar_phy_bhcurv/cand_data/ rm -r L010185N_helium/near/cedar_phy_bhcurv/cand_data/ Tue May 6 09:08:12 CDT 2008 Also cleared the cedar WRITE backlog, AFSS/roundup.20080502 -w -r cedar far AFSS/roundup.20080502 -w -r cedar near Updated roundup default to roundup.20080506 == former 20080502 ./roundup -n -s charm -r cedar_phy_bhcurv mcnear OK - stream L010185N_D04_charm.mrnt.cedar_phy_bhcurv OK - 765 Mbytes in 1 runs PEND - have 26/29 subruns for n13037021_*_L010185N_D04_charm.mrnt.cedar_phy_bhcurv.0.root 5 04/30 11:56 0 26 OK - stream L010185N_D04_charm.sntp.cedar_phy_bhcurv OK - 2167 Mbytes in 1 runs PEND - have 26/29 subruns for n13037021_*_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root 5 04/30 11:56 0 26 ./roundup -n -s helium -r cedar_phy_bhcurv mcnear OK - processing 0 files And cleared the CP WRITE files. ./roundup -w -s nd.cedar_phy -r cedar_phy far Now testing rates, using cedar_phy far cand's ./roundup -n -b 5 -s nd.cedar_phy -r cedar_phy far Oops, no bail option ! ./roundup -n -s F00037835 -s nd.cedar_phy -r cedar_phy far AFSS/roundup.new -n -b 10 -s nd.cedar -r cedar_phy far AFSS/roundup.new -b 10 -s nd.cedar -r cedar_phy far WRITE rate 3 Mbytes/second dccp with x509 is failing, cannot test right now. ########## # CONDOR # ########## Boehm jobs are still failing, in spite of good cert : From kproxy /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Joshua A. Boehm/USERID=boehm From VOMRS /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Joshua A. Boehm/UID=boehm Date: Tue, 06 May 2008 15:00:33 -0500 From: Dan Yocum Oops - I cut-n-pasted the wrong DNs to edit. I'll get the USERID ones in a little bit. ============================================================================= 2008 05 05 ########## # CONDOR # ########## VDT access from other accounts fails ! $ voms-proxy-init -noregen -voms fermilab:/fermilab/minos Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses VOMS Server for fermilab not known! MINOS25 > find /minos/scratch/kreymer/VDT -type d -exec ls -ld {} \; | grep -v 'drwxr-xr-x' drwx------ 2 kreymer g020 2048 Jan 10 12:12 /minos/scratch/kreymer/VDT/vdt/extract drwx------ 3 kreymer g020 2048 Oct 17 2007 /minos/scratch/kreymer/VDT/vdt/backup drwx------ 5 kreymer g020 2048 Oct 17 2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt drwx------ 3 kreymer g020 2048 Oct 17 2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/vdt drwx------ 3 kreymer g020 2048 Oct 17 2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/vdt/setup drwx------ 2 kreymer g020 2048 Oct 17 2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/vdt/setup/questions drwx------ 3 kreymer g020 2048 Oct 17 2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/fetch-crl drwx------ 2 kreymer g020 2048 Oct 17 2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/fetch-crl/sbin drwx------ 3 kreymer g020 2048 Oct 17 2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/globus drwx------ 2 kreymer g020 2048 Oct 17 2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/globus/etc for DIR in extract backup ; chmod 755 /minos/scratch/kreymer/VDT/vdt/${DIR} ; done for DIR in \ vdt vdt/setup vdt/setup/questions fetch-crl fetch-crl/sbin globus globus/etc do chmod 755 /minos/scratch/kreymer/VDT/vdt/backup/vdt/${DIR} done chmod 755 /minos/scratch/kreymer/VDT/vdt/backup/vdt Still no luck. Solved by copying the file to ${HOME}/.glite/vomses and specifying userconfig ${HOME}/.glite/vomses per advice from timm. My glidins are working now. Cannot create proxy for loiacono with fermilab/minos, but can for fermilab. Needed to add her to the minos group via VOMRS. Also added , around 13:00 hartnell loiacono mishi pawloski 14:12 - loiacono can create the fermilab/minos proxy Date: Tue, 06 May 2008 15:00:33 -0500 From: Dan Yocum Oops - I cut-n-pasted the wrong DNs to edit. I'll get the USERID ones in a little bit. ########## # CONDOR # ########## timm : You have to add /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E Kreymer/USERID=kreymer to the VO. Any KCA certs as called by GLEXEC have the field spelled out as /USERID and have to be entered into the VO that way. ######## # FARM # ######## SRV1> ln -s /minos/data/minfarm/maint/farmgsum FGS SRV1> ./farmgsum | tee FGS/sum.2008050510 Mon May 5 10:57:31 CDT 2008 grep -A 9999 'WRITING to DCache 1021' cedar_phy_bhcurvmcnear.log \ | grep SRMCP | wc -l 1496 ########### # ROUNDUP # ########### Added STOPPER function, bailing out for any of GDM=/minos/data/minfarm STOPFILES=" ${GDM}/roundup/STOP ${GDM}/roundup/STOP.${REL}${DETPAR} ${GDM}/roundup/STOP.${REL}${DETPAR}${SEL} " Testing this, and express writing : There are 2002 files to write, WRITING to DCache 981 WRITING to DCache 1021 AFSS/roundup.20080502 -n -s F00037835_0007 -r cedar_phy far OK adding F00037835_0007.all.cand.cedar_phy.0.root 1 OK adding F00037835_0007.spill.bcnd.cedar_phy.0.root 1 OK adding F00037835_0007.spill.cand.cedar_phy.0.root 1 AFSS/roundup.20080502 -s F00037835_0007 -r cedar_phy far SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00037835_0007.all.cand.cedar_phy.0.root /pnfs/minos/reco_far/cedar_phy/cand_data/2007-04 RequestFileStatus#-2144959352 failed with error:[ at Mon May 05 11:31:39 CDT 2008 state Failed : user has no permission to write into path /pnfs/fnal.gov/usr/minos/reco_far/cedar_phy/cand_data/2007-04 ] Corrected CCOP logic added print of proxy AFSS/roundup.20080502 -w -S -s F00037835_0007.all -r cedar_phy far Looks OK, dds /pnfs/minos/reco_far/cedar_phy/cand_data/2007-04 -rw-r--r-- 1 rubin numi 131802778 May 5 12:06 F00037835_0007.all.cand.cedar_phy.0.root Write the other 2 AFSS/roundup.20080502 -w -S -s F00037835_0007 -r cedar_phy far Looks OK, let's try another subrun AFSS/roundup.20080502 -s F00037835_0008 -r cedar_phy far SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00037835_0008.all.cand.cedar_phy.0.root /pnfs/minos/reco_far/cedar_phy/cand_data/2007-04 CCOP ls: /minos/data/reco_far/cedar_phy/cand_data/2007-04/F00037835_0008.all.cand.cedar_phy.0.root: No such file or directory Oops, need to stop using CCDEST, as cand's are not copied. Instead, use ls -lL and the local file path. AFSS/roundup.20080502 -s F00037835_0008 -r cedar_phy far PURGE FARM F00037835_0008.all.cand.cedar_phy.0.root cat: /export/stage/minfarm/ROUNDUP/READ/F00037835_0008.all.cand.cedar_phy.0.root: No such file or directory that's ok, no harm no foul, this was due to problems on the first pass. AFSS/roundup.20080502 -n -s F00037835*cand -r cedar_phy far Nope, this did not find any files Try a safer selection AFSS/roundup.20080502 -n -s F00037835 -r cedar_phy far PEND - have 17/24 subruns for F00037835_*.all.sntp.cedar_phy.0.root 3 05/02 11:36 0 17 So this looks like good place to work up for global cand processing. AFSS/roundup.20080502 -s F00037835 -r cedar_phy far If this looks OK, put -s nd.cedar into corral. SRV1> AFSS/roundup.20080502 -n -s nd.cedar -r cedar far ########## # CONDOR # ########## my glideafs tests stopped getting the KCA proxy because I continued running the vanilla glideafs.run, failed to switch to glideafsp.run 256 running glideings this morning, most started up at around 03:06 through 04:00 Last glidein to run was at 09:10. Probably due to reported fngp-osg problems. > fgannounce ######## # FARM # ######## Current cands for reco, should remove and avoid. Will take care of this first thing Monday, taking care not to remove files pending in writes to dcache. This should clear 2 TB of space pretty quickly. MINOS26 > ls -d /minos/data/mcout_data/*/*/*/*/cand_data /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/cand_data /minos/data/mcout_data/daikon_04/L010185N_helium/near/cedar_phy_bhcurv/cand_data /minos/data/mcout_data/daikon_04/L250200N/near/cedar_phy_bhcurv/cand_data MINOS26 > du -sm /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/cand_data 645847 /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/cand_data MINOS26 > du -sm /minos/data/mcout_data/daikon_04/L010185N_helium/near/cedar_phy_bhcurv/cand_data 304494 /minos/data/mcout_data/daikon_04/L010185N_helium/near/cedar_phy_bhcurv/cand_data MINOS26 > du -sm /minos/data/mcout_data/daikon_04/L250200N/near/cedar_phy_bhcurv/cand_data 2208 /minos/data/mcout_data/daikon_04/L250200N/near/cedar_phy_bhcurv/cand_data MINOS26 > du -sm /minos/data/reco_near/cedar/cand_data 601568 /minos/data/reco_near/cedar/cand_data MINOS26 > du -sm /minos/data/reco_far/cedar/cand_data 292934 /minos/data/reco_far/cedar/cand_data -rw-rw-r-- 1 42411 e875 112406206 May 4 03:58 N00014130_0013.cosmic.cand.cedar.0.root -rw-rw-r-- 1 42411 e875 504357789 May 4 03:58 N00014130_0013.spill.cand.cedar.0.root -rw-rw-r-- 1 42411 e875 138087890 May 3 23:36 F00040838_0004.all.cand.cedar.0.root -rw-rw-r-- 1 42411 e875 33487392 May 3 23:36 F00040838_0004.spill.cand.cedar.0.root ============================================================================= 2008 05 02 ######## # FARM # ######## Date: Fri, 02 May 2008 12:14:21 -0500 From: Howard Rubin FYI after a false start (mixed-up directory structure) there is now an rsync backup in place for the following (recursed) directories: /grid/app/minos/scripts /minos/data/minfarm/lists loonexe /minos/data/minfarm/farmtest/lists loonexe The current size on AFS is < 90M Date: Fri, 02 May 2008 12:23:22 -0500 Just for documentation purposes, the script doing the rsync backup is run by cron at 02:00 and 14:00. The (very simple) script is /grid/app/minos/scripts/farm_backup. Right now there is a -v option which I'll remove after I see it run a couple of times via cron. Matt, your script directory is excluded because rsync won't let me copy it. Art, do you know how to get around this? Actually, it's probably because I keep the existing permissions, and it doesn't like me trying to write files to AFS with someone else's ownership. I'll check the man pages for options to override this. ########### # ROUNDUP # ########### roundup.20080502 Reviewing all local files, so we can run on other hosts First manual scan, then iterate for export HOME ROUNTMP=/export/stage/minfarm/ROUNDUP SOCFILE=/export/stage/minfarm/.grid/samdbs_prd PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin SRM_CONF=/export/stage/minfarm/.srmconfig/config.xml mkdir -p /tmp/minfarm/ROUNDUP ROOTREL=`grep " ${REL} " ROOTRELS | cut -f 1 -d ' '` ROOTSYS="/farm/minsoft2/Minossoft/ROOT/rootv5.12.00e" . /home/minfarm/scripts/setup_minossoft_R1_18_4.sh R1.18.4 . /export/osg/grid/setup.sh This should be /usr/local/vdt/setup.sh /export/osg/grid -> /usr/local/vdt ${ROUNTMP}/${CAT}ECRC/${FILE} ${ROUNTMP}/SUPPRESSED DFILES=/tmp/minfarm/ROUNDUP/DFILES.${REL}.${DETPAR}.${$} DFILESL=/tmp/minfarm/ROUNDUP/DFILESL.${REL}.${DETPAR}.${$} SAMDUPS=`${HOME}/scripts/samdup -s "${SEL}" ${INDIR}` SAMSUBS=`${HOME}/scripts/samsub ${INDIR} | grep "${SEL}"` ROUNTMP/DFARM review the usage of this . Why is is here ? It needs purging, 78K files ! SAM ${ECHO} scripts/saddreco SLOG=${HOME}/ROUNTMP/LOG/saddreco/${MCREL}/${REL}/${DET}_${CONF}.log entry PENDLOG=${ROUNTMP}/${CAT}LOG/${REL}${DETPAR}.pend mkdir -p ${ROUNTMP}/${CAT}LOG/${YEMON} mkdir -p ${ROUNTMP}/${CAT}HADDLOG/${YEMON} PIFL=${ROUNTMP}/${REL}${DETPAR}.pid ${ROUNTMP}/${CAT}LOG/${YEMON}/${REL}${DETPAR}.log 2>&1 & ... quoted ${SEL} in samdups call ############ # PREDATOR # ############ Last NearDCS file was N080415_000002.mdcs.root Wed Apr 16 10:13:59 UTC 2008 ######## # FARM # ######## GRRRRRRRRRR roundup seems to have been running twice, what happened to the PID check ? SRV1> pwd /export/stage/minfarm/ROUNDUP/LOG/2008-05 SRV1> less cedar_phy_bhcurvmcnear.log OK - processing /minos/data/minfarm/mcnearcat version 20080501 SELECT files containing charm Thu May 1 17:01:35 CDT 2008 ... OK - processing /minos/data/minfarm/mcnearcat version 20080501 SELECT files containing charm Fri May 2 01:34:31 CDT 2008 OK adding n13037026_0002_L010185N_D04_charm.cand.cedar_phy_bhcurv.0.root 1 -rw-rw-r-- 1 minospro numi 767436554 Apr 30 12:05 /minos/data/minfarm/WRITE/n13037026_0002_L010185N_D04_charm.cand.cedar_phy_bhcurv.0.root ls: n13037025_0027_L010185N_D04_charm.cand.cedar_phy_bhcurv.0.root: No such file or directory ... OK - stream L010185N_D04_charm.cand.cedar_phy_bhcurv OK - 112465 Mbytes in 6 runs /home/minfarm/scripts/roundup: line 665: ((: SSIF = : syntax error: operand expected (error token is " ") WRITING to DCache 981 WRITING to DCache 981 OK - creating /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N_charm/cand_data/700 OK - creating /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N_charm/cand_data/700 srm client error: credential remaining lifetime is less then a minute ... WRITE rate 0 Mbytes/second Fri May 2 08:11:54 CDT 2008 SADD less +F /home/minfarm/ROUNTMP/LOG/saddreco/daikon_04/cedar_phy_bhcurv/near_L010185N_charm.log Fri May 2 08:11:54 CDT 2008 SRV1> voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Masaki Ishitsuka/USERID=mishi/CN=proxy issuer : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Masaki Ishitsuka/USERID=mishi identity : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Masaki Ishitsuka/USERID=mishi type : proxy strength : 512 bits path : /tmp/x509up_u10871 timeleft : 0:00:00 === VO fermilab extension information === VO : fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Masaki Ishitsuka/USERID=mishi issuer : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov attribute : /fermilab/Role=NULL/Capability=NULL timeleft : 0:00:00 SRV1> voms-proxy-destroy ######### # PROXY # ######### Testing alternate proxy locations, and utilities kx509 kxlist -p mv /tmp/x509up_u10871 /tmp/x509upalt export X509_USER_PROXY=/tmp/x509upalt voms-proxy-info timeleft : 167:26:04 ============================================================================= 2008 05 01 ######## # FARM # ######## Summarizing /minos/data/minfarm/*cat 206 9993 nearcat 1199 7764 farcat 9345 2249798 mcnearcat 0 1 mcfarcat 0 1 mcfmockcat 542 4 WRITE 11292 2267561 TOTAL files, GBytes nearcat 111 3171 cosmic.sntp.cedar.0.root 95 7304 spill.sntp.cedar.0.root farcat 108 4295 all.sntp.cedar.0.root 23 568 all.sntp.cedar_phy_bhcurv.0.root 546 905 spill.bmnt.cedar_phy_bhcurv.0.root 134 993 spill.bntp.cedar.0.root 23 102 spill.bntp.cedar_phy_bhcurv.0.root 208 509 spill.mrnt.cedar_phy_bhcurv.0.root 134 690 spill.sntp.cedar.0.root 23 69 spill.sntp.cedar_phy_bhcurv.0.root mcnearcat 2787 1888751 cand.cedar_phy_bhcurv.0.root 280 161232 cand.cedar_phy_bhcurv.1.root 2787 65568 mrnt.cedar_phy_bhcurv.0.root 314 6080 mrnt.cedar_phy_bhcurv.1.root 37 1480 mrnt.cedar_phy_bhcurv.root 2787 210628 sntp.cedar_phy_bhcurv.0.root 314 21024 sntp.cedar_phy_bhcurv.1.root 37 4296 sntp.cedar_phy_bhcurv.root mcfarcat mcfmockcat WRITE ./roundup -n -s ".cand." -r cedar_phy_bhcurv mcnear ... ZAPPING BAD n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047133_0022_L010185N_D04.0 136 2008-03-23 01:23:45 caf1640 OK - processing 3473 files OK - stream L010185N_D04.cand.cedar_phy_bhcurv OK - 960171 Mbytes in 88 runs OK adding n13037233_0001_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1 OK adding n13037233_0004_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1 Updated roundup to check pass number AFSS/roundup.20080501 -n -s n13047133_0022_L010185N_D04 -r cedar_phy_bhcurv mcnear OK - 524 Mbytes in 1 runs BADRUNS n13047133_0020_L010185N_D04.cand.cedar_phy_bhcurv.1.root +BADRUNS+ n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root OK adding n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1 SRV1> ls /minos/data/minfarm/mcnearcat | grep charm | wc -l 2613 SRV1> ls /minos/data/minfarm/mcnearcat | grep helium | wc -l 2757 SRV1> ls /minos/data/minfarm/mcnearcat | grep -v charm | grep -v helium | wc -l 5394 Started manual concatenation of charm, after ugrades to samdup and roundup ./roundup -s charm -r cedar_phy_bhcurv mcnear ########### # ROUNDUP # ########### roundup.20080501 Added pass number to BAD file check, from field 4 or 5 of MC or Raw data Development : ./roundup -n -s ".cand." -r cedar_phy_bhcurv mcnear ... ZAPPING BAD n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root n13047133_0022_L010185N_D04.0 136 2008-03-23 01:23:45 caf1640 OK - processing 3473 files OK - stream L010185N_D04.cand.cedar_phy_bhcurv OK - 960171 Mbytes in 88 runs OK adding n13037233_0001_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1 OK adding n13037233_0004_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1 AFSS/roundup.20080501 -n -s n13047133_0022_L010185N_D04 -r cedar_phy_bhcurv mcnear OK - 524 Mbytes in 1 runs BADRUNS n13047133_0020_L010185N_D04.cand.cedar_phy_bhcurv.1.root +BADRUNS+ n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root OK adding n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1 Updated roundup to check pass number, and to clean up bad_runs select SRV1> cp -a AFSS/roundup.20080501 . SRV1> ln -sf roundup.20080501 roundup # was roundup.20080422 SRV1> AFSS/roundup.20080501 -n -s n13047133_0022_L010185N_D04 -r cedar_phy_bhcurv mcnear OK adding n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1 PEND - have 1/30 subruns for n13047133_*_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root 1 04/29 19:59 0 1 ########## # SAMDUP # ########## samdup.20080501 Added -s "SEL" file selection, to speed up tests MINOS26 > time ./samdup.20080501 /minos/data/minfarm/nearcat real 0m14.230s user 0m1.392s sys 0m0.264s real 0m4.613s real 0m4.568s MINOS26 > time ./samdup.20080501 /minos/data/minfarm/nearcat -s N00012941_0001 real 0m1.235s MINOS26 > time ./samdup.20080501 /minos/data/minfarm/farcat real 1m1.725s real 0m21.079s real 0m21.105s MINOS26 > time ./samdup.20080501 /minos/data/minfarm/farcat -s F00031874_0000 real 0m1.247s roundup run before the fix, AFSS/roundup.20080501 -n -s charm -r cedar_phy_bhcurv mcnear Thu May 1 16:11:48 CDT 2008 Thu May 1 16:41:57 CDT 2008 After the fix, a quick test, AFSS/roundup.20080501 -n -s n13037017 -r cedar_phy_bhcurv mcnear Thu May 1 16:56:01 CDT 2008 Thu May 1 16:56:58 CDT 2008 SRV1> ln -sf roundup.20080501 roundup # was roundup.20080422 ######## # FARM # ######## Date: Wed, 30 Apr 2008 23:47:45 -0500 From: Howard Rubin Can you suggest a backed-up AFS volume that I can rsynch my bookkeeping files to? Right now the entire data area is 74M and the scripts are 9M. NDDIRS=`ls $MINOS_DATA | grep -v '^d..$' | grep -v '^d...$'` for DIR in $NDDIRS ; do printf "${DIR} " ; fs listquota ${MINOS_DATA}/${DIR} | grep -v Volume | tr -s ' ' | cut -f 2 -d ' ' ; done beam_data 50000000 beam_data1 50000000 beam_data2 5000 crl_data 2000000 1% database_dumps 5000000 0% db_cache 2000000 0% farm_logs 50000000 1% farm_mclogs 50000000 4% log_data 8000000 88% logbook 2000000 1% offline_monitor 8000000 5% release_data 8000000 0% validation 8000000 2% MINOS26 > cp -vax log_data/CFL release_data/CFL MINOS26 > diff -r log_data/CFL release_data/CFL MINOS26 > pwd /afs/fnal.gov/files/home/room1/kreymer/minos MINOS26 > rm CFL MINOS26 > ln -s /afs/fnal.gov/files/data/minos/release_data/CFL CFL # CONDOR # Killed running glideins, for a fresh glidein with my proxy condor_rm 107199 condor_rm 107201 condor_submit glide.run condor_q gfactory kreymer ============================================================================= 2008 04 30 ########## # CONDOR # ########## glidemachine.run - runs on fnpc333, trick to get a new glidein, perhaps with glexec this is still pending - ####### # KCA # ####### http://security.fnal.gov/pki/newkcafaq.html feedback to jklemenc, x3311 cc: minos-admin Q What is a KCA server A You use it to get a Grid certificate based on your Kerberos ticket. Q What is not affected A your kerberos ticket ssh or rsh access to accounts dcache access via dcap Q What are possible common uses of KCA certs A Web browser access to DocDB Helpdesk Computing Division personnel information Grid proxies used internally by some Condor systems Q Why is 'been' spelled 'bene' in the answer to FAQ A 1.8 A To confuse the users ? ######## # DATA # ######## jdejong - space for rev field ND cosmic analysis ( 150-230 GB ) in $MINOS_DATA MINOS26 > mkdir /minos/data/analysis/nonap MINOS26 > chmod 775 /minos/data/analysis/nonap MINOS26 > chgrp e875 /minos/data/analysis/nonap ######## # FARM # ######## See notes under 2008/04/29 regarding removal of files. Completed SAM retirement of these files. ########### # ROUNDUP # ########### Investigating issues with cedar_phymcnear.log /home/minfarm/scripts/roundup: line 389: [: missing `]' and HAVE n13037413__L010185N_D04.mrnt.cedar_phy_bhcurv.0.root:30 grep: /minos/data/minfarm/lists/bad_runs_mc.cedar_phy: No such file or directory ============================================================================= 2008 04 29 ######## # FARM # ######## Proceeding with the removal of D04MCNEAR bad field files cd scripts/AFSS/d4clean mkdir /pnfs/minos/BAD/D4CLEAN Need to mangle PNFS path to DATA path /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/700 2 3 4 5 6 7 8 9 10 to /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data/700 7 8 6 5 9 10 MINOS26 > ./filemove L250200N Processing beam L250200N Processed 67 files MINOS26 > ./filemove L010185N Processing beam L010185N PNFS CLEANUP as rubin on fnpcsrv1, move the PNFS files rubin@SRV1> pwd /home/minfarm/scripts/AFSS/d4clean SRV1> time . L250200N.movepnfs real 0m11.595s user 0m0.066s sys 0m0.825s SRV1> date Tue Apr 29 15:36:15 CDT 2008 SRV1> wc -l L010185N.movepnfs 75868 L010185N.movepnfs 08:56 { time . L010185N.movepnfs ; } 2>&1 | tee L1010185N.logpnfs Wed Apr 30 08:58:06 CDT 2008 real 53m16.213s user 0m19.237s sys 4m55.445s DATA CLEANUP as minfarm on fnpcsrv1, movedata . ./L250200N.movedata Oops, kreymer had made some directories, remove them and try again . ./L250200N.movedata 2>&1 | tee L250200.logdata2 date ; { time . L010185N.movedata ; } 2>&1 | tee L1010185N.logdata real 35m44.553s user 0m7.401s sys 1m39.938s SRV1> du -sm /minos/data/BAD/D4CLEAN/mcout_data/daikon_04/* 1364232 /minos/data/BAD/D4CLEAN/mcout_data/daikon_04/L010185N 8729 /minos/data/BAD/D4CLEAN/mcout_data/daikon_04/L250200N SAM disabling : OPW=... setup oracle_client v10_1_0_2_0b sqlplus samdbs/${OPW}@minosprd Commands will look like this UPDATE DATA_FILES SET RETIRED_DATE = SYSDATE WHERE FILE_NAME IN ( 'realfilenames', 'FLINTSTONE,FRED') AND RETIRED_DATE IS NULL ; COMMIT Created retirehead.sql retiretail.sql with the head and tail of these, just needing the quoted file lists. Created filesql script to make ${BEAM}.sqlf file lists ... interruption for full disk, described below 2008 04 30 time ./filesql L010185N Processing beam L010185N Processed 18911 files real 7m36.646s user 0m29.815s sys 3m12.462s MIN > wc -l L010185N.sqlf 18911 L010185N.sqlf cp L250200N.sqlf L250200N.sql nedit L250200N.sql & Testing with one file cp L250200N.sql onefile.sql MINOS26 > sam locate n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root ['/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L250200N/cand_data/700,193@voa102'] MINOS26 > sam list files --dim='FILE_NAME n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root' Files: n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root File Count: 1 Average File Size: 1.08GB Total File Size: 1.08GB Total Event Count: 800 sqlplus samdbs/${OPW}@minosprd SQL> @onefile.sql 1 row updated. @onefile.sql SQL> @onefile.sql 0 rows updated. MINOS26 > sam locate n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root Datafile with name 'n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root' not found. Try file id: 2138767 MINOS26 > sam list files --dim='FILE_NAME n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root' No files match the given constraints. MINOS26 > sam get metadata --fileId=2138767 ImportedSimulatedFile({ 'fileName' : 'n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root', ... MINOS26 > sqlplus samdbs/${OPW}@minosprd SQL> @L250200N.sql 66 rows updated. For L010185N, 18911 files, need 23 files, N=0 while [ ${N} -lt 24 ] ; do (( K = ( N * 800 ) + 1 )) (( L = K + 799 )) printf "${N} ${K} ${L}\n" cat retirehead.sql > L010185N${N}.sql tail +${K} L010185N.sqlf | head -800 >> L010185N${N}.sql cat retiretail.sql >> L010185N${N}.sql (( N++ )) done These files look reasonable. MINOS26 > date Wed Apr 30 11:02:05 CDT 2008 MINOS26 > ${HOME}/minos/bin/rlwrap sqlplus samdbs/${OPW}@minosprd SQL> @L010185N0.sql 800 rows updated. ... SQL> @L010185N23.sql 511 rows updated. MINOS26 > date Wed Apr 30 11:04:09 CDT 2008 ( 2008 04 29 ) GRRRRRRRRRRRRRRRRRRRRRRRRRRRR Once again ran out of AFS quota du -sm * | sort -n ... 16 OSF1 21 msql 36 IRIX 36 msqldata 57 isajet 134 minos MIN > ls -alF IRIX/webmaker/ total 22921 drwxr-xr-x 8 kreymer kreymer 10240 Jun 3 1996 ./ drwxr-xr-x 8 kreymer kreymer 2048 May 9 1997 ../ lrwxr-xr-x 1 kreymer kreymer 48 May 16 1996 FrameMaker -> /afs/fnal.gov/products/UNIX/frame/v5_1/bin/maker* drwxr-xr-x 3 kreymer kreymer 2048 May 15 1996 doc/ drwxr-xr-x 3 kreymer kreymer 2048 May 15 1996 examples/ drwxr-xr-x 2 kreymer kreymer 2048 May 15 1996 lib/ drwxr-xr-x 5 kreymer kreymer 2048 Jun 4 1996 misc/ drwxr-xr-x 2 kreymer kreymer 2048 May 15 1996 patches/ -rw-r--r-- 1 kreymer kreymer 14242 May 10 1996 readme.txt -rw-r--r-- 1 kreymer kreymer 958 May 10 1996 support.txt drwxr-xr-x 2 kreymer kreymer 2048 Jun 5 1996 ups/ -rwxr-xr-x 1 kreymer kreymer 23429120 Apr 26 1996 webmaker* -rw-r--r-- 1 kreymer kreymer 17 May 16 1996 webmaker-2-unix.reg MINOS26 > mkdir -p /minos/data/users/kreymer/AFS/IRIX MINOS26 > cp -ax IRIX/webmaker /minos/data/users/kreymer/AFS/IRIX/webmaker gzipped a few logs minos/log/top_20050714.log minos/log/saddmc_old/* Started all over generating the L010185N file moves 16:52 MINOS26 > time ./filemove L010185N Processing beam L010185N Processed 18911 files real 12m53.838s user 0m27.966s sys 4m27.664s Putting further notes in-line above, for legibility. ============================================================================= 2008 04 28 Condor - glideafs.run 14:01 - added PROXY statement /local/scratch25/kreymer/grid/kreymer.proxy - links to .2008042815, nonesuch 15:24 - corrected link back to 2008042813 ln -sf kreymer.proxy.2008042813 /local/scratch25/kreymer/grid/kreymer.proxy Scanning logs under -rw-r--r-- 1 kreymer g020 5012 Apr 28 13:30 logs/glideafs/probe.106153.0.out -rw-r--r-- 1 kreymer g020 4487 Apr 28 13:47 logs/glideafs/probe.106154.0.out -rw-r--r-- 1 kreymer g020 4487 Apr 28 13:50 logs/glideafs/probe.106156.0.out -rw-r--r-- 1 kreymer g020 4487 Apr 28 14:00 logs/glideafs/probe.106157.0.out -rw-r--r-- 1 kreymer g020 4621 Apr 28 14:04 logs/glideafs/probe.106158.0.out -rw-r--r-- 1 kreymer g020 4621 Apr 28 14:10 logs/glideafs/probe.106159.0.out -rw-r--r-- 1 kreymer g020 4621 Apr 28 14:20 logs/glideafs/probe.106160.0.out -rw-r--r-- 1 kreymer g020 4621 Apr 28 14:31 logs/glideafs/probe.106161.0.out -rw-r--r-- 1 kreymer g020 4621 Apr 28 14:40 logs/glideafs/probe.106162.0.out -rw-r--r-- 1 kreymer g020 0 Apr 28 14:50 logs/glideafs/probe.106163.0.out -rw-r--r-- 1 kreymer g020 0 Apr 28 15:00 logs/glideafs/probe.106165.0.out -rw-r--r-- 1 kreymer g020 0 Apr 28 15:10 logs/glideafs/probe.106167.0.out -rw-r--r-- 1 kreymer g020 0 Apr 28 15:20 logs/glideafs/probe.106169.0.out -rw-r--r-- 1 kreymer g020 6019 Apr 28 15:58 logs/glideafs/probe.106171.0.out -rw-r--r-- 1 kreymer g020 6008 Apr 28 15:58 logs/glideafs/probe.106176.0.out -rw-r--r-- 1 kreymer g020 6008 Apr 28 15:58 logs/glideafs/probe.106180.0.out -rw-r--r-- 1 kreymer g020 4626 Apr 28 16:00 logs/glideafs/probe.106185.0.out -rw-r--r-- 1 kreymer g020 10595 Apr 25 15:20 logs/glideafs/probe.78296.0.out -rw-r--r-- 1 kreymer g020 9721 Apr 25 15:30 logs/glideafs/probe.78297.0.out -rw-r--r-- 1 kreymer g020 9721 Apr 25 15:40 logs/glideafs/probe.78302.0.out -rw-r--r-- 1 kreymer g020 10863 Apr 25 15:50 logs/glideafs/probe.84485.0.out -rw-r--r-- 1 kreymer g020 10863 Apr 25 16:00 logs/glideafs/probe.91503.0.out -rw-r--r-- 1 kreymer g020 10586 Apr 25 16:10 logs/glideafs/probe.97891.0.out grep identity logs/glideafs/probe.106157.0.out # 13:30 default proxy identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy grep identity logs/glideafs/probe.106158.0.out # 14:31 with KCA proxy ? identity : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy grep identity logs/glideafs/probe.106171.0.out # 15:58 fixed KCA proxy ? identity : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy MINOS25 > grep identity logs/glideafs/probe.*.0.out logs/glideafs/probe.102398.0.out:identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy ... logs/glideafs/probe.106157.0.out:identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy logs/glideafs/probe.106158.0.out:identity : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy ... It is clear that my proxy started getting passed as soon as it was specified. I'm not sure why I got confused earlier yesterday. ####### # SIM # ####### 14:10 Sent email to deb4 asking for confirmation of the file list for D04 MD reprocessing. ########## # DC2NFS # ########## First check state of sntp's in DCache for DIR in `ls /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data` ; do echo $DIR ; ./stage -d -p 0 reco_near/cedar_phy_bhcurv/sntp_data/${DIR} ; done \ | grep '^200\|Needed' for DIR in `ls /pnfs/minos/reco_far/cedar_phy_bhcurv/sntp_data` ; do echo $DIR ; ./stage -d -p 0 reco_far/cedar_phy_bhcurv/sntp_data/${DIR} ; done \ | grep '^200\|Needed' None of these needed staging. 2008 04 29 plan of action : minfarm@fnpcsrv1 DATA=reco_near/cedar_phy_bhcurv/sntp_data NDIRS=`ls /pnfs/minos/${DATA}` AFSS/dc2nfs.20080118 -n -d ${DATA} AFSS/dc2nfs -d reco_near/cedar_phy_bhcurv/sntp_data 2>&1 | \ tee /minos/scratch/log/dc2nfs/cpbnear.log ########## # AUTOFS # ########## To check mounts, cat /etc/auto.master ypcat auto.master ####### # LSF # ####### ----------------------------------------------------------- Date: Mon, 28 Apr 2008 08:42:41 +0100 From: David John Auty To: minos_software_discussion@fnal.gov Subject: lsf I can't submit to the lsf queue at the moment any idea's? ----------------------------------------------------------- MINOS26 > bjobs batch system daemon not responding ... still trying MINOS26 > date Mon Apr 28 08:41:07 CDT 2008 MINOS26 > bjobs batch system daemon not responding ... still trying No tickets, submitting helpdesk ticket for NODE in flxi02 flxi03 flxi04 flxi05 fsui03 ; do printf "${NODE} "; ssh -ax ${NODE} ". /usr/local/etc/setups.sh ; setup lsf" ; done flxi02 /tmp/filefh7jcc: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh: No such file or directory flxi03 /tmp/filejv2bhJ: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh: No such file or directory flxi04 bash: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh: No such file or directory flxi05 /tmp/fileayQG7D: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh: No such file or directory fsui03 bash: /home/room1/lsf/v6_1/conf/profile.lsf.sh: Permission denied Date: Mon, 28 Apr 2008 09:03:26 -0500 (CDT) Subject: HelpDesk ticket 114816 ___________________________________________ Short Description: FNALU Batch system is not responding Problem Description: fnalu-admin : Since at least about 02:40 this morning, it seems that the FNALU Batch system has been unavailable. Here is a recent test : MINOS26 > date Mon Apr 28 08:41:07 CDT 2008 MINOS26 > bjobs batch system daemon not responding ... still trying batch system daemon not responding ... still trying It seems that the /home/room1/lsf configuration files are missing : for NODE in flxi02 flxi03 flxi04 flxi05 fsui03 ; do printf "${NODE} "; ssh -ax ${NODE} ". /usr/local/etc/setups.sh ; setup lsf" done flxi02 /tmp/filefh7jcc: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh:No such file or directory flxi03 /tmp/filejv2bhJ: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh:No such file or directory flxi04 bash: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh: No suchfile or directory flxi05 /tmp/fileayQG7D: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh:No such file or directory fsui03 bash: /home/room1/lsf/v6_1/conf/profile.lsf.sh: Permission denied ___________________________________________ Noticed that /home is automounted from fsun02 fsui03 > ypcat auto.home -rw,hard,intr fsun02:/export/home Updated MINOS status at https://computing.fnal.gov/cdsystemstatus/customersupport/inpform.htm ___________________________________________ Date: Mon, 28 Apr 2008 11:38:00 -0500 (CDT) This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group. ___________________________________________ Wayne is on furlough this week. Please reassign this to somebody else. Thanks ! ___________________________________________ Date: Mon, 28 Apr 2008 12:51:29 -0500 (CDT) From: Margaret_Greaney Art, I've been in FCC1 all morning working on fsun02. I will take a look at this helpdesk ticket now. ___________________________________________ Date: Mon, 28 Apr 2008 13:05:49 -0500 (CDT) fsun02 serves the lsf home area and it was up then down twice this morning and now it is down again. It has a bad cpu board. The vendor is ordering a replacement part and we hope to swap it out tomorrow. Also, our console server to that node was down this moring, and we needed to get that up before we could access fsun02. ___________________________________________ Date: Mon, 28 Apr 2008 16:18:31 -0500 (CDT) This ticket has been reassigned to GREANEY, MARGARET of the CD-LSCS/CSI/DSS/EST Group. ___________________________________________ Date: Tue, 29 Apr 2008 11:37:14 -0500 (CDT) D1 replaced the cpu board on fsun02 and now lsf home area is again being served. Batch nodes were checked and minos users notified. ___________________________________________ ########## # CONDOR # ########## The bspeak job backlog cleared. ============================================================================= 2008 04 25 ########## # CONDOR # ########## bspeak submitted over 20K jobs at about 4/25 15:44 ########## # CONDOR # ########## ANODES='339 340 341 342 343 344 345 346' for NODE in ${ANODES} ; do printf "fnpc${NODE} " ; ssh -ax fnpc${NODE} ls -ld /afs ; done fnpc339 drwxrwxrwx 2 stan oss 4096 Jul 12 2007 /afs fnpc340 drwxrwxrwx 2 stan oss 4096 Jul 12 2007 /afs fnpc341 drwxrwxrwx 2 stan oss 4096 Jul 12 2007 /afs fnpc342 drwxrwxrwx 2 stan oss 4096 Jul 12 2007 /afs fnpc343 drwxrwxrwx 2 stan oss 4096 Jul 12 2007 /afs fnpc344 drwxrwxrwx 2 root root 4096 Jan 31 10:57 /afs fnpc345 drwxrwxrwx 2 stan oss 4096 Jul 12 2007 /afs fnpc346 drwxrwxrwx 2 stan oss 4096 Jul 12 2007 /afs for NODE in ${ANODES} ; do printf "fnpc${NODE} " ; ssh -ax fnpc${NODE} ls -ld /afs/fnal.gov ; done fnpc339 drwxr-xr-x 3 root root 4096 Dec 10 2004 /afs/fnal.gov fnpc340 drwxr-xr-x 3 root root 4096 Dec 10 2004 /afs/fnal.gov fnpc341 drwxr-xr-x 3 root root 4096 Dec 10 2004 /afs/fnal.gov fnpc342 drwxr-xr-x 3 root root 4096 Dec 10 2004 /afs/fnal.gov fnpc343 drwxr-xr-x 3 root root 4096 Dec 10 2004 /afs/fnal.gov fnpc344 ls: /afs/fnal.gov: No such file or directory fnpc345 drwxr-xr-x 3 root root 4096 Dec 10 2004 /afs/fnal.gov fnpc346 drwxr-xr-x 3 root root 4096 Dec 10 2004 /afs/fnal.gov for NODE in ${ANODES} ; do printf "fnpc${NODE} " ; ssh -ax fnpc${NODE} rpm -q openafs ; done fnpc339 openafs-1.4.6-58.SL4.x86_64 fnpc340 openafs-1.4.4-46.SL4.x86_64 fnpc341 openafs-1.4.6-58.SL4.x86_64 fnpc342 openafs-1.4.6-58.SL4.x86_64 fnpc343 openafs-1.4.6-58.SL4.x86_64 fnpc344 openafs-1.4.6-58.SL4.x86_64 fnpc345 openafs-1.4.6-58.SL4.x86_64 fnpc346 openafs-1.4.6-58.SL4.x86_64 Fri Apr 25 19:08:55 UTC 2008 Date: Fri, 25 Apr 2008 14:24:09 -0500 (CDT) Subject: HelpDesk ticket 114790 Short Description: fnpc344 AFS mount needed Problem Description: fnpc344 seems to have rebooted around 24 May 16:30 CDT . The AFS file system is not mounted, and is needed by Minos jobs. Please remount AFS. Thanks ! ___________________________________________ This ticket is assigned to TIMM, STEVE of the CD-SF/GF/FGS. ___________________________________________ The reboot was on 24 April, not 24 May. Sorry for the typo. Also, note that Steve Timm is on furlough, so this ticket needs to be assigned to someone else. ___________________________________________ Date: Fri, 25 Apr 2008 14:45:39 -0500 (CDT) This ticket has been reassigned to YOCUM, DAN of the CD-SF/GF/FGS Group. ___________________________________________ Date: Mon, 28 Apr 2008 10:33:38 -0500 Resolved Installed correct openafs kernel module and restarted afs service. ___________________________________________ N.B. as a workaround, loiacono and pawloski are selecting && ( Machine != "fnpc344.fnal.gov" ) ============================================================================= 2008 04 24 ####### # SAM # ####### Date: Thu, 24 Apr 2008 09:17:36 -0500 From: Nelly Stanfield To: minosdb-support@fnal.gov Minosdev db is available.?? Oracle's April Quarterly patch completed. ######## # FARM # ######## Making file lists, with associated scripts, in scripts/d4clean MINOS26 > ./filelist L250200N Processing beam L250200N n13037004 n13037004 n13037014 n13037014 Config n1303 Run range 7004 7004 Config n1303 Run range 7014 7014 MINOS26 > wc -l L250200N.locations 67 L250200N.locations MINOS26 > for STR in sntp mrnt cand ; do printf "${STR} " ; grep $STR L250200N.locations | wc -l ; done sntp 6 mrnt 0 cand 61 MINOS26 > ./filelist L010185N Processing beam L010185N n13037140 n13037140 n13037233 n13037233 n13037244 n13037245 n13037250 n13037260 n13037263 n13037470 n13037553 n13037735 n13047013 n13047014 n13047041 n13047100 n13047103 n13047103 n13047106 n13047191 Config n1303 Run range 7140 7140 Config n1303 Run range 7233 7233 Config n1303 Run range 7244 7245 Config n1303 Run range 7250 7260 Config n1303 Run range 7263 7470 Config n1303 Run range 7553 7735 Config n1304 Run range 7013 7014 Config n1304 Run range 7041 7100 Config n1304 Run range 7103 7103 Config n1304 Run range 7106 7191 MINOS26 > wc -l L010185N.locations 18911 L010185N.locations MINOS26 > for STR in sntp mrnt cand ; do printf "${STR} " ; grep $STR L010185N.locations | wc -l ; done sntp 944 mrnt 944 cand 17023 ######## # FARM # ######## SAMDIM=" DATA_TIER sntp-near and MC.RELEASE daikon_04 and MC.BEAM L010185N and VERSION cedar.phy.bhcurv and FILE_NAME n1303% and RUN_NUMBER >= 7250 and RUN_NUMBER <= 7260 " ./samlocate "${SAMDIM}" Specialize to no-pass and pass 0 PASS='' SAMDIM=" DATA_TIER sntp-near and MC.RELEASE daikon_04 and MC.BEAM L010185N and VERSION cedar.phy.bhcurv and FILE_NAME n1303%cedar.phy.bhcurv${PASS}.root and RUN_NUMBER >= 7250 and RUN_NUMBER <= 7260 " SAMDIM=" DATA_TIER sntp-near and MC.RELEASE daikon_04 and MC.BEAM L010185N and VERSION cedar.phy.bhcurv and FILE_NAME n%cedar_phy_bhcurv${PASS}.root and RUN_NUMBER >= 7250 and RUN_NUMBER <= 7260 " Counting pass 1 files already present: SAMDIM=" DATA_TIER cand-near and MC.RELEASE daikon_04 and VERSION cedar.phy.bhcurv and FILE_NAME n%_D04.cand.cedar_phy_bhcurv.1.root " ./samlocate "${SAMDIM}" | sort | wc -l 65 Howie's test reprocessing run produced 66, but one of these was produced with pass 0, n13037251_0028_L010185N_D04.cand.cedar_phy_bhcurv.0.root in /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/725 He will put a copy back into mcnearcat, renamed as .1. , so that I can purge this along with the actual bad data files. ########## # CONDOR # ########## Increased Greg's priority, to allow Laura to glide in ( fair share seems not so fair at present ) condor_userprio -setfactor pawloski@fnal.gov 10000. condor_userprio -all Thu Apr 24 11:00:30 CDT 2008 ########## # CONDOR # ########## Date: Wed, 23 Apr 2008 16:42:01 -0500 From: Sfiligoi Igor It seems the changes below were not done on minos25: condor_config_val GLEXEC_STARTER Not defined: GLEXEC_STARTER Without it, we cannot use glexec. ------------------------------------- Date: Thu, 24 Apr 2008 05:20:50 +0000 (UTC) From: Arthur Kreymer To: Sfiligoi Igor Strange, we have the correct GLEXEC content for condor_config.local.master but not condor_config.local These should have been identical ! I am pretty sure that I checked the files at the time of the upgrade. File time stamps are odd, with the incorrect file being the most recently updated . It is as if we had the correct files at 13:40, then something or someone put the old file back 13 minutes later. MINOS25 > ls -l /opt/condor-7.0.1/local total 16 -rw-r--r-- 1 root root 6371 Apr 21 13:53 condor_config.local -rw-r--r-- 1 root root 6590 Apr 21 13:40 condor_config.local.master MINOS25 > ls -l /opt/condor-7.0.1/local.minos25 total 28 -rw-r--r-- 1 root root 6590 Apr 21 13:18 condor_config.local -rw-r--r-- 1 root root 6590 Apr 21 13:19 condor_config.local.master drwxrwxrwt 2 daemon root 4096 Mar 17 16:36 execute drwxr-xr-x 2 daemon root 4096 Mar 17 16:36 log drwxr-xr-x 2 daemon root 4096 Mar 17 16:36 spool Unfortunately the Condor pool is pretty busy at the moment. Is it OK to update local/condor_config.local while the master runs ? If this does no harm, would a restart still be necessary in order to benefit from the changed configuration ? Or do we need to shut down, modify the file, and restart ? ------------------------------------- Date: Thu, 24 Apr 2008 07:11:15 -0500 From: Igor Sfiligoi Yes, all you need is change the files and issue a local condor_reconfig. ------------------------------------- Date: Thu, 24 Apr 2008 09:44:10 -0500 (CDT) Subject: HelpDesk ticket 114713 ___________________________________________ Short Description: Update /opt/condor-7.0.1/local/condor_config.local Problem Description: run2-sys : The content of opt/condor-7.0.1/local/condor_config.local seems to have changed from what was set on Monday at 13:40, going back to a copy of the old file. MINOS25 > ls -l /opt/condor-7.0.1/local total 16 -rw-r--r-- 1 root root 6371 Apr 21 13:53 condor_config.local -rw-r--r-- 1 root root 6590 Apr 21 13:40 condor_config.local.master Please copy the correct content from condor_config.local.master on minos25: cd opt/condor-7.0.1/local cp condor_config.local.master condor_config.local We do not need to pause or restart Condor for this change, so please do this at a time of your convenience. Thanks ! ___________________________________________ Date: Thu, 24 Apr 2008 09:51:50 -0500 (CDT) This ticket has been reassigned to HO, LING of the CD-SF/FEF Group. ________________________________________________________________ ________________________________________________________________ MINOS25 > condor_reconfig minos25 Sent "Reconfig" command to master minos25.fnal.gov MINOS25 > date Thu Apr 24 10:32:25 CDT 2008 ________________________________________________________________ ________________________________________________________________ Date: Thu, 24 Apr 2008 10:46:38 -0500 My test jobs are running and changing uid to uscms466 as expected. ============================================================================= 2008 04 23 ########### # BLUEARC # ########### Attended 10:30 BlueArc Users' Meeting Discussed purchase and upgrade plans for BlueArc servers ######## # FARM # ######## Per email from rubin, with followups, run ranges for reprocessing due to incorrect Bfield seen by grid nodes L250200N n13037004 n13037004 n13037014 n13037014 L010185N n13037140 n13037140 n13037233 n13037233 n13037244 n13037245 n13037250 n13037260 n13037263 n13037470 n13037553 n13037735 n13047013 n13047014 n13047041 n13047100 n13047103 n13047103 n13047106 n13047191 Sample queries use SAMDIM=" DATA_TIER sntp-near and MC.RELEASE daikon_04 and MC.BEAM L250200N and VERSION cedar.phy.bhcurv and FILE_NAME n1303% and RUN_NUMBER 7004 " SAMDIM=" DATA_TIER sntp-near and MC.RELEASE daikon_04 and MC.BEAM L010185N and VERSION cedar.phy.bhcurv and FILE_NAME n1303% and RUN_NUMBER >= 7250 and RUN_NUMBER <= 7260 " ./samlocate "${SAMDIM}" ########## # CONDOR # ########## Modified glide.run to set X509USERPROXY = /local/scratch25/kreymer/grid/kreymer.proxy This works ! ########## # CONDOR # ########## Draft version of kproxy, creating user proxy in /local/scratch25/${USERNAME}/grid/${USERNAME}.proxy crontab.minos25 runs this, 07 1-23/2 * * * /usr/krb5/bin/kcron /local/scratch25/grid/kproxy ######### # DOCDB # ######### Registered Phil Adamson for numirw and beamrw groups, pre email request https://minos-docdb.fnal.gov:440/cgi-bin/EmailAdminister Username minos-adm Password ***** ============================================================================= 2008 04 22 ######### # ADMIN # ######### minos04 does not allow ssh logins MIN > ssh -v minos04 OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug1: Connecting to minos04 [131.225.193.4] port 22. debug1: Connection established. debug1: identity file /home/kreymer/.ssh/identity type -1 debug1: identity file /home/kreymer/.ssh/id_rsa type 1 debug1: identity file /home/kreymer/.ssh/id_dsa type -1 ssh_exchange_identification: Connection closed by remote host MIN > date Tue Apr 22 21:30:30 UTC 2008 MIN > rsh minos04 ... MINOS04 > The last sshd messages in /var/log/messages are Apr 22 02:19:19 minos04 sshd(pam_unix)[9569]: session opened for user djauty by djauty(uid=0) Apr 22 02:41:21 minos04 sshd: pam_krb5[10190]: authentication fails for 'djauty' (djauty@FNAL.GOV): Authentication service cannot retrieve authentication info. (Cannot contact any KDC for requested realm) Condor jobs are running, apparently OK. Date: Tue, 22 Apr 2008 16:47:40 -0500 (CDT) Subject: HelpDesk ticket 114615 ___________________________________________ Short Description: ssh logins to minos04 fail Problem Description: run2-sys I cannot ssh to mino04. MIN > ssh -v minos04 OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug1: Connecting to minos04 [131.225.193.4] port 22. debug1: Connection established. debug1: identity file /home/kreymer/.ssh/identity type -1 debug1: identity file /home/kreymer/.ssh/id_rsa type 1 debug1: identity file /home/kreymer/.ssh/id_dsa type -1 ssh_exchange_identification: Connection closed by remote host MIN > date Tue Apr 22 21:30:30 UTC 2008 But kerberized rsh works. The last sshd messages in /var/log/messages are : Apr 22 02:19:19 minos04 sshd(pam_unix)[9569]: session opened for user djauty by djauty(uid=0) Apr 22 02:41:21 minos04 sshd: pam_krb5[10190]: authentication fails for 'djauty' (djauty@FNAL.GOV): Authentication service cannot retrieve authentication info. (Cannot contact any KDC for requested realm) ___________________________________________ Date: Wed, 23 Apr 2008 08:22:46 -0500 (CDT) This ticket has been reassigned to HO, LING of the CD-SF/FEF Group. ___________________________________________ Apr 23 11:34:58 minos04 sshd(pam_unix)[24095]: session opened for user hartnell by hartnell(uid=0) ___________________________________________ ___________________________________________ ############ # SADDRECO # ############ saddreco.new Adding support for pass numbers in MC files, like n13037251_0028_L010185N_D04.cand.cedar_phy_bhcurv.0.root or n13037251_0028_L010185N_D04.cand.cedar_phy_bhcurv.root as opposed to data files like N00013434_0000.spill.sntp.cedar.0.root ####### # SAM # ####### Closed out IT 3538 re station upgrade versus running projects Why was this ticket assigned to user mundim ? Oh well. ######## # FARM # ######## Cleaned up dangling WRITE file pointing to now-deleted /minos/data CC area rm /export/stage/minfarm/ROUNDUP/WRITE/n13037251_0028_L010185N_D04.cand.cedar_phy_bhcurv.0.root ########### # ROUNDUP # ########### roundup.20080422 Uses new CCOP variable to move only non-cand/bcnd files to CC area SRV1> cp AFSS/roundup.20080422 . SRV1> ln -sf roundup.20080422 roundup # was roundup.20080412 SRV1> ${HOME}/scripts/roundup -r cedar_phy_bhcurv mcnear ############ # MCIMPORT # ############ ######## # FARM # ######## checking the disabling of previous passes MINOS26 > SAMDIM='FILE_NAME n13037252_0003_L010185N_D04.cand.cedar_phy_bhcurv%' MINOS26 > sam list files --dim="${SAMDIM}" Files: n13037252_0003_L010185N_D04.cand.cedar_phy_bhcurv.0.root n13037252_0003_L010185N_D04.cand.cedar_phy_bhcurv.1.root File Count: 2 Average File Size: 545.75MB Total File Size: 1.07GB Total Event Count: 1600 This is no good ! ######## # DATA # ######## doing manual full inventory ( should do this regularly ? ) The big problem is nearly 3 TBytes of /minos/data/minfarm/farmtest/mcnearcat MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/L010185N 2027116 /minos/data/mcimport/STAGE/daikon_04/L010185N MINOS26 > du -sm /minos/data/users/* 2 /minos/data/users/bckhouse 17186 /minos/data/users/boehm 1 /minos/data/users/kreymer 10195 /minos/data/users/loiacono 1 /minos/data/users/minsoft 99003 /minos/data/users/pawloski 102630 /minos/data/users/rmehdi 1 /minos/data/users/rustem 12335360 /minos/data/mcout_data 4594802 /minos/data/minfarm 3730545 /minos/data/mcimport 3437784 /minos/data/reco_near 2297204 /minos/data/reco_far 1835532 /minos/data/analysis 268493 /minos/data/mysql 259008 /minos/data/beam_data 229014 /minos/data/users 99901 /minos/data/flux 3437 /minos/data/log_data 1 /minos/data/mindata 1 /minos/data/release_data MINOS26 > du -sm /minos/data/minfarm/* | sort -n ... 10625 /minos/data/minfarm/nearcat 11919 /minos/data/minfarm/mcfar 23940 /minos/data/minfarm/DUP 163716 /minos/data/minfarm/mcnear 4358441 /minos/data/minfarm/farmtest MINOS26 > du -sm /minos/data/minfarm/farmtest/* | sort -n 45 /minos/data/minfarm/farmtest/logs 72 /minos/data/minfarm/farmtest/mclogs 1865 /minos/data/minfarm/farmtest/neardet 57986 /minos/data/minfarm/farmtest/nearcat_old 1362419 /minos/data/minfarm/farmtest/nearcat 2936050 /minos/data/minfarm/farmtest/mcnearcat ######## # DATA # ######## As minfarm on fnpcsrv1, SRV1> du -sm /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data 8431447 /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data SRV1> ls /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data 701 705 707 709 711 713 715 717 719 724 726 728 730 732 734 736 738 740 742 744 746 755 757 759 761 763 765 767 769 771 773 704 706 708 710 712 714 716 718 723 725 727 729 731 733 735 737 739 741 743 745 747 756 758 760 762 764 766 768 770 772 SRV1> time rm -r /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data real 9m2.200s user 0m0.112s sys 0m3.687s This was done a couple of minutes before Tue Apr 22 11:33:00 CDT 2008 SRV1> du -sm /minos/data/mcout_data 3909216 /minos/data/mcout_data SRV1> df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 21T 20T 792G 97% /minos/data SRV1> date ; df -h /minos/data Tue Apr 22 13:38:18 CDT 2008 Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 21T 20T 1.2T 95% /minos/data Tue Apr 22 14:56:57 CDT 2008 Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 22T 20T 1.5T 94% /minos/data ============================================================================= 2008 04 21 ######## # GRID # ######## fnpc342 - requested AFS restoration at 16:38 Date: Mon, 21 Apr 2008 16:40:03 -0500 (CDT) Subject: HelpDesk ticket 114555 ___________________________________________ Short Description: fnpc342 AFS mount is missing Problem Description: fnpc342 is one of the system which has AFS available, selected by HAVEMINOSAFS as I recall. The AFS mount seems to be missing, causing user jobs to fail. Please remount AFS, and inform minos-admin. Thanks ! _________________________________________________________________ Note To Requester: kreymer@fnal.gov sent this Notes To Requester: Thanks for looking at this ! If AFS cannot be restored quickly to fnpc342, please remove fnpc342 from ISMINOSAFS list This reduces our Grid AFS capacity by only 1/8, and avoid immediate user job failures. $ condor_status slot1@fnpc342 -l | grep ISMINOSAFS ISMINOSAFS = stringlistimember(My.Machine, "fnpc339.fnal.gov, fnpc340.fnal.gov, fnpc341.fnal.gov, fnpc342.fnal.gov, fnpc343.fnal.gov, fnpc344.fnal.gov, fnpc345.fnal.gov, fnpc346.fnal.gov") ___________________________________________________________________ Date: Wed, 23 Apr 2008 15:10:30 -0500 (CDT) Solution: yocum@fnal.gov sent this solution: After some digging around I found the openafs module rpm on the scientificlinux ftp server and installed it. I've informed Troy Dawson that the updated rpm package isn't where it should be in the Fermi Linux directories and he's taking steps to fix this omission. ___________________________________________ ########## # CONDOR # ########## 12:51 MINOS25 > condor_off -fast minos25 Sent "Kill-All-Daemons-Fast" command to master minos25.fnal.gov MINOS25 > sudo /etc/init.d/condor stop Shutting down Condor (fast-shutdown mode) Date: Mon, 21 Apr 2008 12:53:48 -0500 (CDT) Subject: HelpDesk ticket 114534 Short Description: minos25 condor upgrade request Problem Description: run2-sys : We are ready to proceed with the Condor 7.0.1 upgrade on minos25. Unlike the Condor upgrades last week, we need a new local configuration file differing from the condor-6.8.6/local/condor_config.local in the addition of GLEXEC_STARTER = True GLEXEC = /bin/false And there appears to be a second copy of local/condor_config.local, named condor_config.local.master . So please do the following ( or equivalent ) on minos25 , then inform minos-admin : cd /opt NEWCONF=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/cond or_config.local.minos25 cp -r ${NEWCONF} condor-7.0.1/local/condor_config.local cp -r ${NEWCONF} condor-7.0.1/local/condor_config.local.master ln -sf condor-7.0.1 condor ___________________________________________ Date: Mon, 21 Apr 2008 13:05:13 -0500 (CDT) This ticket has been reassigned to HO, LING of the CD-SF/FEF Group. ___________________________________________ Date: Mon, 21 Apr 2008 13:24:18 -0500 (CDT) Note To Requester: ling@fnal.gov sent this Notes To Requester: Done. [root@minos25 ~]# cd /opt [root@minos25 opt]# NEWCONF=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local.minos25 [root@minos25 opt]# cp -r ${NEWCONF} condor-7.0.1/local.minos25/condor_config.local cp: overwrite `condor-7.0.1/local.minos25/condor_config.local'? y [root@minos25 opt]# cp -r ${NEWCONF} condor-7.0.1/local.minos25/condor_config.local.master [root@minos25 opt]# ln -s condor-7.0.1/ condor [root@minos25 opt]# ___________________________________________ Sorry, I should have triple proofread my request. And the ln -s seems not to taken effect, as /opt/condor seems to still point to condor-6.8.6 It should have been : cd /opt NEWCONF=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local.minos25 mkdir condor-7.0.1/local chmod 755 condor-7.0.1/local cp ${NEWCONF} condor-7.0.1/local/condor_config.local cp ${NEWCONF} condor-7.0.1/local/condor_config.local.master ln -sf condor-7.0.1 condor Please try again ! ___________________________________________ I have had to remove the old link first and then make the noew link > So, /bin/rm /opt/condor ; then ln -sf /opt/condor-7.0.1 /opt/condor ___________________________________________ Date: Mon, 21 Apr 2008 19:02:45 +0000 (UTC) ___________________________________________ Thanks ! The condor system is up and running. I am gradually enabling the workers. Igor Sfiligoi is upgrading the glideins for glexec. They will start what that upgrade is done. 13:48 Per sfiligoi, checked master knowledge of the pool with condor_status -master - all nodes are listed. Fired up one node, condor_on minos07 -subsystem startd brebel job started to run Fired up a few more, condor_on minos01 -subsystem startd condor_on minos02 -subsystem startd condor_on minos03 -subsystem startd condor_on minos04 -subsystem startd condor_on minos05 -subsystem startd condor_on minos06 -subsystem startd There are not enough jobs queues locally to use these. Running some probes. They looked good. condor_on -all -subsystem startd Submitted a 100 process probe MINOS25 > condor_submit probex100.run Submitting job(s).................................................................................................... Logging submit event(s).................................................................................................... 100 job(s) submitted to cluster 68220. glideins are running as of about 14:45 Igor had to remove a log file formerly owned by gfactory, now by condor. We should change the administator email to minos-admin, not fermigrid-root ============================================================================= 2008 04 20 Did peaceful shutdown of Minos Condor, prep for Master upgrade ============================================================================= 2008 04 17 ####### # SAM # ####### Production station upgraded to sam_station v6_0_5_24_srm via sam_products v4_32 Had to forcibly remove stale projects, see 2008 04 17 log entry ########### # MONITOR # ########### Here are pointers to various summaries of Minos batch resource usage, for planning purposes. Minos Cluster - condor I don't yet have CondorView running, which would give per-user plots. You can get a text based summary from condor_userprio -all The overall usage pattern is pretty well seen in the long term Ganglia plots, by looking at the 'nice' ( yellow ) CPU usage. http://rexganglia2.fnal.gov/minos/?m=load_one&r=year&s=descending&c=MINOS+Cluster&h=&sh=1&hc=4 Summary - the cluster is pretty well used, sometimes saturated at the roughly 40 process capacity. ( This shows up as 40% in CPU usage ) Hyperthreading is inflating the quoted capacity. GPFARM Condorview is at http://fnpcsrv1.fnal.gov/condorview/viewdir/index.html or http://fermigrid.fnal.gov/ Condor View in the left frame, under FermiGrid Monitoring, Metrics and Accounting: CondorView monitoring For the past month , under Pool User (Job) Statistics or pick a month of your choice Then under 'User', click on : group_numi.minospro - for farm usage group_numi.minos - for Condor glideins group_numi.rustem - for Rustem's direct submissions glideins have been used by boehm, hartnell, loiacono, pawloski You can also use the Configure box under the plot to further select data. I don't presently know how to select a longer time frame than 1 month. FNALU BATCH ( presently LSF ) I do not know how to get accounting information for this. FNALU batch has been pretty idle recently. Within a few months, this system will move off of LSF, probably to Condor. ####### # SAM # ####### Upgrading the production station per sam_products v4_32 Only one project dating from April, MINOS26 > sam dump project --station=minos --project=ahimmel-Cedar_phy_bhcurvNearSpillDaikon04-20080415-1315 | grep delivered ... 2304346: n13037581_0000_L010185N_D04.sntp.cedar_phy_bhcurv.0.root, size=1199787338K, swapped out, node = dcap://minos-02, delivered on 15 Apr 22:10:20 So will just stop and restart the station. changed station station_prd v6_0_1_17 minos --preferred-loc=enstore --excess-satisfaction=0 --pmaster-arg=--consumption-map=\.\*::dcap://minos-01,dcap://minos-02 --constrain-delivery=dcap://minos-01,dcap://minos-02 --route=dcap://minos-01::dcap://minos-01 --route=dcap://minos-02::dcap://minos-02 to station station_prd v6_0_5_24_srm minos --preferred-loc=enstore --excess-satisfaction=0 --pmaster-arg=--consumption-map=\.\*::dcap://minos-01,dcap://minos-02 --constrain-delivery=dcap://minos-01,dcap://minos-02 --route=dcap://minos-01::dcap://minos-01 --route=dcap://minos-02::dcap://minos-02 ups declare -c sam_cp_config v7_1 ups declare -c sam_station v6_0_5_24_srm -q GCC-3.1 ups declare -c sam_gsi_config v2_3_3 -q vdt ups declare -c sam_ns_ior v7_1_0 This failed, this was in the trace file : Non-compliant application error detected: operator->() was used on null pointer or nil object reference smaster: /fnal/ups/prd/orbacus/Linux-2-4/v3_3_4p1GCC-3-1/include/OB/Template.h:557: T* OBObjVar::operator->() const [with T = SAMStation_FileConsumer]: Assertion `(int)(ptr_ != 0)' failed. Falling back : ups declare -c sam_cp_config v7_0 ups declare -c sam_station v6_0_1_17 -q GCC-3.1 ups declare -c sam_gsi_config v2_2_8 Cleaned out messy products/upsdb/ups_config.bad on minos-sam02 ups list -aK+ | grep current | sort > ups01 scp minos-sam02:ups02 . MINOS-SAM01 > sdiff -s ups01 ups02 "oracle_client" "v8_1_7a" "Linux+2" "" "current" | "oracle_client" "v10_2_0_3" "Linux+2" "" "current" "oracle_tnsnames" "v42" "NULL" "" "current" | "oracle_tnsnames" "v45" "NULL" "" "current" > "sam_config" "v7_1_5" "NULL" "dbs_dev2" "current" > "sam_config" "v7_1_5" "NULL" "dbs_prd2" "current" > "sam_config" "v7_1_5" "NULL" "station_int" "current" "sam_cp_config" "v7_0" "NULL" "" "current" | "sam_cp_config" "v7_1" "NULL" "" "current" "sam_station" "v6_0_1_17" "Linux+2.4" "GCC-3.1" "current" | "sam_station" "v6_0_5_23_srm" "Linux+2.4" "GCC-3.1" "current" On minos-sam01 per the prescription from minos-sam02 2007 03 27 ups copy -G "oracle_client v10_2_0_3" oracle_instant_client v10_2_0_3 ups declare oracle_client v10_2_0_3 -f "Linux+2" -q "" -r "oracle_instant_client/v10_2_0_3/Linux+2" -z "/home/sam/products/upsdb" -U "ups" -m "oracle_instant_client.table" ups declare -c oracle_client v10_2_0_3 ln -s libclntsh.so /home/sam/products/oracle_instant_client/v10_2_0_3/Linux+2/libclntsh.so.8.0 Updated oracle_tnsnames on minos-sam01/2 upd install -j oracle_tnsnames v48 ups declare -c oracle_tnsnames v48 Let's try the upgrade again : less private/station__minos-sam01__station_prd__minos/trace 04/17/08 15:53:13 minos.SM.REVIVER 26613: 17 projects found Non-compliant application error detected: operator->() was used on null pointer or nil object reference smaster: /fnal/ups/prd/orbacus/Linux-2-4/v3_3_4p1GCC-3-1/include/OB/Template.h:557: T* OBObjVar::operator->() const [with T = SAMStation_FileConsumer]: Assertion `(int)(ptr_ != 0)' failed. MINOS-SAM01 > ups declare -c sam_gsi_config v2_2_8 -q vdt ERROR: Invalid Specification for Declare - Specification must include a single flavor EH ?????? I can no longer redeclare sam_gsi_config The old one did not have the vdt qualifier! A clean fallback : ups declare -c sam_cp_config v7_0 ups declare -c sam_station v6_0_1_17 -q GCC-3.1 ups undeclare -c sam_gsi_config -q vdt ups declare -c sam_gsi_config v2_2_8 2008 04 18 Another product comparison : ups list -aK+ | grep current | sort > ups01a sdiff -s ups01a ups02 "oracle_tnsnames" "v48" "NULL" "" "current" | "oracle_tnsnames" "v45" "NULL" "" "current" > "sam_config" "v7_1_5" "NULL" "dbs_dev2" "current" > "sam_config" "v7_1_5" "NULL" "dbs_prd2" "current" > "sam_config" "v7_1_5" "NULL" "station_int" "current" "sam_cp_config" "v7_0" "NULL" "" "current" | "sam_cp_config" "v7_1" "NULL" "" "current" > "sam_gsi_config" "v2_3_3" "NULL" "vdt" "current" "sam_station" "v6_0_1_17" "Linux+2.4" "GCC-3.1" "current" | "sam_station" "v6_0_5_23_srm" "Linux+2.4" "GCC-3.1" "current" captured ups tailor sam_config -> station_prd on sam01, station_dev on sam02 cat sc01 | tr -s ' ' | cut -f 3 -d ' ' | sort > sc01s cat sc02 | tr -s ' ' | cut -f 3 -d ' ' | sort > sc02s Trying again, with station v6_0_5_23_srm ups declare -c sam_cp_config v7_1 ups declare -c sam_station v6_0_5_23_srm -q GCC-3.1 ups undeclare -c sam_gsi_config ups declare -c sam_gsi_config v2_3_3 -q vdt Failed as before, fell back WITHOUT changing product declarations. Removed all the stale projects : MINOS26 > sam dump station --projects *** BEGIN DUMP STATION minos version v6_0_1_17 running at minos-sam01.fnal.gov 1 minutes 49 seconds, admins: buckley kreymer rhatcher sam Replica selection: prefer (enstore), avoid (empty) There are 207 authorized transfer groups Full delivery unit is enforced; external deliveries are constrained to dcap://minos-01 dcap://minos-02 Excess consumer satisfaction: 0 PROJECT MANAGER: fileReleaseTO = 1 days : maxConsumer Wait time = 1 days, max prefetched files : 5 STATION PROJECTS (0 already ended, 0 prematurely): project hartnell-PTSimFDCosmicMuLowE-20071111-1403(6350) user hartnell.minos started 18 Apr 14:56:24 UNIX pid 1222 contains 1164 total files: 0 given to project, 0 delivery errors, and 1164 still wanted (of these 0 in cache, 0 locked) project hartnell-PTSimFDCosmicMuLowE-20071111-1436(6351) user hartnell.minos started 18 Apr 14:56:24 UNIX pid 1575 contains 1162 total files: 0 given to project, 0 delivery errors, and 1162 still wanted (of these 0 in cache, 0 locked) project rmehdi-PTSimNDL010185N-D00-R0-R1001-20071205-1051(6588) user rmehdi.minos started 18 Apr 14:56:24 UNIX pid 10464 contains 10 total files: 0 given to project, 0 delivery errors, and 10 still wanted (of these 0 in cache, 0 locked) project rmehdi-PTSimFDL010185N-D00-R0-R1011-20071205-1123(6592) user rmehdi.minos started 18 Apr 14:56:24 UNIX pid 10871 contains 9 total files: 0 given to project, 0 delivery errors, and 9 still wanted (of these 0 in cache, 0 locked) project rmehdi-PTSimFDL010185N-D00-R0-R1011-20071205-1127(6593) user rmehdi.minos started 18 Apr 14:56:24 UNIX pid 10876 contains 9 total files: 0 given to project, 0 delivery errors, and 9 still wanted (of these 0 in cache, 0 locked) project rmehdi-PTSimNDL010185N-D00-R0-R1001-20071205-1131(6594) user rmehdi.minos started 18 Apr 14:56:24 UNIX pid 11069 contains 8 total files: 0 given to project, 0 delivery errors, and 8 still wanted (of these 0 in cache, 0 locked) project ahimmel-PreRevBfld2007-20080130-1315(6930) user ahimmel.minos started 18 Apr 14:56:25 UNIX pid 31599 contains 2436 total files: 0 given to project, 0 delivery errors, and 2436 still wanted (of these 0 in cache, 0 locked) project ahimmel-PreRevBfld2007-20080130-1316(6931) user ahimmel.minos started 18 Apr 14:56:25 UNIX pid 31602 contains 2444 total files: 0 given to project, 0 delivery errors, and 2444 still wanted (of these 0 in cache, 0 locked) project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080221-1704(7454) user rodriges.minos started 18 Apr 14:56:26 UNIX pid 1487 contains 367 total files: 0 given to project, 0 delivery errors, and 367 still wanted (of these 0 in cache, 0 locked) project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080221-1716(7456) user rodriges.minos started 18 Apr 14:56:26 UNIX pid 1742 contains 367 total files: 0 given to project, 0 delivery errors, and 367 still wanted (of these 0 in cache, 0 locked) project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0303(7465) user rodriges.minos started 18 Apr 14:56:26 UNIX pid 12547 contains 367 total files: 0 given to project, 0 delivery errors, and 367 still wanted (of these 0 in cache, 0 locked) project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0616(7469) user rodriges.minos started 18 Apr 14:56:26 UNIX pid 16133 contains 367 total files: 0 given to project, 0 delivery errors, and 367 still wanted (of these 0 in cache, 0 locked) project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0627(7470) user rodriges.minos started 18 Apr 14:56:26 UNIX pid 16148 contains 367 total files: 0 given to project, 0 delivery errors, and 367 still wanted (of these 0 in cache, 0 locked) project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04_many-20080222-0904(7472) user rodriges.minos started 18 Apr 14:56:26 UNIX pid 19392 contains 87 total files: 0 given to project, 0 delivery errors, and 87 still wanted (of these 0 in cache, 0 locked) project ahimmel-Cedar_phyDaikon00-20080228-1332(7513) user ahimmel.minos started 18 Apr 14:56:26 UNIX pid 5654 contains 1305 total files: 0 given to project, 0 delivery errors, and 1305 still wanted (of these 0 in cache, 0 locked) project ahimmel-Cedar_phy_bhcurvNearSpillDaikon04-20080328-1734(7634) user ahimmel.minos started 18 Apr 14:56:26 UNIX pid 26913 contains 919 total files: 0 given to project, 0 delivery errors, and 919 still wanted (of these 0 in cache, 0 locked) project ahimmel-Cedar_phy_bhcurvNearSpillDaikon04-20080415-1315(7711) user ahimmel.minos started 18 Apr 14:56:26 UNIX pid 18062 contains 709 total files: 0 given to project, 0 delivery errors, and 709 still wanted (of these 0 in cache, 0 locked) SAMPS=' hartnell-PTSimFDCosmicMuLowE-20071111-1436 rmehdi-PTSimNDL010185N-D00-R0-R1001-20071205-1051 rmehdi-PTSimFDL010185N-D00-R0-R1011-20071205-1123 rmehdi-PTSimFDL010185N-D00-R0-R1011-20071205-1127 rmehdi-PTSimNDL010185N-D00-R0-R1001-20071205-1131 ahimmel-PreRevBfld2007-20080130-1315 ahimmel-PreRevBfld2007-20080130-1316 rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080221-1704 rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080221-1716 rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0303 rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0616 rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0627 rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04_many-20080222-0904 ahimmel-Cedar_phyDaikon00-20080228-1332 ahimmel-Cedar_phy_bhcurvNearSpillDaikon04-20080328-1734 ahimmel-Cedar_phy_bhcurvNearSpillDaikon04-20080415-1315 ' SAMP=hartnell-PTSimFDCosmicMuLowE-20071111-1403 sam stop project --project=${SAMP} --force for SAMP in $SAMPS ; do sam stop project --project=${SAMP} --force ; done Reported as Sam IT ########## # CONDOR # ########## minos03 condor/local cloned from condor-6.8.6 to condor-7.0.1 about 09:15 MINOS03 > sudo /etc/init.d/condor start Starting up Condor This worked, requested minos 01-02 04-06 08-13 updates. The cluster is quite idle, will do second half today. CONODES='minos01 minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10 minos11 minos12 minos13' for NODE in ${CONODES} ; do printf "${NODE} " ssh -ax ${NODE} "ls -l /opt/condor-7.0.1/local" ; done for NODE in ${CONODES} ; do printf "${NODE} " ssh -ax ${NODE} "grep NUM_CPUS /opt/condor-7.0.1/local/condor_config.local" ; done minos01 NUM_CPUS = 1 minos02 NUM_CPUS = 1 minos03 NUM_CPUS = 2 minos04 NUM_CPUS = 2 minos05 NUM_CPUS = 2 minos06 NUM_CPUS = 2 minos07 NUM_CPUS = 1 minos08 NUM_CPUS = 2 minos09 NUM_CPUS = 2 minos10 NUM_CPUS = 2 minos11 NUM_CPUS = 0 minos12 NUM_CPUS = 2 minos13 NUM_CPUS = 0 for NN in 01 02 04 05 06 08 09 10 ; do printf "${NN} " ssh -ax minos${NN} "sudo /etc/init.d/condor start" ; done 01 Starting up Condor 02 Starting up Condor 04 Starting up Condor 05 Starting up Condor 06 Starting up Condor 08 Starting up Condor 09 Starting up Condor 10 Starting up Condor CONN='14 15 16 17 18 19 20 21 22 23 24' for SYS in ${CONN} ; do condor_off -peaceful minos${SYS} -subsystem startd ; done Sent "Set-Peaceful-Shutdown" command to startd minos14.fnal.gov Sent "Kill-Daemon-Peacefully" command to master minos14.fnal.gov ... MINOS25 > condor_status | grep minos2 vm1@minos20.f LINUX INTEL Claimed Retiring 1.000 2026 0+00:00:04 vm2@minos20.f LINUX INTEL Unclaimed Idle 0.000 2026 0+00:06:16 MINOS25 > condor_status | grep minos1 slot1@minos10 LINUX INTEL Unclaimed Idle 0.000 2026 0+00:05:04 slot2@minos10 LINUX INTEL Unclaimed Idle 0.000 2026 0+00:05:05 vm1@minos14.f LINUX INTEL Claimed Retiring 1.000 2026 0+00:00:04 vm2@minos14.f LINUX INTEL Claimed Retiring 1.000 2026 0+00:00:05 vm1@minos16.f LINUX INTEL Claimed Retiring 1.000 2026 0+00:00:04 vm2@minos16.f LINUX INTEL Unclaimed Idle 0.050 2026 0+01:42:04 So waiting for 14, 16 20 to retire At 13:50, two of the four have finished. MINOS25 > condor_q -r scavan | grep vm 66761.0 scavan 4/15 17:57 1+03:19:21 vm1@minos20.fnal.gov 66764.0 scavan 4/15 18:07 1+04:01:50 vm1@minos14.fnal.gov Some of these jobs migrated to the new nodes. Looking in the log : MINOS25 > condor_q -l 66744.0 | grep UserLog UserLog = "/minos/scratch/scavan/CondorTest/tmp/entR.log.66744.0" MINOS25 > less /minos/scratch/scavan/CondorTest/tmp/entR.log.66744.0 000 (66744.000.000) 04/15 16:27:09 Job submitted from host: <131.225.193.25:63984> ... 001 (66744.000.000) 04/16 02:45:50 Job executing on host: <131.225.193.24:64690> ... 006 (66744.000.000) 04/16 02:45:58 Image size of job updated: 97844 ... 022 (66744.000.000) 04/16 07:23:02 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to vm1@minos24.fnal.gov <131.225.193.24:64690> ... 024 (66744.000.000) 04/16 07:23:02 Job reconnection failed Job disconnected too long: JobLeaseDuration (3600 seconds) expired Can not reconnect to vm1@minos24.fnal.gov, rescheduling job ... 001 (66744.000.000) 04/16 07:40:06 Job executing on host: <131.225.193.23:62702> ... 006 (66744.000.000) 04/16 07:40:14 Image size of job updated: 334880 ... 022 (66744.000.000) 04/16 12:07:53 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to vm2@minos23.fnal.gov <131.225.193.23:62702> ... Seems to fail every 04:40 or so 02:45 07:23 07:40 12:07 13:10 19:04 19:06 23:23 23:25 03:42 03:45 08:03 08:09 12:25 12:30 on host: <131.225.193.4:62679> There's only 1 job left at 6.8.6 as of 14:00, less /minos/scratch/scavan/CondorTest/tmp/entR.log.66761.0 04:50 09:27 14:50 20:27 20:30 00:57 00:59 05:20 05:25 09:42 09:50 001 (66761.000.000) 04/17 09:50:06 Job executing on host: <131.225.193.20:62378> ... 005 (66761.000.000) 04/17 14:03:16 Job terminated. (1) Normal termination (return value 0) Usr 0 04:03:04, Sys 0 00:09:15 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 04:03:04, Sys 0 00:09:15 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 1703849 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 1703849 - Total Bytes Sent By Job 0 - Total Bytes Received By Job for NN in ${CONN} ; do ssh -ax minos${NN} "sudo /etc/init.d/condor stop" ; done Thu Apr 17 19:24:41 UTC 2008 Date: Thu, 17 Apr 2008 19:29:19 +0000 (UTC) Subject: Re: HelpDesk ticket 114294 has additional info. Sent request for remaining copies and ln's for minos14 though 24 ( not 25 ) Apr 17 15:34 /opt/condor -> condor-7.0.1 for NN in ${CONN} ; do ssh -ax minos${NN} "sudo /etc/init.d/condor start" ; done ########## # CONDOR # ########## Created probenode.run control file, Ran successfully on minos07. Renamed to probemachine.run Renamed logs to logs/machine/ ############ # MCIMPORT # ############ RDIRS='712 713 714 715 716 717 718' for DIR in ${RDIRS}; do ./mcimport.20080326 -n -T -s n1104 daikon_04/L010185N/near/${DIR} done \ | grep NFILES NFILES 0 NFILES 298 NFILES 308 NFILES 309 NFILES 308 NFILES 310 NFILES 309 RDIRS='713 714 715 716 717 718' for DIR in ${RDIRS}; do ./mcimport.20080326 -T -s n1104 daikon_04/L010185N/near/${DIR} done Thu Apr 17 09:50:32 CDT 2008 ============================================================================= 2008 04 16 11188687 /pnfs/minos/stage/daikon_04 ########## # CONDOR # ########## cd scripts/condor686 for NODE in ${NODES} ; do printf "${NODE} " rcp ${NODE}:/opt/condor-6.8.6/local/condor_config.local local.${NODE} ; done ####### # SAM # ####### On minos-sam02, ./init_sam -n minos minos v4_32 All looks clean In production, ./init_sam -n minos minos v4_32 ####### # SAM # ####### export SAM_ORACLE_CONNECT="samdbs/" samadmin purge zombie projects --station=minos --startedBefore=yesterday --test 18 candidate projects found in the database... Determining which projects are still registered in the NameService... The following 17 projects are still registered in the NameService and are not eligible for termination: ####### # SAM # ####### sam_products v4_32 For sam_station v6_0_5_24_srm -q GCC-3.1" and updating In kreymer products environment on minos26, version=v4_32 oversion=v4_31 samprod=sam_products FLVR=NULL cd ${PRODUCTS}/../prd/${samprod} cp -ar ${oversion} ${version} cd ${version}/${FLVR} ups declare ${samprod} ${version} -f ${FLVR} -r ${samprod}/${version}/${FLVR} -m ${samprod}.table nedit ups/${samprod}.table sam_station v6_0_5_24_srm sam_bootstrap v8_1_1 cd ~/minos/scripts ./updadd ${FLVR} ${samprod} ${version} upd list -l ${samprod} ${version} upd modproduct -g "minos" ${samprod} ${version} -f ${FLVR} 09:52 ########## # CONDOR # ########## The condor queues have drained as desired. CONODES='minos01 minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10 minos11 minos12 minos13' for NODE in ${CONODES} ; do printf "${NODE} " ssh -ax ${NODE} "ps -u condor" ; done condor_master running on all but minos02 and minos12 for NODE in ${CONODES} ; do printf "${NODE} " ssh -ax ${NODE} "sudo /etc/init.d/condor stop" ; done minos01 Shutting down Condor (fast-shutdown mode) minos02 Condor not running minos03 Shutting down Condor (fast-shutdown mode) minos04 Shutting down Condor (fast-shutdown mode) minos05 Shutting down Condor (fast-shutdown mode) minos06 Shutting down Condor (fast-shutdown mode) minos07 Shutting down Condor (fast-shutdown mode) minos08 Shutting down Condor (fast-shutdown mode) minos09 Shutting down Condor (fast-shutdown mode) minos10 Shutting down Condor (fast-shutdown mode) minos11 Shutting down Condor (fast-shutdown mode) minos12 Condor not running minos13 Shutting down Condor (fast-shutdown mode) Date: Wed, 16 Apr 2008 09:29:15 -0500 (CDT) Subject: HelpDesk ticket 114294 ___________________________________________ Short Description: Part 2 of 4 Condor 7.0.1 upgrade for Minos Problem Description: run2-sys : I have drained the virtual machines, and stopped condor on minos01-13 Please, at your next convenience, as root on minos01 through minos13 cd /opt ln -sf condor-7.0.1 condor and inform minos-admin. I will then restart Condor on most of these nodes. We plan to upgrade minos14 through minos24 tomorrow, and the Condor master minos25 next week. Thanks ! ___________________________________________ Date: Wed, 16 Apr 2008 09:36:44 -0500 (CDT) This ticket has been reassigned to JONES, TERRY of the CD-SF/FEF Group. ___________________________________________ Date: Wed, 16 Apr 2008 14:12:14 -0500 (CDT) Solution: jonest@fnal.gov sent this solution: > This task is complete > > minos01= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos02= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos03= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos04= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos05= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos06= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos07= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos08= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos09= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos10= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos11= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos12= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 > minos13= lrwxrwxrwx 1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1 ___________________________________________ As root on minos03 : cd /opt cp -r condor-6.8.6/local condor-7.0.1/local ___________________________________________ 17 April I have drained the queues on the remaining Minos Condor workers. Please update these remaining nodes : On minos14 through minos24 ( but NOT on minos25 ) cd /opt cp -r condor-6.8.6/local condor-7.0.1/local ln -sf condor-7.0.1 condor ___________________________________________ for NODE in ${CONODES} ; do printf "${NODE} " ssh -ax ${NODE} "ls -l /opt/condor " ; done Select nodes which should run, and start them. CUNODES='minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10' for NODE in ${CUNODES} ; do printf "${NODE} " ssh -ax ${NODE} "sudo /etc/init.d/condor start" ; done minos02 Starting up Condor ERROR: Can't read config source /opt/condor/local/condor_config.local minos03 Starting up Condor ERROR: Can't read config source /opt/condor/local/condor_config.local minos04 Starting up Condor ERROR: Can't read config source /opt/condor/local/condor_config.local minos05 Starting up Condor ERROR: Can't read config source /opt/condor/local/condor_config.local minos06 Starting up Condor ERROR: Can't read config source /opt/condor/local/condor_config.local minos07 Starting up Condor minos08 Starting up Condor ERROR: Can't read config source /opt/condor/local/condor_config.local minos09 Starting up Condor ERROR: Can't read config source /opt/condor/local/condor_config.local minos10 Starting up Condor ERROR: Can't read config source /opt/condor/local/condor_config.local ########## # CONDOR # ########## minos01 through minos10 graceful off, minos11 through minos13 are not active, update them anyway. cd scripts/condor701 for NODE in ${NODES} ; do printf "${NODE} " rcp ${NODE}:/opt/condor-7.0.1/local.${NODE}/condor_config.local condor_config.local.${NODE} ; done diff condor_config.local.minos01 condor_config.local.minosNN differences are like CONDOR_HOST = minos01.fnal.gov CONDOR_ADMIN = root@minos01.fnal.gov COLLECTOR_NAME = Personal Condor at minos01.fnal.gov LOCK = /tmp/condor-lock.$(HOSTNAME)0.251031510372744 Except minos11, minos24 which contain additional > > ## Java parameters: > ## If you would like this machine to be able to run Java jobs, > ## then set JAVA to the path of your JVM binary. If you are not > ## interested in Java, there is no harm in leaving this entry > ## empty or incorrect. > > JAVA = /usr/bin/java > > > ## Some JVMs need to be told the maximum amount of heap memory > ## to offer to the process. If your JVM supports this, give > ## the argument here, and Condor will fill in the memory amount. > ## If left blank, your JVM will choose some default value, > ## typically 64 MB. The default (-Xmx) works with the Sun JVM. > > JAVA_MAXHEAP_ARGUMENT = -Xmx > Let's also grab the etc/condor_config's for NODE in ${NODES} ; do printf "${NODE} " rcp ${NODE}:/opt/condor-7.0.1/etc/condor_config condor_config.${NODE} ; done ============================================================================= 2008 04 15 ############ # MCIMPORT # ############ Updated kreymer-doe.proxy, per condorproxy content kreymer@minos26 cd /local/scratch26/kreymer/grid . /minos/scratch/kreymer/VDT/setup.sh HOURS=10000 # 8760 ? voms-proxy-init \ -voms fermilab:/fermilab/minos \ -cert kreymerdoe.pem \ -key kreymerdoekey.pem \ -out kreymer-doe.proxy \ -valid 10000:0 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Enter GRID pass phrase: Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Creating temporary proxy ................................................... Done Contacting voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Done Warning: fg6x1.fnal.gov:15001: The validity of this VOMS AC in your proxy is shortened to 86400 seconds! Creating proxy ....................................................... Done Warning: your certificate and proxy will expire Wed Mar 25 14:45:40 2009 which is within the requested lifetime of the proxy voms-proxy-info -all -file kreymer-doe.proxy scp kreymer-doe.proxy mindata@minos26:/home/mindata/.grid/kreymer-doe.proxy Tue Apr 15 12:10:09 CDT 2008 srmcp is failing like Tue Apr 15 12:37:19 CDT 2008: rs.state = Failed rs.error = RequestFileStatus#-2145068121 failed with error:[ at Tue Apr 15 12:37:15 CDT 2008 state Failed : user has no permission to write into path /pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_04/L010185N/747 ] voms-proxy-init \ -voms fermilab:/fermilab/minos/Role=Production \ -cert kreymerdoe.pem \ -key kreymerdoekey.pem \ -out kreymer-doe.proxy \ -valid 10000:0 Still no good, tested with ./mcimport -b 1 OVERLAY tail /home/mindata/STAGE/OVERLAY/log/mcimport.log grid-proxy-init \ -cert kreymerdoe.pem \ -key kreymerdoekey.pem \ -out kreymer-grid.proxy \ -valid 999999:00 Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 Enter GRID pass phrase for this identity: Creating proxy ..................................... Done Warning: your certificate and proxy will expire Wed Mar 25 14:45:40 2009 which is within the requested lifetime of the proxy renamed this file to /home/mindata/.grid/kreymer-grid.proxy ln -s kreymer-doe.proxy kreymer-grid.proxy This works ! Bottom line, the voms proxy cannot write to DCache via SRM. ####### # SAM # ####### On minos-sam02, preparing for station upgrade, sam_station v6_0_5_23_srm -q GCC-3.1" ./init_sam -n minos minos v4_31 could not find sam_par_ret Our init_sam is out of date, MINOS-SAM02 > dds init* -rwxr-xr-x 1 sam 5024 21142 Apr 15 11:13 init_new* -rwxr-xr-x 1 sam 5024 20941 Oct 18 2005 init_sam* MINOS-SAM02 > cp -a init_sam init_sam.20051018 MINOS-SAM02 > cp -a init_new init_sam MINOS-SAM02 > diff init_sam init_new name change from sam_par_ret to sam_test_project MINOS-SAM02 > ./init_sam -n minos minos v4_31 =========================================================== =========================================================== == SAM station installation Tue Apr 15 11:18:05 CDT 2008 == == init_sam Version 2005 02 28 == == on minos-sam02.fnal.gov in /home/sam =========================================================== =========================================================== OK - experiment minos OK - station minos OK - UPD_HOST fnkits.fnal.gov OK - cleaning out local configuration files OK - can create files in /home/sam OK - others can read this directory OK - checking source of distribution OK - we can ftp to fnkits.fnal.gov OK - PRODUCTS_ROOT = /home/sam/products is present OK - getting installation scripts from fnkits.fnal.gov OK - got bootups and config scripts OK - init_sam is up to date OK - already have products setup script OK - setting up ups OK - set up ups OK - sam_products v4_31 specified on command line OK - upd install -j sam_products v4_31 -h fnkits.fnal.gov informational: installed sam_products v4_31. upd install succeeded. OK, not really installing, because of -n option Listing existing and needed products below - have it, and it is current ups declare -c - have it, would make it current NEED - would need to install the product orbacus v3_3_4p1 -q GCC-3.1 python v2_1 ups declare -c sam_bootstrap v8_1_0 sam_cp v7_2 NEED sam_cp_config v7_1 sam_dcache_cp v7_1 sam_kerberos_rcp v4_0_11 NEED sam_station v6_0_5_23_srm -q GCC-3.1 setpath v1_11 perl v5_8 sam_gridftp v2_1_2 -q vdt NEED sam_gsi_config v2_3_3 -q vdt sam_gsi_config_util v2_1 -q vdt vdt v1_3_0_1 pacman v2_116 sam v8_2_2 samgrid_batch_adapter v7_0_0 ups declare -c sam_ns_ior v7_1_0 sam_config v7_1_5 Installed the needed bits by hand upd install -j sam_cp_config v7_1 Creating version link in /home/sam/products/upsdb/sam_cp_config/Symlinks for sam_cp_config v7_1. Note: the sam_cp_config template MAY have changed. Please, merge the differences (if any) between your current configuration (/home/sam/products/upsdb/sam_cp_config/Config/sam_cp_config.py) and the new template (/home/sam/products/sam_cp_config/v7_1/NULL/ups/sam_cp_config_template.py) sam_cp_config configuration complete. informational: installed sam_cp_config v7_1. upd install succeeded. upd install -j sam_station v6_0_5_23_srm -q GCC-3.1 informational: installed sam_station v6_0_5_23_srm. upd install succeeded. upd install -j sam_gsi_config v2_3_3 -q vdt ************************************************************************** If you are installing the product for the first time, you should execute the command ups tailor sam_gsi_config v2_3_3 ************************************************************************** informational: installed sam_gsi_config v2_3_3. upd install succeeded. disabled dev station, nedit private/minos-sam02_server_list.txt disabled station, added new station version v6_0_5_23_srm ups update sam_bootstrap ups list -K+ sam_bootstrap "sam_bootstrap" "v8_1_1" "NULL" "" "current" ups list -K+ sam_cp_config "sam_cp_config" "v7_0" "NULL" "" "current" ups list -K+ sam_station -q GCC-3.1 "sam_station" "v6_0_1_17" "Linux+2.4" "GCC-3.1" "current" ups list -K+ sam_gsi_config "sam_gsi_config" "v2_2_8" "NULL" "" "current" ups declare -c sam_bootstrap v8_1_0 ups declare -c sam_cp_config v7_1 ups declare -c sam_station v6_0_5_23_srm -q GCC-3.1 ups declare -c sam_gsi_config v2_3_3 -q vdt ups update sam_bootstrap MINOS-SAM02 > sam dump station --station=minos --all *** BEGIN DUMP STATION minos version v6_0_5_23_srm running at minos-sam02.fnal.gov 40 seconds, admins: buckley kreymer rhatcher sam Replica selection: prefer (enstore), avoid (empty) There are 0 authorized transfer groups Full delivery unit is enforced; external deliveries are constrained to dcap://minos-01 dcap://minos-02 Excess consumer satisfaction: 0 AUTHORIZED GROUPS: group minos: admins: sam , swap policy: LRU, fair share: 1, quotas (cur/max): projects = 0/1000, disk: 104080724KB/10240000MB, locks:0B/0B STATION DISKS: disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 401805604B/52428800KB = 0.7% free disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 393714515B/52428800KB = 0.7% free station disk total: 795520119B/104857600KB = 0.7% free REQUESTED FILES: PROJECT MANAGER: fileReleaseTO = 1 days : maxConsumer Wait time = 1 days, max prefetched files : 5 NO PROJECTS ever run FAIR SHARE MAN: Benefit weights: volumes mounted: 0.2, CPU: 0.2, KBytes transferred from MSS: 0.2, KBytes transferred inter-station: 0.2, files consumed: 0.2 *** END OF STATION DUMP *** MINOS26 > ./sam_test_py minos dev MINOS26 > ./sam_test_project minos dev OK, move on to the latest station, v6_0_5_24_srm upd install -j sam_station v6_0_5_24_srm -q GCC-3.1 ups update sam_bootstrap # shut down dev station ups declare -c sam_station v6_0_5_23_srm -q GCC-3.1 edited server list ups update sam_bootstrap # shut down dev station Noticed that sam_bootstrap is already newer, put it back ups declare -c sam_bootstrap v8_1_1 OK, move on the preinstall on production station minos-sam01 . shrc/kreymer . setups.sh ./init_sam -n minos minos v4_31n # reported newer script, run again ./init_sam -n minos minos v4_31 =========================================================== =========================================================== == SAM station installation Tue Apr 15 14:12:12 CDT 2008 == == init_sam Version $Revision: 1.52 $ == on minos-sam01.fnal.gov in /home/sam =========================================================== =========================================================== OK - experiment minos OK - station minos OK - UPD_HOST fnkits.fnal.gov OK - cleaning out local configuration files OK - can create files in /home/sam OK - others can read this directory OK - checking source of distribution OK - we can ftp to cdfkits.fnal.gov OK - PRODUCTS_ROOT = /home/sam/products is present OK - getting installation scripts from cdfkits.fnal.gov OK - got bootups and config scripts OK - init_sam is up to date OK - already have products setup script OK - setting up ups OK - set up ups OK - backing up old config files to /home/sam/maint/backup/200804151412 cp: missing destination file Try `cp --help' for more information. cp: missing destination file Try `cp --help' for more information. cp: missing destination file Try `cp --help' for more information. OK - sam_products v4_31 specified on command line OK - upd install -j sam_products v4_31 -h fnkits.fnal.gov Unable to close datastream at /home/sam/products/upd/v4_6/NULL/src/updxfr.pm line 184 error: while attempting to ftp to ftp.fnal.gov: error: can't transfer //.register_test from ftp.fnal.gov to /tmp/upd19929_register_test Notice: Either this node is not registered on ftp.fnal.gov or ftp.fnal.gov is down informational: installed sam_products v4_31. upd install succeeded. OK, not really installing, because of -n option Listing existing and needed products below - have it, and it is current ups declare -c - have it, would make it current NEED - would need to install the product orbacus v3_3_4p1 -q GCC-3.1 python v2_1 ups declare -c sam_bootstrap v8_1_0 sam_cp v7_2 NEED sam_cp_config v7_1 sam_dcache_cp v7_1 sam_kerberos_rcp v4_0_11 NEED sam_station v6_0_5_23_srm -q GCC-3.1 setpath v1_11 perl v5_8 sam_gridftp v2_1_2 -q vdt NEED sam_gsi_config v2_3_3 -q vdt sam_gsi_config_util v2_1 -q vdt vdt v1_3_0_1 pacman v2_116 NEED sam v8_2_2 samgrid_batch_adapter v7_0_0 ups declare -c sam_ns_ior v7_1_0 sam_config v7_1_5 ------------------------------------------- ups list -K+ sam_cp_config "sam_cp_config" "v7_0" "NULL" "" "current" ups list -K+ sam_station -q GCC-3.1 "sam_station" "v6_0_1_17" "Linux+2.4" "GCC-3.1" "current" ups list -K+ sam_gsi_config "sam_gsi_config" "v2_2_8" "NULL" "" "current" ups list -K+ sam "sam" "v7_6_0" "Linux+2" "" "current" setup upd upd install -j sam_cp_config v7_1 Creating version link in /home/sam/products/upsdb/sam_cp_config/Symlinks for sam_cp_config v7_1. Note: the sam_cp_config template MAY have changed. Please, merge the differences (if any) between your current configuration (/home/sam/products/upsdb/sam_cp_config/Config/sam_cp_config.py) and the new template (/home/sam/products/sam_cp_config/v7_1/NULL/ups/sam_cp_config_template.py) sam_cp_config configuration complete. informational: installed sam_cp_config v7_1. upd install succeeded. upd install -j sam_station v6_0_5_24_srm -q GCC-3.1 informational: installed sam_station v6_0_5_24_srm. upd install succeeded. upd install -j sam_gsi_config v2_3_3 -q vdt ************************************************************************** If you are installing the product for the first time, you should execute the command ups tailor sam_gsi_config v2_3_3 ************************************************************************** informational: installed sam_gsi_config v2_3_3. upd install succeeded. upd install -j sam v8_2_2 Creating version link in /home/sam/products/upsdb/sam/Symlinks for sam v8_2_2. informational: installed sam v8_2_2. upd install succeeded. ups declare -c sam v8_2_2 Removing current link in /home/sam/products/upsdb/sam/Symlinks for sam v7_6_0. Creating current link in /home/sam/products/upsdb/sam/Symlinks for sam v8_2_2. When ready to upgrade, shut down station and ups declare -c sam_cp_config v7_1 ups declare -c sam_gsi_config v2_3_3 -q vdt ups declare -c sam_station v6_0_5_23_srm -q GCC-3.1 ups declare -c sam_ns_ior v7_1_0 ########## # CONDOR # ########## 08:57 for SYS in minos03 ; do condor_off -peaceful ${SYS} -subsystem startd ; done for SYS in 04 05 06 07 08 09 10 ; do condor_off -peaceful minos${SYS} -subsystem startd ; done ########## # CONDOR # ########## Need to uncomment #CREATE_CORE_FILES = True in all of /opt/condor-7.0.1/etc/condor_config Request this as soon as minos07 is peaceful, do it on minos01 through minos25 Actually, do not request this now, per advice from sfiligoi. This affects core files from condor processes, not user processes. Until we have actual condor crashes, there is no need for this. Igor will be available for the master condor v7_0_1 upgrade with glexec support next week. So we can gradually migrate the workers this week. ============================================================================= 2008 04 14 ######## # DATA # ######## Date: Mon, 14 Apr 2008 15:42:13 -0500 (CDT) Subject: HelpDesk ticket 114191 ___________________________________________ Short Description: Quota request for BlueArc served /minos/scratch, for rahaman Problem Description: LSC/CSI : Please set an individual storage quota of 700 GBytes for user rahaman on the BlueArc served /minos/scratch volume. This in an increase from the existing 500 GBytes quota. ___________________________________________ Date: Mon, 14 Apr 2008 15:49:08 -0500 (CDT) This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ Solution: joes@fnal.gov sent this solution: Hi Art, rahaman quota has been increased to 700G ############ # MCIMPORT # ############ Noting network glitches on minos-sam03, every 10 minutes, with read data rate 6 MB/sec. The glitches toward 0 in ganlia monitoring seem to last 1 to 2 bins, There seem to be more than 10 bins per 5 minutes, probably 15. So probably 20 seconds samples. With data rates up around 12 MBytes/second, the glitches are at 5 minute intervals. Read rates were about 6 MB/s through 12:20, and about 12 MB/s after 14:40 Today, write rates are 14 to 17 MB/sec, glitches at intervals of roughly 4 minutes. ########## # CONDOR # ########## Waiting for minos07 graceful off, MINOS25 > condor_q -r | grep minos07 66127.0 scavan 4/13 06:18 0+07:55:58 minos07.fnal.gov Date: Mon, 14 Apr 2008 15:37:17 -0500 (CDT) Subject: HelpDesk ticket 114190 ___________________________________________ Short Description: Initial Condor 7.0.1 upgrade for Minos Problem Description: I have shut down the condor master on node mins07, for our first test of the upgrade to condor 7.0.1. I have already drained the virtual machine, and stopped condor. Please, at your next convenience, as root cd /opt ln -sf condor-7.0.1 condor and inform minos-admin. I will then try to restart condor on this single node. Thanks ! ___________________________________________ Date: Mon, 14 Apr 2008 15:49:10 -0500 (CDT) This ticket has been reassigned to JONES, TERRY of the CD-SF/FEF Group. ___________________________________________ Date: Mon, 14 Apr 2008 16:01:29 -0500 MINOS07 > sudo /etc/init.d/condor start Starting up Condor ERROR "The following configuration macros appear to contain default values that must be changed before Condor will run. These macros are: hostallow_write (found on line 215 of /etc/condor/condor_config) " at line 242 in file condor_config.C ########## # CONDOR # ########## Disabled factproxy in crontab.minos26, obsolete. ########## # CONDOR # ########## Try getting a proxy before admin command : cd /local/scratch25/kreymer/.grid/ scp minos26:/local/scratch26/kreymer/grid/kreymerdoe.pem . scp minos26:/local/scratch26/kreymer/grid/kreymerdoekey.pem . scp minos26:/local/scratch26/kreymer/grid/kreymerdoe.inf . . /grid/app/minos/VDT/setup.sh . /minos/scratch/kreymer/VDT/setup.sh echo kreymerdoe.inf | voms-proxy-init \ -voms fermilab:/fermilab/minos \ -cert kreymerdoe.pem \ -key kreymerdoekey.pem \ -vomslife 1:0 \ -pwstdin Igor suggests setting X509_USER_PROXY Apparently not needed with the defaults as done above. Trying a kx509 proxy, see if I'm authorized kx509 kxlist -p voms-proxy-init \ -noregen \ -voms fermilab:/fermilab/minos/Role=pilot \ -vomslife 1:0 \ -valid 1:0 Nope, not authorized to write to DCache. Repeated with the DOE proxy, this seems to be a harmless test of good authorization. MINOS25 > condor_off -peaceful minos07 -subsystem startd Can't find address for startd minos07.fnal.gov Perhaps you need to query another pool. Sent "Kill-Daemon-Peacefully" command to master minos07.fnal.gov Now shot down condor on minos07 MINOS07 > ps -flu condor F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 5 S condor 3087 1 0 76 0 - 2017 - 2007 ? 00:39:03 /opt/condor/sbin/condor_master ########## # CONDOR # ########## Sent email to sfiligoi and minos_admin, regarding the following. The minos07 vm did not shut down on request. Tried again, MINOS07 > condor_off -peaceful minos07 -subsystem startd Sent "Set-Peaceful-Shutdown" command to startd minos07.fnal.gov Sent "Kill-Daemon-Peacefully" command to master minos07.fnal.gov MINOS07 > date Mon Apr 14 09:09:52 CDT 2008 Looking in /local/stage1/condor/log MasterLog 4/14 09:09:41 DaemonCore: PERMISSION DENIED to kreymer@fnal.gov from host <131.225.193.7:64699> for command 483 (DAEMON_OFF_PEACEFUL) StartLog 4/14 09:09:41 DC_AUTHENTICATE: received DC_AUTHENTICATE from <131.225.193.7:65345> 4/14 09:09:41 DC_AUTHENTICATE: received following ClassAd: MyType = "(unknown type)" TargetType = "(unknown type)" AuthMethods = "FS,GSI" CryptoMethods = "3DES,BLOWFISH" OutgoingNegotiation = "PREFERRED" Authentication = "OPTIONAL" Encryption = "OPTIONAL" Integrity = "OPTIONAL" Enact = "NO" Subsystem = "TOOL" ServerPid = 29099 SessionDuration = "3600" NewSession = "YES" RemoteVersion = "$CondorVersion: 6.8.6 Sep 13 2007 $" Command = 60016 4/14 09:09:41 DC_AUTHENTICATE: our_policy: MyType = "" TargetType = "" AuthMethods = "FS,GSI" CryptoMethods = "3DES,BLOWFISH" OutgoingNegotiation = "REQUIRED" Authentication = "REQUIRED" Encryption = "OPTIONAL" Integrity = "REQUIRED" Enact = "NO" Subsystem = "STARTD" ParentUniqueID = "minos07:3087:1195500376" ServerPid = 3088 SessionDuration = "3600" 4/14 09:09:41 DC_AUTHENTICATE: the_policy: MyType = "" TargetType = "" Authentication = "YES" Encryption = "NO" Integrity = "YES" AuthMethodsList = "FS,GSI" AuthMethods = "FS" CryptoMethods = "3DES,BLOWFISH" SessionDuration = "3600" Enact = "YES" 4/14 09:09:41 DC_AUTHENTICATE: generating 3DES key for session minos07:3088:1208182181:9838... 4/14 09:09:41 SECMAN: Sending following response ClassAd: MyType = "" TargetType = "" Authentication = "YES" Encryption = "NO" Integrity = "YES" AuthMethodsList = "FS,GSI" AuthMethods = "FS" CryptoMethods = "3DES,BLOWFISH" SessionDuration = "3600" Enact = "YES" 4/14 09:09:41 DC_AUTHENTICATE: generating 3DES key for session minos07:3088:1208182181:9838... 4/14 09:09:41 SECMAN: Sending following response ClassAd: MyType = "" TargetType = "" Authentication = "YES" Encryption = "NO" Integrity = "YES" AuthMethodsList = "FS,GSI" AuthMethods = "FS" CryptoMethods = "3DES,BLOWFISH" SessionDuration = "3600" Enact = "YES" RemoteVersion = "$CondorVersion: 6.8.6 Sep 13 2007 $" 4/14 09:09:41 SECMAN: new session, doing initial authentication. 4/14 09:09:41 DC_AUTHENTICATE: authenticating RIGHT NOW. 4/14 09:09:41 AUTHENTICATE: in authenticate( addr == NULL, methods == 'FS,GSI') 4/14 09:09:41 AUTHENTICATE: can still try these methods: FS,GSI 4/14 09:09:41 HANDSHAKE: in handshake(my_methods = 'FS,GSI') 4/14 09:09:41 HANDSHAKE: handshake() - i am the server 4/14 09:09:41 HANDSHAKE: client sent (methods == 36) 4/14 09:09:41 HANDSHAKE: i picked (method == 4) 4/14 09:09:41 HANDSHAKE: client received (method == 4) 4/14 09:09:41 AUTHENTICATE: will try to use 4 (FS) 4/14 09:09:41 FS: client template is /tmp/FS_XXXXXXXXX 4/14 09:09:41 FS: client filename is /tmp/FS_XXXWlt97z 4/14 09:09:41 AUTHENTICATE_FS: used dir /tmp/FS_XXXWlt97z, status: 1 4/14 09:09:41 AUTHENTICATE: auth_status == 4 (FS) 4/14 09:09:41 Authentication was a Success. 4/14 09:09:41 DC_AUTHENTICATE: mutual authentication to 131.225.193.7 complete. 4/14 09:09:41 DC_AUTHENTICATE: message authenticator enabled with key id minos07:3088:1208182181:9838. 4/14 09:09:41 DC_AUTHENTICATE: sending session ad: MyType = "" TargetType = "" User = "kreymer@fnal.gov" Sid = "minos07:3088:1208182181:9838" ValidCommands = "5,60007,60011,448,452,457,470,60004,1200,1000,60005,60006,60012,60013,60015,60016" 4/14 09:09:41 DC_AUTHENTICATE: sent session minos07:3088:1208182181:9838 info! 4/14 09:09:41 DC_AUTHENTICATE: added incoming session id minos07:3088:1208182181:9838 to cache for 3600 seconds (return address is unknown). MyType = "" TargetType = "" Authentication = "YES" Encryption = "NO" Integrity = "YES" AuthMethodsList = "FS,GSI" CryptoMethods = "3DES,BLOWFISH" SessionDuration = "3600" Enact = "YES" AuthMethods = "FS" Subsystem = "TOOL" ServerPid = 29099 RemoteVersion = "$CondorVersion: 6.8.6 Sep 13 2007 $" User = "kreymer@fnal.gov" Sid = "minos07:3088:1208182181:9838" ValidCommands = "5,60007,60011,448,452,457,470,60004,1200,1000,60005,60006,60012,60013,60015,60016" 4/14 09:09:41 DC_AUTHENTICATE: setting sock->decode() 4/14 09:09:41 DC_AUTHENTICATE: allowing an empty message for sock. 4/14 09:09:41 DC_AUTHENTICATE: Success. 4/14 09:09:41 IPVERIFY: hoststring: minos07.fnal.gov 4/14 09:09:41 DaemonCore: PERMISSION DENIED to kreymer@fnal.gov from host <131.225.193.7:65345> for command 60016 (DC_SET_PEACEFUL_SHUTDOWN) StarterLog And for the record, on minos25 : MINOS25 > condor_off -peaceful minos07 -subsystem startd ERROR AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5004:Failed to get authorization from server. Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile) AUTHENTICATE:1004:Failed to authenticate using FS Can't send Set-Peaceful-Shutdown command to startd minos07.fnal.gov ERROR AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5004:Failed to get authorization from server. Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile) AUTHENTICATE:1004:Failed to authenticate using FS Sent "Kill-Daemon-Peacefully" command to master minos07.fnal.gov After doing this again with a proxy, MasterLog 4/14 10:00:19 DaemonCore: Command received via TCP from kreymer@fnal.gov from host <131.225.193.25:62454> 4/14 10:00:19 DaemonCore: received command 483 (DAEMON_OFF_PEACEFUL), calling handler (admin_command_handler) 4/14 10:00:19 Handling daemon-specific command for "STARTD" 4/14 10:00:19 Sent SIGTERM to STARTD (pid 3088) StartLog 4/14 10:00:19 DaemonCore: Command received via TCP from kreymer@fnal.gov from host <131.225.193.25:65144> 4/14 10:00:19 DaemonCore: received command 60016 (DC_SET_PEACEFUL_SHUTDOWN), calling handler (handle_set_peaceful_shutdown()) ############ # MCIMPORT # ############ du -sm /pnfs/minos/stage/daikon_04 9518682 /pnfs/minos/stage/daikon_04 Consistent with VOLUME_QUOTAS summary under enstore, 9316 GB 5.1 TB 31 March 7.6 TB 07 April 9.5 TB 14 April 10.9 TB 22 April , 15 tapes du -sm /minos/data/mcimport/STAGE/daikon_04 4390697 /minos/data/mcimport/STAGE/daikon_04 Using 13 tapes so far. So need 13 * ( 14 / 9.5 ) = 19.1 ( 20 ) tapes total. Have 15 allocated. Requesting 6 more, as we are still producing data. There seem to be 12 available 81 CD-LTO4G1 none: emergency N/A N/A N/A 12 12/12 Date: Mon, 14 Apr 2008 08:49:01 -0500 (CDT) Ticket #: 114105 ___________________________________________ Short Description: Request 6 more LTO-4 tapes for Minos archives Problem Description: We have written Minos archival data to 13 LTO-4 tapes so far, and are continuing to archive data. We expect to need about 21 tapes for the present data set, but have only 15 allocated. Please make an additional 6 tapes available at your next convenience. ___________________________________________ This ticket is assigned to HARRISON, MICHAEL of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Tue, 15 Apr 2008 09:55:41 -0500 (CDT) Solution: mircea@fnal.gov sent this solution: The minos quota has been increased to 21 volumes. -Mike ___________________________________________ ######### # ADMIN # ######### Tried to update system status for minos09, 2008-04-03 06:24 minos09 No Estimate MINOS09 went down around 06:24. Reported to run2-sys. The resolution message is missing. This status can not be set for more than 3 days. Please go back and correct the date and time. ============================================================================= 2008 04 12 Saturday ######## # DATA # ######## RESUMED ALL CRONTABS AND TASKS by 15:12 CDT / 20:12 UTC kreymer@minos01 crontab crontab.minos01 kreymer@minos26 crontab crontab.dat mindata@minos26 crontab crontab.dat minfarm@fnpcsrv1 mv NOCAT.ok NOCAT mindata@minos-sam03 restarted, see below ######## # DATA # ######## Scanning D0 LTO4 tapes for errors, on d0mino01 NN=-1 while [ ${NN} -lt 500 ] ; do usleep 200000 (( NN++ )) VOL=`printf "PSA%3.3d\n" ${NN}` printf "${VOL}\n" enstore info --vol=${VOL} | grep wr_err | grep -v ': 0,' done PSA090 'sum_wr_err': 1, PSA251 'sum_wr_err': 1, PSA252 'sum_wr_err': 1, PSA253 'sum_wr_err': 1, ########### # ROUNDUP # ########### Fixes to roundup LISTS=/minos/data/minfarm/lists global substitution /home/minfarm/lists-> ${LISTS} Corrected logic handling type 1 and 3 errors ( ignore them ) cp -a AFSS/roundup.20080412 . ln -sf roundup.20080412 roundup # was roundup.20080409 ############ # MCIMPORT # ############ Tried to restart, it started up writing to 735, but files belonged in 736. Command line: encp --delayed-dismount 5 --verbose 4 /home/mindata/TAPE/n11037360_0002_L010185N_D04.tar.gz /pnfs/minos/stage/daikon_04/L010185N/near/735/n11037360_0002_L010185N_D04.tar.gz Interrupted. Removed the misplaced PAPER'd file from PNFS $ rm /pnfs/minos/stage/daikon_04/L010185N/near/735/n11037360_0003_L010185N_D04.tar.gz Checking ecrc files $ ls TAPE | wc -l 263 $ ls /minos/data/mcimport/TAR/daikon_04/L010185N/near/736 | wc -l 261 $ rm /pnfs/minos/stage/daikon_04/L010185N/near/735/n11037360_0003_L010185N_D04.tar.gz Start with the correct directory, wherever interrupted FDIRS='736 737 738 739 740 741 742 743 744 745 746 765 766 767 768 769 770 771 772' for DIR in ${FDIRS}; do ./mcimport.20080326 -T -s n1103 daikon_04/L010185N/near/${DIR} done ######## # DATA # ######## Apparently we've used another 200 GB /minos/data while the network was glitching. http://www-numi.fnal.gov/computing/dh/mdfree/2008/04/11.txt 552168 Fri Apr 11 17:56:05 CDT 2008 529399 Fri Apr 11 18:56:08 CDT 2008 483226 Fri Apr 11 19:56:12 CDT 2008 395550 Fri Apr 11 20:56:16 CDT 2008 346141 Fri Apr 11 21:56:19 CDT 2008 341292 Fri Apr 11 22:56:21 CDT 2008 332561 Fri Apr 11 23:56:25 CDT 2008 ########### # NETWORK # ########### Network status page indicates : http://computing.fnal.gov/cdsystemstatus/system/NETWORKS.html 2008-04-11 19:00 Hub Router Work on hub router complete. 2008-04-11 17:15 Hub Router 2 hours Hub Router ACL configuration problem, service should be normal but work continues to complete configuration. VB The AFS monitoring indicates global timeouts 16:38 through 16:57 17:19 through 17:22 ============================================================================= 2008 04 11 ########### # NETWORK # ########### Posted note to http://computing.fnal.gov/cdsystemstatus/system/MINOS.html 2008-04-11 17:00 network No Estimate Major network disruptions at Fermilab. Intermittent connections. Helpdesk interface is down. The network flaked out for a few minutes, 17:05 to 17:08 CDT. I also see data dropouts from ganglia 16:36 through 16:56. ECRC /home/mindata/TAPE/n11037368_0007_L010185N_D04.tar.gz ./mcimport.20080326: line 308: ecrc: command not found COPY n11037368_0008_L010185N_D04.tar.gz ECRC /home/mindata/TAPE/n11037368_0008_L010185N_D04.tar.gz ./mcimport.20080326: line 308: ecrc: command not found COPY n11037368_0009_L010185N_D04.tar.gz ECRC /home/mindata/TAPE/n11037368_0009_L010185N_D04.tar.gz ./mcimport.20080326: line 308: ecrc: command not found I'm not sure how much else is dead. Saved this file as /home/kreymer/minosLOG20080411 on desktop Shutting down all that I can. kreymer@minos01 crontab -r kreymer@minos26 crontab -r mindata@minos26 crontab -r minfarm@fnpcsrv1 mv NOCAT.ok NOCAT mindata@minos-sam03 Interrupted at COPY n11037369_0001_L010185N_D04.tar.gz $ find /minos/data/mcimport/TAR/daikon_04/L010185N/near/736/ -size 0 /minos/data/mcimport/TAR/daikon_04/L010185N/near/736/n11037368_0007_L010185N_D04.ecrc /minos/data/mcimport/TAR/daikon_04/L010185N/near/736/n11037368_0008_L010185N_D04.ecrc /minos/data/mcimport/TAR/daikon_04/L010185N/near/736/n11037368_0009_L010185N_D04.ecrc $ find /minos/data/mcimport/TAR/daikon_04/L010185N/near/736/ -size 0 -exec rm {} \; -print ######## # FARM # ######## Reviewing handling of type 1 and 3 errors in the bad list. These should both be ignored. Instead, they seem to have been selected for ZAP runs. Need to recall distinction between ZAP and other runs. The logic seems to be reversed, Also rubin states we should be using bad_runs files under /minos/data/minfarm/lists DUH. When did this change ? The old files still sit in /home/minfarm/lists SRV1> ls -l /home/minfarm/lists/bad* -tr ... -rw-rw-r-- 1 rubin numi 5240 Feb 18 17:19 /home/minfarm/lists/bad_runs.cedar -rw-rw-r-- 1 rubin numi 5537 Feb 18 17:19 /home/minfarm/lists/bad_runs.cedar_phy_bhcurv -rw-rw-r-- 1 rubin numi 6012 Feb 23 15:26 /home/minfarm/lists/bad_runs_mc.cedar_phy -rw-rw-r-- 1 rubin numi 8570 Feb 24 14:26 /home/minfarm/lists/bad_runs_mc.cedar_phy_bhcurv -rw-rw-r-- 1 rubin numi 254357 Feb 25 05:16 /home/minfarm/lists/bad_runs_mrcc_mc.cedar_phy SRV1> ls -l /minos/data/minfarm/lists/bad* -tr -rw-rw-r-- 1 rubin numi 3401 Mar 5 02:53 /minos/data/minfarm/lists/bad_runs_mc.cedar -rw-rw-r-- 1 rubin numi 5797 Mar 25 16:31 /minos/data/minfarm/lists/bad_runs.cedar_phy_bhcurv -rw-rw-r-- 1 minospro numi 0 Mar 27 00:20 /minos/data/minfarm/lists/bad_runs.cedar_phy_mboone -rw-rw-r-- 1 rubin numi 5794 Apr 8 23:50 /minos/data/minfarm/lists/bad_runs.cedar -rw-rw-r-- 1 rubin numi 9301 Apr 11 14:20 /minos/data/minfarm/lists/bad_runs_mc.cedar_phy_bhcurv Reviewing /home/minfarm usage in roundup # SUPDIR - contains *.sup suppressed subrun lists SUPDIR=/home/minfarm/lists/daq_lists/sup /home/minfarm/lists/daq_lists -> /minos/data/minfarm/lists/daq_lists/ . /home/minfarm/scripts/setup_minossoft_R1_18_4.sh R1.18.4 That's OK cat /home/minfarm/lists/daq_lists/sup/*.sup > ${ROUNTMP}/SUPPRESSED That should use SUPDIR NOSPILL=/home/minfarm/lists/no_spill.${REL} These exist in /minos/data/minfarm/lists, but only C, CPB, CP_mboone if [ "${MCDET}" ] ; then BADRUNS=/home/minfarm/lists/bad_runs_mc.${REL} ZAPRUNS=/home/minfarm/lists/zap_runs_mc.${REL} else BADRUNS=/home/minfarm/lists/bad_runs.${REL} ZAPRUNS=/home/minfarm/lists/zap_runs.${REL} fi BADRUNS=/home/minfarm/lists/bad_runs.${REL} [ "${STRP}" == "mrnt" ] && BADRUNS=/home/minfarm/lists/bad_runs_mrcc.${REL} [ "${MCDET}" ] && BADRUNS=/home/minfarm/lists/bad_runs_mc.${REL} [ ! -r "${BADRUNS}" ] && BADRUNS=/dev/null -------------------------- Fixes to roundup LISTS=/minos/data/minfarm/lists global substitution /home/minfarm/lists-> ${LISTS} cp -a roundup.20080410 . ln -sf roundup.20080410 roundup # was ############ # MCIMPORT # ############ Continue with forward, per rhatcher advice, FDIRS=' 735 736 737 738 739 740 741 742 743 744 745 746 765 766 767 768 769 770 771 772 ' for DIR in ${FDIRS}; do ./mcimport.20080326 -T -s n1103 daikon_04/L010185N/near/${DIR} done Then can continue with reverse, RDIRS='712 713 714 715 716 717 718' ' for DIR in ${RDIRS}; do ./mcimport.20080326 -T -s n1104 daikon_04/L010185N/near/${DIR} done Started this, saw message, and interrupted. WILL ENCP 1 files HAVE n11037259_0017_L010185N_D04.tar.gz OOPS n11037259_0017_L010185N_D04.tar.gz not in PNFS This is the same stray file spotted before going to 765 $ less /minos/data/mcimport/TAR/daikon_04/L010185N/near/725/mcimport.log Yes, processing was interrupted and restarted Thu Apr 3 07:29:32 CDT 2008 when handing this file, and started up with an immediate ECRC. Re-interrupted and started with COPY/ECRC, oops, the ECRC did not happen. So we have a bad CRC for this. This has been getting copied again and again. XSETS=`grep n11037259_0017_L010185N_D04 \ /minos/data/mcimport/TAR/daikon_04/L010185N/near/*/mcimport.log \ | cut -f 9 -d / | uniq` FILE=n11037259_0017_L010185N_D04.tar.gz for SET in ${XSETS} ; do ls -l /pnfs/minos/stage/daikon_04//L010185N/near/${SET}/${FILE} done /pnfs/minos/stage/daikon_04//L010185N/near/751/n11037259_0017_L010185N_D04.tar.gz: No such file or directory The rest have dates 3 through 10 April. 1) remove the bad files from /pnfs/minos/stage for SET in ${XSETS} ; do rm /pnfs/minos/stage/daikon_04//L010185N/near/${SET}/${FILE} done 2) remove the bad ECRC $ cat /minos/data/mcimport/TAR/daikon_04/L010185N/near/725/${FILE/.tar.gz}.ecrc 3051355852 ecrc TAPE/${FILE} | cut -f 2 -d ' ' > \ /minos/data/mcimport/TAR/daikon_04/L010185N/near/725/${FILE/.tar.gz}.ecrc 3) rewrite to pnfs ./mcimport.20080326 -T -s n1103 daikon_04/L010185N/near/725 NOW RESUME ARCHIVE for DIR in ${FDIRS}; do ./mcimport.20080326 -T -s n1103 daikon_04/L010185N/near/${DIR} done ============================================================================= 2008 04 10 ########### # KREYMER # ########### Due to a family emergency, I'll very likely be out of town today, Thursday 10 April 2008. I can be reached at cell 630 697 0469, and will try to check in via the network. I will try to be back by Friday morning. ########### # ROUNDUP # ########### new version which ignores errors Type 1 ( per rubin ) These are input I/O errors, which cannot produce output, but which may be hanging around from previous attempts. SRV1> cp -a AFSS/roundup.20080410 . SRV1> ln -sf roundup.20080410 roundup # was roundup.20080409 SRV1> date Wed Apr 9 17:40:45 CDT 2008 Never used this, moved on to 20080412 ############ # MCIMPORT # ############ Why is n11037259_0017_L010185N_D04.tar.gz being copied to 765 ? Command line: encp --delayed-dismount 5 --verbose 4 /home/mindata/TAPE/n11037259_0017_L010185N_D04.tar.gz /pnfs/minos/stage/daikon_04/L010185N/near/765/n11037259_0017_L010185N_D04.tar.gz ########## # CONDOR # ########## MINOS07 > condor_off -peaceful minos07 -subsystem startd Sent "Set-Peaceful-Shutdown" command to startd minos07.fnal.gov Sent "Kill-Daemon-Peacefully" command to master minos07.fnal.gov ############ # MCIMPORT # ############ MINOS26 > du -sm /pnfs/minos/stage/daikon_04 8742149 /pnfs/minos/stage/daikon_04 $ du -sm /minos/data/mcimport/STAGE/daikon_04 dds TAPE 5111041 /minos/data/mcimport/STAGE/daikon_04 ============================================================================= 2008 04 09 ########## # CONDOR # ########## MINOS25 > condor_off -peaceful minos07 -subsystem startd ERROR AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5004:Failed to get authorization from server. Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile) AUTHENTICATE:1004:Failed to authenticate using FS Can't send Set-Peaceful-Shutdown command to startd minos07.fnal.gov ERROR AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5004:Failed to get authorization from server. Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile) AUTHENTICATE:1004:Failed to authenticate using FS Sent "Kill-Daemon-Peacefully" command to master minos07.fnal.gov ######## # FARM # ######## Sent note to rubin,minos-data re CPB mcnear ZAP and stale PEND files. ########### # ROUNDUP # ########### Comparison of roundup.new to roundup... First cedar near ./roundup -n -r cedar near 2>&1 | tee /tmp/cnold.log AFSS/roundup.new -n -r cedar near 2>&1 | tee /tmp/cnnew.log diff /tmp/cnold.log /tmp/cnnew.log Then the biggie, CPB mcnear ( filter out the new ECRC messages from purge ) ./roundup -n -r cedar_phy_bhcurv mcnear 2>&1 | tee /tmp/cmold.log AFSS/roundup.new -n -r cedar_phy_bhcurv mcnear 2>&1 | tee /tmp/cmnew.log diff /tmp/cmold.log /tmp/cmnew.log | grep -v ECRC There are many more HAVE messages in the cmold.log. Understandable, we generate one per run, versus one per concatenated file. For cand files, that's a big but moot difference. This is ready for production use. $ mv roundup.new roundup.20080409 SRV1> cp -a AFSS/roundup.20080409 . SRV1> ln -sf roundup.20080409 roundup # was roundup.20080225 SRV1> date Wed Apr 9 17:40:45 CDT 2008 ####### # CVS # ####### per hartnell request, added to NtupleUtils and NuMubar : dja25 David Auty djauty * mtavera Marta Tavera * nickd Nicholas Devenish * rbpatter Ryan B. Patterson * and did ./adduser ######### # ADMIN # ######### Checking existing Minos nodes for 64bit capacity Per http://www.cyberciti.biz/faq/linux-how-to-find-if-processor-is-64-bit-or-not/ cat /proc/cpuinfo | grep flags | grep ' lm ' for NODE in ${NODES} ; do printf "${NODE} " ; ssh -ax ${NODE} "cat /proc/cpuinfo | grep flags | uniq | tr -s ' ' \\\n | grep lm" ; done Have lm for all Cluster and Servers, and flxb flxb31 and above ============================================================================= 2008 04 08 ########### # ROUNDUP # ########### Continuing to adjust roundup.new to use samsub ########## # ORACLE # ########## Date: Tue, 08 Apr 2008 16:34:49 +0000 (UTC) From: Arthur Kreymer To: minosdb-support@fnal.gov Subject: Minos Oracle server purchases for FY 2008 We need to review the status of minosora1/3 and their disks, and make a plan for the purchase of either replacements, or extended service plans, as appropriate. I am not aware of any performance issues requiring upgrades to these systems. The long term Ganglia monitoring shows an average of 1/4 process, with a CPU load of around one percent. http://rexganglia2.fnal.gov/minos/?r=year&c=MINOS+DB&h=minosora1.fnal.gov The disk space presently configured is about 550 GBytes, of which 110 GBytes is used. http://cdcvs.fnal.gov/cgi-bin/fnal-only/cvsweb.cgi/syscollect/minos/minosora1-config.html?rev=1.1.1.1030&con tent-type=text/x-cvsweb-markup&cvsroot=syscollect Issues to be dealt with in the plan : 0) What is the end of warranty coverage for the hosts and disks ? 1) What is the end of service life for these unique Sun/AMD systems ? 2) What is the end of service life for the disks ? 3) What are the costs of replacement versus continued maintenance ? 4) Are we satisfied with the level of service being actually provided, given our experience with last year's 6 month minosora3 repair ? 5) If replacement is an option, what system and/or disks are preferred ? 6) The plan should provide a policy good for the next 3 years, which should cover the end of Minos data taking. Please coordinate this with Robert Hatcher, who is taking on Liz's role as Minos liaison to the Computing Division. Robert is on the minosdb-support mailing list. ######## # FARM # ######## The beam dbu information seems to have returned. SRV1> /grid/app/minos/scripts/beam_mon fnpcsrv1 Inquiring of fnpcsrv1 on port 3307 as reader_old:minos_db B080408_000001.mbeam.root from 2008-04-08 00:00:04 to 2008-04-08 07:59:57 6739 spills 107920358 bytes, found: 28793, missed: 0 seconds B080407_160001.mbeam.root from 2008-04-07 16:00:01 to 2008-04-07 23:59:49 10983 spills 183291638 bytes, found: 57581, missed: 15 seconds B080407_080001.mbeam.root from 2008-04-07 08:00:01 to 2008-04-07 15:59:58 12905 spills 220909976 bytes, found: 86378, missed: 18 seconds B080407_000001.mbeam.root from 2008-04-07 00:00:00 to 2008-04-07 07:59:59 12827 spills 208561586 bytes, found: 115177, missed: 20 seconds ============================================================================= 2008 04 07 ######## # FARM # ######## SRV1> /grid/app/minos/scripts/beam_mon minos-db1 Inquiring of minos-db1 on port 3306 as reader_old:minos_db B080408_000001.mbeam.root from 2008-04-08 00:00:04 to 2008-04-08 07:59:57 6739 spills 107920358 bytes, found: 28793, missed: 0 seconds B080407_160001.mbeam.root from 2008-04-07 16:00:01 to 2008-04-07 23:59:49 10983 spills 183291638 bytes, found: 57581, missed: 15 seconds B080407_080001.mbeam.root from 2008-04-07 08:00:01 to 2008-04-07 15:59:58 12905 spills 220909976 bytes, found: 86378, missed: 18 seconds B080407_000001.mbeam.root from 2008-04-07 00:00:00 to 2008-04-07 07:59:59 12827 spills 208561586 bytes, found: 115177, missed: 20 seconds SRV1> /grid/app/minos/scripts/beam_mon fnpcsrv1 Inquiring of fnpcsrv1 on port 3307 as reader_old:minos_db beam_mon returns null -- no updates recently Mon Apr 7 17:19:45 CDT 2008 ########### # ROUNDUP # ########### roundup.new - using samdup, samsub SRV1> ./roundup -n -r cedar near -> /minos/data/minfarm/maint/cnold.log Testing with a partially purges run AFSS/roundup.new -n -s N00013775 -r cedar near AFSS/roundup.new -n -W -S -v -s N00013775 -r cedar nea ############ # MCIMPORT # ############ mindata@minos-sam03 Had to restart the copy to tape, due to my desktop crashing. tail -1 /minos/data/mcimport/TAR/daikon_04/L010185N/near/755/mcimport.log COPY n11037551_0015_L010185N_D04.tar.gz $ dds TAPE | tail -rw-r--r-- 1 mindata e875 340717229 Apr 7 14:17 n11037551_0014_L010185N_D04.tar.gz -rw-r--r-- 1 mindata e875 232898560 Apr 7 14:18 n11037551_0015_L010185N_D04.tar.gz $ rm TAPE/n11037551_0015_L010185N_D04.tar.gz FDIRS='755 756 757 758 759 760 761 762 763 764 765 ' for DIR in ${FDIRS}; do ./mcimport.20080326 -T -s n1103 daikon_04/L010185N/near/${DIR} done ########### # DESKTOP # ########### Locked up displaying a small PDF file with xpdf. Displays fine in acroread No access via ssh from the net, unable to switch local console. The attachment was PJO-APS-Survey-4-7-08.PDF Displays fine in acroread ######## # FARM # ######## Latest farm output concatenated : MINOS26 > dds /pnfs/minos/fardet_data/2008-04/F00040732_0000.mdaq.root -rw-r--r-- 1 buckley e875 18646825 Apr 3 11:43 /pnfs/minos/fardet_data/2008-04/F00040732_0000.mdaq.root MINOS26 > dds /pnfs/minos/neardet_data/2008-04/N00013887_0002.mdaq.root -rw-r--r-- 1 buckley e875 77409802 Apr 3 17:33 /pnfs/minos/neardet_data/2008-04/N00013887_0002.mdaq.root ############ # MCIMPORT # ############ Another encp 1.5 hour delay Sunday 17:00 ish less /minos/data/mcimport/TAR/daikon_04/L010185N/near/751/mcimport.log Volume VOJ554 is marked NOACCESS. Error after transferring 0 bytes in 1 files in 5246.348979 sec. Overall rate = 0 MB/sec. Transfer rate = 0 MB/sec. Network rate = 0 MB/sec. Drive rate = 0 MB/sec. Disk rate = 0 MB/sec. Exit status = 1. Start time: Sun Apr 6 18:19:59 2008 In summary, for 11 LTO-4 volumes written, 8 have write errors, for a total of 14 write errors. MINOS26 > ./volumes vols MINOS26 > VOLS4=`./volumes stage | grep VOJ` MINOS26 > printf "${VOLS4}\n" VOJ545 VOJ546 VOJ547 VOJ548 VOJ549 VOJ550 VOJ551 VOJ552 VOJ553 VOJ554 VOJ555 MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ; enstore info --vol=${VOL} | grep wr_err ; done VOJ545 'sum_wr_err': 2, VOJ546 'sum_wr_err': 1, VOJ547 'sum_wr_err': 2, VOJ548 'sum_wr_err': 1, VOJ549 'sum_wr_err': 0, VOJ550 'sum_wr_err': 2, VOJ551 'sum_wr_err': 2, VOJ552 'sum_wr_err': 0, VOJ553 'sum_wr_err': 0, VOJ554 'sum_wr_err': 3, VOJ555 'sum_wr_err': 1, MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ; enstore info --vol=${VOL} | grep remaining ; done VOJ545 'remaining_bytes': 792764928L, VOJ546 'remaining_bytes': 0L, VOJ547 'remaining_bytes': 0L, VOJ548 'remaining_bytes': 244734464L, VOJ549 'remaining_bytes': 765575680L, VOJ550 'remaining_bytes': 247753728L, VOJ551 'remaining_bytes': 0L, VOJ552 'remaining_bytes': 175473664L, VOJ553 'remaining_bytes': 34666496L, VOJ554 'remaining_bytes': 192283136L, VOJ555 'remaining_bytes': 630473728000L, MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ; enstore info --vol=${VOL} | grep eod_cookie ; done VOJ545 'eod_cookie': '0000_000000000_0001049', VOJ546 'eod_cookie': '0000_000000000_0001051', VOJ547 'eod_cookie': '0000_000000000_0001683', VOJ548 'eod_cookie': '0000_000000000_0002324', VOJ549 'eod_cookie': '0000_000000000_0000924', VOJ550 'eod_cookie': '0000_000000000_0001079', VOJ551 'eod_cookie': '0000_000000000_0002281', VOJ552 'eod_cookie': '0000_000000000_0002358', VOJ553 'eod_cookie': '0000_000000000_0002246', VOJ554 'eod_cookie': '0000_000000000_0002318', VOJ555 'eod_cookie': '0000_000000000_0000445', MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ; enstore info --vol=${VOL} | grep sum_mounts ; done VOJ545 'sum_mounts': 9, VOJ546 'sum_mounts': 9, VOJ547 'sum_mounts': 10, VOJ548 'sum_mounts': 9, VOJ549 'sum_mounts': 7, VOJ550 'sum_mounts': 58, VOJ551 'sum_mounts': 10, VOJ552 'sum_mounts': 8, VOJ553 'sum_mounts': 8, VOJ554 'sum_mounts': 12, VOJ555 'sum_mounts': 4, ######## # FARM # ######## Report from Rubin, who cannot attend today's Grid Users' meeting: About the only thing to report is that the reconfiguration of fermigrid1 seems to have almost eliminated the hold problem. There has only been one held run since the evening of March 30, and that run 'auto-released' with no problem. (Auto-release means was released by the cron job.) Right now (Sunday at noon) the db updater on fnpcsrv1 hasn't run for a couple of days. Steve has checked that this is *not* a system problem, and I've turned it over to Alex and Nick. I've tried running the update procedure manually, but it just terminates almost immediately with no error (or any) messages. And my check indicates that there have been no updates done. One can check with /grid/app/minos/scripts beam-mon which will look at fnpcsrv1 and at minos-db1 with the argument 'minos-db1'. ####### # AFS # ####### Ticket 107032 ########### # MONTHLY # ########### DATASETS 4/7 PREDATOR 4/7 VAULT 4/8 via cron MYSQL 4/9 started Wed Apr 9 09:45:44 CDT 2008 after posting notice to CRL, and telling shifter Unlocked 10:18. I failed to purge older BINLOG's last month, I see lots of 1 GB logs through 3 Feb. Did an initial supplemental purge, before the archive of BINLOG mysql -u root offline PURGE MASTER LOGS BEFORE DATE_SUB( NOW( ), INTERVAL 40 DAY); EXIT; Final cp back to COPY took 53 minutes ( 18 GB ) at 5 MB/sec. Writes to /M/D were at over 20 MB/sec, per ganglia ################# # VAULT_MONTHLY # ################# mv vault.20060807 vault_monthly simplified MONTH calculation from DAY=`date +%d` let " DOFF = ( DAY + 15) " MONTH=`date +%Y-%m -d "${DOFF} days ago"` to (( DOFF = `date +%d` + 1 )) MONTH=`date +%Y-%m -d "${DOFF} days ago"` Scheduled this for tonight, by activating in crontab.dat, based on 2008-02 times Far - 2 hours Near - 8 hours 11 20 07 * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/vault_monthly ============================================================================= 2008 04 04 ######## # DATA # ######## Checking again, around 17:44, from mindata@minos26 $ ps xf | cut -f 2 -d : | cut -c 4- | grep '^scp\|.tar.gz$' scp -t STAGE/mualem $ time ls -alF /minos/data/mysql/archive/20080303/offline real 2m4.571s real 0m10.478s minos-sam03.fnal.gov real 0m7.622s $ time ls -alF /minos/data/mysql/archive/20080204/offline real 0m33.646s real 0m5.015s real 0m4.920s ########## # SAMSUB # ########## Check this agains the current logs, SRV1> grep PEND cedarnear.log | grep -v ' 0 ' ... PEND - have 1/7 subruns for N00013775_*.spill.sntp.cedar.0.root 22 03/09 23:41 4 5 SRV1> AFSS/samsub /minos/data/minfarm/nearcat | grep -v '0$' N00013775_.spill.sntp.cedar.0.root 4 SRV1> grep PEND cedar_phy_bhcurvmcnear.log | grep -v ' 0 ' PEND - have 17/30 subruns for n13037094_*_L250200N_D04.mrnt.cedar_phy_bhcurv.root 73 01/18 00:40 8 25 PEND - have 3/29 subruns for n13037095_*_L250200N_D04.mrnt.cedar_phy_bhcurv.root 73 01/18 00:21 22 25 PEND - have 1/30 subruns for n13037097_*_L250200N_D04.mrnt.cedar_phy_bhcurv.root 73 01/18 00:38 4 5 PEND - have 17/30 subruns for n13037094_*_L250200N_D04.sntp.cedar_phy_bhcurv.root 73 01/18 00:40 8 25 PEND - have 3/29 subruns for n13037095_*_L250200N_D04.sntp.cedar_phy_bhcurv.root 73 01/18 00:21 22 25 PEND - have 1/30 subruns for n13037097_*_L250200N_D04.sntp.cedar_phy_bhcurv.root 73 01/18 00:38 4 5 SRV1> AFSS/samsub /minos/data/minfarm/mcnearcat | grep -v '0$' n13037094__L250200N_D04.mrnt.cedar_phy_bhcurv.root 8 n13037095__L250200N_D04.mrnt.cedar_phy_bhcurv.root 22 n13037097__L250200N_D04.mrnt.cedar_phy_bhcurv.root 4 n13037094__L250200N_D04.sntp.cedar_phy_bhcurv.root 8 n13037095__L250200N_D04.sntp.cedar_phy_bhcurv.root 22 n13037097__L250200N_D04.sntp.cedar_phy_bhcurv.root 4 ( reorderered for clarity , those listed below are ZAP files ) n13037260__L010185N_D04.mrnt.cedar_phy_bhcurv.0.root 26 n13037270__L010185N_D04.mrnt.cedar_phy_bhcurv.0.root 27 n13037260__L010185N_D04.sntp.cedar_phy_bhcurv.0.root 26 n13037270__L010185N_D04.sntp.cedar_phy_bhcurv.0.root 27 n13037260__L010185N_D04.cand.cedar_phy_bhcurv.0.root 14 n13037270__L010185N_D04.cand.cedar_phy_bhcurv.0.root 27 Bottom line, this is looking pretty good, Next step is to use this in roundup.new, and compare details of some dry runs. Then put it in production. ########## # SAMSUB # ########## This has been built for use with ROUNDUP, see entry under 2008 04 01 SRV1> ls /minos/data/minfarm/mcnearcat | wc -l 2741 SRV1> time AFSS/samsub /minos/data/minfarm/mcnearcat ... real 0m5.978s user 0m2.062s sys 0m0.162s SRV1> AFSS/samsub /minos/data/minfarm/mcnearcat | wc -l 114 ############## # AFSERRSCAN # ############## Made the month automaticaly be `date +%b`, can override like ./afserrscan '' Jan ########## # SOUDAN # ########## Checking /var/log/messages on minos-db behind minos-gateway.minos-soudan.org, Many messages like Apr 3 23:31:10 minos-db kernel: afs: Lost contact with file server 131.225.68.65 in cell fnal.gov (all multi-homed ip addresses down for the server) Apr 3 23:31:11 minos-db kernel: afs: failed to store file (110) Apr 3 23:33:53 minos-db kernel: afs: file server 131.225.68.65 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) [root@minos-db ~]# host 131.225.68.65 65.68.225.131.in-addr.arpa domain name pointer fsus-minos01.fnal.gov. Always the same server. Usually at *:31:10 Sometimes *:32:53 Almost every hour Test with grep afs: /var/log/messages | grep -v Tokens | uniq chmod 555 messages* Testing as kreymer, ln -s /afs/fnal.gov/files/home/room1/kreymer AFSK cd AFSK while true ; do date ; wc -l foo ; sleep 20 ; done No interruption around 13:31, but this is the wrong server. ln -s /afs/fnal.gov/files/data/minos MD MDLDF=/afs/fnal.gov/files/data/minos/log_data/R1.16.0.log_data.tar wc -c ${MDLDF} 30720 /afs/fnal.gov/files/data/minos/log_data/R1.16.0.log_data.tar while true ; do date ; wc -c ${MDLDF} ; sleep 20 ; done Fri Apr 4 13:48:53 CDT 2008 ... ============================================================================= 2008 04 03 ######### # DOCDB # ######### Tested administrative ( minos-adm ) access to Minos DocDB by kreymer and rhatcher. This lets us approve new users, etc. ######## # DATA # ######## Observing very different file access times to /minos/data Mysql> ls -alF /minos/data/mysql/archive/20080303/offline minos-sam03 $ time ls -alF /minos/data/mysql/archive/20080303/offline real 0m0.710s user 0m0.017s sys 0m0.015s MINOS26 > time dds /minos/data/mysql/archive/20080303/offline real 1m39.509s user 0m0.022s sys 0m0.081s MINOS26 > time ls -alF /minos/data/mysql/archive/20080303/offline real 1m5.076s user 0m0.019s sys 0m0.077s 30389 ? D 0:00 \_ md5sum n11037465_0029_L010185N_D04.tar.gz 30292 ? Ss 0:00 scp -t STAGE/mtavera 30289 ? Ss 0:00 scp -t STAGE/mtavera 30247 ? Ss 0:00 scp -t STAGE/mtavera 30030 ? Ss 0:01 scp -t STAGE/mualem 29816 ? Ss 0:00 scp -t STAGE/mtavera 29556 ? Ss 0:00 scp -t STAGE/mtavera 29497 ? Ss 0:00 scp -t STAGE/mtavera 29208 ? Ss 0:01 scp -t STAGE/mtavera 30389 ? D 0:00 \_ md5sum n11037465_0029_L010185N_D04.tar.gz 30292 ? Ss 0:00 scp -t STAGE/mtavera 30289 ? Ss 0:00 scp -t STAGE/mtavera 30247 ? Ss 0:00 scp -t STAGE/mtavera 30030 ? Ss 0:01 scp -t STAGE/mualem 29816 ? Ss 0:00 scp -t STAGE/mtavera 29556 ? Ss 0:00 scp -t STAGE/mtavera 29497 ? Ss 0:00 scp -t STAGE/mtavera 29208 ? Ss 0:01 scp -t STAGE/mtavera Performance is back to good again on minos26, Thu Apr 3 13:53:40 CDT 2008 real 0m2.157s user 0m0.018s sys 0m0.050s 31407 ? Ss 0:00 scp -t STAGE/mtavera 31346 ? Ss 0:00 scp -t STAGE/mtavera 30931 ? Ss 0:00 scp -t STAGE/mtavera 30889 ? Ss 0:01 scp -t STAGE/mtavera 30844 ? Ss 0:00 scp -t STAGE/mtavera And edging down, Thu Apr 3 14:01:33 CDT 2008 real 0m11.622s 31407 ? Ss 0:01 scp -t STAGE/mtavera 1263 ? D 0:00 \_ md5sum n11037465_0009_L010185N_D04.tar.gz 1235 ? Ss 0:00 scp -t STAGE/mualem 1005 ? Ss 0:00 scp -t STAGE/mtavera 955 ? Ss 0:00 scp -t STAGE/mtavera 952 ? Ss 0:00 scp -t STAGE/mtavera 886 ? Ss 0:00 scp -t STAGE/mtavera 746 ? Ss 0:00 scp -t STAGE/mtavera 729 ? Ss 0:00 scp -t STAGE/mtavera ps xf | cut -f 2 -d : | cut -c 4- | grep '^scp\|.tar.gz$' real 1m34.086s \_ md5sum n11037465_0013_L010185N_D04.tar.gz \_ md5sum n11037465_0015_L010185N_D04.tar.gz \_ md5sum n11037465_0004_L010185N_D04.tar.gz scp -t STAGE/mtavera scp -t STAGE/mtavera Did the ls as mindata@minos26, real 0m0.150s Same short time now for kreymer@minos26, real 0m0.079s But now no scp's or md5sum's are running ! still no activity, somewhat slow access real 0m6.821s real 0m0.032s At an earlier time, during the 1 minute slowdowns, MINOS26 > lsof -N /minos/data COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME ls 29272 kreymer 3r DIR 0,23 67584 1994944521 /minos/data/mysql/archive/20080303/offline (minos-nas-0.fnal.gov:/minos/data) mindata@minos26 $ /usr/sbin/lsof -N /minos/data COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME mcimport 22085 mindata cwd DIR 0,23 116736 3948960453 /minos/data/mcimport/OVERLAY/mcin/dcache (minos-nas-0.fnal.gov:/minos/data) mcimport 22085 mindata 1w REG 0,23 4044209 2031215561 /minos/data/mcimport/OVERLAY/log/mcimport.log (minos-nas-0.fnal.gov:/minos/data) scp 29497 mindata 3w REG 0,23 295895040 583008038 /minos/data/mcimport/mtavera/n11037465_0022_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data) scp 29556 mindata 3w REG 0,23 278495232 4192479982 /minos/data/mcimport/mtavera/n11037465_0011_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data) scp 29816 mindata 3w REG 0,23 199196672 152377133 /minos/data/mcimport/mtavera/n11037465_0016_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data) scp 30030 mindata 3w REG 0,23 319619072 347627699 /minos/data/mcimport/mualem/n11037744_0003_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data) scp 30247 mindata 3w REG 0,23 81985536 2015710680 /minos/data/mcimport/mtavera/n11037465_0008_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data) scp 30289 mindata 3w REG 0,23 50003968 3017774333 /minos/data/mcimport/mtavera/n11037465_0026_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data) scp 30292 mindata 3w REG 0,23 48332800 2514584717 /minos/data/mcimport/mtavera/n11037465_0014_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data) bash 30387 mindata cwd DIR 0,23 100352 322878581 /minos/data/mcimport/mtavera (minos-nas-0.fnal.gov:/minos/data) md5sum 30389 mindata cwd DIR 0,23 100352 322878581 /minos/data/mcimport/mtavera (minos-nas-0.fnal.gov:/minos/data) md5sum 30389 mindata 1w REG 0,23 0 504262275 /minos/data/mcimport/mtavera/n11037465_0029_L010185N_D04.tar.gz.md5 (minos-nas-0.fnal.gov:/minos/data) md5sum 30389 mindata 3r REG 0,23 334088614 375232167 /minos/data/mcimport/mtavera/n11037465_0029_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data) bash 30455 mindata cwd DIR 0,23 100352 322878581 /minos/data/mcimport/mtavera (minos-nas-0.fnal.gov:/minos/data) md5sum 30457 mindata cwd DIR 0,23 100352 322878581 /minos/data/mcimport/mtavera (minos-nas-0.fnal.gov:/minos/data) md5sum 30457 mindata 1w REG 0,23 0 181349684 /minos/data/mcimport/mtavera/n11037465_0025_L010185N_D04.tar.gz.md5 (minos-nas-0.fnal.gov:/minos/data) md5sum 30457 mindata 3r REG 0,23 339105079 401305239 /minos/data/mcimport/mtavera/n11037465_0025_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data) mcimport 30491 mindata cwd DIR 0,23 116736 3948960453 /minos/data/mcimport/OVERLAY/mcin/dcache (minos-nas-0.fnal.gov:/minos/data) ecrc 30492 mindata cwd DIR 0,23 116736 3948960453 /minos/data/mcimport/OVERLAY/mcin/dcache (minos-nas-0.fnal.gov:/minos/data) ecrc 30492 mindata 3r REG 0,23 338891388 2309851038 /minos/data/mcimport/OVERLAY/mcin/dcache/n13047170_0025_L010185N_D04.reroot.root (minos-nas-0.fnal.gov:/minos/data) cut 30493 mindata cwd DIR 0,23 116736 3948960453 /minos/data/mcimport/OVERLAY/mcin/dcache (minos-nas-0.fnal.gov:/minos/data) ######### # ADMIN # ######### Sent email to minos_software_discussion, asking whether anyone is using SL3 at Fermilab, or from AFS ( can we upgrade minos11 ? ) ######## # DATA # ######## MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/* 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N 55785 /minos/data/mcimport/STAGE/daikon_04/L010170N 6184434 /minos/data/mcimport/STAGE/daikon_04/L010185N 6622 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm 8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh 65355 /minos/data/mcimport/STAGE/daikon_04/L010200N 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N 27834 /minos/data/mcimport/STAGE/daikon_04/L250200N MINOS26 > du -sm /pnfs/minos/stage/daikon_04/* 2790357 /pnfs/minos/stage/daikon_04/L010185N 3412300 /pnfs/minos/stage/daikon_04/L250200N ############### # CONDORPROXY # ############### Corrected some typo errors, corrected to use /usr/krb5/bin/kinit Tested in cron via /tmp/ctd at 07:02, looks good As an added challenge, this was during the BlueArc maintenancem, which stalled global file operations on the Cluster. ######## # DATA # ######## 2008 04 02 Preparing for the 7 AM 5 minute BlueArc outage minfarm@fnpcsrv1 SRV1> pwd /home/minfarm/ROUNTMP SRM1> mv NOCAT.ok NOCAT mindata@minos26 crontab -r mindata@minos-sam03 Manually interrupt cp phase, if this is running at 07:00 and remove the partial file. COPY n11037259_0017_L010185N_D04.tar.gz $ dds TAPE/n11037259_001* -rw-r--r-- 1 mindata e875 345557859 Apr 3 06:32 TAPE/n11037259_0010_L010185N_D04.tar.gz -rw-r--r-- 1 mindata e875 340648977 Apr 3 06:32 TAPE/n11037259_0011_L010185N_D04.tar.gz -rw-r--r-- 1 mindata e875 347040982 Apr 3 06:33 TAPE/n11037259_0012_L010185N_D04.tar.gz -rw-r--r-- 1 mindata e875 331563261 Apr 3 06:34 TAPE/n11037259_0013_L010185N_D04.tar.gz -rw-r--r-- 1 mindata e875 338125810 Apr 3 06:34 TAPE/n11037259_0014_L010185N_D04.tar.gz -rw-r--r-- 1 mindata e875 342672643 Apr 3 06:35 TAPE/n11037259_0015_L010185N_D04.tar.gz -rw-r--r-- 1 mindata e875 334430343 Apr 3 06:36 TAPE/n11037259_0016_L010185N_D04.tar.gz -rw-r--r-- 1 mindata e875 146292736 Apr 3 06:36 TAPE/n11037259_0017_L010185N_D04.tar.gz rm TAPE/n11037259_0017_L010185N_D04.tar.gz Date: Thu, 03 Apr 2008 07:21:12 -0500 From: Andy Romero To: site-nas-announce@fnal.gov Subject: BlueArc Maintenance complete The upgrade to firmware v5.1.1156.13 is complete. You may resume normal operations. To reverse this, did minfarm@fnpcsrv1 mv NOCAT NOCAT.ok mindata@minos26 crontab crontab.dat mindata@minos-sam03 FDIRS='725 726 727 728 729 730 731 732 733 734 735 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 ' for DIR in ${FDIRS}; do ./mcimport.20080326 -T -s n1103 daikon_04/L010185N/near/${DIR} done ########### # MINOS09 # ########### Date: Thu, 03 Apr 2008 07:30:55 -0500 (CDT) Subject: HelpDesk ticket 113516 ___________________________________________ Short Description: minos09 down Problem Description: run2-sys : Node minos09 seems to be off the network. Ganglia monitoring indicates that it may have been down since about 06:24 : Cluster Report for Thu, 3 Apr 2008 07:23:14 -0500 minos09.fnal.gov load_one: down Last heartbeat 0 days, 1:09:29 ago ___________________________________________ Date: Thu, 03 Apr 2008 08:40:29 -0500 (CDT) This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group. ___________________________________________ ___________________________________________ ___________________________________________ ============================================================================= 2008 04 02 ######### # PROBE # ######### Corrected stderr rerouting of time bash keyword per Google advice at http://www.cs.tut.fi/~jarvi/tips/bash.html { time ... ; } 2>&1 ######### # VAULT # ######### mv vault.20060807 vault.monthly # reverse earlier confusion, this is a script which vaults the previous month's data. This is still not being done via cron, quite yet. Maybe next month ! ######## # FARM # ######## Some of our 8 nodes seem to be missing from the farm fnpc339 fnpc340 fnpc341 Based on SRV1> condor_status | grep fnpc39 SRV1> condor_status | grep fnpc34 Strange, fnpc339 is present now... oops bad grep above, needed fnpc339 Date: Wed, 02 Apr 2008 17:53:35 -0500 (CDT) Subject: HelpDesk ticket 113509 ___________________________________________ Short Description: fnpc340 down, fnpc341 not running jobs - FYI Problem Description: Two of the eight Minos/AFS GPFARM worker nodes seem to be not running jobs. fnpc340 seems to be down, not on the network. fncp341 is up, but does not have AFS mounted, and is not running jobs. condor_status returns no information for either node. This is low priority, as the user demand is low at present, and we are continuing to expand non-AFS means of running our jobs. ___________________________________________ Note To Requester: timm@fnal.gov sent this Notes To Requester: Karen Shepelak reported last week that there is a hardware problem with fnpc340 and a service call is in. I restarted afs on fnpc341. Steve _________________________________________________________________ ############### # CONDORPROXY # ############### Added condorproxy to crontab.dat Removed gridappsync, now obsolete due to PARROT 14:35 ln -sf crontab.dat.20080402 crontab.dat # was crontab.dat.20060504 crontab crontab.dat MINOS26 > crontab -l MAILTO=kreymer@fnal.gov 06 1-23/2 * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/predator 10 04 * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/condorproxy # 11 01 5 * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/vault.monthly ########## # CONDOR # ########## Testing new condorproxy script, amid the startup of the glideafs10min . lrwxrwxrwx 1 gfactory gfactory 29 Apr 2 14:20 kreymer-condor.proxy -> kreymer-condor.proxy.20080408 This seems to have worked smoothly. Added to this to crontab, see above. ########## # CONDOR # ########## Testing access via glidein to our 8 nodes, condor_submit glideafs10min.run 100 sections, cluster 63295 ########## # CONDOR # ########## Get a proxy with my new certificate, this time on minos26 . /grid/app/minos/VDT/setup.sh openssl pkcs12 -in kreymerdoe.p12 -clcerts -nokeys -out kreymerdoe.pem Enter Import Password: MAC verified OK openssl pkcs12 -in kreymerdoe.p12 -nocerts -out kreymerdoekey.pem Enter Import Password: MAC verified OK Enter PEM pass phrase: Verifying - Enter PEM pass phrase: chmod 600 kreymerdoe* vomses is out of date, As mindata, in /grid/app/minos/VDT/glite/etc scp kreymer@fnpcsrv1:/usr/local/vdt-1.8.1/glite/etc/vomses vomses cp -a vomses vomses.20080107 # based on date of file on fnpcsrv1 Still no luck , get message from vpi, VOMS Server for fermilab not known! Switch to another installation of VDT . /minos/scratch/kreymer/VDT/setup.sh DAYS=20 (( HOURS = DAYS * 24 )) DAPR=`date -d "today + ${DAYS}days" +%Y%m%d` voms-proxy-init \ -voms fermilab:/fermilab/minos \ -vomslife ${HOURS}:0 \ -cert kreymerdoe.pem \ -key kreymerdoekey.pem \ -out kreymercondor.proxy.${DAPR} \ -valid ${HOURS}:0 This seems to work, let's try this with an inline password. echo ${PPH} | voms-proxy-init \ -voms fermilab:/fermilab/minos \ -vomslife ${HOURS}:0 \ -cert kreymerdoe.pem \ -key kreymerdoekey.pem \ -out kreymer-condor.proxy.${DAPR} \ -valid ${HOURS}:0 \ -pwstdin Creating proxy ......................................................................... Done Your proxy is valid until Tue Apr 22 10:58:48 2008 scp kreymer-condor.proxy.${DAPR} gfactory@minos25:.grid/kreymer-condor.proxy.${DAPR} ssh gfactory@minos25 \ "cd .grid ; ln -sf kreymer-condor.proxy.${DAPR} kreymer-condor.proxy" Did this at around 11:02 The Idle kreymer glideins immediately started running. ============================================================================= 2008 04 01 ######## # DATA # ######## At about 11:00, data rates on minos-sam03 reading /minos/data dropped from 10 MB/sec to under 5. From 15:00 to 16:00, the rate dropped from 4 to 1/2 MB/sec, and has remained there through 16:30. This slowdown appears to be global. SRV1> du -sk /minos/data/mcimport/STAGE/daikon_04/L010185N/near/720/n12037209_0010_L010185N_D04.tar.gz 10640 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/720/n12037209_0010_L010185N_D04.tar.gz SRV1> time sum /minos/data/mcimport/STAGE/daikon_04/L010185N/near/720/n12037209_0010_L010185N_D04.tar.gz 01559 10639 real 0m19.112s user 0m0.066s sys 0m0.031s Blue2 seems to be OK SRV1> time sum /grid/data/minos/minfarm/OLDBAD/n13014007_0004_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root 20893 23304 real 0m0.623s user 0m0.162s sys 0m0.086s Date: Tue, 01 Apr 2008 17:01:18 -0500 (CDT) Subject: HelpDesk ticket 113446 ___________________________________________ Short Description: /minos/data BlueArc access has slowed down to a crawl today Problem Description: LSC/CSI : /minos/data is mounted from minos-nas-0.fnal.gov:/minos/data At about 11:00 today, data rates on minos-sam03 reading /minos/data dropped from 10 MB/sec to under 5. From 15:00 to 16:00, the rate dropped from 4 to 1/2 MB/sec, and has remained there through 16:30. This slowdown appears to be global. I see the same terrible data rates now from fnpcsrv1. I do not offhand see unreasonable user loads coming from Minos. Are there global BlueArc problems ? I do not see a slowdown for files served by blue2. Are there problems with the Minos data array ? ___________________________________________ Date: Wed, 02 Apr 2008 08:42:46 -0500 (CDT) This ticket has been reassigned to MENGEL, MARC of the CD-LSCS/CSI/CS/EST Group. ____________________________________________ Date: Wed, 02 Apr 2008 11:12:57 -0500 (CDT) This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/BLU Group. ___________________________________________ 12;00 romero adjusted some parameters to allow more requests to the array. We need to wait for another minos-sam03 read cycle to see the effect. Presently, there are 3 md5sums and 7 scp's running on minos26. The 6-8 MB/sec rates seen on sam03 seem kind of normal. __________________________________________ The next reading pass started on minos-sam03, around 13:00 CDT. Rates seem to be around 6 to 8 MBytes per second, and fairly stable. This is probably the expected rate, given the 9 active file access processes to /minos/data on minos26. __________________________________________ Solution: Performance is back to normal levels ___________________________________________________________________ N.B. after 13:00, the 1 minute data dropouts occur every 11 to 12 minutes, not the previous 4 minutes. ########## # CONDOR # ########## Glideins have been queued since 03/31 20:10 The last gfactory started around 3/31 19:31 Found log file for one of the glideins, via condor_q -l 62924.4 /home/gfactory/glideinsubmit/glidein_t11/entry_gpminos/log/condor_activity_20080331_gpminos@t11@minos@my2.log Found message there like 000 (62934.008.000) 03/31 20:14:17 Job submitted from host: <131.225.193.25:63984> ... 020 (62934.002.000) 03/31 20:14:20 Detected Down Globus Resource RM-Contact: fngp-osg.fnal.gov:2119/jobmanager-condor ... 026 (62934.002.000) 03/31 20:14:20 Detected Down Grid Resource GridResource: gt2 fngp-osg.fnal.gov:2119/jobmanager-condor ... Test overall load with 100 5 minute glideins to AFS nodes, ########### # ROUNDUP # ########### Checking timing of samdup,with repeated runs. In case we need to do this for the HAVE scan in roundup. First, about 323 files SRV1> time ./samdup /minos/data/minfarm/nearcat real 0m7.892s user 0m1.685s sys 0m0.231s real 0m6.606s user 0m1.664s sys 0m0.153s Now something hefty, 1699 files SRV1> time ./samdup /minos/data/minfarm/farcat real 0m58.310s user 0m5.066s sys 0m0.335s real 0m30.409s user 0m4.981s sys 0m0.348s real 0m31.177s user 0m5.062s sys 0m0.335s What we need is a count of existing declared subruns in SAM, for each run. It will be easier to draft a new samsub, derived from samdup, counting all subruns declared to SAM for the files in the given directory. The existing roundup stores the counts in shell arrays HAVE${FENU}[10#${FRUN:1}] ---- 2008 04 03 FENU is the filename past run/subrun, '.' changed to '_' with the leading delimiter removed So any convenient filename with subrun removed would do. In the samdup script, we have this as TAI, with the original delimiter For use in roundup, it is simplest to generate RUN_TAI strings, writing these and the subrun count for each unique RUN_TAI Create a RUNTAISET of these RUN_END's Search each RUNTAISET member for parents, ---- 2008 04 04 Implemented this, see note under 2008 04 04 Testing during development with AFSS/samsub /minos/scratch/kreymer/nearcat AFSS/samsub /minos/scratch/kreymer/mcnearcat ============================================================================= 2008 03 31 ########### # ENSTORE # ########### Copies stuck again, as noted in alarms (2008-Mar-31 16:53:03) stkenmvr140a 5292 root E (1) LTO4_40MV MOUNTFAILED max_consecutive_failures (3) reached Date: Mon, 31 Mar 2008 17:53:36 -0500 (CDT) Subject: HelpDesk ticket 113375 ___________________________________________ Short Description: Mover LTO4_40 stuck again Problem Description: An encp copy to LTO-4 tape has been hung up since Mon Mar 31 16:44:53 2008 Apparently due to another triple failure to mount on mover LTO4_40 . Please free this up, and take this mover out of service if appropriate. Start time: Mon Mar 31 16:44:53 2008 User: mindata(3648) Group: e875(5111) Euser: mindata(3648) Egroup: e875(5111) Command line: encp --delayed-dismount 5 --verbose 4 /home/mindata/TAPE/n11037160_0000_L010185N_D04.tar.gz /pnfs/minos/stage/daikon_04/L010185N/near/716/n11037160_0000_L010185N_D04.t ar.gz Version: v3_7 CVS $Revision: 1.866 $ OS: Linux 2.6.9-55.0.2.ELsmp i686 Release: Scientific Linux Fermi LTS release 4.4 (Wilson) Library: CD-LTO4G1 Storage Group: minos File Family: stage FF Wrapper: cpio_odc FF Width: 1 Current working directory: minos-sam03.fnal.gov:/minos/data/mcimport/STAGE/daikon_04/L010185N/near/716 Submitting /pnfs/minos/stage/daikon_04/L010185N/near/716/n11037160_0000_L010185N_D04.t ar.gz write request to LM. elapsed=1.010sec File queued: /home/mindata/TAPE/n11037160_0000_L010185N_D04.tar.gz library: CD-LTO4G1 family: stage bytes: 343266785 elapsed=1.10642313957 Mover called back. elapsed=2.73378705978 Input file /home/mindata/TAPE/n11037160_0000_L010185N_D04.tar.gz opened. elapsed=2.75560522079 Submitting /pnfs/minos/stage/daikon_04/L010185N/near/716/n11037160_0000_L010185N_D04.t ar.gz write request to LM. elapsed=490.330sec Submitting /pnfs/minos/stage/daikon_04/L010185N/near/716/n11037160_0000_L010185N_D04.t ar.gz write request to LM. elapsed=1390.460sec ___________________________________________ Date: Mon, 31 Mar 2008 18:18:30 -0500 (CDT) Solution: berg@fnal.gov sent this solution: Art, The mover is offline, the tape it was writing is full and available. I'll look at the mover in more detail tomorrow. - David __________________________________________________________________ 18:01 - mounting and writing ####### # NAS # ####### Date: Mon, 31 Mar 2008 16:57:11 -0500 From: Andy Romero To: site-nas-announce@fnal.gov Subject: BlueArc Maintenance This Thursday (4/3/2008) 7:00am To address the stability problems we have experienced over the past few weeks, we will be upgrading the firmware on BlueArc cluster node RHEA-1 to version 5.1.1156.13. (node RHEA-2 is already at v5.1.1156.13) Users of the following enterprise virtual servers (EVSs) will experience approximately 5min of downtime as we shift/re-balance the workload between the two cluster nodes blue1 blue2 dirserver1 minos-nas-0 ppdserver Users of the following EVSs will not be effected blue3 bluetest fermi-nas-1 mb-nas-0 ########### # ROUNDUP # ########### Working on roundup.new, with o ECRC file removal after purge samdup for duplicates samdup for HAVES purge of READ and SAM/READ files specific MC subdirs in saddreco Review old DUPLICATES code, for each FILE, finds corresponding files under READ from any subrun with special case, exact match for mock data. This produces the HAVE messages for each run already partially catted. Also produces the HAVES count used later, for automatic flushing ################ # SAM_PRODUCTS # ################ Per query from CDF, noted minos usage of sam_products : MINOS26 > upd modproduct -g minos sam_products v4_31 -f NULL notice: Adding flags -O "public" upd modproduct succeeded. ######### # ADMIN # ######### Received email from jklemenc listing web servers not recertified, and to be removed. ( -> minosadmin ) Only one seems to be related to Minos To be removed: +-----------------+----------------------------+-------------------+----------+---------------------+ | IP Address | Hostname | MAC Address |Updated By| Time updated | +-----------------+----------------------------+-------------------+----------+---------------------+ | 198.124.213.7 | nemean.minos-soudan.org | 00:02:B3:98:5E:01 | saranen | 2008-03-31 09:40:58 | Cannot connect to this node. Email to saranen ########## # SAMDUP # ########## Updated to get SAMQ from sam.pingDbServer, so that this can run within roundup on fnpcsrv1 cp -a AFSS/samdup samdup ./samdup /minos/data/minfarm/mcfarcat time ./samdup /minos/data/minfarm/mcnearcat real 1m0.347s user 0m3.040s sys 0m0.221s real 0m16.768s user 0m2.965s sys 0m0.222s Test setting a variable to the list SAMDUPS=`./samdup /minos/scratch/kreymer/mcnearcat` [ -n "${SAMDUPS}" ] && echo HAVE DUPS && printf "${SAMDUPS}\n" HAVE DUPS n13047100_0000_L010185N_D04.sntp.cedar_phy_bhcurv.0.root n13047100_0003_L010185N_D04.sntp.cedar_phy_bhcurv.0.root SAMDUPS=`./samdup /minos/scratch/kreymer/farcat` [ -n "${SAMDUPS}" ] && echo HAVE DUPS && printf "${SAMDUPS}\n" ########## # DCACHE # ########## ticket 113172 - corrupt file in write queue was removed friday, written OK by the standard scripts. ######## # FARM # ######## ./roundup -r cedar mcfar Mon Mar 31 10:10:18 CDT 2008 1038 34331 nearcat 1699 14289 farcat 923 46746 mcnearcat 0 1 mcfarcat 0 1 mcfmockcat 39 2 minfarm/WRITE 3699 95370 TOTAL files, GBytes nearcat 110 3141 cosmic.sntp.cedar.0.root 134 3881 cosmic.sntp.cedar_phy.0.root 84 2312 cosmic.sntp.cedar_phy.1.root 372 6007 spill.mrnt.cedar_phy.0.root 62 789 spill.mrnt.cedar_phy.1.root 213 15735 spill.sntp.cedar.0.root 63 4125 spill.sntp.cedar_phy.1.root farcat 264 7995 all.sntp.cedar.0.root 87 2174 all.sntp.cedar_phy_bhcurv.0.root 546 905 spill.bmnt.cedar_phy_bhcurv.0.root 264 1903 spill.bntp.cedar.0.root 23 102 spill.bntp.cedar_phy_bhcurv.0.root 208 509 spill.mrnt.cedar_phy_bhcurv.0.root 264 1304 spill.sntp.cedar.0.root 43 82 spill.sntp.cedar_phy_bhcurv.0.root mcnearcat 11 7502 cand.cedar_phy_bhcurv.0.root 419 8033 mrnt.cedar_phy_bhcurv.0.root 37 1480 mrnt.cedar_phy_bhcurv.root 419 27699 sntp.cedar_phy_bhcurv.0.root 37 4296 sntp.cedar_phy_bhcurv.root NEARCAT Let's do nearcat first, to defer the bmnt cleanup. We previously did CPB, now let's do CP. ./samdup /minos/data/minfarm/nearcat < clean > And as before, check for 0/1 duplicates in nearcat FONES=`cd /minos/data/minfarm/nearcat ; ls *.1.root` printf "${FONES}\n" | wc -w 209 for FILE in ${FONES} ; do ( cd /minos/data/minfarm/nearcat FZER=`echo ${FILE} | sed 's/\.1\./\.0\./g'` [ -r "${FZER}" ] && \ ls -l ${FILE} && ls -l ${FZER} && printf "\n" ) done So it appears we have no internal duplicates Let's see what's what with CP ./roundup -n -r cedar_phy near I see nothing strange in the messages, beyond lots of pending, ./roundup -n -r cedar_phy near | grep PEND PEND - have 18/19 subruns for N00007148_*.cosmic.sntp.cedar_phy.1.root 318 05/17 15:37 0 18 PEND - have 3/13 subruns for N00008357_*.cosmic.sntp.cedar_phy.1.root 318 05/18 11:13 7 10 PEND - have 3/4 subruns for N00008366_*.cosmic.sntp.cedar_phy.0.root 303 06/02 05:10 0 3 PEND - have 2/18 subruns for N00008564_*.cosmic.sntp.cedar_phy.0.root 300 06/04 19:55 0 2 PEND - have 1/24 subruns for N00010195_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:38 0 1 PEND - have 1/18 subruns for N00010236_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:40 0 1 PEND - have 1/24 subruns for N00010265_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:41 0 1 PEND - have 1/24 subruns for N00010268_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:41 0 1 PEND - have 1/22 subruns for N00010271_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:38 0 1 PEND - have 2/24 subruns for N00010277_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:41 0 2 PEND - have 4/24 subruns for N00010283_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:42 0 4 PEND - have 1/24 subruns for N00010286_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:50 0 1 PEND - have 1/17 subruns for N00010329_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:49 0 1 PEND - have 2/24 subruns for N00010338_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:49 0 2 PEND - have 1/24 subruns for N00010341_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:50 0 1 PEND - have 5/24 subruns for N00010347_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:49 0 5 PEND - have 12/13 subruns for N00010371_*.cosmic.sntp.cedar_phy.0.root 300 06/05 08:51 0 12 PEND - have 21/23 subruns for N00010494_*.cosmic.sntp.cedar_phy.0.root 300 06/04 16:49 0 21 PEND - have 3/24 subruns for N00010660_*.cosmic.sntp.cedar_phy.1.root 318 05/18 09:32 0 3 PEND - have 16/24 subruns for N00010724_*.cosmic.sntp.cedar_phy.1.root 317 05/18 13:37 0 16 PEND - have 44/45 subruns for N00011134_*.cosmic.sntp.cedar_phy.1.root 317 05/18 15:39 0 44 PEND - have 22/24 subruns for N00011437_*.cosmic.sntp.cedar_phy.0.root 170 10/12 17:17 0 22 PEND - have 30/31 subruns for N00011651_*.cosmic.sntp.cedar_phy.0.root 294 06/11 03:18 0 30 PEND - have 23/24 subruns for N00011710_*.cosmic.sntp.cedar_phy.0.root 294 06/11 05:02 0 23 PEND - have 11/12 subruns for N00007127_*.spill.mrnt.cedar_phy.0.root 305 05/30 17:58 0 11 PEND - have 9/10 subruns for N00007148_*.spill.mrnt.cedar_phy.0.root 327 05/08 16:04 0 9 PEND - have 1/3 subruns for N00007176_*.spill.mrnt.cedar_phy.0.root 327 05/08 16:19 0 1 PEND - have 1/2 subruns for N00007188_*.spill.mrnt.cedar_phy.0.root 305 05/30 18:40 0 1 PEND - have 4/6 subruns for N00007194_*.spill.mrnt.cedar_phy.0.root 305 05/30 18:41 0 4 PEND - have 19/30 subruns for N00007197_*.spill.mrnt.cedar_phy.0.root 305 05/30 20:43 0 19 PEND - have 28/30 subruns for N00007929_*.spill.mrnt.cedar_phy.0.root 327 05/09 09:20 0 28 PEND - have 21/22 subruns for N00008029_*.spill.mrnt.cedar_phy.0.root 303 06/01 13:55 0 21 PEND - have 17/18 subruns for N00008305_*.spill.mrnt.cedar_phy.0.root 324 05/12 08:35 0 17 PEND - have 23/24 subruns for N00008345_*.spill.mrnt.cedar_phy.0.root 303 06/02 04:54 0 23 PEND - have 1/2 subruns for N00008366_*.spill.mrnt.cedar_phy.0.root 303 06/02 05:10 0 1 PEND - have 23/24 subruns for N00008492_*.spill.mrnt.cedar_phy.1.root 302 06/02 12:49 0 23 PEND - have 14/15 subruns for N00008510_*.spill.mrnt.cedar_phy.1.root 302 06/02 12:33 0 14 PEND - have 2/11 subruns for N00009635_*.spill.mrnt.cedar_phy.1.root 300 06/04 18:05 0 2 PEND - have 9/10 subruns for N00009770_*.spill.mrnt.cedar_phy.0.root 300 06/04 13:32 0 9 PEND - have 12/14 subruns for N00010371_*.spill.mrnt.cedar_phy.0.root 300 06/05 08:52 0 12 PEND - have 18/19 subruns for N00010577_*.spill.mrnt.cedar_phy.0.root 300 06/04 23:24 0 18 PEND - have 22/23 subruns for N00010583_*.spill.mrnt.cedar_phy.0.root 299 06/05 20:53 0 22 PEND - have 23/24 subruns for N00010586_*.spill.mrnt.cedar_phy.0.root 300 06/05 09:16 0 23 PEND - have 15/19 subruns for N00010589_*.spill.mrnt.cedar_phy.0.root 299 06/05 22:41 0 15 PEND - have 22/24 subruns for N00010631_*.spill.mrnt.cedar_phy.0.root 298 06/06 11:46 0 22 PEND - have 6/7 subruns for N00011047_*.spill.mrnt.cedar_phy.0.root 298 06/07 10:53 0 6 PEND - have 26/27 subruns for N00011113_*.spill.mrnt.cedar_phy.0.root 321 05/15 08:48 0 26 PEND - have 39/40 subruns for N00011134_*.spill.mrnt.cedar_phy.0.root 321 05/15 09:21 0 39 PEND - have 22/24 subruns for N00011437_*.spill.mrnt.cedar_phy.0.root 170 10/12 17:18 0 22 PEND - have 23/24 subruns for N00011710_*.spill.mrnt.cedar_phy.0.root 294 06/11 05:02 0 23 PEND - have 11/23 subruns for N00011728_*.spill.mrnt.cedar_phy.1.root 317 05/19 02:28 0 11 PEND - have 4/8 subruns for N00011750_*.spill.mrnt.cedar_phy.1.root 317 05/19 02:47 0 4 PEND - have 8/24 subruns for N00011772_*.spill.mrnt.cedar_phy.1.root 317 05/19 02:34 0 8 PEND - have 1/12 subruns for N00007127_*.spill.sntp.cedar_phy.1.root 159 10/23 12:27 0 1 PEND - have 23/24 subruns for N00008492_*.spill.sntp.cedar_phy.1.root 302 06/02 12:49 0 23 PEND - have 14/15 subruns for N00008510_*.spill.sntp.cedar_phy.1.root 302 06/02 12:33 0 14 PEND - have 2/11 subruns for N00009635_*.spill.sntp.cedar_phy.1.root 300 06/04 18:05 0 2 PEND - have 11/23 subruns for N00011728_*.spill.sntp.cedar_phy.1.root 317 05/19 02:28 0 11 PEND - have 4/8 subruns for N00011750_*.spill.sntp.cedar_phy.1.root 317 05/19 02:47 0 4 PEND - have 8/24 subruns for N00011772_*.spill.sntp.cedar_phy.1.root 317 05/19 02:34 0 8 ./roundup -f 10 -r cedar_phy near Mon Mar 31 11:33:08 CDT 2008 These seem to be on tape, ./roundup -r cedar_phy near This cleared out WRITE, looks OK ########## # CFLSUM # ########## cflsum - updated to include BAD and stage in miscellaneous summary ########### # ENSTORE # ########### Checking status of LTO-4 tapes, and archives to stage Strangely large number of write errors, but the tapes seem to be properly filled. MINOS26 > du -sm /pnfs/minos/stage/daikon_04 5127706 /pnfs/minos/stage/daikon_04 MINOS26 > du -sm /pnfs/minos/stage/daikon_00 12006 /pnfs/minos/stage/daikon_00 MINOS26 > ./volumes vols OK , refreshing volume listing in /tmp/vols -rw-r--r-- 1 kreymer g020 215774 Mar 31 09:14 /tmp/vols MINOS26 > ./volumes stage VO9430 VOC445 VOC483 VOC493 VOC630 VOJ545 VOJ546 VOJ547 VOJ548 VOJ549 VOJ550 VOJ551 VOLS4=' VOJ545 VOJ546 VOJ547 VOJ548 VOJ549 VOJ550 VOJ551 ' MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ; enstore info --vol=${VOL} | grep wr_err ; done VOJ545 'sum_wr_err': 2, VOJ546 'sum_wr_err': 1, VOJ547 'sum_wr_err': 2, VOJ548 'sum_wr_err': 1, VOJ549 'sum_wr_err': 0, VOJ550 'sum_wr_err': 2, VOJ551 'sum_wr_err': 1, MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ; enstore info --vol=${VOL} | grep remaining ; done VOJ545 'remaining_bytes': 792764928L, VOJ546 'remaining_bytes': 0L, VOJ547 'remaining_bytes': 0L, VOJ548 'remaining_bytes': 244734464L, VOJ549 'remaining_bytes': 765575680L, VOJ550 'remaining_bytes': 247753728L, VOJ551 'remaining_bytes': 132352000000L, MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ; enstore info --vol=${VOL} | grep eod_cookie ; done VOJ545 'eod_cookie': '0000_000000000_0001049', VOJ546 'eod_cookie': '0000_000000000_0001051', VOJ547 'eod_cookie': '0000_000000000_0001683', VOJ548 'eod_cookie': '0000_000000000_0002324', VOJ549 'eod_cookie': '0000_000000000_0000924', VOJ550 'eod_cookie': '0000_000000000_0001079', VOJ551 'eod_cookie': '0000_000000000_0001946', ============================================================================= 2008 03 30 Sunday ############ # MCIMPORT # ############ Started Forward archive, MDFREE was 728418 Sun Mar 30 08:36:22 CDT 2008 See log details under 2008 03 25 for DIR in ${FDIRS}; do ./mcimport.20080326 -T -s n1103 daikon_04/L010185N/near/${DIR} done Sun Mar 30 09:02:07 CDT 2008 ============================================================================= 2008 03 28 ######## # FARM # ######## ROUNTMP/ECRC Need to clean this up these files are short, 1 line, contain just the ecrc int We recently are producing more of these, due to cand handling. roundup should be updated to remove them immediately after using them. SRV1> ls ../ROUNTMP/ECRC | wc -l 69965 SRV1> find . -atime +10 | wc -l 63036 SRV1> find . -atime -10 | wc -l 6243 SRV1> du -sm ../ROUNTMP/ECRC 278 ../ROUNTMP/ECRC cd /export/stage/minfarm/ROUNDUP/ECRC SRV1> tar cf ../ECRC.20080328.tar . SRV1> du -sm ../ECRC.20080328.tar 69 ../ECRC.20080328.tar SRV1> tar tf ../ECRC.20080328.tar | wc -l 69989 Check for any particluarly old WRITE files which might need ECRC's ls -lutL /minos/data/minfarm/WRITE ... rubin 61821998 Dec 27 13:29 f20011128_0009_CosmicMu_D02.sntp.cedar.root minfarm 496353221 Dec 27 13:28 f20011128_0000_CosmicMu_D02.sntp.cedar.root minfarm 63 Dec 11 18:27 Merged.1751.root minfarm 20552457 Dec 11 10:57 Merged.21430.root SRV1> rm /minos/data/minfarm/WRITE/Merged.1751.root SRV1> rm /minos/data/minfarm/WRITE/Merged.21430.root The two cedar mcfar files got caught in a DCache write backlog at Thu Dec 27 13:29:47 CST 2007 This was never picked up after the holidays. ./roundup -r cedar mcfar Oops, in tarring these up have modified access times, so cannot do find . -atime +10 -exec echo rm {} \; Check the file count, based on modification time, find . -mtime +10 | wc -l 63730 This seems right. Many were accessed after the cand files were on tape. So update the two oldies we still need, and purge based on mtime. touch f20011128_0009_CosmicMu_D02.sntp.cedar.root touch f20011128_0000_CosmicMu_D02.sntp.cedar.root find . -mtime +10 | wc -l 63728 OK, let's purge them, at about 18:34 find . -mtime +10 -exec rm {} \; done at around 18:39 SRV1> ls /export/stage/minfarm/ROUNDUP/ECRC | wc -l 6274 ############ # MCIMPORT # ############ ./pnfsdirs far cedar_phy_bhcurv daikon_04 AtmosNu ./pnfsdirs far cedar_phy_bhcurv daikon_04 AtmosNu write Should be ready to import arms files now, from /minos/data/mcimport/arms/mcin $ mv STAGE/arms/NOIMPORT STAGE/arms/IMPORT $ ./mcimport -n -b 3 arms 14:32 GRRRRRRRRRRR - stuck doing 32156 pts/2 Ss 0:01 -bash 24132 pts/2 S+ 0:00 \_ /bin/sh ./mcimport -n -b 3 arms 26030 pts/2 S+ 0:01 \_ du -sm /home/mindata/STAGE/arms/ This is also stuck locally. And also stuck doing direct du -sm /minos/data/mcimport/arms/mcin It was just going slowly, eventually finished at 14:47 $ ./mcimport -b 1 arms OK, logging activity to /home/mindata/STAGE/arms/log/mcimport.log Files seem to be declared to SAM Will let the scheduled import pick up the rest of these files. ########## # ISAJET # ########## Using isajet-users as a testbed for mailing list changes. The only present subscriber is syoon@fnal.gov, Phil Yoon, Subscribed on 14 Aug 2001 Removed syoon, added kreymer ######### # ADMIN # ######### MINOS-USERS Spam came in from Received: from msgmmp-4.gci.net (msgmmp-4.gci.net [209.165.130.14]) To avoid more spam from offsite, changed SEND to Private via the wizard. ########## # CONDOR # ########## Brian's jobs have finished at high priority. Reset him to normal : condor_userprio -setfactor brebel@fnal.gov 100. condor_userprio -all ########## # CONDOR # ########## HOWTO.condor - changed draft grid environment documents, from $APP - NFS shared, backed except on USCMS T1 $DATA - NFS shared, not backed up $WN_TMP - 50 GB on worker to ${OSG_GRID} - NFS shared, grid support software ${OSG_DATA} - NFS shared, not backed up ${OSG_APP} - NFS shared, backed except on USCMS T1 ${OSG_WN_TMP} - 50 GB on worker ${OSG_SQUID_LOCATION} - which squid to use, if needed https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/StorageParameterOsgWnTmp ########## # CONDOR # ########## regularized file names, like probe.$(Cluster).$(Process).out added JOBLEASEDURATION = 1000000 to glideafs.run glide.run probe.run wms*.run - removed, these are now handled by glide.run MINOS25 > dds *.run -rw-r--r-- 1 kreymer g020 726 Jan 11 15:10 glide150.run -rw-r--r-- 1 kreymer g020 825 Mar 26 12:18 glideafs4hr.run -rw-r--r-- 1 kreymer g020 849 Mar 27 15:31 glideafs70min.run -rw-r--r-- 1 kreymer g020 855 Mar 28 10:37 glideafs.run -rw-r--r-- 1 kreymer g020 731 Mar 27 14:02 glide.run -rw-r--r-- 1 kreymer g020 609 Oct 26 17:23 probe10.run -rw-r--r-- 1 kreymer g020 588 Mar 28 10:38 probe.run -rw-r--r-- 1 kreymer g020 721 Dec 14 14:50 wms1.run -rw-r--r-- 1 kreymer g020 561 Nov 27 20:24 wms2.run -rw-r--r-- 1 kreymer g020 822 Feb 8 15:46 wmsafs.run -rw-r--r-- 1 kreymer g020 719 Dec 12 23:42 wms.run ######## # FARM # ######## Proceed to force out farcat, as there are no DUP's there. But will not force out CPB quite yet, due to bmnt issue. farcat 236 7321 all.sntp.cedar.0.root 6 144 all.sntp.cedar_phy.0.root 1 23 all.sntp.cedar_phy.1.root 87 2174 all.sntp.cedar_phy_bhcurv.0.root 546 905 spill.bmnt.cedar_phy_bhcurv.0.root 236 1697 spill.bntp.cedar.0.root 2 2 spill.bntp.cedar_phy.0.root 23 102 spill.bntp.cedar_phy_bhcurv.0.root 208 509 spill.mrnt.cedar_phy_bhcurv.0.root 236 1165 spill.sntp.cedar.0.root 2 1 spill.sntp.cedar_phy.0.root 43 82 spill.sntp.cedar_phy_bhcurv.0.root MINOS-SAM02 > ./samdup /minos/data/minfarm/farcat ./roundup -n -r cedar_phy far Fri Mar 28 08:53:34 CDT 2008 HAVE /export/stage/minfarm/ROUNDUP/READ/SAM/F00030612_0005.spill.bntp.cedar_phy.0.root 3 HAVE /export/stage/minfarm/ROUNDUP/READ/SAM/F00030612_0005.spill.sntp.cedar_phy.0.root 3 OK - processing 11 files OK - stream all.sntp.cedar_phy OK - 167 Mbytes in 5 runs PEND - have 2/8 subruns for F00030612_*.all.sntp.cedar_phy.0.root 268 07/03 14:12 0 2 PEND - have 1/24 subruns for F00034635_*.all.sntp.cedar_phy.1.root 311 05/21 10:10 0 1 PEND - have 2/24 subruns for F00034647_*.all.sntp.cedar_phy.0.root 290 06/11 21:36 0 2 PEND - have 1/7 subruns for F00034675_*.all.sntp.cedar_phy.0.root 290 06/11 21:48 0 1 PEND - have 1/19 subruns for F00034700_*.all.sntp.cedar_phy.0.root 288 06/13 12:09 0 1 OK - stream spill.bntp.cedar_phy OK - 2 Mbytes in 1 runs PEND - have 2/8 subruns for F00030612_*.spill.bntp.cedar_phy.0.root 268 07/03 14:12 3 5 OK - stream spill.sntp.cedar_phy OK - 1 Mbytes in 1 runs PEND - have 2/8 subruns for F00030612_*.spill.sntp.cedar_phy.0.root 268 07/03 14:12 3 5 ./roundup -r cedar_phy far ./roundup -f 1 -r cedar_phy far Proceed to force out nearcat older passes. This may have duplicates in nearcat. ./samdup /minos/data/minfarm/nearcat nearcat 123 3525 cosmic.sntp.cedar.0.root 134 3881 cosmic.sntp.cedar_phy.0.root 84 2312 cosmic.sntp.cedar_phy.1.root 372 6007 spill.mrnt.cedar_phy.0.root 62 789 spill.mrnt.cedar_phy.1.root 212 7318 spill.mrnt.cedar_phy_bhcurv.0.root 282 8478 spill.mrnt.cedar_phy_bhcurv.1.root 202 15164 spill.sntp.cedar.0.root 63 4125 spill.sntp.cedar_phy.1.root 100 6692 spill.sntp.cedar_phy_bhcurv.0.root 43 1511 spill.sntp.cedar_phy_bhcurv.1.root FONES=`cd /minos/data/minfarm/nearcat ; ls *.1.root` printf "${FONES}\n" | wc -w for FILE in ${FONES} ; do ( cd /minos/data/minfarm/nearcat FZER=`echo ${FILE} | sed 's/\.1\./\.0\./g'` [ -r "${FZER}" ] && \ ls -l ${FILE} && ls -l ${FZER} && printf "\n" ) done -rw-rw-r-- 1 rubin numi 38414123 Nov 16 20:01 N00008165_0013.spill.mrnt.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 38415184 Oct 24 23:15 N00008165_0013.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 9330118 Nov 21 12:42 N00012040_0015.spill.mrnt.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 9331014 Nov 21 02:14 N00012040_0015.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 17493799 Nov 21 12:42 N00012040_0015.spill.sntp.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 17493843 Nov 21 02:14 N00012040_0015.spill.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 224841 Nov 21 11:23 N00012051_0009.spill.mrnt.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 224815 Nov 21 00:54 N00012051_0009.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 279374 Nov 21 11:23 N00012051_0009.spill.sntp.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 279374 Nov 21 00:54 N00012051_0009.spill.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 242964 Nov 21 11:22 N00012051_0022.spill.mrnt.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 242964 Nov 21 00:54 N00012051_0022.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 321751 Nov 21 11:22 N00012051_0022.spill.sntp.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 321751 Nov 21 00:54 N00012051_0022.spill.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 245992 Nov 21 11:23 N00012051_0023.spill.mrnt.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 245998 Nov 21 00:54 N00012051_0023.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 327453 Nov 21 11:23 N00012051_0023.spill.sntp.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 327453 Nov 21 00:54 N00012051_0023.spill.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 10963512 Nov 21 12:59 N00012054_0000.spill.mrnt.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 10963274 Nov 21 02:30 N00012054_0000.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 20300550 Nov 21 12:59 N00012054_0000.spill.sntp.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 20300544 Nov 21 02:30 N00012054_0000.spill.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 308544 Nov 21 11:24 N00012054_0003.spill.mrnt.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 308533 Nov 21 00:55 N00012054_0003.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 535687 Nov 21 11:24 N00012054_0003.spill.sntp.cedar_phy_bhcurv.1.root -rw-rw-r-- 1 rubin numi 535687 Nov 21 00:55 N00012054_0003.spill.sntp.cedar_phy_bhcurv.0.root sam list files --dim="${SAMDIM}" --nosummary SAMDIM="DATA_TIER mrnt-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 8165" SAMDIM="DATA_TIER cand-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 8165" 13 is pass 1 SAMDIM="DATA_TIER mrnt-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12040" SAMDIM="DATA_TIER sntp-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12040" SAMDIM="DATA_TIER cand-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12040" 15 is pass 1 SAMDIM="DATA_TIER mrnt-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12051" SAMDIM="DATA_TIER sntp-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12051" SAMDIM="DATA_TIER cand-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12051" 9, 22, 23 are pass 1 SAMDIM="DATA_TIER mrnt-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12054" SAMDIM="DATA_TIER sntp-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12054" SAMDIM="DATA_TIER cand-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12054" 0, 3 are pass 1 Bottom line, declare all the 0 passes to be duplicates. for FILE in ${FONES} ; do ( cd /minos/data/minfarm/nearcat FZER=`echo ${FILE} | sed 's/\.1\./\.0\./g'` [ -r "${FZER}" ] && echo ${FZER} ) done > /tmp/FDUPS FDUPS=`cat /tmp/FDUPS` printf "${FDUPS}\n" for FILE in ${FDUPS} ; do ( cd /minos/data/minfarm/nearcat mv ${FILE} ../DUP/nearcat/${FILE} ) done done at 13:41 ./roundup -n -r cedar_phy_bhcurv near PEND - have 2/3 subruns for N00007506_*.spill.mrnt.cedar_phy_bhcurv.0.root 156 10/23 17:20 0 2 PEND - have 19/20 subruns for N00008165_*.spill.mrnt.cedar_phy_bhcurv.0.root 155 10/24 21:37 0 19 PEND - have 14/15 subruns for N00008276_*.spill.mrnt.cedar_phy_bhcurv.1.root 73 01/14 14:48 0 14 PEND - have 16/24 subruns for N00008345_*.spill.mrnt.cedar_phy_bhcurv.1.root 73 01/14 19:33 0 16 PEND - have 3/7 subruns for N00008433_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 01:05 0 3 PEND - have 2/22 subruns for N00008436_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 09:02 0 2 PEND - have 1/12 subruns for N00008439_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 09:58 0 1 PEND - have 4/19 subruns for N00008454_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:51 0 4 PEND - have 2/24 subruns for N00008457_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 10:59 0 2 PEND - have 2/24 subruns for N00008460_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:17 0 2 PEND - have 4/20 subruns for N00008463_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 09:31 0 4 PEND - have 1/24 subruns for N00008469_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:33 0 1 PEND - have 1/2 subruns for N00008472_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 09:20 0 1 PEND - have 1/24 subruns for N00008478_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:13 0 1 PEND - have 6/24 subruns for N00008481_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 10:29 0 6 PEND - have 12/24 subruns for N00008486_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 14:30 0 12 PEND - have 6/24 subruns for N00008489_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/14 06:08 0 6 PEND - have 13/24 subruns for N00008492_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/14 08:40 0 13 PEND - have 4/6 subruns for N00008495_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 20:46 0 4 PEND - have 16/24 subruns for N00008498_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 19:06 0 16 PEND - have 11/17 subruns for N00008501_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 19:51 0 11 PEND - have 10/24 subruns for N00008504_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 19:52 0 10 PEND - have 6/14 subruns for N00008507_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 21:00 0 6 PEND - have 11/15 subruns for N00008510_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 13:46 0 11 PEND - have 1/2 subruns for N00008517_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 10:17 0 1 PEND - have 4/21 subruns for N00008523_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 09:24 0 4 PEND - have 9/22 subruns for N00008526_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:52 0 9 PEND - have 13/24 subruns for N00008529_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 18:35 0 13 PEND - have 4/24 subruns for N00008532_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:23 0 4 PEND - have 1/17 subruns for N00008538_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 19:02 0 1 PEND - have 2/18 subruns for N00008564_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:09 0 2 PEND - have 1/7 subruns for N00008568_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 13:06 0 1 PEND - have 1/13 subruns for N00008672_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 13:59 0 1 PEND - have 1/2 subruns for N00008692_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 13:11 0 1 PEND - have 7/12 subruns for N00008695_*.spill.mrnt.cedar_phy_bhcurv.1.root 132 11/16 15:24 0 7 PEND - have 23/24 subruns for N00009629_*.spill.mrnt.cedar_phy_bhcurv.1.root 129 11/20 01:21 0 23 PEND - have 15/16 subruns for N00009659_*.spill.mrnt.cedar_phy_bhcurv.1.root 129 11/20 07:30 0 15 PEND - have 13/14 subruns for N00009705_*.spill.mrnt.cedar_phy_bhcurv.1.root 129 11/20 04:19 0 13 PEND - have 14/15 subruns for N00009816_*.spill.mrnt.cedar_phy_bhcurv.1.root 128 11/20 16:45 0 14 PEND - have 20/23 subruns for N00009836_*.spill.mrnt.cedar_phy_bhcurv.1.root 128 11/20 17:57 0 20 PEND - have 22/23 subruns for N00011059_*.spill.mrnt.cedar_phy_bhcurv.0.root 151 10/29 04:17 0 22 PEND - have 26/27 subruns for N00011113_*.spill.mrnt.cedar_phy_bhcurv.0.root 151 10/29 12:55 0 26 PEND - have 39/40 subruns for N00011134_*.spill.mrnt.cedar_phy_bhcurv.0.root 150 10/29 19:12 0 39 PEND - have 23/24 subruns for N00012129_*.spill.mrnt.cedar_phy_bhcurv.0.root 127 11/21 21:18 0 23 PEND - have 20/21 subruns for N00012431_*.spill.mrnt.cedar_phy_bhcurv.0.root 125 11/24 02:18 0 20 PEND - have 1/19 subruns for N00008165_*.spill.sntp.cedar_phy_bhcurv.1.root 132 11/16 20:01 0 1 PEND - have 21/22 subruns for N00008251_*.spill.sntp.cedar_phy_bhcurv.1.root 73 01/14 14:00 0 21 PEND - have 14/24 subruns for N00008254_*.spill.sntp.cedar_phy_bhcurv.1.root 74 01/14 12:18 0 14 PEND - have 1/2 subruns for N00008366_*.spill.sntp.cedar_phy_bhcurv.1.root 73 01/14 19:39 0 1 PEND - have 39/40 subruns for N00011134_*.spill.sntp.cedar_phy_bhcurv.0.root 150 10/29 19:12 0 39 All these are a couple of months stale. We also are picking up the internal duplicates just moved out of the way. ./roundup -f 30 -r cedar_phy_bhcurv near ============================================================================= 2008 03 27 ######## # FARM # ######## SRV1> ls /minos/data/minfarm/nearcat/*0.root | wc -l 1143 SRV1> ls /minos/data/minfarm/nearcat/*1.root | wc -l 534 SRV1> ls /minos/data/minfarm/farcat/*0.root | wc -l 1625 SRV1> ls /minos/data/minfarm/farcat/*1.root | wc -l 1 SRV1> ls /minos/data/minfarm/farcat/F00034635_0000* /minos/data/minfarm/farcat/F00034635_0000.all.sntp.cedar_phy.1.root ########## # MDFREE # ########## Started keeping track of /minos/data free disk ${HOME}/minos/scripts/mdfree_log & ######### # ADMIN # ######### Seen rtoner entry at 2008 02 25 She has access, per blake MINOS01 > pts adduser -user rtoner -group minos MINOS01 > pts membership minos | grep toner ######## # FARM # ######## 11:30 SRV1> ./farmgsum Summarizing /grid/data/minos/*cat 1677 57043 nearcat 1626 13483 farcat 2944 493599 mcnearcat 0 1 mcfarcat 0 1 mcfmockcat 1726 332816 minfarm/WRITE 7973 896943 TOTAL files, GBytes ######## # FARM # ######## Ticket 112276 is resolved ( overloaded fnpcsrv1 due to carneiro jobs ) See details below. ============================================================================= 2008 03 26 ######## # FARM # ######## 1996 107037 nearcat 1839 23836 farcat 4725 830248 mcnearcat 513 12063 mcfarcat 0 1 mcfmockcat 91 6504 minfarm/WRITE 9164 979689 TOTAL files, GBytes There are many files with both passes 0 and 1 in these areas. MCFARCAT mcfarcat 513 12646 mrnt.cedar_phy.root All these are subrun 0, Feb 24/25 ssh minos-sam02 minos ./samdup /minos/data/minfarm/mcfarcat ./roundup -n -r cedar_phy mcfar This would have added all files, no complaints Do it ! ./roundup -r cedar_phy mcfar and the next day, to purge, did again ./roundup -r cedar_phy mcfar The saddreco logs look OK, mcfarcat is clear ! ######## # BMNT # ######## Urkkh. BMNT files have resurfaced again, farcat 546 905 spill.bmnt.cedar_phy_bhcurv.0.root 208 509 spill.mrnt.cedar_phy_bhcurv.0.root See log entries 2008 01 17 These were all produced on Feb 28 and 29. There seem to be no corresponding mrnt's waiting for concatenation. Let's set this aside, till we get the rest of the files concatenated. ########## # CONDOR # ########## Investigating file ownership created under Condor glideins, The probe job shows ID uid=7927(minos) gid=5111(numi) groups=5111(numi) MINOS25 > ypcat passwd | grep minoscvs minoscvs:KERBEROS:7927:5111:E875 Minos:/home/minoscvs:/home/minoscvs/bin/cvsh likewise, here is another minos account : [minos@minos-evd ~]$ id uid=500(minos) gid=5111(e875) groups=100(users),5111(e875),1100545895 context=user_u:system_r:initrc_t ########## # CONDOR # ########## Investigating Glidein job disconnects, logs/errs under /minos/data/users/pawloski/Nue/AttenuationStudy/Far_Beam_GainPlus10 /minos/data/users/pawloski/Nue/AttenuationStudy/Far_Beam_MEUMinus30 Jobs submitted as /minos/scratch/boehm/CondorTest/GregSub/condor_submit_gliden_Gain_10High_FD.sh running this : /minos/scratch/boehm/CondorTest/GregSub/condor_jobs_gliden_Gain_10High_FD.sh These were resubmitted, setting this : JobLeaseDuration = 360000 These all completed properly, but many had to reconnect, success this time ! Scanned all 100 logs for disconnect messges. MINOS25 > grep disconnected /minos/data/users/pawloski/Nue/AttenuationStudy/Far_Beam_GainPlus10/log.* | wc -l 94 MINOS25 > grep disconnected /minos/data/users/pawloski/Nue/AttenuationStudy/Far_Beam_MEUMinus30/log.* | wc -l 80 Most of the disconnecting jobs lasted about 70 minutes. Most of the clean jobs lasted 64 minutes, but a few were over 71 minutes. In all cases the job ran and produced an output file. The disconnect was at the time of job completion, based on file times. PROBE TESTS Created 4 hour version of probe ( 1440 iterations, 10 seconds per, of tiny) condor_submit glideafs.run Started up on fnpc341 around 12:20, files go to logs/4hr Running probenew, allowing tiny or sleep , setting delay in seconds probenew ${SEC} tiny 4200 #( for about 70 minutes ) Running 5 in parallel. condor_submit glideafs70min.run Got lucky, these all started right away, at 17:17 And they bailed out in the classic way, eventually failing entirely MINOS25 > cat logs/70min/probe.log.61486.0 000 (61486.000.000) 03/26 17:17:21 Job submitted from host: <131.225.193.25:63984> ... 001 (61486.000.000) 03/26 17:17:26 Job executing on host: <131.225.166.130:63103> ... 006 (61486.000.000) 03/26 17:17:34 Image size of job updated: 10152 ... 022 (61486.000.000) 03/26 18:27:26 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to vm2@4190@fnpc342.fnal.gov <131.225.166.130:63103> ... 024 (61486.000.000) 03/26 18:27:26 Job reconnection failed Job disconnected too long: JobLeaseDuration (3600 seconds) expired Can not reconnect to vm2@4190@fnpc342.fnal.gov, rescheduling job ... 001 (61486.000.000) 03/26 18:27:53 Job executing on host: <131.225.166.130:63103> AND MORE OF THE SAME, TILL THE END : 001 (61486.000.000) 03/27 05:20:15 Job executing on host: <131.225.166.120:64512> ... 009 (61486.000.000) 03/27 05:20:23 Job was aborted by the user. The system macro SYSTEM_PERIODIC_REMOVE expression '(JobRunCount > 10) || (JobRunCount>=1 && ImageSize>1000000 && JobStatus==1)' evaluated to TRUE ... ADDED JobLeaseDuration = 360000 to glideafs70min.run - added Reran at 11:46, cluster 61836 MINOS25 > cat logs/70min/probe.log.61836.0 000 (61836.000.000) 03/27 11:48:02 Job submitted from host: <131.225.193.25:63984> ... 001 (61836.000.000) 03/27 11:48:07 Job executing on host: <131.225.166.131:62119> ... 006 (61836.000.000) 03/27 11:48:15 Image size of job updated: 10152 ... 022 (61836.000.000) 03/27 12:58:11 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to vm2@32007@fnpc344.fnal.gov <131.225.166.131:62119> ... 023 (61836.000.000) 03/27 12:58:11 Job reconnected to vm2@32007@fnpc344.fnal.gov startd address: <131.225.166.131:62119> starter address: <131.225.166.131:62606> Now try a test with a short lease, 5 minutes and a 10 minute execution time. JobLeaseDuration = 300 cluster 61850 at 13:15 All jobs completed normally in 10 minutes. Gave myself a boost with condor_userprio -setfactor kreymer@fnal.gov 1. Let's try using sleep, instead of tiny, in the failing test: cluster 61859 at 13:42 MINOS25 > cat logs/70min/probe.log.61859.0 000 (61859.000.000) 03/27 13:42:10 Job submitted from host: <131.225.193.25:63984> ... 001 (61859.000.000) 03/27 13:44:14 Job executing on host: <131.225.166.118:62566> ... 006 (61859.000.000) 03/27 13:44:22 Image size of job updated: 14956 ... 022 (61859.000.000) 03/27 14:54:15 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to vm2@1277@fnpc339.fnal.gov <131.225.166.118:62566> ... 024 (61859.000.000) 03/27 14:54:15 Job reconnection failed Job disconnected too long: JobLeaseDuration (3600 seconds) expired Can not reconnect to vm2@1277@fnpc339.fnal.gov, rescheduling job Same for sections 0, 1, 2 Good enough, killing these, rather than let them waste VM's. condor_rm 61859 For long term workaround, use JOBLEASEDURATION = 1000000 ########## # DCACHE # ########## MINOS26 > date Wed Mar 26 09:07:08 CDT 2008 MINOS26 > ~kreymer/minos/scripts/dc_stat /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.ro ============================ PNFS status for /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root -rw-r--r-- 1 rubin e875 541154973 Mar 21 12:48 n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:bee4af9c;l=541154973; w-stkendca11a-5 LEVEL 4 ============================ Date: Wed, 26 Mar 2008 09:10:50 -0500 (CDT) Subject: HelpDesk ticket 113172 ___________________________________________ Short Description: Minos file in w-stkendca11a-5 for 5 days still not on tape Problem Description: dcache-admin : The following file has been in the write pool since Friday, and is till not on an tape. Many other similar files have been written since that time. Please investigate and flush this file to tape. MINOS26 > date Wed Mar 26 09:07:08 CDT 2008 MINOS26 > ~kreymer/minos/scripts/dc_stat /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/7 12/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root ============================ PNFS status for /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root -rw-r--r-- 1 rubin e875 541154973 Mar 21 12:48 n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:bee4af9c;l=541154973; w-stkendca11a-5 LEVEL 4 ============================ ___________________________________________ This ticket is assigned to SSA Primary of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Wed, 26 Mar 2008 12:10:23 -0500 (CDT) Note To Requester: berg@fnal.gov sent this Notes To Requester: Art, This file is failing in the encp from dcache because of a CRC mismatch. I'm looking into it. - David _________________________________________________________________ Date: Wed, 26 Mar 2008 12:48:49 -0500 (CDT) That CRC doesn't match either what's in dcache, or what's being returned by encp. The path you give below - is that in BlueArc? I'm a little confused by the difference in the order of the path elements between that and pnfs, but it seems to refer to the same file. What is the length of that file? I'll talk with experts about how to proceed. _________________________________________________________________ Date: Fri, 28 Mar 2008 18:20:16 -0500 (CDT) Art, The developer says for you to go ahead and remove the file from pnfs space, then rewrite. We will have to go in later to remove the file from dcache, because cleaner will not automatically remove a precious file when it is removed from pnfs, but the encp will stop. - David _________________________________________________________________ Solution: berg@fnal.gov sent this solution: The file was written to VOH293 on Mar 29, after being rewritten to dcache by the user. Evidently the file itself became corrupted in dcache. The CRC of the rewritten file as recorded by dcache is the same as it was before. Attempts to write it to tape failed because the CRC calculated by encp as the file went to tape was different, so the file content must have changed. ___________________________________________________________________ Observed LTO3_12.mover alive : busy mounting volume VOE096 stkenmvr112a 2008-Mar-26 10:49:09 Odd, this is not showing up yet in LEVEL4 This is strange, most of the files on this tape seem to be removed ! Files are under pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/ only two remain, 713/n13047134_0013_L010185N_D04.cand.cedar_phy_bhcurv.0.root 738/n13037389_0023_L010185N_D04.cand.cedar_phy_bhcurv.0.root SRV1> dds /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 minospro numi 541154973 Mar 21 10:57 /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root ecrc : CRC 2086317979 And the original ECRC record, cat /export/stage/minfarm/ROUNDUP/ECRC/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root 2086317979 So per developer recommendation, as rubin on fnpcsrv1, at about Fri Mar 28 18:31 rm /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root 2008 03 31 This is on tape, VOH293 0000_000000000_0000571 541154973 ######## # FARM # ######## cedar_phy_mboone This has never been concatenated before, or declared to SAM. SAMDIM="VERSION cedar.phy.mboone" MINOS26 > sam list files --dim="${SAMDIM}" --summary_only File Count: 0 Added this to ROUNTMP/ROOTRELS But we do not actually want to concatenate c_p_m, The volume of data is small, and there is one file per run already. ============================================================================= 2008 03 25 ############ # MCIMPORT # ############ Cleaned up one bad copy from last week, this was at the time of the BlueArc problem. $ dds /home/mindata/TAPE/ total 1508 drwxr-xr-x 2 mindata e875 12288 Mar 17 23:57 ./ drwxr-xr-x 10 mindata e875 4096 Mar 25 16:21 ../ -rw-r--r-- 1 mindata e875 1523712 Mar 17 08:52 n11037135_0020_L250200N_D04.tar.gz $ dds /minos/data/mcimport/STAGE/daikon_04/L250200N/near/713/n11037135_0020_L250200N_D04.tar.gz -rw-r--r-- 1 mindata e875 752680142 Dec 13 20:44 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/713/n11037135_0020_L250200N_D04.tar.gz $ dds /pnfs/minos/stage/daikon_04/L250200N/near/713/n11037135_0020_L250200N_D04.tar.gz -rw-r--r-- 1 mindata e875 1523712 Mar 17 10:35 /pnfs/minos/stage/daikon_04/L250200N/near/713/n11037135_0020_L250200N_D04.tar.gz less /minos/data/mcimport/TAR/daikon_04/L250200N/near/713/mcimport.log this ends with COPY n11037135_0019_L250200N_D04.tar.gz Not too surprising, the BlueArc outage killed the log file. $ rm /pnfs/minos/stage/daikon_04/L250200N/near/713/n11037135_0020_L250200N_D04.tar.gz $ rm /home/mindata/TAPE/n11037135_0020_L250200N_D04.tar.gz Let's redo,catch another 95 GB of files $ du -sm /minos/data/mcimport/STAGE/daikon_04/L250200N/near/* 1 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700 1385 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/701 ... 96027 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/713 1517 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/714 154 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/715 846 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/717 DIR=713 ./mcimport.20080311 -T daikon_04/L250200N/near/${DIR} Tue Mar 25 17:10:17 CDT 2008 OK - version mcimport.20080303 processing from /minos/data/mcimport/STAGE/daikon_04/L250200N/near/713 LOGS TAPER NFILES 129 ############ # MCIMPORT # ############ Most proceed to archive more pre-overlay files, /minos/data space was down to 1.9 TB free Monday, down to 1.2 today Forward field: 7101-7350 7501-7650 Reverse field: 7001-7120 Forward field files are like n1103* n1203* Reversed field files are like n1104* n1204* FDIRS='710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 ' RDIRS='700 701 702 703 704 705 706 707 708 709 710 711 712 ' Can use -s to select correct configurations, Check min run : 700 - min run is 7001 750 - min run is 7501 710 - min run is 7100 Assuming Robert is overlaying using 7001 to 7099 ( not 7100 ) we can just archive these full directories. This would look like for DIR in ${FDIRS}; do ./mcimport.20080326 -T -s n1103 daikon_04/L010185N/near/${DIR} done for DIR in ${FDIRS}; do ./mcimport.20080326 -T -s n1203 daikon_04/L010185N/near/${DIR} done for DIR in ${RDIRS}; do ./mcimport.20080326 -T -s n1104 daikon_04/L010185N/near/${DIR} done for DIR in ${RDIRS}; do ./mcimport.20080326 -T -s n1204 daikon_04/L010185N/near/${DIR} done ls /minos/data/mcimport/STAGE/daikon_04/L010185N/near 700 | wc -l 746 700/n1104* | wc -l 276 700/n1204* | wc -l 97 2008 03 26 Testing with DIR=700 AFSS/mcimport.20080326 -n -T -s n1104 daikon_04/L010185N/near/${DIR} for DIR in ${RDIRS}; do ./mcimport.20080326 -n -T -s n1104 daikon_04/L010185N/near/${DIR} done \ | grep NFILES NFILES 275 NFILES 309 NFILES 305 NFILES 309 NFILES 308 NFILES 308 NFILES 308 NFILES 309 NFILES 310 NFILES 308 NFILES 305 NFILES 305 NFILES 309 $ for DIR in ${RDIRS}; do ./mcimport.20080326 -n -T -s n1204 daikon_04/L010185N/near/${DIR} done | grep NFILES NFILES 0 NFILES 0 NFILES 0 NFILES 0 NFILES 0 NFILES 0 NFILES 0 NFILES 0 NFILES 0 NFILES 0 NFILES 0 NFILES 0 NFILES 0 OK, let's do the reverse detector files, the rocks are more like pebbles, too small to be directly copied. $ for DIR in ${RDIRS}; do ./mcimport.20080326 -T -s n1104 daikon_04/L010185N/near/${DIR}; done OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N/near/700/mcimport.log Wed Mar 26 10:24:33 CDT 2008 OK - version mcimport.20080325 processing from /minos/data/mcimport/STAGE/daikon_04/L010185N/near/700 Copies are running at 10 MB/sec, better than usual ! After 700 is complete, scanned sizes, to see what we've gained in clearing out reversed field files, $ du -sm /minos/data/mcimport/STAGE/daikon_04/L010185N/near/70* 90862 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/700 197379 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/701 196882 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/702 197665 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/703 197688 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/704 195568 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/705 197000 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/706 203266 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/707 200288 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/708 197098 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/709 Wed Mar 26 17:08:17 2008 Stuck due to LTO4_40.mover failure, alive : ERROR - ('MOUNTFAILED', 'max_consecutive_failures (3) reached') This cleared at 10:58 Thursday, now running normally, on mover 42. Mover 40 seems to be OK also. MINOS26 > df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 28T 28T 699G 98% /minos/data MINOS26 > date Thu Mar 27 11:29:55 CDT 2008 We are in serious trouble, let's hope we can get ahead at last. 2008 03 30 The RDIR copies finished Sat Mar 29 15:53:22 CDT 2008 Starting forward, Oops, ran the first n1103 scan without the -n, interrupted and did rm TAPE/n11037100_0010_L010185N_D04.tar.gz Scanning first, for DIR in ${FDIRS}; do printf "${DIR} " ./mcimport.20080326 -n -T -s n1103 daikon_04/L010185N/near/${DIR} \ | grep NFILES done 710 NFILES 307 711 NFILES 304 712 NFILES 304 713 NFILES 302 714 NFILES 308 715 NFILES 308 716 NFILES 306 717 NFILES 308 718 NFILES 305 719 NFILES 304 720 NFILES 307 721 NFILES 309 722 NFILES 306 723 NFILES 309 724 NFILES 308 725 NFILES 307 726 NFILES 309 727 NFILES 306 728 NFILES 306 729 NFILES 307 730 NFILES 309 731 NFILES 309 732 NFILES 305 733 NFILES 303 734 NFILES 306 735 NFILES 327 750 NFILES 276 751 NFILES 307 752 NFILES 307 753 NFILES 304 754 NFILES 308 755 NFILES 308 756 NFILES 304 757 NFILES 308 758 NFILES 305 759 NFILES 307 760 NFILES 307 761 NFILES 307 762 NFILES 306 763 NFILES 308 764 NFILES 308 765 NFILES 307 for DIR in ${FDIRS}; do printf "${DIR} " ./mcimport.20080326 -n -T -s n1203 daikon_04/L010185N/near/${DIR} \ | grep NFILES done 710 NFILES 0 711 NFILES 0 712 NFILES 0 713 NFILES 0 714 NFILES 0 715 NFILES 0 716 NFILES 0 717 NFILES 0 718 NFILES 0 719 NFILES 0 720 NFILES 0 721 NFILES 0 722 NFILES 0 723 NFILES 0 724 NFILES 0 725 NFILES 0 726 NFILES 0 727 NFILES 0 728 NFILES 0 729 NFILES 0 730 NFILES 0 731 NFILES 0 732 NFILES 0 733 NFILES 0 734 NFILES 0 735 NFILES 0 750 NFILES 0 751 NFILES 0 752 NFILES 0 753 NFILES 0 754 NFILES 0 755 NFILES 0 756 NFILES 0 757 NFILES 0 758 NFILES 0 759 NFILES 0 760 NFILES 0 761 NFILES 0 762 NFILES 0 763 NFILES 0 764 NFILES 0 765 NFILES 0 Let's run for real : for DIR in ${FDIRS}; do ./mcimport.20080326 -T -s n1103 daikon_04/L010185N/near/${DIR} done ########## # CONDOR # ########## Last time we had a BlueArc hangup, released all the gfactory processes that had been held : I think this overshot the running job goal. This time, will condor_rm them. Neither works, they keep getting held. Odd, jobs do seem to have been running sometimes last week. There were failures, seen under /minos/scratch/kreymer/condor, -rw-r--r-- 1 kreymer g020 139 Mar 22 13:40 logs/glideafs/probe.err.57870.0 ... -rw-r--r-- 1 kreymer g020 139 Mar 24 15:40 logs/glideafs/probe.err.60041.0 -rw-r--r-- 1 kreymer g020 0 Mar 24 15:50 logs/glideafs/probe.err.60062.0 The error files are normal, and there are normal output files for these jobs. 60062.0 and subsequent jobs are still Idle Igor finds that the proxy has, after all, expired. See messages in /home/gfactory/glideinsubmit/glidein_t11/entry_gpminos/log/*err* /home/gfactory/glideinsubmit/glidein_t11/entry_gpgeneral/log/*err* Per sfiligoie, killed off the old harmless Held gfactory jobs, which had lost track of their state with respect to the GPfarm, with condor_rm -forcex In some cases, had to specify the full job.section And the removal was not immediate, took a few minutes to take effect as seen in condor_q ########### # PHYSICS # ########### http://newsinfo.iu.edu/web/page/normal/7294.html Mufsen/Rebal cosmic results ####### # SAM # ####### Resolved old IT items at https://plone3.fnal.gov/SAMGrid/tracking/pcng_search_form Searched for items submitted by kreymer/ SAM-IT/1979] SAM C++ API compiler warnings (#3/resolve) These warnings are gone in sam_cpp_api v8_4_0_1 -q GCC-3.4.3 Thanks ! SAM-IT/1751: multiple parameter selection The root cause was a bug in dbserver v7_6_1, resolved by upgrading to v8_3_0 on 2007/10/15. See IT 2257 for the dbserver fix. The damaged Minos database was repaired on 2007/10/12 with sqlplus, per instructions from herber. See details in http://www-numi.fnal.gov/minwork/computing/dh/worklog.txt UPDATE DIMENSIONS SET DIM_ALIAS = 'param_values##261' where DIMENSION_NAME = 'MC.BFIELD' ; UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_categories##261' where DIMENSION_NAME = 'MC.BFIELD' and DIM_COLUMN = 'param_category' ; UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_types##261' where DIMENSION_NAME = 'MC.BFIELD' and DIM_COLUMN = 'param_type' ; SAM-IT/1642: VERSION_ANALYZED selection Resolved on 2007/05/24 Learned that one should use VERSION, not VERSION_ANALYZED SAM-IT/1226 KITS sam_web_services_client offsite access Resolved by John Inkmann, 2005/10/21 File protections updated. For future reference, filling out webform http://fnkits.fnal.gov/specialprod.html can prevent this problem. ########### # SCRATCH # ########### Date: Tue, 25 Mar 2008 10:24:57 -0500 (CDT) Subject: HelpDesk ticket 113131 ___________________________________________ Short Description: Quota request for rhatcher on BlueArc served /minos/scratch Problem Description: LSC/CSI : Please set an individual storage quota of 500 GBytes for user rhatcher on the BlueArc served /minos/scratch volume. This overrides the existing default 100 GBytes quota. ___________________________________________ Date: Tue, 25 Mar 2008 10:34:25 -0500 (CDT) This ticket has been reassigned to HILL, KEVIN of the CD-LSCS/CSI/CS/EST Group. __________________________________________________________________ Date: Tue, 25 Mar 2008 11:20:52 -0500 (CDT) Solution: quota increased ______________________________________________________________ ########### # SCRATCH # ########### Date: Tue, 25 Mar 2008 13:38:15 -0500 (CDT) Subject: HelpDesk ticket 113147 ___________________________________________ Short Description: Quota request for scavan on BlueArc served /minos/scratch Problem Description: LSC/CSI : Please set an individual storage quota of 500 GBytes for user scavan on the BlueArc served /minos/scratch volume. This overrides the existing default 100 GBytes quota. ___________________________________________ Date: Tue, 25 Mar 2008 13:42:39 -0500 (CDT) This ticket has been reassigned to HILL, KEVIN of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ ============================================================================= 2008 03 24 ################## # WEEK IN REVIEW # ################## BlueArc outage Monday 17 March 08:55 to 09:05 tripped up glideins Actually 08:42 to 09:10 ( m> nas ) + noted kreymer DOE cert will be expiring ( m> 2008 ) Parrot HTTP_PROXY corrected in current Condor 7.0.1 predeployed, ticket 112641 Vahle condor jobs being held ? + resolved FNAL Central Unix Web Service Town Meeting - Liz CRL time stamps shifted to UTC ( gysin fixing ) + fixed apparenly by gysin /home/minfarm filled - rubin cleared it + fixed by Howie /grid/data/minos filling ( rustem ) Condor jobs removed accidentally Wed + yep factproxy had a problem Thu, 20 Mar 2008 13:30:22 -0500 Jason query re server replacements SAM updates stopped after F000040476 ( Sun m> minosdata ) + corrected The daikon_00 archives to tape around midnight Monday 17 March. But we lost ground for the week : Tufts farm upgrade questions ######## # DATA # ######## MINOS26 > df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 28T 27T 1.6T 95% /minos/data MINOS26 > du -sm /minos/data/mcimport/STAGE/* 171932 /minos/data/mcimport/STAGE/daikon_00 9014169 /minos/data/mcimport/STAGE/daikon_04 MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/* 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N 55785 /minos/data/mcimport/STAGE/daikon_04/L010170N 8045715 /minos/data/mcimport/STAGE/daikon_04/L010185N 6622 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm 8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh 65355 /minos/data/mcimport/STAGE/daikon_04/L010200N 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N 121349 /minos/data/mcimport/STAGE/daikon_04/L250200N arms suggests removing all under runs 7000. But everything we have is over 7000. ####### # WEB # ####### Wed 3/19 http://www-css.fnal.gov/csi/webdocs/townmeeting/ Nothing of much interest there, beyond the agenda. ########## # CONDOR # ########## ########## # PARROT # ########## Need to grab 'current', and adjust HTTP_PROXY to http://squid.fnal.gov:3128 adding the http:// Can try using the new HA prototype squid at fg3x3. ######### # MYSQL # ######### Long term connections noted by west, | 11338268 | reader | minosaur.maps.susx.ac.uk:38076 | litest | Sleep | 1342001 | | | | 11338270 | reader | minosaur.maps.susx.ac.uk:38077 | temp | Sleep | 1342000 | | | | 11338275 | reader | minosaur.maps.susx.ac.uk:38078 | offline | Query | 1341998 | Writing to net | select * from PLEXPIXELSPOTTOSTRIPEND where SEQNO between 200001001 and 200001004 or SEQNO between 2 | Should kill these ######### # ADMIN # ######### Could not enter the BlueArc status into the System Status page, as it was more than 3 days old. ####### # AFS # ####### Created afserrscan script, for cluster scan sorted by date #!/bin/sh EXT=${1} NODES='minos01 minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10 minos11 minos12 minos13 minos14 minos15 minos16 minos17 minos18 minos19 minos20 minos21 minos22 minos23 minos24 minos25 minos26' for NODE in ${NODES} ; do ssh -ax ${NODE} "grep afs: /var/log/messages${EXT} | grep 'Mar ' | grep -v Tokens | uniq" done | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' | sort date Usage : ./afserrscan ./afserrscan '.1' ############ # PREDATOR # ############ N00013828_0006.mdaq.root Thu Mar 20 20:06:39 UTC 2008 killed, ok N00013828_0007.mdaq.root Thu Mar 20 20:08:34 UTC 2008 stuck N00013829_0000.mdaq.root Thu Mar 20 20:11:09 UTC 2008 F00040476_0001.mdaq.root Fri Mar 21 22:12:19 UTC 2008 stuck cd /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/neardet_data/2008-03 grep v00-00 *.sam.py* F00040476_0001.sam.py: applicationFamily=ApplicationFamily('online','rotorooter','v00-00--1'), mv N00013828_0007.sam.py N00013828_0007.sam.pybad mv N00013829_0000.sam.py N00013829_0000.sam.pybad cd /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/fardet_data/2008-03 mv F00040476_0001.sam.py F00040476_0001.sam.pybad ============================================================================= 2008 03 17-23 KREYMER ON FURLOUGH ============================================================================= 2008 03 16 Sunday ######## # FARM # ######## cp AFSS/samdup samdup MINOS-SAM02 > ./samdup /minos/data/minfarm/nearcat | tee /minos/data/minfarm/DUP/nearcat.20080317.lis DET=near for FILE in `cat /minos/data/minfarm/DUP/${DET}cat.20080317.lis` ; do mv /minos/data/minfarm/${DET}cat/${FILE} \ /minos/data/minfarm/DUP/${DET}cat/${FILE} done DET=far ########## # samdup # ########## Cloned this from samlocate, infused with saddreco ./samdup /minos/data/minfarm/mcfarcat Test with recent near examples, which should be dups mkdir /minos/scratch/kreymer/mcnearcat touch /minos/scratch/kreymer/mcnearcat/n13047100_0000_L010185N_D04.sntp.cedar_phy_bhcurv.0.root touch /minos/scratch/kreymer/mcnearcat/n13047100_0003_L010185N_D04.sntp.cedar_phy_bhcurv.0.root mkdir /minos/scratch/kreymer/nearcat touch /minos/scratch/kreymer/nearcat/N00013755_0000.cosmic.sntp.cedar.0.root touch /minos/scratch/kreymer/nearcat/N00013755_0006.cosmic.sntp.cedar.0.root And a couple of pending files which should not be dups mkdir /minos/scratch/kreymer/mcfarcat touch /minos/scratch/kreymer/mcfarcat/f21311483_0000_L010185N_D00.mrnt.cedar_phy.root mkdir /minos/scratch/kreymer/farcat touch /minos/scratch/kreymer/farcat/F00040225_0000.all.sntp.cedar.0.root touch /minos/scratch/kreymer/farcat/F00040225_0012.all.sntp.cedar.0.root ./samdup -y /minos/scratch/kreymer/mcnearcat Test with 513 files time ./samdup /minos/data/minfarm/mcfarcat real 0m53.918s user 0m2.415s sys 0m0.419s repeat real 0m12.383s user 0m2.398s sys 0m0.350s Test with mcnearcat, 2676 files real 3m38.113s user 0m7.619s sys 0m0.577s repeat real 3m25.990s user 0m7.617s sys 0m0.634s This was all in dev Test again on idle minos-sam02, in prd mcfarcat real 0m10.236s user 0m1.984s sys 0m0.246s real 0m9.704s user 0m2.000s sys 0m0.216s mcnearcat real 2m40.157s user 0m6.627s sys 0m0.343s real 0m48.428s user 0m6.653s sys 0m0.348s Continue testing on minos-sam02, a bit quicker. ./samdup /minos/data/minfarm/farcat | wc -l 300 ./samdup /minos/data/minfarm/nearcat | wc -l 597 Added test for sam location of concatenated file, and existence in PNFS. ./samdup /minos/data/minfarm/farcat | wc -l 300 ./samdup /minos/data/minfarm/nearcat | wc -l 597 ######## # FARM # ######## SRV1> ./farmgsum Summarizing /grid/data/minos/*cat 2178 69269 nearcat 1815 14211 farcat 2676 473169 mcnearcat 513 12063 mcfarcat 0 1 mcfmockcat 706 556 minfarm/WRITE 7888 569269 TOTAL files, GBytes ############ # MCIMPORT # ############ Noon batch of ENCP's to LTO-4 hung up From /minos/data/mcimport/TAR/daikon_04/L250200N/near/711/mcimport.log WILL ENCP 249 files Start time: Sun Mar 16 11:40:02 2008 User: mindata(3648) Group: e875(5111) Euser: mindata(3648) Egroup: e875(5111) Command line: encp --delayed-dismount 5 --verbose 4 /home/mindata/TAPE/n11037110_0000_L250200N_D04.tar.gz /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz Version: v3_7 CVS $Revision: 1.866 $ OS: Linux 2.6.9-55.0.2.ELsmp i686 Release: Scientific Linux Fermi LTS release 4.4 (Wilson) Library: CD-LTO4G1 Storage Group: minos File Family: stage FF Wrapper: cpio_odc FF Width: 1 Current working directory: minos-sam03.fnal.gov:/minos/data/mcimport/STAGE/daikon_04/L250200N/near/711 Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=0.990sec File queued: /home/mindata/TAPE/n11037110_0000_L250200N_D04.tar.gz library: CD-LTO4G1 family: stage bytes: 755032831 elapsed=1.09949278831 Mover called back. elapsed=1.6713218689 Input file /home/mindata/TAPE/n11037110_0000_L250200N_D04.tar.gz opened. elapsed=1.72233390808 Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=486.270sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=1386.360sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=2286.490sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=3186.580sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=4086.750sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=4986.920sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=5887.060sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=6787.310sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=7687.980sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=8588.060sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=9488.160sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=10388.230sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM. elapsed=11288.340sec Web status page shows CD-LTO4G1.library_manager alive : unlocked stkensrv4 2008-Mar-16 15:02:28 Ongoing Transfers 0 Pending Transfers 1 Full Queue Elements Pending write for stage from minos-sam03 by mindata [VOLS_IN_WORK] Date: Sun, 16 Mar 2008 15:22:38 -0500 (CDT) Subject: HelpDesk ticket 112734 ___________________________________________ Short Description: encp from minos-03 to LTO4 tape hanging up since noon sunday. Problem Description: enstore-admin : After running smoothly for several days, our copies from minos-sam03 to /pnfs/minos/stage are hung up. Messages look like Mover called back. elapsed=1.6713218689 Input file /home/mindata/TAPE/n11037110_0000_L250200N_D04.tar.gz opened. elapsed=1.72233390808 Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.t ar.gz write request to LM. elapsed=486.270sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.t ar.gz write request to LM. elapsed=1386.360sec Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.t ar.gz write request to LM. elapsed=2286.490sec The Enstore Server Status web page shows CD-LTO4G1.library_manager alive : unlocked stkensrv4 2008-Mar-16 15:02:28 Ongoing Transfers 0 Pending Transfers 1 Full Queue Elements Pending write for stage from minos-sam03 by mindata [VOLS_IN_WORK] See the log file /minos/data/mcimport/TAR/daikon_04/L250200N/near/711/mcimport.log ___________________________________________ Data started up at about 18:44 Updated ticket ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ ============================================================================= 2008 03 14 ############ # MCIMPORT # ############ 21:47 Interrupted during DIR=708, due to impending stkensrv2a Enstore master node down time. rm /home/mindata/TAPE/n11037083_0008_L250200N_D04.tar.gz Will do Sat Mar 15 08:26:21 CDT 2008 DIRS='708 709 710 711 712 713 714 715 717' for DIR in ${DIRS} ; do ./mcimport.20080311 -T daikon_04/L250200N/near/${DIR} done ############ # SADDRECO # ############ minfarm@fnpcsrv1 and prepare per HOWTO.saddreco cd to /afs/fnal.gov/files/home/room1/kreymer/minos/scripts ./saddreco.new -d near -r cedar -p 2007-11 --verify There are 502 files in /pnfs/minos/reco_near/cedar/cand_data/2007-11 Found a smaller month to work with, ./saddreco.new -d near -r cedar -p 2006-05 --verify 18 ./saddreco.new -d near -r cedar -p 2007-10 --verify 37 could not use -v , as this activated verbosity of the sam calls. Changed to y for yack Retained -v, if we want both yackiness and sam verbosity SRV1> ./saddreco.new -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N --verify STARTED Sat Mar 15 02:35:01 2008 This is now picking up the .0.root files containing pass numbers. Let's get caught up on this SLOG=${HOME}/ROUNTMP/LOG/saddreco/daikon_04/cedar_phy_bhcurv/near_L010185N.log ./saddreco.20080315 -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N -P 714 --declare That looks OK, Let's pick up 3 more cand's and one mrnt. ./saddreco.20080315 -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N -P 738 --declare Looks good on the surface, several parents. Now get caught up on all runs , with logging ./saddreco.20080315 \ -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N -P 738 --declare \ 2>&1 | tee -a ${SLOG} Made this the default saddreco on fnpcsrv1 SRV1> cp -a AFSS/saddreco.20080315 . SRV1> ln -sf saddreco.20080315 saddreco # was saddreco.20071117 SRV1> date Fri Mar 14 22:20:11 CDT 2008 ######### # FNALU # ######### Programs continue to crash and have strange problems, and nodes get hung up. For node system information, see http://cdcvs.fnal.gov/cgi-bin/fnal-only/cvsweb.cgi/syscollect/fnalu/?cvsroot=syscollect From lsload, and bhosts, MINOS26 > bhosts | cut -f 1 -d ' ' | sort Host MAX cpu mem flxb09 - 2 x 999 510360 kB flxb10 2 2 x 999 449M flxb11 - flxb12 - flxb13 2 449M flxb14 - flxb15 - flxb16 4 2 x 2667 905M 1034584 kB flxb17 4 2 x 2667 965M ditto flxb18 4 2 x 2667 928M flxb19 4 2 x 2667 921M flxb20 4 2 x 2667 962M flxb21 4 2 x 2667 965M flxb22 - flxb23 4 2 x 2667 965M flxb24 4 2 x 2667 963M flxb25 4 2 x 2667 964M flxb26 4 2 x 2667 965M flxb27 4 2 x 2667 964M flxb28 4 2 x 2667 929M flxb29 4 2 x 2667 964M flxb30 4 2 x 2667 597M 1034584 kB flxb31 4 2 x 2194 1902M 2074908 kB flxb32 4 2 x 2193 1899M 2074908 kB flxb33 4 2 x 2193 1644M 2074908 kB flxb34 4 2 x 2195 1637M 2074908 kB flxb35 4 2 x 2393 3570M 4038672 kB flxi04 - flxi06 2 4 x 3600 3753M 4095356 kB Intel for NODE in $BNODES ; do printf "${NODE} " ; ssh -ax ${NODE} 'cat /proc/cpuinfo | grep MHz | uniq' ; done for NODE in $BNODES ; do printf "${NODE} " ; ssh -ax ${NODE} 'cat /proc/cpuinfo | grep MHz | wc -l' ; done I don't trust the lsload memory infomation, scanned it for NODE in $BNODES ; do printf "${NODE} " ; ssh -ax ${NODE} 'cat /proc/meminfo | grep MemTotal' ; done ___________________________________________ Date: Fri, 14 Mar 2008 12:39:28 -0500 (CDT) Subject: HelpDesk ticket 112694 ___________________________________________ Short Description: Please turn of the 1 GHZ FNALU batch nodes Problem Description: dss-est : Please turn off the FNALU batch systems flxb09, flxb10 and flxb13, and retire these systems. They are dusl processor 1 GHz systems, with only 1/2 GB of memory. They are a small fraction of the FNALU capacity. Minos batch jobs have been failing repeatedly on these nodes, due to the lack of memory. ___________________________________________ Date: Fri, 14 Mar 2008 13:05:34 -0500 (CDT) This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group. ___________________________________________ Date: Fri, 14 Mar 2008 13:17:05 -0500 (CDT) Art, I closed the lsf queues for these 3 nodes. I will have to check with Wayne before retiring them. margaret ___________________________________________ Date: Mon, 27 Oct 2008 12:55:49 -0500 (CDT) Solution: flxb11-30 decommissioned ___________________________________________ ___________________________________________ ####### # AFS # ####### The symlink for afssum is out of data, try the new version MIN > cp afssum.20070828 afssum.test time ./afssum.test MINOS01 > time ./afssum.test real 131m8.298s user 22m32.763s sys 69m18.231s Updated to using afssum.20070828 MIN > ln -sf afssum.20070828 afssum # was afssum.20060614 ########### # ENSTORE # ########### Checking the Enstore log for yesterday, for client versions http://www-stken.fnal.gov/enstore/enstore_logs.html http://www-stken.fnal.gov/enstore/log//LOG-2008-03-13 MIN > grep v3_7 LOG-2008-03-13.htm | wc -l 2915 MIN > grep v3_6 LOG-2008-03-13.htm | wc -l 49461 MIN > grep v3_6g LOG-2008-03-13.htm | wc -l 46332 MIN > grep v3_6c LOG-2008-03-13.htm | wc -l 2296 MIN > grep v3_6d LOG-2008-03-13.htm | wc -l 636 MINOS26 > grep v3_6i LOG-2008-03-13.htm | wc -l 197 MINOS26 > grep 'Version:' LOG-2008-03-13.htm | cut -f 9 -d ' ' | sort -u v3_6c v3_6d v3_6g v3_6i v3_7 MINOS26 > grep 'Version: v3_6c ' LOG-2008-03-13.htm | cut -f 2 -d ' ' | sort -u logjam.fnal.gov southport.fnal.gov MINOS26 > grep 'Version: v3_6d ' LOG-2008-03-13.htm | cut -f 2 -d ' ' | sort -u lynx.fnal.gov minos-om.fnal.gov minos-sam03.fnal.gov MINOS26 > grep 'Version: v3_6g ' LOG-2008-03-13.htm | cut -f 2 -d ' ' | sort -u cmsstor101.fnal.gov cmsstor103.fnal.gov ... cmsstor98.fnal.gov cmsstor99.fnal.gov stkendca10a.fnal.gov stkendca11a.fnal.gov ... stkendca19a.fnal.gov stkendca20a.fnal.gov MINOS26 > grep 'Version: v3_6i ' LOG-2008-03-13.htm | cut -f 2 -d ' ' | sort -u cdfensrv3.fnal.gov stkensrv3.fnal.gov MINOS26 > grep 'Version: v3_7 ' LOG-2008-03-13.htm | cut -f 2 -d ' ' | sort -u d0ensrv3n.fnal.gov des02.fnal.gov des04.fnal.gov minos-sam03.fnal.gov sdssdp30.fnal.gov sdssdp44.fnal.gov sdssdp45.fnal.gov sdssdp48.fnal.gov sdssdp51.fnal.gov sdssdp53.fnal.gov sdssdp55.fnal.gov sdssdp56.fnal.gov sdssdp57.fnal.gov Why is minos-om writing directly with encp ? MINOS26 > grep 'minos-om' LOG-2008-03-13.htm | tr ' ' \\\n | grep /pnfs/minos | less MINOS26 > OMDIRS=`grep 'minos-om' LOG-2008-03-13.htm | tr -d "'" | tr ' ' \\\n | grep /pnfs/minos` MINOS26 > for DIR in ${OMDIRS} ; do dirname ${DIR} ; done | sort -u /pnfs/minos/fardet_logs /pnfs/minos/fardet_logs/msglog /pnfs/minos/fardet_logs/om /pnfs/minos/fardet_logs/om/postscript /pnfs/minos/fardet_logs/om/rootfiles /pnfs/minos/fardet_logs/om/rootfiles/00040000-00049999 /pnfs/minos/fardet_logs/om/summaries /pnfs/minos/fardet_logs/timing /pnfs/minos/neardet_logs /pnfs/minos/neardet_logs/msglog /pnfs/minos/neardet_logs/om /pnfs/minos/neardet_logs/om/rootfiles /pnfs/minos/neardet_logs/om/rootfiles/00010000_00019999 /pnfs/minos/neardet_logs/om/summaries /pnfs/minos/neardet_logs/timing MINOS26 > grep 'minos-om' LOG-2008-03-13.htm | cut -f 1 -d ' ' 04:13:40...04:13:42 04:18:09...04:18:48 05:19:54...05:19:56 05:22:50...05:23:06 05:35:30...05:35:31 05:37:42...05:38:54 05:44:06...05:44:07 05:46:27...05:46:43 Check Wednesday, curl http://www-stken.fnal.gov/enstore/log/LOG-2008-03-12 -o LOG-2008-03-12.htm Nothing, it seems yesterday was a one-shot archival. for DAY in 13 01 02 03 04 05 07 08 09 10 ; do LOGF=LOG-2008-03-${DAY} printf "${LOGF}\n" curl -s http://www-stken.fnal.gov/enstore/log/${LOGF} -o ${LOGF} du -sm ${LOGF} grep 'minos-om' ${LOGF} | wc -l rm -f ${LOGF} done LOG-2008-03-13 328 LOG-2008-03-13 247 LOG-2008-03-01 924 LOG-2008-03-01 0 LOG-2008-03-02 608 LOG-2008-03-02 0 LOG-2008-03-03 694 LOG-2008-03-03 0 LOG-2008-03-04 668 LOG-2008-03-04 0 LOG-2008-03-05 608 LOG-2008-03-05 0 LOG-2008-03-07 743 LOG-2008-03-07 0 LOG-2008-03-08 559 LOG-2008-03-08 0 LOG-2008-03-09 484 LOG-2008-03-09 0 LOG-2008-03-10 493 LOG-2008-03-10 0 ============================================================================= 2008 03 13 ####### # AFS # ####### summaries missing from /afs/fnal.gov/files/expwww/numi/html/computing/dh/afssum Manually ran /usr/krb5/bin/kcron ${HOME}/minos/scripts/afssum quiet It looked OK, but produced no output ########## # PARROT # ########## Repeatin test before reporting to cctools@listserv.nd.edu 2_4_2 no proxy FNPC144 > export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_2-i686-linux-2.6 FNPC144 > export PATH=${PARROT_DIR}/bin:${PATH} FNPC144 > parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash FNPC144 > PS1='P> ' P> ls -d /afs/fnal.gov/files/code/e875/general/minossoft 1205436334.559503 [19916] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/ 1205436334.559698 [19916] parrot: grow: fetching checksum: wget --no-cache -q -O /tmp/parrot.1060/grow.checksum.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfschecksum 1205436334.826238 [19916] parrot: grow: remote checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d 1205436334.826301 [19916] parrot: grow: fetching directory: wget --no-cache -q -O /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfsdir 1205436338.337115 [19916] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- 1205436339.339436 [19916] parrot: grow: local checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d /afs/fnal.gov/files/code/e875/general/minossoft 2_4_2 with proxy FNPC145 > export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_2-i686-linux-2.6 FNPC145 > export PATH=${PARROT_DIR}/bin:${PATH} FNPC145 > export HTTP_PROXY="squid.fnal.gov:3128" FNPC145 > parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash FNPC145 > ls -d /afs/fnal.gov/files/code/e875/general/minossoft 1205436801.996700 [3408] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/ 1205436801.996908 [3408] parrot: grow: fetching checksum: wget --no-cache -q -O /tmp/parrot.1060/grow.checksum.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfschecksum 1205436802.020919 [3408] parrot: grow: remote checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d 1205436802.021135 [3408] parrot: grow: fetching directory: wget --no-cache -q -O /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfsdir 1205436805.265410 [3408] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- 1205436806.258024 [3408] parrot: grow: local checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d /afs/fnal.gov/files/code/e875/general/minossoft 2_4_0 with proxy FNPC146 > export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4 FNPC146 > export PATH=${PARROT_DIR}/bin:${PATH} FNPC146 > export HTTP_PROXY="squid.fnal.gov:3128" FNPC146 > parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash FNPC146 > ls -d /afs/fnal.gov/files/code/e875/general/minossoft 1205437116.856289 [7912] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/ 1205437116.856398 [7912] parrot: http: connect squid.fnal.gov port 3128 1205437116.858341 [7912] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfsdir HTTP/1.0 Host: squid.fnal.gov 1205437117.186037 [7912] parrot: http: HTTP/1.0 200 OK 1205437117.186081 [7912] parrot: http: Date: Thu, 21 Feb 2008 22:35:11 GMT 1205437117.186093 [7912] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.5 1205437117.186104 [7912] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:09 GMT 1205437117.186115 [7912] parrot: http: ETag: "5350b140-33ac14e-44592a1cdbf44" 1205437117.186131 [7912] parrot: http: Accept-Ranges: bytes 1205437117.186140 [7912] parrot: http: Content-Length: 54182222 1205437117.186158 [7912] parrot: http: Content-Type: text/plain 1205437117.186168 [7912] parrot: http: X-Cache: HIT from fermigrid4.fnal.gov 1205437117.186178 [7912] parrot: http: Via: 1.0 fermigrid4.fnal.gov:3128 (squid/2.6.STABLE9) 1205437117.186188 [7912] parrot: http: Proxy-Connection: close 1205437117.186197 [7912] parrot: http: 1205437117.186208 [7912] parrot: grow: loading filesystem directory... 1205437126.783397 [7912] parrot: grow: directory checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d 1205437126.783585 [7912] parrot: grow: fetching checksum from wget --no-cache -q -O /tmp/grow.checksum.1060.7909 http://www-numi.fnal.gov:80//computing/d199//.growfschecksum 1205437126.828840 [7912] parrot: grow: actual checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d /afs/fnal.gov/files/code/e875/general/minossoft Sent mail to cctools@listserv.nd.edu ########## # CONDOR # ########## Sent this to sfiligoi for courtesy review, 13:55 : Date: Thu, 13 Mar 2008 14:46:22 -0500 (CDT) Subject: HelpDesk ticket 112641 ___________________________________________ Short Description: Minos Cluster - condor 7.0.1 preinstallation run2-sys : Please install the following RPM in all the minos01 thru minos25 . http://fermigrid.fnal.gov/files/condor/condor-7.0.1-linux-x86-rhel3-dynamic-1.i386.rpm This rpm places new files in /opt/condor-7.0.1, and should not interfere with existing operations. Please also copy the configuration files into /opt/condor-7.0.1 on each node. HNAME=`hostname -s` cp /opt/condor-6.9.5/etc/condor_config \ /opt/condor-7.0.1/etc/condor_config cp /opt/condor-6.9.5/local.${HNAME}/condor_config.local \ /opt/condor-7.0.1/local.${HNAME}/condor_config.local Background : We want to upgrade the Condor version on the Minos Cluster from 6.9.5 to 7.0.1 the week of 24 March. Installing the rpm and prepositioning the configuration files will let us review the installation ahead of time, and perhaps upgrade a node or two ahead of the general upgrade. ___________________________________________ Date: Thu, 13 Mar 2008 14:53:22 -0500 (CDT) This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group. ________________________________________________________________ ######## # PNFS # ######## Date: Thu, 13 Mar 2008 11:24:51 -0500 From: Robert Hatcher To: Arthur Kreymer Subject: MINOS: created charm directories for D04 I've created the input/output directories for D04 L010185N_charm using your script that also sets the file families. cd ~kreymer/minos/scripts ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010185N_charm write ########## # DCACHE # ########## Some DCache failures yesterday, Date: Thu, 13 Mar 2008 11:58:37 -0400 (EDT) From: Josh Boehm Mar 12 14:10 Error ( POLLIN POLLERR POLLHUP) (with data) on control line [32] Failed to create a control line Failed open file in the dCache. Error ( POLLIN POLLERR POLLHUP) (with data) on control line [32] Failed to create a control line Error in : file dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy/cand_data/2005-10/N00009003_0015.spill.cand.cedar_phy.0.root does not exist The rest I have are identicle with different file names. The errors appear to have all occured between 14:10-14:30 yesterday. ######## # ENCP # ######## Installed v3_7, just in cast this helps MINOS26 > upd install -j encp v3_7 This seems to be happy, and connects to stkensrv2 without further config or qualifiers. MINOS26 > ups declare -c encp v3_7 MINOS26 > date Thu Mar 13 09:58:54 CDT 2008 ######## # DATA # ######## Killed mcimport at /pnfs/minos/stage/daikon_04/L250200N/near/704/n11037047_0007_L250200N_D04.tar.gz Set --delayed_dismount 5, to try to improve tape mounting ( observed 30 second HAVEBOUND period, then dismount perhaps the Enstore system is just responding too slowly ) Restarted, with initial LOCALLIM 210000 , reset to 30000 in PAPER. Rats, the tape dismounted after 30 seconds in HAVE-BOUND status ! label mover tot.time status system_inhibit rq. host updated volume family VOJ550 LTO4_48.mover 230 DISMOUNT_WAIT (16 ) (none none) ['minos'] 03-13-08 08:34:39 minos.stage.cpio_odc Into the ECRC/COPY phase of 704 now Interrupted to restore normal LOCALLIM, and remove --delayed-dismount per developers' request ( moot ) $ rm /minos/data/mcimport/TAR/daikon_04/L250200N/near/704/n11037048_0002_L250200N_D04.ecrc Also switched to encp v3_7, just in case this helps. $ cp -a AFSS/mcimport.20080311 . $ ./mcimport.20080311 -T daikon_04/L250200N/near/704 Observed following timings ECRC - 75 to 110 " COPY - 25 " Earlier tests showed COPY - 110 " ECRC - 3 " Interrupted at ECRC n11037048_0018_L250200N_D04.tar.gz Updated to do the COPY before ECRC. Will implicitly get encp v3_7 ( current ). $ cp -a AFSS/mcimport.20080311 . $ ./mcimport.20080311 -T daikon_04/L250200N/near/704 Thu Mar 13 10:33:48 CDT 2008 Rates look good COPY - 55 to 110 " ECRC - 5" net - 60 to 115" WILL ENCP 80 files Start time: Thu Mar 13 11:45:53 2008 LTO4_43.mover 10 files moved so far, as of 11:54, About 30 seconds per file elapsed. timings like Starting /home/mindata/TAPE/n11037047_0025_L250200N_D04.tar.gz transfer. elapsed=4.94632411003 File /home/mindata/TAPE/n11037047_0025_L250200N_D04.tar.gz transfered. elapsed=22.6599259377 Something seems to be slowing down transfers, every 3 minutes. Extra 20 to 30 seconds in transfer. I see nothing directly correlated with this in Ganglia ( except network rates, which is where I saw this first. ) This finished up at Thu Mar 13 12:31:13 CDT 2008 Launched the full set again : $ DIRS='705 706 707 708 709 710 711 712 713 714 715 717' for DIR in ${DIRS} ; do ./mcimport.20080311 -T daikon_04/L250200N/near/${DIR} done MINOS26 > df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 28T 25T 3.1T 90% /minos/data MINOS26 > date Thu Mar 13 13:12:59 CDT 2008 Date: Thu, 13 Mar 2008 17:15:32 +0000 (UTC) From: Arthur Kreymer To: Stan Naymola Cc: Jon Bakken , enstore-admin@fnal.gov, minos-data@fnal.gov Subject: Re: HelpDesk ticket 112593 has additional info. On Thu, 13 Mar 2008, Stan Naymola wrote: > We have added back 2 LTO4 drives, so if things work right, there should > be drives available. Thanks ! The next batch of Minos data writes has started, and is now running at full speed, 30 MB/sec net. There are brief ( 30 second ) delays about every 3 minutes, but nothing like the problems we had before. A few things have changed since the last measurement 1) You added 2 drives, thanks ! 2) CMS writes to CCRC08LoadTest are not active, like they were before. 3) The LTO-3 manager was restored to service, after robot repairs. 4) I upgraded our default client from encp v3_6d to encp v3_7 . ============================================================================= 2008 03 12 ########### # SCRATCH # ########### Date: Wed, 12 Mar 2008 14:43:46 -0500 (CDT) Subject: HelpDesk ticket 112578 ___________________________________________ Short Description: Quota request for boehm on BlueArc served /minos/scratch Problem Description: LSC/CSI : Please set an individual storage quota of 500 GBytes for user boehm on the BlueArc served /minos/scratch volume. This overrides the existing default 100 GBytes quota. ___________________________________________ Date: Wed, 12 Mar 2008 15:06:12 -0500 (CDT) This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ Solution: Hi Art, minos-nas-0:/minos/scratch for boehm quota has been increased to 500GB ___________________________________________ ############ # MCIMPORT # ############ Severe overloads from RAL again, Mar 12 04:07:44 minos26 kernel: oom-killer: gfp_mask=0xd0 Mar 12 04:07:48 minos26 kernel: Out of Memory: Killed process 14571 (scp). MINOS26 > ps -u mindata | grep -c scp ; ps -u mindata | grep -c md5sum 4 54 NSCP=`ps -u mindata | grep -c scp` NMD5=`ps -u mindata | grep -c md5sum` (( NMCI = NSCP + NMD5 )) echo ${NMCI} The rate of clearing gets drastically better with under 35 md5sum's. Load average 35 -> 25 in 5 minutes. 11:41 - 9 md5sum's 11:42 - 0 md5sum Load average dropped from 18 to 0 in under 3 minutes ( 11:41 to 11:43 ) Back up to 33 around 13;15, as the next batch arrives. MINOS01 > time md5sum /minos/data/mcimport/mtavera/n12037402_0008_L010185N_D04.tar.gz e4a7c5c80e0fffdbf72ee9224d4d05f1 /minos/data/mcimport/mtavera/n12037402_0008_L010185N_D04.tar.gz real 0m9.532s user 0m0.037s sys 0m0.020s MINOS01 > time md5sum /minos/data/mcimport/mtavera/n12037402_0008_L010185N_D04.tar.gz e4a7c5c80e0fffdbf72ee9224d4d05f1 /minos/data/mcimport/mtavera/n12037402_0008_L010185N_D04.tar.gz real 0m0.037s user 0m0.027s sys 0m0.008s Here are my working guesses about the capacity of minos26. Let's assume that we sustain 10 scp's at a time, and regulate the data flow to avoid overload. 10 scp's, at about 1 MB/sec each is about 1 TByte per day. We really do not need to ingest data at that rate ! md5sum of a 10 MByte file which is still in memory takes 40 milliseconds. From a badly overloaded /home/data disk, it takes 10 seconds. So md5sum's should play no role in this, as long as they are done while still in memory. This will be the case if we do not badly overload the system. In short, unless I've slipped a decimal somewhere, we should be just fine, on average, if we stay well away from saturation of the system. Whether the import limit is 5, 10, or 20 should not be critical. Even 5 should sustain .5 TByte per day, more than we need. Note that the /minos/data disk, used by the farm, overlaying, and various users, also plays a role in this. We may need to be careful about how many streams we write at once. The job I'm running to archive older mcimport/STAGE data has slowed from 15 MB/sec to 1 MB/sec during these overload periods, due to slow data delivery from /home/data. From top: Cpu(s): 0.2% us, 0.1% sy, 0.0% ni, 0.0% id, 99.6% wa, 0.0% hi, 0.0% si 14:32 md5sums dropped from 27 to 26 !!!! data had been moving in at about 1 MB/sec 14:43 24 14:45 minos-sam03 data rates back up to 8 MB/sec 14:46 21 47 20 48 20 54 18 55 14 56 12 57 8 58 8 59 6 15:00 0 Summary - looking at the minos26 ganglia plots, we suffered about a 12 hour delay in all major /minos/data activities. ######## # DATA # ######## Date: Wed, 12 Mar 2008 18:23:33 -0500 (CDT) Subject: HelpDesk ticket 112593 ___________________________________________ Short Description: minos.stage writes to LTO-4 stalled - why ? Problem Description: enstore-admin : The good news : The writes to /pnfs/minos/stage/.. ( minos.stage family ) have been running at around 5 MB/sec, as they copy from BlueArc mounted /minos/data. I restructured this to copy to local disk first, then write directly. The encp overall rate to tape is now over 30 MB/sec , as of around Wed Mar 12 17:48:44 CDT 2008. The bad news : The drive LTO4-43 was apparently preempted for CMS load test. There was then a nearly minute delay before my next file got written . 2 of the 11 drives were dead, three are idle, and my encp command was just sitting waiting for a mover. One more file got copied using drive LTO4_44, then I was preempted for another CMS load test. I am on furlough next week. We must get at least a few TBytes of data archived this week. Please do what it takes so that we do not get preempted. Here are some details, fyi: Start time: Wed Mar 12 17:52:26 2008 User: mindata(3648) Group: e875(5111) Euser: mindata(3648) Egroup: e875(5111) Command line: encp --verbose 4 /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz /pnfs/minos/stage/daikon_04/L250200N/near/704/n11037040_0003_L250200N_D04.t ar.gz Version: v3_6d CVS $Revision: 1.829 $ OS: Linux 2.6.9-55.0.2.ELsmp i686 Release: Scientific Linux Fermi LTS release 4.4 (Wilson) Library: CD-LTO4G1 Storage Group: minos File Family: stage FF Wrapper: cpio_odc FF Width: 1 Current working directory: minos-sam03.fnal.gov:/minos/data/mcimport/STAGE/daikon_04/L250200N/near/704 Got error while trying to obtain configuration: ('KEYERROR', "Configuration Server: no such name: 'pnfs_agent'") Submitting /pnfs/minos/stage/daikon_04/L250200N/near/704/n11037040_0003_L250200N_D04.t ar.gz write request to LM. elapsed=0.440sec File queued: /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz library: CD-LTO4G1 family: stage bytes: 758433401 elapsed=0.536798000336 Submitting /pnfs/minos/stage/daikon_04/L250200N/near/704/n11037040_0003_L250200N_D04.t ar.gz write request to LM. elapsed=900.470sec Mover called back. elapsed=904.14408493 Input file /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz opened. elapsed=904.165093899 Input file /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz opened. elapsed=904.165093899 Starting /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz transfer. elapsed=1062.3872509 File /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz transfered. elapsed=1079.18411994 Waiting for final mover dialog. elapsed=1079.280sec Received final dialog for minos-sam03.fnal.gov-1205362346-1509-0. elapsed=1087.200sec Verifying /pnfs/minos/stage/daikon_04/L250200N/near/704/n11037040_0003_L250200N_D04.t ar.gz transfer. elapsed=1087.11204982 File status after verification: ('ok', None) elapsed=1087.29182482 Transfer /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz -> /pnfs/minos/stage/daikon_04/L250200N/near/704/n11037040_0003_L250200N_D04.t ar.gz: 758433401 bytes copied to VOJ550 at 29.3 MB/S (43.1 MB/S network) (168 MB/S drive) (43.1 MB/S disk) (3.95 MB/S overall) (29.3 MB/S transfer) drive_id=ULTRIUM-TD4 drive_sn=1310019745 drive_vendor=IBM mover=LTO4_44.mover media_changer=SL8500.media_changer elapsed=1087.31 Completed transferring 758433401 bytes in 1 files in 1087.30618596 sec. Overall rate = 3.95 MB/sec. Transfer rate = 29.3 MB/sec. Network rate = 43.1 MB/sec. Drive rate = 168 MB/sec. Disk rate = 43.1 MB/sec. Exit status = 0. PURGED n11037040_0003_L250200N_D04.tar.gz Start time: Wed Mar 12 18:10:34 2008 ___________________________________________ Date: Wed, 12 Mar 2008 20:03:12 -0500 (CDT) This ticket is assigned to SSA Primary of the CD-SF/DMS/DSC/SSA group. ___________________________________________ Date: Thu, 13 Mar 2008 05:13:13 +0000 (UTC) Oops, an important correction to a typo in this report. Instead of There was then a nearly minute delay before my next file got written . I meant to say There was then a nearly 20 minute delay before my next file got written . The 20 minute delays are continuing. I seem to get 1 to 3 files copied, then am kicked off the drive for 20 minutes. In copying 95 files so far, there have been 18 tape mounts. The progression has been : mover=LTO4_43 mover=LTO4_44 mover=LTO4_46 mover=LTO4_43 mover=LTO4_49 mover=LTO4_46 mover=LTO4_43 mover=LTO4_47 mover=LTO4_45 mover=LTO4_49 mover=LTO4_44 mover=LTO4_46 mover=LTO4_49 mover=LTO4_47 mover=LTO4_51 mover=LTO4_47 mover=LTO4_43 mover=LTO4_46 At 00:10, I have been sitting waiting for a move for 15 minutes. There are 3 IDLE LTO-4 drives. ___________________________________________ Date: Thu, 13 Mar 2008 02:12:03 -0500 (CDT) This ticket is assigned to SSA Primary of the CD-SF/DMS/DSC/SSA group. ___________________________________________ Date: Thu, 13 Mar 2008 13:18:30 +0000 (UTC) Has this ticket been seen by anyone ? It was assigned to SSA Primary at 20:03:12 -0500 (CDT) It was assigned to SSA Primary at 02:12:03 -0500 (CDT) The problems are continuing. Right now, there are three IDLE LTO-4 drives, but my encp commands are still waiting 20 minutes for a tape mount for each file copied. The full log for this job is in file /minos/data/mcimport/TAR/daikon_04/L250200N/near/704/mcimport.log mounted on all FNALU and FermiGrid nodes . ___________________________________________ Date: Thu, 13 Mar 2008 13:47:31 +0000 (UTC) I changed my script to force a 5 minute HAVE BOUND period, encp --delayed-dismount 5 just in case the overloaded Enstore systems need more leeway. But drive LTO4-48 dismounted my tape after 30 seconds in the HAVE_BOUND state. Here is a status line from http://cmsdca.fnal.gov/cgi-bin/enstore_drives.sh label mover tot.time status system_inhibit rq. host updated $ VOJ550 LTO4_48.mover 230 DISMOUNT_WAIT (16 ) (none none) ['minos'] 03-13-08 $ There are still several IDLE drives. ___________________________________________ Date: Thu, 13 Mar 2008 08:51:31 -0500 (CDT) From: Jon Bakken _This just ties up tape drives more - I don't think this is a good idea, and it certainly just makes things worse. It's not fair to other experiments. Basically, enstore is broken, and no options you set are going to help it. __________________________________________ Date: Thu, 13 Mar 2008 09:02:54 -0500 From: Stan Naymola As Jon said the system is broken. The developers are trying to fix the LM. Please use the defaults for transfers. We have 2 LTO4's that are offline for a special test. I am canceling that test and will put these online. That should relieve the LTO4 resource issue. Stan ___________________________________________ Date: Thu, 13 Mar 2008 11:28:43 -0500 From: Stan Naymola We have added back 2 LTO4 drives, so if things work right, there should be drives available. __________________________________________ Date: Thu, 13 Mar 2008 17:15:32 +0000 (UTC) From: Arthur Kreymer To: Stan Naymola Cc: Jon Bakken , enstore-admin@fnal.gov, minos-data@fnal.gov Subject: Re: HelpDesk ticket 112593 has additional info. On Thu, 13 Mar 2008, Stan Naymola wrote: > We have added back 2 LTO4 drives, so if things work right, there should > be drives available. Thanks ! The next batch of Minos data writes has started, and is now running at full speed, 30 MB/sec net. There are brief ( 30 second ) delays about every 3 minutes, but nothing like the problems we had before. A few things have changed since the last measurement 1) You added 2 drives, thanks ! 2) CMS writes to CCRC08LoadTest are not active, like they were before. 3) The LTO-3 manager was restored to service, after robot repairs. 4) I upgraded our default client from encp v3_6d to encp v3_7 . __________________________________________ Date: Thu, 13 Mar 2008 14:40:59 -0500 (CDT) From: David Berg __________________________________________ I sent 3 replies to this ticket yesterday evening. Did you not see them? If not, something is broken in the helpdesk auto forwarding. (See attachments.) In my opinion, (4) is the most significant change as it probably fixed the communication deadlock between the encp client and the mover. That was the real cause of your problems. It never had anything to do with not enough available drives, or CMS bumping your tape. I will look into why the --delayed-dismount option didn't work. Jon is wrong - this is exactly a situation it was intended for. It is not a change applied to all movers, which would be wrong, but just this one. You have a job that will be writing to a single tape continuously for many hours; there is no reason it shouldn't stay mounted the whole time. There is a one-time penalty of a few extra mintues when your job finishes, which is insignificant compared to the time avoided in extra mounts, seeks, and dismounts. I'm glad it's working better for you now. VOJ550 is about half full, with 50 mounts. __________________________________________ ######## # DATA # ######## Need to encp from local disk, for speed. Interrupted near end of 704 ECRC phase, 08:30 Removed the empty n11037048_0000_L250200N_D04.ecrc cp -a AFSS/mcimport.20080311 . ./mcimport.20080311 -T daikon_04/L250200N/near/704 Rates look good 10 to 15 MB/sec even in face of overload from minos26. ( for the copy phase of things ) Oops, need to interrupt this to correct oversight in PURGEFILE, which need to remove both FILE and LOCAL/FILE Will catch this in the ecrc phase early this afternoon. 17:45 - interrupted to correct the import script, removed partial file rm ${LOCAL}/n11037047_0012_L250200N_D04.tar.gz Changed LOCALLIM to 50000, so I can see some purging right now. $ LOCALFREE=`df -m ${LOCAL} | tr -s ' ' | grep '% /' | cut -f 4 -d ' '` ; echo $LOCALFREE 46736 ./mcimport.20080311 -T daikon_04/L250200N/near/704 Wed Mar 12 17:48:44 CDT 2008 OK - version mcimport.20080303 processing from /minos/data/mcimport/STAGE/daikon_04/L250200N/near/704 ... NFILES 305 WILL ENCP 225 files Overall rates look good, like Overall rate = 30.3 MB/sec. Transfer rate = 30.5 MB/sec. Network rate = 43 MB/sec. Drive rate = 176 MB/sec. Disk rate = 43 MB/sec. Exit status = 0. And we've just been preempted from the LTO-4 drive by CMS. Sent helpdesk request, high priority, around 18:15 Note, the CMS writes are to file family CCRC08LoadTest ============================================================================= 2008 03 11 ############ # MCIMPORT # ############ mualem restarted imports from caltech at about 13:41 Data write rates dropped to around 1 MB/sec from 6, gradually starting around 13:45 through 14:30. Check rates on minos26 : cd /local/scratch26/kreymer/DATA BAF=/minos/data/mcimport/STAGE/daikon_04/L250200N/near/710/n11037100_0000_L250200N_D04.tar.gz time dd if=${BAF} of=TEST.dat bs=759175734 1+0 records in 1+0 records out real 3m9.092s user 0m0.001s sys 0m5.075s MINOS26 > time ecrc TEST.dat CRC 2473399715 real 0m4.186s user 0m2.344s sys 0m0.777s minos-sam03 data rates gradually recovered to 6 MB/se by 15:00 Ganglia shows a similar dip this morning 00:00 to 00:30 ######## # FARM # ######## Checking out CPB far logs , par batch meeting. Many duplicates reported, from F00030612_0000.all.sntp.cedar_phy_bhcurv.0.root to F00037885_0000.spill.sntp.cedar_phy_bhcurv.0.root Most of the PENDing files are before run 30,000. Exceptions in all.sntp , F00034650 F00034744 F00035640 F00037901 These are not duplicates, could be forced out if desired, along with the 19 sub-30K runs from F00019953 to F00028451 ============================================================= Date: Tue, 11 Mar 2008 15:44:08 -0500 From: Howard Rubin To: Arthur Kreymer Subject: Re: cedar_phy_bhcurv far concatenation status - FYI Art, I've looked at a couple of runs, and I think the problem is that you're checking against mdaq files before there were suppression lists, which started when there was beam in 2005-03. For the limited sample I checked, the missing output is in runs which are not in Ben's list from which I run. Here are the subruns for the first of your mystery runs: fnpcsrv1% ls -l F00019953* -rw-r--r-- 1 buckley numi 41790452 Oct 6 2003 F00019953_0000.mdaq.root -rw-r--r-- 1 buckley numi 48620509 Oct 6 2003 F00019953_0001.mdaq.root -rw-r--r-- 1 buckley numi 81994333 Oct 6 2003 F00019953_0002.mdaq.root -rw-r--r-- 1 buckley numi 634052 Oct 6 2003 F00019953_0003.mdaq.root It looks like subrun 0003 wasn't in the runlist, /minos/data/minfarm/lists/daq_lists/2003-10.farlist for good reason. I think you should just force them out. Howie =============================================================== ./roundup -f 1 -s F0001 -r cedar_phy_bhcurv far Tue Mar 11 16:32:53 CDT 2008 ./roundup -f 1 -s F0002 -r cedar_phy_bhcurv far Tue Mar 11 16:38:42 CDT 2008 Send email when this is finished. This completed at 19:22 ######## # DATA # ######## files are still moving well to LTO-4 from /minos/data/mcimport/..., still around 6 MB/sec. ####### # LOG # ####### Added this log to the computing/dh/dhmain.html web page, as WORK LOG Shifted the mid term tasks to the bottom of the file, for legibility. Remember to go look there once in a while ! ln -sf dhmain.20080311.html dhmain.html # was dhmain.20080131.html Made a new link worklog.txt to replace samlog.txt. Sent email to minos-data, minos-admin, minos_batch ============================================================================= 2008 03 10 ########### # MINOS10 # ########### Date: Mon, 10 Mar 2008 10:03:47 -0500 (CDT) Subject: HelpDesk ticket 112430 ___________________________________________ Short Description: minos10 down since early Sunday afternoon Problem Description: run2-sys : Node minos10 disappeared from Ganglia monitoring early Sunday afternoon. It is still off the network ( no response to ping. ) Please investigate when you get a chance. ___________________________________________ Date: Mon, 10 Mar 2008 10:12:56 -0500 (CDT) This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group. ________________________________________________________________ Date: Mon, 10 Mar 2008 10:43:52 -0500 (CDT) Solution: schmitz@fnal.gov sent this solution: power-cycled ___________________________________________________________________ ############ # PREDATOR # ############ Problems in ND sam declares, N00013775_0003.mdaq.root Sat Mar 8 23:06:12 UTC 2008 OOPS - run_dbu is stuck for 147, killing it N00013775_0004.mdaq.root Sat Mar 8 23:09:17 UTC 2008 OOPS - run_dbu is stuck for 137, killing it Sun Mar 9 20:11:48 UTC 2008 /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 128: 13756 Segmentation fault dbu -bq ${HOME}/minos/scripts/dbu_sampy.C ${FILE} >>${logname} 2>&1 N00013778_0018.sam.py was not generated - check log for error N00013778_0018.log cd /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/neardet_data/2008-03 rm N00013775_0004.sam.py ######## # DATA # ######## Interrupted early in ECRC for 702, to get rid of the spurious OOPS cp -a AFSS/mcimport.20080303 Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 28T 25T 3.5T 88% /minos/data for DIR in ${DIRS} ; do ./mcimport.20080303 -T daikon_04/L250200N/near/${DIR} done Started up overlaying on minos12 About 10 MB/sec ( input files) 13:25 to 13:45. No impact on minos-sam03 data rates (still running ecrc) Tested local copy on minos26, cd /local/scratch26/kreymer/DATA du -sm /minos/data/mcimport/STAGE/daikon_04/L250200N/near/702/n11037020_0001_L250200N_D04.tar.gz 726 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/702/n11037020_0001_L250200N_D04.tar.gz MINOS26 > time cp -v /minos/data/mcimport/STAGE/daikon_04/L250200N/near/702/n11037020_0001_L250200N_D04.tar.gz TEST.dat `/minos/data/mcimport/STAGE/daikon_04/L250200N/near/702/n11037020_0001_L250200N_D04.tar.gz' -> `TEST.dat' real 1m9.342s user 0m0.144s sys 0m4.051s Rate is 726 MB/110 sec => 10 MB/sec. time ecrc TEST.dat CRC 1502044195 real 0m3.277s user 0m2.388s sys 0m0.824s Let's try dd, for kicks on the next file, BAF=/minos/data/mcimport/STAGE/daikon_04/L250200N/near/702/n11037020_0002_L250200N_D04.tar.gz MINOS26 > time dd if=${BAF} of=TEST.dat bs=10M 70+1 records in 70+1 records out real 1m10.369s user 0m0.000s sys 0m4.046s MINOS26 > time ecrc TEST.dat CRC 3187614117 real 0m5.670s user 0m2.312s sys 0m0.865s MINOS26 > for DIR in ${DIRS} ; do du -sm /minos/data/mcimport/STAGE/daikon_04/L250200N/near/${DIR} ; done 1 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700 1385 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/701 225475 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/702 219404 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/703 222473 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/704 219192 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/705 221078 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/706 221048 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/707 221730 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/708 217331 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/709 220377 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/710 223250 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/711 220463 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/712 222180 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/713 222384 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/714 154 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/715 117468 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/717 We have 200 GB free on /home, total size 252 GB Can we clear 30GB more ? $ du -sm /home/* du: `/home/buckley': Permission denied du: `/home/kreymer': Permission denied 478 . du: `/home/lost+found': Permission denied 5691 /home/mindata removed 141 1 /home/room1 du: `/home/sam/products/man': Permission denied du: `/home/sam/products/catman': Permission denied 2954 /home/sam 1 /home/samread ============================================================================= 2008 03 09 ( Sunday ) ######## # DATA # ######## Scanning for files over 500 MBytes. These can be written directly without tarring em up. $ DIRS=`ls /minos/data/mcimport/STAGE/daikon_04/L250200N/near` $ echo $DIRS 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 717 for DIR in ${DIRS} ; do ( printf "${DIR} " ; find /minos/data/mcimport/STAGE/daikon_04/L250200N/near/${DIR} -name \*.gz -size +500000000c -exec du -sm {} \; | wc -l ) done 700 0 701 307 702 309 703 298 704 305 705 297 706 303 707 303 708 304 709 298 710 302 711 306 712 301 713 303 714 305 715 31 717 161 -T option for direct write of large ( > 500 MB ) .tar files. This will clear most of daikon_04/L250200N/near/* TIND=daikon_04/L250200N/near/715 AFSS/mcimport.20080303 -T ${TIND} Sun Mar 9 17:08:48 CDT 2008 OK - version mcimport.20080303 processing from /minos/data/mcimport/STAGE/daikon_04/L250200N/near/715 Sun Mar 9 18:24:28 CDT 2008 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L250200N/near/715/mcimport.log Purging did not happen this time, ECRCFILE lacked full path in PURGEFILE. Strange, seeing little 1 minute interruptions in data flow every 12 minutes, on the hourly ganglia network plot. Norm data rate seems to average 5 to 6 MB/sec. Corrected and reran, oops ECRC n11037150_0000_L250200N_D04.tar.gz no harm done, corrected 1 more typo $ df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 28T 25T 3.4T 89% /minos/data Now let's run on all daikon_04/L250200N/near cp -a AFSS/mcimport.20080303 . for DIR in ${DIRS} ; do ./mcimport.20080303 -T daikon_04/L250200N/near/${DIR} done Sun Mar 9 18:38:24 CDT 2008 Oops, left in the extra PURGEFILE in the ENCP file loop so lots of harmless OOPSes in the logs. ECRC... Sun Mar 9 22:21:23 2008 grep Overall /minos/data/mcimport/TAR/daikon_04/L250200N/near/701/mcimport.log mostly 6 MB/sec. Ganalia of minos-sam03 shows 15 MB/sec during ECRC phase, 5 MB/sec during encp phase ============================================================================= 2008 03 07 Date: Fri, 07 Mar 2008 10:02:11 -0600 (CST) Art, I have downloaded the condor 7.0.1 RPMS for MINOS. In future, the latest and greatest RPMs for Minos condor will always be stored at http://fermigrid.fnal.gov/files/condor/RPMS/i386/condor-latest.rpm and (if necessary) http://fermigrid.fnal.gov/files/condor/RPMS/x86_64/condor-latest.rpm Steve Timm forwarded to minos-admin I found these files actually at http://fermigrid.fnal.gov/files/condor/ specifically, we use the x86-rhel3 version, http://fermigrid.fnal.gov/files/condor/condor-7.0.1-linux-x86-rhel3-dynamic-1.i386.rpm ============================================================================= 2008 03 06 ######## # DATA # ######## MINOS26 > date Thu Mar 6 09:46:06 CST 2008 MINOS26 > df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 25T 25T 601G 98% /minos/data Date: Thu, 06 Mar 2008 13:17:33 -0600 (CST) From: David Berg To: kreymer@fnal.gov, oleynik@fnal.gov Cc: crawdad@fnal.gov, minos-data@fnal.gov, enstore-admin@fnal.gov Subject: Re: Request Minos write access to LTO-4 to clear 10 TB backlog Art, CMS has kindly loaned 15 blank LTO4 tapes to CD for this purpose. I have created a quota of 15 for minos and reassigned 15 tapes to the common blank pool. I changed the library tag under /pnfs/minos/stage to CD-LTO4G1. Write away. ######## # FARM # ######## roundup is falling behind, copying only 750 out of 1351 files since 13:45 top - 09:10:13 up 72 days, 20:00, 19 users, load average: 8.32, 8.12, 8.13 Tasks: 230 total, 9 running, 220 sleeping, 1 stopped, 0 zombie Cpu(s): 93.6% us, 4.0% sy, 0.0% ni, 2.2% id, 0.2% wa, 0.0% hi, 0.0% si Mem: 16629324k total, 8084144k used, 8545180k free, 7324k buffers Swap: 19454704k total, 240k used, 19454464k free, 1602072k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 22010 carneiro 25 0 597m 592m 636 R 96 3.7 1251:44 astra_022608 23435 carneiro 25 0 597m 592m 636 R 96 3.7 1251:22 astra_022608 23437 carneiro 25 0 597m 592m 636 R 93 3.7 1251:13 astra_022608 23432 carneiro 25 0 597m 592m 636 R 91 3.7 1251:13 astra_022608 23430 carneiro 25 0 597m 592m 636 R 91 3.7 1251:23 astra_022608 23440 carneiro 25 0 597m 592m 616 R 90 3.6 1251:17 astra_022608 Date: Thu, 06 Mar 2008 10:04:35 -0600 (CST) Subject: HelpDesk ticket 112276 ___________________________________________ Short Description: fnpcsrv1 overloaded by carneiro astra_022608 jobs Problem Description: The Minos farm I/O operations from fnpcsrv1 are falling seriously behind. A contributing factor is probably the six CPP bound processes running since about noon yesterday. http://fermigrid2.fnal.gov/ganglia/?r=day&c=FermiGrid&h=fnpcsrv1.fnal.gov These seem to be winding down now, as of about 10:00. In future, please do not overload this central server in this way. ___________________________________________ Date: Thu, 27 Mar 2008 08:59:17 -0500 (CDT) Subject: Help Desk Ticket 112276 Has Been Resolved. ___________________________________________________________________ Solution: Carneiro has been shown how to run grid universe jobs and has successfuly done so. This problem should not happen again. LEt us know if it does. Steve Timm ___________________________________________________________________ ########## # PURIFY # ########## Date: Thu, 06 Mar 2008 18:01:59 +0000 (UTC) From: Arthur Kreymer To: minos_software_discussion@fnal.gov Subject: Software Product "Purify" (fwd) ---------- Forwarded message ---------- Date: Thu, 06 Mar 2008 11:44:35 -0600 From: Peter J. Rzeminski II To: linux-users@fnal.gov Subject: Software Product "Purify" The license for Purify is coming up for renewal soon. The licensing group has requested that we find out if anybody is currently using it. Troy checked the license server and in the past year nobody requested a license for the software. Additionally, nobody I have spoken to seems to know of anybody using the software. If anybody is currently using it and wants the license to be renewed, please speak up now. Otherwise, I will advise them to let the license expire. Thank you. -- ____________________________________________________________ Peter J. Rzeminski II Email: ptr@fnal.gov CD/LSC/CSI/Central Services - Web Team Phone: 630.840.5524 Fermi National Accelerator Laboratory Pager: 630.905.0540 ####### # AFS # ####### Global failures, Mar 6 06:42:37 minos26 kernel: afs: Lost contact with file server 131.225.68.6 ... Mar 6 07:24:57 minos05 kernel: afs: file server 131.225.68.17 is back up Same problem on minos-mysql1 Helpdesk ticket 112250 - WEB down - 12:58 UTC / 06:58 CST Date: Thu, 06 Mar 2008 10:09:38 -0600 (CST) Subject: HelpDesk ticket 112277 ___________________________________________ Short Description: AFS down 06:42 to 07:25 - status ? Problem Description: CSI : AFS seems to have been down from about 06:42 to 07:25 this morning, as seen on the Minos Cluster and on fnpcsrv1 ( presumably a server problem ). I see no announcement at http://computing.fnal.gov/cdsystemstatus/system/AFS.html Is the system stable and useable, or should be be shutting down ? Please post a status announcement. Thanks ! ___________________________________________ Date: Thu, 06 Mar 2008 10:14:30 -0600 (CST) This ticket has been reassigned to RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST Group. ___________________________________________________________________ Solution: The AFS fileserver processes core dumped this morning when we attempted to add a service key for the FERMI.WIN.FNAL.GOV AD domain. This process was to be non-disruptive as the processes only needed to re-read a keyfile. This was not the case. The service outage was from 06:43 -> 07:19. AFS is currently stable. I will see about getting a message posted to the status page. ___________________________________________________________________ ####### # AFS # ####### Date: Wed, 05 Mar 2008 19:15:04 -0600 From: Cron Daemon To: kreymer@fnal.gov Subject: Cron /usr/krb5/bin/kcron ${HOME}/minos/scripts/cfl kinit: Invalid message type while getting initial credentials aklog: Couldn't get fnal.gov AFS tickets: aklog: Invalid argument while getting AFS tickets CFL.new: No such file or directory ? ? CFL.new: Permission denied ? rm: cannot remove `CFL.old': Permission denied mv: cannot move `CFL' to `CFL.old': Permission denied mv: cannot stat `CFL.new': No such file or directory kdestroy: No credentials cache file found while destroying cache Ticket cache ^GNOT^G destroyed! ####### # AFS # ####### Date: Thu, 06 Mar 2008 06:43:02 -0600 From: Cron Daemon To: kreymer@fnal.gov Subject: Cron /usr/krb5/bin/kcron ${HOME}/minos/scripts/condorweb sh: /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorweb: Connection timed out ============================================================================= 2008 03 05 ########## # CONDOR # ########## echo "ln -sf kreymer-condor.proxy.20080404 kreymer-condor.proxy/home/gfactory/.grid/kreymer-condor.proxy" \ | at Apr 01 ############ # MCIMPORT # ############ Urgent need to archive older and/or select gaf files to tape. Revive the tar and write sections. -t option for running tar, specifying input path. Switch to direct ENCP in write, this is archival, no DCache. Means we can purge immediately on a successful copy. encp the whole directory in one command. typically 50 to 100 files. Add ecrc files as each tarfile is built For example, TIND=/minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141 mcimport -t ${TIND} will produce /minos/data/mcimport/STAGE/TAR/daikon_00/L010185N_nue/near/141 containing files like n14111411_0000_L010185N_D00_nue-n14111414_0000_L010185N_D00_nue.tar n14111411_0000_L010185N_D00_nue-n14111414_0000_L010185N_D00_nue.ecrc n14111411_0000_L010185N_D00_nue-n14111414_0000_L010185N_D00_nue.index WRITE will encp these directly to tape, Files will go to /pnfs/minos/stage/TAR/... Test like TIND=daikon_00/L010185N_nue/near/141 AFSS/mcimport.20080303 -n -t ${TIND} $ du -sm /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141 5682 /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141 cp -vax /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141 \ /local/scratch26/mindata/141 ... too slow, too much mcimport on minos-sam03, time cp -vax /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141 \ /home/mindata/141 real 21m17.551s user 0m1.059s sys 0m31.474s $ AFSS/mcimport.20080303 -t ${TIND} OOPS - found /minos/data/mcimport/CRON/mcimport.tar.pid OK - stale pid file Thu Mar 6 16:43:31 CST 2008 ... For config n1411 _L010185N_D00_nue.tar.gz 99 files from n14111411_0000_L010185N_D00_nue.tar.gz to n14111419_0010_L010185N_D00_nue.tar.gz 99/99 TOTAL FILES 5682 /minos/data/mcimport/TAR/daikon_00/L010185N_nue/near/141 Thu Mar 6 17:17:43 CST 2008 Oops, ecrc was done on the wrong file, correcting manually. cd /minos/data/mcimport/TAR/daikon_00/L010185N_nue/near/141 FIS=' n14111411_0000_L010185N_D00_nue-n14111413_0006_L010185N_D00_nue n14111413_0007_L010185N_D00_nue-n14111416_0003_L010185N_D00_nue n14111416_0004_L010185N_D00_nue-n14111418_0010_L010185N_D00_nue n14111419_0000_L010185N_D00_nue-n14111419_0010_L010185N_D00_nue ' for FI in ${FIS} ; do echo ${FI} ecrc ${FI}.tar | cut -f 2 -d ' ' > ${FI}.ecrc done Requested R/W mount of /pnfs/minos on minos-sam03 Corrected group, allowing mindata to write chown kreymer.e875 /pnfs/minos/stage MINOS26 > df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 25T 25T 460G 99% /minos/data MINOS26 > date Thu Mar 6 17:47:02 CST 2008 Date: Thu, 06 Mar 2008 17:41:37 -0600 (CST) Subject: HelpDesk ticket 112323 ___________________________________________ Short Description: Please mount /pnfs/minos read/write on minos-sun03 ( presently readonly ) Problem Description: run2-sys : We urgently need to move some 10 TBytes of data from /minos/data to PNFS. minos-sun03 is the system at most capable of doing this, but /pnfs/minos is mounted readonly there. Please change this to a read/write mount ASAP. Thanks ! ___________________________________________ Corrected this to minos-sam03 Date: Thu, 06 Mar 2008 17:55:25 -0600 From: Jason Harrington I have remounted /pnfs/minos read/write on minos-sam03. I was trying to get the change into cfengine and ran into an issue with classes containing "-" (it's a syntax error), so I commented the fstab edits for the time being (next update run at 18:00, so needed to keep things working everywhere). ___________________________________________ All done on the MINOS side. The cfengine repairs are an internal issue. This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group. ___________________________________________ Date: Fri, 07 Mar 2008 10:00:25 -0600 (CST) Note To Requester: csieh@fnal.gov sent this Notes To Requester: The system name minos-sun03 does not exist in DNS. Is this the correct name of the system you are asking about? _________________________________________________________________ Tape data rates from minos26 are terrible, about 1 MB/sec. $ AFSS/mcimport.20080303 -t ${TIND} Thu Mar 6 17:40:08 CST 2008 AFSS/mcimport.20080303: line 908: [: too many arguments OK - version mcimport.20080303 processing from /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141 Completed transferring 5957857280 bytes in 4 files in 3360.73328114 sec. Overall rate = 1.84 MB/sec. Transfer rate = 1.87 MB/sec. Network rate = 1.97 MB/sec. Drive rate = 94 MB/sec. Disk rate = 1.97 MB/sec. Exit status = 0. Corrected a few more flaws, purged 141. Let's try the whole thing on 142 TIND=daikon_00/L010185N_nue/near/141 AFSS/mcimport.20080303 -t ${TIND} Rates are good, about 6 MB/sec tarring, about 6 MB/sec to tape. Oops, typo error making ecrc's, correct this $ cd /minos/data/mcimport/TAR/daikon_00/L010185N_nue/near/142 $ FILES=`ls *.tar` $ for FILE in $FILES ; do echo ${FILE} ; ecrc ${FILE} | cut -f 2 -d ' ' > ${FILE%.tar}.ecrc ; done AFSS/mcimport.20080303 -t ${TIND} Hmmm, log files are not working properly, oh well. Let's grab a bigger bite $ du -sm /minos/data/mcimport/STAGE/daikon_00/L010185N/near/* 33846 /minos/data/mcimport/STAGE/daikon_00/L010185N/near/141 36561 /minos/data/mcimport/STAGE/daikon_00/L010185N/near/142 36670 /minos/data/mcimport/STAGE/daikon_00/L010185N/near/143 37375 /minos/data/mcimport/STAGE/daikon_00/L010185N/near/144 3763 /minos/data/mcimport/STAGE/daikon_00/L010185N/near/145 10388 /minos/data/mcimport/STAGE/daikon_00/L010185N/near/704 NNS=`ls /minos/data/mcimport/STAGE/daikon_00/L010185N/near` $ echo $NNS 141 142 143 144 145 704 date df -h /minos/data for NN in ${NNS} ; do TIND=daikon_00/L010185N/near/${NN} AFSS/mcimport.20080303 -t ${TIND} date df -h /minos/data done date df -h /minos/data This will go more slowly than it should, most all.md5 files are missing Only present in 704 Thu Mar 6 21:23:26 CST 2008 $ df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 25T 25T 440G 99% /minos/data .... this is no good, the md5sums will take much too long. killed this midstream, $ cd /minos/data/mcimport/STAGE/daikon_00/L010185N/near/141 $ cp 2103.md5 all.md5 Let's grab a bigger piece of pie, $ du -sm /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700 TIND=daikon_04/L250200N/near/700 AFSS/mcimport.20080303 -t ${TIND} $ ls /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700 | wc -l 372 $ wc -l /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700/md5 wc: /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700/md5: No such file or directory $ wc -l /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700/all.md5 369 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700/all.md5 MAIN >> /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700/log/mcimport.log 2>&1 & Thu Mar 6 21:48:50 CST 2008 OK - version mcimport.20080303 processing from /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700 LOGS TAR, WRITE, PURGE IN TAR Thu Mar 6 21:48:50 CST 2008 For config n1103 _L250200N_D04.tar.gz 274 files from n11037001_0000_L250200N_D04.tar.gz to n11037009_0030_L250200N_D04.tar.gz md5sum n11037008_0025_L250200N_D04.tar.gz n11037001_0000_L250200N_D04-n11037001_0001_L250200N_D04.tar 2 n11037001_0000_L250200N_D04.tar.gz to n11037001_0001_L250200N_D04.tar.gz The ecrc file looks OK now, let's see how this looks tomorrow. If lucky, we'll get nearly a 200 GB from this, and nearly a TB from farm concatenation overnight. ####### # AFS # ####### for NODE in ${NODES} ; do ssh -ax ${NODE} \ 'grep afs: /var/log/messages | grep "Mar " | grep -v Tokens | uniq'; done \ | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' Mar 3 22:12:55 minos05 kernel: afs: Lost contact with volume location server 198.128.3.21 in cell es.net Mar 3 22:13:52 minos05 kernel: afs: Lost contact with volume location server 198.128.3.23 in cell es.net Mar 3 22:14:49 minos05 kernel: afs: Lost contact with volume location server 198.128.3.22 in cell es.net Mar 3 22:16:01 minos05 kernel: afs: Lost contact with volume location server 192.204.203.218 in cell sinenomine.net Mar 4 05:30:07 minos05 kernel: afs: volume location server 198.128.3.22 in cell es.net is back up Mar 4 05:30:07 minos05 kernel: afs: volume location server 198.128.3.23 in cell es.net is back up Mar 5 08:30:07 minos05 kernel: afs: volume location server 198.128.3.21 in cell es.net is back up ######## # FARM # ######## top - 11:40:44 up 71 days, 22:30, 22 users, load average: 6.41, 2.93, 2.21 Tasks: 223 total, 7 running, 215 sleeping, 1 stopped, 0 zombie Cpu(s): 76.1% us, 1.6% sy, 0.0% ni, 16.2% id, 6.0% wa, 0.1% hi, 0.0% si Mem: 16629324k total, 9604968k used, 7024356k free, 5184k buffers Swap: 19454704k total, 240k used, 19454464k free, 3748952k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 22010 carneiro 25 0 597m 490m 576 R 12 3.0 1:15.15 astra_022608 23430 carneiro 25 0 597m 490m 576 R 12 3.0 0:58.07 astra_022608 23432 carneiro 25 0 597m 490m 576 R 12 3.0 0:53.79 astra_022608 23435 carneiro 25 0 597m 490m 576 R 12 3.0 0:47.90 astra_022608 23437 carneiro 25 0 597m 490m 576 R 12 3.0 0:44.19 astra_022608 23440 carneiro 25 0 597m 490m 556 R 12 3.0 0:41.75 astra_022608 ... ####### # SAM # ####### Various reports of SAM project trouble on LSF nodes ( flxb* ) MINOS26 > sam dump project --station=minos --project=evansj-CC0325-RunII-L250z200-ND-Data-20080304-0419 MINOS26 > SAMDIM='project_name evansj-CC0325-RunII-L250z200-ND-Data-20080304-0419' MINOS26 > sam list files --dim="${SAMDIM}" ... 101 files listed ... Justin specified a project having trouble, with this dataset : MINOS26 > sam_test_py minos ${UNIV} evansj-CC0325-RunI-L010z185-ND-Data OK running station minos dbserver prd dataset evansj-CC0325-RunI-L010z185-ND-Data project sam_test_project_20080305195142 fileCut 0 cid 7686 cpid 26266 job SAMStation.JobCount(jobsAtNode=1, jobsAll=1) ... Got dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008200_0006.spill.sntp.cedar_phy_bhcurv.0.root file 440 Decrementing the job count. Stopping the project FLXB19 > time sam_test_py minos ${UNIV} evansj-CC0325-RunI-L010z185-ND-Data FLXB29 > ./sam_cli_py minos prd sam_test_project_20080305200307 RetryHandler.getNextFile(26271L)> initial retriable exception ProjectNotFound('Project 'sam_test_project_20080305200307' on station 'minos' not responding.') RetryHandler.getNextFile(26271L)> will retry in 1.28 seconds Got dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008076_0000.spill.sntp.cedar_phy_bhcurv.0.root file 297 Decrementing the job count. Stopping the project real 7m4.591s Projects mentioned are : evansj-CC0325-RunI-L010z185-ND-Data-20080304-1125 evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830 evansj-CC0325-RunII-L010z185-ND-Data-20080303-1013 Checking out the other datasets mentioned FLXB19 > time sam_test_py minos ${UNIV} evansj-CC0325-RunII-L250z200-ND-Data Got dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2006-07/N00010583_0023.spill.sntp.cedar_phy_bhcurv.0.root file 101 Decrementing the job count. Stopping the project real 2m7.226s FLXB19 > time sam_test_py minos ${UNIV} evansj-CC0325-RunII-L010z185-ND-Data Got dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-03/N00011855_0000.spill.sntp.cedar_phy_bhcurv.0.root file 413 Decrementing the job count. Stopping the project real 7m27.047s What is the state of the projects : MINOS26 > sam dump project --station=minos --project=evansj-CC0325-RunI-L010z185-ND-Data-20080304-1125 MINOS26 > sam dump project --station=minos --project=evansj-CC0325-RunI-L010z185-ND-Data-20080304-1125 \ > /minos/scratch/kreymer/log/samproj/evansj-CC0325-RunI-L010z185-ND-Data-20080304-1125 MINOS26 > sam dump project --station=minos --project=evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830 Project 'evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830' on station 'minos' not responding. MINOS26 > sam dump project --station=minos --project=evansj-CC0325-RunII-L010z185-ND-Data-20080303-1013 TRANSIENT; CORBA.TRANSIENT(omniORB.TRANSIENT_ConnectFailed, CORBA.COMPLETED_NO) Two are dead, and the third seems to be in active use. 26265: evansj(loon:dev)@dcap://minos-01[Loon Analysis Process], busy since 05 Mar 13:46:11, 05 Mar 13:46:11, 1958488 26267: evansj(loon:dev)@dcap://minos-02[Loon Analysis Process], busy since 05 Mar 13:53:51, 05 Mar 13:53:51, 1958498 26268: evansj(loon:dev)@dcap://minos-01[Loon Analysis Process], busy since 05 Mar 13:54:40, 05 Mar 13:54:40, 1958489 Looking in trace for evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830 Find messages like 03/03/08 20:39:42 minos.evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830.PollProcess.Worker 26800: Notifing 03/03/08 20:39:42 minos.evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830.PollProcess.Worker 26800: System exception `COMM_FAILURE' Reason: Connection refused Completed: no Minor code: 1330577418 (connect() failed) 03/05/08 01:24:19 minos.evansj-CC0325-RunI-L010z185-ND-Data-20080304-1125.PollProcess.Worker 18021: System exception `COMM_FAILURE' Reason: Connection refused Completed: no Minor code: 1330577418 (connect() failed) But these are not unique to evansj jobs, perhaps 'good' error messages. ########## # CONDOR # ########## Released all the gfactory processes that had been held : ########## # CONDOR # ########## Created newer proxy for gfactory, SRV1> cd /export/stage/minfarm/.grid DAYS=20 (( HOURS = DAYS * 24 )) DAPR=`date -d "today + ${DAYS}days" +%Y%m%d` voms-proxy-init \ -voms fermilab:/fermilab/minos \ -vomslife ${HOURS}:0 \ -cert kreymer-doe.pem \ -key kreymer-doekey.pem \ -out kreymer-condor.proxy.${DAPR} \ -valid ${HOURS}:0 Your proxy is valid until Tue Mar 25 10:28:01 2008 DAYS=30 [gfactory@minos25 ~]$ cd .grid/ DAPR=20080325 DAPR=20080404 scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy.${DAPR} . DAPR=20080325 ln -sf kreymer-condor.proxy.${DAPR} kreymer-condor.proxy ########## # CONDOR # ########## All data stopped just before Noon on Tuesday. Investigating. The proxy expired. ####### # SAM # ####### Sam declares for FD data stopped working . Other declares seem OK. //////////////////////////////////////////////// STARTED Mon Mar 3 21:08:41 2008 FINISHED Mon Mar 3 21:08:44 2008 Traceback (most recent call last): File "./sadd", line 110, in ? SAMLOC=sam.locate( args = FILER ) File "sam_common_pylib/SamCommand/BlessedCommandInterfacePlaceHolder.py", line 81, in __call__ File "sam_common_pylib/SamCommand/CommandInterface.py", line 251, in __call__ File "sam_common_pylib/SamCommand/SamCommandInterface.py", line 240, in apiWrapper File "sam_user_pyapi/src/samLocate.py", line 75, in implementation File "sam_common_pylib/SamStruct/NameOrId.py", line 51, in __init__ File "sam_common_pylib/SamCorba/SamIdlStructWrapperBase.py", line 405, in __init__ File "sam_common_pylib/SamStruct/NameOrId.py", line 71, in initialize_fromPython_countedArgs ArgumentError: NameOrId: Invalid input arguments Input args = (None,) fardet_data/2008-03 STARTED Mon Mar 3 23:08:02 2008 //////////////////////////////////////////////// Generating .py for /pnfs/minos/fardet_data/2008-03 STARTING Mon Mar 3 23:06:13 UTC 2008 Treating 78 files Scanning 2 files F00040390_0003.mdaq.root Mon Mar 3 23:06:15 UTC 2008 ? F00040390_0004.mdaq.root Mon Mar 3 23:07:14 UTC 2008 ? FINISHED Mon Mar 3 23:07:59 UTC 2008 //////////////////////////////////////////////// MINOS26 > pwd /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/fardet_data/2008-03 MINOS26 > dds F00040390_0004* -rw-r--r-- 1 kreymer g020 0 Mar 3 17:07 F00040390_0004.log -rw-r--r-- 1 kreymer g020 0 Mar 3 17:07 F00040390_0004.sam.py MINOS26 > find . -size 0 ./F00040390_0003.log ./F00040390_0003.sam.py MINOS26 > dds F00040390_0003* -rw-r--r-- 1 kreymer g020 0 Mar 3 17:06 F00040390_0003.log -rw-r--r-- 1 kreymer g020 0 Mar 3 17:07 F00040390_0003.sam.py MINOS26 > rm F00040390_0003* //////////////////////////////////////////////// mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root Are these files being tried now? If not, could they be? ============================================================================= 2008 03 04 ########### # BLUEARC # ########### Date: Tue, 04 Mar 2008 13:44:42 -0600 (CST) Subject: HelpDesk ticket 112164 ___________________________________________ Short Description: /minos/data size adjustment in BlueArc Problem Description: LSC/CSI : We seem to have been even more successful than before in filling up /minos/data, before we have had a chance to archive some of the older files. ( See previous helpdesk ticket 111148 from 14 Feb. ) We have under 1 TB free, out of 25 TB. Please adjust the size of the /minos/data area upward from 25 TB to 28 TB. If necessary take this space from /minos/scratch, which is presently using under 3 TB. We are actively working on archiving nearly 10 TB of files from /minos/data, but it will take a while to revive old scripts, and get these on tape. We should get this cleared up by about 10 March. ___________________________________________ Date: Tue, 04 Mar 2008 13:53:50 -0600 (CST) This ticket has been reassigned to RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ Date: Tue, 04 Mar 2008 13:58:39 -0600 (CST) This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/BLU Group. ___________________________________________ ########### # BLUEARC # ########### Yesterday, had no earlier than 13:33 MINOS26 > df -h /minos/data Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 25T 24T 1.7T 94% /minos/data MINOS26 > df -h /minos/scratch Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/scratch 9.0T 2.3T 6.7T 26% /minos/scratch MINOS26 > du -sm /minos/data/users/rustem/ 87015 /minos/data/users/rustem/ ########## # DCACHE # ########## Status of helpdesk tickets 111951 kreymer, 112020 rubin From ticket resolution, georges, x4515, ssa-help@fnal.gov Trying a copy , once more MINOS26 > dccp dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root TEST.data http://fndca3a.fnal.gov/dcache/DOORS.html DCap00-stkendca2a-unknown-550430 DCap00-stkendca2a-unknown-550430 minos26.fnal.gov active Mar 04 11:15:11 Mar 04 11:15:11 1060/14515 DCap00-stkendca2a-unknown-550430 Arthur Kreymer ? ? ? ? open minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root ######## # FARM # ######## > Please give me a brief map of the process of moving files from mcnearcat > to dcache/enstore. The files are moved to /minos/data/minfarm/WRITE/ until they are confirmed to be on tape. That should usually happen by the next cron cycle. Right now there are 739 waiting to get on tape. MINOS26 > ls /minos/data/minfarm/WRITE/*cand* | wc -l 739 > Incidentally, grid processing has added a pass number to mc output > files. Does this disturb you? Somewhat, as I am swamped by some other high priority activities clearing some space on /minos/data ... we have 10.5 TBytes of MC imports !!! clearing out duplicates from /minos/data/minfarm requires some major changes to my scripts But if the files are showing up concatenated, I guess things are fine. I do see one concatenated file dated Mar 4 02:48 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758/n13037581_0000_L010185N_D04.sn tp.cedar_phy_bhcurv.0.root It's not in SAM yet, I presume bacause of the backlog. I'll look for it later today. ... /home/minfarm/ROUNTMP/LOG/saddreco/daikon_04/cedar_phy_bhcurv/near_L010185N.log Needed /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758 Added sam tape location /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758 Treating 3 files in /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758 OK - declared n13037583_0000_L010185N_D04.sntp.cedar_phy_bhcurv.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758(voe170.184) OK - declared n13037582_0000_L010185N_D04.sntp.cedar_phy_bhcurv.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758(voe170.185) OK - declared n13037580_0000_L010185N_D04.sntp.cedar_phy_bhcurv.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758(voe170.186) This skipped the .0 files. Looking into this, using saddreco.new time ./saddreco.new -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N --verify URK --- looking at the latest files showing up in mcnearcat, they are owned by minospro.numi . SRV1> ypcat passwd | grep minospro minospro:x:42411:5111:minos e875 production:/grid/home/minospro:/sbin/nologin ============================================================================= 2008 03 03 ########### # MINOS26 # ########### disk filled up... perhaps due to vault processing. $ df -h . Filesystem Size Used Avail Use% Mounted on /dev/hdb1 230G 218G 0 100% /local/scratch26 cd /local/scratch26/mindata/MOVED/ OOPS vault error processing near 2008-02 $ rm -r kreymer/ $ rm -r mtavera 17:13 ( 23:13 UTC ) $ df -h . Filesystem Size Used Avail Use% Mounted on /dev/hdb1 230G 25G 194G 12% /local/scratch26 Date: Mon, 03 Mar 2008 16:37:01 -0600 From: Minos Data To: minos-data@fnal.gov Subject: MCIMPORT DISABLED DUE TO SCRATCH SPACE WARNING, minos26 scratch space 0 under 3000 MBytes .k5login has been restricted to administrators .k5login will be restored within 10 minutes when space goes over 5000 This worked, $ dds .k5* lrwxrwxrwx 1 mindata e875 12 Mar 3 17:17 .k5login -> .k5loginfull Restarted vault rm -r /local/scratch26/kreymer/SHEEP/neardet_data/2008-02 rm /var/tmp/rawcopy/TARWORK/*.root ############ # MCIMPORT # ############ MUST shift something out. 183937 /minos/data/mcimport/STAGE/daikon_00 10535116 /minos/data/mcimport/STAGE/daikon_04 MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/* 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N 55785 /minos/data/mcimport/STAGE/daikon_04/L010170N 6248133 /minos/data/mcimport/STAGE/daikon_04/L010185N 6622 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm 8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh 65355 /minos/data/mcimport/STAGE/daikon_04/L010200N 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N 3440138 /minos/data/mcimport/STAGE/daikon_04/L250200N The RUN subdirectories of L010185N and L250200N are nearly all 100 to 200 GByte in size. I'll revive and extend the tarring option of mcimport. $ du -sm /pnfs/minos/stage/* 268720 /pnfs/minos/stage/arms 1 /pnfs/minos/stage/buckley 1 /pnfs/minos/stage/gmieg 758471 /pnfs/minos/stage/hgallag 1358789 /pnfs/minos/stage/howcroft 5596976 /pnfs/minos/stage/kordosky 11457 /pnfs/minos/stage/kreymer 202838 /pnfs/minos/stage/mualem 1 /pnfs/minos/stage/rhatcher 1 /pnfs/minos/stage/sjc 1 /pnfs/minos/stage/urheim Date: Mon, 03 Mar 2008 15:50:59 -0600 (CST) Subject: HelpDesk ticket 112120 ___________________________________________ Short Description: STKEN - request change of /pnfs/minos/stage library to LTO-3 Problem Description: We need to write about 10 TBytes of new data to /pnfs/minos/stage. We need to start these new write this week. Please change the library from CD-9940B to CD-LTO3, so that we do not exhaust the supply of 9940-B tapes. ( 10 TB is about 50 9940-B tapes, versus 120 on hand. ) Thanks ! ___________________________________________ Date: Tue, 04 Mar 2008 08:03:06 -0600 (CST) This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group. ########## # CONDOR # ########## Rustem--when you are looking at the slots that are allocated for MINOS you need to add up all users in the MINOS group which are running at the time. For instance, right now there are 300 slots being used by the sum of all three users minos, minospro, and rustem, and it is the sum of these three that matters. Having said that, there was a time this morning at around 9AM when less than 300 total minos jobs were running but it also looked like none were waiting at that time. For more information on what use is being made of slots, you can do condor_userprio -all, and look at the entry for group_numi, that will tell you how many slots the minos group is using at one time. There is a standing request from Art Kreymer to boost up the quota of total slots for MINOS which we should get to later this week. Steve Timm > Ticket #: 112087 > Priority: Medium > System Name: > > ____________________________________________ > Requester Information > ____________________________________________ > Name: RUSTEM OSPANOV > Phone: 6460 > E-Mail Address: RUSTEM@FNAL.GOV > _____________________________________________ > Ticket Details > _____________________________________________ > Problem Category: Grid > Type: Fermilab Sup Ctr > Item: FermiGrid > Urgency: Medium > Short Description: FermiGrid/general purpose farms quota allocation > > Problem Description: Hello, > > I am running jobs under minos group account on general purpose farms. > This account has 300 slots allocated for the jobs but I observe that > often less than 300 jobs are running even when there are idle nodes: > > http://home.fnal.gov/~rustem/tmp/condor_q_2008_03_03.txt > http://home.fnal.gov/~rustem/tmp/condor_status_2008_03_03.txt > > There seems to be something with counting of jobs toward the minos > allocation that I do not understand because we should be able to run > 300 jobs at any given time. > > Thank you, > Rustem SRV1> condor_userprio -all | grep numi Last Priority Update: 3/3 13:39 Effective Real Priority Res Accumulated Usage Last User Name Priority Priority Factor Used Usage (hrs) Start Time Usage Time ------------------------------ --------- -------- ------------ ---- ----------- ---------------- ---------------- group_numi.rustem@fnal.gov 11.31 11.31 1.00 141 25711.36 9/22/2007 11:22 3/03/2008 13:40 group_numi.minos@fnal.gov 77.09 77.09 1.00 51 22886.28 11/27/2007 16:52 3/03/2008 13:40 group_numi.minospro@fnal.gov 87.87 87.87 1.00 108 8923.31 12/18/2007 13:51 3/03/2008 13:40 group_numi 177.72 177.72 1.00 300 1040569.32 4/19/2006 19:49 3/03/2008 13:40 <-- # @ Enter Update below this line. @ # --> Steve, thanks for looking into this. Rustem, for timeline information regarding loads, you can use CondorView, Follow the links under http://fermigrid.fnal.gov/ -> Left Frame - FermiGrid Monitoring, Metrics and Accounting: [Metrics and Service Monitors ] -> FermiGrid - Production Clusters / fngp-osg.fnal.gov Condor View -> Pool User (Job) Statistics [week] http://fnpcsrv1.fnal.gov/condorview/viewdir/UserWeek.html Use the 'configure' box to the lower left of the graphics display to select items of interest, like group_numi.minospro@fnal.gov jobsRunning group_numi.rustem@fnal.gov jugsRunning group_numi.rustem@fnal.gov jobsIdle You had a few idle jobs as you ramped up from 21:30 to 22:00 last night You have more idle jobs now, as you are competing with production. <-- # @ Enter Update above this line. @ # --> > > http://webserver.infn.it/cdf/docs/cafcondorOperations.pdf User condor_userprio to change the priority factory for USER in tinti sjc kreymer masaki scavan boehm pawloski rmehdi loiacono ; do condor_userprio -setfactor ${USER}@fnal.gov 100. ; done Got more complete list with USERS=`condor_userprio -allusers | grep ' 0' | cut -f 1 -d ' ' | grep -v gfactory` for USER in ${USERS} ; do condor_userprio -setfactor ${USER} 100. ; done ########### # MONTHLY # ########### DATASETS 3/3 PREDATOR 3/3 VAULT 3/3 12:00 - restarted near due to full /local/scratch26 MYSQL 3/ 13:30 - MYSQL waiting for heavy usage by tagg@minos11 to abate. This has restarted, but is only reader acces Mysql> du -sm . 58366 . Mon Mar 3 13:30:54 CST 2008 DCS_HV.MYD real 15m26.458s PULSERGAIN.MYD real 11m55.229s the rest real 50m4.492s Mon Mar 3 14:48:34 CST 2008 ######### # MISER # ######### The new Miser uses cert's for access https://appora.fnal.gov/pls/cert/miscomp.miser.html Upgrade described at http://computing.fnal.gov/news/misermar08.html ######## # GRID # ######## voms-proxy-init - are we at required >= 1.6.16.10 SRV1> voms-proxy-init -version voms-proxy-init Version: 1.7.20 Compiled: Jul 3 2007 13:49:34 MINOS26 > cd /grid/app/minos/VDT MINOS26 > . ./setup.sh MINOS26 > voms-proxy-init -version voms-proxy-init Version: 1.7.20 Compiled: Jul 3 2007 13:49:34 ####### # WEB # ####### server validations due by Mar 31 ? follow up to email ######## # FARM # ######## All processing has moved to the Grid universe. ######## # FARM # ######## howie - SAM for Grid jobs ? will try /grid/app/minos/sam clone of /export/stage/minfarm/ROUNDUP/SAM as minfarm@fnpcsrv1 : cp -ax /export/stage/minfarm/ROUNDUP/SAM /grid/app/minos/sam ============================================================================= 2008 02 29 ####### # DOG # ####### Maisey Kreymer is coming home this morning around 10:00 On vacation. Woof. ######## # DATA # ######## Date: Thu, 28 Feb 2008 17:16:38 -0800 From: J. Pedro Ochoa I was successfully getting them using your script "ftpfiles" but then when?it got to n11011026_0000_L010185N_D00.sntp.cedar_phy.root it seems to have gotten stuck (by "stuck" I mean it's been on that file for several hours, and the size does not increase). 15:10 MINOS26 > ./dc_stat /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root ============================ PNFS status for /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root -rw-r--r-- 1 rubin e875 583586556 Nov 28 20:19 n11011026_0000_L010185N_D00.sntp.cedar_phy.root LEVEL 2 2,0,0,0.0,0.0 :c=1:6329d874;h=yes;l=583586556; LEVEL 4 VO9663 0000_000000000_0000008 583586556 mcout_cedar_phy_near_daikon_00_sntp /pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root 000F00000000000006BDD8D8 CDMS119630277100000 stkenmvr16a:/dev/rmt/tps0d0n:479000017059 2252920947 ============================ MINOS26 > dccp -P dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root http://www-stken.fnal.gov/enstore/tape_inventory/VO9663 shows this file at VO9663 CDMS119630277100000 583586556 0000_000000000_0000008 no /pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root MINOS26 > cd /local/scratch26/kreymer/DATA MINOS26 > dccp dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root TEST.data Submitted helpdesk ticket to MSS / dcache-stken Note that Software/MSS is no longer available. Date: Fri, 29 Feb 2008 09:34:49 -0600 (CST) Subject: HelpDesk ticket 111951 ___________________________________________ Short Description: Minos file fails to stage Problem Description: dcache-admin : Please reply to minos-data. The following file fails to stage to DCache. We have a user trying to access it : ./dc_stat /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n110 11026_0000_L010185N_D00.sntp.cedar_phy.root ============================ PNFS status for /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n110 11026_0000_L010185N_D00.sntp.cedar_phy.root -rw-r--r-- 1 rubin e875 583586556 Nov 28 20:19 n11011026_0000_L010185N_D00.sntp.cedar_phy.root LEVEL 2 LEVEL 2 2,0,0,0.0,0.0 :c=1:6329d874;h=yes;l=583586556; LEVEL 4 VO9663 0000_000000000_0000008 583586556 mcout_cedar_phy_near_daikon_00_sntp /pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_ data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root 000F00000000000006BDD8D8 CDMS119630277100000 stkenmvr16a:/dev/rmt/tps0d0n:479000017059 2252920947 ============================ ___________________________________________ This ticket is assigned to NAYMOLA, STAN of the CD-SF/DMS/DSC/SSA. ___________________________________________ Date: Mon, 03 Mar 2008 14:37:50 -0600 (CST) This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group. ___________________________________________ Date: Wed, 05 Mar 2008 13:42:19 -0600 From: Vladimir Podstavkov The last three files are on LT03 tape, and there is a backlog of 2000 transfers to/from LTOs. Nine drives are used by CMS, two others - by database backups. I would say - the problem requires some action on the enstore side. There is nothing we can do on dcache side. This ticket can be closed. ___________________________________________ Date: Mon, 03 Mar 2008 15:47:41 +0900 From: Howard Rubin Ticket #: 112020 ___________________________________________ Short Description: Unable to read some files Problem Description: Here is a list of files I cannot read from dcache using dccp and srmcp: ... ============================================================================= 2008 02 28 ############ # SADDRECO # ############ Interactive testing of sets ./saddreco.new -d near -r cedar -p 2007-11 --verify sampy import ... os.chdir('/pnfs/minos/reco_near/cedar/sntp_data/2007-11') ... stuff from candfiles ... ######## # MAIL # ######## N.B. kerio mail server / versus outlook ####### # AFS # ####### Predator bailed at 01:06 UTC ( 19:06 CST ), /tmp/filedxZ2eC: line 614: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: Connection timed out Updated ticket : Date: Thu, 28 Feb 2008 19:33:33 +0000 (UTC) Subject: Re: HelpDesk ticket 107032 has additional info. Per Ray's question about pinpointing which areas are being accessed. Generally, the /var/log/messages do not tell me which path is being accessed. Just the IP address of the server which has timed out. I have put a summary of our various scans on the web at http://www-numi.fnal.gov/computing/afs.txt I will add specific file information when available. The most recent timeout is interesting, as it involved two hosts, and one of my scripts failed at exactly this time, giving me the path to a file being accessed : Feb 27 05:28:11 minos22 kernel: afs: Lost contact with file server 131.225.68.7 Feb 27 05:30:10 minos22 kernel: afs: file server 131.225.68.7 is back up Feb 27 19:06:35 minos26 kernel: afs: Lost contact with file server 131.225.68.6 /tmp/filedxZ2eC: line 614:/afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: Connection timed out Feb 27 19:07:45 minos26 kernel: afs: file server 131.225.68.6 is back up ####### # SAM # ####### 10 sec limit moveCachedFiles.py -c /var/tmp/kreymer/minosdev.cfg <2008/02/28 08:56:31> Total processing time in seconds: 11 <2008/02/28 08:56:31> Number of records in cached_files. Starting: 553940 Ending: 543977 <2008/02/28 08:56:31> Difference between start and end: 9963 600 sec limit <2008/02/28 09:06:25> Total cached_file records copied: 543116 deleted: 0 <2008/02/28 09:06:25> Total cache_file_project_usages records copied: 578100 deleted: 0 <2008/02/28 09:06:25> Total processing time in seconds: 483 <2008/02/28 09:06:25> Number of records in cached_files. Starting: 543977 Ending: 861 <2008/02/28 09:06:25> Difference between start and end: 543116 ============================================================================= 2008 02 27 ######### # ADMIN # ######### Testing direct page web access, at http://csdserver.fnal.gov/arsys/forms/csdserver/DirectContact/SSAView/?mode=create Date: Wed, 27 Feb 2008 11:35:43 -0600 (CST) Subject: HelpDesk ticket 111830 ___________________________________________ Short Description: WEB MSS PH 630-840-4261 FN ARTHUR This is a test. This is only a test. Problem Description: This is a test. This is only a test. ___________________________________________ ####### # SAM # ####### Date: Wed, 27 Feb 2008 09:07:13 -0600 From: Stephen P. White Subject: Cached_files kreymer@minos-sam01 mkdir /minos/scratch/kreymer/sam cd /minos/scratch/kreymer/sam Getting moveCachedFiles.py from http://cdcvs.fnal.gov/cgi-bin/fnal-only/cvsweb.cgi/sam_maintenance_tools/General/PurgeCachedFiles/ CVSROOT=cvsuser@cdcvs.fnal.gov:/cvs/cd # set the repository unset CVS_RSH # remove ssh used by Minos cvs -d ${CVSROOT co sam_maintenance_tools/General/PurgeCachedFiles cd sam_maintenance_tools/General/PurgeCachedFiles The config file needs to be on local disk, and protected, as it contains database passwords mkdir /var/tmp/kreymer cp example_moveCachedFiles.cfg /var/tmp/kreymer/minosdev.cfg chmod 700 /var/tmp/kreymer/minosdev.cfg nedit /var/tmp/kreymer/minosdev.cfg unset SETUP_UPS UPS_DIR . ~sam/setups.sh setup sam_python v2_4_4 setup cx_Oracle v4_3_3_py2_4_4 moveCachedFiles.py -h [MCF] max_seconds = 3600 # The maximum number of seconds this appliction is to run. log_screen = false # Prints log data to the screen if True. log_path = path # Prints log data to a file at this directory. Set to "" # for no log file. database = dbname # Oracle database to connect to username = uname # Account to use for connection password = paswd # Password to use to obtain connection Added -n NOOP option for a preview run moveCachedFiles.py -n -c /var/tmp/kreymer/minosdev.cfg database connection times out OPW=... setup oracle_client v10_1_0_3_0 sqlplus samdbs/${OPW}@minosdev That failed also, try a newer oracle_tnsnames than 42, setup oracle_tnsnames v46 Now sqlplus works NOOP true <2008/02/27 11:58:37> Max run time is set to 30 seconds. <2008/02/27 11:58:37> Establishing database connection as samdbs/xxxx@minosdev <2008/02/27 11:58:38> Obtaining number of records in cached_files.... <2008/02/27 11:58:38> Number of records in cached_files: 429983 <2008/02/27 11:58:40> Selected 409 cached_file_id records in 0 seconds. <2008/02/27 11:58:40> for minDate: 10/23/2005 maxDate: 10/24/2005 (min/max endTime seconds: 1) <2008/02/27 11:58:40> Error: - local variable 'copyCnt' referenced before assignment <2008/02/27 11:58:40> Total cached_file records copied: 0 deleted: 0 <2008/02/27 11:58:40> Total cache_file_project_usages records copied: 0 deleted: 0 <2008/02/27 11:58:40> Total processing time in seconds: 1 <2008/02/27 11:58:40> Number of records in cached_files. Starting: 429983 Ending: 429983 <2008/02/27 11:58:40> Difference between start and end: 0 Checked connections with cd ~/minos/oracle export TOPDB_CONN=monitor/... ./topdb minosdev This was useful for diagnostics, reverted to the original script for production running. moveCachedFiles.py -c /var/tmp/kreymer/minosdev.cfg Repeated a few times, with 100 second time limit. Did the same for int, took only 2 passes ( 190K files ) Will do production tomorrow. Ran test project in production, success ! DB BEFORE AFTER TIME dev 429973 1160 306 sec int 191907 3060 150 sec prd 553079 861 493 sec ####### # AFS # ####### for NODE in ${NODES} ; do ssh -ax ${NODE} 'grep afs: /var/log/messages | grep "Feb " | grep -v Tokens | uniq'; done | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' Feb 24 11:50:43 minos05 kernel: afs: Lost contact with file server 131.225.68.65 Feb 24 11:53:33 minos05 kernel: afs: file server 131.225.68.65 is back up Feb 25 12:11:05 minos09 kernel: afs: Waiting for busy volume 1685714815 Feb 25 19:21:49 minos19 kernel: afs: Lost contact with file server 131.225.68.49 Feb 25 19:21:50 minos19 kernel: afs: failed to store file Feb 25 19:21:51 minos19 kernel: afs: failed to store file Feb 25 19:22:08 minos19 kernel: afs: file server 131.225.68.49 is back up Feb 25 19:21:08 minos22 kernel: afs: Lost contact with file server 131.225.68.49 Feb 25 19:21:09 minos22 kernel: afs: failed to store file Feb 25 19:21:59 minos22 kernel: afs: file server 131.225.68.49 is back up Feb 27 05:28:11 minos22 kernel: afs: Lost contact with file server 131.225.68.7 Feb 27 05:30:10 minos22 kernel: afs: file server 131.225.68.7 is back up Feb 27 19:06:35 minos08 kernel: afs: Lost contact with file server 131.225.68.6 Feb 27 19:06:50 minos08 kernel: afs: file server 131.225.68.6 is back up Feb 27 19:06:35 minos26 kernel: afs: Lost contact with file server 131.225.68.6 /tmp/filedxZ2eC: line 614: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: Connection timed out Feb 27 19:07:45 minos26 kernel: afs: file server 131.225.68.6 is back up ============================================================================= 2008 02 26 ######## # GRID # ######## Grid school this morning and afternoon. FermiGrid 201 - Scripting and running grid jobs [ Steve Timm ] Four Minos people ( kreymer, loiacoli, rustem, urish ) Two ( kreymer, urish ) at FermiGrid 202 - Grid Storage Access ============================================================================= 2008 02 25 ######## # GRID # ######## Date: Mon, 25 Feb 2008 18:10:03 -0600 (CST) Subject: HelpDesk ticket 111723 ___________________________________________ Short Description: Minos users needing login to worker nodes Problem Description: As requested, here is a list of Minos users who should have interactive login access to FermiGrid worker nodes : asousa bckhouse kreymer mstrait nwest rhatcher rubin scavan ___________________________________________ ######### # ADMIN # ######### Per Helpdesk/NGOP program, the support levels available are : 24by7 ( commonly called 24x7 ) 8to00by7 8to17by7 6to22by7 ( obsolete ) 830to1630by7 ( obsolete ) 8to17by5 ( commonly called 8x5, incorrectly, should be 9x5 ) zero ######### # ADMIN # ######### minos08 - still no rsh or telnet access no sar since Dec 12 telnetd is running to jonest <-- # @@@ Enter Update below this line. @@@ # --> The system has been behaving normally since Friday. So there is no immedate need for further action. Some residual issues remain unexplained : 1) I cannot log in via rsh or telnet 2) SAR is not running. 3) There is a telnetd process running, unlike rest of the cluster. <-- # @@@ Enter Update above this line. @@@ # --> Oops, as of 17:20 or so, no longer can ssh to minos08 ####### # AFS # ####### Date: Mon, 25 Feb 2008 14:31:16 -0600 (CST) Subject: Your ticket 107032 has been reassigned to PASETES, RAY ########### # WEATHER # ########### Date: Mon, 25 Feb 2008 13:45:58 -0600 From: Fermilab Today To: allhands@fnal.gov Subject: All Hands - weather alert To: All Hands From: Bruce Chrisman, Chief Operating Officer Winter storm A severe winter storm is predicted to move across the Chicago area this evening and into Tuesday. Please be careful as you walk to your car and drive home. Despite the weather, Fermilab will very likely remain open. If Fermilab closes, a note will be posted on the home page: http://www.fnal.gov, and recorded on the Fermilab inclement weather hotline: (630) 840-5995. The National Weather Service Web site, http://www.noaa.gov, is the best source for the latest weather information. And remember, whatever the weather, please drive carefully and walk with caution. ########## # CONDOR # ########## Note that glidein jobs run under account ID uid=7927(minos) gid=5111(numi) groups=5111(numi) This account seems not to exist anywhere else. So if directories are created in grid jobs, and are not group writeable, they will be hard to deal with. Perhaps we need a 7927 account somewhere, like minos01 and minos26. ######### # ADMIN # ######### Account for Ruth Toner Submitted via http://computing.fnal.gov/cd/forms/requirements_offsite_new.html Process outlined in http://computing.fnal.gov/cd/forms/requirements.html Approval described at http://computing.fnal.gov/cd/forms/offsite_instructions.html Approval should come from some in the minos-approved mail list. sent to listserv : review minos-approved There is no MINOS-APPROVED list on this server. Try sending a "LIST" command Per helpdesk printout, these are wojcicki,plunk,ayres,buckley,rameika N.B. - 2008 02 28 /afs/fnal/files/home/room1/rtoner dates from 2005. N.B. - 2008 03 27 - she is able to log in now, I have done MINOS01 > pts adduser -user rtoner -group minos MINOS01 > pts membership minos | grep toner rtoner ####### # AFS # ####### for NODE in ${NODES} ; do ssh -ax ${NODE} 'grep afs: /var/log/messages | grep "Feb " | grep -v Tokens | uniq'; done | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' Feb 24 11:50:43 minos05 kernel: afs: Lost contact with file server 131.225.68.65 Feb 24 11:53:33 minos05 kernel: afs: file server 131.225.68.65 is back up ############ # MCIMPORT # ############ minos26 disk has filled quickly Data is coming into mindata from mtavera cd /local/scratch26/mindata/ ls mtavera/ | wc -l 450 10:44 cp -vax mtavera /minos/data/mcimport/mtavera time diff -r mtavera /minos/data/mcimport/mtavera oops, several l*.log and n*.log files have come in. Waited for this to drain. Last change around 14:39, as of 17:13. rsync -n -r mtavera/ /minos/data/mcimport/mtavera --perms --times --size-only -v time rsync -r mtavera/ /minos/data/mcimport/mtavera --perms --times --size-only -v sent 18262419838 bytes received 4060 bytes 10651749.14 bytes/sec total size is 135283318592 speedup is 7.41 real 28m34.211s user 3m19.541s sys 2m7.979s MINOS26 > du -sm /minos/data/mcimport/mtavera 129020 /minos/data/mcimport/mtavera mv mtavera MOVED/mtavera ln -s /minos/data/mcimport/mtavera /local/scratch26/mindata/mtavera 17:44 time diff -r MOVED/mtavera /minos/data/mcimport/mtavera real 391m32.443s user 3m56.464s sys 5m24.418s finished around 01:00 rsync -n -r MOVED/mtavera/ /minos/data/mcimport/mtavera --perms --times --size-only -v building file list ... done sent 62097 bytes received 20 bytes 1656.45 bytes/sec total size is 135283318592 speedup is 2177879.14 cd /minos/data/mcimport/mtavera mv NOIMPORT MCIMPORT ########### # ROUNDUP # ########### roundup.20080225 - changed limit test on ROUNTMP to ignore data, which now resides in /minos/data cp -a AFSS/roundup.20080225 . ln -sf roundup.20080225 roundup ########### # ROUNDUP # ########### stalled due to full disk space ? LOGS/2008-02/cedar_phy_bhcurvmcnear.log OK - stream L010185N_D04.cand.cedar_phy_bhcurv OK - 1298792 Mbytes in 79 runs OOPS - Stream size 1298792 too big for free space 381587 - 10000 This is (unnecessarily) monitoring ROUNTMP=/export/stage/minfarm/ROUNDUP df -m $ROUNTMP | tail -1 | tr -s ' ' | cut -f 4 -d ' ' But the problem is the 1.3 TByte reported size. This went from Wed Feb 20 13:57:18 CST 2008 OK - stream L010185N_D04.cand.cedar_phy_bhcurv OK - 170122 Mbytes in 11 runs to Mon Feb 25 02:21:20 CST 2008 OK - processing 7973 files OK - stream L010185N_D04.cand.cedar_phy_bhcurv OK - 1298792 Mbytes in 79 runs farmgsum shows mcnearcat 2271 1298792 cand.cedar_phy_bhcurv.root SRV1> ls -ltr /minos/data/minfarm/mcnearcat | wc -l 10446 SRV1> ls -ltr /minos/data/minfarm/mcnearcat | grep mrnt | wc -l 5323 SRV1> ls -ltr /minos/data/minfarm/mcnearcat | grep cand | wc -l 2271 cand's are almost all c_p_b, 500 MBytes in size. ============================================================================= 2008 02 24 ############ # SADDRECO # ############ Speed up file scanning, test with minfarm@fnpcsrv1 prepare per HOWTO.saddreco, cd to /afs/...scripts ./saddreco.new -d near -r cedar -p 2007-11 --verify DETECT near RELEASE cedar MONTH 2007-11 BAIL 999999 STARTED Sun Feb 24 22:25:18 2008 saddreco 2007117 Declaring to SAM prd near cedar 2007-11 verify Needed /pnfs/minos/reco_near/cedar/cand_data/2007-11 Treating 291 files in /pnfs/minos/reco_near/cedar/cand_data/2007-11 Needed /pnfs/minos/reco_near/cedar/sntp_data/2007-11 Treating 25 files in /pnfs/minos/reco_near/cedar/sntp_data/2007-11 STARTED Sun Feb 24 22:25:18 2008 FINISHED Sun Feb 24 22:25:44 2008 SAMDIM="FULL_PATH /pnfs/minos/reco_near/cedar/cand_data/2007-11" time sam list files --summaryOnly --dim="${SAMDIM}" File Count: 502 Average File Size: 182.21MB Total File Size: 89.33GB Total Event Count: 50718886 real 0m1.596s user 0m1.114s sys 0m0.136s SAMDIM=" VERSION cedar and DATA_TIER cand-near and PHYSICAL_DATASTREAM_NAME spill and FULL_PATH /pnfs/minos/reco_near/cedar/cand_data/2007-11 " File Count: 211 Average File Size: 289.66MB Total File Size: 59.69GB Total Event Count: 21504842 real 0m1.476s user 0m1.074s sys 0m0.145s Initial conclusions We get a 20 fold speedup ( 26" -> 1.5" ) by doing sam list rather than sam locate. It does not speed things up to select the version and data stream. Use examples of code from samlocate or saddcache. .... 2008 02 26 ... testing saddreco.new 2.2 sec including sam.translateconstraints 5.8 sec including stc and candfiles Longer future test will be ./saddreco.new -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N \ --verify ######## # GRID # ######## Date: Sun, 24 Feb 2008 14:48:45 -0600 (CST) Subject: HelpDesk ticket 111653 ___________________________________________ Short Description: loiacono account on fnpcsrv1 and fngp-osg Problem Description: Please create a loiacono account for this Minos user on fnpcsrv1 and fngp-osg, so that she can test direct grid submission of select Minos analysis jobs. We have been trying to run these jobs via glideinWMS from minos25, on ISMINOSAFS nodes, but have been suffering from timeouts. We need to find out whether the fault lies with the glidein mechanism, or with the jobs themselves. ___________________________________________ ######## # FARM # ######## Sun Feb 24 14:58:29 CST 2008 mv NOCAT NOCAT.ok ============================================================================= 2008 02 22 ########### # MINOS08 # ########### Date: Fri, 22 Feb 2008 15:34:40 -0600 (CST) Subject: HelpDesk ticket 111626 ___________________________________________ Short Description: minos08 logins failing, system in distress Problem Description: run2-sys : We can no longer log into minos08 via ssh. A connection is made, but never progresses to an interactive session. Ganglia monitoring at http://rexganglia2.fnal.gov/minos/?c=MINOS%20Cluster&h=minos08.fnal.gov&m=& r=day&s=descending&hc=4 shows a heavy wait-I/O load starting soon after 14:00 . If you can get to the console, please check for system level problems. ___________________________________________ Date: Fri, 22 Feb 2008 15:39:11 -0600 (CST) This ticket has been reassigned to JONES, TERRY of the CD-SF/FEF Group. ___________________________________________ Date: Fri, 22 Feb 2008 15:54:52 -0600 (CST) Note To Requester: jonest@fnal.gov sent this Notes To Requester: I have been able to use ssh to login. > "load average: 3.43, 3.44, 4.25" ___________________________________________ Note To Requester: jonest@fnal.gov sent this Notes To Requester: > the files in /var/log/sa are from December. _________________________________________________________________ MINOS08 > sudo /etc/init.d/condor start Starting up Condor MINOS08 > date Fri Feb 22 16:01:37 CST 2008 <-- # @@@ Enter Update below this line. @@@ # --> Kerberized rsh connections to minos08 are immediately rejected. They work for the rest of the Minos Cluster . MIN > rsh -N -X minos01 pwd /afs/fnal.gov/files/home/room1/kreymer MIN > rsh -N -X minos08 pwd minos08.fnal.gov: Connection refused trying normal rsh (/usr/bin/rsh) WARNING: NO ENCRYPTION! rsh: invalid option -- N usage: rsh [-nd] [-l login] host [command] <-- # @@@ Enter Update above this line. @@@ # --> rsh -N -X minos08 'pwd' minos08.fnal.gov: Connection refused <-- # @@@ Enter Update below this line. @@@ # --> Thanks, the system seems to have recovered. There are still odd problems. I guess we can wait till next week to investigate. The SAR data seems to stop after 12 February MINOS08 > ls /var/log/sa sa05 sa06 sa07 sa08 sa09 sa10 sa11 sa12 sa13 sar04 sar05 sar06 sar07 sar08 sar09 sar10 sar11 sar12 condor was not running ( I have restarted it. ) I still cannot log in via kerberized rsh. <-- # @@@ Enter Update above this line. @@@ # --> 2008 02 25 ########## # PARROT # ########## Running tests on fnpc132, 32 bit kernel, Intel 2 GB memory, Intel(R) Xeon(TM) CPU 3.06GHz For the first time I see all the root .so files being opened. VERS proxy dcache 242 n n CPU bound, killed after 10 minutes 240 n n OK, quick ( 12 sec ) 240 n y NO, cpu bound, killed after 1 minute, same parrot cur n y OK, quick cur n y NO, CPU bound on rerun, killed after 1 minute cur n n OK, quick Mysteries why can loon not be run twice why does parrot sometimes hang up compute-bound why does -d remote list detailed files, on fnpc132, Linux fnpc132.fnal.gov 2.6.9-42.0.10.ELsmp #1 SMP Tue Feb 27 08:38:56 CST 2007 i686 i686 i386 GNU/Linux not fngp-osg Linux fngp-osg.fnal.gov 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 11:21:10 CDT 2007 i686 i686 i386 GNU/Linux ########## # PARROT # ########## cat > /minos/scratch/kreymer/log/parrot/242prdc.0221 top - 10:59:52 up 200 days, 1:45, 1 user, load average: 15.80, 16.11, 19.51 Tasks: 581 total, 18 running, 538 sleeping, 24 stopped, 1 zombie Cpu(s): 71.6% us, 27.6% sy, 0.7% ni, 0.0% id, 0.0% wa, 0.1% hi, 0.0% si Mem: 6229924k total, 6061752k used, 168172k free, 112288k buffers Swap: 17583132k total, 77904k used, 17505228k free, 3110160k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4157 kreymer 25 0 532m 425m 233m R 27 7.0 15243:54 parrot 29530 kreymer 25 0 526m 516m 302m R 25 8.5 245:40.30 parrot 8712 kreymer 25 0 669m 640m 304m R 23 10.5 250:15.06 parrot 20557 kreymer 25 0 669m 586m 229m R 22 9.6 252:20.28 parrot 4355 kreymer 17 0 2528 1396 860 R 0 0.0 0:00.37 top 4159 kreymer 16 0 7076 1592 1244 T 0 0.0 0:00.12 bash 8713 kreymer 16 0 6752 1520 1244 T 0 0.0 0:00.05 bash 14149 kreymer 16 0 141m 82m 40m T 0 1.4 0:03.39 129f324e 15289 kreymer 16 0 5276 1208 992 T 0 0.0 0:00.01 sh 15292 kreymer 16 0 5276 564 348 T 0 0.0 0:00.00 sh 15293 kreymer 15 0 4608 452 376 T 0 0.0 0:00.00 cat 15294 kreymer 15 0 107m 1944 1212 T 0 0.0 0:00.01 129f324e 18961 kreymer 16 0 141m 82m 40m T 0 1.4 0:03.36 129f324e 20153 kreymer 16 0 5276 1208 992 T 0 0.0 0:00.01 sh 20158 kreymer 17 0 5276 564 348 T 0 0.0 0:00.00 sh 20159 kreymer 15 0 4608 452 376 T 0 0.0 0:00.00 cat 20160 kreymer 15 0 107m 1944 1212 T 0 0.0 0:00.00 129f324e 20558 kreymer 16 0 7400 1596 1244 T 0 0.0 0:00.08 bash 22637 kreymer 15 0 141m 77m 35m T 0 1.3 0:03.36 129f324e 22984 kreymer 16 0 5276 1208 992 T 0 0.0 0:00.01 sh 22987 kreymer 17 0 5276 564 348 T 0 0.0 0:00.00 sh 22988 kreymer 15 0 4608 452 376 T 0 0.0 0:00.00 cat 22989 kreymer 15 0 107m 1608 876 T 0 0.0 0:00.02 129f324e 23904 kreymer 21 0 141m 66m 24m T 0 1.1 0:03.19 129f324e 23928 kreymer 19 0 5276 1208 992 T 0 0.0 0:00.01 sh 23932 kreymer 20 0 5276 564 348 T 0 0.0 0:00.00 sh 23933 kreymer 15 0 4608 452 376 T 0 0.0 0:00.00 cat 23934 kreymer 17 0 107m 1464 732 T 0 0.0 0:00.01 129f324e 29532 kreymer 16 0 6344 1596 1240 T 0 0.0 0:00.09 bash killed off the parrots, load dropped to under 2. ============================================================================= 2008 02 21 ########## # PARROT # ########## Attempting to reestablish parrot tests, and use HTTP_PROXY="squid.fnal.gov:3128" -current- and DCache hanging up : P> loon -bq firstlast.C ${DFILE} Warning in : class timespec already in TClassTable Tested VERS proxy dcache 240 y y Failed open cur y y OK 242 ############ # MCIMPORT # ############ n13037306_0006_L010185N_D04.reroot.root (No such device or address) /pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_04/L010185N/730/n13037306_0006_L010185N_D04.reroot.root ls -l /pnfs/minos/mcin_data/near/daikon_04/L010185N/730/n13037306_0006_L010185N_D04.reroot.root -rw-r--r-- 1 kreymer e875 0 Feb 21 11:29 /pnfs/minos/mcin_data/near/daikon_04/L010185N/730/n13037306_0006_L010185N_D04.reroot.root rm /pnfs/minos/mcin_data/near/daikon_04/L010185N/730/n13037306_0006_L010185N_D04.reroot.root The timing, 11:29, matches the glitch on Condor submission. ######## # DATA # ######## Restarted crontab.dat for kreymer@minos26 mindata@minos26 Holding of on farm concatenation for a little while. ######## # DATA # ######## Date: Thu, 21 Feb 2008 11:30:01 -0600 From: Cron Daemon To: kreymer@fnal.gov Subject: Cron ${HOME}/minos/scripts/condorglide /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorglide: line 12: cd: /minos/scratch/kreymer/condor/probe: Not a directory find: logs/glideafs: No such file or directory subsequent jobs look OK. MIN > for NODE in $NODES ; do printf "${NODE} " ; ssh -ax ${NODE} 'du -sm /minos/scratch/kreymer/condor/probe' ; done minos01 17 /minos/scratch/kreymer/condor/probe minos02 17 /minos/scratch/kreymer/condor/probe ... minos25 17 /minos/scratch/kreymer/condor/probe minos26 17 /minos/scratch/kreymer/condor/probe ######### # DOCDB # ######### Date: Thu, 21 Feb 2008 09:43:03 -0600 (CST) Subject: HelpDesk ticket 111541 Short Description: DodDB is offline Problem Description: The DocDB system seems to be offline. This affects both the Minos DocDB area and the public areas . The problem may have started late last night, before midnight. http://cd-docdb.fnal.gov/cgi-bin/DocumentDatabase/ Forbidden You don't have permission to access /cgi-bin/DocumentDatabase/ on this server. Additionally, a 403 Forbidden error was encountered while trying to use an ErrorDocument to handle the request. Apache/2.0.46 (Scientific Linux) Server at cd-docdb.fnal.gov Port 80 ___________________________________________ Date: Thu, 21 Feb 2008 10:03:39 -0600 (CST) This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group. ___________________________________________________________________ Date: Thu, 21 Feb 2008 11:47:22 -0600 (CST) Solution: NAS (the file server that supplies files to docdb server) had outage. It is back online. Docdb is back online. Try your access to docdb server now. ___________________________________________________________________ ####### # SAM # ####### 09:30 - restarted prd dbserver v8_4_3 False start earlier, had installed all secondary products, but not upd install -j sam_db_srv_pkg v8_4_3 # FARM # roundup -c -r cedar_phy_bhcurv mcnear still running, since Wed Feb 20 16:54:22 CST 2008 WRITING to DCache 329 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/730 /home/minfarm/scripts/roundup: line 801: dccp: command not found SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp \ file:///n13037299_0002_L010185N_D04.cand.cedar_phy_bhcurv.root \ /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/729 Interestingly, these copies are continuing successfully, right through the BlueArc outages and PNFS maintenance. ============================================================================= 2008 02 20 ########## # PARROT # ########## HOWTO.parrot added INSTALL section. Tested with VER=2_4_2 Tested using current, runs loon OK Tested setting setenv HTTP_PROXY "squid.fnal.gov:8080" setenv HTTP_PROXY "squid.fnal.gov:80" setenv HTTP_PROXY "squid.fnal.gov:" This seems to have no effect, set before or after running parrot Connected to older version, export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4 This tries the squid, and fails for port 8080, 1203552033.342214 [8799] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/ 1203552033.342287 [8799] parrot: http: connect squid.fnal.gov port 8080 1203552033.345079 [8799] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfsdir HTTP/1.0 Host: squid.fnal.gov 1203552033.346009 [8799] parrot: http: HTTP/1.1 302 Found 1203552033.346029 [8799] parrot: http: Date: Thu, 21 Feb 2008 00:00:33 GMT 1203552033.346038 [8799] parrot: http: Server: Apache/2.2.3 (Unix) mod_ssl/2.2.3 OpenSSL/0.9.7d mod_python/3.2.6 Python/2.3.5 mod_jk/1.2.18 PHP/4.4.2 ... 1203552033.349700 [8799] parrot: http: error: server gave 302 redirect from https://squid.fnal.gov:8443/computing/d199//.growfsdir back to the same url! for ports 80, and null, FNGP-OSG > parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash FNGP-OSG > ls -d /afs/fnal.gov/files/code/e875/general/minossoft 1203551898.085693 [4300] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/ 1203551898.085911 [4300] parrot: http: connect squid.fnal.gov port 80 1203551898.087505 [4300] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199 1203551898.087625 [4300] parrot: http: connect squid.fnal.gov port 80 1203551898.088621 [4300] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing 1203551898.088716 [4300] parrot: http: connect squid.fnal.gov port 80 1203551898.089746 [4300] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80 1203551898.089842 [4300] parrot: http: connect squid.fnal.gov port 80 ls: /afs/fnal.gov/files/code/e875/general/minossoft: No such file or directory ######## # DATA # ######## Preparing for PNFS/DCache maintenance 21 Feb kreymer@minos26 echo "crontab -r" | at 05:30 Feb 21 mindata@minos26 echo "crontab -r" | at 01:00 Feb 21 minfarm@fnpcsrv1 echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00 Feb 21 ============================================================================= 2008 02 19 ####### # DAQ # ####### Noted that there have been no Near DCS files archived since Feb 4, when the minos-mysql1 database server was restarted. MINOS26 > ls -l /pnfs/minos/near_dcs_data/2008-02 ... -rw-r--r-- 1 buckley e875 477434 Feb 3 19:29 N080203_000002.mdcs.root -rw-r--r-- 1 buckley e875 357868 Feb 4 17:52 N080204_000002.mdcs.root less 2008-02-04.daq.log RCS E Mon 4-02-2008 12:22:03 rcServer 8523 131.225.192.134 5859 31631 run 13591 Socket error on dcsdcp-nd.fnal.gov 9089: Read EOF: Success RCS E Mon 4-02-2008 12:22:08 rcServer 8523 131.225.192.134 5860 31632 run 13591 Connect to dcsdcp-nd.fnal.gov:9089 failed: Connection refused RCS N Mon 4-02-2008 12:22:19 rcServer 8523 131.225.192.134 5861 31633 run 13591 Connected to node DCS(100) on dcsdcp-nd.fnal.gov 9089 RCS N Mon 4-02-2008 12:22:19 rcServer 8523 131.225.192.134 5862 31634 run 13591 Binding socket to DCS(100) N.B. - these showed up in the morning Predator run 2008 02 20 ######## # FARM # ######## > The nue group is interested in running over the HE ND data sample. However, > it looks like only half the RunII HE ND data has been processed with > cedar_phy_bhcurv. Looking at pnfs, only half of the HE ND sntp files for > 2006-06, 2006-07, 2006-08 are there. Would it be possible to get these > missing sntp files? MINOS26 > find /pnfs/minos/mcin_data/near/daikon_04/L250200N -type f | wc -l 4638 MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L250200N/cand_data -type f | wc -l 4515 MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L250200N/sntp_data -type f | wc -l 387 SAMDIM=" DATA_TIER sntp-near and MC.RELEASE daikon_04 and MC.BEAM L250200N and VERSION cedar.phy.bhcurv " sam list files --dim="${SAMDIM}" File Count: 387 Average File Size: 1.25GB Total File Size: 484.43GB Total Event Count: 3582400 SFILES=`sam list files --dim="${SAMDIM}" --nosummary` printf "${SFILES}\n" | wc -l 387 for FILE in ${SFILES} ; do sam get metadata --file=${FILE} \ | grep parents \ | tr "'" \\\n \ | grep root done | wc -l 4478 OOPS, this was irrelevant, the question was about ND data, not MC, 805 raw data files in /pnfs/minos/neardet_data/2006-07 117 candidates in /pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2006-07 632 candidates in /pnfs/minos/reco_near/cedar_phy/cand_data/2006-07 Greg sent a list of missing files . QFILES=`cat ../qfiles` for FILE in ${QFILES} ; do FRU=`echo ${FILE} | cut -f 1 -d _` FTA=`echo ${FILE} | cut -f 2- -d .` SUB=`echo ${FILE} | cut -f 2 -d _ | cut -f 1 -d .` SAMDIM="FILE_NAME ${FRU}_%.${FTA} and PARENT_BY_NAME ${FRU}_${SUB}.mdaq.root" printf "\n${FILE}\n" sam list files --nosummary --dim="${SAMDIM}" done ... all these are concatenated, except N00010583_0000.spill.sntp.cedar_phy_bhcurv.0.root N00010583_0010.spill.sntp.cedar_phy_bhcurv.0.root N00010583_0012.spill.sntp.cedar_phy_bhcurv.0.root N00010583_0014.spill.sntp.cedar_phy_bhcurv.0.root N00010583_0023.spill.sntp.cedar_phy_bhcurv.0.root ls /minos/data/minfarm/nearcat/N00010583*.spill.sntp.cedar_phy_bhcurv.0.root ####### # DAQ # ####### Habig requested access to minossrv-nd I do not seem to have access ( minos, root, kreymer ) MIN > ssh minossrv-nd Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive). ############ # PREDATOR # ############ latest Near DCS file was N080204_000002.mdcs.root Tue Feb 5 11:13:43 UTC 2008 ####### # SAM # ####### received version information for sam_products v4_31 sam_station v6_0_5_23_srm -q GCC-3.1" saved in transit to minos/sam_products.table.v4_31 Referred to http://cdfkits.fnal.gov/CdfCode/source/Distribution/SAM/HOWTO.kits Or better yet, ~/minos/HOWTO.products version=v4_31 oversion=v4_30 samprod=sam_products FLVR=NULL Have to bootstrap, have not done this before for Minos upd install -j sam_products ${oversion} cd ${PRODUCTS}/../prd/${samprod} cp -ar ${oversion} ${version} cd ${version}/${FLVR} ups declare ${samprod} ${version} -f ${FLVR} -r ${samprod}/${version}/${FLVR} -m ${samprod}.table # then edit ups/${samprod}.table as necessary # # You may for example go to a station running the new versions, # and run ./init_sam -n cdf ${oversion} # Then use ups list -K+ to see what's in use for changed products # this could be scripted cd ~/minos/scripts ./updadd ${FLVR} ${samprod} ${version} OK - adding sam_products v4_31 NULL -q OK - reporting space used 12 /afs/fnal.gov/files/code/e875/general/ups/prd/sam_products/v4_31/NULL OK - testing file permissions OK - no file permission problems OK - tar command is gtar 4 /var/tmp/sam_products.tar.gz OK - upd addproduct notice: Adding flags -O "public" error output of move_ups_dir: Authenticated kreymer@FNAL.GOV Account updadmin: authorization for kreymer@FNAL.GOV for execution of /usr/krb5/k5arc/scripts/upd successful Changing uid to updadmin (100) ####### # SAM # ####### Upgraded development dbserver to allow cleanup of old file history upd install -j sam_db_srv_pkg v8_4_3 # was v8_3_0, in server list upd install -j sam v8_2_2 # was v7_5_1 current ups declare -c sam v8_2_2 Startup failed, before I installed all the required products : MINOS-SAM02 > cat dbs__minos-sam02__dbs_dev/trace.15464 warning: Python C API version mismatch for module struct: This Python has API version 1012, module struct has version 1010. warning: Python C API version mismatch for module strop: This Python has API version 1012, module strop has version 1010. warning: Python C API version mismatch for module time: This Python has API version 1012, module time has version 1010. Traceback (most recent call last): File "/home/sam/products/db_server_base/v3_3_17/NULL/bin/DbListener.py", line 30, in ? import Monitor File "/home/sam/products/db_server_base/v3_3_17/NULL/lib/Monitor.py", line 73, in ? import EventPoster File "/home/sam/products/db_server_base/v3_3_17/NULL/lib/EventPoster.py", line 11, in ? import ConfigMgr File "/home/sam/products/db_server_base/v3_3_17/NULL/lib/ConfigMgr.py", line 49, in ? import Parameter File "/home/sam/products/db_server_base/v3_3_17/NULL/lib/Parameter.py", line 8, in ? import DbLog File "/home/sam/products/db_server_base/v3_3_17/NULL/lib/DbLog.py", line 4, in ? import os, os.path, re, exceptions File "/home/sam/products/python/v2_1/Linux+2.4/lib/python2.1/re.py", line 28, in ? from sre import * File "/home/sam/products/python/v2_1/Linux+2.4/lib/python2.1/sre.py", line 17, in ? import sre_compile File "/home/sam/products/python/v2_1/Linux+2.4/lib/python2.1/sre_compile.py", line 15, in ? assert _sre.MAGIC == MAGIC, "SRE module mismatch" AssertionError: SRE module mismatch Killed process: 15464 Pick up missing products, per ups list -l sam_db_srv_pkg v8_4_3 db_server_base_cx v1_8 -j') cx_Oracle v4_3_3_py2_4_4 -j') oracle_instant_client v10_2_0_3 -j') oracle_tnsnames v46 -j') sam_python v2_4_4 -j') sam_db_srv v8_4_3 -j') -q "${UPS_REQ_QUALIFIERS}" sam_server_pylib v8_4_1 -j') -q "${UPS_REQ_QUALIFIERS}" sam_common_pylib v8_4_2 -j') omniORB v4_1_1 -q GCC-3.4.3-PYTHON-2.4 -j') sam_idl_pylib v8_4 -j') HTMLgen v2_1 -j') -q "${UPS_REQ_QUALIFIERS}" sam_config -j') sam_ns_ior -j') -q "${UPS_REQ_QUALIFIERS}" sam_dimension_server_prototype v8_4_0 -j') -q "${UPS_REQ_QUALIFIERS}" sam_pnfs_srv v8_4_0 -j') encp v3_6g -j') ups list -aK+ db_server_base_cx ups list -aK+ cx_Oracle ups list -aK+ oracle_instant_client ups list -aK+ oracle_tnsnames ups list -aK+ sam_python ups list -aK+ sam_db_srv ups list -aK+ sam_server_pylib ups list -aK+ sam_common_pylib ups list -aK+ omniORB -q GCC-3.4.3-PYTHON-2.4 ups list -aK+ sam_idl_pylib ups list -aK+ HTMLgen ups list -aK+ sam_config ups list -aK+ sam_ns_ior ups list -aK+ sam_dimension_server_prototype ups list -aK+ sam_pnfs_srv ups list -aK+ encp Installed the needed products on minos-sam03 upd install -j db_server_base_cx v1_8 # was v1_4 upd install -j cx_Oracle v4_3_3_py2_4_4 upd install -j sam_db_srv v8_4_3 upd install -j sam_server_pylib v8_4_1 upd install -j sam_common_pylib v8_4_2 upd install -j omniORB v4_1_1 -q GCC-3.4.3-PYTHON-2.4 upd install -j sam_idl_pylib v8_4 upd install -j sam_dimension_server_prototype v8_4_0 upd install -j sam_pnfs_srv v8_4_0 Dev dbserver is restarted, passes its tests. int dbserver is restarted, passes its tests. Installed the needed products on minos-sam03 Created private/minos-sam01_server_list.txt.20080219 sam v8_1_6 is current at CDF sam v8_4_0 is current at D0 sam v8_2_2 is current in upd ... previously ... sam_db_srv_pkg v8_3_0 ( was sam_db_srv v7_6_1 ) sam_bootstrap v8_1_0 ( was v6_1_2, required for use of sam_db_srv_pkg ) sam_config v7_1_5 ( was v4_2_34 ) sam v8_2_0 ( was v7_6_5, on clients ) ============================================================================= 2008 02 18 ####### # AFS # ####### for NODE in ${NODES} ; do ssh -ax ${NODE} 'grep afs: /var/log/messages | grep -v Tokens | uniq'; done \ | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' done Feb 17 05:20:01 minos05 kernel: afs: Lost contact with file server 131.225.68.6 Feb 17 05:20:32 minos05 kernel: afs: file server 131.225.68.6 is back up Dec 12 17:08:58 minos08 kernel: afs: Lost contact with file server 131.225.68.7 Dec 12 17:09:09 minos08 kernel: afs: file server 131.225.68.7 is back up Feb 17 05:23:27 minos09 kernel: afs: Lost contact with file server 131.225.68.6 Feb 17 05:23:28 minos09 kernel: afs: Lost contact with file server 131.225.68.6 Feb 17 05:24:45 minos09 kernel: afs: file server 131.225.68.6 is back up ########## # CONDOR # ########## Date: Mon, 18 Feb 2008 12:04:59 -0600 From: Laura Loiacono As you know I am running on the new batch nodes. I had been running alot of jobs since yesterday and it appears that there is a problem with the connection between the node the job is running on and the node it was submitted from. This causes some of the jobs to take twice as long as the should. See log file messages below... 000 (35477.000.000) 02/18 05:43:23 Job submitted from host: <131.225.193.25:63984> ... 001 (35477.000.000) 02/18 06:12:03 Job executing on host: <131.225.166.118:62340> ... 006 (35477.000.000) 02/18 06:12:11 Image size of job updated: 160400 ... 022 (35477.000.000) 02/18 07:22:58 Job disconnected, attempting to reconnect ??? Socket between submit and execute hosts closed unexpectedly ??? Trying to reconnect to vm2@18764@fnpc339.fnal.gov <131.225.166.118:62340> ... 024 (35477.000.000) 02/18 07:22:58 Job reconnection failed ??? Job disconnected too long: JobLeaseDuration (3600 seconds) expired ??? Can not reconnect to vm2@18764@fnpc339.fnal.gov, rescheduling job ... 001 (35477.000.000) 02/18 07:25:06 Job executing on host: <131.225.166.131:65494> ... 022 (35477.000.000) 02/18 08:36:56 Job disconnected, attempting to reconnect ??? Socket between submit and execute hosts closed unexpectedly ??? Trying to reconnect to vm2@8274@fnpc344.fnal.gov <131.225.166.131:65494> ... 024 (35477.000.000) 02/18 08:36:56 Job reconnection failed ??? Job disconnected too long: JobLeaseDuration (3600 seconds) expired ??? Can not reconnect to vm2@8274@fnpc344.fnal.gov, rescheduling job ... Checking that all nodes are involved. $ cd /minos/scratch/loiacono/condor $ grep 'Trying to reconnect' log.* | cut -f 3 -d @ | cut -f 1 -d . | sort -u fnpc339 fnpc340 fnpc341 fnpc342 fnpc343 fnpc344 fnpc345 fnpc346 ============================================================================= 2008 02 15 ######## # GRID # ######## Date: Fri, 15 Feb 2008 21:30:25 -0600 From: Cron Daemon To: kreymer@fnal.gov Subject: Cron /usr/krb5/bin/kcron ${HOME}/minos/scripts/factproxy verify_recvd_packet: generated hash did not compare /usr/krb5/bin/kxlist: Matching credential not found while finding the credentials containing the private key and certificate in the credentials cache ####### # SAM # ####### sam_cpp_api with sam locate support , bug corrected. upd install -j sam_cpp_api v8_4_0_1 -q GCC-3.4.3 ############ # MCIMPORT # ############ corrected md5sum , needed to cd ${MCSTA} to find the recently moved FILE ln -sf mcimport.20080211 mcimport # was mcimport.20071120 ########## # CONDOR # ########## Updated glidein proxy, see 2008 02 13 entry ####### # WEB # ####### In dhmain.html, changed CDSystemStatus to cdsystemstatus This was not needed earlier this week. ######## # FARM # ######## The review below is a usefull warmup for the broader cleanup of *cat areas . Drivers many duplicates being reported ( based on READ SAM/READ files ) are these legit ? too many READ SAM/READ files need to purge these, base dup calculations on SAM stale READ SAM/READ files some may be hanging around after the concatenated files wer ######## # FARM # ######## Date: Thu, 14 Feb 2008 15:39:09 -0600 From: Howard Rubin There are a few files, all from F00040092_0005 2007-12 cedar in /minos/data/minfarm/fardet. These may have been from a repeated run. There are candidates on pnfs (cand and bcnd) but I can't track the ntuples all.sntp, spill.sntp, and spill.bntp because whatever might have been there has been cleared from farcat. Can you see if they've been concatenated? The same is true for N00012048_0017 2007-04 cedar_phy_bhcurv and, from 2007-12 cedar, N0001390_0009, 10, 11, 12, and 13. Again, all candidates have correspondences on pnfs, and I can't track the cosmic or spill ntuples. MINOS26 > dds /minos/data/minfarm/neardet total 1996088 drwxrwxr-x 2 rubin e875 6144 Jan 29 10:22 ./ drwxrwxr-x 31 10871 e875 4096 Feb 14 13:10 ../ -rw-rw-r-- 1 rubin e875 79565829 Nov 21 12:35 N00012048_0017.spill.cand.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin e875 8601534 Nov 21 12:35 N00012048_0017.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin e875 15696718 Nov 21 12:35 N00012048_0017.spill.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin e875 111781571 Dec 22 07:55 N00013190_0009.cosmic.cand.cedar.0.root -rw-rw-r-- 1 rubin e875 29508780 Dec 22 07:55 N00013190_0009.cosmic.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 243623537 Dec 22 07:57 N00013190_0009.spill.cand.cedar.0.root -rw-rw-r-- 1 rubin e875 48111340 Dec 22 07:57 N00013190_0009.spill.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 111378136 Dec 22 07:48 N00013190_0010.cosmic.cand.cedar.0.root -rw-rw-r-- 1 rubin e875 29327780 Dec 22 07:48 N00013190_0010.cosmic.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 224232407 Dec 22 07:49 N00013190_0010.spill.cand.cedar.0.root -rw-rw-r-- 1 rubin e875 44408037 Dec 22 07:49 N00013190_0010.spill.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 110519969 Dec 22 07:50 N00013190_0011.cosmic.cand.cedar.0.root -rw-rw-r-- 1 rubin e875 29113746 Dec 22 07:50 N00013190_0011.cosmic.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 249715245 Dec 22 07:51 N00013190_0011.spill.cand.cedar.0.root -rw-rw-r-- 1 rubin e875 49711909 Dec 22 07:51 N00013190_0011.spill.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 112480567 Dec 22 07:43 N00013190_0012.cosmic.cand.cedar.0.root -rw-rw-r-- 1 rubin e875 29439690 Dec 22 07:44 N00013190_0012.cosmic.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 220021097 Dec 22 07:46 N00013190_0012.spill.cand.cedar.0.root -rw-rw-r-- 1 rubin e875 43400685 Dec 22 07:46 N00013190_0012.spill.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 111656926 Dec 22 07:51 N00013190_0013.cosmic.cand.cedar.0.root -rw-rw-r-- 1 rubin e875 29300465 Dec 22 07:51 N00013190_0013.cosmic.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 112343659 Jan 3 14:54 N00013190_0017.cosmic.cand.cedar.0.root MINOS26 > dds /minos/data/minfarm/fardet total 463468 drwxrwxr-x 2 rubin e875 73728 Feb 14 16:55 ./ drwxrwxr-x 31 10871 e875 4096 Feb 14 13:10 ../ -rw-rw-r-- 1 rubin e875 138049103 Dec 10 19:51 F00037144_0012.all.cand.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin e875 25245746 Dec 10 19:51 F00037144_0012.all.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin e875 31015742 Dec 10 19:51 F00037144_0012.spill.bcnd.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin e875 4678962 Dec 10 19:51 F00037144_0012.spill.bntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin e875 20469882 Dec 10 19:51 F00037144_0012.spill.cand.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin e875 2934022 Dec 10 19:51 F00037144_0012.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin e875 3156006 Dec 10 19:51 F00037144_0012.spill.sntp.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin e875 132515788 Dec 22 06:28 F00040092_0005.all.cand.cedar.0.root -rw-rw-r-- 1 rubin e875 23832557 Dec 22 06:28 F00040092_0005.all.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 48796440 Dec 22 06:32 F00040092_0005.spill.bcnd.cedar.0.root -rw-rw-r-- 1 rubin e875 7709121 Dec 22 06:32 F00040092_0005.spill.bntp.cedar.0.root -rw-rw-r-- 1 rubin e875 31031046 Dec 22 06:30 F00040092_0005.spill.cand.cedar.0.root -rw-rw-r-- 1 rubin e875 5039787 Dec 22 06:30 F00040092_0005.spill.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 504 Feb 12 11:11 c_list -rw-rw-r-- 1 rubin e875 1920 Feb 11 19:57 copy_list for FILE in `ls /minos/data/minfarm/neardet | grep cand` ; do sam locate ${FILE} ; done ['/pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2007-04,391@vo9747'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-12,86@vob319'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-12,84@vob319'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-12,81@vob319'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-12,85@vob319'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-12,87@vob319'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-12,88@vob319'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-12,78@vob319'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-12,80@vob319'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-12,83@vob319'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-12,91@vob319'] for FILE in `ls /minos/data/minfarm/fardet | grep cand` ; do sam locate ${FILE} ; done ['/pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2006-12,1465@vob825'] ['/pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2006-12,1471@vob825'] ['/pnfs/minos/reco_far/cedar/cand_data/2007-12,712@vo5628'] ['/pnfs/minos/reco_far/cedar/cand_data/2007-12,748@vo5628'] Now testing non-cand files which may be concatenated . Try one manually. FILE=N00012048_0017.spill.mrnt.cedar_phy_bhcurv.0.root FRU=`echo ${FILE} | cut -f 1 -d _` FTA=`echo ${FILE} | cut -f 2- -d .` SUB=`echo ${FILE} | cut -f 2 -d _ | cut -f 1 -d .` SAMDIM="FILE_NAME ${FRU}_%.${FTA} and PARENT_BY_NAME ${FRU}_${SUB}.mdaq.root" for FILE in `ls /minos/data/minfarm/neardet | grep -v cand` ; do FRU=`echo ${FILE} | cut -f 1 -d _` FTA=`echo ${FILE} | cut -f 2- -d .` SUB=`echo ${FILE} | cut -f 2 -d _ | cut -f 1 -d .` SAMDIM="FILE_NAME ${FRU}_%.${FTA} and PARENT_BY_NAME ${FRU}_${SUB}.mdaq.root" printf "\n${FILE}\n" sam list files --nosummary --dim="${SAMDIM}" done N00012048_0017.spill.mrnt.cedar_phy_bhcurv.0.root N00012048_0000.spill.mrnt.cedar_phy_bhcurv.0.root N00012048_0017.spill.sntp.cedar_phy_bhcurv.0.root N00012048_0000.spill.sntp.cedar_phy_bhcurv.0.root N00013190_0009.cosmic.sntp.cedar.0.root N00013190_0000.cosmic.sntp.cedar.0.root N00013190_0009.spill.sntp.cedar.0.root N00013190_0000.spill.sntp.cedar.0.root N00013190_0010.cosmic.sntp.cedar.0.root N00013190_0000.cosmic.sntp.cedar.0.root N00013190_0010.spill.sntp.cedar.0.root N00013190_0000.spill.sntp.cedar.0.root N00013190_0011.cosmic.sntp.cedar.0.root N00013190_0000.cosmic.sntp.cedar.0.root N00013190_0011.spill.sntp.cedar.0.root N00013190_0000.spill.sntp.cedar.0.root N00013190_0012.cosmic.sntp.cedar.0.root N00013190_0000.cosmic.sntp.cedar.0.root N00013190_0012.spill.sntp.cedar.0.root N00013190_0000.spill.sntp.cedar.0.root N00013190_0013.cosmic.sntp.cedar.0.root N00013190_0000.cosmic.sntp.cedar.0.root MINOS26 > for FILE in `ls /minos/data/minfarm/fardet | grep -v cand` ; do F00037144_0012.all.sntp.cedar_phy_bhcurv.0.root F00037144_0000.all.sntp.cedar_phy_bhcurv.0.root F00037144_0012.spill.bcnd.cedar_phy_bhcurv.0.root F00037144_0012.spill.bcnd.cedar_phy_bhcurv.0.root F00037144_0012.spill.bntp.cedar_phy_bhcurv.0.root F00037144_0000.spill.bntp.cedar_phy_bhcurv.0.root F00037144_0012.spill.mrnt.cedar_phy_bhcurv.0.root F00037144_0000.spill.mrnt.cedar_phy_bhcurv.0.root F00037144_0012.spill.sntp.cedar_phy_bhcurv.0.root F00037144_0000.spill.sntp.cedar_phy_bhcurv.0.root F00040092_0005.all.sntp.cedar.0.root F00040092_0000.all.sntp.cedar.0.root F00040092_0005.spill.bcnd.cedar.0.root F00040092_0005.spill.bcnd.cedar.0.root F00040092_0005.spill.bntp.cedar.0.root F00040092_0000.spill.bntp.cedar.0.root F00040092_0005.spill.sntp.cedar.0.root F00040092_0000.spill.sntp.cedar.0.root c_list copy_list ============================================================================= 2008 02 14 ################ # IMAGEMAGICK # ################ MINOS11 > rpm -qf /usr/bin/convert ImageMagick-5.5.6-24 Per pawloski/tjyun, requested ImageMagick on minos cluster. Requested installation Date: Thu, 14 Feb 2008 15:30:44 -0600 (CST) Subject: HelpDesk ticket 111178 ___________________________________________ Short Description: ImageMagick on Minos Cluster Problem Description: run2-sys : Please install ImageMagick on the Minos Cluster. minos01 through minos26 This is available via yum. ImageMagick seems to have been dropped back when we upgraded to SLF 4. It is installed on minos11, which stayed at SLF 3. This can be done at your convenience. ___________________________________________ Date: Thu, 14 Feb 2008 15:42:50 -0600 (CST) This ticket has been reassigned to HO, LING of the CD-SF/FEF Group. ___________________________________________ Date: Thu, 14 Feb 2008 17:30:41 -0600 (CST) ___________________________________________________________________ Solution: ling@fnal.gov sent this solution: ImageMagik installed on these nodes. cc'd pawloski,tjyang,minos-admin ####### # SAM # ####### Need to pick up missing cedar_phy files declarations. There is no SLOG directory for cedar_phy, perhaps we never cleaned up. cut and source shrc/kreymer cd /export/stage/minfarm/ROUNDUP PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9010 SOCFILE=/export/stage/minfarm/.grid/samdbs_prd export SAM_ORACLE_CONNECT=`cat ${SOCFILE}` DET=near RELEASE=cedar_phy SLOG=${HOME}/ROUNTMP/LOG/saddreco/${RELEASE}/${DET}.log mkdir -p ${HOME}/ROUNTMP/LOG/saddreco/${RELEASE} MONTH=2007-02 SRV1> ./saddreco ${DET} ${RELEASE} ${MONTH} verify | grep -v verified DETECT near RELEASE cedar_phy MONTH 2007-02 BAIL 999999 STARTED Thu Feb 14 17:49:00 2008 saddreco 2007117 Declaring to SAM prd near cedar_phy 2007-02 verify Needed /pnfs/minos/reco_near/cedar_phy/cand_data/2007-02 Treating 667 files in /pnfs/minos/reco_near/cedar_phy/cand_data/2007-02 Needed 234 files, Rate was 2.584 Needed /pnfs/minos/reco_near/cedar_phy/mrnt_data/2007-02 Treating 47 files in /pnfs/minos/reco_near/cedar_phy/mrnt_data/2007-02 obsolete N00011798_0000.spill.mrnt.cedar_phy.0.root Needed 11 files, Rate was 1.066 Needed /pnfs/minos/reco_near/cedar_phy/sntp_data/2007-02 Treating 50 files in /pnfs/minos/reco_near/cedar_phy/sntp_data/2007-02 obsolete N00011798_0000.spill.sntp.cedar_phy.0.root Needed 23 files, Rate was 1.220 STARTED Thu Feb 14 17:49:00 2008 FINISHED Thu Feb 14 17:51:00 2008 ./saddreco -d ${DET} -r ${RELEASE} -p ${MONTH} --declare 2>&1 \ | tee -a ${SLOG} STARTED Thu Feb 14 18:05:21 2008 FINISHED Thu Feb 14 18:07:42 2008 for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/sntp_data ; ls -d 20??-??)` ; do ./saddreco -d ${DET} -r ${RELEASE} -p ${MONTH} --declare 2>&1 \ | tee -a ${SLOG} done STARTED Thu Feb 14 19:56:24 2008 FINISHED Thu Feb 14 21:39:40 2008 Picked up files in MONTH 2005-03 MONTH 2005-04 MONTH 2005-09 ( obsolete only ) MONTH 2006-06 MONTH 2006-12 MONTH 2007-01 MONTH 2007-02 MONTH 2007-03 MONTH 2007-04 DET=far SLOG=${HOME}/ROUNTMP/LOG/saddreco/${RELEASE}/${DET}.log Verify scan : for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/sntp_data ; ls -d 20??-??)` ; do ./saddreco -d ${DET} -r ${RELEASE} -p ${MONTH} --verify done 2005-03 cand 2006-04 cand Declare the few missing files for MONTH in 2005-03 2006-04 ; do ./saddreco -d ${DET} -r ${RELEASE} -p ${MONTH} --declare 2>&1 \ | tee -a ${SLOG} done ######## # DATA # ######## Date: Thu, 14 Feb 2008 11:25:18 -0600 (CST) Subject: HelpDesk ticket 111148 ___________________________________________ Short Description: minos group quota adjustment in BlueArc Problem Description: LSC/CSI : We have been too successful using our /minos/data disks ! We seem to have used the entire 18 TBytes of group quota. MINOS26 > quota -s -v -g e875 .. minos-nas-0.fnal.gov:/minos/scratch 2165G 0 10240G 459k 0 0 minos-nas-0.fnal.gov:/minos/data 16384G* 0 16384G 554k 0 0 To give us some breathing room, while we move some files off to tape, please shift 5 TBytes of e875 group quota from /minos/scratch to /minos/data. ___________________________________________ Date: Thu, 14 Feb 2008 12:22:16 -0600 (CST) This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group. ___________________________________________ Date: Thu, 14 Feb 2008 13:46:26 -0600 (CST) This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/WST Group. ___________________________________________ Converstation with Andy, the volume quota is now 25 TB. There was no need to reduce scratch quota. Still a mystery why quota is reporting 16384G* That is a suspicious number 2^14 . ########## # DC2NFS # ########## AFSS/dc2nfs -n -d reco_near/cedar_phy/sntp_data/2007-02 $ du -sm /minos/data/reco_near/* 17 /minos/data/reco_near/R1_18 270547 /minos/data/reco_near/R1_18_2 8813 /minos/data/reco_near/R1_18_3 142294 /minos/data/reco_near/R1_18_4 22221 /minos/data/reco_near/R1_21 10001 /minos/data/reco_near/R1_23 9336 /minos/data/reco_near/R1_23a 9958 /minos/data/reco_near/R1_24 10089 /minos/data/reco_near/R1_24a 11734 /minos/data/reco_near/R1_24b 41283 /minos/data/reco_near/R1_24c 24122 /minos/data/reco_near/R1_24cal 1 /minos/data/reco_near/S06-05-25-R1-22 10002 /minos/data/reco_near/S06-06-22-R1-22 895381 /minos/data/reco_near/cedar 221619 /minos/data/reco_near/cedar_phy 341345 /minos/data/reco_near/cedar_phy_bhcurv $ du -sm /minos/data/reco_near 2028757 /minos/data/reco_near $ du -sm /minos/data/reco_far 1343793 /minos/data/reco_far MINOS26 > du -sm /minos/data/* 1273917 /minos/data/analysis 1706 /minos/data/asousa 259008 /minos/data/beam_data 2 /minos/data/d10 87905 /minos/data/flux 0 /minos/data/foo 3437 /minos/data/log_data 1 /minos/data/maint 9099614 /minos/data/mcimport 2362905 /minos/data/mcout_data 1 /minos/data/mindata 406473 /minos/data/minfarm 405888 /minos/data/mysql 1343793 /minos/data/reco_far 2028757 /minos/data/reco_near 87995 /minos/data/users for DIR in `ls /pnfs/minos/reco_near/cedar_phy/sntp_data` ; do echo $DIR ; ./stage -d -p 0 reco_near/cedar_phy/sntp_data/${DIR} ; done 992G /minos/data/reco_near/cedar_phy/sntp_data STARTED Thu Feb 14 14:33:00 CST 2008 FINISHED Fri Feb 15 11:07:15 CST 2008 cat > /tmp/dc2nfs.2008 ########### # BLUEARC # ########### 05:00 to 08:00 scheduled BlueArc firmware upgrade. mindata@minos26 echo "crontab -r" | at 01:00 Feb 14 minfarm@fnpcsrv1 echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00 Feb 14 Restarted corral and mcimport around 15:50 N.B. subscribed to site-nas-announce, Date: Thu, 14 Feb 2008 07:20:03 -0600 Reply-To: Ray Pasetes Sender: Site-Wide NAS Service Announcements and Status From: Ray Pasetes Subject: NAS Servers have been upgraded Comments: To: site-nas-announce@fnal.gov Content-type: text/plain; format=flowed; charset=ISO-8859-15 The Central NAS Servers have been upgraded and service has been resumed. If you notice any problems, please open a helpdesk ticket -- ============================================================================= 2008 02 13 ####### # SAM # ####### For dogwood, per HOWTO.saddreco, sam get registered application families | grep loon export SAM_ORACLE_CONNECT="samdbs/" for DWR in dogwoodtest0 dogwood0 dogwood1 ; do for REL in dev int prd ; do setup sam -q ${REL} samadmin add application family --appFamily=reco --appName=loon \ --appVersion=${DWR} done done ########## # CONDOR ########## condor queue time limit http://www.uscms.org/SoftwareComputing/UserComputing/BatchSystem.html look at queue from any node : condor_q -submitter gfactory see a summary condor_status -submitter see the queues known to minos25 condor_q -name minos25 http://www.astro.northwestern.edu/AstCCwiki/index.php?title=Typhoon(Cluster)_Documentation condor_q -goodputs ########## # CONDOR # ########## Created newer proxy for gfactory, SRV1> cd /export/stage/minfarm/.grid DAYS=10 (( HOURS = DAYS * 24 )) DAPR=`date -d "today + ${DAYS}days" +%Y%m%d` voms-proxy-init \ -voms fermilab:/fermilab/minos \ -vomslife ${HOURS}:0 \ -cert kreymer-doe.pem \ -key kreymer-doekey.pem \ -out kreymer-condor.proxy.${DAPR} \ -valid ${HOURS}:0 Your proxy is valid until Sun Feb 17 16:59:24 2008 voms-proxy-info -file kreymer-condor.proxy.${DAPR} [gfactory@minos25 ~]$ cd .grid/ DAPR=20080223 scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy.${DAPR} . DAPR=20080304 scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy.${DAPR} . cp -a kreymer-condor.proxy.${DAPR} kreymer-condor.proxy [gfactory@minos25 .grid]$ date Fri Feb 15 13:38:13 CST 2008 ######### # FNALU # ######### fsus02 - down for hardware repairs, holds LSF processing. Date: Wed, 13 Feb 2008 09:06:54 -0600 (CST) the Sun tech finished the hardware task on fsun02, reprogramming the nvram, and fsun02 is back. The lsf batch nodes were reopened, and minos jobs are running on them. margaret Margaret Greaney Telephone: 630-840-4623 Fermilab E-mail: mgreaney@fnal.gov ######### # MYSQL # ######### products > . etc/setups.sh export PRODUCTS=/local/ups/db setup upd upd install -j mysql v5_0_51 ============================================================================= 2008 02 12 ######### # MYSQL # ######### Making space on minos-sam03 for mysql tests. kreymer > cd /home/kreymer/MYSQL MINOS-SAM03 > du -sm * 9352 20060421 14021 20070207 14186 20070305 14790 20070403 16492 20070705 16939 20070815 17519 20070919 17542 20071002 17579 20071107 62437 BINLOG 8956 /minos/data/mysql/archive/20060418 9343 /minos/data/mysql/archive/20060421 17753 /minos/data/mysql/archive/20071218 17885 /minos/data/mysql/archive/20080116 18028 /minos/data/mysql/archive/20080204 Removed the binlogs, these live in /minos/data/mysql/BINLOG now rm -r BINLOG Checked the files already copied to /m/d/m/a/20060421 MMON=20060421 MDMA=/minos/data/mysql/archive diff -r ${MMON} ${MDMA}/${MMON} rm -r ${MMON} for MMON in 20070207 20070305 20070403 20070705 20070815 20070919 20071002 20071107 ; do printf "${MMON} " ; date ; time cp -ax ${MMON} ${MDMA}/${MMON} ; done 20070207 Tue Feb 12 19:14:12 CST 2008 real 6m29.750s user 0m0.306s sys 0m42.422s 20070305 Tue Feb 12 19:20:41 CST 2008 real 6m34.734s user 0m0.291s sys 0m43.041s 20070403 Tue Feb 12 19:27:16 CST 2008 real 7m21.917s user 0m0.284s sys 0m44.296s 20070705 Tue Feb 12 19:34:38 CST 2008 real 7m35.256s user 0m0.329s sys 0m48.148s 20070815 Tue Feb 12 19:42:14 CST 2008 real 8m57.493s user 0m0.348s sys 0m50.528s 20070919 Tue Feb 12 19:51:11 CST 2008 real 8m30.084s user 0m0.386s sys 0m54.253s 20071002 Tue Feb 12 19:59:41 CST 2008 real 8m49.770s user 0m0.369s sys 0m53.047s 20071107 Tue Feb 12 20:08:31 CST 2008 real 11m26.015s user 0m0.365s sys 0m55.130s for MMON in 20070207 20070305 20070403 20070705 20070815 20070919 20071002 20071107 ; do printf "${MMON} " ; date ; time diff -r ${MMON} ${MDMA}/${MMON} ; done 20070207 Tue Feb 12 22:44:35 CST 2008 real 7m15.662s user 0m24.176s sys 0m30.761s 20070305 Tue Feb 12 22:51:51 CST 2008 real 7m33.303s user 0m23.318s sys 0m30.847s 20070403 Tue Feb 12 22:59:24 CST 2008 real 10m12.483s user 0m25.223s sys 0m33.597s 20070705 Tue Feb 12 23:09:37 CST 2008 real 8m32.953s user 0m28.791s sys 0m37.324s 20070815 Tue Feb 12 23:18:10 CST 2008 real 9m33.451s user 0m29.187s sys 0m38.286s 20070919 Tue Feb 12 23:27:44 CST 2008 real 11m4.116s user 0m30.180s sys 0m39.555s 20071002 Tue Feb 12 23:38:48 CST 2008 real 8m56.781s user 0m30.327s sys 0m40.164s 20071107 Tue Feb 12 23:47:45 CST 2008 real 10m14.862s user 0m30.470s sys 0m40.086s rm -r 20070207 20070305 for MMON in 20070403 20070705 20070815 20070919 20071002 20071107 ; do printf "${MMON} " ; date ; rm -r ${MMON} ; done 20070403 Wed Feb 13 08:56:08 CST 2008 20070705 Wed Feb 13 08:56:32 CST 2008 20070815 Wed Feb 13 08:56:59 CST 2008 20070919 Wed Feb 13 08:57:25 CST 2008 20071002 Wed Feb 13 08:58:01 CST 2008 20071107 Wed Feb 13 08:58:24 CST 2008 df -h . Filesystem Size Used Avail Use% Mounted on /dev/hdb2 225G 8.2G 205G 4% /home ######### # MYSQL # ######### MINOS26 > upd install -j mysql v5_0_51 informational: installed mysql v5_0_51. upd install succeeded. MINOS26 > setup mysql v5_0_51 cat: /afs/fnal.gov/files/code/e875/general/ups/db/mysql/config/minos26.fnal.gov.: No such file or directory cat: /mysql.socket: No such file or directory cat: /mysql.port: No such file or directory cat: /mysql.user: No such file or directory Setup:mysql datadir = Setup:port=; socket= MINOS26 > type mysql mysql is /afs/fnal.gov/files/code/e875/general/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysql MINOS26 > mysql -V mysql Ver 14.12 Distrib 5.0.51a, for redhat-linux-gnu (i686) using EditLine wrapper ######### # FNALU # ######### bsub -R "hname!=flxi06" hostname bsub -R "hname!=flxi06 & hname!=flxb10 & hname!=flxb11 & hname!=flxb35" hostname ########### # BLUEARC # ########### Date: Tue, 12 Feb 2008 13:53:33 -0600 (CST) Subject: HelpDesk ticket 111030 ___________________________________________ Short Description: Quota request for BlueArc served /minos/scratch, for loiacono Problem Description: LSC/CSI : Please set an individual storage quota of 500 GBytes for user loiacono on the BlueArc served /minos/scratch volume. This overrides the existing default 100 GBytes quota. ___________________________________________ Date: Tue, 12 Feb 2008 14:41:03 -0600 (CST) This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group. ####### # AFS # ####### My previous scans for AFS timeouts were flawed, searching for 'Jan '. There continue to be AFS timeouts this month. for NODE in ${NODES} ; do ssh -ax ${NODE} 'grep afs: /var/log/messages | grep -v Tokens | uniq'; done \ | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' done Feb 1 13:55:48 minos04 kernel: afs: Lost contact with file server 131.225.68.19 Feb 1 13:57:02 minos04 kernel: afs: file server 131.225.68.19 is back up Feb 1 07:39:00 minos10 kernel: afs: Lost contact with file server 131.225.68.65 Feb 1 07:39:05 minos10 kernel: afs: Lost contact with file server 131.225.68.7 Feb 1 07:40:26 minos10 kernel: afs: file server 131.225.68.65 is back up Feb 1 07:40:26 minos10 kernel: afs: file server 131.225.68.7 is back up Feb 4 17:56:06 minos04 kernel: afs: Lost contact with file server 131.225.68.17 Feb 4 17:56:23 minos04 kernel: afs: file server 131.225.68.17 is back up Feb 5 05:38:05 minos18 kernel: afs: Lost contact with file server 131.225.68.49 Feb 5 05:39:30 minos18 kernel: afs: file server 131.225.68.49 is back up Feb 6 06:08:50 minos04 kernel: afs: Lost contact with file server 131.225.68.49 Feb 6 06:10:35 minos04 kernel: afs: file server 131.225.68.49 is back up Feb 7 10:42:02 minos19 kernel: libafs: module license 'http://www.openafs.org/dl/license10.html' taints kernel. Feb 7 10:42:02 minos19 kernel: libafs: no version for "sys_close" found: kernel tainted. Feb 7 10:42:02 minos19 kernel: libafs: Ignoring new-style parameters in presence of obsolete ones Feb 7 11:58:57 minos19 kernel: afs: Lost contact with file server 192.168.67.1 Feb 7 15:51:35 minos05 kernel: afs: Lost contact with file server 131.225.68.17 Feb 7 15:52:01 minos05 kernel: afs: file server 131.225.68.17 is back up Feb 8 11:47:59 minos05 kernel: afs: Lost contact with file server 131.225.68.7 Feb 8 11:48:39 minos05 kernel: afs: file server 131.225.68.7 is back up Feb 9 09:09:23 minos06 kernel: afs: Lost contact with file server 131.225.68.7 Feb 9 09:12:32 minos06 kernel: afs: file server 131.225.68.7 is back up Feb 10 16:50:08 minos16 kernel: afs: Lost contact with file server 131.225.68.17 Feb 10 16:51:58 minos16 kernel: afs: file server 131.225.68.17 is back up Feb 11 11:15:35 minos11 kernel: afs: Lost contact with file server 192.168.67.1 Feb 11 12:55:03 minos05 kernel: afs: Lost contact with file server 131.225.68.49 Feb 11 12:55:21 minos05 kernel: afs: Lost contact with file server 131.225.68.7 Feb 11 12:55:41 minos05 kernel: afs: file server 131.225.68.49 is back up Feb 11 12:55:41 minos05 kernel: afs: file server 131.225.68.7 is back up Feb 12 06:07:06 minos23 kernel: afs: Lost contact with file server 131.225.68.17 Feb 12 06:09:49 minos23 kernel: afs: file server 131.225.68.17 is back up Times : 14 76 17 85 105 26 40 209 110 20 160 14 17 20 26 40 76 85 105 110 160 209 Servers 131.225.68.7 131.225.68.17 131.225.68.49 131.225.68.65 ########## # DCACHE # ########## Write queue jumped sharply to 20000 yesterday afternoon/evening. And again from 17K to 25K around 02:00 MINOS26 > ./dcache/datasets w Run Tue Feb 12 08:58:40 CST 2008 Data from 12-Feb-2008 06:33 Pool group writePools FILES GBYTES FAMILY 31846 551 cdms.cdms That's about 12 MBytes/file. Date: Mon, 11 Feb 2008 21:11:46 -0600 (CST) Ticket #: 110975 ___________________________________________ Short Description: stken overload Problem Description: Since about 3:30 today a user has dumped over 16000 files into dcache. According to its agreement with MSS, MINOS throttled back its running when the write pools exceeded 2500 files. We will therefore be down until the write pools clear. Please communicate with the user to avoid this sort of situation in the future. Howie Rubin ___________________________________________ <-- # @@@ Enter Update below this line. @@@ # --> This is apparently due to addition of over 30,000 files by CDMS . These file average a little over 10 MBytes in size. The peak queued stores has gone as high as 24,000. http://fndca.fnal.gov/dcache/queue/allpools.jpg Note that in the past, this level of backlog has led to global DCache system failures. Here's a summary of the pool content : ... If nothing is done, Minos data handling will remain shut down for about a week. These CDMS files need to be removed. As before, we'll be glad to provide advice and planning support, and even share our scripts. <-- # @@@ Enter Update above this line. @@@ # --> I looked at one of the CDMS tapes, http://www-stken.fnal.gov/enstore/tape_inventory/VOC132 About half the files are 20 byte files , like /pnfs/fs/usr/cdms/Raw_Soudan_data_sync/180108_0736/180108_0736_F0309.gz.status Fermi National Accelerator Laboratory D.A. Bauer, M.B. Crisler, D. Holmgren, E. Ramberg, J. Yoo 2745 Queues : 15:00 - 20136 15:20 - 18572 rapid clearing, lately 16:00 - 16026 23:00 - 14857 and back to a slow clearing, about 200/hour 08:40 - 11844 roughly 3000/10 hours Berg reports that the Tsunami is passing quickly, will not recur, via email from djholm. They are unrepentant about the 10's of thousands of 20 byte files. I find this utterly irresponsible and unacceptable. RESOLVED Wed 02/13 CDMS files were removed by DCache developers A backlog of about 2K cand files has cleared to tape. ============================================================================= 2008 02 11 ########### # MINOS11 # ########### Date: Mon, 11 Feb 2008 09:39:56 -0600 (CST) Subject: HelpDesk ticket 110904 Short Description: minos11 is down Problem Description: run2-sys : Node minos11 ( our only remsining SLF 3 system ) seems to have gone off the network at about 10:00 Sunday 10 Feb, according to the Ganglia plots. I get no response to ping. Please investigate. Thanks ! ___________________________________________ Date: Mon, 11 Feb 2008 09:48:58 -0600 (CST) This ticket has been reassigned to HO, LING of the CD-SF/FEF Group. ___________________________________________________________________ Date: Mon, 11 Feb 2008 10:10:02 -0600 (CST) Solution: ling@fnal.gov sent this solution: System rebooted. ___________________________________________________________________ Feb 10 04:02:03 minos11 syslogd 1.4.1: restart. Feb 10 05:19:56 minos11 sshd(pam_unix)[17841]: session opened for user rhatcher by (uid=0) Feb 10 05:19:57 minos11 sshd(pam_unix)[17970]: session opened for user rhatcher by (uid=0) Feb 10 10:40:53 minos11 sshd(pam_unix)[23565]: session opened for user rhatcher by rhatcher(uid=0) Feb 11 11:02:36 minos11 syslogd 1.4.1: restart. ============================================================================= 2008 02 08 ########## # PARROT # ########## mindata : cd /grid/app/minos/parrot curl http://www.cse.nd.edu/~ccl/software/files/cctools-current-i686-linux-2.6.tar.gz \ -o cctools-current-i686-linux-2.6.tar.gz tar xzvf cctools-current-i686-linux-2.6.tar.gz Now testing this, export PARROT_DIR=/grid/app/minos/parrot/cctools-current-i686-linux-2.6 export PATH=${PARROT_DIR}/bin:${PATH} parrot -m ${PARROT_DIR}/mountfile2.grow bash PS1='P> ' P> . /usr/local/etc/setups.sh P> export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup P> setup_minos() { . $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $* } P> setup dcap -q unsecured P> type dcap bash: type: dcap: not found P> which dccp /fnal/ups/prd/dcap/v2_38_f0512/Linux-2-4/bin/dccp P> type dccp dccp is /fnal/ups/prd/dcap/v2_38_f0512/Linux-2-4/bin/dccp P> DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F P> DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root P> dccp ${DFILE} /var/tmp/TEST.root 41379 bytes in 0 seconds P> hostname fngp-osg.fnal.gov P> setup_minos -r R1.24.2 No default SAM configuration exists at this time. MINOSSOFT release "R1.24.2" SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01 bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory P> type loon loon is /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.24.2/bin/Linux2.6-GCC_3_4/loon cd /minos/scratch/kreymer/condor/loonb P> loon -bq firstlast.C ${DFILE} Warning in : class timespec already in TClassTable loon [0] Processing firstlast.C... Spin(1 in 1 out 0 filt.) 1) +RawRecCounts::Ana n=1 ( 1/ 0) t=( 0.01/ 0.00) RawRecCounts Report: F00031300_0000.mdaq.root root version: v04-02-00 VldContexts: First: { Far| Data|2005-04-25 16:36:07.621514000Z} First Snarl: {Unknow|Unknow|1970-01-01 00:00:00.000000000Z} # -1 Last: { Far| Data|2005-04-25 16:36:07.621514000Z} Last Snarl: {Unknow|Unknow|1970-01-01 00:00:00.000000000Z} # -1 in 11 records of 0 record sets RawConfigFilesBlock 7 RawDaqHeaderBlock 11 RawRunCommentBlock 1 RawRunConfigBlock 1 RawRunEndBlock 1 RawRunStartBlock 1 RawRecCounts done ============================================================================= 2008 02 07 ####### # SAM # ####### sam_cpp_api with sam locate support ! upd install -j sam_cpp_api v8_4_0 -q GCC-3.4.3 ########## # PARROT # ########## Testing the d141/d199 clones of products/minossoft cd /afs/fnal.gov/files/expwww/numi/html/computing ln -s /afs/fnal.gov/files/data/minos/d141 d141 ln -s /afs/fnal.gov/files/data/minos/d199 d199 MINOS26 > fs listquota /afs/fnal.gov/files/data/minos/d141 Volume Name Quota Used %Used Partition nb.minos.d141 50000000 27714790 55% 56% MINOS26 > fs listquota /afs/fnal.gov/files/data/minos/d199 Volume Name Quota Used %Used Partition nb.minos.d199 50000000 28146482 56% 77% MINOS26 > find /afs/fnal.gov/files/data/minos/d141 -type f | wc -l 963571 time make_growfs -k /afs/fnal.gov/files/data/minos/d141 real 9m52.730s user 0m48.330s sys 3m14.385s du -sk /afs/fnal.gov/files/data/minos/d141/.growfsdir 69168 /afs/fnal.gov/files/data/minos/d141/.growfsdir time make_growfs -k /afs/fnal.gov/files/data/minos/d199 real 8m31.961s user 0m36.871s sys 2m27.319s FNGP-OSG > export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4 FNGP-OSG > export PATH=${PARROT_DIR}/bin:${PATH} FNGP-OSG > parrot -m ${PARROT_DIR}/mountfile2.grow bash FNGP-OSG > . /usr/local/etc/setups.sh FNGP-OSG > export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup FNGP-OSG > setup_minos() { . $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $* } FNGP-OSG > setup_minos No default SAM configuration exists at this time. MINOSSOFT release "development" SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=trunk EXTERN=v03 CONFIG=v01 bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory cd /minos/scratch/kreymer/condor/loonb DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root loon -bq firstlast.C ${DFILE} FNGP-OSG > setup_minos -r R1.24.2 No default SAM configuration exists at this time. MINOSSOFT release "R1.24.2" SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01 bash: child setpgid (13580 to 13579): Operation not permitted bash: child setpgid (13581 to 13579): Operation not permitted bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory FNGP-OSG > type loon loon is /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.24.2/bin/Linux2.6-GCC_3_4/loon FNGP-OSG > type dcap bash: type: dcap: not found FNGP-OSG > type dccp dccp is /afs/fnal.gov/files/code/e875/general/ups/prd/dcap/v2_36_f0506/Linux-2-4/bin/dccp FNGP-OSG > loon -bq firstlast.C ${DFILE} Warning in : class timespec already in TClassTable P> dccp ${DFILE} F00031300_0000.mdaq.root getControlMessage: poll fail. Failed to create a control line Failed open file in the dCache. Can't open source file : Server rejected "hello" System error: Input/output error P> loon -bq firstlast.C F00031300_0000.mdaq.root Warning in : class timespec already in TClassTable loon [0] Processing firstlast.C... Spin(1 in 1 out 0 filt.) 1) +RawRecCounts::Ana n=1 ( 1/ 0) t=( 0.00/ 0.00) RawRecCounts Report: F00031300_0000.mdaq.root root version: v04-02-00 VldContexts: First: { Far| Data|2005-04-25 16:36:07.621514000Z} First Snarl: {Unknow|Unknow|1970-01-01 00:00:00.000000000Z} # -1 Last: { Far| Data|2005-04-25 16:36:07.621514000Z} Last Snarl: {Unknow|Unknow|1970-01-01 00:00:00.000000000Z} # -1 in 11 records of 0 record sets RawConfigFilesBlock 7 RawDaqHeaderBlock 11 RawRunCommentBlock 1 RawRunConfigBlock 1 RawRunEndBlock 1 RawRunStartBlock 1 RawRecCounts done ============================================================================= 2008 02 06 ############# # MINOSSOFT # ############# Second pass on minossoft symlinks finds nothing to do. Second pass on products symlinks finds /afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/startup/ups_startup /afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/startup/ups_startup /afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/shutdown/ups_shutdown /afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/shutdown/ups_shutdown /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /afs/fnal.gov/files/code/e875/general/bin/config_build_root_minimal.sh /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/tar_files /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_3/lib/libmyodbc3.so /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so Corrected .fnal.gov to a local symlink, both in my copy, and in the original cd /afs/fnal.gov/files/data/minos/d141/db/.upsfiles/startup ln -sf ups_startup ups_startup.csh ln -sf ups_startup ups_startup.sh cd /afs/fnal.gov/files/data/minos/d141/db/.upsfiles/startup ln -sf ups_shutdown ups_shutdown.csh ln -sf ups_shutdown ups_shutdown.sh cd /afs/fnal.gov/files/code/e875/general/ups/db/.upsfiles/startup cd /afs/fnal.gov/files/code/e875/general/ups/db/.upsfiles/shutdown Removed useless vdt versions in ups which have explicit links to a products path. ups undeclare -Y vdt v1_1_14_13 ups undeclare -Y vdt v1_6_1_0 ups undeclare -Y vdt v1_8_1_1 chmod -R 755 /afs/fnal.gov/files/data/minos/d141/prd/vdt rm -r /afs/fnal.gov/files/data/minos/d141/prd/vdt rm -r /afs/fnal.gov/files/data/minos/d141/db/vdt printf "${SLINKS}\n" | cut -f 2 -d : /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /afs/fnal.gov/files/code/e875/general/bin/config_build_root_minimal.sh /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/tar_files /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_3/lib/libmyodbc3.so /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so This seems like something that should be cleaned up but I'll just grab them for now . There are 215 MBytes of tar files. grep ':/' ${SLINKF} | grep -v "${UPI}" | cut -f 2 -d : /ftp/products/sam/v8_2_0/Linux+2/sam_v8_2_0_Linux+2.ups.tar /ftp/products/sam_ns_ior/v7_1_0/NULL/sam_ns_ior_v7_1_0_NULL.ups.tar /fnal/ups/prd/oracle_client/v10_1_0_2_0a/Linux-2/nls/lbuilder/lbuilder /fnal/ups/prd/oracle_client/v10_1_0_2_0a/Linux-2/jdk/man/ja_JP.eucJP /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /usr/share/libtool/config.guess /usr/share/libtool/config.sub /usr/share/libtool/ltmain.sh /usr/share/automake-1.6/install-sh /usr/share/automake-1.6/mkinstalldirs /usr/share/automake-1.6/missing /usr/share/automake-1.6/depcomp /usr/lib/libz.so.1 /usr/lib/libz.so.1 /ftp/products/samgrid_batch_adapter/v7_1_0/NULL/samgrid_batch_adapter_v7_1_0_NULL.ups.tar /ftp/products/geant/v3_21_14a/Linux+2.6/geant_v3_21_14a_Linux+2.6.ups.tar Just 1 file to actually copy. Done ######### # FNALU # ######### Date: Wed, 06 Feb 2008 14:08:52 -0600 (CST) Subject: HelpDesk ticket 087003 has additional info. _________________________________________________________________ Ticket #: 087003 _________________________________________________________________ Note To Requester: mgreaney@fnal.gov sent this Notes To Requester: Art, the TWW software has a perl that is in your path when you login, probably. This is the default because the TWW took the place of a lot of the ups products. If you use the perl that is on the system or the ups perl, then kcroninit works. try doing setup perl before you run the kcroninit. Let me know if that works, thank you, Margaret _________________________________________________________________ This is still a problem, as the TWW perl is still in people's paths. perl is a standard part of Linux installations. I think we should remove the TWW perl, and put this problem behind us. Unfortunately, now that I look closer, perl is the tip of iceberg. On flxi03, there are 756 files in /opt/TWWfsw/bin, dating from May/Jun 2006. 531 already exist in /usr/bin. for BIN in `ls /opt/TWWfsw/bin` l do [ -r "/usr/bin/${BIN}" ] && ls /usr/bin/${BIN} ; done | wc -l 531 Four are in /usr/X11R6/bin /usr/X11R6/bin/cxpm /usr/X11R6/bin/nc /usr/X11R6/bin/nedit /usr/X11R6/bin/sxpm Another eight exist in /bin : /bin/bash /bin/gettext /bin/gtar /bin/gunzip /bin/gzip /bin/mktemp /bin/tcsh /bin/zcat Perhaps we should not put 2 year old versions of these in people's paths. So at minimum we should remove perl from TWW. And the packages that are duplicating /bin/* Perhaps we should consider removing all of TWW ! _________________________________________________________________ FLXI03 > for BIN in `ls /opt/TWWfsw/bin` ; do [ -r "/usr/bin/${BIN}" ] || ls -l /opt/TWWfsw/bin/${BIN} ; done | cut -f 2 -d \> | cut -f -4 -d / | sort -u /opt/TWWfsw/aspell06 /opt/TWWfsw/bash30 /opt/TWWfsw/ddd33 /opt/TWWfsw/diffutils28 /opt/TWWfsw/emacs21 /opt/TWWfsw/expect54 /opt/TWWfsw/fcpackage22 /opt/TWWfsw/ghostscript70r /opt/TWWfsw/gpatch25 /opt/TWWfsw/groff119 /opt/TWWfsw/gzip13 /opt/TWWfsw/imagemagick62 /opt/TWWfsw/ispell32 /opt/TWWfsw/liblcms11 /opt/TWWfsw/libttf21 /opt/TWWfsw/libungif /opt/TWWfsw/libwmf02 /opt/TWWfsw/lsof47 /opt/TWWfsw/lynx28 /opt/TWWfsw/m4 /opt/TWWfsw/metamail27 /opt/TWWfsw/mktemp15 /opt/TWWfsw/mutt14 /opt/TWWfsw/mysql4112 /opt/TWWfsw/ncurses54 /opt/TWWfsw/nedit55 /opt/TWWfsw/netpbm92 /opt/TWWfsw/perl586 /opt/TWWfsw/pine46 /opt/TWWfsw/pkgutils15 /opt/TWWfsw/plotutils24 /opt/TWWfsw/python242 /opt/TWWfsw/python242p /opt/TWWfsw/tar11 /opt/TWWfsw/tcsh61 /opt/TWWfsw/texinfo48 /opt/TWWfsw/tk84p /opt/TWWfsw/xemacs214 /opt/TWWfsw/xpm _________________________________________________________________ Date: Mon, 30 Jun 2008 12:59:59 -0500 (CDT) From: Margaret_Greaney I have not heard back from Frank Nagy on this, but from what I see the upgrade of TWW caused new perl modules to be available and kcroninit does work on my attempts on fnalu on linux nodes. _________________________________________________________________ Still fails for me, same way, FLXI04 > kcroninit Can't locate Net/Domain.pm in @INC (@INC contains: /usr/krb5/lib /opt/TWWfsw/libdb42/lib/perl586 /opt/TWWfsw/imagemagick62/lib/perl586 /opt/TWWfsw/readline50/lib/perl586 /opt/TWWfsw/pe FLXI04 > type perl perl is /opt/TWWfsw/bin/perl FLXI04 > ls -l /opt/TWWfsw/bin/perl lrwxr-xr-x 1 kevinh root 41 May 23 2006 /opt/TWWfsw/bin/perl -> /opt/TWWfsw/perl586/bin/.perl.tww-wrapper FLXI04 > ls -l /opt/TWWfsw/perl586/bin/.perl.tww-wrapper -rwxr-xr-x 1 kevinh root 12363 Apr 11 2006 /opt/TWWfsw/perl586/bin/.perl.tww-wrapper _________________________________________________________________ Date: Mon, 27 Oct 2008 13:02:33 -0500 (CDT) Solution: kcron may not work on flxi06 which has a 5.1 install. It works on the rest of the fnalu cluster. _________________________________________________________________ Testing 27 Oct, inconsistent results, with or without PATH=${PATH/\/opt\/TWWfsw\/bin://} FLXI03 > kcron kinit: Preauthentication failed while getting initial credentials FLXI03 > PATH=${PATH/\/opt\/TWWfsw\/bin://} FLXI03 > PATH=${PATH/\/opt\/TWWfsw\/bin://} FLXI03 > kcron FLXI03 > kcron kinit: Preauthentication failed while getting initial credentials 17:23 - kcron continues to be intermittent, with and without TWW in the path, at least on flxi04. time for N in 1 2 3 4 5 6 7 8 9 0 ; do printf "${N} " ; kcron ; sleep 1 ; done 17:31 - OK,, this has run OK on flxi02/3/4. So perhaps some flakiness in the network or KDC earlier. _________________________________________________________________ Date: Mon, 27 Oct 2008 22:35:22 +0000 (GMT) kcron is working for me now in flxi02/3/4 There are intermittent failures, like FLXI03 > kcron kinit: Preauthentication failed while getting initial credentials but these do not seem to have anything to do with TWW. Please go ahead and close this ticket. Thanks ! _________________________________________________________________ ============================================================================= 2008 02 05 ############# # MINOSSOFT # ############# See HOWTO.afssoftprod pts examine buckley:minosrecodata pts setfields buckley:minosrecodata -access SOMar [root@minos-mysql1 ~]# time up ${UPI} ${D199} Unable to set owner-id for /afs/fnal.gov/files/data/minos/d199/setup/CVS/Root to 1019 ... Unable to set owner-id for /afs/fnal.gov/files/data/minos/d199 to 1019 Unable to set group-id for /afs/fnal.gov/files/data/minos/d199 to 5111 real 114m24.882s user 0m19.311s sys 11m38.279s Ran the slinky copies, with -head 1 -head 3 then everything. See /minos/scratch/kreymer/slinky/minossoft.log Rates are about 1 MB/sec ( probably due to failing chown's ) ############### # FNALU BATCH # ############### for node in flxb11 flxb13 flxb21 flxb22 flxb24 minos setup_minos -r R1.24.2 cd /minos/scratch/kreymer/condor/loonb DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root loon -bq firstlast.C ${DFILE} flxb11 OK flxb13 OK flxb16 flxb21 Connection to flxb21 closed. flxb22 Connection to flxb22 closed. flxb24 ============================================================================= 2008 02 04 ############### # FNALU BATCH # ############### Jobs which are NOT exiting instantly without logs or output : cat /minos/scratch/rahaman/releases/minos/Mad/macros/*.log | grep executed | cut -f 2 -d '<' | cut -f 1 -d '>' | sort -u flxb17.fnal.gov flxb18.fnal.gov flxb19.fnal.gov flxb20.fnal.gov flxb21.fnal.gov flxb22.fnal.gov flxb23.fnal.gov flxb24.fnal.gov flxb25.fnal.gov flxb26.fnal.gov flxb27.fnal.gov flxb28.fnal.gov flxb30.fnal.gov flxb31.fnal.gov flxi06.fnal.gov Scanning nodes that are up and taking batch jobs, for N in 10 11 13 17 18 19 20 21 22 23 24 25 26 27 28 30 31 32 ; do bsub -R flxb${N} ls -ld /minos/data /minos/scratch ; done 10 11 13 17 18 19 20 21 22 23 24 25 26 27 28 30 31 32 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. x Summary, it seems /minos/data and scratch are missing on flxb32 bsub -R "hname!=flxb32" ls -ld /minos/data /minos/scratch ___________________________________________ Date: Mon, 04 Feb 2008 18:46:40 -0600 (CST) Subject: HelpDesk ticket 110624 Short Description: flxb32 is missing the /minos/data and /minos/scratch mounts Problem Description: fnalu-admin : Several Minos LSF batch jobs are failing, due to the lack of /minos/scratch and /minos/data mounts on flxb32. This tends to quickly suck jobs out of the execution queue. Please hold host flxb32 in LSF until the mounts are established. ___________________________________________ ######## # DATA # ######## Per query from grashorn, re 'missing' subruns in cedar, RUNSUBS=' 24183:0001 24183:0002 24183:0003 24183:0004 24186:0001 24186:0002 24186:0003 24955:0002 25020:0001 25020:0002 25021:0000 25021:0001 25021:0002 25022:0001 25022:0002 25023:0001 25023:0002 25024:0001 25024:0002 25025:0001 25025:0002 26836:0000 27666:0001 27666:0002 27666:0003 27669:0001 27669:0002 27669:0003 27669:0004 27669:0005 27669:0006 27669:0007 28273:0001 28466:0004 28636:0007 29127:0004 29217:0004 29232:0006 29233:0006 29233:0007 30107:0003 30108:0001 30108:0002 30108:0003 34220:0001 34220:0002 34220:0003 34220:0004 34220:0005 34224:0001 34224:0002 35640:0009 35724:0013 35727:0005 36869:0009 37230:0017 37676:0020 37676:0022 ' for RUNSUB in ${RUNSUBS} ; do printf "${RUNSUB} " RUN=`printf "${RUNSUB}" | cut -f 1 -d :` SUB=`printf "${RUNSUB}" | cut -f 2 -d :` SAMDIM=" VERSION cedar \ and DATA_TIER sntp-far \ and PARENT_BY_NAME F000${RUN}_${SUB}.mdaq.root \ " #echo ${SAMDIM} sam list files --nosummary --dim="${SAMDIM}" done ########### # MONTHLY # ########### DATASETS 2/4 PREDATOR 2/4 VAULT 2/4 MYSQL 2/4 ######### # MYSQL # ######### > Sometime in the past hour (it's now Mon Feb 4 06:48:29 CST 2008) we > lost connectivity to minos-db1.fnal.gov:- mysql server down on minos-mysql1 I see a problem in /var/log/messages.1 : Feb 3 02:15:02 minos-mysql1 kernel: hdc: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Feb 3 02:15:02 minos-mysql1 kernel: hdc: drive_cmd: error=0x04Aborted Command Feb 3 02:15:02 minos-mysql1 kernel: hdc: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Feb 3 02:15:02 minos-mysql1 kernel: hdc: drive_cmd: error=0x04Aborted Command Feb 3 02:15:02 minos-mysql1 kernel: cdrom: open failed. Feb 3 02:15:02 minos-mysql1 kernel: cdrom: open failed. But that's just the CD rom, so this should be harmless. ( But why was anything accessing this at 2 AM ? ) Uh-Oh, the usual archive area Mysql> ls -l /data/archive/COPY total 0 -rw-r--r-- 1 root root 0 Jan 23 14:15 empty_file_4_tibs What's going on ? Where are the backup files ? I see that /etc/resolv.conf lists only one nameserver, search fnal.gov nameserver 131.225.8.120 less /data/database/minos-mysql1.fnal.gov.err 080204 6:27:00 [Note] /local/ups/prd/mysql/v4_1_11/Linux-2/libexec/mysqld: Shutdown complete But the usual message about mysqld ended is missing, like : 070821 15:34:09 mysqld ended Backing up mysql database, need for monthly anyway cd ${DBHOME}/offline dds -tr ... -rw-rw---- 1 minsoft e875 31744 Feb 4 06:26 BEAMMONFILESUMMARY.MYI -rw-rw---- 1 minsoft e875 4096 Feb 4 06:26 BEAMMONCUTSVLD.MYI -rw-rw---- 1 minsoft e875 2048 Feb 4 06:26 BEAMMONCUTS.MYI Mysql> script -a ${DBCOPY}/offline/offline.log Script started, file is /minos/data/mysql/archive/20080204/offline/offline.log [minsoft@minos-mysql1 offline]$ time cp -av --target-directory=/minos/data/mysql/archive/20080204/offline *.frm [minsoft@minos-mysql1 offline]$ time cp -av --target-directory=/minos/data/mysql/archive/20080204/offline *.MYD real 30m47.148s user 0m0.990s sys 2m16.440s [root@minos-mysql1 ~]# /etc/init.d/mysql start Starting MySQL................................... [FAILED] 080204 10:41:49 mysqld started 080204 10:41:49 [ERROR] /local/ups/prd/mysql/v4_1_11/Linux-2/libexec/mysqld: unknown variable '--binlog-do-db=crl_v1' 080204 10:41:49 mysqld ended Changed /data/database/my.cnf to binlog-do-db = crl_v1 binlog-do-db = offline [root@minos-mysql1 ~]# /etc/init.d/mysql start Starting MySQL [ OK ] 080204 10:44:11 mysqld started /local/ups/prd/mysql/v4_1_11/Linux-2/libexec/mysqld: ready for connections. Version: '4.1.11-log' socket: '/data/database/mysql.sock' port: 3306 Source distribution Checked for recent temp entries , there are non since restart ( 372 ) There are plenty before, ( 371 ) Mysql> mysqlbinlog -s -d offline ${DBBINS}/minos.000372 | less Mysql> mysqlbinlog -s -d crl_v1 ${DBBINS}/minos.000372 | less Mysql> mysqlbinlog -s -d temp ${DBBINS}/minos.000372 | less ============================================================================= 2008 02 01 ############ # PRODUCTS # ############ Preparing for global up of products etc. Options : -m < check that we have no mount points before copying > AFSC=/afs/fnal.gov/files/code/e875 UPI=${AFSC}/general/products UPO=/afs/fnal.gov/files/data/minos/d141 time up ${UPI} ${UPO} Unable to set owner-id for /afs/fnal.gov/files/data/minos/d141/db/neugen3/CVS/Entries to 5922 Unable to set group-id for /afs/fnal.gov/files/data/minos/d141/db/neugen3/CVS/Entries to 5111 ... Unable to set group-id for /afs/fnal.gov/files/data/minos/d141/prd/pacman/v3_20/NULL/src/pythonCheck.pyc to 5111 Unable to set group-id for /afs/fnal.gov/files/data/minos/d141/prd/pacman/v3_20/NULL/src/Alias.pyc to 5111 real 16m55.340s user 0m1.469s sys 2m9.036s MIN > fs listquota /afs/fnal.gov/files/code/e875/general/products Volume Name Quota Used %Used Partition c.e875.d1 8000000 3266914 41% 59% MIN > fs listquota /afs/fnal.gov/files/data/minos/d141 Volume Name Quota Used %Used Partition nb.minos.d141 50000000 3327363 7% 53% PLINKS=' GENIE LOG4CPP MINOS_EXTERN MINOS_ROOT NEUGEN3 PYTHIA6 stdhep ' D141=/afs/fnal.gov/files/data/minos/d141 for PLINK in ${PLINKS} ; do printf " OK - copying ${PLINK} " date du -sm ${AFSC}/releases/${PLINK} UPX=${D141}/prd/${PLINK} UPI=${AFSC}/releases/${PLINK} UPO=${D141}/prd/${PLINK} [ -L "${UPX}" ] && ls -l ${UPX} && rm ${UPX} time up ${UPI} ${UPO} done 2>&1 | tee /tmp/plinkup.log MINOS26 > grep -v 'Unable to set ' /tmp/plinkup.log OK - copying GENIE Fri Feb 1 17:00:34 CST 2008 600 /afs/fnal.gov/files/code/e875/releases/GENIE real 1m36.736s user 0m0.203s sys 0m18.265s OK - copying LOG4CPP Fri Feb 1 17:02:12 CST 2008 91 /afs/fnal.gov/files/code/e875/releases/LOG4CPP real 1m16.498s user 0m0.220s sys 0m9.826s OK - copying MINOS_EXTERN Fri Feb 1 17:03:30 CST 2008 2048 /afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN real 17m30.968s user 0m3.110s sys 2m20.089s OK - copying MINOS_ROOT Fri Feb 1 17:21:27 CST 2008 20537 /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT lrwxr-xr-x 1 kreymer g020 49 Feb 1 15:00 /afs/fnal.gov/files/data/minos/d141/prd/MINOS_ROOT -> /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT real 194m50.950s user 0m23.238s sys 19m52.623s OK - copying NEUGEN3 Fri Feb 1 20:41:50 CST 2008 841 /afs/fnal.gov/files/code/e875/releases/NEUGEN3 lrwxr-xr-x 1 kreymer g020 46 Feb 1 15:00 /afs/fnal.gov/files/data/minos/d141/prd/NEUGEN3 -> /afs/fnal.gov/files/code/e875/releases/NEUGEN3 real 3m0.463s user 0m0.413s sys 0m28.601s OK - copying PYTHIA6 Fri Feb 1 20:44:55 CST 2008 183 /afs/fnal.gov/files/code/e875/releases/PYTHIA6 lrwxr-xr-x 1 kreymer g020 46 Feb 1 15:00 /afs/fnal.gov/files/data/minos/d141/prd/PYTHIA6 -> /afs/fnal.gov/files/code/e875/releases/PYTHIA6 real 0m51.456s user 0m0.116s sys 0m7.127s OK - copying stdhep Fri Feb 1 20:45:47 CST 2008 27 /afs/fnal.gov/files/code/e875/releases/stdhep lrwxr-xr-x 1 kreymer g020 45 Feb 1 15:00 /afs/fnal.gov/files/data/minos/d141/prd/stdhep -> /afs/fnal.gov/files/code/e875/releases/stdhep real 0m10.084s user 0m0.019s sys 0m1.218s ############## # MINOS_DATA # ############## CDIRS=`(cd /afs/fnal.gov/files/data/minos/d10/indexes ; ls *_near.cedar.index)` MINOS26 > ls -l ${CDIRS} -rw-r--r-- 1 rubin e875 4500 Sep 26 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-04_near.cedar.index -rw-r--r-- 1 rubin e875 32200 Oct 28 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-05_near.cedar.index -rw-r--r-- 1 rubin e875 31000 Feb 5 2007 /afs/fnal.gov/files/data/minos/d10/indexes/2005-06_near.cedar.index -rw-r--r-- 1 rubin e875 33350 Sep 28 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-07_near.cedar.index -rw-r--r-- 1 rubin e875 35000 Sep 28 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-08_near.cedar.index -rw-r--r-- 1 rubin e875 35200 Mar 23 2007 /afs/fnal.gov/files/data/minos/d10/indexes/2005-09_near.cedar.index -rw-r--r-- 1 rubin e875 19700 Sep 28 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-10_near.cedar.index -rw-r--r-- 1 rubin e875 32800 Sep 28 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-11_near.cedar.index -rw-r--r-- 1 rubin e875 24150 Sep 28 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-12_near.cedar.index -rw-r--r-- 1 rubin e875 33550 Feb 5 2007 /afs/fnal.gov/files/data/minos/d10/indexes/2006-01_near.cedar.index -rw-r--r-- 1 rubin e875 25950 Sep 28 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-02_near.cedar.index -rw-r--r-- 1 rubin e875 0 Nov 26 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-05_near.cedar.index -rw-r--r-- 1 rubin e875 33400 Dec 26 16:19 /afs/fnal.gov/files/data/minos/d10/indexes/2006-06_near.cedar.index -rw-r--r-- 1 rubin e875 5900 Dec 26 16:19 /afs/fnal.gov/files/data/minos/d10/indexes/2006-07_near.cedar.index -rw-r--r-- 1 rubin e875 14450 Dec 13 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-08_near.cedar.index -rw-r--r-- 1 rubin e875 20450 Dec 11 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-09_near.cedar.index -rw-r--r-- 1 rubin e875 30850 Dec 11 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-10_near.cedar.index -rw-r--r-- 1 rubin e875 26850 Dec 4 2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-11_near.cedar.index -rw-r--r-- 1 rubin e875 35900 Mar 23 2007 /afs/fnal.gov/files/data/minos/d10/indexes/2006-12_near.cedar.index -rw-rw-r-- 1 rubin e875 30000 Oct 24 14:22 /afs/fnal.gov/files/data/minos/d10/indexes/2007-01_near.cedar.index -rw-r--r-- 1 rubin e875 31650 Mar 2 2007 /afs/fnal.gov/files/data/minos/d10/indexes/2007-02_near.cedar.index -rw-r--r-- 1 rubin e875 36250 Apr 2 2007 /afs/fnal.gov/files/data/minos/d10/indexes/2007-03_near.cedar.index -rw-r--r-- 1 rubin e875 35000 May 2 2007 /afs/fnal.gov/files/data/minos/d10/indexes/2007-04_near.cedar.index -rw-r--r-- 1 rubin e875 2900 May 5 2007 /afs/fnal.gov/files/data/minos/d10/indexes/2007-05_near.cedar.index mindata : cd ~kreymer/minos/scripts for CDIR in ${CDIRS} ; do ./afs2nfs -i ${CDIR} -n ; done Complete except for STREAM sntp to /minos/data/reco_near/cedar/sntp_data/2007-01 1569 cp recodata76/N00011446_0018.spill.sntp.cedar.0.root /minos/data/reco_near/cedar/sntp_data/2007-01/N00011446_0018.spill.sntp.cedar.0.root cp recodata77/N00011648_0003.spill.sntp.cedar.0.root /minos/data/reco_near/cedar/sntp_data/2007-01/N00011648_0003.spill.sntp.cedar.0.root $ dds /afs/fnal/files/data/minos/d10/recodata76/N00011446_0018.spill.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 73341213 Jan 2 2007 /afs/fnal/files/data/minos/d10/recodata76/N00011446_0018.spill.sntp.cedar.0.root $ dds /afs/fnal/files/data/minos/d10/recodata77/N00011648_0003.spill.sntp.cedar.0.root -rw-rw-r-- 1 rubin e875 2181427 Jan 26 2007 /afs/fnal/files/data/minos/d10/recodata77/N00011648_0003.spill.sntp.cedar.0.root ./afs2nfs -i 2007-01_near.cedar.index This is picking up much more than 2 files, the whole month was missing, all 600 files 600/ 600 recodata77/N00011648_0003.spill.sntp.cedar.0.root STREAM sntp rate 17865 38G /minos/data/reco_near/cedar/sntp_data/2007-01 STARTED Fri Feb 1 10:45:10 CST 2008 FINISHED Fri Feb 1 11:22:03 CST 2008 CFDIRS=`(cd /afs/fnal.gov/files/data/minos/d10/indexes ; ls *_far.cedar.index)` for CFDIR in ${CDIRS} ; do ./afs2nfs -i ${CFDIR} -n ; done all present and accounted for CHECKING FOR EMPTY LARGE DATA VOLUMES, NONE AT PRESENT cd /afs/fnal.gov/files/data/minos DIRS=`ls -d d?? d???` for DIR in ${DIRS} ; do printf "${DIR} " ; fs listquota ${DIR} | grep nb ; done > /tmp/mdd cat /tmp/mdd | sort -rk 5 Removed cedar for DIR in ${DIRS} ; do printf "${DIR} " ; fs listquota ${DIR} | grep nb ; done > /tmp/mdd2 cat /tmp/mdd2 | sort -rk 5 d199 nb.minos.d199 50000000 172 0% 79% d198 nb.minos.d198 50000000 59211 0% 79% dbm nb.data.minosd11 4000000 551 0% 76% d86 nb.minos.d86 50000000 6 0% 72% d58 nb.minos.d58 8000000 8 0% 72% d10 nb.data.minosd10 8000000 8439 0% 72% d245 nb.minos.d245 50000000 29 0% 65% d243 nb.minos.d243 50000000 34 0% 61% d141 nb.minos.d141 50000000 242 0% 52% for RDIR in d86 d141 d198 d199 d243 d245 ; do find ${RDIR} -type d ; done for RDIR in d141 d198 d199 ; do ls -R ${RDIR} ; done for RDIR in d141 d198 d199 ; do find ${RDIR} -type f ; done d198/recodata72/c10000845_0005.sntp.cedar.root grep c10000845_0005.sntp.cedar.root d10/indexes/*.index d10/indexes/mc_cosmic.bfld201.cedar.index:recodata72/c10000845_0005.sntp.cedar.root Bottom line, d141 and d199 are clear. rm -r d141/reco* rm -r d199/reco* rm d199/indexes rm d141/indexes rm d10/recodata52 rm d10/recodata73 ############## # MINOS_DATA # ############## REMOVING CEDAR NTUPLES FROM AFS rubin@fnpcsrv1 : cat shrc/kreymer cut/paste cd /afs/fnal.gov/files/data/minos/d10/indexes ls -l *ar.cedar.index Ran a test pass, counting files ./rv cedar noop | grep -v rm Removed net 28698 files ./rv cedar noop > /var/tmp/rv.cedar.log 14:06 ./rv cedar 2>&1 | tee /var/tmp/rvdo.cedar.log This procedure will erase all cedar ntuples and rewrite the index files! It will not prompt again -- do you want to proceed ? y Removing 2004-08_far.cedar.index Removed 0 files Removing 2004-09_far.cedar.index Removed 0 files ... Removing 2007-05_near.cedar.index Removed 58 files Removed net 28698 files grep ^rm /var/tmp/rv.cedar.log | cut -f 2 -d / | sort -u recodata23 recodata26 recodata34 recodata35 recodata36 recodata37 recodata38 recodata39 recodata40 recodata42 recodata43 recodata44 recodata45 recodata46 recodata47 recodata51 recodata52 recodata53 recodata54 recodata56 recodata57 recodata58 recodata59 recodata60 recodata61 recodata62 recodata63 recodata64 recodata65 recodata66 recodata67 recodata68 recodata69 recodata70 recodata71 recodata72 recodata73 recodata74 recodata75 recodata76 recodata77 recodata78 recodata79 recodata80 recodata81 recodata82 recodata83 recodata84 recodata85 recodata86 recodata88 recodata89 recodata90 recodata91 recodata92 recodata93 recodata94 recodata95 recodata96 recodata97 recodata98 ######### # FNALU # ######### for NODE in flxb16 $INODES ; do printf "${NODE} " ; ssh -ax ${NODE} 'df -h /minos/data | grep nas' ; done flxb16 minos-nas-0.fnal.gov:/minos/data flxb10 minos-nas-0.fnal.gov:/minos/data flxb11 minos-nas-0.fnal.gov:/minos/data flxb13 minos-nas-0.fnal.gov:/minos/data flxb17 minos-nas-0.fnal.gov:/minos/data flxb23 Could not chdir to home directory /afs/fnal.gov/files/home/room1/kreymer: No such file or directory minos-nas-0.fnal.gov:/minos/data flxb24 minos-nas-0.fnal.gov:/minos/data flxb25 minos-nas-0.fnal.gov:/minos/data flxb30 minos-nas-0.fnal.gov:/minos/data flxb31 minos-nas-0.fnal.gov:/minos/data flxb32 minos-nas-0.fnal.gov:/minos/data flxb33 minos-nas-0.fnal.gov:/minos/data flxb34 minos-nas-0.fnal.gov:/minos/data flxb35 minos-nas-0.fnal.gov:/minos/data This recovered by 11:00 ( 17:00 UTC ) ######## # FARM # ######## IMANODES='fnpc239 fnpc240 fnpc241 fnpc242 fnpc243 fnpc244 fnpc245 fnpc246' for NODE in ${IMANODES} ; do printf "${NODE} " ssh -ax minos25 'grep ^OPTIONS /etc/sysconfig/afs ; ps -flu root | grep vice' done ============================================================================= 2008 01 31 ########## # CONDOR # ########## Drafting minoscondorsupport.txt ( home of desktop ) emailed to berman, timm, sfiligoi, jallen ######### # FNALU # ######### flxb16 was upgraded to SLF 4 around 15:00 today, lacks /minos/data and /minos/scratch. Reported to mgreaney, logged under ticket 110383 ######## # GRID # ######## for ID in 339 340 341 342 343 344 345 346 ; do printf "fnpc${ID} " ssh -ax fnpc${ID} ls -ld /afs/fnal/files/code/e875 ; done ############## # MINOS_DATA # ############## From CFL 30991 1807 reco_near/cedar/.*nt._data/ 83803 1298 reco_far/cedar/.*nt._data/ PNFS COUNTS grep /pnfs/minos/reco_near/cedar/sntp_data CFL/CFL | wc -l 22856 grep /pnfs/minos/reco_far/cedar/sntp_data CFL/CFL | wc -l 39703 grep /pnfs/minos/reco_far/cedar/.bntp_data CFL/CFL | wc -l 13990 /minos/data COUNTS find /minos/data/reco_near/cedar/sntp_data -type f | wc -l 11874 find /minos/data/reco_far/cedar/sntp_data -type f | wc -l 16655 find /minos/data/reco_far/cedar/.bntp_data -type f | wc -l 88 index COUNTS wc -l /afs/fnal.gov/files/data/minos/d10/indexes/*_near.cedar.index 12220 wc -l /afs/fnal.gov/files/data/minos/d10/indexes/*_far.cedar.index 16478 Let's purge the near cedar first ################ # AFS SYMLINKS # ################ N.B. - there is an 'up' command for cloning AFS file trees, which preserves ACL's. up ####### # NET # ####### Date: Thu, 31 Jan 2008 10:17:25 -0600 (CST) Subject: HelpDesk ticket 110393 Short Description: MRTG plots not available for r-s-fcc2-server Problem Description: I cannot view MRTG plots for nodes on r-s-fcc2-server, Things seem to be OK for nodes like minorora1 (s-s-fcc1-server) For example, at http://fndcg0.fnal.gov/~netadmin/NodeLocator/mrtg-search.cgi?hname=minos01. fnal.gov I see Search Results for minos01.fnal.gov 131.225.193.1 is connected to r-s-fcc2-server on port Gi8/26 (minos01) Last detected on this switch at 2008/01/31/09:41 1 node is connected to port Gi8/26 of r-s-fcc2-server. Looking Glass Error: Unknown area name for Device ___________________________________________ Date: Thu, 31 Jan 2008 10:22:18 -0600 (CST) This ticket has been reassigned to WOHLT, DARRYL of the CD-LSCS/CNCS/SN Group. ___________________________________________ Date: Thu, 31 Jan 2008 10:29:10 -0600 (CST) Note To Requester: darryl@fnal.gov sent this Notes To Requester: Art, it looks like you're using the old NodeLocator. Please change your bookmark to http://www-dcn.fnal.gov/~netadmin/m-s-nodelocator/NodeLocator/search.html and let me know if this fixes the problem. Darryl ___________________________________________ Yes, corrected this in /afs/fnal.gov/files/expwww/numi/html/computing/dh dhmain.html.20080131 ####### # AFS # ####### for NODE in ${NODES} ; do printf "${NODE}\n"; ssh -ax ${NODE} 'grep afs: /var/log/messages | grep "Jan " | grep -v Tokens | uniq'; done minos03 Jan 27 12:56:37 minos03 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server) Jan 27 12:56:42 minos03 kernel: afs: file server 131.225.68.49 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos11 Jan 30 14:25:45 minos11 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Jan 30 14:25:46 minos11 kernel: afs: failed to store file (110) Jan 30 14:26:43 minos11 kernel: afs: failed to store file (110) Jan 30 14:27:58 minos11 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos21 Jan 31 16:27:46 minos21 kernel: afs: Lost contact with file server 131.225.68.65 in cell fnal.gov (all multi-homed ip addresses down for the server) Jan 31 16:30:23 minos21 kernel: afs: file server 131.225.68.65 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos23 Jan 31 16:40:30 minos23 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server) Jan 31 16:40:32 minos23 kernel: afs: Lost contact with file server 131.225.68.11 in cell fnal.gov (all multi-homed ip addresses down for the server) Jan 31 16:43:38 minos23 kernel: afs: file server 131.225.68.49 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Jan 31 16:43:38 minos23 kernel: afs: file server 131.225.68.11 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos25 Jan 29 12:07:01 minos25 kernel: libafs: Ignoring new-style parameters in presence of obsolete ones Jan 29 12:19:35 minos25 kernel: afs: Lost contact with file server 192.168.67.1 in cell fnal.gov (multi-homed address; other same-host interfaces maybe up) ######### # FNALU # ######### Date: Thu, 31 Jan 2008 08:54:57 -0600 (CST) Subject: HelpDesk ticket 110383 ___________________________________________ Ticket #: 110383 ___________________________________________ Short Description: FNALU batch node flxb17 shows no recent activity Problem Description: fnalu-admin : FNALU batch Node flxb17 seems to be stuck. The last active job was started over 4 days ago. $ bjobs -u all -r JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 122890 pawlosk RUN minos minos02.fna flxb17.fnal *>& log193 Jan 27 01:51 122897 pawlosk RUN minos minos02.fna flxb17.fnal *>& log200 Jan 27 01:51 .. bjobs -l 122890 .. Sun Jan 27 02:00:26: Resource usage collected. The CPU time used is 217 seconds. MEM: 437 Mbytes; SWAP: 559 Mbytes; NTHREAD: 3 PGID: 29728; PIDs: 29728 29748 29749 ___________________________________________ Date: Thu, 31 Jan 2008 09:04:26 -0600 (CST) This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group. ___________________________________________ Date: Thu, 31 Jan 2008 09:11:28 -0600 (CST) Note To Requester: mgreaney@fnal.gov sent this Notes To Requester: Art, I will reset it it today and let you know. Margaret ___________________________________________ Date: Thu, 31 Jan 2008 09:11:29 -0600 (CST) Note To Requester: mgreaney@fnal.gov sent this Notes To Requester: I can't access this from the console and will reset it later today. thanks, margaret ___________________________________________ ################ # AFS SYMLINKS # ################ Summary of symlink crosslinks code/e875/general/MINOS_EXTERNAL none code/e875/general/ROOT /code/e875/releases2/ROOT data/minos/d04/libraries code/e875/general/minossoft code/e875/releases/SRT_BINLIBTMP code/e875/releases1 code/e875/releases2 data/minos/d04/libraries/DatabaseTables/HEAD code/e875/general/products code/e875/general/ups code/e875/releases/GENIE code/e875/releases/LOG4CPP code/e875/releases/MINOS_EXTERN code/e875/releases/MINOS_ROOT code/e875/releases/NEUGEN3 code/e875/releases/PYTHIA6 code/e875/releases/stdhep code/e875/releases code/e875/general/MINOS_EXTERNAL code/e875/general/ROOT code/e875/general/bin code/e875/general/ups/prd/MINOS_EXTERN code/e875/general/ups/prd/PYTHIA6 code/e875/releases1 none code/e875/releases2 code/e875/general/ROOT/config_build_root.sh code/e875/general/ups/prd/MINOS_ROOT code/e875/sim miscellaneous data/minos/d04/libraries none ============================================================================= 2008 01 30 ########## # CONDOR # ########## glideinWMS 1.1 available ########## # CONDOR # ########## per http://www.cs.wisc.edu/condor/ Stable series: Condor Version 7.0.0 released January 22nd, 2008 Development series: Condor Version 7.1.0 is coming soon Previous Stable series: Condor Version 6.8.8 released December 20th, 2007 ############ # PREDATOR # ############ 10:05 Corrupt .sam.py due to timeout of genpy for F00040244_0003 ( and timeout of _0004 ) /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/fardet_data/2008-01 MINOS26 > mv F00040244_0003.sam.py F00040244_0003.sam.py.BAD MINOS26 > mv F00040244_0004.sam.py F00040244_0004.sam.py.BAD These were picked up cleanly on the next cycle. ########## # PARROT # ########## Tracking down symlinks in afs can be messy, because of equivalent /afs/fnal.gov /afs/.fnal.gov /afs/fnal There are just a few of these SIM ( /afs/fnal/ ) /afs/fnal.gov/files/code/e875/sim/ /afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOGEANT.ddl /afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOMETRY.ddl /afs/fnal/files/code/e875/sim/hermes_db/include/partap.inc MINOS26 > printf "${SLINKS}\n" | grep /afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEO lrwxr-xr-x 1 para 1507 63 Apr 19 1996 /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/ddl/GEOMETRY.ddl -> /afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOMETRY.ddl lrwxr-xr-x 1 para 1507 63 Apr 19 1996 /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/ddl/GEOGEANT.ddl -> /afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOGEANT.ddl MINOS26 > find /afs/fnal.gov/files/code/e875/sim/hermes -name GEOMETRY.ddl /afs/fnal.gov/files/code/e875/sim/hermes/ddl/GEOMETRY.ddl /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/ddl/GEOMETRY.ddl /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/hermes/ddl/GEOMETRY.ddl MINOS26 > diff /afs/fnal.gov/files/code/e875/sim/hermes/ddl/GEOMETRY.ddl /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/hermes/ddl/GEOMETRY.ddl Corrected these broken symlinks cd /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/ddl ln -sf ../hermes/ddl/GEOGEANT.ddl GEOGEANT.ddl ln -sf ../hermes/ddl/GEOMETRY.ddl GEOMETRY.ddl MINOS26 > printf "${SLINKS}\n" | grep /afs/fnal/files/code/e875/sim/hermes_db/include/partap.inc lrwxr-xr-x 1 para 1507 58 Apr 19 1996 /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/hermes/src/partap.inc -> /afs/fnal/files/code/e875/sim/hermes_db/include/partap.inc MINOS26 > find /afs/fnal/files/code/e875/sim/hermes -type f -name partap.inc /afs/fnal/files/code/e875/sim/hermes/hermes_db/include/partap.inc cd /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/hermes/src ln -sf ../../include/partap.inc partap.inc PRODUCTS /afs/fnal.gov/files/code/e875/general/products /afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/shutdown/ups_shutdown /afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/startup/ups_startup ################ # AFS SYMLINKS # ################ OK, we really need a map of AFS directories, vols and symlinks needed for parrot. First, a symlink map. SLINKS=`find ${BASE} -type l -exec ls -l {} \;` printf "${SLINKS}\n" \ | cut -f 2 -d '>' \ | tr -d '[:blank:]' \ | grep ^/afs \ | grep -v ${BASE} \ | sort -u BASE=/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL 7/8 G code.e875.general ( none ) BASE=/afs/fnal.gov/files/code/e875/general/ROOT 7/8 G code.e875.general /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_2/v3_05_05 /afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-00-02 /afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-00-03A /afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-00-04 /afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-00-08 /afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-01-02 /afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-02-00 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3-05-07 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3-05-07-opt /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3-10-01 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3-10-01-opt /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-02 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-03A /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-04/ /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-08 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-08d /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-08e /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-08f /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-01-02/ /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-01-04 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-02-00 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-02-00-opt /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/bleeding-edge /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v4-02-00 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v4-02-00-opt /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v4-04-02 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v4-04-02-opt /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v5-08-00 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v5-08-00-opt /afs/fnal.gov/files/data/minos/d04/libraries/IRIX6.5-GCC_3_2/v3_05_03 /afs/fnal.gov/files/data/minos/d04/libraries/IRIX6.5-GCC_3_2/v3_05_04 /afs/fnal.gov/files/data/minos/d04/libraries/Linux2.4-GCC_3_2/v3_05_00 /afs/fnal.gov/files/data/minos/d04/libraries/Linux2.4-GCC_3_2/v3_05_04 BASE=/afs/fnal.gov/files/code/e875/general/minossoft 7/8 G code.e875.general /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.18.4/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.18.4/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.18.4/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.22/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.22/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.22/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.23/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.23/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.23/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.0/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.0/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.0/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.1/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.1/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.1/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.2/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.2/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.2/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.3/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.3/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.3/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.25/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.25/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.25/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.26/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.26/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.26/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.28/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.28/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.28/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-02-16-R1-21/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-02-16-R1-21/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-02-16-R1-21/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-09-29-R1-24/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-09-29-R1-24/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-09-29-R1-24/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-10-12-R1-24/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-10-12-R1-24/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-10-12-R1-24/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-11-10-R1-24/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-11-10-R1-24/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-11-10-R1-24/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-02-23-R1-25/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-02-23-R1-25/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-02-23-R1-25/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-09-R1-25/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-09-R1-25/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-09-R1-25/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-24-R1-25/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-24-R1-25/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-24-R1-25/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-04-06-R1-25/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-04-06-R1-25/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-04-06-R1-25/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-06-R1-25/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-06-R1-25/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-06-R1-25/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-17-R1-25/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-17-R1-25/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-17-R1-25/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-04-R1-25/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-04-R1-25/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-04-R1-25/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-20-R1-25/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-20-R1-25/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-20-R1-25/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-13-R1-26/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-13-R1-26/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-13-R1-26/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-27-R1-26/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-27-R1-26/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-27-R1-26/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-06-R1-26/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-06-R1-26/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-06-R1-26/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-20-R1-26/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-20-R1-26/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-20-R1-26/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-10-22-R1-26/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-10-22-R1-26/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-10-22-R1-26/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-11-10-R1-26/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-11-10-R1-26/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-11-10-R1-26/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-12-22-R1-26/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-12-22-R1-26/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-12-22-R1-26/tmp /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/bin /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/lib /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/tmp /afs/fnal.gov/files/code/e875/releases1/R0.18.0/bin /afs/fnal.gov/files/code/e875/releases1/R0.18.0/lib /afs/fnal.gov/files/code/e875/releases1/R0.18.0/tmp /afs/fnal.gov/files/code/e875/releases1/R0.20.0/bin /afs/fnal.gov/files/code/e875/releases1/R0.20.0/lib /afs/fnal.gov/files/code/e875/releases1/R0.20.0/tmp /afs/fnal.gov/files/code/e875/releases1/R0.21/bin /afs/fnal.gov/files/code/e875/releases1/R0.21/lib /afs/fnal.gov/files/code/e875/releases1/R0.21/tmp /afs/fnal.gov/files/code/e875/releases1/R0.22/bin /afs/fnal.gov/files/code/e875/releases1/R0.22/lib /afs/fnal.gov/files/code/e875/releases1/R0.22/tmp /afs/fnal.gov/files/code/e875/releases1/R0.8.0/bin /afs/fnal.gov/files/code/e875/releases1/R0.8.0/lib /afs/fnal.gov/files/code/e875/releases1/R0.8.0/tmp /afs/fnal.gov/files/code/e875/releases1/R1.10/bin /afs/fnal.gov/files/code/e875/releases1/R1.10/lib /afs/fnal.gov/files/code/e875/releases1/R1.10/tmp /afs/fnal.gov/files/code/e875/releases1/R1.11/bin /afs/fnal.gov/files/code/e875/releases1/R1.11/lib /afs/fnal.gov/files/code/e875/releases1/R1.11/tmp /afs/fnal.gov/files/code/e875/releases1/R1.17/bin /afs/fnal.gov/files/code/e875/releases1/R1.17/lib /afs/fnal.gov/files/code/e875/releases1/R1.17/tmp /afs/fnal.gov/files/code/e875/releases1/R1.18.1/bin /afs/fnal.gov/files/code/e875/releases1/R1.18.1/lib /afs/fnal.gov/files/code/e875/releases1/R1.18.1/tmp /afs/fnal.gov/files/code/e875/releases1/R1.18.2/bin /afs/fnal.gov/files/code/e875/releases1/R1.18.2/lib /afs/fnal.gov/files/code/e875/releases1/R1.18.2/tmp /afs/fnal.gov/files/code/e875/releases1/R1.18/bin /afs/fnal.gov/files/code/e875/releases1/R1.18/lib /afs/fnal.gov/files/code/e875/releases1/R1.18/tmp /afs/fnal.gov/files/code/e875/releases1/R1.2/bin /afs/fnal.gov/files/code/e875/releases1/R1.2/lib /afs/fnal.gov/files/code/e875/releases1/R1.2/tmp /afs/fnal.gov/files/code/e875/releases1/R1.20/bin /afs/fnal.gov/files/code/e875/releases1/R1.20/lib /afs/fnal.gov/files/code/e875/releases1/R1.20/tmp /afs/fnal.gov/files/code/e875/releases1/R1.21/bin /afs/fnal.gov/files/code/e875/releases1/R1.21/lib /afs/fnal.gov/files/code/e875/releases1/R1.21/tmp /afs/fnal.gov/files/code/e875/releases1/R1.3/bin /afs/fnal.gov/files/code/e875/releases1/R1.3/lib /afs/fnal.gov/files/code/e875/releases1/R1.3/tmp /afs/fnal.gov/files/code/e875/releases1/development/bin /afs/fnal.gov/files/code/e875/releases1/development/lib /afs/fnal.gov/files/code/e875/releases1/development/tmp /afs/fnal.gov/files/code/e875/releases1/doxygen/loon /afs/fnal.gov/files/code/e875/releases2/R1.0/bin /afs/fnal.gov/files/code/e875/releases2/R1.0/lib /afs/fnal.gov/files/code/e875/releases2/R1.0/tmp /afs/fnal.gov/files/code/e875/releases2/R1.12/bin /afs/fnal.gov/files/code/e875/releases2/R1.12/lib /afs/fnal.gov/files/code/e875/releases2/R1.12/tmp /afs/fnal.gov/files/code/e875/releases2/R1.13/bin /afs/fnal.gov/files/code/e875/releases2/R1.13/lib /afs/fnal.gov/files/code/e875/releases2/R1.13/tmp /afs/fnal.gov/files/code/e875/releases2/R1.14/bin /afs/fnal.gov/files/code/e875/releases2/R1.14/lib /afs/fnal.gov/files/code/e875/releases2/R1.14/tmp /afs/fnal.gov/files/code/e875/releases2/R1.15/bin /afs/fnal.gov/files/code/e875/releases2/R1.15/lib /afs/fnal.gov/files/code/e875/releases2/R1.15/tmp /afs/fnal.gov/files/code/e875/releases2/R1.16/bin /afs/fnal.gov/files/code/e875/releases2/R1.16/lib /afs/fnal.gov/files/code/e875/releases2/R1.16/tmp /afs/fnal.gov/files/code/e875/releases2/R1.5/bin /afs/fnal.gov/files/code/e875/releases2/R1.5/lib /afs/fnal.gov/files/code/e875/releases2/R1.5/tmp /afs/fnal.gov/files/code/e875/releases2/R1.6/bin /afs/fnal.gov/files/code/e875/releases2/R1.6/lib /afs/fnal.gov/files/code/e875/releases2/R1.6/tmp /afs/fnal.gov/files/code/e875/releases2/R1.7/bin /afs/fnal.gov/files/code/e875/releases2/R1.7/lib /afs/fnal.gov/files/code/e875/releases2/R1.7/tmp /afs/fnal.gov/files/code/e875/releases2/R1.8/bin /afs/fnal.gov/files/code/e875/releases2/R1.8/lib /afs/fnal.gov/files/code/e875/releases2/R1.8/tmp /afs/fnal.gov/files/code/e875/releases2/R1.9/bin /afs/fnal.gov/files/code/e875/releases2/R1.9/lib /afs/fnal.gov/files/code/e875/releases2/R1.9/tmp /afs/fnal.gov/files/data/minos/d04/libraries/DatabaseTables/HEAD BASE=/afs/fnal.gov/files/code/e875/general/products 8G c.e875.d1 /afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/shutdown/ups_shutdown /afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/startup/ups_startup /afs/fnal.gov/files/code/e875/general/ups/db/sam_batch_adapter/v0_9_9_5 /afs/fnal.gov/files/code/e875/general/ups/db/sam_config/v4_2_34 /afs/fnal.gov/files/code/e875/general/ups/db/sam_config/v4_2_34/config.env /afs/fnal.gov/files/code/e875/general/ups/prd/misweb/v2_23_5/NULL /afs/fnal.gov/files/code/e875/general/ups/prd/sam/v7_6_3/Linux-2 /afs/fnal.gov/files/code/e875/general/ups/prd/sam/v7_6_5/Linux-2 /afs/fnal.gov/files/code/e875/general/ups/prd/sam/v7_7_1/Linux-2 /afs/fnal.gov/files/code/e875/general/ups/prd/sam/v8_1_3/Linux-2 /afs/fnal.gov/files/code/e875/general/ups/prd/sam/v8_2_0/Linux-2 /afs/fnal.gov/files/code/e875/general/ups/prd/sam_batch_adapter/v0_9_9_5/NULL /afs/fnal.gov/files/code/e875/general/ups/prd/sam_bootstrap/v4_4_1/NULL /afs/fnal.gov/files/code/e875/general/ups/prd/sam_config/v4_2_28/NULL /afs/fnal.gov/files/code/e875/general/ups/prd/sam_config/v4_2_34/NULL /afs/fnal.gov/files/code/e875/general/ups/prd/sam_cpp_api/v7_2_1/Linux-2-4-GCC-3-4-3 /afs/fnal.gov/files/code/e875/general/ups/prd/sam_cpp_api/v7_4_2/Linux-2-4-GCC-3-4-3 /afs/fnal.gov/files/code/e875/general/ups/prd/sam_cpp_api/v7_4_3/Linux-2-4-GCC-3-4-3 /afs/fnal.gov/files/code/e875/general/ups/prd/sam_ns_ior/v7_0_0/NULL /afs/fnal.gov/files/code/e875/general/ups/prd/sam_ns_ior/v7_1_0/NULL /afs/fnal.gov/files/code/e875/general/ups/prd/sam_web_services/v0_9_8/NULL /afs/fnal.gov/files/code/e875/general/ups/prd/sam_web_services/v0_9_9/NULL /afs/fnal.gov/files/code/e875/general/ups/prd/sam_web_services_client/v0_9_2/NULL /afs/fnal.gov/files/code/e875/general/ups/prd/samgrid_batch_adapter/v7_1_0/NULL /afs/fnal.gov/files/code/e875/general/ups/prd/vdt/v1_1_14_13/Linux /afs/fnal.gov/files/code/e875/general/ups/prd/vdt/v1_6_1_0/Linux /afs/fnal.gov/files/code/e875/general/ups/prd/vdt/v1_8_1_1/Linux /afs/fnal.gov/files/code/e875/releases/GENIE /afs/fnal.gov/files/code/e875/releases/LOG4CPP /afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT /afs/fnal.gov/files/code/e875/releases/NEUGEN3 /afs/fnal.gov/files/code/e875/releases/PYTHIA6 /afs/fnal.gov/files/code/e875/releases/stdhep BASE=/afs/fnal.gov/files/code/e875/releases 40/50G nb.minos.d133 /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_3/lib/libmyodbc3.so /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/tar_files /afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh /afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh /afs/fnal.gov/files/code/e875/general/bin/config_build_root_minimal.sh /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_EXTERN/Linux2.4-GCC_3_2/v03/lib/libmyodbc3.so /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_EXTERN/Linux2.4-GCC_4_1/v04/lib/libmyodbc3.so /afs/fnal.gov/files/code/e875/general/ups/prd/PYTHIA6/Linux2.4-GCC_3_4/v6_406/inc /afs/fnal.gov/files/code/e875/general/ups/prd/PYTHIA6/Linux2.4-GCC_3_4/v6_409/inc BASE=/afs/fnal.gov/files/code/e875/releases1 6/8 G (none) BASE=/afs/fnal.gov/files/code/e875/releases2 6/8 G /afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-02-00 /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-02-00-opt /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-04-02b /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-04-02b-opt /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-04-02f /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-04-02f-opt BASE=/afs/fnal.gov/files/code/e875/sim 7/8 G code.e875.sim /afs/fnal.gov/files/code/e875/general/minossoft/releases/development/BField/bfld_imap.C /afs/fnal.gov/files/data/minos/d1/nuflux/newfiles/nuhist_far.rz /afs/fnal.gov/files/data/minos/d1/nuflux/newfiles/nuhist_near_v2.rz /afs/fnal.gov/files/data/minos/d12/root_files/AAA_README.TXT /afs/fnal.gov/files/data/minos/d12/root_files/AAA_README.TXT.~1~ /afs/fnal.gov/files/data/minos/d12/root_files/AAA_README.TXT.~2~ /afs/fnal.gov/files/data/minos/d12/root_files/emu_tauCC.root /afs/fnal.gov/files/data/minos/d12/root_files/far_e_hitbits.root /afs/fnal.gov/files/data/minos/d12/root_files/far_mu-tau_v5.root /afs/fnal.gov/files/data/minos/d12/root_files/far_mu_801.root /afs/fnal.gov/files/data/minos/d12/root_files/far_mu_811.root /afs/fnal.gov/files/data/minos/d12/root_files/far_mu_899.root /afs/fnal.gov/files/data/minos/d12/root_files/far_mu_hitbits.root /afs/fnal.gov/files/data/minos/d12/root_files/far_mu_reco.root /afs/fnal.gov/files/data/minos/d12/root_files/far_muon_hitbits.root /afs/fnal.gov/files/data/minos/d12/root_files/far_muon_noelos.root /afs/fnal.gov/files/data/minos/d12/root_files/far_muon_noscat.root /afs/fnal.gov/files/data/minos/d12/root_files/far_nc_hitbits.root /afs/fnal.gov/files/data/minos/d12/root_files/far_tau_hitbits.root /afs/fnal.gov/files/data/minos/d12/root_files/far_tau_reco.root /afs/fnal.gov/files/data/minos/d12/root_files/overlay_ph2me.root /afs/fnal.gov/files/data/minos/d17/gnumi_flux /afs/fnal.gov/files/data/minos/d7/hitbits/far_e_hitbits.fz_gaf /afs/fnal.gov/files/data/minos/d7/hitbits/far_mu_hitbits.fz_gaf /afs/fnal.gov/files/data/minos/d7/hitbits/far_muon_hitbits.fz_gaf /afs/fnal.gov/files/data/minos/d7/hitbits/far_nc_hitbits.fz_gaf /afs/fnal.gov/files/data/minos/d7/hitbits/far_tau_hitbits.fz_gaf /afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOGEANT.ddl /afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOMETRY.ddl /afs/fnal/files/code/e875/sim/hermes_db/include/partap.inc BASE=/afs/fnal.gov/files/data/minos/d04/libraries 8/8 GB nb.data.minosd4 (none) MINOS26 > du -sm /afs/fnal.gov/files/code/e875/releases/* 600 /afs/fnal.gov/files/code/e875/releases/GENIE 91 /afs/fnal.gov/files/code/e875/releases/LOG4CPP 2048 /afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN 20538 /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT 841 /afs/fnal.gov/files/code/e875/releases/NEUGEN3 183 /afs/fnal.gov/files/code/e875/releases/PYTHIA6 13432 /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP 1687 /afs/fnal.gov/files/code/e875/releases/base_release_build 27 /afs/fnal.gov/files/code/e875/releases/stdhep ============================================================================= 2008 01 29 ########## # PARROT # ########## Trying again to build -f indexes for the export directories and verifying with a direct sha1sum AFSDIR=/afs/fnal.gov/files/data/minos/d04/libraries AFSDIR=/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL real 0m29.130s mv ${AFSDIR}/.growfschecksum ${AFSDIR}/.growfschecksumNL mv ${AFSDIR}/.growfsdir ${AFSDIR}/.growfsdirNL time make_growfs -k -f ${AFSDIR} ls -l /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL fails as before, but the .growfschecksum file contains the correct checksum. directory checksum is 8721a99dea7b07e57189f611263f24b5929528e8 That's nonsense, MINOS26 > curl http://www-numi.fnal.gov:80//computing/MINOS_EXTERNAL//.growfsdir -o /tmp/growdir % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 3991k 100 3991k 0 0 46.4M 0 --:--:-- --:--:-- --:--:-- 52.6M MINOS26 > sha1sum /tmp/growdir e59ac0b8ab17d5f94c2fd165012d2fc5192998b6 /tmp/growdir MINOS26 > sha1sum ${AFSDIR}/.growfsdir e59ac0b8ab17d5f94c2fd165012d2fc5192998b6 /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/.growfsdir Trying again with -d all 1201646406.226614 [11797] parrot: http: GET http://www-numi.fnal.gov:80/computing/MINOS_EXTERNAL//.growfsdir HTTP/1.1 Host: www-numi.fnal.gov Cache-Control: max-age=0 1201646406.236623 [11797] parrot: http: HTTP/1.1 200 OK 1201646406.236650 [11797] parrot: http: Date: Tue, 29 Jan 2008 22:40:05 GMT 1201646406.236659 [11797] parrot: http: Server: Apache/1.3.37 (Unix) PHP/5.2.4 mod_layout/2.1 mod_fastcgi/2.4.2 mod_ssl/2.8.25 OpenSSL/0.9.8d 1201646406.236668 [11797] parrot: http: Last-Modified: Tue, 29 Jan 2008 22:29:49 GMT 1201646406.236676 [11797] parrot: http: ETag: "3c68a40e-3e5f9a-479fa8dd" 1201646406.236682 [11797] parrot: http: Accept-Ranges: bytes 1201646406.236688 [11797] parrot: http: Content-Length: 4087706 1201646406.236697 [11797] parrot: http: Content-Type: text/plain 1201646406.236709 [11797] parrot: http: 1201646406.236716 [11797] parrot: grow: loading filesystem directory... 1201646406.501638 [11797] parrot: tcp: disconnected from 131.225.70.20:80 1201646406.501759 [11797] parrot: grow: directory checksum is 8721a99dea7b07e57189f611263f24b5929528e8 Summary - the downloaded directory seems to have a bad checksum, although the original is fine. ######### # FNALU # ######### An update of CPU batch power, since last week's upgrades to flxb10/11/13/17/23/24/25 ( 10-13 are 1 GHz, counting as 1/3 core x 2 = 2/3 core, net 2 cores) HOSTS bsub -R Cores FLXB10-30 SL3 "linux24" 20 FLXB10-30 SL4 "linux24" 10 FLXB31-34 SL4 "linux26" 10 ssh to flxb17 hangs up ssh to flxb30 lacks AFS token ######## # GRID # ######## Trying to get 60 hour proxy, per chadwick advice MINOS25 > date Tue Jan 29 15:03:12 CST 2008 MINOS25 > d=kreymer/cron/minos25.fnal.gov MINOS25 > kcron -f TMXbT5oIyGkEaix7kSZvZg MINOS25 > kinit ${d} -k -t /var/adm/krb5/`kcron -f` MINOS25 > klist -f Ticket cache: /tmp/krb5cc_1060_YH8644 Default principal: kreymer/cron/minos25.fnal.gov@FNAL.GOV Valid starting Expires Service principal 01/29/08 15:04:38 01/30/08 01:04:38 krbtgt/FNAL.GOV@FNAL.GOV Flags: FIA 01/29/08 15:04:39 01/30/08 01:04:38 afs@FNAL.GOV Flags: FA MINOS25 > voms-proxy-init -noregen -voms fermilab:/fermilab -valid 60:0 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Contacting voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Done Warning: fg6x1.fnal.gov:15001: The validity of this VOMS AC in your proxy is shortened to 86400 seconds! Creating proxy .................................. Done Warning: your certificate and proxy will expire Tue Jan 29 23:30:06 2008 which is within the requested lifetime of the proxy MINOS25 > voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy/CN=proxy issuer : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy identity : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy type : unknown strength : 512 bits path : /tmp/x509up_u1060 timeleft : 8:22:37 === VO fermilab extension information === VO : fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer issuer : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov attribute : /fermilab/Role=NULL/Capability=NULL attribute : /fermilab/minos/Role=NULL/Capability=NULL timeleft : 23:59:30 ============================================================================= 2008 01 28 ####### # AFS # ####### Scanning for AFS configurations : for NODE in ${NODES} ; do printf "${NODE} "; ssh ${NODE} 'grep OPTIONS= /etc/sysconfig/afs'; done minos01 OPTIONS=$LARGE minos02 OPTIONS=$LARGE ... minos25 OPTIONS=AUTOMATIC minos26 OPTIONS=$LARGE Checked dates and content, for NODE in ${NODES} ; do printf "${NODE} "; ssh ${NODE} 'diff /etc/sysconfig/afs /minos/scratch/kreymer/sysafs'; done < OPTIONS=AUTOMATIC --- > OPTIONS=$LARGE for NODE in ${NODES} ; do printf "${NODE} "; ssh ${NODE} 'ls -l /etc/sysconfig/afs'; done minos01 -rw-r--r-- 1 root root 4724 Aug 20 13:03 /etc/sysconfig/afs ... minos06 -rw-r--r-- 1 root root 4724 Aug 20 13:04 /etc/sysconfig/afs ... minos11 -rw-r--r-- 1 root root 1922 Aug 21 15:59 /etc/sysconfig/afs minos12 -rw-r--r-- 1 root root 4724 Aug 20 13:04 /etc/sysconfig/afs ... minos25 -rw-r--r-- 1 root root 4727 Oct 19 10:52 /etc/sysconfig/afs minos26 -rw-r--r-- 1 root root 4724 Aug 20 13:04 /etc/sysconfig/afs ########## # CONDOR # ########## Date: Mon, 28 Jan 2008 14:10:02 -0600 From: Cron Daemon To: kreymer@fnal.gov Subject: Cron ${HOME}/minos/scripts/condorglide /bin/sh: /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorglide: No such file or directory __________________________________________________ /var/log/messages is full of lines like Jan 28 10:52:41 minos25 kernel: afs_NewVCache: warning none freed, using 300 of 300 Jan 28 10:52:41 minos25 kernel: afs_NewVCache - none freed grep afs_NewVCache: /var/log/messages | wc -l 32355 grep -v afs_NewVCache /var/log/messages Rates can be tens per second, or little gaps. Skipping gaps under 5 minutes, grep afs_NewVCache: /var/log/messages.2 | uniq | less Date: Mon, 28 Jan 2008 14:49:29 -0600 (CST) Subject: HelpDesk ticket 110193 ___________________________________________ Short Description: AFS messages on minos25 Problem Description: run2-sys I failed access a file on AFS from minos25 today, as follows, at 14:10:02 /bin/sh: /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorglide: No such file or directory Again at 16:12:04 In /var/log/messagse, I see 32355 set of messages starting at Jan 28 10:52:41 and continuing through 14:18:54 : Jan 28 10:52:41 minos25 kernel: afs_NewVCache: warning none freed, using 300 of 300 Jan 28 10:52:41 minos25 kernel: afs_NewVCache - none freed Similar problem are seen in earlier messages.N files, for 2 in 2,3,4 The messages can come as slowly as every minute or so, or at tens per second. Here is a summary of lines from existing messages files, where I have omitted any messages coming at intervals under 5 minutes, so that we can see the time periods of interest . I do not see any such messages on the other nodes minos01 through minos24. messages.4 Jan 1 03:05:59 Jan 1 03:07:33 Jan 1 03:12:42 Jan 1 03:12:44 Jan 2 11:54:58 Jan 2 11:56:38 messages.3 Jan 8 13:33:08 Jan 8 13:39:57 Jan 11 11:43:21 Jan 11 11:52:57 messages.2 Jan 17 12:44:22 Jan 17 12:52:22 Jan 17 13:06:34 Jan 17 13:06:44 messages Jan 28 10:52:41 Jan 28 11:01:13 Jan 28 13:40:47 Jan 28 13:43:18 Jan 28 14:06:02 Jan 28 14:18:54 Jan 28 15:49:33 Jan 28 15:54:46 Jan 28 16:01:01 Jan 28 16:01:04 Jan 28 16:11:53 Jan 28 16:29:28 Jan 28 17:03:23 Jan 28 17:08:39 __________________________________________ Date: Mon, 28 Jan 2008 15:10:59 -0600 (CST) This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group. ___________________________________________ Date: Tue, 29 Jan 2008 12:11:04 -0600 (CST) Solution: ettab@fnal.gov sent this solution: I am not sure how the /etc/sysconfig/afs file got modified. I have made the requested change and restarted afsd. ___________________________________________ N.B. I see no afs_NewVCache messages past the 12:07 restart of AFS with OPTIONS=$LARGE Updated MINOS status page ########### # MINOS03 # ########### Date: Mon, 28 Jan 2008 14:52:00 -0600 (CST) Subject: HelpDesk ticket 110194 ___________________________________________ Short Description: Cannot ssh to minos03 Problem Description: run2-sys : I cannot log into minos03 via ssh, but can reach the rest of the Minos Cluster. I can log in with kerberized rsh. MIN > date Mon Jan 28 20:45:54 UTC 2008 MIN > ssh -v minos03 OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug1: Connecting to minos03 [131.225.193.3] port 22. debug1: Connection established. debug1: identity file /home/kreymer/.ssh/identity type -1 debug1: identity file /home/kreymer/.ssh/id_rsa type 1 debug1: identity file /home/kreymer/.ssh/id_dsa type -1 ssh_exchange_identification: Connection closed by remote host ___________________________________________ Date: Mon, 28 Jan 2008 15:11:08 -0600 (CST) This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group. ___________________________________________ Date: Tue, 29 Jan 2008 09:06:57 -0600 (CST) Solution: ettab@fnal.gov sent this solution: The sshd daemon was restarted. ============================================================================= 2008 01 25 ####### # SAM # ####### MC examples for vahle SAMDIM=" DATA_TIER sntp-near and MC.RELEASE daikon_00 and VERSION cedar.phy and RUN_NUMBER 1024 " SAMDIM=" DATA_TIER sntp-near and MC.RELEASE daikon_00 and MC.BEAM L010185N and VERSION cedar.phy and RUN_NUMBER in 1024,1025,1026 " ########## # PARROT # ########## upd install -j cern 2004 upd install -j oracle_client v10_1_0_2_0b time make_growfs -k ${AFSB}/general/products real 0m44.483s setup_minos ERROR: Found no match for product 'oracle_tnsnames' ERROR: Action parsing failed on "unsetuprequired(oracle_tnsnames)" explicitly setting up GCC3_4_3 version of GEANT INFORMATIONAL: Product 'geant' (with qualifiers 'GCC3_4_3'), has no current chain (or may not exist) upd install -j oracle_tnsnames ups declare -c oracle_tnsnames v48 -f NULL upd install -j geant v3_21_14a -f Linux+2.6 ups declare -c geant v3_21_14a -f Linux+2.6 FNGP-OSG > setup_minos No default SAM configuration exists at this time. MINOSSOFT release "development" SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=trunk EXTERN=v03 CONFIG=v01 bash: child setpgid (9107 to 9106): Operation not permitted setup "test" version of LABYRINTH [ linux , FNALU ] setup NEUGEN3 development explicitly setting up GCC3_4_3 version of GEANT using PYTHIA6 (v6_412) for LUND bash: child setpgid (9566 to 9565): Operation not permitted bash: child setpgid (9743 to 9742): Operation not permitted bash: child setpgid (10201 to 10200): Operation not permitted Still lacking loon and root. /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT -> /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT Symlinks are not being followed Try this with symlinks, getting into a loop under /afs/fnal.gov/files/code/e875/general/products/prd/MINOS_ROOT/Linux2.4-GCC_3_2/v3-05-05/v3_05_05 ls -l /afs/fnal.gov/files/code/e875/general/products/prd/MINOS_ROOT/Linux2.4-GCC_3_2/v3-05-05/v3_05_05 lrwxr-xr-x 1 rhatcher e875 68 Feb 14 2006 /afs/fnal.gov/files/code/e875/general/products/prd/MINOS_ROOT/Linux2.4-GCC_3_2/v3-05-05/v3_05_05 -> /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05 rm /afs/fnal.gov/files/code/e875/general/products/prd/MINOS_ROOT/Linux2.4-GCC_3_2/v3-05-05/v3_05_05 time make_growfs -k -f ${AFSB}/general/products real 10m37.602s user 0m51.874s sys 4m4.926s We need to serve general/ROOT and general/MINOS_EXTERNAL These have links to d04/libraries time make_growfs -k -f ${AFSB}/general/ROOT Oops, needed cleanup/removal of one loop /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 links to /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05 but /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 links back to /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05 rm /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 rm /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 MINOS26 > dds /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 lrwxr-xr-x 1 rhatcher e875 68 Jul 17 2003 /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 -> /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/ MINOS26 > dds /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 lrwxr-xr-x 1 rhatcher e875 68 Jul 17 2003 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 -> /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/ MINOS26 > rm /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 MINOS26 > rm /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 rm: cannot remove `/afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05': No such file or directory MINOS26 > dds /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 ls: /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05: No such file or directory MINOS26 > dds /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 ls: /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05: No such file or directory time make_growfs -k -f ${AFSB}/general/ROOT real 2m27.806s time make_growfs -k -f ${AFSB}/general/MINOS_EXTERNAL real 0m34.504s ln -s /afs/fnal.gov/files/code/e875/general/ROOT ROOT ln -s /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL MINOS_EXTERNAL ln -s /afs/fnal.gov/files/data/minos/d04/libraries d04libs FNGP-OSG > setup_minos 1201288429.716438 [21203] parrot: fatal : directory and checksum still inconsistent after 60 seconds 1201288429.716711 [21203] parrot: notice: received signal 15 (Terminated), killing all my children... 1201288429.716887 [21203] parrot: notice: sending myself 15 (Terminated), goodbye! Oops, problem with products directory, try again without -f That builds quickly, but now symlinks are not being followed Try a rebuild again with symlinks : real 11m2.270s ls -l /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT 1201290364.388026 [9257] parrot: fatal : directory and checksum still inconsistent after 60 seconds ls -l /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/ GRRRRRR - I seem to be hitting a fundamental problem, after a very promising start. Symlinks do not seem to be followed when I do not use -f, and make_growfs -f seems to produce .growfsdir files with incorrect checksums. Let's try a smaller directory, near the base of the chain, time make_growfs -k -f /afs/fnal.gov/files/data/minos/d04/libraries FNGP-OSG > ls -l /afs/fnal.gov/files/data/minos/d04/libraries total 6 drwxr-xr-x 3 kreymer numi 2048 Mar 12 2003 DatabaseTables drwxr-xr-x 4 kreymer numi 2048 May 22 2003 IRIX6.5-GCC_3_2 drwxr-xr-x 4 kreymer numi 2048 May 22 2003 Linux2.4-GCC_3_2 Testing symlinks across served directories, they do not seem to work : ln -s /afs/fnal.gov/files/code/e875/general/products products ls -l /afs/fnal.gov/files/code/e875/sim/products/etc ls: /afs/fnal.gov/files/code/e875/sim/products/etc: No such file or directory PLAN - we need to merge these multivolume trees Look for links like : find /afs/fnal.gov/files/code/e875/general/products -type l -exec ls -l {} \; | cut -f 2 -d '>' ============================================================================= 2008 01 24 ########## # PARROT # ########## HOWTO.parrot - created export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4 export PATH=${PARROT_DIR}/bin:${PATH} parrot -m ${PARROT_DIR}/mountfile.grow -d remote bash export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup setup_minos() { . $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $* } setup_minos 1201210251.123664 [14892] parrot: grow: open http://www-numi.fnal.gov:80/computing/minossoft///setup/setup_minossoft_FNALU_parser.sh 1201210252.401424 [15100] parrot: notice: cannot execute the program /usr/sbin/userhelper because it is setuid. execl() error, errno=13 FNGP-OSG > ls -alF /usr/sbin/userhelper -rws--x--x 1 kreymer numi 31560 May 4 2007 /usr/sbin/userhelper* Hmmm, let's be more cautious , check some products Still some trouble, seem to have lost the definition of 'setup' FNGP-OSG > ups list -aK+ root WARNING: '/afs/fnal.gov/files/code/e875/general/ups/db' is not a directory OK, need to add the ups->products symlink to our exports ln -s /afs/fnal.gov/files/code/e875/general/products ups bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory ln -s /afs/fnal.gov/files/code/e875/sim/labyrinth labyrinth MINOS26 > echo $AFSB /afs/fnal.gov/files/code/e875 pts adduser -user kreymer -group buckley:minsoft time make_growfs -k ${AFSB}/sim/labyrinth GRRRRRRR - cannot write into labyrinth, in spite of membership. Back up 1 level, export sim as a whole ln -s /afs/fnal.gov/files/code/e875/sim sim time make_growfs -k ${AFSB}/sim real 0m23.568s Now, on setup_minos, ERROR: Action parsing failed on "SetupRequired(cern 2004)" GEANTINCS not in any location ... expect trouble loon is not happy. we need to stop using /afs/fnal.gov/ups But let's see if anything runs : FNGP-OSG > sam locate foo Datafile with name 'foo' not found. FNGP-OSG > sam locate F00031300_0000.mdaq.root ['/pnfs/minos/fardet_data/2005-04,1515@vo4245'] FNGP-OSG > sam get metadata --file=F00031300_0000.mdaq.root ImportedDetectorFile({ 'fileName' : 'F00031300_0000.mdaq.root', Excellent ! ########### # MINOS02 # ########### Cannot log into minos02 via ssh ( pawloski ) Ganglia shows it dead 04:30 through 08:30, and 12:00 to 12:50 today. for NODE in ${NODES} ; do printf "${NODE} " ; ssh -a ${NODE} pwd ; done minos01 /afs/fnal.gov/files/home/room1/kreymer minos02 ssh_exchange_identification: Connection closed by remote host minos03 /afs/fnal.gov/files/home/room1/kreymer ... NO ssh logins since last night. tail of /var/log/messages : Jan 23 22:25:30 minos02 sshd(pam_unix)[19570]: session opened for user grashorn by (uid=0) Jan 23 22:35:06 minos02 sshd(pam_unix)[19604]: session opened for user bspeak by (uid=0) Jan 24 05:51:24 minos02 xfs[4602]: re-reading config file Jan 24 05:51:24 minos02 xfs: xfs -USR1 succeeded Jan 24 05:51:24 minos02 xfs[4602]: ignoring font path element /usr/X11R6/lib/X11/fonts/100dpi:unscaled (unreadable) Jan 24 14:08:53 minos02 login: kreymer preauthenticated login on pts/3 from minos-93198.dhcp ___________________________________________ Date: Thu, 24 Jan 2008 14:27:43 -0600 (CST) Subject: HelpDesk ticket 110032 ___________________________________________ Short Description: Cannot log into minos02 via ssh Problem Description: run2-sys : We cannot log into minos02 via ssh. I am able to login via rsh. The other Minos Cluster nodes are unaffected. Here is an example : MIN > date Thu Jan 24 20:19:27 UTC 2008 MIN > ssh -v minos02 OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug1: Connecting to minos02 [131.225.193.2] port 22. debug1: Connection established. debug1: identity file /home/kreymer/.ssh/identity type -1 debug1: identity file /home/kreymer/.ssh/id_rsa type 1 debug1: identity file /home/kreymer/.ssh/id_dsa type -1 ssh_exchange_identification: Connection closed by remote host ___________________________________________ Date: Thu, 24 Jan 2008 14:37:54 -0600 (CST) This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group. ___________________________________________ Date: Thu, 24 Jan 2008 15:12:03 -0600 (CST) Solution: jonest@fnal.gov sent this solution: > I was able to restart sshd. It should be fine now. ___________________________________________ ============================================================================= 2008 01 23 ######## # FARM # ######## Purging old files / duplicates ./roundup -n -D -r cedar_phy_bhcurv far All the duplicates are pass 0 Contrary to the content of the READ and READ/SAM files, many of these files are NOT declared to SAM, and not in PNFS. Let's chew on this case, and develop the proper tools for cases where READ or READ/SAM are stale. ./roundup -n -D -r cedar_phy_bhcurv far | uniq > /tmp/cpbf.dup SRV1> grep DUPE /tmp/cpbf.dup | wc -l 321 ######### # FNALU # ######### Date: Wed, 23 Jan 2008 11:01:43 -0600 (CST) From: Margaret_Greaney To: minos-users@fnal.gov Cc: dss-est@fnal.gov Subject: status of batch node updates To all, The operating system updates on the fnalu batch nodes are progressing. Yesterday flxb10,11,13 were updated. This morning flxb17,23-25 were updated. If you are having any problems with these nodes please report them. The remainder will be updated as schedule permits. Date: Wed, 23 Jan 2008 18:31:58 +0000 (UTC) From: Arthur Kreymer To: Margaret_Greaney Cc: minos-users@fnal.gov, dss-est@fnal.gov Subject: Re: status of batch node updates On Wed, 23 Jan 2008, Margaret_Greaney wrote: ... > The operating system updates on the fnalu batch nodes are progressing. > Yesterday flxb10,11,13 were updated. This morning flxb17,23-25 were > updated. Simple LSF jobs seem to be failing for me on these nodes. I have tried all of 10, 11, 13, 17, 23, 24, 25 The same trivial job runs OK on unupgraded nodes like flxb16 and flxb26. For example MINOS26 > bsub -R flxb10 echo HELLO Job <120645> is submitted to default queue <30min>. MINOS26 > bjobs -l 120645 Job <120645>, User , Project , Status , Queue <30min>, Command Wed Jan 23 12:27:18: Submitted from host , CWD , Requested Resources ; Wed Jan 23 12:27:24: Started on ; Wed Jan 23 12:27:24: Exited with exit code 255. The CPU time used is 0.0 second s. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - 0.5 0.7 - - - - - - - - loadStop - 7.0 6.0 - - - - - - - - ----------------------------------------- Date: Thu, 24 Jan 2008 13:07:33 -0600 (CST) these nodes were missing a directory that does not come with the openafs rpm but was still on the slf3 nodes. Now that /usr/afsws/bin is in place and something called sbatchd restarted on nodes, they process jobs. ----------------------------------------- bsub -R flxb25 echo HELLO 10 OK 11 OK 13 OK 17 OK 23 OK 24 OK 25 OK These nodes allow interactive login now. Unavailable are 16 18 19 20 21 22 26 27 28 29 ########### # MINOS25 # ########### Date: Wed, 23 Jan 2008 10:34:20 -0600 (CST) Subject: HelpDesk ticket 109940 Short Description: minos25 in distress - may need reboot Problem Description: run2-sys : The minos25 system seems to be in distress. Yesterday, around 14:50, memory usage shot up, and almost all the CPU is in Wait CPU state, according to Ganglia. Condor jobs are not being started, and the condor_q command is failing. The immediate cause may be a set of five 'loon' jobs running under brebel. These are each using a large amount of memory, 1/3 to 3/4 GB each. Brian is unable to kill these, which are all in 'D' state according to top. run2-sys : Please kill these brebel processes if you can. If that does not work, and you cannot restore normal operation, please reboot minos25. ------------------------------------------------ Date: Wed, 23 Jan 2008 10:52:53 -0600 (CST) This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group. ------------------------------------------------ Date: Wed, 23 Jan 2008 11:23:00 -0600 (CST) Subject: Help Desk Ticket 109940 Has Been Resolved. ------------------------------------------------ Condor recovered, and several old jobs are still running. My glidein tests cleared quickly. condor_q kreymer -- Submitter: minos25.fnal.gov : <131.225.193.25:63984> : minos25.fnal.gov ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 27469.0 kreymer 1/22 18:50 0+00:00:00 I 0 9.8 probe 0 0 here are 27471.0 kreymer 1/22 20:00 0+00:00:00 I 0 9.8 probe 0 0 here are 27472.0 kreymer 1/22 20:10 0+00:00:00 I 0 9.8 probe 0 0 here are 27473.0 kreymer 1/22 20:20 0+00:00:00 I 0 9.8 probe 0 0 here are 27474.0 kreymer 1/22 20:30 0+00:00:00 I 0 9.8 probe 0 0 here are 27475.0 kreymer 1/22 20:40 0+00:00:00 I 0 9.8 probe 0 0 here are 27476.0 kreymer 1/22 20:50 0+00:00:00 I 0 9.8 probe 0 0 here are 27477.0 kreymer 1/22 21:00 0+00:00:00 I 0 9.8 probe 0 0 here are 27478.0 kreymer 1/22 21:10 0+00:00:00 I 0 9.8 probe 0 0 here are 27479.0 kreymer 1/22 21:20 0+00:00:00 I 0 9.8 probe 0 0 here are 27480.0 kreymer 1/22 21:30 0+00:00:00 I 0 9.8 probe 0 0 here are 27481.0 kreymer 1/22 21:40 0+00:00:00 I 0 9.8 probe 0 0 here are 27482.0 kreymer 1/22 21:50 0+00:00:00 I 0 9.8 probe 0 0 here are 27483.0 kreymer 1/22 22:00 0+00:00:00 I 0 9.8 probe 0 0 here are 27484.0 kreymer 1/22 22:10 0+00:00:00 I 0 9.8 probe 0 0 here are 27485.0 kreymer 1/22 22:20 0+00:00:00 I 0 9.8 probe 0 0 here are 27486.0 kreymer 1/22 22:30 0+00:00:00 I 0 9.8 probe 0 0 here are 27487.0 kreymer 1/22 22:40 0+00:00:00 I 0 9.8 probe 0 0 here are 27488.0 kreymer 1/22 22:50 0+00:00:00 I 0 9.8 probe 0 0 here are 27489.0 kreymer 1/22 23:00 0+00:00:00 I 0 9.8 probe 0 0 here are 27490.0 kreymer 1/22 23:10 0+00:00:00 I 0 9.8 probe 0 0 here are 27491.0 kreymer 1/22 23:20 0+00:00:00 I 0 9.8 probe 0 0 here are 27492.0 kreymer 1/22 23:30 0+00:00:00 I 0 9.8 probe 0 0 here are 27493.0 kreymer 1/22 23:40 0+00:00:00 I 0 9.8 probe 0 0 here are 27494.0 kreymer 1/22 23:50 0+00:00:00 I 0 9.8 probe 0 0 here are 27502.0 kreymer 1/23 11:20 0+00:00:00 I 0 9.8 probe 0 0 here are 27504.0 kreymer 1/23 11:30 0+00:00:00 I 0 9.8 probe 0 0 here are 27505.0 kreymer 1/23 11:40 0+00:00:00 I 0 9.8 probe 0 0 here are 27506.0 kreymer 1/23 11:50 0+00:00:00 I 0 9.8 probe 0 0 here are ########## # CONDOR # ########## No kreymer jobs submitted between 16:50 and 18:50 yesterday, then job 27469 at 18:50 MINOS25 > condor_q -- Failed to fetch ads from: <131.225.193.25:64973> : minos25.fnal.gov Ganglia monitoring for the cluster shows a sharp drop around 17:00 on 22 Jan, a little blip up around 18:50, then continued low process count. On minos26, memory used ramped up starting at 16:30, reached 2.3 GB around 16:50. Wait CPU spiked to 100% around 14:50. I see no swap being used. top - 10:08:43 up 96 days, 11 min, 8 users, load average: 17.10, 16.97, 16.44 Tasks: 261 total, 1 running, 259 sleeping, 0 stopped, 1 zombie Cpu(s): 0.1% us, 0.2% sy, 0.0% ni, 11.7% id, 87.9% wa, 0.0% hi, 0.0% si Mem: 4151264k total, 4099084k used, 52180k free, 111472k buffers Swap: 4192944k total, 208k used, 4192736k free, 1223368k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME #C COMMAND 4031 brebel 18 0 1051m 791m 40m D 0 19.5 25:25 1 loon 4055 brebel 18 0 1050m 788m 40m D 0 19.5 24:46 0 loon 4525 brebel 18 0 545m 400m 40m D 0 9.9 20:41 1 loon 4734 brebel 18 0 443m 325m 40m D 0 8.0 15:24 3 loon 4723 brebel 18 0 442m 325m 40m D 0 8.0 15:24 0 loon 5023 condor 16 0 32364 27m 3452 S 0 0.7 56:39 2 condor_negotiat 5022 condor 15 0 23592 17m 3480 S 0 0.4 115:11 0 condor_collecto 593 gfactory 16 0 46484 17m 2816 S 0 0.4 247:39 3 python 32587 gfronten 16 0 15288 11m 1560 S 0 0.3 29:47 2 python 2804 ntp 16 0 5472 5472 3524 S 0 0.1 0:11 2 ntpd ########## # PARROT # ########## MINOS26 > make_growfs -h Use: /grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4/bin/make_growfs [options] Where options are: -f Follow symbolic links. -v Give verbose messages. -K Create checksums for files. (default enabled) -k Disable checksums for files. -h Show this help file. 15:03 UTC time make_growfs -f -k ${AFSB}/releases1 real 0m36.613s user 0m4.067s sys 0m18.874s parrot -m ${PARROT_DIR}/mountfile.grow -d remote bash ls /afs/fnal.gov/files/code/e875/releases1 ... directory checksum is dc880108e9c28b1ab8e411629ed503fbf790924d actual checksum is a3edbef5da19cb66c6604b4c27dea1b1c669dd96 1201101290.221921 [2200] parrot: grow: loading filesystem directory... 1201101290.520844 [2200] parrot: grow: directory checksum is dc880108e9c28b1ab8e411629ed503fbf790924d 1201101290.521023 [2200] parrot: grow: fetching checksum from wget --no-cache -q -O /tmp/grow.checksum.1060.30772 http://www-numi.fnal.gov:80//computing/releases1//.growfschecksum 1201101290.573307 [2200] parrot: grow: actual checksum is a3edbef5da19cb66c6604b4c27dea1b1c669dd96 1201101290.590914 [2200] parrot: fatal : directory and checksum still inconsistent after 60 seconds 1201101290.591300 [2200] parrot: notice: received signal 15 (Terminated), killing all my children... 1201101290.591576 [2200] parrot: notice: sending myself 15 (Terminated), goodbye! Terminated Trying this without following symlinks make_growfs -k ${AFSB}/releases1 OK, can see this directory how, checksum matches time make_growfs -k ${AFSB}/releases2 real 1m5.835s user 0m5.574s sys 0m18.773s time make_growfs -k ${AFSB}/releases real 8m48.406s user 0m44.452s sys 2m31.912s time make_growfs -k ${AFSB}/general/products real 0m34.637s user 0m2.962s sys 0m10.926s Along the way, needed to create symlinks : minossoft -> /afs/fnal.gov/files/code/e875/general/minossoft/ products -> /afs/fnal.gov/files/code/e875/general/products/ releases -> /afs/fnal.gov/files/code/e875/releases/ releases1 -> /afs/fnal.gov/files/code/e875/releases1/ releases2 -> /afs/fnal.gov/files/code/e875/releases2/ MIN > du -sm */.growfsdir 1 dh/.growfsdir 67 minossoft/.growfsdir 5 products/.growfsdir 67 releases/.growfsdir 5 releases1/.growfsdir 9 releases2/.growfsdir ============================================================================= 2008 01 22 ######## # FARM # ######## Purging old files / duplicates SRV1> cp -a AFSS/roundup.20080110 . SRV1> ln -sf roundup.20080110 roundup ./roundup -D -r cedar_phy_bhcurv mcfar ./roundup -f 5 -r cedar_phy_bhcurv mcfar cleared out files hanging around since 10/26 ./roundup -f 10 -r cedar_phy_bhcurv mcnear cleared out files from Dec ########## # PARROT # ########## Checking sizes and permissions of the interesting paths GB Path under /afs/fnal.gov/files/code/e875/ 3 general/products minos rlidwka 7 general/minossoft 40 releases 6 releases1 6 releases2 rhatcher:minsoft rlidwk buckley:minsoft rlidwka MINOS26 > pts membership rhatcher:minsoft Members of rhatcher:minsoft (id: -1397) are: buckley kreymer rhatcher AFSB=/afs/fnal.gov/files/code/e875 MIN > find /afs/fnal.gov/files/code/e875/general/minossoft -type d | wc -l 55286 21:06 UTC time make_growfs ${AFSB}/general/minossoft ... entering dir /afs/fnal.gov/files/code/e875/general/minossoft/srt entering dir /afs/fnal.gov/files/code/e875/general/minossoft/temp real 396m3.524s user 24m46.764s sys 170m36.388s FNGP-OSG > export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4 FNGP-OSG > export PATH=${PARROT_DIR}/bin:${PATH} FNGP-OSG > parrot -m ${PARROT_DIR}/mountfile.html -d remote bash FNGP-OSG > time ls -l /afs/fnal.gov/files/code/e875/general/minossoft/packages ... 1201100190.657741 [11726] parrot: grow: loading filesystem directory... ... real 0m35.371s user 0m0.000s sys 0m0.001s MINOS26 > ls -alF /afs/fnal.gov/files/code/e875/general/minossoft/.grow* -rw-r--r-- 1 kreymer g020 44 Jan 22 21:41 /afs/fnal.gov/files/code/e875/general/minossoft/.growfschecksum -rw-r--r-- 1 kreymer g020 69866912 Jan 22 21:41 /afs/fnal.gov/files/code/e875/general/minossoft/.growfsdir MINOS26 > du -sm /afs/fnal.gov/files/code/e875/general/minossoft/.growfsdir 67 /afs/fnal.gov/files/code/e875/general/minossoft/.growfsdir ######### # MYSQL # ######### http://dev.mysql.com/doc/refman/5.0/en/binary-log.html mysqld options that control binlog : --log-bin[=base_name] --binlog-do-db=db_name --binlog-ignore-db=db_name these affect only access via the default USE database mysqlbinlog will display the logs the binary log format is different in MySQL 5.0 from previous versions of MySQL, due to enhancements in replication. I have added to /data/database/my.cnf --binlog-do-db = crl_v1 --binlog-do-db = offline See LOG.mysql ######### # MYSQL # ######### 17:47 UTC Pushing recent BINLOGS to /minos/data/mysql/BINLOGS, so we can clear space on mysql1 local disk time rsync -r ${DBBINS}/ ${DBCOPB} --perms --times --size-only -v building file list ... done ./ minos.000357 minos.000358 minos.index sent 2110642300 bytes received 80 bytes 21212486.23 bytes/sec total size is 140733284935 speedup is 66.68 real 1m38.983s user 0m24.538s sys 0m14.512s mysql -u root offline PURGE MASTER LOGS BEFORE DATE_SUB( NOW( ), INTERVAL 3 DAY); Query OK, 0 rows affected (2 min 22.27 sec) PURGE MASTER LOGS BEFORE DATE_SUB( NOW( ), INTERVAL 2 DAY); Query OK, 0 rows affected (1 min 27.75 sec) EXIT; Mysql> dds /data/archive/BINLOG/ total 9410228 drwxr-xr-x 2 minsoft e875 4096 Jan 22 11:52 ./ drwxr-xr-x 5 minsoft e875 4096 Jan 15 14:57 ../ -rw-rw---- 1 minsoft e875 1073744708 Jan 21 08:57 minos.000350 -rw-rw---- 1 minsoft e875 1073747386 Jan 21 08:58 minos.000351 -rw-rw---- 1 minsoft e875 1073743411 Jan 21 08:59 minos.000352 -rw-rw---- 1 minsoft e875 1073749504 Jan 21 09:00 minos.000353 -rw-rw---- 1 minsoft e875 1073743433 Jan 21 09:02 minos.000354 -rw-rw---- 1 minsoft e875 1073746201 Jan 21 09:03 minos.000355 -rw-rw---- 1 minsoft e875 1073743213 Jan 21 09:09 minos.000356 -rw-rw---- 1 minsoft e875 1073743727 Jan 21 21:40 minos.000357 -rw-rw---- 1 minsoft e875 1036638377 Jan 22 11:48 minos.000358 -rw-rw---- 1 minsoft e875 306 Jan 22 11:52 minos.index Mysql> df -h /data Filesystem Size Used Avail Use% Mounted on /dev/hdb1 230G 87G 132G 40% /data ============================================================================= 2008 01 21 MLKJ Holiday ####### # AFS # ####### MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Jan " | grep -v Tokens | uniq'; done minos03 Jan 21 10:24:54 minos03 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server) Jan 21 10:26:19 minos03 kernel: afs: file server 131.225.68.49 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Jan 21 18:06:28 minos03 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Jan 21 18:08:43 minos03 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos19 Jan 20 12:30:42 minos19 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) Jan 20 12:32:06 minos19 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos21 Jan 21 08:57:32 minos21 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) Jan 21 08:59:19 minos21 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos26 Jan 20 15:31:24 minos26 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) Jan 20 15:32:35 minos26 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) ########### # MINOS01 # ########### The backlogged, apparently failed email from minos01 seems to be making its way out, and is being delivered. ######### # MYSQL # ######### 16:20 Pushing recent BINLOGS to /minos/data/mysql/BINLOGS, so we can clear space on mysql1 local disk time rsync -r ${DBBINS}/ ${DBCOPB} --perms --times --size-only -v minos.000240 ... minos.000356 minos.000357 minos.index sent 126405334115 bytes received 2400 bytes 21332433.81 bytes/sec total size is 139384494903 speedup is 1.10 real 98m45.365s user 23m25.373s sys 12m41.155s mysql -u root offline PURGE MASTER LOGS BEFORE DATE_SUB( NOW( ), INTERVAL 5 DAY); EXIT; ####### # NET # ####### Date: Mon, 21 Jan 2008 15:59:22 -0600 (CST) Subject: HelpDesk ticket 109827 Short Description: fnsrv1 not providing name service Problem Description: fnsrv1 ( 131.225.17.150 ) seems not to be providing name services. MIN > nslookup 131.225.17.150 fnsrv0.fnal.gov Server: fnsrv0.fnal.gov Address: 131.225.8.120#53 150.17.225.131.in-addr.arpa name = fnsrv1.fnal.gov. MIN > nslookup 131.225.17.150 fnsrv1.fnal.gov ;; connection timed out; no servers could be reached This caused service failures on nodes such as minos01, which were incorrectly configured to use only nameserver fnsrv1, since sometime Saturday 19 Jan 2008. Oops, never mind. While I was submitting this ticket, service seems to have been restored. MIN > nslookup 131.225.17.150 fnsrv1.fnal.gov Server: fnsrv1.fnal.gov Address: 131.225.17.150#53 150.17.225.131.in-addr.arpa name = fnsrv1.fnal.gov. MIN > date Mon Jan 21 21:56:32 UTC 2008 And the failing services on minos01 are working again. So consider this an informational report. Strange that one of the primary nameservers would be down so long without NGOP generating a helpdesk ticket. ------------------------------------------- Date: Tue, 22 Jan 2008 09:52:29 -0600 (CST) Note To Requester: tang@fnal.gov sent this Notes To Requester: > Resolved. Rebooted server. ########### # MINOS01 # ########### Justin Evans reports lack of CVS commit email to MSD, since Friday 18 Jan 23:43. Date: Mon, 21 Jan 2008 09:01:42 -0600 (CST) Subject: HelpDesk ticket 109825 Short Description: minos01 cannot send email Problem Description: run2-sys : Outgoing email from minos01.fnal.gov seems to have stopped working sometime after Friday 18 Jan 23:43 . For example, the following produces no received mail : echo TESTDIRECTMAIL | /bin/mail -s "DIRECTMAILTEST" kreymer@fnal.gov The same command sends mail from other nodes such as minos02 and minos26. Per bv (viren) suggestion, MINOS01 > echo TEST | /bin/mail -v -s TEST kreymer@fnal.gov /afs/fnal.gov/files/home/room1/kreymer/outbox: Permission denied fnal.gov: Name server timeout kreymer@fnal.gov... Transient parse error -- message queued for future delivery kreymer@fnal.gov... queued Indeed, the nameserver capability is hosed, MINOS01 > host www.fnal.gov ;; connection timed out; no servers could be reached MINOS01 > cat /etc/resolv.conf search fnal.gov nameserver 131.225.17.150 MINOS02 > cat /etc/resolv.conf search fnal.gov nameserver 131.225.8.120 nameserver 131.225.227.254 nameserver 131.225.17.150 MIN > cat /etc/resolv.conf ; generated by /sbin/dhclient-script search fnal.gov dhcp.fnal.gov nameserver 131.225.17.150 nameserver 131.225.8.120 The problems seems to be with 131.225.17.150, fnsrv1.fnal.gov MIN > nslookup 131.225.17.150 Server: 131.225.8.120 Address: 131.225.8.120#53 150.17.225.131.in-addr.arpa name = fnsrv1.fnal.gov. MIN > nslookup 131.225.17.150 fnsrv1.fnal.gov ;; connection timed out; no servers could be reached MINOS26 > cat > /minos/scratch/kreymer/resolv.conf search fnal.gov nameserver 131.225.8.120 nameserver 131.225.227.254 nameserver 131.225.17.150 MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh -ax ${NODE} 'diff /etc/resolv.conf /minos/scratch/kreymer/resolv.conf'; done minos01 aklog: Couldn't get fnal.gov AFS tickets: aklog: Cannot resolve network address for KDC in requested realm while getting AFS tickets 1a2,3 > nameserver 131.225.8.120 > nameserver 131.225.227.254 minos02 minos03 ... minos11 1d0 < ; generated by /sbin/dhclient-script ... minos24 2a3,4 > nameserver 131.225.227.254 > nameserver 131.225.17.150 MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh -ax ${NODE} 'diff -q /etc/resolv.conf /minos/scratch/kreymer/resolv.conf || cat /etc/resolv.conf'; done minos01 aklog: Couldn't get fnal.gov AFS tickets: aklog: Cannot resolve network address for KDC in requested realm while getting AFS tickets Files /etc/resolv.conf and /minos/scratch/kreymer/resolv.conf differ search fnal.gov nameserver 131.225.17.150 ... minos11 Files /etc/resolv.conf and /minos/scratch/kreymer/resolv.conf differ ; generated by /sbin/dhclient-script search fnal.gov nameserver 131.225.8.120 nameserver 131.225.227.254 nameserver 131.225.17.150 ... minos24 Files /etc/resolv.conf and /minos/scratch/kreymer/resolv.conf differ search fnal.gov nameserver 131.225.8.120 To : HelpDesk Cc : minos-admin@fnal.gov Attchmnt: Subject : Re: HelpDesk ticket 109825 ----- Message Text ----- On Mon, 21 Jan 2008, HelpDesk wrote: <-- # @@@ Enter Update below this line. @@@ # --> Thanks to bv ( Brett Viren ) for suggesting use of mail -v for diagnostics. MINOS01 > echo TEST | /bin/mail -v -s TEST kreymer@fnal.gov /afs/fnal.gov/files/home/room1/kreymer/outbox: Permission denied fnal.gov: Name server timeout ... The nameserver fnsrv1 = 131.225.17.150 is not working, and minos01 is configured to use only this server. minos01, minos11 and minos24 all have nonstandard /etc/resolv.conf This is probably also causing the problem on mino01 with AFS tokens, reported in helpdesk ticket 109813 . Date: Tue, 22 Jan 2008 09:40:22 -0600 (CST) This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group. Date: Tue, 22 Jan 2008 10:21:01 -0600 (CST) Solution: sether@fnal.gov sent this solution: I just sent a test mail from minos01 and it was sent properly. Whatever was wrong appears to have cleared up. If you're still seeing problems, let me know. N.B. /etc/resolv.conf is updated now ####### # CVS # ####### Noted many changed to CVSROOT configuration files, not committed to CVS -r--r--r-- 1 minoscvs e875 1026 Jan 11 11:28 verifymsg -r--r--r-- 1 minoscvs e875 879 Jan 11 11:28 taginfo -r--r--r-- 1 minoscvs e875 649 Jan 11 11:28 rcsinfo -r--r--r-- 1 minoscvs e875 266 Jan 11 11:28 numisoft.list -r--r--r-- 1 minoscvs e875 564 Jan 11 11:28 notify -r--r--r-- 1 minoscvs e875 109 Jan 11 11:28 neugen3.list -r--r--r-- 1 minoscvs e875 12977 Jan 11 11:28 modules -r--r--r-- 1 minoscvs e875 58 Jan 11 11:28 minospub.list -r--r--r-- 1 minoscvs e875 1717 Jan 11 11:28 loginfo -r--r--r-- 1 minoscvs e875 82 Jan 11 11:28 labyrinth.list -r--r--r-- 1 minoscvs e875 796 Jan 11 11:28 framework.list -r--r--r-- 1 minoscvs e875 1025 Jan 11 11:28 editinfo -r--r--r-- 1 minoscvs e875 753 Jan 11 11:28 cvswrappers -r-xr-xr-x 1 minoscvs e875 695 Jan 11 11:28 cvs.log -r--r--r-- 1 minoscvs e875 364 Jan 11 11:28 config -r--r--r-- 1 minoscvs e875 803 Jan 11 11:28 commitinfo -r--r--r-- 1 minoscvs e875 585 Jan 11 11:28 checkoutlist -r-xr-xr-x 1 minoscvs e875 101985 Jan 11 11:28 check_access,v -r-xr-xr-x 1 minoscvs e875 16251 Jan 11 11:28 check_access Odd, I seem to be fingered here , in cvshlog : Fri Jan 11 11:27:38 2008 (kreymer@(null)) : cvsh -c cvs server [sk] Fri Jan 11 11:28:33 2008 (kreymer@(null)) : cvsh -c cvs server [sk] grep -v 'cvsh \-' cvshlog Sat Dec 1 22:33:35 2007 (rhatcher@(null)) : -cvsh [sk] Tue Dec 4 16:10:26 2007 (rhatcher@(null)) : -cvsh [sSk] Mon Jan 7 17:20:51 2008 (kreymer@(null)) : -cvsh [sk] Thu Jan 10 13:18:50 2008 (kreymer@(null)) : -cvsh [sk] Mon Jan 21 07:48:38 2008 (kreymer@(null)) : -cvsh [sk] Mon Jan 21 07:49:19 2008 (kreymer@(null)) : -cvsh [sk] Mon Jan 21 08:23:47 2008 (kreymer@(null)) : -cvsh [sk] Mon Jan 21 09:04:56 2008 (kreymer@(null)) : -cvsh [sk] ============================================================================= 2008 01 19 Date: Sat, 19 Jan 2008 18:48:05 -0600 (CST) Subject: HelpDesk ticket 109813 Short Description: minos01 cannot issue afs tokens Problem Description: run2-sys : Starting sometime today, Saturday 19 January 2008, it seems that node minos01 can no longer issue AFS tokens via aklog. There is no such problem on the other minos02 through minos26 nodes, or on the FNALU interactive nodes. The klog command, using a password, works OK on minos01. For example, MINOS01 > date Sat Jan 19 18:36:22 CST 2008 MINOS01 > klist -f Ticket cache: /tmp/krb5cc_1060_jq7966 Default principal: kreymer@FNAL.GOV Valid starting Expires Service principal 01/19/08 18:23:17 01/20/08 04:17:56 krbtgt/FNAL.GOV@FNAL.GOV renew until 01/26/08 18:15:24, Flags: FfRA MINOS01 > aklog -d Authenticating to cell fnal.gov (server fsus01.fnal.gov). We've deduced that we need to authenticate to realm FNAL.GOV. Getting tickets: afs/@FNAL.GOV Kerberos error code returned by get_cred: -1765328165 aklog: Couldn't get fnal.gov AFS tickets: aklog: unknown RPC error (-1765328165) while getting AFS tickets MINOS01 > klog Password: MINOS01 > tokens Tokens held by the Cache Manager: User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Jan 25 19:19] --End of list-- ----------------------------------------- Date: Tue, 22 Jan 2008 10:30:42 -0600 (CST) Solution: sether@fnal.gov sent this solution: This appears to be working now. We noticed that /etc/resolv.conf only had one name server listed. Having problems with both afs and sendmail could have been caused by the name server being down. I added two more fnal dns servers to the file to prevent this happening again. ============================================================================= 2008 01 18 ########## # PARROT # ########## mindata : mkdir /grid/app/minos/parrot cd /grid/app/minos/parrot curl http://www.cse.nd.edu/~ccl/software/files/cctools-2_4_0-i686-linux-2.4.tar.gz \ -o cctools-2_4_0-i686-linux-2.4.tar.gz tar xzvf cctools-2_4_0-i686-linux-2.4.tar.gz export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4 export PATH=${PARROT_DIR}/bin:${PATH} USAGE Before setting up grow indexes, parrot -M \ /afs/fnal.gov/files/code/e875/releases=/http/www-numi.fnal.gov/computing/releases \ bash cat /afs/fnal.gov/files/code/e875/releases/GENIE/Linux2.4-GCC_3_4/checkout.sh or parrot -m ${PARROT_DIR}/mountfile.html /bin/bash cat /afs/fnal.gov/files/code/e875/releases/GENIE/Linux2.4-GCC_3_4/checkout.sh or parrot -m ${PARROT_DIR}/mountfile.grow [ -d all ] [ -d remote ] mountfile.glow : /afs/fnal.gov/files/code/e875/general/minossoft /grow/www-numi.fnal.gov/computing/minossoft /afs/fnal.gov/files/code/e875/general/products /grow/www-numi.fnal.gov/computing/products /afs/fnal.gov/files/code/e875/releases /grow/www-numi.fnal.gov/computing/releases /afs/fnal.gov/files/code/e875/releases1 /grow/www-numi.fnal.gov/computing/releases1 /afs/fnal.gov/files/code/e875/releases2 /grow/www-numi.fnal.gov/computing/releases2 mountile.html : /afs/fnal.gov/files/code/e875/general/minossoft /html/www-numi.fnal.gov/computing/minossoft /afs/fnal.gov/files/code/e875/general/products /html/www-numi.fnal.gov/computing/products /afs/fnal.gov/files/code/e875/releases /html/www-numi.fnal.gov/computing/releases /afs/fnal.gov/files/code/e875/releases1 /html/www-numi.fnal.gov/computing/releases1 /afs/fnal.gov/files/code/e875/releases2 /html/www-numi.fnal.gov/computing/releases2 PROXY export HTTP_PROXY squid.fnal.gov GROW make_growfs seems to properly walk the tree, but not follow symlinks, making .growfschecksum .growfsdir GRRRRRRRR the mountfiles just are not working, for anything but anonftp, on fngp-osg Let's try GROW on a smaller bit of the web : MINOS26 > make_growfs /afs/fnal.gov/files/expwww/numi/html/computing/dh entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/db entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/badfiles entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/protons entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/pnfslog entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/pnfslog/2005 entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/pnfslog/2005/11 entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/pnfslog/2005/12 ... FNGP-OSG > parrot -M /dh=/grow/www-numi.fnal.gov/computing/dh/ -d remote bash Both of these work ####### # AFS # ####### Per a Ray Pasetes phone conversation earlier today : All Minos AFS file systems are now on AFS servers with upgraded software and firmware. We have seen no AFS timeouts on the Minos Cluster since 15 January ! ######## # FARM # ######## Need to clean up after repeated /m/d failures, esp. Dec 13, in 2007-12/cedarfar.log /pnfs/minos/reco_far/cedar/sntp_data/2007-12 F00040057_0000.all.sntp.cedar.0.root F00040057_0000.spill.sntp.cedar.0.root /pnfs/minos/reco_far/cedar/.bntp_data/2007-12 F00040057_0000.spill.bntp.cedar.0.root AFSS/dc2nfs -n -d reco_far/cedar/sntp_data/2007-12 SRV1> chmod 775 /minos/data/reco_far/cedar/sntp_data/* $ AFSS/dc2nfs.20080118 -d reco_far/cedar/sntp_data/2007-12 22/ 66 /minos/data/reco_far/cedar/sntp_data 2384 F00040054_0000.spill.sntp.cedar.0.root dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar/sntp_data/2007-12/F00040057_0000.all.sntp.cedar.0.root /minos/data/reco_far/cedar/sntp_data/2007-12/F00040057_0000.all.sntp.cedar.0.root 23/ 66 /minos/data/reco_far/cedar/sntp_data 2384 F00040057_0000.all.sntp.cedar.0.root dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar/sntp_data/2007-12/F00040057_0000.spill.sntp.cedar.0.root /minos/data/reco_far/cedar/sntp_data/2007-12/F00040057_0000.spill.sntp.cedar.0.root 66/ 66 /minos/data/reco_far/cedar/sntp_data 2384 F00040124_0000.spill.sntp.cedar.0.root $ AFSS/dc2nfs.20080118 -n -d reco_far/cedar/.bntp_data/2007-12 11/ 33 /minos/data/reco_far/cedar/.bntp_data 2383 F00040054_0000.spill.bntp.cedar.0.root dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar/.bntp_data/2007-12/F00040057_0000.spill.bntp.cedar.0.root /minos/data/reco_far/cedar/.bntp_data/2007-12/F00040057_0000.spill.bntp.cedar.0.root 33/ 33 /minos/data/reco_far/cedar/.bntp_data 2383 F00040124_0000.spill.bntp.cedar.0.root Oops, should I have just moved the concatenated file from farcat ? Yup, checking and removing the extra copies, on fnpcsrv1 cd /minos/data/minfarm/WRITE MDP=/minos/data/reco_far/cedar/sntp_data/2007-12 FILE=F00040057_0000.all.sntp.cedar.0.root ls -l ${FILE} ${MDP}/${FILE} diff ${FILE} ${MDP}/${FILE} rm ${MDP}/${FILE} mv ${FILE} ${MDP}/${FILE} ln -s ${MDP}/${FILE} ${FILE} FILE=F00040057_0000.spill.sntp.cedar.0.root ls -l ${FILE} ${MDP}/${FILE} diff ${FILE} ${MDP}/${FILE} rm ${MDP}/${FILE} mv ${FILE} ${MDP}/${FILE} ln -s ${MDP}/${FILE} ${FILE} MDP=/minos/data/reco_far/cedar/.bntp_data/2007-12 FILE=F00040057_0000.spill.bntp.cedar.0.root ls -l ${FILE} ${MDP}/${FILE} mv ${FILE} ${MDP}/${FILE} ln -s ${MDP}/${FILE} ${FILE} ############## # MINOS_DATA # ############## cedar sntp/bntp total is 30991 from CFLSUM, /M/D near sntp 11837 /M/D far sntp wc -l /afs/fnal.gov/files/data/minos/d10/indexes/*_near.cedar.index 12220 wc -l /afs/fnal.gov/files/data/minos/d10/indexes/*_far.cedar.index 16478 ####### # WEB # ####### Updated dhmain.html from dhmain.20071019.html to dhmain.20080118.html pointing the the numi08 new elog ######## # GRID # ######## Date: Fri, 18 Jan 2008 09:33:54 -0600 (CST) From: Steven Timm To: fermigrid-users@fnal.gov Subject: GP Grid job evictions. There were a large number of jobs from many different users that were evicted from the farms last night. We are currently investigating why this happened. Once we get it fixed the jobs will restart with no further action requrired on your part. ########## # CONDOR # ########## At midnight, condor glideins started queuing up midnight, cleared at 07:00 Condorview looks pretty normal ############# # CHECKLIST # ############# minos-sam01 - ganglia stops around 23:30 yesterday GANGLIA for minos cluster shows hosts dropping off starting around midnight. Nothing is really wrong with the hosts. Same behaviour for Minos Cluster and Minos Server nodes, but Minos DB looks OK. ########### # GANGLIA # ########### Date: Fri, 18 Jan 2008 08:35:27 -0600 (CST) Subject: HelpDesk ticket 109750 Short Description: Ganglia monitoring for Minos Cluster and Minos Server nodes is broken Problem Description: run2-sys : Starting around 23:30 yesterday 17 Jan 2008, the Ganglia monitoring at http://rexganglia2.fnal.gov/minos started gradually losing contact with the monitored nodes. Ganglia now claims that all the Minos Cluster and most Server nodes are down. The monitored nodes are in fact up, and seem to be acting normally. Ganglia monitoring of the MINOS DB nodes minosora1 and minosora3 is normal, as well as monitoring of minos-mysql1 in the Minos Server group. ------------------------------------------- Date: Fri, 18 Jan 2008 08:42:23 -0600 (CST) This ticket has been reassigned to JONES, TERRY of the CD-SF/FEF Group. ------------------------------------------- Date: Fri, 18 Jan 2008 09:19:53 -0600 (CST) Note To Requester: jonest@fnal.gov sent this Notes To Requester: The d0om server has lost its network. this could be the problem. > Or, just a distractionj ------------------------------------------- Date: Fri, 18 Jan 2008 13:25:18 -0600 (CST) Solution: jonest@fnal.gov sent this solution: > Ganglia data is collecting again. ============================================================================= 2008 01 17 ########## # CONDOR # ########## Email from chadwick ( -> minosadmin ) The limit is 2.5 days (60 hours) - here is how to get a robot proxy with this lifetime (thanks to Matt Crawford for his assistance). -Keith. [chadwick@fermigrid0 ~]$ p=chadwick/cron/fermigrid0.fnal.gov [chadwick@fermigrid0 ~]$ kinit $p -k -t /var/adm/krb5/`kcron -f` [chadwick@fermigrid0 ~]$ klist Ticket cache: /tmp/krb5cc_1021_P31493 Default principal: chadwick/cron/fermigrid0.fnal.gov@FNAL.GOV Valid starting Expires Service principal 01/17/08 16:39:26 01/18/08 02:39:26 krbtgt/FNAL.GOV@FNAL.GOV renew until 01/20/08 04:39:26 [chadwick@fermigrid0 ~]$ voms-proxy-init -noregen -voms fermilab:/fermilab -userconf $HOME/vomses/vomses.voms -valid 60:0 ####### # POT # ####### Updated current and historic links to POT plots ####### # WEB # ####### Around 14:00, stopped getting service from www-numi.fnal.gov Can ping the host : MINOS26 > ping -c 3 expwww17.fnal.gov PING expwww17.fnal.gov (131.225.70.20) 56(84) bytes of data. 64 bytes from expwww17.fnal.gov (131.225.70.20): icmp_seq=0 ttl=125 time=0.442 ms 64 bytes from expwww17.fnal.gov (131.225.70.20): icmp_seq=1 ttl=125 time=0.487 ms 64 bytes from expwww17.fnal.gov (131.225.70.20): icmp_seq=2 ttl=125 time=0.460 ms Helpdesk tickets 109719, 109720 issued by NGOP around 20:03 UTC Several other pages are down, including COUPP , Miniboone, SDSS 15:02 - COUPP and SDSS are back 15:05 - NUMI is back The problem was network routing, per ticket 109719 ####### # WEB # ####### Received request from lauram to test Apache 2 servers during beta period ending Jan 23. Forwarded the email to minos-admin, Liz forwarded it to Cat. In /etc/hosts or /C:\WINDOWS\system32\drivers\etc\hosts do something like 131.225.70.203 webstats.fnal.gov 131.225.70.203 computing.fnal.gov 131.225.70.203 cdorg.fnal.gov 131.225.70.203 www-numi.fnal.gov 131.225.70.202 www.fnal.gov ############ # RELEASES # ############ R1.28 tagging has begun, with ROOT 5.18/00 ########## # CONDOR # ########## GPfarm condor upgrade to 6.9.5 and OSG 0.8.0 scheduled 09:00 Completed around 12:00 For present, need to do condor_q -direct schedd pending a fix in the local configuration ######## # BMNT # ######## Plan to rename and remove bmnt and mrnt files for farcat 2915 11365 spill.bmnt.cedar_phy_bhcurv.0.root Per Howie, some of these runs have been done with mrnt/bmnt, and many with just the normal mrnt files. So I need a list of bmnt files, need to remove just those particular mrnt's. BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort` PLAN : 0) Get file name list for bmnt 1) Get corresponding mrnt file list 2) Clean up the mrnt files a) move from /M/D/RF/CPB/mrnt_data to /M/D/RF/CPB/BMNT b) Remove the files from PNFS c) Set aside farm bookkeeping files READ, SAM/READ d) Undeclare from SAM 3) Rename the bmnt files to mrnt ------------------------- EXECUTION : 0) BMNT LIST - kreymer BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort` printf "${BFILES}\n" | wc -w 2915 printf "${BFILES}\n" > /minos/scratch/kreymer/bmnt/BFILES BFILES=`cat /minos/scratch/kreymer/bmnt2/BFILES` BFILES runs from F00032481_0000.spill.bmnt.cedar_phy_bhcurv.0.root 2005-08 to F00038559_0023.spill.bmnt.cedar_phy_bhcurv.0.root 2007-07 1) MRNT LIST - kreymer/mindata/rubin/minfarm MRUNS=`printf "${BFILES}\n" | cut -f 1 -d _ | sort -u` printf "${MRUNS}\n" | wc -w 179 F00032481 F00032484 ... F00038556 F00038559 Rough check for _000000 subruns for MRUN in ${MRUNS} ; do sam locate ${MRUN}_0000.spill.mrnt.cedar_phy_bhcurv.0.root done Datafile with name 'F00038266_0000.spill.mrnt.cedar_phy_bhcurv.0.root' not found. Detailed check via SAM for MRUN in ${MRUNS} ; do RUN=`echo ${MRUN} | cut -c 5-` SAMDIM=" DATA_TIER mrnt-far and VERSION cedar.phy.bhcurv and PHYSICAL_DATASTREAM_NAME spill and RUN_NUMBER ${RUN} " sam list files --dim="${SAMDIM}" --nosummary done > /minos/scratch/kreymer/bmnt/MFILES wc -l /minos/scratch/kreymer/bmnt/MFILES 179 /tmp/MFILES grep -v '_0000' /minos/scratch/kreymer/bmnt/MFILES F00032481_0007.spill.mrnt.cedar_phy_bhcurv.0.root This makes sense I think. We picked up one non subrun 0 file, and there is one run missing, F00038266 MFILES=`cat /minos/scratch/kreymer/bmnt/MFILES` printf "${MFILES}\n" | wc -l 179 for MFILE in ${MFILES} ; do MON=`sam locate ${MFILE} | cut -f 7 -d / | cut -f 1 -d ,` printf "reco_far/cedar_phy_bhcurv/mrnt_data/${MON}/${MFILE}\n" \ | tee -a /minos/scratch/kreymer/bmnt/MFILEPS done MFILEPS=`cat /minos/scratch/kreymer/bmnt/MFILEPS` for MFILEP in ${MFILEPS} ; do ls -l /pnfs/minos/${MFILEP} ; done for MFILEP in ${MFILEPS} ; do ls -l /minos/data/${MFILEP} ; done for each account, did BFILES=`cat /minos/scratch/kreymer/bmnt/BFILES` MFILES=`cat /minos/scratch/kreymer/bmnt/MFILES` MFILEPS=`cat /minos/scratch/kreymer/bmnt/MFILEPS` 2a) /minos/data - minfarm@fnpcsrv1 for MFILEP in ${MFILEPS} ; do MFILER=`echo ${MFILEP} | sed s/mrnt_data/BMNT/g` MFILED=`dirname ${MFILER}` mkdir -p /minos/data/${MFILED} mv /minos/data/${MFILEP} /minos/data/${MFILER} done find /minos/data/reco_far/cedar_phy_bhcurv/BMNT -type f 179 2b) /pnfs/minos - rubin for MFILEP in ${MFILEPS} ; do #ls -l /pnfs/minos/${MFILEP} rm /pnfs/minos/${MFILEP} done 2c) READ, SAM/READ Do this as minfarm@fnpcsrv1 cd /export/stage/minfarm/ROUNDUP mkdir READBMNT for MFILE in ${MFILES} ; do #ls READ/SAM/${MFILE} mv READ/SAM/${MFILE} READBMNT/${MFILE} done 2d) SAM kreymer@minos26 for MFILE in ${MFILES} ; do sam undeclare file ${MFILE} done 3) rename bmnt - minfarm cd /minos/data/minfarm/farcat for BFILE in ${BFILES} ; do MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g` [ -r ${MFILE} ] && ls -l ${MFILE} done One run, 38266, is missing subrun 11 in both mrnt and bmnt. -rw-rw-r-- 1 rubin numi 2959459 Dec 12 20:52 F00038266_0000.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2920808 Dec 12 20:49 F00038266_0001.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2822482 Dec 12 20:51 F00038266_0002.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2979061 Dec 12 20:52 F00038266_0003.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2977375 Dec 12 20:53 F00038266_0004.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2953857 Dec 12 20:53 F00038266_0005.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2846984 Dec 12 20:53 F00038266_0006.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2930329 Dec 12 20:53 F00038266_0007.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2909285 Dec 12 20:54 F00038266_0008.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 3030490 Dec 12 20:51 F00038266_0009.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2132720 Dec 12 20:50 F00038266_0010.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2326530 Dec 12 20:54 F00038266_0012.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2539648 Dec 12 20:53 F00038266_0013.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2943127 Dec 12 20:55 F00038266_0014.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2993096 Dec 12 20:55 F00038266_0015.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2720666 Dec 12 20:50 F00038266_0016.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2980667 Dec 12 20:56 F00038266_0017.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2799555 Dec 12 20:52 F00038266_0018.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 1961562 Dec 12 20:54 F00038266_0019.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2887540 Dec 12 20:55 F00038266_0020.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2966054 Dec 12 20:54 F00038266_0021.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2945561 Dec 12 20:53 F00038266_0022.spill.mrnt.cedar_phy_bhcurv.0.root -rw-rw-r-- 1 rubin numi 2943774 Dec 12 21:04 F00038266_0023.spill.mrnt.cedar_phy_bhcurv.0.root Moved these files out of the way, mkdir /minos/data/minfarm/DUP/BMNT mv F00038266*mrnt*root /minos/data/minfarm/DUP/BMNT/ 14:36 for BFILE in ${BFILES} ; do MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g` mv ${BFILE} ${MFILE} done Later, 175 far mrnt's were added to PNFS and MD by roundup. ============================================================================= 2008 01 16 ####### # AFS # ####### MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Jan " | grep -v Tokens | grep Lost | grep 131.225 | uniq'; done minos03 Jan 15 20:10:43 minos03 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) minos17 Jan 14 11:19:24 minos17 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) minos23 Jan 13 20:37:53 minos23 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) ######## # BMNT # ######## Plan to rename and remove bmnt and mrnt files for farcat 2915 11365 spill.bmnt.cedar_phy_bhcurv.0.root SAMDIM=" DATA_TIER mrnt-far and VERSION cedar.phy.bhcurv and PHYSICAL_DATASTREAM_NAME spill " MINOS26 > sam list files --dim="${SAMDIM}" Files: F00031305_0000.spill.mrnt.cedar_phy_bhcurv.0.root F00030618_0000.spill.mrnt.cedar_phy_bhcurv.0.root ... F00032638_0000.spill.mrnt.cedar_phy_bhcurv.0.root F00032644_0000.spill.mrnt.cedar_phy_bhcurv.0.root File Count: 1371 Average File Size: 22.49MB Total File Size: 30.11GB Total Event Count: 282038696 This does not quite add up, as the mrnt's are concatenated, and there should be more bmnt's waiting for concatenation. sam list files --dim="${SAMDIM}" --nosummary | sort F00030612_0000.spill.mrnt.cedar_phy_bhcurv.0.root F00030613_0000.spill.mrnt.cedar_phy_bhcurv.0.root F00030614_0000.spill.mrnt.cedar_phy_bhcurv.0.root ... F00038553_0000.spill.mrnt.cedar_phy_bhcurv.0.root F00038556_0000.spill.mrnt.cedar_phy_bhcurv.0.root F00038559_0000.spill.mrnt.cedar_phy_bhcurv.0.root MINOS26 > ls /minos/data/reco_far/cedar_phy_bhcurv/mrnt_data/ | sort 2005-03 2005-04 2005-05 2005-06 2005-07 2005-08 2005-09 2005-10 2005-11 2005-12 2006-01 2006-02 2006-05 2006-06 2006-07 2006-08 2006-09 2006-10 2006-11 2006-12 2007-01 2007-02 2007-03 2007-04 2007-05 2007-06 2007-07 2007-08 2007-09 check counts in /M/D/RF/CPB/ MINOS26 > find /minos/data/reco_far/cedar_phy_bhcurv/mrnt_data -type f | wc -l 1371 MINOS26 > find /minos/data/reco_far/cedar_phy_bhcurv/sntp_data -type f | wc -l 2767 MINOS26 > find /minos/data/reco_far/cedar_phy_bhcurv/.bntp_data -type f | wc -l 1400 Need to : 0) Get file name list for mrnt and bmtn 1) Set aside pnfs and MD mrnt files 2) Set aside farm bookkeeping files READ, SAM/READ 3) Undeclare the files from SAM 4) Rename the bmnt files to mrnt 0) SAMDIM=" DATA_TIER mrnt-far and VERSION cedar.phy.bhcurv and PHYSICAL_DATASTREAM_NAME spill " MFILES=`sam list files --dim="${SAMDIM}" --nosummary | sort` BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort` MINOS26 > printf "${MFILES}\n" | wc -w 1371 MINOS26 > printf "${BFILES}\n" | wc -w 2915 MFILES runs from F00030612_0000.spill.mrnt.cedar_phy_bhcurv.0.root 2005-03 to F00038559_0000.spill.mrnt.cedar_phy_bhcurv.0.root 2007-07 BFILES runs from F00032481_0000.spill.bmnt.cedar_phy_bhcurv.0.root 2005-08 to F00038559_0023.spill.bmnt.cedar_phy_bhcurv.0.root 2007-07 1) mv /minos/data/reco_far/cedar_phy_bhcurv/mrnt_data \ /minos/data/reco_far/cedar_phy_bhcurv/mrnt_data_removed mv /pnfs/minos/reco_far/cedar_phy_bhcurv/mrnt_data \ /pnfs/minos/reco_far/cedar_phy_bhcurv/mrnt_data_removed 2) ... 3) ... ####### # AFS # ####### ############ # PREDATOR # ############ Near and Far genpy failed, at 11:06 and 11:10 . HOWTO.dccp - test succeeds, full speed. These files got picked up on the next cycle, at 13:06-ish ######### # MYSQL # ######### Monthly backups, per HOWTO.dbarchive.20080115 Shifted montly backups to archive subdirectory for MON in 20060418 20060421 20071218 ; do mv /minos/data/mysql/${MON} /minos/data/mysql/archive/${MON} ; done Started main copy around 11:00, informed the control room. Mysql> du -sm . 57955 . `DCS_HV.MYD' -> `/minos/data/mysql/archive/20080116/offline/DCS_HV.MYD' real 6m36.177s 13457 /minos/data/mysql/archive/20080116/offline/DCS_HV.MYD `PULSERGAIN.MYD' -> `/minos/data/mysql/archive/20080116/offline/PULSERGAIN.MYD' real 8m32.173s 14363 /minos/data/mysql/archive/20080116/offline/PULSERGAIN.MYD real 24m55.808s Net 40', had been 70' PULSERDRIFTPINVLD.MYD 21' md5sum was 31 53' gzip was 99 ( 41 CPU ) COPY TO DBCOPL real 9m44.978s user 0m3.115s sys 1m55.999s Copy binlog - these binlogs are unreasonably large ! Mysql> du -sm /data/archive/BINLOG/ 43581 /data/archive/BINLOG/ Jan 3 - 9 GB Jan 4 - 20 GB Jan 14 - 7 GB Jan 15 - 4 GB rsync : real 25m37.174s user 8m32.292s sys 4m32.641s ( corrected BINLOG copy to /minos/data/mysql/BINLOG from archive/BINLOG, rm -r BINLOG mv archive/BINLOG BINLOG ######### # MYSQL # ######### ln -sf HOWTO.dbarchive.20080115 HOWTO.dbarchive # was HOWTO.dbarchive.20070705 ########## # CONDOR # ########## The 7:50 glide jobs are still queued up, no glideins running. gfactory and gfrontend jobs are running. There are 300 rubin jobs running, 275 idle. That has probably pushed us aside for awhile. The backup cleared and the jobs ran round 12:00 . ######## # FARM # ######## Also removed the extra files from /minos/data/reco_far/cedar_phy_bhcurv/*_data/2007-08 /minos/data/reco_far/cedar_phy_bhcurv/*_data/2007-09 rm /minos/data/reco_far/cedar_phy_bhcurv/*_data/2007-09/*.root rm /minos/data/reco_far/cedar_phy_bhcurv/.*_data/2007-09/*.root rm /minos/data/reco_far/cedar_phy_bhcurv/*_data/2007-08/*.root rm /minos/data/reco_far/cedar_phy_bhcurv/.*_data/2007-08/*.root ########## # DCACHE # ########## The e907/mipp geant write backlog cleared out last night. ######### # MYSQL # ######### News storied report that Sun is purchasing Mysql. ============================================================================= 2008 01 15 ######## # FARM # ######## Date: Tue, 15 Jan 2008 15:16:43 -0600 From: Howard Rubin To: Art Kreymer Cc: Alex Sousa Subject: Files removed from enstore/dcache and /minos/data for 2007-08 and -09 Art, Here are the run numbers for which all SAM entries must be undeclared. I have deleted the files, including the hidden files, from /pnfs/minos/reco_far/cedar_phy_bhcurv/*_data/2007-0[89] *except* for candidates for F00038559 which complete the last run of 2007-07. Note that there are no ntuples for this run in 2007-08 because they were properly concatenated onto the 2007-07 runs. Howie F00038562 F00038565 F00038568 F00038571 F00038572 F00038575 F00038580 F00038585 F00038588 F00038591 F00038594 F00038597 F00038600 F00038603 F00038822 F00038825 F00038828 F00038869 F00038891 F00038893 F00038897 F00038902 F00038914 F00038918 F00038928 F00038869 F00038891 F00038893 F00038897 F00038902 F00038914 F00038918 F00038928 F00039044 F00039047 F00039050 F00039070 F00039281 F00039284 F00039306 F00039309 F00039312 F00039316 F00039334 F00039337 fnpcsrv1$ cat rm.2007-09 F00039337 F00039340 F00039345 F00039348 F00039349 F00039350 F00039353 F00039356 F00039359 F00039362 F00039571 F00039574 F00039577 F00039580 F00039583 F00039586 F00039589 F00039592 F00039595 F00039603 F00039607 F00039608 F00039610 F00039615 F00039618 F00039622 F00039625 F00039628 F00039631 F00039653 F00039676 F00039679 F00039682 F00039685 F00039688 F00039691 F00039694 F00039697 F00039700 F00039704 F00039707 F00039710 F00039713 F00039716 F00039719 STRM=cand SAMDIM=" DATA_TIER ${STRM}-far and VERSION cedar.phy.bhcurv and RUN_NUMBER > 38560 " sam list files --dim="${SAMDIM}" --nosummary | cut -f 1 -d _ | sort -u > /tmp/samcandlist ./samundeclare -n "${SAMDIM}" | wc -l 2077 MINOS26 > ./samundeclare "${SAMDIM}" Found 2075 files undeclared F00038562_0004.spill.cand.cedar_phy_bhcurv.0.root undeclared F00038562_0011.spill.cand.cedar_phy_bhcurv.0.root undeclared F00038562_0013.all.cand.cedar_phy_bhcurv.0.root ... undeclared F00039716_0003.spill.cand.cedar_phy_bhcurv.0.root undeclared F00039716_0021.spill.cand.cedar_phy_bhcurv.0.root undeclared F00039719_0002.all.cand.cedar_phy_bhcurv.0.root MINOS26 > sam list files --dim="${SAMDIM}" No files match the given constraints. STRM=bcnd File Count: 816 MINOS26 > ./samundeclare "${SAMDIM}" Found 816 files undeclared F00038562_0001.spill.bcnd.cedar_phy_bhcurv.0.root undeclared F00038565_0021.spill.bcnd.cedar_phy_bhcurv.0.root undeclared F00038568_0009.spill.bcnd.cedar_phy_bhcurv.0.root ... undeclared F00039713_0018.spill.bcnd.cedar_phy_bhcurv.0.root undeclared F00039713_0015.spill.bcnd.cedar_phy_bhcurv.0.root undeclared F00039716_0016.spill.bcnd.cedar_phy_bhcurv.0.root sam list files --dim="${SAMDIM}" Picking up sntp, bntp, mrnt files SAMDIM=" VERSION cedar.phy.bhcurv and RUN_NUMBER > 38560 " sam list files --dim="${SAMDIM}" File Count: 288 Average File Size: 121.48MB Total File Size: 34.17GB Total Event Count: 85685481 MINOS26 > ./samundeclare "${SAMDIM}" Found 288 files undeclared F00038562_0000.spill.bntp.cedar_phy_bhcurv.0.root undeclared F00038568_0000.all.sntp.cedar_phy_bhcurv.0.root undeclared F00038588_0000.spill.sntp.cedar_phy_bhcurv.0.root ... undeclared F00039710_0000.all.sntp.cedar_phy_bhcurv.0.root undeclared F00039713_0000.spill.bntp.cedar_phy_bhcurv.0.root undeclared F00039719_0000.spill.bntp.cedar_phy_bhcurv.0.root MINOS26 > sam list files --dim="${SAMDIM}" No files match the given constraints. Sent mail to minos_batch regarding this. ######### # MYSQL # ######### Preparing HOWTO.dbarchive.20080115 writing to /minos/data/mysql/archive... instead of /data/archive/... Cleaned up directory, cd /data/archive rm locate rmdir CP rmdir DUMP mv COPY/20071218 20071218 rmdir COPY ######### # MYSQL # ######### Heavy load on minos-mysql, varying, blocks of mininum x5, max x25 load Queries from flxb*, like select max(TIMEEND) from DCS_MAG_FARVLD where TIMEEND ... bjobs -u all -r bjobs -u all -r | cut -f 3 -d ' ' | sort -u jdejong jyuko pawlosk scavan sjc Mysql> mysqladmin processlist -u root | cut -f 6 -d ' ' | cut -f 1 -d ':' | sort -u flxb10.fnal.gov flxb11.fnal.gov flxb13.fnal.gov flxb18.fnal.gov flxb19.fnal.gov flxb20.fnal.gov flxb22.fnal.gov flxb24.fnal.gov flxb25.fnal.gov flxb26.fnal.gov flxb27.fnal.gov flxb30.fnal.gov flxb31.fnal.gov flxb32.fnal.gov flxb33.fnal.gov flxb34.fnal.gov flxi06.fnal.gov minos02.fnal.gov minos03.fnal.gov minos04.fnal.gov minos06.fnal.gov minos07.fnal.gov minos09.fnal.gov minos10.fnal.gov minos14.fnal.gov minos15.fnal.gov minos18.fnal.gov minos20.fnal.gov minos21.fnal.gov minos22.fnal.gov minos23.fnal.gov minos24.fnal.gov minos26.fnal.gov But none of the DCS_MAG_FARVLD queries are coming from minos* jobs. mysqladmin processlist -u root | grep DCS_MAG_FARVLD | cut -f 6 -d ' ' | cut -f 1 -d ':' | sort flxb10.fnal.gov flxb10.fnal.gov flxb11.fnal.gov flxb11.fnal.gov flxb13.fnal.gov flxb13.fnal.gov flxb18.fnal.gov flxb18.fnal.gov flxb19.fnal.gov flxb20.fnal.gov flxb22.fnal.gov flxb22.fnal.gov flxb24.fnal.gov flxb24.fnal.gov flxb26.fnal.gov flxb27.fnal.gov flxb27.fnal.gov flxb30.fnal.gov flxb31.fnal.gov flxb31.fnal.gov flxb32.fnal.gov flxb33.fnal.gov flxb34.fnal.gov flxb34.fnal.gov flxi06.fnal.gov flxi06.fnal.gov ########## # CONDOR ########## Created newer proxy for gfactory, SRV1> cd /export/stage/minfarm/.grid voms-proxy-init \ -voms fermilab:/fermilab/minos \ -vomslife 800:0 \ -cert kreymer-doe.pem \ -key kreymer-doekey.pem \ -out kreymer-condor.proxy \ -valid 800:0 Your proxy is valid until Sun Feb 17 16:59:24 2008 [gfactory@minos25 ~]$ cd .grid/ scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy \ kreymer-condor.proxy.20080217 cp -a kreymer-condor.proxy.20080217 kreymer-condor.proxy ============================================================================= 2008 01 14 ####### # WEB # ####### Date: Mon, 14 Jan 2008 13:25:49 -0600 From: John Inkmann To: Liz Buckley-Geer , kreymer@fnal.gov Cc: "webteam@fnal.gov" Subject: www-numi website Suspect files shared to system:anyuser The files listed below showed up on our scans for suspect files. Currently, it is being shared (to anyone) If the file is not needed, it can simply be deleted. Otherwise, issue the following commands, in the follow /usr/afsws/bin/fs sa -dir -acl lauram:expwwwmachine rl /usr/afsws/bin/fs sa -dir -acl system:anyuser none Either approach will prevent the file from showing up on next week's scan, after which, we usually make the Thanks for your attention in this matter. - see /home/kreymer/password.txt Date: Mon, 14 Jan 2008 13:40:04 -0600 (CST) From: Liz Buckley-Geer These are not password files. They are all copies of the same ROOT header file and it's associated dependency file. This file does not contain any passwords. It just happen to have the string Passwd in the name. They are visible by design. Liz ########## # VOMSES # ########## SRV1> find /export/stage/minfarm -name vomses find: /export/stage/minfarm/.grid/backup: Permission denied /export/stage/minfarm/homegrid/vdt-1.3.10/voms/etc/vomses ######### # FNALU # ######### Date: Mon, 14 Jan 2008 10:07:12 -0600 (CST) Subject: HelpDesk ticket 109510 Short Description: Mount /minos/data and /minos/scratch on flxi07 Problem Description: fnalu-admin : Please mount /minos/data and /minos/scratch on flxi05 and flxi07. flxi05 is needed to test 32 bit SLF 4.4 operations. flxi07 is needed so that we can transfer large files to AFS, and test 64bit kernel operations. The /minos areas are already mounted on flxi04 and flxi06 (SLF 3) -------------------------------------------- Date: Mon, 14 Jan 2008 10:20:51 -0600 (CST) This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group. -------------------------------------------- Date: Mon, 14 Jan 2008 10:32:11 -0600 (CST) Note To Requester: mgreaney@fnal.gov sent this Notes To Requester: Art, these mounts are done. -------------------------------------------- ########## # DCACHE # ########## Write pool backlog remains high, 2727 Queues, 145 Active as of 09:50. e907 ( Mipps ) total data usage is now 9940 6 TB 9940B 21 TB LTO-3 8 TB with several TB still queued in the writePool buffer. ####### # AFS # ####### Testing large file access MIN > for NODE in ${UNODES} flxi07 ; do printf "${NODE} "; ssh ${NODE} 'ls -d /minos/data'; done flxi02 ls: /minos/data: No such file or directory flxi03 ls: /minos/data: No such file or directory flxi04 /minos/data flxi05 ls: /minos/data: No such file or directory flxi06 /minos/data flxi07 ls: /minos/data: No such file or directory MIN > for NODE in ${UNODES} flxi07 ; do printf "${NODE} "; ssh ${NODE} 'cat /etc/redhat-release'; done flxi02 Scientific Linux Fermi LTS release 4.4 (Wilson) flxi03 Scientific Linux Fermi LTS release 4.4 (Wilson) flxi04 Scientific Linux release 3.0.5 (Fermi) flxi05 Scientific Linux Fermi LTS release 4.5 (Wilson) flxi06 Scientific Linux release 3.0.5 (Fermi) flxi07 Scientific Linux Fermi LTS release 4.4 (Wilson) MIN > for NODE in ${UNODES} flxi07 ; do printf "${NODE} "; ssh ${NODE} 'cat /proc/cpuinfo | grep address | uniq'; done flxi02 address sizes : 36 bits physical, 48 bits virtual flxi03 address sizes : 36 bits physical, 48 bits virtual flxi04 flxi05 flxi06 flxi07 address sizes : 36 bits physical, 48 bits virtual MIN > for NODE in ${UNODES} flxi07 ; do printf "${NODE}\n"; ssh ${NODE} 'uname -a'; done flxi02 Linux flxi02.fnal.gov 2.6.9-55.0.9.ELsmp #1 SMP Fri Sep 28 09:24:48 CDT 2007 x86_64 x86_64 x86_64 GNU/Linux flxi03 Linux flxi03.fnal.gov 2.6.9-55.0.9.ELsmp #1 SMP Fri Sep 28 09:24:48 CDT 2007 x86_64 x86_64 x86_64 GNU/Linux flxi04 Linux flxi04.fnal.gov 2.4.21-32.0.1.ELsmp #1 SMP Wed May 25 15:42:26 CDT 2005 i686 i686 i386 GNU/Linux flxi05 Linux flxi05.fnal.gov 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 11:21:10 CDT 2007 i686 i686 i386 GNU/Linux flxi06 Linux flxi06.fnal.gov 2.4.21-47.ELsmp #1 SMP Thu Jul 20 09:54:04 CDT 2006 i686 i686 i386 GNU/Linux flxi07 Linux flxi07.fnal.gov 2.6.9-55.0.9.ELsmp #1 SMP Fri Sep 28 09:24:48 CDT 2007 x86_64 x86_64 x86_64 GNU/Linux FLXI02 > time sum /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz sum: /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz: Input/output error real 3m27.187s user 0m7.127s sys 0m4.183s FLXI03 > time sum /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz sum: /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz: Input/output error real 3m29.324s user 0m7.240s sys 0m4.183s FLXI07 > time sum /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz 51137 4755327 real 1m23.968s user 0m11.257s sys 0m15.685s ============================================================================= 2008 01 11 ######## # FARM # ######## Need to clean up after repeated /m/d failures, esp. Dec 13, in 2007-12/cedarfar.log /pnfs/minos/reco_far/cedar/sntp_data/2007-12 F00040057_0000.all.sntp.cedar.0.root F00040057_0000.spill.sntp.cedar.0.root /pnfs/minos/reco_far/cedar/.bntp_data/2007-12 F00040057_0000.spill.bntp.cedar.0.root ########## # DCACHE # ########## Date: Fri, 11 Jan 2008 15:28:21 -0600 (CST) Subject: HelpDesk ticket 109468 Short Description: fndca read pools miserly with movers ? On Fri, 11 Jan 2008, J. Pedro Ochoa wrote: > Everything was working OK but since ~10.00 AM this morning it seems to just be stuck. > I am trying to get file > > reco_near/cedar_phy_bhcurv/sntp_data/2006-02/N00009767_0000.spill.sntp.cedar_phy_bhcurv.0.root That file is in DCache, in pool r-stkendca18a-6 . But that pool presently has 4 queued read requests, which is strange, as it should allow up to 50 reads, and only 2 are active. I'm reporting this to the experts. ------------------------------------------ Date: Fri, 11 Jan 2008 15:33:26 -0600 (CST) This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group. ------------------------------------------ ########## # CONDOR # ########## New gfactory processes disappear without running. That's because condorweb was removing the stage subdirectory, which exists only in AFS ( created by create_glidein ) Changed script to push only the monitor piece, per sfiligoi : Corrected around 14:25 CST ( 20:25 UTC ) LOCWG=/home/gfactory/web/monitor/ AFSWG=/afs/fnal.gov/files/expwww/numi/html/gfactory/monitor Note the trailing / on LOGWG, required by rsync /afs/fnal.gov/files/expwww/numi/html/gfactory/stage/glidein_t6 recreated at 20:32 UTC Previously, 23834.0 boehm 1/11 13:12 0+00:15:50 R 0 175.8 RunTemp01-11-08_13 23835.0 boehm 1/11 13:12 0+00:15:50 R 0 175.8 RunTemp01-11-08_13 23836.0 boehm 1/11 13:12 0+00:15:50 R 0 107.4 RunTemp01-11-08_13 23837.0 boehm 1/11 13:12 0+00:15:50 R 0 175.8 RunTemp01-11-08_13 23838.0 boehm 1/11 13:12 0+00:15:32 R 0 166.0 RunTemp01-11-08_13 ... Now 23834.0 boehm 1/11 13:12 0+01:20:07 I 0 175.8 RunTemp01-11-08_13 23835.0 boehm 1/11 13:12 0+01:20:03 I 0 175.8 RunTemp01-11-08_13 23836.0 boehm 1/11 13:12 0+01:20:07 I 0 107.4 RunTemp01-11-08_13 23837.0 boehm 1/11 13:12 0+01:20:08 I 0 175.8 RunTemp01-11-08_13 23838.0 boehm 1/11 13:12 0+01:19:51 I 0 166.0 RunTemp01-11-08_13 23839.0 boehm 1/11 13:12 0+00:00:00 I 0 9.8 RunTemp01-11-08_13 23839.0 boehm 1/11 13:12 0+00:00:00 I 0 9.8 RunTemp01-11-08_13 ... Jobs have ramped up, 14:57 cq gfactory 104 jobs; 0 idle, 104 running, 0 held 111 jobs; 7 idle, 104 running, 0 held 111 jobs; 0 idle, 111 running, 0 held ####### # AFS # ####### Nick reports 2 GB file size limit in AFS. /afs/fnal.gov/files/data/minos/d210/database_dumps/ Mysql> pwd /data/archive/COPY/20071218/offline 4649 MBytes in DCS_HV.MYD.gz Mysql> time cp DCS_HV.MYD.gz /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz cp: writing `/afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz': File too large real 3m6.080s user 0m0.565s sys 0m47.171s produced a 2 GB file http://www.openafs.org/pipermail/openafs-info/2006-March/021852.html Build it from source and use --enable-largefile-fileserver https://lists.openafs.org/pipermail/openafs-info/2002-September/005812.html 65K file per directory limit, translates to file name length limit Test on fsui03 fsui03 > cd /var/tmp fsui03 > scp -c blowfish minsoft@minos-mysql1:/data/archive/COPY/20071218/offline/DCS_HV.MYD.gz DCS_HV.MYD.gz Oops, the root partition did not have enough space ( 2.3 GB free ) Copy to /tmp instead fsui03 > time scp -c blowfish minsoft@minos-mysql1:/data/archive/COPY/20071218/offline/DCS_HV.MYD.gz /tmp/DCS_HV.MYD.gz real 14m8.760s user 0m4.180s sys 1m56.970s fsui03 > time cp /tmp/DCS_HV.MYD.gz /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz cp: /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz: File too large real 17m58.690s user 0m0.150s sys 4m22.220s produced no output file after the failure. TRY AGAIN ON FLXI07, A 64BIT system FLXI07 > time scp -c blowfish minsoft@minos-mysql1:/data/archive/COPY/20071218/offline/DCS_HV.MYD.gz /tmp/DCS_HV.MYD.gz real 2m56.531s user 1m5.632s sys 0m49.417s FLXI07 > time cp /tmp/DCS_HV.MYD.gz /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz real 8m11.278s user 0m0.855s sys 1m6.878s ( a few minutes hangup after the copy seems complete, before exiting from cp ) Try this also on flxi05 FLXI05 > rpm -q openafs openafs-1.4.4-46.SL4 FLXI05 > time scp -c blowfish minsoft@minos-mysql1:/data/archive/COPY/20071218/offline/DCS_HV.MYD.gz /usr/scratch/sect1/DCS_HV.MYD.gz real 4m57.852s user 0m57.639s sys 0m46.515s FLXI05 > time cp /usr/scratch/sect1/DCS_HV.MYD.gz /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz2 cp: writing `/afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz2': File too large real 2m8.493s user 0m0.459s sys 0m37.696s FLXI05 > dds /afs/fnal.gov/files/data/minos/d88 total 6850453 drwxrwxrwx 3 root root 6144 Jan 11 17:32 ./ drwxr-xr-x 3 lisa g150 10240 Nov 1 15:28 ../ -rw-r--r-- 1 kreymer g020 4869454341 Jan 11 13:50 DCS_HV.MYD.gz -rw-r--r-- 1 kreymer g020 2145390592 Jan 11 17:34 DCS_HV.MYD.gz2 drwxr-xr-x 3 shanahan ktev 2048 Jan 2 11:04 ndphys/ FLXI05 > time cp /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz /usr/scratch/sect1/DCS_HV.MYD.gzbig cp: reading `/afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz': Input/output error real 2m22.158s user 0m0.318s sys 0m14.797s FLXI07 > rpm -q openafs openafs-1.4.4-46.SL4.x86_64 Mysql> rpm -q openafs openafs-1.4.4-46.SL4 For reference, copy this test file to /minos/data/mysql/DCS_HV.MYD.gz Rate from mysql1 was about real 2m56.960s user 0m0.096s sys 0m13.086s 4870 mbytes/ 177 sec = 28. MB/sec Try, for reference, the uncompressed database table, as will do for backups : Mysql> pwd /data/database/offline dds DCS_HV.MYD 14110664704 Mysql> time cp DCS_HV.MYD /minos/data/mysql/DCS_HV.MYD real 6m51.467s user 0m0.266s sys 0m38.744s Rate 14111. MB/411 sec = 34 MB/sec ########## # DCACHE # ########## Date: Fri, 11 Jan 2008 10:22:50 -0600 (CST) Subject: HelpDesk ticket 109440 dcache-admin : Over the last few days, the DCache write queue has grown gradually to a peak of 5000, and seems to be still climbing. See http://fndca.fnal.gov/dcache/queue/allpools.jpg As before, this is shutting down Minos farm and data import activity. Also, the writePool group seems to be very close to capacity and in danger of filling. See pools w-stkendca[10,11,12]a-[4,5,6] at http://fndca3a.fnal.gov:2288/usageInfo The major write activity again seems to be to the e907 'geant' family, over 2 TBytes in the last few days. See http://www-stken.fnal.gov/enstore/burn-rate/CD-LTO3.e907.jpg This time the problem is not the large number of small files, but the total volume. Please take action to clear this backlog. ---------------------------------------------- Date: Fri, 11 Jan 2008 10:30:17 -0600 (CST) This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group. ---------------------------------------------- On 2 January, the write pools contained 1/4 TB of e907 data files. http://www-numi.fnal.gov/computing/dh/datasets/2008/01/current.w.20080102 Today it has nearly 6 TBytes of e907 data files. http://www-numi.fnal.gov/computing/dh/datasets/2008/01/current.w.20080111 ============================================================================= 2008 01 10 ######## # FARM # ######## Cleaning up, first the easy stuff Four duplicates to move out of the way : ./roundup -D -r cedar_phy_oldbhcurv mcnear fails with messages mv n13035098_0004_L010185N_D03.sntp.cedar_phy_oldbhcurv.root ../minfarm/DUP/n13035098_0004_L010185N_D03.sntp.cedar_phy_oldbhcurv.root mv: cannot move `n13035098_0004_L010185N_D03.sntp.cedar_phy_oldbhcurv.root' to `../minfarm/DUP/n13035098_0004_L010185N_D03.sntp.cedar_phy_oldbhcurv.root': No such file or directory AFSS/roundup.20080110 -D -r cedar_phy_oldbhcurv mcnear That removed them cleanly. ########## # CONDOR # ########## Corrected the configuration of the gfactory to write to local web, with assistance from sfiligoi. Edited glideinWMS/creation/glideinWMS.xml This is used in the creation of a new glidein_t* configuration. Changed glidein factory_name and monitor base_dir [gfactory@minos25 creation]$ diff glideinWMS.xml glideinWMS.xml.save 1,2c1,2 < < --- > > ./create_glidein glideinWMS.xml cd ./start_factory.sh Note that all the older _tN with N less that 6 are now obsolete. Monitoring plots are at links like http://www-numi.fnal.gov/gfactory/monitor/glidein_t6/total/ ########## # CONDOR # ########## ln -s ../../../../../home/room1/kreymer/minos/HOWTO.condor HOWTO.condor This puts the HOWTO at http://www-numi.fnal.gov/computing/HOWTO.condor ########### # BLUEARC # ########### Date: Thu, 10 Jan 2008 13:31:14 -0600 (CST) Subject: HelpDesk ticket 109383 Short Description: Quota request for BlueArc served /minos/scratch, for tjyang Problem Description: LSC/CSI : Please set an individual storage quota of 500 GBytes for user tjyang on the BlueArc served /minos/scratch volume. This overrides the existing default 100 GBytes quota. ---------------------------------------------------- Date: Thu, 10 Jan 2008 13:36:07 -0600 (CST) This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group. ---------------------------------------------------- Date: Thu, 10 Jan 2008 14:11:04 -0600 (CST) Solution: Hi Art, minos-nas-0:/scratch/tjyang quota has been increased to 500GB ---------------------------------------------------- ########## # NEXSAN # ########## After firmware update, need to mindata@minos26 crontab crontab.dat minfarm@fnpcsrv1 mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok ########## # CONDOR # ########## Testing glideins after outage, seems OK now. Factories submitted at 08:39, started running at about 08:43 ######## # GRID # ######## diff /usr/local/vdt-1.8.1/glite/etc/vomses /minos/scratch/kreymer/VDT/glite/etc/vomses Hacking my copy, changed fermigrid2.fnal.gov to voms.fnal.gov changed CN=host/voms.fnal.gov to CN=http/voms.fnal.gov Still some residual changes. How is this file to be kept up to date ? Where is the DOE CA information kept ? scp fnpcsrv1:/usr/local/vdt-1.8.1/glite/etc/vomses vomses MINOS26 > vdt-version --show You have installed a subset of VDT version 1.8.1a: CA Certificates v31 (includes IGTF 1.17 CAs) Fetch CRL 2.6.2 GPT 3.2 MINOS26 > pacman -update CA-Certificates Update of [/minos/scratch/kreymer/VDT:http://vdt.cs.wisc.edu/vdt_181_cache:VDT-Common] found... Update of [/minos/scratch/kreymer/VDT:http://vdt.cs.wisc.edu/vdt_181_cache:VDT-Environment] found... Update of [/minos/scratch/kreymer/VDT:http://vdt.cs.wisc.edu/vdt_181_cache:VDT-Version-Info] found... Update of [/minos/scratch/kreymer/VDT:http://vdt.cs.wisc.edu/vdt_181_cache:CA-Certificates-Base] found... Update of [/minos/scratch/kreymer/VDT:http://vdt.cs.wisc.edu/vdt_181_cache:Licenses] found... Updating [VDT-Environment]... Downloading [vdt-environment-1-193.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-environment/1]... Updating [VDT-Version-Info]... Downloading [vdt-version-info-1.8.1-26.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-version-info/1.8.1]... Updating [VDT-Common]... Downloading [vdt-common-1-228.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-common/1]... Updating [CA-Certificates-Base]... Downloading [certificates-33-1.tar.gz] from [http://vdt.cs.wisc.edu/software/certificates/33]... Installing package [CA-Certificates-Base]. Downloading [certificates-install-4-256.tar.gz] from [http://vdt.cs.wisc.edu/software//certificates-install/4]... Updating [Licenses]... Downloading [licenses-1.8.1-12.tar.gz] from [http://vdt.cs.wisc.edu/software//licenses/1.8.1]... This did not help, as I am not using this vdt for srmcp on minos26. cd /home/minfarm/.grid mv certificates certificates.20070206 scp -r minfarm@fnpcsrv1:/local/ups/grid/globus/share/certificates certificates This fixed the srmtest problem under mindata. $ dds /minos/scratch/kreymer/VDT/globus/share/certificates-33-1 | wc -l 394 SRV1> dds /usr/local/vdt-1.8.1/globus/share/certificates-33-1 | wc -l 471 Need to track down and update or remove /export/stage/minfarm/homegrid/vdt-1.3.10 /grid/app/minos/VDT . ./setup.sh pacman -update CA-Certificates certificates-32-1 updated to certificates-33-1 ######## # GRID # ######## Helpdesk submission ( apparently not submitted ) Grid / Fermilab Sup Ctr High Kreymer DOE cert claims to have expired, is not expired Starting as early as 02:20 this morning, the Kreymer DOE grid certificate seems to be expired. This seen by SRM, and by the web browser certificate test at https://security.fnal.gov/cgi-bin/doetest/displaycert.cgi But the certificate is not due to expire until April 2008. And I can generate a grid proxy with the cert : SRV1> cd /export/stage/minfarm/.grid voms-proxy-init \ -voms fermilab:/fermilab/minos \ -vomslife 800:0 \ -cert kreymer-doe.pem \ -key kreymer-doekey.pem \ ######## # GRID # ######## srmcp is failing now, claims cert is expired : SRMClientV1 : org.globus.common.ChainedIOException: Authentication failed [Caused by: Defective credential detected [Caused by: [JGLOBUS-96] Certificate "DC=org,DC=DOEGrids,OU=Certificate Authorities,CN=DOEGrids CA 1" expired]] SRV1> cd /export/stage/minfarm/.grid voms-proxy-init \ -voms fermilab:/fermilab/minos \ -vomslife 800:0 \ -cert kreymer-doe.pem \ -key kreymer-doekey.pem \ -out kreymer-test.proxy \ -valid 800:0 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Enter GRID pass phrase: Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Creating temporary proxy .............................................. Done Contacting voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Done Warning: fg6x1.fnal.gov:15001: The validity of this VOMS AC in your proxy is shortened to 86400 seconds! Creating proxy ................................ Done Your proxy is valid until Tue Feb 12 16:23:36 2008 ============================================================================= 2008 01 09 ############### # CONDORGLIDE # ############### Run this script via cron ( crontab.minos25) to keep the factory alive Logs go to condor/log/glide/ Had to add these, to get path to condor commands : source /usr/local/etc/setups.sh setup shrc source /etc/bashrc ########### # BLUEARC # ########### Date: Wed, 09 Jan 2008 15:43:44 -0600 (CST) Subject: HelpDesk ticket 109326 Short Description: Quota request for BlueArc served /minos/scratch, for rmehdi Problem Description: LSC/CSI : Please set an individual storage quota of 500 GBytes for user rmehdi on the BlueArc served /minos/scratch volume. This overrides the existing default 100 GBytes quota. --------------------------- Date: Wed, 09 Jan 2008 15:58:05 -0600 (CST) This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group. --------------------------- Date: Wed, 09 Jan 2008 16:22:56 -0600 (CST) Solution: joes@fnal.gov sent this solution: increased volume quota to 500GB (was 100GB) for minos-nas-0:/minos/scratch/rmehdi This ticket was resolved by SYU, JOSEPH of the CD-LSCS/CSI/CS/EST group. --------------------------- ########## # CONDOR # ########## [gfactory@minos25 ~]$ find . -type f -exec grep -q /afs/fnal.gov/files/expwww/numi {} \; -print ./glideinWMS/install/glideinWMS_install ./glideinWMS/creation/glideinWMS.xml.20071217 ./glideinWMS/creation/glideinWMS.xml ./glideinsubmit/glidein_t3/glideinWMS.xml ./glideinsubmit/glidein_t5/glideinWMS.xml ./glideinsubmit/glidein_t1/glideinWMS.xml ./glideinsubmit/glidein_t4/glideinWMS.xml ./glideinsubmit/glidein_t2/glideinWMS.xml ./.bash_history XMLS=' glideinWMS/creation/glideinWMS.xml glideinsubmit/glidein_t3/glideinWMS.xml glideinsubmit/glidein_t5/glideinWMS.xml glideinsubmit/glidein_t1/glideinWMS.xml glideinsubmit/glidein_t4/glideinWMS.xml glideinsubmit/glidein_t2/glideinWMS.xml ' for XML in ${XMLS} ; do cp -a ${XML} ${XML}.save ; done for XML in ${XMLS} ; do nedit ${XML} & ; done replace /afs/fnal.gov/files/expwww/numi/html/gfactory with /home/gfactory/web for XML in ${XMLS} ; do sdiff -s ${XML} ${XML}.save ; done 15:22 ./start_factory.sh 15:23 gfactory processes 23364.0 1 2 3 4 are idle 15:26 gfactory processes have been running about 1 minute. 15:28 gfactory processes usee 3:25 seconds, kreymer glidein is running MINOS25 > crontab ${HOME}/minos/scripts/crontab.minos Oops, had to add this to get aklog to work in condorweb : PATH="/usr/krb5/bin:${PATH}" Date: Wed, 9 Jan 2008 22:26:08 +0000 (UTC) From: Arthur Kreymer To: Burt Holzman Cc: Igor Sfiligoi Subject: Re: Follow-up on WMS/grid discussions ... I have modified these files : glideinWMS/creation/glideinWMS.xml glideinsubmit/glidein_t3/glideinWMS.xml glideinsubmit/glidein_t5/glideinWMS.xml glideinsubmit/glidein_t1/glideinWMS.xml glideinsubmit/glidein_t4/glideinWMS.xml glideinsubmit/glidein_t2/glideinWMS.xml after first creating *.save versions I changed /afs/fnal.gov/files/expwww/numi/html/gfactory to /home/gfactory/web I have started a one per minute cron job rsync'ing the afs web area to the home web directory. /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorweb This may not be ideal, but should keep the gfactory process happy for now. Can the one minute update interval be increased ? ####### # NET # ####### Slow and failing network connections reported on Minos systems, late morning. MIN > ssh minos01 Last login: Wed Jan 9 12:02:44 2008 from 131.225.56.147 ... aklog: Couldn't get fnal.gov AFS tickets: aklog: Cannot resolve network address for KDC in requested realm while getting AFS tickets 12:05 - OK again ? r-s-fcc2-server ######### # ADMIN # ######### Helpdesk ticket 095815 Updated minos-sam02 status, system is up. Successful ! ######## # FARM # ######## Date: Tue, 08 Jan 2008 17:08:52 -0600 (CST) From: Steven Timm To: fermigrid-announce@fnal.gov Subject: new worker nodes, CDF and GP Grid clusters 46 of the 48 new worker nodes have been deployed on the General Purpose Grid cluster today. We expect the last two nodes to be ready tomorrow morning. Thanks to Rennie Scott and Jason Allen for fast work. The high 8 nodes fnpc339-fnpc346, MINOS will have priority on. We are still working on the details of getting AFS implemented on those nodes as they requested, hopefully within 7-10 days. These 48 new worker nodes will all be available for production for a while. Eventually some of them may be redeployed for integration and testing projects. As reported in operations meeting yesterday, 155 new worker nodes were also deployed as CDF Grid cluster 3 and they are available for use by FermiGrid. these are fcdfcaf1502-1656. CDF will have priority on all of these nodes. Non-cdf users should only access them via the fermigrid1 site job gateway, not by direct submission. ============================================================================= 2008 01 08 ########## # NEXSAN # ########## Scheduled cron shutdowns for NEXSAN upgrades Thur 10 Jan 08:00 mindata@minos26 echo "crontab -r" | at 02:00 Jan 10 minfarm@fnpcsrv1 echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 02:00 Jan 10 ########## # CONDOR # ########## http://www-numi.fnal.gov/gfactory/stage/glidein_t5/condor_config file:/home/gfactory/web/stage/glidein_t5/condor_config condorweb - script does an rsync from /home/gfactory/web/ to /afs/fnal.gov/files/expwww/numi/html/gfactory ############ # MCIMPORT # ############ Removed a bad imported file from sjc, FILE=n11047018_0002_L010185N_D04.tar.gz would have moved to MCSTA, ${MDSTAGE}/${MCREL}_${MCRN}/${CONF}/${DET}/${RUN} MCSTA=/minos/data/mcimport/STAGE/daikon_04/L010185N/near/701 mkdir /minos/data/mcimport/sjc/BAD mv ${MCSTA}/${FILE} /minos/data/mcimport/sjc/BAD/${FILE} ########### # ENSTORE # ########### We are very short of 9940-B tapes in the library, Inventory this morning was 108 tapes See also http://www-stken.fnal.gov/enstore/burn-rate/CD-9940B.jpg Recent rates are under Enstore Plots Bytes Written per Storage Group Plots http://www-stken.fnal.gov/enstore/burn-rate/plot_enstore_system.html Month Week TapesBlank ALL_9940B 241 63 294 CD-9940B 159 31 108 lqcd 36 9 miniboone 4 0 minos 68 16 sdss 18 2 ########### # ENSTORE # ########### Date: Tue, 08 Jan 2008 14:42:12 -0600 (CST) HelpDesk ticket 109251 Short Description: Please move future /pnfs/minos/mcin_data and mcout_data writes to CD-LTO-3 Problem Description: enstore-admin : The STKEN 9940-B tape inventory is running critically low. Minos is a major user ( 68 the past month, 16 last week ) Most of our current use is to paths /pnfs/minos/mcout_data and /pnfs/minos/mcin_data Therefore, please do something like the following to direct future writes under these paths toward LTO-3 tape : cd /pnfs/minos/mcin_data enstore pnfs --library CD-LTO3 cd /pnfs/minos/mcout_data enstore pnfs --library CD-LTO3 -------------------------------- Date: Tue, 08 Jan 2008 15:01:58 -0600 (CST) This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group. -------------------------------- Date: Tue, 08 Jan 2008 18:32:13 -0600 (CST) Solution: berg@fnal.gov sent this solution: The library tags have been changed to CD-LTO3. Thanks, Art! ########## # CONDOR # ########## The gfactory stoppages are due to losing a token for writing to /afs/fnal.gov/files/expwww/numi/html/gfactory Set up rsync via cron job on minos25, kreymer account for present Test timing, AFSWG=/afs/fnal.gov/files/expwww/numi/html/gfactory LOCWG=/home/gfactory/web TESWG=/afs/fnal.gov/files/expwww/numi/html/test time cp -ax ${AFSWG} ${LOCWG} real 0m35.957s user 0m0.063s sys 0m1.526s diff -r ${AFSWG} ${LOCWG} real 0m8.566s user 0m0.237s sys 0m0.953s mkdir ${TESWG} time rsync -r ${LOCWG} ${TESWG} --perms --times --links --size-only --delete -v sent 66808099 bytes received 65640 bytes 3110406.47 bytes/sec total size is 66530899 speedup is 0.99 real 0m20.566s user 0m0.955s sys 0m3.520s Repeat again at about 12:13 [gfactory@minos25 ~]$ time rsync -r ${LOCWG} ${TESWG} --perms --times --links --size-only --delete -v building file list ... done sent 129656 bytes received 20 bytes 13650.11 bytes/sec total size is 66530899 speedup is 513.05 real 0m9.075s user 0m0.052s sys 0m3.327s real 0m9.625s user 0m0.057s sys 0m3.383s Around 13:20 real 0m8.478s Added slash after ${LOCWG} to put the the output files directly in ${TESWG} rm -r ${TESWG}/web time rsync -r ${LOCWG}/ ${TESWG} --perms --times --links --size-only --delete -v real 0m19.969s Tried this from kreymer on minos25, could not access /home/gfactory, mode 700 chmod 755 /home/gfactory time rsync -r ${LOCWG}/ ${TESWG} --perms --times --links --size-only --delete -v real 0m8.535s ######## # MAIL # ######## Removed RFC2369 headers from minos-cdops for which they are not appropriate, to eliminate the PINE messages [ Note: This message contains email list management information ] To disable the headers, added to the head of the options list, Misc-Options= NO_RFC2369 ########## # NEXSAN # ########## Date: Tue, 08 Jan 2008 07:53:39 -0600 From: Etta Burns To: minos-admin@fnal.gov Cc: Jason Allen , dbell@fnal.gov Subject: Request For Satabeast Downtime The new NexSAN firmware (Gn60) is available for installation. Would it be possible to have a 20 minute downtime on Thursday morning, beginning at 8:00, to upgrade the firmware and reboot the satabeast? Etta B -- Etta Burns Fermi National Accelerator Laboratory ettab@fnal.gov P.O. Box 500 (630) 840-8300 Batavia, IL 60510 Announced this to minos_software_discussion minos_batch minos-data minos-users ########## # CONDOR # ########## Date: Tue, 08 Jan 2008 12:28:01 +0100 From: Igor Sfiligoi To: Arthur Kreymer Cc: Burt Holzman Subject: Re: Follow-up on WMS/grid discussions Hi Art. It is again the AFS problem :( The monitoring pages are on AFS, so when the AFS token expires, the factory cannot anymore update the monitoring info and it exits. I don't have an easy answer for this problem :( -------------------------- Date: Tue, 08 Jan 2008 10:01:42 -0600 From: Burt Holzman Keep the monitoring pages local and have a kcron job push it to AFS every minute? -------------------------- I have tested this. There are about 66 MBytes of files in the present web area. I created a copy of the AFS web area AFSWG=/afs/fnal.gov/files/expwww/numi/html/gfactory in LOCWG=/home/gfactory/web I tested the speed of rsync writing to TESWG=/afs/fnal.gov/files/expwww/numi/html/test time rsync -r ${LOCWG} ${TESWG} --perms --times --links --size-only --delete -v real 0m20.566s Subsequent updates take about 8 seconds elapsed time. Igor : How often would such updates need to be performed ? Should they be synchromized with any other process ? I see from the log the gfrontend runs about every 90 seconds. ######## # FARM # ######## cedar_phy_bhcurv mcnear processing cleared out the WRITE backlog, as of 8 this morning. Remaining issues farcat cedar_phy_bhcurv bmnt files nearcat cedar_phy_bhcurv 0 and 1 spill file duplicates mcnearcat cedar_phy_oldbhcurv a few files mcnearcat cedar_phy_bhcurv over 250 files pending ============================================================================= 2008 01 07 ############### # MINOS-USERS # ############### Liz added kreymer as an owner, added Giles Barr barr@FNAL.GOV ######## # FARM # ######## Cleanup and catchup WRITE - Dec 22 c_p_b mcnear stuck due to missing /minos/data/mcout_data/daikon_04/L010000N/near/cedar_phy_bhcurv/sntp_data/701/n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root This is present in pnfs, -rw-r--r-- 1 rubin numi 519562958 Dec 22 22:04 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010000N/sntp_data/701/n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root The rename after the srmcp seems to have failed on Dec 22, probably due to a NexSAN flakeout. SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010000N/sntp_data/701 mkdir: cannot create directory `/minos/data/mcout_data': No such device or address OOPS - cannot create CC area /minos/data/mcout_data/daikon_04/L010000N/near/cedar_phy_bhcurv/sntp_data/701 This is still there in WRITE, but not moved and symlinked. SRV1> dds /minos/data/minfarm/WRITE/n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root -rw-r--r-- 1 minfarm numi 519562958 Dec 22 19:50 /minos/data/minfarm/WRITE/n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root FILE=n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root GDW=/minos/data/minfarm/WRITE CCDEST=/minos/data/mcout_data/daikon_04/L010000N/near/cedar_phy_bhcurv/sntp_data/701 ls -l ${CCDEST} mv ${GDW}/${FILE} ${CCDEST}/${FILE} ln -s ${CCDEST}/${FILE} ${GDW}/${FILE} ############ # PNFSDIRS # ############ Added test that encp is set up, so that we have an enstore command ######### # ADMIN # ######### Helpdesk ticket 095815 Tried entering a Minos status at https://computing.fnal.gov/cdsystemstatus/customersupport/inpform.htm MINOS AFS Minor No Estimate AFS timeouts continue at a low rate on the Cluster. We are awaiting a schedule for server software upgrades. On hitting ENTER, got an attempt to open enter_status.pl This seems to be an empty file. Tried this again, got Internal Server Error The server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, helpdesk@fnal.gov and inform them of the time the error occurred, and anything you might have done that may have caused the error. More information about this error may be available in the server error log. Reported this to trb@fnal.gov Date: Tue, 08 Jan 2008 17:13:52 -0600 (CST) Subject: Help Desk Ticket 109168 Has Been Resolved. Solution: Someone, in an attempt to make the .htaccess file for that area more readable, broke it. It's better now. ########## # CONDOR # ########## To : Burt Holzman Cc : sfiligoi@fnal.gov Attchmnt: Subject : Re: Follow-up on WMS/grid discussions ----- Message Text ----- On Mon, 7 Jan 2008, Burt Holzman wrote: > A few months ago we had a discussion of running your software via WMS on > the grid. Has there been any progress? Do you need any help in getting > started? This has been set up for initial testing , with major assistance from Igor S. Glideins seem to be working most of the time. We have had problems with the gfactory process disappearing. We are still using DOE Grid Certs for glidein, instead of the preferred KCA based certs. The KCA certs seem to be too short lived. I still need to learn to configure and monitor Condor. In particular, we'd like to control the running of jobs with differing time limits ( 30 min, 4 hour, 1 day, 4 day ), and eventually run both Farm and User Analysis jobs via WMS. We are not allowed to use WMS in production until we have upgraded to Condor 6.9 with glExec support, both on the Minos Cluster and GPFARM. ######### # ADMIN # ######### Date: Mon, 07 Jan 2008 10:57:06 -0600 (CST) HelpDesk ticket 109145 run2-sys : Please set minos25 /var/log/messages* protections to 644, as on the reset of the Minos Cluster, to allow monitoring of the ongoing AFS timeouts. chmod 644 /var/log/messages* Date: Mon, 07 Jan 2008 11:06:35 -0600 (CST) This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group. Date: Mon, 07 Jan 2008 11:33:49 -0600 (CST) Solution: schmitz@fnal.gov sent this solution: protections changed as per Art's request. ####### # AFS # ####### pts adduser -user mgoodman -group wadmnumi:numiweb pts: Permission denied ; unable to remove user belias from group wadmnumi:numiweb ####### # AFS # ####### Timeouts continue, at a low rate. MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages.1 | grep "Jan " | grep -v Tokens | grep Lost | grep 131.225 | uniq'; done minos05 Jan 4 11:15:19 minos05 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) Jan 4 14:18:22 minos05 kernel: afs: Lost contact with file server 131.225.68.11 in cell fnal.gov (all multi-homed ip addresses down for the server) minos09 Jan 3 13:28:58 minos09 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) ============================================================================= ============================================================================= 2008 01 03 ######## # DATA # ######## non-root files, per email; mcgowan /data/minos/root_data/reco_far/R1_18_4/.bntp_data/2006-07 F00035859_0008.spill.bntp.R1_18_4.0.root F00035947_0014.spill.bntp.R1_18_4.0.root F00036019_0019.spill.bntp.R1_18_4.0.root DCPOR=24136 DPAT=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/minos/reco_far/R1_18_4/.bntp_data/2006-07 DFILE=${DPAT}/${FIL} setup_minos -r R1.18.4 hadd -f /local/scratch26/kreymer/DATA/Merged.root ${DFILE} ${DFILE} ############ # PNFSDIRS # ############ Per rhatcher/arms ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010185N_nccoh write ============================================================================= 2008 01 02 ########### # MONTHLY # ########### DATASETS 1/2 PREDATOR 1/2 VAULT 1/2 MYSQL 1/ Vault - encp - Got error while trying to obtain configuration: ('KEYERROR', "Configuration Server: no such name: 'pnfs_agent'") ~/minos/log/rawcopy/${DET}/encp.2007-12.log This seems to be normal since 2006-12 ####### # AFS # ####### MINOS26 > pts removeuser -user belias -group wadmnumi:numiweb pts: Permission denied ; unable to remove user belias from group wadmnumi:numiweb Asked liz to add me to wadmnumi:wadmnumigr Done, and done ============================================================================= 2008 01 01 Vacation notes : minosadmin Subject: Your ticket 095815 has been reassigned to BOZONELOS, TOM Solution: Added the following KX509 DN's to the .htaccess file: |kreymer\ |buckley\ |rhatcher\ |urish\ Please go to the following URL to make manual updates, be sure to pick the correct system (MINOS) from the dropdown list: https://computing.fnal.gov/cdsystemstatus/customersupport/inpform.htm rameika - ssh login problems minosadmin lusers - java updates boehm - minosadmin - glidein guidance rarmstr - minoscvs - access windows - 2008 - windows domain exp Jan 14 habig - minosshift - looking for net downtime rubin - minosbatch - cannot use new proxy to copy with srm found gfactory processes missing again, restarted ============================================================================= 2007 12 28 Kreymer on vaction until 2 January 2008 Happy New Year ! ============================================================================= 2007 12 27 ########## # CONDOR # ########## Restarted missing gfactory processes Verified correct running of probe, with wms.run script ########## # CONDOR # ########## Temporary safety copy of examples, cd /local/scratch26/kreymer cp -vax /minos/scratch/kreymer/condor condor ########## # DCACHE # ########## HelpDesk ticket 108857 Short Description: FNDCA overloaded with e907 writes to DCache pools Problem Description: dcache-admin : At about 02:00 this morning, a flood of over 12,000 writes started to FNDCA write pools, pushing the Pool Request Queue for Stores over 8000. The writes continued through 06:00. The Minos MC import and farm concatenation processes have throttled down as designed, trying to keep the queue under the recommended 2-3K limit. The backlog is clearing at roughly 10 seconds/file, apparently all going to a single LTO-3 tape, file family 'geant'. These seem to be e907 files, all under 10 MBytes in size. This is very inefficient for LTO-3 tape ( or even 9940 ). It may take a day or more for this backlog to clear out, and for Minos processing to resume, assuming that no more files of this sort are written, and assuming that there are no global DCache service failures, as have happened before when backlogs got this large. Please contact E907 to understand the source of the problem, and if at all possible, remove this backlog. Date: Thu, 27 Dec 2007 14:30:58 -0600 (CST) This ticket has been reassigned to HARRISON, MICHAEL of the CD-SF/DMS/DSC/SSA Group. Date: Thu, 27 Dec 2007 14:48:46 -0600 (CST) mircea@fnal.gov sent this Notes To Requester: Hi Arthur, I've started looking into it, will let you know when I have an update. -Mike Harrison Date: Thu, 27 Dec 2007 15:04:27 -0600 From: Holger Meyer To: Mike Harrison , kreymer@fnal.gov, dcache-admin@fnal.gov Subject: Re: Fw: HelpDesk ticket 108857 has been assigned to you HARRISON, MICHAEL Mike, Art, dcache-admins, I started these file transfers. They are small stdhep files that I will run through Monte Carlo simulation and reconstruction jobs on the grid. I plan to copy more files soon, so we should find a less disruptive way for me to do so. I assumed that my script initiating one copy at a time would not cause a problem. I want these files backed up on tape (even if that is inefficient for these small files). The Monte Carlo output and reconstructed files will all be much larger. If there is a way to throttle the transfer of these files to tape to a rate that allows MINOS to continue its work, please do so. Best regards, Holger 2997 ########## # DCACHE # ########## Write queues are up to 8000. All dumped in starting 02:00, peak at 06:00 roughly I see no DCache writes in STKEN. Restore list is out of date, Oct 11 Login list, I set mostly access to files like minos/reco_near/cedar_phy/cand_data/2005-11/N00009104_0003.spill.cand.cedar_phy.0.root minos/mcin_data/near/daikon_00/L010185N/101/n11011012_0006_L010185N_D00.reroot.root DCache New Plots, write transfers, confirms 1500 to 2000 transfers/hour, 03:00 through 08:00, again at 11:00 Billing shows 12373 transfers from Clients, versus normal 1500ish http://fndca3a.fnal.gov/dcache/billing.html grep w-stken billing.lis | grep e907 | wc -l 12240 PRQ Stores are still at 7186 14:00 - 7169 14:01 - 8161 14:32 - 7102 16:15 - 6954 ######## # FARM # ######## Clearing out deadwood 13:30 ./roundup -f 1 -r cedar mcfar 1 run, vintage Jun 30 ########## # PARROT # ########## ---------- Forwarded message ---------- Date: Thu, 27 Dec 2007 17:54:12 +0000 (UTC) From: Arthur Kreymer To: webteam@fnal.gov, rayp@fnal.gov, minos-data@fnal.gov Subject: Parrot tests - FYI Per my conversation with Laura Mengel earlier today, this is just an FYI heads-up note to let relevant people know that we are looking into the possible use of Parrot as a means of accessing Minos software releases from FermiGrid worker nodes. At present, this is just a technology investigation. But if the tests are successful, this scheme might be rapidly deployed, as it has been used in production by CDF, and requires relatively little new infrastructure. For initial single client testing, I have linked the Minos release and product areas under the existing web pages. For larger scale testing and deployment, we would plan to use squids to take the load off central servers, and would make a formal deployment and support plan. References : http://www.cse.nd.edu/~ccl/software/parrot/ http://www.cse.nd.edu/~dthain/papers/cdf-parrot-chep06.pdf ######## # WEB # ######## User list from WUSERS=`{ pts membership wadmnumi:wadmnumigr ; pts membership wadmnumi:numiweb;} | sort | uniq | grep -v wadmnumi` echo $WUSERS admarino alberto arms asousa avva ayres belias boehm brebel bseilhan bspeak buckley cbs cjames costas cwhite dave_b dharris efalk gfp gmieg grossman habig hgallag hylen jkn jmusser kreymer lang lauram mdier med messier mgoodman michael msanchez murgia niki nwest para petyt pjl plunk rameika rgran rhatcher rubin shanahan tagg thomsonm thosieck urheim webera wehmann MINOS26 > WMAIL= MINOS26 > for WUSER in ${WUSERS} ; do WMAIL=${WMAIL}${WUSER}, ; done MINOS26 > echo $WMAIL admarino,alberto,arms,asousa,avva,ayres,belias,boehm,brebel,bseilhan,bspeak,buckley,cbs,cjames,costas,cwhite,dave_b,dharris,efalk,gfp,gmieg,grossman,habig,hgallag,hylen,jkn,jmusser,kreymer,lang,lauram,mdier,med,messier,mgoodman,michael,msanchez,murgia,niki,nwest,para,petyt,pjl,plunk,rameika,rgran,rhatcher,rubin,shanahan,tagg,thomsonm,thosieck,urheim,webera,wehmann, Removed michael, invalid. Removed belias, email could not go to RAL. admarino,alberto,arms,asousa,avva,ayres,boehm,brebel,bseilhan,bspeak,buckley,cbs,cjames,costas,cwhite,dave_b,dharris,efalk,gfp,gmieg,grossman,habig,hgallag,hylen,jkn,jmusser,kreymer,lang,lauram,mdier,med,messier,mgoodman,msanchez,murgia,niki,nwest,para,petyt,pjl,plunk,rameika,rgran,rhatcher,rubin,shanahan,tagg,thomsonm,thosieck,urheim,webera,wehmann, Sent mail to minos-data , cc: the full list : Yesterday between 16:54 and 16:59 CST, the www-numi web server lost access to its AFS files. Apparently this ACL entry was removed : system:anyuser rl Service was restored this morning when Dave Bell restored the ACL entry. Laura Mengel has determined the time of change from the server logs. We also see that the /afs/fnal.gov/files/expwww/numi directory changed at Dec 26 16:57. Because we see no new files there, we guess that something was removed. This mail is being sent to the full list of people with access. Did somebody remove a file and/or change the ACL yesterday ? -------------------------------- We have late breaking news from the web team that one of their scripts was very likly guilty of removing the ACL. They will try to assure that it does not get loose from its sandbox in future. ######## # WEB # ######## Date: Thu, 27 Dec 2007 11:01:21 -0600 From: Laura Mengel To: kreymer@fnal.gov Cc: lauram@fnal.gov Subject: user cert access restriction on web servers http://www.fnal.gov/docs/products/apache/SSLNotes.html under "Setting up to allow Kerberos/kx509 authentication" under "and then you can use, in .htaccess files, entries" This is the general URL with web help http://www-css.fnal.gov/csi/webdocs/webmaster_info.html -- Laura ####### # AFS # ####### More AFS timeouts : minos17 Dec 26 21:09:15 minos17 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) minos21 Dec 26 22:46:11 minos21 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) minos26 Dec 25 02:06:19 minos26 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server) ######## # WEB # ######## Numi web sites down, Helpdesk Ticket 108833 Urish - Urgent 12/27/2007 12:26:50 AM The Minos experiment relies on the central web server for accessing some operational information. The www-numi.fnal.gov web site will not allow access. The URL responds with "Forbidden - You don't have permissions to access / on this server." I attempted to access the AFS space directory and was unable to mount the directory /afs/fnal.gov/files/expwww/numi where the files for this site are stored. BELL, DAVE , x4482, csi-est@fnal.gov / medium Modified trb 12/27/2007 2:45:45 PM Audit Log : 12/27/2007 2:41:44 PM jereboze The Assigned To Group was changed from Help Desk to CD-LSCS/CSI/CS/EST. The Assigned To Individual was changed from HelpDesk to BELL, DAVE. The Assigned To E-mail Address was changed from helpdesk@fnal.gov to csi-est@fnal.gov. 12/27/2007 3:34:47 PM resolution The www-numi.fnal.gov web site is now accessible. I added the rights for system:anyuser. # fs la /afs/fnal.gov/files/expwww/numi Access list for /afs/fnal.gov/files/expwww/numi is Normal rights: lauram:expwwwread rl wadmnumi:numiweb rlidwka wadmnumi:wadmnumigr rlidwka lauram:expwwwadm rlidwka system:administrators rlidwka system:anyuser rl MINOS26 > ls -altr total 126 -rw-r--r-- 1 1866 oss 274 Mar 20 1997 expwww.dat -rw-r--r-- 1 7979 oss 8014 May 12 1997 README.numi_webserver drwxr-xr-x 2 7979 oss 2048 Feb 4 1998 CERN_conf drwxr-xr-x 2 7979 oss 2048 Feb 4 1998 wwwstat-1.0 lrwxr-xr-x 1 bin root 5 Feb 10 1998 hyper-news -> babar drwxr-xr-x 2 7979 oss 2048 Apr 21 1999 NCSA_conf drwxr-xr-x 2 1866 1530 2048 Oct 7 1999 admin_pre1_3_9 drwxr-xr-x 2 1866 1530 2048 Mar 6 2001 admin_standalone drwxrwxrwx 6 7979 1530 2048 Oct 4 2001 numinotes drwxr-xr-x 4 1866 1530 2048 May 28 2003 conf_standalone drwxr-xr-x 2 1222 g020 2048 Jul 30 2003 admin -rw-r--r-- 1 10599 e875 8014 Jul 30 2003 README_standalone -rw-r--r-- 1 10599 e875 9314 Sep 5 2003 README drwxr-xr-x 3 para oss 2048 Apr 29 2004 babar drwxrwxrwx 2 root root 8192 Jul 16 2004 file_upload -rwxr-xr-x 1 buckley e875 63 Jul 16 2004 cleanup_query_files drwxr-xr-x 2 1222 g020 2048 Mar 14 2005 conf -rw-r--r-- 1 10599 e875 837 May 26 2006 README.switch drwxr-xr-x 3 boehm e875 2048 Mar 16 2007 youngminos drwxr-xr-x 2 1222 g020 2048 Apr 6 2007 auth drwxr-xr-x 5 7979 oss 8192 Sep 4 17:52 cgi-bin drwxr-xr-x 38 7979 oss 4096 Nov 27 14:27 html drwxr-xr-x 12 9999 root 6144 Dec 21 11:15 .. drwxrwxrwx 2 root root 45056 Dec 25 14:11 query_files drwxrwxrwx 15 7979 root 2048 Dec 26 16:57 . ============================================================================= 2007 12 26 ########### # BLUEARC # ########### minos_software_discussion To: minos_software_discussion@fnal.gov Cc: minos_sim@fnal.gov, minos_batch@fnal.gov, minos-data@fnal.gov,minos-admin Here is a summary of a conversation with Ray Pasetes, who leads the group maintaining our BlueArc file service. This weekend's timeouts were again due to communication problems between the NexSAN SataBeast array and the Fiber Channel fabric. This is not a unique problem, other customers are also suffering. There is a new set of firmware from NexSAN which should correct this. But it has just been received, and is not yet field tested. Therefore, it will be best for use to minimize use of these areas until this firmware is will tested, probably after the Austin meeting. Major users are : GPFarm analysis pioneers ( Rustem and Josh ) The Farm - has finished essentially all the reprocessing MC import - has a few days of files to import. ########### # BLUEARC # ########### Per Pasetes conversation, about 15:45 this afternoon. Two files were lost from /minos/scratch , cleaned up by fsck : 2007-12-25 05:29:37 File System: Deleting corrupted file: near_L010z185i_mc.uDST_strip.root 2007-12-25 05:29:37 File System: Deleting corrupted file: near_L010z185i_mc.uDST_strip.root The problem was again FC resets on the fabric to the NexSAN array. The fsck took 18.5 hours to run ( with many resets slowing things down ) CSI is testing new firmware from NexSAN, version M K - crashed heads L - production, heads OK, but has bus timeouts that we are seeing M - crashes heads N - beta firmware , but claims to correct the timeout problems. Recommendation : Minos - minimize use of /minos/data till N is tested out ( 2008 ) Minos - prepare a client test suite to provide a realistic load. AFS issues - had tested and were ready to deploy new software last week, then started seeing failures again. These turn out to be due to a hardware failure. When the hardware is replaced, and tests are repeated, we may be OK. Meanwhile, the minos software is being shifted to the minos-2 AFS server, which has not seen any timeouts recently. ############## # MINOS_DATA # ############## for DIR in ${DIRS} ; do fs listquota d${DIR} | grep nb ; done > /tmp/mdd sort -n -k 4 /tmp/mdd nb.minos.d88 50000000 6 0% 64% nb.minos.d141 50000000 4590792 9% 60% nb.minos.d118 50000000 5292952 11% 60% nb.minos.d119 50000000 6013856 12% 74% d141/recodata52 - F/N cedar d118/recodata41 - N cedar only 41 files d119/recodata42 - F/N cedar , n R1_18_2 shifting d118 to d105/recodata34 nb.minos.d105 50000000 41376549 83% 64% FILES=`ls d118/recodata41 | grep root` date for FILE in ${FILES} ; do cp -av d118/recodata41/${FILE} d105/recodata34/${FILE} done date Wed Dec 26 15:27:28 CST 2007 Wed Dec 26 15:37:10 CST 2007 for FILE in ${FILES} ; do echo ${FILE} diff d118/recodata41/${FILE} d105/recodata34/${FILE} done for FILE in ${FILES} ; do grep ${FILE} d10/indexes/*.cedar.index done | cut -f 1 -d : | sort | uniq d10/indexes/2006-06_near.cedar.index d10/indexes/2006-07_near.cedar.index 16:20 - Changed recodata41 to recodata34 in these indexes. for FILE in ${FILES} ; do echo ${FILE} ; echo rm d118/recodata41/${FILE} done cd d118 rmdir recodata41 rm recodata* rm indexes Now back to minos26, give this to nonap fs setacl \ -dir /afs/.fnal.gov/files/data/minos/d118 \ -acl minos:nonap rlidwka fs setacl \ -dir /afs/.fnal.gov/files/data/minos/d118 \ -acl minos:admin rlidwka fs setacl \ -dir /afs/.fnal.gov/files/data/minos/d118 \ -acl buckley:minosrecodata none rm d10/recodata41 ####### # AFS # ####### Created d10/analysis directories to keep track of assigned volumes DIR=d186 fs listacl ../../../${DIR} ln -s ../../../${DIR} ${DIR} rustem d186 d203 d221 d260 ana_ntuples ( buckley:ana_ntuples, should be minos:nc ) d271 d272 beam d188 d239 d266 d268 d269 d270 cc d86 nc d138 d147 d169 d187 d204 d211 d228 d229 nd d88 nonap d240 d261 d262 d263 d264 d265 d240 needs minos:nonap added, members adjusted nubar d227 nue d241 d242 d243 d244 d245 reco d267 ######## # FARM # ######## scripts/crontab.dat updated ( previously May 8 2007 ) changed comment ENSTORE to pnfs . saved as crontab.dat.20071226 ####### # AFS # ####### for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Dec " | grep -v Tokens | grep Lost | grep 131.225 | uniq'; done minos08 Dec 12 17:08:58 minos08 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) /var/log/messages.1 minos02 Dec 21 00:15:19 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 21 08:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Bottom line, no timeouts since Dec 21 update after 13:42 to /home/minfarm/scripts/web_status But the /minos/scratch area has been down since Saturday, and there was very little activity over the Christmas holiday. ########### # BLUEARC # ########### BlueArc /minos/data and scratch areas went offline, Last files in /minos/data/minfarm/nearcat were Saturday evening : Dec 22 14:36 N00008224_0019.spill.sntp.cedar_phy_bhcurv.1.root Dec 22 18:47 F00039704_0004.spill.mrnt.cedar_phy_bhcurv.0.root Dec 22 20:57 n13037065_0011_L010000N_D04.mrnt.cedar_phy_bhcurv.root /minos/* announced online and fsck'd at Dec 26, 2007 7:32 AM Help Desk Ticket 108777 ============================================================================= 2007 12 21 ####### # AFS # ####### NEWGROUP=nd pts creategroup -name kreymer:${NEWGROUP} group kreymer:nd has id -2708 pts setfields kreymer:${NEWGROUP} -access SOMar for GUSER in buckley kreymer shanahan kordosky rgran ; do pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done pts chown kreymer:${NEWGROUP} minos:admin pts membership minos:${NEWGROUP} ####### # AFS # ####### fs setacl \ -dir /afs/.fnal.gov/files/data/minos/d88 \ -acl minos:cc none fs setacl \ -dir /afs/.fnal.gov/files/data/minos/d88 \ -acl buckley:minosrecodata none fs setacl \ -dir /afs/.fnal.gov/files/data/minos/d88 \ -acl minos:nd rlidwka buckley:minosrecodata ############## # MINOS_DATA # ############## Let's clear the R1_2* reco data. rubin@fnpcsrv1 : cd /afs/fnal.gov/files/data/minos/d10/indexes INDS=`ls *R1_23*.index` ./rv R1_23 noop | grep -v rm Removing 2005-11_far.R1_23.index Removed 661 files touch -t 200101010000 2005-11_far.R1_23.index Removing 2005-11_near.R1_23.index Removed 827 files touch -t 200101010000 2005-11_near.R1_23.index mostly 51,52,53 SRV1> ./rv R1_23 This procedure will erase all R1_23 ntuples and rewrite the index files! It will not prompt again -- do you want to proceed ? y Removing 2005-11_far.R1_23.index Removed 661 files Removing 2005-11_near.R1_23.index Removed 827 files R1_23a most 53 SRV1> ./rv R1_23a This procedure will erase all R1_23a ntuples and rewrite the index files! It will not prompt again -- do you want to proceed ? y Removing 2005-11_far.R1_23a.index Removed 660 files Removing 2005-11_near.R1_23a.index Removed 817 files ./rv 'S06-05-25-R1-22' S06-05-25-R1-22 Removing 2005-11_far.S06-05-25-R1-22.index Removed 661 files Removing 2005-11_near.S06-05-25-R1-22.index Removed 662 files S06-06-22-R1-22 mostly 53 Removing 2005-11_far.S06-06-22-R1-22.index Removed 661 files Removing 2005-11_near.S06-06-22-R1-22.index Removed 827 files R1_24a mostly 54 Removing 2005-11_far.R1_24a.index Removed 720 files Removing 2005-11_near.R1_24a.index Removed 886 files R1_24b 55, 56 Removing 2005-11_far.R1_24b.index Removed 720 files Removing 2005-11_near.R1_24b.index Removed 1058 files R1_24c 56 57 58 Removing 2005-11_far.R1_24c.index Removed 720 files Removing 2005-11_near.R1_24c.index Removed 1368 files R1_24cal 96 97 98 Removing 2005-11_far.R1_24cal.index Removed 692 files Removing 2005-11_near.R1_24cal.index Removed 1304 files Let's get more ambitious, removing R1_18_2 Removed net 15220 files SRV1> cat 2*R1_18_2.index | wc -l 15220 SRV1> find /minos/data/reco_near/R1_18_2/ -type f | wc -l 6501 SRV1> find /minos/data/reco_far/R1_18_2/ -type f | wc -l 8719 = 15224 R1_18_4 Removed net 6280 files SRV1> find /minos/data/reco_near/R1_18_4/ -type f | wc -l 1752 SRV1> find /minos/data/reco_far/R1_18_4/ -type f | wc -l 3946 = 5698 SRV1> wc -l *_far.R1_18_4.index 3946 total That's complete. SRV1> wc -l *_near.R1_18_4.index 0 2006-03_near.R1_18_4.index 0 2006-05_near.R1_18_4.index 669 2006-06_near.R1_18_4.index 669 118 2006-07_near.R1_18_4.index 118 5 2006-08_near.R1_18_4.index 5 395 2006-09_near.R1_18_4.index 273 * 612 2006-10_near.R1_18_4.index 152 * 535 2006-11_near.R1_18_4.index 535 2334 total Note that indexes 2006-09 and 10 were repaired after the copies. Rerun a catchup copy ./afs2nfs -i 2006-10_near.R1_18_4.index STREAM sntp to /minos/data/reco_near/R1_18_4/sntp_data/2006-10 4618 612/ 612 recodata15/N00011137_0004.spill.sntp.R1_18_4.0.root STREAM sntp rate 15250 34G /minos/data/reco_near/R1_18_4/sntp_data/2006-10 STARTED Fri Dec 21 18:42:00 CST 2007 FINISHED Fri Dec 21 19:12:11 CST 2007 $ ./afs2nfs -i 2006-09_near.R1_18_4.index STREAM sntp to /minos/data/reco_near/R1_18_4/sntp_data/2006-09 4574 395/ 395 recodata15/N00010903_0003.spill.sntp.R1_18_4.0.root STREAM sntp rate 15573 19G /minos/data/reco_near/R1_18_4/sntp_data/2006-09 STARTED Fri Dec 21 19:12:41 CST 2007 FINISHED Fri Dec 21 19:17:55 CST 2007 find /minos/data/reco_near/R1_18_4/ -type f | wc -l 2334 ./rv 'R1_18_2' Removed net 15220 files ./rv 'R1_18_4' Removed net 6280 files ./rv 'R1.16a' Removed net 612 files from recodata17 / d88 for DIR in ${DIRS} ; do echo ${DIR} ; find d${DIR} -name \*R1_17\*; done | less R1_17* files are only in d88, not indexed find d88/recodata17 -name \*R1_17\* | wc -l 3155 find d88/recodata17 -name \*R1_17\* -exec rm {} \; rm d88/recodat17/*R1_17* ./rv 'R1_18' Removed net 2174 files ./rv 'R1_21' Removed net 359 files ./rv 'R1_24' Removed net 883 files ./rv 'R1_18_2a Removed net 720 files ./rvm _far.carrot.R1_18_2 ####### # AFS # ####### fs setacl \ -dir /afs/.fnal.gov/files/data/minos/d86 \ -acl minos:admin rlidwka fs setacl \ -dir /afs/.fnal.gov/files/data/minos/d86 \ -acl minos:cc rlidwka ########## # PARROT # ########## Created web links for Minos releases and products These links seem to work OK, following subsequent links cleanly. cd /afs/fnal.gov/files/expwww/numi/html/computing ln -s /afs/fnal.gov/files/code/e875/releases releases ln -s /afs/fnal.gov/files/code/e875/general/minossoft minossoft ln -s /afs/fnal.gov/files/code/e875/general/products products ln -s /afs/fnal.gov/files/code/e875/general/ ########## # DCACHE # ########## FAM=reco_near_cedar_phy_bhcurv_sntp ./volumes ${FAM} VO3914 VO4613 VO7018 VO9572 VOC321 VOC494 VOC501 VOLS=` ./volumes ${FAM}` for VOL in ${VOLS} ; do ./stage -d -p 0 ${VOL} | grep -v pnfs | tr -d . ; done VO3914 Needed 27/186 VO4613 Needed 11/77 VO7018 Needed 0/240 VO9572 Needed 70/135 VOC321 Needed 0/44 VOC494 Needed 0/42 VOC501 Needed 39/438 The file count seems much lower than far. Still, let's pull 'em. for VOL in ${VOLS} ; do ./stage -w ${VOL} ; done ####### # AFS # ####### regarding HelpDesk ticket 107323 Recent timeouts : minos02 Dec 20 01:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 20 03:15:14 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 20 07:15:13 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 20 09:15:17 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 20 11:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 20 12:15:15 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 21 00:15:19 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 21 08:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) minos09 Dec 19 19:16:52 minos09 kernel: afs: Lost contact with file server 131.225.68.65 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 19 19:16:56 minos09 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) MINOS02 > printf 'sleep 5 ; top -b -n 1 -i > /tmp/top1015.log\n' | at 10:15 job 3 at 2007-12-21 10:15 HOURS='00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23' for HH in ${HOURS} ; do echo ${HH} ; done mkdir /minos/scratch/kreymer/afsscan Found connections like Dec 21 12:15:13 minos02 sshd(pam_unix)[17054]: session opened for user rubin by (uid=0) Here is a more complete comparison of this with reuben connections : RUBIN SSHD AFS TIMEOUT DELAY(sec) Dec 20 01:15:13 01:15:16 3 Dec 20 03:15:12 03:15:14 2 Dec 20 07:15:12 07:15:13 1 Dec 20 09:15:15 09:15:17 2 Dec 20 11:15:14 11:15:16 2 Dec 20 12:15:13 12:15:15 2 Dec 21 00:15:17 00:15:19 2 Dec 21 08:15:14 08:15:16 2 These come from his cron job on fnpcsrv1, 15 00,01,02,03,04,05,06,07,08,09,10,11,12,14,16,18,20,22,23 * * * /usr/krb5/bin/kcron /home/minfarm/scripts/web_status This script does , among other things, kcron EXEC_DIR=/home/minfarm WEB_DIR=$EXEC_DIR/web cd $WEB_DIR webuser=rubin webhost=minos02.fnal.gov webarea=/afs/fnal.gov/files/expwww/numi/html/minwork/computing/batch_monitor scp farm_status.html ${webuser}@${webhost}:$webarea SRV1> time scp farm_status.html ${webuser}@${webhost}:$webarea farm_status.html 100% 37KB 37.4KB/s 00:00 real 0m5.209s user 0m0.035s sys 0m0.056s for N in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ; do > date ; echo ${N} ; scp -q farm_status.html ${webuser}@${webhost}:$webarea ; sleep 60 ; done Fri Dec 21 13:22:42 CST 2007 ... Fri Dec 21 13:42:36 CST 2007 The last scp copy was as follows, from /var/log/messages.1 Dec 21 13:42:37 minos02 sshd(pam_unix)[21162]: session opened for user rubin by (uid=0) ############# # DBARCHIVE # ############# MINOS-SAM03 > du -sm 20060421 9352 20060421 time scp -vr 20060421 minsoft@minos-mysql1:/minos/data/mysql/20060421 # GRID # CDF Grid workshop 7/8 Jan ( 2 hours each ) filed email in cdfcaf ######## # GRID # ######## So, where are the vomses files ? You can get a clue from voms-proxy-init -debug Let's have a look at /grid/app/minos/VDT /minos/scratch/kreymer/VDT ######## # GRID # ######## Date: Mon, 17 Dec 2007 15:52:52 -0600 From: Dan Yocum To: fermigrid-announce@fnal.gov Subject: upcoming change to VOMS server at Fermilab My apologies for missing the fermigrid-announce mailing list. On Tuesday, Dec. 18, 2007 FermiGrid will be moving the Virtual Organization Management Servers (VOMS) for the following VOs to the host voms.fnal.gov: fermilab dzero sdss des gadu nanohub ilc lqcd i2u2 osg Currently, these voms servers are on fermigrid2.fnal.gov and this server will remain in service for several months to alleviate the pain of migration. However, some users may experience problems when attempting to create voms proxy certificates (i.e., voms-proxy-init) for the above VOs. Generally, these users have a mismatch of hostname and host certificate name in their vomses file. The solution is to tell these users to edit their vomses files and make these 2 changes: 1) change all 'host/fermigrid2.fnal.gov' to 'http/voms.fnal.gov' 2) change all instances of the name fermigrid2.fnal.gov to voms.fnal.gov If you have any further questions, feel free to send questions to fermigrid-help@fnal.gov. Thanks, Dan -- Dan Yocum Fermilab 630.840.6509 yocum@fnal.gov, http://fermigrid.fnal.gov Fermilab. Just zeros and ones. ============================================================================= 2007 12 20 s-s-wh8w-7 ######## # FARM # ######## Howie has stopped the copy of minfarm/web/indexes to AFS. The AFS copy is now the primary version ######### # STAGE # ######### file families are set for CPB ntuples, scanning : FAM=reco_far_cedar_phy_bhcurv_sntp ./volumes vols ./volumes ${FAM} VOLS=` ./volumes ${FAM}` for VOL in ${VOLS} ; do ./stage -d -p 0 ${VOL} | grep -v pnfs ; done Staging files from tape VO9677 ................ Needed 545/1492 Staging files from tape VOC674 Needed 0/902 Staging files from tape VOC691 ...................... Needed 0/220 The files do need to move to the new pools : ./stage -d -p 0 -g MinosPrdReadPools VOC691 | grep -v pnfs Sample file is /pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/sntp_data/2007-05/F00038163_0000.all.sntp.cedar_phy_bhcurv.0.root ./stage -b 3 -g MinosPrdReadPools VOC691 r-stkendca18a-5 - this is in the correct pool, good, was not there prestage w-stkendca12a-4 FAM=reco_far_cedar_phy_bhcurv_sntp 17:47 for VOL in ${VOLS} ; do ./stage -w -g MinosPrdReadPools ${VOL} ; done ########## # DCACHE # ########## Date: Thu, 20 Dec 2007 15:14:24 +0000 (UTC) From: Arthur Kreymer To: dcache-admin@fnal.gov Subject: dcap version for security scan immunity Which version of dcap should be used to avoid the known problems with recent Fermilab security scans ? Minos has been using the 'current' version, v2_32_f0408, which I know is too old. How should dcap be set up, or what environment variable should be set, for scan immunity ? What are the operational impacts of running in this mode ? ============================================================================= 2007 12 19 ########## # DCACHE # ########## Date: Wed, 19 Dec 2007 08:44:18 -0600 (CST) Subject: HelpDesk ticket 108563 Short Description: Minos writes pending 3 days Problem Description: dcache-admin There is a set of Minos data files which were written to FNDCA 3 days ago. These are still not on tape, as reported at http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt ( I see that there are problems again with the 9* pools. ) Please report that status of recovery of these files to minos-data. Date: Wed, 19 Dec 2007 08:57:42 -0600 (CST) This ticket has been reassigned to BERG, DAVID of the CD-SF/DMS/DSC/SSA Group. Thu Dec 20 08:31:44 CST 2007 - waiting ######## # FARM # ######## In cedar_phymcnear.log, see message OOPS - POOLS ACTIVE NEED 12 10 11 but writing continued... I see that the 9* pools are down ############ # HELPDESK # ############ Arthur Kreymer wrote: > I cannot use the usual Web page to submit helpdesk requests > I see, at https://computing.fnal.gov/cgi-bin/remedy/Helpdesk.pl > as of about 22:50 on 2007 Dec 18, This error was inadvertently caused by scheduled maintenance and only temporarily affected the web interface to reporting helpdesk issues. Sorry for the inconvenience. ============================================================================= 2007 12 18 ############## # MINOS/DATA # ############## near cedar_phy_bhcurv preparation PNFS=/pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data YEMOS=`cd ${PNFS} ; find . -type d -maxdepth 1 -exec basename {} \; | grep -v '\.' | sort` $ for MO in $YEMOS ; do AFSS/stage -d -p 0 reco_near/cedar_phy_bhcurv/sntp_data/$MO ; done | grep Needed . Needed 8/67 Needed 22/98 ... Needed 6/65 ...... Needed 0/54 . Needed 0/1 ..... Needed 2/55 .... Needed 0/33 Needed 3/46 ... Needed 2/42 . Needed 3/52 OK, need to get these directed to Minos read pools, and staged, then can do the bin/dc2nfs -d reco_near/cedar_phy_bhcurv/sntp_data 2>&1 | \ tee /minos/scratch/log/dc2nfs/cpbnear.log ########### # MONTHLY # ########### completed dbarchive 15:15 DCS_HV.MYD 35m rm -r /data/archive/COPY/20071107 # possibly speed up the copies ? PULSERGAIN.MYD 63m the rest 74m Tue Dec 18 20:47:21 CST 2007 md5sum real 30m44.050s 46G gzip real 98m52.524s 18G Repeated copies, with fresh ticket Ran out of space on minos-sam03 time cp -vax ${DBCOPY} /minos/data/mysql/${DAY} real 16m43.108s time diff -r ${DBCOPY} /minos/data/mysql/${DAY} real 33m19.808s time rsync -r \ real 8m7.283s ran out of space again MINOS-SAM03 > time scp -vr 20060418 minsoft@minos-mysql1:/minos/data/mysql/20060418 real 7m25.788s user 3m0.609s sys 1m4.927s time rsync -r ${DBBINS} /minos/data/mysql/BINLOG --perms --times --size-only -v interrupted during 143, resumed time rsync -r ${DBBINS} /minos/data/mysql --perms --times --size-only -v 40m20.398 The problem is that BINLOGS contains 54 GB of recent changes ! [minsoft@minos-mysql1 ~]$ ls -alF /data/archive/BINLOG | grep 'Dec 8' | wc -l 15 [minsoft@minos-mysql1 ~]$ ls -alF /data/archive/BINLOG | grep 'Dec 9' | wc -l 24 [minsoft@minos-mysql1 ~]$ ls -alF /data/archive/BINLOG | grep 'Dec 14' | wc -l 9 ########### # BLUEARC # ########### Date: Tue, 18 Dec 2007 19:33:16 -0600 (CST) Subject: Help Desk Ticket 108225 Has Been Resolved. Solution: I forgot to close this ticket. I called Art on day of incident .... we had experienced a head/crash and failover at the day/time of the ticket Andy ( romero@fnal.gov x4733 ) Problem Description: LSC/CSI : Today at around 11:30, till around 11:35 ( roughly ) the NFS mounts of the BlueArc served /minos/data and /minos/scratch timed out many or all of the Minos Cluster nodes. The mounts seem to have recovered. ... ########### # BLUEARC # ########### Date: Tue, 18 Dec 2007 15:35:38 -0600 (CST) HelpDesk ticket 108542 Short Description: Quota request for BlueArc served /minos/scratch, for rahaman Problem Description: LSC/CSI : Please set an individual storage quota of 500 GBytes for user rahaman on the BlueArc served /minos/scratch volume. This overrides the existing default 100 GBytes quota. Date: Tue, 18 Dec 2007 15:44:25 -0600 (CST) This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group. Date: Tue, 18 Dec 2007 16:04:09 -0600 (CST) Solution: Quota adjusted. This ticket was resolved by INKMANN, JOHN of the CD-LSCS/CSI/CS/EST group. ########## # ANNUAL # ########## Created new data directories for next year, per procedure at the bottom of LOG per Rubin reminder. ########## # DCACHE # ########## Rustem reports slow access to /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_04/L010185N/sntp_data 1376 files Ticket #: 107808 The restores took about 12 hours, as expected. RUNS=`ls /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_04/L010185N/sntp_data` for RUN in $RUNS ; do ./stage -w mcout_data/cedar_phy_bhcurv/far/daikon_04/L010185N/sntp_data/${RUN} ; done /minos/data/mcout_data/daikon_04/L010185N/far/cedar_phy_bhcurv/sntp_data ########## # DCACHE # ########## Requesting additional file families for MinosPrdReadPools DPAT=reco_near/cedar_phy_bhcurv/sntp_data DPAT=mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data DPAT=mcout_data/cedar_phy_bhcurv/far/daikon_04/L010185N/sntp_data ( cd /pnfs/minos/${DPAT} ; enstore pnfs --tags ) | grep '^.(tag)(file_family)' | cut -f 2 -d = reco_far_cedar_phy_bhcurv_sntp reco_near_cedar_phy_bhcurv_sntp mcout_cedar_phy_bhcurv_far_daikon_04_sntp mcout_cedar_phy_bhcurv_near_daikon_04_sntp Date: Tue, 18 Dec 2007 23:02:45 +0000 (UTC) Subject: HelpDesk ticket 108564 I cannot use the usual web page to submit this, so please enter this manually tomorrow : Please direct to Software / MSS / DCache , ( dcache-admin ) Low priority Please add these file families to the MinosPrdReadPools selection list, described at http://fndca3a.fnal.gov:2288/poolInfo/ugroups/MinosPrdSelGrp minos.reco_far_cedar_phy_bhcurv_mrnt minos.reco_far_cedar_phy_bhcurv_sntp minos.reco_near_cedar_phy_bhcurv_mrnt minos.reco_near_cedar_phy_bhcurv_sntp minos.mcout_cedar_phy_bhcurv_far_daikon_04_mrnt minos.mcout_cedar_phy_bhcurv_far_daikon_04_sntp minos.mcout_cedar_phy_bhcurv_near_daikon_04_mrnt minos.mcout_cedar_phy_bhcurv_near_daikon_04_sntp Date: Wed, 19 Dec 2007 08:55:28 -0600 (CST) This ticket is assigned to BERG, DAVID of the CD-SF/DMS/DSC/SSA. ####### # AFS # ####### Per CD OPS meeting, AFS timeout cause may have been located. Latest Minos timeouts : for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Dec " | grep -v Tokens | grep Lost | uniq'; done minos02 Dec 16 07:16:10 minos02 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 16 11:15:14 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 16 16:15:15 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 17 07:15:18 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) minos08 Dec 12 17:08:58 minos08 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) minos09 Dec 16 07:16:10 minos09 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) ########## # CONDOR # ########## Now testing the run limits in condor, with 100 josh processes running ( and the minos farms stuck due to cert problems ) condor_submit glide150.run It seems that all 150 ran, pretty quickly, with no increase in glideins. MINOS25 > grep HOSTNAME logs/glide150/24032*.out | cut -f 2 -d : | sort -u HOSTNAME fnpc300.fnal.gov HOSTNAME fnpc309.fnal.gov HOSTNAME fnpc323.fnal.gov HOSTNAME fnpc335.fnal.gov Let's try this again, with CPU cranked up to 3 minutes Still only 111 jobs, 111 running cq kreymer ... 24034.* 150 jobs; 139 idle, 11 running, 0 held MINOS25 > grep HOSTNAME logs/glide150/24034*.out | cut -f 2 -d : | sort -u HOSTNAME fnpc300.fnal.gov HOSTNAME fnpc309.fnal.gov HOSTNAME fnpc323.fnal.gov HOSTNAME fnpc335.fnal.gov ########## # CONDOR # ########## Glidein management 0.1 : Glideins are controlled by two accounts on minos25 : gfactory gfrontend To start up the system, run these scripts in the home areas respectively start_factory.sh start_frontend.sh To stop these, just kill the python scripts respectively python glideFactory.py 90 4 /home/gfactory/glideinsubmit/glidein_t5/ python glideinFrontend.py 90 4 /home/gfrontend/myvofrontend1/etc/vofrontend.cfg The primary configuration files are respectively glideinWMS/creation/glideinWMS.xml myvofrontend1/etc/vofrontend.cfg Stopped the pythons, adjusted the max execution limit in vofrontend.cfg, restarted. Submitted wms.run at 14:19 Ran with 10K limit, per myvofrontend1/log/frontend_info.20071218.log Killed process with -9, restarted, now running like Total running 0 limit 50 ============================================================================= 2007 12 17 ########### # BLUEARC # ########### Date: Mon, 17 Dec 2007 14:29:37 -0600 (CST) HelpDesk ticket 108471 Short Description: Export /minos/data and /minos/scratch to *.fnal.gov Problem Description: LSC/CSI : Please export to *.FNAL.GOV , readonly and rootsquashed, the BlueArc served /minos/data and /minos/scratch Motivation - To make these available on Minos laptops and desktops. Security - These are already mounted on the GPFARM Open Science Enclave, making the data readable even to non-Fermilab users. The readonly/rootsquash export should prevent inappropriate writes. Load - These are already mounted on hundreds of GPFARM nodes. The laptops and desktops are a small increment. Timing - Whenever convenient. Compatibility - This request presumes that existing exports and mounts remain functional without modification. We still write to some of these ares from the Minos Cluster, and from the GPFARM. Date: Mon, 17 Dec 2007 15:57:20 -0600 (CST) This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group. Date: Mon, 17 Dec 2007 16:13:24 -0600 (CST) Solution: Added read-only to following nfs mounts: 131.225.*.* (read_only,root_squash) /minos/data 131.225.*.* (read_only,root_squash) /minos/scratch ######## # FARM # ######## # 2007 12 17 - reenabled cedar_phy mcnear, for recent mrnt processing ############ # PREDATOR # ############ Has been disabled since Thursday 14 Dec, forgot to start after Oracle patches. MINOS26 > ./predator 2007-12 STARTED Mon Dec 17 15:10:30 UTC 2007 FINISHED Mon Dec 17 20:02:49 UTC 2007 ########## # CONDOR # ########## factproxy - cleaned up to copy to ${PFIL}.new, then rename to ${PFIL} this should minimize exposure to transition problems. ########## # CONDOR # ########## Igor notes that ~gfactory is in AFS. Not good, contains proxies. Dec 17 13:10 gfactory - this seems to be corrected. Created new proxy, SRV1> cd /export/stage/minfarm/.grid voms-proxy-init \ -voms fermilab:/fermilab/minos \ -vomslife 800:0 \ -cert kreymer-doe.pem \ -key kreymer-doekey.pem \ -out kreymer-condor.proxy \ -valid 800:0 ... Your proxy is valid until Sat Jan 19 21:14:24 2008 scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy \ .grid/kreymer-condor.proxy.20080119 The kreymer-pilot.proxy is also refreshed after 13:10. ============================================================================= 2007 12 14 ########## # CONDOR # ########## administrative access, try this with an active proxy, MINOS25 > cd /local/scratch25/kreymer/.grid/ MINOS25 > scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-doekey.pem . MINOS25 > scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-doe.pem . voms-proxy-init \ -voms fermilab:/fermilab/minos \ -cert kreymer-doe.pem \ -key kreymer-doekey.pem MINOS25 > condor_off -peaceful minos01 -subsystem startd Sent "Set-Peaceful-Shutdown" command to startd minos01.fnal.gov Sent "Kill-Daemon-Peacefully" command to master minos01.fnal.gov MINOS25 > condor_q ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 80.0 jdejong 11/2 13:39 0+00:00:01 H 0 0.0 loon /minos/scratc 1789.0 hartnell 11/11 11:36 0+00:00:01 H 0 9.8 tiny ... MINOS25 > condor_rm 80.0 Job 80.0 marked for removal MINOS25 > condor_rm 1789.0 Job 1789.0 marked for removal voms-proxy-destroy MINOS25 > condor_rm 15835.3 Job 15835.3 marked for removal This was a job stuck due to the bluearc timeout on 11 Dec. So I still seem to be a superuser for managing jobs MINOS25 > condor_off -peaceful minos02 -subsystem startd ERROR AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5003:Failed to authenticate. Globus is reporting error (851968:40). There is probably a problem with your credentials. (Did you run grid-proxy-init?) AUTHENTICATE:1004:Failed to authenticate using FS Can't send Set-Peaceful-Shutdown command to startd minos02.fnal.gov ERROR AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5003:Failed to authenticate. Globus is reporting error (851968:80). There is probably a problem with your credentials. (Did you run grid-proxy-init?) AUTHENTICATE:1004:Failed to authenticate using FS Sent "Kill-Daemon-Peacefully" command to master minos02.fnal.gov ########## # CONDOR # ########## SRV1> minfarm, cd /export/stage/minfarm/.grid voms-proxy-init \ -voms fermilab:/fermilab/minos \ -vomslife 800:0 \ -cert kreymer-doe.pem \ -key kreymer-doekey.pem \ -out kreymer-condor.proxy \ -valid 800:0 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Enter GRID pass phrase: Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Creating temporary proxy .................................................... Done Contacting fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Done Warning: fermigrid2.fnal.gov:15001: validity shortened to 86400 seconds! Creating proxy ................................................ Done Your proxy is valid until Wed Jan 16 19:41:03 2008 [gfactory@minos25 ~]$ scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy \ .grid/kreymer-condor2.proxy ########## # DCACHE # ########## Request to recycle VO2114 | CD-9940B | 0000_000000000_0000637 | minos | reco_near_cedar_phy_bhcurv_cand VO3170 | CD-9940B | 0000_000000000_0000444 | minos | reco_near_cedar_phy_bhcurv_cand VO3899 | CD-9940B | 0000_000000000_0000124 | minos | stage_kordosky VO4319 | CD-9940B | 0000_000000000_0000129 | minos | stage_kordosky VO4616 | CD-9940B | 0000_000000000_0000050 | minos | stage_kordosky VO7080 | CD-9940B | 0000_000000000_0000302 | minos | reco_mc_cosmic_cedar VO9164 | CD-9940B | 0000_000000000_0000128 | minos | stage_kordosky VOA280 | CD-9940B | 0000_000000000_0000128 | minos | stage_kordosky VOC347 | CD-9940B | 0000_000000000_0000579 | minos | mcin_near_daikon_04 VOC588 | CD-9940B | 0000_000000000_0000128 | minos | stage_kordosky for VOL in VO2114 VO3170 VO3899 VO4319 VO4616 VO7080 VO9164 VOA280 VOC347 VOC588 ; do echo VOLUME ${VOL} enstore info --list=${VOL} ; done > /minos/scratch/kreymer/recycle20071214.lis ########## # CONDOR # ########## Need to register Kreymer KCA DN in Minos CAF, for glidein usage Date: Fri, 14 Dec 2007 11:12:07 -0600 (CST) HelpDesk ticket 108403 Problem Description: run2-sys : We are moving to a more secure environment, with shorter lived certificates for the Minos Condor Analysis Facility. Please add the following DN to this file on minos25 : /etc/grid-security/condor-grid-mapfile "/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer" gfactory2 ( This is one line, including gfactory2 . This may have been split by the Helpdesk entry form and/or email. ) We would like to get this done today of possible. Date: Fri, 14 Dec 2007 11:34:06 -0600 (CST) Subject: Your ticket 108403 has been reassigned to ALLEN, JASON Added kreymer/cron/minos25.fnal.gov@FNAL.GOV to gfactory@minos25:.k5login The helpdesk ticket request was wrong, a blunder in my part, should have been for "/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer" gfactory2 Reported this to Jason, this was corrected soon : MINOS25 > ls -l /etc/grid-security/condor-grid-mapfile -rw-r--r-- 1 root root 1797 Dec 14 16:19 /etc/grid-security/condor-grid-mapfile ######## # FARM # ######## This is a mess ! The catchup run seems to have failed to set it's pid, and we had two running in parallel, making a scrambled hash of the log file. PURGED WRITE/F00032678_0000.all.sntp.cedar_phy_bhcurv.0.root PURGED WRITE/F00031989_0000.all.sntp.cedar_phy_bhcurv.0.root do_ypcall: clnt_call: RPC: Timed out Traceback (most recent call last): File "/export/stage/minfarm/ROUNDUP/SAM/current/bin/sam", line 4, in ? sys.exit(Sam.main(sys.argv)) File "sam_user_pyapi/bin/Sam.py", line 6368, in main File "sam_common_pylib/SamCommand/CommandInterfaceSuite.py", line 120, in dispatch File "sam_common_pylib/SamCommand/CommandInterfaceSuite.py", line 118, in dispatchCommand File "sam_common_pylib/SamCommand/CommandInterface.py", line 61, in mainDispatch File "sam_common_pylib/SamCommand/BlessedCommandInterfacePlaceHolder.py", line 38, in dispatch File "sam_common_pylib/SamCommand/SamCommandInterface.py", line 208, in cliDispatch File "sam_common_pylib/SamCommand/CommandInterface.py", line 331, in cliDispatch File "sam_common_pylib/SamCommand/CommandInterface.py", line 344, in _baseClass_cliDispatch File "sam_user_pyapi/src/samLocate.py", line 75, in implementation File "sam_common_pylib/SamCorba/SamServerProxy.py", line 257, in _callRemoteMethod File "sam_common_pylib/SamCorba/SamServerProxyRetryHandler.py", line 266, in handleCall KeyError: 'getpwuid(): uid not found: 10871' Log modified at 06:45:03 Note the do_ypcall, this is an NIS problem ! Net effect, we still have a lot to do, will let the scripts run : SRV1> ls WRITE | grep ^F | wc -l 4026 Present summary : SRV1> find . -name "f*" -type l | wc -l 701 SRV1> find . -name "f*" -type f | wc -l 7 mcfar written around 10:00, into saddreco Saturday 15:48 CST , still copying cedar_phy_bhcurvfar started around 19:00 yesterday SRV1> find WRITE/ -name "F*" -type f | wc -l 1475 SRV1> find WRITE/ -name "F*" -type l | wc -l 2431 ============================================================================= 2007 12 13 ####### # AFS # ####### Per tagg, for AFSUSER in cherdack rearmstr rodriges ; do pts adduser -user ${AFSUSER} -group minos:cc done pts membership minos:cc buckley kreymer urheim tagg cherdack rearmstr rodriges I cannot find a Fermilab ID for Tony. ######## # FARM # ######## 4614 files in WRITE/ are too much for ls. roundup.20071213 PURGE and WRITE sections got too many files for the ls command to swallow changed to 'find' In testing, the CC mv/ln were not disabled by ${ECHO} need to shift some back : SRV1> find . -type l -name "F*" ./F00031721_0000.spill.mrnt.cedar_phy_bhcurv.0.root ./F00031721_0000.spill.bntp.cedar_phy_bhcurv.0.root ./F00031721_0000.all.sntp.cedar_phy_bhcurv.0.root ./F00031721_0000.spill.sntp.cedar_phy_bhcurv.0.root ./F00040057_0000.all.sntp.cedar.0.root ./F00040057_0000.spill.bntp.cedar.0.root ./F00040057_0000.spill.sntp.cedar.0.root FILES=`find . -type l -name "F*" | cut -f 2 -d /` for FILE in ${FILES} ; do FLIN=`ls -l ${FILE} | cut -f 2 -d '>'` FDIR=`dirname ${FLIN}` mv ${FDIR}/${FILE} ${FILE} ls -l ${FILE} done OK, that's clean. Put the new roundup in production, and run catchup SRV1> cp -a AFSS/roundup.20071213 . SRV1> ln -sf roundup.20071213 roundup SRV1> roundup -r cedar_phy_bhcurv far Thu Dec 13 22:15:32 CST 2007 PURGING WRITE files 4497 ######## # GRID # ######## Date: Thu, 13 Dec 2007 11:51:10 -0600 Subject: Re: ALLEN, JASON HelpDesk ticket 108321 Has Been Updated. Resolved Mounted /grid/data and grip/app as requested. I have verified this with a scan of all nodes. ########### # MONTHLY # ########### DATASETS 12/13 PREDATOR 12/13 VAULT 12/13 MYSQL 12/ Date: Thu, 13 Dec 2007 11:36:47 -0600 (CST) HelpDesk ticket 108345 AFS corruption of /afs/fnal.gov/files/home/room1/kreymer/minos/log/rawcopy/far/encp.2007-11.log Date: Thu, 13 Dec 2007 12:06:06 -0600 (CST) Subject: Your ticket 108345 has been reassigned to HILL, KEVIN Checked other files in rawcopy/far and near with 'file', one other data file, far/2007-08.log, with one @ byte, rerun, because the disk filled at that time. Checked all the topdb and pnfslogs, these all look intact. 2007 12 18 - 15:15 - dbarchive DCS_HV.MYD 35m PULSERGAIN.MYD rm -r /data/archive/COPY/20071107 ####### # SAM # ####### Tested dev universe after Tuesday's upgrades, using new TEST section of HOWTO.sam. ####### # SAM # ####### Downtime scheduled 10:00 for Oracle/System patches of production minosora1 Date: Thu, 13 Dec 2007 10:32:35 -0600 oracle database patch and host server reboot are complete.? minosprd database is available for use. Date: Thu, 13 Dec 2007 17:09:36 +0000 (UTC) Thanks ! The Minos production dbserver and station rode throught the maintenance, and are still functioning normally. ============================================================================= 2007 12 12 ####### # AFS # ####### NEWGROUP=cc pts creategroup -name kreymer:${NEWGROUP} group kreymer:cc has id -2502 for GUSER in buckley kreymer tagg urheim ; do pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done pts membership kreymer:${NEWGROUP} buckley kreymer urheim tagg pts examine kreymer:${NEWGROUP} Name: kreymer:cc, id: -2502, owner: kreymer, creator: kreymer, membership: 4, flags: SOMar, group quota: 0. pts chown kreymer:${NEWGROUP} minos:admin Now assign this to d88 fs setacl \ -dir /afs/.fnal.gov/files/data/minos/d88 \ -acl minos:admin rlidwka fs setacl \ -dir /afs/.fnal.gov/files/data/minos/d88 \ -acl minos:cc rlidwka ########## # CONDOR # ########## Date: Wed, 12 Dec 2007 17:08:39 -0600 (CST) HelpDesk ticket 108310 Trying to set up a frequent KCA based proxy for use by the factory Script factproxy kx509 kxlist -p voms-proxy-init \ -noregen \ -voms fermilab:/fermilab/minos/Role=pilot \ -vomslife 12:0 \ -valid 12:0 \ -out /local/scratch25/kreymer/kreymer-pilot.proxy This works OK for my normal ticket, MINOS25 > voms-proxy-info -all -file /local/scratch25/kreymer/kreymer-pilot.proxy WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy/CN=proxy issuer : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy identity : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy type : unknown strength : 512 bits path : /local/scratch25/kreymer/kreymer-pilot.proxy timeleft : 11:57:32 === VO fermilab extension information === VO : fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer issuer : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov attribute : /fermilab/minos/Role=pilot/Capability=NULL attribute : /fermilab/Role=NULL/Capability=NULL attribute : /fermilab/minos/Role=NULL/Capability=NULL timeleft : 23:59:18 But fails for the kcron ticket, even for the minimal vpi form MINOS25 > kcron MINOS25 > klist -f Ticket cache: /tmp/krb5cc_1060_c22360 Default principal: kreymer/cron/minos25.fnal.gov@FNAL.GOV MINOS25 > kx509 MINOS25 > kxlist -p Service kx509/certificate issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA subject= /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/0.9.2342.19200300.100.1.1=kreymer serial=CF0A30 hash=f6c1da48 MINOS25 > voms-proxy-init -noregen -voms fermilab:/fermilab -debug Detected Globus version: 22 Unspecified proxy version, settling on Globus version: 2 Number of bits in key :512 Using configuration file /minos/scratch/kreymer/VDT/glite/etc/vomses Using configuration file /home/condor/execute/dir_11128/userdir/glite/etc/vomses Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Files being used: CA certificate file: none Trusted certificates directory : /minos/scratch/kreymer/VDT/globus/TRUSTED_CA Proxy certificate file : /tmp/x509up_u1060 User certificate file: /tmp/x509up_u1060 User key file: /tmp/x509up_u1060 Output to /tmp/x509up_u1060 Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer Using configuration file /minos/scratch/kreymer/VDT/glite/etc/vomses Using configuration file /home/condor/execute/dir_11128/userdir/glite/etc/vomses Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Contacting fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Failed Error: fermilab: User unknown to this VO. None of the contacted servers for fermilab were capable of returning a valid AC for the user. I suspect that this kcron cert is not known to the VO Trying to self register, page [-] Members . Re-sign Grid and VO AUPs [-] Certificates . Add Certificate The search form is missing. You are logged in as /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 /DC=org/DC=DOEGrids/OU=Certificate Authorities/CN=DOEGrids CA 1 and You are logged in as /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/UID=kreymer /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA 2007 12 14 - Chadwick added the CN=cron certificate, with pilot role. I can now create a proxy : MINOS25 > voms-proxy-init \ -noregen \ -voms fermilab:/fermilab/minos/Role=pilot \ -vomslife 12:0 \ -valid 12:0 \ -out /local/scratch25/kreymer/kreymer-pilot.proxy Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Contacting fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Done Creating proxy ........................................................ Done Your proxy is valid until Fri Dec 14 21:36:53 2007 MINOS25 > voms-proxy-info -all -file /local/scratch25/kreymer/kreymer-pilot.proxy WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy issuer : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer identity : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer type : proxy strength : 512 bits path : /local/scratch25/kreymer/kreymer-pilot.proxy timeleft : 11:59:33 === VO fermilab extension information === VO : fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer issuer : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov attribute : /fermilab/minos/Role=pilot/Capability=NULL attribute : /fermilab/Role=NULL/Capability=NULL attribute : /fermilab/minos/Role=NULL/Capability=NULL timeleft : 11:59:32 But this was for a normal KCA cert which somehow crept in, not the cron MINOS25 > voms-proxy-info -all -file ${PDIR}/${PFIL} WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy issuer : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer identity : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer type : proxy strength : 512 bits path : /local/scratch25/kreymer/.grid/kreymer-pilot.proxy timeleft : 8:35:42 === VO fermilab extension information === VO : fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer issuer : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov attribute : /fermilab/minos/Role=pilot/Capability=NULL attribute : /fermilab/Role=NULL/Capability=NULL attribute : /fermilab/minos/Role=NULL/Capability=NULL timeleft : 10:35:47 ########## # OFFICE # ########## Copied white/black boards to file:///minos/scratch/kreymer/ File: dscn0550.jpg 1123 KB 12/12/2007 03:06:41 PM File: dscn0552.jpg 1052 KB 12/12/2007 03:06:44 PM File: dscn0553.jpg 1038 KB 12/12/2007 03:06:46 PM ########## # CONDOR # ########## Date: Wed, 12 Dec 2007 15:04:21 -0600 (CST) HelpDesk ticket 108299 Short Description: Minos Cluster - condor 6.9.5 preinsatllation Problem Description: run2-sys : Please install the following RPM in all the minos01 through minos26 systems. http://fermigrid.fnal.gov/files/condor/condor-6.9.5-linux-x86-rhel3-dynamic-1.i386.rpm This rpm places new files in /opt/condor-6.9.5, and should not interfere with existing operations. The rpm is about 95 MB, and unwinds into about 250 MBytes. The actual upgrade is still being planned, and will consist roughly of Condor shutdown/swap config files/Condor start. Background : We will need to upgrade the Condor version on the Minos Cluster from 6.8.6 to 6.9.5 sometime soon, by next week, in order to be compatible with the new Condor being deployed next week on the GPfarm It is probably best to do this soon, before the holidays. Date: Wed, 12 Dec 2007 15:20:20 -0600 (CST) This ticket has been reassigned to ALLEN, JASON of the CD-SF/FEF Group. ########## # CONDOR # ########## From 15 Oct plan from Timm, we should be able to pre-install http://fermigrid.fnal.gov/files/condor/condor-6.9.5-linux-x86-rhel3-dynamic-1.i386.rpm We presently have http://fermigrid.fnal.gov/files/condor/condor-6.8.6-linux-x86-rhel3-dynamic-1.i386.rpm The suggested Minos config files are in http://fermigrid.fnal.gov/files/condor/minos/ ###### # CD # ###### FYI - IT/Comp Prof jobs renaming, http://wdrs.fnal.gov/job_descript/info_tech/IT_Job_Description_Review.ppt ######## # FARM # ######## Investigating mia sntp's for N00008218 N00008221 N00008224 N00008227 N00008230 N00008233 N00008238 ########### # BLUEARC # ########### Date: Wed, 12 Dec 2007 09:16:38 -0600 (CST) Etta Burns and Dave Bell worked on the array this morning. We believe the Minos area is now stable. ettab - 8300 dbell - 4482 Date: Wed, 12 Dec 2007 16:44:57 +0000 (UTC) From: Arthur Kreymer To: minos-admin@fnal.gov, csi-mgmt@fnal.gov Cc: rayp@fnal.gov, ettab@fnal.gov, dbell@fnal.gov Subject: Re: HelpDesk ticket 108251 - resolved, executive summary This is an executive summary of the resolution of HelpDesk ticket 108251, mount failures of the BlueArc served /minos/data, /minos/scratch, based on a conversation with Dave Bell. If there are no corrections from the experts, I will forward this to the Minos collaboration in general. Observations : 1) No disk or array failures were observed internal to the array. 2) There is a level of soft errors in the Fiber Channel fabric which is consistent with the presently normal operation of similar arrays at Fermilab. These do not seem to be obviously correlated with our mount timeouts. 3) There were communication failures to the array's FC ports which seem identical to those seen previously on similar CMS servers. Those problems were resolved about a month ago in CMS by changing Nexsan controller settings from 'active/active' to 'active/passive' This change from a/a to a/p was applied to the Minos server this morning. Actions : Minos should resume normal use of the data and scratch areas, expecting to see no further timeouts. ============================================================================= 2007 12 11 ########### # BLUEARC # ########### Date: Tue, 11 Dec 2007 19:48:52 -0600 (CST) HelpDesk ticket 108251 LSC/CSI : As of about 19:40, the /minos/data and /minos/scratch NFS mounts have timed out on the Minos Cluster and on fnpcsrv1. This shuts down all Minos farm processing, and most analysis. ############## # MINOS_DATA # ############## cd $MINOS_DATA/d10 DIRS=`ls -ld recodata* | grep lrwx | cut -f 2 -d / | cut -c 2- | sort -n` cd .. for DIR in ${DIRS} ; do fs listquota d${DIR} | grep nb ; done All are 50000 except d11 d21 d22 d46-49 d71-776 Look at a block of not so full directories nb.minos.d81 50000000 31401634 63% 78% nb.minos.d86 50000000 32841842 66% 85% nb.minos.d88 50000000 34690590 69% 78% nb.minos.d89 50000000 29209809 58% 78% nb.minos.d90 50000000 33385630 67% 77% nb.minos.d91 50000000 44757332 90% 77% d86 had an old kreymer file cp -a d86/kreymer/F00034242_0013.mdaq.root /minos/scratch/kreymer/ and files like F00036724_0008.spill.sntp.R1_18_4.0.root vintage Oct 2006 N00011059_0001.spill.sntp.R1_18_4.0.root vintage Oct 2006 a20000180_0001.cnts.R1.14.root vintage May/Jun 2005 c10000659_0001.cnts.R1.14.root vintage May/Jun 2005 MINOS26 > grep c10000695_0009.cnts.R1.14.root d10/indexes/*.index d10/indexes/mc_far.R1.14.index:recodata16/c10000695_0009.cnts.R1.14.root I'm removing all from mc_far.R1.14.index Doing this on fnpcsrv1 as rubin cut/paste from shrc/kreyemr cd /afs/fnal.gov/files/data/minos/d10/indexes cp remove_vsn.mc /rvm hacked this to preview, looks OK, files will come out of recodata16/17/18/19 cd ../../d86/recodata16 SRV1> rm recodata16 SRV1> ls | wc -l 287 SRV1> for FILE in ${FILES} ; do grep ${FILE} ../../d10/indexes/*R1_18_4*.index; done | cut -f 1 -d : | cut -f 5 -d / | sort -u 2006-10_far.R1_18_4.index 2006-10_near.R1_18_4.index 2006-11_far.R1_18_4.index 2006-11_near.R1_18_4.index INDS=`` SRV1> for IND in ${INDS} ; do grep recodata16 ../../d10/indexes/${IND} ; done | wc -l 287 Now shift these to recodata17 on d88 See that we have space fs listquota ../../d88 Volume Name Quota Used %Used Partition nb.minos.d88 50000000 15788915 32% 74% Move em for FILE in ${FILES} ; do cp -va ${FILE} ../../d88/recodata17/${FILE} done Diff em for FILE in ${FILES} ; do echo ${FILE} diff ${FILE} ../../d88/recodata17/${FILE} done Reindex em for IND in ${INDS} ; do nedit ../../d10/indexes/${IND} ; done changed recodata16 to recodata17 for IND in ${INDS} ; do scp ../../d10/indexes/${IND} fnpcsrv1:~minfarm/web/indexes/${IND} done Clear em for FILE in ${FILES} ; do rm ${FILE} done Tue Dec 11 20:15:30 CST 2007 ############ # NOACCESS # ############ VOC083 181.57GB (NOACCESS 1210-2159 none 0731-0922) CD-9940B minos.mcin_near_daikon_04.cpio_odc ########### # BLUEARC # ########### HelpDesk ticket 108225 LSC/CSI : Today at around 11;30, till around 11:35 ( roughly ) the NFS mounts of the BlueArc served /minos/data and /minos/scratch timed out many or all of the Minos Cluster nodes. Is there a known problem ? For reference, here are some sample mounts : MINOS26 > grep blue /etc/fstab blue2:/fermigrid-data /grid/data nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 blue2:/fermigrid-app /grid/app nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 Date: Tue, 11 Dec 2007 12:13:05 -0600 (CST) This ticket has been reassigned to HILL, KEVIN of the CD-LSCS/CSI/CS/EST Group. Date: Tue, 11 Dec 2007 13:43:04 -0600 (CST) This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group. ######## # FARM # ######## Why are write queues so long, and why did they shoot up to over 5000 ? w-stkendca10a-4 w-stkendca11a-5 w-stkendca9a-5 w-stkendca9a-6 FAMS=`cat /tmp/pool9a6 | cut -f 6,6 -d ' ' | sort -u | grep 'si='\ | cut -f 2 -d '{' | cut -f 1 -d '}' | grep -v unknown` for FAM in ${FAMS} ; do printf "${FAM} " ; grep "{${FAM}}" /tmp/pool9a6 | wc -l done for POOL in 10a-4 11a-5 9a-5 9a-6 ; do curl -o /tmp/pool${POOL} http://fndca3a.fnal.gov/dcache/files/w-stkendca${POOL}.files done for POOL in 10a-4 11a-5 9a-5 9a-6 ; do printf "\n${POOL}\n" FAMS=`cat /tmp/pool${POOL} | cut -f 6,6 -d ' ' | sort -u | grep 'si=' \ | cut -f 2 -d '{' | cut -f 1 -d '}' | grep -v unknown | grep -v minos` for FAM in ${FAMS} ; do printf "${FAM} " ; grep "{${FAM}}" /tmp/pool${POOL} | wc -l done done ============================================================================= 2007 12 10 ########## # DCACHE # ########## 848 files dated 2007-12-08 are listed at http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt Some are on tape now, some are not, like /pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2006-01/F00033671_0020.all.cand.cedar_phy_bhcurv.0.root Took a snapshot : curl -s -o /var/tmp/kreymer/minos.txt http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt ####### # CFL # ####### 'Newline appended' - comes from the 'ed' step. GRRRRRRRRRR - out of quota in minos.log_data -rw-r--r-- 1 kreymer 1525 260832953 Dec 10 17:10 CFL.new -rw-r--r-- 1 kreymer 1525 176661307 Dec 8 19:16 CFL.old MINOS26 > fs listquota Volume Name Quota Used %Used Partition minos.log_data 8000000 7813906 98%<< 26% < wc -l CFL.new 1488395 CFL.new MINOS26 > dds CFL.new -rw-r--r-- 1 kreymer 1525 279971484 Dec 10 17:59 CFL.new no more message 'Newline appended' wc matches ######## # GRID # ######## According to Steve Timm, the attribute to select Minos AFS nodes in GPFARM will be ISMINOSAFS This is boolean, true or false. ########## # DC2NFS # ########## $ AFSS/dc2nfs -d beam_data 2>&1 | tee -a /tmp/dc2nfs.beam_data.log ... STARTED FINISHED ######### # ADMIN # ######### HelpDesk ticket 108182 LSC/CSI : Please set an individual storage quota of 500 GBytes for user pawloski on the BlueArc served /minos/scratch volume. This overrides the existing default 100 GBytes quota. Date: Tue, 11 Dec 2007 09:50:29 -0600 (CST) Solution: quota increased This ticket was resolved by HILL, KEVIN of the CD-LSCS/CSI/CS/EST group. ######## # GRID # ######## /grid/app and data mount on minos02-minos25 Scanned, are mounted presently only on 01 and 26 HelpDesk ticket 108170 run2sys : Please mount /grid/data and /grid/app on Minos Cluster nodes minos02 through minos25. /grid/data should be read/write . /grid/app should be readonly . These are already mounted r/w on minos01 and 26, and should remain so. ########### # MEETING # ########### sent travel request to Rachel Rauchmiller 4514 ######### # ADMIN # ######### MINOS26 > setup systools MINOS26 > cmd add_minos_user boyd cmd: Unable to determine your group name, gid = 1525 MINOS01 > setup systools MINOS01 > cmd add_minos_user boyd You are not authorized to run this command! MINOS01 > date Tue Dec 11 09:13:32 CST 2007 ####### # AFS # ####### For the following scans, piped the output through uniq. Should so this directly in the 'for' statement next time. messages MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Dec " | grep -v Tokens'; done minos02 Dec 10 05:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 10 05:15:54 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) messages.1 MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages.1 | grep "Dec " | grep -v Tokens'; done minos01 Dec 6 19:16:34 minos01 kernel: afs: Waiting for busy volume 1685736052 (minos.log_data) in cell fnal.gov minos02 Dec 2 06:15:17 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 2 06:18:03 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 2 08:15:12 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 2 08:16:13 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 4 04:15:46 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 4 04:21:18 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 4 09:15:41 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 4 09:16:28 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 4 18:16:10 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 4 18:20:51 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 4 22:15:33 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 4 22:17:20 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 5 05:18:06 minos02 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 5 05:21:07 minos02 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 5 10:15:38 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 5 10:16:33 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 5 12:15:53 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 5 12:17:58 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 5 16:15:22 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 5 16:17:26 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 5 22:15:22 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 5 22:18:34 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 6 16:15:20 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 6 16:17:48 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 7 07:15:19 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 7 07:17:25 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 7 18:15:25 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 7 18:18:38 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos04 Dec 5 05:19:45 minos04 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 5 05:22:38 minos04 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 8 15:28:52 minos04 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 8 15:30:39 minos04 kernel: afs: file server 131.225.68.49 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos05 Dec 5 05:20:16 minos05 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 5 05:23:01 minos05 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos06 Dec 5 05:21:15 minos06 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 5 05:23:56 minos06 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos08 Dec 4 05:57:39 minos08 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 4 06:00:57 minos08 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 8 16:11:15 minos08 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 8 16:11:16 minos08 kernel: afs: failed to store file (110) Dec 8 16:11:47 minos08 kernel: afs: file server 131.225.68.49 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos11 Dec 3 13:25:52 minos11 kernel: afs: failed to store file (over quota) Dec 8 22:28:59 minos11 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 8 22:30:35 minos11 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos12 Dec 4 05:57:33 minos12 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 4 05:58:08 minos12 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos15 Dec 3 04:46:46 minos15 kernel: afs: Lost contact with volume location server 131.225.68.4 in cell fnal.gov Dec 3 04:49:41 minos15 kernel: afs: volume location server 131.225.68.4 in cell fnal.gov is back up Dec 7 09:36:02 minos15 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 7 09:38:12 minos15 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos16 Dec 4 05:58:00 minos16 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 4 05:58:37 minos16 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos23 Dec 7 09:36:02 minos23 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 7 09:37:16 minos23 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos24 Dec 4 05:57:40 minos24 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 4 06:00:56 minos24 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Dec 6 11:32:51 minos24 kernel: afs: Lost contact with file server 131.225.68.65 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 6 11:34:35 minos24 kernel: afs: file server 131.225.68.65 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos26 Dec 6 08:06:07 minos26 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Dec 6 08:06:58 minos26 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) ############ # MCIMPORT # ############ 11:06 cycle complains of full disk Found /tmp full -rw-r--r-- 1 mindata e875 934518784 Dec 10 10:45 junk.file Removed this file $ df -h minos-nas-0.fnal.gov:/minos/data 12T 8.6T 2.6T 78% /minos/data minos-nas-0.fnal.gov:/minos/scratch 3.3T 783G 2.6T 24% /minos/scratch ############# # CHECKLIST # ############# queued stores peaking over 6000, averaging 3500 since 12/8 activity ramped up 12/2, gap 12/6 through 12/7 staging sharp spikes over 2000 on 12/3 noon and 12/8 evening Enstore servers - see backlog writing reco_far_cedar_phy_bhcurv_cand ################## # VACATION NOTES # ################## sam-design Thursday ( Nov ? ) 2 SRM-s, problems, srm restarted stkendca9a raid array repaired, back in service ? d0ora2 crashes 11/26 and 12/3 ( back disk/bios ? ) sam-users - d0ora2 down again 12/8 fnoaa - will be retired, what is this ? Helpdesk assessment report 14 Dec minos-admin jpfitz set up tools to let us add new users setup systools cmd add_minos_user bspeak having trouble on some minos?? systems with kcron/kcroninit minos_software_discussion FNALU meeting Dec 19 - hartnell/young minos xTravel - make arrangements xminosdb - x farm db reconnects ? x open file limit increased to 4K minos-data CCPID needs space in afs 10-20 GB recycle request 8 Dec from berg kschu - 4 Dec metadata xAFS x 3 Dec - update requested by rayp x sent new list of errors BATCH x need massive undeclare of MC D04 - or not x deferred, just mrcc were missing 12/7 - is corral using the mysql data disk ? SHIFT - T962 DAQ access SIM kordosky teragrid suggestion 12/5 x corrupt/duplicate files 12/7 x corrupt at origin xCFL x Wed 5 Dec email received 'Newline appended' x log_data partition had filled up xCONDOR x Vahle - cannot read file in d195 12/7 x. liz fixed it ######### # FNALU # ######### Date: Mon, 10 Dec 2007 09:18:11 -0600 From: mgreaney@fnal.gov To: kreymer@fnal.gov Subject: FNALU General meeting, December 19, WH1W 1:30pm To all, There will be a general meeting for experimenters and users using the FNALU cluster on December 19, in WH 1 West from 1:30-3:00pm. The purpose of the meeting is get input from experimenters and users on what resources are needed and to identify experiments using FNALU. Also the status or changes to support for FNALU will be discussed. If you are not able to attend this meeting, please send an email response to dss-est@fnal.gov with details of your project. Please include these details: 1. Name of experiment and scope of the project 2. CPU needs for the duration of the project 3. Disk space needs 4. Applications (licensed) needed 5. Whether or not you are using a local filesystem mounted on fnalu 6. Whether or not you use LSF Thank you, DSS Group ============================================================================= ############ # VACATION # ############ on vacation Dec 3-7 ============================================================================= 2007 11 30 ########## # CONDOR # ########## Could not add fermilab/minos group role of pilot. Helpdesk ticket ############# # CHECKLIST # ############# Enstore ball is red - probably LTO3 failure No ND data since 01:00 UTC. Beam returned around 07:00 UTC Tabs to preserve in office swap - http://www.cs.wisc.edu/condor/tutorials/barcelona-2006/ http://www.lnf.infn.it/computing/afs/doc/adm/adm02.htm http://www.lnf.infn.it/computing/afs/doc/adm/adm02.htm http://glite.web.cern.ch/glite/packages/R3.0/deployment/glite-WMS/glite-WMS.asp ============================================================================= 2007 11 29 ########## # OFFICE # ########## Kreymer/Ayres moved from 1270/1265 to 1260 Plunkett moved from 1260 to 1270 Networks moving tomorrow 07:00 ######## # FARM # ######## ./volumes vols ./volumes mcout_cedar_phy_near_daikon_00_cand CVOLS=` ./volumes mcout_cedar_phy_near_daikon_00_cand` MINOS26 > ./stage -d -p 0 VOC472 Needed 119/356 for VOL in ${CVOLS} ; do ./stage -w -g readPools ${VOL} done 2>&1 | tee -a /tmp/stagecand.log enstore info --list=VOC472 ####### # AFS # ####### MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 28" | grep -v Tokens'; done minos02 Nov 28 04:15:20 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 28 04:15:56 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Nov 28 11:15:18 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 28 11:17:03 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Nov 28 14:15:25 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 28 14:17:25 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Nov 28 20:15:27 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 28 20:17:05 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos19 Nov 28 04:51:25 minos19 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 28 04:51:42 minos19 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) ============================================================================= 2007 11 28 ############## # BLACKBOARD # ############## Prior to kreymer office move from 1270 to 1250 Here's notes from the blackboard ( Not worth photographing into DocDB ) HOSTS bsub -R Cores FLXB11-30 SL3 "linux24" 30 FLXB31-34 SL4 "linux26" 10 condor SL4 35 RECO PATH DISCUSSION RECO DET / REL / STR / MO MCOUT REL / DET / MC / CONF / STR / RUN MC in /minos/data and in future ? MCR / CONF / DET / REL / STR / RUN TODO LIST .forward ? crontab vs /usr/bin/aklog minos workgroup has shepelak in the .k5login of root ####### # CRL # ####### MINOS26 > fs listacl /afs/fnal.gov/files/data/minos/crl_data/WWWdirectory/crlwforms Access list for /afs/fnal.gov/files/data/minos/crl_data/WWWdirectory/crlwforms is Normal rights: kschu:crlweb2 rlidwk bgreen:minoscrladmin rlidwka bgreen:minoscrl rlidwk spanacek:crladmin rlidwka system:administrators rlidwka system:anyuser rl buckley rlidwka bgreen rlidwka habig needs access someone should fs setacl \ -dir /afs/.fnal.gov/files/data/minos/crl_data/WWWdirectory/crlwforms \ -acl habig rlidwka MINOS26 > pts membership bgreen:minoscrladmin buckley bgreen avva saranen MINOS26 > pts membership spanacek:crladmin dave_b spanacek stephen bgreen mccusker 131.225.110.8 131.225.110.61 ########## # CONDOR # ########## Date: Wed, 28 Nov 2007 13:38:06 -0600 From: Sfiligoi Igor To: jason@fnal.gov Cc: kreymer@fnal.gov, minos-admin@fnal.gov Subject: Change to allow Art to manage the MINOS Condor pool Hi Jason. In order for Art to administer the MINOS Condor pool, the following line "/DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310" kreymer should be put into the /etc/grid-security/condor-grid-mapfile of all the Condor worker nodes (minos25 would need it, too, but since we at the moment use the same DN also for the glideins, it should not change from what it is now). This change does not need any Condor reconfig to be effective... as soon as it is in, Art will be allowed to issue administrative commands. Thanks, Igor & Art Date: Wed, 28 Nov 2007 15:34:42 -0600 From: Jason Harrington This has been done. ########### # NETWORK # ########### ... work in progress ... KREYMERNOTE a.k.a. KREYMERLAPFNAL.dhcp.fnal.gov 131.225.56.160 MAC's 00-0E-35-A2-22-59 00-01-4A-04-65-23 The IP assigned to 131.225.56.160 shifted to the wireless and got the new name KREYMERFNALGOV-1024593-dp.dhcp.fnal.gov It has not worked stably on the wired network for several weeks. Earlier today, the wire was offline and wireless had SRV1> host 131.225.94.156 156.94.225.131.in-addr.arpa domain name pointer G-Bs-Computer.dhcp.fnal.gov. ########### # ROUNDUP # ########### SRV1> ./farmgsum nearcat 4370 201652 mcnearcat mcnearcat 1 19 mrnt.cedar_phy_oldbhcurv.root 4021 192594 mrnt.cedar_phy.root 3 197 sntp.cedar_phy_oldbhcurv.root 345 18624 sntp.cedar_phy.root OK, need to proceed with mcnearcat today to clear backlog, to test MC sam declares and moved to /m/d/... AFSS/roundup.20071126 -n -r cedar_phy_oldbhcurv mcnear all 4 are duplicates AFSS/roundup.20071126 -n -r cedar_phy mcnear 2>&1 | tee /tmp/cpmc.log OK adding n13011004_0000_L010185N_D00.mrnt.cedar_phy.root 11 OK adding n13011006_0000_L010185N_D00.mrnt.cedar_phy.root 1 OK adding n13011007_0000_L010185N_D00.mrnt.cedar_phy.root 1 OK adding n13011008_0000_L010185N_D00.mrnt.cedar_phy.root 1 OK adding n13011009_0000_L010185N_D00.mrnt.cedar_phy.root 1 OK adding n13011009_0009_L010185N_D00.mrnt.cedar_phy.root 2 OK adding n13011027_0000_L010185N_D00.mrnt.cedar_phy.root 11 OK adding n13011043_0000_L010185N_D00.mrnt.cedar_phy.root 11 OK adding n13011046_0000_L010185N_D00.mrnt.cedar_phy.root 11 OK adding n13011053_0000_L010185N_D00.mrnt.cedar_phy.root 11 ... Monitor with less LOG/2007-11/cedar_phymcnear.log Run 1 file AFSS/roundup.20071126 -r cedar_phy -s n13011006 mcnear mkdir: cannot create directory `/minos/data/mcout_data/daikon_00': Permission denied OOPS - cannot create CC area /minos/data/mcout_data/daikon_00/L010185N/near/cedar_phy/mrnt_data/100 as mindata, cd /minos/data chmod 775 mcout_data AFSS/roundup.20071126 -r cedar_phy -s n13011007 mcnear Corrected roundup to allow MC saddreco AFSS/roundup.20071126 -r cedar_phy -s n13011008 mcnear The scope of saddreco is overly broad, repeating all run ranges. Live with it for now. In future, try to narrow down. MCREL=daikon_00 DET=near REL=cedar_phy CONF=L010185N/cand_data/141 ./saddreco -m ${MCREL} -d ${DET} -r ${REL} -p ${CONF} -b 1 --verify Changed the SLOG directory and name to ${HOME}/ROUNTMP/LOG/saddreco/${MCREL}/${REL}/${DET}_${CONF}.log AFSS/roundup.20071126 -r cedar_phy -s n13011004 mcnear oops, mkdir -p of log directory was misplaced, try again AFSS/roundup.20071126 -r cedar_phy -s n13011009 mcnear The SADD phase is taking about 30 minutes for L010185N, due to the large number of cand files in about 200 RUN ranges Let's rip on the catchup now ! MIN > mv roundup.20071126 roundup.20071128 SRV1> cp -a AFSS/roundup.20071128 . SRV1> ln -sf roundup.20071128 roundup # was roundup.20071125 ./roundup -r cedar_phy mcnear Wed Nov 28 18:12:44 CST 2007 ####### # AFS # ####### for NODE in ${NODES} ; do printf "${NODE}\n" ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 27" | grep -v Tokens'; done minos02 Nov 27 11:16:05 minos02 kernel: afs: Lost contact with file server 131.225.68.19 Nov 27 11:18:08 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov Nov 27 14:15:22 minos02 kernel: afs: Lost contact with file server 131.225.68.19 Nov 27 14:19:23 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov for NODE in ${NODES} ; do printf "${NODE}\n" ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 28" | grep -v Tokens'; done minos02 Nov 28 04:15:20 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 28 04:15:56 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos19 Nov 28 04:51:25 minos19 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 28 04:51:42 minos19 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) ########## # CONDOR # ########## Updated wms.run to have +RunOnGrid=True Requirements = ((Arch=?="X86_64") || (Arch=?="INTEL")) && (GLIDEIN_Site=!=UNDEFINED) 7367.0 kreymer 11/28 08:34 0+00:00:00 I 0 9.8 probe Submitted 10 process wms, on grid Ran on fnpc206 and 264 Cranked up to 100 processes, cleaned up probe printout csub wms.run 8227.99 kreymer 11/28 11:02 0+00:00:00 I 0 9.8 probe 20 99 here a ########### # ROUNDUP # ########### Did one more catchup on cedar ./roundup -r cedar far Wed Nov 28 08:15:31 CST 2007 Wed Nov 28 08:33:43 CST 2007 ./roundup -r cedar near Wed Nov 28 08:35:45 CST 2007 Wed Nov 28 09:16:18 CST 2007 SRV1> ./farmgsum nearcat 4370 201652 mcnearcat mcnearcat 1 19 mrnt.cedar_phy_oldbhcurv.root 4021 192594 mrnt.cedar_phy.root 3 197 sntp.cedar_phy_oldbhcurv.root 345 18624 sntp.cedar_phy.root OK, need to proceed with mcnearcat today to clear backlog, to test MC sam declares and moved to /m/d/... AFSS/roundup.20071126 -n -r cedar_phy_oldbhcurv mcnear all 4 are duplicates AFSS/roundup.20071126 -n -r cedar_phy mcnear 2>&1 | tee /tmp/cpmc.log ============================================================================= 2007 11 27 ########## # CONDOR # ########## Testing glidein factory alias csub='condor_submit $*' alias cq='condor_q $*' cd /minos/scratch/kreymer/condor/probe wms.run RunOnGrid=True wms2.run +RunOnGrid=True skip kcron condor_queue 7206.0 kreymer 11/27 20:23 0+00:00:00 I 0 9.8 kcron /minos/scrat 7207.0 kreymer 11/27 20:25 0+00:00:00 I 0 9.8 probe 7208.0 gfactory 11/27 20:26 0+00:00:00 I 0 9.8 glidein_startup.sh 7208.1 gfactory 11/27 20:26 0+00:00:00 I 0 9.8 glidein_startup.sh 7208.2 gfactory 11/27 20:26 0+00:00:00 I 0 9.8 glidein_startup.sh 7208.3 gfactory 11/27 20:26 0+00:00:56 R 0 9.8 glidein_startup.sh 7208.4 gfactory 11/27 20:26 0+00:00:00 I 0 9.8 glidein_startup.sh 20:32 cq 7206.0 kreymer 11/27 20:23 0+00:00:00 I 0 9.8 kcron /minos/scrat 7207.0 kreymer 11/27 20:25 0+00:00:00 I 0 9.8 probe 7208.0 gfactory 11/27 20:26 0+00:01:06 R 0 9.8 glidein_startup.sh 7208.1 gfactory 11/27 20:26 0+00:02:06 R 0 9.8 glidein_startup.sh 7208.2 gfactory 11/27 20:26 0+00:02:06 R 0 9.8 glidein_startup.sh 7208.3 gfactory 11/27 20:26 0+00:04:06 R 0 9.8 glidein_startup.sh 7208.4 gfactory 11/27 20:26 0+00:02:06 R 0 9.8 glidein_startup.sh 7209.0 gfactory 11/27 20:29 0+00:01:06 R 0 9.8 glidein_startup.sh MINOS25 > condor_status | grep fnp vm1@9790@fnpc LINUX X86_64 Owner Idle 1.000 39 0+00:05:30 vm2@9790@fnpc LINUX X86_64 Owner Idle 1.960 3891 0+00:05:31 vm1@20628@fnp LINUX X86_64 Owner Idle 1.000 39 0+00:00:08 vm2@20628@fnp LINUX X86_64 Owner Idle 2.190 3891 0+00:00:09 vm1@10273@fnp LINUX X86_64 Owner Idle 1.000 39 0+00:00:11 vm2@10273@fnp LINUX X86_64 Owner Idle 2.450 3891 0+00:00:12 vm1@17343@fnp LINUX X86_64 Owner Idle 1.000 39 0+00:00:11 vm2@17343@fnp LINUX X86_64 Owner Idle 1.990 3891 0+00:00:12 vm1@8821@fnpc LINUX X86_64 Owner Idle 1.000 39 0+00:00:12 vm2@8821@fnpc LINUX X86_64 Owner Idle 2.600 3891 0+00:00:13 vm1@11264@fnp LINUX X86_64 Owner Idle 1.000 39 0+00:00:07 vm2@11264@fnp LINUX X86_64 Owner Idle 1.760 3891 0+00:00:08 condor_q -l 7207.0 ... Arguments = "" RunOnGrid = TRUE GlobalJobId = "minos25.fnal.gov#1196216712#7207.0" ProcId = 0 AutoClusterId = 10 AutoClusterAttrs = "JobUniverse,LastCheckpointPlatform,NumCkpts,vm2_RemoteUser,User,GLIDEIN_Is_Monitor,RunO$ WantMatchDiagnostics = TRUE LastRejMatchReason = "PREEMPTION_REQUIREMENTS == False" LastRejMatchTime = 1196217579 ServerTime = 1196217767 Possibly a problem with preemption ? ########## # CONDOR # ########## Shutting down further jobs on minos12 : Inspired by condor_off -peaceful -all -startd MINOS12 > condor_status minos12 Name OpSys Arch State Activity LoadAv Mem ActvtyTime vm1@minos12.f LINUX INTEL Claimed Busy 1.910 2026 0+01:33:45 vm2@minos12.f LINUX INTEL Claimed Busy 2.360 2026 0+01:33:47 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX 2 0 2 0 0 0 0 Total 2 0 2 0 0 0 0 MINOS12 > condor_off -peaceful minos12 -subsystem startd Sent "Set-Peaceful-Shutdown" command to startd minos12.fnal.gov Sent "Kill-Daemon-Peacefully" command to master minos12.fnal.gov 14:38 MINOS01 > condor_off -peaceful minos01 -subsystem startd Sent "Set-Peaceful-Shutdown" command to startd minos01.fnal.gov Sent "Kill-Daemon-Peacefully" command to master minos01.fnal.gov No obvious effect, but nothing running yet, so MINOS01 > sudo /etc/init.d/condor stop Shutting down Condor (fast-shutdown mode) MINOS01 > sudo /etc/init.d/condor start Shutting down Condor (fast-shutdown mode) Still not running jobs, so stopping startd had the desired effect. Oops, spoke too soon, a job started running. ########## # CONDOR # ########## cd /export/stage/minfarm/.grid voms-proxy-init \ -voms fermilab:/fermilab/minos \ -vomslife 500:0 \ -cert kreymer-doe.pem \ -key kreymer-doekey.pem \ -out kreymer-condor.proxy \ -valid 500:0 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Enter GRID pass phrase: Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Creating temporary proxy ...................................................................... Done Contacting fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Done Warning: fermigrid2.fnal.gov:15001: validity shortened to 86400 seconds! Creating proxy ........................................................................... Done Your proxy is valid until Tue Dec 18 08:12:08 2007 SRV1> voms-proxy-info -all -file kreymer-condor.proxy WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 type : proxy strength : 512 bits path : kreymer-condor.proxy timeleft : 499:56:36 === VO fermilab extension information === VO : fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 issuer : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov attribute : /fermilab/minos/Role=NULL/Capability=NULL attribute : /fermilab/Role=NULL/Capability=NULL timeleft : 23:56:35 [gfactory@minos25 ~]$ scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy .grid/ Date: Tue, 27 Nov 2007 12:36:10 -0600 From: Jason Harrington Done. Now needs writeable web page mkdir /afs/fnal.gov/files/expwww/numi/html/gfactory fs setacl -dir gfactory -acl sfiligoi rlidwka Date: Tue, 27 Nov 2007 17:24:11 -0600 From: Sfiligoi Igor To: Arthur Kreymer Subject: glideinWMS up and running Hi Art. The glideinWMS is up and running on the MINOS pool. All you need to do to get jobs there is add +RunOnGrid=True to your condor submit file. Well, maybe you want also add (Arch=?="X86_64") || (Arch=?="INTEL") to be able to run on 64-bit machines (most of the GPfarms). P.S.: gLExec is not in use right now, as I wanted something simple. Once we put that one in, too, a few more lines will be needed. Cheers, Igor ####### # DAQ # ####### minos-gateway-nd - found disabled account ( !! in /etc/shadow ) hartnell:!!:13059:0:99999:7::: cmetelko:!!:13494:0:99999:7::: koskinen:!!:13552:0:99999:7::: Tested with temporary .k5login in hartnell. ####### # AFS # ####### Looks clean so far today, further timeout last night. for NODE in ${NODES} ; do printf "${NODE}\n" ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 27" | grep -v Tokens'; done ######### # MYSQL # ######### minos-mysql1 has load average around 30, since 16:30 yesterday. These are DCS_MAG_FARVLD queries, from minos* nodes. ============================================================================= 2007 11 26 ########### # MINOS12 # ########### Cleaned out /local/scratch12/kreymer files : -rw-r--r-- 1 kreymer 1525 1887436800 Aug 22 2005 offline.aa -rw-r--r-- 1 kreymer 1525 1887436800 Aug 22 2005 offline.ab ... -rw-r--r-- 1 kreymer 1525 1887436800 Aug 22 2005 offline.at Requested removal of root-owned /l/s12/database files MINOS12 > cd /local/scratch12/database/offline MINOS12 > ls -l total 63568964 -rw-rw---- 1 root root 47137798818 Aug 9 2005 PULSERDRIFT.MYD -rw-rw---- 1 root root 17631876096 Aug 10 2005 PULSERDRIFT.MYI -rw-rw---- 1 root root 130156696 Aug 9 2005 PULSERDRIFTPIN.MYD -rw-rw---- 1 root root 49219584 Aug 9 2005 PULSERDRIFTPIN.MYI -rw-rw---- 1 root root 9046 Jul 8 2005 PULSERDRIFTPIN.frm -rw-rw---- 1 root root 69760831 Aug 9 2005 PULSERDRIFTPINVLD.MYD -rw-rw---- 1 root root 12163072 Aug 9 2005 PULSERDRIFTPINVLD.MYI -rw-rw---- 1 root root 8828 Oct 14 2004 PULSERDRIFTPINVLD.frm HelpDesk ticket 107510 Short Description: Please remove files from minos12:/local/scratch12/database/... Problem Description: run2-sys : Please remove the directory and all the files under minos12: /local/scratch12/database These are old database backups from 2005, and are no longer needed. ( I do not know offhand how they got to be owned by root. ) This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group. Date: Mon, 26 Nov 2007 13:25:48 -0600 (CST) Note To Requester: boyd@fnal.gov sent this Notes To Requester: Art, I changed them to be owned by you. You can delete them if you'd like. joe ..................... Date: Tue, 27 Nov 2007 08:37:49 -0600 (CST) From: Arthur Kreymer Thanks ! I have removed the files. ........................ Copied the files to /minos/data/analysis/database, sum * removed the originals Actually, did a final checksum, from /minos/data on minos02 ( sustained about 20 MBytes/sec ) minos02: 25650 46033007 PULSERDRIFT.MYD 04031 17218629 PULSERDRIFT.MYI 10027 127107 PULSERDRIFTPIN.MYD 38932 48066 PULSERDRIFTPIN.MYI 27312 9 PULSERDRIFTPIN.frm 18962 68126 PULSERDRIFTPINVLD.MYD 06033 11878 PULSERDRIFTPINVLD.MYI 08483 9 PULSERDRIFTPINVLD.frm MINOS02 > date Tue Nov 27 09:26:17 CST 2007 minos12: 25650 46033007 PULSERDRIFT.MYD 04031 17218629 PULSERDRIFT.MYI 10027 127107 PULSERDRIFTPIN.MYD 38932 48066 PULSERDRIFTPIN.MYI 27312 9 PULSERDRIFTPIN.frm 18962 68126 PULSERDRIFTPINVLD.MYD 06033 11878 PULSERDRIFTPINVLD.MYI 08483 9 PULSERDRIFTPINVLD.frm MINOS12 > date Tue Nov 27 10:19:05 CST 2007 10:48 : rm -r database ####### # AFS # ####### fsus02 seems to have been stable since the Saturday 24 Nov upgrades. But I have seen timeouts since then on the Minos cluster for fsus05 131.225.68.17 fsus07 131.225.68.6 fsus08 131.225.68.19 for NODE in ${NODES} ; do printf "${NODE}\n" ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 25" | grep -v Tokens'; done clean for NODE in ${NODES} ; do printf "${NODE}\n" ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 26" | grep -v Tokens'; done minos02 Nov 26 07:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 Nov 26 07:17:14 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov minos04 Nov 26 03:56:54 minos04 kernel: afs: Lost contact with file server 131.225.68.6 Nov 26 03:59:05 minos04 kernel: afs: file server 131.225.68.6 in cell fnal.gov i Nov 26 08:17:23 minos04 kernel: afs: Lost contact with file server 131.225.68.17 Nov 26 08:23:44 minos04 kernel: afs: file server 131.225.68.17 in cell fnal.gov minos09 Nov 26 08:17:50 minos09 kernel: afs: Lost contact with file server 131.225.68.17 Nov 26 08:19:00 minos09 kernel: afs: file server 131.225.68.17 in cell fnal.gov minos14 Nov 26 08:18:02 minos14 kernel: afs: Lost contact with file server 131.225.68.17 Nov 26 08:19:35 minos14 kernel: afs: file server 131.225.68.17 in cell fnal.gov minos17 Nov 26 08:18:14 minos17 kernel: afs: Lost contact with file server 131.225.68.17 Nov 26 08:19:23 minos17 kernel: afs: file server 131.225.68.17 in cell fnal.gov MIN > host 131.225.68.6 6.68.225.131.in-addr.arpa domain name pointer fsus07.fnal.gov. MIN > host 131.225.68.17 17.68.225.131.in-addr.arpa domain name pointer fsus05.fnal.gov. MIN > host 131.225.68.19 19.68.225.131.in-addr.arpa domain name pointer fsus08.fnal.gov. Sent this information as update to Helpdesk ticket 107032 ########### # ROUNDUP # ########### Corrected PURGE code to use ${CCDEST}/${FILE} size, not GDW/FILE Added one more test file : AFSS/roundup.20071115 -r cedar_phy_bhcurv -s N00009300 near less +F /home/minfarm/ROUNTMP/LOG/saddreco/cedar_phy_bhcurv/near.log Tue Nov 27 01:09:41 CST 2007 OK - stream spill.mrnt.cedar_phy_bhcurv OK - 166050 Mbytes in 327 runs ... OK - stream spill.sntp.cedar_phy_bhcurv OK - 177300 Mbytes in 162 runs ... Tue Nov 27 15:02:58 CST 2007 WRITING to DCache 453 ... looks OK through about 16:53 SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00009873_0000.spill.mrnt.cedar_phy_bhcurv.1.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2006-02 PURGE FARM N00009873_0000.spill.mrnt.cedar_phy_bhcurv.1.root SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00012001_0000.spill.mrnt.cedar_phy_bhcurv.0.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2007-03 SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00012001_0000.spill.sntp.cedar_phy_bhcurv.0.root /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-03 SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00012004_0000.spill.mrnt.cedar_phy_bhcurv.0.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2007-04 SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00012004_0000.spill.sntp.cedar_phy_bhcurv.0.root /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-04 SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00012007_0000.spill.mrnt.cedar_phy_bhcurv.0.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2007-04 Odd, lots of files not being purged. Some of these duplicate files written back on 3 Nov, according to LOG/2007-11/cedar_phy_bhcurvnear.log ########### # ROUNDUP # ########### The above looks good, let's get cedar back into keepup : MINOS26 > mv roundup.20071115 roundup.20071125 SRV1> cp -a AFSS/roundup.20071125 . SRV1> ln -sf roundup.20071125 roundup # was roundup.20070809 ./roundup -n -r cedar far OK - 3882 Mbytes in 9 runs ./roundup -n -r cedar near OK - 4239 Mbytes in 13 runs ./roundup -r cedar far Mon Nov 26 11:23:07 CST 2007 Mon Nov 26 11:56:25 CST 2007 ./roundup -r cedar near Mon Nov 26 11:59:50 CST 2007 ####### # AFS # ####### Date: Mon, 26 Nov 2007 10:30:02 -0600 (CST) HelpDesk ticket 107484 Short Description: Can't see AFS backup area Problem Description: ( I still can not see /afs/fnal.gov/files/backup/home/room1/kreymer. But that is a different problem, which should be tracked separately. This ticket is assigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST. Date: Mon, 26 Nov 2007 12:03:41 -0600 (CST) Solution: joes@fnal.gov sent this solution: Remounted afs backup directory ============================================================================= 2007 11 25 Sunday ####### # AFS # ####### for NODE in ${NODES} ; do printf "${NODE}\n" ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 25" | grep -v Tokens'; done minos01 minos02 ... minos25 grep: /var/log/messages: Permission denied minos26 Nov 25 19:14:42 minos26 kernel: afs: Waiting for busy volume 1685441905 (expwww.numi.fnalminos) in cell fnal.gov ########### # ROUNDUP # ########### Trying to get data/mc keepup restarted, with mc saddreco, before next week, so can concentrate on condor analysis glideins and related issues. Probed cedar_phy_bhcurv friday, > /tmp/cpbn.log Here are some to chew on. OK adding N00009280_0014.spill.mrnt.cedar_phy_bhcurv.1.root 6 OK adding N00009283_0000.spill.mrnt.cedar_phy_bhcurv.1.root 1 OK adding N00009300_0000.spill.mrnt.cedar_phy_bhcurv.1.root 24 OK adding N00009303_0000.spill.mrnt.cedar_phy_bhcurv.1.root 22 OK adding N00009306_0000.spill.mrnt.cedar_phy_bhcurv.1.root 1 OK adding N00009309_0000.spill.mrnt.cedar_phy_bhcurv.1.root 1 OK adding N00009322_0000.spill.mrnt.cedar_phy_bhcurv.1.root 24 OK adding N00009325_0000.spill.mrnt.cedar_phy_bhcurv.1.root 24 OK adding N00009328_0000.spill.mrnt.cedar_phy_bhcurv.1.root 1 Corrected to check for dup catted file directly in ${GDW}/${SFINI} rather than using the symlink on ROUNDUP/WRITE AFSS/roundup.20071115 -n -r cedar_phy_bhcurv -s N00009283 near adjusted move to CCDEST, try one for real : AFSS/roundup.20071115 -r cedar_phy_bhcurv -s N00009283 near ................ SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00009283_0000.spill.mrnt.cedar_phy_bhcurv.1.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11 PURGE FARM N00009283_0000.spill.mrnt.cedar_phy_bhcurv.1.root Sun Nov 25 19:32:03 CST 2007 SADD less +F /home/minfarm/ROUNTMP/LOG/2005-11/declare_near_cedar_phy_bhcurv.log Sun Nov 25 19:32:04 CST 2007 ........... Oops, needed to make /minos/data/reco* directories group writeable, to that minfarm can write. Oops, the saddreco MC/Data clauses were reversed, and data logs were still going to LOG/SAMMON. as mindata, cd /minos/data find reco* -type d -exec ls -ld {} \; find reco* -type d -exec chmod 775 {} \; Trying again, with a single file AFSS/roundup.20071115 -n -r cedar_phy_bhcurv -s N00009306 near AFSS/roundup.20071115 -r cedar_phy_bhcurv -s N00009306 near less LOG/2007-11/cedar_phy_bhcurvnear.log per this log, less +F /home/minfarm/ROUNTMP/LOG/saddreco/cedar_phy_bhcurv/near.log declared 1 file ! ls -l /minos/data/reco_near/cedar_phy_bhcurv/mrnt_data/2005-12 Try one more file, this time 6 concatenated : AFSS/roundup.20071115 -r cedar_phy_bhcurv -s N00009280 near OK adding N00009280_0014.spill.mrnt.cedar_phy_bhcurv.1.root 6 SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00009280_0014.spill.mrnt.cedar_phy_bhcurv.1.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11 less +F /home/minfarm/ROUNTMP/LOG/saddreco/cedar_phy_bhcurv/near.log Correct earlier misplacement of our first test case : cd ROUNTMP/WRITE FILE=N00009283_0000.spill.mrnt.cedar_phy_bhcurv.1.root mv ${FILE} /minos/data/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11/${FILE} ln -s /minos/data/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11/${FILE} ${FILE} ########### # BLUEARC # ########### Subject: HelpDesk ticket 107457 Short Description: Quota request for rustem on BlueArc - /minos/scratch Problem Description: LSC/CSI : Please set an individual storage quota of 500 GBytes for user rustem, on the BlueArc served /minos/scratch volume. This overrides the existing default 100 GBytes quota. Date: Mon, 26 Nov 2007 08:32:14 -0600 (CST) This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group. Date: Mon, 26 Nov 2007 08:42:10 -0600 (CST) Quota for user rustem has been increased to 500GB. This ticket was resolved by PASETES, RAY of the CD-LSCS/CSI/CS/EST group. ============================================================================= 2007 11 24 Saturday ####### # AFS # ####### Maintenance started right at 06:00 My home area looks OK. Updated LOG with content of LOG1120 Restarted cron jobs kreymer@minos26 mindata@minos26 ####### # AFS # ####### Date: Sat, 24 Nov 2007 10:50:57 -0600 From: Ray Pasetes To: pc-manager@fnal.gov, unix-managers@fnal.gov, linux-users@fnal.gov, macusers@fnal.gov, ppdhelpdesk@fnal.gov, James C Hammer , John J. Konc , Michael J. Kuc , Michael J. Woods , Thomas W. Ackenhusen , snolan@fnal.gov, bd-net-patch@fnal.gov, Jud Parker , csi-mgmt@fnal.gov, CSG , Desktop & Server Support - Enterprise , HelpDesk , Arthur Kreymer , Liz Buckley-Geer , Steven Timm Subject: Re: Status: AFS Outage 11/24 -- Need additional hour [ The following text is in the "ISO-8859-15" character set. ] [ Your display is set for the "ISO-8859-1" character set. ] [ Some special characters may be displayed incorrectly. ] The AFS servers have been upgraded. Please check your systems to make sure they are communicating with the servers. In some cases, AFS clients may need to reboot to properly flush their cache. -Ray -- ============================================== Ray Pasetes Email: rayp@fnal.gov CD/LSC/CSI/CS Phone: 630-840-5250 Fermilab, Batavia, IL Fax : 630-840-6345 ============================================== ============================================================================= 2007 11 21 ########### # ROUNDUP # ########### Still need to restart roundup Priorities : Cleanly handle the new /minos/data structure Declare MC to SAM CC sntp to /minos/data SRV1> AFSS/roundup.20071115 -n -r cedar_phy_bhcurv near 2>&1 | tee /tmp/cpbn.los SRV1> AFSS/roundup.20071115 -n -r cedar_phy_bhcurv -s N00009283 near #################### # AFS LOG RECOVERY # #################### < recovered > This is a copy of LOG.recovered, restored this afternoon. The usual LOG file went to 0 length, due to an AFS glitch today. Restored per Helpdesk ticket 107415, apparently to -rw-r--r-- 1 7695 bin 1844447 Nov 21 15:08 LOG.restored ( That's Ray Pasetes ) Trying to reconstruct notes from earlier today : ############ # MCIMPORT # ############ Corrected paths and links to mcin, for those still having /far/mcin Accounts having empty far/mcin for DIR in boehm hgallag kordosky ; do echo ${DIR} cd ${DIR} ls -R far rmdir far/mcin/dcache rmdir far/mcin ln -s ../mcin far/mcin cd .. done For accounts with empty mcin, files in far/mcin, mkdir -p howcroft/mcin/dcache for DIR in howcroft kreymer mualem ; do printf "\n\n${DIR}\n" cd ${DIR} du -sk far ls -lR mcin cd .. rmdir mcin/dcache rmdir mcin mv far/mcin mcin ln -s ../mcin far/mcin du -sk mcin done ############ # MCIMPORT # ############ Started cron keepup : $ cat crontab.dat MAILTO=minos-data@fnal.gov 37 0-22/4 * * * ${HOME}/mcimport -c ALL crontab crontab.dat And did top off run, 50 files before 12:37 cron pass ./mcimport -b 50 OVERLAY ############ # MCIMPORT # ############ Did catchup on the mcimport run without sam enabled, DET=near MCREL=daikon_00 CONF=L010185N_nue DET=near MCREL=daikon_04 CONF=L010185N DET=far MCREL=daikon_04 CONF=L010185N SADDIR=${DET}/${MCREL}/${CONF}/* echo $SADDIR ~/saddmc --verify -n 1 ${MCREL} ${SADDIR} ~/saddmc --declare ${MCREL} ${SADDIR} >> /minos/scratch/mindata/log/saddmc/prd-${DET}-${MCREL}-${CONF}.log 2>&1 ######## # FARM # ######## Copy of recent ntuples for nearline analysis cd /minos/data/minfarm/nearcat MDDIR=/minos/data/reco_near/cedar/sntp_data/2007startup mkdir /minos/data/reco_near/cedar/sntp_data/2007startup for FILE in N*.spill.*.cedar.* ; do echo ${FILE} ; cp ${FILE} ${MDDIR}/${FILE} ; done ============================================================================= 2007 11 20 ####### # AFS # ####### Subject : Re: HelpDesk ticket 107323 ----- Message Text ----- <-- # @@@ Enter Update below this line. @@@ # --> As per discussions today at all levels, please close out this ticket, and withdraw the request for the global firewalling of AFS. Minos will take a couple of actions until the Saturday upgrades : 1) We will try to decouple our Control Room beam data logging from AFS. 2) We will keep the Shift personnel ( x3368 ) informed as to how to report AFS outages via the call center, if such reports are needed. We will follow up with Ray to clarify details like how long to wait before calling the call center, and how to know whether a problem is being worked on. <-- # @@@ Enter Update above this line. @@@ # --> ######### # ADMIN # ######### per brebel email MINOS25 > kcroninit ************************************************************************* * * * This system is not properly configured to initialize * * authenticating cron jobs in a secure fashion. * * * * Please contact your sysadmin regarding the ownership and/or * * permissions on the /var/adm/krb5 directory. * * * ************************************************************************* MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'ls -ld /var/adm/krb5 | grep -v "^drwx--s--x"'; done minos25 drwxr-xr-x 2 root root 4096 Oct 19 11:12 /var/adm/krb5 HelpDesk ticket 107348 Short Description: Cannot kcroninit on minos25 due permissions for /var/adm/krb5 Problem Description: run2-sys : We cannot use 'kcroninit' on minos25, apparently due to incorrect permissions on /var/adm/krb5 This directory is drwx--s--x on systems where kcroninit works, and on minos25 is drwxr-xr-x Please investigate and correct this. ############ # MCIMPORT # ############ Testing saddmc.20071114 with dcache location support. Ease to test, need any fresh Previously failed on file /pnfs/minos/mcin_data/near/daikon_04/L010185N/700 n13037002_0009_L010185N_D04.reroot.root MIN > mv saddmc.20071114 saddmc.20071120 $ cp -a AFSS/saddmc.20071120 . $ ln -sf AFSS/saddmc.20071120 saddmc AFSS/mcimport.20071118 -f 10 -m OVERLAY Again, copies to tape started almost immediately, not with a 4 hour delay 14:45 cp -a AFSS/mcimport.20071120 mcimport.20071120 ln -sf mcimport.20071120 mcimport # was mcimport.20071109 In STAGE/arms, rmdir mcin/dcache rmdir mcin ln -s far/mcin mcin ( should go the other way, but imports are active ) $ ./mcimport -b 3 -f 10 -m arms N.B. Perhaps we should try kx509 dccp , X509_CERT_DIR with cert. But how does it know the cert name ? Defer this for now, let's get rolling . Rate has been about 30 GBytes/hour. STAGE/arms/far/mcin has 300 GB. So run a manual mcimport, then start the cron tomorrow morning. 15:56 ./mcimport ALL rm: remove write-protected regular file `n13037002_0011_L010185N_D04.reroot.root'? n rm: remove write-protected regular file `n13037002_0012_L010185N_D04.reroot.root'? ??? what is this ??? why is it going to the terminal ? The 644 protected files are owned by rhatcher. Look at old purged files, n11011020_0001_L010185N_D00.reroot.root They are gone. Needed to hack mcimport.20071120 to do rm -f ${FILE} Restarted around 16:18 ./mcimport -c ALL ####### # AFS # ####### Scanned recent logs, see failures for fsus02 131.225.68.7 fsus03 131.225.68.4 fsus08 131.225.68.19 Sent reply to rayp : > I've identified the following minos volumes on fsus02. I'm going to > move them to fsus07 for now and see if we can isolate minos from the > issues affecting fsus02. These are the areas. Please let me know if > there are more. Thanks for shifting these, this may help with the web page stability. fsus07 is not immune to these problems. We saw failures of fsus07 on Nov 13. We saw failures yesterday for fsus02, fsus03 and fsus08 I do not think that we can avoid this problem by switching servers. Nevertheless, to answer your direct question : The AFS volumes read by the Control Room are the release and product areas, /afs/fnal.gov/files/code/e875/general/minossoft/ /afs/fnal.gov/files/code/e875/general/products/ /afs/fnal.gov/files/code/e875/general/minossoft/packages /afs/fnal.gov/files/code/e875/releases /afs/fnal.gov/files/code/e875/releases1 /afs/fnal.gov/files/code/e875/releases2 There are a lot of symbolic links, so it is hard to know whether these suffice. There remains the problem that there are many users heavily engaged in the analysis of post-shutdown data, which is critical to establishing the running condition for the experiment. They remain sensitive to fsus02. Rayp sent list of servers /afs/fnal.gov/files/code/e875/general/minossoft/ fsus05.fnal.gov /vicepb RW /afs/fnal.gov/files/code/e875/general/products/ fsus07.fnal.gov /vicepb RW /afs/fnal.gov/files/code/e875/general/minossoft/packages fsus05.fnal.gov /vicepb RW /afs/fnal.gov/files/code/e875/releases fsus-minos01.fnal.gov /viceph RW /afs/fnal.gov/files/code/e875/releases1 fsus08.fnal.gov /vicepd RW /afs/fnal.gov/files/code/e875/releases2 fsus09.fnal.gov /vicepb RW ============================================================================= 2007 11 19 ####### # AFS # ####### fsus02 timing out again : Nov 19 17:25:54 minos26 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 19 17:59:18 minos26 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) HelpDesk ticket 107323 Date: Mon, 19 Nov 2007 18:18:46 -0600 (CST) From: Arthur Kreymer To: helpdesk@fnal.gov, schmidt@fnal.gov, rayp@fnal.gov, inkmann@fnal.gov Cc: habig@fnal.gov, plunk@fnal.gov, wojcicki@fnal.gov, buckley@fnal.gov, rhatcher@fnal.gov, urish@fnal.gov Subject: Request AFS restriction to FNAL hosts At about 17:23 this afternoon, AFS server fsus02 again became unavailable, taking with it many of the lab's Web servers, the Minos Control Room Logbok, etc. This is seriously hurting detector operations for Minos, and is massively disruptive to our ability to analyze data. It is my understanding that the direct cause of this instability is the interaction of NAT clients with our AFS servers, something that will be corrected Saturday morning. But I do not think we can afford to run under the present conditions through Thanksgiving. Until the Saturday upgrades are completed, I request that we limit AFS clients to the fnal.gov and minos-soudan.org subnets, via firewalls or other technical means. This is a drastic action, but should eliminate NAT clients, and give us stable operation through the holiday. ####### # LSF # ####### for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} '. /usr/local/etc/setups.sh ; setup lsf ; bjobs' done minos13 Failed in an LSF library call: Slave LIM configuration is not ready yet ######## # FARM # ######## Files were concatenated this morning, OK adding F00039965_0000.all.sntp.cedar.0.root 11 NSFIL SSIZ MSIZ DSIZ 11 262594663 261592421 100224 -rw-r--r-- 1 minfarm numi 261592421 Nov 19 00:05 OK adding F00039968_0000.all.sntp.cedar.0.root 2 NSFIL SSIZ MSIZ DSIZ 2 45970263 45758780 211483 -rw-r--r-- 1 minfarm numi 45758780 Nov 19 00:06 PEND - have 17/23 subruns for F00039971_*.all.sntp.cedar.0.root 0 11/18 23:40 0 17 PEND - have 17/24 subruns for F00039971_* I see no cedar files for near detector. Informed minos_batch . Howie is not seeing recent beam database info : minfarm on fnpcsrv1% scripts/beam_mon minos-db1 Inquiring of minos-db1 on port 3306 as reader_old:minos_db beam_mon returns null -- no updates recently ####### # CAF # ####### Normally 35 running 08:00 minos25 :condor_off -peaceful -all -startd condor_q | grep running ? 29 running 09:00 18 running, 12:00 8 running 14:30 216 jobs; 214 idle, 0 running, 2 held 14:36 216 jobs; 175 idle, 39 running, 2 held I see a jump in load average around 14:33 Hurray ! ########### # ROUNDUP # ########### ############ # MCIMPORT # ############ AFSS/mcimport.20071118 -b 1 -m OVERLAY Looks OK, see /pnfs/minos/mcin_data/near/daikon_00/L010185N/102/n11011020_0000_L010185N_D00.reroot.root AFSS/mcimport.20071118 -m OVERLAY Mon Nov 19 08:04:41 CST 2007 Mon Nov 19 11:27:00 CST 2007 DET=near MCREL=daikon_00 CONF=L010185N for RUN in 102 103 104 ; do SADDIR=${DET}/${MCREL}/${CONF}/${RUN} #~/saddmc --verify -n 1 ${MCREL} ${SADDIR} ~/saddmc --declare ${MCREL} ${SADDIR} \ >> /minos/scratch/mindata/log/saddmc/prd-${DET}-${MCREL}-${CONF}.log 2>&1 done looks good ############ # MCIMPORT # ############ Let's check out the sam declares once again : 18:52 AFSS/mcimport.20071118 -b 1 -f 10 -m OVERLAY Dropped \n from MCINDS, to get clean path, Now find we need to handle files which are not yet on tape Update to saddmc is needed, similar to saddreco. saddmc 20070924 processing mcin_data STARTED Tue Nov 20 01:13:22 2007 Declaring to SAM v8_2_0 prd daikon_04 declare 999999 Scanning /pnfs/minos/mcin_data/near/daikon_04/L010185N ['700'] Needed /pnfs/minos/mcin_data/near/daikon_04/L010185N/700 Treating 37 files in /pnfs/minos/mcin_data/near/daikon_04/L010185N/700 OOPS - short Enstore data at Tue Nov 20 01:13:28 2007 ENLIN [] ENFILE n13037002_0009_L010185N_D04.reroot.root WARNING WARNING WARNIGN - these files are going to tape much too fast, there should be a 4 hour delay, writes are immediate. ####### # AFS # ####### Sent this to minos_all, minos_software_discussion, CRL Date: Mon, 19 Nov 2007 08:12:33 -0600 From: Ray Pasetes To: CSG , Desktop & Server Support - Enterprise , HelpDesk , Kristen J. Webb , Liz Buckley-Geer , Arthur E Kreymer , Steven Timm Subject: AFS outage Saturday, 11/24 6A-10A [ The following text is in the "ISO-8859-15" character set. ] [ Your display is set for the "ISO-8859-1" character set. ] [ Some special characters may be displayed incorrectly. ] On Saturday, 11/24, from 6A-10A, the AFS service will be out for an emergency upgrade. It has been determined that the current release of code, OpenAFS v1.4.4 can have issues with clients that are behind a NAT. These issues can indirectly cause a resource problem on the fileservers which could have resulted in the outages last week and the "Connection timed out" issues we have been seeing as of late. Please let any other interested parties know about this outage. -- ============================================== Ray Pasetes Email: rayp@fnal.gov CD/LSC/CSI/CS Phone: 630-840-5250 Fermilab, Batavia, IL Fax : 630-840-6345 ============================================== bv will contact rhatcher, see whether we can decouple hartnell may assist ============================================================================= 2007 11 18 Sun ############ # MCIMPORT # ############ mcimport.20071118 - cleaned up and tested, $ AFSS/mcimport.20071118 -b 1 -m OVERLAY Oops, /pnfs/minos/mcin_data/near/daikon_00/L010185N/102 et.al. are owned by rhatcher, but not group writeable. as rhatcher cd /pnfs/minos/mcin_data/near find daikon_* -type d -user rhatcher -exec ls -ld {} \; 288 directories find daikon_* -type d -user rhatcher -exec chmod 775 {} \; does not work on systems where I can be rhatcher Sent mail to rhatcher. Did the same for kreymer files : d /pnfs/minos/mcin_data/near find daikon_* -type d -user kreymer -exec ls -ld {} \; 173 directories find daikon_* -type d -user kreymer -exec chmod 775 {} \; ############ # MCIMPORT # ############ Manually imported some kordosky files which had piled up. AFSS/mcimport.20071118 -b 3 -m kordosky AFSS/mcimport.20071118 -m kordosky Manually imported some arms files which had piled up. AFSS/mcimport.20071118 -m arms ############ # SADDRECO # ############ saddreco mc far completed early this morning ============================================================================= 2007 11 17 Sat ############ # SADDRECO # ############ saddreco.20071117 - added RECODIRS.sort() to get a more readable log as we do the full mcout declares 12:15 SRV1> cp -a AFSS/saddreco.20071117 . ; ln -sf saddreco.20071117 saddreco # was saddreco.20070913 ############ # SADDRECO # ############ NDDS=`ls -d /pnfs/minos/mcout_data/*/near/daikon*` FDDS=`ls -d /pnfs/minos/mcout_data/*/far/daikon*` printf "${NDDS}\n" MINOS26 > printf "${NDDS}\n" /pnfs/minos/mcout_data/R1_24cal/near/daikon_00 /pnfs/minos/mcout_data/R1_24calB/near/daikon_00 /pnfs/minos/mcout_data/cedar/near/daikon_00 /pnfs/minos/mcout_data/cedar/near/daikon_01 /pnfs/minos/mcout_data/cedar_bx113/near/daikon_00 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00 /pnfs/minos/mcout_data/cedar_phy/near/daikon_03 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04 /pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00 /pnfs/minos/mcout_data/cedar_phy_oldbhcurv/near/daikon_03 /pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00 /pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00 MINOS26 > printf "${FDDS}\n" /pnfs/minos/mcout_data/R1_24spill/far/daikon_02 /pnfs/minos/mcout_data/cedar/far/daikon_00 /pnfs/minos/mcout_data/cedar/far/daikon_01 /pnfs/minos/mcout_data/cedar/far/daikon_02 /pnfs/minos/mcout_data/cedar_phy/far/daikon_00 /pnfs/minos/mcout_data/cedar_phy/far/daikon_02 /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_03 /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_04 /pnfs/minos/mcout_data/cedar_phy_safitter/far/daikon_02 /pnfs/minos/mcout_data/cedar_phy_srsafitter/far/daikon_02 PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9010 export SAM_ORACLE_CONNECT='samdbs/...' for DIR in ${FDDS} ; do CONFS=`ls ${DIR}` echo ${DIR} ${CONFS} done /pnfs/minos/mcout_data/cedar/far/daikon_00 L010185N L100200N L250200N /pnfs/minos/mcout_data/cedar/far/daikon_01 L010185N /pnfs/minos/mcout_data/cedar/far/daikon_02 CosmicLE CosmicMu /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_03 CosmicLE CosmicMu /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_04 L010185 L010185N L250200 L250200N /pnfs/minos/mcout_data/cedar_phy/far/daikon_00 L010185N L100200N L250200N /pnfs/minos/mcout_data/cedar_phy/far/daikon_02 CosmicLE CosmicMu /pnfs/minos/mcout_data/cedar_phy_safitter/far/daikon_02 CosmicLE CosmicMu /pnfs/minos/mcout_data/cedar_phy_srsafitter/far/daikon_02 CosmicLE CosmicMu /pnfs/minos/mcout_data/R1_24spill/far/daikon_02 CosmicMu for DIR in ${NDDS} ; do CONFS=`ls ${DIR}` echo ${DIR} ${CONFS} done /pnfs/minos/mcout_data/cedar_bx113/near/daikon_00 L010185N_bfldx113 /pnfs/minos/mcout_data/cedar/near/daikon_00 L010000N L010170N L010185N L010185N_bfldx113 L010185N_charm L010185N_lowi L010185N_medi L010185N_nccoh L010200N L100200N L150200N L250200N L250200N_nccoh /pnfs/minos/mcout_data/cedar/near/daikon_01 L010185N /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03 CosmicLE CosmicMu L010185N /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04 CosmicLE L010000N L010170N L010185N L010200N L100200N L150200N L250200N /pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00 L010185N /pnfs/minos/mcout_data/cedar_phy/near/daikon_00 L010000N L010170N L010185N L010200N L100200N L150200N L250200N /pnfs/minos/mcout_data/cedar_phy/near/daikon_03 CosmicMu L010185N /pnfs/minos/mcout_data/cedar_phy_oldbhcurv/near/daikon_03 L010185N /pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00 L010185N L010185N_bfldx113 /pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00 L010185N L010185N_bfldx113 /pnfs/minos/mcout_data/R1_24calB/near/daikon_00 L010185N /pnfs/minos/mcout_data/R1_24cal/near/daikon_00 L010185N L010185N_24cal Test one, ./saddreco -m daikon_00 -d far -r cedar_phy -p L250200N -b 1 --verify AFSS/saddreco.20071117 -m daikon_02 -d far -r cedar -p CosmicLE -b 1 --verify for DIR in ${FDDS} ; do CONFS=`ls ${DIR}` echo ${DIR} ${CONFS} for CONF in ${CONFS} ; do echo ${DIR}/${CONF} REL=`echo ${DIR} | cut -f 5 -d '/'` DET=`echo ${DIR} | cut -f 6 -d '/'` MCREL=`echo ${DIR} | cut -f 7 -d '/'` ls ${DIR}/${CONF} LOGDIR=${HOME}/ROUNTMP/LOG/saddreco/${MCREL} mkdir -p ${LOGDIR} # date ; ./saddreco -m ${MCREL} -d ${DET} -r ${REL} -p ${CONF} -b 1 --verify # date ; ./saddreco -m ${MCREL} -d ${DET} -r ${REL} -p ${CONF} -b 1 \ date ; ./saddreco -m ${MCREL} -d ${DET} -r ${REL} -p ${CONF} \ --declare 2>&1 | tee -a ${LOGDIR}/${REL}_${CONF}_${DET}.log done done Tested this first with one moderate configuration, DIR=/pnfs/minos/mcout_data/cedar/far/daikon_00 CONF=L100200N Ran twice, with corrected LOGDIR Ran full fardet, 1 event Needed r1.24spill for REL in dev int prd ; do setup sam -q ${REL} samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.24spill done New applicationFamilyId = 228 New applicationFamilyId = 82 New applicationFamilyId = 282 DIR=/pnfs/minos/mcout_data/R1_24spill/far/daikon_02 Tested this, OK now. Added missing release for near, r1.24calb for REL in dev int prd ; do setup sam -q ${REL} samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.24calb done 12:55 Ran bail-1 test for NDDS, Sat Nov 17 13:41:45 CST 2007 grep -v declared ${HOME}/ROUNTMP/LOG/saddreco/daikon*/*.log | less Added _${DET} to the log file name Found OOPS problem with cedar.phy.oldbhcurv for REL in dev int prd ; do setup sam -q ${REL} samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.oldbhcurv done New applicationFamilyId = 230 New applicationFamilyId = 84 New applicationFamilyId = 302 Launched the full fardet processing Sat Nov 17 13:51:38 CST 2007 Sat Nov 17 16:10:16 CST 2007 No OOPS in grep -v declared ${HOME}/ROUNTMP/LOG/saddreco/daikon*/*far.log | less Launched the full neardet processing for DIR in ${NDDS} ; do ... same as for far, see above ... Sat Nov 17 16:12:55 CST 2007 Sun Nov 18 02:46:41 CST 2007 No OOPS in grep -v declared ${HOME}/ROUNTMP/LOG/saddreco/daikon*/*near.log | less ####### # AFS # ####### for NODE in ${NODES} ; do printf "${NODE}\n" ssh ${NODE} 'grep afs /var/log/messages | grep -v Tokens | grep "Nov 16"' ; done minos02 Nov 16 11:15:15 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 16 11:20:17 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Nov 16 18:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 16 18:17:07 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) minos06 Nov 16 15:04:12 minos06 kernel: afs: Lost contact with volume location server 131.225.68.4 in cell fnal.gov Nov 16 15:06:25 minos06 kernel: afs: volume location server 131.225.68.4 in cell fnal.gov is back up minos16 Nov 16 20:19:49 minos16 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 16 20:22:11 minos16 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) for NODE in ${NODES} ; do printf "${NODE}\n" ssh ${NODE} 'grep afs /var/log/messages | grep -v Tokens | grep "Nov 17"' ; done minos02 Nov 17 07:15:12 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 17 07:16:35 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) Updated HelpDesk ticket 107032 Lost access to fsus02, as follows <-- # @@@ Enter Update below this line. @@@ # --> We lost access to fsus02 this evening. This removed access to several things, most critically the Minos Control Room Log Book, at http://www-minoscrl2.fnal.gov/minos/Index.jsp and the helpdesk web page. Some of the /var/log/messages messages were like : minos01 Nov 17 20:37:05 minos01 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi$ Nov 17 21:08:39 minos01 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed addr$ minos02 Nov 17 20:36:46 minos02 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi$ Nov 17 21:05:56 minos02 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed addr$ <-- # @@@ Enter Update above this line. @@@ # --> On minos-beamdata, Nov 17 20:33:14 minos-beamdata kernel: afs: Lost contact with file server 192.168.67.1 in cell fnal.gov (multi-homed address; other same-host interfaces maybe up) That is a weird address, not 131.225 Fermilab ============================================================================= 2007 11 16 ####### # CAF # ####### Date: Fri, 16 Nov 2007 16:03:59 -0600 (CST) From: Arthur Kreymer To: minos-admin@fnal.gov Cc: sfiligoi@fnal.gov Subject: Proposed schedule for security enhancements on the Minos Condor Analysis Facility. Proposed schedule for security enhancements on the Minos Condor Analysis Facility. ( I think this is what we have already been talking about, but it is good to have a summary . ) Monday morning ( 8:00 or so, the earlier the better ) root starts draining the Minos Condor worker nodes ( commands to be provided by Igor ) From minos25, issue condor_off -peaceful -all -startd Monday afternoon ( 13:00 or so ) root stops condor ( commands to be provided by Igor ) From minos25, issue condor_off -fast -all -startd followed by condor_off -all -master root needs to have installed host certificates and /etc/grid-security/certificates root pushes out the new configuration files with cfengine root starts condor ( commands to be provided by Igor ) This should be the usual /etc/init.d/condor start on all the affected nodes. sfiligoi and kreymer test proper operation of Minos Condor, processing of user jobs will resume by Monday evening. Date: Fri, 16 Nov 2007 17:35:53 -0600 (CST) From: Arthur Kreymer To: minos_software_discussion@fnal.gov Cc: minos-admin@fnal.gov Subject: Minos Condor Analysis Facility upgrade Monday We plan to upgrade the Minos Condor configuration files Monday 19 Nov, in preparation for providing a gateway to larger Grid resources. We plan to drain the worker nodes during the morning, and perform the configuration upgrade in the afternoon, restoring service by the evening. The queues should be retained. Any jobs process which we kill around noon will be rerun, probably transparently the the users ( aside from the delay ). Thanks for your patience ! ####### # CAF # ####### Igor has pointed out to me that we should also modify a line in minos25:/etc/condor/condor_config , as follows change QUEUE_SUPER_USERS = root, condor to QUEUE_SUPER_USERS = root, condor, buckley, kreymer, rhatcher, sfiligoi, timm This change can be pushed with the rest of the security changes. I'll send a separate email with a proposed schedule. ####### # CAF # ####### HelpDesk ticket 107197 Short Description: Condor stop/start sudo access on Minos Cluster Problem Description: run2-sys Please set up sudo access to invoke /etc/init.d/codor on the Minos Cluster, for the following users : buckley, kreymer, rhatcher, sfiligoi, timm This would let the experiment management stop and start Condor as needed. This will be especially useful during the present commissioning phase. Date: Fri, 16 Nov 2007 16:45:05 -0600 (CST) Solution: schmitz@fnal.gov sent this solution: added user list to sudoers giving condor start/stop access MINOS25 > sudo -l User kreymer may run the following commands on this host: (root) NOPASSWD: /etc/init.d/codor MINOS01 > ps axf | grep condor 7305 pts/3 S+ 0:00 | \_ grep condor 29713 ? Ss 8:33 /opt/condor/sbin/condor_master 29714 ? Ss 16:42 \_ condor_startd -f ####### # LSF # ####### Date: Fri, 16 Nov 2007 14:42:06 -0600 (CST) From: Margaret_Greaney To: minos-users@fnal.gov, minos-admin@fnal.gov Cc: dss-est@fnal.gov Subject: flxi06 hardware problems we will be taking down flxi06 momentarily to try to fix a bad system disk. Date: Fri, 16 Nov 2007 15:02:04 -0600 (CST) Due to a problem we need to reschedule the hardware repair of flxi06 for Monday, November 19. ############ # PREDATOR # ############ Is up to date. Glitch processing N071007_000002.mdcs.root Fri Nov 16 11:11:29 UTC 2007 Wait for next cycle. ####### # AFS # ####### for NODE in ${NODES} ; do printf "${NODE}\n" ssh ${NODE} 'grep afs /var/log/messages | grep -v Tokens | grep "Nov 15"' ; done minos02 Nov 15 12:15:20 minos02 kernel: afs: Lost contact with file server 131.225.68.19 Nov 15 12:16:31 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov Nov 15 18:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 Nov 15 18:15:28 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov Nov 15 22:15:14 minos02 kernel: afs: Lost contact with file server 131.225.68.19 Nov 15 22:17:54 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov minos03 Nov 15 18:26:20 minos03 kernel: afs: Lost contact with file server 131.225.68.49 Nov 15 18:28:38 minos03 kernel: afs: file server 131.225.68.49 in cell fnal.gov minos04 Nov 15 13:46:05 minos04 kernel: afs: Lost contact with file server 131.225.68.49 Nov 15 13:47:20 minos04 kernel: afs: file server 131.225.68.49 in cell fnal.gov minos12 Nov 15 13:21:18 minos12 kernel: afs: Lost contact with file server 131.225.68.49 Nov 15 13:21:18 minos12 kernel: afs: failed to store file (110) Nov 15 13:23:00 minos12 kernel: afs: file server 131.225.68.49 in cell fnal.gov for NODE in ${NODES} ; do printf "${NODE}\n" ssh ${NODE} 'grep afs /var/log/messages | grep -v Tokens | grep "Nov 16"' ; done 09:38 - sent this information as followup to ticket 107032 ============================================================================= 2007 11 15 ############ # PREDATOR # ############ Ran by hand to catchup since yesterday's AFS problem. 16:27 ./predator 2007-11 16:50 predator is still running, but the .pid file will save us neardet SAM data is generated cleanly, so we're OK crontab crontab.dat ######## # FARM # ######## Completing transition to /minos/data/minfarm Pick up strays 26 cand files showed up in farcat on Nov 8, pending in WRITE, like F00039916_0014.spill.cand.cedar.0.root mv /grid/data/minos/minfarm/WRITE/*cand* /grid/data/minos/minfarm/DUP/ FILES=`ls /grid/data/minos/minfarm/DUP` for FILE in ${FILES} ; do sam locate ${FILE} ; done all of these files are in SAM for DIR in BAD DUP N7760 SAFE WRITE ; do du -sm /grid/data/minos/minfarm/${DIR} cp -vax /grid/data/minos/minfarm/${DIR} /minos/data/minfarm/${DIR} du -sm /minos/data/minfarm/${DIR} diff -r /grid/data/minos/minfarm/${DIR} /minos/data/minfarm/${DIR} mv /grid/data/minos/minfarm/${DIR} /grid/data/minos/minfarm/OLD${DIR} ln -s /minos/data/minfarm/${DIR} /grid/data/minos/minfarm/${DIR} done 107 /grid/data/minos/minfarm/BAD 107 /minos/data/minfarm/BAD 2981 /grid/data/minos/minfarm/DUP 2981 /minos/data/minfarm/DUP 1888 /grid/data/minos/minfarm/N7760 1888 /minos/data/minfarm/N7760 1 /grid/data/minos/minfarm/SAFE 1 /minos/data/minfarm/SAFE 4684 /grid/data/minos/minfarm/WRITE 4683 /minos/data/minfarm/WRITE cd /export/stage/minfarm/ROUNDUP SRV1> ls -l | grep grid lrwxrwxrwx 1 minfarm numi 28 May 19 12:14 DUP -> /grid/data/minos/minfarm/DUP lrwxrwxrwx 1 minfarm numi 29 May 14 2007 GDS -> /grid/data/minos/minfarm/SAFE lrwxrwxrwx 1 minfarm numi 30 May 11 2007 WRITE -> /grid/data/minos/minfarm/WRITE for DIR in DUP WRITE ; do rm ${DIR} ; ln -sf /minos/data/minfarm/${DIR} ${DIR} ; done rm GDS ; ln -s /minos/data/minfarm/SAFE GDS ---------------- ls N00012*cedar_phy_bhcurv*0.root | wc -l These date Nov 2 throug Nov 5 FILES=`ls N00012*cedar_phy_bhcurv*0.root` SRV1> for FILE in ${FILES} ; do sam locate ${FILE} ; done SRV1> printf "${FILES}\n" | wc -l 330 FILES=`ls N00012*cedar_phy_bhcurv*0.root | grep -v 00012001` for FILE in ${FILES} ; do mv ${FILE} ../REMOVED/${FILE} ; done cd .. mv REMOVED ../READREM mv This file lists all 24 subruns ( 0-2 in 2007-03, 3-24 in 2007-04 ) READ/SAM/N00012001_0000.spill.mrnt.cedar_phy_bhcurv.0.root MINOS26 > sam undeclare file N00012001_0000.spill.mrnt.cedar_phy_bhcurv.0.root MINOS26 > sam undeclare file N00012001_0000.spill.sntp.cedar_phy_bhcurv.0.root MINOS26 > sam undeclare file N00012001_0000.spill.cand.cedar_phy_bhcurv.0.root MINOS26 > sam undeclare file N00012001_0001.spill.cand.cedar_phy_bhcurv.0.root MINOS26 > sam undeclare file N00012001_0002.spill.cand.cedar_phy_bhcurv.0.root SRV1> mv READ/SAM/N00012001_0000.spill.sntp.cedar_phy_bhcurv.0.root READREM/ SRV1> mv READ/SAM/N00012001_0000.spill.mrnt.cedar_phy_bhcurv.0.root READREM/ Rubin : "Since we were removing the old files, I decided to treat this as if they never existed at all, and thus this is pass 0. " ####### # AFS # ####### No further AFS messages in syslog, except discarded tokens ############ # MCIMPORT # ############ Declares mcin to sam Bails for boring users having no files in top, mcin, mcin/dcache $ cp -a AFSS/mcimport.20071109 . $ ln -sf mcimport.20071109 mcimport # was mcimport.20071022 08:31 crontab crontab.dat ############ # MCIMPORT # ############ For immedate processing of recent arms files, will also import them the old fashioned way . FILES=`find /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near -type f -name \*.gz` for FILE in ${FILES} ; do cp -a ${FILE} STAGE/arms/ ; done ./mcimport.20071022 -F arms Thu Nov 15 09:27:10 CST 2007 SRMCPed n14111411_0000_L010185N_D00_nue-n14111450_0000_L010185N_D00_nue.tar Need to rerun this afternoon, to clear dcache directory. N.B. why Sorting 18517 logs in /local/scratch26/mindata/arms/far/mcin/log ? 13:07 - holding off while more of these show up, disabling arms mcimport mv STAGE/arms/MCIMPORT STAGE/arms/NOIMPORT OOPS, these are all to be removed, and some more files generated. Will leave ARMS in NOIMPORT state through the weekend, until the 40 runs x 10 subruns are uploaded and tarred with old mcimport. $ cd STAGE/arms $ rm index/n14111411_0000_L010185N_D00_nue-n14111450_0000_L010185N_D00_nue.index $ rm -r /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near MINOS26 > rm /pnfs/minos/stage/arms/n14111411_0000_L010185N_D00_nue-n14111450_0000_L010185N_D00_nue.tar ============================================================================= 2007 11 14 ######## # FARM # ######## Redeclaring cedar_phy_bhcurv which failed due to lack of MONS=`ls */decl*cedar_phy_bhcurv.log | cut -f 1 -d '/' cat */decl*cedar_phy_bhcurv.log > /tmp/cpblog grep -v declared /tmp/cpblog | less Lots of failures Sep 11, for MON in 2005-12 2006-01 2006-02 ; do ./roundup -c -m "${MON}" -r cedar_phy_bhcurv near done Need cleanup from previous errors: 2005-12 OOPS, need location for N00009544_0009.spill.cand.cedar_phy_bhcurv.0.root OOPS, need location for N00009303_0009.spill.cand.cedar_phy_bhcurv.0.root OOPS, need location for N00009530_0019.spill.cand.cedar_phy_bhcurv.0.root OOPS, need location for N00009331_0003.spill.cand.cedar_phy_bhcurv.0.root 2006-01 OOPS, need location for N00009647_0006.spill.cand.cedar_phy_bhcurv.0.root OOPS, need location for N00009583_0000.spill.cand.cedar_phy_bhcurv.0.root OOPS, need location for N00009589_0001.spill.cand.cedar_phy_bhcurv.0.root OOPS, need location for N00009586_0024.spill.cand.cedar_phy_bhcurv.0.root OOPS, need location for N00009589_0000.spill.cand.cedar_phy_bhcurv.0.root 2006-02 OOPS, need location for N00009839_0007.spill.cand.cedar_phy_bhcurv.0.root OOPS, need location for N00009755_0019.spill.cand.cedar_phy_bhcurv.0.root OOPS, need location for N00009732_0022.spill.cand.cedar_phy_bhcurv.0.root OOPS, need location for N00009732_0008.spill.cand.cedar_phy_bhcurv.0.root PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9010 export SAM_ORACLE_CONNECT='samdbs/...' DET=near REL=cedar_phy_bhcurv SAMMON=2005-12 AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} addloc SAMMON=2006-01 AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} addloc SAMMON=2006-02 AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} addloc ####### # AFS # ####### -------------------------------------------------------------------------- HelpDesk ticket 107048 Short Description: AFS server(s) down again Problem Description: All the Minos Cluster nodes have lost contact with AFS servers again. Here is a typical message : Nov 14 10:39:13 minos26 kernel: afs: Lost contact with file server 131.225.68.47 in cell fnal.gov (all multi-homed ip addresses down for the server) This is an urgent problem, we cannot access the Minos software without this server. -------------------------------------------------------------------------- 11:04 This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group. 11:20 I see network traffic, the server seems up again. This affects products and releases. Instead of hanging, we see : $ ls /afs/fnal.gov/files/code/e875/products ls: /afs/fnal.gov/files/code/e875/products: No such file or directory MRTG data flow stopped around 10:40, see http://www-dcn.fnal.gov/~netadmin/m-s-fcc-mrtg/cgi/mrtg-rrd.fcgi/r-s-fcc2-server/r-s-fcc2-server_gi1_24.html Scanned again, around 17:20 for NODE in ${NODES} ; do printf "${NODE}\n" ; ssh ${NODE} 'grep afs /var/log/messages | grep "Nov 14"' ; done 131.225.68.47 fsus06 all nodes, 10:40 - 11:22 131.225.68.19 fsus08 minos02 10:15:12 - 10:16:12 14:15:11 - 14:17:39 Many messages like Nov 14 16:15:12 minos02 kernel: afs: Tokens for user of AFS id 1334 for cell fnal.gov are discarded (rxkad error=19270407) 131.225.68.4 fsus03 minos18 15:36:37 - 15:38:04 ########## # SADDMC # ########## Now working with the mindata account, on minos26 Checking that we are up to date, before doing output Nothing was needed, per the following scans. DET=near VEGS='daikon_00 daikon_01 daikon_03 daikon_04' for VEG in ${VEGS} ; do for DIR in `ls /pnfs/minos/mcin_data/${DET}/${VEG} | sort` ; do echo ${VEG} ${DIR} ./saddmc --verify -n 1 ${VEG} ${DET}/${VEG}/${DIR}/* done ; done 2>&1 | tee -a /tmp/saddmc.near AFS went down after daikon_03/spill_cedarphyMRE Repeated the scan with VEGS='daikon_03' Clean, aside from MRE files which should not be declared DET=far VEGS='daikon_00 daikon_01 daikon_02 daikon_03' for VEG in ${VEGS} ; do for DIR in `ls /pnfs/minos/mcin_data/${DET}/${VEG} | sort` ; do echo ${VEG} ${DIR} ./saddmc --verify -n 1 ${VEG} ${DET}/${VEG}/${DIR}/* done ; done 2>&1 | tee -a /tmp/saddmc.far Need daikon_03 CosmicMu 127 to 132 VEG=daikon_03 DIR=CosmicMu ./saddmc --verify ${VEG} ${DET}/${VEG}/${DIR}/* ./saddmc --declare ${VEG} ${DET}/${VEG}/${DIR}/* 2>&1 \ | tee -a /minos/scratch/mindata/log/saddmc/prd-${DET}-${VEG}-${DIR}.log Done by 15:16 ########## # SADDMC # ########## Shifted logs to /minos/scratch/mindata/log/saddmc mkdir -p /minos/scratch/mindata/log cp -vax /minos/scratch/kreymer/log/saddmc \ /minos/scratch/mindata/log/saddmc From now on, will be doing saddmc from the mindata account. ######## # FARM # ######## Preparing for FARM redeclares Continuing from 2007 11 12, but will use run number rather than month Checking run numbers April 2007, March has through 12001 April has 12002 through 12135 for STRM in cand sntp mrnt ; do SAMDIM=" DATA_TIER ${STRM}-near and PHYSICAL_DATASTREAM_NAME spill and VERSION cedar.phy.bhcurv and RUN_NUMBER > 12001 " for STRM in cand sntp mrnt ; do SAMDIM=" DATA_TIER ${STRM}-near and PHYSICAL_DATASTREAM_NAME spill and VERSION cedar.phy.bhcurv and RUN_NUMBER > 12001 " ./samlocate "${SAMDIM}" | sort -k 2,2 done for STRM in cand sntp mrnt ; do SAMDIM=" DATA_TIER ${STRM}-near and PHYSICAL_DATASTREAM_NAME spill and VERSION cedar.phy.bhcurv and RUN_NUMBER > 12001 " ./samlocate "${SAMDIM}" | wc -l done 1783 123 116 date for STRM in cand sntp mrnt ; do SAMDIM=" DATA_TIER ${STRM}-near and PHYSICAL_DATASTREAM_NAME spill and VERSION cedar.phy.bhcurv and RUN_NUMBER > 12001 " ./samundeclare "${SAMDIM}" done date Wed Nov 14 08:55:46 CST 2007 ####### # AFS # ####### from minos26 /var/log/messages Nov 13 16:57:38 minos26 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server) Nov 13 17:29:08 minos26 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down) for NODE in ${NODES} ; do printf "${NODE}\n" ; ssh ${NODE} 'grep afs /var/log/messages' ; done > /tmp/afsmsg HelpDesk ticket 107032 08:36 This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group. Short Description: AFS time outs continuing - status reqeust ? Problem Description: What is the status of AFS ? Scanning the Minos Cluster /var/log/messages logs, I see several AFS timeouts before and after yesterday's 16:00 to 18:00 outage. These occurred as recently as 05:15 this morning. I can find no detailed information at http://computing.fnal.gov/cdsystemstatus/system/AFS.html or in the helpdesk tickets, or in FNALU login messages. Here are the timeout details from the Minos Cluster : 131.225.68.7 fsus02 minos01 through minos26 ( except the nodes listed below ) Nov 13 16:57:38 - 17:29:08 minos21, minos22 - no time out minos04 Nov 11 06:05:03 - 06:05:19 Nov 13 16:57:39 - 17:29:08 minos05 Nov 12 22:29:59 - 22:32:20 Nov 13 16:57:37 - 17:28:56 131.225.68.4 fsus03 minos03 Nov 12 18:52:25 - 18:55:56 minos04 Nov 13 19:31:54 - 19:33:32 minos08 Nov 12 18:52:11 - 18:55:39 minos15 Nov 13 16:22:15 - 16:25:14 minos18 Nov 12 18:52:07 - 18:54:59 minos20 Nov 12 18:16:56 - 18:17:53 Nov 13 16:21:53 - 16:25:15 131.225.68.6 fsus07 minos22 Nov 13 16:30:48 - 16:32:29 131.225.68.19 fsus08 minos02 - Nov 13 20:15:13 - 20:17:00 Nov 13 23:15:15 - 23:17:07 Nov 14 01:15:12 - 01:15:25 Nov 14 03:15:15 - 03:16:50 Nov 14 05:15:16 - 05:18:07 minos18 Nov 11 19:16:29 - 19:17:06 131.225.68.49 fsus09 minos04 Nov 13 19:31:54 - 19:33:32 minos15 Nov 13 16:22:15 - 16:25:14 minos20 Nov 12 18:16:56 - 18:17:53 Nov 13 16:21:53 - 16:25:15 ----------------------------------------------- MRTG shows fsus02 16:00 - 17:45 gap , heavy traffic ( 3 MB/sec ) for hours preceding fsus03 17:20 - 17:45 gap fsus07 15 minute gaps 18:05 through 19:15 fsus08 low traffic 18:15 - 20:30, spike at 19:10 fsus09 18:05 - 19:10 mostly gap, big spike at 19:10, 20:00 - 20:30 gap several gaps 04:00 - 06:00 ============================================================================= 2007 11 13 ####### # AFS # ####### Lost contact with AFS around 14:00. Parts of the system are back, around 15:00 Other parts still time out From minos26 and my desktop, /afs/fnal.gov/files/home/room1/kreymer is OK on minos26, bad on fnpcsrv1 /afs/fnal.gov/files/home/room1/kreymer/minos times out 17:00 - running process count climbing again on Minos Cluster ganglia ####### # VDT # ####### Try installation into /grid/app/minos/VDT setup pacman v3_20 mkdir -p /grid/app/minos/VDT cd /grid/app/minos/VDT pacman -get VDT:VOMS-Client Do you want to add [http://vdt.cs.wisc.edu/vdt_cache/] to [trusted.caches]? (y or n): y Package [VOMS-Client] found in [VDT]... Do you want to add [http://vdt.cs.wisc.edu/vdt_181_cache] to [trusted.caches]? (y or n): y ... Do you agree to the licenses? [y/n] y ... Where would you like to install CA files? Choices: l (local) - install into $VDT_LOCATION/globus/share/certificates n (no) - do not install l mkdir -p /grid/app/minos/VDT/glite/etc chmod 755 /grid/app/minos/VDT/glite/etc cp -a /minos/scratch/kreymer/VDT/glite/etc/vomses glite/etc/vomses See /grid/app/minos/VDT/vdt/etc/package_data/VDT-Version-Info.filelist But I see no versions there, just lists of files. cd /grid/app/minos/VDT . setup.sh klist -f kx509 kxlist -p voms-proxy-init -noregen -voms fermilab:/fermilab -valid 168:0 /minos/scratch/kreymer/VDT seems to be 1.8.1-19 based on vdt-install.log Doing this under /minos/scratch/kreymer/VDT gets MINOS26 > voms-proxy-init -noregen -voms fermilab:/fermilab -valid 168:0 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E Kreymer/USERID=kreymer Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Contacting fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Failed Error: fermilab: User unknown to this VO. None of the contacted servers for fermilab were capable of returning a valid AC for the user. /grid/app/minos/VDT seem to be 1.8.1-21 Doing this under /grid/app/minos/VDT gets VOMS Server for fermilab not known! ######## # FARM # ######## for AREA in neardet fardet farcat nearcat AREA=farcat AREA=nearcat DO THE FOLLOWING THINGS date ls -ltr /grid/data/minos/${AREA} | tail -1 ls /grid/data/minos/${AREA} | wc -l du -sm /grid/data/minos/${AREA} time cp -vax /grid/data/minos/${AREA} /minos/data/minfarm/${AREA} ls /minos/data/minfarm/${AREA} | wc -l du -sm /minos/data/minfarm/${AREA} diff -r /grid/data/minos/${AREA} /minos/data/minfarm/${AREA} mv /grid/data/minos/${AREA} /grid/data/minos/OLD${AREA} ln -s /minos/data/minfarm/${AREA} /grid/data/minos/${AREA} date DID THE ABOVE AREA=neardet 2 594 /grid/data/minos/neardet real 0m26.515s 2 594 /minos/data/minfarm/neardet Tue Nov 13 13:58:01 CST 2007 AREA=fardet R > date Tue Nov 13 13:59:19 CST 2007 0 1 /grid/data/minos/fardet real 0m0.010s 0 1 /minos/data/minfarm/fardet Tue Nov 13 13:59:48 CST 2007 AREA=farcat Tue Nov 13 14:01:23 CST 2007 -rw-rw-r-- 1 rubin numi 7138755 Nov 12 23:38 F00039950_0001.spill.bntp.cedar.0.root 172 1972 /grid/data/minos/farcat real 1m46.929s 172 1969 /minos/data/minfarm/farcat Tue Nov 13 14:07:00 CST 2007 ########## # DCACHE # ########## Problems with stuck open transfers on stkendca19a all uid = 13234 FILS=' reco_far/cedar/sntp_data/2004-09/F00027184_0002.all.sntp.cedar.0.root reco_far/cedar/sntp_data/2005-04/F00030628_0007.all.sntp.cedar.0.root reco_far/cedar/sntp_data/2004-10/F00027603_0004.all.sntp.cedar.0.root reco_near/cedar/sntp_data/2005-12/N00009530_0020.cosmic.sntp.cedar.0.root ' for FIL in ${FILS} ; do root -b -q dcap://fndca1.fnal.gov:${DCPORT}/pnfs/fnal.gov/usr/minos/${FIL} done for FIL in ${FILS} ; do dccp dcap://fndca1.fnal.gov:${DCPORT}/pnfs/fnal.gov/usr/minos/${FIL} \ /local/scratch26/kreymer/COPY/ done rm /local/scratch26/kreymer/COPY/*cedar* ============================================================================= 2007 11 12 ######## # FARM # ######## Preparing for FARM redeclares for MON in 05 06 07 ; do for STRM in cand sntp mrnt ; do SAMDIM=" DATA_TIER ${STRM}-near and PHYSICAL_DATASTREAM_NAME spill and VERSION cedar.phy.bhcurv and FULL_PATH /pnfs/minos/reco_near/cedar_phy_bhcurv/${STRM}_data/2007-${MON} " ./samundeclare -n "${SAMDIM}" done ; done < see 2007 11 14 > ############ # MCIMPORT # ############ Added 0 length check to MCINWRITE ####### # LSF # ####### Old brebel ticket 106750 status ? Cannot submit from minos01 - minos13 11/12/2007 12:53:38 PM ticket closed, "minos nodes not showing up in lsf configuration; minos exp now using condor instead of lsf." Well, that is true for the execution nodes, but not for submission. ####### # LSF # ####### for NODE in ${NODES} ; do printf "${NODE} " ; ssh ${NODE} 'source /usr/local/etc/setups.sh ; setup lsf ; bjobs' ; done minos01 Request from non-LSF host rejected ... minos13 Request from non-LSF host rejected minos14 No unfinished job found minos15 No unfinished job found ... 18:12 HelpDesk ticket 106966 run2-sys : On the Minos Cluster, we have lost access to LSF from nodes minos01 through minos13, Commands like 'bsub' result in Request from non-LSF host rejected We have normal access from hosts minos14 through minos26. Please restore access from minos02 to minos13, as this is causing confusion, and is a substantial inconvenience to the users. Thanks ! 13 Nov 2007 08:30:30 This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group. 13 Nov 2007 08:55:20 schmitz restarted LSF on minos01 - still no good MINOS01 > bjobs Failed in an LSF library call: Slave LIM configuration is not ready yet ######### # FNALU # ######### FNALU batch jobs failed for pawloski over the weekend. His account is absent on FNALU. There are 17 missing accounts. MINOS01 > ypcat passwd | grep '/afs/fnal' | cut -f 1 -d : | sort > /tmp/users FLXI05 > scp minos01:/tmp/users /tmp/users FLXI05 > for user in `cat /tmp/users` ; do grep -q ^${user} /tmp/pwd || echo ${user} ; done bckhouse blake idanko kimjj llhsu mbt mstrait mtavera pawloski pittam rahaman rearmstr rmehdi rodriges scavan tinti whitehd FLXI05 > for user in `cat /tmp/users` ; do grep -q ^${user} /tmp/pwd || echo ${user} ; done > /tmp/MISS 13:24 HelpDesk ticket 106946 forwarded to all these users 13:42 - assigned to mgreaney 16:10 - Except for mtb (who is expired in nas), the accounts were added back. mbt:KERBEROS:13574:5111:Meagan Thompson:/afs/fnal/files/home/room2/mbt:/usr/local/bin/tcsh FLXI05 > for user in `cat /tmp/MISS` ; do grep -q ^${user} /tmp/pwd2 || echo ${user} ; done mbt 17:00 Notified users via email ######## # DESK # ######## Restarting after planned Sunday power outage ############# # CHECKLIST # ############# Mysql1 has been saturated since early morning last Saturday, load averages up to 18. Probably the pawloski jobs running since then Queries like select min(TIMEEND) from DCS_MAG_FARVLD where TIMEEND > '2007-03-15 04:15:08' and ... ####### # MRE # ####### find d* -name N\*MRE\* > /tmp/MRE.lis Found the one 0 length file remaining in AFS ( othere had been deleted ) So the 4 problem-files were 0 length in AFS, have removed them in PNFS. ============================================================================= 2007 11 10 Undeclaring April 2007 files SAMDIM="FULL_PATH /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-04" SFILES=`sam list files --dim="${SAMDIM}" --nosummary` printf "${SFILES}\n" printf "${SFILES}\n" | wc -w 45 06:44 for FILE in ${SFILES} ; do sam locate ${FILE} ; done for FILE in ${SFILES} ; do echo ${FILE} ; sam undeclare file ${FILE} ; done SAMDIM="FULL_PATH /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2007-04" 44 06:45 SAMDIM="FULL_PATH /pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2007-04" 702 06:51 ============================================================================= 2007 11 09 ######## # GRID # ######## HelpDesk ticket 106879 Short Description: Minos Cluster - need grid host certificates for use with Condor Problem Description: run2-sys : We are preparing to improve the configuration of Condor on the Minos Cluster, by installing the Glidein WMS system already being used by CMS. This should act as our gateway to GPFARM and other Fermigrid resources. Igor Sfiligoi will be assisting with the configuration of this. His first advice is that we need to obtain Grid Host certificates for the existing systems, to improve the internal security. I suspect that this is something that run2-sys has done before for similar grid installations. If so, please make these available. If not, we will need to get advice from the people doing this in CMS. We would like to proceed with this project early in the week of Nov 12. ( next week ) Date: Wed, 14 Nov 2007 12:16:37 -0600 (CST) Eventually I got the grid-cert-request done for all 26 servers, and submittied them to the website. Request numbers are: 29489 29497 29498 29500 29501 29502 29503 29504 29505 29506 29507 29508 29509 29510 29511 29512 29513 29514 29515 29516 29517 29518 29519 29520 29521 29522 Just waiting for return mail to get the URL and install the certificates. ############ # MCIMPORT # ############ boehm files still pending ####### # LSF # ####### REQUESTED MINOS CLUSTER LSF SHUTDOWN Date: Fri, 9 Nov 2007 14:02:28 -0600 (CST) From: Arthur Kreymer To: minos_software_discussion@fnal.gov Cc: minos-admin@fnal.gov, mgreaney@fnal.gov Subject: Re: LSF problem on Minos Cluster Having heard no objection from Minos, and per a discussion with Joe Boyd, we are officially asking that LSF job slots be removed from the Minos Cluster nodes ( minos14 through minos26 ). We will still be submitting jobs to the traditional LSF FNALU batch system, but they should not run on the Minos Cluster nodes. Let me repeat our thanks to the Computing Division people who set this up for us earlier this year ! The scheme allowed us to keep doing physics through this phase of our transition to Condor and Grid computing. On Fri, 9 Nov 2007, Arthur Kreymer wrote: > Date: Fri, 9 Nov 2007 09:52:24 -0600 (CST) > From: Arthur Kreymer > To: minos_software_discussion@fnal.gov > Cc: minos-admin@fnal.gov > Subject: LSF problem on Minos Cluster > > > There were global LSF problems a couple of day ago, > which were corrected for the FNALU batch system, > but which are still lingering on the Minos Cluster. > > Some of our initial volunteers ( Brian, Greg, Josh ) > are making successful use of Condor on the Cluster, > and this is our planned direction. > > So I intend to ask the LSF managers to abandon attempts > to revive these Minos Cluster LSF execution slots ( hosts minosNN ). > > The other existing FNALU batch slots are unaffected. > > Please let me know if this would be a problem for anyone. > ( This is somewhat moot, these are still broken in LSF. ) ============================================================================= 2007 11 08 ####### # NET # ####### HelpDesk ticket 106799 Short Description: MRTG timezone error ? Problem Description: When attempting to veiw the MRTG traffic plots for recently rebooted fnpcsrv1, via a host search at http://fndcg0.fnal.gov/~netadmin/NodeLocator/search.html I get the following message at http://fndcg0.fnal.gov/~netadmin/NodeLocator/mrtg-search.cgi?hname=fnpcsrv1 131.225.167.44 is connected to s-f-grid-fcc1 on port Gi0/1 Last detected on this switch at 2007/11/08/10:49 But the local time was only 10:06 Somebody's clock is off by an hour. ( It would be better if all data were logged and presented in UTC. ) 09 Nov 2007 16:03 reassigned to CLIFFORD, ALDEN of the CD-LSCS/CNCS/SN Group 10 Nov 2007 19:01 reassigned to wohlt ########## # CONDOR # ########## Created /minos/scratch/kreymer/condor/loont, which runs loon on a small data file, under tcsh Had to set the PATH environment variable, cloned from path. Created /minos/scratch/kreymer/condor/loonb running under bash No path fiddling was needed. alias csub='condor_submit $*' alias cq='condor_q $*' ######## # FARM # ######## Ticket 106771 - 04:19 fnpcsrv1:Host is unpingable for a least 10 minutes by NGOP Noticed by asousa, email to backhouse,rubin,timm,kreymer Date: Thu, 08 Nov 2007 10:44:18 -0600 (CST) From: Steven Timm Hi Alexandre, I got the fnpcsrv1 back up about an hour ago. It had crashed with kernel panic at 04:05 local time. ############ # MCIMPORT # ############ boehm files, 332/434 files are now on tape, 102 remain stuck in DCache These are in pools like w-stkendca10a-5 which are still offline. 10:29 We did find some problems on the stkendca10a node which have now been corrected. The dCache pools on stkendca10a are up and available again. Please let us know if the writes to tape are moving along properly again. Ken S. -- SSA Group The pools came back at about 09:56 I see data moving out of 10a-5 ( cand data ) to tape MRE data is on the way to tape. Still 4 left, in 11a-4 and 11a-6, at 13:30. 09 Nov 2007 14:57 podstvkv - I am working on it. List of files is /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE/957/ n00009573_0000_spill_D03_cedarphyMRE.reroot.root /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE/969/ n00009696_0019_spill_D03_cedarphyMRE.reroot.root n00009696_0020_spill_D03_cedarphyMRE.reroot.root n00009696_0021_spill_D03_cedarphyMRE.reroot.root ============================================================================= 2007 11 07 ######### # ADMIN # ######### For interactive limits, see http://www-cdf.fnal.gov/offline/runii/ILP/ILPUG/ilpug-4.html User processes have their priority changed to 19 (lowest priority), if using 50% of a CPU or more for 10 minutes, or more. The user is not notified in the event of a renice. User processes are killed if using 10% of a CPU or more for 30 minutes or more. The user is notified via email when a process is killed. ######## # GRID # ######## ticket 105784 pending since 10/18, requests /minos/data and scratch on GPFARM and fnpcsrv1 DONE and tested !!! ###################### # AFS / LSF problems # ###################### 106736 - 09:00 - fsun02 unpinged for 10 minutes 106750 - 13:30 brebel - cannot connect to lsf server no information in the ticket arms informs me of license problems, ####### # LSF # ####### Observe many brebel jobs, submitted around 16:05 ####### # AFS # ####### for NODE in $NODES ; do echo ${NODE} ; ssh ${NODE} 'grep "afs: failed" /var/log/messages' ; done or for NODE in $NODES ; do echo ${NODE} ; ssh ${NODE} 'grep \(110\) /var/log/messages' ; done minos05 Nov 6 12:50:00 minos05 kernel: afs: failed to store file (110) Nov 6 12:50:37 minos05 kernel: afs: failed to store file (110) Nov 7 14:15:01 minos05 kernel: afs: failed to store file (110) minos17 Nov 6 22:29:22 minos17 kernel: afs: failed to store file (110) Nov 6 22:30:26 minos17 kernel: afs: failed to store file (110) Nov 7 15:59:23 minos17 kernel: afs: failed to store file (110) Nov 7 15:59:24 minos17 kernel: afs: failed to store file (110) minos26 Nov 6 12:50:01 minos26 kernel: afs: failed to store file (110) Nov 6 22:30:03 minos26 kernel: afs: failed to store file (110) Nov 7 01:00:05 minos26 kernel: afs: failed to store file (110) Nov 7 04:44:57 minos26 kernel: afs: failed to store file (110) Nov 7 14:15:02 minos26 kernel: afs: failed to store file (110) for NODE in $NODES ; do echo ${NODE} ; ssh ${NODE} 'grep \(110\) /var/log/messages.1' ; done minos04 Oct 30 15:29:53 minos04 kernel: afs: failed to store file (110) minos14 Nov 2 15:19:22 minos14 kernel: afs: failed to store file (110) Nov 2 15:19:31 minos14 kernel: afs: failed to store file (110) minos17 Oct 30 21:04:17 minos17 kernel: afs: failed to store file (110) Nov 1 10:54:42 minos17 kernel: afs: failed to store file (110) for NODE in $NODES ; do echo ${NODE} ; ssh ${NODE} 'grep \(110\) /var/log/messages.2' ; done minos26 Oct 27 02:09:50 minos26 kernel: afs: failed to store file (110) Oct 27 02:09:51 minos26 kernel: afs: failed to store file (110) ####### # SAM # ####### IT 3128 - sam_products for sam_station v6_0_5_22_srm ############ # MCIMPORT # ############ boehm files, total 1546 PURGED 1112 PENDING 434 There are 291 queued writes 106 for w-stkendca10a-4, most of the rest for -5 and -6 Reported to dache-admin twice. Issued helpdesk ticket 106748 14:49 - investigating, contacing developers re stkendca10a pools >> We still have four files waiting to be written from >> w-stkendca11a-4 >> and w-stkendca11a-6 This was resolved, closed out ticket 13 Nov The system disk filled due to some large lqcd files not being handled by and older encp configuration. The configuration was updated and disk space cleared. ########## # CONDOR # ########## jboehm is running on our Condor pool, keeping a queue of about 50. The load average on the cluster jumped up around 13:00 today ! MIN > for NODE in $NODES ; do printf "${NODE} " ; ssh ${NODE} 'du -sm /local/scratch*/boehm' ; done minos02 1 /local/scratch02/boehm minos03 3296 /local/scratch03/boehm minos04 5 /local/scratch04/boehm minos06 7 /local/scratch06/boehm ============================================================================= 2007 11 06 ######## # GRID # ######## /minos/data and scratch mounts on GPFARM - ticket 106721 07:19 - exports were added ############ # MCIMPORT # ############ mcimport.20071102 - continuing to restructure for new scheme ${INPAT}/mcin for reroot files, was ${INPAT}/near/mcin Extended autodest for two-part MC configurations, as done in roundup Updated .grid/kreymer-doe.proxy Created STAGE/CRON to hold the pid $ ./mcimport boehm OOPS - found /home/mindata/STAGE/boehm/log/mcimport.pid OK - stale pid file OK, logging activity to /home/mindata/STAGE/boehm/log/mcimport.log Tue Nov 6 14:52:17 CST 2007 OK - processing from /home/mindata/STAGE/boehm version mcimport.20071102 LOGS PURGE, TAR, WRITE, MCINPURGE, MCINWRITE ... 177624 /home/mindata/STAGE/boehm/ 1 /home/mindata/STAGE/boehm/tar 1 /home/mindata/STAGE/boehm/dcache 177624 /home/mindata/STAGE/boehm/mcin 177623 /home/mindata/STAGE/boehm/mcin/dcache Wed Nov 7 02:15:09 CST 2007 $ ./mcimport boehm OK - purging 1546 MCIN files ? Wed Nov 7 07:33:46 CST 2007 ############ # PNFSDIRS # ############ Added support for a release MCIN which disable output ( this is for archives of some special files from boehm . Also useful for testing. ) ./pnfsdirs near MCIN daikon_03 spill_cedarphyMRE ./pnfsdirs near MCIN daikon_03 spill_cedarphyMRE write ######### # MYSQL # ######### Overloaded with brebel cron jobs since Nov 5 09:45 He will restart with newer code with efficient database access. ============================================================================= 2007 11 05 ############ # MCIMPORT # ############ boehm reroot files : Suggested names like n13011432_0000_L010185N_D03_D00cedarMRE.reroot.root But the initial files are like N00009146_0008_D03_spillcedar_phyMRE.reroot.root These started as cedar_phy mrcc, had MRE run with D03, so should be named like n00009146_0008_spill_D03_cedarphyMRE.reroot.root New file name is n${FILE:1:13}_spill_D03_cedarphyMRE.reroot.root 1547 for FILE in N*_D03_spillcedar_phyMRE.reroot.root ; do echo ${FILE} ; done | wc -l 1546 Stray list.txt file. for FILE in N*_D03_spillcedar_phyMRE.reroot.root ; do echo mv ${FILE} n${FILE:1:13}_spill_D03_cedarphyMRE.reroot.root ; done for FILE in N*_D03_spillcedar_phyMRE.reroot.root ; do mv ${FILE} n${FILE:1:13}_spill_D03_cedarphyMRE.reroot.root ; done 17;236 - ready to move these, with revised mcimport script handle mcin top path do not concatenate see 2007 11 06 ####### # DAQ # ####### to habig : From my logs, based on what I did for a similar shutdown last time , here is a set of commands that would shut down Sunday morning, feel free to adjust . shutting down minos-evd last ( it is an NFS server ) shutting down minos-beamdata after acnet ( it exports to acnet ) ssh -ax -l root minos-rc 'echo "shutdown -h now" | at 05:30 Nov 11' ssh -ax -l root minos-om 'echo "shutdown -h now" | at 05:32 Nov 11' ssh -ax -l root minos-acnet 'echo "shutdown -h now" | at 05:34 Nov 11' ssh -ax -l root minos-beamdata 'echo "shutdown -h now" | at 05:36 Nov 11' ssh -ax -l root minos-evd 'echo "shutdown -h now" | at 05:38 Nov 11' Check the at status with 'at -l ' ########### # MONTHLY # ########### DATASETS 11/5 PREDATOR 11/5 VAULT 11/5 MYSQL 11/7 waited for brebel monthly processing on FNALU ./stage -g RawDataWritePools -d -p 0 fardet_data/2007-10 Needed 403/817 STARTED Mon Nov 5 09:52:15 CST 2007 FINISHED Mon Nov 5 10:48:14 CST 2007 db archives mysql> system time cp -av --target-directory=/data/archive/COPY/20071107/offline DCS_HV.MYD ; real 16m30.116s mysql> system time cp -av --target-directory=/data/archive/COPY/20071107/offline PULSERGAIN.MYD ; real 18m25.131s mysql> system time cp -av --target-directory=/data/archive/COPY/20071107/offline `cat /tmp/offiles` ; real 41m29.246s [minsoft@minos-mysql1 offline]$ time md5sum * >> ../offline.md5sum real 19m41.403s [minsoft@minos-mysql1 offline]$ time gzip -1 *.MYD real 62m10.466s Mysql> time scp -r -c blowfish -qv ${DBCOPY} ${REPATH} real 13m21.059s Mysql> time rsync -r \ real 0m15.821s Wed Nov 7 11:55:36 CST 2007 ######## # GRID # ######## Dear VO Member, Your status with the VO has been changed from Approved to Suspended due to the following reason: Suspended on 200711050500. Please contact VO administrator if you have any questions. VOMRS fermilab Service There are 6885 Fermilab KCA cert's. Of these, 3867 are suspended. Of these, 2130 were suspended this morning. grep 'Suspended on' vomrs.txt | sort -k 5,5 -n | tr -s ' ' | cut -f 5 -d ' ' | sort -u 200708061248 ... ######## # FARM # ######## mcnearcat 2739 53543 mrnt.cedar_phy_oldbhcurv.root 8 63 sntp.cedar_phy_bhcurv.root 3166 180491 sntp.cedar_phy_oldbhcurv.root 186 12280 sntp.cedar_phy.root corral - added cedar_phy_oldbhcurv MFILES=`find /grid/data/minos/mcnearcat -name \*oldbhcurv\* -exec basename {} \;` printf "${MFILES}\n" | wc -l 5662 Need to clear some working space, based on farmgsum nearcat 109 8110 spill.sntp.cedar_phy.0.root WFILES=`find /grid/data/minos/nearcat -name \*spill.sntp.cedar_phy.0.root -exec basename {} \;` printf "${WFILES}\n" | wc -l 109 for FILE in ${WFILES} ; do cp /grid/data/minos/nearcat/${FILE} /export/stage/minfarm/test/${FILE} ; done for FILE in ${WFILES} ; do echo ${FILE} diff /grid/data/minos/nearcat/${FILE} /export/stage/minfarm/test/${FILE} ; done for FILE in ${WFILES} ; do echo ${FILE} touch -r /grid/data/minos/nearcat/${FILE} /export/stage/minfarm/test/${FILE} ; done for FILE in ${WFILES} ; do echo ${FILE} rm /grid/data/minos/nearcat/${FILE}; done in ROUNTMP mv NOCAT.bck NOCAT in scripts ./roundup -c -M -r cedar_phy_oldbhcurv mcnear ######## # FARM # ######## Removed ROUNTMP/WRITE.old left over from migration to /grid/data 2007 05 11 ============================================================================= 2007 11 03 Sat ############ # MCIMPORT # ############ Moved kordosky directory, see entry at 2007 10 30 Sat Nov 3 09:59:59 CDT 2007 Sat Nov 3 10:48:13 CDT 2007 File copy rate for small log files seems to be about 200 files/second, much better than last Thursday when DMA was off. Size of log files : MINOS26 > MCD=/local/scratch26/mindata/kordosky/log MINOS26 > find ${MCD} -type f | wc -l ; du -sm ${MCD} 58784 5710 /local/scratch26/mindata/kordosky/log Sent out an all-clear message daikon_04 cleanup : I have completed the cleanup of the old daikon_04 beam MC. The /pnfs/minos/stage/*D04* files are purged. The appropriate index, md5, and log files have been cleaned up. mcimport : All of the mcimport/STAGE areas have been moved to /minos/data/mcimport including the kordosky area. Everyone should be able to resume production. processing : Given the much larger capacity of the STAGE/ cache area, and its ability to handle large numbers of small files, Robert and I have decided to simplify the overly processing. We will overlay directly from the /minos/data/mcimport// directories, without first creating tarfiles and indexes. We may or may not still create tarfiles later, for archival purposes, but this is no longer in the processing pipeline. This has no impact on people producing the MC files. Files are copied to the same place, with the same integrity checks. They will remain there a bit longer than before. startup : It is the weekend, so please exercise caution and restraint. If things break, they may need to wait till Monday. Note that minos26 had severe problem starting last Wednesday. These were probably resolved by the reboot on Friday, combined with our new processing model reducing the load on local disk. ########### # MONITOR # ########### Restarted monitoring per HOWTO.monitor ( except beam ) ######### # FNALU # ######### FNALU batch jobs failed for pawloski over the weekend. His account is absent on FNALU MINOS01 > ypcat passwd | grep '/afs/fnal' | cut -f 1 -d : | sort > /tmp/users FLXI05 > scp minos01:/tmp/users /tmp/users FLXI05 > for user in `cat /tmp/users` ; do grep -q ^${user} /tmp/pwd || echo MISSING ${user} ; done | cut -f 1 -d ':' bckhouse blake idanko kimjj llhsu mbt mstrait mtavera pawloski pittam rahaman rearmstr rmehdi rodriges scavan tinti whitehd ============================================================================= 2007 11 02 ############ # MCIMPORT # ############ daikon_04 cleanup Logs to remove are all in the run number range 7000 - 7200. cd kordosky/log find . -type f -name L??????_\*_7???_\*.log | wc -l 4949 All seem to be newer then Oct 19, 14 days ago $ find . -type f -name L??????_\*_7???_\*.log -mtime +13 -exec ls -l {} \; | wc -l 335 $ find . -type f -name L??????_\*_7???_\*.log -mtime +14 -exec ls -l {} \; | wc -l 0 $ mv L*.log badd04/ $ mv n*.log badd04/ $ find . -type f -name L??????_\*_7???_\*.log -exec echo mv {} baddo4/ \; $ find L* -type f -name L??????_\*_7???_\*.log -exec mv {} badd04/ \; $ find L* -type f -name n\*_D04.log | wc -l 4901 $ find L* -type f -name n\*_D04.log -exec mv {} badd04/ \; $ find badd04 -type f -name n\*_D04.log | wc -l 4949 $ tar czf badd04.tgz -C badd04 . $ tar tzf badd04.tgz | wc -l 9899 That is a correct count, includes . $ rm -r badd04/ Purge the tar.gz incoming $ rm *D04.tar.gz $ rm tar/n* Clean the mf5.all file cd kordosky/md5 $ wc -l all.md5 31106 all.md5 $ grep D04.tar.gz all.md5 | wc -l 4969 $ grep -v D04.tar.gz all.md5 > all.md5new $ mv all.md5 all.md5.badd04 $ mv all.md5new all.md5 Clean the indexes $ ls *D04.index | wc -l 756 $ mkdir badd04 $ mv *D04.index badd04/ $ cat badd04/*.index | wc -l 4838 $ tar czvf ../badd04.index.tgz -C badd04 . $ tar tzf ../badd04.index.tgz | wc -l 757 PNFS MINOS26 > cd /pnfs/minos/stage/kordosky/ MINOS26 > ls | wc -l 4387 MINOS26 > ls *D04.tar | wc -l 754 MINOS26 > find . -name n\*_D04.tar | wc -l 754 MINOS26 > find . -name n\*_D04.tar -mtime +13 | wc -l 22 MINOS26 > find . -name n\*_D04.tar -mtime +14 | wc -l 0 MINOS26 > find . -name n\*_D04.tar -exec rm {} \; ########## # CONDOR # ########## and zwaska To: brebel@fnal.gov, habig@fnal.gov, jdejong@fnal.gov, pawloski@fnal.gov, petyt@fnal.gov, rustem@fnal.gov, tinti@fnal.gov Cc: minos-admin@fnal.gov, minos_batch@fnal.gov Subject: Condor queues available on Minos Cluster This note is going out to our identified Analysis Batch 'power users'. Last week, we successfully installed a condor pool on the Minos Cluster. Greg has been doing some preliminary tests, and I have done some stress tests to determine that the system can handle thousands of jobs without rolling over. Documention is rough, and I have very little Condor experience. Nevertheless, the system may already be useful for running jobs. Please have a look at an early draft document, ~kreymer/minos/HOWTO.condor a.k.a. http://home.fnal.gov/~kreymer/minos/HOWTO.condor and give things a try. Enjoy ! ############ # MCIMPORT # ############ Cleanup - a lot of daikon_04 was declared to SAM, on 2007 10 25 This was all near CosmicLE, from sjc, directly imported. Also, cleaned up after the oom Killer, which zapped kordosky's mcimport at 09:48 n11037118_0018_L010185N_D04-n11037118_0022_L010185N_D04.tar 5 n11037118_0018_L010185N_D04.tar.gz to n11037118_0022_L010185N_D04.tar.gz rm tar/n11037118_0018_L010185N_D04-n11037118_0022_L010185N_D04.tar rm /var/tmp/mindata/MCTAR/kordosky/*.gz ############ # MCIMPORT # ############ Created overlay directory for overlaid reroot files. We will write them to PNFS from mcimport, like any other reroots. ############ # MCIMPORT # ############ Rearranged sjc/far/mcin per new arrangements, sharing near and far files in /mcin. mv far/mcin mcin ln -s ../mcin far/mcin ########### # MINOS26 # ########### Have been seeing oom killer messages in /var/log/messages Oct 31 03:52:01 minos26 kernel: oom-killer: gfp_mask=0xd0 Oct 31 03:58:28 minos26 kernel: oom-killer: gfp_mask=0xd0 Oct 31 05:29:12 minos26 kernel: oom-killer: gfp_mask=0xd0 ... Nov 1 21:37:43 minos26 kernel: oom-killer: gfp_mask=0xd0 Nov 1 22:55:03 minos26 kernel: oom-killer: gfp_mask=0xd0 Nov 2 03:30:32 minos26 kernel: oom-killer: gfp_mask=0xd0 Nov 2 03:37:43 minos26 kernel: oom-killer: gfp_mask=0xd0 Nov 2 04:19:11 minos26 kernel: oom-killer: gfp_mask=0xd0 Nov 2 04:28:05 minos26 kernel: oom-killer: gfp_mask=0xd0 Nov 2 04:28:06 minos26 kernel: oom-killer: gfp_mask=0xd0 Nov 2 04:28:08 minos26 kernel: oom-killer: gfp_mask=0xd0 Nov 2 04:28:11 minos26 kernel: oom-killer: gfp_mask=0xd0 MINOS26 > grep Killed /var/log/messages | grep -v sleep | grep -v scp | grep -v _log Oct 31 07:22:56 minos26 kernel: Out of Memory: Killed process 17228 (bash). Oct 31 07:22:57 minos26 kernel: Out of Memory: Killed process 4664 (bash). Oct 31 12:49:32 minos26 kernel: Out of Memory: Killed process 32117 (sendmail). Oct 31 12:49:33 minos26 kernel: Out of Memory: Killed process 6445 (bash). Oct 31 12:49:34 minos26 kernel: Out of Memory: Killed process 946 (bash). Oct 31 12:49:40 minos26 kernel: Out of Memory: Killed process 19156 (mcimport). Oct 31 15:53:02 minos26 kernel: Out of Memory: Killed process 10295 (sendmail). Oct 31 15:53:14 minos26 kernel: Out of Memory: Killed process 19867 (bash). Oct 31 20:55:33 minos26 kernel: Out of Memory: Killed process 7926 (sh). Ticket 106517 09:42 jpfitz crontab -r for kreymer and mindata From /var/log/messages Nov 2 10:22:59 minos26 exiting on signal 15 Nov 2 12:33:18 minos26 syslogd 1.4.1: restart. Digging into the history, $ grep -v 'session opened for user' /var/log/messages | less Oct 31 01:51:11 minos26 kernel: hdb: dma_timer_expiry: dma status == 0x61 Oct 31 01:51:21 minos26 kernel: hdb: DMA timeout error Oct 31 01:51:21 minos26 kernel: hdb: dma timeout error: status=0xd0 { Busy } Oct 31 01:51:21 minos26 kernel: Oct 31 01:51:21 minos26 kernel: ide: failed opcode was: unknown Oct 31 01:51:21 minos26 kernel: hda: DMA disabled Oct 31 01:51:21 minos26 kernel: hdb: DMA disabled Oct 31 01:51:21 minos26 kernel: ide0: reset: success Oct 31 03:52:01 minos26 kernel: oom-killer: gfp_mask=0xd0 ... The ganglia plots indicate a load average around 12 around 02:00 ============================================================================= 2007 11 01 ########## # CONDOR # ########## Testing large scale submit with a touch script ./touch n M touches file n in subdirectory M Submitted 1000 processes, rate of running is about 4/second, using full Minos Cluster minos25 Load average running 1000 was about 1 3000 was about 1.4 3000 rate with 100 second sample, is about 2 /second at 200 2.4/second at 900 3.2/second at 1500 3.9/second at 2000 3.9/second at 2500 17:57 10000 rate with 100 second sleep is 0.7/second at 200 0.7/second at 800 0.7/second at 1200 0.8/second at 1700 0.8/second at 2100 0.7/second at 2600 0.9/second at 3100 1.0/second at 3700 1.1/second at 4300 1.2/second at 4900 1.4/second at 5700 Times from 17:58 to 20:18 MINOS25 > condor_q -run -- Failed to fetch ads from: <131.225.193.25:62586> : minos25.fnal.gov ############ # MCIMPORT # ############ Looking at cleanup of incorrect D04 processing. $ ls /pnfs/minos/stage/kordosky/*D04* | wc -l 730 $ ls index/*D04.index | wc -l 730 Plan : Wait for incoming files to abate ( tomorrow ) Remove all the pnfs and index files Logs - let them rot ? Move kordosky to bluearc Removed all kordosky/DUP files Removed almost all kordosky/BAD files for FILE in kordosky/BAD/*.gz ; do echo ${FILE} ; gunzip -t ${FILE} ; done All but two are actually bad kordosky/BAD/n11011003_0001_L010185N_D01.tar.gz kordosky/BAD/n12011003_0001_L010185N_D01.tar.gz Removed them all. MINOS26 > du -sm /pnfs/minos/stage/* 268720 /pnfs/minos/stage/arms 1 /pnfs/minos/stage/buckley 1 /pnfs/minos/stage/gmieg 758471 /pnfs/minos/stage/hgallag 1358789 /pnfs/minos/stage/howcroft 6736664 /pnfs/minos/stage/kordosky 11457 /pnfs/minos/stage/kreymer 202838 /pnfs/minos/stage/mualem 1 /pnfs/minos/stage/rhatcher 1 /pnfs/minos/stage/sjc 1 /pnfs/minos/stage/urheim 9336936 /pnfs/minos/stage Plan to reorganize this into /minos/data/mcimport Heirarchy for long term storage should look like MC release Config Detector Run/subrun breakout For input to overlays, MCR/CONF/DET is adequate Checking for dup's amoung recent import is simple No tarring of the files, as all is on Bluearc/NFS Data can them be archived after the fact, in large files, without paranoid CRC checksum tests, and in very large tars. We could put this to LT03 or LT04 tape. ####### # AFS # ####### Per brebel, requested volumes /afs/fnal.gov/files/data/minos/d271 /afs/fnal.gov/files/data/minos/d272 for nc work system:administrators rlidwka minos:admin rlidwka system:anyuser rl buckley:ana_ntuples rlidwka Ticket 106484 assigned to mengel 15:31 done ######## # GRID # ######## Bluearc maintenance 06:00 this morning seems to have induced a NFS stale file handle problem on minos01/26 and fnpcsrv1, and probably elsewhere. Noted in fermigrid-announce by timm. 14:09 - sent ticket to run2-s6s ticket 106477 Cleared at 14:40 by jpfitz ######## # FARM # ######## /grid/data glitch requires removal of some cand's are these declared to SAM ? RUNSUBS=' N00011852_0000 N00011861_0017 N00011861_0019 N00011861_0021 N00011878_0014 N00011878_0017 N00011896_0003 N00011896_0008 N00011908_0006 N00011908_0022 N00011911_0010 N00011911_0014 N00011911_0021 N00011911_0023 N00011914_0014 N00011917_0001 N00011920_0017 N00011923_0000 N00011923_0017 ' for RUNSUB in ${RUNSUBS} ; do sam locate ${RUNSUB}.spill.cand.cedar_phy_bhcurv.0.root done for RUNSUB in ${RUNSUBS} ; do echo ${RUNSUB} ( cd /pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2007-03 ; cat ".(use)(4)(${RUNSUB}.spill.cand.cedar_phy_bhcurv.0.root)" | head -1 ) done ============================================================================= 2007 10 31 ############ # MCIMPORT # ############ Output is falling behind, due to long kordosky tarring, started 22:49, getting about 7 MB/sec slowed to 1 MB/sec at n11037106_0022_L010185N_D04-n11037106_0028_L010185N_D04.tar ran till 07:23 ####### # BOO # ####### ============================================================================= 2007 10 30 ########### # ENSTORE # ########### Big ( over 600 ) queues, delaying farm running due to lack of mcin_data. CMS data challenge is underway, I also see lqcd activity. Only 2 minos reads are pending, from VOB372 The file needed is /pnfs/minos/mcin_data/near/daikon_03/L010185N/500/n13035001_0000_L010185N_D03.reroot.root on VO3403 ############ # MCIMPORT # ############ Second attempt to import bad file, around 17:45 Oct 29 gunzip: n11037088_0014_L010185N_D04.tar.gz: unexpected end of file ####### # AFS # ####### Mounting subdirectories : http://osdir.com/ml/file-systems.openafs.general/2003-03/msg00092.html fs exportafs nfs -submounts on Freelance mode AFS, readonly (windows only) http://ezine.daemonnews.org/200605/afs.html mentions freelance mode, with no home cell, no tokens ( circa 2002, oops Windows only ) too bad, this would have been useful in the OSE grid Translator : http://www.nabble.com/Bug-405982:-cannot-stop-all-afsd-process-when-start-with--rmtsys-t2935598.html No translator : http://www.openafs.org/pipermail/openafs-devel/2001-May/006056.html Usage at INFN http://www.lnf.infn.it/computing/afs/doc/adm/adm02.htm ####### # AFS # ####### For reference, for grid computing, we would want to have Releases /afs/fnal.gov/files/code/e875/general/minossoft/ Products /afs/fnal.gov/files/code/e875/general/products/ symlinked to ups/ But under releases, there are 3 symlinks outside /afs/fnal.gov/files/code/e875/general/minossoft/packages for bin, lib and tmp, like /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.3/lib/ Scanned all releases, found links like /afs/fnal.gov/files/code/e875/releases /afs/fnal.gov/files/code/e875/releases1 /afs/fnal.gov/files/code/e875/releases2 releases is a 50 GB disk, links are all to SRT_BINLIBTMP releases1/2 are 8 GB. ls -al /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.24.3 | \ grep /afs/fnal | \ grep -v /afs/fnal.gov/files/code/e875/general/minossoft/packages MSP=/afs/fnal.gov/files/code/e875/general/minossoft cd /afs/fnal.gov/files/code/e875/general/minossoft/releases There are many relative symlinks, up 1 level, from include/ to ../ find R1.24.3 -type l -exec ls -l {} \; | grep ' \.\./' But I see no links up 2 levels find R1.24.3 -type l -exec ls -l {} \; | grep ' \.\./\.\.' There are many symlinks to ${MSP}/packages find R1.24.3 -type l -exec ls -l {} \; | grep -v ' \.\./' There are 3 symlinks to bin/lib/tmp with an explicit AFS path find R1.24.3 -type l -exec ls -l {} \; | grep -v ' \.\./' | grep -v ${MSP} Get symlink paths in a searchable form find . -type l -exec ls -l {} \; | grep -v ' \.\./' | grep -v ${MSP} | cut -f 2 -d '>' | tee -a /tmp/minrel There are only 2 stray symlinks find . -type l -exec ls -l {} \; | grep -v ' \.\./' | grep -v ${MSP} | grep -v /afs/fnal.gov lrwxr-xr-x 1 rhatcher e875 23 Sep 27 19:56 ./S07-09-20-R1-26/G3PTSim/LinkDef.h -> TGeant3/geant3LinkDef.h lrwxr-xr-x 1 rhatcher e875 23 Jul 14 00:00 ./S07-07-26-R1-26/Linux2.6-GCC_3_4-maxopt -> Linux2.4-GCC_3_4-maxopt MINOS26 > du -sm /afs/fnal.gov/files/code/e875/releases/* 551 /afs/fnal.gov/files/code/e875/releases/GENIE 54 /afs/fnal.gov/files/code/e875/releases/LOG4CPP 2048 /afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN 22299 /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT 840 /afs/fnal.gov/files/code/e875/releases/NEUGEN3 183 /afs/fnal.gov/files/code/e875/releases/PYTHIA6 16280 /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP 1425 /afs/fnal.gov/files/code/e875/releases/base_release_build 27 /afs/fnal.gov/files/code/e875/releases/stdhep ############ # MCIMPORT # ############ Kregg is concerned again with minos26 capacity, Looking at Ganglia, I see incoming rates as high as 40 GB/3 hours or 4 MBytes/second. Load average runs about 4 , spikes to 6 during this influx. But the average incoming rate is around 1 MB/second, per ganglia plots. ############ # MCIMPORT # ############ Per discussion with arms, will shift all users but kordosky over to /minos/data/mcimport/... $ du -sm * 1 CRON 5805 arms 1 buckley 1 gmieg 468 hgallag 1 himmel 2054 howcroft 26365 kordosky 3621 kreymer 1 mcinwrite 4928 mualem 1 nohup.out 1 rhatcher 23093 sjc 1 urheim Small users are buckley gmieg himmel rhatcher urheim There are no symlinks, per find . -type l cd /local/scratch26/mindata/ mkdir MOVED MCUSER=buckley [ -r "/minos/data/mcimport/${MCUSER}" ] && echo OOPS DUPLICATE in data [ -r "MOVED/${MCUSER}" ] && echo OOPS DUPLICATE MOVED [ -r ${MCUSER}/MCIMPORT ] && echo TOBLUE && mv ${MCUSER}/MCIMPORT ${MCUSER}/TOBLUE du -sm ${MCUSER} find ${MCUSER} -type f | wc -l date time \ cp -ax ${MCUSER} /minos/data/mcimport/${MCUSER} find /minos/data/mcimport/${MCUSER} -type f | wc -l du -sk ${MCUSER} /minos/data/mcimport/${MCUSER} time \ diff -r ${MCUSER} /minos/data/mcimport/${MCUSER} mv ${MCUSER} MOVED/${MCUSER} ln -s /minos/data/mcimport/${MCUSER} ${MCUSER} [ -r ${MCUSER}/TOBLUE ] && echo MCIMPORT && mv ${MCUSER}/TOBLUE ${MCUSER}/MCIMPORT date 14:18 - did the other small guys MCUSER=gmieg MCUSER=himmel MCUSER=rhatcher MCUSER=urheim 14:23 - did the inactive guys MCUSER=arms ... ( copy took hours, due to small files ? ) interrupted the diff -r after real 48m53.509s user 0m5.524s sys 0m17.290s moved anyway, then ran the diff : real 54m8.381s user 0m13.886s sys 0m53.854s 2007 10 31 - continuing MCUSER=hgallag real 1m23.301s real 0m21.857s MCUSER=howcroft 21138 2102772 howcroft 2062388 /minos/data/mcimport/howcroft real 13m22.700s sys 0m16.462s real 7m15.946s sys 0m13.733s MCUSER=kreymer 49 real 12m2.772s 3707768 kreymer 3703972 /minos/data/mcimport/kreymer real 14m25.219s MCUSER=mualem Wed Oct 31 11:02:47 CDT 2007 6571 real 19m4.375s 5045996 mualem 5029340 /minos/data/mcimport/mualem real 21m33.584s 2007 11 01 MCUSER=sjc Fri Nov 2 13:52:49 CDT 2007 16592 real 3m10.135s 2237676 sjc 2201240 /minos/data/mcimport/sjc real 1m22.925s 2007 11 03 MCUSER=kordosky TOBLUE 6756 kordosky 62434 Sat Nov 3 09:59:59 CDT 2007 real 23m26.866s 6917552 kordosky 6803072 /minos/data/mcimport/kordosky real 23m54.821s Sat Nov 3 10:48:13 CDT 2007 MCUSER=mcinwrite 1 mcinwrite 2 Sat Nov 3 10:51:12 CDT 2007 real 0m0.042s 2 16 mcinwrite 16 /minos/data/mcimport/mcinwrite real 0m0.003s Sat Nov 3 10:51:18 CDT 2007 ============================================================================= 2007 10 29 ####### # SAM # ####### sam_bootstrap v8_1_1 current on minos-sam01 and minos-sam02 In a pinch, can fall back by using the older v8_1_0 directly ups update sam_bootstrap v8_1_0 Version v8_1-1 has improved retries in case of station/dbserver restarts, backs off rate of retries to lower limit of once per hour. ############ # INDEXNFS # ############ ./indexnfs reco_near/cedar_phy/sntp_data/2005-04 RDIRS=`cd /minos/data ; find reco_near -type d -name 2???-??` for DIR in ${RDIRS} ; do ./indexnfs ${DIR} ; done ########### # BLUEARC # ########### From CD ops meeting : 11/1: 6-6:15am Site NAS Server (BlueArc) will be down for a major firmware upgrade. ########## # DCACHE # ########## Removed stray directories under May 22 rubin /pnfs/minos/mcout_data/cedar_phy/bfld201_lowE fnpcsrv1% rmdir bfld201_lowE/sntp_data/ fnpcsrv1% rmdir bfld201_lowE/cand_data/ fnpcsrv1% rmdir bfld201_lowE ############ # MCIMPORT # ############ Strange, a bad input file kordosky/n11037054_0014_L010185N_D04.tar.gz was correctly detected and moved to BAD, but it seems to have remained in the FILES list ! Will have to re-test this code somehow. Impact is just a somewhat messy printout. ######## # GRID # ######## 104371 - marked resolved, waiting for my reply to something ? The accounts are present. Need to follow up on cleanup of old users ? ( mail filed in minos-admin ) Michael Kordosky Brandon Seilhan Durga Rajaram Howard Rubin Thomas Brennan Mark Messier (He is also on MIPP) Steven Cavanaugh Deborah Harris Valeri Garkusha Sergei STriganov Hugh Gallagher Adam Para Mayly Sanchez Tingjun Yang George Irwin Byron Lundberg Robert Bernstein John Urheim Alexandre Sousa Regina Rameika Carol Ward Liz Buckley Joshua Boehm 105638 - waiting for information from me ? Mount of /minos/scratch and data on FNALU int and batch I think questions were answered on 16 Oct. Mounts are in place on FNALU batch and some interactive ######## # GRID # ######## New ticket by rubin, 106232 Spontaneous Condor restarts continue on the Farm. Note - farm is running Condor 6.8.5, which has problems. The problem is believed solved in the Sep 13 Condor 6.8.6 which we run on the Minos Cluster ######## # GRID # ######## ticket 105784 pending since 10/18, requests /minos/data and scratch on GPFARM and fnpcsrv1 See fs exportafs - translator ? See Administration Guide, Appendix A, under http://www.openafs.org/doc/index.htm http://www.openafs.org/pages/doc/AdminGuide/auagd022.htm#HDRWQ595 ########### # ENSTORE # ########### Finished review of tapes listed 2007 10 09 for recycling . Approved all but NULL31 ( not a tape ) copy to georges@fnal.gov who sent a recent reminder listing 19 of these ============================================================================= 2007 10 26 ####### # LSF # ####### Checking tokens, they are cloned from submission process : for NODE in $BNODES ; do bsub -R ${NODE} "tokens" ; done ######## # GRID # ######## Mail to minos-admin , chadwick, timm, berman MINOS-doc-3776, version 1 Based on our successful experience so far running Condor on the Minos Cluster, here is a more detailed plan for the new nodes being installed in GCC. Please let us know if there are any adjustments needed to this plan. Condor on new Minos computing ( 8 x Dell PE 6850 ) Driver : Make these nodes available for Minos Analysis batch computing. Issues : To provide compatibility with the existing Minos Cluster Condor system, we should install the following in addition to the default SLF 4 OS , on the eight dedicated Minos nodes : Condor installation to match the Minos Cluster, using the minos25 schedd. Configurations should probably be for 12 VM's per host ( 30% oversubscription ) load average limit of 20 ( generous ) no preemption no suspension AFS Accounts via NIS from minos01/02 , to match the Minos Cluster Allow interactive logins, so that people can do 'kcroninit' Timeline : Install the above within a week after initial acceptance burnin. ----------------------------------------------------- Timm asked about selection of the particular nodes Berman asked about Condor expert support Installation of condor Support level Updated the document https://minos-docdb.fnal.gov:440/cgi-bin/RetrieveFile?docid=3776&version=2&filename=minosanalysis.txt ####### # CVS # ####### Note contact information for cdcvs migration, sforrest Stanley Forrester ( UC Davis, now a contractor ) x4417 , was formerly x8473 ########## # SADDMC # ########## Started FARDET mcin declares ( see below ) ####### # SAM # ####### per nwest having problems with samLocate remotely, ran test : 08:53 NIT=0 while [ ${NIT} -lt 301 ] ; do (( NIT++ )) printf "${NIT} " sleep 1 samLocate --file=F00018000_0000.mdaq.root \ --wsdl=http://www-numi.fnal.gov/sam_web_services/wsdl/DataFileService.wsdl.xml done ran cleanly ########## # CONDOR # ########## Try to control chatter, with when_to_transfer_output = ON_EXIT This works ! Created tinywr.run for this test Creating probe and probe.run tests OK not too thorough yet. Note, that once a job is held, then released, it may not run until an other job is submitted. Rediscovered that to use kcron, this must be the EXECUTEABLE. Adjusted probe.run appropriately. ########## # CONDOR # ########## Testing independence of tokens Submitted probe contining embedded 20*tiny, to minos10 Submitted another 2 minutes later Token expiration times are 2 minutes apart, both at the start and the end of the jobs ==> no interference ============================================================================= 2007 10 25 ########## # SADDMC # ########## Shift the logs into /minos/scratch/kreymer cd /afs/fnal.gov/files/home/room1/kreymer/minos/log mkdir -p /minos/scratch/kreymer/log cp -vax saddmc /minos/scratch/kreymer/log/saddmc mv saddmc saddmc_old ln -s /minos/scratch/kreymer/log/saddmc saddmc ########## # SADDMC # ########## export SAM_ORACLE_CONNECT="samdbs/..." DET=near VEG=daikon_00 DIR=L010000N ./saddmc --verify -n 1 ${VEG} ${DET}/${VEG}/${DIR}/100 ./saddmc --verify -n 1 ${VEG} ${DET}/${VEG}/${DIR}/* ./saddmc --declare -n 1 ${VEG} ${DET}/${VEG}/${DIR}/100 sam get metadata --file=n13011007_0007_L010000N_D00.reroot.root sam locate n13011007_0007_L010000N_D00.reroot.root N E A R DET=near MINOS26 > ls /pnfs/minos/mcin_data/${DET} | grep dai daikon_00 daikon_01 daikon_03 daikon_04 DET=near VEGS='daikon_00 daikon_01 daikon_03 daikon_04' for VEG in ${VEGS} ; do for DIR in `ls /pnfs/minos/mcin_data/${DET}/${VEG} | sort` ; do echo ${VEG} ${DIR} #./saddmc --verify -n 1 ${VEG} ${DET}/${VEG}/${DIR}/* ./saddmc --declare ${VEG} ${DET}/${VEG}/${DIR}/* done 2>&1 | tee -a /minos/scratch/kreymer/log/saddmc/prd-${DET}-${VEG}-${DIR}.log done STARTED Thu Oct 25 21:37:11 2007 FINISHED Fri Oct 26 01:33:33 2007 grep -v declared ../log/saddmc/prd*.log | grep -v Needed | grep -v Treating | less grep -v declared ../log/saddmc/prd-${DET}*.log | grep -v "Needed\|Treating\|Declaring\|Scanning\|MODE" | less F A R DET=far MINOS26 > ls /pnfs/minos/mcin_data/${DET} | grep dai daikon_00 daikon_01 daikon_02 daikon_03 daikon_04 DET=far VEGS='daikon_00 daikon_01 daikon_02 daikon_03 daikon_04' for ... done as above STARTED Fri Oct 26 13:29:15 2007 FINISHED Fri Oct 26 14:04:10 2007 ########## # DC2NFS # ########## Dated version dc2nfs.20071025 - takes single -d argument for path BEAM DATA ( anticipating needs of beam group soon ) $ AFSS/dc2nfs -d beam_data 2>&1 | tee -a /tmp/dc2nfs.beam_data.log STARTING Thu Oct 25 12:03:59 CDT 2007 Running dc2nfs for DATA beam_data Processing 37 months ... STARTED Thu Oct 25 12:03:59 CDT 2007 FINISHED Thu Oct 25 17:00:56 CDT 2007 ####### # NFS # ####### http://osdir.com/ml/linux.nfs/2004-05/msg00108.html http://oss.sgi.com/projects/xfs/ 13:10 email to minos-admin regarding /minos filesystem sizes The NFS mounts of /minos/data and /minos/scratch seem to be working fine, and quota shows roughly the expected quotas. But df shows a device size much smaller than the expected size of about 20 to 10 TBytes for data and scratch MINOS26 > df -h /minos/* Filesystem Size Used Avail Use% Mounted on minos-nas-0.fnal.gov:/minos/data 3.1T 2.3T 846G 73% /minos/data minos-nas-0.fnal.gov:/minos/scratch 851G 5.7G 846G 1% /minos/scratch Is it possible that somehow we have made NFS V2 client mounts ? The fstab entries contain nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 The man page for nfs mentions a nfsvers=3 option, not vers=3. ============================================================================= 2007 10 24 ########## # DC2NFS # ########## BEAM DATA ( anticipating needs of beam group soon ) BMOS=`cd /pnfs/minos/beam_data ; ls` for MO in ${BMOS} ; do ./stage -d -p 0 beam_data/${MO} ; done interrupted, many files missing ./volumes vols ./volumes beam_data VO4933 VO7427 VO8433 VO8538 VO8976 VO9835 VOB445 VOB557 VOC009 BVOLS=` ./volumes beam_data` for VOL in ${BVOLS} ; do ./stage -d -p 0 ${VOL} ; done | tee /tmp/beamvols MINOS26 > grep "Staging files\|Needed" /tmp/beamvols | tr -d '.' Staging files from tape VO4933 Needed 659/814 Staging files from tape VO7427 Needed 17/46 Staging files from tape VO8433 Needed 0/455 Staging files from tape VO8538 Needed 3/444 Staging files from tape VO8976 Needed 1/79 Staging files from tape VO9835 Needed 0/21 Staging files from tape VOB445 Staging files from tape VOB557 Needed 0/707 Staging files from tape VOC009 Needed 0/45 Let's restore the missing files for VOL in ${BVOLS} ; do ./stage -w ${VOL} ; done | tee /tmp/beamstage For reference, setting the scale, du -sm /pnfs/minos/beam_data 259002 /pnfs/minos/beam_data ########## # CONDOR # ########## Steve Timm corrected a typo in the condor_config files, Mark Schumitz restarted all the daemons. Jobs are running ! tiny.run - single job tiny.run3 - 3 jobs tiny.run50 - 50 jobs ######## # PNFS # ######## Corrected directories for new Monte Carlo ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010000N write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010170N write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010185N write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010200N write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L100200N write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L150200N write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L250200N write ./pnfsdirs far cedar_phy_bhcurv daikon_04 L010185N write ./pnfsdirs far cedar_phy_bhcurv daikon_04 L250200N write removed bad mcin for DIR in L010000 L010170 L010185 L010200 L100200 L150200 L250200 ; do ls -l /pnfs/minos/mcin_data/near/daikon_04/${DIR} ; done for DIR in L010000 L010170 L010185 L010200 L100200 L150200 L250200 ; do rmdir /pnfs/minos/mcin_data/near/daikon_04/${DIR} ; done for DIR in L010185 L250200 ; do ls -l /pnfs/minos/mcin_data/far/daikon_04/${DIR} ; done for DIR in L010185 L250200 ; do rmdir /pnfs/minos/mcin_data/far/daikon_04/${DIR} ; done removed bad mcout for DIR in L010000 L010170 L010185 L010200 L100200 L150200 L250200 ; do ls -lr /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/${DIR} ; done for DIR in L010000 L010170 L010185 L010200 L100200 L150200 L250200 ; do rm -r /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/${DIR} ; done ####### # AFS # ####### Changed acl's for 5 volumes for nue analysis group, created 2007 05 23 d241 d242 d243 d244 d245 minos:admin rlidwka boehm rlidwka msanchez rlidwka Created minos:nue group NEWGROUP=nue pts creategroup -name kreymer:${NEWGROUP} group kreymer:nue has id -2487 NEWUSERS='boehm msanchez' for GUSER in ${NEWUSERS} ; do pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done pts setfields kreymer:${NEWGROUP} -access SOMar pts membership kreymer:${NEWGROUP} pts examine kreymer:${NEWGROUP} Name: kreymer:nue, id: -2487, owner: kreymer, creator: kreymer, membership: 2, flags: SOMar, group quota: 0. pts chown kreymer:${NEWGROUP} minos pts examine minos:${NEWGROUP} pts membership minos:${NEWGROUP} Added the rest of the buckley:nue group NEWUSERS='annah1 buckley cbs cherdack howcroft ochoa pawloski tjyang scavan vahle' for GUSER in ${NEWUSERS} ; do pts adduser -user ${GUSER} -group minos:${NEWGROUP} ; done Added this acl cd $MINOS_DATA for DIR in d241 d242 d243 d244 d245 ; do fs listacl ${DIR} ; done for DIR in d241 d242 d243 d244 d245 ; do fs setacl -dir ${DIR} -acl minos:nue rlidwka ; done ########### # afs2nfs # ########### created afs2nfschk to check input file existence and non-0 size asf2nfschk -i 2006-10_near.R1_18_4.index cd /afs/fnal.gov/files/data/minos/d10/indexes for INDEX in *.index ; do ~/minos/scripts/afs2nfschk -i ${INDEX} ; done 2>&1 | tee afs2nfschk.log Created afs2nfschk.sum with a summary of just damaged indexes 2005-07_far.R1.16a.index 2005-11_near.R1_24b.index 2006-09_near.R1_18_4.index 2006-10_near.R1_18_4.index 2007-01_near.cedar.index BAD_mc_far.daikon_00.cedar.index mc_far.carrot.R1_24.index mc_near.R1_18_2.index mc_near.daikon_00.cedar.index for INDEX in ${BIND} ; do ~/minos/scripts/afs2nfschk -i ${INDEX} ; done | grep index 15/ 15 2005-07_far.R1.16a.index 1/ 339 2005-11_near.R1_24b.index 14/ 409 2006-09_near.R1_18_4.index 17/ 629 2006-10_near.R1_18_4.index 1/ 601 2007-01_near.cedar.index 781/ 816 BAD_mc_far.daikon_00.cedar.index 1/ 41 mc_far.carrot.R1_24.index 1/ 2289 mc_near.R1_18_2.index 1/ 10923 mc_near.daikon_00.cedar.index mkdir BAD for INDEX in ${BIND} ; do cp -a ${INDEX} BAD/${INDEX} ; done rm 2005-07_far.R1.16a.index rm BAD_mc_far.daikon_00.cedar.index Edited the remaining files to remove missing files. for INDEX in ${BIND} ; do ( nedit ${INDEX} & ) ; done for INDEX in ${BIND} ; do ~/minos/scripts/afs2nfschk -i ${INDEX} ; done Looks OK now ########### # afs2nfs # ########### Corrected .bntp_data directory name mv /minos/data/reco_far/cedar_phy/bntp_data /minos/data/reco_far/cedar_phy/.bntp_data ############ # MCIMPORT # ############ Assisting boehm move of nue pseudo MC files to PNFS. These started as daikon_00 files reco'd with cedar, then muon removed and electrons simulated replacing the mu. I have suggested names like n13011432_0000_L010185N_D03_D00cedarMRE.reroot.root ============================================================================= 2007 10 23 ####### # AFS # ####### See entry on 2007 10 04 Repeated scan for anyuser, just one problem-user Repeated scan for authuser, See /home/kreymer/afsscan.log Got a response from Chadwick before I sent the note to nightwatch. Must be some tachyons round here. ############ # MCIMPORT # ############ Keepin' up, 135 GB minimum space last night. Messages still a bit messy fron CRON pid, OOPS - found /local/scratch26/mindata/CRON/mcimport.pid PID TTY TIME CMD 19913 ? 00:00:00 mcimport 08:30 Cleaned up these messages, hacked into mcimport.20071022 ########## # DCACHE # ########## Email from Sue Kasahara, DCache/Root read rates with root HEAD are as good as old xrootd, aside from a 24 second real time delay, reading concatenated sntp files. Using tree->SetCacheSize(50000000) and/or TTreeCache::SetLearnEntries(1) ########### # afs2nfs # ########### Reviewing afs2nfs.log Leaving concatenated <\m> statuslines intact, makes it easier to find diagnostic messages via nedit 2006-10_near.R1_18_4.index recodata15/N00011001_0004.spill.sntp.R1_18_4.0.root 2005-11_near.R1_24b.index 187/ 339 recodata55/N00009146_0005.spill.sntp.R1_24b.0.root 2007-01_near.cedar.index 1/ 601 recodata77F00037242_0003.spill.sntp.cedar.0.root OOPS - copied all files to each stream target directory, for 2005-04_far.cedar_phy.index /minos/data/reco_far/cedar_phy/bntp_data/2005-04 2005-05_far.cedar_phy.index 2005-06_far.cedar_phy.index 2005-07_far.cedar_phy.index 2005-08_far.cedar_phy.index 2005-09_far.cedar_phy.index 2005-10_far.cedar_phy.index 2005-11_far.cedar_phy.index 2005-12_far.cedar_phy.index 2006-01_far.cedar_phy.index 2006-02_far.cedar_phy.index 2006-03_far.cedar_phy.index 2006-06_far.cedar_phy.index 2006-07_far.cedar_phy.index 2006-08_far.cedar_phy.index 2006-09_far.cedar_phy.index 2006-10_far.cedar_phy.index 2006-11_far.cedar_phy.index 2006-12_far.cedar_phy.index 2007-01_far.cedar_phy.index 2007-02_far.cedar_phy.index 2007-03_far.cedar_phy.index MONS=' 2005-05 2005-06 2005-07 2005-08 2005-09 2005-10 2005-11 2005-12 2006-01 2006-02 2006-03 2006-06 2006-07 2006-08 2006-09 2006-10 2006-11 2006-12 2007-01 2007-02 2007-03' for MON in ${MONS} ; do echo ${MON} ls /minos/data/reco_far/cedar_phy/bntp_data/${MON}/*.root | wc -l ls /minos/data/reco_far/cedar_phy/bntp_data/${MON}/*.sntp.*.root | wc -l ls /minos/data/reco_far/cedar_phy/sntp_data/${MON}/*.bntp.*.root | wc -l done for MON in ${MONS} ; do echo ${MON} rm /minos/data/reco_far/cedar_phy/bntp_data/${MON}/*.sntp.*.root rm /minos/data/reco_far/cedar_phy/sntp_data/${MON}/*.bntp.*.root done for MON in ${MONS} ; do echo ${MON} ls /minos/data/reco_far/cedar_phy/bntp_data/${MON}/*.bntp.*.root | wc -l ls /minos/data/reco_far/cedar_phy/sntp_data/${MON}/*.sntp.*.root | wc -l done Now clean up the R1_18 overwriting for DIR in cbdl cnts sntp ; do ls /minos/data/reco_far/R1_18/cbdl_data/2005-04 | wc -l ; done rm /minos/data/reco_far/R1_18/cbdl_data/2005-04/*.sntp.*.root rm /minos/data/reco_far/R1_18/cbdl_data/2005-04/*.cnts.*.root rm /minos/data/reco_far/R1_18/cnts_data/2005-04/*.cbdl.*.root rm /minos/data/reco_far/R1_18/cnts_data/2005-04/*.sntp.*.root rm /minos/data/reco_far/R1_18/sntp_data/2005-04/*.cbdl.*.root rm /minos/data/reco_far/R1_18/sntp_data/2005-04/*.cnts.*.root ============================================================================= 2007 10 22 ######## # GRID # ######## 105638 /minos is still mounted on FNALU on flxi04 and 6 we need at least flxi07, for IA64 testing ######## # GRID # ######## 104371 account request for fnpcsrv1 - awaiting information from me ? Art, we will take this request into consideration. Final approval or denial will be based on the details of how the Open Science Enclave security plan is implemented. I suggest you visit the meeting at 3 PM today. ########### # afs2nfs # ########### Ran till about 10:00 Sunday 21 Oct log file ran out of quota ( d10 ) Captured most of it from the screen, saved in /tmp/afs2nfs.log First clear some space on recodata01 mc_cosmic.bfld201.cedar.index mc_far.R1.14.index mc_far.carrot.cedar.index Copying 137 c* files from 01 to 113 ( 7.8 GB ) grep recodata01 indexes/*.index | wc -l 137 14:03 - 14:23 for FILE in recodata01/* ; do cp -a ${FILE} recodata113/ done for FILE in recodata01/* ; do echo ${FILE} ; diff ${FILE} recodata113/ done nedit mc_cosmic.bfld201.cedar.index mc_far.R1.14.index mc_far.carrot.cedar.index changed recodata01 to recodata113 grep recodata113 mc_cosmic.bfld201.cedar.index mc_far.R1.14.index mc_far.carrot.cedar.index | wc -l 137 ############ # MCIMPORT # ############ RAL claimed we were not keeping up, I see no evidence of that Looked at ganglia plots, see minos26free.20071022.png Thursday there was a nice clean run, reducing free space from 230 to 70 GB in about 14 hours, 150 GB/14 hours or 2.9 MBytes/second. Concatenation writes to DCache at 6 to 10 MB/second. Concatenation writes local tars at a similar rate. So we should just keep up. 18:47 CDT Fri 19 Oct 'Not keeping up' free disk down to 38 GB 19:34 300 running jobs, holding rest 20:08 cronjob changed from 6 hours to 4 hour interval 09:00 100 GB free 13:00 180 GB free 19:00 noted that jobs had been held. Issues raised in email : Why delay for second pass/clearing of files ? Nick requests turorial on copy to /pnfs/minos/fermigrid/volatile Why not use FermiGrid SE for volatile storage, to clear minos26 What if all farms run at once. mcimport.20071022 - exits quietly if CRON job is still running. 17:45 ln -sf mcimport.20071022 mcimport # was mcimport.20070912 Updated crontab.dat to run every 2 hours, saved as scripts/crontab.mcimport.20071011 ============================================================================= 2007 10 19 ########### # afs2nfs # ########### $ ./afs2nfs -i 2005-11_far.cedar.index STARTING Fri Oct 19 11:06:06 CDT 2007 Running dc2nfs for INDEX 2005-11_far.cedar.index MO 2005-11 DET far REL cedar FILES = 720 STREAMS = sntp 720/ 720 /minos/data/reco_far/cedar/sntp_data/2005-11 16384 recodata59/F00033256_0005.spill.sntp.cedar.0.root 1.2G /minos/data/reco_far/cedar/sntp_data/2005-11 STARTED Fri Oct 19 11:06:06 CDT 2007 FINISHED Fri Oct 19 11:08:05 CDT 2007 Oops, cannot always get release from file name or index name, due to embedded dots in older releases. Allow this on the command line via -r OK, let's look at the big picture. Presently, in /minos/data have only reco_far R1.16 and R1_16a. find . -name 200\*index -size +0 -exec basename {} \; | grep R1.16 2005-03_far.R1.16a.index 2005-07_far.R1.16a.index find . -name mc\*index -size +0 -exec basename {} \; | grep 'R1\.' mc_far.R1.14.index OK, here are the reco dir's without a . find . -name 20\*index -size +0 -exec basename {} \; | grep -v 'R1\.' RCDIRS=`find /afs/fnal.gov/files/data/minos/d10/indexes -name 20\*index -size +0 -exec basename {} \; | grep -v 'R1\.'` Testing rate calculation ./afs2nfs -i 2005-11_near.cedar.index observed rates are around 16 MBytes/second 41G /minos/data/reco_near/cedar/sntp_data/2005-11 STARTED Fri Oct 19 14:34:51 CDT 2007 FINISHED Fri Oct 19 15:20:39 CDT 2007 Oops, forgot to include the rate calculation, let's try something shorter $ ./afs2nfs -i 2006-01_near.cedar_phy.index 15/ 15 /minos/data/reco_near/cedar_phy/sntp_data/2006-01 16274 recodata109/N00009714_0000.spill.sntp.cedar_phy.0.root STREAM sntp rate 19151 9.4G /minos/data/reco_near/cedar_phy/sntp_data/2006-01 STARTED Fri Oct 19 15:30:19 CDT 2007 FINISHED Fri Oct 19 15:38:50 CDT 2007 Adjusted format, and changed to use of dd instead of cp $ ./afs2nfs -i 2006-02_near.cedar_phy.index ... STREAM sntp rate 17366 9.1G /minos/data/reco_near/cedar_phy/sntp_data/2006-02 STARTED Fri Oct 19 15:39:51 CDT 2007 FINISHED Fri Oct 19 15:48:59 CDT 2007 Now trying a blocksize equal to the file size $ ./afs2nfs -i 2006-07_near.cedar_phy.index ... STREAM sntp rate 16344 3.2G /minos/data/reco_near/cedar_phy/sntp_data/2006-07 STARTED Fri Oct 19 17:23:50 CDT 2007 FINISHED Fri Oct 19 17:27:09 CDT 2007 Cleaned up the formatting $ ./afs2nfs -i 2006-08_near.cedar_phy.index ... STREAM sntp rate 13822 $ ./afs2nfs -i 2006-09_near.cedar_phy.index STREAM sntp rate 20154 Adjust format a bit more $ ./afs2nfs -i 2006-10_near.cedar_phy.index STREAM sntp rate 19693 8.7G /minos/data/reco_near/cedar_phy/sntp_data/2006-10 And a bit more $ ./afs2nfs -i 2006-11_near.cedar_phy.index STREAM sntp rate 19723 8.1G /minos/data/reco_near/cedar_phy/sntp_data/2006-11 Let's rock and roll ! tokens { for INDEX in ${RCDIRS} ; do ./afs2nfs -i ${INDEX} ; done } 2>&1 \ | tee -a /afs/fnal.gov/files/data/minos/d10/indexes/afs2nfs.log Oops, format is not quite clean, interrupted ( to test interruptions ) rm /minos/data/reco_far/cedar/sntp_data/2006-01/F00033499_0018.spill.sntp.cedar.0.root $ ./afs2nfs -i 2006-12_near.cedar_phy.index STREAM sntp rate 18629 11G /minos/data/reco_near/cedar_phy/sntp_data/2006-12 And once again a full run, at 18:23 Note - Have reverted to cp, seems to work well with the AFS source. ########### # MINOS25 # ########### SLF 4.4 upgrade started around 10:25, ganglia up around 10:57 13:10 - schmitz is trying to get condor configured. 13:50 - timm has root access ####### # UPS # ####### for NODE in $BNODES ; do bsub -R ${NODE} "ls -l /usr/local/etc/setups.sh" ; done flxb10 local flxb11 local flxb13 local flxb16 local flxb17 local flxb18 local flxb19 local flxb20 local flxb21 local flxb22 local flxb23 local flxb24 local flxb25 local flxb26 local flxb27 local flxb28 local flxb29 local flxb30 local flxb31 fnal flxb32 fnal flxb33 fnal flxb34 fnal flxb35 fnal Summary, per 2007 09 18 survey : /local/ups seems preferred, Minos Cluster at SL4 flxi02 flxi06 flxb at SL 3 /fnal/ups seems to have crept in recently minos11 post reinstall flxi04/5/7 flxb at SL 4 ############ # mcimport # ############ hacked ganglia/minos26 into DH web page ln -sf dhmain.20071019.html dhmain.html # was dhmain.20070501.html ============================================================================= 2007 10 18 ########### # afs2nfs # ########### Moves existing files from $MINOS_DATA/d10/* to /minos/data, based on indexes/*.index files find /afs/fnal.gov/files/data/minos/d10/indexes/ -name \*index -exec ls -l {} \; | wc -l 297 -size 0 99 -size +0 198 -size +0 -name mc\* 43 -size +0 -name 20\* 154 This is quite a mess, there are more than just sntp files here, sntp cnts bntp snts cbdl File names for mc are not at all uniform, like mc_atmos.bfld201.R1_18_2.index mc_far .R1_18_2.index mc_far .carrot .R1_18_2.index mc_far .v17 .R1_18_2.index Let's proceed with non-mc data first, will have to get target path file-by-file, parsing from names like F00037789_0000.spill.bntp.cedar_phy.0.root ######### # BATCH # ######### Only 10 of the 40 cores in FNALU batch systems were upgraded to SLF 4. I announced my intent to ask for the upgrade of the rest, to minos_software_discussion ######## # CRON # ######## Global scan of crontabs, triggered by find on minos25 15 09 * * * ${HOME}/minos/scripts/prehour > /tmp/prehour.log This was just testing the hour selection in predator, did not do anything but print. for NODE in $NODES ; do printf "${NODE} " ; ssh ${NODE} 'crontab -l' ; done minos01 MAILTO=kreymer@fnal.gov 15 19 * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/cfl 01 23 * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/afsfree quiet 05 23 * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/afssum quiet minos06 02 17 * * * date >> /var/tmp/FOO minos26 MAILTO=kreymer@fnal.gov 06 1-23/2 * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/predator 10 05 * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/gridappsync Removed the minos06 crontab ########### # MINOS25 # ########### The system was upgraded to SLF 4 long before the rest of the cluster. Joe Boyd sees that is running SELinux, and has other configuration problems. Will reinstall it from scratch tomorrow : Email to minos-admin at 15:30: Per our discussion, please schedule the reinstall of the OS on minos25 to match the other Minos Cluster SL 4 systems. Let's do this as soon as possible, so that we can continue with the Condor work. Let me know a specific time, and I'll announce it to Minos. We have already discussed the usual local file issues kcron - mine was the only one, can drop it mail - only lsfadmin had email, can be dropped crontab - mine was the only one, I have removed it. ganglia - please restart after the upgrade lsf - can omit this, as we are moving to Condor. Condor will need to be reinstalled and reconfigured to match the existing configuration, after the upgrade. ####### # SAM # ####### SAMDIM="PARENT_BY_NAME F00030612_0005.mdaq.root" sam list files --dim="${SAMDIM}" Files: F00030612_0005.all.snts.R1.14.root F00030612_0005.all.snts.R1_18.0.root ... F00030612_0005.all.cand.R1_18.0.root F00030612_0005.all.cnts.R1_18.0.root File Count: 32 Average File Size: 19.72MB Total File Size: 630.92MB Total Event Count: 663136 ######## # GRID # ######## Ticket 105784 Please have the Bluearc served /minos/scratch and /minos/data volumes mounted on the FNAL_GPFARM nodes ( including fnpcsrv1 etc. ) These are already mounted on the Minos Cluster and Server nodes, and all FNALU Batch nodes. /minos/scratch will allow analysis users to use existing test releases and files. /minos/data will be evaluated for possible use by Farm processing, and provides access to analysis ntuples. ####### # VDT # ####### Per Timm, need to remove the trailing "32" from vomses, and change -voms fermilab: to -voms fermilab:/fermilab Did not yet remove the 32, but fixing the -voms argument gets a proxy. ########### # SCRATCH # ########### 07:45 Solution: ettab@fnal.gov sent this solution: User directories have been created under /minos/scratch ######## # GRID # ######## MINOS25 > condor_submit tiny.run Timm finds that startd's are trying and failing to connect ######## # GRID # ######## For normal useage, see extended introductory user tutorial at http://www.cs.wisc.edu/condor/tutorials/barcelona-2006/ ########## # ORACLE # ########## minosora1 upgrade started at 07:24. Contact resumed at 08:44 project failed at 09:50, lost connection. OK at 09:10 10:05 - notified by mmihalek ============================================================================= 2007 10 17 ########### # SCRATCH # ########### Requested directory creation 14:15 Ticket 105745 USERS=`ypcat passwd | cut -f 1 -d ':' | sort` echo $USERS for USER in $USERS ; do printf "${USER}\n" ; finger ${USER}@fnal.fnal.gov | grep failed ; done condor lsfadm mindata minoscvs mssg products sam samread vanconan # Create all the directories for SUSER in ${USERS} ; do ( su ${SUSER} ; mkdir -p /minos/scratch/${SUSER} ) done # Remove a few stray users who do not need scratch for SUSER in condor lsfadm mssg products samread ; do ( su ${SUSER} ; rmdir /minos/scratch/${SUSER} ) done DONE - 2007 10 18 07:45 see above ########## # CONDOR # ########## schmitz has updated the configs, and started condor globally MINOS25 > ps -flu condor F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 5 S condor 26687 1 0 76 0 - 2100 - 11:11 ? 00:00:00 /opt/condor/sbin/condor_master 4 S condor 26688 26687 0 76 0 - 2252 - 11:11 ? 00:00:00 condor_collector -f 4 S condor 26689 26687 0 76 0 - 2072 - 11:11 ? 00:00:00 condor_negotiator -f 4 S condor 26690 26687 0 76 0 - 2130 - 11:11 ? 00:00:00 condor_schedd -f This is what Timm says we should expend on the schedd system. condor_q - runs and reports no jobs ########## # ORACLE # ########## mmihalek restarted gmond on minosora1/3, had not been logging data since Oct 11. This restored the data flow. ######## # GRID # ######## FNALU batch /minos mounts are complete and tested, see 2007 10 16 ####### # VDT # ####### vdt 1.6.1 is being used on fnpcsrv1 This exists in UPD upd install -j vdt v1_6_1_0 upd install -j pacman v3_19 ups declare -c pacman v3_19 -f NULL ups tailor vdt v1_6_1_0 2>&1 | tee -a /tmp/vdtinstall.log FAILED - needs perl > 5.8.0 ups undeclare -Y vdt v1_6_1_0 upd install -j vdt v1_6_1_0 unsetup perl ups tailor vdt v1_6_1_0 2>&1 | tee -a /tmp/vdtinstall.log 11:10 - 14:12 Stuck after 3 hours at : Installing Condor Globus EDG-Make-Gridmap MyProxy VOMS (on some systems this may take more than 30 min) Installing package [CA-Certificates-Base]. MINOS26 > ps xfwww ... 4517 pts/6 S+ 0:00 \_ ups tailor vdt v1_6_1_0 4519 pts/6 S+ 0:00 | \_ sh -c . /tmp/file5krZVp 4542 pts/6 S+ 0:00 | \_ /bin/sh /afs/fnal.gov/files/code/e875/general/ups/prd/vdt/v1_6_1_0/Linux/ups/install.sh 6099 pts/6 S+ 0:15 | \_ python /afs/fnal.gov/files/code/e875/general/ups/prd/pacman/v3_19/NULL/bin/pacman -install http://vdt.cs.wisc.edu/vdt_161_cache:Condor http://vdt.cs.wisc.edu/vdt_161_cache:Globus http://vdt.cs.wisc.edu/vdt_161_cache:EDG-Make-Gridmap http://vdt.cs.wisc.edu/vdt_161_cache:MyProxy http://vdt.cs.wisc.edu/vdt_161_cache:VOMS 4518 pts/6 S+ 0:00 \_ tee -a /tmp/vdtinstall.log Try again, this time the current vdt v1_8_1_0 ( vdt.cs.wisc.edu ) upd install -j vdt v1_8_1_1 date ups tailor vdt v1_8_1_1 2>&1 | tee -a /tmp/vdt1811.log date GGGGGRRRRRRRRRRRRRR cannot setup pacman, needs upd install -j python v2_4_2_sam unsetup perl setup pacman date ups tailor vdt v1_8_1_1 2>&1 | tee -a /tmp/vdt1811.log 14:43 - 14:54 pacman version [3.19] must be >= [3.20]. upd install -j pacman v3_20 ups undeclare -Y vdt v1_8_1_1 upd install -j vdt v1_8_1_1 date ups tailor vdt v1_8_1_1 2>&1 | tee -a /tmp/vdt1811a.log date 14:57 - STILL FAILS - garaozli advises me that these kits probably do no work.. Did direct installation into /minos/scratch/kreymer/VDT mkdir -p /minos/scratch/kreymer/VDT cd /minos/scratch/kreymer/VDT pacman -get VDT:VOMS-Client ... Choices: l (local) - install into $VDT_LOCATION/globus/share/certificates n (no) - do not install l . setup.sh echo $VDT_LOCATION /minos/scratch/kreymer/VDT voms-proxy-init -noregen -voms fermilab: -valid 168:0 --debug Detected Globus version: 22 Unspecified proxy version, settling on Globus version: 2 Number of bits in key :512 Using configuration file /minos/scratch/kreymer/VDT/glite/etc/vomses Using configuration file /home/condor/execute/dir_11128/userdir/glite/etc/vomses Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Copied similar file from fnpcsrv1 to glite/etc/vomses, from /usr/local/vdt-1.6.1/glite/etc/vomses MINOS25 > voms-proxy-init -noregen -voms fermilab: -valid 168:0 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Contacting fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Failed Error: fermilab: Unable to satisfy Request! None of the contacted servers for fermilab were capable of returning a valid AC for the user. Note that fnpcsrv1 always complains about lack of $prefix/etc/vomses Tried export prefix=/minos/scratch/kreymer/VDT/glite no change Here is a fresh test : MINOS25 > cd /minos/scratch/kreymer/VDT MINOS25 > . setup.sh MINOS25 > klist -f Ticket cache: /tmp/krb5cc_1060_Tf3886 Default principal: kreymer@FNAL.GOV Valid starting Expires Service principal 10/17/07 17:24:12 10/18/07 19:15:46 krbtgt/FNAL.GOV@FNAL.GOV renew until 10/24/07 17:15:46, Flags: FfRA 10/17/07 17:24:13 10/18/07 19:15:46 afs@FNAL.GOV renew until 10/24/07 17:15:46, Flags: FfRA MINOS25 > kx509 MINOS25 > kxlist -p Service kx509/certificate issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA subject= /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/0.9.2342.19200300.100.1.1=kreymer serial=70F791 hash=3fb2f7c8 MINOS25 > voms-proxy-init -noregen -voms fermilab: -valid 168:0 -debug Detected Globus version: 22 Unspecified proxy version, settling on Globus version: 2 Number of bits in key :512 Using configuration file /minos/scratch/kreymer/VDT/glite/etc/vomses Using configuration file /home/condor/execute/dir_11128/userdir/glite/etc/vomses Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Files being used: CA certificate file: none Trusted certificates directory : /minos/scratch/kreymer/VDT/globus/TRUSTED_CA Proxy certificate file : /tmp/x509up_u1060 User certificate file: /tmp/x509up_u1060 User key file: /tmp/x509up_u1060 Output to /tmp/x509up_u1060 Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer Using configuration file /minos/scratch/kreymer/VDT/glite/etc/vomses Using configuration file /home/condor/execute/dir_11128/userdir/glite/etc/vomses Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Contacting fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Failed Error: fermilab: Unable to satisfy Request! None of the contacted servers for fermilab were capable of returning a valid AC for the user. MINOS25 > date Wed Oct 17 17:25:10 CDT 2007 Try installation into /grid/app/minos/products/VDT setup pacman v3_20 mkdir -p /grid/app/minos/products/VDT cd /grid/app/minos/products/VDT pacman -get VDT:VOMS-Client mkdir -p /grid/app/minos/products/VDT/glite/etc chmod 755 /grid/app/minos/products/VDT/glite/etc . setup.sh voms-proxy-init -noregen -voms fermilab: -valid 168:0 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses VOMS Server for fermilab not known! cp -a /minos/scratch/kreymer/VDT/glite/etc/vomses glite/etc/vomses This works the same as the /minos/scratch copy on minos25, but on fnpcsrv1, returns VOMS Server for fermilab not known! Weird ! Note, there is documentation for the VOMS client installation at http://vdt.cs.wisc.edu/VOMS-documentation.html ######## # GRID # ######## Around 09:00, processes touching /minos/data or /minos/scratch are getting hung up. can ping servers ping -c 1 -w 2 minos-nas-0 09:23 /minos is OK again ! ######## # FARM # ######## fnpcsrv1 cannot perform simple commands like uptime cd /bin/ls /tmp MRTG data query brings up a message : 131.225.167.44 is connected to s-f-grid-fcc1 on port Gi0/1 Last detected on this switch at 2007/10/17/09:11 1 node is connected to port Gi0/1 of s-f-grid-fcc1. Looking Glass Error: Unknown area name for Device s-f-grid-fcc1 No plots available for any nodes on this switch. ( other switches are OK ) N.B. - all bluearc seems to have been affected, including CMS ============================================================================= 2007 10 16 ####### # X11 # ####### Ran 2 clean scans of Minos Cluster nodes, running gimp. No hangups. On minos26, saw several messages like executable not found: '/usr/lib/gimp/2.0/plug-ins/gap_frontends' ######## # GRID # ######## Helpdesk ticket 105638 fnalu-admin : Please mount the BlueArc served /minos/scatch and /minos/data areas on all FNALU interactive and batch systems. /minos/scratch should be writable by users. /minos/data should be exported and readonly on FNALU at present. Thanks ! 16:00 - mounted on all batch, and some interactive systems MINOS26 > BNODES='flxb10 flxb11 flxb13 flxb16 flxb17 flxb18 flxb19 flxb20 flxb21 flxb22 flxb23 flxb24 flxb25 flxb26 flxb27 flxb28 flxb29 flxb30 flxb31 flxb32 flxb33 flxb34 flxb35' for NODE in $BNODES ; do bsub -R ${NODE} "hostname > /minos/scratch/kreymer/BNODES/${NODE}" ; done Failed due to directory missing on flxb13 flxb27 bsub -R flxb13 'grep minos /etc/fstab' for NODE in $BNODES ; do bsub -R ${NODE} "hostname > /minos/data/mindata/kreymer/BNODES/${NODE}" ; done >>> 2007 10 17 Mounts have been updated on flxb13/27/35 for NODE in flxb13 flxb27 ; do AOK ! ######## # GRID # ######## ######## # FARM # ######## Tracking down reason for N00010639_0009.spill.sntp.cedar_phy.0.root pending sam list files --nosum --dim="data_tier raw-near and run_number 10639" | cut -f1 -d '.' | sort | wc -l 24 There are 20 subruns already written, plus 1 pending (_0009 ) But there are only 20 subruns expected 24 subruns in raw data 3 nospills ( 0/1/2 ) 1 badruns not present in nearcat ( 16 ) 20 subruns expected This throws off the simple minded logic of the script. The problem is that subrun 16 was written, but is still listed in the badruns list. ============================================================================= 2007 10 15 ######## # GRID # ######## Submitted timm's Condor plan to minos-admin, Ticket 105607 ########### # ROUNDUP # ########### Added SOCFILE for oracle admin connection cp -a AFSS/roundup.20070809 . ln -sf roundup.20070809 roundup # was roundup.20070803 ########## # DCACHE # ########## schubert cannot access /pnfs/minos/reco_near/R1_18_2/sntp_data/2005-05/N00007815_0000.spill.sntp.R1_18_2.0.root looks OK to me, bases on metadata. IFILE=N00007815_0000.spill.sntp.R1_18_2.0.root IPATH=minos/reco_near/R1_18_2/sntp_data/2005-05 DCPOR=24136 # unsecured DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} cd /local/scratch??/`whoami` dccp ${DFILE} TEST.dat # do the copy Still stuck after 10 minutes at 13:34, here is the login list : DCap01-stkendca2a-unknown-57776 DCap01-stkendca2a-unknown-57776 minos26.fnal.gov active Oct 15 13:24:54 Oct 15 13:24:54 1060/15611 DCap01-stkendca2a-unknown-57776 Arthur Kreymer ? ? ? ? open minos/reco_near/R1_18_2/sntp_data/2005-05/N00007815_0000.spill.sntp.R1_18_2.0.root My guess is that they have reconfigured pools, and this files needs restore. But the restore queue page dates from 13 Oct, at http://fndca3a.fnal.gov/dcache/RC.html I still see no Enstore activity on the DCache data plots today. See Enstore notes below 15:22 - there are idle drives, shubert's test file is restored to r-stkendca15a-6 ######## # FARM # ######## Tracking down N00010639_0016.spill.sntp.cedar_phy.0.root per rubin request /2007-06/cedar_phynear.log Files were written on Thu Jun 7 13:32:59 CDT 2007 wrote files 0, 10 17 BADRUNS N00010639_0009.cosmic.sntp.cedar_phy.0.root BADRUNS N00010639_0016.cosmic.sntp.cedar_phy.0.root File was written this morning to /pnfs/minos/reco_near/cedar_phy/sntp_data/2006-08 saddreco did not run due to lack of SAM_ORACLE_CONNECT Created new roundup, tried again. ./roundup -w -r cedar_phy near ####### # SAM # ####### 11:00 Upgraded production dbserver allows parameters to be added cleanly allows query on CHILD_BY_NAME sam_db_srv_pkg v8_3_0 ( was sam_db_srv v7_6_1 ) sam_bootstrap v8_1_0 ( was v6_1_2, required for use of sam_db_srv_pkg ) sam_config v7_1_5 ( was v4_2_34 ) sam v8_2_0 ( was v7_6_5, on clients ) Updated sam on AFS ups declare -c sam v8_2_0 # was v7_6_5 The queries for listing parents now work, where they returned extra results before : FILE=F00030612_0005.spill.bntp.cedar_phy.0.root SAMDIM=" DATA_TIER raw-far \ and FULL_PATH like /pnfs/minos/fardet_data/2005-04 \ and FILE_NAME like F0003061% \ and CHILD_BY_NAME ${FILE} \ " SAMDIM="CHILD_BY_NAME ${FILE}" sam list files --dim="${SAMDIM}" --nosummary | sort F00030612_0005.mdaq.root F00030612_0006.mdaq.root F00030612_0007.mdaq.root MINOS26 > sam get metadata --file=${FILE} \ | grep parents \ | tr "'" \\\n \ | grep root \ | sort F00030612_0005.mdaq.root F00030612_0006.mdaq.root F00030612_0007.mdaq.root ####### # SAM # ####### Added the new MC parameters in production as previously done in dev and int. samadmin add param suite --param-file=MCPARAMS.py Param Category 'mc': ... paramType 'bfield': registered as type 'string' (new dimension 'mc.bfield') ... paramType 'volume': (no change) ... paramType 'beam': (no change) ... paramType 'split': (no change) ... paramType 'vtxregion': registered as type 'string' (new dimension 'mc.vtxregion') ... paramType 'release': (no change) ... paramType 'flavor': (no change) The parameters have good indexes, as verified with sqlplus. ########### # ENSTORE # ########### Ticket 105574 Sometime this morning the www-stken web pages stopped responding. I see no enstore transfers to FNDCA since 06:00, http://fndca2a.fnal.gov:8090/dcache/outplot?lvl=0&filename=billing-2007.10. daily.brd.png&day=15&fmt=lin The CMS tape activity monitor shows active LTO3 drives, but most 9940B drives have been stuck several hours : http://cmsdca.fnal.gov/cgi-bin/enstore_drives.sh 10:03 Some IP addresses were changed this morning. The enstore monitoring web pages for all three enstore systems are not accessible. We are in the process of identifying and correcting the problem. More when it is known. George Szmuksta SSA ( pages look OK to me A.K. ) 12:45 The web pages have been fixed. As far as tape activity we are experiencing a media changer queue full which is delaying mounts and dismounts. We are looking at it. George szmuksta 13:55 - still no tape activity. 14:19 - seeing tape activity mostly writes, 168 queue elements 15:22 - there are idle drive, shubert's test file is restored to r-stkendca15a-6 ####### # SIM # ####### Request from arms for near/daikon_04/CosmicLE near/daikon_04/L010000 near/daikon_04/L010170 near/daikon_04/L010185 near/daikon_04/L010200 near/daikon_04/L100200 near/daikon_04/L150200 near/daikon_04/L250200 far/daikon_04/L010185 far/daikon_04/L250200 Waiting to see whether these are all cedar_phy_bhcurv . 11:30 - confirmed ./pnfsdirs near cedar_phy_bhcurv daikon_04 CosmicLE write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010000 write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010170 write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010185 write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010200 write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L100200 write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L150200 write ./pnfsdirs near cedar_phy_bhcurv daikon_04 L250200 write ./pnfsdirs far cedar_phy_bhcurv daikon_04 L010185 write ./pnfsdirs far cedar_phy_bhcurv daikon_04 L250200 write ####### # SAM # ####### Performed database repairs in dev as described 2007 10 12, on advice from Herber this morning. setup oracle_client ../bin/rlwrap sqlplus samdbs/...@minosdev ============================================================================= 2007 10 12 ######## # FARM # ######## Rubin is doing cedar_phy near cleanup. Requests status of N00011772 .1.root files Existing catted files are N00011772_0000.spill.sntp.cedar_phy.0.root N00011772_0000.spill.mrnt.cedar_phy.0.root sam get metadata --file=${FILE} \ | grep parents \ | tr "'" \\\n \ | grep root \ | sort The original .0. files seem complete. I don't know why there was reprocessing of 8 of the subruns. ####### # SAM # ####### Testing definition creation, failing for ahimmel ( not in group minos ) sam create definition --definitionName='kreymer-test' \ --dimensions='FILE_NAME = F00031300_0000.mdaq.root' \ --group='minos' sam describe definition --definitionName='kreymer-test' sam delete definition --definitionName='kreymer-test' ####### # SAM # ####### dbs v8_3_0 work, see log_data/LOG/sam03.log Upgrade to sam_config v7_1_5 ups declare -c sam_config v7_1_5 Cannot start integration dbserver v8_3_0 lacking compat-libstdc++-33-3.2.3-47.3 Requested this, ticket 105533 assigned to Jason Done ! MINOS26 > upd install -j sam v8_2_0 ####### # SAM # ####### Assess damage to the parameters setup oracle_client ../bin/rlwrap sqlplus samdbs/...@minosdev select dimension_name,dim_alias from dimensions where dimension_name like 'MC.%' ; select dimension_name,dim_alias from dimension_addons where dimension_name like 'MC.%' ; The dev declaration are definitely mangled, contining bad DIM_ALIAS fields for MC.BFIELD and MC.VTXREGION , like param_types##1 param_categories##1 Adding new parameters to int using sam v8_2_0 and dbs v8_3_0 setup sam -q int v8_2_0 export SAM_ORACLE_CONNECT="samdbs/" samadmin add param suite --param-file=MCPARAMS.py Looked with sqlplus, the param values are unique ( 261 and 252 ) Plan to correct these problems on Monday SET PAGESIZE 1000 SET LINESIZE 100 SET NEWPAGE NONE SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSIONS where DIMENSION_NAME = 'MC.BFIELD' ; SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSION_ADDONS where DIMENSION_NAME = 'MC.BFIELD' and DIM_COLUMN = 'param_category' ; SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSION_ADDONS where DIMENSION_NAME = 'MC.BFIELD' and DIM_COLUMN = 'param_type' ; SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSIONS where DIMENSION_NAME = 'MC.VTXREGION' ; SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSION_ADDONS where DIMENSION_NAME = 'MC.VTXREGION' and DIM_COLUMN = 'param_category' ; SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSION_ADDONS where DIMENSION_NAME = 'MC.VTXREGION' and DIM_COLUMN = 'param_type' ; UPDATE DIMENSIONS SET DIM_ALIAS = 'param_values##261' where DIMENSION_NAME = 'MC.BFIELD' ; UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_categories##261' where DIMENSION_NAME = 'MC.BFIELD' and DIM_COLUMN = 'param_category' ; UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_types##261' where DIMENSION_NAME = 'MC.BFIELD' and DIM_COLUMN = 'param_type' ; UPDATE DIMENSIONS SET DIM_ALIAS = 'param_values##262' where DIMENSION_NAME = 'MC.VTXREGION' ; UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_categories##262' where DIMENSION_NAME = 'MC.VTXREGION' and DIM_COLUMN = 'param_category' ; UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_types##262' where DIMENSION_NAME = 'MC.VTXREGION' and DIM_COLUMN = 'param_type' ; ######## # GRID # ######## Submitted Minos Cluster grid plan via email to timm, chadwick, berman, minos-admin ########## # DC2NFS # ########## AFSS/dc2nfs -d far -r R1.16 -s sntp STARTING Fri Oct 12 14:16:02 CDT 2007 Running dc2nfs for DET far REL R1.16 STR sntp Processing 5 months STARTED Fri Oct 12 14:16:02 CDT 2007 FINISHED Fri Oct 12 15:42:39 CDT 2007 ============================================================================= 2007 10 11 ####### # SAM # ####### Resuming dbs v8_3_0 work, see log_data/LOG/sam03.log products are installed, ready to bite the bullet and upgrade sam_config ? ######### # STAGE # ######### stage.20071012 Added printout of ${NCHECK}/${NFILES} in WAITER Added -b bailout option, for testing Added VERSION and printout thereof Added STARTED / FINISHED time summary 11:15 ln -sf stage.20071012 stage # was stage.20061012 ########## # DC2NFS # ########## $ AFSS/dc2nfs -d far -r R1.16a -s sntp STARTING Thu Oct 11 08:53:06 CDT 2007 Running dc2nfs for DET far REL R1.16a STR sntp Processing 1 months cranked along at about 5 files/second R1_15 far reco need months 2005-0* 3 5 6 7 8 ./stage reco_far/R1.16/sntp_data/2005-03 for MON in 5 6 7 8 ; do ./stage -w reco_far/R1.16/sntp_data/2005-0${MON} ; done ######### # ADMIN # ######### Discussed 8 nodes GP CPU deployment with Jason Allen ( Boyd is on vacation) We will need AFS mounted on these, unlike rest of GP_GRID nodes Probably no special network/location requirements. ============================================================================= 2007 10 10 ############## # parameters # ############## Multiple parameter selections are not working for mc.vtxregion. Same problem as before, herber has corrected the database content previously via direct SQL. It was necessary to samadmin flush dbserver cache The problem was the non-unique numbers in DIMENSIONS.DIM_ALIAS and DIMENSION_ADDONS.DIM_ALIAS To see this, use the database browser, select development SAM parameters mc., and note the value param_values##1 ########## # DC2NFS # ########## Cloned from dc2afs Test on far R1.16a sntp_data then R1.16 ./stage -d -p 0 reco_far/R1.16a/sntp_data/2005-03 Needed 460/460 These are all in the minos file family, so forget tape optimization, just prestage them as is. ./stage reco_far/R1.16a/sntp_data/2005-03 ######### # ADMIN # ######### bspeak asks that grashorn login shell be bash on Minos Cluster I submitted helpdesk ticket 105387 for this and FNALU. Done for Minos Cluster 14:50 ############ # MCIMPORT # ############ mcimport.20070912 tune up/debug ( improved diskfull handling ) AFSS/mcimport.20070913 -n kreymer Corrected quotations around print statements 11:43 $ cp -a AFSS/mcimport.20070912 . $ ln -sf mcimport.20070912 mcimport # was mcimport.20070910 ############ # MCIMPORT # ############ mualem is importing lots of CosmicLE_D03.reroot.root to mualem/ rather than mualem/far/mcin/ Renamed these, and informed mualem and minos_sim for FILE in *root ; do echo ${FILE} ; mv ${FILE} far/mcin/${FILE} ; done Created working directories ./pnfsdirs far cedar_phy_bhcurv daikon_03 CosmicLE write chmod 775 /pnfs/minos/mcin_data/far/daikon_03 chgrp e875 /pnfs/minos/mcin_data/far/daikon_03 chmod 775 /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_03 chgrp e875 /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_03 Ran mcimport -w mualem around 18:40 CDT, clearing space quickly. Hacked crontab to run next cycle at 20:37, when this has cleared the nearly 800 files. ######### # NEDIT # ######### requested installation on minos-sam02 to match rest of Cluster/Server Ticket 105361 ############ # pnfsdirs # ############ Updated to set group and permission of basei/baseo on the level above basein/baseou ./pnfsdirs far cedar_phy_bhcurv daikon_03 CosmicMu write ./pnfsdirs near cedar_phy_bhcurv daikon_03 CosmicLE write ============================================================================= 2007 10 09 ########### # ENSTORE # ########### Recycle request for tapes as follows : Minos currently has 33 tapes with no active files: Checked volumes with enstore info --list=${VOL} VO8166 | none | 9940B | far_dcs_data OK unknown path VO2067 | none | 9940 | log_data_caldet VO5689 | none | 9940 | log_data_R1_14 VO6504 | none | 9940B | log_data_R1_14 VO8500 | none | 9940B | log_data_R1_18 VO8514 | none | 9940B | log_data_R1_18 VO8547 | none | 9940B | log_data_R1_18_2 VO8564 | none | 9940B | log_data_R1_18_2 OK Moved to $MINOS_DATA/log_data/... VO4432 | full | 9940B | mcout_far_daikon_02_cand VO6615 | full | 9940B | mcout_far_daikon_02_cand VOC485 | full | 9940B | mcout_far_daikon_02_cand VOC488 | full | 9940B | mcout_far_daikon_02_cand OK reprocessed NULL31 | none | null | neardet_data NO moibenko files, but this is NULL MOVER data, not a tape VO4435 | full | 9940B | reco_mc_near_cedar VO4460 | readonly | 9940B | reco_mc_near_cedar VO4461 | full | 9940B | reco_mc_near_cedar VO4465 | full | 9940B | reco_mc_near_cedar VO4475 | full | 9940B | reco_mc_near_cedar VO4553 | full | 9940B | reco_mc_near_cedar VO4554 | full | 9940B | reco_mc_near_cedar VO4613 | full | 9940B | reco_mc_near_cedar VO4716 | full | 9940B | reco_mc_near_cedar OK no paths VOB870 | full | 9940B | reco_mc_near_cedar_cand VOB931 | full | 9940B | reco_mc_near_cedar_cand VOC465 | full | 9940B | reco_mc_near_cedar_cand OK no paths VO9913 | full | 9940B | reco_mc_cosmic_cedar VOB691 | full | 9940B | reco_mc_cosmic_cedar VOC043 | full | 9940B | reco_mc_cosmic_cedar VOC151 | full | 9940B | reco_mc_cosmic_cedar VO7080 | readonly | 9940B | reco_mc_cosmic_cedar OK bfld201_lowE some deleted/reprocessed, some not VO8437 | full | 9940B | reco_near_R1_18_4 VOB644 | full | 9940B | reco_near_R1_18_4 OK no path VOB414 | none | 9940B | reco_near_S06-05-25-R1-22 OK test processing runs, obsolete ####### # SRM # ####### SRM is down around 07:30 due to fndca2a failure/replacement. Estimate 3 to 4 hours. Tried pinging servers with telnet fndca1.fnal.gov 8443 telnet stkendca2a.fnal.gov 8443 But both of these succeed in connecting to the port ( exit code of 1 after a normal quit/exit/close ) even though the SRM service is down. 14:50 Network monitoring indicates that fndca2a went down around 03:15, and came up around 10:00 . But the SRM service is still down. Is there a revised time estimate for restoring SRM ? 15:05 - Timur requested database and SRM restart 16:40 - still no estimate ####### # SAM # ####### Testing sam_db_srv_pkg on minos-sam02 ( int ) upd list -l sam_db_srv_pkg v8_3_0 See log_data/LOG/sam02.log ============================================================================= 2007 10 08 ############# # CHECKLIST # ############# Ticket 105076 Minos Server Ganglia still missing - jpfitz send reminder 13:40 fixed by jonest 15:26 Ticket 105113 - jyuko group and scratch access 2007 10 10 - assigned to shepelak reassigned to terry jones completed 13:21 Sam shifter Marcia Begalli sent message re IT 1146 no way to check existence of sam tape location note gltail.rb at www.fudgie.org real time monitoring tools ######## # GRID # ######## Grid user meeting discussed overall tactics, Made a short but failed attempt to see why proxies with minos/production could not write to DCache. Created srmwtest for write tests need to add controls for PNFS/volatile, normal/roled proxy, file size, ############ # SRMWATCH # ############ New monitoring pages under DCache http://fndca2a.fnal.gov:8090/srmwatch/ ============================================================================= 2007 10 05 vacation ============================================================================= 2007 10 04 ####### # SRM # ####### Down since Wednesday sometime. about 16:55, helpdesk ticket 105117 ####### # SRM # ####### Report failure to write using production role in grid proxy, to fermigrid-users ######## # DATA # ######## Added jyuko to minos:beam, per request. pts membership minos:beam pts adduser -user jyuko -group minos:beam Created /minos/scratch/jyuko /minos/scratch/kreymer ########### # GANGLIA # ########### Reported minos25 and sam/mysql1 ganglia monitoring missing since Wed glitch. Ticket 105076 jpfitz restored and reconfigured, we've lost Minos Server links. ####### # AFS # ####### Scanned AFS for system:anyuser protections of home directories system:anyuser includes everyone in the world who can gain access to your cell. system:authuser includes everyone who is currently authenticated in your cell AFSH=/afs/fnal.gov/files/home ROOM=room1 AUTH=anyuser for DIR in ${AFSH}/${ROOM}/* ; do printf "\n${DIR} " ; fs listacl ${DIR} | grep system:anyuser | grep -v 'system:anyuser rl$' ; done This revealed enough exceptions to be worth summarizing for DIR in ${AFSH}/${ROOM}/* ; do ACL=`fs listacl ${DIR} 2> /dev/null | grep system:${AUTH} | grep -v "system:${AUTH} rl$"` [ -n "${ACL}" ] && printf "${DIR} ${ACL}\n" done Redirected stderr to /dev/null, as cannot access all directories I do have a valid token when running this scan. Not listing certain security problems here, but reporting to fnalu-admin and nightwatch. ADIR= At csf.rl.ac.uk, ( cd ${ADIR} ; echo HELLO > HELLO ; ls -l HELLO ; cat HELLO ; rm HELLO ) ls -l ${ADIR}/HELLO ; cat ${ADIR}/HELLO The world can indeed write to these areas. $ cat /tmp/home*.any | grep rlidwka There are 11 rlidwka users, and one rlw cat /tmp/home*.any | grep 'rl.*w Passed the list to Joe Klemencic jklemenc ######## # GRID # ######## /minos/data and /minos/scratch are in /etc/fstab on Cluster and Servers ============================================================================= 2007 10 03 ############ # SADDRECO # ############ Retesting saddreco.20070913 after adjustments for regular data PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin export SAM_DB_SERVER_NAME=SAMDbServer.dev:SAMDbServer export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9000 export SAM_ORACLE_CONNECT='samdbs/pass' RELS=cedar MCRL=daikon_00 MODS=/pnfs/minos/mcout_data/${RELS}/near/${MCRL} DIRS=L100200N AFSS/saddreco.20070913 near ${RELS} ${DIRS} verify -m ${MCRL} -b 1 -v corrected code for copy of params AFSS/saddreco.20070913 near ${RELS} ${DIRS} verify -m ${MCRL} This verified cleanly on cand/mrnt/sntp ; 100-105 Now re-retest for reco data, as below AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} verify -b 1 -v -s F00039716_0005 AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} verify -b 1 AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} verify OOPS, need location for F00039586_0005.all.cand.cedar.0.root AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} addloc OK - add location F00039586_0005.all.cand.cedar.0.root /pnfs/minos/reco_far/cedar/cand_data/2007-09(vo2363.1246) Ran single file declaration AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} declare -b 1 \ 2>&1 | tee -a ${HOME}/ROUNTMP/LOG/${SAMMON}/declare_${DET}_${REL}.log Ran the rest AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} declare \ 2>&1 | tee -a ${HOME}/ROUNTMP/LOG/${SAMMON}/declare_${DET}_${REL}.log SRV1> ln -sf saddreco.20070913 saddreco # was saddreco.20070507 Restored SAM as appropriate, to corral ########## # SADDMC # ########## Asking permission at Monday's MC meeting to proceed with MC declares Metadata now includes Params({ 'mc' : CaseInsensitiveDictionary({ 'beam' : DataType('string'), # from directory, like L010185N_bfldx113 'bfield' : DataType('string'), # field 5 in releases daikon and later 'flavor' : DataType('string'), # field 4 'release' : DataType('string'), # from directory, like daikon_00 'split' : DataType('string'), # field 5 in releases carrot and earlier 'volume' : DataType('string'), # changed vtxregion 'vtxregion' : DataType('string'), # field 3 [itgt] })}) Event counts and first/last event numbers in SAM metadata are faked, as we are not reading the mcin files to get those numbers ( and I lack the code to do so. ) ######### # ADMIN # ######### Apparent reboot of Minos Cluster and Server nodes BNODES='flxb10 flxb11 flxb13 flxb16 flxb17 flxb18 flxb19 flxb20 flxb21 flxb22 flxb23 flxb24 flxb25 flxb26 flxb27 flxb28 flxb29 flxb30 flxb31 flxb32 flxb33 flxb34 flxb35' INODES='flxb10 flxb24 flxb30 flxb31 flxb32 flxb33 flxb34 flxb35' UNODES="flxi02 flxi03 flxi04 flxi05 flxi06" SNODES="minos-mysql1 minos-sam01 minos-sam02 minos-sam03" NODES="minos01 minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10 minos11 minos12 minos13 minos14 minos15 minos16 minos17 minos18 minos19 minos20 minos21 minos22 minos23 minos24 minos25 minos26" for NODE in ${NODES} ; do printf "${NODE} " ; ssh ${NODE} uptime ; done BNODES - stayed up UNODES - stayed up SNODES - all rebooted at 7:02 NODES - minos22 through 26 rebooted at 7:02 Lost /minos/* mounts, these are not yet in fstab roundup was cleanly finished by 06:13 mcimport was cleanly finished at 06:35 minos-sam01 - ups start sam_bootstrap ./sam_test_py minos - OK ######## # GRID # ######## Outline of file system mount permissions, per Chadwick whiteboard DISK HOME GCI OSE GCE OSE | ------------------------- Computing : GCE | RWX | RWX | NO | RWX | | | | | | OSE | NO ?| RWX | NO | RWX | -------------------------- R-- RW- ============================================================================= 2007 10 02 ######## # GRID # ######## /minos/data and /minos/scratch mounted on Cluster and Servers ####### # CVS # ####### Backed up previous passwd file MINOSCVS > cd /cvs/minoscvs/rep1/CVSROOT/ MINOSCVS > mv passwd passwd.20010918 ; cp -a passwd.20010918 passwd ; ls -l pass* Created new passwd file with no password MINOSCVS > cp passwd passwd.20071002 MINOSCVS > nedit passwd.20071002 Deployed and test passwordless pserver, with fallback, about 12:00 cp -a passwd.20071002 passwd Tested MINOS26 > cvs -d $loc checkout BubbleSpeak ; rm -r BubbleSpeak test cvs checkout: warning: failed to open /afs/fnal.gov/files/home/room1/kreymer/.cvspass for reading: No such file or directory cvs checkout: Updating BubbleSpeak and tested from csf.rl.ac.edu ########### # MONTHLY # ########### DATASETS 10/2 PREDATOR 10/2 VAULT 10/2 Note these went to LTO-3 library this time MYSQL 10/2 ############ # SADDRECO # ############ Final test of saddreco.20070913 for predator Log into fnpcsrv1 /export/stage/minfarm/.grid/ cd scripts cp -a AFSS/saddreco.20070913 . PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9010 export SAM_ORACLE_CONNECT=`cat /export/stage/minfarm/.grid/samdbs_prd` DET=far REL=cedar SAMMON='2007-09' AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} verify -b 1 -v -s F00039716_0005 copymeta problem hacked copy of saddreco.20070507 to print MYMETA, for comparison saddreco.old ${DET} ${REL} ${SAMMON} verify 1 > /tmp/log.old AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} verify -b 1 -v -s F00039716_0005 > /tmp/log.20070913 ... did not do the following yet, want to re-test MC first ... Ran single file declaration AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} declare -b 1 \ 2>&1 | tee -a ${HOME}/ROUNTMP/LOG/${SAMMON}/declare_${DET}_${REL}.log Ran the rest AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} declare \ 2>&1 | tee -a ${HOME}/ROUNTMP/LOG/${SAMMON}/declare_${DET}_${REL}.log ============================================================================= 2007 10 01 ############ # SADDRECO # ############ Moving latest saddreco.20070913 to production use in roundup. Creates tape storage locations as needed PLAN : disable SAM declares for a while, via corral saddreco --verify saddreco --declare integrate ####### # CVS # ####### Testing pserver password removal. mkdir -p /tmp/kreymer cd /tmp/kreymer loc=":pserver:anonymous@minoscvs.fnal.gov:/cvs/minoscvs/rep1" cvs -d $loc checkout BubbleSpeak cvs checkout: warning: failed to open /afs/fnal.gov/files/home/room1/kreymer/.cvspass for reading: No such file or directory cvs checkout: authorization failed: server minoscvs.fnal.gov rejected access to /cvs/minoscvs/rep1 for user anonymous cvs checkout: used empty password; try "cvs login" with a real password changed .cvspass ( saving old as .cvspass.20050420 ) in minoscvs@minoscvs, this had not effect. Probably need a pserver restart. The old password is still working :pserver:anonymous@minos01.fnal.gov:/cvs/minoscvs/rep1 A+.(=0BB& was :pserver:anonymous@minos01.fnal.gov:/cvs/minoscvs/rep1 Ay=0=h cat /cvs/minoscvs/rep1/CVSROOT/passwd anonymous:y/6MJprbDjVZ.:minoscvs In accordance with CDFCVS > cat run2/CVSROOT/passwd anonymous::cdfcvs Note that CDF has a passwd,v file ####### # SRM # ####### Manual test with short retry timeout and num, srmls --debug=true -retry_timeout=10 -retry_num=1 ${SPATH2} Stuck, try -retry_num=0, still stuck SRMClientV2 : srmLs, contacting service httpg://stkendca2a.fnal.gov:8443/srm/managerv2 ... eventually, times out with status 1 SRV1> date ; srmls --debug=true -retry_timeout=1000 -retry_num=1 ${SPATH2} ; date Mon Oct 1 09:57:06 CDT 2007 Storage Resource Manager (SRM) CP Client version 1.25 Copyright (c) 2002-2006 Fermi National Accelerator Laboratory SRM Configuration: debug=true gsissl=true ... surl[0]=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/beam_data/2004-12 Mon Oct 01 09:57:06 CDT 2007: In SRMClient ExpectedName: host Mon Oct 01 09:57:06 CDT 2007: SRMClient(https,srm/managerv2,true) SRMClientV2 : user credentials are: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 SRMClientV2 : connecting to srm at httpg://stkendca2a.fnal.gov:8443/srm/managerv2 SRMClientV2 : srmLs, contacting service httpg://stkendca2a.fnal.gov:8443/srm/managerv2 SRMClientV2 : put: try # 0 failed with error SRMClientV2 : ; nested exception is: java.net.SocketTimeoutException: Read timed out SRMClientV2 : put: try again SRMClientV2 : sleeping for 1000 milliseconds before retrying SRMClientV2 : put: try # 1 failed with error SRMClientV2 : ; nested exception is: java.net.SocketTimeoutException: Read timed out Exception in thread "main" AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException faultSubcode: faultString: java.net.SocketTimeoutException: Read timed out faultActor: faultNode: faultDetail: {http://xml.apache.org/axis/}stackTrace:java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ... at gov.fnal.srm.util.SRMDispatcher.work(SRMDispatcher.java:721) at gov.fnal.srm.util.SRMDispatcher.main(SRMDispatcher.java:342) {http://xml.apache.org/axis/}hostname:fnpcsrv1.fnal.gov java.net.SocketTimeoutException: Read timed out at org.apache.axis.AxisFault.makeFault(AxisFault.java:101) at org.apache.axis.transport.http.HTTPSender.invoke(HTTPSender.java:154) ... at gov.fnal.srm.util.SRMDispatcher.main(SRMDispatcher.java:342) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) ... at org.apache.axis.transport.http.HTTPSender.readHeadersFromSocket(HTTPSender.java:583) at org.apache.axis.transport.http.HTTPSender.invoke(HTTPSender.java:143) ... 14 more Mon Oct 1 10:17:11 CDT 2007 That's 20 minutes 11:47 podstvkv investigating 12:00 srm seems to be back. Just in time to catch the noon cron for mindata(mcimport) and minfarm(corral) Copies are working in corral. ============================================================================= 2007 09 29 ####### # SRM # ####### SRM offline, errors like these in LOG/2007-09/cedarfar.log Sun Sep 30 00:06:36 CDT 2007 WRITING to DCache 1 SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00039713_0000.all.sntp.cedar.0.root /pnfs/minos/reco_far/cedar/sntp_data/2007-09 SRMClientV1 : put: try # 0 failed with error SRMClientV1 : java.net.SocketTimeoutException: Read timed out srm copy of at least one file failed or not completed Command failed! Server error message for [1]: "can't get pnfsId (not a pnfsfile)" (errno 666). Failed open file in the dCache. dc_stage fail : "can't get pnfsId (not a pnfsfile)" System error: Input/output error ============================================================================= 2007 09 28 ####### # CVS # ####### copied adduser from cdfcvs server, tested on wojcicki ####### # DBB # ####### From RAL, port 8080 is OK, but RL > curl dbb.fnal.gov:80 curl: (7) socket error: 110 ######## # MAIL # ######## for UUSER in bishai kafka wojcicki ; do finger ${UUSER}@fnal.fnal.gov | grep '@' ; done bishai@fsui02.fnal.gov kafka@fnalu.fnal.gov wojcicki@fnalu.fnal.gov Requested wojcicki SGWEG@SLAC.Stanford.EDU via helpdesk email cc: wojcicki done around 16:08 bishai is trying to connect to imap - done at about 15:45 ============================================================================= 2007 09 27 ######## # MAIL # ######## for UUSER in alberto bishai escobar kafka para wojcicki ; do finger ${UUSER}@fnal.fnal.gov | grep '@' ; done alberto@fnalu.fnal.gov bishai@fsui02.fnal.gov djensen@imapserver1.fnal.gov escobar@fsui02.fnal.gov kafka@fnalu.fnal.gov para@fsui02.fnal.gov wojcicki@fnalu.fnal.gov ######## # FARM # ######## /grid/data/minos filled up quota quota -s -v -g e875 2> /dev/null | grep -A 1 'fermigrid\-data' | grep -v fermi 400G* 0 400G 8139 0 0 SRV1> ./farmgsum Summarizing /grid/data/minos/*cat 1598 53160 nearcat 21 250 farcat 632 27411 mcnearcat 9 535 mcfarcat 0 1 mcfmockcat 742 278670 minfarm/WRITE 3002 360027 TOTAL files, GBytes ... srmcp fails showing : Last good copy was 13:04 on 26 Sep. srm client error: credential remaining lifetime is less then a minute SRM_CONF=/export/stage/minfarm/.srmconfig/config.xml /export/stage/minfarm/.grid/x509up_u1334 /export/stage/minfarm/.srmconfig/config.xml SRV1> grid-proxy-info -all -file /export/stage/minfarm/.grid/x509up_u1334 subject : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990/CN=687673363 issuer : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990 identity : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990 type : Proxy draft (pre-RFC) compliant impersonation proxy strength : 512 bits path : /export/stage/minfarm/.grid/x509up_u1334 timeleft : 0:00:00 SRV1> cd /export/stage/minfarm/.grid SRV1> mv x509up_u1334 x509up_u1334.20061220 SRV1> cp /home/rubin/.grid/x509up_u1334 x509up_u1334 SRV1> chmod 700 x509up_u1334 Cleared off 8.6 GB from DUP cd /grid/data/minfarm cp -vax DUP /export/stage/minfarm/DUP SRV1> du -sm /export/stage/minfarm/DUP 8986 /export/stage/minfarm/DUP SRV1> diff -r DUP /export/stage/minfarm/DUP SRV1> rm DUP/*.root rm: remove write-protected regular file `DUP/c10000607_0003.cand.cedar.root'? y SRV1> quota -s -v -g numi 2> /dev/null | grep -A 1 'fermigrid\-data' | grep -v fermi 395G 0 400G 8193 0 0 SRV1> ./roundup -r cedar_phy mcnear SRV1> ./roundup -r cedar_phy_bhcurv mcnear these two ran in parallel. c_p was doing srmcp while c_p_b was doing hadd 19:00 392G used ./roundup -w -r cedar_phy_bhcurv mcnear 365G Not so outstanding, but cedar_phy mcnear is still writing. Will start a purge of that in an hour or so. GRRRRRRRRR - writes for daikon_03 cedar_phy mcnear failed. No such directory. ./pnfsdirs near cedar_phy daikon_03 L010185N write Group is e875, protections 775, OK 22:01 ./roundup -w -r cedar_phy mcnear 233G ./roundup -w -r cedar_phy_bhcurv mcnear 216G ./roundup -w -r cedar_phy mcnear 200G Midnight corral should clear the remaining 120 GB, which are on the way to tape already. Suggested that Howie restart the farm, around 22:30. ============================================================================= 2007 09 26 set minos:beam for early volumes in for VOL in d188 d239 d266 d268 d269 d270 ; do fs listacl $MINOS_DATA/${VOL} ; done ######### # BATCH # ######### New GRID nodes are ( Req PO 577128 ) D0 L3 30 KEK 17 CDF 138 D0 205 GPFarm 47 Minos 8 MiniBoone 8 Dell PowerEdge 1950, 2 x Quad-Core Intel Xeon 2.66 Ghz, 16GB RAM,500GB SATAu HD, 3 year NBD Warranty, fully integrated, burned-in, tested and installed. 22 Compute Servers per Rack with balance in last Rack. Survey of CLUBS node speeds Normalize to 3 GHz minos26, tiny rating 1033 . flxb tiny 10 416 11 414 416 13 419 419 17 1055 18 1063 19 1067 20 1063 21 1066 22 1064 23 1071 25 1060 26 1068 28 1062 30 1066 31 900 897 899 32 901 901 901 33 896 901 901 34 900 898 35 987 989 985 988 Check parallel capacity Can log into 10 24 30-35 cd Linux/tiny for N in 1 2 3 4 ; do ( time ./tiny & ) ; done flxb10 2 real 0m18.109s user 0m18.100s 4 real 0m36.142s user 0m18.050s flxb24 2 4 flxb31 2 real 0m7.570s user 0m7.559s 4 real 0m15.153s user 0m7.536s flxb35 2 real 0m6.895s user 0m6.892s 4 real 0m13.838s user 0m6.919s For Minos nodes, base time is 7 seconds. Is hyperthreading helping ? minos01 real 0m11.720s user 0m11.715s minos26 real 0m11.606s user 0m11.535s Summary : All CLUBS/FLXB nodes act as 2 core See MHz summary 2007 09 11 AGE NODES GHZ/core cores GHz Ancient 10/11/13 1.5 6 9 Old 16-30 3 30 100 Mid 31-35 2.7 8 22 New 35 3 2 6 CLUBS 137 GHz Cluster 150 GHz ( 50 * 3 ) ( LSF 75 ( 25 * 3 ) ) New 170 GHz ( 64 * 2.66 ) ########### # STORAGE # ########### In 284 AFS disk volumes, we have DIRS=`ls` SIZES=`for DIR in ${DIRS} ; do fs listquota ${DIR} | grep 0000 | tr -s ' ' | cut -f 2 -d ' '; done` SIZEU=`for DIR in ${DIRS} ; do fs listquota ${DIR} | grep 0000 | tr -s ' ' | cut -f 3 -d ' '; done` printf "${SIZEU}\n" | ./count 10457000000 9068980997 We use 9 of 10.5 TBytes of capacity Draft document is going into Minos Doc 3601 ============================================================================= 2007 09 25 ######## # FARM # ######## Rubin reports big backlogs writing. I think it's all us : MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03/CosmicMu/cand_data -type f | wc -l 11560 ############ # SADDRECO # ############ Need to declare daikon_01 and daikon_03 reco in dev, for final tests. near cedar 0 1 cedar_phy 0 3 cedar_phy_bhcurv 3 4 far cedar 0 1 2 cedar_phy 0 2 cedar_phy_bhcurv - none For now, let's do near cedar_phy daikon_03 Log into fnpcsrv1 cd scripts tokens AFSK=/afs/fnal.gov/files/home/room1/kreymer/minos/log/saddreco PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin export SAM_DB_SERVER_NAME=SAMDbServer.dev:SAMDbServer export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9000 export SAM_ORACLE_CONNECT='samdbs/pass' RELS=cedar_phy MCRL=daikon_03 MODS=/pnfs/minos/mcout_data/${RELS}/near/${MCRL} DIRS=`ls $MODS` echo $DIRS CosmicMu tested with AFSS/saddreco.20070913 near ${RELS} ${DIR} verify -m ${MCRL} -b 1 -s sntp_data/205 -v Ran single file declaration for DIR in ${DIRS} ; do AFSS/saddreco.20070913 near ${RELS} ${DIR} declare -m ${MCRL} -b 1 \ 2>&1 | tee -a /tmp/saddreco.${MCRL}.${DIR}.declare.log done Ran the rest for DIR in ${DIRS} ; do AFSS/saddreco.20070913 near ${RELS} ${DIR} declare -m ${MCRL} \ 2>&1 | tee -a /tmp/saddreco.${MCRL}.${DIR}.declare.log done STARTED Tue Sep 25 23:47:49 2007 FINISHED Wed Sep 26 00:05:04 2007 looks clean in the log, as follows : grep -v declared /tmp/saddreco.${MCRL}.${DIR}.declare.log | less ============================================================================= 2007 09 24 ########## # SADDMC # ########## saddmc.20070924 ln -sf saddmc.20070924 saddmc # was saddmc.20070608 export SAM_ORACLE_CONNECT="samdbs/..." ./saddmc --declare -n 1 ${VEG} near/${VEG}/${DIR}/504 sam get metadata --file=n13035044_0008_L010185N_D03.reroot.root for VEG in daikon_01 daikon_03 ; do for DIR in `ls /pnfs/minos/mcin_data/near/${VEG}` ; do echo ${VEG} ${DIR} #./saddmc -v --verify -n 1 ${VEG} near/${VEG}/${DIR}/* ./saddmc --declare ${VEG} near/${VEG}/${DIR}/* done ; done 2>&1 | tee -a ${HOME}/minos/log/saddmc/${VEG}.log ######## # DISK # ######## Per conversation with Ling 8018, The array is rebuilt with 2 hot spare disks. He was told by Jason Allen not to do the hot-spare disk tests on our array. ( 13:53) He will check again with Jason, and get back to me. Spoke to Jason, authorized delaying as needed to do the 1/2/3 disk rebuild time and performance tests. It might not be available by Thursday . I authorized this anyway. We really need to know that our particular hot spare disks work. ######## # FARM # ######## Need to remove READ files for /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03/CosmicMu files which Howie is removing and reprocessing today. Duplicates were detected in the noon run, will have to clean up. Per discussion with Howie , remove ^n1004...cedar_phy_bhcur which should all be CosmicMu, with reversed field. SRV1> ls | grep ^n1004 | grep cedar_phy_bhcurv | wc -l 228 BADRS=`ls | grep ^n1004 | grep cedar_phy_bhcurv` printf "${BADRS}\n" | wc -l 228 for FILE in ${BADRS} ; do mv ${FILE} ../BADREAD/${FILE} ; done Now cleaning up an aborted start from this morning, MINOS26 > ls /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03/CosmicMu/cand_data/* | grep ^n1004 | wc -l 972 MINOS26 > find \ /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03/CosmicMu/cand_data/ \ -type f -name n1004\* | wc -l 972 FILES=`find \ /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03/CosmicMu/cand_data/ \ -type f -name n1004\*` for FILE in ${FILES} ; do usleep 200000 FPA=`dirname ${FILE}` FNA=`basename ${FILE}` ( cd ${FPA} ; L4=`cat ".(use)(4)(${FNA})"` if [ -n "${L4}" ] ; then VOL=`printf ${L4}\n" | head -1"` printf "${VOL} ${FNA}\n" else printf "pend ${FNA}\n" fi ; ) done ####### # SAM # ####### ./sam_test_py minos prd zeval-far-cand-physicsm-spill-r1_16 This tests a 96 file project Note that the Universe qualifier is not working, you keep what you have before running s_t_p Per hartnell request, here's a hack to count up progress : export SAM_STATION=minos export SAM_PROJECT=yourprojectname if PROJDUMP=`sam dump project -s --retryMaxCount=2` ; then NEED=`printf "${PROJDUMP}\n" | grep 'unbuffered yet' | wc -l` HAVE=`printf "${PROJDUMP}\n" | grep 'delivered on' | wc -l` (( TOT = NEED + HAVE )) printf " Delivered ${HAVE}/${TOT} Need ${NEED}\n" else printf " The project is unavailable ( unstarted or completed )\n" fi ============================================================================= 2007 09 23 Sunday ####### # SAM # ####### MCPARMS.py - edited to add 'bfield' : 'string' , replaces split for post-carrot MC 'vtxregion' : 'string' , replaces volume for all, per discussions setup sam -q dev export SAM_ORACLE_CONNECT="samdbs/" samadmin add param suite --param-file=MCPARAMS.py Param Category 'mc': ... paramType 'bfield': registered as type 'string' (new dimension 'mc.bfield') ... paramType 'volume': (no change) ... paramType 'beam': (no change) ... paramType 'split': (no change) ... paramType 'vtxregion': registered as type 'string' (new dimension 'mc.vtxregion') ... paramType 'release': (no change) ... paramType 'flavor': (no change) MINOS26 > sam get registered parameters Params({ 'mc' : CaseInsensitiveDictionary({ 'beam' : DataType('string'), 'bfield' : DataType('string'), 'flavor' : DataType('string'), 'release' : DataType('string'), 'split' : DataType('string'), 'volume' : DataType('string'), 'vtxregion' : DataType('string'), })}) ########## # SADDMC # ########## for VEG in daikon_01 daikon_02 daikon_03 daikon_04; do for UNI in dev int prd ; do setup sam -q ${UNI} export SAM_ORACLE_CONNECT samadmin add application family --appFamily=simulation --appName=gminos --appVersion=${VEG} export -n SAM_ORACLE_CONNECT done done ########## # SADDMC # ########## saddmc.20070924 Removed all RECOREL support, now that saddreco works for MC Removed enupdate, no longer used Changed volume to vtxregion New FILECH4 variable for split vs bfield Set to split for recorel[0] < d ./saddmc.20070924 --verify -n 1 daikon_01 near/daikon_01/L010185N/140 -v ./saddmc.20070924 --verify daikon_01 near/daikon_01/L010185N/140 ============================================================================= 2007 09 21 ####### # LSF # ####### Rustem has submitted about 12K jobs to the 4hr queue, like JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 425482 rustem RUN 4hr flxi06.fnal flxb10.fnal *1006_0007 Sep 21 17:41 bjobs -u rustem | wc -l ~18:15 11410 18:37 11376 18:39 11367 13:00 954 ############ # MCIMPORT # ############ XFILE=/pnfs/minos/mcin_data/near/daikon_03/CosmicMu/218/n10032185_0006_CosmicMu_D03.reroot.root as reported 9/19, too short, apparently damaged. mv ${XFILE} /pnfs/minos/BAD/BAD_n10032185_0006_CosmicMu_D03.reroot.root ######## # GRID # ######## Authorized certs in minos group ( not minossoft or production ) https://voms.fnal.gov:8443/vomrs/vo-fermilab/vomrs brebel - rebel - FNAL only habig jdejong - dejong petyt rhatcher existed rustem tinti And other accounts that exist on fnpcsrv1 backhouse kordosky messier cavanaugh -added FNAL,had DOE gallagher sanchez Tingjun Yang George Irwin John Urheim Alexandre Sousa Joshua Boehm ######## # GRID # ######## HOWTO.fermigrid updated for vdt setup kx509 voms-proxy-init Tested kerberos based grid cert per chadwick email 2007 07 18 fgusers ######### # ADMIN # ######### 10:00 restarted cronjobs and NOCAT ####### # LSF # ####### Continued tests of tcsh scripts, success with ( unset PRODUCTS SETUPS_DIR SETUP_UPS INFO_DIR UPS_DIR SETUP_SHRC SETUP_INFO SETUP_LOGIN ; bsub -R "linux26" -q minos test_sub.scr ) based on a scan of all SETUP_ * environment variable at submission motivated by shrc messages when I unset PRODUCTS SETUPS_DIR SETUP_UPS UPS_DIR seem to need UPS_DIR ( unset UPS_DIR SETUPS_DIR SETUP_UPS SETUP_SHRC ; bsub -R "linux26" -q minos test_sub.scr ) ( unset UPS_DIR SETUPS_DIR SETUP_UPS SETUP_SHRC ; bsub -R "linux26" -q minos test_lsf_csh ) ####### # LSF # ####### 08:58 31 32 34 are updated 33 35 active, status set to closed 10:30 35 is updated, awaiting job on 33 10:52 updates are complete for NODE in flxb10 flxb24 flxb30 flxb31 flxb32 flxb33 flxb34 flxb35 ; do printf "\n${NODE} `date`\n"; ssh -a ${NODE} "grep OPTION /etc/sysconfig/afs" ; done ########### # BLUEARC # ########### Per our discussions verbally yesterday, here is what I understand of our deployment plan : Ling will be doing the following in preparation : 1) Add two hot-spare disks ( 40 disk in use, to hot spares ) 2) Verify and measure the raid rebuild time required for failover for 1 disk failure 2 disk failure 3 disk failure ( this rebuild will fail, see what it look like ) 3) Make the full array available to BlueArc, which quotas as specified below. The split between /minos/scratch and /minos/data to be dynamic, and handle via quotas. 4) Mount /minos/scratch and /minos/data on the Minos Cluster and servers as specified below. Arthur will specify the list of client nodes, and the initial directory structures. 1) Clients - please mount /minos/scratch and /minos/data on all Minos Cluseter and server nodes. minos01 through minos26 minos-mysql1 minos-sam01 minos-sam02 minso-sam03 2) /minos/data - roughly 20 TBytes or 2/3 of the disk capacity Let's start with /minos/data/mindata owned by mindata, group e875, group writeable. 3) /minos/scratch - roughly 10 TBytes Directories for each of the 186 minos users on the Minos Cluster ypcat passwd | cut -f 1 -d : | sort Each user gets 100 GB default quota. This is an oversubscription, but most of these are not active. This should be enough for initial testing. Please provide a means ( sudo ? ) for kreymer, buckley, rhatcher and urish to adjust quotas in /minos/scratch. After testing and discussion, we will probably move all the users' files from existing nodes' scratch areas such as /local/scratch01/ to /minos/scratch//minos01 directories. ============================================================================= 2007 09 20 ####### # AFS # LSF ####### Rustem reports problems on flxb31-35 similar to those during the Cluster upgrade loon: error while loading shared libraries: libEG.so: cannot open shared object file: No such file or directory Checking Cluster and FNALU and batch nodes : MIN > for NODE in $UNODES ; do printf "\n${NODE} `date`\n"; ssh -a ${NODE} "grep OPTIONS /etc/sysconfig/afs" ; done on all but flxi02, OPTIONS=$MEDIUM flxi07 has OPTIONS=AUTOMATIC minos* has OPTIONS=$LARGE for NODE in $BNODES ; do printf "\n${NODE} `date`\n"; ssh -a flxb${NODE} "grep OPTIONS /etc/sysconfig/afs" ; done 10 24 30 31 OPTIONS=$MEDIUM 31 32 33 34 35 OPTIONS=AUTOMATIC ######### # ADMIN # ######### Preparing for all-day shutdown later today predator MINOS26 > echo 'crontab -r' | at 03:30 mcimport M26 > echo 'crontab -r' | at 03:30 job 21 at 2007-05-24 03:30 corral SRV1> echo 'mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT' \ | at 03:30 ######## # GRID # ######## Please create acccounts on fnpcsrv1 and fngp-osg for the following users, so that they can start submitting jobs to Fermigrid : brebel NC habig Run Coord jdejong Calib petyt rhatcher Admin rustem Reco tinti Reco The fnpcsrv1 accounts are needed for access to AFS. The fngp-osg accounts are needed for testing without AFS. ============================================================================= 2007 09 19 ####### # AFS # ####### Per koskinen, requested volumes afs/fnal.gov/files/data/minos/d268 afs/fnal.gov/files/data/minos/d269 afs/fnal.gov/files/data/minos/d270 cloned from d266 for beam systematics work system:administrators rlidwka minos:admin rlidwka minos:beam rlidwka minos rl ######### # MYSQL # ######### Finally doing monthly backups, now that brebel load has dropped Localy copy rates are still miserable, about 4 MBytes/second with cp -av ... Will slug it through, then try dd if= of= bs=10M Times for big copies were DCS_HV.MYD real 41m59.161s PULSERGAIN.MYD real 19m9.964s the rest real 67m43.457s md5sum real 22m43.163s gzip real 79m54.981s oops, minos-sam03 kreymer account moved to AFS. needed to adjust REPATH to /home/kreymer/... scp: real 20m8.355s BINLOGS real 3m37.674s ####### # SSH # ####### curl http://www-numi.fnal.gov/computing/dh/sshkrb5.tgz -o sshkrb5.tgz The original sl3 shared libraries were not correctly named. My tests on csf.rl.ac.uk were falling back to system libraries. Needed various symlinks for kinit/klist : ln -s libkrb5.so.3 libkrb5.so ln -s libcrypto.so.4 libcrypto.so ln -s libcom_err.so.3 libcom_err.so With these symlinks, all of kinit/klist/ssh/scp are fairly clean, using only the same 3 glibc libraries : RL > ldd ./klist | grep -v FOO libc.so.6 => /lib/tls/libc.so.6 (0x00111000) libdl.so.2 => /lib/libdl.so.2 (0x0035d000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x00f8b000) Cleaned up sshlib under SL3 by removing unused ld-linux.so.2 tmp/libdl.so.2 Added kkinit and kklist scripts, adjusted the aliases in setupssh.[c]sh changed kinit/klist aliases to kkinit/kklist for consistency ############ # MCIMPORT # ############ /pnfs/minos/mcin_data/near/daikon_03/CosmicMu/218/n10032185_0006_CosmicMu_D03.reroot.root is reported by howie to be unreadable too short ( 31645696 vs usual 45000000 size ) ########## # DCACHE # ########## /pnfs/minos/fardet_data/2007-09/F00039679_0019.mdaq.root stuck in genpy since 16:06:13 UTC 2007 ============================================================================= 2007 09 18 ####### # SSH # ####### For SL4, I have just put together http://www-numi.fnal.gov/computing/dh/sshkrb5_sl4.txt For SL3, I have renamed the previous files http://www-numi.fnal.gov/computing/dh/sshkrb5_sl3.txt SL4 requires different shared libraries, to avoid getting message You don't exist, go away! ####### # UPS # ####### Per reports from users failing to run LSF jobs, surveying usage of /fnal/ups versus /local/ups We should have installed upsupdbootstrap-local In all cases there is a symlink from /usr/local/etc/setups.* /local/ups minos01-minos10 minos12-minos26 minos25 is a copy, not symlink flxi02 flxi06 flxb10 flxb24 flxb30 /fnal/ups minos11 flxi04 flxi05 flxi07 flxb31-flxb35 /afs/fnal.gov/ups flxi03 Cannot log into flxb11 flxb13 flxb16-flxb23 flxb25-flxb29 Correction, scanning Cluster for /u/l/e/setups links, 11 -> /fnal/ups rest -> /local/ups 25 is not a symlink, but a direct copy of the files. ####### # LSF # ####### lhsu is having trouble in batch. Jobs that try to run #!/bin/tcsh exit cp ~llhsu/scripts/batch/test_sub.scr . bsub -R "linux26" -q minos test_sub.scr Exited for me, but with this message : /local/ups/prd/ups/v4_7_2/Linux-2/bin/ups: Command not found. But I can run a similar test job in bash, bsub -R "linux26" -q minos test_lsf ######## # GRID # ######## kreymer@minos26 : mkdir /grid/data/minos/users chmod 775 /grid/data/minos/users MINOS26 > mkdir /grid/data/minos/users/boehm MINOS26 > mkdir /grid/data/minos/users/brebel MINOS26 > mkdir /grid/data/minos/users/habig MINOS26 > mkdir /grid/data/minos/users/jdejong MINOS26 > mkdir /grid/data/minos/users/kreymer MINOS26 > mkdir /grid/data/minos/users/petyt MINOS26 > mkdir /grid/data/minos/users/rustem MINOS26 > mkdir /grid/data/minos/users/scavan MINOS26 > mkdir /grid/data/minos/users/tinti chmod 775 /grid/data/minos/users/* boehm brebel habig jdejong kreymer petyt rustem scavan tinti ============================================================================= 2007 09 17 ####### # SSH # ####### Testing portable access at RAL At Fermilab, in computing/dh, did tar cvzf sshkrb5.tar -C /afs/fnal.gov/files/home/room3/hartnell/programs/sshkrb5 . Then at csf.rl.ac.uk mkdir -p ${HOME}/programs/sshkrb5 cd ${HOME}/programs/sshkrb5 curl http://www-numi.fnal.gov/computing/dh/sshkrb5.tgz -o sshkrb5.tgz tar xzvf sshkrb5.tgz . setupkssh.sh kinit kreymer@FNAL.GOV /usr/kerberos/bin/klist -f kssh -l kreymer minos26.fnal.gov pwd Updated sshkrb5.tgz on web server to include setupkssh.sh adjusted for bash and to make alias kkinit instead of kinit, using /usr/kerberos/bin/kinit Some data from sjc attempts, Sep 17 14:14:39 minos26 sshd[21042]: error: PAM: Authentication failure for sjc from nova.physics.wm.edu Sep 17 14:14:39 minos26 sshd[21042]: Connection closed by ::ffff:128.239.52.85 grep -v 'session opened for user' /var/log/messages | less Sep 17 14:14:39 minos26 sshd: pam_krb5[21043]: authentication fails for 'sjc' (sjc@FNAL.GOV): Authentication service cannot retrieve authentication info. (Cannot contact any KDC for requested realm) ######## # FARM # ######## Further permission problems under /pnfs/minos/mcout_data/cedar_phy_bhcurv Needed chmod 775 cedar_phy_bhcurv MINOS26 > stat near File: `near' Size: 512 Blocks: 1 IO Block: 512 directory Device: 14h/20d Inode: 254702744 Links: 1 Access: (0755/drwxr-xr-x) Uid: ( 1060/ kreymer) Gid: ( 5111/ e875) Access: 2007-09-14 15:42:01.000000000 -0500 Modify: 2007-09-14 15:42:01.000000000 -0500 Change: 2007-08-30 12:58:40.000000000 -0500 chmod 775 cedar_phy_bhcurv/near cd near MINOS26 > stat daikon_04 File: `daikon_04' Size: 512 Blocks: 1 IO Block: 512 directory Device: 14h/20d Inode: 255895248 Links: 1 Access: (0755/drwxr-xr-x) Uid: ( 1060/ kreymer) Gid: ( 5111/ e875) Access: 2007-09-14 15:42:01.000000000 -0500 Modify: 2007-09-14 15:42:01.000000000 -0500 Change: 2007-09-14 15:42:01.000000000 -0500 The dakkon_04/L010185N and .../*_data directories have proper ownership and groups set. ============================================================================= 2007 09 15 Sat ####### # DAQ # ####### Added habig root access to minos-beamdata # also had to add myself, getting access via new password minos-rc minos-evd minos-acnet Had previously done minos-om ######## # FARM # ######## Did chgrp -R e875 /pnfs/minos/mcout_data/cedar_phy_bhcurv per rubin request. ============================================================================= 2007 09 14 ############# # CHECKLIST # ############# VO8597 is available again, went NOACCESS to be copied on 2007 09 12 ######## # SADD # ######## Corrected names of older versions for MD in 0418 0420 0503 0513 0516 0520 0624 0707 0711 ; do mv sadd.${MD} sadd.2005${MD} ; done ######## # FARM # ######## ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010185N write ############ # SADDRECO # ############ Corrected name of file, to reflect current date. mv saddreco.20070707 saddreco.20070913 ######### # MYSQL # ######### Still have a heavy load from Tuesday's brebel jobs, MINOS26 > bjobs -u brebel | grep flxb | wc -l 19 MINOS26 > bjobs -u brebel -l 405371 Job <405371>, User , Project , Status , Queue <1day>, Com mand Tue Sep 11 16:39:55: Submitted from host , CWD , Requested Resources ; Tue Sep 11 17:34:02: Started on , Execution Home , Execution CWD ; Fri Sep 14 09:24:16: Resource usage collected. The CPU time used is 5978 seconds. MEM: 248 Mbytes; SWAP: 399 Mbytes; NTHREAD: 5 PGID: 6520; PIDs: 6520 6544 6545 7215 7116 ... These are 1 day jobs, CPU limit normalized to flxi06 (CPUF 1390.00) Most nodes have CPUF 1200.00 ######### # CDOPS # ######### Requested mailing list, to archive my summaries. ============================================================================= 2007 09 13 ########## # CONDOR # ########## 09:00 Steve Timm found a typo in a config file. Nodes are now registering. All workers need a reconfigure and restart. ####### # AFS # ####### Created minos:reco group NEWGROUP=reco pts creategroup -name kreymer:${NEWGROUP} group kreymer:reco has id -2481 NEWUSERS='boehm masaki jmusser naples rustem sjc sujeewa tinti' for GUSER in ${NEWUSERS} ; do pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done pts setfields kreymer:${NEWGROUP} -access SOMar pts membership kreymer:${NEWGROUP} pts examine kreymer:${NEWGROUP} Name: kreymer:nonap, id: -1941, owner: kreymer, creator: kreymer, membership: 5, flags: SOMar, group quota: 0. pts chown kreymer:${NEWGROUP} minos:admin pts examine minos:${NEWGROUP} pts membership minos:${NEWGROUP} ####### # AFS # ####### may need to go back and pts chown minos:GROUP minos:admin ( change ownership ) ####### # AFS # ####### Requesting volume for rustem's reco studies using the new group, per HOWTO.afs Size 50000 not backed up Volume /afs/fnal.gov/files/data/minos/d267 ACL's system:administrators rlidwka system:anyuser rl minos:admin rlidwka minos rl minos:reco rlidwka ####### # AFS # ####### DVOLS=`ls -d d??? | sort` for VOL in $DVOLS ; do echo $VOL; fs listacl ${MINOS_DATA}/${VOL} | grep -v system | grep rlidw; done 2>&1 | less ######## # DATA # ######## Checking reported segfaults reading /pnfs/minos/reco_near/cedar_phy/sntp_data/2005-11/N00009259_0000.spill.sntp.cedar_phy.0.root 2131512181 May 15 Pretty close to SLIM=2147483647 # 2^32 - 1 DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy/sntp_data/2005-11/N00009259_0000.spill.sntp.cedar_phy.0.root DFILE1=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy/sntp_data/2005-11/N00009259_0022.spill.sntp.cedar_phy.0.root setup_minos -r R1.24.2 loon -bq firstlast.C ${DFILE} does not crash, but does not find anything ( designed to read raw data ) in /local/scratch26/kreymer/DATA, tried direct hadd from dcache, too slow MINOS26 > dccp $DFILE1 . 201063216 bytes in 4 seconds (49087.70 KB/sec) MINOS26 > dccp $DFILE . 2131512181 bytes in 53 seconds (39274.62 KB/sec) MINOS26 > hadd testhadd.root N00009259_0000.spill.sntp.cedar_phy.0.root N00009259_0022.spill.sntp.cedar_phy.0.root Target file: testhadd.root MINOS26 > ls -l N00009259* testhadd.root -rw-r--r-- 1 kreymer 1525 2131512181 Sep 13 14:20 N00009259_0000.spill.sntp.cedar_phy.0.root -rw-r--r-- 1 kreymer 1525 201063216 Sep 13 14:18 N00009259_0022.spill.sntp.cedar_phy.0.root -rw-r--r-- 1 kreymer 1525 2332536940 Sep 13 14:22 testhadd.root Tried this also on minos11, which might have 32 bit limit size is 2332535450 ############ # SADDRECO # ############ 08:55 start a fat declaration ! AFSS/saddreco.20070707 near cedar_phy L010185N declare -m daikon_00 2>&1 | tee /var/tmp/saddreco.declare.log STARTED Thu Sep 13 13:57:40 2007 FINISHED Thu Sep 13 14:53:01 2007 grep -v declared /var/tmp/saddreco.declare.log Now for the rest : MODS=/pnfs/minos/mcout_data/cedar_phy/near/daikon_00 DIRS=`ls $MODS` MINOS26 > echo $DIRS L010000N L010170N L010185N L010200N L100200N L150200N L250200N MINOS26 > for DIR in ${DIRS} ; do echo ${DIR} ; ls -R ${MOD}/${DIR} | wc -l ; done L010000N 2003 L010170N 165 L010185N 8275 L010200N 163 L100200N 704 L150200N 290 L250200N 1356 11:11 for DIR in L010000N L010170N L010200N L100200N L150200N L250200N ; do AFSS/saddreco.20070707 near cedar_phy ${DIR} declare -m daikon_00 \ 2>&1 | tee /var/tmp/saddreco.${DIR}.declare.log done SAMDIM=' RUN_TYPE physics% and MC.BEAM L010185N and DATA_TIER sntp-near and VERSION cedar.phy ' OOPS, the parameters are not being written to SAM. needed to add this to copymeta, as was done in saddmc Need to remove all these files ! sam undeclare n13011010_0000_L010200N_D00.sntp.cedar_phy.root reran tests, looked at the metadata. Readjusted fileType to importedSimulated in enupdate Declared this one file, after removal , SAMDIM=" RUN_TYPE physics% and VERSION cedar.phy and FULL_PATH like /pnfs/minos/mcout_data/cedar_phy/near/daikon_00% " MINOS26 > sam list files --dim="${SAMDIM}" --summaryonly File Count: 12217 Average File Size: 590.79MB Total File Size: 6.88TB Total Event Count: 8785200 MINOS26 > ./samlocate "${SAMDIM}" | wc -l 12217 real 4m22.920s user 0m31.905s sys 0m1.878s MINOS26 > ./samundeclare -b 1 "${SAMDIM}" -v MINOS26 > ./samundeclare "${SAMDIM}" -b 10 MINOS26 > ./samundeclare "${SAMDIM}" -b 100 real 0m27.710s MINOS26 > ./samundeclare "${SAMDIM}" -b 1 BAIL after 1 Found 12104 files undeclared n13011034_0010_L250200N_D00.cand.cedar_phy.root MINOS26 > ./samundeclare "${SAMDIM}" -b 1 BAIL after 1 Found 12103 files undeclared n13011034_0004_L250200N_D00.cand.cedar_phy.root This looks pretty good, let's go for it all ! MINOS26 > date ; time ./samundeclare "${SAMDIM}" Thu Sep 13 18:30:24 CDT 2007 real 27m23.749s user 0m29.169s sys 0m1.742s MINOS26 > ./samlocate "${SAMDIM}" OK, now we can redeclare everything MODS=/pnfs/minos/mcout_data/cedar_phy/near/daikon_00 DIRS=`ls $MODS` DIR=L010000N AFSS/saddreco.20070707 near cedar_phy ${DIR} verify -m daikon_00 -b 1 22:20 for DIR in ${DIRS} ; do AFSS/saddreco.20070707 near cedar_phy ${DIR} declare -m daikon_00 \ 2>&1 | tee /var/tmp/saddreco.${DIR}.declare.log done for DIR in ${DIRS} ; do grep -v declared /var/tmp/saddreco.${DIR}.declare.log ; done | less Declaring to SAM dev near cedar_phy L010000N declare STARTED Fri Sep 14 03:20:18 2007 FINISHED Fri Sep 14 03:36:26 2007 Declaring to SAM dev near cedar_phy L010170N declare STARTED Fri Sep 14 03:36:28 2007 FINISHED Fri Sep 14 03:37:38 2007 Declaring to SAM dev near cedar_phy L010185N declare STARTED Fri Sep 14 03:37:40 2007 FINISHED Fri Sep 14 04:48:10 2007 Declaring to SAM dev near cedar_phy L010200N declare STARTED Fri Sep 14 04:48:12 2007 FINISHED Fri Sep 14 04:49:23 2007 Declaring to SAM dev near cedar_phy L100200N declare STARTED Fri Sep 14 04:49:25 2007 FINISHED Fri Sep 14 04:54:50 2007 Declaring to SAM dev near cedar_phy L150200N declare STARTED Fri Sep 14 04:54:52 2007 FINISHED Fri Sep 14 04:56:59 2007 Declaring to SAM dev near cedar_phy L250200N declare STARTED Fri Sep 14 04:57:01 2007 FINISHED Fri Sep 14 05:08:07 2007 looks good, informed hartnell and m_s_d sam list files --summaryOnly \ --dim="RUN_TYPE 'physics%' \ and MC.RELEASE 'daikon_00' \ and VERSION 'cedar.phy'" File Count: 12217 Average File Size: 590.79MB Total File Size: 6.88TB Total Event Count: 8785200 sam list files --summaryOnly \ --dim="RUN_TYPE physics% \ and MC.BEAM='L010185N' \ and VERSION='cedar.phy'" File Count: 7851 Average File Size: 553.83MB Total File Size: 4.15TB Total Event Count: 5641200 MINOS26 > find /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N -type f | wc -l 7851 sam list files --summaryOnly \ --dim="RUN_TYPE physics% \ and MC.BEAM='L010185N' \ and DATA_TIER sntp-near \ and VERSION='cedar.phy'" File Count: 796 Average File Size: 555.35MB Total File Size: 431.70GB Total Event Count: 2819200 ============================================================================= 2007 09 12 ############ # PREDATOR # ############ Removed damaged ( full disk ) F00039589_0000.sam.py under /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/fardet_data/2007-09 13:09 CDT rm F00039589_0000.sam.py last good FD declare was F00039586_0022.mdaq.root at 16:08:56 2007 UTC STARTING Mon Sep 10 18:06:17 UTC 2007 Treating 472 files Scanning 4 files F00039586_0023.mdaq.root Mon Sep 10 18:06:28 UTC 2007 F00039587_0000.mdaq.root Mon Sep 10 18:07:54 UTC 2007 F00039588_0000.mdaq.root Mon Sep 10 18:08:29 UTC 2007 F00039589_0000.mdaq.root Mon Sep 10 18:09:13 UTC 2007 ? FINISHED Mon Sep 10 18:09:58 UTC 2007 Try manually MINOS26 > cds MINOS26 > HOSTNA=`hostname -s | cut -c 1-5` MINOS26 > HOSTNU=`hostname -s | cut -c 6-` MINOS26 > LOGPAT=/local/scratch${HOSTNU}/kreymer/log MINOS26 > setup sam -q prd MINOS26 > DET=fardet_data MINOS26 > MONTH=2007-09 ./sadd ${DET}/${MONTH} declare 2>&1 | tee -a ${LOGPAT}/samadd/${DET}/${MONTH}.log failed, backed off and did a verify, that looks OK, after actually deleting the damaged F00039589_0000.sam.py Will let the next predator cycle clean up at 15:06 CDT. That worked OK ! ############ # SADDRECO # ############ Resuming work Cleaned up gnu_getops handling May want to test on /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010200N/sntp_data/100 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010200N/sntp_data/100 In case it is needed, the full lowest directory list is find /pnfs/minos/mcout_data/cedar_phy/near/daikon_00 -type d -name \?\?\? Testing with PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin export SAM_DB_SERVER_NAME=SAMDbServer.dev:SAMDbServer export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9000 Try like AFSS/saddreco.20070707 near cedar_phy L010200N verify 3 -m daikon_00 AFSS/saddreco.20070707 near cedar_phy 2007-03 verify 3 recodirs looks OK, after a tweak. For MC OK - getopt OK - set MCRELEASE OK - RECODIR modified OK - .mdaq.root modified OK - bypass pass/obsolete calculation for MC OK - veto os.rename of index file for SAMQ not prd OK - sam add tape location check SAM_ORACLE_CONNECT add the location and report OK - correct fake last event number for MC Run validation of a fat directory AFSS/saddreco.20070707 near cedar_phy L010185N verify -m daikon_00 2>&1 | tee /var/tmp/saddreco.verify.log STARTED Thu Sep 13 05:23:49 2007 FINISHED Thu Sep 13 06:12:14 2007 grep -v verified /var/tmp/saddreco.verify.log Looks clean ! ############# # CHECKLIST # ############# VO8597 0.15GB (NOTALLOWED 0911-0917 full 0117-1142) CD-9940B minos.reco_near_R1_18_2.cpio_odc Being copied to new media 091107 mysql1 load to about 20 ramp up 17:00 to 18:00 yesterday ######### # MYSQL # ######### Load average went to about 20 yesterday, ramped up 14:30 to 18:00 backups of mysql will have to wait till the load comes down. show full processlist shows queries like select min(TIMESTART) from DCS_MAG_FARVLD where TIMESTART > '2006-09-19 04:37:20' and DetectorMask & 2 and SimMask & 1 and CREATIONDATE >= '2006-09-19 04:18:33' and Task = 0 select max(TIMESTART) from DCS_MAG_FARVLD where TIMESTART < '2007-02-19 08:15:44' and DetectorMask & 2 and SimMask & 1 and CREATIONDATE >= '2007-02-19 08:32:51' and Task = 0 select * from DCS_MAG_FARVLD where TimeStart <= '2005-05-23 23:51:42' and TimeEnd > '2005-05-23 23:15:42' and DetectorMask & 2 and SimMask & 1 and Task = 0 order by CREATIONDATE desc No single command seems to take more than 15 seconds. The first two commands seems to be sending data, the latter seems to be sorting informed brebel, rhatcher rhatcher has found the root of the problem, may be able to implement an improvement if not an optimal solution. Will let the jobs run to completion, as they have done before. ============================================================================= 2007 09 11 ############### # GRIDAPPSYNC # ############### Cloned a new script to rsync /grid/app/minos/products from afs Added it to crontab.dat, running at 05:10 daily ######## # GRID # ######## Need write access to /grid/data and app HelpDesk ticket 103963 done 13:53 ######## # GRID # ######## rsync products per 2007 08 02 example after getting mounts corrected real 0m48.178s user 0m0.823s sys 0m6.144s ####### # LSF # ####### Tested submitting to SL3 vs SL4 nodes MINOS26 > bsub -R "linux24" pwd MINOS26 > bsub -R "linux26" pwd Tested cross kernel submission from minos11 bsub -R "linux26" ". /usr/local/etc/setups.sh ; setup encp ; type encp" from minos26 bsub -R "linux24" ". /usr/local/etc/setups.sh ; setup encp ; type encp" Look OK to me. BNODES='10 11 13 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35' for NODE in $BNODES ; do printf "${NODE} " ; ssh -a -x -n flxb${NODE} grep MHz /proc/cpuinfo | head -1 | grep MHz ; done 10 cpu MHz : 999.564 11 13 16 17 18 19 20 21 22 23 24 cpu MHz : 2666.815 25 26 27 28 29 30 cpu MHz : 2666.831 31 cpu MHz : 2194.864 32 cpu MHz : 2194.740 33 cpu MHz : 2194.891 34 cpu MHz : 2194.950 35 cpu MHz : 2394.301 ####### # AFS # ####### request from brebel for pittam access to NC disks NCVOLS='138 147 169 187 204 211 228 229' for VOL in $NCVOLS ; do echo $VOL; fs listacl ${MINOS_DATA}/d${VOL} | grep -v system | grep rlidw; done | less 138 buckley:ana_ntuples rlidwka 147 buckley rlidwka brebel rlidwka 169 buckley:ana_ntuples rlidwka kreymer rlidwka 187 buckley rlidwka brebel rlidwka 204 buckley rlidwka brebel rlidwka 211 buckley rlidwka brebel rlidwka 228 minos:admin rlidwka brebel rlidwka 229 minos:admin rlidwka brebel rlidwka pts membership buckley:ana_ntuples Members of buckley:ana_ntuples (id: -1536) are: buckley brebel cd $MINOS_DATA fs setacl -dir d228 -acl buckley:ana_ntuples rlidwka fs setacl -dir d229 -acl buckley:ana_ntuples rlidwka fs setacl -dir d169 -acl minos:admin rlidwka pts adduser -user pittam -group buckley:ana_ntuples ============================================================================= 2007 09 10 ########### # MONTHLY # ########### DATASETS 9/10 PREDATOR 9/10 SADDRECO 9/10 # the last time for this, no longer needed VAULT 9/11 MYSQL 9/20 MINOS26 > ./dcache/datasets g ./dcache/datasets: line 122: [: too many arguments Removed SADDRECO step, this is handled by roundup unless cand files are produced without any sntp This should never happen. Vault failed the first time through, ran out of disk $ du -sm /pnfs/minos/vault/neardet/2007-05 89614 /pnfs/minos/vault/neardet/2007-05 $ du -sm /pnfs/minos/vault/neardet/2007-06 91205 /pnfs/minos/vault/neardet/2007-06 $ du -sm /pnfs/minos/vault/neardet/2007-07 85421 /pnfs/minos/vault/neardet/2007-07 $ du -sm /pnfs/minos/neardet_data/2007-08 28618 /pnfs/minos/neardet_data/2007-08 $ du -sm /pnfs/minos/vault/fardet/2007-05 25446 /pnfs/minos/vault/fardet/2007-05 $ du -sm /pnfs/minos/vault/fardet/2007-06 28914 /pnfs/minos/vault/fardet/2007-06 $ du -sm /pnfs/minos/vault/fardet/2007-07 28262 /pnfs/minos/vault/fardet/2007-07 $ du -sm /pnfs/minos/fardet_data/2007-08 124641 /pnfs/minos/fardet_data/2007-08 waited for mcimport to catch up ############ # MCIMPORT # ############ Adding automatic move of non-tar files to BAD, contining processing Typical time to gunzip -t is du -sm n11035090_0027_L010185N_D03.tar.gz 319 n11035090_0027_L010185N_D03.tar.gz real 0m21.086s user 0m9.072s sys 0m0.644s real 0m9.292s user 0m8.983s sys 0m0.305s So will leave this test where it is, as the files are about to be tarred. 14:09 AFSS/mcimport.20070910 kordosky Crashed due to script error ( corrected now ) after finding corrupt file ( this file also needed local md5sum ) n11035090_0010_L010185N_D03.tar.gz Removed the good but unchecked tarfile mv tar/n11035090_0002_L010185N_D03-n11035090_0027_L010185N_D03.tar BAD/n11035090_0002_L010185N_D03-n11035090_0027_L010185N_D03.tar mv BAD/n11035090_0002_L010185N_D03-n11035090_0027_L010185N_D03.tar DUP/ 17:25 AFSS/mcimport.20070910 -f 100 kordosky $ cp AFSS/mcimport.20070910 . $ ln -sf mcimport.20070910 mcimport was mcimport.20070711 finished at 17:50, restarted cronjob ####### # AFS # ####### Need to add zarko to buckley:beamdata group, and/or add minos:beam to d239 and d188 MINOS26 > pts adduser -user zarko -group buckley:minosbeam MINOS26 > pts membership buckley:minosbeam MINOS26 > fs setacl -dir d239 -acl minos:beam rlidwka fs: You don't have the required access rights on 'd239' Need a global addition of minos:admin by buckley ============================================================================= kreymer vacation Sep 1-9 Sent Minos cluster description to timm, for condor planning Set up corral for cedar_phy_bhcurv minosora3 memory problems are gone REC M.C. timesheet signed and submitted for Sep left mesage with Joe Boyd regarding BlueArc disk, and condor planning ============================================================================= 2007 08 31 ########### # ROUNDUP # ########### Added full cedar_phy_bhcurv stanza to corral. Ran corral manually at 08:40, to test before I leave on vacation this PM. /home/minfarm/ROUNTMP/ROOTRELS added cedar_phy_bhcurv Ran corral manually at 09:17 Oops, needed to add locations could use the new version of saddreco ! export SAM_ORACLE_CONNECT="samdbs/" for REL in dev int prd ; do ./samtapeloc /pnfs/minos/reco_near/cedar_phy_bhcurv ${REL} ; done MINOS26 > sam get metadata --file=N00008579_0004.spill.cand.cedar_phy_bhcurv.0.root MINOS26 > sam add location --file=N00008579_0004.spill.cand.cedar_phy_bhcurv.0.root --loc='/pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2005-09(dcache.31)' MINOS26 > sam locate N00008579_0004.spill.cand.cedar_phy_bhcurv.0.root ['/pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2005-09,31@dcache'] OK, all 32 concatenated files are declared, along with cands. ####### # AFS # ####### Tested mengel's suggestion for /usr/bin/aklog problem : 0-59 * * * * /usr/krb5/bin/kcron "/usr/krb5/bin/aklog ; ${HOME}/minos/scripts/crontestark" ####### # AFS # ####### MINOS26 > pts adduser -user zarko -group minos:beam MINOS26 > pts membership minos:beam ############# # MINOSORA3 # ############# --------------------------------------------- Date: Fri, 31 Aug 2007 09:41:55 -0500 From: Maurine Mihalek so far, so good. no warning messages since dimm's were replaced tuesday afternoon. --------------------------------------------- The previous rate was a few errors per hour. I declare victory ! ============================================================================= 2007 08 30 ######### # ADMIN # ######### MIN > for NODE in $NODES ; do printf "\n${NODE} `date`\n"; ssh -ax ${NODE} "echo HELLO" ; done minos01 Thu Aug 30 14:17:26 CDT 2007 Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive). minos02 Thu Aug 30 14:17:27 CDT 2007 Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive). ... minos20 Thu Aug 30 14:18:39 CDT 2007 Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive). The rest were OK Ran another pass minos02 Thu Aug 30 14:25:39 CDT 2007 Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive). minos20 Thu Aug 30 14:26:47 CDT 2007 Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive). in cvshlog, found 38 instances of Thu Aug 30 14:00:07 2007 (west@163.1.244.104) : cvsh -c cvs server [sS] In minos01 /var/log/messages, About 38 messages like Aug 30 13:32:18 minos01 sshd(pam_unix)[31107]: session opened for user root by root(uid=0) node 163.1.244.104 is pplxgenng.physics.ox.ac.uk Then there is a veritable flood of about 128 repititions ... Aug 30 14:02:49 minos01 sshd[32618]: rexec line 45: Deprecated option RhostsAuthentication Aug 30 14:02:50 minos01 sshd[32618]: Invalid user minoscvs from 163.1.244.104 Aug 30 14:02:50 minos01 sshd[32618]: input_userauth_request: invalid user minoscvs Aug 30 14:02:50 minos01 sshd[32618]: Failed none for invalid user minoscvs from 163.1.244.104 port 54925ssh2 Aug 30 14:02:50 minos01 sshd[32618]: Failed publickey for invalid user minoscvs from 163.1.244.104 port54925 ssh2 Aug 30 14:02:50 minos01 sshd[32618]: Failed keyboard-interactive for invalid user minoscvs from163.1.244.104 port 54925 ssh2 Aug 30 14:02:50 minos01 sshd[32618]: Connection closed by 163.1.244.104 Looks like it cleared up around 14:51. I see nothing remarkable on minos02 /var/log/messages, but see minos20 : Aug 30 13:36:41 minos02 ypbind: ypbind shutdown succeeded Aug 30 13:36:41 minos02 ypbind: ypbind startup succeeded Aug 30 13:36:42 minos02 ypbind: bound to NIS server minos02.fnal.gov Aug 30 13:39:45 minos02 ypserv: ypserv shutdown succeeded Aug 30 13:39:45 minos02 ypserv[13584]: WARNING: no securenets file found! Aug 30 13:39:45 minos02 ypserv[13584]: Support for SLP (line 20) is not compiled in. Aug 30 13:39:45 minos02 ypserv[13584]: Support for SLP (line 22) is not compiled in. Aug 30 13:39:45 minos02 ypserv: ypserv startup succeeded Aug 30 13:46:00 minos02 ypbind: ypbind shutdown succeeded Aug 30 13:46:00 minos02 ypbind: ypbind startup succeeded Aug 30 13:46:01 minos02 ypbind: bound to NIS server minos02.fnal.gov Aug 30 13:55:24 minos02 sshd(pam_unix)[13791]: session opened for user root by root(uid=0) Aug 30 14:25:44 minos02 sshd(pam_unix)[13963]: session opened for user root by root(uid=0) Aug 30 14:27:11 minos02 rpc.ypxfrd[14035]: WARNING: no securenets file found! Aug 30 14:27:11 minos02 rpc.ypxfrd[14035]: Support for SLP (line 20) is not compiled in. Aug 30 14:27:11 minos02 rpc.ypxfrd[14035]: Support for SLP (line 22) is not compiled in. Aug 30 14:27:11 minos02 ypxfrd: rpc.ypxfrd startup succeeded Aug 30 14:28:11 minos02 sshd(pam_unix)[14071]: session opened for user kreymer by kreymer(uid=0) Aug 30 14:28:35 minos02 sshd(pam_unix)[14396]: session opened for user kreymer by (uid=0) Aug 30 14:32:36 minos02 ypserv: ypserv shutdown succeeded Aug 30 14:32:36 minos02 ypserv[14440]: WARNING: no securenets file found! Aug 30 14:32:36 minos02 ypserv[14440]: Support for SLP (line 20) is not compiled in. Aug 30 14:32:36 minos02 ypserv[14440]: Support for SLP (line 22) is not compiled in. Aug 30 14:32:36 minos02 ypserv: ypserv startup succeeded Aug 30 14:32:45 minos02 ypbind: ypbind shutdown succeeded Aug 30 14:32:45 minos02 ypbind: ypbind startup succeeded Aug 30 14:32:45 minos02 ypbind: bound to NIS server minos01.fnal.gov I don't see this on other nodes. Reply from Jason Harrington - minos02 was misconfigured, referred to itself. This was corrected by 14:30 . ( My guess - NIS load shifted to minos02, then got stuck. ) ############ # MCIMPORT # ############ DUP cleanup $ du -sm */DUP 1 arms/DUP 45250 hgallag/DUP 29 howcroft/DUP 1553 kordosky/DUP 75 kreymer/DUP 1 sjc/DUP for DIR in `ls -d */DUP` ; do printf "$DIR " find ${DIR} -type f -ctime +10 -exec echo {} \; | wc -l done arms/DUP 0 hgallag/DUP 172 howcroft/DUP 3 kordosky/DUP 6 kreymer/DUP 6 sjc/DUP 0 for DIR in `ls -d */DUP` ; do printf "$DIR " find ${DIR} -type f -ctime +10 -exec rm {} \; done ############ # MCIMPORT # ############ From cron job email /home/mindata/mcimport.20070203: line 368: srmcp: command not found From kordosky/log/mcimport.log Thu Aug 30 07:54:33 CDT 2007 exeAccess failed for java OOPS, found the cronjob, per crontab.dat, running mcimport.20070203 That is a truly ancient and dysfunctional version. This was due to the restoration of the mindata account from an old copy. Corrected 37 0,6,12,18 * * * ${HOME}/mcimport.20070203 -c ALL to 37 0,6,12,18 * * * ${HOME}/mcimport -c ALL While we're at it, updated /home/mindata/.srmconfig/kreymer.xml to reference kreymer-voms.proxy copied thusly $ pwd /home/mindata/.grid $ scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-voms.proxy kreymer-voms.proxy tested with .srmtest Thu Aug 30 10:42:46 CDT 2007: rs.state = Failed rs.error = RequestFileStatus#-2146260890 failed with error:[ at Thu Aug 30 10:42:42 CDT 2007 state Failed : user has no permission to write into path /pnfs/fnal.gov/usr/minos/stage/kordosky ] Reverted to kreymer-doe.proxy First file copy to /pnfs/minos/stage/kordosky/n11035001_0000_L010185N_D03-n11035001_0004_L010185N_D03.tar at 11:02 started ok, then slowed down linearly over 5 mintues to near 0 rate, took 10 minutes for 1.7 GB. Next file was close to 1 minute. ####### # SAM # ####### export SAM_ORACLE_CONNECT="samdbs/" for REL in dev int prd ; do setup sam -q ${REL} samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.bhcurv done ######### # KCRON # ######### /usr/bin/aklog does not work with kcron tickets FLXI06 > klist -f Ticket cache: /tmp/krb5cc_1060_z11732 Default principal: kreymer/cron/flxi06.fnal.gov@FNAL.GOV Valid starting Expires Service principal 08/30/07 10:22:17 08/30/07 20:22:17 krbtgt/FNAL.GOV@FNAL.GOV Flags: FIA 08/30/07 10:22:18 08/30/07 20:22:17 afs@FNAL.GOV Flags: FA FLXI06 > /usr/bin/aklog aklog: Couldn't get fnal.gov AFS tickets: aklog: Improper format of Nov 11ion database entry while getting AFS tickets GRRRRRRR Doing vanilla test of cron on flxi0* with crontab crontest.dat Some nodes fail, with kinit: Client not found in Kerberos database while getting initial credentials flxi02 - ok flxi03 - ok flxi04 - ok flxi05 - kinit fails flxi06 - kinit fails flxi07 - kinit fails and kcron fails interactively ============================================================================= 2007 08 29 ########### # ROUNDUP # ########### ./roundup -M -r cedar_phy mockfar ########### # ROUNDUP # ########### Updated to user kreymer-voms.proxy /export/stage/minfarm/.srmconfig/kreymer.xml Also commented out the .key and .pem file references, not needed. ######## # CRON # ######## Cron works OK on minos11 and minos12 with 0-59 * * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/crontestark MINOS12 > rpm -qa | grep cron anacron-2.3-32 vixie-cron-4.1-47.EL4 crontabs-1.10-7 MINOS01 > rpm -qa | grep cron anacron-2.3-32 vixie-cron-4.1-47.EL4 crontabs-1.10-7 Thiings started working today, on minos01, with MAILTO='kreymer@fnal.gov' 0-59 * * * 0,1,2,3,4,5,6 /usr/krb5/bin/kcron ${HOME}/minos/scripts/crontestark MAILTO='kreymer@fnal.gov' 0-59 * * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/crontestark quiet Restored crontab.minos01 to use of 1,3,5 ######### # FNALU # ######### Requested CLUBS upgrade to SLF 4.4 of fnalu-admin. Spoke to Wayne Baisley today, after Unix Users meeting. He will consider retiring/retaining some flxb nodes in an SL3 queue, upgrading the rest. Told him of our current problem having upgraded to SL4 interactive, and the upcoming Collaboration meeting in Sep. ============================================================================= 2007 08 28 ####### # CVS # ####### For tjyang, added to .k5login Verified that updates got correct username in the log In earlier tests, verified that username is logged without having ssh1 identity ####### # AFS # ####### cfl, afssum and afsfree cron jobs on minos01 have been silent. Missing tokens due to /usr/bin/aklog, fix by setting PATH="/usr/krb5/bin:${PATH}" and running aklog in the scripts. crontab ignores list of day of week 0-59 * * * 2-6/2 works ok 0-59 * * * 0,1,2,3,4,5,6 is ignored tested using /tmp/ct1 ########### # GANGLIA # ########### Minos Server is back online, along with Minos Cluster and Minos Oracle. ############## # CRYPTOCARD # ############## All Minos Cluster nodes now have cryptocard access. rennie restarted all sshd servers. Minos01 sshd was hosed, would not restart, and sshd.cvs crashed. Restarted both around 11:30 . We have cryptocard and cvs access again. ######### # GENPY # ######### [minos@minos-offline2 root_files]$ ls -l /data/root_files/F00039050_0008.mdaq.root -rw-r--r-- 1 minos e875 17413425 Aug 27 00:05 /data/root_files/F00039050_0008.mdaq.root [minos@minos-offline2 root_files]$ md5sum F00039050_0008.mdaq.root b3e0edf63239755dfa55a23d486b2049 F00039050_0008.mdaq.root MINOS26 > scp -c blowfish minos@minos-offline2.minos-soudan.org:/data/root_files/F00039050_0008.mdaq.root F00039050_0008m.mdaq.root MINOS26 > ecrc F00039050_0008m.mdaq.root CRC 1093528200 There is no level4 PNFS info yet ! MINOS26 > DCPOR=24125 # unsecured MINOS26 > IFILE=F00039050_0008.mdaq.root MINOS26 > IPATH=minos/fardet_data/2007-08 MINOS26 > DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} MINOS26 > dccp ${DFILE} ${IFILE} 17413425 bytes in 0 seconds MINOS26 > md5sum F0003* 20da2c880577cb4cad059ac68438975f F00039050_0008.mdaq.root b3e0edf63239755dfa55a23d486b2049 F00039050_0008m.mdaq.root MINOS26 > ~/minos/scripts/run_dbu F00039050_0008.mdaq.root /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 128: 24008 Segmentation fault dbu -bq ${HOME}/minos/scripts/dbu_sampy.C ${FILE} >>${logname} 2>&1 F00039050_0008.sam.py was not generated - check log for error F00039050_0008.log MINOS26 > ~/minos/scripts/run_dbu F00039050_0008m.mdaq.root Moving the bad file out of the way, and saving the good one : MINOS26 > pwd /local/scratch26/kreymer/DATA asked enstore-admin to do enmv /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root \ /pnfs/minos/BAD/F00039050_0008.BAD.mdaq.root no can do, file is still pending for write MINOS26 > ./dc_stat /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root ============================ PNFS status for /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root -rw-r--r-- 1 buckley e875 17413425 Aug 27 00:09 F00039050_0008.mdaq.root LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:05f4ea89;l=17413425; w-stkendca9a-3 LEVEL 4 ============================ MINOS26 > MINOS26 > ./dc_stat /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root ============================ PNFS status for /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root -rw-r--r-- 1 buckley e875 17413425 Aug 27 00:09 F00039050_0008.mdaq.root LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:05f4ea89;l=17413425; w-stkendca9a-3 LEVEL 4 ============================ rm /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root cd /local/scratch26/kreymer/DATA setup dcap DCPOR=24725 # kerberos DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} dccp F00039050_0008m.mdaq.root ${DFILE} 17413425 bytes in 1 seconds (17005.30 KB/sec) chmod 664 /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root ============================================================================= 2007 08 27 ############## # CRYPTOCARD # ############## Working now only on flxi02, 4, 7 Not 3, 5, 6 On 3 and 5, there are multiple sshd's with PPID 1. On 2, 4, 7 there is only 1. MIN > ssh flxi02 'ps -flu root | grep sshd' 5 S root 2798 1 0 75 0 - 774 - May31 ? 00:04:54 /usr/sbin/sshd MIN > ssh flxi04 'ps -flu root | grep sshd' 5 S root 3712 1 0 75 0 - 786 - May31 ? 00:01:13 /usr/sbin/sshd MIN > ssh flxi07 'ps -flu root | grep sshd' 5 S root 19307 1 0 76 0 - 5754 - Aug24 ? 00:00:00 /usr/sbin/sshd FLXI03 > ps -flu root | grep sshd 5 S root 25491 1 0 85 0 - 1043 - Jul30 ? 00:01:57 /usr/sbin/sshd 5 S root 25857 1 0 85 0 - 1043 - Jul30 ? 00:01:51 /usr/sbin/sshd 5 S root 26109 1 0 85 0 - 1043 - Jul30 ? 00:01:58 /usr/sbin/sshd 5 S root 901 1 0 75 0 - 879 - Aug23 ? 00:00:00 /usr/sbin/sshd 5 S root 18823 1 0 75 0 - 798 - Aug24 ? 00:00:00 /usr/sbin/sshd 5 S root 7406 1 0 75 0 - 988 - Aug24 ? 00:00:05 /usr/sbin/sshd MIN > ssh flxi05 'ps -flu root | grep sshd' /usr/X11R6/bin/xauth: timeout in locking authority file /afs/fnal.gov/files/home/room1/kreymer/.Xauthority 5 S root 13089 1 0 75 0 - 931 - Aug10 ? 00:02:02 /usr/sbin/sshd 5 S root 2913 1 0 75 0 - 815 - Aug23 ? 00:00:13 /usr/sbin/sshd 5 S root 3403 1 0 75 0 - 815 - Aug23 ? 00:00:00 /usr/sbin/sshd 5 S root 24057 1 0 75 0 - 986 - Aug24 ? 00:00:01 /usr/sbin/sshd MIN > ssh flxi06 'ps -flu root | grep sshd' 5 S root 28508 1 0 75 0 - 986 - Aug24 ? 00:00:02 /usr/sbin/sshd MIN > ssh flxi07 'ps -flu root | grep sshd' 5 S root 19307 1 0 76 0 - 5754 - Aug24 ? 00:00:00 /usr/sbin/sshd Can see /var/log/secure on minos-mysql1. Buckley failed login produces Aug 27 14:08:38 minos-mysql1 sshd[5168]: Failed gssapi-with-mic for buckley from ::ffff:131.225.193.6 port 37596 ssh2 Aug 27 14:08:38 minos-mysql1 sshd[5168]: Failed keyboard-interactive for buckley from ::ffff:131.225.193.6 port 37596 ssh2 Aug 27 14:08:38 minos-mysql1 sshd[5168]: Connection closed by ::ffff:131.225.193.6 ######### # GENPY # ######### 2007-08-27 00:09:29 buckley(1019.5111) krbftp write /pnfs/fnal.gov/usr/minos/fardet_data/2007-08/F00039050_0008.mdaq.root daqdcp.minos-soudan.org 18 17413425 0 OK dbu fails on /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root R__unzip: error in inflate (zlib) Error in : fNbytes = 13452, fKeylen = 84, fObjlen = 41620, noutot = 0, nout=0, nin=13368, nbuf=41620 Test per HOWTO.genpy MINOS26 > cd ${HOME}/minos/test TIER=mdaq IFILE=F00039050_0008 DATADIR=fardet_data/2007-08 rm -f ${IFILE}.log rm -f ${IFILE}.sam.py rm -f ${IFILE}.sam.pyc minos setup_minos -r R1.22 dbu fails as before setup_minos -r R1.26 dbu fails as before, but with successful error code and producing output Let's see if we have any more old failures : find * -name \*.log -exec grep -H unzip {} \; and on minos06, cd /local/scratch06/kreymer/genpy/fardet_data find ????-?? -name \*.log -exec grep -H unzip {} \; Last previous fardet unzip error was in 2004-10 ########### # ROUNDUP # ########### corral - removed corralsrs from corral, these files are done. ########### # ROUNDUP # ########### 100 2116 mcfmockcat ./roundup -M -r cedar_phy mockfar Finished at about 10:12 About 12 seconds per 20 MBytes file writing with srmcp Need to rerun this afternoon, to flush WRITE ########### # MINOS11 # ########### Trying to clean up SLF 3 afs login, via yum install openafs-krb5 MIN > rpm -ql openafs-krb5 /usr/bin/aklog /usr/sbin/asetkey This works, and gives a long lived token. ######## # PNFS # ######## /pnfs/minos seems mounted ro on minos01 and 26, not rw. Requested that this be corrected. fixed on minos01 around 11:10, minos26 around 11:30 ######## # FARM # ######## SUMMARY OF THE 8 JOB PROCESSING OF SCALED FIELD STUDIES FORMERLY ON THE KREYMER BLACKBOARD N cedar_phy_ Det Request 1 srsafitter f CosmicLE_D02 CosmicMC_D02 2 srsafitter F 6 months cosmic 3 srsafitterbx113 n 550 files D00 L010185N 4 srsafitter n 541 files D00 L010185N_bfldx113 5 srsafitterbx113 n 541 files D00 L010185N_bfldx113 6 srsafitter N 3 mo 2005 spill 7 srsafitterbx113 N 3 mo 2005 spill 8 srsafitter n 550 files D00 L010185N N boundaries 2005-08 8433_0002 2005-09 8433_0003 10 11 9280_0018 F boundaries 2005-11 33077_0002 2006-02 33805_0006 2007-01 37162_0006 2007-02 37709_0000 CHART near far mcnear mcfar srs 6/8 2/ /4 1/ srsbx113 7/ 3/5 ============================================================================= 2007 08 25 Saturday ########### # ROUNDUP # ########### Getting flooded by processing of near cedar_phy daikon_00 L010185N Typically 50 GB/6 hours. All going into the minos file family Try correcting as rubin@fnpcsrv1 with ./pnfsdirs near cedar_phy daikon_00 L010185N Sat Aug 25 09:33:40 CDT 2007 STREAMS cand mrnt sntp INPUT /pnfs/minos/mcin_data/near/daikon_00/L010185N FAMSET mcin_near_daikon_00 FAMILY mcin_near_daikon_00 OUTPUT /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N chgrp: invalid group name `e875' OK - created /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/mrnt_data OOPS, need permissions drwxrwxr-x drwxr-xr-x 1 rubin numi 512 Aug 25 09:33 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N chgrp: invalid group name `e875' OK - have set permissions drwxrwxr-x drwxrwxr-x 1 rubin numi 512 Aug 25 09:33 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N OOPS, need permissions drwxrwxr-x drwxr-xr-x 1 rubin numi 512 Aug 25 09:21 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/cand_data chgrp: invalid group name `e875' OK - have set permissions drwxrwxr-x drwxrwxr-x 1 rubin numi 512 Aug 25 09:21 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/cand_data FAMSET mcout_cedar_phy_near_daikon_00_cand FAMILY minos OOPS - need file family mcout_cedar_phy_near_daikon_00_cand OK - setting family to mcout_cedar_phy_near_daikon_00_cand FAMSET mcout_cedar_phy_near_daikon_00_mrnt FAMILY minos OOPS - need file family mcout_cedar_phy_near_daikon_00_mrnt OK - setting family to mcout_cedar_phy_near_daikon_00_mrnt OOPS, need permissions drwxrwxr-x drwxr-xr-x 1 rubin numi 512 Aug 25 09:25 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data chgrp: invalid group name `e875' OK - have set permissions drwxrwxr-x drwxrwxr-x 1 rubin numi 512 Aug 25 09:25 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data FAMSET mcout_cedar_phy_near_daikon_00_sntp FAMILY minos OOPS - need file family mcout_cedar_phy_near_daikon_00_sntp OK - setting family to mcout_cedar_phy_near_daikon_00_sntp Ran this as rubin@fnpcsrv1 setup encp -q stken ./pnfsdirs near cedar_phy daikon_00 L010185N write ============================================================================= 2007 08 24 ########### # MINOS11 # ########### cryptocard - yum install zz_sshd_pam service sshd restart tried this on minos06, with reload, no cryptocard access Had lots of trouble getting access, turns out AklogCmd is no longer tried restartof ssh, still no good zz_sshd_aklog-3.9-5 was needed, to remove the obsolete AklogCmd entry from /etc/ssh/sshd_config ########### # MINOS11 # ########### AFS - the system was booted around noon, without DHCP. Robert has done 3 clean root builds. There have been no further afs timeout in /var/log/messages, aside from a single pair of timeouts to a private network. Aug 24 11:57:49 minos11 kernel: afs: Lost contact with file server 192.168.67.1 in cell fnal.gov (multi-homed address; other same-host interfaces maybe up) Success ! ########### # ROUNDUP # ########### Forcing out srsafitter remnants that existed tested with -n -W ${HOME}/scripts/roundup -f 2 -r cedar_phy_srsafitter near ${HOME}/scripts/roundup -f 2 -M -r cedar_phy_srsafitter mcfar ${HOME}/scripts/roundup -f 2 -M -r cedar_phy_srsafitterbx113 mcnear ####### # DAQ # ####### minos-gateway-nd - sshd was not running per Peter and Alec in pit x5875 Urish got console control, removed AklogCmd from /etc/sshd/sshd_config rebooted, we're good. ########### # UPGRADE # ########### Short AFS tokens for bash users were due to typo in a config file pushed manually to all nodes. This is corrected by Rennie Scott, lifetimes look good. Remaining issues cryptocard support no .Xauthorization access under SL3 ( minos11 ) Ganglia CLUBS to SLF 4 ============================================================================= 2007 08 23 ########### # MINOS11 # ########### Went off the network sometime after 10:00 sar -n DEV | grep -v 'lo' shows 0 tx packets/data at 11:20 through 12:40 But no errors in the EDEV report And no errors in the MRTG web page information 16:22 copied a file twice via the net, to test rates, Rates look just fine, about 20 MBytes/second MINOS11 > time rcp minos26:/local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root . real 0m3.675s MINOS11 > time rcp minos26:/local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root TEST.dat real 0m3.258s MINOS10 > time rcp minos26:/local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root . real 0m3.961s MINOS10 > time rcp minos26:/local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root TEST.dat real 0m3.035s But 2 or 4 of Robert's rebuilds continue to fail with file timeouts. Trying regular file manipulations : time tar cf /local/scratch11/kreymer/DATA/home1.tar . real 1m23.948s time tar cf /local/scratch11/kreymer/DATA/home2.tar . real 0m57.161s for N in 3 4 5 6 7 8 9 10 ; do time tar cf /local/scratch11/kreymer/DATA/home${N}.tar . ; done real 0m58.113s real 0m51.739s real 0m54.350s real 0m49.661s real 0m51.646s real 0m54.644s real 0m53.916s real 0m57.419s Try something that hits new directories more in AFS, writing and deleting . date time cp -ax ~kreymer ${MINOS_DATA}/release_data/TEST/kreymer date du -sm ~kreymer ${MINOS_DATA}/release_data/TEST/kreymer date diff -r ~kreymer ${MINOS_DATA}/release_data/TEST/kreymer date du -sm ~kreymer ${MINOS_DATA}/release_data/TEST/kreymer date rm -r ${MINOS_DATA}/release_data/TEST/kreymer date Sort of messy, some file were write protected, and many stray symlinks ########### # UPGRADE # ########### S.A.G. is back, per Liz Hi Art, so it is working again. It seemed to be a victim of the local ups area vanishing during the upgrade. I have pointed the PRODUCTS variable at the MINOS products area in afs. ============================================================================= 2007 08 22 ########### # UPGRADE # ########### Library symbolic links like ln -s ../../usr/X11R6/lib/libGL.so.1.2 /usr/lib/libGL.so were part of xorg-x11-devel , installed late Tuesday morning. ########### # UPGRADE # ########### Residual issues, roughly highest priority first DONE - LSF batch servers - not running yet on minos15 16 17 18 20 21 23 24 25 minos11 issues AFS stability - consult an expert ? stray aklog message at login TiBS reinstallation Ganglia monitoring - needed on minos01->26 and minos-sam01/02/03 LANG-en_US.UTF-8 causes a different order for the output of ls Can we remove this, or add LC_COLLATE=C ? cryptocard access - will come soon with new ssh version and zz_sshd_pam token lifetimes are 1 day on login ( kerberos ticket expiration time) rather than 1 week ( based on ticket renewability time) This may be addressed by the new sshd version coming tomorrow. It is as if we were running AklogCmd /usr/bin/aklog instead of AklogCmd /usr/krb5/bin/aklog A few users still cannot login via ssh perhaps a client issue ? SamAtAGlance is not running under buckley@minos-sam01 buckley files may need to be copied from /scratch/sam01/buckley to /home/buckley kcroninit crontab Cannot read /etc/ssh/sshd_config and /var/log/messages Can this be enabled ? ####### # LSF # ####### LSF batch jobs are running only on 14, 19 , 22 Scanning with ps -fu root | grep lsf root 4902 1 0 Aug20 ? 00:01:01 /afs/fnal.gov/ups/lsf/v6_1/i386_linux24/etc/lim root 4905 1 0 Aug20 ? 00:00:00 /afs/fnal.gov/ups/lsf/v6_1/i386_linux24/etc/res root 4908 1 0 Aug20 ? 00:00:01 /afs/fnal.gov/ups/lsf/v6_1/i386_linux24/etc/sbatchd root 5635 4902 0 Aug20 ? 00:00:08 /afs/fnal.gov/ups/lsf/v6_1/i386_linux24/etc/pim This correlates with the bhosts lists. Requested startup. NS='x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x' for N in ${NS} ; do bsub -q minos \ 'HOST=`hostname --short | cut -c -5` ; [ "${HOST}" = "minos" ] && { hostname ; sleep 120 ; }' done All but 25 are running jobs. Try to probe just that node again, for N in ${NS} ; do bsub -q minos \ 'HOST=`hostname --short` ; [ "${HOST}" = "minos25" ] && { hostname ; sleep 120 ; }' done ######### # GENPY # ######### genpy.20070822 ln -sf genpy.20070822 genpy # was genpy.20061103 Need to increase timeout for large files Typical recent timings GDAT/fardet_data/2007-08 Size du -sm /pnfs/minos/fardet_data/2007-08/${RUN}.mdaq.root Time grep '2007/08' ${RUN}.log TZ=UTC stat ${RUN}.log RUN=F00038842_0000 1183 / 599 RUN=F00038846_0000 546 / 340 RUN=F00038869_0000 55 / 36 RUN=F00038871_0000 177 / 93 RUN=F00038872_0000 964 / 610 RUN=F00038889_0000 7 / 22 RUN=F00038891_0001 18 / 32 A safe limit would seem to be 60 sec + (Size in MB) Tested this per HOWTO.genpy ########### # SRMTEST # ########### srmtest.20070822 automatically runs for both minfarm and mindata ============================================================================= 2007 08 21 ########### # STARTUP # ########### Need afs restart on minos19 ( two afsd ) lsf client OK on minos02/3/4 bhosts works bsub works ####### # AFS # on minos11 ####### CACHESIZE= OPTIONS=$LARGE CACHESIZE had been 100000 rebooted around 16:26 w thosieck pts/0 tigris.hep.utexa 3:36pm 42:38 0.68s 0.44s ssh minos05 rhatcher pts/1 rhatcher03.dhcp. 3:55pm 9:06 0.26s 0.26s -bash root pts/2 hyperion.dhcp.fn 3:55pm 12:52 0.06s 0.06s -bash kreymer pts/3 minos-93198.dhcp 4:19pm 0.00s 0.12s 0.02s w jyuko pts/5 argut.hep.utexas 1:59pm 8:27 1.85s 0.37s -tcsh blake pts/6 pcgj.hep.phy.cam 11:30am 4:45m 0.53s 0.46s ssh minos10 blake pts/10 pcgj.hep.phy.cam 11:31am 4:46m 1.24s 1.18s ssh minos12 ####### # AFS # on minos-mysql1 ####### Getting message at login ( as once got on minos11 ) df: `afs': No such file or directory Nick has been having AFS trouble, I presume on minos-mysql1 minos-mysql1 is running openafs 1.4.4, so should have an /etc/sysconfig/afs file like those on the SL 4.4 cluster. But its file is identical to that on flxi04. This is odd because the file specifies CACHEDIR=/usr/vice/cache CACHEINFO=/usr/vice/etc/cacheinfo But the active cache files are in [root@minos-mysql1 ~]# ls -l /var/cache/openafs/ total 200 -rw------- 1 root root 137516 Aug 21 11:35 CacheItems -rw------- 1 root root 20 Aug 16 12:26 CellItems drwx------ 2 root root 32768 Aug 16 12:24 D0 drwx------ 2 root root 20480 Aug 16 12:24 D1 -rw------- 1 root root 2288 Aug 21 11:04 VolumeItems The /usr/vice/cache files are old [root@minos-mysql1 ~]# ls -l /usr/vice/cache total 632 -rw------- 1 root root 440016 Aug 16 07:04 CacheItems -rw------- 1 root root 20 Dec 20 2004 CellItems drwx------ 2 root root 32768 Nov 29 2004 D0 drwx------ 2 root root 36864 Nov 29 2004 D1 drwx------ 2 root root 36864 Nov 29 2004 D2 drwx------ 2 root root 36864 Nov 29 2004 D3 drwx------ 2 root root 32768 Nov 29 2004 D4 -rw------- 1 root root 16952 Aug 3 00:23 VolumeItems [root@minos-mysql1 ~]# ls -l /usr/vice/etc total 180 -rw------- 1 root root 0 Nov 29 2004 AFSLog -rw-r--r-- 1 root root 364 Feb 26 2005 CellAlias -rw-r--r-- 1 root root 31919 Jun 7 13:31 CellServDB -rw-r--r-- 1 root root 157 Mar 22 09:32 SuidCells -rw-r--r-- 1 root root 9 Aug 16 12:12 ThisCell -rw-r--r-- 1 root root 9 Mar 9 2004 ThisCell.FNAL -rwxr-xr-x 1 root root 121788 Jun 7 13:31 afsd -rw-r--r-- 1 root root 30 Jun 7 13:31 cacheinfo -rwxr-xr-x 1 root root 425 Feb 26 2005 killafs Reported to run2-sys ( rennie ) Discussed with him and Joe Boyd. Per advice of mengel, we have put in the standard SLF 4.4 /etc/sysconfig/afs file, and rebooted ( reboots are recommended. ) I stopped the mysql database just before the reboot. Connection details are in LOG.mysql df is now happy with the afs partition afsd shows the ============================================================================= 2007 08 20 ########### # STARTUP # ########### Requested restore of /home/mindata from /local/scratch26/kreymer/homemindata-sam02.tar if necessary ############ # MCIMPORT # ############ rennie restored /home/mindata files from minos-sam02 ( Vintage March ) $ cp AFSS/mcimport.20070711 mcimport.20070711 $ ln -s mcimport.20070711 mcimport $ rmdir STAGE/ $ ln -s /local/scratch26/mindata STAGE ####### # AFS # ####### The afssum scripts failed ( no access ) on Friday PM Seem to be OK this morning. Correcting /etc/sysconfig/afs OPTIONS=AUTOMATIC to OPTIONS=$LARGE Restarted most HOWTO.monitor tasks on minos26 Minos11 config had to be restored from flxi04, rebooted around ########### # SRMTEST # ########### lost from /home/mindata, recopied from minfarm@fnpcsrv1, and put in scripts/ ############## # MINOS-DATA # ############## Removed tzanakos@PHYS.UOA.GR from minos-data list, due to repeated quota problems in gr. ######## # TIME # ######## for NODE in $NODES ; do printf "\n${NODE} `date`\n"; ssh ${NODE} "printf \"${NODE} \" ; ntpstat | grep correct" done All nodes are like minos01 time correct to within 11 ms or minos15 time correct to within 10 ms ####### # X11 # ####### Still need XFree86-devel, per boehm ######## # KRB5 # ######## rhatcher lacks /usr/krb5/bin in path ============================================================================= 2007 08 18 Sat ########### # STARTUP # ########### tokens are appearing now on all nodes 1 day expiration on most 8 day expiration on minos11, minos25 Funny messages on minos11 -bash: aklog: command not found can't exec /local/ups/prd/perl/v5_006_1a/Linux-2/bin/perl:: No such file or directory Terminal type is xterm There are no available articles. # PREDATOR # 17:25 UTC Started manual ./predator, now that DCache is delivering files again # note that kcronit is needed globally, sent mail to m_s_d kinit: No such file or directory while getting initial credentials I needed to run kcroninit on 2 5-9 11-25 kcroninit fails on minos11, can't exec /local/ups/prd/perl/v5_006_1a/Linux-2/bin/perl:: No such file or directory MINOS13 > kcron kinit: Client not found in Kerberos database while getting initial credentials and 14, 15, 16 Note that kcrondestroy fails : MINOS05 > kcrondestroy KCRONINIT_DIR is not defined ... we are quitting. BEGIN failed--compilation aborted at /usr/krb5/bin/kcrondestroy line 35. ############ # MCIMPORT # ############ The mindata account /home/mindata area is missing from minos26 ############ # PREDATOR # ############ Fardet data is showing up with root version 5.16.0 . Starting with data from 2007 08 F00038604_0000.mdaq.root export SAM_ORACLE_CONNECT="samdbs/password" for REL in dev int prd ; do setup sam -q ${REL} samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v05-16-00 done New applicationFamilyId = 187 New applicationFamilyId = 76 New applicationFamilyId = 222 ============================================================================= 2007 08 17 ########### # STARTUP # ########### OK - minos24 down - has been rebooted OK - cvs anonymous asousa cvs [update aborted]: end of file from server (consult above messages if any) gfp - mine OK - pnfs - missing on minos02 OK - /grid/data on minos01 OK - minos11 - host id up, but funny message about can't exec /local/ups/prd/perl/v5_006_1a/Linux-2/bin/perl:: No such file or directory emacs is no longer a link to xemacs type xemacs to run xemacs tokens jdejong Thur midnight ? missing from all but minos25 interactive login minos26 - host id lsf files are present on all nodes, but cannot use from minos02 minos03 minos04 MINOS02 > bhosts Failed in an LSF library call: Slave LIM configuration is not ready yet ssh logins - miscellaneous ? ---------------- ####### # CVS # ####### cd /tmp export loc=":pserver:anonymous@minoscvs.fnal.gov:/cvs/minoscvs/rep1" cvs -d $loc checkout BubbleSpeak Fails from csf.rl.ac.uk 08:59 Renie restored the /etc/hosts.allow, contining cvs: ALL This restores access, tested at ral. ####### # LSF # ####### Checking all nodes for lsf availability for NODE in $NODES ; do printf "${NODE} `date`\n" ssh ${NODE} '. /usr/local/etc/setups.sh ; setup lsf ; bhosts minos25' ; done Fails on minos02 minos03 minos04 Try running a little job on each : for NODE in $NODES ; do printf "${NODE} `date`\n" ssh ${NODE} '. /usr/local/etc/setups.sh ; setup lsf ; bhosts minos25' ; done ####### # AFS # ####### 10:05 Rennie left phone message 13:15 Rennie Scott setting up test fix on minos21 14:49 Rennie updated minos22 as a test, looks OK 17:20 - pushed out everywhere bad minos01-10 bad minos12-20 bad minos23-24 OK minos11 minos21 minos22 minos25 minos26 ########### # MINOS26 # ########### MIN > ssh minos26 1208: Disconnecting: Protocol error: didn't expect packet type 34 This is a problem only from my desktop ? MIN > rpm -qf /usr/bin/ssh openssh-clients-3.5p1f11-1rh7x should be openssh-clients-3.5p1f12-1SL3 Into minos26 via minos01 Needed mount of /local/scratch26 /grid/data /grid/app Started processes per HOWTO.monitor Ran sam_test_py minos sam_test_py minos dev Tested loon per HOWTO.genpy Declared and undeclared and located per HOWTO.sam Started crontab at 17:37 Next iteration at 19:06 ./predator ####### # SSH # ####### mcgowan using crypto-card ( problem with account ? ) annah using crypto-card from local host at UCL to minos09 OK to flxi0* ssh(6258) Permission denied [annah@localhost ~]$ ssh -Y OpenSSH_4.3p2-4.cern-hpn, OpenSSL 0.9.7a Feb 19 2003 skips directly to keyboard-interactive zarko using crypto-card dnieper to minos07/08, tried 1-20 Linux 2.4 OpenSSH_3.6.1p2, SSH protocols 1.5/2.0, OpenSSL 0x0090701f skips directly to keyboard-interactive kreymer@csf.rl.ac.uk /usr/kerberos/bin/kinit kreymer@FNAL.GOV ssh -2 -v minos01.fnal.gov fails ssh -2 -v flxi04.fnal.gov fails ssh -1 -v flxi04.fnal.gov OK user fermilab principal lcgui0359.gridpp.rl.ac.uk to minos01 can connect to flxi02 OpenSSH_3.6.1p2-CERN20030917, SSH protocols 1.5/2.0, OpenSSL 0x0090701f minos01 debug1: Doing challenge response authentication. debug1: No challenge. Permission denied. flxi02 debug1: Trying Kerberos v5 authentication. debug3: Trying to reverse map address 131.225.68.42. debug1: Kerberos v5 authentication accepted. debug1: Kerberos v5 TGT forwarding failed: KDC can't fulfill requested option debug1: Requesting pty. debug3: tty_make_modes: ospeed 38400 Present off-site connections include ######## # TIME # ######## for NODE in $NODES ; do printf "\n${NODE} `date`\n"; ssh ${NODE} "printf \"${NODE} \" ; date" ; done ============================================================================= 2007 08 16 ############ # SADDRECO # ############ Now that all of daikon_00 is declared to mcin in dev, let's proceed with mcout, via saddreco.20070707... ############ # SHUTDOWN # ############ root@minos-mysql1 # /etc/init.d/mysql stop Shutting down MySQL. [ OK ] # /etc/init.d/mysql start Starting MySQL/bin/bash: /root/.bashrc: Permission denied # /etc/init.d/mysql stop Shutting down MySQL. [ OK ] [ OK ] sam@minos-sam03/2/1 ups stop sam_bootstrap no shrc/kreymer on minos-sam01 ? dangling processes on minos-sam01, killed UID PID PPID C STIME TTY TIME CMD sam 12507 1 0 May31 ? 00:00:00 /bin/sh /home/sam/products/sam_bootstrap/v6_1_2/NULL/bin/run.sh start logger log_prd v4_2_0 --stdout=no --info=/dev/nu sam 12671 12507 0 May31 ? 00:07:36 SamLogServer --port=40583 --host-alias=minos-sam01.fnal.gov --log=/home/sam/private/logger__minos-sam01__log_prd/log - 26317 ? S 0:06 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/minos26free_log 9020 ? S 0:00 sleep 3600 14909 ? S 0:43 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/oracle/topdb_log minosdev 16344 ? S 0:00 sleep 600 14908 ? S 0:41 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/oracle/topdb_log minosprd 16458 ? S 0:00 sleep 600 14783 ? S 1:08 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/pnfs_log 16541 ? S 0:00 sleep 300 14781 ? S 0:34 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/ftp_log 16361 ? S 0:00 sleep 600 ########### # STARTUP # ########### minos01 pserver needed to be started, missing script, Then export CVSROOT=:pserver:minoscvs@minoscvs.fnal.gov:/cvs/minoscvs/rep1 cvs checkout -P Candidate cvs checkout: authorization failed: server minoscvs.fnal.gov rejected access to /cvs/minoscvs/rep1 for user minoscvs cvs checkout: used empty password; try "cvs login" with a real password cvs checkout failed. started up sam servers on minos-sam01, looks OK started up sam servers on minos-sam03 started up sam servers on minos-sam02, after minosora3 firmware fix Pending issues : pserver password ? cvcspserver offsite access ? Some users cannot log in , perhaps ? 1 with cryptocard minos11 login problem minos24 down minos26 login problem 18:23 started mysqld /home/room1/lsf empty on all but minos01 ============================================================================= 2007 08 15 ######### # ADMIN # ######### Preparing for all-day shutdown tomorrow predator MINOS26 > echo 'crontab -r' | at 03:30 mcimport M26 > echo 'crontab -r' | at 03:30 job 21 at 2007-05-24 03:30 corral SRV1> echo 'mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT' \ | at 03:30 ######### # MYSQL # ######### See LOG.mysql Making space on disk, copying PULSERDRIFT.DAT to samread@minos-sam02 ######### # ADMIN # ######### Survey local mail stash on Minos Cluster for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'du -sm /var/spool/mail/*' done > AFSK/minos/maint/vsmail.20070815 Highlights, filtering out lsfadm minos01 Wed Aug 15 14:04:45 CDT 2007 1 /var/spool/mail/arms 1 /var/spool/mail/brebel 2 /var/spool/mail/buckley 1 /var/spool/mail/howcroft 1 /var/spool/mail/jyuko 1 /var/spool/mail/kreymer 1 /var/spool/mail/mcgo0109 225 /var/spool/mail/niki 1 /var/spool/mail/saranen minos02 Wed Aug 15 14:04:47 CDT 2007 1 /var/spool/mail/arms 204 /var/spool/mail/rubin minos03 Wed Aug 15 14:04:49 CDT 2007 1 /var/spool/mail/ahimmel minos04 Wed Aug 15 14:04:50 CDT 2007 1 /var/spool/mail/admarino 1 /var/spool/mail/bspeak 0 /var/spool/mail/kreymer 1 /var/spool/mail/shepelak minos05 Wed Aug 15 14:04:51 CDT 2007 1 /var/spool/mail/arms minos06 Wed Aug 15 14:04:59 CDT 2007 1 /var/spool/mail/kreymer minos07 Wed Aug 15 14:05:01 CDT 2007 1 /var/spool/mail/zarko minos08 Wed Aug 15 14:05:07 CDT 2007 1 /var/spool/mail/admarino 2 /var/spool/mail/jdejong minos09 Wed Aug 15 14:05:10 CDT 2007 minos10 Wed Aug 15 14:05:15 CDT 2007 minos11 Wed Aug 15 14:05:18 CDT 2007 1 /var/spool/mail/arms 38 /var/spool/mail/jdejong 23 /var/spool/mail/rhatcher minos12 Wed Aug 15 14:05:20 CDT 2007 1 /var/spool/mail/arms 1 /var/spool/mail/boehm minos13 Wed Aug 15 14:05:23 CDT 2007 1 /var/spool/mail/boehm 1 /var/spool/mail/rhatcher minos14 Wed Aug 15 14:05:27 CDT 2007 minos15 Wed Aug 15 14:05:30 CDT 2007 minos16 Wed Aug 15 14:05:33 CDT 2007 minos23 Wed Aug 15 14:05:50 CDT 2007 1 /var/spool/mail/arms minos24 Wed Aug 15 14:05:52 CDT 2007 minos25 Wed Aug 15 14:05:54 CDT 2007 1 /var/spool/mail/root minos26 Wed Aug 15 14:05:57 CDT 2007 3 /var/spool/mail/buckley 0 /var/spool/mail/kreymer 1 /var/spool/mail/mindata Let me clean up my own stuff : for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'ls -l /var/spool/mail/kreymer' ; done minos01 Wed Aug 15 14:08:30 CDT 2007 -rw------- 1 kreymer mail 5220 May 25 2006 /var/spool/mail/kreymer Four predator pid warnings, all May 24/25 2006. Removed minos04 Wed Aug 15 14:08:34 CDT 2007 -rw------- 1 kreymer mail 0 Jun 14 2006 /var/spool/mail/kreymer minos06 Wed Aug 15 14:08:36 CDT 2007 -rw------- 1 kreymer mail 391083 Apr 21 2006 /var/spool/mail/kreymer kinit messages from predator and afssum from Apr 5 through Apr 21 2006 Removed with pine minos26 Wed Aug 15 14:09:07 CDT 2007 -rw------- 1 kreymer mail 0 Mar 28 17:09 /var/spool/mail/kreymer The nontrivial files are minos01 Wed Aug 15 14:04:45 CDT 2007 225 /var/spool/mail/niki - copied by niki minos02 Wed Aug 15 14:04:47 CDT 2007 204 /var/spool/mail/rubin - removed by rubin minos11 Wed Aug 15 14:05:18 CDT 2007 38 /var/spool/mail/jdejong - removed by jdejong 23 /var/spool/mail/rhatcher Sent email to these users, suggesting an extra copy before the upgrades. ########## # SADDMC # ########## Declared another file in development ./saddmc.20070815 --declare -n 1 daikon_00 near/daikon_00/L010000N/129 Beam has been taken as REFILE[15:22] needs to change to get things like L010185N_bfldx113 , from files like n13014009_0006_L010185N_D00_bfldx113.reroot.root That's easy to get from the directory L010185N_bfldx113, a mess to get from the file. ls /pnfs/minos/mcin_data/near/daikon_00/L010185N_bfldx113/400 ls /pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00/L010185N_bfldx113/sntp_data/400 Cut up to first '.' If < 5 _ fields, take third [2] If 5 _ fields, take 3_5 [2]+'_'+[4] ./saddmc.20070815 --verify daikon_00 near/daikon_00/L010000N/129 -n 1 -v ./saddmc.20070815 --verify daikon_00 near/daikon_00/L010185N_bfldx113/400 -n 1 -v find /pnfs/minos/mcin_data/near/daikon_00 -type d | wc -l 221 find /pnfs/minos/mcin_data/near/daikon_00 -type f -name \*.reroot.root | wc -l 17520 Checking rate ./saddmc.20070815 --verify daikon_00 near/daikon_00/L010185N_bfldx113/400 ... Needed 99 files, Rate was 3.472 STARTED Wed Aug 15 19:38:26 2007 FINISHED Wed Aug 15 19:38:56 2007 D00DIRS=`find /pnfs/minos/mcin_data/near/daikon_00 -type d | cut -f 5- -d /` for D00DIR in ${D00DIRS} ; do echo $D00DIR ; done 19:48:54 UTC to 21:25:07 UTC for D00DIR in ${D00DIRS} ; do ./saddmc.20070815 --verify daikon_00 ${D00DIR} ; done \ > /var/tmp/saddvard00.log 2>&1 & MINOS26 > grep Rate /var/tmp/saddvard00.log | wc -l 184 mv /var/tmp/saddvard00.log ../log/saddmc/D00.ver.log ./saddmc.20070815 --declare daikon_00 near/daikon_00/L010185N_lowi/140 \ >> ${HOME}/minos/log/saddmc/D00.log 2>&1 & 21:31:07 UTC to 23:52:04 UTC for D00DIR in ${D00DIRS} ; do ./saddmc.20070815 --declare daikon_00 ${D00DIR} ; done \ >> ${HOME}/minos/log/saddmc/D00.log 2>&1 & ############# # MINOSORA3 # ############# Firmware upgrades scheduled for tomorrow around 14:30. ######### # MYSQL # ######### ============================================================================= 2007 08 14 ########## # SADDMC # ########## saddmc.20070815 Dropped mcout_data from path. Restored --addloc qualifier Added samAdmin.addPnfsTapeLocation Added printout of SAM version Resumed testing, picking a small directory, ./saddmc.20070815 -v --verify daikon_00 mcin_data/near/daikon_00/L010000N/129 ./saddmc.20070815 --declare -n 1 daikon_00 /pnfs/minos/mcin_data/near/daikon_00/L010000N/129 -v OK - declared n13011290_0005_L010000N_D00.reroot.root /pnfs/minos/mcin_data/near/daikon_00/L010000N/129(voc553.456) OOPS , addLocation error in n13011290_0005_L010000N_D00.reroot.root CLASS SamException.SamExceptions.DataStorageLocationNotFound INSTANCE Location with name '/pnfs/minos/mcin_data/near/daikon_00/L010000N/129' not found. STARTED Tue Aug 14 21:22:37 2007 FINISHED Tue Aug 14 21:22:39 2007 Corrected this with --addloc, which worked using addPnfsTapeLocation ######### # ADMIN # ######### Doing an X11/gimp scan, found messages like this from minos24 : executable not found: '/usr/lib/gimp/2.0/plug-ins/print' Created working directory on minos24 mkdir -p /var/tmp/kreymer/.gimp-2.0/ ######## # FARM # ######## Howie ran 105 more files for jobs 3 and 8, leaving a partial run n13011055 Flushed this : ./roundup -M -f 0 -s n13011055 -r cedar_phy_srsafitter mcnear ./roundup -M -f 0 -s n13011055 -r cedar_phy_srsafitterbx113 mcnear ############ # SADDRECO # ############ SRV1> ls READ | grep '^n' | wc -l 2889 SRV1> ls READ | grep '^f' | wc -l 1238 SRV1> ls READ | grep '^F2' | wc -l 498 Adding missing locations found in the cleanup PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9010 ./saddreco near cedar_phy_srsafitterbx113 2005-08 addloc OK - add location N00008433_0002.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-08(voc678.57) OK - add location N00008433_0002.spill.sntp.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/sntp_data/2005-08(vo4349.129) ./saddreco near cedar_phy_srsafitterbx113 2005-09 addloc OK - add location N00008451_0001.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(voc678.2) OK - add location N00008451_0000.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(voc678.3) OK - add location N00008675_0001.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(vo2180.284) OK - add location N00008454_0014.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(voc678.4) OK - add location N00008669_0018.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(vo2180.168) OK - add location N00008669_0023.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(vo2180.266) OK - add location N00008436_0006.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(voc678.1) OK - add location N00008612_0013.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(vo2180.245) Added 8 locations ./saddreco near cedar_phy_srsafitterbx113 2005-10 addloc OK - add location N00008695_0007.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.321) OK - add location N00008988_0008.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.164) OK - add location N00008972_0008.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.166) OK - add location N00008905_0010.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.302) OK - add location N00008905_0011.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.249) OK - add location N00009000_0017.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.163) OK - add location N00008920_0017.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.247) Added 7 locations ./saddreco near cedar_phy_srsafitterbx113 2005-11 addloc OK - add location N00009219_0015.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.4) OK - add location N00009098_0009.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.467) OK - add location N00009059_0009.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.282) OK - add location N00009238_0015.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.6) OK - add location N00009059_0005.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.285) OK - add location N00009059_0003.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.327) OK - add location N00009238_0010.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.2) OK - add location N00009059_0010.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.301) OK - add location N00009059_0019.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.294) Added 9 locations SRV1> ls READ | grep '^n' | wc -l 2889 SRV1> ls READ | grep '^f' | wc -l 1238 SRV1> ls READ | grep '^F2' | wc -l 498 Let's clear out the MDC files, for cleanliness, as we have no present plans to put these in SAM mkdir READ/MDC mv READ/F2* READ/MDC/ ============================================================================= 2007 08 13 ######## # FARM # ######## One srsafitter mcnear file pending n13014038_0008_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitter.root ZAPRUNS n13014038_0008_L010185N_D00_bfldx113 1 2007-07-30 11:28:55 fnpc262 n13014038_0008_L010185N_D00_bfldx113 1 2007-07-30 11:46:14 fnpc210 n13011015_0007_L010185N_D00 1 2007-08-09 16:00:23 fnpc232 howie is back from vacation, going through large backlog of email ######## # FARM # ######## rhatcher corrected defective row in cedar database connecting by mysql --user=writer --password=###### --host=fnpcsrv1.fnal.gov --port=3307 cedar mysql> select * from SPILLTIMENDVLD where SEQNO=700003590; +-----------+---------------------+---------------------+--------------+---------+------+-------------+---------------------+---------------------+ | SEQNO | TIMESTART | TIMEEND | DETECTORMASK | SIMMASK | TASK | AGGREGATENO | CREATIONDATE | INSERTDATE | +-----------+---------------------+---------------------+--------------+---------+------+-------------+---------------------+---------------------+ | 700003590 | 2007-08-05 18:00:00 | 2007-08-05 18:00:00 | 1 | 1 | 3 | -1 | 2007-08-05 18:00:00 | 2007-08-08 02:30:15 | +-----------+---------------------+---------------------+--------------+---------+------+-------------+---------------------+---------------------+ 1 row in set (0.00 sec) mysql> delete from SPILLTIMENDVLD where SEQNO=700003590; Query OK, 1 row affected (0.11 sec) ######### # MYSQL # ######### ~rhatcher/public_html/MySQLRefCard.ps ############ # SADDRECO # ############ Preparing for MC declares, worried about READ and SAM/READ sizes SRV1> ls READ/SAM | wc -l 12644 SRV1> ls READ | wc -l 4887 Most of this is MC, will be moved to READ/SAM, but this should not immediately break anything SRV1> ls READ | grep ^n | wc -l 2864 SRV1> ls READ | grep ^f | wc -l 1238 And for cleanup, SRV1> ls READ | grep ^N | wc -l 243 SRV1> ls READ | grep ^F | wc -l 541 Catching up, ls READ | grep ^N | grep \.cedar\\. roundup -m '2007-04' -r cedar near 57 files ... 2007 08 14 roundup -m '2005-10' -r cedar near roundup -m '2005-11' -r cedar near roundup -m '2006-12' -r cedar near roundup -m '2007-01' -r cedar near roundup -m '2007-05' -r cedar near roundup -m '2007-01' -r cedar_phy near OOPS, need location for N00011455_0023.spill.cand.cedar_phy.0.root sam add location --file=N00011455_0023.spill.cand.cedar_phy.0.root \ --loc='/pnfs/minos/reco_near/cedar_phy/cand_data/2007-01(voc503.742)' rm READ/N00011669_0000.cosmic.sntp.cedar_phy.0.root.bck this was an editor backup file, containing .cedar. parents verified single parent, and sam metadata, then cleaned up one old stray mv READ/N00008463_0019.spill.sntp.cedar.0.root READ/SAM/N00008463_0019.spill.sntp.cedar.0.root FILE=N00012135_0013.cosmic.cand.cedar.0.root mv READ/${FILE} READ/SAM/${FILE} FILE=N00012135_0021.cosmic.cand.cedar.0.root mv READ/${FILE} READ/SAM/${FILE} And more cleanup in FAR, SRV1> ls READ | grep ^F | grep -v \.cedar_phy_srsa | wc -l 541 ls READ | grep ^F0 | grep \.cedar\\. roundup -m '2005-04' -r cedar far roundup -m '2005-05' -r cedar far roundup -m '2005-07' -r cedar far roundup -m '2007-01' -r cedar far ls READ | grep ^F0 | grep \.cedar_phy\\. roundup -m '2006-07' -r cedar_phy far verified parents, and moved to SAM : for FILE in F00035862_0000.spill.bntp.cedar_phy.0.root F00035862_0000.spill.sntp.cedar_phy.0.root F00035868_0000.spill.bntp.cedar_phy.0.root F00035868_0000.spill.sntp.cedar_phy.0.root ; do mv READ/${FILE} READ/SAM/${FILE} ; done Lots of srsafitterbx113 locations messed up, export SAM_ORACLE_CONNECT="samdbs/" for REL in dev int prd ; do ./samtapeloc /pnfs/minos/reco_near/cedar_phy_srsafitterbx113 ${REL} ; done Corrected locations ============================================================================= 2007 08 10 ############# # CHECKLIST # ############# queues plots still stale in dcache All the 9* pools are offline, but they are not in any groups at present ######### # BATCH # ######### for N in ${NS} ; do bsub -q minos \ '[ `hostname` = "minos24.fnal.gov" ] && { hostname ; sleep 120 ; }' done JIDI=371274 to JIDF=371313 JID=${JIDI} ; while [ ${JID} -le ${JIDF} ] ; do bjobs -l ${JID} | grep "CPU\|Started" (( JID = JID + 1 )) ; done 371314 371353 Jim Fromm suggests lsf wrapper looks like #! /bin/sh $LSB_TRAPSIGS $LSB_RCP1 $LSB_RCP2 $LSB_RCP3 # LSBATCH: User input /afs/fnal.gov/files/home/room1/brebel/gen_iuntuple_cron_sam ExitStat=$? wait # LSBATCH: End user input true exit `expr $? "|" $ExitStat` bsub -q minos 'echo LSB_TRAPSIGS ; echo $LSB_TRAPSIGS' 371354 LSB_TRAPSIGS trap # 15 10 12 2 1 bsub -q minos 'echo $LSB_RCP1 ; echo $LSB_RCP1' bsub -q minos 'echo $LSB_RCP2 ; echo $LSB_RCP2' bsub -q minos 'echo $LSB_RCP3 ; echo $LSB_RCP3' All of these are null Final check : bsub -q minos 'echo LSF STUFF ; echo $LSB_TRAPSIGS; $LSB_RCP1; $LSB_RCP2 ; $LSB_RCP3 ; echo LSF STUFF' LSF STUFF trap # 15 10 12 2 1 LSF STUFF So the effective script is #! /bin/sh trap # 15 10 12 2 1 # LSBATCH: User input pwd ExitStat=$? wait # LSBATCH: End user input true exit `expr $? "|" $ExitStat` As a test, ran a few slow jobs on other nodes, bsub -q minos '[ `hostname` != "minos24.fnal.gov" ] && { hostname ; sleep 120 ; }' MINOS26 > cat ~/.lsbatch/1186759470.371387 #! /bin/sh $LSB_TRAPSIGS $LSB_RCP1 $LSB_RCP2 $LSB_RCP3 # LSBATCH: User input [ `hostname` != "minos24.fnal.gov" ] && { hostname ; sleep 120 ; } ExitStat=$? wait # LSBATCH: End user input true exit `expr $? "|" $ExitStat` Note from laszlo at 14:00, try again, a 40 run shot one job did run OK on minos24 Ran 2 more passes, things are OK The problem was the lack of files in /usr/afsws ( all are symlinks to /usr/sbin ) Now looking at the difference in RPM's between minos24 and minos25 : On minos25, found set LC_COLLATE=C undid this. rpm -qa | sort > minos24.rpmlis rpm -qa | sort > minos24.rpmlis env | sort > minos24.env env | sort > minos24.env minos24 seems to be missing a2ps firefox ghostscript gimp gv java nedit screen seamonkey-nspr seamonkey-nss tetex-xdvi thunderbird vim-common vim-enhanced apel-xemacs xemacs xemacs-common xemacs-el xemacs-info xemacs-sumo xml-common xorg-x11-tools zz_a2ps_stdout zz_emacs_link zz_ntp_configure zz_tex_tweaks I'm sure we should have gv, ghostview, ============================================================================= 2007 08 09 ######### # ADMIN # ######### Created HOWTO.upgrade for draft upgrade plan ######### # BATCH # ######### Testing minos24, included in minos14-24 for N in 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 ; do bsub -q minos "sleep 300 ; hostname" ; done 370875 to 370904 Ran on 14 xx 15 x 16 17 x 18 xx 19 xx 20 x 21 x 22 xx 23 xx 24 25 only 14/30 came back NS='' N=0 while [ ${N} -lt 40 ] ; do (( N = N + 1 )) ; NS="${NS} ${N}" ; done for N in ${NS} ; do bsub -q minos "sleep 120 ; hostname" ; done 370934 to 370973 14 xx 15 x 16 xx 17 x 18 xxx 19 xx 20 x 21 x 22 xx 23 xx 24 25 only 17/40 came back Ran load on minos24/25 up to 4 with 4 instances each of ( while true ; do true ; done ) & for N in ${NS} ; do bsub -q minos "sleep 120 ; hostname" ; done 370974 thru 371013 JID=370974 ; while [ ${JID} -le 371013 ] ; do bjobs -l ${JID} | grep "CPU\|Started" (( JID = JID + 1 )) ; done OK, everything ran Now look back, JID=370934 ; while [ ${JID} -le 370973 ] ; do bjobs -l ${JID} | grep "CPU\|Started" (( JID = JID + 1 )) ; done No minos25 jobs seen, all the minos24 jobs failed. Now take the load off 24, try again for N in ${NS} ; do bsub -q minos "sleep 120 ; hostname" ; done 371018 to 371057 JID=371018 ; while [ ${JID} -le 371057 ] ; do bjobs -l ${JID} | grep "CPU\|Started" (( JID = JID + 1 )) ; done Removed test load from minos24. ########### # ROUNDUP # ########### roundup.20070809 purge READ/SAM area Added ... ####### # AFS # ####### Per loiacono, requested volume afs/fnal.gov/files/data/minos/d266 cloned from d239, d188 per lloiaco request for beam systematics work system:administrators rlidwka minos:admin rlidwka minos:beam rlidwka minos rl ####### # AFS # ####### Created minos:beam group NEWGROUP=beam pts creategroup -name kreymer:${NEWGROUP} group kreymer:beam has id -2192 pts membership buckley:minosbeam | sort admarino arms buckley dharris koskinen loiacono messier morfin rhatcher sjc szleper wehmann yumiceva BUSERS=' admarino arms buckley dharris koskinen kreymer loiacono messier morfin rhatcher sjc szleper wehmann yumiceva' for GUSER in ${BUSERS} ; do pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done pts setfields kreymer:${NEWGROUP} -access SOMar pts membership kreymer:${NEWGROUP} pts examine kreymer:${NEWGROUP} Name: kreymer:nonap, id: -1941, owner: kreymer, creator: kreymer, membership: 5, flags: SOMar, group quota: 0. pts chown kreymer:${NEWGROUP} minos pts examine minos:${NEWGROUP} pts membership minos:${NEWGROUP} ######### # MYSQL # ######### User tinti has been hitting mysql hard, many connections to temp, in batch jobs running on flxb*, since about 10:00 Tuesday Aug 7 ( based on Ganglia, and today's mysqladmin processlist ) ######## # FARM # ######## Matching up new job 8 ( job 3 with srsafitter ) Input seems to be /pnfs/minos/mcin_data/near/daikon_00/L010185N DIR IN OUT3 100 98 98 101 168 168 102 109 109 103 109 109 104 110 66 Job 3 output is under /pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00/L010185N ============================================================================= 2007 08 08 ######## # FARM # ######## Flushing month endpoints for srsa processing NEAR 2005-08 N00008433_0002 2005-11 N00009280_0018 FAR 2005-11 F00033077_0002 2006-02 F00033805_0006 2007-01 F00037162_0006 2007-02 F00037709_0000 ./roundup -n -M -W -f 1 -s "N00008433_\|N00009280_" -r cedar_phy_srsafitter near Missing N00009280_0011..spill.sntp.cedar_phy_srsafitter.0.root This was NOSPILL in cedar_phy_srsafitterbx113 So write out the first run : ./roundup -n -M -W -f 1 -s "N00008433_" -r cedar_phy_srsafitter near ./roundup -f 1 -s "N00008433_" -r cedar_phy_srsafitter near ./roundup -n -M -W -f 1 -s "F00033077_\|F00033805_" -r cedar_phy_srsafitter far ./roundup -f 1 -s "F00033077_\|F00033805_" -r cedar_phy_srsafitter far ./roundup -n -M -W -f 1 -r cedar_phy_srsafitterbx113 near ./roundup -f 1 -r cedar_phy_srsafitterbx113 near this clears out cedar_phy_srsafitterbx113 near ./roundup -n -M -W -f 1 -r cedar_phy_srsafitter far ./roundup -f 1 -r cedar_phy_srsafitter far this clears out far Now go back and force out cedar_phy_srsafitter N00009280, given that subrun 11 is NOSPILL in bx113. ./roundup -n -M -W -f 1 -s "N00009280_" -r cedar_phy_srsafitter near ./roundup -f 1 -s "N00009280_" -r cedar_phy_srsafitter near roundup has 149 of 299 subruns in partial runs, missing 150. MINOS26 > find /pnfs/minos/reco_near/cedar_phy_srsafitter/cand_data -type f | wc -l 1550 MINOS26 > find /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data -type f | wc -l 1749 1749/1550 = 1.128 The sntp ration is larger, MINOS26 > find /pnfs/minos/reco_near/cedar_phy_srsafitter/sntp_data -type f | wc -l 125 MINOS26 > find /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/sntp_data -type f | wc -l 133 133/ 125 = Rustem's POT ratio is 4.64/3.63 = 1.28 Now looking at jobs 4 and 5 , MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00/L010185N_bfldx113/cand_data -type f | wc -l 541 MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00/L010185N_bfldx113/cand_data -type f | wc -l 534 MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00/L010185N_bfldx113/sntp_data -type f | wc -l 59 MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00/L010185N_bfldx113/sntp_data -type f | wc -l 55 MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00/L010185N_bfldx113/cand_data -type f > /tmp/srbxc.lis MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00/L010185N_bfldx113/cand_data -type f > /tmp/srbxbxc.lis MINOS26 > for FI in `cat /tmp/srbxc.lis` ; do basename ${FI} ; done | cut -f 1 -d . | sort > /tmp/srbxc.fil MINOS26 > for FI in `cat /tmp/srbxbxc.lis` ; do basename ${FI} ; done | cut -f 1 -d . | sort > /tmp/srbxbxc.fil MINOS26 > diff /tmp/srbxbxc.fil /tmp/srbxc.fil n13014010_0007_L010185N_D00_bfldx113 n13014013_0001_L010185N_D00_bfldx113 n13014032_0006_L010185N_D00_bfldx113 n13014034_0000_L010185N_D00_bfldx113 n13014034_0001_L010185N_D00_bfldx113 n13014048_0003_L010185N_D00_bfldx113 n13014050_0003_L010185N_D00_bfldx113 ####### # DAQ # ####### F00038583_0000.mdaq.root Tue Aug 7 18:10:35 UTC 2007 had been timing out ( 10 minutes ) This was successfully declared F00038583_0000.mdaq.root Wed Aug 8 10:09:17 UTC 2007 Most runs simply hung up, at Processing /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/dbu_sampy.C... then one got through Open mysql:odbc://minos-db1.fnal.gov/offline_dev?option=1; then success. ########## # DCACHE # ########## Near dcs got stuck Tuesday, N070731_000001.mdcs.root Tue Aug 7 10:08:00 UTC 2007 and all other recent dcs files timing out after 600 seconds FILE=N070731_000001.mdcs.root DPATH=dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/near_dcs_data/2007-08 DFILE=${DPATH}/${FILE} loon -bq ${HOME}/minos/scripts/firstlastreroot.C ${DFILE} This hangs, as does dccp. 12:21 - podstvkv reports that it it working 14:10 - in concur, it is working now. 14:11 - ./predator 2007-08 ( to declare near dcs data ) predator is happy, cleared the near dcs backlog ============================================================================= 2007 08 07 ####### # DAQ # ####### F00038583_0000.mdaq.root Tue Aug 7 18:10:35 UTC 2007 times out in DBU after 10 minutes size is relatively small, 7 MBytes. 6951697 Aug 7 11:30 F00038583_0000.mdaq.root ############### # SAMRELOCATE # ############### Picking up near, similar to far, just a lot more files. FAR OUTPUT REVIEW MINOS26 > SAMDIM="DATA_TIER mc-near" MINOS26 > sam list files --dim="${SAMDIM}" --count 10454 files match the given constraints. MINOS26 > sam list files --dim="${SAMDIM}" --nosummary | cut -c 16- | sort -u L010000.reroot.root L010170.reroot.root L010185.reroot.root L010200.reroot.root L100200.reroot.root L250200.reroot.root MINOS26 > ls /pnfs/minos/mcout_data/R1_18_2/near cand_data mrnt_data sntp_data snts_data STREAMS=`ls /pnfs/minos/mcout_data/R1_18_2/near` for STREAM in ${STREAMS} ; do printf "${STREAM} " ls /pnfs/minos/mcout_data/R1_18_2/near/${STREAM} | wc -l done cand_data 10518 mrnt_data 1596 sntp_data 11691 snts_data 10485 It seems these files remain in their original unhealthy paths. for STREAM in ${STREAMS} ; do ./samrelocate -n mcout_data/R1_18_2/near/${STREAM} ; done MINOS26 > for STREAM in ${STREAMS} ; do ./samrelocate -n mcout_data/R1_18_2/near/${STREAM} ; done NOOP STARTED Tue Aug 7 21:48:26 2007 Declaring to SAM dev 999999 Scanning /pnfs/minos/mcout_data/R1_18_2/near/cand_data 10518 FILES 10401 OK locations 0 fixed locations 117 files undeclared 10518 / 10518 STARTED Tue Aug 7 21:48:26 2007 FINISHED Tue Aug 7 21:51:51 2007 NOOP STARTED Tue Aug 7 21:51:51 2007 Declaring to SAM dev 999999 Scanning /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data 1596 FILES 0 OK locations 0 fixed locations 1596 files undeclared 1596 / 1596 STARTED Tue Aug 7 21:51:51 2007 FINISHED Tue Aug 7 21:52:24 2007 NOOP STARTED Tue Aug 7 21:52:24 2007 Declaring to SAM dev 999999 Scanning /pnfs/minos/mcout_data/R1_18_2/near/sntp_data 11691 FILES 10401 OK locations 0 fixed locations 1290 files undeclared 11691 / 11691 STARTED Tue Aug 7 21:52:24 2007 FINISHED Tue Aug 7 21:56:13 2007 NOOP STARTED Tue Aug 7 21:56:14 2007 Declaring to SAM dev 999999 Scanning /pnfs/minos/mcout_data/R1_18_2/near/snts_data 10485 FILES 10401 OK locations 0 fixed locations 84 files undeclared 10485 / 10485 STARTED Tue Aug 7 21:56:14 2007 FINISHED Tue Aug 7 21:59:35 2007 MINOS26 > SAMDIM="MC.RELEASE carrot_06" MINOS26 > sam list files --dim="${SAMDIM}" --count 41657 files match the given constraints. TIERS="cand-near mrnt-near sntp-near snts-near mc-near" for TIER in $TIERS ; do printf "${TIER} " SAMDIM="DATA_TIER ${TIER}" sam list files --dim="${SAMDIM}" --count done cand-near 106024 files match the given constraints. mrnt-near 915 files match the given constraints. sntp-near 74283 files match the given constraints. snts-near 57045 files match the given constraints. mc-near 10454 files match the given constraints. for TIER in $TIERS ; do printf "${TIER} " SAMDIM="DATA_TIER ${TIER} and MC.RELEASE carrot_06" sam list files --dim="${SAMDIM}" --count done cand-near 10401 files match the given constraints. mrnt-near No files match the given constraints. sntp-near 10401 files match the given constraints. snts-near 10401 files match the given constraints. mc-near 10454 files match the given constraints. sum is 41657, matches all carrot_06 files. ######### # ADMIN # ######### Discussing TiBS backups of minos-sam02 with run2-sys See http://computing.fnal.gov/site-backups/ TiBS cost is 1000/system minimum ( under 250 GB ) $5K + $4K/TB over 1.5 at large scale. ######### # ADMIN # ######### minos23 upgrade to SLF 4 brebel jobs running, MINOS26 > bjobs -u brebel | grep minos24 368175 brebel UNKWN minos flxi04.fnal minos24.fna *_cron_sam Aug 7 09:45 368182 brebel UNKWN minos flxi04.fnal minos24.fna *_cron_sam Aug 7 09:45 These seem to have slowed down, cpu usage dropped sharply around 9:45, when these started. copied mcgowan files to minos23 , as he uses this system time tar cvf /tmp/mcgowan.tar . real 5m40.668s user 0m0.000s sys 0m0.000s MINOS24 > time scp -c blowfish /tmp/mcgowan.tar minos23:/tmp/mcgowan.tar real 5m46.597s user 0m1.530s sys 0m36.970s MINOS23 > tar xvf /tmp/mcgowan.tar He has moved this to his own directory I have removed my copy. ============================================================================= 2007 08 06 ########### # ENSTORE # ########### Found several stray files in /pnfs/minos MINOS26 > ls -l /pnfs/minos | grep 42411 -rw-r--r-- 1 42411 e875 598580 Jul 19 12:45 aaa2 -rw-r--r-- 1 42411 e875 598580 Jul 19 12:50 neha10 -rw-r--r-- 1 42411 e875 56 Jul 19 12:50 test11 -rw-r--r-- 1 42411 e875 56 Jul 17 13:52 test5 -rw-r--r-- 1 42411 e875 56 Jul 18 13:41 test7 -rw-r--r-- 1 42411 e875 56 Jul 18 14:55 test90 -rw-r--r-- 1 42411 e875 5 Jul 18 12:01 try.txt Neha Sharma gave a talk on FermiGrid Matchmaking at the 26 March Grid Users meeting. UID 42411 belongs to user minospro on fnpcsrv1. But that user has no .k5login. SRV1> id minospro uid=42411(minospro) gid=5111(numi) groups=5111(numi) That number matches the e875 group. SRV1> dds ~minospro/gramsave total 96 drwxr-xr-x 2 root root 2048 Aug 5 04:26 ./ drwxr-xr-x 4 minospro numi 2048 Jun 17 04:24 ../ -rw-r--r-- 1 root root 3902 Jun 17 04:24 minospro.7.tar.gz 2007 08 07 from neha : These files were created by me when I was testing gPlazma setup on Fermi dCache. I had to try transfers under multiple VOs and MINOS was one of them. I should have cleaned these up...anyways since I no longer need them, you can delete them. FILES='aaa2 neha10 test11 test5 test7 test90 try.txt' for FILE in ${FILES} ; do ls -l /pnfs/minos/${FILE} ; done for FILE in ${FILES} ; do rm /pnfs/minos/${FILE} ; done ######### # POWER # ######### 07:00 power out, all CR and office nodes shut down 07:30 power is up, Urish is bringing up CR fnpcsrv1 was down before 06:30, for its move from fcc2 to fcc1 09:30 Urish has updated and rebooted all consoles, after fixing problems with drivers and xorg.conf ############# # CHECKLIST # ############# Someone dumped in a peak over 3000 stores, from Sat 4 Aug 18:00 through Sun 5 Aug 06:00 Stage plot has not updated since Jul 31 Sunday, Predator found many files not having SAM tape locations Mostly like 149 F00033455_0004.all.cand.cedar_phy_srsafitter.0.root 4 267 N00009626_0022.cosmic.sntp.cedar_phy.0.root 3 1640 F00037703_0000.all.sntp.cedar_phy_srsafitter.0.root 4 These were all picked up in this morning's saddcache run. ############# # SADDCACHE # ############# ln -sf saddcache.20070806 saddcache # was 20060802 Added -n NOOP option cutting off ENCP access Changed -n to -b (bail) per standard usage elsewhere ######## # FARM # ######## We seem to be producing many Merged.*.root files with about the sam size, MINOS26 > ls -l /grid/data/minos/minfarm/WRITE/Mer*.root -rw-r--r-- 1 10871 e875 674179014 Aug 4 06:59 /grid/data/minos/minfarm/WRITE/Merged.10548.root -rw-r--r-- 1 10871 e875 674178488 Aug 4 19:52 /grid/data/minos/minfarm/WRITE/Merged.20597.root -rw-r--r-- 1 10871 e875 674177416 Aug 2 18:09 /grid/data/minos/minfarm/WRITE/Merged.21292.root -rw-r--r-- 1 10871 e875 674177204 Aug 5 12:20 /grid/data/minos/minfarm/WRITE/Merged.2396.root -rw-r--r-- 1 10871 e875 674177186 Aug 5 18:20 /grid/data/minos/minfarm/WRITE/Merged.24558.root -rw-r--r-- 1 10871 e875 674177475 Aug 6 06:28 /grid/data/minos/minfarm/WRITE/Merged.31723.root -rw-r--r-- 1 10871 e875 674176426 Aug 4 13:51 /grid/data/minos/minfarm/WRITE/Merged.3321.root -rw-r--r-- 1 10871 e875 674176973 Aug 3 22:41 /grid/data/minos/minfarm/WRITE/Merged.3825.root -rw-r--r-- 1 10871 e875 674177091 Aug 5 06:53 /grid/data/minos/minfarm/WRITE/Merged.384.root -rw-r--r-- 1 10871 e875 674176117 Aug 4 14:18 /grid/data/minos/minfarm/WRITE/Merged.4235.root -rw-r--r-- 1 10871 e875 674176900 Aug 5 01:10 /grid/data/minos/minfarm/WRITE/Merged.5583.root -rw-r--r-- 1 10871 e875 674176756 Aug 3 19:39 /grid/data/minos/minfarm/WRITE/Merged.8649.root -rw-r--r-- 1 10871 e875 674176464 Aug 6 00:22 /grid/data/minos/minfarm/WRITE/Merged.9810.root All within 3 KBytes ( sort -n -k 5,5 ) 674176117 ... 674179014 Perhaps hadd is failing ? Look into this when fnpcsrv1 comes back up. Another of these at 12:23. Found this in cedar_phy_srsafitterbx113near.log OK adding n13014007_0000_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root 11 NSFIL SSIZ MSIZ DSIZ 11 688851515 604729958 8412155 OOPS, concatenated file size discrepancy, 8412155 gt 1500000 OK adding n13014008_0000_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root 11 NSFIL SSIZ MSIZ DSIZ 11 707703829 674177823 3352600 OOPS, concatenated file size discrepancy, 3352600 gt 1500000 Has been a problem since Jul 31 Looking in HADDLOG/2007-08/cedar_phy_srsafitterbx113mcnear.log n13014007_0004_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root is truncated at 23863296 bytes: should be 67577636 n13014007_0005_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root is truncated at 57376768 bytes: should be 67304921 n13014008_0009_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root is truncated at 30310400 bytes: should be 68090967 FILES=' n13014007_0004_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root n13014007_0005_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root n13014008_0009_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root ' for FILE in ${FILES} ; do ls -l /grid/data/minos/mcnearcat/${FILE} ; done These file were writting during the quota problems on July 28. They are defective. Moving them out of the way. mkdir /grid/data/minos/minfarm/BAD for FILE in ${FILES} ; do mv /grid/data/minos/mcnearcat/${FILE} /grid/data/minos/minfarm/BAD/${FILE} ; done rm /grid/data/minos/minfarm/WRITE/Merged.*.root ============================================================================= 2007 08 04 Sat ####### # DAQ # ####### Preparing for Monday power outage ssh -ax -l root minos-beamdata 'echo "shutdown -h now" | at 06:30 Aug 06' ssh -ax -l root minos-rc 'echo "shutdown -h now" | at 06:32 Aug 06' ssh -ax -l root minos-evd 'echo "shutdown -h now" | at 06:34 Aug 06' ssh -ax -l root minos-acnet 'echo "shutdown -h now" | at 06:36 Aug 06' ssh -ax -l root minos-om 'echo "shutdown -h now" | at 06:38 Aug 06' ######## # FARM # ######## concatenation is not keeping ahead very well, with old Merged files sitting in WRITE, and many files sitting there over a day : SRV1> dds -tr /grid/data/minos/minfarm/WRITE/Mer* -rw-r--r-- 1 minfarm numi 674177416 Aug 2 18:09 /grid/data/minos/minfarm/WRITE/Merged.21292.root -rw-r--r-- 1 minfarm numi 674176756 Aug 3 19:39 /grid/data/minos/minfarm/WRITE/Merged.8649.root -rw-r--r-- 1 minfarm numi 674176973 Aug 3 22:41 /grid/data/minos/minfarm/WRITE/Merged.3825.root -rw-r--r-- 1 minfarm numi 674179014 Aug 4 06:59 /grid/data/minos/minfarm/WRITE/Merged.10548.root SRV1> dds -tr /grid/data/minos/minfarm/WRITE | grep minfarm Aug 3 12:07 F00038519_0006.all.sntp.cedar.0.root Aug 3 12:09 F00038519_0006.spill.bntp.cedar.0.root Aug 3 12:09 F00038522_0000.spill.bntp.cedar.0.root Aug 3 12:09 F00038519_0006.spill.sntp.cedar.0.root Aug 3 12:26 n13023036_0002_L010185N_D00.sntp.cedar.root Aug 3 12:30 n13023039_0002_L010185N_D00.sntp.cedar.root ... Aug 3 13:09 N00008336_0002.cosmic.sntp.cedar_phy.1.root Aug 3 13:09 N00008336_0011.cosmic.sntp.cedar_phy.1.root Aug 3 13:11 N00009607_0006.cosmic.sntp.cedar_phy.0.root ... MINOS26 > ./dc_stat /pnfs/minos/reco_far/cedar/sntp_data/2007-07/F00038519_0006.all.sntp.cedar.0.root LEVEL 2 2,0,0,0.0,0.0 :c=1:b56e7d1d;h=yes;l=436286249; w-stkendca12a-6 r-stkendca18a-4 MINOS26 > ./dc_stat /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data/303/n13023036_0002_L010185N_D00.sntp.cedar.root LEVEL 2 2,0,0,0.0,0.0 :c=1:f715f3f6;h=yes;l=127520458; w-stkendca12a-6 r-stkendca18a-4 MINOS26 > ./dc_stat /pnfs/minos/reco_near/cedar_phy/sntp_data/2005-08/N00008336_0002.cosmic.sntp.cedar_phy.1.root LEVEL 2 2,0,0,0.0,0.0 :c=1:28865ffc;h=yes;l=172820100; w-stkendca12a-6 r-stkendca18a-4 Reported this to dcache-admin round 12:30, via email Ticket 102064 podstvkv restarted 5 pools around 19:07 ============================================================================= 2007 08 03 ########### # MONTHLY # ########### HOWTO.monthly - created from tail of this log. ############### # SAMRELOCATE # ############### Finally, running in earnest, will do dev,int,prd Review old LOG entry around 2006 07 16 ./saddmc --declare carrot_08 mcin_data/far/carrot/L010185 ./saddmc --declare R1_18_2 mcout_data/R1_18_2/far ./saddmc --declare carrot_06 mcin_data/near/carrot_06/L010200 ./saddmc --declare R1_18_2 mcout_data/R1_18_2/near partial, followed by ./saddmc --declare -s L010200 R1_18_2 mcout_data/R1_18_2/near \ 2>&1 | tee -a ../log/saddmc/mcout-R1_18_2-near-prd.log STARTING WITH FAR, DO NEAR LATER INPUT REVIEW MINOS26 > SAMDIM="DATA_TIER mc-far" MINOS26 > sam list files --dim="${SAMDIM}" --count 471 files match the given constraints. MINOS26 > sam list files --dim="${SAMDIM}" --nosummary | cut -c 16- | sort -u L010185.reroot.root L100200N_D00.reroot.root ./samrelocate -v -n -b 3 mcin_data/far/carrot/L010185 MINOS26 > SAMDIM="DATA_TIER mc-near" MINOS26 > sam list files --dim="${SAMDIM}" --count 10454 files match the given constraints. MINOS26 > sam list files --dim="${SAMDIM}" --nosummary | cut -c 16- | sort -u L010000.reroot.root L010170.reroot.root L010185.reroot.root L010200.reroot.root L100200.reroot.root L250200.reroot.root Checking input areas MINOS26 > ./samrelocate -n mcin_data/far/carrot/L010185 NOOP STARTED Fri Aug 3 20:11:38 2007 Declaring to SAM dev 999999 Scanning /pnfs/minos/mcin_data/far/carrot/L010185 NFILES 1341 441 OK locations 0 fixed locations 900 files undeclared 1341 / 1341 STARTED Fri Aug 3 20:11:38 2007 FINISHED Fri Aug 3 20:12:05 2007 for DIR in `ls /pnfs/minos/mcin_data/near/carrot_06` ; do ls /pnfs/minos/mcin_data/near/carrot_06/${DIR} ; done Need to clean out files in /pnfs/minos/mcin_data/near/carrot_06/L100200/BAD Still, we can check locations : for DIR in `ls /pnfs/minos/mcin_data/near/carrot_06` ; do ./samrelocate -n mcin_data/near/carrot_06/${DIR} ; done All are OK... double checked these in production, still ok OUTPUT REVIEW ls /pnfs/minos/mcout_data/R1_18_2/far/carrot L010185 L010185_RSCT0 L010185_RSCT2 L010185_tau_test L100200 L250200 BASE=mcout_data/R1_18_2/far/carrot/L010185 for DIR in `ls /pnfs/minos/${BASE}` ; do ./samrelocate -n ${BASE}/${DIR} ; done MINOS26 > ./samrelocate -q ${BASE}/sntp_data QUIET STARTED Fri Aug 3 21:26:46 2007 Declaring to SAM dev 999999 Scanning /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data 1969 FILES 0 OK locations 441 fixed locations 1528 files undeclared 1969 / 1969 STARTED Fri Aug 3 21:26:46 2007 FINISHED Fri Aug 3 21:27:41 2007 MINOS26 > setup sam -q int MINOS26 > ./samrelocate -q ${BASE}/sntp_data QUIET STARTED Fri Aug 3 21:45:29 2007 Declaring to SAM int 999999 Scanning /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data 1969 FILES OOPS , addLocation error for f21001047_0000_L010185.sntp.R1_18_2.root CLASS SamException.SamExceptions.DataStorageLocationNotFound INSTANCE Location with name '/pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data' not found. STARTED Fri Aug 3 21:45:29 2007 FINISHED Fri Aug 3 21:45:31 2007 MINOS26 > ./samtapeloc /pnfs/minos/mcout_data/R1_18_2 int MINOS26 > ./samtapeloc /pnfs/minos/mcout_data/R1_18_2 prd MINOS26 > sam add location --file=f21001047_0000_L010185.sntp.R1_18_2.root --loc='/pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data(vo7033,427)' MINOS26 > ./samrelocate -q ${BASE}/sntp_data QUIET STARTED Fri Aug 3 21:57:36 2007 Declaring to SAM int 999999 Scanning /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data 1969 FILES 1 OK locations 440 fixed locations 1528 files undeclared 1969 / 1969 STARTED Fri Aug 3 21:57:36 2007 FINISHED Fri Aug 3 21:58:36 2007 MINOS26 > setup sam -q prd MINOS26 > ./samrelocate -q ${BASE}/sntp_data QUIET STARTED Fri Aug 3 22:01:28 2007 Declaring to SAM prd 999999 Scanning /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data 1969 FILES 0 OK locations 441 fixed locations 1528 files undeclared 1969 / 1969 STARTED Fri Aug 3 22:01:28 2007 FINISHED Fri Aug 3 22:02:39 2007 MINOS26 > setup sam -q dev MINOS26 > ./samrelocate -q ${BASE}/snts_data 1968 FILES 0 OK locations 441 fixed locations 1527 files undeclared 1968 / 1968 ########### # MONTHLY # ########### DATASETS 8/3 PREDATOR 8/3 SADDRECO 8/3 VAULT 8/3 MYSQL 8/ ######## # FARM # ######## Cleanup of duplicates/healed runs Added missing location grep N00012596_0002.spill.cand.cedar.0.root ../CFL/CFL sam add location --file=N00012596_0002.spill.cand.cedar.0.root --loc='/pnfs/minos/reco_near/cedar/cand_data/2007-07(voc583.451)' LOG/2007-08/cedarnear.log ./roundup -s "N00012463\|N00012620" -r cedar near updated roundup to complete all partial runs from now on. Now clean out duplicates : YEMO=`date +%Y-%m` cd LOG/${YEMO} for LOG in `ls` ; do less ${LOG} ; done FILES=' mcfarcat/f20011014_0009_CosmicMu_D02.sntp.cedar.root mcnearcat/n13012001_0006_L010185N_D00.sntp.cedar.root nearcat/N00009635_0007.cosmic.sntp.cedar_phy.1.root nearcat/N00008486_0000.spill.sntp.cedar_phy_srsafitter.0.root ' MINOS26 > for FILE in ${FILES} ; do ls -l /grid/data/minos/${FILE} ; done -rw-rw-r-- 1 1334 e875 61303092 Jul 2 19:15 /grid/data/minos/mcfarcat/f20011014_0009_CosmicMu_D02.sntp.cedar.root -rw-rw-r-- 1 1334 e875 67187759 Jul 21 01:20 /grid/data/minos/mcnearcat/n13012001_0006_L010185N_D00.sntp.cedar.root -rw-rw-r-- 1 1334 e875 30460492 Jun 4 18:05 /grid/data/minos/nearcat/N00009635_0007.cosmic.sntp.cedar_phy.1.root -rw-rw-r-- 1 1334 e875 63165604 Aug 1 15:45 /grid/data/minos/nearcat/N00008486_0000.spill.sntp.cedar_phy_srsafitter.0.root MINOS26 > for FILE in ${FILES} ; do mv /grid/data/minos/${FILE} /grid/data/minos/minfarm/DUP/ ; done ########### # ROUNDUP # ########### roundup.20070803 - concatenate when have+new files count is correct, to complete runs which have been partially forced cp AFSS/roundup.20070803 . ln -sf roundup.20070803 roundup # was roundup.20070802 ######## # FARM # ######## GDM usage is 301/400, manualy purge most file in WRITE, 13:15 ./roundup -w -M -r cedar_phy_srsafitter near usage down to 258/400 ############ # PREDATOR # ############ Cron had been off since 1 Aug, so did catchup ./predator 2007-08 ============================================================================= 2007 08 02 ############# # SAMLOCATE # ############# example sent to brebel ./samlocate "${SAMDIM}" | while read FILEPATH do FILE=`echo ${FILEPATH} | cut -f 1 -d ' '` FPAT=`echo ${FILEPATH} | cut -f 2 -d ' '` echo FILE/PATH ${FILE} ${FPAT} done ####### # SAM # ####### dev/int oracle security patches ( July ) scheduled 09:00 Up and running around 10:00 dev station and dbserver did not need to be restarted. ######## # FARM # ######## Need cleanup of sam declares... /home/minfarm/ROUNTMP/LOG/2007-07/declare_near_cedar.log N00012596_0002.spill.cand.cedar.0.root LOG/2007-08/cedarmcfar.log DUPE f20011014_0009_CosmicMu_D02.sntp.cedar.root LOG/2007-08/cedarmcnear.log DUPE n13012001_0006_L010185N_D00.sntp.cedar.root LOG/2007-08/cedar_phynear.log many duplicates LOG/2007-08/cedar_phyfar.log several pending runs, long term LOG/2007-08/cedar_phymcfar.log PEND - should flush LOG/2007-08/cedar_phymcnear.log clean LOG/2007-08/cedar_phy_brevmcnear.log clean ######## # FARM # ######## Working on cleanup of special processing : LOG/2007-08/cedar_phy_srsafitternear.log DUPE N00008486_0000.spill.sntp.cedar_phy_srsafitter.0.root LOG/2007-08/cedar_phy_srsafitterfar.log LOG/2007-08/cedar_phy_srsafittermcnear.log LOG/2007-08/cedar_phy_srsafittermcfar.log LOG/2007-08/cedar_phy_srsafitterbx113near.log LOG/2007-08/cedar_phy_srsafitterbx113mcnear.log ######## # MAIL # ######## Mail is stuck on fnpcsrv1, minos26, minos-98193.dhtp Mail -s roundup found duplicates in cedar_phy_srsafitter near on fnpcsrv1 minos-data@fnal.gov killing fnpcsrv1 Mail command manually. Mail does not get stuck when given content in the mail body. echo "TESTING" | Mail -s "hello there" kreymer@fnal.gov ########### # ROUNDUP # ########### roundup.20070802 - added content to body of duplicates email, to avoid the email hangs seen today. cp AFSS/roundup.20070802 . ln -sf roundup.20070802 roundup # was roundup.20070730 ######## # GRID # ######## hadd is running quite slowly, no CPU usage to speak of. Read data rates are over 10 MB/second Write rates are good, over 10 MB/second SRV1> time dd if=/dev/zero of=TEST.DAT bs=100M count=1 1+0 records in 1+0 records out real 0m8.871s user 0m0.000s sys 0m1.282s reading/writing is slow, about 1 MB/second SRV1> time cat /grid/data/minos/nearcat/N00009047* > ./TEST.dat du -sm TEST.dat real 4m25.043s user 0m0.045s sys 0m5.634s SRV1> du -sm TEST.dat 383 TEST.dat ######## # GRID # ######## tokens AFSPROD=/afs/fnal.gov/files/code/e875/general/products/ GRIPROD=/grid/app/minos/products time rsync -r ${AFSPROD} ${GRIPROD} --perms --times --links --size-only --delete -v -n db/minos_offline/S07-07-27-R1-26.release.log db/minos_offline/S07-07-27-R1-26.table db/minos_offline/S07-07-27-R1-26.version prd/python/v2_4_sam/Linux-2-4/lib/python2.4/re.pyc prd/python/v2_4_sam/Linux-2-4/lib/python2.4/sre.pyc prd/python/v2_4_sam/Linux-2-4/lib/python2.4/sre_compile.pyc prd/python/v2_4_sam/Linux-2-4/lib/python2.4/sre_constants.pyc wrote 1012360 bytes read 48 bytes 5547.44 bytes/sec total size is 2226133693 speedup is 2198.85 real 3m2.370s user 0m1.870s sys 0m5.740s time rsync -r ${AFSPROD} ${GRIPROD} --perms --times --links --size-only --delete -v MINOS26 > du -sm /afs/fnal.gov/files/code/e875/general/minossoft/* 5534 /afs/fnal.gov/files/code/e875/general/minossoft/packages 435 /afs/fnal.gov/files/code/e875/general/minossoft/releases 16 /afs/fnal.gov/files/code/e875/general/minossoft/setup 1 /afs/fnal.gov/files/code/e875/general/minossoft/srt 1 /afs/fnal.gov/files/code/e875/general/minossoft/temp ============================================================================= 2007 08 01 ############# # CHECKLIST # ############# minos-sam01 load average near 2, CPU at 1/4 ( 1 CPU ) 15:30 - 03:30 , 04:30 to present network active, but 10 KBytes/second 17 processes like PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 22031 sam 15 0 290M 290M 5568 S 4.3 7.2 0:27 1 python /home/sam/products/db_server_base/v3_3_17/NULL/bin/DbListener.py -c=dbs_prd saMINOS26 > sam get dbserver connection info Connection: minfarm@fnpcsrv1.fnal.gov:saddreco_v7_7_1(1008531) Servant creation time: 01-Aug-2007 07:50:00 (CDT) Last method invoked: __init__ (01-Aug-2007 07:50:00 (CDT)) Last method completed in 0.00542998313904 seconds Servant status message: initializing Connection: brebel@flxb31.fnal.gov:sam_v7_6_5(1009387) Servant creation time: 01-Aug-2007 08:02:01 (CDT) Last method invoked: Nov 11eDimensions_v2 (01-Aug-2007 08:02:01 (CDT)) Last method still running. Servant status message: invoking the SQL query for infixString = RUN_TYPE physics% and VERSION ... Connection: brebel@flxb22.fnal.gov:sam_v7_6_5(1009411) Servant creation time: 01-Aug-2007 08:02:22 (CDT) Last method invoked: getReplicaLocationList (01-Aug-2007 08:02:22 (CDT)) Last method completed in 2.46368288994 seconds Servant status message: Marshalling complete in less than 1 second (len: 1) ... and 7 more getReplicaLocationList instances moments later, 12 similar new current connections. MINOS26 > bjobs -u brebel | grep RUN | wc -l 55 MINOS26 > bjobs -u brebel | wc -l 342 MINOS26 > bjobs -u brebel | grep 'Jul 31 09:45' | wc -l 17 MINOS26 > bjobs -u brebel | grep 'Jul 31 05:' | wc -l 13 MINOS26 > bjobs -u brebel | grep 'Aug 1 05:' | wc -l 310 The sam locates are being done by /afs/fnal.gov/files/home/room1/brebel/gen_iuntuple_cron_sam Each job looks up 1 month of files ( cedar, cedar_phy ), about 22 months each. So this activity results in about a 6 hour delay for each job. Watching for dropoff, still heavy with 6 or 7 brebel connections MINOS26 > sam get dbserver connection info | grep brebel See below, saddreco is stuck with excessive timeouts sam_test_py also failed similarly, due to brebel load doing sam locate. sam_test_project is slow, but OK Works after the load drops off at 10:30 SAMDIM=" RUN_TYPE physics% \ and VERSION cedar.phy \ and DATA_TIER sntp-far \ and PHYSICAL_DATASTREAM_NAME spill \ and FULL_PATH like /pnfs/minos/reco_far/cedar_phy/sntp_data/2007-03 " SFILES=`sam list files --dim="${SAMDIM}" --nosummary` printf "${SFILES}\n" | wc -l 48 date ; time { for FILE in ${SFILES} ; do sam locate ${FILE} ; done } N time CPU 1 1' 24" 2 2' 47" 3 1m47 4 2m27 4 2m29 25% u 1% s 5 3m11 25% u 4% s 5 3m19 25% u 5% s sam_test_py normally 7s, retries, OK at end of pass 4 2m38 25% u 3% s stp did got cid after 10 retries, then stuck again 3 2m01 25% u 1% s stp got cid first try, 1 retry, then OK second try needed 1 retry for cid, 1 for cpid 2 1m39 24% u .5% s stp 1/5 passes had retry for cid time 16s 1 1m28 17% u 0 s stp 7/7 passes ok, time 9s 0 stp time 6 s ######## # FARM # ######## Stuck in sampy scripts/saddreco far cedar 2007-07 declare tail /home/minfarm/ROUNTMP/LOG/2007-07/declare_far_cedar.log Needed /pnfs/minos/reco_far/cedar/cand_data/2007-07 Treating 725 files in /pnfs/minos/reco_far/cedar/cand_data/2007-07 RetryHandler.getMetadataRequirementDescriptor()> initial retriable exception TRANSIENT('CORBA.TRANSIENT(omniORB.TRANSIENT_CallTimedout, CORBA.COMPLETED_MAYBE)') RetryHandler.getMetadataRequirementDescriptor()> will retry in 600.00 seconds ... killed corral, then saddreco before the 09:00 shutdown of fnpcsrv1 ######### # ADMIN # ######### minos02 ganglia monitoring is down Last heartbeat 12 days, 20:45:16 ago reported as ticket 101878 Corrected, ganglia looks good, thanks, jonest ######## # FARM # ######## fnpcsrv1 up too late for noon roundup cron Clear out WRITE ./roundup -w -r cedar far declared 2007-07 ./roundup -w -r cedar near declared 2007-07 created corralsrs for the special runs .corralsrs ############# # SAMLOCATE # ############# SAMDIM=" RUN_TYPE physics% \ and VERSION cedar.phy \ and DATA_TIER sntp-far \ and PHYSICAL_DATASTREAM_NAME spill \ and FULL_PATH like /pnfs/minos/reco_far/cedar_phy/sntp_data/2007-03 " ./samlocate "${SAMDIM}" | wc -l 48 time ./samlocate "${SAMDIM}" ( 48 files ) real 0m5.390s SAMDIM=" RUN_TYPE physics% \ and VERSION cedar.phy \ and DATA_TIER sntp-far \ and PHYSICAL_DATASTREAM_NAME spill \ and FULL_PATH like /pnfs/minos/reco_far/cedar_phy/sntp_data/2006-01 " ./samlocate "${SAMDIM}" | wc -l 53 time ./samlocate "${SAMDIM}" SAMDIM=" RUN_TYPE physics% \ and VERSION cedar \ and DATA_TIER sntp-far \ and PHYSICAL_DATASTREAM_NAME spill \ and FULL_PATH like /pnfs/minos/reco_far/cedar/sntp_data/2006-01 " ./samlocate "${SAMDIM}" | wc -l 745 time ./samlocate "${SAMDIM}" real 0m40.879s ============================================================================= 2007 07 31 ######### # ADMIN # ######### minos-om is back in service, around 13:00 No breakin beyond JIRA accounts. ######## # FARM # ######## Cleared enough backlog to resume running, continuing to clear more backlog. See yesterday's entry. ####### # SAM # ####### Does minos-sam02 need to be backed up ? I think not, an occasional snapshot done privately should be OK. MINOS-SAM02 > du -sm * ... 146 private ... 3212 products 799 products.20051018 2665 products.20060413 ... I have removed the old products areas. Moved oracle_client aside to Xoracle_client, everything works OK, So we are really using oracle_instant_client , deleting old clients Moved it back drwxr-xr-x 3 sam 5024 4096 Aug 1 2005 v10_1_0_3_0 drwxr-xr-x 3 sam 5024 4096 Jun 19 2006 v10_2_0_1 drwxr-xr-x 3 sam 5024 4096 Mar 25 2005 v8_1_7a ups undeclare -Y oracle_client v8_1_7a ups undeclare -Y oracle_client v10_1_0_3_0 ups undeclare -Y oracle_client v10_2_0_1 MINOS-SAM02 > du -sm sam 298 sam MINOS-SAM02 > ups list -aK+ sam "sam" "v6_0" "NULL" "" "" "sam" "v7_0_1" "Linux+2.4" "" "" "sam" "v7_0_2c" "Linux+2" "" "" "sam" "v7_0_2" "Linux+2" "" "" "sam" "v7_1_2" "Linux+2" "" "" "sam" "v7_1_10" "Linux+2" "" "" "sam" "v7_2_2" "Linux+2" "" "" "sam" "v7_2_6" "Linux+2" "" "" "sam" "v7_1_9" "Linux+2" "" "" "sam" "v7_3_0" "Linux+2" "" "" "sam" "v7_3_1" "Linux+2" "" "" "sam" "v7_3_4" "Linux+2" "" "" "sam" "v7_4_0a_py24" "Linux+2" "" "" "sam" "v7_4_2" "Linux+2" "" "" "sam" "v7_5_1" "Linux+2" "" "current" "sam" "v8_1_3" "Linux+2" "" "" ups undeclare -Y sam v6_0 for SREL in v7_0_1 v7_0_2c v7_0_2 v7_1_2 v7_1_10 v7_1_9 v7_2_2 v7_2_6 ; do ups undeclare -Y sam ${SREL} ; done for SREL in v7_3_0 v7_3_1 v7_3_4 v7_4_0a_py24 v7_4_2 ; do ups undeclare -Y sam ${SREL} ; done MINOS-SAM02 > du -sm sam 22 sam 269 sam_station ups undeclare -Y sam_station v6_0_1_12 -q GCC-3.1 138 sam_station ============================================================================= 2007 07 30 ########### # ROUNDUP # ########### Added 0 length check for merged file Corrected message regarding ROOTRELS cp AFSS/roundup.20070730 . ln -sf roundup.20070730 roundup # was roundup.20070707 ############# # CHECKLIST # ############# DCache data plots stop Saturday morning around 08:00 http://fndca2a.fnal.gov:8090/dcache/outplot?lvl=0&filename=billing-2007.07.daily.brd.png&day=28&fmt=lin Queue plots last update July 24 00:31:25 http://fndca.fnal.gov/dcache/queue/allpools.jpg minos26 disks are full ( 20 GB free ) ####### # DAQ # ####### MINOS26 > ls -l /pnfs/minos/reco_far/cedar/sntp_data/2007-07/F00038544_0000.spill.sntp.cedar.0.root -rw-r--r-- 1 1334 e875 0 Jul 28 06:07 /pnfs/minos/reco_far/cedar/sntp_data/2007-07/F00038544_0000.spill.sntp.cedar.0.root Cannot investigate till OM comes back online ( has DAQ log archive ) /grid/data/minos/minfarm/WRITE is also 0 length. The file difference check was at 1.5 MBytes, this file was just under. So this slipped through. Removing the damaged 0 length output file from WRITE rm /grid/data/minos/minfarm/WRITE/F00038544_0000.spill.sntp.cedar.0.root rm /pnfs/minos/reco_far/cedar/sntp_data/2007-07/F00038544_0000.spill.sntp.cedar.0.root 2007 08 03 13:26 sam undeclare file F00038544_0000.spill.sntp.cedar.0.root ######### # ADMIN # ######### Report of extra JIRA accounts by saranen, discovered by habig. Shut down system, reported to helpdesk around 10:15. urish is on vacation. Note : to enable the firewall, As root setup firewall Typed space in 'enable' option OK iptables -I INPUT 1 -s -j ACCEPT # allow iptables -D INPUT -s -j ACCEPT # delete iptables -L INPUT # list 14:58 - lilstrom finished, waiting for permission to resume usage 21:30 - added my desktop for access, could not add catbox.dhcp ( no address ) for tarupp ( did verify that he is sysadmin, in miscomp ) restored my desktop access For habag will do iptables -I INPUT 1 -s neutrino.d.umn.edu -j ACCEPT aka 131.212.37.31 iptables -D INPUT -s owl.fnal.gov -j ACCEPT iptables -D INPUT -s 131.225.82.83 -j ACCEPT # catbox Alec needs web access from home, iptables -I INPUT 1 -s 71-83-38-121.dhcp.dlth.mn.charter.com -j ACCEPT For Nessus scan , iptables -I INPUT 1 -s shamus.fnal.gov -j ACCEPT 12:25 - disabled firewall, ran scanmenow ( clean ) 12:53 - scheduled newquick scan of MINOS-OM port 8080 node MINOS-OM was required, not minos-om.fnal.gov 13:13 returned to service ######### # ADMIN # ######### Giving habig ( future run coordinator ) access to CR systems beamdata Removed merina ( albert ) from minos-gateway-nd ########## # DCACHE # ########## Looking at logs from roundup, SRM has been down since Noon Sat 28 July SRMClientV1 : put: try # 0 failed with error SRMClientV1 : java.net.SocketException: Connection reset srmtest fails as before. 12:15 Submitted High priority ticket 101747 15:30 - srm is up Clearing space on minos26, with ./mcimport -w kordosky Reenabled crontab.dat, now that SRM is working. Only 50 GB free on disk, but we should recover OK. ######## # GRID # ######## /grid/data and /grid/app are unmounted on minos01 Asked to have them mounted, ticket 101717 ########### # ROUNDUP # ########### ROOTRELS - Added cedar_phy_srsafitter cedar_phy_srsafitterbx113 ####### # SAM # ####### for REL in dev int prd ; do ./reloc -s REL cedar_phy_srsafitter ; done many locations declared for REL in dev int prd ; do ./reloc -s ${REL} cedar_phy_srsafitterbx113 ; done nothing found for REL in dev int prd ; do ./reloc -s ${REL} cedar_phy_mboone ; done several near locations were declared, none far export SAM_ORACLE_CONNECT="samdbs/" for REL in dev int prd ; do setup sam -q ${REL} samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.srsafitter samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.safitterbx113 samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.mboone done the last 2/3 were needed ######## # FARM # ######## SRV1> quota -s -v -g numi ... Disk quotas for group numi (gid 5111): Filesystem blocks quota limit grace files quota limit grace blue2:/gpfarm-home 114G 0 500G 1189k 0 0 blue2:/gpfarm-stage 629G 0 1639G 3444k 0 0 blue2.fnal.gov:/fermigrid-data 368G 0 400G 12232 0 0 Need to grab no more than 30 GB first chunk. farmgsum > /tmp/farmgsum.20070730 ./roundup -n -W -M -r cedar_phy_srsafitter far 2>&1 | tee -a /tmp/cpsrsf.lis 15:45 ./roundup -r cedar_phy_srsafitter far repeated to flush WRITE files, Quota was around 380/400 a few files into the file purge. At end, 362/400 . This is not enough to do the next block of 80 GB, Will have to take one smaller chunk. Had planned to then do the next sets : ./roundup -M -c -r cedar_phy_srsafitter mcnear ./roundup -M -c -r cedar_phy_srsafitterbx113 mcnear Cleared out a couple of stray files, previously forced out ./roundup -M -r cedar_phy_brev mcnear Now try to take a bite out of mcnearcat SRV1> ./roundup -n -M -W -s n13011 -r cedar_phy_srsafitterbx113 mcnear OK - processing /grid/data/minos/mcnearcat version 20070730 Mon Jul 30 22:40:52 CDT 2007 OK - processing 442 files OK - stream L010185N_D00.sntp.cedar_phy_srsafitterbx113 OK - 29836 Mbytes in 45 runs ... SRV1> quota -s -v -g numi | grep -2 fermigrid-data 363G 0 400G 11801 0 0 OK, will use 30 of 37 GB free space. Growth seems to have stopped recently, let's give it a try. ./roundup -M -s n13011 -r cedar_phy_srsafitterbx113 mcnear 2007 07 31 ran again at 07:56, usage dropped fro 363 to 336 . Grabbing the rest, 08:00 ./roundup -M -r cedar_phy_srsafitterbx113 mcnear done at 9:40, 10:46 grab another 36 GB, still have 336/400 used ./roundup -M -r cedar_phy_srsafitter mcnear 14:30 ./roundup -M -w -r cedar_phy_srsafitterbx113 mcnear usage 336 -> 313 Now get the last large chunk of data, 147 GB 15:00 ./roundup -M -r cedar_phy_srsafitter mcfar 16:30 clearing out WRITE files ./roundup -M -w -r cedar_phy_srsafitter mcnear usage 313 -> 279 ./roundup -M -w -r cedar_phy_brev mcnear one file ./roundup -M -w -r cedar far pick up 2 stuck files from weekend F00038544_0000.spill.bntp.cedar.0.root F00038544_0000.all.sntp.cedar.0.root SRV1> for NUM in 9398 7402 10170 16327 ; do ls -l Merged.${NUM}.root ; done -rw-r--r-- 1 minfarm numi 0 Jul 28 18:05 Merged.9398.root -rw-r--r-- 1 minfarm numi 98304 Jul 29 00:08 Merged.7402.root -rw-r--r-- 1 minfarm numi 0 Jul 29 06:05 Merged.10170.root -rw-r--r-- 1 minfarm numi 0 Jul 29 12:05 Merged.16327.root SRV1> for NUM in 9398 7402 10170 16327 ; do rm Merged.${NUM}.root ; done 2007 08 01 purged WRITE, 317 files ./roundup -M -r cedar_phy_srsafitter mcfar usage 305G -> 173 ============================================================================= 2007 07 27 ####### # OSG # ####### OSG users meeting, day 2 ####### # SAM # ####### minos-sam02 is upgraded to SLF 4.4 ups start sam_boostrap Looks good MINOS26 > setup sam -q dev MINOS26 > sam ping dbserver MINOS26 > sam get dbserver info MINOS26 > sam get dbserver connection info MINOS26 > sam locate foo MINOS26 > sam_test_py minos MINOS26 > sam get metadata --file=f21001001_0000_L010185.cand.R1_18_2.root MINOS26 > sam locate f21001001_0000_L010185.cand.R1_18_2.root ============================================================================= 2007 07 26 ####### # OSG # ####### OSG users meeting ####### # SAM # ####### On minos-sam02, as mindata, samread, sam, in samread, rm SAM03.tar ( 5 GB ) tar czvf /tmp/sam02-mindata-20070726.tgz . tar czvf /tmp/sam02-samread-20070726.tgz . tar czvf /tmp/sam02-sam-20070726.tgz . On minos-sam03, cd ARCH/sam02 scp -c blowfish minos-sam02:/tmp/sam02-mindata-20070726.tgz . scp -c blowfish minos-sam02:/tmp/sam02-samread-20070726.tgz . scp -c blowfish minos-sam02:/tmp/sam02-sam-20070726.tgz . ============================================================================= 2007 07 25 ########### # ROUNDUP # ########### On Wed, 25 Jul 2007, Mayly Sanchez wrote: ... > The newer daikon cosmics ND are not being concatenated, these are > needed for the calibration group. I have added this to the corral script run via cron, and have started an early run of the concatenation script, around 15:00 ./roundup -M -r cedar_phy mcnear ######## # GRID # ######## tokens AFSPROD=/afs/fnal.gov/files/code/e875/general/products/ GRIPROD=/grid/app/minos/products time rsync -r ${AFSPROD} ${GRIPROD} --perms --times --links --size-only --delete -v -n real 3m18.267s user 0m1.670s sys 0m4.400s time rsync -r ${AFSPROD} ${GRIPROD} --perms --times --links --size-only --delete -v real 0m52.233s user 0m1.770s sys 0m4.580s ############### # SAMRELOCATE # ############### Continuing to work on relocation, per 14 July kordosky email re f21001001_0000_L010185.cand.R1_18_2.root in /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data declared /pnfs/minos/mcout_data/R1_18_2/far/cand_data MINOS26 > for DIR in `ls /pnfs/minos/mcin_data/near/carrot_06` ; do printf "${DIR}" ; ls /pnfs/minos/mcin_data/near/carrot_06/${DIR} | wc -l ; done L010000 1456 L010170 198 L010185 7483 L010200 198 L100200 729 L250200 391 ./samrelocate -v -n -b 3 mcin_data/near/carrot_06/L010200 MCODIR=mcout_data/R1_18_2/far/carrot/L010185/cand_data ./samrelocate -v -n -b 20 ${MCODIR} ... f21101070_0000_L010185.cand.R1_18_2.root '/pnfs/minos/mcout_data/R1_18_2/far/cand_data,717@vo7033' f22001094_0000_L010185.cand.R1_18_2.root '/pnfs/minos/mcout_data/R1_18_2/far/cand_data,1174@vo7033' Had to go to the source for SamReplicaLocation to discover how to parse out the components: SLOC.getLocationType() SLOC.getFullPath() SLOC.getActualDetails() Can pick up 1 test file with MINOS26 > ./samrelocate -n ${MCODIR} MCODIR=mcout_data/R1_18_2/far/carrot/L010185/cand_data -b 13 NOOP BAIL after 13 STARTED Wed Jul 25 15:43:20 2007 Declaring to SAM dev 13 Scanning /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data NFILES 1969 f21101070_0000_L010185.cand.R1_18_2.root was /pnfs/minos/mcout_data/R1_18_2/far/cand_data 1 fixed locations 12 files undeclared 13 / 1969 sam.eraseFileLocation( filename = , replicalocation = 'string' ) f21101070_0000_L010185.cand.R1_18_2.root was /pnfs/minos/mcout_data/R1_18_2/far/cand_data f21101070_0000_L010185.cand.R1_18_2.root now /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data(vo7033.717) MINOS26 > ./samrelocate ${MCODIR} MCODIR=mcout_data/R1_18_2/far/carrot/L010185/cand_data -b 13 -v BAIL after 13 VERBOSE DATADIR /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data STARTED Wed Jul 25 16:26:57 2007 Declaring to SAM dev 13 Scanning /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data NFILES 1969 FILES f21101070_0000_L010185.cand.R1_18_2.root '/pnfs/minos/mcout_data/R1_18_2/far/cand_data,717@vo7033' SLOC '/pnfs/minos/mcout_data/R1_18_2/far/cand_data,717@vo7033' tape /pnfs/minos/mcout_data/R1_18_2/far/cand_data MssLocationDetails({ 'mssInstance' : 'Fermilab', 'mssName' : 'enstore', 'offset' : 717L, 'volumeLabel' : 'vo7033', }) f21101070_0000_L010185.cand.R1_18_2.root was /pnfs/minos/mcout_data/R1_18_2/far/cand_data f21101070_0000_L010185.cand.R1_18_2.root now /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data(vo7033.717) OOPS , addLocation error for f21101070_0000_L010185.cand.R1_18_2.root CLASS SamException.SamExceptions.DataStorageLocationNotFound INSTANCE Location with name '/pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data' not found. STARTED Wed Jul 25 16:26:57 2007 FINISHED Wed Jul 25 16:26:59 2007 OOPS, needed to add the storage locations. ./samtapeloc /pnfs/minos/mcout_data/R1_18_2 dev IFILE=f21101070_0000_L010185.cand.R1_18_2.root SAMLOC=/pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data(vo7033.717) sam add location --file=${IFILE} --loc=${SAMLOC} OK, we're good to go ! Ran 13, 14, 100 unlimited in development 1 OK locations 0 fixed locations 12 files undeclared 13 / 1969 1 OK locations 1 fixed locations 12 files undeclared 14 / 1969 2 OK locations 0 fixed locations 12 files undeclared 14 / 1969 2 OK locations 20 fixed locations 78 files undeclared 100 / 1969 22 OK locations 419 fixed locations 1528 files undeclared 1969 / 1969 441 OK locations 0 fixed locations 1528 files undeclared 1969 / 1969 ============================================================================= 2007 07 24 ########## # DCACHE # ########## srm server is down, same as sunday, SRMClientV2 : srmLs, contacting service httpg://stkendca2a.fnal.gov:8443/srm/managerv2 SRMClientV2 : put: try # 0 failed with error SRMClientV2 : ; nested exception is: java.net.SocketTimeoutException: Read timed out Reported to dcache-admin ticket 101455 at 13:33 Note that fndca2a also crashed this morning sometime, and was replaced with a new system. Wonder if this is somehow related to the srm server problems ? Should not be, as this is just a monitoring system. mindata@minos26 crontab -r minfarm@fnpcsrv1 mv NOCAT.ok NOCAT on 15:30 berg is looking into this 16:31 srmtest working again on fnpcsrv1 16:33 podstvkv reports that the srm server has been restarted. no reason yet, Vladimir is just back from vacation Restored crontab and NOCAT ########### # CLUSTER # ########### Following up on upgrade plans, with timl, > being used? > clisp > f2c > fort77 > maxima > maxima-exec_clisp > maxima-xmaxima > mgdiff MIN > for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'stat /usr/bin/maxima | grep "Access: 2"' ; done minos01 Tue Jul 24 11:36:36 CDT 2007 Access: 2007-07-24 06:07:54.000000000 -0500 minos02 Tue Jul 24 11:36:38 CDT 2007 Access: 2006-04-07 10:03:34.000000000 -0500 minos03 Tue Jul 24 11:36:40 CDT 2007 Access: 2002-10-29 16:21:24.000000000 -0600 clistp : similar, plus minos25 Tue Jul 24 11:39:51 CDT 2007 Access: 2007-07-23 11:15:56.000000000 -0500 MIN > for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'stat /usr/bin/f2c | grep "Access: 2"' ; done minos01 Tue Jul 24 11:41:08 CDT 2007 Access: 2007-07-24 06:07:54.000000000 -0500 minos02 Tue Jul 24 11:41:10 CDT 2007 Access: 2006-04-07 10:03:34.000000000 -0500 minos03 Tue Jul 24 11:41:11 CDT 2007 Access: 2000-07-24 15:51:38.000000000 -0500 most like this ... minos25 Tue Jul 24 11:41:56 CDT 2007 Access: 2007-07-23 11:16:51.000000000 -0500 Conclusion : none of these are being used. ############### # SAMRELOCATE # ############### clone samrelocate from saddmc Usage : samrelocate Action: does a sam locate of each file in the given directory, and corrects the location as necessary. ./samrelocate -v -n mcin_data/near/carrot_06/L010185 ############# # CHECKLIST # ############# o near and far dcs missing since Saturday. o Since July 4, OOPS - no tape location in F00038324_0005.sam.py cd GDAT/fardet_data/2007-07 Files look normal on the surface, lacking tape location Set them aside, should clear in the 9:06 predator cycle. MINOS26 > mv F00038324_0005.sam.py F00038324_0005.sam.py.bad MINOS26 > mv F00038324_0005.log F00038324_0005.log.bad The next iteration shows : F00038324_0005.mdaq.root Tue Jul 24 14:13:15 UTC 2007 OOPS - run_dbu is stuck for 600, killing it F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 S 1060 10483 10472 0 85 0 - 538 wait4 ? 00:00:00 run_dbu F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 S 1060 10499 10483 0 85 0 - 28664 schedu ? 00:00:02 dbu kill 10499 Tue Jul 24 14:23:25 UTC 2007 /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 128: 10499 Segmentation fault dbu -bq ${HOME}/minos/scripts/dbu_sampy.C ${FILE} >>${logname} 2>&1 F00038324_0005.sam.py was not generated - check log for error F00038324_0005.log then STARTING Tue Jul 24 16:18:35 UTC 2007 Treating 741 files Scanning 3 files F00038324_0005.mdaq.root Tue Jul 24 16:18:55 UTC 2007 Looks OK for now ! ============================================================================= 2007 07 23 ####### # SAM # ####### Got a slew of farm processing requests from scavan 2 cedar_phy_srsafitter 6 month FD cosmic data 3 cedar_phy_srsafitterbx113 near/daikon_00/L010185N 4 cedar_phy_srsafitter near/daikon_00/L010185N_bfldx113 5 cedar_phy_srsafitterbx113 near/daikon_00/L010185N_bfldx113 6 cedar_phy_srsafitter near data 3 months 2005 spill data 7 cedar_phy_srsafitterbx113 near data 3 months 2005 spill data ./pnfsdirs near cedar_phy_srsafitterbx113 daikon_00 L010185N write ./pnfsdirs near cedar_phy_srsafitter daikon_00 L010185N write Preparing for cedar_phy_srsafitter export SAM_ORACLE_CONNECT="samdbs/" REL=cedar.phy.srsafitter setup sam -q dev samadmin add application family --appFamily=reco --appName=loon --appVersion=${REL} setup sam -q int setup sam -q prd < the following is pending creation of the directories > REL=cedar_phy_srsafitter ./reloc -d -s dev cedar_phy_srsafitter # debug test ./reloc -s dev cedar_phy_safitter ./reloc -s int cedar_phy_safitter ./reloc -s prd cedar_phy_safitter and doing the same for REL=cedar.phy.srsafitterbx113 REL=cedar_phy_srsafitterbx113 ####### # AFS # ####### Requested three new data volumes for NONAP /afs/fnal.gov/files/data/minos/d263 /afs/fnal.gov/files/data/minos/d264 /afs/fnal.gov/files/data/minos/d265 ACL's like system:administrators rlidwka system:anyuser rl minos rl minos:admin rlidwka minos:nonap rlidwka ########## # DCACHE # ########## Tried fermigrid/volatile : SRV1> srmls --debug=true srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos Storage Resource Manager (SRM) CP Client version 1.25 Copyright (c) 2002-2006 Fermi National Accelerator Laboratory SRM Configuration: debug=true ... surl[0]=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos Mon Jul 23 08:21:41 CDT 2007: In SRMClient ExpectedName: host Mon Jul 23 08:21:41 CDT 2007: SRMClient(https,srm/managerv2,true) SRMClientV2 : user credentials are: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 SRMClientV2 : connecting to srm at httpg://stkendca2a.fnal.gov:8443/srm/managerv2 SRMClientV2 : srmLs, contacting service httpg://stkendca2a.fnal.gov:8443/srm/managerv2 SRMClientV2 : put: try # 0 failed with error ########## # DCACHE # ########## Need to make new grid proxy having the production role, using voms-proxy-init. 09:10 Stopped mindata@minos26 crontab, and set NOCAT on minfarm@fnpcsrv1 Reviewing notes on proxy-init in HOWTO.srm cd /export/stage/minfarm/.grid voms-proxy-init --help voms-proxy-init \ -cert kreymer-doe.pem \ -key kreymer-doekey.pem \ -out kreymer-voms.proxy \ -valid 8760:0 SRV1> voms-proxy-init \ > -cert kreymer-doe.pem \ > -key kreymer-doekey.pem \ > -out kreymer-voms.proxy \ > -valid 8760:0 Cannot find file or dir: $prefix/etc/vomses Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 Enter GRID pass phrase: Creating proxy ................................................. Done Warning: your certificate and proxy will expire Tue Apr 15 11:22:43 2008 which is within the requested lifetime of the proxy OK, now let's try to find some use for fermilab:/fermilab/minos/Role=Production No help from voms-proxy-init, try google voms-proxy-init role http://www.atlasgrid.bnl.gov/GUMS/Presentations/vo-privilege.ppt This gives a clue . Better yet, here is an overall guide : http://www.fnal.gov/docs/products/voprivilege/documents/transition-to-privilege.html Also, trying to set prefix to pick up a vomses file, based on 'locate vomses' SRV1> locate vomses /usr/local/vdt-1.6.1/monitoring/vomses /usr/local/vdt-1.6.1/glite/etc/vomses /opt/glite/etc/vomses prefix=/usr/local/vdt-1.6.1/glite voms-proxy-init \ -voms fermilab:/fermilab/minos/Role=Production \ -cert kreymer-doe.pem \ -key kreymer-doekey.pem \ -out kreymer-voms.proxy \ -valid 8760:0 SRV1> voms-proxy-init -voms fermilab:/fermilab/minos/Role=Production -cert kreymer-doe.pem -key kreymer-doekey.pem -out kreymer-voms.proxy -valid 8760:0 Cannot find file or dir: $prefix/etc/vomses Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 Enter GRID pass phrase: Creating temporary proxy ........................................................... Done Contacting fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Done Warning: fermigrid2.fnal.gov:15001: validity shortened to 86400 seconds! Creating proxy ........................................... Done Warning: your certificate and proxy will expire Tue Apr 15 11:22:43 2008 which is within the requested lifetime of the proxy SRV1> voms-proxy-info -file kreymer-voms.proxy WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 type : proxy strength : 512 bits path : kreymer-voms.proxy timeleft : 6409:43:17 SRV1> voms-proxy-info -all -file kreymer-voms.proxy WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 type : proxy strength : 512 bits path : kreymer-voms.proxy timeleft : 6408:53:46 === VO fermilab extension information === VO : fermilab subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 issuer : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov attribute : /fermilab/minos/Role=Production/Capability=NULL attribute : /fermilab/Role=NULL/Capability=NULL attribute : /fermilab/minos/Role=NULL/Capability=NULL timeleft : 23:53:29 Great, but this extension expires in a day, and we still have no access to DCache via SRM. 2007 08 29 Note that the Fermi servers ignore the extension expiration, Usage is production is approved by Yocum and Chadwick 11:28 - srm servers were running in duplicate, restarted. ./srmtest works OK now on both minfarm@fnpcsrv1 and mindata@minos26 ============================================================================= 2007 07 22 sunday ############## # DCACHE DAQ # ############## Timur has removed the 9a write pool, we should be OK for data archiving /home/minos/bin/archiver_krb.py [minos@daqdcp bin]$ mv archiver_krb.py archiver_krb.20051103.py ; cp archiver_krb.20051103.py archiver_krb.py Needed to get some kind of useable editor there, lacking x-11 [minos@minos-gateway ~/.ssh]$ scp /usr/bin/pico daqdcp:/home/minos/bin/pico [minos@daqdcp minos]$ cd bin [minos@daqdcp bin]$ pwd /home/minos/bin [minos@daqdcp bin]$ pico archiver_krb.py # It got there successfully # Check file size for confirmation # Wait for ten minutes to make sure that the size info is # known to pnfs # time.sleep(600) # reduced time to 6 seconds, writes to PNFS are now daily 2007 07 22 kreymer time.sleep(6) Started far archiver, MINOS26 > ls -ltr /pnfs/minos/fardet_data/2007-07 | tail -rw-r--r-- 1 buckley e875 18113194 Jul 20 13:01 F00038519_0000.mdaq.root -rw-r--r-- 1 buckley e875 58471373 Jul 20 14:02 F00038519_0001.mdaq.root -rw-r--r-- 1 buckley e875 17846369 Jul 20 15:02 F00038519_0002.mdaq.root -rw-r--r-- 1 buckley e875 18032433 Jul 20 16:03 F00038519_0003.mdaq.root -rw-r--r-- 1 buckley e875 58357421 Jul 20 17:04 F00038519_0004.mdaq.root -rw-r--r-- 1 buckley e875 17871600 Jul 20 18:05 F00038519_0005.mdaq.root -rw-r--r-- 1 buckley e875 17892471 Jul 22 15:29 F00038519_0006.mdaq.root -rw-r--r-- 1 buckley e875 58425513 Jul 22 15:30 F00038519_0007.mdaq.root -rw-r--r-- 1 buckley e875 17976883 Jul 22 15:31 F00038519_0008.mdaq.root -rw-r--r-- 1 buckley e875 0 Jul 22 15:31 F00038519_0009.mdaq.root The non-delayed writes seem to be OK Do the same to near : [minos@daqdcp-nd bin]$ mv archiver_krb.py archiver_krb.20051103.py ; cp archiver_krb.20051103.py archiver_krb.py [minos@minos-gateway-nd ~]$ scp /usr/bin/pico daqdcp-nd:/home/minos/bin/pico MINOS26 > ls -ltr /pnfs/minos/neardet_data/2007-07 | tail -rw-r--r-- 1 buckley e875 86724783 Jul 21 05:16 N00012636_0011.mdaq.root -rw-r--r-- 1 buckley e875 86971937 Jul 21 06:17 N00012636_0012.mdaq.root -rw-r--r-- 1 buckley e875 86965023 Jul 21 07:18 N00012636_0013.mdaq.root -rw-r--r-- 1 buckley e875 87279328 Jul 21 08:14 N00012636_0014.mdaq.root -rw-r--r-- 1 buckley e875 87075793 Jul 21 09:15 N00012636_0015.mdaq.root -rw-r--r-- 1 buckley e875 87341372 Jul 22 15:38 N00012636_0016.mdaq.root -rw-r--r-- 1 buckley e875 87419819 Jul 22 15:38 N00012636_0017.mdaq.root -rw-r--r-- 1 buckley e875 87476859 Jul 22 15:39 N00012636_0018.mdaq.root -rw-r--r-- 1 buckley e875 87978090 Jul 22 15:39 N00012636_0019.mdaq.root -rw-r--r-- 1 buckley e875 87600671 Jul 22 15:39 N00012636_0020.mdaq.root The backlog is clearing quickly. Started with 47 far, 30 near Near cleared at 15:49 Far cleared at 16:02 ########## # DCACHE # ########## Ticket 101327 by rubin Farm srmcp failed starting 15:08, also other srm's from fnpcsrv1 and minos26. ============================================================================= 2007 07 21 ############## # DCACHE DAQ # ############## Far archiver got stuck after : QOL I Fri 20-07-2007 19:02:48 archiver 17667 198.124.213.171 1 104566 run 38519 Processing file F00038519_0006.mdaq.root QOL I Fri 20-07-2007 19:02:48 archiver 17667 198.124.213.171 1 104567 run 38519 Getting credentials QOL I Fri 20-07-2007 19:02:48 archiver 17667 198.124.213.171 1 104568 run 38519 Got credentials QOL I Fri 20-07-2007 19:02:48 archiver 17667 198.124.213.171 1 104569 run 38519 Trying ftp connect to disk cache QOL I Fri 20-07-2007 19:02:49 archiver 17667 198.124.213.171 1 104570 run 38519 Ftp connect succeeded No further archiver messages PID is OK, process is present. Near archiver got stuck after 2007-07-21 09:13:18 buckley(1019.5111) krbftp write /pnfs/fnal.gov/usr/minos/neardet_data/2007-07/N00012636_0015.mdaq.root QOL I Sat 21-07-2007 10:14:29 archiver 7887 131.225.192.132 1 266411 run 12636 Processing file N00012636_0016.mdaq.root QOL I Sat 21-07-2007 10:14:29 archiver 7887 131.225.192.132 1 266412 run 12636 Getting credentials QOL I Sat 21-07-2007 10:14:32 archiver 7887 131.225.192.132 1 266413 run 12636 Got credentials QOL I Sat 21-07-2007 10:14:32 archiver 7887 131.225.192.132 1 266414 run 12636 Trying ftp connect to disk cache QOL I Sat 21-07-2007 10:14:32 archiver 7887 131.225.192.132 1 266415 run 12636 Ftp connect succeeded Stopped both archivers till the DCache problem is corrected. ########## # DCACHE # ########## Submitted ticket The Minos Far Detector data archiver got stuck around Fri 20-07-2007 19:02:49 The Minos Near Detector data archiver got stuck around Sat 21-07-2007 10:14:32 after connecting to the ftp server. I have halted the near and far detector archivers. Messages from http://fndca3a.fnal.gov/cgi-bin/dcache_files.py look like 425 Cannot open port: java.lang.Exception: Pool error: Pool is disabled We had a similar problem with the 9a pools in the writePool group Friday. I'm guessing that in this case, w-stkendca9a-3 may be at fault. As we are only taking cosmic ray data, I have not called the call-center to have anyone paged after hours. ============================================================================= 2007 07 20 ######## # FARM # ######## Searching for missing mc output files reported by arms, FILES=" f21011047_0000_L010185N_D00.sntp.cedar.root f21011048_0000_L010185N_D00.sntp.cedar.root f21011064_0000_L010185N_D00.sntp.cedar.root f21011067_0000_L010185N_D00.sntp.cedar.root f21011073_0000_L010185N_D00.sntp.cedar.root f21011077_0000_L010185N_D00.sntp.cedar.root f21011078_0000_L010185N_D00.sntp.cedar.root f21011100_0000_L010185N_D00.sntp.cedar.root " for FILE in ${FILES} ; do grep ${FILE} /afs/fnal.gov/files/home/room1/kreymer/minos/CFL/CFL ; done minos reco_mc_far_cedar VOC177 0000_000000000_0000327 CDMS117296018000000 49339753 2011788958 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/104/f21011047_0000_L010185N_D00.sntp.cedar.root minos reco_mc_far_cedar VOC177 0000_000000000_0000328 CDMS117296018800000 48749561 2076869019 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/104/f21011048_0000_L010185N_D00.sntp.cedar.root minos reco_mc_far_cedar VOC177 0000_000000000_0000355 CDMS117301801500001 52392083 2911298779 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/106/f21011064_0000_L010185N_D00.sntp.cedar.root minos reco_mc_far_cedar VOC177 0000_000000000_0000356 CDMS117301802200000 49670542 3764044698 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/106/f21011067_0000_L010185N_D00.sntp.cedar.root minos reco_mc_far_cedar VOC177 0000_000000000_0000357 CDMS117301863600000 53344993 3465332460 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/107/f21011073_0000_L010185N_D00.sntp.cedar.root minos reco_mc_far_cedar VOC177 0000_000000000_0000401 CDMS117303152100000 52629644 1210362470 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/107/f21011077_0000_L010185N_D00.sntp.cedar.root minos reco_mc_far_cedar VOC177 0000_000000000_0000402 CDMS117303153100001 54786893 2012291602 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/107/f21011078_0000_L010185N_D00.sntp.cedar.root minos reco_mc_far_cedar VO4049 0000_000000000_0000086 CDMS117321258900000 62377891 4045578954 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/110/f21011100_0000_L010185N_D00.sntp.cedar.root ######## # FARM # ######## Per rubin request, moved three bad mcin files out of the way cd /pnfs/minos/mcin_data/far/daikon_02/CosmicMu mv 106/f20011068_0004_CosmicMu_D02.reroot.root /pnfs/minos/BAD/ mv 107/f20011072_0003_CosmicMu_D02.reroot.root /pnfs/minos/BAD/ mv 108/f20011088_0005_CosmicMu_D02.reroot.root /pnfs/minos/BAD/ ########## # DCACHE # ########## rhatcher reports errors writing via dccp kerberized since 21:43 yesterday, Command failed! Server error message for [1]: "Pool is disabled" (errno 104). Failed open file in the dCache. Can't open destination file : "Pool is disabled" System error: Input/output error Similar problems for farm cand writing In service status page, w-stkendca10a-4 120 msec rest are 44 to 85 msec I also see three Cells offline, GFTP-stkendca15a GFTP-stkendca7a GFTP-stkendca8a This may be unrelated, as they would not affect Robert's kerberized dccp. 9:50 - 10-4 ping down to 77 msec, Looking for the working nodes, with dcs() { ./dc_stat /pnfs/minos/${1} | grep w-stkendca ; } dcs stage/kordosky/n11012016_0007_L010185N_D00-n11012017_0002_L010185N_D00.tar Around 09:45 thing seem to have recovered. Timur removed the '9a' pools from the configuration. ============================================================================= 2007 07 19 ####### # DAQ # ####### 10:15 stopped archivers, due to DCache downtime 11:10 started near archiver, file moved OK 11:14 started far archiver, file moved OK Copies stilled till just after 18:00. Not sure why. ########## # DCACHE # ########## Enstore/dcache downtime seems to have started around 07:00, per ftplog I saw no pnfs interruption in pnfslog. DCache servers seem to have come back online from 10:13 to 10:26 FTP access is back. dcap access kerberized and unsecured are back. 10:50 srmls fails : $ srmls ${SPATH2} SRMClientV2 : put: try # 0 failed with error SRMClientV2 : ; nested exception is: java.net.ConnectException: Connection refused SRMClientV2 : put: try again ftplog saw one failure at 11:00, OK at 11:10 11:20 - OOPS, get email from kschu that maintenance continues. Will leave DAQ archivers logging for now, as it should recover cleanly, and seems be running properly at present, clearing the backlog. email bounced from minos-data due to signature attachment. Re-enabled attachements for present. 19:58 - kreymer@minos26 crontab crontab.dat mindata@minos26 crontab crontab.dat minfarm@fnpcsrv1 mv NOCAT NOCAT.ok http://fndca3a.fnal.gov/cgi-bin/dcache_files.py shows a new category, globus-mapping:(1334.5111) this seems to include read .mdaq.root from farm read reroot.root from farm write .cand from farm Raw data is showing up in PNFS since about 18:00, don't know why, no trace in the ftp log above logs indicate archiver restart around 18:00 near and far. PID is wrong in near, will correct : 20:18 echo 7887 >> /var/lock/daq/archiver.pid 20:19 daqmon is happy ######## # GRID # ######## 10:28 timm : The condor upgrade to condor 6.8.5 on the head nodes of the GP Grid cluster is complete. ============================================================================= 2007 07 18 ######## # ROOT # ######## From MSD minutes by nwest : > From Tigran: > ReadBuffers() with vector read is implemented. > Today we have released new dcache version 1.7.0-39 with this > functionality. > I have tried with root version 5.14, 5.15 and 5.16. It's amazing! On > some > applications I got up ti 12 times performance increase! ########## # DCACHE # ########## Preparing for all-day shutdown tomorrow predator MINOS26 > echo 'crontab -r' | at 03:30 mcimport M26 > echo 'crontab -r' | at 03:30 job 21 at 2007-05-24 03:30 corral SRV1> echo 'mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT' \ | at 03:30 ####### # DBU # ####### Testing R1.15 vs R1.22 timing on older files, per rhatcher query ( want to remove R1.15 ) MINOS26 > cd ${HOME}/minos/test MINOS26 > TIER=mdaq MINOS26 > setup_minos -r R1.22 MINOS26 > IFILE=F00034242_0013 MINOS26 > DATADIR=fardet_data/2006-03 MINOS26 > ../scripts/run_dbu dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/${DATADIR}/${IFILE}.${TIER}.root Ran for 15 minutes, produced output eventually. Watchdog would have killed this. Log file was present up to this point when stalled : ... Snarls 1996954 (19122455) NonSnarls 119669 (108808) [MISMATCH] TermCode 1 Errors 0 TimeFrames 59834 Dropped 0 Consistency 0x1 MINOS26 > setup_minos R1.15 MINOS26 > time ../scripts/run_dbu dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/${DATADIR}/${IFILE}.${TIER}.root real 12m49.637s user 10m28.330s sys 0m8.220s ============================================================================= 2007 07 17 ######### # EMAIL # ######### For minos-data and minos-sam-users, was Sizelim= 409600 set Sizelim= 10000 Attachments= no ########## # XROOTD # ########## gmieg adjusted minimum retention from 10 hours to 3 minutes. Spaced is purged when disk usage hits 80%, purgeing down to 60%. This is visible in ganglia. Users of /local/scratch07 ( du -sm ) 4550 avva 1 brebel 340 bspeak 185 daikon 9143 deb4 2305 dharma 1545 ebarnes 3804 giurgiu 76556 gmieg 1 hartnell 2280 hjkang 349 howcroft 8677 jdejong 5 kreymer 1 kschu 85 li 938 loiacono 348 mdier 530 mskim 11130 niki 211 panos 11231 petyt 136 pjl 1 rhatcher 263 rustem 851 tagg 4129 thosieck 80 yumiceva Strange... MRTG reports up to 6 MBytes/second sustained from 04:00 to 16:30 Ganglia concurs. But the CPU load on minos07 spikes to 14 from just after 12:00 , to 16:00 Rustem admits to running 20 clients at once, only about 7 of which seem to be visible to LSF. ============================================================================= 2007 07 16 ########## # XROOTD # ########## Rustem reports hundreds of files not readable with xrootd, such as stout_near_2005_05:Error in : open attempt failed on root://minos07.fnal.gov//stage/N00007751_0003.spill.sntp.cedar_phy.0.root ... stout_near_2007_03:Error in : open attempt failed on root://minos07.fnal.gov//stage/N00011998_0000.spill.sntp.cedar_phy.0.root copied from email to maint/xrootdbad.txt gmieg replied that the problem was a minimum 10 hour retention policy on the xrootd cache. He has changed this to 3 minutes minimum. ####### # LSF # minos cluster batch ####### Per EAG Ops report, note existence of xlsbatch an X-11 batch queue viewer. ########## # SADDMC # ########## Wow, rediscovered that all of carrot_06 and carrot_08 mcin and mcout had been declared to production back a year ago, 2006 06 15/16 The files have since been renamed to subdirectories, file locations need to be adjusted. ============================================================================= 2007 07 13 ####### # LSF # minos cluster batch ####### Ticket 98153 Nodes minos14 through 18 started working properly Friday afternoon. Tested with bsub -q minos "echo `date` `hostnam` >> ${HOME}/lsf/lsf.log ; sleep 120 ; hostname" ############ # MCIMPORT # ############ New files to arrive from sjc, n100BRRRR_SSSS_CosmicMu_D03.reroot.root MINOS26 > ./pnfsdirs near cedar_phy daikon_03 CosmicMu write Fri Jul 13 08:28:40 CDT 2007 STREAMS cand mrnt sntp INPUT /pnfs/minos/mcin_data/near/daikon_03/CosmicMu OK - created /pnfs/minos/mcin_data/near/daikon_03/CosmicMu FAMSET mcin_near_daikon_03 FAMILY mcin_near OOPS - need file family mcin_near_daikon_03 OK - setting family to mcin_near_daikon_03 OUTPUT /pnfs/minos/mcout_data/cedar_phy/near/daikon_03/CosmicMu OK - created /pnfs/minos/mcout_data/cedar_phy/near/daikon_03/CosmicMu OK - created /pnfs/minos/mcout_data/cedar_phy/near/daikon_03/CosmicMu/cand_data OK - created /pnfs/minos/mcout_data/cedar_phy/near/daikon_03/CosmicMu/mrnt_data OK - created /pnfs/minos/mcout_data/cedar_phy/near/daikon_03/CosmicMu/sntp_data FAMSET mcout_cedar_phy_near_daikon_03_cand FAMILY minos OOPS - need file family mcout_cedar_phy_near_daikon_03_cand OK - setting family to mcout_cedar_phy_near_daikon_03_cand FAMSET mcout_cedar_phy_near_daikon_03_mrnt FAMILY minos OOPS - need file family mcout_cedar_phy_near_daikon_03_mrnt OK - setting family to mcout_cedar_phy_near_daikon_03_mrnt FAMSET mcout_cedar_phy_near_daikon_03_sntp FAMILY minos OOPS - need file family mcout_cedar_phy_near_daikon_03_sntp OK - setting family to mcout_cedar_phy_near_daikon_03_sntp ######## # FARM # ######## The previous bad pass of /pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/sntp_data/310 were not removed from PNFS, roundup is tripping over them. As rubin on fnpcsrv1, rm /pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/sntp_data/310/*.root ============================================================================= 2007 07 12 ######## # GRID # ######## Ganglia is back for FermiGrid ( not CLUBS, Farm ) http://fermigrid2.fnal.gov:801/ganglia/?m=load_one&r=day&s=descending&c=FermiGrid&h=&sh=1&hc=4 ####### # SAM # ####### Trying again later, post 2007 06 08, when predator is idle. ./genpy -l " -r R1.15 " fardet_data/2006-03 HOSTNA=`hostname -s | cut -c 1-5` HOSTNU=`hostname -s | cut -c 6-` LOGPAT=/local/scratch${HOSTNU}/kreymer/log DET=fardet_data MONTH=2006-03 ./sadd ${DET}/${MONTH} declare 2>&1 | tee -a ${LOGPAT}/samadd/${DET}/${MONTH}.log fardet_data/2006-03 STARTED Thu Jul 12 13:50:27 2007 Treating 1414 files OK - declared F00034242_0013.mdaq.root ... OK - declared F00034635_0000.mdaq.root Needed to add 16 files STARTED Thu Jul 12 13:50:27 2007 FINISHED Thu Jul 12 13:51:08 2007 ########## # SADDMC # development declares are working now ########## setup sam -q dev MID=mcin_data/far/daikon_00/L100200N ./saddmc -m declare daikon_00 ${MID}/101 MODE declare Processing mcin_data STARTED Thu Jul 12 13:37:04 2007 Declaring to SAM dev daikon_00 declare 999999 Scanning /pnfs/minos/mcin_data/far/daikon_00/L100200N ['101'] Needed /pnfs/minos/mcin_data/far/daikon_00/L100200N/101 Treating 3 files in /pnfs/minos/mcin_data/far/daikon_00/L100200N/101 OK - declared f21411010_0000_L100200N_D00.reroot.root /pnfs/minos/mcin_data/far/daikon_00/L100200N/101(voc328.181) OK - declared f21311010_0000_L100200N_D00.reroot.root /pnfs/minos/mcin_data/far/daikon_00/L100200N/101(voc328.189) OK - declared f21011010_0000_L100200N_D00.reroot.root /pnfs/minos/mcin_data/far/daikon_00/L100200N/101(voc328.194) Needed 3 files, Rate was 1.614 STARTED Thu Jul 12 13:37:04 2007 FINISHED Thu Jul 12 13:37:07 2007 MODE declare Processing mcin_data STARTED Thu Jul 12 13:38:20 2007 Declaring to SAM dev daikon_00 declare 999999 Scanning /pnfs/minos/mcin_data/far/daikon_00/L100200N ['100'] Needed /pnfs/minos/mcin_data/far/daikon_00/L100200N/100 Treating 27 files in /pnfs/minos/mcin_data/far/daikon_00/L100200N/100 ... Needed 27 files, Rate was 2.528 STARTED Thu Jul 12 13:38:20 2007 FINISHED Thu Jul 12 13:38:32 2007 ####### # SAM # ####### Test adding RAL file locations, in development MINOS26 > setup sam -q dev MINOS26 > IFIL=F00028812_0000 MINOS26 > IFILE=${IFIL}.mdaq.root MINOS26 > sam locate $IFILE ['/pnfs/minos/fardet_data/2005-01,1898@vo4919'] MINOS26 > sam add location --file=${IFILE} --loc='ral:/some/where/over/the/rain' Location with name 'ral:/some/where/over/the/rain' not found. MINOS26 > samadmin add pnfs tape location --fullpath='ral:/some/where/over/the/rain' New locationId = 3812 MINOS26 > sam add location --file=${IFILE} --loc='ral:/some/where/over/the/rain' MINOS26 > sam locate $IFILE ['/pnfs/minos/fardet_data/2005-01,1898@vo4919' 'ral:/some/where/over/the/rain,unknown volume'] MINOS26 > sam erase file location --file=${IFILE} --loc='ral:/some/where/over/the/rain' MINOS26 > sam locate $IFILE ['/pnfs/minos/fardet_data/2005-01,1898@vo4919'] ####### # SAM # ####### Tested startup of a minos station in development. server_list.txt configuration is same as production, except name of dbserver is station_dev station started, stager did not. STATION=minos # prd ./sam_test_py ${STATION} Ran successfully We seem not to need a stager. ============================================================================= 2007 07 11 ######### # MYSQL # ######### akumar refreshed development from production this killed the minos-test-station station, not in production ######### # BATCH # ######### minos queue is set up, feeding minos14-18 plus the old minos19-24 bsub -q minos "sleep 60 ; hostname" bsub -q minos "sleep 120 ; hostname" 337545 ... 337564 Try more, and a longer delay for N in 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 ; do bsub -q minos "sleep 300 ; hostname" ; done 337565 - 337594 ############ # MCIMPORT # ############ mcimport.20070711 - expanded DUP message slightly, to reference mcimport.log and DUP and to send full DUP list in email Also, needed to use xargs to grep dup in indexes, due to number of kordosky indexes cp -a AFSS/mcimport.20070711 . ln -sf mcimport.20070711 mcimport ######## # FARM # ######## Writing recent R1_24spill MC to PNFS This is a mix of carrot and daikon, just a few files. They go to mcout_data/R1_24spill/far/carrot/L010185/sntp_data mcout_data/R1_24spill/far/daikon_02/CosmicMu/sntp_data/100 I need to write the carrots manually Will do them both, as there are only 5 CosmicMu_D02 files. mkdir /pnfs/minos/mcout_data/R1_24spill/far/daikon_02/CosmicMu/sntp_data/100 chmod 775 /pnfs/minos/mcout_data/R1_24spill/far/daikon_02/CosmicMu/sntp_data/100 setup dcap # kerberized DCPOR=24736 cd /grid/data/minos/mcfarcat # first the carrots FILES=`ls f22*` printf "${FILES}\n" RSPA=minos/mcout_data/R1_24spill/far/carrot/L010185/sntp_data DOUT=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA} POUT=/pnfs/${RSPA} printf "\n`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log for FILE in ${FILES} ; do DFIL=${DOUT}/${FILE} PFIL=${POUT}/${FILE} if [ ! -r ${PFIL} ] ; then echo "NEED" ${FILE} dccp ${FILE} ${DFIL} dccp -P ${DFIL} fi done 2>&1 | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log printf "`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log # then the daikons FILES=`ls f20*spill*` printf "${FILES}\n" RSPA=minos/mcout_data/R1_24spill/far/daikon_02/CosmicMu/sntp_data/100 DOUT=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA} POUT=/pnfs/${RSPA} printf "\n`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log for FILE in ${FILES} ; do DFIL=${DOUT}/${FILE} PFIL=${POUT}/${FILE} if [ ! -r ${PFIL} ] ; then echo "NEED" ${FILE} dccp ${FILE} ${DFIL} dccp -P ${DFIL} fi done 2>&1 | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log printf "`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log Wed Jul 11 09:11:14 CDT 2007 Will need to wait a few hours, then purge like printf "\n`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spillpurge.log for FILE in ${FILES} ; do PFIL=${POUT}/${FILE} if [ -r "${PFIL}" ] ; then PINFO=`(cd ${POUT} ; cat ".(use)(4)(${FILE})" | tr '\n' '\t')` ECRC=`printf "${PINFO}" | cut -f 11` if [ -n "${ECRC}" ] ; then LCRC=`ecrc ${FILE} | tr -s ' ' | cut -f 2 -d ' '` echo " ${FILE}" ${LCRC} ${ECRC} [ ${LCRC} = ${ECRC} ] && echo rm ${FILE} && rm ${FILE} fi ; fi done 2>&1 | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spillpurge.log printf "`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spillpurge.log FILES=`ls f22*` RSPA=minos/mcout_data/R1_24spill/far/carrot/L010185/sntp_data DOUT=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA} POUT=/pnfs/${RSPA} Wed Jul 11 13:23:12 CDT 2007 ######## # FARM # ######## Discussed many PEND issues with Howie, after today's batch meeting. Action items for me : o Remove /grid/data/minos/farcat/*safitter* - DONE - these are obsolete o Force cedar files pre-dating the 2007 concatenation - DONE Far < 37162 Near < 11449 o Force files where we HAVE the missing subruns ( last digit in PEND message ) ########### # ROUNDUP # ########### roundup.20070710 Using ROUNTMP/ROOTRELS to get list of release using a given root cp AFSS/roundup.20070710 . ln -sf roundup.20070710 roundup # wasroundup.20070707 ######## # FARM # ######## Pre-concatenation cedar : N E A R ./roundup -S -s N00008 -r cedar near N00008433_0000.spill.mrnt.cedar.0.root /pnfs/minos/reco_near/cedar/mrnt_data/2005-08(vo2139.100) INSTANCE Location with name '/pnfs/minos/reco_near/cedar/mrnt_data/2005-08' not found. ./samtapeloc /pnfs/minos/reco_near/cedar/mrnt_data dev /pnfs/minos/reco_near/cedar/mrnt_data /pnfs/minos/reco_near/cedar/mrnt_data/2005-11 /pnfs/minos/reco_near/cedar/mrnt_data/2005-08 /pnfs/minos/reco_near/cedar/mrnt_data/2005-10 /pnfs/minos/reco_near/cedar/mrnt_data/2005-09 ./samtapeloc /pnfs/minos/reco_near/cedar/mrnt_data int ./samtapeloc /pnfs/minos/reco_near/cedar/mrnt_data prd ./roundup -m 2005-08 -r cedar near IFILE=N00008433_0000.spill.mrnt.cedar.0.root SAMLOC="/pnfs/minos/reco_near/cedar/mrnt_data/2005-08(vo2139.100)" sam add location --file=${IFILE} --loc=${SAMLOC} ./roundup -m 2005-08 -r cedar near OK, picked up this tray mrnt from April 13. F A R ./roundup -S -s F0001 -r cedar far ./roundup -S -s F0002 -r cedar far ./roundup -S -s "F00031\|F00034\|F00035\|F00036" -r cedar far ######## # FARM # ######## HAVE cleanup for cedar ./roundup -f 1 -s "F00038283_\|F00038304_" -r cedar far ./roundup -f 1 -s "N00012425_\|N00012428_\|N00012434_" -r cedar near ============================================================================= 2007 07 10 ####### # SAM # ####### Tested using herber 8.2.0 dbserver, OK MINOS26 > export SAM_NAMING_SERVICE_IOR=IOR:000000000000002a49444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696e67436f6e746578743a312e3000000000000001000000000000002c000100000000001064306f7261312e666e616c2e676f7600232800000000000c4e616d655365727669636500 MINOS26 > export SAM_DB_SERVER_NAME=herber.dev:SAMDbServer MINOS26 > sam list files --dim="${SAMDIM}" | sort | head F00030612_0005.mdaq.root F00030612_0006.mdaq.root F00030612_0007.mdaq.root We need to upgrade. ######## # FARM # ######## New processing of R1_24spill announced, for future use in SAM, need to create release r1.24spill ######## # FARM # ######## Prior to rerunning brev, find . -name \*brev.root ./READ/n13023101_0000_L010185N_D00.sntp.cedar_phy_brev.root ./READ/n13023101_0007_L010185N_D00.sntp.cedar_phy_brev.root ./READ/n13023102_0000_L010185N_D00.sntp.cedar_phy_brev.root ./READ/n13023103_0000_L010185N_D00.sntp.cedar_phy_brev.root ./READ/n13023104_0000_L010185N_D00.sntp.cedar_phy_brev.root ./READ/n13023104_0005_L010185N_D00.sntp.cedar_phy_brev.root ./ECRC/n13023101_0000_L010185N_D00.sntp.cedar_phy_brev.root ./ECRC/n13023101_0007_L010185N_D00.sntp.cedar_phy_brev.root ./ECRC/n13023102_0000_L010185N_D00.sntp.cedar_phy_brev.root ./ECRC/n13023103_0000_L010185N_D00.sntp.cedar_phy_brev.root ./ECRC/n13023104_0000_L010185N_D00.sntp.cedar_phy_brev.root ./ECRC/n13023104_0005_L010185N_D00.sntp.cedar_phy_brev.root ./DFARM/n13023101_0000_L010185N_D00.sntp.cedar_phy_brev.root ./DFARM/n13023101_0007_L010185N_D00.sntp.cedar_phy_brev.root ./DFARM/n13023102_0000_L010185N_D00.sntp.cedar_phy_brev.root ./DFARM/n13023103_0000_L010185N_D00.sntp.cedar_phy_brev.root ./DFARM/n13023104_0000_L010185N_D00.sntp.cedar_phy_brev.root ./DFARM/n13023104_0005_L010185N_D00.sntp.cedar_phy_brev.root find . -name \*brev.root -exec rm {} \; ####### # SAM # IT 2843 ####### In the Minos production database, when selecting files using CHILD_BY_NAME, extra file names are returned. For example, $ FILE=F00030612_0005.spill.bntp.cedar_phy.0.root $ SAMDIM=" DATA_TIER raw-far \ and FILE_NAME like F0003061% \ and CHILD_BY_NAME ${FILE} \ " $ sam list files --dim="${SAMDIM}" --nosummary | sort F00030610_0000.mdaq.root F00030611_0000.mdaq.root F00030612_0000.mdaq.root F00030612_0001.mdaq.root F00030612_0002.mdaq.root F00030612_0003.mdaq.root F00030612_0004.mdaq.root F00030612_0005.mdaq.root F00030612_0006.mdaq.root F00030612_0007.mdaq.root F00030613_0000.mdaq.root F00030613_0001.mdaq.root ... This list should be F00030612_0005.mdaq.root F00030612_0006.mdaq.root F00030612_0007.mdaq.root ########## # SADDMC # ########## Verified that mc.release is already being set, sam get metadata --file=f21311005_0000_L100200N_D00.reroot.root ... 'mc' : CaseInsensitiveDictionary({ 'beam' : 'L100200', 'flavor' : '3', 'release' : 'daikon_00', 'split' : '1', 'volume' : '1', ######### # VAULT # ######### Moving the vault copies to LTO3 cd /pnfs/minos/vault enstore pnfs --tags | grep '^.(tag)(library)' .(tag)(library) = CD-9940B enstore pnfs --library CD-LTO3 [Errno 13] Permission denied: '/pnfs/minos/vault/.(tag)(library)' Sent email to enstore-admin asking them to set the libraries : cd /pnfs/minos/vault enstore pnfs --library CD-LTO3 ============================================================================= 2007 07 09 ####### # AFS # ####### Received a new backed up volume /afs/fnal.gov/files/data/minos/release_data thanks to kevinh ####### # SAM # ####### export SAM_ORACLE_CONNECT for UNIV in dev int prd ; do setup sam -q ${UNIV} samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.brev done New applicationFamilyId = 261 New applicationFamilyId = 70 New applicationFamilyId = 162 for UNIV in dev int prd ; do ./samtapeloc /pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00 ${UNIV} done export -n SAM_ORACLE_CONNECT ########### # ROUNDUP # ########### roundup.20070707 Added cedar_phy_brev Changed IOR to short form corbaname::minos-sam01.fnal.gov:9010 cp AFSS/roundup.20070707 . ln -sf roundup.20070707 roundup ############ # FNPCSRV1 # ############ Sorted .k5login into .k5login.20070709 Copied original to .k5login.20070116 Copied sorted version to .k5login, tested from kreymer account ============================================================================= 2007 07 08 Sat ########### # MONTHLY # ########### Updated IOR string to friendlier form Was export SAM_NAMING_SERVICE_IOR=IOR:010000002a00000049444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696e67436f6e746578743a312e30000000010000000000000030000000010100b7150000006d696e6f732d73616d30312e666e616c2e676f7600d132230c0000004e616d655365727669636500 Now export SAM_NAMING_SERVICE_IOR="corbaname::minos-sam01.fnal.gov:9010" ############ # SADDRECO # ############ Adding mc reco support on fnpcsrv1 ln -s /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/saddreco.20070707 srmc Normal form is saddreco ${DET} ${REL} ${MON} For MC, ####### # AFS # ####### Requested a new backed up volume /afs/fnal.gov/files/data/minos/release_data ACL's like system:administrators rlidwka system:anyuser rl minos rl minos:admin rlidwka Saturday, July 7, 2007 at 09:50:31 ######## # ROOT # ######## Sue Kasahara benchmarks : root v5.10/00 with file->UseCache(): 3.9 sec cpu, 6 sec real root v5.10/00 without file->UseCache(): 3.5 sec cpu, 163 sec real root v5.12/00 with (or w/o) file->UseCache(): 5.3 sec cpu, 162 sec real root HEAD with tree->SetCacheSize(50000000);: 4.1 sec cpu, 29 sec real ============================================================================= 2007 07 07 ######### # ADMIN # ######### reviewed status of requisition disk CD103354 195772 our $30,500 Project CD Operations Task MINOS-DISK-EQ 50.01.06.04.05.02 PO 576183 25 July cpu CD103358 195773 our $26,000 total $1,432,600 oddone signed 7/10 http://www-bss2.fnal.gov/reqquery/ Activity code 4676 https://appora.fnal.gov/pls/cert/miscomp.miser.bli_html?report_only=y&fiscal_year=2007&focus_this_identifier=4676 MINOS-COMP-OP 50.01.06.04.01.01 MINOS-CPU-EQ 50.01.06.04.06.02 MINOS-DISK-EQ 50.01.06.04.05.02 MINOS-OFF-SOFT-OP 50.01.06.04.02.01 MINOS-SCI-RESRCH-OP 50.01.06.04.06.03 TAPES-MINOS-OP 50.01.10.13 REX-DEPT-INFRA-OP 50.03.05.05.01 ######## # ROOT # ######## pcanal has patched the head and v5.14 branches of root to restore dcache speed. t->SetCacheSize(50000000); ########## # SADDMC # ########## Lost some modifications to HOWTO.saddmc, perhaps saddmc.20070608 due to hangup of my desktop system. Strange, I don't see any .bck or ~ backups of these nedited files. PLAN - declare a small slug of recent mcin with saddmc declare the corresponding mcout, probably using an adapted saddreco (rather than saddmc). then strip the saddreco functions out of saddmc Identify a working data set. 5905 /pnfs/minos/mcin_data/far/adamo 31714 /pnfs/minos/mcin_data/far/avocado 0 /pnfs/minos/mcin_data/far/beet symlink 297068 /pnfs/minos/mcin_data/far/carrot 1 /pnfs/minos/mcin_data/far/carrot_06_ral empty 456687 /pnfs/minos/mcin_data/far/daikon_00 1 /pnfs/minos/mcin_data/far/daikon_01 empty 757234 /pnfs/minos/mcin_data/far/daikon_02 35475 /pnfs/minos/mcin_data/far/v17 for DIR in adamo avocado carrot daikon_00 daikon_02 v17 ; do printf "${DIR} " ; find /pnfs/minos/mcin_data/far/${DIR} -type f | wc -l done adamo 20 avocado 119 carrot 1900 daikon_00 1846 daikon_02 2543 v17 120 Consistent with CFL summary 7156 1734 mcin_data/far/ 1001 53 mcin_data/fmock/ 34950 8872 mcin_data/near/ 0 0 mcin_data/near_pHE/ 37 1 mcin_data/near_pME/ 246 60 mcin_data/nmock/ 43451 10746 mcin_data Now look for a little subset of far MINOS26 > du -sm /pnfs/minos/mcin_data/far/daikon_00/* 419480 /pnfs/minos/mcin_data/far/daikon_00/L010185N 12893 /pnfs/minos/mcin_data/far/daikon_00/L100200N 24315 /pnfs/minos/mcin_data/far/daikon_00/L250200N MINOS26 > du -sm /pnfs/minos/mcin_data/far/daikon_00/L100200N/* 11601 /pnfs/minos/mcin_data/far/daikon_00/L100200N/100 1292 /pnfs/minos/mcin_data/far/daikon_00/L100200N/101 Looks good, only 30 files, two run directories and 3 in the 101 directory, excellent for verbose testing MID=/pnfs/minos/mcin_data/far/daikon_00/L100200N oops, need to remove /p/m prefix : MID=mcin_data/far/daikon_00/L100200N ./saddmc -m verify -v daikon_00 ${MID}/101 Metadata look OK to my eyeball. ./saddmc -m declare daikon_00 ${MID}/101 ./saddmc -m declare daikon_00 ${MID}/100 OOPS , declare error in f21411010_0000_L100200N_D00.reroot.root CLASS SamException.SamExceptions.DbSQLException INSTANCE INTERNAL ERROR IN DbOracleMessage.convertUniqueConstraint This happened back on 2006 06 16 Try this in integration setp sam -q int ./saddmc -m declare daikon_00 ${MID}/101 OOPS , addLocation error in f21411010_0000_L100200N_D00.reroot.root CLASS SamException.SamExceptions.DataStorageLocationNotFound INSTANCE Location with name '/pnfs/minos/mcin_data/far/daikon_00/L100200N/101' not found. Added locations as per below, ./saddmc -m addloc daikon_00 ${MID}/101 ./saddmc -m declare daikon_00 ${MID}/101 Needed 27 files, Rate was 2.468 STARTED Fri Jul 6 20:56:28 2007 FINISHED Fri Jul 6 20:56:40 2007 SAMDIM=" DATA_TIER mc-far \ " sam list files --dim="${SAMDIM}" --nosummary sam list files --dim="${SAMDIM}" --count 471 files match the given constraints. SAMDIM=" DATA_TIER mc-far \ and VERSION daikon_00 \ " sam list files --dim="${SAMDIM}" --count 30 files match the given constraints. ########## # SADDMC # ########## Need mcin storage locations Created samtapeloc ./samtapeloc /pnfs/minos/mcin_data/far/daikon_00/L100200N/101 int ./samtapeloc /pnfs/minos/mcin_data/far/daikon_00 dev ./samtapeloc /pnfs/minos/mcin_data/far/daikon_00 int ./samtapeloc /pnfs/minos/mcin_data/far/daikon_00 prd ########## # SADDMC # ########## Vegetables needed to be registered MINOS26 > sam get registered application families | grep gminos ApplicationFamily('simulation', 'gminos', 'carrot_06') ApplicationFamily('simulation', 'gminos', 'carrot') ApplicationFamily('simulation', 'gminos', 'carrot_08') VEG=daikon_00 for UNI in dev int prd ; do setup sam -q ${UNI} export SAM_ORACLE_CONNECT samadmin add application family --appFamily=simulation --appName=gminos --appVersion=${VEG} export -n SAM_ORACLE_CONNECT done New applicationFamilyId = 257 New applicationFamilyId = 66 New applicationFamilyId = 142 VEG=daikon_02 New applicationFamilyId = 258 New applicationFamilyId = 67 New applicationFamilyId = 143 VEG=avocado New applicationFamilyId = 259 New applicationFamilyId = 68 New applicationFamilyId = 144 VEG=beet New applicationFamilyId = 260 New applicationFamilyId = 69 New applicationFamilyId = 145 ########## # MYSQL # ########## Continuing defragmentation tests Strange, write rates to /data are a few MBytes/second, Mysql> time dd if=/dev/zero of=/data/archive/CP/CALADCTOPE.MYD bs=2025107940 count=1 1+0 records in 1+0 records out real 4m49.504s user 0m0.000s sys 0m18.970s [root@minos-mysql1 root]# ${FRAG} /data/archive/CP/CALADCTOPE.MYD /data/archive/CP/CALADCTOPE.MYD: 212969 extents found, perfection would be 16 extents 1+0 records in 1+0 records out real 0m35.672s user 0m0.010s sys 0m10.740s [root@minos-mysql1 root]# ${FRAG} /var/tmp/CALADCTOPE.MYD /var/tmp/CALADCTOPE.MYD: 361 extents found, perfection would be 16 extents Mysql> time cp -a CALADCTOPE.MYD /var/tmp/CALADCTOPE.MYD real 5m20.852s user 0m0.230s sys 0m20.220s [root@minos-mysql1 root]# ${FRAG} /var/tmp/CALADCTOPE.MYD /var/tmp/CALADCTOPE.MYD: 546 extents found, perfection would be 17 extents PLAN - can probably clear 23 GB of space from retired .MYI files, which could be restored with RESTORE TABLE [root@minos-mysql1 root]# ${FRAG} /data/database/retired/PULSERDRIFT.MYD /data/database/retired/PULSERDRIFT.MYD: 1954494 extents found, perfection would be 560 extents [root@minos-mysql1 root]# ${FRAG} /data/database/retired/PULSERDRIFT.MYI /data/database/retired/PULSERDRIFT.MYI: 641899 extents found, perfection would be 212 extents ######## # GRID # ######## # # # warning - these paths are incorrect # # # # # # see 2007 07 06 ### AFSPROD=/afs/fnal.gov/files/code/e875/general/products/db/ ### GRIPROD=/grid/app/minos/products time rsync -r \ ${AFSPROD} ${GRIPROD} \ --perms --times --links --size-only --delete -v OK, this moved upd to products. Not setting -a rlptgo because do not want group, owner propogated AFSPROD=/afs/fnal.gov/files/code/e875/general/products/ GRIPROD=/grid/app/minos/products mkdir ${GRIPROD} wrote 2206955532 bytes read 728980 bytes 3822830.32 bytes/sec total size is 2203906163 speedup is 1.00 real 9m37.029s user 0m22.770s sys 0m56.100s MINOS26 > time rsync -r ${AFSPROD} ${GRIPROD} --perms --times --links --size-only --delete -v building file list ... done prd/encp/v3_6d/Linux-2-4-2-3-2/volume_import/ prd/encp/v3_6d/Linux-2-6/volume_import/ wrote 1010993 bytes read 20 bytes 18550.70 bytes/sec total size is 2203906163 speedup is 2179.90 real 0m53.946s user 0m1.650s sys 0m5.710s rm -r /grid/data/minos/products ######## # FARM # ######## ./pnfsdirs near cedar_phy_brev daikon_00 L010185N STREAMS cand mrnt sntp INPUT /pnfs/minos/mcin_data/near/daikon_00/L010185N FAMSET mcin_near_daikon_00 FAMILY mcin_near_daikon OOPS - need file family mcin_near_daikon_00 OK - setting family to mcin_near_daikon_00 OUTPUT /pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N FAMSET mcout_cedar_phy_brev_near_daikon_00_cand FAMILY reco_mc_near_cedar_phy_brev OOPS - need file family mcout_cedar_phy_brev_near_daikon_00_cand OK - setting family to mcout_cedar_phy_brev_near_daikon_00_cand FAMSET mcout_cedar_phy_brev_near_daikon_00_mrnt FAMILY reco_mc_near_cedar_phy_brev OOPS - need file family mcout_cedar_phy_brev_near_daikon_00_mrnt OK - setting family to mcout_cedar_phy_brev_near_daikon_00_mrnt FAMSET mcout_cedar_phy_brev_near_daikon_00_sntp FAMILY reco_mc_near_cedar_phy_brev OOPS - need file family mcout_cedar_phy_brev_near_daikon_00_sntp OK - setting family to mcout_cedar_phy_brev_near_daikon_00_sntp MINOS26 > date Fri Jul 6 17:22:39 CDT 2007 ============================================================================= 2007 07 05 ########## # MYSQL # ########## Planning to break up the lock/copy of the largest tables, to reduce database locking time of crl Had to rsync the BINLOGS up front, to gain working space ( they had gotten up to 10 GB ) wrote 9984606905 bytes read 3236 bytes 9969655.66 bytes/sec total size is 11211535336 speedup is 1.12 real 16m40.950s user 1m33.970s sys 0m37.150s Started archives at about 13:00 Thu Jul 5 13:03:51 CDT 2007 Copying DCS_HV ran at about 14-19 MB/sec through 6 GBytes, then slowed down to 4, then back up to 13, 10 Sizes at 100 sec intervals : Mysql> du -sm /data/archive/COPY/20070705/offline/DCS_HV.MYD 1009 2903 4427 6214 6560 7059 8417 9459 10499 11811 13044 18m6.745s Thu Jul 5 13:22:17 CDT 2007 Watching PULSERGAIN.MYD while true ; do du -sm /data/archive/COPY/20070705/offline/PULSERGAIN.MYD | cut -f 1 ; sleep 100 ; done 971 2012 2243 2475 3703 5135 6270 7038 7422 8464 8879 10021 11270 12584 13159 13366 real 25m20.905s user 0m1.980s sys 2m1.070s Thu Jul 5 13:51:54 CDT 2007 while true ; do du -sm /data/archive/COPY/20070705/offline | cut -f 1 ; sleep 100 ; done 27070 27545 27962 28366 28746 29035 29256 BEAMMONSPILL.MYD 29486 29824 30397 31122 BEAMMONSPILL.MYD done 32030 32365 32678 32988 33929 34911 35297 35752 36599 37619 37961 38326 38617 39000 39447 39884 40247 40611 40952 41376 41929 42369 42821 real 60m27.476s N.B. some of the table sizes : 361 /data/database/offline/DCS_ENV_NEAR.MYD 399 /data/database/offline/CALPMTDRIFT.MYD 609 /data/database/offline/DCS_MAG_FAR.MYD 632 /data/database/offline/DBUVACHIPSPARS_OLD.MYD 762 /data/database/offline/DBUVACHIPSPARS.MYD 911 /data/database/offline/DBUVACHIPPEDS_OLD.MYD 1040 /data/database/offline/DBUVACHIPPEDS.MYD 1260 /data/database/offline/SPILLSERVERMON.MYD 1525 /data/database/offline/CALADCTOPE.MYD 1781 /data/database/offline/PULSERTIMEDRIFT.MYD 2225 /data/database/offline/CALADCTOPES.MYD 3097 /data/database/offline/BEAMMONSPILL.MYD 13366 /data/database/offline/PULSERGAIN.MYD 13464 /data/database/offline/DCS_HV.MYD Mysql> df -h /data Filesystem Size Used Avail Use% Mounted on /dev/hdb1 230G 211G 7.0G 97% /data 44G . [minsoft@minos-mysql1 offline]$ time gzip -1 *.MYD real 76m41.851s user 38m45.640s sys 3m42.710s [minsoft@minos-mysql1 offline]$ du -sh . 17G . [kreymer@minos-sam03 MYSQL]$ du -sm /home/kreymer/MYSQL/* 8965 /home/kreymer/MYSQL/20060418 9352 /home/kreymer/MYSQL/20060421 14021 /home/kreymer/MYSQL/20070207 14186 /home/kreymer/MYSQL/20070305 14790 /home/kreymer/MYSQL/20070403 16492 /home/kreymer/MYSQL/20070705 45792 /home/kreymer/MYSQL/BINLOG ########## # MYSQL # ########## http://www.mysql.com/doc/en/InnoDB_File_Defragmenting.html ALTER TABLE CpuHistory TYPE=INNODB; ALTER TABLE CpuHistory TYPE=MYISAM; Have a look at SHOW TABLE STATUS Found caltest database table CALADCTOPE size 2 GB, last updated 2007-03-24 As root, FRAG=/home/minsoft/maint/filefrag ${FRAG} /data/database/caltest/CALADCTOPE.MYD 396575 extents found, perfection would be 17 extents ${FRAG} /data/database/offline/DCS_HV.MYD /data/database/offline/DCS_HV.MYD: 276037 extents found, perfection would be 106 extents ########## # DCACHE # ########## old Ticket 100349 1 of 12 writePools are offline MINOS26 > ./poolstat verb Thu Jul 5 09:57:17 CDT 2007 DOWN TOT POOL GROUP 14 ExpDbWritePools 6 FermigridVolPools 12 KTeVReadPools 15 MinosPrdReadPools 8 RawDataWritePools 9 readPools 1/ 12 writePools w-stkendca11a-4 10:00 sent update to ticket, no activity under ticket 15:55 Solution: berg@fnal.gov sent this solution: All of the write pools are currently online. We will keep watching them and restart as needed. The developers are aware of the problem and are testing a patch. In the meantime, they have increased the size of java heap memory for the pools that have a history of this problem, though it may take additional restarts for the change to take effect. ######## # GRID # ######## # # # warning - these paths are incorrect # # # # # # see 2007 07 06 AFSPROD=/afs/fnal.gov/files/code/e875/general/products GRIPROD=/grid/app/minos/products date time rsync -r \ ${AFSPROD} ${GRIPROD} \ --perms --times --size-only --delete -v wrote 2206935025 bytes read 728980 bytes 3829425.85 bytes/sec total size is 2203906163 speedup is 1.00 real 9m35.625s user 0m22.850s sys 0m58.770s MINOS26 > date Thu Jul 5 15:30:13 CDT 2007 MINOS26 > du -sm /grid/data/minos/products/ 3336 /grid/data/minos/products Try a second pass, for timing. MINOS26 > date Thu Jul 5 16:24:27 CDT 2007 MINOS26 > time rsync -r \ ${AFSPROD} ${GRIPROD} \ --perms --times --size-only --delete -v building file list ... done skipping non-regular file "products/db/.upsfiles/shutdown/ups_shutdown.csh" skipping non-regular file "products/db/.upsfiles/shutdown/ups_shutdown.sh" ... wrote 990490 bytes read 20 bytes 18174.50 bytes/sec total size is 2203906163 speedup is 2225.02 real 0m54.621s user 0m1.520s sys 0m5.010s Oops, the output directory is not what I wanted, change to GRIPROD=/grid/app/minos/products GRRRRRRRRRRRRRRRRR - The products are full of symlinks, especially the VDT stuff. This wreaks havoc with many utilities like rsync. ============================================================================= 2007 07 04 HOLIDAY ============================================================================= 2007 07 03 ########## # DCACHE # ########## Ticket 100349 5 of 12 writePools are offline ########### # MONTHLY # ########### MINOS26 > aklog MINOS26 > tokens DATASETS 7/2 PREDATOR 7/2 SADDRECO 7/2 VAULT 7/2 ok MYSQL 7/5 did crl 7/3, will do rest later this week ########### # MONTHLY # ########### CFL update the web listing cd ${HOME}/minos/CFL $HOME/minos/scripts/cfl $HOME/minos/scripts/cflsum | tee cflsum.`date +%Y%m%d` ln -sf cflsum.`date +%Y%m%d` CFLSUM Updated datasets to write the pool group name, and to include 'q' Removed the following, as it does nothing without file activity ROUNDUP VMON=`date -d '27 days ago' +%Y-%m` ./roundup -m "${VMON}" -r cedar far ./roundup -m "${VMON}" -r cedar near inserted SADDRECO on fnpcsrv1 VMON=`date -d '27 days ago' +%Y-%m` REL=cedar PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer export SAM_NAMING_SERVICE_IOR=IOR:010000002a00000049444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696e67436f6e746578743a312e30000000010000000000000030000000010100b7150000006d696e6f732d73616d30312e666e616c2e676f7600d132230c0000004e616d655365727669636500 cd ~/scripts for DET in near far ; do ./saddreco ${DET} ${REL} ${VMON} declare \ 2>&1 | tee ${HOME}/ROUNTMP/LOG/${SAMMON}/declare_${DET}_${REL}.log done ########### # CLUSTER # ########### shepelak removed KDE components yesterday. No adverse affects so far. Did two gimp scans, all OK. ######## # FARM # ######## Following up on pending cedar_phy far runs Processing completed overnight for spill data. There are still a few all stream subruns missing : PEND - have 24/12 subruns for F00034632_*.all.sntp.cedar_phy.1.root 42 05/21 10:07 0 PEND - have 1/23 subruns for F00034635_*.all.sntp.cedar_phy.1.root 42 05/21 10:10 0 PEND - have 2/24 subruns for F00034647_*.all.sntp.cedar_phy.0.root 21 06/11 21:36 0 PEND - have 1/7 subruns for F00034675_*.all.sntp.cedar_phy.0.root 21 06/11 21:48 0 PEND - have 1/19 subruns for F00034700_*.all.sntp.cedar_phy.0.root 19 06/13 12:09 0 ########## # DC2AFS # ########## echo >> ../TRACE ./dc2afs -d far -r cedar_phy -s .bntp | tee -a ../TRACE 2>&1 echo >> ../TRACE 2007-02 63/ 63 recodata113 8552842 F00037709_0000.spill.bntp.cedar_phy.0.root 34973878 bytes in 1 seconds (34154.18 KB/sec) 2007-03 48/ 48 recodata113 8552842 F00037832_0000.spill.bntp.cedar_phy.0.root 34973878 bytes in 1 seconds (34154.18 KB/sec) ######## # FARM # ######## Following up on pending cedar mcfar runs SRV1> cat LOG/cedarmcfar.pend PEND - have 7/10 subruns for f20011007_*_CosmicLE_D02.sntp.cedar.root 3 06/29 14:39 0 PEND - have 9/10 subruns for f20011128_*_CosmicMu_D02.sntp.cedar.root 2 06/30 16:45 0 ######## # FARM # ######## Niki's subrun list, from email, put in /tmp/subs SRV1> scp kreymer@minos-93198.dhcp.fnal.gov:/tmp/subs /tmp/subs cd ~/lists LINES=`cat /tmp/subs` printf "${LINES}\n" | wc -l 87 cat /home/minfarm/lists/daq_lists/sup/*.sup >>/tmp/SUP cat /tmp/subs | while read LINE ; do printf "\n${LINE}\n" SRUN=`echo ${LINE} | cut -f 1 -d ' '` printf " BADRUNS `grep ${SRUN} ~/lists/bad_runs.cedar_phy`\n" printf " NOSPILL `grep ${SRUN} ~/lists/no_spill.cedar_phy`\n" printf " SUPPRES `grep ${SRUN} /tmp/SUP`\n" done Everything is in bad_runs, no spill, or suppressed, except F00033713_0017 NOT LAST SUBRUN SIZE NORMAL This was flagged as a bad run by the farms on 5/15 This subrun was rerun on 5/21, but no spill output files resulted. F00037351_* PHYSICS TEST processing was not requested F00037691_* !! missing run in pnfs completed today F00037706_* !! missing run in pnfs completed today As of about 14:00, F00033713_0017 finished on the farm, and was rounded up with ./roundup -s F00033713 -f 0 -r cedar_phy far And copied to AFS with echo >> ../TRACE ./dc2afs -d far -r cedar_phy -s .bntp | tee -a ../TRACE 2>&1 echo >> ../TRACE ============================================================================= 2007 07 02 ####### # SAM # ####### Per petyt request, here is a sample SAM query listing all the parents of a given file : FILE=F00030612_0005.spill.bntp.cedar_phy.0.root SAMDIM=" DATA_TIER raw-far \ and FULL_PATH like /pnfs/minos/fardet_data/2005-04 \ and FILE_NAME like F0003061% \ and CHILD_BY_NAME ${FILE} \ " sam list files --dim="${SAMDIM}" --nosummary | sort F00030612_0005.mdaq.root F00030612_0006.mdaq.root F00030612_0007.mdaq.root F00030613_0000.mdaq.root F00030613_0001.mdaq.root ... sam get metadata --file=${FILE} \ | grep parents \ | tr "'" \\\n \ | grep root \ | sort ######## # MAIL # ######## for UUSER in alberto bishai djensen escobar kafka para wojcicki ; do finger ${UUSER}@fnal.fnal.gov | grep '@' ; done alberto@fnalu.fnal.gov \alberto@fsui02.fnal.gov bishai@fsui02.fnal.gov \bishai@fsui02.fnal.gov djensen@fsui02.fnal.gov \djensen@fsui02.fnal.gov escobar@fsui02.fnal.gov escobar@ifi.unicamp.br kafka@fnalu.fnal.gov #\kafka@tuhepf.phy.tufts.edu para@fsui02.fnal.gov \para@fsui02.fnal.gov , adpara@yahoo.com wojcicki@fnalu.fnal.gov SGWEG@SLAC.Stanford.EDU ############ # PURCHASE # ############ CD103354 PO 195772 approved CD103358 PO 195773 approved ######## # FARM # ######## Per berg, note 0 size cand file, -rw-r--r-- 1 1334 5111 0 Jun 30 04:58 f20011014_0009_CosmicMu_D02.cand.cedar.root Removed it. /pnfs/minos/mcout_data/cedar/far/daikon_02/CosmicMu/cand_data/101/f20011014_0009_CosmicMu_D02.cand.cedar.root ####### # CFL # ####### cflsum.20070702 sets MINOS_DATA to /afs/fnal.gov/files/data/minos ######## # FARM # ######## Following up on pending cedar_phy far runs PEND - have 3/8 subruns for F00030612_*.spill.bntp.cedar_phy.0.root 53 05/10 01:31 0 PEND - have 13/24 subruns for F00035724_*.spill.bntp.cedar_phy.0.root 54 05/09 04:14 0 PEND - have 23/24 subruns for F00037691_*.spill.bntp.cedar_phy.0.root 44 05/19 09:26 0 PEND - have 8/9 subruns for F00037706_*.spill.bntp.cedar_phy.0.root 44 05/19 09:02 0 flush F00030612, which spans the April 1 2005 startup cutoff ./roundup -n -f 1 -s F00030612 -r cedar_phy far Howie is rerunning the rest. At 14:36, F00037691 and F00037706 are ready. ./roundup -r cedar_phy far N.B. completed overnight ######## # FARM # ######## Following up on pending cedar mcfar runs SRUNS=' f20011007_0001 f20011007_0002 f20011007_0003 f20011007_0004 f20011030_0008 f20011048_0008 f20011048_0009 f20011056_0003 f20011061_0004 f20011066_0009 f20011080_0006 f20011080_0007 f20011080_0008 f20011080_0009 f20011086_0004 f20011086_0005 f20011086_0006 f20011095_0009 f20011098_0004 f20011098_0005 f20011098_0006 f20011098_0007 f20011099_0006 f20011099_0007 f20011099_0008 f20011099_0009 f20011100_0001 f20011104_0009 f20011105_0000 f20011105_0001 f20011121_0006 f20011128_0008 f20011132_0003 f20011132_0004 f20011132_0005 f20011132_0006 f20011143_0009 f20011144_0000 f20011144_0001 f20011145_0002 f20011145_0004 f20011145_0005 ' for SUB in ${SRUNS} ; do RUNG=${SUB:5:3} ; echo ${SUB} ${RUNG} ; ls -l /pnfs/minos/mcin_data/far/daikon_02/CosmicMu/${RUNG}/${SUB}* ; done for SUB in ${SRUNS} ; do ls -l /grid/data/minos/mcfarcat/${SUB}* ; done Note that f20011080 0,2 are missing in mcin. ######## # FARM # ######## Quota out on minfarm account /home/minfarm SRV1> du -sm * | sort -n ... 53 18801dump_table 122 22892dump_table 123 condor_submit 128 FNAL_00030851.dbm.gz 163 lists 215 condor_log 510 west 1166 test 1726 grid 2209 17271dump_table 3333 scavantest grid files were created 20 Dec, 194 days ago, about 25K files. Last access was Mar 24 101 days ago, find . -mtime -195 | less find . -atime -101 | less cd cp -vax grid /export/stage/minfarm/homegrid diff -r grid /export/stage/minfarm/homegrid SRV1> du -sk ~/grid ../homegrid 1766720 /home/minfarm/grid 1177144 ../homegrid mv grid gridx ln -s /export/stage/minfarm/homegrid grid ln: creating symbolic link `grid' to `/export/stage/minfarm/homegrid': Disk quota exceeded rm gridx/pacman-latest.tar.gz rm -r gridx ============================================================================= 2007 06 29 ######## # GRID # ######## Looking at quota, via group, quota -s -v -g numi We have about 30 GB of app space, should be plenty for products. Need to cp -ax /afs/fnal.gov/files/code/e875/general/products \ /grid/app/minos/products cp -ax /afs/fnal.gov/files/code/e875/general/minossoft \ /grid/app/minos/minossoft MINOS26 > fs listquota /afs/fnal.gov/files/code/e875/general/products Volume Name Quota Used %Used Partition c.e875.d1 8000000 2125440 27% 66% MINOS26 > fs listquota /afs/fnal.gov/files/code/e875/general/minossoft Volume Name Quota Used %Used Partition code.e875.general 8000000 6312687 79% 66% MINOS26 > du -sm /afs/fnal.gov/files/code/e875/general/products 2076 /afs/fnal.gov/files/code/e875/general/products ########### # ROUNDUP # ########### New files showing up in cedar near and far, need to round them up. I am REALLY getting tired of new things happening on Fridays, requring manual intervention through the weekend. /pnfs/minos/mcout_data/cedar/far/daikon_02/CosmicLE /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_bfldx113 ./pnfsdirs far cedar daikon_02 CosmicLE ./pnfsdirs far cedar daikon_02 CosmicLE write ./pnfsdirs near cedar daikon_00 L010185N_bfldx113 ./pnfsdirs near cedar daikon_00 L010185N_bfldx113 write Now need to activate cedar mcnear and mcfar in corral. Did this at 18:25 after the current roundup finished. ########## # DCACHE # ########## Timur Perelumtov is going to CERN to meet with Rene Brun and Patrick Fuhrman next week. They will work on our Root/DCache I/O problem. ######## # DCAP # ######## ups copy dcap v2_38_f0512 -q unsecured -G "dcap v2_38_f0512 -q unsecured" upd install -j dcap v2_41_f0610 ups copy dcap v2_41_f0610 -q unsecured -G "dcap v2_41_f0610 -q unsecured" ####### # CFL # ####### Updated for daily running via cron on minos01 Silent curl No printout Create CFL.YYMM01 and cflsum.YYMM01 on first day of month Argument names working directory, testing in /var/tmp/kreymer Existing montly CFL.200* file headers show times from 04:42 to 09:45 Let's keep as far as possible from these times as possible, Added to crontab.minos01 15 19 * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/cfl And corrected the afssum times, which had minutes/hours reversed 01 23 * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/afsfree quiet 05 23 * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/afssum quiet timing daily real 1m12.730s user 0m22.320s sys 0m19.460s timing with cflsum real 6m17.389s user 1m49.970s sys 1m16.750s real 7m59.211s user 1m46.460s sys 1m17.150s Putting this into production, cd /afs/fnal.gov/files/data/minos/log_data/ rm CFL ; cp CFL.20070608 CFL cds time ./cfl real 3m19.566s user 0m27.450s sys 0m29.700s ============================================================================= 2007 06 28 ########### # MINOS12 # ########### Ganglia monitoring Thu, 28 Jun 2007 13:19:37 -0500 Last heartbeat 21 days, 21:46:55 ago MRTG networking is close to flatline at 600 bits/second for about 21.5 days. Helpdesk ticket 100083 Forwarded to minos-admin, run2-sys ============================================================================= 2007 06 27 ######## # MAIL # ######## Sent email to the 7 Minos users receiving mail on fsui02/fnalu warning of the 1 Oct shutdown of this service. ######### # ADMIN # ######### Our req's, in CD as of 26 June CD103354 for the satabeast CD103358 for the nodes. Reference FAGAN,DAVID requisition CD101973 Lab 193587 PO 575035 1U Intel Dual Xeon Quad Core E5335 2Ghz Computer Server. Under http://www-css.fnal.gov/els/useful_links/ https://fncdug1.fnal.gov/miser/req-query.html http://www-bss2.fnal.gov/reqquery/ ============================================================================= 2007 06 26 ########## # DC2AFS # ########## Need to get the cedar_phy .bntp files in to AFS, for the box opening Friday. MINOS26 > du -sm /pnfs/minos/reco_far/cedar_phy/.bntp_data 52448 /pnfs/minos/reco_far/cedar_phy/.bntp_data MINOS26 > fs listquota /afs/fnal.gov/files/data/minos/d10/recodata112 Volume Name Quota Used %Used Partition nb.minos.d258 50000000 35683004 71% 45% MINOS26 > fs listquota /afs/fnal.gov/files/data/minos/d10/recodata113 Volume Name Quota Used %Used Partition nb.minos.d259 50000000 188 0% 45% for YEMO in `ls /pnfs/minos/reco_far/cedar_phy/.bntp_data` ; do ./stage -d -p 0 reco_far/cedar_phy/.bntp_data/${YEMO} ; done All on disk through 2006-09, then most are off disk through 2007-03. for YEMO in `ls /pnfs/minos/reco_far/cedar_phy/.bntp_data` ; do ./stage -w reco_far/cedar_phy/.bntp_data/${YEMO} ; done echo >> ../TRACE date >> ../TRACE ./dc2afs -d far -r cedar_phy -s .bntp | tee -a ../TRACE 2>&1 date >> ../TRACE echo >> ../TRACE STARTING Tue Jun 26 16:30:30 CDT 2007 FINISHED Tue Jun 26 18:05:52 CDT 2007 ####### # AFS # ####### Existing index file sizes based on cd $MINOS_DATA/d10/indexes du -sk * | sort -n ... 44 BAD_mc_far.daikon_00.cedar.index 65 mc_far.carrot.cedar.index 82 mc_cosmic.bfld201.cedar.index 84 mc_far.R1.14.index 86 mc_far.daikon_00.cedar.index 94 mc_far.carrot.R1_18_2.index 99 2005-04_far.R1_18.index 117 mc_near.R1_18_2.index 501 mc_near.carrot_06.cedar.index 509 mc_near.daikon_00.cedar.index.save 530 mc_near.carrot_06.R1_18_2.index 592 mc_near.daikon_00.cedar.index and many monthly files for R1_18_2 and R1_18_4 du -sk 20*R1_18*.index | cut -f 1 > /tmp/sumin MINOS26 > cat /tmp/sumin | ~/minos/scripts/count Enter numbers to be added : Got 48 /tmp/FOO numbers 1285 ########## # SADDMC # ########## latest examples in this log, and HOWTO, are out of date . checked out some old test data from last year MINOS26 > sam locate n13011068_0000_L010200.reroot.root ['/pnfs/minos/mcin_data/near/carrot_06/L010200,861@vo8034'] MINOS26 > sam get metadata --file=n13011068_0000_L010200.reroot.root did successful verify in the new format : ./saddmc.20070608 -n 2 -m verify carrot_06 mcin_data/near/carrot_06/L010185 ########## # DCACHE # ########## Installed newer dcap upd install -j dcap v2_38_f0512 Could not install the newest one, because of a symlink in the UPD server Helpdesk ticket 099982 upd install -j dcap v2_41_f0610 ftp> pwd 257 "/products/dcap" is current directory. ftp> ls /ftp/products/dcap/v2_41_f0610/Linux+2.6/dcap_v2_41_f0610_Linux+2.6.ups.tar 200 PORT command successful. 150 Opening ASCII mode data connection for directory listing. -rw-rw---- 1 100 3531 3584 Oct 23 2006 /ftp/products/dcap/v2_41_f0610/Linux+2.6/dcap_v2_41_f0610_Linux+2.6.ups.tar ============================================================================= 2007 06 25 ####### # DAQ # ####### F00038278_0000.mdaq.root failed to transfer twice 12:52:36 - 13:27:51 tranferred one other file, then 13:42:56 - 14:14:23 - successful this time Two recent files are very large : ssh -l kreymer minos-gateway.minos-soudan.org ssh -l minos daqdcp cd /daqdata du -sm * | sort -n ... 156 F00038280_0000.mdaq_1.root 176 F00038260_0000.mdaq.root 725 F00038276_0000.mdaq.root 1319 F00038278_0000.mdaq.root 1816 F00038280_0000.mdaq.root Basicly, these large files generated while beam is down today just take a while to copy. ######## # MAIL # Ticket 099858 ######## /var/mail filled up this morning on fsui02 ( a.k.a. fnalu ) Moved niki's email to minos01 /var/spool/mail/niki ######## # MAIL # ######## MUSERS=`ypcat passwd | cut -f 1 -d :` MINOS01 > for MUSE in ${MUSERS} ; do finger ${MUSE}@fnal | grep '@' | grep -v imapserv ; done | grep fnalu alberto@fnalu.fnal.gov michael@fnalu.fnal.gov kafka@fnalu.fnal.gov wojcicki@fnalu.fnal.gov MINOS01 > for MUSE in ${MUSERS} ; do finger ${MUSE}@fnal | grep '@' | grep -v imapserv ; done | grep fsui para@fsui02.fnal.gov djensen@fsui02.fnal.gov escobar@fsui02.fnal.gov bishai@fsui02.fnal.gov for UUSER in alberto bishai djensen escobar kafka para wojcicki ; do du -sk /var/mail/${UUSER} ; done ============================================================================= 2007 06 23 Sat ########### # ROUNDUP # ########### Previous files in mcout_data/cedar_phy/far/daikon_02/CosmicMu and CosmicLE have been removed. These were previously rounded up -S on June 13 (WTW) and WRITE files purged on June 21 ( back home ) Concatenation was OK , but I did not do it due to missing subruns. Listed files in READ index : find READ -name \*Cosmic\* -exec ls -l {} \; | less Set them aside to allow fresh concatenation. mkdir DUP DUP/D02Cosmic FILES=`find . -name \*Cosmic\* -exec basename {} \;` for FILE in ${FILES} ; do mv READ/${FILE} DUP/D02Cosmic/${FILE} ; done Write them out , and put this into corral also corral [ ${BADS} -le 1 ] && ${HOME}/scripts/roundup -c -M -r cedar_phy mcfar || (( BADS++ )) # no SAM yet ran ./roundup -M -r cedar_phy mcfar ============================================================================= 2007 06 22 ############# # MINOSORA3 # ############# Memory problems continue, with new motherboard. June 26 1 PM will switch to single size of dimms. ######### # FNPPD # ######### NOTE: please copy off any important files from fnppd. It is unknown how much longer fnppd will stay on-line as of 6/20/2007. FNPPD > uptime 9:03am up 1 day, 21:47, 1 user, load average: 0.03, 0.07, 0.01 du -sm /prj/e875 67344416 /prj/e875 Files all are owned by rhbob ( Bob Bernstein ) ############ # MINOSCVS # ############ .admin - removed west ( no such Fermi principal ) .k5login added asousa brebel kasahara llhsu ============================================================================= 2007 06 21 ########### # ROUNDUP # ########### Clearing older WRITE files >>>> 156 from June 7 ./roundup -w -M -r cedar_phy mcfar Thu Jun 21 10:58:09 CDT 2007 PURGING WRITE files 156 Thu Jun 21 10:58:39 CDT 2007 >>>> 3 from June 8 ./roundup -w -M -r cedar_phy_safitter far Thu Jun 21 11:00:48 CDT 2007 PURGING WRITE files 3 >>>> 1 from Jun 11 06:47 2007 06 11 N00011669_0000.cosmic.sntp.cedar_phy.0.root has a mismatched checksum. This is a holdover from the 2007 06 11 crash of fnpcsrv1, which I had supposedly repaired before leaving for WTW the next day. SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00011669_0000.cosmic.sntp.cedar_phy.0.root /pnfs/minos/reco_near/ cedar_phy/sntp_data/2007-01 PURGE FARM N00011669_0000.cosmic.sntp.cedar_phy.0.root Wed Jun 13 00:16:29 CDT 2007 /export/stage/minfarm/ROUNDUP/ECRC/N00011669_0000.cosmic.sntp.cedar_phy.0.root Understandable, let's generate that manually ROUNTMP=/export/stage/minfarm/ROUNDUP GDW=/grid/data/minos/minfarm/WRITE SFINI=N00011669_0000.cosmic.sntp.cedar_phy.0.root ecrc ${GDW}/${SFINI} | cut -f 2 -d ' ' > ${ROUNTMP}/${CAT}ECRC/${SFINI} That should do it. Indeed the file cleared out with the Noon cycle of roundup. ########## # DCACHE # ########## Write pool files seem to have been flushed to tape, ticket 099603 Times are 05:47 through 06:29 Verified they have tape locations cd ${IPATH} for FILE in ${FILES} ; do cat ".(use)(4)(${FILE}" ; done ... f20011035_0006_CosmicMu_D02.reroot.root 0000_000000000_0000680 Checking files in http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011090_0000_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011093_0009_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0004_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011091_0007_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011091_0009_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011030_0005_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011094_0002_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011094_0001_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011094_0003_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011031_0000_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0007_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011030_0000_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011030_0006_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0009_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0002_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011095_0002_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0006_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0008_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011093_0005_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011031_0002_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011031_0001_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011031_0004_CosmicMu_D02.cand.cedar_phy.root /pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011091_0000_CosmicMu_D02.cand.cedar_phy.root Times are 05:51 through 06:44 for FLAP in ${FLAPS} ; do echo ${FLAP} IPATH=`dirname ${FLAP}` ; IFILE=`basename ${FLAP}` ( cd ${IPATH} ; cat ".(use)(4)(${IFILE})" | grep '0000_0000' ) done All are on tape. This was due to a tape being NOACCESS. Odd, I did not see such an indication when I looked at the list yesterday. ########## # DCACHE # ########## Per Liz, DCache read rates were helped in early 05 by calling UseCache Perhaps the default arguments need to be tuned. Perhaps something has broken in root. ######### # STAGE # ######### Lots of reco_far_cedar and reco_near_cedar files still being staged. /pnfs/minos/reco_far/cedar/sntp_data/2005-10/F00032814_0023.spill.sntp.cedar.0.root is in r-stkendca14a-6 That's a readPools pool. So our cedar ntuples are not going where intended MINOS26 > ( cd /pnfs/minos/reco_far/cedar/sntp_data/2005-10 ; enstore pnfs --tags ) .(tag)(library) = CD-9940B .(tag)(file_family) = reco_far_cedar_sntp VOLS=`./volumes reco_far_cedar` echo $VOLS VO4093 VO4094 VO7415 VO7907 VO8334 VO8363 VO9661 echo >> ../TRACE date >> ../TRACE for VOL in ${VOLS} ; do ./stage -w -s sntp_data ${VOL} done 2>&1 | tee -a ../TRACE STARTING Thu Jun 21 12:10:38 CDT 2007 FINISHED Fri Jun 22 04:49:15 CDT 2007 date >> ../TRACE echo >> ../TRACE ####### # CPU # ####### Looking for AMD vs Intel benchmarks. http://www.cpubenchmark.net/index.php looks great, but charts don't load ============================================================================= 2007 06 20 ############# # MINOSORA3 # ############# Motherboard replaced, to address memory problems ####### # CVS # ####### Removed stray accidental directory from minoscvs cd /cvs/minoscvs/rep1/minossoft/NCUtils/Extrapolation rmdir MCEvent.h This worked as desired. ############ # NOACCESS # ############ Why is a 9940 raw data tape NOACCESS ? All these files are on disk. VO5182 0.39GB (NOACCESS 0619-1525 full 0611-1920) 9940 minos.fardet_data.cpio_odc ########## # DCACHE # ########## Many files are pending writes in mcimport, say sjc /pnfs/minos/mcin_data/far/daikon_02/CosmicMu/103/ f20011033_0000_CosmicMu_D02.reroot.root f20011033_0001_CosmicMu_D02.reroot.root f20011033_0002_CosmicMu_D02.reroot.root f20011033_0003_CosmicMu_D02.reroot.root f20011033_0004_CosmicMu_D02.reroot.root f20011033_0005_CosmicMu_D02.reroot.root f20011033_0006_CosmicMu_D02.reroot.root f20011033_0007_CosmicMu_D02.reroot.root f20011033_0008_CosmicMu_D02.reroot.root f20011033_0009_CosmicMu_D02.reroot.root f20011034_0000_CosmicMu_D02.reroot.root f20011034_0001_CosmicMu_D02.reroot.root f20011034_0002_CosmicMu_D02.reroot.root f20011034_0003_CosmicMu_D02.reroot.root f20011034_0004_CosmicMu_D02.reroot.root f20011034_0005_CosmicMu_D02.reroot.root f20011034_0006_CosmicMu_D02.reroot.root f20011034_0007_CosmicMu_D02.reroot.root f20011034_0008_CosmicMu_D02.reroot.root f20011034_0009_CosmicMu_D02.reroot.root f20011035_0000_CosmicMu_D02.reroot.root f20011035_0001_CosmicMu_D02.reroot.root f20011035_0002_CosmicMu_D02.reroot.root f20011035_0003_CosmicMu_D02.reroot.root f20011035_0004_CosmicMu_D02.reroot.root f20011035_0005_CosmicMu_D02.reroot.root f20011035_0006_CosmicMu_D02.reroot.root ./dc_stat /pnfs/minos/mcin_data/far/daikon_02/CosmicMu/103/f20011033_0000_CosmicMu_D02.reroot.root ============================ PNFS status for /pnfs/minos/mcin_data/far/daikon_02/CosmicMu/103/f20011033_0000_CosmicMu_D02.reroot.root -rw-r--r-- 1 kreymer e875 364301179 Jun 18 20:19 f20011033_0000_CosmicMu_D02.reroot.root LEVEL 2 2,0,0,0.0,0.0 :c=1:45d30373;h=yes;l=364301179; r-stkendca13a-6 w-stkendca10a-6 LEVEL 4 ============================ Sent this as helpdesk ticket 17:55 099603 ============================================================================= 2007 06 19 ######### # STAGE # ######### Jeff Dejong is hitting near R1_18_2 ntuples pretty hard. Set the file family properly for sntp_data, then restore by volume, even though these will go to the general read pools presently. From CFL summary : FILES GBYTES PATH 28790 541 reco_near/R1_18_2/.*nt._data/ ( cd /pnfs/minos/reco_near/R1_18_2/sntp_data ; enstore pnfs --tags ) .(tag)(file_family) = reco_near_R1_18_2 ( cd /pnfs/minos/reco_near/R1_18_2/sntp_data ; enstore pnfs --file_family reco_far_R1_18_2_sntp ) ./volumes vols VOLS=`./volumes reco_near_R1_18_2` printf "${VOLS}\n" | wc -l 22 echo >> ../TRACE date >> ../TRACE for VOL in ${VOLS} ; do ./stage -w -s sntp_data ${VOL} done 2>&1 | tee -a ../TRACE date >> ../TRACE echo >> ../TRACE Wed Jun 20 00:23:59 CDT 2007 FINISHED Thu Jun 21 11:30:47 CDT 2007 ####### # WTW # ####### Notes from the meeting Nick West using LCG contained in EGEE ( EGEE sort of like OSG ) Looking at GANGA for user interface ( command or gui ) N.B. can Nick use SAM for local cache locations ? gmieg - rootd server being used... stuck raw data file reported/resolved ? ########## # DCACHE # ########## Checking Rustem's slow reading report for dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy/sntp_data/2005-05 Claim is 30 minutes to read 10K snarls, versus 10. Trying the biggest file in that month N00007861_0000.spill.sntp.cedar_phy.0.root '1.58GB dccp speed : MINOS26 > IFILE=N00007861_0000.spill.sntp.cedar_phy.0.root MINOS26 > IPATH=minos/reco_near/cedar_phy/sntp_data/2005-05 MINOS26 > DCPOR=24125 # unsecured MINOS26 > DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} MINOS26 > ( cd /pnfs/${IPATH} ; cat ".(use)(2)(${IFILE})" ) 2,0,0,0.0,0.0 :h=yes;c=1:818e364b;l=1699476400; w-stkendca11a-2 r-stkendca19a-6 r-stkendca11a-2 MINOS26 > cd /local/scratch??/`whoami` MINOS26 > time dccp ${DFILE} TEST.dat # do the copy 1699476400 bytes in 53 seconds (31314.06 KB/sec) real 0m53.368s user 0m0.110s sys 0m11.380s Looks good to me. MINOS26 > setup_minos -r R1.24.0 MINOS26 > time hadd mTEST.dat TEST.dat TEST.dat real 5m21.285s user 0m22.430s sys 0m34.530s MINOS26 > time hadd mdTEST.dat TEST.dat "${DFILE}" ... TEST.dat tree:NtpSt entries=1234567890 dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy/sntp_data/2005-05/N00007861_0000.spill.sntp.cedar_phy.0.root tree:NtpSt entries=1234567890 ... real 174m23.321s user 0m53.150s sys 1m0.770s this ran very fast through the local file, very slow from dcache. about 0.2 MBytes/second, not CPU limited on the client. Let's try something shorter, and more relevant, the old style ntuple concatenation with a couple of shorter files. IFILE=N00007731_0000.spill.sntp.cedar_phy.0.root DFILE1=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} time dccp ${DFILE1} TEST1.root 5342644 bytes in 0 seconds real 0m0.593s user 0m0.000s sys 0m0.040s 'fileSize' : SamSize('5.10MB'), 'lastEvent' : 18121L IFILE=N00007733_0000.spill.sntp.cedar_phy.0.root DFILE2=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} time dccp ${DFILE2} TEST2.root 1890986 bytes in 0 seconds real 0m0.924s user 0m0.000s sys 0m0.040s 'fileSize' : SamSize('1.80MB') 'lastEvent' : 99785L, time hadd TESThloc.root TEST1.root TEST2.root real 0m11.814s user 0m3.030s sys 0m0.350s time hadd TESThdca.root ${DFILE1} ${DFILE2} real 2m17.372s user 0m3.760s sys 0m0.870s MINOS26 > ln -s ~kreymer/minos/scripts/Merger.C Merger.C setup_minos -r R1.24.0 time loon -bq CATTLE/Merger.c TEST1.root TEST2.root real 0m14.316s user 0m11.530s sys 0m0.530s Try again with correct Merger.C, local files, -r R1.24.2 setup_minos -r R1.24.2 time loon -bq Merger.C TEST1.root TEST2.root real 0m12.306s user 0m10.970s sys 0m0.470s 7 MB/ 12.3 sec => .57 Mb/sec time loon -bq Merger.C ${DFILE1} ${DFILE2} real 1m51.101s user 0m10.780s sys 0m0.720s 7 MB/ 111 sec => 63 KB/sec ########## # SADDMC # ########## HOWTO.saddmc Need to match to other then L* when finding directories in mcin ########## # DCACHE # ########## RawDataWritePools write interval needs reset to 24 hours, based on recent file times it seems to be 4 hours now. Ticket 099493 Problem Description: Recently, it seems that pools in the FNDCA RawDataWritePools group have been writing to tape frequently, perhaps on a 4 hour timer. Please reset the timers to the desired 24 hours. For background: The timer for these pools was set to 24 hours early in 2006. Here is an extract from a 25 May email from kennedy : " The general write pools now require the first of any of these three conditions to be met before encp's run: 1) 4 calendar hours have passed 2) 25 GB in file family have accumulated 3) 100 files in file family have accumulated This is distinct from the raw data pools which wait for 24 hours. " ####### # AFS # ####### Requested two new data volumes d261 d262 ACL's like system:administrators rlidwka system:anyuser rl minos rl minos:admin rlidwka minos:nonap rlidwka Tuesday, June 19, 2007 at 12:16:13 Created minos:nonap group NEWGROUP=nonap MINOS26 > pts creategroup -name kreymer:${NEWGROUP} group kreymer:nonap has id -1941 for GUSER in buckley kreymer barr habig jdejong ; do pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done pts setfields kreymer:${NEWGROUP} -access SOMar MINOS26 > pts membership kreymer:${NEWGROUP} Members of kreymer:nonap (id: -1941) are: buckley kreymer habig barr jdejong MINOS26 > pts examine kreymer:${NEWGROUP} Name: kreymer:nonap, id: -1941, owner: kreymer, creator: kreymer, membership: 5, flags: SOMar, group quota: 0. pts chown kreymer:${NEWGROUP} minos ============================================================================= 2007 06 18 ######## # FLUB # ######## Stuck nodes restarted, per Helpdesk ticket 099372 Out of memory in loon on flxb34, other 2 just stuck. Note that we have no ganglia monitoring. ============================================================================= 2007 06 16 Updated HOWTO.monitor beam_log minos26free_log ######## # FLUB # FNALU BATCH ######## jdejong reports stuck jobs #320407 and 320521 These are on flxb33 and flxb32. Cannot log into these Network activity on flxb33 has been very low since Wed 2007 Jun 13 12:00 flxb32 has been very low since Wed 2007 Jun 13 18:00 Other nodes are affected, not just your jobs flxb34 has been very low since Thu 2007 Jun 14 18:00 Getting node list HOSTS=` bjobs -u all | tr -s ' ' | grep RUN | cut -f 6 -d ' ' | cut -f 1 -d . |sort -u` The only stuck nodes are flxb32 flxb33 flxb34 Helpdesk ticket 099372 ============================================================================= 2007 06 13 FARM quota is likely low in grid, data, purgint 57GB in WRITE ./roundup -w -r cedar_phy near just in time, got to SAM phase before normal nightly run ran ./roundup -w -M -r cedar_phy mockfar down to 327 GB in /g/d/m now Better now, pushing dakion_02 cedar_phy far to NPFS without concatenation, too many bad runs for my taste. ============================================================================= 2007 06 12 DRIVING TO WEEK IN THE WOODS. ============================================================================= 2007 06 11 ######## # FARM # ######## fnpcsrv1 crashed this morning in the middle of concatenating, see LOG/2007-06/cedar_phynear.log SUPPRESS N00011669_0024.cosmic.sntp.cedar_phy.0.root OK adding N00011669_0000.cosmic.sntp.cedar_phy.0.root 24 hadd finished , and the mv to WRITE happened SRV1> stat /grid/data/minos/minfarm/WRITE/N00011669_0000.cosmic.sntp.cedar_phy.0.root File: `/grid/data/minos/minfarm/WRITE/N00011669_0000.cosmic.sntp.cedar_phy.0.root' Size: 708547742 Blocks: 1383936 IO Block: 32768 regular file Device: 1eh/30d Inode: -776408581 Links: 1 Access: (0644/-rw-r--r--) Uid: (10871/ minfarm) Gid: ( 5111/ numi) Access: 2007-06-11 06:46:26.832000000 -0500 Modify: 2007-06-11 06:47:51.174000000 -0500 Change: 2007-06-11 06:51:29.530000000 -0500 The cleanup of GDM/nearcat happened SRV1> ls -l /grid/data/minos/nearcat/N00011669* ls: /grid/data/minos/nearcat/N00011669*: No such file or directory The building of WRITE/ did not happen : SRV1> find READ -name N00011669\* READ/SAM/N00011669_0000.cosmic.sntp.cedar.0.root That's the old cedar cosmic file, already declared to SAM. Hacked that old file, change cedar to cedar_phy : We should be good to go ! ############ # POOLSTAT # verbose option ############ poolstat.20070611 add any option, and get a list of down pools added column labels ########## # DCACHE # ########## Mike Harrison started working on pools, we seem to be going down hill Mon Jun 11 12:13:01 CDT 2007 DOWN TOT GROUP 1/ 14 ExpDbWritePools w-stkendca9a-1 5 FermigridVolPools 15 KTeVReadPools 13 MinosPrdReadPools 1/ 7 RawDataWritePools w-stkendca9a-3 8 readPools 10/ 14 writePools w-stkendca10a-2 w-stkendca10a-5 w-stkendca10a-6 w-stkendca11a-2 w-stkendca11a-4 w-stkendca11a-6 w-stkendca9a-2 w-stkendca9a-4 w-stkendca9a-5 w-stkendca9a-6 MINOS26 > ./poolstat v Mon Jun 11 13:16:25 CDT 2007 DOWN TOT POOL GROUP 14 ExpDbWritePools 5 FermigridVolPools 15 KTeVReadPools 13 MinosPrdReadPools 1/ 7 RawDataWritePools w-stkendca11a-3 8 readPools 1/ 14 writePools w-stkendca11a-1 Mon Jun 11 15:52:12 CDT 2007 DOWN TOT POOL GROUP 14 ExpDbWritePools 5 FermigridVolPools 15 KTeVReadPools 13 MinosPrdReadPools 7 RawDataWritePools 7 readPools 14 writePools Authorize close of ticket 098946 ####### # DAQ # ####### DAQ ftp transfers failed : http://fndca3a.fnal.gov/cgi-bin/dcache_files.py 2007-06-11 12:45:51 buckley(1019.5111) krbftp write /pnfs/fnal.gov/usr/minos/neardet_data/2007-06/N00012361_0022.mdaq.root daqdcp-nd.fnal.gov 1 0 0 ERROR 425 Cannot open port: java.lang.Exception: Illegal Object received : dmg.cells.nucleus.NoRouteToCellException 2007-06-11 12:35:53 buckley(1019.5111) krbftp write /pnfs/fnal.gov/usr/minos/fardet_data/2007-06/F00038215_0019.mdaq.root daqdcp.minos-soudan.org 1 0 0 ERROR 425 Cannot open port: java.lang.Exception: Illegal Object received : dmg.cells.nucleus.NoRouteToCellException corrected archiver.pid [minos@daqdcp-nd minos]$ cat /var/lock/daq/archiver.pid 18823 Same as I set manually back on June 1. But the archiver is 25907 emacs /var/lock/daq/archiver.pid This makes daqmon happy Copies are still stuck srmcp works OK outbound. ============================================================================= 2007 06 10 crontab - reenabled around 19:48 kreymer@minos26 mindata@minos26 NOCAT under minfarm@fnpcsrv1 ########## # DCACHE # ########## cleanup - removed 2007 05 29 vintage /grid/data/minos/minfarm/SAFE files, after verifying that all are on tape ( cat .(use)(4) ... ) ########## # DCACHE # ########## FILES2=`sam list files --dim="TAPE_LABEL dcache" --nosummary | grep -v mdaq ` printf "${FILES2}\n" | wc -l 71 Very odd, this is exactly the number of files pending back in 2007 05 29 stuck in pools w-stkendca11a-4 w-stkendca11a-6 ============================================================================= 2007 06 09 ####### # SAM # ####### ./genpy -l " -r R1.15 " fardet_data/2006-03 ( this never happened , repeated 2007 07 12 ) ########## # DCACHE # ########## Cleared the empty bad candidate that's been hanging around ls -l /pnfs/minos/reco_far/cedar_phy/.bcnd_data/2007-02/F00037384_0006.spill.bcnd.cedar_phy.0.root -rw-r--r-- 1 1334 e875 0 May 19 05:01 /pnfs/minos/reco_far/cedar_phy/.bcnd_data/2007-02/F00037384_0006.spill.bcnd.cedar_phy.0.root As rubin on fnpcsrv1, at 02:56 UTC 10 Jun cd /pnfs/minos/reco_far/cedar_phy/.bcnd_data/2007-02 rm F00037384_0006.spill.bcnd.cedar_phy.0.root sam undeclare file F00037384_0006.spill.bcnd.cedar_phy.0.root ########## # DCACHE # ########## Over 70 files are in write pools over 24 hours, not on tape. Such as: MINOS26 > dc_stat /pnfs/minos/reco_near/cedar_phy/cand_data/2006-12/N00011376_0005.spill.cand.cedar_phy.0.root ============================ PNFS status for /pnfs/minos/reco_near/cedar_phy/cand_data/2006-12/N00011376_0005.spill.cand.cedar_phy.0.root -rw-r--r-- 1 1334 e875 153288546 Jun 8 11:48 N00011376_0005.spill.cand.cedar_phy.0.root LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:bfd0d1fb;l=153288546; w-stkendca11a-2 LEVEL 4 ============================ But other recent mcimported files are on tape, such as /pnfs/fnal.gov/usr/minos/mcin_data/far/daikon_02/CosmicMu/101/f20011010_0009_CosmicMu_D02.reroot.root Let's see where they are : FILES=`sam list files --dim="TAPE_LABEL dcache" --nosummary` 77 files 71 of these are non raw data for FILE in ${FILES} ; do PLOC=`sam locate ${FILE} | tr "'" \\\n | grep ^/pnfs | cut -f 1 -d ,` POLE=`cd ${PLOC} ; cat ".(use)(2)(${FILE})" | grep w-stkendca` printf "%70s %s\n" ${FILE} ${POLE} done Every non-raw-data file is in w-stkendca11a-2 Most are cand's, there are few sntp's. Tacked the list onto TRACE DUH, run the handle poolstat script : MINOS26 > ./poolstat Sat Jun 9 23:32:31 CDT 2007 14 ExpDbWritePools 5 FermigridVolPools 15 KTeVReadPools 13 MinosPrdReadPools 7 RawDataWritePools 8 readPools 5/ 14 writePools Dead pools in the writePools group are marked + w-stkendca10a-2 + w-stkendca10a-4 w-stkendca10a-5 + w-stkendca10a-6 w-stkendca11a-1 w-stkendca11a-2 + w-stkendca11a-4 + w-stkendca11a-5 w-stkendca11a-6 + w-stkendca12a-4 w-stkendca9a-2 w-stkendca9a-4 w-stkendca9a-5 w-stkendca9a-6 ############# # CHECKLIST # ############# Cannot contact fndca for queue and stage pages http://fndca.fnal.gov/dcache/queue/allpools.jpg http://fndca.fnal.gov/dcache/logins/stage.jpg also cannot reach http://fndca.fnal.gov/dcache/files/ And under http://fndca.fnal.gov:2288/cellInfo PinManager OFFLINE Checking more links on the DCache page http://fndca.fnal.gov/ Cannot reach these : Recent FTP Transfers http://fndca3a.fnal.gov/cgi-bin/dcache_files.py Active Transfers http://fndca3a.fnal.gov/dcache/transfers.html Billing http://fndca3a.fnal.gov/dcache/billing.html File Lifetime Plots http://fndca3a.fnal.gov/dcache/dc_lifetime_plots.html Pool Directory Listings http://fndca3a.fnal.gov/dcache/files/ Queue Plots http://fndca3a.fnal.gov/dcache/dc_queue_plots.html Sum http://fndca3a.fnal.gov/dcache/queue/allpools.jpg Login Plots http://fndca3a.fnal.gov/dcache/dc_login_plots.html Will report this via helpdesk and to dcache-admin, and call the helpdesk tomorrow if it is not better. Ticket 98946 Closer inspection shows the time stamps of /pnfs/minos/neardet_data/2007-06 are clustered around 4 hour intervals. THE RawDataWritePools POOLS ARE WRITING EVERY 4 HOURS, NOT EVERY 24 THIS IS BAD FOR THE TAPES. DEFER THIS TO MONDAY, WE HAVE GREATER PROBLEMS Have a look at /pnfs/minos/reco_near/cedar/2007-06 times, Times are clustered starting at Jun 4 21:29 Jun 4 23:31 Jun 5 00:30 Jun 5 04:29 Jun 5 10:45 -> 12:25 Jun 6 01:36 Jun 8 03:18 -> 09:48 Jun 8 13:47 ============================================================================= 2007 06 08 ########## # SADDMC # ########## setup sam -q dev cds for SAS in `ls saddmc.*` ; do EXT=`echo ${SAS} | cut -f 2 -d .` mv saddmc.${EXT} saddmc.2006${EXT} done cp saddmc.20060612 saddmc.20070608 REVIEW parameters - OK data tiers storage locations application family # # # mcout data tiers sam get registered data tiers | sort setup sam -q dev SAM_ORACLE_CONNECT=samdbs/password export SAM_ORACLE_CONNECT samadmin add datatier --name=mcout-near --description="mcout_data - near" samadmin add datatier --name=mcout-far --description="mcout_data - far" export -n SAM_ORACLE_CONNECT did this for dev/int/prd # # # STORAGE LOCATIONS ####### # AFS # ####### Added Greg Pawloski to the minos group pts membership minos | sort pts adduser -user jyuko -group minos pts adduser -user pawloski -group minos pts: User or group doesn't exist ; unable to add user pawloski to group minos waiting for Greg's AFS account to be created established by 13 Junel ####### # AFS # ####### minos:sysadmin To keep track of the sysadmins with access to the minos group pts creategroup -name kreymer:sysadmin pts adduser -user kreymer -group kreymer:sysadmin pts membership kreymer:sysadmin pts examine kreymer:sysadmin Name: kreymer:admin, id: -1919, owner: kreymer, creator: kreymer, membership: 1, flags: S-M--, group quota: 0. pts setfields kreymer:sysadmin -access SOMar pts examine kreymer:sysadmin Name: kreymer:admin, id: -1919, owner: kreymer, creator: kreymer, membership: 1, flags: SOMar, group quota: 0. pts chown kreymer:sysadmin minos pts membership minos:sysadmin for US in boyd ettab jason jonest ling schmitz shepelak timl ; do pts adduser -user ${US} -group minos:sysadmin ; done ########### # MONTHLY # ########### MINOS26 > aklog MINOS26 > tokens Tokens held by the Cache Manager: User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Jun 15 18:32] CFL 6/8 DATASETS 6/8 PREDATOR 6/8 SADDRECO 6/8 ROUNDUP 6/8 VAULT 6/8 nearly at the end of near tarring, aklog: Couldn't get fnal.gov AFS tickets: aklog: Invalid argument while getting AFS tickets but the vaulting looks OK MYSQL deferred, need to defrag ######## # FARM # ######## Clean up write cache, check status ./roundup -M -r cedar_phy_safitter far PEND - have 18/24 subruns for F00037060_*.all.sntp.cedar_phy_safitter.0.root 8 05/30 17:42 0 2006-11 files are missing, flush the 2006-12 parts PEND - have 22/24 subruns for F00037996_*.all.sntp.cedar_phy_safitter.0.root 7 06/01 02:07 0 2007-05 Miss 11,12 PEND - have 24/17 subruns for F00038182_*.all.sntp.cedar_phy_safitter.0.root 6 06/01 14:04 0 2007-05 needed Predator to declare raw PEND - have 5/24 subruns for F00038185_*.all.sntp.cedar_phy_safitter.0.root 6 06/01 14:43 0 2007-05 needed Predator to declare raw And data runs into 2007-06, flush Did ./roundup -r cedar_phy_safitter far # picked up 38182 after sam declares ./roundup -f 1 -s F00038185 -r cedar_phy_safitter far ./roundup -f 1 -s F00037060 -r cedar_phy_safitter far ####### # SAM # cedar_phy declared ? ####### per vahle query 3 June, are cedar_phy files declared. cd /pnfs/minos/reco_far for MON in `ls sntp_data` ; do echo $MON FILES=`ls sntp_data/${MON}` for FILE in $FILES ; do sam locate ${FILE} done ; done MONS=`ls /pnfs/minos/reco_far/cedar_phy/sntp_data` DET=far for MON in $MONS ; do ./saddreco.20070507 ${DET} cedar_phy ${MON} list ; done Needed /pnfs/minos/reco_far/cedar_phy/cand_data/2006-03 need raw F00034632_0023.mdaq.root need raw F00034632_0022.mdaq.root need raw F00034632_0020.mdaq.root need raw F00034632_0019.mdaq.root need raw F00034635_0000.mdaq.root need raw F00034632_0012.mdaq.root need raw F00034632_0013.mdaq.root need raw F00034632_0016.mdaq.root need raw F00034632_0017.mdaq.root need raw F00034632_0014.mdaq.root need raw F00034632_0015.mdaq.root need raw F00034632_0018.mdaq.root That's 12 files missing. DET=near some obsoletes in 2005-09 ####### # SAM # ####### STRAY RAW FILES FROM FAR 2006-03 MINOS26 > ls /pnfs/minos/fardet_data/2006-02 | wc -l 750 MINOS26 > ls /pnfs/minos/fardet_data/2006-03 | wc -l 1414 MINOS26 > ls /pnfs/minos/fardet_data/2006-04 | wc -l 936 MINOS26 > SAMDIM="DATA_TIER raw-far and FULL_PATH like /pnfs/minos/fardet_data/2006-02 " MINOS26 > sam list files --dim="${SAMDIM}" --count 750 files match the given constraints. MINOS26 > SAMDIM="DATA_TIER raw-far and FULL_PATH like /pnfs/minos/fardet_data/2006-03" MINOS26 > sam list files --dim="${SAMDIM}" --count 1398 files match the given constraints. MINOS26 > SAMDIM="DATA_TIER raw-far and FULL_PATH like /pnfs/minos/fardet_data/2006-04" MINOS26 > sam list files --dim="${SAMDIM}" --count 936 files match the given constraints. We seem to need 16 files . ./genpy -d -l " -r R1.22 " fardet_data/2006-03 This seems to be listing everything. OK, this was the time at which we moved from minos06 to minos26. MINOS26 > pwd /local/scratch26/kreymer/genpy/fardet_data MINOS26 > scp -c blowfish -r minos06:/local/scratch06/kreymer/genpy/fardet_data/2006-03 2006-03 Looks better now, see this omitting the dbu commands : MINOS26 > ./genpy -d -l " -r R1.22 " fardet_data/2006-03 OK JUST TESTING Generating .py for /pnfs/minos/fardet_data/2006-03 STARTING Fri Jun 8 18:50:29 CDT 2007 Treating 1414 files Scanning 16 files F00034242_0013.mdaq.root Fri Jun 8 18:51:17 CDT 2007 F00034632_0012.mdaq.root Fri Jun 8 18:51:27 CDT 2007 F00034632_0013.mdaq.root Fri Jun 8 18:51:31 CDT 2007 F00034632_0014.mdaq.root Fri Jun 8 18:51:34 CDT 2007 F00034632_0015.mdaq.root Fri Jun 8 18:51:37 CDT 2007 F00034632_0016.mdaq.root Fri Jun 8 18:51:41 CDT 2007 F00034632_0017.mdaq.root Fri Jun 8 18:51:44 CDT 2007 F00034632_0018.mdaq.root Fri Jun 8 18:51:48 CDT 2007 F00034632_0019.mdaq.root Fri Jun 8 18:51:51 CDT 2007 F00034632_0020.mdaq.root Fri Jun 8 18:51:54 CDT 2007 F00034632_0021.mdaq.root Fri Jun 8 18:51:58 CDT 2007 F00034632_0022.mdaq.root Fri Jun 8 18:52:01 CDT 2007 F00034632_0023.mdaq.root Fri Jun 8 18:52:05 CDT 2007 F00034633_0000.mdaq.root Fri Jun 8 18:52:08 CDT 2007 F00034634_0000.mdaq.root Fri Jun 8 18:52:12 CDT 2007 F00034635_0000.mdaq.root Fri Jun 8 18:52:15 CDT 2007 Let's run it for real MINOS26 > ./genpy -l " -r R1.22 " fardet_data/2006-03 Oops, that should be R1.15 for such old data. Killed, removed generated file ( timed out after 10 minutes ) /local/scratch26/kreymer/genpy/fardet_data/2006-03/F00034632_0012* Try again later, when predator is idle. ./genpy -l " -r R1.15 " fardet_data/2006-03 N.B. on 2007 07 18, copied one of these files to /afs/fnal.gov/files/data/minos/d86/kreymer/F00034242_0013.mdaq.root for further testing with R1.22 N.B. on 2007 12 11 copied this to /minos/scratch/kreymer/F00034242_0013.mdaq.root ============================================================================= 2007 06 07 ########### # MINOS26 # ########### Per request to have cjames group corrected on minos26, 09:42 Joe Boyd re-enabled NIS (ypbind) 12:55 Tim Laszlo (timl@fnal.gov) disabled NIS 17:04 As of about 21:04 UTC, Joe moved minos26 to use NIS (YP) for account, with the same short list of authorized users in the local /etc/passwd file, but now taking detailed information from NIS : +shepelak +kreymer +buckley +rhatcher +cjames +mindata This required adding mindata to the global NIS passwd file. Logins for mindata and kreymer are working. Login shells are taken from the NIS passwd file. ( There was about a minute of lost access for mindata around 21:02. That's small compared the interruption this morning. ) ######## # GRID # ticket 98820 ######## Requested export to, and mount of /grid/data and /grid/app on minos01 readonly ######### # ADMIN # ######### Rubin has reviewed and drafted revised Computing section of MOU, word document sent to me in email, Liz original is MINOS-CD-MOU-Oct-06-v3.doc Rubin section is MINOSFermiGrid.doc ============================================================================= 2007 06 06 ############ # PNFSDIRS # ############ Still need to set perms and group on each level of created directory. I guess that means not doing mkdir -p Need + pnfsdirs near cedar daikon_00 L250200N # already exist pnfsdirs near cedar daikon_00 L010185N_bfldx113 write # at 16:42 + pnfsdirs near cedar daikon_00 L010185N # already existed pnfsdirs near cedar daikon_00 L250200N_nccoh write # at 16:45 Oops, failed to set group to e875 MINOS26 > DIRS=' /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_bfldx113 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_bfldx113/cand_data /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_bfldx113/mrnt_data /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_bfldx113/sntp_data /pnfs/minos/mcin_data/near/daikon_00/L250200N_nccoh /pnfs/minos/mcout_data/cedar/near/daikon_00/L250200N_nccoh /pnfs/minos/mcout_data/cedar/near/daikon_00/L250200N_nccoh/cand_data /pnfs/minos/mcout_data/cedar/near/daikon_00/L250200N_nccoh/mrnt_data /pnfs/minos/mcout_data/cedar/near/daikon_00/L250200N_nccoh/sntp_data ' for DIR in ${DIRS} ; do chgrp e875 ${DIR} ; done for DIR in ${DIRS} ; do ls -ld ${DIR} ; done pnfsdirs far cedar_phy daikon_02 CosmicLE write # 17:06 Had to manually fix MINOS26 > chgrp e875 /pnfs/minos/mcin_data/far/daikon_02 MINOS26 > chmod 775 /pnfs/minos/mcin_data/far/daikon_02 MINOS26 > chgrp e875 /pnfs/minos/mcout_data/cedar_phy/far/daikon_02 MINOS26 > chmod 775 /pnfs/minos/mcout_data/cedar_phy/far/daikon_02 MINOS26 > chgrp e875 /pnfs/minos/mcout_data/cedar_phy/far MINOS26 > chmod 775 /pnfs/minos/mcout_data/cedar_phy/far pnfsdirs far cedar_phy_safitter daikon_02 CosmicLE write # MINOS26 > chgrp e875 /pnfs/minos/mcin_data/far/daikon_02 MINOS26 > chgrp e875 /pnfs/minos/mcin_data/far MINOS26 > chgrp e875 /pnfs/minos/mcout_data/cedar_phy_safitter/far/daikon_02/CosmicLE MINOS26 > chgrp e875 /pnfs/minos/mcout_data/cedar_phy_safitter/far/daikon_02 MINOS26 > chgrp e875 /pnfs/minos/mcout_data/cedar_phy_safitter/far MINOS26 > chgrp e875 /pnfs/minos/mcout_data/cedar_phy_safitter MINOS26 > chmod 775 /pnfs/minos/mcin_data/far/daikon_02 MINOS26 > chmod 775 /pnfs/minos/mcin_data/far chmod: changing permissions of `/pnfs/minos/mcin_data/far': Operation not permitted MINOS26 > chmod 775 /pnfs/minos/mcout_data/cedar_phy_safitter/far/daikon_02 MINOS26 > chmod 775 /pnfs/minos/mcout_data/cedar_phy_safitter/far MINOS26 > chmod 775 /pnfs/minos/mcout_data/cedar_phy_safitter And at the last minute, pnfsdirs far cedar_phy daikon_02 CosmicMu write # 22:22 pnfsdirs far cedar_phy_safitter daikon_02 CosmicMu write # 22:24 As the rest of the tree above CosmicMu is already in place, no need to touch up the higher-up permissions and groups. Two files were already being moved into /pnfs/minos/mcin_data/far/daikon_02/CosmicMu/100 -rw-r--r-- 1 kreymer e875 364696325 Jun 6 22:17 f20011001_0000_CosmicMu_D02.reroot.root -rw-r--r-- 1 kreymer e875 355457581 Jun 6 22:18 f20011001_0001_CosmicMu_D02.reroot.root Changed the family on the fly, should be OK. ####### # SAM # ####### SAMDIM=" RUN_TYPE physics% \ and VERSION cedar.phy \ and DATA_TIER sntp-near \ and PHYSICAL_DATASTREAM_NAME spill \ and FILE_NAME like N00011434% " or SAMDIM=" RUN_TYPE physics% \ and VERSION cedar.phy \ and DATA_TIER sntp-near \ and PHYSICAL_DATASTREAM_NAME spill \ and RUN_NUMBER 11434 " MINOS26 > sam list files --dim="${SAMDIM}" --nosummary N00011434_0021.spill.sntp.cedar_phy.0.root N00011434_0000.spill.sntp.cedar_phy.0.root SAMDIM=" RUN_TYPE physics% \ and VERSION cedar.phy \ and DATA_TIER sntp-near \ and PHYSICAL_DATASTREAM_NAME spill \ and RUN_NUMBER 11434 \ and PARENT_BY_NAME N00011434_0000.mdaq.root \ " MINOS26 > sam list files --dim="${SAMDIM}" --nosummary N00011434_0000.spill.sntp.cedar_phy.0.root Getting a clean list of subruns : sam get metadata --file=N00011434_0000.spill.sntp.cedar_phy.0.root \ | grep parents \ | tr "'" \\\n \ | grep root \ | sort ####### # AFS # ####### Added Daniel Cherdack to the minos group pts membership minos | sort pts adduser -user cherdack -group minos ######## # FARM # ######## Copied test bad short sntp from safitter, temporarily, cp /grid/data/minos/farcat/F00038182_0022.all.sntp.cedar_phy_safitter.0.root /afs/fnal.gov/files/data/minos/d10/recodata113/ ============================================================================= 2007 06 05 ############# # MINOSORA3 # ############# Maureen reboots with mce=off in kernel, per RH advice mce=off disable machine check http://lkml.org/lkml/2003/8/5/126 mce=off turns off MCE reporting for fatal MCE exceptions (however your box may still crash when something really bad happens) ######## # FARM # ######## Investigating /grid/data/backlog SRV1> ./farmgsum Summarizing /grid/data/minos/*cat 1926 56573 nearcat 4921 6046 farcat 0 1 mcnearcat 0 1 mcfarcat 0 1 mcfmockcat 427 26379 minfarm/WRITE 7274 89001 TOTAL files, GBytes nearcat 74 2099 cosmic.sntp.cedar.0.root 372 5679 cosmic.sntp.cedar_phy.0.root 123 3176 cosmic.sntp.cedar_phy.1.root 926 22419 spill.mrnt.cedar_phy.0.root 62 789 spill.mrnt.cedar_phy.1.root 71 4354 spill.sntp.cedar.0.root 236 16644 spill.sntp.cedar_phy.0.root 62 4125 spill.sntp.cedar_phy.1.root farcat 53 1270 all.sntp.cedar.0.root 34 800 all.sntp.cedar_phy.0.root 118 2896 all.sntp.cedar_phy.1.root 4354 295 all.sntp.cedar_phy_safitter.0.root 53 241 spill.bntp.cedar.0.root 128 329 spill.bntp.cedar_phy.0.root 53 158 spill.sntp.cedar.0.root 128 207 spill.sntp.cedar_phy.0.root mcnearcat mcfarcat mcfmockcat minfarm/WRITE 2 1427 cosmic.sntp.cedar.0.root 6 3034 cosmic.sntp.cedar_phy.0.root 1 509 cosmic.sntp.cedar_phy.1.root 399 3694 sntp.cedar_phy.root 6 2263 spill.mrnt.cedar_phy.0.root 1 330 spill.mrnt.cedar_phy.1.root 2 3197 spill.sntp.cedar.0.root 9 11430 spill.sntp.cedar_phy.0.root 1 1764 spill.sntp.cedar_phy.1.root ####### # SAM # ####### Preparing for cedar_phy_safitter export SAM_ORACLE_CONNECT="samdbs/" setup sam -q dev samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar_phy_safitter setup sam -q int setup sam -q prd OOPS, repeated above with samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.safitter ./reloc -d -s dev cedar_phy_safitter # debug test ./reloc -s dev cedar_phy_safitter ./reloc -s int cedar_phy_safitter ./reloc -s prd cedar_phy_safitter ########### # ROUNDUP # ########### roundup.20070605 - added cedar_phy_safitter AFSS/roundup.20070605 -n -r cedar_phy_safitter far cp AFSS/roundup.20070605 . ln -sf roundup.20070605 roundup ######## # FARM # ######## ./roundup -r cedar_phy_safitter far Tue Jun 5 18:18:38 CDT 2007 killed it before it concatenated anything, regular roundups had kicked in, did not want to run 2 at once. Will run again later tonight. ./roundup -r cedar_phy_safitter far Tue Jun 5 21:56:47 CDT 2007 Needed /pnfs/minos/reco_far/cedar_phy_safitter/cand_data/2007-05 OK - skipping 12 files not yet in SAM Need to repeat sam declare for first 2 months, ./roundup -m 2006-12 -r cedar_phy_safitter far ./roundup -m 2007-01 -r cedar_phy_safitter far STARTED Wed Jun 6 05:06:25 2007 FINISHED Wed Jun 6 05:13:13 2007 due to lack of the cedar.phy.safitter application 2007-02 and later were OK ####### # SAM # ####### checking recent sam counts for brebel REL=cedar RELD=`echo ${REL} | tr . _` for DET in near far ; do for MON in 2007-04 2007-05 2007-06 ; do for STR in cand sntp ; do SAMDIM=" RUN_TYPE physics% \ and VERSION ${REL} \ and DATA_TIER ${STR}-${DET} \ and PHYSICAL_DATASTREAM_NAME spill \ and FULL_PATH like /pnfs/minos/reco_${DET}/${RELD}/${STR}_data/${MON} \ " printf " ${REL} ${DET} ${MON} ${STR} " ; \ sam list files --dim="${SAMDIM}" --count done ; done ; done ============================================================================= 2007 06 04 ######## # FARM # ######## vahle missing files ? 36 runs listed by howie, not in SAM, PNFS, or AFS ? 00007801 00007899 00008043 00008192 00008564 00008707 00008746 00008826 00008850 00008878 00008925 00008975 00009402 00009441 00009476 00009502 00009582 00009635 00009732 00009892 00010155 00010319 00010383 00010449 00010474 00010510 00010552 00010660 00010678 00010700 00010724 00010749 00011134 00011155 00011218 00011666 SAMDIM=" RUN_TYPE physics% \ and VERSION cedar.phy \ and DATA_TIER sntp-near \ and PHYSICAL_DATASTREAM_NAME spill \ " sam list files --dim="${SAMDIM}" File Count: 648 Average File Size: 553.60MB Total File Size: 350.32GB Total Event Count: 428335040 for RUN in ${RUNS} ; do echo $RUN SAMDIM=" RUN_TYPE physics% \ and VERSION cedar.phy \ and DATA_TIER sntp-near \ and PHYSICAL_DATASTREAM_NAME spill \ and FILE_NAME like N${RUN}% " sam list files --nosummary --dim="${SAMDIM}" | sort done # tested the above with RUN=00008917 Then did full run ... 00009582 N00009582_0000.spill.sntp.cedar_phy.0.root 00009635 N00009635_0000.spill.sntp.cedar_phy.0.root ... Opened to all streams and tiers, found spill cand for N00008564 gap 11 N00009582 ok N00009635 ok N00009732 0/1/2/3 N00011134 gap 14-16 MINOS26 > sam locate N00009582_0000.spill.sntp.cedar_phy.0.root ['/pnfs/minos/reco_near/cedar_phy/sntp_data/2005-12,286@vob549'] MINOS26 > ls -l /pnfs/minos/reco_near/cedar_phy/sntp_data/2005-12/N00009582_0000.spill.sntp.cedar_phy.0.root -rw-r--r-- 1 kreymer e875 449697580 May 16 22:16 /pnfs/minos/reco_near/cedar_phy/sntp_data/2005-12/N00009582_0000.spill.sntp.cedar_phy.0.root MINOS26 > sam locate N00009635_0000.spill.cand.cedar_phy.0.root ['/pnfs/minos/reco_near/cedar_phy/cand_data/2006-01,215@vo9531'] MINOS26 > ls -l /pnfs/minos/reco_near/cedar_phy/cand_data/2006-01/N00009635_0000.spill.cand.cedar_phy.0.root -rw-r--r-- 1 1334 e875 414760027 May 16 22:54 /pnfs/minos/reco_near/cedar_phy/cand_data/2006-01/N00009635_0000.spill.cand.cedar_phy.0.root for RUN in ${RUNS} ; do ls /pnfs/minos/reco_near/cedar_phy/cand_data/*/N${RUN}* done something for all runs but N00010678 N00010700 ####### # AFS # ####### Preparing AFS request for a volume for rustem, for his analysis ( cd $MINOS_DATA ; ls -d d??? | sort | tail -3 ) d257 d258 d259 Clone acl from rustem's existing volumes, adjusted from buckley to minos:admin d186 d203 d221 Summary : ask for 50000 MB /afs/fnal.gov/files/data/minos/d260 Not backed up minos rl system:administrators rlidwka system:anyuser rl minos:admin rlidwka rustem rlidwka Sent request about 22:50 ######## # GRID # ######## Approved Rubin's Minos Production role https://voms.fnal.gov:8443/vomrs/vo-fermilab/vomrs + Members . Manage Groupsand Group Roles This will be needed when GPlazma in installed on June 21 in Dcache ============================================================================= 2007 06 01 ####### # DAQ # ####### Filled empty archiver.pid [minos@daqdcp-nd minos]$ ps xf | grep archiver | grep -v grep 18823 ? S 0:20 python /home/minos/bin/archiver_krb.py [minos@daqdcp-nd minos]$ ls -l /var/lock/daq/archiver.pid -rw-r--r-- 1 minos e875 0 May 31 21:20 /var/lock/daq/archiver.pid [minos@daqdcp-nd minos]$ printf "18823\n" >> /var/lock/daq/archiver.pid [minos@daqdcp-nd minos]$ cat /var/lock/daq/archiver.pid 18823 ########### # ROUNDUP # ########### roundup.20070529 - handles DUPS, purging if you set -D cp AFSS/roundup.20070529 . ln -sf roundup.20070529 roundup ./roundup -n -r cedar_phy far Too messy, created roundup.20070601 which lists HAVE count on the PEND line. Then select out the ones that are ready now, a few at a time AFSS/roundup.20070601 -f 2 -s "F00036655\|F00036662\|F00036680\|F00036718\|F00036770\|F00036773\| F00036777\|F00036780" -r cedar_phy far AFSS/roundup.20070601 -f 10 -s "F00032737\|F00032788\|F00032791\|F00030642\|F00031163\|F00031201\|F00031202\|F00031203\|F00031280\|F00031286" -r cedar_phy far AFSS/roundup.20070601 -f 18 -s "F00031292\|F00031295\|F00031302\|F00031330\|F00031338\|F00031343\|F00031344\|F00031348\|F00031353\|F00031378" -r cedar_phy far AFSS/roundup.20070601 -f 10 -s "F00031379\|F00031380\|F00031388\|F00031389\|F00031392\|F00031393\|F00031397" -r cedar_phy far hacked roundup to purge file that were written while PNFS was down, by looking in READ/SAM AFSS/roundup.20070601 -n -w -r cedar_phy far Oops, the -n pass writes DFARM files SRV1> find ROUNTMP/DFARM -cmin -40 -type f -exec mv {} ROUNTMP/DFARM/tmp/ \; Try again, -n, with disabled DFARM writing for NOOP OK,now really purgint AFSS/roundup.20070601 -w -M -r cedar_phy far Get a new list and clean up the last strays AFSS/roundup.20070601 -n -W -M -r cedar_phy far | tee /tmp/cpf AFSS/roundup.20070601 -f 10 -s "F00036777\|F00036777\|F00037252\|F00037697\|F00037700\|F00037703\|F00037709\|F00037761\|F00037776" -r cedar_phy far cp AFSS/roundup.20070601 . ln -sf roundup.20070601 roundup Oops, more changes for efficiency of filtering out DUP's. AFSS/roundup.20070601 -n -r cedar_phy near | tee /tmp/cpn Fri Jun 1 15:18:21 CDT 2007 cp AFSS/roundup.20070601 . ln -sf roundup.20070601 roundup N00007760_0011 is suppressed, but is present in output. For the moment,set these aside mv /grid/data/minos/nearcat/N00007760* /grid/data/minos/minfarm/N7760/ Remove duplicates ./roundup -W -M -D -s N00009805 -r cedar_phy near Create current PEND list for far, ./roundup -W -M -r cedar_phy far Now to a regular catchup for ./roundup -r cedar near Fri Jun 1 15:51:58 CDT 2007 ./roundup -r cedar far There is still manual clearing of PENDs from cedar_phy near to do, but we've got to clear the backlog first. Will let the next corral run clear the easy backlog in cedar_phy near Moved NOCAT to NOCAT.okm 17:30 ============================================================================= 2007 05 31 ########## # DCACHE # cedar_phy_safitter ########## ###### # CD # ###### Shutting down most servers for power outage minos-sam01 minos-sam02 minos-sam03 crontab -r on kreymer@minos26 mindata@minos26 minfarm@fnpcsrv1 Created pingall, pingstat scripts, We seem to have up : minos01 minos-mysql1 minos-25 AFS 16:10 AFS seems to have gone offline, processes are stuck 16:18 Mail is down, cannot contact imap3 server cannot ping imap1/2/3 16:25 CD system status server is unpingable from my desktop 17:30 AFS and fnpcsrv1 seem to be back up 21:13 - summary of CD status items E-mail listserv down with disk error, no estimate MSS systems coming back 17:15 - are they back ? Restarting SAM servers : minos-sam01 . setups.sh ; ups start sam_bootstrap ./sam_test_py minos minos-sam02 . setups.sh ; ups start sam_bootstrap minos-sam03 . setups.sh ; ups start sam_bootstrap Restarted monitors, per new HOWTO.monitor Checklist : DCache plots - stale Enstore servers - locked Summary - restarting mindata@minos26 ./srmtest srmls looks OK srmcp stuck for at least a minute dccp is also stuck Large numbers of DCache pools are offline. Enstore is still inactive. Only one of the seven RawDataWritePool Pools is online. Only 4 of the 12 general write queues are active. The above got stuck because pools with the test files are offline. Recent data can be copied from write pools. IFILE=N00012297_0023.mdaq.root DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} crl seems to be OK Sent this list of DCache pools down to dcache-admin MIN > ./poolstat Thu May 31 23:19:07 CDT 2007 4/ 14 ExpDbWritePools 6 FermigridVolPools 7/ 15 KTeVReadPools 4/ 13 MinosPrdReadPools 6/ 7 RawDataWritePools 4/ 8 readPools 9/ 13 writePools Fri Jun 1 07:32:12 CDT 2007 3/ 14 ExpDbWritePools 6 FermigridVolPools 3/ 15 KTeVReadPools 4/ 13 MinosPrdReadPools 7 RawDataWritePools 2/ 8 readPools 13 writePools ######## # FARM # ######## 17:40 Cleared the duplicates quickly with AFSS/roundup.20070529 -M -W -D -s N00009653 -r cedar_phy near AFSS/roundup.20070529 -M -W -D -s N00009689 -r cedar_phy near AFSS/roundup.20070529 -M -W -D -s N00009714 -r cedar_phy near AFSS/roundup.20070529 -M -W -D -s F00032654 -r cedar_phy far AFSS/roundup.20070529 -M -W -D -s F00035859 -r cedar_phy far ####### # CRL # ####### The CRL is having problems. That is odd, as the main page is up at http://www-minoscrl2.fnal.gov/minos/Index.jsp And the minos-mysql database is up. But there is no response when we click on "All Categories" at http://www-minoscrl2.fnal.gov/minos/Log.jsp?viewTopic=All The mysql database shows recent connections from crlweb2.fnal.gov Sent this report to the helpdesk, tried to call Suzanne Gysin at 8334 CR reports that CRL has been working... but not for me at present 21:38 UTC ############ # PNFSDIRS # ############ Creating pnfsdirs script to create and check permissions on various PNFS directories : reco_near/... reco_far/... mcin_data/... mcout_data/... ============================================================================= 2007 05 30 ######## # FARM # cosmic copies ######## Trying srmcp due to stuck kerberos doors, setup dcap -q unsecured SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr SOUT=${SPATH}/${RSPA} printf "\n`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/${DET}${REL}${STR}.log for FILE in ${FILES} ; do SFIL=${SOUT}/${FILE} DFIL=${DOUT}/${FILE} PFIL=${POUT}/${FILE} if [ ! -r ${PFIL} ] ; then echo "NEED" ${FILE} srmcp -streams_num=1 -server_mode=active file:///${FILE} ${SFIL} dccp -P ${DFIL} fi done 2>&1 | tee -a ~/ROUNTMP/LOG/cosmic/${DET}${REL}${STR}.log printf "`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/${DET}${REL}${STR}.log Wed May 30 08:33:44 CDT 2007 Wed May 30 12:06:27 CDT 2007 This is running at about 12 seconds per 64 MByte file, this will take hours for the nearly 1000 files we have to copy. The dccp copies were reporting rates of over 30 MB/second. Copied 980 files in 93 minutes = 5580 sec => 5.7 Sec/file started purge of grid files cedar sntp near Wed May 30 15:24:08 CDT 2007 Wed May 30 15:41:39 CDT 2007 cedar sntp far Wed May 30 17:36:38 CDT 2007 Wed May 30 17:39:29 CDT 2007 cedar_phy sntp far Wed May 30 18:01:23 CDT 2007 Wed May 30 18:04:49 CDT 2007 cedar_phy sntp near Wed May 30 18:05:38 CDT 2007 ######## # FARM # cleanup ######## Testing new roundup which lists possible and real duplicates HAVE - existing runs DUPE - actual duplicate subruns CEDAR FAR AFSS/roundup.20070529 -f 1 -r cedar far This picked up 3 runs whose subruns had formerly been bad, are now good. CEDAR NEAR N00012145 has 17 subruns, rest are MIA, from 27 days ago missing 17-23, informed howie 12179 has good subruns from 10 days ago, so clean it up: AFSS/roundup.20070529 -f 8 -s N00012179 -r cedar near CEDAR MCNEAR no files pending CEDAR_PHY FAR dozens of old pending runs, this could be a challenge AFSS/roundup.20070529 -n -r cedar_phy far DUPE F00032654_0000.spill.bntp.cedar_phy.0.root DUPE F00032654_0000.spill.sntp.cedar_phy.0.root subruns 0-4 DUPE F00035859_0014.spill.bntp.cedar_phy.0.root DUPE F00035859_0014.spill.sntp.cedar_phy.0.root subruns 14, 16-23 CEDAR_PHY MEAR DUPE N00009653_0006.spill.mrnt.cedar_phy.0.root DUPE N00009653_0006.spill.sntp.cedar_phy.0.root 6, 19-22 DUPE N00009689_0000.spill.mrnt.cedar_phy.0.root DUPE N00009689_0000.spill.sntp.cedar_phy.0.root 0, 2, 3, 5, 7, 8 DUPE N00009714_0003.spill.mrnt.cedar_phy.0.root DUPE N00009714_0003.spill.sntp.cedar_phy.0.root 3-5, 16 ######## # FARM # mock ######## ./roundup -M -r cedar_phy mockfar Wed May 30 15:18:01 CDT 2007 Wed May 30 16:42:32 CDT 2007 Could not write to L250200N/000 corrected protections, restarted Wed May 30 17:40:53 CDT 2007 Wed May 30 18:04:25 CDT 2007 Oops, errors were writing to /pnfs/minos/mcout_data/cedar_phy/fmock/daikon_00/L010185N/sntp_data/000 Odd, this was owned by kreymer,created 22 May. Needed to rerun to pick up first 22 files ####### # LSF # minos cluster batch ####### Ticket 98153 ____________________________________________________________________ Presently, the minos cluster nodes minos19 through minos25 are set up to run LSF job for the minos queue. If licenses permit, we would like to expand this to most of the minos cluster in the short term ( by next week ) Please let us know if license are available, and we can discuss the specific list of nodes. I know we want to exclude minos01 minos02 minos11 minos26 and possible one or two others. __________________________________________________________________ ============================================================================= 2007 05 29 ########## # DCACHE # ########## There are 71 reco_far, reco_near, and mcout_data files at http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt waiting to be written. These are cand and .bcnd files written Friday from 2007-05-25 05:56:57 to 2007-05-25 15:10:09 Reported to helpdesk as follows , high priority :98029 _________________________________________________________________________ There are 71 Minos farm output files in the DCache write pools, but which are not yet on tape. These were written before the Friday 25 May PNFS probelems, from 2007-05-25 05:56:57 to 2007-05-25 15:10:09 The file list is reported at http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt When I try to copy one of these files with dccp, I get these messages : MINOS26 > dccp -d 4 ${DPATH}/${FILE} TEST.dat [Tue May 29 11:24:38 2007] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/cosmic/near/cand_data/c10010115_0003.cand.cedar_phy.root in cache. Connected in 0.00s. Command failed! Server error message for [1]: "905" (errno 905). Failed open file in the dCache. Can't open source file : "905" System error: Input/output error MINOS26 > echo ${DPATH}/${FILE} dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/cosmic/near/cand_data/c10010115_0003.cand.cedar_phy.root MINOS26 > date Tue May 29 11:44:16 CDT 2007 _________________________________________________________________ Investigating for FILE in `cat /tmp/stales` ; do echo ${FILE} ; PAT=/pnfs/minos/`dirname ${FILE}` ; FIL=`basename ${FILE}` ( cd ${PAT} ; cat ".(use)(2)(${FIL})" | grep stken ) ; done w-stkendca11a-4 w-stkendca11a-6 w-stkendca11a-6 is missing from http://fndca.fnal.gov:2288/queueInfo 13:20 - w-stkendca11a-6 is back online. Apparently, 11 of the 13 pools were offline this morning, according to developers. prestaging the data for a safety copy : for FILE in `cat /tmp/stales` ; do echo ${FILE} ; dccp -P ${DPATH}/${FILE} ; sleep 10 ; done Most files are in read queues now, except : mcout_data/cedar_phy/cosmic/near/cand_data/c10010127_0004.cand.cedar_phy.root mcout_data/cedar_phy/cosmic/near/cand_data/c10010128_0004.cand.cedar_phy.root mcout_data/cedar_phy/cosmic/near/cand_data/c10010172_0003.cand.cedar_phy.root mcout_data/cedar_phy/cosmic/near/cand_data/c10010171_0000.cand.cedar_phy.root mcout_data/cedar_phy/cosmic/near/cand_data/c10010171_0004.cand.cedar_phy.root mcout_data/cedar_phy/cosmic/near/cand_data/c10010186_0000.cand.cedar_phy.root mcout_data/cedar_phy/cosmic/near/cand_data/c10010193_0001.cand.cedar_phy.root mcout_data/cedar_phy/cosmic/near/cand_data/c10010199_0004.cand.cedar_phy.root reco_far/cedar/.bcnd_data/2007-05/F00037989_0022.spill.bcnd.cedar.0.root reco_far/cedar/cand_data/2007-05/F00037993_0004.spill.cand.cedar.0.root reco_far/cedar/.bcnd_data/2007-05/F00037993_0004.spill.bcnd.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012231_0009.cosmic.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012231_0009.spill.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012182_0007.cosmic.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012231_0010.cosmic.cand.cedar.0.root 13:55 most of these files are on tape now, Remaining files are all reco_near/cedar/cand_data Will check again in a couple of hours. 16:66 Still need to get these on tape : reco_near/cedar/cand_data/2007-05/N00012182_0002.cosmic.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012182_0002.spill.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012182_0001.cosmic.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012182_0001.spill.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012231_0009.cosmic.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012231_0009.spill.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012182_0007.cosmic.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012231_0010.cosmic.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012182_0007.spill.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012231_0010.spill.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012231_0011.cosmic.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012231_0011.spill.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012182_0009.cosmic.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012182_0009.spill.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012182_0008.cosmic.cand.cedar.0.root reco_near/cedar/cand_data/2007-05/N00012182_0008.spill.cand.cedar.0.root for FILE in ${FILES} ; do echo ${FILE} ; PAT=/pnfs/minos/`dirname ${FILE}` ; FIL=`basename ${FILE}` ( cd ${PAT} ; cat ".(use)(4)(${FIL})" ) ; done N.B. removed 2007 06 10, all are on VO4763 ########### # ROUNDUP # ########### roundup.20070529 adding duplicate test, to allow more aggressive force for present, this will be based on READ and SAM/READ files ######## # FARM # ######## cosmic sntp files are in /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE/sntp_data but only cosmic with runs under 10000 are there. cosmic/bfld201_lowE is for far det files, runs under 10000 cosmoc/near is for near det files, runs over 10000 Created HOWTO.cosmicmc REL=cedar STR=sntp DET=near SRV1> printf "${FILES}\n" | wc -w 980 Tue May 29 17:23:32 CDT 2007 Tue May 29 18:56:40 CDT 2007 DET=far MINOS26 > printf "${FILES}\n" | wc -w 198 copied cedar and cedar_phy far Tried running cedar_phy near in the background, lost kerberos ticket when I logged out which hosed the kerberos door. Tried this again with the other door, stuffed the second door. ============================================================================= 2007 05 26 ####### # DAQ # ####### 11:15 - directory problem is resolved by remounting /pnfs on a server Restarted nd archive around 14:40 removed stray empty /var/lock/daq/archive.pid from minos@daqdcp FD daq had restarted around 12:06 ########## # DCACHE # ########## DCache remains healthy aside from SRM, so restarting predator. Special run to catch up : MINOS26 > echo "12 16 * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/predator" | crontab Then restored normal cronjob on minos26. Gave berg, podstvkv access to mindata@minos26, created srmtest script there, which does srmls and srmcp Added -debug-true This revealed an explicit trial of httpd,dccp,gsiftp Override this on command with -protocols=gsiftp and get successful copy Hypothesis, we have always been trying httpd,dccp,gsiftp, and dccp only recently woke from the dead on the server end. Something is broken in our .xml config files or their handling. ########### # ROUNDUP # ########### roundup.20070526 - sets -protocols=gsiftp cp AFSS/roundup.20070526 . ln -sf roundup.20070526 roundup Moved NOCAT to NOCAT.ok ############ # MCIMPORT # ############ cp AFSS/mcimport.20070526 . ln -sf mcimport.20070526 mcimport ######## # FARM # ######## Spotted an empty cedar_phy bcnd file, written last Saturday MINOS26 > dds /pnfs/minos/reco_far/cedar_phy/.bcnd_data/2007-02/F00037384_0006.spill.bcnd.cedar_phy.0.root -rw-r--r-- 1 1334 e875 0 May 19 05:01 /pnfs/minos/reco_far/cedar_phy/.bcnd_data/2007-02/F00037384_0006.spill.bcnd.cedar_phy.0.root The blinded cand is OK ============================================================================= 2007 05 25 ####### # DAQ # ####### singles runs this morning, during beam downtime. We built to a backlog of about 88 runs. They changed the runs from about 20 to 400 seconds, finishing the running before noon. Backlog of about 83 at 13:00 Will let this clear natuarally, given that we face a holiday weekend. ( Monday is Memorial Day ) ########## # DCACHE # ########## nwest has been having trouble with ftp from RAL and OX this last week. ########## # DC2AFS # ########## Data rates from minos26 leveled at 12MB/sec after 1 AM, had been about 8 to 10 MB/sec before that. according to Ganglia. dc2afs -n -d near -r cedar_phy -s sntp STARTING Fri May 25 09:02:49 CDT 2007 Running dc2afs for DET near REL cedar_phy STR sntp Processing 36 months ... FINISHED Fri May 25 14:23:17 CDT 2007 ####### # LSF # ####### kschu reminded me : we used up the available licenses setting up our minos queue perhaps older slow FNALU nodes could retire ? he has unique knowledge on configuring the minos queue, no NFS shared LSF config directory, files must be rsync'd by admins ########### # ENSTORE # ########### __________________________________________ Ticket #: 97982 ___________________________________________ Short Description: The /pnfs/minos file system has disappeared Problem Description: Sometime after Fri May 25 17:43:54 CDT 2007 and before Fri May 25 17:53:56 CDT 2007 the /pnfs/minos files seem to have disappeared. Likewise, /pnfs/cdf is gone. I cannot list files directly via our PNFS mounts ( i.e. on fnpcsrv1 ) I cannot list files via ftp or srmls. I see that the Enstore library managers are paused. But the ball is not red. ____________________________________________________________________ I disabled cron at kreymer@minos26 mindata@minos26 minfarm@fnpcsrv1 ( mv NOCAT.ok NOCAT ) Note also dbu failures for beam and DCS files this morning in predator. 19:50 - berg announces system up 1 hour, checking Minos writes 22:40 - found bad listing for some directories in normal ftp, and srm beam_data fardet_data/2007-05 neardet_data mcout_data/R0.8.0 SRV1> SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fardet_data/2007-05 SRV1> srmls ${SPATH2} 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fardet_data/2007-05 SRV1> SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/neardet_data/2007-05 SRV1> srmls ${SPATH2} 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/neardet_data/2007-05 SRV1> SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/neardet_data SRV1> srmls ${SPATH2} 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/neardet_data Spoke to Berg around midnight, he will call experts (Vladimir) to restart ftp servers. Note that I can copy a file successfully from one of these lost directories via dccp : /pnfs/minos/beam_data/2004-12/B041201_195652.mbeam.root 00:39 - still waiting word, FTP down. zzzzzzzz 13:09 srmls is working, but srmcp fails : SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr IFILE=N00004502_0000.mdaq.root IPATH=minos/neardet_data/2004-11 SFILE=${SPATH}/${IPATH}/${IFILE} srmcp -streams_num=1 -server_mode=active \ $SFILE file:///TEST.dat Dcap Version version-1-2-38 Jan 4 2006 10:11:51 Allocated message queues 0, used 0 Allocated message queues 1, used 1 Creating a new control connection to stkendca2a.fnal.gov:24725. Activating IO tunnel. Provider: [/fnal/ups/prd/dcap/v2_38_f0512/Linux-2-4/lib/libgssTunnel.so]. Added IO tunneling plugin /fnal/ups/prd/dcap/v2_38_f0512/Linux-2-4/lib/libgssTunnel.so for stkendca2a.fnal.gov:24725. Sending control message: 0 0 client hello 0 0 2 38 -uid=10871 -pid=31746 -gid=5111 errrrr, this is using dcap, not GridFTP. Trying another file, in a safe path : ============================================================================= 2007 05 24 ########## # DCACHE # ########## PNFS/FTP went down on schedule 07:00 Report from howie, ticket 97825 of dcache unavailable srmcp failed 2007-05-23 22:30:47 Up at about 10:35, but ftp is still down Ticket 97867 12:06 ftp fixed by litvinse ######### # STAGE # ######### cedar restores finished last night : STARTING Thu May 17 15:59:31 CDT 2007 FINISHED Wed May 23 18:38:06 CDT 2007 ####### # SAM # ####### 08:25:50 DB patches have been deployed successfully. Minosprd is available for use. sam locate foo ./sam_test_py minos http://www-numi.fnal.gov/computing/findrun_sam.html selected recent raw files ######## # FARM # ######## Undeclaring files processed with wrong field, MINOS26 > sam list files --nosummary --dim='FILE_NAME like N00012252%cand%root' N00012252_0000.cosmic.cand.cedar.0.root N00012252_0000.spill.cand.cedar.0.root N00012252_0001.cosmic.cand.cedar.0.root N00012252_0001.spill.cand.cedar.0.root N00012252_0002.cosmic.cand.cedar.0.root N00012252_0002.spill.cand.cedar.0.root N00012252_0003.cosmic.cand.cedar.0.root N00012252_0003.spill.cand.cedar.0.root N00012252_0004.cosmic.cand.cedar.0.root N00012252_0004.spill.cand.cedar.0.root FILES=`sam list files --nosummary --dim='FILE_NAME like N00012252%cand%root'` for FILE in ${FILES} ; do echo ${FILE} ; sam undeclare ${FILE} ; done ####### # SAM # ####### Preparing sample query for Nikki, using http://www-numi.fnal.gov/computing/findrun_sam.html and grabbing the dimension from dbs sam Nov 11e constraints --summaryOnly --dim="data_tier sntp-near and VERSION_ANALYZED like r1.18.4" sam list files --dim="run_type physics% and file_name like N000%.spill.sntp.R1_18_4.0.root and start_time <= to_date('2006-09-30','yyyy-mm-dd') and end_time >= to_date('2006-09-24','yyyy-mm-dd')" SAMDIM=" RUN_TYPE physics% \ and VERSION cedar.phy \ and DATA_TIER sntp-near \ and PHYSICAL_DATASTREAM_NAME spill \ and start_time >= to_date('2006-11-01','yyyy-mm-dd') \ and end_time <= to_date('2006-11-30','yyyy-mm-dd') " GRRRRRRRRRR have been fighting for years with and FAMILY_ANALYZED reco \ and APPL_NAME_ANALYZED loon \ and VERSION_ANALYZED cedar \ Nikki discovered that VERSION works fine sam get registered dimensions - lists VERSION sam get dimension info - seems consistent with this usage Try cedar_phy ####### # SAM # ####### Example for brebel SAMDIM=" RUN_TYPE physics% \ and VERSION cedar.phy \ and DATA_TIER sntp-near \ and PHYSICAL_DATASTREAM_NAME spill \ and FULL_PATH like /pnfs/minos/reco_near/cedar_phy/sntp_data/2006-11 \ " MINOS26 > sam list files --dim="${SAMDIM}" --nosummary N00011295_0000.spill.sntp.cedar_phy.0.root N00011295_0002.spill.sntp.cedar_phy.0.root N00011277_0000.spill.sntp.cedar_phy.0.root N00011200_0000.spill.sntp.cedar_phy.0.root N00011176_0000.spill.sntp.cedar_phy.0.root N00011259_0000.spill.sntp.cedar_phy.0.root FILES=`sam list files --dim="${SAMDIM}" --nosummary` for FILE in ${FILES} ; do PNFS=`sam locate ${FILE} | tr "'" \\\n | grep ^/pnfs/minos | cut -f 1 -d ,` printf "dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/${PNFS/\/pnfs\/}/${FILE}\n" done ######## # FARM # ######## used this to get list of recent ND files for reprocessing ( did not yet know to use VERSION ) SAMDIM="\ FAMILY_ANALYZED reco \ and APPL_NAME_ANALYZED loon \ and VERSION_ANALYZED cedar \ and RUN_NUMBER >= 12191 \ and RUN_NUMBER <= 12210 \ and FULL_PATH like /pnfs/minos/reco_near/cedar_phy/2007-05 \ " sam list files --dim="${SAMDIM}" --nosummary FILES=`sam list files --dim="${SAMDIM}" --nosummary | sort` 15:12 for FILE in ${FILES} ; do sam undeclare ${FILE} ; done ########## # DC2AFS # ########## Hacked it to use recodata??? dc2afs -n -d far -r cedar_phy -s sntp Started this for real around 21:30 2007-03 87/ 87 recodata108 47523102 F00037832_0000.spill.sntp.cedar_phy.0.root 66925650 bytes in 3 seconds (21785.69 KB/sec) FINISHED Fri May 25 08:07:48 CDT 2007 A couple of false starts, getting the overprinting clean and suppressing dccp output Need to toss in a SPACER call up front to clean up skip messages ============================================================================= 2007 05 23 ########## # DCACHE # ########## Prepare for DCache outage predator MINOS26 > echo 'crontab -r' | at 03:30 mcimport M26 > echo 'crontab -r' | at 03:30 job 21 at 2007-05-24 03:30 corral SRV1> echo 'mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT' \ | at 03:30 ####### # AFS # ####### Added Gemma Tinti to the minos group pts membership minos | sort pts adduser -user tinti -group minos ######## # FARM # ######## corral - Re-enabled mcnear now that roundup supports D01 ####### # AFS # ####### Requested 5 volumes for nue analysis group, per mayly, access to boehm d241 d242 d243 d244 d245 minos:admin rlidwka boehm rlidwka msanchez rlidwka ####### # AFS # ####### Planning to pull all cedar_phy sntp to AFS du -sh /pnfs/minos/reco_near/cedar_phy/sntp_data 217G /pnfs/minos/reco_near/cedar_phy/sntp_data du -sh /pnfs/minos/reco_far/cedar_phy/sntp_data 370G /pnfs/minos/reco_far/cedar_phy/sntp_data for completeness, we don't need these in AFS : du -sh /pnfs/minos/reco_near/cedar_phy/mrnt_data 41G /pnfs/minos/reco_near/cedar_phy/mrnt_data du -sh /pnfs/minos/reco_far/cedar_phy/.bntp_data 51G /pnfs/minos/reco_far/cedar_phy/.bntp_data cd $MINOS_DATA/d10/indexes wc -l *.index | sort -n A short one, 5 files, is 2006-08_near.R1_18_4.index A typical sntp at 519 is 2006-02_near.cedar.index Three mc_near are 10K+ AFSLD=/afs/fnal.gov/files/expwww/numi/html/computing/dh/afssum for INDEX in `ls *.index` ; do (( SUM = 0 )) for FILE in `cat ${INDEX}` ; do SIZ=`ls -l ../${FILE} | tr -s ' ' | cut -f 5 -d ' '` (( SUM += SIZ )) done SUM=`echo "${SUM} / 1000000000" | bc` printf "%5d %s\n" ${SUM} ${INDEX} done 2>&1 | tee ${AFSLD}/indexsum.20070523 sort -n ${AFSLD}/indexsum.20070523 ... 50 mc_far.R1.14.index 57 2006-06_near.cedar.index 59 mc_near.R1_18_2.index 72 mc_far.daikon_00.cedar.index 120 mc_cosmic.bfld201.cedar.index 276 mc_near.carrot_06.R1_18_2.index 299 mc_near.carrot_06.cedar.index 842 mc_near.daikon_00.cedar.index cd $MINOS_DATA/d10 (( QUOT = 0 )) (( USED = 0 )) for DIR in `ls -d recodata*` ; do LSQ=`fs listquota ${DIR} | grep -v Quota | tr -s ' '` QUO=`echo ${LSQ} | cut -f 2 -d ' '` USE=`echo ${LSQ} | cut -f 3 -d ' '` (( QUOT += QUO )) (( USED += USE )) echo ${QUO} ${USE} done (( FREE = QUOT - USED )) (( QUOT /= 1000000 )) (( USED /= 1000000 )) (( FREE /= 1000000 )) printf "QUOTA ${QUOT}\nUSED ${USED}\nFREE ${FREE}\n" QUOTA 4362 USED 4221 FREE 140 We have 14 8 GB volumes d10 d11 d21 d22 d46 d47 d48 d49 d71 d72 d73 d74 d75 d76 Have submitted a request for 14 move 50 GB volumes. We'll see whether we have the space available. Volumes created 14:00, by inkmann, thanks !! ########## # DC2AFS # ########## dc2afs - script to move a release's ntuples from PNFS to AFS. taking bits from mcimport.20070509 dc2afs -n -d far -r cedar_phy -s sntp Checked out directories, had to cd d252 mv recodata105 recodata106 ============================================================================= 2007 05 22 ####### # SAM # ####### Test extreme sam query times, locally and on CDF database FILESM=`ls /pnfs/minos/fardet_data/2007-02` printf "${FILESM}\n" | wc -w 1004 FILES=`printf "${FILESM}\n" | head -10` FIRST=`printf "${FILES}\n" | head -1` FREST=`printf "${FILES}\n" | tail +2` FRESC=`for FI in ${FREST} ; do printf ", ${FI}" ; done` SAMDIM="( FILE_NAME ${FIRST} ${FRESC} )" time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; } 10 real 0m1.703s user 0m0.920s sys 0m0.170s In dbs log, 13:38:09 SqlBuilderImpl.buildSqlQuery < 1 second 13:38:09 DbCore < 1 second Now try 1000 1000 real 1m15.183s user 0m1.320s sys 0m0.280s 13:43:42 SqlBuilderImpl.buildSqlQuery 13:44:50 rpn.infix2dims> rpnList = 13:44:50 rpn.infix2dims> returning dims = 13:44:50 DbCore ... 13:44:50 DbCore 13:44:55 DbFunctions.query::ALARM(2)> exec = 3.556872 secs 13:44:55 DbCore(servantId=98780).query[connId=5]> 1000 rows found CDF fcdflnx4: find /pnfs/cdfen/filesets/GJ/GJ00/ -type f | wc -l FILES=`find /pnfs/cdfen/filesets/GJ/GJ00/ -type f | head -1000 \ | cut -f 9 -d /` 1000 real 1m10.558s user 0m1.230s sys 0m0.210s ######## # FARM # ######## Inventory of roundup/corral, now that the DCache backlog is clear cedarfar PEND - have 23/24 subruns for F00038021_*.all.sntp.cedar*.root 2 05/19 23:37 most got processed Monday morning, 1 short wait a while cedarnear PEND - have 17/24 subruns for N00012145_*.cosmic.sntp.cedar*.root 18 05/03 07:38 missing cand 0017-0023 PEND - have 23/24 subruns for N00012197_*.cosmic.sntp.cedar*.root 7 05/14 15:44 missing cand for _0002 PEND - have 1/18 subruns for N00012200_*.cosmic.sntp.cedar*.root 7 05/14 19:02 all 18 written May 14 18:19 SRV1> dds /grid/data/minos/nearcat/N00012200* -rw-rw-r-- 1 rubin numi 29524001 May 14 19:02 /grid/data/minos/nearcat/N00012200_0004.cosmic.sntp.cedar.0.root -rw-rw-r-- 1 rubin numi 68729489 May 14 19:03 /grid/data/minos/nearcat/N00012200_0004.spill.sntp.cedar.0.root these are duplicates, mv /grid/data/minos/nearcat/N00012200* /grid/data/minos/minfarm/DUP/ PEND - have 20/24 subruns for N00012231_*.cosmic.sntp.cedar*.root 3 05/19 02:37 missing cand 09-11,14 cedarmcnear OK cedar_phyfar PEND 54 different runs, from back to 5/11, let things drain a bit cedar_phynear PEND 14 runs, back to 5/8 ,caught up. Digging into cedar_phynear ######## # FARM # ######## The above looks close enough, let's try mockfar ./roundup -W -M -n -r cedar_phy mockfar would add 99 files ./roundup -W -M -r cedar_phy mockfar added 99 to WRITE ./roundup -W -M -r cedar_phy mockfar Tue May 22 11:27:55 CDT 2007 running about 15 sec/file we will be out of the way before the noon corral cycle. ######## # FARM # ######## Correcting file families for daikon_01 files written, FILES=' n13011403_0000_L010185N_D01.sntp.cedar.root n13011406_0000_L010185N_D01.sntp.cedar.root n13011407_0000_L010185N_D01.sntp.cedar.root ' ( cd /pnfs/minos/mcout_data/cedar/near/daikon_01/L010185N/cand_data ; \ enstore pnfs --file_family reco_mc_near_cedar_cand ) ( cd /pnfs/minos/mcout_data/cedar/near/daikon_01/L010185N/mrnt_data ; \ enstore pnfs --file_family reco_mc_near_cedar_mrnt ) ( cd /pnfs/minos/mcout_data/cedar/near/daikon_01/L010185N/sntp_data ; \ enstore pnfs --file_family reco_mc_near_cedar_sntp ) mkdir /pnfs/minos/mcout_data/cedar/near/daikon_01/L010185N/sntp_data/140 DAI1=/pnfs/minos/mcout_data/cedar/near/daikon_01/L010185N/sntp_data/140/ DAI0=/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data/140/ for FILE in ${FILES} ; do mv ${DAI0}/${FILE} ${DAI1}/${FILE} ; done 12:10 ########### # ROUNDUP # ########### roundup.20070522 Updated DETI to use F0, N0 for far, near, to avoid conflict with mock Need to set MCREL from file name, to handle daikon_00 and daikon_01 Better yet, check and bail on any other than D??=daikon for now. Tested on 1 file AFSS/roundup.20070522 -W -M -s n13011401 -r cedar mcnear AFSS/roundup.20070522 -w -s n13011401 -r cedar mcnear Looks good. Ran the rest, not all of which have all subruns in mcin. cp AFSS/roundup.20070522 . ln -sf roundup.20070522 roundup ./roundup -r cedar mcnear Tue May 22 15:17:59 CDT 2007 ######## # FARM # ######## cosmic sntp files are in /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE/sntp_data but only cosmic with runs under 10000 are there. cosmic/bfld201_lowE is for far det files, runs under 10000 cosmoc/near is for near det files, runs over 10000 cd /pnfs/minos/mcout_data/cedar/cosmic ls near/cand_data | wc -l 995 ls bfld201_lowE/sntp_data/c1001* | wc -l 995 I need to move these 995 files now : FILES=`ls bfld201_lowE/sntp_data/c1001* | cut -f 3 -d /` for FILE in ${FILES} ; do mv bfld201_lowE/sntp_data/${FILE} near/sntp_data/${FILE} usleep 200000 ; done Done around 15:55 ============================================================================= 2007 05 21 ######## # FARM # ######## DCCP limited this morning, due to continued cand backlog. cedar_phyfar has lots of pending runs from around 5/11 cedar_phynear - lots pending, nothing added since Sat. /grid/data is up to 180 GB, mostly in WRITE, throttled ######### # STAGE # ######### Still cranking on cedar far, doing far 2005-04, down the home stretch. Most files seem to be needed for this patch of data. ########### # ROUNDUP # ########### roundup.20070518 Putting this into production cp AFSS/roundup.20070518 . ln -sf roundup.20070518 roundup ./roundup -r cedar far ./roundup -r cedar near restored 0 length sntp from Friday, below disabled cronjob, while running these manually ./roundup -r cedar_phy far hit a backlog after writing most files queue has backed off a bit, try again ./roundup -r cedar_phy far ######## # FARM # ######## Removed 0 length sntp file written Friday 05:01 MINOS26 > dds /pnfs/minos/reco_far/cedar_phy/sntp_data/2006-11/F00037028_0000.all.sntp.cedar_phy.0.root -rw-r--r-- 1 kreymer e875 0 May 19 05:01 /pnfs/minos/reco_far/cedar_phy/sntp_data/2006-11/F00037028_0000.all.sntp.cedar_phy.0.root srmcp retried 3 times, them failed due to existing file. MINOS26 > rm /pnfs/minos/reco_far/cedar_phy/sntp_data/2006-11/F00037028_0000.all.sntp.cedar_phy.0.root Moved this file back from DUP SRV1> mv DUP/F00037028_0000.all.sntp.cedar_phy.0.root WRITE/F00037028_0000.all.sntp.cedar_phy.0.root ######## # FARM # ######## Duplicate runs with version 1, from Rubin : FILES=' N00011609_0004.spill.cand.cedar_phy.1.root N00011609_0006.spill.cand.cedar_phy.1.root N00011609_0007.spill.cand.cedar_phy.1.root N00011609_0012.spill.cand.cedar_phy.1.root N00011609_0013.spill.cand.cedar_phy.1.root N00011609_0014.spill.cand.cedar_phy.1.root N00011609_0017.spill.cand.cedar_phy.1.root N00011609_0018.spill.cand.cedar_phy.1.root N00011640_0000.spill.cand.cedar_phy.1.root N00011687_0003.spill.cand.cedar_phy.1.root N00011687_0010.spill.cand.cedar_phy.1.root N00011687_0011.spill.cand.cedar_phy.1.root N00011687_0012.spill.cand.cedar_phy.1.root N00011687_0014.spill.cand.cedar_phy.1.root N00011687_0015.spill.cand.cedar_phy.1.root N00011687_0016.spill.cand.cedar_phy.1.root N00011687_0019.spill.cand.cedar_phy.1.root N00011687_0020.spill.cand.cedar_phy.1.root N00011687_0021.spill.cand.cedar_phy.1.root N00011687_0023.spill.cand.cedar_phy.1.root N00011707_0000.spill.cand.cedar_phy.1.root N00011707_0001.spill.cand.cedar_phy.1.root N00011707_0002.spill.cand.cedar_phy.1.root N00011707_0004.spill.cand.cedar_phy.1.root N00011707_0006.spill.cand.cedar_phy.1.root N00011707_0007.spill.cand.cedar_phy.1.root N00011707_0008.spill.cand.cedar_phy.1.root N00011707_0011.spill.cand.cedar_phy.1.root N00011707_0013.spill.cand.cedar_phy.1.root N00011728_0001.spill.cand.cedar_phy.1.root N00011728_0002.spill.cand.cedar_phy.1.root N00011728_0003.spill.cand.cedar_phy.1.root N00011728_0007.spill.cand.cedar_phy.1.root ' for FILE in ${FILES} ; do printf "${FILE}\n" ; ls -1 /grid/data/minos/nearcat/${FILE:0:20}* ; done There were sntp and mrnt files for all these. Move them to DUP/pass1 intending to remove them entirely. mkdir /grid/data/minos/minfarm/DUP/pass1/ for FILE in ${FILES} ; do printf "${FILE}\n" mv /grid/data/minos/nearcat/${FILE:0:20}* /grid/data/minos/minfarm/DUP/pass1/ done Check SAM locations for FILE in ${FILES} ; do sam locate ${FILE} ; done all have locations except Datafile with name 'N00011687_0003.spill.cand.cedar_phy.1.root' not found. for FILE in ${FILES} ; do FOLD=${FILE:0:36}0.root ; sam locate ${FOLD} ; done all have locations unknown volume except N00011687_0003.spill.cand.cedar_phy.0.root Make a shortened list FILSAM=${FILES/N00011687_0003.spill.cand.cedar_phy.1.root} for FILE in ${FILSAM} ; do sam undeclare file ${FILE} ; done for FILE in ${FILSAM} ; do FOLD=${FILE:0:36}0.root ; sam undeclare ${FOLD} ; done SRV1> ./roundup -m 2007-01 -r cedar_phy near SRV1> ./roundup -m 2007-02 -r cedar_phy near We're clean now ! for RUN in N00011609 N00011640 N00011687 N00011707 N00011728 ; do sam locate ${RUN}_0000.spill.mrnt.cedar_phy.0.root ; done for RUN in N00011609 N00011640 N00011687 N00011707 N00011728 ; do sam locate ${RUN}_0000.spill.sntp.cedar_phy.0.root ; done for RUN in N00011609 N00011640 N00011687 N00011707 N00011728 ; do sam get metadata --file=${RUN}_0000.spill.mrnt.cedar_phy.0.root | grep parent done for RUN in N00011609 N00011640 N00011687 N00011707 N00011728 ; do sam get metadata --file=${RUN}_0000.spill.sntp.cedar_phy.0.root | grep parent done ============================================================================= 2007 05 20 ######## # FARM # ######## Writing cosmic MC files to PNFS. Note that cosmic MC file naming is entirely different from centrally produced files, and in many cases conflicts. See Swallowing my pride, and at great risk of duplicting file names previously used, I will simply move them as requested to PNFS. From /grid/data/minos/mccosmic /grid/data/minos/mccosmiccat To /pnfs/minos/mcout_data/cedar/cosmic/ SRV1> du -sm /grid/data/minos/mccosmic* 25848 /grid/data/minos/mccosmic 69217 /grid/data/minos/mccosmiccat setup dcap # kerberized DCPOR=24736 RELE=cedar COSMIC CAND cd /grid/data/minos/mccosmic STRM=cand FILES=`ls -1 *${STRM}*${RELE}\.root` Need to pick up stray cands from mccosmic SRV1> ls /grid/data/minos/mccosmic | wc -l 46 SRV1> ls /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE/cand_data | wc -l 175 RSPA=minos/mcout_data/cedar/cosmic/bfld201_lowE/${STRM}_data DOUT=/dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA} POUT=/pnfs/${RSPA} printf "\n`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}purge.log for FILE in ${FILES} ; do PFIL=${POUT}/${FILE} if [ -r "${PFIL}" ] ; then PINFO=`(cd ${POUT} ; cat ".(use)(4)(${FILE})" | tr '\n' '\t')` ECRC=`printf "${PINFO}" | cut -f 11` if [ -n "${ECRC}" ] ; then LCRC=`ecrc ${FILE} | tr -s ' ' | cut -f 2 -d ' '` echo " ${FILE}" ${LCRC} ${ECRC} [ ${LCRC} = ${ECRC} ] && echo rm ${FILE} && rm ${FILE} fi ; fi done 2>&1 | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}purge.log printf "`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}purge.log Test run revealed a duplicate, c10000607_0003.cand.cedar.root mv /grid/data/minos/mccosmic/c10000607_0003.cand.cedar.root \ /grid/data/minos/minfarm/DUP/ Ran the purge for real, Sun May 20 16:35:45 CDT 2007 Now can write the rest : FILES=`ls -1 *${STRM}*${RELE}\.root` FILES=`ls -1 *${STRM}*${RELE}\.root` printf "${FILES}\n" | wc -w 45 printf "\n`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log for FILE in ${FILES} ; do DFIL=${DOUT}/${FILE} PFIL=${POUT}/${FILE} if [ ! -r ${PFIL} ] ; then echo "NEED" ${FILE} dccp ${FILE} ${DFIL} dccp -P ${DFIL} fi done 2>&1 | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log printf "`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log Sun May 20 16:38:30 CDT 2007 Sun May 20 16:56:22 CDT 2007 cd /grid/data/minos/mccosmiccat STRM=sntp RSPA=minos/mcout_data/cedar/cosmic/bfld201_lowE/${STRM}_data DOUT=/dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA} POUT=/pnfs/${RSPA} FILES=`ls -1 *${STRM}*${RELE}\.root` printf "${FILES}\n" | wc -w 1194 Cut/paste the dccp commands above : Sun May 20 17:02:59 CDT 2007 Sun May 20 19:18:55 CDT 2007 PURGED CAND FILES ( Forgot to un-comment the initial purge described above, so this re-purges some older files . ) Sun May 20 19:47:20 CDT 2007 Files were initially not group writeable, ran second pass to pick up c10000607_0000.cand.cedar.root Sun May 20 20:04:25 CDT 2007 Sun May 20 20:04:29 CDT 2007 Moved to mccosmiccat, purged 766 files already on tape : Sun May 20 21:05:25 CDT 2007 Sun May 20 21:20:18 CDT 2007 ============================================================================= 2007 05 19 ######## # FARM # ######## Moved another duplicate to DUP SRV1> mv WRITE/F00037028_0000.all.sntp.cedar_phy.0.root DUP/ ######## # FARM # ######## Shifted local ROUNDUP/DUP files to /grid/data/minos/minfarm/DUP FILES=`ls DUP` Checked for conflicts, there were none SRV1> for FILE in $FILES ; do ls /grid/data/minos/minfarm/${FILE} ; done Copied files SRV1> for FILE in $FILES ; do cp -a DUP/${FILE} /grid/data/minos/minfarm/${FILE} ; done Checked files SRV1> for FILE in $FILES ; do diff DUP/${FILE} /grid/data/minos/minfarm/${FILE} ; done Purged files SRV1> for FILE in $FILES ; do rm DUP/${FILE} ; done Oops, shifted files from minfarm to minfarm/DUP SRV1> for FILE in $FILES ; do ls -l /grid/data/minos/minfarm/${FILE} /grid/data/minos/minfarm/DUP/${FILE} ; done SRV1> for FILE in $FILES ; do mv /grid/data/minos/minfarm/${FILE} /grid/data/minos/minfarm/DUP/${FILE} ; done Relinked DUP SRV1> rmdir DUP ; ln -s /grid/data/minos/minfarm/DUP DUP ########### # ROUNDUP # ########### Certifying roundup.20070518 for general use allowing MDC files to be handled. Previously, these would have been confused with FD files. cedar near and far look OK, comparing to .20070510 ( default ) Write cedar_phy near and far to /tmp for comparison, as these logs are longer. SRV1> AFSS/roundup.20070510 -n -r cedar_phy far 2>&1 | tee /tmp/cpf10 SRV1> AFSS/roundup.20070518 -n -r cedar_phy far 2>&1 | tee /tmp/cpf18 SRV1> diff /tmp/cpf10 /tmp/cpf18 Had to hack to correct AUTODEST for fmock, as had done for MCIN path ########## # DCACHE # ########## Queued stores went up over 3500 midday on 18 May, with nearly 1500 queued restores. These restores should not have been farm activity, as raw data is all on disk. Will have to look at billing files to see the root cause. Almost all the reads now are from flxi04 by jurgen. Why hundreds of reads queued up ? Why to flxi04 ? This is selex data. ============================================================================= 2007 05 18 ######## # FARM # ######## GRRRRRRRRRRRRRRRRRRR ONCE AGAIN, GOING INTO A WEEKEND, A M A J O R CHANGE TO THE MODEL Apparently we will be receiving a substantial number of duplicated files. This has already started to show up as duplicates in WRITE. I am supposed to ignore them, drop them on the floor. This requires some means of detecting them. Not so easy, since things are concatenated. In principal, a grep of the READ and SAM/READ indexes may find the originals. SAM could help, but not for MC files. For the moment, I have moved the duplicated concatenated files to DUP. ######## # FARM # ######## Rubin : Another fact, although I haven't seen anything about it from the systems people, is that dcache was having problems, I think both reading and writing, for several periods yesterday afternoon and evening. If you detect that bad files are bunched, (for example 22:15 - 22:45) that's probably the source. ============================================================================= 2007 05 17 ######### # STAGE # ######### STRM=mrnt date >> ../TRACE for DIR in `ls /pnfs/minos/reco_near/cedar_phy/${STRM}_data` ; do ./stage -w -g MinosPrdReadPools reco_near/cedar_phy/${STRM}_data/${DIR} done 2>&1 | tee -a ../TRACE date >> ../TRACE Thu May 17 08:19:09 CDT 2007 Thu May 17 08:41:30 CDT 2007 date >> ../TRACE REL=cedar_phy DET=far for STRM in `ls -a /pnfs/minos/reco_${DET}/${REL} | grep "bntp\|sntp\|mrnt"` ; do for DIR in `ls /pnfs/minos/reco_${DET}/${REL}/${STRM}` ; do ./stage -w -g MinosPrdReadPools reco_${DET}/${REL}/${STRM}/${DIR} done ; done 2>&1 | tee -a ../TRACE date >> ../TRACE Thu May 17 08:46:43 CDT 2007 Thu May 17 13:34:45 CDT 2007 Now for the old stuff, may take days, using a spare window... printf "\n\n\nSTARTING `date`\n" >> ../TRACE REL=cedar for DET in near far ; do for STRM in `ls -a /pnfs/minos/reco_${DET}/${REL} | grep "bntp\|sntp\|mrnt"` ; do for DIR in `ls /pnfs/minos/reco_${DET}/${REL}/${STRM}` ; do ./stage -w -g MinosPrdReadPools reco_${DET}/${REL}/${STRM}/${DIR} done ; done ; done 2>&1 | tee -a ../TRACE printf "FINISHED `date`\n" >> ../TRACE STARTING Thu May 17 15:59:31 CDT 2007 FINISHED Wed May 23 18:38:06 CDT 2007 ############ # MCIMPORT # ############ Forced recent cosmic MC to disk, only 4 G/10 threshold present now. M26 > ./mcimport -f 60 howcroft Thu May 17 11:47:04 CDT 2007 ============================================================================= 2007 05 16 ############ # MCIMPORT # ############ Mock data has started to arrive from howcroft. Check a slug of these manually. In howcroft, mv NOIMPORT noIMPORT ./mcimport -n -F howcroft Paths look OK to me In howcroft, mv noIMPORT MCIMPORT ./mcimport -F howcroft Failed, proxy has expired Back in fnpcsrv1:/home/minfarm/.grid SRV1> grid-proxy-info -f kreymer-doe.proxy subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=768538851 issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 type : Proxy draft (pre-RFC) compliant impersonation proxy strength : 512 bits path : kreymer-doe.proxy timeleft : 0:00:00 grid-proxy-init -cert kreymer-doe.pem -key kreymer-doekey.pem ( used my usual long many-word pass phrase ) ERROR: Your certificate has expired: Tue May 8 10:08:22 2007 OK, copy my new cert from desktop, where I use it for web browsing Per 2006 10 28 log entry scp kreymer-doe.p12 minfarm@fnpcsrv1:.grid/kreymer-doe.p12 SRV1> openssl pkcs12 -in kreymer-doe.p12 -clcerts -nokeys -out kreymer-doe.pem Enter Import Password: MAC verified OK SRV1> openssl pkcs12 -in kreymer-doe.p12 -nocerts -out kreymer-doekey.pem Enter Import Password: MAC verified OK Enter PEM pass phrase: Verifying - Enter PEM pass phrase: chmod 600 kreymer-doe*.pem Get a grid proxy SRV1> grid-proxy-init -cert kreymer-doe.pem -key kreymer-doekey.pem -out kreymer-doe.proxy -valid 999999:00 Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 Enter GRID pass phrase for this identity: Creating proxy ...................................................... Done Warning: your certificate and proxy will expire Tue Apr 15 11:22:43 2008 which is within the requested lifetime of the proxy SRV1> grid-proxy-info -f kreymer-doe.proxy subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=1467756922 issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 type : Proxy draft (pre-RFC) compliant impersonation proxy strength : 512 bits path : kreymer-doe.proxy timeleft : 8040:34:54 (335.0 days) Now copy this back to minos26:mindata SRV1> scp kreymer-doe.proxy mindata@minos26:.grid/ ./mcimport -F howcroft RequestFileStatus#-2146774239 failed with error:[ at Wed May 16 10:52:38 CDT 2007 state Failed : user has no permission to write into path /pnfs/fnal.gov/usr/minos/mcin_data/fmock/daikon_00/L010185N/000 $ dds /pnfs/minos/mcin_data/fmock/daikon_00/L010185N/000 total 1 drwxr-xr-x 1 rhatcher e875 512 May 9 14:13 ./ drwxr-xr-x 1 rhatcher e875 512 May 9 14:13 ../ As rubin SRV1> mv daikon_00 daikon_00rh SRV1> mkdir daikon_00 SRV1> chmod 775 daikon_00 As kreymer MINOS26 > mkdir /pnfs/minos/mcin_data/fmock/daikon_00/L010185N MINOS26 > chmod 775 /pnfs/minos/mcin_data/fmock/daikon_00/L010185N MINOS26 > mkdir /pnfs/minos/mcin_data/fmock/daikon_00/L250200N MINOS26 > chmod 775 /pnfs/minos/mcin_data/fmock/daikon_00/L250200N This is running, as of 10:59 Finished 11:25, after a 5 minute delay early on Minos26 data rates plateaued around 6 Mbytes/second, 11:10 to 11:25 ########### # ROUNDUP # ########### roundup.20070516 Added dccp -P prestage of all files. Had been omitted prior to Apr 27 deployment of MinosPrdReadPools ######### # STAGE # ######### ./stage -d -p 0 -g MinosPrdReadPools reco_near/cedar_phy/sntp_data/2007-02 Needed 9/ 9 ./stage -w -g MinosPrdReadPools reco_near/cedar_phy/sntp_data/2007-02 Needed 9/ 9 FINISHED Wed May 16 16:08:28 CDT 2007 ./stage -d -p 0 -g MinosPrdReadPools reco_near/cedar_phy/sntp_data/2007-02 . Needed 0/ 9 FINISHED Wed May 16 16:09:15 CDT 2007 for DIR in `ls /pnfs/minos/reco_near/cedar_phy/sntp_data` ; do ./stage -w -g MinosPrdReadPools reco_near/cedar_phy/sntp_data/${DIR} done see TRACE STARTING Wed May 16 16:11:51 CDT 2007 STARTING Wed May 16 16:38:54 CDT 2007 ============================================================================= 2007 05 15 ####### # SAM # ####### Test rapid listing of files declared to sam, for use in saddreco etc, MINOS26 > FILESM=`ls /pnfs/minos/fardet_data/2007-02` MINOS26 > printf "${FILESM}\n" | wc -w 1004 FILES=`printf "${FILESM}\n" | head -10` FIRST=`printf "${FILES}\n" | head -1` FREST=`printf "${FILES}\n" | tail +2` FRESC=`for FI in ${FREST} ; do printf ", ${FI}" ; done` SAMDIM="( FILE_NAME ${FIRST} ${FRESC} )" time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; } real 0m1.457s user 0m0.880s sys 0m0.160s for NFI in 1 4 16 64 256 ; do FILES=`printf "${FILESM}\n" | head -${NFI}` FIRST=`printf "${FILES}\n" | head -1` FREST=`printf "${FILES}\n" | tail +2` FRESC=`for FI in ${FREST} ; do printf ", ${FI}" ; done` SAMDIM="( FILE_NAME ${FIRST} ${FRESC} )" time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; } done for NFI in 1 4 16 64 256 999; do 1 1 1 real 0m1.615s real 0m1.502s real 0m2.678s user 0m0.870s user 0m0.900s user 0m0.940s sys 0m0.290s sys 0m0.220s sys 0m0.200s 4 4 4 real 0m1.632s real 0m1.716s real 0m1.503s user 0m0.890s user 0m0.860s user 0m0.930s sys 0m0.350s sys 0m0.300s sys 0m0.210s 16 16 16 real 0m1.649s real 0m1.737s real 0m1.531s user 0m0.930s user 0m0.920s user 0m0.880s sys 0m0.290s sys 0m0.260s sys 0m0.250s 64 64 64 real 0m2.005s real 0m2.160s real 0m1.868s user 0m0.950s user 0m0.950s user 0m0.870s sys 0m0.290s sys 0m0.340s sys 0m0.230s 256 256 256 real 0m6.978s real 0m6.269s real 0m6.402s user 0m1.090s user 0m1.030s user 0m1.050s sys 0m0.510s sys 0m0.250s sys 0m0.190s 999 999 real 1m17.383s real 1m11.690s user 0m1.510s user 0m1.350s sys 0m0.140s sys 0m0.330s ( filter out user, sys, from now on ) for NFI in 100 200 300 400 500 600 700 800 900 999 ; do 100 100 real 0m2.434s real 0m5.847s user 0m0.980s user 0m0.930s sys 0m0.240s sys 0m0.150s 200 200 real 0m4.723s real 0m4.622s user 0m0.990s user 0m1.180s sys 0m0.340s sys 0m0.220s 300 300 real 0m8.268s real 0m8.102s user 0m1.040s user 0m1.050s sys 0m0.200s sys 0m0.240s 400 400 real 0m13.307s real 0m12.989s user 0m1.110s user 0m1.260s sys 0m0.210s sys 0m0.190s 500 500 real 0m25.629s real 0m19.199s user 0m1.170s user 0m1.150s sys 0m0.220s sys 0m0.260s 600 600 real 0m54.243s real 0m27.445s user 0m1.200s user 0m1.140s sys 0m0.370s sys 0m0.260s 700 700 real 0m39.220s real 0m36.608s user 0m1.360s user 0m1.330s sys 0m0.260s sys 0m0.220s 800 800 real 0m49.463s real 0m49.048s user 0m1.370s user 0m1.350s sys 0m0.420s sys 0m0.590s 900 900 real 1m6.517s real 0m59.775s user 0m1.430s user 0m1.670s sys 0m0.300s sys 0m0.280s 999 999 real 1m35.493s real 1m12.315s user 0m0.550s user 0m1.660s sys 0m0.260s sys 0m0.150s MINOS26 > FILES=`printf "${FILESM}\n" | head -1002 MINOS26 > time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; } real 1m12.128s user 0m0.890s sys 0m0.130s MINOS26 > time { sam list files --nosummary --dim="${SAMDIM}" ; } ...real 1m13.084s user 0m1.450s sys 0m0.220s FILES=`ls /pnfs/minos/fardet_data/2007-02 ; ls /pnfs/minos/fardet_data/2007-03` echo $FILES | wc -w 1841 ... time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; } ORA-01795: maximum number of expressions in a list is 1000 0 real 3m56.338s user 0m0.830s sys 0m0.160s MINOS26 > time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; } ORA-01795: maximum number of expressions in a list is 1000 0 real 3m54.704s user 0m0.920s sys 0m0.320s Looked at dbserver dbg file, all the time is spent in SqlBuilderImpl.buildSqlQuery for NFI in 100 200 300 400 500 600 700 800 900 999 ; do sleep 30 100 1 second in SqlBuilderImpl.buildSqlQuery real 0m3.943s user 0m0.930s sys 0m0.230s 200 2 real 0m4.666s user 0m1.050s sys 0m0.300s 300 7 real 0m8.257s user 0m1.010s sys 0m0.180s 400 11 real 0m13.072s user 0m1.100s sys 0m0.280s 500 17 real 0m19.515s user 0m1.190s sys 0m0.170s 600 25 real 0m27.969s user 0m1.260s sys 0m0.180s 700 34 real 0m37.339s user 0m1.360s sys 0m0.410s 800 44 real 0m47.711s user 0m1.360s sys 0m0.250s 900 56 real 0m59.522s user 0m1.520s sys 0m0.220s 999 69 seconds real 1m12.526s user 0m1.650s sys 0m0.250s This is the same old production Minos sam_db_srv v7_6_1 ######### # VAULT # ######### vault.20070515 Time ordered list of files before encp, for rational order. mv vault_prev vault.20060807 # N.B.- moved this to vault.monthly 2008 04 02 mv vault vault.20070109 ln -s vault.20070515 vault rawsum.20070515 Time ordered list of files before encp, for rational order. mv rawsum.0329a rawsum.20060329 cp rawsum rawsum.20070515 mv rawsum rawsum.20060331 ln -s rawsum.20070515 rawsum ######## # FARM # ######## Complexity calculation for recent running : 2 Detectors ( near/far ) 2 Streams ( spill/cosmic , spill/all ) 4 Types ( data , MC , Mock , cambridge ) 4 Releases ( cedar, R1_24cal, R1_24calB , cedar_phy ) 2 Teams/scripts ( Howie, Art ) ? Calibs ( alpha beta gamma ... final ) 2 Samples ( 1/6 , 5/6 for near data pass ) = 256 ============================================================================= 2007 05 14 ####### # AFS # ####### Requested a new volume for NONAP group MINOS26 > ls -1d $MINOS_DATA/d??? # see what is in use /afs/fnal.gov/files/data/minos/d240 system:administrators rlidwka minos:admin rlidwka habig rlidwka minos rl Expanded minos:admin to match buckley:admin, adding habig pts membership minos:admin for GUSER in boehm dharris messier shanahan ; do pts adduser -user ${GUSER} -group minos:admin ; done pts adduser -user habig -group minos:admin ; done ########### # ENSTORE # ########### Checking that the VO4209 files are all on tape : FILES=`enstore info --list=VO4209 | grep mrnt_data | tr -s ' ' | cut -f 6 -d ' ' | cut -f 1-2,5- -d /` for FILE in ${FILES} ; do FP=`echo ${FILE} | cut -f -7 -d /` # ; echo ${FP} FI=`echo ${FILE} | cut -f 8 -d /` ; printf "\n${FI}\n" ( cd ${FP} ; cat ".(use)(4)(${FI})" | head -2 ) sleep 1 done These are all on VOB506, so can remove the safety copies : 17:18 MINOS26 > rm -r /grid/data/minos/mindata/VO4209 MINOS26 > rm -r /local/scratch26/kreymer/VO4209 ######## # FARM # ######## condor problems, ticket 97158 ######## # FARM # ######## Removed the 'safe' copies written when commissioning roundup.20070510 /grid/data/minos/minfarm/SAFE F00037989* N00012179* mcnearcat ROUNDUP/SAFE ( ${DET}_cedar_phy ) Checking for existence of each file in a READ or READ/SAM file SRV1> for FILE in `ls SAFE/far_cedar_phy` ; do printf "${FILE}" ; grep -q ${FILE} READ/SAM/${FILE:0:10}* ; echo " $?" ; done 2>&1 | grep -v " 0" SRV1> ls SAFE/near_cedar_phy | wc -l 501 SRV1> for FILE in `ls SAFE/near_cedar_phy` ; do printf "${FILE}" ; grep -q ${FILE} READ/SAM/${FILE:0:10}* ; echo " $?" ; done 2>&1 | grep -v " 0" | wc -l 138 Ah, many of these are pending DET=near (( PEND = 0 )) ; (( OUTS = 0)) ; (( NFIL = 0 )) (( NFIS = `ls SAFE/${DET}_cedar_phy | wc -l` )) for FILE in `ls SAFE/${DET}_cedar_phy` ; do (( NFIL++ )) if [ -r "/grid/data/minos/${DET}cat/${FILE}" ] ; then (( PEND++)) ; else (( OUTS++)) printf "${FILE}" grep -q ${FILE} READ/SAM/${FILE:0:10}* ; echo " STAT=${?}" ; fi [ ${NFIL} -eq ${NFIS} ] && printf " NFIL ${NFIL} 0\n PEND ${PEND} 0\n OUTS ${OUTS} 0\n" done 2>&1 | grep -v "STAT=0" NFIL 501 0 PEND 136 0 OUTS 365 0 DET=far NFIL 3695 0 PEND 340 0 OUTS 3355 0 ln -s /grid/data/minos/minfarm/SAFE GDS DET=far (( PEND = 0 )) ; (( OUTS = 0)) ; (( NFIL = 0 )) (( NFIS = `ls GDS/F00037989* | wc -l` )) for FILE in `ls GDS/F00037989*| cut -f 2 -d /` ; do NFIL 60 0 PEND 0 0 OUTS 60 0 DET=near (( PEND = 0 )) ; (( OUTS = 0)) ; (( NFIL = 0 )) (( NFIS = `ls GDS/N00012179* | wc -l` )) for FILE in `ls GDS/N00012179*| cut -f 2 -d /` ; do NFIL 46 0 PEND 2 0 OUTS 44 0 Pending : N00012179_0018.cosmic.sntp.cedar.0.root N00012179_0018.spill.sntp.cedar.0.root for FILE in N00012179_0018.cosmic.sntp.cedar.0.root \ N00012179_0018.spill.sntp.cedar.0.root ; do ls -l GDS/${FILE} /grid/data/minos/nearcat/${FILE} diff GDS/${FILE} /grid/data/minos/nearcat/${FILE} ; done -rw-rw-r-- 1 minfarm numi 29916893 May 10 17:45 GDS/N00012179_0018.cosmic.sntp.cedar.0.root -rw-rw-r-- 1 rubin numi 29916893 May 10 17:45 /grid/data/minos/nearcat/N00012179_0018.cosmic.sntp.cedar.0.root -rw-rw-r-- 1 minfarm numi 79883531 May 10 17:45 GDS/N00012179_0018.spill.sntp.cedar.0.root -rw-rw-r-- 1 rubin numi 79883531 May 10 17:45 /grid/data/minos/nearcat/N00012179_0018.spill.sntp.cedar.0.root These were in badruns through Fri May 11 15:39:18 They are still there, error type 1. DET=mcnear (( PEND = 0 )) ; (( OUTS = 0)) ; (( NFIL = 0 )) (( NFIS = `ls GDS/mcnearcat | wc -l` )) for FILE in `ls GDS/mcnearcat | cut -f 2 -d /` ; do grep -q ${FILE} READ/${FILE:0:10}* ; echo " STAT=${?}" ; fi NFIL 184 0 PEND 1 0 OUTS 183 0 Pending : n13011765_0002_L010185N_D00.sntp.cedar.root This seems to be a duplicate ! Its behaviour in HADDLOG/2007-05/cedarmcnear.log is unremarkable I have moved it to DUP FILE=n13011765_0002_L010185N_D00.sntp.cedar.root ls -l /grid/data/minos/mcnearcat/${FILE} /grid/data/minos/minfarm/DUP/${FILE} mv /grid/data/minos/mcnearcat/${FILE} /grid/data/minos/minfarm/DUP/${FILE} for FILE in n13011765_0002_L010185N_D00.sntp.cedar.root ; do ls -l GDS/mcnearcat/${FILE} /grid/data/minos/mcnearcat/${FILE} diff GDS/mcnearcat/${FILE} /grid/data/minos/mcnearcat/${FILE} ; done -rw-rw-r-- 1 minfarm numi 67927529 May 11 03:23 GDS/mcnearcat/n13011765_0002_L010185N_D00.sntp.cedar.root -rw-rw-r-- 1 rubin numi 67927529 May 11 03:23 /grid/data/minos/mcnearcat/n13011765_0002_L010185N_D00.sntp.cedar.root I see no need to keep any of these safety copy areas. 17:00 SRV1> rm /grid/data/minos/minfarm/SAFE/N* SRV1> rm /grid/data/minos/minfarm/SAFE/F* SRV1> rm -r /grid/data/minos/minfarm/SAFE/mcnearcat SRV1> rmdir /grid/data/minos/minfarm/SAFE/farphy SRV1> rm -r SAFE SRV1> rmdir /grid/data/minos/minfarm/SAFE # FARM # ============================================================================= 2007 05 11 ######## # FARM # ######## MOVING ROUNDUP/WRITE TO /grid/data/minos/minfarm/WRITE PLAN : 0) Copy the stray duplicates from WRITE to /grid/data/minos/DUPS 1) Around 11:00, most files in WRITE should be on tape Purge them with a -w -M pass 2) Copy the remaining files to /grid/data/minos/minfarm/WRITE 3) change the existing write directory to a symlink Execution : 0) done 08:39 FILES=' N00012135_0013.cosmic.cand.cedar.0.root N00012135_0021.cosmic.cand.cedar.0.root n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root ' for FILE in ${FILES} ; do cp -a ${FILE} /grid/data/minos/DUP/ diff ${FILE} /grid/data/minos/DUP/ done for FILE in ${FILES} ; do rm ${FILE} ; done Shifted DUP under minos/minfarm mv /grid/data/minos/DUP /grid/data/minos/minfarm/DUP 1) ${HOME}/scripts/roundup -c -M -w -r cedar mcnear # done 11:00 ${HOME}/scripts/roundup -c -M -w -r cedar_phy near # done 11:51 ${HOME}/scripts/roundup -c -M -w -r cedar_phy far # done 13:2 only 51/56 of cedar_phy far are on tape at 13:20 only 41 MB, set set aside and copy 2) 14:05 cp -vax /export/stage/minfarm/ROUNDUP/WRITE \ /grid/data/minos/minfarm/WRITE mv /export/stage/minfarm/ROUNDUP/WRITE \ /export/stage/minfarm/ROUNDUP/WRITEold ln -s /grid/data/minos/minfarm/WRITE \ /export/stage/minfarm/ROUNDUP/WRITE IMPACT The existing roundup script accesses the WRITE area entirely by doing 'cd' . The old WRITE becoming a symlink will have no impact. The script does a 'mv' of Merged.root to WRITE, should work OK roundup.20070510 and later will go direct to /grid/data/minos/minfarm/WRITE TESTING Purged the slow WRITE files AFSS/roundup.20070510 -c -M -w -r cedar_phy far Ran -n test pass of all types AFSS/roundup.20070510 -n -r cedar far AFSS/roundup.20070510 -n -r cedar near AFSS/roundup.20070510 -n -M -r cedar mcnear AFSS/roundup.20070510 -n -r cedar_phy near AFSS/roundup.20070510 -n -r cedar_phy far Just 1 run in far, can test cleanly ? mkdir /grid/data/minos/minfarm/SAFE cp -va /grid/data/minos/farcat/F00037989* /grid/data/minos/minfarm/SAFE AFSS/roundup.20070510 -W -r cedar far Fri May 11 15:17:54 CDT 2007 Wrote output to WRITE, READ files look OK ECRC files look OK AFSS/roundup.20070510 -w -r cedar far Fri May 11 15:21:28 CDT 2007 SAM declares seem valid, SAM/READ files are there Cleaned up the dups in nearcat, checked first with diff rm -f /grid/data/minos/nearcat/N00012135_0013.cosmic.cand.cedar.0.root rm -f /grid/data/minos/nearcat/N00012135_0021.cosmic.cand.cedar.0.root cp -va /grid/data/minos/nearcat/N00012179* /grid/data/minos/minfarm/SAFE AFSS/roundup.20070510 -c -r cedar near Fri May 11 15:39:17 CDT 2007 Looking good, SAM declared worked for all 4 files cp -vax /grid/data/minos/mcnearcat /grid/data/minos/minfarm/SAFE/mcnearcat AFSS/roundup.20070510 -c -M -r cedar mcnear Fri May 11 16:30:01 CDT 2007 Cleaned up the dup in nearcat, checked first with diff rm -f /grid/data/minos/mcnearcat/n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root Make local copies for safety, too many files for double network xfer File list is too long for simple copy mkdir SAFE/near_cedar_phy # under ROUNDUP FILES=`find /grid/data/minos/nearcat -name \*cedar_phy\* | cut -f 6 -d /` time for FILE in ${FILES} ; do cp -va /grid/data/minos/nearcat/${FILE} SAFE/near_cedar_phy/${FILE} done real 7m50.735s user 0m4.052s sys 3m44.384s Corrected roundup to use Merged.${}.root not Merged.root for thread safety AFSS/roundup.20070510 -c -s N00008011 -r cedar_phy near looks clean, Merged.446.root grew as expected AFSS/roundup.20070510 -c -r cedar_phy near Fri May 11 17:45:14 CDT 2007 Fri May 11 18:06:07 CDT 2007 mkdir SAFE/far_cedar_phy # under ROUNDUP FILES=`find /grid/data/minos/farcat -name \*cedar_phy\* | cut -f 6 -d /` time for FILE in ${FILES} ; do cp -va /grid/data/minos/farcat/${FILE} SAFE/far_cedar_phy/${FILE} done real 7m3.847s user 0m5.719s sys 2m10.599s This is looking good, released cron while doing final catchup cp AFSS/roundup.20070510 . ln -sf roundup.20070510 roundup mv NOCAT NOCAT.old ./roundup.20070510 -c -r cedar_phy far Fri May 11 18:51:29 CDT 2007 ########### # ENSTORE # ########### Some tape mounts have been queues for over a half hour, delaying farm output We are producing 17 data streams on the farm So that's roughly 17 tape mounts/6 hours, 3 per hour. 3 cedar_phy near 4 cedar_phy far 3 cedar near 4 cedar far 3 cedar mcnear That's our intent. Is reality different ? Enstore Drives Fri May 11 11:39:28 CDT 2007 label mover tot.time status system_inhibit rq. host updated volume family VO4357 9940B26.mover 2005 DISMOUNT_WAIT (579 ) (none none) southport 05-11-07 11:39:28 miniboone.OpenRootTree.cpio_odc VOB796 9940B24.mover 31 MOUNT_WAIT (4 ) (none none) stkendca11a 05-11-07 11:39:03 minos.reco_far_cedar_phy_sntp.cpio_odc VOD544 9940B33.mover 2373 SETUP (0 ) (none none) southport 05-11-07 10:59:57 miniboone.TankData.cpio_odc VO2146 9940B34.mover 2402 DISMOUNT_WAIT (159 ) (none full) stkendca13a 05-11-07 11:39:09 astro.astro.cpio_odc VO4078 9940B22.mover 2333 MOUNT_WAIT (2305 ) (none none) stkendca18a 05-11-07 11:39:03 exp-db.daily-d0-offline.cpio_odc VO7256 9940B21.mover 572 MOUNT_WAIT (539 ) (none full) flxi04 05-11-07 11:38:57 selex.selex.cpio_odc VOC316 9940B40.mover 505 MOUNT_WAIT (494 ) (none none) minos01 05-11-07 11:39:22 minos.cedar_antp.cpio_odc VO7708 9940B15.mover 2301 ACTIVE-READ (0 ) (none full) flxi04 05-11-07 11:39:28 selex.selex.cpio_odc VOB506 9940B36.mover 2120 MOUNT_WAIT (2099 ) (none none) stkendca9a 05-11-07 11:39:08 minos.reco_near_cedar_phy_mrnt.cpio_odc VOB135 9940B35.mover 1476 ACTIVE-WRITE (4 ) (none none) stkendca11a 05-11-07 11:39:08 minos.reco_near_cedar_phy_cand.cpio_odc VOC295 9940B25.mover 1189 ACTIVE-WRITE (4 ) (none none) stkendca9a 05-11-07 11:39:11 minos.reco_mc_near_cedar_cand.cpio_odc VOB549 9940B41.mover 388 SEEK (21 ) (none none) stkendca11a 05-11-07 11:39:11 minos.reco_near_cedar_phy_sntp.cpio_odc VO5147 9940B20.mover 14572 MOUNT_WAIT (14540) (none full) stkendca13a 05-11-07 11:38:57 lqcd.lqcd.cpio_odc VO6615 9940B16.mover 2482 ACTIVE-WRITE (35 ) (none none) stkendca10a 05-11-07 11:39:07 exp-db.daily-d0-offline.cpio_odc ########### # ENSTORE # ########### Per email from georges , volume VO4209 has been lost due to a drive error. These are all still in DCache. FILES=`enstore info --list=VO4209 | grep mrnt_data | tr -s ' ' | cut -f 6 -d ' ' | cut -f 1-2,5- -d /` for FILE in ${FILES} ; do FP=`echo ${FILE} | cut -f -7 -d /` # ; echo ${FP} FI=`echo ${FILE} | cut -f 8 -d /` ; echo ${FI} ( cd ${FP} ; cat ".(use)(2)(${FI})" ) | grep stken sleep 1 done They all seem to be in the write pools Slip these into /local/scratch26/kreymer/VO4209 mkdir /local/scratch26/kreymer/VO4209 for FILE in ${FILES} ; do FP=`echo ${FILE} | cut -f 3- -d /` ; echo ${FP} FI=`echo ${FILE} | cut -f 8 -d /` # ; echo ${FI} dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/${FP} \ /local/scratch26/kreymer/VO4209/${FI} done Let's put them in their months for FILE in ${FILES} ; do FM=`echo ${FILE} | cut -f 7 -d /` ; echo ${FM} FI=`echo ${FILE} | cut -f 8 -d /` # ; echo ${FI} mkdir -p /local/scratch26/kreymer/VO4209/${FM} mv /local/scratch26/kreymer/VO4209/${FI} \ /local/scratch26/kreymer/VO4209/${FM}/${FI} done And make another copy in /grid/data /grid/data/minos/mindata chmod 775 /grid/data/minos/mindata cp -vax /local/scratch26/kreymer/VO4209 /grid/data/minos/mindata/VO4209 ########## # DCACHE # ########## Five files have been in write queues over 6 hours, not yet queued for output -rw-r--r-- 1 minfarm numi 10183045 May 11 06:56 F00031422_0000.spill.sntp.cedar_phy.0.root -rw-r--r-- 1 minfarm numi 5332142 May 11 06:56 F00031426_0000.spill.sntp.cedar_phy.0.root -rw-r--r-- 1 minfarm numi 6992773 May 11 06:56 F00031428_0000.spill.sntp.cedar_phy.0.root -rw-r--r-- 1 minfarm numi 10819041 May 11 06:56 F00031431_0000.spill.sntp.cedar_phy.0.root -rw-r--r-- 1 minfarm numi 8864273 May 11 06:56 F00031433_0000.spill.sntp.cedar_phy.0.root All are in w-stkendca11a-1 All are under /pnfs/minos/reco_far/cedar_phy/sntp_data/2005-05 The problem is the 94 queued stores in w-stkendca11a-1 I'm probably stuck behind a slug of cand/bcnd writes Indeed the 'drives' web page http://cmsdca.fnal.gov/cgi-bin/enstore_drives.sh shows w-stkendca11a-1 being pretty active writing to tape. ============================================================================ 2007 05 10 ########### # ROUNDUP # ########### roundup.20070510 for mock data challenge data ? Added test for valid INDIR in /grid/data For reference, scanned old releases, MINOS26 > ls -aF /pnfs/minos/mcout_data/*/fmock /pnfs/minos/mcout_data/R1.12/fmock: ./ ../ .snrl_data/ .trth_data/ cand_data/ sntp_data/ snts_data/ /pnfs/minos/mcout_data/R1.6.1/fmock: ./ ../ cand_data/ snrl_data/ sntp_data/ snts_data/ trth_data/ /pnfs/minos/mcout_data/R1.7/fmock: ./ ../ .snrl_data/ .trth_data/ cand_data/ sntp_data/ snts_data/ /pnfs/minos/mcout_data/R1.9/fmock: ./ ../ .snrl_data/ .trth_data/ cand_data/ sntp_data/ snts_data/ /pnfs/minos/mcout_data/R1_18_2/fmock: ./ ../ cand_data/ sntp_data/ snts_data/ /pnfs/minos/mcout_data/cedar/fmock: ./ ../ carrot/ Per rhatcher, the .trth and .snrl are historic Will should produce the usual sntp, .bntp and .bcnd files I hope for this we could skip cand ######## # FARM # ######## Keepup : reco_far/cedar reco_near/cedar mcout_data/cedar/near/daikon_00 Reprocessing : /reco_far/cedar_phy /reco_near/cedar_phy Data : reco_near/R1_24cal reco_far/R1_24cal Monte Carlo : mcout_data/cedar/cosmic/bfld201_lowE_R1_24cal mcout_data/cedar/cosmic/bfld201_lowE_R1_24calB the above are the special Cambridge cosmid runs, Copied as-is, with no concatenation No far/near or release in the path. mcout_data/R1_24cal/near/daikon_00/... mcout_data/R1_24calB/near/daikon_00/... mcout_data/cedar_phy/fmock/... still to come, this will not be concatenated mcin files are arriving momentarily ######## # FARM # ######## Reported duplicates per 2007 05 07 N00012135_0013.cosmic.cand.cedar.0.root N00012135_0021.cosmic.cand.cedar.0.root n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root This was due to cand reprocessing, lost DCache file. Should remove this duplicate sntp ######## # FARM # ######## Purged WRITE files from special passes ./roundup -w -r R1_24calB mcnear ./roundup -w -r R1_24cal near ./roundup -w -r R1_24cal far Forced several files which were once in bad_runs_mc.cedar on 4/5 May, OK now. ./roundup -f 5 -r cedar mcnear ============================================================================= 2007 05 09 ############ # MCIMPORT # ############ mcimport.20070709 Added support for MCIN mock data files N* and F* 17:29 cp -a AFSS/mcimport.20070509 . ln -sf mcimport.20070509 mcimport ########### # ROUNDUP # ########### roundup.20070509 Control saddreco months with SAMMONS variable, sorted/unique months from all calls to AUTODEST Added R1_24calB release Looks ok in a dry run with AFSS/roundup.20070509 -n -r cedar_phy near Putting this into production cp AFSS/roundup.20070509 . ln -sf roundup.20070509 roundup Oops, corrected typo leaving space after \ running saddreco, caused saddreco output to go the wrong place. Hacked logs with text editor. There are messages like File "scripts/saddreco", line 79, in ? ValueError: invalid literal for int(): These are due to \ being taken as an argument to saddreco. Should have been harmless. Corrected roundup.20070509 to continue correctly, and to issue sample commands for less'ing the saddreco LOGS and to create directories for saddreco LOGS 17:52 cp AFSS/roundup.20070509 . ln -sf roundup.20070509 roundup ######### # MYSQL # ######### The heavy load on minos-mysql1 continues since 4 AM yesterday. I see a dozen or so connections to the temp database, and a dozen or so logins in progress at all times, from flxb* nodes. ######## # FARM # ######## SRV1> time md5sum c10000605_0003.cand.R1_24cal.root 6424da9475ba0239642ac6b13b99a757 c10000605_0003.cand.R1_24cal.root real 0m6.832s user 0m1.121s sys 0m2.051s SRV1> time md5sum c10000605_0003.cand.R1_24cal.root 6424da9475ba0239642ac6b13b99a757 c10000605_0003.cand.R1_24cal.root real 0m1.167s user 0m0.861s sys 0m0.307s /grid/data rates are great again. when running below, seeing 200 MBit/sec on MRTG plot of eth0 cd /grid/data/minos/mcfarcat RELE=R1_24cal STRM=cand FILES=`ls -1 *${STRM}*${RELE}\.root` RSPA=minos/mcout_data/cedar/cosmic/bfld201_lowE_${RELE}/${STRM}_data POUT=/pnfs/${RSPA} 08:32 removed STRM=sntp 08:48 RELE=R1_24calB 08:49 STRM=cand 08:51 09:07 done ######## # FARM # ######## As soon as the 18:05 cycle is done, need to do : ./roundup -n -W -r R1_24calB mcnear Wed May 9 21:37:32 CDT 2007 We only had a 15 minute gap this afternoon, due to cedar_phy catchup. ============================================================================= 2007 05 08 ########## # CORRAL # ########## Added veto on the existence of ${HOME}/ROUNTUP/NOCAT file. crontab.dat schedules corral for 05 00,06,12,18 ########### # ROUNDUP # ########### Oops, forgot to put roundup.20070507 with pid protection into production "No harm, no foul." cp AFSS/roundup.20070507 . ln -sf roundup.20070507 roundup ####### # SAM # ####### Test rapid listing of files declared to sam, for use in saddreco etc, sam list files --dim='(FILE_NAME F00037871_0004.mdaq.root, F00037871_0008.mdaq.root, F00037871_0013.mdaq.root )' Files: F00037871_0004.mdaq.root F00037871_0008.mdaq.root F00037871_0013.mdaq.root File Count: 3 Average File Size: 31.78MB Total File Size: 95.35MB Total Event Count: 37651 SAMDIM='(FILE_NAME F00037871_0004.mdaq.root, F00037871_0008.mdaq.root, F00037871_0013.mdaq.root )' sam list files --nosummary --dim="${SAMDIM}" F00037871_0004.mdaq.root F00037871_0008.mdaq.root F00037871_0013.mdaq.root FILES=`ls /pnfs/minos/fardet_data/2007-05` FIRST=`printf "${FILES}\n" | head -1` FREST=`printf "${FILES}\n" | tail +2` FRESC=`for FI in ${FREST} ; do printf ", ${FI}" ; done` SAMDIM="( FILE_NAME ${FIRST} ${FRESC} )" sam list files --nosummary --dim="${SAMDIM}" | wc -l 200 printf "${FILES}\n" | wc -w 204 Good, this seems to work, and fairly quickly real 0m4.402s user 0m1.010s sys 0m0.120s Try agin for 2004, real 0m50.976s user 0m1.230s sys 0m0.190s real 0m48.758s user 0m1.300s sys 0m0.110s 818 files ####### # SAM # ####### export SAM_ORACLE_CONNECT ./reloc -s dev cedar_phy ./reloc -s int cedar_phy ./reloc -s prd cedar_phy export -n SAM_ORACLE_CONNECT ######### # MYSQL # ######### minos-mysql1 has had a load average of about 6 to 8 since about 04:00. ######## # FARM # ######## R1_24cal forced output, verified these subruns were previously skipped N00009235_0001 N00009241_0010 N00009256_0002 N00009256_0008 N00009256_0009 N00009259_0013 N00009162_ mrnt missing 16,17 N00009226_ mrnt missing 21 N00009143_ mrnt missing 18 ./roundup -f 1 -M - r R1_24cal near Tue May 8 13:22:35 CDT 2007 purge Tue May 8 13:22:51 CDT 2007 cat Tue May 8 13:26:13 CDT 2007 write Tue May 8 13:32:21 CDT 2007 done ./roundup -m 2005-11 -r R1_24cal near # and do SAM declares STARTED Tue May 8 18:35:45 2007 FINISHED Tue May 8 18:37:52 2007 FAR 247 files in WRITE to be purged, clear them first ./roundup -w -r R1_24cal far F00028201_ missing 00,01 which are in 2004-11 not 2004-12 so force this ./roundup -f 1 -M -r R1_24cal far ########## # DCACHE # ########## Existing http://fndca3a.fnal.gov:2288/poolInfo/ugroups/MinosPrdSelGrp minos.reco_far_cedar_bntp@enstore minos.reco_far_cedar_mrnt@enstore minos.reco_far_cedar_sntp@enstore minos.reco_mc_far_cedar_mrnt@enstore minos.reco_mc_far_cedar_sntp@enstore minos.reco_mc_near_cedar_mrnt@enstore minos.reco_mc_near_cedar_sntp@enstore minos.reco_near_cedar_mrnt@enstore minos.reco_near_cedar_sntp@enstore Need to add minos.reco_far_cedar_phy_bntp@enstore minos.reco_far_cedar_phy_mrnt@enstore minos.reco_far_cedar_phy_sntp@enstore minos.reco_mc_far_cedar_phy_mrnt@enstore minos.reco_mc_far_cedar_phy_sntp@enstore minos.reco_mc_near_cedar_phy_mrnt@enstore minos.reco_mc_near_cedar_phy_sntp@enstore minos.reco_near_cedar_phy_mrnt@enstore minos.reco_near_cedar_phy_sntp@enstore And set file families for each stream, as on 2006 09 04 cd /pnfs/minos/reco_far/cedar_phy for DIR in .bcnd .bntp cand mrnt sntp ; do (cd ${DIR}_data ; enstore pnfs --tags | grep 'family)' ) ; done These are not properly qualified for DIR in .bcnd .bntp cand mrnt sntp ; do ( cd ${DIR}_data DIRT=`echo ${DIR} | tr -d '.'` enstore pnfs --file_family reco_far_cedar_phy_${DIRT} ) done for DIR in .bcnd .bntp cand mrnt sntp ; do (cd ${DIR}_data/2007-03 ; enstore pnfs --tags | grep 'family)' ) ; done # this was correctly inherited Now do NEAR cd /pnfs/minos/reco_near/cedar_phy for DIR in cand mrnt sntp ; do (cd ${DIR}_data ; enstore pnfs --tags | grep 'family)' ) ; done for DIR in cand mrnt sntp ; do ( cd ${DIR}_data DIRT=`echo ${DIR} | tr -d '.'` enstore pnfs --file_family reco_near_cedar_phy_${DIRT} ) done for DIR in cand mrnt sntp ; do (cd ${DIR}_data/2007-03 ; enstore pnfs --tags | grep 'family)' ) ; done Oops, somehow set /pnfs/minos/reco_near/cedar_phy to snts. Moot, but correct this ####### # AFS # ####### Requested volume d239 cloned from d188 per lloiaco request for beam systematics work ######## # FARM # ######## Remove last 2 0 length files from last Thursday diskful event SRV1> rm /grid/data/minos/nearcat/N00009235_0001.cosmic.cand.R1_24cal.0.root SRV1> rm /grid/data/minos/nearcat/N00009256_0008.cosmic.cand.R1_24cal.0.root ######## # FARM # ######## Cambridge Cosmic file cleanup setup encp -q stken RELE=R1_24cal STRM=cand STRM=sntp cd /grid/data/minos/mcfarcat FILES=`ls -1 *${STRM}*${RELE}\.root` RSPA=minos/mcout_data/cedar/cosmic/bfld201_lowE_${RELE}/${STRM}_data POUT=/pnfs/${RSPA} for FILE in ${FILES} ; do PFIL=${POUT}/${FILE} PINFO=`(cd ${POUT} ; cat ".(use)(4)(${FILE})" | tr '\n' '\t')` LCRC=`ecrc ${FILE} | tr -s ' ' | cut -f 2 -d ' '` ECRC=`printf "${PINFO}" | cut -f 11` echo " ${FILE}" ${LCRC} ${ECRC} [ ${LCRC} = ${ECRC} ] && echo rm ${FILE} && rm ${FILE} done 2>&1 | tee /tmp/purge${RELE}${STRM}.log THis is running dog slow SRV1> time md5sum c10000605_0000.cand.R1_24cal.root 9fb3226f8fde606d0f7d5d10887b7671 c10000605_0000.cand.R1_24cal.root real 3m14.168s user 0m1.829s sys 0m1.691s 551M, so about 2 MBytes/sec real 10m26.894s user 0m1.840s sys 0m0.790s Tue May 8 16:28:01 CDT 2007 Speed seems to be back to normal, Tue May 8 23:54:53 CDT 2007 ########### # ROUNDUP # ########### roundup.2070508 supporting cedar_phy cp AFSS/roundup.20070508 . ln -sf roundup.20070508 roundup ######## # FARM # ######## Writing output for cedar_phy !!!!!!!!!! ./roundup -c -M -r cedar_phy far ; ./roundup -c -M -r cedar_phy near Tue May 8 16:45:40 CDT 2007 far Then declared to sam, ( failed first time, had to ( dev/int/prd ) samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy ./roundup -m '2005-04' -r cedar_phy near STARTED Wed May 9 04:31:40 2007 FINISHED Wed May 9 04:31:54 2007 ./roundup -m '2005-05' -r cedar_phy near STARTED Wed May 9 04:32:14 2007 FINISHED Wed May 9 04:33:50 2007 ./roundup -m '2005-04' -r cedar_phy far STARTED Wed May 9 04:35:34 2007 FINISHED Wed May 9 04:37:20 2007 ./roundup -m '2005-05' -r cedar_phy far STARTED Wed May 9 04:37:39 2007 FINISHED Wed May 9 04:39:12 2007 ============================================================================= 2007 05 07 ########## # CORRAL # ########## Run various roundups in cron on fnpcsrv1, to keep the crontab file short Will run all current roundup's , one at a time If one is already running, move on to the next. If one stream is very slow, we'll end up running 2 roundups This should be OK. But allow only one such. Check error, count, bail . Can this be done simply ? ########### # ROUNDUP # ########### roundup.20070507 Adding PID interlocking stealing code from mcimport ####### # SAM # ####### Preparing for cedar_phy export SAM_ORACLE_CONNECT="samdbs/" setup sam -q dev samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar_phy New applicationFamilyId = 251 setup sam -q int New applicationFamilyId = 60 setup sam -q prd New applicationFamilyId = 62 reco directories do not yet exist for cedar_phy Same for R1_24cal ./reloc -d -s dev R1_24cal ./reloc -s dev R1_24cal ./reloc -s int R1_24cal ./reloc -s prd R1_24cal Testing R1_24cal ./roundup -m 2005-11 -r R1_24cal near STARTED Mon May 7 18:21:29 2007 ... Treating 666 files in /pnfs/minos/reco_near/R1_24cal/cand_data/2005-11 Oops, appVersion should have been r1.25cal Working OK this time STARTED Mon May 7 18:28:45 2007 Needed /pnfs/minos/reco_near/R1_24cal/cand_data/2005-11 Treating 666 files in /pnfs/minos/reco_near/R1_24cal/cand_data/2005-11 ... Treating 38 files in /pnfs/minos/reco_near/R1_24cal/mrnt_data/2005-11 OOPS - tier known, mrnt ... ############ # SADDRECO # ############ saddreco.20070507 - added mrnt to TIERS SRV1> cp -a AFSS/saddreco.20070507 . SRV1> ln -sf saddreco.20070507 saddreco Ran R1_24cal again ./roundup -m 2005-11 -r R1_24cal near Followed up with the rest of R1_24cal ./roundup -m 2005-11 -r R1_24cal far STARTED Mon May 7 18:57:36 2007 FINISHED Mon May 7 19:10:30 2007 ./roundup -m 2004-12 -r R1_24cal far STARTED Mon May 7 19:10:55 2007 FINISHED Mon May 7 19:16:24 2007 Note that there are no .bcnd for 2004-12 ########### # ENSTORE # ########### nwest reports a duplicate file in COMPLETE_FILE_LIST_minos ls -l /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/100/f21011002_0000_L010185N_D00.sntp.cedar.root -rw-r--r-- 1 3475 e875 65629350 Mar 10 08:07 /pnfs//minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/100/f21011002_0000_L010185N_D00.sntp.cedar.root enstore info --list VOC177 | grep f21011002_0000_L010185N_D00.sntp.cedar.root VOC177 CDMS117247822600000 62959276 0000_000000000_0000266 active /pnfs/fnal.gov/usr/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/100/f21011002_0000_L010185N_D00.sntp.cedar.root enstore info --list VOB971 | grep f21011002_0000_L010185N_D00.sntp.cedar.root VOB971 CDMS117353562400000 65629350 0000_000000000_0000022 active /pnfs/fnal.gov/usr/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/100/f21011002_0000_L010185N_D00.sntp.cedar.root cd $MINOS_DATA/d10/indexes grep f21011002_0000_L010185N_D00.sntp.cedar.root *.index BAD_mc_far.daikon_00.cedar.index:recodata83/f21011002_0000_L010185N_D00.sntp.cedar.root mc_far.daikon_00.cedar.index:recodata89/f21011002_0000_L010185N_D00.sntp.cedar.root MINOS26 > dds recodata89/f21011002_0000_L010185N_D00.sntp.cedar.root -rw-rw-r-- 1 3475 e875 65629350 Mar 10 04:28 recodata89/f21011002_0000_L010185N_D00.sntp.cedar.root ######## # FARM # ######## Writing Cambridge Cosmic MC files from /grid/data/minos/mcfarcat 316 files, a mix of R1_24calB and cedar MINOS26 > ls /grid/data/minos/mcfarcat | wc -l 316 MINOS26 > ls /grid/data/minos/mcfarcat | grep cand | wc -l 158 MINOS26 > ls /grid/data/minos/mcfarcat | grep sntp | wc -l 158 Oops a mixture of calB and cedar File names are like c10000601_0000.cand.cedar.root c10000601_0000.cand.R1_24calB.root Examine existing directories for output ls /pnfs/minos/mcout_data/cedar/cosmic -1 bfld201 bfld201_lowE bfld201_lowE_R1.24.0 bfld201_rock bfld201_vlowE bfldoff bfldrev neutron The cedar names match files already in /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1.24.0/cand_data /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1.24.0/sntp_data Rather than change the names of the old file, I have made new directories, as rubin mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_cedarmay/cand_data mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_cedarmay/sntp_data Then per discussion with howie, have shifted to rmdir /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_cedarmay/cand_data rmdir /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_cedarmay/sntp_data mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1_24cal/cand_data mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1_24cal/sntp_data and make space for some existig R1_24calB files mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1_24calB/cand_data mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1_24calB/sntp_data Renamed all the cedar grid files : cd /grid/data/minos/mcfarcat FILES=`ls -1 *cedar.root` for FILE in ${FILES} ; do sleep 1 echo ${FILE} ${FILE:0:19}.R1_24cal.root mv ${FILE} ${FILE:0:19}.R1_24cal.root done Did this at 17:24 Now write these to PNFS, from minfarm setup dcap # kerberized DCPOR=24736 cd /grid/data/minos/mcfarcat RELE=R1_24cal STRM=cand STRM=sntp FILES=`ls -1 *${STRM}*${RELE}\.root` RSPA=minos/mcout_data/cedar/cosmic/bfld201_lowE_${RELE}/${STRM}_data DOUT=/dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA} POUT=/pnfs/${RSPA} date | tee ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log for FILE in ${FILES} ; do DFIL=${DOUT}/${FILE} PFIL=${POUT}/${FILE} [ ! -r ${PFIL} ] && echo "NEED" ${FILE} && dccp ${FILE} ${DFIL} done 2>&1 | tee ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log date | tee ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log Test with FILES=c10000601_0000.sntp.R1_24cal.root OK, ran with sntp, R1_24cal, OK ran cand OK Mon May 7 18:16:43 CDT 2007 Mon May 7 18:42:32 CDT 2007 Now copy R1_24calB ( in grid/data/minos/mcfarcat ) sum *sntp*R1_24calB.root RELE=R1_24calB for STRM in sntp cand ; do ... do Oops, forgot to mkdir the directories. Corrected, restarted : Mon May 7 18:46:12 CDT 2007 ######## # FARM # ######## DUPLICATES Duplicate cand near file In LOG/2007-05/cedarnear.log continued problem, wrong ECRC for N00012135_0013.cosmic.cand.cedar.0.root N00012135_0021.cosmic.cand.cedar.0.root /pnfs/minos/reco_near/cedar/cand_data/2007-05/N00012135_0013.cosmic.cand.cedar.0.root /pnfs/minos/reco_near/cedar/cand_data/2007-05/N00012135_0021.cosmic.cand.cedar.0.root The files were first written on May 4 -rw-r--r-- 1 minfarm numi 606314496 May 4 17:41 /export/stage/minfarm/ROUNDUP/WRITE/N00012135_0013.cosmic.cand.cedar.0.root -rw-r--r-- 1 minfarm numi 231964672 May 4 17:42 /export/stage/minfarm/ROUNDUP/WRITE/N00012135_0021.cosmic.cand.cedar.0.root Purged from WRITE on Sat 08:00 Then picked up again from /grid/data Saturday -rw-r--r-- 1 minfarm numi 749183111 May 5 13:30 /export/stage/minfarm/ROUNDUP/WRITE/N00012135_0013.cosmic.cand.cedar.0.root -rw-r--r-- 1 minfarm numi 744304636 May 5 13:30 /export/stage/minfarm/ROUNDUP/WRITE/N00012135_0021.cosmic.cand.cedar.0.root A second srmcp was not attempted. Duplicate sntp mcnear file In LOG/2007-05/cedarmcnear.log continued problem, wrong ECRC for n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/sntp_data/144/n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root Recent attempt -rw-r--r-- 1 minfarm numi 177252703 May 5 15:44 /export/stage/minfarm/ROUNDUP/WRITE/n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root Originally written April 19 MINOS26 > dds /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/sntp_data/144/n13011446_00*_L010185N_D00_nccoh.sntp.cedar.root -rw-r--r-- 1 1334 e875 1939588608 Apr 19 22:01 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/sntp_data/144/n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root ########## # RUSTLE # ########## Removed /grid/data/minos/farcat_safe, from early tests. See notes added under the 2007 04 10 log entry hereunder. SRV1> rm /grid/data/minos/farcat_safe/* SRV1> rmdir /grid/data/minos/farcat_safe ============================================================================= 2007 05 06 sunday ######### # STAGE # ######### MINOS26 > NVOLS=`./volumes neardet_data` MINOS26 > echo $NVOLS VO2307 VO3863 VO4531 VO4918 VO5041 VO5042 VO6784 VO7026 VO7175 VO7421 VO7774 VO7896 VO7939 VO8098 VO8187 VO8332 VO8537 VO8556 VO8721 VO8741 VO8791 VO8842 VO8949 VO9752 VO9834 VOC065 VOC443 MINOS26 > echo $NVOLS | wc -w 27 MINOS26 > for VOL in ${NVOLS} ; do ./stage -w ${VOL} ; done | tee ../log/stage/neardet_data.20070506 FINISHED Mon May 7 00:23:27 CDT 2007 Files restored : Staging files from tape VO4918 Needed 308/ 1074 Staging files from tape VO5041 Needed 1636/ 3235 Staging files from tape VO5042 Needed 1310/ 2608 Staging files from tape VO8556 Needed 74/ 937 Staging files from tape VO9752 Needed 58/ 2120 ######## # FARM # ######## ./roundup -M -r R1_24cal near ============================================================================= 2007 05 05 saturday On friday, did ./roundup -r R1_24cal mcnear Fri May 4 20:01:23 CDT 2007 cat Fri May 4 21:19:53 CDT 2007 write OK - creating /pnfs/minos/mcout_data/R1_24cal/near/daikon_00/L010185N_24cal/cand_data/100 no such directory ( N.B. this DID create the full directory ) Immediate cause - stale copy of roundup.20070504 on fnpcsrv1, renamed now to roundup.20070504x Checking inventory and status of misplaced files : SRV1> ls WRITE/n* | wc -l 233 Removed N and F files from WRITE, for clarity ./roundup -w -r cedar near ./roundup -w -r cedar far ./roundup -w -r R1_24cal near SRV1> ls WRITE | wc -l 233 Verified that all these files are in Enstore SRV1> ./roundup.20070504x -n -w -r R1_24cal mcnear | grep 'rm ' | wc -l 233 Made a list for future reference ls WRITE > ~/maint/nR1_24cal.20070505 As Rubin, move the misplaced files back where they belong RUB > find L010185N -type d L010185N L010185N/cand_data L010185N/mrnt_data L010185N/sntp_data RUB > find L010185N_24cal -type d L010185N_24cal L010185N_24cal/cand_data L010185N_24cal/cand_data/100 L010185N_24cal/cand_data/101 L010185N_24cal/cand_data/102 L010185N_24cal/sntp_data L010185N_24cal/sntp_data/100 L010185N_24cal/sntp_data/101 L010185N_24cal/sntp_data/102 The RUN directories do not exist where they belong under L010185N, so they can be moved cleanly from the wrong path L010185N_24cal And we have verified, above, that nothing is pending for write. for STR in cand_data sntp_data ; do for RUN in 100 101 102 ; do mv -v L010185N_24cal/${STR}/${RUN} L010185N/${STR}/${RUN} done ; done RUB > find L010185N -type d L010185N L010185N/cand_data L010185N/cand_data/100 L010185N/cand_data/101 L010185N/cand_data/102 L010185N/mrnt_data L010185N/sntp_data L010185N/sntp_data/100 L010185N/sntp_data/101 L010185N/sntp_data/102 RUB > find L010185N_24cal/ -type d L010185N_24cal/ L010185N_24cal/cand_data L010185N_24cal/sntp_data Now we can purge the WRITE files, using standard roundup ./roundup -w -r R1_24cal mcnear Sat May 5 09:50:05 CDT 2007 Sat May 5 09:57:06 CDT 2007 Scanned for recent files in nearcat/farcat from R1_24cal, they are still currently flowing from far, the last one was around 04:00 from near. There is nothing in mcnearcat. Need a utility like farmgsum to scan all *cat directories, nearcat farcat mcnearcat mcfarcat listing for each stream and directory stream / number / size / last time ./roundup -c -r cedar far ; ./roundup -c -r cedar near Sat May 5 11:29:39 CDT 2007 GRRRRRRRRRRRRRRRRRRRRRRR Stuck once again, something has changed again less LOG/2007-05/cedarfar.log rm: remove write-protected regular file `/grid/data/minos/farcat/F00037968_0000.all.sntp.cedar.0.root'? Odd, the directory is group writeable, but not the files. Check an older directory : SRV1> ls -alF /grid/data/minos/mcfarcat/ total 3637984 drwxrwxr-x 2 rubin numi 2048 May 2 21:13 ./ drwxrwxr-x 20 rubin numi 2048 May 3 09:10 ../ -rw-r--r-- 1 rubin numi 573833281 May 5 01:08 c10000601_0000.cand.cedar.root -rw-r--r-- 1 rubin numi 576581544 May 5 01:10 c10000601_0001.cand.cedar.root -rw-r--r-- 1 rubin numi 575020786 May 5 01:29 c10000601_0002.cand.cedar.root -rw-r--r-- 1 rubin numi 283649445 May 5 00:52 c10000601_0003.cand.cedar.root -rw-r--r-- 1 rubin numi 568409565 May 5 01:11 c10000602_0000.cand.cedar.root -rw-r--r-- 1 rubin numi 576844434 May 5 01:29 c10000602_0001.cand.cedar.root -rw-r--r-- 1 rubin numi 570762049 May 5 01:13 c10000602_0002.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:21 c10000602_0003.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:21 c10000602_0003.sntp.cedar.root Try removing a useless 0 length file : SRV1> type rm rm is /bin/rm SRV1> rm /grid/data/minos/mcfarcat/c10000602_0003.sntp.cedar.root rm: remove write-protected regular empty file `/grid/data/minos/mcfarcat/c10000602_0003.sntp.cedar.root'? y This removed the file Try using -f for a clean removal rm -f /grid/data/minos/mcfarcat/c10000602_0003.cand.cedar.root OK, created roundup.20070505 once again, to do rm -r , and cloned to fnpcsrc1 cp AFSS/roundup.20070505 . ln -sf roundup.20070505 roundup ./roundup -r cedar far Sat May 5 13:06:19 CDT 2007 Sat May 5 13:29:20 CDT 2007 And remove the input file that should have been removed : SRV1> dds /pnfs/minos/reco_far/cedar/sntp_data/2007-05/F00037968_0000.all.sntp.cedar.0.root -rw-r--r-- 1 rubin numi 433161527 May 5 11:36 /pnfs/minos/reco_far/cedar/sntp_data/2007-05/F00037968_0000.all.sntp.cedar.0.root SRV1> dds ../ROUNTMP/WRITE/F00037968_0000.all.sntp.cedar.0.root -rw-r--r-- 1 minfarm numi 433161527 May 5 11:32 ../ROUNTMP/WRITE/F00037968_0000.all.sntp.cedar.0.root SRV1> dds /grid/data/minos/farcat/F00037968_0000.all.sntp.cedar.0.root -rw-r--r-- 1 rubin numi 23722211 May 2 23:51 /grid/data/minos/farcat/F00037968_0000.all.sntp.cedar.0.root SRV1> rm -f /grid/data/minos/farcat/F00037968_0000.all.sntp.cedar.0.root for FILE in `cat ../ROUNTMP/READ/F00037968_0000.all.sntp.cedar.0.root` ; do ls -l /grid/data/minos/farcat/${FILE} ; done for FILE in `cat ../ROUNTMP/READ/F00037968_0000.all.sntp.cedar.0.root` ; do rm -f /grid/data/minos/farcat/${FILE} ; done Now back to our regularly scheduled program ./roundup -r cedar near Sat May 5 13:30:03 CDT 2007 cat Sat May 5 13:35:36 CDT 2007 write Sat May 5 13:39:14 CDT 2007 ./roundup -r cedar mcnear Sat May 5 13:49:50 CDT 2007 cat several files hanging round since April 30, should do an -f 4 run Sat May 5 14:15:53 CDT 2007 write Sat May 5 14:46:03 CDT 2007 ./roundup -f 4 -r cedar mcnear Sat May 5 15:38:28 CDT 2007 cat OK - processing 80 files Sat May 5 15:44:40 CDT 2007 write Sat May 5 15:50:59 CDT 2007 done MINOS26 > ./farmgsum Summarizing /grid/data/minos/*cat 229 5580 nearcat 2831 36420 farcat 57 18359 mcnearcat 24 5936 mcfarcat 3141 66295 TOTAL files, GBytes nearcat 2 0 cosmic.cand.R1_24cal.0.root 2 1493 cosmic.cand.cedar.0.root 8 170 cosmic.sntp.R1_24cal.0.root 17 507 cosmic.sntp.cedar.0.root 175 1958 spill.mrnt.R1_24cal.0.root 8 585 spill.sntp.R1_24cal.0.root 17 1129 spill.sntp.cedar.0.root farcat 1426 33936 all.sntp.R1_24cal.0.root 7 169 all.sntp.cedar.0.root 692 2949 spill.bntp.R1_24cal.0.root 7 28 spill.bntp.cedar.0.root 692 1038 spill.sntp.R1_24cal.0.root 7 19 spill.sntp.cedar.0.root mcnearcat 28 17182 cand.cedar.root 29 2065 sntp.cedar.root mcfarcat 12 5712 cand.cedar.root 12 510 sntp.cedar.root ./roundup -n -s F0003317 -r R1_24cal far OK adding F00033174_0000.spill.sntp.R1_24cal.0.root 1 ./roundup: line 507: ((: SSIF = : syntax error: operand expected (error token is " ") ./roundup: line 507: ((: SSIF = : syntax error: operand expected (error token is " ") OK adding F00033178_0000.spill.sntp.R1_24cal.0.root 24 ./roundup -f 4 -s F0003317 -r cedar mcnear OOps, accident when trying additional test of the above. nothing was done, no files ./roundup -n -s F0003317 -r R1_24cal far ./roundup -n -s spill.sntp -r R1_24cal far both are clean I may have typed something at the terminal during the original test ./roundup -n -r R1_24cal far clean this time, go for it, without SAM ! ./roundup -M -r R1_24cal far Sat May 5 18:59:58 CDT 2007 ============================================================================= 2007 05 04 ######## # FARM # ######## More rate tests, trying direct /grid/data to enstore : cd /grid/data/minos/DUP IFILE=n13011446_0000_L010185N_D00_nccoh.cand.cedar.root export SRM_CONFIG=/export/stage/minfarm/.srmconfig/config.xml SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm SFILE2=${SPATH2}/${IFILE} SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm SFILE=${SPATH}/${IFILE} srmkdir ${SPATH2} srmls ${SPATH2} srmls ${SFILE2} time srmcp file:///${IFILE} ${SFILE} real 1m3.708s user 0m27.422s sys 0m43.969s Rate is 25 MBytes/sec srmls ${SFILE2} 1602194515 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm/n13011446_0000_L010185N_D00_nccoh.cand.cedar.root srmrm ${SFILE2} time srmcp file:///${IFILE} ${SFILE} real 0m49.820s user 0m26.619s sys 0m31.868s Rate is 32 MB/sec cd /grid/data/minos/nearcat FILES=`ls N00012135*cand*` SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm/TEST SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm/TEST srmmkdir ${SPATH2} At 15:00+ for IFILE in ${FILES} ; do echo ${IFILE} time srmcp file:///${IFILE} ${SPATH}/${IFILE} done N00012135_0001.cosmic.cand.cedar.0.root real 0m41.664s user 0m20.561s sys 0m20.849s N00012135_0001.spill.cand.cedar.0.root real 0m24.862s user 0m13.962s sys 0m8.898s ... real 0m16.846s user 0m11.435s sys 0m3.375s N00012135_0022.spill.cand.cedar.0.root real 0m23.094s user 0m14.327s sys 0m9.214s SRV1> TIMS=`grep 'real' /tmp/testcp | cut -c 12-16` SRV1> echo $TIMS 41.664 24.862 46.818 22.677 44.512 23.150 49.938 25.862 39.586 30.845 18.375 27.728 48.427 23.360 18.204 29.931 38.207 50.222 31.942 15.594 26.017 39.594 25.393 38.542 27.424 21.473 16.846 23.094 41.664 24.862 46.818 22.677 44.512 23.150 49.938 25.862 39.586 30.845 18.375 27.728 48.427 23.360 18.204 29.931 38.207 50.222 31.942 15.594 26.017 39.594 25.393 38.542 27.424 21.473 16.846 23.094 41.664 24.862 46.818 22.677 44.512 23.150 49.938 25.862 39.586 30.845 18.375 27.728 48.427 23.360 18.204 29.931 38.207 50.222 31.942 15.594 26.017 39.594 25.393 38.542 27.424 21.473 16.846 23.094 SRV1> for TIM in $TIMS ; do printf "${TIM} + " >> /tmp/times ; done SRV1> printf "0\n" >> /tmp/times SRV1> cat /tmp/times | bc 2610.861 for FILE in ${FILES} ; do SI=`ls -l ${FILE} | tr -s ' ' | cut -f 5 -d ' '` ; printf "${SI} + " ; done > /tmp/sizes printf "0\n" >> /tmp/sizes cat /tmp/sizes | bc 12879467728 Rate is 5 MB/sec ??? Looks lousy to me not consistent with 2x8.5 MB/sec MRTG peaked sharply around 85 mbit/sec on eth0 ######## # FARM # ######## 152G 80G WRITE SRV1> du -sm /grid/data/minos/*cat 9727 /grid/data/minos/farcat 1 /grid/data/minos/mccat 3553 /grid/data/minos/mcfarcat 221977 /grid/data/minos/mcnearcat 37539 /grid/data/minos/nearcat ./roundup -w -r cedar mcnear Oops, typos in roundup.20070503 setting SRMQ. Fixed a couple of times, back on track. ./roundup -w -r cedar mcnear Fri May 4 08:21:53 CDT 2007 Fri May 4 10:42:51 CDT 2007 MRTG shows sustained 40 Mbit/sec on eth0 sar -n DEV shows sustained 4 MBytes/sec on each eth0 and eth1 bond0 does not show correct sum At aroung 09:20, tried to prime the pump getting a few files into memory cache with time md5sum ../ROUNTMP/WRITE/n13011679* real 9m45.719s user 0m23.272s sys 0m16.046s 1 sntp, 11 cand, net about 7 GB. du -sm 580 ../ROUNTMP/WRITE/n13011679_0000_L010185N_D00.cand.cedar.root 702 ../ROUNTMP/WRITE/n13011679_0000_L010185N_D00.sntp.cedar.root This made a spike to 60 Bit/sec in eth0 ( 5 min ave ) in mrtg Purged other files from WRITE, to check size AFSS/roundup.20070504 -w -r R1_24cal near The net rate for this run was 73 GBytes 140 minutes, 8400 seconds 8.9 MBytes/sec Somewhat better than the old 7, not great. ./roundup -r cedar far Fri May 4 12:12:15 CDT 2007 Fri May 4 12:19:39 CDT 2007 Looks OK, then PURGE FARM F00037968_0011.spill.cand.cedar.0.root Datafile with name 'F00037968_0012.mdaq.root' not found. SRMCP -streams_num=1 -server_mode=active file:///F00037968_0012.all.cand.cedar.0.root /pnfs/minos/reco_far/cedar/cand_data/ PURGE FARM F00037968_0012.all.cand.cedar.0.root Datafile with name 'F00037968_0012.mdaq.root' not found. SRMCP -streams_num=1 -server_mode=active file:///F00037968_0012.spill.bcnd.cedar.0.root /pnfs/minos/reco_far/cedar/.bcnd_data/ PURGE FARM F00037968_0012.spill.bcnd.cedar.0.root ?????? What happened to the month ? Predator had not been run, the raw files were not in SAM. The script proceeded to write the files without a month. Ouch !!!! As howie, remove these strays and rewrite, first checking : SRV1> cd /pnfs/minos/reco_far/cedar SRV1> ls -a *_data/F* cand_data/F00037968_0012.all.cand.cedar.0.root cand_data/F00037971_0000.all.cand.cedar.0.root cand_data/F00037968_0012.spill.cand.cedar.0.root cand_data/F00037971_0000.spill.cand.cedar.0.root cand_data/F00037968_0013.all.cand.cedar.0.root cand_data/F00037971_0001.all.cand.cedar.0.root cand_data/F00037968_0013.spill.cand.cedar.0.root cand_data/F00037971_0001.spill.cand.cedar.0.root cand_data/F00037968_0014.all.cand.cedar.0.root cand_data/F00037971_0002.all.cand.cedar.0.root cand_data/F00037968_0014.spill.cand.cedar.0.root cand_data/F00037971_0002.spill.cand.cedar.0.root cand_data/F00037968_0015.all.cand.cedar.0.root cand_data/F00037971_0003.all.cand.cedar.0.root cand_data/F00037968_0015.spill.cand.cedar.0.root cand_data/F00037971_0003.spill.cand.cedar.0.root cand_data/F00037968_0016.all.cand.cedar.0.root cand_data/F00037971_0004.all.cand.cedar.0.root cand_data/F00037968_0016.spill.cand.cedar.0.root cand_data/F00037971_0004.spill.cand.cedar.0.root cand_data/F00037968_0017.all.cand.cedar.0.root cand_data/F00037971_0005.all.cand.cedar.0.root cand_data/F00037968_0017.spill.cand.cedar.0.root cand_data/F00037971_0005.spill.cand.cedar.0.root cand_data/F00037968_0018.all.cand.cedar.0.root cand_data/F00037971_0006.all.cand.cedar.0.root cand_data/F00037968_0018.spill.cand.cedar.0.root cand_data/F00037971_0006.spill.cand.cedar.0.root SRV1> ls -a .*_data/F* .bcnd_data/F00037968_0012.spill.bcnd.cedar.0.root .bcnd_data/F00037971_0000.spill.bcnd.cedar.0.root .bcnd_data/F00037968_0013.spill.bcnd.cedar.0.root .bcnd_data/F00037971_0001.spill.bcnd.cedar.0.root .bcnd_data/F00037968_0014.spill.bcnd.cedar.0.root .bcnd_data/F00037971_0002.spill.bcnd.cedar.0.root .bcnd_data/F00037968_0015.spill.bcnd.cedar.0.root .bcnd_data/F00037971_0003.spill.bcnd.cedar.0.root .bcnd_data/F00037968_0016.spill.bcnd.cedar.0.root .bcnd_data/F00037971_0004.spill.bcnd.cedar.0.root .bcnd_data/F00037968_0017.spill.bcnd.cedar.0.root .bcnd_data/F00037971_0005.spill.bcnd.cedar.0.root .bcnd_data/F00037968_0018.spill.bcnd.cedar.0.root .bcnd_data/F00037971_0006.spill.bcnd.cedar.0.root SRV1> FILES=`ls -a *_data/F* ; ls -a .*_data/F*` SRV1> echo $FILES | wc -w 42 SRV1> for FP in ${FILES} ; do FI=`echo ${FP} | cut -f 2 -d /` ; ls -l /export/stage/minfarm/ROUNDUP/WRITE/${FI} ; done SRV1> for FP in ${FILES} ; do rm ${FP} ; done Tested the updated roundup which should abort. SRV1> AFSS/roundup.20070504 -n -w -r cedar far OK - processing /grid/data/minos/farcat version 20070504 Fri May 4 14:23:14 CDT 2007 PURGING WRITE files 147 Datafile with name 'F00037968_0012.mdaq.root' not found. OOPS - raw data not in SAM OK, let's get predator up to date and resume cp AFSS/roundup.20070504 . ln -sf roundup.20070504 roundup SRV1> du -sm /grid/data/minos/*cat 4839 /grid/data/minos/farcat 1 /grid/data/minos/mccat 3553 /grid/data/minos/mcfarcat 155542 /grid/data/minos/mcnearcat 39258 /grid/data/minos/nearcat VMON=2007-05 ./predator ${VMON} MINOS26 > crontab crontab.dat ( later, at 18:41 ) ./roundup -w -r cedar far Fri May 4 17:20:04 CDT 2007 Fri May 4 17:36:49 CDT 2007 ./roundup -r cedar near Fri May 4 17:37:45 CDT 2007 catting Fri May 4 18:00:24 CDT 2007 writing MRTG shows excellent rates, 60 to 80 mbit/s (x2) in spite of 60 GB of files to write ( >> 16 GB local memory ) Fri May 4 19:17:05 CDT 2007 Next... ./roundup -r R1_24cal near Fri May 4 19:49:00 CDT 2007 Fri May 4 19:50:17 CDT 2007 Fri May 4 19:53:41 CDT 2007 Running short of space, purge write ./roundup -w -r cedar mcnear Fri May 4 19:55:56 CDT 2007 Fri May 4 20:00:04 CDT 2007 196G free ./roundup -r R1_24cal mcnear Fri May 4 20:01:23 CDT 2007 cat Fri May 4 21:19:53 CDT 2007 write OK - creating /pnfs/minos/mcout_data/R1_24cal/near/daikon_00/L010185N_24cal/cand_data/100 no such directory ( N.B. this DID create the full directory ) Fixing this Sat morning 5/5 ########### # ROUNDUP # ########### roundup.20070504 Need to handle MCCONF calculation for R1_24cal Added file count to printout of WRITING to DCache AFSS/roundup.20070504 -n -W -s n13011001 -r R1_24cal mcnear .../L010185N_24cal/... STREAM=L250200N_D00.mrnt.cedar STREAM=L010185N_D00_nccoh.sntp.cedar STREAM=L010185N_D00.cand.R1_24cal doing MCPHYS=`echo ${STREAM} | cut -f 3 -d '_' | cut -f 1 -d .` MCCONF=`echo ${STREAM} | cut -f 1 -d '_'`${MCPHYS:+_${MCPHYS}} MCPHYS is getting activated when it shouldn't be Switch to cut on . field first, then third _ MCPHYS=`echo ${STREAM} | cut -f 1 -d . | cut -f 3 -d '_'` Moved to production around 12:08 cp AFSS/roundup.20070504 . ln -sf roundup.20070504 roundup Modified again to abort on files which are not in SAM ######## # FARM # ######## jaboehm (Josh) reports corrupt file at d174/MRCC/TEMP/n13011065_0005_L010185N_D0.mrnt.cedar.root That's actually n13011065_0005_L010185N_D00.mrnt.cedar.root written at 12:11 srmcp'd after 14:46 There were problem with the minos_reco_far familied around then, but not the mc families. The new pools were being deployed. MINOS26 > dirs=`ls -d d*` MINOS26 > for DIR in $dirs ; do echo ${DIR} ; find ${DIR} -name TEMP ; done d170/TEMP Are the raw ntuples there ? MINOS26 > for DIR in $dirs ; do echo ${DIR} ; find ${DIR} -name n13011065_0005_L010185N_D00.mrnt.cedar.root ; done MINOS26 > dds /afs/fnal.gov/files/data/minos/d170/TEMP/n13011065_0005_L010185N_D00.mrnt.cedar.root -rw-r--r-- 1 11634 e875 291140914 May 4 09:01 /afs/fnal.gov/files/data/minos/d170/TEMP/n13011065_0005_L010185N_D00.mrnt.cedar.root MINOS26 > dds /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/106/n13011065_0005_L010185N_D00.mrnt.cedar.root -rw-r--r-- 1 1334 e875 291140914 Apr 28 16:37 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/106/n13011065_0005_L010185N_D00.mrnt.cedar.root ecrc /afs/fnal.gov/files/data/minos/d170/TEMP/n13011065_0005_L010185N_D00.mrnt.cedar.root CRC 1580338695 MINOS26 > ./dc_stat /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/106/n13011065_0005_L010185N_D00.mrnt.cedar.root ... 1580338695 Check the files the usual way with root, using hadd MINOS26 > cd /local/scratch26/kreymer/ MINOS26 > hadd Merged.root /afs/fnal.gov/files/data/minos/d170/TEMP/n13011065_0000_L010185N_D00.mrnt.cedar.root /afs/fnal.gov/files/data/minos/d170/TEMP/n13011065_0005_L010185N_D00.mrnt.cedar.root MINOS26 > du -sm Merged.root 464 Merged.root Input sizes 194911909 291140914 Output 486044846 versus 486052823 ( difference 7977 ) Tested Merged.root as Josh suggested root Merged.root NtpSt->Show(0) ######### # STAGE # ######### ./volumes vols FVOLS=`./volumes fardet_data` MINOS26 > echo $FVOLS VO2064 VO2212 VO2220 VO2225 VO3646 VO3909 VO4136 VO4245 VO4309 VO4335 VO4639 VO4640 VO4919 VO5046 VO5054 VO5182 VO5672 VO5869 VO5871 VO5881 VO6809 VO6876 VO7999 VO8536 VO8555 VO8722 VO8917 VO8968 VO9488 VO9830 VOB499 for VOL in ${FVOLS} ; do ./stage -w ${VOL} ; done This is already picking up a few stray files, like 2003-10/F00020634_0000.mdaq.root Also finding a few files listed on tape, but not in PNFS, and without delflag set to yes on tape VO2212 This is stuff from 2004-11, probably not relevant like F00010926_0000.mdaq.root ============================================================================= 2007 05 03 ########### # ROUNDUP # ########### Added filtering of filename initial ${DETI} cp AFSS/roundup.20070503 . ln -sf roundup.20070503 roundup ######## # FARM # ######## ./roundup -w -r R1_24cal near Thu May 3 07:29:17 CDT 2007 234 GB free Date: Thu, 03 May 2007 09:25:21 -0500 (CDT) From: David Berg These are srm services, and they show as offline until they are used. Please go ahead and use them. Tested srmls and srmcp per HOWTO.srm at 09:27, OK Web page shows services running http://fndca.fnal.gov:2288/cellInfo Get back to work, clear rest of R1_24cal ./roundup -n -M -W -r R1_24cal near reveals several bad_runs files which are present with non-0 content. +BADRUNS+ N00009241_0010.cosmic.cand.R1_24cal.0.root +BADRUNS+ N00009244_0000.cosmic.cand.R1_24cal.0.root +BADRUNS+ N00009256_0002.cosmic.cand.R1_24cal.0.root +BADRUNS+ N00009256_0009.cosmic.cand.R1_24cal.0.root +BADRUNS+ N00009259_0013.cosmic.cand.R1_24cal.0.root +BADRUNS+ N00009265_0001.cosmic.cand.R1_24cal.0.root +BADRUNS+ N00009241_0010.cosmic.sntp.R1_24cal.0.root +BADRUNS+ N00009256_0002.cosmic.sntp.R1_24cal.0.root +BADRUNS+ N00009259_0013.cosmic.sntp.R1_24cal.0.root +BADRUNS+ N00009265_0001.cosmic.sntp.R1_24cal.0.root +BADRUNS+ N00009241_0010.spill.cand.R1_24cal.0.root +BADRUNS+ N00009256_0002.spill.cand.R1_24cal.0.root +BADRUNS+ N00009259_0013.spill.cand.R1_24cal.0.root I need to add an up front filter, which is a big pain, because there are two separate bad_run files, one for normal and one for mrnt files. Ignore the mrcc files for now, there are no type 1 or 3 errors there bad_runs_camb.cedar bad_runs.cedar bad_runs_mc.cedar bad_runs_mrcc.cedar bad_runs_mrcc_mc.cedar What are these camb files ? This is TOOOOO MUUUUUCH ! grep ' *[1,3] *....-..-.. *' ~/lists/bad_runs_mc.cedar put this list into zap_files Hacked up roundup.20070503 skipping zap_files ( bad_files errors 1 or 3 ) SRV1> grep ' *[1,3] *....-..-.. *' ~/lists/bad_runs.R1_24cal N00009256_0009.0 2005-11 47487 3 2007-05-02 19:52:12 fnpc104 N00009235_0001.0 2005-11 47643 3 2007-05-02 19:55:07 fnpc111 N00009265_0001.0 2005-11 46531 3 2007-05-02 20:09:18 fnpc228 N00009244_0000.0 2005-11 41122 3 2007-05-02 20:09:38 fnpc117 N00009256_0008.0 2005-11 47913 3 2007-05-02 20:13:24 fnpc91 N00009241_0010.0 2005-11 47591 3 2007-05-02 20:17:52 fnpc37 N00009256_0002.0 2005-11 47818 3 2007-05-02 20:27:39 fnpc30 N00009259_0013.0 2005-11 47132 3 2007-05-02 20:34:37 fnpc171 AFSS/roundup.20070503 -n -M -W -r R1_24cal near > /tmp/rRcalz AFSS/roundup.20070502 -n -M -W -r R1_24cal near > /tmp/rRcal Looks good, has zapped the type 1 and 3 bad runs. cp AFSS/roundup.20070503 . ln -sf roundup.20070503 roundup 232 GB free 100674 /grid/data/minos/nearcat ./roundup -M -r R1_24cal near Thu May 3 15:35:45 CDT 2007 Thu May 3 16:41:11 CDT 2007 Thu May 3 20:51:54 CDT 2007 163G free ./roundup -w -r R1_24cal mcnear Thu May 3 21:21:16 CDT 2007 Thu May 3 21:32:30 CDT 2007 224G free Updated primary roundup.20070503 to add qualifiers srmcp -streams_num=1 -server_mode=active 207712 /grid/data/minos/mcnearcat SRV1> ls /grid/data/minos/mcnearcat/*R1_* | wc -l 410 SRV1> ls /grid/data/minos/mcnearcat | wc -l 745 Let's rip : AFSS/roundup.20070503 -r R1_24cal mcnear GRRRRRRRRRRRRRRRRRRRR Did a test run, and we have problem with BOTH R1_24cal and cedar R1_24cal contains underscores, which fouls up mcin path calculation need to debug/extend/add more special cases to script cedar has many +BAD_RUN+ diagnostics, in spite of new filtering of codes 1 and 3 n13011680_0003 Corrected BADRUNS/ZAPRUNS in roundup.20070503, recloned to fnpcsrv1 ./roundup -r cedar mcnear Thu May 3 21:51:02 CDT 2007 Thu May 3 22:37:02 CDT 2007 srmcp$: command not found typo in roundup.20070503 , corrected, repropogated ######## # FARM # ######## What to do with mcnearcat/c* files, which do not follow naming conventions ? Need to rename per minos_sim conventions, ######## # FARM # ######## Pending 0 length files : SRV1> find /grid/data/minos -size 0 /grid/data/minos/mcfarcat/c10000602_0003.cand.cedar.root /grid/data/minos/mcfarcat/c10000602_0003.sntp.cedar.root /grid/data/minos/nearcat/N00009235_0001.cosmic.cand.R1_24cal.0.root /grid/data/minos/nearcat/N00009256_0008.cosmic.cand.R1_24cal.0.root /grid/data/minos/mcnearcat/n13011685_0002_L010185N_D00.cand.cedar.root /grid/data/minos/mcnearcat/n13011684_0009_L010185N_D00.cand.cedar.root /grid/data/minos/mcnearcat/n13011684_0001_L010185N_D00.cand.cedar.root /grid/data/minos/mcnearcat/n13011684_0000_L010185N_D00.cand.cedar.root /grid/data/minos/mcnearcat/n13011683_0006_L010185N_D00.cand.cedar.root /grid/data/minos/mcnearcat/n13011682_0010_L010185N_D00.cand.cedar.root /grid/data/minos/mcnearcat/n13011683_0001_L010185N_D00.cand.cedar.root /grid/data/minos/mcnearcat/n13011683_0009_L010185N_D00.cand.cedar.root /grid/data/minos/mcnearcat/n13011683_0002_L010185N_D00.cand.cedar.root /grid/data/minos/mcnearcat/n13011683_0008_L010185N_D00.cand.cedar.root /grid/data/minos/mcnearcat/n13011684_0002_L010185N_D00.cand.cedar.root /grid/data/minos/mcnearcat/n13011684_0008_L010185N_D00.cand.cedar.root SRV1> find /grid/data/minos -size 0 -exec ls -l {} \; -rw-r--r-- 1 rubin numi 0 May 2 20:21 /grid/data/minos/mcfarcat/c10000602_0003.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:21 /grid/data/minos/mcfarcat/c10000602_0003.sntp.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 19:55 /grid/data/minos/nearcat/N00009235_0001.cosmic.cand.R1_24cal.0.root -rw-r--r-- 1 rubin numi 0 May 2 20:13 /grid/data/minos/nearcat/N00009256_0008.cosmic.cand.R1_24cal.0.root -rw-r--r-- 1 rubin numi 0 May 2 20:24 /grid/data/minos/mcnearcat/n13011685_0002_L010185N_D00.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:25 /grid/data/minos/mcnearcat/n13011684_0009_L010185N_D00.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:21 /grid/data/minos/mcnearcat/n13011684_0001_L010185N_D00.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:21 /grid/data/minos/mcnearcat/n13011684_0000_L010185N_D00.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:19 /grid/data/minos/mcnearcat/n13011683_0006_L010185N_D00.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:19 /grid/data/minos/mcnearcat/n13011682_0010_L010185N_D00.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:20 /grid/data/minos/mcnearcat/n13011683_0001_L010185N_D00.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:24 /grid/data/minos/mcnearcat/n13011683_0009_L010185N_D00.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:18 /grid/data/minos/mcnearcat/n13011683_0002_L010185N_D00.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:18 /grid/data/minos/mcnearcat/n13011683_0008_L010185N_D00.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:18 /grid/data/minos/mcnearcat/n13011684_0002_L010185N_D00.cand.cedar.root -rw-r--r-- 1 rubin numi 0 May 2 20:18 /grid/data/minos/mcnearcat/n13011684_0008_L010185N_D00.cand.cedar.root ######## # FARM # ######## Waiting for DCache GsiFTP door and CopyManager Take this as a chance to test I/O rates SRV1> du -sm REDO/WRITE 3014 REDO/WRITE SRV1> du -sb REDO/WRITE 3155841680 REDO/WRITE local to /grid/data/minos/TEST 08:51 SRV1> time cp -r REDO/WRITE /grid/data/minos/TEST real 3m22.844s user 0m0.108s sys 0m50.829s Rate 3156./263 = 12 MB/sec. /grid/data/minos/TEST to local 08:58 SRV1> time cp -r /grid/data/minos/TEST REDO/TESTREAD real 1m12.285s user 0m0.572s sys 0m25.092s Rate 3156./72 = 44 MB/sec. Repeat local to TEST2 09:00 SRV1> time cp -r REDO/TESTREAD /grid/data/minos/TEST2 real 1m49.432s user 0m0.077s sys 0m18.470s Rate 3156./109 = 29 MB/sec. Repeat local to TEST3 09:03 SRV1> time cp -r REDO/TESTREAD /grid/data/minos/TEST3 real 1m23.789s user 0m0.092s sys 0m47.065s Rate 3156./83 = 38 MB/sec. NOW SRMCP/DCCP test, to fermigrid/volatile cd LONG ( 2+ GByte file ) # size is 2283574599 IFILE=N00010819_0000.spill.sntp.R1_18_4.0.root SRMCP and list/remove export SRM_CONFIG=/export/stage/minfarm/.srmconfig/config.xml SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm SFILE2=${SPATH2}/${IFILE} SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm SFILE=${SPATH}/${IFILE} srmkdir ${SPATH2} srmls ${SPATH2} srmls ${SFILE2} time srmcp file:///${IFILE} ${SFILE} real 4m11.374s user 0m31.137s sys 0m46.010s srmrm ${SFILE2} time srmcp file:///${IFILE} ${SFILE} real 1m25.390s user 0m30.004s sys 0m41.447s time srmcp file:///${IFILE} ${SFILE} real 1m14.435s user 0m30.811s sys 0m44.405s time srmcp -streams_num=1 -server_mode=active file:///${IFILE} ${SFILE} real 1m1.463s user 0m19.758s sys 0m36.044s time srmcp -streams_num=1 -server_mode=active file:///${IFILE} ${SFILE} real 0m59.955s user 0m20.694s sys 0m35.184s time md5sum ../REDO/WRITE/* real 5m55.882s user 0m9.317s sys 0m7.823s Rate is 3155841680 / 356. = 8.9 MB/s time srmcp -streams_num=1 -server_mode=active file:///${IFILE} ${SFILE} real 1m0.487s user 0m20.063s sys 0m34.877s time srmcp -streams_num=1 -server_mode=active file:///${IFILE} ${SFILE} real 1m13.355s user 0m20.007s sys 0m34.414s time md5sum ../REDO/WRITE/* real 0m12.335s user 0m8.669s sys 0m3.659s setup dcap -q x509 DCPOR=24525 IPATH=minos/fermigrid/volatile/kreymer DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm/${IFILE} time dccp ${IFILE} ${DFILE} ########## # DCACHE # ########## Dcache advertized as up at 07:09 Replied to dcache-admin at 07:39 Except apparently the following DCache services, as show at http://fndca.fnal.gov:2288/cellInfo CopyManager OFFLINE RemoteGsiftpTransferManager OFFLINE RemoteHttpTransferManager OFFLINE SRM-stkendca2a OFFLINE Logged helpdesk ticket 96639 at 08:18 Berg claims services are up, srmcp/srmls do work for me. But fardet logging is still down. It resumed at 11:00, with an archiver restart. ############ # MCIMPORT # ############ Bounced off kordosky/tar/n12011778_0004_L010185N_D00-n12011795_0011_L010185N_D00.tar Should recover automatically at noon. ============================================================================= 2007 05 02 ########### # ROUNDUP # ########### cp AFSS/roundup.20070502 . ln -sf roundup.20070502 roundup ######## # FARM # ######## 166G free, clear some nd/fd space before running R1_24cal ./roundup -w -r cedar far ./roundup -w -r cedar hear # oops, typo ./roundup -w -r cedar near ./roundup -w -r cedar mcnear 225G free Check a couple of subrun sfrom R1_24cal ./roundup -s N00009095 -r R1_24cal near Looks good, go with ./roundup -r R1_24cal near Wed May 2 15:12:46 CDT 2007 Looking a bit at sar's log of network rates, sar -n DEV | grep bond0 > /tmp/sarn ( copy this to minos01 where we have working gnuplot ) setup gnuplot plot '/tmp/sarn' using 1:6 Typical write rate is 1, peak is 7. This ain't Gigabit, people ! 92 GB free ./roundup -w -r cedar mcnear Wed May 2 23:27:50 CDT 2007 ./roundup -w -r R1_24cal near Wed May 2 23:33:20 CDT 2007 OOPS - the previous roundup run was still running, so the log is a bit mixed up. Should do no harm. aborted cleanly on a 0 size file 150 GB free ============================================================================= 2007 05 01 ########### # MONTHLY # ########### CFL 5/1 DATASETS 5/1 PREDATOR 5/1 SADDRECO 5/1 via roundup -m '2007-04' -r cedar far and near VAULT 5/15 MYSQL 5/... ########### # ROUNDUP # ########### Status of first intergrated roundup with mcnear and SAM : Good, but note that being 1 May, saddreco needs a catchup pass. roundup.20070502 Manual catchup for 2007-04 via AFSS/roundup.20070502 -m '2007-04' -r cedar near AFSS/roundup.20070502 -m '2007-04' -r cedar far Moved purge of files ahead of concatenation, for best space usage. Pre-purged with 135 G free About 95 GB in *cat, so will prepurge ( I will be in class all day tomorrow morning, so will not deploy the new roundup.20070502 in production yet. ) 102 GB in WRITE AFSS/roundup.20070502 -w -r cedar mcnear AFSS/roundup.20070502 -w -r cedar near AFSS/roundup.20070502 -w -r cedar far 235 G free ############ # DATASETS # ############ Need to change naming of summary files Was g - FermigridVolPools m - RawDataWritePools r - readPools w - writePools Want m for MinosPrdReadPools : Change m files to q for RawDataWritePools g - FermigridVolPools m - MinosPrdReadPools q - RawDataWritePools r - readPools w - writePools cd /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets MIN > FILES=`find . -name current.m.2\* | cut -f 2- -d /` MIN > printf "${FILES}\n" 2006/03/current.m.20060331 2006/04/current.m.20060401 2006/04/current.m.20060404 2006/04/current.m.20060406 2006/04/current.m.20060412 2006/04/current.m.20060413 2006/04/current.m.20060421 2006/04/current.m.20060426 2006/04/current.m.20060427 2006/09/current.m.20060918 2006/09/current.m.20060920 2006/09/current.m.20060925 2006/10/current.m.20061023 2007/02/current.m.20070226 2007/02/current.m.20070228 2007/03/current.m.20070302 2007/03/current.m.20070319 2007/04/current.m.20070402 for FILE in ${FILES} ; do echo mv ${FILE} ${FILE:0:16}q${FILE:17} ; done for FILE in ${FILES} ; do mv ${FILE} ${FILE:0:16}q${FILE:17} ; done Reexamine 7a-1/2, 8a-1/2 write pools, each 885760 MBytes, net 3.5 TB. ( 3.7 Decimal ) Consistent with Minos usage. Current raw data is about 4 TBytes. ####### # X11 # ####### for NODE in $UNODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'echo gimp;gimp;echo done' ; done minos02 Tue May 1 14:07:48 CDT 2007 minos04 Tue May 1 14:08:55 CDT 2007 minos05 Tue May 1 14:09:07 CDT 2007 minos23 Tue May 1 14:11:53 CDT 2007 for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh minos${NODE} 'mkdir -p /var/tmp/kreymer/.gimp-1.2' ; done lockups are repeatable. Reported to minos-admin Note that the initial symptom was an emacs session, stuck with this message in the lower left of window : Loading edt... Emacs version is xemacs-21.4.13-8.ent.1 ============================================================================= 2007 04 30 ######## # FARM # ######## 17 GB free ./roundup -w -r cedar mcnear Mon Apr 30 07:41:45 CDT 2007 Mon Apr 30 07:46:43 CDT 2007 127 GB free 103522 /grid/data/minos/mcnearcat Corrected crontab to crontab.dat from crontab.nofntp AFSS/roundup.20070501 -r cedar far Mon Apr 30 08:05:35 CDT 2007 Mon Apr 30 08:05:43 CDT 2007 ./roundup -r cedar mcnear Mon Apr 30 08:09:50 CDT 2007 OK - processing 2042 files Mon Apr 30 14:54:41 CDT 2007 Now wait till about 20:00 to purge WRITE files Saved mrtr traffic plot on desktop in fnpcsrv1-20070430.png 46 GB free ./roundup -w -r cedar mcnear Mon Apr 30 21:40:20 CDT 2007 Mon Apr 30 21:46:17 CDT 2007 143 GB free ######## # GRID # ######## Note that /grid/app is mounted on minos26 ######## # FARM # ######## Cleaning up duplicate subruns for n13011001_0000_L010185N_D00.mrnt.cedar.root n13011059_0000_L010185N_D00.mrnt.cedar.root Each was concatenated with 11 subruns. PA=/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data for RUN in n13011001 n13011059 ; do for SUB in 01 02 03 04 05 06 07 08 09 10 ; do F=${RUN}_00${SUB}_L010185N_D00.mrnt.cedar.root [ -r ${PA}/${F:5:3}/${F} ] && ls -l ${PA}/${F:5:3}/${F} done ; done -rw-r--r-- 1 1334 e875 48903889 Mar 1 02:44 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0001_L010185N_D00.mrnt.cedar.root -rw-r--r-- 1 1334 e875 49539338 Mar 1 02:04 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0002_L010185N_D00.mrnt.cedar.root -rw-r--r-- 1 1334 e875 48888805 Mar 1 01:55 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0003_L010185N_D00.mrnt.cedar.root -rw-r--r-- 1 1334 e875 48978682 Mar 1 01:09 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0004_L010185N_D00.mrnt.cedar.root -rw-r--r-- 1 1334 e875 48989510 Mar 1 02:04 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0005_L010185N_D00.mrnt.cedar.root -rw-r--r-- 1 1334 e875 49178857 Mar 1 01:53 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0006_L010185N_D00.mrnt.cedar.root for RUN in n13011001 ; do for SUB in 01 02 03 04 05 06 ; do F=${RUN}_00${SUB}_L010185N_D00.mrnt.cedar.root mv ${PA}/${F:5:3}/${F} /pnfs/minos/BAD/DUP_${F} done ; done Did the above around 11:34 ######## # FARM # ######## Rubin email states that we should ignore any bad_runs lines having error 1. I think this is moot, as I keep existing files. I might force out a run missing a temporarily 'bad' subrun. Not sure how to find the error number, bad_files formats vary : bad_runs_mc.cedar n13025685_0000_L010185 carrot_06 L010185 139 2006-09-28 00:28:13 fnpc31 f21011011_0000_L010185N_D00 136 2007-02-25 20:39:11 fnpc229 bad_runs.cedar F00033570_0007.0 2006-01 92028 136 2006-08-28 10:55:05 fnpc59 Perhaps the error code is the last blank separated field before the year : grep ' *1 *....-..-.. *' ######## # FARM # ######## Added mcnearcat to crontab.dat : 00 08 * * * ${HOME}/scripts/roundup -c -r cedar far ; ${HOME}/scripts/roundup -c -r cedar near ; ${HOME}/scripts/roundup -c -r cedar mcnear ############ # SADDRECO # ############ Testing SAM declares in saddreco.20070501, looking to deploy tomorrow. Move it to production cp AFSS/roundup.20070501 . ln -sf roundup.20070501 roundup AFSS/roundup.20070501 -m -r cedar far 15:00 Oops, forgot to say 'declare' on saddreco commandline, corrected. Corrected, tested, looks OK in LOG/2007-04/declare_far_cedar.log ######## # FARM # ######## Stirred the pot regarding mcout_data/cedar/cosmic and atmos, which start with letters a and c, and do not indicate daikon heritage. Suggested usage of the beam configuration string, and reversion to n, f prefix. And shift to usual mcout_data/cedar/[ne,f]ar/daikon_00// ============================================================================= 2007 04 29 sunday ######## # FARM # ######## ./roundup -w -r cedar mcnear Sun Apr 29 07:41:18 CDT 2007 122 GB free Far writes stuck on sntp, clear the cand's ./roundup -w -s cand -r cedar far Sun Apr 29 07:49:43 CDT 2007 125 GB free crontab crontab.nofntp # adds -s cand to the far roundup DCache write pool has been reconfigures, try 1 file test ./roundup -w -s F00037950_0000.all -r cedar far look good, catch up : ./roundup -w -r cedar far 125 GB free Grab some more mrnt, up to 200 GB now ./roundup -s n130112 -r cedar mcnear # 635 files 32 GB SRV1> ls /grid/data/minos/mcnearcat | grep "n13011[2,3,4]" | wc -l 2689 ./roundup -n -W -s 'n13011[2,3,4]' -r cedar mcnear OK - processing 2689 files OOPS - Stream size 131286 too big for free space 127834 - 10000 SRV1> ./roundup -n -W -s 'n13011[2,3]' -r cedar mcnear OK - processing 1615 files OK - stream L010185N_D00.mrnt.cedar OK - 78869 Mbytes in 172 runs OK, let's do that ./roundup -s 'n13011[2,3]' -r cedar mcnear 54 GB free only 149/ of 180 WRITE files seem to be in enstore, close enough for next batch. roundup -w -r cedar mcnear Sun Apr 29 22:44:14 CDT 2007 112 GB free ./roundup -n -W -s 'n13011[4,5]' -r cedar mcnear OOPS - Stream size 104611 too big for free space 114510 - 10000 roundup -w -r cedar near roundup -w -r cedar far 113 GB free ./roundup -s 'n13011[4,5]' -r cedar mcnear Sun Apr 29 22:59:49 CDT 2007 OK - processing 2143 files OK - stream L010185N_D00.mrnt.cedar OK - 104611 Mbytes in 199 runs ... Writing at about 01:45 ... Mon Apr 30 05:48:36 CDT 2007 Rate is about 104611 MB/ 14480 Sec = 7 MBytes/sec MRTG was reporting sustained 30 MBits/second, not consistent with 7 MBy/sec ============================================================================= 2007 04 28 saturday ######## # FARM # ######## 68 GB free ./roundup -w -r cedar far 82 GB free ./roundup -w -r cedar near 127 GB free ./roundup -s sntp -f 2 -r cedar mcnear ( expect 25 GB ) ./roundup -s cand -r cedar mcnear ( expect 50 GB ) 61 GB free wait 4 hours ./roundup -w -r cedar mcnear 127 GB free ./roundup -s n130110 -r cedar mcnear ( expect 50 GB, 1036 mrnt files ) 64 GB free ./roundup -w -r cedar near 75 GB free mcnear writes are stuck, -rw-r--r-- 1 minfarm numi 536433461 Apr 28 11:27 n13011001_0000_L010185N_D00.mrnt.cedar.root -rw-r--r-- 1 rubin numi 48784095 Mar 1 00:51 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/ cd WRITE/ FS=`ls n*mrnt.cedar.root` PA=/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data for F in ${FS} ; do [ -r ${PA}/${F:5:3}/${F} ] && ls -l ${PA}/${F:5:3}/${F} done -rw-r--r-- 1 rubin numi 48784095 Mar 1 00:51 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0000_L010185N_D00.mrnt.cedar.root -rw-r--r-- 1 rubin numi 49142158 Mar 1 02:49 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/105/n13011059_0000_L010185N_D00.mrnt.cedar.root Other subruns also exist from March 1 processing. Dodge around this for now by moving the two offenders out of the way As rubin PA=/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data for F in n13011001_0000_L010185N_D00.mrnt.cedar.root \ n13011059_0000_L010185N_D00.mrnt.cedar.root do mv ${PA}/${F:5:3}/${F} /pnfs/minos/BAD/DUP_${F} done Finding all the mcnear duplicates pending: cd /grid/data/minos/mcnearcat FS=`ls n*mrnt.cedar.root` Now back as minfarm ./roundup -w -r cedar mcnear Sat Apr 28 14:45:59 CDT 2007 75 GB free Now purge them, ./roundup -w -r cedar mcnear Sat Apr 28 20:32:06 CDT 2007 122 Gb free ./roundup -s n130111 -r cedar mcnear # expect 47 GB, 961 mrnt files Sat Apr 28 20:48:34 CDT 2007 ############ # SADDRECO # ############ REL=cedar MON=2007-04 for DET in near far ; do ./saddreco ${DET} ${REL} ${MON} declare \ >> ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1 ########## # DCACHE # ########## Far det reco files failed in a messy way to go to DCache, root cause in messages seems : Pool manager error: No write pools configured for reco_near_sntp and reco_far_cand are fine. ########## # DCACHE # ########## sent request to dcache-admin The MinosPrdSelGrp selection group presently contains minos.reco_far_cedar_bntp@enstore minos.reco_far_cedar_mrnt@enstore minos.reco_far_cedar_sntp@enstore After we have corrected the present problem writing to these families, please extend the MinosPrdSelGrp to include minos.reco_near_cedar_bntp@enstore minos.reco_near_cedar_mrnt@enstore minos.reco_near_cedar_sntp@enstore ########### # ROUNDUP # ########### roundup.20070501 Added -m -M options to enable/disable saddreco calls for near, far only ============================================================================= 2007 04 27 ########## # DCACHE # ########## Conversation with podstvkv, New pools have not been effective, because wild carding of file families does not work as a general expression. Will send explicit list of families, he will test with cedar. Note 13 Feb note regarding read families y Note 9 Feb note regarding need for 5+ TB of Minos DAQ capacity ########## # DCACHE # ########## New pools seem to be present, since 25 April 14:00, ExpDbWritePools 8.9 TB w-stkendca17a-1 680960 w-stkendca17a-2 680960 w-stkendca17a-3 680960 w-stkendca18a-1 680960 w-stkendca18a-2 680960 w-stkendca18a-3 680960 w-stkendca19a-1 680960 w-stkendca19a-2 680960 w-stkendca19a-3 680960 w-stkendca20a-1 921600 w-stkendca20a-2 921600 w-stkendca20a-3 921600 FermigridVolPools v-stkendca16a-1 v-stkendca16a-2 v-stkendca16a-3 v-stkendca16a-4 v-stkendca16a-5 v-stkendca16a-6 KTeVReadPools r-stkendca12a-1 r-stkendca12a-2 r-stkendca12a-3 r-stkendca12a-4 r-stkendca13a-1 r-stkendca13a-2 r-stkendca13a-3 r-stkendca13a-4 r-stkendca14a-1 r-stkendca14a-2 r-stkendca14a-3 r-stkendca14a-4 r-stkendca15a-1 r-stkendca15a-2 r-stkendca15a-3 r-stkendca15a-4 MinosPrdReadPools 10.2 TB r-stkendca17a-4 680960 r-stkendca17a-5 906240 r-stkendca17a-6 906240 r-stkendca18a-4 680960 r-stkendca18a-5 906240 r-stkendca18a-6 906240 r-stkendca19a-4 680960 r-stkendca19a-5 906240 r-stkendca19a-6 906240 r-stkendca20a-4 906240 r-stkendca20a-5 906240 r-stkendca20a-6 906240 RawDataWritePools 3.5 TB w-stkendca7a-1 885760 w-stkendca7a-2 885760 w-stkendca8a-1 885760 w-stkendca8a-2 885760 readPools r-stkendca12a-5 r-stkendca12a-6 r-stkendca13a-5 r-stkendca13a-6 r-stkendca14a-5 r-stkendca14a-6 r-stkendca15a-5 r-stkendca15a-6 r-stkendca17a-4 r-stkendca17a-5 r-stkendca17a-6 r-stkendca20a-4 r-stkendca20a-5 r-stkendca20a-6 writePools w-stkendca10a-1 788480 w-stkendca10a-2 788480 w-stkendca10a-3 788480 w-stkendca10a-4 788480 w-stkendca10a-5 788480 w-stkendca10a-6 788480 w-stkendca11a-1 788480 w-stkendca11a-2 788480 w-stkendca11a-3 788480 w-stkendca11a-4 788480 w-stkendca11a-5 788480 w-stkendca11a-6 788480 w-stkendca17a-1 680960 w-stkendca17a-2 680960 w-stkendca17a-3 680960 w-stkendca20a-1 680960 w-stkendca20a-2 680960 w-stkendca20a-3 680960 w-stkendca9a-1 675840 w-stkendca9a-2 675840 w-stkendca9a-3 675840 w-stkendca9a-4 675840 w-stkendca9a-5 675840 w-stkendca9a-6 675840 ######## # FARM # ######## Failed to restart roundup concatenation in crontab. But /export/stage is full ! Cleanup : SRV1> df -h /export/stage Filesystem Size Used Avail Use% Mounted on /dev/sdb3 477G 451G 1.9G 100% /export/stage The 451 GB is mostly not minos : SRV1> du -sm /export/stage/minfarm du: `/export/stage/minfarm/.grid/backup': Permission denied 125245 /export/stage/minfarm But we can help for a while : SRV1> du -sm WRITE 94534 WRITE SRV1> ./roundup -w -r cedar near SRV1> du -sm WRITE 85525 WRITE SRV1> ./roundup -w -r cedar far SRV1> du -sm WRITE 80661 WRITE SRV1> ./roundup -w -r cedar mcnear SRV1> du -sm WRITE 1 WRITE Now to catch up SRV1> du -sm /grid/data/minos/*cat 14686 /grid/data/minos/farcat 1 /grid/data/minos/mccat 1 /grid/data/minos/mcfarcat 172917 /grid/data/minos/mcnearcat 45850 /grid/data/minos/nearcat SRV1> ./roundup -r cedar far SRV1> du -sm WRITE 14612 WRITE Writing rate is misearable ! Concatenation in 15 minutes, 11:13 thru 11:28 srmcp's in 98 minutes, 11:28 thru 13:10, per ls -ltr /pnfs/minos/reco_far/cedar/cand_data/2007-04 mgtr for fnpcsrv1 show sustained 120 Mbits/second but equal input and output rates. SRV1> ./roundup -r cedar far SRV1> du -sm WRITE 60362 WRITE mrtg shows mostly 20 mbit/second data rate, all 'in' ( to net ) very different than for far detector 13:48 thru Will have to get all this on tape, then roundup -w to purge, then split up the mcnear files : SRV1> ls /grid/data/minos/mcnearcat | grep n130110 | wc -l 1036 SRV1> ls /grid/data/minos/mcnearcat | grep n130111 | wc -l 961 ######## # FARM # ######## SRV1> DET=far SRV1> ./saddreco ${DET} ${REL} ${MON} declare \ > >> ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1 ######## # FARM # ######## timm mentions 60 GB quota ? sum of /farm/stage01_minos 1 /farm/stage02_minos 4091 /farm/stage03_minos 1 /farm/minsoft 3347 /farm/minsoft2 52347 /farm/minsoft2/Minossoft 52330 32878 /farm/minsoft2/Minossoft/dbm Per rubin note, this is mostly ancient releases and root versions ============================================================================= 2007 04 26 kreymer on vacation ============================================================================= 2007 04 25 kreymer on vacation ============================================================================= 2007 04 24 ########### # ROUNDUP # ########### roundup.20070424 Corrected defects with cand/bcnd handling bcnd - set SOLO bcnd/cand - disable PEND ############ # MCIMPORT # ############ Arms reports originally truncated files under /pnfs/minos/mcin_data/far/daikon_00/L010185N f21011047_0000_L010185N_D00 f21011048_0000_L010185N_D00 f21011064_0000_L010185N_D00 f21011067_0000_L010185N_D00 f21011073_0000_L010185N_D00 f21011077_0000_L010185N_D00 f21011078_0000_L010185N_D00 f21011100_0000_L010185N_D00 f21311177_0000_L010185N_D00 f21311178_0000_L010185N_D00 for FI in $FIS ; do dds ${FPA}/${FI:5:3}/${FI}.reroot.root ; done -rw-r--r-- 1 kreymer e875 226206750 Mar 1 17:12 /pnfs/minos/mcin_data/far/daikon_00/L010185N/104/f21011047_0000_L010185N_D00.reroot.root -rw-r--r-- 1 kreymer e875 227337073 Mar 1 17:13 /pnfs/minos/mcin_data/far/daikon_00/L010185N/104/f21011048_0000_L010185N_D00.reroot.root -rw-r--r-- 1 kreymer e875 241601846 Mar 2 18:26 /pnfs/minos/mcin_data/far/daikon_00/L010185N/106/f21011064_0000_L010185N_D00.reroot.root -rw-r--r-- 1 kreymer e875 227683157 Mar 2 18:21 /pnfs/minos/mcin_data/far/daikon_00/L010185N/106/f21011067_0000_L010185N_D00.reroot.root -rw-r--r-- 1 kreymer e875 243768829 Mar 2 18:12 /pnfs/minos/mcin_data/far/daikon_00/L010185N/107/f21011073_0000_L010185N_D00.reroot.root -rw-r--r-- 1 kreymer e875 246309331 Mar 2 18:22 /pnfs/minos/mcin_data/far/daikon_00/L010185N/107/f21011077_0000_L010185N_D00.reroot.root -rw-r--r-- 1 kreymer e875 250816517 Mar 2 18:52 /pnfs/minos/mcin_data/far/daikon_00/L010185N/107/f21011078_0000_L010185N_D00.reroot.root -rw-r--r-- 1 kreymer e875 288965539 Mar 6 11:28 /pnfs/minos/mcin_data/far/daikon_00/L010185N/110/f21011100_0000_L010185N_D00.reroot.root ls: /pnfs/minos/mcin_data/far/daikon_00/L010185N/117/f21311177_0000_L010185N_D00.reroot.root: No such file or directory ls: /pnfs/minos/mcin_data/far/daikon_00/L010185N/117/f21311178_0000_L010185N_D00.reroot.root: No such file or directory for FI in $FIS ; do mv ${FPA}/${FI:5:3}/${FI}.reroot.root /pnfs/minos/BAD/BAD_${FI}.reroot.root done mv: cannot stat `/pnfs/minos/mcin_data/far/daikon_00/L010185N/117/f21311177_0000_L010185N_D00.reroot.root': No such file or directory mv: cannot stat `/pnfs/minos/mcin_data/far/daikon_00/L010185N/117/f21311178_0000_L010185N_D00.reroot.root': No such file or directory ########## # DCACHE # ########## For tjyang calibration work, post shutdown cedar ntuples needed in DCache, ./stage -s sntp_data/2006 VOB733 Needed 62/ 177 FINISHED Tue Apr 24 15:14:06 CDT 2007 ./stage -s sntp_data/2006 VOB894 ./stage -s sntp_data/2006 VOB357 ./stage -s sntp_data/2006 VO5072 These tapes were already mounted, might as well get all the files. ============================================================================= 2007 04 23 ########## # DCACHE # ########## Minos production read pool group should come online tomorrow. Discussed overall scale with Vlad, may need more pools . First goal is stability of config, then adjust scale. ============================================================================= 2007 04 20 ######## # GRID # ######## Added /fermilab/minos Production to both my cert's ( grid, fermilab ) Will post the procedure to fermigrid-users Steve Timm noted that those who need to write to DCache under both /pnfs/fnal.gov/usr/ and /pnfs/fnal.gov/usr/fermigrid/volatile/ should have a Production role assigned under the Fermilab VO. Here is a specific procedure, obtained with some guidance from Dan Yucum. You *must* use the VOMRS interface: https://voms.fnal.gov:8443/vomrs/vo-fermilab/vomrs The update will be immediate to VOMS, but could take up to 6 hours to migrate to GUMS. Expand the left menu bar as : [-] fermilab Registration Home [-] Members . Re-sign Grid and VO AUPs [+] Certificates . Edit Personal Info . Change Email Address . Change Representative . Change Expiration Date . Set Authorization Status . Manage Groups & Group Roles Click on . Manage Groups & Group Roles You will get a search form. Find the cert's of interest ( I used my last name to select mine ) checking the Member DN and Roles boxes before the search. The report has columns including Group role , Status and Select. Under Select, check the Production box for the appropriate groups(s). Then click the [submit] box at the bottom left corner of the report. You will go to a new web page which announces : "You have successfully assigned member(s) to group/role!" A subsequent search of these certs will show the Production role with Status 'Approved' The owner of each cert will also get a confirming email. ############ # SADDRECO # ############ saddreco.20070420 Added ping of dbserver, like command line sam ping dbserver ----retryMaxCount=1 --retryJitter=0 SAMQ - get from ping , not SETUP_SAM_CONFIG ########### # ROUNDUP # ########### roundup.20070420 Added call to saddreco ########## # DCACHE # ########## Investigating reported bad file in dcache write pool SRV1> pwd /export/stage/minfarm/ROUNDUP/DUP SRV1> IFILE=n13011446_0000_L010185N_D00_nccoh.cand.cedar.root SRV1> IPATH=minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/cand_data/144 SRV1> DCPOR=24125 # unsecured SRV1> DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} SRV1> dccp ${DFILE} . 1602194515 bytes in 61 seconds (25649.89 KB/sec) MINOS26 > ./dc_stat /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/cand_data/144/n13011446_0000_L010185N_D00_nccoh.cand.cedar.root ============================ PNFS status for /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/cand_data/144/n13011446_0000_L010185N_D00_nccoh.cand.cedar.root -rw-r--r-- 1 1334 e875 1602194515 Apr 13 12:18 n13011446_0000_L010185N_D00_nccoh.cand.cedar.root LEVEL 2 2,0,0,0.0,0.0 :c=1:262b1661;h=yes;l=1602194515; r-stkendca20a-6 w-stkendca10a-1 Looking in pool lists: 9526 n13011446_0000_L010185N_D00_nccoh.cand.cedar.root 000F00000000000005312848 1602194515 si={minos.reco_mc_near_cedar} 799 n13011446_0000_L010185N_D00_nccoh.cand.cedar.root 000F00000000000005312848 <-P---s----L(0)[0]> 1602194515 si={minos.reco_mc_near_cedar} This is the same file reported in the flood of emails Monday 16 April. I have removed it via rubin@fnpcsrv1. ####### # CVS # ####### minoscvs@cdcvs account was created 17 April Accesses by baisley and penny They are using an old cvsh v1_9, CDF needs to have v1_11_1 at least, Minos is using 1.9.1 ============================================================================= 2007 04 19 ########## # DCACHE # ########## DCache access was lost at 06:06 ( as seen by ND data logging ) Apr 19 05:54 N00012077_0028.mdaq.root DCache was shut down at 06:30 for planned but unannounced maintenance of PNFS. Helpdesk ticket 95868 11:06 - PNFS seems back per http://www-numi.fnal.gov/computing/dh/pnfslog/NOW.txt 13:38 - email that Dcache being started 13:49 - some but not all DCache services coming back 13:51 - beam data logging started to succeed, followed by fd ( none yet from ND ) 13:55 - have all but CopyManager RemoteGsiftpTransferManager RemoteHttpTransferManager SRM-stkendca2a 14:02 - Enstore/Dcache announced as being up 14:10 - The above 4 services were restarted ( 14:09:56 ) ND archiver stuck : QOL I Thu 19-04-2007 05:54:45 archiver 6372 131.225.192.132 1 112033 run 12077 Processing file N00012077_0028.mdaq.root QOL I Thu 19-04-2007 05:54:45 archiver 6372 131.225.192.132 1 112034 run 12077 Getting credentials QOL I Thu 19-04-2007 05:54:47 archiver 6372 131.225.192.132 1 112035 run 12077 Got credentials QOL I Thu 19-04-2007 05:54:47 archiver 6372 131.225.192.132 1 112036 run 12077 Trying ftp connect to disk cache QOL I Thu 19-04-2007 05:54:47 archiver 6372 131.225.192.132 1 112037 run 12077 Ftp connect succeeded 14:40 - nd archiver restarted filesize matched N00012077_0028.mdaq.root ls -l --> 143776495 Apr 19 14:40 N00012077_0029.mdaq.root ######## # FARM # ######## 15:02 nothing to do for near, SRV1> ./roundup -r cedar far 15:42 SRV1> ./roundup -r cedar mcnear Oops, there are lots of partial runs, being written anyway with gaps. Will have to come back later and force the gap fillers out. OK adding n13011433_0000_L010185N_D00_nccoh.sntp.cedar.root 3 OOPS - SUBRUN gap 4 to 6 OK adding n13011436_0000_L010185N_D00_nccoh.sntp.cedar.root 4 OOPS - SUBRUN gap 4 to 5 OK adding n13011437_0000_L010185N_D00_nccoh.sntp.cedar.root 4 OOPS - SUBRUN gap 9 to 9 OK adding n13011437_0006_L010185N_D00_nccoh.sntp.cedar.root 3 OK adding n13011437_0010_L010185N_D00_nccoh.sntp.cedar.root 1 OK adding n13011438_0001_L010185N_D00_nccoh.sntp.cedar.root 10 OOPS - SUBRUN gap 6 to 6 OK adding n13011439_0000_L010185N_D00_nccoh.sntp.cedar.root 6 OOPS - SUBRUN gap 8 to 9 OK adding n13011439_0007_L010185N_D00_nccoh.sntp.cedar.root 1 OOPS - SUBRUN gap 5 to 5 OK adding n13011440_0000_L010185N_D00_nccoh.sntp.cedar.root 5 OOPS - SUBRUN gap 9 to 9 OK adding n13011440_0006_L010185N_D00_nccoh.sntp.cedar.root 3 OOPS - SUBRUN gap 9 to 9 OK adding n13011456_0000_L010185N_D00_nccoh.sntp.cedar.root 9 OOPS - SUBRUN gap 6 to 6 OK adding n13011458_0000_L010185N_D00_nccoh.sntp.cedar.root 6 OK - stream L010185N_D00.sntp.cedar OK - 26037 Mbytes in 40 runs OOPS - SUBRUN gap 9 to 9 OK adding n13011624_0000_L010185N_D00.sntp.cedar.root 9 OOPS - SUBRUN gap 8 to 8 OK adding n13011631_0001_L010185N_D00.sntp.cedar.root 7 OOPS - SUBRUN gap 2 to 2 OK adding n13011647_0000_L010185N_D00.sntp.cedar.root 2 OOPS - SUBRUN gap 9 to 9 OK adding n13011650_0000_L010185N_D00.sntp.cedar.root 9 ############ # SADDRECO # ############ for DET in near far ; do ./saddreco ${DET} ${REL} ${MON} declare \ >> ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1 done ============================================================================= 2007 04 18 ########## # STATUS # ########## Requested access to CD System Status Minos web page, for kreymer buckley rhatcher urish Ticket 95815 2007 12 26 - assigned to Richard Thies ####### # AFS # ####### loiacono - requests AFS disk space for beam ntuples Reviewing farm usage : for DIR in $DIRS ; do echo ${DIR} fs listacl ${DIR} | grep -q minosrecodata && fs listquota ${DIR} done | grep 50000000 86 volumes ( 50 GB ) ==> 4.3 TBytes Later, repeated, have 90 volumes, 4.5 TBytes Looking at existing AFS ntuples cd d10/indexes wc -l *.index | sort -n ... 770 2006-10_far.R1_18_4.index 816 BAD_mc_far.daikon_00.cedar.index 1332 mc_far.carrot.cedar.index 1594 mc_far.daikon_00.cedar.index 1844 mc_far.carrot.R1_18_2.index 1984 mc_cosmic.bfld201.cedar.index 2024 mc_far.R1.14.index 2064 2005-04_far.R1_18.index 2289 mc_near.R1_18_2.index 9218 mc_near.daikon_00.cedar.index 10252 mc_near.carrot_06.cedar.index 10435 mc_near.carrot_06.R1_18_2.index 105697 total wc -l *.R1_18_2.index | sort -n 748 2005-12_far.R1_18_2.index 1844 mc_far.carrot.R1_18_2.index 2289 mc_near.R1_18_2.index 10435 mc_near.carrot_06.R1_18_2.index 30824 total ######## # FARM # ######## Scheduled FARM maintenance ( fnpcsrv1/2 ) started around 09:00 Announced to fermigrid-announce ( but not in advance ) No details of downtime duration or work to be done. Nothing on CD Status Page Up at 13:02 N.B. /fnal/ups is now bluearc served, not from fnpcsrv1 ########### # BLUEARC # ########### Spoke to Ray Pasetes x 5250 about long term possibilities no per-disk or volume license charges Hitachi at about $2K/TB managed by then, or external SATAbeasts etc. http://computing.fnal.gov/nasan/bluearc.html ########## # DCACHE # ########## Removed and rewrite old damaged file from 13 March : SRV1> ls -l /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112/f21011125_0000_L010185N_D00.sntp.cedar.root -rw-rw-r-- 1 bseilhan numi 0 Mar 13 15:14 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112/f21011125_0000_L010185N_D00.sntp.cedar.root SRV1> rm /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112/f21011125_0000_L010185N_D00.sntp.cedar.root SRV1> setup dcap MINOS26 > grep f21011125_0000_L010185N_D00.sntp.cedar.root ~/minos/CFL/CFL minos reco_mc_far_cedar VO4049 0000_000000000_0000239 CDMS117334037200000 61387201 1783755534 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112/f21011125_0000_L010185N_D00.sntp.cedar.root EH ???? too late, this is not consistent with PNFS listing Copy the AFS copy anyway SRV1> AFSP=/afs/fnal.gov/files/data/minos/d10/recodata90 SRV1> ls -l ${AFSP}/${FILE} -rw-rw-r-- 1 bseilhan numi 64260919 Mar 10 09:44 /afs/fnal.gov/files/data/minos/d10/recodata90/f21011125_0000_L010185N_D00.sntp.cedar SRV1> DFSP=dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112 SRV1> dccp ${AFSP}/${FILE} ${DFSP}/${FILE} 64260919 bytes in 3 seconds (20918.27 KB/sec) ######## # FARM # ######## srmcp has vanished SRV1> less /export/osg/grid/setup.sh SRV1> echo $PATH | tr : \\\n | grep srm /export/osg/grid/srmclient/bin /export/osg/grid/srmclient/sbin /export/osg/grid/srmclient/bin /export/osg/grid/srmclient/sbin But /export/osg/grid/srmclient is not present ! Submitted helpdesk ticket 95864 Forwarded note to minos-data, minos_batch, fermigrid-users SRV1> crontab crontab.noround Here is the ticket content : //////////////////////////////////////////////////////////////// Short Description: srmcp and other commands are missing Problem Description: Since today's maintenance, we seem to be missing the srmcp command et.al. We set up osg as always with source /export/osg/oldgrid/setup.csh The path to srmcp is being set : minfarm on fnpcsrv1% echo $PATH | tr : \\\n | grep srm /export/osg/grid/srmclient/bin /export/osg/grid/srmclient/sbin /export/osg/grid/srmclient/bin /export/osg/grid/srmclient/sbin But ths srmclient directory is missing. The old osg tree was moved today , changed to a symlink : minfarm on fnpcsrv1% ls -l /export/osg total 8 lrwxrwxrwx 1 root root 14 Apr 18 17:16 grid -> /usr/local/vdt/ drwxr-xr-x 31 root root 4096 Jan 17 11:15 oldgrid/ drwxr-xr-x 2 root root 4096 Oct 12 2005 scratch/ srmcp is still there under oldgrid Was the OSG software deliberately changed ? There was no announcement to this effect. //////////////////////////////////////////////////////////////// ########### # ROUNDUP # ########### ln -sf roundup.20070417 roundup this was trying to do the right thing for near data ============================================================================= 2007 04 17 ######## # FARM # ######## re-listed DFARM /minos/*, only cores has remaining files ROUNDUP - did catchup, to pick up several runs stuck since Friday, which has not yet emerged from the farm by the 08:00 cron run 11:09 SRV1> ./roundup -r cedar near ########### # BLUEARC # ########### Check with Ray Pasetes on specs/plans for Blue Arc - 5250 Should we use this as shared work space for Minos Cluster ? Should this supplement AFS/DCache ? Cost ? ########### # ROUNDUP # ########### Starting MC tests, with files presently in GDM/mcnearcat SRV1> ./roundup -n -r cedar mcnear ... ./roundup -n -W -s n13011451_ -r cedar mcnear OK adding n13011451_0000_L010185N_D00_nccoh.sntp.cedar.root 11 Tue Apr 17 14:10:31 CDT 2007 ./roundup -n -w -s n13011451_ -r cedar mcnear srmcp file:///n13011451_0000_L010185N_D00_nccoh.sntp.cedar.root \ srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data/145/n13011451_0000_L010185N_D00_nccoh.sntp.cedar.root OOPS, need to adjust path to include potential beam config suffix like ncc0h. Path should be mcout_data/cedar/near/daikon_00/L010185N_nccoh roundup only worked accidentally, as there are also n13011451_0000_L010185N_D00 run/subruns without the _nccoh. UGH... We have the same run/subrun being used twice, each time with different physics. Not so smart, but it is being done. Roundup needs to append to beam, if non-null, _ and cut -f 5 -d '_' up to . Modified roundup.20070417 accordingly Setting file families for directories : howie : SRV1> DIRS=`ls` SRV1> for DIR in $DIRS ; do ( cd ${DIR}/sntp_data ; enstore pnfs --tags | grep 'ly) =' ) ; done /pnfs/minos/mcout_data/cedar/near/daikon_00/L010000N/sntp_data .(tag)(file_family) = reco_mc_near_cedar /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_charm/sntp_data .(tag)(file_family) = reco_mc_near_cedar /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/sntp_data .(tag)(file_family) = reco_mc_near_cedar /pnfs/minos/mcout_data/cedar/near/daikon_00/L150200N/sntp_data .(tag)(file_family) = reco_mc_near_cedar /pnfs/minos/mcout_data/cedar/near/daikon_00/L010170N/sntp_data .(tag)(file_family) = reco_mc_near_cedar /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_lowi/sntp_data .(tag)(file_family) = reco_mc_near_cedar /pnfs/minos/mcout_data/cedar/near/daikon_00/L010200N/sntp_data .(tag)(file_family) = reco_mc_near_cedar /pnfs/minos/mcout_data/cedar/near/daikon_00/L250200N/sntp_data .(tag)(file_family) = reco_mc_near_cedar /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data .(tag)(file_family) = reco_mc_near_cedar /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_medi/sntp_data .(tag)(file_family) = reco_mc_near_cedar /pnfs/minos/mcout_data/cedar/near/daikon_00/L100200N/sntp_data .(tag)(file_family) = reco_mc_near_cedar SRV1> for DIR in $DIRS ; do ( cd ${DIR}/${STR}_data ; pwd ; enstore pnfs --file_family reco_mc_near_cedar_${STR} ) ; done Same for STR=cand , STR=mrnt Do not bother with lower level RUN subdir's, good enough to pick up new ones SRV1> find . -type d -exec ls -ld {} \; SRV1> find . -type d -exec chmod 775 {} \; chmod: changing permissions of `./L010185N/cand_data/161': Operation not permitted chmod: changing permissions of `./L010185N/cand_data/162': Operation not permitted These were owned by bseilhan, perm's were OK already ============================================================================= 2007 04 16 ########## # DCACHE # ########## Email every 2 minutes or so regarding cand file error in DCache . This is being sent to non-minos addressees : Date: Mon, 16 Apr 2007 11:15:18 -0500 From: Enstore To: cdfdh_oper@fnal.gov, cmst1@fnal.gov, jen_a@fnal.gov, stoughto@fnal.gov} Subject: Alarm raised Mon Apr 16 11:15:18 CDT 2007 From: alarm_server ['1176740118.76', 1176740118.7641211, 'stkendca10a.fnal.gov', 13190, 'root', 'C (1)', 'ENCP', 'CRC DCACHE MISMATCH', None, None, None, {'r_a': (('131.225.13.94', 33575), 1L, '131.225.13.94-33575-1176739751.637307-13190'), 'text': {'status': 'CRC dcache mismatch: 640357985 (0x262B1661L) != 3635691144 (0xD8B43E88L)', 'outfile': '/pnfs/fnal.gov/usr/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/cand_data/144/n13011446_0000_L010185N_D00_nccoh.cand.cedar.root', 'infile': '/diska/write-pool-1/data/000F00000000000005312848'}}] Rubin will remove the file. There seems to be no sntp_data file for this subrun. ######## # FARM # ######## The DFarm array failed last Friday 13 April Even /tmp seems to be locked agains writing on fnpcsrv1. No roundup has run since friday Warning : farm mc output is going to /grid/data/mcnearcat and /grid/data/nearcat Trying to log into fngp-osg as minfarm, getting stuck. ####### # DAQ # ####### Checked for updated kerberos for NODE in acnet beamdata om evd rc ; do ssh -l minos minos-${NODE} rpm -q krb5-workstation-fermi gateway-nd done krb5-workstation-fermi-1.8d-1.LTS4 krb5-workstation-fermi-1.8d-1.LTS4 krb5-workstation-fermi-1.8d-1.LTS4 krb5-workstation-fermi-1.8d-1.LTS4 krb5-workstation-fermi-1.8d-1.LTS4 krb5-workstation-fermi-1.8d-1.LTS4 ============================================================================= 2007 04 13 ############## # DBARCHIVES # ############## Scanned exp-db database backups with new script ./dbarchives /home/kreymer/COMPLETE_FILE_LISTING_exp-db.20070410 /weekly/.\*April 134 263 263155 weekly/cdf-offline/cdfofpr2/2007/04-April/08 373 393 393351 weekly/cdf-online/cdfonprd/2007/04-April/08 227 475 475634 weekly/d0-offline/d0oflump/2007/04-April/08 404 771 771552 weekly/d0-offline/d0ofprd1/2007/04-April/08 23 45 45459 weekly/d0-online/d0onlprd/2007/04-April/08 29 3 3736 weekly/minos-offline/minosprd/2007/04-April/08 FILES GB MB SET Noted drop, but not so much, in size of D0 after 26 March drop of event tables MIN > ./dbarchives /home/kreymer/COMPLETE_FILE_LISTING_exp-db.20070410 /weekly/d0-offline/d0ofprd ... 542 1038 1038083 weekly/d0-offline/d0ofprd1/2007/03-March/25 404 771 771552 weekly/d0-offline/d0ofprd1/2007/04-April/08 ########## # DCACHE # ########## 18/19 pools are starting tests in FNDCAT today, need to burn in for 1 week. Plan deployment 23 April. ############ # MCIMPORT # ############ stagesum - notes on how to start plotting MC port data with gnuplot ( simple size vs date plot for now ) ============================================================================= 2007 04 12 ######## # FARM # ######## The final 24 nearcat files are copied to /grid/data/minos/nearcat ( first attempt failed with dfarm timeout ) SRV1> DIRS='cores farcat fardet li mc mccat mcfarcat mcnearcat mctest nearcat neardet test' SRV1> for DIR in ${DIRS} ; do printf "${DIR} " ; dfarm ls /minos/${DIR} | wc -l ; done cores 39 farcat 0 fardet 0 li 18 mc 0 mccat 0 mcfarcat 0 mcnearcat 0 mctest 19 nearcat 0 neardet 0 test 8 Ran roundup at 09:50 to pick up nearcat files PEND review PEND - have 1/19 subruns for N00012007_*.spill.sntp.cedar*.root 5 04/07 04:35 This subrun was in the BADRUN list Wed Apr 4 08:00:02 CDT 2007 ./roundup -S -s N00012007_0007 -f 0 -r cedar near Now some stale mrnt's PEND - have 5/7 subruns for N00008433_*.spill.mrnt.cedar*.root 5 04/06 17:44 Missing _0000 and _0001 PEND - have 18/19 subruns for N00008454_*.spill.mrnt.cedar*.root 5 04/06 16:43 Missing _0010 PEND - have 17/18 subruns for N00008612_*.spill.mrnt.cedar*.root 3 04/09 04:00 Missing _0010 PEND - have 4/12 subruns for N00008695_*.spill.mrnt.cedar*.root 3 04/09 07:55 Missing _0004 through _0011 ########### # ROUNDUP # ########### roundup.20070412 - corrected FILES list to select per type, SEL, REL this had been misplaced in .20070411 shift to 'find' corrected to check bad_runs_mrcc.${REL} for mrnt This cleared up all the stale mrnt files except N00008433 ############ # SADDRECO # ############ saddreco.20070412 Dropped parents from normal printout ########## # DCACHE # ########## Send email to dcache-admin re Minos Read pool deployment ============================================================================= 2007 04 11 ######## # FARM # ######## Howie reports 12 production and 21 mc jobs running which will write DFARM production are all N00012029 Moved roundup.20070411 to production, including 10 minute age requirement Ran on near, far round 10:00 Re-enabled cron job Ran saddreco 12:45 4 of the 12 nearcat files are in dfarm 13:56 8 files are there 17:35 14 of 12 files... let's let this keep running, we are waiting for 12 subruns, 24 files. ####### # AFS # ####### per buckley, requested 250 GG ( 5 x 50 ) additional user space acl's cloned from d01 ls -d $MINOS_DATA/d* | cut -c 33- | sort -n ... 227 228 229 request /afs/fnal.gov/files/data/minos/d230 through d234 ticket 95475 ####### # SAM # ####### Verified unknown volume location for 2 obsolete ND reco files N00010912_0003.cosmic.sntp.R1_18_4.0.root N00010912_0003.spill.sntp.R1_18_4.0.root ####### # AFS # ####### User volumes $MINOS_DATA/d230-d234 are created, by joes. ============================================================================= 2007 04 10 ######## # FARM # ######## rubin is ready to write to /grid/data There are some older mrnt runs I'd like to flush directly from DFARM, but that does not affect the switch to /grid/data for current running. The DFARM farcat area is up to date, containing only current files which I will copy to /grid/data/farcat. The DFARM nearcat area has some older runs with single missing mrnt subruns, which I presume are missing because of the problems reading the cand files. PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 39 03/01 17:20:46 PEND - have 23/24 subruns for N00009168_*.spill.mrnt.cedar*.root 11 03/29 17:07:07 PEND - have 22/24 subruns for N00009235_*.spill.mrnt.cedar*.root 10 03/30 09:19:40 We are missing : N00009165_0002.spill.mrnt.cedar.0.root N00009168_0018.spill.mrnt.cedar.0.root N00009235_0019.spill.mrnt.cedar.0.root N00009235_0022.spill.mrnt.cedar.0.root These were listed in the rubin 1 April email to minos_batch N00009162_0013.0 2005-11 47438 139 2007-03-01 14:12:27 fnpc161 N00009165_0002.0 2005-11 46767 134 2007-03-02 16:53:50 fnpc196 N00009168_0018.0 2005-11 47380 139 2007-03-29 17:14:24 fnpc143 N00009235_0019.0 2005-11 47502 134 2007-03-30 06:58:04 fnpc196 N00009235_0022.0 2005-11 47592 132 2007-03-30 07:56:51 fnpc197 So I am forcing them out. Updated roundup to recognize mrnt entries, and added them to no_spill.cedar : SRV1> cp AFSS/roundup.20070410 . SRV1> ln -sf roundup.20070410 roundup N00009165_0002.spill.mrnt.cedar.0.root 2007-04 N00009168_0018.spill.mrnt.cedar.0.root 2007-04 N00009235_0019.spill.mrnt.cedar.0.root 2007-04 N00009235_0022.spill.mrnt.cedar.0.root 2007-04 Ran one more roundup : SRV1> ./roundup -n -r cedar near Done, the oldest files in farcat now are from 04/06 ########### # ROUNDUP # ########### roundup.20070411 - take input to /grid/data/minos rustle - copy existing files from DFARM to /grid/data/minos ########## # RUSTLE # ########## SRV1> AFSS/rustle far moved 21 files, looks good. Still, make a safe copy : SRV1> cp -r /grid/data/minos/farcat /grid/data/minos/farcat_safe SRV1> diff -r /grid/data/minos/farcat /grid/data/minos/farcat_safe upcated rustle to touch the files with their DFARM date SRV1> AFSS/rustle near N.B> 2007 05 07 - removed all files from /grid/data/minos/farcat_safe these were subruns 0 through 6 of F00037871_0000, and are long since concatenated into F00037871_0000.all.sntp.cedar.0.root F00037871_0000.spill.bntp.cedar.0.root F00037871_0000.spill.sntp.cedar.0.root ============================================================================= 2007 04 09 ######## # FARM # ######## Doing cleanup after file removal of files with 0 copies in dfarm frwr- 0 rubin 29668085 04/06 16:48:28 N00008442_0000.spill.mrnt.cedar.0.root frwr- 0 rubin 29892938 04/06 16:53:18 N00008442_0001.spill.mrnt.cedar.0.root many extra status messages, due to leftover stuff in dfarm Moving ahead to roundup.20070409 ( was 20070401 ) with clean file selection, and NOSPILL suppression of mrnt based on sntp ######## # FARM # ######## Rubin reports that 4 files should be undeclared to sam, new files exist in dfarm /minos/neardet I've noted PNFS sizes and dates PNFS DFARM N00012007_0007.cosmic.cand.cedar.0.root 113745471 Apr 4 03:50 113749699 04/07 04:35:34 N00012013_0017.cosmic.cand.cedar.0.root 114483673 Apr 6 15:16 114481305 04/07 05:11:46 N00012010_0000.cosmic.cand.cedar.0.root 113633751 Apr 4 10:26 113634922 04/07 08:24:11 N00012010_0000.spill.cand.cedar.0.root 507054947 Apr 4 15:20 507043027 04/07 08:24:32 These are in /pnfs/minos/reco_near/cedar/cand_data/2007-04 I have undeclared these : MINOS26 > FILES='N00012007_0007.cosmic.cand.cedar.0.root N00012010_0000.cosmic.cand.cedar.0.root N00012010_0000.spill.cand.cedar.0.root' MINOS26 > for FILE in $FILES ; do sam locate ${FILE} ; done ['/pnfs/minos/reco_near/cedar/cand_data/2007-04,44@voc165'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-04,79@voc165'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-04,82@voc165'] MINOS26 > for FILE in $FILES ; do sam undeclare file ${FILE} ; done ############ # MCIMPORT # ############ per kordosky email, Need to purge mistakenly imported files -rw-r--r-- 1 mindata e875 384K Apr 5 07:59 n11011003_0001_L010185N_D01.tar.gz -rw-r--r-- 1 mindata e875 146K Apr 5 08:20 n12011003_0001_L010185N_D01.tar.gz FILS='n11011003_0001_L010185N_D01.tar.gz n12011003_0001_L010185N_D01.tar.gz' grep ${FIL} kordosky/index/*.index Robert Hatcher moved the original to BAD. I have removed the relevant lines from kordosky/md5/all.md5 $ for FIL in ${FILS} ; do grep ${FIL} kordosky/md5/all.md5 ; done 0b23060c0ae02d21e465a4db8df39300 n11011003_0001_L010185N_D01.tar.gz f25ab8f958425c0b9ec4fa2e2820dc1f n12011003_0001_L010185N_D01.tar.gz $ cat all.md5 | grep -v n11011003_0001_L010185N_D01.tar.gz | grep -v n12011003_0001_L010185N_D01.tar.gz > all2.md5 $ diff all2.md5 all.md5 6550a6551,6552 > 0b23060c0ae02d21e465a4db8df39300 n11011003_0001_L010185N_D01.tar.gz > f25ab8f958425c0b9ec4fa2e2820dc1f n12011003_0001_L010185N_D01.tar.gz $ mv all2.md5 all.md5 ============================================================================= 2007 04 06 V A C A T I O N ============================================================================= 2007 04 05 ####### # CVS # ####### Met with rs, boyd, mengel, + , regarding possible plans to move CDF and/or Minos CVS to central CVS server, with Bluarc connected disks. Will give mengel access to servers ( cdf, zoom, minos ) for prototyping Did so for zoom, minos ########## # ORACLE # ########## Sun tech could not work on minosora3, Does not have cables for opereron system Ordering cables. ============================================================================= 2007 04 04 ####### # NET # ####### netdown announced downtime on Thur Apr 19, 06:00 to 06:45, s-s-fcc2-server3 Interesting hosts : crlweb2 docdb indico listserv mailgw1/2 numiserver fermi-helpdesk cdops linux1 crlweb mailgw imap1/2/3 ####### # CVS # ####### Total space used is 4.3 GBytes 3 GBytes - Contrib/raufer ( most of it in NikiSys/Attic, can be purged ) .6 GB - DatabaseTables .16 GB - Contrib/RecoCheck *.root files There are many more .root binay files checked into CVS, not to mention *.pdf , *.ps binary documents. Only one obvious MS doc, /WebDocs/reconstruction/MINOS Standard Reconstruction Package.doc MINOSCVS > find . -name \*\ \*,v\* -exec ls -l {} \; 641769 Feb 23 02:20 ./WebDocs/reconstruction/MINOS Standard Reconstruction Package.doc,v 2467 Feb 23 02:20 ./WebDocs/reconstruction/standard reconstruction software.w2w,v 1496874 Mar 24 15:49 ./DetSim/doc/Simulation Presentation June 2003 collab.sxi,v 24898 Mar 24 15:49 ./EventDisplay/doc/snapshot13 .png,v 25611 Mar 24 15:49 ./EventDisplay/doc/snapshot14 .png,v 3624 Aug 1 2006 ./HWDB/images/left copy.png,v MINOSCVS > dds Attic/ total 3043384 drwxrwxr-x 2 minoscvs e875 4096 Jul 5 2006 ./ drwxrwxr-x 8 minoscvs e875 4096 Apr 4 09:50 ../ -r--r--r-- 1 minoscvs e875 628199184 Jul 5 2006 New_Systematics_0027.tar.gz,v -r--r--r-- 1 minoscvs e875 24635037 Jun 29 2006 PAN_le_mcfar18_2_SKZP_0.root,v -r--r--r-- 1 minoscvs e875 24378672 Jun 29 2006 PAN_le_mcfar18_2_modbyrs3_0.root,v -r--r--r-- 1 minoscvs e875 2796150 Jun 29 2006 Results_Macros_BeamMatrix.tar.gz,v -r--r--r-- 1 minoscvs e875 603125924 Jun 29 2006 Syst_rel_ndfits.tar.gz,v -r--r--r-- 1 minoscvs e875 601748800 Jun 29 2006 Syst_rel_skzp15p.tar.gz,v -r--r--r-- 1 minoscvs e875 608663238 Jun 30 2006 Syst_rel_v1.tar.gz,v -r--r--r-- 1 minoscvs e875 610860683 Jun 30 2006 Syst_rel_v2.tar.gz,v -r--r--r-- 1 minoscvs e875 531113 Jun 29 2006 far_spec_dataTR.root,v -r--r--r-- 1 minoscvs e875 542477 Jun 29 2006 far_spec_dataskzpTR.root,v -r--r--r-- 1 minoscvs e875 2943029 Jun 29 2006 results_comp_nskzp.tar.gz,v -r--r--r-- 1 minoscvs e875 3962436 Jun 29 2006 results_comp_skzp.tar.gz,v -r--r--r-- 1 minoscvs e875 916096 Jun 29 2006 results_comp_skzp15p.tar.gz,v MINOSCVS > pwd /cvs/minoscvs/rep1/minossoft/Contrib/raufer/NikiSys/Attic MINOSCVS > tar cf /local/scratch01/minoscvs/raufer-NikiSys-Attic.tar . MINOSCVS > du -sm . 2973 . MINOSCVS > du -sm /local/scratch01/minoscvs/raufer-NikiSys-Attic.tar 2972 /local/scratch01/minoscvs/raufer-NikiSys-Attic.tar MINOSCVS > tar tvf /local/scratch01/minoscvs/raufer-NikiSys-Attic.tar drwxrwxr-x minoscvs/e875 0 2006-07-05 15:22:27 ./ -r--r--r-- minoscvs/e875 24635037 2006-06-29 09:47:05 ./PAN_le_mcfar18_2_SKZP_0.root,v -r--r--r-- minoscvs/e875 24378672 2006-06-29 09:47:07 ./PAN_le_mcfar18_2_modbyrs3_0.root,v -r--r--r-- minoscvs/e875 2796150 2006-06-29 09:47:08 ./Results_Macros_BeamMatrix.tar.gz,v -r--r--r-- minoscvs/e875 2943029 2006-06-29 09:47:09 ./results_comp_nskzp.tar.gz,v -r--r--r-- minoscvs/e875 3962436 2006-06-29 09:47:11 ./results_comp_skzp.tar.gz,v -r--r--r-- minoscvs/e875 916096 2006-06-29 09:47:11 ./results_comp_skzp15p.tar.gz,v -r--r--r-- minoscvs/e875 531113 2006-06-29 12:02:24 ./far_spec_dataTR.root,v -r--r--r-- minoscvs/e875 542477 2006-06-29 12:02:24 ./far_spec_dataskzpTR.root,v -r--r--r-- minoscvs/e875 603125924 2006-06-29 12:31:45 ./Syst_rel_ndfits.tar.gz,v -r--r--r-- minoscvs/e875 601748800 2006-06-29 12:34:30 ./Syst_rel_skzp15p.tar.gz,v -r--r--r-- minoscvs/e875 608663238 2006-06-30 03:20:10 ./Syst_rel_v1.tar.gz,v -r--r--r-- minoscvs/e875 610860683 2006-06-30 03:24:51 ./Syst_rel_v2.tar.gz,v -r--r--r-- minoscvs/e875 628199184 2006-07-05 15:22:24 ./New_Systematics_0027.tar.gz,v MINOSCVS > cd .. MINOSCVS > rm -r Attic/ MINOSCVS > cd ../../.. MINOSCVS > du -sm . 1285 . MINOSCVS > pwd /cvs/minoscvs/rep1/minossoft ####### # SAM # ####### Duplicate subruns in R1_18_2, per email from asousa Needed to obsolete spill/cosmic, cand/sntp/ntps for these for PASS in 0 1; do for SRUN in N00007821_0019 N00007821_0022 N00007751_0022 N00007759_0008 ; do for SPIL in spill cosmic ; do for STRM in cand sntp snts ; do #echo ${SRUN}.${SPIL}.${STRM}.R1_18_2.${PASS}.root >> /tmp/obsfiles sam locate ${SRUN}.${SPIL}.${STRM}.R1_18_2.${PASS}.root done ; done ; done ; done ./dropfiles /tmp/obsfiles ============================================================================= 2007 04 03 ######### # MYSQL # ######### per HOWTO.dbarchive Mysql> rm -r /data/archive/COPY/20070305 /data has 49 GB free offline real 68m54.075s 100m44.507s md5 real 21m35.787s 32m36.246s 40G gzip real 55m50.531s 80m46.291s 15G scp real 9m36.975s 14m3.862s BINLOGS real 2m59.620s 8m23.896s 8.8 GB free at minimum All tables in all databases were locked during the offline file copies. This is surprising ( even buggy ) behaviour. This is unfortunately documented in http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html Perhaps reinvestigate mysqlhotcopy, which failed a couple of years ago. Mysql> locate mysqlhotcopy /local/ups/prd/mysql/v4_1_11/Linux-2/bin/mysqlhotcopy Not sure this does a proper global table lock per database. Hard to interpret the python. Modified HOWTO.dbarchive to lock and flush tables by name. But am hitting an apparent command length limit in our 4.1.11 server http://bugs.mysql.com/bug.php?id=10119 ############ # MCIMPORT # ############ Some sjc far/mcin files have been imported, apparently correctly, to /pnfs/minos/mcin_data/far/daikon_00/L100200N et. al. ########## # DCACHE # ########## Per Alexander Podovs..., the new DCache pools are being configured into the test stand today. Should deploy to production early next week. Will follow up with Timur et.al. then ####### # SAM # ####### Duplicate subruns in R1_18_2, per email from asousa N00007821_0019.spill.sntp.R1_18_2.[0,1].root N00007821_0022.spill.sntp.R1_18_2.[0,1].root N00007751_0022.spill.sntp.R1_18_2.[0,1].root N00007759_0008.spill.sntp.R1_18_2.[0,1].root Need to obsolete spill/cosmic, cand/sntp/ntps for these for PASS in 0 1; do for SRUN in N00007821_0019 N00007821_0022 N00007751_0022 N00007759_0008 ; do for SPIL in spill cosmic ; do for STRM in cand sntp snts ; do #echo ${SRUN}.${SPIL}.${STRM}.R1_18_2.${PASS}.root >> /tmp/obsfiles sam locate ${SRUN}.${SPIL}.${STRM}.R1_18_2.${PASS}.root done ; done ; done ; done ============================================================================= 2007 04 02 ########### # MONTHLY # ########### CFL 4/2 DATASETS 4/2 PREDATOR 4/2 SADDRECO 4/2 VAULT 4/2 OK MYSQL 4/3 OK All tables in all databases locked during offline copy ####### # AFS # ####### On online systems should do fs getcellstatus -cell fnal.gov fs setcell -cell fnal.gov -nosuid fs getcellstatus -cell fnal.gov ####### # DAQ # ####### Investigating clone of /afs/.../ products to /data/minsoft ( hosted on minos-evd, rcp'd to other CR systems ) MINOS26 > du -sm minossoft 3875 minossoft ############ # SADDRECO # ############ REL=cedar MON=2007-03 Do a global verification for DET in near far ; do ./saddreco ${DET} ${REL} ${MON} verify done for DET in near far ; do ./saddreco ${DET} ${REL} ${MON} declare \ >> ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1 done ############ # MCIMPORT # ############ mcimport.20070402 Adding support to /far/mcin for other than daikon_00 Autodest updated, add more layers of mkdir MINOS26 > cd /pnfs/minos/mcin_data/far MINOS26 > mkdir daikon_01 MINOS26 > cd daikon_01 MINOS26 > enstore pnfs --file_family "mcin_far_daikon_01" Into production before the 18:00 run, no files to process yet. ########## # ORACLE # ########## minosora3 has cpu warnings, needs diagnostics run Scheduled for Thur 5 April ============================================================================= 2007 03 30 ######### # DFARM # ######### Did catchup of far and near Strays : PEND - have 1/24 subruns for F00037230_*.all.sntp.cedar*.root 29 02/28 13:10:33 PEND - have 1/24 subruns for F00037230_*.spill.bntp.cedar*.root 29 02/28 13:11:06 OK duplicates, copied to ROUNTP/DUP PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 28 03/01 17:20:46 PEND - have 23/24 subruns for N00009168_*.spill.mrnt.cedar*.root 0 03/29 17:07:07 PEND crashed in mrnt, investigate NOSPILL N00011347_0017.spill.sntp.cedar.0.root PEND - have 2/17 subruns for N00011347_*.spill.sntp.cedar*.root 7 03/22 23:33:40 PEND - have 1/24 subruns for N00011568_*.spill.sntp.cedar*.root 7 03/22 23:49:47 OK recovered subruns, forced to output Details as follows : FAR - These are duplicates of _0017 produced as a side effect of replacing F00037230_0017.all.sntp.cedar.0.root I have set them aside in /home/minfarm/ROUNTMP/DUP SRV1> dfarm ls /minos/farcat/F00037230* frwrw 1 rubin 23001056 02/28 13:10:33 /minos/farcat/F00037230_0017.all.sntp.cedar.0.root frwrw 1 rubin 3185344 02/28 13:11:06 /minos/farcat/F00037230_0017.spill.bntp.cedar.0.root SRV1> dfarm get /minos/farcat/F00037230_0017.all.sntp.cedar.0.root . SRV1> dfarm get /minos/farcat/F00037230_0017.spill.bntp.cedar.0.root . SRV1> date -d '02/28 13:10:33' +%Y%m%d%H%M.%S 200702281310.33 SRV1> touch -t 200702281310.33 F00037230_0017.all.sntp.cedar.0.root SRV1> date -d '02/28 13:11:06' +%Y%m%d%H%M.%S 200702281311.06 SRV1> touch -t 200702281311.06 F00037230_0017.spill.bntp.cedar.0.root SRV1> dfarm rm /minos/farcat/F00037230* NEAR - -N00009165_0002 - 2005-11 crashed 3 times in mrnt processing, waiting -N00009168_0018 - 2005-11 missing, probably like 9165_0002 +N00011347_0004 - 2006-12 +N00011347_0014 - 2006-12 03/22 23:33:40 /minos/nearcat/N00011347_0004.spill.sntp.cedar.0.root 03/22 23:22:12 /minos/nearcat/N00011347_0014.spill.sntp.cedar.0.root Reprocessed, flush SOLO ./roundup -s N00011347 -S -f 0 -r cedar near +N00011568_0008 - 2007-01 Recovered missing subrun, flush ./roundup -s N00011568 -S -f 0 -r cedar near ############ # SADDRECO # ############ PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer export SAM_NAMING_SERVICE_IOR=IOR:010000002a00000049444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696e67436f6e746578743a312e30000000010000000000000030000000010100b7150000006d696e6f732d73616d30312e666e616c2e676f7600d132230c0000004e616d655365727669636500 export SETUP_SAM_CONFIG='sam_config v4_2_34 -f NULL -z /afs/fnal.gov/files/code/e875/general/ups/db -q prd -r /afs/fnal.gov/files/code/e875/general/ups/prd/sam_config/v4_2_34/NULL -m sam_config_prd.table -M /afs/fnal.gov/files/code/e875/general/ups/db/sam_config/v4_2_34' REL=cedar MON=2007-03 Do a global verification for DET in near far ; do ./saddreco ${DET} ${REL} ${MON} verify done for DET in near far ; do ./saddreco ${DET} ${REL} ${MON} declare \ >> ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1 done Sent note to admarino regarding /numi_target/mars/real_target, about 158 files, 30 GB, Mars input and .hbook files from 11/21 (year ? ) admarino advises these should be in AFS. timm has removed them ####### # DAQ # ####### AFS usage in control room : Using /usr/sbin/lsof /afs minos-beamdata rotorooter python beam_data_files_monitor.py /home/minos/share/start_bd_files_monitor - explicitly sets up the AFS products minos-evd loon /data/minsoft/mcr/ControlRoomSoftware/bin/mcrrun -> /data/minsoft/mcr/ControlRoomSoftware/mcrrc - sets up AFS products minos-om HistoDisplayMain loon /data/minsoft/mcr/ControlRoomSoftware/bin/mcrrun -> /data/minsoft/mcr/ControlRoomSoftware/mcrrc - sets up AFS products minos-rc rcGui /data/minsoft/mcr/ControlRoomSoftware/bin/mcrrun -> /data/minsoft/mcr/ControlRoomSoftware/mcrrc - sets up AFS products minos- ######## # FARM # ######## Re-enabled roundup in crontab crontab crontab.dat ########### # ROUNDUP # ########### roundup.20070401 - /grid/data version Adding + character to message for NOSPILL, BADRUN, SUPPRESSED files which exist. Writing such files to output. Plan to rustle DFARM files into /grid/data, then reread from there. Reading from /grid/data will require completeness test, in case files are actively being written - aged 1000 seconds perhaps. ============================================================================= 2007 03 29 ####### # AFS # ####### Outage was 06:04 through 06:25 ############ # PRODUCTS # ############ 05:11 cd /afs/fnal.gov/files/code/e875/general mv ups OLDups ln -s products ups MINOS26 > time diff -r OLDups products real 7m57.830s user 0m6.680s sys 0m41.050s Tested minossoft setup, sam, dbu AOK. ########### # ENSTORE # ########### VOC167, with many stage/kordosky files, has been NOACCESS since yesterday Date: Wed, 28 Mar 2007 14:57:11 -0500 From: George Szmuksta Yes, There were a number of transfers today that were successful with this tape. The last one at 12:39 today had an error of "READ_VOL1_WRONG". Implying the internal volser of the tape is wrong. I am trying to check the volser now with an automated tool. So I am waiting my turn for a drive. MINOS26 > ./stage -d -p 0 VOC167 > /var/tmp/kreymer/VOC167.stage Needed 60/ 128 MINOS26 > ./stage -d -p 0 -v VOC167 > /var/tmp/kreymer/VOC167.stage check only writePools group MINOS26 > ./stage -d -p 0 -g readPools -v VOC167 > /var/tmp/kreymer/VOC167.stage Needed 68/ 128 Touched all the files presently on caches, 68/128. 16:05 - the volume is available again. ./stage VOC167 staged cleanly to disk ####### # LSF # ####### Ticket 94831 Short Description: LSF license failure this morning Problem Description: At aroung 10:00 this morning, from several hosts minos01, minos26, flxi04 I was unable to access LSF, due to a lack of a license. For example FLXI04 > setup lsf FLXI04 > bjobs Host does not have a software license This seems to have cleared up as of about 10:05. I have also seen slowness in email, is there a general network problem ? This could affect access to the license servers. Tested around 11:30, bqueues is now working on flxi02 through 6 minos01 through 26 for NODE in $NODES ; do printf "${NODE} " ssh ${NODE} '. /afs/fnal.gov/ups/etc/setups.sh ; setup lsf ; bqueues' ; done ####### # AFS # ####### ticket 94858 brebel requests more NC AFS space, 2 x 50GB Template ? d147 d187 d204 d211 Requested d228 and d229, acls' similar to the above, minos rl system:administrators rlidwka system:anyuser rl minos:admin rlidwka brebel rlidwka Making a minos:admin group modeled on buckley:admin pts creategroup -name kreymer:admin group kreymer:admin has id -1919 pts adduser -user kreymer -group kreymer:admin pts membership kreymer:admin pts examine kreymer:admin pts examine kreymer:admin Name: kreymer:admin, id: -1919, owner: kreymer, creator: kreymer, membership: 1, flags: S-M--, group quota: 0. pts setfields kreymer:admin -access SOMar pts examine kreymer:admin Name: kreymer:admin, id: -1919, owner: kreymer, creator: kreymer, membership: 1, flags: SOMar, group quota: 0. pts chown kreymer:admin minos for GUSER in buckley gmieg rhatcher urish ; do pts adduser -user ${GUSER} -group minos:admin ; done pts adduser -user urheim -group minos:admin ============================================================================= 2007 03 28 ####### # AFS # ####### Schedule shutdown during PNFS outage MINOS26 > echo 'crontab -r' | at 03:30 job 19 at 2007-03-29 03:30 ( the minos01 crontab does not have entries for Thursday ) DCache will go down at 05:00 ######### # DFARM # ######### far - did catchup, only stray is 3001056 02/28 13:10:33 F00037230_0017.all.sntp.cedar.0.root 3185344 02/28 13:11:06 F00037230_0017.spill.bntp.cedar.0.root near - Failed last night, OK adding N00011971_0000.cosmic.sntp.cedar.0.root 24 Transfer initiation timeout OOPS - failed to dfarm get /minos/nearcat/N00011971_0006.cosmic.sntp.cedar.0.root BAILING Tue Mar 27 22:42:36 CDT 2007 Same feilure this morning. cd /export/stage/minfarm/ROUNDUP_TEST/ DFN=`dfarm ls /minos/nearcat | tr -s ' ' | cut -f 7 -d ' '` for DF in ${DFN} ; do date ; dfarm get /minos/nearcat/${DF} ${DF} ; ls ${DF} ; rm ${DF} ; done ... Wed Mar 28 09:50:32 CDT 2007 Transfer initiation timeout ls: N00011971_0006.cosmic.sntp.cedar.0.root: No such file or directory rm: cannot remove `N00011971_0006.cosmic.sntp.cedar.0.root': No such file or directory Wed Mar 28 09:55:32 CDT 2007 ... Wed Mar 28 09:57:22 CDT 2007 Transfer initiation timeout ls: N00011977_0002.cosmic.sntp.cedar.0.root: No such file or directory rm: cannot remove `N00011977_0002.cosmic.sntp.cedar.0.root': No such file or directory Wed Mar 28 10:02:22 CDT 2007 Transfer initiation timeout ls: N00011977_0002.spill.sntp.cedar.0.root: No such file or directory rm: cannot remove `N00011977_0002.spill.sntp.cedar.0.root': No such file or directory Wed Mar 28 10:07:22 CDT 2007 Transfer initiation timeout ls: N00011981_0000.cosmic.sntp.cedar.0.root: No such file or directory rm: cannot remove `N00011981_0000.cosmic.sntp.cedar.0.root': No such file or directory Wed Mar 28 10:12:22 CDT 2007 Transfer initiation timeout ls: N00011981_0000.spill.sntp.cedar.0.root: No such file or directory rm: cannot remove `N00011981_0000.spill.sntp.cedar.0.root': No such file or directory Wed Mar 28 10:17:23 CDT 2007 Transfer initiation timeout ls: N00011981_0001.cosmic.sntp.cedar.0.root: No such file or directory rm: cannot remove `N00011981_0001.cosmic.sntp.cedar.0.root': No such file or directory Wed Mar 28 10:22:23 CDT 2007 Transfer initiation timeout ls: N00011981_0001.spill.sntp.cedar.0.root: No such file or directory rm: cannot remove `N00011981_0001.spill.sntp.cedar.0.root': No such file or directory Wed Mar 28 10:27:23 CDT 2007 SRV1> dfarm ls /minos/nearcat | grep ' 0 ' frwr- 0 rubin 29235685 03/27 16:48:54 N00011971_0006.cosmic.sntp.cedar.0.root frwr- 0 rubin 4788055 03/27 15:17:16 N00011977_0002.cosmic.sntp.cedar.0.root frwr- 0 rubin 9743078 03/27 15:27:25 N00011977_0002.spill.sntp.cedar.0.root frwr- 0 rubin 29466218 03/27 16:42:23 N00011981_0000.cosmic.sntp.cedar.0.root frwr- 0 rubin 71763200 03/27 16:47:46 N00011981_0000.spill.sntp.cedar.0.root frwr- 0 rubin 29660049 03/27 15:02:26 N00011981_0001.cosmic.sntp.cedar.0.root frwr- 0 rubin 32033773 03/27 15:07:42 N00011981_0001.spill.sntp.cedar.0.root Writing what I can : SRV1> ./roundup -w -r cedar near And keeping below the bad files, SRV1> ./roundup -s N0001196 -r cedar near The above seven files are being reprocessed. ############ # PRODUCTS # ############ /afs/fnal.gov/files/code/e875/general MINOS26 > date ; time diff -r ups products Wed Mar 28 08:32:15 CDT 2007 Only in products/prd/fnorb/v1_1b_8/Linux-2-4: Fnorb real 7m50.582s user 0m6.560s sys 0m40.830s MINOS26 > rm products/prd/fnorb/v1_1b_8/Linux-2-4/Fnorb ============================================================================= 2007 03 27 ######## # FARM # ######## timm : massive errors on fnpcsrv1 external disk arrays, down for reboot at 08:30 20:25 dfarm is healthy and announced to users was allowed to run for Howie earlier, while still rebuilding. ############ # PRODUCTS # ############ Back in 2006 07 21, many old SAM products were disabled. Time to finally remove them : cd /afs/fnal.gov/files/code/e875/general/ups for DIR in db prd ; do for PRD in corba_common corba_util sam_idl_cpplib sam_lib sam_mis_cpplib sam_client_cpplib ; do du -sm ${DIR}/DISABLED${PRD} ; done ; done MINOS26 > fs listquota . Volume Name Quota Used %Used Partition code.e875.general 8000000 7749808 97%<< 49% < fs listquota . Volume Name Quota Used %Used Partition code.e875.general 8000000 7176814 90% 49% Also clear out empty sam product directories MINOS26 > SAMS=`ls prd/sam` MINOS26 > for SAM in $SAMS ; do [ -z `ls prd/sam/${SAM}` ] && du -sk prd/sam/${SAM} ; done MINOS26 > for SAM in $SAMS ; do [ -z `ls prd/sam/${SAM}` ] && rmdir prd/sam/${SAM} ; done Requested new disk, backed up, ticket 94695 /afs/fnal.gov/files/code/e875/general/products Created around 16:00 cd /afs/fnal.gov/files/code/e875/general cp -vax ups/catman products for DIR in db etc man prd ; do echo $DIR ; cp -ax ups/${DIR} products/ ; done Done by around 16:20 MIN > fs listquota /afs/fnal.gov/files/code/e875/general/products Volume Name Quota Used %Used Partition c.e875.d1 8000000 2111911 26% 52% MINOS26 > date ; time diff -r ups products Tue Mar 27 16:27:44 CDT 2007 diff: ups/prd/fnorb/v1_1b_8/Linux-2-4/Fnorb: recursive directory loop diff: ups/prd/java/v1.5.0/Linux-2/ups/..tar: No such file or directory diff: products/prd/java/v1.5.0/Linux-2/ups/..tar: No such file or directory diff: ups/prd/misweb/v2_23_5/NULL/www/tmp: No such file or directory diff: products/prd/misweb/v2_23_5/NULL/www/tmp: No such file or directory diff: ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2: Too many levels of symbolic links diff: products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2: Too many levels of symbolic links diff: ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2: Too many levels of symbolic links diff: products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2: Too many levels of symbolic links real 8m6.765s user 0m6.800s sys 0m41.280s Cleaning these out, for future sanity : First verify they should go : DUDS=' ups/prd/fnorb/v1_1b_8/Linux-2-4/Fnorb ups/prd/java/v1.5.0/Linux-2/ups/..tar products/prd/java/v1.5.0/Linux-2/ups/..tar ups/prd/misweb/v2_23_5/NULL/www/tmp products/prd/misweb/v2_23_5/NULL/www/tmp ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2 products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2 ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2 products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2 ' for DUD in ${DUDS} ; do ls -l ${DUD} ; done MINOS26 > for DUD in ${DUDS} ; do ls -l ${DUD} ; done lrwxr-xr-x 1 buckley e875 1 May 27 2004 ups/prd/fnorb/v1_1b_8/Linux-2-4/Fnorb -> . lrwxr-xr-x 1 kreymer g020 61 Feb 20 08:17 ups/prd/java/v1.5.0/Linux-2/ups/..tar -> /ftp/products/java/v1.5.0/Linux+2/java_v1.5.0_Linux+2.ups.tar lrwxr-xr-x 1 kreymer g020 61 Mar 27 16:16 products/prd/java/v1.5.0/Linux-2/ups/..tar -> /ftp/products/java/v1.5.0/Linux+2/java_v1.5.0_Linux+2.ups.tar lrwxr-xr-x 1 buckley e875 32 May 28 2004 ups/prd/misweb/v2_23_5/NULL/www/tmp -> /fnal/ups/db/misweb/Symlinks/tmp lrwxr-xr-x 1 kreymer g020 32 Mar 27 16:12 products/prd/misweb/v2_23_5/NULL/www/tmp -> /fnal/ups/db/misweb/Symlinks/tmp lrwxr-xr-x 1 buckley e875 11 May 27 2004 ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2 -> plat-linux2 lrwxr-xr-x 1 kreymer g020 11 Mar 27 16:08 products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2 -> plat-linux2 lrwxr-xr-x 1 buckley e875 11 May 27 2004 ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2 -> plat-linux2 lrwxr-xr-x 1 kreymer g020 11 Mar 27 16:08 products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2 -> plat-linux2 for DUD in ${DUDS} ; do rm ${DUD} ; done rm: cannot lstat `ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2': No such file or directory rm: cannot lstat `products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2': No such file or directory ############ # MCIMPORT # ############ Per arms email, removing kordosky runs n14* 1001-1010 charm files These files were all contained in two big tarfiles, under /pnfs/minos/stage/kordosky . n14011003_0000_L010185N_D00_charm-n14011009_0006_L010185N_D00_charm.tar n14011009_0007_L010185N_D00_charm-n14011010_0010_L010185N_D00_charm.tar $ mv n14011003_0000_L010185N_D00_charm-n14011009_0006_L010185N_D00_charm.index ../BAD/ $ mv n14011009_0007_L010185N_D00_charm-n14011010_0010_L010185N_D00_charm.index ../BAD/ MINOS26 > cd /pnfs/minos/stage/kordosky MINOS26 > rm n14011003_0000_L010185N_D00_charm-n14011009_0006_L010185N_D00_charm.tar MINOS26 > rm n14011009_0007_L010185N_D00_charm-n14011010_0010_L010185N_D00_charm.tar ########### # ROUNDUP # ########### per discussion, should pass SUPPRESS'd files, as with NO_SPILL. ###### # DB # ###### mmihalek and jason have set up minosora1 and minosora3 Ganglia monitoring, http://rexganglia2.fnal.gov/minos/?c=MINOS%20DB Updated links in /afs/fnal.gov/files/expwww/numi/html/computing/dh dhmain.html dhleft.html ####### # SAM # ####### dev/int dbs stuck, CPU bound even with sqlplus problem with oracle_client v10_2_0_1 others are ok v8_1_7a v10_1_0_3_0 Oracle client - hangs up CPU bound on minos-sam02 because system has been up too long, known problem in 10.2 Oracle Client MINOS-SAM02 > upd install -j oracle_instant_client v10_2_0_3 informational: installed oracle_instant_client v10_2_0_3. upd install succeeded. MINOS-SAM02 > ups copy -G "oracle_client v10_2_0_3" oracle_instant_client v10_2_0_3 MINOS-SAM02 > ups declare oracle_client v10_2_0_3 -f "Linux+2" -q "" -r "oracle_instant_client/v10_2_0_3/Linux+2" -z "/home/sam/products/upsdb" -U "ups" -m "oracle_instant_client.table" MINOS-SAM02 > ups declare -c oracle_client v10_2_0_3 trace : missing libclntsh.so.8.0 MINOS-SAM02 > ln -s libclntsh.so /home/sam/products/oracle_client/v10_1_0_3_0/Linux+2/lib/libclntsh.so.8.0 oops, wrong product, MINOS-SAM02 > ln -s libclntsh.so /home/sam/products/oracle_instant_client/v10_2_0_3/Linux+2/libclntsh.so.8.0 DBservers are restarted, seem to be running in dev/int ############ # SADDRECO # ############ rm'd saddreco ( symlink ) in scripts, this runs on fnpcsrv1 now. ######### # DFARM # ######### Cleaning up FD files already concatenated and written, had been retained due to stale ROUNDTMP/DFARM files. F00037221 F00037233 F00037776 Verified that these were re-processed without clearing existing ROUNTMP/DCACHE files, so dfarm files were not purged. Concatenated files are written to DCache and purged from WRITE. So removed these from DFARM manually. dfarm rm /minos/farcat/F00037221* dfarm rm /minos/farcat/F00037233* dfarm rm /minos/farcat/F00037776* First, get data logged with ./roundup -r cedar far This leaves few stray files 37230 - all.sntp 24 subruns added 01 18 spill.bntp 24 subruns added 01 30 spill.sntp 24 subruns added 03 01 with dup all.sntp, spill.bntp _0017 dated 02/28 solution : check with howie, then remove 37801_23 3 files dated 03/23, this was incorrectly suppressed. solution : force this subrun 23 out. ./roundup -s F00037801 -r cedar far ============================================================================= 2007 03 26 ########### # ROUNDUP # ########### Deal with existing bntp files listed in no_spill because of empty sntp NOSPILL F00032719_0000.spill.bntp.cedar.0.root PEND - have 1/1 subruns for F00032719_*.spill.bntp.cedar*.root 1 03/21 23:10:28 NOSPILL F00033011_0000.spill.bntp.cedar.0.root NOSPILL F00033011_0001.spill.bntp.cedar.0.root PEND - have 1/0 subruns for F00033011_*.spill.bntp.cedar*.root 1 03/21 23:10:10 AFSS/roundup.20070326 -n -W -s F00032719 -r cedar far F00032719_*.spill.bntp.cedar*.root raw 0/1 dfarm 0 no_spill 0 where is 1 ? F00033011_*.spill.bntp.cedar*.root raw 0/1 dfarm 0 no_spill 0/1 Working on (il)logic of event selection, it seems to help to use a group command, { ... ; } This lets me use the same NOSPILL terms for printing and selecting, negated for the latter. You must have white space surrounding the { and } characters Pending DFARM files and issues : NEAR PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 24 03/01 17:20:46 NOSPILL N00011347_0017.spill.sntp.cedar.0.root PEND - have 2/17 subruns for N00011347_*.spill.sntp.cedar*.root 3 03/22 23:33:40 PEND - have 1/24 subruns for N00011568_*.spill.sntp.cedar*.root 3 03/22 23:49:47 FAR SUPPRESS F00037233_0024.all.sntp.cedar.0.root These messages still need cleanup SUPPRESS F00037801_0023.all.sntp.cedar.0.root PEND - have 1/23 subruns for F00037801_*.all.sntp.cedar*.root 2 03/23 23:36:32 ... SUPPRESS F00037804_0017.all.sntp.cedar.0.root SUPPRESS F00037804_0018.all.sntp.cedar.0.root PEND - have 24/5 subruns for F00037804_*.all.sntp.cedar*.root 2 03/23 23:37:34 PEND - have 1/2 subruns for F00032719_*.spill.bntp.cedar*.root 4 03/21 23:10:28 NOSPILL F00033011_0001.spill.bntp.cedar.0.root OK adding F00033011_0000.spill.bntp.cedar.0.root 1 SUPPRESS F00037801_0023.spill.bntp.cedar.0.root PEND - have 1/23 subruns for F00037801_*.spill.bntp.cedar*.root 2 03/23 23:37:05 ... SUPPRESS F00037804_0018.spill.bntp.cedar.0.root PEND - have 24/5 subruns for F00037804_*.spill.bntp.cedar*.root 2 03/23 23:38:05 SUPPRESS F00037801_0023.spill.sntp.cedar.0.root PEND - have 1/23 subruns for F00037801_*.spill.sntp.cedar*.root 2 03/23 23:36:47 ... SUPPRESS F00037804_0018.spill.sntp.cedar.0.root PEND - have 24/5 subruns for F00037804_*.spill.sntp.cedar*.root 2 03/23 23:37:49 ============================================================================= 2007 03 23 ############ # SADDRECO # ############ C. Test and read the READ/${FILE} parent list Done D. Done. Skipped first READ parent use final time, event range increment event count append parents Operations : MV to /SAM what to do with the READ lists ? NO move these to minos26 ? YES move when used/declared ? DUH if run on fnpcsrv1, how to monitor activity ? Move used READ/ files to READ/SAM/ Log to ${HOME}/ROUNTP/LOG/${MONTH}/declare_${DET}_${REL}.log SRV1> cp -a AFSS/saddreco.20070322 saddreco REL=cedar MON=2007-01 Do a global verification for DET in near far ; do ./saddreco ${DET} ${REL} ${MON} verify \ >> ${HOME}/ROUNTMP/LOG/${MON}/verify_${DET}_${REL}.log 2>&1 done Declare one event per stream for DET in near far ; do ./saddreco ${DET} ${REL} ${MON} declare 1 \ >> ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1 done Some .bntp locations are missing export SAM_ORACLE_CONNECT="samdbs/pass_word@minosprd" ./reloc -y 2007 -s dev cedar Hmmmm, better deal with mrnt_data before we declare them. Can clean out unused snts, .bnts directories. MINOS26 > samadmin add datatier --name=mrnt-far --description="Muon removed ntuple - far" New dataTierId = 136 MINOS26 > samadmin add datatier --name=mrnt-near --description="Muon removed ntuple - near" New dataTierId = 137 Fix the unlocated file : MINOS26 > IFILE=F00037185_0000.spill.bntp.cedar.0.root MINOS26 > ITAPE=vob719.10600 MINOS26 > SAMLOC="${IPATH}(${ITAPE})" MINOS26 > sam add location --file=${IFILE} --loc=${SAMLOC} Try single file addition again Looks good, SRV1> FILES='N00011452_0013.spill.cand.cedar.0.root N00011598_0000.cosmic.sntp.cedar.0.root N00011452_0013.cosmic.cand.cedar.0.root N00011598_0000.spill.sntp.cedar.0.root F00037198_0000.spill.cand.cedar.0.root F00037185_0000.spill.sntp.cedar.0.root F00037185_0000.spill.bntp.cedar.0.root F00037198_0000.all.cand.cedar.0.root F00037185_0000.all.sntp.cedar.0.root F00037204_0000.spill.bntp.cedar.0.root F00037246_0011.spill.bcnd.cedar.0.root' SRV1> for FILE in $FILES ; do sam locate ${FILE} ; done ['/pnfs/minos/reco_near/cedar/cand_data/2007-01,89@vo9947'] ['/pnfs/minos/reco_near/cedar/sntp_data/2007-01,2375@vo5072'] ['/pnfs/minos/reco_near/cedar/cand_data/2007-01,108@vo9947'] ['/pnfs/minos/reco_near/cedar/sntp_data/2007-01,2376@vo5072'] ['/pnfs/minos/reco_far/cedar/cand_data/2007-01,616@vo7416'] ['/pnfs/minos/reco_far/cedar/sntp_data/2007-01,12981@vob357'] ['/pnfs/minos/reco_far/cedar/.bntp_data/2007-01,10600@vob719'] ['/pnfs/minos/reco_far/cedar/cand_data/2007-01,609@vo7416'] ['/pnfs/minos/reco_far/cedar/sntp_data/2007-01,12987@vob357'] ['/pnfs/minos/reco_far/cedar/.bntp_data/2007-01,10596@vob719'] ['/pnfs/minos/reco_far/cedar/.bcnd_data/2007-01,2890@vob735'] SRV1> for FILE in $FILES ; do sam get metadata --file=${FILE} ; done Looks good to my eye. The parents seem to be listed in a random order, but they are in order in the Database Browser. Reviewed verify pass , SRV1> grep -v verified LOG/2007-01/verify_far_cedar.log | grep -v PARENT STARTED Fri Mar 23 17:06:23 2007 saddreco 20070323 Declaring to SAM prd far cedar 2007-01 verify Needed /pnfs/minos/reco_far/cedar/cand_data/2007-01 Treating 738 files in /pnfs/minos/reco_far/cedar/cand_data/2007-01 Needed 1476 files, Rate was 4.727 Needed /pnfs/minos/reco_far/cedar/sntp_data/2007-01 Treating 46 files in /pnfs/minos/reco_far/cedar/sntp_data/2007-01 Needed 92 files, Rate was 1.351 Needed /pnfs/minos/reco_far/cedar/.bntp_data/2007-01 Treating 46 files in /pnfs/minos/reco_far/cedar/.bntp_data/2007-01 Needed 46 files, Rate was 1.289 Needed /pnfs/minos/reco_far/cedar/.bcnd_data/2007-01 Treating 738 files in /pnfs/minos/reco_far/cedar/.bcnd_data/2007-01 Needed 738 files, Rate was 4.333 STARTED Fri Mar 23 17:06:23 2007 FINISHED Fri Mar 23 17:16:10 2007 SRV1> grep -v verified LOG/2007-01/verify_near_cedar.log | grep -v PARENT STARTED Fri Mar 23 17:00:08 2007 saddreco 20070323 Declaring to SAM prd near cedar 2007-01 verify Needed /pnfs/minos/reco_near/cedar/cand_data/2007-01 Treating 682 files in /pnfs/minos/reco_near/cedar/cand_data/2007-01 obsolete N00011516_0022.cosmic.cand.cedar.0.root obsolete N00011516_0022.spill.cand.cedar.0.root obsolete N00011516_0021.cosmic.cand.cedar.0.root obsolete N00011516_0021.spill.cand.cedar.0.root obsolete N00011516_0020.cosmic.cand.cedar.0.root obsolete N00011516_0020.spill.cand.cedar.0.root obsolete N00011516_0016.spill.cand.cedar.0.root obsolete N00011516_0016.cosmic.cand.cedar.0.root obsolete N00011516_0017.spill.cand.cedar.0.root obsolete N00011516_0017.cosmic.cand.cedar.0.root obsolete N00011516_0015.cosmic.cand.cedar.0.root obsolete N00011516_0015.spill.cand.cedar.0.root obsolete N00011516_0018.spill.cand.cedar.0.root obsolete N00011516_0018.cosmic.cand.cedar.0.root obsolete N00011516_0019.spill.cand.cedar.0.root obsolete N00011516_0019.cosmic.cand.cedar.0.root Needed 1258 files, Rate was 4.183 Needed /pnfs/minos/reco_near/cedar/sntp_data/2007-01 Treating 58 files in /pnfs/minos/reco_near/cedar/sntp_data/2007-01 Needed 103 files, Rate was 1.427 STARTED Fri Mar 23 17:00:08 2007 FINISHED Fri Mar 23 17:06:22 2007 Indeed the obsoletes are supplanted by .1 files, OK fine. Take a breath, run them all : for DET in near far ; do ./saddreco ${DET} ${REL} ${MON} declare \ >> ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1 done Looks OK, MON=2007-02 for DET in near far ; do ./saddreco ${DET} ${REL} ${MON} declare \ >> ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1 done MON=2007-03 for DET in near far ; do ./saddreco ${DET} ${REL} ${MON} declare \ >> ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1 done Summary of changes : < Added VERSION variable, and print thereof < enupdate has all concatenation support < READFIL has list of input files < READLIN variable has content, skipping first file < PARENT is .mdaq.root < For each file, < bump eventCount < replace lastevent and endTime < append parents < rename READFIL to SAM subdirectory for MODE declare ######## # FARM # ######## Clearing out N00011577* for clean reprocessing, filling the holes after the 22 January problems got too messy. as minfarm SRV1> dfarm rm /minos/nearcat/N00011577* SRV1> rm WRITE/N00011577* SRV1> rm READ/N00011577* SRV1> rm ECRC/N00011577* SRV1> rm DFARM/N00011577* as howie SRV1> cd /pnfs/minos/reco_near/cedar/ SRV1> rm *_data/2007-01/N00011577* Final cleanup, howie's reprocessing is done today : SRV1> cp -a AFSS/roundup.20070320 roundup.20070320 SRV1> ln -sf roundup.20070320 roundup SRV1> ./roundup -r cedar near SRV1> ./roundup -r cedar far Looking at the far log, still ugly : 37230 - single subrun, force this ? On march 1, spill.sntp was complete, all and bntp had 1/24 37233 SUPPRESS message during processing of F00037786 due to stale ROUNTMP/DFARM control file time stamps move them aside, ./roundup -w -r cedar far 37786 is current OK - processing /minos/farcat Fri Mar 23 18:17:42 CDT 2007 OK - processing 286 files OK - stream all.sntp.cedar OK - 2302 Mbytes in 5 runs PEND - have 1/24 subruns for F00037230_*.all.sntp.cedar*.root 23 02/28 13:10:33 SUPPRESS F00037233_0024.all.sntp.cedar.0.root PEND - have 22/24 subruns for F00037786_*.all.sntp.cedar*.root 5 03/17 23:46:05 OK - stream spill.bntp.cedar OK - 416 Mbytes in 7 runs NOSPILL F00032719_0000.spill.bntp.cedar.0.root PEND - have 1/1 subruns for F00032719_*.spill.bntp.cedar*.root 1 03/21 23:10:28 NOSPILL F00033011_0000.spill.bntp.cedar.0.root NOSPILL F00033011_0001.spill.bntp.cedar.0.root PEND - have 1/0 subruns for F00033011_*.spill.bntp.cedar*.root 1 03/21 23:10:10 PEND - have 1/24 subruns for F00037230_*.spill.bntp.cedar*.root 23 02/28 13:11:06 SUPPRESS F00037233_0024.spill.bntp.cedar.0.root PEND - have 22/24 subruns for F00037786_*.spill.bntp.cedar*.root 5 03/17 23:46:35 OK - stream spill.sntp.cedar OK - 280 Mbytes in 4 runs SUPPRESS F00037233_0024.spill.sntp.cedar.0.root PEND - have 22/24 subruns for F00037786_*.spill.sntp.cedar*.root 5 03/17 23:46:19 Fri Mar 23 18:18:18 CDT 2007 PEND - have 1/24 subruns for F00037230_*.all.sntp.cedar*.root 23 02/28 13:10:33 SUPPRESS F00037233_0024.all.sntp.cedar.0.root PEND - have 22/24 subruns for F00037786_*.all.sntp.cedar*.root 5 03/17 23:46:05 NOSPILL F00032719_0000.spill.bntp.cedar.0.root PEND - have 1/1 subruns for F00032719_*.spill.bntp.cedar*.root 1 03/21 23:10:28 NOSPILL F00033011_0000.spill.bntp.cedar.0.root NOSPILL F00033011_0001.spill.bntp.cedar.0.root PEND - have 1/0 subruns for F00033011_*.spill.bntp.cedar*.root 1 03/21 23:10:10 PEND - have 1/24 subruns for F00037230_*.spill.bntp.cedar*.root 23 02/28 13:11:06 SUPPRESS F00037233_0024.spill.bntp.cedar.0.root PEND - have 22/24 subruns for F00037786_*.spill.bntp.cedar*.root 5 03/17 23:46:35 OK - stream spill.sntp.cedar OK - 280 Mbytes in 4 runs SUPPRESS F00037233_0024.spill.sntp.cedar.0.root PEND - have 22/24 subruns for F00037786_*.spill.sntp.cedar*.root 5 03/17 23:46:19 Further cleanup of old strays, set aside the old N00008460_0002.spill.sntp.cedar.0.root to make room for a new one, matching the recent cand from this subrun. SRV1> sam undeclare file N00008460_0002.spill.sntp.cedar.0.root SRV1> cd /pnfs/minos/reco_near/cedar/sntp_data/2005-09 SRV1> mv N00008460_0002.spill.sntp.cedar.0.root /pnfs/minos/BAD/BAD_N00008460_0002.spill.sntp.cedar.0.root SRV1> ./roundup -S -f 0 -s N00008460_ -r cedar near ============================================================================= 2007 03 22 ############ # SADDRECO # ############ Cleaned up version numbers for old versions for FIL in `ls saddreco.0*` ; do DT=${FIL:9} ; mv ${FIL} saddreco.2005${DT} ; done saddreco.20070322 - adding concatenated file support See notes from 2006 10 24 A. Pick a short victim run from 2007-01, test run this through saddreco B. Review fields to modify C. Test and read the READ/${FILE} parent list D. Iterate over parent metadata to get the numbers. A. F00037292_*.all.sntp.cedar.0.root has 3 subruns, in 2007-01 B. hacked candfiles to select the desired file Should add getops to saddreco, so can do -n, -s Usage : saddreco far cedar 2007-01 [mode] [bail] PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer export SAM_NAMING_SERVICE_IOR=IOR:010000002a00000049444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696e67436f6e746578743a312e30000000010000000000000030000000010100b7150000006d696e6f732d73616d30312e666e616c2e676f7600d132230c0000004e616d655365727669636500 export SETUP_SAM_CONFIG='sam_config v4_2_34 -f NULL -z /afs/fnal.gov/files/code/e875/general/ups/db -q prd -r /afs/fnal.gov/files/code/e875/general/ups/prd/sam_config/v4_2_34/NULL -m sam_config_prd.table -M /afs/fnal.gov/files/code/e875/general/ups/db/sam_config/v4_2_34' AFSS/saddreco.20070322 far cedar 2007-01 verify 1 Set READDIR MINOS26 > sam get metadata --file="F00037126_0020.all.sntp.cedar.0.root" 'eventCount' : 13956L, 'firstEvent' : 204013L, 'lastEvent' : 214368L, 'startTime' : SamTime(1166619877.0), 'endTime' : SamTime(1166623477.0), 'parents' : NameOrIdList(['F00037126_0020.mdaq.root']), 'runDescriptorList' : RunDescriptorList([RunDescriptor(runType='physics', runNumber=37126)]), }) ########### # GANGLIA # ########### Requested Ganglia host group for minosora1, ticket 94504 For reference, on fnpca, minosora1 and minos Cluster addresses were like http://fnpca.fnal.gov/ganglia/?r=day&c=MINOS Servers&h=minosora1.fnal.gov http://fnpca.fnal.gov/ganglia/?m=load_one&r=day&c=MINOS+Cluster&h=minos-mysql1.fnal.gov ########### # ROUNDUP # ########### fnpcsrv1 - updated copy of roundup.20070319 ( mc support ) to include corrected comment PEND FAR today : F00032138 - 2005-07 F00036066 - 2006-07 F00037221 - 2007-01 F00037230 - 2007-01 Added -S solo option to force direct output for older subruns Forced the pre-2007 far files : SRV1> AFSS/roundup.20070320 -S -f 0 -s F00032138 -r cedar far SRV1> AFSS/roundup.20070320 -S -f 0 -s F00036066 -r cedar far Howie has rerun, last night, F00032719_0000 2005-09 F00033011_0000 2005-10 PEND NEAR today SRV1> AFSS/roundup.20070320 -n -W -r cedar near > /tmp/nearpend PEND - have 18/24 subruns for N00011577_*.cosmic.sntp.cedar*.root 62 01/18 15:12:27 PEND - have 6/24 subruns for N00011956_*.cosmic.sntp.cedar*.root 1 03/21 00:36:51 PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 20 03/01 17:20:46 PEND - have 1/20 subruns for N00008463_*.spill.sntp.cedar*.root 0 03/21 18:12:45 PEND - have 6/24 subruns for N00011956_*.spill.sntp.cedar*.root 1 03/21 00:37:08 N00008463 - 2005-09 N00009165 - 2005-11 N00011595 - 2007-01 N00011956 - 2007-03 Adding : N00011565 onward - 2007-01 Clear the pre-2007 files : AFSS/roundup.20070320 -S -f 0 -s N00008463 -r cedar near This conflicted with an existing N00008463_0019.spill.sntp.cedar.0.root removed this file from WRITE, READ, ECRC Holding off pending investigation, did not do : AFSS/roundup.20070320 -f 0 -s N00009165 -r cedar near mrnt files which need concatenation holding off for validation, per rubin advice Bottom line, going ahead with the near pend cleanup : AFSS/roundup.20070320 -r cedar near Rubin requests putting the newer N00008463_0019.spill.sntp.cedar.0.root into enstore. cd /pnfs/minos/reco_near/cedar/sntp_data/2005-09 mv N00008463_0019.spill.sntp.cedar.0.root /pnfs/minos/BAD/BAD_N00008463_0019.spill.sntp.cedar.0.root ============================================================================= 2007 03 21 ###### # DB # ###### Root Cause meeting scheduled 10 AM FCC1-small meeting room for the Wed 14 March minorora1 outage (loose network cable) Report by jtrumbo emailed to minosdb-support ############ # MCIMPORT # ############ corrupt file f21011195_0000_L010185N_D00.reroot.root ? MINOS26 > ls -l /pnfs/minos/mcin_data/far/daikon_00/L010185N/119/f21011195_0000_L010185N_D00.reroot.root -rw-r--r-- 1 kreymer e875 283698210 Mar 21 00:00 /pnfs/minos/mcin_data/far/daikon_00/L010185N/119/f21011195_0000_L010185N_D00.reroot.root mv mcin_data/far/daikon_00/L010185N/119/f21011195_0000_L010185N_D00.reroot.root ../BAD/ mv BAD/f21011195_0000_L010185N_D00.reroot.root BAD/BAD_f21011195_0000_L010185N_D00.reroot.root verified BAD permissions 775 per rubin request, later changed group to e875 per rubin request ########### # ROUNDUP # 20070320 ########### Checking old mc files in /minos/mccat, like frwrw 3 rubin 136984820 09/29 10:40:32 f21001006_0000_L100200.sntp.cedar.root frwrw 3 rubin 139335356 09/29 10:46:13 f21001025_0000_L100200.sntp.cedar.root ... frwrw 3 rubin 29103428 09/29 14:25:51 n13021079_0000_L010170.sntp.cedar.root frwrw 3 rubin 29468053 09/29 15:12:15 n13021080_0000_L010170.sntp.cedar.root From last survey 2006 09 29 dfarm ls /minos/mccat | tr -s ' ' | cut -f 4 -d ' ' > /tmp/mccatsz wc -l /tmp/mccatsz 288 /tmp/mccatsz echo \(`cat /tmp/mccatsz | tr \\\n +` 0 \) / 1000000000 | bc 9 dfarm ls /minos/mccat | tr -s ' ' | cut -f 7 -d ' ' > /tmp/mccatlis About right, average 30 MB each. for FILE in `cat /tmp/mccatlis | grep ^f` ; do CONF=`echo ${FILE} | cut -f 3 -d '_' | cut -f 1 -d .` ls -l /pnfs/minos/mcout_data/cedar/far/carrot/${CONF}/sntp_data/${FILE} done for FILE in `cat /tmp/mccatlis | grep ^n` ; do CONF=`echo ${FILE} | cut -f 3 -d '_' | cut -f 1 -d .` ls -l /pnfs/minos/mcout_data/cedar/near/carrot_06/${CONF}/sntp_data/${FILE} done All files present and accounted for, removing these old files dfarm rm /minos/mccat/f* dfarm rm /minos/mccat/n* Cleared out working files from testing, dfarm rm /minos/mcnearcat/n* dfarm rm /minos/mcfarcat/f* Shifted the nd mc test files out of the way in /grid/data mv /grid/data/minos/mcnearcat \ /grid/data/minos/mcnearcattest Further cleanup of N00011935 N00011938 These were concatenated, partially purged from DFARM due to permissions on Sat 2007 Mar 17. These were purged from WRITE on Sunday, Howie will fix, I will purge from DFARM asap. dfarm rm /minos/nearcat/N00011935* dfarm rm /minos/nearcat/N00011938* Done 14:55 Howie is reprocessing some runs round Jan 22. Need to remove from SAM : N00008463_0019.spill.cand.cedar.0.root MINOS26 > sam locate N00008463_0019.spill.cand.cedar.0.root ['/pnfs/minos/reco_near/cedar/cand_data/2005-09,725@vob428'] MINOS26 > sam undeclare file N00008463_0019.spill.cand.cedar.0.root ######### # FARMS # ######### Grid server maintenance today, started 08:42. Will follow up to see whether this interrupted daily 08:00 roundup Cron jobs finished at 08:17, well ahead of the shutdown. OSG software moved from /export/osg/grid to /usr/local/vdt with compatibility symlinks. Will adjust roundup.20070320 Due to reprocessing, have disabled roundup in crontab tomorrow : crontab crontab.noround ####### # AFS # ####### Security advisory re allowing suid in AFS, don't do it ! http://openafs.org/security/OPENAFS-SA-2007-001.txt For those who are unable to upgrade, setuid status can always be disabled by running, as the super user on any client: fs setcell -cell fnal.gov -nosuid MIN > cat /usr/vice/etc/ThisCell fnal.gov MIN > fs getcellstatus -cell fnal.gov Cell fnal.gov status: setuid allowed MIN > fs setcell -cell fnal.gov -nosuid MIN > fs getcellstatus -cell fnal.gov Cell fnal.gov status: no setuid allowed See /usr/vice/etc/afs.rc ? or /etc/sysconfig/afs on our systems ============================================================================= 2007 03 20 ########### # ROUNDUP # ########### Added support for Monte Carlo SRV1> cp -a AFSS/roundup.20070319 . SRV1> ln -sf roundup.20070319 roundup ( hacked this to update VERSION at about 13:15 ) roundup.20070320 - adding no_spill.txt scan handling absence of no_spill and bad_runs more cleanly Running a preview in near, removing NOSPILL subruns, we would have the following pending runs for spill files PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 18 03/01 17:20:46 PEND - have 18/17 subruns for N00011577_*.spill.sntp.cedar*.root 61 01/18 15:12:57 PEND - have 4/ 6 subruns for N00011580_*.spill.sntp.cedar*.root 56 01/22 20:38:35 PEND - have 6/ 7 subruns for N00011589_*.spill.sntp.cedar*.root 56 01/22 16:32:32 PEND - have 23/24 subruns for N00011935_*.spill.sntp.cedar*.root 4 03/16 10:51:34 PEND - have 1/11 subruns for N00011938_*.spill.sntp.cedar*.root 4 03/16 10:54:27 PEND - have 6/25 subruns for N00011953_*.spill.sntp.cedar*.root 0 03/20 04:06:57 Note that N00011577 claims to have 18/17 runs, N00011577_0008.spill.sntp.cedar.0.root is in no_spill.cedar but exists in /minos/nearcat/ SRV1> AFSS/roundup.20070320 -s N00011577 -n -W -r cedar near ############ # MCIMPORT # ############ rubin reports two corrupt files f21311496_0000_L010185N_D00.reroot.root This file was imported by howcroft on Sat Mar 17 . The size looks pretty normal, no obvious truncation. MINOS26 > grep f21311496_0000_L010185N_D00 CFL minos mcin_far_daikon VO4125 0000_000000000_0000461 CDMS117412486400000 130644338 148713862 /pnfs/minos/mcin_data/far/daikon_00/L010185N/149/f21311496_0000_L010185N_D00.reroot.root MINOS26 > ls -l /pnfs/minos//mcin_data/far/daikon_00/L010185N/149/f21311496_0000_L010185N_D00.reroot.root -rw-r--r-- 1 kreymer e875 130644338 Mar 17 04:47 /pnfs/minos//mcin_data/far/daikon_00/L010185N/149/f21311496_0000_L010185N_D00.reroot.root f21011135_0000_L010185N_D00.reroot.root Partial transfer saved in sjc/far/mcin/BAD on 9 March, and removed from PNFS at that time. ####### # NET # ####### Thursday - 3/29 - 6:00 AM - 45 minutes - Operating system upgrade for s-s-fcc2-server switch- This affects Minos Cluster Minos Servers ( SAM , Mysql1 ) Minos AFS Enstore acsminos01 rip9 stkendca3a stkendca7a stkendca8a stkendm1a stkendm2a stkendm3a stkendm4a stkensrv4 stkensrv5 stkensrv6 stkensrv7 stkensrv8 stkensrv0n many movers Asked lamore,netmanager whether this is definitely scheduled. Outage announced at about 13:25 to netdown@fnal.gov . ######## # FARM # ######## Scheduled down Wed 21 March 08:30 ######## # FARM # ######## fnpcsrv1 issued this message to sessions logged in : Message from syslogd@fnpcsrv1 at Tue Mar 20 14:06:46 2007 ... fnpcsrv1 kernel: journal commit I/O error The system has gone off the network, as of about 14:30. NGOP paged timm at 14:30 . SCSI error Rebooted 14:50 ============================================================================= 2007 03 19 ####### # DAQ # ####### urish upgraded kernels on CR systems, starting with minos-beamdata, during a brief FD outage for a fuse replacement. BNL had to intervene to restart beam logging processes. ########## # DCACHE # ########## Sent query to dcache-admin regarding Minos Read pools in FNDCA, due around mid march. ######### # ADMIN # ######### Reported multiple /pnfs/minos /etc/fstab entries, ticket 94266 Email to minos-admin ########### # ROUNDUP # ########### roundup.20070319 has final tweaks for mc, dropping roundup.20070314 Seems to work, path looks good, tested with -C Plan : deploy it tomorrow Next : use no_spill.${REL} to filter files with no spill output, after Howie has done some more tests of existing data. Then : handle input from /grid/data instead of DFARM, or all 3 DFARM, /grid/data, DCache But First - get SAM declares going. ============================================================================= 2007 03 17 ####### # WEB # ####### Per rhatcher 11 Nov 2006 email suggestion, html/minwork/computing/enstore.html.20070317 Changed email from buckley to minos-data Noted write access from only minos01 Since ticket 73674, 2006 Feb 08 Checked this actively, it's true Changed to user path /pnfs/minos/users/ ######### # ADMIN # ######### /pnfs/minos has multiple /etc/fstab entries on minos03 minos08 minos09 minos11-12 minos14-16 minos18-24 ============================================================================= 2007 03 16 ####### # DOE # ####### Prepared minos/plan/doesum0703.txt for Gina's DOE review presentation. Security visit coming next week Put all media into cabinets ########### # ROUNDUP # ########### The old mccat mcfardcat files will not do , they have carrot names NMCF=`ls /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data/123 | grep n13011230` DCCP_PATH=dcap://fndca1:24136/pnfs/fnal.gov/usr/minos/ for FILE in ${NMCF} ; do ; dccp ${DCCP_PATH}/mcout_data/cedar/near/daikon_00/L010185N/sntp_data/123/${FILE} \ /grid/data/minos/mcnearcat/${FILE} ; done 66820161 bytes in 1 seconds (65254.06 KB/sec) 68172273 bytes in 1 seconds (66574.49 KB/sec) 67004068 bytes in 1 seconds (65433.66 KB/sec) 67009239 bytes in 1 seconds (65438.71 KB/sec) 65355460 bytes in 1 seconds (63823.69 KB/sec) 67156459 bytes in 2 seconds (32791.24 KB/sec) 67347107 bytes in 3 seconds (21922.89 KB/sec) 66109571 bytes in 1 seconds (64560.13 KB/sec) 68086532 bytes in 3 seconds (22163.58 KB/sec) 66813257 bytes in 1 seconds (65247.32 KB/sec) 67456146 bytes in 2 seconds (32937.57 KB/sec) for FILE in ${NMCF} ; do chmod 775 /grid/data/minos/mcnearcat/${FILE} ; done for FILE in ${NMCF} ; do dfarm put -n 1 -v /grid/data/minos/mcnearcat/${FILE} /minos/mcnearcat/${FILE} done ############# # minosora1 # ############# Before Wed 14 Mar network problems, network activity showed a short spike around 07:00 and an hour at 60 MBit/sec around 18:00 Wednesday the large peak was around 22:00, at 130 MBit/sec Thursday there was no large peak Friday the small peak ( if that ) was around noon, 2 mbit/sec Note that minosora3 is spending nearly 2 hours/day at 6 MBytes/second. Why ? This is dev/int, does not need backups. ============================================================================= 2007 03 15 ####### # AFS # ####### Per howcroft request, asked for 50GB not backup disk at /afs/fnal.gov/files/data/minos/d227 for Requested for "myself, Jeff, Pedro, Alex, Justi, David J, rustem" myself = howcroft Jeff jkn Jeffery Nelson jdejong Jeffrey Dejong hartnell Jeffrey Hartnell *** per email address Pedro ochoa Juan Pedro Ochoa Alex asousa Alexandre_Sousa *** a guess ( wrong ) ahimmel Alexander Himmel *** set this once disk exists Justi evansj Justin Evans David J djaffe David Jaffe Rustem rustem Rustem Ospanov ACL ( inspired by d221 ) minos rl system:administrators rlidwka system:anyuser rl buckley:kreymer rlidwka howcroft:asousa:djaffe:evansj:hartnell:ochoa rlidwka Helpdesk ticket 94115 by inkmann Oops, format is wrong needed buckley rlidwka kreymer rlidwka howcroft rlidwka ahimmel rlidwka djaffe rlidwka * * * no such user evansj rlidwka hartnell rlidwka ochoa rlidwka Removed stray entry : fs setacl -dir /afs/.fnal.gov/files/data/minos/d227 -acl buckley:minosrecodata none # minos.nubar group # Hi Art, I would create a new AFS group for these guys to use. If you make it owned by buckley:admin then anyone on that group can add users. Liz Documents at http://www.openafs.org/pages/doc/UserGuide/auusg008.htm#HDRWQ60 pts creategroup -name kreymer:nubar group kreymer:nubar has id -1917 for GUSER in buckley kreymer howcroft ahimmel evansj hartnell ochoa ; do pts adduser -user ${GUSER} -group kreymer:nubar ; done pts membership kreymer:nubar pts examine kreymer:nubar pts setfields kreymer:nubar -access SOMar pts chown kreymer:nubar minos pts adduser -user asousa -group minos:nubar pts removeuser -user asousa -group minos:nubar OK, now add this to the directory : fs setacl -dir /afs/.fnal.gov/files/data/minos/d227 -acl minos:nubar rlidwka fs setacl -dir /afs/.fnal.gov/files/data/minos/d227 -acl buckley:admin rlidwka for GUSER in howcroft ahimmel evansj hartnell ochoa ; do fs setacl -dir /afs/.fnal.gov/files/data/minos/d227 -acl ${GUSER} none ; done ########### # ROUNDUP # ########### editing roundup.20070314 for mc support ######## # FARM # ######## 14:00 rubin, bseilhan met with timm to discuss Minos farm/grid compatibility Issued identified : 1) AFS access - Steve could provide this, but rubin prefers to drop AFS as soon as we can count on DCache for ntuples. Logs can be handled somehow, they are small. 2) timm would prefer jobs to run in group account like minos, rather than rubin, bseilhan. This should be OK, as grid jobs have valid cert's giving output file access. 3) Can certain job steps be forced to fnpcsrv1 ? Steve says yes. 4) DFARM retirement - switch to /grid/data and/or fermigrid/volative Dcache rubin will proceed now with /grid/data tests 5) Software distribution, presently via /home/minfarm, use /grid/app 6) timm strongly prefers no interactive logins to workers That's a problem for monitoring, tools are lacking ( top, ps ) ============================================================================= 2007 03 14 ########### # ROUNDUP # ########### roundup.20070314 - adding monte carlo Now logging to HADDLOG/${YEMON} Aligned PURGED DFARM message with SRMCP message FILES=`dfarm ls /minos/mccat | tr -s ' ' | cut -f 7 -d ' '` Shifted 7 old files to /minos/mcfarcat dfarm mkdir /minos/mcnearcat dfarm chmod rwrw /minos/mcnearcat dfarm mkdir /minos/mcfarcat dfarm chmod rwrw /minos/mcfarcat FILES=`dfarm ls /minos/mccat | grep f2 | tr -s ' ' | cut -f 7 -d ' '` cd /export/stage/minfarm/ROUNDUP_TEST for FILE in ${FILES} ; do echo ${FILE} dfarm get /minos/mccat/${FILE} ${FILE} dfarm put -n 1 -v ${FILE} /minos/mcfarcat/${FILE} rm ${FILE} ; done ########## # DCACHE # ########## 7 empty files reported by dcache-admin one was a command error leaving an empty 'ls' file, oops, my bad. 6 are output of dakion_00 processing, informed Howie and Brandon /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112/f21011125_0000_L010185N_D00.sntp.cedar.root /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/160/n13011601_0001_L010185N_D00.cand.cedar.root /pnfs/minos/reco_far/cedar/.bntp_data/2007-03/ls /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/160/n13011601_0006_L010185N_D00.cand.cedar.root /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/160/n13011602_0003_L010185N_D00.cand.cedar.root /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/160/n13011602_0001_L010185N_D00.cand.cedar.root /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/160/n13011601_0000_L010185N_D00.cand.cedar.root ############# # minosora1 # ############# 13:30 mmihalek reports power cord work completed on minosora3 14:05 normal contact with minosora1 recorded at http://www-numi.fnal.gov/computing/database/oracle/topdb/minosprd/2007/03/14/14.txt 14:15 no access to minosora1, off the network 14:52 Helpdesk ticket 94055 issued 15:01 mmihalek verifies no outage was scheduled for minosora1 15:03 ticket 94055 assigned to jtrumbo 15:20 mmihalek verifies that minosora1 is up and running at local console 15:36 131.225.107.24 is connected to s-s-fcc1-server on port 7/37 (minosora1) Pinged neighboring nodes at 7/35 uscmsdb01 7/36 uscmsdb03 7/38 appora 7/39 appora-dev I have searched for minosora1 in the Tissue data base of blocked nodes, it does not seem to be blocked. 14:39 jtrumbo asked that ticket be assigned to networking 14:41 jtrumbo calls lamore directly, no ticket to networking yet 15:50 ticket assigned to networking, vbravov 16:04 ret called, will check that we get a response from networking ( no response to page yet ) 16:15 minosora1 is back, restarted dbserver 16:16 duplicate ticket 94069 issued by Remedy software, in response to kreymer email - should not have been issued, ret will investigate 16:23 received additional history from Orlando via mmihalek, contractors were working in that area. 16:41 ticket 94055 assigned to Jack Schmidt 16:51 detailed reply from orlando ticket 94055 resolved by Jack Schmidt ( strange time stamp on email, 13:52 ) ============================================================================= 2007 03 13 ############ # PREDATOR # ############ Predator cronjob has not been triggering beam or dcs activity since the DST shift. Strange, this is keyed on the local HOUR being 5 or 23 The cron jobs were running on the even hours, see below. ######## # CRON # ######## cron on minos26 is still running in DST. crond probably needs a restarted crond restarted before 16:37 by Tim Laszlo ########### # ROUNDUP # ########### roundup.20070313 - corrected SOLO test setting DELT, wrote separate files for F00037755 N00011914 I will put these back into dfarm, for concatenation ( after setting aside for safety ) Not because we cannot stand a few small files, but because is seriously inflates the entries in the reco directories. Checked tape status, everything is on tape, do we can delete. MINOS26 > cd /pnfs/minos/reco_far/cedar/sntp_data/2007-03 MINOS26 > for FIL in `ls F00037755*` ; do printf "${FIL} " ; cat ".(use)(4)(${FIL})" | head -2 ; done F00037755_0000.all.sntp.cedar.0.root VOD520 0000_000000000_0000172 ... MINOS26 > ls F00037755* | wc -l 48 MINOS26 > cd /pnfs/minos/reco_far/cedar/.bntp_data/2007-03 MINOS26 > for FIL in `ls F00037755*` ; do printf "${FIL} " ; cat ".(use)(4)(${FIL})" | head -2 ; done F00037755_0000.spill.bntp.cedar.0.root VOB719 0000_000000000_0010701 ... MINOS26 > ls F00037755* | wc -l 24 MINOS26 > cd /pnfs/minos/reco_near/cedar/sntp_data/2007-03 MINOS26 > for FIL in `ls N00011914*` ; do printf "${FIL} " ; cat ".(use)(4)(${FIL})" | head -2 ; done N00011914_0000.cosmic.sntp.cedar.0.root VO2116 0000_000000000_0000039 ... MINOS26 > ls N00011914* | wc -l 48 Set aside ROUNDUP summary files DFARM - time stamps ECRC - check sums READ - input list WRITE - files -> dfarm for FILE in `ls WRITE | grep N\*root` ; do mv DFARM/${FILE} REDO/DFARM/${FILE} mv ECRC/${FILE} REDO/ECRC/${FILE} mv READ/${FILE} REDO/READ/${FILE} done Oops, this selected them all, OK, done in 1 pass. Should have been grep N.*root for FILE in `ls WRITE | grep root` ; do echo mv WRITE/${FILE} REDO/WRITE/${FILE} ; done for FILE in `ls REDO/WRITE | grep N.*root` ; do dfarm put -n 1 REDO/WRITE/${FILE} /minos/nearcat ; done for FILE in `ls REDO/WRITE | grep F.*root` ; do dfarm put -n 1 REDO/WRITE/${FILE} /minos/farcat ; done All set, now remove the PNFS copies to make room using rubin account on srv1 SRV1 > cd /pnfs/minos/reco_far/cedar/.bntp_data/2007-03 SRV1 > for FIL in `ls F00037755*` ; do ls -l ${FIL} ; done SRV1 > for FIL in `ls F00037755*` ; do rm -v ${FIL} ; done SRV1 > cd /pnfs/minos/reco_far/cedar/sntp_data/2007-03 SRV1 > for FIL in `ls F00037755*` ; do ls -l ${FIL} ; done SRV1 > for FIL in `ls F00037755*` ; do rm -v ${FIL} ; done SRV1 > cd /pnfs/minos/reco_near/cedar/sntp_data/2007-03 SRV1 > for FIL in `ls N00011914*` ; do ls -l ${FIL} ; done SRV1 > for FIL in `ls N00011914*` ; do rm -v ${FIL} ; done ============================================================================= 2007 03 12 ####### # DST # ####### Problems in DAQ in beam logging, Big Button, DCS Processes had to be restarted to pick up DST support Proposed moving all Control room and DAQ system to localtime GMT, during the summer shutdown. Received favorably by Cat and Rob, needs discussion. Process : Adjust crontab entries ########### # ENSTORE # ########### Informed enstore-admin of bad dakion renames of 1632 files, 479424 MB cd /pnfs/minos/mcout_data/cedar/far mv daikon_00 bad_daikon_00 L010185N/cand_data/143/f21411431_0000_L010185N_D00.cand.cedar.root ########### # ROUNDUP # ########### Activated handling of cand files ( if any show up ) Informed Howie and Brandon SRV1> cp -a AFSS/roundup.20070309 . SRV1> ln -sf roundup.20070309 roundup ########### # ENSTORE # ########### Stan Naymola reports 600 rewrites tried for /pnfs/fnal.gov/usr/minos/mcout_data/cedar/far/daikon_00/L010185N/cand_data/143/f21411431_0000_L010185N_D Check bad_daikon_00 for pending files : cd /pnfs/fnal.gov/usr/minos/mcout_data/cedar/far/bad_daikon_00 FILES=`find . -type f | cut -f 2 -d /` for FILE in ${FILES} ; do DIR=`dirname ${FILE}` ; FIL=`basename ${FILE}` TL=`( cd ${DIR} ; cat ".(use)(4)(${FIL})" ) | head -2` printf "${FILE} " ; echo ${TL} usleep 300000 ; done | tee /tmp/badvols L010185N/cand_data/143/f21411431_0000_L010185N_D00.cand.cedar.root MINOS26 > ./dc_stat /pnfs/minos/mcout_data/cedar/far/bad_daikon_00/L010185N/cand_data/143/f21411431_0000_L010185N_D00.cand.cedar.root ============================ PNFS status for /pnfs/minos/mcout_data/cedar/far/bad_daikon_00/L010185N/cand_data/143/f21411431_0000_L010185N_D00.cand.cedar.root -rw-r--r-- 1 1334 e875 615062441 Mar 9 09:52 f21411431_0000_L010185N_D00.cand.cedar.root LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:992b6eb3;l=615062441; w-stkendca17a-1 LEVEL 4 ============================ Fix this by mv'ing the files to make DCache/Enstore happy : PAB=/pnfs/minos/mcout_data/cedar/far/bad_daikon_00 PAG=/pnfs/minos/mcout_data/cedar/far/daikon_00 FILE=L010185N/cand_data/143/f21411431_0000_L010185N_D00.cand.cedar.root mv ${PAG}/${FILE} ${PAG}/${FILE}.good mv ${PAB}/${FILE} ${PAG}/${FILE} < wait for the file to be written > mv ${PAG}/${FILE} ${PAB}/${FILE} mv ${PAG}/${FILE}.good ${PAG}/${FILE} Details, first had to get access to bseilhan account hr> chmod 775 ${PAB}/L010185N/cand_data/143 bs> chmod 775 ${PAG}/L010185N/cand_data/143 At 13:58, I did the initial file moves, under the bseilhan account, to allow tape writing to proceed, At 14:00 I saw the tape write underway, on VOC105, using 9940B24.mover. At 14:07 I moved the files back to their original locations. The bad_daikon_00 copy is tape VOC105 file 283 The daikon_00 copy is tape VOC105 file 133 All is well ! ============================================================================= 2007 03 09 ######## # JAVA # ######## Per discussion on lusers, do we need a java upgrade for dates ? Is there a test case ? SUMMARY : no upgrade needed, we have a test script herber's email suggest we need, for each series 1.3.1_18 1.4.2_13 5.0_u9 http://java.sun.com/developer/technicalArticles/Intl/USDST_Faq.html#jdkversion But the given link suggests 5.0_u6 in section 7 Control room systems run jre-1.5.0_07-fcs Google search for java test daylight savings reveals a test program http://ablogofideas.net/blog/2007/02/19/test-your-java-for-new-daylight-saving-time-changes/ based on a javascript test http://www.mkville.com/blog/index.cfm/2007/2/15/Quick-test-for-Daylight-Saving-Time-updates There is also an applet browser test, http://ablogofideas.net/blog/2007/02/24/test-your-browsers-jre-for-daylight-saving-time-changes/ had to hack ” to " ″ to " & to & … to . Had to rename source to DSTCheck.java This is in minos/scripts/DSTCheck.java javac DSTCheck.java minos@minos-beamdata tmp]$ java DSTCheck Hello, you are running Sun Microsystems Inc. JVM version: 1.4.2_12 OLD Daylight Saving Time (DST) dates: Apr 1 - Oct 28 NEW DST dates: Mar 11 - Nov 4 Now (2007-03-09 09:05:34 CST) DST offset: 0 hours 2007-03-12 01:00:00 CDT DST Offset: 1 hours 2007-04-02 01:00:00 CDT DST Offset: 1 hours 2007-10-27 01:00:00 CDT DST Offset: 1 hours 2007-11-03 01:00:00 CDT DST Offset: 1 hours ............... . Your JVM is OK with the new DST changes . ............... Sent summary to linux-users, put copy in http://~kreymer/DSTCheck.java OK SLF 4.2 / j2sdk-1.4.2_12-fcs SLF 3.0.5 / j2sdk-1.4.2_12-fcs BAD java v1.5.0 in kits ########### # ENSTORE # ########### No response to our 6 March request to enmv misplaced files. The requested enmv commands are in ~kreymer/minos/maint/daikonmove.txt I am proceeding with a normal mv , or would do so if I could become rubin, the file owner. DDIR=/pnfs/minos/mcout_data/cedar/near/daikon_00 DIRS='L010200N L010185N' for DIR in ${DIRS} ; do FILES=`ls ${DDIR}/${DIR} | grep .root` for FILE in ${FILES} ; do RUN=`echo ${FILE} | cut -c 6-8` mv ${DDIR}/${DIR}/${FILE} ${DDIR}/${DIR}/cand_data/${RUN}/${FILE} usleep 300000 done ; done Did this around 11:00 to 11:30, from rubin@fnpcsrv1 ########### # ROUNDUP # ########### roundup.20070309 Added SOLO , set for cand streams, which pops DELT to 2000, so that files will not be concatenated. Test with a few recent cand files, DPAT=dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar/cand_data/2007-03 FILES=' F00037737_0000.spill.cand.cedar.0.root F00037737_0001.spill.cand.cedar.0.root F00037740_0000.spill.cand.cedar.0.root F00037740_0001.spill.cand.cedar.0.root F00037740_0002.spill.cand.cedar.0.root F00037740_0003.spill.cand.cedar.0.root ' cd /export/stage/minfarm/ROUNDUP_TEST/CAND for FILE in $FILES ; do dccp ${DPAT}/${FILE} . ; done for FILE in $FILES ; do dfarm put -n 1 ${FILE} /minos/farcat/${FILE} ; done dfarm ls /minos/farcat/*cand* cleanup after testing for FILE in $FILES ; do dfarm rm /minos/farcat/${FILE} ; done Informed minos_batch, let's deploy this next week. ######### # BATCH # ######### Need to clear all /pnfs/minos/mcout_data/cedar/far/daikon_00 MINOS26 > find . -type f | wc -l 1632 MINOS26 > pwd /pnfs/minos/mcout_data/cedar/far/daikon_00 MINOS26 > du -sm . 479424 . SRV1> cd /pnfs/minos/mcout_data/cedar/far cat "daikon_00/L010185N/.(tag)(file_family)"; reco_mc_far_cedar setup encp v3_6d -q stken mv daikon_00 bad_daikon_00 mkdir daikon_00 cd daikon_00 enstore pnfs --file_family reco_mc_far_cedar mkdir L010185N mkdir L010185N/cand_data mkdir L010185N/sntp_data mkdir L250200N mkdir L250200N/cand_data mkdir L250200N/sntp_data ( cd L010185N/cand_data ; enstore pnfs --file_family reco_mc_far_cedar_cand ) ( cd L250200N/cand_data ; enstore pnfs --file_family reco_mc_far_cedar_cand ) ( cd L010185N/sntp_data ; enstore pnfs --file_family reco_mc_far_cedar_sntp ) ( cd L250200N/sntp_data ; enstore pnfs --file_family reco_mc_far_cedar_sntp ) ########### # GANGLIA # ########### The ganglia monitor unblocking was authorized a week ago, finally got unblocked this afternoon ( lost message somewhere . ) Authorized nodes are listed on the registration web page. Had to reload KCS cert after regeneration with kx509 kxlist -p openssl pkcs12 -export -passout pass:"" -in /tmp/x509up_u1060 -out /tmp/kreymer.p12 -name Fermilab ########## # INDICO # ########## Got registered so I can create minutes and categories under Experiments -> Minos Created Core Software, and a Mar 08 meeting draft ============================================================================= 2007 03 08 ########### # ROUNDUP # ########### roundup.20070308 - has bad_run.${REL} filtering no additional files are being picked up by this at present. Still, made this current, will run tomorrow SRV1> ln -sf roundup.20070308 roundup # was 20070302 ########## # INDICO # ########## HOWTO.indico ######## # GRID # ######## HOWTO.fermigrid ########### # GANGLIA # ########### Sent the following sample .ssh/config file to msd : # You can use an ssh tunnel to see ganglia or chan13 offsite # First, put this file in ~/.ssh/config # Then, in one terminal window: $ ssh gate # this must connect via kerberos credentials or crypto card # Then browse locally to: http://localhost:20000/minos # or: http://localhost:20013/notifyservlet/www Host gate HostName flxi06.fnal.gov LocalForward 20000 rexganglia2.fnal.gov:80 LocalForward 20013 www-bd.fnal.gov:80 ============================================================================= 2007 03 07 ############ # MCIMPORT # ############ mcimport.20070306 Added duplicate detection 11:33 cp -a AFSS/mcimport.20070306 . ln -sf mcimport.20070306 mcimport ####### # CVS # ####### write access failed at 05:03, when yum updates ran these did not restart the sshd.cvs server shepelak did this at 10:46, AOK Helpdesk ticket 93608 As a side benefit, minos-admin email is now forwarded to run2-sys Educated run2-sys about minos-admin ( reached buckley/rhatcher/kreymer/urish) ####### # NET # ####### Network was down in some fashion from about 13:20 to 13:45. Informed control room, msd. N.B. - according to SSA Primary report, An FCC hub router rebooted at 13:20 ########### # ROUNDUP # ########### roundup.200703 - working on bad_run removal ============================================================================= 2007 03 06 ########### # ENSTORE # ########### Rubin reports several misplaced cand files in mc_out : MINOS26 > DDIR=/pnfs/minos/mcout_data/cedar/near/daikon_00 MINOS26 > DIRS=`ls ${DDIR}` MINOS26 > for DIR in ${DIRS} ; do echo ${DIR} ; find ${DDIR}/${DIR} -type f -maxdepth 1 | wc -l ; done L010000N 0 L010170N 0 L010185N 3760 L010200N 11 L100200N 0 L150200N 0 L250200N 0 Request to enstore-admin : Several recent Minos farm files have been placed in the wrong directories. Please move these with enmv, so that the internal Enstore metadata is corrected. The misplaced files are /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/*.root ( 3760 files ) /pnfs/minos/mcout_data/cedar/near/daikon_00/L010200N/*.root ( 11 files ) Please do the equivalent to : DDIR=/pnfs/minos/mcout_data/cedar/near/daikon_00 DIRS='L010200N L010185N' for DIR in ${DIRS} ; do FILES=`ls ${DDIR}/${DIR} | grep .root` for FILE in ${FILES} ; do RUN=`echo ${FILE} | cut -c 6-8` enmv ${DDIR}/${DIR}/${FILE} ${DDIR}/${DIR}/cand_data/${RUN}/${FILE} done ; done An explicit set of enmv commands may be found in ~kreymer/minos/maint/daikonmove.txt L010185N needs 45 RUN directories 106 through 160 The directories are present, but empty ######### # BATCH # ######### rustem reports two duplicated files, with no clue as to where dups are. n13011450_0007_L010185N_D00.sntp.cedar.root n13011451_0005_L010185N_D00.sntp.cedar.root cd /pnfs/minos/mcout_data/cedar/near/daikon_00 CONFS=`ls` for CONF in ${CONFS} ; do find ${CONF}/sntp_data -name n13011450_0007_L010185N_D00.sntp.cedar.root ; done L010185N/sntp_data/145/n13011450_0007_L010185N_D00.sntp.cedar.root for CONF in ${CONFS} ; do find ${CONF}/sntp_data -name n13011451_0005_L010185N_D00.sntp.cedar.root ; done L010185N/sntp_data/145/n13011451_0005_L010185N_D00.sntp.cedar.root DUH, the duplicates are in AFS, not PNFS. Howie will clean this up. ############ # MCIMPORT # ############ kordosky file copy failed, seems to be retrying indefinitely : Time User Type Oper File Node dT 1 File Size dT 2 Status Details 2007-03-6 15:21:01 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar minos26.fnal.gov 0 0 0 ERROR 553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists) None 0 NOT_FINISHED 0 NOT_FINISHED 2007-03-6 15:19:09 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar minos26.fnal.gov 0 0 0 ERROR 553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists) 2007-03-6 15:17:25 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar minos26.fnal.gov 0 0 0 ERROR 553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists) 2007-03-6 15:15:52 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar minos26.fnal.gov 1 0 0 ERROR 553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists) 2007-03-6 15:14:30 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar minos26.fnal.gov 1 0 0 ERROR 553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists) 2007-03-6 15:13:19 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar minos26.fnal.gov 0 0 0 ERROR 553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists) 2007-03-6 15:12:16 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar minos26.fnal.gov 1 0 0 ERROR 553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists) 2007-03-6 15:11:24 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar minos26.fnal.gov 0 0 0 ERROR 553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists) 2007-03-6 15:10:42 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar minos26.fnal.gov 0 0 0 ERROR 553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists) 2007-03-6 15:10:10 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar minos26.fnal.gov 0 0 0 ERROR 553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists) 2007-03-6 15:08:49 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar minos26.fnal.gov 60 0 0 ERROR 426 Transfer aborted, closing connection :PANIC : Unexpected message arrived class dmg.cells.nucleus.NoRouteToCellException 2007-03-6 15:06:54 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0004_L010185N_D00-n11011619_0008_L010185N_D00.tar minos26.fnal.gov 50 1711288320 0 OK 2007-03-6 15:04:57 kreymer(1060.5111) gsiftp write /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011618_0010_L010185N_D00-n11011619_0003_L010185N_D00.tar minos26.fnal.gov 52 1734010880 0 OK The problem is not so much that this failed, but that it did not report an error to the client, and kept retrying. ############ # MCIMPORT # ############ mcimport.20070306 Adding duplicate detection ============================================================================= 2007 03 05 ####### # NET # ####### Many helpdesk tickets today re networking problems. Cannot browse to sites like www.irs.gov, lwn.net . Ticket 93479 for example, seems to be the parent. These are assigned to vyto (Vyto Grigaliunas). Sometime, the following note showed up in Notes to Requester: Hello, The CD-HelpDesk has just been advised that a workaround has been implemented to allow access to the affected off-site web addresses. We're asking that you re-try the web-sites letting us know if the problem persists. Thank you, CD-HelpDesk ######## # MRCC # ######## Over the weekend, copied all MRCC files to DCache/Enstore, and verified with MRCCIN=/afs/fnal.gov/files/data/minos/d170/MRCC/sntp/ ./mrccarch -n ${MRCCIN}/MC/Near-L010185 mcout_data/R1_18_2/near/mrnt_data for MON in 2005-11 2006-01 ./mrccarch -n ${MRCCIN}/Data/Near/${MON} reco_near/R1_18_2/mrnt_data/${MON} ########### # MONTHLY # ########### MYSQL per HOWTO.dbarchive offline real 68m54.075s md5 real 21m35.787s gzip real 55m50.531s scp real 9m36.975s BINLOGS real 2m59.620s ############ # MCIMPORT # ############ Found 3 duplicates in howcroft ( since Friday ) M26 > FILES=`ls *.gz` M26 > for FILE in ${FILES} ; do grep ${FILE} index/*.index ; done index/n12011178_0011_L010185N_D00-n12011314_0006_L010185N_D00.index:n12011193_0011_L010185N_D00.tar.gz index/n12011178_0011_L010185N_D00-n12011314_0006_L010185N_D00.index:n12011197_0011_L010185N_D00.tar.gz index/n12011178_0011_L010185N_D00-n12011314_0006_L010185N_D00.index:n12011201_0011_L010185N_D00.tar.gz M26 > dds n12011197_0011_L010185N_D00.tar.gz -rw-r--r-- 1 mindata e875 9767022 Mar 2 15:33 n12011197_0011_L010185N_D00.tar.gz M26 > dds n12011201_0011_L010185N_D00.tar.gz -rw-r--r-- 1 mindata e875 10206615 Mar 2 15:33 n12011201_0011_L010185N_D00.tar.gz $ mv n12011193_0011_L010185N_D00.tar.gz DUP/ $ mv n12011197_0011_L010185N_D00.tar.gz DUP/ $ mv n12011201_0011_L010185N_D00.tar.gz DUP/ Removed former tarfile, which keep beeing concatenated to, getting over 7 GB in size : PNFS status for /pnfs/minos/stage/howcroft/n12011193_0011_L010185N_D00-n12011400_0012_L010185N_D00.tar -rw-r--r-- 1 kreymer e875 1 Mar 5 12:42 n12011193_0011_L010185N_D00-n12011400_0012_L010185N_D00.tar $ dds tar total 8228592 drwxr-xr-x 2 mindata e875 8192 Mar 5 12:37 ./ drwxr-xr-x 9 mindata e875 53248 Mar 5 12:39 ../ -rw-r--r-- 1 mindata e875 7391375360 Mar 5 10:30 n12011193_0011_L010185N_D00-n12011400_0012_L010185N_D00.tar -rw-r--r-- 1 mindata e875 1026396160 Mar 5 12:38 n12011311_0011_L010185N_D00-n12011400_0012_L010185N_D00.tar Reran ./mcimport -w howcroft, as previous run bailed on the fat tarfile. ############ # MCIMPORT # ############ mcimport.20070305 - detect duplicates via index search, put em in DUP check for existing output tarfile ########### # GANGLIA # ########### Saturday, found ssh tunnel prescription at http://souptonuts.sourceforge.net/sshtips.htm 1) Create local .ssh/config with Host gate HostName 131.225.193.1 LocalForward 20000 131.225.217.201:80 # User kreymer # User needed only if there is a username mismatch 2) ssh gate 3) browse to http://localhost:20000/minos This config also works : Host gate HostName flxi06.fnal.gov LocalForward 20000 rexganglia2.fnal.gov:80 ######## # GRID # ######## checking adler32 checksum of copied files : from log below, in /local/scratch/kreymer, MINOS26 > srmls -l ${SPATH2}/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root - Checksum value: 1cf17836 MINOS26 > time adler32 N00010819_0000.spill.sntp.R1_18_4.0.root 1cf17836 real 0m38.951s user 0m5.150s sys 0m5.590s That's good. But here's the bad. Using adler32 requires that SRM_PATH be set. MINOS26 > unset SRM_PATH MINOS26 > time adler32 N00010819_0000.spill.sntp.R1_18_4.0.root SRM_PATH is not set But srmcp etc require that it NOT be set. GRRRRRRRRRRRRR ============================================================================= 2007 03 02 ########### # MONTHLY # ########### CFL DATASETS PREDATOR VAULT Leave for next week MYSQL N.B. (done Monday) ############ # MCIMPORT # ############ Date: Thu, 01 Mar 2007 18:40:32 -0600 From: Cron Daemon To: minos-data@fnal.gov Subject: Cron ${HOME}/mcimport -c ALL du: `/local/scratch26/mindata/kordosky/n14011012_0008_L010185N_D00_charm.tar.gz.md5': No such file ordirectory Odd, this was picked up later $ dds kordosky/index/n14011012_0003_L010185N_D00_charm-n14011012_0009_L010185N_D00_charm.index -rw-r--r-- 1 mindata e875 287 Mar 2 06:56 kordosky/index/n14011012_0003_L010185N_D00_charm-n14011012_0009_L010185N_D00_charm.index ######## # GRID # ######## Test a >2 GB copy, this works ! MINOS26 > time srmcp -streams_num=1 -server_mode=active file:///N00010819_0000.spill.sntp.R1_18_4.0.root ${SPATH}/kreymer/N00010819_0000.spill.sntp.R1_18_4.0.root real 1m30.052s user 0m19.140s sys 0m14.430s But the directory listing is hosed : Listing is OK for the single file: MINOS26 > srmls ${SPATH2}/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root 2283574599 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root MINOS26 > srmls -l ${SPATH2}/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root 2283574599 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root type:PERMANENT - Checksum value: 1cf17836 - Checksum type: adler32 UserPermission: uid=1060 PermissionsRW GroupPermission: gid=5111 PermissionsRW WorldPermission: R created at:2007/03/02 09:54:39 modified at:2007/03/02 09:54:39 - Original SURL: srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root - Status: null - Type: FILE Get length int : MINOS26 > srmls ${SPATH2}/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root | head -1 | tr -s ' ' | cut -f 2 -d ' ' 2283574599 But the directory listing is hosed : ########### # ROUNDUP # ########### roundup.20070302 - corrected typo which failed to set SDEST2 srmcp was failing SRV1> ./roundup -w -r cedar near SRV1> ./roundup -w -r cedar far ######## # MRCC # ######## Drop the splitting MC into RUNS, 1598 files is tolerable Running 14:35, on minos-sam02 mrccarch -n MC/Near-L010185 mcout_data/R1_18_2/near/mrnt_data NETWORK - strange, as reported by Ganglia and mrtg, network input and output rates are identical, and about 1.5 MBytes/second DUUUUH, of course ! The input files are in AFS. Pending : for MON in 2005-11 2006-01 mrccarch -n Data/Near/{$MON} reco_near/R1_18_2/mrnt_data/${MON} ########### # ROUNDUP # ########### PEND - have 21/24 subruns for N00009104_*.spill.mrnt.cedar*.root 1 02/28 14:14:19 PEND - have 23/24 subruns for N00009143_*.spill.mrnt.cedar*.root 1 02/28 15:16:27 PEND - have 23/24 subruns for N00009146_*.spill.mrnt.cedar*.root 1 02/28 13:40:43 PEND - have 23/24 subruns for N00009162_*.spill.mrnt.cedar*.root 0 03/01 15:48:59 PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 0 03/01 17:20:46 SRV1> grep N00009104 /home/minfarm/lists/bad_runs_mrcc.cedar N00009104_0017.0 2005-11 45832 1 2007-02-28 12:52:00 fnpc230 N00009104_0016.0 2005-11 46210 1 2007-02-28 12:52:08 fnpc230 N00009104_0015.0 2005-11 46073 1 2007-02-28 12:53:22 fnpc201 SRV1> grep N00009143 /home/minfarm/lists/bad_runs_mrcc.cedar N00009143_0004.0 2005-11 45947 1 2007-02-28 12:57:42 fnpc229 SRV1> grep N00009146 /home/minfarm/lists/bad_runs_mrcc.cedar N00009146_0000.0 2005-11 45443 1 2007-02-28 13:29:38 fnpc230 SRV1> grep N00009162 /home/minfarm/lists/bad_runs_mrcc.cedar N00009162_0013.0 2005-11 47438 139 2007-03-01 14:12:27 fnpc161 SRV1> grep N00009165 /home/minfarm/lists/bad_runs_mrcc.cedar So flush 9104, 9143, 9146, 9162 SRV1> ./roundup -f 0 -s 9104.*mrnt -W -r cedar near SRV1> ./roundup -f 0 -s 9143.*mrnt -W -r cedar near SRV1> ./roundup -f 0 -s 9146.*mrnt -W -r cedar near SRV1> ./roundup -f 0 -s 9162.*mrnt -W -r cedar near ########### # GANGLIA # ########### Thanks to sether (Seth Graham) who has moved the minos systems to rexganglia2.fnal.gov/minos/?cMinos Cluster rexganglia2.fnal.gov/minos/?cMinos Server ============================================================================= 2007 03 01 ####### # NET # ####### The 06:00 failover of ESNET to an alternate link seems to have failed. No network traffic from Hirise to the Border Router 06:00 to 06:20, load had been 20 Mbit/sec / ESNET load had been 300 Mbit, failover advertised as 600. Helpdesk ticket 93337, assigned to Andrew Raider # AFS # maintenance done ? helpdesk 93322, assigned to Mengel Reply: work was done from 06:00 to 06:05 ######## # GRID # ######## Try again to access volatile grid dcache, now that we have a /pnfs/minos/farmigrid/volatile/ link. MINOS26 > setup srmcp v1_25_1 unset SRM_PATH setup java v1.5.0 export SRM_CONFIG=/local/scratch26/kreymer/.srmconfig/kreymer.xml SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/fermigrid/volatile srmls ${SPATH2} 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/kordosky srmcp -debug=true file:///Merged.root ${SPATH2}/kreymer/Merged.root srmls ${SPATH2}/kreymer 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/kreymer 27009843 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/kreymer/Merged.root THIS WORKS !!!! ############ # MCIMPORT # ############ At round 12:00, lots of kordosky files flooded in. At round 13:00, the minos26 load average went up close to 38. $ ps xf | grep md5sum | wc -l 71 It seems to easing off round 13:45, load average at 32, some free CPU. $ ps xf | grep md5sum | wc -l 63 ######## # MRCC # ######## Will copy *mrnt* files from these for MON in 2005-11 2006-01 Data/Near/{$MON} -> /pnfs/minos/reco_near/R1_18_2/mrnt_data/${MON} MC/Near-L010185 -> /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data/${RUN} MAKE THE TARGET DIRECTORIES AND SET FAMILIES mkdir /pnfs/minos/reco_near/R1_18_2/mrnt_data chmod 775 /pnfs/minos/reco_near/R1_18_2/mrnt_data ( cd /pnfs/minos/reco_near/R1_18_2/mrnt_data ; \ enstore pnfs --file_family reco_near_R1_18_2_mrnt ) mkdir /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data chmod 775 /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data ( cd /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data; \ enstore pnfs --file_family reco_near_R1_18_2_mrnt ) Created scripts/mrccarch to do the copies MRCCIN=/afs/fnal.gov/files/data/minos/d170/MRCC/sntp/ ./mrccarch -n ${MRCCIN}/MC/Near-L010185 mcout_data/R1_18_2/near/mrnt_data for MON in 2005-11 2006-01 ./mrccarch -n ${MRCCIN}/Data/Near/${MON} reco_near/R1_18_2/mrnt_data/${MON} Sets RUN automatically for latter input path Checks checksum for existing files ? or -q quality ? ============================================================================= 2007 02 28 ############ # SADDRECO # ############ DECLARED 2003 2004 2005-1/2/3 cedar ./saddreco far cedar 2003-07 list 3 DET=far HOSTNU=`hostname -s | cut -c 6-` LOGPAT=/local/scratch${HOSTNU}/kreymer/log FARM=cedar YEAR=2003 ; MONS='07 08 09 10 11 12' YEAR=2004 ; MONS='01 02 03 04 05 06 07 08 09 10 11 12' YEAR=2005 ; MONS='01 02 03' Needed 679 files, Rate was 2.494 Needed /pnfs/minos/reco_far/cedar/.bntp_data/2005-03 Treating 0 files in /pnfs/minos/reco_far/cedar/.bntp_data/2005-03 Needed /pnfs/minos/reco_far/cedar/.bcnd_data/2005-03 Treating 0 files in /pnfs/minos/reco_far/cedar/.bcnd_data/2005-03 Needed /pnfs/minos/reco_far/cedar/.bnts_data/2005-03 Treating 0 files in /pnfs/minos/reco_far/cedar/.bnts_data/2005-03 STARTED Wed Feb 28 06:34:53 2007 FINISHED Wed Feb 28 06:49:07 2007 omniORB: Assertion failed. This indicates a bug in the application using omniORB, or maybe in omniORB itself. file: ../../../../../src/lib/omniORB/orbcore/SocketCollection.cc line: 682 info: pd_refcount > 0 for MON in ${MONS} ; do ./saddreco ${DET} ${FARM} ${YEAR}-${MON} declare 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log done grep -v declare /local/scratch26/kreymer/log/saddreco/declare_far_cedar.log | less ############ # DATASETS # ############ datasets.20070228 Corrected m to RawDataWritePools Added g DPOOLG FermigridVolPools had problems with this yesterday, before pools lists existed. OK today ln -s datasets.20070228 datasets ########### # ROUNDUP # ########### roundup.20070228 - added mrnt\. to file selection for DFILESL SRV1> ln -sf roundup.20070228 roundup # was roundup.20070224 Created working directories mkdir /pnfs/minos/reco_near/cedar/mrnt_data ( cd /pnfs/minos/reco_near/cedar/mrnt_data ; \ enstore pnfs --file_family reco_near_cedar_mrnt ) mkdir /pnfs/minos/reco_far/cedar/mrnt_data ( cd /pnfs/minos/reco_far/cedar/mrnt_data ; \ enstore pnfs --file_family reco_far_cedar_mrnt ) SRV1> ./roundup -s mrnt -r cedar near OOPS, forgot to set ownership to rubin MINOS01 > chown rubin /pnfs/minos/reco_near/cedar/mrnt_data MINOS01 > chown rubin /pnfs/minos/reco_far/cedar/mrnt_data SRV1> ./roundup -w -r cedar near And recreated far directory with correct protection mkdir /pnfs/minos/reco_far/cedar/mrnt_data chmod 775 /pnfs/minos/reco_far/cedar/mrnt_data ( cd /pnfs/minos/reco_far/cedar/mrnt_data ; \ enstore pnfs --file_family reco_far_cedar_mrnt ) MINOS01 > chown rubin /pnfs/minos/reco_far/cedar/mrnt_data N.B. - rubin normally makes the directories ahead of time. the 'mkdir' in roundup is useless, change to srmmkdir ############ # MCIMPORT # ############ One bad file identified near end of January, removed from index by rhatcher " Just for the record I modified: n11011172_0002_L010185N_D00-n11011172_0010_L010185N_D00.index and removed the line: n11011172_0002_L010185N_D00.tar.gz leaving only the line: n11011172_0010_L010185N_D00.tar.gz This removes the duplicate and *corrupted* copy of subrun 0002 from the index files. The working version comes from: n11011172_0002_L010185N_D00-n11011172_0006_L010185N_D00.tar " ============================================================================= 2007 02 27 ######## # MRCC # ######## Per 13 Feb request, plan to write all files from ${MINOS_DATA}/d170 thru d174 to /pnfs/reco_* or /pnfs/mcout_data/* for DAT in d170 d171 d172 d173 d174 Files are under ${MINOS_DATA}/${DAT}/MRCC/sntp/Data ${MINOS_DATA}/${DAT}/MRCC/sntp/MC/Near-L010185/*.mrnt.R1.18.2.root All files seem to be individually symlinked back to d170 MRCC/MRCCDRIVES has links to d170-d175,d193-d195 No mrnt files are on the d19x drives MC MINOS26 > ls MRCC/sntp/MC/Near-L010185/*.root | wc -l 1596 MINOS26 > find d170 -type f -name \n*mrnt\*root | wc -l 848 MINOS26 > find d172 -type f -name \n*mrnt\*root | wc -l 748 SUM 1596 Data MINOS26 > ls d171/MRCC/sntp/Data/Near/2005-11/N*mrnt*root | wc -l 656 MINOS26 > ls d171/MRCC/sntp/Data/Near/2006-01/N*mrnt*root | wc -l 669 SUM 1325 MINOS26 > for DAT in d170 d171 d172 d173 d174 ; do printf ${DAT} ; find ${DAT} -type f -name N\*mrnt\*root | wc -l ; done d170 0 d171 575 d172 81 d173 474 d174 195 SUM 1325 Where do they go ? MRCC/sntp/Data/Near/{$MON} -> /pnfs/minos/reco_near/R1_18_2/mrnt_data/${MON} MRCC/sntp/MC/ -> /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data/${RUN} ============================================================================= 2007 02 26 ########### # ROUNDUP # ########### Review of PEND subruns for roundup. FAR PEND - have 22/24 subruns for F00037676_*.all.sntp.cedar*.root 6 02/19 23:41:20 PEND - have 4/17 subruns for F00037697_*.all.sntp.cedar*.root 0 02/25 23:43:40 PEND - have 22/24 subruns for F00037676_*.spill.bntp.cedar*.root 6 02/19 23:41:54 PEND - have 4/17 subruns for F00037697_*.spill.bntp.cedar*.root 0 02/25 23:44:15 PEND - have 19/24 subruns for F00037221_*.spill.sntp.cedar*.root 45 01/11 23:53:32 PEND - have 23/24 subruns for F00037230_*.spill.sntp.cedar*.root 42 01/15 07:58:09 PEND - have 18/24 subruns for F00037233_*.spill.sntp.cedar*.root 40 01/16 12:25:21 PEND - have 22/24 subruns for F00037676_*.spill.sntp.cedar*.root 6 02/19 23:41:36 PEND - have 4/17 subruns for F00037697_*.spill.sntp.cedar*.root 0 02/25 23:43:57 for RUN in 37676 37697 37676 37697 37221 37230 37233 37676 37697 ; do grep ${RUN} /home/minfarm/lists/bad_runs.cedar ; done ./roundup -f 0 -s F00037676 -r cedar -n far ./roundup -f 0 -s F00037676 -r cedar far for RUN in 37221 37230 37233 37697 ; do grep ${RUN} /home/minfarm/lists/runs_done.cedar ; done NEAR PEND - have 18/24 subruns for N00011577_*.cosmic.sntp.cedar*.root 38 01/18 15:12:27 PEND - have 5/13 subruns for N00011580_*.cosmic.sntp.cedar*.root 34 01/22 20:37:13 PEND - have 4/24 subruns for N00011586_*.cosmic.sntp.cedar*.root 34 01/22 21:32:57 PEND - have 6/24 subruns for N00011589_*.cosmic.sntp.cedar*.root 34 01/22 16:31:44 PEND - have 16/24 subruns for N00011592_*.cosmic.sntp.cedar*.root 34 01/22 22:51:26 PEND - have 18/24 subruns for N00011595_*.cosmic.sntp.cedar*.root 34 01/22 15:00:05 PEND - have 30/31 subruns for N00011651_*.cosmic.sntp.cedar*.root 27 01/29 10:51:09 PEND - have 21/24 subruns for N00011824_*.cosmic.sntp.cedar*.root 2 02/24 01:05:03 PEND - have 20/24 subruns for N00011827_*.cosmic.sntp.cedar*.root 1 02/25 01:49:57 PEND - have 2/16 subruns for N00011830_*.cosmic.sntp.cedar*.root 0 02/26 03:11:21 PEND - have 21/24 subruns for N00011565_*.spill.sntp.cedar*.root 41 01/15 11:29:49 PEND - have 23/24 subruns for N00011568_*.spill.sntp.cedar*.root 41 01/15 14:16:12 PEND - have 18/24 subruns for N00011577_*.spill.sntp.cedar*.root 38 01/18 15:12:57 PEND - have 4/13 subruns for N00011580_*.spill.sntp.cedar*.root 34 01/22 20:38:35 PEND - have 6/24 subruns for N00011586_*.spill.sntp.cedar*.root 34 01/22 21:34:11 PEND - have 6/24 subruns for N00011589_*.spill.sntp.cedar*.root 34 01/22 16:32:32 PEND - have 16/24 subruns for N00011592_*.spill.sntp.cedar*.root 34 01/22 22:52:22 PEND - have 17/24 subruns for N00011595_*.spill.sntp.cedar*.root 34 01/22 15:00:31 PEND - have 5/ 6 subruns for N00011621_*.spill.sntp.cedar*.root 30 01/26 21:56:10 PEND - have 11/12 subruns for N00011643_*.spill.sntp.cedar*.root 29 01/27 23:58:32 PEND - have 4/24 subruns for N00011648_*.spill.sntp.cedar*.root 30 01/27 00:09:36 PEND - have 3/ 5 subruns for N00011701_*.spill.sntp.cedar*.root 20 02/06 01:51:05 PEND - have 23/24 subruns for N00011728_*.spill.sntp.cedar*.root 15 02/11 02:58:48 PEND - have 21/24 subruns for N00011734_*.spill.sntp.cedar*.root 13 02/13 03:27:54 PEND - have 21/24 subruns for N00011804_*.spill.sntp.cedar*.root 5 02/20 23:45:43 PEND - have 21/23 subruns for N00011819_*.spill.sntp.cedar*.root 3 02/23 05:14:33 PEND - have 21/24 subruns for N00011824_*.spill.sntp.cedar*.root 2 02/24 01:06:23 PEND - have 20/24 subruns for N00011827_*.spill.sntp.cedar*.root 1 02/25 01:50:18 PEND - have 2/16 subruns for N00011830_*.spill.sntp.cedar*.root Three 0 02/26 03:11:52 Cosmic 11577 11580 11586 11589 11592 11595 11651 Spill 11565 11568 11577 11580 11586 11589 11592 11595 11621 11643 11648 11701 11728 11734 ./roundup -f 20 -s N00011651_ -r cedar -n near ./roundup -f 20 -s N00011651_ -r cedar near ############## # MCOUT_DATA # ############## 5 files need to be moved to the Run subdirectories : /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/ n13011458_0009_L010185N_D00.cand.cedar.root n13011453_0007_L010185N_D00.cand.cedar.root n13011455_0002_L010185N_D00.cand.cedar.root n13011456_0000_L010185N_D00.cand.cedar.root n13011457_0010_L010185N_D00.cand.cedar.root ########### # GANGLIA # ########### Note patrol information available offsite, http://d0ora2.fnal.gov/Patrol/sys-config-info/ http://d0ora2.fnal.gov/Patrol/sys-config-info/d0ora2-config.html Also, quite a lot of Ganglia pages: http://cmssrv02.fnal.gov/ganglia/ http://d0om.fnal.gov/d0admin/ganglia/ http://d0online3.fnal.gov/ganglia/ There may be a simpler way to limit kernel data, just do not feed valid data to the ganglia server ! ######### # FNALU # ######### Asked with flxi07 (x86_64) will be available for interactive login email to fnalu-admin ############ # MCIMPORT # ############ First sjc files have been archived ( 11 ), on Sunday. Alphabetized .k5loginfull, .k5loginmin corrected gallag to hgallag in .k5loginmin ########### # ROUNDUP # ########### Running cleanly in cron since Sunday 2007 Feb 25 SRV1> dfarm usage rubin Used: 121267 + Reserved: 0 / Quota: 500000 (MB) Need to examine/flush many pending runs, mostly near. ############ # HELPDESK # ############ Scanned database for kreymer tickets outstanding, 2 in stage assigned 90783 Assigned 1/9/2007 Please install encp v3_6d in AFS ... 87003 Assigned 10/16/2006 kcroninit fails on flxi04, flxi05 and flxi06 ########### # GANGLIA # ########### Note patrol information available offsite, http://d0ora2.fnal.gov/Patrol/sys-config-info/ for example, http://d0ora2.fnal.gov/Patrol/sys-config-info/d0ora2-config.html ============================================================================= 2007 02 23 ########## # DCACHE # ########## Need to delete old files, owned by rubin. Try this from flxi04, where I have tested latest srmcp v1_25_1 Updated .grid with howie's certs. Updated .srmconfig/config.xml per local file locations FLXI04 > scp minfarm@fnpcsrv1:.grid/user* .grid/ usercert.pem 100% |**********************************************************************************************| 1533 00:00 userkey.pem 100% |**********************************************************************************************| 1131 00:00 FLXI04 > scp minfarm@fnpcsrv1:.grid/x5* .grid/ FLXI04 > for SRMP in `head -1 FLOSS`; do srmls ${SRMP} ; done 559681707 srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037621_0000.all.sntp.cedar.0.root Try some other functions : FLXI04 > SRMTD=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/reco_far/cedar/sntp_data/TEST FLXI04 > srmmkdir ${SRMTD} looked OK, did nothing... OK now I see it. Need to use the extended srm path for v2 functons : srm://fndca1.fnal.gov:8443/pnfs/ becomes srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs Edited FLOSS file, reran FLXI04 > for SRMP in `cat FLOSS`; do srmrm ${SRMP} ; done Running clean. check for files : SRV1> for FILL in `cat /tmp/FLOGS` ; do ls -l ${FILL} ; done Rewrote the files SRV1> ./roundup -c -w -r cedar far ; ./roundup -c -w -r cedar near Note to admins : There were 42 Minos farm output files listed, plus 4 not in the original list. I removed all 46 files from dcache ( using srmrm ), at around 11:30 this morning. I have rewritten all 46 files to DCache, as of about 12:20 . ( We do not remove these files from our write buffer till they are on tape. ) They should normally be moving to tape in about 4 hours. ####### # DAQ # ####### Informed minos-data and gfp The web page claims that the far_dcs_archiver is running, and that it last logged F070214_000008.mdcs.root Feb 14 20:53 In fact it last logged F070215_000010.mdcs.root Feb 16 19:10 It probably needs a shutdown/restart . ########### # ROUNDUP # ########### Informed minos_batch Today I have placed the regular concatenation of ntuples for farm output into the crontab of minfarm@fnpcsrv1. It is scheduled to run at 08:00 daily. See file ~minfarm/scripts/crontab.dat This was ready to go a week ago, but I've been busy with other data handling issues since then, for obvious reasons. ######## # GRID # ######## Try again to access volatile grid dcache, now that we have a tested srmcp v1_25_1 on central systems. MINOS26 > setup srmcp v1_25_1 unset SRM_PATH setup java v1.5.0 export SRM_CONFIG=/local/scratch26/kreymer/.srmconfig/kreymer.xml SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos MINOS26 > srmls ${SPATH2} 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/kordosky MINOS26 > srmmkdir ${SPATH2}/kreymer MINOS26 > srmls ${SPATH2} 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/kordosky 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/kreymer Cannot write, MINOS26 > srmcp -debug=true file:///Merged.root ${SPATH}/kreymer/Merged.root ... Exception: user's path ///pnfs/fnal.gov/usr/fermigrid/volatile/minos/kreymer/Merged.root is not subpath of the user's root ... Reported to dcache-admin ============================================================================= 2007 02 22 ######## # MRCC # ######## Per 13 Feb request, plan to write all files from $MINOS_DATA}/d170 thru d174 to /pnfs/reco_* or /pnfs/mcout_data/* ############ # MCIMPORT # ############ 08:08 M26 > ln -sf .k5loginfull .k5login 08:30 - crontab.dat updated to NOT run mcimport.20070203 Reran manual catchup 08:45 ./mcimport ALL 13:46 - crontab crontab.dat - so the above change is effective ! ####### # AFS # ####### tjyang reported $MINOS_DATA/d167 not accessible, since Monday AFS outage. MINOS26 > ls /afs/fnal.gov/files/data/minos/d167 ls: /afs/fnal.gov/files/data/minos/d167: No such file or directory Note that this directory does show in the next higher level directory: MINOS26 > ls -l /afs/fnal.gov/files/data/minos ls: /afs/fnal.gov/files/data/minos/d167: No such file or directory total 788 drwxrwxrwx 14 root root 2048 Jan 4 2006 beam_data drwxrwxrwx 18 root root 2048 Jan 4 2006 beam_data1 drwxrwxrwx 2 root root 2048 Jan 11 2006 beam_data2 .. Helpdesk ticket 92974 issued around 10:56. Resolved around 13:26. Files are available again. ########## # DCACHE # ########## DCache admins report files lost in DCache write pools Monday 19 Feb during the PNFS outage. /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037621_0000.all.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037621_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037624_0000.all.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037624_0000.spill.bntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037624_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037628_0000.all.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037628_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037629_0000.all.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037629_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037633_0000.all.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037633_0000.spill.bntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037633_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037636_0000.all.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037636_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037639_0000.all.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037639_0000.spill.bntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037639_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037642_0000.all.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037642_0000.spill.bntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037642_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037645_0000.all.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037645_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037648_0000.all.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037648_0000.spill.bntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037648_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011742_0000.cosmic.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011742_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011745_0000.cosmic.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011745_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011750_0000.cosmic.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011750_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011755_0000.cosmic.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011755_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011758_0000.cosmic.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011758_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011761_0000.cosmic.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011761_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011764_0000.cosmic.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011764_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011769_0000.cosmic.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011769_0000.spill.sntp.cedar.0.root /pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037621_0000.spill.bntp.cedar.0.root Four more files are sitting in WRITE : F00037628_0000.spill.bntp.cedar.0.root F00037629_0000.spill.bntp.cedar.0.root F00037636_0000.spill.bntp.cedar.0.root F00037645_0000.spill.bntp.cedar.0.root Reported to dcache-admin, awaiting guidance. Got OK to remove files from PNFS, and rewrite. Created /tmp/FLOGS on fnpcsrv1, containint the above 42+4 files. Checked they are in WRITE with for FILE in `ls WRITE/` ; do grep -q ${FILE} /tmp/FLOGS || ls WRITE/${FILE} ; done for FILL in `cat /tmp/FLOGS` ; do FIL=`echo $FILL | cut -f 8 -d /` SL=`ls WRITE/${FIL} -l | tr -s ' ' | cut -f 5 -d ' '` SD=`ls -l ${FILL} | tr -s ' ' | cut -f 5 -d ' '` printf "${SL}\n${SD}\n" [ ${SL} -ne ${SD} ] && echo OOPS done for FILL in `cat /tmp/FLOGS` ; do ls -l ${FILL} ; done Oops, files are owned by rubin. So in ROUNTMP/FLOSS, created list of files with srm paths, like srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037621_0000.all.sntp.cedar.0.root for FILL in `cat FLOSS` ; do srmls ${FILL} ; done SRV1> date ; for FILL in `cat FLOSS` ; do srmrm ${FILL} ; done Thu Feb 22 19:38:11 CST 2007 date Return code: SRM_FAILURE Explanation: java.lang.NullPointerException MINOS26 > setup srmcp v1_25_1 MINOS26 > unset SRM_PATH MINOS26 > export SRM_CONFIG=/home/mindata/.srmconfig/config.xml MINOS26 > setup java v1.5.0 ..... down the drain, nothing is working..... Will have to start fresh tomorrow, try to get rubin certificate on minos26 where we have the current srm, for which srmrm might work. ============================================================================= 2007 02 21 ############ # MCIMPORT # ############ mcimport.20070220 log sorting, for howcroft, the GLOGFS directories are L010185N_n1101 3346 L010185N_n1201 3405 the JLOGFS directories are L010185_near 3346 L010185_rock 3405 M26 > ./mcimport.20070220 -w -f 144 howcroft OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log M26 > ln -s mcimport.20070220 mcimport M26 > ./mcimport -w kordosky OK, logging activity to /local/scratch26/mindata/kordosky/log/mcimport.log mcimport.20070222 Links .k5login to .k5loginmin when disk space is low. on minos-sam02 and 03, ln -sf .k5loginfull .k5login Do this minos26 as soon as we have an idle patch This is so that mcimport can do ln -sf .k5loginmin .k5login when the disk is full. LOG corruption, arms reports corrupt log files, n12011205_0001_L010185N_D00.log n12011205_0002_L010185N_D00.log n11011205_0001_L010185N_D00.log n11011205_0002_L010185N_D00.log Scanning all howcroft logs, for DIR in `ls -d howcroft/log/L*` ; do echo $DIR ; for FIL in `ls ${DIR}` ; do wc -w ${DIR}/${FIL} ; done ; done | grep ' 0 ' 0 howcroft/log/L010185N_n1101/n11011205_0001_L010185N_D00.log 0 howcroft/log/L010185N_n1101/n11011205_0002_L010185N_D00.log 0 howcroft/log/L010185N_n1201/n12011205_0001_L010185N_D00.log 0 howcroft/log/L010185N_n1201/n12011205_0002_L010185N_D00.log 0 howcroft/log/L010185_near/L010185_near_1205_1.log 0 howcroft/log/L010185_near/L010185_near_1205_2.log 0 howcroft/log/L010185_rock/L010185_rock_1205_1.log 0 howcroft/log/L010185_rock/L010185_rock_1205_2.log Jan 24 04:59 howcroft/log/L010185N_n1101/n11011205_0001_L010185N_D00.log Jan 24 05:09 howcroft/log/L010185N_n1101/n11011205_0002_L010185N_D00.log Jan 24 05:10 howcroft/log/L010185N_n1201/n12011205_0001_L010185N_D00.log Jan 24 05:10 howcroft/log/L010185N_n1201/n12011205_0002_L010185N_D00.log Jan 24 04:59 howcroft/log/L010185_near/L010185_near_1205_1.log Jan 24 05:09 howcroft/log/L010185_near/L010185_near_1205_2.log Jan 24 05:09 howcroft/log/L010185_rock/L010185_rock_1205_1.log Jan 24 05:10 howcroft/log/L010185_rock/L010185_rock_1205_2.log kordosky logs are clean. ============================================================================= 2007 02 20 ####### # SRM # ####### per timur, installed java v1.5.0 This works, setup java v1.5.0 can use srmls from srmcp v1_25 Still need to unset SRM_PATH ########## # DCACHE # ########## Kennedy found that 20a was holding files, this has been released, files going to tape now. Checking also raw data, MINOS26 > cd /pnfs/minos/neardet_data/2007-02 MINOS26 > for FIL in `ls` ; do printf "${FIL} " ; head -1 ".(use)(4)(${FIL})" ; sleep 1 ; done N00011672_0002.mdaq.root VO2307 N00011672_0003.mdaq.root VO2307 ... N00011798_0001.mdaq.root N00011799_0000.mdaq.root N00011800_0000.mdaq.root N00011801_0000.mdaq.root N00011802_0000.mdaq.root N00011803_0000.mdaq.root N00011804_0000.mdaq.root N00011804_0001.mdaq.root N00011804_0002.mdaq.root N00011804_0003.mdaq.root N00011804_0004.mdaq.root N00011804_0005.mdaq.root N00011804_0006.mdaq.root N00011804_0007.mdaq.root N00011804_0008.mdaq.root N00011804_0009.mdaq.root N00011804_0010.mdaq.root MINOS26 > cd /pnfs/minos/fardet_data/2007-02 MINOS26 > for FIL in `ls` ; do printf "${FIL} " ; head -1 ".(use)(4)(${FIL})" ; sleep 1 ; done ... F00037654_0017.mdaq.root F00037665_0000.mdaq.root F00037670_0000.mdaq.root F00037671_0000.mdaq.root F00037676_0001.mdaq.root F00037676_0002.mdaq.root F00037676_0003.mdaq.root F00037676_0004.mdaq.root F00037676_0005.mdaq.root F00037676_0006.mdaq.root F00037676_0007.mdaq.root F00037676_0008.mdaq.root F00037676_0009.mdaq.root F00037676_0010.mdaq.root F00037676_0011.mdaq.root F00037676_0012.mdaq.root F00037676_0013.mdaq.root F00037676_0014.mdaq.root F00037676_0015.mdaq.root F00037676_0016.mdaq.root F00037676_0017.mdaq.root F00037676_0018.mdaq.root These are less than 24 hours old. ############ # MCIMPORT # ############ MCIN M26 > ./mcimport.20070216 -w howcroft OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log FARM OUTPUT Checking farm output locations, many not in proper RUN subdirectories SRV1> FILS=`ls /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data | grep '^n'` SRV1> printf "${FILS}\n" | head n11011010_0002_L010185N_D00.sntp.cedar.root n13011014_0008_L010185N_D00.sntp.cedar.root n13011014_0010_L010185N_D00.sntp.cedar.root n13011015_0000_L010185N_D00.sntp.cedar.root n13011015_0003_L010185N_D00.sntp.cedar.root n13011015_0005_L010185N_D00.sntp.cedar.root n13011015_0006_L010185N_D00.sntp.cedar.root n13011029_0003_L010185N_D00.sntp.cedar.root n13011033_0010_L010185N_D00.sntp.cedar.root n13011034_0002_L010185N_D00.sntp.cedar.root SRV1> printf "${FILS}\n" | wc -l 119 for FIL in ${FILS} ; do RUN=`echo ${FIL} | cut -c 6-8` ; FIN=`echo ${FIL} | cut -f 1 -d . ; ls /pnfs/minos/mcin_data/near/daikon_00/L010185N/${RUN}/${FIN}.reroot.root ; done SRV1> find /pnfs/minos/mcout_data/cedar/near/daikon_00 -name n\* -maxdepth 3 | wc -l 386 SRV1> find /pnfs/minos/mcout_data/cedar/near/daikon_00 -name n\* -maxdepth 3 | cut -f 7,8 -d '/' | sort -u daikon_00/L010185N daikon_00/L010200N daikon_00/L100200N daikon_00/L150200N daikon_00/L250200N FILES=`find /pnfs/minos/mcout_data/cedar/near/daikon_00 -name n\* -maxdepth 3` FIRST=`printf ${FILES} | head` Check that the files are not on the same tape, via VP1/VP2 Check that the checksums are the same, EC1/EC2 for FILE in ${FILES} ; do for FILE in ${FIRST} ; do PAT=`echo ${FILE} | cut -f -9 -d /` FIL=`echo ${FILE} | cut -f 10 -d /` RUN=`echo ${FIL} | cut -c 6-8` if [ -r ${PAT}/${RUN}/${FIL} ] ; then MD1=`( cd ${PAT} ; cat ".(use)(4)(${FIL})" )` MD2=`( cd ${PAT}/${RUN} ; cat ".(use)(4)(${FIL})" )` VP1=`printf "${MD1}\n" | head -2` VP2=`printf "${MD2}\n" | head -2` LN1=`printf "${MD1}\n" | tail +3 | head -1` LN2=`printf "${MD2}\n" | tail +3 | head -1` CS1=`printf "${MD1}\n" | tail -1` CS2=`printf "${MD2}\n" | tail -1` # printf "${MD1}\n" echo ${VP1} ${LN1} ${CS1} echo ${VP2} ${LN2} ${CS2} [ "${VP1}" == "${VP2}" ] && printf "${PAT}/${FIL}\n OOPS, same volume \n" [ "${LN1}" != "${LN2}" ] && printf "${PAT}/${FIL}\n OOPS, wrong length \n" [ "${CS1}" != "${CS2}" ] && printf "${PAT}/${FIL}\n OOPS, wrong checksum \n" else printf " Missing ${PAT}/${RUN}/${FIL} \n" fi done 2>&1 | tee /tmp/runscan.log Missing /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/n13011458_0009_L010185N_D00.cand.cedar.root Missing /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/n13011453_0007_L010185N_D00.cand.cedar.root Missing /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/n13011455_0002_L010185N_D00.cand.cedar.root Missing /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/n13011456_0000_L010185N_D00.cand.cedar.root Missing /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/n13011457_0010_L010185N_D00.cand.cedar.root SRV1> grep -B 1 length /tmp/runscan.log | grep pnfs/minos | sort /pnfs/minos/mcout_data/cedar/near/daikon_00/ L010185N/cand_data/n11011010_0002_L010185N_D00.cand.cedar.root L010185N/cand_data/n13011029_0003_L010185N_D00.cand.cedar.root L010185N/cand_data/n13011049_0002_L010185N_D00.cand.cedar.root L010185N/sntp_data/n11011010_0002_L010185N_D00.sntp.cedar.root L010185N/sntp_data/n13011029_0003_L010185N_D00.sntp.cedar.root L010185N/sntp_data/n13011049_0002_L010185N_D00.sntp.cedar.root L010200N/cand_data/n13011009_0005_L010200N_D00.cand.cedar.root L010200N/sntp_data/n13011009_0005_L010200N_D00.sntp.cedar.root L100200N/cand_data/n13011015_0008_L100200N_D00.cand.cedar.root L100200N/cand_data/n13011045_0001_L100200N_D00.cand.cedar.root L100200N/sntp_data/n13011015_0008_L100200N_D00.sntp.cedar.root L100200N/sntp_data/n13011045_0001_L100200N_D00.sntp.cedar.root L250200N/cand_data/n13011004_0007_L250200N_D00.cand.cedar.root L250200N/cand_data/n13011012_0008_L250200N_D00.cand.cedar.root L250200N/cand_data/n13011015_0006_L250200N_D00.cand.cedar.root L250200N/sntp_data/n13011004_0007_L250200N_D00.sntp.cedar.root L250200N/sntp_data/n13011012_0008_L250200N_D00.sntp.cedar.root L250200N/sntp_data/n13011015_0006_L250200N_D00.sntp.cedar.root ============================================================================= 2007 02 19 ########### # ROUNDUP # ########### roundup.20070219 Added -c (cron/current) to run in foreground, as in mcimport Tested : SRV1> ./roundup.20070219 -c -r cedar far ; ./roundup.20070219 -c -r cedar near This worked as expected, move to this in production SRV1> ln -sf roundup.20070219 roundup ########### # GANGLIA # ########### Requested split from Minos to Minos Cluster and Minos Servers Requested move from rexganglia2/farms to rexganglia2/minos ####### # AFS # ####### AFS timeouts starting around 08:50 09:17 - crontab -r on kreymer,minodata@minos26 A PDU serving many of the servers has failed, no estimate (09:12) 10:00 call from helpdesk, service is back. ########### # ENSTORE # ########### After about 12:52 ( PNFS sampler ) or 12:58 ( raw data logging ) PNFS went offline Helpdesk ticket 92758 14:21 assigned to HARRISON, MICHAEL of the CD-SF/DMS/DSC/SSA Group 14:55 assigned to TIMM, STEVE of the CD-SF/GF/FGS Group. 15:09 assigned to HARRISON, MICHAEL of the CD-SF/DMS/DSC/SSA Group howcroft mcimport was running, failed on n11011329_0008_L010185N_D00-n11011330_0004_L010185N_D00.tar n11011330_0005_L010185N_D00-n11011330_0009_L010185N_D00.tar should pick them up next iteration Estimate is service back by 15:00 Data logging resumed at about 14:47 ####### # SSH # ####### Created new id_rsa (.pub) on desktop, with ssh-keygen -t rsa for use in connecting to csf.rl.ac.uk for grid data tests ####### # SRM # ####### Testing bootleg installation of srm v1.24 under mindata@minos26 Can create directory with srmmkdir, no special config. M26 > pwd /local/scratch26/kreymer/SRM M26 > export SRM_CONFIG=/home/mindata/.srmconfig/kreymer.xml M26 > SPATH3=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcin_data/far/daikon_00/L010185N/161 M26 > srmclient/bin/srmls ${SPATH3} srm client error: srm ls responce path details array is null! M26 > srmclient/bin/srmmkdir ${SPATH3} M26 > srmclient/bin/srmls ${SPATH3} 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcin_data/far/daikon_00/L010185N/161 Now try a test of flxi04, for public testing : FLXI04 > mkdir /var/tmp/kreymer FLXI04 > cd /var/tmp/kreymer FLXI04 > scp -r mindata@minos26:/home/mindata/.srmconfig .srmconfig FLXI04 > scp -r mindata@minos26:/home/mindata/.grid .grid FLXI04 > scp -r minos26:/local/scratch26/kreymer/SRM SRM FLXI04 > cp -vax /var/tmp/kreymer /usr/scratch/kreymer FLXI04 > nedit .srmconfig/kreymer.xml changed /home/mindata/.grid to /usr/scratch/kreymer/.grid FLXI04 > . /afs/fnal.gov/ups/etc/setups.sh FLXI04 > export PRODUCTS=/afs/fnal.gov/files/code/e875/general/ups/db FLXI04 > setup srmcp v1_25_1 FLXI04 > export SRM_CONFIG=/usr/scratch/kreymer/.srmconfig/kreymer.xml FLXI04 > SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos The usual failures with java traceback. Now try the copy via /local/scratch26/kreymer/SRM/srmclient from https://srm.fnal.gov/twiki/pub/SrmProject/SrmcpClient/srmcp_v1_24_NULL.tar FLXI04 > SRM/srmclient/bin/srmls ${SPATH2} 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/neardet_data 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/hpss 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fardet_logs ############ # MCIMPORT # ############ REROOTS - touch em up No new directories, so just run the script blindly M26 > cp -a AFSS/mcimport.20070216 . M26 > ./mcimport.20070216 howcroft OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log ============================================================================= 2007 02 16 ########## # DCACHE # ########## Have not restarted cron jobs for kreymer or mindata Came up around 15:30 yesterday, no announcement. Many problem, data archiving is stuck. Summary since the shutdown : 2007-02-16 08:06:49 N00011758_0001.mdaq.root OK 2007-02-16 07:59:40 F00037642_0001.mdaq.root OK 2007-02-16 07:58:39 F00037642_0001.mdaq.root NOT_FINISHED 2007-02-15 21:54:46 N070215_000006.mdcs.root OK 2007-02-15 19:08:20 F070215_000010.mdcs.root OK 2007-02-15 15:49:40 F00037642_0000.mdaq.root OK 2007-02-15 15:48:21 N00011758_0000.mdaq.root OK Geoff Pearce restarted archiver at 08:06, archiver noted N00011758_0000 complete, now stuck on _0001 far archiver seems to be running twice ------------------------------------------------------------------- Issued helpdesk ticket 92654 ------------------------------------------------------------------- Short Description: FNDCA DCache system is failing Problem Description: Since the upgrades yesterday, we have observerd at least the following : Minos raw data logging fails, FTP's do not complete. dccp -P commands hang up, should complete immediately dccp copy commands move data, but time out after 5 minutes Details have been reported to dcache-admin via email All Minos data handling via DCache is down, specifically raw data logging Monte Carlo import Farm processing Analysis ------------------------------------------------------------------- ------------------------------------------------------------------- 16:15 - restarted FD archivers, after DCache restart ND is busy with an access. normal dccp and dccp -P commands seem to be OK again. 16:17 - successful F00037642_0004.mdaq.root, need to wait 10' for next 16:28 - successful F00037642_0005.mdaq.root 16:38 - successful F00037642_0006.mdaq.root ... ############ # MCIMPORT # ############ Cleared off some tarfiles preparing for resumed imports : Kordosky had nothing queued for writing, just a bit to purge. M26 > ./mcimport -w kordosky 18:33 M26 > ./AFSS/mcimport.20070216 howcroft OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log mkdir misspelled midir, reran reran forgot to change MIPNFS to PNFS, so size check failed on f21011501_0000_L010185N_D00.reroot.root moved to howcroft/far/mcin/dcache reran OK until a new directory was created, owned by mindata not kreymer need to do srm_mkdir M26> rmdir /pnfs/minos/mcin_data/far/daikon_00/L010185N/151 This will take some work, using srmmidir for the first time. Early tests failed with srm v1_21 Make these by hand for these files : MINOS26 > mkdir /pnfs/minos/mcin_data/far/daikon_00/L010185N/151 for 151 through 160 reran, drat, ran it twice and the pid interlock failed !!!! copying both these, file sized match OK f21011511_0000 moved manually to dcache f21011510_0000 reran, OK so far ! Still need to implement working srmmkdir. ############ # ARCHIVER # ############ Tracking down archiver, minos@minos-beamdata crontab runs /home/minos/bin/archiverstatus.sh This references /home/minos/bin/init/archiver script which, for starting, runs /home/minos/bin/archiver_krb.py 1> /data/logs/archiver.log 2>&1 & which does the actual ftp with import gssftp gssftp gssftp.py seems to be vintage May 9 2006 /home/minos/kftp/v3_6/NULL/lib/gssftp.py Note empty pid file, /var/lock/beam/archiver.pid init/archived gets pid from ps -f --cols 132 -u minos | grep archiver_ | grep -v grep | awk '{print $2}' right after starting archiver_krb.py, with no delay. This seems corrected as of 18:05, on a restart of the server Perhaps the server crashed instantly on restart, per buckley. ############ # PREDATOR # ############ 18:07 ./predator 2007-02 Ran cleanly, doing something in all streams 18:58 crontab crontab.dat ============================================================================= 2007 02 15 ########## # DCACHE # ########## down at 06:00 for maintenance ####### # AFS # ####### HOWTO.afs - created with AFS management guidance MINOS26 > fs listcells | tee /tmp/listcells MINOS26 > wc -l /tmp/listcells 159 /tmp/listcells MINOS26 > grep fnal /tmp/listcells Cell fnal.gov on hosts fsus03.fnal.gov fsus01.fnal.gov fsus04.fnal.gov. MINOS26 > for DIR in `ls` ; do fs whereis ${DIR} ; done | cut -f 6 -d ' ' | sort -u fsus-minos01.fnal.gov fsus02.fnal.gov fsus05.fnal.gov fsus06.fnal.gov fsus07.fnal.gov fsus08.fnal.gov MINOS26 > DIRS=`ls` MINOS26 > WHERES=`for DIR in ${DIRS} ; do fs whereis ${DIR} ; done` MINOS26 > printf "${WHERES}\n" | grep minos01 | wc -l 211 MINOS26 > printf "${WHERES}\n" | grep fsus02 | wc -l 6 MINOS26 > printf "${WHERES}\n" | grep fsus05 | wc -l 3 MINOS26 > printf "${WHERES}\n" | grep fsus06 | wc -l 1 MINOS26 > printf "${WHERES}\n" | grep fsus07 | wc -l 1 MINOS26 > printf "${WHERES}\n" | grep fsus08 | wc -l 3 MINOS26 > printf "${WHERES}\n" | grep fsus02 File d08 is on host fsus02.fnal.gov File d35 is on host fsus02.fnal.gov File d50 is on host fsus02.fnal.gov File d59 is on host fsus02.fnal.gov File d63 is on host fsus02.fnal.gov File validation is on host fsus02.fnal.gov MINOS26 > printf "${WHERES}\n" | grep fsus05 File crl_data is on host fsus05.fnal.gov File logbook is on hosts fsus05.fnal.gov fsus08.fnal.gov File offline_monitor is on host fsus05.fnal.gov MINOS26 > printf "${WHERES}\n" | grep fsus06 File beam_docs is on host fsus06.fnal.gov MINOS26 > printf "${WHERES}\n" | grep fsus07 File log_data is on host fsus07.fnal.gov MINOS26 > printf "${WHERES}\n" | grep fsus08 File d31 is on host fsus08.fnal.gov File d67 is on host fsus08.fnal.gov File logbook is on hosts fsus05.fnal.gov fsus08.fnal.gov ============================================================================= 2007 02 14 ########## # DCACHE # ########## Schedule shutdown during PNFS outage MINOS26 > echo 'crontab -r' | at 05:30 job 13 at 2007-02-15 05:30 M26 > echo 'crontab -r' | at 20:00 job 14 at 2007-02-14 20:00 ########### # ROUNDUP # ########### ############ # MINOSCVS # ############ Created .k5login backups, cleaned up removed west, minoscvs ######### # MYSQL # ######### Speakman is having trouble connectiong, from a particular offsite host. Probably a firewall problem. ERROR 2003 (HY000): Can't connect to MySQL server on 'minos-db1.fnal.gov' (113) ERROR 2003 (HY000): Can't connect to MySQL server on 'minos-mysql1.fnal.gov' (113) Suggested telnetting to the port : MINOS26 > telnet minos-mysql1 3306 Trying 131.225.193.13... Connected to minos-mysql1.fnal.gov (131.225.193.13). Escape character is '^]'. 8 4.1.11-log%? uC.YY44H,M,1R*yG6vrYG exit #08S01Bad handshakeConnection closed by foreign host. Confirmed, this port is being blocked by his firewall. My preference is to say OK, they have got what they want. I prefer not to override security policies of remote sites. ########### # GANGLIA # ########### ########### # BEAMLOG # ########### Updated to remove HTML content ( internal ) and suppress entries for NCYCLE or NBEAM 0, we had previously duplicated the previous entry ! Getting some messages like /afs/...beam_log: line 61: [: NaN time dd if=/dev/zero of=thous bs=1M count=1000 1000+0 records in 1000+0 records out real 0m12.699s user 0m0.000s sys 0m5.600s AKS3 > ls -l thous -rw-r--r-- 1 kreymer 1525 1048576000 Feb 13 11:34 thous AKS3 > time dd if=/dev/zero of=single bs=1048576000 count=1 1+0 records in 1+0 records out real 0m13.814s user 0m0.000s sys 0m5.380s AKS3 > ls -l total 2460012 -rw-r--r-- 1 kreymer 1525 419430400 Feb 6 12:00 TEST -rw-r--r-- 1 kreymer 1525 1048576000 Feb 13 11:35 single -rw-r--r-- 1 kreymer 1525 1048576000 Feb 13 11:34 thous AKS3 > time dd if=/dev/zero of=funny bs=1048576123 count=1 1+0 records in 1+0 records out real 0m13.911s user 0m0.000s sys 0m5.220s AKS3 > ls -l funny -rw-r--r-- 1 kreymer 1525 1048576123 Feb 13 12:00 funny ########## # DCACHE # ########## Pool allocation adjustments for expanded pools Thursday Taking an inventory of tags on various sntp directories MINOS26 > for DIR in `ls -d reco_near/*/sntp_data` ; do printf "${DIR} `cat ${DIR}/'.(tag)(file_family)'`\n" ; done reco_near/R1.11/sntp_data sntp_near_R1_11 reco_near/R1.12/sntp_data sntp_near_R1_12 reco_near/R1.14/sntp_data sntp_near_R1_14 reco_near/R1.16/sntp_data reco_near_R1_16 reco_near/R1/sntp_data sntp_near_R1 reco_near/R1_17/sntp_data reco_near_R1_17 reco_near/R1_18/sntp_data sntp_near_R1_18_0 reco_near/R1_18_2/sntp_data reco_near_R1_18_2 reco_near/R1_18_2_temp/sntp_data reco_near_R1_18 reco_near/R1_18_3/sntp_data reco_near_R1_18_3 reco_near/R1_18_4/sntp_data reco_near_R1_18_4 reco_near/R1_21/sntp_data reco_near_R1_21 reco_near/R1_23/sntp_data reco_near_R1_23 reco_near/R1_23a/sntp_data reco_near_R1_23a reco_near/R1_24/sntp_data reco_near_R1_24 reco_near/R1_24a/sntp_data reco_near_R1_24a reco_near/R1_24b/sntp_data reco_near_R1_24b reco_near/R1_24c/sntp_data reco_near_R1_24c reco_near/S06-05-25-R1-22/sntp_data reco_near_S06-05-25-R1-22 reco_near/S06-06-22-R1-22/sntp_data reco_near_S06-06-22-R1-22 reco_near/cedar/sntp_data reco_near_cedar_sntp MINOS26 > for DIR in `ls -d reco_far/*/sntp_data` ; do printf "${DIR} `cat ${DIR}/'.(tag)(file_family)'`\n" ; done reco_far/R1.11/sntp_data sntp_data_R1_11 reco_far/R1.12/sntp_data sntp_data_R1_12 reco_far/R1.14/sntp_data sntp_data_R1_14 reco_far/R1.16/sntp_data reco_far_R1_16 reco_far/R1.16a/sntp_data sntp_near_R1_16a reco_far/R1_17/sntp_data reco_far_R1_17 reco_far/R1_17a.0/sntp_data reco_far_R1_17 reco_far/R1_18/sntp_data reco_far_R1_18 reco_far/R1_18_2/sntp_data reco_far_R1_18_2 reco_far/R1_18_2_temp/sntp_data minos reco_far/R1_18_2a/sntp_data reco_far_R1_18_2a reco_far/R1_18_4/sntp_data reco_far_R1_18_4 reco_far/R1_21/sntp_data reco_far_R1_21 reco_far/R1_23/sntp_data reco_far_R1_23 reco_far/R1_23a/sntp_data reco_far_R1_23a reco_far/R1_24/sntp_data reco_far_R1_24 reco_far/R1_24a/sntp_data reco_far_R1_24a reco_far/R1_24b/sntp_data reco_far_R1_24b reco_far/R1_24c/sntp_data reco_far_R1_24c reco_far/S06-05-25-R1-22/sntp_data reco_far_S06-05-25-R1-22 reco_far/S06-06-22-R1-22/sntp_data reco_far_S06-06-22-R1-22 reco_far/cedar/sntp_data reco_far_cedar_sntp MINOS26 > for DIR in `ls -d reco_far/*/.bntp_data` ; do printf "${DIR} `cat ${DIR}/'.(tag)(file_family)'`\n" ; done reco_far/R1_18/.bntp_data reco_far_R1_18 reco_far/R1_18_2/.bntp_data reco_far_R1_18_2 reco_far/R1_18_2_temp/.bntp_data minos reco_far/R1_18_2a/.bntp_data reco_far_R1_18_2a reco_far/R1_18_4/.bntp_data reco_far_R1_18_4 reco_far/R1_23/.bntp_data reco_far_R1_23 reco_far/R1_23a/.bntp_data reco_far_R1_23a reco_far/R1_24/.bntp_data reco_far_R1_24 reco_far/R1_24a/.bntp_data reco_far_R1_24a reco_far/R1_24b/.bntp_data reco_far_R1_24b reco_far/R1_24c/.bntp_data reco_far_R1_24c reco_far/S06-05-25-R1-22/.bntp_data reco_far_S06-05-25-R1-22 reco_far/S06-06-22-R1-22/.bntp_data reco_far_S06-06-22-R1-22 reco_far/cedar/.bntp_data reco_far_cedar_bntp Sent request to dcache-admin, kennedy, ( omitting the not-yet-existing *_mrnt files, thanks Robert ! ) ########### # GANGLIA # ########### email ============================================================================= 2007 02 12 ########### # ROUNDUP # ########### roundup.20070212 setup encp v3_6d versus c ( since 2007 Jan 22 ) use /export/state/minfarm/.srmconfig and .grid, for local non-NFS copy ran far with old roundup, near with now, looks OK ######### # VOMRS # ######### Plunging into the brave new VO registration world, for Grid access. Started at fermigrid.fnal.gov, User Guide. Directed to register at https://voms.fnal.gov:8443/vomrs/vo-fermilab/vomrs Steps are ( in language clear as mud to me ) Visitor anyone Registration (Phase I) magic identity check ? Candidate Registration (Phase II) agreeing to OSG Usage Rules Applicant Approval by Administrator Member Phase I - mail kreymer@fnal.gov rep Steven Timm rights Full First Arthur Last Kreymer Phone 630 840 4261 Immediately get message You have successfully submitted phase I of fermilab VO registration! You now have candidate status in the fermilab VO. You will receive an email providing further instructions about second phase of registration. Click on the Registration (Phase I) link to update the left hand menu. Got the email immediately, clicked the email link. Great, the /fermilab/minos group is obvious. But what about the roles ? GratiaFermilabAdmin GratiaGlobalAdmin minossoft Production root VOMS-Query and why is minossoft a role for ALL groups ? Got confirmation web page, You have successfully submitted phase II of fermilab VO registration! You will receive an email notice from the VOMRS fermilab Service indicating that you've been approved (or denied) as a VO member. This could take up to a few days; it depends on how soon your representative completes this task. You now have applicant status in the fermilab VO, and as such can access more screens. Click on the fermilab VO Registration Home link in the left hand menu to update the menu. Applicant to fermilab VO may: * Change your groups and group roles selection * Browse groups * Browse institutions and sites * Browse required personal information * Browse CAs recognized by fermilab VO * Browse your own personal information * Re-sign usage rules * Browse your own authorization status * Browse required personal information * Browse CAs recognized by fermilab VO * Unsubscribe and resubscribe to personal event notification Got an immediate email, " you have been assigned " But that is neither 'approved' nor 'denied'. N.B. Phase I - must select a representative ( of what, for what ? ) Labels have magic pop up boxes, unreadable black on dark blue Why is Peter Shanahan listed ? Why if fermigrid2 listed ? Why are Steven Timm and Steven C. Timm listed ? I selected Steven Timm Note that Authorization Status that I can browse is listed as 'new', which is not any of the states described above. Found this on 2007 02 15, select 'Roles', called Phases on the welcome page. My Role is still listed as Applicant UGH,. on a reconnect to https://voms.fnal.gov:8443/vomrs/vo-fermilab/vomrs I get a popup box message : You have attempted to establish a connection to "voms.fnal.gov". However, the security certificate presented belongs to "http/fermigrid2.fnal.gov". It is possible, though unlikely, that someone may be trying to intercept your communication with this web site. If you suspect that the certificate shown does not belong to "voms.fnal.gov", please cancel the connection and notify the site administrator. Your status with the VO has been changed from New to Approved due to the following reason: Approved. Please contact VO administrator if you have any questions. Status, So now we have 'Phases', 'Roles' and 'Status' to describe the same thing. Ugh. ============================================================================= 2007 02 09 ########## # BREBEL # ########## Finished archiving minos22:/local/scratch22/brebel/R1.14 to /pnfs/minos/users/brebel/R1.14/* ntupleSt - 149 GB monthly directories MINOS22 > cd ntupleSt MINOS22 > DIRS=`ls` MINOS22 > for DIR in ${DIRS} ; do du -sh ${DIR} ; done 14G 2005-05 13G 2005-06 14G 2005-07 14G 2005-08 14G 2005-09 13G 2005-10 13G 2005-11 13G 2005-12 13G 2006-01 11G 2006-02 11G 2006-03 9.6G 2006-04 setup ecrc DCPOR=24736 setup dcap klist -f ##################################### for DIR in ${DIRS} ; do date TAF=/tmp/br/${DIR}.tar echo make ${TAF} time tar cf ${TAF} -C ${DIR} . echo diff ${TAF} time tar df ${TAF} -C ${DIR} . du -sm ${TAF} ls -l ${TAF} ECRC=`ecrc ${TAF}` printf "`echo $ECRC | cut -f 2 -d ' '` ${DIR}\n" date DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/minos/users/brebel/R1.14/${DIR}.tar echo copy ${TAF} time dccp ${TAF} ${DFILE} ( cd /pnfs/minos/users/brebel/R1.14 ; cat ".(use)(2)(${DIR}.tar)" ) rm -f ${TAF} done 2>&1 | tee -a /tmp/br.log ##################################### Did this for DIRS=2005-05 DIRS='2005-06 2005-07 2005-08 2005-09 2005-10 2005-11 2005-12 2006-01 2006-02 2006-03 2006-04' Wait for enstore to get ECRC. BRLIS=$MINOS_DATA/log_data/users/brebel/R1.14 cp /tmp/br.log ${BRLIS}/ Create ecrc listings nedit ${BRLIS}/ecrc.lis # Create lines with ecrcdir MINOS26 > mkdir ${BRLIS} -p for DIR in ${DIRS} ; do ls -l ${DIR} > ${BRLIS}/${DIR}.lis ; done cd .. ; DIR=ntupleStUp_v3.5 ls -l ${DIR} > ${BRLIS}/${DIR}.lis Next, check CRC's. TARS=`ls /pnfs/minos/users/brebel/R1.14 | cut -f 1 -d .` TARS=${TARS}.5 # adds missing .5 to last tarfile name. for TAR in ${TARS} ; do OCRC=`grep ${TAR} ${BRLIS}/ecrc.lis | cut -f 1 -d ' '` echo ${OCRC} ${TAR} ECRC=`(cd /pnfs/minos/users/brebel/R1.14 ; cat ".(use)(4)(${TAR}.tar)" | tail -1)` echo ${ECRC} ${TAR} [ "${ECRC}" != "${OCRC}" ] && printf " OOPS, mismatch \n" done 2007 02 10 all these are now on tape, the above tests succeed ####### # CFL # ####### Corrected raw data to include all months cdm ; cd CFL $HOME/minos/scripts/cflsum.20070209 | tee cflsum.20070206 ln -sf cflsum.20070206 CFLSUM cds ln -sf cflsum.20070206 cflsum ############ # MCIMPORT # ############ Note heavy load from kordosky, as many as 10 scp's and nearly 20 md5sum's running all at once. SAR: 02:50:01 PM all 0.20 0.00 0.17 0.21 99.42 03:00:01 PM all 0.61 0.00 0.64 0.65 98.09 03:10:01 PM all 9.37 0.00 2.08 1.18 87.37 03:20:01 PM all 11.50 0.00 3.67 14.72 70.10 03:30:01 PM all 6.15 0.00 4.70 13.01 76.14 03:40:02 PM all 8.50 0.00 7.40 46.63 37.47 03:50:01 PM all 5.95 0.00 5.16 87.77 1.13 04:00:02 PM all 3.22 0.00 2.91 93.75 0.12 04:10:02 PM all 3.12 0.00 2.61 94.22 0.05 04:20:01 PM all 1.95 0.00 2.11 95.84 0.09 04:30:02 PM all 0.73 0.00 1.79 97.39 0.08 04:40:01 PM all 1.27 0.00 2.34 96.39 0.00 04:50:02 PM all 1.12 0.00 2.12 96.69 0.07 05:00:01 PM all 1.10 0.00 2.38 96.48 0.04 05:10:02 PM all 1.62 0.00 2.22 96.17 0.00 05:20:02 PM all 16.46 0.00 3.55 79.97 0.02 05:20:02 PM CPU %user %nice %system %iowait %idle 05:30:01 PM all 2.65 0.00 2.13 95.20 0.02 05:40:01 PM all 2.81 0.00 2.64 94.55 0.01 05:50:01 PM all 2.65 0.00 2.34 94.83 0.19 06:00:01 PM all 5.67 0.00 7.51 86.39 0.43 06:10:01 PM all 1.81 0.00 2.43 52.03 43.73 06:20:02 PM all 3.03 0.00 1.65 29.64 65.68 Average: all 4.21 0.00 2.80 40.62 52.37 ============================================================================= 2007 02 08 ############ # MCIMPORT # ############ running slowly again, up to 14 kordosky scp's running at once. Fortunately my local broadband connection is slow at uploads ( 50 KB/sec ), so it is easy to create a slow source of data. ( Easy, that is, if I drive home to do the test. ) I copied 6 files at once, with scp -c blowfish , each of them 10 MBytes. ( This is not an unusual situation, I saw 14 kordosky scp's running today. ) The files were named frag0 through frag5. Then I pre-created 6 more files with for N in a b c d e f ; do dd if=/dev/zero of=frag${N} bs=1M count=10 ; done Then ran 6 copies again, to fraga through fragf. The files are 2560 blocks long ( 4096 byte blocks ) filefrag reports as follow : R > FRAG=/home/minsoft/maint/filefrag R > for N in 0 1 2 3 4 5 ; do ${FRAG} frag${N} ; done frag0: 1420 extents found, perfection would be 1 extent frag1: 1601 extents found, perfection would be 1 extent frag2: 1694 extents found, perfection would be 1 extent frag3: 1384 extents found, perfection would be 1 extent frag4: 1480 extents found, perfection would be 1 extent frag5: 1346 extents found, perfection would be 1 extent R > for N in a b c d e f ; do ${FRAG} frag${N} ; done fraga: 1 extent found fragb: 1 extent found fragc: 1 extent found fragd: 1 extent found frage: 1 extent found fragf: 2 extents found, perfection would be 1 extent ########### # GANGLIA # ########### fnpca seems to be down, existing trouble ticket 92258 ============================================================================= 2007 02 07 ############ # DATABASE # ############ See LOG.mysql ####### # X11 # ####### Gimp scan, stuck on minos15, minos15 Wed Feb 7 13:11:54 CST 2007 Globally, the swap directory was missing, /var/tmp/kreymer/.gimp-2.0 MIN > for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'ls -l /var/tmp | grep drw' ; done minos01 Wed Feb 7 13:16:30 CST 2007 drwx------ 4 minoscvs e875 4096 Jan 26 17:59 cvs-serv22533 minos13 Wed Feb 7 13:16:38 CST 2007 drwxr-xr-x 4 kreymer 1525 4096 Jan 27 00:25 kreymer minos26 Wed Feb 7 13:16:50 CST 2007 drwxr-xr-x 5 mindata e875 4096 Feb 5 12:41 mindata drwxr-xr-x 3 kreymer g020 4096 Jan 9 11:53 rawcopy MIN > for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'mkdir -p /var/tmp/kreymer/.gimp-1.2 | grep drw' ; done MINOS26 > mv /var/tmp/kreymer/.gimp-1.2 /var/tmp/kreymer/.gimp-2.0 Rescanned, stuck on 11 and 15 minos11 Wed Feb 7 13:28:58 CST 2007 minos15 Wed Feb 7 13:29:39 CST 2007 ############ # MCIMPORT # ############ keepup - ran very fast last night for howcroft, 6 to 9 MB/sec from 06:48 to 07:33 during which time files continued to be imported, about 10 minutes each size time 348051424 Feb 7 06:10 n11011272_0007_L010185N_D00.tar.gz 348915700 Feb 7 06:21 n11011272_0008_L010185N_D00.tar.gz 11 344127618 Feb 7 06:31 n11011272_0009_L010185N_D00.tar.gz 10 351102396 Feb 7 06:42 n11011273_0000_L010185N_D00.tar.gz 11 354129779 Feb 7 06:53 n11011273_0001_L010185N_D00.tar.gz 11 345883174 Feb 7 07:14 n11011273_0002_L010185N_D00.tar.gz 19 345407102 Feb 7 07:24 n11011273_0003_L010185N_D00.tar.gz 10 350804073 Feb 7 07:35 n11011273_0004_L010185N_D00.tar.gz 11 346207685 Feb 7 07:45 n11011273_0005_L010185N_D00.tar.gz 11 ### DUPLICATE RUNNING ### Reindexing kordosky duplicate running per 03 Feb email L010185, 1071-1090 cd kordosky/index FIND THE FILES for N in 7 8 9 ; do grep "10${N}._.*L010185" *.index ; done for N in 7 8 9 ; do grep "10${N}._.*L010185" *.index ; done | wc -l 432 FIND THE TARS for N in 7 8 9 ; do grep "10${N}._.*L010185" *.index ; done | \ cut -f 1 -d : | sort -u M26> for N in 7 8 9 ; do grep "10${N}._.*L010185" *.index ; done | \ cut -f 1 -d : | sort -u | wc -l 48 DUPTS=`for N in 7 8 9 ; do grep "10${N}._.*L010185" *.index ; done | \ cut -f 1 -d : | sort -u` for DUPT in $DUPTS ; do cat ${DUPT} ; done | wc -l 432 GOOD ! These tarfiles contain only the duplicated runs. So the index files can just be moved out of the way. mkdir ../index.dup1071 for DUPT in $DUPTS ; do mv ${DUPT} ../index.dup1071/ ; done ============================================================================= 2007 02 06 ####### # CFL # ####### aklog cdm ; cd CFL $HOME/minos/scripts/cfl 1110230 CFL $HOME/minos/scripts/cflsum | tee cflsum.20070206 ln -sf cflsum.20070206 CFLSUM ############ # MCIMPORT # ############ test of file preallocation , kreymer@minos-sam03 SS3 > cd MCIMPORT SS3 > dd if=/dev/zero of=TEST bs=1M count=1000 1000+0 records in 1000+0 records out SS3 > du -sb TEST 1049604096 TEST SS3 > rm TEST SS3 > time dd if=/dev/zero of=TEST bs=1M count=1000 1000+0 records in 1000+0 records out real 0m12.710s user 0m0.000s sys 0m5.090s AKS3 > du -sm /var/tmp/* 3193 /var/tmp/DCS 329 /var/tmp/FIX 1 /var/tmp/rc_host_0 AKS3 > time cp /var/tmp/FIX TEST real 0m8.033s user 0m0.060s sys 0m2.960s M26> TIF=/var/tmp/mindata/TMP/n11011219_0003_L010185N_D00.tar.gz M26> time dd if=$TIF of=TIF bs=1M real 0m10.066s user 0m0.000s sys 0m3.340s M26> rm TIF M26> time dd if=$TIF of=TIF bs=1M 338+1 records in 338+1 records out real 0m4.353s user 0m0.000s sys 0m2.170s M26> time dd if=/dev/zero of=TIF bs=1M count=400 400+0 records in 400+0 records out real 0m5.346s user 0m0.000s sys 0m2.030s M26> stat TIF File: `TIF' Size: 419430400 Blocks: 820008 IO Block: 4096 Regular File Device: 341h/833d Inode: 17301506 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 3648/ mindata) Gid: ( 5111/ e875) Access: 2007-02-06 12:07:54.000000000 -0600 Modify: 2007-02-06 12:07:59.000000000 -0600 Change: 2007-02-06 12:07:59.000000000 -0600 M26> time dd if=$TIF of=TIF bs=1M 338+1 records in 338+1 records out real 0m4.868s user 0m0.000s sys 0m2.480s M26> stat TIF File: `TIF' Size: 354953788 Blocks: 693960 IO Block: 4096 Regular File Device: 341h/833d Inode: 17301506 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 3648/ mindata) Gid: ( 5111/ e875) Access: 2007-02-06 12:07:54.000000000 -0600 Modify: 2007-02-06 12:08:39.000000000 -0600 Change: 2007-02-06 12:08:39.000000000 -0600 ############ # MCIMPORT # ############ mcimport_init - initializes mindata account, as on minos-sam02/3 ########### # ROUNDUP # ########### Did catchup. SRMCP errors , retried OK for /pnfs/minos/reco_far/cedar/sntp_data/2007-02/F00037375_0000.spill.sntp.cedar.0.root at 12:12. Succeeded at 12:13 in spite of message org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: (error code 1) [Nested exception messa ge: Custom message: Unexpected reply: 425 Cannot open port: java.lang.Exception: Pool manager error: No write pool available for ]. Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 425 Cannot open port: java.lang.Exception: Pool manager error: No write pool available for at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:195) at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109) at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1456) at org.globus.ftp.GridFTPClient.extendedPut(GridFTPClient.java:508) at org.globus.ftp.GridFTPClient.extendedPut(GridFTPClient.java:474) at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:846) at java.lang.Thread.run(Thread.java:534) try again sleeping for 10000 before retrying That's reasonable, all the 9a pools were restarted at 12:12 w-stkendca9a-1 w-stkendca9a-1Domain 0 10 69 msec 02/06 12:12:29 production-1-7-0(1.130.2.2) ============================================================================= 2007 02 05 ############ # PREDATOR # ############ VMON=2007-01 ./predator ${VMON} ../HOWTO.predator ${VMON} ######### # VAULT # ######### for DET in far near; do ./vault ${DET} ${VMON} ; done Ran correctly ######## # DATA # ######## Runs to be reprocessed in cedar, from rubin mail : N00008009_0000 N00009582_0005 N00009586_0000 N00009586_0001 N00009586_0002 N00009586_0004 N00009586_0005 N00009586_0006 N00009586_0008 N00009586_0009 N00009586_0010 N00010163_0001 N00010163_0003 N00010163_0004 N00010163_0005 N00010163_0006 N00010163_0007 N00010163_0008 N00010163_0009 N00010163_0010 N00010163_0011 N00010163_0012 N00010163_0015 N00010184_0002 N00010184_0003 N00010184_0004 N00010184_0005 N00010184_0007 N00010184_0008 N00010184_0009 N00010184_0010 N00010184_0011 N00010184_0012 N00010184_0013 N00010184_0014 N00010184_0015 N00010184_0016 Are they in SAM ? First, what is in reco_near ? Only two of these files ! for RUN in ${RUNS} ; do for RUN in N00010184_0016 ; do # one bad run for RUN in N00010218_0020 ; do # one good run printf "${RUN}\n" ; MON=`sam locate ${RUN}.mdaq.root | cut -f 5 -d '/' | cut -f 1 -d ,` ls /pnfs/minos/reco_near/cedar/*/${MON}/${RUN}* done ... N00010163_0015 /pnfs/minos/reco_near/cedar/cand_data/2006-06/N00010163_0015.cosmic.cand.cedar.0.root /pnfs/minos/reco_near/cedar/sntp_data/2006-06/N00010163_0015.cosmic.sntp.cedar.0.root ... MINOS26 > sam locate N00010163_0015.cosmic.cand.cedar.0.root ['/pnfs/minos/reco_near/cedar/cand_data/2006-06,1614@vob884'] MINOS26 > sam locate N00010163_0015.cosmic.sntp.cedar.0.root ['/pnfs/minos/reco_near/cedar/sntp_data/2006-06,3090@vob657'] for RUN in ${RUNS} ; do for TYP in cosmic spill ; do for STR in cand sntp ; do sam locate ${RUN}.${TYP}.${STR}.cedar.0.root ; done ; done ; done ... Datafile with name 'N00010163_0012.spill.sntp.cedar.0.root' not found. ['/pnfs/minos/reco_near/cedar/cand_data/2006-06,1614@vob884'] ['/pnfs/minos/reco_near/cedar/sntp_data/2006-06,3090@vob657'] Datafile with name 'N00010163_0015.spill.cand.cedar.0.root' not found. ... Another check for RUN in ${RUNS} ; do sam list files --dim="file_name like ${RUN}%cedar%.root" ; done ... Files: N00010163_0015.cosmic.cand.cedar.0.root N00010163_0015.cosmic.sntp.cedar.0.root OK, undeclare these two : sam undeclare file N00010163_0015.cosmic.cand.cedar.0.root sam undeclare file N00010163_0015.cosmic.sntp.cedar.0.root ############ # MCIMPORT # ############ kordosky/ n11011519_0006_L010185N_D00.tar.gz is reported corrupt, M26> ls -alF n11011519_0006_L010185N_D00.tar.gz -rw-r--r-- 1 mindata e875 349098958 Feb 5 10:12 n11011519_0006_L010185N_D00.tar.gz M26> md5sum n11011519_0006_L010185N_D00.tar.gz 67de1f44e0c8820ed5ad53975978e834 n11011519_0006_L010185N_D00.tar.gz M26> grep n11011519_0006_L010185N_D00.tar.gz md5/all.md5 5e42eae486d2b49271affc29241e165d n11011519_0006_L010185N_D00.tar.gz mv n11011519_0006_L010185N_D00.tar.gz BAD/ Perhaps we can avoid such problems, and help fragmentation, by first copying to /var/tmp/mindata/MCIN/* ? M26> for DIR in `ls /local/scratch26/mindata` ; do mkdir /var/tmp/mindata/MCIN/${DIR} ; done Time some copies ( 333 MByte file ) M26> FIX=n11011504_0002_L010185N_D00.tar.gz M26> dds $FIX -rw-r--r-- 1 mindata e875 348302809 Feb 5 10:04 n11011504_0002_L010185N_D00.tar.gz M26> time dd if=$FIX of=/var/tmp/mindata/MCIN/FIX 680278+1 records in 680278+1 records out real 2m6.155s user 0m0.370s sys 0m7.540s ( retry later, vault copies are running now, same disks ) Try some tests on minos-sam03 S03> FIX=n11011503_0002_L010185N_D00.tar.gz S03> time scp -c blowfish mindata@minos26:/local/scratch26/mindata/kordosky/${FIX} FIX real 1m37.945s user 0m0.040s sys 0m2.210s >S03 md5sum FIX 32c8146409dff0f5318c63f3b4e05810 FIX M26> md5sum n11011503_0002_L010185N_D00.tar.gz 32c8146409dff0f5318c63f3b4e05810 n11011503_0002_L010185N_D00.tar.gz S03> ls -l FIX -rw-r--r-- 1 samread 5024 343676718 Feb 5 15:35 FIX S03> time gunzip -t FIX real 0m9.682s user 0m9.490s sys 0m0.190s S03> time cp FIX /var/tmp/FIX real 0m3.146s user 0m0.030s sys 0m1.950s S03> time dd if=FIX of=/var/tmp/FIX 671243+1 records in 671243+1 records out real 0m11.299s user 0m0.410s sys 0m6.840s S03> time dd if=FIX of=/var/tmp/FIX bs=1M 327+1 records in 327+1 records out real 0m4.172s user 0m0.000s sys 0m2.100s Move a big file to /var/tmp, to flush memory. SS3 > time dd if=DCS_HV.MYD.gz of=/var/tmp/DCS bs=1M 3188+1 records in 3188+1 records out real 1m51.970s user 0m0.010s sys 0m26.070s S03> time dd if=FIX of=/var/tmp/FIX bs=1M real 0m9.743s user 0m0.000s sys 0m3.040s ########## # BREBEL # ########## Jan 31 request to backup minos22:/local/scratch22/brebel/R1.14 I suggest to /pnfs/minos/users/brebel/R1.14/* ntupleSt - 149 GB monthly directories MINOS22 > for DIR in `ls ntupleSt` ; do du -sh ntupleSt/${DIR} ; done 14G ntupleSt/2005-05 13G ntupleSt/2005-06 14G ntupleSt/2005-07 14G ntupleSt/2005-08 14G ntupleSt/2005-09 13G ntupleSt/2005-10 13G ntupleSt/2005-11 13G ntupleSt/2005-12 13G ntupleSt/2006-01 11G ntupleSt/2006-02 11G ntupleSt/2006-03 9.6G ntupleSt/2006-04 ntupleStUp_v3.5 - 4.9 GB Simplest solution : Make tarfiles of the whole directories. Make a listing, safe in afs. Very little free space... so tar to /tmp one at a time, record ecrc, then dccp to write pool. Can use tar -d to check content Let's try one : cd /local/scratch22/brebel/R1.14 DIR=ntupleStUp_v3.5 MINOS22 > time tar cf /tmp/br/${DIR}.tar -C ${DIR} . real 3m21.508s user 0m0.500s sys 0m38.390s MINOS22 > time tar df /tmp/br/${DIR}.tar -C ${DIR} . real 2m32.246s user 0m9.250s sys 0m23.880s MINOS22 > du -sm /tmp/br/${DIR}.tar 5008 /tmp/br/ntupleStUp_v3.5.tar MINOS22 > ls -l /tmp/br/${DIR}.tar -rw-r--r-- 1 kreymer 1525 5245091840 Feb 5 18:15 /tmp/br/ntupleStUp_v3.5.tar MINOS26 > mkdir /pnfs/minos/users/brebel MINOS26 > chmod 775 /pnfs/minos/users/brebel MINOS26 > cd /pnfs/minos/users/brebel MINOS26 > enstore pnfs --file_family minos_users_brebel MINOS26 > mkdir R1.14 MINOS22 > DCPOR=24736 MINOS22 > DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/minos/users/brebel/${DIR}.tar MINOS22 > setup dcap MINOS22 > setup encp MINOS22 > time ecrc /tmp/br/${DIR}.tar CRC 503464758 real 1m51.515s user 0m15.480s sys 0m14.350s MINOS22 > time dccp /tmp/br/${DIR}.tar ${DFILE} real 5m45.927s user 0m0.000s sys 0m0.010s < interrupted > < readjusted path from minos/brebel to minos/users/brebel > < readusted family from minos_brebel to minos_users_brebel > MINOS22 > time dccp /tmp/br/${DIR}.tar ${DFILE} < observed 22 MB/sec on ganglia during copy > 5245091840 bytes in 235 seconds (21796.43 KB/sec) real 3m57.383s user 0m15.900s sys 0m20.340s As expected , ls shows a file size of 1 in PNFS. MINOS26 > cat '.(use)(2)(ntupleStUp_v3.5.tar)' 2,0,0,0.0,0.0 :h=yes;c=1:308e4337;l=5245091840; w-stkendca9a-2 Need to wait for enstore to get ECRC. MINOS26 > ./dc_stat /pnfs/minos/users/brebel/R1.14/ntupleStUp_v3.5.tar PNFS status for /pnfs/minos/users/brebel/R1.14/ntupleStUp_v3.5.tar -rw-r--r-- 1 kreymer e875 1 Feb 5 23:01 ntupleStUp_v3.5.tar LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:308e4337;l=5245091840; w-stkendca9a-2 LEVEL 4 VOC416 0000_000000000_0000001 5245091840 minos_users_brebel /pnfs/fnal.gov/usr/minos/users/brebel/R1.14/ntupleStUp_v3.5.tar 000F00000000000004E7CA50 CDMS117073809300000 stkenmvr35a:/dev/rmt/tps2d0n:479000010076 503464758 Size and ecrc match, looks good ! MINOS22 > rm /tmp/br/${DIR}.tar ============================================================================= 2007 02 04 ############ # MCIMPORT # ############ kordosky pass took over 6 hours howcroft input idle at present. min free disk was srmcp was trying, no output from 07:00 to 07:05 cron pid detection is working ( 06:37 ) Maybe should go back to tarring to /var/tmp, with a copy back to scratch ? Might reduce fragmentation, and be faster overall ? $ time dd if=STAGE/kordosky/tar/n11011014_0001_L150200N_D00-n11011014_0009_L150200N_D00.tar \ of=/var/tmp/mindata/TMP/n11011014_0001_L150200N_D00-n11011014_0009_L150200N_D00.tar ( spot checked at about 1.2 MBytes/sec ) ( cancelled at 836 MBytes ) 1709544+0 records in 1709544+0 records out real 9m37.147s user 0m1.230s sys 0m19.330s Now copy a 330 MB file from /var/tmp to /tmp ( same disk ) $ time dd if=/var/tmp/mindata/TMP/n11011001_0000_L010185N_D00.tar.gz of=/tmp/mindata/TEST.dat 672549+1 records in 672549+1 records out real 0m34.817s user 0m0.560s sys 0m8.950s ( speed is about 10 MBytes/second ) ============================================================================= 2007 02 03 ############ # MCIMPORT # ############ mcimport.20070203 added missing writing and clearing of CRON/mcimport.pid miserable rates, .3 MB/sec for kordosky when 6 kordosky and 1 howcroft scp's kordosky rate is .6 MB/sec no output at all for srmcp. later, round 10:00, single kordosky scp runs at .3 mB/sec, howcroft at .6 later, round 10:23, no kordosky scp's at all, howcroft at .6 MB/sec srmcp's vary from 3 to 20 MB/sec, with up to 2-5 minutes idle between files Killed cronjob, restart with this new mcimport this afternoon, after the present run finished up. ============================================================================= 2007 02 02 ############ # MCIMPORT # ############ crontab crontab.dat around 09:00 mcimport.20070201 - commented out MAILTO, FREETIME test lines Shifted crontab run times to 37 3,9,15,21 * * * ${HOME}/mcimport -c ALL 17:45 - hacked crontab to allow catchup, 37 1,6,12,20 * * * ${HOME}/mcimport -c ALL M26 > echo 'crontab crontab.dat' | at 02:30 Going very slowly through kordosy files, under 1 MB/sec ( 3 x 500MB per tar ) Rehacked round midnight to 37 2,7,12,18 * * * ${HOME}/mcimport -c ALL ####### # SAM # ####### kordosky reports samwebservices problem with samTranslateDimensions \ --dim="run_type physics% and file_name like N000%.spill.sntp.R1_18_4.0.root and start_time <= to_date('2006-09-30','yyyy-mm-dd') and end_time >= to_date('2006-09-24','yyyy-mm-dd')" \ --wsdl="http://www-numi.fnal.gov/sam_web_services/wsdl/DimensionsService.wsdl.xml" Usage: samTranslateDimensions --dim= [--verbose] or samTranslateDimensions --query= [--verbose] ... Local query is OK, MINOS26 > sam list files --dim="run_type physics% and file_name like N000%.spill.sntp.R1_18_4.0.root and start_time <= to_date('2006-09-30','yyyy-mm-dd') and end_time >= to_date('2006-09-24','yyyy-mm-dd')" Files: N00010861_0020.spill.sntp.R1_18_4.0.root ... File Count: 116 Average File Size: 39.62MB Total File Size: 4.49GB Total Event Count: 11759539 Testing sws in my clean PRODUCTS window, MINOS26 > setup sam_web_services_client ERROR: Found no match for product 'python' ERROR: Action parsing failed on "unsetuprequired(python v2_2_3_sam)" WARNING: Unsetup of sam_web_services_client failed, continuing with setup ERROR: Found no match for product 'python' ERROR: Found no match for product 'python' MINOS26 > setup python v2_4_sam MINOS26 > samTranslateDimensions \ --wsdl="http://www-numi.fnal.gov/sam_web_services/wsdl/DimensionsService.wsdl.xml" \ --dim="file_name like N00008698_000%.cosmic.sntp.R1_18.0.root" Dimension string: file_name like N00008698_000%.cosmic.sntp.R1_18.0.root Dataset file list: : ['N00008698_0000.cosmic.sntp.R1_18.0.root', 'N00008698_0001.cosmic.sntp.R1_18.0.root', 'N00008698_0002.cosmic.sntp.R1_18.0.root', 'N00008698_0003.cosmic.sntp.R1_18.0.root', 'N00008698_0004.cosmic.sntp.R1_18.0.root', 'N00008698_0005.cosmic.sntp.R1_18.0.root', 'N00008698_0006.cosmic.sntp.R1_18.0.root', 'N00008698_0007.cosmic.sntp.R1_18.0.root', 'N00008698_0008.cosmic.sntp.R1_18.0.root', 'N00008698_0009.cosmic.sntp.R1_18.0.root'] Dataset size: 768939259.0 bytes So the generic test query works Nick's query works for me : MINOS26 > samTranslateDimensions --wsdl="http://www-numi.fnal.gov/sam_web_services/wsdl/DimensionsService.wsdl.xml" --dim="run_type physics% and file_name like N000%.spill.sntp.R1_18_4.0.root and start_time <= to_date('2006-09-30','yyyy-mm-dd') and end_time >= to_date('2006-09-24','yyyy-mm-dd')" Dataset file list: :['N00010861_0020.spill.sntp.R1_18_4.0.root', ... , 'N00010893_0012.spill.sntp.R1_18_4.0.root'] I suggested trying to ping www-numi.fnal.gov and minos-sam03.fnal.gov, and samLocate --file=foo Scanning logs on minos-sam03, note that trace is itegrated, and that there are daily logs, both filled with minute by minute messages about Checking on opened file streams. MINOS-SAM03 > grep -v 'Checking on' wsLog__02_02_07 NB per Liz, they are running a hacked client, which has a built-in wsdl, you cannot set --wsdl on command line. HOWTO.samwebservices - updated to reflect cleaner usage, more functions and to note minos-sam03 server. Added nwest and kordosky to samread .k5login on minos-sam03, created WEBLOGS link to make logs easy to find. ########### # ROUNDUP # ########### touched up around 17:00 ########## # DCACHE # ########## Reviewing file families for DCache pool planning MINOS26 > cd ../reco_far MINOS26 > for DIR in `ls` ; do printf "${DIR} " ; ( cd ${DIR}/sntp_data ; enstore pnfs --tags | grep 'ily) ' ) ; done R1.11 .(tag)(file_family) = sntp_data_R1_11 R1.12 .(tag)(file_family) = sntp_data_R1_12 R1.14 .(tag)(file_family) = sntp_data_R1_14 R1.14_201 -bash: cd: R1.14_201/sntp_data: No such file or directory .(tag)(file_family) = minos R1.16 .(tag)(file_family) = reco_far_R1_16 R1.16a .(tag)(file_family) = sntp_near_R1_16a R1_17 .(tag)(file_family) = reco_far_R1_17 R1_17a.0 .(tag)(file_family) = reco_far_R1_17 R1_18 .(tag)(file_family) = reco_far_R1_18 R1_18_2 .(tag)(file_family) = reco_far_R1_18_2 R1_18_2_temp .(tag)(file_family) = minos R1_18_2a .(tag)(file_family) = reco_far_R1_18_2a R1_18_4 .(tag)(file_family) = reco_far_R1_18_4 R1_21 .(tag)(file_family) = reco_far_R1_21 R1_23 .(tag)(file_family) = reco_far_R1_23 R1_23a .(tag)(file_family) = reco_far_R1_23a R1_24 .(tag)(file_family) = reco_far_R1_24 R1_24a .(tag)(file_family) = reco_far_R1_24a R1_24b .(tag)(file_family) = reco_far_R1_24b R1_24c .(tag)(file_family) = reco_far_R1_24c S06-05-25-R1-22 .(tag)(file_family) = reco_far_S06-05-25-R1-22 S06-06-22-R1-22 .(tag)(file_family) = reco_far_S06-06-22-R1-22 cedar .(tag)(file_family) = reco_far_cedar_sntp MINOS26 > for DIR in `ls` ; do printf "${DIR} " ; ( cd ${DIR}/.bntp_data ; enstore pnfs --tags | grep 'ily) ' ) ; done R1_18 .(tag)(file_family) = reco_far_R1_18 R1_18_2 .(tag)(file_family) = reco_far_R1_18_2 R1_18_2_temp .(tag)(file_family) = minos R1_18_2a .(tag)(file_family) = reco_far_R1_18_2a R1_18_4 .(tag)(file_family) = reco_far_R1_18_4 R1_21 -bash: cd: R1_21/.bntp_data: No such file or directory .(tag)(file_family) = minos R1_23 .(tag)(file_family) = reco_far_R1_23 R1_23a .(tag)(file_family) = reco_far_R1_23a R1_24 .(tag)(file_family) = reco_far_R1_24 R1_24a .(tag)(file_family) = reco_far_R1_24a R1_24b .(tag)(file_family) = reco_far_R1_24b R1_24c .(tag)(file_family) = reco_far_R1_24c S06-05-25-R1-22 .(tag)(file_family) = reco_far_S06-05-25-R1-22 S06-06-22-R1-22 .(tag)(file_family) = reco_far_S06-06-22-R1-22 cedar .(tag)(file_family) = reco_far_cedar_bntp Hmmmm, only 3 very old releases, and cedar, have sntp or bntp tags. Sent email back to kennedy, minos_data, dcache-admin with outline. ============================================================================= 2007 02 01 ####### # DAQ # ####### file-exist errors started again from fd data logging, since 2007-01-31 18:41:12 buckley(1019.5111) krbftp write /pnfs/fnal.gov/usr/minos/fardet_data/2007-02/F00037343_0003.mdaq.root daqdcp.minos-soudan.org 0 0 0 ERROR 553 /pnfs/fnal.gov/usr/minos/fardet_data/2007-02/F00037343_0003.mdaq.root: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists) cleanly written at 18:25:18 ( 00:25 UTC Feb 1 ) previous clean subrun was 0002, at 17:29:37 ( 23:29 UTC ) in /pnfs/minos/fardet_data/2007-01/ continues through current subrun, 2007-02/F00037343_0018.mdaq.root at 10:16:58 cleanly written at 09:27:10 Here are the timestamps of existing files in cache : MINOS26 > dds /pnfs/minos/fardet_data/2007-01/ ... -rw-r--r-- 1 buckley e875 16956401 Jan 31 11:17 F00037330_0017.mdaq.root -rw-r--r-- 1 buckley e875 30157757 Jan 31 11:17 F00037330_0018.mdaq.root -rw-r--r-- 1 buckley e875 44126200 Jan 31 11:17 F00037330_0019.mdaq.root -rw-r--r-- 1 buckley e875 17007894 Jan 31 11:20 F00037330_0020.mdaq.root -rw-r--r-- 1 buckley e875 30224860 Jan 31 12:20 F00037330_0021.mdaq.root -rw-r--r-- 1 buckley e875 41813754 Jan 31 13:11 F00037330_0022.mdaq.root -rw-r--r-- 1 buckley e875 18219275 Jan 31 13:27 F00037331_0000.mdaq.root -rw-r--r-- 1 buckley e875 6912098 Jan 31 13:37 F00037332_0000.mdaq.root -rw-r--r-- 1 buckley e875 6912366 Jan 31 13:48 F00037333_0000.mdaq.root -rw-r--r-- 1 buckley e875 17815950 Jan 31 13:58 F00037334_0000.mdaq.root -rw-r--r-- 1 buckley e875 956721 Jan 31 14:08 F00037335_0000.mdaq.root -rw-r--r-- 1 buckley e875 18184587 Jan 31 14:24 F00037336_0000.mdaq.root -rw-r--r-- 1 buckley e875 6906838 Jan 31 14:34 F00037337_0000.mdaq.root -rw-r--r-- 1 buckley e875 18271837 Jan 31 14:45 F00037338_0000.mdaq.root -rw-r--r-- 1 buckley e875 6910722 Jan 31 14:55 F00037339_0000.mdaq.root -rw-r--r-- 1 buckley e875 18242005 Jan 31 15:06 F00037340_0000.mdaq.root -rw-r--r-- 1 buckley e875 6909233 Jan 31 15:21 F00037341_0000.mdaq.root -rw-r--r-- 1 buckley e875 18240463 Jan 31 15:32 F00037342_0000.mdaq.root -rw-r--r-- 1 buckley e875 37010458 Jan 31 15:47 F00037343_0000.mdaq.root -rw-r--r-- 1 buckley e875 37442321 Jan 31 16:28 F00037343_0001.mdaq.root -rw-r--r-- 1 buckley e875 17120975 Jan 31 17:29 F00037343_0002.mdaq.root MINOS26 > ls -alF /pnfs/minos/fardet_data/2007-02/ total 518959 drwxrwxr-x 1 kreymer e875 512 Feb 1 10:28 ./ drwxrwxr-x 1 buckley e875 512 Dec 14 11:50 ../ -rw-r--r-- 1 buckley e875 36935441 Jan 31 18:25 F00037343_0003.mdaq.root -rw-r--r-- 1 buckley e875 37282857 Jan 31 19:26 F00037343_0004.mdaq.root -rw-r--r-- 1 buckley e875 17129575 Jan 31 20:28 F00037343_0005.mdaq.root -rw-r--r-- 1 buckley e875 37047418 Jan 31 21:25 F00037343_0006.mdaq.root -rw-r--r-- 1 buckley e875 37405153 Jan 31 22:28 F00037343_0007.mdaq.root -rw-r--r-- 1 buckley e875 16996342 Jan 31 23:26 F00037343_0008.mdaq.root -rw-r--r-- 1 buckley e875 36888338 Feb 1 00:30 F00037343_0009.mdaq.root -rw-r--r-- 1 buckley e875 37113715 Feb 1 01:31 F00037343_0010.mdaq.root -rw-r--r-- 1 buckley e875 17062111 Feb 1 02:30 F00037343_0011.mdaq.root -rw-r--r-- 1 buckley e875 36863240 Feb 1 03:29 F00037343_0012.mdaq.root -rw-r--r-- 1 buckley e875 37372085 Feb 1 07:53 F00037343_0013.mdaq.root -rw-r--r-- 1 buckley e875 16979392 Feb 1 08:04 F00037343_0014.mdaq.root -rw-r--r-- 1 buckley e875 37208848 Feb 1 08:14 F00037343_0015.mdaq.root -rw-r--r-- 1 buckley e875 37341783 Feb 1 08:25 F00037343_0016.mdaq.root -rw-r--r-- 1 buckley e875 17165763 Feb 1 08:42 F00037343_0017.mdaq.root -rw-r--r-- 1 buckley e875 37324693 Feb 1 09:27 F00037343_0018.mdaq.root -rw-r--r-- 1 buckley e875 37300769 Feb 1 10:28 F00037343_0019.mdaq.root Note that the directory is based on UTC. So this problem is correlated with the directory we're writing to. Files were written to Enstore around 11:28 , based on times from MINOS26 > ls -alF /pnfs/minos/fardet_data/2007-02/ N.B. ftp client is returning SIZE=None, and failure code. As with kennedy, kftp shows size OK MINOS26 > ../bin/rlwrap ftp fndca1.fnal.gov 24127 Connected to stkendca2a.fnal.gov. 220 Kerberos FTP Door ready 334 ADAT must follow GSSAPI accepted as authentication type GSSAPI authentication succeeded Name (fndca1.fnal.gov:kreymer): 200 User kreymer logged in Remote system type is UNIX. Using binary mode to transfer files. ftp> cd fardet_data/2007-02 250 CWD command succcessful. New CWD is ftp> size F00037343_0003.mdaq.root 213 36935441 Per kennedy, restarted archiver around 13:30, no further problems. Probably due to data having been written to tape. This may have happened in previous months. Experts will investigate. ############ # MCIMPORT # ############ DUPLICATES ? M26> cat kordosky/index/*.index > /var/tmp/mindata/TMP/kordosky.index M26> cat /var/tmp/mindata/TMP/kordosky.index | wc -l 4212 M26> cat /var/tmp/mindata/TMP/kordosky.index | sort -u | wc -l 4212 M26> cat howcroft/index/*.index > /var/tmp/mindata/TMP/howcroft.index M26> cat /var/tmp/mindata/TMP/howcroft.index | wc -l 4394 M26> cat /var/tmp/mindata/TMP/howcroft.index | sort -u | wc -l 4326 M26> sort /var/tmp/mindata/TMP/howcroft.index > /tmp/ksor M26> sort -u /var/tmp/mindata/TMP/howcroft.index > /tmp/ksou M26> sdiff -s /tmp/ksor /tmp/ksou > /tmp/ksod M26> nedit /tmp/ksod M26> cat /tmp/ksod | wc -l 68 M26> for FIL in `cat /tmp/ksod` ; do grep ${FIL} howcroft/index/*.index ; done | wc -l 136 So, each duplicate exists in two index files. M26> for FIL in `cat /tmp/ksod` ; do grep ${FIL} howcroft/index/*.index ; done | cut -f 1 -d ':' | sort -u | wc -l 21 So, 21 tarfiles contribute to this problem n11011027_0000_L010185N_D00-n11011031_0000_L010185N_D00.tar n11011028_0000_L010185N_D00-n11011033_0000_L010185N_D00.tar n11011032_0000_L010185N_D00-n11011036_0000_L010185N_D00.tar n11011172_0002_L010185N_D00-n11011172_0006_L010185N_D00.tar n11011172_0002_L010185N_D00-n11011172_0010_L010185N_D00.tar n11011217_0000_L010185N_D00-n11011221_0000_L010185N_D00.tar n11011218_0004_L010185N_D00-n11011219_0001_L010185N_D00.tar n11011219_0007_L010185N_D00-n11011220_0000_L010185N_D00.tar n11011221_0000_L010185N_D00-n11011221_0004_L010185N_D00.tar n11011221_0010_L010185N_D00-n11011222_0003_L010185N_D00.tar n11011222_0000_L010185N_D00-n11011226_0000_L010185N_D00.tar n11011222_0009_L010185N_D00-n11011223_0002_L010185N_D00.tar n11011223_0008_L010185N_D00-n11011224_0001_L010185N_D00.tar n11011224_0007_L010185N_D00-n11011225_0000_L010185N_D00.tar n11011226_0000_L010185N_D00-n11011226_0004_L010185N_D00.tar n11011227_0000_L010185N_D00-n11011227_0004_L010185N_D00.tar n11011227_0000_L010185N_D00-n11011231_0000_L010185N_D00.tar n11011227_0010_L010185N_D00-n11011228_0003_L010185N_D00.tar n11011228_0009_L010185N_D00-n11011229_0002_L010185N_D00.tar n12011005_0010_L010185N_D00-n12011209_0001_L010185N_D00.tar n12011005_0010_L010185N_D00-n12011222_0003_L010185N_D00.tar crontab.dat updated to contain 37 3,9,15,21 * * * ${HOME}/mcimport -c ALL Hold off, start running this tomorrow. Run manually this afternoon and evening, mcimport.20070201 Moved print of VERSION to MAIN, enabled full time TRIGTIME, TRIGSIZE trigger concatenation in generic running NOIMPORT disables running Added ALL users, using MCIMPORT to control activity Added pid check outside user loop for ALL and CRON MINOS26 > ln -sf mcimport.20070201 mcimport # was mcimport.20070130 M26 > cp /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mcimport.20070201 mcimport 16:32 ./mcimport ALL OOPS, needed to hack this to set CRON with directory ALL is specified, so that mcimport runs serially 16:40 ./mcimport ALL ============================================================================= 2007 01 31 ############ # MCIMPORT # ############ $ du -sm * 69799 howcroft 20365 kordosky HOWCROFT CLEANUP Moved badm5.log to maint/howbad/ $ BADT=`cat ~/maint/howbad/badmd5.log | cut -f 1 -d ':' | cut -f 1 -d . | sort -u` 12 tarfiles, not so bad. Per rhatcher discussion, will modify thefl index files, leaving the tars alone. $ for BAD in ${BADT} ; do echo ${BAD} ; cp -a index/${BAD}.index ~/maint/howbad/ ; done $ for BAD in ${BADT} ; do for FIL in ${BADF} ] ; do grep ${FIL} index/${BAD}.index ; done ; nedit index/${BAD}.index ; done $ for BAD in ${BADT} ; do echo $BAD ; for FIL in ${BADF} ] ; do grep ${FIL} index/${BAD}.index ; done ; nedit index/${BAD}.index ; done Logged and annotated to maint/howbad/fix.log Sent this email to minos-data, arms, howcroft, kordosky, bout 12:10 There are 34 mangled files, residing in 12 tarfiles. Per rhatcher's suggestion, I have edited the howcroft/*.index files for those tarfiles to remove the mangled file names. This makes the mangled files invisible, nearly as good as rebuilding the tars, and Robert can proceed. Notes on this are under mindata/maint/howbad/ Enjoy ! Note that 4 of the tarfiles are now moot, so have deleted them from /pnfs/minos/stage/howcroft n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar n11011007_0000_L010185N_D00-n11011011_0000_L010185N_D00.tar n11011012_0000_L010185N_D00-n11011016_0000_L010185N_D00.tar n11011017_0000_L010185N_D00-n11011021_0000_L010185N_D00.tar This is to avoid conflicts on re-import. Just in time, as these show up in current mcimport, from howcroft/log/mcimport.log n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar 5 n11011007_0000_L010185N_D00-n11011011_0000_L010185N_D00.tar 5 n11011012_0000_L010185N_D00-n11011016_0000_L010185N_D00.tar 5 n11011017_0000_L010185N_D00-n11011021_0000_L010185N_D00.tar 5 MINOS26 > cd /pnfs/minos/stage/howcroft MINOS26 > for FILE in $FILES ; do ls -l ${FILE} ; done -rw-r--r-- 1 kreymer e875 1740851200 Jan 25 03:16 n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar -rw-r--r-- 1 kreymer e875 1757184000 Jan 25 03:18 n11011007_0000_L010185N_D00-n11011011_0000_L010185N_D00.tar -rw-rw-r-- 1 kreymer e875 1744199680 Jan 25 03:21 n11011012_0000_L010185N_D00-n11011016_0000_L010185N_D00.tar -rw-rw-r-- 1 kreymer e875 1764659200 Jan 25 03:25 n11011017_0000_L010185N_D00-n11011021_0000_L010185N_D00.tar MINOS26 > for FILE in $FILES ; do rm ${FILE} ; done ########### # ROUNDUP # ########### touched up around 08:30 ============================================================================= 2007 01 30 ############ # MCIMPORT # ############ Relaunched mcimport on kordosky with hack, select last not first all.md5 match, there may be duplicate entries as for n11011006_0005_L010000N_D00.tar.gz mcimport.20070127 -n - do continue to set pid, but do not do ecrc This is too much change since 1/27, rename it to 1/30 MINOS26 > mv mcimport.20070127 mcimport.20070130 $ ln -sf /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mcimport.20070130 mci MINOS26 > ln -sf mcimport.20070130 mcimport # was mcimport.20070126 Found mcimport local copy pointing to /afs/.../mcimport, OOPS, corrected this back to pure local copy around 08:51, hope this did not pull run from under running mcimport kordosky ( looks OK ) Update notes from mcimport.20070130 Added test for valid .gz file, gunzip -t this required tar -r , 1 file at a time Added test for free disk space in TAR Added test for existence of INPAT directory Take final all.md5sum match, not first, to handle duplicates Added sort of file ALLFILES Added rate report for TAR Added PURGE ahead of TAR Added log message for PURGE files not in PNFS Added MINAGE variable to set minimum file age, changed from 10 to 30 Changed CLASS variable name to CONFIG Do not do ecrc in PURGE when NOOP is set For next version : + Added ALL users, using MCIMPORT to control activity + NOIMPORT, TRIGTIME, TRIGSIZE trigger concatenation in generic running BAD MD5 from howcroft, 34 files FILES=`cat ~kreymer/minos/maint/badmd5.txt cd STAGE/howcroft/index/ for FILE in $FILES ; do grep $FILE *.index ; done | wc -l 36 for FILE in $FILES ; do grep $FILE *.index ; done > ../log/badmd5.log Duplicates are n11011038_0005_L010185N_D00-n12011010_0002_L010185N_D00.index:n11011038_0007_L010185N_D00.tar.gz n11011038_0007_L010185N_D00-n11011039_0002_L010185N_D00.index:n11011038_0007_L010185N_D00.tar.gz n11011137_0010_L010185N_D00-n11011138_0001_L010185N_D00.index:n11011138_0001_L010185N_D00.tar.gz n11011138_0001_L010185N_D00-n11011138_0005_L010185N_D00.index:n11011138_0001_L010185N_D00.tar.gz To remove them : rebuild the tarfiles ? use --remove_files ? no , just rebuild. And remove these entries from the index files. ########### # ROUNDUP # ########### bntp files have been missing this year, recovery being debated in minos_batch I'd prefer leaving existing files alone and rerunning to produce just bntp. Howie is proceeding with this plan ( 1b ) this afternoon. ============================================================================= 2007 01 29 ############ # PREDATOR # ############ Note that far_dcs_data finally showed up Sat 2007 01 27 SRV1> dfarm usage rubin Used: 63759 + Reserved: 0 / Quota: 1000000 (MB) SRV1> ./roundup -r cedar near SRV1> ./roundup -r cedar far SRV1> dfarm usage rubin ############ # MCIMPORT # ############ Discussed things at 11:30 non-meeting ( off week ) arms kreymer kordosky howcroft rhatcher Howcroft access errors ( no access, permission denied ) have probably been due to local removal of his /tmp/* ticket He will modify copy scripts to detect ( klist -s ) and correct. People have not been using -c blowfish with scp, will do so. rhatcher found zero-file n11011001_0000_L010185N_D00.tar.gz [howcroft@positron02 L010185_near_1001_0]$ du n11011001_0000_L010185N_D00.tar.gz 336612 n11011001_0000_L010185N_D00.tar.gz [howcroft@positron02 L010185_near_1001_0]$ md5sum n11011001_0000_L010185N_D00.tar.gz e8ba468e14a44870337470722fb98111 n11011001_0000_L010185N_D00.tar.gz Locally, $ grep n11011001_0000_L010185N_D00 *.index n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.index:n11011001_0000_L010185N_D00.tar.gz $ setup dcap v2_36_f0506 -q unsecured $ pwd /var/tmp/mindata/TMP $ FIN=n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar $ dccp dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/minos/stage/howcroft/${FIN} . 1740851200 bytes in 41 seconds (41464.63 KB/sec) $ FIL=n11011001_0000_L010185N_D00.tar.gz $ od $FIL 0000000 000000 000000 000000 000000 000000 000000 000000 000000 * 2441445140 000000 000000 000000 2441445145 $ time wc $FIL 0 0 344345189 n11011001_0000_L010185N_D00.tar.gz real 0m3.665s user 0m3.460s sys 0m0.190s $ ls -l $FIL -rw-r--r-- 1 mindata e875 344345189 Jan 24 04:26 n11011001_0000_L010185N_D00.tar.gz $ du -b $FIL 344690688 n11011001_0000_L010185N_D00.tar.gz $ time gunzip -t $FIL gunzip: n11011001_0000_L010185N_D00.tar.gz: not in gzip format real 0m0.002s user 0m0.010s sys 0m0.000s $ echo $? 1 $ for FILE in `ls /local/scratch26/mindata/howcroft/*.tar.gz` ; do wc ${FIL} ; done ... 1077744 6822523 354953788 /local/scratch26/mindata/howcroft/n11011219_0003_L010185N_D00.tar.gz 419746 2654670 138010624 /local/scratch26/mindata/howcroft/n11011219_0004_L010185N_D00.tar.gz 1034217 6558738 341289917 /local/scratch26/mindata/howcroft/n11011219_0010_L010185N_D00.tar.gz 211736 1345673 70098944 /local/scratch26/mindata/howcroft/n11011229_0001_L010185N_D00.tar.gz 31312 197469 10150947 /local/scratch26/mindata/howcroft/n12011209_0002_L010185N_D00.tar.gz SUMMARY :::::: Unlike earlier problem with kordosky files copied when the disk was full, du is not providing a diagnostic for these sparse files. wc -w would seem to give a good robust test. gunzip -t could be even better. Time these for a valid file : $ dd if=/local/scratch26/mindata/howcroft/n11011219_0003_L010185N_D00.tar.gz of=n11011219_0003_L010185N_D00.tar.gz 693269+1 records in 693269+1 records out $ time wc -w n11011219_0003_L010185N_D00.tar.gz 6822523 n11011219_0003_L010185N_D00.tar.gz real 0m9.069s user 0m6.500s sys 0m0.390s $ time gunzip -t n11011219_0003_L010185N_D00.tar.gz real 0m10.207s user 0m10.040s sys 0m0.150s ########## # DCACHE # ########## kennedy reported corruption of /pnfs/fnal.gov/usr/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root PNFSid = 000F00000000000004D6B328 in w-stkendca10a-3 cd /export/stage/minfarm/ROUNDUP_TEST/TEST IFILE=N00011648_0015.cosmic.cand.cedar.0.root IPATH=minos/reco_near/cedar/cand_data/2007-01 DCPOR=24136 DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} dccp ${DFILE} . 113837134 bytes in 3 seconds (37056.36 KB/sec) SRV1> ls -l /pnfs/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root -rw-r--r-- 1 rubin numi 113837134 Jan 29 10:51 /pnfs/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root MINOS26 > ./dc_stat /pnfs/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root ============================ PNFS status for /pnfs/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root -rw-r--r-- 1 1334 e875 113837134 Jan 29 10:51 N00011648_0015.cosmic.cand.cedar.0.root LEVEL 2 2,0,0,0.0,0.0 :c=1:7d20ff59;h=yes;l=113837134; w-stkendca10a-3 r-stkendca14a-3 LEVEL 4 ============================ I have removed the file, per kennedy request. about 17:40 rm /pnfs/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root ============================================================================= 2007 01 27 ############## # MCTARCHECK # ############## 00:25 MINOS13 > cd /local/scratch13/kreymer MINOS13 > ~/minos/scripts/mctarcheck howcroft > howcroft.log 2>&1 & OOPS, accidentally ran this with kordosky briefly, lost kordosky.log Copied kordosky files to afs, cp -ax kordosky \ /afs/fnal.gov/files/data/minos/log_data/mcimport/kordosky/mccheck howcroft finished around 15:29. cp -ax howcroft \ /afs/fnal.gov/files/data/minos/log_data/mcimport/howcroft/mccheck ============================================================================= 2007 01 26 ########## # DCACHE # ########## SRM doors down, expired host tickets, helpdesk ticket by rubin 91593 Some certs updated by Berg at 8 PM last night, not sufficient. 11:10 srm servers restarted by kennedy, cleared caches, OK now To run srmls on minos26, need to $ cd /local/scratch26/kreymer/SRM $ srmclient/bin/srmls ${SPATH2} ############ # MCIMPORT # ############ mcimport.20070126 - group by configuation (all but run/subrun) Changed the config of some test files, in all three relevant fields cd /local/scratch26/mindata/kreymer cp -a TMP/* . mv n12011054_0003_L250200N_D00.tar.gz n12021054_0003_L250200N_D00.tar.gz mv n12011054_0004_L250200N_D00.tar.gz n12021054_0004_L250200N_D00.tar.gz mv n12011054_0005_L250200N_D00.tar.gz n12011054_0005_L100200N_D00.tar.gz mv n12011054_0006_L250200N_D00.tar.gz n12011054_0006_L100200N_D00.tar.gz mv n12011054_0007_L250200N_D00.tar.gz n12011054_0007_L250200N_D01.tar.gz mv n12011054_0008_L250200N_D00.tar.gz n12011054_0008_L250200N_D01.tar.gz ( made a little pop script to do this for testing ) MINOS26 > ln -sf mcimport.20070126 mcimport # was 20070125 $ cp -a afsmcimport mcimport # was 20070118 At about 14:00, let's get back to work ./mcimport kordosky kordosky scan is nearly done on minos13. While they're on disk, lets touch up MINOS26 > ./stage -T stage/kordosky staging howcroft next ########### # ROUNDUP # ########### SRV1> dfarm usage rubin Used: 56656 + Reserved: 0 / Quota: 1000000 (MB) SRV1> ./roundup -r cedar near SRV1> ./roundup -r cedar far SRV1> dfarm usage rubin Used: 47838 + Reserved: 0 / Quota: 1000000 (MB) ############ # PREDATOR # ############ Clearing empty .py files since 24 Jan, restarting .sam.py files under 1 KB MINOS26 > for DIR in `ls GDAT/` ; do find GDAT/${DIR}/2007-01 -type f -name \*.sam.py -mtime -3 -size -1 -exec ls -l {} \; ; done -rw-r--r-- 1 kreymer g020 0 Jan 24 05:11 GDAT/beam_data/2007-01/B070123_080001.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 05:12 GDAT/beam_data/2007-01/B070123_160001.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 05:12 GDAT/beam_data/2007-01/B070124_000001.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 05:13 GDAT/far_dcs_data/2007-01/F070101_170032.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 05:13 GDAT/far_dcs_data/2007-01/F070102_000021.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 05:13 GDAT/far_dcs_data/2007-01/F070103_000000.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 05:10 GDAT/fardet_data/2007-01/F00037265_0014.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 05:11 GDAT/fardet_data/2007-01/F00037265_0015.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 07:10 GDAT/fardet_data/2007-01/F00037265_0016.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 07:11 GDAT/fardet_data/2007-01/F00037265_0017.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 09:10 GDAT/fardet_data/2007-01/F00037265_0018.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 09:10 GDAT/fardet_data/2007-01/F00037265_0019.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 05:12 GDAT/near_dcs_data/2007-01/N070123_000002.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 05:07 GDAT/neardet_data/2007-01/N00011615_0012.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 05:09 GDAT/neardet_data/2007-01/N00011615_0013.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 07:08 GDAT/neardet_data/2007-01/N00011615_0014.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 07:09 GDAT/neardet_data/2007-01/N00011615_0015.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 09:08 GDAT/neardet_data/2007-01/N00011615_0016.sam.py -rw-r--r-- 1 kreymer g020 0 Jan 24 09:09 GDAT/neardet_data/2007-01/N00011615_0017.sam.py MINOS26 > for DIR in `ls GDAT/` ; do find GDAT/${DIR}/2007-01 -type f -name \*.sam.py -mtime -3 -size -1 -exec rm {} \; ; done ============================================================================= 2007 01 25 ########### # NETWORK # ########### Maintenance scheduled 06:00 to 06:30. Restarted afsd on desktop /etc/init.d/afsd restart did not help ( did this too early, AFS server was unstable.) OK as of 08:00. Lost a couple of windows (minos26, LOG, nedit(LOG) ) AFS has been unstable, seems to be up around 08:00 Web servers have been down, seem to be up now. but not the fndca3a pages. 08:12 - web server is down again (www-numi.fnal.gov) DAQ Archiver started succeeding about 08:12 ####### # SAM # ####### Nelly installed quarterly patches on minosprd. Ran test projects on minos26, looks good. Had to restart dbserver. ############ # MCIMPORT # ############ Updated mcimport to correctly set and lock the pid file and to use md5sum to much more quickly check tarfile content. A number of kordosky files are empty ( du -sb ), but show a size in dir. Cross check against mike's list in maint/no_space_L010000.txt maint/no_space_L250200.txt for FILES=`cat ~kreymer/minos/maint/no_space_L010000.txt | cut -f 1 -d .` and FILES=`cat ~kreymer/minos/maint/no_space_L250200.txt | cut -f 1 -d .` for FIL in ${FILES} ; do du -sb kordosky/${FIL}.tar.gz ; done for FIL in ${FILES} ; do rm kordosky/${FIL}.tar.gz ; done for FIL in ${FILES} ; do rm kordosky/log/${FIL}.log ; done L010000 files were all 0 bytes length Some L250 files were not, but will delete them all per kordosky $ for FIL in ${FILES} ; do du -sb kordosky/${FIL}.tar.gz ; done 667193344 kordosky/n11011049_0010_L250200N_D00.tar.gz 660463616 kordosky/n11011050_0000_L250200N_D00.tar.gz 191254528 kordosky/n11011050_0010_L250200N_D00.tar.gz 687894528 kordosky/n11011050_0001_L250200N_D00.tar.gz 416866304 kordosky/n11011050_0002_L250200N_D00.tar.gz 614211584 kordosky/n11011050_0003_L250200N_D00.tar.gz 373862400 kordosky/n11011050_0004_L250200N_D00.tar.gz 605798400 kordosky/n11011050_0005_L250200N_D00.tar.gz 691228672 kordosky/n11011050_0006_L250200N_D00.tar.gz 401604608 kordosky/n11011050_0007_L250200N_D00.tar.gz 173281280 kordosky/n11011050_0009_L250200N_D00.tar.gz 0 kordosky/n11011051_0010_L250200N_D00.tar.gz 64204800 kordosky/n11011051_0002_L250200N_D00.tar.gz 169431040 kordosky/n11011051_0004_L250200N_D00.tar.gz 0 kordosky/n11011051_0005_L250200N_D00.tar.gz 0 kordosky/n11011051_0007_L250200N_D00.tar.gz 0 kordosky/n11011051_0009_L250200N_D00.tar.gz 0 kordosky/n11011052_0000_L250200N_D00.tar.gz 43642880 kordosky/n11011052_0010_L250200N_D00.tar.gz 12505088 kordosky/n11011052_0002_L250200N_D00.tar.gz 0 kordosky/n11011052_0003_L250200N_D00.tar.gz 0 kordosky/n11011052_0004_L250200N_D00.tar.gz 191016960 kordosky/n11011052_0005_L250200N_D00.tar.gz 0 kordosky/n11011052_0006_L250200N_D00.tar.gz 0 kordosky/n11011052_0007_L250200N_D00.tar.gz 74678272 kordosky/n11011053_0000_L250200N_D00.tar.gz 560844800 kordosky/n11011053_0010_L250200N_D00.tar.gz 0 kordosky/n11011053_0001_L250200N_D00.tar.gz 127692800 kordosky/n11011053_0002_L250200N_D00.tar.gz 0 kordosky/n11011053_0003_L250200N_D00.tar.gz 0 kordosky/n11011053_0005_L250200N_D00.tar.gz 0 kordosky/n11011053_0006_L250200N_D00.tar.gz 0 kordosky/n11011053_0008_L250200N_D00.tar.gz 353742848 kordosky/n11011054_0000_L250200N_D00.tar.gz 0 kordosky/n11011054_0010_L250200N_D00.tar.gz 0 kordosky/n11011054_0001_L250200N_D00.tar.gz 452763648 kordosky/n11011054_0002_L250200N_D00.tar.gz 0 kordosky/n11011054_0003_L250200N_D00.tar.gz 97021952 kordosky/n11011054_0004_L250200N_D00.tar.gz 0 kordosky/n11011054_0005_L250200N_D00.tar.gz 0 kordosky/n11011054_0006_L250200N_D00.tar.gz 90382336 kordosky/n11011054_0007_L250200N_D00.tar.gz 0 kordosky/n11011054_0008_L250200N_D00.tar.gz 0 kordosky/n11011054_0009_L250200N_D00.tar.gz 0 kordosky/n11011055_0000_L250200N_D00.tar.gz 0 kordosky/n11011055_0010_L250200N_D00.tar.gz 0 kordosky/n11011055_0001_L250200N_D00.tar.gz 190640128 kordosky/n11011055_0002_L250200N_D00.tar.gz 0 kordosky/n11011055_0003_L250200N_D00.tar.gz 0 kordosky/n11011055_0005_L250200N_D00.tar.gz 0 kordosky/n11011055_0006_L250200N_D00.tar.gz 0 kordosky/n11011055_0007_L250200N_D00.tar.gz 0 kordosky/n11011055_0008_L250200N_D00.tar.gz 0 kordosky/n11011055_0009_L250200N_D00.tar.gz I found and purged one more such empty file not in your list, $ du -sb kordosky/* | sort -nr 0 kordosky/n11011006_0004_L010000N_D00.tar.gz $ ls -l kordosky/n11011006_0004_L010000N_D00.tar.gz -rw-r--r-- 1 mindata e875 0 Jan 24 09:55 kordosky/n11011006_0004_L010000N_D00.tar.gz Now run a more full scale test, for rates $ time cp -a kordosky/n11011008* kreymer/ real 11m52.958s user 0m0.210s sys 0m17.710s That's 2000 mb/799 sec, 220 kordosky/n11011001_0007_L010000N_D00.tar.gz $ time md5sum STAGE/kordosky/n11011001_0007_L010000N_D00.tar.gz 4cdd99fd0a6eb208b03da906b800afbd STAGE/kordosky/n11011001_0007_L010000N_D00.tar.gz real 3m3.608s user 0m0.730s sys 0m0.660s Ugh, that's about 1 MBytes/second... miserable !!! I did have the copy from kordosky to kreymer, plus an scp import running by howcroft, at present. Try this is on a similar file with the cp running : $ time md5sum STAGE/kordosky/n11011010_0010_L010000N_D00.tar.gz f779b46d4be31c710518c7cb5a1ab210 STAGE/kordosky/n11011010_0010_L010000N_D00.tar.gz real 0m43.791s user 0m0.690s sys 0m0.580s That's better, 5 MB/sec, but still a tenth what I expect from modern disks. MD5SUM Testing remote example : FILE=n11011008_0000_L010000N_D00.tar.gz RUSE=kreymer ssh mindata@minos26.fnal.gov "cd STAGE/${RUSE} ; md5sum ${FILE} \ > md5/${FILE}.md5 ; \ cat md5/${FILE}.md5 >> md5/all.md5 ; \ cat md5/${FILE}.md5 ; \ rm md5/${FILE}.md5 " mccheck script, to dump sizes/checksums of existing tars local tests, rates : MINOS13 > dccp ${DPATH}/${FUSE}/${TAR} . 1619066880 bytes in 43 seconds (36770.23 KB/sec) MINOS13 > time tar xf ${TAR} -C /var/tmp/kreymer/${FUSE} real 0m38.792s user 0m0.150s sys 0m11.170s MINOS13 > du -sm n11011001_0000_L100200N_D00-n11011001_0001_L100200N_D00.tar 1546 n11011001_0000_L100200N_D00-n11011001_0001_L100200N_D00.tar MINOS13 > ~/minos/scripts/mctarcheck kordosky 2>&1 | tee kordosky.log INFORMATIONAL: Product 'dcap' (with qualifiers 'unsecured'), has no current chain (or may not exist) Thu Jan 25 17:52:34 CST 2007 n11011001_0000_L100200N_D00-n11011001_0001_L100200N_D00.tar ============================================================================= 2007 01 24 ############ # MCIMPORT # ############ /local/stage filled around 03:00, due to mcimport flood, about 10 to 20 GB/hour from kordosky Disabled further input, via .k5loginmin ( omits project principals ) will restore from .k5loginfull $ du -sm STAGE/*/dcache 15267 STAGE/howcroft/dcache 20121 STAGE/kordosky/dcache 1 STAGE/kreymer/dcache $ du -sm STAGE/*/tar 20806 STAGE/howcroft/tar 34010 STAGE/kordosky/tar 1 STAGE/kreymer/tar Total of 90 GB of tarred files pending to tape. Shifted some kordosky files to /var/tmp/mindata, to breathe : $ du -sm n11011044* 748 n11011044_0005_L250200N_D00.tar.gz 724 n11011044_0006_L250200N_D00.tar.gz 733 n11011044_0008_L250200N_D00.tar.gz 746 n11011044_0009_L250200N_D00.tar.gz 745 n11011044_0010_L250200N_D00.tar.gz $ mkdir /var/tmp/mindata/TMP $ cp -a n11011044*/var/tmp/mindata/TMP/ $ for FIL in n11011044* ; do echo $FIL ; diff ${FIL} /var/tmp/mindata/TMP/${FIL} ; done n11011044_0005_L250200N_D00.tar.gz n11011044_0006_L250200N_D00.tar.gz n11011044_0008_L250200N_D00.tar.gz n11011044_0009_L250200N_D00.tar.gz n11011044_0010_L250200N_D00.tar.gz $ rm STAGE/kordosky/n11011044_0005_L250200N_D00.tar.gz $ #rm n11011044* rhatcher also cleared off 10 GB of space. so I have restored this file to kodosky cp -a /var/tmp/mindata/TMP/n11011044_0005_L250200N_D00.tar.gz . diff /var/tmp/mindata/TMP/n11011044_0005_L250200N_D00.tar.gz . $ ./mcimport -w kordosky OOPS - found /local/scratch26/mindata/kordosky/log/mcimport.pid PID TTY TIME CMD OK - stale pid file OK, logging activity to /local/scratch26/mindata/kordosky/log/mcimport.log SRMCP phase ran at about 5' per file, system 200% iowait PURGE phase ran at about 30" per file, system 130% iowait There is one 0 length tarfile : 0 Jan 24 10:56 n11011001_0001_L010000N_D00-n11011044_0005_L250200N_D00.tar $ dds dcache/n11011001_0001_L010000N_D00-n11011044_0005_L250200N_D00.tar -rw-r--r-- 1 mindata e875 0 Jan 24 03:44 dcache/n11011001_0001_L010000N_D00-n11011044_0005_L250200N_D00.tar removed it MINOS26 > rm /pnfs/minos/stage/kordosky/n11011001_0001_L010000N_D00-n11011044_0005_L250200N_D00.tar Similar problem in howcroft ( in tar, none in pnfs ): $ dds howcroft/tar/n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar -rw-r--r-- 1 mindata e875 0 Jan 24 07:37 howcroft/tar/n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar $ rm howcroft/tar/n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar Launched same for howcroft, and enabled inflow : cp .k5loginfull .k5login 13:03 $ ./mcimport -w howcroft OOPS - found /local/scratch26/mindata/howcroft/log/mcimport.pid PID TTY TIME CMD OK - stale pid file OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log MCIN_DATA Have waited long enough for enstore-admin to run enmv. Have rerun ./mcinfix and reported this to enstore-admin Now finish purging dcache files : $ ./mcimport -w kordosky OK, logging activity to /local/scratch26/mindata/kordosky/log/mcimport.log $ ./mcimport -w howcroft OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log MCIMPORT - bad news, pid interlocking is just not working within cron jobs. Will take quite some time to debug, I am mystefied. Lots more print statments, I guess. Let's tar up the 38 GB howcroft files, as a start. $ du -sm STAGE/* ... 37818 STAGE/howcroft 110232 STAGE/kordosky ... Wed Jan 24 18:45:53 CST 2007 $ ${HOME}/mcimport howcroft OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log Further problem, logs indicate we run at about 15 GB/hour just making tarfiles. That's pretty lousy, I guess too many copies going on. ============================================================================= 2007 01 23 ############ # MCIMPORT # ############ MINOS26 > for DIR in L010170 L010185 L010000 L010200 L100200 L150200 L250200 ; do rmdir /pnfs/minos/mcin_data/near/daikon_00/${DIR} ; done ########### # ROUNDUP # ########### SRV1> dfarm usage rubin Used: 55910 + Reserved: 0 / Quota: 1000000 (MB) SRV1> ./roundup -r cedar near SRV1> ./roundup -r cedar far SRV1> dfarm usage rubin Used: 50692 + Reserved: 0 / Quota: 1000000 (MB) ####### # AFS # ####### The dh web area has filled up. MIN > fs listquota /afs/fnal.gov/files/expwww/numi Volume Name Quota Used %Used Partition room.numi 2000000 2000003 100%<< 87% < du -sm * | sort -n du: cannot change to directory `minwork/daqlogs': Permission denied 1 Contacts 1 DocDBSite 1 Gallery ... 2 xstyles 3 collab 4 mtg 7 public 9 PublicInfo 13 offline_software 14 doe_may_04_review 24 MinosAEM 29 doe_feb_05_review 59 Minos 95 DataQuality 113 sam 164 numi_pics 198 computing 268 talks 299 numwork 310 workgrps 889 internal 959 minwork Repeat the study done on 2006 05 26 Note the fs lsquota no longer shows mount points. So must explicitly sniff out mount points : WEB=/afs/fnal.gov/files/expwww/numi WEBS=`find ${WEB} -type d ` printf "${WEBS}\n" | wc -l 6130 for DIR in ${WEBS} ; do fs lsmount ${DIR} ; done | grep 'is a mount' | tee /tmp/afsmounts MIN > cat /tmp/afsmounts '/afs/fnal.gov/files/expwww/numi' is a mount point for volume '#room.numi' '/afs/fnal.gov/files/expwww/numi/html/talks' is a mount point for volume '#expwww.numi.talks' '/afs/fnal.gov/files/expwww/numi/html/fnal_minos' is a mount point for volume '#expwww.numi.fnalminos' '/afs/fnal.gov/files/expwww/numi/html/minwork' is a mount point for volume '#expwww.numi.minwork' '/afs/fnal.gov/files/expwww/numi/html/numwork' is a mount point for volume '#expwww.numi.numwork' '/afs/fnal.gov/files/expwww/numi/query_files' is a mount point for volume '#nb.w.numi.d1' '/afs/fnal.gov/files/expwww/numi/numinotes' is a mount point for volume '#room.numi.1' '/afs/fnal.gov/files/expwww/numi/numinotes/public/ps' is a mount point for volume '#w.numi.d2' '/afs/fnal.gov/files/expwww/numi/numinotes/restricted/html' is a mount point for volume '#w.numi.d1' '/afs/fnal.gov/files/expwww/numi/file_upload' is a mount point for volume '#expwww.numi.fileupload' for DIR in `cat /tmp/afsmounts | cut -f 2 -d "'"` ; do fs listquota ${DIR} ; done Let's be selfish, and take the whole fnal_minos partition for computing, copying the files, comparing, then renaming the original to computing_retired_20070123 MIN > cp -ax computing fnal_minos MINOS26 > du -sm /afs/fnal.gov/files/expwww/numi/html/fnal_minos/computing 198 /afs/fnal.gov/files/expwww/numi/html/fnal_minos/computing MINOS25 > time diff -r computing fnal_minos/computing MIN > mv computing computing_retired_20070123 ; ln -s fnal_minos/computing computing out of quota, had to clean up. meanwhile, script had recreated computing confusing things when I did MIN > ln -s fnal_minos/computing computing Cleaned up, tried again, MIN > mv computing computingx ; ln -s ../fnal_minos/computing computing that was the wrong path, once again cleanly : MIN > mv computing computingy ; ln -s fnal_minos/computing computing That looks good. Now pick up a few stray bits of content, from an earlier diff -r MIN > FIL=computing/dh/beamlog/2007/01/23.txt MIN > nedit $FIL < 190 Tue Jan 23 14:43:12 CST 2007 < 190 Tue Jan 23 14:44:13 CST 2007 < 186 Tue Jan 23 14:45:15 CST 2007 < 188 Tue Jan 23 14:46:12 CST 2007 < 187 Tue Jan 23 14:47:13 CST 2007 < 188 Tue Jan 23 14:48:12 CST 2007 < 188 Tue Jan 23 14:49:12 CST 2007 MIN > FIL=computing/dh/ftplog/2007/01/23.txt MIN > nedit $FIL < 2 Tue Jan 23 14:48:21 CST 2007 557 MIN > FIL=computing/dh/pnfslog/2007/01/23.txt MIN > nedit $FIL < 3 Tue Jan 23 14:42:37 CST 2007 < 1 Tue Jan 23 14:47:38 CST 2007 Picked up more bits from computingx MIN > find computingx -type f computingx/dh/pnfslog/2007/01/23.txt computingx/dh/beamlog/2007/01/23.txt computingx/database/oracle/topdb/minosprd/2007/01/23/14.txt MIN > cat computingx/dh/pnfslog/2007/01/23.txt 1 Tue Jan 23 14:52:39 CST 2007 MIN > cat computingx/dh/beamlog/2007/01/23.txt 187 Tue Jan 23 14:53:11 CST 2007 187 Tue Jan 23 14:54:11 CST 2007 187 Tue Jan 23 14:55:12 CST 2007 MIN > cat computingx/database/oracle/topdb/minosprd/2007/01/23/14.txt Tue Jan 23 14:55:20 CST 2007 All user connections to minosprd Access account User name Logon Client machine Program STATUS Time -------------------- -------------------- ---------- ------------------------------ ------------------------------ -------- ---------- MONITOR kreymer 14:55:16 minos26.fnal.gov sqlplus@minos26.fnal.gov (TNS ACTIVE 0 DBSNMP oracle 07:39:34 minosora1.fnal.gov emagent@minosora1.fnal.gov (T ACTIVE 2 DBSNMP oracle 07:39:32 minosora1.fnal.gov emagent@minosora1.fnal.gov (T INACTIVE 26 Elapsed: 00:00:00.01 COUNT(*) ---------- 3 Elapsed: 00:00:00.00 Database server cpu used for sessions terminating within past minute: User name Total cpu Sessions cpu/session -------------------- --------- -------- ----------- oracle .1 3 .02 Elapsed: 00:00:00.91 Edited this into current computing. Copied database file 14.txt topdb stopped around 10:54. ============================================================================= 2007 01 22 ############ # MCIMPORT # ############ Oops, massive write errors in DCache due to my directory renames. Writes to /pnfs/minos/mcin_data/near/daikon_00/L010185N/ started around 17:00 Friday 19 Jan. Directories were renamed around 19:40. Need to unfix damage done by the mcinfix script, created mcunfix script, ran it after doing manual $ mkdir /pnfs/minos/mcin_data/near/daikon_00/L010185N/10 $ chmod 775 /pnfs/minos/mcin_data/near/daikon_00/L010185N $ chmod 775 /pnfs/minos/mcin_data/near/daikon_00/L010185N/10 $ ./mcunfix 2>&1 | tee mcunfix.log OK - 45 files in 100 Mon Jan 22 14:32:38 CST 2007 OK - 60 files in 101 Mon Jan 22 14:33:29 CST 2007 Mon Jan 22 14:34:36 CST 2007 ################ # CONTROL ROOM # ################ The free space is about 500 MB, with files currently being written to /home/minos/BD/devel/BeamData/java/NuMIMon mkdir /acnet/minos/NuMIMon $ cd /home/minos/BD/devel/BeamData/java/NuMIMon $ ls xml*.dat | wc -l 29 $ FILES=`ls xml*.dat | head -28` for FILE in ${FILES} ; do cp ${FILE} /acnet/minos/NuMIMon/${FILE} done for FILE in ${FILES} ; do if diff --brief ${FILE} /acnet/minos/NuMIMon/${FILE} ; then echo rm ${FILE} echo ln -s /acnet/minos/NuMIMon/${FILE} ${FILE} else printf " OOPS, copy error for ${FILE} \n" fi done ########### # ROUNDUP # ########### SRV1> ./roundup -r cedar near SRV1> ./roundup -r cedar far NO go, the script is not executeable. Modified around 14:16 by howie, to change from /usr/local/etc/setups.sh to /fnal/ups/etc/setups.sh chmod 775 round* chmod 775 dfarmsum chmod 775 remove_duplicates GRRRRRRRRRRRRRRRRRRRRRRR encp v3_5c no longer sets up ( it work on Friday ). envp v3_6d seems OK, but was just installed after 14:00 this afternoon. afs products are suddenly in the path. dfarm no longer works ( just hangs ) SRV1> setup dfarm SRV1> type dfarm dfarm is /fnal/ups/prd/dfarm/v3_1a/Linux/bin/dfarm SRV1> ups list -aK+ dfarm "dfarm" "v3_1a" "Linux" "" "current" SRV1> dfarm usage rubin Traceback (most recent call last): File "/fnal/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 1236, in ? usg, res, qta = c.getUsage(args[0]) File "/fnal/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 303, in getUsage self.connect() File "/fnal/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 232, in connect ans = self.DStr.sendAndRecv('HELLO %s' % self.Username) File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 386, in sendAndRecv return self.recv(tmo = tmo) File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 379, in recv while not self.readMore(maxmsg, tmo): File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 278, in readMore r,w,e = select.select([fd],[],[],tmo) KeyboardInterrupt MIN > ssh -l minfarm fnpc146 minfarm on fnpc146% source /fnal/ups/etc/setups.csh minfarm on fnpc146% setup dfarm minfarm on fnpc146% date ; dfarm usage rubin ; date Mon Jan 22 18:24:59 CST 2007 Traceback (most recent call last): File "/local/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 1236, in ? usg, res, qta = c.getUsage(args[0]) File "/local/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 303, in getUsage self.connect() File "/local/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 232, in connect ans = self.DStr.sendAndRecv('HELLO %s' % self.Username) File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 386, in sendAndRecv return self.recv(tmo = tmo) File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 379, in recv while not self.readMore(maxmsg, tmo): File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 278, in readMore r,w,e = select.select([fd],[],[],tmo) KeyboardInterrupt Mon Jan 22 18:27:02 CST 2007 ============================================================================= 2007 01 21 ########### # DESKTOP # ########### Found desktop system powered down. Restarted cleanly ( unclean shutdown ) from /var/log/messages : Jan 20 09:37:20 minos-93198 sshd[21899]: Failed none for illegal user scanner from 131.225.110.131 port 52730 ssh2 Jan 20 09:37:20 minos-93198 sshd[21899]: Connection closed by 131.225.110.131 Jan 20 09:51:14 minos-93198 telnetd[21900]: ttloop: peer died: Invalid or incomplete multibyte or wide character Jan 20 10:28:16 minos-93198 kernel: e1000: eth0: e1000_watchdog: NIC Link is Down Jan 21 14:34:48 minos-93198 syslogd 1.4.1: restart. Jan 21 14:34:48 minos-93198 syslog: syslogd startup succeeded N.B. - this powerdown was due to electrical maintenance last weekend. ############ # MCIMPORT # ############ per kordosky email 20 Jan 03:37:02, pid check is not effective ! Why does exit 1 not exit ? Because pid is invoked as a funtion ? ( No harm done in this case, previous process was just purging ) But note tar failures Fri Jan 5 03:37:02 CST 2007 /var/tmp/mindata/MCTAR/kordosky/n11011418_0004_L010185N_D00.tar.gz Mon Jan 8 19:37:01 CST 2007 OOPS - tar file corrupt /var/tmp/mindata/MCTAR/kordosky/n11011453_0006_L010185N_D00-n12011455_0006_L010185N_D00.tar Tue Jan 9 11:37:02 CST 2007 OOPS - tar file corrupt /var/tmp/mindata/MCTAR/kordosky/n11011426_0000_L010185N_D00-n11011430_0000_L010185N_D00.tar Wed Jan 10 03:37:01 CST 2007 OOPS - tar file corrupt /var/tmp/mindata/MCTAR/kordosky/n11011459_0006_L010185N_D00-n12011459_0006_L010185N_D00.tar Wed Jan 10 19:37:01 CST 2007 OOPS - tar file corrupt /var/tmp/mindata/MCTAR/kordosky/n11011001_0000_L100200N_D00-n11011001_0002_L100200N_D00.tar ============================================================================= 2007 01 19 ############ # MCIMPORT # ############ 10:35 : $ cp afsmcimport mcimport This will copy data from the write to read pool. MINOS26 > ./stage -d -p 0 stage/kordosky Needed 120/ 373 FINISHED Fri Jan 19 10:39:02 CST 2007 MINOS26 > ./stage -d -p 0 stage/howcroft ........ Needed 119/ 341 FINISHED Fri Jan 19 10:40:16 CST 2007 MINOS26 > ./stage stage/kordosky MINOS26 > ./stage stage/howcroft MINOS26 > date Fri Jan 19 10:42:43 CST 2007 This should get older files on disk, newer ones should start there. ########### # ROUNDUP # ########### fnpcsrv1 There have been periods this morning, from about 10:20 through 10:45, when simple commands have hung up for minutes ls ps top cat /proc/meminfo ./roundup -n -r cedar near ./roundup -r cedar near There were network problems earlier today, per timm ( fermigrid-help ) Do not use farm-admin in future. Successfully ran ( after 11;40 ) SRV1> ./roundup -r cedar near SRV1> ./roundup -r cedar far ########### # FIREFOX # ########### firefox crashed on my desktop going to the network speed test page /FIREFOX/run-mozilla.sh: line 424: 15208 Segmentation fault "$prog" ${1+"$@"} restarted ############# # mcinwrite # ############# cp mcimport mcinwrite will move files from the given source path to the proper release in mcin_data, Example : ./mcinwrite -f -r daikon_00 $ MCI=/afs/fnal.gov/files/data/minos/d185/daikon_00/fnal/ $ ./mcinwrite -n -v -s n11011001_0001_L010185N_D00 -r daikon_00 ${MCI} Write one selected file $ ./mcinwrite -s n11011001_0001_L010185N_D00 -r daikon_00 ${MCI} 16:55 $ ./mcinwrite -r daikon_00 ${MCI} ( need to clean up PID clearing and exit from write ) ============================================================================= 2007 01 18 ############ # MCIMPORT # ############ mcimport.20070118 - added dccp -P to move file into read pools sooner ln -sf mcimport.20070118 mcimport # was 20070104 tested in kreymer cp howcroft/n*.tar.gz kreymer/ rm kreymer/n11011165_0002_L010185N_D00.tar.gz # drop a transit file ~kreymer/minos/scripts/mcimport.20070118 -W kreymer afsmcimport -w kreymer Seems to have worked : $ ~kreymer/minos/scripts/dc_stat /pnfs/minos/stage/kreymer/n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar ============================ PNFS status for /pnfs/minos/stage/kreymer/n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar -rw-r--r-- 1 kreymer e875 1744189440 Jan 18 17:03 n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:94e26d79;l=1744189440; w-stkendca10a-3 LEVEL 4 ============================ then a couple of minutes later : $ ~kreymer/minos/scripts/dc_stat /pnfs/minos/stage/kreymer/n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar ============================ PNFS status for /pnfs/minos/stage/kreymer/n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar -rw-r--r-- 1 kreymer e875 1744189440 Jan 18 17:03 n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar LEVEL 2 2,0,0,0.0,0.0 :c=1:94e26d79;h=yes;l=1744189440; r-stkendca15a-5 w-stkendca10a-3 LEVEL 4 ============================ TOMORROW - should ( while crons are idle ) cp afsmcimport mcimport ########### # ROUNDUP # ########### roundup.20060118 - suppressed 0 length files, using string ' rubin 0 ' ln -sf roundup.20070118 roundup # was roundup.2070117 MINOS26 > scp minfarm@fnpcsrv1:scripts/roundup.20070118 . 09:55 running full fardet catchup SRV1> ./roundup -r cedar far srmcp is failing, like java.io.IOException: credential remaining lifetime is less then a minute ... srm client error: java.io.IOException: credential remaining lifetime is less then a minute ( The error message is misspelled, should be 'less than a minute' ) SRV1> grid-proxy-info -f /home/minfarm/.grid/x509up_u1334 subject : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990/CN=687673363 issuer : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990 identity : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990 type : Proxy draft (pre-RFC) compliant impersonation proxy strength : 512 bits path : /home/minfarm/.grid/x509up_u1334 timeleft : 6024:51:56 (251.0 days) SRV1> grid-proxy-info -f /home/minfarm/.grid/kreymer-doe.proxy subject : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=768538851 issuer : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 type : Proxy draft (pre-RFC) compliant impersonation proxy strength : 512 bits path : /home/minfarm/.grid/kreymer-doe.proxy timeleft : 2638:01:18 (109.9 days) But this works if I explicitly give the proxy specified in the config.xml or kreymer.xml SRV1> srmcp -streams_num=1 -server_mode=active --x509_user_proxy=/home/minfarm/.grid/x509up_u1334 $SFILE file:///TEST.dat SRV1> ls -l TEST.dat -rw-rw-r-- 1 minfarm numi 15761813 Jan 18 11:28 TEST.dat SRV1> srmcp -streams_num=1 -server_mode=active --x509_user_proxy=/home/minfarm/.grid/kreymer-doe.proxy $SFILE file:///TEST.dat SRV1> dds TEST.dat -rw-rw-r-- 1 minfarm numi 15761813 Jan 18 11:17 TEST.dat What has changed since yesterday ? What changed is the the /tmp proxy expired. SRV1> rm /tmp/x509up_u10871 Now the default proxy is absent, the config file works as intended. Even an empty file in the default path causes failure : org.globus.gsi.GlobusCredentialException: [JGLOBUS-11] No certificates loaded CONFIG FILE UPGRADE ? see kreymer2.xml vs kreymer.xml Should we move from /fnal/ups paths to /export/osg ? Reran roundup near and far, OK ! HOWTO.dccp - removed srm information to HOWTO.srm ============================================================================= ########### # ROUNDUP # ########### roundup.20060117 - changed required by fnpcsrv1 upgrades changed setup encp from v3_5a to v3_6c -q stken srmcp v1.25 seems to working ok, tested with srmls and srmcp per HOWTO.dccp ln -sf roundup.20070117 roundup # was roundup.2070116 Howie is having trouble with srmls under tcsh, OK under bash. The problem was the need for a ? character in the srm path, which needed to be escaped like \? in tcsh. SRV1> ./roundup -s F00037213_ -r cedar far Looks OK to me ( made safety copies if input files in ROUNDUP_TEST/TEST Will run full catchup tomorrow. ============================================================================= 2007 01 16 ########### # ROUNDUP # ########### Purging STRAYS and ODDS, per plan STRAYS ( 115) and ODDS (8) match the total count (123) Verify the files are still there in DFARM, then remove them. for DET in N F ; do if [ "${DET}" = "N" ] ; then det=near ; else det=far ; fi for FIL in `cat SDFARM${DET}` ; do dfarm ls /minos/${det}cat/${FIL} done ; done for DET in N F ; do if [ "${DET}" = "N" ] ; then det=near ; else det=far ; fi for FIL in `cat SDFARM${DET}` ; do dfarm rm /minos/${det}cat/${FIL} done ; done All files in DFARM are now recent (2007). Two files have 0 length : frwrw 2 rubin 0 01/15 07:54:42 F00037233_0002.all.sntp.cedar.0.root frwrw 2 rubin 0 01/15 08:29:30 F00037230_0019.all.sntp.cedar.0.root (N.B. these files were removed, rewritten around 01/16 14:13:08 ) Preview the catchup run : SRV1> ./roundup -n -r cedar near OK - processing 684 files OK - stream cosmic.sntp.cedar OK - 9949 Mbytes in 29 runs SUPPRESS N00011446_0024.cosmic.sntp.cedar.0.root PEND - have 7/24 subruns for N00011446_*.cosmic.sntp.cedar*.root 14 01/02 10:20:22 PEND - have 9/13 subruns for N00011481_*.cosmic.sntp.cedar*.root 11 01/05 04:47:37 PEND - have 11/13 subruns for N00011488_*.cosmic.sntp.cedar*.root 11 01/05 05:48:22 PEND - have 32/24 subruns for N00011516_*.cosmic.sntp.cedar*.root 8 01/08 05:21:31 PEND - have 15/17 subruns for N00011542_*.cosmic.sntp.cedar*.root 0 01/15 14:16:24 PEND - have 3/4 subruns for N00011552_*.cosmic.sntp.cedar*.root 0 01/15 14:18:43 PEND - have 14/24 subruns for N00011565_*.cosmic.sntp.cedar*.root 1 01/15 11:29:27 PEND - have 19/24 subruns for N00011568_*.cosmic.sntp.cedar*.root 0 01/15 14:15:40 PEND - have 14/25 subruns for N00011574_*.cosmic.sntp.cedar*.root 0 01/16 06:21:19 ... PEND - have 7/24 subruns for N00011446_*.spill.sntp.cedar*.root 14 01/02 10:20:52 PEND - have 9/13 subruns for N00011481_*.spill.sntp.cedar*.root 11 01/05 04:48:09 PEND - have 11/13 subruns for N00011488_*.spill.sntp.cedar*.root 11 01/05 05:48:53 PEND - have 32/24 subruns for N00011516_*.spill.sntp.cedar*.root 8 01/08 05:22:04 PEND - have 15/17 subruns for N00011542_*.spill.sntp.cedar*.root 0 01/15 14:17:56 PEND - have 3/4 subruns for N00011552_*.spill.sntp.cedar*.root 0 01/15 14:19:35 PEND - have 11/24 subruns for N00011565_*.spill.sntp.cedar*.root 1 01/15 11:29:49 PEND - have 18/24 subruns for N00011568_*.spill.sntp.cedar*.root 0 01/15 14:16:12 PEND - have 14/25 subruns for N00011574_*.spill.sntp.cedar*.root 0 01/16 06:21:40 SRV1> ./roundup -n -r cedar far 2>&1 | tee /tmp/far.pre.log SRV1> grep PEND /tmp/far.pre.log PEND - have 13/19 subruns for F00037162_*.all.sntp.cedar*.root 14 01/01 23:47:22 PEND - have 23/24 subruns for F00037221_*.all.sntp.cedar*.root 4 01/11 23:53:18 PEND - have 22/24 subruns for F00037230_*.all.sntp.cedar*.root 1 01/15 07:57:52 PEND - have 22/24 subruns for F00037233_*.all.sntp.cedar*.root 1 01/15 07:54:42 PEND - have 1/18 subruns for F00037239_*.all.sntp.cedar*.root 0 01/16 00:17:22 PEND - have 13/19 subruns for F00037162_*.spill.sntp.cedar*.root 14 01/01 23:47:35 PEND - have 18/24 subruns for F00037221_*.spill.sntp.cedar*.root 4 01/11 23:53:32 PEND - have 20/24 subruns for F00037230_*.spill.sntp.cedar*.root 1 01/15 07:58:09 PEND - have 15/24 subruns for F00037233_*.spill.sntp.cedar*.root 1 01/15 08:14:03 PEND - have 1/18 subruns for F00037239_*.spill.sntp.cedar*.root 0 01/16 00:17:35 The intial near and far PENDS are due to files already written in 2006-12 : SRV1> ls -l /pnfs/minos/reco_near/cedar/sntp_data/2006-12/N00011446*cosmic* | wc -l 17 SRV1> ls -l /pnfs/minos/reco_near/cedar/sntp_data/2006-12/N00011446*spill* | wc -l 17 SRV1> ls -l /pnfs/minos/reco_far/cedar/sntp_data/2006-12/F00037162_*all* | wc -l 6 SRV1> ls -l /pnfs/minos/reco_far/cedar/sntp_data/2006-12/F00037162_*spill* | wc -l 6 N00011516 was partially reprocessed, subruns 15-22. roundup.20070116 - makes YEMON subdirectories of LOG and HADDLOG ln -sf roundup.20070116 roundup # was roundup.2070110 Round up the initial runs for near and far : ./roundup -n -s N00011446_ -f 0 -r cedar near ./roundup -n -s F00037162_ -f 0 -r cedar far round 16:25 : # # # first output # # # ./roundup -s N00011446_ -f 0 -r cedar near ./roundup -s F00037162_ -f 0 -r cedar far ~minfarm/lists/bad_runs.cedar has list of problem runs. Recently, N00011468_0000.0 2007-01 106 2 2007-01-03 23:22:41 fnpc166 N00011481_0008.0 2007-01 99445 136 2007-01-04 23:37:16 fnpc192 N00011481_0010.0 2007-01 100083 136 2007-01-04 23:55:07 fnpc174 N00011481_0005.0 2007-01 99799 136 2007-01-05 00:01:10 fnpc188 N00011481_0006.0 2007-01 99189 136 2007-01-05 00:01:59 fnpc146 This leaves a mystery N00011488 missing subruns 10,11 Meanwhile, clean up and write the good data : ./roundup -s N00011481_ -f 0 -r cedar near ./roundup -r cedar near ./roundup -r cedar far Per rubin, removing unwanted duplicates from N00011516 SRV1> dfarm ls /minos/nearcat/*.1.* frwrw 2 rubin 30475986 01/10 16:07:36 /minos/nearcat/N00011516_0015.cosmic.sntp.cedar.1.root frwrw 2 rubin 77598268 01/10 16:08:31 /minos/nearcat/N00011516_0015.spill.sntp.cedar.1.root frwrw 2 rubin 30682257 01/10 16:23:31 /minos/nearcat/N00011516_0016.cosmic.sntp.cedar.1.root frwrw 2 rubin 77519165 01/10 16:24:18 /minos/nearcat/N00011516_0016.spill.sntp.cedar.1.root frwrw 2 rubin 30168133 01/10 18:02:18 /minos/nearcat/N00011516_0017.cosmic.sntp.cedar.1.root frwrw 2 rubin 86727707 01/10 18:03:22 /minos/nearcat/N00011516_0017.spill.sntp.cedar.1.root frwrw 2 rubin 30197596 01/10 16:08:16 /minos/nearcat/N00011516_0018.cosmic.sntp.cedar.1.root frwrw 2 rubin 77902299 01/10 16:09:37 /minos/nearcat/N00011516_0018.spill.sntp.cedar.1.root frwrw 2 rubin 30242143 01/10 18:25:16 /minos/nearcat/N00011516_0019.cosmic.sntp.cedar.1.root frwrw 2 rubin 86283961 01/10 18:26:36 /minos/nearcat/N00011516_0019.spill.sntp.cedar.1.root frwrw 2 rubin 30415410 01/10 16:17:21 /minos/nearcat/N00011516_0020.cosmic.sntp.cedar.1.root frwrw 2 rubin 78316537 01/10 16:18:04 /minos/nearcat/N00011516_0020.spill.sntp.cedar.1.root frwrw 2 rubin 30336800 01/10 14:55:28 /minos/nearcat/N00011516_0021.cosmic.sntp.cedar.1.root frwrw 2 rubin 78352443 01/10 14:56:04 /minos/nearcat/N00011516_0021.spill.sntp.cedar.1.root frwrw 2 rubin 30105619 01/10 13:40:02 /minos/nearcat/N00011516_0022.cosmic.sntp.cedar.1.root frwrw 2 rubin 59368513 01/10 13:40:35 /minos/nearcat/N00011516_0022.spill.sntp.cedar.1.root SRV1> dfarm rm /minos/nearcat/N00011516_*.1.* ############ # predator # ############ See 2006 12 11 note, we did apparently get up to date on saddreco on 12/12. Need to top this off. Easist thing is to restore saddreco to predator ( done just now ) and run VMON=2006-12 ./predator ${VMON} Then re-disable saddreco in predator, this needs to be done by roundup in future. ============================================================================= 2007 01 12 ############ # MCIMPORT # ############ Adding sjc area to mindata@fnal.gov for Stephen Coleman, per arms. .k5login - added sjc@FNAL.GOV mkdir STAGE/sjc mkdir $MINOS_DATA/log_data/mcimport/sjc for USER in sjc do mkdir -m 775 /pnfs/minos/stage/${USER} ; done do ( cd /pnfs/minos/stage/${USER} ; enstore pnfs --file_family stage_${USER} ) ; done do ( cd /pnfs/minos/stage/${USER} ; enstore pnfs --tags | grep 'file_family) =' ) ; done ########### # ROUNDUP # ########### Ran further tests of size-mismatched files SRV1> cat ~/maint/oops.1.files N00010230_0004.spill.sntp.cedar.0.root N00010274_0018.spill.sntp.cedar.0.root N00010485_0001.cosmic.sntp.cedar.0.root N00011408_0006.spill.sntp.cedar.0.root N00011408_0021.cosmic.sntp.cedar.0.root N00011408_0021.spill.sntp.cedar.0.root F00037025_0005.all.sntp.cedar.0.root F00037147_0013.all.sntp.cedar.0.root Four of these exist in AFS, with sizes and dates matching DFARM, all after the PFNS times. N00010230_0004.spill.sntp.cedar.0.root N00010274_0018.spill.sntp.cedar.0.root N00011408_0006.spill.sntp.cedar.0.root N00011408_0021.spill.sntp.cedar.0.root Go ahead and purge the files which do match in size : SRV1> ./remove_duplicates 2>&1 | tee ../maint/remdup.log Fri Jan 12 15:04:19 CST 2007 Fri Jan 12 15:04:28 CST 2007 PURGE N00010077_0000.cosmic.sntp.cedar.0.root PURGE N00010077_0001.cosmic.sntp.cedar.0.root ... SRV1> dfarm usage rubin Used: 97525 + Reserved: 0 / Quota: 1000000 (MB) shifted /tmp/PNFS to $HOME/maint/PNFS, adjsted remove_duplicates STRAYS Now get a copy of all the DFARM file missing from PNFS, put it into /export/stage/minfarm/STRAYS Made a shorter list of DFARM files cd /home/minfarm/maint dfarm ls /minos/nearcat > SDFARMNF dfarm ls /minos/farcat > SDFARMFF Edited this with nedit to exclude 01/* files from 2007. cat SDFARMNF | tr -s ' ' | cut -f 7 -d ' ' > SDFARMN cat SDFARMFF | tr -s ' ' | cut -f 7 -d ' ' > SDFARMF cd /home/minfarm/maint mkdir /export/stage/minfarm/STRAYS List them : for DET in N F ; do for FIL in `cat SDFARM${DET}` ; do grep -q ${FIL} PNFS || grep ${FIL} SDFARM${DET}F done ; done Clone them : for DET in N F ; do if [ "${DET}" = "N" ] ; then det=near ; else det=far ; fi for FIL in `cat SDFARM${DET}` ; do grep -q ${FIL} PNFS || dfarm get /minos/${det}cat/${FIL} /export/stage/minfarm/STRAYS/${FIL} done ; done SRV1> ls -l /export/stage/minfarm/STRAYS total 1209476 -rw-rw-r-- 1 minfarm numi 275564 Jan 12 17:58 F00028071_0001.all.sntp.cedar.0.root -rw-rw-r-- 1 minfarm numi 452419 Jan 12 17:58 F00033105_0010.spill.sntp.cedar.0.root ... SRV1> ls /export/stage/minfarm/STRAYS | wc -l 115 Check sizes for DET in N F ; do for FIL in `cat SDFARM${DET}` ; do if grep -q ${FIL} PNFS ; then true ; else DSIZ=`grep ${FIL} SDFARM${DET}F | tr -s ' ' | cut -f 4 -d ' '` PSIZ=`ls -l /export/stage/minfarm/STRAYS/${FIL} | tr -s ' ' | cut -f 5 -d ' '` echo ${FIL} ${DSIZ} ${PSIZ} [ ${DSIZ} != ${PSIZ} ] && echo OOPS fi done ; done Now put a copy of odd length files into export/stage/minfarm/ODDS mkdir /export/stage/minfarm/ODDS ODDN=' N00010230_0004.spill.sntp.cedar.0.root N00010274_0018.spill.sntp.cedar.0.root N00010485_0001.cosmic.sntp.cedar.0.root N00011408_0006.spill.sntp.cedar.0.root N00011408_0021.cosmic.sntp.cedar.0.root N00011408_0021.spill.sntp.cedar.0.root ' ODDF=' F00037025_0005.all.sntp.cedar.0.root F00037147_0013.all.sntp.cedar.0.root ' for FIL in ${ODDN} ; do echo $FIL dfarm get /minos/nearcat/${FIL} /export/stage/minfarm/ODDS/${FIL} done for FIL in ${ODDF} ; do echo $FIL dfarm get /minos/farcat/${FIL} /export/stage/minfarm/ODDS/${FIL} done ============================================================================= 2007 01 11 SRV1> dfarm usage rubin Used: 415346 + Reserved: 0 / Quota: 1000000 (MB) While waiting for resolution, prepare to purge the majority of files in PNFS. Check more carefully , need a little remove_duplicates script Put dfarm list in /tmp/DFARM[N|F} [F] as before check existince in /tmp/PNFS, check file sizes SRV1> ./remove_duplicates 2>&1 | tee /tmp/reduppre.log ( fixed problem with bntp -> .bntp ), reran for just far SRV1> ./remove_duplicates 2>&1 | tee -a /tmp/reduppre.log Note several near detector file size mismatches ============================================================================= 2007 01 10 VMON=2006-12 ./predator ${VMON} ########### # ROUNDUP # ########### Explore the 2006/7 boundary, purge dfarm of all files already in PNFS roundup.20070110 - moved /DFARM/ to /${CAT}/DFARM likewise for LOG and HADDLOG and WRITE ln -sf roundup.20070110 roundup SRV1> mv DFARM CAT/DFARM SRV1> mkdir DFARM SRV1> mv LOG CAT/LOG SRV1> mkdir LOG SRV1> mv HADDLOG CAT/HADDLOG SRV1> mkdir HADDLOG SRV1> mkdir CAT/WRITE make tmp/PNFS list of recent reco files DET=near for DIR in /pnfs/minos/reco_${DET}/cedar/sntp_data ; do ls ${DIR}/2006-11 >> /tmp/PNFS ls ${DIR}/2006-12 >> /tmp/PNFS ls ${DIR}/2006-05 >> /tmp/PNFS ls ${DIR}/2006-06 >> /tmp/PNFS ls ${DIR}/2006-07 >> /tmp/PNFS ls ${DIR}/2006-08 >> /tmp/PNFS ls ${DIR}/2006-09 >> /tmp/PNFS ls ${DIR}/2006-10 >> /tmp/PNFS done SRV1> wc -l /tmp/PNFS 2696 /tmp/PNFS DET=far for DIR in /pnfs/minos/reco_${DET}/cedar/sntp_data /pnfs/minos/reco_${DET}/cedar/.bntp_data ; do ls ${DIR}/2006-11 >> /tmp/PNFS ls ${DIR}/2006-12 >> /tmp/PNFS ls ${DIR}/2006-05 >> /tmp/PNFS ls ${DIR}/2006-06 >> /tmp/PNFS ls ${DIR}/2006-07 >> /tmp/PNFS ls ${DIR}/2006-08 >> /tmp/PNFS ls ${DIR}/2006-09 >> /tmp/PNFS ls ${DIR}/2006-10 >> /tmp/PNFS done SRV1> wc -l /tmp/PNFS 7091 /tmp/PNFS SRV1> dfarm ls /minos/nearcat > /tmp/DFARMNF SRV1> dfarm ls /minos/farcat > /tmp/DFARMFF SRV1> dfarm ls /minos/nearcat | tr -s ' ' | cut -f 7 -d ' ' > /tmp/DFARMN SRV1> dfarm ls /minos/farcat | tr -s ' ' | cut -f 7 -d ' ' > /tmp/DFARMF SRV1> wc -l /tmp/DFARMN 2692 /tmp/DFARMN SRV1> wc -l /tmp/DFARMF 3404 /tmp/DFARMF SRV1> for FIL in `cat /tmp/DFARMF` ; do grep -q ${FIL} /tmp/PNFS || grep ${FIL} /tmp/DFARMFF ; done SRV1> for FIL in `cat /tmp/DFARMN` ; do grep -q ${FIL} /tmp/PNFS || grep ${FIL} /tmp/DFARMNF ; done frwrw 2 rubin 37297557 11/10 15:39:27 N00010163_0001.spill.sntp.cedar.0.root frwrw 2 rubin 35117005 11/10 15:28:44 N00010163_0003.spill.sntp.cedar.0.root frwrw 2 rubin 44780044 11/10 16:28:54 N00010163_0004.spill.sntp.cedar.0.root frwrw 2 rubin 47876194 11/10 17:06:10 N00010163_0005.spill.sntp.cedar.0.root frwrw 2 rubin 46815316 11/10 16:44:38 N00010163_0006.spill.sntp.cedar.0.root frwrw 2 rubin 45944255 11/10 18:37:33 N00010163_0007.spill.sntp.cedar.0.root frwrw 2 rubin 33988356 11/10 17:24:19 N00010163_0008.spill.sntp.cedar.0.root frwrw 2 rubin 41349361 11/10 17:52:56 N00010163_0009.spill.sntp.cedar.0.root frwrw 2 rubin 34032079 11/10 16:38:20 N00010163_0010.spill.sntp.cedar.0.root frwrw 2 rubin 28499385 11/10 15:39:04 N00010163_0011.spill.sntp.cedar.0.root frwrw 2 rubin 32534482 11/10 15:42:31 N00010163_0012.spill.sntp.cedar.0.root frwrw 2 rubin 69803776 12/08 07:10:33 N00011347_0014.spill.sntp.cedar.0.root ... rest are from 2007 10163 is from 2006-06, these spill and cosmic sntp files are missing in PNFS Informed rubin, waiting for resolution . ============================================================================= 2007 01 09 ######### # VAULT # ######### vault - changed encp from v3_5a to v3_6d per current per HOWTO.vault VMON=2006-12 for DET in far near; do ./vault ${DET} ${VMON} ; done Completed cleanly ############ # beam_log # ############ The script was stuck since http://www-numi.fnal.gov/computing/dh/beamlog/2007/01/08.txt 296 Mon Jan 8 11:26:29 CST 2007 ps xf : 25430 ? S 25:01 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/beam_log 15077 ? S 0:00 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/beam_log 15078 ? S 0:02 \_ curl -s http://www-bd.fnal.gov/notifyservlet/www 15079 ? S 0:00 \_ grep SC time MINOS26 > ps -f -p 15079 UID PID PPID C STIME TTY TIME CMD kreymer 15079 15077 0 Jan08 ? 00:00:00 grep SC time Killed the grep and curl, running now since http://www-numi.fnal.gov/computing/dh/beamlog/2007/01/09.txt 181 Tue Jan 9 11:18:21 CST 2007 ########### # ROUNDUP # ########### grep -e PEND -e CST LOG/R1_18_4near.log > /tmp/pendrn grep 'root 09/' /tmp/pendrn > /tmp/pendrn09 grep 'root 10/' /tmp/pendrn > /tmp/pendrn10 grep 'root 11/' /tmp/pendrn > /tmp/pendrn11 nedit /tmp/pendrm # select latest batch of PEND, file prior to December FILES=`cat /tmp/pendrn09 | cut -f 8 -d ' '` for FILE in $FILES ; do echo $FILE ; dfarm ls /minos/${DET}cat/${FILE} ; done for FILE in $FILES ; do echo $FILE ; dfarm rm /minos/${DET}cat/${FILE} ; done This was a moot exercise, cleanup had already occurred. Checked out far with ./roundup -C -r R1_18_4 -W -n far roughly consistent with log, just a few runs, so let's clean up Format of log changed, need slightly different selection for 09 10 11 grep -e PEND -e CST LOG/R1_18_4far.log > /tmp/pendrn nedit /tmp/pendrn # cut out all but latest PENDs grep ' 09/' /tmp/pendrn > /tmp/pendrn09 grep ' 10/' /tmp/pendrn > /tmp/pendrn10 grep ' 11/' /tmp/pendrn > /tmp/pendrn11 FILES=`cat /tmp/pendrn09 | cut -f 8 -d ' '` FILES=`cat /tmp/pendrn10 | cut -f 8 -d ' '` FILES=`cat /tmp/pendrn11 | cut -f 8 -d ' '` for FILE in $FILES ; do echo $FILE ; dfarm ls /minos/${DET}cat/${FILE} ; done for FILE in $FILES ; do echo $FILE ; dfarm rm /minos/${DET}cat/${FILE} ; done SRV1> dfarm usage rubin Used: 480608 + Reserved: 0 / Quota: 1000000 (MB) Now clean out some cedar stuff : ./roundup -C -r cedar -W -n near 2>&1 | tee /tmp/pendcn ./roundup -C -r cedar -W -n far 2>&1 | tee /tmp/pendcf grep 'root.*09/' /tmp/pendcn > /tmp/pendcn09 grep 'root.*10/' /tmp/pendcn > /tmp/pendcn10 grep 'root.*11/' /tmp/pendcn > /tmp/pendcn11 (10 is empty) DET=near FILES=`cat /tmp/pendcn09 | cut -f 8 -d ' '` FILES=`cat /tmp/pendcn11 | cut -f 8 -d ' '` grep 'root.*09/' /tmp/pendcf > /tmp/pendcf09 grep 'root.*10/' /tmp/pendcf > /tmp/pendcf10 grep 'root.*11/' /tmp/pendcf > /tmp/pendcf11 (10 is empty) DET=far FILES=`cat /tmp/pendcf09 | cut -f 8 -d ' '` FILES=`cat /tmp/pendcf11 | cut -f 8 -d ' '` Still a lot of stuff there, SRV1> dfarm usage rubin Used: 461875 + Reserved: 0 / Quota: 1000000 (MB) Let's just purge all of R1_18_4 There are many obsolete short ntuple files there : SRV1> dfarm ls /minos/nearcat/*R1_18_4* | wc -l 2564 SRV1> dfarm ls /minos/nearcat/*snts**R1_18_4* | wc -l 2545 SRV1> dfarm ls /minos/farcat/*R1_18_4* | wc -l 4545 SRV1> dfarm ls /minos/farcat/*nts**R1_18_4* | wc -l 4476 SRV1> dfarm usage rubin Used: 402857 + Reserved: 0 / Quota: 1000000 (MB) Everything is cedar now, look at the 2006 vs 2007 breakdown SRV1> dfarm ls /minos/farcat/ | grep ' 01/' | wc -l 372 SRV1> dfarm ls /minos/farcat/*cedar* | grep -v ' 01/' | wc -l 2984 SRV1> dfarm ls /minos/farcat/*cedar* | wc -l 3356 SRV1> dfarm ls /minos/nearcat/ | grep ' 01/' | wc -l 416 SRV1> dfarm ls /minos/nearcat/*cedar* | grep -v ' 01/' | wc -l 2283 SRV1> dfarm ls /minos/nearcat/*cedar* | wc -l 2699 One quick check of short ntuples, SRV1> dfarm ls /minos/farcat/*nts* SRV1> dfarm ls /minos/nearcat/*nts* frwrw 2 rubin 512714 11/10 10:20:48 /minos/nearcat/N00010072_0000.cosmic.snts.cedar.0.root frwrw 2 rubin 11483882 11/10 10:43:40 /minos/nearcat/N00010077_0000.cosmic.snts.cedar.0.root ... SRV1> dfarm ls /minos/nearcat/*nts* | wc -l 19 SRV1> dfarm rm /minos/nearcat/*nts* Let's check out the cutover boundary SRV1> dfarm ls /minos/farcat/ | grep ' 12/' | wc -l 2881 SRV1> dfarm ls /minos/farcat/ | grep ' 12/3' | wc -l 96 ============================================================================= 2007 01 08 ############ # MCIMPORT # ############ ############# # checklist # ############# minosora1 ganglia monitoring shows no data, but system is up. Reported to minos-dbsupport Minos-servers category is missing entirely. Ganglia plots are back as of about 15:00 ########### # ROUNDUP # ########### SRV1> ./roundstat Mon Jan 8 15:22:31 CST 2007 OK - 5219 files , 181 GBytes in near 2655 files , 156 GBytes in near cedar 2564 files , 25 GBytes in near R1_18_4 OK - 8821 files , 55 GBytes in far 4039 files , 47 GBytes in far cedar 4782 files , 7 GBytes in far R1_18_4 OK - WRITE OK - 2 files , 0 GBytes in near 2 files , 0 GBytes in near cedar OK - 0 files , 0 GBytes in far OK - READ OK - 0 files , 0 GBytes in near OK - 0 files , 0 GBytes in far SRV1> dfarm usage rubin Used: 509062 + Reserved: 0 / Quota: 1000000 (MB) Moved to latest version, supporting -C : ln -sf roundup.20061220 roundup # was roundup.20061215 Cleaned up the two cedar near WRITE files left from 20 Dec ./roundup -C -r cedar -w near ============================================================================= 2006 12 29 ############ # MCIMPORT # ############ Testing fermigrid cache, on the side : SRV1> srmclient/bin/srmls ${SPATH2}/usr/fermigrid/volatile/minos 512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos srmclient/bin/srmmkdir ${SPATH2}/mcimport Now try writing to this from mindata on minos26 . /usr/local/etc/setups.sh setup upd export PRODUCTS=/afs/fnal.gov/files/code/e875/general/ups/db setup vdt setup srmcp v1_21 export SRM_CONFIG=/home/mindata/.srmconfig/kreymer.xml cd /local/scratch26/mindata/kordosky IPATH=fermigrid/volatile/minos/kordosky IFILE=n11011401_0001_L010185N_D00.tar.gz SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr SFILE=${SPATH}/${IPATH}/${IFILE} $ srmcp -streams_num=1 -server_mode=active file:///${IFILE} ${SFILE} Fri Dec 29 10:17:50 CST 2006: rs.state = Failed rs.error = RequestFileStatus#-2147274830 failed with error:[ at Fri Dec 29 10:17:46 CST 2006 state Failed : can not obtain turl for file:org.dcache.srm.SRMException: user's path ///pnfs/fnal.gov/usr/fermigrid/volatile/minos/kordosky/n11011401_0001_L010185N_D00.tar.gz is not subpath of the user's root] GRRRR, no good for now, will just go ahead with mcimport development without a safety copy of files in fermigrid/volatile/minos Added command qualifiers to mcimport, tested srmcp on kreymer files Grabbed 10 more kordosky files for full test in kreymer, with fresh login cd /local/scratch26/mindata FILES=`ls /local/scratch26/mindata/kordosky/ | head -40 | tail -10` for FILE in ${FILES} ; do cp -v kordosky/${FILE} kreymer/ ; done ./afsmcimport kreymer Seems to be ok, need to still test the purging code ( wait for data on tape) Will lauch full compression of howcroft, while waiting for little kreymer files to be archived. Tarred up /local/scratch/kreymer/ARCHIVE into /tmp/ARCHIVE.tar, sccp -c blowfish to minos25:/local/scratch25/kreymer/ARCHIVE.tar checked with md5sum. ============================================================================= 2006 12 28 ############ # MCIMPORT # ############ Tested srmcp successfully, readig a single file In minos products area, upd install -j srmcp v1_25_1 -f NULL upd install -j vdt v1_1_14_13 upd install -j pacman v2_116_1 ups declare -c pacman v2_116_1 -f NULL ups tailor vdt v1_1_14_13 > /tmp/vdtinstall.log ups declare -c vdt v1_1_14_13 ups declare -c srmcp v1_25_1 SRV1> java -version java version "1.4.2_10" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_10-b03) Java HotSpot(TM) Client VM (build 1.4.2_10-b03, mixed mode) MINOS26 > java -version java version "1.4.2_12" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_12-b03) Java HotSpot(TM) Client VM (build 1.4.2_12-b03, mixed mode) Tried various varsions, same result, then finally srmcp v1_21 works ! Had to update kreymer.xml for local product paths to srmcp, and had to clone a local copy of CA certificates, scp -r minfarm@fnpcsrv1:/local/ups/grid/globus/share/certificates certificates This now works , at least copying files to disk per HOWTO.dccp. Now try going to dcache from local disk. cd /local/scratch26/mindata/kreymer/tar export SRM_CONFIG=/home/mindata/.srmconfig/kreymer.xml IPATH=minos/stage/kreymer IFILE=n11011401_0001_L010185N_D00-n11011401_0005_L010185N_D00.tar SFILE=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/${IPATH}/${IFILE} srmcp -streams_num=1 -server_mode=active \ file:///${IFILE} ${SFILE} ============================================================================= 2006 12 23 ############ # MCIMPORT # ############ cd $MINOS_DATA/log_data mkdir mcimport cd mcimport fs setacl . minos rlidwka cd .. fs setacl . minos rlidwka mkdir howcroft mkdir kordosky Tested this on 19 kordosky files cloned to kreymer Tarring to kreymer/tar/ looks good, now need to add archiving ============================================================================= 2006 12 22 ############### # minos26free # ############### Created script to report free space on /local/scratch26 hourly, at http://www-numi.fnal.gov/computing/dh/minos26free/NOW.txt ( for daily reports ) http://www-numi.fnal.gov/computing/dh/minos26free/FREE.txt ( use in scripts ) Updated web at /afs/fnal.gov/files/expwww/numi/html/computing/dh/dhmain.200601222.html ln -sf dhmain.20061222.html dhmain.html # was dhmain.20060918.html ############ # mcimport # ############ This will to be cloned from vault ( tars/vaults raw data) rawcopy ( actually does the tarring ) roundup ( writes concatenated ntuples ) Files are under /local/scratch26/mindata// There are data files like and log files in */log/ Data files should be tarred and feathxxxxarchived, and logs should ber rsync'd to AFS. /afs/fnal.gov/files/data/minos/log_data/mcimport/ Let's get the tar going first, to avoid a space crunch. ============================================================================= 2006 12 20 ####### # SRM # ####### SRV1> time srmcp -debug=false -streams_num=1 -server_mode=active -protocols=gsiftp $SFILE file:///TEST.dat real 0m11.100s user 0m3.550s sys 0m0.250s Compare this to the roughly .8 elapsed, .3 CPU cost of globus-url-copy X11 - clean scan today ########### # ROUNDUP # ########### SRV1> ./roundstat Wed Dec 20 09:48:47 CST 2006 OK - 4009 files , 120 GBytes in near 1445 files , 95 GBytes in near cedar 2564 files , 25 GBytes in near R1_18_4 OK - 7585 files , 38 GBytes in far 2803 files , 30 GBytes in far cedar 4782 files , 7 GBytes in far R1_18_4 OK - WRITE OK - 2 files , 2 GBytes in near 2 files , 2 GBytes in near R1_18_4 OK - 48 files , 9 GBytes in far 48 files , 9 GBytes in far R1_18_4 OK - READ OK - 701 files , 0 GBytes in near 517 files , 0 GBytes in near cedar 184 files , 0 GBytes in near R1_18_4 OK - 2181 files , 0 GBytes in far 1726 files , 0 GBytes in far cedar 455 files , 0 GBytes in far R1_18_4 Removed duplicated rerounded file from Far, rm WRITE/F00036196* Removed old second half of large file dating from Nov 9 rm /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0030.spill.sntp.R1_18_4.0.root Cleared the write area : ./roundup -C -r R1_18_4 -w near ./roundup -C -r R1_18_4 -w far Rerun R1_18_4 to get list of PEND's for removal ./roundup -C -r R1_18_4 -W near PEND - have 7/8 subruns for N00011295_*.spill.sntp.R1_18_4*.root 19 12/01 07:59:01 PEND - have 10/12 subruns for N00011301_*.spill.sntp.R1_18_4*.root 19 12/01 08:31:34 SUPPRESS N00011315_0024.spill.sntp.R1_18_4.0.root PEND - have 2/24 subruns for N00011315_*.spill.sntp.R1_18_4*.root 18 12/01 16:38:28 Try a roundup ./roundup -C -r R1_18_4 -W -R near fails, the missing subruns are really missing. ./roundup -C -r R1_18_4 -W far But first, clean up protections for DET in near far ; do FILES=` dfarm ls /minos/${DET}cat | tr -s ' ' | grep '\- ' | cut -f 7 -d ' '` for FILE in $FILES ; do dfarm ls /minos/${DET}cat/${FILE} ; done for FILE in $FILES ; do dfarm chmod rwrw /minos/${DET}cat/${FILE} ; done done Howie has done this. ######### # SRMCP # ######### Testing roundup.20061220 using Howie's cert to srmcp get some fodder SRV1> ./roundup.20061220 -C -r cedar -W -s N00011356 near ============================================================================= 2006 12 19 ####### # SRM # ####### Trying a fresh VDT install on fnpcsrv1, per http://fermigrid.fnal.gov/user-guide-new.html cd ~/grid wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-latest.tar.g tar xzf pacman-latest.tar.gz LATEST=3.19 export PATH='pwd'/pacman-${LATEST}:$PATH cd pacman-${LATEST} source setup.sh Pacman requires at least Python version 2.2. Your Python version, 2.1, is too old for Pacman. Installing Python 2.4.1 locally... Downloading Python 2.4.1... download successful. Unzipping... unzip successful. Untarring... untar successful. Configuring... configure successful. Making Python 2.4.1... make successful.make install successful. Python 2.4.1 has been built locally. Ready to use Pacman. export VDT_LOCATION=${HOME}/grid/vdt-1.3.10 mkdir $VDT_LOCATION cd $VDT_LOCATION Looked for latest in http://software.grid.iu.edu/pacman/ client-0.4.1-2.pacman 19-Sep-2006 20:44 868 pacman -get OSG:client-0.4.1-2 SRV1> pacman -get OSG:client-0.4.1-2 warning: Python C API version mismatch for module struct: This Python has API version 1012, module struct has version 1010. warning: Python C API version mismatch for module bsddb: This Python has API version 1012, module bsddb has version 1010. warning: Python C API version mismatch for module gdbm: This Python has API version 1012, module gdbm has version 1010. warning: Python C API version mismatch for module dbm: This Python has API version 1012, module dbm has version 1010. warning: Python C API version mismatch for module strop: This Python has API version 1012, module strop has version 1010. warning: Python C API version mismatch for module time: This Python has API version 1012, module time has version 1010. Traceback (most recent call last): File "/home/minfarm/grid/pacman-3.19/bin/pacman", line 18, in ? import Pacman File "/home/minfarm/grid/pacman-3.19/src/Pacman.py", line 83, in ? import lock File "/home/minfarm/grid/pacman-3.19/src/lock.py", line 4, in ? from Base import * File "/home/minfarm/grid/pacman-3.19/src/Base.py", line 4, in ? import sys,os,string,commands,copy,time,popen2,cPickle,pwd,grp,socket,anydbm,shutil ImportError: /local/ups/prd/python/v2_1/Linux-2-4/lib/python2.1/lib-dynload/cPickle.so: undefined symbol: PyUnicode_DecodeRawUnicodeEscape Tried a direct copy : SRV1> globus-url-copy gsiftp://stkendca2a.fnal.gov:2811///neardet_data/2004-11/N00004502_0000.mdaq.root file:////export/stage/minfarm/ROUNDUP_TEST/TEST/TEST.dat Now test a bit : GSIF=gsiftp://stkendca2a.fnal.gov:2811///neardet_data/2004-11/N00004502_0000.mdaq.root LOPE=file:////export/stage/minfarm/ROUNDUP_TEST/TEST Trying various sizes SRV1> N=4 SRV1> globus-url-copy -dbg -p ${N} ${GSIF} ${LOPE}/TEST.dat 2>&1 | wc -l 3145 I get the following size log files : N Lines 1 3143 2 3145 4 3149 8 3157 16 3173 time globus-url-copy -p ${N} ${GSIF} ${LOPE}/TEST.dat N=1 real 0m0.828s user 0m0.050s <--- this was a fluke, repeated copies are around .1 sys 0m0.230s N=8 real 0m0.823s user 0m0.110s sys 0m0.210s ########## # DCACHE # ########## Cleaned up the damaged file from SAM, sam undeclare file N00011134_0038.spill.cand.cedar.0.root This file was damaged in DCache, producing thousands of files on 9 tapes, discovered back on 2006 12 14. Tapes are released for consolidation via migration. ########## # IMPORT # ########## export FTP_PASSIVE ; FTP_PASSIVE=1 FTP_HOST=minos26.fnal.gov # Test connection ftpout=`printf "user mindata\nquit\n" | ftp -n ${FTP_HOST}` if [ "${ftpout}" = "GSSAPI authentication succeeded" ] then echo " OK - we can ftp to ${FTP_HOST}" else echo " " echo " OOPS - ftp output = ${ftpout}" echo " " echo " OOPS - we cannot access ftp at ${FTP_HOST} ," fi Copy file printf "user mindata\n \ cd STAGE/kreymer \n \ put ${FILE} ${FILE} \n \ quit\n" \ | ftp -n ${FTP_HOST} ============================================================================= 2006 12 18 ########## # DCACHE # ########## Another rubin dccp -P got stuck CPU-bound for over 2 days on fnpcserv1. Killed it. Farms were stuck, due to srmcp failing. Trying it on srv1, per revised HOWTO.dccp. dccp works ============================================================================= 2006 12 15 ########### # ROUNDUP # ########### SRV1> ./roundstat Fri Dec 15 09:25:50 CST 2006 OK - 5379 files , 188 GBytes in near 2096 files , 126 GBytes in near cedar 3283 files , 61 GBytes in near R1_18_4 OK - 8479 files , 47 GBytes in far 2803 files , 30 GBytes in far cedar 5676 files , 16 GBytes in far R1_18_4 OK - WRITE OK - 2 files , 2 GBytes in near 2 files , 0 GBytes in near R1_18_4 OK - 0 files , 0 GBytes in far OK - READ OK - 701 files , 0 GBytes in near 517 files , 0 GBytes in near cedar 184 files , 0 GBytes in near R1_18_4 OK - 2136 files , 0 GBytes in far 1726 files , 0 GBytes in far cedar 410 files , 0 GBytes in far R1_18_4 grep -e PEND -e CST LOG/R1_18_4near.log > /tmp/pendrn nedit /tmp/pendrm # select latest batch of PEND, file prior to December FILES=`cat /tmp/pendrn | cut -f 8 -d ' '` SRV1> for FILE in $FILES ; do dfarm rm minos/nearcat/${FILE} ; done Error deleting /minos/nearcat/N00008460_0002.cosmic.sntp.cedar.0.root: PERM Permission denied Error deleting /minos/nearcat/N00008463_0019.spill.sntp.cedar.0.root: PERM Permission denied PERM Permission denied Moving on to R1_18_4 far, find quite a few Nov files ready to round up, SRV1> ./roundup -r R1_18_4 far SRV1> ./roundup -C -r R1_18_4 -s F00036196_ -R far Oops, that was a mistake... the Rustling worked, as did the concatenation, but the file was the first one concatenated, using loon. So it is already in PNFS, with a slightly different size. ############### # LARGE FILES # ############### Renamed the long file, ( cd /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09 ; \ mv N00010819_0000.spill.sntp.R1_18_4.0.root N00010819_0000.spill.sntp.R1_18_4.99.root ) ./dc_stat /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0000.spill.sntp.R1_18_4.99.root ============================================================================= 2006 12 14 ######## # DATA # ######## created data directories for 2007, as indicate above in ANNUAL section ######### # STAGE # ######### mkdir -m 775 /pnfs/minos/stage ( cd /pnfs/minos/stage ; enstore pnfs --file_family stage ) ( cd /pnfs/minos/stage ; enstore pnfs --tags ) for USER in arms buckley gallag gmieg howcroft kordosky kreymer rhatcher urheim do mkdir -m 775 /pnfs/minos/stage/${USER} ; done do ( cd /pnfs/minos/stage/${USER} ; enstore pnfs --file_family stage_${USER} ) ; done do ( cd /pnfs/minos/stage/${USER} ; enstore pnfs --tags | grep 'file_family) =' ) ; done ######## # ENCP # ######## Need to upgrade to v3_6d due to security scans coming by 18 Dec. But which one ? Linux+2.6 or Linx+2.4-2.3.2 ? -q dcache or normal ? MINOS26 > upd install -j encp v3_6d MINOS26 > upd install -j encp v3_6d -q dcache got word from zalokar that the dcache version is for pool nodes MINOS26 > ups undeclare -Y encp v3_6d -q dcache MINOS26 > upd install -j encp v3_6d -f Linux+2.6 shell-init: could not get current directory: getcwd: cannot access parent directories: No such file or directory OOPS, I was sitting in the removed encp -q dcache removed and started over with -f Linux+2.6 MINOS26 > ups undeclare -Y encp v3_6d -f Linux+2.6 MINOS26 > upd install -j encp v3_6d -f Linux+2.6 Edited v3_6d.table to use stkensrv2 by default MINOS26 > ups list -aK+ encp "encp" "v3_3" "Linux+2.4" "" "" "encp" "v3_4" "Linux+2.4-2.3.2" "" "" "encp" "v3_5a" "Linux+2.4-2.3.2" "" "current" "encp" "v3_6d" "Linux+2.4-2.3.2" "" "" "encp" "v3_6d" "Linux+2.6" "" "" MINOS26 > ups declare -c encp v3_6d WARNING: Unless you know what you are doing, use a qualifier in your ups declare command! MINOS26 > ups declare -c encp v3_6d -f Linux+2.6 WARNING: Unless you know what you are doing, use a qualifier in your ups declare command! MINOS26 > ups list -aK+ encp "encp" "v3_3" "Linux+2.4" "" "" "encp" "v3_4" "Linux+2.4-2.3.2" "" "" "encp" "v3_5a" "Linux+2.4-2.3.2" "" "" "encp" "v3_6d" "Linux+2.4-2.3.2" "" "current" "encp" "v3_6d" "Linux+2.6" "" "current" ============================================================================= 2006 12 13 ####### # SAM # ####### per akumar : Date: Wed, 13 Dec 2006 10:47:44 -0600 v6_3 version of SAM schema has been deployed successfully on minosprd. This version added a column called retired_date on data_files and build the index on file_name and retired_date. N.B. - this allows v8 dbservers to be deployed MINOS26 > sam ping dbserver The server 'SAMDbServer.prd:SAMDbServer' is alive. MINOS26 > sam locate foo RetryHandler.getReplicaLocationList('foo')> will retry in 18.33 seconds Datafile with name 'foo' not found. MINOS26 > ./sam_test_py minos OK MINOS26 > sam get metadata --file=F00031300_0000.mdaq.root OK MINOS26> ~/minos/HOWTO.predator OK SRV1> ./dfarmsum ########### # ROUNDUP # ########### Wed Dec 13 11:30:57 CST 2006 OK - 5304 files , 184 GBytes in near 2024 files , 122 GBytes in near cedar 3280 files , 61 GBytes in near R1_18_4 OK - 8479 files , 47 GBytes in far 2803 files , 30 GBytes in far cedar 5676 files , 16 GBytes in far R1_18_4 ============================================================================= 2006 12 12 ############ # saddreco # ############ DECLARE 2006 cedar reco thru November grep -v declare /local/scratch26/kreymer/log/saddreco/declare_near_cedar.log | less grep -v declare /local/scratch26/kreymer/log/saddreco/declare_far_cedar.log | less HOSTNU=`hostname -s | cut -c 6-` LOGPAT=/local/scratch${HOSTNU}/kreymer/log FARM=cedar MONS='01 02 03 04 05 06 07 08 09 10 11' for DET in near far ; do for MON in ${MONS} ; do ./saddreco ${DET} ${FARM} 2006-${MON} declare 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log done ; done ########### # ROUNDUP # ########### SRV1> dfarm usage rubin Used: 523415 + Reserved: 0 / Quota: 1000000 (MB) SRV1> ./dfarmsum Tue Dec 12 08:26:54 CST 2006 OK - 5119 files , 177 GBytes in near 1832 files , 115 GBytes in near cedar 3287 files , 61 GBytes in near R1_18_4 OK - 8335 files , 45 GBytes in far 2659 files , 29 GBytes in far cedar 5676 files , 16 GBytes in far R1_18_4 Looking for CST in LOG/cedarnear.log, OK - processing /minos/nearcat Mon Dec 11 13:09:16 CST 2006 Tue Dec 12 00:05:00 CST 2006 SRV1> df -h . Filesystem Size Used Avail Use% Mounted on /dev/sdb3 485G 180G 306G 37% /export/stage Run again, to purge WRITE area primarily ./roundup.20061208 -r cedar near OK - processing /minos/nearcat Tue Dec 12 08:41:50 CST 2006 SRV1> du -sm WRITE 23813 WRITE ./roundup.20061208 -r R1_18_4 -w near SRV1> du -sm WRITE 3171 WRITE SRV1> ./dfarmsum Tue Dec 12 15:00:01 CST 2006 OK - 5121 files , 177 GBytes in near 1834 files , 115 GBytes in near cedar 3287 files , 61 GBytes in near R1_18_4 OK - 8407 files , 46 GBytes in far 2731 files , 29 GBytes in far cedar 5676 files , 16 GBytes in far R1_18_4 MINOS26 > ls -R /pnfs/minos/reco_far/R1_18_4/CAT | wc -l 297 MINOS26 > ls -R /pnfs/minos/reco_near/R1_18_4/CAT | wc -l 200 MINOS26 > ls -R /pnfs/minos/reco_far/cedar/CAT | wc -l 1365 MINOS26 > ls -R /pnfs/minos/reco_near/cedar/CAT | wc -l 557 Moved the long file out of the way, for cleanup. SRV1> mv WRITE/N00010819_0000.spill.sntp.R1_18_4.0.root LONG/ ############### # LARGE FILES # ############### Cleaning up, added test for WRITE vs PNFS file size found stray file from Sep 26 OOPS - Size mismatch , BAILING -rw-r--r-- 1 minfarm numi 2283574599 Sep 25 17:22 N00010819_0000.spill.sntp.R1_18_4.0.root -rw-r--r-- 1 1060 numi 1 Sep 26 07:39 /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0000.spill.sntp.R1_18_4.0.root Removed the bad file, requeued : rm /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0000.spill.sntp.R1_18_4.0.root ./roundup.20061208 -r R1_18_4 -w near Failed again OK !!!!! This is a file which is too big. The hadd and dccp were happy, it is the ls via pnfs which is unhappy. MINOS26 > ./dc_stat /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0000.spill.sntp.R1_18_4.0.root ============================ PNFS status for /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0000.spill.sntp.R1_18_4.0.root -rw-r--r-- 1 kreymer e875 1 Dec 12 12:01 N00010819_0000.spill.sntp.R1_18_4.0.root LEVEL 2 2,0,0,0.0,0.0 :h=yes;c=1:1cf17836;l=2283574599; w-stkendca11a-1 LEVEL 4 ============================ So the level-2 information is good. MINOS26 > unset DCACHE_IO_TUNNEL MINOS26 > cd /local/scratch??/`whoami` MINOS26 > IFILE=N00010819_0000.spill.sntp.R1_18_4.0.root MINOS26 > IPATH=minos/reco_near/R1_18_4/CAT/sntp_data/2006-09 MINOS26 > DCPOR=24125 # unsecured MINOS26 > DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} MINOS26 > dccp ${DFILE} ${IFILE} 2283574599 bytes in 49 seconds (45511.29 KB/sec) MINOS26 > loon -bq ~/minos/scripts/firstlast.C ${IFILE} Not too informative, does not crash, but no counts. MINOS26 > loon -bq ~/minos/scripts/Merger.C ${IFILE} ... Floating point exception MINOS26 > dds /local/scratch26/kreymer/*root -rw-r--r-- 1 kreymer g020 323946379 Dec 12 16:59 /local/scratch26/kreymer/Merged.root -rw-r--r-- 1 kreymer g020 842860369 Dec 12 15:30 /local/scratch26/kreymer/N00010819_0000.cosmic.sntp.R1_18_4.0.root MINOS26 > mv Merged.root Merged.cosmic.root MINOS26 > IFILE=N00010819_0000.spill.sntp.R1_18_4.0.root MINOS26 > loon -bq ~/minos/scripts/Merger.C ${IFILE} very quickly, 27009843 in Merged.root Floating point exception SRV1> mv WRITE/N00010819_0000.spill.sntp.R1_18_4.0.root LONG/ ============================================================================= 2006 12 11 ############ # saddreco # ############ saddreco was failing due to Application with family 'reco', applName 'loon', version 'cedar' not found. MINOS26 > export SAM_ORACLE_CONNECT="samdbs/@minosprd" MINOS26 > samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar same for dev, int Need to do global declaration for all of cedar before resuming keepup in predator. crontab -r HOSTNU=`hostname -s | cut -c 6-` LOGPAT=/local/scratch${HOSTNU}/kreymer/log FARM=cedar MON=2005-04 for DET in near far ; do ./saddreco ${DET} ${FARM} ${MON} verify 5 ; done for DET in near far ; do ./saddreco ${DET} ${FARM} ${MON} declare 5 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log done INSTANCE Location with name '/pnfs/minos/reco_near/cedar/cand_data/2005-04' not found. ./reloc cedar for DET in near far ; do ./saddreco ${DET} ${FARM} ${MON} declare 5 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log done ... OOPS, need location for N00007354_0015.cosmic.cand.cedar.0.root DET=near ./saddreco near cedar 2005-04 addloc | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log ... OK - add location N00007354_0015.cosmic.cand.cedar.0.root /pnfs/minos/reco_near/cedar/cand_data/2005-04(vob884.1037) Do all of 2005 MONS='04 05 06 07 08 09 10 11 12' for DET in near far ; do for MON in ${MONS} ; do ./saddreco ${DET} ${FARM} 2005-${MON} declare 5 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log done ; done for DET in near far ; do for MON in ${MONS} ; do ./saddreco ${DET} ${FARM} 2005-${MON} declare 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log done ; done ######### # reloc # ######### Updated to use sam from afs, not local copy, so this can be run on minos26 and elsewhere MINOS26 > cp reloc.1201 reloc.20061211 MINOS26 > ln -sf reloc.20061211 reloc MINOS26 > ./reloc -s dev cedar Declaring locations to SAM for cedar ... MINOS26 > ./reloc -s int cedar ########### # ROUNDUP # ########### SRV1> ./dfarmsum Mon Dec 11 09:44:58 CST 2006 OK - 8437 files , 326 GBytes in near 5150 files , 264 GBytes in near cedar 3287 files , 61 GBytes in near R1_18_4 OK - 8335 files , 45 GBytes in far 2659 files , 29 GBytes in far cedar 5676 files , 16 GBytes in far R1_18_4 roundup.20061208 - enforces 1.5 MByte per subrun file size match requirement ./roundup.20061208 -r cedar -n near ./roundup.20061208 -r cedar near ########## # DCACHE # ########## Note false alarm regarding write pool corruption of neardet_data/2004-08/N00003307_0037.mdaq.root Email in http://listserv.fnal.gov/scripts/wa.exe?A2=ind0612&L=dcache-admin&T=0&X=710F7C3E310242DD53&Y=baisley%40fnal.gov&P=7034 Mentioned in the 8 Dec developer's plone log. ============================================================================= 2006 12 09 ########## # DCACHE # ########## Removed the Oct 19 bad file, lost in DCache maintenance MINOS26 > sam locate N00011077_0013.spill.snts.R1_18_4.0.root ['/pnfs/minos/reco_near/R1_18_4/snts_data/2006-10,21@dcache'] MINOS26 > sam undeclare file N00011077_0013.spill.snts.R1_18_4.0.root MINOS26 > ls -l /pnfs/minos/reco_near/R1_18_4/snts_data/2006-10/N00011077_0013.spill.snts.R1_18_4.0.root ls: /pnfs/minos/reco_near/R1_18_4/snts_data/2006-10/N00011077_0013.spill.snts.R1_18_4.0.root: No such file or directory Removed the file written July 1, corrupted July 22, never on tape MINOS26 > sam locate N00010368_0008.spill.cand.R1_18_4.0.root ['/pnfs/minos/reco_near/R1_18_4/cand_data/2006-06,3@dcache'] MINOS26 > sam undeclare file N00010368_0008.spill.cand.R1_18_4.0.root MINOS26 > ls /pnfs/minos/reco_near/R1_18_4/cand_data/2006-06/N00010368_0008.spill.cand.R1_18_4.0.root ls: /pnfs/minos/reco_near/R1_18_4/cand_data/2006-06/N00010368_0008.spill.cand.R1_18_4.0.root: No such file or directory Both of these files have been reported regularly in the saddcache summary scripts. Three cedar far sntp files are reported lost in DCache, not on tape. /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035724_0013.all.sntp.cedar.0.root /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.all.sntp.cedar.0.root /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.spill.sntp.cedar.0.root These exist in the CAT stream. MINOS26 > dds /pnfs/minos/reco_far/cedar/CAT/sntp_data/2006-06/F00035727* -rw-r--r-- 1 kreymer e875 140341160 Dec 4 14:43 /pnfs/minos/reco_far/cedar/CAT/sntp_data/2006-06/F00035727_0000.all.sntp.cedar.0.root -rw-r--r-- 1 kreymer e875 5791142 Dec 4 19:24 /pnfs/minos/reco_far/cedar/CAT/sntp_data/2006-06/F00035727_0000.spill.sntp.cedar.0.root MINOS26 > dds /pnfs/minos/reco_far/cedar/CAT/sntp_data/2006-05/F00035724* -rw-r--r-- 1 kreymer e875 587664400 Dec 4 14:43 /pnfs/minos/reco_far/cedar/CAT/sntp_data/2006-05/F00035724_0000.all.sntp.cedar.0.root And in AFS, for the spill stream recodata26/F00035727_0005.spill.sntp.cedar.0.root ============================================================================= 2006 12 08 ############ # predator # ############ 2006-11 is caught up, veiwed with ./HOWTO.predator 2006-11 crontab crontab.dat at about 08:17 ######### # VAULT # ######### per HOWTO.vault VMON=2006-11 for DET in far near; do ./vault ${DET} ${VMON} ; done Start Fri Dec 8 10:56:14 CST 2006 Finish Fri Dec 8 18:21:28 CST 2006 ########### # ROUNDUP # ########### Testing file size checking, with these: OK adding F00037000_0000.spill.bntp.R1_18_4.0.root 24 OK adding F00037003_0000.spill.bntp.R1_18_4.0.root 7 OK adding F00037006_0000.spill.bntp.R1_18_4.0.root 9 OK adding F00037010_0000.spill.bntp.R1_18_4.0.root 1 OK adding F00037013_0000.spill.bntp.R1_18_4.0.root 24 OK adding F00037016_0000.spill.bntp.R1_18_4.0.root 28 OK adding F00037019_0000.spill.bntp.R1_18_4.0.root 4 like ./roundup.20061208 -r R1_18_4 -s F00037019 -W far OK adding F00037003_0000.all.sntp.R1_18_4.0.root 7 NSFIL SSIZ MSIZ DSIZ 7 134962138 134334043 104682 OK adding F00037003_0000.spill.bntp.R1_18_4.0.root 7 NSFIL SSIZ MSIZ DSIZ 7 16511218 15811337 116646 OK adding F00037003_0000.spill.sntp.R1_18_4.0.root 7 NSFIL SSIZ MSIZ DSIZ 7 8199703 7531334 111394 OK adding F00037006_0000.all.sntp.R1_18_4.0.root 9 NSFIL SSIZ MSIZ DSIZ 9 170847188 170142123 88133 OK adding F00037006_0000.spill.bntp.R1_18_4.0.root 9 NSFIL SSIZ MSIZ DSIZ 9 21240611 20338483 112766 OK adding F00037006_0000.spill.sntp.R1_18_4.0.root 9 NSFIL SSIZ MSIZ DSIZ 9 10606868 9692753 114264 OK adding F00037019_0000.all.sntp.R1_18_4.0.root 4 NSFIL SSIZ MSIZ DSIZ 4 76506078 75924341 193912 OK adding F00037019_0000.spill.bntp.R1_18_4.0.root 4 NSFIL SSIZ MSIZ DSIZ 4 24714683 24232176 160835 OK adding F00037019_0000.spill.sntp.R1_18_4.0.root 4 NSFIL SSIZ MSIZ DSIZ 4 11188674 10783956 134906 OK adding F00037000_0000.all.sntp.R1_18_4.0.root 24 NSFIL SSIZ MSIZ DSIZ 24 486227081 484040886 95051 OK adding F00037000_0000.spill.bntp.R1_18_4.0.root 24 NSFIL SSIZ MSIZ DSIZ 24 62424368 59987997 105929 OK adding F00037000_0000.spill.sntp.R1_18_4.0.root 24 NSFIL SSIZ MSIZ DSIZ 24 30507880 28106032 104428 Test this for near, ./roundup.20061208 -r cedar -n near OK adding N00010589_0000.cosmic.sntp.cedar.0.root 19 OK adding N00010592_0000.cosmic.sntp.cedar.0.root 24 OK adding N00010733_0000.cosmic.sntp.cedar.0.root 3 OK adding N00010755_0000.cosmic.sntp.cedar.0.root 2 OK adding N00010772_0000.cosmic.sntp.cedar.0.root 4 OK adding N00010789_0000.cosmic.sntp.cedar.0.root 3 OK adding N00010794_0000.cosmic.sntp.cedar.0.root 5 OK adding N00010801_0000.cosmic.sntp.cedar.0.root 13 OK adding N00010822_0000.cosmic.sntp.cedar.0.root 14 OK adding N00010847_0000.cosmic.sntp.cedar.0.root 3 OK adding N00010855_0000.cosmic.sntp.cedar.0.root 14 OK adding N00010864_0000.cosmic.sntp.cedar.0.root 9 OK adding N00011271_0000.cosmic.sntp.cedar.0.root 12 OK adding N00010755_0000.cosmic.sntp.cedar.0.root 2 NSFIL SSIZ MSIZ DSIZ 2 53056255 52786823 269432 OK adding N00010733_0000.cosmic.sntp.cedar.0.root 3 NSFIL SSIZ MSIZ DSIZ 3 82046883 81600948 222967 OK adding N00010789_0000.cosmic.sntp.cedar.0.root 3 NSFIL SSIZ MSIZ DSIZ 3 72441275 71982343 229466 OK adding N00010847_0000.cosmic.sntp.cedar.0.root 3 NSFIL SSIZ MSIZ DSIZ 3 67450420 67054480 197970 OK adding N00010847_0000.spill.sntp.cedar.0.root 3 NSFIL SSIZ MSIZ DSIZ 3 155247257 154631823 307717 OK adding N00010772_0000.cosmic.sntp.cedar.0.root 4 NSFIL SSIZ MSIZ DSIZ 4 106806743 106075425 243772 OK adding N00010794_0000.cosmic.sntp.cedar.0.root 5 NSFIL SSIZ MSIZ DSIZ 5 134388261 133632186 189018 OK adding N00010794_0000.spill.sntp.cedar.0.root 5 NSFIL SSIZ MSIZ DSIZ 5 254050313 253037704 253152 OK adding N00010864_0000.cosmic.sntp.cedar.0.root 9 NSFIL SSIZ MSIZ DSIZ 9 245878476 244307392 196385 OK adding N00010864_0000.spill.sntp.cedar.0.root 9 NSFIL SSIZ MSIZ DSIZ 9 317547287 316161340 173243 OK adding N00011271_0000.cosmic.sntp.cedar.0.root 12 NSFIL SSIZ MSIZ DSIZ 12 348282322 346803428 134444 OK adding N00011271_0000.spill.sntp.cedar.0.root 12 NSFIL SSIZ MSIZ DSIZ 12 892507773 889431218 279686 OK adding N00010822_0000.cosmic.sntp.cedar.0.root 14 NSFIL SSIZ MSIZ DSIZ 14 409630441 407233416 184386 OK adding N00010822_0000.spill.sntp.cedar.0.root 14 NSFIL SSIZ MSIZ DSIZ 14 1010879902 1007664431 247343 OK adding N00010589_0000.cosmic.sntp.cedar.0.root 19 NSFIL SSIZ MSIZ DSIZ 19 542678607 539639362 168846 OK - 3700 Mbytes in 1 runs BIG - Splitting due to size 2269580593 OK adding N00010589_0000.spill.sntp.cedar.0.root 9 NSFIL SSIZ MSIZ DSIZ 9 2029993078 2026219894 471648 OK adding N00010589_0009.spill.sntp.cedar.0.root 10 NSFIL SSIZ MSIZ DSIZ 10 1670337440 1667239859 344175 OK adding N00010592_0000.cosmic.sntp.cedar.0.root 24 NSFIL SSIZ MSIZ DSIZ 24 721184487 717231223 171881 Max observed DSIZ ( difference/(secs-1) ) is under 500 KB. Should run with DSIZ limit of 2 MBytes,to be really generous. ============================================================================= 2006 12 07 ########### # ROUNDUP # ########### roundup.20061202 Added FREETMP calculation Added DCFLIM file limit based on DCache write queue length Added POOLMIN DCache write pool limit, at most 1 pool may be inactive. SRV1> ./dfarmsum Thu Dec 7 19:04:41 CST 2006 OK - 6573 files , 232 GBytes in near 2568 files , 144 GBytes in near cedar 4005 files , 88 GBytes in near R1_18_4 OK - 8176 files , 44 GBytes in far 2368 files , 26 GBytes in far cedar 5808 files , 17 GBytes in far R1_18_4 SRV1> ./roundup.20061202 -r R1_18_4 near ############ # predator # ############ 11:32 Catch up with ./predator 2006-11 This has lots to process, following 20 Nov, so crontab -r ####### # X11 # ####### scanned gimp, some hangups minos06 Thu Dec 7 11:08:34 CST 2006 minos18 Thu Dec 7 11:11:31 CST 2006 ########### # NETWORK # ########### Ganglia of minos-mysql1 suggests outage was 06:37 thru 06:52 Email to net suggests partial outages 06:32 through 08:00 Network was not really up at 07:05, ssh to minos26 from off site hung with no response. Succeeded on second try. MINOS26 > minos -bash: /afs/fnal.gov/files/code/e875/general/minossoft/setup/setup_minossoft_FNALU.sh: Connection timed out MINOS26 > ls /afs/fnal.gov/files/code/e875/general/minossoft/setup/setup_minossoft_FNALU.sh /afs/fnal.gov/files/code/e875/general/minossoft/setup/setup_minossoft_FNALU.sh MINOS26 > minos leaving crontab disabled for minosora1 Oracle upgrade to 10.2.0.2 ####### # SAM # ####### Production minosora1 upgraded to 10.2.0.2, complete about 09:43 Restarted dbserver at 10:32. Tested station, OK MINOS26 > sam ping dbserver The server 'SAMDbServer.prd:SAMDbServer' is alive. MINOS26 > sam locate foo RetryHandler.getReplicaLocationList('foo')> will retry in 18.33 seconds Datafile with name 'foo' not found. MINOS26 > ./sam_test_py minos MINOS26 > sam get metadata --file=F00031300_0000.mdaq.root ============================================================================= 2006 12 06 ######### # mysql # ######### Need to correct grants for reader, reader_old, writer etc. Recently broken for dbu by a wildcard change by nwest, but then we should not have been writing using reader_old. ########### # ROUNDUP # ########### SRV1> dfarm usage rubin Used: 532463 + Reserved: 0 / Quota: 1000000 (MB) SRV1> df -h . Filesystem Size Used Avail Use% Mounted on /dev/sdb3 485G 145G 340G 30% /export/stage Purge written files ./roundup -r cedar -w far ./roundup -r cedar -w near SRV1> df -h . Filesystem Size Used Avail Use% Mounted on /dev/sdb3 485G 20G 465G 5% /export/stage ============================================================================= 2006 12 05 ########### # ROUNDUP # ########### SRV1> date Tue Dec 5 08:27:40 CST 2006 SRV1> dfarm usage rubin Used: 720934 + Reserved: 0 / Quota: 1000000 (MB) SRV1> ./dfarmsum Tue Dec 5 08:28:13 CST 2006 OK - 8298 files , 305 GBytes in near 4293 files , 217 GBytes in near cedar 4005 files , 88 GBytes in near R1_18_4 OK - 7957 files , 41 GBytes in far 2149 files , 23 GBytes in far cedar 5808 files , 17 GBytes in far R1_18_4 08:49 SRV1> kinit -R ./roundup -r cedar near ########### # ENSTORE # ########### ENSTORE - vet empty tapes for recycling per berg request 21 Nov 2006 These are all reco_near_R1_18_4.cpio_odc tapes, 9940B VOLS='VO7416 VOB684 VOB685 VOB688 VOB691 VOB693 VOB695 VOB699 VOB701 VOB714 VOB724 VOB729 VOB732 VOB738 VOB739' for VOL in ${VOLS} ; do enstore info --list="${VOL}" | less ; done None of these files have names, all are deleted. for VOL in ${VOLS} ; do enstore info --gvol="${VOL}" | less ; done last_access ranges from 21 though 31 July 2006 ============================================================================= 2006 12 04 ########### # ROUNDUP # ########### roundup aborted on a dfarm read error, not clearing the disks. From the log, OK adding F00033174_0000.spill.bntp.cedar.0.root 1 Transfer initiation timeout OOPS - failed to dfarm get /minos/farcat/F00033174_0000.spill.bntp.cedar.0.root BAILING Sat Dec 2 21:23:40 CST 2006 Last previous timestamp was about 21:17. 99% dfarm capacity as of this morning. SRV1> ./roundup -r cedar -w far SRV1> ./dfarmsum Mon Dec 4 09:04:24 CST 2006 OK - 8133 files , 294 GBytes in near 4128 files , 206 GBytes in near cedar 4005 files , 88 GBytes in near R1_18_4 OK - 22215 files , 197 GBytes in far 16407 files , 179 GBytes in far cedar 5808 files , 17 GBytes in far R1_18_4 SRV1> dfarm usage rubin Used: 991707 + Reserved: 0 / Quota: 1000000 (MB) ( after about 4.5 GB had been recovered) So I think we did not quite hit 100% . But we came very very close. Or maybe we did hit 100%, as dfarm stores 2 copies of each file. Checking status of that file : SRV1> dfarm get /minos/farcat/F00033174_0000.spill.bntp.cedar.0.root TEST.DAT Transfer initiation timeout SRV1> time dfarm get /minos/farcat/F00033174_0000.spill.bntp.cedar.0.root TEST.DAT Transfer initiation timeout real 5m0.141s user 0m0.110s sys 0m0.070s This dfarm cp has been failing since 27 Sep. Only recently did this cause a bailout from the roundup script. I have removed the offending file from dfarm, as dfarm is about to be retired, and there is no point pursuing this. dfarm rm /minos/farcat/F00033174_0000.spill.bntp.cedar.0.root The cleanup helped somewhat, as of 12:30 SRV1> dfarm usage rubin Used: 745531 + Reserved: 0 / Quota: 1000000 (MB) 14:00 - files are moving to tape pretty well Running another pass on cedar/far, to finish it up. There is plenty of space in the ROUNTMP area, 341 GB free. ./roundup -r cedar far ============================================================================= 2006 12 02 ########### # ROUNDUP # ########### DFARM is getting full, 75/79/85 % on Thu/Fri/Sat created dfarmsum on fnpcsrv1 : SRV1> ./dfarmsum Sat Dec 2 12:40:19 CST 2006 OK - 6974 files , 238 GBytes in near OK - 22163 files , 196 GBytes in far SRV1> ./dfarmsum Sat Dec 2 12:58:36 CST 2006 OK - 6974 files , 238 GBytes in near 2969 files , 150 GBytes in near cedar 4005 files , 88 GBytes in near R1_18_4 OK - 22163 files , 196 GBytes in far 16355 files , 179 GBytes in far cedar 5808 files , 17 GBytes in far R1_18_4 Make space for the weekend : SRV1> ./roundup -r cedar far ########### # CRONTAB # ########### crontab crontab.dat Apparently off since 15 Nov ( kreymer on vacation since 23 Nov ) ============================================================================= 2006 11 21 ######## # GRID # ######## SRV1> pwd /export/stage/minfarm/ROUNDUP_TEST/TEST SRV1> grid-proxy-init -cert kreymer-doe.pem -key kreymer-doekey.pem Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310 Enter GRID pass phrase for this identity: Creating proxy .................................. Done Your proxy is valid until: Tue Nov 21 22:51:32 2006 SRV1> setup dcap -q x509 IFILE=N00004502_0000.mdaq.root IPATH=minos/neardet_data/2004-11 DCPOR=24525 DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE} SFILE=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/${IPATH}/${IFILE} SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/${IPATH} srmcp $SFILE file:////export/stage/minfarm/ROUNDUP_TEST/TEST/TEST.dat srmcp $SFILE file:///TEST.dat N.B. 2006 12 12 should instead use srmcp -streams_num=1 -server_mode=active $SFILE file:///TEST.dat srmls ${SPATH} - fails ? SRV1> dccp dcap://fndca1.fnal.gov:24536/pnfs/fnal.gov/usr/minos/sim_root/far/camb_cosmic/bfld201/cosmic_mu_r651.root TEST.dat 731987214 bytes in 33 seconds (21661.55 KB/sec) Per timur, trying a newer version of srm client, v1_24 versus v1.23 on fnpcsrv1 MINOS26 > pwd /local/scratch26/kreymer/SRM MINOS26 > curl https://srm.fnal.gov/twiki/pub/SrmProject/SrmcpClient/srmcp_v1_24_NULL.tar -o srm.tar -k MINOS26 > tar xfv srm.tar ######## # nscd # ######## verified that nscd is off on all minos cluster nodes /sbin/chkconfig --list nscd ####### # SRM # ####### ============================================================================= 2006 11 17 ########## # DCACHE # ########## Security scans were causing x509 door failures, disabled for now. ============================================================================= 2006 11 16 ####### # SAM # ####### 12:25 - restarted prd dbserver ( sam locate foo hung up ) after the production database and OS patches today ( done by 10:00 ) MINOS26 > ./sam_test_py minos prd ########## # DCACHE # ########## ######### # USERS # ######### Made directory for user files in PNFS cd /pnfs/minos mkdir users chmod 775 users Perhaps should change this back to 755. In fact, did so on 17 Nov. ============================================================================= 2006 11 15 Registered DOE grid certificate for access to CD forms ( vacation request, etc ) X11 - clean scan ############ # SHUTDOWN # ############ Need to shut down servers around 05:30, to match the Enstore/DCache shutdown MINOS-SAM01 > echo '. ./samstop > samstop.log 2>&1' | at 05:30 job 7 at 2006-06-07 03:00 MINOS26 > echo 'crontab -r' | at 05:30 ============================================================================= 2006 11 14 ####### # X11 # ####### minos19 Tue Nov 14 14:43:52 CST 2006 ######### # VAULT # ######### per HOWTO.vault VMON=2006-10 for DET in far near; do ./vault ${DET} ${VMON} ; done Start 15:37:02 CST 2006 Finish 23:05 ============================================================================= 2006 11 13 ####### # X11 # ####### minos19 Mon Nov 13 14:06:49 CST 2006 ============================================================================= 2006 11 10 X11 scan - clean for both acroread and gimp 15:20 CHECKLIST - write queues gradually up over 1000, starting around 01:00 yesterday perhaps p929 NOVA ============================================================================= 2006 11 09 X11 scan - clean for both acroread and gimp ########### # SLF 4.4 # ########### Ran genpy per HOWTO.genpy on minos25, looks OK ########### # ROUNDUP # ########### Remove Suppressed subruns from the RAWS list, for cleaner keepup Testing RUNN=00011116 Corrected typo in AUTODEST ( had been using stale STRP ) Added up-front removal of SUPPRESSED subruns from RAWS list, for cleaner keepup running. SRV1> ln -sf roundup.20061109 roundup # was roundup.20061018 Catchup !!! SRV1> ./roundup -r R1_18_4 near SRV1> ./roundup -r R1_18_4 far ============================================================================= 2006 11 08 ####### # X11 # ####### Around 09:00, gimp scan stuck on 06 07 17. NODES="minos01 minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10 minos11 minos12 minos13 minos14 minos15 minos16 minos17 minos18 minos19 minos20 minos21 minos22 minos23 minos24 minos25 minos26" CNODES="fcdflnx4 fcdflnx5 fcdflnx6 fcdflnx7 fcdflnx8 fcdflnx9" UNODES="flxi02 flxi03 flxi04 flxi05 flxi06" for NODE in $UNODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'mkdir -p /var/tmp/kreymer/.gimp-1.2' ; done for NODE in $UNODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'echo gimp;gimp;echo done' ; done Scans clean on cdf and FNALU nodes ########### # ROUNDUP # ########### Issue 2366 I cannot select minos files using the lum_min and lum_max dimensions. MINOS26 > sam list files --dim=" DATA_TIER = raw-far and LUM_MIN = 288120000 " ORA-00904: "DATA_FILES"."LUM_MIN": invalid identifier Looking at the LUM_MIN dimension in the database browser, it seems that this dimension is in the data_files table, rather than the data_files_lumblocks table as in CDF and D0. Is there a problem with our schema or database initialization ? See http://dbb2.fnal.gov:8520/cdfr2/databases?smdim=LUM_MIN&dimorder=%2Bdim&skip=&limit=125&rc=n&email=&type=sam-dim&do=r&nsrc=cdfofpr2&fsrc=cdfofpr2&gsrc=cdfofpr2&rc=n&dcbk=FILECATALOG http://dbb.fnal.gov:8520/minos/databases?smdim=LUM_MIN&dimorder=%2Bdim&skip=&limit=125&rc=n&email=&type=sam-dim&do=r&nsrc=&fsrc=minosprd&gsrc=minosprd&rc=n&dcbk=FILECATALOG Improve reporting of missing subruns Scan for suppressed subruns, report as such SRV1> cat /home/minfarm/lists/daq_lists/sup/*.sup | grep -v Failed | wc -l 298 ######### # BATCH # ######### minos queue is set up, feeding minos19-24 bsub -q minos "cat /proc/cpuinfo" ============================================================================= 2006 11 07 ####### # X11 # ####### At around 13:50, a scan with gimp revealed problems on minos05 - several logged in, system idle, 188076k free memory minos07 - several logged in, system busy running 1 loon process, 2689772k free minos08 - root logged in, system idle, 3050428k free minos14 - nobody logged in, system idle, 98968k free minos23 - two logged in, system idle, 775776k free ########### # ROUNDUP # ########### setup sam -q dev IFIL=F00028812_0001 sam verify metadata --descriptionFile=${IFIL}.sam.py sam declare file ${IFIL}.sam.py sam get metadata --file=$IFIL.mdaq.root sam add location --fileName=F00028812_0001.mdaq.root --loc=/pnfs/minos/fardet_data/2005-01 MINOS26 > sam list files --dim=" DATA_TIER = raw-far and run_number = 28812 " Files: F00028812_0001.mdaq.root F00028812_0000.mdaq.root MINOS26 > sam list files --dim=" DATA_TIER = raw-far and LUM_MIN = 288120000 " ORA-00904: "DATA_FILES"."LUM_MIN": invalid identifier fcdflnx5 > sam list files --dim="data_tier = raw and LUM_MIN = 8877899795 and LUM_MAX = 8877899795" Files: br02112a.0013phys fcdflnx5 > sam list files --dim="data_tier = raw and LUM_MIN = 8877899795" Files: br02112a.0013phys Need to issue Sam issue for this, why can I not select on LUM_MIN ? TO CLEANUP WILL DO sam erase file location --fileName=F00028812_0001.mdaq.root --loc="/pnfs/minos/fardet_data/2005-01(v01234.5)" sam undeclare file $IFIL.mdaq.root ============================================================================= 2006 11 06 ########## # DCACHE # ########## kennedy noted door 0 stuck, restarted Sunday 11/05 around 17:53 ########## # K5PALL # ########## Created k5pall script on desktop, to push keys to all ssh sessions, using ${HOME}/k5push. Presumes connections are made via one of ssh host ssh -l user host ########### # ROUNDUP # ########### Added MISSING printout for missing subruns Reviewing content : ./roundup.20061101 -r R1_18_4 -f 10 -W -n near 2>&1 | tee /tmp/missing MISS=`grep MISSING /tmp/missing | tr -s ' ' | cut -f 3 -d ' '` for MIS in $MISS ; do sam locate $MIS ; done 159 files total Some of the missing subruns are N00011099_0001.cosmic.sntp.R1_18_4.0.root N00010816_0000.spill.sntp.R1_18_4.0.root N00010840_0001.spill.sntp.R1_18_4.0.root N00010852_0010.spill.sntp.R1_18_4.0.root N00010867_0018.spill.sntp.R1_18_4.0.root ... None of these are in SAM. Trouble in run N00010912, due to reprocessing ( 0 and 1 versions ) This fouls up the gap calculations. Detecting this by test for DELT=0 and veto for manual cleanup, Now test fardet, more agressive purging ( 2 days ) ./roundup.20061101 -r R1_18_4 -f 2 -W -n far 2>&1 | tee /tmp/missingf MISSF=`grep MISSING /tmp/missingf | tr -s ' ' | cut -f 3 -d ' '` for MIS in $MISSF ; do sam locate $MIS ; done Several runs missing many subruns, from 1 or small numbers through 24/25. And many are in SAM. Probably due to my too aggressive purging Looking again with 10 day flush, most of these are F0003619, plus F00036753_0010.spill.bntp.R1_18_4.0.root Oct 24 F00036551_0009.spill.sntp.R1_18_4.0.root Sep 7 SRV1> dfarm ls /minos/farcat/F00036753*.bntp.* ... frwrw 2 rubin 6636701 10/24 11:43:24 /minos/farcat/F00036753_0009.spill.bntp.R1_18_4.0.root frwrw 2 rubin 8303375 10/24 00:44:58 /minos/farcat/F00036753_0011.spill.bntp.R1_18_4.0.root SRV1> dfarm ls /minos/farcat/F00036551_*.spill.sntp.R1_18_4.0.root ... frwrw 2 rubin 1164444 09/07 07:29:53 /minos/farcat/F00036551_0008.spill.sntp.R1_18_4.0.root frwrw 2 rubin 988439 09/07 04:52:04 /minos/farcat/F00036551_0010.spill.sntp.R1_18_4.0.root ####### # X11 # ####### gimp scan, minos08/14 sticking minos14 is quite idle, just a couple of idle interactive logins ============================================================================= 2006 11 03 ######### # STAGE # ######### stage.20061012 hacked to correct restore queue feedback ( changed 8/10 to 5/7 ) staging ran amok, as the restore/queue feedback was misaligned, was reading store, not restore quantities. This produced a peak restore queue of about 900 last night, before midnight. The stage queue went over 1000, at http://fndca3a.fnal.gov/dcache/logins//stage.jpg. The scripts seem to have gotten stuck around 03:27 per /local/scratch26/kreymer/log/stage/VOB057.20061103.log Door 1 logins peaked at about 28, around 3 am Door 0 logins remained under 10 till around 3 am Dcache services show door 0 offline Restart staging with VOB057 : REVOLS="VOB057 VOB428 VOB441 VOB612 VOB641 VOB727 VOB884" for VOL in ${REVOLS} ; do ./stage -w -s 'spill.cand' ${VOL} ; done ########## # DCACHE # ########## Verified that door 0 is down, with dccp from port 24136 Adding -o 10 ( 10 second open timeout ) did not help, the dccp remained inactive. dccp from 24136 succeeded. ######### # GENPY # ######### Switched from port 24125 to 24136, made this a variable in genpy ln -sf genpy.20061103 genpy # was genpy.20060714 Killed some dbu processes for neardet_data, to get predator unstuck at about 09:58 ########### # ROUNDUP # ########### Added CHART section to end of roundup.20061101 this explains critical variables did some cleanup Test writing runs with trailing missing subrun, ./roundup.20061101 -r R1_18_4 -n near ./roundup.20061101 -r R1_18_4 -s N00010777 -n near Concatenate a run with one subrun missing at end : ./roundup.20061101 -r R1_18_4 -s N00010777_ -f 24 -W near works ! Implemented the -f