2009 02 04

##################################################################
                          SHORT TERM PROJECTS
##################################################################
______________________________________________________________

    OPEN TICKETS

125925  aklog of kcron tickets under SLF 4.7

124576 Account and group alignment

_____________________________________________________________

   ACTIVE - 

_____________________________________________________________


CONDOR -
   7.2.0 upgrade on cluster, to fix write to Iwd of <user>.proxy

   FARM running 
     - on some nodes, lacking libxml2-2.6.16-12.6
dlopen error: libxml2.so.2: cannot open shared object file: No such

FARM
   duplicates  in nearcat / cedar, from 1 Nov .
 
AFS timeouts - pursue minos-mysql1 timeouts correlated to nwest ssh connx
   Check 'WasIScanned' at security web page

Test DCache unsecured door capacity, OK to 4000 ?
    use root -b <file> to hold files open,
    
Rename volumes
    d141 - testups
    d199 - testminossoft
    and test mk/rm and rename impact on running processes

Subject: HelpDesk ticket 115219 has additional info.
    Short Description: Cannot write via dcap -q x509 using Howard Rubin proxy

CRL - reproduce and report multi-header and java crash issues


dc2nfs -cedar  far .bntp and sntp all months, to catch up.
   See notes 5/25

bluwatch 
   add /grid/data , /minos/data2 /minos/scratch monitoring
   Add write-mode monitoring

libssl.so.4 on flxb35 ( 64 bit )
     report, send advice to skip this node

  
#############################################################################
#############################################################################
                             W O R K    L O G
#############################################################################
#############################################################################

=============================================================================
2009 02 06
=============================================================================

########
# DATA #
########

Date: Thu, 05 Feb 2009 21:29:12 -0600
From: ssa-group@fnal.gov

The libraries for stken: LTO3, LTO4F1 and LTO4G1 are currently down. The
cause is not yet known, but somebody is working on investigating and we will
post more info as we are aware.
__________________________________________________________________________

Date: Fri, 06 Feb 2009 01:18:43 -0600
From: ssa-group@fnal.gov

The disruption of service to stken libraries has been fixed and all
libraries are back in production.

__________________________________________________________________________


I see many Minos data tapes in no-access :

VO2307    0.04GB   (NOACCESS   0205-2223 full     0209-0621)   CD-9940B  minos.neardet_data.cpio_odc         
VO2432   45.58GB   (NOACCESS   0205-2224 readonly 0909-1538)   CD-9940B  minos.fardet_data.cpio_odc           write-protected
VO4335   49.10GB   (NOACCESS   0205-2225 readonly 0511-1029)   CD-9940B  minos.fardet_data.cpio_odc          
VO8536   55.29GB   (NOACCESS   0205-2225 readonly 0311-1010)   CD-9940B  minos.fardet_data.cpio_odc           Copied to new media 03/14/06
VOK237  361.69GB   (NOACCESS   0205-2202 none     0526-0123)   CD-LTO3   minos.reco_far_cedar_cand.cpio_odc  

VO4209  177.36GB   (NOTALLOWED 0511-1347 none     0831-0853)   CD-9940B  minos.reco_near_cedar_phy_mrnt.cpio_odc BOT overwritten 05/11/07
VO4956    0.37GB   (NOTALLOWED 0112-1046 migrating 0106-0720)   9940     minos.caldet_data.cpio_odc           Volume is missing 01/06/2009, not in drop slot
VOB445  200.00GB   (NOTALLOWED 0908-1806 none              )   CD-9940B  minos.beam_data.cern                 Cannot seem to write a label; needs investigation
VOK330  331.09GB   (NOTALLOWED 0731-1124 readonly 0716-1115)   CD-LTO3   minos.reco_far_cedar_bcnd.cpio_odc   Volume needs to be cloned due to repeated errors

Tapes with raw data :

V02307, staging in ND raw data. caught up in Enstore restart ?
 'sum_mounts': 312,
 'sum_rd_err': 2,
Thu Feb  5 22:23:57 CST 2009


VO2432 minos.fardet_data.cpio_odc
 'last_access': 1233894269.0,
 'library': 'CD-9940B',
 'sum_mounts': 3,
 'sum_rd_err': 2,
Thu Feb  5 22:24:29 CST 2009

VO4335 minos.fardet_data.cpio_odc
 'sum_mounts': 996,
 'sum_rd_err': 2,
 'sum_wr_err': 1,
Thu Feb  5 22:25:36 CST 2009

VO8536  minos.fardet_data.cpio_odc
 'sum_mounts': 133,
 'sum_rd_err': 2,
 'sum_wr_err': 3,
Thu Feb  5 22:25:00 CST 2009

__________________________________________________________________________

Date: Fri, 06 Feb 2009 17:28:41 -0600 (CST)
Subject: HelpDesk ticket 128947

Short Description: STKEN Minos tapes in NOACCESS

Problem Description: enstore-admin :

There are four Minos tapes that have gone NOACCESS recently at
http://www-stken.fnal.gov/enstore/tape_inventory/NOACCESS
 
The tapes all show last access times around Thu Feb  5 22:23:57 CST 2009 .
The time was during last night's unscheduled Enstore outage.

Please restore access to these tapes.
___________________________________________

This ticket is assigned to HICKS, STAN of the CD-SF/DMS/DSC/SSA.
___________________________________________

___________________________________________

___________________________________________

___________________________________________


########
# FARM #
########

Date: Thu, 05 Feb 2009 17:07:07 -0600
Subject: cleanup

All but one cedar_phy_bhcurv near MC (a helium job) have completed their
reruns, so you could rerun the concatenation.  I'm currently running the far
failures (57 of them plus).  In light of this I'm going to rerun all cedar
and dogwoodtest4 data failures.


=============================================================================
2009 02 05
=============================================================================

##########
# DCACHE #
##########

               W A R N I N G

Our raw data seems to be getting migrated to LTO-4 tape :

Reading VOO107(neardet_data-MIGRATION) using LTO4_112.mover 
from stkendca24a by root

This may really foul up our raw data restoration,
as the files may no longer be in tape-order,
if they are currently being moved.

This may also destroy our policy of having 
primary and vault copies in different buildings.


########
# ENCP #
########

   Make current the version of encp having enmv

MINOS26 > date
Thu Feb  5 15:09:39 CST 2009

MINOS26 > ups declare -c encp v3_7d
WARNING: Unless you know what you are doing, use a qualifier in your ups declare command!


###########
# ENSTORE #
###########

Date: Thu, 05 Feb 2009 14:38:52 -0600
From: ssa-group@fnal.gov
To: cdweb@fnal.gov, helpdesk@fnal.gov, oleynik@fnal.gov, stan@fnal.gov,
    wolbers@fnal.gov, crawdad@fnal.gov, white@fnal.gov, moibenko@fnal.gov,
    timur@fnal.gov, stk-users@fnal.gov, cms-t1@fnal.gov,
    fermigrid-announce@fnal.gov, enstore-admin@fnal.gov
Subject: Announcement: Service disruption for enstore on stken for a duration of
     30 min

The media_changer needs some work 
and we are backing out of the previous upgrade. 
We should be down for about 30 minutes. Please stand-by...

Stanley

__________________________________________________________

Date: Thu, 05 Feb 2009 15:20:53 -0600

The libraries are back in production. Total outage was only 20 minutes. 
__________________________________________________________

__________________________________________________________


############
# MCIMPORT #
############

Date: Thu, 05 Feb 2009 14:20:38 -0500
From: Daniel D. Cherdack <Daniel.Cherdack@tufts.edu>

The next decade (75*) of Daikon06 singles is ready for import.
__________________________________________________________

touch  STAGE/cherdack/MCIMPORT
__________________________________________________________


#########
# STAGE #
#########

    Running this on minos27, for speed

./volumes vols
 NVOLS=`./volumes neardet_data`

{ for VOL in ${NVOLS} ; do 
    ./stage -w -g q ${VOL}
done ; } > /minos/scratch/kreymer/log/stage/ndstage.log 2>&1 &

STARTING Thu Feb  5 13:01:00 CST 2009


#######
# EVO #
#######

   Send HELP error report regarding 
   window scroll bar sliders needing decoration.

########
# FARM #
########

./roundup  -p -r cedar_phy_bhcurv mcnear


SRV1> ./roundup  -b 2000 -r cedar_phy_bhcurv mcnear
Thu Feb  5 12:46:18 CST 2009


########
# FARM #
########

    Why are a large number of these files not declared to sam ?

    Like

n13047122_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root

##########
# DCACHE #
##########

Date: Thu, 05 Feb 2009 08:56:46 -0600 (CST)
Subject: HelpDesk ticket 128816

Short Description: FNDCA PNFS is down

Problem Description: enstore-admin, dcache-admin :

    The FNDCA PNFS server is down. 
    This takes down the public DCache system.

    There is scheduled Enstore maintenance today.
    But the announcement stated :

The following services will be available.
- Stken PNFS.
- Public dCache.
- The rest of Enstore.

    Therefore I assume that this is unrelated to the maintenance.
___________________________________________

This ticket is assigned to HICKS, STAN of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Thu, 05 Feb 2009 09:07:04 -0600 (CST)

Note To Requester: swhicks@fnal.gov sent this Notes To Requester: 

Art, We mistakenly took the pnfs manager down as part of our maintenance. It
has since been brought back up and you should see dcache back to normal.
Please let us know if this doesn't happen right away.  Sorry for the
inconvenience.
___________________________________________

___________________________________________


02/05 08:53:29 PNFS manager is back up

Daq logging failed with messages in Recent FTP like

2009-02-5 08:49:10 451 Operation failed: PANIC : Unexpected message arrived class dmg.cells.nucleus.NoRouteToCellException 
...
2009-02-5 07:27:13 last success
2009-02-5 08:06:45 first failure 


##########
# DCACHE #
##########

To      : minos_software_discussion@fnal.gov
Cc      : minos-data@fnal.gov
Attchmnt: 
Subject : Fermilab DCache unscheduled outage this morning
----- Message Text -----

  It appears that between about 07:30 and 08:55 this morning,
  the DCache PNFS manager went offline, taking DCache down.

  I have submitted a helpdesk ticket.

  Here is the reply :

We mistakenly took the pnfs manager down as part of our maintenance.   
It has since been brought back up and you should see dcache back to normal.
Please let us know if this doesn't happen right away.  
Sorry for the inconvenience.


###########  
# ENSTORE #
###########

Date: Wed, 04 Feb 2009 15:51:26 -0600
From: ssa-group@fnal.gov
To: cdweb@fnal.gov, helpdesk@fnal.gov, oleynik@fnal.gov, stan@fnal.gov,
    wolbers@fnal.gov, crawdad@fnal.gov, white@fnal.gov, moibenko@fnal.gov,
    timur@fnal.gov, d0en-announce@fnal.gov, stk-users@fnal.gov,
    cms-t1@fnal.gov, fermigrid-announce@fnal.gov, cdf_dh_help@fnal.gov,
    enstore-admin@fnal.gov
Subject: Announcement: Service scheduled outage for enstore on d0en, stken,
    cdfen for a duration of 4 Hours

This is a reminder that Stken and parts of Cdfen and D0en will be down Feb
5, 2009 from 0730 - 1130. The work to be done is:
- Replace/repair 2 Bots in SL8500#2.
- Update Enstore database.
- Enstore code update.

The following services will be unavailable.
- Stken Enstore. This includes all Stken Library Managers.
- CDF-LTO4F1 and D0-LTO4F1 Library Managers during the Bot repair.

The following services will be available.
- Stken PNFS.
- Public dCache.
- The rest of Enstore.

The Bots will be repaired first. This will allow the CDF and D0 Library
Mangers to be available as soon as that work is done. ~2 hours.

_____________________________________________________________________


  Via CDF JIRA

Date: Thu, 05 Feb 2009 10:12:08 -0600 (CST)

The libraries at GCC for D0 and cdf have been un-paused and returned to
service. You should see your submitted jobs begin to run. The affected
libraries for this maintenance were:

CDF-LTO3
CDF-LTO4G1
D0-LTO4G1
_____________________________________________________________________

Date: Thu, 05 Feb 2009 11:56:15 -0600
From: ssa-group@fnal.gov

We are experiencing a bit of difficulty in getting back into production.
We will be over by an additional 30 minutes.
_____________________________________________________________________

Date: Thu, 05 Feb 2009 12:25:35 -0600

The maintenance is completed on stken and everything is being brought back
into production.

Please let us know if you experience any errors as they relate to the
libraries and enstore being down.

_____________________________________________________________________

_____________________________________________________________________


=============================================================================
2009 02 04
=============================================================================

########
# FARM #
########

SRV1> ./roundup  -b 6000 -D -r cedar_phy_bhcurv mcnear
Wed Feb  4 07:18:52 CST 2009
Wed Feb  4 19:03:00 CST 2009


##########
# DCACHE #
##########

Date: Wed, 04 Feb 2009 17:21:40 -0600 (CST)
Subject: HelpDesk ticket 128800
___________________________________________
Short Description: FNDCA Recent FTP Transfers is out of date again

Problem Description: dcache-admin :

Once again, the Recent FTP Transfer page is out of date,
showing only transfers from user   pagedcache

    http://fndca3a.fnal.gov/cgi-bin/dcache_files.py

Please correct this, we use this to verify Minos raw data archiving.
___________________________________________

This ticket is assigned to HICKS, STAN of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Wed, 04 Feb 2009 17:50:37 -0600 (CST)
From: HelpDesk <aremail@fnal.gov>

Note To Requester: swhicks@fnal.gov sent this Notes To Requester: 

Art,

I have re-opened our bug #189 in which the developers said this was 
fixed. An email has gone out to them to correct this problem.

___________________________________________

___________________________________________

___________________________________________


#########
# STAGE #
#########

   Raw data far stage.

./volumes vols
 
NVOLS=`./volumes neardet_data`

enstore info --vol=VOO267


{ for VOL in ${NVOLS} ; do 
    ./stage -d -p 0 -g q ${VOL}
done ; }

    Run the full stage

{ for VOL in ${NVOLS} ; do 
    ./stage -w -g q ${VOL}
done ; } > /minos/scratch/kreymer/log/stage/ndstage.log 2>&1 &
___________________________________________

Date: Wed, 04 Feb 2009 15:38:26 -0600 (CST)
Subject: HelpDesk ticket 128790

Short Description: Permisson to reload files to FNDCA RawDataWritePools

Problem Description: dcache-admin :

The new RawDataWritePools pools are deployed in FNDCA.

We have current Pool Directory Listings available

I would like to reload all the Minos raw data to this pool group.
My scripts for loading the files in optimal tape-order are ready to run.

Please send permission from the admins, and I will start these scripts.

   Standing by ...
___________________________________________

This ticket is assigned to HICKS, STAN of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Wed, 04 Feb 2009 17:23:46 -0600 (CST)
From: HelpDesk <aremail@fnal.gov>

Note To Requester: swhicks@fnal.gov sent this Notes To Requester: 
I thought this would be addressed to the dcache admins (which I don't
believe is SSA) and forwarded it on to the Dcache Admin maillist we have;
his reply was that he thought it was our call and that he had no problems
with this.
>   
> So, I guess I have no problems with your reloading them either.
>   

___________________________________________

Date: Wed, 04 Feb 2009 23:26:08 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Thanks for the green light !

   I will start the file restores after the Enstore maintenance tommorrow.

   This ticket can be closed.

___________________________________________

Date: Thu, 05 Feb 2009 15:26:48 -0600 (CST)

This has been done.
___________________________________________

  
#########
# STAGE #
#########

    stage.20090204

Modify pool option to use suffix ( r w m q )and the files.* summary file,
as we no longer have Layer 2 metadata.

./stage.20090204 -n -g q fardet_data/2009-02


MINOS26 > ln -sf stage.20090204 stage # was stage.20081203

############
# DATASETS #
############

    Added symlink CFL/list.* to listing file
 
MIN > mv datasets.20090116 datasets.20090204
MIN > ln -sf datasets.20090204 datasets

##########
# DCACHE #
##########

    Updating file lists

MINOS26 > dcache/datasets q '' '' list
Run Wed Feb  4 10:53:08 CST 2009      Data from 04-Feb-2009 06:15

Pool group RawDataWritePools

Caches   = 12
LIST w-raw-minos-stkendca21a-1.files
LIST w-raw-minos-stkendca22a-1.files
LIST w-raw-minos-stkendca24a-1.files
LIST w-raw-minos-stkendca26a-1.files
LIST w-stkendca7a-1.files
LIST w-stkendca7a-2.files
LIST w-stkendca8a-1.files
LIST w-stkendca8a-2.files
LIST w-stkendca9a-3.files
LIST w-stkendca10a-3.files
LIST w-stkendca11a-3.files
LIST w-stkendca12a-3.files
/afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2009/02/list.q


############
# PREDATOR #
############

   crc values seem to be 666 since 
      /pnfs/minos/fardet_data/2008-12/F00042259_0018.mdaq.root


########
# DATA #
########

    tagg reports corruption of files :

Error in <TDCacheFile::ReadBuffer>: error reading all requested bytes from file
dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015460_0010.mdaq.root
got 16057 of 42564

Error in <TDCacheFile::ReadBuffer>: error reading all requested bytes from file
dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2008-11/F00042228_0003.mdaq.root
got 8878 of 45732

   I can copy these to disk OK.
   ecrc values are OK.

./dccptest  N00015460_0010.mdaq.root '' '' '' keep
./dccptest  F00042228_0003.mdaq.root '' '' '' keep

ecrc /local/scratch26/kreymer/DCCPTEST/N00015460_0010.mdaq.root
ecrc /local/scratch26/kreymer/DCCPTEST/F00042228_0003.mdaq.root


setup_minos -r R1.22

loon -bq firstlast.C /local/scratch26/kreymer/DCCPTEST/N00015460_0010.mdaq.root
loon -bq firstlast.C dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015460_0010.mdaq.root
loon -bq firstlast.C dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2008-11/F00042228_0003.mdaq.root

    All these look OK to me.

ls /pnfs/minos/reco_near/cedar/cand_data/2009-01/N00015460_0010.*
/pnfs/minos/reco_near/cedar/cand_data/2009-01/N00015460_0010.cosmic.cand.cedar.0.root
/pnfs/minos/reco_near/cedar/cand_data/2009-01/N00015460_0010.spill.cand.cedar.0.root

ls /pnfs/minos/reco_far/cedar/cand_data/2008-11/F00042228_0003*
/pnfs/minos/reco_far/cedar/cand_data/2008-11/F00042228_0003.all.cand.cedar.0.root
/pnfs/minos/reco_far/cedar/cand_data/2008-11/F00042228_0003.spill.cand.cedar.0.root

MINOS26 > grep N00015460_0010.mdaq.root  /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2009/02/list.q
N00015460_0010.mdaq.root w-raw-minos-stkendca22a-1
MINOS26 > grep F00042228_0003.mdaq.root  /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2009/02/list.q
F00042228_0003.mdaq.root w-stkendca11a-3

_________________________________________________________________________
 
Date: Wed, 04 Feb 2009 16:45:21 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Nathaniel Tagg <NTagg@otterbein.edu>
Cc: minos_software_discussion@fnal.gov
Subject: Re: More errors!

On Wed, 4 Feb 2009, Nathaniel Tagg wrote:

> I now see an identical error from a Far detector run as well as a Near
> detector run.

   I can run loon R1.22 on both these files in Dcache,
   and from local copies, on minos01.fnal.gov.
   ( loon -bq ~kreymer/minos/scripts/firstlast.C ... )

   Copies made from DCache seem to have the correct Enstore checksums.

   There are cedar keepup output files for both of these subruns.

   I suspect a problem with your loon, or your node .
_________________________________________________________________________
 

MINOS26 > grep N00015460_0010.mdaq.root  /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2009/02/list.q
N00015460_0010.mdaq.root w-raw-minos-stkendca22a-1
MINOS26 > grep F00042228_0003.mdaq.root  /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2009/02/list.q
F00042228_0003.mdaq.root w-stkendca11a-3


=============================================================================
2009 02 03
=============================================================================

    Continued mysql testing, see LOG.mysql

=============================================================================
2009 02 02
=============================================================================

#######
# NET #
#######

Non-disruptive network maintenance on r-s-edge-1 router.
9:00 - 9:30AM CST Tuesday February 3, 2009


########
# FARM #
########

   Digging into the Mother Lode of backlog :

SRV1> ls /minos/data/minfarm/mcnearcat | grep cedar_phy_bhcurv | grep sntp | wc -l
2931

SRV1> ls /minos/data/minfarm/mcnearcat | grep cedar_phy_bhcurv | grep mrnt | wc -l
2930


SRV1> ./roundup  -r cedar_phy_bhcurv mcnear
Mon Feb  2 12:56:07 CST 2009
Mon Feb  2 13:08:33 CST 2009

   Found 204 duplicate files

   Running a pass over all 6000 files, do clear DUP's and get master PEND list

SRV1> ./roundup  -b 6000 -D -r cedar_phy_bhcurv mcnear
Mon Feb  2 14:46:04 CST 2009

__________________________________________________________________________

Date: Mon, 02 Feb 2009 21:38:31 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   I have run a full pass of  
./roundup  -b 6000 -D -r cedar_phy_bhcurv mcnear

    204 duplicate files were removed.

    There is a fresh pend file,
/home/minfarm/ROUNTMP/LOG/cedar_phy_bhcurvmcnear.pend

    There are also several files present which are flagged as bad :
    
SRV1> grep ZAPPING LOG/2009-02/cedar_phy_bhcurvmcnear.log
 ZAPPING BAD n13037415_0009_L010185N_D04.cand.cedar_phy_bhcurv.1.root
 ZAPPING BAD n13037415_0009_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root
 ZAPPING BAD n13037415_0009_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
 ZAPPING BAD n13037436_0005_L010185N_D04.cand.cedar_phy_bhcurv.1.root
 ZAPPING BAD n13037436_0005_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root
 ZAPPING BAD n13037436_0005_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
 
Should the badfiles list be edited, or should these files be discarded ?
These files do seem to be a missing subruns that are needed.
Note that some of these are candidate files.

__________________________________________________________________________

Date: Mon, 02 Feb 2009 16:13:18 -0600
From: Howard Rubin <rubin@iit.edu>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: Re: cedar_phy_bhcurvmcnear.pend

Here's the history on the 'bad' files:

n13037415_0009_L010185N_D04.0          28665  analyze  fnpc260  BEGIN
2008-03-26 08:30:39
n13037415_0009_L010185N_D04.0          28665  analyze  fnpc260  ERROR
2008-03-26 09:12:22   136
n13037415_0009_L010185N_D04.0          18430  analyze  caf1626  BEGIN
2008-04-29 14:34:31
n13037415_0009_L010185N_D04.0          18430  analyze  caf1586  BEGIN
2008-04-29 14:34:53
n13037415_0009_L010185N_D04.0          18430  analyze  caf1626  ERROR
2008-04-29 16:14:31   136
n13037415_0009_L010185N_D04.0          18430  analyze  caf1586    END
2008-04-29 18:12:11

n13037436_0005_L010185N_D04.0          15737  analyze  fnpc255  BEGIN
2008-04-05 00:33:55
n13037436_0005_L010185N_D04.0          15737  analyze  fnpc255    END
2008-04-05 03:55:30
n13037436_0005_L010185N_D04.1           2159  analyze  caf1606  BEGIN
2008-05-09 17:57:26
n13037436_0005_L010185N_D04.1           2159  analyze  fnpc304  BEGIN
2008-05-09 18:20:28
n13037436_0005_L010185N_D04.1           2159  analyze  fnpc304  ERROR
2008-05-09 18:53:51   136
n13037436_0005_L010185N_D04.1           2159  analyze  caf1606    END
2008-05-09 21:10:19

I don't remember the reason for 'Pass 1' but it appears that, at least in
that pass, the 2 jobs were multiply submitted.  I've removed the lines from
bad_runs_mc.cedar_phy_bhcurv.

I'm beginning processing of the pend list except for the 'MRE' run whose
history I don't remember at all, and whose naming convention doesn't match
our recent running.  (Were they perhaps run by Matt as a special pass?)  I
suggest we just forget about the one pair of files.

__________________________________________________________________________


Date: Mon, 02 Feb 2009 17:05:59 -0600
From: Howard Rubin <rubin@fnal.gov>
To: Art Kreymer <kreymer@fnal.gov>
Subject: cleanup

Art,

Not counting the MRE jobs, there are 714 files in the cedar_phy_bhcurv pend
list.  Of these, 652 have already run successfully, meaning that they
produced *some* output which has made it to dcache or is already in
mcnearcat.  This means that this cleanup will produce *lots* of duplicates,
lots of them probably being candidates, and some of which I will catch if
the existing output is in mcnearcat (or candidates in dcache), and the rest
of which you will find when you concatenate.

I think that *all* of these duplicates should simply be deleted, not kept
around in a 'duplicates' area.  Presumably we understand why they've been
produced, and there's no reason for anyone to ever look at them.

Incidentally, all the files I produce in this cleanup will be pass 0.
__________________________________________________________________________

Date: Tue, 03 Feb 2009 15:01:33 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   I ran roundup this morning.

   It picked up the subruns which had formerly been flagged as bad.

   But nothing more.
   The newest files in /minos/data/minfarm/mcnearcat are from Nov 26.
__________________________________________________________________________

Date: Tue, 03 Feb 2009 10:23:10 -0600
From: Howard Rubin <rubin@fnal.gov>

Jobs seem to be running now so I'm submitting more.
__________________________________________________________________________

Date: Wed, 04 Feb 2009 09:38:06 -0600
From: Howard Rubin <rubin@iit.edu>

Art,

The cedar_phy_bhcurv cleanup processing is complete.  Again, I expect *many*
duplicates, and I think these should be deleted, not moved. Everything has
been forced to use pass 0.
__________________________________________________________________________

__________________________________________________________________________


########
# FARM #
########
    Sunday, purged D06, charm, helium files on farm
    Started cleanup of helium; 
SRV1> ./roundup -s helium -r cedar_phy_bhcurv mcnear &

    This finished, I have done the purge :
SRV1> ./roundup  -p  -s helium -r cedar_phy_bhcurv mcnear


#######
# NAS #
#######

Reported previous delays to romero

########
# GRID #
########

   Tracking down dejong problems access condor_submit but not condor_q

#########
# ADMIN #
#########

   Planning for gfactory/gfrontend ID, versus Condor 7.2


=============================================================================
2009 01 30
=============================================================================

########
# FARM #
########

    helium files are complete for  M100200R
    Still missing subruns for M100200N

SRV1> ./roundup  -s helium -r cedar_phy_bhcurv mcnear
Fri Jan 30 11:30:44 CST 2009

SRV1> ./roundup  -s charm  -r cedar_phy_bhcurv mcnear
Fri Jan 30 13:55:56 CST 2009


########
# FARM #
########

SRV1> ./roundup  -s D06  -r cedar_phy_bhcurv mcnear
Fri Jan 30 11:42:37 CST 2009

./roundup  -n -p  -s D06 -r cedar_phy_bhcurv mcnear


=============================================================================
2009 01 29
=============================================================================

#########
# DOCDB #
#########

    The old ticket regarding DNS failover is closed.
    I think nothing was done. 
    Noted.

Date: Thu, 29 Jan 2009 12:16:54 -0600 (CST)
Subject: Help Desk Ticket 121522 Has Been Resolved.

Solution: related to DNS issue
___________________________________________________________________

This ticket was resolved by INKMANN, JOHN of the CD-LSCS/CSI/CS/EST group.


#########
# AKLOG #
#########

    Surveyed FNALU for system running LSF 4.7, to test aklog/kcron combo
    There are none :

MIN > for NODE in ${UNODES} ; do printf "${NODE} " ; ssh -ax ${NODE} 'cat /etc/redhat-release' ; done
flxi02 Scientific Linux Fermi LTS release 4.4 (Wilson) x86_64
flxi03 Scientific Linux Fermi LTS release 4.4 (Wilson) x86_64
flxi04 Scientific Linux Fermi LTS release 4.5 (Wilson) i686
flxi05 Scientific Linux Fermi LTS release 4.5 (Wilson) i686
flxi06 Scientific Linux SLF release 5.1 (Lederman)     i686    no kcron exists
flxi07 Scientific Linux Fermi LTS release 4.4 (Wilson) x86_64
flxi09 Scientific Linux Fermi LTS release 4.5 (Wilson) i686


#######
# SAM #
#######

    minosora3 - Oracle patches and OS patches being performed,
                Seems to be complete as of 09:00

09:12 - ran HOWTO.sam tests, including station test   AOK

Updated the MINOS status page


Date: Thu, 29 Jan 2009 15:27:40 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos_software_discussion@fnal.gov
Cc: minos_sam_admin@fnal.gov
Subject: SAM maintenance completed this morning

  The Minos SAM Oracle database was upgraded from 08:00 to 09:12.

  The SAM station and dbservers have resumed normal operation .


kreymer@minos26 : crontab crontab.dat
mindata@minos26 : crontab crontab.dat
minfarm@fnpcsrv1 :  mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok


########
# FARM #
########

    Concatenation of
        
        D00_medi.sntp.cedar_phy_linfix
    and
        D00_lowi.mrnt.cedar_phy_linfix

    completed Tue Jan 27 15:12:54 CST 2009

########
# FARM #
########

  The concatenation of daikon06 files is up to date, as of
       Wed Jan 28 14:37:26 CST 2009
  

  There is one run pending, due to a missing subrun :
 
 PEND - have 28/29 subruns for
n13037060_*_L010185N_D06_nccohbkg.mrnt.cedar_phy_bhcurv.0.root
 MISS 0002
 
 PEND - have 28/29 subruns for
n13037060_*_L010185N_D06_nccohbkg.sntp.cedar_phy_bhcurv.0.root
 MISS 0002


=============================================================================
2009 01 28
=============================================================================

########
# DATA #
########

Tue Jan 27 13:18:14 CST 2009

    cherdack imports have completed, with one false DUP entry,
n15037088_0000_L010185N_D06_nccohbkg.reroot.root

mv DUP/n15037088_0000_L010185N_D06_nccohbkg.reroot.root dcache/

./mcimport  cherdack
        
less /minos/data/mcimport/cherdack/log/mcimport.log
Wed Jan 28 14:42:12 CST 2009
PURGED  n15037088_0000_L010185N_D06_nccohbkg.reroot.root

rm STAGE/cherdack/MCIMPORT


#########
# ADMIN #
#########
Date: Wed, 28 Jan 2009 11:59:20 -0600 (CST)
Subject: HelpDesk ticket 128348

Short Description: 

Request UID assignment for gfactory account.

Problem Description: Please assign an official UID for the gfactory account.

The UID assignment should be similar to that of gfrontend,
including the Minos 5111 group, and Ryan Patterson as the registered user :

    43598 5111 PATTERSON RYAN GFRONTEND
___________________________________________

Date: Wed, 28 Jan 2009 12:07:40 -0600 (CST)

This ticket has been reassigned to VALADEZ, YOLANDA of the CD-LSCS/CSI/HD
___________________________________________

Date: Wed, 28 Jan 2009 13:08:56 -0600 (CST)

Solution: 

Refresh your uid/gid list. 
The following new uid/gid assignment has been entered:

UID:GID:LASTNAME: FIRSTNAME:USERNAME
43680  5111 Patterson  Ryan   gfactory

This ticket was resolved by VALADEZ, YOLANDA of the CD-LSCS/CSI/HD group.
___________________________________________


##########
# DCACHE #
##########

   The backlog has cleared, we are back in operation .
   See ticket  128156 below

#########
# ADMIN #
#########

Date: Wed, 28 Jan 2009 11:30:49 -0600
From: HelpDesk <helpdesk@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: uid and gid listings


http://www-giduid.fnal.gov/cd/FUE/uidgid/uid.lis

http://www-giduid.fnal.gov/cd/FUE/uidgid/gid_id.lis

    
=============================================================================
2009 01 27
=============================================================================

##########
# DCACHE #
##########

    Forwarded to M_B, M-D, CC: M_S_D

Date: Tue, 27 Jan 2009 14:46:14 -0600
From: ssa-group@fnal.gov

All STKEN dCache pools will be restarted in 10 minutes (at approximately
2:55pm) for an emergency configuration change. Another notification will be
sent after the restart is complete.
___________________________________________________________________

Date: Tue, 27 Jan 2009 15:06:12 -0600
From: ssa-group@fnal.gov

The restart of the public dCache pools is complete.
___________________________________________________________________

      Stores Queued

15:08 - 29000 approx
15:25 - 27609
15:36 - 27412
   DES writes to tape are continuing apace, about 5 sec/file.
16:56 - 19763
16:57 - 18649
   That's more like it, 1000/minute, 


###########
# BLUEARC #
###########

To fermigrid-users, minos-data :

  I run a set of processes which check the readability and speed
  of files in the /minos/data2 piece of Bluearc.

  In response to questions during the Grid Users meeting yesterday,
  here is a summary of recent slow access, as seem from Minos nodes.

  There have been no recent outright failures .

  File access times over 10 seconds are logged.
  ( Normal access times are under a second. )

  The full logs are at
       http://www-numi.fnal.gov/computing/dh/bluwatch/log/
  
  I am listing here the large blocks of slow response.   
  See the full logs for a few other short term slowdowns.

    fnpcsrv1

Jan 14 04:36 -> 06:26   to 50 sec
Jan 24 18:17 -> 23:29   to 92 sec
Jan 25 18:37 -> 21:07   to 41 sec

    minos-sam03

Jan 24 18:20 -> 23:21   to 43 sec
Jan 25 18:37 -> 21:35   to 40 sec

    minos02

Jan 24 18:16 -> 23:18   to 36 sec
Jan 25 18:40 -> 21:07   to 44 sec

    minos26

Jan 24 18:16 -> 23:21   to 69 sec
Jan 25 18:37 -> 22:44   to 43 sec


########
# FARM #
########

    Concatenating lowi and medi files
    These are the only linfix files pending

./roundup -n  -r cedar_phy_linfix mcnear
   
./looper  '-r cedar_phy_linfix mcnear' &
Tue Jan 27 13:56:55 CST 2009


=============================================================================
2009 01 26
=============================================================================

########
# FARM #
########

ls /minos/data/minfarm/mcnearcat | grep D06 | wc -l
8970

ls /minos/data/minfarm/mcnearcat -ltr | grep D06 | tail
-rw-rw-r--  1 minospro e875  20802267 Jan 20 19:52 n13037137_0047_L010185N_D06_nccohbkg.mrnt.cedar_phy_bhcurv.0.root

rm /minos/data/minfarm/roundup/STOP.LOOPER

SRV1> ./roundup -n -s D06 -r cedar_phy_bhcurv mcnear

 OK - processing /minos/data/minfarm/mcnearcat
      version 20090121
 SELECT  files containing D06 
Mon Jan 26 17:32:48 CST 2009
...
 OK adding n13037015_0000_L010185N_D06_nccohbkg.mrnt.cedar_phy_bhcurv.0.root 31
./roundup: line 764: ((: SSIF =  : syntax error: operand expected (error token is " ")
./roundup: line 764: ((: SSIF =  : syntax error: operand expected (error token is " ")
 OK adding n13037016_0000_L010185N_D06_nccohbkg.mrnt.cedar_phy_bhcurv.0.root 31
...

 BIG  - Splitting due to size 2168015605 
 OK adding n13037015_0000_L010185N_D06_nccohbkg.sntp.cedar_phy_bhcurv.0.root 30
 OK adding n13037015_0030_L010185N_D06_nccohbkg.sntp.cedar_phy_bhcurv.0.root 1
./roundup: line 764: ((: SSIF =  : syntax error: operand expected (error token is " ")
 BIG  - Splitting due to size 2201111372 
 OK adding n13037016_0000_L010185N_D06_nccohbkg.sntp.cedar_phy_bhcurv.0.root 30
 OK adding n13037016_0030_L010185N_D06_nccohbkg.sntp.cedar_phy_bhcurv.0.root 1

    The errors were because I tried to cancel the interactive roundup
    with ^C and ^Y.
    
./looper  '-s D06 -r cedar_phy_bhcurv mcnear' &

LOG/2009-01/cedar_phy_bhcurvmcnearD06.log

Mon Jan 26 18:04:51 CST 2009


########
# DATA #
########

   Testing cherdack imports,

500 files, typically 600 MBytes, a few over 1 GByte.

$ ./mcimport -b 2  cherdack
 OK, logging activity to /minos/data/mcimport/cherdack/log/mcimport.log 
...

Mon Jan 26 16:43:23 CST 2009
 MCIN configuration n1503 _L010185N_D06_nccohbkg.reroot.root 
SRMCPed n15037001_0000_L010185N_D06_nccohbkg.reroot.root 
SRMCPed n15037002_0000_L010185N_D06_nccohbkg.reroot.root 
BAIL aftar 2 files 
~/saddmc  --declare  daikon_06  near/daikon_06/L010185N_nccohbkg/700
324213  /minos/data/mcimport/cherdack/
du: cannot access `/minos/data/mcimport/cherdack/tar': No such file or directory
1       /minos/data/mcimport/cherdack/dcache
324212  /minos/data/mcimport/cherdack/mcin
1240    /minos/data/mcimport/cherdack/mcin/dcache
Mon Jan 26 16:46:43 CST 2009


    Sam data is OK
MINOS26 > sam locate n15037001_0000_L010185N_D06_nccohbkg.reroot.root
['/pnfs/minos/mcin_data/near/daikon_06/L010185N_nccohbkg/700,26@dcache']


    Allow the rest to run on the normal cycle

MINOS26 > touch /minos/data/mcimport/cherdack/MCIMPORT

    And start an early cycle, as we just missed the 4 hour boat by 10 minutes.

$ ./mcimport cherdack
 OK, logging activity to /minos/data/mcimport/cherdack/log/mcimport.log 


########
# DATA #
########

   One of the Minos caldet tapes has been found MIA during migration.

http://www-stken.fnal.gov/enstore/tape_inventory/VO4956

  'library': '9940'
  'sum_mounts': 629
  'sum_rd_access': 1169
  'sum_rd_err': 5
  'sum_wr_access': 581
  'sum_wr_err': 0
  'volume_family': 'minos.caldet_data.cpio_odc'


##########
# CONDOR #
##########

Jobs took a nosedive around 00:00 today.

Confirmed by GPFarm Ganglia
    http://rexganglia2.fnal.gov/farms/?m=load_one&r=day&s=descending&c=GP+Farm&h=&sh=1&hc=4
Processes dropped 
from  800 at 00:00
sharply to 400,
then to under 200 at 01:00

    The latest running gfactory is
268378.16  gfactory        1/24 14:29   1+18:50:24

My glide test jobs stayed idle from 00:00 through 07:45.
They are idle again, starting 08;20

Jobs started up again around 09:44 ..


##########
# DCACHE #
##########

   Huge write backlog, over 16k files.
   Built up during the day Sunday 25 Jan
   21K Stores queued in DCache, mostly
   
w-pub-minos-stkendca21a-2  4122
w-pub-minos-stkendca22a-2  1143
w-pub-minos-stkendca23a-2  2493
w-sstkendca10a-4  1768
w-sstkendca10a-5  2140
w-sstkendca10a-6   811
w-stkendca11a-4   3007
w-stkendca11a-5    249
w-stkendca11a-6   2666
w-stkendca12a-4    550
w-stkendca12a-5    296
w-stkendca12a-6    536
w-stkendca9a-4    1222

   Appears to be file family des.
   Recent files on tape 
   
   No visible Minos backlog so far.
   My eyeball suggests the backlog will takeat least 3 days to clear.
   The latest 38 files on   VOK823 average 8 MBytes in size.
http://www-stken.fnal.gov/enstore/tape_inventory/VOK823
   Many are under 10 KBytes. 

Date: Mon, 26 Jan 2009 09:11:33 -0600 (CST)
Subject: HelpDesk ticket 128156

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->

Short Description: writePools backlog

Problem Description: dcache-admin :

It appears that a write backlog of over 20,000 files 
has built up in the FNDCA writePools group.

On the surface, these seem to be in file family des.
The average file size is under 10 MBytes, with many files under 1 MByte,
based on a glance at recent files on volume VOK823.

At the present rate, this backlog will take at least 3 days to clear.

I urge that the user be contacted, and that these writes be reorganized.
___________________________________________

Date: Mon, 26 Jan 2009 11:23:19 -0600 (CST)

This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA.

We are looking at this and working to resolve it.  I do see quite a large
number of queued write requests.  What monitoring page are you using to
determine the user?
_
Thank you for letting us know.
__________________________________________

Date: Mon, 26 Jan 2009 18:05:12 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    I am deducing the user from the backlogs in Enstore :

    Servers page, see CD-LTO3.library_manager
http://www-stken.fnal.gov/enstore/status_enstore_system.html

    CD-LT03 Full Queue Elements
http://www-stken.fnal.gov/enstore/CD-LTO3.library_manager-full.html
___________________________________________

Date: Mon, 26 Jan 2009 20:56:34 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

We were making a little progress this morning,
but now the Queued Stores are up by another 15K files, to 29133
as of about 14:55.

The new files seem to have all been queued in a single sampling
of the plot at
    http://fndca.fnal.gov/dcache/queue/allpools.jpg
___________________________________________

Date: Tue, 27 Jan 2009 19:50:03 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

Almost a full day ago, we were told at the Grid Users meeting
that the errant des files would be removed by their creator at UIC.

The write backlog remains at nearly 28,000, as seen at
    http://fndca.fnal.gov:2288/queueInfo

Minos production reprocessing remains shut down,
as it has been since last Sunday,
due to our service level agreement.

I request that either

  1) Administrators remove these DES files.

or

  2) Minos be given permission to write to the DCache pools
     without any service limitations.

If thing remain as they are, it will be several more days
before we can resume processing. This is unacceptable.
___________________________________________

Date: Tue, 27 Jan 2009 14:52:48 -0600 (CST)

The user ceased writing yesterday afternoon, but the jobs already in the
system have continued to clog it up.  We are taking action immediately to
resolve this problem in conjunction with a pool restart.  The backlog should
clear out shortly after we restart pools and the files are removed from pnfs
space.  I will let you know when we are done.

Thanks for your patience.

-Tim Messer

___________________________________________

Date: Tue, 27 Jan 2009 18:00:15 -0600 (CST)

The FNDCA write queue is now below 3,000 and continues to drop.
___________________________________________

Date: Wed, 28 Jan 2009 15:56:54 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Thanks !

   The DCache writePools backlog cleared yesterday, around 18:00 .

   Minos reconstruction processing has resumed.

   This ticket can be closed.
___________________________________________

Solution: 
DES user was writing >28,000 small files to dCache, 
slowing the system down and causing MINOS processing to stop.  
User was notified and files were removed from pnfs.
___________________________________________


16:42 - 31177


##########
# CONDOR #
##########

=============================================================================
2009 01 23
=============================================================================

#########
# SHIFT #
#########

   Where do the minos-shiftscript item come from, sent to minos-shifters ?

From minos23, probably habig account.  


#########
# ADMIN #
#########

Date: Fri, 23 Jan 2009 14:20:03 -0600 (CST)
Subject: HelpDesk ticket 128129

___________________________________________
Short Description: Please create minsoft account on minos-sam04

Problem Description: run2-sys :

Please create account minsoft on minos-sam04.
This is for development testing of mysql.

The account can be NIS served, as on minos-mysql2.
Please copy the .k5login file from   minos-mysql2.
___________________________________________

Date: Fri, 23 Jan 2009 15:49:05 -0600 (CST)
This ticket has been reassigned to SHEPELAK, KAREN of the CD-SF/FEF Group.
___________________________________________

Date: Fri, 23 Jan 2009 16:05:37 -0600 (CST)
Solution: shepelak@fnal.gov sent this solution:

Minsoft account has been created on minos-sam04 with .k5login as copied
from minos-mysql2 as requested. -- karen

This ticket was resolved by SHEPELAK, KAREN of the CD-SF/FEF group.
___________________________________________

Date: Fri, 23 Jan 2009 22:30:57 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Thanks, I can log in !
___________________________________________


##########
# CONDOR #
##########

   Increased analysis to 350, leaving headroom for Farm jobs on GPFARM.

Somehow I typed the wrong file names last time.
I had edited the old vofrontend.cfg.20081119, not vofrontend.cfg.20090122,
and symlinked to it.  

I REALLY NEED A BETTER MONITOR !


cd /home/gfrontend/myvofrontend2/etc

cp vofrontend.cfg.20081125 vofrontend.cfg.20090123

nedit - max_running_jobs=350

ln -sf vofrontend.cfg.20090123 vofrontend.cfg # was vofrontend.cfg.20081119

kill -9 22120

cd

./start_frontend.sh 
Fri Jan 23 15:59:46 CST 2009

[gfrontend@minos25 ~]$ tail /home/gfrontend/myvofrontend2/log/frontend_info.`date +%Y%m%d`.log
[2009-01-23T15:59:45-05:00 28984] Iteration at Fri Jan 23 15:59:45 2009
[2009-01-23T15:59:48-05:00 28984] Match
[2009-01-23T15:59:48-05:00 28984] Total running 205 limit 350
[2009-01-23T15:59:48-05:00 28984] For gpgeneral@t22_glexec@minos Idle 898 Running 200
[2009-01-23T15:59:48-05:00 28984] Advertize gpgeneral@t22_glexec@minos Request idle 40 max_run 1142
[2009-01-23T15:59:48-05:00 28984] For gpminos@t22_glexec@minos Idle 898 Running 205
[2009-01-23T15:59:48-05:00 28984] Advertize gpminos@t22_glexec@minos Request idle 40 max_run 1148
[2009-01-23T15:59:48-05:00 28984] For cdf@t22_glexec@minos Idle 898 Running 200
[2009-01-23T15:59:48-05:00 28984] Advertize cdf@t22_glexec@minos Request idle 40 max_run 1142
[2009-01-23T15:59:48-05:00 28984] Sleep


########
# DATA #
########

From: ssa-group@fnal.gov
Subject: Announcement: Service scheduled outage for dCache on stken for a
    duration of 30 minutes

The following STKEN services will be restarted at 3:30pm today to allow new gri$

fndca3a - gPlazma
stkendca2a - dcap, gridftp and srm services
stkendca13a-20a - dCache and gridftp doors 

A notice will be sent when the services have been restarted.

_____________________________________________________________________

Date: Fri, 23 Jan 2009 15:45:04 -0600
From: ssa-group@fnal.gov
Subject: Announcement: Service restoration for dCache on stken for a
    duration of 30 minutes

The STKEN dCache and grid services on fndca3a and stkendca2a,13a-20a have
been restarted. Thank you.


#########
# MYSQL #
#########

Date: Fri, 23 Jan 2009 11:47:09 -0600 (CST)
Subject: HelpDesk ticket 128116
___________________________________________
Short Description: Please change the system timezone for minos-mysql2 to UTC

Problem Description: run2-sys :

Because the new minos-mysql2 server connects with various grid hosts,
we would like to have the system time zone set to UTC, at your next
convenience.

Please contact minos-admin and minosdb_support to schedule the change.

I suggest waiting until next week, to avoid weekend surprises.
___________________________________________

Date: Fri, 23 Jan 2009 11:53:27 -0600 (CST)
This ticket has been reassigned to SHEPELAK, KAREN of the CD-SF/FEF Group.
___________________________________________

Date: Fri, 23 Jan 2009 13:07:00 -0600 (CST)

    I reset the timezone to UTC on minos-mysql2. I'm still able to login as
root after the change so this change looks to be working.
Please confirm that you are able to login. Let me know that this change
works for you, then I'll close the ticket.

[root@minos-mysql2 ~]# date
Fri Jan 23 19:02:06 UTC 2009

___________________________________________

Date: Fri, 23 Jan 2009 13:17:20 -0600 (CST)
This ticket has been reassigned to ALLEN, JASON of the CD-SF/FEF Group.
___________________________________________

Date: Fri, 23 Jan 2009 20:26:28 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Gamglia monitoring of minos-mysql2 seems to have stopped,
   as of almost precisely 12:00 today.

http://rexganglia2.fnal.gov/minos/?m=load_one&r=day&c=MINOS
Server&h=minos-mysql2.fnal.gov

    Perhaps the Ganglia server is confused about the time shift.
    If this does not fix itself, please look into it Monday.
___________________________________________

Date: Fri, 23 Jan 2009 14:27:21 -0600
From: Jason Allen <jallen@fnal.gov>

I don't want to make these machines unique.  Please set the timezone back to
CST.  If using CST is really causing minos legitimate problems then I'll
discuss the matter with Margaret V.
___________________________________________

Date: Fri, 23 Jan 2009 21:01:02 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Jason has asked that the UTC test on mysql2 be suspended.
   So be it.

   I share the desire for uniform systems.
   I would like to have all Minos Servers running UTC.
      minos-sam01
      minos-sam02
      minos-sam03
      minos-sam04
      minos-mysql1
      minos-mysql2

   I'll be glad to discuss a deployment plan for this,
   when you have time.  This is not urgent.

   I would also like to see the entire Minos Cluster at UTC,
       but this takes a lot more planning and testing.
       Users' crontab files are affected by such a change,
       We might set a local TZ in /etc/profile or /etc/csh.cshrc
       so that interactive users would get the old times reported.
       This would likely be done at a major system upgrade.
___________________________________________

Date: Fri, 23 Jan 2009 15:48:44 -0600

Minos-mysql2 is back on CS Time.
___________________________________________

Date: Mon, 26 Jan 2009 14:38:19 -0600 (CST)

Solution: jallen@fnal.gov sent this solution: 
Converting just the Minos nodes to UTC time zone introduces  
unnecessary system management complexity to the FEF Dept.  Minos has  
been operating for many years with the nodes set to CST time zone.

The FEF Dept will evaluate the potential benefits of converting all  
nodes CDF/D0/Minos/MiniBoone/EAG/etc to UTC.

This ticket was resolved by ALLEN, JASON of the CD-SF/FEF group.

__________________________________________

Date: Fri, 23 Jan 2009 19:27:49 +0000 (GMT)
  
    Thanks, I can log in and access the database server normally.

    This ticket can be closed.
__________________________________________


Mysql2 > dds /etc/sysconfig/clock
-rw-r--r--  1 root root 42 Jan 23 18:50 /etc/sysconfig/clock


##########
# DCACHE #
##########

Date: Fri, 23 Jan 2009 11:36:31 -0600 (CST)
Subject: HelpDesk ticket 128115

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 ___________________________________________
Ticket #: 128115
___________________________________________
Short Description: FNDCA dcache monitor page is out of date

Problem Description: 

There is a web page which monitors for internal file problems in FNDCA
:
http://www-stken.fnal.gov/enstore/dcache_monitor/

    The content of this page seems not to have changed since 23-Jul-2008
___________________________________________

This ticket is assigned to JONES, TERRY of the CD-SF/DMS/DSC/SSA.
____________________________________________

Date: Fri, 23 Jan 2009 12:50:14 -0600 (CST)

Note To Requester: jonest@fnal.gov sent this Notes To Requester: 
Have you looked at this URL yet?
> http://www-stken.fnal.gov/enstore/www-stken_pnfs_monitor
___________________________________________

Date: Fri, 23 Jan 2009 19:19:54 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   That web page is new to me.

   The minos data files listed there are normal.

   Problem files were once sorted into files like
       http://www-stken.fnal.gov/enstore/dcache_monitor/p929_bad.txt
   This seemed to be useful.

___________________________________________

Date: Wed, 28 Jan 2009 16:48:21 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   This ticket can be closed.

   I have adjusted my web pages to point to the new url
http://www-stken.fnal.gov/enstore/www-stken_pnfs_monitor

   Because this listing includes files from all experiments,
   both bad files and files queued for write to tape,              
   it is not trivial to scan for bad Minos files.

   That is OK.
   I continue to rely on the SSA group to alert us to problem files.
   There is no need for me to scan this list on a regular basis.
___________________________________________

Note To Requester: jhendry@fnal.gov sent this Notes To Requester: 
Hi Art,

For your information,  this web page can be accessed from the top level
dcache web page:

http://www-stken.fnal.gov/enstore/enstore_saag.html
click on "CD dCache"

goes to FNAL General dCache System Status
  url: http://fndca.fnal.gov/

Then click on "Meta-Data Checks" which takes you to the pnfs monitor
output page:

http://www-stken.fnal.gov/enstore/www-stken_pnfs_monitor

John
___________________________________________


##########
# DCACHE #
##########

Date: Fri, 23 Jan 2009 10:11:04 -0600
From: jonest <jonest@fnal.gov>
To: kreymer@fnal.gov, bernstein@fnal.gov
Cc: Dcache Admin <dcache-admin@fnal.gov>
Subject: minos files not in dcache or pnfs.
Parts/Attachments:
   1   OK     40 lines  Text
   2 Shown   ~44 lines  Text
----------------------------------------


These file are not in dcache. The have layer 2 information but no
pool data.

Do you dill have the files? �If so, please delete from pnfs and
retransfer these files�
or regenerate the file and�retransfer.�

Recent PNFS Database minos files:
�timestamp � � � � � � � � �| � � � � � � � � � pnfsid | layer1 |
layer2 | layer4 | path
�2009-01-21 15:12:57.930745 | 000F00000000000009073A88 | � � �n | � �
�y | � � �n |
/pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0000.mdaq.root
�2009-01-21 16:13:55.177935 | 000F00000000000009073AC8 | � � �n | � �
�y | � � �n |
/pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0001.mdaq.root
�2009-01-21 17:14:23.413759 | 000F00000000000009073B38 | � � �n | � �
�y | � � �n |
/pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0002.mdaq.root
�2009-01-21 18:14:58.21879 �| 000F00000000000009073C48 | � � �n | � �
�y | � � �n |
/pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0003.mdaq.root
�2009-01-21 22:11:48.698747 | 000F00000000000009073F70 | � � �n | � �
�y | � � �n |
/pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0007.mdaq.root
�2009-01-22 01:13:41.053332 | 000F00000000000009074C08 | � � �n | � �
�y | � � �n |
/pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0010.mdaq.root
�2009-01-22 04:09:34.693716 | 000F00000000000009075B10 | � � �n | � �
�y | � � �n |
/pnfs/fs/usr/minos/fardet_data/2009-01/F00042731_0013.mdaq.root


Terry Jones, jonest@fnal.gov, run2-sys@fnal.gov
FCC/2/252/T x5200
__________________________________________________________________

FFILES='
F00042731_0000.mdaq.root
F00042731_0001.mdaq.root
F00042731_0002.mdaq.root
F00042731_0003.mdaq.root
F00042731_0007.mdaq.root
F00042731_0010.mdaq.root
F00042731_0013.mdaq.root
'

for FILE in ${FFILES} ; do ./dc_stat ${FILE} ; done

MINOS26 > for FILE in ${FFILES} ; do ./dccptest ${FILE} ; done

   Everthing is OK.
__________________________________________________________________

Date: Fri, 23 Jan 2009 17:16:16 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Layer 2 no longer has pool data, so this is normal.

   The files are all on tape, and I can read them all from DCache.

   It is a serious problem that we have no  pool information.
   But this is a design issue, for which we have no solution planned.
   
   
#########
# MYSQL #
#########

    Testing file copy speeds data to /var/tmp ( over 50 MBytes/second )

    Note, we can set per-user limits.
    
http://dev.mysql.com/doc/refman/5.1/en/user-resources.html

    The existing user table has field max_connection
    
##########
# CONDOR #
##########

The glideins adjusted to the new 200 target around midnight.
Farm jobs are running,


=============================================================================
2009 01 22
=============================================================================

##########
# CONDOR #
##########

16:30    farm jobs are not getting started, due to existing glideins.

    Adjusted 
max_running_jobs=650
    to
max_running_jobs=200
    in
/home/gfrontend/myvofrontend2/etc/vofrontend.cfg.20081119

ln -sf vofrontend.cfg.20081119 vofrontend.cfg # was vofrontend.cfg.20081125

[2009-01-22T16:38:37-05:00 16173] Iteration at Thu Jan 22 16:38:37 2009
[2009-01-22T16:38:48-05:00 16173] Match
[2009-01-22T16:38:48-05:00 16173] Total running 417 limit 2050
[2009-01-22T16:38:48-05:00 16173] For gpgeneral@t22_glexec@minos Idle 3666 Running 417
[2009-01-22T16:38:48-05:00 16173] Advertize gpgeneral@t22_glexec@minos Request idle 40 max_run 4247
[2009-01-22T16:38:48-05:00 16173] For gpminos@t22_glexec@minos Idle 3672 Running 417
[2009-01-22T16:38:48-05:00 16173] Advertize gpminos@t22_glexec@minos Request idle 40 max_run 4253
[2009-01-22T16:38:48-05:00 16173] For cdf@t22_glexec@minos Idle 3666 Running 417
[2009-01-22T16:38:48-05:00 16173] Advertize cdf@t22_glexec@minos Request idle 40 max_run 4247
[2009-01-22T16:38:48-05:00 16173] Sleep

    That does not match the config file.
    Tried to kill gfrontent without kill -9, noffect.
    
    Tried again, kill -9    
[gfrontend@minos25 ~]$ cd
[gfrontend@minos25 ~]$ ./start_frontend.sh 

[2009-01-22T16:41:58-05:00 22120] Iteration at Thu Jan 22 16:41:58 2009
[2009-01-22T16:42:08-05:00 22120] Match
[2009-01-22T16:42:08-05:00 22120] Total running 417 limit 200
[2009-01-22T16:42:08-05:00 22120] For gpgeneral@t22_glexec@minos Idle 3664 Running 417
[2009-01-22T16:42:08-05:00 22120] Advertize gpgeneral@t22_glexec@minos Request idle 0 max_run 4245
[2009-01-22T16:42:08-05:00 22120] For gpminos@t22_glexec@minos Idle 3670 Running 417
[2009-01-22T16:42:08-05:00 22120] Advertize gpminos@t22_glexec@minos Request idle 0 max_run 4251
[2009-01-22T16:42:08-05:00 22120] For cdf@t22_glexec@minos Idle 3664 Running 417
[2009-01-22T16:42:08-05:00 22120] Advertize cdf@t22_glexec@minos Request idle 0 max_run 4245
[2009-01-22T16:42:08-05:00 22120] Sleep

##########
# ORACLE #
##########

   Benchmarks comparing sparc vs x86
   
http://it.toolbox.com/blogs/david/ultrasparc-vs-x86-servers-which-one-runs-oracle-faster-22776

http://www.tpc.org/

spec.org int2006 rates

   Sun T5440 4x8core, x8 threads/core   1.4 GHz, 256 GB memory
       255 copies, rate 270 peak 301
       http://www.spec.org/cpu2006/results/res2008q4/cpu2006-20080929-05410.txt

   HP ProLiant DL580 G5 (2.66 GHz, Intel Xeon X7460 ) 64 GB memory
       24 copies  rate 267  peak 291    
       http://www.spec.org/cpu2006/results/res2008q3/cpu2006-20080901-05166.txt

    Scanning our heavy hitting Oracle SAM servers via the browser,
        Oracle Database Administration
            Offline System Session and License Usage

http://dbb.fnal.gov/d0/databases
http://rexganglia1.fnal.gov/d0/?r=month&c=d0central&h=d0ora2.fnal.gov
    36 DBS_* connections, 3 active, all DBS_USER_POOL

http://cdfdbb.fnal.gov/cdfr2/databases/
http://rexganglia2.fnal.gov/cdf/?r=month&c=CDF+Central&h=fcdfora4.fnal.gov
    46 DBS_* connections, 1 active, DBS_CDF_USER_PRD


    The D0 load is very non-uniform.
    Something is going on every 15 minutes which is most of the load.
    
#######
# CVS #
#######

Date: Thu, 22 Jan 2009 11:36:08 -0600 (CST)
To: KREYMER@FNAL.GOV
Subject: HelpDesk ticket 128037

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 
___________________________________________
Ticket #: 128037
___________________________________________
Short Description: Please remove minoscvs from NIS hosts file on Minos
Cluster

Problem Description: run2-sys :

    The Minos CVS production server minoscvs has moved to cdcvs.

    But the NIS hosts file on the Minos Cluster contains

MINOS26 > ypcat hosts | grep minoscvs
131.225.193.33          minoscvs.fnal.gov  minoscvs
131.225.193.33          minoscvs.fnal.gov  minoscvs

   Please remove minoscvs from this file and push out the change,
   so that Minos Cluster users will connect to the correct server.

   We plan to run the old CVS server on minos01, in readonly mode,
   for another week or two, before shutting down that CVS server.

P.S.
   All lines in the NIS hosts file seem to entered twice.
   You might want to clean that up.
___________________________________________

This ticket is assigned to HelpDesk of the Help Desk.
___________________________________________

Date: Thu, 22 Jan 2009 11:40:21 -0600 (CST)
This ticket has been reassigned to SHEPELAK, KAREN of the CD-SF/FEF Group.
___________________________________________

Date: Thu, 22 Jan 2009 13:28:41 -0600 (CST)
Solution: Minoscvs has been removed from nis hosts file as requested.

This ticket was resolved by SHEPELAK, KAREN of the CD-SF/FEF group.
___________________________________________


Thanks for updating the file.
  
minoscvs is now being properly resolved to cdcvs.


############
# TERAGRID #
############

Date: Thu, 22 Jan 2009 09:25:08 -0600
From: help@teragrid.org
To: rmehdi@fnal.gov
Subject: Re: Parrot and Teragrid

FROM: Eijkhout, Victor
(Concerning ticket No. 166606)

Rashid,

I'm still not convinced that compiling libraries locally is going to work,
but let's assume it does.

Can your proposed arrangement with parrot handle the fact that jobs have to
be run through a batch 
system? You are not allowed to run production codes on the login nodes;
those are for compiling only.

By the way, lonestar is on the teragrid, so you can have shared software
areas there. I suspect that's what 
you mean by "web cache". Or are you really talking about something relating
to http connections?

Victor.


########
# MAIL #
########

   Tracking down A-hat characters in Mac and PC mail
   Perhaps they use UTF-8 or  Windows-1252 
   rather than ISO-8859-1 or USASCII :
       http://objectmix.com/pine/330696-pc-alpine-character-set.html


##########
# DCACHE #
##########

   Scheduled maintenance seems to have begun around 08:30
   Finished around 

########
# DATA #
########


Date: Thu, 22 Jan 2009 14:08:15 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Mayly Sanchez <msanchez@physics.harvard.edu>
Cc: Minos_Batch <minos_batch@fnal.gov>, minos-data@fnal.gov
Subject: Re: low/medium intensity MC

On Wed, 21 Jan 2009, Mayly Sanchez wrote:

> We need processed with very high priority the following set using
> cedar_phy_linfix:
> /pnfs/minos/mcin_data/near/daikon_00/L010185N_medi/
> /pnfs/minos/mcin_data/near/daikon_00/L010185N_lowi/

    I have created the output directories,
    and corrected the file families for the input directories :

cd ~kreymer/minos/scripts

./pnfsdirs near cedar_phy_linfix daikon_00 L010185N_medi write
./pnfsdirs near cedar_phy_linfix daikon_00 L010185N_lowi write


=============================================================================
2009 01 21
=============================================================================

########
# DATA #
########

There will be a downtime on Wednesday Jan 21, 2009 
starting at 7:30am until 1:30pm. 
Fire Suppression maintenance will be done on the 9310's, SL8500 and AML in FCC2.

########
# FARM #
########

    roundup.20090121 - added SEL string to PENDLOG, and added timestamp

AFSS/roundup.20090121 -s charm  -r cedar_phy_bhcurv mcnear
AFSS/roundup.20090121 -s helium -r cedar_phy_bhcurv mcnear

SRV1> cp AFSS/roundup.20090121 .
SRV1> ln -sf roundup.20090121  roundup # was roundup.20081209

./roundup -s L010185_D04 -r cedar_phy_bhcurv mcnear

=============================================================================
2009 01 20
=============================================================================

########
# DATA #
########

Date: Tue, 20 Jan 2009 16:47:26 -0600

The SL8500 at FCC had a problem with one of the gripper/bots. SUN/STK
replaced two gripper/bots

This is effected the following libraries:

D0-LTO4F1.library_manager
CD-LTO4F1.library_manager
CDF-LTO4F1.library_manager

They libraries have been returned to service.
The libraries were unavailable from ~1pm to 4:30pm


############
# HELPDESK #
############

Date: Tue, 20 Jan 2009 15:12:24 -0600 (CST)
Subject: HelpDesk ticket 127906
___________________________________________

Ticket #: 127906
___________________________________________

Short Description: System Status Input Page - shortcut request

Problem Description: We have been making good use of the System Status Input
Page,
    https://computing.fnal.gov/cdsystemstatus/customersupport/inpform.htm

But in spite of being careful, 
I have at least once updated the status of the wrong System.

I suggest setting up a URL to pre-select the system,
something like :
   
https://computing.fnal.gov/cdsystemstatus/customersupport/inpform.htm?s=MINOS
___________________________________________

This ticket is assigned to HelpDesk of the Help Desk.
___________________________________________

Date: Tue, 20 Jan 2009 15:20:42 -0600 (CST)
This ticket has been reassigned to ARENA, MATTHEW of the CD-LSCS/DBAP/IS
___________________________________________

Date: Wed, 21 Jan 2009 11:27:05 -0600 (CST)

Solution: mengel@fnal.gov sent this solution: 

Art,

That input form is just a static HTML form.  It has no smarts behind it
to prefill in data, etc.

However, if  you wanted to make a copy of the form, put the full URL in
the <form action="..."> tag, and add a few 'value="..."' attributes to
the input tags, you could make a version that has various fields
pre-filled, which should work fine, just do a "save page as.."
in your browser and edit the HTML.

This ticket was resolved by ARENA, MATTHEW of the CD-LSCS/DBAP/IS group.
___________________________________________

  N.B. I have copied this page to my desktop,
  and removed all the non-Minos options.
  This seems to work as desired.
  There is a risk of version drift, if the original page chanages.
___________________________________________


########
# FARM #
########

I have removed the vanilla L010185N_D04 duplicates.

There are thousands of concatenated subruns.

For a partial list of missing subruns, see

    /home/minfarm/ROUNTMP/LOG/2009-01/cedar_phy_bhcurvmcnearL010185N_D04.log

There are still many missing charm and helium subruns 

    /home/minfarm/ROUNTMP/LOG/2009-01/cedar_phy_bhcurvmcnearcharm.log
    /home/minfarm/ROUNTMP/LOG/2009-01/cedar_phy_bhcurvmcnearhelium.log

##########
# DCACHE #
##########

Date: Tue, 20 Jan 2009 20:50:31 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos_software_discussion@fnal.gov, minos_batch@fnal.gov
Cc: minos-data@fnal.gov
Subject: PNFS/DCache down Thursday morning 22 Jan 2009

The Minos PNFS and DCache systems will be down this Thursday morning.

---------- Forwarded message ----------
Date: Fri, 16 Jan 2009 15:11:18 -0600
From: ssa-group@fnal.gov
Subject: Announcement: Service schedule outage for enstore, dCache on d0en,
    stken, cdfen for a duration of 4 Hours

There will be a downtime on Thursday Jan 22, 2009
starting at 7:30am until 11:30pm.
The following is the agenda for the downtime.
Not all services will be available during the maintenance.
Listed below are the services that will be effected
 and when they will be up or down.

7:30
 Stop Stken Enstore
 Here is the list of Library Managers that will be unavailable.
 9940.library
 CD-9940B
 CD-LTO3
 CD-LTO4F1
 CD-LTO4G1
 CDF-LTO4G1
 D0-LTO4G1
 Replace library robot arm in GCC SL8500.
 Begin repair work on srv0, srv1, srv2 and srv4.
 Begin SW changes and upgrades after repair work.
 dCache pool config change.
10:30
 SL8500 work done. Verify work.
 The following Library Managers will be made available.
 CDF-LTO4G1
 D0-LTO4G1

11:30
 All Stken srv's back online, Enstore service restored.
All Library Managers available.


Stan Naymola ssa-group@fnal.gov


#########
# MYSQL #
#########

    Trying to tailor Port 3307 server for crl
    Sort of succeeded, got port 3308. That's OK.


##########
# ORACLE #
##########

Oracle's January Quarterly patch has been  deployed on minosora3~minosdev
successfully.

Can we proceed with the Production  Oracle & O/S patch?   Jan 28 or 29th? 8
a.m. ?

We will do the reboot for the o/s at this time as well.

In sum, requesting ~2 hr downtime.

Please advice
___________________________________________________________________


    I have done the usual minimal tests.
    The development station seems to be working normally.

    No dbserver or station restarts were needed.

    Thursday Jan 29 at 8 AM is OK with me.
    Consider it scheduled

=============================================================================
2009 01 16
=============================================================================

############
# DATASETS #
############

    Why are the *minos* pools not listed ?

cp datasets.20081201 datasets.20090116

    Because they are not in the daily pool lists.
    

########
# DATA #
########

    New Minos pools are showing up, and in use.
    Sizes are odd, I see over 26 TB of minos-names pools,
    but we only bought about  14 TB of new  pools.
 
   Pool     + indicates presence in   Cell Services list               

    1.8 TB each
+ r-minos-stkendca21a-3 
+ r-minos-stkendca22a-3
+ r-minos-stkendca23a-3
  r-minos-stkendca25a-3
  r-minos-stkendca26a-3
  r-minos-stkendca27a-2

    2.8 TB each
+ w-pub-minos-stkendca21a-2
+ w-pub-minos-stkendca22a-2
+ w-pub-minos-stkendca23a-2
  w-pub-minos-stkendca25a-3 

   1.8 TB each
+ w-raw-minos-stkendca21a-1
+ w-raw-minos-stkendca22a-1
+ w-raw-minos-stkendca24a-1
  w-raw-minos-stkendca26a-1

Date: Fri, 16 Jan 2009 16:11:58 -0600 (CST)
Subject: HelpDesk ticket 127805

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 
___________________________________________

Short Description: Minos DCache pool file listings ?

Problem Description: dcache-admin :

I do not see the new Minos DCache pools in the pool file listings at
    http://fndca3a.fnal.gov/dcache/files/

The pools are online, so it is important that we have these lists,
    especially now that we lack Layer 2 PNFS data.

Some of the pools are also missing from the Cell Services list at
    http://fndca3a.fnal.gov:2288/cellInfo

  r-minos-stkendca25a-3
  r-minos-stkendca26a-3
  r-minos-stkendca27a-2

  w-pub-minos-stkendca25a-3 

  w-raw-minos-stkendca26a-1
___________________________________________

Date: Fri, 16 Jan 2009 16:11:58 -0600 (CST)
This ticket is assigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Wed, 28 Jan 2009 16:55:48 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    There are still no pool directory listings for the 'minos' pools under
http://fndca3a.fnal.gov/dcache/files/

    Is progress being made on this ?
___________________________________________

Date: Thu, 29 Jan 2009 09:49:03 -0600 (CST)

Note To Requester: georges@fnal.gov sent this Notes To Requester: 
  This was forwarded to the dcache developers.
___________________________________________

Date: Mon, 02 Feb 2009 21:02:26 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    Is there a time estimate for getting these listings ?
___________________________________________

Date: Tue, 03 Feb 2009 14:29:47 -0600
From: Vladimir Podstavkov <podstvkv@fnal.gov>

Fixed.

Minos DCache pool file listings have been generated and will be
generated daily from now on.

About the pool info on http://fndca3a.fnal.gov:2288/cellInfo page.

 r-minos-stkendca25a-3    --> No such pool
 r-minos-stkendca26a-3    --> Are present
 r-minos-stkendca27a-2    --> Are present

 w-pub-minos-stkendca25a-3 --> No such pool
 w-raw-minos-stkendca26a-1 --> Are present

The ticket can be closed.
___________________________________________

___________________________________________


########
# DATA #
########

    Dealing with plans to import reroot files in 
           
/minos/data/users/cherdack/Daikon06/singles_reroot/7*

264   GBytes
12769 files

   Average size 21 MBytes
   
   Typically 31 subruns per run
   Typically 6 GBytes, 10 runs, 300 files and per directory

########
# GRID #
########

    Submitted account request to TACC,
    for interactive access to UT Austin Teragrid resources.

https://portal.tacc.utexas.edu/gridsphere/gridsphere

    Approved, logged in OK at

https://portal.tacc.utexas.edu/gridsphere/gridsphere?cid

    Login to longhorn.tacc.utexas.edu via ssh
    This does not work, needed to be registed in a valid group by rmehdi


########
# FARM #
########

    Continuing cleanup up dups for general CPB mcnear,
    only a few runs have dupes:

n13037009 - already removed as a test
n13037010
n13037064

n13037137
n13037138
n13037139
n13037148

    To clean up, 

./roundup  -M -D -s "n13037010"  -r cedar_phy_bhcurv mcnear
for RUN in n13037010 n13037064 n1303713 n13037148 ; do
./roundup  -M -D -s "${RUN}"  -r cedar_phy_bhcurv mcnear
date
done

Fri Jan 16 13:14:59 CST 2009
Fri Jan 16 13:16:33 CST 2009
Fri Jan 16 13:18:03 CST 2009
Fri Jan 16 13:19:57 CST 2009

    Now one more full pass on D04 files
Most files are L010185N_D04,
     a few are L250200N_D04
     
./roundup  -s L250200N_D04  -r cedar_phy_bhcurv mcnear
./roundup  -s L010185N_D04  -r cedar_phy_bhcurv mcnear


#############
# SADDCACHE #
#############

    ln -sf  saddcache.20090116 saddcache # was saddcache.20070806

Shifted date to the second field 
Increased mimimum file name width to 45, from 25, for cleaner reading

########
# MAIL #
########

    Removed RFC2369 headers from lists for which they are not appropriate,
    to eliminate the PINE messages

     [ Note: This message contains email list management information ]

   To disable the headers, added to the head of the options list, 

Misc-Options= NO_RFC2369

minos_software_discussion

########
# MAIL #
########

Date: Fri, 16 Jan 2009 09:42:52 -0600 (CST)
Subject: HelpDesk ticket 127751

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 ___________________________________________
Ticket #: 127751
___________________________________________
Short Description: SpamAssassin mistuned ?

Problem Description: I am getting a lot more spam lately ( what else is new
. )

    But some of the items coming through have negative SpamAssassin
ratings,
    in spite of having the single most common signature of SPAM, 
    an IP address which is not registered in DNS.

    For example,

Received: by hepa2.fnal.gov (Postfix, from userid 102) id 42BDBBA2F7; Fri,
 16 Jan 2009 05:01:12 -0600 (CST)
Received: from ffhzi.comunitel.net (unknown [77.224.46.178])
 by hepa2.fnal.gov (Postfix) with SMTP id 8EB35BA2EE for
 <minos_software_discussion@fnal.gov>; Fri, 16 Jan 2009 05:01:10 -0600
(CST)
  
    This produced
X-Spam-Status: No, score=-0.4 required=5.0 tests=BAYES_00,
    DATE_IN_PAST_96_XX,
           HTML_MESSAGE,RDNS_NONE,URI_HEX autolearn=no version=3.2.4

    With so many flags set, how could this be getting a negative score ?

    Perhaps something is broken in the SpamAssassin configuration.
___________________________________________

Date: Fri, 16 Jan 2009 09:59:52 -0600 (CST)
This ticket has been reassigned to MIHALEK, MAURINE of the CD-LSCS/CSI/CS/EST Group.
___________________________________________

Note To Requester: i spoke to the mail expert. SpamAssassin is not broken.
the spam you are getting was crafted to score more negatively. his
suggestion is that you report this to the spam reporting page:
http://computing.fnal.gov/cgi-bin/email/teachspam.cgi
___________________________________________

___________________________________________

=============================================================================
2009 01 15
=============================================================================


########
# FARM #
########

    Clearing out cedar_phy_bhcurv mcnear duplicates

./roundup -n  -M -D -s n13037009  -r cedar_phy_bhcurv mcnear

    Looks OK, do this limited thing manually,

./roundup     -M -D -s n13037009  -r cedar_phy_bhcurv mcnear

    Then clear out the D04_charm duplicates,
    
./roundup -n  -M -D -s D04_charm  -r cedar_phy_bhcurv mcnear
OK - processing 660 files 
    still many PEND files.
    Go ahead and drop the duplicates
./roundup     -M -D -s D04_charm  -r cedar_phy_bhcurv mcnear

=============================================================================
2009 01 14
=============================================================================

Date: Wed, 14 Jan 2009 08:19:08 -0600 (CST)
Subject: HelpDesk ticket 127564

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 ___________________________________________

Ticket #: 127564
___________________________________________
Short Description: Recent Minos raw data file not readable in DCache

Problem Description: dcache-admin,minos-data :

    I get an empty file when accessing a recent Minos raw data file in
DCache.

/pnfs/minos/fardet_data/2009-01/F00042685_0003.mdaq.root

    But the file should not be empty, see ls and Layer 4 metadata below :

 -rw-r--r--  1 buckley e875 18677506 Jan 13 08:01 F00042685_0003.mdaq.root
VO8699
0000_000000000_0003456
18677506
fardet_data
/pnfs/fs/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root

000F00000000000009016408

CDMS123185530300000
stkenmvr10a:/dev/rmt/tps2d0n:479002012194

    According to today's pool listings, this file is in  w-stkendca8a-2

    I see no files listed in the DCache filemonitor page.
    But that may not mean much, as those listing seem dated 23-Jul-2008
http://www-stken.fnal.gov/enstore/dcache_monitor/

   Please restore F00042685_0003.mdaq.root in DCache.
___________________________________________

This ticket is assigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA.
____________________________________________

Date: Wed, 14 Jan 2009 09:11:02 -0600
 This problem was forwarded to the dcache developers.
___________________________________________
Date: Wed, 14 Jan 2009 13:23:04 -0600 (CST)

georges@fnal.gov sent this Notes To Requester: 

Notes To requester

Please advise user to try to use file once again. 
Zero length copy of file has been removed from the pool 
after it has been verified that file exists on the tape 
and is of correct size.

___________________________________________

Date: Wed, 14 Jan 2009 20:23:26 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

The file is gone from DCache, as confirmed by dc_check.

But it has not been restored yet, despite my attempt to dccp it,
and my doing a manual
    dccp -P

The restore is visible in the DCache restore queue pages
    http://fndca3a.fnal.gov:2288/poolInfo/restoreHandler/lazy
    http://fndca3a.fnal.gov:2288/poolInfo/restoreHandler/*

It has been going to w-raw-minos-stkendca21a-1, since about 13:09.
But there is no tape mounted, in spite of light library activity.

I see the problem.

At http://fndca3a.fnal.gov:2288/queueInfo, 
Restores Max is set to 0 for all the w-*-minos-* pools.

This restore shows up as Queued, but will not get scheduled.

Please boost the restore limit for these pools.
__________________________________________

__________________________________________

Date: Wed, 14 Jan 2009 21:44:32 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   The file is now queued for restore, but is not yet on disk.

   I see no tape mounted, though the tape system is not busy.

   The restore request was in the restore queues for a while,
   but is now gone.

   dc_check reports the file as unavailable :

MINOS26 > dc_check
dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00
042685_0003.mdaq.root
dc_stage fail : File not cached
System error: Resource temporarily unavailable

   I have run another dccp by hand, at 15:35:35

   I see the transfer in the Restore Queues again,
   but again no tape is mounted, and there is no active transfer.
   This time the file is targeted to w-raw-minos-stkendca22a-1

   Still no tape activity, as of 15:44.

__________________________________________

Date: Wed, 14 Jan 2009 15:18:58 -0600
From: Vladimir Podstavkov <podstvkv@fnal.gov>

Max number of restores for the new pools set to 1.
__________________________________________

Date: Wed, 14 Jan 2009 16:08:04 -0600 (CST)
From: Dmitry Litvintsev <litvinse@fnal.gov>

Let us know if anything changed for the better. 
__________________________________________

Date: Wed, 14 Jan 2009 23:33:49 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

Restoring the file took about 12 minutes, from around 17:10,
due to a modest backlog of restore requests ( your testing ? ).

The file has been processed and declared to SAM.
This ticket can be closed.

What was done to kick things loose ?
__________________________________________

Date: Wed, 14 Jan 2009 20:35:55 -0600 (CST)
From: Dmitry Litvintsev <litvinse@fnal.gov>

Two issues:

1) pools are part of read/write link but number of restores was set to 
   zero preventing staging files from tape. This has been noted by you and 
   then fixed by us.

2) in addition the encp, which is as you know is HSM client used to put 
   data to/from tape has been missing on the pools. So the HSM jobs will 
   spawn and quit w/ errors. This has been spotted and corrected.    

The long time to restore a file was due to a backlog of store/restore jobs
coming from these pools. 
__________________________________________

__________________________________________

__________________________________________


http://fndca3a.fnal.gov:2288/poolInfo/restoreHandler/lazy
PnfsId   	Subnet   	PoolCandidate   	Started   	Clients   	Retries   	Status  
000F00000000000009016408 	0.0.0.0/0.0.0.0-*/* 	w-raw-minos-stkendca21a-1 	01.14 13:09:37 	3 	0 	Staging 01.14 13:09:37
/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root

http://fndca3a.fnal.gov:2288/poolInfo/restoreHandler/*
PnfsId   	Subnet   	PoolCandidate   	Started   	Clients   	Retries   	Status  
000F00000000000009016408 	0.0.0.0/0.0.0.0-*/* 	w-raw-minos-stkendca21a-1 	01.14 13:09:37 	3 	0 	Staging 01.14 13:09:37

Wed Jan 14 14:10:46 CST 2009


000F00000000000009016408 	0.0.0.0/0.0.0.0-*/* 	w-raw-minos-stkendca24a-1 	01.14 15:16:34 	1 	1 	Waiting 01.14 15:16:34


   While waiting, get a safe copy :

[minos@minos-gateway ~]$ scp -c blowfish daqdcp.minos-soudan.org:/daqdata/F00042685_0003.mdaq.root /tmp/

[minos@minos-gateway ~]$ sum /tmp/F00042685_0003.mdaq.root 
45765 18240

[minos@minos-gateway ~]$ scp /tmp/F00042685_0003.mdaq.root kreymer@minos26.fnal.gov:/minos/scratch/kreymer/

MIN > sum /minos/scratch/kreymer/F00042685_0003.mdaq.root 
45765 18240


MINOS26 > ./dccptest /fardet_data/2009-01/F00042685_0003.mdaq.root
PORT 24136
Connected in 0.00s.
[Wed Jan 14 15:35:35 2009] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root in cache.
Command failed!
Server error message for [2]: "902" (errno 902).
Failed open file in the dCache.
Can't open source file : "902"
System error: Input/output error
ls: /local/scratch26/kreymer/DCCPTEST/F00042685_0003.mdaq.root: No such file or directory

000F00000000000009016408 	0.0.0.0/0.0.0.0-*/* 	w-raw-minos-stkendca22a-1 	01.14 15:35:35 	1 	1 	Waiting 01.14 15:35:35
/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root


17:10 - 
MINOS26 > dc_check dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root
dc_stage fail : File not cached
System error: Resource temporarily unavailable

MINOS26 > date ; ./dccptest /fardet_data/2009-01/F00042685_0003.mdaq.root
Wed Jan 14 17:11:36 CST 2009
PORT 24136
Connected in 0.00s.
[Wed Jan 14 17:11:37 2009] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root in cache.
Cache open succeeded in 877.22s.
18677506 bytes in 0 seconds
-rw-r--r--  1 kreymer g020 18677506 Jan 14 17:26 /local/scratch26/kreymer/DCCPTEST/F00042685_0003.mdaq.root


000F00000000000009016408 	0.0.0.0/0.0.0.0-*/* 	w-raw-minos-stkendca22a-1 	01.14 17:10:47 	3 	0 	Staging 01.14 17:10:47
/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root

    Restores are queued up for
w-raw-minos-stkendca21a-1  8
w-raw-minos-stkendca22a-1  3
w-raw-minos-stkendca24a-1  2

By 17:26, the queue for 22a-1 was down to 0, 1 active


=============================================================================
2009 01 13
=============================================================================

#######
# DAQ #
#######

Last file archived : Tue Jan 13 16:01:01 CST 2009

F00042685_0002.mdaq.root Jan 12 15:03
N00015450_0000.mdaq.root Jan 13 15:36
F090112_000000.mdcs.root Jan 12 19:43
N090112_000003.mdcs.root Jan 12 19:16
B090113_080001.mbeam.root Jan 13 10:14

    Near DAQ is OK, Far is bad, both DCS are bad since yesterday.

    Slight slowdowns reading yesterday,

http://www-numi.fnal.gov/computing/dh/ftplog/2009/01/12.txt

   8 Mon Jan 12 14:50:53 CST 2009 557
  10 Mon Jan 12 15:01:03 CST 2009 557
  10 Mon Jan 12 15:11:13 CST 2009 557
   5 Mon Jan 12 15:21:18 CST 2009 557

   Looked at logs in 

LDIR=/local/scratch26/kreymer/genpy/neardet_data/2009-01

   Testing a few of the difficult mdaq files with loon, firstlast.C

    Created freebird to allow me to run on minos26

setup_minos -r R1.22

    The failed command was
dbu -bq /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/dbu_sampy.C \
dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015439_0008.mdaq.root

    Trying
loon -bq firstlast.C \
dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015439_0008.mdaq.root
Processing firstlast.C...
Spin(104292 in 104292 out 0 filt.)
  1) +RawRecCounts::Ana         n=104292(104292/     0) t=(    1.97/    0.08)

    This looks OK on the surface, now try

export ENV_TSQL_URL="mysql:odbc://minos-db1.fnal.gov/temp;mysql:odbc://minos-db1.fnal.gov/offline;mysql:odbc://minos-db1.fnal.gov/offl
ine_dev"

dbu -bq dbu_sampy.C \
dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015439_0008.mdaq.root

    This is also OK

    Try genpy in its normal context  per HOWTO.genpy

./genpy -d -f N00015439_0008.mdaq.root neardet_data/2009-01
./genpy    -f N00015439_0008.mdaq.root neardet_data/2009-01

    This looks OK

./predator

    Only one file fails,

F00042685_0003.sam.py was not generated - check log for error

Unable to open dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root

./dc_stat /pnfs/minos/fardet_data/2009-01/F00042685_0003.mdaq.root
 PNFS status for /pnfs/minos/fardet_data/2009-01/F00042685_0003.mdaq.root 
-rw-r--r--  1 buckley e875 18677506 Jan 13 08:01 F00042685_0003.mdaq.root

LEVEL 2 
2,0,0,0.0,0.0
:l=18677506;

LEVEL 4 
VO8699
0000_000000000_0003456
18677506
fardet_data
/pnfs/fs/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root

000F00000000000009016408

CDMS123185530300000
stkenmvr10a:/dev/rmt/tps2d0n:479002012194
2356856715

============================

dc_check dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root
Check passed for file "dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root"


loon -bq firstlast.C \
dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.mdaq.root
    
Error in <TDCacheFile::ReadBuffer>: error reading all requested bytes

./dccptest /fardet_data/2009-01/F00042685_0003.mdaq.root
PORT 24136
Connected in 0.00s.
[Tue Jan 13 22:13:23 2009] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-01/F00042685_0003.
mdaq.root in cache.
Cache open succeeded in 0.20s.
0 bytes in 0 seconds

   This file claims to be OK on tape, but is 0 length in DCache.

Ran ./predator 2009-01 - clean except for this one file.

22:20   crontab crontab.dat


#######
# EVO #
#######

   Registered.
       http://evo.caltech.edu/evoGate/index.jsp

   This was immediate via automatic email/web validation.


#########
# MYSQL #
#########

Date: Tue, 13 Jan 2009 14:56:04 -0600 (CST)
Subject: HelpDesk ticket 127533
___________________________________________
Ticket #: 127533
___________________________________________
Short Description: On minos-mysql2, need /data/crl owned by minsoft

Problem Description: run2-sys

On minos-mysql2, please create directory /data/crl owned by minsoft,mysql
just like  /data/database.
___________________________________________

Date: Tue, 13 Jan 2009 15:02:24 -0600 (CST)
This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group.
___________________________________________

Date: Tue, 13 Jan 2009 15:10:41 -0600 (CST)
Solution: Request completed.

___________________________________________

___________________________________________

___________________________________________


#########
# MYSQL #
#########

   Added Minos Cluster account for our Mysql DBA

MINOS01 > cmd add_minos_user svetlana

Date: Tue, 13 Jan 2009 14:28:52 -0600 (CST)
Subject: HelpDesk ticket 127529
___________________________________________
Ticket #: 127529
___________________________________________
Short Description: On minos-mysql1 and minos-mysql2, add svetlana access,
remove bgreen,

Problem Description: run2-sys :

Please add local account access to minos-mysql1 and minos-mysql2 
for user svetlana. This account has recently been added to the Minos
Cluster.
So you can just add to /etc/passwd :

+svetlana:::::


This would also be a good time to remove the bgreen accounts from
minos-mysql1 and minoos-mysql2. Bruce left the lab years ago.

I have examined the /home/bgreen files on minos-mysql1.
They should be removed along with the account.
___________________________________________

Date: Tue, 13 Jan 2009 15:27:30 -0600 (CST)
This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group.
___________________________________________

Date: Wed, 14 Jan 2009 14:36:44 -0600 (CST)

Solution: Requests complete.

This ticket was resolved by SCOTT, RENNIE of the CD-SF/FEF group.
___________________________________________


########
# PNFS #
########

Date: Tue, 13 Jan 2009 11:53:12 -0600
From: John Hendry <jhendry@fnal.gov>

We only show Liz Buckley-Geer as the sole pnfs authorizations contact for
minos.
I know she is working in dark energy survey now.  Should I make you the
minos
contact in our pnfs authorizations list?

On http://www-ccf.fnal.gov/ISA/PNFS_Auth_list.html

For these mountpoints:  These users are authorized to request mountpoint
exports:        Experiment also known as:
 * stkensrv1:/minos       * Liz Buckley-Geer (buckley@fnal.gov)
E875/E934

_________________________________________________________________

Date: Tue, 13 Jan 2009 19:53:38 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Yes, I should be the primary contact.
   Robert Hatcher (rhatcher) should also be authorized.

########
# PINE #
########

http://mailman2.u.washington.edu/pipermail/alpine-info/2008-July/000971.html

#######
# DAQ #
#######

   FD data gap ?

-rw-r--r--  1 buckley e875  59609235 Jan 11 16:03 F00042682_0002.mdaq.root
-rw-r--r--  1 buckley e875  18551291 Jan 12 07:56 F00042679_0015.mdaq.root


=============================================================================
2009 01 12
=============================================================================

#########
# MYSQL #
#########

Mysql> ./dbar -I

STARTED  DBARCHIVES Mon Jan 12 16:06:32 CST 2009

 Archiving OFFLINE 
Mon Jan 12 16:06:32 CST 2009
69377   .
Filesystem            Size  Used Avail Use% Mounted on
/dev/hdb1             230G  174G   44G  80% /data
/minos/data/mysql/archive/20090112/offline/offline.log


###########
# CRONTAB #
###########

    Restarted kreymer@minos26, mindata@minos26, 
    down since AFS problems 7 Jan

    Ran condorproxy and predator manually

############
# PREDATOR #
############

   Many, not all, ND files timed out

N00015420_0005.mdaq.root Mon Jan 12 20:54:14 UTC 2009

N00015420_0008.mdaq.root Mon Jan 12 21:00:44 UTC 2009

N00015425_0000.mdaq.root Mon Jan 12 21:36:40 UTC 2009

N00015427_0002.mdaq.root Mon Jan 12 21:46:32 UTC 2009

N00015427_0008.mdaq.root Mon Jan 12 21:59:12 UTC 2009
N00015427_0009.mdaq.root Mon Jan 12 22:02:38 UTC 2009
N00015427_0010.mdaq.root Mon Jan 12 22:06:13 UTC 2009
N00015427_0011.mdaq.root Mon Jan 12 22:09:38 UTC 2009
N00015427_0012.mdaq.root Mon Jan 12 22:13:14 UTC 2009

N00015427_0014.mdaq.root Mon Jan 12 22:20:14 UTC 2009

N00015427_0016.mdaq.root Mon Jan 12 22:25:29 UTC 2009
N00015427_0017.mdaq.root Mon Jan 12 22:28:55 UTC 2009

N00015427_0019.mdaq.root Mon Jan 12 22:34:15 UTC 2009=
N00015427_0020.mdaq.root Mon Jan 12 22:37:45 UTC 2009=

N00015430_0000.mdaq.root Mon Jan 12 22:46:41 UTC 2009

N00015430_0002.mdaq.root Mon Jan 12 22:51:51 UTC 2009
N00015430_0003.mdaq.root Mon Jan 12 22:55:26 UTC 2009

N00015431_0000.mdaq.root Mon Jan 12 22:59:11 UTC 2009=

N00015433_0002.mdaq.root Mon Jan 12 23:04:25 UTC 2009=

N00015433_0005.mdaq.root Mon Jan 12 23:11:15 UTC 2009=
N00015433_0006.mdaq.root Mon Jan 12 23:14:45 UTC 2009=

N00015433_0009.mdaq.root Mon Jan 12 23:22:01 UTC 2009
N00015433_0010.mdaq.root Mon Jan 12 23:25:31 UTC 2009

N00015433_0014.mdaq.root Mon Jan 12 23:33:56 UTC 2009=
N00015433_0015.mdaq.root Mon Jan 12 23:37:16 UTC 2009=

N00015433_0017.mdaq.root Mon Jan 12 23:43:31 UTC 2009
N00015433_0018.mdaq.root Mon Jan 12 23:47:02 UTC 2009
N00015433_0019.mdaq.root Mon Jan 12 23:50:22 UTC 2009
N00015433_0020.mdaq.root Mon Jan 12 23:53:52 UTC 2009=
N00015433_0021.mdaq.root Mon Jan 12 23:57:12 UTC 2009
N00015433_0022.mdaq.root Mon Jan 12 23:58:52 UTC 2009
N00015433_0023.mdaq.root Tue Jan 13 00:02:12 UTC 2009=

N00015435_0000.mdaq.root Tue Jan 13 00:08:57 UTC 2009

N00015436_0004.mdaq.root Tue Jan 13 00:18:33 UTC 2009

N00015436_0007.mdaq.root Tue Jan 13 00:25:18 UTC 2009

N00015436_0009.mdaq.root Tue Jan 13 00:30:33 UTC 2009
N00015436_0010.mdaq.root Tue Jan 13 00:34:08 UTC 2009

N00015436_0014.mdaq.root Tue Jan 13 00:42:23 UTC 2009
N00015436_0015.mdaq.root Tue Jan 13 00:45:43 UTC 2009

N00015436_0017.mdaq.root Tue Jan 13 00:51:14 UTC 2009

N00015436_0020.mdaq.root Tue Jan 13 00:58:03 UTC 2009
N00015436_0021.mdaq.root Tue Jan 13 01:01:24 UTC 2009

N00015436_0023.mdaq.root Tue Jan 13 01:06:54 UTC 2009

N00015439_0004.mdaq.root Tue Jan 13 01:20:53 UTC 2009

N00015439_0008.mdaq.root Tue Jan 13 01:29:17 UTC 2009X
N00015439_0009.mdaq.root Tue Jan 13 01:32:38 UTC 2009X
N00015439_0010.mdaq.root Tue Jan 13 01:36:08 UTC 2009X
N00015439_0011.mdaq.root Tue Jan 13 01:39:28 UTC 2009X
N00015439_0012.mdaq.root Tue Jan 13 01:42:58 UTC 2009X
N00015439_0013.mdaq.root Tue Jan 13 01:44:08 UTC 2009X
N00015439_0014.mdaq.root Tue Jan 13 01:47:38 UTC 2009X
N00015439_0015.mdaq.root Tue Jan 13 01:50:58 UTC 2009X
N00015439_0016.mdaq.root Tue Jan 13 01:54:29 UTC 2009X
N00015439_0017.mdaq.root Tue Jan 13 01:57:34 UTC 2009X
N00015439_0018.mdaq.root Tue Jan 13 01:59:03 UTC 2009X
N00015439_0019.mdaq.root Tue Jan 13 02:02:24 UTC 2009X

   Many of these were picked up the next cycle.
   New timeouts,

N00015439_0020.mdaq.root Tue Jan 13 08:12:30 UTC 2009
N00015439_0021.mdaq.root Tue Jan 13 08:14:05 UTC 2009


dccptest neardet_data/2009-01/N00015427_0016.mdaq.root
      OK, 2 seconds to copy

dccp testneardet_data/2009-01/N00015420_0005.mdaq.root
PORT 24136
Datafile with name 'neardet_data/2009-01/N00015420_0005.mdaq.root' not found.
Connected in 0.00s.
[Mon Jan 12 16:32:38 2009] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos//neardet_data/2009-01/N00015420_0005.mdaq.root in cache.
Cache open succeeded in 317.21s.
115960637 bytes in 4 seconds (28310.70 KB/sec)


#######
# CRL #
#######

   Reset password for rlt ( Richard Talaga ) via Administer menu.in CRL.

    minoscrl-admin 
        added   kreymer
        removed howcroft

########
# FARM #
########

touch /minos/data/minfarm/roundup/STOP.LOOPER


#######
# DAQ #
#######

    Control Room network went down, back up around 8:50

Date: Mon, 12 Jan 2009 06:39:10 -0600
From: MINOS DAQ <minos@daqdcp-nd.fnal.gov>
To: c.j.metelko@rl.ac.uk, gfp@fnal.gov, kreymer@fnal.gov
Subject: Copy of N00015439_0012.mdaq.root to
    minos-om.fnal.gov:/data/root_files/ FAILED


=============================================================================
2009 01 10
=============================================================================

########
# FARM #
########

./looper  '-r cedar_phy_bhcurv mcnear' &

rm /minos/data/minfarm/roundup/STOP.LOOPER

./looper  '-r cedar_phy_bhcurv mcnear' &

#########
# MYSQL #
#########

Moved XTRA databases to mysql2


=============================================================================
2009 01 09
=============================================================================

########
# DATA #
########

Date: Wed, 07 Jan 2009 09:29:02 -0500
From: Daniel D. Cherdack <Daniel.Cherdack@tufts.edu>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: Data Storage

I have about 365G of Daikon04 AnaNue files that I would like to hold onto
for at least the near term. Right now they sit on
/minos/data/users/cherdack/ while I trim and merge them for transport
analysis etc. When I finish where is the preferred location for storage?

_________________________________________________________________________


Date: Sat, 10 Jan 2009 00:17:08 +0000 (GMT)
  I have started looking at this.
  For archival, we need most files to be at least 1 GB in size,
  and no more than about 100 per directory.
  Your files seem too small, with over 1000 in one directory.
  So we will need to tar them up in some simple fashion, then archive.


##########
# PARROT #
##########

   Created indexparrot script under /home/mindata.
   Will put this into CVS soon.
   

#######
# CRL #
#######

Date: Fri, 09 Jan 2009 17:55:04 -0600 (CST)
Subject: HelpDesk ticket 127347
 
___________________________________________
Ticket #: 127347
___________________________________________
Short Description: Minos CRL software sometimes silently fails to post an
entry

Problem Description: minoscrl-admin,gysin :

As a spinoff of discussions earlier today, 
we verified that entries posted to the CRL without a specified Topic
could be silently dropped.
This happens if the item is Save'd without a Preview and without a Topic.

Suzanne Gysin has added the warning in version v1_14 as deployed to Minos.
Please update the official software, and the release notes.
___________________________________________

This ticket is assigned to HelpDesk of the Help Desk.
___________________________________________

Date: Mon, 12 Jan 2009 09:26:59 -0600 (CST)
This ticket has been reassigned to GYSIN, SUZANNE of the CD-ILC/FP Group.
___________________________________________

Date: Mon, 12 Jan 2009 09:48:08 -0600 (CST)

Note To Requester: gysin@fnal.gov sent this Notes To Requester: 
The official software and release notes are updated and all CRLW
installations are patched.

___________________________________________

Date: Tue, 20 Jan 2009 09:54:56 -0600 (CST)
Solution: gysin@fnal.gov sent this solution: 
The software is updated and the release notes published
This ticket was resolved by GYSIN, SUZANNE of the CD-ILC/FP group.
___________________________________________


##########
# DCACHE #
##########

Subject : Re: HelpDesk ticket 126958 has additional info.

The FTP transfers page seems to be up to date today.
I presume that something was done to address the previous problem.
If that is the case, this ticket can be closed.
    Thanks !


#######
# CRL #
#######

   Suzanne Gysin gave me admin access.

Web master email:
    minoscrl-admin@fnal.gov
	
Path to CRL configuration directory:
    /afs/fnal.gov/files/data/minos/crl_data/

Memo Pad URL:
    http://www-bd.fnal.gov/issues/wiki/MINOSMemopad//////////

MINOS26 > fs listacl /afs/fnal.gov/files/data/minos/crl_data/
Access list for /afs/fnal.gov/files/data/minos/crl_data/ is
Normal rights:
  bens:crlweb2 rlidwk
  bgreen:minoscrladmin rlidwka
  bgreen:minoscrl rlidwk
  spanacek:crladmin rlidwka
  system:administrators rlidwka
  system:anyuser rl
  buckley rlidwka
  bgreen rlidwka

We should create minos:minoscrladmin to replace the breen's,

Meanwhile, 

CRLCONF=/afs/fnal.gov/files/data/minos/crl_data/LogBook_admin/LogbookConfigParms.properties
    
grep Logbook.file_location.entry_directory ${CRLCONF}
Logbook.file_location.entry_directory = /afs/fnal.gov/files/data/minos/crl_data/CRLdata

grep Logbook.file_location.www_directory ${CRLCONF}
Logbook.file_location.www_directory = /afs/fnal.gov/files/data/minos/crl_data/WWWdirectory


#######
# CRL #
#######

Checklist for CRL:
S.Gysin 1-9-08

AFS

The symptoms for AFS failure are that one can not see the entries, 
and can not make new entries or annotations. 

In many cases the CRL administrator chooses to store the CRL entries
in an AFS directory. This directory can be found in the configuration file. 

The path to the configuration file is noted in the CRL on line 
on the configure page:
1.On the CRL's index page (Index.jsp) select configure (upper left). 
Only a user with admin priviledges has access to this page.
2.The path to the CRL configuration directory is noted there.
3.Go to this directory and 
> cd LogBook_admin
open the file  LogbookConfigParms.properties
This file contains the path to all data stored within the logbook. 
The directory for storing the entries is noted by the property: 
Logbook.file_location.entry_directory
Logbook.file_location.www_directory
4.If this is an afs directory, one can check this by attempting to cd to it.


Database

The symptom of a Database failure are that one can not see any entries 
nor make new entries or annotations.

Similarily to the AFS directory the database information 
is also stored in the configuration file.

1.On the CRL's index page (Index.jsp)  select configure (upper left). 
Only a user with admin priviledges has access to this page.
2.The path to the CRL configuration directory is noted there. 
  Go to this directory and 
> cd LogBook_admin
open the file  LogbookConfigParms.properties
This file contains the database information in the following properties:
Logbook.database.server 
Logbook.database.dbms_name
Logbook.database.username

3.If this database is in your control, log in and execute a mysql command 
to see if it is up and running.
If it is not in your control, open a help desk ticket, 
specifying the logbook and database.


WebServer

The symptom of a webserver failure are that one can not see 
the webpage and usually sees an error 500.
The webserver is also specific to each logbook. 
Most logbooks at FNAL run currently under an alias on crlweb2. 
The alias has to be correct to see images, 
because each alias is assigned a virtual host with its own home directory 
(see afs) where the images are stored.

The CRL runs under Tomcat and Apache. 
The distinction is irrelevant since either can be down 
and both have to be restarted if one is down. 
The webserver is restarted every day at 4 am to re-issue an AFS token. 

If you see that your webserver is down, 
open a help desk ticket stating the name of your logbook and the symptom.

CRL application
The symptoms of a CRL application failure vary greatly. 
You may see an exception and an error page, 
or you may find a single entry missing. 
If you are sure AFS, the database, and the Webserver are running, 
open a helpdesk ticket for the CRL application.


############
# PREDATOR #
############

Predator - look into beam declares, seemed to be choking 
STARTING Wed Jan  7 11:31:06 UTC 2009

B090106_080001.mbeam.root Wed Jan  7 11:31:07 UTC 2009
B090106_160002.mbeam.root Wed Jan  7 11:36:23 UTC 2009

   ?

##########
# ORACLE #
##########

    Found a list of HP Oracle certified servers,

http://h18004.www1.hp.com/products/servers/linux/hplinuxcert-oracle.html

    For example,

http://h10010.www1.hp.com/wwpc/us/en/en/WF25a/15351-15351-3328412-241644-3328422-3454575.html

HP DL580 G5      Rack 
$25K 
4x4core 2.9 GHz 64 GB, 4x350GB disk, dual ps, dual FC
http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=431&FamilyId=2635&BaseId=23381&oi=E9CED&BEID=19701&SBLID=
   3 GHz Intel is like 6 HGz Sparc ?

HP DL580 G5 7400 Rack 
$28K 
4x6core 2.7 GHz 64 GB, 4x250GB disk, dual ps, dual FC
http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=431&FamilyId=2635&BaseId=28012&oi=E9CED&BEID=19701&SBLID=

    Similar Sun servers, like
Netra X4250 Server, $7K 

   The baseline Oracle server is Sun T5440 Config 2
http://shop.sun.com/is-bin/INTERSHOP.enfinity/WFS/Sun_NorthAmerica-Sun_Store_US-Site/en_US/-/USD/ViewStandardCatalog-Browse?CategoryName=SPARC_T5440&CategoryDomainName=Sun_NorthAmerica-Sun_Store_US-SunCatalog
$92K, 
8-Core 4 x 1.2 GHz UltraSPARC T2 Plus
64 GB (32 x 2 GB DIMMs)

   DELL
www.dell.com/oracle
http://www.dell.com/content/topics/global.aspx/alliances/en/oracle_builds?c=us&cs=555&l=en&s=biz&~tab=3

   Dell 900   Oracle Validated
$37K
4x6core 2.7 GHz 64 GB, 4x300GB disk, dual ps, dual FC
http://configure.us.dell.com/dellstore/config.aspx?c=us&cs=555&l=en&oc=MLB1041&s=biz


=============================================================================
2009 01 08
=============================================================================

#########
# BATCH #
#########

    Finished stray cpl far files (3 each mrnt, sntp)

./roundup -r cedar_phy_linfix mcfar

    Waited aobut 5 minutes for files to be written, iterated.

./roundup -r cedar_phy_linfix mcfar

Date: Thu, 08 Jan 2009 15:31:10 -0600 (CST)
From: Howard@agni.phys.iit.edu, Rubin@agni.phys.iit.edu
To: kreymer@fnal.gov
Subject: On vacation

I'm on vacation and will not be reading my mail for a while.
Your mail will be dealt with when I return on or about January 12.


=============================================================================
2009 01 07
=============================================================================

#######
# AFS #
#######

Date: Thu, 08 Jan 2009 10:38:56 -0600
From: Ramon C. Pasetes <rayp@fnal.gov>
To: 'Arthur Kreymer' <kreymer@fnal.gov>
Subject: RE: CENTRAL WEB Servers & AFS Status (fwd)

Hi Art,

We don't know what caused the incident yesterday, but we have been stable as
of 10:22.

-Ray
_____________________________________________________________________

Date: Thu, 08 Jan 2009 17:08:44 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos-admin@fnal.gov
Cc: minos_all@fnal.gov, minos_software_discussion@fnal.gov,
    minos-shifters@fnal.gov
Subject: Re: Fermilab AFS problems since 05:00 CST ( 11:00 UTC )

  The root cause of yesterday's AFS problems is not yet understood.

  But AFS is stable and back to normal, since about 10:30 yesterday/

  We can resume normal operations.

#########
# MYSQL #
#########

    Archived the extra databases,
    
Mysql> ./dbar  -X
tee: /minos/data/mysql/archive/20090107/archive.log: No such file or directory

STARTED  DBARCHIVES Wed Jan  7 16:50:06 CST 2009


STARTED  DBARCHIVES Wed Jan  7 16:50:06 CST 2009
FINISHED DBARCHIVES Wed Jan  7 17:14:32 CST 2009

   Helping nwest get connected to the server on mysql2,
   and getting dbarchive ready for running in cron,
   and cleaning the non-database files out of /data/database

mkdir -p  /var/tmp/minsoft/maint
mkdir -p  /var/tmp/minsoft/maint/grid
chmod 700 /var/tmp/minsoft/maint/grid
chmod 700 /var/tmp/minsoft/maint/grid/mysqlroot

    Test this on each system:

export MYSQL_PWD=`cat /var/tmp/minsoft/maint/grid/mysqlroot`
. ${HOME}/setups.sh
setup mysql
mysqladmin -u root processlist


###########
# MINOS01 #
###########

Load average on minos01 is increasing.
Many processes seen in ps axfu, like

arms
/usr/krb5/bin/kcron /usr/krb5/bin/aklog 
cp -p /minos/data/mcimport/hgallag/md5/all.md5 /afs/fnal/files/home/room3/arms/public_html/hgallag.all.md5

Date: Wed, 7 Jan 2009 17:22:37 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: arms@fnal.gov
Cc: minos-admin@fnal.gov
Subject: arms cron jobs piling up on minos01


  Kregg :

You have many cron jobs piling up on minos01.fnal.gov,
starting every 10 minutes,
doing things like

    cp -p /minos/data/mcimport/hgallag/md5/all.md5 \
          /afs/fnal/files/home/room3/arms/public_html/hgallag.all.md5

These are getting stuck due to the present AFS problems.

You should kill off most of these,
and put in a test to keep more than one from running at once.


   Done, the load is back down.


########
# PNFS #
########

    Isloated slow access to PNFS, but this may have been due to AFS


http://www-numi.fnal.gov/computing/dh/pnfslog/NOW.txt
 102 Wed Jan  7 04:35:22 CST 2009 
  50 Wed Jan  7 05:01:22 CST 2009 
 776 Wed Jan  7 05:29:21 CST 2009 
  52 Wed Jan  7 07:32:11 CST 2009 
 203 Wed Jan  7 08:20:48 CST 2009 


http://www-numi.fnal.gov/computing/dh/ftplog/NOW.txt
 136 Wed Jan  7 04:35:24 CST 2009 557
 823 Wed Jan  7 05:29:23 CST 2009 557
 116 Wed Jan  7 07:32:13 CST 2009 557

  
#######
# CRL #
#######

    D0 logbook is at

http://www-d0online.fnal.gov/crlw/Index.jsp?inquiry=/CRLWindex/2_Hr_All_Entries

     D0    uses v1_8_28  September 6, 2006
     Minos uses v1_13    November  6, 2008

    Available support levels are :

24by7        ( commonly called 24x7 )
8to00by7
8to17by7
8to17by5     ( commonly called  8x5, incorrectly, should be 9x5 )

   Per discussion with Tom B, we need a SLA (Service Level Agreement)
   listing the components of CRL and their support.
   D0 is doing a full review of their online systems, much larger scale.
   

##########
# CONDOR #
##########

    Moved condorglide out of AFS, due to today's global failure.

MINOS25 > pwd
/minos/scratch/kreymer/condor/probe

MINOS25 > cp ${HOME}/minos/scripts/condorglide ../condorglide

    crontab.minos25

MAILTO=kreymer@fnal.gov

0-59/10          * * * *   /minos/scratch/kreymer/condor/condorglide
07          1-23/2 * * * /usr/krb5/bin/kcron  /local/scratch25/grid/kproxy


#######
# AFS #
#######

Date: Wed, 07 Jan 2009 15:49:13 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos-admin@fnal.gov
Cc: minos_all@fnal.gov, minos_software_discussion@fnal.gov,
    minos-shifters@fnal.gov
Subject: Fermilab AFS problems since 05:00 CST ( 11:00 UTC )

There have been severe AFS problems since about 05:00 CST ( 11:00 UTC ).

This has affected most of the Fermilab web pages,
and has severely slowed down logins to Minos Cluster nodes,
to the point of uselessness.

The Fermilab experts are working to resolve the problem.

Please minimize use of /afs/fnal.gov .

Please stand by for a further announcement.


##########
# ORACLE #
##########

Date: Wed, 07 Jan 2009 09:42:04 -0600
From: Maurine Mihalek <mmihalek@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: dsg-entire group <css-dsg@fnal.gov>, minosdb-support@fnal.gov,
    csi unix group <csi-est@fnal.gov>
Subject: Re: new kernel for minosora1/minosora3

minosora3 is back up. julie checked db's and they are up. will co-ordinate
minosora1 with nelly.

maurine

----- Original Message -----
From: Arthur Kreymer <kreymer@fnal.gov>
Date: Monday, January 5, 2009 2:50 pm
Subject: Re: new kernel for minosora1/minosora3
To: Maurine Mihalek <mmihalek@fnal.gov>
Cc: dsg-entire group <css-dsg@fnal.gov>, minosdb-support@fnal.gov, csi unix
group <csi-est@fnal.gov>


> On Fri, 2 Jan 2009, Maurine Mihalek wrote:
> 
> > there is a new linux kernel that needs to be made effective on 
> minosora1 and 
> > minosora3.
> >
> > i would like to do minosora3 on wednesday (1/7/2009) and reboot 
> around 8:30 
> > am.  minosora3 should be up by 9 am. minosora3 has been up for 231 days.
> >
> > for minosora1, i would like to upgrade the kernel and reboot on 
> tuesday 
> > (1/13) morning at 8 am. minosora1 has been up for 192 days.
> >
> > are these days and times acceptable?
> 
>     You can do minosora3 any time, just let me know when it is done.
> 
>     Since the January quarterly patches are coming out soon,
>     I would rather combine the minosora1 kernel update with those patches,
>     to minimize service interruptions.
> 
> Note - please do not cc: minos_software_discussion or minos-data.
> Those lists are not related to database support.


=============================================================================
2009 01 06
=============================================================================

#########
# BATCH #
#########

Date: Tue, 06 Jan 2009 15:30:51 -0600 (CST)
From: HelpDesk <helpdesk@fnal.gov>
Subject: HelpDesk ticket 127096

___________________________________________
Ticket #: 127096
___________________________________________
Short Description: Ten runaway emacs session for user elllis

Problem Description: User ellis has ten emacs sessions running  on fnpcsrv1.

9 of these are running CPU bound, 
each using nearly 19 hours of CPU so far.

This is bogging down the the fnpcsrv1 server.


fnpcsrv1% ps -flu ellis
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME
CMD
0 R ellis      340     1 83  85   0 -  2149 -      Jan05 ?        18:53:47
xemacs rubmit2.al=0.1
4 S ellis     2029  1370  0  75   0 -  2033 -      10:24 pts/1    00:00:00
-tcsh
0 R ellis     4417     1 77  85   0 -  2149 -      Jan05 ?        18:56:06
xemacs submit1.al=0.1
0 R ellis     5520     1 77  85   0 -  2149 -      Jan05 ?        18:50:51
xemacs copy
0 R ellis     7202     1 78  85   0 -  2149 -      Jan05 ?        19:06:37
xemacs symlin
0 R ellis     7720     1 78  85   0 -  2149 -      Jan05 ?        18:54:52
xemacs symlin
0 R ellis    10945     1 78  85   0 -  2149 -      Jan05 ?        18:55:42
xemacs submit1.al=0.01
4 S ellis    12023 12016  0  75   0 -  1732 -      12:03 pts/7    00:00:00
            [Message 1 copied to "minosbatch" in <Mail> and deleted]

   cc: ellis@fnal.gov
___________________________________________

Date: Tue, 06 Jan 2009 15:39:08 -0600 (CST)

Note To Requester: timm@fnal.gov sent this Notes To Requester: 
Art, I had already seen these emacs processes and killed them
by the time this ticket got to me.
___________________________________________
___________________________________________

#######
# CRL #
#######

   Found an interesting CRL review, when responding to an errant
   Helpdesk ticket 127060 by Elvin Harms (ILC)

Electronic Logbooks for Use at FNAL ILC Test Areas
http://docdb.fnal.gov/ILC/DocDB/0003/000306/001/ElectronicLogbooks_060529.ppt

   They  mention a PSI logbook : 
   
PSI logbook:  This product was used at MINOS for a while 
              but it's use is declining due to support problems.
PSI:
MINOS liked it but they tried to make some changes 
and the server now hangs frequently.  
Archaic architecture is blamed for the difficulty in finding the problem.
Rejected.   


#########
# BATCH #
#########

    LINFIX status :
less LOG/2008-12/cedar_phy_linfixmcnear.log

    Files were picked up Thu Dec 25 18:38:25 CST 2008

There are 7 each sntp/mrnt linfix files left in mcnearcat

n11011001_0009_L010185N_D00.mrnt.cedar_phy_linfix.0.root
n11011001_0009_L010185N_D00.sntp.cedar_phy_linfix.0.root
n11011015_0002_L010185N_D00.mrnt.cedar_phy_linfix.0.root
n11011015_0002_L010185N_D00.sntp.cedar_phy_linfix.0.root
n13011112_0010_L010185N_D00.mrnt.cedar_phy_linfix.0.root
n13011112_0010_L010185N_D00.sntp.cedar_phy_linfix.0.root
n13011318_0000_L010185N_D00.mrnt.cedar_phy_linfix.0.root
n13011318_0000_L010185N_D00.sntp.cedar_phy_linfix.0.root
n13011493_0009_L010185N_D00.mrnt.cedar_phy_linfix.0.root
n13011493_0009_L010185N_D00.sntp.cedar_phy_linfix.0.root
n13011494_0010_L010185N_D00.mrnt.cedar_phy_linfix.0.root
n13011494_0010_L010185N_D00.sntp.cedar_phy_linfix.0.root
n13012017_0003_L010185N_D00.mrnt.cedar_phy_linfix.0.root
n13012017_0003_L010185N_D00.sntp.cedar_phy_linfix.0.root

   These are all marked as bad :

n11011001_0009_L010185N_D00.0    136 2008-10-29 21:06:51  fcdfcaf1581
n11011001_0009_L010185N_D00.0    136 2008-10-29 21:06:51  fcdfcaf1581
n11011015_0002_L010185N_D00.0    136 2008-10-29 21:02:37  fcdfcaf1508
n11011015_0002_L010185N_D00.0    136 2008-10-29 21:02:37  fcdfcaf1508
n13011112_0010_L010185N_D00.0    136 2008-10-30 07:51:53  fcdfcaf1599
n13011112_0010_L010185N_D00.0    136 2008-10-30 07:51:53  fcdfcaf1599
n13011318_0000_L010185N_D00.0    136 2008-11-01 16:21:35  fcdfcaf1614
n13011318_0000_L010185N_D00.0    136 2008-11-01 16:21:35  fcdfcaf1614
n13011493_0009_L010185N_D00.0    136 2008-11-04 04:53:12  fcdfcaf1664
n13011493_0009_L010185N_D00.0    136 2008-11-04 04:53:12  fcdfcaf1664
n13011494_0010_L010185N_D00.0    136 2008-11-04 07:35:02  fcdfcaf1613
n13011494_0010_L010185N_D00.0    136 2008-11-04 07:35:02  fcdfcaf1613
n13012017_0003_L010185N_D00.0    136 2008-12-09 15:46:47  fnpc263
n13012017_0003_L010185N_D00.0    136 2008-12-09 15:46:47  fnpc263

The  HAVE messages show the following missing subruns
   run        missing
n11011001 - 0000
n11011015 - 0000 0001
n13011112 - complete
n13011318 - 0010
n13011493 - complete
n13011494 - complete
n13012017 - complete

    I have taken the liberty of editing the bad_runs file :

RUBIN> cd /minos/data/minfarm/lists
RUBIN> cp -a bad_runs_mc.cedar_phy_linfix bad_runs_mc.cedar_phy_linfix.20081211
RUBIN> nedit bad_runs_mc.cedar_phy_linfix


#######
# NET #
#######

Date: Tue, 06 Jan 2009 13:40:09 -0600
From: Rick Finnegan <finnegan@fnal.gov>

Thursday, January 15, 2008� 6:00pm - 7:00pm
Upgrade S-S-WH8W-5 network switch chassis

   Minos/Numi nodes connected :

wh12whp800c
wh12w-hp4200
wh12w-xerox8400
numi-koizumilt
numi-92582
numi-94790
numi-lucaspc
numi-plunkettpc


#########
# ADMIN #
#########

   This shows status of Minos and other systems ( Ganglia up/down )
   catecgorized by date of purchase.

http://d0om.fnal.gov/d0admin/faultlog/


#########
# MYSQL #
#########

   dbarchive - adding -X option to archive non-offline/crl tables


#########
# MYSQL #
#########

Mysql> mysqladmin -u root processlist   | grep crlweb | wc -l
30


#########
# ADMIN #
#########

MINOS01 > setup systools

MINOS01 > cmd add_minos_user lueking
Creating account...
/var/yp
gmake[1]: Entering directory `/var/yp/minos'
gmake[1]: `ypservers' is up to date.
gmake[1]: Leaving directory `/var/yp/minos'
gmake[1]: Entering directory `/var/yp/minos'
Updating passwd.byname...
Updating passwd.byuid...
Updating netid.byname...
gmake[1]: Leaving directory `/var/yp/minos'
Adding user to Minos AFS group...
Sending mail to subscribe to minos-user mailing list ...
Sending email to user...


########
# MAIL #
########

     minos-shifters mail list
     
    To cut spam, but allow mail from stk-users, changed configuration

    from
Send= Private
    to
Local=fnal.gov,*.fnal.gov
Service=Local
Send=Service

###########
# MONTHLY #
###########

DATASETS 1/6
PREDATOR 1/6
VAULT    1/3
MYSQL    1/6

rm -r /data/archive/COPY/20081114
scripts/dbarchive

                    Tue Jan  6 12:03:37 CST 2009
FINISHED DBARCHIVES Tue Jan  6 15:14:55 CST 2009


=============================================================================
2009 01 05
=============================================================================

##########
# DCACHE #
##########

   id=164 -  use dc_check or srmls to check disk copies

MINOS26 > dc_check dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015411_0008.mdaq.root
Check passed for file "dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01/N00015411_0008.mdaq.root"

dc_check does  
   dccp -P -t -1 $*

Reviewed 
    http://www-dcache.desy.de/manuals/dccp.html

Need a file which is not on disk :

/pnfs/minos/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root

./dc_stat /pnfs/minos/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root

MINOS26 > time dc_check dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root
dc_stage fail : File not cached
System error: Resource temporarily unavailable
Check FAILED for file "dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root"

real    0m0.382s
user    0m0.004s
sys     0m0.020s

MINOS26 > time dccp -P -t -1 dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root
dc_stage fail : File not cached
System error: Resource temporarily unavailable

real    0m0.804s
user    0m0.001s
sys     0m0.010s

real    0m0.192s
user    0m0.001s
sys     0m0.010s


   Time this for 
FILES=`ls /pnfs/minos/neardet_data/2009-01`

MINOS26 > printf "${FILES}\n" | wc -l
181

DCP=dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2009-01

time for FILE in ${FILES} ; do dccp -P -t -1 ${DCP}/${FILE} ; done

real    0m41.377s
user    0m0.172s
sys     0m1.557s


Rate is 4.4 files/second

   Check using Layer 2,

MINOS26 > time ./stage -n neardet_data/2009-01
 Staging files from /pnfs/minos/neardet_data/2009-01
 Prestaging 183 files 
................... Needed 183/183
STARTED  Wed Jan  7 11:56:52 CST 2009
FINISHED Wed Jan  7 11:56:59 CST 2009

    Rate is 23 files/second.
    
    Testing srmls

 15 files in beam_data/2004-12

time srmls 
real    0m4.313s
user    0m7.715s
sys     0m0.263s

time srmls -l
real    0m9.623s
user    0m8.063s
sys     0m0.276s


   181 files in neardet_data/2009-01

time srmls
real    0m6.865s
user    0m10.177s
sys     0m0.266s

time srmls -l
real    1m9.835s
user    0m11.597s
sys     0m0.368s

    Perhaps disk status is given by
access latency:NEARLINE
locality:ONLINE_AND_NEARLINE

    Test non-local file

SPATH2z=${S2MINOS}/reco_far/R1.11/snts_data/2004-02/F00022928_0000.snts.R1.11.root

access latency:NEARLINE
locality:NEARLINE
real    0m4.758s
user    0m6.214s
sys     0m0.235s

   Now test a larger directory, with over 1K files :

SPATH2B=${S2MINOS}/fardet_data/2008-12 

time { srmls ${SPATH2B} | tee /tmp/bigls ; }

real    0m24.986s
user    0m16.601s
sys     0m0.483s

$ time srmls -l ${SPATH2B}
real    6m9.159s
user    0m18.488s
sys     0m0.557s

    Rates are 40 files/second for the list,
               2.7 files/second for the full list


#########
# EMAIL #
#########

minos-shifters - allow email from stk admins ?

##########
# DCACHE #
##########


Data logging failing since  02:06, picked up 09:51

ftplog gap 02:23 to 09:41

Gap in pagedcache ftp transferf 02:15 to 09:38

PREDATOR genpy large interval 07:09 to 15:39 

___________________________________________

Date: Mon, 05 Jan 2009 10:48:37 -0600 (CST)
Subject: HelpDesk ticket 126954
___________________________________________

Ticket #: 126954
___________________________________________

Short Description: FNDCA ftp transfers failing since

Problem Description: FTP transfers from DCache seem to have failed from
about 02:30 to 09:30
today, 2008 Jan 05.

This includes password access FTP reads and kerberized FTP writes .

    Was there a known outage ?
___________________________________________

Date: Mon, 05 Jan 2009 11:43:05 -0600 (CST)

Solution: jhendry@fnal.gov sent this solution: 
Hi Art,

Yes there was a problem which has already been resolved.

I clicked the d0 box instead of stken when I made the initial announcement
so that may be why you did not see it.  However when I sent the resolution
announcement I also clicked the stken box on that web page form.

Its confusing as on the announcment web page form the boxes are on the left
of each instance whereas on other forms we use they are on the left of each
instance.

Sorry for any confusion.

Date: Mon, 05 Jan 2009 10:17:13 -0600

The stken pnfs server matter has been resolved.

Please report any further problems.

Thanks,

John Hendry
SSA Primary

___________________________________________
___________________________________________
___________________________________________

This ticket is assigned to NAYMOLA, STAN of the CD-SF/DMS/DSC/SSA.


Date: Mon, 05 Jan 2009 10:54:55 -0600 (CST)
Subject: HelpDesk ticket 126958
___________________________________________

Ticket #: 126958
___________________________________________

Short Description: FNDCA Recent FTP Transfers web page is missing entries

Problem Description: The only entries in the Recent FTP Transfers web page
are for pagedcache(5744.6209).

    http://fndca3a.fnal.gov/cgi-bin/dcache_files.py

There have been many other recent transfers from other accounts,
but these are not showing up on the web page.
___________________________________________

This ticket is assigned to NAYMOLA, STAN of the CD-SF/DMS/DSC/SSA.
____________________________________________

Date: Mon, 05 Jan 2009 12:48:21 -0600 (CST)

Note To Requester: jhendry@fnal.gov sent this Notes To Requester: 
Opened bugzilla 186 for dcache developers.

___________________________________________

Date: Mon, 05 Jan 2009 12:54:52 -0600 (CST)
This ticket has been reassigned to HENDRY, JOHN of the CD-SF/DMS/DSC/SSA Group.
__________________________________________


The FTP transfers page seems to be up to date today.

I presume that something was done to address the previous problem.

If that is the case, this ticket can be closed.

    Thanks !

__________________________________________


Date: Mon, 12 Jan 2009 17:27:59 +0000 (GMT)

As of 11:24 today, the FTP transfers page only shows pagedcache items,

Times from 2009-01-11 11:15:08    
        to 2009-01-12 11:15:33

No entries for any other users.

So please continue to investigate.                             
__________________________________________

Date: Wed, 14 Jan 2009 17:56:56 +0000 (GMT)

The FTP transfers page is still incomplete,
showing only recent transfers by pagedcache,
none of the 'buckley' transfers of Minos raw data.

Is there any progress on bringing it back to life ?

__________________________________________

Date: Wed, 14 Jan 2009 13:02:11 -0600 (CST)
Note To Requester: jhendry@fnal.gov sent this Notes To Requester: 

Dmitry believes he has solved this issue.  
I will await your OK prior to closing this ticket.

Comment #4 from Dmitry Litvintsev <litvinse@fnal.gov>  2009-01-14 12:33:33
The pages shows up to date info (I updated manually). The cause of files not
being copied to the right destination is being investigated.

Comment #5 from Dmitry Litvintsev <litvinse@fnal.gov>  2009-01-1412:42:13
lock file discovered and removed. Page is up to date. Lock file dated Jan 09.
__________________________________________

Date: Wed, 14 Jan 2009 19:53:47 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    The web page is up to date again,
    after the clearing of a lock file by Dmitri.

    You can close this ticket.

    Thanks !
__________________________________________

Date: Wed, 14 Jan 2009 14:21:55 -0600 (CST)
Originator concurs issue has been resolved with lock file removal by Dmitry.
This ticket was resolved by HENDRY, JOHN of the CD-SF/DMS/DSC/SSA group.

__________________________________________

__________________________________________

#######
# CDF #
#######

no cvs backups since 22 Dec, low disk space.

#########
# ADMIN #
#########

jmusser/musser password reset needed

MINOS26 > finger musser@fnal.fnal.gov
OpenLDAP Finger Service...
Search failed to find anything.

MINOS26 > finger jmusser@fnal.fnal.gov
OpenLDAP Finger Service...
1 exact match found for "jmusser":
"jmusser, People"
 Users Name:   
               james mussser
 User ID:      
               jmusser
 E-Mail forwarded to:
               jmusser@indiana.edu

   Suggested that he call the helpdesk for a password reset.

#########
# DOCDB #
#########

   Updated ticket 126747, yes do get the developers involved.

   This is resolved ! See the ticket below. 
   Informed m_s_d
   
##########
# CONDOR #
##########

   condor admins schedule - Jan 9

############
# minverva #
############

    Lee Leuking joins as Minerva liaison


###########
# ENSTORE #
###########

   Restarted 29 Dec 15:30 through 17:13

###########
# ENSTORE #
###########

   9940 and 9940B library managers down
   Tue 30 Dec 04:00 through 11:33
   
#########
# ADMIN #
#########

   126613 ncurses.i386 installed on minos-mysql2

#########
# MYSQL #
#########

    beam dbu processes hung Tue, 30 Dec 2008 15:59:18

Date: Mon, 05 Jan 2009 19:09:33 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
    The database server Ganglia monitoring shows a low load average,
    and very little network activity at this time, as viewed at
    http://rexganglia2.fnal.gov/minos/?r=week&c=MINOS+Server&h=minos-mysql1.
fnal.gov

    Dan Cherdack did have about 250 grid jobs running at that time,
    There are a lot of connections right now from grid nodes,
    with long open connections to temp, like :

| Id        | User       | Host                     | db      | Command |Time
| 194405069 | reader_old | fncdf279.fnal.gov:51647  | temp    | Sleep   |9491


MINOS25 > condor_q -run | grep -v gfactory | grep fncdf279
254128.10  cherdack        1/5  10:16   0+02:51:30
glidein_28550@fncdf279.fnal.gov

   Dan, we need to see what is going wrong with your jobs.

    
##########
# TRAVEL #
##########

   signed up for cambridge meeting, 
          http://www.hep.phy.cam.ac.uk/~thomson/meetings/collabmtg2009/

##########
# ORACLE #
##########

Date: Fri, 02 Jan 2009 15:30:36 -0600
From: Maurine Mihalek <mmihalek@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: dsg-entire group <css-dsg@fnal.gov>, minosdb-support@fnal.gov,
    csi unix group <csi-est@fnal.gov>, minos_software_discussion@fnal.gov,
    minos-data@fnal.gov
Subject: new kernel for minosora1/minosora3

there is a new linux kernel that needs to be made effective on minosora1 and
minosora3.

i would like to do minosora3 on wednesday (1/7/2009) and reboot around 8:30
am.  minosora3 should be up by 9 am. minosora3 has been up for 231 days.

for minosora1, i would like to upgrade the kernel and reboot on tuesday
(1/13) morning at 8 am. minosora1 has been up for 192 days.

are these days and times acceptable?

maurine

___________________________________________________________________
  
   You can do minosora3 any time, just let us know when it is done.
 
   Since the January quarterly patches are coming out soon,
   I would rather combine the minosora1 kernel update with those patches,
   to minimize service interruptions.
  
Note - please do not cc: minos_software_discussion or minos-data. 
Those lists are not related to database support.

##########
# CONDOR #
##########

Date: Sun, 04 Jan 2009 00:38:55 -0600 (CST)
From: Jeff K deJong <jdejong@agni.phys.iit.edu>

I seem to be having some trouble getting my jobs to run on Condor, they ran
for me at the start of december and now I run my scripts again and my jobs
are just sitting idle in the queue, while other jobs submited after I
submitted mine run and complete OK. I've run a condor analyze command

   ...

253380.034:  Run analysis summary.  Of 145 machines,
     39 are rejected by your job's requirements
    106 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job

The Requirements expression for your job is:

( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) 
&&
( target.HasFileTransfer )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( target.Arch == "X86_64" )       106                  
2   ( target.OpSys == "LINUX" )       145                  
3   ( target.Disk >= 3 )              145                  
4   ( ( 1024 * target.Memory ) >= 3 ) 145                  
5   ( target.HasFileTransfer )        145                  

The following attributes are missing from the job ClassAd:

RunOnGrid
x509userproxysubject


Reply :


   The file suggests that you are requiring X86_64 ( 64 bit kernel )
   while submitting to the local Minos Cluster.

   We do not have any such nodes.

   For local cluster submission, see for example

      /minos/scratch/kreymer/condor/probe/probe.run

=============================================================================
2008 12 26
=============================================================================


       KREYMER IS ON VACATION UNTIL 2009 JANUARY 5

nodified minos-admin,minos_batch,minos-data

=============================================================================
2008 12 25
=============================================================================

   ON SHIFT 00:00 - 07:00

########
# FARM #
########

To      : Howard Rubin <rubin@iit.edu>
Cc      : Alexandre Sousa <asousa@physics.harvard.edu>,
          minos-data@fnal.gov
Attchmnt: 
Subject : Re: linfix reprocessing
----- Message Text -----
On Wed, 24 Dec 2008, Howard Rubin wrote:

> The linfix reprocessing is complete.
  
    The latest file in /minos/data/minfarm/mcnearcat is dated 9 Dec.
  
    But I see 89 each sntp and mrnt files in mcnear ( without the cat ),
    15 from Dec 9 and the rest from Dec 24.
 
    Should these be shifted to mcnearcat ?
 

########
# DATA #
########

On Wed, 24 Dec 2008, Robert Hatcher wrote:
> On Dec 24, 2008, at 8:26 AM, "Musser, James A." <jmusser@indiana.edu> wrote:
> 
>       Robert:
> �Could you copy
> 
> /minos/scratch/petyt/FDfiles_2008/.bntp/all_events_cphy_bfld.root
> 
> to someplace I can access it in afs space, if it still exists?�
> Sorry, I have lost the capability of logging on to the minos cluster.

cp /minos/scratch/petyt/FDfiles_2008/.bntp/all_events_cphy_bfld.root \
    /afs/fnal.gov/files/data/minos/d13/musser/all_events_cphy_bfld.root

   Done, Merry Christmas !


#########
# DOCDB #
#########

Date: Wed, 24 Dec 2008 06:25:09 -0600 (CST)
From: HelpDesk <helpdesk@fnal.gov>
Subject: HelpDesk ticket 126747

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
___________________________________________
Ticket #: 126747
___________________________________________
Short Description: Minos DocDB pages are extremely slow to load

Problem Description: Several Minos DocDB pages are extremely slow to load.

This is true with a variety of browsers, and both Linux and XP clients.
Therefore this is probably a server side problem.

    Specific examples :

15 seconds to load
https://minos-docdb.fnal.gov:440/cgi-bin/ShowCalendar

19 seconds to load
http://minos-docdb.fnal.gov:8080/cgi-bin/DisplayMeeting?sessionid=1307
___________________________________________

This ticket is assigned to TECKENBROCK, MARCIA of the CD-CDO/CO.
___________________________________________

Date: Sat, 27 Dec 2008 12:15:23 -0600

Dear MINOS DocDB Admins,

Please see the complaint below.  This issue was brought to our attention
prior to the upgrade, but we felt we had obtained a manageable speed.  Of
course, instances with more events will have a slower load time.  Do you
feel the 19 second load time for the events page is serious enough to get
the developer involved?

Thank you.

-Marcia
___________________________________________

Yes, I think this is too slow, 
given that these pages formerly loaded in a fraction of a second.

Something is clearly wrong.
There is nothing being done that should take anything like this long.
This is well worth getting the developers involved.

For example, it takes 15 seconds to load a meeting containing a single talk
and a single document :
   https://minos-docdb.fnal.gov:440/cgi-bin/DisplayMeeting?sessionid=1264

___________________________________________

Date: Mon, 05 Jan 2009 17:14:16 -0600 (CST)

Solution: E. Vaandering has fixed the problem and notes we should let him
know if we notice other speed issues. The MINOS DocDB instance has been
upgraded and testing complete.
___________________________________________

Date: Mon, 05 Jan 2009 23:41:49 +0000 (GMT)


    Thanks !
    This seems to have restored access to full speed.
    
    I had to Clear Private Data/Authenticated Sessions under Firefox
    in order to restore access to the certificate-protected pages.    

    Some large Calendar pages are still a bit slow, but tolerable.
    5 seconds for
https://minos-docdb.fnal.gov:440/cgi-bin/ShowCalendar?year=2008&month=12
___________________________________________

   The DocDB developers ( Eric Vaandering ) have corrected the problems
   that led to slow loading of our Calendars and Meeting pages.

   You may need to Clear Private Data/Authenticated Sessions under Firefox
   in order to regain certificated based access to internal pages.
___________________________________________

Date: Tue, 06 Jan 2009 10:24:38 -0600
From: Eric Vaandering <ewv@fnal.gov>

The ShowCalendar function should be nearly instantaneous again. It's not on
my test setup, but I have a thousand or so events on a single day, so that
takes some time to write out. :-)

This is in stable/8.7.5

___________________________________________

Date: Tue, 06 Jan 2009 17:28:46 +0000 (GMT)

    Thanks !

    I am not seeing the speedup in Minos DocDB yet,
    but I expect that the new code has not been deployed for us.

___________________________________________

Date: Tue, 06 Jan 2009 11:30:41 -0600
From: Eric Vaandering <ewv@fnal.gov>

Probably not, but you can always check the version number in the lower left
corner of a page.
___________________________________________

Date: Tue, 6 Jan 2009 17:32:53 +0000 (GMT)

  Thanks !  As expected, we are at 8.7.4 .
___________________________________________

Date: Tue, 06 Jan 2009 12:01:10 -0600

Nope. I will do this in  just a few minutes.
___________________________________________

Date: Tue, 06 Jan 2009 12:21:15 -0600
From: Marcia Teckenbrock <marcia@fnal.gov>

I've upgraded the code, but the December calendar is still 7-8 seconds for
me.  It has quite a few events, though, so I don't think this is
unreasonable.  The January calendar loads in 2 seconds.

___________________________________________

Date: Tue, 06 Jan 2009 18:25:37 +0000 (GMT)

   Thanks !

   That is odd.
   The speed loading the December calendar is slower than before,
   7 seconds under 8.7.4
   8 seconds under 8.7.5.

https://minos-docdb.fnal.gov:440/cgi-bin/ShowCalendar?year=2008&month=12

DocDB Version 8.7.5, contact Document Database Administrators
Execution time: 8 wallclock secs ( 4.02 usr + 3.89 sys = 7.91 CPU)
___________________________________________

Date: Tue, 06 Jan 2009 12:45:02 -0600
From: Eric Vaandering <ewv@fnal.gov>

That is weird, because in my test setup which has a day with hundreds of
events, it takes 4 seconds using my laptop as a server.

So let me take a look again. The difference may be that I don't have
sessions and talks in those events, so it may be slowing down somewhere
else. It's also possible it is unavoidable.
___________________________________________

Date: Tue, 06 Jan 2009 19:50:21 -0600
From: Eric Vaandering <ewv@fnal.gov>

Marcia, can you update again? I didn't bump the version number yet, but I
*think* I've got the time consuming calls taken out of this without side
effects.

Some other things still seem as if they might be slower than I would like.
Perhaps we can find an evening to turn on the debugging so I can see what is
happening live on the Minos DocDB. (Debugging will just add a bunch of text
output to the end of the page, so DocDB is still fully usable, but doesn't
look quite as nice.)
___________________________________________

Date: Wed, 07 Jan 2009 18:02:23 -0600
From: Marcia Teckenbrock <marcia@fnal.gov>

Sorry, I ran out of time today, but I will do this first thing tomorrow.
___________________________________________

Date: Thu, 08 Jan 2009 10:25:36 -0600
From: Marcia Teckenbrock <marcia@fnal.gov>

This seems to have done the trick.  Art, would you please take a look?
___________________________________________

Date: Thu, 08 Jan 2009 18:17:59 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   I see no change at present :
https://minos-docdb.fnal.gov:440/cgi-bin/ShowCalendar?year=2008&month=12
DocDB Version 8.7.5, contact Document Database Administrators
Execution time: 6 wallclock secs ( 2.51 usr + 2.84 sys = 5.35 CPU)
___________________________________________

Date: Thu, 08 Jan 2009 12:52:29 -0600
From: Marcia Teckenbrock <marcia@fnal.gov>

Hmm. There's something about the December calendar, because the change
caused the current month's calendar display time to be halved.
___________________________________________

  December had some heavy sessions, due to the Collaboration meeting.

  October and November were more normal, with no such meeting,
  but they are also loading more slowly than you might expect :

    October
Execution time:  7 wallclock secs ( 2.96 usr + 3.09 sys = 6.05 CPU)

    November
Execution time:  4 wallclock secs ( 1.93 usr + 1.74 sys = 3.67 CPU)

    February, with no talks, does load very quickly
Execution time:  0 wallclock secs ( 0.37 usr + 0.06 sys = 0.43 CPU)
___________________________________________

Date: Thu, 08 Jan 2009 13:32:30 -0600
From: Marcia Teckenbrock <marcia@fnal.gov>
To: Eric Vaandering <ewv@fnal.gov>
Cc: Art Kreymer <kreymer@fnal.gov>
Subject: Re: Help Desk Ticket 126747 Has Been Resolved.

Yes, I am willing to work on this next week, but the earliest I can do it is
Wednesday.
___________________________________________

Date: Thu, 08 Jan 2009 13:45:51 -0600
From: Eric Vaandering <ewv@fnal.gov>

Ok. On the other hand, if the problem is the sessions and talks, then it
*should* be no different than it was several months ago.

But we'll look at this. Marcia, Wednesday is fine for me and I will try to
reproduce something like a "normal" MINOS month in the meantime and see what
improvements can be made.

I think MINOS is using this part of DocDB more heavily than it was ever
tested before so I'm not too surprised things might be popping up.

___________________________________________

___________________________________________

___________________________________________

=============================================================================
2008 12 24
=============================================================================

   ON SHIFT 00:00 - 07:00

#########
# DOCDB #
#########

Ticket submitted

=============================================================================
2008 12 23
=============================================================================

   ON SHIFT 00:00 - 07:00

#########
# DOCDB #
#########

    Display of the calendar remains slow, as noted below
19 seconds for http://minos-docdb.fnal.gov:8080/cgi-bin/DisplayMeeting?sessionid=1307


=============================================================================
2008 12 22
=============================================================================

   ON SHIFT 00:00 - 07:00


#########
# DOCDB #
#########

    Host cert update schedule 12:00 to 12:20

    If display of calendar is still slow, report this.

    Presently, takes 12 seconds to display
https://minos-docdb.fnal.gov:440/cgi-bin/ShowCalendar
    and
http://minos-docdb.fnal.gov:8080/cgi-bin/ShowCalendar

    Meetings can also take a long time, 14 seconds for
http://minos-docdb.fnal.gov:8080/cgi-bin/DisplayMeeting?sessionid=1307


=============================================================================
2008 12 20  Sun
=============================================================================

########
# SPAM #
########

  to 
minos-docdb
minos_software_discussion

   Need to let ssa-group post to minos-shifters


=============================================================================
2008 12 20  Sat
=============================================================================

    Changed 
Send= Public 
    to 
Send= Private

    for
minos-docdb
minos_sam_users
minoscrl_admin


    and
Send= Owner
    for
minos_comp
minos_linux_users
numi-pc-users

=============================================================================
2008 12 19
=============================================================================

#########
# MYSQL #
#########

    dbarchive.20081219

   Draft dbarchive  supports -I to copy indexes,
      and do just the offline database, nothing else
   
   And self logging to ${DBCOPY}/archive.log

Mysql> ./scripts/dbarchive -I
PARSING ARGS

 Archiving OFFLINE 
Fri Dec 19 18:03:26 CST 2008
68683   .
Filesystem            Size  Used Avail Use% Mounted on
/dev/hdb1             230G  172G   47G  79% /data
/minos/data/mysql/archive/20081219/offline/offline.log

FINISHED DBARCHIVES Fri Dec 19 19:29:06 CST 2008
[1]+  Done                    nedit scripts/dbarchive

##########
# DCACHE #
##########

   rubin reported srmcp failure,
   
cp /minos/data2/minfarm/farmtest/mclogs/dogwoodtest4/near/daikon_04/L010185N/706/n13037064_0009_L010185N_D04.0.dogwoodtest4.log.gz /var/tmp/dogtest.gz
gunzip /var/tmp/dogtest.gz 


MINOS26 > ./dccptest n13037064_0009_L010185N_D04.reroot.root
PORT 24136
Connected in 0.00s.
[Fri Dec 19 13:06:15 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_04/L010185N/706/n13037064_0009_L010185N_D04.reroot.root in cache.
Cache open succeeded in 99.57s.
355257111 bytes in 6 seconds (57821.80 KB/sec)

   Tested an srmcp , with upgraded srmtest.20081219, looks OK.

Date: Fri, 19 Dec 2008 22:15:20 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>


   The log indicated that the copy first started at Thu Dec 18 13:00:36,
   while the Enstore robots were still down.

   Perhaps srm/Dcache failed to properly execute the retries ?

   I can dccp and srmcp the file now :
        n13037064_0009_L010185N_D04.reroot.root


########
# MAIL #
########

     minos-shifters mail list
     
    To cut spam, changed configuration

Send= Public
    to
Send= Private

    And added
Owner= zwaska


#######
# CVS #
#######

   Created new admin package 

cd /local/scratch26/kreymer/trel/testrel
mkdir admin
cd    admin
touch .cvsignore

MINOS26 > cvs import minossoft/admin kreymer start
aklog: Couldn't get fnal.gov AFS tickets:
aklog: Incorrect net address while getting AFS tickets
nedit: the current locale is utf8 (en_US.UTF-8)
nedit: changed locale to non-utf8 (en_US)
N minossoft/admin/.cvsignore

No conflicts created by this import

MINOS26 > date
Fri Dec 19 13:13:28 CST 2008

cd /local/scratch26/kreymer/trel/CVSROOT
cvs update
nedit modules
cvs commit -m 'Added admin for scripts and HOWTOs' modules

nedit check_access
cvs commit -m 'Added admin module' check_access


#########
# MYSQL #
#########

    Added older HOWTO's to admin/mysql
addpkg -h admin
cd    admin
mkdir mysql
cvs add -m 'mysql database' mysql
cd mysql

cp ~/minos/HOWTO.dbarchive.20051021 HOWTO.dbarchive
cvs add HOWTO.dbarchive
cvs commit -m 'HOWTO.dbarchive.20051021' HOWTO.dbarchive

cp ~/minos/HOWTO.dbarchive.20070403 HOWTO.dbarchive
cvs commit -m "HOWTO.dbarchive.20070403" HOWTO.dbarchive 

for DAT in 20070703 20070705 20080115 20080409 20080804 20081014 ; do 
cp     ~/minos/HOWTO.dbarchive.${DAT}  HOWTO.dbarchive
cvs commit -m "HOWTO.dbarchive.${DAT}" HOWTO.dbarchive
done

    Also added dbarchive script

cd /minos/scratch/kreymer/admin

#########
# DOCDB #
#########

    To minos_all

  DocDB will be down next Monday 22 Dec at noon, for 20 minutes.

  There is no Minos meeting conflict, we have none scheduled next week.

---------- Forwarded message ----------
Date: Fri, 19 Dec 2008 12:53:20 -0600  
From: Marcia Teckenbrock <marcia@fnal.gov>
To: cd-docdb-users@fnal.gov
Subject: Re: DocDB Outage on Monday, December 22nd, Noon

Hi All,

Just want you to know we ARE going forward with the outage on Monday at Noon.
Thank you for your prompt responses, and happy holidays!

-Marcia

Marcia Teckenbrock wrote:
> Dear DocDB Users,
> 
> The system administrators for the DocDB machine would like to schedule an
> approximately 20 minute outage on Monday, December 22nd at Noon. The purpose
> of the outage is to install the new ssl certificate on the server.
>
> Before we schedule, I just want to make sure this will not interfere with
> your activities.  Please let me know ASAP.  Thanks,
>
> -Marcia
> marcia@fnal.gov


#######
# NET #
#######

   netdown email ,

Tuesday, December 23, 2008  6:00am - 7:00am

Upgrade S-S-WH8W-9 network switch

Scattered locations in Wilson Hall on the following VLANS.

Not all users on these VLANS will be down - only those connected to switch #9.

VLAN 18 - Beams-WH
VLAN 19 - Dir
VLAN 27 - BSS
VLAN 31 - LSS
VLAN 55 - PPD
VLAN 92 - Conference

    Note that the Control Room is on subnet 55.
    But probably not routing through S-S-WH8W-9

minos-rc    is connected to s-s-wh8w-7 on port Fa3/30
minos-acnet same
minos-om    same 
minos-evd   is connected to s-s-wh8w-7 on port Fa3/31


##########
# MYSQL2 #
##########

   Continuing tests, see LOG.mysql2


=============================================================================
2008 12 18
=============================================================================

##########
# MYSQL2 #
##########

Date: Thu, 18 Dec 2008 17:19:25 -0600 (CST)
Subject: HelpDesk ticket 126613
___________________________________________
Ticket #: 126613
___________________________________________
Short Description: minos-mysql2 needs compatibility libncurses.so.5 for
mysql

Problem Description: Some of the 32 bit mysql programs on minos-mysql2 need
libncurses.so.5 .

  Please find out which compability rpm's are needed, and install them,
  on all of our new SLF 4.7 systems ( mysql2, sam04, minos25, minos27 ) 

  This is not urgent, I've copied the library from minos-mysql1
  and put it in my path for testing purposes.
___________________________________________

Date: Fri, 19 Dec 2008 08:23:13 -0600 (CST)
This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group.
___________________________________________

Date: Tue, 30 Dec 2008 09:36:47 -0600 (CST)

Solution: schmitz@fnal.gov sent this solution:  installed ncurses.i386
___________________________________________

___________________________________________


##########
# MYSQL2 #
##########

    Tests needed for commissioning :

    Make all this into a formal deployment and support plan.

OS - find missing .so files

OS - measure and boost open file limits as appropriate

OS - set time zone to UTC ?

DB - copy mysql table from mysql1, to get accounts

DB - Time mysqldump for backups
         3 minutes/GB for production table, to bluearc

DB - Time mysql_upgrade
         2h 35m for offline database

DB - set connection limits consistent with OS file limits and capacity
     test performance vs connections

DB - Test recovery
         from data file, with index rebuild time
         from mysqldump files
         adding binlogs to base restore

DB - Test full database snapshot ( backup plus indexes )

DB - Copy all but offline and crl databases from mysql1 to mysql2

DB - Test defragmentation

MINOS - Test dbmauto - Nick

########
# DATA #
########

Date: Thu, 18 Dec 2008 13:26:27 -0600
From: George Szmuksta <georges@fnal.gov>

  The enstore outage is over. All enstore libraries are available for work.


#########
# ADMIN #
#########

Date: Thu, 18 Dec 2008 12:35:11 -0600 (CST)
Subject: HelpDesk ticket 126575
___________________________________________
Ticket #: 126575
___________________________________________
Short Description: sshd not responding on minos21

Problem Description: I cannot log into minos21 with ssh.
 But kerberized rsh is working.

    Please restart the sshd.

MIN > ssh minos21
ssh_exchange_identification: Connection closed by remote host

MIN > date
Thu Dec 18 18:31:28 GMT 2008
___________________________________________

This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group.

Solution: schmitz@fnal.gov sent this solution: 
restarted sshd
___________________________________________

Date: Thu, 18 Dec 2008 13:11:25 -0600 (CST)
___________________________________________


#########
# ADMIN #
#########

Date: Thu, 18 Dec 2008 12:35:12 -0600 (CST)
Subject: HelpDesk ticket 126576
___________________________________________
Ticket #: 126576
___________________________________________
Short Description: minos26 mount of /grid/data

Problem Description: The /grid/data file handle is stale on minos26 :

MINOS26 > ls -ld /grid/data
ls: /grid/data: Stale NFS file handle

MINOS26 > date
Thu Dec 18 12:32:43 CST 2008
___________________________________________

This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group.
___________________________________________

Date: Thu, 18 Dec 2008 13:13:30 -0600 (CST)

Solution: schmitz@fnal.gov sent this solution: 
cleared stale mount and remounted
___________________________________________

___________________________________________


##########
# CONDOR #
##########

To      : Nicholas Devenish <N.E.Devenish@sussex.ac.uk>
Cc      : "Ryan B. Patterson" <rbpatter@caltech.edu>, 
          minos-admin@fnal.gov, 
          nickd@fnal.gov, 
          xbhuang@fnal.gov, 
          minos_software_discussion@fnal.gov
Attchmnt: 
Subject : Re: minos25, /minos/data2? (fwd)
----- Message Text -----
On Thu, 18 Dec 2008, Nicholas Devenish wrote:

> I've been noticing it for a couple of days - my mass jobs are submitting at a
> rate of about 0.5 per second (horrendously slow).
  
    Thanks to detective work by Ryan.
    we have found the cause of recent Condor slowness and failures.
  
    User xbhuang has been running some grid jobs which have memory leaks.
  
    These eventually grow above 2 GBytes on the CDF worker nodes,
    crashing the glidein processes on the workers,
    and causing large overheads in our Condor scripts as they reconnect
    and eventualy restart these jobs.
    
    I have just added a 1.8 GBytes memory limit to the paloon script,
    which should help prevent future global problems.

Xiaobo - you need to debug your jobs to eliminate the memory leak
         before running more of these .
         I have removed all your existing jobs from Condor, 
         some of which had grown to nearly 4 Gbytes in size.

##########
# CONDOR #
##########

    My test glide jobs last completed at Dec 18 08:51
    There are log files through 
Dec 18 10:20 logs/glide/probe.246186.0.out
No further submissions, as of 10:50

Ganglia showa a load average that is low now,
 Load spike to  70 around 00:40
 Load spike to 100 around 03:00

Networking showed 
   sustained 6 MB/sec  input 22:45 through 09:52
             3 MB/sec output same time, with spike at end to 10 MB/sec                 


    condor_status looks OK,

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

minos01.fnal.gov   LINUX      INTEL  Claimed   Busy     0.990  4053  0+03:46:01
minos02.fnal.gov   LINUX      INTEL  Claimed   Busy     0.990  4053  0+04:09:52
slot1@minos03.fnal LINUX      INTEL  Claimed   Busy     2.690  2026  0+03:26:54
slot2@minos03.fnal LINUX      INTEL  Claimed   Busy     2.020  2026  0+02:51:36
glidein_12452@fnpc LINUX      X86_64 Claimed   Busy     1.030  16053  0+08:04:32
monitor_12452@fnpc LINUX      X86_64 Owner     Idle     1.000  1605  0+08:05:13

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX    39     0      39         0       0          0        0
        X86_64/LINUX    58    29      29         0       0          0        0

               Total    97    29      68         0       0          0        0

    condor_q is failing,

MINOS25 > condor_q

-- Failed to fetch ads from: <131.225.193.25:63348> : minos25.fnal.gov

    10:55 - condor_status has decayed,
 
MINOS25 > condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@minos04.fnal LINUX      INTEL  Claimed   Busy     1.010  2026  0+03:43:15
slot2@minos04.fnal LINUX      INTEL  Claimed   Busy     0.980  2026  0+03:58:16
slot1@minos09.fnal LINUX      INTEL  Claimed   Busy     0.980  2026  0+02:35:41
slot2@minos09.fnal LINUX      INTEL  Claimed   Busy     0.970  2026  0+02:29:29
slot1@minos15.fnal LINUX      INTEL  Claimed   Busy     0.980  2026  0+04:09:00
slot2@minos15.fnal LINUX      INTEL  Claimed   Busy     0.970  2026  0+04:34:05
slot1@minos18.fnal LINUX      INTEL  Claimed   Busy     0.940  2026  0+04:24:05
slot2@minos18.fnal LINUX      INTEL  Claimed   Busy     1.050  2026  0+03:14:04
slot1@minos19.fnal LINUX      INTEL  Claimed   Busy     0.980  2026  0+04:18:02
slot2@minos19.fnal LINUX      INTEL  Claimed   Busy     1.020  2026  0+04:13:27
slot1@minos20.fnal LINUX      INTEL  Claimed   Busy     1.040  2026  0+04:19:11
slot2@minos20.fnal LINUX      INTEL  Claimed   Busy     1.900  2026  0+03:14:34
slot1@minos21.fnal LINUX      INTEL  Claimed   Busy     1.020  2026  0+03:14:20
slot2@minos21.fnal LINUX      INTEL  Claimed   Busy     0.970  2026  0+04:24:23

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX    14     0      14         0       0          0        0

               Total    14     0      14         0       0          0        0

   I see no full disks, 

MINOS25 > w
 10:57:03 up 13 days,  2:33, 15 users,  load average: 0.21, 0.44, 0.30
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
nickd    pts/12   ndevenish-macboo 10:26   28:57   3.44s  0.00s /bin/sh /afs/fnal.gov/files/code/e875/general/condor/scripts/remote_wrappers/condor_submit Condor_FCSyst.run


MINOS25 > ps -flu gfactory
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 Z gfactory  4293 20163  0  76   0 -     0 exit   00:24 ?        00:00:39 [condor_gridmana] <defunct>
4 S gfactory 22056 22055  0  76   0 - 14047 -      Dec17 pts/9    00:00:00 -bash

MINOS25 > ps -flu gfrontend
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD

    Checking Condor log,

12/18 09:22:51 Return from HandleReq <HandleChildAliveCommand>
12/18 09:27:39 Preen pid is 20970
12/18 09:27:39 DaemonCore: pid 20970 exited with status 0, invoking reaper 1 <Daemons::DefaultReaper()>
12/18 09:27:39 Child 20970 died, but not a daemon -- Ignored
12/18 09:27:39 DaemonCore: return from reaper for pid 20970
12/18 09:34:51 Calling HandleReq <HandleChildAliveCommand> (0)
...
12/18 10:52:14 Calling Handler <DaemonCore::HandleReqSocketHandler>
12/18 10:52:14 ZKM: setting default map to rbpatter@fnal.gov
12/18 10:52:14 DaemonCore: Command received via TCP from rbpatter@fnal.gov from host <131.225.193.25:62745>, access level ADMINISTRATOR
12/18 10:52:14 DaemonCore: received command 454 (DAEMONS_OFF), calling handler (admin_command_handler)
12/18 10:52:14 Calling HandleReq <admin_command_handler> (0)
12/18 10:52:14 Sent SIGTERM to COLLECTOR (pid 7451)
12/18 10:52:14 Sent SIGTERM to NEGOTIATOR (pid 7452)
12/18 10:52:14 Sent SIGTERM to SCHEDD (pid 20163)
12/18 10:52:14 Return from HandleReq <admin_command_handler>
12/18 10:52:14 Return from Handler <DaemonCore::HandleReqSocketHandler>
12/18 10:52:14 DaemonCore: pid 7452 exited with status 0, invoking reaper 1 <Daemons::AllReaper()>
12/18 10:52:14 The NEGOTIATOR (pid 7452) exited with status 0
12/18 10:52:14 DaemonCore: return from reaper for pid 7452
12/18 10:52:53 DaemonCore: pid 7451 exited with status 0, invoking reaper 1 <Daemons::AllReaper()>
12/18 10:52:53 The COLLECTOR (pid 7451) exited with status 0
12/18 10:52:53 DaemonCore: return from reaper for pid 7451

   11:02  condor_status is back to a full list, condor_q still hung

    condor_q still fails

In Shadowlog,

12/18 04:27:50 (245787.13) (23786): attempt to connect to <131.225.211.155:48026> failed: Connection refused (connect errno = 111).
     fcdfcaf1507.fnal.gov
grep attempted /local/stage1/condor/log/ShadowLog

/minos/data/users/xbhuang/new_run3/log.245787.0

245869.99


PROCS=`condor_q xbhuang | grep xbhuang | cut -f 1 -d ' '`


    Need to clear the local side of some of these,

PROCS=`condor_q xbhuang | grep xbhuang | cut -f 1 -d ' '`
11:56:30
for PROC in ${PROCS} ; do printf "${PROC} " ; echo condor_rm -forcex ${PROC} ; sleep 1 ; done


##########
# PALOON #
##########

    Added 1.8 GBytes virtual memory limit, to stop future crashes as above,

ulimit -v 1800000


cp  paloon paloon.20081218

    Tested this,

cd /grid/fermiapp/minos/parrot
./paloon.20081218

    Moved new version to production

ln -sf paloon.20081218 paloon


########
# DATA #
########

    Early fardet_data was not in monthly directories :

grep fardet_data/F ../CFL/CFL  | wc -l
326

MINOS26 > ls /pnfs/minos/fardet_data/2001-09 | wc -l
316

MINOS26 > ls /pnfs/minos/fardet_data/2001-10 | wc -l
55

OFILES=`grep fardet_data/F ../CFL/CFL |  cut -f 8 -d ' ' | sort`

printf "${OFILES}\n" | sort
/pnfs/minos/fardet_data/F00000508_0000.mdaq.root
/pnfs/minos/fardet_data/F00000535_0000.mdaq.root
... 
/pnfs/minos/fardet_data/F00000983_0000.mdaq.root
/pnfs/minos/fardet_data/F00000985_0000.mdaq.root

   Where are the first and last files ?

MINOS26 > ls /pnfs/minos/fardet_data/*/F00000508_0000.mdaq.root
/pnfs/minos/fardet_data/2001-09/F00000508_0000.mdaq.root
MINOS26 > ls /pnfs/minos/fardet_data/*/F00000985_0000.mdaq.root
/pnfs/minos/fardet_data/2001-10/F00000985_0000.mdaq.root

for FILE in ${OFILES} ; do
    FIL=`echo ${FILE} | cut -f 5 -d / | cut -f 1 -d '}'`
    ls /pnfs/minos/fardet_data/2001-*/${FIL}
done

    Mostly in 2001-09, except

/pnfs/minos/fardet_data/2001-10/F00000965_0000.mdaq.root
/pnfs/minos/fardet_data/2001-10/F00000966_0000.mdaq.root
/pnfs/minos/fardet_data/2001-10/F00000967_0000.mdaq.root
/pnfs/minos/fardet_data/2001-10/F00000968_0000.mdaq.root
/pnfs/minos/fardet_data/2001-10/F00000969_0000.mdaq.root
/pnfs/minos/fardet_data/2001-10/F00000970_0000.mdaq.root
/pnfs/minos/fardet_data/2001-10/F00000974_0000.mdaq.root
/pnfs/minos/fardet_data/2001-10/F00000980_0000.mdaq.root
/pnfs/minos/fardet_data/2001-10/F00000983_0000.mdaq.root
/pnfs/minos/fardet_data/2001-10/F00000985_0000.mdaq.root

    Also, have a few three digit subruns :
    
/pnfs/minos/fardet_data/2001-09/F00000570_000.mdaq.root
/pnfs/minos/fardet_data/2001-09/F00000573_000.mdaq.root
/pnfs/minos/fardet_data/2001-09/F00000574_000.mdaq.root
/pnfs/minos/fardet_data/2001-09/F00000575_000.mdaq.root
/pnfs/minos/fardet_data/2001-09/F00000576_000.mdaq.root
/pnfs/minos/fardet_data/2001-09/F00000577_000.mdaq.root
   
    
    First, let's enmv everything to its current locations,
    for the CFL listing.    

setup encp v3_7d -q stken

FILES09=`printf "${OFILES}\n" | head -316 | cut -f 5 -d /`

for FILE in ${FILES09} ; do
    printf "${FILE}\n"
    enmv /pnfs/minos/fardet_data/2001-09/${FILE} \
         /pnfs/minos/fardet_data/2001-09/${FILE}
    sleep 1
done
...
F00000964_0000.mdaq.root

FILES10=`printf "${OFILES}\n" | tail -10 | cut -f 5 -d /`
for FILE in ${FILES10} ; do
    printf "${FILE}\n"
    enmv /pnfs/minos/fardet_data/2001-10/${FILE} \
         /pnfs/minos/fardet_data/2001-10/${FILE}
    sleep 1
done

    Rename the short subruns to standard form

SHORTS='570 573 574 575 576 577'

    Verified absence from CFL with some other subrun:


for SHORT in ${SHORTS} ; do
    grep F00000${SHORT} ../CFL/CFL 
done

minos fardet_data VO6876 0000_000000000_0002625 CDMS109626825400000 245916094 3230905666 /pnfs/minos/fardet_data/F00000570_000.mdaq.root
minos fardet_data VO6876 0000_000000000_0002649 CDMS109626881500000 83623425 3755784777 /pnfs/minos/fardet_data/F00000573_000.mdaq.root
minos fardet_data VO6876 0000_000000000_0002650 CDMS109626883600000 50080898 3771648130 /pnfs/minos/fardet_data/F00000574_000.mdaq.root
minos fardet_data VO6876 0000_000000000_0002651 CDMS109626888400000 37155578 2198592335 /pnfs/minos/fardet_data/F00000575_000.mdaq.root
minos fardet_data VO6876 0000_000000000_0002652 CDMS109626890700000 24823718 526838901 /pnfs/minos/fardet_data/F00000576_000.mdaq.root
minos fardet_data VO6876 0000_000000000_0002653 CDMS109626892000000 12363247 542879600 /pnfs/minos/fardet_data/F00000577_000.mdaq.root

 
for SHORT in ${SHORTS} ; do
    printf "F00000${SHORT}_000.mdaq.root\n"
    enmv /pnfs/minos/fardet_data/2001-09/F00000${SHORT}_000.mdaq.root \
              /pnfs/minos/fardet_data/2001-09/F00000${SHORT}_0000.mdaq.root
done

   TO DO - check CFL tomorrow

#########
# DOCDB #
#########

    Registered Rustem for numirw ( and not all groups as requested )

    Note that the email should be modified :
If this is correct, please visit
https://minos-docdb.fnal.gov:440/cgi-bin/EmailAdministerForm, select
"Modify", select the user, check "Verify", and click to Submit.

If the groups are not correct, select the correct groups before clicking
Submit.

   Check the box to the left of 'User is Verified' 
   Click on 'Modify Personal Account


   Got a request from Gary W. Smith gsmish@fnal.gov
   Not an author, asked whether this was a real request.
   
   
=============================================================================
2008 12 17
=============================================================================

##########
# SAMSUB #
##########

    samsub.20081217
          Removed leftover diagnostic printouts.
          Somehow that had been used in production up to now.

ln -sf samsub.20081217 samsub # was samsub.20081118

SRV1> cp -a AFSS/samsub.20081217 .
SRV1> ln -sf samsub.20081217 samsub


=============================================================================
2008 12 16
=============================================================================

##########
# DCACHE #
##########

Date: Tue, 16 Dec 2008 12:26:14 -0600
From: ssa-group@fnal.gov

We need to restart pnfs on stken for a logging issue. The restart should
only take a couple of minutes.

Date: Tue, 16 Dec 2008 12:27:40 -0600
The restart of pnfs will happen at 1245 pm.

Date: Tue, 16 Dec 2008 12:46:52 -0600

Done.

=============================================================================
2008 12 15
=============================================================================

    
#######
# CFL #
#######
 
   minos/CFL data files filled my AFS quota.

MINOS26 > fs listquota .
Volume Name                   Quota      Used %Used   Partition
d.minos.d5                  8000000   7997943  100%<<       25%    <<WARNING
   
   Shifted old lists to /minos/scratch/kreymer/CFL -> CFL/lists
   Adjusted cfl script accordingly

cp -a cfl cfl.20081210
cp    cfl cfl.20081215
nedit     cfl.20081215
ln -sf    cfl.20081215 cfl

cp -va CFL.* lists/
CFILES=`ls CFL.*
for FILE in ${CFILES} ; do echo ${FILE} ; diff ${FILE} lists/${FILE} ; done
rm CFL.*

Still a problem, as ed works in /tmp, which is only 1 GByte in minos26 !!!

   WOW

Time edit with sed :

cat CFL.new | sed 's./fs/usr/./.g'| sed 's./fnal.gov/usr/./.g'

Pipe CFL.new through  sed

MINOS26 > time  cat CFL.new | sed 's^/fs/usr/^/^g' | sed 's^/fnal.gov/usr/^/^g' > CFL.newer

real    0m49.075s
user    1m14.395s
sys     0m2.452s

Seems OK, let's put that filter on the original curl output.

MINOS26 > time ./cfl

real    1m0.709s
user    1m16.221s
sys     0m9.321s

MINOS27 > time ./cfl
real    0m40.826s
user    0m40.643s
sys     0m5.202s


#######
# NET #
#######

Date: Mon, 15 Dec 2008 12:51:43 -0600 (CST)
Subject: HelpDesk ticket 126359

Short Description: Wireless problems reported on WH12W

Problem Description: This is a low priority request to check out Wireless
networking around
WH2W,
 including the Control Room at WH12NW.

 During the Minos Collaboration meeting this weekend, and this morning,
 there have been some reports of connectivity problems.
 I cannot reproduce the problems with my own laptop.
 Unfortunately , I have no specific report ( this is third hand. )

 So please just cast an eye on your monitoring logs,
 see whether there are any obvious issues.
___________________________________________

This ticket is assigned to FINNEGAN, RICK of the CD-LSCS/CNCS/SN.
___________________________________________

This ticket has been reassigned to ANDREWS, CHARLES of the CD-LSCS/CNCS/SN
Group.
___________________________________________

Date: Tue, 16 Dec 2008 16:13:48 -0600 (CST)

Solution: Art - The microwave oven in the control room is leaking RF  -
levels at 1 meter are about -7 to -10 Dbm - more than enough to interfere
with some wireless operatrions - I will e-mail the screen captures to you. 
-Chuck-
___________________________________________


########
# GRID #
########

Date: Mon, 15 Dec 2008 08:26:49 -0600 (CST)
Subject: HelpDesk ticket 126325
___________________________________________

Ticket #: 126325
___________________________________________
Short Description: Make rbpatter a /fermilab/minos manager

Problem Description: Please give Ryan Patterson, rbpatter@fnal.gov
the same control over /fermilab/minos that I have,
so that he can authorize new members and assign roles.
___________________________________________

This ticket is assigned to TIMM, STEVE of the CD-SF/GF/FGS.
___________________________________________

Date: Mon, 15 Dec 2008 12:24:07 -0600 (CST)
Solution: This has been done.
___________________________________________________________________

This ticket was resolved by TIMM, STEVE of the CD-SF/GF/FGS group.


=============================================================================
2008 12 14   SUN
=============================================================================

##########
# DCACHE #
##########

   I see no data movement since 13 Dec 23:00
   roundup and predator are stuck

   Services are all up.
   ftplog failed since 
   6 Sat Dec 13 23:33:58 CST 2008 557
3601 Sun Dec 14 00:43:59 CST 2008 1


   pnfslog 400 seconds since 
    2 Sat Dec 13 23:40:42 CST 2008 
  309 Sat Dec 13 23:50:51 CST 2008 

No recent helpdesk tickets

Date: Sun, 14 Dec 2008 12:17:02 -0600 (CST)
Subject: HelpDesk ticket 126310

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 
___________________________________________
Ticket #: 126310
___________________________________________
Short Description: PNFS not responding in FNDCA DCache

Problem Description: User access to FNDCA has been failing since about Sat
Dec 13 23:45

  See my FTP and PNFS listing monitoring logs at
      http://www-numi.fnal.gov/computing/dh/ftplog/2008/12/13.txt
      http://www-numi.fnal.gov/computing/dh/pnfslog/2008/12/13.txt

  Reply to minos-data

  I can be reached at 630 697 0469
___________________________________________

This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA.
___________________________________________

   Calling helpdesk , option 5 to page
   get message 'is not available at the moment
                at the tone record your message... "
___________________________________________

Yolanda paged SSA, ticket 126312, around 16:45
___________________________________________

___________________________________________


Date: Mon, 15 Dec 2008 14:44:58 -0600 (CST)
From: Dmitry Litvintsev <litvinse@fnal.gov>

  Info was requested of me by HelpDesk:
  dcache developer primary has been contacted at about 7:30 PM Sunday 
  12/14. It was discovered that an obscure log file used by pnfs daemon
  has reached 2GB in size generating error "File size exceeds limit". 
  The file has been moved and the pnfs daemon has been restarted. 
  System ws back to normal just after 8PM. 


=============================================================================
2008 12 13
=============================================================================

############
# MCIMPORT #
############

   Working to get arms approved for pushing data to volatile area from the grid

/DC=org/DC=doegrids/OU=People/CN=Kregg Arms 875233


Date: Sat, 13 Dec 2008 11:43:40 -0600 (CST)
Subject: HelpDesk ticket 126305

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 
___________________________________________
Ticket #: 126305
___________________________________________
Short Description: FNDCA permissions for Kregg Arms to
fermigrid/volatile/minos

Problem Description: Please authorize Kregg to write to
fermigrid/volatile/minos , using

    /DC=org/DC=doegrids/OU=People/CN=Kregg Arms 875233

User/group mapping should probably be arms/e875

Kregg intends to write from Teragrid sites, probably with Grid FTP.
___________________________________________

This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA.
___________________________________________


___________________________________________

___________________________________________

#######
# CRL #
#######

Ticket 126307

CRLWEB problem- Minos Control room - 630-840-3368

Detailed Problem Description (if supplied):
Masaki called off hours helpdesk from the Minos Control rool that they are 
having problems with the CRL logbook (CRLWEB) with images, 

   Extensive Work Log, datail of people being paged, 


Ticket 126303

Short Description: (from weekend) Proble with Control Room Logbook for
MINOS

Problem Description: Our instance of the Control Room Logbook has stopped
loading images.  It
appears to be some problem with the html wrapper for the image.

http://crlweb2.fnal.gov/minos/Index.jsp

We seem to be able to enter images into entries, but none will load or
display onto pages (including navigation buttons).

You may contact the control room at x3368 - it is manned 24 hours a day.
Or, contact me at 630-240-6842.
___________________________________________________________________

This ticket was resolved by RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST group.
___________________________________________________________________
___________________________________________________________________


Ticket 126304

Web images problem from MINOS control romm

hi
from MINOS control room

We are having a problem to view images in the following web page.
http://crlweb2.fnal.gov/minos/Index.jsp

For example :
When we click one of images that we pasted, the images sometimes does't
appear at all, but we are able to see the images in some other computers...

error message we get is

*************************************************
Not Found

The requested URL
/Entries/2008/12month/13day/06hour/General/Operations/Log/Text_82191_0_dec_
13_08_day_plot1_png_wrapper.htm was not found on this server.

Apache/2.0.46 (Scientific Linux) Server at crlweb2.fnal.gov Port 80
**************************************************

Can you help to solve the problem?

other contacts for MINOS experts : kreymer@fnal.gov 
phone number for MINOS control room : x3368  x8751  x2482  x6913

    The Work Log has discussions of missing files ? in 
/afs/fnal.gov/files/data/crl/dr/WWWdirectory 

   Files have been restored to /afs/fnal.gov/files/restored/d.crl.1

___________________________________________________________________
___________________________________________________________________


Date: Sat, 13 Dec 2008 22:43:00 -0600
From: kreymer <kreymer@fnal.gov>
To: Maurine Mihalek <mmihalek@fnal.gov>
Cc: Wayne Baisley <baisley@fnal.gov>, Hugh Gallagher <hgallag@fnal.gov>,
    Desktop & Server Support - Enterprise <dss-est@fnal.gov>,
    Arthur Kreymer <kreymer@fnal.gov>

   Maureen, sorry that I could not get you more specific information earlier
this evening,
   I was on the way out the door, and not able to get to a computer until
just now.

   The ticket numbers were  126307 and 126304.
   My memory was faulty, this issue came up this morning, not last night.

   One of the missing URL's is
http://crlweb2.fnal.gov/Entries/2008/12month/13day/06hour/General/Operations
/Log/Text_82191_0_dec_13_08_day_plot1_png_wrapper.htm

   This is visible from my Desktop system at Fermilab, but not my laptop,
the CRL, or my home system.
   This cannot have been cached on my desktop,
   as the graphic is much newer than my previous access to CRL from that
system.

===========================================================

Date: Sat, 13 Dec 2008 23:41:06 -0600
From: Wayne Baisley <baisley@fnal.gov>
To: "kreymer@fnal.gov" <kreymer@fnal.gov>
Cc: Maurine Mihalek <mmihalek@fnal.gov>, Hugh Gallagher <hgallag@fnal.gov>,
    Desktop & Server Support - Enterprise <dss-est@fnal.gov>
Subject: Re: Minos control room crlweb tickets

>     One of the missing URL's is
> http://crlweb2.fnal.gov/Entries/2008/12month/13day/06hour/General/Operatio
> ns/Log/Text_82191_0_dec_13_08_day_plot1_png_wrapper.htm
> 
>     This is visible from my Desktop system at Fermilab, but not my laptop,
> the CRL, or my home system.
>     This cannot have been cached on my desktop,
>     as the graphic is much newer than my previous access to CRL from that
> system.

It seems to be cached somewhere, because most of the directory tree under
Entries has gone missing, excepting a couple of hours for Thursday.  The
deepest directories in the vicinity are ...

/afs/fnal.gov/files/data/crl/dr/LogBook_admin
/afs/fnal.gov/files/data/crl/dr/CRLinquiries/CRLWindex
/afs/fnal.gov/files/data/crl/dr/CRLdata/Entries/2008/12month/11day/10hour/Si
mulation/SLIC_studies/Log
/afs/fnal.gov/files/data/crl/dr/CRLdata/Entries/2008/12month/11day/13hour/Si
mulation/SLIC_studies/Log
/afs/fnal.gov/files/data/crl/dr/WWWdirectory/Entries/2008/12month/11day/10ho
ur/Simulation/SLIC_studies/Log
/afs/fnal.gov/files/data/crl/dr/WWWdirectory/Entries/2008/12month/11day/13ho
ur/Simulation/SLIC_studies/Log
/afs/fnal.gov/files/data/crl/dr/CRLmaillists

I get 404s for anything below 12month, aside from the 11day directories
which give me a 403 (Forbidden).  Hope that helps some.

Wayne

====================================================================

Date: Sun, 14 Dec 2008 06:15:56 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Wayne Baisley <baisley@fnal.gov>
Cc: Maurine Mihalek <mmihalek@fnal.gov>, Hugh Gallagher <hgallag@fnal.gov>,
    Desktop & Server Support - Enterprise <dss-est@fnal.gov>,
    kreymer@fnal.gov
Subject: Re: Minos control room crlweb tickets

On Sat, 13 Dec 2008, Wayne Baisley wrote:

> It seems to be cached somewhere, because most of the directory tree under
> Entries has gone missing, excepting a couple of hours for Thursday.  The
> deepest directories in the vicinity are ...

    I have created a minos account on my desktop,
    where things worked this afternoon:
        minos-93198.dhcp.fnal.gov

    I have given access to kreymer, baisley, mmihalek,
    and minos-wh-cr/minos/minos-om.fnal.gov@FNAL.GOV
        which is the principal used in the control room.

    You can run 'firefox' there,

   Unfortunately, running this remotely from home,
   the graphics seem to have disappared from my desktop also.

   I cannot do anything about the AFS areas under
      /afs/fnal.gov/files/data/crl
   I do not even have read access.


Date: Sun, 14 Dec 2008 00:22:18 -0600
From: Maurine Mihalek <mmihalek@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>, mmihalek@fnal.gov
Cc: Wayne Baisley <baisley@fnal.gov>, Hugh Gallagher <hgallag@fnal.gov>,
    Desktop & Server Support - Enterprise <dss-est@fnal.gov>,
    kreymer@fnal.gov
Subject: Re: Minos control room crlweb tickets

the directory wayne requested to be restored was
/afs/fnal.gov/files/data/crl/dr/WWWdirectory. i am doing a tibs restore of
the volume d.crl.1. this is the only way i see in the afs and tibs restore
documentation to do it. i conferred with joe syu and he agrees.

the restore is started. once it is finished, i will be mounting it in a
restored area. i will advise you of the name and when that restored volume
is available.

maurine

Date: Sun, 14 Dec 2008 01:05:26 -0600
From: Maurine Mihalek <mmihalek@fnal.gov>
To: Maurine Mihalek <mmihalek@fnal.gov>, Arthur Kreymer <kreymer@fnal.gov>,
    Wayne Baisley <baisley@fnal.gov>, Hugh Gallagher <hgallag@fnal.gov>,
    Desktop & Server Support - Enterprise <dss-est@fnal.gov>,
    kreymer@fnal.gov
Subject: Re: Minos control room crlweb tickets

I restored from 12/12/2008 tibs backup.
the restored volume d.crl.1 is mounted under
/afs/fnal.gov/files/restored/d.crl.1

there is a dr directory that has the WWWdirectory from Dec 12 tibs backup

you can restore whatever files you need from there.


Date: Sun, 14 Dec 2008 22:26:38 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: helpdesk-forwarder@fnal.gov
Cc: Maurine Mihalek <mmihalek@fnal.gov>, Wayne Baisley <baisley@fnal.gov>,
    Desktop & Server Support - Enterprise <dss-est@fnal.gov>
Subject: HelpDesk ticket 126304


<-- # @@@  Enter Update below this line. @@@ # -->

   Minos has access to these CRL/AFS support areas
   only through the CRL web interface.

   We cannot read the /afs paths, or perform maintenance.

   My desktop system minos-93198.dhcp.fnal.gov,
   running SLF 5, with Firefox 3.0.4,
   continues to see new graphics files, like the recent

http://www-minoscrl2.fnal.gov/Entries/2008/12month/14day/06hour/General/Operati$

   This file was added long after the problem started yesterday.
   This file is not visible to the Control Room, or to most other systems.

   The problem cannot be in the loss of the data files on the Web Server,
   just a failure to serve them to some ( but not all ) clients.

   Perhaps there is a change in the way the server
   handles or caches usernames/passwords for restricted pages.

   I think that all these images are password protected,
   which is historically quite a nuisance when viewing pages.

   My successful Firefox 3.0.4 browser has a stored password for :
       http://crlweb2.fnal.gov(CRLW)     <Username> <Password>

   User minos on the same system with the same browser, cannot see images.
<-- # @@@  Enter Update above this line. @@@ # -->


Fails
http://crlweb2.fnal.gov/Entries/2008/12month/15day/00hour/General/Operations/Log/Text_82255_0_dec_15_08_night_plot1_png_wrapper.htm
Works
http://www-minoscrl2.fnal.gov/Entries/2008/12month/15day/00hour/General/Operations/Log/Text_82255_0_dec_15_08_night_plot1_png_wrapper.htm


---------- Forwarded message ----------
Date: Mon, 15 Dec 2008 11:08:06 -0600
From: Suzanne Gysin <gysin@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: Re: Minos control room crlweb tickets (fwd)

Hi Art,
just a little more information. 
The web address has not changed, not since years. 
The way this works is the webserver (crlweb2) has many logbooks, 
each having an alias. 
The alias maps to the specific log's image and entry directory. 
This is nothing new.

Maybe the images were cashed until now in the control room, 
I don't know why it worked before. Maybe they used the alias before.
Suzanne

_____________________________________________________________________________


   Corrected crlweb2 to www-minoscrl2 at
http://www-numi.fnal.gov/Minos/ControlRoom/index.html

_____________________________________________________________________________

Date: Mon, 15 Dec 2008 17:22:00 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: helpdesk@fnal.gov
Cc: mmihalek@fnal.gov, baisley@fnal.gov, dss-est@fnal.gov, zwaska@fnal.gov,
    kreymer@fnal.gov, votava@fnal.gov
Subject: Re: Minos control room crlweb tickets (fwd)


    Tickets 126307, 126303, and 126304 can be closed.

    We had been using an incorrect web address,
    which stopped working over the weekend.

    Apparently the correct web address works fine.
    There appears to have been no problem with the server itself.

    I have updated the link on the Minos Control Room web page.
_____________________________________________________________________________

    Per baisley, also updated 
    
/afs/fnal.gov/files/expwww/numi/html/documentation/alphabetical_index.html
_____________________________________________________________________________

      HISTORY - digging through email for www-minoscrl2 references

minos:
Date: Fri, 02 Jun 2006 16:31:00 -0500 saranen, new account

minosadmin:
Date: Wed, 05 Apr 2006 14:44:32 -0500 (CDT)
   referenced in passsing, re mysql upgrade
 Date: Mon, 24 Apr 2006 11:04:51 -0500 (CDT)
   reference to the old CRL, www-minoscrl

out:


=============================================================================
2008 12 12
=============================================================================

########
# FARM #
########

    To minos_batch, rubin :

Please run cedar catchup spill processing on the horn off runs,

    N00015187
MISS 0021 0022 0023

    N00015190
 MISS 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 0011 0012
0013 0014 0015 0016 0017 0018 0019 0020

Xiaobo - there are spill sntp files waiting to be concatenated for these
in
    /minos/data/minfarm/nearcat


The third horn off run has been fully processed,
    N00015193

This may take special sntp-only processing,
as it seems that cand and cosmic files exist for all the missing subruns.

----------------------------

    Howie submitted these, picked up

Dec 12 19:31 /minos/data/minfarm/nearcat/N00015187_0021.spill.sntp.cedar.0.root
Dec 12 20:06 /minos/data/minfarm/nearcat/N00015187_0022.spill.sntp.cedar.0.root
Dec 12 20:12 /minos/data/minfarm/nearcat/N00015187_0023.spill.sntp.cedar.0.root

N00015190_0000 through 20

    I have concatenated ahead of schedule,

SRV1> ./roundup -r cedar near

Fri Dec 12 20:43:32 CST 2008

 OK adding N00015187_0000.spill.sntp.cedar.0.root 9
 OK adding N00015187_0010.spill.sntp.cedar.0.root 14
 SUPPRESS  N00015190_0024.spill.sntp.cedar.0.root
 OK adding N00015190_0000.spill.sntp.cedar.0.root 24

Fri Dec 12 20:56:30 CST 2008

   Informed minos_batch, xbhuang

#########
# MYSQL #
#########

    Resuming work on mysql installation
    
    HOWTO.mysqladmin         -  updating per mysql2 work
    HOWTO.mysqladmin.20080820 - describes minos-sam03 work


##########
# DCACHE #
##########

  Requested closeout of ticket 121533
  as there no planned action to test DCache failover to secondary DNS.

9/12/2008 11:04:51 AM
_______________________________________________________________________

During last night's fnsrv0 primary DNS server outage,
it appears that all FNDCA and STKEN data transfers stopped.

The DCache and Enstore data rate plots show no activity,
and many user jobs failed.

Likewise, I see no Enstore data transfers in CDFEN or D0EN
from 22:45 through 04:00 last night ( Sep 11/12 )

All password ftp reads from FNDCA failed during this period.
Access to PNFS was very slow, typically 3 minutes instead of 3 seconds.

Strangly, the Minos Data Acquisition kerberized ftp copies
all succeeded during this downtime.

   It would be very desirable for Enstore and DCache smoothly
   fail over to secondary nameservers.
_______________________________________________________________________

9/22/2008 3:59:20 PM Remedy Application Service
The following was e-mailed to the Requester:

jonest@fnal.gov sent this Notes To Requester: 
All enstore movers and servers have primary and secondary  
> nameservers defined.
> nameserver 131.225.8.120
> nameserver 131.225.17.150
>
Also,  each mover and server has all enstore related nodes listed in  
the /etc/hosts file.
_______________________________________________________________________

Date: Fri, 12 Dec 2008 20:03:32 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

  As no further action on this issue seems to be planned,
  I suggest that this ticket be closed.

  It would be nice to test failover to secondary DNS servers
  in the test stand.

  But that is not best tracked via a helpdesk ticket.
    
    Thanks !
_______________________________________________________________________

Solution: jonest@fnal.gov sent this solution: 
> Art feels this ticket can be close,  Thanks Art!
_______________________________________________________________________
_______________________________________________________________________


##########
# CONDOR #
##########

Date: Fri, 12 Dec 2008 10:33:31 -0600 (CST)
Subject: HelpDesk ticket 126261

___________________________________________
Ticket #: 126261
___________________________________________
Short Description: xbhuang Robot Cert needs approval.

Problem Description: User xbhuang added a Robot cert a couple of days ago,
  but this is still listed as 'new', not 'approved'.

  Please approve this cert :

/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Xiaobo
Huang/CN=UID:xbhuang
___________________________________________

This ticket is assigned to  of the .
___________________________________________

Date: Fri, 12 Dec 2008 10:43:56 -0600 (CST)
This ticket has been reassigned to TIMM, STEVE of the CD-Grid/Fermi Group.
___________________________________________

Date: Fri, 12 Dec 2008 10:56:35 -0600 (CST)

Note To Requester: chadwick@fnal.gov sent this Notes To Requester: 
The new certificate has been approved.
___________________________________________

Solution: This cert has been added and approved.
This ticket was resolved by TIMM, STEVE of the CD-Grid/Fermi group.
___________________________________________

Date: Mon, 15 Dec 2008 12:19:54 -0600 (CST)

Note To Requester: timm@fnal.gov sent this Notes To Requester: 
The logs of VOMRS show that the certificate was never successfully
added but I have added it now and approved it.

Steve Timm
___________________________________________


########
# DATA #
########

   html/computing/dh/dhleft.html.20081212
       updated for data/data2/scratch free

#############
# DBARCHIVE #
#############

Cleaned up messages :

   Non-verbose file copy
   Final time stamp
   

###########
# MONTHLY #
###########

DATASETS 12/5
PREDATOR 12/5
VAULT    12/4
MYSQL    12/12

Thu Dec 11 15:49:39 CST 2008
 Archiving OFFLINE 
Thu Dec 11 15:49:55 CST 2008
68608   .
 Archiving BINLOGS 
Thu Dec 11 16:53:49 CST 2008
real    1m41.312s
 Compressing archives 
Thu Dec 11 16:55:34 CST 2008
real    55m36.702s
 Copying back to local disk 
Thu Dec 11 20:57:49 CST 2008
real    64m0.852s

Mysql> du -sm /data/archive/COPY/*
34485   /data/archive/COPY/20080902
22233   /data/archive/COPY/20081013
22445   /data/archive/COPY/20081114
22509   /data/archive/COPY/20081211

   
=============================================================================
2008 12 11
=============================================================================

##########
# CONDOR #
##########

    Submittted glidex100 ( 100 sections ), around 14;19


82 jobs; 71 idle, 10 running, 1 held            14:20 ; 

72 jobs; 61 idle, 10 running, 1 held Thu Dec 11 14:21:03 CST 2008

50 jobs; 0 idle, 49 running, 1 held  Thu Dec 11 14:22:00 CST 2008

1 jobs; 0 idle, 0 running, 1 held    Thu Dec 11 14:23:00 CST 2008

   Checking startup time, from the logs :


    Removed old held job  
242864.0   kreymer        12/10 23:20   0+00:14:02 H  0   0.0  probe             
HoldReason = "Error from starter on glidein_24419@fcdfcaf1695.fnal.gov: Failed to execute '/local/stage1/condor/execute/dir_23884/glide_e23941/condor_job_wrapper.sh' with arguments 0 sleep 30: Connection reset by peer"

   Checking start times from logs
grep executing logs/glide/probe.243057.*.log \
    | cut -f 4 -d ' ' | sort > probex100.logtimes
14:18:48 ... 14:22:19
   Never more than two jobs per second, and these are always isolated.
   Net rate 100/210 seconds.

    Checking job starts from the *.out files, for consistency
grep STARTED logs/glide/probe.243057.*.out \
    | cut -f 7 -d ' ' | sort > probex100.outtimes
14:18:49 ... 14:22:21

   Triplets at

14:21:47
14:21:47
14:21:50
14:21:50
14:21:50
14:21:51
14:21:52
14:21:52
14:21:52
14:21:54
14:21:54
14:21:56

     This did not get 100 running at once, 30 sec was too short a sleep.
     And did not have 100 glideins up front.
     
     Increased the sleep to 210 seconds, 243074
     primed the pump with another run ar 15:30, 

Farm glideins:  R=189 I=0 H=0

condor_submit glidex100.run
100 job(s) submitted to cluster 243087.

100 jobs; 100 idle, 0 running, 0 held
Thu Dec 11 15:55:22 CST 2008

MINOS25 > condor_q kreymer | tail -1 ; date
100 jobs; 0 idle, 100 running, 0 held
Thu Dec 11 15:55:35 CST 2008

grep executing logs/glide/probe.243087.*.log     | cut -f 4 -d ' ' | sort \
  >  probex100a.logtimes

grep STARTED logs/glide/probe.243087.*.out | cut -f 7 -d ' ' | sort \
    > probex100a.outtimes


#########
# MYSQL #
#########

    Preparing for replication testing of mysql2, see
    
http://www-numi.fnal.gov/offline_software/srt_public_context/DatabaseMaintenance/doc/dbmauto_index.html

##########
# DCACHE #
##########

Date: Thu, 11 Dec 2008 11:28:06 -0600
From: ssa-group@fnal.gov
To: cdweb@fnal.gov, helpdesk@fnal.gov, oleynik@fnal.gov, stan@fnal.gov,
    wolbers@fnal.gov, crawdad@fnal.gov, white@fnal.gov, moibenko@fnal.gov,
    timur@fnal.gov, stk-users@fnal.gov, cms-t1@fnal.gov,
    dcache-admin@fnal.gov
Subject: Announcement: Service disruption for dCache on stken for a duration
     of is back up.

There was a short disruption of the fndca (public) dcache. 
A hardware repair contractor accidentally rebooted the head node. 
Dcache is running now.


###########
# ENSTORE #
###########

Date: Thu, 11 Dec 2008 10:38:59 -0600
From: ssa-group@fnal.gov

Just as a reminder there will be an upgrade to the STKEN movers at 11:00 AM.
Approximately 20 minutes from now.

Date: Thu, 11 Dec 2008 11:27:49 -0600
From: ssa-group@fnal.gov

The update is complete. Thank you for your patience.


##########
# DCACHE #
##########

    Closed out ticket
126014 FNDCA - three files unavailable via DCache

    All three files are readable.
    
#######
# CRL #
#######

Date: Thu, 11 Dec 2008 08:51:29 -0600 (CST)
Subject: HelpDesk ticket 126186

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 ___________________________________________
Ticket #: 126186
___________________________________________
Short Description: Incorrect email address in error message when CRL was
down

Problem Description: On Dec 3, there was a problem with the Minos CRL web
server.

    The message that was seen referred to an incorrect email address,
    webmaster@crlweb2.fnal.gov

    At low priority, I suggest that this be tracked down and corrected.
    Here is the message that was seen at the time :

The server encountered an internal error or misconfiguration and was
unable to complete your request.

Please contact the server administrator, webmaster@crlweb2.fnal.gov
and inform them of the time the error occurred, and anything you
might have done that may have caused the error.

More information about this error may be available in the server
error log.

____________________________________________________________________________
  Apache/2.0.46 (Scientific Linux) Server at crlweb2.fnal.gov Port 80
___________________________________________

Date: Thu, 11 Dec 2008 09:10:03 -0600 (CST)

This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/WST
___________________________________________

___________________________________________

___________________________________________


=============================================================================
2008 12 10
=============================================================================

##########
# CONDOR #
##########

    Helping xbhuang on minos25

MINOS_CONDOR=/afs/fnal.gov/files/code/e875/general/condor
${MINOS_CONDIR}/scripts/proxyconfig 

     fails to find libssl3.so

     In particular,
${MINOS_CONDOR}/scripts/get-cert/get-cert.sh -i


#######
# CFL #
#######

    Sometimes the COMPLETE_FILE_LISTING is truncated.
    This results in 'newline appended' messages from he cronjob.
    
    I would be better to bail on incomplete CFL files.
    The last line should be like
(1832559 rows)
 
Test with 
    tail -1 | grep -q '^(.* rows)$'

    Updated cfl script accordingly    

###########
# BLUEARC #
###########

Date: Wed, 10 Dec 2008 11:58:43 -0600
From: Andrew J. Romero <romero@fnal.gov>
To: "'site-nas-announce@fnal.gov'" <site-nas-announce@fnal.gov>
Subject: Emergency Reboot of all nodes in RHEA cluster to occur at noon (Dec
     10, 2008)

We are experiencing severe heap allocation problems
on both RHEA cluster nodes.

At noon we will be rebooting both RHEA-1 and RHEA-2
(We will also install a firmware upgrade which will
 addresses this issue)


The following EVSs are effected:

BLUE1
BLUE2
MINOS-NAS-0
CDSERVER
PPDSERVER
DIRSERVER1
ESHSERVER1
CDFSERVER1
LSSERVER
NUMISERVER
PSEEKITS

The RHEA cluster should be back online at 12:30

Andy
____________________________________________________________________

   bluwatch resumed at 12:34

   But the logs are strangely incomplete.

fnpcrv1     - no errors or timeouts
minos-sam03 - no errors or timeouts
minos01     - timeout ending 12:34:57
minos26     - no errors or timeouts

____________________________________________________________________

Date: Wed, 10 Dec 2008 13:15:54 -0600   
From: Andrew J. Romero <romero@fnal.gov>
To: "'site-nas-announce@fnal.gov'" <site-nas-announce@fnal.gov>
Subject: BLUEARC .... Emergency Reboot .... COMPLETE

All EVSs and Filesystems were back online at 12:42PM

____________________________________________________________________
____________________________________________________________________


#######
# DAQ #
#######

   Started ND archiver, had been stopped by GFP
 Restarted FD archiver, it appeared to be stuck,
     valid PID, process running, no activity.
     
 Restarted BEAM archiver, no files had moved, as with FD.

 Restarted NDCS around 14:07 , had not archived 
     Dec  8 18:00 N081209_000002.mdcs.root
        got archived 14:16
   For consistency with daq, beam systems,
     midir                   bin/init
     cp -a /etc/init.d/archiver bin/init/

 Restarted FDCS around 14:30
     Dec  8 18:00 F081209_000006.mdcs.root
        got archived at 14:32
     midir                   bin/init
     cp -a /etc/init.d/archiver bin/init/


Noted that archiverstatus.sh is in scripts on N/F DAQ, 
    in bin for N/F DCS and BEAM  ( where it has been in crontabs )


For the first time in over a year, 
all five archivers have current status and recently archived files at

    http://minos-om.fnal.gov/cgi-bin/archiver_status.cgi


#########
# ADMIN #
#########

Old rbpatter helpdesk ticket :124576
11/10/2008 11:18:33 PM
___________________________________________

Please add the following names to the NIS database for the MINOS Cluster:

UID 42918 -- gfrontend
UID 43021 -- minosana

and 

GID 5468 -- e898

Thanks in advance.

--Ryan
___________________________________________

11/11/2008 5:22:07 PM jereboze

Checked out uids listed on this request, 
there was no uid/gid assignment for "gfrontend". 
UID 42918 was assigned to user "whend". 
Created uid 43498 for gfrontend. 
UID for minosana is correct. 
Minos Cluster admin you can now add the two uids:

UID 43598 -- gfrontend
UID 43021 -- minosana

Will assign ticket to the Minos admins..

Yolanda Valadez
CD/Helpdesk

11/11/2008 4:53:44 PM valadez
Checked out uids listed on this request, 
there was no uid/gid assignment for "gfrontend". 
UID 42918 was assigned to user "whend". 
Created uid 43498 for gfrontend. 
UID for minosana is correct. 
Minos Cluster admin you can now add the two uids:

UID 43598 -- gfrontend
UID 43021 -- minosana

Will assign ticket to the Minos admins..

Yolanda Valadez
CD/Helpdesk
___________________________________________

   Assigned to Arthur Kreymer

1/11/2008 5:21:35 PM jereboze
The Assigned To Group was changed from CD-SF/FEF to CD-SP.

The Assigned To Individual was changed from HO, LING to KREYMER, ARTHUR.

The Assigned To E-mail Address was changed from run2-sys@fnal.gov to kreymer@fnal.gov.

helpdesk@fnal.gov was put into the CC Address 1 field.

11/11/2008 4:56:11 PM valadez
The Assigned To Group was changed from CD-LSCS/CSI/HD to CD-SF/FEF.

The Assigned To Individual was changed from VALADEZ, YOLANDA to HO, LING.

The Assigned To E-mail Address was changed from helpdesk@fnal.gov to run2-sys@fnal.gov.

11/11/2008 4:53:44 PM valadez
The Assigned To Group was changed from Help Desk to CD-LSCS/CSI/HD.

The Assigned To Individual was changed from HelpDesk to VALADEZ, YOLANDA.

___________________________________________

Date: Wed, 28 Jan 2009 17:52:00 +0000 (GMT)
Subject: Re: KREYMER, ARTHUR HelpDesk ticket 124576 Reminder

   We need an official UID for gfactory. It presently has none.
   I will submit a separate request.

   The UID we now use, 42917, belongs to Stefan Lammel/cdfprd_svx.

   Once that is assigned, I will ask to shift this ticket back to
   FEF ( run2-sys ) so that we can schedule a shift of the
   gfactory and gfrontend accounts and files to their proper UID's.
___________________________________________

Date: Wed, 28 Jan 2009 20:48:12 +0000 (GMT)
Subject: Re: KREYMER, ARTHUR  HelpDesk ticket 124576 Has Been Updated.

  Helpdesk : Please reassign this ticket to run2-sys (FEF)

  We should change the gfactory and gfrontent accounts
  to use their assigned UID's

43598  gfrontend
43680  gfactory

  This needs to be coordinated with rbpatter and kreymer,
  as the factory and frontend processes needed to be stopped
  while the NIS and file ownerships are being changed.
___________________________________________

Date: Wed, 28 Jan 2009 15:10:22 -0600 (CST)

This ticket has been reassigned

There is no need for you to continue working on this problem.
The ticket has been reassigned to: COOPER, GLENN
___________________________________________

___________________________________________


########
# SSHD #
########

Date: Wed, 10 Dec 2008 10:50:56 -0600 (CST)
Subject: HelpDesk ticket 126126

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
___________________________________________
Ticket #: 126126
___________________________________________
Short Description: sshd not responding on minos03, minos05, minos22

Problem Description: run2-sys :

    rlogin works, but sshd logins are not available for nodes

minos05
minos08
minos22

   Please restart the sshd servers on these nodes.
___________________________________________

Date: Wed, 10 Dec 2008 10:55:27 -0600 (CST)
This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group.
___________________________________________

Date: Wed, 10 Dec 2008 11:26:11 -0600 (CST)
Solution: The sshd service has been restarted on the three machines.
This ticket was resolved by BURNS, ETTA of the CD-SF/FEF group.
___________________________________________

___________________________________________

#########
# AKLOG #
#########

Date: Wed, 10 Dec 2008 15:52:27 +0000 (GMT)
Subject: Re: [Fwd: HelpDesk ticket 125925 has additional info.]
 
<-- # @@@  Enter Update below this line. @@@ # -->

  This rebuilt version of aklog does get me a token.

  Unfortunately, the token cannot be used to write to AFS.

    I can write using tokens derived from my default ticket

MINOS25 > touch /afs/fnal.gov/files/home/room1/kreymer/testafs.native
MINOS25 > tokens > /tmp/tokens.ok

    But not with kcron tokens
    
MINOS25 > kcron
MINOS25 > aklog
MINOS25 > tokens

Tokens held by the Cache Manager:

User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Dec 10 19:44]
   --End of list--

MINOS25 > touch /afs/fnal.gov/files/home/room1/kreymer/testafs.kcron
touch: cannot touch `/afs/fnal.gov/files/home/room1/kreymer/testafs.kcron': Permission denied

MINOS25 > tokens > /tmp/tokens.kcron

MINOS25 > diff /tmp/tokens.ok /tmp/tokens.kcron 
4c4
< User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Dec 11 10:53]
---
> User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Dec 10 19:44]

<-- # @@@  Enter Update above this line. @@@ # -->

____________________________________________________________________________

12/10/2008 4:02:11 PM dawson
The following was e-mailed to the Requester:

Hi Art,
I am reassigning this to the kcron maintainer.  This is beyond my expertise, as I usually deal with just tickets gotten the usual way.
____________________________________________________________________________

Date: Mon, 02 Feb 2009 21:19:43 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    Is there any progress on this ticket ?

    We really need an aklog that functions with kcron tickets
    on the new Minos SLF 4.7 servers.

    This ticket has been pending now for almost two months.      

    We are tantalizingly close to a solution.
    Ling's patched version of aklog delivers AFS tokens.
    But those tokens to not give us access to AFS,
    so they must be broken in some invisible way. 

    As the ticket is assigned to Frank,
    I have created a nagy account on the Minos Cluster.   
    Nodes minos25 and minos27 are at SLF 4.7 .
____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________


##########
# DCACHE #
##########

DCache seems to be down
   Last ftplog 20:57
   Last MRTG activity around 21:10 for fndca3a aka fndca
       Traffic Analysis for 3/20 fndca3a -- s-s-fcc1-server

    kreymer@minos26 
crontab -r

    mindata@minos26
crontab -r

    minfarm@fnpcsrv1
mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT

    Updated MINOS status page
_____________________________________________________________

Date: Wed, 10 Dec 2008 11:05:18 -0600
Subject: Announcement: Service disruption for dCache on stken for a duration
     of public dcache system back up

The fndca ( public ) dcache system is back up. The head node was replaced.
_____________________________________________________________

Near and far data have been archived since 11:00
_____________________________________________________________

16:55 - restarted all 
  

=============================================================================
2008 12 09
=============================================================================

##########
# DCACHE #
##########

  Found some 3 day old files in write pools :

n13036711_0004_L010185N_D04_helium.reroot.root
kreymer e875 0 Dec  4 09:08

n13036710_0013_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root
minospro e875 0 Dec  4 05:44

n13036710_0014_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root
minospro e875 0 Dec  4 09:57

n13036705_0028_L010185N_D04_helium.sntp.cedar_phy_bhcurv.0.root
minospro e875 0 Dec  4 04:59

-rw-r--r--  1 kreymer  e875 0 Dec  4 09:08 /pnfs/minos/mcin_data/near/daikon_04/L010185N_helium/671/n13036711_0004_L010185N_D04_helium.reroot.root
-rw-r--r--  1 minospro e875 0 Dec  4 05:44 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N_helium/cand_data/671/n13036710_0013_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root
-rw-r--r--  1 minospro e875 0 Dec  4 09:57 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N_helium/cand_data/671/n13036710_0014_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root
-rw-r--r--  1 minospro e875 0 Dec  4 04:59 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N_helium/sntp_data/670/n13036705_0028_L010185N_D04_helium.sntp.cedar_phy_bhcurv.0.root


ZFILES='
n13036711_0004_L010185N_D04_helium.reroot.root
n13036710_0013_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root
n13036710_0014_L010185N_D04_helium.cand.cedar_phy_bhcurv.0.root
n13036705_0028_L010185N_D04_helium.sntp.cedar_phy_bhcurv.0.root
'
    Removed the files from PNFS

for FILE in ${ZFILES} ; do
    FPAT=`sam locate ${FILE} | cut -f 2 -d "'" | grep '^/pnfs'  | cut -f 1 -d,`/${FILE}
    rm ${FPAT}
done    

    Removed the files from SAM

for FILE in ${ZFILES} ; do sam undeclare file ${FILE} ; done


########
# FARM #
########

Date: Tue, 09 Dec 2008 15:29:17 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos_batch@fnal.gov
Cc: rubin@fnal.gov
Subject: linfix concatenation up to date

  The mcnearcat cedar_phy_bhcurv linfix concatenation is up to date.

  There are 36 runs which are pending missing subruns.

      See ROUNTUP/LOG/cedar_phy_linfixmcnear.pend

###########
# ROUNDUP #
###########

    rounudup.20081209 
    
    Corrected logic to set aside old PENDFILE when NOOP is null

SRV1> cp AFSS/roundup.20081209 .

SRV1> ln -sf roundup.20081209 roundup # was roundup.20081126
 

=============================================================================
2008 12 08
=============================================================================

#########
# STAGE #
#########

    Review state of carrot_06 for pawloski

CARROT=/pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/sntp_data

CDIRS=`ls ${CARROT}`

for DIR in ${CDIRS} ; do
cd  ${CARROT}/${DIR}
CFILES=`ls ${CARROT}/${DIR}`
    for FILE in ${CFILES} ; do
    head -1 ".(use)(4)(${FILE})"
    done
done 2>&1 | tee /tmp/CVOLS

wc -l   /tmp/CVOLS
7457 /tmp/CVOLS

CVOLS=`sort -u /tmp/CVOLS`

echo $CVOLS
VO8219 VO8366 VOB656 VOB862 VOB873 VOB879 VOB883 VOB887 VOB895 VOB903 VOB907 VOB908 VOB913 VOB920 VOB927

for VOL in ${CVOLS} ; do printf "${VOL} " ; grep ${VOL} /tmp/CVOLS | wc -l ; done

VO8219 38
VO8366 582
VOB656 575
VOB862 592
VOB873 600
VOB879 582
VOB883 583
VOB887 590
VOB895 492
VOB903 563
VOB907 481
VOB908 504
VOB913 593
VOB920 608
VOB927 74

cd ~/minos/scripts

{ for VOL in ${CVOLS} ; do 
    ./stage -w -p 5 -s carrot_06/L010185/sntp_data ${VOL}
done ; } >> /minos/scratch/kreymer/log/stage/carrot06.log 2>&1 &

STARTING Mon Dec  8 15:15:18 CST 2008

FINISHED Tue Dec  9 04:06:02 CST 2008

##########

# DCACHE #
##########

Date: Mon, 08 Dec 2008 09:43:51 -0600
From: David Saranen <saranen@soudan.umn.edu>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: Re: Daq archiving started again this evening
Ticket  125959

FNDCA recent FTP web page listing is empty.
_____________________________________________________________________

Similar to tickets 123960 and 124357. Web page:

 http://fndca3a.fnal.gov/cgi-bin/dcache_files.py

is empty. Art Kreymer has restarted archiving from MINOS Far Detector, so I
assume that files are being transfered normally.
_____________________________________________________________________

Date: Thu, 11 Dec 2008 15:44:33 +0000 (GMT)

    The FTP Recent Transfers list is still empty, at

http://fndca3a.fnal.gov/cgi-bin/dcache_files.py

    Our data transfers seem to be running normally,
    but it would be nice to have this diagnostic page available.
_____________________________________________________________________

Date: Thu, 11 Dec 2008 16:26:46 -0600 (CST)
From: Dmitry Litvintsev <litvinse@fnal.gov>

I believe I have fixed the issue.

Issue was having to do with log gathering script that has been 
stoppped/killed but left lock file that prevented this script from 
starting over. Lock file was dated Dec 03
_____________________________________________________________________


###########
# BLUEARC #
###########

Date: Mon, 08 Dec 2008 14:23:00 -0600 (CST)
Subject: HelpDesk ticket 126005

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 ___________________________________________
Ticket #: 126005
___________________________________________
Short Description: BlueArc r/w exports for new Minos servers

Problem Description: LSC/CSI

    Please make read/write exports for /minos/data, data2, scratcch
    for the new Minos servers :

minos27
minos-sam04
minos-mysql2
minos-mysql3
___________________________________________

This ticket is assigned to HelpDesk of the Help Desk.
___________________________________________

Date: Mon, 08 Dec 2008 15:30:33 -0600 (CST)
This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST
___________________________________________

Note To Requester: added blue2.fnal.gov:/minos/data for hosts minos27 ,
minos-sam04 , minos-mysql2 , minos-mysql3

But we want to confirm, is that minos-mysql2 with letter "l" or number one?
We were extending the pattern in the current host list.
(hosts with plus signs added during this request)

minos-mysql1.fnal.gov(rw,no_root_squash)
flxb10.fnal.gov (rw)
flxb11.fnal.gov (rw)
flxb12.fnal.gov (rw)
flxb13.fnal.gov (rw)
flxb14.fnal.gov (rw)
flxb15.fnal.gov (rw)
flxb16.fnal.gov (rw)
flxb17.fnal.gov (rw)
flxb18.fnal.gov (rw)
flxb19.fnal.gov (rw)
flxb20.fnal.gov (rw)
flxb21.fnal.gov (rw)
flxb22.fnal.gov (rw)
flxb23.fnal.gov (rw)
flxb24.fnal.gov (rw)
flxb25.fnal.gov (rw)
flxb26.fnal.gov (rw)
flxb27.fnal.gov (rw)
flxb28.fnal.gov (rw)
flxb29.fnal.gov (rw)
flxb30.fnal.gov (rw)
flxb31.fnal.gov (rw)
flxb32.fnal.gov (rw)
flxb33.fnal.gov (rw)
flxb34.fnal.gov (rw)
flxb35.fnal.gov (rw)
flxi02.fnal.gov (rw)
flxi03.fnal.gov (rw)
flxi04.fnal.gov (rw)
flxi05.fnal.gov (rw)
flxi06.fnal.gov (rw)
flxi07.fnal.gov (rw)
minos01.fnal.gov (rw)
minos02.fnal.gov (rw)
minos03.fnal.gov (rw)
minos04.fnal.gov (rw)
minos05.fnal.gov (rw)
minos06.fnal.gov (rw)
minos07.fnal.gov (rw)
minos08.fnal.gov (rw)
minos09.fnal.gov (rw)
minos10.fnal.gov (rw)
minos11.fnal.gov (rw)
minos12.fnal.gov (rw)
minos13.fnal.gov (rw)
minos14.fnal.gov (rw)
minos15.fnal.gov (rw)
minos16.fnal.gov (rw)
minos17.fnal.gov (rw)
minos18.fnal.gov (rw)
minos19.fnal.gov (rw)
minos20.fnal.gov (rw)
minos21.fnal.gov (rw)
minos22.fnal.gov (rw)
minos23.fnal.gov (rw)
minos24.fnal.gov (rw)
minos25.fnal.gov (rw)
minos26.fnal.gov (rw)
minos27.fnal.gov (rw)                           +++++++
minos-mysql11.fnal.gov (rw)
minos-mysql12.fnal.gov (rw)                  ++++++++          ???????????
minos-mysql13.fnal.gov (rw)                  ++++++++          ???????????
minos-sam01.fnal.gov (rw)
minos-sam02.fnal.gov (rw)
minos-sam03.fnal.gov (rw)
minos-sam04.fnal.gov (rw)                   ++++++++
131.225.166.* (rw)
131.225.167.* (rw)
131.225.208.0/22 (rw)
131.225.212.0/23 (rw)
131.225.240.0/24 (rw)
131.225.238.0/23 (rw)
131.225.*.* (read_only,root_squash)
___________________________________________

Date: Tue, 09 Dec 2008 17:40:46 +0000 (GMT)

    Thanks for updating these.

    The mysql node names start with minos-mysql
    followed by the numbers one, two three.

    There is an extra 1(one) in these in your list below,
    minos-mysql11 should be minos-mysql1 etc.
___________________________________________

Date: Tue, 09 Dec 2008 13:17:32 -0600 (CST)
This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/WST
___________________________________________

Date: Tue, 09 Dec 2008 14:12:19 -0600 (CST)

Solution: Gave the following hosts:

minos27
minos-sam04
minos-mysql2
minos-mysql3

 (rw) access to

minos-nas-0.fnal.gov:/minos/data
minos-nas-0.fnal.gov:/minos/scratch
blue2.fnal.gov:/minos/data

This ticket was resolved by ROMERO, ANDY of the CD-LSCS/CSI/CS/WST group.
___________________________________________
___________________________________________
___________________________________________


###########
# MINOS27 #
###########

Date: Mon, 08 Dec 2008 14:52:58 -0600 (CST)
Subject: HelpDesk ticket 126011
___________________________________________
Ticket #: 126011
___________________________________________
Short Description: minos27 lacks /afs/fnal.gov and /pnfs/minos mounts

Problem Description: The /afs/fnal.gov and /pnfs/minos mounts seem to have
disappeared
  from node minos27.fnal.gov .

  Please investigate and restore.

  ( Please also make /var/log/messages world readable on minos27
    and the other new servers (minos25, minos-sam04, minos-mysql2 )
___________________________________________

Date: Mon, 08 Dec 2008 15:38:56 -0600 (CST)
This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group.
___________________________________________

Date: Mon, 08 Dec 2008 16:14:39 -0600 (CST)

Note To Requester: Please make read/write exports for /minos/data, data2,
scratcch    (is this typo?)
    for the new Minos servers :

minos27
minos-sam04
minos-mysql2
minos-mysql3


=====================================================================================
scratcch    (is this typo?)

Added rights for minos27 , minos-sam04 , minos-mysql2 , minos-mysql3 to export /minos/data
Added rights for minos27 , minos-sam04 , minos-mysql2 , minos-mysql3 to export /minos/scratch   

No export with "data2" found.   Is this correct name of export?
___________________________________________

Date: Mon, 08 Dec 2008 22:49:14 +0000 (GMT)

    Sorry for the typo, and the imprecise list.
    
    My request should have been in terms of your exports, not our mounts.

    Thanks for taking care of the first two of these :

minos-nas-0.fnal.gov:/minos/data   
minos-nas-0.fnal.gov:/minos/scratch
blue2.fnal.gov:/minos/data
___________________________________________

Date: Tue, 09 Dec 2008 08:17:28 -0600 (CST)
Note To Requester: 
Please verify which minos machines should have /pnfs/minos mounted.
___________________________________________

Date: Tue, 09 Dec 2008 14:29:29 -0600 (CST)
This ticket has been reassigned to TIMM, STEVE of the CD-Grid/Fermi Group.
___________________________________________

Date: Tue, 09 Dec 2008 14:42:48 -0600 (CST)
This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group.
___________________________________________

Date: Tue, 09 Dec 2008 20:50:43 +0000 (GMT)

    Scanned existing nodes,  ro everywhere but

minos01      - rw
minos25      - missing deliberately
minos26      - rw
minos-mysql2 - missing
minos-sam03  - rw
minos-sam04  - missing

  
    Since you ask, I have reviewed the existing mounts,
    and compared this to what we might be needing soon.

Request :

    Mount /pnfs/minos ro on all Minos Cluster and Server nodes, except :

    Do not mount on minos25

    Mount rw on
minos01
minos26
minos-mysql2
minos-sam03
minos-sam04

      Compared the present status,
      this adds rw mounts on
minos-mysql2
minos-sam04
___________________________________________

Date: Wed, 10 Dec 2008 10:16:23 -0600 (CST)
Solution: Verified that the /pnfs/minos mount is in the /etc/fstab file for
the entire cluster of machines.

Made /var/log/messages world readable as requested.

This ticket was resolved by BURNS, ETTA of the CD-SF/FEF group.
___________________________________________
___________________________________________


##########
# ANNUAL #
##########

   Created new directories for 2009, per HOWTO.annual


##########
# DCACHE #
##########

Date: Mon, 08 Dec 2008 12:06:28 -0600 (CST)
Subject: HelpDesk ticket 125989

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 
___________________________________________
Ticket #: 125989
___________________________________________
Short Description: FNDCA KFTP-stkendca2a status report claims to be offline

Problem Description: The KFTP-stkendca2a status report at
http://fndca.fnal.gov:2288/cellInfo
    claims that the service is offline.

KFTP-stkendca2a         gridftp-stkendca2aDomain        OFFLINE

    Yet the Minos raw data logging, via KFTP, seems to be OK.
___________________________________________

This ticket is assigned to NAYMOLA, STAN of the CD-SF/DMS/DSC/SSA.
____________________________________________

Date: Mon, 08 Dec 2008 13:19:20 -0600 (CST)

The process was restarted but seems to not have resolved the problem of 
restarting every three minutes. The log output states that there is a 
Java exception. I have created Bug # 176 so the d-cache group will take 
a look at this.

Glenn
___________________________________________

Date: Wed, 10 Dec 2008 23:19:40 +0000 (GMT)

Since the DCache restart this morning,
the KFTP-stkendca2a status is no longer OFFLINE at
    http://fndca.fnal.gov:2288/cellInfo

In fact, it shows a creation time of 12/05 16:10:42 .
That is well before the problem was first reported on Monday 8 Dec.
Oh well.

Our data transfers seem to be running OK.
I guess this ticket can be closed.
__________________________________________

Date: Mon, 15 Dec 2008 14:56:44 -0600 (CST)
Solution: Door properly reported after a reboot of the server.
This ticket was resolved by NAYMOLA, STAN of the CD-SF/DMS/DSC/SSA group.


########
# DATA #
########

>   They all worked with the exception of 3
> /reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar$
> /reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0005.spill.bcnd.cedar$
> /reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0006.spill.bcnd.cedar$

MINOS26 > grep F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root  ../CFL/list.r 
F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root r-stkendca16a-6
MINOS26 > grep F00032507_0005.spill.bcnd.cedar_phy_bhcurv.0.root  ../CFL/list.r 
F00032507_0005.spill.bcnd.cedar_phy_bhcurv.0.root r-stkendca9a-2
MINOS26 > grep F00032507_0006.spill.bcnd.cedar_phy_bhcurv.0.root  ../CFL/list.r 


MINOS26 > ./dccptest F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root
Connected in 0.00s.
[Mon Dec  8 09:17:36 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root in cache.


MINOS26 > ./dc_stat F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root
============================
 PNFS status for /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root 
-rw-r--r--  1 rubin e875 39198093 Dec 21  2007 F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root

LEVEL 2 
2,0,0,0.0,0.0
:c=1:4d900d72;h=yes;l=39198093;

LEVEL 4 
VOC190
0000_000000000_0002756
39198093
reco_far_cedar_phy_bhcurv_bcnd
/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root

000F0000000000000730EA60

CDMS119826912800000
stkenmvr16a:/dev/rmt/tps0d0n:479000017059
217648497

============================

    Tested again,

MINOS26 > ./dccptest F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root
Connected in 0.00s.
[Thu Dec 11 09:07:58 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root in cache.
Cache open succeeded in 138.29s.
39198093 bytes in 1 seconds (38279.39 KB/sec)

MINOS26 > ./dccptest F00032507_0005.spill.bcnd.cedar_phy_bhcurv.0.root
Connected in 0.00s.
[Thu Dec 11 09:26:42 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0005.spill.bcnd.cedar_phy_bhcurv.0.root in cache.
Cache open succeeded in 0.21s.
38144336 bytes in 1 seconds (37250.33 KB/sec)

MINOS26 > ./dccptest F00032507_0006.spill.bcnd.cedar_phy_bhcurv.0.root
Connected in 0.00s.
[Thu Dec 11 09:27:09 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F00032507_0006.spill.bcnd.cedar_phy_bhcurv.0.root in cache.
Cache open succeeded in 0.20s.
46186096 bytes in 1 seconds (45103.61 KB/sec)

___________________________________________

Date: Mon, 08 Dec 2008 15:09:40 -0600 (CST)
Subject: HelpDesk ticket 126014

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 
___________________________________________
Ticket #: 126014
___________________________________________
Short Description: FNDCA - three files unavailable via DCache

Problem Description: One of the Minos users has reported three files to be
unavailable via dccp.


Their metadata looks good, and the first two are in recent pool listings.

I have verified that a 'dccp' of the first of these gets stuck indefinitely
:

    Under
/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/
    File                                             Pool listing
F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root    r-stkendca16a-6
F00032507_0005.spill.bcnd.cedar_phy_bhcurv.0.root    r-stkendca9a-2
F00032507_0006.spill.bcnd.cedar_phy_bhcurv.0.root

MINOS26 > ./dccptest F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root
Connected in 0.00s.
[Mon Dec  8 09:17:36 2008] Going to open file
dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhc
urv/.bcnd_data/2005-08/F00032507_0004.spill.bcnd.cedar_phy_bhcurv.0.root in
cache.
  ( still stuck as of 15:00 )


  Odd, the entry in the Lazy Restore Queue for the first file is dated
11.28,
  and indicates a pool to pool transfer to a write queue ???

000F0000000000000730EA60        0.0.0.0/0.0.0.0-*/*    
r-stkendca16a-6->w-stkendca10a-6        11.28 19:32:25  136     0      
Pool2Pool 11.28 19:32:25
/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-08/F0003
2507_0004.spill.bcnd.cedar_phy_bhcurv.0.root
___________________________________________

This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Thu, 11 Dec 2008 15:29:23 +0000 (GMT)

All three of the files are available via DCache now.

This ticket can be closed.

    Thanks !
___________________________________________

___________________________________________


#######
# CRL #
#######

   CRL web page got stuck last Friday, 5 Dec.
   
   Mail bounced :
   
Date: Mon, 08 Dec 2008 00:30:46 -0600 (CST)
From: Internet Mail Delivery <postmaster@mailgw2.fnal.gov>
To: kreymer@fnal.gov
Subject: Delivery Notification: Delivery has been delayed
...

  Recipient address: webmaster@crlweb2.fnal.gov
  Reason: unable to deliver this message after 4 days

Delivery attempt history for your mail:

Sun,  7 Dec 2008 18:56:20 -0600 (CST)
TCP active open: Failed connect()    Error: Connection refused
...

Wed,  3 Dec 2008 17:32:57 -0600 (CST)
TCP active open: Failed connect()    Error: Connection refused

The mail system will continue to try to deliver your message
for an additional 3 days.

   ...
   
On Wed, 3 Dec 2008, Mayly Sanchez wrote:

> We are getting the following error when trying to access the CRL. T=
his
> started happening around 15.15 and ti still not fixed at 17:00. =
=A0
> Mayly=A0

    I have looked at the minos-db1 muysql server.

    The server seems to be acting normally,
    with several current connections from the CRL web server :


=============================================================================
2008 12 06
=============================================================================


   The KFTP-stkendca2a kerberized FTP door is down.

   This is needed for Minos raw data archiving.

   The last file archived seem to have been

Dec  5 21:29 /pnfs/minos/neardet_data/2008-12/N00015256_0021.mdaq.root

    FTP transfer page is empty :
http://fndca3a.fnal.gov/cgi-bin/dcache_files.py


#######
# DAQ #
#######

  restarted the ND archiver

Last login: Fri Dec  5 21:03:55 2008 from 131.225.192.193
[minos@daqdcp-nd ~]$ ps xf
  PID TTY      STAT   TIME COMMAND
 5701 pts/0    Ss     0:00 -bash
 5737 pts/0    R+     0:00  \_ ps xf
14689 ?        S      0:32 python /home/minos/bin/archiver_krb.py
15678 ?        Z      0:00  \_ [kdestroy] <defunct>
15681 ?        Z      0:00  \_ [kinit] <defunct>

[minos@daqdcp-nd ~]$ bin/init/archiver restart
Stopping archiver - try graceful exit first. Please wait ......

Killing archiver with USR1
Starting archiver

[minos@daqdcp-nd ~]$ ps xf
  PID TTY      STAT   TIME COMMAND
 5701 pts/0    Ss     0:00 -bash
 6687 pts/0    R+     0:00  \_ ps xf
 6394 pts/0    S      0:09 python /home/minos/bin/archiver_krb.py
 6683 pts/0    Z      0:00  \_ [kinit] <defunct>

Caught up to Dec  6 15:00 N00015268_0016.mdaq.root

Next cycle was OK at 16:00

[minos@daqdcp-nd ~]$ ls -l -tr /daqdata/archiver/data-archived/ | tail -2
-rw-r--r--  1 minos e875 0 Dec  6 16:05 N00015268_0017.mdaq.root

   On Sunday, restarted the beamdata archiver, after clearing empty PID file.

   Need to research archiverstatus.sh scripts, updating web status ?
   Ran one of these manually, seemed to work.
   Why is this not in cron  ?

Date: Sun, 07 Dec 2008 14:04:26 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: kreymer@fnal.gov
Subject: beam_data archiver

08:00 - removed empty pid file, started archiver.

   
=============================================================================
2008 12 05
=============================================================================

############
# PREDATOR #
############

N00015238_0017.mdaq.root bad .py data

  needs to be cleared/added


    Set the damaged file aside
cd /local/scratch26/kreymer/genpy/neardet_data/2008-11

MINOS26 > dds N00015238_0017*
-rw-r--r--  1 kreymer g020 4662 Nov 30 01:09 N00015238_0017.log
-rw-r--r--  1 kreymer g020 1284 Nov 30 01:09 N00015238_0017.sam.py
-rw-r--r--  1 kreymer g020 1266 Nov 30 01:15 N00015238_0017.sam.pyc

mv N00015238_0017.sam.py N00015238_0017.sam.pybad2

        Note that there is a .pyc for this file !
MINOS26 > dds *.pyc
-rw-r--r--  1 kreymer g020 1266 Nov 30 01:15 N00015238_0017.sam.pyc


MINOS26 > ./predator 2008-11
 
Looks OK this time.


#########
# VAULT #
#########

   Far encp failed, because the first copy was comlete and clean on 3 Dec.

   Removed working files
      rm -r /local/scratch26/kreymer/SHEEP/fardet_data/2008-11  

###########
# MINOS25 #
###########

Date: Fri, 05 Dec 2008 19:28:29 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Ling C. Ho <ling@fnal.gov>
Cc: minos-admin@fnal.gov, rbpatter@fnal.gov
Subject: Re: minos25 hardware swap tomorrow, 5 Dec at 08:00

   Update on kcron / aklog problem.

   I have tested this on minos27 and minos25 ( both recently installed. )

   After doing kcron,
   aklog seems to succeed, but does not provide a working AFS token.

       aklog with a normal ticket

MINOS25 > kinit kreymer

MINOS25 > /usr/krb5/bin/aklog

MINOS25 > tokens

Tokens held by the Cache Manager:

User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Dec  6 15:26]
   --End of list--


      aklog with a kcron ticket

MINOS25 > kcron

MINOS25 > klist
Ticket cache: /tmp/krb5cc_1060_L11665
Default principal: kreymer/cron/minos25.fnal.gov@FNAL.GOV

Valid starting     Expires            Service principal
12/05/08 13:27:12  12/05/08 23:27:12  krbtgt/FNAL.GOV@FNAL.GOV

MINOS25 > /usr/krb5/bin/aklog
MINOS25 > echo $?
0

MINOS25 > tokens

Tokens held by the Cache Manager:

Tokens for afs@fnal.gov [Expires Dec  5 23:27]
   --End of list--

================================================================

13:31  Trying mengel workaround from FNALU, 
*  * * * * /usr/krb5/bin/kcron "/usr/krb5/bin/aklog ; ${HOME}/minos/scripts/crontestark"

13:32 - still no good, removed aklog from crontestark

   Trying again interactively

MINOS25 > aklog -d
Authenticating to cell fnal.gov (server fsus01.fnal.gov).
Trying to authenticate to user's realm FNAL.GOV.
Getting tickets: afs/fnal.gov@FNAL.GOV
We've deduced that we need to authenticate to realm FNAL.GOV.
Getting tickets: afs/fnal.gov@FNAL.GOV
Getting tickets: afs/fnal.gov@FNAL.GOV
Getting tickets: afs@FNAL.GOV
Using Kerberos V5 ticket natively
About to resolve name kreymer.cron to id in cell fnal.gov.
Id 32766
Set username to kreymer.cron
Setting tokens. kreymer.cron /  @ FNAL.GOV 

MINOS25 > tokens

Tokens held by the Cache Manager:

Tokens for afs@fnal.gov [Expires Dec  5 23:27]
   --End of list--

   Success on minos26 looks like :

MINOS26 > aklog -d
Authenticating to cell fnal.gov (server fsus01.fnal.gov).
We've deduced that we need to authenticate to realm FNAL.GOV.
Getting tickets: afs/@FNAL.GOV
endTime = 1228540648, Fri Dec  5 23:17:28 2008
About to resolve name kreymer to id in cell fnal.gov.
Id 1060
Set username to AFS ID 1060
Setting tokens. AFS ID 1060 /  @ FNAL.GOV 

         Test 64bit node fnpc344, SLF 4.5

Getting tickets: afs/@FNAL.GOV
endTime = 1229144749, Fri Dec 12 23:05:49 2008
About to resolve name kreymer to id in cell fnal.gov.
Id 1060
Set username to AFS ID 1060
Setting tokens. AFS ID 1060 /  @ FNAL.GOV 

touch /afs/fnal.gov/files/home/room1/kreymer/maint/touchafs
rm    /afs/fnal.gov/files/home/room1/kreymer/maint/touchafs

    Trying a copy of aklog.slf45 on minos25:
    
MINOS25 > ./aklog.slf45 -d
Authenticating to cell fnal.gov (server fsus01.fnal.gov).
We've deduced that we need to authenticate to realm FNAL.GOV.
Getting tickets: afs/@FNAL.GOV
endTime = 1228802052, Mon Dec  8 23:54:12 2008
About to resolve name kreymer to id in cell fnal.gov.
Id 1060
Set username to AFS ID 1060
Setting tokens. AFS ID 1060 /  @ FNAL.GOV 
aklog.slf45: unable to obtain tokens for cell fnal.gov (status: 11862788).


     DISABLED AKLOG AND VOMSES CODE IN KPROXY, TILL THIS IS FIXED

edited /local/scratch25/grid/kproxy

    CHECKING FNALU

MIN > for NODE in ${UNODES} ; do printf "${NODE} " ; ssh -ax ${NODE} 'cat /etc/redhat-release' ; done
flxi02 Scientific Linux Fermi LTS release 4.4 (Wilson)
flxi03 Scientific Linux Fermi LTS release 4.4 (Wilson)
flxi04 Scientific Linux Fermi LTS release 4.5 (Wilson)
flxi05 Scientific Linux Fermi LTS release 4.5 (Wilson)   OK
flxi06 Scientific Linux SLF release 5.1 (Lederman)       NA
flxi07 Scientific Linux Fermi LTS release 4.4 (Wilson)   OK
flxi09 Scientific Linux Fermi LTS release 4.5 (Wilson)   OK

   Ling will pursue getting an SFL 4.7 aklog.
   can reproduce the problem on minos25 with ling account.

minos27 is having other problems, kcron fails a random 1/2 the time :

MINOS27 > kcron
MINOS27 > kcron
kinit: Preauthentication failed while getting initial credentials
MINOS27 > kcron
kinit: Preauthentication failed while getting initial credentials
MINOS27 > kcron
kinit: Preauthentication failed while getting initial credentials
MINOS27 > kcron
MINOS27 > kcron


########
# FARM #
########

rm /minos/data/minfarm/roundup/STOP.LOOPER

Inspect bad file :

-rw-r--r--  1 minospro e875 0 Dec  4 01:19 /pnfs/minos/mcout_data/cedar_phy_linfix/near/daikon_00/L010185N/mrnt_data/108/n13011086_0000_L010185N_D00.mrnt.cedar_phy_linfix.0.root

ls -l /minos/data/minfarm/WRITE/n13011086_0000_L010185N_D00.mrnt.cedar_phy_linfix.0.root
    -rw-r--r--  1 minfarm e875 211699277 Dec  3 20:25

    minospro@minos26
rm /pnfs/minos/mcout_data/cedar_phy_linfix/near/daikon_00/L010185N/mrnt_data/108/n13011086_0000_L010185N_D00.mrnt.cedar_phy_linfix.0.root

./looper  '-r cedar_phy_linfix mcnear' &
[1] 19786
 OK - processing /minos/data/minfarm/mcnearcat
      version 20081126
Fri Dec  5 13:06:50 CST 2008

 PURGING WRITE files 496 

 
#######
# DAQ #
#######

Predator fails on 
N00015253_0012.mdaq.root Thu Dec  4 23:09:36 UTC 2008
   0 length

-rw-r--r--  1 buckley e875         0 Dec  4 05:09 N00015253_0012.mdaq.root

-rw-r--r--  1 buckley e875    0 Dec  4 02:08 B081204_000001.mbeam.root

rm /pnfs/minos/neardet_data/2008-12/N00015253_0012.mdaq.root
rm /pnfs/minos/beam_data/2008-12/B081204_000001.mbeam.root

   Need to manually restart the archivers ?
   And update email address in archiver restart scripts to minos-data.


###########
# MINOS25 #
###########


MINOS25 > condor_off -fast minos25 -master
Sent "Kill-Daemon-Fast" command for "master" to master minos25.fnal.gov

MINOS25 > condor_status
CEDAR:6001:Failed to connect to <131.225.193.25:9618>
Error: Couldn't contact the condor_collector on minos25.fnal.gov. 

MINOS25 > ps -flu condor
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
5 S condor   29187     1  0  76   0 -  2685 -      Dec01 ?        00:23:29 /opt/condor/sbin/condor_master

MINOS25 > date
Fri Dec  5 08:04:16 CST 2008


08:12 - ling shuttind down both systems for the swap.

239047.0  tinti job that completed ?, yes, logs look OK


10:00 rbpatter started up gfactory.

11:09 - released 6 pawloski jobs
11:11 - pawloski jobs start to run


pawloski        6      0    565     12/4 13:39 0+03:03:31 paloon.sh         

     release 100 scavan jobs

SJOBS=`condor_q -hold scavan | grep scavan | head -100 | cut -f 1 -d ' '`

11:15
for JOB in ${SJOBS} ; do condor_release ${JOB} ; done

    Released another 100 pawloski
    
SJOBS=`condor_q -hold pawloski | grep pawloski | head -100 | cut -f 1 -d ' '`

11:21
for JOB in ${SJOBS} ; do condor_release ${JOB} ; done

11:33 
condor_release pawloski

   scavan released his own jobs, apparently


########
# DATA #
########

    rubin@fnpcsrv1

cut/paste shrc/kreymer

export PRODUCTS=/afs/fnal.gov/files/code/e875/general/ups/db

setup encp v3_7d -q stken


CARROT=/pnfs/minos/mcout_data/cedar/near/carrot_06/L010185
STREAM=cand_data
BADDIR=${CARROT}/${STREAM}
BFILES=`ls -C1 ${BADDIR}`

printf "${BFILES}\n" | wc -l
7457

for FILE in ${BFILES} ; do 
    SDIR=`echo ${FILE} | cut -c 6-8`
    echo ${SDIR}
done | sort -u | wc -l
389
    100 through 599

cd ${BADDIR}
NMOV=0
date
for FILE in ${BFILES} ; do 
    SDIR=`echo ${FILE} | cut -c 6-8`
    [ ! -d "${SDIR}" ] && printf " MAKING ${SDIR}\n"  && mkdir ${SDIR}
    (( NMOV++ ))
    enmv ${FILE}  ${SDIR}/${FILE}
    printf "\r ${NMOV} ${FILE}"
    sleep 1
done 
printf "\n" 
date

Fri Dec  5 11:45:18 CST 2008
 MAKING 100
 7 n11001009_0000_L010185.cand.cedar.root MAKING 101
 17 n11001019_0000_L010185.cand.cedar.root MAKING 102
 27 n11001029_0000_L010185.cand.cedar.root MAKING 103
...


=============================================================================
2008 12 04
=============================================================================

#########
# STAGE #
#########

    Restarted staging ;

    Dropped limit to 500 files, slow down to 5 sec/file.
   
FVOLS="VOB594 VOB990"

{ for VOL in ${FVOLS} ; do 
    ./stage -w -p 5 -s cedar_phy_bhcurv/.bcnd_data ${VOL}
done ; } >> /minos/scratch/kreymer/log/stage/cpbcnd3.log 2>&1 &


   Good, not all files are needed on this pass.

########
# DATA #
########

Create subdirectories for carrot_06 files 

     rubin@fnpcsrv

CARROT=/pnfs/minos/mcout_data/cedar/near/carrot_06/L010185


( for STREAM in snts_data cand_data sntp_data ; do )

STREAM=snts_data

BADDIR=${CARROT}/${STREAM}

BFILES=`ls -C1 ${BADDIR}`

printf "${BFILES}\n" | wc -l
802

for FILE in ${BFILES} ; do 
    SDIR=`echo ${FILE} | cut -c 6-8`
    echo ${SDIR}
done | sort -u | wc -l
93

cd ${BADDIR}
for FILE in ${BFILES} ; do 
    SDIR=`echo ${FILE} | cut -c 6-8`
    [ ! -d "${SDIR}" ] && mkdir ${SDIR}
    mv ${FILE}  ${SDIR}/${FILE} ; usleep 100000
done 

$ find . -type f | wc -l
802

    Now do use enmv to correct this metadata

for FILE in ${BFILES} ; do 
    SDIR=`echo ${FILE} | cut -c 6-8`
    enmv ${SDIR}/${FILE}  ${SDIR}/${FILE} ; sleep 1
done 


          REVISED FOR ENMV :

CARROT=/pnfs/minos/mcout_data/cedar/near/carrot_06/L010185
STREAM=sntp_data
BADDIR=${CARROT}/${STREAM}
BFILES=`ls -C1 ${BADDIR}`

printf "${BFILES}\n" | wc -l
7457

for FILE in ${BFILES} ; do 
    SDIR=`echo ${FILE} | cut -c 6-8`
    echo ${SDIR}
done | sort -u | wc -l
389
    100 through 599

cd ${BADDIR}
NMOV=0
date
for FILE in ${BFILES} ; do 
    SDIR=`echo ${FILE} | cut -c 6-8`
    [ ! -d "${SDIR}" ] && printf " MAKING ${SDIR}\n"  && mkdir ${SDIR}
    (( NMOV++ ))
    enmv ${FILE}  ${SDIR}/${FILE}
    printf "\r ${NMOV} ${FILE}"
    sleep 1
done 
printf "\n" 
date

Thu Dec  4 16:03:07 CST 2008
   # N.B. - this is running about 3 seconds/file
   But it has sped up to 1 sec/file in dirctory 400
MAKING 598
 3832 n13015989_0000_L010185.sntp.cedar.root
MAKING 599
 7457 n13025999_0000_L010185.sntp.cedar.rootfnpcsrv1$ printf "\n" 

fnpcsrv1$ date
Thu Dec  4 19:53:16 CST 2008
fnpcsrv1$ Connection to fnpcsrv1 closed.


#######
# DAQ #
#######

/home/minos/bin/archiverstatus.sh

    changed mail form buckley to minos-data

$ crontab -l
1 * * * *       /home/minos/bin/archiverstatus.sh > /dev/null 2>&1

# Update the #pot.  This is a bit buggy still!
# 0 0,8,16 * * * /home/minos/BD/R1.16/BeamData/ana/bv/run_npot.sh

# Run the DBU job 10 minutes after every hour.  It will just exit if there is nothing to do.
20 * * * * /bin/bash /home/minos/BD/dbu/BeamDataDbi/scripts/run_bdbu_fnal_cron.sh


##########
# DCACHE #
##########

http://www-numi.fnal.gov/computing/dh/ftplog/2008/12/03.txt
    FTP got slow :
    
 115 Wed Dec  3 18:05:01 CST 2008 557
 
   6 Wed Dec  3 19:56:38 CST 2008 557
   7 Wed Dec  3 20:06:45 CST 2008 557
  21 Wed Dec  3 20:17:06 CST 2008 557
  11 Wed Dec  3 20:27:17 CST 2008 557
  24 Wed Dec  3 20:37:41 CST 2008 557
   5 Wed Dec  3 20:47:47 CST 2008 557
  48 Wed Dec  3 20:58:35 CST 2008 557
   5 Wed Dec  3 21:08:40 CST 2008 557
  75 Wed Dec  3 21:19:55 CST 2008 557
  65 Wed Dec  3 21:31:00 CST 2008 557
 244 Wed Dec  3 21:45:04 CST 2008 557
 392 Wed Dec  3 22:01:36 CST 2008 557
  13 Wed Dec  3 22:11:49 CST 2008 557
  10 Wed Dec  3 22:21:59 CST 2008 557
1781 Wed Dec  3 23:01:40 CST 2008 557
 220 Wed Dec  3 23:15:20 CST 2008 557
 373 Wed Dec  3 23:31:33 CST 2008 557
 644 Wed Dec  3 23:52:17 CST 2008 557

http://www-numi.fnal.gov/computing/dh/ftplog/2008/12/04.txt
   Failing now :

  40 Thu Dec  4 00:02:57 CST 2008 557
1418 Thu Dec  4 00:36:35 CST 2008 557
 110 Thu Dec  4 00:48:25 CST 2008 557
  64 Thu Dec  4 00:59:29 CST 2008 557
  45 Thu Dec  4 01:10:14 CST 2008 557
  76 Thu Dec  4 01:21:30 CST 2008 557
1330 Thu Dec  4 01:53:40 CST 2008 557
3603 Thu Dec  4 03:03:43 CST 2008 1
3602 Thu Dec  4 04:13:45 CST 2008 1
3603 Thu Dec  4 05:23:48 CST 2008 1
3603 Thu Dec  4 06:33:51 CST 2008 1
3602 Thu Dec  4 07:43:53 CST 2008 1

Date: Thu, 04 Dec 2008 03:05:02 -0600
From: MINOS DAQ <minos@daqdcp.minos-soudan.org>
To: carl.metelko@stfc.ac.uk, geoff.pearce@stfc.ac.uk, kreymer@fnal.gov,
    miller@sudan.umn.edu, saranen@sudan.umn.edu
Subject: FARDAQ: 1 file(s) waiting more than 1h for archival

   Predator is failing to read recent files.

Starting with
N00015253_0008.mdaq.root Thu Dec  4 05:09:22 UTC 2008 ( 23:09 CST )

Continuing with repeated failures,
N00015253_0009.mdaq.root
N00015253_0011.mdaq.root
N00015253_0012.mdaq.root


MINOS26 > ./dc_stat /pnfs/minos/beam_data/2008-12/B081203_080002.mbeam.root
============================
 PNFS status for /pnfs/minos/beam_data/2008-12/B081203_080002.mbeam.root 
-rw-r--r--  1 buckley e875 8068 Dec  3 18:09 B081203_080002.mbeam.root

LEVEL 2 
2,0,0,0.0,0.0
:c=1:a4ccb18c;h=yes;l=8068;

LEVEL 4 
VOC009
0000_000000000_0000270
8068
beam_data
/pnfs/fnal.gov/usr/minos/beam_data/2008-12/B081203_080002.mbeam.root

000F000000000000089499D8

CDMS122834935400000
stkenmvr25a:/dev/rmt/tps0d0n:479000022613
2236133771

    Cannot read this file from DCache.

MINOS26 > ./dccptest /pnfs/minos/beam_data/2008-12/B081203_080002.mbeam.root
Connected in 0.00s.

    Try another file :

./dccptest n13047018_0029_L010185N_D04.reroot.root
Connected in 0.00s.
   failed also.

/pnfs/minos/mcin_data/near/daikon_04/L010185N/701/n13047018_0029_L010185N_D04.reroot.root

   Vaults are also apparently stuck :


tail ~/minos/log/rawcopy/near/2008-11.log
neardet_data.2008-11.8.tar
      N00015143_0001.mdaq.root to
      N00015147_0000.mdaq.root   
      ...................

MINOS26 > ls -l  /local/scratch26/kreymer/SHEEP/neardet_data/2008-11/
total 11974696
-rw-r--r--  1 kreymer g020 1758433280 Dec  3 23:12 neardet_data.2008-11.1.tar
-rw-r--r--  1 kreymer g020 1733775360 Dec  3 23:42 neardet_data.2008-11.2.tar
-rw-r--r--  1 kreymer g020 1691781120 Dec  4 00:03 neardet_data.2008-11.3.tar
-rw-r--r--  1 kreymer g020 1766010880 Dec  4 00:34 neardet_data.2008-11.4.tar
-rw-r--r--  1 kreymer g020 1783511040 Dec  4 00:51 neardet_data.2008-11.5.tar
-rw-r--r--  1 kreymer g020 1726822400 Dec  4 01:04 neardet_data.2008-11.6.tar
-rw-r--r--  1 kreymer g020 1789736960 Dec  4 01:24 neardet_data.2008-11.7.tar

MINOS26 > ls -l /var/tmp/rawcopy/TARWORK/
total 1557196
-rw-r--r--  1 kreymer g020 117972943 Dec  4 01:29 N00015143_0001.mdaq.root
-rw-r--r--  1 kreymer g020  89554683 Dec  4 01:32 N00015143_0002.mdaq.root
-rw-r--r--  1 kreymer g020  86609970 Dec  4 01:32 N00015143_0003.mdaq.root
-rw-r--r--  1 kreymer g020  76573633 Dec  4 01:32 N00015143_0004.mdaq.root
-rw-r--r--  1 kreymer g020  87348917 Dec  4 01:32 N00015143_0005.mdaq.root
-rw-r--r--  1 kreymer g020  76753020 Dec  4 01:32 N00015143_0006.mdaq.root
-rw-r--r--  1 kreymer g020  89123053 Dec  4 01:33 N00015143_0007.mdaq.root
-rw-r--r--  1 kreymer g020 276591749 Dec  4 01:38 N00015144_0000.mdaq.root
-rw-r--r--  1 kreymer g020  13252439 Dec  4 01:43 N00015145_0000.mdaq.root
-rw-r--r--  1 kreymer g020  81400956 Dec  4 01:44 N00015146_0000.mdaq.root
-rw-r--r--  1 kreymer g020  90560856 Dec  4 01:46 N00015146_0001.mdaq.root
-rw-r--r--  1 kreymer g020  80464850 Dec  4 01:50 N00015146_0002.mdaq.root
-rw-r--r--  1 kreymer g020  90798583 Dec  4 01:54 N00015146_0003.mdaq.root
-rw-r--r--  1 kreymer g020  80166536 Dec  4 01:59 N00015146_0004.mdaq.root
-rw-r--r--  1 kreymer g020  90765663 Dec  4 02:05 N00015146_0005.mdaq.root
-rw-r--r--  1 kreymer g020  77672824 Dec  4 02:11 N00015146_0006.mdaq.root
-rw-r--r--  1 kreymer g020  87265249 Dec  4 02:30 N00015146_0007.mdaq.root

There were about 900 pawloski jobs, accessing files in

MINOS26 > grep cand_data/ ../CFL/CFL | wc -l
7450

MINOS26 > grep /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/sntp_data/ ../CFL/CFL | wc -l
7455

MINOS26 > grep /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/snts_data/ ../CFL/CFL | wc -l
802

MINOS26 > grep /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/mrnt_data/ ../CFL/CFL | wc -l
0

This probably was the prime cause of the PNFS overload.
We must break up these files into more directories.


===============================================================

   Killing off all activity :

MINOS26 > ps xf
  PID TTY      STAT   TIME COMMAND
29193 ?        Ss     0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/vault_monthly
29195 ?        S      0:00  \_ /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/vault_monthly
 2377 ?        S      0:00      \_ /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/vault near 2008-11
 2487 ?        S      0:00          \_ /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/rawcopy neardet_data/2008-11
19740 ?        S      0:00              \_ dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2008-11/N00015146_0009.mdaq.root /var/tmp/rawcopy/TARWORK/N00015146_0009.mdaq.

MINOS26 > kill 29193
MINOS26 > kill 29195
MINOS26 > kill 2377
MINOS26 > kill 2487
MINOS26 > kill 19740
 
MINOS26 > crontab -r

kill %2 ( kills off prestage )

Updated MINOS CD status page

mindata@minos26  11:54
    crontab -r

minfarm@fnpcsrv1  11:55

mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT

touch /minos/data/minfarm/roundup/STOP.LOOPER  12:01


===============================================================

http://fndca3a.fnal.gov:2288/context/transfers.html

    found unusual activity , authentication since 

Thu Dec 04 11:14:32 CST 2008
KFTP-stkendca2a-Unknown-31478 	kerberizedftpdoor-stkendca2aDomain 	38532 	GFtp-1 	1602 	0 	null 	N.N. 	bzora1.fnal.gov 	checking permissions via permission handler 	07:09:13 	Staging
   and many more like these, from various  ExpDbWritePools clients

   11:14 - 7:09 =  04:05, well after the problem started.

    The latest FTP tranfer indicated at
http://fndca3a.fnal.gov/cgi-bin/dcache_files.py

oracle(1602.2752) 2008-12-3 16:00:23


===============================================================
Date: Thu, 04 Dec 2008 09:35:54 -0600 (CST)
Subject: HelpDesk ticket 125810
<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->

___________________________________________
Ticket #: 125810
___________________________________________
Short Description: FNDCA RawDataWritePools severe problems

Problem Description: There seems to be a global DCache problem, which
started last night.


Kerberized ftp writes of Minos raw data to RawDataWritePools have been
failing
since about 03:00.


dccp from both RawDataWritePools and readPools are failing,
for all files that I have tried


We were doing the monthly raw data safetly copies at the time of the
failure
The last successful dccp by this script was on minos26, to

-rw-r--r--  1 kreymer g020  87265249 Dec  4 02:30 N00015146_0007.mdaq.root


We were doing regulated dccp -P prestages of files,
which got stuck soon after this status report :

File/needed   Time stamp               Restore queue depth
5181/6794 Thu Dec  4 02:11:30 CST 2008 queue=1351/2000


Anonymous ftp reads were very slow last night, then failed altogether.
Here are detailed logs of some access attempts.

Elapsed  Attempted ftp at         Bytes returned
 sec
 115 Wed Dec  3 18:05:01 CST 2008 557
 
   6 Wed Dec  3 19:56:38 CST 2008 557
   7 Wed Dec  3 20:06:45 CST 2008 557
  21 Wed Dec  3 20:17:06 CST 2008 557
  11 Wed Dec  3 20:27:17 CST 2008 557
  24 Wed Dec  3 20:37:41 CST 2008 557
   5 Wed Dec  3 20:47:47 CST 2008 557
  48 Wed Dec  3 20:58:35 CST 2008 557
   5 Wed Dec  3 21:08:40 CST 2008 557
  75 Wed Dec  3 21:19:55 CST 2008 557
  65 Wed Dec  3 21:31:00 CST 2008 557
 244 Wed Dec  3 21:45:04 CST 2008 557
 392 Wed Dec  3 22:01:36 CST 2008 557
  13 Wed Dec  3 22:11:49 CST 2008 557
  10 Wed Dec  3 22:21:59 CST 2008 557
1781 Wed Dec  3 23:01:40 CST 2008 557
 220 Wed Dec  3 23:15:20 CST 2008 557
 373 Wed Dec  3 23:31:33 CST 2008 557
 644 Wed Dec  3 23:52:17 CST 2008 557
  40 Thu Dec  4 00:02:57 CST 2008 557
1418 Thu Dec  4 00:36:35 CST 2008 557
 110 Thu Dec  4 00:48:25 CST 2008 557
  64 Thu Dec  4 00:59:29 CST 2008 557
  45 Thu Dec  4 01:10:14 CST 2008 557
  76 Thu Dec  4 01:21:30 CST 2008 557
1330 Thu Dec  4 01:53:40 CST 2008 557
3603 Thu Dec  4 03:03:43 CST 2008 1
3602 Thu Dec  4 04:13:45 CST 2008 1
3603 Thu Dec  4 05:23:48 CST 2008 1
3603 Thu Dec  4 06:33:51 CST 2008 1
3602 Thu Dec  4 07:43:53 CST 2008 1
___________________________________________

This ticket is assigned to JONES, TERRY of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Thu, 04 Dec 2008 12:33:21 -0600
From: Timur Perelmutov <timur@fnal.gov>

These are pnfs problem, it can not handle the increased load. We requested
to upgrade the pnfs software that should allow it to perform under increased
load. We are waiting for the reply from SSA group.
___________________________________________

Date: Thu, 04 Dec 2008 12:56:58 -0600
From: Timur Perelmutov <timur@fnal.gov>

Service should be back up again.
___________________________________________

As of 13:40, I have successfully tested 
   dccp
   srmls
   ftp

===============================================================

Restating activity

SRV1> mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok

   mindata@minos26

$ crontab crontab.dat


    kreymer@minos26

crontab crontab.dat


rm -r /local/scratch26/kreymer/SHEEP/neardet_data/2008-11
rm /var/tmp/rawcopy/TARWORK/*.root

   hacked crontab to run vault tonight


=============================================================================
2008 12 03
=============================================================================

#######
# CRL #
#######


Date: Wed, 03 Dec 2008 16:58:48 -0600
From: Mayly Sanchez <msanchez@physics.harvard.edu>
To: webmaster@crlweb2.fnal.gov
Cc: Robert Bernstein <rhbob@fnal.gov>, Bob Zwaska <zwaska@fnal.gov>,
    Arthur Kreymer <kreymer@fnal.gov>
Subject: CRL is dead
Parts/Attachments:
   1   OK     29 lines  Text
   2 Shown   ~44 lines  Text
----------------------------------------


Hi,�
We are getting the following error when trying to access the CRL.
This started happening around 15.15 and ti still not fixed at 17:00.
�
Mayly�

                                     OK

The server encountered an internal error or misconfiguration and was
unable to complete your request.

Please contact the server administrator, webmaster@crlweb2.fnal.gov
and inform them of the time the error occurred, and anything you
might have done that may have caused the error.

More information about this error may be available in the server
error log.

____________________________________________________________________________
  Apache/2.0.46 (Scientific Linux) Server at crlweb2.fnal.gov Port 80


Mysql> mysqladmin processlist -u root | grep crlweb
| 57308859  | crl        | crlweb.fnal.gov:38519    | crl_v1  | Sleep   | 16892388 
| 57308860  | crl        | crlweb.fnal.gov:38520    | crl_v1  | Sleep   | 16892388 
| 110774986 | crl        | crlweb.fnal.gov:53489    | crl_v1  | Sleep   | 12257425 
| 110774987 | crl        | crlweb.fnal.gov:53490    | crl_v1  | Sleep   | 12257425 
| 110774988 | crl        | crlweb.fnal.gov:53491    | crl_v1  | Sleep   | 12257425 
| 189141425 | crl        | crlweb.fnal.gov:47178    | crl_v1  | Sleep   | 2489     
| 189143126 | crl        | crlweb.fnal.gov:47180    | crl_v1  | Sleep   | 1585     
| 189143129 | crl        | crlweb.fnal.gov:47181    | crl_v1  | Sleep   | 1583     
| 189144484 | crl        | crlweb.fnal.gov:47183    | crl_v1  | Sleep   | 680      
| 189144487 | crl        | crlweb.fnal.gov:47184    | crl_v1  | Sleep   | 678      


##########
# CONDOR #
##########

Date: Wed, 03 Dec 2008 14:51:37 -0600 (CST)
Subject: HelpDesk ticket 125792

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 
___________________________________________
Ticket #: 125792
___________________________________________
Short Description: minos25 condor configuration file update

Problem Description: run2-sys :

    Please update the minos25 local configuration file,
/opt/condor-7.0.1/local/condor_config.local
    to have the content of
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/
    condor_config.local.minos25.20081203

This is not urgent.

This change reduces the rate at which our Grid jobs start,
to prevent the global SAZ overloads seem last weekend.
___________________________________________

Date: Wed, 03 Dec 2008 15:10:39 -0600 (CST)
This ticket has been reassigned to SIMMONDS, EDWARD of the CD-SF/FEF Group.
___________________________________________

Date: Wed, 03 Dec 2008 15:33:38 -0600 (CST)

Solution: esimm@fnal.gov sent this solution: 
Done.  I did not restart any services.
___________________________________________

___________________________________________


#######
# SRM #
#######

   Testing publicly useable srm on my desktop,
   using mcimport as a model.


MIN > scp minfarm@fnpcsrv1:/local/globus/minfarm/.grid/kreymer-production.proxy .

MIN > cd .grid

. /minos/scratch/app/OSG1/setup.sh

export X509_USER_PROXY=/home/kreymer/.grid/kreymer-production.proxy

srmls srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport
  0 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/
      512 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/howcroft/
      512 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/kordosky/

###########
# BLUEARC #
###########

    Monitoring showed no failures, delays on 
minos-sam03
Wed Dec  3 04:49:13 CST 2008 SLO N00013822_0000.spill.sntp.cedar_phy_bhcurv.0.root 600

minos26      
Wed Dec  3 04:49:28 CST 2008 SLO N00013289_0000.spill.sntp.cedar_phy_bhcurv.0.root 625


    After bluearc outage 04:30

    minfarm@fnpcsrv1
mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok

    kreymer@minos26
crontab crontab.dat


########
# DATA #
########


On Wed, 3 Dec 2008, Steven Cavanaugh wrote:

>   I am running grid jobs which dccp a number of files which have all been
> prestaged.

   The problem is that not all the files were prestaged.

   Short story : I need to spend about 1/2 hour modifying scripts
                 to get a list of volumes containing these files,
                 so that I can stage the rest.

   Longer story : about half these .bcnd files got written to
                  the generic 'minos' file family,
                  not the reco_far_cedar_phy_bhcurv_bcnd family.
                  My original tape list was therfore incomplete.

                  I think that I can get an accurate tape list using SAM.
                  With 25K files to check, 5 second queries are too slow.
                  I can adapt an existing python script to get tape data
                  with one query.

I'm not sure if it helps (I don't have tape information), but I have a
file with a list of all of the files that I need:

/minos/scratch/scavan/mrcc_trimmer/final/bcnd_far_cedar_phy_bhcurv

wc -l /minos/scratch/scavan/mrcc_trimmer/final/bcnd_far_cedar_phy_bhcurv
22981

wc -l /tmp/CPBVOL.lis
22963 /tmp/CPBVOL.lis

MINOS26 > sort -u /tmp/CPBVOL.lis
vob235
vob570
vob594
vob990
voc190
voc193
voh334

    Volumes previously restored were 
VOC190 VOC193 VOH334 VOK485

    New volumes are

FVOLS="VOB235 VOB570 VOB594 VOB990"

   Why does VOK485 not show up in this list ?
   Files seem to be in 2007-11, not declared to SAM.

VFILES=`enstore info --list=VOK485 | grep cedar_phy_bhcurv | cut -f 10 -d /`
for FILE in ${VFILES} ; do sam locate ${FILE} ; done
Datafile with name 'F00039984_0011.spill.bcnd.cedar_phy_bhcurv.0.root' not found.
...
Datafile with name 'F00039987_0011.spill.bcnd.cedar_phy_bhcurv.0.root' not found.

    All files are missing from SAM. But they were prestaged already.


    Let's pick up the remaining files :


{ for VOL in ${FVOLS} ; do 
    ./stage.20081125 -w -p 1 -s cedar_phy_bhcurv/.bcnd_data ${VOL}
done ; } >> /minos/scratch/kreymer/log/stage/cpbcnd2.log 2>&1 &


    Hmmm, this looks grim.
    Encp history http://www-stken.fnal.gov/enstore/encp_enstore_system.html
    shows 1 minute between file copies from VOB570,
    though the files are in tape-order.
    Almost all time is being spent seeking.

    9940B27.mover	2937

    xfers
        5878 - 11:05:20
        5879   11:05:42	
	5883   11:07:55

9940B27.mover	alive : HAVE BOUND volume (VOB570) - IDLE	stkenmvr27a	2008-Dec-03 11:06:28	 
 	
Completed Transfers	5882	Failed Transfers	0
 
Last Read (bytes)	29,878,277	Volume	VOC197
Last Write (bytes)	29,878,082	Location Cookie	45
 
/pnfs/fs/usr/.(access)(000F00000000000006E55678) -->
stkendca14a.fnal.gov:/diskc/read-pool-5/data/000F00000000000006E55678
 
   Stared at ENTV display on WH8E for a while.
   Seek times vary from 2 to 60 seconds.
   Average is 20

MINOS26 > echo '(38 + 44 + 6 + 12 + 17 + 10 + 60 + 30 +   10 + 12 + 2 + 8 + 23 ) / 13' | bc
20

   08:43 - grinding away on VOB594, 
   
5181/6794 Thu Dec  4 02:11:30 CST 2008 queue=1351/2000


#########
# STAGE #
#########

< Changed test for exiting file from using level 2 ( no longer valid )
< to grepping through a consolidated file dump POOLFILES.


ln -sf stage.20081203 stage # was stage.20071012


#############
# SAMLOCATE #
#############

   Adding -t option, for tape label

   Steal code from
http://d0db-prd.fnal.gov/rexipedia/illingworth/add_pnfs_location.py

SAMDIM="FILE_NAME F00030617_0002.spill.bcnd.cedar_phy_bhcurv.0.root"

./samlocate "${SAMDIM}"
F00030617_0002.spill.bcnd.cedar_phy_bhcurv.0.root /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2005-04


SAMDIM="
    DATA_TIER    bcnd-far
and VERSION      cedar.phy.bhcurv
"

./samlocate -t "${SAMDIM}" | tee  /tmp/CPBVOL.lis

wc -l /tmp/CPBVOL.lis
22963 /tmp/CPBVOL.lis

#########


Date: Tue, 02 Dec 2008 19:15:50 -0600
From: Cron Daemon <root@minos01.fnal.gov>
To: kreymer@fnal.gov
Subject: Cron <kreymer@minos01> /usr/krb5/bin/kcron
    ${HOME}/minos/scripts/cfl

Newline appended

Found the end of CFL with partial file name :
minos reco_near_cedar_phy_bhcurv_cand VO9747 0000_000000000_0000453 CDMS119569487200000 326877613 4165507 /pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2007-04/N00012063_0020.spil

not the customary 
(1832538 rows)

  File: `CFL'
  Size: 304196973       Blocks: 594136     IO Block: 4096   regular file
Device: 18h/24d Inode: 27722344    Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1060/ kreymer)   Gid: ( 1525/    g020)
Access: 2008-12-08 19:16:51.574824459 -0600
Modify: 2008-12-08 19:16:51.485102646 -0600
Change: 2008-12-08 19:16:51.485102646 -0600


########
# FARM #
########

Test linfix concat, do just one run

./roundup -s n11011001 -r cedar_phy_linfix mcnear

 OK - processing /minos/data/minfarm/mcnearcat
      version 20081126
 SELECT  files containing n11011001 
Wed Dec  3 09:35:09 CST 2008
Wed Dec  3 09:50:11 CST 2008

 SADD 
less +F /home/minfarm/ROUNTMP/LOG/saddreco/daikon_00/cedar_phy_linfix/near_L010185N.log
Wed Dec  3 11:21:09 CST 2008

    Declared thousands of candidates

    Start up looper on linfix

./looper  '-r cedar_phy_linfix mcnear' &
 OK - processing /minos/data/minfarm/mcnearcat
      version 20081126
Wed Dec  3 11:40:47 CST 2008

Traceback (most recent call last):
  File "/home/minfarm/scripts/samdup", line 162, in ?
    SUB = FILE.strip().split('_')[1].split('.')[0]
IndexError: list index out of range

SRV1> ls /minos/data/minfarm/mcnearcat | grep -v .root
marker_end
marker_start
wc

SRV1> dds /minos/data/minfarm/mcnearcat/mar*
-rw-rw-r--  1 rubin e875 0 Apr 29  2008 /minos/data/minfarm/mcnearcat/marker_end
-rw-rw-r--  1 rubin e875 0 Apr 29  2008 /minos/data/minfarm/mcnearcat/marker_start
SRV1> dds /minos/data/minfarm/mcnearcat/w*
-rw-r--r--  1 asousa e875 18496 Oct 14 08:37 /minos/data/minfarm/mcnearcat/wc

SRV1> mv /minos/data/minfarm/mcnearcat/mar* /minos/data/minfarm/maint/
SRV1> mv /minos/data/minfarm/mcnearcat/wc /minos/data/minfarm/maint/

=============================================================================
2008 12 02
=============================================================================

########
# FARM #
########

Prepare for linfix concat, do just one run

./roundup -n -s n11011001 -r cedar_phy_linfix mcnear
Tue Dec  2 16:28:10 CST 2008
...
 OK adding n11011001_0001_L010185N_D00.sntp.cedar_phy_linfix.0.root 8
 OK adding n11011001_0010_L010185N_D00.sntp.cedar_phy_linfix.0.root 1

 HADD  rate 0 Kbytes/second 
Tue Dec  2 16:49:32 CST 2008
...


########
# DATA #
########

Staged a test copy , using door 0 

./dccptest n13047018_0029_L010185N_D04.reroot.root 24125
15:30:30

Checked transfer page, see nothing for dcap00 or minos26 as of
15:33

Showed up at 15:34:17 

DCap00-stkendca2a-unknow-113 	dcap00-stkendca2aDomain 	2 	dcap-3 	1060 	30485 	000F000000000000074C0348 	N.N. 	minos26.fnal.gov 	WaitingForGetPool 	00:02:08 	Staging


########
# DATA #
########


Date: Tue, 02 Dec 2008 14:37:50 -0600
From: Keith Chadwick <chadwick@fnal.gov>
To: fermigrid-announce@fnal.gov
Subject: FermiGrid BlueArc filesystems...

The BlueArc filesystems (/grid/app, /grid/data, /grid/home, etc.)
appear to have "burped" across all of FermiGrid about 5 minutes ago.

We are investigating...

-Keith.

------------------------------------------------


Date: Tue, 02 Dec 2008 14:47:40 -0600
From: Andrew J. Romero <romero@fnal.gov>
To: "'site-nas-announce@fnal.gov'" <site-nas-announce@fnal.gov>
Subject: BlueArc node RHEA-1 crashed at ~2:00pm today

The following EVSs failed over to RHEA-2
  blue2
  minos-nas-0
and appear to be operating normally

I will provide more information soon.


---------- Forwarded message ----------
Date: Tue, 02 Dec 2008 15:12:56 -0600
From: Andrew J. Romero <romero@fnal.gov>
To: 'Jon Bakken' <bakken@fnal.gov>, 'Steve Timm' <timm@fnal.gov>,
    'Arthur Kreymer' <kreymer@fnal.gov>
Subject: RHEA-1 crash

We are still working with BlueArc to determine why RHEA-1 crashed.

We will need to rebalance the EVSs (Virtual Servers)
as soon as possible. This will involve a short 10-15min downtime
for the following EVSs (as they are moved from RHEA-2 to RHEA-1)

   - blue2
   - minos-nas-0

Assuming RHEA-1 does not have a hardware problem, we
would like to rebalance the EVS load at 4:30am tomorrow.

Let me know if this time is acceptable

Andy


------------------------------------------------

There were no failures to read files.

But the reads took 20 minutes to complete, finishing around 14:41 .

Strangely, the process on fnpcsrv1 seems to have continued
           with no logged error or delay around 14:41 !


There have been many instances of 20 to 40 second delays this last week.
See samples recorded under
    http://www-numi.fnal.gov/computing/dh/bluwatch/log/

------------------------------------------------

Date: Tue, 02 Dec 2008 22:09:44 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos_batch@fnal.gov, minos_software_discussion@fnal.gov
Subject: BlueArc downtime/stall tomorrow at 04:30

   There is a scheduled emergency interruption to BlueArc service,
   affecting /grid/* and /minos/* file systems.

   Files reads may be slow, with 10 to 20 minute delays.

   I do not know what will happen to file writes.
   I am shutting down file concatenation for the evening.

------------------------------------------------

mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT
Tue Dec  2 16:11:27 CST 2008

And disabled mdsum_log on minos26.

#########
# FNALU #
#########

   Brebel reports condor on FNALU not giving him tokens.
   
   Repeat my tests,
   
cd /local/stage1/kreymer/condor

condor_submit probe


#######
# DAQ #
#######

Date: Tue, 02 Dec 2008 12:33:39 -0600
From: John Urish <urish@fnal.gov>
To: Brett M Viren <bv@bnl.gov>, Mary R Bishai <bishai@fnal.gov>,
    Arthur E Kreymer <kreymer@fnal.gov>, zwaska@fnal.gov
Subject: minos-beamdata

minos-beamdata is set up in it's rack in FCC. The new IP address is
131.225.107.196. I'm able to connect to it via SSH. You can go ahead and
check the minos software on it. Let me know if you find any problems.


########
# FARM #
########

Can you get the linfix files prestaged yet?  The runs are in
/minos/data/minfarm/lists/mclist.cedar_phy_linfix.  There are also ~1100
dogwoodtest4 jobs to rerun which should be prestaged if possible. They'll be
found in /minos/data/minfarm/farmtest/mclist.dogwoodtest4.


dcache/datasets r '' '' list
/afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2008/12/list.r

PFILES=/afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2008/12/list.r

    Linfix input files

LFILES=`cat /minos/data/minfarm/lists/mclist.cedar_phy_linfix | cut -f 1 -d .`

for FILE in ${LFILES} ; do 
grep ${FILE}.reroot.root ${PFILES} ; done | wc -l
79
  
    list files not in pools
for FILE in ${LFILES} ; do 
grep -q ${FILE}.reroot.root ${PFILES} || echo ${FILE} ; done
83

    Dogwood test input files


MFILES=`cat /minos/data/minfarm/lists/mclist.dogwoodtest4.matt`
2385

    count files in pools
for FILE in ${MFILES} ; do 
grep ${FILE}.reroot.root ${PFILES} ; done | wc -l
2648

    list files not in pools
for FILE in ${MFILES} ; do  
grep -q ${FILE}.reroot.root ${PFILES} || echo ${FILE} ; done


    More mail from Howie
    
wc -l /minos/data/minfarm/farmtest/lists/mclist.dogwoodtest4
1135

HFILES=`cat /minos/data/minfarm/farmtest/lists/mclist.dogwoodtest4 | cut -f 1 -d .`

    count files in pools
for FILE in ${HFILES} ; do 
grep ${FILE}.reroot.root ${PFILES} ; done | wc -l
1285

    list files not in pools
for FILE in ${HFILES} ; do  
grep -q ${FILE}.reroot.root ${PFILES} || echo ${FILE} ; done


###########
# MINOS25 #
###########

Date: Tue, 02 Dec 2008 12:02:00 -0600 (CST)
Reply-To: HelpDesk <helpdesk-forwarder@fnal.gov>
Subject: HelpDesk ticket 125690

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 
___________________________________________
Ticket #: 125690
___________________________________________
Short Description: minos-mysql3 swap with minos25 - OS reinstall requested

Problem Description: run2-sys :

Per discussions with Jason Allen this morning, 
please reinstall the OS on the node presently called minos-mysql3, 
configured as a Minos Cluster node ( presently configured as a mysql
server.)

Please create partitions and accounts, and copy files as described in
    http://www-numi.fnal.gov/computing/minos25.txt
or provide appropriate revisions to this plan.
___________________________________________

This ticket is assigned to HelpDesk of the Help Desk.
___________________________________________

Date: Tue, 02 Dec 2008 12:06:14 -0600 (CST)

This ticket has been reassigned to SIMMONDS, EDWARD of the CD-SF/FEF Group.
___________________________________________

Date: Wed, 03 Dec 2008 09:55:09 -0600 (CST)
This ticket has been reassigned to HO, LING of the CD-SF/FEF Group.
___________________________________________

Date: Wed, 03 Dec 2008 14:55:58 -0600 (CST)

The OS has been reinstalled.

[root@minos-mysql3 ~]# uname -a
Linux minos-mysql3.fnal.gov 2.6.9-78.0.8.ELsmp #1 SMP Wed Nov 19 13:11:58
CST 2008 x86_64 x86_64 
x86_64 GNU/Linux

[root@minos-mysql3 ~]# df -l
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             20641788   3585528  16007620  19% /
none                   8219568         0   8219568   0% /dev/shm
/dev/sda7            192822076   5818512 177208736   4% /home
/dev/sda6              2063504     35948   1922736   2% /tmp
/dev/sda3             10317860    182728   9611012   2% /var
/dev/sda2             10317860     55844   9737896   1% /var/tmp
AFS                    9000000         0   9000000   0% /afs
/dev/sdb1            240292420     98892 227987344   1% /local/scratch25


/dev/sdb mounted as /local/scratch25. Not sure if you really want
/minos/scratch25.
10GB /var/tmp created.

gfrontend and gfactory users and groups added to NIS server.
Only these three users are recognized on the machine, besides the system
logins.

Home areas have been rsyned.
/home/condor from minos25:/local/stage1/condor

/local/scratch25/condor does not exist. Guessing this is
/local/scratch25/stage1/condor (or 
/local/stage1/condor)
Copied /local/stage1/condor to minos-mysql3:/home/condor


Copied kcron files under /var/adm/krb/
Copied ONLY the grid directories under the user directories. Not sure if
other content in the user 
directories need to be copied.
Copied crontabs file under /var/spool/cron . System cron files were not
copied.


___________________________________________

Date: Wed, 03 Dec 2008 23:42:45 +0000 (GMT)

> /dev/sdb mounted as /local/scratch25. Not sure if you really want /minos/scra$

   Yes, /minos/scratch25 was yet another typo.
   Your interpretation was correct.
   I have corrected the original document.

>>     Create accounts for Condor management, cloned from minos25,
>>         and rsync the home areas.
>>             Account    Home                   Size
>>             condor    /home/condor           844 MB
>>               symlink to this from /local/stage1/condor
>>             gfrontend /home/gfrontend        175 MB
>>             gfactory  /home/gfactory        4614 MB
>
> gfrontend and gfactory users and groups added to NIS server.
> Only these three users are recognized on the machine, besides the system logi$

   Please allow the full set of Minos users to log in,
   as is the case on minos25.
   That will let me log in as kreymer, to do some configuration.

> Home areas have been rsyned.
> /home/condor from minos25:/local/stage1/condor

   Thanks, I have logged into gfactory and gfrontend, looks OK.

> /local/scratch25/condor does not exist. Guessing this is /local/scratch25/sta$
> /local/stage1/condor)
> Copied /local/stage1/condor to minos-mysql3:/home/condor
 
   Correct again. I have correct the document.
   The condor home is /local/stage1/condor,           
   which is a symlink to /local/scratch25/stage1/condor.
  
> Copied kcron files under /var/adm/krb/
> Copied ONLY the grid directories under the user directories. Not sure if othe$
> directories need to be copied.

   Not needed for this migration.               
   Will copy them elsewhere, later.

> Copied crontabs file under /var/spool/cron . System cron files were not copie$
  
   Let me know when rpbatter and kreymer et.al. can log in,
   we will start doing more tests.
 
___________________________________________

Date: Wed, 03 Dec 2008 17:45:56 -0600
From: Ling C. Ho <ling@fnal.gov>

I have corrected this. You should be able to log in now.
___________________________________________

Date: Wed, 03 Dec 2008 17:48:19 -0600
From: Ling C. Ho <ling@fnal.gov>

By the way I don't thee the user rpbatter on minos25 nor the nis map.
___________________________________________

Date: Wed, 03 Dec 2008 15:48:51 -0800 (PST)
From: Ryan B. Patterson <rbpatter@caltech.edu>

Any word on the new Condor versions?  Are they *really* going to be
available this week, or should we ask for 7_0_3 to be installed.  I'd like
to do hard testing this week/weekend if possible, and my gut tells me that
we aren't going to see these new RPMs in a timely manner.
___________________________________________

Date: Wed, 03 Dec 2008 17:49:57 -0600
From: Ling C. Ho <ling@fnal.gov>

SOrry, I am slow at the end of the day. "rbpatter" is there.
___________________________________________

Date: Wed, 03 Dec 2008 16:01:51 -0800 (PST)
From: Ryan B. Patterson <rbpatter@caltech.edu>

/minos/scratch and /minos/data are present but appear to be read-only at
the moment:

bash$ touch /minos/data/hi
touch: cannot touch `/minos/data/hi': Read-only file system
bash$ touch /minos/scratch/hi
touch: cannot touch `/minos/scratch/hi': Read-only file system

Perhaps this is temporary, as 'mount' suggests they should be rw:

minos-nas-0.fnal.gov:/minos/scratch on /minos/scratch type nfs
(rw,rsize=32768,timeo=600,proto=tcp,nfsvers=3,hard,intr,addr=131.225.111.115
)
minos-nas-0.fnal.gov:/minos/data on /minos/data type nfs
(rw,rsize=32768,timeo=600,proto=tcp,nfsvers=3,hard,intr,addr=131.225.111.115
)

___________________________________________

Date: Thu, 04 Dec 2008 10:30:21 -0800 (PST)
From: Ryan B. Patterson <rbpatter@caltech.edu>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: Re: HelpDesk ticket 125690 has additional info.

An additional observation:  The permissions of

  /var/adm/krb5

seem to disallow proper kcron operation.  On minos25[old] this was:

  drwx--s--x  2 root root 4096 Nov 18 14:56 krb5

on minos25-mysql3, this is:

  drwxr-xr-x  2 root root 4096 Dec  3 11:44 krb5

--Ryan


___________________________________________

Date: Mon, 08 Dec 2008 13:21:28 -0600 (CST)

Machine swap completed on Friday 12/5/08.
This ticket was resolved by HO, LING of the CD-SF/FEF group.
___________________________________________


##########
# DCACHE #
##########

> Stan Naymola <stan@fnal.gov>
>   I may have found a dCache monitor page that may help you with door status.
> Look at http://fndca3a.fnal.gov:2288/context/transfers.html . Look at the   
> first column, it tells you which door is being used or queued. Let me know if
> this is helpful.

   Thanks !

   This should be very useful.
   I seems to be updated about every 2 minutes,                
   which should be good enough for load balancing.
--------------------------------------------------------

   Time stamps from the bottom of the page

11:33:24
11:35:14 
11:37:05
11:38:54 CST 2008

##########
# DCACHE #
##########
Date: Tue, 02 Dec 2008 11:47:16 -0600 (CST)
Reply-To: HelpDesk <helpdesk-forwarder@fnal.gov>
Subject: HelpDesk ticket 125688
<-- # @@@  Enter Update below this line. @@@ # -->


<-- # @@@  Enter Update above this line. @@@ # -->
 
___________________________________________
Ticket #: 125688
___________________________________________
Short Description: STKEN door DCap00-stkendca2a , port 24125 is stuck

Problem Description: According to the login plots at
    http://fndca3a.fnal.gov/dcache/logins//DCap00-stkendca2a.jpg
DCap00-stkendca2a , port 24125,
seems to have been down since early Saturday 29 November,

    I cannot connect to this door :

Failed to create a control line
[Tue Dec  2 11:24:45 2008] Going to open file
dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/neardet_data/2004-11/N
00004502_0000.mdaq.root in cache.
Failed to create a control line
Failed open file in the dCache.
Can't open source file : Unable to connect to server
System error: Connection refused
___________________________________________

This ticket is assigned to JONES, TERRY of the CD-SF/DMS/DSC/SSA.

___________________________________________

Date: Tue, 02 Dec 2008 15:43:20 -0600 (CST)

Solution: jonest@fnal.gov sent this solution: 
> Door was not working with no obvious reason. Nothing in the log  
> file. The door
> has been restarted and now works fine.

This ticket was resolved by JONES, TERRY of the CD-SF/DMS/DSC/SSA group.
___________________________________________

___________________________________________


=============================================================================
2008 12 01
=============================================================================

###########
# MINOS25 #
###########

   First draft migration plan to swap minos-mysql3 with minos25
   
We are having severe performance problems with minos25,
presently our Condor gateway system.
Unexplained very high load averages, with very high I/O wait delays are common.
The problems continue with each of the suspected software components disabled.

Even without these overloads, we need a more capable Condor host system.
Therefore, we would like to swap the new minos-sam03 hardware with minos25.

The following items can be done in preparation :

    Install condor v7_0_3 on minos-mysql3
        as on the rest of the Minos Cluster.
    Install condor v7_0_3 configuration files
        to be provided by Minos to FEF.
        
    Create accounts for Condor management, cloned from minos25,
        and rsync the home areas.
            Account    Home                   Size
            condor    /local/stage1/condor   844 MB
            gfrontend /home/gfrontend        175 MB
            gfactory  /home/gfactory        4614 MB

    Change mount from /data to /local/scratch25 on minos-mysql3,
            and set permissions per minos25.
            
    Copy /local/scratch25/condor to minos-mysql3.
    
    Copy all user kcron files to minos-mysql3.    

    Copy all user /local/scratch25/<user>/grid files to minos-mysql3.

    Copy all user crontabs from minos25 to minos-mysql3

For the actualy identity transplant :
    
    Shut down the minos25 condor processes.
    Disable automatic Condor startup on minos25 (old and new).
    
    Shut down minos25 and minos-mysql.
    
    Exchange host Grid certificates .

    Reboot the new minos25, without starting condor.
    
    rsync files from the condor, gfrontend and gfactory accounts.
    
    Restart condor and verify proper operation.

N.B. shift ~condor from /local/stage1/condor to /home/condor,
     with a symlink to new space for compatibility.
     
N.B. - changed /local/scratch25 to /local/scratch25 above, typo in first draft        

N.B  - added correction to afs login problem on minos-mysql3

##########
# DCACHE #
##########

          Requesting more doors :

Date: Mon, 01 Dec 2008 18:02:55 -0600 (CST)
From: HelpDesk <helpdesk@fnal.gov>

<-- # @@@  Enter Update below this line. @@@ # -->


<-- # @@@  Enter Update above this line. @@@ # -->
 
___________________________________________
Ticket #: 125645
___________________________________________
Short Description: FNDCA needs more unauthenticated dcap doors

Problem Description: Minos is ramping up grid usage, with a goal of 5000
jobs running on
FermiGrid.
We are getting close to 1000 jobs running recently.

But we still have only four unauthenticated dcap doors.
We are hitting door limits even when using all four.

To handle 5000 jobs, figuring 250 connections per door,
we would need about 20 doors.

Please increase the number of doors to 10 as soon as possible,
to handle the present load.

And increase to 20 doors as soon as convenient to handle the longer term
load.
___________________________________________

This ticket is assigned to JONES, TERRY of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Mon, 01 Dec 2008 17:38:15 -0600
From: Jon Bakken <bakken@fnal.gov>

That's interesting.  I configured the CMS doors for a max of 4000 each
(running multiple doors per node).
We routinely have more than a 1000 per door without troubles.
___________________________________________

Date: Tue, 02 Dec 2008 15:01:26 -0600 (CST)

Note To Requester: jonest@fnal.gov sent this Notes To Requester: 
Hi,  A dcache expert has responded.  should I close this ticket?
> --- Comment #1 from Vladimir Podstavkov <podstvkv@fnal.gov>   
> 2008-12-02 14:53:01 ---
> The current limit for dcap doors is 4000 connections per door. Two  
> doors allow
> to have up to 8000 open connections. No additional doors needed.

___________________________________________


Date: Tue, 02 Dec 2008 21:16:55 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: HelpDesk <helpdesk-forwarder@fnal.gov>
Cc: minos-data@fnal.gov, dcache-admin@fnal.gov
Subject: Re: HelpDesk ticket 125645 has additional info.

<-- # @@@  Enter Update below this line. @@@ # -->

Thanks, it is good to know that the limit is at 4000.

Door 0 seems to have died Saturday with 1000 connections.

Doors 1/2/3 seem to have historic peaks under 400.

Door 2 is at 536 right now, as high as I've ever seen it.

I'll keep watching, and think about setting up a load test.
I'll start regular monitoring of service availability.

<-- # @@@  Enter Update above this line. @@@ # -->


############
# PREDATOR #
############

InvalidMetadata: Invalid Metadata specified for file 'N00015238_0017.mdaq.root' of type 'importedDetector':
        Application with family 'online', applName 'rotorooter', version 'v00-00--1' not found.
FINISHED  Sun Nov 30 05:12:08 2008

    Set the damaged file aside
cd /local/scratch26/kreymer/genpy/neardet_data/2008-11
mv N00015238_0017.sam.py N00015238_0017.sam.pybad

        Note that there is a .pyc for this file !
MINOS26 > dds *.pyc
-rw-r--r--  1 kreymer g020 1266 Nov 30 01:15 N00015238_0017.sam.pyc

    Subruns 16 and 18 are OK.

MINOS26 > ./predator 2008-11


###########
# MINOS27 #
###########

    /pnfs/minos is not mounted
    

########
# DATA #
########

    Link for file listings from dcache pools, in minos/CFL/

ln -s /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets/2008/12/list.q dcache_q.txt

    Many raw data files are still not on tape, since Thursday 27 Nov.

cat > /tmp/oldfiles

DFILES=`grep -v '  30' /tmp/oldfiles | cut -c 7- | cut -f 1 -d ' '`

MINOS27 > for FILE in ${DFILES} ; do grep ${FILE} ../CFL/dcache_q.txt ; done
w-stkendca10a-3
w-stkendca11a-3
w-stkendca12a-3
w-stkendca8a-1
w-stkendca8a-2
w-stkendca9a-3


########
# DATA #
########
Date: Mon, 01 Dec 2008 09:23:19 -0500
From: Steven Cavanaugh <scavan@fas.harvard.edu>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: [Fwd: dccp issues with grid]

Hi Art,

   I forgot to include you on this email last night

Thanks,
Steve

-------- Original Message --------
Subject:        dccp issues with grid
Date:   Sun, 30 Nov 2008 21:55:47 -0500
From:   Steven Cavanaugh <scavan@fas.harvard.edu>
To:     Helpdesk <HelpDesk@fnal.gov>, Mayly Sanchez
<msanchez@fas.harvard.edu>


Hi,

  This is a high priority issue, as minos grid jobs are currently unable to
dccp files.

  I am submitting some jobs on minos25 to the condor grid

  jobs executing on the following machines (and possibly more) are unable to
dccp a file:

131.225.167.4   fnpc237
131.225.166.81  fnpc303
131.225.211.160 fcdfcaf1512
131.225.167.8   fnpc240
131.225.166.90  fnpc312

They all get the error :

Failed to create a control line
Failed open file in the dCache.
Can't open source file : Unable to connect to server
System error: Connection refused


however, running the dccp on minos09 works:
dccp
/pnfs/minos//reco_far/cedar_phy_bhcurv/.bcnd_data/2005-03/F00030612_0004.spi
ll.bcnd.cedar_phy_bhcurv.0.root .
5802487 bytes in 0 seconds

___________________________________________________________________


Thanks,
Steve

New Information: scavan@fas.harvard.edu sent in this update: 
>
>
>   
Some additional information:

These jobs were attempting to copy 10 files using dccp (a separate dccp 
for each file, which would not run until the previous dccp command 
completed)

usually the first 3-8 files would copy without issue, and the remainder 
would fail as described below

So this is not simply a connection problem

___________________________________________________________________


<-- # @  Enter Update below this line. @ # -->

  I think I see the reason for the dccp failures.
  
  Messages like
      Unable to connect to server -
  usually come from Doors ( dccp ports ) that are overloaded.
 
  Looking at the door login plots under
      http://fndca3a.fnal.gov/dcache/dc_login_plots.html, 
  particularly those for the unauthenticated doors we use,
      DCap00/01/02/03
  I see repeated peaks over 300, sometimes to 1000.
  The doors can only handle about 200 to 300 connections
  before they stop taking connections.

Steve - I presume that you are spreading your connections  
        randomly between 
             ports  24125, 24136, 24137, 24138.
             doors     00     01     02     03
       
       We need to have more doors added to DCache.  
       I will put in a new helpdesk ticket for this.

       Until the new doors come,
       you need to reduce the number of simultaneous dccp's
       so as not to overload any single door.
       You may need to do something similar to Greg,
       submitting then holding the jobs,
       then releasing a few hundred at a time.
       He has some scripts to regulate the number of jobs running.

       This should let you run your jobs in 10 file mode,
       if you use all four doors/ports randomly,
       and keep your total per door under 100 ( there are other users.)

<-- # @  Enter Update above this line. @ # -->


########
# GRID #
########

Date: Sun, 30 Nov 2008 22:47:29 -0600 (CST)
From: Steven Timm <timm@fnal.gov>
To: minos-data@fnal.gov, scavan@fnal.gov
Cc: fermigrid-operations@fnal.gov
Subject: HUGE number of very short jobs from user scavan of MINOS

We are seeing a HUGE number of jobs (17000!) from minos user scavan
(Steve Cavanaugh) being processed through the minosgli glideins.
The mean finishing time of these jobs is 2-5 minutes.
This is very much frowned upon.  It is pushing our SAZ server to
near-record and near-failure levels and threatening to disable
all of FermiGrid.

Make the jobs longer, NOW.  Jobs designed to run only 1-2 minutes
are not allowed on FermiGrid, period.

If our SAZ server alarms for high load again tonight, which
it has already done once, we will not hesitate to block all MINOS
glideins until the problem is fixed.

Steve Timm


     --------------------------

Per previous discussions, 

  JOB_START_COUNT = 8
  JOB_START_DELAY = 2

     --------------------------

condor_config_val -schedd JOB_START_COUNT
8

condor_config_val -schedd -rset "JOB_START_COUNT = 1"
Attempt to set configuration "JOB_START_COUNT = 1" on schedd minos25.fnal.gov <131.225.193.25:65226> failed.

export X509_USER_PROXY=/local/scratch25/kreymer/grid/kreymer.proxy

   Still fails

condor_reconfig
Sent "Reconfig" command to local master

    Trying a sample from the man page,

MINOS25 > condor_config_val -schedd -rset "MAX_JOBS_RUNNING = 2001"
Attempt to set configuration "MAX_JOBS_RUNNING = 2001" on schedd minos25.fnal.gov <131.225.193.25:65226> failed.

     Let's to this through config files

cd /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701

cp condor_config.local.minos25.20080925 condor_config.local.minos25.20081203

    From
JOB_START_COUNT = 3
JOB_START_DELAY = 2
    To
JOB_START_COUNT = 1
JOB_START_DELAY = 1

ln -sf condor_config.local.minos25.20081203 condor_config.local.minos25 # was condor_config.local.minos25.20080925 Sep 25

condor_config.20081203
    added rbpatter to QUEUE_SUPER_USERS, removed buckley


###########
# MINOS25 #
###########

Test hdb disk access, using similar script to DATA test below

RFILES=`cat /minos/data/minfarm/lists/mmm.D00 | cut -f 1 -d .`

for FILE in ${RFILES}; do 
time ./dccptest ${FILE}.reroot.root
done 2>&1 | tee /tmp/dccptest.log


Data rates are typically 30 MB/sec, as on minos26.

But the load average is running about 8, versus 2
But the CPU  average is running about 70% wait, versus 40% wait.

  26 GBytes of files were copied, no errors.

The CP load stayed high about 5 minutes past the last network traffic.

    Question, is DMA enabled on this local disk ?

  
    Do a similar test on minos-mysql3, candidate for minospgrid

    Extended dccptest to take 4th argument, destination path,
    will supply DCCPTEST subdirectory

MINOS-MYSQL3 > 

RFILES=`cat /minos/data/minfarm/lists/mmm.D00 | cut -f 1 -d .`

for FILE in ${RFILES}; do 
time ./dccptest ${FILE}.reroot.root '' '' /var/tmp/kreymer/DCCPTEST
done 2>&1 | tee /tmp/dccptest.log

    Data rates uniform 40 MBytes/sec.
    Load average on minos-mysq3 was round 1,
    CPU usage 10% wait I/O, no post-copy delays
    Elapsed time 10 minutes
    

    Continuing tests, with a local file :

DTDIR=/local/scratch26/${LOGNAME}/DCCPTEST
time dd if=/dev/urandom of=${DTDIR}/10G.dat bs=10M count=1000    
Top - 26% system, dd is 100% CPU bound, load around 1.5
real    31m41.954s
user    0m0.009s
sys     31m38.962s


DTDIR=/local/scratch25/${LOGNAME}/DCCPTEST
time dd if=/dev/urandom of=${DTDIR}/10G.dat bs=10M count=1000    
Top - 26% system, dd is 100% CPU bound, load around 1.2
real    31m57.528s
user    0m0.011s
sys     31m53.885s


DTDIR=/var/tmp/${LOGNAME}/DCCPTEST
time dd if=/dev/urandom of=${DTDIR}/10G.dat bs=10M count=1000    
Top - 12% system, dd is 100% CPU bound, load around .6
real    17m48.005s
user    0m0.000s
sys     17m46.705s


    Not such a good test, seem to be burning CPU making urandom content.


Deferred minos25 test, load average started growing aroun 14:20,
The 10G.dat creation finished at 14:38:12
Overload lasted 14:20 to 14:50

MINOS25 > time dd if=${DTDIR}/10G.dat of=${DTDIR}/10G.dat2
top - 16:07:47 up 41 days, 16 min,  9 users,  load average: 3.07, 0.99, 0.49
Cpu(s):  4.4% us,  5.1% sy,  0.0% ni, 30.8% id, 59.4% wa,  0.2% hi,  0.0% si


MINOS26 > time dd if=${DTDIR}/10G.dat of=${DTDIR}/10G.dat2
top - 14:35:42 up 395 days,  3:03, 11 users,  load average: 3.52, 1.91, 1.59
Cpu(s):  2.9% us,  3.7% sy,  0.0% ni, 22.0% id, 70.9% wa,  0.5% hi,  0.0% si
20480000+0 records in
20480000+0 records out

real    9m8.772s
user    0m13.858s
sys     2m17.817s


MINOS-MYSQL3 > time dd if=${DTDIR}/10G.dat of=${DTDIR}/10G.dat2
top - 14:36:02 up 16 days, 21:26,  2 users,  load average: 1.81, 1.57, 0.98
Cpu(s):  0.0% us,  0.5% sy,  0.0% ni, 61.0% id, 38.3% wa,  0.0% hi,  0.1% si
real    3m56.376s
user    0m3.955s
sys     0m56.385s


#######
# SRM #
#######
 
   Test FermiGrid volatile :

SRV1> export X509_USER_PROXY=/local/globus/minfarm/.grid/kreymer-production.proxy
   
SRV1> srmls  srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport
  0 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/
      512 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/howcroft/
      512 /pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport/kordosky/

##########
# CONDOR #
##########

minos24 I/O wait overloads continued through the weekend.

Abandoning ship, will move to a new host, minos-grid formerly minos-mysql3

########
# DATA #
########

Date: Thu, 27 Nov 2008 23:24:55 -0500
From: Howard Rubin <rubin@fnal.gov>

There remain 79 srmcp failures listed in file
/minos/data/minfarm/lists/mmm.D00.  Can you check whether these might be
resulting from a NONACCESS (or other problem) tape?  The symptoms are
...

   N.B., yes, these were all on VOC495.

MINOS26 > head -1 /minos/data/minfarm/lists/mmm.D00
n13023108_0002_L010185N_D00.0                 1 2008-11-27 03:37:14  fcdfcaf1056

RFILES=`cat /minos/data/minfarm/lists/mmm.D00 | cut -f 1 -d .`

for FILE in ${RFILES}; do 
./dccptest ${FILE}.reroot.root
rm -f /local/scratch26/kreymer/${FILE}.reroot.root
done


############
# DCCPTEST #
############

    Extended, look up path of simple file name in SAM.
    
    Third  argument is Debug flag
    Fourth argument is DEST , will supply DCCPTEST subdirectory 

 
=============================================================================
2008 11 27
=============================================================================
 
    T H A N K S G I V I N G
     
=============================================================================
2008 11 26
=============================================================================

############
# NOACCESS #
############

This tape has been on the list much of the last week.  Status ?

VOC495              0.05GB   (NOACCESS   1118-2328 full     0629-0105)   CD-9940B         minos.mcin_near_daikon.cpio_odc     

Howie needs some of these files for farm processing.

MINOS26 > enstore info --vol=VOC495
{'blocksize': 131072,
 'capacity_bytes': 214748364800L,
 'comment': '',
 'declared': 1180707936.0,
 'eod_cookie': '0000_000000000_0000438',
 'external_label': 'VOC495',
 'first_access': 1181290454.0,
 'last_access': 1227736330.0,
 'library': 'CD-9940B',
 'media_type': '9940B',
 'remaining_bytes': 48729600L,
 'si_time': [1227736121.0, 1183097105.0],
 'sum_mounts': 54,
 'sum_rd_access': 701,
 'sum_rd_err': 2,
 'sum_wr_access': 437,
 'sum_wr_err': 0,
 'system_inhibit': ['none', 'full'],
 'user_inhibit': ['none', 'none'],
 'volume_family': 'minos.mcin_near_daikon.cpio_odc',
 'wrapper': 'cpio_odc',
 'write_protected': 'y'}

Date: Wed, 26 Nov 2008 15:59:37 -0600 (CST)
Subject: HelpDesk ticket 125550
___________________________________________
Ticket #: 125550
___________________________________________
Short Description: VOC495 status

Problem Description: We need to read some files from Minos 9940B volume
VOC495.

The tape is NOACCESS, I think for several days.

What is the status of this tape ?
___________________________________________

This ticket is assigned to HENDRY, JOHN of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Wed, 26 Nov 2008 22:07:21 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

     The NOACCESS list is at

http://www-stken.fnal.gov/enstore/tape_inventory/NOACCESS

     The Minos tapes seem to be

VO4209     - not a problem, copied elsewhere long ago.
VOB445     - not a problem, no files written to this tape.
VOC495      0.05GB   (NOACCESS   1118-2328 full     0629-0105)
            CD-9940B         minos.mcin_near_daikon.cpio_odc VOK330
331.09GB   (NOTALLOWED 0731-1124 readonly 0716-1115)
            CD-LTO3          minos.reco_far_cedar_bcnd.cpio_odc
            Volume needs to be cloned due to repeated errors
              ( This one has been on the list for weeks )
___________________________________________

cleared by Timur

___________________________________________

Date: Wed, 26 Nov 2008 16:22:26 -0600 (CST)
Solution: jhendry@fnal.gov sent this solution: 
Hi Art,  Glenn cleared this tape a bit earlier c. 15:49 today Nov 26 2008
CST.
___________________________________________


###########
# ROUNDUP #
###########

    Corrected SOCFILE from insecure
        /export/stage/minfarm/.grid/samdbs_prd
    to
        /local/globus/minfarm/.grid/samdbs_prd

cp -a AFSS/roundup.20081126 .
ln -sf     roundup.20081126 roundup # was roundup.20081118
Wed Nov 26 11:14:41 CST 2008

    And clean up,

rm roundup.20080*

        
########
# GRID #
########

   M25 overloads seem to come from gfrontent,
   Igor asks that we upgrade to current code.


########
# DATA #
########

Big backlog of restores for files in
    mcout_data/cedar_phy/far/daikon_00/L010185N/cand_data

Total data is about .5 TB, under 2k files, this should clear up by itself.

=============================================================================
2008 11 25
=============================================================================

########
# FARM #
########

   killed looper on charm, stable results,
   many duplicate and missing subruns.
   Reported to rubin.

Reenabled charm and helium mcnearcat in corral.

 
########
# DATA #
########

Date: Tue, 25 Nov 2008 15:09:18 -0600 (CST)
Subject: HelpDesk ticket 125498
___________________________________________

Ticket #: 125498
___________________________________________
Short Description: Increase space available to /minos/scratch

Problem Description: LSC/CSI :

At your next convenience, please shift 3 TBytes of capacity on minos-nas:
from
   /minos/data
to
   /minos/scratch
___________________________________________

Date: Wed, 26 Nov 2008 08:13:32 -0600 (CST)
This ticket has been reassigned to WILLIAMS, CARL of the CD-LSCS/CSI/CS/EST Group.
___________________________________________

Date: Wed, 26 Nov 2008 08:40:22 -0600 (CST)
This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/WST
___________________________________________

Date: Wed, 26 Nov 2008 09:24:01 -0600 (CST)

Solution: Quotas have been adjusted

   /minos/data   .... decreased to 24TB

   /minos/scratch  ... increased to 9TB
___________________________________________

########
# DATA #
########

Date: Tue, 25 Nov 2008 15:07:12 -0600 (CST)
Subject: HelpDesk ticket 125497
___________________________________________

Ticket #: 125497
___________________________________________
Short Description: Quota request for rahaman on BlueArc served
/minos/scratch

Problem Description: LSC/CSI :

Please increase the individual storage quota to 1000 GBytes for user
rahaman
on the BlueArc served /minos/scratch volume.

Please try to do this before the Thanksgiving weekend.
___________________________________________

Date: Tue, 25 Nov 2008 15:11:24 -0600 (CST)
This ticket has been reassigned to WILLIAMS, CARL of the CD-LSCS/CSI/CS/EST
___________________________________________

Date: Tue, 25 Nov 2008 15:21:53 -0600 (CST)
This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/WST
Group.
___________________________________________

Date: Tue, 25 Nov 2008 16:26:49 -0600 (CST)

Solution: Quota for rahaman on /minos/scratch is now 1000GB
___________________________________________


##########
# ADMIN #
##########

mgoodman,zwaska,plunk,cjames
   
   Need to roll back the collab web page:

In particular, I am looking for a version of
http://www-numi.fnal.gov/collab/index.html
from before September of this year.

    At www.archive.com, typed in the web address 'Take Me Back'
    Got pleny of them, through Aug 2007.

http://web.archive.org/web/*/http://www-numi.fnal.gov/collab/index.html
http://web.archive.org/web/20070813035359/http://www-numi.fnal.gov/collab/index.html

    Put a copy of Aug 13 link in to ~kreymer/minos/collabindex.html

    Maury cannot access these links :

The Argonne IT Administrator Review Group (IT-ARG) has chosen to block this
url based on its category affilication.

The site you requested is blocked under the following categories:
*Anonymizing Utilities*

         
Anonymizing Utilities

   RESOLVED :


To      : mgoodman@fnal.gov, 
          zwaska@fnal.gov, 
          plunk@fnal.gov, 
          cjames@fnal.gov
Cc      : 
Attchmnt: 
Subject : Re: [Fwd: Re: Fwd: Re: IB web page]
----- Message Text -----

ANL blocked Maury's access to www.archive.org.
He could not log into Fermilab to get the copy I had made.

I took the liberty of cleaning this up myself.

I found our own backup copy, html/collab/index07.html.

I renamed the various older copies,
and copied index.20070910.html to index.html,
first saving the stray copy of the ib index, as follows :

   index.20050115.html    - was index_old.html
   index.20070813.html    - from www.archive.org
   index.20070910.html    - was index07.html
   index.ib.20080918.html - problematic copy of ib/ib.index

The Collaboration page is working again.

    Enjoy !

( If this is too stale, and needs rolling forward to Sept 08,
  we could issue a helpdesk request for file restoration. )  


##########
# DCACHE #
##########

Date: Tue, 25 Nov 2008 14:40:07 -0600 (CST)
Subject: HelpDesk ticket 125495
___________________________________________
Ticket #: 125495
___________________________________________

Short Description: Level 2 metadata seems very out of date

Problem Description: Some Minos data management scripts use the PNFS Level 2
metadata
    to get an estimate of whether a file is on disk in DCache.

    For example,

( cd /pnfs/minos/neardet_data/2004-11 ; \
  cat ".(use)(2)(N00004502_0000.mdaq.root)" )
2,0,0,0.0,0.0
:l=15761813;
w-stkendca7a-1
r-stkendca4a-1

    It is understood that this information is not precise.
    But it has always been very close to reality.

    For at least the last several months, 
    the pool information seems to be very out of date.

    Recently written raw data files do no have pool information,
    even files which have been in the pools for over a month.

    The following file was written on Oct 1, and is in pool stkendca11a-3

( cd /pnfs/minos/neardet_data/2008-10 ; \
  cat ".(use)(2)(N00014898_0024.mdaq.root)" )
2,0,0,0.0,0.0
:c=1:d22ec926;h=yes;l=100979;

    Please investigate, and correct this condition.
___________________________________________

This ticket is assigned to HENDRY, JOHN of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Tue, 25 Nov 2008 15:24:00 -0600 (CST)

Note To Requester: jhendry@fnal.gov sent this Notes To Requester: 

This issue had been recorded as dcache bugzilla Bug 164.
We will update this ticket upon action from the dcache developers.
___________________________________________

Date: Tue, 25 Nov 2008 17:14:18 -0600 (CST)
Note To Requester: jhendry@fnal.gov sent this Notes To Requester: 

An update has been received from the dcache developers:

--- Comment #1 from Alex Kulyavtsev <aik@fnal.gov>  2008-11-25 16:36:13 ---
(In reply to comment #0)
 > > This is from helpdesk remedy ticket 125495 submitted by ARTHUR KREYMER.
 > >
 > > Some Minos data management scripts use the PNFS Level 2 metadata
 > >     to get an estimate of whether a file is on disk in DCache.

During last dcache upgrade on June 24 the pool code was switched from
version 2
to version 3, which is default and has been used by CMS for a while.
This version of code does not keep pool location information for the file
replica in pnfs layer(2). Instead it stores cacheinfo in so called "pnfs
companion" DB.
Files cached earlier may keep cacheinfo in layer(2).

Alex.
___________________________________________

Date: Wed, 26 Nov 2008 17:49:07 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>


Thanks, the switch to the 'companion' explains the lack of layer(2) data.

How to I obtain PNFS companion data ?

I occasionally  need a fair estimate of which pools a file is in.

The layer(2) data worked very nicely.

The daily pool directory listings are a bit too stale,
  and do not include file paths.
___________________________________________

Date: Wed, 26 Nov 2008 12:04:18 -0600
From: Timur Perelmutov <timur@fnal.gov>

I would not recommend to grant you access to the companion database, this is
critical internal dCache service. What do you use information for?
___________________________________________

Date: Wed, 26 Nov 2008 20:52:25 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

It is reasonable that users should not have direct access.

Two recent examples where I would have used the old level(2) information :

  1) A Minos user was scanning through 24,000 files in a particular family,
     generating uncontrolled large tape restore backlogs, and many tape
mounts.
     These files were on only 4 tapes.

     I have a script which pre-stages such files ( using dccp -P )
     tape by tape, and in tape order.
     The script issues each dccp then waits 5 seconds,
     to keep slightly ahead of actual file delivery from tape.
     It backs off when the Pool Request Queues page shows a backlog.

     For this to work efficiently,
     I need to avoid the dccp/delay for files which are already on disk.

  2) Last Friday afternoon, it seemed that hundreds of files were
     being restored to a single pool. I could have been mistaken about this.
     I wanted to see which pools these files were in after staging.

If we have no present means of getting this data,
we might try a readonly database replica ( perhaps using Slony-I )
with direct access for experts,
and a simple web interface for people like me.

A readonly replica might let us deploy more agressive monitoring tools.

___________________________________________

Date: Mon, 01 Dec 2008 13:57:40 -0600 (CST)

Note To Requester: jhendry@fnal.gov sent this Notes To Requester: 
The originator, made this reply outside of the helpdesk remedy system:
  ( see above )
___________________________________________

Date: Mon, 01 Dec 2008 14:04:05 -0600 (CST)

Note To Requester: jhendry@fnal.gov sent this Notes To Requester: 
Hi Art,
I have appended your comments to enstore/dcache bugzilla 164.
___________________________________________

Date: Tue, 30 Dec 2008 09:34:38 -0600 (CST)

> --- Comment #3 from Timur Perelmutov <timur@fnal.gov>  2008-12-29 13:24:26 ---
We do not have resources to implement the replication of the companion database
into a read-only database replica. We suggest that you use either dc_check or
srmls to find out if file has a copy on disk.This will allow MINOS to estimate
if the file is on disk without access to internal databases of dCache.
___________________________________________

Date: Wed, 07 Jan 2009 11:31:50 -0600
From: John Hendry <jhendry@fnal.gov>

May I please have your agreement to close this ticket?
___________________________________________

Date: Wed, 07 Jan 2009 22:02:03 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

This ticket can be closed as an operational issue,

    We should do something to improve the situation longer term.

The suggested workarounds are to use dc_check, or srmls
to check for the existence of a file on DCache disk.

    >>>> dc_check <<<<

dc_check     runs dccp -P -t -1

This runs at a rate of about 4.5 files per second.
The same test using PNFS Layer 2 runs at 23 files per second.

This slowdown is tolerable for the occasional global pre-stage.

But dc_check does not give any estimate of which pool holds the file,
something we had with the PNFS Layer 2 data.

    >>>> srmls <<<<

srmls -l   seems to give file location information via
    locality:ONLINE_AND_NEARLINE

The per-query overhead of srmls is about 8 seconds,
making it too slow for testing tens of thousands of files.

srmls -l is slower than dc_check even for large directories,
about 3 seconds per file.

But srmls has a fatal flaw.
It quietly lists only the first 999 files in a given directory.
This can be worked around using the count and offset options,
but this requires private knowledge of the directory content,
and special logic in each application for making multiple queries.
Not worth it, when the rate is so slow.
I do not trust a tool which has such a deep flaw.

   The bottom line is that we can work around the problem
   by using dccp -P -t -1  , but this is very inefficient.
   I would like to know where dccp gets its information,
   to avoid the cost of activating the dccp image.
   Formerly this information came from the PNFS Layer 2 data.
___________________________________________

#######
# AFS #
#######

loiacono has no access to $MINOS_DATA/d190
/afs/fnal.gov/files/data/minos/d190

MINOS26 > fs listacl d190
Access list for d190 is
Normal rights:
  minos rlidwka
  system:administrators rlidwka
  buckley rlidwka

pts membership minos | grep loiacono
   nuthin
pts adduser -user loiacono  -group minos
pts membership minos | grep loiacono
  loiacono

    She still has no access.

fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/d190 \
    -acl system:authuser rl

    That worked, can read and listacl.


#########
# STAGE #
#########

    The form of the http://fndca.fnal.gov:2288/queueInfo
    has changed, such that we no longer get valid queue feedback.

    The data was once in a single line, headed by Total.
    Now it is split across many lines in the HTML source.

#########
# STAGE #
#########

We need the full set of near cedar_phy_bhcurv/.bcnd_data files
2005-03 through 2008-07

   Check file counts

MINOS26 > find /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data -type f | wc -l
24182
MINOS26 > find /pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data -type f | wc -l
17599

    Check file sizes

MINOS26 > ( cd /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2007-10 ; du -sm * ; )
21      F00039719_0003.spill.bcnd.cedar_phy_bhcurv.0.root
23      F00039719_0004.spill.bcnd.cedar_phy_bhcurv.0.root
22      F00039719_0005.spill.bcnd.cedar_phy_bhcurv.0.root
22      F00039719_0006.spill.bcnd.cedar_phy_bhcurv.0.root
24      F00039719_0007.spill.bcnd.cedar_phy_bhcurv.0.root

    Net size would be somewhat over 24182*22 = 532 GBytes.

    Check file families, so we can do this by volume

 ( cd /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data ; enstore pnfs --tags ; )
.(tag)(library) = CD-LTO3
.(tag)(file_family) = reco_far_cedar_phy_bhcurv_bcnd

./volumes vols

FVOLS=`./volumes reco_far_cedar_phy_bhcurv_bcnd`

echo $FVOLS
VOC190 VOC193 VOH334 VOK485

    Test one volume

enstore info --list=VOC190

./stage -d -s cedar_phy_bhcurv/.bcnd_data VOC190
 Staging files from tape VOC190
  OK  JUST TESTING 
Staging VOC190
Version 20071012
STARTING Tue Nov 25 13:30:37 CST 2008
 Prestaging 3389 files 
 Selecting cedar_phy_bhcurv/.bcnd_data 
.NEED /pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2007-01/F00037210_0008.spill.bcnd.cedar_phy_bhcurv.0.root
dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2007-01/F00037210_0008.spill.bcnd.cedar_phy_bhcurv.0.root


    Run the full stage

{ for VOL in ${FVOLS} ; do 
    ./stage -w -s cedar_phy_bhcurv/.bcnd_data ${VOL}
done ; } > /minos/scratch/kreymer/log/stage/cpbcnd.log 2>&1 &

Staging VOC190
Version 20071012
STARTING Tue Nov 25 13:35:28 CST 2008
...

   I do not see much backlog building.
   Restarted with a 1 second pause, not the default 5
   Change stage_limit to 2000, from 200.
   These are small files.

kill %2

{ for VOL in ${FVOLS} ; do 
    ./stage -w -p 1 -s cedar_phy_bhcurv/.bcnd_data ${VOL}
done ; } >> /minos/scratch/kreymer/log/stage/cpbcnd.log 2>&1 &

Staging VOC190
Version 20071012
STARTING Tue Nov 25 14:04:56 CST 2008
 Prestaging 3389 files 
1/3389 Tue Nov 25 14:03:47 CST 2008 queue=0/2000

Staging VOC193
Version 20071012
STARTING Tue Nov 25 15:28:44 CST 2008
 Prestaging 571 files 

Staging VOH334
Version 20071012
STARTING Tue Nov 25 15:43:05 CST 2008
 Prestaging 5520 files 


    Killed at 16:44 CST, queue is up to 1974,
    and the stage script will not back off due to web page changes.

less /local/scratch26/kreymer/log/stage/VOH334.20081125.log
2511/5520 Tue Nov 25 16:43:29 CST 2008 queue=0/2000
    
    VOH334 is mounted, transferring data.
    Check out the last file from VOC193

./dccptest /reco_far/cedar_phy_bhcurv/.bcnd_data/2006-03/F00034263_0013.spill.bcnd.cedar_phy_bhcurv.0.root
Cache open succeeded in 1.09s.
14207324 bytes in 0 seconds

     Corrected stage to handle new web page format, for proper metering

{ for VOL in VOH334 VOK485 ; do 
    ./stage.20081125 -w -p 1 -s cedar_phy_bhcurv/.bcnd_data ${VOL}
done ; } >> /minos/scratch/kreymer/log/stage/cpbcnd.log 2>&1 &

FINISHED Tue Nov 25 23:00:20 CST 2008

    But only a net of about 9K files were restored.


Date: Wed, 26 Nov 2008 16:23:56 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: scavan@fnal.gov, msanchez@fnal.gov
Cc: Patricia Vahle <plvahle@wm.edu>, minos-data@fnal.gov
Subject: Re: cedar_phy_bhcurv .bntp file staging

 The prestaging of these files finished late last night.

 It should be OK to run full blast on the files under
    /pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data


##########
# PARROT #
##########

    Added GROWFS section to HOWTO.parrot, for rebuilding growfsdir

Changed archives to GROWFS subdirectory, cleaned up d120

Made new directory for d119

make_growfs: 1241589 files, 5675 links, 177026 dirs, 0 checksums computed


##########
# PARROT #
##########

    Tracking down SAM problems under parrot.
On CDF and GP farm nodes. ( fnpc338, fcdfcaf1502
With and without squid.
With or without fresh PTD cache.


export PRO=/grid/fermiapp/minos/parrot
REL=current  ;  ARC='x86_64-linux-2.6' ; DAT=''
export VER=cctools-${REL}${DAT}-${ARC}
export PARROT_DIR=${PRO}/${VER}
export PATH=${PARROT_DIR}/bin:${PATH}
PTD=/local/stage1/${LOGNAME}/parrot
rm -r ${PTD}

parrot -m ${PARROT_DIR}/mountfile.grow  -H -t ${PTD} /bin/bash
PS1='P> '
export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup
ls -d /afs/fnal.gov/files/code/e875/general/minossoft

unset SETUP_UPS SETUPS_DIR
. /afs/fnal.gov/files/code/e875/general/ups/etc/setups.sh  

setup sam
No default SAM configuration exists at this time.

    Works OK on fnpc340, which has AFS.

    Try rebuilding the products index
    I can now setup sam, but get segmentation fault.

    Still fail to be able to run sam.


/tmp/minossoft_30632.setup_script: line 5: 30939 Segmentation fault      /afs/fnal.gov/files/code/e875/general/minossoft/setup/datagram/datagram_client.py "[sh] kreymer minos_offline R1.24.2 -q GCC_3_4 # kernel 2.6.9-78.0.1.ELsmp OS 4.5 "


########
# FARM #
########

DIR=313

./stage ${MCIND}/${DIR}


http://fndca3a.fnal.gov:2288/poolInfo/restoreHandler/lazy
to 
                                sorted
r-stkendca14a-5              r-stkendca14a-5 +
r-stkendca14a-5              r-stkendca14a-5
r-stkendca14a-5              r-stkendca14a-5
r-stkendca14a-6              r-stkendca14a-5
r-stkendca16a-6              r-stkendca14a-5
r-stkendca14a-5              r-stkendca14a-5
r-stkendca14a-6              r-stkendca14a-5
r-stkendca14a-6              r-stkendca14a-6 +
r-stkendca16a-6              r-stkendca14a-6
r-stkendca14a-5              r-stkendca14a-6
r-stkendca14a-6              r-stkendca14a-6
r-stkendca14a-5              r-stkendca14a-6
r-stkendca14a-5              r-stkendca16a-6 +
r-stkendca14a-6              r-stkendca16a-6

This is a reasonable balance I suppose, 3 pools involved, 2 hosts
Will continue with rest of the stages :

MCIND=mcin_data/near/daikon_00/L010185N

DIRS=`cat /minos/data/minfarm/lists/mclist.cedar_phy_linfix | cut -c 6-8 | sort -u `

echo $DIRS
304 305 306 307 310 311 312 313

    Removed those already done :
DIRS='304 305 306 307 310 311'

FINISHED Tue Nov 25 10:16:40 CST 2008


=============================================================================
2008 11 24
=============================================================================

########
# FARM #
########

Draft to rubin :


On Fri, 21 Nov 2008, Howard Rubin wrote:

> You'll find the list of still-to-be-run jobs in
> /minos/data/minfarm/lists/mclist.cedar_phy_bhcurv.

    This file is empty.

    I think you meant

 /minos/data/minfarm/lists/mclist.cedar_phy_linfix

MINOS27 > wc -l /minos/data/minfarm/lists/mclist.cedar_phy_linfix
980 /minos/data/minfarm/lists/mclist.cedar_phy_linfix

I've gotten a list of directories based on this list :

DIRS=`cat /minos/data/minfarm/lists/mclist.cedar_phy_linfix | cut -c 6-8 | sort -u `

echo $DIRS
223 224 225 303 304 305 306 307 310 311 312 313

   I've checked that the file counts are about right

MCIND=mcin_data/near/daikon_00/L010185N

for DIR in ${DIRS}; do ls /pnfs/minos/${MCIND}/${DIR} | wc -l ; done
109
86
10
99
109
109
107
64
96
109
108
11

   I've not started the file restores :

08:54

cd ~kreymer/minos/scripts
{ for DIR in ${DIRS}; do
./stage -w ${MCIND}/${DIR}
done ; } > /minos/scratch/kreymer/log/stage/linfix.log &

This could take a while
Of the 13 LTO_3 drives
    1 writing mcin nccohbkg
    6 in mount or dismount wait
    6 writing exp-db, 
       3 dismount/mount
       1 seek
       2 active


##########
# DCACHE #
##########

   Tests that all raw data is on tape per below

for FILE in  F081119_000006.mdcs.root B081120_000001.mbeam.root ; do ./dc_stat ${FILE} ; done
for FILE in  ${NFILES} ; do ./dc_stat ${FILE} ; done
for FILE in  ${FFILES} ; do ./dc_stat ${FILE} ; done

###############
# MINOS-SAM04 #
###############

    Requested sam and samread accounts, copy of .k5login from minos-sam01.

Date: Mon, 24 Nov 2008 17:57:52 -0600 (CST)
Subject: HelpDesk ticket 125435

<-- # @@@  Enter Update below this line. @@@ # -->
<-- # @@@  Enter Update above this line. @@@ # -->
 
___________________________________________
Ticket #: 125435
___________________________________________
Short Description: minos-sam04 needs /home/sam and samread

Problem Description: runs-sys :

Please create /home/sam and /home/samread login areas on minos-sam04,
and copy the .k5login files from minos-sam01.
___________________________________________

Date: Tue, 25 Nov 2008 08:19:59 -0600 (CST)

This ticket has been reassigned to COOPER, GLENN of the CD-SF/FEF Group.
___________________________________________

Date: Tue, 25 Nov 2008 10:51:42 -0600 (CST)

Solution: gcooper@fnal.gov sent this solution: 
Added home areas and .k5login files.
___________________________________________

___________________________________________


###########
# ENSTORE #
###########

library is CD-LTO4F1, per
( cd /pnfs/minos/mcout_data/cedar_phy_linfix/near/daikon_00; enstore pnfs --tags )
  
I do not recall setting this up.


##########
# DCACHE #
##########

Date: Mon, 24 Nov 2008 23:23:23 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

To: Gene Oleynik <oleynik@fnal.gov>
Cc: minos-data@fnal.gov
Subject: Re: dCache upgrade/expansion

On Fri, 21 Nov 2008, Gene Oleynik wrote:

> The new hardware is in place. We still have to install OS etc, and plan
> the migration from new hardware. Seems to me you will get most benefit
> from bringing up the minos expansion first.
> 
> How do you want these new pools configured? What file families,
> read/write, etc.

    We would like to deploy the additional 12 TB of disk as follows :

Expand RawDataWritePools from  6 TB to  8 TB
Selection unchanged.

Expand MinosPrdReadPools from 13 TB to 23 TB
Selection has been a long list of file families like
    minos.mcout_cedar_phy_bhcurv_far_daikon_04_sntp
If wild cards worked the way we might wish, this would be
    minos.*sntp
    minos.*mrnt

We could discuss shifting the explicit selection to the general readPools,
with a somewhat shorter list like
    minos.*bcnd
    minos.*cand
    minos.mcin*

The family selections can be updated after deployment,
if this is desired.

##########
# CONDOR #
##########

    Removed 'fnpc374.fnal.gov' from entry_gpgeneral/nodes.blacklist 
    as the file system mounts have been restored 

#############
# MDSUM_LOG #
#############

   Corrected to use fine for a subdirectory list,
   due to files at top level of minfarm.

MIN > ln -sf mdsum_log.20081124 mdsum_log # was mdsum_log.20081118

#########
# ADMIN #
#########

Date: Mon, 24 Nov 2008 11:55:34 -0600 (CST)
Subject: HelpDesk ticket 125388
___________________________________________

Short Description:     Minos Cluster has stale NFS mounts of /grid/data

Problem Description: run2-sys :

During the maintenance period last Thursday 20 Nov,
the /grid/data files were moved to a new server

There are stale NFS mounts of /grid/data on most Minos Cluster systems, and

minos-sam01
minos-sam02
minos-sam03
minos-mysql2
minos-mysql3

We do not use /grid/data heavily on the Cluster or servers,
but it would be nice to have the mounts cleaned up at your next
convenience.
___________________________________________

Date: Mon, 24 Nov 2008 12:37:01 -0600 (CST)
This ticket has been reassigned to COOPER, GLENN of the CD-SF/FEF Group.
___________________________________________

Date: Mon, 24 Nov 2008 14:43:21 -0600 (CST)

Solution: gcooper@fnal.gov sent this solution: 
/grid/data remounted on minos[01-24], minos-sam[01-03], minos-mysql[2-3].

This ticket was resolved by COOPER, GLENN of the CD-SF/FEF group.
___________________________________________
___________________________________________

=============================================================================
2008 11 22  Sat
=============================================================================

##########
# DCACHE #
##########

Looking at precious space in RawDataWritePools

    Pool          MB precious 
w-stkendca10a-3 2494
w-stkendca11a-3  165
w-stkendca12a-3    0
w-stkendca8a-1     0
w-stkendca8a-2     0
w-stkendca9a-3   468


############
# POOLSTAT #
############

   hacked poolstat.20081124 to give full pool listing :
   
   ./poolstat full

ln -sf poolstat.20081124 poolstat # was poolstat.20080707
   

##########
# DCACHE #
##########

Date: Sat, 22 Nov 2008 14:04:37 -0600 (CST)
Subject: HelpDesk ticket 125350

___________________________________________

Short Description: FNDCA has not written tape since Thursday maintenance ?

Problem Description: dcache-admin -

It appears that no Minos raw data files have been written to tape
from the RawDataWritePools group since Thursday 19 November.

Looking at http://fndca3a.fnal.gov:2288/usageInfo,
I see many precious files in write pools across the whole DCache system, 
over half of the capacity of some pools.

Most of the writes I see active in Enstore are for one file family.
    minos.mcout_cedar_phy_linfix_near_daikon_00_cand.cpio_odc


There are also problems with file restores from tape.

There are hundreds of file restores pending for the readPools group,
but all are directed to one pool,   r-stkendca15a


I will ask the Minos team to shut down production processing,
to help take some of the load off the system
until these problems are resolved.
___________________________________________

Date: Sat, 22 Nov 2008 15:17:33 -0600
From: Howard Rubin <rubin@iit.edu>

This is almost certainly related to the read problems I've been having,
which I first reported several weeks ago, and which have, in general, been
either ignored or 'turned over to the developers.'

At the present time there are 542 linfix jobs in the system, either running
or idle, so when they finish the system will be empty except for (data)
keep-up.

So far today (2008-11-22) 133 jobs have failed on srm input while 1618 jobs
have finished successfully.  This is roughly consistent with the error rate
I've quoted in recent tickets.  There have been no output failures due to
the lost mounts.
___________________________________________

Date: Mon, 24 Nov 2008 12:12:08 -0600
From: Timur Perelmutov <timur@fnal.gov>

The problems were due to the fact that pnfs was not mounted on pools after
restart. This was fixed by Vladimir over the weekend.
___________________________________________

Date: Mon, 24 Nov 2008 19:01:07 +0000 (GMT)


We still have raw data files not archived to tape
since Wednesday 19 Nov.

For example,
    /pnfs/minos/fardet_data/2008-11/F00042222_0005.mdaq.root

The nightly pool listing shows this file in w-stkendca12a-3.

http://fndca3a.fnal.gov:2288/usageInfo shows 0 MB of precious files
in this pool w-stkendca12a-3.

Why is this file not being written ?

Here is a summary of precious space presently reported at usageInfo .
As noted above, these space reports are not consistent with
the list of files not on tape.

    Pool          MB precious
w-stkendca10a-3 2494
w-stkendca11a-3  165
w-stkendca12a-3    0
w-stkendca8a-1     0
w-stkendca8a-2     0
w-stkendca9a-3   468
___________________________________________

Date: Mon, 24 Nov 2008 13:14:43 -0600
From: George Szmuksta <georges@fnal.gov>

   From your perspective are the dcache problems fixed?

___________________________________________

Date: Mon, 24 Nov 2008 13:21:39 -0600
From: Margaret Votava <votava@fnal.gov>

i think minos is still having a problem? Is it fixed from their
perspective?
___________________________________________

Date: Mon, 24 Nov 2008 13:22:09 -0600
From: Margaret Votava <votava@fnal.gov>
I don't think so. Art?

Gene Oleynik wrote:
      Minos has production back up now, correct?
___________________________________________

Date: Mon, 24 Nov 2008 13:24:35 -0600
From: Gene Oleynik <oleynik@fnal.gov>
To: Margaret Votava <votava@fnal.gov>

I asked George Szmuksta to follow up on the ticket.
Art, if there are still issues let me know asap.
___________________________________________

Date: Mon, 24 Nov 2008 13:37:50 -0600
From: Timur Perelmutov <timur@fnal.gov>

Many of the precious files remain on disk, because they are deleted from
pnfs. In order to prevent potential data loss, the files in these cases are
not deleted automatically. They can not be written to tape either, as
enstore will detect that they are deleted from pnfs. We perform a periodic
manual clean up of these files. The behavior will be different in dCache
1.9. So certain accumulation of the precious space on the write pools is
normal and should not be considered a system malfunction.

___________________________________________

georges,minos_batch,dcache-admin,minos-data,timur,votava,oleynik

The following files, dating from 19 through 21 November are not on tape . 

I have listed the pools that they seem to be in, per today's pool listings.

Other more recent files are on tape.

/pnfs/minos/fardet_data/2008-11 pools
F00042222_0002.mdaq.root        11a-3            
F00042222_0003.mdaq.root        10a-3  12a-3     
F00042222_0004.mdaq.root        10a-3            
F00042222_0005.mdaq.root        10a-3  12a-3     
F00042222_0006.mdaq.root        10a-3            
F00042222_0007.mdaq.root        10a-3            
F00042222_0008.mdaq.root        10a-3  11a-3     
F00042222_0009.mdaq.root        10a-3            
F00042222_0010.mdaq.root        10a-3            
F00042222_0011.mdaq.root        10a-3            
F00042222_0012.mdaq.root        10a-3  12a-3     
F00042222_0013.mdaq.root        10a-3            
F00042222_0014.mdaq.root        10a-3            
F00042222_0015.mdaq.root        10a-3            
F00042222_0016.mdaq.root        10a-3            
F00042222_0017.mdaq.root        10a-3            
F00042222_0018.mdaq.root        10a-3            
F00042222_0019.mdaq.root        10a-3            
F00042222_0020.mdaq.root        10a-3            
F00042222_0021.mdaq.root        10a-3  11a-3     
                                     
/pnfs/minos/fardcs_data/2008-11      
F081119_000006.mdcs.root        10a-3     
                                     
/pnfs/minos/neardet_data/2008-11     
N00015199_0014.mdaq.root        11a-3        
N00015199_0015.mdaq.root        10a-3        
N00015199_0016.mdaq.root        10a-3 11a-3  
N00015199_0017.mdaq.root        10a-3        
N00015199_0018.mdaq.root        10a-3        
N00015199_0019.mdaq.root        10a-3        
N00015199_0020.mdaq.root        10a-3 11a-3  
N00015199_0021.mdaq.root        10a-3        
N00015199_0022.mdaq.root        10a-3        
N00015199_0023.mdaq.root        10a-3        
N00015199_0024.mdaq.root        10a-3        
N00015200_0000.mdaq.root        10a-3        
N00015201_0000.mdaq.root        10a-3        
N00015202_0000.mdaq.root        10a-3        
N00015202_0001.mdaq.root        10a-3        
N00015202_0002.mdaq.root        10a-3        
N00015202_0003.mdaq.root        10a-3        
N00015202_0004.mdaq.root        10a-3        
N00015202_0005.mdaq.root        10a-3        
N00015202_0006.mdaq.root        10a-3        
N00015202_0007.mdaq.root        10a-3        
N00015202_0008.mdaq.root        10a-3        

/pnfs/minos/beam_data/2008-11
B081120_000001.mbeam.root       10a-3
___________________________________________

Date: Mon, 24 Nov 2008 15:03:45 -0600
From: Alex Kulyavtsev <aik@fnal.gov>

I was looking on one file you referred before and I was going to ask do you
know other files like that. You do.
Thanks for the info - I'll let you know as we learn more.
___________________________________________

Date: Mon, 24 Nov 2008 17:31:54 -0600
From: Alex Kulyavtsev <aik@fnal.gov>

the issue was due to pnfs not mounted during pool restart. dcache did not
find file in pnfs and decided to deactivate requests.

I restarted pools 10-3, 11-3 and 12-3 and requests were flushed to tape.
Could you please confirm files were written to tape ?
___________________________________________

Date: Tue, 25 Nov 2008 00:10:38 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    I have re-scanned the full file list.
    They all appear to be on tape.

        Thanks !
___________________________________________

Date: Mon, 01 Dec 2008 19:39:13 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   RawDataWritePools files dated before Nov 30, but not yet on tape,

   Pools are determined from the nightly Pool Directory Listings at
      http://fndca3a.fnal.gov/dcache/files/

F00042247_0005.mdaq.root w-stkendca10a-3
F00042247_0006.mdaq.root w-stkendca10a-3
F00042247_0007.mdaq.root w-stkendca10a-3
F00042247_0008.mdaq.root w-stkendca10a-3
F00042247_0009.mdaq.root w-stkendca10a-3
F00042247_0010.mdaq.root w-stkendca10a-3
F00042247_0011.mdaq.root w-stkendca10a-3
F00042247_0012.mdaq.root w-stkendca10a-3
F00042247_0013.mdaq.root w-stkendca10a-3
F00042247_0014.mdaq.root w-stkendca10a-3
F00042247_0015.mdaq.root w-stkendca10a-3
F00042247_0016.mdaq.root w-stkendca10a-3
F00042247_0017.mdaq.root w-stkendca10a-3
F00042247_0018.mdaq.root w-stkendca10a-3
F00042247_0019.mdaq.root w-stkendca10a-3
F00042247_0020.mdaq.root w-stkendca10a-3
F00042247_0021.mdaq.root w-stkendca10a-3
F00042247_0022.mdaq.root w-stkendca10a-3
F00042248_0000.mdaq.root w-stkendca10a-3
F00042250_0000.mdaq.root w-stkendca10a-3
F00042250_0001.mdaq.root w-stkendca10a-3
F00042250_0002.mdaq.root w-stkendca10a-3
F00042250_0003.mdaq.root w-stkendca10a-3
F00042250_0004.mdaq.root w-stkendca10a-3
F00042250_0005.mdaq.root w-stkendca10a-3
F00042250_0006.mdaq.root w-stkendca10a-3
F00042250_0007.mdaq.root w-stkendca10a-3
F00042250_0008.mdaq.root w-stkendca10a-3
F00042250_0009.mdaq.root w-stkendca10a-3
F00042250_0010.mdaq.root w-stkendca10a-3
F00042250_0011.mdaq.root w-stkendca10a-3
F00042250_0015.mdaq.root w-stkendca10a-3
F00042250_0016.mdaq.root w-stkendca10a-3
F00042250_0017.mdaq.root w-stkendca10a-3
F00042250_0018.mdaq.root w-stkendca10a-3
F00042250_0019.mdaq.root w-stkendca10a-3
F00042250_0020.mdaq.root w-stkendca10a-3
F00042252_0000.mdaq.root w-stkendca10a-3
F00042253_0000.mdaq.root w-stkendca10a-3
F00042253_0001.mdaq.root w-stkendca10a-3
F00042253_0002.mdaq.root w-stkendca10a-3
F00042253_0003.mdaq.root w-stkendca10a-3
F00042253_0004.mdaq.root w-stkendca10a-3
F00042253_0005.mdaq.root w-stkendca10a-3
F00042253_0006.mdaq.root w-stkendca10a-3
F00042253_0007.mdaq.root w-stkendca10a-3
F00042253_0008.mdaq.root w-stkendca10a-3
F00042253_0009.mdaq.root w-stkendca10a-3
F00042253_0012.mdaq.root w-stkendca10a-3
F00042253_0013.mdaq.root w-stkendca10a-3
F00042253_0014.mdaq.root w-stkendca10a-3
F00042253_0015.mdaq.root w-stkendca10a-3
F00042253_0016.mdaq.root w-stkendca10a-3
F00042253_0017.mdaq.root w-stkendca10a-3
F00042253_0018.mdaq.root w-stkendca10a-3
F00042253_0019.mdaq.root w-stkendca10a-3
F00042253_0020.mdaq.root w-stkendca10a-3
F00042253_0021.mdaq.root w-stkendca10a-3
F00042253_0022.mdaq.root w-stkendca10a-3
F00042253_0023.mdaq.root w-stkendca10a-3
F00042255_0000.mdaq.root w-stkendca10a-3
F00042256_0000.mdaq.root w-stkendca10a-3
F00042256_0001.mdaq.root w-stkendca10a-3
F00042256_0002.mdaq.root w-stkendca10a-3
F00042256_0003.mdaq.root w-stkendca10a-3
F00042256_0006.mdaq.root w-stkendca10a-3
F00042256_0007.mdaq.root w-stkendca10a-3
F00042256_0008.mdaq.root w-stkendca10a-3
F00042256_0009.mdaq.root w-stkendca10a-3
F00042256_0010.mdaq.root w-stkendca10a-3
F00042259_0011.mdaq.root w-stkendca10a-3
F081126_000010.mdcs.root w-stkendca10a-3
F081127_000010.mdcs.root w-stkendca10a-3
F081128_000002.mdcs.root w-stkendca10a-3
N081126_000002.mdcs.root w-stkendca10a-3
N081127_000002.mdcs.root w-stkendca10a-3
N081128_000003.mdcs.root w-stkendca10a-3
F00042247_0023.mdaq.root w-stkendca11a-3
F00042250_0012.mdaq.root w-stkendca11a-3
F00042250_0014.mdaq.root w-stkendca11a-3
F00042253_0009.mdaq.root w-stkendca11a-3
N00015235_0008.mdaq.root w-stkendca11a-3
N00015237_0000.mdaq.root w-stkendca11a-3
N00015238_0003.mdaq.root w-stkendca11a-3
F00042249_0000.mdaq.root w-stkendca12a-3
F00042250_0013.mdaq.root w-stkendca12a-3
F00042251_0000.mdaq.root w-stkendca12a-3
F00042253_0010.mdaq.root w-stkendca12a-3
F00042253_0011.mdaq.root w-stkendca12a-3
F00042253_0015.mdaq.root w-stkendca12a-3
F00042254_0000.mdaq.root w-stkendca12a-3
F00042256_0004.mdaq.root w-stkendca12a-3
F00042256_0005.mdaq.root w-stkendca12a-3
N00015231_0000.mdaq.root w-stkendca12a-3
N00015234_0000.mdaq.root w-stkendca12a-3
N00015235_0009.mdaq.root w-stkendca12a-3
N00015235_0024.mdaq.root w-stkendca12a-3
N00015238_0002.mdaq.root w-stkendca12a-3
F00042245_0000.mdaq.root w-stkendca8a-1
F081125_000010.mdcs.root w-stkendca8a-1
N00015225_0000.mdaq.root w-stkendca8a-1
N00015228_0000.mdaq.root w-stkendca8a-1
F00042247_0003.mdaq.root w-stkendca8a-2
B081124_160001.mbeam.root w-stkendca9a-3
B081125_000001.mbeam.root w-stkendca9a-3
B081125_080001.mbeam.root w-stkendca9a-3
F081124_000012.mdcs.root w-stkendca9a-3
N081124_000002.mdcs.root w-stkendca9a-3
N081125_000002.mdcs.root w-stkendca9a-3

___________________________________________

Date: Tue, 02 Dec 2008 22:37:43 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: dcache-admin@fnal.gov
Cc: minos-data@fnal.gov
Subject: Re: HelpDesk ticket 125350 (fwd)

  Is there any progress on this ?

  There are precious files in RawDataWritePools as old as 24 November

  For example,

Nov 24 18:00 /pnfs/minos/beam_data/2008-11/B081124_160001.mbeam.root

_____________________________________________

Date: Wed, 03 Dec 2008 17:25:28 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: helpdesk-forwarder@fnal.gov, helpdesk@fnal.gov, dcache-admin@fnal.gov
Cc: minos-data@fnal.gov, timur@fnal.gov, oleynik@fnal.gov, georges@fnal.gov
Subject: Re: HelpDesk ticket 125350 (fwd)

<-- # @@@  Enter Update below this line. @@@ # -->

I have had no feedback on this ticket since November 24.


Some of our raw data files have been pending for over a week now !

There are precious files in RawDataWritePools as old as 24 November

For example,
Nov 24 18:00 /pnfs/minos/beam_data/2008-11/B081124_160001.mbeam.root

I have updated the ticket,
sent mail to dcache-admin,
and raised the issue in CD ops and Grid Ops meetings.

Still no response at all.

   Is anyone there ?

   Hello ?

<-- # @@@  Enter Update above this line. @@@ # -->
___________________________________________

Date: Wed, 03 Dec 2008 11:59:44 -0600
From: Vladimir Podstavkov <podstvkv@fnal.gov>

Yes, we are working on this. If you noticed some of these files have
been written to the tape yesterday evening. We are investigating what
caused this problem and don't want just blindly flush everything.
Sorry, we have had to let you know that we are looking into it.
_________________________________________

Date: Wed, 03 Dec 2008 18:07:07 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Thanks for the feedback.

   Because the Minos beam is off for another week,
   you can take whatever time is needed to investigate this.

_________________________________________

Date: Wed, 03 Dec 2008 14:14:14 -0600
From: Vladimir Podstavkov <podstvkv@fnal.gov>

Finally I have found the cause of the problem. It turned out that the
timeout for Minos pools has been set to 10 days instead of one by
mistake. I have changed the setup files and will change the actual
values on all pools, so all files will be flushed within a day.

Sorry for inconvenience and thank you for your patience!
_________________________________________


=============================================================================
2008 11 21
=============================================================================

#########
# MYSQL #
#########

   Testing minos-mysql2
   
gcooper created /home/minsoft and /data/database directories,
and copied .k5login

Next, requested minsoft be in group mysql 9531 per ups tailor mysql

Iterated account/files, initially local group file had mysql/27 group .

    The base server is running !
    
Date: Tue, 25 Nov 2008 12:25:20 -0600 (CST)
From: Glenn Cooper <gcooper@fnal.gov>

I finally got back to this.  Removed the mysql entry from local /etc/group
file, and also changed nsswitch.conf to use the NIS map as well as the local
file.

##########
# CONDOR #
##########

   The load average took off starting at 12:00,
   up over 55 at 14:00
   
MINOS25 > touch /minos/scratch/kreymer/test1121
MINOS25 > rm /minos/scratch/kreymer/test1121

MINOS25 > touch /minos/data/users/kreymer/test1121
MINOS25 > rm    /minos/data/users/kreymer/test1121

   condor response time is fine.
   
MINOS25 > condor_q | tail -1
3016 jobs; 1651 idle, 1365 running, 0 held

   I see nothing recent in /var/log/messages
   I see no change in the condor queues around 12:00

Last output file in logs/glide/probe*out is 
Nov 20 10:50 logs/glide/probe.229053.0.log

${HOME}/minos/scripts/condorglide
 no output, nothing in queue
 
condor_history kreymer
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD            


MINOS25 > condor_submit glide.run
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 229927.

      I see that over 200 pilots started up on CDF pool 3,
      from 11:22 through 12:18.

      These show up in    

MINOS25 > condor_q gfactory | tail -1
685 jobs; 33 idle, 652 running, 0 held
   
    But no user jobs seem to be making use of them yet, as of 14:40.
    
MINOS25 > condor_q -run gfactory | grep fnpcfg1 | wc -l
400

MINOS25 > condor_q -run gfactory | grep fermigridosg1 | wc -l
247


########
# GRID #
########

for HOST in `cat /tmp/cdfhosts`; do    
printf "${HOST} " ; ssh -ax ${HOST} 'ls -ld /fnal/ups/etc' ; done

fcdfcaf1528 ls: /fnal/ups/etc: No such file or directory
fcdfcaf1539 ls: /fnal/ups/etc: No such file or directory
fcdfcaf1559 ls: /fnal/ups/etc: No such file or directory
fcdfcaf1566 ls: /fnal/ups/etc: No such file or directory

for HOST in `cat /tmp/cdfhosts`; do    
printf "${HOST} "
ssh -ax ${HOST} 'ls -ld /fnal/ups /usr/local/etc/setups.*  > /dev/null && echo'; done  ; date

fcdfcaf1528 ls: /fnal/ups: No such file or directory
/usr/local/etc/setups.*: No such file or directory
fcdfcaf1539 ls: /fnal/ups: No such file or directory
/usr/local/etc/setups.*: No such file or directory
fcdfcaf1559 ls: /fnal/ups: No such file or directory
/usr/local/etc/setups.*: No such file or directory
fcdfcaf1566 ls: /fnal/ups: No such file or directory
/usr/local/etc/setups.*: No such file or directory

   Ran again, reversed file order,
fcdfcaf1528 ls: /usr/local/etc/setups.*: No such file or directory
ls: /fnal/ups: No such file or directory
fcdfcaf1539 ls: /usr/local/etc/setups.*: No such file or directory
ls: /fnal/ups: No such file or directory
fcdfcaf1559 ls: /usr/local/etc/setups.*: No such file or directory
ls: /fnal/ups: No such file or directory
fcdfcaf1566 ls: /usr/local/etc/setups.*: No such file or directory
ls: /fnal/ups: No such file or directory


Date: Fri, 21 Nov 2008 12:52:33 -0600 (CST)
Subject: HelpDesk ticket 125314

___________________________________________

Short Description: Four fcdfcaf nodes lack /fnal/ups and
/usr/local/etc/setups.* scripts

Problem Description: Four of the fcdfcaf nodes lack a local UPS installation
in /fnal/ups,
   and the associated /usr/local/etc/setups.[c]sh scripts :
 
fcdfcaf1528
fcdfcaf1539
fcdfcaf1559
fcdfcaf1566
___________________________________________

This ticket is assigned to TIMM, STEVE of the CD-SF/GF/FGS.
___________________________________________

Date: Fri, 21 Nov 2008 13:02:56 -0600 (CST)

Note To Requester: timm@fnal.gov sent this Notes To Requester: 
This is the first I knew that there is a /fnal/ups installation
on any fcdfcaf nodes.  There did not use to be any /fnal/ups installation
on the cdf nodes at all and FermiGrid never gave Minos any representations
that it would work or that you could count on it.  Your scripts should
not be dependent on /fnal/ups anywhere outside the GP Grid cluster, and
it would be good to get rid of the dependency there too.

That said, it appears that in the latest round of CDF node reinstalls
that is ongoing, there is now a upsupdbootstrap rpm on most of the new 
ones. The four nodes that you mention in the ticket are currently
being drained to be reinstalled again for other reasons and this
problem will be resolved by FEF at that time.

Steve Timm
___________________________________________

Date: Mon, 24 Nov 2008 10:50:24 -0600 (CST)

Solution: 
These 4 systems have been removed from the batch
system for a software reinstall that was scheduled to take place even before
this ticket was filed.

By the way--the FEF person who does these reinstalls is
aware of what went wrong and it should be automatically  fixed next time.
Nevertheless FermiGrid makes no warranty of presence or usability of 
/fnal/ups outside the GP Grid cluster.


Steve Timm
This ticket was resolved by TIMM, STEVE of the CD-SF/GF/FGS group.

___________________________________________


##########
# CONDOR #
##########

Date: Fri, 21 Nov 2008 10:37:21 -0600 (CST)
Subject: HelpDesk ticket 125300

<-- # @@@  Enter Update below this line. @@@ # -->


<-- # @@@  Enter Update above this line. @@@ # -->
 ___________________________________________
Short Description: Minos glideins to CDF  nodes not  starting

Problem Description: This morning at around 08:00, according to CondorView,
the minosgli jobs disappeared sharply from the fcdfosg3 served nodes,

No new Minos glideins to the CDF nodes have started since then.

There are plenty of jobs still running on the fcdfosg3 pool,
and there seems to be plenty of unused capacity there.

Glideins continue to start normally on the gpfarm pool.

fnpc4x1 looks OK, only one globus-job-manager running for minosgli.

It seems that rubin's minospro jobs are also not running, with about 49
idle.
___________________________________________

Date: Fri, 21 Nov 2008 10:21:30 -0600
From: Marian Zvada <zvada@fnal.gov>
To: cdf_caf_user@fnal.gov, cdf_jointphysics@fnal.gov
Cc: cdfoom@fnal.gov
Subject: cdfgrid fcdfhead10 down

Dear Users,

during the night we've experienced trouble with cdfgrid cluster. Now it's
under recovery and not available for the users.

We will announce when the system is back in normal.

Sorry for any inconvenience,
Marian (for the CAF team)
___________________________________________
Date: Fri, 21 Nov 2008 10:45:41 -0600 (CST)

Note To Requester: timm@fnal.gov sent this Notes To Requester: 
Art, how long had the glideins been running at the time?
Also, were they going directly to fcdfosg3 or coming through fg1x1?

There was a disruption at 07:52 on the CDF side of things from
their submitter that submits to that grid.  Maybe something
was connected to that. I'll have a look.

Steve Timm
___________________________________________

Date: Fri, 21 Nov 2008 11:19:16 -0600 (CST)

Note To Requester: timm@fnal.gov sent this Notes To Requester: 
It appears that most of the glideins  which exited from the fcdfosg3/4
cluster around 8:00 exited of their own accord, with status zero.
This was not at all due to any problems in the "cdfgrid" which
is the CDF submission machine that would normally feed this cluster.
There were a few, 10-20 glideins in the last number of days
that were removed from the CDF cluster because they were above
the 2GB/process memory limit on this cluster and got killed.

Steve Timm

___________________________________________

Date: Mon, 24 Nov 2008 10:52:55 -0600 (CST)

Solution: Information was given on the status of the recent minosgli glideins.
This ticket was resolved by TIMM, STEVE of the CD-SF/GF/FGS group.

##########
# CONDOR #
##########

    CDFCAF glideins seem to have stopped,
    probably around 08:00 ( sharp drop in client jobs, to near 0 )


Fri Nov 21 10:08:29 CST 2008

MINOS25 > condor_q -run | grep -v gfactory | grep fcdfcaf | wc -l
0
MINOS25 > condor_q -run | grep -v gfactory | grep fnpc | wc -l
387
MINOS25 > condor_q gfactory | tail -1
635 jobs; 83 idle, 552 running, 0 held

    No new glideins are getting started on cdf nodes, check the grid gateway

ssh -ax fnpc4x1 'ps axfu | grep globus-job-manager | grep minos | grep -v grep'
minosgli  1044  0.0  0.0 111932  5000 ?        S    10:08   0:00 globus-job-manager -conf /usr/local/vdt-1.10.1/globus/etc/globus-job-manager.conf -type managedfork -rdn jobmanager-managedfork -machine-type unknown -publish-jobs

    Check that factory and frontends are running, looks OK

ps -flu gfactory
ps -flu gfrontend

condor_q -run  | grep gfactory
...
229132.0   gfactory       11/20 13:22   0+20:48:26 gt2 fermigridosg1.fnal.gov:2119/jobmanager-condor
229132.1   gfactory       11/20 13:22   0+20:48:26 gt2 fermigridosg1.fnal.gov:2119/jobmanager-condor
229594.0   gfactory       11/21 06:31   0+02:42:39 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
229615.0   gfactory       11/21 06:59   0+02:14:08 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
...

##########
# DCACHE #
##########

PO 582564 F1F-141000HDRG SATABOY storage device configured with (14) 1TB disks
I presume 12 TB net capacity,due to raid 5.

    We would like to deploy the additional 12 TB of disk as follows :

Expand RawDataWritePools from  6 TB to  8 TB

Expand MinosPrdReadPools from 13 TB to 23 TB


##########
# DCACHE #
##########

Date: Fri, 21 Nov 2008 09:43:34 -0600
From: Gene Oleynik <oleynik@fnal.gov>

The new hardware is in place. We still have to install OS etc, and plan the
migration from new hardware. Seems to me you will get most benefit from
bringing up the minos expansion first.

How do you want these new pools configured? What file families, read/write,
etc.

-----------------------------------------------------------------------


########
# FARM #
########

    Start clearing out the backlog

------------------------------------------------------------------------
    Summarizing /minos/data/minfarm/*cat Fri Nov 21 09:03:17 CST 2008  

   1719  232668 nearcat
   5031   54693 farcat
  30240 1287973 mcnearcat
     26    1218 mcfarcat
      0       1 mcfmockcat
      7       1 WRITE
  37023 1576554 TOTAL files, GBytes

nearcat
     66    1896 cosmic.sntp.cedar.0.root
    412  164242 spill.cand.cedar_phy_bhcurv.0.root
    587   25393 spill.mrnt.cedar_phy_bhcurv.0.root
      1      47 spill.mrnt.cedar_phy_bhcurv.1.root
     65    5649 spill.sntp.cedar.0.root
    587   46647 spill.sntp.cedar_phy_bhcurv.0.root
      1      87 spill.sntp.cedar_phy_bhcurv.1.root

farcat
     23     544 all.sntp.cedar.0.root
   1359   33137 all.sntp.cedar_phy_bhcurv.0.root
     23     170 spill.bntp.cedar.0.root
   1201    8951 spill.bntp.cedar_phy_bhcurv.0.root
   1201    8551 spill.mrnt.cedar_phy_bhcurv.0.root
     23     112 spill.sntp.cedar.0.root
   1201    5870 spill.sntp.cedar_phy_bhcurv.0.root

mcnearcat
      2    1151 cand.cedar_phy_bhcurv.1.root
   2336   68216 mrnt.cedar_phy_bhcurv.0.root
   2881   54129 mrnt.cedar_phy_bhcurv.1.root
     54    1737 mrnt.cedar_phy_bhcurv.root
   9813  190711 mrnt.cedar_phy_linfix.0.root
     65    3114 mrnt.cedar_phy.root
   2336  198362 sntp.cedar_phy_bhcurv.0.root
   2881  185731 sntp.cedar_phy_bhcurv.1.root
     54    5188 sntp.cedar_phy_bhcurv.root
   9813  641990 sntp.cedar_phy_linfix.0.root
      2     135 sntp.cedar_phy.root

mcfarcat
      3     120 mrnt.cedar_phy_linfix.0.root
      4     209 sntp.cedar_phy_bhcurv.0.root
     16     813 sntp.cedar_phy_bhcurv.root
      3     132 sntp.cedar_phy_linfix.0.root
------------------------------------------------------------------------

    mcfar linfix would write cleanly, waiting for permission.

    mcfar bhcurv has many several DUPs, would write nothing,

ran this to produce a log for reference,
./roundup  -r cedar_phy_bhcurv mcfar
~minfarm/ROUNTMP/LOG/2008-11/cedar_phy_bhcurvmcfar.log

    mcnear CBP charm files can be written,

    First a small test probe
    
./roundup  -s charm  -b 10 -r cedar_phy_bhcurv mcnear
~minfarm/ROUNTMP/LOG/2008-11/cedar_phy_bhcurvmcnearcharm.log

    Would start writing them all
./looper  '-s charm -r cedar_phy_bhcurv mcnear' &
    But first, there may be too many duplicates, like n13037067
./roundup  -n -s charm -r cedar_phy_bhcurv mcnear 2>&1 | tee /tmp/cpbmcnc.log
        
DUPEs include just 3 runs.
 DUPE n13037065_0027_L010185N_D04_charm.mrnt.cedar_phy_bhcurv.0.root
 DUPE n13037066_0002_L010185N_D04_charm.mrnt.cedar_phy_bhcurv.0.root
 DUPE n13037067_0001_L010185N_D04_charm.mrnt.cedar_phy_bhcurv.0.root

     Let's proceed with the rest :

./looper  '-s charm -r cedar_phy_bhcurv mcnear' &

rm /minos/data/minfarm/roundup/STOP.LOOPER 

./looper  '-s charm -r cedar_phy_bhcurv mcnear' &


=============================================================================
2008 11 20
=============================================================================

##########
# CONDOR #
##########

rbpatter and Igor have cut back monitoring,
to avoid overloads on minos25.

Looks healthy with 660 user jobs glided, 690 pilots


##########
# CONDOR #
##########

rbpatter has implemented group priorities for high priority tasks.

Announced to primer users.

############
# SHUTDOWN #
############

Thu Nov 20 19:31:49 CST 2008

    kreymer@minos26
cd minos/scripts
crontab crontab.dat

    mindata@minos26 
cd
crontab crontab.dat

    minfarm@fnpcsrv1
mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok


##########
# DCACHE #
##########

Date: Thu, 20 Nov 2008 20:28:10 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos-data@fnal.gov
Cc: dcache-admin@fnal.gov, minos_batch@fnal.gov
Subject: Holding of Minos writing to DCache for now

  Most FNDCA DCache services seem to have come back up around 13:13 today.
  Raw data files are being archived successfully.

  But 5 of the 12 pools in the writePools group are still offline.

Summary from ~kreymer/minos/scripts/poolstat

Thu Nov 20 14:26:24 CST 2008

DOWN TOT   POOL GROUP
   3/ 14 ExpDbWritePools
   4/ 10 FermigridVolPools
      12 KTeVReadPools
   3/ 15 MinosPrdReadPools
   2/  8 RawDataWritePools
   4/ 13 readPools
   7/ 14 writePools

    Pools down in writePools :

             w-stkendca11a-4
             w-stkendca12a-5
             w-stkendca12a-6
             w-stkendca6a-1
             w-stkendca6a-2

  I will not restart the Farm concatenation and MC import jobs,
  in order to reduce the load on that system,
  until we hear more from the DCache people,

------------------------------------------------------

Date: Thu, 20 Nov 2008 15:43:56 -0600
From: ssa-group@fnal.gov

We believe we have resolved the pnfs problems. As far as we know everything
is once again operational. Please report any problems.


------------------------------------------------------

Thu Nov 20 19:20:00 CST 2008

DOWN TOT   POOL GROUP
   3/ 14 ExpDbWritePools
   4/ 10 FermigridVolPools
      12 KTeVReadPools
   3/ 15 MinosPrdReadPools
   2/  8 RawDataWritePools
   4/ 13 readPools
   4/ 14 writePools

    This looks OK'ish to me

------------------------------------------------------


########
# DCAP #
########

ups copy dcap v2_42_f0710 -q unsecured -G "dcap v2_42_f0710 -q unsecured"
ups declare -c dcap v2_42_f0710 -q unsecured

    copy succeeded around 13:30

########
# DATA #
########


    PNFS/ftp test succeeded 
  10 Thu Nov 20 13:18:37 CST 2008 557


Date: Thu, 20 Nov 2008 12:09:07 -0600 (CST)
From: Steven Timm <timm@fnal.gov>

The /grid/data file system is now available again for use, on
the new disk with increased size.

     Tested roundup, too few write pools :

 OOPS - POOLS ACTIVE NEED 12 7 8 
    

##########
# PARROT #
##########

    mindata@minos26

PD=/minos/scratch/parrot
MD=/afs/fnal.gov/files/data/minos/d120

MDB=${MD}/20081106
cd ${PD}

MDB=${MD}/GROWFSDIR/20081106
mkdir -p ${MDB}
cp -va ${MD}/.grow* ${MDB}/

date ; time ./make_growfs.auto -k ${MD}

Thu Nov 20 10:44:05 CST 2008
make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d120/.growfsdir
make_growfs: 2710628 files, 8291 links, 125209 dirs, 0 checksums computed

real    28m3.788s
user    2m56.720s
sys     10m48.974s


MDB=${MD}/20081120
mkdir -p ${MDB}
cp -va ${MD}/.grow* ${MDB}/

$ du -sm $MD/*/.growfsdir
120     /afs/fnal.gov/files/data/minos/d120/20081106/.growfsdir
123     /afs/fnal.gov/files/data/minos/d120/20081120/.growfsdir

########
# FARM #
########

    Moving to the new scripts, which give correct MISS lists

SRV1> cp -a AFSS/roundup.20081118 .
SRV1> cp -a AFSS/samsub.20081118 .


########
# DATA #
########

Date: Wed, 19 Nov 2008 18:03:36 -0600

I am having problems with:
fcdfcaf1566.fnal.gov

It was around 5:15 pm yesterday.
I would get the following errors on that node:
/minos/scratch/pawloski/EntProc/condor_job_glidein_Reco_SameCali_FarTauMC.sh
: line 8: srt_setup: command not found
/minos/scratch/pawloski/EntProc/condor_job_glidein_Reco_SameCali_FarTauMC.sh
: line 32: dccp: command not found
/minos/scratch/pawloski/EntProc/condor_job_glidein_Reco_SameCali_FarTauMC.sh
: line 42: loon: command not found

Note I used a release at:
/grid/app/minos/
by sourcing the following script:
/grid/app/minos/users/boehm/setup_minossoft_MINOS_BATCH_GRID.sh R1.24.2

Greg


   Reply - It scanned OK at 09:17, ypwhich fcdfcaf1575

N.B. - see 2008 11 21 GRID note  - lack of /fnal/ups 


##########
# CONDOR #
##########

MINOS25 > uptime
 09:01:22 up 29 days, 17:09,  8 users,  load average: 213.06, 213.04, 213.00
MINOS25 > lsof
^C

MINOS25 > df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/hda1             9.9G  7.1G  2.4G  76% /
none                  2.0G     0  2.0G   0% /dev/shm
/dev/hda5            1012M  535M  426M  56% /tmp
/dev/hda6              22G  244M   21G   2% /var
/dev/hdb1             230G  180G   38G  83% /local/scratch25
^C


MINOS25 > cat /etc/fstab
# This file is edited by fstab-sync - see 'man fstab-sync' for details
LABEL=/                 /                       ext3    defaults        1 1
none                    /dev/pts                devpts  gid=5,mode=620  0 0
none                    /dev/shm                tmpfs   defaults        0 0
none                    /proc                   proc    defaults        0 0
none                    /sys                    sysfs   defaults        0 0
LABEL=/tmp              /tmp                    ext3    defaults        1 2
LABEL=/var              /var                    ext3    defaults        1 2
LABEL=SWAP-hda2         swap                    swap    defaults        0 0
LABEL=SWAP-hda3         swap                    swap    defaults        0 0
stkensrv1:/minos        /pnfs/minos     nfs     user,intr,bg,hard,ro,noac       0       0
LABEL=/local/scratch25  /local/scratch25        ext3    defaults        0 0
minos-nas-0.fnal.gov:/minos/data        /minos/data     nfs    rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0
minos-nas-0.fnal.gov:/minos/scratch     /minos/scratch  nfs    rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0
blue2:/fermigrid-data   /grid/data      nfs     rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0
blue2:/fermigrid-app    /grid/app       nfs     rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0
blue2:/fermigrid-fermiapp    /grid/fermiapp       nfs     rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0
/dev/hdc                /media/cdrom            auto    pamconsole,exec,noauto,managed 0 0
/dev/fd0                /media/floppy           auto    pamconsole,exec,noauto,managed 0 0
blue2.fnal.gov:/minos/data      /minos/data2    nfs     rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0

    Grabbing the   ps axf   output, as of around 09:00
cat > /local/scratch25/kreymer/psaxf25.20081120

MINOS25 > grep /minos/scratch /local/scratch25/kreymer/psaxf25.20081120
11247 ?        D      0:02 condor_submit /minos/scratch/tinti/Cleaning/RecoStudies.run
12276 ?        D      0:00 cat /minos/scratch/tinti/condor_logs/datamc.log.229046.0
14360 ?        D      0:00 cat /minos/scratch/tinti/condor_logs/datamc.log.229046.0
26143 ?        D      0:00 /bin/sh submit_badChanSntpGen.sh 1 2 /minos/scratch/med/badChannels/N00014833_0002.mdaq.root
26743 ?        D      0:00 rm /minos/scratch/med/badChannels/R1.24/cmdfile


MINOS25 > cat /minos/scratch/tinti/condor_logs/datamc.log.229046.0
000 (229046.000.000) 11/20 04:30:04 Job submitted from host: <131.225.193.25:63223>

^Y^Y^Y^Y^Y^Y


[gfrontend@minos25 ~]$ ps xf
  PID TTY      STAT   TIME COMMAND
 2072 pts/18   Ss     0:00 -bash
 2107 pts/18   R+     0:00  \_ ps xf
14360 ?        D      0:00 cat /minos/scratch/tinti/condor_logs/datamc.log.229046.0
[gfrontend@minos25 ~]$ kill -9 14360
[gfrontend@minos25 ~]$ kill -9 14360

[gfactory@minos25 ~]$ ps xf
  PID TTY      STAT   TIME COMMAND
22865 pts/13   Ss     0:00 -bash
31838 pts/13   D+     0:00  \_ cat /minos/data/users/scavan/mrcc_cand_filter/log.228592.100
18216 ?        Z      1:07 [condor_gridmana] <defunct>
 2128 pts/18   Ss     0:00 -bash
 2162 pts/18   R+     0:00  \_ ps xf


Nov 20 03:09:48 minos25 kernel: afs: Tokens for user of AFS id 4356 for cell fnal.gov have expired
Nov 20 03:29:07 minos25 kernel: afs: Tokens for user of AFS id 13849 for cell fnal.gov are discarded (rxkad error=19270407)
Nov 20 05:22:51 minos25 kernel: afs: Tokens for user of AFS id 5922 for cell fnal.gov are discarded (rxkad error=19270407)
Nov 20 09:02:18 minos25 kernel: nfs: server stkensrv1 not responding, still trying
Nov 20 09:08:14 minos25 kernel: nfs_statfs: statfs error = 512

    Load average plummeted around 10:42


Date: Thu, 20 Nov 2008 09:49:08 -0600 (CST)
Subject: HelpDesk ticket 125219
___________________________________________
Short Description: minos25 cannot write to /minos/scratch, load average is
over 200

Problem Description: run2-sys :

Starting around 04:00 CST today, 
the load average on minos25 started climbing sharply. It is now over 200.

Writes to the Bluearc served /minos/scratch seem to get hung up.

I can read some files from /minos/scratch, but others hang up :
    cat /minos/scratch/tinti/condor_logs/datamc.log.229046.0
This command displays the file, but fails to exit, and cannot be killed.
This same file can be read from other hosts such as minos26.

I see no interesting messages in /var/log/messages.

Please see whether we can determine the cause for these hangups.

If necessary, please reboot minos25.
___________________________________________

Date: Thu, 20 Nov 2008 09:50:16 -0600
From: Ling C. Ho <ling@fnal.gov>

This mount is the reason lsof is hanging:

stkensrv1:/minos on /pnfs/minos type nfs
(ro,noexec,nosuid,nodev,intr,bg,hard,noac,addr=131.225.13.1)
___________________________________________

Date: Thu, 20 Nov 2008 15:55:32 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   It would be OK to remove the /pnfs/minos mount from minos25.
   We do not need it on that node.
___________________________________________

Date: Thu, 20 Nov 2008 10:42:57 -0600 (CST)
This ticket has been reassigned to SHEPELAK, KAREN of the CD-SF/FEF Group.

___________________________________________

Date: Thu, 20 Nov 2008 10:45:15 -0600
From: Ling C. Ho <ling@fnal.gov>

All the stuck close calls seems to have returned. Not sure what happened.

/pnfs/minos has been unmounted.

/grid/data was remounted.
___________________________________________

Date: Thu, 20 Nov 2008 10:42:06 -0600
From: condor <condor@minos25.fnal.gov>
To: minos-admin@fnal.gov
Subject: [Condor] Problem minos25.fnal.gov: condor_schedd killed
    (unresponsive)

This is an automated email from the Condor system
on machine "minos25.fnal.gov".  Do not reply.

"/opt/condor/sbin/condor_schedd" on "minos25.fnal.gov" was killed because
it was no longer responding.
Condor will automatically restart this process in 10 seconds.
___________________________________________

Date: Thu, 20 Nov 2008 17:01:57 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

The condor_schedd seems to have restarted itself at 10:42:06

At just about that time, the load average dropped,
and the condor system became responsive.

User jobs are running again.

/minos/scratch is writeable again from minos25.
The formerly bad file is OK :
    /minos/scratch/tinti/condor_logs/datamc.log.229046.0
___________________________________________

Date: Thu, 20 Nov 2008 15:49:34 -0600 (CST)

Hi Art,
    The problem seems to have resolved itself. Load average is ok right now,
and condor_q doesn't hang. 
[root@minos25 ~]# uptime
 15:39:36 up 29 days, 23:47, 15 users,  load average: 1.25, 1.10, 1.21
[root@minos25 ~]# condor_q  
6832 jobs; 5457 idle, 1375 running, 0 held

Let me know if you think it is ok to mark this ticket as resolved,
karen

___________________________________________

Date: Sun, 30 Nov 2008 22:04:48 -0600 (CST)
From: HelpDesk <aremail@fnal.gov>

This request will be automatically closed in two weeks. 
If you wish this problem to remain open please contact the HelpDesk.

___________________________________________


___________________________________________


=============================================================================
2008 11 19
=============================================================================

########
# FARM #
########

   samsub.20081118 and roundup.20081118 ready for production.
   
   Do this coming out of the shutdown tomorrow.

############
# SHUTDOWN #
############

Prepared for PNFS/DCache maintenance Nov 20

    kreymer@minos26
echo "crontab -r" | at 06:30
job 20 at 2008-11-20 06:30

    mindata@minos26 
echo "crontab -r" | at 01:00
job 21 at 2008-11-20 01:00

    minfarm@fnpcsrv1
echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00
job 17 at 2008-11-20 01:00


########
# GRID #
########

Investigating Ticket #: 125092
/minos/data2 mounts unstable on CDF grid nodes


MIN > ssh fcdfcaf1502

-bash-3.00$ domainname
fcdfosg1

-bash-3.00$ ypwhich -d fcdfosg1
fcdfcaf1550.fnal.gov

-bash-3.00$ ypwhich -d fcdfosg1
fcdfcaf1425.fnal.gov

-bash-3.00$ ypwhich -d fcdfosg1
fcdfcaf1450.fnal.gov

for HOST in `head -2 /tmp/cdfhosts`; do   
printf "${HOST} " 
ssh -ax ${HOST} 'printf "`ypwhich` " ;
ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null && echo'
done

fcdfcaf1502 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1503 fcdfcaf1425.fnal.gov
fcdfcaf1504 fcdfcaf1450.fnal.gov
fcdfcaf1505 fcdfcaf1425.fnal.gov
fcdfcaf1506 fcdfcaf1501.fnal.gov
fcdfcaf1507 fcdfcaf1425.fnal.gov
fcdfcaf1508 fcdfcaf1550.fnal.gov
fcdfcaf1509 fcdfcaf1425.fnal.gov
fcdfcaf1510 fcdfcaf1425.fnal.gov
fcdfcaf1511 fcdfcaf1450.fnal.gov
fcdfcaf1512 fcdfcaf1450.fnal.gov
fcdfcaf1513 fcdfcaf1425.fnal.gov
fcdfcaf1514 fcdfcaf1475.fnal.gov
fcdfcaf1515 fcdfcaf1550.fnal.gov
fcdfcaf1516 fcdfcaf1450.fnal.gov
fcdfcaf1517 fcdfcaf1425.fnal.gov
fcdfcaf1518 fcdfcaf1450.fnal.gov
fcdfcaf1519 fcdfcaf1475.fnal.gov
fcdfcaf1520 fcdfcaf1425.fnal.gov
fcdfcaf1521 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory


for HOST in `cat /tmp/cdfhosts`; do   
printf "${HOST} " 
ssh -ax ${HOST} 'printf "`ypwhich` " ;
ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null && echo'
done    2>&1 | tee /tmp/scan1119a.lis

    ( selecting only failing nodes )

fcdfcaf1519 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1521 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1525 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1529 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1540 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1544 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1557 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1560 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1670 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1674 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1677 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1680 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1681 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1683 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1684 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1686 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1687 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1689 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1695 do_ypcall: clnt_call: RPC: Timed out
fcdfcaf1703 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1712 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory
fcdfcaf1713 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory

   Get list of ypwhich'es :
MIN > cat /tmp/scan1119a.lis | cut -f 2 -d ' ' | sort -u

do_ypcall:
fcdfcaf1150.fnal.gov
fcdfcaf1201.fnal.gov
fcdfcaf1425.fnal.gov
fcdfcaf1450.fnal.gov
fcdfcaf1475.fnal.gov
fcdfcaf1501.fnal.gov
fcdfcaf1525.fnal.gov
fcdfcaf1550.fnal.gov
fcdfcaf1575.fnal.gov
fcdfcaf1601.fnal.gov


   List the maps
-bash-3.00$ ypwhich -m
passwd.byuid fcdf0x4.fnal.gov
auto.des fcdf0x4.fnal.gov-h 
auto.grid fcdf0x4.fnal.gov
auto.master fcdf0x4.fnal.gov
auto.home fcdf0x4.fnal.gov
auto.minos fcdf0x4.fnal.gov
auto.cdf fcdf0x4.fnal.gov
passwd.byname fcdf0x4.fnal.gov
ypservers fcdf0x4.fnal.gov
auto.ilc fcdf0x4.fnal.gov
group.bygid fcdf0x4.fnal.gov
group.byname fcdf0x4.fnal.gov

     Look at the maps :

-bash-3.00$ ypcat -h fcdfcaf1450 auto.grid

blue2:/fermigrid-home
blue2:/fermigrid-products/opt/condorsleeper/${ARCH}/condor-7.0.3
blue2:/fermigrid-products/opt/condorsleeper_sl5/${ARCH}/condor-7.0.3
blue2:/fermigrid-products/usr/local/grid-1.0.0-${ARCH}
blue2:/fermigrid-fermiapp
blue2:/fermigrid-app
blue2:/fermigrid-data

     
-bash-3.00$ ypcat -h fcdfcaf1525 auto.minos
-ro              minos-nas-0.fnal.gov:/minos/scratch 
-noexec          blue2.fnal.gov:/minos/data
-noexec          minos-nas-0.fnal.gov:/minos/data  

-bash-3.00$ ypcat -h fcdfcaf1450 auto.minos
-ro              minos-nas-0.fnal.gov:/minos/scratch 
-noexec          blue2.fnal.gov:/minos/data
-noexec          minos-nas-0.fnal.gov:/minos/data  

-bash-3.00$ ypcat -h fcdfcaf1450 auto.master
auto.minos     -o proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600
auto.grid      -o proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600
auto.ilc      -o proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600
auto.home      -o proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600
auto.cdf       -o proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600


-bash-3.00$ ypcat ypservers
fcdfcaf1325.fnal.gov
fcdfcaf1525.fnal.gov
fcdf0x4.fnal.gov
fcdfcaf1001.fnal.gov
fcdfcaf1250.fnal.gov
fcdfcaf1450.fnal.gov
fcdfcaf1225.fnal.gov
fcdfcaf1375.fnal.gov
fcdfcaf1425.fnal.gov
fcdfcaf1575.fnal.gov
fcdfcaf1150.fnal.gov
fcdfcaf1350.fnal.gov
fcdfcaf1275.fnal.gov
fcdfcaf1501.fnal.gov
fcdfcaf1475.fnal.gov
fcdfcaf1050.fnal.gov
fcdfcaf1201.fnal.gov
fcdfcaf1401.fnal.gov
fcdfcaf1601.fnal.gov
fcdfcaf1101.fnal.gov
fcdfcaf1550.fnal.gov

   Schmitz will reboot fcdfcaf1525 after permission from Timm.

   on fcdfcaf1519, scan all the maps

MAPS=`ypwhich -m | cut -f 1 -d ' '`

for HOST in fcdf0x4 fcdfcaf1525 fcdfcaf1450; do
for MAP in ${MAPS} ; do 
ypcat -h ${HOST} ${MAP}
done >> /tmp/ypmaps.${HOST}
done

There are differences in passwd and group files

try again without these,

MAPS=`ypwhich -m | cut -f 1 -d ' ' | grep -v passwd | grep -v group`

for HOST in fcdf0x4 fcdfcaf1525 fcdfcaf1450; do
for MAP in ${MAPS} ; do 
ypcat -h ${HOST} ${MAP}
done > /tmp/ypmaps.${HOST}
done

    No differences, but still :

-bash-3.00$ ypwhich
fcdfcaf1525.fnal.gov
-bash-3.00$ ls -ld /minos/data2 
ls: /minos/data2: No such file or directory

/usr/lib64/autofs/autofs-ldap-auto-maste

    11:27 - fcdfcaf1525 has been rebooted, try a new scan

done    2>&1 | tee  /tmp/scan1119b.lis
Wed Nov 19 17:48:30 GMT 2008

   Clean, aside from timeout logging into 1525 - cleared up now.
   Second scan, just to be sure !

MIN > ssh fcdfcaf1525
-bash-3.00$ uptime ; date
 11:36:34 up 3 min,  2 users,  load average: 0.12, 0.11, 0.04
Wed Nov 19 11:36:34 CST 2008
date

11:43, poking in parallel, while stuck on connection to 1694
MIN > ssh -ax fcdfcaf1693 'ypwhich ; ls -ld /minos/data /minos/data2 /minos/scratch'
do_ypcall: clnt_call: RPC: Timed out
fcdfcaf1525.fnal.gov

   
done    2>&1 | tee  /tmp/scan1119d.lis ; date
Wed Nov 19 17:57:39 GMT 2008

   Clean, the ticket is closed.
   Back to work !


=============================================================================
2008 11 18
=============================================================================

#############
# MDSUM_LOG #
#############

    created new version, mdsum_log.20081118
    ran test around 16:23

###########
# ROUNDUP #
###########

    roundup.20081118

Make use of the extended subrun list.
Present usage of samsubs :
    printed in HAVE message 
    get VAL subrun count from  cut -f 2 -d ':' 

    Due to new whitespace in SAMSUBS, changed
for SAMSUB in ${SAMSUBS} 
    to  
printf "${SAMSUBS}\n" | while read SAMSUB 

    Therefore must deploy new samsubs and roundup together.

    Test with 

AFSS/roundup.20081118 -n -W -r cedar near

AFSS/roundup.20081118 -n -W -s n13037095 -r cedar_phy_bhcurv mcnear

    samsub tripped on stray files in mcnearcat,
Oops, my parsing of subruns does not work for mcout files,
which have many more underscores.

Test on n13011168

   Weird, no longer need the string.join,
   Needed to split on _ first, then ., for sake of mcin parents.

For quicker testing, hacked
INDIR=/minos/data/minfarm/testroundup
SRV1> mkdir /minos/data/minfarm/testroundup

cp -a /minos/data/minfarm/mcnearcat/n1303709*.root /minos/data/minfarm/testroundup/
 

########
# FARM #
########


########
# GRID #
########

Date: Tue, 18 Nov 2008 11:52:50 -0800 (PST)
From: Ryan B. Patterson <rbpatter@caltech.edu>
To: minos-admin@fnal.gov
Cc: pawloski@fnal.gov
Subject: Glidein throughput issue seems to be addressed

Hi,

We were suffering from glideins not willing to run new jobs after
finishing their first one.  This was limiting computing throughput to the
rate of new glidein production.

Igor was eventually able to track it down to a communication problem
between the starter and the schedd/shadow.  On his suggestion, I added two
obscure settings to the factory configuration:

  WANT_UDP_COMMAND_SOCKET = True
  STARTD_SENDS_ALIVES = False

Initial testing suggests that this fix has worked, as new glideins seem to
be returning to the pool in the "Unclaimed" state and seem willing to run
new jobs.  Let me know if this doesn't seem true in more complex
situations over the next few days.

--Ryan


###########
# MINOS27 #
###########

Date: Tue, 18 Nov 2008 19:38:36 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Ling C. Ho <ling@fnal.gov>
Cc: minos-admin@fnal.gov
Subject: Re: DB servers, and other Grid items

On Fri, 14 Nov 2008, Ling C. Ho wrote:

> I have minos27 ready. Please log in and take a look. The virtual network
> interface is not set up. We can swap this with minos01 once you determine
> everything need is installed properly.

Thanks !

    /minos/data and /minos/scratch need to be NFS mounted,
    as on other Cluster systems.

    The local scratch space should be mounted as /minos/scratch27,
    group writeable by the e875 group, similar to other Cluster systems.
    /local/scratch27 should be configured as a single 500 GB volume.

    We would rename /minos/scratch27 to /minos/scratch01,
    and clone content from minos01,
    when the system eventually goes to production.

    We will move the minoscvs repository to the CD's cdcvs server
    before we retire the old minos01 server.
    So we will not need the /cvs area, or the special sshd cvs server
        ( /usr/sbin/sshd -f /etc/ssh/sshd_config.cvs )
    And we will not need to run the pserver on minos27.


Date: Tue, 18 Nov 2008 14:14:26 -0600
From: Ling C. Ho <ling@fnal.gov>

>     /minos/data and /minos/scratch need to be NFS mounted,
>     as on other Cluster systems.

Corrected.

Do you mean /local/scratch27? I have repartitioned the data disk and mounted
as /local/scratch27.


Date: Tue, 18 Nov 2008 20:55:56 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Yes, my bad.

On the other Minos Cluster nodes,
/local/scratchNN has permissions 777, and is owned by root.root

It is probaby best to do this also on minos27,
rather than set ownership root.e875, and mode 775 as I had requested.

I'll let the users know that minos27 is available now.

Date: Tue, 18 Nov 2008 20:56:14 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos_software_discussion@fnal.gov
Cc: minos-admin@fnal.gov
Subject: minos27  testing - SLF 4.7 and x86_64 kernel

Node minos27 is available for testing.

This node is intended to become the replacement for minos01.

It is running SLF 4.7 . The rest of the Minos Cluster runs SLF 4.4.
It is running the x86_64 64 bit kernel.

Please test the working environment there,
particularly whether programs built on minos27
can be used on other Minos Cluster nodes.

Do not put anything permanent into /local/scratch27,
as that area may be cloned from /local/scratch01 eventually.


########
# GRID #
########

Date: Tue, 18 Nov 2008 12:07:49 -0600 (CST)
Subject: HelpDesk ticket 125094

___________________________________________
Ticket #: 125094
___________________________________________
Short Description: /minos/data and data2 timeouts on fnpcsrv1

Problem Description: Howie Rubin reports repeated failures to access
/minos/data2 on fnpcsrv1.

My once-per minute scans have not seen a recent failure,
but they are not accessing the system as often as Howie.

In /var/log/messages, it is apparent that /minos/data and /minos/data2
are being dismounted and remounted about every 20 minutes,
even though I am accessing files there every minute.

These dismounts ( 'expired /minos/data2' )  seem to have started at
    Nov 14 01:02:27
but have not occured since
    Nov 18 09:42:59

Please tune fnpcsrv1 so that these filesystems are not dismounted so
frequently,
or confirm that this tuning was done this morning after 09:42.
___________________________________________

This ticket is assigned to TIMM, STEVE of the CD-SF/GF/FGS.
___________________________________________

   Verbal 12:30, Steve Timm
 
The script which kept /minos/data* mounted was stopped
when the stale NFS handles were cleared last Friday.

He is restarting the script.

I see no further dismounts as of 13:50, of /minos/data2.
But /minos/data continues to be dismounted every 20 minutes .

___________________________________________

Date: Tue, 18 Nov 2008 14:07:22 -0600 (CST)

Our script which keeps the minos areas mounted on fnpcsrv1 is now
running again.  We had disabled it during last week's incident
with /minos/data when we had to recover from the stale file handles.

Note, however, that it is the belief of FGS that whether the file
system is temporarily umounted has nothing to do with the problems
that Howie is seeing.  So let us know if Howie's problem persists.

Steve Timm
___________________________________________

    N.B. - 2008 11 19 08:35 - no further expired messages in messages
___________________________________________

Date: Thu, 20 Nov 2008 11:24:44 -0600 (CST)
From: HelpDesk <aremail@fnal.gov>

Solution: We restored the processes
which keep the /minos/data,
/minos/data2, and /minos/scratch areas mounted all the time.

Steve Timm

___________________________________________

########
# GRID #
########

Date: Tue, 18 Nov 2008 11:55:19 -0600 (CST)
Subject: HelpDesk ticket 125092

Short Description: /minos/data2 mounts unstable on CDF grid nodes

Problem Description: The mounts of /minos/data2 are sometimes present,
sometimes absent from
the CDF grid nodes ( fcdfcaf1502 through fcdfcaf1716 )

This seems to be the same problem previously resolved on Friday 14 Nov,
Helpdesk Ticket 124929

    Here is the relevant language from that ticket, quoting Steve Timm :

There is probably an old out-of-sync yp slave server somewhere on
the CDF grid cluster 2 that does not yet have the new map that includes 
/minos/data2.  

I re-pushed out the map to all existing nodes.  

Will ask FEF to re-enable the 3 slave servers that are down.  
This might not happen before the weekend.
___________________________________________

This ticket is assigned to Box, Dennis of the CDF.
___________________________________________

Date: Tue, 18 Nov 2008 12:05:39 -0600 (CST)
Reassign
Please reassign this to someone who can update yp mapfiles on cdf 
machines (FEF?)
___________________________________________

Date: Tue, 18 Nov 2008 12:09:56 -0600 (CST)
Note To Requester: timm@fnal.gov sent this Notes To Requester: 
Somehow this ticket went to Grid/CDF and should have gone
to FEF instead.  But Art misunderstood my E-mail to him.   It is
not a problem with the automount maps now, it is just that
some of the worker nodes had a missing /minos/data2 left over
from Nov. 14 when the automount maps were fixed and that needs
to get reset.

Steve Timm
___________________________________________

  I still don't understand.

  /minos/data2 was mounted on all nodes on Friday.

  It is now intermittent, coming and going apparently randomly.
___________________________________________

Date: Tue, 18 Nov 2008 13:25:24 -0600 (CST)
This ticket has been reassigned to SHEPELAK, KAREN of the CD-SF/FEF Group.
___________________________________________

Date: Tue, 18 Nov 2008 22:12:21 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Howard Rubin <rubin@iit.edu>
Cc: minos-data@fnal.gov, shepelak@fnal.gov
Subject: Re: cdf nodes

On Tue, 18 Nov 2008, Howard Rubin wrote:

> They may have done something.  The last failure was at 13:40.

My last scan, around 15:00, picked up 5 failures, out of 151 hosts.

One of the 5 failures was an rcp timeout:

fcdfcaf1510 ls: /minos/data2: No such file or directory
fcdfcaf1525 ls: /minos/data2: No such file or directory
fcdfcaf1672 ls: /minos/data2: No such file or directory
fcdfcaf1677 ls: /minos/data2: No such file or directory
fcdfcaf1686 do_ypcall: clnt_call: RPC: Timed out

___________________________________________

Date: Tue, 18 Nov 2008 23:03:52 +0000 (GMT)

    We are continuing to see failures, as of 17:00 today.

___________________________________________

Date: Tue, 18 Nov 2008 17:53:35 -0600 (CST)

Note To Requester: investigating
___________________________________________

Date: Wed, 19 Nov 2008 15:25:59 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    The problem seems only to occur when yp sever fcdfcaf1502
    happens to be chosen .

    I have repeated a scan of the fcdfdcaf nodes,
    this time checking which yp server is being used as follows :

for HOST in `cat /tmp/cdfhosts`; do
printf "${HOST} "
ssh -ax ${HOST} 'printf "`ypwhich` " ;
ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null && echo'
done    2>&1 > /tmp/scan1119a.lis

    A typical line of failing output is :

fcdfcaf1519 fcdfcaf1525.fnal.gov ls: /minos/data2: No such file or directory

    Various hosts fail, but the problem always occurs when
    fcdfcaf1525 happens to be used to serve the yp maps.

    I am still puzzled, I see no difference
    in the auto.minos map served by fcdfcaf1525 :

-bash-3.00$ ypcat -h fcdfcaf1525 auto.minos
-ro              minos-nas-0.fnal.gov:/minos/scratch
-noexec          blue2.fnal.gov:/minos/data
-noexec          minos-nas-0.fnal.gov:/minos/data

-bash-3.00$ ypcat -h fcdfcaf1450 auto.minos
-ro              minos-nas-0.fnal.gov:/minos/scratch
-noexec          blue2.fnal.gov:/minos/data
-noexec          minos-nas-0.fnal.gov:/minos/data
___________________________________________

Date: Wed, 19 Nov 2008 09:30:58 -0600 (CST)
From: Steven Timm <timm@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: HelpDesk <helpdesk-forwarder@fnal.gov>, shepelak@fnal.gov,
    run2-sys@fnal.gov, minos-data@fnal.gov
Subject: Re: HelpDesk ticket 125092 has additional info.

What I have been trying to say is the following:

On Nov. 14, when the yp server fcdfcaf1525 came back up,
it came up with a bad map, which I fixed yesterday when I saw
that problem.  However, any nodes which were bound to that
server between Nov 14 and now, and tried to access /minos/data2
during that time, got the "no such file or directory" error.
Killing and restarting automount may not be enough to fix this
error, it may require a reboot of the nodes in question.

___________________________________________

Date: Wed, 19 Nov 2008 09:34:33 -0600 (CST)

I restarted ypbind on fcdfcaf1502. It seems to have resolved the mount
issue.
Check it out and let me know.

Mark
___________________________________________

Date: Wed, 19 Nov 2008 11:13:44 -0600
From: Mark Schmitz <schmitz@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: fcdfcaf1525

This node is being rebooted now.

Mark
___________________________________________

Date: Wed, 19 Nov 2008 10:39:30 -0600
From: Howard Rubin <rubin@iit.edu>
To: Art Kreymer <kreymer@fnal.gov>, Steve Timm <timm@fnal.gov>
Subject: I/O failures on /minos/data2

Art (Steve, FYI),

Since 23:42 yesterday there have been a total of 612 input or output
failures.  I'm going to shut down processing (except for keep-up) until we
get a response from FEF.

However, you should be aware that some input failures are also occurring on
GPGrid nodes as well.  These appear to be the same old SRM problem to which
I don't think I've ever received a satisfactory response to old tickets
beyond "Turned over to developers" which has been the standard response
lately (after a couple of weeks of aging).  The several random checks I've
made of input failures on CDFGrid are *not* SRM related. Being the good
soldier that I am, I will submit a ticket on this.

Wed Nov 19 09:12:59 CST 2008:       ====> fileStatus state ==Failed
java.io.IOException: rs.state = Failed rs.error =  at Wed Nov 19 07:08:26
CST 20
08 state Pending : created
RequestFileStatus#-2144281128 failed with error:[  at Wed Nov 19 09:12:03
CST 20
08 state Failed : Pinning failed]

        at gov.fnal.srm.util.SRMGetClientV1.start(SRMGetClientV1.java:298)
        at gov.fnal.srm.util.SRMDispatcher.work(SRMDispatcher.java:795)
        at gov.fnal.srm.util.SRMDispatcher.main(SRMDispatcher.java:374)
srm copy of at least one file failed or not completed

I have a complete list of nodes and times.

Howie
___________________________________________

Date: Wed, 19 Nov 2008 10:42:04 -0600 (CST)
From: Steven Timm <timm@fnal.gov>
To: Howard Rubin <rubin@iit.edu>
Cc: Art Kreymer <kreymer@fnal.gov>
Subject: Re: I/O failures on /minos/data2

Internal E-mail from Mark Schmitz of FEF tells me he is working
on the /minos/data2 issue on the CDF nodes at the moment.
This is something that I would have the permission to do myself
but since I am in the workshop I can't get to it today or tomorrow
and you are better to stay with them.

Steve Timm
___________________________________________

Date: Wed, 19 Nov 2008 17:25:32 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: HelpDesk <helpdesk-forwarder@fnal.gov>
Cc: shepelak@fnal.gov, run2-sys@fnal.gov, timm@fnal.gov, minos-data@fnal.gov
Subject: Re: HelpDesk ticket 125092 has additional info.

<-- # @@@  Enter Update below this line. @@@ # -->

I earlier wrote

   " The problem seems only to occur when yp sever fcdfcaf1502
    happens to be chosen "

As always, I cannot type correctly.

The yp server at issue is fcdfcaf1525 , not 1502,
1525 was mentioned correctly several times later in the posting.

<-- # @@@  Enter Update above this line. @@@ # -->
___________________________________________


########
# DATA #
########

Date: Fri, 14 Nov 2008 13:23:05 -0600 (CST)
Subject: HelpDesk ticket 124929 has additional info.
_________________________________________________________________
Ticket #: 124929
_________________________________________________________________
Note To Requester: timm@fnal.gov sent this Notes To Requester: 
There is probably an old out-of-sync yp slave server somewhere on
the CDF grid cluster 2 that does not yet have the new map that includes 
/minos/data2.  I re-pushed
out the map to all existing nodes.  Will ask FEF to re-enable the
3 slave servers that are down.  This might not happen before the weekend.

As far as stale file handles the process is the following, which FEF
can do in my absence

1) kill any stale processes of the form gidd_alloc or procd
2) umount /minos/data (and /minos/scratch if necessary)
3) kill auto.minos automount process
4) umount /minos
5) service autofs reload

Art--I would suggest that a ticket be opened to FEF to get this done 
expediently.,

Steve
_________________________________________________________________

Date: Wed, 19 Nov 2008 11:35:44 -0600 (CST)
Note To Requester: Hi Art, Steve,
    I'd like to reboot fcdfcaf1502.  The /minos/data2 directory is still not
mounting even after ypbind and autofs services have been restarted.
Can you start draining condor jobs? 

Automount maps also appear to be ok, same mapping as the other machines
which mount the directory ok.

[root@fcdfcaf1502 ~]# ypcat -k auto.minos
scratch -ro              minos-nas-0.fnal.gov:/minos/scratch
data2 -noexec          blue2.fnal.gov:/minos/data
data -noexec          minos-nas-0.fnal.gov:/minos/data

Since services restarted I am now seeing:

[root@fcdfcaf1502 ~]# mount /minos/data2
Unsupported nfs mount option: o

[root@fcdfcaf1502 ~]# mount blue2.fnal.gov:/minos/data /minos/data2
mount: blue2.fnal.gov:/minos/data already mounted or /minos/data2 busy
mount: according to mtab, blue2.fnal.gov:/minos/data is already mounted on
/minos/data2

[root@fcdfcaf1502 ~]# cat /etc/mtab |grep data2
blue2.fnal.gov:/minos/data /minos/data2 nfs
rw,noexec,o,proto=tcp,nfsvers=3,wsize=32768,rsize=32768,hard,intr,timeo=600,
addr=131.225.111.93 0 0

thanks,
karen
_________________________________________________________________

Date: Wed, 19 Nov 2008 11:53:45 -0600
From: Mark Schmitz <schmitz@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: HelpDesk <helpdesk-forwarder@fnal.gov>, shepelak@fnal.gov,
    run2-sys@fnal.gov, timm@fnal.gov, minos-data@fnal.gov
Subject: Re: HelpDesk ticket 125092 has additional info.

<-- # @@@  Enter Update below this line. @@@ # -->
worklog
fcdfcaf1525 has been restarted
<-- # @@@  Enter Update above this line. @@@ # -->

_________________________________________________________________

Date: Wed, 19 Nov 2008 17:58:49 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Mark Schmitz <schmitz@fnal.gov>
Cc: HelpDesk <helpdesk-forwarder@fnal.gov>, shepelak@fnal.gov,
    run2-sys@fnal.gov, timm@fnal.gov, minos-data@fnal.gov,
    minos_batch@fnal.gov, minos_software_discussion@fnal.gov
Subject: Re: HelpDesk ticket 125092 has additional info.

<-- # @@@  Enter Update below this line. @@@ # -->

I have run two full scans on the cdf caf nodes
since the reboot of fcdfcaf1525.

There are no failures to mount the /minos areas.

The fcdfcaf1525 NIS server is being used heavily.

We can consider this probelem resolved.

    Thanks !

<-- # @@@  Enter Update above this line. @@@ # -->
_________________________________________________________________


Date: Wed, 19 Nov 2008 13:57:08 -0600 (CST)
This ticket was resolved by SHEPELAK, KAREN of the CD-SF/FEF group.


########
# FARM #
########

    Rubin reports failure to read or write /minos/data2 on several hosts.

    For exmaple, reading
/minos/data/minfarm/loonexe/set_tsql_override.C

> 2008-11-18 04:01:03  fcdfcaf1554
> 2008-11-18 04:01:16  fcdfcaf1688
> 2008-11-18 04:01:16  fcdfcaf1672
> 2008-11-18 04:01:40  fcdfcaf1672
> 2008-11-18 04:01:58  fcdfcaf1672
> 2008-11-18 04:02:28  fcdfcaf1504
> 
> and those with output errors:
> 
> 2008-11-18 07:54:36  fcdfcaf1675
> 2008-11-18 08:14:50  fcdfcaf1563
> 2008-11-18 08:21:34  fcdfcaf1522
> 2008-11-18 08:21:42  fcdfcaf1665
> 2008-11-18 08:26:38  fcdfcaf1555
> 2008-11-18 08:27:20  fcdfcaf1670
> 2008-11-18 08:33:31  fcdfcaf1535
> 2008-11-18 08:34:17  fcdfcaf1555
> 2008-11-18 09:16:26  fcdfcaf1540

Rescanning for /minos/data, as before

for HOST in `cat /tmp/cdfhosts`; do  
printf "${HOST} " 
ssh -ax ${HOST} 'ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null && echo' 
done  2>&1 | tee /tmp/scan1118a.lis

done  2>&1 | tee /tmp/scan1118b.lis


   Scanned gpfarm nodes, AOK

for HOST in `cat /tmp/gphosts`; do  
printf "${HOST} " 
ssh -ax ${HOST} 'ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null ;echo' 
done 

   on fnpcsrv1, find 
       expired /minos/data2
   every 20 minutes since Nov 14 01:02:27
   through Nov 18 09:42:59
   but not since then.

########
# DATA #
########

/grid/data usage message

        Total disk allocated (GB):              400.0
        Percent disk used:      80.1%

du -sm /grid/data/minos/*
315373  /grid/data/minos/users
9659    /grid/data/minos/minfarm
1972    /grid/data/minos/OLDfarcat
594     /grid/data/minos/OLDneardet
110     /grid/data/minos/condor_log
...

du -sm /grid/data/minos/users/* | sort -n
...
91      /grid/data/minos/users/mishi
315282  /grid/data/minos/users/rustem


Forwarded to rustem


=============================================================================
2008 11 17
=============================================================================

#########
# MYSQL #
#########

Date: Mon, 17 Nov 2008 15:15:19 -0600
From: Ling C. Ho <ling@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: minos-admin@fnal.gov
Subject: Re: DB servers, and other Grid items

Hi Art,

Minos-mysql2 and 3 are ready. I have installed NIS client on these nodes,
but login is limited to the few accounts that were on the local password
files. The default database directory is /var/lib/mysql. Please let me know
how you would like to set up /data (ie, if you want subdirectories like on
minos-mysql1) and I can create symbolic links to point /var/lib/mysql to the
right place.

#######
# SAM #
#######

Date: Mon, 17 Nov 2008 17:08:43 -0600
From: Ling C. Ho <ling@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: minos-admin@fnal.gov
Subject: Re: DB servers, and other Grid items

Minos-sam04 is ready too.

########
# DATA #
########

mcimport - mtavera, duplicates over the weekend - inform her
 

########
# DATA #
########

   mcimport OVERLAY

Mon Nov 17 08:48:32 CST 2008
 MCIN configuration n1303 _L010185N_D06_nccohbkg.reroot.root 
SRMClientV1 : put: try # 0 failed with error
SRMClientV1 : java.net.ConnectException: Connection timed out
srm copy of at least one file failed or not completed


MRTG shows stkendca2a off net just after 08:30 this morning.

Now  it is back up, services up since 10:57

stkendca7a is also offline, as of 11:11
MRTG has no data for that node


Date: Mon, 17 Nov 2008 11:01:35 -0600 (CST)
Subject: HelpDesk ticket 125001

___________________________________________

Short Description: stkendca2a seems down, doors offline in FNDCA

Problem Description: According to MRTG data, stkendca2a went off the net around$
morning.

All the stlendca2a services are offline, including
    dcap
    dcapK
    dcapG
    SRM
    GFTP0/1
    KFTP
    WFTP

Minos raw data archiving has stopped.
___________________________________________

Date: Mon, 17 Nov 2008 11:26:43 -0600 (CST)

The stkendca2a node should be back on-line now.  It was moved from FCC1 
to the Mezzanine this morning.  I had believed that the node was only 
running the test instance of dCache.  I was not aware it was serving 
public dCache services.

I'm sorry for the inconvenience.  No further interruptions are anticipated.

Ken S. -- SSA Group

___________________________________________

Date: Mon, 17 Nov 2008 17:37:49 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Thanks for restoring stkendca2a.

I see that node stkendca7a is also off the network.
This serves two of the RawDataWritePools pools, and one GFTP door.

___________________________________________

Date: Mon, 17 Nov 2008 12:13:44 -0600

Art,

The stkendca7a node was recently brought back on-line.  On Friday, it was
found that one of the two RAID sets was inaccessible.  In trying to recover
that RAID partition, I issued a command to reboot the system. We have been
unable to get the system to boot properly since then.  It has been down
since Friday afternoon.

This is a serious hardware problem. And it is a rather old node.  I'm
meeting with my manager after lunch to discuss options for how we can get
this node back on-line and recover the pools.

Ken S.

___________________________________________


########
# DATA #
########

Continue cleanout of MOVED files,

find /minos/data/reco_near.MOVED -user mindata | wc -l
14747

find /minos/data/reco_near.MOVED -user minfarm  | wc -l
3348

find /minos/data/reco_near.MOVED -user rubin  | wc -l
206

minfarm@fnpcsrv1> find /minos/data/reco_near.MOVED -user minfarm -exec chmod g+w {} \;
  rubin@fnpcsrv1> find /minos/data/reco_near.MOVED -user rubin   -exec chmod g+w {} \;

df -m /minos/data
                      28311552  20197451   8114102  72% /minos/data

 mindata@minos26> time rm -r /minos/data/reco_near.MOVED
rm: remove write-protected regular file
`/minos/data/reco_near.MOVED/cedar_phy_bhcurv/sntp_data/2007-03/libMyPainterSL4_51902.so'?y
-rwxr-xr-x   1 rodriges e875     124459 Mar 12  2008 libMyPainterSL4_51902.so*
real    190m44.659s
user    0m0.156s
sys     0m2.681s

df -m /minos/data
                      28311552  18119726  10191827  65% /minos/data


=============================================================================
2008 11 14
=============================================================================

########
# DATA #
########

Date: Fri, 14 Nov 2008 16:44:29 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: plunk@fnal.gov
Cc: minos-data@fnal.gov, votava@fnal.gov
Subject: FYI, people involved in new /minos/data disk deployment :


CSI/SVC - Andy Romero <romero@fnal.gov> BlueArc deployment and migration,
              intervening as necessary past 10 PM, and as early as 4 AM.
          Ray Pasetes <rayp@fnal.gov>     CSI  group HeadC
CSI/DSS ( FNALU )
          Margaret Greaney <mgreaney@fnal.gov>
              mounted the new disks and fixed stale handles on FNALU
          Wayne Baisley <baisley@fnal.gov> DSS group head

          Jack Schmidt <schmidt@fnal.gov> CSI Dept Head

FEF - Glenn Cooper <gcooper@fnal.gov> - assisted in planning and
coordination
      Jason Harrington <jason@fnal.gov> -/minos/data2 mounts on Minos
systems
      Ling  Ho <ling@fnal.gov> - stale file handle cleanup 10 PM Thursday
      Jason Allen <jallen@fnal.gov> FEF Dept Head

Grid - Steve Timm <timm@fnal.gov> assisted in planning,
            set up the new /minos/data2 mounts,
            cleaned up stale file handles Thursday night.
       Keith Chadwick <chadwick@fnal.gov> Grid Services group head
       Eileen Berman <berman@fnal.gov>    Grid Dept Head


############
# BLUWATCH #
############

ln -sf bluwatch.20081114 bluwatch # was bluwatch.20080724

rm /afs/fnal.gov/files/data/minos/log_data/bluwatch/STOP

set nohup ; ${HOME}/minos/scripts/bluwatch &


#########
# ADMIN #
#########

    minos27 is available for testing.
    
    Stray message at login,

aklog: unable to obtain tokens for cell fnal.gov (status: 11862788).


########
# FARM #
########

    Restarted concatenation ( cedar )

    14:50 mv NOCAT NOCAT.ok

    Need to investigate DUP files  in N00015122


############
# DATABASE #
############

    Doing monthly backups, cut/paste from the new dbarchive script.

    Next month will run the script as such !

########
# GRID #
########

/minos/data2 mounts on cdf nodes seem flaky

condor_status | grep @ | grep '\. LINUX' | cut -f 2 -d @ | cut -f 1 -d . | sort -u > /tmp/cdfhosts
scp fcdfcaf1699:/tmp/cdfhosts /tmp/cdfhosts

for HOST in `cat /tmp/cdfhosts`; do  
printf "${HOST} " 
ssh -ax ${HOST} 'ls -ld /minos/data /minos/data2 > /dev/null' 
sleep 1 ; done  2>&1 | tee /tmp/scan1.lis

Most nodes lack /minos/data2, but many are intermittent.

NTHOSTS='
fcdfcaf1573
fcdfcaf1583
fcdfcaf1586
fcdfcaf1663
fcdfcaf1669
fcdfcaf1670
fcdfcaf1672
fcdfcaf1680
fcdfcaf1681
fcdfcaf1684
'
ls: /minos/data: Stale NFS file handle

fcdfcaf1674
do_ypcall: clnt_call: RPC: Timed out

    Scan for FS that should be good:

for HOST in `cat /tmp/cdfhosts`; do  
printf "${HOST} " 
ssh -ax ${HOST} 'ls -ld /minos/scratch /minos/data > /dev/null && echo' 
done  2>&1 | tee /tmp/scansd.lis

    Everything is fine, except for the NFS file handles on select nodes.

    Repeated scan for just data2, missing on all but
fcdfcaf1512

    Scanning GPFARM hosts

condor_status  | grep LINUX | wc -l
1089

condor_status | grep  fnpc | cut -f 2 -d @ | \
  cut -f 1 -d . | sort -u > /tmp/gphosts

wc -l /tmp/gphosts
204 /tmp/gphosts


scp fnpc340:/tmp/gphosts /tmp/gphosts


for HOST in `cat /tmp/gphosts`; do  
printf "${HOST} " 
ssh -ax ${HOST} 'ls -ld /minos/scratch /minos/data /minos/data2 > /dev/null ;echo' 
sleep 1 ; done  2>&1 | tee /tmp/scangp1.lis

    CDF data2 mounts were corrected about 13:25 by Timm.
    Passed the stale NFS handles on to FEF.

for HOST in ${NTHOSTS} ; do   printf "${HOST} "
ssh -ax ${HOST} 'ls -ld /minos/data > /dev/null && echo' ; done


########
# DATA #
########

Date: Fri, 14 Nov 2008 12:10:01 -0600 (CST)
Subject: HelpDesk ticket 124929
___________________________________________
Short Description: CDF node mounts intermittent for /minos/data2, and some
stale NFS handles

Problem Description: The /minos/data2 file system seems to come and go on
FermiGrid CDF
nodes.
    For example, on fcdfcaf1502, it is visible about half the time,
    as seen by 'df -h /minos/data2'  or  'ls -ld /minos/data2'


    A few nodes have stale NFS file handles for /minos/data  :

fcdfcaf1573
fcdfcaf1583
fcdfcaf1586
fcdfcaf1663
fcdfcaf1669
fcdfcaf1670
fcdfcaf1672
fcdfcaf1680
fcdfcaf1681
fcdfcaf1684
___________________________________________

########
# DATA #
########

Date: Fri, 14 Nov 2008 16:59:30 +0000
From: Philip Rodrigues <p.rodrigues1@physics.ox.ac.uk>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: Re: /minos/data status - coming soon - please stand by !

Hi Art,

>    There may still be stale NFS handles on FNALU batch nodes,
>    we hope to fix that tomorrow morning.

Just to let you know, I'm seeing stale file handles on CDF nodes. Other
nodes seem to be working fine.

Thanks,
Phil

########
# DATA #
########

Continue cleanout of MOVED files,

    rubin@fnpcsrv1
find /minos/data/reco_far.MOVED -user rubin -exec ls -l {} \;
1542
find /minos/data/reco_far.MOVED -user rubin -exec chmod g+w {} \;

find /minos/data/reco_near.MOVED -user rubin -exec ls -l {} \;
find /minos/data/reco_near.MOVED -user rubin -exec chmod g+w {} \;

    mindata@minos26
$ time rm -r /minos/data/reco_far.MOVED
real    9m20.399s
user    0m0.112s

sys     0m2.966s

    minfarm@fnpcsrv1
time rm -r /minos/data/reco_far.MOVED
real    46m33.447s

    Stopped to set g+w for mindata files
find /minos/data/reco_far.MOVED -user mindata -exec ls -l {} \;
52258
find /minos/data/reco_far.MOVED -user minfarm -exec ls -l {} \;
19266
    Lots fewer minfarm files, let's chmod them, and remove under mindata
find /minos/data/reco_far.MOVED -user minfarm -exec chmod g+w {} \;


time rm -r /minos/data/reco_near.MOVED
real    98m42.239s
user    0m0.121s
sys     0m3.328s


=============================================================================
2008 11 13
=============================================================================

########
# DATA #
########

   Creating rsync command for data -> data2 replica
   Steal from gridappsync, and HOWTO.rsync

   Preview,

DIR=mcimport/boehm/mcin

time rsync -r /minos/data/${DIR}/ /minos/data2/${DIR} \
  -n  --perms --times  --links --size-only --delete --verbose


{ echo ; date
printf "time rsync -r /minos/data/${DIR}/ /minos/data2/${DIR} \
  -n  --perms --times  --links --size-only --delete --verbose
" 
        time rsync -r /minos/data/${DIR}/ /minos/data2/${DIR} \
      --perms --times  --links --size-only --delete --verbose
} 2>&1 | tee -a /home/minsoft/datasync.log

    Test the boehm files,
    Restarted, after adding / to the source path,
    to avoid creation of an extra subdirectory at the distination.


RDIRS=`ls /minos/data | grep -v analysis | grep -v users`
beam_data
condor-limbo
condor-tmp
d10
flux
log_data
maint
mcimport
mcout_data
mindata
minfarm
mysql
reco_far
reco_near
release_data
validation

for DIR in ${RDIRS} ; do
{ echo ; date
printf "time rsync -r /minos/data/${DIR}/ /minos/data2/${DIR} \
  -n  --perms --times  --links --size-only --delete --verbose
" 
        time rsync -r /minos/data/${DIR}/ /minos/data2/${DIR} \
      --perms --times  --links --size-only --delete --verbose
} 2>&1 | tee -a /home/minsoft/datasync.log
done

    Ganglia shows stable data rates of 15 to 20 MBytes/seconds 
    on minos-mysql1, starting around 16:12
    101 minutes to update mcimport, done at 16:40


...
Thu Nov 13 18:12:24 CST 2008
time rsync -r /minos/data/validation/ /minos/data2/validation   -n  --perms --times  --links --size-only --delete --verbose
real    7m29.180s


    Correct directory ownerships

chown 3648 /minos/data2/condor-limbo
chown 3648 /minos/data2/condor-tmp
chown 3648 /minos/data2/validation

    Create symlinks

    TEST

for DIR in ${RDIRS} ; do
    echo mv ${DIR} ${DIR}.MOVED
    echo ln -s /minos/data2/${DIR} ${DIR}
    ls -ld ${DIR}
    echo 
done

    MOVE

cd /minos/data
for DIR in ${RDIRS} ; do
    mv ${DIR} ${DIR}.MOVED
    ln -s /minos/data2/${DIR} ${DIR}
    ls -ld ${DIR}
    echo 
done
 
    Clean up after not haveing done cd before the first move

cd
for DIR in ${RDIRS} ; do
    ls -ld ${DIR}
done


   Generally, files moved seemed to match what was expected,
   except for reco_near and reco_far lists, nothing was copied.
   
Repeated the NUFILES scan,

Thu Nov 13 19:29:51 CST 2008
 GOT 3238 

   Which files are not present ?
   
GOT=0 ; for FILE in ${NUFILES} ; do
[ ! -r /minos/data2/${FILE} ] && (( GOT++)) && ls -l /minos/data/${FILE}
done ; date ; printf " GOT ${GOT} \n"

    These are all condor-tmp and minfarm/DBM/dbtables/checksum files


      DANGER DANGER DANGER

    It seems I should have used option --archive,
    which preserves perms, links, times, group, owner, devices
    or have added --owner --group


    The files which I rsync'd are now owned by root !!!!

    Files were written to
condor-tmp
mcimport
minfarm

DIR=condor-tmp
DIR=minfarm       1041
DIR=mcimport      2747 ( the find took about 20' )

ROOFILES=`find /minos/data2/${DIR} -user root | cut -f 5- -d /`

for FILE in ${ROOFILES} ; do
ls -ld /minos/data2/${DIR}/${FILE}
chown  --reference=/minos/data/${DIR}.MOVED/${FILE} /minos/data2/${DIR}/${FILE}
done

         DONE !

Prepare to remove the .MOVED directories for mcout_data, reco_near, reco_far

Mysql> time du -sm /minos/data/mcout_data.MOVED
5480489 /minos/data/mcout_data.MOVED

real    0m7.086s
user    0m0.070s
sys     0m0.797s

5480489 /minos/data2/mcout_data

real    0m7.095s
user    0m0.059s
sys     0m0.692s

    Now trying to remove mcout_data.MOVED,
    hard because so many files are owned by rubin.
    
    rubin@fnpcsrv1
    
find /minos/data/mcout_data.MOVED -user rubin -exec chmod g+w {} \;   

22:54
-bash-3.00$ time rm -r /minos/data/mcout_data.MOVED

real    207m39.066s
user    0m0.127s
sys     0m2.798s


########
# DATA #
########

    The second replication to data2 is still running.
    
    Checking access times of something scanned by bluwatch
/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-11
     
-rw-r--r--  1 minfarm e875 1922308910 Nov 11 18:58 /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-11/N00009300_0000.spill.sntp.cedar_phy_bhcurv.0.root
-rw-r--r--  1 minfarm e875 1922308910 Nov 12 21:19 /minos/data2/reco_near/cedar_phy_bhcurv/sntp_data/2005-11/N00009300_0000.spill.sntp.cedar_phy_bhcurv.0.root

    Things not recently scanned.

-rw-r--r--  1 minfarm e875 1047749340 Oct  4 16:10 /minos/data/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11/N00009300_0000.spill.mrnt.cedar_phy_bhcurv.1.root
-rw-r--r--  1 minfarm e875 1047749340 Nov 12 16:19 /minos/data2/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11/N00009300_0000.spill.mrnt.cedar_phy_bhcurv.1.root


     Also some of the old mcin/boehm/mcin files
-rw-r--r--  1 mindata e875 383622203 Nov  8 22:14 n00009592_0001_spill_D04_cedarphybhcurvMRE.reroot.root
-rw-r--r--  1 mindata e875 420489636 Nov  8 22:14 n00009592_0002_spill_D04_cedarphybhcurvMRE.reroot.root
-rw-r--r--  1 mindata e875 423754360 Nov  8 22:14 n00009592_0003_spill_D04_cedarphybhcurvMRE.reroot.root
     and dcache
-rw-r--r--  1 mindata e875 388314221 Nov  9 02:51 n00009219_0011_spill_D04_cedarphybhcurvMRE.reroot.root
-rw-r--r--  1 mindata e875 375754081 Nov  9 02:51 n00009219_0012_spill_D04_cedarphybhcurvMRE.reroot.root
-rw-r--r--  1 mindata e875 356026673 Nov  9 02:51 n00009219_0013_spill_D04_cedarphybhcurvMRE.reroot.root
-rw-r--r--  1 mindata e875 400541244 Nov  9 02:51 n00009219_0014_spill_D04_cedarphybhcurvMRE.reroot.root

    romero has found the problem,
    removal of the previous snapshot caused a full copy of data to data2.
    This could take another week.
    
    Creating a summary of files to be copies to data2,
    based on the summary in /minos/scratch/mindata/newdata.log

NDL=/minos/scratch/mindata/newdata.log

cd ~kreymer/minos/scripts

grep /minos/data ${NDL} | tr -s ' ' | cut -f 5 -d ' ' | count
 Enter numbers to be added : 
 Got 3519 /tmp/FOO numbers 
56645368420
    57 GBytes.

Select only those modified in November :

MINOS26 > grep /minos/data ${NDL} | grep ' Nov ' | tr -s ' ' | cut -f 5 -d ' ' | count
 Enter numbers to be added : 
 Got 3056 /tmp/FOO numbers 
35871684801

    36 GBytes.

    Summary by directory :
FDIRS=`ls /minos/data | grep -v analysis | grep -v users`

for DIR in mcimport ; do
for DIR in ${FDIRS} ; do
    COUNTS=`grep /minos/data/${DIR} ${NDL} | tr -s ' ' | cut -f 5 -d ' ' | count`
    BYTES=`printf "${COUNTS}\n" | tail -1`
    (( MB = BYTES / 1000000 ))
    NFILES=`printf "${COUNTS}\n" | grep Got | cut -f 3 -d ' '`
    printf "%12s %6d %6d\n" ${DIR} ${NFILES} ${MB}
done

   DIRECTORY  FILES     MB

   beam_data      0      0
condor-limbo      0      0
  condor-tmp    915      2
         d10      0      0
        flux      0      0
    log_data      0      0
       maint      0      0
    mcimport   1468  18089
  mcout_data      0      0
     mindata      0      0
     minfarm   1100  25986
       mysql      0      0
    reco_far     23   3058
   reco_near     13   9508
release_data      0      0
  validation      0      0

########
# GRID #
########

    scavan still having trouble with cert.

    Looks OK in VOMRS

/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven Cavanaugh/CN=UID:scavan 

His robot cert is not registered.


=============================================================================
2008 11 12
=============================================================================

##########
# CONDOR #
##########

    Added analysis in /fermilab/minos  for

asousa
rodriges
scavan

    In VOMRS, 

Status       Approved
Roles        Member
Groups       /fermilab/minos
Group Roles  Analysis
My Groups Only - check

   Good listing, but still finds 4912 records, non-group members

Group/Group Role Status - approved - limits this to approved users,
   but only those with analysis role.

Removed Group Roles, get 153 records, but only see approved roles.
   
Removed Status and Group Roles entries, tried again,

Get 182 rows !

    Let's get a list of active Condor users :

condor_userprio -all -allusers | grep fnal.gov | sort \
    | grep -vfnal.gov@fnal.gov | cut -f 1 -d @  | wc -l
46

ahimmel  himmel
asousa   sousa
bckhouse backhouse
brebel   rebel
bspeak   speakman
cherdack cherdack
deb4     Bhattacharya
djauty   auty
idanko   danko
jdejong  dejong
jjling   ling
jyuko    ma
koskinen koskinen
masaki   watabe
mtavera  tavera
naples   naples
nsmayer  mayer
ochoa    ochoa
petyt    petyt
pittam   pittam
rahaman  rahaman
rearmstr armstrong
rhatcher hatcher
rmehdi   mehdiyev
sfarrell farrell
sfiligoi sfiligoi
sjc      coleman
tagg     tagg
tinti    tinti
tjyang   yang
vahle    vahle
whitehd  whitehead
zisvan   isvan


   Pre approved

hartnell hartnell
kreymer  kreymer
loiacono loiacono
mishi    ishitsuka
med      dorman
nickd    devenish
pawloski pawloski
rbpatter patterson
rodriges rodrigues
rustem   ospanov
scavan   cavanaugh

########
# DATA #
########

   Scan for files newer than the first snapshot.

$ { date ; find /minos/data -type f -ctime -7 -exec ls -ld {} \; ; date ; } 2>&1  | tee /tmp/newdata.log
Wed Nov 12 10:39:14 CST 2008

    Oops finding lots in the .snapshot directory, 
    and wasting time in anaysis and users.
    
$ FDIRS=`ls /minos/data | grep -v analysis | grep -v users`

for DIR in ${FDIRS} ; do 
    { printf "\n${DIR} `date`\n"
      find /minos/data/${DIR} -type f -ctime -7 -exec ls -ld {} \;
    }  2>&1  | tee -a /tmp/newdata.log
done

condor-tmp ...
d10 Wed Nov 12 10:57:15 CST 2008
flux Wed Nov 12 10:57:15 CST 2008
log_data Wed Nov 12 11:02:22 CST 2008
maint Wed Nov 12 11:02:22 CST 2008
mcimport Wed Nov 12 11:02:22 CST 2008

    First pass missed logging of headers, restarted 

beam_data Wed Nov 12 11:04:13 CST 2008
condor-limbo Wed Nov 12 11:04:13 CST 2008
condor-tmp Wed Nov 12 11:04:13 CST 2008
d10 Wed Nov 12 11:04:19 CST 2008
flux Wed Nov 12 11:04:19 CST 2008
log_data Wed Nov 12 11:04:29 CST 2008
maint Wed Nov 12 11:04:29 CST 2008
mcimport Wed Nov 12 11:04:29 CST 2008
mcout_data Wed Nov 12 12:20:14 CST 2008
mindata Wed Nov 12 12:25:12 CST 2008
minfarm Wed Nov 12 12:25:13 CST 2008
mysql Wed Nov 12 13:04:40 CST 2008
reco_far Wed Nov 12 13:06:59 CST 2008
reco_near Wed Nov 12 13:13:53 CST 2008
release_data Wed Nov 12 13:16:23 CST 2008
validation Wed Nov 12 13:16:27 CST 2008

MINFARM > wc -l /tmp/newdata.log 
3551 /tmp/newdata.log

MINFARM > grep /minos/data /tmp/newdata.log  | wc -l
3519

     Count files modified in each month :
MINFARM > for N in 4 5 6 7 8 9 ; do printf " ${N} " ; grep /minos/data /tmp/newdata.log  | grep "Nov  ${N}" | wc -l ; done
MINFARM > for N in 10 11 12 ; do printf " ${N} " ; grep /minos/data /tmp/newdata.log  | grep "Nov ${N}" | wc -l ; done

  4 0
  5 537
  6 1145
  7 300
  8 205
  9 53
 10 56
 11 191
 12 37

echo '537 +1145 +300 +205 +53 +56 +191+37' | bc
2524

NUFILES=`grep /minos/data /minos/scratch/mindata/newdata.log | cut -f 4- -d /`

GOT=0 ; for FILE in ${NUFILES} ; do
[ -r /minos/data2/${FILE} ] && (( GOT++))
done ; date ; printf " GOT ${GOT} \n"

Wed Nov 12 14:39:01 CST 2008
 GOT 230 

Thu Nov 13 08:22:21 CST 2008
 GOT 230 

Thu Nov 13 19:29:51 CST 2008
 GOT 3238 

    Let's get a new review of these files, 
    with change times rather than the default modification times.

for FILE in ${NUFILES} ; do ls -lc /minos/data/${FILE} ; done \
 > /minos/scratch/mindata/newchange.log

for N in 4 5 6 7 8 9 ; do printf "  ${N} "
grep /minos/data /minos/scratch/mindata/newchange.log  | grep "Nov  ${N}" | wc -l ; done
for N in 10 11 12 ; do printf " ${N} "
grep /minos/data /minos/scratch/mindata/newdata.log  | grep "Nov ${N}" | wc -l ; done

Day Files
  4 0
  5 565
  6 675
  7 336
  8 1598
  9 55
 10 56
 11 191
 12 37

Sum is 3513


=============================================================================
2008 11 11
=============================================================================

#############
# DBARCHIVE #
#############

    Creating script from HOWTO.dbarchive.
    
    Dropping use of script command, tee into log files instead,
    now what we do not cut/paste from the terminal.


########
# GRID #
########

Subject: Help Desk Ticket 119292 Has Been Resolved.

Ticket closed, no more globus errors 17 or 43 since our glideinWMS
upgraded to Condor 7.1.3


########
# MAIL #
########


  As usual, an email to stk-users bounces from minos-shifters :

Your message cannot be delivered to the following recipients:

  Recipient address: c.bungau@SUSSEX.AC.UK
  Reason: Remote SMTP server has rejected address
  Diagnostic code: smtp;550 unknown user, or c.bungau has a bad forwarding
address
  Remote system: dns;smtp2.susx.ac.uk
(TCP|131.225.111.11|43150|139.184.14.93|25) (sivits.uscs.susx.ac.uk ESMTP
Exim 4.64 Tue, 11 Nov 2008 16:15:41 +0000)
  Recipient address: kafv1@SUSSEX.AC.UK
  Reason: Remote SMTP server has rejected address
  Diagnostic code: smtp;550 unknown user, or kafv1 has a bad forwarding
address
  Remote system: dns;smtp2.susx.ac.uk
(TCP|131.225.111.11|43150|139.184.14.93|25) (sivits.uscs.susx.ac.uk ESMTP
Exim 4.64 Tue, 11 Nov 2008 16:15:41 +0000)

   minos shifters does contain

Cristian Bungau <c.bungau@SUSSEX.AC.UK>
Elisabeth Falk <kafv1@SUSSEX.AC.UK>

   I find no Bungau under the Sussex shift index.
   bungau does exist at Fermilab, forwarded to c.bungau@SUSSEX.AC.UK
   
   Removed bungau from minos-shifters
   Changed to kafv1 to e.falk

###########
# ENSTORE #
###########

Date: Tue, 11 Nov 2008 08:54:15 -0600

SSA Group needs to reboot the system which manages the STK Silos for Public
Enstore %28STK%29 and D0en Enstore. We need to do this as soon as we can,
but we want to do this carefully so we cause as little disruption of service
as possible. This will only affect the STK libraries.

STKen: 9940.library_manager & CD-9940B.library_manager
D0en: D0-9940B.library_manager & mezsilo.library_manager
CDFen: CDF-9940B-D0.library_manager

We will begin draining these libraries at 09:00. Users will still be able to
submit requests which will be queued up for the library manager. Draining
will allow any tape work that is already in progress to complete. No new
mount requests will be handled until after the reboot. Once any copies that
are already in progress finish and those tapes get dismounted, we will be
able to reboot the %27fntt%27 library front end system.

Once the reboot is accomplished, we will restart Enstore processes as
necessary. We will then re-open the library managers for normal use. We will
have these re-opened as soon as possible. Again, this will only affect
access to 9940 type tapes.

-----------------------------------------------------------------------

Date: Tue, 11 Nov 2008 10:13:07 -0600

SSA Group has reboot the %27fntt%27 system. Everything went smoothly. All
libraries are re-opened.

STKen: 9940.library_manager & CD-9940B.library_manager
D0en: D0-9940B.library_manager & mezsilo.library_manager
CDFen: CDF-9940B-D0.library_manager

########
# GRID #
########

Date: Tue, 11 Nov 2008 09:32:04 -0600
From: Edward Simmonds <esimm@fnal.gov>
To: kreymer@fnal.gov
Subject: New Grid certificates

Art,

I don't think we've met, but I'm the "new guy" in Jason Allen's department.
I've been asked to install new Grid certificates on minos01-26, because the
current certs expire Thursday.   I'd like to update one server, minos26 if
that works for you, and have someone test to make sure the new certificates
are working properly.  In other words, I don't want to install all
twenty-six and have it break something.

Can I install the new cert on minos26 and have you (or anyone you suggest)
test it before I update the other 25 servers?

Thanks much,


Edward Simmonds

-----------------------------------------

Date: Tue, 11 Nov 2008 15:42:11 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   The Minos Cluster host cert's are used by the Condor batch system.

   I suggest updating the cert on minos01 first.
   The real test is to see Condor jobs start and finish after the upgrade.
   Let me know when the cert is upgraded on minos01,
   and I will run a test job.

   Then minos02 through 24 can be upgraded.

   minos25 is the Condor master node, and the most sensitive to problems.
   It should be upgraded last.

-----------------------------------------

Date: Tue, 11 Nov 2008 09:49:29 -0600
From: Edward Simmonds <esimm@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: minos-admin@fnal.gov
Subject: Re: New Grid certificates

Arthur Kreymer wrote:
>    I suggest updating the cert on minos01 first.
>    The real test is to see Condor jobs start and finish after the upgrade.
>    Let me know when the cert is upgraded on minos01,
Okay, I'll do this right now and send you an email.

-----------------------------------------

Date: Tue, 11 Nov 2008 09:57:49 -0600
The new cert is installed on minos01.  Please test and let me know the
results.

-----------------------------------------

Date: Tue, 11 Nov 2008 16:06:47 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   A Condor test job has run on minos01 after the cert update.

   Please go ahead with the rest of the Minos Cluster cert updates.

-----------------------------------------

Date: Tue, 11 Nov 2008 10:59:15 -0600

All Grid certificates have been installed on minos01 through 26.  Please let
me know if you have any issues.

-----------------------------------------
Date: Tue, 11 Nov 2008 17:02:16 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    Thanks !

    I have seem at least one new Condor glideinWMS job run,
    so I think we are in good shape.

-----------------------------------------


########
# DATA #
########

mysql and validation are in data2 this morning !

date ; df -m /minos/data2 | grep '^  '
Tue Nov 11 14:17:25 GMT 2008
                      28311552  22455831   5855722  80% /minos/data2

The output of df seems constant, perhaps the first pass is complete !


   Andy Romero released the snapshot

date ; df -m /minos/data | grep '^  '

Tue Nov 11 14:24:00 GMT 2008
                      28311552  28261948     49605 100% /minos/data
Tue Nov 11 14:26:30 GMT 2008
                      28311552  28196767    114786 100% /minos/data

   romero is starting the next replication

Tue Nov 11 14:38:02 GMT 2008
                      28311552  27934584    376969  99% /minos/data

Tue Nov 11 14:57:58 GMT 2008
                      28311552  27285167   1026386  97% /minos/data
                      28311552  22455831   5855722  80% /minos/data2

Tue Nov 11 15:49:51 GMT 2008
                      28311552  26727884   1583669  95% /minos/data
                      28311552  22455869   5855684  80% /minos/data2

    Second replication started around 12:00 CST

Tue Nov 11 12:26:20 CST 2008
                      28311552  26728252   1583301  95% /minos/data
                      28311552  22093690   6217863  79% /minos/data2

Tue Nov 11 13:12:40 CST 2008
                      28311552  26728382   1583171  95% /minos/data
                      28311552  22240488   6071065  79% /minos/data2
Tue Nov 11 15:59:20 CST 2008
                      28311552  26728807   1582746  95% /minos/data
                      28311552  21826461   6485092  78% /minos/data2


    Check removed files in mcin :

du -sm /minos/data2/mcimport/boehm/mcin/dcache


MIN > ls /minos/data2
beam_data  condor-limbo  condor-tmp  d10  flux  log_data  maint  mcimport  mcout_data  mindata  minfarm  mysql  reco_far  reco_near  release_data  validation

MIN > du -sm /minos/data2/*
259008  /minos/data2/beam_data
1       /minos/data2/condor-limbo
3286    /minos/data2/condor-tmp
2       /minos/data2/d10
du: cannot read directory `/minos/data2/flux/gnumi/v19/fluka05_le010z185i_old': Permission denied
3862464 /minos/data2/mcimport
5480489 /minos/data2/mcout_data
1       /minos/data2/mindata
1       /minos/data2/mindata
du: cannot read directory `/minos/data2/minfarm/farmtest/.certs/rubin': Permission denied
2687538 /minos/data2/minfarm
445852  /minos/data2/mysql
2131478 /minos/data2/reco_far
4094215 /minos/data2/reco_near
2251    /minos/data2/release_data
85846   /minos/data2/validation

   Sum 19052432

    Checking known deleted files :

du -sm /minos/data/mcimport/boehm/
2558    /minos/data/mcimport/boehm/

du -sm /minos/data2/mcimport/boehm/
521751  /minos/data2/mcimport/boehm/

du -sm /minos/data2/mcimport/boehm/mcin/dcache
274817  /minos/data2/mcimport/boehm/mcin/dcache


MINOS26 > date
Tue Nov 11 13:19:18 CST 2008


Date: Tue, 11 Nov 2008 23:02:24 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

The second pass of replication to /minos/data2 is still running.
This seems likely to finish this evening, perhaps early tomorrow morning.

We will then need one more replication to /minos/data2,
with the file system unmounted.

I will sent a note when this starts.

Again, please stand by, and minimize access to /minos/data or scratch.


        PLAN FOR FINAL /minos/data2  CUTOVER


    mindata@minos26
    minfarm@minos26
    minsoft@minos26
    
cd /minos/data

DDIRS=`find . -type d -maxdepth 1 -user ${LOGNAME} -exec basename {} \; \
      | sort |grep -v analysis | grep -v users`

printf "${DDIRS}\n"

    TEST

for DIR in ${DDIRS} ; do
    echo mv ${DIR} ${DIR}.MOVED
    echo ln -s /minos/data2/${DIR} ${DIR}
    ls -ld ${DIR}
    echo 
done

    MOVE

for DIR in ${DDIRS} ; do
    mv ${DIR} ${DIR}.MOVED
    ln -s /minos/data2/${DIR} ${DIR}
    ls -ld ${DIR}
    echo 
done


=============================================================================
2008 11 10
=============================================================================

########
# GRID #
########

Date: Mon, 10 Nov 2008 15:47:13 -0800 (PST)
From: Ryan B. Patterson <rbpatter@caltech.edu>

FYI:

We have increased he total number of allowed glidein pilots from 250 to
400, of which 50 may be on CDF nodes.


########
# DATA #
########

MIN > df -m /minos/data*
Filesystem           1M-blocks      Used Available Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                      28311552  28256092     55461 100% /minos/data
blue2.fnal.gov:/minos/data
                      28311552  22503981   5807572  80% /minos/data2

mysql and validation are still not in data2.

########
# DATA #
########


Date: Mon, 10 Nov 2008 12:35:48 -0600 (CST)
Subject: HelpDesk ticket 124533
___________________________________________
Short Description: Deployment plan for new BlueArc /minos/data disk

Problem Description: Please forward this to
        CSI
        fermigrid-help
        run2-sys
        fnalu-admin

Per conversation with Andy Romero this morning,
here is a plan for active deployment of the new Minos data disks.

1) CSI - export the new disks as               blue2:/minos/data
         to all systems presently mounting minos-nas:/minos/data
 
         Do this ASAP, and
         inform minos-data, fermigrid-help, run2-sys, fnalu-admin

2) run2-sys , fermigrid-help , fnalu-admin
   
   Mount blue2:/minos/data as /minos/data2,
   on all systems where /minos/data is presently mounted.
 
   Do this as soon as blue2 is exported, see step 1) above

3) CSI - complete the data copy from minos-nas:/minos/data to
         blue2:/minos/data, including touchup copies.
         This is likely to finish today.

4) CSI/Kreymer coordinated deployment :
   CSI     - Make the final touchup copy, with readonly file systems 
   Kreymer - rename the copied directories to *.MOVED in minos-nas
             create symlinks for each directory from minos-nas to blue2
   CSI     - Make file systems writeable.
 
   Arthur Kreymer can be reached at x4261, or cell 630 697 0469
___________________________________________

Date: Mon, 10 Nov 2008 13:20:29 -0600 (CST)
This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group.
___________________________________________

Date: Mon, 10 Nov 2008 13:35:04 -0600 (CST)
From: Steven Timm <timm@fnal.gov>
To: kreymer@fnal.gov
Cc: fermigrid-help@fnal.gov

FermiGrid has mounted the /minos/data2 = blue2:/minos/data
on all the places where the other minos volumes are mounted.

Steve Timm

___________________________________________

Date: Mon, 10 Nov 2008 15:11:12 -0600 (CST)
This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/WST Group.
___________________________________________

Date: Mon, 10 Nov 2008 15:13:27 -0600 (CST)

Note To Requester: The following (read-only ... for now) export has been
created:

blue2.fnal.gov:/minos/data

The suggested NFS mount options are:

-o rsize=32768,wsize=32768,timeo=600,proto=tcp,vers=3,hard,intr
___________________________________________

Date: Mon, 10 Nov 2008 15:29:51 -0600 (CST)
From: Margaret_Greaney <mgreaney@fnal.gov>

fnalu nodes were updated for the new mount.
___________________________________________

Date: Mon, 10 Nov 2008 21:52:02 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Margaret_Greaney <mgreaney@fnal.gov>

    On FNALU batch nodes,
    I see the new blue2:/minos/data file system mounted on /minos/data,
    rather than /minos/data2.

    Please restore the original minos-nas:/minos/data mount on /minos/data,
    and add a new mount of blue2:/minos/data on /minos/data2.

        Thanks
___________________________________________

Date: Mon, 10 Nov 2008 16:03:53 -0600 (CST)
From: Margaret_Greaney <mgreaney@fnal.gov>

done
__________________________________________

Date: Mon, 10 Nov 2008 15:44:58 -0800 (PST)
From: Ryan B. Patterson <rbpatter@caltech.edu>
To: kreymer@fnal.gov
Subject: condor-tmp and condor-limbo ownership

I've changed ownership of these areas to 'mindata'.

Enjoy.
___________________________________________

Date: Tue, 11 Nov 2008 10:54:15 -0600  
From: Jason Harrington <jason@fnal.gov>
To: Arthur E Kreymer <kreymer@fnal.gov>
Cc: run2-sys@fnal.gov
Subject: /minos/data2

/minos/data2 has been installed on all nodes listed in the 'minos-cluster'
sysadmin db cluster with the following exceptions:

> minos-mysql2 (no /minos/data)
> minos-mysql3 (login permission denied)
> minos-sam04 (no /minos/data)
> minos27 (ssh connection refused, telnet login permission denied)
___________________________________________

Verbal -  
    This morning's replication crashed, needed to clear the snapshot.
    Andy restarted it around 12:00
    Will evaluate again around 16:30, after his class.
___________________________________________

Date: Wed, 12 Nov 2008 16:14:04 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    blue2:/minos/data is mounted on all our clients as /minos/data2.
    
    I am now tracking free space in /minos/data2 hourly at
http://www-numi.fnal.gov/computing/dh/mdfree/data2/NOW.txt
 
    Replication seems to be both adding and removing files.
    300 GB was freed up just before 09:44.

    I am watching for removal of 520 GB of *.reroot.root files under
    /minos/data2/mcimport/boehm/mcin.

___________________________________________

Date: Wed, 12 Nov 2008 23:41:09 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos-data@fnal.gov
Cc: romero@fnal.gov
Subject: Replication status and estimates


    Andy, yes, I did a full filescan earlier today.

    Between 11:04 and 13:16, I did a find in replicating directories.    

    This produced file listing /minos/scratch/mindata/newdata.log
    There are 3519 files.

    I have done a day by day scan of the 'change' times
    of these files, an improvement over 'mod' times previously reported.

Day Files
  4 0
  5 565
  6 675
  7 336
  8 1598
  9 55  
 10 56
 11 191
 12 37  
 
    I have occasionally been doing :
 
NUFILES=`grep /minos/data /minos/scratch/mindata/newdata.log | cut -f 4- -d /`

GOT=0 ; for FILE in ${NUFILES} ; do
[ -r /minos/data2/${FILE} ] && (( GOT++))
done ; date ; printf " GOT ${GOT} \n"

Wed Nov 12 14:39:01 CST 2008
 GOT 230
...
Wed Nov 12 17:05:10 CST 2008
 GOT 230

    We do not seem to be getting more of the files that I expect to see,
    nor are the files being removed from /minos/data2/mcimport/boehm/mcin/.
    But there are certainly files being copied,                
    as df indicates net file system size changes, at
http://www-numi.fnal.gov/computing/dh/mdfree/data2/NOW.txt

    I do not know of any other activity that would be doing global file scans.

___________________________________________

___________________________________________

___________________________________________

___________________________________________

Date: Thu, 13 Nov 2008 19:42:43 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
__
The CSI internal replication is being canceled.
Due to premature removal of a Bluearc snapshot,
this turned in to a full replication, which would have run for days.

The modified plan step 4) is :

   CSI - cancel the second replication
   CSI - remove all exports of /minos/data from minos-nas: and blue2:
   CSI - export these root-enabled to minos-mysql1
   kreymer -
        finish the file system copies using rsync ( estimate 4 hours )
        I will log in as root@minos-mysql1 ( already have access )

   CSI - restore full read/write exports after the rsync is done.
   FEF/Fermigrid/FNALU -
         remounts will likely be needed, due to stale file handles
_________________________________________

Date: Thu, 13 Nov 2008 19:45:09 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   We are starting the final replication pass.

   The exports of /minos/data and data2 have been removed.

   I have reason to expect this to take less than 4 hours.
   Then file systems will have to be re-exported and remounted.
___________________________________________

Date: Thu, 13 Nov 2008 21:01:36 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

The rsync copies started around 14:36
___________________________________________

Date: Thu, 13 Nov 2008 22:33:09 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

The rsync copies are cranking along.
Files are presently moving to mcimport/mtavera.

I guess we are about halfway through this pass.
I will check again on progress before about 19:00 this evening.
___________________________________________

Date: Fri, 14 Nov 2008 01:22:56 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    Mount/remounts of /minos/data, /minos/data2 needed as follows :


run2-sys - We need dismount/remounts on the Minos Cluster and servers
           to clear the stale file handles in
    /minos/data
    /minos/data2
          ( except for minos-mysql1, where I have done this already. )


fermigrid-help - we need remounts of /minos/data on fnpcsrv1,
                  and possibly other nodes.
                  Perhaps we can get this by waiting an hour or so
                  for automount to time out ?


fnalu-admin - we  need remounts desribed above on FNALU batch nodes,
              which are seeing stale NFS handles.

       --------- Summary of plan execution -----------

  CSI - cancel the second replication
      DONE

  CSI - remove all exports of /minos/data from minos-nas: and blue2:
      DONE
  CSI - export these root-enabled to minos-mysql1
      DONE

  kreymer -
       finish the file system copies using rsync ( estimate 4 hours )
       I will log in as root@minos-mysql1 ( already have access )
      DONE

  CSI - restore full read/write exports after the rsync is done.
      DONE around 19:00 CST, thanks Andy

  FEF/Fermigrid/FNALU -
        remounts will likely be needed, due to stale file handles
      TRUE - requests are listed above
___________________________________________

Date: Thu, 13 Nov 2008 19:49:07 -0600 (CST)
From: Steven Timm <timm@fnal.gov>

Stale file handles for /minos/data cleared on fnpcsrv1.
Will check workers later.
___________________________________________

Date: Thu, 13 Nov 2008 20:31:07 -0600 (CST)
From: Steven Timm <timm@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: Re: HelpDesk ticket 124533 has additional info.

Stale file handles on 7 worker nodes in gp grid cleared too,
Will look for stale ones on cdf grid later.
___________________________________________

Date: Fri, 14 Nov 2008 02:37:27 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: run2-sys@fnal.gov
Cc: minos-data@fnal.gov
Subject: Urgent - please remount /minos/data and data2 on Minos Cluster

  Please, if you get a chance this evening,
  correct the stale file handles on the Minos Cluster,
  as noted below.

  This is the last thing that needs to be done
  before we announce availalbility of the disks to our users.
...
___________________________________________

20:50 - called helpdesk, requested page of FEF run2-sys
21:00 - helpdesk will page FEF run2-sys

Date: Thu, 13 Nov 2008 21:15:56 -0600 (CST)
From: HelpDesk <helpdesk@fnal.gov>
Subject: HelpDesk ticket 124803

Short Description: MINOS Cluster - Art Kreymer- 630-840-4261

Problem Description: Detailed Problem Description (if supplied)FEF primary
call Art Kreymer X4271 regarding the MINOS Cluster.

This ticket is assigned to HO, LING of the CD-SF/FEF.
_________________________________________

21:23 Ling responded, will remount disks by about 22:00
___________________________________________

Date: Thu, 13 Nov 2008 22:29:23 -0600
From: Ling C. Ho <ling@fnal.gov>

I have remounted /minos/data and /minos/data on minos01-26, minos-sam01-03.
Is there anything I missed?
___________________________________________

Date: Thu, 13 Nov 2008 22:49:45 -0600 (CST)
    Thanks, things look good on all nodes,
    except for minos07, which still shows NFS timouts.
___________________________________________

Date: Fri, 14 Nov 2008 04:56:42 +0000 (GMT)

Looks like I missed minos07. It should be fine now.
___________________________________________

Date: Fri, 14 Nov 2008 04:58:09 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Thanks, minos07 is better now.
   Thanks again for attending to this so late in the evening.

   I will announce availability of the file systems.

       This ticket can be closed !
__________________________________________

Date: Thu, 13 Nov 2008 23:04:11 -0600 (CST)
From: HelpDesk <aremail@fnal.gov>

Solution: ling@fnal.gov sent this solution: 

NFS mounts restored.
This ticket was resolved by HO, LING of the CD-SF/FEF group.


=============================================================================
2008 11 07
=============================================================================

#########
# ADMIN #
#########

Date: Fri, 07 Nov 2008 23:38:55 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: ling@fnal.gov
Cc: minos-admin@fnal.gov
Subject: Re: DB servers, and other Grid items

On Wed, 22 Oct 2008, Arthur Kreymer wrote:

>   Here is a strawman configuration, for discussion
> 
> minos-mysql2
> ...
> minos-sam04
> ...
> minos01 replacement - configured just like minos01, as NIS server.
> ...
> minos-mysql3
> ...

   These systems seem to have been on the network for a week or two.

   I can log into
      kreymer@minos-mysq2
      kreymer@minos27     ( using rsh, not ssh )
   but not the other two systems.

   I do not see the requested minsoft account on minos-mysql2.

   When will these systems will be available for us to test ?


#########
# MYSQL #
#########

    Bootstrapping mysql product onto samread@minos-sam03

scp minsoft@minos-sam03:setups.sh .

mkdir -p ups/db/foo
mkdir -p ups/db/.upsfiles
mkdir -p ups/db/.updfiles
AFSP=/afs/fnal.gov/files/code/e875/general/ups

cp ${AFSP}/db/.upsfiles/dbconfig ups/db/.upsfiles/dbconfig
nedit ups/db/.upsfiles/dbconfig 
    changed /afs/fnal.gov/files/code/e875/general/ups
    to      /home/samread/ups
        
cp ${AFSP}/db/.updfiles/updconfig ups/db/.updfiles/updconfig

setup upd
upd install -j mysql v5_0_67

ups declare -c mysql v5_0_67

########
# DATA #
########

   /grid/data/minos is over 80% capacity ( 400 GB ).    
   Again, it is Rustem.
   

SRV1> du -sm /grid/data/minos/users/*
1       /grid/data/minos/users/boehm
1       /grid/data/minos/users/brebel
1       /grid/data/minos/users/habig
1       /grid/data/minos/users/jdejong
1       /grid/data/minos/users/jjling
1       /grid/data/minos/users/kreymer
1       /grid/data/minos/users/masaki
91      /grid/data/minos/users/mishi
1       /grid/data/minos/users/petyt
315282  /grid/data/minos/users/rustem
1       /grid/data/minos/users/scavan
1       /grid/data/minos/users/tinti


##########
# CONDOR #
##########

   Removed the cronjob entry that releases held gfactory jobs,
   this should no longer happen, since our Condor 7.1.3 upgrade in gfactory.

#########
# POWER #
#########

    recover from planned outage 05:00

###########
# BLUEARC #
###########

Date: Fri, 07 Nov 2008 05:44:00 -0600
From: Andrew J. Romero <romero@fnal.gov>
To: 'lisa' <lisa@fnal.gov>, 'Jon Bakken' <bakken@fnal.gov>,
    'Steven Timm' <timm@fnal.gov>, "'kreymer@fnal.gov'" <kreymer@fnal.gov>
Subject: BlueArc maintenance complete

BlueArc maintenance complete

    bluwatch saw the outage,
    05:24   through 05:42

########
# DATA #
########

./mcimport boehm

less  /minos/data/mcimport/boehm/log/mcimport.log
  Fri Nov  7 07:08:35 CST 2008
  OK - purging 87 MCIN files ?
 MCIN processing 511 files  
Fri Nov  7 08:04:30 CST 2008

    74 files copied by about 9:12
    this is a much better rate than yesterday,
    1' per file versus 10' per file

./mciboehm -n boehm |  tee  /tmp/mcib.log

$ grep SIZE /tmp/mcib.log | wc -l
1519

$ grep PNFS /tmp/mcib.log | wc -l
81

$ ls /minos/data/analysis/nue/MRERerootFiles/CedarPhyData | wc -l
1600

MINOS26 > du -sm /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE
177736  /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE

MINOS26 > find /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE -type f | wc -l
1543

    N.B.  - should mcimprt -M to suppress sam declarations.


=============================================================================
2008 11 06
=============================================================================

#######
# CVS #
#######

    global lock for the repository

http://www.mail-archive.com/info-cvs@gnu.org/msg33409.html

    create an empty $CVSROOT/CVSROOT/writers file
http://ximbiot.com/cvs/manual/cvs-1.11.20/cvs_2.html#SEC36

########
# DATA #
########

    Moving the remaining nearly 600 files,

./mcimport -b 1 boehm

./mcimport  boehm

###########
# BLUEARC #
###########

   Data has been moving from /minos/data to /minos/data2 since Tuesday.
   Internally, via FC.
   Only directories :
       mcimport
       mcout_data
       minfarm
       reco_far
       reco_near
   Will change these to symlinks in /M/D/ when the copy is complete.

  Existing exports :

minos-nas-0.fnal.gov:/minos/data        /minos/data     nfs    rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0
minos-nas-0.fnal.gov:/minos/scratch     /minos/scratch  nfs    rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0

  New export :

blue2.fnal.gov:/minos/data        /minos/data2     nfs    rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0


########
# DATA #
########

Test copy of a file which Rubin cannot read via srmcp

MINOS26 > ./dc_stat n13011670_0005_L010185N_D00.reroot.root
============================
 PNFS status for /pnfs/minos/mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root 
-rw-r--r--  1 rhatcher e875 371431752 May  1  2007 n13011670_0005_L010185N_D00.reroot.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;l=371431752;

LEVEL 4 
VO4722
0000_000000000_0000129
371431752
mcin_near_daikon
/pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root

000F00000000000005412E40

CDMS117804817300000
stkenmvr26a:/dev/rmt/tps0d0n:479000037979
455900240

============================

MINOS26 > ./dccptest /mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root
2,0,0,0.0,0.0
:h=yes;l=371431752;
[Thu Nov  6 11:18:31 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root in cache.
Connected in 0.00s.
Cache open succeeded in 94.03s.
371431752 bytes in 5 seconds (72545.26 KB/sec)
-rw-r--r--  1 kreymer g020 371431752 Nov  6 11:20 /local/scratch26/kreymer/n13011670_0005_L010185N_D00.reroot.root


MINOS26 > ./dccptest /mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root
2,0,0,0.0,0.0
:h=yes;l=371431752;
[Thu Nov  6 13:23:16 2008] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_00/L010185N/167/n13011670_0005_L010185N_D00.reroot.root in cache.
Connected in 0.00s.
Cache open succeeded in 0.34s.
371431752 bytes in 24 seconds (15113.60 KB/sec)
-rw-r--r--  1 kreymer g020 371431752 Nov  6 13:23 /local/scratch26/kreymer/n13011670_0005_L010185N_D00.reroot.root


###########
# SRMTEST #
###########

    srmtest.20081106 - now using X509_USER_PROXY instead of SRM_CONFIG

Tested on fnpcsrv1, OK


 ./mcimport -b 1 boehm
 
########
# FARM #
########

    Setting a flag which will tell the Minos Farm scripts
    not to reconstruct the archived mcin data from Josh

mkdir  /pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurvMRE/NORECO
    
    Cleaning up daikon_04/spill_cedarphybhcurvMRE files
    produced in error on the farm

    minospro@minos26

cd /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/spill_cedarphybhcurvMRE

PRO> du -sm .
35670   .

 ( cd /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/spill_cedarphybhcurvMRE/cand_data ; enstore pnfs --tags )
.(tag)(library) = CD-LTO3
.(tag)(file_family) = minos
.(tag)(file_family_wrapper) = cpio_odc
.(tag)(storage_group) = minos
.(tag)(file_family_width) = 1

PRO> find . -type f | wc -l
88

cd /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04

rm -r spill_cedarphybhcurvMRE
date
Thu Nov  6 10:48:32 CST 2008

########
# GRID #           MILESTONE
######## 

    Upgraded to use the analysis role, jobs will run under minosana

MINOS25 > cd /local/scratch25/grid

MINOS25 > cp ~rbpatter/computing/sandbox/kproxy.20081106 .
MINOS25 > cp ~rbpatter/computing/sandbox/kproxyv.20081106 .

MINOS25 > ln -sf kproxy.20081106  kproxy
MINOS25 > ln -sf kproxyv.20081106 kproxyv
MINOS25 > date
Thu Nov  6 08:39:37 CST 2008

MINOS25 > condor_q | tail -1
1291 jobs; 369 idle, 42 running, 880 held

MINOS25 > condor_q -run | grep -v minos


 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)         
218557.3   gfactory       11/6  05:20   0+03:20:02 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
218564.0   gfactory       11/6  06:11   0+02:28:30 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
218578.3   gfactory       11/6  08:01   0+00:38:27 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor

    Updated my proxy at 08:47

MINOS25 > /local/scratch25/grid/kproxy
MINOS25 > /local/scratch25/grid/kproxyi
attribute : /fermilab/minos/Role=Analysis/Capability=NULL


########
# GRID #           MILESTONE
######## 

    Ryan has run the first Minos glidein job on CDF nodes


    those nodes lack /usr/local/etc/setups.sh

###########
# ENSTORE #
###########

   Data Rates plot shows zero from 15:00 through 19:30 yesterday

##########
# PARROT #
##########

    mindata@minos26

PD=/minos/scratch/parrot
MD=/afs/fnal.gov/files/data/minos


cd ${PD}

date ; time ./make_growfs.auto -k ${MD}/d120
Thu Nov  6 07:53:10 CST 2008
make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d120/.growfsdir

#######
# SAM #
#######

   Reviewing sam web pages , be sure we have no samzilla running :
   
    SAM User registration
http://www-numi.fnal.gov/cgi-bin/autoRegister.py

    Get list of files
http://www-numi.fnal.gov/computing/findrun_sam.html  

    SAG 
http://www-numi.fnal.gov/sam_local/SamAtAGlance/


    Web home is
/afs/fnal.gov/files/expwww/numi

    Did global search under cgi-bin,
find  cgi-bin -name samzilla


=============================================================================
2008 11 05
=============================================================================

14:37 or so - site wide power outage

15:35 - power back up


########
# DATA #
########

   Creating specal mciboehm script to purge archived files from
       /minos/data/analysis/nue/MRERerootFiles/CedarPhyData


Nov  5 13:09 STAGE/boehm/log/mcimport.log 
   The previous iteration of mcimport failed, 
   
SRMCPed n00009238_0007_spill_D04_cedarphybhcurvMRE.reroot.root 
SRMCPed n00009238_0008_spill_D04_cedarphybhcurvMRE.reroot.root 
SRMCPed n00009238_0009_spill_D04_cedarphybhcurvMRE.reroot.root 
SRMClientV1 : getRequestStatus: try #0 failed with error
SRMClientV1 : java.net.ConnectException: Connection timed out
setting File Request to "Done" failed
java.lang.RuntimeException: java.net.ConnectException: Connection timed out
        at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1101)
        at gov.fnal.srm.util.SRMV1CopyJob.done(SRMV1CopyJob.java:188)
        at gov.fnal.srm.util.Copier.run(Copier.java:359)
        at java.lang.Thread.run(Thread.java:595)
srm client error: 
java.net.ConnectException: Connection timed out
SRMClientV1 : getRequestStatus: try #0 failed with error
SRMClientV1 : java.rmi.RemoteException: srm setFileStatus failed; nested exception is: 
        java.lang.RuntimeException: java.lang.IllegalArgumentException: FileRequest fileRequestId =-2144325443does not belong to this Request
Exception in thread "Thread-1" java.lang.RuntimeException: java.rmi.RemoteException: srm setFileStatus failed; nested exception is: 
        java.lang.RuntimeException: java.lang.IllegalArgumentException: FileRequest fileRequestId =-2144325443does not belong to this Request
        at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1101)
        at gov.fnal.srm.util.SRMV1CopyJob.done(SRMV1CopyJob.java:188)
        at gov.fnal.srm.util.Copier.cleanup(Copier.java:672)
        at gov.fnal.srm.util.Copier.run(Copier.java:274)
        at java.lang.Thread.run(Thread.java:595)
 OOPS - srmcp failed , bailing 

Last file copied seems to be :

$ ls -l /pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurvMRE/923/n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root
-rw-r--r--  1 kreymer e875 332164037 Nov  5 13:08 /pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurvMRE/923/n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root

$ ls -l STAGE/boehm/mcin/n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root
-rw-r--r--  1 mindata e875 332164037 Sep 28 04:12 STAGE/boehm/mcin/n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root

     Power outage, then recovery,
     resume the copies :

$ cd STAGE/boehm/mcin
$ mv n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root dcache/n00009238_0010_spill_D04_cedarphybhcurvMRE.reroot.root

./mcimport  boehm

less /minos/data/mcimport/boehm/log/mcimport.log
 OK - purging 916 MCIN files ?
Wed Nov  5 15:56:28 CST 2008
PURGED  n00008451_0001_spill_D04_cedarphybhcurvMRE.reroot.root
...

$ ls /minos/data/analysis/nue/MRERerootFiles/CedarPhyData | wc -l
1600

   Rats, the file sizes do not match !

$ dds /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE/900/n00009003_0000_spill_D03_cedarphyMRE.reroot.root
-rw-r--r--  1 kreymer e875 144656067 Nov  6  2007 /pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE/900/n00009003_0000_spill_D03_cedarphyMRE.reroot.root
$ dds /minos/data/analysis/nue/MRERerootFiles/CedarPhyData/n00009003_0000_spill_D03_cedarphyMRE.reroot.root
-rw-r--r--  1 mindata e875 110607413 Dec 13  2007 /minos/data/analysis/nue/MRERerootFiles/CedarPhyData/n00009003_0000_spill_D03_cedarphyMRE.reroot.root


   And data rates for present srmcp's is over 10 minutes per file !!!!!!
   
Killed the script, ran one more iteration to clean the pid file,

./mcimport -b 1 boehm


########
# SPAM #
########

   Election spam, from unregistered network address, hitting

minos_sam_admin
minos_software_discussion
minos_sim


########
# GRID #
########


Date: Wed, 05 Nov 2008 08:48:57 -0600 (CST)
Subject: HelpDesk ticket 124250
___________________________________________

Short Description: Many files at the top of /grid/data - see closed ticket
120763

Problem Description: There are over 7000 files at the top level of
/grid/data 
with names like 2008-11-05T14:08:01Z-gridftp-probe-test-file-remote.7515

There are about 200 of these per day, dating back through Oct 1.

Perhaps the daily purge script has stopped working.
___________________________________________

Date: Thu, 06 Nov 2008 14:37:56 -0600 (CST)

Note To Requester: I have manually removed these
files again.
There is some reason why 
this script isn't working
in cron and we are not sure why.

Will keep the ticket open
until we get it solved.

Steve Timm
___________________________________________

Date: Mon, 10 Nov 2008 13:27:41 -0600 (CST)
Solution: The cleanup script is working now.

Steve Timm

This ticket was resolved by TIMM, STEVE of the CD-SF/GF/FGS group.
___________________________________________


============================================================================
2008 11 04
=============================================================================

########
# GRID #
########

Date: Tue, 04 Nov 2008 14:57:04 -0600 (CST)
From: Steven Timm <timm@fnal.gov>
To: kreymer@fnal.gov
Subject: minos quota bump


Art--as a result of the meeting with Steve W. and Patty M. this morning it
was determined that MINOS quota would be (and now has been) increased
to 400 slots. I know it is not full opportunistic yet but it will help some.


########
# DATA #
########

    Ticket 118354 - raw data writes run daily again, see below


#######
# SAM #
#######

    I think the following is moot, our logs are not browsable

Date: Tue, 04 Nov 2008 13:19:04 -0600
From: Robert Illingworth <illingwo@fnal.gov>
To: Angela Bellavance <bellavan@fnal.gov>, Arthur Kreymer <kreymer@fnal.gov>
Subject: SamZilla web vulnerabilities


There are apparently security vulnerabilities with the SamZilla web log file
browser. If this is installed on the CDF or Minos webservers, I recommend
you either remove it or at least remove execute permission from the python
scripts in the ups product cgi directory, until we discovered how serious
the problem is and what can be done about it.

Robert


##########
# ORACLE #
##########

Date: Tue, 04 Nov 2008 11:46:19 -0600

minosdev & minosint database hosted on minosora3 will be down for
Oracle DB Security patches on Tuesday 11/04/2008 starting 1:30PM

Interruption is expected to last 1 hour.

-Nelly

p.s.� i have cc'd the sysadmin maillist,� giving them a heads up
that�i will need a script run as root during the patching process

Date: Tue, 04 Nov 2008 14:19:56 -0600

minosdev & minosint database hosted on minosora3 october oracle
quarterly� work� is completed.
�
let us know if you have any issues at oem-admin@fnal.gov
�
our goal is to patch minosora1~minosprd on thursday nov 20, 2008


=============================================================================
2008 11 03
=============================================================================


###########
# MONTHLY #
###########

DATASETS 11/03
PREDATOR 11/03
VAULT    11/03
MYSQL    11/14 using new dbarchive script ( still cut/paste )

########
# DATA #
########

   boehm volunteered about 1 TB of reroot files that can be archived.
   /minos/data/analysis/nue/MRERerootFiles/

   See notes 10/30
   
   ./pnfsdirs near MCIN daikon_04 spill_cedarphybhcurveMRE write
Mon Nov  3 10:52:04 CST 2008
 STREAMS cand mrnt sntp

     INPUT /pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurveMRE 
 FAMSET mcin_near_daikon_04
 FAMILY mcin_near_daikon_04

  Oops,

./pnfsdirs near MCIN daikon_04 spill_cedarphybhcurvMRE write
 
rmdir /pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurveMRE
 

    Shifted a few files, for testing

MCIF=/minos/data/analysis/nue/MRERerootFiles
MCID=${MCIF}/CedarPhyBhcurvData/NearDetector
MCIN=/minos/data/mcimport/boehm/mcin
 
$ ls ${MCIF}/CedarPhyBhcurvData/NearDetector | grep n00008454
n00008454_0000_spill_D04_cedarphybhcurvMRE.reroot.root
n00008454_0007_spill_D04_cedarphybhcurvMRE.reroot.root
n00008454_0014_spill_D04_cedarphybhcurvMRE.reroot.root
n00008454_0020_spill_D04_cedarphybhcurvMRE.reroot.root

    10:57
mv ${MCIF}/CedarPhyBhcurvData/NearDetector/n00008454*  ${MCIN}/

    Cleaning up stray dcache files in boehm/mcin/dcache :
-rw-r--r--  1 mindata e875      0 Oct 25  2007 n00009573_0000_spill_D03_cedarphyMRE.reroot.root
-rw-r--r--  1 mindata e875      0 Oct 24  2007 n00009696_0019_spill_D03_cedarphyMRE.reroot.root
-rw-r--r--  1 mindata e875      0 Oct 24  2007 n00009696_0020_spill_D03_cedarphyMRE.reroot.root
-rw-r--r--  1 mindata e875      0 Oct 24  2007 n00009696_0021_spill_D03_cedarphyMRE.reroot.root

    In log/mcimport.log, these were pending a year ago.
 MCIN processing 0 files  
Thu Nov  8 15:14:26 CST 2007


    The files were cleanly written to 
/pnfs/minos/mcin_data/near/daikon_04/spill_cedarphybhcurvMRE/845/

    Let's move the rest.

$ ls ${MCID} | wc -l
1515

FILES=`ls ${MCID}`
printf "${FILES}\n" | wc -l
1515

time for FILE in ${FILES} ; do  mv ${MCID}/${FILE} ${MCIN}/${FILE} ; done
real    0m56.457s
user    0m1.088s
sys     0m2.653s

$ du -sm ${MCIN}
521751  /minos/data/mcimport/boehm/mcin

    The above was done around 17:00


########
# DATA #
########

   /minos/data nearly filled on Sunday, then cleared out
   
237536 Sun Nov  2 00:56:07 CDT 2008
210189 Sun Nov  2 01:56:10 CDT 2008
187179 Sun Nov  2 01:56:11 CST 2008
165368 Sun Nov  2 02:56:12 CST 2008
151886 Sun Nov  2 03:56:13 CST 2008
132018 Sun Nov  2 04:56:14 CST 2008
131748 Sun Nov  2 05:56:16 CST 2008
118768 Sun Nov  2 06:56:18 CST 2008
108241 Sun Nov  2 07:56:19 CST 2008
 94099 Sun Nov  2 08:56:22 CST 2008
101133 Sun Nov  2 09:56:25 CST 2008
106683 Sun Nov  2 10:56:29 CST 2008
101025 Sun Nov  2 11:56:31 CST 2008
 83327 Sun Nov  2 12:56:35 CST 2008
 80768 Sun Nov  2 13:56:38 CST 2008
 71174 Sun Nov  2 14:56:40 CST 2008
 64220 Sun Nov  2 15:56:43 CST 2008
118041 Sun Nov  2 16:56:47 CST 2008
232765 Sun Nov  2 17:56:51 CST 2008
391707 Sun Nov  2 18:56:55 CST 2008

##########
# CONDOR #
##########

    minos25 went into overload, average over 100,
    starting around 02:30
    Average is up to 140 around 06:00

    Condor activity pretty much shut down globally.


Can write to
    /minos/scratch
    /minos/data .
    /grid/data
    /grid/app

    
MINOS25 > lsof | wc -l
1372

Nothing interesting in the /var/log/messages

gfactory plot entries end at 3:30
No unusual activity around 02:30, about 210 running glideins, few idle

    stuck doing
MINOS25 > echo FOO > /grid/fermiapp/touchga

   
MINOS25 > lsof /grid/fermiapp
COMMAND  PID    USER   FD   TYPE DEVICE SIZE       NODE NAME
bash    5721 kreymer    1u   REG   0,24    4 3937155499 /grid/fermiapp/touchga (blue2:/fermigrid-fermiapp)


Load dropped sharply at 09:48.
No condor processes are running.
System is still about 25% wait I/O,
Ganalia shows sustained 4 to 9 MB/sec I/O.

MINOS25 > lsof | wc -l
620

MINOS25 > date
Mon Nov  3 10:38:07 CST 2008


MINOS25 > cat logs/glide/probe.217336.0.log 
000 (217336.000.000) 11/02 14:50:03 Job submitted from host: <131.225.193.25:64258>
...
001 (217336.000.000) 11/02 14:50:32 Job executing on host: <131.225.166.120:60817>
...
007 (217336.000.000) 11/03 09:48:13 Shadow exception!
        Assertion ERROR on (result)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job


   Pawloski reports errors writing to /minos/data last night.

cherdack  9777  1.2  0.0  3692  720 pts/2    D+   09:05   1:10  |       \_ mv /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0000_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0001_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0002_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0003_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0004_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0005_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0006_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0007_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0008_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0009_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0010_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0000_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0001_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0002_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0003_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0004_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0005_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0006_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0007_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0008_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011412_0010_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0000_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0001_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0002_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0003_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0004_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0005_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0006_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0007_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0008_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0009_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011413_0010_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0000_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0001_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0002_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0003_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0004_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0005_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0006_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/users/cherdack/linearity/sntp_noBUG/n13011414_0007_L010185N_D00_nccoh.sntp.cedar_phy.root /minos/data/u


Date: Mon, 03 Nov 2008 08:54:04 -0600 (CST)
Subject: HelpDesk ticket 124102
___________________________________________

Short Description: minos25 is overloaded - why ?

Problem Description: run2-sys, minos-admin :

At around 02:30 Monday 3 Nov, 
the load average on minos25 increased sharply to 100.
It has built up to 140 through the morning.

Many of the usual suspects do not seem to be at fault.
I can write to /minos/data and scratch,
           and /grid/data and app.

In testing, I created a new stuck interactive process, doing
    echo FOO > /grid/fermiapp/touchga

The file was written, and is readable from minos25 and elsewhere,
but the shell that did the writing is stuck, cannot be interrupted.

I see nothing interesting in /var/log/messages.

We are not out of file descriptors, as I can log in and write new files.

   What is going on ?
___________________________________________

Date: Mon, 03 Nov 2008 09:14:02 -0600 (CST)

This ticket has been reassigned to BRICHACEK, MATTHEW of the CD-SF/FEF
Group.
    x3982

___________________________________________

Date: Mon, 03 Nov 2008 11:38:26 -0600

There is a move command that has made /minos/data io-bound.  This is causing
the load to jump on the server.  I have stopped condor but the move is still
completing.  The move command PID is 9777.  Once the move is complete I will
restart condor.
___________________________________________

Date: Mon, 03 Nov 2008 11:40:06 -0600

The move command just completed and condor is on it's way back up.

___________________________________________

Date: Mon, 03 Nov 2008 11:43:42 -0600 (CST)

Solution: A move command was causing IO on /minos/data to hang.  Condor was
stopped, the move command completed and condor has been restarted
___________________________________________

Date: Mon, 03 Nov 2008 17:48:04 +0000 (GMT)

   Would this be the command ?

cherdack  9777  1.2  0.0  3692  720 pts/2    D+   09:05   1:10  |
 \_ mv
/minos/data/users/cherdack/linearity/sntp_noBUG/n13011411_0000_L010185N_D00_
nccoh.sntp.cedar_phy.root
       ...
___________________________________________

Date: Mon, 03 Nov 2008 11:48:50 -0600

Yes, that was the one.
___________________________________________


=============================================================================
2008 10 31
=============================================================================

##########
# CONDOR #
##########

   The initial working directory seems to be in
   
_CONDOR_SCRATCH_DIR

    Confirmed this, 
http://osg-docdb.opensciencegrid.org/0003/000382/002/NFS-lite.doc


###############
# CONDORGLIDE #
###############

script/condorglide switched from glideafs to glide.
    do not select afs nodes
    do not set REMOTE_INITIALDIR

##########
# CONDOR #
##########

    Probe job is taking over 6 minutes to do
du -sh /grid/home/minos
    better remove this from probe, for now.

    Holding glideins for the moment, with

touch /minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE

09:59
rm    /minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE

##########
# CONDOR #
##########

    About 54 gpfarm nodes are down, including
295 -> 318
323 -> 346 ( 339-346 are the Minos AFS nodes )

   Ganglia shows a short drop in farm capacity around 00:00 tdoay.

   Sometime around 09:46, AFS came back.

##########
# CONDOR #
##########

ssh fnpc4x1

ps axfu | grep globus-job-manager | grep minos | grep -v grep
minos    15734  0.0  0.0 111924  5008 ?        S    08:33   0:00 globus-job-manager -conf /usr/local/vdt-1.10.1/globus/etc/globus-job-manager.conf -type managedfork -rdn jobmanager-managedfork -machine-type unknown -publish-jobs

ssh -ax fnpc4x1 'ps axfu | grep globus-job-manager | grep minos | grep -v grep'
minos    15734  0.0  0.0 111924  5008 ?        S    08:33   0:00 globus-job-manager -conf /usr/local/vdt-1.10.1/globus/etc/globus-job-manager.conf -type managedfork -rdn jobmanager-managedfork -machine-type unknown -publish-jobs

Date: Fri, 31 Oct 2008 09:06:08 -0500 (CDT)
From: Steven Timm <timm@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: fermigrid-help@fnal.gov
Subject: Re: Userid for MINOS glideins

That's right--since you aren't user "minos" you can't actually
strace the process.  Only root can do that.
But if there's a globus-job-manager out there,
particularly if it has any subprocesses such as globus-gass-cache-util
sitting out there, let us know and we can deal with it.

We are now pretty sure that this problem is related to the
"feature" of bluearc snapshotting, in which sometimes a
hard link can be removed from a directory on the bluearc but
it isn't really gone.  thus the globus-gass-cache-util
spins in a tight loop:

rm "data"
no such file or directory
ln "data"
file exists

and so forth.
for whatever reason an strace is enough to jar it out of the loop.

Steve

=============================================================================
2008 10 30
=============================================================================

########
# DATA #
########

    Removed .removed files from the 2008 10 24 simulation cleanup.
    Details have been appended to that entry.


##########
# CONDOR #
##########

    No new gildein user jobs seem to have started since about 12:00 today.

    Bluearc has been healthy, can write from minos25

  In /home/gfrontend/myvofrontend2/log/frontend_info.20081030.log,
  there was unusual activity around noon :
  
[2008-10-30T11:57:27-05:00 7042] Iteration at Thu Oct 30 11:57:27 2008
[2008-10-30T11:57:33-05:00 7042] Match
[2008-10-30T11:57:33-05:00 7042] Total running 239 limit 250
[2008-10-30T11:57:33-05:00 7042] For gpminos@t20_glexec@minos Idle 432 Running 239
[2008-10-30T11:57:33-05:00 7042] Advertize gpminos@t20_glexec@minos Request idle 10 max_run 698
[2008-10-30T11:57:33-05:00 7042] For gpgeneral@t20_glexec@minos Idle 432 Running 239
[2008-10-30T11:57:33-05:00 7042] Advertize gpgeneral@t20_glexec@minos Request idle 10 max_run 698
[2008-10-30T11:57:33-05:00 7042] Sleep
[2008-10-30T11:59:03-05:00 7042] Iteration at Thu Oct 30 11:59:03 2008
[2008-10-30T11:59:13-05:00 7042] Match
[2008-10-30T11:59:13-05:00 7042] Total running 239 limit 250
[2008-10-30T11:59:13-05:00 7042] For gpminos@t20_glexec@minos Idle 922 Running 239
[2008-10-30T11:59:13-05:00 7042] Advertize gpminos@t20_glexec@minos Request idle 10 max_run 1208
[2008-10-30T11:59:13-05:00 7042] For gpgeneral@t20_glexec@minos Idle 922 Running 239
[2008-10-30T11:59:13-05:00 7042] Advertize gpgeneral@t20_glexec@minos Request idle 10 max_run 1208
[2008-10-30T11:59:13-05:00 7042] Sleep
     currently

[2008-10-30T16:02:38-05:00 7042] Iteration at Thu Oct 30 16:02:38 2008
[2008-10-30T16:02:42-05:00 7042] Match
[2008-10-30T16:02:42-05:00 7042] Total running 81 limit 250
[2008-10-30T16:02:42-05:00 7042] For gpminos@t20_glexec@minos Idle 294 Running 81
[2008-10-30T16:02:42-05:00 7042] Advertize gpminos@t20_glexec@minos Request idle 10 max_run 391
[2008-10-30T16:02:42-05:00 7042] For gpgeneral@t20_glexec@minos Idle 269 Running 81
[2008-10-30T16:02:42-05:00 7042] Advertize gpgeneral@t20_glexec@minos Request idle 10 max_run 365
[2008-10-30T16:02:43-05:00 7042] Sleep

/home/gfactory/glideinsubmit/glidein_t20_glexec/log/factory_info.20081030.log
   I see nothing interesting going on around 12:00 or later


From: Steven Timm <timm@fnal.gov>
To: Sfiligoi Igor <sfiligoi@fnal.gov>
Cc: Arthur Kreymer <kreymer@fnal.gov>,
    Ryan B. Patterson <rbpatter@caltech.edu>, fermigrid-help@fnal.gov
Subject: Re: Userid for MINOS glideins

10 globus-job-manager processes were hung on fnpcfg1 (a.k.a. fnpc4x1)
since 11:34AM
I found the right one, straced it, and they all cleared up.

Since both Art and Ryan have the "admin" access to the gatekeepers
and worker nodes as requested by MINOS it is possible
for them to log in to fnpc4x1 as kreymer and rbpatter
respectively and check for this themselves.
Look for any globus-job-manager process owned by
minos that is more than 1 hr old, that's a sign
of this problem.

We have a feature request in to condor team so that the
condor-G client can corrrectly detect this error condition too.
Expect they'll get it done in the next release or two.

Steve Timm


########
# DATA #
########

   boehm volunteered about 1 TB of reroot files that can be archived.
   /minos/data/analysis/nue/MRERerootFiles/

   Warning, most of the CedarPhyData files are already in PNFS,
   they can just be removed.

CedarPhyBhcurvData/NearDetector
  
CedarPhyDaikon00
CedarPhyDaikon00/NearL010185N
CedarPhyDaikon00/FarL010185N

CedarPhyData

MINOS26 > find . -type d -exec du -sm {} \;
844299  .

519535  ./CedarPhyBhcurvData/NearDetector

143030  ./CedarPhyDaikon00
1       ./CedarPhyDaikon00/NearL010185N
26516   ./CedarPhyDaikon00/FarL010185N

181734  ./CedarPhyData

        Typical files

    CedarPhyBhcurvData/NearDetector
n00008451_0001_spill_D04_cedarphybhcurvMRE.reroot.root

    CedarPhyDaikon00
n13011001_0001_L010185N_D00_sntp_D03_cedarphyMRE.reroot.root
    Josh will rename these like
n13011001_0001_L010185ND00sntp_D03_cedarphyMRE.reroot.root
./pnfsdirs near MCIN daikon_03 L010185ND00sntp_cedarphyMRE write

    CedarPhyDaikon00/FarL010185N
reroot_f21011005_0000.root

    CedarPhyData
n00009000_0000_spill_D03_cedarphyMRE.reroot.root


    File name forms,
reroot* is NG, ignore for now

    The major item them is the 520 GB in CedarPhyBhcurvData/NearDetector

ls CedarPhyBhcurvData/NearDetector | grep -v '^n........_...._spill_D04_cedarphybhcurvMRE.reroot.root$'
MINOS26 > ls CedarPhyBhcurvData/NearDetector | wc -l
1519
MINOS26 > ls CedarPhyBhcurvData/NearDetector | grep '^n........_...._spill_D04_cedarphybhcurvMRE.reroot.root$' | wc -l
1519


    CPD were last imported around 2007 11 06,
    at that time just like those in CedarPhyData.

    Remember that pnfsdirs supports an MCIN release just for this stuff

    Previously used /home/mindata/STAGE/boehm/mcin
    now this is    /minos/data/mcimport/boehm/mcin

    Files had been written to
/pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE


MCIF=/minos/data/analysis/nue/MRERerootFiles
MDIR=...
( find ${MCIF}/${MDIR} -type f -name \*reroot.root -exec du -sm {} \; | 
  cut -f 1 ) > /minos/scratch/kreymer/MCIN.gpl

printf 'plot "/minos/scratch/kreymer/MCIN.gpl"\n' | gnuplot -persist


MDIR=CedarPhyData
  mostly 80  to 250, some around 1
  
MDIR=CedarPhyBhcurvData/NearDetector
  mostly 250 to 700 MB

MDIR=CedarPhyDaikon00
  tight, 200 to 250 MB

MDIR=CedarPhyDaikon00/FarL010185N
   file names are reroot_*.root
   tight around 160 MB for 1/4, then tight around 20-25 MB for 3/4 of files


   Let's get the CPD stuff going, should just need to move files to mcin,
   as similar files were previously imported

JOSH=/minos/data/analysis/nue/MRERerootFiles
MCIN=/minos/data/mcimport/boehm/mcin
BOEH=/minos/data/mcimport/boehm

cd /minos/data/analysis/nue/MRERerootFiles
mv CedarPhyData/n00009000* ${MCIN}/


##########
# SAMSUB #
##########

    Updating to provide list of subrun, not a count,
    for use in roundup.

ln -sf  samsub samsub.20080408
cp -a samsub samsub.20081030
 
########
# FCOE #
########

   converged network adapters (CNAs) from Emulex and QLogic ?


=============================================================================
2008 10 29
=============================================================================

##########
# BUEARC #
##########

    Spoke to Andy Romero,  they are making plans, will contact us in a few days
    regarding specific actions for deployment of new /minos/data disk.
    
###########
# SERVERS #
###########

    MRTG shows network activity since last Friday 24 Oct.
    No logins yet, but :

MINOS26 > host minos-mysql2
minos-mysql2.fnal.gov has address 131.225.193.32

MINOS26 > host minos-mysql3
minos-mysql3.fnal.gov has address 131.225.193.34

MINOS26 > host minos-sam04
minos-sam04.fnal.gov has address 131.225.193.35

MINOS26 > host minos27
minos27.fnal.gov has address 131.225.193.31


##########
# DCACHE #
##########

Date: Wed, 29 Oct 2008 15:05:17 -0500
From: David Saranen <saranen@soudan.umn.edu>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: file transfers from mine

http://fndca3a.fnal.gov/cgi-bin/dcache_files.py

doesn't show any data.  Is this related to stk problems earlier this week?

-Dave

Date: Wed, 29 Oct 2008 15:17:35 -0500 (CDT)
Subject: HelpDesk ticket 123960
___________________________________________

Short Description: 

FNDCA recent FTP web page listing is empty.

Problem Description: The web page listing recent FNDCA FTP transfers
    contains only the header line and Legal Notices, no tranfers.

    I am not sure how long this has been so.
    The problem was present today 29 Oct, around 15:10 CDT

    I know that recent transfers have occurred, the page should not be
empty.

    See :
 http://fndca3a.fnal.gov/cgi-bin/dcache_files.py
___________________________________________

This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Wed, 29 Oct 2008 16:28:40 -0500 (CDT)

Note To Requester: The process which gathers the log files appeared to have
been stuck since Monday morning and left a lockfile behind.  It is now
running and the page lists recent transfers.
___________________________________________

Date: Wed, 29 Oct 2008 17:42:16 -0500 (CDT)

Solution: Killed stuck transfers of log files, removed stale lockfile.
ftp_gather ran at the next scheduled time.  dcache_files.py now generates
output.

This ticket was resolved by MESSER, TIM of the CD-SF/DMS/DSC/SSA group.

########
# DATA #
########

   dcache/datasets - add capacity calculation to the summary 
   Get from Pool Usage, http://fndca3a.fnal.gov:2288/usageInfo
   
   Applied 1049/1000 scale factory to the usageinfo MBytes numbers.   

ln -sf datasets.20081029 datasets # was datasets.20070703

    For daq pools, 

dcache/datasets.20081029 q
...
Size     =    3188 
Capacity =    3190 

Historically,

MIN > grep Size */*/current.q.*
2006/09/current.q.20060918:Size     =    1610 
2006/09/current.q.20060920:Size     =    1610 
2006/09/current.q.20060925:Size     =    3265 
2006/10/current.q.20061023:Size     =    3397 
2007/02/current.q.20070226:Size     =    3714 
2007/02/current.q.20070228:Size     =    3714 
2007/03/current.q.20070302:Size     =    3714 
2007/03/current.q.20070319:Size     =    3714 
2007/04/current.q.20070402:Size     =    3714 
2007/05/current.q.20070501:Size     =    3714 
2007/06/current.q.20070609:Size     =    5505 
2007/07/current.q.20070703:Size     =    5671 
2007/08/current.q.20070803:Size     =    5206 
2007/09/current.q.20070910:Size     =    5206 
2007/10/current.q.20071002:Size     =    5631 
2007/10/current.q.20071029:Size     =    5432 
2007/11/current.q.20071105:Size     =    5468 
2007/12/current.q.20071213:Size     =    5782 
2008/01/current.q.20080102:Size     =    5723 
2008/02/current.q.20080204:Size     =    5925 
2008/03/current.q.20080303:Size     =    6100 
2008/04/current.q.20080407:Size     =    6193 
2008/05/current.q.20080513:Size     =    6418 
2008/06/current.q.20080604:Size     =    6508 
2008/07/current.q.20080701:Size     =    6604 
2008/08/current.q.20080804:Size     =    3188 
2008/09/current.q.20080902:Size     =    3188 
2008/10/current.q.20081013:Size     =    3187 
2008/10/current.q.20081029:Size     =    3188 

   In July, there were eight pools
   
Tue Jul 1 06:02:06 CDT 2008  w-stkendca7a-1.files
Tue Jul 1 06:06:42 CDT 2008  w-stkendca7a-2.files
Tue Jul 1 06:11:20 CDT 2008  w-stkendca8a-1.files
Tue Jul 1 06:16:00 CDT 2008  w-stkendca8a-2.files
Tue Jul 1 06:13:05 CDT 2008  w-stkendca9a-3.files
Tue Jul 1 06:13:14 CDT 2008  w-stkendca10a-3.files
Tue Jul 1 06:13:22 CDT 2008  w-stkendca11a-3.files
Tue Jul 1 06:13:06 CDT 2008  w-stkendca12a-3.files


    In August, this dropped to four
    
Mon Aug 4 06:02:46 CDT 2008  w-stkendca9a-3.files
Mon Aug 4 06:03:43 CDT 2008  w-stkendca10a-3.files
Mon Aug 4 06:03:55 CDT 2008  w-stkendca11a-3.files
Mon Aug 4 06:03:48 CDT 2008  w-stkendca12a-3.files

    The pools group is back to eight now, but only four are active.

Date: Wed, 29 Oct 2008 12:27:22 -0500 (CDT)
Subject: HelpDesk ticket 123932

___________________________________________

Short Description: 

RawDataWritePools decreased in August, needs an increase

Problem Description: The RawDataWritePools group is sized to hold all the
Minos raw data, about
6 TB.
This was true through last July.

But in August the pool group capacity decreased to 3 TB, and remains there.


It appears that four of the eight pools were removed from the group :
    w-stkendca7a-1
    w-stkendca7a-2
    w-stkendca8a-1
    w-stkendca8a-2

As of 13 October, these pools seem to have returned to the group.

But these pools are not listed today in cellInfo, or in usageInfo.
The file listings under http://fndca3a.fnal.gov/dcache/files/
indicate that these four pools are empty.

We need an increase to 8 TB to handle the next year or so of data taking.

Please review the status of the RawDataWritePools group,
and take action consistent with the new round of hardware deployments
coming over the next few weeks.
___________________________________________


########
# DATA #
########

Date: Wed, 29 Oct 2008 15:24:15 +0000 (GMT)

   Tickled helpdesk ticket 118354 , 2008 07 08
   regarding aggressive writes to tape.
   The current FD volume has been mounted 1153 times.

enstore info --vol VO8699
...
 'eod_cookie': '0000_000000000_0001335',
 'sum_mounts': 1153,
 'sum_rd_access': 65,
 'sum_wr_access': 1335,

./volumes vols
FVOLS=`./volumes fardet_data

BVOLS=`
for VOL in ${FVOLS} ; do 
printf "${VOL} " 
enstore info --vol ${VOL} | grep library
done | grep 9940B | cut -f 1  -d ' '
`
for VOL in ${BVOLS} ; do 
printf "${VOL} " 
enstore info --vol ${VOL} | grep sum_mounts
done 

VO2432  'sum_mounts': 2,
VO3899  'sum_mounts': 282,
VO4298  'sum_mounts': 1758,
VO4335  'sum_mounts': 996,
VO6876  'sum_mounts': 800,
VO8536  'sum_mounts': 133,
VO8555  'sum_mounts': 1151,
VO8699  'sum_mounts': 1153,
VO9488  'sum_mounts': 1048,
VO9830  'sum_mounts': 163,
VOA187  'sum_mounts': 307,
VOB499  'sum_mounts': 194,
VOB737  'sum_mounts': 108,
VOC268  'sum_mounts': 144,
VOC475  'sum_mounts': 1133,
VOC513  'sum_mounts': 771,
VOC538  'sum_mounts': 13,
VOC560  'sum_mounts': 253,

############
# GRELEASE #
############

Added daily release of all gfactory processes, at 05:34, logged to

    /afs/fnal.gov/files/expwww/numi/html/gfactory/release.txt

    http://www-numi.fnal.gov/gfactory/release.txt

########
# DATA #
########

Closed helpdesk ticket 

#############
# CHECKLIST #
#############

    Ganglia monitoring for Minos Cluster offline yesterday
    15:00 - 17:50

    Same for all rexganglia2 monitoring
    MRTG shows dropout, 15:50 - 16:40

#######
# SAM #
#######
    
   +Includes corrected in all SAM products by illingwo,
   closed SAMDEV-25 in jira
    
=============================================================================
2008 10 28
=============================================================================

########
# DATA #
########

   boehm volunteers about 1 TB of reroot files that can be archived.
   /minos/data/analysis/nue/MRERerootFiles/

   Some are similar to files previously imported.
   
   Some may need to be renamed, to fit our conventions for mcimport.
   
###########
# ENSTORE #
###########

Date: Tue, 28 Oct 2008 15:54:27 -0500
From: Tim Messer <tmesser@fnal.gov>
To: stk-users@fnal.gov, "cms-t1@fnal.gov" <cms-t1@fnal.gov>
Cc: Enstore Admin <enstore-admin@fnal.gov>
Subject: STK Enstore mover code update

Hello,

An emergency update of the Enstore code is ready for deployment.  SSA and
the Enstore developers have determined that this code update is necessary in
order to prevent movers from going offline in certain cases when switching
from write mode to read mode.

SSA will begin to deploy the code shortly and will restart the mover
processes after the code is copied into place.  This will not affect
transfers in progress and is anticipated to be transparent to users.


########
# DATA #
########

    Removed temporary data copies.

MINOS26 > rm  /local/scratch26/kreymer/DAQ/*.root

########
# FARM #
########

    Requesting minfarm account on the Minos Cluater 

  Scanning existing account in NIS, are they all in AFS home* ?
  Yes they are, either /afs/fnal.gov or /afs/fnal/,
  except for strictly local home areas.

condor:KERBEROS:4716:3302:condor:/local/stage1/condor:/sbin/nologin
sam:KERBEROS:7816:5024:sam users:/home/sam:/bin/bash
minoscvs:KERBEROS:7927:5111:E875 Minos:/home/minoscvs:/home/minoscvs/bin/cvsh
products:KERBEROS:1342:4525:products account:/local/ups:/bin/sh
minsoft:KERBEROS:9979:5111:Minos Software:/home/minsoft:/bin/bash
minfarm:KERBEROS:10871:5111:Minos Farm:/home/minfarm:/bin/bash
lsfadm:KERBEROS:7628:5443:Admin_Load_Sharing_Facility:/home/room1/lsf/v6_1:/bin/bash
samread:KERBEROS:12160:5024:Sequential Access - Run II:/home/samread:/bin/bash
mindata:KERBEROS:3648:5111:Minos Data:/home/mindata:/bin/bash
  
  Minfarm exists on minos-sam03, that's why it is in the NIS list.

Updated the .k5login per fnpcsrv1, removing servers and obsolete users.

Sent note to minos_batch.

Date: Tue, 28 Oct 2008 16:45:27 -0500 (CDT)
Subject: HelpDesk ticket 123886

___________________________________________
Short Description: Please create minfarm local account on minos01 and
minos26

Problem Description: Please create a minfarm local account on minos26.

This is for the purpose of building software in /grid/farmiapp.

We would prefer to do this on the Minos Cluster, for software uniformity.

The account is already in the NIS passwd file,
and is enabled on minos_sam03.

Please create the /home/minfarm area on minos26,
and copy .k5login from minos_sam03.
___________________________________________

Date: Wed, 29 Oct 2008 08:36:49 -0500 (CDT)
This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group.
___________________________________________

Date: Thu, 30 Oct 2008 15:58:16 -0500 (CDT)
Solution: Request completed.
This ticket was resolved by SCOTT, RENNIE of the CD-SF/FEF group.
___________________________________________

   The account works, for kreymer

#######
# SAM #
#######

SAMDEV-25

    Per Fermilab security recommendation ( inkmann@fnal.gov )
    we need to change all .htaccess files from the use of
Options +Includes
    to
Options +IncludesNOEXEC

#######
# SAM #
#######

    Jira categories, at https://fermilab.onjira.com/secure/BrowseProject.jspa
    need to be clarified.
    

Project 	                        Key 	Project Lead 	URL
D0 Grid Data Production Initiative 	DZGDPI 	Adam Lyon 	No URL
D0 SAM Operations 	                DZSAM 	D0 SAM Shifter 	No URL
SAMGrid development 	                SAMDEV 	Adam Lyon 	No URL


########
# ZOOM #
########

    Final snapshot from cvsuser@cdfcode, after the move to cdcvs.

[cvsuser@cdfcode cvsuser]$ time tar czvf /var/tmp/zoomcvs.tgz .
real    1m0.955s
user    0m24.220s
sys     0m3.280s
    
[cvsuser@cdfcode cvsuser]$ du -sm .
331     .

[cvsuser@cdfcode cvsuser]$ du -sm *
9       archive
1       ark
2       bin
1       check_access.bak
1       check_access.hold
1       crontab.dat
0       cvsh
1       cvshlog
1       Desktop
1       genser
1       LOG
1       LOG~
1       log.bak
1       maint
170     repository
149     repository_work
1       rsyncZoom.sh
1       shrc

[cvsuser@cdfcode cvsuser]$ du -sm /var/tmp/zoomcvs.tgz 
55      /var/tmp/zoomcvs.tgz

[cvsuser@cdfcode cvsuser]$ scp -c blowfish /var/tmp/zoomcvs.tgz kreymer@minos26:/minos/data/users/kreymer/zoomcvs.tgz

[cvsuser@cdfcode cvsuser]$ crontab -l
55 23 * * * ${HOME}/archive/archive 1> ${HOME}/archive/archive.log 2>&1

    Sent mail to garren, rs, suggesting shutdown of the nightly cron job.

=============================================================================
2008 10 27
=============================================================================

########
# PNFS #
########

Date: Mon, 27 Oct 2008 12:18:17 -0500 (CDT)
Subject: HelpDesk ticket 123773
___________________________________________
Short Description: PNFS not responding

Problem Description: Starting soon after 11:57 today,
the /pnfs file system does not seem to be responding.

ftp file transfers hang up,
ls hangs up.
___________________________________________
This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Mon, 27 Oct 2008 12:26:57 -0500
From: Stanley Hicks <swhicks@fnal.gov>
To: stk-users@fnal.gov, cms-t1@fnal.gov,
    Enstore Admin <enstore-admin@fnal.gov>
Subject: Problems with stk

Users and interested parties,

We received notification about a power supply failure and possible disk
failure on the raid on two stk servers. The situation is currently being
investigated and we will update you with more information as we discover the
cause and potential uptime.

Currently there are no transfers happening on stken.

Sorry for the inconvenience and please stay tuned for further information.

Thanks,
Stanley
___________________________________________

Date: Mon, 27 Oct 2008 12:42:23 -0500 (CDT)

Note To Requester: There was a facilities power problem with the STKEN
server rack.  It has been corrected and we are working on restoring
services.

___________________________________________

Date: Mon, 27 Oct 2008 16:59:03 -0500
From: Tim Messer <tmesser@fnal.gov>
To: stk-users@fnal.gov, cms-t1@fnal.gov
Cc: Enstore Admin <enstore-admin@fnal.gov>
Subject: Re: Problems with stk

Hi,

STK Enstore has been returned to service.  The cause of the outage was the
accidental power-off of a breaker on the circuit feeding most of the STKEN
server rack.  Power was restored, and after running consistency checks, the
system is now believed to be stable.

We apologize for the inconvenience and thank you for your patience. Please
let us know if you encounter any further trouble.  Thank you.

___________________________________________

Date: Tue, 28 Oct 2008 13:49:52 +0000 (GMT)

Services were restored yesterday.

This ticket can be closed.
___________________________________________

Date: Tue, 28 Oct 2008 16:32:01 -0500 (CDT)

Solution: PNFS was not responding due to loss of power to the STKEN server rack.
___________________________________________

17:20  reeneabled crontab for mindata@minos26,
       and renamed NOCAT to NOCAT.ok on fnpcsrv1

########
# DATA #
########

./savedata 2>&1 | tee -a ../maint/daqwrite/savedata.log
Mon Oct 27 12:02:04 CDT 2008

 OOPS, bad file, F00042086_0000.all.sntp.cedar.0.root 
 OOPS, bad file, F00042089_0006.spill.bcnd.cedar.0.root 
 OOPS, bad file, F00042089_0007.spill.bcnd.cedar.0.root 
 OOPS, bad file, F00042089_0011.spill.bcnd.cedar.0.root 
 OOPS, bad file, F00042089_0015.spill.bcnd.cedar.0.root 
 OOPS, bad file, F00042089_0020.spill.bcnd.cedar.0.root 

 OOPS, bad file, F00042089_0021.spill.bcnd.cedar.0.root 

COPYING F00042089_0022.mdaq.root 
Mon Oct 27 12:02:13 CDT 2008
     ...  
interrupted at 12:07, no progress, no tape mount, ...

MINOS26 > ./dc_stat F00042089_0021.spill.bcnd.cedar.0.root
   no response, killed

ftplog/NOW.txt
   4 Mon Oct 27 11:36:59 CDT 2008 557
   7 Mon Oct 27 11:47:06 CDT 2008 557
   5 Mon Oct 27 11:57:11 CDT 2008 557
3604 Mon Oct 27 13:07:15 CDT 2008 1
3603 Mon Oct 27 14:17:18 CDT 2008 1
5776 Mon Oct 27 16:03:34 CDT 2008 1
   6 Mon Oct 27 16:13:40 CDT 2008 557
   6 Mon Oct 27 16:23:46 CDT 2008 557

pnfslog/NOW.txt
   2 Mon Oct 27 11:48:51 CDT 2008 
   4 Mon Oct 27 11:53:55 CDT 2008 
   2 Mon Oct 27 11:58:57 CDT 2008 
10804 Mon Oct 27 15:04:01 CDT 2008 
   3 Mon Oct 27 15:09:04 CDT 2008 
   3 Mon Oct 27 15:14:07 CDT 2008 
   3 Mon Oct 27 15:19:10 CDT 2008 
   3 Mon Oct 27 15:24:13 CDT 2008 
   2 Mon Oct 27 15:29:15 CDT 2008 
   2 Mon Oct 27 15:34:17 CDT 2008 
   3 Mon Oct 27 15:39:20 CDT 2008 

    Sent in PNFS helpdesk ticket, above

    PNFS is back up, rescanned for stale files,
    they all seem to be on tape now .

MINOS26 > ./saddcache --list | grep -v vo

 MODE  list
Relocating files in SAM as needed, in  prd
STARTED   Mon Oct 27 17:16:47 2008
324 FILES
STARTED   Mon Oct 27 17:16:47 2008
FINISHED  Mon Oct 27 17:17:00 2008

    A typical line is :
    
  304 F00042090_0000.mdaq.root  /pnfs/minos/fardet_data/2008-10(vo8699.1189)

   
=============================================================================
2008 10 25
=============================================================================

./saddcache --list > ./maint/daqwrite/oct25.pend

cd ../maint/daqwrite

grep root oct25.pend | cut -c 7- | cut -f 1 -d ' ' | sort > oct25.files

wc -l oct25.files 
90 oct25.files

for FILE in `cat ../maint/daqwrite/oct25.files` ; do ./dc_stat ${FILE} ; done | 
less

    None of the files have pools listed in Level2, none are on tape

MINOS26 > ./dccptest /fardet_data/2008-10/F00042089_0011.mdaq.root
2,0,0,0.0,0.0
:h=yes;c=1:4e8ae670;l=72334795;
72334795 bytes in 2 seconds (35319.72 KB/sec)
-rw-r--r--  1 kreymer g020 72334795 Oct 25 08:55 /local/scratch26/kreymer/F00042
089_0011.mdaq.root


    Set up a safety copy on minos26.

MINOS26 > cp dccptest dccpdata

    Increased debug level to 2, to get name of originating pool
    
    
./savedata 2>&1 | tee -a ../maint/daqwrite/savedata.log

95 files copied


=============================================================================
2008 10 24
=============================================================================

#######
# WEB #
#######

/afs/.fnal.gov/files/expwww/numi/html

   Focused scan of sam docs

MIN > find sam -name .htaccess -exec grep Includes {} \; -print
Options +Includes
sam/doc/design/samBootstrapRedesign/.htaccess
Options +Includes
sam/doc/install/.htaccess
Options +Includes
sam/sam_doc/doc/design/samBootstrapRedesign/.htaccess
Options +Includes
sam/sam_doc/doc/install/.htaccess
Options +Includes
sam/sam_doc/www/.htaccess

find sam -name .htaccess -exec grep Includes {} \; -exec nedit {} \;
    added NOEXEC


    Globel scan

find computing/products. -follow -name .htaccess -exec grep Includes {} \; -print
  Too many symlinks under
computing/products/prd/MINOS_ROOT/Linux2.4-GCC_3_3/v4-00-08f/v4-00-08f
   links to itself
lrwxr-xr-x 1 5922 5111 9 Feb 14  2006 computing/products/prd/MINOS_ROOT/Linux2.4-GCC_3_3/v4-00-08f/v4-00-08f -> v4-00-08f

rm computing/products/prd/MINOS_ROOT/Linux2.4-GCC_3_3/v4-00-08f/v4-00-08f

Options +IncludesNOEXEC
computing/products/db/sam_config/Symlinks/v4_2_34/www/.htaccess
Options +IncludesNOEXEC
computing/products/db/sam_config/Symlinks/current/www/.htaccess
Options +IncludesNOEXEC
computing/products/db/sam_bootstrap/Symlinks/v4_4_1/www/.htaccess
Options +IncludesNOEXEC
computing/products/db/sam_bootstrap/Symlinks/current/www/.htaccess
Options +IncludesNOEXEC
computing/products/db/sam_web_services/Symlinks/v0_9_8/www/.htaccess
Options +IncludesNOEXEC
computing/products/db/sam_web_services/Symlinks/current/www/.htaccess
Options +IncludesNOEXEC
computing/products/db/sam_web_services/Symlinks/v0_9_9/www/.htaccess
Options +Includes
computing/products/prd/sam_config/v4_2_28/NULL/www/.htaccess
Options +Includes
computing/products/prd/sam_config/v4_2_34/NULL/www/.htaccess
Options +Includes
computing/products/prd/sam_bootstrap/v4_4_1/NULL/www/.htaccess
Options +Includes
computing/products/prd/sam_web_services/v0_9_8/NULL/www/.htaccess
Options +Includes
computing/products/prd/sam_web_services/v0_9_9/NULL/www/.htaccess


    Removed old products, no longer needed

MINOS26 > ups list -aK+ sam_config v4_2_28
"sam_config" "v4_2_28" "NULL" "" "" 
"sam_config" "v4_2_28" "NULL" "minos" "" 
"sam_config" "v4_2_28" "NULL" "minos_prd" "" 
"sam_config" "v4_2_28" "NULL" "prd" "" 
"sam_config" "v4_2_28" "NULL" "dev" "" 

ups undeclare  sam_config v4_2_28 -q dev -f NULL
ups undeclare  sam_config v4_2_28 -q int -f NULL
ups undeclare  sam_config v4_2_28 -q prd -f NULL
ups undeclare  sam_config v4_2_28 -q minos -f NULL
ups undeclare  sam_config v4_2_28 -q minos_prd -f NULL
ups undeclare -Y  sam_config v4_2_28 -f NULL


############
# PREDATOR #
############

Several raw data files are not on tape after nearly 1 day .

Good, this is the desired behaviour !
Write pools should only go to tape daily.


    4 F00042086_0006.mdaq.root  24
   13 N00015034_0002.mdaq.root  23
   14 N00015034_0005.mdaq.root  24
   23 N00015034_0004.mdaq.root  24
   31 N00015034_0006.mdaq.root  24
   38 F00042086_0002.mdaq.root  23
   48 F00042086_0007.mdaq.root  24
   55 F00042085_0000.mdaq.root  23
   57 N00015033_0000.mdaq.root  23

RUNS="F00042086_0006 N00015034_0002 N00015034_0005 N00015034_0004 N00015034_0006
F00042086_0002 F00042086_0007 F00042085_0000 N00015033_0000"

for RUN in ${RUNS} ; do ./dc_stat ${RUN}.mdaq.root ; done

The oldest unwritten file is
-rw-r--r--  1 buckley e875 17651382 Oct 23 12:13 F00042085_0000.mdaq.root

This is under 1 day old, good !

The latest written file is

-rw-r--r--  1 buckley e875 114201451 Oct 24 07:53 N00015034_0002.mdaq.root


   Tested again at 15:00, not so good !
   No further files have been written to tape.

   The files are OK, but not on tape

MINOS26 > ./dccptest /fardet_data/2008-10/F00042085_0000.mdaq.root
2,0,0,0.0,0.0
:h=yes;c=1:dab8b986;l=17651382;
17651382 bytes in 1 seconds (17237.68 KB/sec)
-rw-r--r--  1 kreymer g020 17651382 Oct 24 15:02 /local/scratch26/kreymer/F00042085_0000.mdaq.root

   Made a fresh file list

./saddcache --list

OFILES="
F00042086_0022.mdaq.root B081023_160001.mbeam.root B081024_000001.mbeam.root
F00042085_0000.mdaq.root F00042086_0002.mdaq.root F00042086_0006.mdaq.root
F00042086_0007.mdaq.root F00042086_0008.mdaq.root F00042086_0009.mdaq.root
F00042086_0010.mdaq.root F00042086_0011.mdaq.root F00042086_0012.mdaq.root
F00042086_0013.mdaq.root F00042086_0014.mdaq.root F00042086_0015.mdaq.root
F00042086_0016.mdaq.root F00042086_0017.mdaq.root F00042086_0018.mdaq.root
F00042086_0019.mdaq.root F00042086_0020.mdaq.root F00042086_0021.mdaq.root
F00042086_0022.mdaq.root F00042086_0023.mdaq.root F00042087_0000.mdaq.root
F00042088_0000.mdaq.root F00042089_0000.mdaq.root F00042089_0001.mdaq.root
F081023_000010.mdcs.root N00015033_0000.mdaq.root N00015034_0004.mdaq.root
N00015034_0005.mdaq.root N00015034_0006.mdaq.root N00015034_0007.mdaq.root
N00015034_0008.mdaq.root N00015034_0009.mdaq.root N00015034_0010.mdaq.root
N00015034_0011.mdaq.root N00015034_0012.mdaq.root N00015034_0013.mdaq.root
N00015034_0014.mdaq.root N00015034_0015.mdaq.root N00015034_0016.mdaq.root
N00015034_0017.mdaq.root N00015034_0018.mdaq.root N00015034_0019.mdaq.root
N00015034_0020.mdaq.root N00015034_0021.mdaq.root N00015034_0022.mdaq.root
N00015034_0023.mdaq.root N00015034_0024.mdaq.root N00015035_0000.mdaq.root
N00015036_0000.mdaq.root N00015037_0000.mdaq.root N081023_000002.mdcs.root
"

   Scanned for stale files, at about 15:30

for FILE in ${OFILES} ; do ./dc_stat ${FILE} ; done | less

/pnfs/minos/fardet_data/2008-10/F00042085_0000.mdaq.root 
-rw-r--r--  1 buckley e875 17651382 Oct 23 12:13 F00042085_0000.mdaq.root

/pnfs/minos/fardet_data/2008-10/F00042086_0002.mdaq.root 
-rw-r--r--  1 buckley e875 28326005 Oct 23 15:15 F00042086_0002.mdaq.root

/pnfs/minos/neardet_data/2008-10/N00015033_0000.mdaq.root 
-rw-r--r--  1 buckley e875 13390345 Oct 23 13:50 N00015033_0000.mdaq.root


Date: Fri, 24 Oct 2008 15:41:56 -0500 (CDT)
Subject: HelpDesk ticket 123698

___________________________________________

Short Description: Minos raw data files not moving to tape

Problem Description: Some Minos raw data files seem to have not moved to
tape in the last day.
This is based on the absence of PNFS Level 4 metadata.

The latest file to be written seems to be
    /pnfs/minos/neardet_data/2008-10/N00015034_0002.mdaq.root
    around Oct 24 07:53

I noticed the delay writing to tape this morning,
I had hoped that the normal 1 day delay had been restored, as requested.
At that time, all the files were under 24 hours old.

But now a few files are over 24 hours in the pools, without being on tape.

/pnfs/minos/fardet_data/2008-10/F00042085_0000.mdaq.root 
-rw-r--r--  1 buckley e875 17651382 Oct 23 12:13 F00042085_0000.mdaq.root

/pnfs/minos/fardet_data/2008-10/F00042086_0002.mdaq.root 
-rw-r--r--  1 buckley e875 28326005 Oct 23 15:15 F00042086_0002.mdaq.root

/pnfs/minos/neardet_data/2008-10/N00015033_0000.mdaq.root 
-rw-r--r--  1 buckley e875 13390345 Oct 23 13:50 N00015033_0000.mdaq.root

   There are plenty of free movers; the delay is not due to an Enstore backlog.
___________________________________________
Date: Sat, 25 Oct 2008 15:46:15 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
Subject: Re: HelpDesk ticket 123698 has additional info.

<-- # @@@  Enter Update below this line. @@@ # -->

The files are all in the RawDataWritePools pool group.

It is hard to tell which specific pool is involved,
because the pool name is not in the Level 2 PNFS data.
Perhaps this is a clue to an underlying problem !

I have made a safety copy of all pending files to node minos26, using
    dccp -d 2
to determine the pools holding the files.

The 95 files copied came from pools as follows :

     1 stkendca9a
    87 stkendca11a
     7 stkendca12a

It would seem that there is a problem with w-stkendca11a

There many files over a day old,  one of which is
    /pnfs/minos/near_dcs_data/2008-10/N081023_000002.mdcs.root

<-- # @@@  Enter Update above this line. @@@ # -->
___________________________________________
Date: Mon, 27 Oct 2008 08:45:48 -0500 (CDT)
Note To Requester: We are looking into this Art.
This ticket is assigned to JONES, TERRY of the CD-SF/DMS/DSC/SSA group.
___________________________________________

Date: Tue, 28 Oct 2008 13:52:28 +0000 (GMT)

All of our files in RawDataWritePools have moved to tape.

You can close this ticket.

    Thanks !
___________________________________________
Date: Tue, 04 Nov 2008 11:11:42 -0600 (CST)

This ticket was resolved by JONES, TERRY of the CD-SF/DMS/DSC/SSA group.

########
# DATA #
########

Date: Thu, 23 Oct 2008 17:27:56 -0500 (CDT)
From: Kregg E Arms <arms@physics.umn.edu>
    
Only the files within the run range 7481 - 7500 were uploaded to FNAL from
this "corrupted" set (i.e. only the runs listed below by Nick as
"L010185_near_production" & "L010185_rock_pro"), the others were merely test
runs not actually used. So, the list derived from yours with this correction
is /minos/scratch/arms/remove.badd04.lis

    Repeated the above , with

SAMDIM="
    RUN_TYPE  physics% 
and DATA_TIER mc-near 
and MC.BEAM   L010185N
and RUN_NUMBER   in 
7481,7482,7483,7484,7485,7486,7487,7488,7489,
7490,7491,7492,7493,7494,7495,7496,7497,7498,7499,7500
"
~/minos/scripts/samlocate "${SAMDIM}" > reroot.lis

File Count:         626
Average File Size:  342.95MB
Total File Size:    209.66GB
Total Event Count:  500800


for STRM in cand sntp mrnt ; do
SAMDIM="
    DATA_TIER    ${STRM}-near
and MC.RELEASE   daikon_04
and MC.BEAM      L010185N
and VERSION      cedar.phy.bhcurv
and RUN_NUMBER   in 
7481,7482,7483,7484,7485,7486,7487,7488,7489,
7490,7491,7492,7493,7494,7495,7496,7497,7498,7499,7500
"
~/minos/scripts/samlocate "${SAMDIM}" > ${STRM}.lis
done

"; sam list files --dim="${SAMDIM}" --summaryOnly ; done

File Count:         412
Average File Size:  551.04MB
Total File Size:    221.71GB
Total Event Count:  329600

File Count:         29
Average File Size:  1.34GB
Total File Size:    38.73GB
Total Event Count:  500800

File Count:         28
Average File Size:  408.75MB
Total File Size:    11.18GB
Total Event Count:  500800

MINOS26 > wc -l *.lis
   412 cand.lis
    28 mrnt.lis
   626 reroot.lis
    29 sntp.lis
  1095 total
   469     mcout_data

Total size is 481 GB.

    Made a backed up copy of these in AFS

MINOS26 > cp -vax . ~/minos/maint/badd04


    Set aside the files

minospro@minos26

cd /minos/scratch/kreymer/badd04

for STRM in  sntp mrnt cand reroot ; do
cat ${STRM}.lis | while read LINE ; do 
    FNAM=`echo ${LINE} | cut -f 1 -d ' '`
    FPAT=`echo ${LINE} | cut -f 2 -d ' '`
    echo mv ${FPAT}/${FNAM} ${FPAT}/${FNAM}.removed
         mv ${FPAT}/${FNAM} ${FPAT}/${FNAM}.removed
    usleep 100000
done
done

    Tested the copy

for STRM in  sntp mrnt cand reroot; do
cat ${STRM}.lis | while read LINE ; do 
    FNAM=`echo ${LINE} | cut -f 1 -d ' '`
    FPAT=`echo ${LINE} | cut -f 2 -d ' '`
    ls  ${FPAT}/${FNAM}.removed
    usleep 100000
done
done


    Removed them from /minos/data

for STRM in sntp mrnt ; do 

# head -1 ${STRM}.lis | while read LINE ; do 
cat ${STRM}.lis | while read LINE ; do 
    FNAM=`echo ${LINE} | cut -f 1 -d ' '`
    FPAT=`echo ${LINE} | cut -f 2 -d ' '`
    FMD=/minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/${STRM}_data
    NUM=`echo ${FPAT} | cut -f 10 -d /`
    ls -l ${FMD}/${NUM}/${FNAM}
    rm    ${FMD}/${NUM}/${FNAM}
done

minfarm@fnpcsrv1
chmod 775 /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/mrnt_data/750
chmod 775 /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/sntp_data/750


    Removed the files from SAM

./samundeclare  "${SAMDIM}"
Found  626  files 
 OOPS, did not undeclare  n13037481_0049_L010185N_D04.reroot.root

    OK, let's try cand instead
    Worked OK. Same for sntp and mrnt

Now the reroots can be removed.

OK! Of course, they are now no longer parents.


    2008 10 30     removing the .removed files, for real.


cd /minos/scratch/kreymer/badd04

for STRM in  sntp mrnt cand reroot; do
cat ${STRM}.lis | while read LINE ; do 
    FNAM=`echo ${LINE} | cut -f 1 -d ' '`
    FPAT=`echo ${LINE} | cut -f 2 -d ' '`
    ls -l ${FPAT}/${FNAM}.removed
    rm -f  ${FPAT}/${FNAM}.removed
    usleep 100000
done
done

   Did this at about 16:45 CDT
   
=============================================================================
2008 10 23
=============================================================================

########
# DATA #
########

   Purging sam/pnfs/md    for bad v18 D04 mcnear files

See notes in this log, 2008 09 18
Date: Fri, 12 Sep 2008 16:18:09 +0100
From: Nick West <n.west1@physics.ox.ac.uk>

Attachment was copied to  /minos/scratch/kreymer/badd04

SAMDIM="
    DATA_TIER    ${STRM}-near
and MC.RELEASE   daikon_04
and MC.BEAM      L010185N
and VERSION      cedar.phy.bhcurv
and RUN_NUMBER   >= 7450
and RUN_NUMBER   <= 7500
"

sam list files --dim="${SAMDIM}
File Count:         76
Average File Size:  1.28GB
Total File Size:    97.56GB
Total Event Count:  1260800

All are n1303*,

Sorted and uniqued the run list, regardless of configuration,

SAMDIM="
    DATA_TIER    ${STRM}-near
and MC.RELEASE   daikon_04
and MC.BEAM      L010185N
and VERSION      cedar.phy.bhcurv
and RUN_NUMBER   in 
7450,7451,7455,7481,7482,7483,7484,7485,7486,7487,7488,7489,
7490,7491,7492,7493,7494,7495,7496,7497,7498,7499,7500,7655
"

    STRM=sntp

File Count:         32
Average File Size:  1.39GB
Total File Size:    44.49GB
Total Event Count:  575200

    STRM=mrnt

File Count:         31
Average File Size:  424.03MB
Total File Size:    12.84GB
Total Event Count:  575200

    STRM=cand

File Count:         530
Average File Size:  550.95MB
Total File Size:    285.16GB
Total Event Count:  424000

SAMDIM="
    RUN_TYPE  physics% 
and DATA_TIER mc-near 
and MC.BEAM=L010185N
and RUN_NUMBER   in 
7450,7451,7455,7481,7482,7483,7484,7485,7486,7487,7488,7489,
7490,7491,7492,7493,7494,7495,7496,7497,7498,7499,7500,7655
"
File Count:         749
Average File Size:  342.88MB
Total File Size:    250.80GB
Total Event Count:  599200

MINOS26 > sam list files --dim="${SAMDIM}"  --nosummary | cut -f 1 -d _ | sort -u
n13037450
n13037451
n13037455
n13037481
n13037482
n13037483
n13037484
n13037485
n13037486
n13037487
n13037488
n13037489
n13037490
n13037491
n13037492
n13037493
n13037494
n13037495
n13037496
n13037497
n13037498
n13037499
n13037500
n13037655

~/minos/scripts/samlocate "${SAMDIM}" > reroot.lis

for STRM in cand sntp mrnt ; do
SAMDIM="
    DATA_TIER    ${STRM}-near
and MC.RELEASE   daikon_04
and MC.BEAM      L010185N
and VERSION      cedar.phy.bhcurv
and RUN_NUMBER   in 
7450,7451,7455,7481,7482,7483,7484,7485,7486,7487,7488,7489,
7490,7491,7492,7493,7494,7495,7496,7497,7498,7499,7500,7655
"
~/minos/scripts/samlocate "${SAMDIM}" > ${STRM}.lis
done

MINOS26 > wc -l *.lis
   530 cand.lis
    31 mrnt.lis
   749 reroot.lis
    32 sntp.lis
  1342 total


##########
# PARROT #
##########

    paloonew - arguments

-m - mountfile name    e.g. mountfile.grow  mountfile.d199d141.grow
-p - parrot arguments  e.g. "-d remote"
-r - parrot release    e.g. current current-20081010  2_4_3
-s - script to run     e.g. /grid/fermiapp/minos/parrot/loonar
                           "/grid/fermiapp/minos/parrot/loonar -r R1.24.2"
                           
    loonar - arguments

-r - loon release      e.g. R1.24.2  S08-08-28-R1-30
-s - script to run     e.g. firstlast.C

    This is working well, making this the new paloon:

PW=/afs/fnal.gov/files/expwww/numi/html/computing/parrot

    /grid/fermiapp/minos/parrot
mv paloon paloon.20081013 ; cp paloonew paloon
cp paloon ${PW}/paloon
cp loonar ${PW}/loonar

./paloon -r current -s "/grid/fermiapp/minos/parrot/loonar -r S08-08-28-R1-30"

./paloon  -m mountfile.d199d141.grow

./paloon -p "-d remote"


##########
# PARROT #
##########

    Bootstrap process to run reco :

WP=http://www-numi.fnal.gov/computing/parrot

wget ${WP}/paloon
wget ${WP}/mountfile.grow

wget ${WP}/loonar
wget ${WP}/firstlast.C
wget ${WP}/reco_near_spill_cedar.C
wget ${WP}/N00009870_0002.mdaq.root
wget ${WP}/N00009870_0002.log

chmod 755 loonar paloon

    Modify paloon to set the path to parrot,
    and the default verson of parrot for your site.

    Copy mountfile.grow to the parrot home directory
    
    Run a firstlast.C event count test of the file

./paloon -s "./loonar -f  N00009870_0002.mdaq.root"

    Reconstruct the file

{ time ./paloon -s "./loonar -f  N00009870_0002.mdaq.root -s reco_near_spill_cedar.C" ; } 2>&1 | tee N00009870_0002.log2

    Look for reco .root files :

-rw-r--r--    1 kreymer e875  3186229 Oct 23 11:33 ntupleStS.root
-rw-r--r--    1 kreymer e875 17805680 Oct 23 11:33 CandS.root

    Compare N00009870_0002.log2 to N00009870_0002.log


=============================================================================
2008 10 22
=============================================================================

########
# FARM #
########

> the minos mysql instance is taking 729% of cpu on fnpcsrv1
> currently, with only 250 production jobs running.  Isn't that
> a bit more than usual?  Could someone please have a look?
  
According to fnpcsrv1 ganglia plots, this happened in coincidence with
sustained network rates of 15 MBytes/sec out, 5 MBytes/sec in,
The lastest overloads were at :
    16:20 to 16:40
    16:50 to 17:05
Earlier episodes started after 12:30.

Condorview for fcdfosg3 shows minospro usage peaks, from roughly
17:40 to 20:55 UTC, then from 21:20 through the current time.   
12:40 to 15:55 CDT            16:20 CDT

This web site is   http://fcdfcm3.fnal.gov/UserDay.html

So I think this overload is consistent with our recent farm startups.
I think we need to ramp up more gradually, to avoid these overloads.


############
# BLUWATCH #
############

   Restarted bluwatch on minos25, down since yesterday's reboot

###########
# BLUEARC #
###########

    /minos/data service slowdown seen by all clients,

[TXT] fnpcsrv1.txt            22-Oct-2008 13:37   87   
[TXT] minos-sam03.txt         22-Oct-2008 13:38   87   
[TXT] minos01.txt             22-Oct-2008 13:38   87   
[TXT] minos26.txt             22-Oct-2008 13:38   87   

    This produced Howie's doubling or missing content in condor job files
    in /minos/data/minfarm
    But this doubling disappeared on later inspection of the same files.
    Sounds to me like a local cache defect, induced by these delays.

    The roundup concatenator script was not running at the time.

   log files

minos01.txt
Tue Oct 21 09:01:47 CDT 2008 SLO N00011513_0000.spill.sntp.cedar_phy_bhcurv.0.root 176
Wed Oct 22 13:04:30 CDT 2008 SLO N00011896_0000.spill.sntp.cedar_phy_bhcurv.0.root 168
Wed Oct 22 13:38:04 CDT 2008 SLO N00011995_0000.spill.sntp.cedar_phy_bhcurv.0.root 210

minos26.txt
Tue Oct 21 09:01:50 CDT 2008 SLO N00009689_0000.spill.sntp.cedar_phy_bhcurv.0.root 161
Tue Oct 21 15:00:42 CDT 2008 SLO N00011411_0020.spill.sntp.cedar_phy_bhcurv.0.root 36
Wed Oct 22 13:04:33 CDT 2008 SLO N00010265_0019.spill.sntp.cedar_phy_bhcurv.0.root 210
Wed Oct 22 13:38:08 CDT 2008 SLO N00010350_0000.spill.sntp.cedar_phy_bhcurv.0.root 238

   Ganglia on minos26 shows low network activity consistent with this.
   
   Ganglia on fnpcsrv1 shows a load average around 40, since about 12:30
   mysqld shows cpu usage around 700%


########
# GRID #
########

   Cleaned up stray files in /grid/home/minos
   
   empty directory 0
   empty file      foo

SRV1> stat /grid/home/minos/0
  File: `/grid/home/minos/0'
  Size: 2048            Blocks: 64         IO Block: 32768  directory
Device: 1dh/29d Inode: 148324388   Links: 2
Access: (0755/drwxr-xr-x)  Uid: ( 7927/   minos)   Gid: ( 5111/    numi)
Access: 2008-10-22 08:10:38.218000000 -0500
Modify: 2005-08-12 11:51:57.000000000 -0500
Change: 2006-09-19 13:49:04.035000000 -0500

SRV1> stat /grid/home/minos/foo
  File: `/grid/home/minos/foo'
  Size: 0               Blocks: 0          IO Block: 32768  regular empty file
Device: 1dh/29d Inode: 1869770220  Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 7927/   minos)   Gid: ( 5111/    numi)
Access: 2008-01-22 13:43:38.288000000 -0600
Modify: 2008-01-22 13:43:38.288000000 -0600
Change: 2008-01-22 13:43:38.289000000 -0600


########
# GRID #
########

find /grid/home/minos -maxdepth 1 -type d -name gram_scratch_\* -mtime +60 | wc -l
1641


Date: Wed, 22 Oct 2008 09:04:23 -0500 (CDT)
Subject: HelpDesk ticket 123533

___________________________________________

Short Description: /grid/home/minos old gram_scratch directories

Problem Description: The /grid/home/minos area is getting pretty large,
nearly 1 TByte.

725M    /grid/home/minos

There are 1641 directories dating from July ,
all but one from July 31 :

For comparison, there are many fewer current files :

SRV1> ls -l -t /grid/home/minos | grep gram_scratch | grep Oct | wc -l
297


SRV1> ls -l -t /grid/home/minos | grep gram_scratch | grep Jul | wc -l
1641

SRV1> ls -l -t /grid/home/minos | grep gram_scratch | grep 'Jul 31' | wc -l

1640

These July files should probably be removed.
___________________________________________
Date: Wed, 22 Oct 2008 14:22:40 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

Correction, that is almost a GBYte, not a TByte, used.

Still, it is probably good to clear up the old directories.
___________________________________________
Date: Wed, 22 Oct 2008 09:36:27 -0500 (CDT)

Note To Requester: timm@fnal.gov sent this Notes To Requester: 
This has now been done.
Most of those directories were blank anyway.
It exposed a minor bug in our gram_scratch clearing script, we will
get that fixed.

Steve
___________________________________________

Date: Mon, 27 Oct 2008 12:35:31 -0500 (CDT)
Subject: Help Desk Ticket 123533 Has Been Resolved.

Solution: These directories were removed as requested

Steve timm

=============================================================================
2008 10 21
=============================================================================

##########
# CONDOR #
##########

Date: Tue, 21 Oct 2008 15:30:23 -0500 (CDT)
Subject: HelpDesk ticket 123517

___________________________________________

Short Description: Minos25 stuck writing to /minos/data, Condor is hung

Problem Description: Starting around 13:30 today, processes writing to
/minos/data seem to be
stuck
on node minos25 .

Please have a look to see whether there is any obvious cause for this.
( System level file descriptors, bad user processes, etc. ? )

I see nothing in /var/log/messages since :

Oct 21 08:59:38 minos25 kernel: lockd: server 131.225.111.115 not
responding, still trying
Oct 21 09:00:58 minos25 last message repeated 5 times
Oct 21 09:01:08 minos25 last message repeated 6 times
Oct 21 09:01:46 minos25 kernel: lockd: server 131.225.111.115 OK

If this cannot be corrected gently , we may need to reboot the system.
___________________________________________

Date: Tue, 21 Oct 2008 15:35:22 -0500 (CDT)
This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group.
___________________________________________

Date: Tue, 21 Oct 2008 15:58:05 -0500 (CDT)
Note To Requester: sether@fnal.gov sent this Notes To Requester: 
There are a rather large number of defunct processes on the system from 
around 1pm - 2pm, which I assume is the result of some issues with the 
bluarc server. Everything appears okay now in terms of access, but it 
looks like the system never really recovered.

A reboot is probably the best choice.
___________________________________________

Date: Tue, 21 Oct 2008 21:16:08 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

Please reboot minos25 as soon as possible,
to clear this condition.
___________________________________________

Date: Tue, 21 Oct 2008 21:25:27 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

  I wonder whether it is worth trying a forced dismount, then remount
  of /minos/data, which is less drastic than a full reboot ?

  I suspect that any process trying to dismount /minos/data
  is likely to get stuck itself,
  so please do not waste too much time trying this.
___________________________________________
Date: Tue, 21 Oct 2008 16:55:43 -0500 (CDT)
Solution: The machine has been rebooted. Things look okay in general, but
let us know if there are any more problems.
___________________________________________
MINOS25 > uptime
 16:55:43 up 4 min,  2 users,  load average: 0.29, 0.37, 0.17
MINOS25 > date
Tue Oct 21 16:55:48 CDT 2008
___________________________________________

Date: Tue, 21 Oct 2008 22:11:23 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

Thanks for the reboot !

Most of the running jobs seem to be accounted for.

New jobs have started, both local and through glideinWMS.
___________________________________________

Date: Tue, 21 Oct 2008 22:14:47 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

Today between about 13:30 and 17:00 CDT,
the Minos25 system bogged down with stuck writes to /minos/data.

This pretty much stopped the Condor system from running.

minos25 was rebooted around 16:50.

Condor jobs seem to have resumed running.
New jobs have started, both on the Cluster and through glideinWMS.

The system seem to have kept track of most of the existing running jobs.


########
# DATA #
########

   Planning to remove all the R* ntuples,
   we should really only need the cedar* releases on disk.

   Per minos batch meeting today.

cd /minos/data/reco_near

du -sm R*
17      R1_18
270547  R1_18_2
8813    R1_18_3
142294  R1_18_4
22221   R1_21
10001   R1_23
9336    R1_23a
9958    R1_24
10089   R1_24a
11734   R1_24b
41283   R1_24c
24122   R1_24cal

find R1* -atime -365 -type f  | wc -l
583

for FILE in ${FILES} ; do dirname ${FILE} ; done | sort -u
R1_18_4/sntp_data/2006-09
R1_18_4/sntp_data/2006-10
R1_18/sntp_data/2005-04

    There is only one file in R1_18
R1_18/sntp_data/2005-04/N00007184_0000.spill.sntp.R1_18.0.root

find R1* -atime -304 -type f
R1_18/sntp_data/2005-04/N00007184_0000.spill.sntp.R1_18.0.root

   So most of these were last accessed 305 days ago.

   Removing them all,

mkdir  ZAP
mv R1* ZAP/
du -sm ZAP
560411  ZAP

date ; time rm -r  ZAP
Tue Oct 21 14:09:44 CDT 2008

real    59m54.901s
user    0m0.042s
sys     0m1.024s


#########
# MDSUM #
#########

   mdsum_log wrapped around, is running twice.
   Killed the younger one

26132 ?        Ss     0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log
28734 ?        D      0:00          \_ du -sm users/scavan
 9283 ?        Ss     0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log
21228 ?        S      0:29          \_ du -sm users

26132 ?        Ss     0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log
 4221 ?        S      0:13          \_ du -sm users
 9283 ?        Ss     0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log
20120 ?        S      0:01          \_ du -sm analysis/nue

  Tokens will have expired, kill em both

MINOS26 > kill 26132
MINOS26 > kill 9283

    Updated mdsum_log to check for existing process running.


############
# PREDATOR #
############

Why is predator linked to NORECO ?

lrwxr-xr-x 1 kreymer kreymer   24 Dec 12  2006 predator -> predator.20061209.NORECO*
-rwxr-xr-x 1 kreymer kreymer 4638 Dec  9  2006 predator.20061209*
-rwxr-xr-x 1 kreymer kreymer 4803 Oct 20 15:00 predator.20061209.NORECO*

I hacked the NORECO file inadvertantly, presumably it dates 2006 12 12

MIN > diff predator.20061209.NORECO predator.20061209
69,73d68
< # 2008 10 20
< # work around loon suppression on minos25/26, temporarily
< 
< PATH=${PATH/#\/afs\/fnal.gov\/files\/code\/e875\/general\/minos25_bin:/}
< 
120,121c115
< if   false ; then
< #if [ ${HOUR} = "23"  -o  -n "${FORCE}" ] ; then
---
> if [ ${HOUR} = "23"  -o  -n "${FORCE}" ] ; then

   OK, this is the version that leaves saddreco up to the concatenator.
   Cutting a new version with this content,
   removing the reco code altogether, as
   
   predator.20081020
   
   Renaming predator.20061209.NORECO to predator.20061212

cp -a predator.20061209.NORECO predator.20081020
nedit                          predator.20081020
ln -sf                         predator.20081020 predator

mv    predator.20061209.NORECO predator.20061212


############
# PREDATOR #
############

genpy/sadd are clean today, after hacks to correct the path to loon.

saddcache timed out :

/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/predator: line 144: ./saddcache: Connection timed out


MINOS26 > ./saddcache --list
STARTED   Tue Oct 21 10:50:14 2008

    looks OK to me, finds 5 files to add.
    
##########
# PARROT #
##########

recopa - support single file name specification
    recopa "" FILE
    Added printout of loon command
    Updated in PG and PW

Generated logs for two sample files in PG, PW :

MINOS24 > { date ; time ${PG}/recopa "" N00009761_0010.mdaq.root ; } 2>&1 | tee  N00009761_0010.log
Tue Oct 21 10:39:48 CDT 2008
...
real    4m9.124s
user    3m40.282s
sys     0m6.296s

MINOS24 > { date ; time ${PG}/recopa "" N00009870_0002.mdaq.root ; } 2>&1 | tee  N00009870_0002.log
Tue Oct 21 10:45:36 CDT 2008
...
real    15m29.448s
user    15m5.123s
sys     0m14.673s


=============================================================================
2008 10 20
=============================================================================

##########
# PARROT #
##########

Final local test of a large file, about 10' reco time

ssh minos24

cd /local/scratch24/kreymer

PW=/afs/fnal.gov/files/expwww/numi/html/computing/parrot
export PRO=/local/scratch24/kreymer

cp ${PG}/reco_near_spill_cedar.C  ${PRO}
cp ${PG}/N00009870_0002.mdaq.root ${PRO}

MINOS24 > date ; time ${PG}/recopa
Mon Oct 20 17:58:05 CDT 2008
real    17m58.747s
user    15m42.702s
sys     0m14.986s


##########
# PARROT #
##########

    minosadmin

Update from rbpatter, re Grid items
    Where to run dbserver tests ( sam02/3  prob'ly )
    How to get to CDf nodes ( See Steve and Igor )
    How to cleanup CONDOR_TMP -  perhaps sudo script, where ?    


############
# PREDATOR #
############

Starting Sat AM, near/fardcs and beam genpy failing, like

B081017_080002.mbeam.root Sat Oct 18 10:11:57 UTC 2008
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 70: [: too many arguments
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 92: [: too many arguments
ERROR: List of process IDs must follow -p.

   This was due to the setup_minos fix stopping users from running loon
   on minos26. 
   
   Hacked PATH in predator, to remove the killer path.

MINOS26 > ./predator 2008-10

   This found nothing to do for dcs/beam,
   we must have damaged .py files around.

cd /local/scratch26/kreymer/genpy/near_dcs_data/2008-10

Some files are like :


MINOS26 > cat N081017_000001.sam.py
from SamFile.SamDataFile import SamDataFile
from SamFile.SamDataFile import ApplicationFamily
from SamFile.SamDataFile import CRC
from SamFile.SamDataFile import SamTime
from SamFile.SamDataFile import RunDescriptorList
from SamFile.SamDataFile import SamSize

import SAM

metadata = SamDataFile(
   fileName          = 'N081017_000001.mdcs.root',
   fileType          = 'physicsGeneric',
   fileContentStatus = SAM.DataFileContentStatus_Good,
   fileFormat        = SAM.DataFileFormat_ROOT,
   fileSize          = SamSize('479932B'),
   crc               = CRC(1106728658L,SAM.CRC_Adler32Type),
   group             = 'minos',
   applicationFamily = ApplicationFamily('online','rotorooter',''),
   dataTier          = 'dcs-near',
   datastream        = 'alldata',
   startTime         = SamTime ( '(UTC)' , '%Y-%m-%d %H:%M:%S(UTC)' ),
   endTime           = SamTime ( '(UTC)' , '%Y-%m-%d %H:%M:%S(UTC)' ),
   eventCount        = ,
   firstEvent        = 0,
   lastEvent         = 
)
#/pnfs/minos/near_dcs_data/2008-10(vo8508.1304)


    Automatic scan for this problem,

FILES=`ls`

for FILE in ${FILES} ; do grep  '= ,' ${FILE} && ls -l ${FILE} ; done 
   eventCount        = ,
-rw-r--r--  1 kreymer g020 1032 Oct 18 05:12 N081017_000001.sam.py
   eventCount        = ,
-rw-r--r--  1 kreymer g020 1031 Oct 18 05:12 N081017_235956.sam.py
   eventCount        = ,
-rw-r--r--  1 kreymer g020 1032 Oct 19 05:13 N081018_000003.sam.py
   eventCount        = ,
-rw-r--r--  1 kreymer g020 1032 Oct 20 05:13 N081019_000003.sam.py

    Removed the bad files :

for FILE in ${FILES} ; do grep  '= ,' ${FILE} && rm ${FILE} ; done 


    Repeated for fardcs, beam

cd /local/scratch26/kreymer/genpy/far_dcs_data/2008-10

-rw-r--r--  1 kreymer g020 1031 Oct 18 05:13 F081017_000010.sam.py
-rw-r--r--  1 kreymer g020 1031 Oct 20 05:13 F081018_000008.sam.py
-rw-r--r--  1 kreymer g020 1031 Oct 20 05:13 F081019_000011.sam.py


cd /local/scratch26/kreymer/genpy/beam_data/2008-10

-rw-r--r--  1 kreymer g020 1027 Oct 18 05:12 B081017_080002.sam.py
-rw-r--r--  1 kreymer g020 1027 Oct 18 05:12 B081017_160001.sam.py
-rw-r--r--  1 kreymer g020 1027 Oct 18 05:12 B081018_000001.sam.py
-rw-r--r--  1 kreymer g020 1027 Oct 19 05:12 B081018_080001.sam.py
-rw-r--r--  1 kreymer g020 1027 Oct 19 05:12 B081018_160001.sam.py
-rw-r--r--  1 kreymer g020 1026 Oct 19 05:13 B081019_000002.sam.py
-rw-r--r--  1 kreymer g020 1026 Oct 20 05:12 B081019_080001.sam.py
-rw-r--r--  1 kreymer g020 1027 Oct 20 05:12 B081019_160001.sam.py
-rw-r--r--  1 kreymer g020 1027 Oct 20 05:12 B081020_000001.sam.py


./predator 2008-10

    Failed again, removed .py files again.

    Needed to update genpy

cp -a genpy.20080915 genpy.20081020
nedit  genpy.2008102
    PATH=${PATH/#\/afs\/fnal.gov\/files\/code\/e875\/general\/minos25_bin:/}
ln -sf genpy.20081020 genpy # was genpy.20080915

./predator 2008-10

   Successful

##########
# PARROT #
##########

    Updated directory for latest snapshot

ls -l /afs/fnal.gov/files/expwww/numi/html/computing/parrot
    releases -> .../d120

MD=/afs/fnal.gov/files/data/minos
PG=/grid/fermiapp/minos/parrot

cd ${PG}

mkdir ${MD}/d120/GROWFSDIR
mv ${MD}/d120/GROW ${MD}/d120/GROWFSDIR/20080814 

mkdir                    ${MD}/d120/GROWFSDIR/20080829
cp -a ${MD}/d120/.grow*  ${MD}/d120/GROWFSDIR/20080829


du -sm ${MD}/d120/GROWFSDIR/*/.growfsdir
30      /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080814/.growfsdir
203     /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080829/.growfsdir


Mon Oct 20 13:20:58 CDT 2008
time ./make_growfs.auto -k ${MD}/d120 ; date
real    26m20.190s
user    3m34.661s
sys     10m35.154s
Mon Oct 20 13:47:30 CDT 2008

30      /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080814/.growfsdir
203     /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080829/.growfsdir
119     /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20081020/.growfsdir

$ grep '^L ' /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080814/.growfsdir | wc -l
16425
$ grep '^L ' /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20080829/.growfsdir | wc -l
12
$ grep '^L ' /afs/fnal.gov/files/data/minos/d120/GROWFSDIR/20081020/.growfsdir | wc -l
7979

mkdir                      ${MD}/d120/GROWFSDIR/20081020
cp -a ${MD}/d120/.grow*    ${MD}/d120/GROWFSDIR/20081020

    Test in fnpc185

PG=/grid/fermiapp/minos/parrot

mkdir /local/stage1/kreymer
cd    /local/stage1/kreymer

    Before the new index

time ${PG}/paloonew "" "" ${PG}/recopa mountfile.grow
real    8m2.860s
user    3m3.453s
sys     0m58.444s

-rw-r--r--    1 kreymer numi 1533038 Oct 20 12:12 CandS.root
-rw-r--r--    1 kreymer numi  233549 Oct 20 12:12 ntupleStS.root

    After the new index

real    4m33.607s
user    2m56.538s
sys     0m57.384s

-rw-r--r--    1 kreymer numi 1533038 Oct 20 14:17 CandS.root
-rw-r--r--    1 kreymer numi  233549 Oct 20 14:17 ntupleStS.root

real    4m15.273s
user    2m53.797s
sys     0m55.029s


#########
# BATCH #
#########

    per asousa, for numiwrk/Batch we pages, 

pts membership wadmnumi:numiweb
pts membership wadmnumi:numiweb | grep masaki

pts adduser -user masaki -group wadmnumi:numiweb


=============================================================================
2008 10 16
=============================================================================

##########
# PARROT #
##########

    mindata@minos26

Repeat test of new d141(ups) d199(minsoft) copies, with make_growfs.auto

MD=/afs/fnal.gov/files/data/minos
PD=/minos/scratch/parrot
PG=/grid/fermiapp/minos/parrot

cd ${PD}

time ./make_growfs.auto -k ${MD}/d199
make_growfs: 2557394 files, 7810 links, 117623 dirs, 0 checksums computed

real    25m59.916s
user    2m55.113s
sys     11m24.894s

116     .growfsdir


time ./make_growfs.auto -k ${MD}/d141
make_growfs: 1088642 files, 5588 links, 150368 dirs, 0 checksums computed

real    11m0.743s
user    1m27.281s
sys     2m36.532s

47      .growfsdir

  Suggestion, progress messages could be shorter,
  by omitting the common path to the area beng indexed :
  
make_growfs: following link /afs/fnal.gov/files/data/minos/d199/releases/..
  could be
make_growfs: following link releases/..


    There are some differences in these indexes,
    e.g.  in d199
$ diff .growfsdir oldparrot/20080912/.growfsdir | less

< F arch_spec_doc.mk    33188 4217 22456815 0
---
> F arch_spec_doc.mk    33188 4228 -14522322 0

    Test this with a full reco job

Find an idle node at
    http://rexganglia2.fnal.gov/farms/?c=GP Farm&m=&r=hour&s=descending&hc=4

fnpc387
    
PG=/grid/fermiapp/minos/parrot

mkdir /local/stage1/kreymer
cd    /local/stage1/kreymer

   Regular minimal pallon

${PG}/paloon
${PG}/paloon "" "" ${PG}/paloon

${PG}/paloon "" "" ${PG}/recopa
   seg faults sending datagram
   set faults running recopa

    Try a somewhat older node

fnpc238

${PG}/paloon
    Still a segfault sending datagram
${PG}/paloonew "" "" ${PG}/recopa 
    Still setfault running loon

    Try a node formerly used

fnpc185 - formerly tested

cd $PG
./paloon "" "" ./recopa
   no segfault for datagram
   no segfault running recopa

mkdir /local/stage1/kreymer
cd    /local/stage1/kreymer

${PG}/paloon "" "" ${PG}/recopa
Spill(100000 in 750 out 99250 filt.)

   Try a shorter file, N00009761_0010.mdaq.root
-bash-3.00$ time ./paloon "" "" ./recopa
real    4m11.336s
user    3m4.323s
sys     0m29.075s

  cd ${PG}
-bash-3.00$ time ${PG}/paloon "" "" ${PG}/recopa
real    4m56.638s
user    3m10.155s
sys     0m52.410s

   OK, wrote root files.

   Now try paloonew, old files

time ${PG}/paloonew "" "" ${PG}/recopa mountfile.grow
real    4m39.656s
user    3m7.654s
sys     0m42.132s

    updated mountfile.d199d141.grow, adding MINOS_EXTERNAL, sim, release_data

rm -r /local/stage1/kreymer/parrot

time ${PG}/paloonew "" "" ${PG}/recopa mountfile.d199d141.grow
real    5m51.853s
user    3m19.409s
sys     0m27.931s


##########
# CONDOR #
##########

   rhatcher hacked setup_minos so that users will have a dummy loon, root, etc
   in their path

   The setup ends like 
     
using PYTHIA6 (v6_409) for LUND
***********************************************************************
* WARNING: do NOT run loon or root on minos25
***********************************************************************

   Running loon gets you

*******************************************************************
*******************************************************************
** MINOS25.FNAL.GOV: condor head node
** user should not run executables here
** attempted to run: "loon" 
*******************************************************************
*******************************************************************


#######
# DAQ #
#######

changed buckley to minos-data in email from archiver

cp -a archiver_near_daq.config archiver_near_dcs.config.20070531
nedit archiver_near_daq.config


cp -a archiver_near_dcs.config archiver_near_dcs.config.20051103
nedit archiver_near_dcs.config

cp -a  archiver_far_daq.config  archiver_far_daq.config.20070531
nedit  archiver_far_daq.config

cp -a archiver_far_dcs.config archiver_far_dcs.config.20051103
nedit archiver_far_dcs.config

cp -a archiver_beam.config archiver_beam.config.20080724
nedit archiver_beam.config

Also in minfarm@fnpcsrv1 :
    /home/minfarm/scripts/check_delivery


########
# GRID #
########

Date: Mon, 06 Oct 2008 12:06:03 -0500 (CDT)
Subject: minos-admin HelpDesk ticket 122469 Reminder

___________________________________________________________________

Requester Name: JOSHUA BOEHM 
Phone: 3316
E-Mail Address: BOEHM@PHYSICS.HARVARD.EDU
Incident Time: 10/1/2008 1:59:18 PM
System Name: 
Priority: Medium
Problem Category: Software
Type: Other
Item: Other
Urgency: Medium

Short Description: condor jobs losing permissions

Problem Description: I've submitted a large number of jobs through the
glide-in system to the minos part of the farm.  Oddly a random subset of
these jobs appear to be dying with a permission denied error trying to
access the condor scripts.  Its not universal, new jobs are successfully
running, but most are dying.  This started around 13:00 cst, I thought
perhaps my tokens had expired, but I logged out and in with a new ticket and
even new submissions are demonstrating this problem.

Things ran perfectly smoothly as best as I can tell for the previous 20
hours or so.

Is there an obvious setting I missed that would be causing this?
Have I missed a setting?

the scripts that are running are located in
/minos/scratch/boehm/MREGeneration/SummaryMake

And assuming they haven't all died the current batch of jobs showing issues
is condor cluster 199262

Thanks,

Josh
___________________________________________________________________

   Date: Mon, 13 Oct 2008 12:05:42 -0500 (CDT)
reminder
___________________________________________________________________
Date: Fri, 17 Oct 2008 19:07:24 +0000 (GMT)

I was on vacation Sep 26 through Oct 12.

Josh - has this problem cleared up ?

I do not see reports of such problems from our current active users.

I do not see unusual activity at around that time
in the glidein statistics plots.

Were your jobs failing on a specific node ?
Sometimes a single node with a filesystem problem
can consume much more than its share of failing jobs,
until the FermiGrid people spot this and take it out of the configuration.

___________________________________________________________________

Date: Fri, 17 Oct 2008 23:38:17 +0000 (GMT)

Resolved
I hear from rhatcher that your problem was related to the
'setuid' issue, which has since been resolved.

I am marking this helpdesk ticket Resolved.


=============================================================================
2008 10 15
=============================================================================

########
# DATA #
########

requested scan of near cedar sntp cosmics
check that they are in pnfs

MINOS26 > ./stage -d -p 0 -s cosmic reco_near/cedar/sntp_data/2008-10
    most claim to be off disk, not likely.
    
./dc_stat N00014901_0000.cosmic.sntp.cedar.0.root
   This claims the file is not in dcache.
   Not so, quick response from

MINOS26 > ./dccptest /reco_near/cedar/sntp_data/2008-10/N00014901_0000.cosmic.sntp.cedar.0.root
2,0,0,0.0,0.0
:c=1:2b854664;h=yes;l=708962619;
708962619 bytes in 13 seconds (53257.41 KB/sec)
-rw-r--r--  1 kreymer g020 708962619 Oct 16 17:39 /local/scratch26/kreymer/N00014901_0000.cosmic.sntp.cedar.0.root

   So level2 information is stale or wrong.

Run anyway, extra dccp -P will be run, that should be OK.

CDIRS=`ls 

for DIR in $CDIRS ; do ./stage -w -s cosmic reco_near/cedar/sntp_data/${DIR} ; done

 > minos/log/stage/reco_near_cedar_cosmic.log
MINOS26 > grep Needed reco_near_cedar_cosmic.log
 Needed 1/1
 Needed 0/632
 Needed 0/772
 Needed 0/736
 Needed 0/740
 Needed 0/743
 Needed 0/718
 Needed 0/704
 Needed 418/716
 Needed 288/745
 Needed 633/754
 Needed 412/605
 Needed 0/18
 Needed 33/704
 Needed 0/526
 Needed 12/743
 Needed 1/576
 Needed 0/695
 Needed 0/703
 Needed 0/742
 Needed 0/55
 Needed 0/47
 Needed 0/43
 Needed 0/36
 Needed 0/48
 Needed 0/44
 Needed 0/59
 Needed 0/20
 Needed 0/5
 Needed 3/26
 Needed 0/43
 Needed 4/71
 Needed 0/42
 Needed 2/36
 Needed 0/49
 Needed 7/37
 Needed 6/34
 Needed 48/48
 Needed 48/48
 Needed 40/40
 Needed 15/15


# DATA #

Date: Fri, 10 Oct 2008 09:49:17 -0500
From: George Szmuksta <georges@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: Dcache Admin <dcache-admin@fnal.gov>
Subject: Minos dcache pnfs files with no layer information

  There are 2 files in pnfs dated 10/07/08  that not have any pnfs layer
information.  These files are not in dcache and should be deleted and
retransfered.

/pnfs/fs/usr/minos/reco_near/cedar_phy_bhcurv/cand_data/tmp1.27374
/pnfs/fs/usr/minos/reco_near/cedar_phy_bhcurv/cand_data/tmp1.27412


 Sorry for the inconvenience.


 George Szmuksta
 SSA
_________________________________________________________

-rw-rw-r--  1 mstrait e875   0 Oct  7 12:20 tmp1.27374
-rw-rw-r--  1 mstrait e875   0 Oct  7 12:20 tmp1.27412

  I have removed these.
_________________________________________________________


########
# DATA #
########

   Helping to make space, rhatcher is archiving files for pawloski

    mindata@minos26
mkdir /pnfs/minos/analysis
enstore pnfs --tags
    file family is minos, want analysis
enstore pnfs --file_family analysis

$ mkdir nue
$ cd nue
$ enstore pnfs --file_family analysis_nue

Date: Thu, 16 Oct 2008 15:33:57 -0500 (CDT)
Subject: HelpDesk ticket 123303
___________________________________________

Short Description: Please assign CD-LTO4G1 to /pnfs/minos/analsys and
/pnfs/minos/analysis/nue

Problem Description: enstore-admin :

We need to archive a few TBytes of minos files,
similar to what we did previously in /pnfs/minos/stage.

This time we will write under /pnfs/minos/analysis
Please assign the CD-LTO4G1 library to
    /pnfs/minos/analysis
    /pnfs/minos/analysis/nue

And see to it that there are a few tapes available,
10 should be more than enough for now.

    Thanks !
___________________________________________

Date: Fri, 31 Oct 2008 16:20:43 -0500 (CDT)
Your request has been put into a pending status by the expert working on the
problem.  
Pending Reason: On Hold By Expert

___________________________________________

Date: Mon, 03 Nov 2008 11:04:25 -0600 (CST)

Note To Requester: Art,

The library tag for the /pnfs/minos/analysis directory has been changed from
CD-9940B to CD-LTO4G1.  No other tags were changed.

 We have also increased the quota of LTO4 tapes for Minos by 10.  The quota
was set to 25 tapes but is now set to 35.  Minos currently has 19 tapes in
use.

The /pnfs/minos/analysis/nue directory is also updated.  It automatically
inherits the library tag of its parent directory.

Please try this out and then let me know if I can close this Remedy request.

Ken S -- SSA Group
___________________________________________

Date: Mon, 03 Nov 2008 17:25:43 +0000 (GMT)

Thanks for updating the /pnfs/minos/analysis tags,
for future writes, and adding the tapes.

We completed this round of file copies a couple of weeks ago,
hence that data went to 9940B tapes.
We will keep an eye on things the next time we write.

This ticket can be closed out.
___________________________________________
Date: Mon, 03 Nov 2008 11:41:15 -0600 (CST)

Solution: Library tag was updated and quota was increased.
This ticket was resolved by SCHUMACHER, KEN of the CD-SF/DMS/DSC/SSA group.

#########
# FARM #
########

cedar_phy_bhcurvmcnearcharm.log and helium, 

errors in samsub, starting
Tue Oct 14 12:11:53 CDT 2008
Traceback (most recent call last):
  File "/home/minfarm/scripts/samsub", line 159, in ?
    SUB = FILE.strip().split('_')[1].split('.')[0]
IndexError: list index out of range

Many pending runs in helium and charm, regardless of this problem.

Disabled linfix ( complete ) and helium/charm in corral ( stuck )


##########
# CONDOR #
##########

At around 06:00, an interactive loon job by bckhouse on minos25
lost its network connection.

Condor processes on the cluster seem to have stopped,
as well as most other schedd activity,
including gfactory plots.

Process writing to /minos/scratch succeeded, but hung up after the write.

He killed the lost loon around 11:14

schedd and user processes broke loose at that time.


##########
# RUSTEM #
##########

Per his request,, to allow building his code on Minos Cluster

MINOS26 > upd install -j mysql v5_0_22
informational: installed mysql v5_0_22.
upd install succeeded.


############
# PREDATOR #
############

Stuck running dbu on 
N00014991_0007.mdaq.root Tue Oct 14 14:06:13 UTC 2008
recovered.

Stuck permanently on 
N00014995_0000.mdaq.root Tue Oct 14 22:06:14 UTC 2008
N00014995_0001.mdaq.root Tue Oct 14 22:09:09 UTC 2008
N00014996_0000.mdaq.root Tue Oct 14 22:11:29 UTC 2008
N00014997_0000.mdaq.root Tue Oct 14 22:17:35 UTC 2008
N00014998_0000.mdaq.root Wed Oct 15 00:19:23 UTC 2008
    through
N00014998_0015.mdaq.root Wed Oct 15 15:05:26 UTC 2008

Similar for far,

F00042018_0016.mdaq.root Mon Oct  6 22:06:14 UTC 2008
    through
F00042058_0018.mdaq.root Wed Oct 15 15:48:00 UTC 2008

And 
B081014_080001.mbeam.root Wed Oct 15 11:28:41 UTC 2008
B081014_160001.mbeam.root Wed Oct 15 11:33:27 UTC 2008
B081015_000001.mbeam.root Wed Oct 15 11:35:16 UTC 2008

N081010_145548.mdcs.root Wed Oct 15 11:37:49 UTC 2008

   This cleared up when the DCache queues cleared Thursday.

########
# DATA #
########

11:00   dbu has been stuck since last night, see above

   Stuck in 
MINOS26 > ./dccptest /neardet_data/2008-10/N00014995_0000.mdaq.root
2,0,0,0.0,0.0
:c=1:d0bf66c5;h=yes;l=81922203;


   Big mover queues on r 10a-3, 11a-3, 12a-3, 9a-3 ( write pools )

   Many connections to beam_data from fnpc34* nodes.
   But I don't believe this listing, Started/Active times are like
   Aug 25 08:34:56 Aug 25 08:40:49 

   The login ;lots for door 0 shows a spike to nearly 250 last last night,
   down to 120 thiis morning, down to 56 right now ( 14:00 )
   Queues on 10,12a-3 have cleared, 37,30 on 12,9
   15:45, queue down to 15
   15:55  queue down to 12, only on w-stkendca9a-3 	
   17:24 - queues has cleared


Date: Wed, 15 Oct 2008 11:50:35 -0500 (CDT)
Subject: HelpDesk ticket 123203

___________________________________________
Short Description: FNDCA login list is stale ?

Problem Description: I am trying to hunt down the source of an overload of STKEN
RawDataWritePools,
  as indicated by queues in http://fndca.fnal.gov:2288/queueInfo

I look at the login list, at
    http://fndca3a.fnal.gov/dcache/DOORS.html
The time stamp on the listing is current,
    Wed Oct 15 11:23:02 2008
But almost all connections show Started/Active times like
    Aug 25 08:34:56 Aug 25 08:40:49

Is this page stale ?

We need to track down the user who is overloading this pool.
This ticket is assigned to SCHUMACHER, KEN of the CD-SF/DMS/DSC/SSA.
___________________________________________
Date: Wed, 15 Oct 2008 14:20:35 -0500 (CDT)

The problem has been turned over to the developers. It is bugzilla 
ticket number 125.
Thank you
___________________________________________
Date: Wed, 15 Oct 2008 22:32:04 +0000 (GMT)

The queuing has cleared from the RawDataWritePools,
as of around 17:00 CDT.

My programs that access this pool group have resumed normal operation.

The non-timestamp content of
    http://fndca3a.fnal.gov/dcache/DOORS.html
has not changed between 14:19 and 17:27 today,

So it seems the content of this page is indeed stale.
___________________________________________
> Just another data point.
> 
> The content of http://fndca.fnal.gov:2288/queueInfo today
> is identical to the content yesterday,
> with the exception of the time stamp.
___________________________________________
Date: Fri, 17 Oct 2008 10:55:16 -0500
We are looking at this problem.

Thanks,
Timur
___________________________________________
Date: Fri, 17 Oct 2008 11:27:45 -0500

The problem was looked at by enstore developers and experts at the storage
meeting. The following was determined. These are requests for small files.
The requests cause multiple mount / rewind operations and because of these
delays the effective transfer rates are low. We will continue looking at how
the situation can be improved.

Thanks,
Timur
___________________________________________

   I don't quite understand your reply.

   The problem is not with the transfers listed on the login page.

   The problem is that the contents of the page is 2 months old.

___________________________________________
Date: Fri, 17 Oct 2008 11:34:56 -0500
From: Timur Perelmutov <timur@fnal.gov>
It is not old for me, can it be cached in your web browser?

___________________________________________
Date: Fri, 17 Oct 2008 17:55:10 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

  The content of the page is different each time I display it
  so this is not a caching problem :

MIN > diff DOORS1017.html DOORS1419.html
7c7
<       Thu Oct 16 17:11:02 2008
---
>       Wed Oct 15 14:19:01 2008
97c97
< Finished at Thu Oct 16 17:11:05 CDT 2008
---
> Finished at Wed Oct 15 14:19:04 CDT 2008
MIN >

   The time stamps are changing.
   It is the login list data that is stale.
___________________________________________
Date: Fri, 17 Oct 2008 14:05:36 -0500

_Then I do not understand what particular information on the page
http://fndca.fnal.gov:2288/queueInfo is stale?
And why do you think it is stale?
__________________________________________
Date: Fri, 17 Oct 2008 19:12:07 +0000 (GMT)

_  The contents of the page, aside from the overall page timestamps,
  has apparently not changed since August 25,
  the latest Started entry on that page.

  I am quite sure that there have been new DCache logins since Aug 25,
  and that some of there are active.

  At the time that I first spotted this problem,
  there were at least 30 recent, open, active logins,
  not reflected in the login list.

JobId   door     Node                      State  Started         Last
Active UID/PID Role                      Username                  Pool
[PNFS Id] [Timer] [File Seq] [Client Id] [Client Pid] Kind
Status(time-in-state)            Command

DCap00-stkendca2a-unknow-93225 DCap00-stkendca2a-unknow-93225
fnpc341.fnal.gov active Aug 25 17:41:21 Aug 25 17:54:56    7927/9134
DCap00-stkendca2a-unknow-93225 E875 Minos                ? ?
? ? stat  minos/beam_data/2007-11/B071121_224612.mbeam.root
__________________________________________

  The stale page is
 
      http://fndca3a.fnal.gov/dcache/DOORS.html

__________________________________________
Date: Fri, 17 Oct 2008 14:20:17 -0500
Ok, I was looking at a completely different page.
__________________________________________
Date: Fri, 17 Oct 2008 14:23:46 -0500

Art,

We have watched the resotres changed so that is not stale.

We have come across an enstore bug we think. See if this makes sense to you.

Minos is currently reading almost all, if not all, of the files off of a
certain tape containing thousands of files. dCache requests these files over
time in no particular order. enstore should order these and assure that the
tape progressively moves forward. What is happening is new requests, some
behind the current tape position,  are not getting ordered properly such
that the tape is seeking back and forth and not reading sequentially. The
rate is really slow due to this seeking, so the queues are being drained
very slowly, and the restore queues in dCache aren't changing by very much.

Development is working on the problem.

Gene

N.B. - AK - these are reco_far/cedar_phy_bhcurv/.bcnd_data spill files,
like  2007-02/F00037654_0001.spill.bcnd.cedar_phy_bhcurv.0.root 
__________________________________________

Date: Wed, 29 Oct 2008 13:43:09 +0000 (GMT)

The http://fndca3a.fnal.gov/dcache/DOORS.html login list
continued to show August data earlier this week.

But this morning it is up to date, as of Oct 29 08:36:12 CDT 2008

I am curious as to the root cause of the problem.

    Thanks, this ticket can be closed.
__________________________________________

Date: Fri, 31 Oct 2008 14:12:53 -0500 (CDT)

Solution: The "stale" report may have been related to a backlog caused by a
large number of small files being read from tape and possibly related to a
bug in Enstore sequencing of these read from tape requests.

This ticket was resolved by SCHUMACHER, KEN of the CD-SF/DMS/DSC/SSA group.
__________________________________________

Date: Fri, 31 Oct 2008 14:12:54 -0500 (CDT)

Note To Requester: Art,

I'm not sure if this is actually the root cause you asked about.  But I did
find the following information in a dCache developer report from two weeks
ago.

I will mark this request as resolved.  If you encounter further problems,
please open a new request.

Ken S. -- SSA Group
__________________________________________


=============================================================================
2008 10 14
=============================================================================

########
# FARM #
########

MINOS26 > ./pnfsdirs  far cedar_phy_linfix daikon_00  L010185N write
MINOS26 > ./pnfsdirs near cedar_phy_linfix daikon_00  L010185N write

SRV1> nedit ~/ROUNTMP/ROOTRELS
   added cedar_phy_linfix

export SAM_ORACLE_CONNECT="samdbs/<passwd>"

for REL in dev int prd ; do
setup sam -q ${REL}
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.linfix
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.linfix
done
New applicationFamilyId = 258
Application family already exists: id = 258
New applicationFamilyId = 95
Application family already exists: id = 95
New applicationFamilyId = 368
Application family already exists: id = 368

    Note that daikon_00 has 1 subrun per run, concatenation is swift.

Picked up the existing 12 runs ( mrnt and sntp )

SRV1> ./roundup  -r cedar_phy_linfix mcfar

Ran cleanly, declared files to sam ( most already on tape ).

corral :

[ ${BADS} -le 1 ] && ${HOME}/scripts/roundup -c    -r cedar_phy_linfix mcfar  || (( BADS++ ))
#[ ${BADS} -le 1 ] && ${HOME}/scripts/roundup -c    -r cedar_phy_linfix mcnear || (( BADS++ ))

#######
# WEB #
#######

MIN > ln -sf protons.20081014.html protons.html # was protons.20080117.html

########
# DATA #
########

MINOS26 > du -sm /grid/app/minos/*
840     /grid/app/minos/Minossoft
du: cannot read directory `/grid/app/minos/VDT/vdt/extract': Permission denied
du: cannot read directory `/grid/app/minos/VDT/vdt/backup': Permission denied
du: cannot read directory `/grid/app/minos/VDT/vdt/services': Permission denied
288     /grid/app/minos/VDT
1       /grid/app/minos/bin
du: cannot read directory `/grid/app/minos/minfarm/Minossoft/EXTERNAL/mysql-5.0.22/sql/share/japanese-sjis': Permission denied

18007   /grid/app/minos/minfarm
1       /grid/app/minos/parrot
848     /grid/app/minos/parrotold
56      /grid/app/minos/sam
5       /grid/app/minos/scripts
1       /grid/app/minos/test
9471    /grid/app/minos/users

Rustem is using 9 GB, including recently built ROOT version.
Requested his removal of files from /grid/app via email, cc minos_batch


=============================================================================
2008 10 13
=============================================================================

###########
# MONTHLY #
###########

DATASETS 10/13
PREDATOR 10/13
VAULT    10/3
MYSQL    10/16

mysql timing, offline copies
Mon Oct 13 14:20:26 CDT 2008
Mon Oct 13 15:08:13 CDT 2008

Adjusted HOWTO.dbarchive to use /tmp/*.sql for gzip phase,
    no more cut/paste.


############
# VACATION #
############

    Predator - 

no neardcs since Mon Oct  6 10:09:15 2008 UTC

    Checklist -

Cluster ganglia shows mostly wait state Oct 07 15:00 through Wed 08 Oct 06:00
and a high load avarage ( over 150 )
High on minos25, low on the rest

1.5 GB data free, needs attention

Blue Arc was clean, 
    

#############
# MAIL SCAN #
#############

Nearly 1000  emails to dig through 

cdfdev - zoomcvs moved from cdfcvs to cdcvs Fri, 26 Sep 2008 10:13:01 -0500

minosshift -

    Many messages  FAR DAQ web status unable to contact minos-om.fnal.gov
    Sep 26
    Oct 3
    Thursday ( with and without web status string )
    12/30/2007 NEAR daqautoclean.sh refused kerberos ticket by minos-om.fnal.gov

nas -
    Reboot RHEA 2 Fri 10/10  affects Windows -  cdserver1 numiserver1
    
farm - down Sep 22 through Oct 5 due to DCache/SRM authentication problems.

lusers -

parrot on 2.6.9 ?

firefox update from 1.5 to 3.0 Thu 2 Oct., SLF 4.5 and older
    Also affected my desktop 
 
minosbatch
   masaki starrrting web page, needs access to 
       /afs/fnal.gov/files/expwww/numi/html/workgrps

   doing 1700 runs of mcfar cedar_phy_linfix

mnv
   fermigrid mounts allowed ?
   x need for VO -  nope.

minosdata
    ticket 122261 - fermiapp quota - done
    HDS disk configuration - plan ?
    parrot setuid problem discussion - 
        new nodes at 2.6.9-78 kernel
        ticket 122483
    2 hour emergency Enstore downtime Thu Oct 9 10:00 -
   
    web outage scheduled 9 Oct 06:00 - 07:00

minosadmin
    121790 - fnalu mounts
       /m/d and /m/s mounted , /grid is not mounted.

    jyuko - how to setup under condor ? - referred to loont/loonb

    jdejong - 7 day job cannot write to afs. Yep.

    121520 - crl during dns - was a database problem, closed
    
    122270 - rodriges - missing function.h on fnpc339 
             working OK now.

    122469 - condor script access 13:00 cst  Wed, 01 Oct 2008 
    
    119292 - Thu, 02 Oct 2008 13:09:15 gahp_server upgrade for grid errors
                we need condor upgrad ( to ? )

    parrot setuid problem - scavan
    
minossim - hgallag keytab problems - resolved, bad ssh client

     mail list ( new coordinater hennessy ) continue to use minos_sim

parrot 
    meeting 1211, grid

2008 - health exam

     
=============================================================================

     KREYMER ON VACATION THROUGH 12 OCTOBER

=============================================================================
2008 09 26
=============================================================================

    DCache seminar 09:00  WH10NW
       
Write queue timers/limits should act per pool group, not per pool

Need wild cards in file family pool associations  ( e.g. all Ntuples )

Kerberos doors hang up due to single client access with expired cert

Management of ports for doors
    clients need list of valid ports,
        or automatic port assignment

File leveling and migration for pool additions and retirement

Per-file time overheads ( small file management ) 

########
# GRID #
########

Date: Fri, 26 Sep 2008 08:42:47 -0500 (CDT)
Subject: HelpDesk ticket 122261

___________________________________________

Short Description: Increase to e875/minos/numi/5111 group quota in /grid/fermiapp

Problem Description: The group quota for 5111 a.k.a. e875/minos/numi 
 in /grid/fermiapp is only 30 GB.

 This is not enough to hold all the files formerly in /home/minfarm etc.
 We will shift some of these to /minos/data, 
 but we could still use more space in /grid/fermiapp.

 Please increase this quota to 100 GBytes, at your next convenience.
 This will expedite our retirement our use of /grid/app.

     Thanks !

  Please reply to minos-data, as user kreymer is leaving on vacation today.
___________________________________________

Date: Mon, 29 Sep 2008 12:12:32 -0500 (CDT)
Note To Requester: timm@fnal.gov sent this Notes To Requester: 
We have passed the quota request on to CSI-WST group.  It may take
a bit, as it turned out that this volume is currently configured
with user-by-user quotas and not group-by-group quotas so it
could take some time to reconfigure it.

Steve Timm
___________________________________________
Date: Mon, 29 Sep 2008 13:00:56 -0500 (CDT)

Solution: This request has been completed.


=============================================================================
2008 09 25
=============================================================================


##########
# CONDOR #
##########

Date: Thu, 25 Sep 2008 12:22:36 -0500 (CDT)
Subject: HelpDesk ticket 122224

___________________________________________
Short Description: minos25 configuration file needs to be update before the weekend

Problem Description: run2-sys :

    Please update the minos25 local configuration file,
/opt/condor-7.0.1/local/condor_config.local
    to have the content of
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/
    condor_config.local.minos25.20080925

We need to have this done today if at all possible.

This is to correct a parameter which has been limiting out glideins
to Fermigrid to only 100 out of the 350 jobs we should normally get.
___________________________________________
Date: Thu, 25 Sep 2008 12:38:04 -0500 (CDT)
This ticket has been reassigned to COOPER, GLENN of the CD-SF/FEF Group.
___________________________________________
Date: Thu, 25 Sep 2008 12:55:23 -0500 (CDT)
From: Glenn Cooper <gcooper@fnal.gov>
I copied the file in.  Do I need to restart/reload condor, or will it read the file each time a job is
submitted?
____________________________________________
Date: Thu, 25 Sep 2008 19:31:06 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
    I see no change, aside from the modified time, in
/opt/condor-7.0.1/local/condor_config.local

    The file should end like :

MINOS26 > tail -6
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local.minos25.20080925
##########################################################
# Set the number of jobs that can be submitted to glide in, default 100
# setting this to the full gpfarm, set a tighter limit via gfrontend
##########################################################

GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=1000
___________________________________________
Date: Thu, 25 Sep 2008 21:08:05 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    Thanks for updating the file, this went fine.

    Unfortunately, at or around Sep 18 13:53,
/opt/condor-7.0.1/etc/condor_config
    was updated on all the Minos Cluster nodes,
    including minos25.


    Please, immediately, restore the correct content.

My copy of the former file is in

/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config
__________________________________________
Date: Thu, 25 Sep 2008 16:27:29 -0500 (CDT)
From: Glenn Cooper <gcooper@fnal.gov>

Art's version put into cfengine and pulled to minos25.  
The other nodes should get it over the next few hours.

Glenn
__________________________________________
16:29 - condor_reconfig minos25 #  by rhatcher

16:32 - restarted condor_gfactory process


Date: Thu, 25 Sep 2008 21:45:57 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    Thanks, we have run 
condor_reconfig minos25

    and have restarted the gfactory process.

The gfactory processes are registered again, seen by gfrontend.
New glidein processes are now running, and user jobs have started again.

    Thanks !!!

__________________________________________

Date: Thu, 25 Sep 2008 16:32:27 -0500 (CDT)
From: Glenn Cooper <gcooper@fnal.gov>
Not sure how the incorrect file got there, epecially with a Sep 18 date. Our subversion logs show no
changes to this file since May 20 (until today, of course).  I'll investigate further and let you know
if I find anything.
__________________________________________

MINOS25 > condor_q gfactory | tail -1
94 jobs; 18 idle, 74 running, 2 held
MINOS25 > date
Thu Sep 25 16:35:09 CDT 2008

MINOS25 > condor_q gfactory | tail -1 ; date
96 jobs; 15 idle, 79 running, 2 held
Thu Sep 25 16:35:44 CDT 2008

MINOS25 > condor_q gfactory | tail -1 ; date
116 jobs; 13 idle, 100 running, 3 held
Thu Sep 25 16:42:38 CDT 2008

   The plots have come alive, at
http://www-numi.fnal.gov/gfactory/monitor/glidein_t20_glexec/total/

MINOS25 > condor_q gfactory | tail -1 ; date
169 jobs; 14 idle, 155 running, 0 held
Thu Sep 25 17:09:04 CDT 2008

#########
# MYSQL #
#########

SOFT03 > ups declare -c mysql v5_0_67
DECLARE: A UPS start/stop exists for this product
SOFT03 > ups tailor mysql
Enter valid path for mysql data directory: 
/home/minsoft/database

Never use default port number 3306 for any mysql server instances! 
Assign your port number here:3306

You can update mysql server options in my.cnf file before you start mysql server.

Please assign a new username for your mysql daemon. 
For security it is recommended to substitute this name for mysql root in a mysql database.
See README file in your mysql datadir for more details. 
Do not forget to set a strong password for root user IMMEDIATELY after initial startup of mysql daemon! 
Then replace root username with the newly assigned username. 
Enter your new username here:root

Mysql server with server_id = 1 was already configured on minos-sam03.fnal.gov  machine.
Would you like to configure next mysql server on minos-sam03.fnal.gov machine (y,n)? n


SOFT03 > ups start mysql
Setup:mysql datadir = /home/minsoft/database
Setup:port=3306; socket=/home/minsoft/database/mysql.sock
SOFT03 > WARNING: Found /home/minsoft/database/my.cnf
Datadir is deprecated place for my.cnf, please move it to /home/minsoft/ups/prd/mysql/v5_0_67/Linux-2-6

Starting mysqld daemon with databases from /home/minsoft/database


SOFT03 > ups rootpass mysql
Setup:mysql datadir = /home/minsoft/database
Setup:port=3306; socket=/home/minsoft/database/mysql.sock
Enter password for root user: 
Setup root password for root@localhost is O.K.

You also need to set this password for root@minos-sam03.fnal.gov when you start mysql client.
You can do it using following command in mysql:
mysql> SET PASSWORD FOR root@minos-sam03.fnal.gov=PASSWORD('new_password');
See user table in mysql database.


=============================================================================
2008 09 24
=============================================================================

195912.0   gfactory        9/24 08:41   0+00:00:00 I  0   0.0  glidein_startup.sh
195932.0   gfactory        9/24 10:16   0+00:00:00 I  0   0.0  glidein_startup.sh
195932.1   gfactory        9/24 10:16   0+00:00:00 I  0   0.0  glidein_startup.sh
195932.2   gfactory        9/24 10:16   0+00:00:00 I  0   0.0  glidein_startup.sh
195932.3   gfactory        9/24 10:16   0+00:00:00 I  0   0.0  glidein_startup.sh
195940.0   gfactory        9/24 10:46   0+00:00:00 I  0   0.0  glidein_startup.sh
195944.0   gfactory        9/24 10:59   0+00:00:00 I  0   0.0  glidein_startup.sh
195967.0   gfactory        9/24 13:16   0+00:00:00 I  0   0.0  glidein_startup.sh
195974.0   gfactory        9/24 13:46   0+00:00:00 I  0   0.0  glidein_startup.sh
195974.1   gfactory        9/24 13:46   0+00:00:00 I  0   0.0  glidein_startup.sh
195974.2   gfactory        9/24 13:46   0+00:00:00 I  0   0.0  glidein_startup.sh
195974.3   gfactory        9/24 13:46   0+00:00:00 I  0   0.0  glidein_startup.sh
195974.4   gfactory        9/24 13:46   0+00:00:00 I  0   0.0  glidein_startup.sh
195974.5   gfactory        9/24 13:46   0+00:00:00 I  0   0.0  glidein_startup.sh
195974.6   gfactory        9/24 13:46   0+00:00:00 I  0   0.0  glidein_startup.sh
195974.7   gfactory        9/24 13:46   0+00:00:00 I  0   0.0  glidein_startup.sh
195974.8   gfactory        9/24 13:46   0+00:00:00 I  0   0.0  glidein_startup.sh
195974.9   gfactory        9/24 13:46   0+00:00:00 I  0   0.0  glidein_startup.sh
196017.2   gfactory        9/24 16:05   0+00:00:00 I  0   0.0  glidein_startup.sh
196018.0   gfactory        9/24 16:08   0+00:00:00 I  0   0.0  glidein_startup.sh
196020.0   gfactory        9/24 16:11   0+00:00:00 I  0   0.0  glidein_startup.sh
196021.0   gfactory        9/24 16:19   0+00:00:00 I  0   0.0  glidein_startup.sh
196024.0   gfactory        9/24 16:36   0+00:00:00 I  0   0.0  glidein_startup.sh
196025.0   gfactory        9/24 16:39   0+00:00:00 I  0   0.0  glidein_startup.sh
196027.0   gfactory        9/24 16:44   0+00:00:00 I  0   0.0  glidein_startup.sh
196031.0   gfactory        9/24 16:54   0+00:00:00 I  0   0.0  glidein_startup.sh
196032.0   gfactory        9/24 16:56   0+00:00:00 I  0   0.0  glidein_startup.sh
196034.0   gfactory        9/24 17:05   0+00:00:00 I  0   0.0  glidein_startup.sh

MINOS25 > for CLU in ${CLUS} ; do printf "${CLU} " ; condor_q -l ${CLU} | grep GlideinEntryName ; done 
195912.0 GlideinEntryName = "gpminos"
195932.0 GlideinEntryName = "gpminos"
195932.1 GlideinEntryName = "gpminos"
195932.2 GlideinEntryName = "gpminos"
195932.3 GlideinEntryName = "gpminos"
195940.0 GlideinEntryName = "gpminos"
195944.0 GlideinEntryName = "gpminos"
195967.0 GlideinEntryName = "gpminos"
195974.0 GlideinEntryName = "gpminos"
195974.1 GlideinEntryName = "gpminos"
195974.2 GlideinEntryName = "gpminos"
195974.3 GlideinEntryName = "gpminos"
195974.4 GlideinEntryName = "gpminos"
195974.5 GlideinEntryName = "gpminos"
195974.6 GlideinEntryName = "gpminos"
195974.7 GlideinEntryName = "gpminos"
195974.8 GlideinEntryName = "gpminos"
195974.9 GlideinEntryName = "gpminos"
196017.2 GlideinEntryName = "gpgeneral"
196018.0 GlideinEntryName = "gpgeneral"
196020.0 GlideinEntryName = "gpgeneral"
196021.0 GlideinEntryName = "gpgeneral"
196024.0 GlideinEntryName = "gpgeneral"
196025.0 GlideinEntryName = "gpgeneral"
196027.0 GlideinEntryName = "gpgeneral"
196031.0 GlideinEntryName = "gpgeneral"
196032.0 GlideinEntryName = "gpgeneral"
196034.0 GlideinEntryName = "gpgeneral"

MINOS25 > for CLU in ${CLUS} ; do printf "${CLU} " ; condor_q -l ${CLU} | grep QDate ; done 
195912.0 QDate = 1222263702
195932.0 QDate = 1222269414
195932.1 QDate = 1222269414
195932.2 QDate = 1222269414
195932.3 QDate = 1222269414
195940.0 QDate = 1222271191
195944.0 QDate = 1222271941
195967.0 QDate = 1222280188
195974.0 QDate = 1222281985
195974.1 QDate = 1222281985
195974.2 QDate = 1222281985
195974.3 QDate = 1222281985
195974.4 QDate = 1222281985
195974.5 QDate = 1222281985
195974.6 QDate = 1222281985
195974.7 QDate = 1222281985
195974.8 QDate = 1222281985
195974.9 QDate = 1222281985
196017.2 QDate = 1222290308
196018.0 QDate = 1222290494
196020.0 QDate = 1222290683
196021.0 QDate = 1222291149
196024.0 QDate = 1222292175
196025.0 QDate = 1222292360
196027.0 QDate = 1222292641
196031.0 QDate = 1222293296
196032.0 QDate = 1222293389
196034.0 QDate = 1222293947
MINOS25 > datesec 1222290308
Wed Sep 24 16:05:08 CDT 2008
MINOS25 > datesec 1222293947
Wed Sep 24 17:05:47 CDT 2008

##########
# CONDOR #
##########

Date: Wed, 24 Sep 2008 16:05:54 -0500 (CDT)
Subject: HelpDesk ticket 122184

___________________________________________
Short Description: Too few jobs running on GPFarm

Problem Description: I see far fewer jobs than expected running on GPfarm.

Our analysis users are getting well under half normal capacity,
and we have several high priority jobs that we are trying to get through.

The overall load on GPFarm seems pretty light, according to Ganglia,
The 'nice' CPU is running around 20%.

Condorview shows about 250 running processes, out of the 850 capacity.

User jobs are getting in and running, but at nothing like normal capacity.

According to condor_q, rubin has 97 jobs idle, only 14 running.

The Minos glideins have 100 processes running, with over 20 idle.
A few new pilots have gotten started during the day,
with no net gain. We usually have more like 200 running.

Any idea what has gone wrong ?
___________________________________________

Date: Wed, 24 Sep 2008 16:30:32 -0500 (CDT)
From: HelpDesk <aremail@fnal.gov>


Note To Requester: timm@fnal.gov sent this Notes To Requester:
Art--with respect to the glideins, I checked the glideins
and those that have not started, haven't started because they    
are waiting for the nodes with AFS.  As far as Howie's jobs are  
concerned, it appears that his condor_gridmanager on fnpcsrv1 got
stuck, I have now unstuck it.

If you continue to see gfactory jobs sitting "unsubmitted" or "pending"
on minos25 for any length of time, keep us posted.

Steve

___________________________________________

Date: Wed, 24 Sep 2008 22:40:09 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

Thanks for unsticking the rubin jobs, they seem to have finished.

You are right, the glideins submitted earlier today, through 16:05,
were all going toward the saturated AFS nodes.

There are 10 newer glideings submitted between 16:05 and 17:05,
which I think are not tied to AFS, but which are also all idle.

196017.2 QDate = 1222290308
196018.0 QDate = 1222290494
196020.0 QDate = 1222290683
196021.0 QDate = 1222291149
196024.0 QDate = 1222292175
196025.0 QDate = 1222292360
196027.0 QDate = 1222292641
196031.0 QDate = 1222293296
196032.0 QDate = 1222293389
196034.0 QDate = 1222293947

Only the first of these has started to run, as of 17;34.

___________________________________________

Date: Thu, 25 Sep 2008 16:15:25 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

    FYI, Minos glideinWMS status plots are available at

http://www-numi.fnal.gov/gfactory/monitor/glidein_t20_glexec/total/0Status.day.large.html

    The gpgeneral glideins are not restricted to AFS nodes.
    The gpminos   glideins are     restricted to AFS nodes.

We seem to have hit a ceiling of about 80 to 100 glideins.
This is about the level of our hardware priority allocation.
Is this a coincidence ?

Recent glidein jobs are contining to get started,
but at a very restricted rate, consistent with some GPFarm limit,
although the GPFarm nodes are mostly idle.
___________________________________________
___________________________________________
___________________________________________

########
# FARM #
########

    Studying confused F00041882 status, 25/24 reported

    All subruns 00 thru 23 are present in cand files.
dds /pnfs/minos/reco_far/cedar/cand_data/2008-08/F00041882
Aug 29 throu Sep 02

    Have only 0, 7, 10 , 12 in pass 0, Aug 31, for sntp and bntp

SRV1> FILE=F00041882_0000.all.sntp.cedar.0.root
SRV1> sam get metadata --file=${FILE} | grep parents | tr "'" \\\n  | grep root  | sort
F00041882_0000.mdaq.root
F00041882_0001.mdaq.root
F00041882_0002.mdaq.root
F00041882_0003.mdaq.root
F00041882_0004.mdaq.root

F00041882_0007.mdaq.root
F00041882_0008.mdaq.root

F00041882_0010.mdaq.root

F00041882_0012.mdaq.root
F00041882_0013.mdaq.root
F00041882_0014.mdaq.root
F00041882_0015.mdaq.root
F00041882_0016.mdaq.root
F00041882_0017.mdaq.root
F00041882_0018.mdaq.root
F00041882_0019.mdaq.root
F00041882_0020.mdaq.root
F00041882_0021.mdaq.root
F00041882_0022.mdaq.root

In farcat , have  05 06 09 10 11 23
dating Aug 31 and Sep 02

    The problem is subrun 10, a duplicate ?
    Why is this not detected ?

   From logs,
Sun Aug 31 06:07:28 CDT 2008
 BADRUNS   F00041882_0005.all.sntp.cedar.0.root
 BADRUNS   F00041882_0006.all.sntp.cedar.0.root
 BADRUNS   F00041882_0009.all.sntp.cedar.0.root
 BADRUNS   F00041882_0011.all.sntp.cedar.0.root
 BADRUNS   F00041882_0023.all.sntp.cedar.0.root
The files were processes at that time.

SRV1> dds /minos/data/minfarm/farcat/F00041882_0010*
-rw-rw-r--  1 minospro numi 23029986 Aug 31 22:28 /minos/data/minfarm/farcat/F00041882_0010.all.sntp.cedar.0.root
-rw-rw-r--  1 minospro numi  8183846 Aug 31 22:28 /minos/data/minfarm/farcat/F00041882_0010.spill.bntp.cedar.0.root
-rw-rw-r--  1 minospro numi  5223452 Aug 31 22:28 /minos/data/minfarm/farcat/F00041882_0010.spill.sntp.cedar.0.root
SRV1> 
SRV1> dds /pnfs/minos/reco_far/cedar/sntp_data/F00041882_0010*
ls: /pnfs/minos/reco_far/cedar/sntp_data/F00041882_0010*: No such file or directory
SRV1> dds /pnfs/minos/reco_far/cedar/sntp_data/2008-08/F00041882_0010*
-rw-r--r--  1 rubin numi 23031199 Aug 31 07:43 /pnfs/minos/reco_far/cedar/sntp_data/2008-08/F00041882_0010.all.sntp.cedar.0.root
-rw-r--r--  1 rubin numi  5223452 Aug 31 07:40 /pnfs/minos/reco_far/cedar/sntp_data/2008-08/F00041882_0010.spill.sntp.cedar.0.root
SRV1> dds /pnfs/minos/reco_far/cedar/.bntp_data/2008-08/F00041882_0010*
-rw-r--r--  1 rubin numi 8183846 Aug 31 07:37 /pnfs/minos/reco_far/cedar/.bntp_data/2008-08/F00041882_0010.spill.bntp.cedar.0.root

   Let's check out DUP checking

SRV1> . ./samsetup
SRV1> ./samdup /minos/data/minfarm/farcat
F00041882_0010.spill.bntp.cedar.0.root
F00041882_0010.all.sntp.cedar.0.root
F00041882_0010.spill.sntp.cedar.0.root

   The roundup script was escaping quotation marks around the -s \"${SEL}"
   argument being sent to samdup, causing nothing to be selected.
   
   Corrected this and a related typo in the DUP code ( extre [] brackets )
   Cut a new roundup.20080923 version on the fly.

   Testing this out, this is also the first test of the new proxy.

SRV1> ./roundup  -D  -r cedar far

###########
# ROUNDUP #
###########

    Put today's changes for DUP handling into roundup.20080924

SRV1> cp -a AFSS/roundup.20080924 .
SRV1> ln -sf roundup.20080924 roundup

#########
# MYSQL #
#########

   On minos-sam03, created   setups.sh script in home area of minsoft,

unset UPS_DIR
unset SETUP_UPS

. /usr/local/etc/setups.sh

export PRODUCTS=${HOME}/ups/db:/local/ups/db

    Test this, and move to the newer mysql,
    

##########
# CONDOR #
##########

No held jobs for the last couple of days, then a few this morning :

MINOS25 > condor_q -hold gfactory


-- Submitter: minos25.fnal.gov : <131.225.193.25:64961> : minos25.fnal.gov
 ID      OWNER           HELD_SINCE HOLD_REASON                   
195887.5   gfactory        9/24 07:55 Globus error 17: the job failed when the jo
195887.7   gfactory        9/24 07:55 Globus error 43: the job manager failed to 
195887.9   gfactory        9/24 07:55 Globus error 17: the job failed when the jo


#########
# MYSQL #
#########

   Make room in samread on minos-sam02 for database tests
 
cd DBARCH/
-rw-r-----  1 samread 5024        8806 Aug 15  2007 PULSERDRIFT.frm
-rw-r-----  1 samread 5024 75037580970 Aug 15  2007 PULSERDRIFT.MYD
-rw-r--r--  1 samread 5024 32655442989 Aug 17  2007 PULSERDRIFT.MYD.gz
-rw-r-----  1 samread 5024 28319080448 Aug 15  2007 PULSERDRIFT.MYI
-rw-r--r--  1 samread 5024  8937925900 Aug 18  2007 PULSERDRIFT.MYI.gz

   Test integrity of the zipped PD files, remove originals, 
   copy to /minos/data/mysql/old


MINOS-SAM02 > gunzip -c PULSERDRIFT.MYI.gz > PDI
MINOS-SAM02 > diff PULSERDRIFT.MYI PDI
rm PULSERDRIFT.MYI PDI

time gunzip -c PULSERDRIFT.MYD.gz > PDD
real    59m8.515s
user    18m53.518s

time md5sum PULSERDRIFT.MYD PDD
96e5cb77b49526184e10d78f12969636  PULSERDRIFT.MYD
96e5cb77b49526184e10d78f12969636  PDD

rm PULSERDRIFT.MYD PDD

MINOS-SAM02 > mkdir /minos/data/mysql/old
MINOS-SAM02 > time cp -va PULSER* /minos/data/mysql/old/
`PULSERDRIFT.frm' -> `/minos/data/mysql/old/PULSERDRIFT.frm'
`PULSERDRIFT.MYD.gz' -> `/minos/data/mysql/old/PULSERDRIFT.MYD.gz'
real    15m55.623s
user    0m0.753s
sys     1m54.251s


   Wow, data transfers peak up to 50 MBytes/second !
   BlueArc must be happy today.

   But there are frequent minute long interruptions with 0 data rate.
   Packet rates are steady around 30K/second.
   du -sm shows rates about 46 MBytes/sec, maybe the du sample was lucky.

   Net transfer was about 40 GB/1000 sec or 40 MB/sec.

SOFT03 > du -sm restore/20080902/
66492   restore/20080902/

mkdir ~/MYSQL
cd    ~/MYSQL
time scp -r -c blowfish minsoft@minos-sam03:restore/20080902 restore


   Typical rates are reported around 40 MB/sec 
 
   
=============================================================================
2008 09 23
=============================================================================

########
# FARM #
########

    Howie's cert was about to expire
   
    created a fresh kreymer cert with Role=Production, on minos26

MINOS26 > . /minos/scratch/kreymer/VDT/setup.sh
MINOS26 > cd /local/scratch26/kreymer/grid
MINOS26 > voms-proxy-init     -voms fermilab:/fermilab/minos/Role=Production -cert kreymerdoe.pem
-key kreymerdoekey.pem -out kreymer-production.proxy -valid 10000:0
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Enter GRID pass phrase:
phrase is too short, needs to be at least 4 chars
Enter GRID pass phrase:

Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Creating temporary proxy ............................................. Done
Contacting  voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab"
Done

Warning: fg6x1.fnal.gov:15001: The validity of this VOMS AC in your proxy is shortened to 86400
seconds!

Creating proxy ............................................................. Done

Warning: your certificate and proxy will expire Wed Mar 25 14:45:40 2009
which is within the requested lifetime of the proxy
MINOS26 > voms-proxy-info -all -file kreymer-production.proxy
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
identity  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
type      : proxy
strength  : 512 bits
path      : kreymer-production.proxy
timeleft  : 4385:57:02
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
issuer    : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov
attribute : /fermilab/minos/Role=Production/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
attribute : /fermilab/minos/Role=NULL/Capability=NULL

    This will give a production role proxy for use by roundup.
    Copied this to /local/globus/minfarm/.grid

SRV1> pwd
/local/globus/minfarm/.grid

SRV1> scp kreymer@minos26:/local/scratch26/kreymer/grid/kreymer-production.proxy .

SRV1> cd /export/stage/minfarm/.grid

    Created draft local srmtestp, using production proxy,
    and adding a write and cleanup to /pnfs/minos/NULL

    Created a new roundup, using the new cert in the correct location.

SRV1> ln -sf roundup.20080923 roundup # was roundup.20080915
SRV1> date
Tue Sep 23 21:18:47 CDT 2008


#######
# WEB #
#######

    Per request of inkmann, reviewing all .htaccess files with 
Options +Includes

    These should be
Options +IncludesNOEXEC

MIN > find /afs/fnal.gov/files/data/minos/d119 -name .htaccess
/afs/fnal.gov/files/data/minos/d119/prd/sam_config/v4_2_28/NULL/www/.htaccess
/afs/fnal.gov/files/data/minos/d119/prd/sam_config/v4_2_34/NULL/www/.htaccess
/afs/fnal.gov/files/data/minos/d119/prd/sam_bootstrap/v4_4_1/NULL/www/.htaccess
/afs/fnal.gov/files/data/minos/d119/prd/sam_web_services/v0_9_8/NULL/www/.htaccess
/afs/fnal.gov/files/data/minos/d119/prd/sam_web_services/v0_9_9/NULL/www/.htaccess
/afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_4/v03/boost_1_34_1/regression/.htaccess

    Only the sam files are SSI enabled
    
    First checking all .shtml files to see that we are not using #exec

MIN > find  . -name \*\.shtml
./prd/sam_config/v4_2_28/NULL/www/index.shtml
./prd/sam_config/v4_2_34/NULL/www/index.shtml
./prd/sam_bootstrap/v4_4_1/NULL/www/index.shtml

    So the sam_web_services entry seem frivolous.

    None of these .shtml files contain the string exec.

    Corrected all sam* .htaccess files,
    in d119 and d141.

    Checked on minos-sam01

FILES=`find  . -name \*\.shtml`
for FILE in $FILES ; do echo ${FILE} ; grep exec ${FILE} ; done
   Found no #exec elements of directives

#######
# SAM #
#######

    Note to sam-design


  I just received an email from the Fermilab Web security team,
  noting that several of the Minos .htaccess files contained

Options +Includes

  This is apparently dangerous.               

  They should be set up to prohibit #exec directives on the server side:

Options +IncludesNOEXEC

  This is a sam issue because none of these particular .includes are
  from active Minos code, but are parts of various sam products      
  whose files are incidentally being served to the web :

sam_config/v4_2_28
sam_config/v4_2_34

sam_bootstrap/v4_4_1

sam_web_services/v0_9_8
sam_web_services/v0_9_9


    Looking at products on the Minos station/dbserver,
    it seems that many sam products have   Options +Includes

sam_bootstrap
sam_config   
sam_cp
sam_gridftp
sam_kerberos_rcp

    The good news is that none of our .shtml files seem to use
    the dangerous #exec element, so there is no immediate risk. 

    The bad news is that the web security people will bug us
    until we change all the +Includes to +IncludesNOEXEC


############
# PRODUCTS #
############

   Per loiacono request,
   
upd install -j  geant4 v4_8_1_p02 -q GCC_3_4_3 -f Linux+2.4-2.3.2 
informational: installed geant4 v4_8_1_p02.
upd install succeeded.


########
# GRID #
########

MINOS26 > cd /grid/app/minos
MINOS26 > du -sm *

$ du -sm /grid/app/minos/users/*
3202    /grid/app/minos/users/boehm
2975    /grid/app/minos/users/loiacono
1       /grid/app/minos/users/pawloski
2683    /grid/app/minos/users/rustem
10      /grid/app/minos/users/scavan


MINOS26 > quota -v -s  -g e875
Disk quotas for group e875 (gid 5111): 
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
blue2:/fermigrid-data
                   315G       0    400G            128k       0       0        
blue2:/fermigrid-app
                 26436M       0  30720M            391k       0       0        
minos-nas-0.fnal.gov:/minos/scratch
                  5084G       0   6144G           1549k       0       0        
minos-nas-0.fnal.gov:/minos/data
                 16384G*      0  16384G           1500k       0       0        
blue2:/fermigrid-fermiapp
                 15291M       0  30720M            217k       0       0        


##########
# PARROT #
##########


    10:50 - as planned, move to use of /grid/fermiapp/minos/parrot

mv    /grid/app/minos/parrot      /grid/app/minos/parrotold
ln -s /grid/fermiapp/minos/parrot /grid/app/minos/parrot


########
# DATA #
########

    rbpatter will create door lists in something like 
    computing/config/dcachedoor

pts membership wadmnumi:numiweb

pts adduser -user rbpatter -group wadmnumi:numiweb


=============================================================================
2008 09 22
=============================================================================

#########
# CLUBS #
#########

HOWTO.nodes - updated per current condor nodes

flxb31
flxb32
flxb33
flxb34
flxb36
flxi09
flxi10

Can also log into
flxb19
flxb35

##########
# DCACHE #
##########

   Why is level 2 information not indicating a pool for raw data?


./dc_stat /pnfs/minos/fardet_data/2008-09/F00041967_0000.mdaq.root
============================
 PNFS status for /pnfs/minos/fardet_data/2008-09/F00041967_0000.mdaq.root 
-rw-r--r--  1 buckley e875 43610714 Sep 21 21:36 F00041967_0000.mdaq.root

LEVEL 2 
2,0,0,0.0,0.0
:c=1:5cb8e1c2;h=yes;l=43610714;

LEVEL 4 
VO8699
0000_000000000_0000396
43610714
fardet_data
/pnfs/fnal.gov/usr/minos/fardet_data/2008-09/F00041967_0000.mdaq.root

000F0000000000000864A588

CDMS122205097100000
stkenmvr25a:/dev/rmt/tps0d0n:479000022613
3277382081

============================

############
# DCCPTEST #
############

Created dccptest script, can copy recent raw data.


########
# FARM #
########


SRV1> ls -ltr  /minos/data/minfarm/mcnearcat | grep charm | wc -l
1714
-rw-rw-r--  1 minospro numi  29619026 Sep 19 18:49 n13037053_0003_L010185N_D04_charm.mrnt.cedar_phy_bhcurv.0.root


SRV1> ls -ltr  /minos/data/minfarm/mcnearcat | grep helium | wc -l
830
-rw-rw-r--  1 minospro numi  39034470 Sep 21 00:01 n13038001_0015_M100200N_D04_helium.mrnt.cedar_phy_bhcurv.0.root


./roundup -b 2000 -s helium -r cedar_phy_bhcurv mcnear
Mon Sep 22 11:00:39 CDT 2008

   Need to set up helium and charm in corral.

#########
# ADMIN #
#########

CD105723 produced requisition 204475  last week.

Buyer Gloinski

PO 582475   SuperMicro server 
Promised Date: 09-Oct-2008

No PO yet for Sataboy

It is there now, 17:00 CDT,

PO 582564 F1F-141000HDRG SATABOY storage device configured with (14) 1TB disks
ORDER DATE
22-Sep-2008
Promised Date: 13-Oct-2008

=============================================================================
2008 09 21  Sun
=============================================================================

#########
# MYSQL #
#########

    minsoft@minos-sam03 - added rearmstr

##########
# PARROT #
##########

    Cloned to /grid/fermiapp, stop supporting /grid/app

    mindata@minos26

MINOS26 > du -sm /grid/app/minos/parrot
848     /grid/app/minos/parrot

$ cp -vax /grid/app/minos/parrot /grid/fermiapp/minos/parrot
$ date
Sun Sep 21 19:17:27 CDT 2008
$ diff -r /grid/app/minos/parrot /grid/fermiapp/minos/parrot
    mountfile2.grow was missing link
$ cp -a /grid/app/minos/parrot/cctools-current-20080717-i686-linux-2.6/mountfile2.grow /grid/app/minos/parrot
$ cp -a /grid/fermiapp/minos/parrot/cctools-current-20080717-i686-linux-2.6/mountfile2.grow /grid/fermiapp/minos/parrot
$ diff -r /grid/app/minos/parrot /grid/fermiapp/minos/parrot
    clean

    Will shift tomorrow. after confirming /g/fa mounts at Grid Users Meeting,
    
mv    /grid/app/minos/parrot      /grid/app/minos/parrotold
ln -s /grid/fermiapp/minos/parrot /grid/app/minos/parrot


paloon - adjusted paths to /grid/fermiapp/...


##########
# PARROT #
##########

   Test file for grid tests.


    mindata@minos26
. /afs/fnal.gov/ups/etc/setups.sh
export PRODUCTS=/afs/fnal.gov/files/code/e875/general/ups/db
cd /afs/fnal.gov/files/data/minos/release_data/parrot
DFILE='dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root'
dccp ${DFILE} .

   This is one of our short, 50KBytes test files.

   Where is a somewhat longer, 10 miinue
 
 
MIN > ssh fnpc176
-bash-3.00$ cd /grid/fermiapp/minos/parrot
-bash-3.00$ time ./paloon
SETTING UP UPS
SETTING UP MINOS

   real    0m55.410s

   OK , found my test files,

cp /grid/fermiapp/minos/parrot/N00009870_0002.mdaq.root \
   /afs/fnal.gov/files/data/minos/release_data/parrot/N00009870_0002.mdaq.root

time ./paloon "" "" /grid/fermiapp/minos/parrot/recopa
Spin(103760 in 103760 out 0 filt.)
real    2m58.785s
user    1m1.069s
sys     0m49.411s

   Need a yet larger file for realistic testing.
   No, need to correct typos, and run in a writeable area;
   
FAP=/grid/fermiapp/minos/parrot
cd /local/scratch1/kreymer

time ${FAP}/paloon "" ""  ${FAP}/recopa
Spill(100000 in 750 out 99250 filt.)
real    13m58.890s
user    10m10.147s
sys     2m27.029s

-bash-3.00$ ls -ltr
total 20544
drwxr-xr-x  259 kreymer numi     4096 Sep 22 11:51 parrot
-rw-r--r--    1 kreymer numi 17819119 Sep 22 12:05 CandS.root
-rw-r--r--    1 kreymer numi  3179130 Sep 22 12:05 ntupleStS.root

-bash-3.00$ du -sm *
18      CandS.root
4       ntupleStS.root

=============================================================================
2008 09 19
=============================================================================

########
# FARM #
########

Once again, I have run CPB far concatenation without first removing
the mrnt files, and renaming bmnt to mrnt.

I have added an appropriate comment to the corral scripts.

Doing a test run of roundup,
it seems we have a clean boundary,
all runs presently in farcat have all subruns present.

171 files of each, in 9 runs.
660 bmnt files, some previously concatenated.

So we can swap out the bmnt/mrnt in farcat,
then remove the previously written mrnt files in SAM.

-----------------------------------------------------------
    BMNT LIST

BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort`
MFILES=`ls /minos/data/minfarm/farcat | grep mrnt | sort`

printf "${BFILES}\n" | wc -w 
660
printf "${MFILES}\n" | wc -w 
171
-----------------------------------------------------------
    MOVE MRNT OUT OF THE WAY 

mkdir -p /minos/data/minfarm/FMRNT

cd /minos/data/minfarm/farcat

for MFILE in ${MFILES} ; do
    mv ${MFILE} /minos/data/minfarm/FMRNT/${MFILE}
done
-----------------------------------------------------------
    RENAME BMNT TO MRNT

cd /minos/data/minfarm/farcat

    check for conflicts

for BFILE in ${BFILES} ; do
    MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g`
    [ -r ${MFILE} ] && ls -l ${MFILE}
done

for BFILE in ${BFILES} ; do
    MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g`
    mv ${BFILE} ${MFILE}
done

-----------------------------------------------------------

    PNFS, MINOS_DATA  and SAM cleanup prepration

    Get run list of possible bmnt

MRUNS=`printf "${BFILES}\n" | cut -f 1 -d _ | sort -u`

printf "${MRUNS}\n" | wc -w 
36

cd ~/scripts
. ./samsetup

  Detailed check via SAM

for MRUN in ${MRUNS} ; do
RUN=`echo ${MRUN} | cut -c 5-`
SAMDIM="
    DATA_TIER  mrnt-far
and VERSION    cedar.phy.bhcurv
and PHYSICAL_DATASTREAM_NAME spill
and RUN_NUMBER    ${RUN}
"
sam list files --dim="${SAMDIM}" --nosummary
done > /minos/data/minfarm/maint/MFILES

    I expect 27 ( = 36 - 9 ) runs to remove

SRV1> wc -l /minos/data/minfarm/maint/MFILES
28 /minos/data/minfarm/maint/MFILES

One run is split,
F00040213_0000.spill.mrnt.cedar_phy_bhcurv.0.root
F00040213_0019.spill.mrnt.cedar_phy_bhcurv.0.root
    So have 28 files to remove

grep -v '_0000' /minos/data/minfarm/maint/MFILES

MFILES=`cat /minos/data/minfarm/maint/MFILES`
printf "${MFILES}\n" | wc -l
28

    Added the paths

for MFILE in ${MFILES} ; do
MON=`sam locate ${MFILE} | cut -f 7 -d / | cut -f 1 -d ,`
printf "reco_far/cedar_phy_bhcurv/mrnt_data/${MON}/${MFILE}\n" \
   | tee -a /minos/data/minfarm/maint/MFILEPS
done

MFILEPS=`cat /minos/data/minfarm/maint/MFILEPS`

for MFILEP in ${MFILEPS} ; do
ls -l /pnfs/minos/${MFILEP} ; done

for MFILEP in ${MFILEPS} ; do
ls -l /minos/data/${MFILEP} ; done

-----------------------------------------------------------

/minos/data - minfarm@fnpcsrv1

for MFILEP in ${MFILEPS} ; do
MFILER=`echo ${MFILEP} | sed s/mrnt_data/BMNT2/g`
MFILED=`dirname ${MFILER}`
mkdir -p                 /minos/data/${MFILED}
mv /minos/data/${MFILEP} /minos/data/${MFILER}
done

find /minos/data/reco_far/cedar_phy_bhcurv/BMNT2/2008-01 -type f | wc -l 
28

-----------------------------------------------------------

    /pnfs/minos - rubin@fnpcsrv1

cat shrc/kreymer  #  cut and paste the result to get into bash

 MFILES=`cat /minos/data/minfarm/maint/MFILES`
MFILEPS=`cat /minos/data/minfarm/maint/MFILEPS`

for MFILEP in ${MFILEPS} ; do
    ls -l /pnfs/minos/${MFILEP}
    rm    /pnfs/minos/${MFILEP}
done

-----------------------------------------------------------

    SAM/READ


cd /export/stage/minfarm/ROUNDUP
mkdir -p READBMNT2

for MFILE in ${MFILES} ; do
    ls  -l READ/SAM/${MFILE}
    mv     READ/SAM/${MFILE} READBMNT2/${MFILE}
done

-----------------------------------------------------------

    SAM  
    
for MFILE in ${MFILES} ; do
    sam undeclare file ${MFILE}
done
13:30 
-----------------------------------------------------------

    WRITE clean up the items which I left dangling.
    
for MFILE in ${MFILES} ; do
       [ -L "/minos/data/minfarm/WRITE/${MFILE}" ] \
    && rm    /minos/data/minfarm/WRITE/${MFILE}
done
 
    Now we should be able to roundup the remaining 9 runs.
    Needed to adjust the BAIL limit to over the default 1000

    First purge the stale WRITE links

./roundup -w  -r cedar_phy_bhcurv far

One CPBF file is left, not in DCache or Tape ( stkendca9a problem )
F00040151_0000.spill.sntp.cedar_phy_bhcurv.0.root 

./roundup  -b 1500  -r cedar_phy_bhcurv far
Fri Sep 19 14:33:54 CDT 2008
OK - processing 1173 files 


############
# PREDATOR #
############

N00014862_0013.mdaq.root Fri Sep 19 14:06:19 UTC 2008
F00041958_0003.mdaq.root Fri Sep 19 10:13:58 UTC 2008
B080918_080001.mbeam.root Fri Sep 19 10:18:07 UTC 2008
N080918_000003.mdcs.root Fri Sep 19 10:28:30 UTC 2008
    repeatedly time out in dbu

F080917_000007.mdcs.root Thu Sep 18 10:14:17 UTC 2008
     is ok

Many files queued for write to tape :

STARTED   Fri Sep 19 02:11:47 2008
302 FILES
    3 N00014861_0000.mdaq.root  18
   32 N00014862_0006.mdaq.root  19
   65 F00040170_0022.spill.cand.cedar_phy_bhcurv.0.root 18
   76 F00041955_0018.mdaq.root  18
   98 F00040167_0012.all.cand.cedar_phy_bhcurv.0.root 18
  102 F00040167_0018.all.cand.cedar_phy_bhcurv.0.root 18
  111 F00040176_0019.all.cand.cedar_phy_bhcurv.0.root 18
  113 F00040151_0000.spill.sntp.cedar_phy_bhcurv.0.root 18
  114 F00041955_0021.mdaq.root  19
  115 F00041955_0022.mdaq.root  19
  119 N00014862_0004.mdaq.root  18
  146 F00041955_0019.mdaq.root  19
  155 F00040145_0023.spill.bcnd.cedar_phy_bhcurv.0.root 18
  172 F00040148_0021.spill.cand.cedar_phy_bhcurv.0.root 18
  215 F00040176_0011.spill.cand.cedar_phy_bhcurv.0.root 18
  217 N00014862_0007.mdaq.root  19
  224 F00041955_0012.mdaq.root  18
  247 F00040148_0003.all.cand.cedar_phy_bhcurv.0.root 18
  249 F00040148_0018.all.cand.cedar_phy_bhcurv.0.root 18
  252 F00040167_0005.spill.cand.cedar_phy_bhcurv.0.root 18
  259 F00040173_0019.spill.cand.cedar_phy_bhcurv.0.root 18
  262 F00040176_0019.spill.cand.cedar_phy_bhcurv.0.root 18
  266 N00014859_0022.mdaq.root  18
  272 F00040145_0012.spill.bcnd.cedar_phy_bhcurv.0.root 18
  284 F00040145_0014.spill.cand.cedar_phy_bhcurv.0.root 18
  296 F00040170_0005.spill.cand.cedar_phy_bhcurv.0.root 18
  299 N00014862_0005.mdaq.root  19
  300 F00041955_0020.mdaq.root  19
  301 N00014862_0008.mdaq.root  19

Full paths of the dbu trouble files :

/pnfs/minos/neardet_data/2008-09/N00014862_0013.mdaq.root
/pnfs/minos/fardet_data/2008-09/F00041958_0003.mdaq.root
/pnfs/minos/beam_data/2008-09/B080918_080001.mbeam.root
/pnfs/minos/near_dcs_data/2008-09/N080918_000003.mdcs.root 

MINOS26 > ./dc_stat /pnfs/minos/neardet_data/2008-09/N00014862_0013.mdaq.root
============================
 PNFS status for /pnfs/minos/neardet_data/2008-09/N00014862_0013.mdaq.root 
-rw-r--r--  1 buckley e875 111237454 Sep 19 01:16 N00014862_0013.mdaq.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:8097bf89;l=111237454;

LEVEL 4 

============================

   Same for all 4 files.

Scanning the write backlog for SAM locations :

FILES='
N00014861_0000.mdaq.root
N00014862_0006.mdaq.root
F00040170_0022.spill.cand.cedar_phy_bhcurv.0.root
F00041955_0018.mdaq.root
F00040167_0012.all.cand.cedar_phy_bhcurv.0.root
F00040167_0018.all.cand.cedar_phy_bhcurv.0.root
F00040176_0019.all.cand.cedar_phy_bhcurv.0.root
F00040151_0000.spill.sntp.cedar_phy_bhcurv.0.root
F00041955_0021.mdaq.root
F00041955_0022.mdaq.root
N00014862_0004.mdaq.root
F00041955_0019.mdaq.root
F00040145_0023.spill.bcnd.cedar_phy_bhcurv.0.root
F00040148_0021.spill.cand.cedar_phy_bhcurv.0.root
F00040176_0011.spill.cand.cedar_phy_bhcurv.0.root
N00014862_0007.mdaq.root
F00041955_0012.mdaq.root
F00040148_0003.all.cand.cedar_phy_bhcurv.0.root
F00040148_0018.all.cand.cedar_phy_bhcurv.0.root
F00040167_0005.spill.cand.cedar_phy_bhcurv.0.root
F00040173_0019.spill.cand.cedar_phy_bhcurv.0.root
F00040176_0019.spill.cand.cedar_phy_bhcurv.0.root
N00014859_0022.mdaq.root
F00040145_0012.spill.bcnd.cedar_phy_bhcurv.0.root
F00040145_0014.spill.cand.cedar_phy_bhcurv.0.root
F00040170_0005.spill.cand.cedar_phy_bhcurv.0.root
N00014862_0005.mdaq.root
F00041955_0020.mdaq.root
N00014862_0008.mdaq.root
'
MINOS26 > for FILE in $FILES ; do sam locate $FILE ; done
    The cand files are all
/pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2008-01
    bcnd are
/pnfs/minos/reco_far/cedar_phy_bhcurv/.bcnd_data/2008-01
    sntp are
/pnfs/minos/reco_far/cedar_phy_bhcurv/sntp_data/2008-01

Some are on tape now,

F00040167_0012.all.cand.cedar_phy_bhcurv.0.root 
F00040176_0019.all.cand.cedar_phy_bhcurv.0.root 
F00040148_0018.all.cand.cedar_phy_bhcurv.0.root
F00040176_0019.spill.cand.cedar_phy_bhcurv.0.root
F00040145_0012.spill.bcnd.cedar_phy_bhcurv.0.root
F00040145_0014.spill.cand.cedar_phy_bhcurv.0.root
F00040170_0005.spill.cand.cedar_phy_bhcurv.0.root

   Submitted helpdesk ticket 
   
   I see now that w-stkendca9a-* pools are offline

stkendca9a is up, on the network
MRTG traffic stops around 01:45
MRTG traffic starts around 14:30

Date: Fri, 19 Sep 2008 16:15:26 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: HelpDesk <helpdesk-forwarder@fnal.gov>
Cc: minos-data@fnal.gov, dcache-admin@fnal.gov
Subject: Re: HelpDesk ticket 121930

<-- # @@@  Enter Update below this line. @@@ # -->

According to the MRGT network monitoring,
stkendca9a is up and on the network,
but stopped most activity around 01:45 this morning.

This node serves both writePools and RawDataWritePools.
This could explain the absence of our files.

<-- # @@@  Enter Update above this line. @@@ # -->
From: Arthur Kreymer <kreymer@fnal.gov>
To: HelpDesk <helpdesk-forwarder@fnal.gov>
Cc: minos-data@fnal.gov, dcache-admin@fnal.gov
Subject: Re: HelpDesk ticket 121930

<-- # @@@  Enter Update below this line. @@@ # -->

According to the MRGT network monitoring,
stkendca9a started moving data again around 14:30.

All of the previously backlogged files seem to have made it to tape,

<-- # @@@  Enter Update above this line. @@@ # -->


############
# MINOS_OM #
############

Investigating failure of FarWeb to contact minos-om since 
Fri, 19 Sep 2008 07:50:39 -0500


/var/log/messages is flooded with 

    Aug 24 04:02:30 minos-om pam_timestamp_check: pam_timestamp: `/var/run/' owner UID != 0

/var/run is owned by apache.


#######
# DAQ #
#######

[root@minos-evd ~]# cat /etc/exports
#
# export /data/mcr to other control room pc's. SA 1/19/05
/data/mcr       131.225.55.0/255.255.0.0(rw) minos-beamdata.fnal.gov(rw)
/data/minsoft   131.225.55.0/255.255.0.0(rw) minos-beamdata.fnal.gov(rw)

  This is dangerously wrong, 
  exports rw to 131.225.* , all of Fermilab

  Probably want an exlicit list of CR systems.


=============================================================================
2008 09 18
=============================================================================

########
# FARM #
########

   Cleanup of v18 flux error files in D04 GPB

GOt list of configs and runs from nwest email 

Date: Fri, 12 Sep 2008 16:18:09 +0100
From: Nick West <n.west1@physics.ox.ac.uk>

scp minos-93198.dhcp:baddo4 /minos/scratch/kreymer/badd04

These are thing whose mapping to files I do  not understand

L010185_near_bhcurv             00007450

L010185_near_bhcurv_test        00007655

L010185_near_production         00007484

L010185_rock_pro                00007481


########
# GRID #
########

14:37
  Most worker nodes have booted up, as recently as 14:20

    A couple of nodes are still down ( formerly running pawloski jobs )
fnpc209.fnal.gov
fnpc219.fnal.gov

16:24 - Timm states that FermiGrid is and has been up.

Started gfactory and gfrontend

##########
# CONDOR #
##########

   Clean up pilots that think they are running .

    Nodes supposedly running jobs 

XNO=`condor_q -run   | grep fnpc | grep -v gfactory | cut -f 3 -d @ | sort -u`
 

    Of these, some respond to ping

XUP=`for NO in ${XNO} ; do ping -c 1 -w 2 ${NO} > /dev/null && echo ${NO}  ; done`


    Of these scan for condor processes

for UP in ${XUP} ; do echo ${UP} ; ssh -ax ${UP} "ps -fu condor" ; done

    mostly got
UID        PID  PPID  C STIME TTY          TIME CMD
condor    3773     1  0 08:04 ?        00:00:00 /opt/condor/sbin/condor_master
condor    3788  3773  0 08:04 ?        00:00:05 condor_startd -f


    unauthorized on
fnpc207.fnal.gov
fnpc218.fnal.gov
fnpc253.fnal.gov
fnpc257.fnal.gov


fnpc236.fnal.gov
This rsh session is using DES encryption for all data transmissions.
UID        PID  PPID  C STIME TTY          TIME CMD
condor    3818     1  0 08:04 ?        00:00:00 /opt/condor/sbin/condor_master
condor    3826  3818  0 08:04 ?        00:00:14 condor_startd -f
condor    6712  3826  0 08:41 ?        00:00:00 condor_starter -f -a slot2 fnpcosg1.fnal.gov
condor    6713  3826  0 08:41 ?        00:00:00 condor_starter -f -a slot3 fnpcosg1.fnal.gov
condor    6714  3826  0 08:41 ?        00:00:00 condor_starter -f -a slot1 fnpcosg1.fnal.gov
condor    6715  3826  0 08:41 ?        00:00:00 condor_starter -f -a slot4 fnpcosg1.fnal.gov

    These are jobs for username engage

    Checking again for minos processes, there were none :
for UP in ${XUP} ; do echo ${UP} ; ssh -ax ${UP} "ps -fu minos" ; done

    Bottom line, nothing useful is running for us.
    Shall/can I remove these gfactory's ?
    No need, they are all help now !

MINOS25 > condor_q gfactory | tail -1 ; date
73 jobs; 0 idle, 0 running, 73 held
Thu Sep 18 09:43:26 CDT 2008
 
MINOS25 > condor_rm gfactory
User gfactory's job(s) have been marked for removal.

     six of these want back into X status
     then back to H after a minute
     
MINOS25 > condor_rm gfactory

     all clear

Held pawloski jobs

66 jobs; 0 idle, 66 running, 0 held
MINOS25 > condor_hold pawloski
66 jobs; 0 idle, 0 running, 66 held

   The pawloski jobs are now in X state


########
# FARM #
########

   GCC power maintenance has started, Condor glideins are shut down
   since 00:45.

   My glideafs stopped getting scheduled at 01:30

##############
# AFSERRSCAN #
##############

    Added printout of nodes failing to connect via ssh

#########
# ADMIN #
#########

    Two nodes unavailable to ssh, they are OK with rsh

minos03
minos16

     Ticket 121841
Date: Thu, 18 Sep 2008 08:26:32 -0500 (CDT)
This ticket has been reassigned to HO, LING of the CD-SF/FEF Group.
    ling


=============================================================================
2008 09 17
=============================================================================

##########
# CONDOR #
##########

     Have over 150 gfactory pilots running, but few user jobs
     
MINOS25 > condor_q gfactory |  tail -1
176 jobs; 25 idle, 151 running, 0 held
 
MINOS25 > condor_q -run | grep fnpc | grep -v gfactory |  wc -l
34

MINOS25 > condor_q -run | grep ahimmel | tr -s ' ' | cut -f 6 -d ' ' | cut -f 3 -d @ | sort -u
fnpc340.fnal.gov
fnpc341.fnal.gov
fnpc342.fnal.gov
fnpc343.fnal.gov
fnpc345.fnal.gov

    On 343, see

 6035 ?        Ss    17:36 /opt/condor/sbin/condor_master
 6055 ?        Ss   304:10  \_ condor_startd -f
 7336 ?        S     49:30      \_ condor_procd -A /local/stage1/condor/log/condor-lock.fnpc3430.0366944801799072/procd_pipe.STARTD -S 60 -C 4716
and many condor_starter -f  under which our jobs run

    On 344, just have

 8657 ?        Ss    22:17 /opt/condor/sbin/condor_master
 8658 ?        Ss   371:24  \_ condor_startd -f
 8740 ?        S     60:23      \_ condor_procd -A /local/stage1/condor/log/condor-lock.fnpc3440.0366944801799072/procd_pipe.STARTD -S 60 -C 4716

We know that a pawloski job finished on fnpc300 recently ( 14:00 )
and that it was running loon OK.

25553 ?        Ss    13:42 /opt/condor/sbin/condor_master
25554 ?        Ss   465:30  \_ condor_startd -f
25733 ?        S     44:06      \_ condor_procd -A /local/stage1/condor/log/condor-lock.fnpc3000.0366944801799072/procd_pipe.STARTD -S 60 -C 4716
22113 ?        Ss     0:00      \_ condor_starter -f -a slot4 fnpcfg1.fnal.gov
22114 ?        SNs    0:00      |   \_ /bin/bash /grid/home/minos/.globus/.gass_cache/local/md5/68/c3/98/63a9242a845ee20f3cf6078aa0/md5/6c/85/f9/a30b5e7bee1d26d3c297802c16/data -v
22514 ?        SN     0:00      |       \_ /bin/bash ./condor_startup.sh glidein_config
22699 ?        SN     0:10      |           \_ /local/stage1/condor/execute/dir_22113/glide_m22150/condor/sbin/condor_master -r 359 -dyn -f
22700 ?        SN     1:09      |               \_ condor_startd -f
22925 ?        SN     0:34      |                   \_ condor_procd -A /local/stage1/condor/execute/dir_22113/glide_m22150/log.131.225.166.78-22699/procd_address.STARTD -L /local/
16310 ?        SN     0:00      |                   \_ /local/stage1/condor/execute/dir_22113/glide_m22150/condor/sbin/condor_starter -f -a vm2 minos25.fnal.gov
16509 ?        SN     0:22      |                       \_ condor_procd -A /local/stage1/condor/execute/dir_22113/glide_m22150/tmp/starter-tmp-dir-wu3e3x/log/procd_pipe.STARTER -L
16510 ?        SN     0:00      |                       \_ /bin/sh /minos/scratch/pawloski/EntProc/paloon 148 0 /minos/scratch/pawloski/EntProc/SntpFileListsForSept2008Meeting/con
16513 ?        RN   498:00      |                           \_ parrot -m /grid/app/minos/parrot/cctools-current-20080708-x86_64-linux-2.6/mountfile.grow -H -t /local/stage1/minos/
16514 ?        TN     0:00      |                               \_ /minos/scratch/pawloski/EntProc/SntpFileListsForSept2008Meeting/condor_job_glidein_FarDataAll_HornOn_SUN.sh /min
17009 ?        TN   302:58      |                               \_ loon -bq /minos/scratch/pawloski/Nue/nue_standard_Firebird_SUN/NueAna/macros/MakeAnaNueTreePECut.C dcap://fndca1
 7623 ?        Ss     0:00      \_ condor_starter -f -a slot1 fnpcosg1.fnal.gov
 7626 ?        RNs   21:46      |   \_ condor_exec.exe pd_45mA_errors12_1p0.in
 7637 ?        Ss     0:00      \_ condor_starter -f -a slot5 fnpcosg1.fnal.gov
 7638 ?        RNs   18:15      |   \_ condor_exec.exe pd_45mA_errors16_1p0.in
 7639 ?        Ss     0:00      \_ condor_starter -f -a slot3 fnpcosg1.fnal.gov
 7641 ?        RNs   17:43      |   \_ condor_exec.exe pd_45mA_errors15_1p0.in
 7640 ?        Ss     0:00      \_ condor_starter -f -a slot2 fnpcosg1.fnal.gov
 7642 ?        RNs   18:12      |   \_ condor_exec.exe pd_45mA_errors14_1p0.in
 7653 ?        Ss     0:00      \_ condor_starter -f -a slot7 fnpcosg1.fnal.gov
 7655 ?        RNs   17:26      |   \_ condor_exec.exe pd_45mA_errors107_1p0.in
 7654 ?        Ss     0:00      \_ condor_starter -f -a slot6 fnpcosg1.fnal.gov
 7657 ?        RNs   17:21      |   \_ condor_exec.exe pd_45mA_errors40_1p0.in
 7668 ?        Ss     0:00      \_ condor_starter -f -a slot8 fnpcosg1.fnal.gov
 7671 ?        RNs   17:11          \_ condor_exec.exe pd_45mA_errors51_1p0.in

MINOS25 > condor_q gfactory | tail -1
176 jobs; 25 idle, 151 running, 0 held

  Sees that this fnpc300 pilot is gone, but we were not informed.


25553 ?        Ss    13:42 /opt/condor/sbin/condor_master
25554 ?        Ss   465:34  \_ condor_startd -f
25733 ?        S     44:07      \_ condor_procd -A /local/stage1/condor/log/condor-lock.fnpc3000.0366944801799072/procd_pipe.STARTD -S 60 -C 4716
 7623 ?        Ss     0:00      \_ condor_starter -f -a slot1 fnpcosg1.fnal.gov
 7626 ?        RNs   41:25      |   \_ condor_exec.exe pd_45mA_errors12_1p0.in
 7637 ?        Ss     0:00      \_ condor_starter -f -a slot5 fnpcosg1.fnal.gov
 7638 ?        RNs   37:59      |   \_ condor_exec.exe pd_45mA_errors16_1p0.in
 7639 ?        Ss     0:00      \_ condor_starter -f -a slot3 fnpcosg1.fnal.gov
 7641 ?        RNs   37:26      |   \_ condor_exec.exe pd_45mA_errors15_1p0.in
 7640 ?        Ss     0:00      \_ condor_starter -f -a slot2 fnpcosg1.fnal.gov
 7642 ?        RNs   37:54      |   \_ condor_exec.exe pd_45mA_errors14_1p0.in
 7653 ?        Ss     0:00      \_ condor_starter -f -a slot7 fnpcosg1.fnal.gov
 7655 ?        RNs   37:09      |   \_ condor_exec.exe pd_45mA_errors107_1p0.in
 7654 ?        Ss     0:00      \_ condor_starter -f -a slot6 fnpcosg1.fnal.gov
 7657 ?        RNs   37:04      |   \_ condor_exec.exe pd_45mA_errors40_1p0.in
 7668 ?        Ss     0:00      \_ condor_starter -f -a slot8 fnpcosg1.fnal.gov
 7671 ?        RNs   36:53          \_ condor_exec.exe pd_45mA_errors51_1p0.in


   Starting around 02:00, getting fewer and fewer running client jobs.
   And my probe jobs have not run since 20:40,
   based on *.out sizes.
   Next job to run should be 190628.0
   
Submitting a non-afs test job,

MINOS25 > condor_submit glide.run


   Checking proxy in gfactory, it looks fine.
   
Steve Timm logged into fnpc4x1, found 10 stuck minos account 
    globus-job-manager processes

Sent them a signal by running ptrace, this kicked them loose.
May be able to do this with   kill -s SIGCONG ( see man 7 signal )

Our condor_q info is now up to date, pilots are starting,
and jobs are starting to run.

I do not seem to be able to log into fnpcrx1 or fnpcfg1

sfiligoi notes that 
    condor_status -l
contains the information necessary to trace a gfactory job to a specific
execution node, searching for things like

GLIDEIN_ClusterId = 190823
GLIDEIN_ProcId = 5
    for
190823.5


#########
# ADMIN #
#########

Date: Wed, 17 Sep 2008 11:21:19 -0500 (CDT)
Subject: Help Desk Ticket 121592 Has Been Resolved.
___________________________________________________________________

Solution: Added resolv.conf to cfengine for the minos cluster.
___________________________________________________________________

Short Description: minos-mysql1 /etc/resolv.conf has been updated

Problem Description: run2-sys :

    The /etc/resolv.conf file on minos-mysql1 consisted of:

search fnal.gov
nameserver 131.225.8.120

    This of course caused severe problems for mysql service,
    including unavailability of the Minos Control Room Logbook,
    during the recent service problems with fnsrv0/131.225.8.120

    Because this has been causing current operational problems,
    I have taken the liberty of renaming this file to  resolv.conf.2004
    and have created a new file, copied from the Minos Cluster systems,
    but putting fnsrv1 first in the list:

search fnal.gov
nameserver 131.225.17.150
nameserver 131.225.227.254
nameserver 131.225.8.120

    The old resolv.conf file seems to be older than minos-mysql1

[root@minos-mysql1 etc]# ls -l resolv.conf.2004
-rw-r--r--  1 root root 41 Nov 22  2004 resolv.conf.2004

     Action items :

Please put this change in via your usual means ( cfengine ? ),
rather than my hack.

Please also update resolv.conf on minos-sam02 .
___________________________________________________________________

This ticket was resolved by SCOTT, RENNIE of the CD-SF/FEF group.

_________________________________________________________________
Note To Requester: Hi Art,

I put resolv.conf in cfengine. Of note, you are correct that the nameserver on minos-mysql-1 was
older. This is the only system in the cluster that was upgraded and not re-installed during last years
upgrade of the minos cluster. This method was done due to the uncertainty of the mysql database
configurations and what would happen if upgraded i.e. new mysql database issue, etc. Since resolv.conf
is configured during our kickstart process, this system never received an updated resolv.conf like the
rest of the cluster. I can't remeber what we did with minos-sam02 so I'm not sure why it's resolv.conf
did not get updated. This cfengine file entry should rectify the issues. 

Best regards,

Rennie


########
# MAIL #
########

Adjusting SPAM filter to directly reject mail over 5,
rather than send to managers.

The level might even go lower, but I don't have time to test,
and I've never seen good mail at or over 5.


___________________________________________________________________

Date: Tue, 16 Sep 2008 11:25:58 -0500 (CDT)
Subject: Help Desk Ticket 121629 Has Been Resolved.

Solution: Hi,
Sounds like your list is configured with the default setting of sending spam to the owners for
moderation. You can change that by adding a filter to the list configuration on listserv. The
instructions for doing so are here:

http://computing.fnal.gov/email/listserv/listserv-spam.html

___________________________________________________________________

  List Management
     select a list
         Templates
             Select a template to view or edit
                 Rules for filtering list messages based on their contents [CONTENT FILTER]
                     Edit Form
X-Spam-Flag: YES
Action: Discard SPAM    ( was Action: Moderate )
                     Update

    Did this to isajet-users and minos_sam_admin

    Also did minos-cdops, set spam level of 2

X-Spam-ListServ-Level:--
Action: Discard SPAM


#########
# FNALU #
#########

Date: Wed, 17 Sep 2008 10:22:13 -0500 (CDT)
Subject: HelpDesk ticket 121790

___________________________________________

Short Description: FNALU Batch mount needed for /grid/data,app,fermiapp and /minos/data,scratch

Problem Description: fnalu-admin

In order to let Minos make effective use of the FNALU Condor batch system,
please mount on the interactive and worker nodes :

/minos/data
/minos/scratch

/grid/data
/grid/fermiapp
/grid/app

    As FNALU interactive and batch nodes are all GCE managed nodes, 
    these file systems can be exported and mounted with full RWX access.

    Thanks !
___________________________________________

Date: Wed, 17 Sep 2008 10:42:32 -0500 (CDT)
This ticket has been reassigned to HILL, KEVIN of the CD-LSCS/CSI/CS/EST Group.
___________________________________________

Date: Wed, 17 Sep 2008 11:26:00 -0500 (CDT)
This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group.
___________________________________________
Date: Fri, 26 Sep 2008 09:31:01 -0500 (CDT)

Art,

I've mounted /minos/data and scratch on the fnalu condor nodes, but
not the /grid mounts. Wayne is going to be sending an e-mail to
you clarifying or expanding on what we discussed in our meeting
this week.

Please remember that DSS considers this condor pool as a test set up
for a couple months.

I will wait to announce that condor is on fnalu because I will not
be here next week and there will be very restricted support for
it during that time.

Margaret
___________________________________________
Date: Tue, 14 Oct 2008 09:02:40 -0500 (CDT)
Solution: file systems were mounted on 9/21.
___________________________________________
N.B. this is only the /minos/files. /grid is not mounted.
___________________________________________

=============================================================================
2008 09 16
=============================================================================

#########
# FNALU #
#########

Testing condor submission from /local/stage1/kreymer/condor,

per

http://cdorg.fnal.gov/dss/condor/condor.html
http://cdorg.fnal.gov/dss/condor/condor2.html
    

Had to move to /local/stage1/kreymer/condor,
submission from $HOME/condor resulted in a running job per condor,
but no useful work on the worker (

FLXI09 > condor_submit probe.run
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 532.

WARNING: File /afs/fnal.gov/files/home/room1/kreymer/condor/logs/probe/probe.532.0.err is not writable by condor.

WARNING: File /afs/fnal.gov/files/home/room1/kreymer/condor/logs/probe/probe.532.0.out is not writable by condor.

FLXI09 > condor_q kreymer -run
-- Quill: quilld@flxi09.fnal.gov : <131.225.68.37:5432> : quill2 : 2008-09-17 11:31:03-05
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)         
 532.0   kreymer         9/17 11:30   0+00:01:03 slot1@flxb32.fnal.gov

32678 ?        Ss     0:09 /opt/condor/sbin/condor_master
32679 ?        Ss     0:34  \_ condor_startd -f
 7137 ?        S      0:07      \_ condor_procd -A /tmp/condor-lock.flxb320.342033398644165/procd_pipe.STARTD -S 60 -C 4716


#######
# DAQ #
#######

Date: Tue, 16 Sep 2008 14:55:23 -0500
Subject: [Fwd: Hardware Service Request]

System Node Name: minos-beamdata
Tag Number: 106501
Manufacturer/Model: Dell 
Equipment Location: precision 390
Task Name: 50
Task Number: 50.01.06.04.01.01
Problem Details: 
Move the minos-beamdata PC to FCC computer rooms. 
This computer logs accelerator beam data for the Minos experiment. 
It's currently located in WH12NW. 
The form factor is a mini-tower. 
It is currently in the 131.225.52.xxx subnet with a fixed IP.


#########
# FNALU #
#########

Date: Tue, 16 Sep 2008 14:22:13 -0500 (CDT)
Subject: HelpDesk ticket 121735
___________________________________________

Short Description: LSF has been shut down as scheduled - please announce

Problem Description: The FNALU LSF queues have been shut down as scheduled.

  Please put an announcement on the System Status / FNALU web page,
  and/or a login message on FNALU.

  Please let us know when Condor queues will be available.
___________________________________________

Date: Tue, 16 Sep 2008 16:11:40 -0500 (CDT)
This ticket has been reassigned to GREANEY, MARGARET of the CD-LSCS/CSI/DSS/EST Group.
___________________________________________


Date: Tue, 16 Sep 2008 15:27:23 -0500 (CDT)
From: Margaret_Greaney <mgreaney@fnal.gov>

The condor cluster is available. I specifically asked Wayne
to contact you before I made an announcement. Could you please
try to call him?

This condor cluster is not grid specific. I have two html documents
set up that describe the way this pool is configured. These
are at http://cdorg.fnal.gov/dss under the FNALU heading.

I am still setting up a few more nodes for the cluster, but you
should be able to login to flxi09 and look at the current set up.

If you would like to have a meeting with Wayne and me I would welcome
that to answer your questions if you have any.

Also, remember that we had no budget to buy any new batch nodes yet
because of the budget cuts last year.

___________________________________________

Date: Tue, 16 Dec 2008 12:32:39 -0600 (CST)
Solution: motd on fnalu nodes updated; system status page updated for lsf
replacement.
___________________________________________

Date: Tue, 16 Dec 2008 12:32:40 -0600 (CST)

Note To Requester: Art,

the helpdesk gave me permissions to update the system status page and
this has been done for condor/lsf.

margaret
___________________________________________

########
# FARM #
########

    Forcing out cedar_phy near recent processing, no needing concatenation.
    per batch meeting discussion

AFSS/roundup.20080915  -F -r cedar_phy near
AFSS/roundup.20080915  -F -r cedar_phy near

    All done

SRV1> cp -a AFSS/roundup.20080915 .


###########
# ROUNDUP #
###########

20080915 version hacked to filter SAMSUBS list on \.${REL}\.

Restored -F option

########
# FARM #
########

  Rename the n1303*.0.root files to .root

SRV1> FILES=`ls /minos/data/minfarm/mcnearcat | grep .0.root | grep n1303`

SRV1> printf "${FILES}\n" | wc -l
62


SRV1> cd /minos/data/minfarm/mcnearcat

for FILE in ${FILES} ; do 
mv ${FILE} ${FILE/.0./.} ; done

SRV1> ln -sf roundup.20080915 roundup # was roundup.20080728

./roundup -n -s n130370 -r cedar_phy_bhcurv mcnear

  Missing subrun 29 of n13037097, failed in pass0, but have shifted to null
  Created a fake null pass line  via 
SRV1> nedit /minos/data/minfarm/lists/bad_runs_mc.cedar_phy_bhcurv


    Test with

AFSS/roundup.20080915 -n -W -s n13037097 -r cedar_phy_bhcurv mcnear

 PEND - have 30/31 subruns for n13037097_*_L250200N_D04.sntp.cedar_phy_bhcurv.root 8 09/08 06:25 4 26
 MISS 0005 0006 0007 0008 0029

RAWF is like n13037097_0029_L250200N_D04

   Changed all \.${PASS} to   .${PASS}  to this can match null passes

 RUNTAI n13037097__L250200N_D04.sntp.cedar_phy_bhcurv.root
SAMSUBS n13037097__L250200N_D04.mrnt.cedar_phy_bhcurv.root:4
  

=============================================================================
2008 09 15
=============================================================================

########
# FARM #
########

n13037078_0024_L250200N_D04.sntp.cedar_phy_bhcurv.0.root
    is in mcnearcat.

    Subruns 10 and 26 have no input files
    All other subruns are concatenated in
        00
        11
        27
    So why is the MISS list so long, and there no output ?
  
    I do not see the expected 
HAVE n13037078__L250200N_D04
 
   Probably because the old files have no pass number,
   and the new ones have pass 0.
   
   Wiped outh the pend file in testing this.
   Ugh, hacked it back into place from the full log.
   
   Let's see how many .0.root files have .root friends

MINOS26 > ls /minos/data/minfarm/mcnearcat > /tmp/mcn

MINOS26 > grep '.0.root' /tmp/mcn  | wc -l
66

MINOS26 > grep '.0.root' /tmp/mcn  | cut -f 1 -d _ | sort -u
  The first 4 are CosmicMu
n10032095
n10042089
n10042106
n10042115
  The next 4 are L250200N_D04
n13037073
n13037078
n13037095
n13037097

for RUN in n10032095 n10042089 n10042106 n10042115 ; do
sam list files \
--dim="FILE_NAME ${RUN}%L250200N_D04.sntp.cedar_phy_bhcurv.root"
done
No files match the given constraints.
No files match the given constraints.
No files match the given constraints.
No files match the given constraints.


for RUN in n13037073 n13037078 n13037095 n13037097 ; do
sam list files \
--dim="FILE_NAME ${RUN}%L250200N_D04.sntp.cedar_phy_bhcurv.root"
done
   n13037073_0000_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037073_0018_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037073_0021_L250200N_D04.sntp.cedar_phy_bhcurv.root

File Count:         3
Average File Size:  1.08GB
Total File Size:    3.25GB
Total Event Count:  24000
Files:
   n13037078_0000_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037078_0011_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037078_0025_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037078_0027_L250200N_D04.sntp.cedar_phy_bhcurv.root

File Count:         4
Average File Size:  773.30MB
Total File Size:    3.02GB
Total Event Count:  22400
Files:
   n13037095_0000_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037095_0003_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037095_0006_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037095_0012_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037095_0014_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037095_0018_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037095_0020_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037095_0023_L250200N_D04.sntp.cedar_phy_bhcurv.root
   n13037095_0028_L250200N_D04.sntp.cedar_phy_bhcurv.root

File Count:         9
Average File Size:  269.94MB
Total File Size:    2.37GB
Total Event Count:  17600
Files:
   n13037097_0005_L250200N_D04.sntp.cedar_phy_bhcurv.root

File Count:         1
Average File Size:  442.30MB
Total File Size:    442.30MB
Total Event Count:  3200


############
# PREDATOR #
############

    B080906_160000.mbeam.root Fri Sep 12 10:26:16 UTC 2008
    OOPS - loon status is 139 

    genpy sets up loon R1.22

    Created predatorbfx to just to beam_data
 
   Test manually,

DFILE=dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/beam_data/2008-09/B080906_160000.mbeam.root

minos
setup_minos -r R1.22
loloon [0] 
Processing /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/firstlast.C...
Warning in <TClass::TClass>: no dictionary for class RecJobHistory is available
Warning in <TStreamerInfo::BuildCheck>: 
        The StreamerInfo of class RawDataBlock read from file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/beam_data/2008-09/B080906_160000.mbeam.root
        has the same version (=1) as the active class but a different checksum.
        You should update the version to ClassDef(RawDataBlock,2).
        Do not try to write objects with the current class definition,
        the files will not be readable.
on -bq ${HOME}/minos/scripts/firstlast.C ${DFILE}

MINOS26 > setup_minos -r R1.26 # root v5-16-00 - fails
MINOS26 > setup_minos -r R1.28 # root v5-18-00 , good output
...
       root version: v05-21-03

    Need to declare this to SAM before trying to declare.

export SAM_ORACLE_CONNECT="samdbs/password"

for REL in dev int prd ; do
setup sam -q ${REL}
samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v05-21-03
done

New applicationFamilyId = 257
New applicationFamilyId = 94
New applicationFamilyId = 348

    S08-01-10-R1-27  v5-17-08     OK
    S08-02-16-R1-28  v5-18-00
    S08-02-24-R1-28  v5-18-00a
    S08-03-20-R1-28  v5-18-00a
    S08-04-24-R1-28  v5-19-02a
    S08-07-25-R1-29  v5-20-00
    S08-08-28-R1-30  v5-20-00
    R1.29  v5-18-00c
    R1.30  v5-20-00

    Try running root from development, get v5-21-03 !!!


The Bottom Line :
    
    These beam_data files require R1.28, S08-01-10-R1-27  
    or later for reading ( root >= v5-17-08 )

./predatorbfx

B080906_080002.mbeam.root Mon Sep 15 18:20:53 UTC 2008
 OOPS - run_dbu is stuck for 208, killing it 
 OK - declared  B080906_160000.mbeam.root
 OK - declared  B080915_080001.mbeam.root

DFILE=dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/beam_data/2008-09/B080906_080002.mbeam.root
MINOS26 > time loon -bq ${HOME}/minos/scripts/firstlast.C ${DFILE}
real    4m59.461s
user    0m28.788s
sys     0m3.123s
    repeat,
real    4m57.145s

   OK, let's hack predatorbfx to  genpy -t 500

./predatorbfx

   Removed predatorbfx, no longer needed

#########
# GENPY #
#########

 Added printing of TIMEX when set.
 Added a few more {} in getopts parsing

##########
# CONDOR #
##########

Only three help gfactory pilots this morning, not the usual dozen/day
All were Globus error 43:

189597.3   gfactory        9/13 14:31   0+00:00:00 H  0   0.0  glidein_startup.sh
189944.9   gfactory        9/14 22:49   0+00:00:00 H  0   0.0  glidein_startup.sh
189995.3   gfactory        9/15 03:24   0+00:00:00 H  0   0.0  glidein_startup.sh


#########
# ADMIN #
#########

    sar data is bunk, no idle time listed, contrary to top, uptime.

ARK > ssh -ax minos01.fnal.gov rpm -q sysstat
sysstat-5.0.5-11.rhel4
MIN > ssh -ax minos01 cat /etc/redhat-release
Scientific Linux Fermi LTS release 4.4 (Wilson)


ARK > ssh -ax fnpcsrv1.fnal.gov rpm -q sysstat
sysstat-5.0.5-16.rhel4
MIN > ssh -ax fnpcsrv1 cat /etc/redhat-release
Scientific Linux Fermi LTS release 4.6 (Wilson)


=============================================================================
2008 09 12
=============================================================================

##########
# PARROT #
##########

Test of new d141(ups) d199(minsoft) copies, with make_growfs.auto

MD=/afs/fnal.gov/files/data/minos
PD=/minos/scratch/parrot

$ time ./make_growfs.auto -k ${MD}/d141
    WOW, lots of broken symliks, both relative and to /fnal/ups/...
real    9m10.773s
user    0m54.767s
sys     1m39.739s

$ time ./make_growfs.auto -k ${MD}/d199


#######
# DAQ #
#######

   DNS problems continue with fnsrv0 / 131.225.8.120,
   according to the helpdesk.

for ROLE in rc evd om acnet beamdata gateway-nd ; do 
echo ${ROLE}
ssh -l minos minos-${ROLE} cat /etc/resolv.conf
done

    All are

; generated by /sbin/dhclient-script
search fnal.gov dhcp.fnal.gov
nameserver 131.225.17.150
nameserver 131.225.8.120

   Same for my desktop


#########
# ADMIN #
#########

    xbhuang account  - created this for xiaobo

#######
# CRL #
#######

  CRL not responding well, ( no login, or content )
Mail to gysin :

Date: Fri, 12 Sep 2008 17:22:50 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: gysin@fnal.gov
Subject: Minos CRL


  Sorry not to have gotten back to you sooner.

  I presume that you have now seen the helpdesk ticket 121520.

  In addition, just recently, we seem to have additional Minos CRL problems.

  I cannot login :

To login you must be authenticated:
Login for kreymer

was invalid - please try again. If the problem persists,
contact your CRL Administrator to verify your user name, password,
and your status as an active, remote user.
User name:
                Password:

   And when I try to view http://www-minoscrl2.fnal.gov/minos/Index.jsp,
   I get two header bars, and no content.

   The minos-mysql1 database server looks normal to me,
   and has crl connections.


########
# FARM #
########

SRV1> ./looper  '-r cedar_phy_bhcurv mcnear' &
Fri Sep 12 12:04:44 CDT 2008

 ZAPPING BAD n13037415_0009_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13037415_0009_L010185N_D04.1               136 2008-04-29 16:14:31  fcdfcaf1626
 ZAPPING BAD n13037415_0009_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root
n13037415_0009_L010185N_D04.1               136 2008-04-29 16:14:31  fcdfcaf1626
 ZAPPING BAD n13037415_0009_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037415_0009_L010185N_D04.1               136 2008-04-29 16:14:31  fcdfcaf1626
 ZAPPING BAD n13037436_0005_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13037436_0005_L010185N_D04.1               136 2008-05-09 18:53:51  fnpc304
 ZAPPING BAD n13037436_0005_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root
n13037436_0005_L010185N_D04.1               136 2008-05-09 18:53:51  fnpc304
 ZAPPING BAD n13037436_0005_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037436_0005_L010185N_D04.1               136 2008-05-09 18:53:51  fnpc304


  This ran out of steam, to many pending files, restarted as :
  
SRV1> ./looper  '-b 10000 -r cedar_phy_bhcurv mcnear' &
Fri Sep 12 16:00:08 CDT 2008


#######
# CRL #
#######

    Where to go for help ( ticket, mail ??? ) not on CRL or HELP
    Help spelling ( hyrarchy of topics )
    

############
# PREDATOR #
############

 B080906_160000.mbeam.root Fri Sep 12 10:26:16 UTC 2008
 OOPS - loon status is 139 
 OOPS - cannot read B080906_160000.sam.py 
   

#######
# NET #
#######

    The Primary DNS server  fnsrv0  131.225.8.120  failed last night.
    It was on the network, but not providing DNS service.
   
    Test :

host www.fnal.gov 131.225.8.120    # failed
host www.fnal.gov 131.225.17.150   # succeeded

     Problems :

Predator : genpy failed 
EVD stopped working

CRL stopped responding
Ticket #: 121520 - closed, not a CRL issue, was database

Helpdesk expert login and ticket submission failed.
Ticket #: 121521 ( closed in Oct, no real cause found )

DocDB failed
Ticket #: 121522 ( closed 2009 Jan 29, as is )

MRTG network monitoring data disappeared around 22:45 CDT, back at 03:30
    fnsrv0 monitoring was back at 04:00
Ticket #: 121529 - closed 13 Oct 2008
Solution: darryl@fnal.gov sent this solution: 
The MRTG monitors are configured to use fnsrv1 as a secondary DNS 
server.  However, there is insufficient evidence that the DNS failover 
mechanism on an MRTG host would have singlehandedly prevented loss of 
data, as there are ongoing external factors affecting host performance.


DCache was down, no data transfers in FNDCA
CDFDCA seems to have stayed up, but not tape activity
pnfslog times were over 3 minutes, not 3 seconds.
Strangely, Daq kftp writes to DCache claim to have succeeded.
Ticket #: 121533


    Ticket 121506
9/12/2008 6:59:14 AM by plunk

Resolved 9/12/2008 9:12:32 AM  

rader@fnal.gov sent this solution: 
FNSRV0 dns server needed a reboot.
Looking into the cause of the failure...

=============================================================================
2008 09 11
=============================================================================
xbhuang account

/fermilab/mions VO entries

########
# GRID #
########

    Sweeping up all Minos folk into /fermilab/minos VO

MINOS26 > ypcat passwd | cut -f 5 -d : > /tmp/names

sort /tmp/names -k 2,2 > /tmp/namess


    Registering them one at a time ,

cp /tmp/namess /minos/scratch/kreymer/namess

Completed this tomorrow ( 9/12 ), about 152 total users..

    Got note from Yocum :

Date: Fri, 12 Sep 2008 15:30:26 -0500
From: Dan Yocum <yocum@fnal.gov>

I notice that you're adding a lot of suspended members to the minos group.  For instance:

Brandon Sielhan
Vitali Semenov
Christopher Smith
Philip Symes
Edward Tetteh-Lartey
Carol Ward
Qun Wu
Hai Zheng

Anyone who only has a DN with '../UID=...' has been suspended for a long time.

#########
# ADMIN #
#########

    Make sure everyone is in group e875 5111 

MINOS26 > GPS=`ypcat passwd | cut -f 4 -d : | sort -u`

MINOS26 > for GP in ${GPS} ; do grep ${GP} /tmp/group ; done
g020:x:1525:kreymer
epp:x:1535:
e791:x:1720:cjames
e781:x:1747:
oss:x:5023:
us_cms:-:5063:gaines,odell
e875:x:5111:kreymer,pgouffon,bishai,cjames,jyuko,rbpatter
lsfadm:-:5443:

    Wow, there are 64 users not in the e875/5111 group !
    Here are the apparent Minos users who need addition,
    beyond present  kreymer,pgouffon,bishai,cjames,jyuko,rbpatter


GUSERS='
ahimmel
baller
bock
costas
cwhite
dave_b
dawson
diwan
djensen
erwin
escobar
grossman
hartouni
jkn
joffem
jpaley
kafka
kschu
kulik
llhsu
mmichel3
moeller
murgia
naples
nevans
niki
paolone
para
qkwu
rtoner
shanahan
thosieck
tzanakos
verebryu
'
Date: Thu, 11 Sep 2008 16:48:37 -0500 (CDT)
Subject: HelpDesk ticket 121501
___________________________________________

Short Description: Please add users to groups list

Problem Description: We are preparing to scale up our grid usage.
Users need to write to group protected areas under /minos/data/... 
Quite a few Minos Cluster users are not in the e875/5111 group,
and whose group id is not 5111.

    The e875 list is presently :

MINOS26 > ypcat group | grep e875
e875:x:5111:kreymer,pgouffon,bishai,cjames,jyuko,rbpatter

    Please add these :

ahimmel
baller
bock
costas
cwhite
dave_b
dawson
diwan
djensen
erwin
escobar
grossman
hartouni
jkn
joffem
jpaley
kafka
kschu
kulik
llhsu
mmichel3
moeller
murgia
naples
nevans
niki
paolone
para
qkwu
rtoner
shanahan
thosieck
tzanakos
verebryu
___________________________________________

Date: Fri, 12 Sep 2008 09:27:25 -0500 (CDT)
This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group.
___________________________________________

___________________________________________


########
# FARM #
########

   Let's just nibble away at CPB mcnearcat for a while.

SRV1> ls /minos/data/minfarm/mcnearcat | grep n13047 | wc -l
7452
SRV1> ls /minos/data/minfarm/mcnearcat | grep n13037 | wc -l
2332
SRV1> ls /minos/data/minfarm/mcnearcat | grep L250 | wc -l
296

   First some L250 items, then  n13037 

./looper '-b 2000 -s L250 -r cedar_phy_bhcurv mcnear' &
Thu Sep 11 14:30:55 CDT 2008


############
# FERMIAPP #
############

MINOS25 > mkdir /grid/fermiapp/minos

MINOS25 > chgrp e875 /grid/fermiapp/minos

MINOS25 > chmod 775 /grid/fermiapp/minos

MINOS25 > mkdir /grid/fermiapp/minos/kreymer

MIN > for NODE in ${NODES} ; do printf "${NODE} " ; ssh -ax ${NODE} touch /grid/fermiapp/minos/kreymer/${NODE}  ; done

minos01 touch: cannot touch `/grid/fermiapp/minos/kreymer/minos01': No such file or directory
minos02 touch: cannot touch `/grid/fermiapp/minos/kreymer/minos02': Read-only file system
...
minos25 
minos26 touch: cannot touch `/grid/fermiapp/minos/kreymer/minos26': Read-only file system

MIN > for NODE in ${NODES} ; do 
printf "${NODE} " ; ssh -ax ${NODE} touch /grid/app/minos/test/${NODE}  ; done

    Updated ticket accordingly


#########
# ADMIN #
#########

    Four links under 
http://computing.fnal.gov/xms/Internal/Budget_%26_Finance

    ICON - Web Interface for Electronic Purchase Request
http://fncdug1.fnal.gov/miser/
    redirects a new window to
https://appora.fnal.gov/pls/cert/miscomp.miser.html

    LINK - Create/Edit Purchase Requisition
https://appora.fnal.gov/pls/cert/miscomp.miser.html


    ICON - Puchase Requisition Query
https://miscomp.fnal.gov/miser/req-query.html
    redirects a new window to
https://appora.fnal.gov/pls/cert/miscomp.miser.html

    LINK - Query Purchase Requisition
https://appora.fnal.gov/miser_ora/www/req-query.html

    Followed this latter link, got query form,
    Put in  CD105723 and/or CD%105723
    Get a result which should have had the Lab_Number and PO_Number
    These fields were blank.


########
# FARM #
########

          UGH, looking closely at concatenation, why would this be written?
          In /pnfs/minos/mcin_data/near/daikon_04/L250200N/709/n13037094*,
          all but subrun 1 are present.

SRV1> ./roundup -n -s L250 -r cedar_phy_bhcurv mcnear
... 
 MISS n13037094_*._L250200N_D04.mrnt.cedar_phy_bhcurv.root 0006 0007 0020 0021 0022 0023 0029 0030
 OOPS - SUBRUN gap 1 to 1
 OK adding n13037094_0000_L250200N_D04.mrnt.cedar_phy_bhcurv.root 1
 OOPS - SUBRUN gap 6 to 7
 OK adding n13037094_0002_L250200N_D04.mrnt.cedar_phy_bhcurv.root 4
 OOPS - SUBRUN gap 20 to 23
 OK adding n13037094_0008_L250200N_D04.mrnt.cedar_phy_bhcurv.root 12
 OK adding n13037094_0024_L250200N_D04.mrnt.cedar_phy_bhcurv.0.root 5

SRV1> ./roundup -n -s n13037094 -r cedar_phy_bhcurv mcnear

    Same behaviour, would write this sntp and mrnt, in spite of gaps.

SRV1> ./roundup -n -v -s n13037094 -r cedar_phy_bhcurv mcnear

    Note that the reco files are both .0.root and .root

 HAVE n13037094__L250200N_D04.mrnt.cedar_phy_bhcurv.root:8
 HAVE n13037094__L250200N_D04.sntp.cedar_phy_bhcurv.root:8

MINOS26 > ls /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L250200N/sntp_data/709 | grep n13037094
n13037094_0006_L250200N_D04.sntp.cedar_phy_bhcurv.root
n13037094_0020_L250200N_D04.sntp.cedar_phy_bhcurv.root
n13037094_0029_L250200N_D04.sntp.cedar_phy_bhcurv.root

   So this is legitemate, filling in missing subruns.

##########
# DCACHE #
##########

Date: Thu, 11 Sep 2008 11:14:04 -0500 (CDT)
Subject: HelpDesk ticket 121458
___________________________________________

Short Description: Additional unsecured dcap doors needed FNDCA

Problem Description: There are presently only two unsecured dcap doors in FNDCA,
  each capable of a few hundred connections.

  Minos is now routinely running over 400 analysis jobs.
  We plan to scale up toward several thousand jobs running on Fermigrid.

  We will clearly need more doors.

  Please add a few more doors now, if technically possible.
  Please contact us ( minos-data ) to prepare a long term plan.
___________________________________________

Date: Mon, 22 Sep 2008 14:12:47 -0500 (CDT)
From: Dmitry Litvintsev <litvinse@fnal.gov>

I started two additional dcap doors on fndca1:

port numbers : 24137,24138

___________________________________________

___________________________________________

########
# FARM #
########

Date: Thu, 11 Sep 2008 14:34:10 +0100
From: Robert Pittam <r.pittam1@physics.ox.ac.uk>
To: Arthur Kreymer <kreymer@fnal.gov>, Alex Sousa <a.sousa1@physics.ox.ac.uk>
Subject: RE: Cedar cedar_phy differences

In a similar vein as the last email there are some near detector files
for which there are a cedar_phy_bhcurv version but no cedar_phy
equivalent. Some of them exist in cedar as well.

I checked /minos/data/minfarm/nearcat/ 
But theres no sign of them. Any idea why they're missing?


Jul 05
N00008046_0000.spill.sntp.cedar_phy_bhcurv.0.root

Oct 06
N00011134_0000.spill.sntp.cedar_phy_bhcurv.0.root
N00011134_0017.spill.sntp.cedar_phy_bhcurv.0.root

Dec 06
N00011437_0000.spill.sntp.cedar_phy_bhcurv.0.root 

Jan 07
N00011468_0000.spill.sntp.cedar_phy_bhcurv.0.root 

Feb 07
N00011710_0000.spill.sntp.cedar_phy_bhcurv.0.root 

Apr 07
N00012074_0000.spill.sntp.cedar_phy_bhcurv.0.root
N00012083_0000.spill.sntp.cedar_phy_bhcurv.0.root

SUBS='N00008046_0000 N00011134_0000 N00011134_0017 N00011437_0000 N00011468_0000 N00011710_0000 N00012074_0000 N00012083_0000'

MINOS26 > for SUB in ${SUBS} ; do grep  ${SUB} /minos/data/minfarm/lists/bad_runs.cedar_phy ; done
N00008046_0000.0   2005-07       8     2  2007-06-01 13:37:07  fnpc269
N00011468_0000.0   2007-01     106     2  2007-06-10 16:08:50  fnpc282
N00012074_0000.0   2007-04     2  2008-07-01 15:47:06  fnpc219
N00012083_0000.0   2007-04     2  2008-07-01 15:53:51  fnpc183

MINOS26 > for SUB in ${SUBS} ; do grep  ${SUB} /minos/data/minfarm/lists/good_runs.cedar_phy ; done
N00011134_0000.0   2006-10  101468        2007-05-15 09:22:28  fnpc237
N00011134_0000.1   2006-10  101468        2007-05-18 15:39:42  fnpc226
N00011134_0017.0   2006-10  100520        2007-05-15 08:36:42  fnpc242
N00011134_0017.1   2006-10  100520        2007-05-18 16:20:20  fnpc279
N00011437_0000.0   2006-12  100792        2007-10-12 17:18:27  fnpc136
N00011710_0000.0   2007-02   99755        2007-06-11 05:02:27  fnpc222

grep N00011134_0000.spill.*sntp */cedar_phynear.log

__________________________________________________

Howie is resubmitting the missing runs.
__________________________________________________

__________________________________________________

=============================================================================
2008 09 10
=============================================================================

#########
# ADMIN #
#########

OLD - Subject: HelpDesk ticket 118265

MINOS01 > cmd add_minos_user jcravens
Creating account...
/var/yp
gmake[1]: Entering directory `/var/yp/minos'
gmake[1]: `ypservers' is up to date.
gmake[1]: Leaving directory `/var/yp/minos'
gmake[1]: Entering directory `/var/yp/minos'
Updating passwd.byname...
Updating passwd.byuid...
Updating netid.byname...
gmake[1]: Leaving directory `/var/yp/minos'
Adding user to Minos AFS group...
Sending mail to subscribe to minos-user mailing list ...
Sending email to user...
MINOS01 > ypcat passwd | grep jcravens
jcravens:KERBEROS:43513:5111:John Cravens:/afs/fnal/files/home/room2/jcravens:/usr/local/bin/tcsh

The user got /usr/local/bin/tcsh rather than /bin/bash
 
   send mail to jonest

#########
# ADMIN #
#########

CD105723

https://appora.fnal.gov/pls/cert/miscomp.miser.html?action=view&cd_req_number=CD105723&submit_label=Go!

State 	Entry 	Requires Role 	Exit 	Via Transition 	By Actor 	Comments
E-Ready 	23-Jul-2008 15:34:50.406 	Terminal State 	  	  	  	 
CheckOut 	22-Jul-2008 15:28:25.855 	CheckOut_Approver 	23-Jul-2008 15:34:50.392 	E-Ready 	cbruce 	 

########
# FARM #
########

   Start to work again on CPB mcnear, howie is doing cleanup runs.
   The 10K file backlog is too-too much,
   Breaking it down.

SRV1> ls /minos/data/minfarm/mcnearcat/n1303709* | wc -l
288

SRV1> ls /minos/data/minfarm/mcnearcat/n130374* | wc -l
1196

SRV1> ./roundup         -n -s n1303709 -r cedar_phy_bhcurv mcnear
    This mostly added files
SRV1> ./roundup -b 2000 -n -s n130374 -r cedar_phy_bhcurv mcnear
  This almost entirely added, with a few ZAPPED

 ZAPPING BAD n13037415_0009_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13037415_0009_L010185N_D04.1               136 2008-04-29 16:14:31  fcdfcaf1626
 ZAPPING BAD n13037415_0009_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root
n13037415_0009_L010185N_D04.1               136 2008-04-29 16:14:31  fcdfcaf1626
 ZAPPING BAD n13037415_0009_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037415_0009_L010185N_D04.1               136 2008-04-29 16:14:31  fcdfcaf1626
 ZAPPING BAD n13037436_0005_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13037436_0005_L010185N_D04.1               136 2008-05-09 18:53:51  fnpc304
 ZAPPING BAD n13037436_0005_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root
n13037436_0005_L010185N_D04.1               136 2008-05-09 18:53:51  fnpc304
 ZAPPING BAD n13037436_0005_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037436_0005_L010185N_D04.1               136 2008-05-09 18:53:51  fnpc304


./looper '-b 2000 -s n130374 -r cedar_phy_bhcurv mcnear' &
Wed Sep 10 15:46:23 CDT 2008


########
# FARM #
########

From: Arthur Kreymer <kreymer@fnal.gov>
To: rubin@fnal.gov

Two files are in mcfarcat which are in bad_runs_mc.cedar_phy_bhcurv

 ZAPPING BAD f21438026_0000_M100200N_D04_helium.mrnt.cedar_phy_bhcurv.0.root
f21438026_0000_M100200N_D04_helium.0        136 2008-09-04 21:34:47  fcdfcaf1605
 ZAPPING BAD f21438026_0000_M100200N_D04_helium.sntp.cedar_phy_bhcurv.0.root
f21438026_0000_M100200N_D04_helium.0        136 2008-09-04 21:34:47  fcdfcaf1605

The file times seem to predate the entries in bad_runs :

-rw-rw-r--   1 minospro numi 109085210 Sep  3 16:45
f21438026_0000_M100200N_D04_helium.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--   1 minospro numi 126672629 Sep  3 16:44
f21438026_0000_M100200N_D04_helium.sntp.cedar_phy_bhcurv.0.root

The candiate is similar :

-rw-r--r--  1 minospro e875 1596835361 Sep  3 17:58
/pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_04/M100200N_helium/cand_data/802/f21438026_0000_M10
0200N_D04_helium.cand.cedar_phy_bhcurv.0.root


Date: Wed, 10 Sep 2008 12:45:47 -0500
From: Howard Rubin <rubin@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: Re: Two 'bad' files in mcfarcat

I can't be sure I understand this, but I have a possible scenario. Suppose this is one of those cases
where a job was spontaneously restarted -- or perhaps not spontaneously because there were several
jobs as mentioned by Steve in the meeting where they finished but held on termination.  He released
them but they restarted from the beginning and reran.  If, on the second pass, they hit the 'random'
failure, they might have failed, producing the bad_runs_mc entry.  In fact, this seems to be borne out
by the existence of a line in the good_runs_mc file:

f21438026_0000_M100200N_D04_helium.0    2008-09-03 16:45:00  fcdfcaf1573

Since the pass number is determined upon submission, not upon processing, the pass for both processes
would be 0.

The operative procedure would be to delete the line(s) from bad_runs_mc.  Do you want to do it or
should I?


   Rubin re activity :

If /grid/app/minos/scripts is on your path, lj s will give you the current activity.  If Matt's jobs
are running there may be some formatting errors, but the final count will be correct.

SRV1> /grid/app/minos/scripts/lj

0 jobs running.

The ENSTORE write pool contains 0 files at 13:49 on 09/10/08.

  
   Updated the bad_runs file.

fnpcsrv1% ls -l bad_runs_mc.cedar_phy_bhcurv
-rw-rw-r--  1 rubin numi 13465 Sep  9 17:20 bad_runs_mc.cedar_phy_bhcurv

fnpcsrv1% cp -a bad_runs_mc.cedar_phy_bhcurv bad_runs_mc.cedar_phy_bhcurvnew

fnpcsrv1% nedit bad_runs_mc.cedar_phy_bhcurvnew

fnpcsrv1% diff bad_runs_mc.cedar_phy_bhcurvnew bad_runs_mc.cedar_phy_bhcurv
162a163
> f21438026_0000_M100200N_D04_helium.0        136 2008-09-04 21:34:47  fcdfcaf1605

fnpcsrv1% mv bad_runs_mc.cedar_phy_bhcurvnew bad_runs_mc.cedar_phy_bhcurv


########
# FARM #
########

MINOS26 > ./samdup /minos/data/minfarm/mcnearcat


########
# GRID #
########

Date: Wed, 10 Sep 2008 10:03:29 -0500 (CDT)
Subject: HelpDesk ticket 121371
___________________________________________
Short Description: Please  mount /grid/fermiapp on Minos Cluster and Servers

    run2sys :

The existing /grid/app application are is to assume a new role in Fermigrid,
such that we will need to reinstall our software in a new area,
/grid/farmiapp.

Please mount  /grid/fermiapp  on Minos Cluster nodes 
    minos02 through minos26
and on the Minos SAM servers
    minos-sam01
    minos-sam02
    minos-sam03

The mount should be similar to that of /grid/app :

blue2:/fermigrid-app       /grid/app       nfs     rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0
blue2:/fermigrid-fermiapp  /grid/fermiapp  nfs     rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0
___________________________________________
Date: Wed, 10 Sep 2008 10:08:00 -0500 (CDT)
Subject: Your ticket 121371 has been reassigned to SCOTT, RENNIE
___________________________________________

Date: Thu, 11 Sep 2008 16:53:04 -0500 (CDT)
Solution: Request completed.


=============================================================================
2008 09 09
=============================================================================

#########
# ADMIN #
#########

reviewed status of requisition

14-Aug-2008
ALLEN

203579  CD105749.1  CD105749 for FL/CD/SCF/FEF  Storage and Servers for Minos

PO
582085  Page 4
2U Dual Intel Xeon E5430 2.66GHz 
General Rack computer Server
Promised Date: 06-Oct-2008
Deliver To: ALLEN, JASON M
4.00 EACH  3,422.00  PO 203579 13,688.00


582126
Configuration # 1 - TagmaStore 
Adaptable Storage and TagmaStore 
Workgroup Storage Hardware - 30TB 
Additional Capacity
Promised Date: 06-Oct-2008
Deliver To: ALLEN, JASON M
2.0 EACH  14,000.0  PO 203579 28,000.00
Project  CD Operations  Task  MINOS-COMP-OP  Task Number 50.01.06.04.01.01 
Exp. Org CD - FERMILAB EXPERIMENTS FACILITIES  Exp. Type MATERIAL PURCHASES  
Task Org CD - RUNNING EXPERIMENTS Service Type OP-EXST PRGM OP-DET 

##########
# PARROT #
##########

   Added MINOS_EXTERNAL and release_data to mountfile.grow,
   renaming the mountfile.MX.grow previously tested.

$ diff  mountfile.3119d120MX.grow mountfile.grow
3d2
< /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL /grow/www-numi.fnal.gov/computing/parrot/MINOS_EXTERNAL
5d3
< /afs/fnal.gov/files/data/minos                       /grow/www-numi.fnal.gov/computing/parrot/release_data
$ ln -sf  mountfile.3119d120MX.grow mountfile.grow

$ date
Tue Sep  9 14:50:33 CDT 2008
$ pwd
/grid/app/minos/parrot


#########
# DOCDB #
#########

Added Mark Messier 581879, group numirw

Actual actions are :
   Find Name/ID in list, and click it.
   Select
       Action: Modify
       Verify User        
   Click on Modify Personal Account.
   
Instructions say to Select, nonesuch.


########
# FARM #
########

    Pushing out mcfar CPB helium files

./looper '-r cedar_phy_bhcurv mcfar' &


########
# FARM #
########

   Repeated dccp tests, per Ken S request

SRV1> cd /minos/data/minfarm/mcnear
SRV1> FILE=n13047014_0025_L010185N_D04.cand.cedar_phy_bhcurv.1.root


SRV1> setup dcap -q x509

SRV1> source /usr/local/vdt/setup.sh 
SRV1> export X509_USER_PROXY=/export/stage/minfarm/.grid/x509up_u1334

SRV1> dccp  ${FILE} \
dcap://fndca1.fnal.gov:24536/pnfs/fnal.gov/usr/minos/NULL/${FILE}
Error ( POLLIN) (with data) on control line [6]
Failed to create a control line
Error ( POLLIN) (with data) on control line [7]
Failed to create a control line
Failed open file in the dCache.
Can't open destination file : Server rejected "hello"
System error: Input/output error
SRV1> voms-proxy-info -all
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: VOMS extension not found!
subject   : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990/CN=2146134877
issuer    : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990
identity  : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990
type      : unknown
strength  : 512 bits
path      : /export/stage/minfarm/.grid/x509up_u1334
timeleft  : 358:01:23

    I think  I need a cert with production role.


#########
# FNALU #
#########

    Still on schedule for shutdown next week.
    
    For the record, lest we forget :
    
MINOS26 > bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
debug            99  Open:Active       -    5    1    -     0     0     0     0
test             98  Open:Active       -   15    1    -     0     0     0     0
30min            10  Open:Active       -    -    1    -     0     0     0     0
4hr               8  Open:Active       -    -    1    -     6     0     6     0
12hr              6  Open:Active       -    -    1    -     1     0     1     0
1day              4  Open:Active       -    -    1    -     0     0     0     0
selex             4  Open:Active       -    5    1    -     0     0     0     0
minos             4  Open:Active       -    -    1    -     0     0     0     0
1day_ex           4  Open:Active       -    4    1    -     0     0     0     0
4day              2  Open:Active       -    5    1    -     0     0     0     0
8day              1  Open:Active       -    2    1    -     0     0     0     0

MINOS26 > bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV 
flxb16.fnal.gov    unavail         -      4      1      1      0      0      0
flxb17.fnal.gov    ok              -      4      0      0      0      0      0
flxb18.fnal.gov    ok              -      4      0      0      0      0      0
flxb19.fnal.gov    ok              -      4      0      0      0      0      0
flxb20.fnal.gov    unavail         -      4      0      0      0      0      0
flxb21.fnal.gov    unavail         -      4      0      0      0      0      0
flxb22.fnal.gov    unavail         -      4      0      0      0      0      0
flxb23.fnal.gov    unavail         -      4      0      0      0      0      0
flxb24.fnal.gov    unavail         -      4      1      1      0      0      0
flxb25.fnal.gov    ok              -      4      0      0      0      0      0
flxb26.fnal.gov    closed          -      4      2      2      0      0      0
flxb27.fnal.gov    ok              -      4      0      0      0      0      0
flxb28.fnal.gov    closed          -      4      2      2      0      0      0
flxb29.fnal.gov    ok              -      4      0      0      0      0      0
flxb30.fnal.gov    ok              -      4      0      0      0      0      0
flxb31.fnal.gov    ok              -      4      0      0      0      0      0
flxb32.fnal.gov    ok              -      4      0      0      0      0      0
flxb33.fnal.gov    ok              -      4      0      0      0      0      0
flxb34.fnal.gov    ok              -      4      0      0      0      0      0
flxb35.fnal.gov    ok              -      4      0      0      0      0      0
flxi04.fnal.gov    unavail         -      1      0      0      0      0      0
flxi06.fnal.gov    ok              -      2      0      0      0      0      0
flxi07.fnal.gov    unavail         -      2      0      0      0      0      0
fsui03.fnal.gov    unavail         -      5      0      0      0      0      0
minos14.fnal.gov   unavail         -      2      0      0      0      0      0
minos15.fnal.gov   unavail         -      2      0      0      0      0      0
minos16.fnal.gov   unavail         -      2      0      0      0      0      0
minos17.fnal.gov   unavail         -      2      0      0      0      0      0
minos18.fnal.gov   unavail         -      2      0      0      0      0      0
minos19.fnal.gov   unavail         -      2      0      0      0      0      0
minos20.fnal.gov   unavail         -      2      0      0      0      0      0
minos21.fnal.gov   unavail         -      2      0      0      0      0      0
minos22.fnal.gov   unavail         -      2      0      0      0      0      0
minos23.fnal.gov   unavail         -      2      0      0      0      0      0
minos24.fnal.gov   unavail         -      2      0      0      0      0      0
minos25.fnal.gov   unavail         -      2      0      0      0      0      0
minos26.fnal.gov   unavail         -      2      0      0      0      0      0


=============================================================================
2008 09 08
=============================================================================

########
# MAIL #
########

    Removed RFC2369 headers from lists for which they are not appropriate,
    to eliminate the PINE messages

     [ Note: This message contains email list management information ]

   To disable the headers, added to the head of the options list, 

Misc-Options= NO_RFC2369

minos-admin
minos-docdb
minos_sam_admin
minos_sam_users

    Need to get ownership of some other lists
minosdb-support
MINOS-ACCOUNTS ?

########
# MCIN #
#########

    Checked sized, for budget planning

MINOS26 > du -sh  /pnfs/minos/mcin_data/near/daikon*
7.3T    /pnfs/minos/mcin_data/near/daikon_00
33G     /pnfs/minos/mcin_data/near/daikon_01
2.2T    /pnfs/minos/mcin_data/near/daikon_03
15T     /pnfs/minos/mcin_data/near/daikon_04
1.0K    /pnfs/minos/mcin_data/near/daikon_05


#########
# MYSQL #
#########

export PRODUCTS=/${HOME}/ups/db

RE=/home/minsoft/restore/20080902/offline
OF=/home/minsoft/database//offline

cp -va ${RE}/BEAMMONCUTS.* ${OF}/

SOFT03 > ups start mysql

SOFT03 > ups status mysql
Setup:mysql datadir = /home/minsoft/database
Setup:port=3306; socket=/home/minsoft/database/mysql.sock
 Check mysqld status:
Uptime: 25  Threads: 1  Questions: 1  Slow queries: 0  Opens: 12  Flush tables: 1  Open tables: 6  Queries per second avg: 0.040

export MYSQL_PWD

mysqladmin processlist -u root

mysql> show tables ;
+-------------------+
| Tables_in_offline |
+-------------------+
| BEAMMONCUTS       | 
+-------------------+

mysql> show columns from BEAMMONCUTS ;
+-------------+---------+------+-----+---------+-------+
| Field       | Type    | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| SEQNO       | int(11) | NO   | PRI | 0       |       | 
| ROW_COUNTER | int(11) | NO   | PRI | 0       |       | 
| CUTVALUES   | text    | YES  |     | NULL    |       | 
+-------------+---------+------+-----+---------+-------+
3 rows in set (0.00 sec)

mysql> show index from BEAMMONCUTS ;
+-------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table       | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| BEAMMONCUTS |          0 | PRIMARY  |            1 | SEQNO       | A         |        NULL |     NULL | NULL   |      | BTREE      |         | 
| BEAMMONCUTS |          0 | PRIMARY  |            2 | ROW_COUNTER | A         |          12 |     NULL | NULL   |      | BTREE      |         | 
+-------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
2 rows in set (0.00 sec)

mysql> check table BEAMMONCUTS ;
+---------------------+-------+----------+-------------------------------------------------------+
| Table               | Op    | Msg_type | Msg_text                                              |
+---------------------+-------+----------+-------------------------------------------------------+
| offline.BEAMMONCUTS | check | warning  | 1 client is using or hasn't closed the table properly | 
| offline.BEAMMONCUTS | check | status   | OK                                                    | 
+---------------------+-------+----------+-------------------------------------------------------+
2 rows in set (0.00 sec)


=============================================================================
2008 09 06
=============================================================================

######
# WH #
######

Power out 06:00 to 18:00


##########
# CONDOR #
##########

MINOS25 > condor_q gfactory 
351 jobs; 16 idle, 248 running, 87 held


MINOS25 > IDLES=`condor_q gfactory  -hold | grep gfactory | cut -f 1 -d ' '`

MINOS25 > date
Sat Sep  6 02:41:48 CDT 2008

for IDLE in ${IDLES} ; do condor_release ${IDLE} ; sleep 10 ; done

=============================================================================
2008 09 05
=============================================================================

###########
# MINOS26 #
###########

./vault near 2008-08


=============================================================================
2008 09 04
=============================================================================

########
# FARM #
########

SRV1> ./roundup -b 2000 -r cedar_phy far

SRV1> ./roundup -b 2000 -r cedar_phy far
OK - processing 351 files 


#######
# DAQ #
#######

[minos@dcsdcp ~]$ cat  /dcsdata/logs/archiver.log 
/home/minos/kftp/v3_5/NULL/lib/gssftp.py:1: RuntimeWarning: Python C API version mismatch for module gss: This Python has API version 1012, module gss has version 1011.
  import gss
Traceback (most recent call last):
  File "/home/minos/bin/archiver_krb.py", line 395, in ?
    os.remove(lock_file)
OSError: [Errno 2] No such file or directory: '/var/lock/dcs/archiver.pid'


[minos@dcsdcp ~]$ ls -l /var/lock/dcs/
total 0
-rw-r--r--  1 minos e875 0 Sep  4 15:40 archiver.pid
-r--r-----  1 minos e875 0 Sep  4 15:22 dcs_mysql2rotod.lock

      Checking out the near detector

[minos@dcsdcp-nd dcsdata]$ cat /var/lock/dcs/archiver.pid 
3046

[minos@dcsdcp-nd dcsdata]$ ps -f -p 3046
UID        PID  PPID  C STIME TTY          TIME CMD
minos     3046     1  0 Jun27 ?        00:00:01 python /home/minos/bin/archiver_krb.py

[minos@dcsdcp-nd dcsdata]$ /etc/init.d/archiver  status
Archiver is running

    This looks like the classic empty pid file, try clearing it


FDCS > ls -l /dcsdata/archiver/data-to-archive
total 0
-rw-r--r--  1 minos e875 0 Jan  1  2007 F070101_163119.mdcs.root
-rw-r--r--  1 minos e875 0 Aug 28 19:00 F080829_000008.mdcs.root
-rw-r--r--  1 minos e875 0 Aug 29 19:00 F080830_000009.mdcs.root
-rw-r--r--  1 minos e875 0 Aug 30 19:00 F080831_000004.mdcs.root
-rw-r--r--  1 minos e875 0 Aug 31 19:00 F080901_000006.mdcs.root
-rw-r--r--  1 minos e875 0 Sep  1 19:00 F080902_000012.mdcs.root
-rw-r--r--  1 minos e875 0 Sep  2 19:00 F080903_000013.mdcs.root
-rw-r--r--  1 minos e875 0 Sep  3 19:00 F080904_000001.mdcs.root

FDCS > ls -l /dcsdata/2007/F070101*
-rw-r--r--  1 minos e875   11273 Dec 31  2006 /dcsdata/2007/F070101_000034.mdcs.root
-rw-r--r--  1 minos e875 1303994 Jan  1  2007 /dcsdata/2007/F070101_000057.mdcs.root
-rw-r--r--  1 minos e875  721315 Jan 23  2007 /dcsdata/2007/F070101_170032.mdcs.root


FDCS > mkdir /dcsdata/archiver/data-to-archivestray/
FDCS > mv /dcsdata/archiver/data-to-archive/F070101_163119.mdcs.root /dcsdata/archiver/data-to-archivestray/F070101_163119.mdcs.root

FDCS > /etc/init.d/archiver start
Starting archiver

FDCS > cat logs/archiver.log 
/home/minos/kftp/v3_5/NULL/lib/gssftp.py:1: RuntimeWarning: Python C API version mismatch for module gss: This Python has API version 1012, module gss has version 1011.
  import gss

MINOS26 > dds /pnfs/minos/far_dcs_data/2008-08
-rw-r--r--  1 buckley e875 2510149 Aug 27 23:40 F080827_000002.mdcs.root
-rw-r--r--  1 buckley e875 2518981 Sep  4 15:38 F080828_000010.mdcs.root
-rw-r--r--  1 buckley e875 2494006 Sep  4 15:55 F080829_000008.mdcs.root

    I think the archiver still has the 10 minute cycle time.
    Files are continuing to move.
    
On Thu, 4 Sep 2008, Alec T. Habig wrote:


> Art Kreymer writes:
>>   File "/home/minos/bin/archiver_krb.py", line 395, in ?
>>     os.remove(lock_file)
>> OSError: [Errno 2] No such file or directory: '/var/lock/dcs/archiver.pid'
>
> This was when I was trying to clean up processes, and had deleted the
> lockfile but not killed the zombie process.
>
> Interestingly, I haven't been able to get the scripts to write to that
> logfile since, although I can run the archiver manually (without the  
> startup scripts).  The startup scripts do nothing.
> 
> The data's there to be archived, flag files are in /dcsdata/archiver
>
> I did try and fixing ./dcs_startup/dcs_init_functions, which as far as I
> could tell was trying to invoke the archiver with a nonexistant path.   
> diff it with ./dcs_startup/dcs_init_functions.cya to see what I mean.   
 
   I did restart the archiver one more time,
   after removing a stale /var/lock/dcs/archiver.pid
  
It has transferred two files so far, so we should be in business.
  
I did correct one other problem.

There was a tag file
    /dcsdata/archiver/data-to-archive/F070101_163119.mdcs.root
for which there was no corresponding file in
   /dcsdata

Perhaps this was tripping things up ( time bomb triggered by new python ? )


############
# PREDATOR #
############

    Dealing with effects of the full disk

Bad files for 
   N00014791_0007.mdaq.root
   N00014791_0008.mdaq.root
   N00014791_0009.mdaq.root
   N00014791_0010.mdaq.root

F00041897_0009.mdaq.root Thu Sep  4 06:09:49 UTC 2008
F00041897_0010.mdaq.root Thu Sep  4 06:10:39 UTC 2008
F00041897_0011.mdaq.root Thu Sep  4 08:09:24 UTC 2008
F00041897_0012.mdaq.root Thu Sep  4 08:10:19 UTC 2008

B080903_080001.mbeam.root Thu Sep  4 10:11:14 UTC 2008
cat: write error: No space left on device
?
B080903_160001.mbeam.root Thu Sep  4 10:11:59 UTC 2008
cat: write error: No space left on device
?
B080904_000001.mbeam.root Thu Sep  4 10:12:38 UTC 2008

N080903_000002.mdcs.root Thu Sep  4 10:13:20 UTC 2008

cd /local/scratch26/kreymer/genpy/beam_data/2008-09
rm B080903_080001.sam.py B080903_160001.sam.py B080904_000001.sam.py

cd /local/scratch26/kreymer/genpy/fardet_data/2008-09
for SR in 09 10 11 12 13 14 15 16 ; do
rm F00041897_00${SR}.sam.py ; done

cd /local/scratch26/kreymer/genpy/near_dcs_data/2008-09
rm N080903_000002.sam.py

cd /local/scratch26/kreymer/genpy/neardet_data/2008-09
for SR in 07 08 09 10 11 12 13 14 ; do
rm N00014791_00${SR}.sam.py ; done

   These were picked up on the 15:06 cycle 

   Now pick up the beam and dcs

MINOS26 > ./predator 2008-09

###########
# MINOS26 #
###########

   Disk was filled, due to monthly vault copies 
   combined with jdejong use of this disk

MINOS26 > du -sm /pnfs/minos/neardet_data/2008-08
88047   /pnfs/minos/neardet_data/2008-08

   Just the neardet failed, the far was OK.

MINOS26 > du -sm /local/scratch26/jdejong
130137  /local/scratch26/jdejong


 MINOS26 > du -sm *
1       fardet_data
48262   neardet_data

     Checking other big users, mainly mindata/MOVED
MINOS26 > du -sm *
1692    141
1       CRON
22195   MOVED

  
    mindata
$ rm -r /local/scratch26/mindata/MOVED

    kreymer
MINOS26 > cd /local/scratch26/kreymer/SHEEP/neardet_data
MINOS26 > rm -r 2008-08

   Jeff will remove his files, they are no longer needed.


#########
# MYSQL #
#########

   Shifted MYI files out of the way,
   before continuing with the gzip and local phase of archives
   
mkdir ${DBCOPY}/offlineindex
mv    ${DBCOPY}/offline/*.MYI ${DBCOPY}/offlineindex/

Mysql> du -sm ${DBCOPY}/off*
53650   /minos/data/mysql/archive/20080902/offline
12777   /minos/data/mysql/archive/20080902/offlineindex


=============================================================================
2008 09 03
=============================================================================

#########
# MYSQL #
#########

export PRODUCTS=/${HOME}/ups/db

ups stop mysql

SOFT03 > du -sm /minos/data/mysql/archive/20080902/offline
66426   /minos/data/mysql/archive/20080902/offline

SOFT03 > rm -r restore

mkdir restore
mkdir restore/20080902

time cp -vax /minos/data/mysql/archive/20080902/offline restore/20080902/offline/
real    132m44.680s
user    0m10.905s
sys     6m14.102s

    Ganglia rate is 6 to 10 MBytes/second,
    with the usual 15 second drop outs every 10 minutes.
      For example, after 600 sec of 6 MB/sec ( 3.6 GB )
      Peak rate interval 240 s     15 MB/sec ( 3.4 Gb )

SOFT03 > du -sk /minos/data/mysql/archive/20080902/offline restore/20080902/offline/
68020112        /minos/data/mysql/archive/20080902/offline
68087604        restore/20080902/offline/


=============================================================================
2008 09 02
=============================================================================

##########
# PARROT #
##########

    Test thain's new symlink hack to make_growfs.auto,
using

$ cat mountlink.grow
/parrot   /grow/www-numi.fnal.gov/computing/parrot/link

   The usual parrot test, but

parrot -m /minos/scratch/parrot/mountlink.grow -d remote  /bin/bash

P> ls -l /parrot
total 2514
-r--r--r--  1 kreymer numi 1283387 Jul 29 14:56 data
-rw-r--r--  1 kreymer numi    6483 Aug 26 16:12 HOWTO.parrot
-r--r--r--  1 kreymer numi 1283387 Jul 29 14:56 releasedata

P> head -3 /parrot/releasedata 
///
/// Data and script to load the CalStripAtten database
/// with ND mapper attenuation curves.

P> head -3 /parrot/data
...
2008/09/02 14:20:44.589679 [20793] parrot: grow: failed to open http://www-numi.fnal.gov:80/computing/parrot/link//data
head: cannot open `/parrot/data' for reading: Permission denied

    This is the expected result, as /minos/data cannot be web served.

    Now test this sort of directory on ups/minossoft in d199/d141


    Need a fresh copy of these, for testing with current software

  minsoft@minos-mysql1

MD=/afs/fnal.gov/files/data/minos
ECHO=echo

for DIR in  packages  releases  setup  srt ; do
    date ; echo ${DIR}
    ${ECHO} rm -r   ${MD}/d199/${DIR}
    ${ECHO} cp -vax ${MD}/d120/${DIR} \
                    ${MD}/d199/${DIR}
date ; done

for DIR in  catman  db  etc  man  prd ; do
    date ; echo ${DIR}
    ${ECHO} rm -r   ${MD}/d141/${DIR}
    ${ECHO} cp -vax ${MD}/d119/${DIR} \
                    ${MD}/d141/${DIR}
date ; done

Tue Sep  2 16:52:25 CDT 2008
packages
...
Tue Sep  2 19:37:54 CDT 2008
man
rm: remove write-protected regular file `/afs/fnal.gov/files/data/minos/d141/man/man1/tclsh.1'? 

   OOOPS, should not have used 'v' option for cp,

Mysql> fs listacl /afs/fnal.gov/files/data/minos/d141/man/man1
Access list for /afs/fnal.gov/files/data/minos/d141/man/man1 is
Normal rights:
  minos rlidwka
  system:administrators rlidwka
  system:anyuser rl
Mysql> tokens

Tokens held by the Cache Manager:

User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Sep  4 20:47]
   --End of list--

Mysql> rm /afs/fnal.gov/files/data/minos/d141/man/man1/tclsh.1
rm: remove write-protected regular file `/afs/fnal.gov/files/data/minos/d141/man/man1/tclsh.1'? no

Mysql> chmod 755  /afs/fnal.gov/files/data/minos/d141/man/man1/tclsh.1
Mysql> rm /afs/fnal.gov/files/data/minos/d141/man/man1/tclsh.1

      ?????????????? what gives ????????????
      Since when do AFS files care about file permissions ?

Mysql> ls -l /afs/fnal.gov/files/data/minos/d141/man/man1
total 36
-rw-r--r--  1 kreymer 1525  5887 Oct 17  2005 dropit.1
-rw-r--r--  1 kreymer 1525 11704 Apr 26  2005 python.1
-rwxr-xr--  1 kreymer 1525  4448 Oct 17  2005 upd.1
-r--r--r--  1 kreymer 1525 12668 Jan 31  2005 wish.1

Mysql> whoami
minsoft

Mysql> chmod u+w /afs/fnal.gov/files/data/minos/d141/man/man1/wish.1

    This allowed the file to be removed.

    There are many more files lacking u+w permission,
    using code from addpkg :

DIR=/afs/fnal.gov/files/data/minos/d141

#find ${DIR}  ! -perm -200 -exec ls -l     {} \;
#find ${DIR}  ! -perm -200 -exec chmod u+w {} \;

  Many things in man
     tcl
     tk
     blt
     python
     perl
     xfig
     imagelibs
     java
     oracle_client
     MINOS_EXTERN

Mysql> find ${DIR}  ! -perm -200  | wc -l
  125553  
     
-r-xr-xr-x  1 kreymer 1525 7771765 May  7  2007 /afs/fnal.gov/files/data/minos/d141/prd/encp/v3_6g/Linux-2-4-2-3-2/enstore
total 52
-r--r--r--  1 kreymer 1525 1766 May  7  2007 ECRC.c
-r--r--r--  1 kreymer 1525 1019 May  7  2007 Makefile
-r--r--r--  1 kreymer 1525 4820 May  7  2007 add_to_tape.c
-r--r--r--  1 kreymer 1525 5960 May  7  2007 cpio.c

 ...  ???? what gives ????? where did the path go ??????    

find /afs/fnal.gov/files/code/e875/general/products/man/man1 ! -perm -200 -exec ls -l     {} \;
   see similar issues

    Checking out minossoft :
Mysql> find /afs/fnal.gov/files/data/minos/d120 ! -perm -200 -exec ls -l     {} \;
  ... nothing found ...

find ${DIR}  ! -perm -200 -exec chmod u+w {} \;

Mysql> find ${DIR}  ! -perm -200 -exec chmod u+w {} \;
Mysql> date
Wed Sep  3 11:27:25 CDT 2008

    Let's correct the original working ups files,

DIR=/afs/fnal.gov/files/data/minos/d119

date
Wed Sep  3 11:55:52 CDT 2008

time find ${DIR}  ! -perm -200 -exec chmod u+w {} \;
real    22m32.942s
user    0m51.270s
sys     7m49.561s


for DIR in  catman  db  etc  man  prd ; do
    date ; echo ${DIR}
    ${ECHO} rm -r  ${MD}/d141/${DIR}
    ${ECHO} cp -ax ${MD}/d119/${DIR} \
                   ${MD}/d141/${DIR}
date ; done
Wed Sep  3 12:20:42 CDT 2008
catman
Wed Sep  3 12:20:44 CDT 2008
Wed Sep  3 12:20:44 CDT 2008
db
Wed Sep  3 12:20:53 CDT 2008
Wed Sep  3 12:20:53 CDT 2008
etc
Wed Sep  3 12:20:53 CDT 2008
Wed Sep  3 12:20:53 CDT 2008
man
Wed Sep  3 12:21:02 CDT 2008
Wed Sep  3 12:21:02 CDT 2008
prd
Wed Sep  3 15:45:33 CDT 2008


########
# DISK #
########

Date: Tue, 02 Sep 2008 13:27:51 -0500 (CDT)
Subject: HelpDesk ticket 120912

___________________________________________
Short Description: Quota request for jdejong on BlueArc served /minos/scratch

Problem Description:   LSC/CSI :

Please set an individual storage quota of 500 GBytes for user jdejong
on the BlueArc served /minos/scratch volume.

This overrides the existing default 100 GBytes quota.

   Please cc: jdejong and minos-data
   ( my mail comes through imap3, which is down right now )
___________________________________________

Date: Tue, 02 Sep 2008 14:36:35 -0500 (CDT)
This ticket has been reassigned to RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST Group.
___________________________________________

Date: Tue, 02 Sep 2008 15:17:15 -0500 (CDT)

The quota has been moved up to 500GB as requested.
This ticket was resolved by RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST group.
___________________________________________

########
# IMAP #
########

10:50
   imapserver3 seems to be down, not on the network.
   Stopped getting mail service around 10:50 CDT.

   Submitted helpdesk ticket.

13:30 imap3 is up.

Date: Tue, 02 Sep 2008 13:27:53 -0500 (CDT)
Subject: HelpDesk ticket 120917
___________________________________________

Short Description: imapserver3 down

Problem Description: At around 10:50, imapserver3 seems to have gone off the
network.
 
 ( I will not be able to receive email regarding this,
    as this is where kreymer@fnal.gov mail goes. )
___________________________________________

Date: Tue, 02 Sep 2008 14:02:36 -0500 (CDT)
Solution: Imapserver3 experienced a hardware failure at 11:00am this
morning. The imapserver3 service is back up as of 1:30pm on other temporary
hardware.   Another downtime will be scheduled to switch the service back
once the hardware problem has been repaired.  email will be sent when we
know exactly when the downtime will be.

This ticket was resolved by BOZONELOS, JERE of the CD-LSCS/CSI/HD group.
__________________________________________

___________________________________________
Date: Tue, 02 Sep 2008 13:34:41 -0500
From: Fermilab Postmaster <postmaster@fnal.gov>
To: all-imap3-users@imapserver3.fnal.gov
Subject: Imapserver3 back up

Hi,
Imapserver3 experienced a hardware failure at 11:00am this morning. The
imapserver3 service is back up as of 1:30pm on other temporary hardware.
We will have to schedule a downtime to switch the service back once the
hardware problem has been repaired. We will send email when we know exactly
when the downtime will be.

Fermilab Email Team


############
# PREDATOR #
############

   No far_dcs_data files this month
   Last file standing was
F080827_000002.mdcs.root Thu Aug 28 10:14:58 UTC 2008

###########
# MONTHLY #
###########

DATASETS 9/2
PREDATOR 9/2
VAULT    9/5
MYSQL    9/5


=============================================================================
2008 08 29
=============================================================================

##########
# PARROT #
##########

    Fails when running paloon/loonar with S08-08-28-R1-30
    OK with old  R1.24.2 and S07-12-22-R1-26
    
could not find a gcc version for release "S08-08-28-R1-30" on Linux+2
ERROR: Need unique instance but multiple "products" found 
INFORMATIONAL: Product '*' (with qualifiers ','), has no S08-08-28-R1-30 version (or may not exist)
RUNNING LOON
/grid/app/minos/parrot/loonar: line 47: loon: command not found

   Should look like

No default SAM configuration exists at this time.
MINOSSOFT release "S07-12-22-R1-26"
SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-17-08 EXTERN=v03 CONFIG=v01
setup "test" version of LABYRINTH [ linux , FNALU ]
setup NEUGEN3 development
explicitly setting up GCC3_4_3 version of GEANT
using PYTHIA6 (v6_412) for LUND

    Needed 

$ date ; time make_growfs -k -f /afs/fnal.gov/files/code/e875/general/ups
make_growfs: 1133707 files, 361 links, 153891 dirs, 0 checksums computed

real    9m1.392s
user    1m19.533s
sys     2m6.919s


##########
# CONDOR #
##########

Slow response, especially for condor_submit ( seconds per process )
since late last  night, reported by pawloski and loiacono

Also,   recently held gfactory jobs,
108 jobs; 1 idle, 19 running, 88 held

MINOS25 > condor_q -l 181282.1
HoldReason = "Globus error 10: data transfer to the server failed"


MINOS25 > condor_q -l -hold gfactory | grep EnteredCurrentStatus | sort
EnteredCurrentStatus = 1220027205 Fri Aug 29 11:26:45 CDT 2008  many
...
EnteredCurrentStatus = 1220027597 Fri Aug 29 11:33:17 CDT 2008  many
...
EnteredCurrentStatus = 1220027780 Fri Aug 29 11:36:20 CDT 2008  few
...
EnteredCurrentStatus = 1220029654 Fri Aug 29 12:07:34 CDT 2008  many


    bluwatch did see slow access ( no failure )
Fri Aug 29 03:25:30 CDT 2008 SLO N00013286_0000.spill.sntp.cedar_phy_bhcurv.0.root 13
Fri Aug 29 03:29:46 CDT 2008 SLO N00013299_0000.spill.sntp.cedar_phy_bhcurv.0.root 12
...
Fri Aug 29 11:16:17 CDT 2008 SLO N00008017_0000.spill.sntp.cedar_phy_bhcurv.0.root 58
Fri Aug 29 11:17:47 CDT 2008 SLO N00008019_0000.spill.sntp.cedar_phy_bhcurv.0.root 30
Fri Aug 29 11:19:28 CDT 2008 SLO N00008019_0002.spill.sntp.cedar_phy_bhcurv.0.root 41
Fri Aug 29 11:21:15 CDT 2008 SLO N00008020_0000.spill.sntp.cedar_phy_bhcurv.0.root 47
Fri Aug 29 11:22:49 CDT 2008 SLO N00008021_0000.spill.sntp.cedar_phy_bhcurv.0.root 34
Fri Aug 29 12:23:08 CDT 2008 SLO N00008218_0000.spill.sntp.cedar_phy_bhcurv.0.root 12

    Released 84 held gfactory jobs,
    a few are running, more are held ( error 10 )


##########
# PARROT #
##########

   Updated minossoft, for latest snapshot ( noopt only )

$ date ; time make_growfs -v  -k -f /afs/fnal.gov/files/code/e875/general/minossoft
Fri Aug 29 14:20:44 CDT 2008

   Interrupted, there were changes to the latest snapshot.
   Reduced verbosity

$ date ; time make_growfs -k -f /afs/fnal.gov/files/code/e875/general/minossoft
Fri Aug 29 14:56:34 CDT 2008
make_growfs: loading existing directory from /afs/fnal.gov/files/code/e875/general/minossoft/.growfsdir
make_growfs: scanning directory tree for changes...
    Broken link, 
/afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-08-28-R1-30/include/CodeMgtTools/include


    rhatcher repaired this,

$ date ; time make_growfs -k -f /afs/fnal.gov/files/code/e875/general/minossoft
Fri Aug 29 15:58:56 CDT 2008
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.16/bin/Linux2.6-GCC_3_4/Linux2.4-GCC_3_4
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.16/bin/Linux-sl3-GCC_3_4/Linux2.4-GCC_3_4
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.16/bin/Linux2.4-GCC_3_4/Linux2.4-GCC_3_4
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/bin/Linux2.6-GCC_3_4/Linux2.4-GCC_3_4
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/bin/Linux2.6-GCC_3_4-maxopt/Linux2.4-GCC_3_4-maxopt
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/bin/Linux2.4-GCC_3_4/Linux2.4-GCC_3_4
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/bin/Linux2.4-GCC_3_4-maxopt/Linux2.4-GCC_3_4-maxopt
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/lib/Linux2.6-GCC_3_4/Linux2.4-GCC_3_4
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/lib/Linux2.6-GCC_3_4-maxopt/Linux2.4-GCC_3_4-maxopt
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/lib/Linux2.4-GCC_3_4/Linux2.4-GCC_3_4
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-03-20-R1-28/lib/Linux2.4-GCC_3_4-maxopt/Linux2.4-GCC_3_4-maxopt
warning: broken symbolic link /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1-29/Linux2.6-GCC_3_4-maxopt
make_growfs: 4422692 files, 12 links, 176657 dirs, 0 checksums computed

real    36m58.916s
user    3m35.032s
sys     20m50.270s

  From
-rw-r--r--   1 kreymer    5111 31326785 Aug 14 19:29 .growfsdir
  to
-rw-r--r--   1 kreymer    5111 211982753 Aug 29 21:35 .growfsdir

   Following symlinks takes us from 31 MB to 211 MB directory size.


#############    
# MILESTONE #
#############

   Successfully ran a standard cedar near detector spill reco job
   under Parrot.

##########
# PARROT #
##########

   Added release_data service

/afs/fnal.gov/files/expwww/numi/html/computing/parrot
MIN > ln -s /afs/fnal.gov/files/data/minos release_data

    Corrected single file sim data file link, for reco test

ln -sf /afs/fnal.gov/files/data/minos/release_data/bmaps/bfld_160.dat \
      /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat

    Would restore with

ln -sf /minos/data/release_data/bmaps/bfld_160.dat \
       /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat

    Checked file identity with

diff /minos/data/release_data/bmaps/bfld_160.dat \
     /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat


    Before retest, rebuilt sim index, and created such for release_data 

time make_growfs -v  -k /afs/fnal.gov/files/data/minos/release_data
make_growfs: 292 files, 1 links, 53 dirs, 0 checksums computed
real    0m1.885s
user    0m0.028s
sys     0m0.083s

time make_growfs -v  -k -f /afs/fnal.gov/files/code/e875/sim

   Interrupted, circular symlink under gmieg/Mesa/Mesa-2.6

$ ls -alF /afs/fnal.gov/files/code/e875/sim/gmieg/Mesa/Mesa-2.6/Mesa-2.6
lrwxr-xr-x  1 gmieg e875 53 Jan 25  1999 /afs/fnal.gov/files/code/e875/sim/gmieg/Mesa/Mesa-2.6/Mesa-2.6 -> /afs/fnal.gov/files/code/e875/sim/gmieg/Mesa/Mesa-2.6/
$ rm /afs/fnal.gov/files/code/e875/sim/gmieg/Mesa/Mesa-2.6/Mesa-2.6

time make_growfs -v  -k -f /afs/fnal.gov/files/code/e875/sim
make_growfs: 393475 files, 77 links, 63801 dirs, 0 checksums computed
real    3m33.670s
user    0m22.977s
sys     0m57.210s


   Repeaded loon run, got stuck at :

explicitly setting up GCC3_4_3 version of GEANT
using PYTHIA6 (v6_412) for LUND

 6160 kreymer   25   0  6592 1976 1248 R  100  0.1  13:39.02 5b33ce4febe1b4b                                                                              

 5819 pts/0    S      0:11  \_ parrot -m ./mountfile.MX.grow -H /bin/bash
 5822 pts/0    T      0:00      \_ /bin/bash
 6160 pts/0    R+    13:46      \_ python /afs/fnal.gov/files/code/e875/general/minossoft/setup/datagram/datagram_client.py [sh] kreymer minos_offline R1.

    Noted that /local/stage1/minos was empty, 
    Fresh login, fresh parrot sesssion, 

parrot -m /grid/app/minos/parrot/mountfile.MX.grow  -H   /bin/bash

...
BfldLoanPool::GetMap new map, type 2 'Rect2dGrid', variant 160
BfldMapRect2d read file: /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat
BfldMapRect2d: near detector, 40kAturns CurrentForward -- FXY Bob Wands 08/09/2005
...
 =E= Bfld 2008/08/29 11:59:27 [9870|200520] BfldMapRect2d.cxx,v1.26:87> can not open input file: '/afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_161.dat'
Floating point exception

diff /minos/data/release_data/bmaps/bfld_161.dat \
     /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_161.dat

ln -sf /afs/fnal.gov/files/data/minos/release_data/bmaps/bfld_161.dat \
      /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_161.dat

$ time make_growfs -v  -k -f /afs/fnal.gov/files/code/e875/sim
make_growfs: 393475 files, 77 links, 63801 dirs, 0 checksums computed
real    2m41.662s
user    0m30.416s
sys     0m54.832s


Spill(100000 in 750 out 99250 filt.)
P> du -sk *root
17420   CandS.root
81772   N00009870_0002.mdaq.root
3112    ntupleStS.root

    S U C C E S S !

    rhatcher will revise all the sim symlinks from /minos/data to afs,
    as was done with the minossoft area.

    Repeated this run, after rebuilding minsoft index, looks OK

Spill(100000 in 750 out 99250 filt.)
...
Channels with the most errors: 
Errors:       13  [  Near|  46 Vf| 39|*W]
Errors:       13  [  Near| 136 Vf| 64|*W]
Errors:       13  [  Near| 136 Vf| 72|*W]
Errors:       11  [  Near| 146 Vf| 65|*W]
Errors:       11  [  Near| 146 Vf| 73|*W]
Errors:       10  [  Near|   6 Vf| 81|*W]
Errors:       10  [  Near|  16 Vf| 48|*W]
Errors:       10  [  Near| 136 Vf| 65|*W]
Errors:       10  [  Near| 136 Vf| 73|*W]
Errors:        9  [  Near| 136 Vf| 80|*W]
DatabaseInterface shutdown not requested


=============================================================================
2008 08 28
=============================================================================

##########
# PARROT #
##########

    continue reco test, this time kreymer@minos26

cd ${HOME}/minos 
. ./setup_minos
setup_minos -r R1.24.0

export ENV_TSQL_URL='mysql:odbc://fnpcsrv1.fnal.gov:3307/temp;mysql:odbc://fnpcsrv1.fnal.gov:3307/cedar'
export ENV_TSQL_USER=reader
export ENV_TSQL_PSWD=minos_db

cd /local/scratch26/kreymer/DATA

FIN=N00009870_0002.mdaq.root

time loon -b -q reco_near_spill_cedar.C  ${FIN} 2>&1 | tee loon.log
...
 Spill(100000 in 750 out 99250 filt.)
...
real    14m10.572s
user    13m43.216s
sys     0m12.443s

    Try this again using the public database
    Changed 
        host  to minos-db1
        port  to 3306
        cedar to offline

export ENV_TSQL_URL='mysql:odbc://minos-db1.fnal.gov:3306/temp;mysql:odbc://minos-db1.fnal.gov:3306/offline'
export ENV_TSQL_USER=reader
export ENV_TSQL_PSWD=minos_db


    This looks good, let's gear up for Parrot tests.
   
    mindata@Minos26:

$ cd /grid/app/minos/parrot
$ cp /local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root .
$ cp /local/scratch26/kreymer/DATA/reco_near_spill_cedar.C  .

    kreymer@fnpc170

Run standard usage test, with the setups and env's as above

mkdir -p  /local/stage1/minos
cd        /local/stage1/minos
cp /grid/app/minos/parrot/N00009870_0002.mdaq.root .
cp /grid/app/minos/parrot/reco_near_spill_cedar.C  .

time loon -b -q reco_near_spill_cedar.C  ${FIN} 2>&1 | tee loon.log 
^D ( needed for parrot stickiness )
Ended after libFiltration.so

Try again with printf,

time { printf "" | loon -b -q reco_near_spill_cedar.C  ${FIN} } 2>&1 | tee loon.log 

P> time { printf "" | loon -b -q reco_near_spill_cedar.C  ${FIN} ; } 2>&1 | tee loon.log ; 
ERROR: ld.so: object '/grid/app/minos/parrot/cctools-current-20080708-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored.
Warning in <TClassTable::Add>: class timespec already in TClassTable

Processing reco_near_spill_cedar.C...
Warning in <TClassTable::Add>: class CandDigitListHandleKeyFunctor already in TClassTable
Warning in <TClassTable::Add>: class CandDigitListHandleKeyFunc already in TClassTable
Warning in <TClassTable::Add>: class CandDigitListHandleItr already in TClassTable
Segmentation fault

real    0m12.186s
user    0m0.000s
sys     0m0.000s

   The next log messages would have been

Successfully opened connection to: mysql:odbc://minos-db1.fnal.gov:3306/temp?option=1;
Successfully opened connection to: mysql:odbc://minos-db1.fnal.gov:3306/offline?option=1;

   On rerunning, got an additional message,
segmentation fault

    Trying a fresh test,


parrot -m ${PARROT_DIR}/mountfile.grow  -H        /bin/bash  # for production
PS1='P> '
export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup
unset SETUP_UPS SETUPS_DIR
. /afs/fnal.gov/files/code/e875/general/ups/etc/setups.sh  
setup_minos()
{
. $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $*
}
setup_minos  -r R1.24.0

cd        /local/stage1/minos
export ENV_TSQL_URL='mysql:odbc://minos-db1.fnal.gov:3306/temp;mysql:odbc://minos-db1.fnal.gov:3306/offline'
export ENV_TSQL_USER=reader
export ENV_TSQL_PSWD=minos_db
FIN=N00009870_0002.mdaq.root

printf "" | loon -b -q reco_near_spill_cedar.C  ${FIN}
Warning in <TClassTable::Add>: class timespec already in TClassTable

Processing reco_near_spill_cedar.C...
Warning in <TClassTable::Add>: class CandDigitListHandleKeyFunctor already in TClassTable
Warning in <TClassTable::Add>: class CandDigitListHandleKeyFunc already in TClassTable
Warning in <TClassTable::Add>: class CandDigitListHandleItr already in TClassTable
Segmentation fault

   With -d all, see message just before the crash , cannot open
   
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/etc/odbcinst.ini 

    This file is indeed not visible, MINOS_EXTERNAL is not exported.

MIN > pwd
/afs/fnal.gov/files/expwww/numi/html/computing/parrot
ln -s /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL MINOS_EXTERNAL

rm MINOS_EXTERNAL/.gr*
$ time make_growfs -v  -k /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL
make_growfs: 37346 files, 1077 links, 2425 dirs, 0 checksums computed

real    0m22.704s
user    0m1.763s
sys     0m3.694s

parrot -m ./mountfile.MX.grow  -H        /bin/bash 

    This connects to the database, cranking along at 17:02 CDT,

  BfldLoanPool::GetMap new map, type 2 'Rect2dGrid', variant 160
=E= Bfld 2008/08/28 17:02:56 [9870|200314] BfldMapRect2d.cxx,v1.26:87> can not open input file: '/afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat'
Floating point exception
  
    That is reasonable, we need to shift more symlinks :
    
 $ ls -l /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat
lrwxr-xr-x  1 rhatcher e875 43 Aug  5 16:16 /afs/fnal.gov/files/code/e875/sim/labyrinth/ephemera/bfield/bfld_160.dat -> /minos/data/release_data/bmaps/bfld_160.dat

  
#######
# NAS #
#######

Date: Thu, 28 Aug 2008 13:05:52 -0500
From: Andrew J. Romero <romero@fnal.gov>
To: "'site-nas-announce@fnal.gov'" <site-nas-announce@fnal.gov>
Subject: Reminder: BlueArc Maintenance Tuesday Sept 2,
    2008 from 6:00am to 6:20am (FERMI-BLUE cluster ONLY)

We will be performing maintenance on the FERMI-BLUE
BlueArc cluster on Tuesday Sept 2, 2008 from 6:00am to 6:20am

During the maintenance outage we will
be upgrading the BlueArc Titan firmware

UNIX NFS Clients should recover gracefully when the maintenance is complete.

Note: There are no production Windows shares hosted on
the FERMI-BLUE cluster


The following BlueArc hosted file servers (EVSs)
are effected by this maintenance outage
-----------------------------------------------
blue3
bluetest
fermi-nas-1
mb-nas-0


The following BlueArc hosted file servers (EVSs)
are **NOT** effected by this maintenance outage
-----------------------------------------------
blue1
blue2
cdfserver1
cdserver
dirserver1
eshserver1
lsserver
minos-nas-0
numiserver
ppdserver
pseekits


#############
# MINOSSOFT #
#############

    Preparing to move all the /minos/data/release_data sylinks to
           /afs/fnal.gov/files/data/minos/release_data

    Build on the methods described in HOWTO.afssoftprod

LOGD=/minos/scratch/minsoft/afssoft

SLINKF=${LOGD}/slink/recodata.links
SLINKL=${LOGD}/slink/recodata.log

PVOL=/afs/fnal.gov/files/data/minos/d120
DOUT=${PVOL}

find ${DOUT} -type l -exec ls -l {} \; \
    | cut -f 2- -d /   \
    | sed 's/ -> /:/g' \
    | grep  ':/'       \
    | grep :/minos/data/release_data/  \
    | tee ${SLINKF}

   * * *  Not proceeding with this. * * *

    rhatcher already has scripts in place which can do this,
    as part of normal release management.
    
#######
# WEB #
#######

Date: Thu, 28 Aug 2008 11:35:53 -0500 (CDT)
Subject: HelpDesk ticket 120775
___________________________________________

Short Description: public web server allows browsing of local files in /etc

Problem Description: Web administrators :

  The CD public web servers are wollowing symlinks to local files,
  such as /etc/passwd.

  The actual files you see vary from time to time, depending on which
  backend web server you actually get connected to.

  For example, see
      http://www-numi.fnal.gov/computing/parrot/link/etcdir/
  which is a symlink to /etc,
  which contains interesting files like 
      ftpusers
      group
      hosts
      passwd
      release
      resolv.conf

  These are probably things we do not want served to the world.
___________________________________________

Date: Thu, 28 Aug 2008 15:38:56 -0500 (CDT)
This ticket has been reassigned to PASETES, RAY of the CD-LSCS/CSI/CS/EST Group
___________________________________________
Date: Fri, 29 Aug 2008 10:33:07 -0500 (CDT)
Solution: 

Hi Art,

Thank you for bringing this up.  Currently, we are allowing links on the
central web servers.  However, we are working with security to change 
this policy.  This will be a HUGE disruption to many sites which rely on 
soft links heavily.

There  are a couple of other options we can place which could prevent
the people from linking to local files, but these options would also
severely
break many other sites.  So, for now, we are balancing security with
practicality.
We accept the current risk for now and are working towards a more secure
infrastructure while also providing the least painful path for our users.

We have removed your links to the /etc directory.  Please do not
do that again. 

Thank you.
This ticket was resolved by PASETES, RAY of the CD-LSCS/CSI/CS/EST group.
___________________________________________

Thanks for the clarification.

I had not intented to suggest disabling symlinks, which would be disastrous.

Instead, something more gentle, 
limiting served files to /afs/fnal.gov
             ( and maybe /afs/.fnal.gov )

Thanks for your attention to this.                             
I have removed the rest of my test links to local file systems.
___________________________________________


############
# STARTUP  #
############

    PNFS is back
   2 Thu Aug 28 08:40:19 CDT 2008 
13219 Thu Aug 28 12:25:38 CDT 2008 
   4 Thu Aug 28 12:30:42 CDT 2008 

    FTP is back
   6 Thu Aug 28 12:36:01 CDT 2008 557

    So DCache seems to be fine, restarting tasks

Thu Aug 28 13:33:24 CDT 2008

    kreymer@minos26
cd minos/scripts
crontab crontab.dat

    mindata@minos26 
cd
crontab crontab.dat

    minfarm@fnpcsrv1
mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok

###########
# MINOS01 #
###########

Date: Thu, 28 Aug 2008 10:09:47 -0500 (CDT)
Subject: HelpDesk ticket 120768
___________________________________________

Short Description: minos01 is not accepting ssh connections

Problem Description: minos01 will not accept ssh connections.
    rsh is still working.

    The last login I see in /var/log/messages is
Aug 27 12:03:05 minos01 sshd(pam_unix)[8288]: session opened for user
rustem by (uid=0)

    It is rather important to fix this, as this system is our CVS server,
    which is accessed primarily via ssh.
___________________________________________

Date: Thu, 28 Aug 2008 10:13:49 -0500
From: Mark Schmitz <schmitz@fnal.gov>

I restarted sshd.
Seems OK now.
___________________________________________
Date: Thu, 28 Aug 2008 10:16:26 -0500 (CDT)
This ticket has been reassigned to HARRINGTON, JASON of the CD-SF/FEF Group.

___________________________________________

Date: Thu, 28 Aug 2008 10:16:27 -0500 (CDT)
Solution: restarted sshd


########
# GRID #
########

Date: Thu, 28 Aug 2008 09:32:50 -0500 (CDT)
Subject: HelpDesk ticket 120763
___________________________________________

Short Description: Many files at the top of /grid/data

Problem Description: There are over 4600 files at the top level of
/grid/data .

MINOS26 > ls /grid/data | grep remote$ | wc -l
4681

    An initial 'ls /grid/data' command can take over two minutes.

    The files are owned by fnalgrid, and have names like
    2008-07-25T17:08:01Z-gridftp-probe-test-file-remote

    I suggest moving these to a subdirectory of /grid/data , 
    perhaps /grid/data/fnalgrid/ .
___________________________________________

Date: Thu, 28 Aug 2008 09:37:35 -0500 (CDT)

Note To Requester: timm@fnal.gov sent this Notes To Requester: 
Art--most of those files are the by-product of the automatic OSG RSV
system probe testing.  Unfortunately it is not easy to change
the directory into which they go but we could easily develop a cron
to purge them after a day or so and we will do that.

Steve Timm
___________________________________________

Date: Mon, 08 Sep 2008 09:06:25 -0500 (CDT)

Solution: These files in /grid/data are now
being purged daily by a script.

Steve Timm
___________________________________________

___________________________________________

___________________________________________


=============================================================================
2008 08 27
=============================================================================

############
# SHUTDOWN #
############

Prepared for PNFS/DCache maintenance Aug 28

    kreymer@minos26
echo "crontab -r" | at 05:30
job 18 at 2008-08-28 05:30

    mindata@minos26 
echo "crontab -r" | at 01:00
job 19 at 2008-08-28 01:00

    minfarm@fnpcsrv1
echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00
job 16 at 2008-08-28 01:00


#######
# AFS #
#######

    Spoke to Ray Pasetes ( kreymer, rhatcher )

    We can add mounts for d119 and d120 to replace the symlinks
    anytime we want.

    Testing,

fs mkmount art  room.kreymer
fs rmmount art

    If a mount fails, it still needs to be rmmount'd

    Got the volume names for mounting with
fs examine $MINOS_DATA/d119
fs examine $MINOS_DATA/d120


    Will do

cd /afs/fnal.gov/files/code/e875/general

rm ups    
fs mkmount ups       nb.minos.d119

rm minossoft
fs mkmount minossoft nb.minos.d120


   Survey total Minos usage in AFS

vos partinfo fsus-minos02  
 Free space on partition /vicepa: 282137398 K blocks out of total 898040549
Free space on partition /vicepb: 327383281 K blocks out of total 898040549
Free space on partition /vicepc: 328670058 K blocks out of total 898040549
Free space on partition /vicepd: 415337195 K blocks out of total 896348377
Free space on partition /vicepe: 379669157 K blocks out of total 899211060
Free space on partition /vicepf: 325347830 K blocks out of total 841974743
Free space on partition /vicepg: 296747358 K blocks out of total 841974743
Free space on partition /viceph: 380486512 K blocks out of total 841974743
Free space on partition /vicepi: 349533267 K blocks out of total 841974743
Free space on partition /vicepj: 334027396 K blocks out of total 841974743
Free space on partition /vicepk: 397945771 K blocks out of total 841974743
Free space on partition /vicepl: 343095448 K blocks out of total 841974743
Free space on partition /vicepm: 336612230 K blocks out of total 841974743
Free space on partition /vicepn: 362614880 K blocks out of total 841974743
Free space on partition /vicepo: 386638702 K blocks out of total 841974743
MINOS26 > vos partinfo fsus-minos02   | wc -l
15

##########
# PARROT #
##########

Date: Wed, 27 Aug 2008 12:39:21 +0100
From: Alexandre Sousa <a.sousa1@physics.ox.ac.uk>

Sorry this is coming a bit late, but the Dogwood validation meeting went
all the way to 20:40 local time, so I got home a little too late.

So cedar reconstruction uses R1.24.0 for data and R1.24.1 for MC.
Therefore, to run a Near detector data job in fnpcsrv1 you would do:

source /grid/app/minos/minfarm/Minossoft/setup_minossoft_MINOS_BATCH_GRID_CEDAR.[sh;csh] R1.24.0

This sets up 
, root v5.12.00 and the mysql environment variables:

echo $ENV_TSQL_URL
mysql:odbc://fnpcsrv1.fnal.gov:3307/temp;mysql:odbc://fnpcsrv1.fnal.gov:3307/cedar

echo $ENV_TSQL_USER
reader

echo $ENV_TSQL_PSWD

that allow you to connect to the far DB.

Then, to run the job, do:

loon -b -q /home/minfarm/loonexe/reco_near_spill_cedar.C <NearData_File>


To run a Near Detector MC job, you would do:
source
/grid/app/minos/minfarm/Minossoft/setup_minossoft_MINOS_BATCH_GRID_CEDAR.[sh
;csh] R1.24.1

loon -b -q   /home/minfarm/loonexe/GoodSpillTime.C
/home/minfarm/loonexe/reco_MC_daikon_near_cedar.C <NearMC_File>


The cedar scripts are also found in the Production package under:
Production/Cedar

Hope this helps. Let me know if you run into problems.
Cheers,

             Alex


FNPCSRV1 > pwd
/home/kreymer/DATA

FNPCSRV1 > scp -c blowfish minos26:/local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root .

FNPCSRV1 > FIN=N00009870_0002.mdaq.root

FNPCSRV1 > time loon -b -q /home/minfarm/loonexe/reco_near_spill_cedar.C  ${FIN} 2>&1 | tee loon.log
real    10m31.341s
user    10m21.704s
sys     0m7.702s

MINOS26 > du -sk  /pnfs/minos/reco_near/cedar/cand_data/2006-02/N00009870_0002*
112612  /pnfs/minos/reco_near/cedar/cand_data/2006-02/N00009870_0002.cosmic.cand.cedar.0.root
18170   /pnfs/minos/reco_near/cedar/cand_data/2006-02/N00009870_0002.spill.cand.cedar.0.root
MINOS26 > du -sk  /pnfs/minos/reco_near/cedar/sntp_data/2006-02/N00009870_0002*
29271   /pnfs/minos/reco_near/cedar/sntp_data/2006-02/N00009870_0002.cosmic.sntp.cedar.0.root
3235    /pnfs/minos/reco_near/cedar/sntp_data/2006-02/N00009870_0002.spill.sntp.cedar.0.root


[kreymer@fnpcsrv1 ~/DATA]$ du -sk *
17440   CandS.root
32      loon.log
81696   N00009870_0002.mdaq.root
3104    ntupleStS.root

##########
# PARROT #
##########

/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link

ln -s /etc/mail etcmail
ln -s /etc/mail/access access

make_growfs: loading existing directory from /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/.growfsdir
make_growfs: no directory exists, this might be quite slow...
make_growfs: scanning directory tree for changes...
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/HOWTO.parrot
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/access
make_growfs: 1 files, 2 links, 0 dirs, 0 checksums computed

MIN > cat .growfsdir 
D root  16877 2048 20706406 0
F HOWTO.parrot  33188 6483 20621540 0
L etcmail       41453 9 20706367 0 /etc/mail
L access        41453 16 20706378 0 /etc/mail/access
E

   This fails to follow symlinks
   
$ make_growfs -v -f -k /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link
make_growfs: loading existing directory from /afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/.growfsdir
make_growfs: scanning directory tree for changes...
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/HOWTO.parrot
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/virtusertable.db
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/submit.cf
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/mailertable
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/submit.mc
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/domaintable
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/helpfile
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/Makefile
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/virtusertable
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/access
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/sendmail.cf
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/access.db
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/local-host-names
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/trusted-users
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/domaintable.db
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/sendmail.mc
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/etcmail/mailertable.db
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link/access
make_growfs: 18 files, 0 links, 1 dirs, 0 checksums computed

MIN > cat .growfsdir 
D root  16877 2048 20706583 0
F HOWTO.parrot  33188 6483 20621540 0
D etcmail       16877 4096 -11869490 0
F virtusertable.db      33184 12288 -11869522 0
F submit.cf     33060 41313 -48846718 0
F mailertable   33188 0 -48846717 0
F submit.mc     33188 952 -48846718 0
F domaintable   33188 0 -48846717 0
F helpfile      33188 5588 -48846718 0
F Makefile      33188 1035 -48846717 0
F virtusertable 33188 0 -48846718 0
F access        33188 331 -48846717 0
F sendmail.cf   33188 58049 -11869490 0
F access.db     33184 12288 -11869522 0
F local-host-names      33188 64 -48846718 0
F trusted-users 33188 127 -48846718 0
F domaintable.db        33184 12288 -11869522 0
F sendmail.mc   33188 6736 -11869490 0
F mailertable.db        33184 12288 -11869522 0
E
F access        33188 331 -48846717 0
E
 
 
    This DOES follow symlinks

    Check again on symlinks in d120 (minossoft)

    They are all
local ( with or without full path )
/minos/data/...
   
    Test web access to /minos/data
MIN > ln -s /minos/data/release_data/minossoft/Calibrator/macros/GenerateNdAttenConstants-r1.2.C data

    Test availablility of release_data via the web

MIN > ln -s /afs/fnal.gov/files/data/minos/release_data/minossoft/Calibrator/macros/GenerateNdAttenConstants-r1.2.C releasedata


=============================================================================
2008 08 26
=============================================================================

##########
# PARROT #
##########

    Try following a simple symlink :

    Testing under
/afs/fnal.gov/files/expwww/numi/html/computing/parrot/link


#########
# FNALU #  minos-users
#########

   From CD ops notes

9/3: 8-5
There will be an all-day FNALU downtime for rack consolidation on FCC-1.�
This will affect everything - interactive and batch nodes - except for fsui03.

   Login message 

NOTE: Downtime Sept. 3, 2008 all day for all fnalu nodes except fsui03. This
includes batch nodes. Rack consolidation will be done in FCC1.


##########
# CONDOR #
##########

66 jobs; 10 idle, 27 running, 29 held

MINOS25 > condor_release gfactory
User gfactory's job(s) released.

MINOS25 > date
Tue Aug 26 13:04:08 CDT 2008

  N.B.  I have lately spotted a few jobs in state 'C',
  which I think means Completing.
  This seems unusual.
  

#######
# AFS #
#######

Mysql> pwd
/afs/fnal.gov/files/code/e875/general

Mysql>     rm ups
Mysql>     ln -s /afs/fnal.gov/files/data/minos/d119 ups

Mysql>     mv minossoft minossoftold
Mysql>     ln -1 /afs/fnal.gov/files/data/minos/d120 ups
ln: invalid option -- 1
Try `ln --help' for more information.
Mysql>     ln -s /afs/fnal.gov/files/data/minos/d120 minossoft

Mysql> date
Tue Aug 26 11:16:05 CDT 2008

MINOS26 > minos
-bash: /afs/fnal.gov/files/code/e875/general/minossoft/setup/setup_minossoft_FNALU.sh: No such file or directory

Mysql> ln -s /afs/fnal.gov/files/data/minos/d120 minossoft
Mysql> date
Tue Aug 26 11:18:31 CDT 2008

Mysql> rm ups/d120

   Examined Minos Cluster Ganglia report,
   no glitch seen around 11:16, 
   remains busy running batch jobs.

Some loiacono condor jobs, submitted at 11:10, failed due to this.


#######
# SAM #
#######

    Test in integration 

export SAM_ORACLE_CONNECT="samdbs/<passwd>"

samadmin add dimension  \
   --name=EVENT_COUNT   \
   --table=data_files   \
   --column=EVENT_COUNT \
   --type=number        \
   --desc='select DATA_FILES EVENT_COUNT'

Test using file N00009521_0024.mdaq.root
           'eventCount' : 129L,

NEARDIM=`printf "DATA_TIER raw-near and RUN_NUMBER 9521  and EVENT_COUNT < 1000"` 

sam list files --dim="${NEARDIM}" --nosummary | sort 

   This worked in integration, now did this in production 
   
MINOS26 > sam list files --dim="${NEARDIM}" --nosummary | sort 
N00009521_0024.mdaq.root


=============================================================================
2008 08 25
=============================================================================

#######
# AFS #
#######

Date: Mon, 25 Aug 2008 22:11:30 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos_software_discussion@fnal.gov
Cc: minos-admin@fnal.gov
Subject: Minos AFS symlink ajustment tommorrow for ups, minossoft

We have copied all of the Minos UPS and Release files
into single 50 GB volumes.
in order to clean up the tangled web of symlinks
which have grown as verious smaller 2 to 8 GB volumes have overflowed.

    ups is already a symlink to an alternate volume,
    this new copy is just to a larger volume.

These copies have been tested with real Analysis jobs, via Parrot.

We have scheduled to symlink these new volumes into production use
tommorrow morning, as follows :

    cd /afs/fnal.gov/files/code/e875/general

    rm ups
    ln -s /afs/fnal.gov/files/data/minos/d119 ups

    mv minossoft minossoftold
    ln -s /afs/fnal.gov/files/data/minos/d120 ups

Ideally, shifting the links will be entirely transparent,
even for running jobs.

We do not plan to remove the original files in the near future.
With the exception of the development release,
we think that the copies are identical to the originals


=============================================================================
2008 08 23  Sat
=============================================================================

#########
# ADMIN #
#########

   In ganglia, Minos cluster system mode CPU kicked up to 20% our of 40%,
   by Friday 24:00, starting Friday 22 Aug beforenoon.
 
   Circumstantial evidence suggests this is cause by tinti condor jobs.
   ( no problem on minos08, where they are not running )

   This cleared up by Sunday at noon.

#########
# CONDOR #
#########

/home/gfactory/glideinsubmit/glidein_t20_glexec

entry_gpminos/log 
  condor_activity_20080822_gpminos@t20_glexec@minos@my2.log


000 (177730.005.000) 08/22 14:04:33 Job submitted from host: <131.225.193.25:61451>
...
017 (177730.000.000) 08/23 04:36:36 Job submitted to Globus
    RM-Contact: fnpcfg1.fnal.gov:2119/jobmanager-condor
    JM-Contact: https://fnpcfg1.fnal.gov:40032/31893/1219484189/
    Can-Restart-JM: 1
...
027 (177730.000.000) 08/23 04:36:36 Job submitted to grid resource
    GridResource: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
    GridJobId: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor https://fnpcfg1.fnal.gov:40032/31893/1219484189/


  factory_info.20080823.log

   Look at latest err and .out files,

less  entry_gpgeneral/log/job.177567.6.out
    ran on fnpc266, looks fine to me

less  entry_gpgeneral/log/job.177567.6.err
    errors at the end :

MasterLog
======== gzip | uuencode =============
./condor_startup.sh: line 195: uuencode: command not found

gzip: stdout: Broken pipe
StartdLog
======== gzip | uuencode =============
./condor_startup.sh: line 195: uuencode: command not found

gzip: stdout: Broken pipe

   But I see similar things back through August 18.

Date: Sat, 23 Aug 2008 10:56:17 -0500 (CDT)
Subject: HelpDesk ticket 120532
_____________________________________________________________________
    Sometime after Friday 22 August 18:00, the Minos glideinWMS pilots
    seem to have disappeared from the GPFarm nodes.

    The minos25 condor system shows 134 gfactory processes running, from
176094.2   gfactory        8/18 15:35   0+19:55:38 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
    to
177726.9   gfactory        8/22 13:59   0+19:56:38 gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor

    The last user job that I see completing normally was 

177760.0   kreymer         8/22 18:00   0+00:00:44 C   8/22 18:00 /minos/scratch/

MINOS25 > condor_q gfactory
...
150 jobs; 20 idle, 130 running, 0 held

    The condor logs look pretty normal for recently running gfactory jobs,

/home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/log/condor_activity_20080822_gpminos@t20_glexec@minos@my2.log

000 (177726.000.000) 08/22 13:59:51 Job submitted from host: <131.225.193.25:61451>
017 (177726.000.000) 08/22 13:59:59 Job submitted to Globus
    RM-Contact: fnpcfg1.fnal.gov:2119/jobmanager-condor
    JM-Contact: https://fnpcfg1.fnal.gov:40012/3970/1219431596/
    Can-Restart-JM: 1
...
027 (177726.000.000) 08/22 13:59:59 Job submitted to grid resource
    GridResource: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
    GridJobId: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor https://fnpcfg1.fnal.gov:40012/3970/1219431596/
...
001 (177726.000.000) 08/22 14:02:02 Job executing on host: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor

   But there is a big gap in activity in the log 
000 (177730.005.000) 08/22 14:04:33 Job submitted from host: <131.225.193.25:61451>
...
017 (177730.000.000) 08/23 04:36:36 Job submitted to Globus
    RM-Contact: fnpcfg1.fnal.gov:2119/jobmanager-condor
    JM-Contact: https://fnpcfg1.fnal.gov:40032/31893/1219484189/
    Can-Restart-JM: 1

    Nothing seems to have started or finished since that gap.
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________

  Cannot submit helpdesk ticket, get download window for Helpdesk.pl
  Cannot run firefox2, complains about missing library.
  Submitted the above from my laptop, FF 2.

########
# GRID #
########

 /grid/data has  4471   files at the top level.

 On fnpc341, an 'ls' takes about 150 seconds.
 On minos26, 112 sec
 
-rw-r--r--    1 fnalgrid fnalgrid    134 May  8 17:08 2008-05-08T22:08:00Z-gridftp-probe-test-file-remote
...
-rw-r--r--    1 fnalgrid fnalgrid    134 Aug 23 09:08 2008-08-23T14:08:02Z-gridftp-probe-test-file-remote

-bash-3.00$ ls  /grid/data | grep -remote | wc -l
4227

Mon Aug 25,
MINOS26 > ls /grid/data | grep remote$ | wc -l
4425

=============================================================================
2008 08 22
=============================================================================

##########
# CONDOR #
##########

   gfrontend - reset limit to 250, now that pawloski jobs are
               corrected to not hold database connections open

               Stopped and restarted after changing limit

Found about 20 Globus error 17 and 43 held gfactories,
MINOS25 > condor_release gfactory
User gfactory's job(s) released.


########
# DATA #
########

pittam suggests lack of cedar_phy files from 2007-04/5/6/7,

                                                 in farcat
F00037835_0000.spill.bntp.cedar_phy.0.root       F00037835
F00037838_0000.spill.bntp.cedar_phy.0.root       F00037838
F00037841_0000.spill.bntp.cedar_phy.0.root       F00037841
F00037868_0000.spill.bntp.cedar_phy.0.root       F00037868
F00037871_0000.spill.bntp.cedar_phy.0.root       F00037871
F00037947_0000.spill.bntp.cedar_phy.0.root       F00037947
F00037956_0000.spill.bntp.cedar_phy.0.root       F00037956
F00037974_0000.spill.bntp.cedar_phy.0.root       F00037974
F00037977_0000.spill.bntp.cedar_phy.0.root       F00037977
F00037989_0000.spill.bntp.cedar_phy.0.root       F00037986
F00037989_0017.spill.bntp.cedar_phy.0.root       F00037989
F00037989_0018.spill.bntp.cedar_phy.0.root       F00037993
F00037989_0021.spill.bntp.cedar_phy.0.root       F00037996
F00037993_0000.spill.bntp.cedar_phy.0.root       F00038221
F00037993_0007.spill.bntp.cedar_phy.0.root       F00038266
F00037996_0000.spill.bntp.cedar_phy.0.root       F00038283
F00038221_0000.spill.bntp.cedar_phy.0.root       F00038307
F00038266_0000.spill.bntp.cedar_phy.0.root       F00038559
F00038283_0000.spill.bntp.cedar_phy.0.root
F00038283_0006.spill.bntp.cedar_phy.0.root
F00038304_0016.spill.bntp.cedar_phy.0.root
F00038307_0000.spill.bntp.cedar_phy.0.root
F00038559_0000.spill.bntp.cedar_phy.0.root


########
# DESK #
########

   Upgrading kreymer desktop to SLF 5, limited availability this AM

=============================================================================
2008 08 21
=============================================================================

############
# BLUWATCH #
############

   Restarted bluwatch on minos01/25/26,
   stopped since the desktop reboot.
   
   This time being sure to set nohup before ./bluwatch &,
   and verifying updates after logging out.
   

########
# DESK #
########

    kreymer recovered from /home filesystem error on desktop,
    required boot with recovery disk to fsck.

    the system continues to hang up... this is distracting

########    
# GRID #
########

Date: Thu, 21 Aug 2008 12:16:15 -0500 (CDT)
Subject: HelpDesk ticket 120454

___________________________________________

Short Description: fnpc339 lacks mount of /grid/app

Problem Description: We had a set of user jobs go into a block hole recently on node fnpc339.

They all failed to access /grid/app .

Indeed, mounts seem to be broken there for
    /grid/app
    /grid/data
    /home/kreymer

As of 12:13, this problem is still present.
___________________________________________

Date: Thu, 21 Aug 2008 12:31:50 -0500 (CDT)
Note To Requester: timm@fnal.gov sent this Notes To Requester: 
I saw the problem of the black hole jobs last night and disabled
condor on that node.  I was waiting for all the existing jobs
to finish which they now have done.  I will notify FEF
that the node needs a reboot.

Steve Timm
___________________________________________

Note To Requester: timm@fnal.gov sent this Notes To Requester: 
I have now sent a ticket to FEF asking for a reboot.  This is the second
of the AFS machines that has crashed this way in less than a week.

Steve Timm
___________________________________________

Date: Fri, 22 Aug 2008 14:07:21 -0500 (CDT)
Note To Requester: FEF rebooted the node this morning and said it was OK but it is not. I still can't
login.  Have asked
them to look again.
___________________________________________

    The grid mount points are back on fnpc339, and jobs are running there.

    But /afs is missing, and jobs are again failing.

    In fact, the openafs is not installed :
-bash-3.00$ rpm -q openafs
package openafs is not installed

    Rather than having afs reinstalled on this one node,
    please just remove the ISMINOSAFS flag for this node.

    You may also need to gracefully terminate the glidein pilots
    which are presently running there.

___________________________________________

Note To Requester: timm@fnal.gov sent this Notes To Requester: 
The ISMINOSAFS tag has been removed temporarily from node
fnpc339.  We will have them put the AFS back on on Monday.
Sorry for the inconvenience.

Steve Timm
___________________________________________

Date: Mon, 25 Aug 2008 16:55:55 -0500 (CDT)

Solution: FEF added AFS back to the
node.  I changed the condor
config back to make 
 ISMINOSAFS true
It is good to go.

Steve Timm

___________________________________________

=============================================================================
2008 08 20
=============================================================================

########
# FARM #
########

Per petyt mail, 8 Jan, Run I thru IIb is 31720 - 38449
Run IIb (2007-04 - 2007-07)

#########
# MYSQL #
#########

    Oops, finished up gzip/local phases of this month's archives

#########
# MYSQL #
#########

HOWTO.mysqladmin - details installation and operation of mysql,
                   initially on minos-sam01

This will document two major modes
  1)  primary warehouse,
        for disaster recovery
        for upgrades and tests
  2)  replica

Observe this on minos-mysql1 , from ps xfwww

/local/ups/prd/mysql/v4_1_11/Linux-2/libexec/mysqld 
    --basedir=/local/ups/prd/mysql/v4_1_11/Linux-2 
    --datadir=/data/database 
    --pid-file=/data/database/minos-mysql1.fnal.gov.pid 
    --skip-locking 
    --port=3306 
    --socket=/data/database/mysql.sock

  'start')
        setup mysql
        user=`/usr/bin/whoami`
        if [ "${user}" = "minsoft" ]; then
            $MYSQL_DIR/bin/mysql.server start
        else
            ulimit -n 4096
            su minsoft -c "setup mysql; $MYSQL_DIR/bin/mysql.server start"
        fi
        ;;
   'stop')
        setup mysql
        $MYSQL_DIR/bin/mysql.server stop
        ;;

On minos-mysql1, /etc/my.cnf -> /data/database/my.cnf

According to mysql.server comments,
    the default location should be basedir, which is $MYSQL_DIR,
    rather than datadir
I don't like this much, as UPS products should not be hacked.
So will continue to use the deprecated datadir/my.cnf,
avoiding /etc/my.cnf to minimize root activity.

cp database/my.cnf.minos-mysql1 database/my.cnf.20080820

As a last resort, reading the README file at $MYSQL_DIR
    This is very confusing document, too many forward references
        and unexplained options.
    many irrelevant comments about group ownership,
        for cases where several accounts share the mysql server 
    recommends against default port 3306 ( we use this )
    References to 'starting mysql client' are confusing to me.
____________________________________________________________________________
    
Let's try the tailor process :
____________________________________________________________________________

SOFT03 > ups tailor mysql
Enter valid path for mysql data directory: 
/home/minsoft/database

Never use default port number 3306 for any mysql server instances! 
Assign your port number here:3306

You can update mysql server options in my.cnf file before you start mysql server.

Please assign a new username for your mysql daemon. 
For security it is recommended to substitute this name for mysql root in a mysql database.
See README file in your mysql datadir for more details. 
Do not forget to set a strong password for root user IMMEDIATELY after initial startup of mysql daemon! 
Then replace root username with the newly assigned username. 
Enter your new username here:root

There are small,medium,large or huge cnf files 
in /home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/share/mysql directory.
Which one you would like to use (s/m/l/h)?  h
Installing MySQL system tables...
080820 13:53:51 [Warning] One can only use the --user switch if running as root

OK
Filling help tables...
080820 13:53:51 [Warning] One can only use the --user switch if running as root

OK

To start mysqld at boot time you have to copy
support-files/mysql.server to the right place for your system

PLEASE REMEMBER TO SET A PASSWORD FOR THE MySQL root USER !
To do so, start the server, then issue the following commands:
/home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysqladmin -u root password 'new-password'
/home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysqladmin -u root -h minos-sam03.fnal.gov password 'new-password'

Alternatively you can run:
/home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysql_secure_installation

which will also give you the option of removing the test
databases and anonymous user created by default.  This is
strongly recommended for production servers.

See the manual for more instructions.

You can start the MySQL daemon with:
cd /home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6 ; /home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysqld_safe &

You can test the MySQL daemon with mysql-test-run.pl
cd mysql-test ; perl mysql-test-run.pl

Please report any problems with the /home/minsoft/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysqlbug script!

The latest information about MySQL is available on the web at
http://www.mysql.com
Support MySQL by buying support/licenses at http://shop.mysql.com
chgrp: invalid group name `products'
You can on/off skip-innodb option in my.cnf file before starting mysqld. 
 Disable InnoDB tables now (y/n)? n
Mysql server successfuly configured. 
chgrp: invalid group name `mysql'

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
     Cannot change group name to mysql in your data directory!   
Group ownership for files in your data area should belong to mysql group. 
Reserved gid is 9531.
Please, add an entry for a mysql group to your systems /etc/group file or 
NIS group map.If you are setting up multiple databases to be managed by 
different UNIX acccounts, add these account names to the group entry. 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

There are following ups function you can use:
To start/stop mysql server      : ups start/stop mysql 
To set root password            : ups rootpass mysql 
To start mysql client           : ups client mysql 
To see run-time server variables: ups variables mysql 
To see short status message     : ups status mysql 
To ping mysql server            : ups ping mysql 
____________________________________________________________________________

    Set the initial password with

SOFT03 > ups rootpass mysql
Setup:mysql datadir = /home/minsoft/database
Setup:port=3306; socket=/home/minsoft/database/mysql.sock
Enter password for root user: 
Setup root password for root@localhost is O.K.

You also need to set this password for root@minos-sam03.fnal.gov when you start mysql client.
You can do it using following command in mysql:
mysql> SET PASSWORD FOR root@minos-sam03.fnal.gov=PASSWORD('new_password');
See user table in mysql database.


    Changed password back to the existing one, for testing

mysqladmin -u root -p password themostsecrecpasswordever

____________________________________________________________________________

   Loading data tables

SOFT03 > du -sm /minos/data/mysql/archive/20080804/offline
53095   /minos/data/mysql/archive/20080804/offline

SOFT03 > time cp -a /minos/data/mysql/archive/20080804/offline/*.frm \
>      ${HOME}/database/mysql/
real    0m6.293s
user    0m0.009s
sys     0m0.067s

SOFT03 > time cp -a /minos/data/mysql/archive/20080804/offline/*.MYD \
>      ${HOME}/database/mysql/

    This runs at 6 MB/sec.
    Interrupted around 15 GB into the copy.

    Removed partial copies,
SOFT03 > export LANG="C";
SOFT03 > rm database/mysql/[A-Z]*       

SOFT03 > time scp -c blowfish minsoft@minos-mysql1:/data/archive/COPY/20080702/offline/* database/mysql/
real    21m42.975s
user    5m6.842s
sys     3m51.649s
   rates were typically 25 to 30 MB/sec for larger files

SOFT03 > rm database/mysql/*.log
SOFT03 > rm database/mysql/CALADCTOPESVLD.dump
SOFT03 > rm database/mysql/CALADCTOPESVLD.dump2

What about db.opt ?
-rw-r-----  1 minsoft e875         65 Aug 20 15:40 db.opt
SOFT03 > cat database/mysql/db.opt
default-character-set=latin1
default-collation=latin1_swedish_ci

SOFT03 > du -sm database/mysql
20265   database/mysql

SOFT03 > time gunzip database/mysql/*.gz
real    33m53.119s
user    12m15.257s
sys     3m19.913s

   Shifted offline tables to the offline database

SOFT03 > mv database/mysql/[A-Z]* database/offline/
SOFT03 > mv database/mysql/db.opt database/offline/db.opt

    Shifted files to /home/minsoft/recover directory

    Started database

mysql> use offline
mysql> restore table BEAMMONCUTS from '/home/minsoft/restore' ;
+-------------+---------+----------+----------------------------------------+
| Table       | Op      | Msg_type | Msg_text                               |
+-------------+---------+----------+----------------------------------------+
| BEAMMONCUTS | restore | error    | Failed generating table from .frm file | 
+-------------+---------+----------+----------------------------------------+
1 row in set, 1 warning (0.00 sec)

=============================================================================
2008 08 19
=============================================================================

#########
# MYSQL #
#########

    minsoft@minos-sam03

SOFT03 > mkdir -p ups/db/foo
SOFT03 > mkdir -p ups/db/.upsfiles
SOFT03 > mkdir -p ups/db/.updfiles
SOFT03 > AFSP=/afs/fnal.gov/files/code/e875/general/ups

SOFT03 > cp ${AFSP}/db/.upsfiles/dbconfig ups/db/.upsfiles/dbconfig
SOFT03 > nedit ups/db/.upsfiles/dbconfig 
                 changed path to /home/minsoft/ups/...
                 
SOFT03 > cp ${AFSP}/db/.updfiles/updconfig ups/db/.updfiles/updconfig

. /usr/local/etc/setups.sh
setup upd

export PRODUCTS=/${HOME}/ups/db

upd install -j mysql v5_0_51
informational: installed mysql v5_0_51.
upd install succeeded.

SOFT03 > ups list -aK+
"mysql" "v5_0_51" "Linux+2.6" "" "" 

SOFT03 > ups declare -c mysql v5_0_51
DECLARE: A UPS start/stop exists for this product

    Setup seems to use configs from ${PRODUCTS}/mysql/config/${MACHID}.${UPS_OPTIONS}

Mysql> cat /local/ups/db/mysql/config/minos-mysql1.fnal.gov.
/data/database


###########
# GNUPLOT #
###########

Date: Tue, 19 Aug 2008 16:13:53 -0500 (CDT)
___________________________________________
Ticket #: 120371
___________________________________________
Short Description: Install gnuplot on Minos cluster and servers

Problem Description: run2-sys :

Please install gnuplot on the Minos Cluster and the minos servers,
including minos26, minos-sam01/2/3 and minos-mysql1 .
This should be a standard part of Minos installations.

This is not at all urgent, please do it at your next convenience.
___________________________________________
Date: Tue, 19 Aug 2008 16:22:53 -0500 (CDT)
This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group.
___________________________________________
Date: Wed, 20 Aug 2008 09:44:08 -0500 (CDT)
Solution: The gnuplot rpm has been installed on the cluster and server machines.
___________________________________________
    This works --- for batch, needed to add a - for stdin
         echo " ... " | gnuplot -persist -
___________________________________________
___________________________________________


#########
# ADMIN #
#########

Date: Tue, 19 Aug 2008 15:36:18 -0500 (CDT)
Subject: HelpDesk ticket 120369
___________________________________________
Short Description: Please add rbpatter to the e875 group on the Minos Cluster

Problem Description: run2-sys :

Please add rbpatter to the e875 group on the Minos Cluster

    Thanks !
_________________________________________________________________
Date: Tue, 19 Aug 2008 15:41:10 -0500 (CDT)
This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group.
__________________________________________
Date: Wed, 20 Aug 2008 08:59:11 -0500 (CDT)
Solution: rbpatter has been added to e875 group.


#########
# FNALU #
#########

Date: Tue, 19 Aug 2008 11:46:27 -0500 (CDT)
Subject: HelpDesk ticket 120350
___________________________________________
Short Description: FNALU status needs update on CD systems page

Problem Description: The FNALU status is shown as Down with a scheduled outage, at
    http://computing.fnal.gov/cdsystemstatus/system/FNALU.html
due to a problem with fsui03 back in June.

But the system is up.
Please update the status.
___________________________________________
Date: Tue, 19 Aug 2008 12:50:31 -0500 (CDT)
This ticket has been reassigned to GREANEY, MARGARET of the CD-LSCS/CSI/DSS/EST Group.
___________________________________________
Date: Mon, 25 Aug 2008 13:40:12 -0500 (CDT)
Note To Requester: mgreaney@fnal.gov sent this Notes To Requester: 
I will ask the CSI group to update the status page for FNALU as I see
it has a red ball on the system status page. When you click on this page,
it does not show any message.

As for the remainder, none of the nodes are marked down in NGOP. I think
there is still a problem with ngop updating the system status pages.
I make entries for down or problem nodes in ngop and these are updated
to the system status page, but it is not something that I do, but is
supposed to get done automatically.
___________________________________________

Solution: There is some way other than ngop updates which changes the system
status page, and the helpdesk has access to this.

___________________________________________


#######
# AFS #
#######

   Mail to nwest and minos-data :


  I recently looked at /var/log/messages on minos-mysql1,
  and saw a surprising number of AFS timeouts,
  highly correleated with nwest ssh logins from minos-db.minos-soudan.org.

  These are very similar to the short timeouts that have been 
  plaguing the Minos Cluster, typically once per month per node, 
  not correleated with any other activity as far as I can tell.

  I would love to understand what activity your connections are performing
  which might trigger these timeouts.
  Maybe we can reproduce and try to eliminate the problem.

  Here is a sample from this morning :

... a subset of the listing below ...


#######
# AFS #
#######
 
    Timeouts typically look like this on minos-mysql1

Aug 19 02:15:12 minos-mysql1 sshd(pam_unix)[10442]: session opened for user nwest by (uid=0)
Aug 19 02:15:15 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 19 02:15:15 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 19 02:18:24 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Aug 19 02:18:24 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Aug 19 02:37:29 minos-mysql1 sshd(pam_unix)[11895]: session opened for user nwest by (uid=0)
Aug 19 02:37:32 minos-mysql1 kernel: afs: Tokens for user of AFS id 4777 for cell fnal.gov are discarded (rxkad error=19270407)
Aug 19 02:37:37 minos-mysql1 sshd(pam_unix)[12031]: session opened for user nwest by (uid=0)
Aug 19 02:38:54 minos-mysql1 sshd(pam_unix)[12169]: session opened for user nwest by (uid=0)
Aug 19 02:38:57 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 19 02:38:57 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 19 02:39:01 minos-mysql1 sshd(pam_unix)[12305]: session opened for user nwest by (uid=0)
Aug 19 02:40:10 minos-mysql1 sshd(pam_unix)[12506]: session opened for user nwest by (uid=0)
Aug 19 02:40:12 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 19 02:40:12 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 19 02:40:13 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Aug 19 02:40:13 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Aug 19 02:40:14 minos-mysql1 kernel: afs: Tokens for user of AFS id 4777 for cell fnal.gov are discarded (rxkad error=19270407)
Aug 19 02:40:17 minos-mysql1 sshd(pam_unix)[12642]: session opened for user nwest by (uid=0)
Aug 19 02:40:30 minos-mysql1 sshd(pam_unix)[12778]: session opened for user nwest by (uid=0)
Aug 19 02:40:33 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 19 02:40:33 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 19 02:40:37 minos-mysql1 sshd(pam_unix)[12914]: session opened for user nwest by (uid=0)
Aug 19 02:40:46 minos-mysql1 sshd(pam_unix)[13050]: session opened for user nwest by (uid=0)
Aug 19 02:40:50 minos-mysql1 sshd(pam_unix)[13186]: session opened for user nwest by (uid=0)
Aug 19 02:43:20 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Aug 19 02:43:20 minos-mysql1 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Aug 19 02:43:20 minos-mysql1 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Aug 19 02:43:20 minos-mysql1 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

In /var/log/secure, find

Aug 19 02:15:11 minos-mysql1 sshd[10441]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok)
Aug 19 02:15:11 minos-mysql1 sshd[10441]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56335 ssh2

Aug 19 02:37:29 minos-mysql1 sshd[11894]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok)
Aug 19 02:37:29 minos-mysql1 sshd[11894]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56340 ssh2

Aug 19 02:37:36 minos-mysql1 sshd[12030]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok)
Aug 19 02:37:36 minos-mysql1 sshd[12030]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56341 ssh2

Aug 19 02:38:53 minos-mysql1 sshd[12168]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok)
Aug 19 02:38:53 minos-mysql1 sshd[12168]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56342 ssh2

Aug 19 02:39:00 minos-mysql1 sshd[12304]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok)
Aug 19 02:39:00 minos-mysql1 sshd[12304]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56343 ssh2

Aug 19 02:40:10 minos-mysql1 sshd[12505]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok)
Aug 19 02:40:10 minos-mysql1 sshd[12505]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56344 ssh2

Aug 19 02:40:16 minos-mysql1 sshd[12641]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok)
Aug 19 02:40:16 minos-mysql1 sshd[12641]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56345 ssh2

Aug 19 02:40:30 minos-mysql1 sshd[12777]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok)
Aug 19 02:40:30 minos-mysql1 sshd[12777]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56346 ssh2

Aug 19 02:40:36 minos-mysql1 sshd[12913]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok)
Aug 19 02:40:36 minos-mysql1 sshd[12913]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56347 ssh2

Aug 19 02:40:46 minos-mysql1 sshd[13049]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok)
Aug 19 02:40:46 minos-mysql1 sshd[13049]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56348 ssh2

Aug 19 02:40:49 minos-mysql1 sshd[13185]: Authorized to nwest, krb5 principal nwest/cron/minos-db.minos-soudan.org@FNAL.GOV (krb5_kuserok)
Aug 19 02:40:49 minos-mysql1 sshd[13185]: Accepted external-keyx for nwest from ::ffff:198.124.213.10 port 56349 ssh2

Aug 19 03:31:22 minos-mysql1 sshd[15994]: Authorized to nwest, krb5 principal nwest@FNAL.GOV (krb5_kuserok)
Aug 19 03:31:22 minos-mysql1 sshd[15994]: Accepted external-keyx for nwest from ::ffff:163.1.136.71 port 48518 ssh2

Mysql> host 198.124.213.10
10.213.124.198.in-addr.arpa domain name pointer minos-db.minos-soudan.org.

Mysql> host 163.1.136.71
71.136.1.163.in-addr.arpa domain name pointer pplxint1.physics.ox.ac.uk.

# grep nwest /etc/passwd
nwest:x:4777:5111:Nick_West:/afs/fnal.gov/files/home/room3/nwest:/usr/local/bin/tcsh

########
# FARM #  -> 2008
########

Date: Tue, 19 Aug 2008 09:15:35 -0500
From: Phyllis Rubin <phyllis@centermail.iit.edu>
Reply-To: phyllis.rubin@comcast.net
To: kreymer@fnal.gov, asousa@fnal.gov
Subject: Howie Rubin

Howie won't be at today's phone meeting.  He is in the hospital after 
having a heart attack on Saturday night.  He is doing well now. 
...


=============================================================================
2008 08 18
=============================================================================

#######
# AFS #
#######
 
   Summary of AFS ups/minossoft copy issues
   
   Questions are prefixed with >

We will ask the CSI group to rearrange the AFS mounts :
    I have sniffed out volumes using    fs examine

    MD=/afs/fnal.gov/files/data/minos
    MG=/afs/fnal.gov/files/code/e875/general

MG/minossoft is not presently its own volume, it sits under MG.

 AFS volume                                 present mount future mount

 vid = 1685748770 named c.e875.d1          MG/ups        MG/oldups
(vid = 1685404769 named code.e875.general) MG/minossoft  MG/oldminossoft

 vid = 1685735879 named nb.minos.d119      MD/d119       MG/ups
 vid = 1685735882 named nb.minos.d120      MD/d120       MG/minossoft

> What happens to old MG/minossoft files when we mount MD/d120 there ?

> How long do the remounts take, and do they cause failures of running jobs ?

    UPS
Several files were duplicated in removing symlinks, see
    /minos/scratch/minsoft/afssoft/slink/prod2
Mainly, 
    config_build_root.sh
    config_build_root_minimal.sh
        and broken links to mengel
    There are more broken links to non-AFS space

> Should we set up local symlinks for config_build_root*.sh ?

    MINOSSOFT

/minos/scratch/minsoft/afssoft/slink/soft1

Beyond the expected bin/lib/tmp copies, there were symlinks to 
   packages/DatabaseTables/HEAD        53 MB
   packages/WebDocs/HEAD/doxygen/loon 415 MB
The doxygen files copied extremely slowly.

> Should these be left in-line, cleaning up the originals later ?
    /afs/fnal.gov/files/data/minos/d04/libraries/DatabaseTables/HEAD
    /afs/fnal.gov/files/code/e875/releases1/doxygen/loon
  

#########
# ADMIN #
#########

Date: Mon, 18 Aug 2008 14:17:57 -0500 (CDT)
Subject: HelpDesk ticket 120307
___________________________________________
Short Description: minos21 sshd not accepting logins

Problem Description: run2-sys :


minos21 does not accept ssh logins.
rsh does allow connections 

$ ssh minos21
ssh_exchange_identification: Connection closed by remote host

The latest ssh login seems to be
Aug 16 05:27:10 minos21 sshd(pam_unix)[28947]: session opened for user rhatcher by (uid=0)
___________________________________________
Date: Mon, 18 Aug 2008 14:27:05 -0500 (CDT)
This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group.
___________________________________________
Date: Mon, 18 Aug 2008 14:45:17 -0500 (CDT)
Solution: I restarted the ssh daemon.
___________________________________________
___________________________________________


########
# FARM #
########


SIESTA=`date -u +'%Y-%m-%d %H:%M:%S' -d 'now - 1 day'`


FARDIM=`printf "DATA_TIER raw-far 
and RUN_TYPE physics%s 
and EVENT_COUNT >=16000
and END_TIME >= to_date(\'${SIESTA}\',\'yyyy-mm-dd hh24:mi:ss\')" %`

sam list files --dim="${FARDIM}"
~kreymer/minos/scripts/samlocate "${FARDIM}"

   This does not work, no such dimension.

Several metadata items seem not to be dimensions, therefore not selectable.

Found the column name via db browser,
    EVENT_COUNT=16083 apparently in the FILE_DATA table
as in F00041835_0003.mdaq.root

    Test in development :

FARDIM=`printf "DATA_TIER raw-far and RUN_NUMBER 28812"` 
MINOS26 > sam get metadata --file=F00028812_0000.mdaq.root
...
           'eventCount' : 7L,

export SAM_ORACLE_CONNECT="samdbs/<passwd>"

samadmin add dimension  \
   --name=EVENT_COUNT   \
   --table=file_data    \
   --column=EVENT_COUNT \
   --type=number        \
   --desc='select eventCount, DATA_FILES EVENT_COUNT'

New dimensionName 'EVENT_COUNT' added.

MINOS26 > sam get dimension info --category=datafile
EVENT_COUNT (category: datafile)
        select eventcount
         data_files event_count

FARDIM=`printf "DATA_TIER raw-far and RUN_NUMBER 28812 and EVENT_COUNT 7"` 
MINOS26 > sam list files --dim="${FARDIM}"
 table or view does not exist

SQL> select file_name,event_count from data_files where file_name = 'F00028812_0000.mdaq.root' ;

FILE_NAME
--------------------------------------------------------------------------------
EVENT_COUNT
-----------
F00028812_0000.mdaq.root
          7

Try again with data_files table, in development

samadmin add dimension  \
   --name=EVENT_COUNTS  \
   --table=data_files   \
   --column=EVENT_COUNT \
   --type=number        \
   --desc='select eventCount based on DATA_FILES EVENT_COUNT'
New dimensionName 'EVENT_COUNTS' added.

FARDIM=`printf "DATA_TIER raw-far and RUN_NUMBER 28812 and EVENT_COUNTS 7"` 

   That's fine.

   
#######
# SAM #
#######


##########
# CONDOR #
##########

    Released some older held pilots

11:28  condor_release gfactory    

    They all went to Idle state, OK


##########
# CONDOR #
##########

[gfrontend@minos25 etc]$ stat vofrontend.cfg
Access: 2008-07-31 14:18:24.000000000 -0500

[gfrontend@minos25 etc]$ ps -flu gfrontend
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S 42918    13839 13838  0  75   0 -  1536 wait   Jul31 pts/2    00:00:00 -bash
0 S 42918    15585     1  1  76   0 - 28850 -      Jul14 ?        13:54:57 python glideinFrontend.py 90 4 /home/gfrontend/myvo
0 R 42918    23272 13839  0  77   0 -  1008 -      09:23 pts/2    00:00:00 ps -flu gfrontend

That July 31 time was me looking at the .cfg file.

Check the log, /home/gfrontend/myvofrontend2/log/frontend_info.20080714.log

[2008-07-14T15:32:46-05:00 15585] Starting up

    Kill and restart, soon after sleep

[2008-08-18T09:29:06-05:00 15585] Sleep
kill 15585
[2008-08-18T09:30:39-05:00 15585] Sleep

    Reverted to

kill -9 15585

[gfrontend@minos25 ~]$ ./start_frontend.sh 

2008-08-18T09:31:30-05:00 23634] Starting up
[2008-08-18T09:31:30-05:00 23634] Iteration at Mon Aug 18 09:31:30 2008
[2008-08-18T09:31:33-05:00 23634] Match
[2008-08-18T09:31:33-05:00 23634] Total running 255 limit 125

MINOS25 > condor_q pawloski -run  | grep pawl | wc -l
251

MINOS25 > date ; condor_q pawloski -run  | grep pawl | wc -l
Mon Aug 18 09:44:32 CDT 2008
242
Mon Aug 18 09:44:45 CDT 2008
249

    Could not connect to CRL,
    finally got a connection at
MINOS25 > date ; condor_q pawloski -run  | grep pawl | wc -l
Mon Aug 18 09:48:28 CDT 2008
238

10:50 Greg killed the presently idle jobs, to avoid new starts
10:51 Grek killed all the jobs

  Mysql> date ; mysqladmin processlist -u root | grep fnal | wc -l
Mon Aug 18 10:51:23 CDT 2008
16

    Adjusted limit down to 100, for safety

[2008-08-18T11:23:56-05:00 645] Starting up
[2008-08-18T11:23:56-05:00 645] Iteration at Mon Aug 18 11:23:56 2008
[2008-08-18T11:23:56-05:00 645] Match
[2008-08-18T11:23:56-05:00 645] Total running 0 limit 100


########
# SOFT #
########

14:04 UTC

Cleaned up protections in minos/scripts,

MINOS26 > chmod g+x *

MINOS26 > chmod o+x *

=============================================================================
2008 08 15
=============================================================================

########
# FARM #
########

      REQUEST FOR FILE INPUT LISTS FROM SAM

Date: Thu, 14 Aug 2008 21:35:18 -0500
From: Howard Rubin <rubin@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
What I need is a file with the subruns from the 'UDT previous day' 
delivered before 23:19 local time to directory 
/minos/data/minfarm/lists.  The content of the file is

subrun  month

where subrun is obvious and month is the month subdirectory in which the 
mdaq is to be found.  Life would be easiest for me if the file were 
named fardet.month (neardet.month); example: fardet.2008-08

Howie

__________________________________________________________________________

SIESTA=`date -u +'%Y-%m-%d %H:%M:%S' -d 'now - 1 day'`


FARDIM=`printf "DATA_TIER raw-far and
END_TIME >= to_date(\'${SIESTA}\',\'yyyy-mm-dd hh24:mi:ss\')"`

sam list files --dim="${FARDIM}"

~kreymer/minos/scripts/samlocate "${FARDIM}"


########
# DATA #
########
Date: Fri, 15 Aug 2008 15:18:39 -0500 (CDT)
From: Kregg E Arms <arms@physics.umn.edu>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: Minos Data <minos-data@fnal.gov>, Marta Tavera <M.Tavera@sussex.ac.uk>
Subject: New pnfs directories for MC

Hi Art,

It appears we will soon need directories in pnfs for the following new MC 
samples:

near/daikon_04/M100200N_helium
near/daikon_04/M100200R_helium
far/daikon_04/M100200N_helium
far/daikon_04/M100200R_helium
_______________________________________________________________

cd ~kreymer/minos/scripts

./pnfsdirs near cedar_phy_bhcurv daikon_04 M100200N_helium write
./pnfsdirs near cedar_phy_bhcurv daikon_04 M100200R_helium write
./pnfsdirs  far cedar_phy_bhcurv daikon_04 M100200N_helium write
./pnfsdirs  far cedar_phy_bhcurv daikon_04 M100200R_helium write

MINOS26 > date
Fri Aug 15 16:38:08 CDT 2008


########
# FARM #
########

farcat
   3674   25987 spill.bmnt.cedar_phy_bhcurv.0.root
   3674   27225 spill.bntp.cedar_phy_bhcurv.0.root
   3674   17099 spill.mrnt.cedar_phy_bhcurv.0.root

   Picked up the bntp's first, see below

   Following the plan of 2008 06 18
   to clear bmnt files out of farcat area,
   simplified because no mrnt have gone to PNFS/SAM.
   Everything can be done as minfarm@fnpcsrv1

-----------------------------------------------------------
    BMNT LIST

BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort`
MFILES=`ls /minos/data/minfarm/farcat | grep mrnt | sort`

printf "${BFILES}\n" | wc -w 
3674

printf "${MFILES}\n" | wc -w 
3674

-----------------------------------------------------------
    MOVE MRNT OUT OF THE WAY  13:18

mkdir -p /minos/data/minfarm/FMRNT

cd /minos/data/minfarm/farcat

for MFILE in ${MFILES} ; do
    mv ${MFILE} /minos/data/minfarm/FMRNT/${MFILE}
done

-----------------------------------------------------------
    RENAME BMNT TO MRNT

cd /minos/data/minfarm/farcat

    check for conflicts

for BFILE in ${BFILES} ; do
    MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g`
    [ -r ${MFILE} ] && ls -l ${MFILE}
done

    13:24
for BFILE in ${BFILES} ; do
    MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g`
    mv ${BFILE} ${MFILE}
done
-----------------------------------------------------------
    DONE, NOW ROUNDUP !

   3674   25987 spill.mrnt.cedar_phy_bhcurv.0.root

    Grab all mrnt files at once

./roundup -b 4000 -r cedar_phy_bhcurv far
Fri Aug 15 16:33:34 CDT 2008


########
# FARM #
########

    bad_runs.cedar_phy_bhcurv is updated for stray subruns,

SRV1> ./roundup -s sntp -r cedar_phy_bhcurv far

  Wow, for the first time, see size discrepancy, consistently,

 OK adding F00040403_0000.all.sntp.cedar_phy_bhcurv.0.root 3
 NSFIL SSIZ MSIZ DSIZ 3 1482704657 1479415327 1644665
 OOPS, concatenated file size discrepancy, 1644665 gt 1500000
 OOPS, concatenated file size discrepancy, 1643462 gt 1500000

    Subrun 0 is 1.47 GB by itself, this is a wonky run.
SRV1> dds /minos/data/minfarm/farcat/*sntp.cedar_phy_bhcurv*
-rw-rw-r--  1 minospro numi 1472851108 Aug 10 07:25 /minos/data/minfarm/farcat/F00040403_0000.all.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 minospro numi    4837673 Aug  8 19:55 /minos/data/minfarm/farcat/F00040403_0001.all.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 minospro numi    5015876 Aug  8 19:56 /minos/data/minfarm/farcat/F00040403_0002.all.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 minospro numi  512174805 Aug  9 07:57 /minos/data/minfarm/farcat/F00040403_0004.all.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 minospro numi    5096558 Aug  8 19:57 /minos/data/minfarm/farcat/F00040403_0005.all.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 minospro numi  482758689 Aug  9 07:09 /minos/data/minfarm/farcat/F00040403_0007.all.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 minospro numi    5350507 Aug  8 19:58 /minos/data/minfarm/farcat/F00040403_0008.all.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 minospro numi  498439762 Aug  9 07:34 /minos/data/minfarm/farcat/F00040403_0010.all.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 minospro numi    5224760 Aug  8 19:58 /minos/data/minfarm/farcat/F00040403_0011.all.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 minospro numi  455139267 Aug  9 06:31 /minos/data/minfarm/farcat/F00040403_0013.all.sntp.cedar_phy_bhcurv.0.root

   This left a stray file in WRITE
-rw-r--r--   1 minfarm  numi 1479417732 Aug 15 09:31 Merged.17767.root
SRV1> rm /minos/data/minfarm/WRITE/Merged.17767.root

    Hacked DLIM from 1500000 to 2000000, reran.

    All the files show large per-subrun DSIZ values about 1.6 MB/subrun

    Also ran the .bntp pass :

./roundup -s bntp -r cedar_phy_bhcurv far

   Iterated, with larger bail limit ( default was 1000 )
./roundup -s bntp -b 3000 -r cedar_phy_bhcurv far
   And once more the purge WRITE 
./roundup -s bntp -b 4000 -r cedar_phy_bhcurv far
     
    Hacked DLIM back.


##########
# CONDOR #
##########

Date: Fri, 15 Aug 2008 09:02:54 -0500 (CDT)
Subject: HelpDesk ticket 119292 has additional info.

Note To Requester: We captured extra debug
output from one of the minos glidein jobs yesterday that 
held with error 17 and have
sent it to the Condor team and the OSG Troubleshooting team.

Steve Timm

#######
# SAM #
#######

Date: Thu, 14 Aug 2008 23:16:57 +0100
From: Nicholas Devenish <N.E.Devenish@sussex.ac.uk>

I am trying to learn how to use sam, and actually registered my  
username on the system in april; but when I try to create a definition  
it gives me the message:

"Person 'nickd' is not registered in group 'minos'"

When I look at the registration at http://www-numi.fnal.gov/cgi-bin/autoRegister.py 
  it looks like I am supposed to be in the group, so I don't know what  
it is complaining about.

______________________________________________________________________

Date: Tue, 26 Aug 2008 15:39:24 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>

   Sorry to be slow in responding .  Are you still having problems ?

   It appears that nickd was added to the minos group just after
   your creation attempts on August 14.

   Perhaps the dbserver had some old information cached at that time.

   I successfully created a definition under Person nickd today.

Tested by using

export SAM_USER_NAME=nickd

=============================================================================
2008 08 14
=============================================================================

#######
# AFS #
#######

    Working on duplication of e875/sim for parrot,
    use volume d117.

10 broken links to /utarchive/para/minos/events/old
18 broken links to /afs/fnal.gov/files/data/minos/d12/root_files
 5 broken links to /afs/fnal.gov/files/data/minos/d7/hitbits
 4 broken links to /afs/fnal.gov/files/data/minos/d1/nuflux/newfiles

Size of files in the links that exist :

1       /afs/fnal.gov/files/code/e875/general/minossoft/releases/development/BField/bfld_imap.C
624     /afs/fnal.gov/files/data/minos/d17/gnumi_flux
77      /afs/fnal.gov/files/data/minos/d82/rhatcher/daikon_02.tar.gz
1       /afs/fnal.gov/files/data/minos/d87/gnumi/v19
1       /afs/fnal.gov/files/data/minos/d87/gnumi/v19
1       /afs/fnal.gov/files/data/minos/d87/gnumi/v19
1       /afs/fnal.gov/files/data/minos/d87/gnumi/v19

    gnumi_flux has additional links, to d87, largest is v18 at 26 GB.

    the original d18 is small, links to large d87

Mysql> fs listquota $MD/d17
Volume Name                   Quota      Used %Used   Partition
nb.minos.d19                8000000   2334722   29%         60%  

Mysql> fs listquota $MD/d87
Volume Name                   Quota      Used %Used   Partition
nb.minos.d87               50000000  35085185   70%         58%  

   For present, bailing on this, leave sim as it is in 
   /afs/fnal.gov/files/code/e875/sim

Added to mountfile.d119d120.grow :

This gets rid of the old labyrinth complaints :

-bash-3.00$ /grid/app/minos/parrot/paloon
SETTING UP UPS
SETTING UP MINOS
No default SAM configuration exists at this time.
MINOSSOFT release "R1.24.2"
SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01
setup "test" version of LABYRINTH [ linux , FNALU ]
setup NEUGEN3 development
explicitly setting up GCC3_4_3 version of GEANT
using PYTHIA6 (v6_412) for LUND


##########
# CONDOR #
##########

MINOS25 > condor_q -hold
30 jobs; 0 idle, 0 running, 30 held


MINOS25 > condor_release gfactory


##########
# PARROT #
##########

    Testing releases/ups copies on d119/d120

    Recorrected symlinks to be /general/ups rather than /general/ups
    We plan to remount this all as /general/ups

    Had done :

SLINKF=${LOGD}/slink/prod1
SLINKL=${LOGD}/slink/prod1.log
    generated SLINKF as below,

    Corrected general/ups symlinks to general/products
    This had not been done in the first tests in d141,
       do not know why we got away with this,
       as d141 did not have symlink ups -> products

SLINKS=`grep ':/afs' ${SLINKF} | grep ":${AFSC}/general/ups"`
printf "${SLINKS}\n"            | while read SLINK ; do
    SLIN=`printf "${SLINK}" | cut -f 2 -d :`
    SLIX=${SLIN/\/e875\/general\/ups/\/e875\/general\/products}
    SLOU=/`printf "${SLINK}" | cut -f 1 -d :`
    rm -f ${SLOU}
    ln -s ${SLIX} ${SLOU}
done

         Now need to reverse this, having lost our copy of prod1

SLINKF=${LOGD}/slink/prodprod
SLINKL=${LOGD}/slink/prodprod.log
    generated SLINKF, hacked to include $UPI, which we need to change

Mysql> wc -l ${SLINKF}
29 /minos/scratch/minsoft/afssoft/slink/prodprod

SLINKS=`grep ':/afs' ${SLINKF} | grep ":${AFSC}/general/products"`
printf "${SLINKS}\n"            | while read SLINK ; do
    SLIN=`printf "${SLINK}" | cut -f 2 -d :`
    SLIX=${SLIN/\/e875\/general\/products/\/e875\/general\/ups}
    SLOU=/`printf "${SLINK}" | cut -f 1 -d :`
    rm -f ${SLOU}
    ln -s ${SLIX} ${SLOU}
done

    mindata@minos26

$ time make_growfs -k /afs/fnal.gov/files/data/minos/d119
make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d119/.growfsdir
make_growfs: scanning directory tree for changes...
make_growfs: 1075098 files, 5525 links, 149944 dirs, 0 checksums computed
real    10m51.369s
user    1m16.832s
sys     1m43.335s


    Ran the usual HOWTO.parrot tests on fnpc185, OK !
    
    Ran paloon integrated test,

-bash-3.00$ /grid/app/minos/parrot/paloon
SETTING UP UPS
SETTING UP MINOS
No default SAM configuration exists at this time.
MINOSSOFT release "R1.24.2"
SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01
/tmp/fileHEYSq8: line 620: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory
RUNNING LOON
Warning in <TClassTable::Add>: class timespec already in TClassTable

Processing firstlast.C...
Spin(1 in 1 out 0 filt.)
  1) +RawRecCounts::Ana         n=1     (     1/     0) t=(    0.00/    0.00)
...
RawRecCounts done

 OK, ran loon under parrot 


#######
# SAM #
#######

   pittam reports several cedar files not declared to SAM,
   in cedar, sntp_data

10   2006-01  2007 Feb 5
32   2006-06  2007 Feb 5
12   2006-08  2006 Dec 12/13
 1   2006-09  2006 Dec 12

  Test 1 file
   
MINOS26 > sam locate N00010896_0014.spill.sntp.cedar.0.root
Datafile with name 'N00010896_0014.spill.sntp.cedar.0.root' not found.

RELEASE=cedar
DET=near
MONTH=2006-09

for MONTH in  2006-01 2006-06 2006-08 2006-09 ; do
./saddreco ${DET} ${RELEASE} ${MONTH} verify
done

    needed 20, 65, 24, 2 files

    These should be even numbers, including cosmic/spill
N00010163_0015.spill.sntp.cedar.0.root is missing.

declare 2006-09, looks OK

SLOG=${HOME}/ROUNTMP/LOG/saddreco/${RELEASE}/${DET}.log

for MONTH in  2006-01 2006-06 2006-08 2006-09 ; do
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done   2>&1 | tee -a ${SLOG}

    Updated HOWTO.saddreco
    with improved SLOG and paths

    Ran full verify scan, one more file was missing

MONTH  2005-03 needed 1
N00007101_0001.cosmic.*.cedar.0.root

MONTH  2007-01 several obsoletes

MONTH  2008-07 several obsoletes
 obsolete                  N00014529_0010.spill.cand.cedar.1.root
 obsolete                  N00014529_0002.spill.cand.cedar.1.root
 obsolete                  N00014529_0003.spill.cand.cedar.1.root
 obsolete                  N00014529_0001.spill.cand.cedar.1.root
 obsolete                  N00014551_0001.spill.cand.cedar.0.root
 obsolete                  N00014529_0004.spill.cand.cedar.1.root
 obsolete                  N00014529_0005.spill.cand.cedar.1.root
 obsolete                  N00014529_0006.spill.cand.cedar.1.root

MONTH=2005-03
./saddreco ${DET} ${RELEASE} ${MONTH} declare 2>&1 | tee -a ${SLOG}
FINISHED  Thu Aug 14 16:39:16 2008

    Also scanned DET=far, found nothing missing.


#######
# SAM #
#######

    Example

SAMDIM="
    DATA_TIER    sntp-near
and MC.RELEASE   daikon_04
and MC.BEAM      L010185N
and MC.VTXREGION 3
and MC.BFIELD    3
and VERSION      cedar.phy.bhcurv
and RUN_NUMBER   >= 7250
and RUN_NUMBER   <= 7260
"

MINOS26 > sam list files --summaryonly --dim="${SAMDIM}"
File Count:         14
Average File Size:  1.50GB
Total File Size:    20.96GB
Total Event Count:  270400

MINOS26 > sam list files --noSummary --dim="${SAMDIM}"
n13037259_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037257_0025_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037257_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037252_0014_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037260_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037254_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037250_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037255_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037253_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037258_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037256_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037252_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037251_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root
n13037255_0016_L010185N_D04.sntp.cedar_phy_bhcurv.1.root

=============================================================================
2008 08 13
=============================================================================

########
# FARM #
########

---------- rubin

> F00038575_0010 is bad_runs, but with pass 1 (filenames were
> intentionally changed)

   Added a line to bad_runs to mark pass 0 as bad, which it was.

fnpcsrv1% grep F00038575_0010 bad_runs.cedar_phy_bhcurv >> /tmp/newbad
fnpcsrv1% nedit /tmp/newbad
    changed pass1 to 0
fnpcsrv1% cat /tmp/newbad >> bad_runs.cedar_phy_bhcurv
fnpcsrv1% grep F00038575_0010 bad_runs.cedar_phy_bhcurv 
F00038575_0010.1   2007-08   139  2008-08-09 22:08:13  fcdfcaf1283
F00038575_0010.0   2007-08   139  2008-08-09 22:08:13  fcdfcaf1283

> F00039811 and F00039818 are 2007-10 and were not supposed to be run
> through the spill pass.  It looks as though a single run in each case
> was put through both passes by mistake, probably as part of a previous
> cleanup.  One of us should delete the spill files from farcat

   We have 1 subrun of each of these, bmnt/bntp/mrnt/sntp

rm /minos/data/minfarm/farcat/F00039811*
rm /minos/data/minfarm/farcat/F00039818*
sam locate F00039811_0000.spill.cand.cedar_phy_bhcurv.0.root
['/pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10,321@vok698']
sam locate F00039818_0015.spill.cand.cedar_phy_bhcurv.0.root
['/pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10,369@vok698']

sam undeclare file F00039811_0000.spill.cand.cedar_phy_bhcurv.0.root
sam undeclare file F00039818_0015.spill.cand.cedar_phy_bhcurv.0.root

ls /pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10/F00039811_0000.spill.cand.cedar_phy_bhcurv.0.root
ls /pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10/F00039818_0015.spill.cand.cedar_phy_bhcurv.0.root

rm /pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10/F00039811_0000.spill.cand.cedar_phy_bhcurv.0.root
rm /pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2007-10/F00039818_0015.spill.cand.cedar_phy_bhcurv.0.root

> F00040124 crosses a month/year boundary and the first part may have
> escaped the original runlist.  I'm submitting them.
>
> F00040133, 40403, 40421:  No idea why they weren't run.  I'm submitting
> them.
 
   OK, will keep an eye out for them.

----------

   All but 40403 and 40421 seem to be there now

SRV1> ./roundup -s sntp -r cedar_phy_bhcurv far
Wed Aug 13 15:19:45 CDT 2008

---------- rubin

F00040124 and 40133 are complete.  It turns out that the other subruns 
are from runs which, for some other subruns in the first pass, produced 
my 'Type 90' failures where they run 'forever' producing multi-volume 
candidate files.  The remaining 3 jobs look like they'll do the same. 
I'll let them run for several hours and kill them if they produce a 
second candidate volume.  I will then make a manual entry in the 
bad_runs list and let you know.

The 3 subruns do not appear in the nightly lists, nor in the suppressed 
lists.  The former is why they were not run in the first pass.  They are 
continuing to chug along, but they are almost certainly junk.

#######
# AFS #
#######

HOWTO.afssoftprod

    Continuing to clean up symlinks, interrupted by other work

    correcting many symlinks from general/ups to general/products

    Adjusted HOWTO to filter :${UPI} out of the initial SLINKF file.
    Oops, stepped on prod1 file, lost it.

Mysql> mv /minos/scratch/minsoft/afssoft/slink/prod1 /minos/scratch/minsoft/afssoft/slink/prod2

  Let's look at the deadwood, not pointing to /afs

Mysql> grep -v :/afs $SLINKF | wc -l
18
Mysql> grep  :/afs $SLINKF | wc -l
11
 
Mysql> grep -v :/afs $SLINKF
afs/fnal.gov/files/data/minos/d119/prd/sam/v8_2_0/Linux-2/ups/..tar:/ftp/products/sam/v8_2_0/Linux+2/sam_v8_2_0_Linux+2.ups.tar
afs/fnal.gov/files/data/minos/d119/prd/sam_ns_ior/v7_1_0/NULL/ups/..tar:/ftp/products/sam_ns_ior/v7_1_0/NULL/sam_ns_ior_v7_1_0_NULL.ups.tar
afs/fnal.gov/files/data/minos/d119/prd/oracle_client/v10_1_0_2_0b/Linux-2/bin/lbuilder:/fnal/ups/prd/oracle_client/v10_1_0_2_0a/Linux-2/nls/lbuilder/lbuilder
afs/fnal.gov/files/data/minos/d119/prd/oracle_client/v10_1_0_2_0b/Linux-2/jdk/man/ja:/fnal/ups/prd/oracle_client/v10_1_0_2_0a/Linux-2/jdk/man/ja_JP.eucJP
afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_3_2/config_build_root.sh:/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_3_3/config_build_root.sh:/afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh
afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_3_4/config_build_root.sh:/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_3_4/config_build_root_minimal.sh:/afs/fnal.gov/files/code/e875/general/bin/config_build_root_minimal.sh
afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_4_1/config_build_root.sh:/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_2/v01/lib/mysql/libz.so:/usr/lib/libz.so.1
afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_2/v01/lib/libz.so:/usr/lib/libz.so.1
afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_2/v03/tar_files:/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/tar_files
afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_3/v02/lib/libmyodbc.so:/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_3/lib/libmyodbc3.so
afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_4/bleeding-edge/lib/libmyodbc.so:/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so
afs/fnal.gov/files/data/minos/d119/prd/MINOS_EXTERN/Linux2.4-GCC_3_4/v03/lib/libmyodbc.so:/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so
afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/config.guess:/usr/share/libtool/config.guess
afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/config.sub:/usr/share/libtool/config.sub
afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/ltmain.sh:/usr/share/libtool/ltmain.sh
afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/install-sh:/usr/share/automake-1.6/install-sh
afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/mkinstalldirs:/usr/share/automake-1.6/mkinstalldirs
afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/missing:/usr/share/automake-1.6/missing
afs/fnal.gov/files/data/minos/d119/prd/LOG4CPP/src/cvs/config/depcomp:/usr/share/automake-1.6/depcomp
afs/fnal.gov/files/data/minos/d119/prd/samgrid_batch_adapter/v7_1_0/NULL/ups/..tar:/ftp/products/samgrid_batch_adapter/v7_1_0/NULL/samgrid_batch_adapter_v7_1_0_NULL.ups.tar
afs/fnal.gov/files/data/minos/d119/prd/geant/v3_21_14a/Linux-2-6/ups/..tar:/ftp/products/geant/v3_21_14a/Linux+2.6/geant_v3_21_14a_Linux+2.6.ups.tar
afs/fnal.gov/files/data/minos/d119/prd/sam_products/v4_30/NULL/ups/..tar:/ftp/products/sam_products/v4_30/NULL/sam_products_v4_30_NULL.ups.tar
afs/fnal.gov/files/data/minos/d119/prd/sam_products/v4_31/NULL/ups/..tar:/ftp/products/sam_products/v4_30/NULL/sam_products_v4_30_NULL.ups.tar
afs/fnal.gov/files/data/minos/d119/prd/sam_products/v4_32/NULL/ups/..tar:/ftp/products/sam_products/v4_30/NULL/sam_products_v4_30_NULL.ups.tar
afs/fnal.gov/files/data/minos/d119/prd/gcc/v3_4_3/Linux+2.6-2.3.4/tar/binutils.tar.gz:/afs/fnal/files/home/room1/mengel/binutils.tar.gz
afs/fnal.gov/files/data/minos/d119/prd/gcc/v3_4_3/Linux+2.6-2.3.4/tar/gcc.tar.gz:/afs/fnal/files/home/room1/mengel/gcc-3.4.3.tar.gz

  Back to work, look at what is needed from afs,
  all are files except for one directory 

Mysql> printf "${SLINKS}\n" | cut -f 2 -d :
/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/bin/config_build_root_minimal.sh
/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/tar_files
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_3/lib/libmyodbc3.so
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so
/afs/fnal/files/home/room1/mengel/binutils.tar.gz
/afs/fnal/files/home/room1/mengel/gcc-3.4.3.tar.gz

  For now, let's not clean up, just take a copy as-is

   Proceeding to releases,

Added SLINKF filter against /minos/data/release_data

      MINOSSOFT

soft1 found mostly bin/lib/tmp, plus

Mysql> grep -v '/bin$' $SLINKF | grep -v '/lib$' | grep -v  '/tmp$'
afs/fnal.gov/files/data/minos/d120/packages/DatabaseTables/HEAD:/afs/fnal.gov/files/data/minos/d04/libraries/DatabaseTables/HEAD
afs/fnal.gov/files/data/minos/d120/packages/WebDocs/HEAD/doxygen/loon:/afs/fnal.gov/files/code/e875/releases1/doxygen/loon
Mysql> du -sm /afs/fnal.gov/files/data/minos/d04/libraries/DatabaseTables/HEAD
53      /afs/fnal.gov/files/data/minos/d04/libraries/DatabaseTables/HEAD
Mysql> du -sm /afs/fnal.gov/files/code/e875/releases1/doxygen/loon
415     /afs/fnal.gov/files/code/e875/releases1/doxygen/loon

   Let's go ahead and copy all this, it should fit.
   Using only 4.1 GB without bin/lib/tmp

doxygen/loon copy is slow, 1/3 MB/second

Wed Aug 13 19:15:55 CDT 2008
 /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/lib /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib
535     /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/lib
close : Connection timed out
Rename of /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib/Linux2.4-GCC_3_4/libCluster3D.so.UPD to /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib/Linux2.4-GCC_3_4/libCluster3D.so failed.
...
getacl : Connection timed out
Unable to set mode-bits for /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib/Linux2.4-GCC_3_4-maxopt to 16877
Couldn't set acls for /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib/Linux2.4-GCC_3_4-maxopt
Could not read symbolic link /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/lib/Linux2.6-GCC_3_4
read link : Connection timed out
Could not read symbolic link /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/lib/Linux2.6-GCC_3_4-maxopt
read link : Connection timed out
Unable to set mode-bits for /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib to 16877
getacl : Connection timed out
du: cannot access `/afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib': Connection timed out
Wed Aug 13 19:16:15 CDT 2008
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/tmp /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/tmp

    the rest timed out.

Mysql> fs listquota /afs/fnal.gov/files/data/minos/d120
Volume Name                   Quota      Used %Used   Partition
nb.minos.d120              50000000  19849589   40%         56%  

...
Aug 13 19:15:20 minos-mysql1 kernel: post_create:  no inode, dir (dev=afs, ino=1238086069)
Aug 13 19:15:46 minos-mysql1 kernel: post_create:  no inode, dir (dev=afs, ino=1238086069)
Aug 13 19:15:55 minos-mysql1 kernel: post_create:  no inode, dir (dev=afs, ino=1238086077)
Aug 13 19:15:55 minos-mysql1 kernel: post_create:  no inode, dir (dev=afs, ino=1238086077)
Aug 13 19:16:13 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.11 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 13 19:16:13 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.11 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 13 19:16:14 minos-mysql1 kernel: afs: failed to store file (110)
Aug 13 19:16:14 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 13 19:16:14 minos-mysql1 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)
Aug 13 19:16:15 minos-mysql1 kernel: afs: Tokens for user of AFS id 1060 for cell fnal.gov have expired
Aug 13 19:16:31 minos-mysql1 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Aug 13 19:16:31 minos-mysql1 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Aug 13 19:16:31 minos-mysql1 kernel: afs: file server 131.225.68.11 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Aug 13 19:16:31 minos-mysql1 kernel: afs: file server 131.225.68.11 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

Mysql> grep ^OPTIONS= /etc/sysconfig/afs
OPTIONS=$LARGE

Pick up where we left off, 

Mysql> printf "${SLINKS}\n" | wc -l
158

Mysql> printf "${SLINKS}\n" | grep -n S08-01-10-R1-27/lib
109:...

Mysql> SLINKX=`printf "${SLINKS}\n" | tail +109` 

    Renewed expired token
Mysql> tokens

    Removed partial library
Mysql> rm -r /afs/fnal.gov/files/data/minos/d120/releases/S08-01-10-R1-27/lib

Mysql> grep afs: /var/log/messages |               grep -v Tokens | uniq | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' | sort
  >> put this into the afse.txt file


    Ran the SLINK procedures, reading from SLINKX

printf "${SLINKX}\n"            | while read SLINK ; do
   and the rest per SLINK procedure in HOWTO.afssoftprod
Thu Aug 14 09:37:03 CDT 2008
...
Thu Aug 14 10:07:44 CDT 2008

Mysql> fs listquota /afs/fnal.gov/files/data/minos/d120
Volume Name                   Quota      Used %Used   Partition
nb.minos.d120              50000000  30004513   60%         57%  


$ mv mountfile.grow mountfile.d199d141.grow
$ ln -s mountfile.d119d120.grow mountfile.grow


Mysql> wc -l ${SLINKF}
29 /minos/scratch/minsoft/afssoft/slink/prodprod

#######
# CVS #
#######

Removed blake cvs keys, per request ( a machine was cracked )

Deferred adding new keys pending test of kerberos access.

"Yes, kerberos access works - I just committed some code to CVS."

MINOSCVS > grep blake cvshlog | tail -1
Thu Aug 14 04:39:41 2008 (blake@(null)) : cvsh -c cvs server [sSk]


########
# FARM #
########

./roundup -s sntp -r cedar_phy_bhcurv far
Wed Aug 13 10:41:31 CDT 2008


########
# FARM #
########

    Added test of NOCAT to looper :


#!/bin/sh

OPTS="${1}"

if [ -z "${OPTS}" ] ; then
    printf " OOPS, need to specify at least release/stream \n"
    printf " LIKE   ./looper '-r cedar_phy_bhhi mcnear' \n"
    exit 1
fi

printf "./roundup -c ${OPTS}\n"

while true ; do 
    [ -r /home/minfarm/ROUNTMP/NOCAT ] || ./roundup -c ${OPTS}
    sleep 1200
done


=============================================================================
2008 08 12
=============================================================================

########
# FARM #
########

  Monitoring automount of /minos/data on fnpcsrv1 yesterday,

Aug 11 09:43:41 fnpcsrv1 automount[15347]: mount(nfs): mounted minos-nas-0.fnal.gov:/minos/data on /minos/data
Aug 11 17:43:33 fnpcsrv1 automount[15441]: expired /minos/data

    This good period of no dismounts corresponds to timm's test script
    which cd'd to /minos/datat, terminating around 17:43
    

########
# FARM #
########

Added asousa, mstrait, nwest to .k5login, pending restore


SRV1> du -sk .
4666784 .
Tue Aug 12 12:21:17 CDT 2008

9598368 .
SRV1> date
Tue Aug 12 13:01:48 CDT 2008

SRV1> sdiff -s restore_20080810/minfarm/.k5login .k5login
                                                              > bseilhan/cron/fnpcsrv1.fnal.gov@FNAL.GOV
                                                              > durga@FNAL.GOV
mishi@FNAL.GOV                                                <
mstrait/cron/fnpcsrv1.fnal.gov@FNAL.GOV                       <
mstrait/cron/minos04fnal.gov@FNAL.GOV                         <
                                                              > rubin/cron/fnppd.fnal.gov@FNAL.GOV
timm@FNAL.GOV                                                 <
                                                              > timm@FNAL.GOV
removed bseilhan/cron, durga, rubin/cron/fnppd

SRV1> sdiff -s restore_20080810/minfarm/.k5login .k5login
mishi@FNAL.GOV                                                <
mstrait/cron/fnpcsrv1.fnal.gov@FNAL.GOV                       <
mstrait/cron/minos04fnal.gov@FNAL.GOV                         <
timm@FNAL.GOV                                                 <
                                                              > timm@FNAL.GOV


cd restore_20080810/minfarm
SRV1> du -sm *  | sort -n
...
2       restore
2       rhatcher
3       maint
3       work
4       bin
7       lib
8       monitor
11      loonexe
20      scripts
60      west
67      web
128     FNAL_00030851.dbm.gz
167     lists
546     strait_scratch
1047    track_crash
2735    bckhousetest

Time of minfarm directory seems to be Aug 09

Latest bad_runs* or good_runs* seem to be Feb 26
That is because lists moved to /minos/data/minfarm then.

Howie suggests 


cd ~minfarm
setenv R restore_20080810/minfarm
cp -upr $R/.* .
cp -upr $R/*  .

    I'd be explicit, and add the 'd' ( I usually use a, which is -dpr )

    First, though back it all up :

cd /home/minfarm

date
Tue Aug 12 14:24:49 CDT 2008
time tar cf /minos/data/minfarm/backup20080812.tar .

Many files could not be opened, owned by rubin :

FILES='
restore_20080810/minfarm/lists/missing.cedar.bck
restore_20080810/minfarm/lists/good_runs_mc.cedar.duplicated_output
restore_20080810/minfarm/lists/greg_list.bck
restore_20080810/minfarm/lists/bad_runs_mc.cedar_phy_bhcurv.bck
restore_20080810/minfarm/lists/mmm.d04.cedar_phy_bhcurv.bck
restore_20080810/minfarm/lists/bad_runs.cedar_phy_bhcurv.bck
restore_20080810/minfarm/lists/bad_runs.cedar.bck
restore_20080810/minfarm/loonexe/reco_MC_daikon_far_MRCCOnly_cedar_phy_Corrected.C.bck
restore_20080810/minfarm/scripts/deprecated/total_data.trigger.old
restore_20080810/minfarm/restore/minfarm/lists/bad_runs.cedar_phy_bhcurv.bck
restore/minfarm/lists/bad_runs.cedar_phy_bhcurv.bck
'
  This agrees with
SRV1> find .  -user rubin ! -perm -40 

SRV1> rm /minos/data/minfarm/backup20080812.tar

fnpcsrv1% cd /home/minfarm
fnpcsrv1$ for FILE in ${FILES} ; do chmod g+r ${FILE} ; done

SRV1> time tar cf /minos/data/minfarm/backup20080812.tar .
real    17m34.869s
user    0m3.311s
sys     1m52.664s
SRV1> date
Tue Aug 12 14:55:26 CDT 2008

SRV1> du -sh /minos/data/minfarm/backup20080812.tar
7.8G    /minos/data/minfarm/backup20080812.tar

SRV1> du -sh .
9.2G    .
SRV1> find . -type f | wc -l
50686


   Test file restoration, with loonexe
   
SRV1> du -sk loonexe restore_20080810/minfarm/loonexe/
2976    loonexe
10848   restore_20080810/minfarm/loonexe/

SRV1> find restore_20080810/minfarm/loonexe  -type f | wc -l
248

cp -dupr restore_20080810/minfarm/loonexe  .
cp: setting permissions for `./loonexe/josh': Permission denied

    Many files are owned by rubin :

SRV1> find restore_20080810 -user rubin | wc -l
3562
SRV1> find  -user rubin | wc -l
4555

  This is OK, per howie.
  Everying will end up being owned by minfarm, 

   Go for the gold

cp -dupr restore_20080810/minfarm/.* .
cp: setting permissions for `././condor_log': Permission denied
cp: setting permissions for `././condor_submit': Permission denied
cp: setting permissions for `././lists/non_current': Permission denied
cp: setting permissions for `././loonexe/josh': Permission denied
cp: cannot overwrite directory `././.nedit' with non-directory
cp: setting permissions for `././badlogs': Permission denied
cp: setting permissions for `././monitor/R1_18/logfiles': Permission denied
cp: setting permissions for `././monitor/R1_18/psfiles': Permission denied
cp: setting permissions for `././monitor/R1_18': Permission denied
cp: setting permissions for `././monitor/R1_18_2': Permission denied
cp: setting permissions for `././monitor/R1_18_3': Permission denied
cp: setting permissions for `././monitor/R1_18_4': Permission denied
cp: setting permissions for `././monitor/R1_21': Permission denied
cp: setting permissions for `././monitor/R1_23': Permission denied
cp: setting permissions for `././monitor/R1_23a': Permission denied
cp: setting permissions for `././monitor/S06-05-25-R1-22': Permission denied
cp: setting permissions for `././monitor/S06-06-22-R1-22': Permission denied
cp: setting permissions for `././monitor/R1_24c': Permission denied
cp: setting permissions for `././monitor/R1_24': Permission denied
cp: setting permissions for `././monitor/R1_24a': Permission denied
cp: setting permissions for `././monitor/cedar': Permission denied
cp: setting permissions for `././monitor/R1_24b': Permission denied
cp: setting permissions for `././monitor/cedar_phy_bhcurve': Permission denied
cp: setting permissions for `././monitor/R1_24calB': Permission denied
cp: setting permissions for `././monitor/R1_24cal': Permission denied
cp: setting permissions for `././monitor/cedar_phy': Permission denied
cp: setting permissions for `././monitor/cedar_phy_safitter': Permission denied
cp: setting permissions for `././monitor/cedar_phy_srsafitter': Permission denied
cp: setting permissions for `././monitor/srsafitter': Permission denied
cp: setting permissions for `././monitor/cedar_phy_mboone': Permission denied
cp: setting permissions for `././monitor/cedar_phy_srsafitterbx113': Permission denied
cp: setting permissions for `././monitor/cedar_phy_bhcurv': Permission denied
cp: preserving times for `././recover': Permission denied
cp: setting permissions for `././scripts/caldet': Permission denied
cp: setting permissions for `././scripts/deprecated': Permission denied
cp: setting permissions for `././scripts/fbs': Permission denied
cp: setting permissions for `././scripts/specials': Permission denied
cp: setting permissions for `././scripts/old_li': Permission denied
cp: setting permissions for `././web/deprecated': Permission denied
cp: setting permissions for `././web/indexes': Permission denied
cp: setting permissions for `././restore/minfarm/lists': Permission denied
cp: setting permissions for `././restore/minfarm': Permission denied
cp: setting permissions for `././restore': Permission denied
cp: preserving times for `././strait_scratch/itworked': Permission denied
cp: preserving times for `././strait_scratch/itworked2': Permission denied
cp: preserving times for `././strait_scratch/badtries/10thtry': Permission denied
cp: preserving times for `././strait_scratch/badtries/11thtry': Permission denied
cp: preserving times for `././strait_scratch/badtries/12thtry': Permission denied
cp: preserving times for `././strait_scratch/badtries/13thtry': Permission denied
cp: preserving times for `././strait_scratch/badtries/14thtry': Permission denied
cp: preserving times for `././strait_scratch/badtries/15thtry': Permission denied
cp: preserving times for `././strait_scratch/badtries/16thtry': Permission denied
cp: preserving times for `././strait_scratch/badtries/eighthtry': Permission denied
cp: preserving times for `././strait_scratch/badtries/ninthtry': Permission denied
cp: preserving times for `././strait_scratch/badtries': Permission denied
cp: preserving times for `././strait_scratch/2005-08-logs': Permission denied
cp: preserving times for `././strait_scratch/2005-09-logs': Permission denied
cp: will not create hard link `./minfarm' to directory `./.'
cp: will not create hard link `./.autosave' to directory `././.autosave'
cp: will not create hard link `./.emacs.d' to directory `././.emacs.d'
cp: will not create hard link `./.grid' to directory `././.grid'
cp: cannot overwrite directory `./.nedit' with non-directory
cp: will not create hard link `./.netscape' to directory `././.netscape'
cp: will not create hard link `./.srmconfig' to directory `././.srmconfig'
cp: will not create hard link `./.ssh' to directory `././.ssh'
cp: will not create hard link `./.subversion' to directory `././.subversion'

drwxrwxr-x   2 minfarm numi      2048 Aug 12 09:12 .nedit/
rmdir .nedit
SRV1> cp -dupr restore_20080810/minfarm/.nedit .


for FOO in autosave emacs.d grid netscape srmconfig ssh subversion
do echo ${FOO}
diff -r restore_20080810/minfarm/.${FOO} .${FOO}
done

Tue Aug 12 15:25:32 CDT 2008
cp -dupr restore_20080810/minfarm/*  .
cp: setting permissions for `./badlogs': Permission denied
cp: setting permissions for `./condor_log': Permission denied
cp: setting permissions for `./condor_submit': Permission denied
cp: setting permissions for `./lists/non_current': Permission denied
cp: setting permissions for `./loonexe/josh': Permission denied
cp: setting permissions for `./monitor/R1_18/logfiles': Permission denied
cp: setting permissions for `./monitor/R1_18/psfiles': Permission denied
cp: setting permissions for `./monitor/R1_18': Permission denied
cp: setting permissions for `./monitor/R1_18_2': Permission denied
cp: setting permissions for `./monitor/R1_18_3': Permission denied
cp: setting permissions for `./monitor/R1_18_4': Permission denied
cp: setting permissions for `./monitor/R1_21': Permission denied
cp: setting permissions for `./monitor/R1_23': Permission denied
cp: setting permissions for `./monitor/R1_23a': Permission denied
cp: setting permissions for `./monitor/S06-05-25-R1-22': Permission denied
cp: setting permissions for `./monitor/S06-06-22-R1-22': Permission denied
cp: setting permissions for `./monitor/R1_24c': Permission denied
cp: setting permissions for `./monitor/R1_24': Permission denied
cp: setting permissions for `./monitor/R1_24a': Permission denied
cp: setting permissions for `./monitor/cedar': Permission denied
cp: setting permissions for `./monitor/R1_24b': Permission denied
cp: setting permissions for `./monitor/cedar_phy_bhcurve': Permission denied
cp: setting permissions for `./monitor/R1_24calB': Permission denied
cp: setting permissions for `./monitor/R1_24cal': Permission denied
cp: setting permissions for `./monitor/cedar_phy': Permission denied
cp: setting permissions for `./monitor/cedar_phy_safitter': Permission denied
cp: setting permissions for `./monitor/cedar_phy_srsafitter': Permission denied
cp: setting permissions for `./monitor/srsafitter': Permission denied
cp: setting permissions for `./monitor/cedar_phy_mboone': Permission denied
cp: setting permissions for `./monitor/cedar_phy_srsafitterbx113': Permission denied
cp: setting permissions for `./monitor/cedar_phy_bhcurv': Permission denied
cp: preserving times for `./recover': Permission denied
cp: setting permissions for `./restore/minfarm/lists': Permission denied
cp: setting permissions for `./restore/minfarm': Permission denied
cp: setting permissions for `./restore': Permission denied
cp: setting permissions for `./scripts/caldet': Permission denied
cp: setting permissions for `./scripts/deprecated': Permission denied
cp: setting permissions for `./scripts/fbs': Permission denied
cp: setting permissions for `./scripts/specials': Permission denied
cp: setting permissions for `./scripts/old_li': Permission denied
cp: preserving times for `./strait_scratch/itworked': Permission denied
cp: preserving times for `./strait_scratch/itworked2': Permission denied
cp: preserving times for `./strait_scratch/badtries/10thtry': Permission denied
cp: preserving times for `./strait_scratch/badtries/11thtry': Permission denied
cp: preserving times for `./strait_scratch/badtries/12thtry': Permission denied
cp: preserving times for `./strait_scratch/badtries/13thtry': Permission denied
cp: preserving times for `./strait_scratch/badtries/14thtry': Permission denied
cp: preserving times for `./strait_scratch/badtries/15thtry': Permission denied
cp: preserving times for `./strait_scratch/badtries/16thtry': Permission denied
cp: preserving times for `./strait_scratch/badtries/eighthtry': Permission denied
cp: preserving times for `./strait_scratch/badtries/ninthtry': Permission denied
cp: preserving times for `./strait_scratch/badtries': Permission denied
cp: preserving times for `./strait_scratch/2005-08-logs': Permission denied
cp: preserving times for `./strait_scratch/2005-09-logs': Permission denied
cp: setting permissions for `./web/deprecated': Permission denied
cp: setting permissions for `./web/indexes': Permission denied
Tue Aug 12 15:27:44 CDT 2008

    We are good to go now.
   
    Making a separate copy of the restore_20080810 files.

cd restore_20080810
time tar cf /minos/data/minfarm/restore_20080810.tar .
real    3m37.368s
user    0m1.392s
sys     1m2.883s


########
# FARM #
########

export MYSQL_PWD=

mysqladmin -h fnpcsrv1.fnal.gov --port 3307 -u minfarm processlist

mysqladmin  -u minfarm -S /export/stage/minfarm/mysql.sock1 processlist


=============================================================================
2008 08 11
=============================================================================

########
# FARM #
########

Most of the /home/minfarm files were deleted , by a runaway script
   /grid/app/minos/scripts/gather_runs.mc

The script was intending to concatenate files in  
   /home/minfarm/lists/BAD and GOOD,

But there were no BAD or GOOD directories initially.
So the script wandered into /home, 
and started removing all files thereunder.

I have put back a .k5login with rubin and kreymer.

I have done  
   crontab crontab.dat  in the scripts directory.
And in /home/minfarm,   ln -s scripts/crontab.dat crontab.dat


###########
# ROUNDUP #
###########

Fell behind on 11:00 cycle, fardet, due to small runs around 60 KBytes
F00041598 through F00041801.

########
# FARM #
########

Date: Sat, 09 Aug 2008 12:32:10 -0500
There have been serious NFS (probably) problems at 19:15-19:45 yesterday 
and again at ~01:00 today.  /minos/data has been affected.  I'm not sure 
if it relates to the missing runs or not.  I want to check before 
resubmitting.  "Late" FD r3 is complete (except of course for your 
missing runs) so I'm going to start submitting the early r3 stuff which 
has only a cosmic pass.

Date: Sat, 09 Aug 2008 16:41:16 -0500
From: Howard Rubin <rubin@fnal.gov>
There is a single run of pass 1 output in farcat, F00039719.  I think I 
must have run these subruns to complete the run started in the previous 
month.  I don't know if these should actually replace the pass 0 because 
of better constants or just be deleted.  If they should replace, then 
the first 3 subruns should probably also be reprocessed.  What do you 
think?  Note that these are cosmic/all only.

Note added in proof: Apparently the first pass stuff must have been 
deleted because there's no 
F00039719_0000.all.sntp.cedar_phy_bhcurv.0.root (or _0003) in SAM.  I 
guess that means I *should* run the first 3 subruns so that the run will 
be complete.  I'll change the pass to pass 0 and remove the good_runs 
lines that are causing them to be designated pass 1.

Date: Sat, 09 Aug 2008 22:47:29 -0500
From: Howard Rubin <rubin@fnal.gov>

There a set of runs, probably a month's worth, and perhaps more coming 
with pass 1.  I'm going to stop roundup until this is all complete and 
rename the files to pass 0.  I've checked and there don't seem to be any 
of these declared to SAM.  I've done it by stopping the cron job.  I'm 
sure there's a more elegant way, but this should only be for a short 
time.  There are only a couple of hundred more jobs to run.

Date: Sat, 09 Aug 2008 23:35:50 -0500
From: Howard Rubin <rubin@fnal.gov>

I've restarted the corral cron job.  Except for anything you might find 
in roundup, run 3 is complete.

I'll start checking the list from 'late run3' concatenation tomorrow or 
Monday.

---------------------------------

looper picked up F00039719 around Aug  9 23:03 , pass 0
had formerly been pass 1


It picked up and concatenated most of the .1. data last Saturday.
Sat Aug  9 19:52:35 CDT 2008

Some with pass 2 or 4

Sat Aug  9 21:49:13 CDT 2008
PURGED WRITE/F00039340_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039345_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039348_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039349_0000.all.sntp.cedar_phy_bhcurv.2.root
PURGED WRITE/F00039350_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039353_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039356_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039362_0000.all.sntp.cedar_phy_bhcurv.2.root
PURGED WRITE/F00039574_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039827_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039830_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039834_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039840_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039840_0008.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039843_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039846_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039849_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039855_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039858_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039869_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039878_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039881_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039884_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00039887_0000.all.sntp.cedar_phy_bhcurv.0.root

Sun Aug 10 00:07:38 CDT 2008
PURGED WRITE/F00038559_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039359_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039571_0000.all.sntp.cedar_phy_bhcurv.2.root
PURGED WRITE/F00039577_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039580_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039583_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039586_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039589_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039592_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039595_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039603_0000.all.sntp.cedar_phy_bhcurv.2.root
PURGED WRITE/F00039607_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039608_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039610_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039615_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039618_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039622_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039625_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039628_0000.all.sntp.cedar_phy_bhcurv.4.root
PURGED WRITE/F00039631_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039653_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039676_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039679_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039682_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039685_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039688_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039691_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039694_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039697_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039700_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039704_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039707_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039710_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039713_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039716_0000.all.sntp.cedar_phy_bhcurv.1.root
PURGED WRITE/F00039719_0000.all.sntp.cedar_phy_bhcurv.1.root

FONES='<cut/paste .1. runs above>'

for FON in $FONES ; do sam locate ${FON}cand.cedar_phy_bhcurv.1.root ; done
AOK with .1. except
Datafile with name 'F00039603_0000.all.cand.cedar_phy_bhcurv.1.root' not found.

for FON in $FTWOS ; do sam locate ${FON}cand.cedar_phy_bhcurv.2.root ; done
AOK, all 4 files are in SAM

The single .4. cand file is in SAM and PNFS

FTWOS cands are all in PNFS.
FONES cands are all in PNFS, except F00039603_0000.


########
# FARM #
########


=============================================================================
2008 08 08
=============================================================================

#######
# AFS #
#######

HOWTO.afssoftprod

   Continuing, resolving symlinks and cleaning up the HOWTO per this pass.

In first products symlink pass ( prod1 )
   first needed to correct many symlinks from general/ups to general/products


########
# FARM #
########

farcat
      8     191 all.sntp.cedar.0.root
    315    7614 all.sntp.cedar_phy.0.root
    634   15573 all.sntp.cedar_phy_bhcurv.0.root
    627    4386 spill.bmnt.cedar_phy_bhcurv.0.root
      8      65 spill.bntp.cedar.0.root
    315    1332 spill.bntp.cedar_phy.0.root
    627    4543 spill.bntp.cedar_phy_bhcurv.0.root
    627    2786 spill.mrnt.cedar_phy_bhcurv.0.root
      8      40 spill.sntp.cedar.0.root
    315     888 spill.sntp.cedar_phy.0.root
    627    2877 spill.sntp.cedar_phy_bhcurv.0.root

mcfmockcat
    241    3427 mrnt.cedar_phy_bhcurv.0.root
    241    6827 sntp.cedar_phy_bhcurv.0.root


    mockfar seems complete, force this out :

./looper '-M -r cedar_phy_bhcurv mockfar' &
Fri Aug  8 13:21:33 CDT 2008
OK - processing 482 files 
Fri Aug  8 13:23:59 CDT 2008
 WRITING to DCache 482


    CPB far is running currently, let's keep up with sntp :

./looper '-s sntp -r cedar_phy_bhcurv far' &
 SELECT  files containing sntp 
Fri Aug  8 13:28:55 CDT 2008
 ZAPPING BAD F00040942_0008.all.sntp.cedar_phy_bhcurv.0.root
F00040942_0008.0   2008-06   136  2008-08-01 00:26:31  fcdfcaf1628
...
 ZAPPING BAD F00040942_0020.all.sntp.cedar_phy_bhcurv.0.root
F00040942_0020.0   2008-06   136  2008-08-01 00:30:56  fcdfcaf1597
 ZAPPING BAD F00040942_0020.spill.sntp.cedar_phy_bhcurv.0.root
F00040942_0020.0   2008-06   136  2008-08-01 00:30:56  fcdfcaf1597
...
OK - processing 974 files 
OK - stream all.sntp.cedar_phy_bhcurv
OK - 12096 Mbytes in 22 runs 
...


Date: Fri, 08 Aug 2008 21:14:14 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Minos Batch <Minos_Batch@fnal.gov>
     The additional 241 subruns of cedar_phy_bhcurv daikon_05 mdc
     have been written to PNFS and /minos/data .


#######
# WEB #
#######

MIN > cp dhleft.html.20070328 dhleft.html.20080808
[14]+  Done                    nedit dhleft.html
MIN > ln -sf dhleft.html.20080808 dhleft.html

Updated dhleft with Cluster group replacing MINOS26,
and DATA group showing disk status

dhmain - added DATABASE - ganglia/fnpcsrv1

############
# NOACCESS #
############

VOK330            331.09GB   (NOTALLOWED 0731-1124 readonly 0716-1115)   CD-LTO3          minos.reco_far_cedar_bcnd.cpio_odc   Volume needs to be cloned due to repeated errors


##########
# CONDOR #
##########

Summary of proxy activities last week :

   When I added the pilot proxy,
   I did wait for all the glideins to terminate.
 
   condor_q showed no entries for gfactory when I changed the proxy.
   

   I was not so clever yesterday.
   
   I did remove the pilot Role while jobs were running yesterday.
   This resulted in many held gfactory processes.
   Plus a few freshly minted pilots without the pilot Role.
   
   I stopped the gfactory,   
     restored the pilot Role,
     did 'condor_release gfactory',
     waited for the released jobs to run, but some reverted to 'held',
     iterated several times until all processes terminated.
  
   Then I removed the pilot Role from the proxy,
       released the last few gfactory processes,
       and waited for them to run and terminate.
   
   Then I restarted the gfactory.
   
   Thinks look clean since then.


=============================================================================
2008 08 07
=============================================================================

##########
# CONDOR #
##########

   Errors seen recently by pawloski

cat /minos/data/users/pawloski/Nue/PETrimmerTest_Delete/Far_Beam_Standard_MC/log.172615.43 

000 (172615.043.000) 08/06 21:44:29 Job submitted from host: <131.225.193.25:64545>
...
007 (172615.043.000) 08/06 23:57:51 Shadow exception!
        Can no longer talk to condor_starter <131.225.166.131:65459>
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (172615.043.000) 08/06 23:57:54 Shadow exception!
        Can no longer talk to condor_starter <131.225.166.131:65459>
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (172615.043.000) 08/06 23:57:58 Shadow exception!
        Can no longer talk to condor_starter <131.225.166.131:65459>
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (172615.043.000) 08/07 00:00:09 Shadow exception!
        Can no longer talk to condor_starter <131.225.166.131:64284>
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (172615.043.000) 08/07 00:00:12 Shadow exception!
        Can no longer talk to condor_starter <131.225.166.131:64284>
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (172615.043.000) 08/07 00:00:17 Shadow exception!
        Can no longer talk to condor_starter <131.225.166.131:64284>
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (172615.043.000) 08/07 00:00:19 Shadow exception!
        Can no longer talk to condor_starter <131.225.166.131:64284>
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (172615.043.000) 08/07 00:00:22 Shadow exception!
        Can no longer talk to condor_starter <131.225.166.131:64284>
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (172615.043.000) 08/07 00:02:46 Job executing on host: <131.225.166.119:62644>
...
009 (172615.043.000) 08/07 00:02:54 Job was aborted by the user.
        The system macro SYSTEM_PERIODIC_REMOVE expression '(JobRunCount > 10) || (JobRunCount>=1 && ImageSize>1000000 && JobStatus==1)' evaluated to TRUE
...

##########
# CONDOR #
##########


15:11 restored condorproxy without pilot role

Killed gfactory master process ( only, not condor_gridmanager )

Restored pilot role to condorproxy, so we can release held gfactory jobs

Still have fresh gfactories,


MINOS25 > condor_q gfactory | grep -v H
-- Submitter: minos25.fnal.gov : <131.225.193.25:61622> : minos25.fnal.gov
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
172904.0   gfactory        8/7  15:18   0+00:00:00 I  0   0.0  glidein_startup.sh
172904.1   gfactory        8/7  15:18   0+00:10:08 R  0   0.0  glidein_startup.sh
172904.2   gfactory        8/7  15:18   0+00:47:08 R  0   0.0  glidein_startup.sh
172904.3   gfactory        8/7  15:18   0+00:00:00 I  0   0.0  glidein_startup.sh
172904.4   gfactory        8/7  15:18   0+00:00:00 I  0   0.0  glidein_startup.sh
172904.5   gfactory        8/7  15:18   0+00:00:00 I  0   0.0  glidein_startup.sh
172904.6   gfactory        8/7  15:18   0+00:00:00 I  0   0.0  glidein_startup.sh
172904.7   gfactory        8/7  15:18   0+00:00:00 I  0   0.0  glidein_startup.sh
172904.8   gfactory        8/7  15:18   0+00:00:00 I  0   0.0  glidein_startup.sh
172904.9   gfactory        8/7  15:18   0+00:00:00 I  0   0.0  glidein_startup.sh
172906.0   gfactory        8/7  15:22   0+00:00:00 I  0   0.0  glidein_startup.sh
172913.0   gfactory        8/7  15:58   0+00:00:00 I  0   0.0  glidein_startup.sh

16:10
    Released GLobus error 17 and 43 from yesterday,
condor_release 171531 172568 172589

   The recent glideins have evaporated.

16:15

JOBS=`condor_q gfactory -hold  | grep gfactory | cut -f 1 -d ' '`

for JOB in ${JOBS} ; do  condor_q ${JOB} | grep gfactory ;  done 
88

for JOB in ${JOBS} ; do  condor_release ${JOB} ; sleep 1 ; done

16:18
87 jobs; 12 idle, 50 running, 25 held
16:46
82 jobs; 12 idle, 45 running, 25 held
18:40:
57 jobs; 0 idle, 32 running, 25 held

MINOS25 > condor_release 172666.0
MINOS25 > condor_release 172670.0

MINOS25 > condor_q 172666.0
172666.0   gfactory        8/7  01:59   0+10:21:03 R  0   0.0  glidein_startup.sh
MINOS25 > condor_q 172670.0
172670.0   gfactory        8/7  02:20   0+10:20:33 R  0   0.0  glidein_startup.sh

JOBS2=`condor_q gfactory -hold  | grep gfactory | cut -f 1 -d ' '`
for JOB in ${JOBS2} ; do  condor_q ${JOB} | grep gfactory ;  done 
23
for JOB in ${JOBS2} ; do  condor_release ${JOB} ; sleep 1 ; done

51 jobs; 0 idle, 35 running, 16 held

JOBS3=`condor_q gfactory -hold  | grep gfactory | cut -f 1 -d ' '`
for JOB in ${JOBS3} ; do  condor_q ${JOB} | grep gfactory ;  done 
23
for JOB in ${JOBS3} ; do  condor_release ${JOB} ; sleep 1 ; done

   two more started running.

20:45 
MINOS25 > condor_q gfactory | tail -1
14 jobs; 0 idle, 0 running, 14 held

MINOS25 > condor_release gfactory
User gfactory's job(s) released.

MINOS25 > condor_q gfactory | tail -1
14 jobs; 14 idle, 0 running, 0 held

21:00
12 jobs; 0 idle, 0 running, 12 held


MINOS25 > condor_release gfactory

   These are the processes without /Role=pilot

21:02:30 - removed the Role from the proxy

21:03:20
MINOS25 > condor_release gfactory

21:07 
    all processes are gone, including gfactory's condor_gridmanager

    gfactory@minos25: ./start_factory.sh

21:09
rm /minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE


172923.0   gfactory        8/7  21:12   0+00:00:00 I  0   0.0  glidein_startup.sh

21:17 
172920.0   kreymer         8/7  21:10   0+00:00:25 R  0   0.0  probe             
172923.0   gfactory        8/7  21:12   0+00:03:43 R  0   0.0  glidein_startup.sh

RUN FINISHED Thu Aug  7 21:18:25 CDT 2008

    ALL CLEAR !
    

########
# FARM #
########

mysql overload on fnpcsrv1

Looked at top, 10 second interval,

top - 14:13:06 up 1 day, 50 min,  9 users,  load average: 11.99, 10.33, 8.30

  PID USER      PR  NI %CPU    TIME+  %MEM  VIRT  RES  SHR S COMMAND                                                                                            
 7748 minfarm   16   0  601   3740:44  2.3  437m 367m 3548 S mysqld                                                                                              
 7748 minfarm   16   0  565   3741:40  2.3  436m 366m 3548 S mysqld                                                                                              

Using 60 seconds of CPU in 10 seconds, 600%, 
It does not to show up on 'idle suppressed' top displays.

Checking the Starting messages in the database log file, find

SRV1> grep Starting  /farm/minsoft2/Minossoft/dbm-cedar_phy/logs/dbm_checksum.log
2008-08-07 11:35:07 Starting pass 1 on BEAMMONSPILLVLD:
2008-08-07 11:44:08 Starting pass 2 on BEAMMONSPILLVLD:
2008-08-07 11:55:35 Starting pass 3 on BEAMMONSPILLVLD:
2008-08-07 12:07:54 Starting pass 4 on BEAMMONSPILLVLD:
2008-08-07 12:20:59 Starting pass 5 on BEAMMONSPILLVLD:
2008-08-07 12:33:13 Starting pass 6 on BEAMMONSPILLVLD:
2008-08-07 12:46:10 Starting pass 7 on BEAMMONSPILLVLD:
2008-08-07 12:58:17 Starting pass 8 on BEAMMONSPILLVLD:
2008-08-07 13:10:22 Starting pass 9 on BEAMMONSPILLVLD:
2008-08-07 13:22:20 Starting pass 10 on BEAMMONSPILLVLD:
2008-08-07 13:34:26 Starting pass 11 on BEAMMONSPILLVLD:
2008-08-07 13:46:47 Starting pass 12 on BEAMMONSPILLVLD:
2008-08-07 13:59:35 Starting pass 13 on BEAMMONSPILLVLD:
2008-08-07 14:14:02 Starting pass 14 on BEAMMONSPILLVLD:
2008-08-07 14:29:06 Starting pass 15 on BEAMMONSPILLVLD:
2008-08-07 14:44:53 Starting pass 16 on BEAMMONSPILLVLD:
2008-08-07 15:01:31 Starting pass 17 on BEAMMONSPILLVLD:
2008-08-07 15:13:55 Starting pass 18 on BEAMMONSPILLVLD:
2008-08-07 15:26:07 Starting pass 19 on BEAMMONSPILLVLD:
2008-08-07 15:39:07 Starting pass 20 on BEAMMONSPILLVLD:
2008-08-07 15:52:51 Starting pass 21 on BEAMMONSPILLVLD:
2008-08-07 16:03:51 Starting pass 22 on BEAMMONSPILLVLD:
2008-08-07 16:11:17 Starting pass 23 on BEAMMONSPILLVLD:
2008-08-07 16:18:50 Starting pass 24 on BEAMMONSPILLVLD:
2008-08-07 16:26:39 Starting pass 25 on BEAMMONSPILLVLD:
2008-08-07 16:34:18 Starting pass 26 on BEAMMONSPILLVLD:
2008-08-07 16:42:12 Starting pass 27 on BEAMMONSPILLVLD:
2008-08-07 16:47:04 Starting pass 1 on CALADCTOPESVLD:
2008-08-07 16:59:16 Starting pass 2 on CALADCTOPESVLD:
2008-08-07 17:03:46 Starting pass 3 on CALADCTOPESVLD:
2008-08-07 17:08:19 Starting pass 4 on CALADCTOPESVLD:
2008-08-07 17:36:23 Starting pass 5 on CALADCTOPESVLD:
2008-08-07 18:17:34 Starting pass 6 on CALADCTOPESVLD:
2008-08-07 18:57:08 Starting pass 7 on CALADCTOPESVLD:


SRV1> grep Starting  /farm/minsoft2/Minossoft/dbm-cedar/logs/dbm_checksum.log | cut -f 1-3 -d :
2008-08-07 19:20:22 Starting pass 1 on BEAMMONFILESUMMARYVLD
2008-08-07 19:20:24 Starting pass 1 on BEAMMONSPILLVLD
2008-08-07 19:20:37 Starting pass 1 on BEAMMONSWICPEDSVLD
2008-08-07 19:20:50 Starting pass 1 on CALADCTOPESVLD
2008-08-07 19:30:29 Starting pass 1 on CALADCTOPEVLD
2008-08-07 19:40:10 Starting pass 1 on FABPLNINSTALLVLD
2008-08-07 19:40:11 Starting pass 1 on PHOTONBLUESPECTRUMVLD
2008-08-07 19:40:11 Starting pass 1 on PHOTONELECTRONRANGEVLD
2008-08-07 19:40:20 Starting pass 1 on UGLIDBISCINTPLNSTRUCTVLD
2008-08-07 19:40:22 Starting pass 1 on UGLIDBISCINTPLNVLD

Oops, wiped out /farm/minsoft2/Minossoft/dbm-cedar_phy/logs/dbm_checksum.log
with one of my commands.


SRV1> wc -l /farm/minsoft2/Minossoft/dbm-cedar/reference_checksums/checksum_Fnal_CALADCTOPES.log
295 /farm/minsoft2/Minossoft/dbm-cedar/reference_checksums/checksum_Fnal_CALADCTOPES.log

SRV1> wc -l /farm/minsoft2/Minossoft/dbm-cedar_phy/reference_checksums/checksum_Fnal_CALADCTOPES.log
159200 /farm/minsoft2/Minossoft/dbm-cedar_phy/reference_checksums/checksum_Fnal_CALADCTOPES.log


SRV1> wc -l /farm/minsoft2/Minossoft/dbm-cedar/reference_checksums/checksum_Fnal_BEAMMONSPILL.log
10111 /farm/minsoft2/Minossoft/dbm-cedar/reference_checksums/checksum_Fnal_BEAMMONSPILL.log

SRV1> wc -l /farm/minsoft2/Minossoft/dbm-cedar_phy/reference_checksums/checksum_Fnal_BEAMMONSPILL.log
9908 /farm/minsoft2/Minossoft/dbm-cedar_phy/reference_checksums/checksum_Fnal_BEAMMONSPILL.log


##########
# CONDOR #
##########

 ID      OWNER           HELD_SINCE HOLD_REASON                   
171531.1   gfactory        8/5  13:12 Globus error 43: the job manager failed to 
171531.3   gfactory        8/5  13:12 Globus error 17: the job failed when the jo
172568.5   gfactory        8/6  18:39 Globus error 17: the job failed when the jo
172568.6   gfactory        8/6  18:39 Globus error 43: the job manager failed to 
172589.6   gfactory        8/6  19:54 Globus error 43: the job manager failed to 
172589.7   gfactory        8/6  19:54 Globus error 17: the job failed when the jo

=============================================================================
2008 08 06
=============================================================================
 
########
# FARM #
########

SRV1> 
Broadcast message from root (ttyS0) (Wed Aug  6 13:09:59 2008):

The system is going down for reboot NOW!

15:24 - restarted kreymer@fnpcsrv1   ./bluwatch &

#######
# AFS #
#######

HOWTO.afssoftprod

Will use
    d119  products
    d120  releases

Adjusted HOWTO, saved old as HOWTO.afssoftprod.20080207

As before, a flood of Unable to set group-id messages

{ time up ${UPI} ${UPO} ; } 2>&1 | tee -a /minos/scratch/minsoft/afssoft/cloneproducts.log
Unable to set group-id for /afs/fnal.gov/files/data/minos/d119/.growfschecksum to 1525
Unable to set group-id for /afs/fnal.gov/files/data/minos/d119/.growfsdir to 1525
...
Unable to set group-id for /afs/fnal.gov/files/data/minos/d119/catman/cat1/kcommon.1 to 1525
Unable to set group-id for /afs/fnal.gov/files/data/minos/d119/catman/cat1/bison.1 to 1525
real    17m53.617s
user    0m2.322s
sys     2m37.550s

grep -v 'Unable to set' /minos/scratch/minsoft/afssoft/cloneproducts.log

Scanned sizes of PLINKS
680     /afs/fnal.gov/files/code/e875/releases/GENIE
91      /afs/fnal.gov/files/code/e875/releases/LOG4CPP
3307    /afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN
22630   /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT
841     /afs/fnal.gov/files/code/e875/releases/NEUGEN3
183     /afs/fnal.gov/files/code/e875/releases/PYTHIA6
27      /afs/fnal.gov/files/code/e875/releases/stdhep

    Tested stdhep first,

PLINKS=stdhep

    Looks OK, proceeded with

PLINKS='
GENIE
LOG4CPP
MINOS_EXTERN
MINOS_ROOT
NEUGEN3
PYTHIA6
'
 OK - copying GENIE Wed Aug  6 12:04:06 CDT 2008
680     /afs/fnal.gov/files/code/e875/releases/GENIE
real    2m7.429s
user    0m0.226s
sys     0m20.850s

 OK - copying LOG4CPP Wed Aug  6 12:06:16 CDT 2008
91      /afs/fnal.gov/files/code/e875/releases/LOG4CPP
real    1m2.998s
user    0m0.171s
sys     0m8.953s

 OK - copying MINOS_EXTERN Wed Aug  6 12:07:21 CDT 2008
3307    /afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN
real    20m4.292s
user    0m2.805s
sys     3m11.530s

 OK - copying MINOS_ROOT Wed Aug  6 12:27:53 CDT 2008
22630   /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT
real    181m8.958s
user    0m24.515s
sys     22m48.879s

 OK - copying NEUGEN3 Wed Aug  6 15:34:45 CDT 2008
841     /afs/fnal.gov/files/code/e875/releases/NEUGEN3
real    2m20.528s
user    0m0.349s
sys     0m29.542s

 OK - copying PYTHIA6 Wed Aug  6 15:37:09 CDT 2008
183     /afs/fnal.gov/files/code/e875/releases/PYTHIA6
real    0m44.693s
user    0m0.097s
sys     0m7.347s


   Before cleaning up symlinks, copy the other big slug of files,

Mysql> AFSC=/afs/fnal.gov/files/code/e875
Mysql> RVOL=/afs/fnal.gov/files/data/minos/d120   # previously d199
Mysql> UPI=${AFSC}/general/minossoft


Mysql> { time up ${UPI} ${RVOL} ; }  2>&1  \
>     | grep -v 'Unable to set .*-id' \
>     | tee /minos/scratch/minsoft/afssoft/cloneminos.log
real    101m23.552s
user    0m12.607s
sys     9m27.300s

 
########
# FARM #
########

nwest updated the farm data, as needed for CPB processing of Run III.

In farcat I see part of  F00040942 ( 8 - 17 )  from 01:28 to 01:57.
This is part of the  2008-06 part of the run ( 8 - 23 )
The first part is in 2008-05 ( 0 - 7 )  

Scanning back, the last Run II cand seems to be F00038559 in 2007-08,
dribbling over from 2007-07 
sntp is entirely in 2007-07

Howie is running Run II FD CPB full bore now, round 11:15,
as well as rest of MDC

########
# FARM #
########

Date: Wed, 06 Aug 2008 09:13:13 -0500 (CDT)
Subject: HelpDesk ticket 119761

___________________________________________
Short Description: Request fnpcsrv1 account, anf work node login for rbpatter, for Minos support

Problem Description: Ryan Patterson ( rbpatter@fnal.gov ) of Caltech
is joining the Minos support term,
particularly working on Parrot and Condor support.

Please create an account for him on fnpcsrv1, 
and give him interactive access to the worker nodes.
___________________________________________
Date: Fri, 08 Aug 2008 10:30:07 -0500 (CDT)
Subject: Help Desk Ticket 119761 Has Been Resolved.

Solution: Account rbpatter has been created on fnpcsrv1, and the 
gp grid workers.

Steve Timm
___________________________________________________________________


#########
# ADMIN #
#########

    Default shell for new minos accounts is now /bin/bash, not FNALU shell.
    Ticket 118265    2008 07 07

=============================================================================
2008 08 05
=============================================================================
  
##########
# CONDOR #
##########

Updated glideafs10min.run per glideme.run, 
testing UID of jobs running via glidein,,
test on fnpc344

  UID
 4716  condor_starter 
 7927     /bin/bash /grid/home/minos/...
 7927        /bin/bash ./condor_startup.sh
 7927           .../condor_master
 7927              condor_startd -f
 7927                 condor_procd
43022                    .../condor_starter
43022                       condor_procd
43022                       /bin/sh /minos/scratch/kreymer/condor/probe/probe 0 sleep 600

-bash-3.00$ id condor
uid=4716(condor) gid=3302(condor) groups=3302(condor)

#-bash-3.00$ id minos
uid=7927(minos) gid=5111(numi) groups=5111(numi)

-bash-3.00$ id minosgli
uid=43022(minosgli) gid=5111(numi) groups=5111(numi)

Ticket 119498 updated,

> _________________________________________________________________
> Note To Requester: timm@fnal.gov sent this Notes To Requester:
> 
> Actually you haven't added the role. glideins are still running
> without any glidein role as user "minos" as they always have.
> 
> Steve Timm
> 
>
> 
> _________________________________________________________________


The user jobs are running under the minosgli account,
but the pilot is apparently remaining under minos.

Here is a simplfied execution tree, from 'ps -axflwww'
     
  UID

 4716  condor_starter
 
 7927     /bin/bash /grid/home/minos/...  
 7927        /bin/bash ./condor_startup.sh
 7927           .../condor_master  
 7927              condor_startd -f
 7927                 condor_procd

43022                    .../condor_starter
43022                       condor_procd
43022                       /bin/sh /minos/scratch/kreymer/condor/probe/probe 0 sleep 600
 
 4716 is condor
 7927 is minos   
43022 is minosgli

It is no longer clear to me that we are running under glexec .
I do not see anything like a glexec binary in this execution tree.

We did upgrade the glideinWMS software back on 14 July.
_________________________________________________________________

Date: Tue, 05 Aug 2008 15:48:53 -0500 (CDT)
Note To Requester: timm@fnal.gov sent this Notes To Requester: 

With the new glexec, it exits immediately and you do not
see the glexec executable keep running through the
course of the job.  But I assure you that unless it had been
running successfully you would never have been able to
change uid from minos to minosgli.

You've actually implemented it backwards of the way we had
intended the glidein role to be used.

"gfactory" should have the glidein role in its
proxy when it submits the glideins to fnpcfg1. The normal users
should not.

The glideins should be running as "minosgli"
and the user processes they spawn should be running as "minos".

Steve Timm
_________________________________________________________________
Date: Tue, 05 Aug 2008 15:57:08 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>
Hi Art.

The fact that you do not see glexec is OK... this is as it should be.

Regarding the UID:
you mentioned in a previous mail that you added a role...
Where did you do that?

The factory one does not have one:
[gfactory@minos25 ~]$ voms-proxy-info -all
WARNING: Unable to verify signature! Server certificate possibly not 
installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 
261310/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
identity  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
type      : proxy
strength  : 512 bits
path      : /home/gfactory/.grid/kreymer-condor.proxy
timeleft  : 132:14:19
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
issuer    : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov
attribute : /fermilab/minos/Role=NULL/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
timeleft  : 12:14:18


Did you add a role to your user jobs?

Igor
_________________________________________________________________

You are both correct, I've gotten my proxies reversed.

I have corrected the pilot role from the user proxy to gfactory, around 16:21. 

We happen to have gotten a fresh batch of pilots with the new proxies,
just as I made the change.

The accounts are now as you describe, 
    pilots under minosgli, 
    users under minos.

Thanks !

_________________________________________________________________


#########
# CONDOR #
##########

   The email worked, alerting us to a held job :

   Date: Tue, 05 Aug 2008 13:14:17 -0500
From: Art Kreymer <kreymer@minos25.fnal.gov>
To: minos-admin@fnal.gov, fermigrid-help@fnal.gov, timm@fnal.gov
Subject: Minos gfactory job held, details follow

Tue Aug  5 13:14:17 CDT 2008


-- Submitter: minos25.fnal.gov : <131.225.193.25:64545> : minos25.fnal.gov
 ID      OWNER           HELD_SINCE HOLD_REASON                   
171531.1   gfactory        8/5  13:12 Globus error 43: the job manager failed to 
171531.3   gfactory        8/5  13:12 Globus error 17: the job failed when the jo

2 jobs; 0 idle, 0 running, 2 held

   
##########
# CONDOR #
##########

   Per pittam/brebel request, set pittam to better priority.

MINOS25 > condor_userprio -setfactor pittam@fnal.gov 10.
The priority factor of pittam@fnal.gov was set to 10.000000
  
##########
# CONDOR #
##########

   spotting users with excessively good priorities
    
HOTS=`condor_userprio -all -allusers \
| grep -v gfactory \
| grep -v kreymer  \
| grep -v rhatcher \
| grep '    1.00 ' \
| cut -f 1 -d @
`

for HOT in ${HOTS} ; do
printf "condor_userprio -setfactor ${HOT}@fnal.gov 100.\n"
done

MINOS25 > condor_userprio -setfactor rbpatter@fnal.gov 100.

condor_userprio -all -allusers \
| grep -v gfactory \
| grep -v kreymer  \
| grep -v rhatcher \
| grep '  100.00 '


##############
# MINOS_DATA #
##############

   Need to return to cleanups, to make space for releases and products,
   and for user analysis


MINOS26 > dds *.index | sort -n -k 5,5
...
-rw-r--r--  1 rubin   e875  66383 Oct 22  2007 mc_far.carrot.cedar.index
-rw-r--r--  1 rubin   e875  83431 Oct 22  2007 mc_cosmic.bfld201.cedar.index
-rw-r--r--  1 rubin   e875  87615 May  9  2007 mc_far.daikon_00.cedar.index
-rw-r--r--  1 rubin   e875 118976 Oct 24  2007 mc_near.R1_18_2.index
-rw-r--r--  1 rubin   e875 512600 Oct 31  2006 mc_near.carrot_06.cedar.index
-rw-r--r--  1 rubin   e875 542620 Nov  1  2006 mc_near.carrot_06.R1_18_2.index
-rw-rw-r--  1 rubin   e875 606080 Oct 24  2007 mc_near.daikon_00.cedar.index

MINOS26 > less mc_near.R1_18_2.index | cut -f 1 -d / | sort -u
recodata35
recodata36
recodata37
recodata38
recodata39
recodata40
recodata42

MINOS26 > less mc_near.carrot_06.R1_18_2.index | cut -f 1 -d / | sort -u
recodata08
recodata13
recodata19
recodata21
recodata22
recodata35
recodata36
recodata37
recodata38
recodata39
recodata40
recodata42
recodata43
recodata44
recodata45
recodata46
recodata47
recodata48
recodata49
recodata50
recodata51
recodata53
recodata56


MINOS26 > for NN in 35 36 37 38 39 40 42 ; do ls ../recodata${NN} | cut -f 3 -d . | sort -u ; done

R1_18_2
cbdl
cnts
recodata35

sntp
snts
R1_18_2

cbdl
cnts
recodata36

sntp
snts
R1_18_2
recodata37

R1_18_2
recodata38

sntp
snts
R1_18_2
recodata39

R1_18_2
recodata40

R1_18_2
recodata42

    rubin@fnpcsrv1
cat shrc/kreymer
   cut/paste

cd /afs/fnal.gov/files/data/minos/d10/indexes

./rvm  _near.R1_18_2 noop

SRV1> ./rvm  _near.R1_18_2            
This procedure will erase all _near.R1_18_2 ntuples and rewrite the index files!
It will not prompt again -- do you want to proceed <y|n>? y
Removing mc_near.R1_18_2.
 Removed 2288 files 
 Removed net 2288 files 
SRV1> date
Tue Aug  5 11:38:37 CDT 2008

./rvm  _near.carrot_06.R1_18_2 noop | less

SRV1> ./rvm  _near.carrot_06.R1_18_2 ; date     
 many messages failing to remove nonexistent files,
rm: cannot remove `../recodata38/n13011000_0000_L010170.sntp.R1_18_2.root': No such file or directory
rm: cannot remove `../recodata40/n13011000_0000_L010185.sntp.R1_18_2.root': No such file or directory
rm: cannot remove `../recodata42/n13011000_0000_L010200.sntp.R1_18_2.root': No such file or directory
...

    Updated rvm to print data, using /bin/bash,
    and to rm -f

SRV1> ./rvm  _near.carrot_06.R1_18_2       
This procedure will erase all _near.carrot_06.R1_18_2 ntuples and rewrite the index files!
It will not prompt again -- do you want to proceed <y|n>? y
Tue Aug  5 11:47:22 CDT 2008
Removing mc_near.carrot_06.R1_18_2.
 Removed 10435 files 
 Removed net 10435 files 
Tue Aug  5 11:48:33 CDT 2008

    Cleaning up the rest of mc R1_18_2

-rw-r--r--  1 rubin   e875 21736 Sep 28  2006 mc_far.R1_18_2.index
lrwxr-xr-x  1 rubin   e791    24 Jan 27  2006 mc_far.beet.R1_18_2.index -> mc_far.v17.R1_18_2.index
-rw-r--r--  1 rubin   e875  1560 Sep 28  2006 mc_far.v17.R1_18_2.index
-rw-r--r--  1 rubin   e875   265 Feb  4  2006 mc_far.v17.R1_18_2a.index
-rw-r--r--  1 rubin   e875 19292 Mar 17  2006 mc_fmock.carrot.R1_18_2.index
-rw-r--r--  1 rubin   e875  1508 Mar 15  2006 mc_fmock.carrot_06.R1_18_2.index
lrwxr-xr-x  1 rubin   e791    25 Jan 27  2006 mc_near.beet.R1_18_2.index -> mc_near.v17.R1_18_2.index
-rw-r--r--  1 rubin   e875  4108 Sep 28  2006 mc_near.v17.R1_18_2.index

rm mc_far.beet.R1_18_2.index
rm mc_near.beet.R1_18_2.index

REL=_far.R1_18_2
REL=_far.v17.R1_18_2
REL=_far.v17.R1_18_2a
REL=_fmock.carrot.R1_18_2
REL=_fmock.carrot_06.R1_18_2
REL=_near.v17.R1_18_2

./rvm ${REL} noop | less

SRV1> REL=_far.R1_18_2
SRV1> ./rvm ${REL}            
This procedure will erase all _far.R1_18_2 ntuples and rewrite the index files!
It will not prompt again -- do you want to proceed <y|n>? y
Tue Aug  5 11:59:45 CDT 2008
Removing mc_far.R1_18_2.
 Removed 418 files 
 Removed net 418 files 
Tue Aug  5 11:59:47 CDT 2008
SRV1> REL=_far.v17.R1_18_2
SRV1> ./rvm ${REL}
This procedure will erase all _far.v17.R1_18_2 ntuples and rewrite the index files!
It will not prompt again -- do you want to proceed <y|n>? y
Tue Aug  5 11:59:58 CDT 2008
Removing mc_far.v17.R1_18_2.
 Removed 30 files 
 Removed net 30 files 
Tue Aug  5 11:59:58 CDT 2008
SRV1> REL=_far.v17.R1_18_2a
SRV1> ./rvm ${REL}
This procedure will erase all _far.v17.R1_18_2a ntuples and rewrite the index files!
It will not prompt again -- do you want to proceed <y|n>? y
Tue Aug  5 12:00:06 CDT 2008
Removing mc_far.v17.R1_18_2a.
 Removed 5 files 
 Removed net 5 files 
Tue Aug  5 12:00:06 CDT 2008
SRV1> REL=_fmock.carrot.R1_18_2
SRV1> ./rvm ${REL}
This procedure will erase all _fmock.carrot.R1_18_2 ntuples and rewrite the index files!
It will not prompt again -- do you want to proceed <y|n>? y
Tue Aug  5 12:00:15 CDT 2008
Removing mc_fmock.carrot.R1_18_2.
 Removed 371 files 
 Removed net 371 files 
Tue Aug  5 12:00:17 CDT 2008
SRV1> REL=_fmock.carrot_06.R1_18_2
SRV1> ./rvm ${REL}
This procedure will erase all _fmock.carrot_06.R1_18_2 ntuples and rewrite the index files!
It will not prompt again -- do you want to proceed <y|n>? y
Tue Aug  5 12:00:25 CDT 2008
Removing mc_fmock.carrot_06.R1_18_2.
 Removed 29 files 
 Removed net 29 files 
Tue Aug  5 12:00:25 CDT 2008
SRV1> REL=_near.v17.R1_18_2
SRV1> ./rvm ${REL}
This procedure will erase all _near.v17.R1_18_2 ntuples and rewrite the index files!
It will not prompt again -- do you want to proceed <y|n>? y
Tue Aug  5 12:00:30 CDT 2008
Removing mc_near.v17.R1_18_2.
 Removed 79 files 
 Removed net 79 files 
Tue Aug  5 12:00:31 CDT 2008


   Identifying the empty disks :

cd $MINOS_DATA/d10

fs listquota recodata* | sort -k 3,3 | head -12
Volume Name                   Quota      Used %Used   Partition
nb.minos.d114              50000000       246    0%         59%  
nb.minos.d117              50000000       252    0%         60%  
nb.minos.d116              50000000       278    0%         59%  
nb.minos.d124              50000000       292    0%         53%  
nb.minos.d119              50000000       300    0%         60%  
nb.minos.d120              50000000       638    0%         53%  
nb.minos.d115              50000000      1405    0%         55%  
nb.data.minosd10            8000000      8446    0%         59%  
nb.minos.d198              50000000     59211    0%         53%  
nb.minos.d123              50000000    310819    1%         54%  
nb.minos.d125              50000000   2043136    4%         55%  


Noted that recodata17 points to ../d88, which was long ago given to cc.
rm recodata17

Let's backlink these to the recodata links :

for RECO in `ls -d recodata*` ; do
USED=`fs listquota ${RECO} | grep '% ' | tr -s ' ' | cut -f 3 -d ' '`
[ ${USED} -lt 10000 ] && printf " ${RECO} " && fs listquota ${RECO} | grep '% ' 
done
 recodata01 nb.data.minosd10            8000000      8446    0%         59%  
 recodata37 nb.minos.d114              50000000       246    0%         59%  
 recodata38 nb.minos.d115              50000000      1405    0%         55%  
 recodata39 nb.minos.d116              50000000       278    0%         59%  
 recodata40 nb.minos.d117              50000000       252    0%         60%  
 recodata42 nb.minos.d119              50000000       300    0%         60%  
 recodata43 nb.minos.d120              50000000       638    0%         53%  
 recodata47 nb.minos.d124              50000000       292    0%         53%  

    Remove the recodata* links presently empty 
    
for RD in 37 38 39 40 42 43 47 ; do ls -l recodata${RD} ; done

for RD in 37 38 39 40 42 43 47 ; do rm recodata${RD} ; done

    Clean out the remnant data directories

MINOS26 > for DN in 14 15 16 17 19 20 24 ; do find ../d1${DN} -type f ; done
../d115/recodata38/F00033307_0005.all.snts.R1_18_2.1.root
../d115/recodata38/F00033307_0005.spill.sntp.R1_18_2.1.root
../d115/recodata38/N00008029_0007.cosmic.snts.R1_18_2.1.root
../d115/recodata38/N00008029_0007.spill.sntp.R1_18_2.1.root
../d115/recodata38/n13021020_0017_L100200.sntp.R1_18_2.root

MINOS26 > grep recodata38 indexes/*.index

    nothing, so these few files are unindexed strays.

MINOS26 > dds ../d115/recodata38/*.root
-rw-r--r--  1 rubin e875 156754 Dec 14  2005 ../d115/recodata38/F00033307_0005.all.snts.R1_18_2.1.root
-rw-r--r--  1 rubin e875 158192 Dec 14  2005 ../d115/recodata38/F00033307_0005.spill.sntp.R1_18_2.1.root
-rw-r--r--  1 rubin e875 159699 Dec 15  2005 ../d115/recodata38/N00008029_0007.cosmic.snts.R1_18_2.1.root
-rw-r--r--  1 rubin e875 217387 Dec 15  2005 ../d115/recodata38/N00008029_0007.spill.sntp.R1_18_2.1.root
-rw-r--r--  1 rubin e875 453595 Dec 15  2005 ../d115/recodata38/n13021020_0017_L100200.sntp.R1_18_2.root

MINOS26 > rm ../d115/recodata38/*.root

for DN in 14 15 16 17 19 20 24 ; do find ../d1${DN} -type f ; done

for DN in 14 15 16 17 19 20 24 ; do 
find ../d1${DN} -type l ; done

for DN in 14 15 16 17 19 20 24 ; do 
find ../d1${DN} -type l -exec rm {} \; ; done
    
for DN in 14 15 16 17 19 20 24 ; do 
find ../d1${DN} -type d -name reco\* ; done
    
for DN in 14 15 16 17 19 20 24 ; do 
find ../d1${DN} -type d -name reco\*  -exec rmdir {} \; ; done

MINOS26 > for DN in 14 15 16 17 19 20 24 ; do ls -l ../d1${DN} ; done
total 0
total 0
total 0
total 0
total 0
total 0
total 0


=============================================================================
2008 08 04
=============================================================================

###########
# MONTHLY #
###########

DATASETS 8/4
PREDATOR 8/4
VAULT    8/3
MYSQL    8/4


#########
# MYSQL #
#########

   HOWTO.dbarchive.20080804

MILESTONE - no more locking CRL during backups

   Rework table locking, pairwise,
   
FLUSH TABLES  ${TAB}, ${TAB}VLD
LOCK TABLES   ${TAB}, ${TAB}VLD READ

131 *VLD.MYD
266    *.MYD

Non-VLD are :
DBUVACHIPPEDS_OLD.MYD
DBUVACHIPSPARS_OLD.MYD
GLOBALSEQNO.MYD
LOCALSEQNO.MYD


##########
# CONDOR #
##########

Try running with Role=pilot

Edited /local/scratch25/grid/kproxy, 
adding /Role=pilot to -voms      fermilab:/fermilab/minos/Role=pilot

Just after 13:10, will have to let gfactory processes expire,

MINOS25 > condor_q gfactory
-- Submitter: minos25.fnal.gov : <131.225.193.25:64545> : minos25.fnal.gov
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
171085.2   gfactory        8/4  13:01   0+00:28:19 R  0   0.0  glidein_startup.sh

 touch /minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE

17:15 - rm /minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE

gfactory jobs submitted at 17:18

MILESTONE - glidein jobs are running under minosgli account !
            new proxies continue to get /Role=pilot

##########
# CONDOR #
##########

Started this around 10:32 CDT,
to send email once in case of a future held gfactory job
whose condor_q message include 'Globus',
per request of Timm to be notified immediately.

{ while ! { condor_q gfactory -hold | grep -q Globus ; } ; do
  sleep 500
  done
  { date ; condor_q gfactory -hold ; } | \
  mail -s "Minos gfactory job held, details follow"  \
  minos-admin@fnal.gov,fermigrid-help@fnal.gov,timm@fnal.gov
} &

#######
# RAL #
#######

Verified that kreymer@rl.ac.uk mail still is forwarded to kreymer@fnal.gov

=============================================================================
2008 08 02
=============================================================================

##########
# CONDOR #
##########
    see 08 01,
    released last block of 'tranfer' related held gfactory jobs
    still 31 others, clear them later.
    pawloski is running again

########
# FARM #
########

CPB near is fully concatenated, so killed looper.

=============================================================================
2008 08 01
=============================================================================

########
# GRID #
########

Date: Sat, 02 Aug 2008 03:53:28 +0100
From: Jenny Thomas <jthomas@hep.ucl.ac.uk>
To: minos_authors@fnal.gov
Subject: MINOS GRID INFRASTRUCTURE GROUP

All,
I am very happy to announce that Ryan Patterson has agreed to lead the 
new MINOS GRID INFRASTRUCTURE GROUP whose goal is to enable a ten fold 
increase in the MINOS processing capability. This will provide a 
desparately needed shot in the arm for our computing resources.

That was the good stuff. The next thing of course is that Ryan needs 
volunteers to help him. I would like to point you to Doc-DB 4886 which 
lays out the tasks which need to be covered.

I would like to ask you all to consider whether you might be willing to 
volunteer. The idea is that this would be a 2-3 month blitz to set up 
the infrastructue and then it would be routine maintenence after that.
Some experience of unix systems is obviously necessary or a willingness 
to learn it quickly! I would point out that GRID usage is going to 
become the bread and butter of HEP physics analysis in the future and so 
this would be extremely good experience for more junior people although 
senior people are also encouraged to volunteer.

Please respond to me in the first instance and ideally before the next 
collaboration meeting.
Thanks,
Jenny


###########
# SCRATCH #
###########

   Disk is full,

MINOS26 > df -h /minos/scratch
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/scratch
                      5.0T  5.0T   38G 100% /minos/scratch

MINOS26 > du -sm /minos/scratch/* | sort -n
du: `/minos/scratch/app/OSG1/vdt/extract': Permission denied
du: `/minos/scratch/app/OSG1/vdt/backup': Permission denied
du: `/minos/scratch/boehm/Extrapolation/SideBands/PidTweak/BadFiles': Permission denied
du: `/minos/scratch/boehm/Extrapolation/SideBands/L250200/SysFiles': Permission denied
du: `/minos/scratch/djauty/.TemporaryItems/folders.23559': Permission denied
du: `/minos/scratch/pawloski/Nue/HornOn_HornOff/HornOn/macros/NueErrors': Permission denied
du: `/minos/scratch/pawloski/Nue/Old/HornOn_HornOff/HornOn/macros/NueErrors': Permission denied
du: `/minos/scratch/pawloski/Nue/Old/HornOn_HornOff/HornOn_PreAustinCuts/macros/NueErrors': Permission denied
du: `/minos/scratch/pawloski/Old/test/.tmp/.files': Permission denied
du: `/minos/scratch/pawloski/.tmp': Permission denied
du: `/minos/scratch/rahaman/latex/BcForBeach2006': Permission denied
du: `/minos/scratch/rahaman/latex/aps2006': Permission denied
du: `/minos/scratch/rahaman/latex/beach2006': Permission denied
du: `/minos/scratch/rahaman/latex/bcpaper': Permission denied
du: `/minos/scratch/rahaman/latex/cdfnote': Permission denied
du: `/minos/scratch/rahaman/latex/ckm06Proc': Permission denied
du: `/minos/scratch/rahaman/latex/cv': Permission denied
du: `/minos/scratch/rahaman/latex/style': Permission denied
du: `/minos/scratch/rahaman/latex/talk': Permission denied
du: `/minos/scratch/rahaman/latex/thesis': Permission denied
du: `/minos/scratch/rearmstr/.ssh': Permission denied
...
11255   /minos/scratch/asousa
11532   /minos/scratch/arms
11753   /minos/scratch/grashorn
12002   /minos/scratch/bishai
13144   /minos/scratch/masaki
14802   /minos/scratch/rbpatter
19195   /minos/scratch/sjc
19960   /minos/scratch/tagg
20928   /minos/scratch/annah1
23068   /minos/scratch/kimjj
25386   /minos/scratch/bspeak
25853   /minos/scratch/tinti
26276   /minos/scratch/jyuko
44281   /minos/scratch/med
46128   /minos/scratch/ahimmel
46539   /minos/scratch/vahle
47369   /minos/scratch/ochoa
49958   /minos/scratch/hartnell
51873   /minos/scratch/rodriges
55421   /minos/scratch/koskinen
59403   /minos/scratch/brebel
59769   /minos/scratch/bckhouse
61388   /minos/scratch/deb4
66207   /minos/scratch/evansj
69586   /minos/scratch/niki
72106   /minos/scratch/djauty
82008   /minos/scratch/petyt
84030   /minos/scratch/zarko
90560   /minos/scratch/pittam
92179   /minos/scratch/mishi
140503  /minos/scratch/jjling
176011  /minos/scratch/boehm
276778  /minos/scratch/rustem
342611  /minos/scratch/tjyang
433343  /minos/scratch/rmehdi
457522  /minos/scratch/scavan
474990  /minos/scratch/pawloski
629348  /minos/scratch/rahaman
902190  /minos/scratch/loiacono


Date: Fri, 01 Aug 2008 17:33:30 -0500 (CDT)
Subject: HelpDesk ticket 119604
___________________________________________
Short Description: Please move 1 TB of quota from /minos/data to /minos/scratch

Problem Description: LSC/CSI :

We have recovered a lot of space from /minos/data, with about 5 TB free.
We have run out of space in /minos/scratch.
 
   So please shift 1 TB of capacity back from 
       /minos/data
   to 
       /minos/scratch

Thanks !
___________________________________________
forwarded copy of ticket to 
rayp, romero, inkmann
___________________________________________
Date: Fri, 01 Aug 2008 17:45:32 -0500
From: Andrew Romero <romero@fnal.gov>

        /minos/data  ... decreased to 27TB

        /minos/scratch  ... increased to 6TB
___________________________________________


########
# FARM #
########

Date: Fri, 01 Aug 2008 15:34:19 -0500
From: Howard Rubin <rubin@iit.edu>
To: Minos_Batch batch <minos_batch@fnal.gov>
Subject: Current status of Run III

The Run III ND spill pass has completed and over 400 FD runs from 
2008-06 have run.  *ALL* of the FD jobs and 25% of the ND jobs have 
failed with FPE's, according to Steve's research, probably all in the 
calibrator.

I am shutting down Run III processing until this is completely diagnosed 
and fixed.  It may be necessary to rerun all of the ND as well as, of 
course the FD.

Because of a logic error in my proxy renewal for grid processing, there 
was a problem early this morning which has caused a substantial backlog 
of 'apparently incomplete' jobs which have to be cleared out.  This may 
not happen until the next cedar keep-up is due to start, so tentatively 
I have also shut down keep-up until the backlog clears and I'm sure I've 
removed the logic flaw.

##########
# PARROT #
##########

    Test parallel parrots,

    First, try out stale test area on fnpc338
last changed Jul 31 16:42

388 > du -sm /local/stage1/condor/
150     /local/stage1/condor/

388 > /grid/app/minos/parrot/paloon

388 > du -sm /local/stage1/condor/
150     /local/stage1/condor/

     oops, potoential shared /local/stage1/kreymer is absent,

388 > du -sm /var/tmp/kreymer/
336     /var/tmp/kreymer/

    changed pallon to create /local/stage1/${LOGNAME}

    Set up file list, just the data part of the path without /pnfs/minos
    in a shared area.
    
./samlocate "__set__ st-censmall"

./samlocate "__set__ st-censmall" | sort | while read FLINE ; do
    FILE=`echo ${FLINE} | cut -f 1 -d ' '`
    FPAT=`echo ${FLINE} | cut -f 2 -d ' '`
    printf "${FPAT/\/pnfs\/minos\/}/${FILE}\n"
done > /minos/scratch/kreymer/condor/parrot/st-censmall.files

    Set up paloon and loonar to take optional process and file list.

/grid/app/minos/parrot/paloon 3 /minos/scratch/kreymer/condor/parrot/st-censmall.files

    This works !
    
But we had the wrong files in st-censmall, too big !

Recreated the dataset, and the list, this time ordered by name.

Test all the files ,

N=-1

while [ ${N} -lt 100 ] ; do 
(( N ++ ))
/grid/app/minos/parrot/paloon ${N} /minos/scratch/kreymer/condor/parrot/st-censmall.files
done

    Completed cleanly.
   
    Shifted to a less busy node, fnpc299

mkdir parrot
cd    parrot

/grid/app/minos/parrot/paloon 3 /minos/scratch/kreymer/condor/parrot/st-censmall.files > 3.log 2>&1 &


EXE=/grid/app/minos/parrot/paloon
FILES=/minos/scratch/kreymer/condor/parrot/st-censmall.files

for N in 0 1 2 3 4 5 6 7 8 9  ; do 
${EXE} ${N} ${FILES} > ${N}.log 2>&1 &
done

    Clear the boards, try fresh with empty cache

-bash-3.00$ rm -r /local/stage1/kreymer

cd ~/parrot/try2

for N in 0 1 2 3 4 5 6 7 8 9  ; do 
${EXE} ${N} ${FILES} > ${N}.log 2>&1 &
done


-bash-3.00$ du -sm /local/stage1/kreymer/parrot/
337     /local/stage1/kreymer/parrot/


##########
# CONDOR #
##########

  FYI , to switch to using account minosgli or minosana,
  just add the corresponding role to the glidin proxy,

      pilot  or  Analysis


  I'm going to wait till I have purged all the held gfactory jobs,
  probably next week.
  
###############
# CONDORGLIDE #
###############

Added flag to skip :

[ -r "/minos/scratch/kreymer/condor/probe/SKIPCONDORGLIDE" ] && exit 0

###########
# ROUNDUP #
###########

roundup.20080801

    DFARM cleanup :


DFARM is was used as flag for file purging,
    absent for purge of components, in PURGE GRID, set at that time
    required before purge from WRITE.
    This assures the purge of components, in case of messy restarts.
         ( a file gets into WRITE, components not purged )

Should change directory name to PURGED,

Should remove the PURGED file as soon as the PURGE WRITE is complete.

########
# FARM #
########

    DFARM cleanup

SRV1> ls /export/stage/minfarm/ROUNDUP/DFARM | wc -l
124861

SRV1> ls /export/stage/minfarm/ROUNDUP/DFARM | grep cedar | wc -l
122924

SRV1> du -sm .
495     .

    Safety copy :


SRV1> tar cf /minos/data/minfarm/maint/DFARM.tar .
SRV1> tar tf /minos/data/minfarm/maint/DFARM.tar | grep root | wc -l
124898
MINOS26 > du -sm /minos/data/minfarm/maint/DFARM.tar
122     /minos/data/minfarm/maint/DFARM.tar

    Found 38 files in DFARM/tmp, vintage Jun 1 2007.

    Removed them.

SRV1> rm -r /export/stage/minfarm/ROUNDUP/DFARM/tmp 

    Remove an older release
SRV1> find  -name \*R1_24\* | wc -l
1936

SRV1> find  -name \*R1_24\* -exec rm {} \;

    Remove candidates, easy pickens

SRV1> find . -name \*cand\* | wc -l
60366

SRV1> time find . -name \*cand\*  -exec rm {} \;
real    5m35.302s
user    0m19.702s
sys     4m51.090s


    now monte carlo, not presently being concatenated
    
SRV1> find . -name n\* | wc -l
17875
SRV1> find . -name f\* | wc -l
11547

SRV1> time find . -name n\* -exec rm {} \;
real    1m36.460s
user    0m6.008s
sys     1m13.267s

SRV1> time find . -name f\* -exec rm {} \;
real    0m54.439s
user    0m4.046s
sys     0m48.083s

    Blow away D05 mdc files

SRV1> time find . -name \*D05\* -exec rm {} \;
real    0m0.945s
user    0m0.103s
sys     0m0.617s

    And Far files, none being written presently

SRV1> find . -type f | wc -l
32822

SRV1> find . -type f -name F\* | wc -l
24472

SRV1> time find . -type f -name F\* -exec rm {} \;
real    2m6.308s
user    0m8.369s
sys     1m38.664s

SRV1> find . -type f | wc -l
8350

     Grab the cedar_phy files, 

SRV1> find . -type f -name \*\.cedar_phy\.\* | wc -l
3676

SRV1>      find . -type f -name \*\.cedar_phy\.\* | cut -f 5 -d . | uniq
cedar_phy

SRV1> time find . -type f -name \*\.cedar_phy\.\* -exec rm {} \;
real    0m20.316s
user    0m1.286s
sys     0m15.412s

    Troll for more stuff

SRV1> find . -type f  | cut -f 5 -d . | sort -u
cedar
cedar_phy_bhcurv
cedar_phy_srsafitter
cedar_phy_srsafitterbx113

SRV1>      find . -type f -name \*\.cedar_phy_srsafitter\* | wc -l
302

SRV1> time find . -type f -name \*\.cedar_phy_srsafitter\* -exec rm {} \;
real    0m1.450s
user    0m0.142s
sys     0m1.205s

    Keepup cleaned up, can purge cedar

SRV1>      find . -type f -name \*\.cedar\.\* | wc -l
1503

SRV1> time find . -type f -name \*\.cedar\.\*  -exec rm {} \;

    Now a final purge of slightly old files,

SRV1> find . -type f -mtime +3  | wc -l
2453

ls -ltr
-rw-rw-r--   1 minfarm numi      29 Mar 28 18:46 N00008345_0002.spill.mrnt.cedar_phy_bhcurv.1.root
drwxrwxr-x  16 minfarm numi    4096 Jul 24 21:44 ../
-rw-rw-r--   1 minfarm numi      29 Jul 30 18:54 N00014166_0000.spill.mrnt.cedar_phy_bhcurv.0.root

   That makes sense, have purged all but CPB, last run last March.

SRV1> find . -type f -mtime +3  -exec ls -l {} \; | tail
-rw-rw-r--  1 minfarm numi 29 Mar 28 18:38 ./N00008439_0011.spill.mrnt.cedar_phy_bhcurv.1.root
...

SRV1> time find . -type f -mtime +3  -exec rm {} \;
real    0m15.380s
user    0m0.862s
sys     0m12.335s

SRV1> ls | wc -l
416

   This is very healthy.
   Finish the purge after the upgrade to roundup,
     and the shift from DFARM to PURGED


##########
# CONDOR #
##########

    The backlog has cleared, nothing runing in glidein
    beyond my 10-minute interval tests.
    
    The first batch of released held gfactory jobs has cleared.
    
    Release a batch of 100.

JOBS=`condor_q gfactory -hold | grep transfer | head -100 | cut -f 1 -d ' '`
for JOB in ${JOBS} ; do  condor_release ${JOB} ; sleep 1 ; done
for JOB in ${JOBS} ; do  condor_q ${JOB} | grep gfactory ;  done

Released these around 10:00
All were running by about 10:50, about half already timed out.

13:12 - previous batch is clear,
Ran the next batch,

576 jobs; 102 idle, 55 running, 419 held
170282.0   gfactory        7/31 07:05   0+00:00:00 H  0   0.0  glidein_startup.sh
...
170301.9   gfactory        7/31 08:45   0+00:00:00 H  0   0.0  glidein_startup.sh

17:15 - all clear, take another shot ( farm has ramped down )

23:00 - all clear, another shot
219 jobs; 0 idle, 0 running, 219 held

   2 August
192 jobs; 10 idle, 63 running, 119 held
12:40 - all clear, many other glideins running

MINOS25 > printf "${JOBS}\n" | wc -l
88
192 jobs; 98 idle, 63 running, 31 held


#########
# PROBE #
#########

10:00
    Added  space check of HOME,
df  -h ${HOME}
du -sh ${HOME}

########
# FARM #
########

    Just for completeness, now that other cleanup has been done,
    looking at state of May 2 vintage cedar_phy far.
    Forcing an update of ROUNTMP/LOG/cdar_phyfar.pend

farcat
    315    7614 all.sntp.cedar_phy.0.root
    315    1332 spill.bntp.cedar_phy.0.root
    315     888 spill.sntp.cedar_phy.0.root

./roundup  -r cedar_phy far

=============================================================================
2008 07 31
=============================================================================

##########
# CONDOR #
##########

   Per sfiligoi advice, clearing out all the old held gfactory sections,
   by releasing them.
   
   Let's do 50 at a time, and start with the 'data transfer error's of today.
   
JOBS=`condor_q gfactory -hold | grep transfer | head -50 | cut -f 1 -d ' '`

for JOB in ${JOBS} ; do  condor_release ${JOB} ; sleep 1 ; done

for JOB in ${JOBS} ; do  condor_q ${JOB} | grep gfactory ;  done

Thu Jul 31 15:19:48 CDT 2008
  initally all idle ( 170250.0 - 170259.9 )
Thu Jul 31 15:27:30 CDT 2008
   most of these are running.

   Plan : wait a half hour for them to time out,
          then do another batch.

Things have changed, pawloski has 1460 jobs queued up.
But his jobs only run on the AFS nodes, 64 of them.
On the other hand, there are 808 farm jobs running,
so the glideings are not starting too quickly.

   I'll have a look again tomorrow


#######
# WEB #
#######

   Created an easier to find DH home page

MIN > cd /afs/fnal.gov/files/expwww/numi/html/computing/dh
MIN > ln -s dhmain.html index.html

    Added mdsum link

MIN > cp dhmain.20080403.html dhmain.20080731.html
MIN > ln -sf dhmain.20080731.html dhmain.html


############
# MCIMPORT #
############

   The planned data archival using mcimport is complete.


########
# FARM #
########

>       User:                   minos
>       
>       Email:                  rubin@fnal.gov
>       
>       FileSystem:             fermigrid-home
>       
>       Total disk allocated (GB):              10.0
>       
>       Percent disk used:      100.0%


SRV1> dds /export/blue2_home
drwxr-xr-x  1184 minos       numi     1808384 Jul 31 11:17 minos/
drwxr-xr-x     3 minosana    numi        2048 Dec 16  2007 minosana/
drwxr-xr-x     4 minosgli    numi       12288 Dec 30  2007 minosgli/
drwxr-xr-x  4605 minospro    numi     2160640 Jul 31 11:19 minospro/
drwxr-xr-x     3 minsoft     numi        2048 Jan 22  2008 minsoft/

SRV1> df -h /export/blue2_home
Filesystem            Size  Used Avail Use% Mounted on
blue2.fnal.gov:/fermigrid-home
                     1004G  282G  723G  29% /grid/home

du -sm /grid/home/minos
du: cannot read directory `/grid/home/minos/gram_scratch_azzcRJplkx': Permission denied
du: cannot read directory `/grid/home/minos/gram_scratch_gMpBXjwcYD': Permission denied
du: cannot read directory `/grid/home/minos/gram_scratch_Es7lcr26pj': Permission denied
407     /grid/home/minos

SRV1> ls -l /grid/home/minos | grep drw | wc -l
1221

SRV1> ls -l /grid/home/minos | grep -v gram
total 408192
drwxr-xr-x  2 minos numi   2048 Aug 12  2005 0
-rw-r--r--  1 minos numi      0 Jan 22  2008 foo
-rw-r--r--  1 minos numi  12197 May 22  2006 wrapper.sh
-rw-------  1 minos numi   7660 May 22  2006 x509_proxy_in

SRV1> ls -l /grid/home/minos | wc -l
9553

SRV1> ls -l /grid/home/minos | grep gram_scratch | wc -l
1229


CONDOR errors in gractory jobs, starting

170250.0   gfactory        7/31 04:30 Globus error 10: data transfer to the serve
170250.1   gfactory        7/31 04:30 Globus error 10: data transfer to the serve

ls -ldtr /grid/home/minos/gram_scratch_*
drwx------  2 minos numi 2048 Jul 12 13:52 /grid/home/minos/gram_scratch_GqyUAmOXyw
drwx------  2 minos numi 2048 Jul 13 04:08 /grid/home/minos/gram_scratch_BiWr7QCsMH
drwx------  2 minos numi 2048 Jul 13 04:38 /grid/home/minos/gram_scratch_8rVn4pIePb
drwx------  2 minos numi 2048 Jul 13 04:38 /grid/home/minos/gram_scratch_LIuonUwfsV
drwx------  2 minos numi 2048 Jul 13 04:52 /grid/home/minos/gram_scratch_zuAusgD3En
...
drwx------  2 minos numi 2048 Jul 13 09:56 /grid/home/minos/gram_scratch_8LT5ENcZEr
drwx------  2 minos numi 2048 Jul 13 10:45 /grid/home/minos/gram_scratch_syH6ohX48k
drwx------  2 minos numi 2048 Jul 24 06:20 /grid/home/minos/gram_scratch_pryJw1EdQF
drwx------  2 minos numi 2048 Jul 31 04:20 /grid/home/minos/gram_scratch_WHgYnLpDYr
drwx------  2 minos numi 2048 Jul 31 04:20 /grid/home/minos/gram_scratch_dHjsbrp3jM
drwx------  2 minos numi 2048 Jul 31 04:20 /grid/home/minos/gram_scratch_bNGYJjYDPO

There are also lots of stale gram_scratch areas under minospro,
mainly from :

Jul  1 21:06 thru
Jul  2 16:13

Jul 11 10:17 thru
Jul 11 22:11

Jul 15 13:27
Jul 15 17:30
Jul 16 05:57
Jul 17 12:14
Jul 18 19:34
Jul 18 19:39
Jul 18 20:59
Jul 23 15:17
Jul 23 15:17
Jul 23 15:17
Jul 24 06:25

Jul 24 23:09 thru
Jul 25 16:48

Jul 26 11:35
Jul 27 04:37
Jul 28 15:23
Jul 28 17:24
Jul 28 17:25
Jul 28 17:26
Jul 28 17:26
Jul 28 17:26
Jul 28 17:27
Jul 28 22:32
Jul 28 22:32
Jul 28 22:37

Jul 29 09:40 thru
Jul 29 17:17

Jul 30 07:31
Jul 30 07:37

Jul 30 11:30 thru
Jul 30 12:15

Jul 30 20:58 thru current,
Jul 31 11:51

Counting non-Jul 31 gram_scratch directories:

SRV1> ls -ltr /grid/home/minospro | grep -v 'Jul 31' | wc -l
3779

SRV1> ls -ltr /grid/home/minos | grep -v 'Jul 31' | wc -l
7153

   Somewhat better now, 13:15

SRV1> ls -ltr /grid/home/minospro | grep -v 'Jul 31' | wc -l
616

SRV1> ls -ltr /grid/home/minos | grep -v 'Jul 31' | wc -l
7075

Later, around 14:23,
SRV1> ls -ltr /grid/home/minos | grep -v 'Jul 31' | wc -l
179

I see a few gfactory processes running, starting around 14:03
Rustem has removed his large stdout jobs, around 

Testing the release of a few gfactory jobs :

condor_release 170382


Date: Thu, 31 Jul 2008 11:53:32 -0500 (CDT)
Subject: HelpDesk ticket 119498
___________________________________________
Short Description: /grid/home/minos quota used up - why ?

Problem Description: The /grid/home/minos quota of 10 GB seems to have been suddenly used up.
Analysis glidein jobs have stalled, as of early this morning.

A quick scan of the area does not show evidence of user abuse.
This is difficult, as I do not have an interactive minos login with access
to /grid/home/minos.

With what I can see, there are about 400 MBytes of visible files,
plus over 1200 gram_scratch* directories.
We have nothing like 1200 jobs running,
so something must have gone wrong with the grid software.

The time stamps are suspicious :

ls -ldtr /grid/home/minos/gram_scratch_*
drwx------  2 minos numi 2048 Jul 12 13:52
/grid/home/minos/gram_scratch_GqyUAmOXyw
drwx------  2 minos numi 2048 Jul 13 04:08
/grid/home/minos/gram_scratch_BiWr7QCsMH
drwx------  2 minos numi 2048 Jul 13 04:38
/grid/home/minos/gram_scratch_8rVn4pIePb
drwx------  2 minos numi 2048 Jul 13 04:38
/grid/home/minos/gram_scratch_LIuonUwfsV
drwx------  2 minos numi 2048 Jul 13 04:52
/grid/home/minos/gram_scratch_zuAusgD3En
..
drwx------  2 minos numi 2048 Jul 13 09:56
/grid/home/minos/gram_scratch_8LT5ENcZEr
drwx------  2 minos numi 2048 Jul 13 10:45
/grid/home/minos/gram_scratch_syH6ohX48k
drwx------  2 minos numi 2048 Jul 24 06:20
/grid/home/minos/gram_scratch_pryJw1EdQF
drwx------  2 minos numi 2048 Jul 31 04:20
/grid/home/minos/gram_scratch_WHgYnLpDYr
drwx------  2 minos numi 2048 Jul 31 04:20
/grid/home/minos/gram_scratch_dHjsbrp3jM
drwx------  2 minos numi 2048 Jul 31 04:20
/grid/home/minos/gram_scratch_bNGYJjYDPO

Something seems to have gone wrong around July 12 and 13,
then again around 04:20 this morning.
___________________________________________
This ticket is assigned to TIMM, STEVE of the CD-SF/GF/FGS.
___________________________________________
Date: Thu, 31 Jul 2008 12:16:38 -0500 (CDT)
Note To Requester: The recent move of 
fermigrid1 to fg1x1 led
us to neglect to re-enable
our cleanup cron-job.
I'll re-enable it shortly and 
run it manually and 
it should clean everything up.

Thanks for 
bringing this to our attention.

Steve Timm
___________________________________________
Note To Requester: In the process of 
cleaning out the /grid/home/minos 
directory we discovered
that the bulk of the 
quota was actually used
up not by the glidein jobs going to fnpcfg1 but by a 
set of jobs that were submitted by Rustem, 20 or more of which produced stdout of 500+MB per job.  The
quota 
is only 10GB.

If minos leadership wants
to not have collisions, you should switch to running
the minos glideins 
as user "minosgli" 
as we had previously 
discussed.  That will 
keep the glidein 
jobs from interfering
with the jobs of the unwashed minos users.

If you need changes in 
priority among the various minos users we can do that too.

Seeing that these stdouts were only written last night, 
the cleanup script will not remove them right away.
We can intervene for those jobs that were removed, if necessary, 
or we can temporarily get the /grid/home/minos quota bumped up appropriately.

Let us know what you 
want to do.

Steve Timm
___________________________________________
Date: Thu, 31 Jul 2008 19:04:47 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: rustem@fnal.gov
Cc: minos-admin@fnal.gov, timm@fnal.gov

   You have about  44 jobs running or queued on Fermigrid,
   most submitted around 7/30 20:51.

   These have exhaused the quota in the minos account,
   so none of these are likely to be producing useful output.

   The note from Steve Timm indicates that the jobs are producing about 
   1/2 GBytes in stdout ( perhaps more, as they are still running ).

   The grid system cannot handle this much data in stdout.
   This is not a local Fermilab limitation, 
   It is intrinsic to all existing grid systems that I know of.

   Until these jobs are cleared out, nobody ( including you )
   will be able to run analsyis jobs on Fermigrid.
   
   Please cancel these jobs, and reorganize them to write data
   to the usual places ( local disk or /minos/data )

      Thanks !
___________________________________________
Date: Thu, 31 Jul 2008 14:43:49 -0500 (CDT)
Solution: Jobs from one minos user
were identified and cancelled.
The stdout files that were left behind from the cancelled jobs
were cleaned up on /grid/home/minos directory.
Also minos quota will be boosted to 50GB so there 
will be less chance of running
out of quota again.

minos is now using only 68MB 
of the 10GB quota.

Steve Timm
___________________________________________
Date: Mon, 04 Aug 2008 22:36:27 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>

I have added the pilot role to the proxy used to run Minos glideings.

The jobs are now running under the minosgli account.

You can return the minos account quota to 10 GB quota.

If you feel included to boost quotas to give more margin,
please boost the minosgoi account.

At this point, I think this ticket can be closed.

    Thanks !
___________________________________________


###########
# ENSTORE #
###########

noaccess list is back with us.
Can close ticket 119190
Sent request for same.

########
# FARM #
########

    fmock roundup - how to do it ?
 
    Previous runs were like
    On 2007 08 29
./roundup  -M -r cedar_phy mockfar2007 08 29

    but mockfar directory does not exist.

    As I recall, there was a symlink before we moved /minos/data/minfarm
    Let's recreate it :

ln -s mcfmockcat /minos/data/minfarm/mockfarcat

./roundup -n -W -r cedar_phy_bhcurv mockfar

    This looks OK, all subruns would be added Separately.

    Last time we did this, no SAM was available for MC.
    Let's try one file :

./roundup  -s F21930001_0000_L010185N_D05.mrnt -r cedar_phy_bhcurv mockfar

    The SAM declares fail, messages like
Oops, no directories found like  /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_05/*_data/L010185N

    So let's run without SAM, -M, as before
mcfmockcat
    156    2232 mrnt.cedar_phy_bhcurv.0.root
    157    4514 sntp.cedar_phy_bhcurv.0.root

./roundup -M -r cedar_phy_bhcurv mockfar
Thu Jul 31 10:37:52 CDT 2008
Thu Jul 31 12:11:12 CDT 2008


    SRMCP copies were running just about 19 seconds/file.
    These are smallish, clustered at 15 MBytes and 27 MBytes
( find /minos/data/minfarm/WRITE -type f -name F\*D05\* -exec du -sm {} \; | cut -f 1 ) > /minos/scratch/kreymer/mcfmock.gpl
printf 'plot "/minos/scratch/kreymer/mcfmock.gpl"' | gnuplot -persist

Later files, like F21930001_0026_L010185N_D05.sntp.cedar_phy_bhcurv.0.root,
copy in about 14 seconds.
This is a larger file, 28 MBytes, so the change is not due to size.

    Did cleanup of WRITE,

SRV1> ./roundup -M -W -r cedar_phy_bhcurv mockfar
Thu Jul 31 15:25:50 CDT 2008

 PURGING WRITE files 313 
 PURGED 278/313

Thu Jul 31 16:03:40 CDT 2008
 PURGING WRITE files 35 
 PURGED 0/35

Drive LTO3_11 is still busy writing this data, to VOK682

Thu Jul 31 16:33:26 CDT 2008
 PURGED 35/35

  DONE !
  
=============================================================================
2008 07 30
=============================================================================

########
# FARM #
########

cedar_phy_bhcurv processing has started for Run III.

A few near files are showing up,

nearcat
     17     118 spill.mrnt.cedar_phy_bhcurv.0.root
     17     216 spill.sntp.cedar_phy_bhcurv.0.root


about 18:30 CDT
./looper '-r cedar_phy_bhcurv near' &


############
# MCIMPORT #
############

TOP=daikon_04/L010000N/near        # 10MB, 190-230

MCI3 > echo $RDIRS
700 701 702 703 704 705 706

for DIR in ${RDIRS} ; do 
printf "${DIR} "
./mcimport.20080730   -n -t ${TOP}/${DIR} | grep  FILES
done
700 278/278 TOTAL FILES 
138/138 TOTAL FILES 
701 311/311 TOTAL FILES 
110/110 TOTAL FILES 
702 309/309 TOTAL FILES 
111/111 TOTAL FILES 
703 310/310 TOTAL FILES 
110/110 TOTAL FILES 
704 307/307 TOTAL FILES 
110/110 TOTAL FILES 
705 305/305 TOTAL FILES 
109/109 TOTAL FILES 
706 182/182 TOTAL FILES 
65/65 TOTAL FILES 


DIR=706

./mcimport.20080730   -n -t ${TOP}/${DIR}

./mcimport.20080730      -t ${TOP}/${DIR}
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010000N/near/706/mcimport.log 
Wed Jul 30 15:33:14 CDT 2008
...
n11037060_0000_L010000N_D04-n11037060_0007_L010000N_D04.tar 8
      n11037060_0000_L010000N_D04.tar.gz to
      n11037060_0007_L010000N_D04.tar.gz   

      from 8 files, 1756725643 bytes
      tar  8 files, 1756733440 bytes   (7797)
      rate 7 MB/sec
...

ln -sf mcimport.20080730 mcimport # was mcimort.20080729

RDIRS='700 701 702 703 704 705'

for DIR in ${RDIRS}; do
./mcimport      -t ${TOP}/${DIR}
done 
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010000N/near/700/mcimport.log 
Wed Jul 30 18:26:49 CDT 2008
Thu Jul 31 12:49:21 CDT 2008


########
# FARM #
########

MINOS26 > ./pnfsdirs  fmock cedar_phy_bhcurv daikon_05 L010185N write
Wed Jul 30 14:59:36 CDT 2008

ls -l /pnfs/minos/mcout_data/cedar_phy_bhcurv/fmock/daikon_05/L010185N/cand_data/000
minospro e875 298500705 Jul 29 14:42 F21930001_0000_L010185N_D05.cand.cedar_phy_bhcurv.0.root
minospro e875 308399928 Jul 29 14:42 F21930001_0001_L010185N_D05.cand.cedar_phy_bhcurv.0.root
minospro e875 305796428 Jul 29 17:05 F21930001_0002_L010185N_D05.cand.cedar_phy_bhcurv.0.root
minospro e875 311444452 Jul 29 17:17 F21930001_0003_L010185N_D05.cand.cedar_phy_bhcurv.0.root


########
# FARM #
########

Date: Tue, 29 Jul 2008 17:04:05 -0500 (CDT)
From: Matthew Strait <strait@physics.umn.edu>

mrnts for these two subruns are now in 
/minos/data/minfarm/farmtest_strait/mcnearcat/
----------------------------------------------------

Someone moved these, as well as sntp's to /m/d/mf/mcnearcat yesterday,
and the looper script picked them up.

Moving the duplicate sntp's out of the way.
Need to correct the DUP detection in roundup, it is clearly broken.

SRV1> ls -l /minos/data/minfarm/mcnearcat/*bhhi*
-rw-rw-r--  1 minospro numi 65551338 Jul 29 16:39 /minos/data/minfarm/mcnearcat/n13037022_0007_L010185N_D04.sntp.cedar_phy_bhhi.root
-rw-rw-r--  1 minospro numi 66227908 Jul 29 16:44 /minos/data/minfarm/mcnearcat/n13037022_0010_L010185N_D04.sntp.cedar_phy_bhhi.root

SRV1> mv /minos/data/minfarm/mcnearcat/*bhhi* /minos/data/minfarm/DUP/

########
# FARM #
########

Date: Tue, 29 Jul 2008 17:08:53 -0500
From: Howard Rubin <rubin@iit.edu>

This was another split month.  The 2 are running now.
------------------------------------------------------------------
SRV1> ./roundup  -r cedar_phy_bhcurv far
Wed Jul 30 09:38:08 CDT 2008

SRV1> ./roundup  -w  -r cedar_phy_bhcurv far
Wed Jul 30 09:43:00 CDT 2008
 PURGING WRITE files 4 

SRV1> cat ../ROUNTMP/READ/SAM/F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root
F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root
F00037962_0001.spill.mrnt.cedar_phy_bhcurv.1.root
F00037962_0002.spill.mrnt.cedar_phy_bhcurv.0.root
F00037962_0003.spill.mrnt.cedar_phy_bhcurv.0.root
...
F00037962_0023.spill.mrnt.cedar_phy_bhcurv.0.root

So now we have F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root
which is in PNFS and eclared to SAM, with first two subruns blinded.

    Proposed to just remove all these files, and forget this run for mrnt.

    Per rubin's approved, did

rm /pnfs/minos/reco_far/cedar_phy_bhcurv/mrnt_data/2007-04/F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root

sam undeclare file F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root

ls /minos/data/minfarm/farcat/F00037962*
/minos/data/minfarm/farcat/F00037962_0000.spill.bmnt.cedar_phy_bhcurv.1.root  
/minos/data/minfarm/farcat/F00037962_0001.spill.bmnt.cedar_phy_bhcurv.1.root

rm /minos/data/minfarm/farcat/F00037962*

rm /minos/data/reco_far/cedar_phy_bhcurv/mrnt_data/2007-04/F00037962_0000.spill.mrnt.cedar_phy_bhcurv.1.root
 
 =============================================================================
2008 07 29
=============================================================================

########
# FARM #
########

Date: Tue, 29 Jul 2008 11:33:32 -0500
From: Howard Rubin <rubin@iit.edu>

The farcat cedar_phy_bhcurv cleanup has finished.  There is a possible 
complication.  I believe that at some point you may have renamed bmnt 
files to mrnt, getting rid of the original mrnt.  The cleanup has 
produced both mrnt and bmnt, but because the cleanup didn't include all 
subruns, the newly produced bmnt set isn't complete.  If you did rename, 
then all you have to do is rename the new bmnt to mrnt and run the 
concatenator.  Otherwise I have to repeat the cleanup to produce the 
missing bmnt.

I think it would also avoid confusion if the pass was renamed to 0.
-------------------------------------------------------------

    Reviewed files for the partial runs,

for RUN in F00032654 F00036592 F00037962 F00037965 ; do
echo
ls -alF /minos/data/minfarm/farcat/${RUN}*spill.mrnt.cedar_phy_bhcurv.*
done

cd /minos/data/minfarm/farcat

mv *spill.mrnt.cedar_phy_bhcurv.1.root /minos/data/minfarm/BAD/

RUNS=`ls *spill.bmnt.cedar_phy_bhcurv.1.root | cut -f 1 -d .`

for RUN in ${RUNS} ; do
ls ${RUN}.spill.bmnt.cedar_phy_bhcurv.1.root
mv ${RUN}.spill.bmnt.cedar_phy_bhcurv.1.root \
   ${RUN}.spill.mrnt.cedar_phy_bhcurv.0.root
done

Not quite all there,

 PEND - have 22/24 subruns for F00037962_*.spill.mrnt.cedar_phy_bhcurv.0.root 0 07/28 22:56 0 22
 MISS 0000 0001


############
# MCIMPORT #
############

    mcimport.20080729

For better speed in -t mode, 
add a new path to local disk for the concatenated tar files.
This had been TAPAT, instead use ${LOCAL} as used by -T ( TAPER )


############
# MCIMPORT #
############

   tarring up (-t) smaller files now

TOP=daikon_04/L010185N_charm/near  # 10MB, 120-260 MB

MCI3 > echo $RDIRS
700 701 702 703 999

700 269/269 TOTAL FILES 
701 296/296 TOTAL FILES 
702 296/296 TOTAL FILES 
703 30/30 TOTAL FILES 
999 14/14 TOTAL FILES 
14/14 TOTAL FILES 

Test with just DIR=703, as this code is stale

./mcimport  -t ${TOP}/${DIR}
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N_charm/near/703/mcimport.log 
Tue Jul 29 11:01:50 CDT 2008
 OK - writing 3 tarfiles 
Tue Jul 29 11:17:05 CDT 2008
        Disk rate = 5.55 MB/sec.  Exit status = 0.

 OK - purging 3 files ?
Tue Jul 29 11:32:56 CDT 2008
PURGED  n14037030_0000_L010185N_D04_charm-n14037030_0011_L010185N_D04_charm.tar
PURGED  n14037030_0012_L010185N_D04_charm-n14037030_0023_L010185N_D04_charm.tar
PURGED  n14037030_0024_L010185N_D04_charm-n14037030_0029_L010185N_D04_charm.tar
Tue Jul 29 11:32:57 CDT 2008

Data rates are pretty lousy for these direct /m/d to enstore transfers,
under 6 MB/sec.

Upgraded to mcimport.20080729, for speed.
Test on another smaller directory

DIR=999
./mcimport.20080729  -t ${TOP}/${DIR}
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N_charm/near/999/mcimport.log 
Tue Jul 29 14:03:44 CDT 2008
   OOPS, ran the old version of the script ( failed to flush editor )
Cleaned out /minos/data/mcimport/TAR/daikon_04/L010185N_charm/near/999

./mcimport.20080729  -t ${TOP}/${DIR}
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N_charm/near/999/mcimport.log 
Tue Jul 29 14:18:47 CDT 2008

   Tar rates look better, 9 MB/sec, versus former 4.
   Encp rates are much better,
        (10.7 MB/S overall) (40.6 MB/S transfer)
        (38.8 MB/S overall) (39.1 MB/S transfer)
        (39.5 MB/S overall) (39.8 MB/S transfer)
        (56.7 MB/S overall) (57.2 MB/S transfer)

But the purging code failed, the ecrc files are missing ?

Corrected location of .ecrc files to /home/mindata/TAPE/

MCI3 > ./mcimport.20080729  -t ${TOP}/${DIR}
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N_charm/near/999/mcimport.log 
Tue Jul 29 14:57:40 CDT 2008
PURGED  n14011011_0000_L010185N_D00_charm-n14011011_0006_L010185N_D00_charm.tar
PURGED  n14011011_0007_L010185N_D00_charm-n14011012_0002_L010185N_D00_charm.tar
PURGED  n14039991_0000_L010185N_D04_charm-n14039991_0006_L010185N_D04_charm.tar
PURGED  n14039991_0007_L010185N_D04_charm-n14039992_0002_L010185N_D04_charm.tar
Tue Jul 29 14:57:47 CDT 2008

MCI3 > ln -sf mcimport.20080729 mcimport # was mcimport.20080728


RDIRS='700 701 702'
for DIR in ${RDIRS}; do
./mcimport      -t ${TOP}/${DIR}
done 


##########
# CONDOR #
##########

Date: Tue, 29 Jul 2008 08:45:38 -0500 (CDT)
Subject: Help Desk Ticket 115222 Has Been Resolved.
___________________________________________________________________

Solution: Since the new glexec from osg 1.0.0 has been installed on the GP Grid cluster and elsewhere they 
have been able to see the 
environment variables correctly.
___________________________________________________________________


=============================================================================
2008 07 28
=============================================================================

########
# FARM #
########

Date: Mon, 28 Jul 2008 11:55:23 -0500
From: Howard Rubin <rubin@iit.edu>

The following runs in farcat appear to me to be complete, given what's 
in that directory plus what's in bad_runs.  All are 
spill.mrnt.cedar_phy_bhcurv.

F00031874
F00031939
F00032654
F00032997
F00033538
F00033570
F00035947
F00036563
F00037126
F00037752
F00038266

There are 3 additional runs which are not complete:

F00036592
F00037962
F00037965

In the first 2 of these there is only 1/23 subrun *present* while the 
last has 6/23 missing.  I've checked the logs for a couple of these 
'missing' subruns and they were apparently written (on or about Dec. 20) 
along with the other ntuples, which were successfully concatenated.  I'm 
going to rerun these to get the mrnt, but first I would like you to run 
the concatenator to see if what's complete gets put out.  Then after I 
do the rerun I can go in and delete all but the mrnt, avoiding 
duplicates showing up.  I'll also change the pass to 0 to avoid confusion.

SRV1> ./roundup -n -r cedar_phy_bhcurv far
Mon Jul 28 12:22:01 CDT 2008
OK - 444 Mbytes in 14 runs 
 PEND - have 7/8 subruns for F00031874_*.spill.mrnt.cedar_phy_bhcurv.0.root 233 12/07 19:32 0 7
 MISS 0002
 PEND - have 7/8 subruns for F00031939_*.spill.mrnt.cedar_phy_bhcurv.0.root 233 12/07 20:43 0 7
 MISS 0001
 PEND - have 5/14 subruns for F00032654_*.spill.mrnt.cedar_phy_bhcurv.0.root 219 12/21 16:42 0 5
 MISS 0005 0006 0007 0008 0009 0010 0011 0012 0013
 PEND - have 23/24 subruns for F00032997_*.spill.mrnt.cedar_phy_bhcurv.0.root 232 12/08 11:40 0 23
 MISS 0019
 PEND - have 1/2 subruns for F00033538_*.spill.mrnt.cedar_phy_bhcurv.0.root 232 12/08 18:29 0 1
 MISS 0000
 PEND - have 7/8 subruns for F00033570_*.spill.mrnt.cedar_phy_bhcurv.0.root 232 12/08 19:09 0 7
 MISS 0007
 PEND - have 23/24 subruns for F00035947_*.spill.mrnt.cedar_phy_bhcurv.0.root 231 12/09 15:22 0 23
 MISS 0015
 PEND - have 23/24 subruns for F00036563_*.spill.mrnt.cedar_phy_bhcurv.0.root 231 12/10 01:21 0 23
 MISS 0015
 PEND - have 1/24 subruns for F00036592_*.spill.mrnt.cedar_phy_bhcurv.0.root 219 12/21 16:44 0 1
 MISS 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 0012 0013 0014 0015 0016 0017 0018 0019 0020 0021 0022 0023
 PEND - have 23/24 subruns for F00037126_*.spill.mrnt.cedar_phy_bhcurv.0.root 230 12/10 18:22 0 23
 MISS 0022
 SUPPRESS  F00037752_0024.spill.mrnt.cedar_phy_bhcurv.0.root
 PEND - have 23/24 subruns for F00037752_*.spill.mrnt.cedar_phy_bhcurv.0.root 230 12/11 05:29 0 23
 MISS 0012
 PEND - have 1/24 subruns for F00037962_*.spill.mrnt.cedar_phy_bhcurv.0.root 228 12/12 15:00 0 1
 MISS 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 0011 0012 0013 0014 0015 0016 0017 0018 0019 0020 0021 0022
 SUPPRESS  F00037965_0024.spill.mrnt.cedar_phy_bhcurv.0.root
 PEND - have 18/24 subruns for F00037965_*.spill.mrnt.cedar_phy_bhcurv.0.root 228 12/12 14:59 0 18
 MISS 0000 0002 0003 0005 0006 0007
 PEND - have 23/24 subruns for F00038266_*.spill.mrnt.cedar_phy_bhcurv.0.root 228 12/12 20:52 0 23
 MISS 0011

Date: Mon, 28 Jul 2008 13:01:52 -0500
From: Howard Rubin <rubin@iit.edu>

All inserted 'data' lines are from bad_runs.cedar_phy_bhcurv.  
You appear to not be using this for mrnt.  
If there's no sntp (or bntp) there's no mrnt.

--------------------------------------------------
The roundup script was changed to get bad runs from bad_runs_mrcc.${REL}
in April of 2007.
  
It has been that way since then.
 
But I see no *mrcc* files in /minos/data/minfarm/lists .
Apparently we have never had an mrcc-only failure ?

So I will change roundup to go back to using bad_runs.${REL} for mrnt data,
for the present.

SRV1> AFSS/roundup.20080728  -r cedar_phy_bhcurv far

###########
# ROUNDUP #
###########

SRV1> cp AFSS/roundup.20080728 .
SRV1> ln -sf roundup.20080728 roundup # was roundup.20080722

   Dropped bad_runs_mrcc , was specific to original mrcc tests

##########
# CONDOR #
##########

MINOS25 > condor_q -hold


-- Submitter: minos25.fnal.gov : <131.225.193.25:64545> : minos25.fnal.gov
 ID      OWNER           HELD_SINCE HOLD_REASON                   
168302.2   gfactory        7/24 10:21 Globus error 17: the job failed when the jo
168302.4   gfactory        7/24 10:21 Globus error 43: the job manager failed to 
168348.1   gfactory        7/24 13:57 Globus error 17: the job failed when the jo
168348.2   gfactory        7/24 13:57 Globus error 43: the job manager failed to 
168411.3   gfactory        7/24 20:31 Globus error 43: the job manager failed to 
168458.0   gfactory        7/25 01:51 Globus error 17: the job failed when the jo
168458.3   gfactory        7/25 01:51 Globus error 43: the job manager failed to 
168652.0   gfactory        7/25 15:18 Globus error 43: the job manager failed to 
168652.1   gfactory        7/25 15:18 Globus error 17: the job failed when the jo
168652.3   gfactory        7/25 15:18 Globus error 17: the job failed when the jo
168652.4   gfactory        7/25 15:18 Globus error 17: the job failed when the jo
169073.0   gfactory        7/27 03:52 Globus error 17: the job failed when the jo
169073.2   gfactory        7/27 03:52 Globus error 43: the job manager failed to 
169109.0   gfactory        7/27 07:52 Globus error 17: the job failed when the jo
169109.3   gfactory        7/27 07:52 Globus error 43: the job manager failed to 
169132.3   gfactory        7/27 10:11 Globus error 43: the job manager failed to 
169148.1   gfactory        7/27 11:51 Globus error 17: the job failed when the jo
169148.2   gfactory        7/27 11:51 Globus error 17: the job failed when the jo
169148.3   gfactory        7/27 11:51 Globus error 43: the job manager failed to 
169148.4   gfactory        7/27 11:51 Globus error 43: the job manager failed to 
169176.1   gfactory        7/27 14:42 Globus error 17: the job failed when the jo
169176.3   gfactory        7/27 14:42 Globus error 43: the job manager failed to 
169264.4   gfactory        7/28 00:30 Globus error 43: the job manager failed to 
169407.4   gfactory        7/28 08:30 Globus error 43: the job manager failed to 
...
169471.0   gfactory        7/28 14:34 Globus error 43: the job manager failed to 
169685.2   gfactory        7/29 07:10 Globus error 43: the job manager failed to 
170368.0   gfactory        7/31 14:35 Globus error 43: the job manager failed to 
170368.1   gfactory        7/31 14:19 Globus error 43: the job manager failed to 
170368.2   gfactory        7/31 14:24 Globus error 43: the job manager failed to 
170368.4   gfactory        7/31 14:14 Globus error 43: the job manager failed to 
170368.9   gfactory        7/31 14:29 Globus error 43: the job manager failed to 

MINOS25 > condor_q -l 169176.1
UserLog = "/home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/log/condor_activity_20080727_gpminos@t20_glexec@minos@my2.log"
GridResource = "gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor"
GlobalJobId = "minos25.fnal.gov#1217187710#169176.1"
EnteredCurrentStatus = 1217187727
HoldReason = "Globus error 17: the job failed when the job manager attempted to run it"

MINOS25 > condor_q -l 169176.3
UserLog = "/home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/log/condor_activity_20080727_gpminos@t20_glexec@minos@my2.log"
GridResource = "gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor"
GlobalJobId = "minos25.fnal.gov#1217187710#169176.3"
EnteredCurrentStatus = 1217187727
HoldReason = "Globus error 43: the job manager failed to stage the executable"


Date: Mon, 28 Jul 2008 11:46:48 -0500 (CDT)
Subject: HelpDesk ticket 119292
___________________________________________
Short Description: Minos glideinWMS pilot jobs seeing low level of grid errors on GPFARM

Problem Description: Recently, we have started seeing a few Minos glideinWMS jobs
    being held by Condor on our end, with messages like :

MINOS25 > condor_q -l 169176.1
UserLog
="/home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/log/condor_
activity_20080727_gpminos@t20_glexec@minos@my2.log"
EnteredCurrentStatus = 1217187727
HoldReason = "Globus error 17: the job failed when the job manager
attempted to run it"

    and

MINOS25 > condor_q -l 169176.3
GlobalJobId = "minos25.fnal.gov#1217187710#169176.3"
EnteredCurrentStatus = 1217187727
HoldReason = "Globus error 43: the job manager failed to stage the
executable"

So far these are not doing a great deal of harm, our jobs run on other
pilots.
But they are cluttering up the local queues, and may indicate a grid
problem.

For more details, see the 2008 07 28 CONDOR entry in 
    http://www-numi.fnal.gov/minwork/computing/dh/worklog.txt

The pilot jobs are submitted in batches of 5, most of which are OK.
For example, here are 2 of a batch of 5 the UserLog,
including the .001. process which got held :

The pilot jobs are submitted in batches of 5, most of which are OK.
For example, here are 2 of a batch of 5 the UserLog,
including the .001. process which got held :

000 (169176.000.000) 07/27 14:41:50 Job submitted from host:
<131.225.193.25:64545>
..
000 (169176.001.000) 07/27 14:41:50 Job submitted from host:
<131.225.193.25:64545>

017 (169176.001.000) 07/27 14:42:06 Job submitted to Globus
    RM-Contact: fnpcfg1.fnal.gov:2119/jobmanager-condor
    JM-Contact: https://fnpcfg1.fnal.gov:40013/19561/1217187720/
    Can-Restart-JM: 1
..
027 (169176.001.000) 07/27 14:42:06 Job submitted to grid resource
    GridResource: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
    GridJobId: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
https://fnpcfg1.fnal.gov:40013/19561/1217187720/
..
017 (169176.000.000) 07/27 14:42:06 Job submitted to Globus
    RM-Contact: fnpcfg1.fnal.gov:2119/jobmanager-condor
    JM-Contact: https://fnpcfg1.fnal.gov:40019/19682/1217187720/
    Can-Restart-JM: 1
..
027 (169176.000.000) 07/27 14:42:06 Job submitted to grid resource
    GridResource: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
    GridJobId: gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
https://fnpcfg1.fnal.gov:40019/19682/1217187720/

012 (169176.001.000) 07/27 14:42:07 Job was held.
        Globus error 17: the job failed when the job manager attempted to
run it
        Code 2 Subcode 17

005 (169176.000.000) 07/27 15:07:15 Job terminated.
        (1) Normal termination (return value 0)
___________________________________________
Note To Requester: timm@fnal.gov sent this Notes To Requester: 
These errors have been seen by a variety of users on fermigridosg1,
fnpcosg1, and fnpcfg1.  We always had a low level of them but they 
appear to have increased in frequency since the OSG 1.0 upgrade. We have 
an open ticket with condor_support
on this already and as of this morning we received a debug version
of one of the key condor executables which we have deployed, in hopes
of figuring out what is causing this error.  At the moment
it does not seem to be related to any bluearc problems at all.

In the case of the minos glidein jobs, do you have TRANSFER_EXECUTABLE
set to TRUE or FALSE?


Steve Timm
___________________________________________

This seems to be set to True,
as has been the case since around April 23 when we moved to glexec.

MINOS25 > cat /home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/job.condor 
# File: job.condor
#
Universe = grid
Grid_Resource = gt2 fnpcfg1.fnal.gov:2119/jobmanager-condor
globus_rsl = (condorsubmit=(universe vanilla)(requirements \"(ISMINOSAFS=?=True)\"))
Executable = glidein_startup.sh
Arguments = -v $ENV(GLIDEIN_VERBOSITY) -cluster $(Cluster) -name t20_glexec -entry gpminos -subcluster $(Process) -schedd $ENV(GLIDEIN_SCHEDD)  -factory minos -web http://www-numi.fnal.gov/gfactory/stage/glidein_t20_glexec -sign 3962a6fa3b08256b9424992ea9f4b871028d589f -signentry 1e30f8c685e345d5be3a3c5f5b510515f0ed2d18 -signtype sha1 -descript description.87eflp.cfg -descriptentry description.87eflp.cfg -dir Condor -param_GLIDEIN_Client $ENV(GLIDEIN_CLIENT) $ENV(GLIDEIN_PARAMS)
+GlideinFactory = "minos"
+GlideinName = "t20_glexec"
+GlideinEntryName = "gpminos"
+GlideinClient = "$ENV(GLIDEIN_CLIENT)"
+GlideinWebBase = "http://www-numi.fnal.gov/gfactory/stage/glidein_t20_glexec"
+GlideinLogNr = "$ENV(GLIDEIN_LOGNR)"
+GlideinWorkDir = "Condor"
Transfer_Executable = True
transfer_Input_files = 
transfer_Output_files = 
WhenToTransferOutput  = ON_EXIT
Notification = Never
+Owner = undefined
Log = entry_gpminos/log/condor_activity_$ENV(GLIDEIN_LOGNR)_$ENV(GLIDEIN_CLIENT).log
Output = entry_gpminos/log/job.$(Cluster).$(Process).out
Error = entry_gpminos/log/job.$(Cluster).$(Process).err
stream_output = False
stream_error  = False
Queue $ENV(GLIDEIN_COUNT)
___________________________________________
Date: Fri, 15 Aug 2008 09:02:54 -0500 (CDT)
Note To Requester: We captured extra debug
output from one of the minos glidein jobs yesterday that 
held with error 17 and have
sent it to the Condor team and the OSG Troubleshooting team.
Steve Timm
___________________________________________
Date: Thu, 02 Oct 2008 13:09:15 -0500 (CDT)
Note To Requester: We have recently received
a patch to the gahp_server
binary of condor which
has shown great promise
thus far in reducing
and eliminating these
errors on fnpcsrv1 and fg1x1.
We want to run with the patch on fnpcsrv1 and fg1x1 first for another week
or so, until 
minos production solves its
current difficulties and is 
able to ramp back up. It would
then be possible for you to 
install the same patch on
minos25 and it it should address
the difficulties there as well.

Steve Timm
___________________________________________
Date: Mon, 06 Oct 2008 10:16:48 -0500 (CDT)

Note To Requester: We are in possession of a
new debug/patched
gahp_server executable.
Since this has been installed on fg1x1 and fnpcsrv1
we have not seen any 
repeats of the globus errors 17 and 43 there.  I suggest
that it get installed 
on minos25 as well.  
Please contact me.

Steve Timm

___________________________________________

Date: Tue, 11 Nov 2008 10:24:18 -0600 (CST)

Solution: Per E-mail from MINOS they have not seen this error since they
upgraded the condor within their glideins to condor 7.1.3.  
We will close this ticket for now but keep an eye on the
larger problem in FermiGrid, which is not going to upgrade to condor 7.1.x
for a couple of months yet.

Steve Timm

########
# FARM #
########


SRV1> less cedar_phy_bhlomcnear.log
   Finished last purge
Sun Jul 27 07:21:39 CDT 2008

SRV1> less cedar_phy_bhhimcnear.log
    Finished last purge
Sun Jul 27 04:20:58 CDT 2008
 PEND - have 28/30 subruns for n13037022_*_L010185N_D04.mrnt.cedar_phy_bhhi.root 15 07/11 18:13 0 28
 MISS 0007 0010

    Informed rubin via email


############
# MCIMPORT #
############

MCI3 > ln -sf mcimport.20080728  mcimport
    Put ECRC message inline, reduced encp verbose from 4 to 1
    
TOP=daikon_04/L150200N/near        # 10MB, 480-540 MB
Mon Jul 28 08:59:51 CDT 2008
Tue Jul 29 05:30:09 CDT 2008

###########
# ENSTORE #
###########

Per email from kordosky, nwest

Date: Mon, 28 Jul 2008 09:41:38 -0500 (CDT)
Subject: HelpDesk ticket 119275
___________________________________________
Short Description: Some web pages not available offsite

Problem Description: Since the July 24 upgrades, several web pages are not visible 
to clients outside fnal.gov.  Not all web pages are affected.

    Available pages include
http://www-stken.fnal.gov/enstore/enstore_system.html
and all links directly under this page, except the three listed here.

    Blocked pages include these links under the home page :

Quota and Usage
    http://www-stken.fnal.gov/enstore/tape_inventory/VOLUME_QUOTAS
Tape Inventory Summary
    http://www-stken.fnal.gov/cgi-bin/enstore_show_inv_summary_cgi.py
Tape Inventory 
    http://www-stken.fnal.gov/cgi-bin/enstore_show_inventory_cgi.py

    The same problem seems to exist under www-cdfen and www-d0en
    We did not see this problem in STKEN before the July 24 upgrade.
___________________________________________
Date: Mon, 28 Jul 2008 12:20:28 -0500 (CDT)
    kschu
Note To Requester: I believe this is because the computer security webserver exemptions are tied to old
hostnames that are no longer in use.  I will request exemptions for the new hostnames, and this will be
resolved as soon as possible.  Thanks for letting us know.
___________________________________________
Date: Mon, 28 Jul 2008 12:20:29 -0500 (CDT)
This ticket has been reassigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA Group.
___________________________________________
Tested this later 28 July, CFL is available again, not the cgi's

Tested via ssh -1 cdfsoft@uchicago.edu ; mozilla --local
___________________________________________
Date: Thu, 31 Jul 2008 14:25:45 -0500 (CDT)
Note To Requester: Exemption requests have been submitted.  Can you please see whether the specified pages
can be viewed off-site now?

___________________________________________

The first page is available again,

> Quota and Usage
>     http://www-stken.fnal.gov/enstore/tape_inventory/VOLUME_QUOTAS

This is the area for which we had an immediate need.

( This area is still blocked for www-cdfen and www-d0en,
  but this is not a problem for Minos. )


> Tape Inventory Summary
>     http://www-stken.fnal.gov/cgi-bin/enstore_show_inv_summary_cgi.py
> Tape Inventory
>     http://www-stken.fnal.gov/cgi-bin/enstore_show_inventory_cgi.py


The latter two URL's still seem to be unavailable,
at least at  cdf.uchicago.edu ,  with messages as follows :

Forbidden

You don't have permission to access /cgi-bin/enstore_show_inv_summary_cgi.py
on this server.
Apache Server at www-stken.fnal.gov Port 80
___________________________________________

Date: Fri, 29 Aug 2008 10:37:47 -0500 (CDT)
Solution: I believe this problem to be resolved.  
Some pages were never intended to be viewed off-site, as part of policy.  
Main system pages are now available off-site 
after requesting web server exemptions from CST.
This ticket was resolved by MESSER, TIM of the CD-SF/DMS/DSC/SSA group.
___________________________________________


=============================================================================
2008 07 25
=============================================================================

#########
# ADMIN #
#########

Ticket #: 119229

MINOS01 > cmd add_minos_user paschrei


##########
# CONDOR #
##########

Several loiacono jobs got held, trying to write loiacono.proxy to afs :

MINOS25 > condor_history -l 168583.20

cat /minos/scratch/loiacono/condor_minosoft_output/log.168583.20

000 (168583.020.000) 07/25 10:46:58 Job submitted from host: <131.225.193.25:64545>
...
001 (168583.020.000) 07/25 12:07:49 Job executing on host: <131.225.166.130:61062>
...
006 (168583.020.000) 07/25 12:12:57 Image size of job updated: 124608
...
006 (168583.020.000) 07/25 12:42:57 Image size of job updated: 125824
...
006 (168583.020.000) 07/25 13:32:57 Image size of job updated: 189636
...
007 (168583.020.000) 07/25 13:37:03 Shadow exception!
        Error from starter on vm2@26016@fnpc342.fnal.gov: STARTER at 131.225.166.130 failed to send file(s) to <131.225.193.25:65305>; SHADOW 
at 131.225.193.25 failed to write to file /afs/fnal.gov/files/home/room3/loiacono/work/minossoft/BeamDataPro/condor/loiacono.proxy: (errno 13)
 Permission denied
        373213  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
012 (168583.020.000) 07/25 13:37:03 Job was held.
        Error from starter on vm2@26016@fnpc342.fnal.gov: STARTER at 131.225.166.130 failed to send file(s) to <131.225.193.25:65305>; SHADOW 
at 131.225.193.25 failed to write to file /afs/fnal.gov/files/home/room3/loiacono/work/minossoft/BeamDataPro/condor/loiacono.proxy: (errno 13)
 Permission denied
        Code 12 Subcode 13
...
013 (168583.020.000) 07/25 14:13:12 Job was released.
        via condor_release (by user loiacono)
...
001 (168583.020.000) 07/25 14:14:04 Job executing on host: <131.225.166.122:65323>
...
006 (168583.020.000) 07/25 14:19:12 Image size of job updated: 190260
...
005 (168583.020.000) 07/25 14:19:33 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:04:45, Sys 0 00:00:01  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:04:45, Sys 0 00:00:01  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        366975  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        740188  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

Strange, there are many loon processes using almost no CPU on fnpc342

Same thing on fnpc343, where the above job ran quickly :

bash-3.00$ ps axf | grep loon
 8874 pts/0    S+     0:00          \_ grep loon
24475 ?        SN     0:01      |                           \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2007,01,11,2007,01,11)
 8341 ?        SN     0:01      |                           \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2006,12,27,2006,12,27)
23010 ?        SN     0:01      |                           \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2007,01,06,2007,01,06)
 5511 ?        SN     0:01      |                           \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2006,12,21,2006,12,21)
 4389 ?        SN     0:01      |                           \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2006,12,20,2006,12,20)
 7771 ?        SN     0:01      |                           \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2006,12,26,2006,12,26)
30748 ?        SN     0:01                                  \_ loon -bq BDProcess/macros/beamNtp_pre080110.C(2006,12,12,2006,12,12)

These loon processes are claimed not to be reading from DCache,


########
# FARM #
########

Changed 'looper' script to take command options via parmeter

Will do

./looper '-r cedar_phy_bhhi mcnear' &

./looper '-r cedar_phy_bhlo mcnear' &

Fired these up round 17:45

#########
# ADMIN #
#########

Suggested contacts for surplus equipment
   tape   Gene Oleynik   ( head of Data Storage and Dacheing in DMS in SF.
   CPU    Bob Tschirhart

##########
# CONDOR #
##########

Testing new glideme.run, 
    would like to avoid setting REMOTE_INITIALDIR,
 
 MINOS25 > condor_submit probeme.run
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 168576.

   
MINOS25 > condor_submit glideme.run
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 168579.

    Removed REMOTE_INITIALDIR, fully specified logs 
     
MINOS25 > condor_submit glideme.run
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 168608.
MINOS25 > grep PWD logs/glide/glide.168608.0.out
PWD      /local/stage1/condor/execute/dir_3962/glide_Ig4000/tmp/starter-tmp-dir-zolpWp/execute/dir_4929

    Removed full spec from logs, probably uses Iwd.

MINOS25 > condor_submit glideme.run
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 168611.
MINOS25 > grep PWD logs/glide/glide.168611.0.out
PWD      /local/stage1/condor/execute/dir_3962/glide_Ig4000/tmp/starter-tmp-dir-rJq7IP/execute/dir_5675

    Hacked glideme.run to transfer a file, run probefile.

MINOS25 > condor_submit glideme.run
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 168613.

MINOS25 > less logs/glidefile/glide.168613.0.out 
##########
#   FILE #
##########
 Looking at the file 
 ------------------- 
THIS IS A TEST FILE.  I AM HERE !
 ------------------- 
RUN FINISHED Fri Jul 25 11:43:37 CDT 2008

##########
# CONDOR #
##########

MINOS25 > condor_q -hold

-- Submitter: minos25.fnal.gov : <131.225.193.25:64545> : minos25.fnal.gov
 ID      OWNER           HELD_SINCE HOLD_REASON                   
168302.2   gfactory        7/24 10:21 Globus error 17: the job failed when the jo
168302.4   gfactory        7/24 10:21 Globus error 43: the job manager failed to 
168348.1   gfactory        7/24 13:57 Globus error 17: the job failed when the jo
168348.2   gfactory        7/24 13:57 Globus error 43: the job manager failed to 
168411.3   gfactory        7/24 20:31 Globus error 43: the job manager failed to 
168458.0   gfactory        7/25 01:51 Globus error 17: the job failed when the jo
168458.3   gfactory        7/25 01:51 Globus error 43: the job manager failed to 


##########
# PARROT #
##########

Found 100 small files for parroting,
same place as our reference small file, 
/pnfs/minos/fardet_data/2005-04/F000310*

Sizes range 40 to 54 KB.

Should make a dataset st-100small,
Runs 3100 through 3199

SAMDIM='
    DATA_TIER       raw-far
and RUN_NUMBER  >=  3100
and RUN_NUMBER  <=  3199
'
sam list files --dim="${SAMDIM}"

ST100=`sam list files --dim="${SAMDIM}" --nosummary | sort`

MINOS26 > for FILE in ${ST100} ; do printf "${FILE} " ; ./dc_stat ${FILE} | grep stkendca ; done 
F00003100_0000.mdaq.root w-stkendca7a-1
F00003101_0000.mdaq.root w-stkendca9a-3
F00003102_0000.mdaq.root w-stkendca9a-3
...
F00003197_0000.mdaq.root w-stkendca11a-3
F00003198_0000.mdaq.root w-stkendca7a-1
r-stkendca4a-1
F00003199_0000.mdaq.root w-stkendca9a-3

sam create definition  \
    --definitionName='st-censmall' \
    --dimensions="${SAMDIM}" \
    --group='minos'
DatasetDefinition saved with definitionId = 5201

sam list files --dim="__set__ st-censmall"

    2008 08 01

OOPS, this was wrong, file selection should have been

SAMDIM='
    DATA_TIER       raw-far
and RUN_NUMBER  >=  31000
and RUN_NUMBER  <=  31099
'

sam list files --dim="${SAMDIM}"
File Count:         100
Average File Size:  48.90KB
Total File Size:    4.78MB
Total Event Count:  690

sam delete dataset definition  --definitionName st-censmall

sam create definition  \
    --definitionName='st-censmall' \
    --dimensions="${SAMDIM}" \
    --group='minos'
DatasetDefinition saved with definitionId = 5215


############
# MCIMPORT #
############

TOP=daikon_04/L150200N/near        # 10MB, 480-540 MB
Fri Jul 25 10:20:10 CDT 2008
Fri Jul 25 19:20:19 CDT 2008


########
# FARM #
########

./roundup   -r cedar_phy_bhlo mcfar
Fri Jul 25 09:42:15 CDT 2008
 PURGED 832/832
Fri Jul 25 09:45:12 CDT 2008

bhhi mcnear is already done

./roundup  -r cedar_phy_bhhi mcnear
Fri Jul 25 10:08:11 CDT 2008
 PURGED 6/6
Fri Jul 25 13:22:34 CDT 2008

./roundup  -r cedar_phy_bhhi mcnear &
Fri Jul 25 13:25:18 CDT 2008
 PURGING WRITE files 44 


./roundup  -r cedar_phy_bhlo mcnear &
Fri Jul 25 13:23:32 CDT 2008


###########
# CONDOR #
##########

MINOS25 > condor_q kreymer
166375.0   kreymer         7/17 08:40   0+00:00:00 X  0   0.0  probe             

MINOS25 > condor_rm 166375.0
Job 166375.0 already marked for removal

MINOS25 > condor_rm -force 166375.0
Job 166375.0 removed locally (remote state unknown)

MINOS25 > dds -tr logs/glideafs/*out | tail
-rw-r--r--  1 kreymer g020 3844 Jul 25 08:40 logs/glideafs/probe.168524.0.out
-rw-r--r--  1 kreymer g020 3675 Jul 25 08:53 logs/glideafs/probe.168525.0.out
-rw-r--r--  1 kreymer g020 3897 Jul 25 09:00 logs/glideafs/probe.168528.0.out
-rw-r--r--  1 kreymer g020 6317 Jul 25 09:10 logs/glideafs/probe.168543.0.out
-rw-r--r--  1 kreymer g020 6637 Jul 25 09:23 logs/glideafs/probe.168565.0.out
    Difference is due to number of jobs running on each node.
    loiacono has just submitted a slug o jobs.

#############
# CHECKLIST #
#############

FTPLOG
   5 Fri Jul 25 02:10:48 CDT 2008 557
3601 Fri Jul 25 03:20:49 CDT 2008 1
   5 Fri Jul 25 03:30:54 CDT 2008 557

NOACCESS 
    missing

DATA
    missing

##########
# DCACHE #
##########

http://www-numi.fnal.gov/computing/dh/ftplog/2008/07/25.txt
   5 Fri Jul 25 02:10:48 CDT 2008 557
3601 Fri Jul 25 03:20:49 CDT 2008 1
   5 Fri Jul 25 03:30:54 CDT 2008 557

Date: Fri, 25 Jul 2008 09:10:28 -0500 (CDT)
Subject: HelpDesk ticket 119196
___________________________________________
Short Description: FNDCA - glitch in weak ftp access around 2008/07/25 02:20

Problem Description: I test the availability of weak ftp access every 10 minutes, from minos26.
This is done by listing a small directory, /pnfs/minos/beam_data/2004-12 .

The listing at 02:20 this morning failed after 1 hour.

These numbers are the elapsed time, time stamps, and size of the listings.
   5 Fri Jul 25 02:10:48 CDT 2008 557
3601 Fri Jul 25 03:20:49 CDT 2008 1
   5 Fri Jul 25 03:30:54 CDT 2008 557

This does not cause me a problem, but may be a symptom of some deeper
problem.
minos26 also tests connectivity to bluearc every minute, no problem seen
there.
___________________________________________
Date: Fri, 25 Jul 2008 16:29:42 -0500 (CDT)
Vladimir fixed the problem.

##########
# DCACHE #
##########

BILLING
http://fndca3a.fnal.gov/dcache/billing.html
    looks very empty
http://fndca2a.fnal.gov:8090/dcache/outplot?lvl=0&filename=billing.week.brd.png
    no response

OLD PLOTS
http://fndca2a.fnal.gov:8090/dcache/lsplots
    no response fron fndca2a
NEW PLOTS
http://fndca2a.fnal.gov:9090/lps/plots/src/plots.lzx
    cannot establish connection
    
http://fndca3a.fnal.gov/cgi-bin/dcache_files.py
   At around 08:50, the latest transfers listed are at about 07:32:40 by minospro

Date: Fri, 25 Jul 2008 09:02:29 -0500 (CDT)
Subject: HelpDesk ticket 119192

___________________________________________
Short Description: Billing and other web monitoring data is missing from FNDCA since the shutdown

Problem Description: dcache-admin :

Since yesterday's shutdown, several web pages are missing or abnormal :


BILLING
http://fndca3a.fnal.gov/dcache/billing.html
    looks very empty
http://fndca2a.fnal.gov:8090/dcache/outplot?lvl=0&filename=billing.week.brd
png
    no response

OLD PLOTS
http://fndca2a.fnal.gov:8090/dcache/lsplots
    no response fron fndca2a
NEW PLOTS
http://fndca2a.fnal.gov:9090/lps/plots/src/plots.lzx
    cannot establish connection
    
http://fndca3a.fnal.gov/cgi-bin/dcache_files.py
   At around 08:50, 
   the latest transfers listed are at about 07:32:40 by minospro
___________________________________________
Date: Fri, 25 Jul 2008 16:27:27 -0500
the developer says these are fixed now....please click away...


###########
# ENSTORE #
###########

http://www-stken.fnal.gov/enstore/tape_inventory/NOACCESS
The requested URL /enstore/tape_inventory/NOACCESS was not found on this server.


Date: Fri, 25 Jul 2008 09:02:06 -0500 (CDT)
Subject: HelpDesk ticket 119190
___________________________________________
Short Description: NOACCESS list is missing

Problem Description: enstore-admin :

ince yesterday's SDE upgrade of STKEN, 
the list of NOACCESS tapes is missing from 

http://www-stken.fnal.gov/enstore/tape_inventory/NOACCESS

The requested URL /enstore/tape_inventory/NOACCESS was not found on this
server.
___________________________________________
Date: Fri, 25 Jul 2008 16:27:27 -0500
the developer says these are fixed now....please click away...
___________________________________________
The list actually returned round July 31

=============================================================================
2008 07 24
=============================================================================

############
# STARTUP  #
############

    kreymer@minos26
crontab crontab.dat

    mindata@minos26 
crontab crontab.dat

    minfarm@fnpcsrv1

############
# SHUTDOWN #
############

DOWNTIMES Thursday 24 July

  06:00 - 06:20 (06:29) BlueArc
  06:30                 Enstore drain
  07:15 - 18:00 (21:17) Enstore and DCache ; FermiGrid reracking ( not down )
  07:30 - 07:45 (07:45) AFS data servers
  09:30 - 10:00 (09:55) SAM database

   MDSUM_LOG

The mdsum_log script was still running at 06:00.
Restarted it via cron

29891 ?        Ss     0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log
29905 ?        S      0:00  \_ /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log
25478 ?        S      0:00      \_ /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log
25479 ?        S      0:00          \_ du -sm mcimport/TAR
25480 ?        S      0:00          \_ tr -d  

Killed all these pid's.

    BLUWATCH

The STOP file did not get created for bluwatch

Reported slow access :

fnpcsrv1.txt            11-Jul-2008 19:10   86   
minos-sam03.txt         24-Jul-2008 06:18   87   
minos01.txt             24-Jul-2008 06:18   87   
minos25.txt             19-Jul-2008 17:29   86   
minos26.txt             15-Jul-2008 12:20   86   


The right way to touch the STOP file would have been
echo "/usr/krb5/bin/kcron ; /usr/krb5/bin/aklog ;touch /afs/fnal.gov/files/data/minos/log_data/bluwatch/TEST" | at 07:57
echo "
/usr/krb5/bin/kcron
/usr/krb5/bin/aklog
touch /afs/fnal.gov/files/data/minos/log_data/bluwatch/TEST
touch /afs/fnal.gov/files/data/minos/log_data/bluwatch/TEST2
" | at 08 09

    AFS was on schedule, 06:30 to 07:45

    SAM was on schedule, 09:30 to 09:53

Tested per HOWTO.sam
UNI=prd
for N in 1 2 3 ; do echo ${N}
date
./sam_test_py minos ${UNI} st-onesmall
./sam_test_py minos ${UNI} st-ten
./sam_test_py minos ${UNI} st-cen
done ; date

    DCACHE

PNFS logging came up at 18:43

FTP  logging started to recover 19:56,
     looked normal starting at 21:17


From: Ken Schumacher <kschu@fnal.gov>

Date: Thu, 24 Jul 2008 18:09:25 -0500
We have encountered a few problems...

Date: Thu, 24 Jul 2008 20:15:02 -0500
We have overcome most of our problems...

Date: Thu, 24 Jul 2008 21:32:20 -0500
The public dCache services are back on-line.  Enstore is ready except 

Date: Thu, 24 Jul 2008 23:03:46 -0500
We thank you for your patience and we apologize for the extended 
...
On behalf of the whole team from DMS, Good Night.


##########
# PARROT #
##########

/grid/app/minos/parrot
   paloon - script to run loon on a raw data file, under parrot
   loonar - loon script run by paloon

388 > /grid/app/minos/parrot/paloon

#########
# ADMIN #
#########

13:00

  Cannot ssh to minos04 or minos12

ssh_exchange_identification: Connection closed by remote host

MINOS04 > tail /var/log/messages
...
Jul 23 05:43:24 minos04 sshd(pam_unix)[26376]: session opened for user djauty by (uid=0)
Jul 24 13:06:30 minos04 login: kreymer preauthenticated login on pts/0 from minos-93198.dhcp

MINOS12 > tail /var/log/messages
...
Jul 23 07:59:37 minos12 sshd(pam_unix)[22780]: session opened for user jyuko by jyuko(uid=0)
Jul 23 17:18:50 minos12 sshd(pam_unix)[9686]: session opened for user kreymer by (uid=0)
Jul 23 21:45:11 minos12 sshd: pam_krb5[10080]: authentication fails for 'jyuko' (jyuko@FNAL.GOV): Authentication service cannot retrieve authentication info. (Cannot contact any KDC for requested realm)
Jul 24 13:00:15 minos12 kernel: nfs: server stkensrv1 not responding, still trying
Jul 24 13:07:34 minos12 login: kreymer preauthenticated login on pts/0 from minos-93198.dhcp

13:10 submitted ticket

Date: Thu, 24 Jul 2008 13:12:52 -0500 (CDT)
Subject: HelpDesk ticket 119161

___________________________________________
Short Description: ssh logins fail to minos04 and minos12

Problem Description: run2-sys :

I cannot ssh to minos04 or minos12.
I can    rsh to them, and they look OK on the surface.

MIN > date
Thu Jul 24 18:10:06 UTC 2008

MIN > ssh -v minos04
OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to minos04 [131.225.193.4] port 22.
debug1: Connection established.
debug1: identity file /home/kreymer/.ssh/identity type -1
debug1: identity file /home/kreymer/.ssh/id_rsa type 1
...
___________________________________________
Date: Thu, 24 Jul 2008 13:18:23 -0500 (CDT)
This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group.
___________________________________________
Date: Thu, 24 Jul 2008 13:46:08 -0500 (CDT)
Resolved
killed off stuck ssh/pam processes and ssh access was restored.
___________________________________________
___________________________________________


###########
# GNUPLOT #
###########

    On my desktop and laptop,

yum install gnuplot


########
# MAIL #
########

    Removed RFC2369 headers from lists for which they are not appropriate,
    to eliminate the PINE messages
minos-users

     [ Note: This message contains email list management information ]

   To disable the headers, added to the head of the options list, 

Misc-Options= NO_RFC2369

    Need to get ownership of some other lists
minos_sam_admin
minosdb-support
MINOS-ACCOUNTS ?
MINOS-SAM-USERS ?

############
# BLUWATCH #
############

bluwatch.20080724

   Added SKIP control, gentler than STOP
   Added usage comments

ln -sf bluwatch.20080724 bluwatch # was bluwatch.20080707
Thu Jul 24 14:05:44 UTC 2008

##########
# CONDOR #
##########

Removed ancient stuck probe job,
166375.0   kreymer         7/17 08:40   6+23:31:16 R  0   0.0  probe             
166375.0   kreymer         7/17 08:40   6+03:36:35 vm2@12501@fnpc346.fnal.gov

MINOS25 > condor_rm 166375.0
Job 166375.0 marked for removal

166375.0   kreymer         7/17 08:40   0+00:00:00 X  0   0.0  probe             

Let this sit a day, then remove again.


Also, let's cleanup up again the held pilots
167672.1   gfactory        7/22 02:10   0+07:11:56 H  0   0.0  glidein_startup.sh
167693.3   gfactory        7/22 03:09   0+06:46:58 H  0   0.0  glidein_startup.sh

MINOS25 > condor_rm 167672.1
Job 167672.1 marked for removal
MINOS25 > condor_rm 167693.3
Job 167693.3 marked for removal

167672.1   gfactory        7/22 02:10   0+07:11:56 X  0   0.0  glidein_startup.sh
167693.3   gfactory        7/22 03:09   0+06:46:58 X  0   0.0  glidein_startup.sh

A minute later, they were gone.


=============================================================================
2008 07 23
=============================================================================

############
# SHUTDOWN #
############

Prepared for PNFS/DCache maintenance Jul 24

    kreymer@minos26
echo "crontab -r" | at 05:30
job 11 at 2008-07-24 05:30

echo "touch /afs/fnal.gov/files/data/minos/log_data/bluwatch/STOP" | at 05:58
job 14 at 2008-07-24 05:58

    mindata@minos26 
echo "crontab -r" | at 01:00
job 12 at 2008-07-24 01:00

    minfarm@fnpcsrv1
echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00
job 15 at 2008-07-24 01:00

###########
# ROUNDUP #
###########

   Added MISS messages listing pending subruns.
   PENDLOG is only written when NOOP is clear

SRV1> ln -sf roundup.20080722 roundup # was roundup.20080703
 ( did nothing yesterday, due to a typo )

##########
# CONDOR #
##########

165531.3   gfactory        7/15 10:21   0+00:00:00 H  0   0.0  glidein_startup.sh
165861.1   gfactory        7/15 21:03   0+00:00:00 H  0   0.0  glidein_startup.sh
165861.3   gfactory        7/15 21:03   0+00:00:00 H  0   0.0  glidein_startup.sh
166779.1   gfactory        7/18 01:51   0+00:00:00 H  0   0.0  glidein_startup.sh
166779.2   gfactory        7/18 01:51   0+00:00:00 H  0   0.0  glidein_startup.sh
166861.7   gfactory        7/18 11:27   0+00:00:00 H  0   0.0  glidein_startup.sh
167039.1   gfactory        7/19 06:12   0+00:00:00 H  0   0.0  glidein_startup.sh
167039.2   gfactory        7/19 06:12   0+00:00:00 H  0   0.0  glidein_startup.sh
167039.4   gfactory        7/19 06:12   0+00:00:00 H  0   0.0  glidein_startup.sh
167672.1   gfactory        7/22 02:10   0+07:11:56 H  0   0.0  glidein_startup.sh
167693.3   gfactory        7/22 03:09   0+06:46:58 H  0   0.0  glidein_startup.sh
167863.0   gfactory        7/22 14:35   0+00:00:00 H  0   0.0  glidein_startup.sh
167999.0   gfactory        7/23 04:00   0+00:00:00 H  0   0.0  glidein_startup.sh
167999.1   gfactory        7/23 04:00   0+00:00:00 H  0   0.0  glidein_startup.sh

MINOS25 > condor_rm 165531.3
MINOS25 > condor_rm 165861.1
MINOS25 > condor_rm 165861.3

JOB=166779.1 ; condor_rm ${JOB} ; sleep 10 ; condor_q ${JOB}
JOB=166779.2 ; condor_rm ${JOB} ; sleep 10 ; condor_q ${JOB}

for JOB in 166861.7 167039.1 167039.2 167039.4 167672.1 ; do
condor_rm ${JOB} ; sleep 10 ; condor_q ${JOB} ; done
167672.1   gfactory        7/22 02:10   0+07:11:56 X  0   0.0  glidein_startup.sh

for JOB in 167693.3 167863.0 167999.0 167999.1 ; do
condor_rm ${JOB} ; sleep 10 ; condor_q ${JOB} ; done
167693.3   gfactory        7/22 03:09   0+06:46:58 X  0   0.0  glidein_startup.sh

  So we have two X jobs stuck, since yesterday.

Now they are back to H,
167672.1   gfactory        7/22 02:10   0+07:11:56 H  0   0.0  glidein_startup.sh
167693.3   gfactory        7/22 03:09   0+06:46:58 H  0   0.0  glidein_startup.sh


##########
# ORACLE #
##########

Costs per email 
Date: Thu, 10 Apr 2008 11:42:44 -0500
From: Maurine Mihalek <mmihalek@fnal.gov>

continuing monthly decision 1 maintenance costs are:
MINOSORA1 - $52.13
MINOSORA1-SUN-RAID-ARRAY - $272.27
MINOSORA3 - $43.32
MINOSORA3-SUN-RAID-ARRAY - $82.71
   
############
# MCIMPORT #
############

   Keep on truckin, per list at 2008 07 17 MCIMPORTARCHIVELIST

TOP=daikon_04/L010170N/near

RDIRS=`ls /minos/data/mcimport/STAGE/${TOP}`
echo $RDIRS

( find /minos/data/mcimport/STAGE/${TOP} -type f -name \*.tar.gz -exec du -sm {} \; | cut -f 1 ) \
   > /minos/scratch/mindata/ssize.gpl

FLXI04 > 
printf 'plot "/minos/scratch/mindata/ssize.gpl"' | gnuplot -persist

for DIR in ${RDIRS} ; do 
printf "${DIR} "
./mcimport   -n -T ${TOP}/${DIR} | grep NFILES \
| grep -v 'NFILES 0'
done
700   NFILES 183 

for DIR in ${RDIRS}; do
./mcimport      -T ${TOP}/${DIR}
done 
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010170N/near/700/mcimport.log 

Wed Jul 23 10:52:31 CDT 2008
Wed Jul 23 13:30:51 CDT 2008

Plan is to do the rest in roughly order of total size

TOP=daikon_04/L100200N/near        # 10MB, 350-410 MB
700   NFILES 278 
701   NFILES 31 
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L100200N/near/700/mcimport.log 
Wed Jul 23 14:49:34 CDT 2008
Wed Jul 23 22:20:15 CDT 2008

TOP=daikon_04/L150200N/near        # 10MB, 480-540 MB
700   NFILES 275 
701   NFILES 31 
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L150200N/near/700/mcimport.log 
Fri Jul 25 10:20:10 CDT 2008
Fri Jul 25 19:20:19 CDT 2008

TOP=daikon_04/L010185N_helium/near # 10MB, 350-410 MB
650   NFILES 278 
651   NFILES 305 
652   NFILES 307 
653   NFILES 29 
Mon Jul 28 08:59:51 CDT 2008
Tue Jul 29 05:30:09 CDT 2008

   These are a bit too small, should tar them with -t option
   
TOP=daikon_04/L010185N_charm/near  # 10MB, 120-260 MB
TOP=daikon_04/L010000N/near        # 10MB, 190-230

for DIR in ${RDIRS} ; do 
printf "${DIR} "
./mcimport   -n -t ${TOP}/${DIR} | grep  FILES
done

for DIR in ${RDIRS}; do
./mcimport      -t ${TOP}/${DIR}
done 

########
# DOWN #
########

Announced downtime schedule to 
    minos-data
    minos_software_discussion
    Minos status web page

##########
# ORACLE #
##########

Date: Wed, 23 Jul 2008 10:17:31 -0500
From: Anil Kumar <akumar@fnal.gov>

Checked with Nelly too.
MINOS_DEV user on minosdev database can be safely dropped.

I am dropping the user now.


########
# FARM #
########

   The removal of dogwood0/1 seems to be complete.

MINOS26 > du -sm /minos/data/minfarm/farmtest
616494  /minos/data/minfarm/farmtest

   We now have over 5 TB free
   
########
# FARM #
########

   rounding up 

cedar_phy_bhhi mcnear
cedar_phy_bhlo mcnear

cedar_phy_bhhi mcfar
cedar_phy_bhlo mcfar

    Had top get bad_runs lists,
cp  /minos/data/minfarm/farmtest_strait/lists/bad_runs_mc.cedar_phy_bhlo \
    /minos/data/minfarm/lists/bad_runs_mc.cedar_phy_bhlo
cp  /minos/data/minfarm/farmtest_strait/lists/bad_runs_mc.cedar_phy_bhhi \
    /minos/data/minfarm/lists/bad_runs_mc.cedar_phy_bhhi

for DET in near  far ; do
for FIE in   hi   lo ; do
for STR in sntp mrnt ; do
NF=`ls /minos/data/minfarm/mc${DET}cat | \
    grep L010185N_D04.${STR}.cedar_phy_bh${FIE}.root | wc -l`
printf " %6s %6s %6s %6d\n" ${DET} ${FIE} ${STR} ${NF}
done ; done ; done
   near     hi   sntp   5493
   near     hi   mrnt   5491
   near     lo   sntp   5493
   near     lo   mrnt   5493
    far     hi   sntp    417
    far     hi   mrnt    417
    far     lo   sntp    416
    far     lo   mrnt    416


./roundup -n -W -s n13037001 -r cedar_phy_bhhi mcnear
./roundup       -s n13037001 -r cedar_phy_bhhi mcnear
   takes a while to pick up the candidates

./roundup -n -W -s f21037001 -r cedar_phy_bhhi mcfar
    Noted that ALL the mcfar files are subrun 0.

./roundup       -s f21037001 -r cedar_phy_bhhi mcfar
./roundup                    -r cedar_phy_bhhi mcfar
    cedar_phy_bhhimcfar.log
   One of the srmcp 's failed,
SRMCP 34/832 -streams_num=1 -server_mode=active -protocols=gsiftp file:///f21037018_0000_L010185N_D04.sntp.cedar_phy_bhhi.root /pnfs/minos/mcout_data/cedar_phy_bhhi/far/daikon_04/L010185N/sntp
_data/701
[main] ERROR gsi.CertificateRevocationLists  - CRL /usr/local/grid/globus/TRUSTED_CA/eebc7717.r0 failed to load.
   Getting several of these, they seem scary but harmless, should report.

./roundup                    -r cedar_phy_bhhi mcfar
Wed Jul 23 17:24:32 CDT 2008
 PURGED 832/832
Wed Jul 23 17:28:28 CDT 2008

    Updated to newer roundup,

./roundup                    -r cedar_phy_bhlo mcfar
Wed Jul 23 17:58:18 CDT 2008
Wed Jul 23 22:23:51 CDT 2008


############
# PNFSDIRS #
############

   Added mrnt to far , due to problems seen in bhhi, bhlo

./pnfsdirs far cedar_phy_bhhi daikon_04 L010185N write
./pnfsdirs far cedar_phy_bhlo daikon_04 L010185N write

#############
# MDSUM_LOG #
#############

Date: Wed, 23 Jul 2008 03:10:06 -0500
From: Cron Daemon <root@minos26.fnal.gov>
To: kreymer@fnal.gov
Subject: Cron <kreymer@minos26> /usr/krb5/bin/kcron ${HOME}/minos/scripts/mdsum_log

/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mdsum_log: line 50: kcron: command not found
...

Removed kcron
Changed to full path /usr/krb5/bin/aklog

Tested this at 09:39, then restored crontab.dat
Looks OK.


=============================================================================
2008 07 22
=============================================================================

########
# FARM #
########

    farmtest - directory is being purged of dogwoottest0 and 1 files,
    farmtest_strait - hi/lo will be moved to mcnearcat/mcfarcat for catting


###########
# ROUNDUP #
###########

   Added MISS messages listing pending subruns.
   PENDLOG is only written when NOOP is clear

SRV1> ln -sf roundup.20080722 roundup # was roundup.20080703

##########
# PARROT #
##########

    Installed the present current versions into /grid/app/minos/parrot
    per HOWTO.parrot,

REL=current  ;  ARC='x86_64-linux-2.6' ; DAT='-20080708'

REL=current  ;  ARC='i686-linux-2.6'   ; DAT='-20080717'

    Tested

ssh fnpc338

mkdir -p /local/stage1/kreymer/parrot

export PRO=/grid/app/minos/parrot
REL=current  ;  ARC='x86_64-linux-2.6' ; DAT='-20080708'
export VER=cctools-${REL}${DAT}-${ARC}
export PARROT_DIR=${PRO}/${VER}
export PATH=${PARROT_DIR}/bin:${PATH}
export HTTP_PROXY="http://squid.fnal.gov:3128"

PTD=/local/stage1/kreymer/parrot

parrot -m ${PARROT_DIR}/mountfile.grow  -H -t ${PTD} /bin/bash

P> printf "\n" | loon -bq firstlast.C ${DFILE}
sh: error while loading shared libraries: libtermcap.so.2: object file has no loadable segments

-bash-3.00$ du -sm /local/stage1/kreymer/parrot/
337     /local/stage1/kreymer/parrot/

    Repeated test with fresh login,
REL=current  ;  ARC='x86_64-linux-2.6' ; DAT='-20080619'

sh: error while loading shared libraries: libtermcap.so.2: object file has no loadable segments

printf "" | loon -bq firstlast.C ${DFILE}
   This is clean

-bash-3.00$ du -sm $PTD
673     /local/stage1/kreymer/parrot

   Repeated the test,

-bash-3.00$ du -sm $PTD
704     /local/stage1/kreymer/parrot

##########
# PARROT #
##########

    How to check which kernel we run ?
uname
  -m, --machine            print the machine hardware name
  -p, --processor          print the processor type
  -i, --hardware-platform  print the hardware platform

-i    tends to be i386 or x86_64
-m -p tend  to be i686 or x86_64

   For testing, FNALU nodes are all Intel
for NODE in ${UNODES} ; do 
printf "$NODE "; ssh -ax ${NODE}  'grep name /proc/cpuinfo | head -1' ; done


MIN > for NODE in ${UNODES} ; do printf "$NODE "; ssh -ax ${NODE}  'uname -m -p -i' ; done
flxi02 x86_64 x86_64 x86_64
flxi03 ssh_exchange_identification: Connection closed by remote host
flxi04 i686 i686 i386
flxi05 i686 i686 i386
flxi06 i686 i686 i386
flxi07 x86_64 x86_64 x86_64
flxi09 i686 i686 i386

MIN > ssh flxb31 uname -m -p -i
i686 athlon i386

   So experimentally, we use 'uname -m' to identify bitosity of kernel
   
############
# MCIMPORT #
############

MCI3 > ln -sf mcimport.20080716 mcimport # was mcimport.20080630

    Check file sizes with

du -sm /minos/data/mcimport/STAGE/${TOP}/*/* | sort -n | less


TOP=daikon_04/L010200N/near

RDIRS=`ls /minos/data/mcimport/STAGE/${TOP}`
echo $RDIRS
700

for DIR in ${RDIRS} ; do 
printf "${DIR} "
./mcimport   -n -T ${TOP}/${DIR} | grep NFILES \
| grep -v 'NFILES 0'
done
700   NFILES 185 

for DIR in ${RDIRS}; do
./mcimport      -T ${TOP}/${DIR}
done 
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010200N/near/700/mcimport.log 

Tue Jul 22 12:26:06 CDT 2008


#############
# MDSUM_LOG #
#############

    Added this to the kreymer crontab on minos26

ln -sf crontab.minos26.20080722 crontab.dat # was crontab.minos26.20080402

MINOS26 > crontab crontab.dat


=============================================================================
2008 07 21
=============================================================================

########
# FARM #
########

State of recent near spill data processing ?

Pending :

 PEND - have 22/24 subruns for N00014187_*.spill.sntp.cedar.0.root 63 05/19 03:07 0 22
 PEND - have 9/10 subruns for N00014551_*.spill.sntp.cedar.0.root   3 07/18 04:18 0 9
 PEND - have 7/8 subruns for N00014562_*.spill.sntp.cedar.0.root    1 07/20 08:11 0 7
 PEND - have 3/20 subruns for N00014584_*.spill.sntp.cedar.0.root   0 07/21 03:34 0 3

Far det, per habig query,
runs Friday 18 July seem to be
    F00041400_0015.mdaq.root
    ...
    F00041412_0001.mdaq.root
    
########
# FARM #
########

Date: Thu, 17 Jul 2008 15:05:02 -0500
From: Howard Rubin <rubin@iit.edu>

Run F00040421 should be forced out of farcat.  The 'snarl' counts are 
all over the place, and 4 of the subruns ran forever before I finally 
killed them.  It's not worth pursuing the remainder of the run.
---------------------------------------------------------------
SRV1> ./roundup -n -s F00040421 -r cedar far
 PEND - have 4/6 subruns for F00040421_*.all.sntp.cedar.0.root 132 03/11 11:18 0 4
 PEND - have 4/6 subruns for F00040421_*.spill.bntp.cedar.0.root 132 03/11 11:18 0 4
 PEND - have 4/6 subruns for F00040421_*.spill.sntp.cedar.0.root 132 03/11 11:18 0 4

./roundup  -f 1 -s F00040421 -r cedar far
Mon Jul 21 11:51:30 CDT 2008

    Ugh, many many ECRC problems reported since Friday,

SRV1> ./roundup -n -r cedar far 2>&1 | grep 'No such file' | tee /tmp/nsf.lis

SRV1> wc -l /tmp/nsf.lis
165 /tmp/nsf.lis

SRV1> DUPC=`cat /tmp/nsf.lis | cut -f 7 -d / | cut -f 1 -d :`

SRV1> for ${DUP} in ${DUPC} ; do sam locate ${DUP} ; done

   These are all declared to SAM.

SRV1> printf "$DUPC\n" | cut -f 1 -d _ | sort -u
F00041342  Thu Jul 10 13:51:50 CDT 2008
F00041348
F00041351
F00041360
F00041363
F00041366
F00041369
F00041372
F00041375
F00041380
F00041383
F00041388
F00041393  Tue Jul 15 16:02:29 CDT 2008

    Moved them all to DUP/farcat

SRV1> for FIL in ${DUPC} ; do mv /minos/data/minfarm/WRITE/${FIL} /minos/data/minfarm/DUP/farcat/${FIL} ; done
SRV1> date
Mon Jul 21 14:57:47 CDT 2008

This is all messed up ,

 BADRUNS   F00041348_0003.spill.bntp.cedar.0.root
 PEND - have 45/23 subruns for F00041348_*.spill.bntp.cedar.0.root 4 07/17 14:35 23 22
 PEND - have 20/18 subruns for F00041393_*.spill.bntp.cedar.0.root 4 07/17 13:18 17 3
 PEND - have 6/22 subruns for F00041418_*.spill.bntp.cedar.0.root 0 07/20 23:39 0 6

 BADRUNS   F00041348_0003.spill.sntp.cedar.0.root
 PEND - have 45/23 subruns for F00041348_*.spill.sntp.cedar.0.root 4 07/17 14:35 23 22
 PEND - have 20/18 subruns for F00041393_*.spill.sntp.cedar.0.root 4 07/17 13:18 17 3
 PEND - have 6/22 subruns for F00041418_*.spill.sntp.cedar.0.root 0 07/20 23:39 0 6

41348 - spill cand's for subruns 0-2, 5-23
        subrun 3 is bad, what about 4 ?
        These have already been concatenated and written, both bntp and sntp

mv /minos/data/minfarm/farcat/F00041348*  /minos/data/minfarm/DUP/farcat/     

    For F00041393,
have 0/1/2 subruns in farcat,
have 0/1 already concatenated as 
    F00041393_0000.all.sntp.cedar.0.root
    F00041393_0003.spill.bntp.cedar.0.root

    Move the extra 0/1 to DUP, force out remaining _0002.

mv /minos/data/minfarm/farcat/F00041393_0000* /minos/data/minfarm/DUP/farcat/
mv /minos/data/minfarm/farcat/F00041393_0001* /minos/data/minfarm/DUP/farcat/

SRV1> ./roundup -n -f 1 -s F00041393_0002 -r cedar far

   Odd, why does the script not detect that we HAVE the other subruns ?
   They are declared to SAM.
   Will have to debug this all over again !!!

SRV1> ./roundup  -f 1 -s F00041393_0002 -r cedar far


#############
# MDSUM_LOG #
#############

    Created mdsum_log for daily /minos/data space usage summary.

    Removed strays :

rmdir /minos/data/BAD

rm -r /minos/data/analysis/database   
    these were stray database backups , set log entry 2007 11 26 

MINOS26 > find asousa -type f -atime -130 -exec ls -ltu {} \;
-rw-r--r--  1 asousa e875 89327267 Mar 14 06:04 asousa/N00009098_0015.mdaq.root
-rw-r--r--  1 asousa e875 698 Apr  7 11:41 asousa/makeShortSNTP.C

mv asousa users/asousa

  Ran initial pass round 17:20
  See  http://www-numi.fnal.gov/computing/dh/mdsum/2008/07/21.txt


#######
# AFS #
#######

Sometime this last week :

MINOS01 > ./bluwatch: line 71: /afs/fnal.gov/files/data/minos/log_data/bluwatch/last/minos01.txt: Connection timed out
./bluwatch: line 71: /afs/fnal.gov/files/data/minos/log_data/bluwatch/last/minos01.txt: Connection timed out

############
# MCIMPORT #
############
Free space is down to 700GB.
Good thing we archived over 1 TB last week !

MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/L010185N/near
1109172 /minos/data/mcimport/STAGE/daikon_04/L010185N/near

MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/L010185N/near/*
90862   /minos/data/mcimport/STAGE/daikon_04/L010185N/near/700
101257  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/701
101891  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/702
101538  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/703
101930  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/704
99776   /minos/data/mcimport/STAGE/daikon_04/L010185N/near/705
101306  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/706
107014  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/707
103844  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/708
101137  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/709
2186    /minos/data/mcimport/STAGE/daikon_04/L010185N/near/710
2123    /minos/data/mcimport/STAGE/daikon_04/L010185N/near/711


=============================================================================
2008 07 20
=============================================================================

########
# DATA #
########

Many condor complaint sbout a file not removable on minos25.

MINOS01 > dds logs/glideafs/.*c
-rw-r--r--  1 kreymer g020 0 Jul 17 08:40 logs/glideafs/.nfs532415d50000014c

MINOS01 > rm logs/glideafs/.*c

MINOS01 > dds logs/glideafs/.*c

MINOS01 > date
Sun Jul 20 21:31:30 CDT 2008


=============================================================================
2008 07 17
=============================================================================

##########
# CONDOR #
##########

  First draft document from rbpatter on increased computing
  
  http://www.hep.caltech.edu/~rbpatter/computing.pdf


############
# MCIMPORT #
############

Proceed with over 2 TB of mainline MC, leaving alone first 110 runs.

RDIRS=`ls /minos/data/mcimport/STAGE/daikon_04/L010185N/near | grep -v ^70`

for DIR in ${RDIRS} ; do 
printf "${DIR} "
du -sm /minos/data/mcimport/STAGE/daikon_04/L010185N/near/${DIR}
done

710 2186        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/710
711 2123        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/711
712 1835        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/712
713 1926        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/713
714 2112        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/714
715 1953        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/715
716 1885        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/716
717 1901        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/717
718 2048        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/718
719 98473       /minos/data/mcimport/STAGE/daikon_04/L010185N/near/719
720 10924       /minos/data/mcimport/STAGE/daikon_04/L010185N/near/720
721 1030        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/721
722 1034        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/722
723 1054        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/723
724 1049        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/724
725 1027        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/725
726 1041        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/726
727 1014        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/727
728 1033        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/728
729 1048        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/729
730 1035        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/730
731 1030        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/731
732 1466        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/732
733 1029        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/733
734 1039        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/734
735 7098        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/735
736 5339        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/736
737 1045        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/737
738 1013        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/738
739 983 /minos/data/mcimport/STAGE/daikon_04/L010185N/near/739
740 1005        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/740
741 1042        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/741
742 1049        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/742
743 1026        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/743
744 1023        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/744
745 1023        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/745
746 1057        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/746
747 99778       /minos/data/mcimport/STAGE/daikon_04/L010185N/near/747
748 101765      /minos/data/mcimport/STAGE/daikon_04/L010185N/near/748
749 106939      /minos/data/mcimport/STAGE/daikon_04/L010185N/near/749
750 11259       /minos/data/mcimport/STAGE/daikon_04/L010185N/near/750
751 1097        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/751
752 1043        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/752
753 1095        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/753
754 1129        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/754
755 1130        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/755
756 1145        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/756
757 1146        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/757
758 1454        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/758
759 1143        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/759
760 1127        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/760
761 1130        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/761
762 1104        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/762
763 1127        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/763
764 1094        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/764
765 1104        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/765
766 1364        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/766
767 1121        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/767
768 1117        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/768
769 1430        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/769
770 1146        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/770
771 1132        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/771
772 1143        /minos/data/mcimport/STAGE/daikon_04/L010185N/near/772
773 101015      /minos/data/mcimport/STAGE/daikon_04/L010185N/near/773
774 101705      /minos/data/mcimport/STAGE/daikon_04/L010185N/near/774
775 101436      /minos/data/mcimport/STAGE/daikon_04/L010185N/near/775
776 99742       /minos/data/mcimport/STAGE/daikon_04/L010185N/near/776
777 99526       /minos/data/mcimport/STAGE/daikon_04/L010185N/near/777
778 99725       /minos/data/mcimport/STAGE/daikon_04/L010185N/near/778
779 99893       /minos/data/mcimport/STAGE/daikon_04/L010185N/near/779
999 12076       /minos/data/mcimport/STAGE/daikon_04/L010185N/near/999


  Much of the run-space has been archived.
  Scan for what remains to do :

for DIR in ${RDIRS} ; do 
printf "${DIR} "
./mcimport.20080716   -n -T daikon_04/L010185N/near/${DIR} | grep NFILES \
| grep -v 'NFILES 0'
done
719   NFILES 309 
720   NFILES 31 
747   NFILES 304 
748   NFILES 306 
749   NFILES 320 
750   NFILES 31 
773   NFILES 308 
774   NFILES 310 
775   NFILES 309 
776   NFILES 304 
777   NFILES 303 
778   NFILES 304 
779   NFILES 304 
999   NFILES 33 

RDIRS='719 720 747 748 749 750 773 774 775 776 777 778 779 999'

for DIR in ${RDIRS} ; do 
printf "${DIR} "
./mcimport.20080716   -T daikon_04/L010185N/near/${DIR}
done
719  OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N/near/719/mcimport.log 
Thu Jul 17 08:24:35 CDT 2008


=============================================================================
2008 07 16
=============================================================================

########
# FARM #
########

Several cedarnear.log ECRC missing files,

DFILES='
N00014520_0000.cosmic.cand.cedar.0.root
N00014520_0011.cosmic.cand.cedar.0.root
N00014520_0012.cosmic.cand.cedar.0.root
N00014520_0015.cosmic.cand.cedar.0.root
N00014523_0000.cosmic.cand.cedar.0.root
'

    These are all duplicates.
    Why not picked up by the DUP checking code ?

    For now, renamed to DUP
 
SRV1> cd /minos/data/minfarm/WRITE

SRV1> for FILE in ${DFILES} ; do ls -l ${FILE} ; done
-rw-rw-r--  1 minospro numi 111176051 Jul 12 20:09 N00014520_0000.cosmic.cand.cedar.0.root
-rw-rw-r--  1 minospro numi 113477854 Jul 12 19:50 N00014520_0011.cosmic.cand.cedar.0.root
-rw-rw-r--  1 minospro numi 113270695 Jul 12 19:24 N00014520_0012.cosmic.cand.cedar.0.root
-rw-rw-r--  1 minospro numi 113336210 Jul 12 18:59 N00014520_0015.cosmic.cand.cedar.0.root
-rw-rw-r--  1 minospro numi 112857605 Jul 12 19:50 N00014523_0000.cosmic.cand.cedar.0.root

SRV1> for FILE in ${DFILES} ; do mv ${FILE} ../DUP/nearcat/${FILE} ; done

SRV1> date
Wed Jul 16 18:10:35 CDT 2008


##########
# DCACHE #
##########

Date: Wed, 16 Jul 2008 14:26:06 -0500 (CDT)
From: Michael Zalokar <zalokar@fnal.gov>
To: kreymer@fnal.gov
Cc: moibenko@fnal.gov
Subject: missing minos files


We recently ran a scan of STKen PNFS.  In that scan the following 6 files 
were flagged as being in pnfs, but not on tape.

/pnfs/fs/usr/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/Unable
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.all.sntp.cedar.0.root.08dec2006.bad
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.spill.sntp.cedar.0.root.08dec2006.bad
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2006-06/F00035724_0013.all.sntp.cedar.0.root.08dec2006.bad
/pnfs/fs/usr/minos/reco_near/R1_18_4/snts_data/2006-10/N00011077_0013.spill.snts.R1_18_4.0.root.08dec2006.ba
d
/pnfs/fs/usr/minos/reco_near/cedar/cand_data/2006-10/N00011134_0038.spill.cand.cedar.0.root.18dec2006.bad

If you still have the originals, feel free to rewrite them.  If not, 
please remove them from PNFS.

We apologize for the inconvenience.

Mike

BFILES='
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/Unable
/pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.all.sntp.cedar.0.root.08dec2006.bad
/pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.spill.sntp.cedar.0.root.08dec2006.bad
/pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035724_0013.all.sntp.cedar.0.root.08dec2006.bad
/pnfs/minos/reco_near/R1_18_4/snts_data/2006-10/N00011077_0013.spill.snts.R1_18_4.0.root.08dec2006.bad
/pnfs/minos/reco_near/cedar/cand_data/2006-10/N00011134_0038.spill.cand.cedar.0.root.18dec2006.bad
'
for FIL in ${BFILES} ; do ls -l ${FIL} ; done
-rw-rw-r--  1 rubin e875 0 Jun  9 12:06 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/Unable
-rw-r--r--  1 rubin e875 17865772 Nov 10  2006 /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.all.sntp.cedar.0.root.08dec2006.bad
-rw-r--r--  1 rubin e875 805743 Nov 10  2006 /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.spill.sntp.cedar.0.root.08dec2006.bad
-rw-r--r--  1 rubin e875 22792353 Nov 10  2006 /pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035724_0013.all.sntp.cedar.0.root.08dec2006.bad
-rw-r--r--  1 rubin e875 7106097 Oct 19  2006 /pnfs/minos/reco_near/R1_18_4/snts_data/2006-10/N00011077_0013.spill.snts.R1_18_4.0.root.08dec2006.bad
-rw-r--r--  1 rubin e875 443821667 Dec 10  2006 /pnfs/minos/reco_near/cedar/cand_data/2006-10/N00011134_0038.spill.cand.cedar.0.root.18dec2006.bad

for FIL in ${BFILES} ; do ./dc_stat ${FIL} ; done

As rubin,

for FIL in ${BFILES} ; do rm ${FIL} ; done


############
# MCIMPORT #
############

   Urgently need to tar up some more files, for space on /minos/data.

Let's move to the current mcimport, which now supports tar archiving

As usual, work on minos-sam03


MCI3 > cp AFSS/mcimport.20080630 mcimport.20080630
MCI3 > ln -sf mcimport.20080630 mcimport # was AFSS/mcimport.20071102

   Let's see what there is to chew on.

$ du -sm  /minos/data/mcimport/STAGE/daikon_00/*
158600  /minos/data/mcimport/STAGE/daikon_00/L010185N          done 2008/07/16
13332   /minos/data/mcimport/STAGE/daikon_00/L010185N_nue      


MCIMPORTARCHIVELIST
$ du -sm  /minos/data/mcimport/STAGE/daikon_04/*
 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N          -t
  55785 /minos/data/mcimport/STAGE/daikon_04/L010170N          done 7/23
2235895 /minos/data/mcimport/STAGE/daikon_04/L010185N          done 7/20
 126550 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm         7/29 -t
 368713 /minos/data/mcimport/STAGE/daikon_04/L010185N_helium   done 7/28
   8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh    done
  65355 /minos/data/mcimport/STAGE/daikon_04/L010200N          done 7/22 338 min (700)
 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N          done 7/23
 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N          done 7/25
  27834 /minos/data/mcimport/STAGE/daikon_04/L250200N          done


$ du -sm /minos/data/mcimport/STAGE/daikon_00/L010185N/near/*/* | sort -n
...
11      /minos/data/mcimport/STAGE/daikon_00/L010185N/near/704/n12017041_0010_L010185N_D00.tar.gz
12      /minos/data/mcimport/STAGE/daikon_00/L010185N/near/142/n12011429_0010_L010185N_D00.tar.gz
317     /minos/data/mcimport/STAGE/daikon_00/L010185N/near/142/n11011424_0003_L010185N_D00.tar.gz
319     /minos/data/mcimport/STAGE/daikon_00/L010185N/near/143/n11011431_0009_L010185N_D00.tar.gz
...
349     /minos/data/mcimport/STAGE/daikon_00/L010185N/near/145/n11011450_0001_L010185N_D00.tar.gz

$ du -sm /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/*/* | sort -n
...
67      /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/144/n14111446_0008_L010185N_D00_nue.tar.gz


RDIRS=`ls /minos/data/mcimport/STAGE/daikon_00/L010185N/near`
MCI3 > echo $RDIRS
141 142 143 144 145 704

for DIR in ${RDIRS}; do
./mcimport.20080716   -n -T daikon_00/L010185N/near/${DIR}
done \
| grep NFILES
  NFILES 99 
  NFILES 107 
  NFILES 107 
  NFILES 109 
  NFILES 11 
  NFILES 31 

Had to update to mcimport.20080716 to get correct MCIN path,
so try one file first :

./mcimport.20080716 -b 1 -T daikon_00/L010185N/near/141

Hung up indefinitely waiting for CMS BACKFILL2 jobs in Enstore,
2237 queued up, and the queue is not getting shorter.

Start time: Wed Jul 16 15:02:04 2008
Wed Jul 16 17:31:32 CDT 2008

   OK, let er rip on this modest 158 GB of files.

for DIR in ${RDIRS}; do
./mcimport.20080716  -T daikon_00/L010185N/near/${DIR}
done 


##########
# CONDOR #
##########

   loiacono jobs are still not running.
 
MINOS25 > condor_history -l 165974.0 > /tmp/histark
MINOS25 > condor_q       -l 165515.0 > /tmp/histlau

sdiff -s /tmp/histark  /tmp/histlau
     reveals that they are setting
JobLeaseDuration = 360000
     and not setting
X509USERPROXY = /local/scratch25/$ENV(LOGNAME)/grid/$ENV(LOGNAME).proxy

   Sent mail, Laura confirmed that this resolves the problem.
   
  
########
# FARM #
########

Date: Wed, 16 Jul 2008 09:07:53 -0500 (CDT)
From: Steven Timm <timm@fnal.gov>
To: fermigrid-announce@fnal.gov
Subject: fnpcsrv1 reboot now

node fnpcsrv1 was hopelessly confused with
NFS errors and has to be rebooted.  I am rebooting now.
Steve Timm


########
# FARM #
########

Checking last night's roundups :

   Problems since 
Sun Jul 13 18:07:47 CDT 2008
cat: /export/stage/minfarm/ROUNDUP/ECRC/N00014520_0000.cosmic.cand.cedar.0.root: No such file or directory
   etc.

Did pick up 
 OK adding N00012941_0007.spill.sntp.cedar.0.root 1
 OK adding N00013375_0000.spill.sntp.cedar.0.root 30
 OK adding N00013434_0000.spill.sntp.cedar.0.root 18
 OK adding N00013793_0011.spill.sntp.cedar.0.root 1
 BIG  - Splitting due to size 2163459957 
 OK adding N00014184_0000.spill.sntp.cedar.0.root 23
 OK adding N00014184_0023.spill.sntp.cedar.0.root 1
 Jul 14 19:25

But these are all old,
beam came  up Sunday around 06:00, in
near    14526/14
 far    41348/16

    There are 5 spill cand's for 14508 and 14520, before beam came back.
ls /pnfs/minos/reco_near/cedar/cand_data/2008-07/*spill*

    No spill data in
ls /minos/data/minfarm/neardet

Alec will try to reach Matt Strait, Howie is away.


Date: Wed, 16 Jul 2008 17:35:28 -0500 (CDT)

I have submitted for processing runs 14526_0000 through 145548_0000. 
Since I don't know what's happening with Howie's jobs and don't want to 
step on them unncessarily, the output will appear in my directory:

/minos/data/minfarm/farmtest_strait/nearcat/

and logs at:

/minos/data/minfarm/farmtest_strait/logs/cedar/near/

When Howie comes back, he can decide whether to rerun them with his 
scripts or copy my output.  It's not a whole lot of processing, so it 
doesn't much matter.

-Matt


=============================================================================
2008 07 15
=============================================================================

#########
# ADMIN #
#########

Account request for zkrahn , no FNALU account yet.


############
# MCIMPORT #
############

  13:00 roughly

MINOS26 > ./pnfsdirs near cedar_phy_bhcurv daikon_05 L010185N write
MINOS26 > ./pnfsdirs  far cedar_phy_bhcurv daikon_05 L010185N write

Oops, this is Mock Data.
Need to have a modified /pnfsdirs to handle this.
The default script created 
     /pnfs/minos/mcin_data/far/daikon_05/L010185N
needed
     /pnfs/minos/mcin_data/fmock/daikon_05/L010185N

    Finally updated pnfsdirs, ran this 15:00 2008 07 30


The 49 GB of files were copied, 00:37 through 04:40.
That's 3.4 MB/sec.

##########
# CONDOR #
##########

  These processes have been running CPU-bound under gfactory since Sunday 13:00

4 R gfactory 14664 32564 64  76   0 -  2618 -      Jul12 ?        1-20:16:27 condor_gridmanager -f -C (Owner=?="gfactory"&&JobUniverse==9) -o gfactory -S /tmp/condor_g_scratch.0xbb57928.32564
4 R gfactory 14670 14664 61  77   0 -  2298 -      Jul12 ?        1-18:14:38 /opt/condor/sbin/gahp_server

Per sfiligoi, killed these,
first checking the time cycle of gfactory :

LDIR=/home/gfactory/glideinsubmit/glidein_t20_glexec/entry_gpminos/log
tail -1 ${LDIR}/factory_info.20080715.log 
[2008-07-15T16:44:10-05:00 15536] Sleep 90s

[gfactory@minos25 ~]$ ps xf
  PID TTY      STAT   TIME COMMAND
14664 ?        R    2940:55 condor_gridmanager -f -C (Owner=?="gfactory"&&JobUniverse==9) -o gfactory -S /tmp/condor_g_scratch.0xbb57928.32564
14670 ?        R    2804:05  \_ /opt/condor/sbin/gahp_server
17635 pts/10   Ss     0:00 -bash
25203 pts/10   R+     0:00  \_ ps xf
15533 ?        S     71:41 python glideFactory.py 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/
15535 ?        S     86:09  \_ /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 15533 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/ gpgeneral
15536 ?        S     91:42  \_ /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 15533 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/ gpminos

kill 15533
kill 15536
kill -9 15536

kill 14670
   that got them both


./start_factory.sh 
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S gfactory 17635 17634  0  75   0 -  1407 wait   Jul14 pts/10   00:00:00 -bash
0 S gfactory 25254     1  6  77   0 -  6194 -      16:48 pts/10   00:00:13 python glideFactory.py 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/
0 S gfactory 25256 25254  6  76   0 -  8047 -      16:48 pts/10   00:00:14 /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 25254 90 4 /home/gfactory/glideinsubmit/glidein_t
0 S gfactory 25257 25254  6  76   0 -  8258 -      16:48 pts/10   00:00:13 /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 25254 90 4 /home/gfactory/glideinsubmit/glidein_t
4 R gfactory 25478 32564 88  78   0 -  2806 -      16:50 ?        00:01:28 condor_gridmanager -f -C (Owner=?="gfactory"&&JobUniverse==9) -o gfactory -S /tmp/condor_g_scratch.0x1122a558.32564
4 R gfactory 25480 25478 85  85   0 -  2041 -      16:50 ?        00:01:22 /opt/condor/sbin/gahp_server
0 R gfactory 25515 17635  0  77   0 -  1011 -      16:51 pts/10   00:00:00 ps -flu gfactory

[gfactory@minos25 ~]$ ps xf
  PID TTY      STAT   TIME COMMAND
25478 ?        R      2:02 condor_gridmanager -f -C (Owner=?="gfactory"&&JobUniverse==9) -o gfactory -S /tmp/condor_g_scratch.0x1122a558.32564
25480 ?        R      1:54  \_ /opt/condor/sbin/gahp_server
17635 pts/10   Ss     0:00 -bash
25524 pts/10   R+     0:00  \_ ps xf
25254 pts/10   S      0:13 python glideFactory.py 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/
25256 pts/10   S      0:14  \_ /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 25254 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/ gpgeneral
25257 pts/10   S      0:13  \_ /usr/bin/python /home/gfactory/glideinWMS/factory/glideFactoryEntry.py 25254 90 4 /home/gfactory/glideinsubmit/glidein_t20_glexec/ gpminos

My test glideins are running.

But the gridmanager processes are still running 100% cpu limited.


#######
# SAM #
#######

SAM taking stock meeting,  09:00 to 09:45,
  Julie Trumbo
  Anil Kumar
  Dianne Bonham
  Arthur Kreymer
  
One result was to identify large test are in Integration 
which was being backed up.
Removed, cutting backups from 1.5 hours to 10 minutes.

=============================================================================
2008 07 14
=============================================================================

#########
# MYSQL #
#########

Date: Mon, 14 Jul 2008 17:59:19 +0100
From: Jeff Hartnell <j.j.hartnell@sussex.ac.uk>
To: Nick West <n.west1@physics.ox.ac.uk>, rhatcher <rhatcher@fnal.gov>, Arthur Kreymer <kreymer@fnal.gov>
Cc: Nick Devenish <ned20@sussex.ac.uk>
Subject: URGENT (database problem)

Hi all,
Just tried to phone each one of you... Nick D has accidentally removed 
all the entries from the CALADCTOPESVLD table in offline on minos-db1 
(he meant to do it to caltest). He was about to do a large import of new 
gain numbers so was trying to test it out on caltest.

One worry: will this get imminently exported to other sites?

Could this table be reprimed? It hasn't been updated since the end of 
May this year.

Cheers,
Jeff.

------------------------------------------------------------------
  The latest backup was done on July 2, the gzipped file is
      
      /data/archive/COPY/20080702/offline/CALADCTOPESVLD.MYD.gz

  This is only 6 MB, so a restore should go quickly.

  I'll coordinate with Robert and Nick to see whether we can use this.
  At present, the existing 0 length database file is being held open by
  mysqld, so we cannot just swap this under the running mysqld.
------------------------------------------------------------------

gunzip /data/database/recover/CALADCTOPESVLD.MYD.gz

Mysql> mkdir /data/database/recover
Mysql> cp /data/database/offline/db.opt /data/database/recover/db.opt

Mysql> ARC=/data/archive/COPY/20080702/offline/
Mysql> cp ${ARC}/CALADCTOPESVLD.MYD.gz /data/database/recover/
Mysql> cp ${ARC}/CALADCTOPESVLD.frm    /data/database/recover/
Mysql> gunzip /data/database/recover/CALADCTOPESVLD.MYD.gz

Mysql> mysqlshow  -u root recover
Database: recover
+----------------+
|     Tables     |
+----------------+
| CALADCTOPESVLD |
+----------------+


Mysql> mysql -u root recover

mysql> repair no_write_to_binlog table CALADCTOPESVLD quick use_frm ;
+------------------------+--------+----------+-----------------------------------------+
| Table                  | Op     | Msg_type | Msg_text                                |
+------------------------+--------+----------+-----------------------------------------+
| recover.CALADCTOPESVLD | repair | warning  | Number of rows changed from 0 to 115593 |
| recover.CALADCTOPESVLD | repair | status   | OK                                      |
+------------------------+--------+----------+-----------------------------------------+
2 rows in set (0.46 sec)

Mysql> mysqlshow  -u root recover CALADCTOPESVLD
Database: recover  Table: CALADCTOPESVLD  Rows: 115593
+--------------+------------+-----------+------+-----+---------------------+----------------+---------------------------------+---------+
| Field        | Type       | Collation | Null | Key | Default             | Extra          | Privileges                      | Comment |
+--------------+------------+-----------+------+-----+---------------------+----------------+---------------------------------+---------+
| SEQNO        | int(11)    | NULL      |      | PRI |                     | auto_increment | select,insert,update,references |         |
| TIMESTART    | datetime   | NULL      |      | MUL | 0000-00-00 00:00:00 |                | select,insert,update,references |         |
| TIMEEND      | datetime   | NULL      |      | MUL | 0000-00-00 00:00:00 |                | select,insert,update,references |         |
| DETECTORMASK | tinyint(4) | NULL      | YES  |     |                     |                | select,insert,update,references |         |
| SIMMASK      | tinyint(4) | NULL      | YES  |     |                     |                | select,insert,update,references |         |
| TASK         | int(11)    | NULL      | YES  |     |                     |                | select,insert,update,references |         |
| AGGREGATENO  | int(11)    | NULL      | YES  |     |                     |                | select,insert,update,references |         |
| CREATIONDATE | datetime   | NULL      |      |     | 0000-00-00 00:00:00 |                | select,insert,update,references |         |
| INSERTDATE   | datetime   | NULL      |      |     | 0000-00-00 00:00:00 |                | select,insert,update,references |         |
+--------------+------------+-----------+------+-----+---------------------+----------------+---------------------------------+---------+

Mysql> mysqldump -u root recover CALADCTOPESVLD > /data/archive/COPY/20080702/offline/CALADCTOPESVLD.dump

   Second iteration, this should have been done with

Mysql> REC=/data/archive/COPY/20080702/recover
Mysql> mkdir -p ${REC}

Mysql> cp ${ARC}/CALADCTOPESVLD.MYD.gz ${REC}/
Mysql> cp ${ARC}/CALADCTOPESVLD.frm    ${REC}/
Mysql> gunzip ${REC}/CALADCTOPESVLD.MYD.gz

mysql> drop table CALADCTOPESVLD ;

   
mysql> restore table CALADCTOPESVLD from '/data/archive/COPY/20080702/recover' ;

Mysql> mysqldump -u root recover CALADCTOPESVLD > /data/archive/COPY/20080702/offline/CALADCTOPESVLD.dump2

Mysql> diff /data/archive/COPY/20080702/offline/CALADCTOPESVLD.dump /data/archive/COPY/20080702/offline/CALADCTOPESVLD.dump2


##########
# CONDOR #
##########

The number of held glideins built up gradually since the Tue 8 July 10 AM,
peak over 16K Fri 10 AM, 
down gradually to 12K 12:30 Monday.


4664 ?        R    1267:16 condor_gridmanager -f -C (Owner=?="gfactory"&&JobUniverse==9) -o gfactory -S /tmp/condor_g_scratch.0xbb57928.32564
14670 ?        R    1216:00  \_ /opt/condor/sbin/gahp_server
17635 pts/10   Ss     0:00 -bash
17731 pts/10   R+     0:00  \_ ps xf

The condor_gridmanager is still running.

MINOS25 > condor_q -l 164258.0
-- Submitter: minos25.fnal.gov : <131.225.193.25:64545> : minos25.fnal.gov
MyType = "Job"
TargetType = "Machine"
ClusterId = 164258
QDate = 1215957560
CompletionDate = 0
...
EnteredCurrentStatus = 1215963651
HoldReason = "Globus error 9: the system cancelled the job"
HoldReasonCode = 2
HoldReasonSubCode = 9
ReleaseReason = UNDEFINED
NumSystemHolds = 1
Managed = "Schedd"
ServerTime = 1216050068


MINOS25 > condor_q gfactory -hold
...
11525 jobs; 0 idle, 0 running, 11525 held

Mixture of 
11330  Globus error 17: the job failed when the jo
    3  Globus error 43: the job manager failed to
  192  Globus error 9: the system cancelled the jo

HJOBS=`grep 'Globus' gfactoryhold.log  | cut -f 1 -d ' '`

for JOB in ${HJOBS} ; do usleep 100000 ; condor_rm ${JOB} ; done

MINOS25 > condor_q gfactory > cqfact.log
164013.1   gfactory        7/12 13:52   0+00:00:00 H  0   0.0  glidein_startup.sh
164021.2   gfactory        7/12 14:54   1+22:20:12 R  0   0.0  glidein_startup.sh
164186.0   gfactory        7/13 04:08   1+08:34:16 R  0   0.0  glidein_startup.sh
...
164248.1   gfactory        7/13 08:42   1+04:27:09 R  0   0.0  glidein_startup.sh
164248.2   gfactory        7/13 08:42   1+04:27:09 R  0   0.0  glidein_startup.sh
164248.3   gfactory        7/13 08:42   1+04:27:09 R  0   0.0  glidein_startup.sh
164258.1   gfactory        7/13 08:59   0+00:00:00 X  0   0.0  glidein_startup.sh
164258.2   gfactory        7/13 08:59   0+00:00:00 X  0   0.0  glidein_startup.sh
164260.0   gfactory        7/13 09:02   0+00:00:00 X  0   0.0  glidein_startup.sh
...
164262.2   gfactory        7/13 09:19   0+00:00:00 X  0   0.0  glidein_startup.sh
164267.0   gfactory        7/13 09:55   0+00:00:00 X  0   0.0  glidein_startup.sh
164278.0   gfactory        7/13 10:45   0+00:00:00 X  0   0.0  glidein_startup.sh
164443.0   gfactory        7/14 09:28   0+00:00:00 I  0   0.0  glidein_startup.sh
164443.1   gfactory        7/14 09:28   0+00:00:00 I  0   0.0  glidein_startup.sh
164445.0   gfactory        7/14 09:35   0+00:00:00 I  0   0.0  glidein_startup.sh
164445.1   gfactory        7/14 09:35   0+00:00:00 I  0   0.0  glidein_startup.sh
164445.2   gfactory        7/14 09:35   0+00:00:00 I  0   0.0  glidein_startup.sh
164447.0   gfactory        7/14 09:49   0+00:00:00 I  0   0.0  glidein_startup.sh
164447.1   gfactory        7/14 09:49   0+00:00:00 I  0   0.0  glidein_startup.sh
164447.2   gfactory        7/14 09:49   0+00:00:00 I  0   0.0  glidein_startup.sh
164453.0   gfactory        7/14 10:22   0+00:00:00 I  0   0.0  glidein_startup.sh
164461.0   gfactory        7/14 11:18   0+00:00:00 I  0   0.0  glidein_startup.sh

49 jobs; 10 idle, 38 running, 1 held

MINOS25 > RJOBS=`condor_q -run gfactory | grep gfactory | cut -f 1 -d ' '`

MINOS25 > condor_rm 164021.2
Job 164021.2 marked for removal
   Went into stat 'X'

MINOS25 > for JOB in ${IJOBS} ; do sleep 1 ;  condor_rm ${JOB} ; done

Strange, my glideafs jobs kept running up through 09:00 Sunday :

MINOS25 > dds -tr log/glideafs/*.out
...
-rw-r--r--  1 kreymer g020 5991 Jul 13 08:59 logs/glideafs/probe.164249.0.out
-rw-r--r--  1 kreymer g020    0 Jul 13 09:00 logs/glideafs/probe.164259.0.out
...

Here is a clue from 

GridJobId = "gt2 fngp-osg.fnal.gov:2119/jobmanager-condor https://fnpcosg1.fnal.gov:40028/29205/1215888755/"

Looking in the gfactory config file
    glideinWMS/creation/glideinWMS.xml
      <entry name="gpgeneral" gridtype="gt2" gatekeeper="fngp-osg.fnal.gov:2119/jobmanager-condor" work_dir="Condor" enabled="True" schedd_name="minos25.fnal.gov">
...
      <entry name="gpminos" gridtype="gt2" gatekeeper="fngp-osg.fnal.gov:2119/jobmanager-condor" work_dir="Condor" enabled="True" rsl="(condorsubmit=(universe vanilla)(requirements \&quot;(ISMINO

Job submissions to fngp-osg and fermigrid1 were cut off Tue 8 July.

The claim was the  fngp-osg would become an alias to fnpcosg1.
The new servers are
     fnpcosg1 available from offsite
     fnpcfg1  availalbe only onsite
     
Old fermigrid1     references should change to  
    fermigridosg1
 
Contacted Igor Sfiligoi around 14:15 CDT,

He will clean up, do an upgrade to the WMS system,
and get things restarted.
--------------------------------------------------------
Date: Mon, 14 Jul 2008 14:56:56 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: Re: Minos glideWMS notes - FYI

condor_gridmanager is part of Condor... it handles the Grid jobs.
You cannot mess with it too much... condor will restart it as long as
there are grid jobs in the queue.

Igor
--------------------------------------------------------
Date: Mon, 14 Jul 2008 15:23:11 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: gfactory upgraded and reconfigured

Here is what I did:

mv glideinWMS glideinWMS.old2
cvs -d :pserver:anonymous@cdcvs.fnal.gov:/cvs/cd_read_only co -r v1_2_3
glideinWMS

cd glideinWMS/creation
cp ~/glideinWMS.old2/creation/glideinWMS.xml .
vi glideinWMS
diff glideinWMS.xml ~/glideinWMS.old2/creation/glideinWMS.xml
1c1
< <glidein glidein_name="t20_glexec" factory_name="minos"
schedd_name="minos25.fnal.gov">
---
> <glidein glidein_name="t12_glexec" factory_name="minos"
schedd_name="minos25.fnal.gov">
5c5
<    <condor base_dir="/home/gfactory/Downloads/condor-7.0.3"/>
---
>    <condor base_dir="/opt/condor-7.0.1"
tar_file="/home/gfactory/web//stage/glidein_t12_glexec/condor_bin.84nduS.tgz"/>
13c13
<       <entry name="gpgeneral" gridtype="gt2"
gatekeeper="fnpcfg1.fnal.gov:2119/jobmanager-condor" work_dir="Condor"
enabled="True" schedd_name="minos25.fnal.gov">
---
>       <entry name="gpgeneral" gridtype="gt2"
gatekeeper="fngp-osg.fnal.gov:2119/jobmanager-condor" work_dir="Condor"
enabled="True" schedd_name="minos25.fnal.gov">
21c21
<       <entry name="gpminos" gridtype="gt2"
gatekeeper="fnpcfg1.fnal.gov:2119/jobmanager-condor" work_dir="Condor"
enabled="True" rsl="(condorsubmit=(universe vanilla)(requirements
\&quot;(ISMINOSAFS=?=True)\&quot;))" schedd_name="minos25.fnal.gov">
---
>       <entry name="gpminos" gridtype="gt2"
gatekeeper="fngp-osg.fnal.gov:2119/jobmanager-condor" work_dir="Condor"
enabled="True" rsl="(condorsubmit=(universe vanilla)(requirements
\&quot;(ISMINOSAFS=?=True)\&quot;))" schedd_name="minos25.fnal.gov">

cd ~
vi start_factory.sh


Igor
--------------------------------------------------------


=============================================================================
2008 07 11
=============================================================================

#########
# ADMIN #
#########

Date: Fri, 11 Jul 2008 10:13:47 -0500
From: Jason Allen <jallen@fnal.gov>


Attached is a quote specifying the config of the new Minos servers.  The 
servers will have dual quad core 2.66GHz CPUs, 16GB RAM, mirrored 250GB 
system and data disks, and redundant power supplies.  Additionally the 
systems are compatible with SLF4.5 and SLF5.1.  These are very nice 
machines!

We have you down on the list for 3 servers, is that number correct?

Please reserve about 2K in the Minos budget for a rack and PDUs.  As I 
mentioned yesterday, placement of the new Minos servers is yet to be 
determined.

---------------------------------------------------------------------

   200 West North Avenue, Lombard, IL 60148. Tel No.(630) 627-8811; Fax No. (630) 627-8877
                                                www.koicomputer.com
                                                    FERMILAB
                                ATTN: GLENN COOPER/JASON ALLEN
                                EMAIL: gcooper@fnal.gov/jallen@fnal.gov
                                      Quotation#20080707-02/INTEL (Revised)
Qty                                  Description                             
    Unit Cost Total Amount
    $3,300.00      $3,300.00

 1    2U Dual Intel Xeon E5430 2.66GHz General Rack Server                     
      Breakdown:
 1    Supermicro SC823T-R500LPPB, 2U Black Rack chassis, 500W Redundant
      Power Supply. 6 x 3.5" Hot-swap SAS/SATA Drive bays, 1 x 5.25" +
      1 x Slim CD-ROM Drive + 1 x 3.5" Floppy Drive Bays. Cooling: 4 x 80mm
      6300RPM Fans. 2U Slide Rails included.
 1    X7DBE, Intel 5000P (Blackford) Chipset, quad/dual core Intel 64-bit Xeon
      Support, 667/1066/1333MHz FSB. 8 x 240-pin DIMM sockets support up to
      32GB DDR2 667 ECC Fully Buffered DIMM in dual channel. Onboard 6 x
      SATA 3.0Gbps Ports via ESB2 SATA Controller, ATI ES1000 16MB Graphics,
      Intel 82563EB Dual-port Gigabit Ethernet Controller. Expansion Slots:
      2 (x8) & 1 (x4) PCI-Express, 2 x 64-bit 133MHz PCI-X, 1 x 64-bit 100MHz
      PCI-X. Extended ATX 12" x 13.05" Form Factor.
 1    AOC-SIMLP-B+ IPMI 2.0 Adapter
 2    Intel Xeon E5430 QC LGA771 2.66GHz 12MB 1333MHz Processor
 8    2GB DDR2-667 ECC Fully Buffered DIMM
 1    8x+ 24x24x24x Internal Black Slim Tray-type DVD/CDRW Combo Drive
 1    3Ware 9650SE-4LPML, 4 Port SATA2 PCIEx4 Multi-Lane RAID Controller
 4    Seagate ST3250310NS 250GB 32MB 7200RPM SATA Enterprise Ver. HDD
 1    Server Labor/3Year Parts and Labor On-site Repair Warranty
 1    Test with Scientific Fermi Linux 5.1

      TOTAL:     $3,300.00

Checked out seagate ST3250310NS
http://www.seagate.com/www/en-us/products/servers/barracuda_es/barracuda_es.2/

Interface	Capacity	Model #
SAS 3Gb/s	500GB	       ST3500620SS     ST3500620SS     500000.0
SAS 3Gb/s	750GB	       ST3750630SS     ST3750630SS     750000.0
SAS 3Gb/s	1000GB	       ST31000640SS    ST31000640SS    1000000.0
SATA 3.0Gb/s	250GB	       ST3250310NS     ST3250310NS     250000.0
SATA 3.0Gb/s	500GB	       ST3500320NS     ST3500320NS     500000.0
SATA 3.0Gb/s	750GB	       ST3750330NS     ST3750330NS     750000.0
SATA 3.0Gb/s	1000GB	       ST31000340NS

Pricewatch :
ST3250310NS   $  79 to  95
ST3500320NS   $ 100 to 120
ST3750330NS   $ 148 to 250
ST31000340NS  $ 230 to 255


=============================================================================
2008 07 10
=============================================================================

#########
# MYSQL #
#########

   reforwarded this to minosdb-support

Date: Tue, 20 May 2008 16:52:55 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minosdb-support@fnal.gov
Subject: Re: new MINOS hardware (fwd)

Here's an extract  of recent discussions regarding purchase of a
server-class ( 64 bit Intel architecture ) replacement for the
present minos-mysql1 production Mysql server.

We would get 24x7 class hardware, 
and ask for 8to17by7 hardware and software support.

---------- Forwarded message ----------
Date: Thu, 15 May 2008 14:38:28 -0500
From: Robert Hatcher <rhatcher@fnal.gov>
To: Joseph Boyd <boyd@fnal.gov>
Cc: minos-admin@fnal.gov
Subject: Re: new MINOS hardware

On May 13, 2008, at 2:06 PM, Joseph Boyd wrote:

> I believe everything you want is possible.  Please let me know how  
> much data disk space and scratch space you want on each of the  
> servers.

I think the only server that has any real requirement on disk space
is the one that we will use for the MySQL warehouse and even those
needs are relatively moderate in this day-and-age.  The sizes I'll
list are really only minimums, if larger disks are standard order
those at your discretion.

MySQL replacement:
      2 * 230GB /data (mirrored) + 1 * 230GB /local/scratchX
...


#######
# NET #
#######

Wireless seems to be out on WH12 generally.

My laptop, from my office
KREYMERFNALGOV.dhcp.fnal.gov, connects to :

131.225.94.231 is connected to w-s-wh11se-g on port radio 
Last detected on this switch at 2008/07/10/14:07
25 MAC addresses have been seen on port radio of w-s-wh11se-g.

Date: Thu, 10 Jul 2008 14:23:30 -0500 (CDT)
Subject: HelpDesk ticket 118505

___________________________________________
Short Description: WH12W wireless non-functional

Problem Description: We do not seem to be getting wireless connections on WH12.

When I connect to wireless from my WH12 SW office (1260), 
I am connected at follows :
    131.225.94.231 is connected to w-s-wh11se-g on port radio

When I scan for available networks, I see fgz, and

   tuftswireless      on demand
   Unsecured computer-to-computer network

The WH12SW wireless access point is at the entrance to my office,
so this is not a signal strength problem.
___________________________________________

NB - MRTG shows no nodes connected to w-s-wh12sw-g
___________________________________________
See also tickets
118488 7/10 plunk   assigned to andrews
118406 7/09 perdue  assigned to andrews 
___________________________________________

The CD leave request page shows that Dave Coder is not here this week.
 
Please reassign this to someone who is here.

See also two other tickets, assigned to Chuck Andrews        
    118488 7/10
    118406 7/09
which have had no action indicated in Remedy.
Chuck is not listed on the leave page, but is he around ?
___________________________________________

Spoke to Chuck Andrews around 14:10, 

   Our access point failed to reboot, he will come reset it.

   My own laptop was broadcasting tuftswireless !
       (Wireless Network Connection)
           Change the order of perferred networks
               (Preferred Networks)
                  tuftswireless
                      Remove
                OK
                View Wireless Networks

    There are several more doing this now on other floors.
    
    The wireless transmitters have been reducing their power to minimize
    interference, based on traffic they see. This is a recent policy change.

    When they see rogue access points, which have always been around,
    they drop their power, causing access problems.
    The network group is working today to roll this change out 
    until some way can be found to filter out the rogues.

___________________________________________

The w-s-wh12sw-g AP has been rebooted and firmware reloaded.
It is still not picking up any clients.
But several rogue access points have been located and corrected
by candrews, so we should be getting much better service now.
___________________________________________
Date: Fri, 11 Jul 2008 15:52:50 -0500 (CDT)
This ticket has been reassigned to ANDREWS, CHARLES of the CD-LSCS/CNCS/SN Group.
___________________________________________

Chuck visited the WH12SW area this afternoon,
and removed several rogue access points,
including one on my own laptop in WH12W 1260.

He restarted the w-s-wh12sw-n wireless access point,
which eventually seems to have downloaded software,
and resumed normal operation, according to its status lights.

As of 17:28, I see several connections to the access point.
Note that the name ends in  -n  , not  -g .

  This ticket can be closed, thanks !
___________________________________________
Date: Mon, 14 Jul 2008 09:35:20 -0500 (CDT)
Solution: Adhoc laptops were causing localized problems - removed adhoc SSID from WCS reported aptops - OK
now.
This ticket was resolved by ANDREWS, CHARLES of the CD-LSCS/CNCS/SN group.
___________________________________________

000d93ee8658


118402 7/8/2008 10:23:14 PM
wireless reception bad in WH11SW since replacement of wireless hub\

0n 9-July-2008, two laptops (one on WH12 SW & 1 on WH13 SW) 
were found that were acting as Adhoc Rouges.
Both laptops were advertising themselves as Access Points, 
transmitting unauthorized SSIDs and interfering with the 
Fermi Wireless network on WH10, WH11, WH12 and WH13.  
Both laptops were returned to correct operating configurations, 
and normal wireless operation was restored.

-Chuck-
840-2721

=============================================================================
2008 07 09
=============================================================================

#######
# CVS #
#######

Updated WebDocs/cvs-rep.html to deprecate ssh key pairs, 
and document kerberos access.


#######
# CVS #
#######

Added all Minos Cluster users to minoscvs .k5login

ypcat passwd | cut -f 1 -d : | sort | wc -l
211

ypcat passwd | cut -f 1 -d : | sort > minosusers

    Removed non-Minos entries
baisley
boyd
condor
dawson
ettab
fromm
jlkaiser
joes
kevinh
lisa
lsfadm
mgreaney
mindata
minfarm
minoscvs
minsoft
products
sam
samread
sfiligoi
timm

cp .k5login .k5loginnew

NUSERS=`cat minosusers`
for username in ${NUSERS} ; do
if  grep -q \^${username}@FNAL.GOV  ${HOME}/.k5loginnew
then
    printf "HAVE ${username}\n"
else
    printf "ADD  ${username}\n"
    echo ${username}@FNAL.GOV >> ${HOME}/.k5loginnew 
fi
done

HAVE  37 
ADD  153
-bash-3.00$ wc -l .k5login*
  43 .k5login
  14 .k5login.20050527
  15 .k5login.20070122
  13 .k5login.20070214
  42 .k5login.bak
 196 .k5loginnew
  28 .k5login~

cp          ${HOME}/.k5login  ${HOME}/.k5login.bak

sort -u  -o ${HOME}/.k5login  ${HOME}/.k5loginnew

-bash-3.00$ wc -l .k5login*
 196 .k5login
  14 .k5login.20050527
  15 .k5login.20070122
  13 .k5login.20070214
  42 .k5login.bak
 196 .k5loginnew

-bash-3.00$ date
Wed Jul  9 17:43:04 CDT 2008


#######
# SRM #
#######

MCFILS=`cat /minos/data/minfarm/mcnear/cp_to_dc`

CPFILS=`for FIL in ${MCFILS} ; do sam locate ${FIL} ; done 2>&1 | \
grep Datafile | cut -f 2 -d "'"`

n13047014_0025_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047014_0027_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047014_0028_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047014_0029_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047041_0000_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047041_0001_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047041_0002_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047041_0014_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047041_0015_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047041_0016_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047041_0017_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047041_0021_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047041_0030_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047042_0000_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047042_0002_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047042_0004_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047042_0006_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root

SRV1> printf "${CPFILS}\n" | wc -l
18

for FIL in ${CPFILS} ; do ls -l /minos/data/minfarm/mcnear/${FIL} ; done

   These are all 500 MB files

. /usr/local/grid/setup.sh 
export X509_USER_PROXY=/local/globus/minfarm/.grid/x509up_u1334
setup encp v3_6d -q stken


SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL
GPATH=gsiftp://stkendca2a.fnal.gov:2811///NULL
LPATH=file:////minos/data/minfarm/mcnear
EPATH=file:////export/stage/minfarm/CPFILS

SRV1> FIL=n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root

SRV1> time ecrc /minos/data/minfarm/mcnear/${FIL}
CRC 883220849

real    0m7.075s
user    0m1.924s
sys     0m2.423s
SRV1> time srmcp ${LPATH}/${FIL} ${SPATH}/${FIL}

real    0m48.549s
user    0m9.628s
sys     0m9.133s

chmod 600 /local/globus/minfarm/.grid/x509up_u1334

SRV1> time globus-url-copy ${LPATH}/${FIL} ${GPATH}/${FIL}

real    0m19.137s
user    0m1.607s
sys     0m6.124s

SRV1> time srmcp ${LPATH}/${FIL} ${SPATH}/${FIL}

real    0m50.392s
user    0m9.757s
sys     0m8.851s


   Let's tune the GUC to be simlar to srmcp as we use it,
   based on options listed with  globus-url-copy -help   :

GUC='globus-url-copy 
     -binary 
     -create-dest 
     -fast
     -rst-retries     24
     -rst-interval   600
     -rst-timeout  10000
     -block-size      1M
     -parallel         1
    '    

GUCV="${GUC} -vb"

SRV1> time ${GUCV} ${LPATH}/${FIL} ${GPATH}/${FIL}
SRV1> time ${GUCV} ${LPATH}/${FIL} ${GPATH}/${FIL}
Source: file:////minos/data/minfarm/mcnear/
Dest:   gsiftp://stkendca2a.fnal.gov:2811///NULL/
  n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root
    528312697 bytes        16.91 MB/sec avg        16.91 MB/sec inst

real    0m31.644s
user    0m0.097s
sys     0m4.082s
SRV1> time ${GUCV} ${LPATH}/${FIL} ${GPATH}/${FIL}
Source: file:////minos/data/minfarm/mcnear/
Dest:   gsiftp://stkendca2a.fnal.gov:2811///NULL/
  n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root
    528312697 bytes        15.89 MB/sec avg        15.89 MB/sec inst

real    0m33.663s
user    0m0.097s
sys     0m5.681s


SRMC='srmcp -streams_num=1 -server_mode=active -protocols=gsiftp'


ecrc times are consistent,
real    0m7.075s
real    0m6.157s
real    0m7.208s
real    0m6.089s

srmcp elapsed times vary

real    0m50.392s
real    0m40.475s
real    0m45.932s

SRMC elapsed times can be
real    0m42.429s
real    0m42.493s
real    0m37.965s
real    0m45.403s

GUC times have been
real    0m19.137s
real    0m31.644s
real    0m33.663s
real    1m10.560s
real    0m57.457s
real    0m28.931s
real    0m28.527s
real    0m26.159s
real    0m51.112s
real    0m45.346s
real    0m27.960s

    15:05

time { for FIL in ${CPFILS} ; do
ecrc  /minos/data/minfarm/mcnear/${FIL}
${SRMC} ${LPATH}/${FIL} ${SPATH}/${FIL}
done ; }
date
real    20m52.922s
user    3m29.292s
sys     3m8.408s
Wed Jul  9 15:25:53 CDT 2008


time { for FIL in ${CPFILS} ; do
ecrc  /minos/data/minfarm/mcnear/${FIL}
${GUC} ${LPATH}/${FIL} ${GPATH}/${FIL}
done ; }
date
real    14m31.935s
user    0m37.659s
sys     2m17.358s
Wed Jul  9 15:40:59 CDT 2008

    Repeated with 'globus-url-copy' instead of GUC

real    21m32.638s
user    1m3.671s
sys     2m25.088s
Wed Jul  9 16:42:16 CDT 2008


setup dcap
klist -f
DPATH=dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/NULL

MINOS26 > type dccp
dccp is hashed (/afs/fnal.gov/ups/dcap/v2_26_f0213/Linux+2.4/bin/dccp)

MINOS26 > dccp  /minos/data/minfarm/mcnear/${FIL} ${DPATH}/${FIL}
540263111 bytes in 118 seconds (4471.19 KB/sec)
MINOS26 > dccp  /minos/data/minfarm/mcnear/${FIL} ${DPATH}/${FIL}
540263111 bytes in 14 seconds (37685.76 KB/sec)


SRV1> type dccp
dccp is hashed (/fnal/ups/prd/dcap/v2_42_f0710/Linux-2-6/bin/dccp)

SRV1> dccp  /minos/data/minfarm/mcnear/${FIL} ${DPATH}/${FIL}
Command failed!
Server error message for [1]: "path /pnfs/fnal.gov/usr/minos/NULL/n13047014_0025_L010185N_D04.cand.cedar_phy_bhcurv.1.root not found" (errno 10001).
540263111 bytes in 33 seconds (15987.90 KB/sec)
...
540263111 bytes in 20 seconds (26380.03 KB/sec)
540263111 bytes in 20 seconds (26380.03 KB/sec)

   Testing the bulk copy, 
   rates reported as uniformly 29 to 30 MB/sec ( 18 sec )
   not sure why the messy command failed messages.

SRV1> 
time { for FIL in ${CPFILS} ; do
ecrc  /minos/data/minfarm/mcnear/${FIL}
dccp  /minos/data/minfarm/mcnear/${FIL} ${DPATH}/${FIL}
done ; }
date
540263111 bytes in 18 seconds (29311.15 KB/sec)
534044330 bytes in 18 seconds (28973.76 KB/sec)
535609734 bytes in 22 seconds (23775.29 KB/sec)
521546727 bytes in 21 seconds (24253.48 KB/sec)
530692815 bytes in 28 seconds (18509.10 KB/sec)
535020334 bytes in 22 seconds (23749.13 KB/sec)
536381262 bytes in 20 seconds (26190.49 KB/sec)
536748102 bytes in 19 seconds (27587.79 KB/sec)
541163157 bytes in 31 seconds (17047.73 KB/sec)
531467411 bytes in 23 seconds (22565.70 KB/sec)
554185242 bytes in 21 seconds (25771.26 KB/sec)
546379858 bytes in 33 seconds (16168.91 KB/sec)
532681782 bytes in 33 seconds (15763.55 KB/sec)
527743225 bytes in 62 seconds (8312.49 KB/sec)
536695576 bytes in 31 seconds (16906.99 KB/sec)
528616952 bytes in 45 seconds (11471.72 KB/sec)
527392960 bytes in 27 seconds (19075.27 KB/sec)
528312697 bytes in 21 seconds (24568.11 KB/sec)
real    17m43.132s
user    1m7.907s
sys     2m20.122s

Wed Jul  9 18:33:26 CDT 2008

    Running on minos26, where we've seen better rates
    Noted that the CRC's were taking longer than the copies !

MINOS26> 
time { for FIL in ${CPFILS} ; do
ecrc  /minos/data/minfarm/mcnear/${FIL}
dccp  /minos/data/minfarm/mcnear/${FIL} ${DPATH}/${FIL}
done ; }
date
540263111 bytes in 30 seconds (17586.69 KB/sec)
534044330 bytes in 10 seconds (52152.77 KB/sec)
535609734 bytes in 16 seconds (32691.02 KB/sec)
521546727 bytes in 15 seconds (33954.87 KB/sec)
530692815 bytes in 24 seconds (21593.95 KB/sec)
535020334 bytes in 48 seconds (10885.02 KB/sec)
536381262 bytes in 31 seconds (16897.09 KB/sec)
536748102 bytes in 16 seconds (32760.50 KB/sec)
541163157 bytes in 32 seconds (16514.99 KB/sec)
531467411 bytes in 17 seconds (30530.07 KB/sec)
554185242 bytes in 15 seconds (36079.77 KB/sec)
546379858 bytes in 12 seconds (44464.51 KB/sec)
532681782 bytes in 11 seconds (47290.64 KB/sec)
527743225 bytes in 18 seconds (28631.90 KB/sec)
536695576 bytes in 43 seconds (12188.76 KB/sec)
528616952 bytes in 13 seconds (39709.81 KB/sec)
527392960 bytes in 14 seconds (36788.01 KB/sec)
528312697 bytes in 30 seconds (17197.68 KB/sec)

real    29m55.317s
user    1m2.824s
sys     0m39.356s
MINOS26 > date
Wed Jul  9 19:07:55 CDT 2008


    Make a local copy on /local/scratch26

mkdir /local/scratch26/kreymer/CPFILS

time { for FIL in ${CPFILS} ; do
echo ${FIL}
cp /minos/data/minfarm/mcnear/${FIL} /local/scratch26/kreymer/CPFILS/${FIL}
done ; }
date
Wed Jul  9 19:07:55 CDT 2008
real    16m4.144s
user    0m1.670s
sys     0m53.230s
Wed Jul  9 19:25:41 CDT 2008


    Test dccp from /local/scratch26
    In many cases, elapsed dccp time is much longer than reported.
    Ganglia shows many minute or more long gaps with no I/O.
    including 19:38 to 19:42 ( 4 minutes ! )
    saved minos26net.20080709.png image 

    
MINOS26> 
time { for FIL in ${CPFILS} ; do
ecrc  /local/scratch26/kreymer/CPFILS/${FIL}
dccp  /local/scratch26/kreymer/CPFILS/${FIL} ${DPATH}/${FIL}
done ; }
date
540263111 bytes in 13 seconds (40584.67 KB/sec)
534044330 bytes in 15 seconds (34768.51 KB/sec)
535609734 bytes in 17 seconds (30768.02 KB/sec)
521546727 bytes in 15 seconds (33954.87 KB/sec)
530692815 bytes in 16 seconds (32390.92 KB/sec)
535020334 bytes in 16 seconds (32655.05 KB/sec)
536381262 bytes in 33 seconds (15873.03 KB/sec)
536748102 bytes in 39 seconds (13440.21 KB/sec)
541163157 bytes in 21 seconds (25165.70 KB/sec)
531467411 bytes in 17 seconds (30530.07 KB/sec)
554185242 bytes in 17 seconds (31835.09 KB/sec)
546379858 bytes in 38 seconds (14041.42 KB/sec)
532681782 bytes in 14 seconds (37156.93 KB/sec)
527743225 bytes in 27 seconds (19087.93 KB/sec)
536695576 bytes in 28 seconds (18718.46 KB/sec)
528616952 bytes in 18 seconds (28679.31 KB/sec)
527392960 bytes in 18 seconds (28612.90 KB/sec)
528312697 bytes in 18 seconds (28662.80 KB/sec)

real    24m25.210s
user    0m59.146s
sys     0m37.979s
Wed Jul  9 19:50:30 CDT 2008


   The next day, run again on minos26, timing each dccp
   Strange to see minute delays connecting to the pool
   queueinfo shows only 8 stores
   w-stkendca11a-6 1 writePools
   w-stkendca20a-2 3 ExpDbWritePools
   w-stkendca20a-3 4 ExpDbWritePools

Min/Max delays are  8  83

540263111 bytes in 16 seconds (32975.04 KB/sec)
real             0m24.324s
534044330 bytes in 27 seconds (19315.84 KB/sec)
real             0m44.544s
535609734 bytes in 23 seconds (22741.58 KB/sec)
real             0m48.259s
521546727 bytes in 29 seconds (17562.86 KB/sec)
real             0m55.006s
530692815 bytes in 22 seconds (23557.03 KB/sec)
real             0m58.104s
535020334 bytes in 30 seconds (17416.03 KB/sec)
real              112.799s
536381262 bytes in 20 seconds (26190.49 KB/sec)
real             0m29.646s
536748102 bytes in 19 seconds (27587.79 KB/sec)
real             0m30.349s
541163157 bytes in 35 seconds (15099.42 KB/sec)
real               60.711s
531467411 bytes in 29 seconds (17896.94 KB/sec)
real             0m58.351s
554185242 bytes in 32 seconds (16912.39 KB/sec)
real              104.013s
546379858 bytes in 35 seconds (15244.97 KB/sec)
real               90.253s
532681782 bytes in 30 seconds (17339.90 KB/sec)
real             0m45.880s
527743225 bytes in 29 seconds (17771.53 KB/sec)
real               62.893s
536695576 bytes in 34 seconds (15415.20 KB/sec)
real               92.619s
528616952 bytes in 30 seconds (17207.58 KB/sec)
real               70.083s
527392960 bytes in 30 seconds (17167.74 KB/sec)
real             0m53.432s
528312697 bytes in 15 seconds (34395.36 KB/sec)
real             0m25.090s

real    20m27.129s
user    0m59.083s
sys     0m37.151s
Thu Jul 10 15:38:01 CDT 2008


dccp -d 4 /local/scratch26/kreymer/CPFILS/${FIL} ${DPATH}/${FIL}
[Thu Jul 10 15:51:43 2008] Going to open file dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/NULL/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root in cache.
Connected in 0.00s.
Cache open succeeded in 1.66s.
528312697 bytes in 29 seconds (17790.70 KB/sec)


MINOS26 > time dccp -d 4 /local/scratch26/kreymer/CPFILS/${FIL} ${DPATH}/${FIL}
[Thu Jul 10 15:54:08 2008] Going to open file dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/NULL/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root in cache.
Connected in 0.00s.
Cache open succeeded in 3.93s.
528312697 bytes in 33 seconds (15634.25 KB/sec)

real    4m7.100s
user    0m1.649s
sys     0m1.201s

    Weird. ganglia shows a blip of activity at 15:54 to 15:56,
    then nothing through 16:00, the timestamp on the file in PNFS.


MINOS26 > time dccp -d999 /local/scratch26/kreymer/CPFILS/${FIL} ${DPATH}/${FIL}
extra option: -alloc-size=528312697 
Real file name: /local/scratch26/kreymer/CPFILS/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root.
Using system native open for /local/scratch26/kreymer/CPFILS/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root.
[Thu Jul 10 16:04:35 2008] Going to open file dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/NULL/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root in cache.
Allocated message queues 0, used 0

Allocated message queues 1, used 1

Creating a new control connection to fndca1.fnal.gov:24725.
Activating IO tunnel. Provider: [/afs/fnal.gov/ups/dcap/v2_26_f0213/Linux+2.4/lib/libgssTunnel.so].
Added IO tunneling plugin /afs/fnal.gov/ups/dcap/v2_26_f0213/Linux+2.4/lib/libgssTunnel.so for fndca1.fnal.gov:24725.
Connected in 0.00s.
Sending control message: 0 0 client hello 0 0 2 26 -uid=1060 -pid=7649

Server reply: welcome.
Connected to fndca1.fnal.gov:24725
Setting hostname to minos26.fnal.gov.
Sending control message: 1 0 client open "dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/NULL/n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root" w minos26.fnal.gov 51944 -timeout=-1 -onerror=default -alloc-size=528312697  -uid=1060

Got callback connection from stkendca10a.fnal.gov:51426 for session 1, myID 1.
Enabling checksumming on write.
Cache open succeeded in 1.80s.
[6] Sending IOCMD_WRITE.
[6] Expected position: 1048570 @ 0 bytes written.
[6] Sending IOCMD_WRITE.
[6] Expected position: 2097140 @ 0 bytes written.
...
[6] Sending IOCMD_WRITE.
[6] Expected position: 528312697 @ 0 bytes written.
Using system native close for [3].
[6] unpluging node
File checksum is: 1989075314
Sending CLOSE for fd:6 ID:1.
    ...  N.B.  long delay here ...
Server reply: ok destination [1].
Removing unneeded queue [1]
[6] destroing node
528312697 bytes in 17 seconds (30348.85 KB/sec)

real    0m58.003s
user    0m1.638s
sys     0m1.094s


   The problem is not recycling file names, same delays with unique names.
   Tried disabling crc and allowing unsafe writes with -c and -u, no gain.

v2_26_f0213/Linux+2.4
MINOS26 > setup dcap v2_41_f0610


    Make a local copy on srv1

mkdir /export/stage/minfarm/CPFILS

time { for FIL in ${CPFILS} ; do
echo ${FIL}
cp /minos/data/minfarm/mcnear/${FIL} /export/stage/minfarm/CPFILS
done ; }
date

real    4m0.040s
user    0m1.675s
sys     1m52.409s
Wed Jul  9 17:00:18 CDT 2008


    Let's try local disk GUC to PNFS

time { for FIL in ${CPFILS} ; do
ecrc   /export/stage/minfarm/CPFILS/${FIL}
${GUC} ${EPATH}/${FIL} ${GPATH}/${FIL}
done ; }
date

real    18m14.040s
user    0m37.082s
sys     1m19.753s
SRV1> date
Wed Jul  9 17:24:16 CDT 2008


   For reference, check ecrc times :

time { for FIL in ${CPFILS} ; do
ecrc   /export/stage/minfarm/CPFILS/${FIL}
done ; }
date
real    2m38.338s
user    0m28.999s
sys     0m16.279s
Wed Jul  9 17:50:36 CDT 2008

time { for FIL in ${CPFILS} ; do
ecrc  /minos/data/minfarm/mcnear/${FIL}
done ; }
date
real    2m9.727s
user    0m34.299s
sys     0m38.907s
Wed Jul  9 17:52:58 CDT 2008


############
# BLUWATCH #
############

blutickle is not preventing expiration of /minos/data on fnpcsrv1,
and we keep getting errors.

Made local 55 second sleep copy of bluwatch in /home/kreymer,
started this up around 09:20


#######
# NET #
#######

LOW admin

Date: Wed, 09 Jul 2008 09:02:40 -0500 (CDT)
Subject: HelpDesk ticket 118397
___________________________________________
Short Description: DNS port radomization security test seems to fail

Problem Description: A worldwide coordinated DNS security fix was apparently deployed yesterday.

    http://www.doxpara.com/

The DNS checker at that site reports that the Fermilab DNS server is
vulnerable.
This seems odd to me, as we deployed new DNS servers at that time.

    Here is the test result :

Your name server, at 131.225.8.120, appears vulnerable to DNS Cache
Poisoning.
All requests came from the following source port: 32770Requests seen for
4e7a194afa10.toorrr.com:
131.225.8.120:32770 TXID=22669
131.225.8.120:32770 TXID=24309
131.225.8.120:32770 TXID=38045
131.225.8.120:32770 TXID=19642
131.225.8.120:32770 TXID=59774
___________________________________________
Date: Wed, 09 Jul 2008 09:15:43 -0500 (CDT)
This ticket has been reassigned to TANG, DAVID of the CD-LSCS/CNCS/SN Group.
___________________________________________
___________________________________________
___________________________________________


=============================================================================
2008 07 08
=============================================================================

#######
# NET #
#######

Date: Tue, 08 Jul 2008 15:19:02 -0500 (CDT)
Subject: HelpDesk ticket 118378
___________________________________________
Short Description: DNS lookups failing on fnsrv0

Problem Description: DNS network lookups are failing on DNS server fnsrv0, but not fnsrv1.

This is causing a myriad of problems site wide.

A recent example :

MINOS01 > nslookup 131.225.193.11 fnsrv0
Server:         fnsrv0
Address:        131.225.8.120#53

** server can't find 11.193.225.131.in-addr.arpa: SERVFAIL

MINOS01 > nslookup 131.225.193.11 fnsrv1
Server:         fnsrv1
Address:        131.225.17.150#53

11.193.225.131.in-addr.arpa     name = minos11.fnal.gov.
___________________________________________
Date: Tue, 08 Jul 2008 16:04:50 -0500 (CDT)
Note To Requester: tang@fnal.gov sent this Notes To Requester: 

Resolved.
Related to DNS server glitch this afternoon
___________________________________________
Date: Wed, 09 Jul 2008 08:33:03 -0500 (CDT)
Solution: Resolved.
Related to DNS server glitch this afternoon
This ticket was resolved by CODER, DAVE of the CD-LSCS/CNCS/SN group.
___________________________________________
___________________________________________


##########
# DCACHE #
##########

LOW
Date: Tue, 08 Jul 2008 13:31:52 -0500 (CDT)
Subject: HelpDesk ticket 118354
___________________________________________
Short Description: FNDCA write pool time settings

Problem Description: The time threshold for writing files from FNDCA write pools are intended to
be :
    4 hours - writePools
   24 hours - RawDataWritePools

I have good reason to suspect that these have reverted to small values.
This probably happened even before the DCache 1.8 upgrade.

Please reset the time thresholds to these normal values, 
to improve overall efficiency and reduce wear on the tapes and robots.

This is particularly important for the Raw pools, 
where Minos writes one file per hour.
___________________________________________

Date: Mon, 08 Sep 2008 14:02:41 -0500 (CDT)

Note To Requester: swhicks@fnal.gov sent this Notes To Requester: 
Arthur,
>   
This ticket is over 60 days old. If this problem still exists, please 
let us know and a new ticket can be issued. If not, I will close this 
ticket now.

Thanks,
Stanley
___________________________________________

Date: Mon, 08 Sep 2008 19:45:21 +0000 (GMT)

I have had not response to this ticket..

I believe that the problems persists.
___________________________________________

Date: Mon, 22 Sep 2008 19:06:08 +0000 (GMT)

The raw data files continue to be written to tape immediately, which is bad
:

MINOS26 > ls -l /pnfs/minos/fardet_data/2008-09
...
-rw-r--r--  1 buckley e875  37049096 Sep 22 07:31 F00041967_0010.mdaq.root
-rw-r--r--  1 buckley e875  19926529 Sep 22 08:32 F00041967_0011.mdaq.root
-rw-r--r--  1 buckley e875  43658989 Sep 22 09:33 F00041967_0012.mdaq.root
-rw-r--r--  1 buckley e875  37256313 Sep 22 10:34 F00041967_0013.mdaq.root
-rw-r--r--  1 buckley e875  19872473 Sep 22 11:35 F00041967_0014.mdaq.root
-rw-r--r--  1 buckley e875  43571565 Sep 22 12:36 F00041967_0015.mdaq.root
-rw-r--r--  1 buckley e875  37174411 Sep 22 13:30 F00041967_0016.mdaq.root

The latest of these files is already on tape as of 13:31

Thanks to the CD Ops meeting report, I went to
    Bugzilla, http://www-ccf.fnal.gov/Bugzilla/show_bug.cgi?id=100

I see a reply to me last Friday,
but I do not have a record of receiving this email.
___________________________________________

Date: Mon, 22 Sep 2008 14:57:30 -0500 (CDT)

I have sent this info to the dcache developers and included it in the 
bugzilla ticket. Please let us know if you don't hear from somebody in a 
reasonable (24 hr?) amount of time.

Stanley
___________________________________________

Date: Wed, 29 Oct 2008 15:24:15 +0000 (GMT)

  
This ticket was initially logged July 8.
Updated 8 Sept.
Updated 22 Sept.

The problem persists.   This is not academic.       
 
Our raw data tapes are being mounted once per file,
which is extremly bad for the tapes.

For example, the current, partially filled far detector data tape
has been mounted over 1000 times :

MINOS26 > enstore info --vol VO8699
{'blocksize': 131072,
 'capacity_bytes': 214748364800L,
 'comment': '',
 'declared': 1199726325.0,
 'eod_cookie': '0000_000000000_0001335',
 'external_label': 'VO8699',
 'first_access': 1206710452.0,
 'last_access': 1225291110.0, 
 'library': 'CD-9940B',
 'media_type': '9940B',
 'remaining_bytes': 160116203520L,
 'si_time': [1222728047.0, 1130679712.0],
 'sum_mounts': 1153,
 'sum_rd_access': 65,
 'sum_rd_err': 0,
 'sum_wr_access': 1335,
 'sum_wr_err': 1,
 'system_inhibit': ['none', 'none'],
 'user_inhibit': ['none', 'none'],  
 'volume_family': 'minos.fardet_data.cpio_odc',
 'wrapper': 'cpio_odc',
 'write_protected': 'n'}

___________________________________________

http://www-ccf.fnal.gov/Bugzilla/show_bug.cgi?id=100

 ------- Comment  #3 From Stanley W. Hicks  2008-10-29 14:36:29  -------

 ARTHUR KREYMER  wrote again about this problem. He says it is on-going and
continues to be an issues (since July 9). I am upping the priority from P5 to
P2 due to this being nearly 4 months old now:

___________________________________________

Date: Thu, 30 Oct 2008 15:22:29 -0500 (CDT)
From: Dmitry Litvintsev <litvinse@fnal.gov>


Hi Alex, 

just not to make impression that we are walking in circles here. Last time 
Art reported about it - I looked in log files and indeed the files get 
written to tape almost immediately. Without regard to what is 
specified in pool setup. We promised to raise the priority of this, but 
then it slipped through. 

Dmitry
___________________________________________

Date: Thu, 30 Oct 2008 20:51:13 +0000 (GMT)
From: Arthur Kreymer <kreymer@fnal.gov>


    All recent raw data files have been written 'too soon'.

    I claim that files are on tape based on valid Level 4 metadata.

    The PNFS file time seen by 'ls' changes when the file moves to tape.
    So the simplest way to see the problem is to look at files in
    any of our recent data directories.


    The data directories are under /pnfs/minos/
  fardet_data/2008-10
 neardet_data/2008-10
    beam_data/2008-10
 far_dcs_data/2008-10
near_dcs_data/2008-10


    For example,
/pnfs/minos/far_dcs_data/2008-10/F081029_000012.mdcs.root
    was written to DCache at
2008-10-29 22:53:26
    The time in PNFS is
Oct 29 22:55

   Similarly,
/pnfs/minos/near_dcs_data/2008-10/N081029_000002.mdcs.root
    was written to DCache at
2008-10-30 00:10:00
    The time in PNFS is
Oct 30 00:12

    As of around 15:20 CDT, the latest far detector data file,
/pnfs/minos/fardet_data/2008-10/F00042108_0002.mdaq.root
    is on tape VO8699, with time stamp
Oct 30 14:41

    That is consistent with the enstore info for the volume,
    the last_access field.


    Since Wed, 29 Oct 2008 15:24:15 +0000 (GMT),
    our latest Far Detector tape VO8699
    has been written to 33 times,
            and mounted 29 times.


    Statistics around 15:25 today :
MINOS26 > enstore info --vol VO8699
...
 'last_access': 1225395669.0,
...
 'sum_mounts': 1182,
 'sum_rd_access': 65,
 'sum_rd_err': 0,
 'sum_wr_access': 1368,
 'sum_wr_err': 1,

    Previous statistics
> > MINOS26 > enstore info --vol VO8699
...
> >  'sum_mounts': 1153,
> >  'sum_rd_access': 65,
> >  'sum_rd_err': 0,
> >  'sum_wr_access': 1335,
> >  'sum_wr_err': 1,

___________________________________________

Date: Fri, 31 Oct 2008 16:29:51 -0500

Hi Art,
we still looking on the issue of minos files written to the tape too often.

Dmitri came up with possible solution. We will try to change store queue
parameter on the pool but we will defer  this change to Monday to observe
system behavior.

Sorry for inconvenience, Alex.
___________________________________________

Date: Tue, 04 Nov 2008 14:33:03 -0600 (CST)
From: Dmitry Litvintsev <litvinse@fnal.gov>
To: Alex Kulyavtsev <aik@fnal.gov>
Cc: Arthur Kreymer <kreymer@fnal.gov>, swhicks@fnal.gov,
    dcache-admin@fnal.gov, minos-data@fnal.gov
Subject: Re: HelpDesk ticket 118354 has additional info.


Hi Art, 

We believe that since yesterday the 24 hour policy has been reinstated and 
is in fact observed. Let us know if you still find exceptions from this 
rule. The policy has been set in effect only on pools belonging to 
RawDataWritePools. 
___________________________________________

   Yes, this looks OK, no writes since around noon 4 Nov, as of 07:30  5 Nov,
   in fardet_data
   
___________________________________________


########
# FARM #
########

 PEND - have 16/24 subruns for N00012636_*.cosmic.sntp.cedar_phy.0.root 5

Right, these subruns 16-23 don't appear in Ben's to-be-processed lists 
(nor do they appear in the 'suppressed' lists).  So this run should be 
forced out.

SRV1> ./roundup -n -f 0 -s N00012636 -r cedar_phy near
 MISSING   N00012636_0016..cosmic.sntp.cedar_phy.0.root
 MISSING   N00012636_0017..cosmic.sntp.cedar_phy.0.root
 MISSING   N00012636_0018..cosmic.sntp.cedar_phy.0.root
 MISSING   N00012636_0019..cosmic.sntp.cedar_phy.0.root
 MISSING   N00012636_0020..cosmic.sntp.cedar_phy.0.root
 MISSING   N00012636_0021..cosmic.sntp.cedar_phy.0.root
 MISSING   N00012636_0022..cosmic.sntp.cedar_phy.0.root
 MISSING   N00012636_0023..cosmic.sntp.cedar_phy.0.root

SRV1> ./roundup -f 0 -s N00012636 -r cedar_phy near
Tue Jul  8 11:13:15 CDT 2008

SRV1> ln -sf roundup.20080703 roundup # was roundup.20080624

    This puts in the 1+ minute delay after concatenation,
    so that fresh file will be written.
    And abandons SRM_CONFIG, we now use X509_USER_PROXY

SRV1> ./roundup  -s N00012636 -r cedar_phy near

   Picked up the other partial run after missing subruns were rerun
N00012596

    Forced out the remaining run, which spanned months
N00012681

SRV1> ./roundup  -f 0 -r cedar_phy near

##########
# CONDOR #
##########

Added rbpatter ( Patterson ) to Minos Analysis and Production groups.


=============================================================================
2008 07 07
=============================================================================


##########
# CONDOR #
##########

Date: Mon, 07 Jul 2008 15:16:42 -0500 (CDT)
Subject: HelpDesk ticket 118305
___________________________________________

Short Description: Minos Cluster - condor 7.0.3 preinstallation

run2-sys :

    Please install the following RPM on nodes minos01 thru minos25 .

http://fermigrid.fnal.gov/files/condor/condor-7.0.3-linux-x86-rhel3-dynamic-1.i386.rpm
   
    This rpm places new files in /opt/condor-7.0.3,
    and should not interfere with existing operations.

    I will submit a separate request for configuration files when we are ready.
___________________________________________
Date: Mon, 07 Jul 2008 15:48:35 -0500 (CDT)
This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group.
___________________________________________
Date: Mon, 07 Jul 2008 16:04:33 -0500 (CDT)
Solution: schmitz@fnal.gov sent this solution: 
installed new condor rpms
___________________________________________
Date: Mon, 07 Jul 2008 17:32:45 -0500
Putting it back now.

Mark
___________________________________________

Condor 7.0.1 has been reinstalled and configured, and test jobs have been run.
Thanks to the run2-sys group for quickly correcting this !

We seem to be back up.

Running jobs probably were lost.


The outage seems to have been from 16:00 to 18:00 CDT.
    
    Sorry about this !
    Probably an   'rpm --upgrade' was done instead of
    the intended  'rpm --install' .

___________________________________________


Glidein jobs - last success 16:00, first new job 18:05

N.B. Files in /local/stage1/condor had wrong ownership,
     daemon instead of condor.
     How did that happen ?


MIN > for NODE in ${NODES} ; do printf "$NODE "; ssh -ax ${NODE}  'du -sm /opt/condor-*' ; done
minos01 1       /opt/condor-6.8.6
1       /opt/condor-6.9.5
1       /opt/condor-7.0.1
242     /opt/condor-7.0.3
...
minos25 1       /opt/condor-6.8.6
1       /opt/condor-6.9.5
1       /opt/condor-7.0.1
242     /opt/condor-7.0.3
minos26 237     /opt/condor-6.8.6

MINOS01 > cd /opt
MINOS01 > dds
total 44
drwxr-xr-x   9 root   root  4096 Jul  7 15:54 ./
drwxr-xr-x  32 root   root  4096 Dec  4  2007 ../
lrwxrwxrwx   1 root   root    12 Apr 16 14:07 condor -> condor-7.0.1/
drwxr-xr-x   6 root   root  4096 Jul  7 15:54 condor-6.8.6/
drwxr-xr-x   4 root   root  4096 Jul  7 15:54 condor-6.9.5/
drwxr-xr-x   5 root   root  4096 Jul  7 15:54 condor-7.0.1/
drwxr-xr-x  13 root   root  4096 Jul  7 15:54 condor-7.0.3/


##########
# DCACHE #
##########

Date: Mon, 07 Jul 2008 11:26:40 -0500 (CDT)
From: HelpDesk <helpdesk@fnal.gov>
___________________________________________
Short Description: Most FNDCA readPools pools are down

Problem Description: dcache-admin :

A user reported a failure to open a file in DCache this morning.

Only two of the 13 readPools pools seem to be active.
    r-stkendca16a-6
    r-stkendca9a-2

The rest are absent from
    http://fndca.fnal.gov:2288/cellInfo
and
    http://fndca.fnal.gov:2288/queueInfo
___________________________________________
   Sent poolstat summary to this ticket
MINOS26 > ./poolstat.20080707  verb

Mon Jul  7 11:47:27 CDT 2008

DOWN TOT   POOL GROUP
      14 ExpDbWritePools
   4/ 10 FermigridVolPools
             v-gwdca01-1
             v-gwdca01-2
             v-stkendca6a-1
             v-stkendca6a-2

   1/ 12 KTeVReadPools
             r-stkendca13a-2

      15 MinosPrdReadPools
       8 RawDataWritePools
  11/ 13 readPools
             r-gwdca01-1
             r-gwdca01-2
             r-stkendca13a-5
             r-stkendca13a-6
             r-stkendca14a-5
             r-stkendca14a-6
             r-stkendca15a-5
             r-stkendca15a-6
             r-stkendca16a-5
             r-stkendca6a-1
             r-stkendca6a-2

   6/ 16 writePools
             w-gwdca01-1
             w-gwdca01-2
             w-stkendca12a-4
             w-stkendca12a-6
             w-stkendca6a-1
             w-stkendca6a-2
___________________________________________
Two files are stuck in the generate write pools

_ PNFS status for /pnfs/minos/reco_near/cedar_phy/cand_data/2007-04/N00012120_0001.spill.cand.cedar_phy.0.root 
-rw-r--r--  1 rubin e875 395142882 Jul  3 23:48 N00012120_0001.spill.cand.cedar_phy.0.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:956eded9;l=395142882;

LEVEL 4 

 PNFS status for /pnfs/minos/reco_near/cedar_phy/cand_data/2007-04/N00012120_0003.cosmic.cand.cedar_phy.0.root 
-rw-r--r--  1 rubin e875 112667526 Jul  3 23:50 N00012120_0003.cosmic.cand.cedar_phy.0.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:5440fe70;l=112667526;

LEVEL 4 

These are still in /minos/data/minfarm/WRITE/
___________________________________________
Date: Mon, 07 Jul 2008 12:38:11 -0500
From: Stanley Hicks <swhicks@fnal.gov>
Hi,

Didn't want you to think nobody is reading your messages; I just don't 
have an answer yet and am looking for help from others here. Thanks for 
all the input and I or somebody will be getting back with you on this 
before too long.

Stanley
___________________________________________
Date: Mon, 07 Jul 2008 14:39:06 -0500
Art,
"missing" pools stem from the configuration inconsistency when pools are 
described in PoolManager.conf  (and shown on page 
http://fndca3a.fnal.gov:2288/poolInfo/pools/* - I guess your script 
starts from there) and these pools are not actually connected to the 
head node.
E.g.  gwdca01  and  stkendca6a are test pools and were used to test 
dcache v1.8 before moving in production.
We are looking to fix configuration.

Alex.
___________________________________________
Date: Mon, 07 Jul 2008 16:03:44 -0500
Hi Art,
besides few pools from test system described in configuration, there 
were ten pools which did not start properly after dcache upgrade due to 
communication error
during pool startup.
I restarted these ten pools, the list is attached below.

Thanks for noticing and reporting the issue.


For helpdesk:
please close the ticket, we filed separate ticket in dcache support on 
dcache.org to address the root cause in the dcache code.

Alex.
___________________________________________
Date: Mon, 07 Jul 2008 22:12:41 +0000 (UTC)
Thanks, I see that the pools are back as advertised.

    We are still missing two files :
/pnfs/minos/reco_near/cedar_phy/cand_data/2007-04/N00012120_0001.spill.cand.cedar_phy.0.root 
/pnfs/minos/reco_near/cedar_phy/cand_data/2007-04/N00012120_0003.cosmic.cand.cedar_phy.0.root 

I could easily remove and rewrite these to PNFS.

But for present, I will leave them alone for your investigation.
___________________________________________
Date: Tue, 08 Jul 2008 19:03:32 -0500
Hi Art,
each file was present in two write pools : precious copy (in now dead 
pool) and the cached copy on other pool.
We checked CRC for existing copy and set files to precious thus both 
files were written to tapes.
Please close the ticket.

Best regards, Alex, Vladimir.
___________________________________________
___________________________________________
___________________________________________
___________________________________________


Mon Jul  7 14:13:55 CDT 2008

DOWN TOT   POOL GROUP
      14 ExpDbWritePools
   4/ 10 FermigridVolPools
   1/ 12 KTeVReadPools
      15 MinosPrdReadPools
       8 RawDataWritePools
  11/ 13 readPools
   6/ 16 writePools

Mon Jul  7 15:32:42 CDT 2008

DOWN TOT   POOL GROUP
      14 ExpDbWritePools
   4/ 10 FermigridVolPools
   1/ 12 KTeVReadPools
      15 MinosPrdReadPools
       8 RawDataWritePools
  11/ 13 readPools
   4/ 16 writePools

Mon Jul  7 16:56:30 CDT 2008

DOWN TOT   POOL GROUP
      14 ExpDbWritePools
   4/ 10 FermigridVolPools
      12 KTeVReadPools
      15 MinosPrdReadPools
       8 RawDataWritePools
   4/ 13 readPools
   4/ 16 writePools


############
# poolstat #
############


poolstat.20080707 - genaralized pool name match, anything with 2 non-consec -

MINOS26 > ln -sf poolstat.20080707 poolstat # was poolstat.20070611

#########
# ADMIN #
#########

Created jsm62 account

#########
# ADMIN #
#########

Date: Mon, 07 Jul 2008 10:56:11 -0500 (CDT)
Subject: HelpDesk ticket 118265
___________________________________________
Short Description: Default login shell on Minos Cluster - please change to bash

Problem Description: run2-sys :

Please change the default login shell for new users on the Minos Cluster
from   /usr/local/bin/tcsh   to   /bin/bash

This is the preferred shell, in policy and practice.
___________________________________________
Date: Mon, 07 Jul 2008 11:00:53 -0500 (CDT)
This ticket has been reassigned to ALLEN, JASON of the CD-SF/FEF Group.
___________________________________________________________________
Date: Tue, 05 Aug 2008 17:00:03 -0500 (CDT)

Solution: jonest@fnal.gov sent this solution:
> I updated the  /root/bin/add_minos_user script.
>> I edited the function 'user_info' to use /bin/bash as the default
>> shell rather than the user's
>> fnalu account shell.
>

___________________________________________________________________

########
# FARM #
########

cedar_phy near stuck ,

Sun Jul  6 20:17:06 CDT 2008

 PURGING WRITE files 442 
...
rm: remove write-protected regular file `N00012620_0013.cosmic.cand.cedar_phy.0.root'?

-rw-r--r--  1 minfarm  numi 801602717 Jul  6 16:26 /home/minfarm/ROUNTMP/WRITE/N00012620_0000.spill.sntp.cedar_phy.0.root
-rw-r--r--  1 minfarm  numi 801602717 Jul  6 16:26 /home/minfarm/ROUNTMP/WRITE/N00012620_0000.spill.sntp.cedar_phy.0.root


This is not group writeable,

SRV1> ls -l ~/ROUNTMP/WRITE/ | grep minospro | grep 'r--r--'
-rw-r--r--  1 minospro numi 112909866 Jul  3 00:25 N00012620_0013.cosmic.cand.cedar_phy.0.root
-rw-r--r--  1 minospro numi 111808381 Jul  3 00:27 N00012626_0016.cosmic.cand.cedar_phy.0.root
-rw-r--r--  1 minospro numi  82860929 Jul  3 00:28 N00012633_0006.cosmic.cand.cedar_phy.0.root

MIN > ssh -l minospro minos26
PRO> cd /minos/data/minfarm/WRITE
PRO> for SUB in N00012620_0013 N00012626_0016 N00012633_0006 ; do
     ls -l ${SUB}.cosmic.cand.cedar_phy.0.root ; done

PRO> for SUB in N00012620_0013 N00012626_0016 N00012633_0006 ; do 
chmod 664 ${SUB}.cosmic.cand.cedar_phy.0.root ; done

22616 pts/1    T      0:00  |       \_ rm N00012620_0013.cosmic.cand.cedar_phy.0.root
kill 22616
kill -9 22616
  This is now a Zombie process
  
Killed off the parent, also needed -9
15813 pts/1    Z      0:01  |   \_ [roundup] <defunct>

Killed loopCPn, will wait for DCache to recover before restarting.

 
############
# BLUWATCH #
############

bluwatch.20080707

    Created lasterr directory, shifted files there.

    Renamed latest to last, for consistency, symlink for transition

    Renamed LASTERR to LASTBAD for consistency with log message.

cd ${MINOS_DATA}/log_data/bluwatch
mkdir lastbad
mkdir lastslo
mv latest last ; ln -s last latest

mv *.txt lastbad/

Stopped and restarted minos26, looks OK.

ln -sf bluwatch.20080707 bluwatch

Restarted minos25, minos01, minos-sam03, fnpcsrv1

cd ${MINOS_DATA}/log_data/bluwatch
mv LASTERR LASTBAD

Added README file.


    Started up on fnpcsrv1 at 09:59

Jul  7 12:25:30 fnpcsrv1 automount[14027]: expired /minos/data
...
Jul  7 12:53:02 fnpcsrv1 automount[10043]: attempting to mount entry /minos/data

    Started getting failures,

12:12:14
12:17:14 
12:19:14
12:21:15

    Started up a tickle script interactively, around 12:10

while true ; do sleep 50 ; ls -d /minos/data/minfarm 2>&1 > /dev/null ; done

This disappeared, put it in a script

./blutickle &   # around 13:25


#######
# SAM #
#######

Date: Sat, 05 Jul 2008 16:24:22 -0500
From: Rashid Mehdiyev <rmehdi@fnal.gov>

Hi Art,

do you know what is wrong with this SAM query command below ?
I used it this way in Jan,08, but now it does not return  me
anythinng interesting,,,

sam list files --dim="run_type physics% and data_tier mc-far and mc.beam='L010185N' and mc.bfield='1' and
mc.flavor='0' and mc.release='daikon_00' and mc.vtxregion='3' and run_number>=0 and run_number<=1111"

No files match the given constraints.
-------------------------------------------------------------------------
SAMDIM='run_type physics% and data_tier mc-far and mc.beam=L010185N \
  and mc.bfield=1 \
  and mc.flavor=0 \
  and mc.release=daikon_00'
MINOS26 > sam list files --summaryonly --dim="${SAMDIM}"
File Count:         586

SAMDIM='run_type physics% and data_tier mc-far and mc.beam=L010185N \
  and mc.bfield=1 \
  and mc.flavor=0 \
  and mc.release=daikon_00 \
  and mc.vtxregion=3 \
'

MINOS26 > sam locate f21011015_0000_L010185N_D00.reroot.root
['/pnfs/minos/mcin_data/far/daikon_00/L010185N/101,317@voc139']

I see no vtxregion 3 files :

MINOS26 > ls /pnfs/minos/mcin_data/far/daikon_00/L010185N/*/f23*
ls: /pnfs/minos/mcin_data/far/daikon_00/L010185N/*/f23*: No such file or directory


=============================================================================
2008 07 03
=============================================================================

17:06:45, adjusted bluwatch sleep on fnpcsrv1 to 58 seconds

Also note the last expiration in /var/log/messages was
Jul  3 13:28:55 fnpcsrv1 automount[17005]: expired /minos/data

Perhaps Steve adjusted something.

##########
# DCACHE #
##########

Created HOWTO.dcachetest
   Need to replicate the pre-upgrade tests into scripts,
   and run these regularly, perhaps monthly.
   

########
# FARM #
########

Started up aggressive cedar_phy near and far concatenation, 

SRV1> ./loopCPn &

SRV1> ./loopCPf &

Will stop these when they have caught up in a day or so,
over the weekend.

loopCPf got caught up, only one run to process,
    F00037968

The script is unable to move the files to target Bluearc areas,
    
Set STOP flag 
  touch /minos/data/minfarm/STOP.cedar_phynear


mv: cannot move `N00012004_0000.cosmic.sntp.cedar_phy.0.root' to `/minos/data/reco_near/cedar_phy/sntp_data/2007-04/N00012004_0000.cosmic.sntp.cedar_phy.0.root': Permission denied
ln: `N00012004_0000.cosmic.sntp.cedar_phy.0.root': File exists

MINOS26 > ls -ld /minos/data/reco_near/cedar_phy/sntp_data/2007-04
drwxr-xr-x  2 mindata e875 2048 Feb 15 11:07 /minos/data/reco_near/cedar_phy/sntp_data/2007-04

   GRRRRRRRRRRRRRRRR

Directories under /minos/data/reco_near/cedar_phy/sntp_data
  2007-04 through 2008-12 are mode 755, not 775.
  How did this happen ?
  Why did these all get created on Feb 15 11:07 ?

$ chmod 775  /minos/data/reco_near/cedar_phy/sntp_data/2007-*
$ chmod 775  /minos/data/reco_near/cedar_phy/sntp_data/2008-*

This is a mess, will have to stop and restart loopCPn,
and try to move and repair these symlinks.

The previous pass was on cand_data, so no problem there.

SRV1> grep 'cannot move' cedar_phynear.log | wc -l
26

MFILES=`grep 'cannot move' cedar_phynear.log | cut -c 18- | cut -f 1 -d "'"`

for FILE in ${MFILES} ; do
ls -l ${FILE}
BLUE=/minos/data/reco_near/cedar_phy/sntp_data/2007-04/${FILE}
mv    ${FILE} ${BLUE}
ln -s ${BLUE} ${FILE}
done

   WHEW - looks OK now.

rm /minos/data/minfarm/STOP.cedar_phynear
 
#######
# TWW #
#######

Regarding ticket 087003

Date: Mon, 30 Jun 2008 12:59:59 -0500 (CDT)
From: Margaret_Greaney <mgreaney@fnal.gov>

I have not heard back from Frank Nagy on this, but from what I see the 
upgrade of TWW caused new perl modules to be available and kcroninit does
work on my attempts on fnalu on linux nodes.

------------------------------------
Still fails for me, same way, 

FLXI04 > kcroninit
Can't locate Net/Domain.pm in @INC (@INC contains: 
/usr/krb5/lib 
/opt/TWWfsw/libdb42/lib/perl586 
/opt/TWWfsw/imagemagick62/lib/perl586 
/opt/TWWfsw/readline50/lib/perl586 
/opt/TWWfsw/pe

FLXI04 > type perl
perl is /opt/TWWfsw/bin/perl

FLXI04 > ls -l /opt/TWWfsw/bin/perl
lrwxr-xr-x    1 kevinh   root           41 May 23  2006 
  /opt/TWWfsw/bin/perl -> /opt/TWWfsw/perl586/bin/.perl.tww-wrapper


FLXI04 > ls -l /opt/TWWfsw/perl586/bin/.perl.tww-wrapper
-rwxr-xr-x    1 kevinh   root        12363 Apr 11  2006 
  /opt/TWWfsw/perl586/bin/.perl.tww-wrapper


#########
# ADMIN #
#########


MINOS01 > cmd add_minos_user jsm62
INVALID: jsm62 does NOT have a valid fnalu account.

Informed minos-admin and jsm62.


Date: Mon, 07 Jul 2008 16:44:52 +0100
From: Jessica Mitchell <mitchell@hep.phy.cam.ac.uk>
I now have a FNALU account, so if you could set up my minos one that would be great!

  Created account at 10:45

#########
# ADMIN #
#########

  scan for AFS AUTOMATIC needing reset

MIN > for NODE in ${UNODES} ; do printf "$NODE "
<more> ssh -ax ${NODE}  'grep ^OPTIONS= /etc/sysconfig/afs' ; done
flxi02 OPTIONS=$LARGE
flxi03 OPTIONS=$LARGE
flxi04 OPTIONS=$MEDIUM
flxi05 OPTIONS=$SMALL
flxi06 OPTIONS=$MEDIUM
flxi07 OPTIONS=$LARGE
flxi09 Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive).
You have new mail in /var/spool/mail/kreymer

MIN > date
Thu Jul  3 14:50:49 UTC 2008

See ticket 117526

###########
# BLUEARC #
###########

Date: Thu, 03 Jul 2008 07:53:08 -0500
From: Andy Romero <romero@fnal.gov>
To: site-nas-announce@fnal.gov
Subject: BlueArc problems

Both RHEA cluster nodes are being rebooted.
Should be back in ~15min
more info to come
CMS, Minos, FermiGrid and Windows are effected.

Date: Thu, 03 Jul 2008 08:35:04 -0500
From: Andy Romero <romero@fnal.gov>
To: site-nas-announce@fnal.gov
Subject: BlueArc NAS service back on-line

BlueArc NAS service back on-line

-----------------------------------
bluwatch shows errors on minos nodes at 08:01 and 08:02
    Also 08:21 and 08:23 on fnpcsrv1


=============================================================================
2008 07 02
=============================================================================

########
# FARM #
########

    Clear out some cedar_phy candidates, and test new roundup ( no SRM_CONFIG )

SRV1> ./roundup.20080703 -b 2 -s cand -r cedar_phy near

MINOS26 > sam locate N00012004_0000.cosmic.cand.cedar_phy.0.root
['/pnfs/minos/reco_near/cedar_phy/cand_data/2007-04,3@dcache']
MINOS26 > sam locate N00012004_0000.spill.cand.cedar_phy.0.root
['/pnfs/minos/reco_near/cedar_phy/cand_data/2007-04,3@dcache']

    OK, let's do the rest of cand's on hand
    Rates are pretty lousy, 2 MB/second for these 2 files.
    Check this out tomorrow, using NULL, 
        try dccp x509 again, after the DCache 1.8 upgrade

SRV1> ./roundup.20080703     -s cand -r cedar_phy near


########
# DATA #
########

    Clearing 1.3 TB of space, from removal of bad-field D04 data   2008 04 29

SRV1 > du -sm /minos/data/BAD/D4CLEAN
1372961 /minos/data/BAD/D4CLEAN

SRV1> date ; time rm -r /minos/data/BAD/D4CLEAN
Wed Jul  2 14:42:57 CDT 2008

real    14m56.042s
user    0m0.028s
sys     0m0.657s


########
# FARM #
########

Date: Wed, 02 Jul 2008 14:25:59 -0500 (CDT)
Subject: HelpDesk ticket 118137
___________________________________________
Short Description: Please create minospro account on minos26

Problem Description: run2-sys :

Recent changes in Grid authentication have changed file ownership
of new Minos production analysis files from rubin to minospro.

We need a local minospro account in order to manage these files.

Please create account minospro, group e875 on node minos26,
with a local login area similar to mindata, probably /home/minospro.

For initial access, please copy /home/mindata/.k5login to /home/minospro/
( and change ownership to minospro )
___________________________________________
Date: Wed, 02 Jul 2008 14:35:33 -0500 (CDT)
This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group.
___________________________________________
Date: Wed, 02 Jul 2008 14:50:27 -0500 (CDT)
Solution: ettab@fnal.gov sent this solution: 
Account has been created.
___________________________________________
Date: Wed, 2 Jul 2008 20:39:43 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: ettab@fnal.gov

  Thanks !

  I have logged in successfully.


  Can you change the login shell to /bin/bash ?
___________________________________________
___________________________________________


##########
# DCACHE #
##########

Per blake,ochoa,arms

Date: Wed, 02 Jul 2008 10:43:55 -0500 (CDT)
Subject: HelpDesk ticket 118111
___________________________________________
Short Description: Password reset for mindata ftp access

Problem Description: dcache-admin :

The password for ftp read access by user mindata seems to have changed
after the 24 June upgrade to DCache 1.8.

  Please contact me (x4261) to arrange a reset of the password.
___________________________________________
This ticket is assigned to MESSER, TIM of the CD-SF/DMS/DSC/SSA.
___________________________________________
Date: Thu, 03 Jul 2008 15:09:58 -0500
From: Timur Perelmutov <timur@fnal.gov>
Could you please try using the weak ftp door again? I think we found and 
fixed the problem.
___________________________________________
Date: Thu, 03 Jul 2008 15:11:40 -0500
From: Timur Perelmutov <timur@fnal.gov>
I spoke with Art on the phone and he confirmed that the door works again.
___________________________________________
Date: Thu, 03 Jul 2008 20:18:32 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
Timur reports the door being repaired around 15:10 today Thu 3 July,

I tested an 'ls' command successfully at
    ftp fndca1.fnal.gov 24126
using account mindata, and the usual password.

Thanks, and have a good Holiday weekend !
___________________________________________
___________________________________________

############
# MCIMPORT #
############

MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/*
 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N
  55785 /minos/data/mcimport/STAGE/daikon_04/L010170N
2235895 /minos/data/mcimport/STAGE/daikon_04/L010185N
 126550 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm
 368713 /minos/data/mcimport/STAGE/daikon_04/L010185N_helium
   8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh
  65355 /minos/data/mcimport/STAGE/daikon_04/L010200N
 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N
 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N
  27834 /minos/data/mcimport/STAGE/daikon_04/L250200N

MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04
3599213 /minos/data/mcimport/STAGE/daikon_04


############
# MCIMPORT #
############

    10:58
    Restarted crontab, now that we have a working mcimport script again.

crontab crontab.dat


#######
# DAQ #
#######

Checked for missing near dcs files, on dcsdcp-nd,
using a password different than for    daqdcp-nd,
but in an obvious way.

No files since June 27 21:26 N080628_000002.mdcs.root
The archiver is running, there are no input files.

Reported to Run Coordinator habig, he will restart the dcs scripts.


=============================================================================
2008 07 01
=============================================================================

###########
# MONTHLY #
###########

DATASETS 7/1
PREDATOR 7/1
VAULT    7/3
MYSQL    7/2

Wed Jul  2 09:50:30 CDT 2008
Wed Jul  2 10:26:03 CDT 2008


############
# PREDATOR #
############

Oops, restarted predator cronjob, 
was down since Monday morning, due to DCache outage.


##########
# CONDOR #
##########

Drafting 7.0.3 installation request ticket

Short Description: 
Minos Cluster - condor 7.0.1 preinstallation

run2-sys :

    Please install the following RPM on nodes minos01 thru minos25 .

http://fermigrid.fnal.gov/files/condor/condor-7.0.3-linux-x86-rhel3-dynamic-1.i386.rpm
   
    This rpm places new files in /opt/condor-7.0.3,
    and should not interfere with existing operations.

    I will submit a separate request for configuration files when we are ready.

___________________________________________

########
# DATA #
########

    mcimport

Clearing space in /minos/data,

$ du -sm /minos/data/mcimport/OVERLAY/mcin/DUP
77963   /minos/data/mcimport/OVERLAY/mcin/DUP

These are mostly from 2007 Dec and 2008 Jan 09,
one from Feb 20 17:44 n13037306_0006_L010185N_D04.reroot.root

Removed them all

    ANALYSIS

$ du -sm /minos/data/analysis/* 
 110141 /minos/data/analysis/NuMuBar
  62019 /minos/data/analysis/database
 466571 /minos/data/analysis/nc
 672500 /minos/data/analysis/nonap
1404200 /minos/data/analysis/nue

    USERS

     18      /minos/data/users/bckhouse
 506457  /minos/data/users/boehm
      1       /minos/data/users/jjling
     29      /minos/data/users/kreymer
  72289   /minos/data/users/loiacono
      1 /minos/data/users/minsoft
 142391 /minos/data/users/mishi
    104 /minos/data/users/nickd
2289641 /minos/data/users/pawloski
      1       /minos/data/users/rhatcher
 102630  /minos/data/users/rmehdi
  27825   /minos/data/users/rustem


    CAND strays

Nothing in reco_near or reco_far

SRV1> ls -l /minos/data/mcout_data/*/*/*/*/cand_data/*
/minos/data/mcout_data/daikon_04/L250200N/near/cedar_phy_bhcurv/cand_data/700:
total 1131008
-rw-rw-r--  1 minospro numi 1158151211 Apr 21 20:20 n13037004_0009_L250200N_D04.cand.cedar_phy_bhcurv.1.root

/minos/data/mcout_data/daikon_04/L250200N/near/cedar_phy_bhcurv/cand_data/701:
total 1129128
-rw-rw-r--  1 minospro numi 1156226310 Apr 21 20:21 n13037014_0008_L250200N_D04.cand.cedar_phy_bhcurv.1.root

    Remove these new minospro@minos26 account

    Oops, the directory is owned by minfarm, so do it from that account.

SRV1> rm -r /minos/data/mcout_data/*/*/*/*/cand_data


##########
# CONDOR #
##########

Scanning /local/stage1 sizes ( symlinked to /local/scratch??/stage1 )
to see whether this could be decoupled from /local/scratch device
to prevent full-disk problems.

MIN > for NODE in ${NODES} ; do printf "$NODE " ; ssh -ax ${NODE}  'du -sm /local/scratch??/stage1' ; done
minos01 26      /local/scratch01/stage1
minos02 32      /local/scratch02/stage1
minos03 81      /local/scratch03/stage1
minos04 77      /local/scratch04/stage1
minos05 76      /local/scratch05/stage1
minos06 68      /local/scratch06/stage1
minos07 34      /local/scratch07/stage1
minos08 65      /local/scratch08/stage1
minos09 63      /local/scratch09/stage1
minos10 76      /local/scratch10/stage1
minos11 1       /local/scratch11/stage1
minos12 67      /local/scratch12/stage1
minos13 1       /local/scratch13/stage1
minos14 78      /local/scratch14/stage1
minos15 80      /local/scratch15/stage1
minos16 64      /local/scratch16/stage1
minos17 67      /local/scratch17/stage1
minos18 70      /local/scratch18/stage1
minos19 61      /local/scratch19/stage1
minos20 70      /local/scratch20/stage1
minos21 79      /local/scratch21/stage1
minos22 64      /local/scratch22/stage1
minos23 60      /local/scratch23/stage1
minos24 75      /local/scratch24/stage1
minos25 458     /local/scratch25/stage1
minos26 1       /local/scratch26/stage1

    Perhaps these file could go under /var, say /var/condor


MIN > for NODE in ${NODES} ; do printf "$NODE " ; ssh -ax ${NODE}  'df -h /var/tmp | grep /var' ; done
minos01 /dev/hda7              13G  285M   12G   3% /var
minos02 /dev/hda6              22G  273M   21G   2% /var
minos03 /dev/hda6              22G  269M   21G   2% /var
minos04 /dev/hda6              22G  373M   21G   2% /var
minos05 /dev/hda6              22G  268M   21G   2% /var
minos06 /dev/hda6              22G  254M   21G   2% /var
minos07 /dev/hda6              22G  260M   21G   2% /var
minos08 /dev/hda6              22G  285M   21G   2% /var
minos09 /dev/hda6              22G  273M   21G   2% /var
minos10 /dev/hda6              22G  269M   21G   2% /var
minos11 /dev/hda6              22G  138M   21G   1% /var
minos12 /dev/hda6              22G  2.0G   19G  10% /var
minos13 /dev/hda6              22G  261M   21G   2% /var
minos14 /dev/hda6              22G  263M   21G   2% /var
minos15 /dev/hda6              22G  260M   21G   2% /var
minos16 /dev/hda6              22G  261M   21G   2% /var
minos17 /dev/hda6              22G  264M   21G   2% /var
minos18 /dev/hda6              22G  267M   21G   2% /var
minos19 /dev/hda6              22G  262M   21G   2% /var
minos20 /dev/hda6              22G  266M   21G   2% /var
minos21 /dev/hda6              22G  576M   21G   3% /var
minos22 /dev/hda6              22G  265M   21G   2% /var
minos23 /dev/hda6              22G  265M   21G   2% /var
minos24 /dev/hda6              22G  259M   21G   2% /var
minos25 /dev/hda6              22G  268M   21G   2% /var
minos26 /dev/hda6              22G  416M   21G   2% /var


########
# DATA #
########

Date: Tue, 01 Jul 2008 11:20:37 -0500 (CDT)
Subject: HelpDesk ticket 118050
___________________________________________
Short Description: /minos/data error report from fnpcsrv1

Problem Description: Since about May 20, we have been running tests of file access to
/minos/data,
from
    fnpcsrv1.txt 
    minos-sam03.txt 
    minos01.txt 
    minos25.txt
    minos26.txt

The test script reads a few bytes from a different file every minute.

    /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/bluwatch

We have seen failures only on fnpcsrv1 since 1 June.
The failures on fnpcsrv1 continue on occasion.

See the lines containing 'BAD' at
    http://www-numi.fnal.gov/computing/dh/bluwatch/log/fnpcsrv1.txt
___________________________________________
Date: Thu, 03 Jul 2008 15:31:00 -0500 (CDT)
This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/BLU Group.
___________________________________________
Date: Thu, 03 Jul 2008 15:31:00 -0500 (CDT)
Note To Requester: I am assigning this ticket 
to the bluearc admins.
All of those times correspond 
to times where we also
saw failures in other 
parts of FermiGrid monitoring.  The latest one at 7/3 08:23 was
due to the reboot of the 
bluearc system, so maybe 
that will have fixed the problem.

Steve Timm
___________________________________________
Date: Thu, 03 Jul 2008 16:36:25 -0500 (CDT)
Note To Requester: I am corelating your file with the BlueArc logs.
Some of the times in your log do match up
to days on which we had known problems.

for example:
May 20   ~19:30
June 1    ~05:30 ... throughout the morning
July 3      ~08:00 (today)


Other times corelate to backups and snapshot management


Other times do not corelate to anything.
(possible net or host issues ?)


Andy
___________________________________________
Date: Thu, 03 Jul 2008 22:11:58 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
<-- # @@@  Enter Update below this line. @@@ # -->

  Thanks for checking the BlueArc end.

  The latest failure on fnpcsrv1 is at 
  Thu Jul  3 08:23:18

  The following lines from /var/log/messages may be relevant :

Jul  3 08:23:18 fnpcsrv1 automount[16061]: expired /minos/data
...
Jul  3 08:25:18 fnpcsrv1 automount[10043]: attempting to mount entry /minos/data
Jul  3 08:25:18 fnpcsrv1 automount[16384]: mount(nfs): mounted minos-nas-0.fnal.gov:/minos/data on
/minos/data
Jul  3 08:26:18 fnpcsrv1 automount[16472]: expired /minos/data

    It seems that the automount mount of /minos/data expired 
    just as an access was being attempted.

    It seems that the automounter mounts for /minos/data expire in 1 minute,
    precisely the time that I wait between file checks.

    This is probably why I have occasional failures in my test scripts,
    but we have not seen global problems in farm processing.

    I would suggest a much more gentle automount timeout, like 1 hour.
    I do see that the /home/data expirations stopped after 13:28:55,
    so perhaps Steve has already adjusted something.
        
<-- # @@@  Enter Update above this line. @@@ # -->
___________________________________________

#######
# SAM #
#######

Date: Fri, 27 Jun 2008 11:05:59 -0500
From: Stephen P. White <swhite@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: sam-design mailing list <sam-design@fnal.gov>
Subject: Fix for Minos cached_files data.

Art,

Dbserver v8_4_5 fixes the cached_files bug that was introduced in 
v8_4_1.   Please upgrade your dbserver to v8_4_5.   Also, I've created a 
database fix for cached_file records that were created by dbservers from 
v8_4_1 to v8_4_4.  After you upgrade the dbserver please run the update 
statement I've supplied.  (Since MINOS has so few cahced_file records 
this can be done in 1 update statement.)

update cached_files
set owner_work_grp_id=uncaching_work_grp_id,
   uncaching_work_grp_id=null
where cached_file_id in (select cached_file_id
                        from cached_files
                        where owner_work_grp_id is null and
                             uncaching_work_grp_id is NOT null)


Call me if you have questions.....

Steve

------------------------------------------------------------

export SAM_ORACLE_CONNECT="samdbs/<passwd>"

bin/rlwrap sqlplus samdbs/<passwd>@minosdev

select cached_file_id from cached_files
where owner_work_grp_id is null and
  uncaching_work_grp_id is NOT null ;

This list is empty for dev and int
There are 32 rows in prd

CACHED_FILE_ID
--------------
        596229
        596235
        596236
        596237
        596242
        596250
        596060
        596064
        596022
        596023
        596024

CACHED_FILE_ID
--------------
        596025
        596038
        596043
        596048
        595743
        595556
        595557
        595607
        595621
        595625
        595626

CACHED_FILE_ID
--------------
        595481
        595482
        595483
        595491
        595492
        595493
        595499
        595402
        595403
        595405

32 rows selected.

update cached_files
set     owner_work_grp_id=uncaching_work_grp_id,
    uncaching_work_grp_id=null
where cached_file_id in 
(select cached_file_id from cached_files
 where owner_work_grp_id is null and
   uncaching_work_grp_id is NOT null) ;

32 rows updated.

SQL> select cached_file_id from cached_files
  2  where owner_work_grp_id is null and
  3    uncaching_work_grp_id is NOT null ;

no rows selected

Tue Jul  1 11:44:11 CDT 2008


#######
# SAM #
#######
   On minos-sam01 and minos-sam02,

upd install -j sam_db_srv_pkg v8_4_5

Fired this up, failed.


Traceback (most recent call last):
  File "/home/sam/products/db_server_base_cx/v1_8/NULL/bin/DbListener.py", line 31, in ?
    import DbCORBAomni
  File "/home/sam/products/db_server_base_cx/v1_8/NULL/lib/DbCORBAomni.py", line 12, in ?
    from omniORB import CORBA
ImportError: No module named omniORB
Killed process: 6016

    I needed to do  upd install without the -j

upd install   sam_db_srv_pkg v8_4_5

informational: installed sam_pnfs_srv v8_4_1.
informational: installed sam_dimension_server_prototype v8_4_2.
informational: installed sam_config v7_1_6.
informational: installed sam_idl_pylib v8_4_1.
informational: installed omniORB v4_1_2.
informational: installed sam_common_pylib v8_4_3.
informational: installed sam_server_pylib v8_4_2.
informational: installed sam_db_srv v8_4_5.
informational: installed python v2_4_5.
Warning: For product "sam_config"local node version v7_1_5 does not match distribution node version v7_1_6 
Warning: For product "sam_config"local node version v7_1_5 does not match distribution node version v7_1_6 
upd install succeeded.

   This works, running on dev/int
Tested with 1/10/100 file projects, looks clean.

   Installed products on minos-sam01 also, same list.

MINOS26 > sam get dbserver connection info
Number of connections: 1

11:04 restarted production dbserver

UNI=prd
for N in 1 2 3 4 5 6 7 8 9 10 ; do echo ${N}
date
./sam_test_py minos ${UNI} st-onesmall
./sam_test_py minos ${UNI} st-ten
./sam_test_py minos ${UNI} st-cen
done ; date

Tue Jul  1 11:06:19 CDT 2008
... oops, that was done in dev.

Repeated in prd

Tue Jul  1 11:20:53 CDT 2008
...
Tue Jul  1 12:06:59 CDT 2008


=============================================================================
2008 06 30
=============================================================================

Kreymer unavailable since 24 June due to family emergency 

    Summary of activity during that period :


Condor 7.0.3 is available - minosadmin

stken/fndca downtime Wed  - stkusers

pnfs restore Wed 25 Jun 16:41:17 - stkusers
stken coming up 18:50:35 - stkusers
fndca up 19:46:52 - stkusers
   no file loss ?

rubin file access failure 08-06-26 10:17:41 - minosdata

117526 AFS - flxi02, 03 have been updated - minosadmin

cvs key request lwhitehead - minoscvs

sjc SLF 5 query - minosadmin

rubin stray dogwood cands - minosdata

Dbserver v8_4_5 needed, and db repair - minossamadmin

minossoft draft minutes - minossoft

saranen  drive request LTO - minosdata

mcimport discussion 27 Jun 2008 14:48:52 arms - minossim

Sun, 29 Jun 2008 09:34:29 ftp access down bspeak - minosdata

Sun, 29 Jun 2008 11:35:59 beam data rhatcher/habig - minosdata

Mon, 30 Jun 2008 06:06:34 enstore down berg - stkusers
Mon, 30 Jun 2008 10:09:35 enstore up zalokar - stkusers

#########
# ADMIN #
#########

Date: Thu, 26 Jun 2008 16:28:31 -0400
From: Stephen Coleman <sjc@fnal.gov>
To: Kregg E Arms <arms@physics.umn.edu>
Cc: Arthur Kreymer <kreymer@fnal.gov>, Robert Hatcher <rhatcher@fnal.gov>
Subject: MC and Scientific Linux

Hi,
With delivery of W&M's new farm mere weeks away, the lattice QCD 
co-owner of the cluster and I are trying to hash out some details 
regarding software.  He's agreed to use Scientific Linux, but wants the 
most recent version (5.x) because he's concerned about up-to-date 
drivers for all of our Infiniband cards.  Can you think of any reason 
that this would be a bad idea for either Minossoft or 
Daikon/Eggplant/Fava/Garlic production?  Our current farm is running 
obsolete 303, and I know the MINOS farm runs SL Fermi 4.4...

Thanks,
-Stephen


          My answer :

Minos has no publicly available SLF5 systems at present.
Not on FNALU interactive or batch systems, or the Cluster, or FermiGrid.

So we have no way to test compatibility at present.

SLF5 is being primarily deployed on desktops and laptops,
where bleeding edge hardware support is more important.


########
# FARM #
########

$ RFILE=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/mcin_data/far/daikon_03/CosmicLE/120/f20031209_0001_CosmicLE_D03.reroot.root

$ srmls ${RFILE}
  236326221 /pnfs/fnal.gov/usr/minos/mcin_data/far/daikon_03/CosmicLE/120/f20031209_0001_CosmicLE_D03.reroot.root

$ srmcp ${RFILE} file:////tmp/testfile.root

   Did this with grid and doe proxy

############
# MCIMPORT #
############

$ du -sm /minos/data/mcimport/*/mcin
6674    /minos/data/mcimport/arms/mcin
1       /minos/data/mcimport/boehm/mcin
1       /minos/data/mcimport/hgallag/mcin
391     /minos/data/mcimport/howcroft/mcin
1       /minos/data/mcimport/kordosky/mcin
12      /minos/data/mcimport/kreymer/mcin
1       /minos/data/mcimport/mtavera/mcin
542     /minos/data/mcimport/mualem/mcin
99561   /minos/data/mcimport/OVERLAY/mcin
1       /minos/data/mcimport/rhatcher/mcin
2149    /minos/data/mcimport/sjc/mcin
lrwxrwxrwx  1 mindata e875     8 May 16 14:08 mcimport.20080516 -> mcimport*


$ dds mcimport*
lrwxrwxrwx  1 mindata e875    17 Feb 15 13:54 mcimport -> mcimport.20080211*
...


    mcimport.20080630
cp mcimport.20080516 mcimport.20080630

Added these changes

. /minos/scratch/app/OSG1/setup.sh
export X509_USER_PROXY=/home/mindata/.grid/kreymer-grid.proxy

$ AFSS/mcimport.20080630 -n -b 1 OVERLAY
 OK - processing 1 directories
OVERLAY
 OOPS - /minos/data/mcimport/STAGE/OVERLAY not a directory
daikon_00  daikon_04

   URK - need to resolve
   mcimport.20080211 ( last version used for import )
   mcimport.20080303 ( draft tar version )
   mcimport.20080311 ( second draft tar , with local copy )
   mcimport.20080326 ( last version used for Tarring, with -s ,dismount time )
   mcimport.20080516 ( latest revision, sets X509_USER_PROXY )
   
TYPO, cp mcimport.20080326 mcimport.20080516

Recovered this from backup,
MIN > cp -a /afs/fnal.gov/files/backup/home/room1/kreymer/minos/scripts/mcimport.20080516 .

$ AFSS/mcimport.20080630 -n -b 1 OVERLAY
 OOPS - found /minos/data/mcimport/CRON/mcimport.pid 
 OK - stale pid file 
 OK - processing 1 directories
OVERLAY

Mon Jun 30 17:53:17 CDT 2008
 OK - version mcimport.20080516 processing from /minos/data/mcimport/OVERLAY 

 NOOP

BAIL at 1
 LOGS 

 STAGE, MCINPURGE, MCINWRITE 

 OK - staging 0 files 

 OK - purging 0 MCIN files ?
Mon Jun 30 17:53:17 CDT 2008
 MCIN processing 63 files  
Mon Jun 30 17:53:17 CDT 2008
 MCIN configuration n1303 _L010185N_D04.reroot.root 
srmcp -streams_num=1 -server_mode=active -protocols=gsiftp file:///n13037495_0010_L010185N_D04.reroot.root  srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_04/L010185N/749/n13037495_0010_L010185N_D04.reroot.root
BAIL aftar 1 files 
~/saddmc  --declare  daikon_04  near/daikon_04/L010185N/749         >> /minos/scratch/mindata/log/saddmc/prd-near-daikon_04-L010185N.log
99565   /minos/data/mcimport/OVERLAY/
1       /minos/data/mcimport/OVERLAY/tar
1       /minos/data/mcimport/OVERLAY/dcache
99561   /minos/data/mcimport/OVERLAY/mcin
1       /minos/data/mcimport/OVERLAY/mcin/dcache
Mon Jun 30 17:53:22 CDT 2008

$ AFSS/mcimport.20080630  -b 1 OVERLAY

$ less /minos/data/mcimport/OVERLAY/log/mcimport.log 

$ dds /pnfs/minos/mcin_data/near/daikon_04/L010185N/749/n13037495_0010_L010185N_D04.reroot.root
-rw-r--r--  1 kreymer e875 365410709 Jun 30 17:55 /pnfs/minos/mcin_data/near/daikon_04/L010185N/749/n13037495_0010_L010185N_D04.reroot.root


MINOS26 > sam locate n13037495_0010_L010185N_D04.reroot.root
['/pnfs/minos/mcin_data/near/daikon_04/L010185N/749,30@dcache']

$ AFSS/mcimport.20080630   OVERLAY

$ cp -a AFSS/mcimport.20080630 .
$ ln -sf     mcimport.20080630  mcimport # was mcimport.20080211

   Let this cook, then restart the crontab.dat entry.

############
# MCIMPORT #
############


We need a grid proxy, to retain kreymer ownership of mcimport files.

cd  /local/scratch26/kreymer/grid
. /minos/scratch/kreymer/VDT/setup.sh

MINOS26 > grid-proxy-info -all -file kreymer-grid.proxy
subject  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=2127860370
issuer   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : kreymer-grid.proxy
timeleft : 6432:20:40  (268.0 days)


Let's test this

  mindata@minos-sam03

cd /home/mindata/.grid

scp kreymer@minos26:/local/scratch26/kreymer/grid/kreymer-grid.proxy .

$ export X509_USER_PROXY=/home/mindata/.grid/kreymer-grid.proxy
$ srmcp -debug -streams_num=1 -server_mode=active file:////minos/scratch/parrot/F00031300_0000.mdaq.root srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/grid.dat

$ dds /pnfs/minos/NULL/grid.dat 
-rw-r--r--  1 kreymer e875 41379 Jun 30 14:33 /pnfs/minos/NULL/grid.dat

This proxy was already present on minos26.


##########
# PARROT #
##########

Investigating cache area, on fnpc338


try  
  -t <dir>   Where to store temporary files.             (PARROT_TEMP_DIR)

export PARROT_TEMP_DIR=/local/stage1/parrot

parrot -m ${PARROT_DIR}/mountfile.grow -d remote -t /local/stage1/parrot bash

P> ls -l /tmp/par*
ls: /tmp/par*: No such file or directory

The environment variable by itself does not seem to work.


##########
# DCACHE #
##########

Reported Wed 16:31 to 19:47 outage to
   minos-data, minos_software_discussion


##########
# DCACHE #
##########

Public enstore down 06:06 - probably several hours.

Up at 10:09 - reported to minos-data and msd.

##########
# DCACHE #
##########

No neardet data since Sunday morning
But far continued thru current time ( how, with DCache down ? )
  F00041292_0017.mdaq.root Mon Jun 30 14:07:39 UTC 2008

Beam files missing Thu/Fri/Sat/Sun

Near and Far DCS look roughly OK

Cannot investigate further until the public Enstore system comes back up.


=============================================================================
2008 06 24
=============================================================================

##########
# DCACHE #
##########

Tests for DCache 1.8 upgrade 24 June

    SRM
        srls        OK
        srmcp  read OK
        srmcp write OK
        srmmkdir    OK ( wrong ownership ) 6/23

    DCACHE    unsecured
        dccp        OK
        loon        OK

    DAQ
        write       OK  11 files transferred, at 15:00

Transition, need to shift to OSG 1 for
    mcimport
    roundup


##########
# DCACHE #
##########

    kreymer@minos26
cd ~/minos/scripts
crontab crontab.dat

    minfarm@fnpcsrv1
cd /home/minfarm/ROUNTMP
mv NOCAT NOCAT.ok

    mindata@minos26

########
# FARM #
########

SRV1> cp -a AFSS/roundup.20080624 .
SRV1> ln -sf roundup.20080624 roundup # was roundup.20080515

./roundup -n -s 40403 -f 0 -r cedar far

This would work, but ...

 OOPS - POOLS ACTIVE NEED 14 10 11 

Adjusted pool limit to 10, now
 WRITING to DCache 82
subject   : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990/CN=2146134877
timeleft  : 2217:51:00
SRMCP 1/82 -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00041019_0000.all.sntp.cedar.0.root /pnfs/minos/reco_far/cedar/sntp_data/2008-06
org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message:  (error code 1) [Nested exception message:  Custom message: Unexpected reply: 425 Cannot open p
ort: java.lang.Exception: Pool manager error: No write pools configured for <minos.reco_far_cedar_sntp@enstore>].  Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  C
ustom message: Unexpected reply: 425 Cannot open port: java.lang.Exception: Pool manager error: No write pools configured for <minos.reco_far_cedar_sntp@enstore>

SRV1> ./roundup -n -w -s 40403 -r cedar far

SRMCP 1/15 -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00040403_0001.spill.bntp.cedar.0.root /pnfs/minos/reco_far/cedar/.bntp_data/2008-03
srmcp -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00040403_0001.spill.bntp.cedar.0.root srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/reco_far/cedar/.bntp_data/2008-03/F00040403_0001.spill.bntp.cedar.0.root

     Try manually, round about 20:40

. /usr/local/grid/setup.sh 
export X509_USER_PROXY=/export/stage/minfarm/.grid/x509up_u1334

cd /minos/data/minfarm/WRITE

srmcp -streams_num=1 -server_mode=active -protocols=gsiftp  -debug=true \
file:///F00040403_0001.spill.bntp.cedar.0.root \
srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/F00040403_0001.spill.bntp.cedar.0.root


srmcp -streams_num=1 -server_mode=active -protocols=gsiftp  -debug=true \
file:///F00040403_0001.spill.bntp.cedar.0.root \
srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/reco_far/cedar/.bntp_data/2008-03/F00040403_0001.spill.bntp.cedar.0.root

MINOS26 > ls -alF /pnfs/minos/reco_far/cedar/.bntp_data/2008-03/F00040403_0001.spill.bntp.cedar.0.root
-rw-r--r--  1 rubin e875 17425962 Jun 24 20:45 /pnfs/minos/reco_far/cedar/.bntp_data/2008-03/F00040403_0001.spill.bntp.cedar.0.root

cp roundup roundupsrm2
SRMQ="-streams_num=1 -server_mode=active -protocols=gsiftp -debug=true"

SRV1> ./roundupsrm2 -w -s 40403 -r cedar far

   D U H
   
   My bad, this was the pool shutdown last night.
   I was looking at the wrong log file.
   
   The current copies are running correctly,

URK, need to check the support files for F00040403_0001.spill.bntp.cedar.0.root
They seem to be there ( READ/SAM, ECRC )
The file is on tape.

SRV1> ./roundup -w -s 40403 -r cedar far

But the content of the ECRC/F00040403_0001.spill.bntp.cedar.0.root is the file,
not the checksum.

SRV1> setup encp -q stken
SRV1> ecrc WRITE/F00040403_0001.spill.bntp.cedar.0.root
CRC 334121273
SRV1> nedit ECRC/F00040403_0001.spill.bntp.cedar.0.root

SRV1> ./roundup -w -s 40403 -r cedar far
 PURGING WRITE files 1 
PURGED WRITE/F00040403_0001.spill.bntp.cedar.0.root


SRV1> mv NOCAT NOCAT.ok

########
# FARM #
########

Per Howie's note ( minosdata )

SRMLS_PATH=srm://fndca1:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos

srmls -debug=true \
"$SRMLS_PATH"/reco_near/cedar_phy/cand_data/2007-04/N00012001_0020.cosmic.cand.cedar_phy.0.root 

This works.

Howie had been using /usr/local/vdt/setup.sh, still the old SRM,
not /usr/local/grid/setup.sh, the new one.


##########
# DCACHE #
##########

DAQ logging seems to have halted last night after 23:09:30

First success after 11:04:35

As of 14:00, there are several unfinished copies pending from around 11:30

$ ps xf
  PID TTY      STAT   TIME COMMAND
17665 pts/1    Ss+    0:00 -bash
13187 pts/0    Ss     0:00 -bash
11344 pts/0    R+     0:00  \_ ps xf
 3632 ?        S     33:57 python /home/minos/bin/archiver_krb.py
 3736 ?        Z      0:00  \_ [kinit] <defunct>
 3578 ?        S    3123:49 rotorooter -p9011


From ND RC stopped and started archiver.

Files are moving again.


Same for FD RC, files are moving again.

> ls $DAQ_DATA_TO_ARCHIVE_DIR -l
total 0
-rw-r--r--  1 minos e875 0 Jun 24 11:29 F00041034_0018.mdaq.root
-rw-r--r--  1 minos e875 0 Jun 24 12:29 F00041034_0019.mdaq.root
-rw-r--r--  1 minos e875 0 Jun 24 12:36 F00041034_0020.mdaq.root
-rw-r--r--  1 minos e875 0 Jun 24 13:05 F00041035_0000.mdaq.root
-rw-r--r--  1 minos e875 0 Jun 24 13:07 F00041036_0000.mdaq.root
-rw-r--r--  1 minos e875 0 Jun 24 13:07 F00041037_0000.mdaq.root
-rw-r--r--  1 minos e875 0 Jun 24 14:07 F00041038_0000.mdaq.root
...
> ls $DAQ_DATA_TO_ARCHIVE_DIR -l
total 0
-rw-r--r--  1 minos e875 0 Jun 24 14:07 F00041038_0000.mdaq.root
-rw-r--r--  1 minos e875 0 Jun 24 15:07 F00041038_0001.mdaq.root


#######
# SAM #
#######

Upgraded to v6_0_5_24_srm just before the oracle upgrade.

Projects run, but very slowly, several minutes per batch of 5 files.

MINOS26 > sam stop project --station=minos --project=sam_test_project_20080624125114   --force

Killed after 45.

Try again after the downtime.

Reran st-cen job,
ran quickly up through file 45, then slowly.
Long delay for each batch of 5 files, then they are delivered 1 per second.
For example, delay after 
Got   dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030619_0007.mdaq.root  file  45

Than ok for
Got   dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030620_0000.mdaq.root  file  46
Got   dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030621_0000.mdaq.root  file  47
Got   dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030622_0000.mdaq.root  file  48
Got   dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030622_0001.mdaq.root  file  49
Got   dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00030622_0002.mdaq.root  file  50

Delay is about 2 minutes

Next pass was at full speed, 118 sec. for 100 files.


########
# DATA #
########

    DAQ logging failed starting around midnight.

Reran DC18 test , succeeds on test stand.
Reran DC18 test using fndca1 host, production system,
Cannot open port: java.lang.Exception: Pool manager error: No write pools configured for <minos.NULL@enstore>

    OK, this is consistent with draining the write queues prior to the upgrades.


N.B. test stand web server at 
    http://stkendca3a:2288/

=============================================================================
2008 06 23
=============================================================================

########
# DATA #
########

Date: Mon, 23 Jun 2008 16:42:04 -0500 (CDT)
Subject: HelpDesk ticket 117640

___________________________________________
Short Description: Unrequested roles being assigned by  GUMS/VOMS , seen by dcache 1.8

Problem Description: We have seen a difference in behaviour between DCache 1.7 and 1.8,
seen on the test stand recently.

With DCache 1.7, if I do not assign a role to my grid proxy,
I get the expected mapping of proxy to file ownership,
in my case to the kreymer username.

With DCache 1.8, they are seeing a Production role assigned,
and the file ownership is accordingly mapped to minospro,
something I do not want.
But the proxy has no Production role.

Is this something that can or should be corrected before the end of
tomorrow's public DCache upgrade ?

I wonder whether this same issue is contributing to recent confusion
regarding Minos farm reconstruction file ownership ?

The following is the latest exchange with DCache developers.

Date: Mon, 23 Jun 2008 15:56:56 -0500
From: Timur Perelmutov <timur@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: dcache-admin@fnal.gov, minos-data@fnal.gov, rubin@fnal.gov
Subject: Re: File ownerships wrong using kreymer cert, dcache 1.8

But this voms-proxy-info output just confirms what I have said, your 
proxy had vo attributes and in this case the mapping was done by GUMs 
into the user minospro.

If you want  your proxy to be mappend into the user kreymer, you need to 
create the proxy with grid-proxy-init. In this case voms-proxy-info 
output will contain no
" === VO fermilab extension information ==="  line and no voms 
attributes listing.

You want you voms proxy mapped to user kreymer even if you have VO 
attributes in your proxy, you need to resolve this with administrators 
of GUMS/VOMS  servers. These servers are not controlled by dCache 
project or storage administrators.

Thanks,
Timur
Arthur Kreymer wrote:
> On Mon, 23 Jun 2008, Timur Perelmutov wrote:
>
>   
>>  From the  gPlazma logs I see:
>>
>> 06/23 09:01:53,489 VOAuthorizationPlugin: authRequestID 332223121 
>> Requesting mapping for User with DN:
/DC=org/DC=doegrids/OU=People/CN=Arthur
>>  E Kreymer 261310 and Role
/fermilab/minos/Role=Production/Capability=NULL
>> 06/23 09:01:47,889 VOAuthorizationPlugin: authRequestID 793306152 
>> Mapping Service URL configuration:
https://gums.fnal.gov:8443/gums/services/
>> GUMSAuthorizationServicePort
>> 06/23 09:01:48,263 VOAuthorizationPlugin: authRequestID 793306152 VO 
>> mapping service returned Username: minospro
>>     
>
>
> I happened to do a voms-proxy-info -all on Friday, 
> when the directories were being created :
>
> $ voms-proxy-info -all
> WARNING: Unable to verify signature! Server certificate possibly not 
> installed.
> Error: Cannot find certificate of AC issuer for vo fermilab
> subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer
261310/CN=proxy
> issuer    : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
> identity  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
> type      : proxy
> strength  : 512 bits
> path      : /home/mindata/.grid/kreymer-doe.proxy
> timeleft  : 6606:10:02
> === VO fermilab extension information ===
> VO        : fermilab
> subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
> issuer    : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov
> attribute : /fermilab/minos/Role=Production/Capability=NULL
> attribute : /fermilab/Role=NULL/Capability=NULL
> attribute : /fermilab/minos/Role=NULL/Capability=NULL
> timeleft  : 0:00:00
>
> The file and directory at issue are :
>
> MINOS26 > ls -al /pnfs/minos/NULL/dc18dir
> total 41
> drwxrwxr-x  1   42411 e875   512 Jun 23 09:01 .
> drwxrwxr-x  1 mindata e875   512 Jun 23 14:40 ..
> -rw-r--r--  1   42411 e875 41379 Jun 23 09:02 F00031300_0000.mdaq.root
> -rw-r--r--  1   42411 e875 41379 Jun 23 15:18 F00031300_0000.mdaq.root2
>
> I repeateded the file copy at 15:18,
> see the entry just above 2008 06 20 in
>     
> http://www-numi.fnal.gov/minwork/computing/dh/worklog.txt
>
>
___________________________________________
Date: Thu, 03 Jul 2008 15:10:18 -0500 (CDT)
Note To Requester: After the grid users meeting
we met together and agreed on a list of problems that need to be resolved.  
As of today I have communicated 
that list to the developers and dcache admins.

Steve Timm
___________________________________________
___________________________________________
___________________________________________
___________________________________________

########
# DATA #
########

Preparing for PNFS/DCache maintenance Jun 24

    kreymer@minos26
echo "crontab -r" | at 05:30

    mindata@minos26 
echo "crontab -r" | at 01:00

    minfarm@fnpcsrv1
echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00

#######
# DAQ #
#######

    Testing gsiftp as used by DAQ for archiving, with DCache 1.8 teststand

$ ssh -l minos minos-gateway-nd.fnal.gov

$ ssh          daqdcp-nd

    Observed
    3632 ?        S     33:16 python /home/minos/bin/archiver_krb.py

$ . shrc/kreymer    # get rid of ls alias, create dds for my fun

$ cd /home/minos/bin

$ cp archiver_krb.py DC18.py

    We need to test gssftp.

    Most details come from config file
        $DAQ_CONFIG_DIR/archiver_near_daq.config

    Initial environment comes from .profile

    Now hacking DC18.py to test fndcat.
    
    Created config/DC18.config

    CONFIG FILE CHANGES

ftp_host_name#T=fndcat.fnal.gov;
pnfs_dir#T=NULL;
check_freq#I=10;
lock_file#T=/var/lock/daq/DC18.pid;
err_mail#T=kreymer@fnal.gov,g;

toarchive#T=/var/tmp/DC18/toarchive;
archived#T=/var/tmp/DC18/archived;
data_dir#T=/var/tmp/DC18/data_dir;


#toarchive#T=$DAQ_DATA_TO_ARCHIVE_DIR; /daqdata/archiver/data-to-archive
#archived#T=$DAQ_DATA_ARCHIVED_DIR;    /daqdata/archiver/data-archived
#data_dir#T=$DAQ_DATA_DIR;             /daqdata

    Modified bin/DC18.py

config just DC18, no other options
Remove expansion of toarchive, archived, data_dir strings

Changed msg_srv to be ECHO
os.chdir scan - just to toarchive and data_dir
    toarchive from config, 
    data_dir  from 

changed krb_cache to /tmp/krb5c_DC18


$ mkdir /var/tmp/DC18
$ mkdir /var/tmp/DC18/toarchive
$ mkdir /var/tmp/DC18/archived
$ mkdir /var/tmp/DC18/data_dir

for NN in 00 01 02 03 04 05 06 07 08 09 10 11 ; do
cp -av /daqdata/N00014362_00${NN}.mdaq.root /var/tmp/DC18/data_dir ; done

$ du -sm /var/tmp/DC18/data_dir/
955     /var/tmp/DC18/data_dir/

for NN in 00 01 02 03 04 05 06 07 08 09 10 11 ; do
touch /var/tmp/DC18/toarchive/N00014362_00${NN}.mdaq.root ; done

    We also need an output directory,

MINOS26 > mkdir      /pnfs/minos/NULL/2008-06
MINOS26 > chmod 775  /pnfs/minos/NULL/2008-06
MINOS26 > chgrp e875 /pnfs/minos/NULL/2008-06


   Tried with wrong group,

$ bin/DC18.py
/home/minos/kftp/v3_6/NULL/lib/gssftp.py:1: RuntimeWarning: Python C API version mismatch for module gss: This Python has API version 1012, module gss has version 1011.
  import gss
ftp_host_name fndcat.fnal.gov
port 24127
username buckley
data_dir /var/tmp/DC18/data_dir
toarchive /var/tmp/DC18/toarchive
archived /var/tmp/DC18/archived
pnfs_dir NULL
check_freq 10
ticket_cache /home/minos/kt/minoskt
lock_file /var/lock/daq/DC18.pid
err_mail kreymer@fnal.gov
/// MESSAGE /// -l I Will archive files in /pnfs/minos/NULL 
/// MESSAGE /// -l I Will look for new files every 10 seconds
['N00014362_0000.mdaq.root', 'N00014362_0001.mdaq.root', 'N00014362_0002.mdaq.root', 'N00014362_0003.mdaq.root', 'N00014362_0004.mdaq.root', 'N00014362_0005.mdaq.root', 'N00014362_0006.mdaq.root', 'N00014362_0007.mdaq.root', 'N00014362_0008.mdaq.root', 'N00014362_0009.mdaq.root', 'N00014362_0010.mdaq.root', 'N00014362_0011.mdaq.root']
/// MESSAGE /// -l I Processing file N00014362_0000.mdaq.root
/// MESSAGE /// -l I Getting credentials
/// MESSAGE /// -l I Got credentials
/// MESSAGE /// -l I Trying ftp connect to disk cache
/// MESSAGE /// -l I Ftp connect succeeded
/// MESSAGE /// -l W  File N00014362_0000.mdaq.root failed to transfer, try again: STOR N00014362_0000.mdaq.root: Permission denied
/// MESSAGE /// -l I Processing file N00014362_0001.mdaq.root
...


   Try one file,

$ bin/DC18.py
...
/// MESSAGE /// -l I Will archive files in /pnfs/minos/NULL 
/// MESSAGE /// -l I Will look for new files every 10 seconds
['N00014362_0000.mdaq.root']
/// MESSAGE /// -l I Processing file N00014362_0000.mdaq.root
/// MESSAGE /// -l I Getting credentials
/// MESSAGE /// -l I Got credentials
/// MESSAGE /// -l I Trying ftp connect to disk cache
/// MESSAGE /// -l I Ftp connect succeeded
/// MESSAGE /// -l I  File N00014362_0000.mdaq.root transferred to /pnfs/minos/NULL/2008-06

    Try all 11 files,

/// MESSAGE /// -l I Will archive files in /pnfs/minos/NULL 
/// MESSAGE /// -l I Will look for new files every 10 seconds
['N00014362_0000.mdaq.root', 'N00014362_0001.mdaq.root', 'N00014362_0002.mdaq.root', 'N00014362_0003.mdaq.root', 'N00014362_0004.mdaq.root', 'N00014362_0005.mdaq.root', 'N00014362_0006.mdaq.root', 'N00014362_0007.mdaq.root', 'N00014362_0008.mdaq.root', 'N00014362_0009.mdaq.root', 'N00014362_0010.mdaq.root', 'N00014362_0011.mdaq.root']
/// MESSAGE /// -l I Processing file N00014362_0000.mdaq.root
/// MESSAGE /// -l I Getting credentials
/// MESSAGE /// -l I Got credentials
/// MESSAGE /// -l I Trying ftp connect to disk cache
/// MESSAGE /// -l I Ftp connect succeeded
/// MESSAGE /// -l W  File N00014362_0000.mdaq.root failed to transfer, try again: STOR N00014362_0000.mdaq.root: Permission denied
/// MESSAGE /// -l I Processing file N00014362_0001.mdaq.root
/// MESSAGE /// -l I Getting credentials
/// MESSAGE /// -l I Got credentials
/// MESSAGE /// -l I Trying ftp connect to disk cache
/// MESSAGE /// -l I Ftp connect succeeded
/// MESSAGE /// -l W  File N00014362_0001.mdaq.root failed to transfer, try again: STOR N00014362_0001.mdaq.root: Permission denied
/// MESSAGE /// -l I Processing file N00014362_0002.mdaq.root
...
/// MESSAGE /// -l I  File N00014362_0010.mdaq.root transferred to /pnfs/minos/NULL/2008-06
/// MESSAGE /// -l I Processing file N00014362_0011.mdaq.root
/// MESSAGE /// -l I Getting credentials
/// MESSAGE /// -l I Got credentials
/// MESSAGE /// -l I Trying ftp connect to disk cache
/// MESSAGE /// -l I Ftp connect succeeded
/// MESSAGE /// -l I  File N00014362_0011.mdaq.root transferred to /pnfs/minos/NULL/2008-06


MINOS26 > ls -l  /pnfs/minos/NULL/2008-06
total 976125
-rw-r--r--  1 buckley e875 88929278 Jun 23 14:59 N00014362_0000.mdaq.root
-rw-r--r--  1 buckley e875 78320762 Jun 23 15:00 N00014362_0001.mdaq.root
-rw-r--r--  1 buckley e875 88274514 Jun 23 15:01 N00014362_0002.mdaq.root
-rw-r--r--  1 buckley e875 78394261 Jun 23 15:02 N00014362_0003.mdaq.root
-rw-r--r--  1 buckley e875 87837099 Jun 23 15:02 N00014362_0004.mdaq.root
-rw-r--r--  1 buckley e875 78282913 Jun 23 15:02 N00014362_0005.mdaq.root
-rw-r--r--  1 buckley e875 87983068 Jun 23 15:03 N00014362_0006.mdaq.root
-rw-r--r--  1 buckley e875 78285285 Jun 23 15:04 N00014362_0007.mdaq.root
-rw-r--r--  1 buckley e875 88048732 Jun 23 15:04 N00014362_0008.mdaq.root
-rw-r--r--  1 buckley e875 78299797 Jun 23 15:05 N00014362_0009.mdaq.root
-rw-r--r--  1 buckley e875 88505253 Jun 23 15:05 N00014362_0010.mdaq.root
-rw-r--r--  1 buckley e875 78394163 Jun 23 15:05 N00014362_0011.mdaq.root

########
# FARM #
########

    Corrected bhhi and bhlo directory permissions ( at the top )
    using mismapped kreymer cert, allowing access to 42411 minospro.

    08:51

for  BH in ${SLO} ${SHI}  ; do
for DET in far near ; do
srm-set-permissions  -type=CHANGE -group=RWX   ${BH}/${DET}
srm-get-permissions                            ${BH}/${DET}
srm-set-permissions  -type=CHANGE -group=RWX   ${BH}/${DET}/daikon_04
srm-get-permissions                            ${BH}/${DET}/daikon_04
srm-set-permissions  -type=CHANGE -group=RWX   ${BH}/${DET}/daikon_04/L010185N
srm-get-permissions                            ${BH}/${DET}/daikon_04/L010185N
done ; done

for  BH in ${SLO} ${SHI}  ; do
for DET in far near ; do
srm-set-permissions  -type=CHANGE -group=RWX  ${BH}/${DET}/daikon_04/L010185N/cand_data
srm-get-permissions                           ${BH}/${DET}/daikon_04/L010185N/cand_data
done ; done

Back on kreymer@minos26,

./pnfsdirs near cedar_phy_bhhi daikon_04 L010185N write
Mon Jun 23 09:04:04 CDT 2008
 STREAMS cand mrnt sntp

     INPUT /pnfs/minos/mcin_data/near/daikon_04/L010185N 
 FAMSET mcin_near_daikon_04
<frozen>:28: RuntimeWarning: Python C API version mismatch for module _locale: This Python has API version 1012, module _locale has version 1011.
 FAMILY mcin_near_daikon_04

    OUTPUT /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N 
 OK - created /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/mrnt_data 
 OK - created /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/sntp_data 
 OOPS, need permissions drwxrwxr-x 
drwxr-xr-x  1 42411 e875 512 Jun 11 18:34 /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/cand_data
chmod: changing permissions of `/pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/cand_data': Operation not permitted
 OK - have set permissions drwxrwxr-x
drwxr-xr-x  1 42411 e875 512 Jun 11 18:34 /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/cand_data
 FAMSET mcout_cedar_phy_bhhi_near_daikon_04_cand
...

./pnfsdirs near cedar_phy_bhlo daikon_04 L010185N write

./pnfsdirs far cedar_phy_bhlo daikon_04 L010185N write
./pnfsdirs far cedar_phy_bhhi daikon_04 L010185N write

##########
# DCACHE #
##########

    mindata@minos26

SPATH17=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/NULL/dc17dir
SPATH18=srm://fndcat.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/NULL/dc18dir

srmmkdir ${SPATH17}

$ ls -l /pnfs/minos/NULL
total 161
-rw-r--r--  1 kreymer e875 41379 Jun 20 13:44 F00031300_0000.mdaq.root
-rw-r--r--  1   42411 e875 41379 Jun 20 19:10 F00031300_0000.mdaq.root1
-rw-r--r--  1   42411 e875 41379 Jun 20 14:16 F00031300_0000.mdaq.root2
-rw-r--r--  1   42411 e875 41379 Jun 20 19:10 F00031300_0000.mdaq.root3
drwxr-xr-x  1 kreymer e875   512 Jun 23 08:27 dc17dir

srmmkdir ${SPATH18}

$ ls -l /pnfs/minos/NULL

SRMClientV2 : put: try # 0 failed with error
SRMClientV2 : org.xml.sax.SAXException: Invalid element in org.dcache.srm.v2_2.SrmMkdirRequest - directoryPath
SRMClientV2 : put: try again
SRMClientV2 : put: try # 1 failed with error
SRMClientV2 : org.xml.sax.SAXException: Invalid element in org.dcache.srm.v2_2.SrmMkdirRequest - directoryPath
SRMClientV2 : put: try again

   OK, that's with a client mismatch, let's use the proper client

$ srmmkdir ${SPATH18}
  
$ ls -l /pnfs/minos/NULL
total 161
-rw-r--r--  1 kreymer e875 41379 Jun 20 13:44 F00031300_0000.mdaq.root
-rw-r--r--  1   42411 e875 41379 Jun 20 19:10 F00031300_0000.mdaq.root1
-rw-r--r--  1   42411 e875 41379 Jun 20 14:16 F00031300_0000.mdaq.root2
-rw-r--r--  1   42411 e875 41379 Jun 20 19:10 F00031300_0000.mdaq.root3
drwxr-xr-x  1 kreymer e875   512 Jun 23 08:27 dc17dir
drwxrwxr-x  1   42411 e875   512 Jun 23 08:30 dc18dir

Note that the ownership has changed.

And permissions.

Reported to dcache-admin at 08:35

   Let's check permissions :

$ srm-set-permissions  -type=CHANGE -group=RX ${SPATH18}

$  ls -l /pnfs/minos/NULL
total 161
drwxr-xr-x  1 kreymer e875   512 Jun 23 08:27 dc17dir
drwxr-xr-x  1   42411 e875   512 Jun 23 08:42 dc18dir

  For reference, from my session on Friday,

$ voms-proxy-info -all
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
identity  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
type      : proxy
strength  : 512 bits
path      : /home/mindata/.grid/kreymer-doe.proxy
timeleft  : 6606:10:02
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
issuer    : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov
attribute : /fermilab/minos/Role=Production/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
attribute : /fermilab/minos/Role=NULL/Capability=NULL
timeleft  : 0:00:00


Repeating the file copy, around 15:18

$ srmcp -debug -streams_num=1 -server_mode=active file:////minos/scratch/parrot/F00031300_0000.mdaq.root srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/dc18dir/F00031300_0000.mdaq.root2
Storage Resource Manager (SRM) implementation version 2.0.3
Copyright (c) 2002-2008 Fermi National Accelerator Laboratory
Specification Version 2.0 by SRM Working Group (http://sdm.lbl.gov/srm-wg)
SRM Configuration:
        debug=true
        gsissl=true
        help=false
        pushmode=false
        userproxy=true
        buffer_size=131072
        tcp_buffer_size=0
        streams_num=1
        config_file=/minos/scratch/app/OSG1/srm-client-fermi/etc/config-2.xml
        glue_mapfile=/minos/scratch/app/OSG1/srm-client-fermi/conf/SRMServerV1.map
        webservice_path=srm/managerv1
        webservice_protocol=https
        gsiftpclinet=globus-url-copy
        protocols_list=http,gsiftp
        save_config_file=null
        srmcphome=/minos/scratch/app/OSG1/srm-client-fermi
        urlcopy=sbin/urlcopy.sh
        x509_user_cert=
        x509_user_key=
        x509_user_proxy=/home/mindata/.grid/kreymer-doe.proxy
        x509_user_trusted_certificates=/minos/scratch/app/OSG1/globus/TRUSTED_CA
        globus_tcp_port_range=null
        gss_expected_name=null
        storagetype=permanent
        retry_num=20
        retry_timeout=10000
        wsdl_url=null
        use_urlcopy_script=false
        connect_to_wsdl=false
        delegate=true
        full_delegation=true
        server_mode=active
        srm_protocol_version=1
        request_lifetime=86400
        access latency=null
        overwrite mode=null
        priority=0
        from[0]=file:////minos/scratch/parrot/F00031300_0000.mdaq.root
        to=srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/dc18dir/F00031300_0000.mdaq.root2

Mon Jun 23 15:17:36 CDT 2008: starting SRMPutClient
Mon Jun 23 15:17:36 CDT 2008: In SRMClient ExpectedName: host
Mon Jun 23 15:17:36 CDT 2008: SRMClient(https,srm/managerv1,true)
SRMClientV1 : user credentials are: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
SRMClientV1 : SRMClientV1 calling org.globus.axis.util.Util.registerTransport() 
SRMClientV1 : connecting to srm at httpg://stkendca3a.fnal.gov:8443/srm/managerv1
Mon Jun 23 15:17:38 CDT 2008: connected to server, obtaining proxy
Mon Jun 23 15:17:38 CDT 2008: got proxy of type class org.dcache.srm.client.SRMClientV1
Mon Jun 23 15:17:38 CDT 2008: source file#0 : /minos/scratch/parrot/F00031300_0000.mdaq.root
copy_jobs is empty
SRMClientV1 :   put, sources[0]="srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/dc18dir/F00031300_0000.mdaq.root2"
SRMClientV1 :   put, dests[0]="srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/dc18dir/F00031300_0000.mdaq.root2"
SRMClientV1 :   put, protocols[0]="gsiftp"
SRMClientV1 :   put, protocols[1]="dcap"
SRMClientV1 :   put, protocols[2]="http"
SRMClientV1 :  put, contacting service httpg://stkendca3a.fnal.gov:8443/srm/managerv1
Mon Jun 23 15:17:40 CDT 2008:  srm returned requestId = -2147441805
Mon Jun 23 15:17:40 CDT 2008: sleeping 4 seconds ...
Mon Jun 23 15:17:44 CDT 2008: FileRequestStatus with SURL=srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/NULL/dc18dir/F00031300_0000.mdaq.root2 is Ready
Mon Jun 23 15:17:44 CDT 2008:        received TURL=gsiftp://stkendca6a.fnal.gov:2811///NULL/dc18dir/F00031300_0000.mdaq.root2
copy_jobs is not empty
copying CopyJob, source = file:////minos/scratch/parrot/F00031300_0000.mdaq.root destination = gsiftp://stkendca6a.fnal.gov:2811///NULL/dc18dir/F00031300_0000.mdaq.root2
GridftpClient: memory buffer size is set to 131072
GridftpClient: connecting to stkendca6a.fnal.gov on port 2811
GridftpClient: gridFTPClient tcp buffer size is set to 1048576
GridftpClient: gridFTPWrite started, source file is java.io.RandomAccessFile@1eb5666 destination path is //NULL/dc18dir/F00031300_0000.mdaq.root2
GridftpClient: gridFTPWrite started, destination path is //NULL/dc18dir/F00031300_0000.mdaq.root2
GridftpClient: set local data channel authentication mode to None
GridftpClient: stream mode transfer
GridftpClient: adler32 for file java.io.RandomAccessFile@1eb5666 is bdce1a9a
GridftpClient: waiting for completion of transfer
GridftpClient: starting a transfer to //NULL/dc18dir/F00031300_0000.mdaq.root2
GridftpClient: DiskDataSink.close() called
GridftpClient: gridFTPWrite() wrote 41379bytes
GridftpClient: closing client : org.dcache.srm.util.GridftpClient$FnalGridFTPClient@e34726
GridftpClient: closed client
execution of CopyJob, source = file:////minos/scratch/parrot/F00031300_0000.mdaq.root destination = gsiftp://stkendca6a.fnal.gov:2811///NULL/dc18dir/F00031300_0000.mdaq.root2 completed
setting file request -2147441804 status to Done
copy_jobs is empty
stopping copier


=============================================================================
2008 06 20
=============================================================================

#######
# SAM #
#######
   
export SAM_ORACLE_CONNECT="samdbs/<passwd>"

for REL in dev int prd ; do
setup sam -q ${REL}
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.bhlo
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.bhhi
done
New applicationFamilyId = 255
New applicationFamilyId = 256
New applicationFamilyId = 92
New applicationFamilyId = 93
New applicationFamilyId = 346
New applicationFamilyId = 347


#######
# OSG #
#######
    minsoft@minos26

mkdir /minos/scratch/app/OSG1
cd    /minos/scratch/app/OSG1
   11:47
pacman -get OSG:client
Do you want to add [http://software.grid.iu.edu/pacman/] to [trusted.caches]? (y or n): y
Package [client] found in [OSG]...
Package [OSG:vo-client-0.6] found in [OSG]...
Package [vo-client-0.6] found in [OSG]...
Do you want to add [http://vdt.cs.wisc.edu/vdt_1101_cache] to [trusted.caches]? (y or n): y
Package [VDT-Client] found in [http://vdt.cs.wisc.edu/vdt_1101_cache]...
...
Downloading [vo-client-0.6-11.tar.gz] from [http://software.grid.iu.edu/pacman/tarballs]...
Untarring [vo-client-0.6-11.tar.gz]...
Downloading [vdt-common-1-287.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-common/1]...
Downloading [vdt-core-1.2.1-404.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-core/1.2.1]...
Downloading [vdt-environment-1-233.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-environment/1]...
Downloading [vdt-core-bin-2.1.x86_rhas_4.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-core-bin/2.1-4]...
Downloading [vdt-version-5-281.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-version/5]...
Downloading [vdt-version-info-1.10.1-19.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-version-info/1.10.1]...
Downloading [vdt-system-profiler-3-267.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-system-profiler/3]...
Downloading [vdt-questions-2.0-324.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-questions/2.0]...
Downloading [remove-rpaths-1--0.13-2.x86_rhas_4.tar.gz] from [http://vdt.cs.wisc.edu/software//remove-rpaths/1--0.13-2]...
Beginning VDT prerequisite checking script vdt-common/vdt-prereq-check...

All prerequisite checks are satisfied.
Downloading [VDT-Client-1.10.1.tar.gz] from [http://vdt.cs.wisc.edu/software/questions/1.10.1]...

VDT 1.10.1 installs a variety of software, each with its own license.
In order to continue, you must agree to the licenses.
You can view the licenses online at:

     http://vdt.cs.wisc.edu/licenses/1.10.1

After the installation has completed, you will also be able to
view the licenses in the "licenses" directory.

Do you agree to the licenses? [y/n] y

The VDT typically installs public certificates and signing policy files 
for the well-known public CA's. This is necessary in order for you to 
perform GSI authentication with any remote Grid services (that have 
service/host certificates signed by these CA's).

For more information please refer to the VDT documentation:
http://vdt.cs.wisc.edu/releases/1.10.1/setup_ca.html

Where would you like to install CA files?

Choices:
        l (local) - install into $VDT_LOCATION/globus/share/certificates
        n (no)    - do not install
l
Downloading [CA-Certificates-1.10.1.tar.gz] from [http://vdt.cs.wisc.edu/software/questions/1.10.1]...
...
Package [Configure-Condor] found in [http://vdt.cs.wisc.edu/vdt_1101_cache]...
Downloading [configure_condor-1-301.tar.gz] from [http://vdt.cs.wisc.edu/software//configure_condor/1]...
...
Downloading [LCG-Infosites-1.10.1.tar.gz] from [http://vdt.cs.wisc.edu/software/questions/1.10.1]...
Downloading [lcg-infosites-2.6-2.tar.gz] from [http://vdt.cs.wisc.edu/software//lcg-infosites/2.6-2]...

The VDT version 1.10.1 has been installed.


The OSG Client package version 1.0.0 has been installed.

    12:02

########
# TEST #
########

 Putting a file into TEST and NULL

mindata@minos26

$ setup java  v1.5.0
$ setup srmcp v1_25_1
$ unset SRM_PATH
$ export X509_USER_PROXY=/home/mindata/.grid/kreymer-doe.proxy


srmcp -debug \
-streams_num=1 -server_mode=active \
file:////minos/scratch/parrot/F00031300_0000.mdaq.root \
srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/TEST/F00031300_0000.mdaq.root


############
# OSG TEST #
############

$ export X509_USER_PROXY=/home/mindata/.grid/kreymer-doe.proxy

$ voms-proxy-info -all
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
identity  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
type      : proxy
strength  : 512 bits
path      : /home/mindata/.grid/kreymer-doe.proxy
timeleft  : 6674:39:11
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
issuer    : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov
attribute : /fermilab/minos/Role=Production/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
attribute : /fermilab/minos/Role=NULL/Capability=NULL
timeleft  : 0:00:00

SPATH2=srm://fndcat.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/beam_data/2004-12

srmls ${SPATH2}
   OK, got all files


SPATH=srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr
S1PATH=srm://fndcat.fnal.gov:8443/srm/managerv1?SFN=/pnfs/fnal.gov/usr
S2PATH=srm://fndcat.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr

IFILE=N00004502_0000.mdaq.root
IPATH=minos/neardet_data/2004-11


SFILE=${SPATH}/${IPATH}/${IFILE}
S1FILE=${S1PATH}/${IPATH}/${IFILE}

srmcp -streams_num=1 -server_mode=active ${SFILE} \
file:////var/tmp/TEST.dat
  gave up after 10 minutes

srmcp -debug \
-streams_num=1 -server_mode=active \
file:////minos/scratch/parrot/F00031300_0000.mdaq.root \
srm://fndcat.fnal.gov:8443/pnfs/fnal.gov/usr/minos/TEST/F00031300_0000.mdaq.root

fails, first file protection, then authorization


$ T1FILE=srm://fndcat.fnal.gov:8443/srm/managerv1?SFN=/pnfs/fnal.gov/usr/minos/TEST/F00031300_0000.mdaq.root
$ srmcp -streams_num=1 -server_mode=active -debug=true ${T1FILE} file:////var/tmp/TEST.dat

   URK , this works, but not the short form,
         and not with managerv2.
   N.B.  that's not correct, the real problem was hitting a bad door.
   

WFILE=F00031300_0000.mdaq.root2
WPATH=minos/NULL
SW1FILE=${S1PATH}/${WPATH}/${WFILE}

$ echo $SW1FILE
srm://fndcat.fnal.gov:8443/srm/managerv1?SFN=/pnfs/fnal.gov/usr/minos/NULL/F00031300_0000.mdaq.root2


srmcp -debug \
-streams_num=1 -server_mode=active \
file:////minos/scratch/parrot/F00031300_0000.mdaq.root \
${SW1FILE}

   Dmitri has examined the logs.                                       
  
   The commands work correctly when they happen to get ftp paths like

TURL=gsiftp://stkendca3a.fnal.gov:2812///TEST/F00031300_0000.mdaq.root
 
   The commands fail with
 
TURL=gsiftp://gwdca01.fnal.gov:2811///TEST/F00031300_0000.mdaq.root 
 
   He will examine the door configurations.

   The configuration has been updated by Timur,

reads and writes both work consistently.


    DCCP TESTS - irrelevant ?

kreymer@minos26

setup dcap  # kerberized

DCPOR=24725 # kerberos (default)

WFILE=F00031300_0000.mdaq.root
WPATH=minos/NULL
WFILE=dcap://fndcat.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${WPATH}/${WFILE}

cd /minos/scratch/parrot

dccp -d 4 ${IFILE} ${WFILE}
[Fri Jun 20 13:24:14 2008] Going to open file dcap://fndcat.fnal.gov:24125/pnfs/fnal.gov/usr/minos/NULL/F00031300_0000.mdaq.root in cache.
Connected in 0.00s.
Error on control line [4]
Failed to create a control line
Failed open file in the dCache.
Can't open destination file : Server rejected "hello"
System error: No such file or directory


   I am not sure that I was ever authorized, this is not important.


##########
# DCACHE #
##########

Date: Fri, 20 Jun 2008 10:56:44 -0500
From: Dan Yocum <yocum@fnal.gov>

https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/ClientInstallationGuide

Date: Fri, 20 Jun 2008 11:26:14 -0500 (CDT)
From: Steven Timm <timm@fnal.gov>
To: Dan Yocum <yocum@fnal.gov>, fermigrid-help@fnal.gov
Cc: Arthur Kreymer <kreymer@fnal.gov>, timur@fnal.gov, minos-data@fnal.gov
Subject: Re: yes, OSG v1.0 client software

The new version of SRM clients are already available on fnpcsrv1
(and fnpcsrv1 only, right now).  They have, in fact, been available
in that location since February.

Log on to fnpcsrv1
do
. /usr/local/grid/setup.sh

and the srmcp that will be in your path is /opt/d-cache/srm/bin/srmcp

That is the right version.  As Dan says, we will be installing this
on the worker nodes as of June 24, but it will be in the /usr/local/grid
area once we have done that, and the /usr/local/grid/setup.sh
will automatically place it in your path on all worker nodes,
not just on fnpcsrv1.


#######
# AFS #
#######

   Per  kordosky  question regarding AFS problems in Minerva,
   happening on flxi05 but not flxi06

   rescanned AFS tuning :
   
MIN > for NODE in ${NODES} ; do printf "$NODE " ; 
ssh -ax ${NODE}  'grep ^OPTIONS= /etc/sysconfig/afs' ; done

minos01 OPTIONS=$LARGE
...
minos26 OPTIONS=$LARGE


MIN > for NODE in ${UNODES} ; do printf "$NODE " ; ssh -ax ${NODE}  'grep ^OPTIONS= /etc/sysconfig/afs' ; done

flxi02 OPTIONS=AUTOMATIC
flxi03 OPTIONS=AUTOMATIC
flxi04 OPTIONS=$MEDIUM
flxi05 OPTIONS=AUTOMATIC
flxi06 OPTIONS=$MEDIUM
flxi07 OPTIONS=$LARGE
flxi09 OPTIONS=AUTOMATIC

FNALU batch is all $LARGE ( cannot log into 21, 22 )


___________________________________________
Date: Fri, 20 Jun 2008 09:26:14 -0500 (CDT)
Subject: HelpDesk ticket 117526
___________________________________________
Short Description: AFS client tuning is needed on FNALU hosts

Problem Description: fnalu-admin :

  Mike Kordosky has been observing inconsistent AFS service on 
  some of the FNALU nodes, including flxi05.

  I see the following problematic setting in etc/sysconfig/afs :
OPTIONS=AUTOMATIC

  For reliable service, we had to set this in the Minos Cluster to
OPTIONS=$LARGE

  Please set this to at least $MEDIUM, if not $LARGE

  Here is a summary of present FNALU settings :

for NODE in ${UNODES} ; do printf "$NODE "
ssh -ax ${NODE}  'grep ^OPTIONS= /etc/sysconfig/afs' ; done

flxi02 OPTIONS=AUTOMATIC
flxi03 OPTIONS=AUTOMATIC
flxi04 OPTIONS=$MEDIUM
flxi05 OPTIONS=AUTOMATIC
flxi06 OPTIONS=$MEDIUM
flxi07 OPTIONS=$LARGE
flxi09 OPTIONS=AUTOMATIC
___________________________________________
Date: Fri, 20 Jun 2008 09:53:40 -0500 (CDT)
This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group.
___________________________________________
Date: Fri, 20 Jun 2008 10:01:25 -0500 (CDT)
From: Margaret_Greaney <mgreaney@fnal.gov>
Art,

could you please ask Mike to provide more details about "inconsistent"
AFS service?

thanks,
Margaret
___________________________________________
Date: Fri, 20 Jun 2008 10:04:51 -0500 (CDT)
From: Margaret_Greaney <mgreaney@fnal.gov>
flxi09 is not a node released to users and won't be for some time.
___________________________________________
Date: Fri, 20 Jun 2008 10:23:27 -0500 (CDT)
Art,

I will plan to work on this afs update next week. It will require
a reboot or at least an unmount of afs and remount, so it will take
a maintenance day.  We have something else scheduled for the early
part of the week, but for flxi02 and 03, I will try to make that
part of the move for these nodes.  Flxi05 may be later on in the week.

Margaret
___________________________________________

Date: Fri, 20 Jun 2008 10:42:19 -0500 (CDT)
This ticket has been reassigned to GREANEY, MARGARET of the CD-LSCS/CSI/DSS/EST Group.
___________________________________________
___________________________________________
___________________________________________

##########
# PARROT #
##########


Tested current 2008-06-19 on fnpc338 with dcache, OK ( with ^D hack )

Tested without -d remote, still OK
Tested with    squid    , checksums do not match for  d199-
    remote checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d
    local checksum: 8294e8e248aa71fc003cb306d2ca0db5266aeaec
disabled squid,
    local checksum: 8294e8e248aa71fc003cb306d2ca0db5266aeaec

export http_proxy=http://squid.fnal.gov:3128  # for curl
export HTTP_PROXY=http://squid.fnal.gov:3128  # for parrot

curl http://www-numi.fnal.gov:80/computing/d199//.growfschecksum
   curl is not available

wget http://www-numi.fnal.gov:80/computing/d199//.growfschecksum -O - -q
6f63107de1a1e42d3a10b8847ebffea250f0895d  -

unset http_proxy

wget http://www-numi.fnal.gov:80/computing/d199//.growfschecksum -O - -q
8294e8e248aa71fc003cb306d2ca0db5266aeaec  -


   squid is being inconsistent :
   
MINOS26 > wget http://www-numi.fnal.gov:80/computing/d199//.growfschecksum -O CHKS ; cat CHKS
--09:42:26--  http://www-numi.fnal.gov/computing/d199//.growfschecksum
           => `CHKS'
Resolving squid.fnal.gov... 131.225.107.161
Connecting to squid.fnal.gov|131.225.107.161|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 44 [text/plain]

100%[====================================================================================================================================================================================================================================================================>] 44            --.--K/s             

09:42:26 (3.23 MB/s) - `CHKS' saved [44/44]

6f63107de1a1e42d3a10b8847ebffea250f0895d  -

MINOS26 > wget http://www-numi.fnal.gov:80/computing/d199//.growfschecksum -O CHKS ; cat CHKS
--09:42:29--  http://www-numi.fnal.gov/computing/d199//.growfschecksum
           => `CHKS'
Resolving squid.fnal.gov... 131.225.107.161
Connecting to squid.fnal.gov|131.225.107.161|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 44 [text/plain]

100%[====================================================================================================================================================================================================================================================================>] 44            --.--K/s             

09:42:29 (3.23 MB/s) - `CHKS' saved [44/44]

8294e8e248aa71fc003cb306d2ca0db5266aeaec  -


   Lets hit this more ...
   
while true ; do sleep 4 ;  
curl http://www-numi.fnal.gov:80/computing/d199//.growfschecksum ; done
   no failures in 1/2 hour. 

while true ; do sleep 4 ;  
wget -q http://www-numi.fnal.gov:80/computing/d199//.growfschecksum -O -; done
  

As of 11:30, the squid cache seems to be up to date.
Have run loon repeatedly using squid, local and dcache, on fnpc338.


=============================================================================
2008 06 19
=============================================================================


#######
# SAM #
#######

    why is N00004502_0000.mdaq.root not declared to SAM ????
    
DIRS=`ls /pnfs/minos/neardet_data | grep 2004`
for DIR in ${DIRS} ; do echo ${DIR} 
find /pnfs/minos/neardet_data/${DIR} -type f | wc -l
SAMDIM="FULL_PATH /pnfs/minos/neardet_data/${DIR}" 
sam list files --dim="${SAMDIM}" --summary_only
done

2004-07
697
File Count:         696

2004-11
1081
File Count:         0

2006-10
905
File Count:         904


##########
# DCACHE #
##########

    Tests to do before the 24 June upgrade :

read a file using dccp
loon a file using dcap path

write a file with srmcp, normal
write a file with srmcp, production role

write a file from DAQ, using gsiftp

    Should write to a TEST directory and file family, for recycling

    Should write to a NULL directory, directed to a NULL mover,
    for extended tests.

    mindata@minos26

mkdir /pnfs/minos/NULL
mkdir /pnfs/minos/TEST

( cd /pnfs/minos/NULL ; enstore pnfs --file_family NULL )
( cd /pnfs/minos/TEST ; enstore pnfs --file_family TEST )


    read dccp

MINOS26 > setup dcap -q unsecured
MINOS26 > IFILE=N00004502_0000.mdaq.root
MINOS26 > IPATH=minos/neardet_data/2004-11
MINOS26 > DCPOR=24125
MINOS26 > DFILE=dcap://fndcat.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
MINOS26 > cd /local/scratch??/`whoami`
MINOS26 > dccp    ${DFILE} TEST.dat   # around 14:34
Unknow replay from server: "0 0 server shutdown "


On Thu, 19 Jun 2008, Timur Perelmutov wrote:

> Could you please let us know what is the status of the tests. We would  
> like to make a final decision about the upgrade schedule tomorrow during
> the Storage Weekly meeting at 10:30 AM.

The tests have not yet started.

In order to do the tests, I need to monitor the system,
which I thought would be at
    http://fndcat.fnal.gov

But I got not response from that address.
So I presumed that the system was down.

I also have not been told which version of VDT to use, for srmcp.
Is there version installed anywere for public use ?
Somewhere in /grid/app would be fine for me.

I tried to read a file at 13:34, via unsecured dccp, but am still waiting 20 minutes later.
The file is on tape VO5041
I see no activity for that tape.

MINOS26 > enstore info --vol=VO5041 | grep last_access
 'last_access': 1203539287.0,

datesec 1203539287
Wed Feb 20 14:28:07 CST 2008


    File details follow :

IFILE=N00004502_0000.mdaq.root  
IPATH=minos/neardet_data/2004-11
DCPOR=24125

DFILE=dcap://fndcat.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}

dcap://fndcat.fnal.gov:24125/pnfs/fnal.gov/usr/minos/neardet_data/2004-11/N00004502_0000.mdaq.root

setup dcap -q unsecured
dccp    ${DFILE} TEST.dat


########
# FARM #
########

mstrait email indicates that adjusted field runs have started

I presume

   Detectors - near
               far

   Releases -  cedar_phy_bhhi
               cedar_phy_bhlo

   MC rel   -  daikon_04
 
   Beam     -  L010185N 


./pnfsdirs near cedar_phy_bhhi daikon_04 L010185N
Thu Jun 19 14:14:50 CDT 2008
 STREAMS cand mrnt sntp

     INPUT /pnfs/minos/mcin_data/near/daikon_04/L010185N 
 FAMSET mcin_near_daikon_04
 FAMILY mcin_near_daikon_04

    OUTPUT /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N 
 OOPS, need permissions drwxrwxr-x 
drwxr-xr-x  1 42411 e875 512 Jun 10 22:23 /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04
 OOPS, need permissions drwxrwxr-x 
drwxr-xr-x  1 42411 e875 512 Jun 11 18:34 /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N
 OOPS, need permissions drwxrwxr-x 
drwxr-xr-x  1 42411 e875 512 Jun 11 18:34 /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/cand_data
 FAMSET mcout_cedar_phy_bhhi_near_daikon_04_cand
 FAMILY minos
 OOPS - need file family mcout_cedar_phy_bhhi_near_daikon_04_cand
ls: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/mrnt_data: No such file or directory
 OOPS, need permissions drwxrwxr-x 
ls: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/mrnt_data: No such file or directory
 FAMSET mcout_cedar_phy_bhhi_near_daikon_04_mrnt
./pnfsdirs: line 87: cd: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/mrnt_data: No such file or directory
 FAMILY 
 OOPS - need file family mcout_cedar_phy_bhhi_near_daikon_04_mrnt
ls: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/sntp_data: No such file or directory
 OOPS, need permissions drwxrwxr-x 
ls: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/sntp_data: No such file or directory
 FAMSET mcout_cedar_phy_bhhi_near_daikon_04_sntp
./pnfsdirs: line 87: cd: /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N/sntp_data: No such file or directory
 FAMILY 
 OOPS - need file family mcout_cedar_phy_bhhi_near_daikon_04_sntp

    What a mess !
    
    Let's correct permissions/families
    
./pnfsdirs near cedar_phy_bhhi daikon_04 L010185N write

   Csnnot correct this, directories owned by minospro

    2008 06 20
    
BLO=/pnfs/minos/mcout_data/cedar_phy_bhlo
BHI=/pnfs/minos/mcout_data/cedar_phy_bhhi

for  BH in ${BLO} ${BHI}  ; do
for DET in far near ; do
ls -ld  ${BH}/${DET}
ls -ld  ${BH}/${DET}/daikon_04
ls -ld  ${BH}/${DET}/daikon_04/L010185N
ls -l   ${BH}/${DET}/daikon_04/L010185N
done ; done

All these need to have group write.
There are stray candidates in

    /pnfs/minos/mcout_data/cedar_phy_bhlo/near/daikon_04/L010185N
    /pnfs/minos/mcout_data/cedar_phy_bhhi/near/daikon_04/L010185N

Need to do this under mindata@minos26, where we have access to SRM v2

scp  minfarm@fnpcsrv1:/export/stage/minfarm/.grid/x509up_u1334 /home/mindata/.grid/

export X509_USER_PROXY=/home/mindata/.grid/x509up_u1334
    
SRMP=srm://fndcat.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos
SLO=${SRMP}/mcout_data/cedar_phy_bhlo
SHI=${SRMP}/mcout_data/cedar_phy_bhhi

for  BH in ${SLO} ${SHI}  ; do
for DET in far near ; do
srm-get-permissions ${BH}/${DET}
done ; done

   WOW THIS WORKED !!!

for  BH in ${SLO} ${SHI}  ; do
for DET in far near ; do
srm-set-permissions  -type=CHANGE -group=RWX ${BH}/${DET}
srm-get-permissions                          ${BH}/${DET}
done ; done

   This would work, but rubin's proxy is mapping to 1334(rubin)
   And the files are owned by 42411 ( minospro ) 
     
    2008 06 23 - 
Ran the set permissions script above, using kreymer cert.
Because this is presently mis-mapped to minospro, I can get this work done !


#########
# ADMIN #
#########

MINOS01 > cmd add_minos_user bain
Creating account...
/var/yp
gmake[1]: Entering directory `/var/yp/minos'
gmake[1]: `ypservers' is up to date.
gmake[1]: Leaving directory `/var/yp/minos'
gmake[1]: Entering directory `/var/yp/minos'
Updating passwd.byname...
Updating passwd.byuid...
Updating netid.byname...
gmake[1]: Leaving directory `/var/yp/minos'
Adding user to Minos AFS group...
Sending mail to subscribe to minos-user mailing list ...
Sending email to user...

##########
# PARROT #
##########

continue tests using current version, with fix for ups on x86_64

   Still problems on fcdflnx3, perhaps due to Linux+2.6-2.3.4 flavor of gcc
   root hangs up after loading libvector.dll

Changed the flavor to Linux+2.6 by hacking the .version file,
then rebuilt the growfs index,

time make_growfs -k /afs/fnal.gov/files/data/minos/d141
make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d141/.growfsdir
make_growfs: scanning directory tree for changes...
make_growfs: 991412 files, 6817 links, 107259 dirs, 0 checksums computed

real    9m19.467s
user    1m6.978s
sys     1m28.444s

ls -l /afs/fnal.gov/files/code/e875/general/ups/prd/gcc

    Oops, had to rename the old mountfiles in 
    /grid/app/minos/parrot
    

########
# DATA #
########

Date: Thu, 19 Jun 2008 10:30:49 -0500
From: root <root@fnpcsrv1.fnal.gov>
To: owner-minos-data@listserv.fnal.gov
Subject: FermiGrid Thu Jun 19 10:30:49 CDT 2008 Usage limit approaching on fermigrid-app
        Total disk allocated (GB):              30.0
        Percent disk used:      80.0%

Date: Thu, 19 Jun 2008 10:30:46 -0500
From: root <root@fnpcsrv1.fnal.gov>
To: owner-minos-data@listserv.fnal.gov
Subject: FermiGrid Thu Jun 19 10:30:46 CDT 2008 Usage limit approaching on fermigrid-data

        Total disk allocated (GB):              400.0
        Percent disk used:      85.4%


=============================================================================
2008 06 18
=============================================================================
  
##########
# CONDOR #
##########

   spotting users with excessively good priorities
    
HOTS=`condor_userprio -all -allusers \
| grep -v gfactory \
| grep -v kreymer  \
| grep -v rhatcher \
| grep '    1.00 ' \
| cut -f 1 -d @
`

for HOT in ${HOTS} ; do
printf "condor_userprio -setfactor ${HOT}@fnal.gov 100.\n"
done

condor_userprio -setfactor boehm@fnal.gov 100.
condor_userprio -setfactor asousa@fnal.gov 100.


##########
# PARROT #
##########

   Continue 2008 06 14 work


########
# BMNT #
########

   Following the plan of 2008 01 17
   to clear bmnt files out of farcat area.

List the bmnt files and runs
Generate mrnt list, verify that they are in PNFS and MD
Move the mrnt files aside in PNFS and MD
 
0)    BMNT LIST - kreymer

BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort`

printf "${BFILES}\n" | wc -w 
546

mkdir /minos/scratch/kreymer/bmnt2

printf "${BFILES}\n" > /minos/scratch/kreymer/bmnt2/BFILES

BFILES runs from
F00033808_0000.spill.bmnt.cedar_phy_bhcurv.0.root    2006-03
to
F00034618_0003.spill.bmnt.cedar_phy_bhcurv.0.root    2006-03


1)    MRNT LIST - kreymer/mindata/rubin/minfarm

MRUNS=`printf "${BFILES}\n" | cut -f 1 -d _ | sort -u`

printf "${MRUNS}\n" | wc -w 
49

F00033808
F00033814
F00033818
...

F00034607
F00034610
F00034618


  Rough check for _000000 subruns

for MRUN in ${MRUNS} ; do
sam locate ${MRUN}_0000.spill.mrnt.cedar_phy_bhcurv.0.root
done
 ... all files are found ...
 

  Detailed check via SAM

for MRUN in ${MRUNS} ; do
RUN=`echo ${MRUN} | cut -c 5-`
SAMDIM="
    DATA_TIER  mrnt-far
and VERSION    cedar.phy.bhcurv
and PHYSICAL_DATASTREAM_NAME spill
and RUN_NUMBER    ${RUN}
"
sam list files --dim="${SAMDIM}" --nosummary
done > /minos/scratch/kreymer/bmnt2/MFILES
 
wc -l /minos/scratch/kreymer/bmnt2/MFILES 
49 /minos/scratch/kreymer/bmnt2/MFILES

grep -v '_0000' /minos/scratch/kreymer/bmnt2/MFILES
.. nothing ...


MFILES=`cat /minos/scratch/kreymer/bmnt2/MFILES`

printf "${MFILES}\n" | wc -l
49


for MFILE in ${MFILES} ; do
MON=`sam locate ${MFILE} | cut -f 7 -d / | cut -f 1 -d ,`
printf "reco_far/cedar_phy_bhcurv/mrnt_data/${MON}/${MFILE}\n" \
   | tee -a /minos/scratch/kreymer/bmnt2/MFILEPS
done

MFILEPS=`cat /minos/scratch/kreymer/bmnt2/MFILEPS`

for MFILEP in ${MFILEPS} ; do
ls -l /pnfs/minos/${MFILEP} ; done

for MFILEP in ${MFILEPS} ; do
ls -l /minos/data/${MFILEP} ; done

...  continuing 2008 06 18 ...


  for each account, do
BFILES=`cat /minos/scratch/kreymer/bmnt2/BFILES`
MFILES=`cat /minos/scratch/kreymer/bmnt2/MFILES`
MFILEPS=`cat /minos/scratch/kreymer/bmnt2/MFILEPS`


2a)  /minos/data - minfarm@fnpcsrv1

    MOVE TO BMNT2 IN /minos/data

for MFILEP in ${MFILEPS} ; do
MFILER=`echo ${MFILEP} | sed s/mrnt_data/BMNT2/g`
MFILED=`dirname ${MFILER}`
mkdir -p                 /minos/data/${MFILED}
mv /minos/data/${MFILEP} /minos/data/${MFILER}
done

find /minos/data/reco_far/cedar_phy_bhcurv/BMNT2 -type f
 | wc -l 
49


2b)    /pnfs/minos - rubin@fnpcsrv1

cat shrc/kreymer  #  cut and paste the result to get into bash


for MFILEP in ${MFILEPS} ; do
ls -l /pnfs/minos/${MFILEP}
rm /pnfs/minos/${MFILEP}
done


2c)    SAM/READ

    Do this as minfarm@fnpcsrv1

cd /export/stage/minfarm/ROUNDUP
mkdir READBMNT2

for MFILE in ${MFILES} ; do
ls  READ/SAM/${MFILE}
mv  READ/SAM/${MFILE} READBMNT2/${MFILE}
done


2d)    SAM   kreymer@minos26

for MFILE in ${MFILES} ; do
    sam undeclare file ${MFILE}
done


3)    rename bmnt  -  minfarm@fnpcsrv1

cd /minos/data/minfarm/farcat

for BFILE in ${BFILES} ; do
MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g`
[ -r ${MFILE} ] && ls -l ${MFILE}
done

   no conflicting files were found
   
for BFILE in ${BFILES} ; do
MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g`
mv ${BFILE} ${MFILE}
done

for BFILE in ${BFILES} ; do
MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g`
[ -r ${MFILE} ] && ls -l ${MFILE}
done | wc -l
546

   Ready to rock n' roll !

cd /home/minfarm/scripts

./roundup -n -r cedar_phy_bhcurv far
   adding messages look good to my eye, let's count them
./roundup -n -r cedar_phy_bhcurv far | grep adding | wc -l
49

  OK, let's do this.

./roundup    -r cedar_phy_bhcurv far
 OK - processing /minos/data/minfarm/farcat
      version 20080515
Wed Jun 18 11:11:15 CDT 2008
...
 HADD  rate 1 Mbytes/second 
Wed Jun 18 11:24:56 CDT 2008
 WRITING to DCache 48
Wed Jun 18 11:40:16 CDT 2008

   ??? we just added 49, why writing 48 ?
   probably timing, files need to have been around a little while.
   Yes, missing just the last 3'ish 

F00034602_0000.spill.mrnt.cedar_phy_bhcurv.0.root 22
F00034610_0000.spill.mrnt.cedar_phy_bhcurv.0.root 4
F00034618_0000.spill.mrnt.cedar_phy_bhcurv.0.root 4

./roundup  -w -r cedar_phy_bhcurv far 
 WRITING to DCache 3
Wed Jun 18 11:58:16 CDT 2008

  Waited a bit for files to get onto tape

Wed Jun 18 13:21:31 CDT 2008

 PURGING WRITE files 3 
PURGED WRITE/F00034602_0000.spill.mrnt.cedar_phy_bhcurv.0.root
PURGED WRITE/F00034610_0000.spill.mrnt.cedar_phy_bhcurv.0.root
PURGED WRITE/F00034618_0000.spill.mrnt.cedar_phy_bhcurv.0.root


##########
# DCACHE #
##########

    Preparing for DCache 1.8 tests, prior to Jun 24 upgrade.
    Do we have NULL movers set up ?
    
Date: Tue, 17 Jun 2008 09:48:26 -0500 (CDT)
From: Dmitry Litvintsev <litvinse@fnal.gov>

Art,  you need to do to change "fndca" -> "fndcat"
and make sure you are using these ports:

ports:

SRM                   :  8443 
kerberized dcap doors :  24725, 24736 
kerberized ftp door   :  24127
weak ftp door         :  24126
gsi dcap              :  24128 
grid ftp              :  2811 

grid ftp doors run on all three nodes available 

        fndcat
        stkendca6a
        gwdca01

Date: Tue, 17 Jun 2008 09:51:22 -0500
From: Timur Perelmutov <timur@fnal.gov>

Could you let us know what path you are planning to write into? We will 
configure pools for that path.


=============================================================================
2008 06 17
=============================================================================

########
# FARM #
########
Date: Mon, 16 Jun 2008 21:56:27 +0100
From: Alexandre Sousa <a.sousa1@physics.ox.ac.uk>
To: Howard Rubin <rubin@iit.edu>, Matthew Strait <strait@physics.umn.edu>,
     Arthur Kreymer <kreymer@fnal.gov>, Robert Hatcher <rhatcher@fnal.gov>,
     Nick West <n.west1@physics.ox.ac.uk>
Subject: Hypothetical reprocessing with increased CPU resources.

... cost of a full reprocessing of CPB, for subshower nue correction ?


=============================================================================

    2008 06 09 through 14 
    
        WEEK IN THE WOODS WORKSHIP IN ELY, MN

=============================================================================
2008 06 14
=============================================================================

##########
# PARROT #
##########

    Checking out use of ups/upd from 
    /afs/fnal.gov/files/code/e875/general/ups/db

. /afs/fnal.gov/files/code/e875/general/ups/etc/setups.sh
    
export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup
setup_minos()
{
. $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $*
}
setup_minos  -r R1.24.2


ups/upd seem to work OK there,
but setup_minos fails due to lack of gcc v3_4_3

MINOS26 > du -sm /afs/fnal.gov/ups/gcc/v3_4_3/* prd/gcc/v3_4_3
159     /afs/fnal.gov/ups/gcc/v3_4_3/IRIX+6.5
119     /afs/fnal.gov/ups/gcc/v3_4_3/Linux+2.4-2.2.4
380     /afs/fnal.gov/ups/gcc/v3_4_3/Linux+2.4-2.3.2
380     /afs/fnal.gov/ups/gcc/v3_4_3/Linux+2.6-2.3.4
132     /afs/fnal.gov/ups/gcc/v3_4_3/SunOS+5.6
165     /afs/fnal.gov/ups/gcc/v3_4_3/SunOS+5.8

MINOS26 > mkdir prd/gcc/v3_4_3
MINOS26 > cp -r /afs/fnal.gov/ups/gcc/v3_4_3/Linux+2.6-2.3.4 prd/gcc/v3_4_3

  ... continue on 2008 06 18

MINOS26 > mkdir db/gcc
MINOS26 > cp  /afs/fnal.gov/ups/db/gcc/v3_4_3.table db/gcc/
MINOS26 > cp  /afs/fnal.gov/ups/db/gcc/v3_4_3.version db/gcc/

MINOS26 > nedit db/gcc/v3_4_3.version
   removed all but LInux_2.6... stanzas
 
MINOS26 > ups list -aK+ gcc
"gcc" "v3_4_3" "Linux+2.6-2.3.4" "" "" 


   The setups look good now, and root runs.


    Now clone this to the d141 working copy of products,

cd /afs/fnal.gov/files/data/minos/d141

cp -r /afs/fnal.gov/files/code/e875/general/ups/prd/gcc prd/gcc
cp -r /afs/fnal.gov/files/code/e875/general/ups/db/gcc   db/gcc

   
   For easier testing, as mindata, did
   
cd /grid/app/minos/parrot

FILES='
HOWTO.parrot
firstlast.C
F00031300_0000.mdaq.root
'
for FILE in ${FILES} ; do
curl http://www-numi.fnal.gov/computing/parrot/${FILE} -o ${FILE}
done


   Successfully ran loon, without /usr/local/etc/setups.sh


Now repeated testing... strange inconsistencies.

setups produces 
ERROR: ld.so: object '/cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored.
which dthain states can be ignored ( or use  parrot -H )

root runs, produces the usual batch messages,
but hangs on input.

ERROR: ld.so: object '/cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored.
  *******************************************
  *                                         *
  *        W E L C O M E  to  R O O T       *
  *                                         *
  *   Version  5.12/00f   23 October 2006   *
  *                                         *
  *  You are welcome to visit our Web site  *
  *          http://root.cern.ch            *
  *                                         *
  *******************************************

Compiled on 25 March 2007 for linux with thread support.

CINT/ROOT C/C++ Interpreter version 5.16.13, June 8, 2006
Type ? for help. Commands must be C++ statements.
Enclose multiple statements between { }.
.exit


  PID TTY      STAT   TIME COMMAND
 2966 pts/0    Ss     0:00 -bash
 3013 pts/0    R+     0:00  \_ ps xf
  923 pts/7    Ss     0:00 -bash
 1029 pts/7    S+     0:09  \_ parrot -m /cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/mountfile.grow -d remote bash
 1030 pts/7    T      0:00      \_ bash
 2884 pts/7    T      0:00      \_ root -b
 2885 pts/7    T      0:00      \_ /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v5-12-00f/bin/root.exe -splash -b
 2886 pts/7    T      0:00      \_ sh -c ldd /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v5-12-00f/bin/root.exe

  Duh, forgot to rebuild the parrot index for d141 after adding gcc.

  dthain report that he can run root and loon, with proper output.


Installed parrot 2_4_3 as mindata@minos26, per HOWTO.parrot

Rebuilt indexes

$ REL=2_4_3    ;  ARC='i686-linux-2.6'   ; DAT=''
$ VER=cctools-${REL}${DAT}-${ARC}
$ export PARROT_DIR=${PRO}/${VER}
$ export PATH=${PARROT_DIR}/bin:${PATH}

$ dds /afs/fnal.gov/files/data/minos/d141/.g*
-rw-r--r--  1 kreymer g020       44 Feb  7 08:42 /afs/fnal.gov/files/data/minos/d141/.growfschecksum
-rw-r--r--  1 kreymer g020 70827161 Feb  7 08:42 /afs/fnal.gov/files/data/minos/d141/.growfsdir

$ mkdir /afs/fnal.gov/files/data/minos/d141/oldparrot
$ mv /afs/fnal.gov/files/data/minos/d141/.g* /afs/fnal.gov/files/data/minos/d141/oldparrot


$ time make_growfs -k /afs/fnal.gov/files/data/minos/d141
make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d141/.growfsdir
make_growfs: no directory exists, this might be quite slow...
make_growfs: scanning directory tree for changes...
make_growfs: 991412 files, 6817 links, 107259 dirs, 0 checksums computed

real    9m10.921s
user    0m46.863s
sys     1m25.696s
du -sk /afs/fnal.gov/files/data/minos/d141/.growfsdir
41195   /afs/fnal.gov/files/data/minos/d141/.growfsdir

$ mkdir /afs/fnal.gov/files/data/minos/d199/oldparrot
$ mv /afs/fnal.gov/files/data/minos/d199/.g* /afs/fnal.gov/files/data/minos/d199/oldparrot

$ time make_growfs -k /afs/fnal.gov/files/data/minos/d199
make_growfs: loading existing directory from /afs/fnal.gov/files/data/minos/d199/.growfsdir
make_growfs: no directory exists, this might be quite slow...
make_growfs: scanning directory tree for changes...
make_growfs: 727335 files, 16246 links, 59829 dirs, 0 checksums computed

real    6m41.867s
user    0m33.389s
sys     1m5.625s


P> ups list -aK+
ERROR: ld.so: object '/cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored.

   GRRR, all products are gone now.

Noted that fcdflnx3 is SL 5.0

Let's try SL 4.4, 

PRO=/minos/scratch/parrot


=============================================================================
2008 06 13
=============================================================================


##########
# CONDOR #
##########

  4 email indicate that
  minos02 /local/scratch1 seems to have filled around
  Date: Fri, 13 Jun 2008 14:44:48 -0500
  Date: Fri, 13 Jun 2008 14:44:57 -0500
  Date: Fri, 13 Jun 2008 14:45:09 -0500
  Date: Fri, 13 Jun 2008 14:45:22 -0500

  The disk looks OK now ( 14:50 )

  Ganglia shows 30 GB used up, 
  starting        14:14,
  ran out around  14:44,
  freed up around 14:50
  
###########
# BLUEARC #
###########

Date: Fri, 13 Jun 2008 14:51:43 -0500 (CDT)
Subject: HelpDesk ticket 117175
___________________________________________
Short Description: Quota request for BlueArc served /minos/scratch, for loiacono

Problem Description: LSC/CSI :

Please set an individual storage quota of 1000 GBytes for user loiacono
on the BlueArc served /minos/scratch volume.

This increases the existing 500 GBytes quota.
___________________________________________


##########
# PARROT #
##########

    Checking out X86_64 on fcdflnx3

REL=2_4_3    ;  ARC='x86_64-linux-2.6'   ; DAT=''

copied and modified setup.sh
   which had gotten lost on fcdflnx2 at $HOME, moved to parrot

    After ls -d ...
53      /tmp/parrot.1060/


   catting a file works,

P> . /usr/local/etc/setups.sh
ERROR: ld.so: object '/cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored.

P> uname -a
Linux fcdflnx3.fnal.gov 2.6.18-53.1.6.el5 #1 SMP Wed Jan 23 11:37:57 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

P> cat /etc/redhat-release
Scientific Linux SLF release 5.0 (Lederman)

model name      : Intel(R) Xeon(R) CPU           E5335  @ 2.00GHz

P> file /cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so
/cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/bin/parrot_helper.so: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped

This may be harmless

    After ls
53      /tmp/parrot.1060/

    After setup_minos
121     /tmp/parrot.1060/

    After root -v - the splash came up, 

    After loon with local file, stuck at vector.dll,    
378     /tmp/parrot.1060/

fcdflnx3 > ps xf
  PID TTY      STAT   TIME COMMAND
 7682 pts/3    Ss     0:00 -bash
 7750 pts/3    S+     0:00  \_ script parrot.log
 7751 pts/3    S+     0:00      \_ script parrot.log
 7752 pts/7    Ss     0:00          \_ bash -i
 7826 pts/7    S+     0:15              \_ parrot -m /cdf/home/kreymer/parrot/cctools-2_4_3-x86_64-linux-2.6/mountfile.grow -d remote bash
 7827 pts/7    T      0:00                  \_ bash
 8071 pts/7    T      0:02                  \_ loon -bq firstlast.C F00031300_0000.mdaq.root
 8073 pts/7    T      0:00                  \_ sh -c ldd /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.24.2/bin/Linux2.6-GCC_3_4/loon
 3297 pts/5    Ss     0:00 -bash
 8076 pts/5    R+     0:00  \_ ps xf


##########
# PARROT #
##########

    Checked sizes on fcdflnx2

fcdflnx2 > du -sm /tmp/parrot.1060/
384     /tmp/parrot.1060/

Now a clean start, 

    After ls -d ...
53      /tmp/parrot.1060/

    After setup_minos
121     /tmp/parrot.1060/

    After loon
384     /tmp/parrot.1060/

   Repeated loon run seems to be OK

   Repeated parrot   seems to be OK
      no increase in size of /tmp/parrot.1060/

   setup and ran root v5_14_00g -q after setup_minos,  OK

   created setup.sh for quicker usage test
   ran with squid, OK
   ran with DCache , uses libdcap.so , OK

 437     /tmp/parrot.1060/

fcdflnx2 > du -sm /tmp/parrot.1060/* | sort -n
...
11      /tmp/parrot.1060/7a
13      /tmp/parrot.1060/6e
18      /tmp/parrot.1060/8f
52      /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199-
52      /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199--
68      /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d141--


=============================================================================
2008 06 12
=============================================================================

##########
# PARROT #
##########

    INSTALLATION

simplified, for major release install
 
  kreymer@KREYMERLAP
very slow moving files to VCC/Ely

  Tried also on fcdflnx2 ( SLF 4.5  32 bit )

cd    ${HOME}/parrot

FILES='
HOWTO.parrot
mountfile.grow
firstlast.C
F00031300_0000.mdaq.root
'
for FILE in ${FILES} ; do
curl http://www-numi.fnal.gov/computing/parrot/${FILE} -o ${FILE}
done


VER=2_4_3
KERN='i686'

TARD=cctools-${VER}-${KERN}-linux-2.6

TARP=${TARD}.tar.gz

curl http://www.cse.nd.edu/~ccl/software/files/${TARP} -o ${TARP}

tar xzvf ${TARP}

ln -s ../mountfile.grow ${TARD}/


    TESTING

VER=cctools-${VER}-${KERN}-linux-2.6

export PARROT_DIR=${HOME}/parrot/${TARD}
export PATH=${PARROT_DIR}/bin:${PATH}
#export HTTP_PROXY="http://squid.fnal.gov:3128"

parrot -m ${PARROT_DIR}/mountfile.grow -d remote bash

ls -d /afs/fnal.gov/files/code/e875/general/minossoft

PS1='P> '

. /usr/local/etc/setups.sh

export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup
setup_minos()
{
. $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $*
}


setup_minos  -r R1.24.2
No default SAM configuration exists at this time.
MINOSSOFT release "R1.24.2"
SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01
bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory


type loon

DFILE=F00031300_0000.mdaq.root

loon -bq firstlast.C ${DFILE}


=============================================================================
2008 06 11
=============================================================================

########
# GRID #
########

MINOS26 > quota -g e875
Disk quotas for group e875 (gid 5111):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
blue2:/fermigrid-data
                324961152       0 419430400          126348       0       0
blue2:/fermigrid-app
                31344704       0 31457280          328480       0       0

MINOS26 > du -sm *
1834    Minossoft
du: `VDT/vdt/extract': Permission denied
du: `VDT/vdt/backup': Permission denied
du: `VDT/vdt/services': Permission denied
287     VDT
1       bin
du: `minfarm/Minossoft/EXTERNAL/mysql-5.0.22/sql/share/japanese-sjis': Permission denied
15192   minfarm
488     parrot
4458    products
56      sam
6269    scripts
7209    users

MINOS26 > du -sm users/*
1700    users/boehm
2818    users/loiacono
1       users/pawloski
2681    users/rustem
10      users/scavan

MINOS26 > chmod -R 755 products
MINOS26 > rm -r products

##########
# CONDOR #
##########

Regarding intermittent kxlist failures/timeouts in kproxy
   These stopped after the KCA server upgrade
   The last was on 21 May, upgrade was on 28 May


########
#  LSF #
########

Date: Wed, 11 Jun 2008 13:24:29 -0500
From: Joseph Boyd <boyd@fnal.gov>
To: Arthur E. Kreymer <kreymer@fnal.gov>, Robert W. Hatcher <rhatcher@fnal.gov>
Subject: LSF on Minos

Hi Art,

What is the state of LSF on minos?  We got a ticket today that running bjobs (or
anything) on minos13 caused and error.  That was confirmed.  If I kill all the
lsf daemons running on that machine though then everything works (it presumably
goes and talks to some other server).

Looking at all the minos machines, various machines have various things running.
   Some have nothing.

Can you please let me know what the current state is so I can fix minos13 if
I've broken it and also so I can document what it's supposed to look like.  Is
it going away soon?

Thanks,

joe

----------------------------------------------------------------------

Could not log in, around 14:00

This is working fine, as of 14:47
   bjobs, bhosts, bqueues

########
# GRID #
########

  Clean out a little space in /grid/app/minos,
  per email warning 99.7% full ( 30 GB )

############
# MCIMPORT #
############

-------------------------------------------------------------
Date: Tue, 10 Jun 2008 11:23:51 -0500 (CDT)
From: Kregg E Arms <arms@physics.umn.edu>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: Ben Speakman <bspeak@physics.umn.edu>
Subject: short runs (fwd)

Hi Art,

Ben found four of the AtmosNu files I generated for him had problems. I
will rerun these (today?) and upload new copies of the reroot files. Can
you remove the old ones from pnfs, etc.?

/pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330002_0004_AtmosNu_D04.reroot.root
/pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330002_0005_AtmosNu_D04.reroot.root
/pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330004_0005_AtmosNu_D04.reroot.root
/pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330004_0008_AtmosNu_D04.reroot.root
-------------------------------------------------------------
    kreymer@minos26

FILES='
f21330002_0004_AtmosNu_D04.reroot.root
f21330002_0005_AtmosNu_D04.reroot.root
f21330004_0005_AtmosNu_D04.reroot.root
f21330004_0008_AtmosNu_D04.reroot.root
'
for FILE in ${FILES} ; do
ls -l /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/${FILE}
rm    /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/${FILE}
sam undeclare ${FILE}
done
-rw-r--r--  1 kreymer e875 74925261 Apr  6 22:43 /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330002_0004_AtmosNu_D04.reroot.root
-rw-r--r--  1 kreymer e875 15996235 Apr  6 22:44 /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330002_0005_AtmosNu_D04.reroot.root
-rw-r--r--  1 kreymer e875 64018225 Apr  6 23:07 /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330004_0005_AtmosNu_D04.reroot.root
-rw-r--r--  1 kreymer e875 66456480 Apr  6 23:14 /pnfs/minos/mcin_data/far/daikon_04/AtmosNu/000/f21330004_0008_AtmosNu_D04.reroot.root

MINOS26 > date
Wed Jun 11 07:31:14 CDT 2008


=============================================================================
2008 06 09
=============================================================================

MINOS01 > setup systools

MINOS01 > cmd add_minos_user djalbrec
Creating account...
/var/yp
gmake[1]: Entering directory `/var/yp/minos'
gmake[1]: `ypservers' is up to date.
gmake[1]: Leaving directory `/var/yp/minos'
gmake[1]: Entering directory `/var/yp/minos'
Updating passwd.byname...
Updating passwd.byuid...
Updating netid.byname...
gmake[1]: Leaving directory `/var/yp/minos'
Adding user to Minos AFS group...
Sending mail to subscribe to minos-user mailing list ...
Sending email to user...


=============================================================================
2008 06 06
=============================================================================

#######
# LSF #
#######

fsun02 is up again.
MRTG network rates indicate a spike in before drop Wed Jun 4 11:00 ish
And resumed activity Fri Jun 06 Jun 11:00.

###########
# ENSTORE #
###########

( cd /pnfs/minos/reco_near ; enstore pnfs --tags )
( cd /pnfs/minos/reco_far  ; enstore pnfs --tags )
.(tag)(library) = CD-9940B

Date: Fri, 06 Jun 2008 21:33:51 -0500 (CDT)
Subject: HelpDesk ticket 116813

___________________________________________
Short Description: 
Please move future /pnfs/minos/reco_near and reco_far writes to CD-LTO-3

Problem Description:
enstore-admin :

It seems to be a good time to move the bulk of remaining Minos writes
from 9940B tape to LTO-3 tape.

Therefore, please do something like the following
to direct future writes under these paths toward LTO-3 tape :

cd /pnfs/minos/reco_near
enstore pnfs --library CD-LTO3

cd /pnfs/minos/reco_far
enstore pnfs --library CD-LTO3
___________________________________________
This ticket is assigned to BERG, DAVID of the CD-SF/DMS/DSC/SSA.
___________________________________________
Date: Mon, 09 Jun 2008 17:31:47 -0500 (CDT)
Solution: berg@fnal.gov sent this solution: 

Art,

I changed the library tags under reco_near and reco_far to CD-LTO3.
Everything below those points in the tree inherits the tags, and
will now write to LTO3, except R1_18_2, which for some reason has
primary tags (dated Nov 16, 2005):
  1 CD-9940B minos/reco_far/R1_18_2
  1 CD-9940B minos/reco_near/R1_18_2
I don't know if anything will be written under these directories in
the future, but they are still set to CD-9940B. I can changes these
directories to inherit from the level above like the others, if you
like.

 - David

___________________________________________

Date: Mon, 16 Jun 2008 01:44:21 +0000 (UTC)
    Thanks !

Nothing should be written to the R1_18_2 directories again.

It is fine with us if you change these to inherit.


########
# FARM #
########

Following up on Rubin note of 16 May, regarding undeclare cand's in
    /minos/data/minfarm/mcnear/cp_to_dc

MINOS26 > MCFILS=`cat /minos/data/minfarm/mcnear/cp_to_dc`

MINOS26 > for FIL in ${MCFILS} ; do sam locate ${FIL} ; done
...
Datafile with name 'n13047014_0025_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047014_0027_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047014_0028_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047014_0029_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047041_0000_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047041_0001_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047041_0002_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.

Datafile with name 'n13047041_0014_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047041_0015_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047041_0016_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047041_0017_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.

Datafile with name 'n13047041_0021_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.

Datafile with name 'n13047041_0030_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047042_0000_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047042_0002_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047042_0004_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047042_0006_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.
Datafile with name 'n13047042_0007_L010185N_D04.cand.cedar_phy_bhcurv.1.root' not found.

These files were not copied to DCache.

I may copy these manually, using them to test srmcp versus gsi_ftp rates.

########
# DATA #
########

Date: Fri, 06 Jun 2008 11:23:27 -0500 (CDT)
Subject: HelpDesk ticket 116781
___________________________________________
Short Description: Quota request for BlueArc served /minos/scratch, for jjling

Problem Description: LSC/CSI :

Please set an individual storage quota of 500 GBytes for user jjling
on the BlueArc served /minos/scratch volume.

This in an increase from the existing default 100 GBytes quota.
___________________________________________
Date: Fri, 06 Jun 2008 11:33:23 -0500 (CDT)
This ticket has been reassigned to HILL, KEVIN of the CD-LSCS/CSI/CS/EST Group.
___________________________________________


=============================================================================
2008 06 05
=============================================================================

#########
# ADMIN #
#########

    Will try again to use

setup systools
cmd add_minos_user djalbrec

after this user gets a FNALU account ( for home area )

Updated web link for account, at
/afs/fnal.gov/files/expwww/numi/html/minwork/computing/account.html
Changed this to a symlink to a dated file.


##############
# AFSERRSCAN #
##############

Made the MON=${2} selection optional 
Removed the default being the current month.

##########
# PARROT #
##########


    INSTALLATION

mindata@minos26

cd    /grid/app/minos/parrot

VER=current
VERX="-20080604"

TARD=cctools-${VER}-x86_64-linux-2.6

TARX=cctools-${VER}${VERX}-x86_64-linux-2.6

TARP=${TARD}.tar.gz
TARL=${TARX}.tar.gz

curl http://www.cse.nd.edu/~ccl/software/files/${TARP} -o ${TARL}

tar xzvf ${TARL}
[ -n "${VERX}" ] && mv ${TARD}  ${TARX}

ln -s ../mountfile2.grow ${TARX}/

cat /grid/app/minos/parrot/mountfile2.grow
/afs/fnal.gov/files/code/e875/general/minossoft      /grow/www-numi.fnal.gov/computing/d199/
/afs/fnal.gov/files/code/e875/general/ups            /grow/www-numi.fnal.gov/computing/d141/

    TESTING


ssh fnpc194

PVER=cctools-current-20080604-x86_64-linux-2.6

export PARROT_DIR=/grid/app/minos/parrot/${PVER}
export PATH=${PARROT_DIR}/bin:${PATH}

parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash
ls -d /afs/fnal.gov/files/code/e875/general/minossoft

   This is OK !

PVER=cctools-current-20080604-x86_64-linux-2.6

export PARROT_DIR=/grid/app/minos/parrot/${PVER}
export PATH=${PARROT_DIR}/bin:${PATH}
export HTTP_PROXY="http://squid.fnal.gov:3128"

parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash
ls -d /afs/fnal.gov/files/code/e875/general/minossoft

   Squid worked !
   
P> . /usr/local/etc/setups.sh
bash: child setpgid (27573 to 27572): Operation not permitted
ERROR: ld.so: object '/grid/app/minos/parrot/cctools-current-20080604-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored.

P> setup_minos  -r R1.24.2
MINOSSOFT release "R1.24.2"
SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01
bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory

P> cd /minos/scratch/kreymer/condor/loonb
P> DFILE=F00031300_0000.mdaq.root
P> loon -bq firstlast.C ${DFILE}

Host: squid.fnal.gov
2008/06/05 11:45:45.185560 [27820] parrot: http: HTTP/1.0 200 OK
2008/06/05 11:45:45.185573 [27820] parrot: http: Date: Thu, 05 Jun 2008 16:45:45 GMT
2008/06/05 11:45:45.185583 [27820] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6
2008/06/05 11:45:45.185593 [27820] parrot: http: Last-Modified: Sat, 03 Nov 2007 02:37:05 GMT
2008/06/05 11:45:45.185602 [27820] parrot: http: ETag: "4ec3eb58-2721c0-43dfd28a89641"
2008/06/05 11:45:45.185611 [27820] parrot: http: Accept-Ranges: bytes
2008/06/05 11:45:45.185619 [27820] parrot: http: Content-Length: 2564544
2008/06/05 11:45:45.185630 [27820] parrot: http: Content-Type: application/x-msdownload
2008/06/05 11:45:45.185639 [27820] parrot: http: X-Cache: MISS from fg2x3.fnal.gov
2008/06/05 11:45:45.185648 [27820] parrot: http: Via: 1.0 fg2x3.fnal.gov:3128 (squid/2.6.STABLE17)
2008/06/05 11:45:45.185657 [27820] parrot: http: Proxy-Connection: close
2008/06/05 11:45:45.185665 [27820] parrot: http: 
2008/06/05 11:45:45.185673 [27820] parrot: grow: open http://www-numi.fnal.gov:80/computing/d141///prd/MINOS_ROOT/Linux2.4-GCC_3_4/v5-12-00f/cint/stl/vector.dll
   and no further action

Around 12:00,

27437 pts/0    Ss     0:00 -bash
27553 pts/0    S+     0:15  \_ parrot -m /grid/app/minos/parrot/cctools-current-20080604-x86_64-linux-2.6/mountfile2.grow -d remote bash
27554 pts/0    T      0:00      \_ bash
27820 pts/0    T      0:02      \_ loon -bq firstlast.C F00031300_0000.mdaq.root
27897 pts/0    T      0:00      \_ sh -c ldd /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.24.2/bin/Linux2.6-GCC_3_4/loon

MIN > curl http://www-numi.fnal.gov:80/computing/d141///prd/MINOS_ROOT/Linux2.4-GCC_3_4/v5-12-00f/cint/stl/vector.dll -o /var/tmp/vector.dll

    killed the parrot session

    Trying again without squid


P> loon -bq firstlast.C ${DFILE}
ERROR: ld.so: object '/grid/app/minos/parrot/cctools-current-20080604-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored.
Warning in <TClassTable::Add>: class timespec already in TClassTable
P> loon -bq firstlast.C ${DFILE}
ERROR: ld.so: object '/grid/app/minos/parrot/cctools-current-20080604-x86_64-linux-2.6/bin/parrot_helper.so' from LD_PRELOAD cannot be preloaded: ignored.

   Trying again with a clean cache

rm -r  /tmp/parrot.1060

    Stuck again at the same place, vector.dll


    For the record, application tests :

rm -r /tmp/parrot.1060
PVER=
export PARROT_DIR=/grid/app/minos/parrot/${PVER}

export PATH=${PARROT_DIR}/bin:${PATH}
export HTTP_PROXY="http://squid.fnal.gov:3128"
parrot -m ${PARROT_DIR}/mountfile2.grow   bash
ls -d /afs/fnal.gov/files/code/e875/general/minossoft
PS1='P> '

. /usr/local/etc/setups.sh

export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup
setup_minos()
{
. $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $*
}


setup_minos  -r R1.24.2
type loon
cd /minos/scratch/kreymer/condor/loonb
DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root
DFILE=F00031300_0000.mdaq.root
loon -bq firstlast.C ${DFILE}


########
# FARM #
########

./roundup -r cedar_phy_bhcurv mcnear

Reported long PEND list to minos_batch

Reported long time PEND cedar items to minos_batch

#######
# LSF #
#######

First reports via email around 04:39 UTC ( 23:39 CDT ) 

MINOS26 > bqueues
batch system daemon not responding ... still trying
batch system daemon not responding ... still trying
...

fsun02 is down


Date: Thu, 05 Jun 2008 10:44:00 -0500 (CDT)
Subject: HelpDesk ticket 116724
___________________________________________
Short Description: fsui02 is off the network, taking FNALU batch down

Problem Description: The fsui02 system is off the network.
This takes down the FNALU LSF batch system.
This seems to have happened as long ago as 23:30 last night.
___________________________________________
Date: Thu, 05 Jun 2008 11:03:43 -0500 (CDT)
This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group.
___________________________________________
Date: Fri, 6 Jun 2008 02:06:38 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos-users@fnal.gov
   Correction, it is fsun02 that is down.

   We still have no status update from the managers.

   They are aware of the problem.
___________________________________________
Correction, it is fsun02 that is down, I expect you knew that.

It is hard to avoid typing fsui02, force of habit.
___________________________________________
Date: Sat, 07 Jun 2008 02:50:56 +0000 (UTC)
The LSF queues seem to be active again, and jobs are running.

MRTG monitoring shows fsun02 active again Fri 2008 Jun 06 11:00 
___________________________________________

Date: Mon, 17 Nov 2008 12:50:34 -0600 (CST)
Solution: fsui02 was decommissioned.

This ticket was resolved by BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST group.

=============================================================================
2008 06 04
=============================================================================

##########
# PARROT #
##########

   Resume testing latest version


    INSTALLATION

   2.4.2, which has proxy problems, is still latest point release
   But there is a more recent x86_64 version Man 29, after last tests
   2008 05 23


mindata@minos26

cd    /grid/app/minos/parrot

VER=current
VERX="-20080529"

TARD=cctools-${VER}-x86_64-linux-2.6

TARX=cctools-${VER}${VERX}-x86_64-linux-2.6

TARP=${TARD}.tar.gz
TARL=${TARX}.tar.gz

curl http://www.cse.nd.edu/~ccl/software/files/${TARP} -o ${TARL}

tar xzvf ${TARL}
[ -n "${VERX}" ] && mv ${TARD}  ${TARX}

ln -s ../mountfile2.grow ${TARX}/

cat /grid/app/minos/parrot/mountfile2.grow
/afs/fnal.gov/files/code/e875/general/minossoft      /grow/www-numi.fnal.gov/computing/d199/
/afs/fnal.gov/files/code/e875/general/ups            /grow/www-numi.fnal.gov/computing/d141/


    TESTING


ssh fnpc194

PVER=cctools-current-20080529-x86_64-linux-2.6

export PARROT_DIR=/grid/app/minos/parrot/${PVER}
export PATH=${PARROT_DIR}/bin:${PATH}

parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash

ls -d /afs/fnal.gov/files/code/e875/general/minossoft
bash-3.00$ ls -d /afs/fnal.gov/files/code/e875/general/minossoft
2008/06/04 16:08:38.639664 [31451] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/
2008/06/04 16:08:38.639771 [31451] parrot: grow: fetching checksum: 
2008/06/04 16:08:38.639800 [31451] parrot: http: connect www-numi.fnal.gov port 80
2008/06/04 16:08:38.641832 [31451] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfschecksum HTTP/1.0
Host: www-numi.fnal.gov
2008/06/04 16:08:38.650005 [31451] parrot: http: HTTP/1.1 200 OK
2008/06/04 16:08:38.650059 [31451] parrot: http: Date: Wed, 04 Jun 2008 21:08:38 GMT
2008/06/04 16:08:38.650070 [31451] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6
2008/06/04 16:08:38.650080 [31451] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:19 GMT
2008/06/04 16:08:38.650090 [31451] parrot: http: ETag: "534d65ce-2c-44592a26655c3"
2008/06/04 16:08:38.650098 [31451] parrot: http: Accept-Ranges: bytes
2008/06/04 16:08:38.650106 [31451] parrot: http: Content-Length: 44
2008/06/04 16:08:38.650115 [31451] parrot: http: Connection: close
2008/06/04 16:08:38.650123 [31451] parrot: http: Content-Type: text/plain
2008/06/04 16:08:38.650130 [31451] parrot: http: 
2008/06/04 16:08:38.650142 [31451] parrot: grow: remote checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d
2008/06/04 16:08:38.650187 [31451] parrot: grow: fetching directory: http://www-numi.fnal.gov:80/computing/d199//.growfsdir
2008/06/04 16:08:38.650253 [31451] parrot: http: connect www-numi.fnal.gov port 80
2008/06/04 16:08:38.651219 [31451] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfsdir HTTP/1.0
Host: www-numi.fnal.gov
2008/06/04 16:08:38.654426 [31451] parrot: http: HTTP/1.1 200 OK
2008/06/04 16:08:38.654466 [31451] parrot: http: Date: Wed, 04 Jun 2008 21:08:38 GMT
2008/06/04 16:08:38.654476 [31451] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6
2008/06/04 16:08:38.654486 [31451] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:09 GMT
2008/06/04 16:08:38.654496 [31451] parrot: http: ETag: "5350b140-33ac14e-44592a1cdbf44"
2008/06/04 16:08:38.654505 [31451] parrot: http: Accept-Ranges: bytes
2008/06/04 16:08:38.654513 [31451] parrot: http: Content-Length: 54182222
2008/06/04 16:08:38.654522 [31451] parrot: http: Connection: close
2008/06/04 16:08:38.654529 [31451] parrot: http: Content-Type: text/plain
2008/06/04 16:08:38.654537 [31451] parrot: http: 
2008/06/04 16:08:39.836793 [31451] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199-
2008/06/04 16:08:40.545656 [31451] parrot: grow: local checksum: 2adb169a42c725ccfbe2b2174da7d8d9e46121f4
2008/06/04 16:08:40.545802 [31451] parrot: grow: checksum does not match, reloading...
2008/06/04 16:08:40.545897 [31451] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/
2008/06/04 16:08:40.545943 [31451] parrot: grow: fetching checksum: http://www-numi.fnal.gov:80/computing/d199//.growfsdir
2008/06/04 16:08:40.545993 [31451] parrot: http: connect www-numi.fnal.gov port 80
2008/06/04 16:08:40.547439 [31451] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfschecksum HTTP/1.0
Host: www-numi.fnal.gov
2008/06/04 16:08:40.551883 [31451] parrot: http: HTTP/1.1 200 OK
2008/06/04 16:08:40.551994 [31451] parrot: http: Date: Wed, 04 Jun 2008 21:08:40 GMT
2008/06/04 16:08:40.552036 [31451] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6
2008/06/04 16:08:40.552076 [31451] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:19 GMT
2008/06/04 16:08:40.552115 [31451] parrot: http: ETag: "534d65ce-2c-44592a26655c3"
2008/06/04 16:08:40.552150 [31451] parrot: http: Accept-Ranges: bytes
2008/06/04 16:08:40.552183 [31451] parrot: http: Content-Length: 44
2008/06/04 16:08:40.552218 [31451] parrot: http: Connection: close
2008/06/04 16:08:40.552251 [31451] parrot: http: Content-Type: text/plain
2008/06/04 16:08:40.552283 [31451] parrot: http: 
2008/06/04 16:08:40.552320 [31451] parrot: grow: remote checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d
2008/06/04 16:08:40.552392 [31451] parrot: grow: fetching directory: http://www-numi.fnal.gov:80/computing/d199//.growfsdir
2008/06/04 16:08:40.552488 [31451] parrot: http: connect www-numi.fnal.gov port 80
2008/06/04 16:08:40.553482 [31451] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfsdir HTTP/1.0
Host: www-numi.fnal.gov
2008/06/04 16:08:40.559354 [31451] parrot: http: HTTP/1.1 200 OK
2008/06/04 16:08:40.559467 [31451] parrot: http: Date: Wed, 04 Jun 2008 21:08:39 GMT
2008/06/04 16:08:40.559508 [31451] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6
2008/06/04 16:08:40.559548 [31451] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:09 GMT
2008/06/04 16:08:40.559588 [31451] parrot: http: ETag: "5350b140-33ac14e-44592a1cdbf44"
2008/06/04 16:08:40.559626 [31451] parrot: http: Accept-Ranges: bytes
2008/06/04 16:08:40.559661 [31451] parrot: http: Content-Length: 54182222
2008/06/04 16:08:40.559695 [31451] parrot: http: Connection: close
2008/06/04 16:08:40.559728 [31451] parrot: http: Content-Type: text/plain
2008/06/04 16:08:40.559771 [31451] parrot: http: 
2008/06/04 16:08:56.323562 [31451] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199-
2008/06/04 16:08:57.046740 [31451] parrot: grow: local checksum: 2adb169a42c725ccfbe2b2174da7d8d9e46121f4
2008/06/04 16:08:57.046890 [31451] parrot: grow: checksum does not match, reloading...
2008/06/04 16:08:57.046977 [31451] parrot: grow: directory and checksum are inconsistent, retry in 2 seconds
...

    Try this with 2.4.2 ( which will not use Squid )


ssh fnpc194
PVER=cctools-2_4_2-x86_64-linux-2.6
export PARROT_DIR=/grid/app/minos/parrot/${PVER}
export PATH=${PARROT_DIR}/bin:${PATH}
parrot -m ${PARROT_DIR}/mountfile2.grow -d remote bash
ls -d /afs/fnal.gov/files/code/e875/general/minossoft


###########
# MONTHLY #
###########

DATASETS 6/4
PREDATOR 6/4
VAULT    6/3  from cron, overall rate 32 MB/sec
MYSQL    6/5
    Thu Jun  5 11:03:07 CDT 2008
    Thu Jun  5 11:53:27 CDT 2008

##########
# CONDOR #
##########

Glideins stopped getting scheduled to run at around 04:30.

Probably due to this on fnpcsrv1 :

SRV1> condor_q rustem


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 2008-06-04 08:33:25-05
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
1879557.0   rustem          6/4  02:07   0+06:20:17 R  0   0.0  run_study.sh /grid
1879557.2   rustem          6/4  02:07   0+06:20:17 R  0   0.0  run_study.sh /grid
...
1879691.38  rustem          6/4  03:49   0+00:00:00 I  0   0.0  run_study.sh /grid
1879691.39  rustem          6/4  03:49   0+00:00:00 I  0   0.0  run_study.sh /grid

423 jobs; 50 idle, 373 running, 0 held

This is using the entire Minos group allocation

No more of our jobs will run until these jobs finish.


##########
# CONDOR #
##########

   Corrected default factors for newer users

condor_userprio -all
mtavera@fnal.gov                         0.50     0.50         1.00    0         8.09  6/03/2008 10:40  6/03/2008 20:15
pittam@fnal.gov                          0.57     0.57         1.00    0       149.31  5/27/2008 11:20  6/04/2008 00:35
naples@fnal.gov                          1.04     1.04         1.00    0       167.95  5/09/2008 15:14  6/03/2008 22:00
jjling@fnal.gov                         25.71    25.71         1.00   36      4038.77  5/28/2008 16:30  6/04/2008 08:50


condor_userprio -setfactor  mtavera@fnal.gov 100.
condor_userprio -setfactor   pittam@fnal.gov 100.
condor_userprio -setfactor   naples@fnal.gov 100.
condor_userprio -setfactor   jjling@fnal.gov 100.
condor_userprio -setfactor pawloski@fnal.gov 100.  # no longer needs boost

=============================================================================
2008 06 03
=============================================================================

############
# MCIMPORT #
############

Planning for

rm                           /home/mindata/STAGE  # was /local/scratch26/mindata
ln -s  /minos/data/mcimport  /home/mindata/STAGE

$ ls -l /local/scratch26/mindata
total 12
drwxr-xr-x   2 mindata e875 4096 Mar  6 16:17 141
drwxr-xr-x   2 mindata e875 4096 Jun  3 08:37 CRON
drwxr-xr-x  14 mindata e875 4096 Mar  3 17:22 MOVED
lrwxrwxrwx   1 mindata e875   28 Nov 18  2007 OVERLAY -> /minos/data/mcimport/OVERLAY
lrwxrwxrwx   1 mindata e875   25 Oct 30  2007 arms -> /minos/data/mcimport/arms
lrwxrwxrwx   1 mindata e875   26 Nov  5  2007 boehm -> /minos/data/mcimport/boehm
lrwxrwxrwx   1 mindata e875   28 Oct 30  2007 buckley -> /minos/data/mcimport/buckley
lrwxrwxrwx   1 mindata e875   26 Oct 30  2007 gmieg -> /minos/data/mcimport/gmieg
lrwxrwxrwx   1 mindata e875   28 Oct 31  2007 hgallag -> /minos/data/mcimport/hgallag
lrwxrwxrwx   1 mindata e875   27 Oct 30  2007 himmel -> /minos/data/mcimport/himmel
lrwxrwxrwx   1 mindata e875   29 Oct 31  2007 howcroft -> /minos/data/mcimport/howcroft
lrwxrwxrwx   1 mindata e875   29 Nov  3  2007 kordosky -> /minos/data/mcimport/kordosky
lrwxrwxrwx   1 mindata e875   28 Oct 31  2007 kreymer -> /minos/data/mcimport/kreymer
lrwxrwxrwx   1 mindata e875   30 Nov  3  2007 mcinwrite -> /minos/data/mcimport/mcinwrite
lrwxrwxrwx   1 mindata e875   28 Feb 25 17:42 mtavera -> /minos/data/mcimport/mtavera
lrwxrwxrwx   1 mindata e875   27 Oct 31  2007 mualem -> /minos/data/mcimport/mualem
lrwxrwxrwx   1 mindata e875   26 Feb 25 18:23 nwest -> /minos/data/mcimport/nwest
lrwxrwxrwx   1 mindata e875   29 Oct 30  2007 rhatcher -> /minos/data/mcimport/rhatcher
lrwxrwxrwx   1 mindata e875   24 Nov  2  2007 sjc -> /minos/data/mcimport/sjc
lrwxrwxrwx   1 mindata e875   27 Oct 30  2007 urheim -> /minos/data/mcimport/urheim

$ ls -l /local/scratch26/mindata/CRON
total 4
-rw-r--r--  1 mindata e875 6 Mar  6 13:57 mcimport.tar.pid

$ ln -sf  /minos/data/mindata  /home/mindata/STAGE
$ date
Tue Jun  3 11:59:25 CDT 2008

   Cleanup - remove the MOVED and 141 directory files.
   
    There is more that can be archived,

MDS3 > du -sm /home/mindata/STAGE/STAGE
3699254 STAGE


=============================================================================
2008 06 02
=============================================================================

########
# FARM #
########


Tracking down size mismatch in loopCPB1

n13037702_0018_L010185N_D04.cand.cedar_phy_bhcurv.1.root

-rw-r--r--  1 rubin numi 570061246 May 21 21:04 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/770/n13037702_0018_L010185N_D04.cand.cedar_phy_bhcurv.1.root

This is odd, as the file was moved to WRITE on May 28,
long after this file went into PNFS.

Ths file is declared to SAM already.
Seems like a classic duplicate.
For present,

FIL=n13037702_0018_L010185N_D04.cand.cedar_phy_bhcurv.1.root
mv WRITE/${FIL} DUP/${FIL}

loopCPB1 is now complete

loopCPB gets nowhere, as everything is pending.

########
# FARM #
########

Rustem points out that cedar_phy_bhcurv Far run 37901 it not in PNFS.

This data was processed on Dec 12 2007.
Concatenatin has been stalled due to missing subrun is 15, 
whose size is normal in raw data, see
    /pnfs/minos/fardet_data/2007-04/F00037901*
Subrun F00037901_0015 produced output during the cedar_phy pass on the data.

I have forced concatenation of F00037901 without subrun 15.
I suggest that someone look at Farm logs to see why subrun 15 is missing.


SRV1> ./roundup -n -s 37901 -r cedar_phy_bhcurv far
...
OK - 568 Mbytes in 1 runs 
 PEND - have 23/24 subruns for F00037901_*.all.sntp.cedar_phy_bhcurv.0.root 173 12/12 14:58 0 23
OK - stream spill.bntp.cedar_phy_bhcurv
OK - 102 Mbytes in 1 runs 
 PEND - have 23/24 subruns for F00037901_*.spill.bntp.cedar_phy_bhcurv.0.root 173 12/12 14:59 0 23
OK - stream spill.mrnt.cedar_phy_bhcurv
OK - 65 Mbytes in 1 runs 
 PEND - have 23/24 subruns for F00037901_*.spill.mrnt.cedar_phy_bhcurv.0.root 173 12/12 14:59 0 23
OK - stream spill.sntp.cedar_phy_bhcurv
OK - 69 Mbytes in 1 runs 
 PEND - have 23/24 subruns for F00037901_*.spill.sntp.cedar_phy_bhcurv.0.root 173 12/12 14:58 0 23

SRV1> ./roundup    -s 37901 -r cedar_phy_bhcurv far

SRV1> ./roundup -f 1 -s 37901 -r cedar_phy_bhcurv far

##########
# cflsum #
##########

    Corrected cflsum to use release_data not log_data ( space issues )

MIN > ln -sf cflsum.20080602 cflsum # was 20070702

MINOS26 > ${HOME}/minos/scripts/cflsum > cflsum.`date +%Y%m%d`


###########
# MINOS25 #
###########


System is in desparate trouble.

condor_q no longer responds
Dozens of processes in 'D' state.

Load average started rising around 09:30, with pawloski submission,
but root cause is probably rmehdi loon job using 2.3 GB of memory.

   Trying to shut down condor gracefully.


[gfactory@minos25 ~]$ ps xf
  PID TTY      STAT   TIME COMMAND
 6226 ?        Z      0:02 [condor_gridmana] <defunct>
 9547 pts/26   Ss     0:00 -bash
 9595 pts/26   R+     0:00  \_ ps xf
 4314 ?        S     68:21 python glideFactory.py 90 4 /home/gfactory/glideinsubmit/glidein_t12_glexec/
 4316 ?        S    624:33  \_ /usr/bin/python glideFactoryEntry.py 4314 90 4 /home/gfactory/glideinsubmit/glidein_t12_glexec/ gpgeneral
 4317 ?        S    687:45  \_ /usr/bin/python glideFactoryEntry.py 4314 90 4 /home/gfactory/glideinsubmit/glidein_t12_glexec/ gpminos
15986 ?        S      0:00      \_ /bin/bash ./job_submit.sh gpminos gpminos@t12_glexec@minos@my2 10 std -- GLIDEIN_Collector minos25.dot,fnal.dot,gov
15991 ?        S      0:00          \_ condor_submit -name minos25.fnal.gov entry_gpminos/job.condor

Killed all gfrontent and gfactory processes


MINOS25 > condor_off  -peaceful  -all  -subsystem startd
Can't connect to master FNAL_858@fnpc344.fnal.gov
Can't connect to master FNAL_31050@fnpc339.fnal.gov

Date: Mon, 02 Jun 2008 12:34:42 -0500 (CDT)
Subject: HelpDesk ticket 116503
___________________________________________
Short Description: User process in D state, please kill this, or reboot

Problem Description: run2-sys :

    Around 09:30 this morning, user rmehdi ran a 'loon' process,
    which has gone in to a 'D' state, according to the top display.

    The load average has gone up over 100,
    and many other processes are behaving strangely.

    Rashid cannot kill this process, 

top - 12:08:50 up 130 days, 23:57, 14 users,  load average: 123.08, 123.02,
122.64
Tasks: 311 total,   2 running, 305 sleeping,   0 stopped,   4 zombie
Cpu(s): 25.2% us,  1.5% sy,  0.0% ni,  0.0% id, 73.3% wa,  0.0% hi,  0.0%
si
Mem:   4151264k total,  4125572k used,    25692k free,    64764k buffers
Swap:  4192944k total,      208k used,  4192736k free,   972376k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM   TIME #C COMMAND       
                                                                 
 3860 rmehdi    15   0 2457m 2.3g  44m D    0 58.0  22:34  2 loon          
                                                                  

There are now dozens of other processes in 'D' state.

I do not see any problems accessing disk on the system,
so I am inclined to blame this on the runaway 2.3 GByte user process.

Please let us know whether this is a correct assessment of the situation.
Please use any an all tricks you know of to kill this process.

If we need to reboot later this afternoon, contact minos-admin,
and I'll let the users know, and will try to drain the Condor queues first.
___________________________________________

Date: Mon, 02 Jun 2008 13:07:30 -0500 (CDT)
This ticket has been reassigned to HO, LING of the CD-SF/FEF Group.
___________________________________________
Date: Mon, 02 Jun 2008 13:59:44 -0500
The loon process, and many other processes are in D state which cannot 
be killed. I don't see any problem with the afs or nfs space, not the 
local disk. So, we'll have to reboot the machine.

Please let me know if it is ok to reboot it now.
___________________________________________

   At around 14:10, 

MINOS25 > condor_off -fast minos25
Sent "Kill-All-Daemons-Fast" command to master minos25.fnal.gov

___________________________________________

Date: Mon, 02 Jun 2008 14:18:42 -0500 (CDT)
Note To Requester: kreymer@fnal.gov sent this Notes To Requester: 

  I have stopped condor on minos25.

  Please go ahead with the reboot of minos25 as soon as you can.


_________________________________________________________________

    Rebooted

    Restarted startd on minos03

MINOS25 > condor_on minos03 -subsystem startd

    Removes stale jjling jobs

MINOS25 > condor_rm 140901.0

    Started and tested minos04

MINOS25 > condor_on minos04 -subsystem startd

    Started them all up

MINOS25 > condor_on -all -subsystem startd


[gfrontend@minos25 ~]$ ./start_frontend.sh

[gfactory@minos25 ~]$ ./start_factory.sh


########
# DATA #
########

Blue arc failures,

    fnpcsrv1
Sun Jun  1 05:41:48 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-05
...
Sun Jun  1 07:31:54 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-11
Sun Jun  1 10:03:00 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-12
Sun Jun  1 10:04:00 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-01


    minos-sam03

Sun Jun  1 05:26:27 CDT 2008 SLO N00007148_0002.spill.sntp.cedar_phy_bhcurv.0.root 14
head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-04/N00007148_0005.spill.sntp.cedar_phy_bhcurv.0.root' for reading: Stale NFS file handle
...
Sun Jun  1 05:43:48 CDT 2008 BAD N00007188_0000.spill.sntp.cedar_phy_bhcurv.0.root 0
head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-04/N00007194_0000.spill.sntp.cedar_phy_bhcurv.0.root' for reading: Stale NFS file handle
Sun Jun  1 05:44:48 CDT 2008 BAD N00007194_0000.spill.sntp.cedar_phy_bhcurv.0.root 0
head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-04/N00007194_0005.spill.sntp.cedar_phy_bhcurv.0.root' for reading: No such device or address
...
Sun Jun  1 06:24:49 CDT 2008 BAD N00007604_0011.spill.sntp.cedar_phy_bhcurv.0.root 0
head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-04/N00007607_0002.spill.sntp.cedar_phy_bhcurv.0.root' for reading: No such device or address
Sun Jun  1 06:25:49 CDT 2008 BAD N00007607_0002.spill.sntp.cedar_phy_bhcurv.0.root 0
Sun Jun  1 07:31:51 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-11
Sun Jun  1 10:03:04 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-12
Sun Jun  1 10:04:04 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-01

    minos01

Sun Jun  1 05:37:58 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-05
...

Sun Jun  1 07:31:53 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-01
Sun Jun  1 10:03:06 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-02
Sun Jun  1 10:04:07 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-03

    minos25

Fri May 30 02:05:17 CDT 2008 SLO N00010645_0000.spill.sntp.cedar_phy_bhcurv.0.root 14
Sun Jun  1 05:37:58 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-05
...
Sun Jun  1 07:29:53 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-11
Sun Jun  1 10:03:01 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-12
Sun Jun  1 10:04:01 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-01

    minos26

Sun Jun  1 05:37:58 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-05
...
Sun Jun  1 07:32:09 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-01
Sun Jun  1 10:03:08 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-02
Sun Jun  1 10:04:08 CDT 2008 BAD /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-03

Date: Sun, 01 Jun 2008 08:03:20 -0500
From: Andy Romero <romero@fnal.gov>
To: site-nas-announce@fnal.gov
Cc: Storage Admins <storage-admins@fnal.gov>
Subject: BlueArc Problem ... Status  update

Only blue2 and minos-nas-0 customers were effected by
this problem .. everyone else ignore

Early this morning array Minossata01 partially failed.
Of course this caused file system MINOS-r6sata-0 to fail;
however, it also caused cms-r5-at-1 and cms-r5-at-2
to fail. Initial sttempts to get these filesystems back 
online failed ...(system calls were timing out).
At this point I decided that the only course of action
which had any chance of quickly getting CMS back online
was to reboot NAS head RHEA-1.

The Reboot completed, all CMS file systems are back online.
All other blue2 hosted file systems are also back online.

I am now going to contact someone to get the Minossata01
array back online. Then I will re-enable EVS minos-nas-0

Andy


=============================================================================
=============================================================================

  * * * * *  KREYMER IS ON FURLOUGH 26 THROUGH 31 MAY * * * * *

Blue Arc failures Sunday Jun 1, as noted above

CFL - failed again, path error, adjusted and retried, OK, see above

KCA update - interrupted farm processing, Rubin's robot cert needed registration

=============================================================================
=============================================================================
2008 05 25
=============================================================================

########
# FARM #
########

Leaving loopCPB1 running through the furlough.
There are enough PEND partial runs that the normal roundup processing
done under the corral script would not be moving any cand files through.

########
# DATA #
########

  One more quick scan for cand/bcnd files

MINOS26 > du -sm /minos/data/reco_*/*/cand_data
144     /minos/data/reco_far/cedar_phy/cand_data
331197  /minos/data/reco_far/cedar_phy_bhcurv/cand_data
1066    /minos/data/reco_near/cedar_phy_bhcurv/cand_data

MINOS26 > du -sm /minos/data/reco_*/*/.bcnd_data
78964   /minos/data/reco_far/cedar/.bcnd_data
27      /minos/data/reco_far/cedar_phy/.bcnd_data
6054    /minos/data/reco_far/cedar_phy_bhcurv/.bcnd_data

FARM03 > df -h /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       28T   27T  1.8T  94% /minos/data

FARM03 > DIRS=`ls -d /minos/data/reco_*/*/cand_data`

FARM03 > for DIR in ${DIRS} ; do echo rm -r ${DIR} ; done
rm -r /minos/data/reco_far/cedar_phy_bhcurv/cand_data
rm -r /minos/data/reco_far/cedar_phy/cand_data
rm -r /minos/data/reco_near/cedar_phy_bhcurv/cand_data

FARM03 > rm -r /minos/data/reco_far/cedar_phy_bhcurv/cand_data
FARM03 > rm -r /minos/data/reco_far/cedar_phy/cand_data
FARM03 > rm -r /minos/data/reco_near/cedar_phy_bhcurv/cand_data

FARM03 > DIRS=`ls -d /minos/data/reco_*/*/.bcnd_data`
FARM03 > for DIR in ${DIRS} ; do echo rm -r ${DIR} ; done
rm -r /minos/data/reco_far/cedar/.bcnd_data
rm -r /minos/data/reco_far/cedar_phy/.bcnd_data
rm -r /minos/data/reco_far/cedar_phy_bhcurv/.bcnd_data

FARM03 > for DIR in ${DIRS} ; do 
echo rm -r ${DIR}
     rm -r ${DIR} ; done
rm -r /minos/data/reco_far/cedar/.bcnd_data

   About 2.2 TB free in /minos/data now.

########
# DATA #
########

jdejong asks to process sntp's from far cedar, 
problem is 2000 files per month period.

That's correct, 2005-04 through 2007-04 were not concatenated.

MINOS26 > for DIR in ${DIRS} ; do  printf "${DIR} " ; ls /minos/data/reco_far/cedar/sntp_data/${DIR} | wc -l ; done
2005-04 688
2005-05 741
2005-06 716
2005-07 735
2005-08 742
2005-09 721
2005-10 738
2005-11 720
2005-12 748
2006-01 745
2006-02 616
2006-03 554
2006-06 721
2006-07 702
2006-08 696
2006-09 719
2006-10 746
2006-11 720
2006-12 745
2007-01 738
2007-02 672
2007-03 744
2007-04 715
2007-05 96
2007-11 40
2007-12 66
2008-01 75
2008-02 82
2008-03 84
2008-04 74
2008-05 50

MINOS26 > for DIR in ${DIRS} ; do  printf "${DIR} " ; ls /pnfs/minos/reco_far/cedar/sntp_data/${DIR} | wc -l ; done
2005-04 1392
2005-05 1486
2005-06 1432
2005-07 1470
2005-08 1486
2005-09 1445
2005-10 1482
2005-11 1440
2005-12 1497
2006-01 1490
2006-02 1237
2006-03 1151
2006-06 1442
2006-07 1429
2006-08 1402
2006-09 1440
2006-10 1492
2006-11 1443
2006-12 1490
2007-01 92
2007-02 98
2007-03 80
2007-04 74
2007-05 76
2007-11 76
2007-12 66
2008-01 75
2008-02 82
2008-03 84
2008-04 74
2008-05 50

   That's truly weird, 2007-01 through 2007-04 are concatenated in PNFS,
   but not on /minos/data. 
   OHHHH, not so weird after all.
   Files were copied from afs to nfs, and afs files were not concatenated.
   
   Also, /minos/data seems to have only spill data through 2007-04

   
=============================================================================
2008 05 23 
=============================================================================

##########
# PARROT #
##########

   Resume testing latest version

    INSTALLATION

   2.4.2, which has proxy problems, is still latest point release

cd    /grid/app/minos/parrot

VER=current
VERX="-20080520"

TARD=cctools-${VER}-x86_64-linux-2.6

TARX=cctools-${VER}${VERX}-x86_64-linux-2.6

TARP=${TARD}.tar.gz
TARL=${TARX}.tar.gz

curl http://www.cse.nd.edu/~ccl/software/files/${TARP} -o ${TARL}


tar xzvf ${TARL}
[ -n "${VERX}" ] && mv ${TARD}  ${TARX}

ln -s ../mountfile.grow  ${TARX}/
ln -s ../mountfile2.grow ${TARX}/
ln -s ../mountfile.html  ${TARX}/

    TESTING

Checked ganglia at
      http://rexganglia2.fnal.gov/farms/?c=GP%20Farm&m=&r=hour&s=descending&hc=4  

ssh fnpc194

   Checksums fail.

running on current-20080520 x86_64  fnpc194,

remote checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d

   local checksum: 2adb169a42c725ccfbe2b2174da7d8d9e46121f4

Fails, and retries indefinitely.
   

running 2.4.2 on fngp-osg, 32 bit system, see

   local checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d

running 2.4.2 x96_64 on fnpc194,

1211664873.544326 [14605] parrot: grow: local checksum: c56014e206c26c1ba13e5a321c3155b95689bf4a
1211664873.544453 [14605] parrot: grow: checksum does not match, reloading...
1211664873.548607 [14605] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/
1211664873.548663 [14605] parrot: grow: fetching checksum: wget --no-cache -q -O /tmp/parrot.1060/grow.checksum.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfschecksum
1211664873.563510 [14605] parrot: grow: remote checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d
1211664873.563624 [14605] parrot: grow: fetching directory: wget --no-cache -q -O /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfsdir
1211664878.222728 [14605] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199-
1211664878.951315 [14605] parrot: grow: local checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d

running on current-20080520 x86_64  fnpc194, another attempt

bash-3.00$ ls -d /afs/fnal.gov/files/code/e875/general/minossoft
2008/05/24 16:36:40.927396 [14740] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/
2008/05/24 16:36:40.927602 [14740] parrot: grow: fetching checksum: 
2008/05/24 16:36:40.927664 [14740] parrot: http: connect www-numi.fnal.gov port 80
2008/05/24 16:36:40.930494 [14740] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfschecksum HTTP/1.0
Host: www-numi.fnal.gov
2008/05/24 16:36:40.938232 [14740] parrot: http: HTTP/1.1 200 OK
2008/05/24 16:36:40.938249 [14740] parrot: http: Date: Sat, 24 May 2008 21:36:40 GMT
2008/05/24 16:36:40.938260 [14740] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.6
2008/05/24 16:36:40.938270 [14740] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:19 GMT
2008/05/24 16:36:40.938279 [14740] parrot: http: ETag: "534d65ce-2c-44592a26655c3"
2008/05/24 16:36:40.938288 [14740] parrot: http: Accept-Ranges: bytes
2008/05/24 16:36:40.938296 [14740] parrot: http: Content-Length: 44
2008/05/24 16:36:40.938307 [14740] parrot: http: Connection: close
2008/05/24 16:36:40.938315 [14740] parrot: http: Content-Type: text/plain
2008/05/24 16:36:40.938323 [14740] parrot: http: 
2008/05/24 16:36:40.938333 [14740] parrot: grow: remote checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d
2008/05/24 16:36:40.938349 [14740] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199-
2008/05/24 16:36:41.632329 [14740] parrot: grow: local checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d
/afs/fnal.gov/files/code/e875/general/minossoft

   So suddenly things are OK !!!
 
  Try again on fnpc195, things fail again forever
    
2008/05/24 16:38:36.871472 [28006] parrot: grow: local checksum: 2adb169a42c725ccfbe2b2174da7d8d9e46121f4

   Can clean it up again by running 2.4.1 once.


=============================================================================
2008 05 23 
=============================================================================


    Istvan (I.Z.) Danko - P  [izdanko@pitt.edu]                  412-624-7159

##########
# CONDOR #
##########

   Submit a 10 minute glidein job, then change the proxy with 
/local/scratch25/grid/kproxyvnew
   while job is running.

MINOS25 > ln -s /local/scratch25/grid grid

condor_submit glideafs10min.run

117890.0   kreymer         5/23 10:59   0+00:00:00 I  0   0.0  probe             

MINOS25 > grid/kproxyvnew 
-rw-------  1 kreymer g020 5302 May 23 11:03 kreymer.proxy.2008052311

MINOS25 > dds logs/10min/*117890.0
-rw-r--r--  1 kreymer g020   0 May 23 10:59 logs/10min/probe.err.117890.0
-rw-r--r--  1 kreymer g020 247 May 23 11:01 logs/10min/probe.log.117890.0
-rw-r--r--  1 kreymer g020   0 May 23 10:59 logs/10min/probe.out.117890.0

MINOS25 > cat logs/10min/probe.log.117890.0
000 (117890.000.000) 05/23 10:59:14 Job submitted from host: <131.225.193.25:62903>
...
001 (117890.000.000) 05/23 11:01:26 Job executing on host: <131.225.166.120:61927>
...
006 (117890.000.000) 05/23 11:01:34 Image size of job updated: 6332
...
005 (117890.000.000) 05/23 11:11:27 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        9974  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        9974  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job

MINOS25 > less logs/10min/probe.out.117890.0
RUN STARTED  Fri May 23 11:01:26 CDT 2008

...

##########
#  PROXY #
##########
PROXY    /local/stage1/condor/execute/dir_10848/glide_J10883/tmp/starter-tmp-dir-xF9smf/execute/dir_12333/kreymer.proxy
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer/CN=proxy/CN=proxy
issuer    : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer/CN=proxy
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer/CN=proxy
type      : unknown
strength  : 512 bits
path      : /local/stage1/condor/execute/dir_10848/glide_J10883/tmp/starter-tmp-dir-xF9smf/execute/dir_12333/kreymer.proxy
timeleft  : 9:51:59
######################################################
#   CHECK THE GRID ENVIRONMENT IF WE ARE ON THE GRID #
######################################################

 OK - we do not seem to be on an OSG host 
RUN FINISHED Fri May 23 11:11:26 CDT 2008


    Tried this again, showing proxy and identity at start and end of job.
    
RUN STARTED  Fri May 23 14:27:39 CDT 2008
PROXY    /local/stage1/condor/execute/dir_22749/glide_W22784/tmp/starter-tmp-dir-wk19ES/execute/dir_25851/kreymer.proxy
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy
...
PROXY    /local/stage1/condor/execute/dir_22749/glide_W22784/tmp/starter-tmp-dir-wk19ES/execute/dir_25851/kreymer.proxy
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer/CN=proxy
RUN FINISHED Fri May 23 14:37:40 CDT 2008


#######
# SAM #
#######

    Need to add upcoming  daikon_05,
    heck let's throw in daikon_06/07/08
    Inspired by 2007 09 23  log entry


for UNI in dev int prd ; do
for VEG in daikon_05 daikon_06 daikon_07 daikon_08 ; do
  setup sam -q ${UNI}
  export    SAM_ORACLE_CONNECT
  samadmin add application family --appFamily=simulation --appName=gminos --appVersion=${VEG}
  export -n SAM_ORACLE_CONNECT
done
done

New applicationFamilyId = 251
New applicationFamilyId = 252
New applicationFamilyId = 253
New applicationFamilyId = 254
New applicationFamilyId = 88
New applicationFamilyId = 89
New applicationFamilyId = 90
New applicationFamilyId = 91
New applicationFamilyId = 342
New applicationFamilyId = 343
New applicationFamilyId = 344
New applicationFamilyId = 345

MINOS26 > date
Fri May 23 09:47:08 CDT 2008


########
# DATA #
########

   No error reported since yesterday's bluearc reboot, per  

       http://www-numi.fnal.gov/computing/dh/bluwatch/

   Will resume normal activities.

########
# FARM #
########

   loopCPB0 -  finished its work on CPB 0 candidates
   
   started loopCPB, to get all mrnt and sntp's concatenated

=============================================================================
2008 05 22 
=============================================================================

##########
# CONDOR #
##########

   Testing new KCA form of certificate

############
# BLUWATCH #
############

    bluwatch.20080522

kcron was not being run in bluwatch.20080522

Added kcron based on expiration time in the file loop

Added LATEST file to show time of latest error

Moved full logs to bluwatch/log/*

Output goes to
    *.txt        - latest error
    log/*.txt    - full log
    latest/*.txt - latest status

MINOS26 > touch LASTERR -d 'May 21 10:19:02 CDT'

=============================================================================
2008 05 21 
=============================================================================

############
# BLUWATCH #
############

SRV1> echo 'touch /minos/data/minfarm/roundup/STOP' | at 05:00
job 13 at 2008-05-22 05:00


bluwatch.20080521
   . add kcron as in other monitoring scripts
   . cut down verbosity, OK goes to latest 1 line log
   . add *.latest.txt file with heartbeat
   . add time limit and report
   . add STOP file

    add /grid/data monitoring
    add lasterr directory/files

Started up on all systems,

MINOS26 > echo 'touch /afs/fnal.gov/files/data/minos/log_data/bluwatch/STOP' | at 05:00
job 8 at 2008-05-22 05:00

MINOS26 > at -l
8       2008-05-22 05:00 a kreymer

MINOS26 > mkdir /afs/fnal.gov/files/data/minos/log_data/bluwatch/lasterr

##########
# CONDOR #
##########

    Testing new KCA per 
        http://security.fnal.gov/pki/newkcafaq.html
    Got new kx509 image from
        http://security.fnal.gov/tools/kx509.tar

/local/scratch25/grid/kx509 -s winserver.fnal.gov 

Seems I do need to do kdestroy before testing interactively
  
/local/scratch25/grid/kproxynew

Date: Wed, 21 May 2008 18:05:49 -0500 (CDT)
Subject: HelpDesk ticket 116054
___________________________________________
Short Description: Testing new KCA proxies - voms-proxy-init fails

Problem Description: I am attempting to get a new style KCA proxy for testing
    under the fermilab/minos group before the upgrade next week.

    I want to submit a Minos glidein job,
    then change the original proxy from the old to the new form,
    and verify that the job can complete correctly.
    This is what will happen to users' jobs next Wednesday.

    voms-proxy-init fails, as follows :

Your identity:
/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur
E. Kreymer/CN=UID:kreymer
Cannot find file or dir:
/home/condor/execute/dir_11128/userdir/glite/etc/vomses
Contacting  voms.fnal.gov:15001
[/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Failed

Error: fermilab: Unable to satisfy G/fermilab/minos Request!

    This is not too surprising, 
    as the VO does not seem to know about the new style DN's.

    I am on furlough next week.
    Will the new DN's be registered in time for some advanced testing
    before next week ?
___________________________________________
Registered the new form certificate at
    DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer

It shows up as 'new' under
    https://vomrs.fnal.gov:8443/vomrs/vo-fermilab/vomrs?path=/RootNode/MemberAction/MemberDNs&action=execute&do=select
___________________________________________
Date: Thu, 22 May 2008 11:17:32 -0500 (CDT)
From: fermilab-vomrs-admin@fnal.gov
To: kreymer@fnal.gov
Subject: Automated email from fermilab vomrs: You have a request to add a new certificate
Dear VO Member,

A request to add a new certificate DN:
/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer
CA: /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA to your certificate list
was made by a Member DN: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/UID=kreymer CA:
/DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA The following reason was
provided: Testing new KCA proxy format. Please contact VO administrator if you have any questions.

VOMRS fermilab Service
___________________________________________
Date: Thu, 22 May 2008 15:38:37 -0500 (CDT)
Note To Requester: Art, can you 
change the status 
via the certificates/
set certificate status
on your own entry 
in vomrs?  you should
be able to do that.
If you can't, let me know
and I will do it for you.

Steve Timm


___________________________________________
Date: Thu, 22 May 2008 16:07:11 -0500 (CDT)
Note To Requester: timm@fnal.gov sent this Notes To Requester: 
I have set the certificate status to approved.  Should get into
VOMS within the hour.

Steve
___________________________________________
Date: Wed, 21 May 2008 19:43:55 -0500 (CDT)
Note To Requester: timm@fnal.gov sent this Notes To Requester: 
The new DN's will start getting loaded into the Fermilab VO on
Monday night May 26.  We anticipate, based on the same procedure
done on our test machine, that it will take 14-15 hours to load
them all. If you want to test before that, you can manually add
the new DN to your VOMRS entry via VOMRS.  Or voms-proxy-init
against fgtest2.fnal.gov where the entry should already be there.

Steve Timm

___________________________________________
Date: Thu, 22 May 2008 21:05:36 +0000 (UTC)

The new DN has been 'Approved',
and I have generated a new style KCA proxy.

Thanks, you can close this ticket !

___________________________________________
Date: Fri, 23 May 2008 20:23:06 +0000 (UTC)


Most of the Minos glidein users, and users in general,
will be generating one of the new KCA DN's for the first time
at the time of the global conversion next Wednesday.

The new DN's contain the name of the machine 
on which the proxy was generated, like

  .../CN=minos25.fnal.gov/CN=cron/CN=Arthur E. Kreymer/CN=UID:kreymer

Is this hostname field needed for authorization purposes ?

If so, how will you know which hostname to use
when doing the automatic load of the new DN's into the Fermilab VO ?

___________________________________________

Date: Fri, 23 May 2008 15:31:46 -0500 (CDT)
Note To Requester: timm@fnal.gov sent this Notes To Requester: 
Yes, the hostname field is needed for authorization purposes.
No, our auto-add of the new CN=UID:username doesn't account
for this.

But given this request, we will take any
/OU=Robots/CN=cron/CN=....
certs that are in the "minos" group of the fermilab
VO and make sure
that 
/OU=Robots/CN=minos25.fnal.gov/CN=cron/CN=User Name/CN=UID:username
gets added for all ofthem.  The farms production people
may also need one added for fnpcsrv1.

Steve Timm
___________________________________________


Date: Fri, 23 May 2008 20:37:40 +0000 (UTC)

For completeness, here is a list from my whiteboard of users who
are probable near term users of Minos glideins,
hence need new KCA registrations with /CN=minos25.fnal.gov

asousa
bckhouse
boehm
djauty
hartnell
loiacono
mishi
nickd
pawloski
rustem
rhatcher


########
# FARM #
########

./roundup -n -m L010185N -r cedar_phy_bhcurv mcnear

##########
# CONDOR #
##########

MIN > NODES0='minos11 minos13 minos26'
MIN > NODES1='minos01 minos02 minos07'
MIN > NODES2='minos03 minos04 minos05 minos06 minos08 minos09 minos10 minos12 minos14 minos15 minos16 minos17 minos18 minos19 minos20 minos21 minos22 minos23 minos24'

MIN > REFFIL=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local0.20080512
MIN > for NODE in ${NODES0} ; do printf "${NODE}\n"
ssh -ax ${NODE} diff /opt/condor-7.0.1/local/condor_config.local ${REFFIL}  ; done 

MIN > REFFIL=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local1.20080512
MIN > for NODE in ${NODES1} ; do printf "${NODE}\n"
ssh -ax ${NODE} diff /opt/condor-7.0.1/local/condor_config.local ${REFFIL}  ; done 

MIN > REFFIL=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local2.20080512
MIN > for NODE in ${NODES2} ; do printf "${NODE}\n"
ssh -ax ${NODE} diff /opt/condor-7.0.1/local/condor_config.local ${REFFIL}  ; done 

ssh -ax minos25 diff /opt/condor-7.0.1/local/condor_config.local \
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local.minos25

    The config files all look OK.

    Now will do the promised reconfigure

MINOS25 > condor_reconfig minos01
Sent "Reconfig" command to master minos01.fnal.gov
MINOS25 > condor_config_val CONDOR_ADMIN -name minos01
minos-admin@fnal.gov

MINOS25 > date
Wed May 21 14:33:36 CDT 2008

MINOS25 > condor_reconfig -all
Sent "Reconfig" command to master minos10.fnal.gov
Sent "Reconfig" command to master minos01.fnal.gov
Sent "Reconfig" command to master minos02.fnal.gov
Sent "Reconfig" command to master minos20.fnal.gov
Sent "Reconfig" command to master minos21.fnal.gov
Sent "Reconfig" command to master minos03.fnal.gov
Sent "Reconfig" command to master minos04.fnal.gov
Sent "Reconfig" command to master minos22.fnal.gov
Sent "Reconfig" command to master minos14.fnal.gov
Sent "Reconfig" command to master minos05.fnal.gov
Sent "Reconfig" command to master minos23.fnal.gov
Sent "Reconfig" command to master minos15.fnal.gov
Sent "Reconfig" command to master minos06.fnal.gov
Sent "Reconfig" command to master minos24.fnal.gov
Sent "Reconfig" command to master minos16.fnal.gov
Sent "Reconfig" command to master minos25.fnal.gov
Sent "Reconfig" command to master minos07.fnal.gov
Sent "Reconfig" command to master minos17.fnal.gov
Sent "Reconfig" command to master minos08.fnal.gov
ERROR
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5004:Failed to get authorization from server.  Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile)
AUTHENTICATE:1004:Failed to authenticate using FS
Can't send Reconfig command to master minos26.fnal.gov
Sent "Reconfig" command to master minos09.fnal.gov
Sent "Reconfig" command to master minos18.fnal.gov
Sent "Reconfig" command to master minos19.fnal.gov
Sent "Reconfig" command to master FNAL_1621@fnpc342.fnal.gov
Sent "Reconfig" command to master FNAL_5500@fnpc345.fnal.gov
Sent "Reconfig" command to master FNAL_8103@fnpc344.fnal.gov
Sent "Reconfig" command to master FNAL_8211@fnpc345.fnal.gov
Sent "Reconfig" command to master FNAL_2224@fnpc339.fnal.gov
Sent "Reconfig" command to master FNAL_5215@fnpc345.fnal.gov
Sent "Reconfig" command to master FNAL_1913@fnpc346.fnal.gov
Sent "Reconfig" command to master FNAL_5660@fnpc344.fnal.gov
Sent "Reconfig" command to master FNAL_2426@fnpc339.fnal.gov
Sent "Reconfig" command to master FNAL_2808@fnpc344.fnal.gov
Sent "Reconfig" command to master FNAL_2466@fnpc344.fnal.gov
Sent "Reconfig" command to master FNAL_1537@fnpc339.fnal.gov
Sent "Reconfig" command to master FNAL_1467@fnpc346.fnal.gov
Sent "Reconfig" command to master FNAL_1396@fnpc346.fnal.gov
Sent "Reconfig" command to master FNAL_8652@fnpc344.fnal.gov
Sent "Reconfig" command to master FNAL_5594@fnpc344.fnal.gov
Sent "Reconfig" command to master FNAL_8464@fnpc345.fnal.gov
Sent "Reconfig" command to master FNAL_1797@fnpc339.fnal.gov
Sent "Reconfig" command to master FNAL_25110@fnpc341.fnal.gov
Sent "Reconfig" command to master FNAL_21093@fnpc340.fnal.gov
Sent "Reconfig" command to master FNAL_27403@fnpc340.fnal.gov
Sent "Reconfig" command to master FNAL_10820@fnpc345.fnal.gov
Sent "Reconfig" command to master FNAL_23605@fnpc340.fnal.gov
Sent "Reconfig" command to master FNAL_31094@fnpc340.fnal.gov
Sent "Reconfig" command to master FNAL_30125@fnpc346.fnal.gov
Sent "Reconfig" command to master FNAL_32550@fnpc342.fnal.gov
Sent "Reconfig" command to master FNAL_28105@fnpc341.fnal.gov
Sent "Reconfig" command to master FNAL_29012@fnpc343.fnal.gov
Sent "Reconfig" command to master FNAL_11911@fnpc345.fnal.gov
Sent "Reconfig" command to master FNAL_27081@fnpc340.fnal.gov
Sent "Reconfig" command to master FNAL_22662@fnpc340.fnal.gov
Sent "Reconfig" command to master FNAL_26620@fnpc342.fnal.gov
Sent "Reconfig" command to master FNAL_30442@fnpc346.fnal.gov
Sent "Reconfig" command to master FNAL_29630@fnpc340.fnal.gov
Sent "Reconfig" command to master FNAL_11644@fnpc345.fnal.gov
Sent "Reconfig" command to master FNAL_30295@fnpc342.fnal.gov
Sent "Reconfig" command to master FNAL_32408@fnpc344.fnal.gov
Sent "Reconfig" command to master FNAL_25528@fnpc342.fnal.gov
Sent "Reconfig" command to master FNAL_18184@fnpc342.fnal.gov
Sent "Reconfig" command to master FNAL_15557@fnpc341.fnal.gov
Sent "Reconfig" command to master FNAL_32745@fnpc346.fnal.gov
Sent "Reconfig" command to master FNAL_23589@fnpc340.fnal.gov
Sent "Reconfig" command to master FNAL_29392@fnpc343.fnal.gov
Sent "Reconfig" command to master FNAL_32709@fnpc339.fnal.gov
Sent "Reconfig" command to master FNAL_27077@fnpc346.fnal.gov
Sent "Reconfig" command to master FNAL_29745@fnpc342.fnal.gov
Sent "Reconfig" command to master FNAL_27767@fnpc341.fnal.gov
Sent "Reconfig" command to master FNAL_29318@fnpc339.fnal.gov
Sent "Reconfig" command to master FNAL_28882@fnpc343.fnal.gov
Sent "Reconfig" command to master FNAL_15897@fnpc341.fnal.gov
Sent "Reconfig" command to master FNAL_29175@fnpc339.fnal.gov
Sent "Reconfig" command to master FNAL_28477@fnpc346.fnal.gov
Sent "Reconfig" command to master FNAL_29596@fnpc339.fnal.gov

MINOS25 > condor_config_val CONDOR_ADMIN
minos-admin@fnal.gov

MINOS25 > condor_config_val CONDOR_ADMIN -name minos11
Can't find address for this master
Perhaps you need to query another pool.
MINOS25 > condor_config_val CONDOR_ADMIN -name minos12
Can't find address for this master
Perhaps you need to query another pool.
MINOS25 > condor_config_val CONDOR_ADMIN -name minos13
Can't find address for this master
Perhaps you need to query another pool.

On minos12,
   sudo  /etc/init.d/condor  start


##########
# CONDOR #
##########

Date: Wed, 21 May 2008 13:46:01 -0500 (CDT)
Subject: HelpDesk ticket 116039
___________________________________________
Short Description: asousa certificate is misregistered in VOMS ?

Problem Description: Alex Sousa ( asousa@fnal.gov ) is unable to get a proxy as a member of 
the fermilab/minos group.

His x509 identity is
/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/
CN=Alexandre B. Pereira sousa/USERID=asousa

His registered certificate as seen in VOMRS is
/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alexandre B. Pereira
Sousa/UID=asousa

Note that sousa is lower case in the X509 string,
but is    Sousa in the VO.

Please resolve this.
  I am not authorized to add cert's for other people,
  so I cannot fix this by adding the alternate DN myself.
  There is the larger question of why we have so many malformed DN's.

   Please reply to minos-admin and/or asousa
___________________________________________
Date: Wed, 21 May 2008 15:37:39 -0500 (CDT)

New Information: The two robot certs
/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Alexandre B. Pereira sousa/UID=asousa

and 

/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Alexandre B. Pereira sousa/USERID=asousa

have been added to vomrs.

The failure appears to be that the robots cert was missing altogether in vomrs,
at least we hope that is the case.  vomrs and voms are not
supposed to be case sensitive, i.e.

Alexandre B. Pereira Sousa and Alexandre B. Pereira sousa
resolve to the same person.

I will check again in about 1/2 hour to make sure the change 
propagated through the rest of the system.

The incredible changing names of KCA certs are a problem of long standing
which are due to be finally fixed when the new KCA is rolled out
on May 28 next week.  Historically we find that a name changes
when a user renews his or her Fermi ID.

Steve Timm


##########
# CONDOR #
##########

Date: Wed, 21 May 2008 12:05:11 -0500 (CDT)
Subject: HelpDesk ticket 116023

___________________________________________
Short Description: Use bckhouse KCA cert has extra space

Problem Description: Christopher Backhouse ( bckhouse@fnal.gov )
cannot activate a kx509 proxy in the fermilab/minos group
because his kx509 subject has an extra space after his name :

    From kx509
Service kx509/certificate
 issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate
Authorities/CN=Kerberized CA
 subject= /DC=gov/DC=fnal/O=Fermilab/OU=People/
CN=Christopher J. Backhouse /0.9.2342.19200300.100.1.1=bckhouse

    From voms-proxy-init
Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/
CN=Christopher J. Backhouse /USERID=bckhouse


Please do what it takes to correct this condition for Chris,
and do the same for other users who may have this problem.
I know that Rustem Ospanov had the seme problem earlier.

   Please reply to minos-admin and bckhouse
___________________________________________
Date: Wed, 21 May 2008 15:25:36 -0500 (CDT)
Note To Requester: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Christopher J. Backhouse /UID=bckhouse
/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Christopher J. Backhouse /USER
ID=bckhouse
/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Christopher J. Backhouse /UID=bckhouse
/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Christopher J. Backhouse /USERID=bckhouse

All added to Christopher Backhouse in VOMRS.

Will check in 1/2 hour or so to make sure they make it to VOMS.

Steve Timm
___________________________________________


########
# FARM #
########

Tue May 20 20:51:22 CDT 2008 BAD N00008517_0000.spill.sntp.cedar_phy_bhcurv.0.root
Tue May 20 20:52:22 CDT 2008 BAD N00008523_0000.spill.sntp.cedar_phy_bhcurv.0.root


SRV1> ./loopCPB0: line 5: 17604 Killed                  ./roundup -c -b 100 -w -s "cand.cedar_phy_bhcurv.0" -r cedar_phy_bhcurv mcnear
Connection to fnpcsrv1 closed.


SRV1> dds -tr
-rw-rw-r--   1 minfarm numi  740890 May 20 20:48 cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.0.log
-rw-r--r--   1 minfarm numi 3215797 May 21 06:09 cedarfar.log
-rw-r--r--   1 minfarm numi 2325596 May 21 06:40 cedarnear.log

SRV1> tail cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.0.log
SRMCP 70/100 -streams_num=1 -server_mode=active -protocols=gsiftp file:///n13037760_0012_L010185N_D04.cand.cedar_phy_bhcurv.0.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L01018
5N/cand_data/776


    Latest mount failure ( except /home/ftp ) is

May 20 19:26:41 fnpcsrv1 automount[16603]: failed to mount /minos/data

May 20 20:47:16 fnpcsrv1 autofs: automount shutdown failed
May 20 20:59:03 fnpcsrv1 automount[9888]: starting automounter version 4.1.3-231, path = /farm, maptype = yp, mapname = auto.farm


=============================================================================
2008 05 20 
=============================================================================

########
# FARM #
########

   Will not run loopCPB until the BlueArc problems are resolved.
   
mcnearcat
    837  472139 cand.cedar_phy_bhcurv.0.root
   4974 2871267 cand.cedar_phy_bhcurv.1.root

WRITE
      1       6 7571.root
    811  468125 cand.cedar_phy_bhcurv.0.root
    414  238927 cand.cedar_phy_bhcurv.1.root

So firing up loopCPB0 again, to clear 1.5 TB .

   No, cannot do this, see /var/log/messages on fnpcsrv1 :
   
May 20 19:24:54 fnpcsrv1 automount[16429]: failed to mount /home/carneiro
May 20 19:25:15 fnpcsrv1 automount[16518]: failed to mount /minos/data
May 20 19:25:17 fnpcsrv1 automount[16526]: failed to mount /grid/wnclient
May 20 19:25:28 fnpcsrv1 kernel: nfs_statfs: statfs error = 512

   There is a defective output file, 7571.root

   Let's try to get some disk cleared while we fight BlueArc.

SRV1> ./loopCPB0 &
[3] 17603
SRV1> date
Tue May 20 19:37:31 CDT 2008


########
# DATA #
########

Date: Tue, 20 May 2008 17:04:01 -0500 (CDT)
From: HelpDesk <aremail@fnal.gov>
Subject: CC: Help Desk Ticket 000000000115925 Has Been Updated.
___________________________________________________________________

New Information: It looks like we are having communications issues with one of the Minos logical drives on
the minossata01 array.  From our logs:

May 19 23:05:16 blue1.fnal.gov  2054 Warning: FCP nexus 1 (host port 4; target port name 0x5000402301FC41F7
address 0x692E00; LUN 10) of SCSI device 88 has failed.
May 20 03:18:58 blue1.fnal.gov  2054 Warning: FCP nexus 0 (host port 2; target port name 0x5000402201FC41F7
address 0x692600; LUN 10) of SCSI device 88 has failed.
May 20 07:18:05 blue1.fnal.gov  2054 Warning: FCP nexus 1 (host port 4; target port name 0x5000402301FC41F7
address 0x692E00; LUN 10) of SCSI device 88 has failed.
May 20 11:26:09 blue1.fnal.gov  2054 Warning: FCP nexus 0 (host port 2; target port name 0x5000402201FC41F7
address 0x692600; LUN 10) of SCSI device 88 has failed.
May 20 14:22:02 blue1.fnal.gov  2054 Warning: FCP nexus 1 (host port 4; target port name 0x5000402301FC41F7
address 0x692E00; LUN 10) of SCSI device 88 has failed.


Jason/Art, can you check if the array is okay or if there is any indication in the logs that there is a
problem with the minos array.
The device in question, according to our notes, is Lun 10 on array minossata01.  Each time the lun fails,
the BlueArc goes into recovery mode and tries to replay the filesystem, attempting to access the luns via
any paths it thinks it can use.  In the log it looks like we are bouncing between the two ports on the
nexsan array.
___________________________________________________________________
Requester Name: GREGORY PAWLOSKI

Short Description: BlueArc Server for /minos/scratch/ Down?
Problem Description: I wonder if the BlueArc server that provides access to the /minos/data/ and
/minos/scratch/ areas on the Minos cluster is down.  I cannot access these areas.

Greg
___________________________________________________________________


   We have just now had another timeout of /minos/data.
 
    Here are the relevant bits from the logs at
http://www-numi.fnal.gov/computing/dh/bluwatch.html
    Note the three to four minute delays in monitoring on the nodes
    where we did not have an outright failure.
    
    fnpcsrv1
Tue May 20 19:23:59 CDT 2008 OK  N00008203_0000.spill.sntp.cedar_phy_bhcurv.0.root
head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008206_0000.spill.sntp.cedar_phy_bhcurv.0.root' for reading: No such file or directory
Tue May 20 19:25:25 CDT 2008 BAD N00008206_0000.spill.sntp.cedar_phy_bhcurv.0.root
head: cannot open `/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008214_0000.spill.sntp.cedar_phy_bhcurv.0.root' for reading: No such file or directory
Tue May 20 19:26:51 CDT 2008 BAD N00008214_0000.spill.sntp.cedar_phy_bhcurv.0.root
Tue May 20 19:28:06 CDT 2008 OK  N00008218_0000.spill.sntp.cedar_phy_bhcurv.0.root

    minos-sam03
Tue May 20 19:24:07 CDT 2008 OK  N00007983_0000.spill.sntp.cedar_phy_bhcurv.0.root
Tue May 20 19:27:59 CDT 2008 OK  N00007988_0000.spill.sntp.cedar_phy_bhcurv.0.root

    minos01
ue May 20 19:24:27 CDT 2008 OK  N00008218_0000.spill.sntp.cedar_phy_bhcurv.0.root
Tue May 20 19:27:59 CDT 2008 OK  N00008221_0000.spill.sntp.cedar_phy_bhcurv.0.root

    minos25
Tue May 20 19:23:37 CDT 2008 OK  N00008200_0007.spill.sntp.cedar_phy_bhcurv.0.root
Tue May 20 19:27:51 CDT 2008 OK  N00008203_0000.spill.sntp.cedar_phy_bhcurv.0.root

    minos26
Tue May 20 19:24:37 CDT 2008 OK  /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-08
Tue May 20 19:28:05 CDT 2008 OK  N00008227_0000.spill.sntp.cedar_phy_bhcurv.1.root
    
    This all happened at the moment at which I tried to cat a file from
    BlueArc mounted /home/minfarm/scripts, on fnpcsrv1.

    Here is the tail of /var/log/messages there .

May 20 18:08:52 fnpcsrv1 telnetd[15872]: ttloop:  peer died: Invalid or incomplete multibyte or wide character
May 20 19:24:54 fnpcsrv1 automount[16429]: failed to mount /home/carneiro
May 20 19:25:15 fnpcsrv1 automount[16518]: failed to mount /minos/data
May 20 19:25:17 fnpcsrv1 automount[16526]: failed to mount /grid/wnclient
May 20 19:25:28 fnpcsrv1 kernel: nfs_statfs: statfs error = 512
May 20 19:26:41 fnpcsrv1 automount[16603]: failed to mount /minos/data

   This seems odd, /home/carneiro and /grid/wnclient do not seem to exist.
   Their failure to mount seems tightly correlated with the /minos/data timeout.

   I find one earlier such incident , involving /minos/scratch :
   
May 19 16:08:44 fnpcsrv1 automount[32100]: failed to mount /home/rubin
May 19 16:10:17 fnpcsrv1 automount[32244]: failed to mount /grid/wnclient
May 19 16:12:34 fnpcsrv1 kernel: nfs_statfs: statfs error = 6
May 19 16:13:44 fnpcsrv1 automount[809]: >> mount: minos-nas-0.fnal.gov:/minos/scratch failed, reason given by server: Input/output error
May 19 16:13:44 fnpcsrv1 automount[809]: mount(nfs): nfs: mount failure minos-nas-0.fnal.gov:/minos/scratch on /minos/scratch
May 19 16:13:44 fnpcsrv1 automount[809]: failed to mount /minos/scratch

And there are many more failures to mount file systems like
   /home/carneiro
   /home/rubin

There are problems that extend beyond /minos/data.

<-- # @@@  Enter Update above this line. @@@ # -->
_____________________________________________________________________
Date: Tue, 20 May 2008 21:04:15 -0500 (CDT)
From: Steven Timm <timm@fnal.gov>

I've restarted autofs on fnpcsrv1 in debug mode so if there are
future problems which there probably will be, we whave more information.
___________________________________________________________________

___________________________________________________________________

_________________________________________________________________


########
# DATA #
########

  Setting up file pings for bluearc,
  
BASE=/minos/data/reco_near/cedar_phy_bhcurv/sntp_data
DIRS=`ls -d ${BASE}/*`

for DIR in ${DIRS} ; do

if [ -d "${DIR}" ] ; then
     printf "`date` OK  ${DIR}\n"
else printf "`date` BAD ${DIR}\n"
fi
FILS=`ls ${DIR}`

for FIL in ${FILS} ; do
if  head -c 8 ${DIR}/${FIL} > /dev/null  ; then
     printf "`date` OK  ${FIL}\n"
else printf "`date` BAD ${FIL}\n"
fi

sleep 5 ; done # FILS
sleep 5 ; done # DIRS


    Put this into a bluwatch self logging script,

MINOS26 > ./bluwatch &

MINOS26 > cat /afs/fnal.gov/files/data/minos/log_data/bluwatch/minos26.txt 
Tue May 20 14:38:12 CDT 2008 OK  /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-03
Tue May 20 14:38:12 CDT 2008 OK  /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2005-04
Tue May 20 14:38:13 CDT 2008 OK  N00007119_0000.spill.sntp.cedar_phy_bhcurv.0.root
Tue May 20 14:39:13 CDT 2008 OK  N00007122_0000.spill.sntp.cedar_phy_bhcurv.0.root

    We have another timeout around 19:24


SRV1> mkdir /export/stage/minfarm/bluwatch
SRV1> cp -a /var/log/messages /export/stage/minfarm/bluwatch/messages.2008052119

########
# DATA #
########

DATA=reco_near/cedar_phy_bhcurv/sntp_data

FARM03> ./dc2nfs -d ${DATA} 2>&1 |    tee -a /minos/scratch/minfarm/dc2nfs/cpbnear.log

/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-03 982327 free
  47/  51  N00011987_0000.spill.sntp.cedar_phy_bhcurv.0.root     16261802 bytes in 1 seconds (15880.67 KB/sec)      
  
  FARM03 > dds -tr /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-03 | tail
-rw-r--r--   1 minfarm  e875  844353660 May 20 12:32 N00011981_0011.spill.sntp.cedar_phy_bhcurv.0.root
-rw-r--r--   1 minfarm  e875 1835651305 May 20 12:32 N00011984_0000.spill.sntp.cedar_phy_bhcurv.0.root
-rw-r--r--   1 minfarm  e875   16261802 May 20 12:32 N00011987_0000.spill.sntp.cedar_phy_bhcurv.0.root
drwxrwxr-x   2 minfarm  e875      10240 May 20 12:32 ./
-rw-r--r--   1 minfarm  e875 2030928611 May 20 12:35 N00011992_0000.spill.sntp.cedar_phy_bhcurv.0.root

FARM03 > dds /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-03/N00011992_0000.spill.sntp.cedar_phy_bhcurv.0.root
-rw-r--r--  1 rubin e875 2030928611 Nov  2  2007 /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-03/N00011992_0000.spill.sntp.cedar_phy_bhcurv.0.root


   This looks like the classical dccp glitch, after a successful copy.
FARM03 > ps xf
 3880 pts/3    R+    73:39  \_ dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-03/N00011992_0000.spill.sntp.cedar_phy_bhcurv.0.root /minos/data

FARM03> ./dc2nfs -d ${DATA} 2>&1 |    tee -a /minos/scratch/minfarm/dc2nfs/cpbnear.log

/minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2007-03 934685 free
  50/  51  N00011998_0000.spill.sntp.cedar_phy_bhcurv.0.root     1770611397 bytes in 43 seconds (40211.92 KB/sec)       
 NEEDED 2/51 reco_near/cedar_phy_bhcurv/sntp_data/2007-03 33 Mbytes/second
 
 STARTED Tue May 20 13:49:21 CDT 2008
FINISHED Tue May 20 13:51:14 CDT 2008

########
# DATA #
########

  BlueArc timeouts and failures continue

-----------------------------------------------------------------
Subject: Cron <minfarm@fnpcsrv1> ${HOME}/scripts/farmgsum_log
Date: Tue, 20 May 2008 00:15:02 -0500
     Summarizing /minos/data/minfarm/*cat
du: cannot access 
`/minos/data/minfarm/mcnearcat/n13037716_0014_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root': 
No such file or directory
du: cannot access 
`/minos/data/minfarm/mcnearcat/n13037716_0022_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root': 
No such file or directory
-----------------------------------------------------------------
Date: Tue, 20 May 2008 00:46:32 -0500
From: Howard Rubin <rubin@fnal.gov>
Subject: [Fwd: HelpDesk ticket 115906]
Short Description: /grid/app/minos/ mount lost, srm probably dead

Problem Description: This is the output of a cron job run at 23:09

/grid/app/minos/scripts/get_daq_submit: line 28: cd:
/minos/data/minfarm/lists: No such device or address
/grid/app/minos/scripts/get_daq_submit: line 100:
/minos/data/minfarm/lists/current_version: No such device or address
There is no FD timestamp file in /minos/data/minfarm/lists/daq_lists.
Unable to proceed.

Earlier in the evening grid jobs weren't starting with failed
authentication, and srm was hung in one process.
-----------------------------------------------------------------
Date: Tue, 20 May 2008 01:29:46 -0500
roundup cedar_phy_bhcurv mcnear 20126 stale pidfile on fnpcsrv1
-----------------------------------------------------------------
Date: Tue, 20 May 2008 03:21:45 -0500
From: Cron Daemon <root@minos25.fnal.gov>
To: kreymer@fnal.gov
Subject: Cron <kreymer@minos25> ${HOME}/minos/scripts/condorglide

rm: cannot remove `logs/glideafs/probe.116295.0.err': No such device or address
find: logs/glideafs/probe.116295.0.out: No such device or address
find: logs/glideafs/probe.116295.0.log: No such device or address
...
-----------------------------------------------------------------
Date: Tue, 20 May 2008 05:34:32 -0500
roundup cedar_phy_bhcurv mcnear 3481 stale pidfile on fnpcsrv1
-----------------------------------------------------------------
Date: Tue, 20 May 2008 07:20:55 -0500
From: Cron Daemon <root@minos25.fnal.gov>
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorglide: line 12: cd:
/minos/scratch/kreymer/condor/probe: Not a directory
find: logs/glideafs: No such file or directory
-----------------------------------------------------------------
Date: Tue, 20 May 2008 09:27:44 -0400
From: Steven Cavanaugh <scavan@fas.harvard.edu>

I can't run loon (rather it starts, and as soon as I press a single key, 
I get a FPE)..

I think it can't read all of the libraries.

This problem started around 9am EST... it was working at 830am EST


=============================================================================
2008 05 19 
=============================================================================

##########
# CONDOR #
##########

MINOS25 > condor_config_val CONDOR_ADMIN
fermigrid-root@fnal.gov

cd ~kreymer/minos/scripts/condor701

MINOS25 > diff /opt/condor/local/condor_config.local ~/minos/scripts/condor701/condor_config.local.minos25
24c24
< CONDOR_ADMIN = fermigrid-root@fnal.gov
---
> CONDOR_ADMIN = minos-admin@fnal.gov

MINOS01 > diff /opt/condor/local/condor_config.local ~/minos/scripts/condor701/condor_config.local
25c25
< CONDOR_ADMIN = fermigrid-root@fnal.gov
---
> CONDOR_ADMIN = minos-admin@fnal.gov

MINOS25 > diff /opt/condor/etc/condor_config ~/minos/scripts/condor701/condor_config
77c77
< CONDOR_ADMIN          = fermigrid-root@fnal.gov
---
> CONDOR_ADMIN          = minos-admin@fnal.gov


 condor_config -> condor_config.20080512
 condor_config.local -> condor_config.local.20080512
 condor_config.local.minos25 -> condor_config.local.minos25.20080512

Date: Mon, 19 May 2008 18:27:03 -0500 (CDT)
Subject: HelpDesk ticket 115902

___________________________________________
Short Description: Update requested to Minos condor_config files.

Problem Description: The administrative email from the Minos Condor pool 
  has been going to fermigrid-root@fnal.gov rather than minos-admin.

  Please propagate new configuration files to minos01 through minos25
  as follows .

  The new scripts are contained under
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701


    On all of minos01 through minos25, propagate the new
condor_config
    to
/opt/condor-7.0.1/etc/condor_config

    On minos01 through minos24, propagate
condor_config.local
    to
/opt/condor-7.0.1/local/condor_config.local

    On minos25, propagate
condor_config.local.minos25
    to
/opt/condor-7.0.1/local/condor_config.local


   After these have been updated, I will make them effective with 
condor_reconfig -all
___________________________________________
___________________________________________
Date: Wed, 21 May 2008 09:46:22 -0500 (CDT)
Subject: Help Desk Ticket 115902 Has Been Resolved.
Solution: The updated configs are now on all the minos machines.
___________________________________________


##########
# CONDOR #
##########

http://gratia-fermi.fnal.gov:8880/gratia-reporting/

#########
# ADMIN #
#########

Date: Mon, 19 May 2008 12:48:16 -0500 (CDT)
Subject: HelpDesk ticket 115867
___________________________________________
Short Description: Login shells for minfarm and minsoft on minos-sam03 ( cluster )

Problem Description: At your next convenience, please change the login shells for
    minfarm
and
    minsoft
on the Minos Cluster to  /bin/bash.

( The accounts are only active on minos-sam03 at present. )
___________________________________________
Date: Mon, 19 May 2008 12:50:38 -0500 (CDT)
This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group.
___________________________________________
___________________________________________
___________________________________________


########
# DATA #
########

    minfarm @ minos-sam03

ln -s ~kreymer/minos/scripts/dc2nfs.20080118 dc2nfs

DATA=reco_near/cedar_phy_bhcurv/sntp_data

./dc2nfs -n -d ${DATA}/2006-07  # need 7/15

./dc2nfs    -d ${DATA}/2006-07  # need 7/15

   Added rate printout

./dc2nfs    -d ${DATA}/2005-08  # need 1/55

   integrated rate with NEEDED
   
./dc2nfs    -d ${DATA}/2005-10  # 33/33
 NEEDED 33/33 reco_near/cedar_phy_bhcurv/sntp_data/2005-10 35 Mbytes/second
 STARTED Mon May 19 12:40:21 CDT 2008
FINISHED Mon May 19 12:52:25 CDT 2008

mkdir -p /minos/scratch/minfarm/dc2nfs

./dc2nfs    -d ${DATA}/2005-04  2>&1 | \
   tee -a /minos/scratch/minfarm/dc2nfs/cpbnear.log


./dc2nfs -d ${DATA} 2>&1 | \
   tee -a /minos/scratch/minfarm/dc2nfs/cpbnear.log

########
# DATA #
########

    /minos/data seems to have glitched, seen on fnpcsrv1 and minos-sam03

FARM03> stat /minos/scratch/minfarm/dc2nfs/cpbnear.log
  File: `/minos/scratch/minfarm/dc2nfs/cpbnear.log'
  Size: 110557          Blocks: 216        IO Block: 32768  regular file
Device: 15h/21d Inode: -944082904  Links: 1
Access: (0644/-rw-r--r--)  Uid: (10871/ minfarm)   Gid: ( 5111/    e875)
Access: 2008-05-19 12:59:56.753000000 -0500
Modify: 2008-05-19 16:43:18.136000000 -0500
Change: 2008-05-19 16:43:18.136000000 -0500

2006-08
  28/  41 /minos/data/reco_near/cedar_phy_bhcurv/sntp_data  1263632 N00010634_0011.spill.sntp.cedar_phy_bhcurv.0.root     200581432 bytes in 4 seconds (48970.08 KB/sec)       FARM03> 

FARM03> ls -ltr /minos/data/reco_near/cedar_phy_bhcurv/sntp_data/2006-08
...
-rw-r--r--  1 minfarm e875  807030972 May 19 16:07 N00010634_0012.spill.sntp.cedar_phy_bhcurv.0.root
-rw-r--r--  1 minfarm e875  705818624 May 19 16:07 N00010639_0003.spill.sntp.cedar_phy_bhcurv.0.root

   Cleaned up dc2nfs printout ( full path at top of each directory only )


./dc2nfs -d ${DATA} 2>&1 | \
   tee -a /minos/scratch/minfarm/dc2nfs/cpbnear.log


    FARM cleanup

SRV1> less cedar_phy_bhcurvmcnearsntp.cedar_phy_bhcurv.log

SRMCP 4/5 -streams_num=1 -server_mode=active -protocols=gsiftp file:///n13037664_0000_L010185N_D04.sntp.cedar_phy_bhcurv.1.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/
sntp_data/766
srm client error: No such device or address
 OOPS - SRMCP failed, bailing 
Mon May 19 16:12:32 CDT 2008
rm: cannot remove `/minos/data/minfarm/roundup/cedar_phy_bhcurvmcnearsntp.cedar_phy_bhcurv.pid': No such device or address


    Very Very Very odd.
    On minos-sam03, /var/log/messages,

May 19 15:57:59 minos-sam03 kernel: oom-killer: gfp_mask=0xd0
May 19 15:57:59 minos-sam03 kernel: Mem-info:
...
May 19 15:58:01 minos-sam03 kernel: Free pages:       14836kB (1664kB HighMem)
May 19 15:58:01 minos-sam03 kernel: Active:48391 inactive:966431 dirty:1 writeback:200392 unstable:0 free:3709 slab:14123 mapped:47542 pagetables:577
May 19 15:58:01 minos-sam03 kernel: DMA free:12532kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB pages_scanned:4334989 all_unreclaimable? yes
May 19 15:58:01 minos-sam03 kernel: protections[]: 0 0 0
May 19 15:58:01 minos-sam03 kernel: Normal free:640kB min:928kB low:1856kB high:2784kB active:236kB inactive:793240kB present:901120kB pages_scanned:1229679 all_unreclaimable? yes
May 19 15:58:01 minos-sam03 kernel: protections[]: 0 0 0
May 19 15:58:01 minos-sam03 kernel: HighMem free:1664kB min:512kB low:1024kB high:1536kB active:193328kB inactive:3072484kB present:3801088kB pages_scanned:0 all_unreclaimable? no
May 19 15:58:01 minos-sam03 kernel: protections[]: 0 0 0
May 19 15:58:01 minos-sam03 kernel: DMA: 3*4kB 3*8kB 3*16kB 3*32kB 3*64kB 3*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 12532kB
May 19 15:58:01 minos-sam03 kernel: Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 640kB
May 19 15:58:01 minos-sam03 kernel: HighMem: 288*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1664kB
May 19 15:58:01 minos-sam03 kernel: Swap cache: add 212613, delete 212613, find 80435/92424, race 0+0
May 19 15:58:01 minos-sam03 kernel: 0 bounce buffer pages
May 19 15:58:01 minos-sam03 kernel: Free swap:       4192072kB
May 19 15:58:01 minos-sam03 kernel: 1179648 pages of RAM
May 19 15:58:01 minos-sam03 kernel: 819136 pages of HIGHMEM
May 19 15:58:01 minos-sam03 kernel: 141832 reserved pages
May 19 15:58:01 minos-sam03 kernel: 213957 pages shared
May 19 15:58:01 minos-sam03 kernel: 0 pages swap cached
May 19 15:58:01 minos-sam03 kernel: Out of Memory: Killed process 6340 (python).

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM   TIME #C COMMAND                                                                                                                              
30691 sam       16   0 54412  11m 3060 S    0  0.3   0:00  0 python                                                                                                                                

    And on fnpcsrv1,
    
May 19 15:36:01 fnpcsrv1 automount[17398]: failed to mount /home/scripts
May 19 16:08:44 fnpcsrv1 automount[32100]: failed to mount /home/rubin
May 19 16:10:17 fnpcsrv1 automount[32244]: failed to mount /grid/wnclient
May 19 16:12:34 fnpcsrv1 kernel: nfs_statfs: statfs error = 6
May 19 16:13:44 fnpcsrv1 automount[809]: >> mount: minos-nas-0.fnal.gov:/minos/scratch failed, reason given by server: Input/output error
May 19 16:13:44 fnpcsrv1 automount[809]: mount(nfs): nfs: mount failure minos-nas-0.fnal.gov:/minos/scratch on /minos/scratch
May 19 16:13:44 fnpcsrv1 automount[809]: failed to mount /minos/scratch

    SRV1> grep 'failed to mount' /var/log/messages

May 19 16:08:44 fnpcsrv1 automount[32100]: failed to mount /home/rubin
May 19 16:10:17 fnpcsrv1 automount[32244]: failed to mount /grid/wnclient
May 19 16:13:44 fnpcsrv1 automount[809]: failed to mount /minos/scratch

SRV1> grep 'failed to mount' /var/log/messages | cut -f 9 -d ' ' | sort -u
/grid/wnclient
/home/condor_log
/home/ftp
/home/lists
/home/mclogs
/home/rubin
/home/scripts
/minos/scratch

   And on minos25


Date: Mon, 19 May 2008 16:12:25 -0500
From: Cron Daemon <root@minos25.fnal.gov>
To: kreymer@fnal.gov
Subject: Cron <kreymer@minos25> ${HOME}/minos/scripts/condorglide

/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorglide: line 12: cd:
/minos/scratch/kreymer/condor/probe: Not a directory
find: logs/glideafs: No such file or directory


Date: Mon, 19 May 2008 17:28:52 -0500 (CDT)
Subject: HelpDesk ticket 115900

___________________________________________
Short Description: /minos/data and /minos/scratch interruption around 16:07 CDT

Problem Description: LSC/CSI

At around 16:07 CDT, the mounts of /minos/data and /minos/scratch
seem to have timed out on several nodes, including
   fnpcsrv1
   minos-sam03

Here are some relevant lines from /var/log/messages on fnpcsrv1 :

May 19 16:08:44 fnpcsrv1 automount[32100]: failed to mount /home/rubin
May 19 16:10:17 fnpcsrv1 automount[32244]: failed to mount /grid/wnclient
May 19 16:13:44 fnpcsrv1 automount[809]: failed to mount /minos/scratch

My user application failed to write a file to /minos/data at around 16:07.

Was there a global BlueArc or network problem around 16:07 ?
___________________________________________
Date: Tue, 20 May 2008 10:02:43 -0500 (CDT)
Note To Requester: Is it working now?
How long was it down?
___________________________________________
Date: Tue, 20 May 2008 21:02:43 +0000 (UTC)

   We continue to see failures of /minos/data, /minos/scratch
   and /grid/app  on several hosts,
   on at least
       fnpcsrv1 - farm head node
       minos25  - Minos Condor submission node
       minos-sam03 - doing dccp's from FNDCA to /minos/data.

   Some of the additonal failures are at :

Tue, 20 May 2008 00:15:02 - fnpcsrv1
Tue, 20 May 2008 03:21:45 - minos25
Tue, 20 May 2008 07:20:55 - minos25
and sometime around 09:00

    I have started a script reading a few bytes of data
    from a fresh file under /minos/data once per minute, on
fnpcsrv1
minos01
minos25
minos26
minos-sam03

   The logs are on the web under
http://www-numi.fnal.gov/computing/dh/bluwatch.html

   I plan to check these logs for correlated 'BAD' messages
   the next time we see a timeout.

___________________________________________
___________________________________________
___________________________________________
Date: Thu, 22 May 2008 15:57:50 +0000 (UTC)

I have restarted a streamlined 'bluwatch' script,
monitoring access to /minos/data files every minute,
and running on 
    fnpcsrv1
    minos-sam03
    minos01
    minos25
    minos26

The scan results are presented at
    http://www-numi.fnal.gov/computing/dh/bluwatch

This directory has *.txt files with the latest error from each node.

There is a LASTERR file whose time stamp is reset when any error is found.


Full logs of all errors, as well as starts and stops for each node, 
are under the 'log' subdirectory

Likewise, files with the latest result for each node
are under the 'latest' subdirectory.


So far this morning, no reported errors.

    Enjoy !

___________________________________________


=============================================================================
2008 05 17   Sat
=============================================================================


#######
# SAM #
#######

  Tried to update the Issue Tracker regarding station problems,

  SAM-IT/3582

https://plone3.fnal.gov/SAMGrid/tracking/base_view


Browse issues
New search
Search
Navigation:
Show issue #  

 
Site error

This site encountered an error trying to fulfill your request. The errors were:

Error Type
    IOError
Error Value
    [Errno 28] No space left on device
Request made at
    2008/05/17 18:06:33.083 GMT-5

Date: Sat, 17 May 2008 18:13:11 -0500 (CDT)
Subject: HelpDesk ticket 115829
___________________________________________
Short Description: The SAM issue tracker is down

Problem Description: On connecting to 

    https://plone3.fnal.gov/SAMGrid/tracking/base_view

I see a web page indicating disk problems :

Browse issues
New search
Search
Navigation:
Show issue #  
 
        
Site error

This site encountered an error trying to fulfill your request. The errors
were:

Error Type
    IOError
Error Value
    [Errno 28] No space left on device
Request made at
    2008/05/17 18:06:33.083 GMT-5
___________________________________________
Date: Mon, 19 May 2008 07:47:21 -0500 (CDT)
This ticket has been reassigned to RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST Group.
___________________________________________
Date: Mon, 19 May 2008 08:17:57 -0500 (CDT)
This ticket has been reassigned to MENGEL, MARC of the CD-LSCS/CSI/CS/EST Group.
___________________________________________
___________________________________________


#######
# SAM #
#######

    Prepare a slightly bigger standard test

TENFILES=`ls /pnfs/minos/fardet_data/2005-04 | head -10` 

MINOS26 > printf "${TENFILES}\n"
F00030612_0005.mdaq.root
F00030612_0006.mdaq.root
F00030612_0007.mdaq.root
F00030613_0000.mdaq.root
F00030613_0001.mdaq.root
F00030613_0002.mdaq.root
F00030613_0003.mdaq.root
F00030613_0004.mdaq.root
F00030613_0005.mdaq.root
F00030613_0006.mdaq.root

{ for FILE in `printf "${TENFILES}\n" | head -9` ; do  printf "${FILE},"; done ;
  printf "${TENFILES}\n" | tail -1 ; } > /tmp/STENFILES

STENFILES=`cat /tmp/STENFILES`

sam list files --dim="FILE_NAME in ${STENFILES}"

sam create definition  \
    --definitionName='st-ten' \
    --dimensions="FILE_NAME in ${STENFILES}" \
    --group='minos'
DatasetDefinition saved with definitionId = 3519

sam  describe definition --definitionName='st-ten'

sam list files --dim="__set__ st-ten"


    Also created CENFILES dataset

CENFILES=`ls /pnfs/minos/fardet_data/2005-04 | head -100` 

{ for FILE in `printf "${CENFILES}\n" | head -99` ; do  printf "${FILE},"; done ;
  printf "${CENFILES}\n" | tail -1 ; } > /tmp/CENFILES

CENFILES=`cat /tmp/CENFILES`

sam list files --dim="FILE_NAME in ${CENFILES}"

sam create definition  \
    --definitionName='st-cen' \
    --dimensions="FILE_NAME in ${CENFILES}" \
    --group='minos'
DatasetDefinition saved with definitionId = 3521

sam  describe definition --definitionName='st-cen'

sam list files --dim="__set__ st-cen"

   ( did this wrong once, 3520, deleted definition )


MINOS26 > ./sam_test_py minos dev st-onesmall
MINOS26 > ./sam_test_py minos dev st-ten
   about a 30 second delay between files 5 and 6, 
   then 6-10 came at the usual rate 1 per second
MINOS26 > ./sam_test_py minos dev st-ten
   hung up, here's trace, from minos-sam02

05/17/08 16:06:29 minos.SM.ProjectManager 8934: Constraining delivery to the 1 consumption sites , priority : hi
05/17/08 16:06:29 minos.SM.CacheFitter_constrained 8934: Delivery of 1047113 is constrained to (none), i.e., impossible
05/17/08 16:06:29 minos.SM.CacheMan minos 8934: Could not fit files on disk, possibly due to fragmentation

MINOS-SAM02 > cd private/station__minos-sam02__station_dev__minos

MINOS26 > sam stop project --force --project=sam_test_project_20080517210337 

restarted dev station

05/17/08 16:10:44 minos.SM.Repler 16692: No authorized requests
05/17/08 16:11:43 minos.SM.Repler 16692: No authorized requests

MINOS26 > ./sam_test_py minos dev st-onesmall
   good, fast
did it 9 more times, looks OK

MINOS26 > ./sam_test_py minos dev st-ten
   OK, fast
MINOS26 > ./sam_test_py minos dev st-onesmall
   OK, fast
MINOS26 > ./sam_test_py minos dev st-ten
   OK, fast
MINOS26 > ./sam_test_py minos dev st-ten
05/17/08 16:15:09 minos.SM.CacheFitter_constrained 16692: Delivery of 1047113 is constrained to (none), i.e., impossible
05/17/08 16:15:09 minos.SM.CacheMan minos 16692: Could not fit files on disk, possibly due to fragmentation
05/17/08 16:15:09 minos.SM.Repler 16692: No more deliveries possible

MINOS26 > sam stop project --force --project=sam_test_project_20080517211506

   Fall back to v6_0_5_23_srm from v6_0_5_24_srm

Repeased same tests

one
ten
one x 9 
ten
ten
ten
05/17/08 16:24:52 minos.SM.CacheMan minos 19557: Could not fit files on disk, possibly due to fragmentation

MINOS26 > sam stop project --force --project=sam_test_project_20080517212449


    Falling back to the old station on minos-sam01 dev
    per notes

 ups   declare -c sam_cp_config  v7_0 
 ups   declare -c sam_station    v6_0_1_17 -q GCC-3.1 
 ups undeclare -c sam_gsi_config -q vdt
 ups   declare -c sam_gsi_config v2_2_8

one
ten
one x 9 
ten
for N in 2 3 4 5 6 7 8 9 10 ; do echo ${N}
./sam_test_py minos dev st-ten
done

date; for N in 1 2 3 4 5 6 7 8 9 10 ; do echo ${N}
./sam_test_py minos dev st-onesmall
./sam_test_py minos dev st-ten
./sam_test_py minos dev st-cen
done ; date
Sat May 17 16:39:22 CDT 2008
...
Sat May 17 17:04:42 CDT 2008

    Prepared production station for fallback

MINOS26 > export SAM_STATION=minos

BADPROJ=`sam dump station --projects \
    | grep project    \
    | cut -f 2 -d ' ' \
    | cut -f 1 -d '('
`

gemma3-Cedar-near-all-sntp-2008-3-w2muondrift-2008-05-15-18-30-19.303301000-0500
gemma3-Cedar-near-all-sntp-2008-3-w1muondrift-2008-05-15-18-30-27.506402000-0500
gemma3-Cedar-near-all-sntp-2008-4-w1muondrift-2008-05-15-18-45-19.588175000-0500
gemma3-Cedar-near-all-sntp-2008-4-w2muondrift-2008-05-15-18-45-22.696954000-0500
gemma3-Cedar-near-all-sntp-2008-4-w3muondrift-2008-05-15-18-55-23.363232000-0500
gemma3-Cedar-near-all-sntp-2008-4-w4muondrift-2008-05-15-18-55-28.628555000-0500
gemma-Cedar-far-all-sntp-2008-05-11muondrift-2008-05-16-05-41-33.678666000-0500
gemma-Cedar-near-all-sntp-2008-05-11muondrift-2008-05-16-05-43-24.221559000-0500
gemma-Cedar-far-all-sntp-2008-03-w3muondrift-2008-05-16-12-10-21.330785000-0500
gemma-Cedar-far-all-sntp-2008-03-w1muondrift-2008-05-16-12-10-21.525181000-0500
gemma-Cedar-far-all-sntp-2008-03-w2muondrift-2008-05-16-12-10-25.796463000-0500
gemma-Cedar-far-all-sntp-2008-03-w4muondrift-2008-05-16-12-10-29.350059000-0500
gemma-Cedar-far-all-sntp-2008-04-w2muondrift-2008-05-16-12-10-29.397919000-0500
gemma-Cedar-far-all-sntp-2008-04-w1muondrift-2008-05-16-12-10-29.411734000-0500
gemma-Cedar-far-all-sntp-2008-04-w3muondrift-2008-05-16-12-19-29.526171000-0500
gemma-Cedar-far-all-sntp-2008-04-w4muondrift-2008-05-16-12-19-31.759466000-0500
gemma-Cedar-far-all-sntp-2008-05-05muondrift-2008-05-17-05-40-45.360945000-0500
gemma-Cedar-far-all-sntp-2008-05-07muondrift-2008-05-17-05-41-04.215081000-0500
gemma-Cedar-far-all-sntp-2008-05-09muondrift-2008-05-17-05-41-39.177638000-0500
gemma-Cedar-far-all-sntp-2008-05-11muondrift-2008-05-17-05-41-51.113199000-0500
gemma-Cedar-near-all-sntp-2008-04-19muondrift-2008-05-17-05-42-09.456687000-0500
gemma-Cedar-near-all-sntp-2008-04-18muondrift-2008-05-17-05-42-29.405132000-0500
gemma-Cedar-near-all-sntp-2008-05-04muondrift-2008-05-17-05-43-05.304878000-0500

for PROJ in ${BADPROJ} ; do
sleep 2 ; sam stop project --force --project=${PROJ} ; done

16:54  shutdown minos prd for downgrade
 ups   declare -c sam_cp_config  v7_0 
 ups   declare -c sam_station    v6_0_1_17 -q GCC-3.1 
 ups undeclare -c sam_gsi_config -q vdt
 ups   declare -c sam_gsi_config v2_2_8
16:55 - started downgraded station

Created st-ten and st-cen in production, as above

DatasetDefinition saved with definitionId = 5002
DatasetDefinition saved with definitionId = 5004

17:09   Restarted prd station, accidentally started as _dev

date; for N in 1 2 3 4 5 6 7 8 9 10 ; do echo ${N}
./sam_test_py minos prd st-onesmall
./sam_test_py minos prd st-ten
./sam_test_py minos prd st-cen
done ; date
Sat May 17 17:10:56 CDT 2008
...
Sat May 17 17:33:08 CDT 2008


#######
# SAM #
#######

Date: Sat, 17 May 2008 10:28:13 +0100
From: Gemma Tinti <g.tinti1@physics.ox.ac.uk>

????? I see again an intermittent odd behaviour in SAM. It looks like the same problem we had before,
some jobs just stay there as they are still waiting for some files to be delivered.

------------------------------------------------------------

Date: Sat, 17 May 2008 13:17:46 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
  I was waiting for a time when the SAM station was idle,
  to restart it with larger virtual disks,
  to prevent this problem in the future.

  Since it is failing again, 
  I have done the restart now, around 08:15 CDT.

  This should prevent future failures of this sort.  
------------------------------------------------------------
Date: Sat, 17 May 2008 13:59:48 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>

  Restarted station after gemma reported file deliver errors again.
  Disks are larger now :

MINOS26 > sam dump station --disks
*** BEGIN DUMP STATION minos version v6_0_5_24_srm running at 
minos-sam01.fnal.gov 9 days 20 hours 18 minutes 57 seconds, admins: buckley 
kreymer rhatcher sam 
Replica selection: prefer (enstore), avoid (empty)
There are 0 authorized transfer groups
Full delivery unit is enforced; external deliveries are constrained to 
dcap://minos-01 dcap://minos-02 
Excess consumer satisfaction: 0
STATION DISKS:
disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 
1285777209B/52428800KB = 2.4% free
disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 
205357411B/52428800KB = 0.4% free
station disk total: 1491134620B/104857600KB = 1.4% free

*** END OF STATION DUMP ***


MINOS26 > sam dump station --disks
*** BEGIN DUMP STATION minos version v6_0_5_24_srm running at 
minos-sam01.fnal.gov 11 seconds, admins: buckley kreymer rhatcher sam 
Replica selection: prefer (enstore), avoid (empty)
There are 0 authorized transfer groups
Full delivery unit is enforced; external deliveries are constrained to 
dcap://minos-01 dcap://minos-02 
Excess consumer satisfaction: 0
STATION DISKS:
disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 
849026969KB/900200128KB = 94.3% free
disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 
847971872KB/900200128KB = 94.2% free
station disk total: 1696998842KB/1800400256KB = 94.3% free

*** END OF STATION DUMP ***

But a test of a 400+ file project failed,

MINOS26 > sam_test_py minos ${UNIV} evansj-CC0325-RunI-L010z185-ND-Data

 OK running 
    station   minos
    dbserver  prd
    dataset   evansj-CC0325-RunI-L010z185-ND-Data
    project   sam_test_project_20080517131921
    fileCut   0
    cid       8154
    cpid      29777
    job       SAMStation.JobCount(jobsAtNode=1, jobsAll=1)
Got   
dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2005-06/N00008019_
0002.spill.sntp.cedar_phy_bhcurv.0.root  
file  1
Got   
dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008083_
0000.spill.sntp.cedar_phy_bhcurv.0.root  
file  2
...
Got   
dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008143_
0000.spill.sntp.cedar_phy_bhcurv.0.root  
file  18
RetryHandler.getNextFile(29777L)> initial retriable exception 
TRANSIENT('CORBA.TRANSIENT(omniORB.TRANSIENT_CallTimedout, 
CORBA.COMPLETED_MAYBE)')
RetryHandler.getNextFile(29777L)> will retry in 1.95 seconds
Traceback (most recent call last):
  File "./sam_test_py", line 162, in ?
    c          = 'TRUE'
  File "sam_common_pylib/SamCommand/BlessedCommandInterfacePlaceHolder.py", 
line 81, in __call__
  File "sam_common_pylib/SamCommand/CommandInterface.py", line 251, in 
__call__
  File "sam_common_pylib/SamCommand/SamCommandInterface.py", line 240, in 
apiWrapper
  File "sam_user_pyapi/src/samConsumer.py", line 752, in implementation
  File "sam_common_pylib/SamCorba/SamServerProxy.py", line 230, in 
_callRemoteMethod
  File "sam_common_pylib/SamCorba/SamServerProxyRetryHandler.py", line 266, in 
handleCall
PreviousFileNotReleased: Previous file not released, CPID: 29777

MINOS26 > sam list files --dim='__set__ evansj-CC0325-RunI-L010z185-ND-Data'
Files:
   N00007787_0000.spill.sntp.cedar_phy_bhcurv.0.root
   N00007799_0002.spill.sntp.cedar_phy_bhcurv.0.root
...
File Count:         445
Average File Size:  680.52MB
Total File Size:    295.73GB
Total Event Count:  338837208

------------------------------------------------------------
Date: Sat, 17 May 2008 14:01:26 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Gemma Tinti <g.tinti1@physics.ox.ac.uk>
Cc: minos_sam_admin@fnal.gov, minos_software_discussion@fnal.gov
Subject: RE: SAM behaviour

  The Minos SAM station is still having problems, even after the restart.

  I will work on it this afternoon.

  Meanwhile, please stand by ( do not run SAM projects )

------------------------------------------------------------

------------------------------------------------------------

------------------------------------------------------------

------------------------------------------------------------


########
# FARM #
########

Ran  loopCPB ( -w  mrnt and sntp )
    to clean out remaining mrnt/sntp in WRITE

Ran loopCPB ( -b 1000 and without -w )
    to catch up on mrnt/sntp 
    now that the farm is writing cand's directly
     ( loopCPB0/1 cannot compete with the direct farm write,
       and was falling behind. )


=============================================================================
2008 05 16
=============================================================================

###############
# MINOS_SAM03 #
###############

Date: Fri, 16 May 2008 15:49:43 -0500 (CDT)
Subject: HelpDesk ticket 115813
___________________________________________
Short Description: Account request on minos-sam03

Problem Description: Please enable two more accounts on  minos-sam03 

     minfarm - as exists on fnpcsrv1 - for Farm I/O operations
         /home/minfarm login area.
         uid=10871(minfarm) gid=5111(e875)
         .k5login can be copied from mindata@minos26 initially

     minsoft - as exists on minos-mysql1, for testing Mysql5,
               and to test migration to the new minos-mysql1 hardware
               Minos-sam03 may end up being the Farm mysql server.
         /home/minsoft login area
        .k5login can be copied from mindata@minos26 initially

     Please give minos-sam03 root access to kreymer,
         so that we can fully test the Mysql operation.
___________________________________________


########
# DATA #
########

  PLANNING - need minfarm account on minos-sam03
             to keep the I/O load off fnpcsrv1

DATA=reco_near/cedar_phy_bhcurv/sntp_data

AFSS/dc2nfs.20080118 -n -d ${DATA}/2006-07  # need 7/15

NDIRS=`ls /pnfs/minos/${DATA}`

AFSS/dc2nfs -d ${DATA} 2>&1 | \
   tee /minos/scratch/log/dc2nfs/cpbnear.log


############
# MCIMPORT #
############

  Per tests on mindata@minos26,
  it seems that we do not need SRM_CONFIG,
  and we are impervious to default proxies like /tmp/x509up_u3648
  if we do
  
export X509_USER_PROXY=/home/mindata/.grid/kreymer-doe.proxy

   We had been hosed by :
   
-rw-------  1 mindata e875 5071 May  6 11:14 /tmp/x509up_u3648

$ cp -a AFSS/mcimport.20080516 .
$ ln -sf mcimport mcimport.20080516 # was mcimport.20080211

#######
# SRM #
#######

Rubin reports problems with srm on worker nodes.

srmcp is working on fpcserv1 / roundup

But my manual test in  mindata@minos26 fails

MDS3 > srmls ${SPATH2}
SRMClientV2 : put: try # 0 failed with error
SRMClientV2 : ; nested exception is: 
        org.globus.common.ChainedIOException: Authentication failed [Caused by: Defective credential detected [Caused by: [JGLOBUS-96] Certificate "DC=org,DC=DOEGrids,OU=Certificate Authorities,CN=DOEGrids CA 1" expired]]


   Renewed the proxy in /local/scratch26/kreymer/grid

cd  /local/scratch26/kreymer/grid

. /minos/scratch/kreymer/VDT/setup.sh

voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -cert kreymerdoe.pem    \
    -key  kreymerdoekey.pem \
    -out  kreymer-doe.proxy \
    -valid 10000:0


    Testing minfarm@fnpcsrv1

cd /local/globus/minfarm/.grid
scp kreymer@minos26:/local/scratch26/kreymer/grid/kreymer-doe.proxy .

export SRM_CONFIG=/export/stage/minfarm/.srmconfig/kreymer.xml
export SRM_CONFIG=/local/globus/minfarm/.srmconfig/kreymer.xml

SRV1> cp ax /export/stage/minfarm/.srmconfig .srmconfig
nedit /local/globus/minfarm/.srmconfig/kreymer.xml
   use kreymer-doe.proxy from /local/globus/minfarm/.grid

    Looks good, try rubin,
    
MIN > ssh -l rubin fnpcsrv1

fnpcsrv1% date
Fri May 16 14:22:16 CDT 2008

fnpcsrv1% source /usr/local/vdt/setup.csh

fnpcsrv1% set SPATH2='srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/cand_data/310/n13023109_0003_L010185N_D00.cand.cedar_phy_brev.root'

fnpcsrv1% srmls "${SPATH2}"
  542849643 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/cand_data/310/n13023109_0003_L010185N_D00.cand.cedar_phy_brev.root

fnpcsrv1% srmls -debug  "${SPATH2}"
Storage Resource Manager (SRM) CP Client version 1.25
Copyright (c) 2002-2006 Fermi National Accelerator Laboratory

SRM Configuration:
        debug=true
        gsissl=true
        help=false
        pushmode=false
        userproxy=true
        buffer_size=131072
        tcp_buffer_size=0
        streams_num=10
        config_file=config.xml
        glue_mapfile=conf/SRMServerV1.map
        webservice_path=srm/managerv2
        webservice_protocol=https
        gsiftpclinet=globus-url-copy
        protocols_list=http,gsiftp
        save_config_file=null
        srmcphome=..
        urlcopy=sbin/urlcopy.sh
        x509_user_cert=/home/timur/k5-ca-proxy.pem
        x509_user_key=/home/timur/k5-ca-proxy.pem
        x509_user_proxy=/tmp/x509up_u1334
        x509_user_trusted_certificates=/usr/local/vdt-1.8.1/globus/TRUSTED_CA
        globus_tcp_port_range=null
        gss_expected_name=null
        storagetype=permanent
        retry_num=20
        retry_timeout=10000
        wsdl_url=null
        use_urlcopy_script=false
        connect_to_wsdl=false
        delegate=true
        full_delegation=true
        server_mode=passive
        srm_protocol_version=2
        request_lifetime=86400
        priority=0
        action is ls
        recursion depth=1
        is long listing mode=false
        surl[0]=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/cand_data/310/n13023109_0003_L010185N_D00.cand.cedar_phy_brev.root
Fri May 16 14:25:21 CDT 2008: In SRMClient ExpectedName: host
Fri May 16 14:25:21 CDT 2008: SRMClient(https,srm/managerv2,true)
SRMClientV2 : user credentials are: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/UID=kreymer
SRMClientV2 : connecting to srm at httpg://stkendca2a.fnal.gov:8443/srm/managerv2
SRMClientV2 :  srmLs, contacting service httpg://stkendca2a.fnal.gov:8443/srm/managerv2
  542849643 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/cand_data/310/n13023109_0003_L010185N_D00.cand.cedar_phy_brev.root

fnpcsrv1% srmls -version
Storage Resource Manager (SRM) CP Client version 1.25
Copyright (c) 2002-2006 Fermi National Accelerator Laboratory

srm client error: No URL(s) specified 


    Try, and fail, with V2 SRM.

fnpcsrv1% source /usr/local/grid/setup.csh

fnpcsrv1% srmls "${SPATH2}"
[main] ERROR client.Call  - Exception:
org.xml.sax.SAXException: Invalid element in org.dcache.srm.v2_2.TMetaDataPathDetail - surl
        at org.apache.axis.encoding.ser.BeanDeserializer.onStartChild(BeanDeserializer.java:258)
        at org.apache.axis.encoding.DeserializationContext.startElement(DeserializationContext.java:1035)
        at org.apache.axis.message.SAX2EventRecorder.replay(SAX2EventRecorder.java:165)
        at org.apache.axis.message.MessageElement.publishToHandler(MessageElement.java:1141)
        at org.apache.axis.message.RPCElement.deserialize(RPCElement.java:236)
        at org.apache.axis.message.RPCElement.getParams(RPCElement.java:384)
        at org.apache.axis.client.Call.invoke(Call.java:2467)
        at org.apache.axis.client.Call.invoke(Call.java:2366)
        at org.apache.axis.client.Call.invoke(Call.java:1812)
        at org.dcache.srm.v2_2.SrmSoapBindingStub.srmLs(SrmSoapBindingStub.java:2089)
        at org.dcache.srm.client.SRMClientV2.srmLs(SRMClientV2.java:575)
        at gov.fnal.srm.util.SRMLsClientV2.start(SRMLsClientV2.java:136)
        at gov.fnal.srm.util.SRMDispatcher.work(SRMDispatcher.java:779)
        at gov.fnal.srm.util.SRMDispatcher.main(SRMDispatcher.java:372)
SRMClientV2 : put: try # 0 failed with error
SRMClientV2 : ; nested exception is: 
        org.xml.sax.SAXException: Invalid element in org.dcache.srm.v2_2.TMetaDataPathDetail - surl
SRMClientV2 : put: try again

fnpcsrv1% srmls -version
Storage Resource Manager (SRM) CP Client version 2.0
Copyright (c) 2002-2006 Fermi National Accelerator Laboratory

srm client error: No URL(s) specified 
java.lang.IllegalArgumentException: No URL(s) specified 
        at gov.fnal.srm.util.SRMDispatcher.checkURLSUniformity(SRMDispatcher.java:786)
        at gov.fnal.srm.util.SRMDispatcher.work(SRMDispatcher.java:463)
        at gov.fnal.srm.util.SRMDispatcher.main(SRMDispatcher.java:372)
fnpcsrv1% srmls -version
Storage Resource Manager (SRM) CP Client version 1.25
Copyright (c) 2002-2006 Fermi National Accelerator Laboratory

srm client error: No URL(s) specified 

########
# FARM #
########

    Cleanup interrupted copy of  

FILE=n13037749_0025_L010185N_D04.cand.cedar_phy_bhcurv.0.root

SRV1> dds /minos/data/minfarm/WRITE/${FILE}
-rw-rw-r--  1 minospro numi 573160605 May  1 09:59 /minos/data/minfarm/WRITE/n13037749_0025_L010185N_D04.cand.cedar_phy_bhcurv.0.root

SRV1> dds /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/774//${FILE}
-rw-r--r--  1 rubin numi 0 May 15 21:05 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/774//n13037749_0025_L010185N_D04.cand.cedar_phy_bhcurv.0.root

RUBIN > rm /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/774//${FILE}

    Things seem to be running OK now.


########
# FARM #
########

    URK -  in cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.1.log
occasional messages like

SRV1> grep cat: cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.1.log 
cat: /export/stage/minfarm/ROUNDUP/READ/n13037409_0008_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory
cat: /export/stage/minfarm/ROUNDUP/READ/n13037412_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory
cat: /export/stage/minfarm/ROUNDUP/READ/n13037426_0009_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory
cat: /export/stage/minfarm/ROUNDUP/READ/n13037428_0023_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory
cat: /export/stage/minfarm/ROUNDUP/READ/n13037429_0016_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory
cat: /export/stage/minfarm/ROUNDUP/READ/n13037430_0011_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory
cat: /export/stage/minfarm/ROUNDUP/READ/n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory

SRV1> grep cat: cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.0.log 
cat: /export/stage/minfarm/ROUNDUP/READ/n13037476_0001_L010185N_D04.cand.cedar_phy_bhcurv.0.root: No such file or directory
cat: /export/stage/minfarm/ROUNDUP/READ/n13037737_0004_L010185N_D04.cand.cedar_phy_bhcurv.0.root: No such file or directory
cat: /export/stage/minfarm/ROUNDUP/READ/n13037743_0003_L010185N_D04.cand.cedar_phy_bhcurv.0.root: No such file or directory
cat: /export/stage/minfarm/ROUNDUP/READ/n13037749_0025_L010185N_D04.cand.cedar_phy_bhcurv.0.root: No such file or directory

SRV1> grep cat: cedar_phyfarF00037835_0008.log
cat: /export/stage/minfarm/ROUNDUP/READ/F00037835_0008.all.cand.cedar_phy.0.root: No such file or directory

SRV1> grep cat: ../2008-04/*.log 
SRV1> grep cat: ../2008-03/*.log 
../2008-03/cedar_phy_bhcurvnear.log:cat: /export/stage/minfarm/ROUNDUP/READ/N00011059_0000.spill.mrnt.cedar_phy_bhcurv.0.root: No such file or directory
SRV1> grep cat: ../2008-02/*.log 
SRV1> grep cat: ../2008-01/*.log 

   Specifically, recently, 
   
SRMCP 56/100 -streams_num=1 -server_mode=active -protocols=gsiftp file:///n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L01018
5N/cand_data/743
PURGE FARM    n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root
cat: /export/stage/minfarm/ROUNDUP/READ/n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root: No such file or directory

           for REA in `cat ${ROUNTMP}/${CAT}READ/${FILE}` ; do
           ${ECHO} rm -f ${INDIR}/${REA} ; done

ls -l /export/stage/minfarm/ROUNDUP/READ/SAM/n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root

SRV1> ls -l /export/stage/minfarm/ROUNDUP/READ/SAM/n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 minfarm numi 57 May 14 00:09 /export/stage/minfarm/ROUNDUP/READ/SAM/n13037432_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root

   OK this makes sense, saddmc is moving the READ file to READ/SAM


=============================================================================
2008 05 15
=============================================================================

##########
# DCACHE #
##########

   Power out to DCache/Enstore and Oracle around 21:10 tonight.
   
   Stopped predator cronjob on minos26,
   and loopCP0/1 on fnpcsrv1

########## 
# CONDOR #
##########

   Per man page , can use some variables in the .run files :

X509USERPROXY  = /local/scratch25/$ENV(LOGNAME)/grid/$ENV(LOGNAME).proxy

   Updated glidefile.run, glide.run

   This works !!


########## 
# CONDOR #
##########

    HOWTO condor updated with cert registration instructions, per
Date: Fri, 09 May 2008 10:00:37 -0500 (CDT)
From: HelpDesk <aremail@fnal.gov>
Solution: yocum@fnal.gov sent this solution: 

Hi Art,

Yesterday I re-enabled email notification on the fermilab VOMRS server, 
so I am now in a position to give you the correct method of performing 
the action you request, yourself.

The users can and should add their own Robot certificates to their 
membership in the fermilab VO per the instructions I sent you last 
Monday, with the following addition (see 2a, 2b, 2c):

1) Load your KCA certificate (current, not expired!) and visit this URL:

https://vomrs.fnal.gov:8443/vomrs/vo-fermilab/vomrs

2) Click on the [+] next to the "Members"
2a) Click on "Change Email Address"
2b) Enter your last name and "Search"
2c) Enter your correct email address and "Submit"
3) Click on the [+] next to the "Certificates"
4) Click on "Add certificate"
5) Enter your last name and "Search"
6) Enter your 'new' DN in the New DN field, and select the Fermi KCA
    from the pull-down list in the "New CA" list.
7) Enter some text in the "Reason" field and click "Submit"

Next, the members representative (You!) will receive an email from VOMRS 
requesting you to approve the addition of the DN.  The email will 
contain a handy link for you to click on to get to the right page.

I should note, that when the new DN format is implemented the users will 
NOT need to add this DN, we'll do this for them automatically.

Cheers,
Dan


########## 
# CONDOR #
##########
    Removed from all .run files

JOBLEASEDURATION        = 1000000

 
########
# FARM #
########

    Request for test job submission of farm jobs, looking into it

    Last minfarm jobs were 9 April
    This was an old vanilla test.

SRV1> condor_history -l 1780232.0> /tmp/chl
Iwd = "/home/minfarm/bckhousetest/test-scripts"
Cmd = "/home/minfarm/bckhousetest/test-scripts/reco_far_cosmic_daikon04_base_dogwoodtest0.sh"
UserLog = "/home/minfarm/bckhousetest/restructure-test/test-results/dogwoodtest1/reco_far_cosmic_daikon04_base_dogwoodtest0/log.txt"
LastRemoteHost = "slot1@fnpc225.fnal.gov"
JobUniverse = 5
    http://www.cs.wisc.edu/condor/manual/v6.6/2_5Submitting_Job.html
JobUniverse
    : Integer which indicates the job universe, where 1 = Standard, 4 = PVM, 5 = Vanilla, 7 = Scheduler, 8 = MPI, 9 = Globus, and 10 = Java. 

   Testing a clone of probe, copied to\
/minos/data/minfarm/probe
    Inspired by
/minos/data/minfarm/condor_submit

SRV1> condor_submit probe.run
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1856378.

000 (1856378.000.000) 05/15 15:10:11 Job submitted from host: <131.225.167.44:63082>
...
012 (1856378.000.000) 05/15 15:10:14 Job was held.
        Failed to initialize GAHP
        Code 0 Subcode 0
...
SRV1> condor_rm 1856378.0
Job 1856378.0 marked for removal

########
# FARM #
########

    Made safer copies of the crontab.dat files

SRV1> cp -a crontab.dat.20071226 AFSS/crontab.minfarm.20071226
SRV1> cp -a crontab.dat.20070503 AFSS/crontab.minfarm.20070503
SRV1> cp -a crontab.dat.20060829 AFSS/crontab.minfarm.20060829
SRV1> rm crontab.dat.*

   Added farmgsum_log

15 00 * * * ${HOME}/scripts/farmgsum_log

SRV1> cp crontab.dat AFSS/crontab.minfarm.20080515


=============================================================================
2008 05 14
=============================================================================

##########
# DCACHE #
##########

Date: Wed, 14 May 2008 15:45:17 -0500 (CDT)
Subject: HelpDesk ticket 115648
___________________________________________
Short Description: Recent FTP Tranfers - web page is empty

Problem Description: The Recent FTP Transfers web page is empty, at

http://fndca3a.fnal.gov/cgi-bin/dcache_files.py
___________________________________________


############
# MCIMPORT #
############

MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/*
 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N
  55785 /minos/data/mcimport/STAGE/daikon_04/L010170N
2030288 /minos/data/mcimport/STAGE/daikon_04/L010185N
 126550 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm
 368713 /minos/data/mcimport/STAGE/daikon_04/L010185N_helium
   8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh
  65355 /minos/data/mcimport/STAGE/daikon_04/L010200N
 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N
 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N
  27834 /minos/data/mcimport/STAGE/daikon_04/L250200N

    Per rhatcher, will probably next safely archive

 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N
 126550 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm
 368713 /minos/data/mcimport/STAGE/daikon_04/L010185N_helium


#######
# SAM #
#######

export SAM_ORACLE_CONNECT="samdbs/<passwd>"
export SAM_STATION=minos

setup sam -q dev

sam dump station --disks
disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 401805604B/52428800KB = 0.7% free
disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 393714515B/52428800KB = 0.7% free

Typical cdf-caf disk is
disk 719 dcap://cdfcaf-door1:dcap://cdfdca1.fnal.gov:25125/pnfs/fnal.gov/usr, 69464633B/209715200KB = 0% free

Typical cdf-cnaf disk is
disk 533 cdfsam.cnaf.infn.it:/cdf/data/gpfs01/sam/cache1, 1518181B/2252800000KB = 0% free

MINOS26 > 
samadmin resize station disk \
  --mountPoint=dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr \
  --size=900200100KB
Disk size for dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr changed to 858.50GB

samadmin resize station disk \
  --mountPoint=dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr \
  --size=900200100KB
Disk size for dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr changed to 858.50GB

    Restarted the station,
disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 848163716KB/900200128KB = 94.2% free
disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 848155814KB/900200128KB = 94.2% free

./sam_test_py minos dev
Got   dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root  file  1
Decrementing the job count.
Stopping the project

    PRODUCTION

setup sam -q prd

disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 1417330916B/52428800KB = 2.6% free
disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 575415386B/52428800KB = 1.1% free

samadmin resize station disk \
  --mountPoint=dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr \
  --size=900200100KB
Disk size for dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr changed to 858.50GB

samadmin resize station disk \
  --mountPoint=dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr \
  --size=900200100KB
Disk size for dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr changed to 858.50GB

   Waiting for an idle time for station restart
   gemma has projects running, mostly started around 05:40


########
# DATA #
########

per   2008 04 29 plan of action :

    minfarm@fnpcsrv1


DATA=reco_near/cedar_phy_bhcurv/sntp_data
NDIRS=`ls /pnfs/minos/${DATA}`

AFSS/dc2nfs.20080118 -n -d ${DATA}/2007-04
 NEEDED 0/45 reco_near/cedar_phy_bhcurv/sntp_data/2007-04
204G    /minos/data/reco_near/cedar_phy_bhcurv/sntp_data

SRV1> AFSS/dc2nfs.20080118 -n -d ${DATA} | grep NEEDED
 NEEDED 0/0 reco_near/cedar_phy_bhcurv/sntp_data/2005-03
 NEEDED 67/67 reco_near/cedar_phy_bhcurv/sntp_data/2005-04
 NEEDED 98/98 reco_near/cedar_phy_bhcurv/sntp_data/2005-05
 NEEDED 65/65 reco_near/cedar_phy_bhcurv/sntp_data/2005-06
 NEEDED 54/56 reco_near/cedar_phy_bhcurv/sntp_data/2005-07
 NEEDED 1/55 reco_near/cedar_phy_bhcurv/sntp_data/2005-08
 NEEDED 55/55 reco_near/cedar_phy_bhcurv/sntp_data/2005-09
 NEEDED 33/33 reco_near/cedar_phy_bhcurv/sntp_data/2005-10
 NEEDED 46/46 reco_near/cedar_phy_bhcurv/sntp_data/2005-11
 NEEDED 42/42 reco_near/cedar_phy_bhcurv/sntp_data/2005-12
 NEEDED 52/52 reco_near/cedar_phy_bhcurv/sntp_data/2006-01
 NEEDED 50/50 reco_near/cedar_phy_bhcurv/sntp_data/2006-02
 NEEDED 0/0 reco_near/cedar_phy_bhcurv/sntp_data/2006-03
 NEEDED 0/0 reco_near/cedar_phy_bhcurv/sntp_data/2006-04
 NEEDED 0/0 reco_near/cedar_phy_bhcurv/sntp_data/2006-05
 NEEDED 73/74 reco_near/cedar_phy_bhcurv/sntp_data/2006-06
 NEEDED 7/15 reco_near/cedar_phy_bhcurv/sntp_data/2006-07
 NEEDED 41/41 reco_near/cedar_phy_bhcurv/sntp_data/2006-08
 NEEDED 36/36 reco_near/cedar_phy_bhcurv/sntp_data/2006-09
 NEEDED 49/51 reco_near/cedar_phy_bhcurv/sntp_data/2006-10
 NEEDED 36/36 reco_near/cedar_phy_bhcurv/sntp_data/2006-11
 NEEDED 39/39 reco_near/cedar_phy_bhcurv/sntp_data/2006-12
 NEEDED 51/51 reco_near/cedar_phy_bhcurv/sntp_data/2007-01
 NEEDED 52/52 reco_near/cedar_phy_bhcurv/sntp_data/2007-02
 NEEDED 50/51 reco_near/cedar_phy_bhcurv/sntp_data/2007-03
 NEEDED 0/45 reco_near/cedar_phy_bhcurv/sntp_data/2007-04
 NEEDED 0/48 reco_near/cedar_phy_bhcurv/sntp_data/2007-05
 NEEDED 0/42 reco_near/cedar_phy_bhcurv/sntp_data/2007-06
 NEEDED 0/32 reco_near/cedar_phy_bhcurv/sntp_data/2007-07

SRV1> AFSS/dc2nfs.20080118 -n -d reco_far/cedar_phy_bhcurv/sntp_data | grep NEEDED
 NEEDED 0/59 reco_far/cedar_phy_bhcurv/sntp_data/2003-07
 NEEDED 0/140 reco_far/cedar_phy_bhcurv/sntp_data/2003-08
 NEEDED 0/164 reco_far/cedar_phy_bhcurv/sntp_data/2003-09
 NEEDED 0/199 reco_far/cedar_phy_bhcurv/sntp_data/2003-10
 NEEDED 0/166 reco_far/cedar_phy_bhcurv/sntp_data/2003-11
 NEEDED 0/135 reco_far/cedar_phy_bhcurv/sntp_data/2003-12
 NEEDED 0/118 reco_far/cedar_phy_bhcurv/sntp_data/2004-01
 NEEDED 0/108 reco_far/cedar_phy_bhcurv/sntp_data/2004-02
 NEEDED 0/137 reco_far/cedar_phy_bhcurv/sntp_data/2004-03
 NEEDED 0/152 reco_far/cedar_phy_bhcurv/sntp_data/2004-04
 NEEDED 0/119 reco_far/cedar_phy_bhcurv/sntp_data/2004-05
 NEEDED 0/102 reco_far/cedar_phy_bhcurv/sntp_data/2004-06
 NEEDED 0/103 reco_far/cedar_phy_bhcurv/sntp_data/2004-07
 NEEDED 0/111 reco_far/cedar_phy_bhcurv/sntp_data/2004-08
 NEEDED 0/112 reco_far/cedar_phy_bhcurv/sntp_data/2004-09
 NEEDED 0/102 reco_far/cedar_phy_bhcurv/sntp_data/2004-10
 NEEDED 0/115 reco_far/cedar_phy_bhcurv/sntp_data/2004-11
 NEEDED 0/105 reco_far/cedar_phy_bhcurv/sntp_data/2004-12
 NEEDED 0/105 reco_far/cedar_phy_bhcurv/sntp_data/2005-01
 NEEDED 0/95 reco_far/cedar_phy_bhcurv/sntp_data/2005-02
 NEEDED 0/112 reco_far/cedar_phy_bhcurv/sntp_data/2005-03
 NEEDED 0/214 reco_far/cedar_phy_bhcurv/sntp_data/2005-04
 NEEDED 0/220 reco_far/cedar_phy_bhcurv/sntp_data/2005-05
 NEEDED 0/210 reco_far/cedar_phy_bhcurv/sntp_data/2005-06
 NEEDED 0/216 reco_far/cedar_phy_bhcurv/sntp_data/2005-07
 NEEDED 0/151 reco_far/cedar_phy_bhcurv/sntp_data/2005-08
 NEEDED 0/83 reco_far/cedar_phy_bhcurv/sntp_data/2005-09
 NEEDED 0/111 reco_far/cedar_phy_bhcurv/sntp_data/2005-10
 NEEDED 0/98 reco_far/cedar_phy_bhcurv/sntp_data/2005-11
 NEEDED 0/102 reco_far/cedar_phy_bhcurv/sntp_data/2005-12
 NEEDED 0/104 reco_far/cedar_phy_bhcurv/sntp_data/2006-01
 NEEDED 0/70 reco_far/cedar_phy_bhcurv/sntp_data/2006-02
 NEEDED 0/102 reco_far/cedar_phy_bhcurv/sntp_data/2006-03
 NEEDED 0/43 reco_far/cedar_phy_bhcurv/sntp_data/2006-04
 NEEDED 0/68 reco_far/cedar_phy_bhcurv/sntp_data/2006-05
 NEEDED 0/84 reco_far/cedar_phy_bhcurv/sntp_data/2006-06
 NEEDED 0/110 reco_far/cedar_phy_bhcurv/sntp_data/2006-07
 NEEDED 0/98 reco_far/cedar_phy_bhcurv/sntp_data/2006-08
 NEEDED 0/78 reco_far/cedar_phy_bhcurv/sntp_data/2006-09
 NEEDED 0/106 reco_far/cedar_phy_bhcurv/sntp_data/2006-10
 NEEDED 0/97 reco_far/cedar_phy_bhcurv/sntp_data/2006-11
 NEEDED 0/68 reco_far/cedar_phy_bhcurv/sntp_data/2006-12
 NEEDED 0/90 reco_far/cedar_phy_bhcurv/sntp_data/2007-01
 NEEDED 0/94 reco_far/cedar_phy_bhcurv/sntp_data/2007-02
 NEEDED 0/80 reco_far/cedar_phy_bhcurv/sntp_data/2007-03
 NEEDED 0/72 reco_far/cedar_phy_bhcurv/sntp_data/2007-04
 NEEDED 0/66 reco_far/cedar_phy_bhcurv/sntp_data/2007-05
 NEEDED 0/72 reco_far/cedar_phy_bhcurv/sntp_data/2007-06
 NEEDED 0/88 reco_far/cedar_phy_bhcurv/sntp_data/2007-07
 803G    /minos/data/reco_far/cedar_phy_bhcurv/sntp_data

Need over 1 TB free space before doing near catchup,
wait till we have 2.
  
AFSS/dc2nfs -d ${DATA} 2>&1 | \
   tee /minos/scratch/log/dc2nfs/cpbnear.log


##########
# DCACHE #
##########

Date: Wed, 14 May 2008 14:36:26 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: jdejong@fnal.gov
Cc: minos-data@fnal.gov
Subject: jdejong jobs overloading raw data DCache pools

I have been seeing timeouts for the last couple of days
in the jobs which declare raw data files to SAM.

This is apparently due to an overload of the DCache pools
by jdejong jobs running under LSF.
These jobs are collectively holding dozens of raw data files open,
which is beyond the capacity of these pools.

The net effect is to delay other users of raw data,
including most of your own jobs.

Please adjust your jobs to take a local copy of the files before processing.

    Thanks !


To see the overload, look at the w-stkendca9a-3 line at the web page

    http://fndca.fnal.gov:2288/queueInfo

====================================================================

Date: Wed, 14 May 2008 13:07:13 -0500 (CDT)
From: Jeff K deJong <jdejong@agni.phys.iit.edu>

Can you indentify for me which are the jobs that are the problem, are they 
the jobs in the 12hr queue?

If it is the 12hr job then each job that is running is holding 1 dcache 
file. I'll modify them shortly so that at the start of each job the file 
is copied from dcache to the local directory

Sorry for the problems

====================================================================

   Thanks !

You can see the dcache connections at

         http://fndca3a.fnal.gov/dcache/DOORS.html

You can get a text listing of your connections with

    curl http://fndca3a.fnal.gov/dcache/DOORS.html  2>&1 | grep Dejong


########
# FARM #
########

  CP far all done except :

OK - stream all.sntp.cedar_phy
OK - 825 Mbytes in 2 runs 
 PEND - have 17/24 subruns for F00037835_*.all.sntp.cedar_phy.0.root 11 05/02 11:36 0 17
 SUPPRESS  F00037838_0024.all.sntp.cedar_phy.0.root
 PEND - have 17/24 subruns for F00037838_*.all.sntp.cedar_phy.0.root 11 05/02 12:02 0 17
OK - stream spill.bntp.cedar_phy
OK - 144 Mbytes in 2 runs 
 PEND - have 17/24 subruns for F00037835_*.spill.bntp.cedar_phy.0.root 11 05/02 11:36 0 17
 SUPPRESS  F00037838_0024.spill.bntp.cedar_phy.0.root
 PEND - have 16/24 subruns for F00037838_*.spill.bntp.cedar_phy.0.root 11 05/02 12:02 0 16
OK - stream spill.sntp.cedar_phy
OK - 97 Mbytes in 2 runs 
 PEND - have 17/24 subruns for F00037835_*.spill.sntp.cedar_phy.0.root 11 05/02 11:36 0 17
 SUPPRESS  F00037838_0024.spill.sntp.cedar_phy.0.root
 PEND - have 16/24 subruns for F00037838_*.spill.sntp.cedar_phy.0.root 11 05/02 12:02 0 16

    Stopped loopCPF

    Stopped looper,
    created loopCPB0, loopCPB1 for pass 0 and 1 files, run them in parallel

SRV1> cp AFSS/roundup.20080515 .
SRV1> ln -sf roundup.20080515 roundup # was roundup.20080514


SRV1> ./roundup -n -b 100 -p -s "cand.cedar_phy_bhcurv.1" -r cedar_phy_bhcurv mcnear
PURGED 100/1600

SRV1> ./roundup -n -b 1600 -w -s "cand.cedar_phy_bhcurv.1" -r cedar_phy_bhcurv mcnear

   It seems that MOST of WRITE needs to be written to PNFS  ( all pass 1 )!
   
So let's run loopCPB1 with -w set

SRV1> cat loopCPB1
#!/bin/sh
while true ; do 
    ./roundup -c -b 100 -w -s "cand.cedar_phy_bhcurv.1" -r cedar_phy_bhcurv mcnear
    sleep 1200
done

./loopCPB1 &
   
less cedar_phy_bhcurvmcnearcand.cedar_phy_bhcurv.1.log 

=============================================================================
2008 05 13
=============================================================================

###########
# ROUNDUP #
###########

    herber suggests an inline alternate to invoking MAIN :

exec > ...  2> ...   
    

###########
# ROUNDUP #
###########

Subject: Cron <minfarm@fnpcsrv1> ${HOME}/scripts/corral
Date: Tue, 13 May 2008 06:05:01 -0500
From: root@fnpcsrv1.fnal.gov (Cron Daemon)
To: rubin@fnal.gov

   PID TTY          TIME CMD

    need to kill header, like 
    
ps -p ${prepid} --no-headers

###########
# MONTHLY #
###########

DATASETS 5/13
PREDATOR 5/13,14,15  had to retry due to DCache overload, OK on 15th
VAULT    5/7   from cron, logs are OK
MYSQL    5/19
    Mon May 19 09:43:13 CDT 2008   
    Mon May 19 10:14:42 CDT 2008

crontab.dat - changed vault to run on 4th of the month
Renamed all scripts/crontab.dat.* to crontab.minos26.*

Predator - 
B080430_080001.mbeam.root Tue May 13 16:50:30 UTC 2008
 OOPS - run_dbu is stuck for 207, killing it 

Try again, watching
25509 pts/0    S+     0:04                          \_ loon -bq /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/firstlast.C dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/beam_data

B080430_080001.mbeam.root Tue May 13 20:06:13 UTC 2008
 OOPS - run_dbu is stuck for 207, killing it 

setup dcap -q unsecured
IFILE=B080430_080001.mbeam.root
IPATH=minos/beam_data/2008-04
DCPOR=24125 # unsecured
DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}

cd /local/scratch??/`whoami`
dccp  -d 4  ${DFILE} TEST.dat

MINOS26 > dccp  -d 4  ${DFILE} TEST.dat
[Tue May 13 15:16:11 2008] Going to open file dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/2008-04/B080430_080001.mbeam.root in cache.
Connected in 0.00s.
   not making much progress.
   I see why , under PolRequestQueue,

CellName 	DomainName 	       Active 	Max 	Queued movers
                                     
w-stkendca9a-3 	w-stkendca9a-3Domain 	10 	12 	43

Jeff Dejong is reading large numbers of neardet_data files from LSF,

Several beam_data files are open from fnpc341

MINOS25 > condor_q -r | grep fnpc341
113258.6   loiacono        5/13 06:06   0+00:26:09 vm2@7829@fnpc341.fnal.gov
113258.13  loiacono        5/13 06:06   0+00:26:11 vm2@14117@fnpc341.fnal.gov
113258.14  loiacono        5/13 06:06   0+00:26:15 vm2@28520@fnpc341.fnal.gov
113258.15  loiacono        5/13 06:06   0+00:26:15 vm2@11484@fnpc341.fnal.gov
113258.16  loiacono        5/13 06:06   0+00:26:07 vm2@23347@fnpc341.fnal.gov
113333.140 pawloski        5/13 10:09   0+01:25:06 vm2@992@fnpc341.fnal.gov

bin/sh /minos/scratch/loiacono/gnumi/prod/condor/condor_job_le150_20060603_1.sh 16 000
\_ /bin/csh -f ./run-gnumi fluka05_le150i000_20060603 36 99999999999
    \_ /minos/scratch/loiacono/gnumi/src/gnumi/linux/gnumi

###########
# CRONTAB #
###########

for YMD in 20050817 20050902 20051005 20060504 20060807 20080402 ; do
mv crontab.dat.${YMD} crontab.minos26.${YMD} ; done


##########
# CONDOR #
##########

Date: Tue, 13 May 2008 10:42:32 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>
...
The Joint EGEE and OSG Workshop on VO Management in Production Grids
will be held June 24, 2008 in conjunction with HPDC 2008.
...

########
# FARM #
########

    farmgsum_log - logs farmgsum to the tee


    looper keeps on running as desired,

  29627 5586119 mcnearcat
mcnearcat
   2537 1453681 cand.cedar_phy_bhcurv.0.root
   6058 3497257 cand.cedar_phy_bhcurv.1.root
WRITE
   1731 1000277 cand.cedar_phy_bhcurv.1.root

    set up a looper for CP F

SRV1> cp looper loopCPF
    containing
./roundup -c -b 100  -r cedar_phy far

Tue May 13 09:15:10 CDT 2008

 PURGING WRITE files 100 
PURGED WRITE/F00038316_0006.spill.cand.cedar_phy.0.root


=============================================================================
2008 05 12
=============================================================================

##########
# DCACHE #
##########

Sent in high priority helpdesk ticket re lack of CPB candidate writes

The predator logs show a backlog of 

STARTED   Sat May 10 02:11:26 2008
40 FILES

STARTED   Sun May 11 02:11:57 2008
1799 FILES

STARTED   Mon May 12 02:11:26 2008
1435 FILES
  all clear

STARTED   Tue May 13 02:28:34 2008
1408 FILES
   many pending, 30 thru 1407


   cc'd this ticket to rubin

Date: Mon, 12 May 2008 23:45:42 -0500 (CDT)
Subject: HelpDesk ticket 115493
___________________________________________
Short Description: some Minos files pending in FNDCA write queues for 1 day

Problem Description: dcache-admin

We have a large amount of data ( hundreds of GBytes ) waiting to get out of

the public write pools onto LTO-3 tape. 
Some of these files have been been waiting nearly a day.
I do not see large backlogs in Enstore.
There are modest queues ( net 500 ) on the write pools,
but I do not think that explains a 1 day delay.

A sample delayed file follows :

============================
 PNFS status for
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/7
34/n13037340_0017_L010185N_D04.cand.cedar_phy_bhcurv.1.root 
-rw-r--r--  1 rubin e875 574165580 May 12 02:03
n13037340_0017_L010185N_D04.cand.cedar_phy_bhcurv.1.root

LEVEL 2 
2,0,0,0.0,0.0
:c=1:4c3ee836;h=yes;l=574165580;
w-stkendca12a-6
r-stkendca16a-5

LEVEL 4 

============================
___________________________________________
Date: Tue, 13 May 2008 16:32:19 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>

There was a similar large backlog Saturday night,
which cleared during the day Sunday.

Our files started moving to tape again Monday night,
so we are making some progress again.
The file listed below is on tape.

I will keep an eye on things for a while,
to see whether another backlog develops tomorrow.
___________________________________________
Date: Mon, 07 Jul 2008 14:41:38 -0500 (CDT)
Previously assigned to: BERG, DAVID
This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group.
___________________________________________
Date: Mon, 07 Jul 2008 19:52:11 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>

This problem did not recur, as nearly as I can tell.

This ticket can be closed out.

    Thanks !

  
##########
# CONDOR #
##########

MINOS25 > condor_config_val CONDOR_ADMIN
fermigrid-root@fnal.gov

MINOS25 > condor_config_val -config
Configuration source:
        /etc/condor/condor_config
Local configuration source:
        /opt/condor/local/condor_config.local

MINOS25 > condor_config_val -set  "CONDOR_ADMIN=minos-admin@fnal.gov" 
Attempt to set configuration "CONDOR_ADMIN=minos-admin@fnal.gov" on master minos25.fnal.gov <131.225.193.25:62694> failed.


MINOS25 > condor_config_val -name minos01 -set  "CONDOR_ADMIN=minos-admin@fnal.gov" 
Attempt to set configuration "CONDOR_ADMIN=minos-admin@fnal.gov" on master minos01.fnal.gov <131.225.193.1:61561> failed.

#########
# MINOS #
#########

Date: Mon, 12 May 2008 12:14:47 -0500 (CDT)
Subject: HelpDesk ticket 115465
___________________________________________
Short Description: ssh to minos05 fails

Problem Description: run2-sys :

  ssh to minos05 has been failing since at least last Friday :

MIN > ssh -v minos05 ; date
OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to minos05 [131.225.193.5] port 22.
debug1: Connection established.
debug1: identity file /home/kreymer/.ssh/identity type -1
debug1: identity file /home/kreymer/.ssh/id_rsa type 1
debug1: identity file /home/kreymer/.ssh/id_dsa type -1
ssh_exchange_identification: Connection closed by remote host
Mon May 12 17:10:47 UTC 2008
___________________________________________
Date: Mon, 12 May 2008 12:59:04 -0500
sshd has been restarted on minos05.  Logins via ssh are working again.
___________________________________________


########
# GRID #
########

du: `/grid/app/minos/VDT/vdt/extract': Permission denied
du: `/grid/app/minos/VDT/vdt/backup': Permission denied
du: `/grid/app/minos/VDT/vdt/services': Permission denied

MINOS26 > du -sm /grid/app/minos
31249   /grid/app/minos

MINOS26 > du -sm /grid/app/minos/*
1834    /grid/app/minos/Minossoft
287     /grid/app/minos/VDT
1       /grid/app/minos/bin
du: `/grid/app/minos/minfarm/Minossoft/EXTERNAL/mysql-5.0.22/sql/share/japanese-sjis': Permission denied
17376   /grid/app/minos/minfarm
445     /grid/app/minos/parrot
4458    /grid/app/minos/products
56      /grid/app/minos/sam
10      /grid/app/minos/scripts
6787    /grid/app/minos/users

MINOS26 > du -sm /grid/app/minos/minfarm/*

###########
# ROUNDUP #
###########

    roundup.20080513 

Corrected pid code, to find single Process ID.
   more exact script match

###########
# ROUNDUP #
###########

    Suppressed printing from non-NOOP PID code,
    in order to avoid CRON email to Howie

SRV1> cp AFSS/roundup.20080512 .

SRV1> ln -sf roundup.20080512 roundup # was roundup.20080509

SRV1> ls -l roundup
lrwxrwxrwx  1 minfarm numi 16 May 12 12:01 roundup -> roundup.20080512

Date: Mon, 12 May 2008 12:05:04 -0500
From: minfarm <minfarm@fnal.gov>
To: minos-data@fnal.gov
Subject: roundup cedar far 13815

? stale pidfile on fnpcsrv1

roundup cedar far 13815
? stale pidfile on fnpcsrv1

   OK, that is due to a 2-line pid file,
13815
   

26952 ?        Ss     0:00 /bin/sh /home/minfarm/scripts/corral
26954 ?        S      0:09  \_ /bin/bash /home/minfarm/scripts/roundup -c -r cedar far
24529 ?        S      0:00      \_ /bin/sh /usr/local/vdt-1.8.1/srm-v1-client/bin/srmcp -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00040860_00
24533 ?        S      0:00          \_ /bin/sh /usr/local/vdt-1.8.1/srm-v1-client/sbin/srm -copy -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00
24535 ?        Sl     0:12              \_ java -cp /usr/local/vdt-1.8.1/srm-v1-client/lib/srm_client.jar:/usr/local/vdt-1.8.1/srm-v1-client/lib/srm.jar:/usr/loc
23229 pts/10   Ss     0:08 /bin/bash
15383 pts/10   R+     0:00  \_ ps xf
23131 ?        Sl   7413:02 /fnal/ups/prd/mysql/v5_0_22/Linux-2-4/libexec/mysqld --defaults-file=/export/stage/minfarm/my.cnf --basedir=/fnal/ups/prd/mysql/v5_0_
 7968 pts/2    Ss+    0:06 /bin/bash
 7953 pts/0    S      0:00 ksu minfarm
 7955 pts/0    S+     0:00  \_ -bin/tcsh
 4544 ?        S      0:00 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t
 4520 ?        Ss     0:00 /bin/sh /home/minfarm/scripts/corral
 4522 ?        S      0:01  \_ /bin/bash /home/minfarm/scripts/roundup -c -r cedar far
15378 ?        S      0:00      \_ /bin/bash /home/minfarm/scripts/roundup -c -r cedar far
15379 ?        R      0:00          \_ sampy /export/stage/minfarm/ROUNDUP/SAM/current/bin/sam locate F00040863_0018.mdaq.root
15380 ?        S      0:00          \_ grep /pnfs/minos
15381 ?        S      0:00          \_ cut -f 5 -d /
15382 ?        S      0:00          \_ cut -f 1 -d ,
32219 pts/10   S      0:11 /bin/bash ./roundup -w -r cedar_phy far
15097 pts/10   S      0:00  \_ /bin/sh /usr/local/vdt-1.8.1/srm-v1-client/bin/srmcp -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00038349_0008.s
15101 pts/10   S      0:00      \_ /bin/sh /usr/local/vdt-1.8.1/srm-v1-client/sbin/srm -copy -streams_num=1 -server_mode=active -protocols=gsiftp file:///F000383
15103 pts/10   Sl     0:09          \_ java -cp /usr/local/vdt-1.8.1/srm-v1-client/lib/srm_client.jar:/usr/local/vdt-1.8.1/srm-v1-client/lib/srm.jar:/usr/local/v


SRV1> cat /minos/data/minfarm/roundup/cedarfar.pid 
4522
?
SRV1> ls -l /minos/data/minfarm/roundup/cedarfar.pid 
-rw-r--r--  1 minfarm numi 7 May 12 12:05 /minos/data/minfarm/roundup/cedarfar.pid

pico /minos/data/minfarm/roundup/cedarfar.pid  # 26954

....

cedar far is stuck in an srmcp,
/pnfs/fnal.gov/usr/minos/reco_far/cedar/.bcnd_data/2008-05/F00040860_0014.spill.bcnd.cedar.0.root

When I killed the second copy, during its PURGE phase,
all further output to the log file ceased, and the srmcp got stuck.

SRV1> touch /minos/data/minfarm/roundup/STOP.cedar_phyfar
SRV1> date
Mon May 12 13:21:53 CDT 2008

 OOPS - SRMCP failed, bailing 
Mon May 12 13:23:03 CDT 2008

   corral is cranking along, looking at cedar nearm,
   will proceed to CPfar and CPBmcnear

SRV1> ./roundup -r cedar_phy far

rm /minos/data/minfarm/roundup/STOP.cedar_phyfar

SRV1> ./roundup -r cedar_phy far

    running, let's poke at the 600 files in WRITE for CPB

SRV1> ./roundup -w -b 20 -r cedar_phy_bhcurv mcnear

SRV1> ./roundup -w -b 100 -r cedar_phy_bhcurv mcnear

SRV1> ./roundup -w -b 200 -r cedar_phy_bhcurv mcnear
   slight delay, this is also srmcp'ing 200 files, mostly non cand


16:45
mv ../ROUNTMP/NOCAT.ok NOCAT

    PID is messed up for CPB :

SRV1> cat MDMR/cedar_phy_bhcurvmcnear.pid
7538
7587
7588
pts/10

   Plan, modify corral to run just C far and near,
   set up loops running continuous scans in 'c' cron mode

                    CP     far
      -s "nd.cedar" CPB mcnear

    thinking about doing
while true ; do ./roundup -c -b 50 -s "nd.cedar" -r cedar_phy_bhcurv mcnear ; done
while true ; do ./roundup -c -b 50 -s "nd.cedar" -r cedar_phy           far ; done

22:20 - updated corral

There's not enough CP far left to be worth chewing on.
Try pushing out CPB mcnear candidates in a fairly tight loop :

cat looper

#!/bin/sh
while true ; do 
    ./roundup -c -b 100 -s "cand.cedar" -r cedar_phy_bhcurv mcnear
    sleep 1200
done

Oops, false start with 'far' detector.

Try again  at 22:55
And again  at 22:56

SRV1> df -h /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       28T   28T  168G 100% /minos/data


SRV1> df -h /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       28T   28T  182G 100% /minos/data

  Checking top file in WRITE, 
n13037340_0017_L010185N_D04.cand.cedar_phy_bhcurv.1.root
  is in dcache, not on tape yet.

########
# FARM #
########

Free space down to 
388181 Mon May 12 10:40:20 CDT 2008

./farmgsum  | tee FGS/sum.`date +%Y%m%d%H`

WRITE
    847  113334 all.cand.cedar_phy.0.root
     32   12290 all.sntp.cedar_phy.0.root
    848   19025 spill.bcnd.cedar_phy.0.root
     32    1359 spill.bntp.cedar_phy.0.root
    848   12053 spill.cand.cedar_phy.0.root
     32     887 spill.sntp.cedar_phy.0.root

    509  293914 cand.cedar_phy_bhcurv.1.root
     54    5540 mrnt.cedar_phy_bhcurv.1.root
     54   19233 sntp.cedar_phy_bhcurv.1.root

./roundup -w -r cedar_phy far


    Cleaned up messages, and PID code a bit more


 85500 Mon May 12 20:40:57 CDT 2008


WRITE
     42    5676 all.cand.cedar.0.root
    649   86728 all.cand.cedar_phy.0.root
      1     582 all.sntp.cedar.0.root
     12    5843 all.sntp.cedar_phy.0.root
    254  146751 cand.cedar_phy_bhcurv.1.root
    136   44698 spill.cand.cedar.0.root

./roundup -b 256 -w -s 'cand.cedar_phy_bhcurv.1' -r cedar_phy_bhcurv mcnear

21:00
SRV1> condor_q rubin
26 jobs; 0 idle, 26 running, 0 held

    N.B. - why are there pass 0 files from CPB mcnear ?
           this reprocessing was to be at pass 1,
           to avoid file name conflicts.


./roundup -b 100 -w -s 'cand.cedar_phy.0' -r cedar_phy far
./roundup -b 200 -w -s 'cand.cedar_phy.0' -r cedar_phy far
   There are about 230 writes for CP far queues in Enstore
   more may dribble out as the tapes make progress.


   SRV1> cat ~/scripts/MDMR/cedarnear.pid 
5357
7967
    GRRRRR this is silly !
    Yes, there are two roundups running,
    under separate trees.
    Pure luck that the first matched correctly.

    Let's go back to setting MYPID=${$}
    which seems to be correct lately.
    
SRV1> ./roundup.20080514 -w -r cedar near
SRV1> cat ~/scripts/MDMR/cedarnear.pid 
18163
SRV1> ps xf
18175 pts/10   S      0:00 /bin/bash ./roundup.20080514 -w -r cedar near

   GRRRRRR !!!!!!!!

pico ~/scripts/MDMR/cedarnear.pid

SRV1> ./roundup -n -w -r cedar far
SRV1> cat ~/scripts/MDMR/cedarfar.pid 
7967
pts/10
pts/10
pts/10

    GRRRRRRRR !!!
    
This entirely missed the proper 27338

   The root cause of all these problems is probably running 
   MAIN in the background. 
   stop doing this, background with ^z if needed.
   This seems to work as desired, $$ does the right thing.

SRV1> cp -a AFSS/roundup.20080514 .
SRV1> ln -sf roundup.20080514 roundup  # was 13


=============================================================================
2008 05 09
=============================================================================

########
# FARM #
########

    Test improved pid calc, and SRMCP count printout.

AFSS/roundup.20080509 -w -b 100 -r cedar_phy far

SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid
20608

20619 pts/10   S      0:00 /bin/sh AFSS/roundup.20080509 -w -b 100 -r cedar_phy far


 AFSS/roundup.20080509 -w -b 10 -r cedar_phy far

SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid
28005
28016 pts/10   S      0:00 /bin/sh AFSS/roundup.20080509 -w -b 10 -r cedar_phy far
 MYOWNPID is 28005

   Try /bin/bash not /bin/sh

SRV1> AFSS/roundup.20080509 -w -b 10 -r cedar_phy far
SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid
29375
29386 pts/10   S      0:00 /bin/bash AFSS/roundup.20080509 -w -b 10 -r cedar_phy far

   Try adding ps diagnostic, with filtering

 SRV1> AFSS/roundup.20080509 -w -b 10 -r cedar_phy far
SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid
31524
31535 pts/10   S      0:00 /bin/bash AFSS/roundup.20080509 -w -b 10 -r cedar_phy far

    Try without filtering
Fri May  9 09:05:33 CDT 2008
  PID TTY          TIME CMD
23229 pts/10   00:00:08 bash
32660 pts/10   00:00:00 roundup.2008050
32661 pts/10   00:00:00 ps

SRV1> AFSS/roundup.20080509 -w -b 10 -r cedar_phy far
SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid
1707
1709
1710
1713

    Cleanup the parsing of ps, with

PSO="`ps --no-header`"
PSOU=" `printf "${PSO}\n" | grep roundup`"
PSOUT=`printf "${PSOU}\n" | tr -s ' ' | cut -f 2 -d ' '`


    Test shifting the symlink out from under a running process,
    to see whether we can upgrade on the fly.

cp AFSS/roundup.20080509 ./roundup.20080509test # with test message
cp AFSS/roundup.20080509 ./roundup.20080509
ln -sf roundup.20080509test roundup.work

./roundup.work -w -b 10 -r cedar_phy far
 OK, was running the test version 

./roundup.work -w -b 10 -r cedar_phy far
ln -sf roundup.20080509     roundup.work
 OK, was running the test version 

./roundup.work -w -b 10 -r cedar_phy far
  <  no message >

    It seems safe enough to shift the symlink on the fly !

ln -sf roundup.20080509     roundup
date
Fri May  9 09:52:27 CDT 2008

./roundup -w -r cedar_phy far

./farmgsum  | tee FGS/sum.`date +%Y%m%d%H`

12:07 both CPB and CP finished a few seconds/minutes to late for cron

Started next cycle manually

./roundup -r cedar_phy far
./roundup -r cedar_phy_bhcurv mcnear

##########
# CONDOR #
##########

   glide4hr.run is stuck,
111815.0   boehm           5/8  14:54   0+17:16:02 R  0   9.8  probe             


   Submitted glide1hr 2 3 4  at 08:15
112066.0   boehm           5/9  08:17   0+00:00:57 R  0   0.0  probe             
112067.0   boehm           5/9  08:17   0+00:00:00 I  0   0.0  probe             
112068.0   boehm           5/9  08:17   0+00:00:00 I  0   0.0  probe             

   Submitted 1 2 3 4 under kreymer account
112071.0   kreymer         5/9  08:20   0+00:00:00 I  0   0.0  probe             
112072.0   kreymer         5/9  08:22   0+00:00:21 R  0   0.0  probe             
112073.0   kreymer         5/9  08:22   0+00:00:00 R  0   0.0  probe             
112074.0   kreymer         5/9  08:22   0+00:00:00 R  0   0.0  probe             
112075.0   kreymer         5/9  08:22   0+00:00:00 I  0   0.0  probe             
   
 MINOS25 > dds logs/glide/probe*hr*
-rw-r--r--  1 kreymer g020   38 May  9 09:22 logs/glide/probe1hr.112072.0.err
-rw-r--r--  1 kreymer g020  696 May  9 09:22 logs/glide/probe1hr.112072.0.log
-rw-r--r--  1 kreymer g020 2227 May  9 09:22 logs/glide/probe1hr.112072.0.out
-rw-r--r--  1 kreymer g020    0 May  9 08:22 logs/glide/probe2hr.112073.0.err
-rw-r--r--  1 kreymer g020  247 May  9 08:22 logs/glide/probe2hr.112073.0.log
-rw-r--r--  1 kreymer g020    0 May  9 08:22 logs/glide/probe2hr.112073.0.out
-rw-r--r--  1 kreymer g020    0 May  9 08:22 logs/glide/probe3hr.112074.0.err
-rw-r--r--  1 kreymer g020  247 May  9 08:22 logs/glide/probe3hr.112074.0.log
-rw-r--r--  1 kreymer g020    0 May  9 08:22 logs/glide/probe3hr.112074.0.out
-rw-r--r--  1 kreymer g020    0 May  9 08:22 logs/glide/probe4hr.112075.0.err
-rw-r--r--  1 kreymer g020  245 May  9 08:25 logs/glide/probe4hr.112075.0.log
-rw-r--r--  1 kreymer g020    0 May  9 08:22 logs/glide/probe4hr.112075.0.out

   Created 50 and 70  minute jobs,

112133.0   kreymer         5/9  14:31   0+00:00:00 I  0   0.0  probe             
112134.0   kreymer         5/9  14:31   0+00:00:00 I  0   0.0  probe             

    Removed the 3 and 4 hour tests

MINOS25 > condor_rm 112074.0 112075.0
Job 112074.0 marked for removal
Job 112075.0 marked for removal

condor_rm -forcex 112074.0 112075.0

    Per sfiligoi, will try leaving an intact proxy for the next tests
    also killing the 10 minute probes, till this is settled
    
MINOS25 > crontab /tmp/ct25
Fri May  9 14:52:34 CDT 2008

MINOS25 > condor_submit glidelease.run
1 job(s) submitted to cluster 112139.


   Per sfiligoi,

condor_config_val SEC_DEFAULT_SESSION_DURATION
3600

   This should be, for a 4 day maximum

SEC_DEFAULT_SESSION_DURATION = 345600

Date: Fri, 09 May 2008 15:40:51 -0500 (CDT)
Subject: HelpDesk ticket 115385
___________________________________________
Short Description: Please update Minos Cluster condor.config files

Problem Description: run2-sys :

Most of our analysis jobs have been failing to return their logs,
and are getting hung up on exit,
since Monday's upgrade of the Condor glidein system to use glexec.

The experts believe that the root cause of this is a parameter in
    /opt/condor-7.0.1/etc/condor_config

    The line
SEC_DEFAULT_SESSION_DURATION = 3600
    needs to be updated to
SEC_DEFAULT_SESSION_DURATION = 345600

I have provided a modified condor_config file under
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/

Please update and cfengine this file to all of minos01 through minos25
before the weekend, so that we can resume Analysis batch processing.

    Thanks !
___________________________________________
Date: Fri, 09 May 2008 15:51:07 -0500 (CDT)
This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group.
___________________________________________
Date: Fri, 09 May 2008 16:13:15 -0500 (CDT)
I have made the change in cfengine.  The file has been updated.
___________________________________________
___________________________________________
___________________________________________

MIN > for NODE in ${NODES} ; do printf "${NODE} "; ssh -ax ${NODE} grep SEC_DEFAULT_SESSION_DURATION /opt/condor/etc/condor_config  ; done 
minos01 SEC_DEFAULT_SESSION_DURATION = 345600
...

condor_reconfig -all
MINOS25 > date
Fri May  9 17:10:56 CDT 2008

MINOS25 > condor_submit glide2hr.run
1 job(s) submitted to cluster 112146.
112146.0   kreymer         5/9  17:12   0+00:00:00 I  0   0.0  probe             

MINOS25 > condor_q 111448 | grep boehm | wc -l
15
MINOS25 > condor_q 111448

MINOS25 > condor_q 111448 | grep boehm | wc -l

-- Submitter: minos25.fnal.gov : <131.225.193.25:65336> : minos25.fnal.gov
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
111448.0   boehm           5/7  10:01   2+06:41:37 R  0   732.4 condor_job.sh     
111448.1   boehm           5/7  10:01   2+06:41:37 R  0   732.4 condor_job.sh     
111448.2   boehm           5/7  10:01   2+06:41:34 R  0   732.4 condor_job.sh     
111448.3   boehm           5/7  10:01   2+06:41:34 R  0   732.4 condor_job.sh     
111448.4   boehm           5/7  10:01   2+06:41:32 R  0   488.3 condor_job.sh     
111448.5   boehm           5/7  10:01   2+06:41:34 R  0   732.4 condor_job.sh     
111448.6   boehm           5/7  10:01   2+06:38:22 R  0   732.4 condor_job.sh     
111448.7   boehm           5/7  10:01   2+06:38:20 R  0   732.4 condor_job.sh     
111448.8   boehm           5/7  10:01   2+06:38:20 R  0   732.4 condor_job.sh     
111448.9   boehm           5/7  10:01   2+06:38:21 R  0   732.4 condor_job.sh     
111448.10  boehm           5/7  10:01   2+06:38:04 R  0   488.3 condor_job.sh     
111448.14  boehm           5/7  10:01   2+06:38:02 R  0   488.3 condor_job.sh     
111448.84  boehm           5/7  10:01   2+01:33:43 R  0   732.4 condor_job.sh     
111448.88  boehm           5/7  10:01   2+01:29:02 R  0   732.4 condor_job.sh     
111448.98  boehm           5/7  10:01   2+01:26:17 R  0   732.4 condor_job.sh     

condor_rm 111448
condor_rm 111448 -forcex

MINOS25 > condor_submit probe.run
112148.0   kreymer         5/9  17:17   0+00:00:02 R  0   0.0  kcron             
less logs/probe/probe.112148.0.out
    this short job looks OK

Killing off a lot more of boehm stuck jobs :

111420 -   1
111422 - 100
111438 - 100 

condor_rm 111422
condor_rm 111422 -forcex
Fri May  9 17:23:06 CDT 2008

condor_config_val SEC_DEFAULT_SESSION_DURATION

Try killing off a few of the stuck gfactories 
   there are a dozen or so idle at present.
   
condor_rm 111423 -forcex

17:45
condor_rm boehm -forcex

condor_rm gfactory -forcex

   This was a bad idea, sfiligoi informs me this creates a huge mess.

Standing by.

Date: Fri, 09 May 2008 18:22:37 -0500
I have cleaned up the old glideins in the queue using the fork queue.

The new glideins should be starting soon.

Igor

MINOS25 > condor_submit glide.run 
112155.0   kreymer         5/9  18:24   0+00:00:01 R  0   0.0  probe             

MINOS25 > condor_submit glide2hr.run 
112156.0   kreymer         5/9  18:25   0+00:00:07 R  0   0.0  probe             


=============================================================================
2008 05 08
=============================================================================

########
# FARM #
########

    Corral

Enabled cedar_phy        far
Enabled cedar_phy_bhcurv mcnear

SRV1> ./roundup  -s charm -r cedar_phy_bhcurv mcnear
   this picked up the remaining charm files

   touched roundup to drop pool limit from NPOOLS-1 to NPOOLS-3
   allowing a single server to be down. ( 11 is down today )
      
SRV1> ./roundup  -r cedar_phy far

SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid 
23736
    That process is gone, this should be 
23747

    Edited it manually

    Ran again, to purge the files now declared to sam 

SRV1> ./roundup  -r cedar_phy far

SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid 
15130

15141 pts/10   S      0:09 /bin/sh ./roundup -r cedar_phy far


   Try shifting PID setting to MAIN.

AFSS/roundup.new  -r cedar_phy_bhcurv mcnear

SRV1> cat /minos/data/minfarm/roundup/cedar_phy_bhcurvmcnear.pid 
25271

25282 pts/10   S      0:00 /bin/sh AFSS/roundup.new -r cedar_phy_bhcurv mcnear


    Corrected the file order to be the same for
        PURGE  CONCAT   WRITE

AFSS/roundup.newer -w  -r cedar_phy far

SRV1> cat /minos/data/minfarm/roundup/cedar_phyfar.pid
21099
21116 pts/10   S      0:00 /bin/sh AFSS/roundup.newer -w -r cedar_phy far

   Try chaing to /bin/bash for better shell control ?
   Try setting MYOWNPID=${$}
   echo ${MYOWNPID} ...
   

###########
# ROUNDUP #
###########

   Changed default bail limit to 1000, to stay clear of DCache limits

SRV1> cp -a AFSS/roundup.20080708 .
SRV1> ln -sf roundup.20080508 roundup # was roundup.20080506

##########
# CONDOR #
##########

Date: Thu, 8 May 2008 16:37:14 +0000 (UTC)

 From: Arthur Kreymer <kreymer@fnal.gov>
To: sfiligoi@fnal.gov
Cc: minos-admin@fnal.gov
Subject: Re: Analysis glideins ramping up on GPFARM (fwd)


   Our first large scale users seems to have many hung jobs,
   stuck as they are trying to finish. 

   It seems odd to me that I see only two active glideins
   when I log into fnpc340,
   but condor_q shows 8 boehm jobs running.


   For example,

MINOS25 > condor_q -r | grep fnpc340
111422.34  boehm           5/7  09:25   1+01:51:58 vm2@31266@fnpc340.fnal.gov
111422.40  boehm           5/7  09:25   1+01:48:55 vm2@642@fnpc340.fnal.gov
111422.42  boehm           5/7  09:25   1+01:48:58 vm2@377@fnpc340.fnal.gov
111422.61  boehm           5/7  09:25   1+01:45:54 vm2@2592@fnpc340.fnal.gov
111422.68  boehm           5/7  09:25   1+01:45:50 vm2@3381@fnpc340.fnal.gov
111422.72  boehm           5/7  09:25   1+01:45:54 vm2@3697@fnpc340.fnal.gov
111422.92  boehm           5/7  09:25   1+01:42:51 vm2@6865@fnpc340.fnal.gov
111422.93  boehm           5/7  09:25   1+01:42:51 vm2@6309@fnpc340.fnal.gov


MINOS25 > condor_q -l 111422.34 | grep UserLog
UserLog = "/minos/data/users/boehm/RyanFiles/distcutoff/log.111422.34"

MINOS25 > cat /minos/data/users/boehm/RyanFiles/distcutoff/log.111422.34
000 (111422.034.000) 05/07 09:25:34 Job submitted from host: 
<131.225.193.25:62172>
...
001 (111422.034.000) 05/07 09:32:47 Job executing on host: 
<131.225.166.119:61601>
...
006 (111422.034.000) 05/07 09:32:55 Image size of job updated: 132728
...
006 (111422.034.000) 05/07 10:32:55 Image size of job updated: 447540
...
006 (111422.034.000) 05/07 11:32:55 Image size of job updated: 582148
...

MINOS25 > ls -l /minos/data/users/boehm/RyanFiles/distcutoff/*.111422.34
-rw-r--r--  1 boehm e875   0 May  7 09:25 
/minos/data/users/boehm/RyanFiles/distcutoff/err.111422.34
-rw-r--r--  1 boehm e875 397 May  7 11:32 
/minos/data/users/boehm/RyanFiles/distcutoff/log.111422.34
-rw-r--r--  1 boehm e875   0 May  7 09:25 
/minos/data/users/boehm/RyanFiles/distcutoff/out.111422.34


fnpc340 $ ps axfwww > /minos/data/users/kreymer/fnpc340.psaxf

Date: Thu, 08 May 2008 11:56:54 -0500 (CDT)
Subject: HelpDesk ticket 115305
Short Description: recent Minos glidein jobs on GPFARM not terminating properly ?

Problem Description: Date: Thu, 08 May 2008 11:52:55 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: minos-admin@fnal.gov
Subject: Re: Analysis glideins ramping up on GPFARM (fwd)

From all you write below, it would seem it is running just fine.

Also looking on fnpc340, I see 8 starters, so Condor-wise there do not
seem to be any obvious problems.
Unfortunately, I am not able to read the files in
fnpc340:/local/stage1/condor/execute/dir_2051/glide_dx2086/tmp/starter-tmp-
dir-SfzaAS
so I cannot look at all the details.

.... further information, as listed above in this log ...

___________________________________________
Date: Fri, 09 May 2008 13:02:07 -0500 (CDT)
Hi Igor--you have root on fnpc211 temporarily. cfengine will
wipe you out of the .k5login again pretty soon.
...
___________________________________________
Date: Fri, 09 May 2008 15:02:01 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>
But talking with Dan B. from the Condor team, we think we may have found
the problem:
<minos25.fnal.gov> condor_config_val SEC_DEFAULT_SESSION_DURATION
3600

This seems to confuse the glideins a lot, at least when using glexec:
after the glidein has been up for 2h, it starts misbehaving.
Probably a Condor bug.

For now, I would suggest you increase this value to the max expected
lifetime of the glidein.

This should be OK:
SEC_DEFAULT_SESSION_DURATION = 30000

I would suggest you change this for all the Condor daemons in the pool.

Cheers,
  Igor

___________________________________________
Date: Fri, 09 May 2008 15:06:35 -0500 (CDT)
From: Steven Timm <timm@fnal.gov>

I would concur--SEC_DEFAULT_SESSION_DURATION=3600 was left over
from the GP Grid cluster config and we have lengthened it there too now.

___________________________________________
Subject: HelpDesk ticket 115385
   update condor_config, see 2008/04/09
___________________________________________
This has been changed on the Minos pool,
by update the /opt/condor/etc/condor_config file,
then issuing

    condor_reconfig -all

I have submitted another 2 hour test job,
but it has not started running yet, after 20 minutes.

I have also done  
    condor_rm -forcex
on over 100 of the stuck jobs running for user boehm.

But the probe job is still not running.
An equivalent local poll probe job ran right away.

I will try removing some of the gfactory jobs,
to make room for new ones to run.

___________________________________________
Date: Fri, 09 May 2008 17:46:28 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>
The glideins are most probably stuck.

You should remove them all from the queue.
___________________________________________
Date: Fri, 09 May 2008 22:55:10 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
I've done this.

The old pilots are still running on GPFARM.
Perhaps Steve will have to remove these on the GPFARM end.

New gfactory processes are being created, but are all Idle.

rubin also has 350 jobs running on GPFARM,
using the minos quota till some of them finish.
___________________________________________
Date: Fri, 09 May 2008 18:00:30 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>
For the future, do not use
condor_rm -forcex
, unless you have a very good reason.

It creates a huge mess in the system.
___________________________________________
Date: Fri, 09 May 2008 18:22:37 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>
I have cleaned up the old glideins in the queue using the fork queue.

The new glideins should be starting soon.
___________________________________________
Date: Fri, 09 May 2008 23:28:23 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
  Thanks !

  The old jobs seem to have cleared off of GPFARM, at least on node fnpc344.

  I submitted a short glide job, which completed.

  I submitted a 2 hr  glide job, will check in 2 hours.
___________________________________________
Date: Tue, 20 May 2008 09:01:09 -0500 (CDT)
Solution: Per discussion with Art Kreymer,
he changed the 
SEC_DEFAULT_SESSION_CONFIGURATION on his glideins after consulting Igor and the condor team.  After 
this was lengthened this problem hasn't recurred.
___________________________________________
___________________________________________
 
Sample job, glide.run
<minos25.fnal.gov> condor_history 111803
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD            
111803.0   boehm           5/8  13:56   0+00:00:35 C   5/8  14:00 /minos/scratch/

condor_history -l 111803
/minos/scratch/boehm/probe.111803.0.log

14:20 - while investigating, improve josh's factor from 10 to 1

condor_userprio -setfactor boehm@fnal.gov 1.

   Also cleaned up one of the erroneous entries :

MINOS25 > condor_userprio -delete deb4@fnal.gov@fnal.gov
The accountant record named deb4@fnal.gov@fnal.gov was deleted


<minos25.fnal.gov> condor_submit glide4hr.run 
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 111815.

111815.0   boehm           5/8  14:54   0+00:00:00 I  0   0.0  probe             

##########
# CONDOR #
##########

condor_userprio -setfactor rearmstr@fnal.gov 100.
 
##########
# CONDOR #
##########

  None of my glideafs jobs have run since 10:00 yesterday.
  
 
##########
# CONDOR #
##########

    Added random delay to the proxy regeneration by
        /local/scratch25/grid/kproxy
    in order to avoid piling up on the KCA's.

[ "x$USER" = "x" ] || [ "x$PS1" = "x" ] && sleep  $(( ${RANDOM} % 600 ))

    Tested in kreymer account, at 09:56.

30695 ?        Ss     0:00 /usr/krb5/bin/kcron /local/scratch25/grid/kproxy
30701 ?        S      0:00  \_ /bin/sh /local/scratch25/grid/kproxy
30702 ?        S      0:00      \_ sleep 505

This worked, producing
-rw-------  1 kreymer g020 5189 May  8 10:04 kreymer.proxy.2008050810


########
# FARM #
########

SRV1> grep SRMCP cedar_phy_bhcurvmcnearnd.cedar.log | wc -l
826

SRV1> tail cedar_phyfar.log 
PURGE FARM    F00038005_0011.all.cand.cedar_phy.0.root
DCACHE WRITE BACKLOG at 3000 after 3001 files
SRV1> dds cedar_phyfar.log 
-rw-rw-r--  1 minfarm numi 2207507 May  7 18:57 cedar_phyfar.log

  OK, that's intrinsic, will never copy more then DCQLIM files per pass

   Wait for existing iteration to finish,
   then make roundup.20080507 the default ( with -b support )

 
##########
# CONDOR #
##########

    The new 250 limit seems to be effective,

MINOS25 > condor_q gfactory -r | wc -l
220
  

=============================================================================
2008 05 07
=============================================================================

########
# FARM #
########

Date: Wed, 07 May 2008 17:47:30 -0500 (CDT)
Subject: HelpDesk ticket 115269
___________________________________________
Short Description: /fnal/ups is not mounted on fnpc213

Problem Description: /fnal/ups has become unmounted on fnpc213,
  causing user jobs to fail.
___________________________________________


############
# RELEASES #
############

MINOS26 > cd /afs/fnal.gov/files/code/e875/general/minossoft/releases/S08-02-16-R1-28/Mad/data
cvs update ( added back nearly 100 MB of root files. )

  
##########
# CONDOR #
##########

   Investigating cause of 300+ glideins
   Seems we have a different configuration file,
   probably happened when glexec was enabled.

drwxrwxr-x   4 gfrontend gfrontend 4096 Nov 27 09:18 myvofrontend1/
drwxrwxr-x   4 gfrontend gfrontend 4096 Feb  6 09:50 myvofrontend2/

[gfrontend@minos25 ~]$ ls -l myvofrontend1/etc
total 8
-rw-rw-r--  1 gfrontend gfrontend 887 Dec 21 09:25 vofrontend.cfg
-rw-rw-r--  1 gfrontend gfrontend 889 Nov 27 09:19 vofrontend.cfg.20071127

[gfrontend@minos25 ~]$ ls -l myvofrontend2/etc
total 4
-rw-rw-r--  1 gfrontend gfrontend 1456 Apr 23 13:39 vofrontend.cfg


 [gfrontend@minos25 ~]$ diff myvofrontend1/etc/vofrontend.cfg myvofrontend2/etc/vofrontend.cfg
4c4
< frontend_name='fe1'
---
> frontend_name='my2'
18c18,25
< match_string='1'
---
> # GLIDEIN_Has_MINOSAFS can handle all jobs
> # The other only the ones that do not have Require_MINOSAFS set to true
> # Also only match glexec glideins with jobs that have a proxy
> match_string='(glidein["attrs"]["GLIDEIN_Has_MINOSAFS"] or (not (job.has_key("Require_MINOSAFS") and job["Require_MINOSAFS"]))) and ((not glidein["attrs"].has_key("GLIDEIN_UseGLEXEC")) or (not glidein["attrs"]["GLIDEIN_UseGLEXEC"]) or job.has_key("x509userproxysubject"))'
> 
> 
> # old
> #match_string='glidein["attrs"]["GLIDEIN_Has_MINOSAFS"] or (not (job.has_key("Require_MINOSAFS") and job["Require_MINOSAFS"]))'
21c28
< max_idle_glideins_per_entry=20
---
> max_idle_glideins_per_entry=10
24c31
< max_running_jobs=100 
---
> max_running_jobs=1000
32c39
< log_dir='/home/gfrontend/myvofrontend1/log'
---
> log_dir='/home/gfrontend/myvofrontend2/log'

    Changed limit from 1000 to 250

cd myvofrontend2/etc
cp -a vofrontend.cfg vofrontend.cfg.20080423
nedit vofrontend.cfg

[gfrontend@minos25 etc]$ diff vofrontend.cfg.20080423 vofrontend.cfg
31c31
< max_running_jobs=1000
---
> max_running_jobs=250

kill -9 6931
./start_frontend.sh 

[gfrontend@minos25 ~]$ grep Total  myvofrontend2/log/frontend_info.20080507.log | tail
[2008-05-07T17:31:53-05:00 6931] Total running 339 limit 1000
[2008-05-07T17:33:30-05:00 6931] Total running 349 limit 1000
[2008-05-07T17:35:06-05:00 6931] Total running 349 limit 1000
[2008-05-07T17:36:41-05:00 6931] Total running 349 limit 1000
[2008-05-07T17:38:17-05:00 6931] Total running 349 limit 1000
[2008-05-07T17:38:40-05:00 18528] Total running 349 limit 250
[2008-05-07T17:40:16-05:00 18528] Total running 349 limit 250
[2008-05-07T17:41:52-05:00 18528] Total running 349 limit 250
[2008-05-07T17:43:27-05:00 18528] Total running 345 limit 250
[2008-05-07T17:45:05-05:00 18528] Total running 347 limit 250

###########
# ROUNDUP #
###########

    Capturing the current roundup, renamed as 20080507,
    adds -b BAIL count.

SRV1> cp AFSS/roundup.20080507 .
SRV1> ln -sf roundup.20080506 roundup # was roundup.20080501
  
##########
# CONDOR #
##########

    spotted users with excessively good priorities
    
condor_userprio -setfactor idanko@fnal.gov 100.
condor_userprio -setfactor djauty@fnal.gov 100.
  Wed May  7 14:03:26 CDT 2008

   condor_userprio --all -allusers

rahaman@fnal.gov                         0.50     0.50         1.00    0       241.19  5/01/2008 14:30  5/03/2008 04:10
rhatcher@fnal.gov                        0.50     0.50         1.00    0      3712.76  3/19/2008 11:30  5/01/2008 16:40
nickd@fnal.gov                           0.50     0.50         1.00    0       392.49  4/28/2008 11:29  5/04/2008 18:30
kreymer@fnal.gov                         0.54     0.54         1.00    0      1409.22 10/24/2007 09:00  5/07/2008 09:50

condor_userprio -setfactor rahaman@fnal.gov 100.
condor_userprio -setfactor   nickd@fnal.gov 100.

It seems that newly running users come in with a factor of 1.


########
# FARM #
########

   CP far is running, finished concatenation and started writes at 01:24

Wed May  7 09:46:14 CDT 2008

SRV1> du -sm  /minos/data/minfarm/WRITE/
549973  /minos/data/minfarm/WRITE/

  Wed May  7 01:24:39 CDT 2008
 WRITING to DCache 8798

SRV1> date
Wed May  7 15:04:06 CDT 2008
SRV1> grep SRMCP cedar_phyfar.log | wc -l 
2342

    Started rate tests with CPB mcnear , in parallel

AFSS/roundup.20080507 -n -b   20 -s 'nd.cedar' -r cedar_phy_bhcurv mcnear

AFSS/roundup.20080507    -b 1000 -s 'nd.cedar' -r cedar_phy_bhcurv mcnear

##########
# CONDOR #
##########
 
/local/scratch25/grid/VDT - hacked setups.sh to change path,
  this seems to have worked
    
fngp-osg >  du -sh /usr/local/vdt-1.8.1
3.0G    /usr/local/vdt-1.8.1

SRV1> du -sh /usr/local/vdt-1.8.1
1.2G    /usr/local/vdt-1.8.1

SRV1> tar cf /tmp/vdt181.tar -C /usr/local/vdt-1.8.1 .

SRV1> du -sh /tmp/vdt181.tar 
1.2G    /tmp/vdt181.tar

##########
# CONDOR #
##########
    OSG variables

On fngp-osg, using /usr/local/vdt-1.8.1/monitoring/osg-attributes.conf

cp 

##########
# CONDOR #
##########

    Added users to fermilab/minos/Analysis role ( not yet used )

djauty
boehm
hartnell
loiacono
mishi
nickd
pawloski
rustem
    
##########
# CONDOR #
##########

    For the record, OSG environment from a non-glexec job :

OSG_GRID   ^/usr/local/grid^
OSG_DATA   ^/grid/data^
OSG_APP    ^/grid/app^
OSG_WN_TMP ^/local/stage1^

See https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/StorageParameterOsgWnTmp

         
##########
# CONDOR #
##########

Date: Wed, 07 May 2008 08:23:59 -0500 (CDT)
Subject: HelpDesk ticket 115229
___________________________________________
Short Description: Request KCA cert addition for devenish and auty

Problem Description: Please add KCA certs, for glexec support, for

User    Lastname
nickd   devenish
djauty  auty

    I think these would look like

/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Nicholas Devenish/UID=nickd

/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=David J. Auty/USERID=djauty


    We will eventually want such certs for all Minos users,
    as well as certs compatible with the new format coming next week.
___________________________________________
Date: Fri, 09 May 2008 09:58:10 -0500
Resolved

Hi Art,

Yesterday I re-enabled email notification on the fermilab VOMRS server, 
so I am now in a position to give you the correct method of performing 
the action you request, yourself.

The users can and should add their own Robot certificates to their 
membership in the fermilab VO per the instructions I sent you last 
Monday, with the following addition (see 2a, 2b, 2c):

1) Load your KCA certificate (current, not expired!) and visit this URL:

https://vomrs.fnal.gov:8443/vomrs/vo-fermilab/vomrs

2) Click on the [+] next to the "Members"
2a) Click on "Change Email Address"
2b) Enter your last name and "Search"
2c) Enter your correct email address and "Submit"
3) Click on the [+] next to the "Certificates"
4) Click on "Add certificate"
5) Enter your last name and "Search"
6) Enter your 'new' DN in the New DN field, and select the Fermi KCA
    from the pull-down list in the "New CA" list.
7) Enter some text in the "Reason" field and click "Submit"

Next, the members representative (You!) will receive an email from VOMRS 
requesting you to approve the addition of the DN.  The email will 
contain a handy link for you to click on to get to the right page.

I should note, that when the new DN format is implemented the users will 
NOT need to add this DN, we'll do this for them automatically.

Cheers,
Dan
___________________________________________
___________________________________________
___________________________________________
___________________________________________

=============================================================================
2008 05 06
=============================================================================
   
##########
# CONDOR #
##########

    spotted users with excessively good priorities
    
condor_userprio -setfactor idanko@fnal.gov 100.
condor_userprio -setfactor djauty@fnal.gov 100.
  Wed May  7 14:03:26 CDT 2008


##########
# CONDOR #
##########

Date: Tue, 06 May 2008 18:30:04 -0500 (CDT)
Subject: HelpDesk ticket 115222
___________________________________________
Short Description: OSG_* environment missing from Minos glidein/glexec jobs

Problem Description: Even since we activated glexec execution of Minos glidein jobs 
on GPFARM yesterday aroung 13:00, 
the OSG_* environment variables have been undefined on the user jobs.

This could be a problem for people needing to do 
    source ${OSG_GRID}/setup.sh
___________________________________________
Date: Tue, 06 May 2008 18:38:56 -0500 (CDT)

Note To Requester: chadwick@fnal.gov sent this Notes To Requester: 

As a workaround, here is how you can manually invoke the script
to define the missing environment variables:

if [ -z "$VDT_LOCATION" ]
then
         if [ -e "/usr/local/vdt" ]
         then
                 export VDT_LOCATION='/usr/local/vdt'
         elif [ -e "/usr/local/grid" ]
         then
                 export VDT_LOCATION='/usr/local/grid'
         elif [ -n "$OSG_LOCATION" ]
         then
                 export VDT_LOCATION="$OSG_LOCATION"
         fi
fi
if [ -n "$VDT_LOCATION" ]
then
         source $VDT_LOCATION/setup.sh
         if [ -e "$VDT_LOCATION/monitoring/osg-attributes.conf" ]
         then
                 source $VDT_LOCATION/monitoring/osg-attributes.conf
         fi
fi

-Keith.
___________________________________________
Date: Tue, 06 May 2008 19:48:14 -0500 (CDT)

Note To Requester: timm@fnal.gov sent this Notes To Requester: 
The stripping of the OSG* environment variables is a feature
of glexec and glideWMS.  It purposely strips out any environment that was 
set up before that.  I suggest you contact the GlideWMS developer to see
how he deals with that.  He does have a mechanism.

Steve Timm
___________________________________________
Date: Wed, 07 May 2008 08:50:33 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>
Uhm... this was not expected.
I thought Condor preserved the environment even when using gLExec;
but now that I think about it, I never tried it out!

I'll see what I can do... but it could be a few days.
___________________________________________
Date: Mon, 12 May 2008 13:50:22 -0500 (CDT)
Note To Requester: yocum@fnal.gov sent this Notes To Requester: 

Art,

Are you satisfied with the answers Steve and Keith gave?  Can we close 
this ticket?

Thanks,
Dan
___________________________________________
Date: Mon, 12 May 2008 14:09:30 -0500 (CDT)
Note To Requester: A clarification to Keith's earlier
E-mail.

$VDT_LOCATION/monitoring/osg-attributes.conf
is not visible from the worker nodes.

/usr/local/grid/setup.sh willl be but that, in itself, does not define the various OSG* variables 
that you are looking for.

It is probably best to 
modify your glidein startup script to (1) detect the 
OSG environment variables before it 
launches the glidein and (2) somehow pass them on to the user job.  It may be necessary to send these
arguments along with the user job independent of the glidewms.

Glexec is configurable. By default it passes only a very few environment variables to the glidein.  This
configuration could be changed but we would
need very good reason from the glidewms to prove it couldn't be done any other way before we do that.

Steve Timm

___________________________________________

Our users can easily work around the problem with OSG variables on glidein.
So we will stand by for a longer term solution.
See the following comment from Igor :

Date: Mon, 12 May 2008 16:10:34 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>

Hi Art.

This is a Condor bug.
I am in contact with Madison, and hope to have a beta to test by the end
of the week.

Igor
___________________________________________
Date: Thu, 22 May 2008 10:31:18 -0500 (CDT)
Note To Requester: Igor Sfiligoi is now testing 
a pre-release of condor 7.0.2 which is supposed to beat the problem of stripping out all the environment
variables.  That, accompanied with 
an upgrade of the cluster to 7.0.2 when it comes out and a change in the glexec configuration file, should
eventually resolve this problem.
I am marking this ticket pending until we hear 
from condor that condor 7.0.2 is out.

Steve Timm
___________________________________________


##########
# CONDOR #
##########

  per Mayly request, at around 18:25

MINOS25 > condor_userprio -setfactor pawloski@fnal.gov 10.
The priority factor of pawloski@fnal.gov was set to 10.000000

MINOS25 > condor_userprio -setfactor boehm@fnal.gov 10.
The priority factor of boehm@fnal.gov was set to 10.000000


#########
# PROBE #
#########

probe - changed from  ps ax   to  ps -H --forest
        resticts display to current process tree

########
# FARM #
########

    Overnight, while waiting for dccp -x509 assistance,
    will proceed to concatenate cedar_phy far data.
    For write rate tests, can use a STOP file to halt this.
 
./roundup  -r cedar_phy far


########
# FARM #
########

   Enabled corral again, now that we have a STOP ability,
   but just for catchup.

   Catching up on clearing cand space, now that WRITE is clean.

SRV1> du -sm /minos/data/reco_near/cedar/cand_data
601568  /minos/data/reco_near/cedar/cand_data

SRV1> du -sm /minos/data/reco_far/cedar/cand_data
292934  /minos/data/reco_far/cedar/cand_data

rm -r /minos/data/reco_near/cedar/cand_data
rm -r /minos/data/reco_far/cedar/cand_data

Space is up to 2.9 TB.

########
# FARM #
########

   Prepare for switch to /local/globus from /export

chmod 740 /export/stage/minfarm/.grid/backup

cp -vax /export/stage/minfarm/.grid \
        /local/globus/minfarm/.grid

########
# FARM #
########

Trying dccp ( x509 )

We use a grid proxy instead of a voms proxy because we think voms is unsafe,
given the caching of roles by the present FNDCA system.


SRV1> setup dcap -q x509

SRV1> export X509_USER_PROXY=/export/stage/minfarm/.grid/x509up_u1334

SRV1> dccp  F00037838_0004.spill.bcnd.cedar_phy.0.root dcap://fndca1.fnal.gov:24536/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy/.bcnd_data/2007-04/F00037838_0004.spill.bcnd.cedar_phy.0.root
Error ( POLLIN POLLERR POLLHUP) (with data) on control line [6]
Failed to create a control line
Error ( POLLIN POLLERR POLLHUP) (with data) on control line [8]
Failed to create a control line
Failed open file in the dCache.
Can't open destination file : Server rejected "hello"
System error: Input/output error

SRV1> dccp  F00037838_0004.spill.bcnd.cedar_phy.0.root dcap://fndca1.fnal.gov:24525/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy/.bcnd_data/2007-04/F00037838_0004.spill.bcnd.cedar_phy.0.root
Error ( POLLIN POLLERR POLLHUP) (with data) on control line [6]
Failed to create a control line
Error ( POLLIN POLLERR POLLHUP) (with data) on control line [8]
Failed to create a control line
Failed open file in the dCache.
Can't open destination file : Server rejected "hello"
System error: Input/output error

SRV1> voms-proxy-info -all
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: VOMS extension not found!
subject   : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990/CN=2146134877
issuer    : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990
identity  : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990
type      : unknown
strength  : 512 bits
path      : /export/stage/minfarm/.grid/x509up_u1334
timeleft  : 3377:44:33
 
 
SRV1> grid-proxy-info -all


ERROR: Couldn't find a valid proxy.
Use -debug for further information.


Date: Tue, 06 May 2008 16:47:05 -0500 (CDT)
Subject: HelpDesk ticket 115219
___________________________________________
Short Description: Cannot write via dcap -q x509 using Howard Rubin proxy
...
This ticket is assigned to HARRISON, MICHAEL of the CD-SF/DMS/DSC/SSA.
___________________________________________
Date: Mon, 07 Jul 2008 14:35:49 -0500 (CDT)
This ticket has been reassigned to SCHUMACHER, KEN of the CD-SF/DMS/DSC/SSA Group.
___________________________________________
___________________________________________
___________________________________________
___________________________________________


########
# FARM #
########

./farmgsum  | tee FGS/sum.`date +%Y%m%d%H`

###########
# ROUNDUP #
###########

    Capturing the current roundup, renamed as 20080506,
    making it current.

SRV1> cp AFSS/roundup.20080506 .
SRV1> ln -sf roundup.20080506 roundup # was roundup.20080501


########
# FARM #
########

   Howie stopped new mcnear runs before midnight

690132 Tue May  6 00:30:17 CDT 2008
644756 Tue May  6 01:30:21 CDT 2008
581928 Tue May  6 02:30:24 CDT 2008
575817 Tue May  6 03:30:28 CDT 2008
562353 Tue May  6 04:30:31 CDT 2008
559085 Tue May  6 05:30:35 CDT 2008
557717 Tue May  6 06:30:39 CDT 2008


AFSS/roundup.20080502  -n -w -s helium  -r cedar_phy_bhcurv mcnear
 PURGING WRITE files 981 
OOPS - mismatched Enstore and local size/crc 
    SIZE 325788525/325788525
     CRC 20242414/20242414
n13037030_0026_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root
n13037022_0001_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root
 WRITING to DCache 981

SRV1> cat ../../ECRC/n13037030_0026_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root
20242414
8

SRV1> ecrc /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/sntp_data/703/n13037030_0026_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root
CRC 20242414

SRV1> cat ../../ECRC/n13037022_0001_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root
2612195126

 SRV1> ecrc /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/sntp_data/702/n13037022_0001_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root
CRC 2356580040

    These are likely due to the two scripts running simultaneously

echo 20242414   > ../../ECRC/n13037030_0026_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root
echo 2356580040 > ../../ECRC/n13037022_0001_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root

    This corrected the problem, let's purge :

AFSS/roundup.20080502  -w -s charm  -r cedar_phy_bhcurv mcnear


SRV1> du -sm /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/cand_data/
645847  /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/cand_data/


    The helium files were already purged,
    per     
 OK - processing /minos/data/minfarm/mcnearcat
      version 20080501
 SELECT  files containing helium 
Mon May  5 23:31:57 CDT 2008


SRV1> du -sm /minos/data/mcout_data/daikon_04/L010185N_helium/near/cedar_phy_bhcurv/cand_data/
617245  /minos/data/mcout_data/daikon_04/L010185N_helium/near/cedar_phy_bhcurv/cand_data/

As minfarm@fnpcsrv1,

cd /minos/data/mcout_data/daikon_04
rm -r L010185N_charm/near/cedar_phy_bhcurv/cand_data/
rm -r L010185N_helium/near/cedar_phy_bhcurv/cand_data/
Tue May  6 09:08:12 CDT 2008


   Also cleared the cedar WRITE backlog,

AFSS/roundup.20080502  -w  -r cedar far
AFSS/roundup.20080502  -w  -r cedar near

    Updated roundup default to roundup.20080506 == former 20080502

./roundup -n -s charm -r cedar_phy_bhcurv mcnear
OK - stream L010185N_D04_charm.mrnt.cedar_phy_bhcurv
OK - 765 Mbytes in 1 runs 
 PEND - have 26/29 subruns for n13037021_*_L010185N_D04_charm.mrnt.cedar_phy_bhcurv.0.root 5 04/30 11:56 0 26
OK - stream L010185N_D04_charm.sntp.cedar_phy_bhcurv
OK - 2167 Mbytes in 1 runs 
 PEND - have 26/29 subruns for n13037021_*_L010185N_D04_charm.sntp.cedar_phy_bhcurv.0.root 5 04/30 11:56 0 26

./roundup -n -s helium -r cedar_phy_bhcurv mcnear
OK - processing 0 files 

    And cleared the CP WRITE files.

./roundup -w -s nd.cedar_phy  -r cedar_phy far

 
    Now testing rates, using cedar_phy far cand's

./roundup -n -b 5 -s nd.cedar_phy  -r cedar_phy far
   Oops, no bail option !
   
./roundup -n -s F00037835 -s nd.cedar_phy  -r cedar_phy far


AFSS/roundup.new -n -b 10 -s nd.cedar -r cedar_phy far

AFSS/roundup.new    -b 10 -s nd.cedar -r cedar_phy far

 WRITE rate 3 Mbytes/second 

    dccp with x509 is failing, cannot test right now.


##########
# CONDOR #
##########

Boehm jobs are still failing, in spite of good cert :

    From kproxy
/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Joshua A. Boehm/USERID=boehm
    From VOMRS
/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Joshua A. Boehm/UID=boehm

Date: Tue, 06 May 2008 15:00:33 -0500
From: Dan Yocum <yocum@fnal.gov>

Oops - I cut-n-pasted the wrong DNs to edit.  I'll get the USERID ones 
in a little bit.


=============================================================================

2008 05 05

##########
# CONDOR #
##########

VDT access from other accounts fails !
$     voms-proxy-init         -noregen            -voms      fermilab:/fermilab/minos  
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
VOMS Server for fermilab not known!


MINOS25 > find /minos/scratch/kreymer/VDT -type d -exec ls -ld {} \; | grep -v 'drwxr-xr-x'
drwx------  2 kreymer g020 2048 Jan 10 12:12 /minos/scratch/kreymer/VDT/vdt/extract
drwx------  3 kreymer g020 2048 Oct 17  2007 /minos/scratch/kreymer/VDT/vdt/backup
drwx------  5 kreymer g020 2048 Oct 17  2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt
drwx------  3 kreymer g020 2048 Oct 17  2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/vdt
drwx------  3 kreymer g020 2048 Oct 17  2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/vdt/setup
drwx------  2 kreymer g020 2048 Oct 17  2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/vdt/setup/questions
drwx------  3 kreymer g020 2048 Oct 17  2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/fetch-crl
drwx------  2 kreymer g020 2048 Oct 17  2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/fetch-crl/sbin
drwx------  3 kreymer g020 2048 Oct 17  2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/globus
drwx------  2 kreymer g020 2048 Oct 17  2007 /minos/scratch/kreymer/VDT/vdt/backup/vdt/globus/etc

for DIR in extract backup ; chmod 755 /minos/scratch/kreymer/VDT/vdt/${DIR} ; done

for DIR in \
vdt vdt/setup vdt/setup/questions fetch-crl fetch-crl/sbin globus globus/etc
do
    chmod 755 /minos/scratch/kreymer/VDT/vdt/backup/vdt/${DIR}
done

chmod 755 /minos/scratch/kreymer/VDT/vdt/backup/vdt

   Still no luck.

Solved by copying the file to ${HOME}/.glite/vomses and specifying
    userconfig  ${HOME}/.glite/vomses
per advice from timm.

My glidins are working now.

Cannot create proxy for loiacono with fermilab/minos, but can for fermilab.
Needed to add her to the minos group via VOMRS.

Also added , around 13:00
  hartnell
  loiacono
  mishi
  pawloski

14:12 - loiacono can create the fermilab/minos proxy  

Date: Tue, 06 May 2008 15:00:33 -0500
From: Dan Yocum <yocum@fnal.gov>

Oops - I cut-n-pasted the wrong DNs to edit.  I'll get the USERID ones 
in a little bit.


##########
# CONDOR #
##########


timm :

You have to add

/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E Kreymer/USERID=kreymer

to the VO.  Any KCA certs as called by GLEXEC have the field
spelled out as /USERID and have to be entered into the VO that way.


########
# FARM #
########

SRV1> ln -s /minos/data/minfarm/maint/farmgsum FGS
SRV1> ./farmgsum  | tee FGS/sum.2008050510

Mon May  5 10:57:31 CDT 2008
grep -A 9999 'WRITING to DCache 1021' cedar_phy_bhcurvmcnear.log  \
    | grep SRMCP | wc -l
1496

###########
# ROUNDUP #
###########

Added STOPPER function, bailing out for any of

GDM=/minos/data/minfarm 

STOPFILES="
${GDM}/roundup/STOP
${GDM}/roundup/STOP.${REL}${DETPAR}
${GDM}/roundup/STOP.${REL}${DETPAR}${SEL}
"

    Testing this, and express writing :

There are 2002 files to write,
 WRITING to DCache  981
 WRITING to DCache 1021

AFSS/roundup.20080502 -n -s  F00037835_0007  -r cedar_phy far
 OK adding F00037835_0007.all.cand.cedar_phy.0.root 1
 OK adding F00037835_0007.spill.bcnd.cedar_phy.0.root 1
 OK adding F00037835_0007.spill.cand.cedar_phy.0.root 1

AFSS/roundup.20080502    -s  F00037835_0007  -r cedar_phy far
SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00037835_0007.all.cand.cedar_phy.0.root /pnfs/minos/reco_far/cedar_phy/cand_data/2007-04
RequestFileStatus#-2144959352 failed with error:[  at Mon May 05 11:31:39 CDT 2008 state Failed : user has no permission to write into path /pnfs/fnal.gov/usr/minos/reco_far/cedar_phy/cand_data/2007-04
]


   Corrected CCOP logic
   added print of proxy

AFSS/roundup.20080502   -w -S -s  F00037835_0007.all  -r cedar_phy far

    Looks OK, 

dds /pnfs/minos/reco_far/cedar_phy/cand_data/2007-04
-rw-r--r--  1 rubin numi 131802778 May  5 12:06 F00037835_0007.all.cand.cedar_phy.0.root

    Write the other 2

AFSS/roundup.20080502   -w -S -s  F00037835_0007  -r cedar_phy far

    Looks OK, let's try another subrun

AFSS/roundup.20080502  -s  F00037835_0008  -r cedar_phy far

SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00037835_0008.all.cand.cedar_phy.0.root /pnfs/minos/reco_far/cedar_phy/cand_data/2007-04
 CCOP 
ls: /minos/data/reco_far/cedar_phy/cand_data/2007-04/F00037835_0008.all.cand.cedar_phy.0.root: No such file or directory

Oops, need to stop using CCDEST, as cand's are not copied.
   Instead, use ls -lL and the local file path.

AFSS/roundup.20080502  -s  F00037835_0008  -r cedar_phy far

PURGE FARM    F00037835_0008.all.cand.cedar_phy.0.root
cat: /export/stage/minfarm/ROUNDUP/READ/F00037835_0008.all.cand.cedar_phy.0.root: No such file or directory
   that's ok, no harm no foul, this was due to problems on the first pass.

AFSS/roundup.20080502  -n -s  F00037835*cand  -r cedar_phy far

    Nope, this did not find any files 
    Try a safer selection


AFSS/roundup.20080502  -n -s  F00037835  -r cedar_phy far
 PEND - have 17/24 subruns for F00037835_*.all.sntp.cedar_phy.0.root 3 05/02 11:36 0 17

    So this looks like good place to work up for global cand processing.
    
AFSS/roundup.20080502     -s  F00037835  -r cedar_phy far

    If this looks OK, put -s nd.cedar into corral.

SRV1> AFSS/roundup.20080502  -n -s nd.cedar  -r cedar far


##########
# CONDOR #
##########

my glideafs tests stopped getting the KCA proxy
because I continued running the vanilla 
  glideafs.run,    failed to switch to 
  glideafsp.run

256 running glideings this morning,
most started up at around 03:06 through 04:00

Last glidein to run was at 09:10.

   Probably due to reported fngp-osg problems.   > fgannounce


########
# FARM #
########


    Current cands for reco, should remove and avoid.

    Will take care of this first thing Monday,
    taking care not to remove files pending in writes to dcache.

    This should clear 2 TB of space pretty quickly.


MINOS26 > ls -d /minos/data/mcout_data/*/*/*/*/cand_data
/minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/cand_data
/minos/data/mcout_data/daikon_04/L010185N_helium/near/cedar_phy_bhcurv/cand_data
/minos/data/mcout_data/daikon_04/L250200N/near/cedar_phy_bhcurv/cand_data


MINOS26 > du -sm /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/cand_data
645847  /minos/data/mcout_data/daikon_04/L010185N_charm/near/cedar_phy_bhcurv/cand_data

MINOS26 > du -sm /minos/data/mcout_data/daikon_04/L010185N_helium/near/cedar_phy_bhcurv/cand_data
304494  /minos/data/mcout_data/daikon_04/L010185N_helium/near/cedar_phy_bhcurv/cand_data

MINOS26 > du -sm /minos/data/mcout_data/daikon_04/L250200N/near/cedar_phy_bhcurv/cand_data
2208    /minos/data/mcout_data/daikon_04/L250200N/near/cedar_phy_bhcurv/cand_data


MINOS26 > du -sm /minos/data/reco_near/cedar/cand_data
601568  /minos/data/reco_near/cedar/cand_data

MINOS26 > du -sm /minos/data/reco_far/cedar/cand_data
292934  /minos/data/reco_far/cedar/cand_data


-rw-rw-r--  1 42411 e875 112406206 May  4 03:58 N00014130_0013.cosmic.cand.cedar.0.root
-rw-rw-r--  1 42411 e875 504357789 May  4 03:58 N00014130_0013.spill.cand.cedar.0.root

-rw-rw-r--  1 42411 e875 138087890 May  3 23:36 F00040838_0004.all.cand.cedar.0.root
-rw-rw-r--  1 42411 e875  33487392 May  3 23:36 F00040838_0004.spill.cand.cedar.0.root


=============================================================================

2008 05 02

########
# FARM #
########

Date: Fri, 02 May 2008 12:14:21 -0500
From: Howard Rubin <rubin@iit.edu>

FYI after a false start (mixed-up directory structure) there is now an 
rsync backup in place for the following (recursed) directories:

/grid/app/minos/scripts
/minos/data/minfarm/lists
                     loonexe
/minos/data/minfarm/farmtest/lists
                              loonexe

The current size on AFS is < 90M

Date: Fri, 02 May 2008 12:23:22 -0500


Just for documentation purposes, the script doing the rsync backup is 
run by cron at 02:00 and 14:00.  The (very simple) script is 
/grid/app/minos/scripts/farm_backup.  Right now there is a -v option 
which I'll remove after I see it run a couple of times via cron.

Matt, your script directory is excluded because rsync won't let me copy 
it.  Art, do you know how to get around this?  Actually, it's probably 
because I keep the existing permissions, and it doesn't like me trying 
to write files to AFS with someone else's ownership.  I'll check the man 
pages for options to override this.

###########
# ROUNDUP #
###########

    roundup.20080502

    Reviewing all local files, so we can run on other hosts
    First manual scan, then iterate for
         export
         HOME

ROUNTMP=/export/stage/minfarm/ROUNDUP

SOCFILE=/export/stage/minfarm/.grid/samdbs_prd

PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin

SRM_CONF=/export/stage/minfarm/.srmconfig/config.xml

mkdir -p /tmp/minfarm/ROUNDUP

ROOTREL=`grep " ${REL} " ROOTRELS | cut -f 1 -d ' '`

    ROOTSYS="/farm/minsoft2/Minossoft/ROOT/rootv5.12.00e"

    . /home/minfarm/scripts/setup_minossoft_R1_18_4.sh R1.18.4

. /export/osg/grid/setup.sh
   This should be /usr/local/vdt/setup.sh 
   /export/osg/grid -> /usr/local/vdt

${ROUNTMP}/${CAT}ECRC/${FILE}

${ROUNTMP}/SUPPRESSED

 DFILES=/tmp/minfarm/ROUNDUP/DFILES.${REL}.${DETPAR}.${$}

DFILESL=/tmp/minfarm/ROUNDUP/DFILESL.${REL}.${DETPAR}.${$}

SAMDUPS=`${HOME}/scripts/samdup -s "${SEL}" ${INDIR}`

SAMSUBS=`${HOME}/scripts/samsub ${INDIR} | grep "${SEL}"`

ROUNTMP/DFARM
   review the usage of this . Why is is here ?
   It needs purging, 78K files !

    SAM

 ${ECHO} scripts/saddreco

           SLOG=${HOME}/ROUNTMP/LOG/saddreco/${MCREL}/${REL}/${DET}_${CONF}.log

    entry

PENDLOG=${ROUNTMP}/${CAT}LOG/${REL}${DETPAR}.pend

mkdir -p ${ROUNTMP}/${CAT}LOG/${YEMON}
mkdir -p ${ROUNTMP}/${CAT}HADDLOG/${YEMON}

PIFL=${ROUNTMP}/${REL}${DETPAR}.pid

${ROUNTMP}/${CAT}LOG/${YEMON}/${REL}${DETPAR}.log 2>&1 &

...

  quoted ${SEL} in samdups call

  
############
# PREDATOR #
############

    Last NearDCS file was
N080415_000002.mdcs.root Wed Apr 16 10:13:59 UTC 2008


########
# FARM #
########

   GRRRRRRRRRR
   
roundup seems to have been running twice, 
what happened to the PID check ?

SRV1> pwd
/export/stage/minfarm/ROUNDUP/LOG/2008-05

SRV1> less cedar_phy_bhcurvmcnear.log

 OK - processing /minos/data/minfarm/mcnearcat
      version 20080501
 SELECT  files containing charm 
Thu May  1 17:01:35 CDT 2008
...

 OK - processing /minos/data/minfarm/mcnearcat
      version 20080501
 SELECT  files containing charm 
Fri May  2 01:34:31 CDT 2008

 OK adding n13037026_0002_L010185N_D04_charm.cand.cedar_phy_bhcurv.0.root 1
-rw-rw-r--  1 minospro numi 767436554 Apr 30 12:05 /minos/data/minfarm/WRITE/n13037026_0002_L010185N_D04_charm.cand.cedar_phy_bhcurv.0.root
ls: n13037025_0027_L010185N_D04_charm.cand.cedar_phy_bhcurv.0.root: No such file or directory
...
OK - stream L010185N_D04_charm.cand.cedar_phy_bhcurv
OK - 112465 Mbytes in 6 runs 
/home/minfarm/scripts/roundup: line 665: ((: SSIF =  : syntax error: operand expected (error token is " ")

 WRITING to DCache 981

 WRITING to DCache 981
 OK - creating /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N_charm/cand_data/700
 OK - creating /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N_charm/cand_data/700
srm client error: credential remaining lifetime is less then a minute 

...
 WRITE rate 0 Mbytes/second 
Fri May  2 08:11:54 CDT 2008

 SADD 
less +F /home/minfarm/ROUNTMP/LOG/saddreco/daikon_04/cedar_phy_bhcurv/near_L010185N_charm.log
Fri May  2 08:11:54 CDT 2008

SRV1> voms-proxy-info -all
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Masaki Ishitsuka/USERID=mishi/CN=proxy
issuer    : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Masaki Ishitsuka/USERID=mishi
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Masaki Ishitsuka/USERID=mishi
type      : proxy
strength  : 512 bits
path      : /tmp/x509up_u10871
timeleft  : 0:00:00
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Masaki Ishitsuka/USERID=mishi
issuer    : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov
attribute : /fermilab/Role=NULL/Capability=NULL
timeleft  : 0:00:00

SRV1> voms-proxy-destroy 

#########
# PROXY #
#########
 
   Testing alternate proxy locations, and utilities

kx509
kxlist -p

mv /tmp/x509up_u10871 /tmp/x509upalt

export X509_USER_PROXY=/tmp/x509upalt

voms-proxy-info
timeleft  : 167:26:04


=============================================================================

2008 05 01

########
# FARM #
########

    Summarizing /minos/data/minfarm/*cat   

    206    9993 nearcat
   1199    7764 farcat
   9345 2249798 mcnearcat
      0       1 mcfarcat
      0       1 mcfmockcat
    542       4 WRITE
  11292 2267561 TOTAL files, GBytes

nearcat
    111    3171 cosmic.sntp.cedar.0.root
     95    7304 spill.sntp.cedar.0.root

farcat
    108    4295 all.sntp.cedar.0.root
     23     568 all.sntp.cedar_phy_bhcurv.0.root
    546     905 spill.bmnt.cedar_phy_bhcurv.0.root
    134     993 spill.bntp.cedar.0.root
     23     102 spill.bntp.cedar_phy_bhcurv.0.root
    208     509 spill.mrnt.cedar_phy_bhcurv.0.root
    134     690 spill.sntp.cedar.0.root
     23      69 spill.sntp.cedar_phy_bhcurv.0.root

mcnearcat
   2787 1888751 cand.cedar_phy_bhcurv.0.root
    280  161232 cand.cedar_phy_bhcurv.1.root
   2787   65568 mrnt.cedar_phy_bhcurv.0.root
    314    6080 mrnt.cedar_phy_bhcurv.1.root
     37    1480 mrnt.cedar_phy_bhcurv.root
   2787  210628 sntp.cedar_phy_bhcurv.0.root
    314   21024 sntp.cedar_phy_bhcurv.1.root
     37    4296 sntp.cedar_phy_bhcurv.root

mcfarcat

mcfmockcat

WRITE


./roundup -n -s ".cand." -r cedar_phy_bhcurv mcnear
...
 ZAPPING BAD n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047133_0022_L010185N_D04.0               136 2008-03-23 01:23:45  caf1640
OK - processing 3473 files 
OK - stream L010185N_D04.cand.cedar_phy_bhcurv
OK - 960171 Mbytes in 88 runs 
 OK adding n13037233_0001_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1
 OK adding n13037233_0004_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1


   Updated roundup to check pass number

AFSS/roundup.20080501 -n -s n13047133_0022_L010185N_D04 -r cedar_phy_bhcurv mcnear

OK - 524 Mbytes in 1 runs 
 BADRUNS   n13047133_0020_L010185N_D04.cand.cedar_phy_bhcurv.1.root
+BADRUNS+  n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root
 OK adding n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1


SRV1> ls /minos/data/minfarm/mcnearcat | grep charm | wc -l
2613

SRV1> ls /minos/data/minfarm/mcnearcat | grep helium | wc -l
2757

SRV1> ls /minos/data/minfarm/mcnearcat | grep -v charm | grep -v helium | wc -l
5394

   Started manual concatenation of charm,
   after ugrades to samdup and roundup

./roundup  -s charm  -r cedar_phy_bhcurv mcnear

###########
# ROUNDUP #
###########

    roundup.20080501

Added pass number to BAD file check,
   from field 4 or 5 of MC or Raw data

   Development :

./roundup -n -s ".cand." -r cedar_phy_bhcurv mcnear
...
 ZAPPING BAD n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root
n13047133_0022_L010185N_D04.0               136 2008-03-23 01:23:45  caf1640
OK - processing 3473 files 
OK - stream L010185N_D04.cand.cedar_phy_bhcurv
OK - 960171 Mbytes in 88 runs 
 OK adding n13037233_0001_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1
 OK adding n13037233_0004_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1


AFSS/roundup.20080501 -n -s n13047133_0022_L010185N_D04 -r cedar_phy_bhcurv mcnear

OK - 524 Mbytes in 1 runs 
 BADRUNS   n13047133_0020_L010185N_D04.cand.cedar_phy_bhcurv.1.root
+BADRUNS+  n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root
 OK adding n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1

    Updated roundup to check pass number, and to clean up bad_runs select

SRV1> cp -a AFSS/roundup.20080501 .
SRV1> ln -sf roundup.20080501 roundup # was roundup.20080422

SRV1> AFSS/roundup.20080501 -n -s n13047133_0022_L010185N_D04 -r cedar_phy_bhcurv mcnear

 OK adding n13047133_0022_L010185N_D04.cand.cedar_phy_bhcurv.1.root 1
 PEND - have 1/30 subruns for n13047133_*_L010185N_D04.mrnt.cedar_phy_bhcurv.1.root 1 04/29 19:59 0 1


##########
# SAMDUP #
##########

    samdup.20080501

Added  -s "SEL"  file selection, to speed up tests

MINOS26 > time ./samdup.20080501 /minos/data/minfarm/nearcat

real    0m14.230s
user    0m1.392s
sys     0m0.264s

real    0m4.613s
real    0m4.568s

MINOS26 > time ./samdup.20080501 /minos/data/minfarm/nearcat -s N00012941_0001
real    0m1.235s

MINOS26 > time ./samdup.20080501 /minos/data/minfarm/farcat
real    1m1.725s
real    0m21.079s
real    0m21.105s

MINOS26 > time ./samdup.20080501 /minos/data/minfarm/farcat -s F00031874_0000
real    0m1.247s

   roundup run before the fix,
AFSS/roundup.20080501 -n -s charm  -r cedar_phy_bhcurv mcnear
   Thu May  1 16:11:48 CDT 2008
   Thu May  1 16:41:57 CDT 2008

    After the fix, a quick test,

AFSS/roundup.20080501 -n -s n13037017  -r cedar_phy_bhcurv mcnear
Thu May  1 16:56:01 CDT 2008
Thu May  1 16:56:58 CDT 2008

SRV1> ln -sf roundup.20080501 roundup # was roundup.20080422

########
# FARM #
########

Date: Wed, 30 Apr 2008 23:47:45 -0500
From: Howard Rubin <rubin@fnal.gov>

Can you suggest a backed-up AFS volume that I can rsynch my bookkeeping 
files to?  Right now the entire data area is 74M and the scripts are 9M.


NDDIRS=`ls $MINOS_DATA | grep -v '^d..$' | grep -v '^d...$'`

for DIR in $NDDIRS ; do printf "${DIR} " ; fs listquota ${MINOS_DATA}/${DIR} | grep -v Volume | tr -s ' ' | cut -f 2 -d ' ' ; done
beam_data      50000000
beam_data1     50000000
beam_data2         5000
crl_data        2000000  1%
database_dumps  5000000  0%
db_cache        2000000  0%
farm_logs      50000000  1%
farm_mclogs    50000000  4%
log_data        8000000 88%
logbook         2000000  1%
offline_monitor 8000000  5% 
release_data    8000000  0%
validation      8000000  2%

MINOS26 > cp -vax log_data/CFL release_data/CFL
MINOS26 > diff -r log_data/CFL release_data/CFL

MINOS26 > pwd
/afs/fnal.gov/files/home/room1/kreymer/minos
MINOS26 > rm CFL
MINOS26 > ln -s /afs/fnal.gov/files/data/minos/release_data/CFL CFL


# CONDOR #

    Killed running glideins, for a fresh glidein with my proxy

condor_rm 107199
condor_rm 107201

condor_submit glide.run

condor_q gfactory kreymer


=============================================================================

2008 04 30

##########
# CONDOR #
##########

glidemachine.run - runs on fnpc333,
   trick to get a new glidein, perhaps with glexec
   this is still pending - 

#######
# KCA #
#######

http://security.fnal.gov/pki/newkcafaq.html


feedback to jklemenc, x3311
cc: minos-admin

Q What is a KCA server
A You use it to get a Grid certificate based on your Kerberos ticket.

Q What is not affected
A your kerberos ticket
  ssh or rsh access to accounts
  dcache access via dcap

Q What are possible common uses of KCA certs
A Web browser access to
     DocDB
     Helpdesk
     Computing Division personnel information
  Grid proxies used internally by some Condor systems
            
Q Why is 'been' spelled 'bene' in the answer to FAQ A 1.8
A To confuse the users ?

########
# DATA #
########

jdejong - space for rev field ND cosmic analysis ( 150-230 GB )
    in $MINOS_DATA

MINOS26 > mkdir /minos/data/analysis/nonap
MINOS26 > chmod 775 /minos/data/analysis/nonap
MINOS26 > chgrp e875 /minos/data/analysis/nonap

########
# FARM #
########

   See notes under 2008/04/29 regarding removal of files.
   Completed SAM retirement of these files.

###########
# ROUNDUP #
###########

  Investigating issues with cedar_phymcnear.log

/home/minfarm/scripts/roundup: line 389: [: missing `]'
    and 
 HAVE n13037413__L010185N_D04.mrnt.cedar_phy_bhcurv.0.root:30
grep: /minos/data/minfarm/lists/bad_runs_mc.cedar_phy: No such file or directory


=============================================================================

2008 04 29


########
# FARM #
########

    Proceeding with the removal of D04MCNEAR bad field files

cd scripts/AFSS/d4clean

mkdir /pnfs/minos/BAD/D4CLEAN

    Need to mangle PNFS path to DATA path
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/700
  2    3        4         5              6     7         8        9        10
    to
/minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data/700
                          7          8      6         5             9      10


MINOS26 > ./filemove L250200N
 Processing beam L250200N 

Processed 67 files 


MINOS26 > ./filemove L010185N
 Processing beam L010185N 


     PNFS CLEANUP
     as rubin on fnpcsrv1, move the PNFS files

rubin@SRV1> pwd
/home/minfarm/scripts/AFSS/d4clean

SRV1> time . L250200N.movepnfs 

real    0m11.595s
user    0m0.066s
sys     0m0.825s
SRV1> date
Tue Apr 29 15:36:15 CDT 2008


SRV1> wc -l L010185N.movepnfs
75868 L010185N.movepnfs

08:56
{ time . L010185N.movepnfs ; } 2>&1 | tee L1010185N.logpnfs
Wed Apr 30 08:58:06 CDT 2008
real    53m16.213s
user    0m19.237s
sys     4m55.445s


    DATA CLEANUP
    as minfarm on fnpcsrv1, movedata

. ./L250200N.movedata
  Oops, kreymer had made some directories, remove them and try again
  
. ./L250200N.movedata 2>&1 | tee  L250200.logdata2


date ; { time . L010185N.movedata ; } 2>&1 | tee L1010185N.logdata
real    35m44.553s
user    0m7.401s
sys     1m39.938s

SRV1> du -sm /minos/data/BAD/D4CLEAN/mcout_data/daikon_04/*
1364232 /minos/data/BAD/D4CLEAN/mcout_data/daikon_04/L010185N
8729    /minos/data/BAD/D4CLEAN/mcout_data/daikon_04/L250200N


     SAM disabling :

OPW=...
setup oracle_client v10_1_0_2_0b
sqlplus samdbs/${OPW}@minosprd

    Commands will look like this

UPDATE DATA_FILES
SET RETIRED_DATE = SYSDATE
WHERE FILE_NAME IN (
'realfilenames',
'FLINTSTONE,FRED')
AND RETIRED_DATE IS NULL ;
COMMIT

Created retirehead.sql
        retiretail.sql
with the head and tail of these,
just needing the quoted file lists.


Created filesql script to make ${BEAM}.sqlf file lists

     ... interruption for full disk, described below

    2008 04 30

time ./filesql L010185N
 Processing beam L010185N 
Processed 18911 files 

real    7m36.646s
user    0m29.815s
sys     3m12.462s

MIN > wc -l L010185N.sqlf 
  18911 L010185N.sqlf

cp L250200N.sqlf L250200N.sql
nedit L250200N.sql &

    Testing with one file

cp L250200N.sql onefile.sql

MINOS26 > sam locate n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root
['/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L250200N/cand_data/700,193@voa102']
MINOS26 > sam list files --dim='FILE_NAME n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root'
Files:
   n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root

File Count:         1
Average File Size:  1.08GB
Total File Size:    1.08GB
Total Event Count:  800

sqlplus samdbs/${OPW}@minosprd
SQL> @onefile.sql

1 row updated.
@onefile.sql

SQL> @onefile.sql

0 rows updated.

MINOS26 > sam locate n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root
Datafile with name 'n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root' not found.  Try file id: 2138767
MINOS26 > sam list files --dim='FILE_NAME n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root'
No files match the given constraints.
MINOS26 > sam get metadata --fileId=2138767
ImportedSimulatedFile({
             'fileName' : 'n13037004_0000_L250200N_D04.cand.cedar_phy_bhcurv.root',
...


MINOS26 > sqlplus samdbs/${OPW}@minosprd
SQL> @L250200N.sql

66 rows updated.


    For L010185N, 18911 files, need 23 files,

N=0
while [ ${N} -lt 24 ] ; do
(( K = ( N * 800 ) + 1 ))
(( L = K + 799 ))
printf "${N} ${K} ${L}\n"
cat retirehead.sql                    > L010185N${N}.sql
tail +${K} L010185N.sqlf | head -800 >> L010185N${N}.sql
cat retiretail.sql                   >> L010185N${N}.sql
(( N++ ))
done

These files look reasonable.

MINOS26 > date
Wed Apr 30 11:02:05 CDT 2008

MINOS26 > ${HOME}/minos/bin/rlwrap sqlplus samdbs/${OPW}@minosprd
SQL> @L010185N0.sql

800 rows updated.

...

SQL> @L010185N23.sql

511 rows updated.

MINOS26 > date
Wed Apr 30 11:04:09 CDT 2008


( 2008 04 29 )

    GRRRRRRRRRRRRRRRRRRRRRRRRRRRR
  
Once again ran out of AFS quota 

du -sm * | sort -n
...
16      OSF1
21      msql
36      IRIX
36      msqldata
57      isajet
134     minos

MIN > ls -alF IRIX/webmaker/
total 22921
drwxr-xr-x    8 kreymer  kreymer     10240 Jun  3  1996 ./
drwxr-xr-x    8 kreymer  kreymer      2048 May  9  1997 ../
lrwxr-xr-x    1 kreymer  kreymer        48 May 16  1996 FrameMaker -> /afs/fnal.gov/products/UNIX/frame/v5_1/bin/maker*
drwxr-xr-x    3 kreymer  kreymer      2048 May 15  1996 doc/
drwxr-xr-x    3 kreymer  kreymer      2048 May 15  1996 examples/
drwxr-xr-x    2 kreymer  kreymer      2048 May 15  1996 lib/
drwxr-xr-x    5 kreymer  kreymer      2048 Jun  4  1996 misc/
drwxr-xr-x    2 kreymer  kreymer      2048 May 15  1996 patches/
-rw-r--r--    1 kreymer  kreymer     14242 May 10  1996 readme.txt
-rw-r--r--    1 kreymer  kreymer       958 May 10  1996 support.txt
drwxr-xr-x    2 kreymer  kreymer      2048 Jun  5  1996 ups/
-rwxr-xr-x    1 kreymer  kreymer  23429120 Apr 26  1996 webmaker*
-rw-r--r--    1 kreymer  kreymer        17 May 16  1996 webmaker-2-unix.reg

MINOS26 > mkdir -p /minos/data/users/kreymer/AFS/IRIX
MINOS26 > cp -ax IRIX/webmaker /minos/data/users/kreymer/AFS/IRIX/webmaker

gzipped a few logs
   minos/log/top_20050714.log
   minos/log/saddmc_old/*
   
Started all over generating the L010185N file moves
16:52

MINOS26 > time ./filemove L010185N
 Processing beam L010185N 
Processed 18911 files 

real    12m53.838s
user    0m27.966s
sys     4m27.664s

    Putting further notes in-line above, for legibility.


=============================================================================

2008 04 28

Condor -  glideafs.run 14:01 - added PROXY statement

/local/scratch25/kreymer/grid/kreymer.proxy - links to .2008042815, nonesuch

15:24 - corrected link back to 2008042813
ln -sf kreymer.proxy.2008042813 /local/scratch25/kreymer/grid/kreymer.proxy

Scanning logs under  

-rw-r--r--  1 kreymer g020  5012 Apr 28 13:30 logs/glideafs/probe.106153.0.out
-rw-r--r--  1 kreymer g020  4487 Apr 28 13:47 logs/glideafs/probe.106154.0.out
-rw-r--r--  1 kreymer g020  4487 Apr 28 13:50 logs/glideafs/probe.106156.0.out
-rw-r--r--  1 kreymer g020  4487 Apr 28 14:00 logs/glideafs/probe.106157.0.out
-rw-r--r--  1 kreymer g020  4621 Apr 28 14:04 logs/glideafs/probe.106158.0.out
-rw-r--r--  1 kreymer g020  4621 Apr 28 14:10 logs/glideafs/probe.106159.0.out
-rw-r--r--  1 kreymer g020  4621 Apr 28 14:20 logs/glideafs/probe.106160.0.out
-rw-r--r--  1 kreymer g020  4621 Apr 28 14:31 logs/glideafs/probe.106161.0.out
-rw-r--r--  1 kreymer g020  4621 Apr 28 14:40 logs/glideafs/probe.106162.0.out
-rw-r--r--  1 kreymer g020     0 Apr 28 14:50 logs/glideafs/probe.106163.0.out
-rw-r--r--  1 kreymer g020     0 Apr 28 15:00 logs/glideafs/probe.106165.0.out
-rw-r--r--  1 kreymer g020     0 Apr 28 15:10 logs/glideafs/probe.106167.0.out
-rw-r--r--  1 kreymer g020     0 Apr 28 15:20 logs/glideafs/probe.106169.0.out
-rw-r--r--  1 kreymer g020  6019 Apr 28 15:58 logs/glideafs/probe.106171.0.out
-rw-r--r--  1 kreymer g020  6008 Apr 28 15:58 logs/glideafs/probe.106176.0.out
-rw-r--r--  1 kreymer g020  6008 Apr 28 15:58 logs/glideafs/probe.106180.0.out
-rw-r--r--  1 kreymer g020  4626 Apr 28 16:00 logs/glideafs/probe.106185.0.out

-rw-r--r--  1 kreymer g020 10595 Apr 25 15:20 logs/glideafs/probe.78296.0.out
-rw-r--r--  1 kreymer g020  9721 Apr 25 15:30 logs/glideafs/probe.78297.0.out
-rw-r--r--  1 kreymer g020  9721 Apr 25 15:40 logs/glideafs/probe.78302.0.out
-rw-r--r--  1 kreymer g020 10863 Apr 25 15:50 logs/glideafs/probe.84485.0.out
-rw-r--r--  1 kreymer g020 10863 Apr 25 16:00 logs/glideafs/probe.91503.0.out
-rw-r--r--  1 kreymer g020 10586 Apr 25 16:10 logs/glideafs/probe.97891.0.out


grep identity logs/glideafs/probe.106157.0.out    # 13:30  default proxy
identity  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy

grep identity logs/glideafs/probe.106158.0.out    # 14:31  with KCA proxy ?
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy

grep identity logs/glideafs/probe.106171.0.out    # 15:58  fixed KCA proxy ?
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy

MINOS25 > grep identity logs/glideafs/probe.*.0.out

logs/glideafs/probe.102398.0.out:identity  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy
...
logs/glideafs/probe.106157.0.out:identity  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy
logs/glideafs/probe.106158.0.out:identity  : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy
...

   It is clear that my proxy started getting passed as soon as it was specified.
   I'm not sure why I got confused earlier yesterday.

#######
# SIM #
#######

14:10

Sent email to deb4 asking for confirmation of the file list
for D04 MD reprocessing.


##########
# DC2NFS #
##########

    First check state of sntp's in DCache

for DIR in `ls /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data` ; do 
echo $DIR ; ./stage -d -p 0 reco_near/cedar_phy_bhcurv/sntp_data/${DIR} ; done \
| grep '^200\|Needed'

for DIR in `ls /pnfs/minos/reco_far/cedar_phy_bhcurv/sntp_data` ; do 
echo $DIR ; ./stage -d -p 0 reco_far/cedar_phy_bhcurv/sntp_data/${DIR} ; done \
| grep '^200\|Needed'

    None of these needed staging.

    2008 04 29 plan of action :

    minfarm@fnpcsrv1

DATA=reco_near/cedar_phy_bhcurv/sntp_data
NDIRS=`ls /pnfs/minos/${DATA}`

AFSS/dc2nfs.20080118 -n -d ${DATA}
   
AFSS/dc2nfs -d reco_near/cedar_phy_bhcurv/sntp_data 2>&1 | \
   tee /minos/scratch/log/dc2nfs/cpbnear.log


##########
# AUTOFS #
##########

   To check mounts,
   
cat /etc/auto.master
ypcat    auto.master


#######
# LSF # 
#######

-----------------------------------------------------------
Date: Mon, 28 Apr 2008 08:42:41 +0100
From: David John Auty <D.J.Auty@sussex.ac.uk>
To: minos_software_discussion@fnal.gov
Subject: lsf

I can't submit to the lsf queue at the moment any idea's?
-----------------------------------------------------------

MINOS26 > bjobs
batch system daemon not responding ... still trying

MINOS26 > date
Mon Apr 28 08:41:07 CDT 2008

MINOS26 > bjobs
batch system daemon not responding ... still trying

No tickets, submitting helpdesk ticket

for NODE in flxi02 flxi03 flxi04 flxi05 fsui03 ; do printf "${NODE} "; ssh -ax ${NODE} ". /usr/local/etc/setups.sh ; setup lsf" ; done
flxi02 /tmp/filefh7jcc: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh: No such file or directory
flxi03 /tmp/filejv2bhJ: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh: No such file or directory
flxi04 bash: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh: No such file or directory
flxi05 /tmp/fileayQG7D: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh: No such file or directory
fsui03 bash: /home/room1/lsf/v6_1/conf/profile.lsf.sh: Permission denied

Date: Mon, 28 Apr 2008 09:03:26 -0500 (CDT)
Subject: HelpDesk ticket 114816
___________________________________________
Short Description: FNALU Batch system is not responding

Problem Description: fnalu-admin :

    Since at least about 02:40 this morning,
    it seems that the FNALU Batch system has been unavailable.
    Here is a recent test :

MINOS26 > date
Mon Apr 28 08:41:07 CDT 2008

MINOS26 > bjobs
batch system daemon not responding ... still trying
batch system daemon not responding ... still trying

    It seems that the /home/room1/lsf configuration files are missing :

for NODE in flxi02 flxi03 flxi04 flxi05 fsui03 ; do 
printf "${NODE} "; ssh -ax ${NODE} ". /usr/local/etc/setups.sh ; setup lsf"

done
flxi02 /tmp/filefh7jcc: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh:No such file or directory
flxi03 /tmp/filejv2bhJ: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh:No such file or directory
flxi04 bash: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh: No suchfile or directory
flxi05 /tmp/fileayQG7D: line 21: /home/room1/lsf/v6_1/conf/profile.lsf.sh:No such file or directory
fsui03 bash: /home/room1/lsf/v6_1/conf/profile.lsf.sh: Permission denied
___________________________________________

   Noticed that /home is automounted from fsun02

fsui03 > ypcat auto.home
-rw,hard,intr fsun02:/export/home

Updated MINOS status at 
    https://computing.fnal.gov/cdsystemstatus/customersupport/inpform.htm
___________________________________________
Date: Mon, 28 Apr 2008 11:38:00 -0500 (CDT)
This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group.
___________________________________________
   Wayne is on furlough this week.

   Please reassign this to somebody else.

       Thanks !
___________________________________________
Date: Mon, 28 Apr 2008 12:51:29 -0500 (CDT)
From: Margaret_Greaney <mgreaney@fnal.gov>
Art, I've been in FCC1 all morning working on fsun02. I will take a look 
at this helpdesk ticket now.
___________________________________________
Date: Mon, 28 Apr 2008 13:05:49 -0500 (CDT)
fsun02 serves the lsf home area and it was up then down twice this morning 
and now it is down again. It has a bad cpu board. The vendor is ordering a 
replacement part and we hope to swap it out tomorrow.

Also, our console server to that node was down this moring, and we needed 
to get that up before we could access fsun02.
___________________________________________
Date: Mon, 28 Apr 2008 16:18:31 -0500 (CDT)
This ticket has been reassigned to GREANEY, MARGARET of the CD-LSCS/CSI/DSS/EST Group.
___________________________________________
Date: Tue, 29 Apr 2008 11:37:14 -0500 (CDT)
D1 replaced the cpu board on fsun02 and now lsf home area is again being served. Batch nodes were checked
and minos users notified.
___________________________________________

##########
# CONDOR #
##########

The bspeak job backlog cleared.

=============================================================================

2008 04 25

##########
# CONDOR #
##########

bspeak submitted over 20K jobs at about 4/25 15:44


##########
# CONDOR #
##########

ANODES='339 340 341 342 343 344 345 346'

for NODE in ${ANODES} ; do printf "fnpc${NODE} " ; ssh -ax fnpc${NODE} ls -ld /afs ; done
fnpc339 drwxrwxrwx  2 stan oss 4096 Jul 12  2007 /afs
fnpc340 drwxrwxrwx  2 stan oss 4096 Jul 12  2007 /afs
fnpc341 drwxrwxrwx  2 stan oss 4096 Jul 12  2007 /afs
fnpc342 drwxrwxrwx  2 stan oss 4096 Jul 12  2007 /afs
fnpc343 drwxrwxrwx  2 stan oss 4096 Jul 12  2007 /afs
fnpc344 drwxrwxrwx  2 root root 4096 Jan 31 10:57 /afs
fnpc345 drwxrwxrwx  2 stan oss 4096 Jul 12  2007 /afs
fnpc346 drwxrwxrwx  2 stan oss 4096 Jul 12  2007 /afs

for NODE in ${ANODES} ; do printf "fnpc${NODE} " ; ssh -ax fnpc${NODE} ls -ld /afs/fnal.gov  ; done
fnpc339 drwxr-xr-x  3 root root 4096 Dec 10  2004 /afs/fnal.gov
fnpc340 drwxr-xr-x  3 root root 4096 Dec 10  2004 /afs/fnal.gov
fnpc341 drwxr-xr-x  3 root root 4096 Dec 10  2004 /afs/fnal.gov
fnpc342 drwxr-xr-x  3 root root 4096 Dec 10  2004 /afs/fnal.gov
fnpc343 drwxr-xr-x  3 root root 4096 Dec 10  2004 /afs/fnal.gov
fnpc344 ls: /afs/fnal.gov: No such file or directory
fnpc345 drwxr-xr-x  3 root root 4096 Dec 10  2004 /afs/fnal.gov
fnpc346 drwxr-xr-x  3 root root 4096 Dec 10  2004 /afs/fnal.gov

for NODE in ${ANODES} ; do printf "fnpc${NODE} " ; ssh -ax fnpc${NODE} rpm -q openafs  ; done
fnpc339 openafs-1.4.6-58.SL4.x86_64
fnpc340 openafs-1.4.4-46.SL4.x86_64
fnpc341 openafs-1.4.6-58.SL4.x86_64
fnpc342 openafs-1.4.6-58.SL4.x86_64
fnpc343 openafs-1.4.6-58.SL4.x86_64
fnpc344 openafs-1.4.6-58.SL4.x86_64
fnpc345 openafs-1.4.6-58.SL4.x86_64
fnpc346 openafs-1.4.6-58.SL4.x86_64

Fri Apr 25 19:08:55 UTC 2008

Date: Fri, 25 Apr 2008 14:24:09 -0500 (CDT)
Subject: HelpDesk ticket 114790
Short Description: fnpc344 AFS mount needed

Problem Description: fnpc344 seems to have rebooted around 24 May 16:30 CDT .

  The AFS file system is not mounted, and is needed by Minos jobs.

  Please remount AFS.

      Thanks !
___________________________________________
This ticket is assigned to TIMM, STEVE of the CD-SF/GF/FGS.
___________________________________________
  The reboot was on 24 April, not 24 May.

  Sorry for the typo.

  Also, note that Steve Timm is on furlough,
  so this ticket needs to be assigned to someone else.
___________________________________________
Date: Fri, 25 Apr 2008 14:45:39 -0500 (CDT)
This ticket has been reassigned to YOCUM, DAN of the CD-SF/GF/FGS Group.
___________________________________________
Date: Mon, 28 Apr 2008 10:33:38 -0500
Resolved

Installed correct openafs kernel module and restarted afs service.
___________________________________________


N.B.  as a workaround, loiacono and pawloski are selecting

    && ( Machine != "fnpc344.fnal.gov" )

=============================================================================

2008 04 24

#######
# SAM #
#######

Date: Thu, 24 Apr 2008 09:17:36 -0500
From: Nelly Stanfield <nelly@fnal.gov>
To: minosdb-support@fnal.gov

Minosdev db is available.?? Oracle's April Quarterly patch completed.

########
# FARM #
########

    Making file lists, with associated scripts, in scripts/d4clean

MINOS26 > ./filelist L250200N
 Processing beam L250200N 
n13037004 n13037004
n13037014 n13037014
 Config n1303 Run range 7004 7004 
 Config n1303 Run range 7014 7014 


MINOS26 > wc -l L250200N.locations 
67 L250200N.locations

MINOS26 > for STR in sntp mrnt cand ; do printf "${STR} " ; grep $STR  L250200N.locations | wc -l ; done
sntp 6
mrnt 0
cand 61


MINOS26 > ./filelist L010185N
 Processing beam L010185N 
n13037140 n13037140
n13037233 n13037233
n13037244 n13037245
n13037250 n13037260
n13037263 n13037470
n13037553 n13037735
n13047013 n13047014
n13047041 n13047100
n13047103 n13047103
n13047106 n13047191
 Config n1303 Run range 7140 7140 
 Config n1303 Run range 7233 7233 
 Config n1303 Run range 7244 7245 
 Config n1303 Run range 7250 7260 
 Config n1303 Run range 7263 7470 
 Config n1303 Run range 7553 7735 
 Config n1304 Run range 7013 7014 
 Config n1304 Run range 7041 7100 
 Config n1304 Run range 7103 7103 
 Config n1304 Run range 7106 7191 

MINOS26 > wc -l L010185N.locations 
18911 L010185N.locations

MINOS26 > for STR in sntp mrnt cand ; do printf "${STR} " ; grep $STR  L010185N.locations | wc -l ; done
sntp 944
mrnt 944
cand 17023


########
# FARM #
########


SAMDIM="
    DATA_TIER    sntp-near
and MC.RELEASE   daikon_04
and MC.BEAM      L010185N
and VERSION      cedar.phy.bhcurv
and FILE_NAME    n1303%
and RUN_NUMBER   >= 7250
and RUN_NUMBER   <= 7260
"

./samlocate "${SAMDIM}"

    Specialize to no-pass and pass 0

PASS=''
SAMDIM="
    DATA_TIER    sntp-near
and MC.RELEASE   daikon_04
and MC.BEAM      L010185N
and VERSION      cedar.phy.bhcurv
and FILE_NAME    n1303%cedar.phy.bhcurv${PASS}.root
and RUN_NUMBER   >= 7250
and RUN_NUMBER   <= 7260
"

 SAMDIM="
    DATA_TIER    sntp-near
and MC.RELEASE   daikon_04
and MC.BEAM      L010185N
and VERSION      cedar.phy.bhcurv
and FILE_NAME    n%cedar_phy_bhcurv${PASS}.root
and RUN_NUMBER   >= 7250
and RUN_NUMBER   <= 7260
"

   Counting pass 1 files already present:

SAMDIM="                       
    DATA_TIER    cand-near
and MC.RELEASE   daikon_04
and VERSION      cedar.phy.bhcurv
and FILE_NAME    n%_D04.cand.cedar_phy_bhcurv.1.root
"

./samlocate "${SAMDIM}" | sort  | wc -l
65

Howie's test reprocessing run  produced 66,
but one of these was produced with pass 0,

n13037251_0028_L010185N_D04.cand.cedar_phy_bhcurv.0.root
in
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/725

He will put a copy back into mcnearcat,
renamed as .1. ,
so that I can purge this along with the actual bad data files.


##########
# CONDOR #
##########

    Increased Greg's priority, to allow Laura to glide in
      ( fair share seems not so fair at present )

condor_userprio -setfactor pawloski@fnal.gov 10000.
condor_userprio -all

Thu Apr 24 11:00:30 CDT 2008

##########
# CONDOR #
##########

Date: Wed, 23 Apr 2008 16:42:01 -0500
From: Sfiligoi Igor <sfiligoi@fnal.gov>

It seems the changes below were not done on minos25:
<minos25.fnal.gov> condor_config_val GLEXEC_STARTER
Not defined: GLEXEC_STARTER

Without it, we cannot use glexec.

-------------------------------------
Date: Thu, 24 Apr 2008 05:20:50 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Sfiligoi Igor <sfiligoi@fnal.gov>


Strange, we have the correct GLEXEC content for 
    condor_config.local.master
but not
    condor_config.local
These should have been identical !


I am pretty sure that I checked the files at the time of the upgrade.

File time stamps are odd, with the incorrect file being the
most recently updated .
It is as if we had the correct files at 13:40,
then something or someone put the old file back 13 minutes later.


MINOS25 > ls -l /opt/condor-7.0.1/local
total 16
-rw-r--r--  1 root root 6371 Apr 21 13:53 condor_config.local
-rw-r--r--  1 root root 6590 Apr 21 13:40 condor_config.local.master

MINOS25 > ls -l /opt/condor-7.0.1/local.minos25
total 28
-rw-r--r--  1 root   root 6590 Apr 21 13:18 condor_config.local
-rw-r--r--  1 root   root 6590 Apr 21 13:19 condor_config.local.master
drwxrwxrwt  2 daemon root 4096 Mar 17 16:36 execute
drwxr-xr-x  2 daemon root 4096 Mar 17 16:36 log
drwxr-xr-x  2 daemon root 4096 Mar 17 16:36 spool


Unfortunately the Condor pool is pretty busy at the moment.

    Is it OK to update local/condor_config.local while the master runs ?

    If this does no harm, would a restart still be necessary 
    in order to benefit from the changed configuration ?

    Or do we need to shut down, modify the file, and restart ?

-------------------------------------

Date: Thu, 24 Apr 2008 07:11:15 -0500
From: Igor Sfiligoi <sfiligoi@fnal.gov>

Yes, all you need is change the files and issue a local condor_reconfig.


-------------------------------------

Date: Thu, 24 Apr 2008 09:44:10 -0500 (CDT)
Subject: HelpDesk ticket 114713

___________________________________________
Short Description: Update  /opt/condor-7.0.1/local/condor_config.local

Problem Description: run2-sys :

  The content of opt/condor-7.0.1/local/condor_config.local
  seems to have changed from what was set on Monday at 13:40,
  going back to a copy of the old file.

MINOS25 > ls -l /opt/condor-7.0.1/local
total 16
-rw-r--r--  1 root root 6371 Apr 21 13:53 condor_config.local
-rw-r--r--  1 root root 6590 Apr 21 13:40 condor_config.local.master

  Please copy the correct content from condor_config.local.master on
minos25:

cd  opt/condor-7.0.1/local
cp  condor_config.local.master  condor_config.local

   We do not need to pause or restart Condor for this change,
   so please do this at a time of your convenience.

Thanks !
___________________________________________
Date: Thu, 24 Apr 2008 09:51:50 -0500 (CDT)
This ticket has been reassigned to HO, LING of the CD-SF/FEF Group.
________________________________________________________________

________________________________________________________________
MINOS25 > condor_reconfig minos25
Sent "Reconfig" command to master minos25.fnal.gov

MINOS25 > date
Thu Apr 24 10:32:25 CDT 2008
________________________________________________________________

________________________________________________________________
Date: Thu, 24 Apr 2008 10:46:38 -0500
My test jobs are running and changing uid to uscms466 as expected.

=============================================================================

2008 04 23

###########
# BLUEARC #
###########

Attended 10:30 BlueArc Users' Meeting

  Discussed purchase and upgrade plans for BlueArc servers


########
# FARM #
########

Per email from rubin, with followups, run ranges for reprocessing
due to incorrect Bfield seen by grid nodes


L250200N

n13037004  n13037004
n13037014  n13037014

L010185N

n13037140  n13037140
n13037233  n13037233
n13037244  n13037245
n13037250  n13037260
n13037263  n13037470
n13037553  n13037735
n13047013  n13047014
n13047041  n13047100
n13047103  n13047103
n13047106  n13047191

    Sample queries use

SAMDIM="
    DATA_TIER    sntp-near
and MC.RELEASE   daikon_04
and MC.BEAM      L250200N
and VERSION      cedar.phy.bhcurv
and FILE_NAME    n1303%
and RUN_NUMBER   7004
"

SAMDIM="
    DATA_TIER    sntp-near
and MC.RELEASE   daikon_04
and MC.BEAM      L010185N
and VERSION      cedar.phy.bhcurv
and FILE_NAME    n1303%
and RUN_NUMBER   >= 7250
and RUN_NUMBER   <= 7260
"

./samlocate "${SAMDIM}"


##########
# CONDOR #
##########

    Modified glide.run to set

X509USERPROXY           = /local/scratch25/kreymer/grid/kreymer.proxy

    This works !

##########
# CONDOR #
##########

    Draft version of kproxy, creating user proxy in
/local/scratch25/${USERNAME}/grid/${USERNAME}.proxy

    crontab.minos25 runs this,
07          1-23/2 * * * /usr/krb5/bin/kcron  /local/scratch25/grid/kproxy

#########
# DOCDB #
#########

    Registered Phil Adamson for numirw and beamrw groups, pre email request

https://minos-docdb.fnal.gov:440/cgi-bin/EmailAdminister
Username minos-adm  Password *****

=============================================================================

2008 04 22

#########
# ADMIN #
#########

    minos04 does not allow ssh logins

MIN > ssh -v minos04
OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to minos04 [131.225.193.4] port 22.
debug1: Connection established.
debug1: identity file /home/kreymer/.ssh/identity type -1
debug1: identity file /home/kreymer/.ssh/id_rsa type 1
debug1: identity file /home/kreymer/.ssh/id_dsa type -1
ssh_exchange_identification: Connection closed by remote host
MIN > date
Tue Apr 22 21:30:30 UTC 2008

MIN > rsh minos04
...
MINOS04 > 

The last sshd messages in /var/log/messages are

Apr 22 02:19:19 minos04 sshd(pam_unix)[9569]: session opened for user djauty by djauty(uid=0)
Apr 22 02:41:21 minos04 sshd: pam_krb5[10190]: authentication fails for 'djauty' (djauty@FNAL.GOV): Authentication service cannot retrieve authentication info. (Cannot contact any KDC 
for requested realm)

Condor jobs are running, apparently OK.

Date: Tue, 22 Apr 2008 16:47:40 -0500 (CDT)
Subject: HelpDesk ticket 114615
___________________________________________
Short Description: ssh logins to minos04 fail

Problem Description: run2-sys

I cannot ssh to mino04.

MIN > ssh -v minos04
OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to minos04 [131.225.193.4] port 22.
debug1: Connection established.
debug1: identity file /home/kreymer/.ssh/identity type -1
debug1: identity file /home/kreymer/.ssh/id_rsa type 1
debug1: identity file /home/kreymer/.ssh/id_dsa type -1
ssh_exchange_identification: Connection closed by remote host
MIN > date
Tue Apr 22 21:30:30 UTC 2008

But kerberized rsh works.

The last sshd messages in /var/log/messages are :

Apr 22 02:19:19 minos04 sshd(pam_unix)[9569]: session opened for user
djauty by djauty(uid=0)
Apr 22 02:41:21 minos04 sshd: pam_krb5[10190]: authentication fails for
'djauty' (djauty@FNAL.GOV): Authentication service cannot retrieve
authentication info. (Cannot contact any KDC 
for requested realm)
___________________________________________
Date: Wed, 23 Apr 2008 08:22:46 -0500 (CDT)
This ticket has been reassigned to HO, LING of the CD-SF/FEF Group.
___________________________________________
Apr 23 11:34:58 minos04 sshd(pam_unix)[24095]: session opened for user hartnell by hartnell(uid=0)
___________________________________________
___________________________________________


############
# SADDRECO #
############

    saddreco.new

  Adding support for pass numbers in MC files, like

n13037251_0028_L010185N_D04.cand.cedar_phy_bhcurv.0.root
    or
n13037251_0028_L010185N_D04.cand.cedar_phy_bhcurv.root

    as opposed to data files like
N00013434_0000.spill.sntp.cedar.0.root


#######
# SAM #
#######

   Closed out IT 3538 re station upgrade versus running projects

Why was this ticket assigned to user mundim ?  Oh well.

########
# FARM #
########

    Cleaned up dangling WRITE file pointing to now-deleted /minos/data CC area

rm /export/stage/minfarm/ROUNDUP/WRITE/n13037251_0028_L010185N_D04.cand.cedar_phy_bhcurv.0.root

###########
# ROUNDUP #
###########

    roundup.20080422

    Uses new CCOP variable to move only non-cand/bcnd files to CC area

SRV1> cp AFSS/roundup.20080422 .
SRV1> ln -sf roundup.20080422 roundup # was roundup.20080412

SRV1> ${HOME}/scripts/roundup     -r cedar_phy_bhcurv mcnear


############
# MCIMPORT #
############


########
# FARM #
########

    checking the disabling of previous passes
    
MINOS26 > SAMDIM='FILE_NAME  n13037252_0003_L010185N_D04.cand.cedar_phy_bhcurv%'
MINOS26 > sam list files --dim="${SAMDIM}"
Files:
   n13037252_0003_L010185N_D04.cand.cedar_phy_bhcurv.0.root
   n13037252_0003_L010185N_D04.cand.cedar_phy_bhcurv.1.root

File Count:         2
Average File Size:  545.75MB
Total File Size:    1.07GB
Total Event Count:  1600

    This is no good !
    
########
# DATA #
########

    doing manual full inventory ( should do this regularly ? )

    The big problem is nearly 3 TBytes of /minos/data/minfarm/farmtest/mcnearcat

MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/L010185N
2027116 /minos/data/mcimport/STAGE/daikon_04/L010185N

MINOS26 > du -sm /minos/data/users/*
2       /minos/data/users/bckhouse
17186   /minos/data/users/boehm
1       /minos/data/users/kreymer
10195   /minos/data/users/loiacono
1       /minos/data/users/minsoft
99003   /minos/data/users/pawloski
102630  /minos/data/users/rmehdi
1       /minos/data/users/rustem

12335360        /minos/data/mcout_data
 4594802 /minos/data/minfarm
 3730545 /minos/data/mcimport
 3437784 /minos/data/reco_near
 2297204 /minos/data/reco_far
 1835532 /minos/data/analysis
  268493  /minos/data/mysql
  259008  /minos/data/beam_data
  229014  /minos/data/users
   99901   /minos/data/flux
    3437    /minos/data/log_data
       1       /minos/data/mindata
       1       /minos/data/release_data

MINOS26 > du -sm /minos/data/minfarm/* | sort -n
...
10625   /minos/data/minfarm/nearcat
11919   /minos/data/minfarm/mcfar
23940   /minos/data/minfarm/DUP
163716  /minos/data/minfarm/mcnear
4358441 /minos/data/minfarm/farmtest

MINOS26 > du -sm /minos/data/minfarm/farmtest/* | sort -n
45      /minos/data/minfarm/farmtest/logs
72      /minos/data/minfarm/farmtest/mclogs
1865    /minos/data/minfarm/farmtest/neardet
57986   /minos/data/minfarm/farmtest/nearcat_old
1362419 /minos/data/minfarm/farmtest/nearcat
2936050 /minos/data/minfarm/farmtest/mcnearcat

    
########
# DATA #
########

As minfarm on fnpcsrv1,

SRV1> du -sm /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data
8431447 /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data

SRV1> ls /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data
701  705  707  709  711  713  715  717  719  724  726  728  730  732  734  736  738  740  742  744  746  755  757  759  761  763  765  767  769  771  773
704  706  708  710  712  714  716  718  723  725  727  729  731  733  735  737  739  741  743  745  747  756  758  760  762  764  766  768  770  772

SRV1> time rm -r /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data
real    9m2.200s
user    0m0.112s
sys     0m3.687s

This was done a couple of minutes before
Tue Apr 22 11:33:00 CDT 2008


SRV1> du -sm /minos/data/mcout_data
3909216 /minos/data/mcout_data

SRV1> df -h /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       21T   20T  792G  97% /minos/data

SRV1> date ; df -h /minos/data
Tue Apr 22 13:38:18 CDT 2008
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       21T   20T  1.2T  95% /minos/data

Tue Apr 22 14:56:57 CDT 2008
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       22T   20T  1.5T  94% /minos/data


=============================================================================

2008 04 21

########
# GRID #
########

fnpc342 - requested AFS restoration at 16:38

Date: Mon, 21 Apr 2008 16:40:03 -0500 (CDT)
Subject: HelpDesk ticket 114555
___________________________________________
Short Description: fnpc342 AFS mount is missing

Problem Description: fnpc342 is one of the system which has AFS available, 
selected by HAVEMINOSAFS as I recall.

The AFS mount seems to be missing, causing user jobs to fail.

Please remount AFS, and inform minos-admin.

    Thanks !

_________________________________________________________________
Note To Requester: kreymer@fnal.gov sent this Notes To Requester: 

   Thanks for looking at this !

If AFS cannot be restored quickly to fnpc342, 
please remove fnpc342 from ISMINOSAFS list 

This reduces our Grid AFS capacity by only 1/8, 
and avoid immediate user job failures.

$ condor_status slot1@fnpc342 -l | grep ISMINOSAFS
ISMINOSAFS = stringlistimember(My.Machine, "fnpc339.fnal.gov, fnpc340.fnal.gov, fnpc341.fnal.gov,
fnpc342.fnal.gov, fnpc343.fnal.gov, fnpc344.fnal.gov, fnpc345.fnal.gov, fnpc346.fnal.gov")

___________________________________________________________________
Date: Wed, 23 Apr 2008 15:10:30 -0500 (CDT)

Solution: yocum@fnal.gov sent this solution: 

After some digging around I found the openafs module rpm on the 
scientificlinux ftp server and installed it.  I've informed Troy Dawson 
that the updated rpm package isn't where it should be in the Fermi Linux 
directories and he's taking steps to fix this omission.

___________________________________________


##########
# CONDOR #
##########

12:51

MINOS25 > condor_off -fast minos25
Sent "Kill-All-Daemons-Fast" command to master minos25.fnal.gov

MINOS25 > sudo /etc/init.d/condor stop
Shutting down Condor (fast-shutdown mode)

Date: Mon, 21 Apr 2008 12:53:48 -0500 (CDT)
Subject: HelpDesk ticket 114534
Short Description: minos25 condor upgrade request

Problem Description: run2-sys :

    We are ready to proceed with the Condor 7.0.1 upgrade on minos25.

    Unlike the Condor upgrades last week, we need a new local configuration
file
    differing from the condor-6.8.6/local/condor_config.local in the
addition
    of
GLEXEC_STARTER = True
GLEXEC = /bin/false

    And there appears to be a second copy of local/condor_config.local,
    named condor_config.local.master .

    So please do the following ( or equivalent ) on minos25 ,
    then inform minos-admin : 

cd /opt

NEWCONF=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/cond
or_config.local.minos25
cp -r ${NEWCONF} condor-7.0.1/local/condor_config.local
cp -r ${NEWCONF} condor-7.0.1/local/condor_config.local.master

ln -sf  condor-7.0.1  condor
___________________________________________
Date: Mon, 21 Apr 2008 13:05:13 -0500 (CDT)
This ticket has been reassigned to HO, LING of the CD-SF/FEF Group.
___________________________________________
Date: Mon, 21 Apr 2008 13:24:18 -0500 (CDT)
Note To Requester: ling@fnal.gov sent this Notes To Requester: 
Done.

[root@minos25 ~]# cd /opt
[root@minos25 opt]#
NEWCONF=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local.minos25
[root@minos25 opt]# cp -r ${NEWCONF} condor-7.0.1/local.minos25/condor_config.local 
cp: overwrite `condor-7.0.1/local.minos25/condor_config.local'? y
[root@minos25 opt]# cp -r ${NEWCONF} condor-7.0.1/local.minos25/condor_config.local.master
[root@minos25 opt]# ln -s condor-7.0.1/ condor
[root@minos25 opt]# 
___________________________________________
Sorry, I should have triple proofread my request.

And the ln -s seems not to taken effect,
as /opt/condor seems to still point to condor-6.8.6


    It should have been :

cd /opt

NEWCONF=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condor701/condor_config.local.minos25

mkdir      condor-7.0.1/local
chmod 755  condor-7.0.1/local

cp ${NEWCONF} condor-7.0.1/local/condor_config.local
cp ${NEWCONF} condor-7.0.1/local/condor_config.local.master

ln -sf        condor-7.0.1  condor


   Please try again !
___________________________________________
I have had to remove the old link first and then make the noew link
> So, /bin/rm /opt/condor ; then ln -sf        /opt/condor-7.0.1  /opt/condor
___________________________________________
Date: Mon, 21 Apr 2008 19:02:45 +0000 (UTC)
___________________________________________
    Thanks !

The condor system is up and running.

I am gradually enabling the workers.

Igor Sfiligoi is upgrading the glideins for glexec.
They will start what that upgrade is done.


13:48

    Per sfiligoi, checked master knowledge of the pool with

condor_status -master - all nodes are listed.

     Fired up one node,

condor_on minos07  -subsystem startd

     brebel job started to run

     Fired up a few more,

condor_on minos01  -subsystem startd
condor_on minos02  -subsystem startd
condor_on minos03  -subsystem startd
condor_on minos04  -subsystem startd
condor_on minos05  -subsystem startd
condor_on minos06  -subsystem startd

    There are not enough jobs queues locally to use these.
    Running some probes. They looked good.

condor_on   -all  -subsystem startd

    Submitted a 100 process probe

MINOS25 > condor_submit probex100.run 
Submitting job(s)....................................................................................................
Logging submit event(s)....................................................................................................
100 job(s) submitted to cluster 68220.

    glideins are running as of about 14:45
    Igor had to remove a log file formerly owned by gfactory, now by condor.
    We should change the administator email to minos-admin, not fermigrid-root
    
=============================================================================

2008 04 20

Did peaceful shutdown of Minos Condor, prep for Master upgrade 


=============================================================================

2008 04 17

#######
# SAM #
#######

Production station upgraded to sam_station v6_0_5_24_srm 
via sam_products v4_32

Had to forcibly remove stale projects, see 2008 04 17 log entry


###########
# MONITOR #
###########


Here are pointers to various summaries of Minos batch resource usage,
for planning purposes.


    Minos Cluster - condor

I don't yet have CondorView running, which would give per-user plots.

You can get a text based summary from
    condor_userprio -all

The overall usage pattern is pretty well seen in the long term
Ganglia plots, by looking at the 'nice' ( yellow ) CPU usage.

http://rexganglia2.fnal.gov/minos/?m=load_one&r=year&s=descending&c=MINOS+Cluster&h=&sh=1&hc=4

Summary - the cluster is pretty well used, sometimes saturated at the
          roughly 40 process capacity.  ( This shows up as 40% in CPU usage )
          Hyperthreading is inflating the quoted capacity.


    GPFARM

Condorview is at
    http://fnpcsrv1.fnal.gov/condorview/viewdir/index.html
or
    http://fermigrid.fnal.gov/
        Condor View in the left frame, under FermiGrid Monitoring, Metrics and Accounting:
            CondorView monitoring
                For the past month ,   under Pool User (Job) Statistics
                or pick a month of your choice
                    Then under 'User', click on :
                        group_numi.minospro - for farm usage
                        group_numi.minos    - for Condor glideins
                        group_numi.rustem   - for Rustem's direct submissions

glideins have been used by boehm, hartnell, loiacono, pawloski

You can also use the Configure box under the plot to further select data.

I don't presently know how to select a longer time frame than 1 month.


     FNALU BATCH ( presently LSF )

I do not know how to get accounting information for this.
FNALU batch has been pretty idle recently.
Within a few months, this system will move off of LSF,
probably to Condor.


#######
# SAM #
#######

    Upgrading the production station per sam_products v4_32

    Only one project dating from April,

MINOS26 > sam dump project --station=minos --project=ahimmel-Cedar_phy_bhcurvNearSpillDaikon04-20080415-1315 | grep delivered
...
  2304346: n13037581_0000_L010185N_D04.sntp.cedar_phy_bhcurv.0.root, size=1199787338K, swapped out, node = dcap://minos-02, delivered on 15 Apr 22:10:20

    So will just stop and restart the station.

    changed
station station_prd v6_0_1_17  minos --preferred-loc=enstore --excess-satisfaction=0 --pmaster-arg=--consumption-map=\.\*::dcap://minos-01,dcap://minos-02 --constrain-delivery=dcap://minos-01,dcap://minos-02 --route=dcap://minos-01::dcap://minos-01 --route=dcap://minos-02::dcap://minos-02
    to
station station_prd v6_0_5_24_srm  minos --preferred-loc=enstore --excess-satisfaction=0 --pmaster-arg=--consumption-map=\.\*::dcap://minos-01,dcap://minos-02 --constrain-delivery=dcap://minos-01,dcap://minos-02 --route=dcap://minos-01::dcap://minos-01 --route=dcap://minos-02::dcap://minos-02


 ups declare -c sam_cp_config v7_1 
 ups declare -c sam_station v6_0_5_24_srm -q GCC-3.1 
 ups declare -c sam_gsi_config v2_3_3 -q vdt 

 ups declare -c sam_ns_ior v7_1_0 

This failed, this was in the trace file :
Non-compliant application error detected:
operator->() was used on null pointer or nil object reference
smaster: /fnal/ups/prd/orbacus/Linux-2-4/v3_3_4p1GCC-3-1/include/OB/Template.h:557: T* OBObjVar<T>::operator->() const [with T = SAMStation_FileConsumer]: Assertion `(int)(ptr_ != 0)' failed.

Falling back :

 ups declare -c sam_cp_config  v7_0 
 ups declare -c sam_station    v6_0_1_17 -q GCC-3.1 
 ups declare -c sam_gsi_config v2_2_8

    Cleaned out messy products/upsdb/ups_config.bad on minos-sam02

ups list -aK+ | grep current | sort > ups01
scp minos-sam02:ups02 .

MINOS-SAM01 > sdiff -s ups01 ups02
"oracle_client" "v8_1_7a" "Linux+2" "" "current"              | "oracle_client" "v10_2_0_3" "Linux+2" "" "current" 
"oracle_tnsnames" "v42" "NULL" "" "current"                   | "oracle_tnsnames" "v45" "NULL" "" "current" 
                                                              > "sam_config" "v7_1_5" "NULL" "dbs_dev2" "current" 
                                                              > "sam_config" "v7_1_5" "NULL" "dbs_prd2" "current" 
                                                              > "sam_config" "v7_1_5" "NULL" "station_int" "current" 
"sam_cp_config" "v7_0" "NULL" "" "current"                    | "sam_cp_config" "v7_1" "NULL" "" "current" 
"sam_station" "v6_0_1_17" "Linux+2.4" "GCC-3.1" "current"     | "sam_station" "v6_0_5_23_srm" "Linux+2.4" "GCC-3.1" "current"


   On minos-sam01 per the prescription from minos-sam02  2007 03 27

ups copy -G "oracle_client v10_2_0_3" oracle_instant_client v10_2_0_3
ups declare oracle_client v10_2_0_3 -f "Linux+2" -q "" -r "oracle_instant_client/v10_2_0_3/Linux+2" -z "/home/sam/products/upsdb"  -U "ups"  -m "oracle_instant_client.table" 
ups declare -c oracle_client v10_2_0_3

ln -s libclntsh.so /home/sam/products/oracle_instant_client/v10_2_0_3/Linux+2/libclntsh.so.8.0

    Updated oracle_tnsnames on minos-sam01/2
upd install -j oracle_tnsnames v48
ups declare -c oracle_tnsnames v48

    Let's try the upgrade again :

less private/station__minos-sam01__station_prd__minos/trace 

04/17/08 15:53:13 minos.SM.REVIVER 26613: 17 projects found
Non-compliant application error detected:
operator->() was used on null pointer or nil object reference
smaster: /fnal/ups/prd/orbacus/Linux-2-4/v3_3_4p1GCC-3-1/include/OB/Template.h:557: T* OBObjVar<T>::operator->() const [with T = SAMStation_FileConsumer]: Assertion `(int)(ptr_ != 0)' failed.

MINOS-SAM01 >  ups declare -c sam_gsi_config v2_2_8 -q vdt 
ERROR: Invalid Specification for Declare -
        Specification must include a single flavor

   EH ??????

I can no longer redeclare sam_gsi_config

The old one did not have the vdt qualifier!

    A clean fallback :

 ups   declare -c sam_cp_config  v7_0 
 ups   declare -c sam_station    v6_0_1_17 -q GCC-3.1 
 ups undeclare -c sam_gsi_config -q vdt
 ups   declare -c sam_gsi_config v2_2_8

    2008 04 18

Another product comparison :

ups list -aK+ | grep current | sort > ups01a

sdiff -s ups01a ups02
"oracle_tnsnames" "v48" "NULL" "" "current"                   | "oracle_tnsnames" "v45" "NULL" "" "current" 
                                                              > "sam_config" "v7_1_5" "NULL" "dbs_dev2" "current" 
                                                              > "sam_config" "v7_1_5" "NULL" "dbs_prd2" "current" 
                                                              > "sam_config" "v7_1_5" "NULL" "station_int" "current" 
"sam_cp_config" "v7_0" "NULL" "" "current"                    | "sam_cp_config" "v7_1" "NULL" "" "current" 
                                                              > "sam_gsi_config" "v2_3_3" "NULL" "vdt" "current" 
"sam_station" "v6_0_1_17" "Linux+2.4" "GCC-3.1" "current"     | "sam_station" "v6_0_5_23_srm" "Linux+2.4" "GCC-3.1" "current"

captured ups tailor sam_config -> station_prd on sam01, station_dev on sam02

cat sc01 | tr -s ' ' | cut -f 3 -d ' ' | sort  > sc01s
cat sc02 | tr -s ' ' | cut -f 3 -d ' ' | sort  > sc02s

    Trying again, with station v6_0_5_23_srm

 ups   declare -c sam_cp_config  v7_1 
 ups   declare -c sam_station    v6_0_5_23_srm -q GCC-3.1 
 ups undeclare -c sam_gsi_config 
 ups   declare -c sam_gsi_config v2_3_3 -q vdt 

    Failed as before, fell back WITHOUT changing product declarations.

    Removed all the stale projects :
    
MINOS26 > sam dump station --projects
*** BEGIN DUMP STATION minos version v6_0_1_17 running at minos-sam01.fnal.gov 1 minutes 49 seconds, admins: buckley kreymer rhatcher sam 
Replica selection: prefer (enstore), avoid (empty)
There are 207 authorized transfer groups
Full delivery unit is enforced; external deliveries are constrained to dcap://minos-01 dcap://minos-02 
Excess consumer satisfaction: 0
PROJECT MANAGER: fileReleaseTO = 1 days : maxConsumer Wait time = 1 days, max prefetched files : 5
STATION PROJECTS (0 already ended, 0 prematurely):
project hartnell-PTSimFDCosmicMuLowE-20071111-1403(6350) user hartnell.minos started 18 Apr 14:56:24  UNIX pid 1222 contains 1164 total files: 0 given to project, 0 delivery errors, and 1164 still wanted (of these 0 in cache, 0 locked) 
project hartnell-PTSimFDCosmicMuLowE-20071111-1436(6351) user hartnell.minos started 18 Apr 14:56:24  UNIX pid 1575 contains 1162 total files: 0 given to project, 0 delivery errors, and 1162 still wanted (of these 0 in cache, 0 locked) 
project rmehdi-PTSimNDL010185N-D00-R0-R1001-20071205-1051(6588) user rmehdi.minos started 18 Apr 14:56:24  UNIX pid 10464 contains 10 total files: 0 given to project, 0 delivery errors, and 10 still wanted (of these 0 in cache, 0 locked) 
project rmehdi-PTSimFDL010185N-D00-R0-R1011-20071205-1123(6592) user rmehdi.minos started 18 Apr 14:56:24  UNIX pid 10871 contains 9 total files: 0 given to project, 0 delivery errors, and 9 still wanted (of these 0 in cache, 0 locked) 
project rmehdi-PTSimFDL010185N-D00-R0-R1011-20071205-1127(6593) user rmehdi.minos started 18 Apr 14:56:24  UNIX pid 10876 contains 9 total files: 0 given to project, 0 delivery errors, and 9 still wanted (of these 0 in cache, 0 locked) 
project rmehdi-PTSimNDL010185N-D00-R0-R1001-20071205-1131(6594) user rmehdi.minos started 18 Apr 14:56:24  UNIX pid 11069 contains 8 total files: 0 given to project, 0 delivery errors, and 8 still wanted (of these 0 in cache, 0 locked) 
project ahimmel-PreRevBfld2007-20080130-1315(6930) user ahimmel.minos started 18 Apr 14:56:25  UNIX pid 31599 contains 2436 total files: 0 given to project, 0 delivery errors, and 2436 still wanted (of these 0 in cache, 0 locked) 
project ahimmel-PreRevBfld2007-20080130-1316(6931) user ahimmel.minos started 18 Apr 14:56:25  UNIX pid 31602 contains 2444 total files: 0 given to project, 0 delivery errors, and 2444 still wanted (of these 0 in cache, 0 locked) 
project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080221-1704(7454) user rodriges.minos started 18 Apr 14:56:26  UNIX pid 1487 contains 367 total files: 0 given to project, 0 delivery errors, and 367 still wanted (of these 0 in cache, 0 locked) 
project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080221-1716(7456) user rodriges.minos started 18 Apr 14:56:26  UNIX pid 1742 contains 367 total files: 0 given to project, 0 delivery errors, and 367 still wanted (of these 0 in cache, 0 locked) 
project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0303(7465) user rodriges.minos started 18 Apr 14:56:26  UNIX pid 12547 contains 367 total files: 0 given to project, 0 delivery errors, and 367 still wanted (of these 0 in cache, 0 locked) 
project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0616(7469) user rodriges.minos started 18 Apr 14:56:26  UNIX pid 16133 contains 367 total files: 0 given to project, 0 delivery errors, and 367 still wanted (of these 0 in cache, 0 locked) 
project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0627(7470) user rodriges.minos started 18 Apr 14:56:26  UNIX pid 16148 contains 367 total files: 0 given to project, 0 delivery errors, and 367 still wanted (of these 0 in cache, 0 locked) 
project rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04_many-20080222-0904(7472) user rodriges.minos started 18 Apr 14:56:26  UNIX pid 19392 contains 87 total files: 0 given to project, 0 delivery errors, and 87 still wanted (of these 0 in cache, 0 locked) 
project ahimmel-Cedar_phyDaikon00-20080228-1332(7513) user ahimmel.minos started 18 Apr 14:56:26  UNIX pid 5654 contains 1305 total files: 0 given to project, 0 delivery errors, and 1305 still wanted (of these 0 in cache, 0 locked) 
project ahimmel-Cedar_phy_bhcurvNearSpillDaikon04-20080328-1734(7634) user ahimmel.minos started 18 Apr 14:56:26  UNIX pid 26913 contains 919 total files: 0 given to project, 0 delivery errors, and 919 still wanted (of these 0 in cache, 0 locked) 
project ahimmel-Cedar_phy_bhcurvNearSpillDaikon04-20080415-1315(7711) user ahimmel.minos started 18 Apr 14:56:26  UNIX pid 18062 contains 709 total files: 0 given to project, 0 delivery errors, and 709 still wanted (of these 0 in cache, 0 locked) 

SAMPS='
hartnell-PTSimFDCosmicMuLowE-20071111-1436
rmehdi-PTSimNDL010185N-D00-R0-R1001-20071205-1051
rmehdi-PTSimFDL010185N-D00-R0-R1011-20071205-1123
rmehdi-PTSimFDL010185N-D00-R0-R1011-20071205-1127
rmehdi-PTSimNDL010185N-D00-R0-R1001-20071205-1131
ahimmel-PreRevBfld2007-20080130-1315
ahimmel-PreRevBfld2007-20080130-1316
rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080221-1704
rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080221-1716
rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0303
rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0616
rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04-20080222-0627
rodriges-L250z200iCedar_phy_bhcurvNearSpillDaikon04_many-20080222-0904
ahimmel-Cedar_phyDaikon00-20080228-1332
ahimmel-Cedar_phy_bhcurvNearSpillDaikon04-20080328-1734
ahimmel-Cedar_phy_bhcurvNearSpillDaikon04-20080415-1315
'

SAMP=hartnell-PTSimFDCosmicMuLowE-20071111-1403
sam stop project --project=${SAMP} --force

for SAMP in $SAMPS ; do 
sam stop project --project=${SAMP} --force ; done


Reported as Sam IT 

##########
# CONDOR #
##########

minos03 condor/local cloned from condor-6.8.6 to condor-7.0.1

    about 09:15

MINOS03 > sudo /etc/init.d/condor start
Starting up Condor

    This worked, requested minos 01-02  04-06 08-13 updates.

    The cluster is quite idle, will do second half today.
 
   
CONODES='minos01 minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10 minos11 minos12 minos13'

for NODE in ${CONODES} ; do printf "${NODE} "
ssh -ax ${NODE} "ls -l /opt/condor-7.0.1/local" ; done 

for NODE in ${CONODES} ; do printf "${NODE} "
ssh -ax ${NODE} "grep NUM_CPUS /opt/condor-7.0.1/local/condor_config.local" ; done 
minos01 NUM_CPUS = 1
minos02 NUM_CPUS = 1
minos03 NUM_CPUS = 2
minos04 NUM_CPUS = 2
minos05 NUM_CPUS = 2
minos06 NUM_CPUS = 2
minos07 NUM_CPUS = 1
minos08 NUM_CPUS = 2
minos09 NUM_CPUS = 2
minos10 NUM_CPUS = 2
minos11 NUM_CPUS = 0
minos12 NUM_CPUS = 2
minos13 NUM_CPUS = 0


for NN in 01 02   04 05 06   08 09 10 ; do printf "${NN} "
ssh -ax minos${NN} "sudo /etc/init.d/condor start" ; done 
01 Starting up Condor
02 Starting up Condor
04 Starting up Condor
05 Starting up Condor
06 Starting up Condor
08 Starting up Condor
09 Starting up Condor
10 Starting up Condor


CONN='14 15 16 17 18 19 20 21 22 23 24'

for SYS in ${CONN} ; do 
condor_off  -peaceful  minos${SYS}  -subsystem startd ; done

Sent "Set-Peaceful-Shutdown" command to startd minos14.fnal.gov
Sent "Kill-Daemon-Peacefully" command to master minos14.fnal.gov
...

MINOS25 > condor_status | grep minos2
vm1@minos20.f LINUX       INTEL  Claimed    Retiring   1.000  2026  0+00:00:04
vm2@minos20.f LINUX       INTEL  Unclaimed  Idle       0.000  2026  0+00:06:16
MINOS25 > condor_status | grep minos1
slot1@minos10 LINUX       INTEL  Unclaimed  Idle       0.000  2026  0+00:05:04
slot2@minos10 LINUX       INTEL  Unclaimed  Idle       0.000  2026  0+00:05:05
vm1@minos14.f LINUX       INTEL  Claimed    Retiring   1.000  2026  0+00:00:04
vm2@minos14.f LINUX       INTEL  Claimed    Retiring   1.000  2026  0+00:00:05
vm1@minos16.f LINUX       INTEL  Claimed    Retiring   1.000  2026  0+00:00:04
vm2@minos16.f LINUX       INTEL  Unclaimed  Idle       0.050  2026  0+01:42:04

   So waiting for 14, 16 20 to retire

At 13:50, two of the four have finished.

MINOS25 > condor_q -r scavan | grep vm
66761.0   scavan          4/15 17:57   1+03:19:21 vm1@minos20.fnal.gov
66764.0   scavan          4/15 18:07   1+04:01:50 vm1@minos14.fnal.gov

Some of these jobs migrated to the new nodes.
Looking in the log :

MINOS25 > condor_q -l 66744.0 | grep UserLog
UserLog = "/minos/scratch/scavan/CondorTest/tmp/entR.log.66744.0"

MINOS25 > less /minos/scratch/scavan/CondorTest/tmp/entR.log.66744.0
000 (66744.000.000) 04/15 16:27:09 Job submitted from host: <131.225.193.25:63984>
...
001 (66744.000.000) 04/16 02:45:50 Job executing on host: <131.225.193.24:64690>
...
006 (66744.000.000) 04/16 02:45:58 Image size of job updated: 97844
...
022 (66744.000.000) 04/16 07:23:02 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to vm1@minos24.fnal.gov <131.225.193.24:64690>
...
024 (66744.000.000) 04/16 07:23:02 Job reconnection failed
    Job disconnected too long: JobLeaseDuration (3600 seconds) expired
    Can not reconnect to vm1@minos24.fnal.gov, rescheduling job
...
001 (66744.000.000) 04/16 07:40:06 Job executing on host: <131.225.193.23:62702>
...
006 (66744.000.000) 04/16 07:40:14 Image size of job updated: 334880
...
022 (66744.000.000) 04/16 12:07:53 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to vm2@minos23.fnal.gov <131.225.193.23:62702>
...

   Seems to fail every 04:40 or so
02:45  07:23
07:40  12:07
13:10  19:04
19:06  23:23
23:25  03:42
03:45  08:03
08:09  12:25
12:30  on host: <131.225.193.4:62679>

   There's only 1 job left at 6.8.6 as of 14:00,

less /minos/scratch/scavan/CondorTest/tmp/entR.log.66761.0
04:50  09:27
14:50  20:27
20:30  00:57
00:59  05:20
05:25  09:42
09:50
001 (66761.000.000) 04/17 09:50:06 Job executing on host: <131.225.193.20:62378>
...
005 (66761.000.000) 04/17 14:03:16 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 04:03:04, Sys 0 00:09:15  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 04:03:04, Sys 0 00:09:15  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        1703849  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        1703849  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job

for NN in ${CONN} ; do 
ssh -ax minos${NN} "sudo /etc/init.d/condor stop" ; done 

Thu Apr 17 19:24:41 UTC 2008

Date: Thu, 17 Apr 2008 19:29:19 +0000 (UTC)
Subject: Re: HelpDesk ticket 114294 has additional info.
Sent request for remaining copies and ln's for minos14 though 24 ( not 25 )

Apr 17 15:34 /opt/condor -> condor-7.0.1


for NN in ${CONN} ; do 
ssh -ax minos${NN} "sudo /etc/init.d/condor start" ; done 


##########
# CONDOR #
##########


Created probenode.run control file,

Ran successfully on minos07.

Renamed to probemachine.run

Renamed logs to logs/machine/


############
# MCIMPORT #
############


RDIRS='712 713 714 715 716 717 718'

for DIR in ${RDIRS}; do
./mcimport.20080326 -n -T -s n1104 daikon_04/L010185N/near/${DIR}
done \
| grep NFILES
  NFILES 0 
  NFILES 298 
  NFILES 308 
  NFILES 309 
  NFILES 308 
  NFILES 310 
  NFILES 309 

RDIRS='713 714 715 716 717 718'

for DIR in ${RDIRS}; do
./mcimport.20080326  -T -s n1104 daikon_04/L010185N/near/${DIR}
done
Thu Apr 17 09:50:32 CDT 2008


=============================================================================

2008 04 16

11188687        /pnfs/minos/stage/daikon_04


##########
# CONDOR #
##########

cd scripts/condor686

for NODE in ${NODES} ; do printf "${NODE} "
rcp ${NODE}:/opt/condor-6.8.6/local/condor_config.local local.${NODE} ; done 
   
#######
# SAM #
#######

    On minos-sam02,

./init_sam -n minos minos  v4_32
  All looks clean

    In production,

./init_sam -n minos minos  v4_32

   
#######
# SAM #
#######

export SAM_ORACLE_CONNECT="samdbs/<passwd>"

samadmin purge zombie projects --station=minos --startedBefore=yesterday --test 

18 candidate projects found in the database...
Determining which projects are still registered in the NameService...
The following 17 projects are still registered in the NameService and are not eligible for termination:

   
#######
# SAM #
#######

sam_products v4_32

    For sam_station v6_0_5_24_srm -q GCC-3.1"
    and updating 

In kreymer products environment on minos26,

 version=v4_32
oversion=v4_31

samprod=sam_products
FLVR=NULL

cd ${PRODUCTS}/../prd/${samprod}

cp -ar ${oversion}  ${version}
cd ${version}/${FLVR}
ups declare ${samprod} ${version} -f ${FLVR} -r ${samprod}/${version}/${FLVR} -m ${samprod}.table

nedit ups/${samprod}.table
    sam_station v6_0_5_24_srm 
    sam_bootstrap       v8_1_1

cd ~/minos/scripts

./updadd ${FLVR} ${samprod} ${version}

upd list -l  ${samprod} ${version}

upd modproduct -g "minos" ${samprod} ${version} -f ${FLVR}

   09:52


##########
# CONDOR #
##########

   The condor queues have drained as desired.
   
CONODES='minos01 minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10 minos11 minos12 minos13'

for NODE in ${CONODES} ; do printf "${NODE} "
ssh -ax ${NODE} "ps -u condor" ; done 

    condor_master running on all but minos02 and minos12

for NODE in ${CONODES} ; do printf "${NODE} "
ssh -ax ${NODE} "sudo /etc/init.d/condor stop" ; done 
minos01 Shutting down Condor (fast-shutdown mode)
minos02 Condor not running
minos03 Shutting down Condor (fast-shutdown mode)
minos04 Shutting down Condor (fast-shutdown mode)
minos05 Shutting down Condor (fast-shutdown mode)
minos06 Shutting down Condor (fast-shutdown mode)
minos07 Shutting down Condor (fast-shutdown mode)
minos08 Shutting down Condor (fast-shutdown mode)
minos09 Shutting down Condor (fast-shutdown mode)
minos10 Shutting down Condor (fast-shutdown mode)
minos11 Shutting down Condor (fast-shutdown mode)
minos12 Condor not running
minos13 Shutting down Condor (fast-shutdown mode)

Date: Wed, 16 Apr 2008 09:29:15 -0500 (CDT)
Subject: HelpDesk ticket 114294
___________________________________________
Short Description: Part 2 of 4 Condor 7.0.1 upgrade for Minos

Problem Description: 

    run2-sys :

I have drained the virtual machines, and stopped condor on minos01-13

Please, at your next convenience, 
as root on minos01 through minos13

    cd /opt
    ln -sf  condor-7.0.1  condor

and inform minos-admin.

I will then restart Condor on most of these nodes.

We plan to upgrade minos14 through minos24 tomorrow,
and the Condor master minos25 next week.

    Thanks !
___________________________________________
Date: Wed, 16 Apr 2008 09:36:44 -0500 (CDT)
This ticket has been reassigned to JONES, TERRY of the CD-SF/FEF Group.
___________________________________________
Date: Wed, 16 Apr 2008 14:12:14 -0500 (CDT)
Solution: jonest@fnal.gov sent this solution: 
> This task is complete
>
> minos01= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos02= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos03= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos04= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos05= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos06= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos07= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos08= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos09= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos10= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos11= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos12= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
> minos13= lrwxrwxrwx  1 root root 12 Apr 16 14:07 /opt/condor -> condor-7.0.1
___________________________________________
As root on minos03 :

   cd /opt
   cp -r condor-6.8.6/local condor-7.0.1/local
___________________________________________
17 April

    I have drained the queues on the remaining Minos Condor workers.

    Please update these remaining nodes :

On minos14 through minos24 ( but NOT on minos25 )

    cd /opt

    cp -r condor-6.8.6/local condor-7.0.1/local

    ln -sf  condor-7.0.1  condor

___________________________________________


for NODE in ${CONODES} ; do printf "${NODE} "
ssh -ax ${NODE} "ls -l /opt/condor " ; done

   Select nodes which should run, and start them.

CUNODES='minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10'

for NODE in ${CUNODES} ; do printf "${NODE} "
ssh -ax ${NODE} "sudo /etc/init.d/condor start" ; done 

minos02 Starting up Condor
ERROR: Can't read config source /opt/condor/local/condor_config.local
minos03 Starting up Condor
ERROR: Can't read config source /opt/condor/local/condor_config.local
minos04 Starting up Condor
ERROR: Can't read config source /opt/condor/local/condor_config.local
minos05 Starting up Condor
ERROR: Can't read config source /opt/condor/local/condor_config.local
minos06 Starting up Condor
ERROR: Can't read config source /opt/condor/local/condor_config.local
minos07 Starting up Condor
minos08 Starting up Condor
ERROR: Can't read config source /opt/condor/local/condor_config.local
minos09 Starting up Condor
ERROR: Can't read config source /opt/condor/local/condor_config.local
minos10 Starting up Condor
ERROR: Can't read config source /opt/condor/local/condor_config.local

##########
# CONDOR #
##########

    minos01 through minos10 graceful off,
    minos11 through minos13 are not active, update them anyway.

cd scripts/condor701

for NODE in ${NODES} ; do printf "${NODE} "
rcp ${NODE}:/opt/condor-7.0.1/local.${NODE}/condor_config.local condor_config.local.${NODE} ; done 

diff condor_config.local.minos01 condor_config.local.minosNN

   differences are like
CONDOR_HOST = minos01.fnal.gov
CONDOR_ADMIN = root@minos01.fnal.gov
COLLECTOR_NAME = Personal Condor at minos01.fnal.gov
LOCK = /tmp/condor-lock.$(HOSTNAME)0.251031510372744

   Except minos11, minos24 which contain additional 

> 
> ##  Java parameters:
> ##  If you would like this machine to be able to run Java jobs,
> ##  then set JAVA to the path of your JVM binary.  If you are not
> ##  interested in Java, there is no harm in leaving this entry
> ##  empty or incorrect.
> 
> JAVA = /usr/bin/java
> 
> 
> ##  Some JVMs need to be told the maximum amount of heap memory
> ##  to offer to the process.  If your JVM supports this, give
> ##  the argument here, and Condor will fill in the memory amount.
> ##  If left blank, your JVM will choose some default value,
> ##  typically 64 MB.  The default (-Xmx) works with the Sun JVM.
> 
> JAVA_MAXHEAP_ARGUMENT = -Xmx
> 

Let's also grab the etc/condor_config's

for NODE in ${NODES} ; do printf "${NODE} "
rcp ${NODE}:/opt/condor-7.0.1/etc/condor_config condor_config.${NODE} ; done 


=============================================================================

2008 04 15

############
# MCIMPORT #
############

Updated kreymer-doe.proxy, per condorproxy content

    kreymer@minos26

cd  /local/scratch26/kreymer/grid

. /minos/scratch/kreymer/VDT/setup.sh

HOURS=10000  # 8760 ?

voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -cert kreymerdoe.pem    \
    -key  kreymerdoekey.pem \
    -out  kreymer-doe.proxy \
    -valid 10000:0

Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Enter GRID pass phrase:
Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Creating temporary proxy ................................................... Done
Contacting  voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Done

Warning: fg6x1.fnal.gov:15001: The validity of this VOMS AC in your proxy is shortened to 86400 seconds!

Creating proxy ....................................................... Done

Warning: your certificate and proxy will expire Wed Mar 25 14:45:40 2009
which is within the requested lifetime of the proxy


voms-proxy-info -all -file kreymer-doe.proxy

scp  kreymer-doe.proxy  mindata@minos26:/home/mindata/.grid/kreymer-doe.proxy
 Tue Apr 15 12:10:09 CDT 2008


  srmcp is failing like 
  
Tue Apr 15 12:37:19 CDT 2008: rs.state = Failed rs.error = 
RequestFileStatus#-2145068121 failed with error:[  at Tue Apr 15 12:37:15 CDT 2008 state Failed : user has
no permission to write into path /pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_04/L010185N/747
]

voms-proxy-init \
    -voms fermilab:/fermilab/minos/Role=Production \
    -cert kreymerdoe.pem    \
    -key  kreymerdoekey.pem \
    -out  kreymer-doe.proxy \
    -valid 10000:0


    Still no good, tested with

./mcimport  -b 1 OVERLAY
tail /home/mindata/STAGE/OVERLAY/log/mcimport.log

grid-proxy-init \
  -cert kreymerdoe.pem \
  -key  kreymerdoekey.pem  \
  -out  kreymer-grid.proxy \
  -valid 999999:00

Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Enter GRID pass phrase for this identity:
Creating proxy ..................................... Done

Warning: your certificate and proxy will expire Wed Mar 25 14:45:40 2009
 which is within the requested lifetime of the proxy

renamed this file to /home/mindata/.grid/kreymer-grid.proxy
ln -s kreymer-doe.proxy kreymer-grid.proxy 

This works !

Bottom line, the voms proxy cannot write to DCache via SRM.


#######
# SAM #
#######

    On minos-sam02, preparing for station upgrade,
        sam_station         v6_0_5_23_srm -q GCC-3.1"

./init_sam -n minos minos  v4_31

    could not find sam_par_ret
    
Our init_sam is out of date,


MINOS-SAM02 > dds init*
-rwxr-xr-x  1 sam 5024 21142 Apr 15 11:13 init_new*
-rwxr-xr-x  1 sam 5024 20941 Oct 18  2005 init_sam*

MINOS-SAM02 > cp -a init_sam init_sam.20051018


MINOS-SAM02 > cp -a init_new init_sam

MINOS-SAM02 > diff init_sam init_new
   name change from sam_par_ret to sam_test_project

MINOS-SAM02 > ./init_sam -n minos minos v4_31
 
===========================================================
===========================================================
== SAM station installation Tue Apr 15 11:18:05 CDT 2008 ==
==              init_sam Version 2005 02 28              ==
==        on minos-sam02.fnal.gov in /home/sam 
===========================================================
===========================================================
 
 OK - experiment minos 
 OK - station    minos 
 OK - UPD_HOST   fnkits.fnal.gov 


 OK - cleaning out local configuration files 
 OK - can create files in /home/sam 
 OK - others can read this directory 
 OK - checking source of distribution 
 OK - we can ftp to fnkits.fnal.gov
 
 OK - PRODUCTS_ROOT = /home/sam/products is present 
 OK - getting installation scripts from fnkits.fnal.gov 
 OK - got bootups and config scripts 
 OK - init_sam is up to date 
 OK - already have products setup script 
 OK - setting up ups 
 OK - set     up ups 

 OK - sam_products v4_31 specified on command line
 OK - upd install -j sam_products v4_31 -h fnkits.fnal.gov
informational: installed sam_products v4_31.
upd install succeeded.


 OK, not really installing, because of -n option 
     Listing existing and needed products below 

   <blank>      - have it, and it is current 
 ups declare -c - have it, would make it current 
 NEED           - would need to install the product 


                orbacus v3_3_4p1 -q GCC-3.1 
                python v2_1 
 ups declare -c sam_bootstrap v8_1_0 
                sam_cp v7_2 
 NEED           sam_cp_config v7_1 
                sam_dcache_cp v7_1 
                sam_kerberos_rcp v4_0_11 
 NEED           sam_station v6_0_5_23_srm -q GCC-3.1 
                setpath v1_11 
                perl v5_8 
                sam_gridftp v2_1_2 -q vdt 
 NEED           sam_gsi_config v2_3_3 -q vdt 
                sam_gsi_config_util v2_1 -q vdt 
                vdt v1_3_0_1 
                pacman v2_116 
                sam v8_2_2 
                samgrid_batch_adapter v7_0_0 
 ups declare -c sam_ns_ior v7_1_0 
                sam_config v7_1_5 

Installed the needed bits by hand

upd install -j  sam_cp_config v7_1 
Creating version link in /home/sam/products/upsdb/sam_cp_config/Symlinks for sam_cp_config v7_1.
Note: the sam_cp_config template MAY have changed.
      Please, merge the differences (if any) between
      your current configuration
      (/home/sam/products/upsdb/sam_cp_config/Config/sam_cp_config.py) 
      and the new template
      (/home/sam/products/sam_cp_config/v7_1/NULL/ups/sam_cp_config_template.py)
sam_cp_config configuration complete.
informational: installed sam_cp_config v7_1.
upd install succeeded.

upd install -j  sam_station v6_0_5_23_srm -q GCC-3.1
informational: installed sam_station v6_0_5_23_srm.
upd install succeeded.

upd install -j  sam_gsi_config v2_3_3 -q vdt 
**************************************************************************
 If you are installing the product for the first time,
 you should execute the command
            ups tailor sam_gsi_config v2_3_3
**************************************************************************
informational: installed sam_gsi_config v2_3_3.
upd install succeeded.


    disabled dev station, 

nedit private/minos-sam02_server_list.txt
    disabled station,
    added new station version v6_0_5_23_srm

ups update sam_bootstrap

ups list -K+  sam_bootstrap
"sam_bootstrap" "v8_1_1" "NULL" "" "current" 

ups list -K+  sam_cp_config
"sam_cp_config" "v7_0" "NULL" "" "current" 

ups list -K+  sam_station -q GCC-3.1 
"sam_station" "v6_0_1_17" "Linux+2.4" "GCC-3.1" "current" 

ups list -K+  sam_gsi_config
"sam_gsi_config" "v2_2_8" "NULL" "" "current" 


ups declare -c sam_bootstrap v8_1_0 
ups declare -c sam_cp_config v7_1 
ups declare -c sam_station v6_0_5_23_srm -q GCC-3.1 
ups declare -c sam_gsi_config v2_3_3 -q vdt 

ups update sam_bootstrap

MINOS-SAM02 > sam dump station --station=minos --all
*** BEGIN DUMP STATION minos version v6_0_5_23_srm running at minos-sam02.fnal.gov 40 seconds, admins: buckley kreymer rhatcher sam 
Replica selection: prefer (enstore), avoid (empty)
There are 0 authorized transfer groups
Full delivery unit is enforced; external deliveries are constrained to dcap://minos-01 dcap://minos-02 
Excess consumer satisfaction: 0
AUTHORIZED GROUPS:
 group minos: admins: sam , swap policy: LRU, fair share: 1, quotas (cur/max): projects = 0/1000, disk: 104080724KB/10240000MB, locks:0B/0B

STATION DISKS:
disk 1 dcap://minos-01:dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr, 401805604B/52428800KB = 0.7% free
disk 2 dcap://minos-02:dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr, 393714515B/52428800KB = 0.7% free
station disk total: 795520119B/104857600KB = 0.7% free

REQUESTED FILES:
PROJECT MANAGER: fileReleaseTO = 1 days : maxConsumer Wait time = 1 days, max prefetched files : 5
NO PROJECTS ever run
FAIR SHARE MAN:
 Benefit weights: volumes mounted: 0.2, CPU: 0.2, KBytes transferred from MSS: 0.2, KBytes transferred inter-station: 0.2, files consumed: 0.2
*** END OF STATION DUMP ***

MINOS26 > ./sam_test_py minos dev
   <OK>
MINOS26 > ./sam_test_project minos dev
   <OK>


   OK, move on to the latest station, v6_0_5_24_srm

upd install -j  sam_station v6_0_5_24_srm -q GCC-3.1
ups update sam_bootstrap # shut down dev station
ups declare -c sam_station v6_0_5_23_srm -q GCC-3.1 
   edited server list
ups update sam_bootstrap # shut down dev station

   Noticed that sam_bootstrap is already newer, put it back

ups declare -c sam_bootstrap v8_1_1

   OK, move on the preinstall on production station

minos-sam01

. shrc/kreymer 
. setups.sh

./init_sam -n minos minos v4_31n  
#   reported newer script, run again

./init_sam -n minos minos v4_31
 
===========================================================
===========================================================
== SAM station installation Tue Apr 15 14:12:12 CDT 2008 ==
==              init_sam Version $Revision: 1.52 $ 
==        on minos-sam01.fnal.gov in /home/sam 
===========================================================
===========================================================
 
 OK - experiment minos 
 OK - station    minos 
 OK - UPD_HOST   fnkits.fnal.gov 


 OK - cleaning out local configuration files 
 OK - can create files in /home/sam 
 OK - others can read this directory 
 OK - checking source of distribution 
 OK - we can ftp to cdfkits.fnal.gov
 
 OK - PRODUCTS_ROOT = /home/sam/products is present 
 OK - getting installation scripts from cdfkits.fnal.gov 
 OK - got bootups and config scripts 
 OK - init_sam is up to date 
 OK - already have products setup script 
 OK - setting up ups 
 OK - set     up ups 

 OK - backing up old config files to /home/sam/maint/backup/200804151412
cp: missing destination file
Try `cp --help' for more information.
cp: missing destination file
Try `cp --help' for more information.
cp: missing destination file
Try `cp --help' for more information.

 OK - sam_products v4_31 specified on command line
 OK - upd install -j sam_products v4_31 -h fnkits.fnal.gov
Unable to close datastream at /home/sam/products/upd/v4_6/NULL/src/updxfr.pm line 184
error: while attempting to ftp to ftp.fnal.gov:
error: can't transfer //.register_test
        from ftp.fnal.gov to
        /tmp/upd19929_register_test
Notice: Either this node is not registered on ftp.fnal.gov
        or ftp.fnal.gov is down
informational: installed sam_products v4_31.
upd install succeeded.


 OK, not really installing, because of -n option 
     Listing existing and needed products below 

   <blank>      - have it, and it is current 
 ups declare -c - have it, would make it current 
 NEED           - would need to install the product 


                orbacus v3_3_4p1 -q GCC-3.1 
                python v2_1 
 ups declare -c sam_bootstrap v8_1_0 
                sam_cp v7_2 
 NEED           sam_cp_config v7_1 
                sam_dcache_cp v7_1 
                sam_kerberos_rcp v4_0_11 
 NEED           sam_station v6_0_5_23_srm -q GCC-3.1 
                setpath v1_11 
                perl v5_8 
                sam_gridftp v2_1_2 -q vdt 
 NEED           sam_gsi_config v2_3_3 -q vdt 
                sam_gsi_config_util v2_1 -q vdt 
                vdt v1_3_0_1 
                pacman v2_116 
 NEED           sam v8_2_2 
                samgrid_batch_adapter v7_0_0 
 ups declare -c sam_ns_ior v7_1_0 
                sam_config v7_1_5 

-------------------------------------------

ups list -K+  sam_cp_config
"sam_cp_config" "v7_0" "NULL" "" "current" 

ups list -K+  sam_station -q GCC-3.1 
"sam_station" "v6_0_1_17" "Linux+2.4" "GCC-3.1" "current" 

ups list -K+  sam_gsi_config
"sam_gsi_config" "v2_2_8" "NULL" "" "current" 

ups list -K+  sam
"sam" "v7_6_0" "Linux+2" "" "current" 

setup upd

upd install -j  sam_cp_config v7_1 
Creating version link in /home/sam/products/upsdb/sam_cp_config/Symlinks for sam_cp_config v7_1.
Note: the sam_cp_config template MAY have changed.
      Please, merge the differences (if any) between
      your current configuration
      (/home/sam/products/upsdb/sam_cp_config/Config/sam_cp_config.py) 
      and the new template
      (/home/sam/products/sam_cp_config/v7_1/NULL/ups/sam_cp_config_template.py)
sam_cp_config configuration complete.
informational: installed sam_cp_config v7_1.
upd install succeeded.

upd install -j  sam_station v6_0_5_24_srm -q GCC-3.1
informational: installed sam_station v6_0_5_24_srm.
upd install succeeded.

upd install -j  sam_gsi_config v2_3_3 -q vdt 
**************************************************************************
 If you are installing the product for the first time,
 you should execute the command
            ups tailor sam_gsi_config v2_3_3
**************************************************************************
informational: installed sam_gsi_config v2_3_3.
upd install succeeded.

upd install -j  sam v8_2_2 
Creating version link in /home/sam/products/upsdb/sam/Symlinks for sam v8_2_2.
informational: installed sam v8_2_2.
upd install succeeded.

ups declare -c  sam v8_2_2
Removing current link in /home/sam/products/upsdb/sam/Symlinks for sam v7_6_0.
Creating current link in /home/sam/products/upsdb/sam/Symlinks for sam v8_2_2.


    When ready to upgrade, shut down station and
    
 ups declare -c sam_cp_config v7_1 
 ups declare -c sam_gsi_config v2_3_3 -q vdt 
 ups declare -c sam_station v6_0_5_23_srm -q GCC-3.1 
 ups declare -c sam_ns_ior v7_1_0 


##########
# CONDOR #
##########

08:57

for SYS in minos03 ; do 
condor_off  -peaceful  ${SYS}  -subsystem startd ; done

for SYS in 04 05 06 07 08 09 10 ; do 
condor_off  -peaceful  minos${SYS}  -subsystem startd ; done

##########
# CONDOR #
##########

Need to uncomment 

#CREATE_CORE_FILES      = True

in all of /opt/condor-7.0.1/etc/condor_config

Request this as soon as minos07 is peaceful,
do it on minos01 through minos25

Actually, do not request this now, per advice from sfiligoi.
This affects core files from condor processes, not user processes.
Until we have actual condor crashes, there is no need for this.

Igor will be available for the master condor v7_0_1 upgrade
with glexec support next week. 
So we can gradually migrate the workers this week.

=============================================================================

2008 04 14

########
# DATA #
########

Date: Mon, 14 Apr 2008 15:42:13 -0500 (CDT)
Subject: HelpDesk ticket 114191
___________________________________________
Short Description: Quota request for BlueArc served /minos/scratch, for rahaman

Problem Description: LSC/CSI :

Please set an individual storage quota of 700 GBytes for user rahaman
on the BlueArc served /minos/scratch volume.

This in an increase from the existing 500 GBytes quota.
___________________________________________

Date: Mon, 14 Apr 2008 15:49:08 -0500 (CDT)
This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group.
___________________________________________

Solution: joes@fnal.gov sent this solution: 
Hi Art, rahaman quota has been increased to 700G


############
# MCIMPORT #
############

    Noting network glitches on minos-sam03,
every 10 minutes, with read data rate 6 MB/sec.
The glitches toward 0 in ganlia monitoring seem to last 1 to 2 bins,
There seem to be more than 10 bins per 5 minutes, probably 15.
So probably 20 seconds samples.   

With data rates up around 12 MBytes/second,
the glitches are at 5 minute intervals.

Read rates were about  6 MB/s through 12:20,
            and about 12 MB/s after   14:40

Today, write rates are 14 to 17 MB/sec,
glitches at intervals of roughly 4 minutes.

##########
# CONDOR #
##########

    Waiting for minos07 graceful off,

MINOS25 > condor_q -r | grep minos07
66127.0   scavan          4/13 06:18   0+07:55:58 minos07.fnal.gov

Date: Mon, 14 Apr 2008 15:37:17 -0500 (CDT)
Subject: HelpDesk ticket 114190
___________________________________________
Short Description: Initial Condor 7.0.1 upgrade for Minos

Problem Description: I have shut down the condor master on node mins07,
for our first test of the upgrade to condor 7.0.1.

I have already drained the virtual machine, and stopped condor.

Please, at your next convenience, as root

    cd /opt
    ln -sf  condor-7.0.1 condor

and inform minos-admin.

I will then try to restart condor on this single node.

    Thanks !
___________________________________________
Date: Mon, 14 Apr 2008 15:49:10 -0500 (CDT)
This ticket has been reassigned to JONES, TERRY of the CD-SF/FEF Group.
___________________________________________
Date: Mon, 14 Apr 2008 16:01:29 -0500


MINOS07 > sudo /etc/init.d/condor start
Starting up Condor
ERROR "The following configuration macros appear to contain default values that must be changed before Condor will run.  These macros are:
   hostallow_write (found on line 215 of /etc/condor/condor_config)
" at line 242 in file condor_config.C


##########
# CONDOR #
##########

Disabled factproxy in crontab.minos26, obsolete.

##########
# CONDOR #
##########

Try getting a proxy before admin command :

cd /local/scratch25/kreymer/.grid/
scp minos26:/local/scratch26/kreymer/grid/kreymerdoe.pem    .
scp minos26:/local/scratch26/kreymer/grid/kreymerdoekey.pem .
scp minos26:/local/scratch26/kreymer/grid/kreymerdoe.inf    .

.  /grid/app/minos/VDT/setup.sh
. /minos/scratch/kreymer/VDT/setup.sh

echo kreymerdoe.inf | voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -cert kreymerdoe.pem   \
    -key kreymerdoekey.pem \
    -vomslife 1:0 \
    -pwstdin

Igor suggests setting
   X509_USER_PROXY

Apparently not needed with the defaults as done above.


    Trying a kx509 proxy, see if I'm authorized
    
kx509
kxlist -p
voms-proxy-init \
    -noregen    \
    -voms fermilab:/fermilab/minos/Role=pilot \
    -vomslife 1:0 \
    -valid    1:0

Nope, not authorized to write to DCache.

    Repeated with the DOE proxy,
    this seems to be a harmless test of good authorization.

MINOS25 > condor_off  -peaceful  minos07  -subsystem startd
Can't find address for startd minos07.fnal.gov
Perhaps you need to query another pool.
Sent "Kill-Daemon-Peacefully" command to master minos07.fnal.gov

   Now shot down condor on minos07

MINOS07 > ps -flu condor
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
5 S condor    3087     1  0  76   0 -  2017 -       2007 ?        00:39:03 /opt/condor/sbin/condor_master


##########
# CONDOR #
##########
 
    Sent email to sfiligoi and minos_admin, regarding the following.


    The minos07 vm did not shut down on request.

    Tried again,

MINOS07 > condor_off  -peaceful  minos07  -subsystem startd
Sent "Set-Peaceful-Shutdown" command to startd minos07.fnal.gov
Sent "Kill-Daemon-Peacefully" command to master minos07.fnal.gov

MINOS07 > date
Mon Apr 14 09:09:52 CDT 2008

    Looking in /local/stage1/condor/log

MasterLog

4/14 09:09:41 DaemonCore: PERMISSION DENIED to kreymer@fnal.gov from host <131.225.193.7:64699> for command 483 (DAEMON_OFF_PEACEFUL)


StartLog

4/14 09:09:41 DC_AUTHENTICATE: received DC_AUTHENTICATE from <131.225.193.7:65345>
4/14 09:09:41 DC_AUTHENTICATE: received following ClassAd:
MyType = "(unknown type)"
TargetType = "(unknown type)"
AuthMethods = "FS,GSI"
CryptoMethods = "3DES,BLOWFISH"
OutgoingNegotiation = "PREFERRED"
Authentication = "OPTIONAL"
Encryption = "OPTIONAL"
Integrity = "OPTIONAL"
Enact = "NO"
Subsystem = "TOOL"
ServerPid = 29099
SessionDuration = "3600"
NewSession = "YES"
RemoteVersion = "$CondorVersion: 6.8.6 Sep 13 2007 $"
Command = 60016
4/14 09:09:41 DC_AUTHENTICATE: our_policy:
MyType = ""
TargetType = ""
AuthMethods = "FS,GSI"
CryptoMethods = "3DES,BLOWFISH"
OutgoingNegotiation = "REQUIRED"
Authentication = "REQUIRED"
Encryption = "OPTIONAL"
Integrity = "REQUIRED"
Enact = "NO"
Subsystem = "STARTD"
ParentUniqueID = "minos07:3087:1195500376"
ServerPid = 3088
SessionDuration = "3600"
4/14 09:09:41 DC_AUTHENTICATE: the_policy:
MyType = ""
TargetType = ""
Authentication = "YES"
Encryption = "NO"
Integrity = "YES"
AuthMethodsList = "FS,GSI"
AuthMethods = "FS"
CryptoMethods = "3DES,BLOWFISH"
SessionDuration = "3600"
Enact = "YES"
4/14 09:09:41 DC_AUTHENTICATE: generating 3DES key for session minos07:3088:1208182181:9838...
4/14 09:09:41 SECMAN: Sending following response ClassAd:
MyType = ""
TargetType = ""
Authentication = "YES"
Encryption = "NO"
Integrity = "YES"
AuthMethodsList = "FS,GSI"
AuthMethods = "FS"
CryptoMethods = "3DES,BLOWFISH"
SessionDuration = "3600"
Enact = "YES"
4/14 09:09:41 DC_AUTHENTICATE: generating 3DES key for session minos07:3088:1208182181:9838...
4/14 09:09:41 SECMAN: Sending following response ClassAd:
MyType = ""
TargetType = ""
Authentication = "YES"
Encryption = "NO"
Integrity = "YES"
AuthMethodsList = "FS,GSI"
AuthMethods = "FS"
CryptoMethods = "3DES,BLOWFISH"
SessionDuration = "3600"
Enact = "YES"
RemoteVersion = "$CondorVersion: 6.8.6 Sep 13 2007 $"
4/14 09:09:41 SECMAN: new session, doing initial authentication.
4/14 09:09:41 DC_AUTHENTICATE: authenticating RIGHT NOW.
4/14 09:09:41 AUTHENTICATE: in authenticate( addr == NULL, methods == 'FS,GSI')
4/14 09:09:41 AUTHENTICATE: can still try these methods: FS,GSI
4/14 09:09:41 HANDSHAKE: in handshake(my_methods = 'FS,GSI')
4/14 09:09:41 HANDSHAKE: handshake() - i am the server
4/14 09:09:41 HANDSHAKE: client sent (methods == 36)
4/14 09:09:41 HANDSHAKE: i picked (method == 4)
4/14 09:09:41 HANDSHAKE: client received (method == 4)
4/14 09:09:41 AUTHENTICATE: will try to use 4 (FS)
4/14 09:09:41 FS: client template is /tmp/FS_XXXXXXXXX
4/14 09:09:41 FS: client filename is /tmp/FS_XXXWlt97z
4/14 09:09:41 AUTHENTICATE_FS: used dir /tmp/FS_XXXWlt97z, status: 1
4/14 09:09:41 AUTHENTICATE: auth_status == 4 (FS)
4/14 09:09:41 Authentication was a Success.
4/14 09:09:41 DC_AUTHENTICATE: mutual authentication to 131.225.193.7 complete.
4/14 09:09:41 DC_AUTHENTICATE: message authenticator enabled with key id minos07:3088:1208182181:9838.
4/14 09:09:41 DC_AUTHENTICATE: sending session ad:
MyType = ""
TargetType = ""
User = "kreymer@fnal.gov"
Sid = "minos07:3088:1208182181:9838"
ValidCommands = "5,60007,60011,448,452,457,470,60004,1200,1000,60005,60006,60012,60013,60015,60016"
4/14 09:09:41 DC_AUTHENTICATE: sent session minos07:3088:1208182181:9838 info!
4/14 09:09:41 DC_AUTHENTICATE: added incoming session id minos07:3088:1208182181:9838 to cache for 3600 seconds (return address is unknown).
MyType = ""
TargetType = ""
Authentication = "YES"
Encryption = "NO"
Integrity = "YES"
AuthMethodsList = "FS,GSI"
CryptoMethods = "3DES,BLOWFISH"
SessionDuration = "3600"
Enact = "YES"
AuthMethods = "FS"
Subsystem = "TOOL"
ServerPid = 29099
RemoteVersion = "$CondorVersion: 6.8.6 Sep 13 2007 $"
User = "kreymer@fnal.gov"
Sid = "minos07:3088:1208182181:9838"
ValidCommands = "5,60007,60011,448,452,457,470,60004,1200,1000,60005,60006,60012,60013,60015,60016"
4/14 09:09:41 DC_AUTHENTICATE: setting sock->decode()
4/14 09:09:41 DC_AUTHENTICATE: allowing an empty message for sock.
4/14 09:09:41 DC_AUTHENTICATE: Success.
4/14 09:09:41 IPVERIFY: hoststring: minos07.fnal.gov
4/14 09:09:41 DaemonCore: PERMISSION DENIED to kreymer@fnal.gov from host <131.225.193.7:65345> for command 60016 (DC_SET_PEACEFUL_SHUTDOWN)

StarterLog


    And for the record, on minos25 :

MINOS25 > condor_off  -peaceful  minos07  -subsystem startd
ERROR
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5004:Failed to get authorization from server.  Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile)
AUTHENTICATE:1004:Failed to authenticate using FS
Can't send Set-Peaceful-Shutdown command to startd minos07.fnal.gov
ERROR
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5004:Failed to get authorization from server.  Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile)
AUTHENTICATE:1004:Failed to authenticate using FS
Sent "Kill-Daemon-Peacefully" command to master minos07.fnal.gov


    After doing this again with a proxy,

MasterLog

4/14 10:00:19 DaemonCore: Command received via TCP from kreymer@fnal.gov from host <131.225.193.25:62454>
4/14 10:00:19 DaemonCore: received command 483 (DAEMON_OFF_PEACEFUL), calling handler (admin_command_handler)
4/14 10:00:19 Handling daemon-specific command for "STARTD"
4/14 10:00:19 Sent SIGTERM to STARTD (pid 3088)

StartLog

4/14 10:00:19 DaemonCore: Command received via TCP from kreymer@fnal.gov from host <131.225.193.25:65144>
4/14 10:00:19 DaemonCore: received command 60016 (DC_SET_PEACEFUL_SHUTDOWN), calling handler (handle_set_peaceful_shutdown())


############
# MCIMPORT #
############

du -sm /pnfs/minos/stage/daikon_04
9518682 /pnfs/minos/stage/daikon_04

Consistent with VOLUME_QUOTAS summary under enstore, 9316 GB

 5.1 TB 31 March
 7.6 TB 07 April
 9.5 TB 14 April
10.9 TB 22 April , 15 tapes

du -sm /minos/data/mcimport/STAGE/daikon_04
4390697 /minos/data/mcimport/STAGE/daikon_04

   Using 13 tapes so far.

So need  13 * ( 14 / 9.5 ) = 19.1 ( 20 ) tapes total.
Have 15 allocated.

    Requesting 6 more, as we are still producing data.
    There seem to be 12 available 

81  CD-LTO4G1     none: emergency       N/A         N/A    N/A         12    12/12 


Date: Mon, 14 Apr 2008 08:49:01 -0500 (CDT)
Ticket #: 114105
___________________________________________
Short Description: Request 6 more LTO-4 tapes for Minos archives

Problem Description: We have written Minos archival data to 13 LTO-4 tapes so far,
and are continuing to archive data.

We expect to need about 21 tapes for the present data set,
but have only 15 allocated.

Please make an additional 6 tapes available
at your next convenience.
___________________________________________
This ticket is assigned to HARRISON, MICHAEL of the CD-SF/DMS/DSC/SSA.
___________________________________________
Date: Tue, 15 Apr 2008 09:55:41 -0500 (CDT)
Solution: mircea@fnal.gov sent this solution: 

The minos quota has been increased to 21 volumes.

-Mike
___________________________________________


#########
# ADMIN #
#########

    Tried to update system status for minos09,

2008-04-03	06:24		minos09	No Estimate	MINOS09 went down around 06:24. Reported to run2-sys.
The resolution message is missing.
    
This status can not be set for more than 3 days.
Please go back and correct the date and time.


=============================================================================

2008 04 12   Saturday

########
# DATA #
########

RESUMED ALL CRONTABS AND TASKS by 15:12 CDT / 20:12 UTC
       
    kreymer@minos01
crontab crontab.minos01


    kreymer@minos26
crontab crontab.dat

    mindata@minos26
crontab crontab.dat

    minfarm@fnpcsrv1
mv NOCAT.ok NOCAT

    mindata@minos-sam03
restarted, see below

########
# DATA #
########

    Scanning D0 LTO4 tapes for errors, on d0mino01

NN=-1
while [ ${NN} -lt 500 ] ; do
    usleep 200000
    (( NN++ ))
    VOL=`printf "PSA%3.3d\n"  ${NN}`
    printf "${VOL}\n"
    enstore info --vol=${VOL} | grep wr_err | grep -v ': 0,'
done

PSA090 'sum_wr_err': 1,
PSA251 'sum_wr_err': 1,
PSA252 'sum_wr_err': 1,
PSA253 'sum_wr_err': 1,

###########
# ROUNDUP #
###########

    Fixes to roundup

LISTS=/minos/data/minfarm/lists

global substitution
   /home/minfarm/lists-> ${LISTS}

Corrected logic handling type 1 and 3 errors ( ignore them )

cp -a  AFSS/roundup.20080412  .
ln -sf      roundup.20080412  roundup # was roundup.20080409


############
# MCIMPORT #
############

Tried to restart, it started up writing to 735, but files belonged in 736.

Command line: encp --delayed-dismount 5 --verbose 4 /home/mindata/TAPE/n11037360_0002_L010185N_D04.tar.gz /pnfs/minos/stage/daikon_04/L010185N/near/735/n11037360_0002_L010185N_D04.tar.gz

    Interrupted.

    Removed the misplaced PAPER'd file from PNFS
$ rm /pnfs/minos/stage/daikon_04/L010185N/near/735/n11037360_0003_L010185N_D04.tar.gz


    Checking ecrc files

$ ls TAPE | wc -l
263

$ ls /minos/data/mcimport/TAR/daikon_04/L010185N/near/736 | wc -l
261


$ rm /pnfs/minos/stage/daikon_04/L010185N/near/735/n11037360_0003_L010185N_D04.tar.gz


   Start with the correct directory, wherever interrupted

FDIRS='736 737 738 739 740 741 742 743 744 745 746 765 766 767 768 769 770 771 772'

for DIR in ${FDIRS}; do
./mcimport.20080326  -T -s n1103 daikon_04/L010185N/near/${DIR}
done


########
# DATA #
########

Apparently we've used another 200 GB /minos/data
while the network was glitching.

http://www-numi.fnal.gov/computing/dh/mdfree/2008/04/11.txt

552168 Fri Apr 11 17:56:05 CDT 2008
529399 Fri Apr 11 18:56:08 CDT 2008
483226 Fri Apr 11 19:56:12 CDT 2008
395550 Fri Apr 11 20:56:16 CDT 2008
346141 Fri Apr 11 21:56:19 CDT 2008
341292 Fri Apr 11 22:56:21 CDT 2008
332561 Fri Apr 11 23:56:25 CDT 2008

###########
# NETWORK #
###########

   Network status page indicates :
http://computing.fnal.gov/cdsystemstatus/system/NETWORKS.html

2008-04-11	19:00		Hub Router	 	
Work on hub router complete.

2008-04-11	17:15		Hub Router	2 hours	
Hub Router ACL configuration problem, service should be normal but work continues to complete configuration. VB

    The AFS monitoring indicates global timeouts 

        16:38 through 16:57
        17:19 through 17:22

=============================================================================

2008 04 11

###########
# NETWORK #
###########

    Posted note to 
http://computing.fnal.gov/cdsystemstatus/system/MINOS.html

2008-04-11	17:00		network	No Estimate	
Major network disruptions at Fermilab. Intermittent connections. Helpdesk interface is down.


The network flaked out
for a few minutes, 17:05 to 17:08 CDT.
 
I also see data dropouts from ganglia 16:36 through 16:56.


ECRC /home/mindata/TAPE/n11037368_0007_L010185N_D04.tar.gz 
./mcimport.20080326: line 308: ecrc: command not found
COPY n11037368_0008_L010185N_D04.tar.gz
ECRC /home/mindata/TAPE/n11037368_0008_L010185N_D04.tar.gz 
./mcimport.20080326: line 308: ecrc: command not found
COPY n11037368_0009_L010185N_D04.tar.gz
ECRC /home/mindata/TAPE/n11037368_0009_L010185N_D04.tar.gz 
./mcimport.20080326: line 308: ecrc: command not found

I'm not sure how much else is dead.

Saved this file as /home/kreymer/minosLOG20080411 on desktop

Shutting down all that I can.

    kreymer@minos01
crontab -r

    kreymer@minos26
crontab -r

    mindata@minos26
crontab -r

    minfarm@fnpcsrv1
mv NOCAT.ok NOCAT

    mindata@minos-sam03
Interrupted at COPY n11037369_0001_L010185N_D04.tar.gz

$ find /minos/data/mcimport/TAR/daikon_04/L010185N/near/736/ -size 0
/minos/data/mcimport/TAR/daikon_04/L010185N/near/736/n11037368_0007_L010185N_D04.ecrc
/minos/data/mcimport/TAR/daikon_04/L010185N/near/736/n11037368_0008_L010185N_D04.ecrc
/minos/data/mcimport/TAR/daikon_04/L010185N/near/736/n11037368_0009_L010185N_D04.ecrc

$ find /minos/data/mcimport/TAR/daikon_04/L010185N/near/736/ -size 0 -exec rm {} \; -print


########
# FARM #
########

Reviewing handling of type 1 and 3 errors in the bad list.
These should both be ignored.
Instead, they seem to have been selected for ZAP runs.
Need to recall distinction between ZAP and other runs.

The logic seems to be reversed,

Also rubin states we should be using bad_runs files under

    /minos/data/minfarm/lists

DUH.

When did this change ?

      The old files still sit in  
/home/minfarm/lists

SRV1> ls -l /home/minfarm/lists/bad* -tr
...
-rw-rw-r--  1 rubin    numi   5240 Feb 18 17:19 /home/minfarm/lists/bad_runs.cedar
-rw-rw-r--  1 rubin    numi   5537 Feb 18 17:19 /home/minfarm/lists/bad_runs.cedar_phy_bhcurv
-rw-rw-r--  1 rubin    numi   6012 Feb 23 15:26 /home/minfarm/lists/bad_runs_mc.cedar_phy
-rw-rw-r--  1 rubin    numi   8570 Feb 24 14:26 /home/minfarm/lists/bad_runs_mc.cedar_phy_bhcurv
-rw-rw-r--  1 rubin    numi 254357 Feb 25 05:16 /home/minfarm/lists/bad_runs_mrcc_mc.cedar_phy

SRV1> ls -l /minos/data/minfarm/lists/bad* -tr
-rw-rw-r--  1 rubin    numi 3401 Mar  5 02:53 /minos/data/minfarm/lists/bad_runs_mc.cedar
-rw-rw-r--  1 rubin    numi 5797 Mar 25 16:31 /minos/data/minfarm/lists/bad_runs.cedar_phy_bhcurv
-rw-rw-r--  1 minospro numi    0 Mar 27 00:20 /minos/data/minfarm/lists/bad_runs.cedar_phy_mboone
-rw-rw-r--  1 rubin    numi 5794 Apr  8 23:50 /minos/data/minfarm/lists/bad_runs.cedar
-rw-rw-r--  1 rubin    numi 9301 Apr 11 14:20 /minos/data/minfarm/lists/bad_runs_mc.cedar_phy_bhcurv


    Reviewing /home/minfarm usage in roundup 

# SUPDIR - contains *.sup suppressed subrun lists
SUPDIR=/home/minfarm/lists/daq_lists/sup

    /home/minfarm/lists/daq_lists -> /minos/data/minfarm/lists/daq_lists/

. /home/minfarm/scripts/setup_minossoft_R1_18_4.sh R1.18.4

    That's OK

cat /home/minfarm/lists/daq_lists/sup/*.sup  > ${ROUNTMP}/SUPPRESSED

    That should use SUPDIR

NOSPILL=/home/minfarm/lists/no_spill.${REL}

    These exist in /minos/data/minfarm/lists, but only C, CPB, CP_mboone

if [ "${MCDET}" ] ; then
    BADRUNS=/home/minfarm/lists/bad_runs_mc.${REL}
    ZAPRUNS=/home/minfarm/lists/zap_runs_mc.${REL}
else
    BADRUNS=/home/minfarm/lists/bad_runs.${REL}
    ZAPRUNS=/home/minfarm/lists/zap_runs.${REL}
fi

                               BADRUNS=/home/minfarm/lists/bad_runs.${REL}
    [ "${STRP}" == "mrnt" ] && BADRUNS=/home/minfarm/lists/bad_runs_mrcc.${REL}
    [ "${MCDET}" ]          && BADRUNS=/home/minfarm/lists/bad_runs_mc.${REL}
    [ ! -r "${BADRUNS}" ]   && BADRUNS=/dev/null

--------------------------

    Fixes to roundup

LISTS=/minos/data/minfarm/lists

global substitution
   /home/minfarm/lists-> ${LISTS}


cp -a  roundup.20080410  .
ln -sf roundup.20080410  roundup # was 

       
############
# MCIMPORT #
############

    Continue with forward, per rhatcher advice,

FDIRS='                    735 736 737 738 739 
       740 741 742 743 744 745 746
                           765 766 767 768 769
       770 771 772
       '
for DIR in ${FDIRS}; do
./mcimport.20080326  -T -s n1103 daikon_04/L010185N/near/${DIR}
done

    Then can continue with reverse,
 
RDIRS='712 713 714 715 716 717 718'
       '
for DIR in ${RDIRS}; do
./mcimport.20080326  -T -s n1104 daikon_04/L010185N/near/${DIR}
done

Started this, saw message, and interrupted.

WILL ENCP 1 files
HAVE n11037259_0017_L010185N_D04.tar.gz
 OOPS   n11037259_0017_L010185N_D04.tar.gz not in PNFS

This is the same stray file spotted before going to 765

$ less /minos/data/mcimport/TAR/daikon_04/L010185N/near/725/mcimport.log

Yes, processing was interrupted and restarted Thu Apr  3 07:29:32 CDT 2008
when handing this file,
and started up with an immediate ECRC.
Re-interrupted and started with COPY/ECRC,
   oops, the ECRC did not happen.
   So we have a bad CRC for this.

This has been getting copied again and again.
   
XSETS=`grep n11037259_0017_L010185N_D04 \
/minos/data/mcimport/TAR/daikon_04/L010185N/near/*/mcimport.log \
| cut -f 9 -d  / | uniq`

FILE=n11037259_0017_L010185N_D04.tar.gz

for SET in ${XSETS} ; do 
ls -l /pnfs/minos/stage/daikon_04//L010185N/near/${SET}/${FILE}
done

/pnfs/minos/stage/daikon_04//L010185N/near/751/n11037259_0017_L010185N_D04.tar.gz: No such file or directory
    The rest  have dates 3 through 10 April.

   1) remove the bad files from /pnfs/minos/stage
   
for SET in ${XSETS} ; do 
    rm /pnfs/minos/stage/daikon_04//L010185N/near/${SET}/${FILE}
done
   
   2) remove the bad ECRC

$ cat /minos/data/mcimport/TAR/daikon_04/L010185N/near/725/${FILE/.tar.gz}.ecrc
3051355852
ecrc TAPE/${FILE} | cut -f 2 -d ' ' > \
/minos/data/mcimport/TAR/daikon_04/L010185N/near/725/${FILE/.tar.gz}.ecrc

   3) rewrite to pnfs

./mcimport.20080326  -T -s n1103 daikon_04/L010185N/near/725


    NOW RESUME ARCHIVE

for DIR in ${FDIRS}; do
./mcimport.20080326  -T -s n1103 daikon_04/L010185N/near/${DIR}
done

   
=============================================================================

2008 04 10

###########
# KREYMER #
###########

Due to a family emergency, I'll very likely be out of town today,
Thursday 10 April 2008.

I can be reached at cell 630 697 0469,   
and will try to check in via the network.

I will try to be back by Friday morning.    

###########
# ROUNDUP #
###########

   new version which ignores errors Type 1 ( per rubin )
   These are input I/O errors, which cannot produce output,
   but which may be hanging around from previous attempts.
   

SRV1> cp -a AFSS/roundup.20080410 .
SRV1> ln -sf roundup.20080410 roundup # was roundup.20080409
SRV1> date
Wed Apr  9 17:40:45 CDT 2008

   Never used this, moved on to 20080412

############
# MCIMPORT #
############

  Why is n11037259_0017_L010185N_D04.tar.gz being copied to 765 ?

Command line: encp --delayed-dismount 5 --verbose 4 /home/mindata/TAPE/n11037259_0017_L010185N_D04.tar.gz /pnfs/minos/stage/daikon_04/L010185N/near/765/n11037259_0017_L010185N_D04.tar.gz

##########
# CONDOR #
##########

MINOS07 > condor_off  -peaceful  minos07  -subsystem startd
Sent "Set-Peaceful-Shutdown" command to startd minos07.fnal.gov
Sent "Kill-Daemon-Peacefully" command to master minos07.fnal.gov

############
# MCIMPORT #
############

MINOS26 > du -sm /pnfs/minos/stage/daikon_04
8742149 /pnfs/minos/stage/daikon_04

$ du -sm /minos/data/mcimport/STAGE/daikon_04
dds TAPE
5111041 /minos/data/mcimport/STAGE/daikon_04

=============================================================================

2008 04 09

##########
# CONDOR #
##########

MINOS25 > condor_off  -peaceful  minos07  -subsystem startd
ERROR
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5004:Failed to get authorization from server.  Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile)
AUTHENTICATE:1004:Failed to authenticate using FS
Can't send Set-Peaceful-Shutdown command to startd minos07.fnal.gov
ERROR
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5004:Failed to get authorization from server.  Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile)
AUTHENTICATE:1004:Failed to authenticate using FS
Sent "Kill-Daemon-Peacefully" command to master minos07.fnal.gov


########
# FARM #
########

Sent note to rubin,minos-data re CPB mcnear ZAP and stale PEND files.


###########
# ROUNDUP #
###########

    Comparison of roundup.new to roundup...

    First cedar near

./roundup        -n -r cedar near 2>&1 | tee /tmp/cnold.log
AFSS/roundup.new -n -r cedar near 2>&1 | tee /tmp/cnnew.log

diff /tmp/cnold.log /tmp/cnnew.log


   Then the biggie, CPB mcnear ( filter out the new ECRC messages from purge )

./roundup        -n -r cedar_phy_bhcurv mcnear 2>&1 | tee /tmp/cmold.log
AFSS/roundup.new -n -r cedar_phy_bhcurv mcnear 2>&1 | tee /tmp/cmnew.log

diff /tmp/cmold.log /tmp/cmnew.log | grep -v ECRC

    There are many more HAVE messages in the cmold.log.
    Understandable, we generate one per run, versus one per concatenated file.
    For cand files, that's a big but moot difference.


    This is ready for production use.

$ mv roundup.new roundup.20080409

SRV1> cp -a AFSS/roundup.20080409 .
SRV1> ln -sf roundup.20080409 roundup # was roundup.20080225
SRV1> date
Wed Apr  9 17:40:45 CDT 2008


#######
# CVS #
#######

    per hartnell request, added to NtupleUtils and NuMubar :

dja25    David Auty
djauty                      *
mtavera  Marta Tavera       *
nickd    Nicholas Devenish  *
rbpatter Ryan B. Patterson

    * and did ./adduser <user>
    
#########
# ADMIN #
#########

Checking existing Minos nodes for 64bit capacity

Per
    http://www.cyberciti.biz/faq/linux-how-to-find-if-processor-is-64-bit-or-not/

cat /proc/cpuinfo | grep flags | grep ' lm '

for NODE in ${NODES} ; do printf "${NODE} " ; ssh -ax ${NODE} "cat /proc/cpuinfo | grep flags | uniq | tr -s ' ' \\\n | grep lm" ; done 

Have lm for all Cluster and Servers, and flxb flxb31 and above

=============================================================================

2008 04 08

###########
# ROUNDUP #
###########

Continuing to adjust roundup.new to use samsub


##########
# ORACLE #
##########

Date: Tue, 08 Apr 2008 16:34:49 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minosdb-support@fnal.gov
Subject: Minos Oracle server purchases for FY 2008

We need to review the status of minosora1/3 and their disks,
and make a plan for the purchase of either replacements,
or extended service plans, as appropriate.


I am not aware of any performance issues requiring upgrades to these systems.

The long term Ganglia monitoring shows an average of 1/4 process,
with a CPU load of around one percent.

    http://rexganglia2.fnal.gov/minos/?r=year&c=MINOS+DB&h=minosora1.fnal.gov

The disk space presently configured 
is about 550 GBytes,
of which 110 GBytes is used.

   
http://cdcvs.fnal.gov/cgi-bin/fnal-only/cvsweb.cgi/syscollect/minos/minosora1-config.html?rev=1.1.1.1030&con
tent-type=text/x-cvsweb-markup&cvsroot=syscollect
   

Issues to be dealt with in the plan :

  0) What is the end of warranty coverage for the hosts and disks ?
  1) What is the end of service life for these unique Sun/AMD systems ?
  2) What is the end of service life for the disks ?
  3) What are the costs of replacement versus continued maintenance ?
  4) Are we satisfied with the level of service being actually provided,
     given our experience with last year's 6 month minosora3 repair ?
  5) If replacement is an option, what system and/or disks are preferred ?
  6) The plan should provide a policy good for the next 3 years,
     which should cover the end of Minos data taking.

Please coordinate this with Robert Hatcher, 
who is taking on Liz's role as Minos liaison to the Computing Division.
Robert is on the minosdb-support mailing list.


########
# FARM #
########

    The beam dbu information seems to have returned.

SRV1> /grid/app/minos/scripts/beam_mon fnpcsrv1
Inquiring of fnpcsrv1 on port 3307 as reader_old:minos_db
B080408_000001.mbeam.root from 2008-04-08 00:00:04 to 2008-04-08 07:59:57 6739 spills 107920358 bytes, found: 28793, missed: 0 seconds
B080407_160001.mbeam.root from 2008-04-07 16:00:01 to 2008-04-07 23:59:49 10983 spills 183291638 bytes, found: 57581, missed: 15 seconds
B080407_080001.mbeam.root from 2008-04-07 08:00:01 to 2008-04-07 15:59:58 12905 spills 220909976 bytes, found: 86378, missed: 18 seconds
B080407_000001.mbeam.root from 2008-04-07 00:00:00 to 2008-04-07 07:59:59 12827 spills 208561586 bytes, found: 115177, missed: 20 seconds


=============================================================================

2008 04 07

########
# FARM #
########

SRV1> /grid/app/minos/scripts/beam_mon minos-db1
Inquiring of minos-db1 on port 3306 as reader_old:minos_db
B080408_000001.mbeam.root from 2008-04-08 00:00:04 to 2008-04-08 07:59:57 6739 spills 107920358 bytes, found: 28793, missed: 0 seconds
B080407_160001.mbeam.root from 2008-04-07 16:00:01 to 2008-04-07 23:59:49 10983 spills 183291638 bytes, found: 57581, missed: 15 seconds
B080407_080001.mbeam.root from 2008-04-07 08:00:01 to 2008-04-07 15:59:58 12905 spills 220909976 bytes, found: 86378, missed: 18 seconds
B080407_000001.mbeam.root from 2008-04-07 00:00:00 to 2008-04-07 07:59:59 12827 spills 208561586 bytes, found: 115177, missed: 20 seconds

SRV1> /grid/app/minos/scripts/beam_mon fnpcsrv1
Inquiring of fnpcsrv1 on port 3307 as reader_old:minos_db
beam_mon returns null -- no updates recently

Mon Apr  7 17:19:45 CDT 2008


###########
# ROUNDUP #
###########

    roundup.new - using samdup, samsub

SRV1> ./roundup -n -r cedar near
  ->  /minos/data/minfarm/maint/cnold.log

    Testing with a partially purges run

AFSS/roundup.new -n -s N00013775 -r cedar  near

AFSS/roundup.new -n -W -S -v -s N00013775 -r cedar  nea

############
# MCIMPORT #
############

    mindata@minos-sam03

Had to restart the copy to tape, due to my desktop crashing.

tail -1 /minos/data/mcimport/TAR/daikon_04/L010185N/near/755/mcimport.log
COPY n11037551_0015_L010185N_D04.tar.gz

$ dds TAPE | tail
-rw-r--r--   1 mindata e875 340717229 Apr  7 14:17 n11037551_0014_L010185N_D04.tar.gz
-rw-r--r--   1 mindata e875 232898560 Apr  7 14:18 n11037551_0015_L010185N_D04.tar.gz

$ rm TAPE/n11037551_0015_L010185N_D04.tar.gz

FDIRS='755 756 757 758 759
       760 761 762 763 764 765
       '
for DIR in ${FDIRS}; do
./mcimport.20080326  -T -s n1103 daikon_04/L010185N/near/${DIR}
done


###########
# DESKTOP #
###########

Locked up displaying a small PDF file with xpdf.
Displays fine in acroread

No access via ssh from the net, unable to switch local console.

The attachment was PJO-APS-Survey-4-7-08.PDF
   Displays fine in acroread


########
# FARM #
########
 
   Latest farm output concatenated :

MINOS26 > dds /pnfs/minos/fardet_data/2008-04/F00040732_0000.mdaq.root
-rw-r--r--  1 buckley e875 18646825 Apr  3 11:43 /pnfs/minos/fardet_data/2008-04/F00040732_0000.mdaq.root
MINOS26 > dds /pnfs/minos/neardet_data/2008-04/N00013887_0002.mdaq.root
-rw-r--r--  1 buckley e875 77409802 Apr  3 17:33 /pnfs/minos/neardet_data/2008-04/N00013887_0002.mdaq.root


############
# MCIMPORT #
############

Another encp 1.5 hour delay Sunday 17:00 ish

less /minos/data/mcimport/TAR/daikon_04/L010185N/near/751/mcimport.log

Volume VOJ554 is marked NOACCESS.

Error after transferring 0 bytes in 1 files in 5246.348979 sec.
        Overall rate = 0 MB/sec.  Transfer rate = 0 MB/sec.
        Network rate = 0 MB/sec.  Drive rate = 0 MB/sec.
        Disk rate = 0 MB/sec.  Exit status = 1.
Start time: Sun Apr  6 18:19:59 2008


   In summary, for 11 LTO-4 volumes written, 8 have write errors,
   for a total of 14 write errors.

MINOS26 > ./volumes vols
MINOS26 > VOLS4=`./volumes stage | grep VOJ`
MINOS26 > printf "${VOLS4}\n"
VOJ545
VOJ546
VOJ547
VOJ548
VOJ549
VOJ550
VOJ551
VOJ552
VOJ553
VOJ554
VOJ555

MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ;  enstore info --vol=${VOL} | grep wr_err ; done
VOJ545  'sum_wr_err': 2,
VOJ546  'sum_wr_err': 1,
VOJ547  'sum_wr_err': 2,
VOJ548  'sum_wr_err': 1,
VOJ549  'sum_wr_err': 0,
VOJ550  'sum_wr_err': 2,
VOJ551  'sum_wr_err': 2,
VOJ552  'sum_wr_err': 0,
VOJ553  'sum_wr_err': 0,
VOJ554  'sum_wr_err': 3,
VOJ555  'sum_wr_err': 1,

MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ;  enstore info --vol=${VOL} | grep remaining ; done
VOJ545  'remaining_bytes': 792764928L,
VOJ546  'remaining_bytes': 0L,
VOJ547  'remaining_bytes': 0L,
VOJ548  'remaining_bytes': 244734464L,
VOJ549  'remaining_bytes': 765575680L,
VOJ550  'remaining_bytes': 247753728L,
VOJ551  'remaining_bytes': 0L,
VOJ552  'remaining_bytes': 175473664L,
VOJ553  'remaining_bytes': 34666496L,
VOJ554  'remaining_bytes': 192283136L,
VOJ555  'remaining_bytes': 630473728000L,

MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ;  enstore info --vol=${VOL} | grep eod_cookie ; done
VOJ545  'eod_cookie': '0000_000000000_0001049',
VOJ546  'eod_cookie': '0000_000000000_0001051',
VOJ547  'eod_cookie': '0000_000000000_0001683',
VOJ548  'eod_cookie': '0000_000000000_0002324',
VOJ549  'eod_cookie': '0000_000000000_0000924',
VOJ550  'eod_cookie': '0000_000000000_0001079',
VOJ551  'eod_cookie': '0000_000000000_0002281',
VOJ552  'eod_cookie': '0000_000000000_0002358',
VOJ553  'eod_cookie': '0000_000000000_0002246',
VOJ554  'eod_cookie': '0000_000000000_0002318',
VOJ555  'eod_cookie': '0000_000000000_0000445',

MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ;  enstore info --vol=${VOL} | grep sum_mounts ; done
VOJ545  'sum_mounts': 9,
VOJ546  'sum_mounts': 9,
VOJ547  'sum_mounts': 10,
VOJ548  'sum_mounts': 9,
VOJ549  'sum_mounts': 7,
VOJ550  'sum_mounts': 58,
VOJ551  'sum_mounts': 10,
VOJ552  'sum_mounts': 8,
VOJ553  'sum_mounts': 8,
VOJ554  'sum_mounts': 12,
VOJ555  'sum_mounts': 4,

########
# FARM #
########

   Report from Rubin, who cannot attend today's Grid Users' meeting:
   
About the only thing to report is that the reconfiguration of fermigrid1 
seems to have almost eliminated the hold problem.  There has only been 
one held run since the evening of March 30, and that run 'auto-released' 
with no problem.  (Auto-release means was released by the cron job.)

Right now (Sunday at noon) the db updater on fnpcsrv1 hasn't run for a 
couple of days.  Steve has checked that this is *not* a system problem, 
and I've turned it over to Alex and Nick.  I've tried running the update 
procedure manually, but it just terminates almost immediately with no 
error (or any) messages.  And my check indicates that there have been no 
updates done.

One can check with

/grid/app/minos/scripts beam-mon

which will look at fnpcsrv1 and at minos-db1 with the argument 'minos-db1'.


#######
# AFS #
#######

  Ticket 107032

###########
# MONTHLY #
###########


DATASETS 4/7
PREDATOR 4/7
VAULT    4/8   via cron
MYSQL    4/9   started Wed Apr  9 09:45:44 CDT 2008
               after posting notice to CRL, and telling shifter
               Unlocked 10:18.
               
    I failed to purge older BINLOG's last month, 
    I see lots of 1 GB logs through 3 Feb.
    Did an initial supplemental purge, before the archive of BINLOG
mysql -u root offline
PURGE MASTER LOGS BEFORE DATE_SUB( NOW( ), INTERVAL 40 DAY);
EXIT;

    Final cp back to COPY took 53 minutes ( 18 GB ) at 5 MB/sec.
    Writes to /M/D were at over 20 MB/sec, per ganglia

#################
# VAULT_MONTHLY #
#################

   mv vault.20060807 vault_monthly

simplified MONTH calculation from

    DAY=`date +%d`
    let " DOFF = ( DAY + 15) "
    MONTH=`date +%Y-%m -d "${DOFF} days ago"`

to
    (( DOFF = `date +%d` + 1 ))
    MONTH=`date +%Y-%m -d "${DOFF} days ago"`

    Scheduled this for tonight, by activating in crontab.dat,
    based on 2008-02 times
    
     Far - 2 hours
    Near - 8 hours
    
11 20     07 * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/vault_monthly


=============================================================================

2008 04 04

########
# DATA #
########

Checking again, around 17:44, from mindata@minos26

$ ps xf | cut -f 2 -d : | cut -c 4- | grep '^scp\|.tar.gz$'
scp -t STAGE/mualem

$  time ls -alF /minos/data/mysql/archive/20080303/offline
real    2m4.571s

real    0m10.478s

minos-sam03.fnal.gov
real    0m7.622s

$ time ls -alF /minos/data/mysql/archive/20080204/offline
real    0m33.646s
real    0m5.015s
real    0m4.920s

   
##########
# SAMSUB #
##########

Check this agains the current logs,

SRV1> grep PEND cedarnear.log  | grep -v ' 0 '
...
 PEND - have 1/7 subruns for N00013775_*.spill.sntp.cedar.0.root 22 03/09 23:41 4 5

SRV1> AFSS/samsub /minos/data/minfarm/nearcat | grep -v '0$'
N00013775_.spill.sntp.cedar.0.root      4


SRV1> grep PEND cedar_phy_bhcurvmcnear.log  | grep -v ' 0 '
 PEND - have 17/30 subruns for n13037094_*_L250200N_D04.mrnt.cedar_phy_bhcurv.root 73 01/18 00:40 8 25
 PEND - have 3/29 subruns for n13037095_*_L250200N_D04.mrnt.cedar_phy_bhcurv.root 73 01/18 00:21 22 25
 PEND - have 1/30 subruns for n13037097_*_L250200N_D04.mrnt.cedar_phy_bhcurv.root 73 01/18 00:38 4 5
 PEND - have 17/30 subruns for n13037094_*_L250200N_D04.sntp.cedar_phy_bhcurv.root 73 01/18 00:40 8 25
 PEND - have 3/29 subruns for n13037095_*_L250200N_D04.sntp.cedar_phy_bhcurv.root 73 01/18 00:21 22 25
 PEND - have 1/30 subruns for n13037097_*_L250200N_D04.sntp.cedar_phy_bhcurv.root 73 01/18 00:38 4 5

SRV1> AFSS/samsub /minos/data/minfarm/mcnearcat | grep -v '0$'
n13037094__L250200N_D04.mrnt.cedar_phy_bhcurv.root      8
n13037095__L250200N_D04.mrnt.cedar_phy_bhcurv.root      22
n13037097__L250200N_D04.mrnt.cedar_phy_bhcurv.root      4

n13037094__L250200N_D04.sntp.cedar_phy_bhcurv.root      8
n13037095__L250200N_D04.sntp.cedar_phy_bhcurv.root      22
n13037097__L250200N_D04.sntp.cedar_phy_bhcurv.root      4

  ( reorderered for clarity , those listed below are ZAP files )

n13037260__L010185N_D04.mrnt.cedar_phy_bhcurv.0.root    26
n13037270__L010185N_D04.mrnt.cedar_phy_bhcurv.0.root    27

n13037260__L010185N_D04.sntp.cedar_phy_bhcurv.0.root    26
n13037270__L010185N_D04.sntp.cedar_phy_bhcurv.0.root    27

n13037260__L010185N_D04.cand.cedar_phy_bhcurv.0.root    14
n13037270__L010185N_D04.cand.cedar_phy_bhcurv.0.root    27


   Bottom line, this is looking pretty good,

   Next step is to use this in roundup.new,
   and compare details of some dry runs.
   Then put it in production.
   
##########
# SAMSUB #
##########

   This has been built for use with ROUNDUP, see entry under 2008 04 01

   
SRV1> ls /minos/data/minfarm/mcnearcat  | wc -l
2741

SRV1> time AFSS/samsub /minos/data/minfarm/mcnearcat  
...
real    0m5.978s
user    0m2.062s
sys     0m0.162s

SRV1> AFSS/samsub /minos/data/minfarm/mcnearcat  | wc -l
114


##############
# AFSERRSCAN #
##############

   Made the month automaticaly be `date +%b`,
   can override like
       ./afserrscan '' Jan

##########
# SOUDAN #
##########

   Checking /var/log/messages on minos-db behind minos-gateway.minos-soudan.org,
   
   Many messages like
   
Apr  3 23:31:10 minos-db kernel: afs: Lost contact with file server 131.225.68.65 in cell fnal.gov (all multi-homed ip addresses down for the server)
Apr  3 23:31:11 minos-db kernel: afs: failed to store file (110)
Apr  3 23:33:53 minos-db kernel: afs: file server 131.225.68.65 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

[root@minos-db ~]# host 131.225.68.65
65.68.225.131.in-addr.arpa domain name pointer fsus-minos01.fnal.gov.

   Always the same server.
   Usually at *:31:10
   Sometimes  *:32:53
   
   Almost every hour

   Test with

grep afs: /var/log/messages | grep -v Tokens | uniq

chmod 555 messages*

Testing as kreymer,

ln -s /afs/fnal.gov/files/home/room1/kreymer AFSK
cd AFSK

while true ; do date ; wc -l foo ; sleep 20 ; done
   No interruption around 13:31, but this is the wrong server.
   

ln -s /afs/fnal.gov/files/data/minos MD

MDLDF=/afs/fnal.gov/files/data/minos/log_data/R1.16.0.log_data.tar

wc -c ${MDLDF}
30720 /afs/fnal.gov/files/data/minos/log_data/R1.16.0.log_data.tar

while true ; do date ; wc -c ${MDLDF} ; sleep 20 ; done
Fri Apr  4 13:48:53 CDT 2008
...

=============================================================================

2008 04 03

#########
# DOCDB #
#########

   Tested administrative ( minos-adm ) access to Minos DocDB
   by kreymer and rhatcher.

   This lets us approve new users, etc.

########
# DATA #
########

Observing very different file access times to /minos/data

Mysql> ls -alF /minos/data/mysql/archive/20080303/offline

minos-sam03 $ time ls -alF /minos/data/mysql/archive/20080303/offline
real    0m0.710s
user    0m0.017s
sys     0m0.015s


MINOS26 > time dds /minos/data/mysql/archive/20080303/offline
real    1m39.509s
user    0m0.022s
sys     0m0.081s

MINOS26 > time ls -alF /minos/data/mysql/archive/20080303/offline
real    1m5.076s
user    0m0.019s
sys     0m0.077s


30389 ?        D      0:00  \_ md5sum n11037465_0029_L010185N_D04.tar.gz
30292 ?        Ss     0:00 scp -t STAGE/mtavera
30289 ?        Ss     0:00 scp -t STAGE/mtavera
30247 ?        Ss     0:00 scp -t STAGE/mtavera
30030 ?        Ss     0:01 scp -t STAGE/mualem
29816 ?        Ss     0:00 scp -t STAGE/mtavera
29556 ?        Ss     0:00 scp -t STAGE/mtavera
29497 ?        Ss     0:00 scp -t STAGE/mtavera
29208 ?        Ss     0:01 scp -t STAGE/mtavera


30389 ?        D      0:00  \_ md5sum n11037465_0029_L010185N_D04.tar.gz
30292 ?        Ss     0:00 scp -t STAGE/mtavera
30289 ?        Ss     0:00 scp -t STAGE/mtavera
30247 ?        Ss     0:00 scp -t STAGE/mtavera
30030 ?        Ss     0:01 scp -t STAGE/mualem
29816 ?        Ss     0:00 scp -t STAGE/mtavera
29556 ?        Ss     0:00 scp -t STAGE/mtavera
29497 ?        Ss     0:00 scp -t STAGE/mtavera
29208 ?        Ss     0:01 scp -t STAGE/mtavera


    Performance is back to good again on minos26,

Thu Apr  3 13:53:40 CDT 2008
real    0m2.157s
user    0m0.018s
sys     0m0.050s

31407 ?        Ss     0:00 scp -t STAGE/mtavera
31346 ?        Ss     0:00 scp -t STAGE/mtavera
30931 ?        Ss     0:00 scp -t STAGE/mtavera
30889 ?        Ss     0:01 scp -t STAGE/mtavera
30844 ?        Ss     0:00 scp -t STAGE/mtavera

    And edging down,

Thu Apr  3 14:01:33 CDT 2008

real    0m11.622s

31407 ?        Ss     0:01 scp -t STAGE/mtavera
 1263 ?        D      0:00  \_ md5sum n11037465_0009_L010185N_D04.tar.gz
 1235 ?        Ss     0:00 scp -t STAGE/mualem
 1005 ?        Ss     0:00 scp -t STAGE/mtavera
  955 ?        Ss     0:00 scp -t STAGE/mtavera
  952 ?        Ss     0:00 scp -t STAGE/mtavera
  886 ?        Ss     0:00 scp -t STAGE/mtavera
  746 ?        Ss     0:00 scp -t STAGE/mtavera
  729 ?        Ss     0:00 scp -t STAGE/mtavera

ps xf | cut -f 2 -d : | cut -c 4- | grep '^scp\|.tar.gz$'
real    1m34.086s
 \_ md5sum n11037465_0013_L010185N_D04.tar.gz
 \_ md5sum n11037465_0015_L010185N_D04.tar.gz
 \_ md5sum n11037465_0004_L010185N_D04.tar.gz
scp -t STAGE/mtavera
scp -t STAGE/mtavera

    Did the ls as mindata@minos26,
real    0m0.150s
    Same short time now for kreymer@minos26,
real    0m0.079s

But now no scp's or md5sum's are running !

    still no activity, somewhat slow access
real    0m6.821s
real    0m0.032s


   At an earlier time, during the 1 minute slowdowns,

MINOS26 > lsof -N /minos/data
COMMAND   PID    USER   FD   TYPE DEVICE  SIZE       NODE NAME
ls      29272 kreymer    3r   DIR   0,23 67584 1994944521 /minos/data/mysql/archive/20080303/offline (minos-nas-0.fnal.gov:/minos/data)

mindata@minos26 $ /usr/sbin/lsof -N /minos/data
COMMAND    PID    USER   FD   TYPE DEVICE      SIZE       NODE NAME
mcimport 22085 mindata  cwd    DIR   0,23    116736 3948960453 /minos/data/mcimport/OVERLAY/mcin/dcache (minos-nas-0.fnal.gov:/minos/data)
mcimport 22085 mindata    1w   REG   0,23   4044209 2031215561 /minos/data/mcimport/OVERLAY/log/mcimport.log (minos-nas-0.fnal.gov:/minos/data)
scp      29497 mindata    3w   REG   0,23 295895040  583008038 /minos/data/mcimport/mtavera/n11037465_0022_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data)
scp      29556 mindata    3w   REG   0,23 278495232 4192479982 /minos/data/mcimport/mtavera/n11037465_0011_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data)
scp      29816 mindata    3w   REG   0,23 199196672  152377133 /minos/data/mcimport/mtavera/n11037465_0016_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data)
scp      30030 mindata    3w   REG   0,23 319619072  347627699 /minos/data/mcimport/mualem/n11037744_0003_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data)
scp      30247 mindata    3w   REG   0,23  81985536 2015710680 /minos/data/mcimport/mtavera/n11037465_0008_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data)
scp      30289 mindata    3w   REG   0,23  50003968 3017774333 /minos/data/mcimport/mtavera/n11037465_0026_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data)
scp      30292 mindata    3w   REG   0,23  48332800 2514584717 /minos/data/mcimport/mtavera/n11037465_0014_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data)
bash     30387 mindata  cwd    DIR   0,23    100352  322878581 /minos/data/mcimport/mtavera (minos-nas-0.fnal.gov:/minos/data)
md5sum   30389 mindata  cwd    DIR   0,23    100352  322878581 /minos/data/mcimport/mtavera (minos-nas-0.fnal.gov:/minos/data)
md5sum   30389 mindata    1w   REG   0,23         0  504262275 /minos/data/mcimport/mtavera/n11037465_0029_L010185N_D04.tar.gz.md5 (minos-nas-0.fnal.gov:/minos/data)
md5sum   30389 mindata    3r   REG   0,23 334088614  375232167 /minos/data/mcimport/mtavera/n11037465_0029_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data)
bash     30455 mindata  cwd    DIR   0,23    100352  322878581 /minos/data/mcimport/mtavera (minos-nas-0.fnal.gov:/minos/data)
md5sum   30457 mindata  cwd    DIR   0,23    100352  322878581 /minos/data/mcimport/mtavera (minos-nas-0.fnal.gov:/minos/data)
md5sum   30457 mindata    1w   REG   0,23         0  181349684 /minos/data/mcimport/mtavera/n11037465_0025_L010185N_D04.tar.gz.md5 (minos-nas-0.fnal.gov:/minos/data)
md5sum   30457 mindata    3r   REG   0,23 339105079  401305239 /minos/data/mcimport/mtavera/n11037465_0025_L010185N_D04.tar.gz (minos-nas-0.fnal.gov:/minos/data)
mcimport 30491 mindata  cwd    DIR   0,23    116736 3948960453 /minos/data/mcimport/OVERLAY/mcin/dcache (minos-nas-0.fnal.gov:/minos/data)
ecrc     30492 mindata  cwd    DIR   0,23    116736 3948960453 /minos/data/mcimport/OVERLAY/mcin/dcache (minos-nas-0.fnal.gov:/minos/data)
ecrc     30492 mindata    3r   REG   0,23 338891388 2309851038 /minos/data/mcimport/OVERLAY/mcin/dcache/n13047170_0025_L010185N_D04.reroot.root (minos-nas-0.fnal.gov:/minos/data)
cut      30493 mindata  cwd    DIR   0,23    116736 3948960453 /minos/data/mcimport/OVERLAY/mcin/dcache (minos-nas-0.fnal.gov:/minos/data)


#########
# ADMIN #
#########

Sent email to minos_software_discussion, 
asking whether anyone is using SL3 at Fermilab, or from AFS 
( can we upgrade minos11 ? )

########
# DATA #
########

MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/*
 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N
  55785 /minos/data/mcimport/STAGE/daikon_04/L010170N
6184434 /minos/data/mcimport/STAGE/daikon_04/L010185N
   6622 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm
   8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh
  65355 /minos/data/mcimport/STAGE/daikon_04/L010200N
 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N
 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N
  27834 /minos/data/mcimport/STAGE/daikon_04/L250200N

MINOS26 > du -sm /pnfs/minos/stage/daikon_04/*
2790357 /pnfs/minos/stage/daikon_04/L010185N
3412300 /pnfs/minos/stage/daikon_04/L250200N

###############
# CONDORPROXY #
###############

   Corrected some typo errors, corrected to use /usr/krb5/bin/kinit
   Tested in cron via /tmp/ctd at 07:02, looks good
   As an added challenge, this was during the BlueArc maintenancem,
   which stalled global file operations on the Cluster.
   
########
# DATA #
########

    2008 04 02

    Preparing for the 7 AM 5 minute BlueArc outage

    minfarm@fnpcsrv1
SRV1> pwd
/home/minfarm/ROUNTMP
SRM1> mv NOCAT.ok NOCAT

    mindata@minos26
crontab -r

    mindata@minos-sam03
Manually interrupt cp phase, if this is running at 07:00
and remove the partial file.

COPY n11037259_0017_L010185N_D04.tar.gz
$ dds TAPE/n11037259_001*
-rw-r--r--  1 mindata e875 345557859 Apr  3 06:32 TAPE/n11037259_0010_L010185N_D04.tar.gz
-rw-r--r--  1 mindata e875 340648977 Apr  3 06:32 TAPE/n11037259_0011_L010185N_D04.tar.gz
-rw-r--r--  1 mindata e875 347040982 Apr  3 06:33 TAPE/n11037259_0012_L010185N_D04.tar.gz
-rw-r--r--  1 mindata e875 331563261 Apr  3 06:34 TAPE/n11037259_0013_L010185N_D04.tar.gz
-rw-r--r--  1 mindata e875 338125810 Apr  3 06:34 TAPE/n11037259_0014_L010185N_D04.tar.gz
-rw-r--r--  1 mindata e875 342672643 Apr  3 06:35 TAPE/n11037259_0015_L010185N_D04.tar.gz
-rw-r--r--  1 mindata e875 334430343 Apr  3 06:36 TAPE/n11037259_0016_L010185N_D04.tar.gz
-rw-r--r--  1 mindata e875 146292736 Apr  3 06:36 TAPE/n11037259_0017_L010185N_D04.tar.gz

rm TAPE/n11037259_0017_L010185N_D04.tar.gz

Date: Thu, 03 Apr 2008 07:21:12 -0500
From: Andy Romero <romero@fnal.gov>
To: site-nas-announce@fnal.gov
Subject: BlueArc Maintenance complete

The upgrade to firmware v5.1.1156.13
is complete. You may resume normal operations.


    To reverse this, did

    minfarm@fnpcsrv1
mv NOCAT NOCAT.ok

    mindata@minos26
crontab crontab.dat

    mindata@minos-sam03
FDIRS='725 726 727 728 729
       730 731 732 733 734 735
       750 751 752 753 754 755 756 757 758 759
       760 761 762 763 764 765
       '
for DIR in ${FDIRS}; do
./mcimport.20080326  -T -s n1103 daikon_04/L010185N/near/${DIR}
done

###########
# MINOS09 #
###########

Date: Thu, 03 Apr 2008 07:30:55 -0500 (CDT)
Subject: HelpDesk ticket 113516
___________________________________________
Short Description: minos09 down

Problem Description: run2-sys :

Node minos09 seems to be off the network.
Ganglia monitoring indicates that it may have been down
since about 06:24 :

    Cluster Report for Thu, 3 Apr 2008 07:23:14 -0500

    minos09.fnal.gov
    load_one: down
    Last heartbeat 0 days, 1:09:29 ago
___________________________________________
Date: Thu, 03 Apr 2008 08:40:29 -0500 (CDT)
This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group.
___________________________________________
___________________________________________
___________________________________________

 
=============================================================================

2008 04 02

#########
# PROBE #
#########

Corrected stderr rerouting of  time bash keyword per Google advice at
    http://www.cs.tut.fi/~jarvi/tips/bash.html
{ time ... ; } 2>&1

#########
# VAULT #
#########

mv vault.20060807 vault.monthly # reverse earlier confusion,
    this is a script which vaults the previous month's data.
  
This is still not being done via cron, quite yet.
Maybe next month !

########
# FARM #
########

    Some of our 8 nodes seem to be missing from the farm

fnpc339
fnpc340
fnpc341

    Based on 

SRV1> condor_status | grep fnpc39
SRV1> condor_status | grep fnpc34

Strange, fnpc339 is present now... oops bad grep above, needed fnpc339

Date: Wed, 02 Apr 2008 17:53:35 -0500 (CDT)
Subject: HelpDesk ticket 113509
___________________________________________
Short Description: fnpc340 down, fnpc341 not running jobs - FYI

Problem Description: Two of the eight Minos/AFS GPFARM worker nodes seem to be not running jobs.


    fnpc340 seems to be down, not on the network.
    fncp341 is up, but does not have AFS mounted, and is not running jobs.
    condor_status returns no information for either node.

This is low priority, as the user demand is low at present, 
and we are continuing to expand non-AFS means of running our jobs.
___________________________________________
Note To Requester: timm@fnal.gov sent this Notes To Requester: 
Karen Shepelak reported last week that there is a hardware
problem with fnpc340 and a service call is in.  I  restarted
afs on fnpc341.

Steve
_________________________________________________________________


###############
# CONDORPROXY #
###############

   Added condorproxy to crontab.dat
   Removed gridappsync, now obsolete due to PARROT

    14:35
ln -sf crontab.dat.20080402 crontab.dat # was crontab.dat.20060504
crontab crontab.dat
    
MINOS26 > crontab -l
MAILTO=kreymer@fnal.gov

06 1-23/2  * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/predator

10     04  * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/condorproxy

# 11 01  5 * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/vault.monthly


##########
# CONDOR #
##########

   Testing new condorproxy script, amid the startup of the glideafs10min .

lrwxrwxrwx   1 gfactory gfactory   29 Apr  2 14:20 kreymer-condor.proxy -> kreymer-condor.proxy.20080408

   This seems to have worked smoothly.

Added to this to crontab, see above.

##########
# CONDOR #
##########

   Testing access via glidein to our 8 nodes,

condor_submit glideafs10min.run
   100 sections, cluster 63295


##########
# CONDOR #
##########

    Get a proxy with my new certificate, this time on minos26

.  /grid/app/minos/VDT/setup.sh


openssl pkcs12 -in kreymerdoe.p12 -clcerts -nokeys -out kreymerdoe.pem
Enter Import Password:
MAC verified OK

openssl pkcs12 -in kreymerdoe.p12 -nocerts         -out kreymerdoekey.pem
Enter Import Password:
MAC verified OK
Enter PEM pass phrase:
Verifying - Enter PEM pass phrase:

chmod 600 kreymerdoe*

    vomses is out of date,
    As mindata, in /grid/app/minos/VDT/glite/etc
scp kreymer@fnpcsrv1:/usr/local/vdt-1.8.1/glite/etc/vomses vomses
cp -a vomses vomses.20080107 # based on date of file on fnpcsrv1

    Still no luck , get message from vpi, 
VOMS Server for fermilab not known!

    Switch to another installation of VDT


. /minos/scratch/kreymer/VDT/setup.sh
     
DAYS=20
(( HOURS = DAYS * 24 ))
DAPR=`date -d "today + ${DAYS}days" +%Y%m%d`


voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -vomslife ${HOURS}:0 \
    -cert kreymerdoe.pem   \
    -key kreymerdoekey.pem \
    -out kreymercondor.proxy.${DAPR}  \
    -valid ${HOURS}:0

    This seems to work,
    let's try this with an inline password.

echo ${PPH} | voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -vomslife ${HOURS}:0 \
    -cert kreymerdoe.pem   \
    -key kreymerdoekey.pem \
    -out kreymer-condor.proxy.${DAPR}  \
    -valid ${HOURS}:0 \
    -pwstdin

Creating proxy ......................................................................... Done
Your proxy is valid until Tue Apr 22 10:58:48 2008

scp kreymer-condor.proxy.${DAPR} gfactory@minos25:.grid/kreymer-condor.proxy.${DAPR}
ssh gfactory@minos25 \
    "cd .grid ; ln -sf kreymer-condor.proxy.${DAPR} kreymer-condor.proxy"

   Did this at around 11:02

   The Idle kreymer glideins immediately started running.

=============================================================================

2008 04 01

########
# DATA #
########

At about 11:00, data rates on minos-sam03 reading /minos/data
dropped from 10 MB/sec to under 5.

From 15:00 to 16:00, the rate dropped from 4 to 1/2 MB/sec,
and has remained there through 16:30.

This slowdown appears to be global.

SRV1> du -sk /minos/data/mcimport/STAGE/daikon_04/L010185N/near/720/n12037209_0010_L010185N_D04.tar.gz
10640   /minos/data/mcimport/STAGE/daikon_04/L010185N/near/720/n12037209_0010_L010185N_D04.tar.gz

SRV1> time sum /minos/data/mcimport/STAGE/daikon_04/L010185N/near/720/n12037209_0010_L010185N_D04.tar.gz
01559 10639

real    0m19.112s
user    0m0.066s
sys     0m0.031s

   Blue2 seems to be OK

SRV1> time sum /grid/data/minos/minfarm/OLDBAD/n13014007_0004_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root
20893 23304

real    0m0.623s
user    0m0.162s
sys     0m0.086s


Date: Tue, 01 Apr 2008 17:01:18 -0500 (CDT)
Subject: HelpDesk ticket 113446
___________________________________________
Short Description: /minos/data BlueArc access has slowed down to a crawl today

Problem Description: LSC/CSI :

/minos/data is mounted from
    minos-nas-0.fnal.gov:/minos/data

At about 11:00 today, data rates on minos-sam03 reading /minos/data
dropped from 10 MB/sec to under 5.

From 15:00 to 16:00, the rate dropped from 4 to 1/2 MB/sec,
and has remained there through 16:30.

This slowdown appears to be global.
I see the same terrible data rates now from fnpcsrv1.

I do not offhand see unreasonable user loads coming from Minos.

Are there global BlueArc problems ?
I do not see a slowdown for files served by blue2.

Are there problems with the Minos data array ?
___________________________________________
Date: Wed, 02 Apr 2008 08:42:46 -0500 (CDT)
This ticket has been reassigned to MENGEL, MARC of the CD-LSCS/CSI/CS/EST Group.
____________________________________________
Date: Wed, 02 Apr 2008 11:12:57 -0500 (CDT)
This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/BLU Group.
___________________________________________
12;00  romero adjusted some parameters to allow more requests to the array.
       We need to wait for another minos-sam03 read cycle to see the effect.
       Presently, there are 3 md5sums and 7 scp's running on minos26.
       The 6-8 MB/sec rates seen on sam03 seem kind of normal.
__________________________________________
The next reading pass started on minos-sam03, around 13:00 CDT.
Rates seem to be around 6 to 8 MBytes per second, and fairly stable.

This is probably the expected rate, 
given the 9 active file access processes to /minos/data  on minos26.
__________________________________________

Solution: Performance is back to normal levels
___________________________________________________________________


N.B. after 13:00, the 1 minute data dropouts occur every 11 to 12 minutes,
not the previous 4 minutes.


##########
# CONDOR #
##########

    Glideins have been queued since 03/31 20:10 

    The last gfactory started around 3/31 19:31

    Found log file for one of the glideins, via

condor_q -l  62924.4

/home/gfactory/glideinsubmit/glidein_t11/entry_gpminos/log/condor_activity_20080331_gpminos@t11@minos@my2.log

    Found message there like

000 (62934.008.000) 03/31 20:14:17 Job submitted from host: <131.225.193.25:63984>
...
020 (62934.002.000) 03/31 20:14:20 Detected Down Globus Resource
    RM-Contact: fngp-osg.fnal.gov:2119/jobmanager-condor
...
026 (62934.002.000) 03/31 20:14:20 Detected Down Grid Resource
    GridResource: gt2 fngp-osg.fnal.gov:2119/jobmanager-condor
...


Test overall load with 100 5 minute glideins to AFS nodes,


###########
# ROUNDUP #
###########

    Checking timing of samdup,with repeated runs.
    In case we need to do this for the HAVE scan in roundup.

    First, about 323 files

SRV1> time ./samdup /minos/data/minfarm/nearcat
real    0m7.892s
user    0m1.685s
sys     0m0.231s
 
real    0m6.606s
user    0m1.664s
sys     0m0.153s

    Now something hefty, 1699 files

SRV1> time ./samdup /minos/data/minfarm/farcat
real    0m58.310s
user    0m5.066s
sys     0m0.335s

real    0m30.409s
user    0m4.981s
sys     0m0.348s

real    0m31.177s
user    0m5.062s
sys     0m0.335s

   What we need is a count of existing declared subruns in SAM,
   for each run.
   
   It will be easier to draft a new samsub, derived from samdup,
   counting all subruns declared to SAM for the files in the given directory.

   The existing roundup stores the counts in shell arrays 
       HAVE${FENU}[10#${FRUN:1}]

   ---- 2008 04 03

   FENU is the filename past run/subrun, '.' changed to  '_'
       with the leading delimiter removed
   
   So any convenient filename with subrun removed would do. 
   In the samdup script, we have this as TAI, with the original delimiter
   
   For use in roundup, it is simplest to generate RUN_TAI strings,
   writing these and the subrun count for each unique RUN_TAI

   Create a RUNTAISET of these RUN_END's
   Search each RUNTAISET member for parents,
   
    ----  2008 04 04

    Implemented this, see note under 2008 04 04


    Testing during development with 

AFSS/samsub /minos/scratch/kreymer/nearcat

AFSS/samsub /minos/scratch/kreymer/mcnearcat

=============================================================================

2008 03 31

###########
# ENSTORE #
###########

   Copies stuck again, as noted in alarms

(2008-Mar-31 16:53:03)	stkenmvr140a	5292	root	E (1)	LTO4_40MV
	MOUNTFAILED max_consecutive_failures (3) reached	 

Date: Mon, 31 Mar 2008 17:53:36 -0500 (CDT)
Subject: HelpDesk ticket 113375

___________________________________________
Short Description: Mover LTO4_40 stuck again

Problem Description: An encp copy to LTO-4 tape has been hung up since Mon Mar 31 16:44:53
2008

  Apparently due to another triple failure to mount on mover LTO4_40 .

  Please free this up, and take this mover out of service if appropriate.

Start time: Mon Mar 31 16:44:53 2008
User: mindata(3648)  Group: e875(5111)  Euser: mindata(3648)  Egroup:
e875(5111)
Command line: encp --delayed-dismount 5 --verbose 4
/home/mindata/TAPE/n11037160_0000_L010185N_D04.tar.gz
/pnfs/minos/stage/daikon_04/L010185N/near/716/n11037160_0000_L010185N_D04.t
ar.gz
Version: v3_7  CVS $Revision: 1.866 $ <frozen>
OS: Linux 2.6.9-55.0.2.ELsmp i686 Release:  Scientific Linux Fermi LTS
release 4.4 (Wilson)
Library: CD-LTO4G1  Storage Group: minos  File Family: stage  FF Wrapper:
cpio_odc  FF Width: 1
Current working directory:
minos-sam03.fnal.gov:/minos/data/mcimport/STAGE/daikon_04/L010185N/near/716

Submitting
/pnfs/minos/stage/daikon_04/L010185N/near/716/n11037160_0000_L010185N_D04.t
ar.gz write request to LM.  elapsed=1.010sec
File queued: /home/mindata/TAPE/n11037160_0000_L010185N_D04.tar.gz library:
CD-LTO4G1 family: stage bytes: 343266785 elapsed=1.10642313957
Mover called back.  elapsed=2.73378705978
Input file /home/mindata/TAPE/n11037160_0000_L010185N_D04.tar.gz opened.  
elapsed=2.75560522079
Submitting
/pnfs/minos/stage/daikon_04/L010185N/near/716/n11037160_0000_L010185N_D04.t
ar.gz write request to LM.  elapsed=490.330sec
Submitting
/pnfs/minos/stage/daikon_04/L010185N/near/716/n11037160_0000_L010185N_D04.t
ar.gz write request to LM.  elapsed=1390.460sec
___________________________________________
Date: Mon, 31 Mar 2008 18:18:30 -0500 (CDT)

Solution: berg@fnal.gov sent this solution: 

Art,

The mover is offline, the tape it was writing is full and
available. I'll look at the mover in more detail tomorrow.

 - David
__________________________________________________________________


18:01 - mounting and writing


#######
# NAS #
#######

Date: Mon, 31 Mar 2008 16:57:11 -0500
From: Andy Romero <romero@fnal.gov>
To: site-nas-announce@fnal.gov
Subject: BlueArc Maintenance This Thursday (4/3/2008) 7:00am

To address the stability problems we have experienced
over the past few weeks, we will be upgrading the
firmware on BlueArc cluster node RHEA-1 to version 5.1.1156.13.
(node RHEA-2 is already at v5.1.1156.13)

Users of the following enterprise virtual servers (EVSs) will experience
approximately 5min of downtime as we shift/re-balance the workload
between the two cluster nodes

blue1
blue2
dirserver1
minos-nas-0
ppdserver

Users of the following EVSs will not be effected

blue3
bluetest
fermi-nas-1
mb-nas-0


###########
# ROUNDUP #
###########

Working on roundup.new, with            
     o ECRC file removal after purge    
     samdup for duplicates
     samdup for HAVES            
     purge of READ and SAM/READ files   
     specific MC subdirs in saddreco    

Review old DUPLICATES code,
    for each FILE, finds corresponding files under READ from any subrun
        with special case, exact match for mock data. 
    This produces the HAVE messages for each run already partially catted.
    Also produces the HAVES count used later, for automatic flushing


################
# SAM_PRODUCTS #
################

    Per query from CDF, noted minos usage of sam_products :

MINOS26 > upd modproduct -g minos sam_products v4_31 -f NULL
notice: Adding flags -O "public"
upd modproduct succeeded.


#########
# ADMIN #
#########

   Received email from jklemenc listing web servers not recertified,
   and to be removed.  ( -> minosadmin )
   
   Only one seems to be related to Minos

To be removed:
+-----------------+----------------------------+-------------------+----------+---------------------+
| IP Address      | Hostname                   | MAC Address       |Updated By| Time updated        |
+-----------------+----------------------------+-------------------+----------+---------------------+
| 198.124.213.7   | nemean.minos-soudan.org    | 00:02:B3:98:5E:01 | saranen  | 2008-03-31 09:40:58 |

    Cannot connect to this node.

    Email to saranen

##########
# SAMDUP #
##########

   Updated to get SAMQ from sam.pingDbServer,
   so that this can run within roundup on fnpcsrv1

cp -a AFSS/samdup samdup

./samdup /minos/data/minfarm/mcfarcat

time ./samdup /minos/data/minfarm/mcnearcat
real    1m0.347s
user    0m3.040s
sys     0m0.221s

real    0m16.768s
user    0m2.965s
sys     0m0.222s

   Test setting a variable to the list

SAMDUPS=`./samdup /minos/scratch/kreymer/mcnearcat`

[ -n "${SAMDUPS}" ] && echo HAVE DUPS && printf "${SAMDUPS}\n"
HAVE DUPS
n13047100_0000_L010185N_D04.sntp.cedar_phy_bhcurv.0.root
n13047100_0003_L010185N_D04.sntp.cedar_phy_bhcurv.0.root


SAMDUPS=`./samdup /minos/scratch/kreymer/farcat`

[ -n "${SAMDUPS}" ] && echo HAVE DUPS && printf "${SAMDUPS}\n"


##########
# DCACHE #
##########

ticket 113172 - corrupt file in write queue
   was removed friday, written OK by the standard scripts.

########
# FARM #
########

./roundup -r cedar  mcfar
Mon Mar 31 10:10:18 CDT 2008

   1038   34331 nearcat
   1699   14289 farcat
    923   46746 mcnearcat
      0       1 mcfarcat
      0       1 mcfmockcat
     39       2 minfarm/WRITE
   3699   95370 TOTAL files, GBytes

nearcat
    110    3141 cosmic.sntp.cedar.0.root
    134    3881 cosmic.sntp.cedar_phy.0.root
     84    2312 cosmic.sntp.cedar_phy.1.root
    372    6007 spill.mrnt.cedar_phy.0.root
     62     789 spill.mrnt.cedar_phy.1.root
    213   15735 spill.sntp.cedar.0.root
     63    4125 spill.sntp.cedar_phy.1.root

farcat
    264    7995 all.sntp.cedar.0.root
     87    2174 all.sntp.cedar_phy_bhcurv.0.root
    546     905 spill.bmnt.cedar_phy_bhcurv.0.root
    264    1903 spill.bntp.cedar.0.root
     23     102 spill.bntp.cedar_phy_bhcurv.0.root
    208     509 spill.mrnt.cedar_phy_bhcurv.0.root
    264    1304 spill.sntp.cedar.0.root
     43      82 spill.sntp.cedar_phy_bhcurv.0.root

mcnearcat
     11    7502 cand.cedar_phy_bhcurv.0.root
    419    8033 mrnt.cedar_phy_bhcurv.0.root
     37    1480 mrnt.cedar_phy_bhcurv.root
    419   27699 sntp.cedar_phy_bhcurv.0.root
     37    4296 sntp.cedar_phy_bhcurv.root


    NEARCAT

    Let's do nearcat first, to defer the bmnt cleanup.

We previously did CPB, now let's do CP.

./samdup /minos/data/minfarm/nearcat
  < clean >

And as before, check for 0/1 duplicates in nearcat

FONES=`cd /minos/data/minfarm/nearcat ; ls *.1.root`
printf "${FONES}\n" | wc -w
209

for FILE in ${FONES} ; do
(   cd /minos/data/minfarm/nearcat
    FZER=`echo ${FILE} | sed 's/\.1\./\.0\./g'`
    [ -r "${FZER}" ] && \
    ls -l ${FILE} &&   
    ls -l ${FZER} && printf "\n"
)
done

   So it appears we have no internal duplicates

   Let's see what's what with CP

./roundup -n -r cedar_phy near

    I see nothing strange in the messages, beyond lots of pending,

./roundup -n -r cedar_phy near | grep PEND
 PEND - have 18/19 subruns for N00007148_*.cosmic.sntp.cedar_phy.1.root 318 05/17 15:37 0 18
 PEND - have 3/13 subruns for N00008357_*.cosmic.sntp.cedar_phy.1.root 318 05/18 11:13 7 10
 PEND - have 3/4 subruns for N00008366_*.cosmic.sntp.cedar_phy.0.root 303 06/02 05:10 0 3
 PEND - have 2/18 subruns for N00008564_*.cosmic.sntp.cedar_phy.0.root 300 06/04 19:55 0 2
 PEND - have 1/24 subruns for N00010195_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:38 0 1
 PEND - have 1/18 subruns for N00010236_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:40 0 1
 PEND - have 1/24 subruns for N00010265_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:41 0 1
 PEND - have 1/24 subruns for N00010268_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:41 0 1
 PEND - have 1/22 subruns for N00010271_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:38 0 1
 PEND - have 2/24 subruns for N00010277_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:41 0 2
 PEND - have 4/24 subruns for N00010283_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:42 0 4
 PEND - have 1/24 subruns for N00010286_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:50 0 1
 PEND - have 1/17 subruns for N00010329_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:49 0 1
 PEND - have 2/24 subruns for N00010338_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:49 0 2
 PEND - have 1/24 subruns for N00010341_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:50 0 1
 PEND - have 5/24 subruns for N00010347_*.cosmic.sntp.cedar_phy.0.root 170 10/12 13:49 0 5
 PEND - have 12/13 subruns for N00010371_*.cosmic.sntp.cedar_phy.0.root 300 06/05 08:51 0 12
 PEND - have 21/23 subruns for N00010494_*.cosmic.sntp.cedar_phy.0.root 300 06/04 16:49 0 21
 PEND - have 3/24 subruns for N00010660_*.cosmic.sntp.cedar_phy.1.root 318 05/18 09:32 0 3
 PEND - have 16/24 subruns for N00010724_*.cosmic.sntp.cedar_phy.1.root 317 05/18 13:37 0 16
 PEND - have 44/45 subruns for N00011134_*.cosmic.sntp.cedar_phy.1.root 317 05/18 15:39 0 44
 PEND - have 22/24 subruns for N00011437_*.cosmic.sntp.cedar_phy.0.root 170 10/12 17:17 0 22
 PEND - have 30/31 subruns for N00011651_*.cosmic.sntp.cedar_phy.0.root 294 06/11 03:18 0 30
 PEND - have 23/24 subruns for N00011710_*.cosmic.sntp.cedar_phy.0.root 294 06/11 05:02 0 23
 PEND - have 11/12 subruns for N00007127_*.spill.mrnt.cedar_phy.0.root 305 05/30 17:58 0 11
 PEND - have 9/10 subruns for N00007148_*.spill.mrnt.cedar_phy.0.root 327 05/08 16:04 0 9
 PEND - have 1/3 subruns for N00007176_*.spill.mrnt.cedar_phy.0.root 327 05/08 16:19 0 1
 PEND - have 1/2 subruns for N00007188_*.spill.mrnt.cedar_phy.0.root 305 05/30 18:40 0 1
 PEND - have 4/6 subruns for N00007194_*.spill.mrnt.cedar_phy.0.root 305 05/30 18:41 0 4
 PEND - have 19/30 subruns for N00007197_*.spill.mrnt.cedar_phy.0.root 305 05/30 20:43 0 19
 PEND - have 28/30 subruns for N00007929_*.spill.mrnt.cedar_phy.0.root 327 05/09 09:20 0 28
 PEND - have 21/22 subruns for N00008029_*.spill.mrnt.cedar_phy.0.root 303 06/01 13:55 0 21
 PEND - have 17/18 subruns for N00008305_*.spill.mrnt.cedar_phy.0.root 324 05/12 08:35 0 17
 PEND - have 23/24 subruns for N00008345_*.spill.mrnt.cedar_phy.0.root 303 06/02 04:54 0 23
 PEND - have 1/2 subruns for N00008366_*.spill.mrnt.cedar_phy.0.root 303 06/02 05:10 0 1
 PEND - have 23/24 subruns for N00008492_*.spill.mrnt.cedar_phy.1.root 302 06/02 12:49 0 23
 PEND - have 14/15 subruns for N00008510_*.spill.mrnt.cedar_phy.1.root 302 06/02 12:33 0 14
 PEND - have 2/11 subruns for N00009635_*.spill.mrnt.cedar_phy.1.root 300 06/04 18:05 0 2
 PEND - have 9/10 subruns for N00009770_*.spill.mrnt.cedar_phy.0.root 300 06/04 13:32 0 9
 PEND - have 12/14 subruns for N00010371_*.spill.mrnt.cedar_phy.0.root 300 06/05 08:52 0 12
 PEND - have 18/19 subruns for N00010577_*.spill.mrnt.cedar_phy.0.root 300 06/04 23:24 0 18
 PEND - have 22/23 subruns for N00010583_*.spill.mrnt.cedar_phy.0.root 299 06/05 20:53 0 22
 PEND - have 23/24 subruns for N00010586_*.spill.mrnt.cedar_phy.0.root 300 06/05 09:16 0 23
 PEND - have 15/19 subruns for N00010589_*.spill.mrnt.cedar_phy.0.root 299 06/05 22:41 0 15
 PEND - have 22/24 subruns for N00010631_*.spill.mrnt.cedar_phy.0.root 298 06/06 11:46 0 22
 PEND - have 6/7 subruns for N00011047_*.spill.mrnt.cedar_phy.0.root 298 06/07 10:53 0 6
 PEND - have 26/27 subruns for N00011113_*.spill.mrnt.cedar_phy.0.root 321 05/15 08:48 0 26
 PEND - have 39/40 subruns for N00011134_*.spill.mrnt.cedar_phy.0.root 321 05/15 09:21 0 39
 PEND - have 22/24 subruns for N00011437_*.spill.mrnt.cedar_phy.0.root 170 10/12 17:18 0 22
 PEND - have 23/24 subruns for N00011710_*.spill.mrnt.cedar_phy.0.root 294 06/11 05:02 0 23
 PEND - have 11/23 subruns for N00011728_*.spill.mrnt.cedar_phy.1.root 317 05/19 02:28 0 11
 PEND - have 4/8 subruns for N00011750_*.spill.mrnt.cedar_phy.1.root 317 05/19 02:47 0 4
 PEND - have 8/24 subruns for N00011772_*.spill.mrnt.cedar_phy.1.root 317 05/19 02:34 0 8
 PEND - have 1/12 subruns for N00007127_*.spill.sntp.cedar_phy.1.root 159 10/23 12:27 0 1
 PEND - have 23/24 subruns for N00008492_*.spill.sntp.cedar_phy.1.root 302 06/02 12:49 0 23
 PEND - have 14/15 subruns for N00008510_*.spill.sntp.cedar_phy.1.root 302 06/02 12:33 0 14
 PEND - have 2/11 subruns for N00009635_*.spill.sntp.cedar_phy.1.root 300 06/04 18:05 0 2
 PEND - have 11/23 subruns for N00011728_*.spill.sntp.cedar_phy.1.root 317 05/19 02:28 0 11
 PEND - have 4/8 subruns for N00011750_*.spill.sntp.cedar_phy.1.root 317 05/19 02:47 0 4
 PEND - have 8/24 subruns for N00011772_*.spill.sntp.cedar_phy.1.root 317 05/19 02:34 0 8

./roundup -f 10 -r cedar_phy near
Mon Mar 31 11:33:08 CDT 2008

   These seem to be on tape,
   
./roundup  -r cedar_phy near

   This cleared out WRITE, looks OK

##########
# CFLSUM #
##########

   cflsum - updated to include BAD and stage in miscellaneous summary

###########
# ENSTORE #
###########

   Checking status of LTO-4 tapes, and archives to stage

   Strangely large number of write errors,
   but the tapes seem to be properly filled.

MINOS26 > du -sm /pnfs/minos/stage/daikon_04
5127706 /pnfs/minos/stage/daikon_04

MINOS26 > du -sm /pnfs/minos/stage/daikon_00
12006   /pnfs/minos/stage/daikon_00


MINOS26 > ./volumes vols
 OK , refreshing volume listing in /tmp/vols 
-rw-r--r--  1 kreymer g020 215774 Mar 31 09:14 /tmp/vols
MINOS26 > ./volumes stage
VO9430
VOC445
VOC483
VOC493
VOC630
VOJ545
VOJ546
VOJ547
VOJ548
VOJ549
VOJ550
VOJ551

VOLS4='
VOJ545
VOJ546
VOJ547
VOJ548
VOJ549
VOJ550
VOJ551
'
MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ;  enstore info --vol=${VOL} | grep wr_err ; done
VOJ545  'sum_wr_err': 2,
VOJ546  'sum_wr_err': 1,
VOJ547  'sum_wr_err': 2,
VOJ548  'sum_wr_err': 1,
VOJ549  'sum_wr_err': 0,
VOJ550  'sum_wr_err': 2,
VOJ551  'sum_wr_err': 1,

MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ;  enstore info --vol=${VOL} | grep remaining ; done
VOJ545  'remaining_bytes': 792764928L,
VOJ546  'remaining_bytes': 0L,
VOJ547  'remaining_bytes': 0L,
VOJ548  'remaining_bytes': 244734464L,
VOJ549  'remaining_bytes': 765575680L,
VOJ550  'remaining_bytes': 247753728L,
VOJ551  'remaining_bytes': 132352000000L,

MINOS26 > for VOL in $VOLS4 ; do printf "$VOL " ;  enstore info --vol=${VOL} | grep eod_cookie ; done
VOJ545  'eod_cookie': '0000_000000000_0001049',
VOJ546  'eod_cookie': '0000_000000000_0001051',
VOJ547  'eod_cookie': '0000_000000000_0001683',
VOJ548  'eod_cookie': '0000_000000000_0002324',
VOJ549  'eod_cookie': '0000_000000000_0000924',
VOJ550  'eod_cookie': '0000_000000000_0001079',
VOJ551  'eod_cookie': '0000_000000000_0001946',


=============================================================================

2008 03 30    Sunday

############
# MCIMPORT #
############

    Started Forward archive, MDFREE was

728418 Sun Mar 30 08:36:22 CDT 2008

    See log details under 2008 03 25

for DIR in ${FDIRS}; do
./mcimport.20080326  -T -s n1103 daikon_04/L010185N/near/${DIR}
done

Sun Mar 30 09:02:07 CDT 2008


=============================================================================

2008 03 28

########
# FARM #
########

   ROUNTMP/ECRC

    Need to clean this up
    these files are short, 1 line, contain just the ecrc int
    We recently are producing more of these, due to cand handling.

    roundup should be updated to remove them immediately after using them.
 
SRV1> ls ../ROUNTMP/ECRC | wc -l
69965

SRV1> find . -atime +10 | wc -l
63036
SRV1> find . -atime -10 | wc -l
6243


SRV1> du -sm ../ROUNTMP/ECRC
278     ../ROUNTMP/ECRC

cd /export/stage/minfarm/ROUNDUP/ECRC

SRV1> tar cf ../ECRC.20080328.tar .

SRV1> du -sm ../ECRC.20080328.tar 
69      ../ECRC.20080328.tar

SRV1> tar tf ../ECRC.20080328.tar  | wc -l
69989

    Check for any particluarly old WRITE files which might need ECRC's

ls -lutL /minos/data/minfarm/WRITE
...
rubin     61821998 Dec 27 13:29 f20011128_0009_CosmicMu_D02.sntp.cedar.root
minfarm  496353221 Dec 27 13:28 f20011128_0000_CosmicMu_D02.sntp.cedar.root
minfarm         63 Dec 11 18:27 Merged.1751.root
minfarm   20552457 Dec 11 10:57 Merged.21430.root

SRV1> rm /minos/data/minfarm/WRITE/Merged.1751.root
SRV1> rm /minos/data/minfarm/WRITE/Merged.21430.root

    The two cedar mcfar files got caught in a DCache write backlog at
Thu Dec 27 13:29:47 CST 2007

    This was never picked up after the holidays.

./roundup  -r cedar mcfar


    Oops, in tarring these up have modified access times,
    so cannot do
find . -atime +10 -exec echo rm {} \;


    Check the file count, based on modification time,

find . -mtime +10 | wc -l
63730

    This seems right.
    Many were accessed after the cand files were on tape.

    So update the two oldies we still need,
    and purge based on mtime.

touch f20011128_0009_CosmicMu_D02.sntp.cedar.root
touch f20011128_0000_CosmicMu_D02.sntp.cedar.root

find . -mtime +10 | wc -l
63728

    OK, let's purge them, at about 18:34

find . -mtime +10 -exec rm {} \;

     done at around 18:39

SRV1> ls /export/stage/minfarm/ROUNDUP/ECRC | wc -l
6274


############
# MCIMPORT #
############

./pnfsdirs  far cedar_phy_bhcurv daikon_04 AtmosNu

./pnfsdirs  far cedar_phy_bhcurv daikon_04 AtmosNu write

Should be ready to import arms files now, from /minos/data/mcimport/arms/mcin
    
$ mv STAGE/arms/NOIMPORT STAGE/arms/IMPORT
$ ./mcimport -n -b 3 arms
14:32

   GRRRRRRRRRRR - stuck doing 
32156 pts/2    Ss     0:01 -bash
24132 pts/2    S+     0:00  \_ /bin/sh ./mcimport -n -b 3 arms
26030 pts/2    S+     0:01      \_ du -sm /home/mindata/STAGE/arms/

    This is also stuck locally.
    And also stuck doing direct
du -sm /minos/data/mcimport/arms/mcin

    It was just going slowly, eventually finished at 14:47

$ ./mcimport -b 1 arms
 OK, logging activity to /home/mindata/STAGE/arms/log/mcimport.log 

Files seem to be declared to SAM

Will let the scheduled import pick up the rest of these files.


##########
# ISAJET #
##########

   Using isajet-users as a testbed for mailing list changes.
   The only present subscriber is syoon@fnal.gov, Phil Yoon, Subscribed on 14 Aug 2001

   Removed syoon, added kreymer

#########
# ADMIN #
#########

    MINOS-USERS


Spam came in from
Received: from msgmmp-4.gci.net (msgmmp-4.gci.net [209.165.130.14])

To avoid more spam from offsite, changed SEND to Private via the wizard.

##########
# CONDOR #
##########

    Brian's jobs have finished at high priority.
    Reset him to normal :

condor_userprio -setfactor brebel@fnal.gov 100.
condor_userprio -all

##########
# CONDOR #
##########

HOWTO.condor - changed draft grid environment documents, from

    $APP    - NFS shared, backed except on USCMS T1
    $DATA   - NFS shared, not backed up
    $WN_TMP - 50 GB on worker
to
    ${OSG_GRID}   - NFS shared, grid support software
    ${OSG_DATA}   - NFS shared, not backed up
    ${OSG_APP}    - NFS shared, backed except on USCMS T1
    ${OSG_WN_TMP} - 50 GB on worker
    ${OSG_SQUID_LOCATION} - which squid to use, if needed

https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/StorageParameterOsgWnTmp


##########
# CONDOR #
##########

    regularized file names, 
    like  probe.$(Cluster).$(Process).out

    added 
JOBLEASEDURATION        = 1000000

    to
glideafs.run
   glide.run
   probe.run


wms*.run - removed, these are now handled by glide.run

MINOS25 > dds *.run
-rw-r--r--  1 kreymer g020 726 Jan 11 15:10 glide150.run
-rw-r--r--  1 kreymer g020 825 Mar 26 12:18 glideafs4hr.run
-rw-r--r--  1 kreymer g020 849 Mar 27 15:31 glideafs70min.run
-rw-r--r--  1 kreymer g020 855 Mar 28 10:37 glideafs.run
-rw-r--r--  1 kreymer g020 731 Mar 27 14:02 glide.run
-rw-r--r--  1 kreymer g020 609 Oct 26 17:23 probe10.run
-rw-r--r--  1 kreymer g020 588 Mar 28 10:38 probe.run
-rw-r--r--  1 kreymer g020 721 Dec 14 14:50 wms1.run
-rw-r--r--  1 kreymer g020 561 Nov 27 20:24 wms2.run
-rw-r--r--  1 kreymer g020 822 Feb  8 15:46 wmsafs.run
-rw-r--r--  1 kreymer g020 719 Dec 12 23:42 wms.run


########
# FARM #
########

    Proceed to force out farcat, as there are no DUP's there.
    But will not force out CPB quite yet, due to bmnt issue.

farcat
    236    7321 all.sntp.cedar.0.root
      6     144 all.sntp.cedar_phy.0.root
      1      23 all.sntp.cedar_phy.1.root
     87    2174 all.sntp.cedar_phy_bhcurv.0.root
    546     905 spill.bmnt.cedar_phy_bhcurv.0.root
    236    1697 spill.bntp.cedar.0.root
      2       2 spill.bntp.cedar_phy.0.root
     23     102 spill.bntp.cedar_phy_bhcurv.0.root
    208     509 spill.mrnt.cedar_phy_bhcurv.0.root
    236    1165 spill.sntp.cedar.0.root
      2       1 spill.sntp.cedar_phy.0.root
     43      82 spill.sntp.cedar_phy_bhcurv.0.root


MINOS-SAM02 > ./samdup /minos/data/minfarm/farcat
<nothing>

./roundup -n -r cedar_phy far

Fri Mar 28 08:53:34 CDT 2008
 HAVE /export/stage/minfarm/ROUNDUP/READ/SAM/F00030612_0005.spill.bntp.cedar_phy.0.root 3 
 HAVE /export/stage/minfarm/ROUNDUP/READ/SAM/F00030612_0005.spill.sntp.cedar_phy.0.root 3 
OK - processing 11 files 
OK - stream all.sntp.cedar_phy
OK - 167 Mbytes in 5 runs 
 PEND - have 2/8 subruns for F00030612_*.all.sntp.cedar_phy.0.root 268 07/03 14:12 0 2
 PEND - have 1/24 subruns for F00034635_*.all.sntp.cedar_phy.1.root 311 05/21 10:10 0 1
 PEND - have 2/24 subruns for F00034647_*.all.sntp.cedar_phy.0.root 290 06/11 21:36 0 2
 PEND - have 1/7 subruns for F00034675_*.all.sntp.cedar_phy.0.root 290 06/11 21:48 0 1
 PEND - have 1/19 subruns for F00034700_*.all.sntp.cedar_phy.0.root 288 06/13 12:09 0 1
OK - stream spill.bntp.cedar_phy
OK - 2 Mbytes in 1 runs 
 PEND - have 2/8 subruns for F00030612_*.spill.bntp.cedar_phy.0.root 268 07/03 14:12 3 5
OK - stream spill.sntp.cedar_phy
OK - 1 Mbytes in 1 runs 
 PEND - have 2/8 subruns for F00030612_*.spill.sntp.cedar_phy.0.root 268 07/03 14:12 3 5


./roundup      -r cedar_phy far
./roundup -f 1 -r cedar_phy far


    Proceed to force out nearcat older passes.
    This may have duplicates in nearcat. 

./samdup /minos/data/minfarm/nearcat

nearcat
    123    3525 cosmic.sntp.cedar.0.root
    134    3881 cosmic.sntp.cedar_phy.0.root
     84    2312 cosmic.sntp.cedar_phy.1.root
    372    6007 spill.mrnt.cedar_phy.0.root
     62     789 spill.mrnt.cedar_phy.1.root
    212    7318 spill.mrnt.cedar_phy_bhcurv.0.root
    282    8478 spill.mrnt.cedar_phy_bhcurv.1.root
    202   15164 spill.sntp.cedar.0.root
     63    4125 spill.sntp.cedar_phy.1.root
    100    6692 spill.sntp.cedar_phy_bhcurv.0.root
     43    1511 spill.sntp.cedar_phy_bhcurv.1.root

FONES=`cd /minos/data/minfarm/nearcat ; ls *.1.root`
printf "${FONES}\n" | wc -w

for FILE in ${FONES} ; do
(   cd /minos/data/minfarm/nearcat
    FZER=`echo ${FILE} | sed 's/\.1\./\.0\./g'`
    [ -r "${FZER}" ] && \
    ls -l ${FILE} &&   
    ls -l ${FZER} && printf "\n"
)
done

-rw-rw-r--  1 rubin numi 38414123 Nov 16 20:01 N00008165_0013.spill.mrnt.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 38415184 Oct 24 23:15 N00008165_0013.spill.mrnt.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 9330118 Nov 21 12:42 N00012040_0015.spill.mrnt.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 9331014 Nov 21 02:14 N00012040_0015.spill.mrnt.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 17493799 Nov 21 12:42 N00012040_0015.spill.sntp.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 17493843 Nov 21 02:14 N00012040_0015.spill.sntp.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 224841 Nov 21 11:23 N00012051_0009.spill.mrnt.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 224815 Nov 21 00:54 N00012051_0009.spill.mrnt.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 279374 Nov 21 11:23 N00012051_0009.spill.sntp.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 279374 Nov 21 00:54 N00012051_0009.spill.sntp.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 242964 Nov 21 11:22 N00012051_0022.spill.mrnt.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 242964 Nov 21 00:54 N00012051_0022.spill.mrnt.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 321751 Nov 21 11:22 N00012051_0022.spill.sntp.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 321751 Nov 21 00:54 N00012051_0022.spill.sntp.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 245992 Nov 21 11:23 N00012051_0023.spill.mrnt.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 245998 Nov 21 00:54 N00012051_0023.spill.mrnt.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 327453 Nov 21 11:23 N00012051_0023.spill.sntp.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 327453 Nov 21 00:54 N00012051_0023.spill.sntp.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 10963512 Nov 21 12:59 N00012054_0000.spill.mrnt.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 10963274 Nov 21 02:30 N00012054_0000.spill.mrnt.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 20300550 Nov 21 12:59 N00012054_0000.spill.sntp.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 20300544 Nov 21 02:30 N00012054_0000.spill.sntp.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 308544 Nov 21 11:24 N00012054_0003.spill.mrnt.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 308533 Nov 21 00:55 N00012054_0003.spill.mrnt.cedar_phy_bhcurv.0.root

-rw-rw-r--  1 rubin numi 535687 Nov 21 11:24 N00012054_0003.spill.sntp.cedar_phy_bhcurv.1.root
-rw-rw-r--  1 rubin numi 535687 Nov 21 00:55 N00012054_0003.spill.sntp.cedar_phy_bhcurv.0.root

sam list files --dim="${SAMDIM}" --nosummary

SAMDIM="DATA_TIER mrnt-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 8165"
SAMDIM="DATA_TIER cand-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 8165"
    13 is pass 1
    
SAMDIM="DATA_TIER mrnt-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12040"
SAMDIM="DATA_TIER sntp-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12040"
SAMDIM="DATA_TIER cand-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12040"
    15 is pass 1
    
SAMDIM="DATA_TIER mrnt-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12051"
SAMDIM="DATA_TIER sntp-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12051"
SAMDIM="DATA_TIER cand-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12051"
    9, 22, 23 are pass 1

SAMDIM="DATA_TIER mrnt-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12054"
SAMDIM="DATA_TIER sntp-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12054"
SAMDIM="DATA_TIER cand-near and VERSION cedar.phy.bhcurv and RUN_NUMBER 12054"
    0, 3 are pass 1

    Bottom line, declare all the 0 passes to be duplicates.

for FILE in ${FONES} ; do
(   cd /minos/data/minfarm/nearcat
    FZER=`echo ${FILE} | sed 's/\.1\./\.0\./g'`
    [ -r "${FZER}" ] && echo ${FZER}
)
done  > /tmp/FDUPS
FDUPS=`cat /tmp/FDUPS`

printf "${FDUPS}\n"

for FILE in ${FDUPS} ; do
(   cd /minos/data/minfarm/nearcat
    mv ${FILE} ../DUP/nearcat/${FILE}
)
done

    done at 13:41

./roundup -n -r cedar_phy_bhcurv near

 PEND - have 2/3 subruns for N00007506_*.spill.mrnt.cedar_phy_bhcurv.0.root 156 10/23 17:20 0 2
 PEND - have 19/20 subruns for N00008165_*.spill.mrnt.cedar_phy_bhcurv.0.root 155 10/24 21:37 0 19
 PEND - have 14/15 subruns for N00008276_*.spill.mrnt.cedar_phy_bhcurv.1.root 73 01/14 14:48 0 14
 PEND - have 16/24 subruns for N00008345_*.spill.mrnt.cedar_phy_bhcurv.1.root 73 01/14 19:33 0 16
 PEND - have 3/7 subruns for N00008433_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 01:05 0 3
 PEND - have 2/22 subruns for N00008436_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 09:02 0 2
 PEND - have 1/12 subruns for N00008439_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 09:58 0 1
 PEND - have 4/19 subruns for N00008454_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:51 0 4
 PEND - have 2/24 subruns for N00008457_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 10:59 0 2
 PEND - have 2/24 subruns for N00008460_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:17 0 2
 PEND - have 4/20 subruns for N00008463_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 09:31 0 4
 PEND - have 1/24 subruns for N00008469_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:33 0 1
 PEND - have 1/2 subruns for N00008472_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 09:20 0 1
 PEND - have 1/24 subruns for N00008478_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:13 0 1
 PEND - have 6/24 subruns for N00008481_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 10:29 0 6
 PEND - have 12/24 subruns for N00008486_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 14:30 0 12
 PEND - have 6/24 subruns for N00008489_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/14 06:08 0 6
 PEND - have 13/24 subruns for N00008492_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/14 08:40 0 13
 PEND - have 4/6 subruns for N00008495_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 20:46 0 4
 PEND - have 16/24 subruns for N00008498_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 19:06 0 16
 PEND - have 11/17 subruns for N00008501_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 19:51 0 11
 PEND - have 10/24 subruns for N00008504_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 19:52 0 10
 PEND - have 6/14 subruns for N00008507_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 21:00 0 6
 PEND - have 11/15 subruns for N00008510_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 13:46 0 11
 PEND - have 1/2 subruns for N00008517_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 10:17 0 1
 PEND - have 4/21 subruns for N00008523_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 09:24 0 4
 PEND - have 9/22 subruns for N00008526_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:52 0 9
 PEND - have 13/24 subruns for N00008529_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 18:35 0 13
 PEND - have 4/24 subruns for N00008532_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:23 0 4
 PEND - have 1/17 subruns for N00008538_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 19:02 0 1
 PEND - have 2/18 subruns for N00008564_*.spill.mrnt.cedar_phy_bhcurv.1.root 136 11/13 11:09 0 2
 PEND - have 1/7 subruns for N00008568_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 13:06 0 1
 PEND - have 1/13 subruns for N00008672_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 13:59 0 1
 PEND - have 1/2 subruns for N00008692_*.spill.mrnt.cedar_phy_bhcurv.1.root 135 11/13 13:11 0 1
 PEND - have 7/12 subruns for N00008695_*.spill.mrnt.cedar_phy_bhcurv.1.root 132 11/16 15:24 0 7
 PEND - have 23/24 subruns for N00009629_*.spill.mrnt.cedar_phy_bhcurv.1.root 129 11/20 01:21 0 23
 PEND - have 15/16 subruns for N00009659_*.spill.mrnt.cedar_phy_bhcurv.1.root 129 11/20 07:30 0 15
 PEND - have 13/14 subruns for N00009705_*.spill.mrnt.cedar_phy_bhcurv.1.root 129 11/20 04:19 0 13
 PEND - have 14/15 subruns for N00009816_*.spill.mrnt.cedar_phy_bhcurv.1.root 128 11/20 16:45 0 14
 PEND - have 20/23 subruns for N00009836_*.spill.mrnt.cedar_phy_bhcurv.1.root 128 11/20 17:57 0 20
 PEND - have 22/23 subruns for N00011059_*.spill.mrnt.cedar_phy_bhcurv.0.root 151 10/29 04:17 0 22
 PEND - have 26/27 subruns for N00011113_*.spill.mrnt.cedar_phy_bhcurv.0.root 151 10/29 12:55 0 26
 PEND - have 39/40 subruns for N00011134_*.spill.mrnt.cedar_phy_bhcurv.0.root 150 10/29 19:12 0 39
 PEND - have 23/24 subruns for N00012129_*.spill.mrnt.cedar_phy_bhcurv.0.root 127 11/21 21:18 0 23
 PEND - have 20/21 subruns for N00012431_*.spill.mrnt.cedar_phy_bhcurv.0.root 125 11/24 02:18 0 20
 PEND - have 1/19 subruns for N00008165_*.spill.sntp.cedar_phy_bhcurv.1.root 132 11/16 20:01 0 1
 PEND - have 21/22 subruns for N00008251_*.spill.sntp.cedar_phy_bhcurv.1.root 73 01/14 14:00 0 21
 PEND - have 14/24 subruns for N00008254_*.spill.sntp.cedar_phy_bhcurv.1.root 74 01/14 12:18 0 14
 PEND - have 1/2 subruns for N00008366_*.spill.sntp.cedar_phy_bhcurv.1.root 73 01/14 19:39 0 1
 PEND - have 39/40 subruns for N00011134_*.spill.sntp.cedar_phy_bhcurv.0.root 150 10/29 19:12 0 39

    All these are a couple of months stale.
    We also are picking up the internal duplicates just moved out of the way.

./roundup -f 30 -r cedar_phy_bhcurv near


=============================================================================

2008 03 27

########
# FARM #
########

SRV1> ls /minos/data/minfarm/nearcat/*0.root | wc -l
1143

SRV1> ls /minos/data/minfarm/nearcat/*1.root | wc -l
534


SRV1> ls /minos/data/minfarm/farcat/*0.root | wc -l
1625
SRV1> ls /minos/data/minfarm/farcat/*1.root | wc -l
1

SRV1> ls /minos/data/minfarm/farcat/F00034635_0000*
/minos/data/minfarm/farcat/F00034635_0000.all.sntp.cedar_phy.1.root


##########
# MDFREE #
##########

    Started keeping track of /minos/data free disk

${HOME}/minos/scripts/mdfree_log &

#########
# ADMIN #
#########

   Seen rtoner entry at 2008 02 25

   She has access, per blake

MINOS01 > pts adduser -user rtoner -group minos
MINOS01 > pts membership minos | grep toner


########
# FARM #
########

11:30

SRV1> ./farmgsum

    Summarizing /grid/data/minos/*cat   

   1677   57043 nearcat
   1626   13483 farcat
   2944  493599 mcnearcat
      0       1 mcfarcat
      0       1 mcfmockcat
   1726  332816 minfarm/WRITE
   7973  896943 TOTAL files, GBytes


########
# FARM #
########

Ticket 112276 is resolved ( overloaded fnpcsrv1 due to carneiro jobs )
See details below.

=============================================================================

2008 03 26

########
# FARM #
########

   1996  107037 nearcat
   1839   23836 farcat
   4725  830248 mcnearcat
    513   12063 mcfarcat
      0       1 mcfmockcat
     91    6504 minfarm/WRITE
   9164  979689 TOTAL files, GBytes

There are many files with both passes 0 and 1 in these areas.

   MCFARCAT

mcfarcat
    513   12646 mrnt.cedar_phy.root

    All these are subrun 0, Feb 24/25

ssh minos-sam02
minos
./samdup /minos/data/minfarm/mcfarcat

./roundup -n -r cedar_phy mcfar

   This would have added all files, no complaints
   Do it !

./roundup -r cedar_phy mcfar

    and the next day, to purge, did again

./roundup -r cedar_phy mcfar

   The saddreco logs look OK, mcfarcat is clear !


########
# BMNT #
########

   Urkkh.
   BMNT files have resurfaced again,
   
farcat
    546     905 spill.bmnt.cedar_phy_bhcurv.0.root
    208     509 spill.mrnt.cedar_phy_bhcurv.0.root

See log entries 2008 01 17

These were all produced on Feb 28 and 29.

There seem to be no corresponding mrnt's waiting for concatenation.

Let's set this aside, till we get the rest of the files concatenated.

##########
# CONDOR #
##########

    Investigating file ownership created under Condor glideins,

    The probe job shows

ID       uid=7927(minos) gid=5111(numi) groups=5111(numi)


MINOS25 > ypcat passwd | grep minoscvs
minoscvs:KERBEROS:7927:5111:E875 Minos:/home/minoscvs:/home/minoscvs/bin/cvsh

    likewise, here is another minos account :
    
[minos@minos-evd ~]$ id
uid=500(minos) gid=5111(e875) groups=100(users),5111(e875),1100545895 context=user_u:system_r:initrc_t


##########
# CONDOR #
##########

Investigating Glidein job disconnects, 

logs/errs under

/minos/data/users/pawloski/Nue/AttenuationStudy/Far_Beam_GainPlus10
/minos/data/users/pawloski/Nue/AttenuationStudy/Far_Beam_MEUMinus30

   Jobs submitted as
 
/minos/scratch/boehm/CondorTest/GregSub/condor_submit_gliden_Gain_10High_FD.sh

   running this :
   
/minos/scratch/boehm/CondorTest/GregSub/condor_jobs_gliden_Gain_10High_FD.sh 

   These were resubmitted, setting this :

JobLeaseDuration        = 360000

   These all completed properly, but many had to reconnect, success this time !  

   Scanned all 100 logs for disconnect messges.
   
MINOS25 > grep disconnected /minos/data/users/pawloski/Nue/AttenuationStudy/Far_Beam_GainPlus10/log.*  | wc -l
94
MINOS25 > grep disconnected /minos/data/users/pawloski/Nue/AttenuationStudy/Far_Beam_MEUMinus30/log.*  | wc -l
80

   Most of the disconnecting jobs lasted about 70 minutes.
   Most of the clean jobs lasted 64 minutes, but a few were over 71 minutes.

   In all cases the job ran and produced an output file.
   The disconnect was at the time of job completion, based on file times.

PROBE TESTS

    Created 4 hour version of probe ( 1440 iterations, 10 seconds per, of tiny)

condor_submit glideafs.run
Started up on fnpc341 around 12:20,
files go to logs/4hr

   Running probenew, allowing tiny or sleep , setting delay in seconds

        probenew ${SEC} tiny 4200 #( for about 70 minutes )

   Running 5 in parallel.

condor_submit glideafs70min.run

    Got lucky, these all started right away, at 17:17

    And they bailed out in the classic way, eventually failing entirely

MINOS25 > cat logs/70min/probe.log.61486.0

000 (61486.000.000) 03/26 17:17:21 Job submitted from host: <131.225.193.25:63984>
...
001 (61486.000.000) 03/26 17:17:26 Job executing on host: <131.225.166.130:63103>
...
006 (61486.000.000) 03/26 17:17:34 Image size of job updated: 10152
...
022 (61486.000.000) 03/26 18:27:26 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to vm2@4190@fnpc342.fnal.gov <131.225.166.130:63103>
...
024 (61486.000.000) 03/26 18:27:26 Job reconnection failed
    Job disconnected too long: JobLeaseDuration (3600 seconds) expired
    Can not reconnect to vm2@4190@fnpc342.fnal.gov, rescheduling job
...
001 (61486.000.000) 03/26 18:27:53 Job executing on host: <131.225.166.130:63103>
          AND MORE OF THE SAME, TILL THE END :
001 (61486.000.000) 03/27 05:20:15 Job executing on host: <131.225.166.120:64512>
...
009 (61486.000.000) 03/27 05:20:23 Job was aborted by the user.
        The system macro SYSTEM_PERIODIC_REMOVE expression '(JobRunCount > 10) || (JobRunCount>=1 && ImageSize>1000000 && JobStatus==1)' evaluated to TRUE
...


    ADDED 
JobLeaseDuration        = 360000
    to  glideafs70min.run - added 

    Reran at 11:46, cluster 61836

MINOS25 > cat logs/70min/probe.log.61836.0
000 (61836.000.000) 03/27 11:48:02 Job submitted from host: <131.225.193.25:63984>
...
001 (61836.000.000) 03/27 11:48:07 Job executing on host: <131.225.166.131:62119>
...
006 (61836.000.000) 03/27 11:48:15 Image size of job updated: 10152
...
022 (61836.000.000) 03/27 12:58:11 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to vm2@32007@fnpc344.fnal.gov <131.225.166.131:62119>
...
023 (61836.000.000) 03/27 12:58:11 Job reconnected to vm2@32007@fnpc344.fnal.gov
    startd address: <131.225.166.131:62119>
    starter address: <131.225.166.131:62606>

    Now try a test with a short lease, 5 minutes
    and a 10 minute execution time.

JobLeaseDuration        = 300

    cluster 61850  at 13:15

    All jobs completed normally in 10 minutes.
    

    Gave myself a boost with

condor_userprio -setfactor kreymer@fnal.gov 1.


    Let's try using sleep, instead of tiny, in the failing test:

    cluster 61859  at 13:42

MINOS25 > cat logs/70min/probe.log.61859.0

000 (61859.000.000) 03/27 13:42:10 Job submitted from host: <131.225.193.25:63984>
...
001 (61859.000.000) 03/27 13:44:14 Job executing on host: <131.225.166.118:62566>
...
006 (61859.000.000) 03/27 13:44:22 Image size of job updated: 14956
...
022 (61859.000.000) 03/27 14:54:15 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to vm2@1277@fnpc339.fnal.gov <131.225.166.118:62566>
...
024 (61859.000.000) 03/27 14:54:15 Job reconnection failed
    Job disconnected too long: JobLeaseDuration (3600 seconds) expired
    Can not reconnect to vm2@1277@fnpc339.fnal.gov, rescheduling job

    Same for sections 0, 1, 2
    Good enough, killing these, rather than let them waste VM's.

condor_rm 61859


    For long term workaround, use

JOBLEASEDURATION        = 1000000
    

##########
# DCACHE #
##########

MINOS26 > date
Wed Mar 26 09:07:08 CDT 2008

MINOS26 > ~kreymer/minos/scripts/dc_stat /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.ro
============================
 PNFS status for /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root 
-rw-r--r--  1 rubin e875 541154973 Mar 21 12:48 n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:bee4af9c;l=541154973;
w-stkendca11a-5

LEVEL 4 
============================

Date: Wed, 26 Mar 2008 09:10:50 -0500 (CDT)
Subject: HelpDesk ticket 113172
___________________________________________
Short Description: Minos file in w-stkendca11a-5 for 5 days still not on tape

Problem Description: dcache-admin :

The following file has been in the write pool since Friday,
and is till not on an tape. 

Many other similar files have been written since that time.

    Please investigate and flush this file to tape.

MINOS26 > date
Wed Mar 26 09:07:08 CDT 2008

MINOS26 > ~kreymer/minos/scripts/dc_stat
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/7
12/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root
============================
 PNFS status for
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root 
-rw-r--r--  1 rubin e875 541154973 Mar 21 12:48 n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:bee4af9c;l=541154973;
w-stkendca11a-5

LEVEL 4 

============================
___________________________________________
This ticket is assigned to SSA Primary of the CD-SF/DMS/DSC/SSA.
___________________________________________
Date: Wed, 26 Mar 2008 12:10:23 -0500 (CDT)
Note To Requester: berg@fnal.gov sent this Notes To Requester: 

Art,

This file is failing in the encp from dcache because of a CRC
mismatch. I'm looking into it.

 - David
_________________________________________________________________
Date: Wed, 26 Mar 2008 12:48:49 -0500 (CDT)
That CRC doesn't match either what's in dcache, or what's being returned by encp.

The path you give below - is that in BlueArc? I'm a little confused by the difference
in the order of the path elements between that and pnfs, but it seems to refer to the
same file. What is the length of that file?

I'll talk with experts about how to proceed.
_________________________________________________________________
Date: Fri, 28 Mar 2008 18:20:16 -0500 (CDT)
Art,

The developer says for you to go ahead and remove the file from pnfs
space, then rewrite.

We will have to go in later to remove the file from dcache, because
cleaner will not automatically remove a precious file when it is
removed from pnfs, but the encp will stop.

 - David
_________________________________________________________________

Solution: berg@fnal.gov sent this solution: 

The file was written to VOH293 on Mar 29, after being rewritten to
dcache by the user.

Evidently the file itself became corrupted in dcache. The CRC of the
rewritten file as recorded by dcache is the same as it was before.
Attempts to write it to tape failed because the CRC calculated by
encp as the file went to tape was different, so the file content
must have changed.
___________________________________________________________________


    Observed
LTO3_12.mover alive : busy mounting volume VOE096 stkenmvr112a 2008-Mar-26 10:49:09	 

Odd, this is not showing up yet in LEVEL4

    This is strange, most of the files on this tape seem to be removed !
    Files are under
pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/

    only two remain,
713/n13047134_0013_L010185N_D04.cand.cedar_phy_bhcurv.0.root
738/n13037389_0023_L010185N_D04.cand.cedar_phy_bhcurv.0.root

SRV1> dds /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 minospro numi 541154973 Mar 21 10:57 /minos/data/mcout_data/daikon_04/L010185N/near/cedar_phy_bhcurv/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root

ecrc :

CRC 2086317979

And the original ECRC record,

cat /export/stage/minfarm/ROUNDUP/ECRC/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root
2086317979

    So per developer recommendation, as rubin on fnpcsrv1, 
    at about     Fri Mar 28 18:31

rm /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/712/n13047122_0029_L010185N_D04.cand.cedar_phy_bhcurv.0.root 


    2008 03 31

This is on tape, 
VOH293
0000_000000000_0000571
541154973

########
# FARM #
########

   cedar_phy_mboone 

    This has never been concatenated before, or declared to SAM.

SAMDIM="VERSION cedar.phy.mboone"

MINOS26 > sam list files --dim="${SAMDIM}" --summary_only
File Count:         0


     Added this to ROUNTMP/ROOTRELS

But we do not actually want to concatenate c_p_m,

The volume of data is small, and there is one file per run already.


=============================================================================

2008 03 25

############
# MCIMPORT #
############

Cleaned up one bad copy from last week,
   this was at the time of the BlueArc problem.


$ dds /home/mindata/TAPE/
total 1508
drwxr-xr-x   2 mindata e875   12288 Mar 17 23:57 ./
drwxr-xr-x  10 mindata e875    4096 Mar 25 16:21 ../
-rw-r--r--   1 mindata e875 1523712 Mar 17 08:52 n11037135_0020_L250200N_D04.tar.gz

$ dds /minos/data/mcimport/STAGE/daikon_04/L250200N/near/713/n11037135_0020_L250200N_D04.tar.gz
-rw-r--r--  1 mindata e875 752680142 Dec 13 20:44 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/713/n11037135_0020_L250200N_D04.tar.gz

$ dds /pnfs/minos/stage/daikon_04/L250200N/near/713/n11037135_0020_L250200N_D04.tar.gz
-rw-r--r--  1 mindata e875 1523712 Mar 17 10:35 /pnfs/minos/stage/daikon_04/L250200N/near/713/n11037135_0020_L250200N_D04.tar.gz

less /minos/data/mcimport/TAR/daikon_04/L250200N/near/713/mcimport.log
   this ends with
COPY n11037135_0019_L250200N_D04.tar.gz

   Not too surprising, the BlueArc outage killed the log file.
   
$ rm /pnfs/minos/stage/daikon_04/L250200N/near/713/n11037135_0020_L250200N_D04.tar.gz
$ rm /home/mindata/TAPE/n11037135_0020_L250200N_D04.tar.gz

    Let's redo,catch another 95 GB of files

$ du -sm /minos/data/mcimport/STAGE/daikon_04/L250200N/near/*
1       /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700
1385    /minos/data/mcimport/STAGE/daikon_04/L250200N/near/701
...
96027   /minos/data/mcimport/STAGE/daikon_04/L250200N/near/713
1517    /minos/data/mcimport/STAGE/daikon_04/L250200N/near/714
154     /minos/data/mcimport/STAGE/daikon_04/L250200N/near/715
846     /minos/data/mcimport/STAGE/daikon_04/L250200N/near/717

DIR=713

./mcimport.20080311  -T daikon_04/L250200N/near/${DIR}

Tue Mar 25 17:10:17 CDT 2008
 OK - version mcimport.20080303 processing from /minos/data/mcimport/STAGE/daikon_04/L250200N/near/713 
 LOGS 

 TAPER 

  NFILES 129 


############
# MCIMPORT #
############


Most proceed to archive more pre-overlay files,
/minos/data space was down to 1.9 TB free Monday, down to 1.2 today

Forward field:
7101-7350
7501-7650

Reverse field:
7001-7120

    Forward field files are like
n1103*
n1203*

    Reversed field files are like
n1104*
n1204*

FDIRS='710 711 712 713 714 715 716 717 718 719
       720 721 722 723 724 725 726 727 728 729
       730 731 732 733 734 735
       750 751 752 753 754 755 756 757 758 759
       760 761 762 763 764 765
       '

RDIRS='700 701 702 703 704 705 706 707 708 709    
       710 711 712
      '
      
    Can use -s to select correct configurations,
    Check min run :
       700 - min run is 7001
       750 - min run is 7501
       710 - min run is 7100
    Assuming Robert is overlaying using 7001 to 7099 ( not 7100 )
    we can just archive these full directories.
        
    This would look like

for DIR in ${FDIRS}; do
./mcimport.20080326  -T -s n1103 daikon_04/L010185N/near/${DIR}
done

for DIR in ${FDIRS}; do
./mcimport.20080326  -T -s n1203 daikon_04/L010185N/near/${DIR}
done


for DIR in ${RDIRS}; do
./mcimport.20080326  -T -s n1104 daikon_04/L010185N/near/${DIR}
done

for DIR in ${RDIRS}; do
./mcimport.20080326  -T -s n1204 daikon_04/L010185N/near/${DIR}
done


ls /minos/data/mcimport/STAGE/daikon_04/L010185N/near

700 | wc -l
746

700/n1104* | wc -l
276

700/n1204* | wc -l
97

    2008 03 26

Testing with

DIR=700

AFSS/mcimport.20080326 -n -T -s n1104 daikon_04/L010185N/near/${DIR}

for DIR in ${RDIRS}; do
./mcimport.20080326 -n -T -s n1104 daikon_04/L010185N/near/${DIR}
done \
| grep NFILES
  NFILES 275 
  NFILES 309 
  NFILES 305 
  NFILES 309 
  NFILES 308 
  NFILES 308 
  NFILES 308 
  NFILES 309 
  NFILES 310 
  NFILES 308 
  NFILES 305 
  NFILES 305 
  NFILES 309 

$ for DIR in ${RDIRS}; do 
./mcimport.20080326 -n -T -s n1204 daikon_04/L010185N/near/${DIR}
done | grep NFILES
  NFILES 0 
  NFILES 0 
  NFILES 0 
  NFILES 0 
  NFILES 0 
  NFILES 0 
  NFILES 0 
  NFILES 0 
  NFILES 0 
  NFILES 0 
  NFILES 0 
  NFILES 0 
  NFILES 0 

    OK, let's do the reverse detector files,
    the rocks are more like pebbles, too small to be directly copied.

$ for DIR in ${RDIRS}; do ./mcimport.20080326 -T -s n1104 daikon_04/L010185N/near/${DIR}; done
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L010185N/near/700/mcimport.log 

Wed Mar 26 10:24:33 CDT 2008
 OK - version mcimport.20080325 processing from /minos/data/mcimport/STAGE/daikon_04/L010185N/near/700 

    Copies are running at 10 MB/sec, better than usual !

    After 700 is complete, scanned sizes, to see what we've gained
    in clearing out reversed field files,
    
$ du -sm /minos/data/mcimport/STAGE/daikon_04/L010185N/near/70*

 90862  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/700
197379  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/701
196882  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/702
197665  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/703
197688  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/704
195568  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/705
197000  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/706
203266  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/707
200288  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/708
197098  /minos/data/mcimport/STAGE/daikon_04/L010185N/near/709

   Wed Mar 26 17:08:17 2008
   Stuck due to LTO4_40.mover failure,
   alive : ERROR - ('MOUNTFAILED', 'max_consecutive_failures (3) reached')

This cleared at 10:58 Thursday, now running normally, on mover 42. 
Mover 40 seems to be OK also.

MINOS26 > df -h /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       28T   28T  699G  98% /minos/data
MINOS26 > date
Thu Mar 27 11:29:55 CDT 2008

We are in serious trouble, let's hope we can get ahead at last.


    2008 03 30

The RDIR copies finished Sat Mar 29 15:53:22 CDT 2008

Starting forward,


   Oops, ran the first n1103 scan without the -n,
   interrupted and did
rm TAPE/n11037100_0010_L010185N_D04.tar.gz

   Scanning first,

for DIR in ${FDIRS}; do
printf "${DIR} "
./mcimport.20080326  -n -T -s n1103 daikon_04/L010185N/near/${DIR} \
| grep NFILES
done
710   NFILES 307 
711   NFILES 304 
712   NFILES 304 
713   NFILES 302 
714   NFILES 308 
715   NFILES 308 
716   NFILES 306 
717   NFILES 308 
718   NFILES 305 
719   NFILES 304 
720   NFILES 307 
721   NFILES 309 
722   NFILES 306 
723   NFILES 309 
724   NFILES 308 
725   NFILES 307 
726   NFILES 309 
727   NFILES 306 
728   NFILES 306 
729   NFILES 307 
730   NFILES 309 
731   NFILES 309 
732   NFILES 305 
733   NFILES 303 
734   NFILES 306 
735   NFILES 327 
750   NFILES 276 
751   NFILES 307 
752   NFILES 307 
753   NFILES 304 
754   NFILES 308 
755   NFILES 308 
756   NFILES 304 
757   NFILES 308 
758   NFILES 305 
759   NFILES 307 
760   NFILES 307 
761   NFILES 307 
762   NFILES 306 
763   NFILES 308 
764   NFILES 308 
765   NFILES 307 


for DIR in ${FDIRS}; do
printf "${DIR} "
./mcimport.20080326  -n -T -s n1203 daikon_04/L010185N/near/${DIR} \
| grep NFILES
done 
710   NFILES 0 
711   NFILES 0 
712   NFILES 0 
713   NFILES 0 
714   NFILES 0 
715   NFILES 0 
716   NFILES 0 
717   NFILES 0 
718   NFILES 0 
719   NFILES 0 
720   NFILES 0 
721   NFILES 0 
722   NFILES 0 
723   NFILES 0 
724   NFILES 0 
725   NFILES 0 
726   NFILES 0 
727   NFILES 0 
728   NFILES 0 
729   NFILES 0 
730   NFILES 0 
731   NFILES 0 
732   NFILES 0 
733   NFILES 0 
734   NFILES 0 
735   NFILES 0 
750   NFILES 0 
751   NFILES 0 
752   NFILES 0 
753   NFILES 0 
754   NFILES 0 
755   NFILES 0 
756   NFILES 0 
757   NFILES 0 
758   NFILES 0 
759   NFILES 0 
760   NFILES 0 
761   NFILES 0 
762   NFILES 0 
763   NFILES 0 
764   NFILES 0 
765   NFILES 0 


   Let's run for real :

for DIR in ${FDIRS}; do
./mcimport.20080326  -T -s n1103 daikon_04/L010185N/near/${DIR}
done

##########
# CONDOR #
##########

Last time we had a BlueArc hangup,
released all the gfactory processes that had been held :
I think this overshot the running job goal.
This time, will condor_rm them.

Neither works, they keep getting held.

Odd, jobs do seem to have been running sometimes last week.

There were failures, seen under /minos/scratch/kreymer/condor,

-rw-r--r--  1 kreymer g020 139 Mar 22 13:40 logs/glideafs/probe.err.57870.0
...
-rw-r--r--  1 kreymer g020 139 Mar 24 15:40 logs/glideafs/probe.err.60041.0
-rw-r--r--  1 kreymer g020   0 Mar 24 15:50 logs/glideafs/probe.err.60062.0

The error files are normal, and there are normal output files for these jobs.

60062.0 and subsequent jobs are still Idle

   Igor finds that the proxy has, after all, expired.
   See messages in 
      /home/gfactory/glideinsubmit/glidein_t11/entry_gpminos/log/*err*
      /home/gfactory/glideinsubmit/glidein_t11/entry_gpgeneral/log/*err*

Per sfiligoie, killed off the old harmless Held gfactory jobs,
which had lost track of their state with respect to the GPfarm, with

   condor_rm -forcex <JID>

In some cases, had to specify the full job.section
And the removal was not immediate, took a few minutes to take effect
as seen in condor_q


###########
# PHYSICS #
###########

http://newsinfo.iu.edu/web/page/normal/7294.html
   Mufsen/Rebal cosmic results

#######
# SAM #
#######

    Resolved old IT items at 
https://plone3.fnal.gov/SAMGrid/tracking/pcng_search_form

    Searched for items submitted by kreymer/


SAM-IT/1979]  SAM C++ API compiler warnings (#3/resolve)
    These warnings are gone in sam_cpp_api v8_4_0_1 -q GCC-3.4.3
    Thanks !

SAM-IT/1751: multiple parameter selection
    The root cause was a bug in dbserver v7_6_1,
                resolved by upgrading to v8_3_0 on 2007/10/15.
    See IT 2257 for the dbserver fix.
    The damaged Minos database was repaired on 2007/10/12 with sqlplus,
    per instructions from herber.
    See details in http://www-numi.fnal.gov/minwork/computing/dh/worklog.txt
    UPDATE DIMENSIONS       SET DIM_ALIAS = 'param_values##261'
        where DIMENSION_NAME = 'MC.BFIELD' ;
    UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_categories##261'
        where DIMENSION_NAME = 'MC.BFIELD'
          and DIM_COLUMN = 'param_category' ;
    UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_types##261'
        where DIMENSION_NAME = 'MC.BFIELD'
          and DIM_COLUMN = 'param_type' ;

SAM-IT/1642: VERSION_ANALYZED selection
    Resolved on 2007/05/24
    Learned that one should use VERSION, not VERSION_ANALYZED  

SAM-IT/1226  KITS sam_web_services_client offsite access
    Resolved by John Inkmann, 2005/10/21
    File protections updated.
    For future reference, filling out webform
    http://fnkits.fnal.gov/specialprod.html can prevent this problem.

###########
# SCRATCH #
###########

Date: Tue, 25 Mar 2008 10:24:57 -0500 (CDT)
Subject: HelpDesk ticket 113131
___________________________________________
Short Description: Quota request for rhatcher on BlueArc served /minos/scratch

Problem Description: LSC/CSI :

Please set an individual storage quota of 500 GBytes for user rhatcher
on the BlueArc served /minos/scratch volume.

This overrides the existing default 100 GBytes quota.
___________________________________________
Date: Tue, 25 Mar 2008 10:34:25 -0500 (CDT)
This ticket has been reassigned to HILL, KEVIN of the CD-LSCS/CSI/CS/EST Group.
__________________________________________________________________
Date: Tue, 25 Mar 2008 11:20:52 -0500 (CDT)
Solution: quota increased
______________________________________________________________

###########
# SCRATCH #
###########
Date: Tue, 25 Mar 2008 13:38:15 -0500 (CDT)
Subject: HelpDesk ticket 113147
___________________________________________

Short Description: Quota request for scavan on BlueArc served /minos/scratch

Problem Description:   LSC/CSI :

Please set an individual storage quota of 500 GBytes for user scavan
on the BlueArc served /minos/scratch volume.

This overrides the existing default 100 GBytes quota.
___________________________________________
Date: Tue, 25 Mar 2008 13:42:39 -0500 (CDT)
This ticket has been reassigned to HILL, KEVIN of the CD-LSCS/CSI/CS/EST Group.
___________________________________________

=============================================================================

2008 03 24

##################
# WEEK IN REVIEW #
##################

BlueArc outage Monday 17 March 08:55 to 09:05
    tripped up glideins
Actually 08:42 to 09:10 ( m> nas )
  + noted

kreymer DOE cert will be expiring ( m> 2008 )

Parrot HTTP_PROXY corrected in current

Condor 7.0.1 predeployed, ticket 112641

Vahle condor jobs being held ?
  + resolved

FNAL Central Unix Web Service Town Meeting - Liz


CRL time stamps shifted to UTC ( gysin fixing )
   + fixed apparenly by gysin

/home/minfarm  filled - rubin cleared it
  + fixed by Howie
  
/grid/data/minos filling ( rustem )

Condor jobs removed accidentally Wed
  + yep
  
factproxy had a problem Thu, 20 Mar 2008 13:30:22 -0500

Jason query re server replacements

SAM updates stopped after F000040476 ( Sun m> minosdata )
  + corrected

The daikon_00 archives to tape around midnight Monday 17 March.
But we lost ground for the week :

Tufts farm upgrade questions

########
# DATA #
########


MINOS26 > df -h /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       28T   27T  1.6T  95% /minos/data

MINOS26 > du -sm /minos/data/mcimport/STAGE/*
 171932 /minos/data/mcimport/STAGE/daikon_00
9014169 /minos/data/mcimport/STAGE/daikon_04

MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/*
 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N
  55785 /minos/data/mcimport/STAGE/daikon_04/L010170N
8045715 /minos/data/mcimport/STAGE/daikon_04/L010185N
   6622 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm
   8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh
  65355 /minos/data/mcimport/STAGE/daikon_04/L010200N
 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N
 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N
 121349 /minos/data/mcimport/STAGE/daikon_04/L250200N

arms suggests removing all under runs 7000.
But everything we have is over 7000.

#######
# WEB #
#######

Wed 3/19

http://www-css.fnal.gov/csi/webdocs/townmeeting/

   Nothing of much interest there, beyond the agenda.
   
##########
# CONDOR #
##########

##########
# PARROT #
##########

Need to grab 'current', 
and adjust HTTP_PROXY to http://squid.fnal.gov:3128
adding the http://

   Can try using the new HA prototype squid at fg3x3.

#########
# MYSQL #
#########

Long term connections noted by west,

| 11338268 | reader     | minosaur.maps.susx.ac.uk:38076      | litest  | Sleep   | 1342001 |                |                                                                                                      |
| 11338270 | reader     | minosaur.maps.susx.ac.uk:38077      | temp    | Sleep   | 1342000 |                |                                                                                                      |
| 11338275 | reader     | minosaur.maps.susx.ac.uk:38078      | offline | Query   | 1341998 | Writing to net | select * from PLEXPIXELSPOTTOSTRIPEND where SEQNO between 200001001 and 200001004 or SEQNO between 2 |

Should kill these

#########
# ADMIN #
#########

  Could not enter the BlueArc status into the System Status page,
  as it was more than 3 days old.

#######
# AFS #
#######

Created afserrscan script, for cluster scan sorted by date

#!/bin/sh
EXT=${1}
NODES='minos01 minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10 minos11 minos12 minos13 minos14 minos15 minos16 minos17 minos18 minos19 minos20 minos21 minos22 minos23 minos24 minos25 minos26'
for NODE in ${NODES} ; do 
ssh -ax ${NODE} "grep afs: /var/log/messages${EXT} | grep 'Mar ' | grep -v Tokens | uniq"
done | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' | sort
date

    Usage :
./afserrscan
./afserrscan '.1'

############
# PREDATOR #
############

N00013828_0006.mdaq.root Thu Mar 20 20:06:39 UTC 2008 killed, ok
N00013828_0007.mdaq.root Thu Mar 20 20:08:34 UTC 2008 stuck
N00013829_0000.mdaq.root Thu Mar 20 20:11:09 UTC 2008

F00040476_0001.mdaq.root Fri Mar 21 22:12:19 UTC 2008 stuck

cd /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/neardet_data/2008-03
grep v00-00 *.sam.py*
F00040476_0001.sam.py:   applicationFamily=ApplicationFamily('online','rotorooter','v00-00--1'),

mv N00013828_0007.sam.py N00013828_0007.sam.pybad
mv N00013829_0000.sam.py N00013829_0000.sam.pybad

cd /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/fardet_data/2008-03

mv F00040476_0001.sam.py F00040476_0001.sam.pybad


=============================================================================

2008 03 17-23    KREYMER ON FURLOUGH

=============================================================================

2008 03 16       Sunday
   
########
# FARM #
########

cp AFSS/samdup samdup

MINOS-SAM02 > ./samdup /minos/data/minfarm/nearcat | tee /minos/data/minfarm/DUP/nearcat.20080317.lis

DET=near
for FILE in `cat /minos/data/minfarm/DUP/${DET}cat.20080317.lis` ; do
    mv /minos/data/minfarm/${DET}cat/${FILE} \
       /minos/data/minfarm/DUP/${DET}cat/${FILE}
done

DET=far


##########
# samdup #
##########

    Cloned this from samlocate,
    infused with saddreco

./samdup /minos/data/minfarm/mcfarcat

Test with recent near examples, which should be dups

mkdir /minos/scratch/kreymer/mcnearcat
touch /minos/scratch/kreymer/mcnearcat/n13047100_0000_L010185N_D04.sntp.cedar_phy_bhcurv.0.root
touch /minos/scratch/kreymer/mcnearcat/n13047100_0003_L010185N_D04.sntp.cedar_phy_bhcurv.0.root

mkdir /minos/scratch/kreymer/nearcat
touch /minos/scratch/kreymer/nearcat/N00013755_0000.cosmic.sntp.cedar.0.root
touch /minos/scratch/kreymer/nearcat/N00013755_0006.cosmic.sntp.cedar.0.root


    And a couple of pending files which should not be dups

mkdir /minos/scratch/kreymer/mcfarcat
touch /minos/scratch/kreymer/mcfarcat/f21311483_0000_L010185N_D00.mrnt.cedar_phy.root

mkdir /minos/scratch/kreymer/farcat
touch /minos/scratch/kreymer/farcat/F00040225_0000.all.sntp.cedar.0.root
touch /minos/scratch/kreymer/farcat/F00040225_0012.all.sntp.cedar.0.root

./samdup -y /minos/scratch/kreymer/mcnearcat

    Test with 513 files
    
time ./samdup  /minos/data/minfarm/mcfarcat
real    0m53.918s
user    0m2.415s
sys     0m0.419s
    repeat
real    0m12.383s
user    0m2.398s
sys     0m0.350s

    Test with mcnearcat, 2676 files

real    3m38.113s
user    0m7.619s
sys     0m0.577s
    repeat
real    3m25.990s
user    0m7.617s
sys     0m0.634s

   This was all in dev
Test again on idle minos-sam02, in prd

mcfarcat

real    0m10.236s
user    0m1.984s
sys     0m0.246s

real    0m9.704s
user    0m2.000s
sys     0m0.216s

mcnearcat

real    2m40.157s
user    0m6.627s
sys     0m0.343s

real    0m48.428s
user    0m6.653s
sys     0m0.348s

   Continue testing on minos-sam02, a bit quicker.

./samdup /minos/data/minfarm/farcat | wc -l
300

./samdup /minos/data/minfarm/nearcat | wc -l
597

Added test for sam location of concatenated file,
and existence in PNFS.

./samdup /minos/data/minfarm/farcat | wc -l
300

./samdup /minos/data/minfarm/nearcat | wc -l
597

   
########
# FARM #
########

SRV1> ./farmgsum

    Summarizing /grid/data/minos/*cat   

   2178   69269 nearcat
   1815   14211 farcat
   2676  473169 mcnearcat
    513   12063 mcfarcat
      0       1 mcfmockcat
    706     556 minfarm/WRITE
   7888  569269 TOTAL files, GBytes


############
# MCIMPORT #
############

    Noon batch of ENCP's to LTO-4 hung up

    From  /minos/data/mcimport/TAR/daikon_04/L250200N/near/711/mcimport.log

WILL ENCP 249 files
Start time: Sun Mar 16 11:40:02 2008
User: mindata(3648)  Group: e875(5111)  Euser: mindata(3648)  Egroup: e875(5111)
Command line: encp --delayed-dismount 5 --verbose 4 /home/mindata/TAPE/n11037110_0000_L250200N_D04.tar.gz /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz
Version: v3_7  CVS $Revision: 1.866 $ <frozen>
OS: Linux 2.6.9-55.0.2.ELsmp i686 Release:  Scientific Linux Fermi LTS release 4.4 (Wilson)
Library: CD-LTO4G1  Storage Group: minos  File Family: stage  FF Wrapper: cpio_odc  FF Width: 1
Current working directory: minos-sam03.fnal.gov:/minos/data/mcimport/STAGE/daikon_04/L250200N/near/711
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=0.990sec
File queued: /home/mindata/TAPE/n11037110_0000_L250200N_D04.tar.gz library: CD-LTO4G1 family: stage bytes: 755032831 elapsed=1.09949278831
Mover called back.  elapsed=1.6713218689
Input file /home/mindata/TAPE/n11037110_0000_L250200N_D04.tar.gz opened.   elapsed=1.72233390808
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=486.270sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=1386.360sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=2286.490sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=3186.580sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=4086.750sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=4986.920sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=5887.060sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=6787.310sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=7687.980sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=8588.060sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=9488.160sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=10388.230sec
Submitting /pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.tar.gz write request to LM.  elapsed=11288.340sec

    Web status page shows

CD-LTO4G1.library_manager	alive : unlocked	stkensrv4	2008-Mar-16 15:02:28	 
 	
Ongoing Transfers	0	Pending Transfers	1	Full Queue Elements
 
Pending write for stage from minos-sam03 by mindata [VOLS_IN_WORK]

Date: Sun, 16 Mar 2008 15:22:38 -0500 (CDT)
Subject: HelpDesk ticket 112734
___________________________________________
Short Description: encp from minos-03 to LTO4 tape hanging up since noon sunday.

Problem Description: enstore-admin :

After running smoothly for several days,
our copies from minos-sam03 to /pnfs/minos/stage are hung up.

Messages look like

Mover called back.  elapsed=1.6713218689
Input file /home/mindata/TAPE/n11037110_0000_L250200N_D04.tar.gz opened.  
elapsed=1.72233390808
Submitting
/pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.t
ar.gz write request to LM.  elapsed=486.270sec
Submitting
/pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.t
ar.gz write request to LM.  elapsed=1386.360sec
Submitting
/pnfs/minos/stage/daikon_04/L250200N/near/711/n11037110_0000_L250200N_D04.t
ar.gz write request to LM.  elapsed=2286.490sec

    The Enstore Server Status web page shows

CD-LTO4G1.library_manager       alive : unlocked        stkensrv4      
2008-Mar-16 15:02:28     
        
Ongoing Transfers       0       Pending Transfers       1       Full Queue
Elements
 
Pending write for stage from minos-sam03 by mindata [VOLS_IN_WORK]


    See the log file
/minos/data/mcimport/TAR/daikon_04/L250200N/near/711/mcimport.log
___________________________________________
   Data started up at about 18:44
   Updated ticket
___________________________________________
___________________________________________
___________________________________________
___________________________________________


=============================================================================

2008 03 14

############
# MCIMPORT #
############

21:47

    Interrupted during DIR=708,
    due to impending stkensrv2a Enstore master node down time.

rm /home/mindata/TAPE/n11037083_0008_L250200N_D04.tar.gz

    Will do

Sat Mar 15 08:26:21 CDT 2008


DIRS='708 709 710 711 712 713 714 715 717'

for DIR in ${DIRS} ; do
    ./mcimport.20080311  -T daikon_04/L250200N/near/${DIR}
done
   

############
# SADDRECO #
############

    minfarm@fnpcsrv1 and prepare per HOWTO.saddreco

cd to /afs/fnal.gov/files/home/room1/kreymer/minos/scripts


./saddreco.new -d near -r cedar -p 2007-11 --verify

    There are 502 files in /pnfs/minos/reco_near/cedar/cand_data/2007-11

    Found a smaller month to work with,

./saddreco.new -d near -r cedar -p 2006-05 --verify
    18
./saddreco.new -d near -r cedar -p 2007-10 --verify
    37

    could not use -v  , as this activated verbosity of the sam calls.
    Changed to y for yack
    Retained -v, if we want both yackiness and sam verbosity

SRV1> ./saddreco.new -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N --verify
STARTED   Sat Mar 15 02:35:01 2008

This is now picking up the .0.root files containing pass numbers.

Let's get caught up on this
    
SLOG=${HOME}/ROUNTMP/LOG/saddreco/daikon_04/cedar_phy_bhcurv/near_L010185N.log

./saddreco.20080315 -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N -P 714 --declare

   That looks OK,
   Let's pick up 3 more cand's and one mrnt.

./saddreco.20080315 -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N -P 738 --declare
 
   Looks good on the surface, several parents.


   Now get caught up on all runs , with logging
   
./saddreco.20080315 \
  -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N -P 738 --declare \
  2>&1 | tee -a ${SLOG}
 
    Made this the default saddreco on fnpcsrv1

SRV1> cp -a AFSS/saddreco.20080315 . 
SRV1> ln -sf saddreco.20080315 saddreco # was saddreco.20071117
SRV1> date
Fri Mar 14 22:20:11 CDT 2008


#########
# FNALU #
#########

    Programs continue to crash and have strange problems,
    and nodes get hung up.

    For node system information, see
http://cdcvs.fnal.gov/cgi-bin/fnal-only/cvsweb.cgi/syscollect/fnalu/?cvsroot=syscollect

    From lsload, and bhosts,

MINOS26 > bhosts | cut -f 1 -d ' ' | sort

 Host   MAX   cpu       mem
flxb09   -   2 x 999   510360 kB 
flxb10   2   2 x 999    449M
flxb11   -                
flxb12   -              
flxb13   2              449M
flxb14   -                                 
flxb15   -              
flxb16   4   2 x 2667   905M 1034584 kB
flxb17   4   2 x 2667   965M   ditto
flxb18   4   2 x 2667   928M
flxb19   4   2 x 2667   921M
flxb20   4   2 x 2667   962M
flxb21   4   2 x 2667   965M
flxb22   -              
flxb23   4   2 x 2667   965M
flxb24   4   2 x 2667   963M
flxb25   4   2 x 2667   964M
flxb26   4   2 x 2667   965M
flxb27   4   2 x 2667   964M
flxb28   4   2 x 2667   929M
flxb29   4   2 x 2667   964M
flxb30   4   2 x 2667   597M 1034584 kB
flxb31   4   2 x 2194  1902M 2074908 kB
flxb32   4   2 x 2193  1899M 2074908 kB
flxb33   4   2 x 2193  1644M 2074908 kB
flxb34   4   2 x 2195  1637M 2074908 kB
flxb35   4   2 x 2393  3570M 4038672 kB
flxi04   -              
flxi06   2   4 x 3600  3753M 4095356 kB Intel

for NODE in $BNODES ; do printf "${NODE} " ; ssh -ax ${NODE} 'cat /proc/cpuinfo | grep MHz | uniq' ; done
for NODE in $BNODES ; do printf "${NODE} " ; ssh -ax ${NODE} 'cat /proc/cpuinfo | grep MHz | wc -l' ; done
  I don't trust the lsload memory infomation, scanned it
for NODE in $BNODES ; do printf "${NODE} " ; ssh -ax ${NODE} 'cat /proc/meminfo | grep MemTotal' ; done
  
___________________________________________
Date: Fri, 14 Mar 2008 12:39:28 -0500 (CDT)
Subject: HelpDesk ticket 112694
___________________________________________
Short Description: Please turn of the 1 GHZ FNALU batch nodes

Problem Description: dss-est :

Please turn off the FNALU batch systems flxb09, flxb10 and flxb13,
and retire these systems.

They are dusl processor 1 GHz systems, with only 1/2 GB of memory.

They are a small fraction of the FNALU capacity.
Minos batch jobs have been failing repeatedly on these nodes,
due to the lack of memory.
___________________________________________
Date: Fri, 14 Mar 2008 13:05:34 -0500 (CDT)
This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group.
___________________________________________
Date: Fri, 14 Mar 2008 13:17:05 -0500 (CDT)
Art, I closed the lsf queues for these 3 nodes. I will have to check with
Wayne before retiring them.

margaret
___________________________________________
Date: Mon, 27 Oct 2008 12:55:49 -0500 (CDT)
Solution: flxb11-30 decommissioned
___________________________________________
___________________________________________

#######
# AFS #
#######

   The symlink for afssum is out of data, try the new version

MIN > cp afssum.20070828  afssum.test

time ./afssum.test

MINOS01 > time ./afssum.test

real    131m8.298s
user    22m32.763s
sys     69m18.231s

    Updated to using afssum.20070828
    
MIN > ln -sf afssum.20070828 afssum # was afssum.20060614


###########
# ENSTORE #
###########

   Checking the Enstore log for yesterday, for client versions

http://www-stken.fnal.gov/enstore/enstore_logs.html
http://www-stken.fnal.gov/enstore/log//LOG-2008-03-13


MIN > grep v3_7 LOG-2008-03-13.htm  | wc -l
   2915

MIN > grep v3_6 LOG-2008-03-13.htm  | wc -l
  49461

MIN > grep v3_6g LOG-2008-03-13.htm  | wc -l
  46332
MIN > grep v3_6c LOG-2008-03-13.htm  | wc -l
   2296
MIN > grep v3_6d LOG-2008-03-13.htm  | wc -l
    636
MINOS26 > grep v3_6i LOG-2008-03-13.htm  | wc -l
    197

MINOS26 > grep 'Version:' LOG-2008-03-13.htm  | cut -f 9 -d ' ' | sort -u
v3_6c
v3_6d
v3_6g
v3_6i
v3_7

MINOS26 > grep 'Version: v3_6c ' LOG-2008-03-13.htm  | cut -f 2 -d ' ' | sort -u
logjam.fnal.gov
southport.fnal.gov

MINOS26 > grep 'Version: v3_6d ' LOG-2008-03-13.htm  | cut -f 2 -d ' ' | sort -u
lynx.fnal.gov
minos-om.fnal.gov
minos-sam03.fnal.gov

MINOS26 > grep 'Version: v3_6g ' LOG-2008-03-13.htm  | cut -f 2 -d ' ' | sort -u
cmsstor101.fnal.gov
cmsstor103.fnal.gov
...
cmsstor98.fnal.gov
cmsstor99.fnal.gov
stkendca10a.fnal.gov
stkendca11a.fnal.gov
...
stkendca19a.fnal.gov
stkendca20a.fnal.gov


MINOS26 > grep 'Version: v3_6i ' LOG-2008-03-13.htm  | cut -f 2 -d ' ' | sort -u
cdfensrv3.fnal.gov
stkensrv3.fnal.gov

  MINOS26 > grep 'Version: v3_7 ' LOG-2008-03-13.htm  | cut -f 2 -d ' ' | sort -u
d0ensrv3n.fnal.gov
des02.fnal.gov
des04.fnal.gov
minos-sam03.fnal.gov
sdssdp30.fnal.gov
sdssdp44.fnal.gov
sdssdp45.fnal.gov
sdssdp48.fnal.gov
sdssdp51.fnal.gov
sdssdp53.fnal.gov
sdssdp55.fnal.gov
sdssdp56.fnal.gov
sdssdp57.fnal.gov
 
   Why is minos-om writing directly with encp ?

MINOS26 > grep 'minos-om' LOG-2008-03-13.htm  | tr ' ' \\\n | grep /pnfs/minos | less

MINOS26 > OMDIRS=`grep 'minos-om' LOG-2008-03-13.htm  | tr -d "'" | tr ' ' \\\n | grep /pnfs/minos`
MINOS26 > for DIR in ${OMDIRS} ; do dirname ${DIR} ; done | sort -u
/pnfs/minos/fardet_logs
/pnfs/minos/fardet_logs/msglog
/pnfs/minos/fardet_logs/om
/pnfs/minos/fardet_logs/om/postscript
/pnfs/minos/fardet_logs/om/rootfiles
/pnfs/minos/fardet_logs/om/rootfiles/00040000-00049999
/pnfs/minos/fardet_logs/om/summaries
/pnfs/minos/fardet_logs/timing
/pnfs/minos/neardet_logs
/pnfs/minos/neardet_logs/msglog
/pnfs/minos/neardet_logs/om
/pnfs/minos/neardet_logs/om/rootfiles
/pnfs/minos/neardet_logs/om/rootfiles/00010000_00019999
/pnfs/minos/neardet_logs/om/summaries
/pnfs/minos/neardet_logs/timing

MINOS26 > grep 'minos-om' LOG-2008-03-13.htm  | cut -f 1 -d ' '
04:13:40...04:13:42
04:18:09...04:18:48
05:19:54...05:19:56
05:22:50...05:23:06
05:35:30...05:35:31
05:37:42...05:38:54
05:44:06...05:44:07
05:46:27...05:46:43

Check Wednesday,

curl http://www-stken.fnal.gov/enstore/log/LOG-2008-03-12 -o LOG-2008-03-12.htm

   Nothing, it seems yesterday was a one-shot archival.

for DAY in 13 01 02 03 04 05 07 08 09 10 ; do
    LOGF=LOG-2008-03-${DAY}
    printf "${LOGF}\n"

    curl -s  http://www-stken.fnal.gov/enstore/log/${LOGF} -o ${LOGF}
    du -sm ${LOGF}

    grep 'minos-om' ${LOGF} | wc -l
    rm -f ${LOGF}
done
LOG-2008-03-13
328     LOG-2008-03-13
247
LOG-2008-03-01
924     LOG-2008-03-01
0
LOG-2008-03-02
608     LOG-2008-03-02
0
LOG-2008-03-03
694     LOG-2008-03-03
0
LOG-2008-03-04
668     LOG-2008-03-04
0
LOG-2008-03-05
608     LOG-2008-03-05
0
LOG-2008-03-07
743     LOG-2008-03-07
0
LOG-2008-03-08
559     LOG-2008-03-08
0
LOG-2008-03-09
484     LOG-2008-03-09
0
LOG-2008-03-10
493     LOG-2008-03-10
0


=============================================================================

2008 03 13

#######
# AFS #
#######

  summaries missing from /afs/fnal.gov/files/expwww/numi/html/computing/dh/afssum

    Manually ran 
/usr/krb5/bin/kcron ${HOME}/minos/scripts/afssum  quiet

It looked OK, but produced no output

##########
# PARROT #
##########

    Repeatin test before reporting to cctools@listserv.nd.edu

2_4_2   no proxy

FNPC144 > export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_2-i686-linux-2.6
FNPC144 > export PATH=${PARROT_DIR}/bin:${PATH}
FNPC144 > parrot -m ${PARROT_DIR}/mountfile2.grow -d remote  bash
FNPC144 > PS1='P> '
P> ls -d /afs/fnal.gov/files/code/e875/general/minossoft
1205436334.559503 [19916] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/
1205436334.559698 [19916] parrot: grow: fetching checksum: wget --no-cache -q -O /tmp/parrot.1060/grow.checksum.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfschecksum
1205436334.826238 [19916] parrot: grow: remote checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d
1205436334.826301 [19916] parrot: grow: fetching directory: wget --no-cache -q -O /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfsdir
1205436338.337115 [19916] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199-
1205436339.339436 [19916] parrot: grow: local checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d
/afs/fnal.gov/files/code/e875/general/minossoft

2_4_2 with proxy

FNPC145 > export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_2-i686-linux-2.6
FNPC145 > export PATH=${PARROT_DIR}/bin:${PATH}
FNPC145 > export HTTP_PROXY="squid.fnal.gov:3128"
FNPC145 > parrot -m ${PARROT_DIR}/mountfile2.grow -d remote  bash
FNPC145 > ls -d /afs/fnal.gov/files/code/e875/general/minossoft
1205436801.996700 [3408] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/
1205436801.996908 [3408] parrot: grow: fetching checksum: wget --no-cache -q -O /tmp/parrot.1060/grow.checksum.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfschecksum
1205436802.020919 [3408] parrot: grow: remote checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d
1205436802.021135 [3408] parrot: grow: fetching directory: wget --no-cache -q -O /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199- http://www-numi.fnal.gov:80/computing/d199//.growfsdir
1205436805.265410 [3408] parrot: grow: checksumming /tmp/parrot.1060/grow.directory.www-numi-fnal-gov-80--computing-d199-
1205436806.258024 [3408] parrot: grow: local checksum: 6f63107de1a1e42d3a10b8847ebffea250f0895d
/afs/fnal.gov/files/code/e875/general/minossoft

2_4_0 with proxy

FNPC146 > export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4
FNPC146 > export PATH=${PARROT_DIR}/bin:${PATH}
FNPC146 > export HTTP_PROXY="squid.fnal.gov:3128"
FNPC146 > parrot -m ${PARROT_DIR}/mountfile2.grow -d remote  bash
FNPC146 > ls -d /afs/fnal.gov/files/code/e875/general/minossoft
1205437116.856289 [7912] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/
1205437116.856398 [7912] parrot: http: connect squid.fnal.gov port 3128
1205437116.858341 [7912] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfsdir HTTP/1.0
Host: squid.fnal.gov
1205437117.186037 [7912] parrot: http: HTTP/1.0 200 OK
1205437117.186081 [7912] parrot: http: Date: Thu, 21 Feb 2008 22:35:11 GMT
1205437117.186093 [7912] parrot: http: Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8g mod_fastcgi/2.4.2 PHP/5.2.5
1205437117.186104 [7912] parrot: http: Last-Modified: Thu, 07 Feb 2008 14:53:09 GMT
1205437117.186115 [7912] parrot: http: ETag: "5350b140-33ac14e-44592a1cdbf44"
1205437117.186131 [7912] parrot: http: Accept-Ranges: bytes
1205437117.186140 [7912] parrot: http: Content-Length: 54182222
1205437117.186158 [7912] parrot: http: Content-Type: text/plain
1205437117.186168 [7912] parrot: http: X-Cache: HIT from fermigrid4.fnal.gov
1205437117.186178 [7912] parrot: http: Via: 1.0 fermigrid4.fnal.gov:3128 (squid/2.6.STABLE9)
1205437117.186188 [7912] parrot: http: Proxy-Connection: close
1205437117.186197 [7912] parrot: http: 
1205437117.186208 [7912] parrot: grow: loading filesystem directory...
1205437126.783397 [7912] parrot: grow: directory checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d
1205437126.783585 [7912] parrot: grow: fetching checksum from wget --no-cache -q -O /tmp/grow.checksum.1060.7909 http://www-numi.fnal.gov:80//computing/d199//.growfschecksum
1205437126.828840 [7912] parrot: grow: actual checksum is 6f63107de1a1e42d3a10b8847ebffea250f0895d
/afs/fnal.gov/files/code/e875/general/minossoft

    Sent mail to cctools@listserv.nd.edu


##########
# CONDOR #
##########

   Sent this to sfiligoi for courtesy review, 13:55 :

Date: Thu, 13 Mar 2008 14:46:22 -0500 (CDT)
Subject: HelpDesk ticket 112641

___________________________________________
Short Description: Minos Cluster - condor 7.0.1 preinstallation

run2-sys :

    Please install the following RPM in all the minos01 thru minos25 .

http://fermigrid.fnal.gov/files/condor/condor-7.0.1-linux-x86-rhel3-dynamic-1.i386.rpm
   
    This rpm places new files in /opt/condor-7.0.1,
    and should not interfere with existing operations.

    Please also copy the configuration files into /opt/condor-7.0.1 on each node.

HNAME=`hostname -s`

cp /opt/condor-6.9.5/etc/condor_config \
   /opt/condor-7.0.1/etc/condor_config

cp /opt/condor-6.9.5/local.${HNAME}/condor_config.local \
   /opt/condor-7.0.1/local.${HNAME}/condor_config.local

   Background :

We want to upgrade the Condor version on the Minos Cluster
from 6.9.5 to 7.0.1 the week of 24 March.

Installing the rpm and prepositioning the configuration files
will let us review the installation ahead of time,
and perhaps upgrade a node or two ahead of the general upgrade.
___________________________________________
Date: Thu, 13 Mar 2008 14:53:22 -0500 (CDT)
This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group.
________________________________________________________________


########
# PNFS #
########

Date: Thu, 13 Mar 2008 11:24:51 -0500
From: Robert Hatcher <rhatcher@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: MINOS: created charm directories for D04

I've created the input/output directories for D04 L010185N_charm
using your script that also sets the file families.

cd ~kreymer/minos/scripts
./pnfsdirs near cedar_phy_bhcurv daikon_04 L010185N_charm write


##########
# DCACHE #
##########

    Some DCache failures yesterday,

Date: Thu, 13 Mar 2008 11:58:37 -0400 (EDT)
From: Josh Boehm <jaboehm@fas.harvard.edu>

Mar 12 14:10
Error ( POLLIN POLLERR POLLHUP) (with data) on control line [32]
Failed to create a control line
Failed open file in the dCache.
Error ( POLLIN POLLERR POLLHUP) (with data) on control line [32]
Failed to create a control line
Error in <TDCacheFile::TDCacheFile>: file dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy/cand_data/2005-10/N00009003_0015.spill.cand.cedar_phy.0.root does not exist

The rest I have are identicle with different file names.
The errors appear to have all occured between 14:10-14:30 yesterday.


########
# ENCP #
########

Installed v3_7, just in cast this helps

MINOS26 > upd install -j encp v3_7

  This seems to be happy, and connects to stkensrv2 without further config
  or qualifiers.

MINOS26 > ups declare -c encp v3_7

MINOS26 > date
Thu Mar 13 09:58:54 CDT 2008

########
# DATA #
########

    Killed mcimport at 
/pnfs/minos/stage/daikon_04/L250200N/near/704/n11037047_0007_L250200N_D04.tar.gz
    Set --delayed_dismount 5, to try to improve tape mounting
    ( observed 30 second HAVEBOUND period, then dismount 
      perhaps the Enstore system is just responding too slowly )

Restarted, with initial LOCALLIM 210000 , reset to 30000 in PAPER.

   Rats, the tape dismounted after 30 seconds in HAVE-BOUND status !   

label      mover             tot.time status                system_inhibit      rq. host         updated           volume family
VOJ550     LTO4_48.mover     230      DISMOUNT_WAIT (16   ) (none         none) ['minos']        03-13-08 08:34:39 minos.stage.cpio_odc

   Into the ECRC/COPY phase of 704 now
   Interrupted to restore normal LOCALLIM, 
   and remove --delayed-dismount per developers' request ( moot )

$ rm /minos/data/mcimport/TAR/daikon_04/L250200N/near/704/n11037048_0002_L250200N_D04.ecrc

    Also switched to encp v3_7, just in case this helps.

$ cp -a AFSS/mcimport.20080311 .
$ ./mcimport.20080311 -T daikon_04/L250200N/near/704
      

   Observed following timings
       ECRC - 75 to 110 "
       COPY - 25 "
   Earlier tests showed
       COPY - 110 "
       ECRC - 3 "
   
   Interrupted at ECRC n11037048_0018_L250200N_D04.tar.gz 
   Updated to do the COPY before ECRC.
   Will implicitly get encp v3_7 ( current ).

$ cp -a AFSS/mcimport.20080311 .
$ ./mcimport.20080311 -T daikon_04/L250200N/near/704
Thu Mar 13 10:33:48 CDT 2008

    Rates look good
        COPY - 55 to 110 "
        ECRC -  5"
        net  - 60 to 115"


WILL ENCP 80 files
Start time: Thu Mar 13 11:45:53 2008

LTO4_43.mover
  10 files moved so far, as of 11:54,
  About 30 seconds per file elapsed.
  timings like
Starting /home/mindata/TAPE/n11037047_0025_L250200N_D04.tar.gz transfer.  elapsed=4.94632411003
File /home/mindata/TAPE/n11037047_0025_L250200N_D04.tar.gz transfered.  elapsed=22.6599259377


  Something seems to be slowing down transfers, every 3 minutes.
  Extra 20 to 30 seconds in transfer.
  I see nothing directly correlated with this in Ganglia 
  ( except network rates, which is where I saw this first. )

   This finished up at Thu Mar 13 12:31:13 CDT 2008

   Launched the full set again :

$ DIRS='705 706 707 708 709 710 711 712 713 714 715 717'

for DIR in ${DIRS} ; do
    ./mcimport.20080311  -T daikon_04/L250200N/near/${DIR}
done

MINOS26 > df -h /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       28T   25T  3.1T  90% /minos/data

MINOS26 > date
Thu Mar 13 13:12:59 CDT 2008


Date: Thu, 13 Mar 2008 17:15:32 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Stan Naymola <stan@fnal.gov>
Cc: Jon Bakken <bakken@fnal.gov>, enstore-admin@fnal.gov, minos-data@fnal.gov
Subject: Re: HelpDesk ticket 112593 has additional info.

On Thu, 13 Mar 2008, Stan Naymola wrote:

> We have added back 2 LTO4 drives, so if things work right, there should 
> be drives available.

  Thanks !

  The next batch of Minos data writes has started,
  and is now running at full speed, 30 MB/sec net.

  There are brief ( 30 second ) delays about every 3 minutes,
  but nothing like the problems we had before.

      A few things have changed since the last measurement

1) You added 2 drives, thanks !

2) CMS writes to CCRC08LoadTest are not active, like they were before.

3) The LTO-3 manager was restored to service, after robot repairs.

4) I upgraded our default client from encp v3_6d to  encp v3_7 .

=============================================================================

2008 03 12

###########
# SCRATCH #
###########

Date: Wed, 12 Mar 2008 14:43:46 -0500 (CDT)
Subject: HelpDesk ticket 112578
___________________________________________
Short Description: Quota request for boehm on BlueArc served /minos/scratch

Problem Description: LSC/CSI :

Please set an individual storage quota of 500 GBytes for user boehm
on the BlueArc served /minos/scratch volume.

This overrides the existing default 100 GBytes quota.
___________________________________________
Date: Wed, 12 Mar 2008 15:06:12 -0500 (CDT)
This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group.
___________________________________________
Solution: Hi Art,
minos-nas-0:/minos/scratch for boehm quota has been increased to 500GB
___________________________________________


############
# MCIMPORT #
############

    Severe overloads from RAL again,

Mar 12 04:07:44 minos26 kernel: oom-killer: gfp_mask=0xd0

Mar 12 04:07:48 minos26 kernel: Out of Memory: Killed process 14571 (scp).

MINOS26 > ps -u mindata | grep -c scp ; ps -u mindata | grep -c md5sum
4
54

NSCP=`ps -u mindata | grep -c scp`
NMD5=`ps -u mindata | grep -c md5sum`
(( NMCI = NSCP + NMD5 ))
echo ${NMCI}

The rate of clearing gets drastically better with under 35 md5sum's.
Load average 35 -> 25 in 5 minutes.

11:41 - 9 md5sum's
11:42 - 0 md5sum

Load average dropped from 18 to 0 in under 3 minutes ( 11:41 to 11:43 )

Back up to 33 around 13;15, as the next batch arrives.


MINOS01 > time md5sum /minos/data/mcimport/mtavera/n12037402_0008_L010185N_D04.tar.gz
e4a7c5c80e0fffdbf72ee9224d4d05f1  /minos/data/mcimport/mtavera/n12037402_0008_L010185N_D04.tar.gz

real    0m9.532s
user    0m0.037s
sys     0m0.020s
MINOS01 > time md5sum /minos/data/mcimport/mtavera/n12037402_0008_L010185N_D04.tar.gz
e4a7c5c80e0fffdbf72ee9224d4d05f1  /minos/data/mcimport/mtavera/n12037402_0008_L010185N_D04.tar.gz

real    0m0.037s
user    0m0.027s
sys     0m0.008s


  Here are my working guesses about the capacity of minos26.

  Let's assume that we sustain 10 scp's at a time,
  and regulate the data flow to avoid overload.

  10 scp's, at about 1 MB/sec each is about 1 TByte per day.
  We really do not need to ingest data at that rate !


  md5sum of a 10 MByte file which is still in memory takes 40 milliseconds.
  From a badly overloaded /home/data disk, it takes 10 seconds.

  So md5sum's should play no role in this,
  as long as they are done while still in memory.
  This will be the case if we do not badly overload the system.


  In short, unless I've slipped a decimal somewhere,
  we should be just fine, on average,
  if we stay well away from saturation of the system.

  Whether the import limit is 5, 10, or 20 should not be critical.
  Even 5 should sustain .5 TByte per day, more than we need.


  Note that the /minos/data disk, used by the farm, overlaying,
  and various users, also plays a role in this.
  We may need to be careful about how many streams we write at once.

  The job I'm running to archive older mcimport/STAGE data
  has slowed from 15 MB/sec to 1 MB/sec during these overload periods,
  due to slow data delivery from /home/data.


From top:
  Cpu(s):  0.2% us,  0.1% sy,  0.0% ni,  0.0% id, 99.6% wa,  0.0% hi,  0.0% si

14:32 md5sums dropped from 27 to 26 !!!!
      data had been moving in at about 1 MB/sec
14:43 24
14:45      minos-sam03 data rates back up to 8 MB/sec
14:46 21
   47 20
   48 20
   54 18
   55 14
   56 12
   57  8
   58  8
   59  6
15:00  0

    Summary - looking at the minos26 ganglia plots,
    we suffered about a 12 hour delay in all major /minos/data activities.


########
# DATA #
########

Date: Wed, 12 Mar 2008 18:23:33 -0500 (CDT)
Subject: HelpDesk ticket 112593
___________________________________________
Short Description: minos.stage writes to LTO-4 stalled - why ?

Problem Description: enstore-admin :

    The good news :

The writes to /pnfs/minos/stage/.. ( minos.stage family )
have been running at around 5 MB/sec, 
as they copy from BlueArc mounted /minos/data.

I restructured this to copy to local disk first, then write directly.
The encp overall rate to tape is now over 30 MB/sec , as of
around Wed Mar 12 17:48:44 CDT 2008.

    The bad news :

The drive LTO4-43 was apparently preempted for CMS load test.
There was then a nearly minute delay before my next file got written .

2 of the 11 drives were dead, three are idle,
and my encp command was just sitting waiting for a mover.

One more file got copied using drive LTO4_44,
then I was preempted for another CMS load test.

I am on furlough next week.
We must get at least a few TBytes of data archived this week.

    Please do what it takes so that we do not get preempted.


Here are some details, fyi: 

Start time: Wed Mar 12 17:52:26 2008
User: mindata(3648)  Group: e875(5111)  Euser: mindata(3648)  Egroup:
e875(5111)
Command line: encp --verbose 4
/home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz
/pnfs/minos/stage/daikon_04/L250200N/near/704/n11037040_0003_L250200N_D04.t
ar.gz
Version: v3_6d  CVS $Revision: 1.829 $ <frozen>
OS: Linux 2.6.9-55.0.2.ELsmp i686 Release:  Scientific Linux Fermi LTS
release 4.4 (Wilson)
Library: CD-LTO4G1  Storage Group: minos  File Family: stage  FF Wrapper:
cpio_odc  FF Width: 1
Current working directory:
minos-sam03.fnal.gov:/minos/data/mcimport/STAGE/daikon_04/L250200N/near/704

Got error while trying to obtain configuration: ('KEYERROR', "Configuration
Server: no such name: 'pnfs_agent'")
Submitting
/pnfs/minos/stage/daikon_04/L250200N/near/704/n11037040_0003_L250200N_D04.t
ar.gz write request to LM.  elapsed=0.440sec
File queued: /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz library:
CD-LTO4G1 family: stage bytes: 758433401 elapsed=0.536798000336
Submitting
/pnfs/minos/stage/daikon_04/L250200N/near/704/n11037040_0003_L250200N_D04.t
ar.gz write request to LM.  elapsed=900.470sec
Mover called back.  elapsed=904.14408493
Input file /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz opened.  
elapsed=904.165093899
Input file /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz opened.  
elapsed=904.165093899
Starting /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz transfer. 
elapsed=1062.3872509
File /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz transfered. 
elapsed=1079.18411994
Waiting for final mover dialog.  elapsed=1079.280sec
Received final dialog for minos-sam03.fnal.gov-1205362346-1509-0. 
elapsed=1087.200sec
Verifying
/pnfs/minos/stage/daikon_04/L250200N/near/704/n11037040_0003_L250200N_D04.t
ar.gz transfer.  elapsed=1087.11204982
File status after verification: ('ok', None)   elapsed=1087.29182482
Transfer /home/mindata/TAPE/n11037040_0003_L250200N_D04.tar.gz ->
/pnfs/minos/stage/daikon_04/L250200N/near/704/n11037040_0003_L250200N_D04.t
ar.gz:
        758433401 bytes copied to VOJ550 at 29.3 MB/S
        (43.1 MB/S network) (168 MB/S drive) (43.1 MB/S disk)
        (3.95 MB/S overall) (29.3 MB/S transfer)
        drive_id=ULTRIUM-TD4 drive_sn=1310019745 drive_vendor=IBM
        mover=LTO4_44.mover media_changer=SL8500.media_changer  
elapsed=1087.31
Completed transferring 758433401 bytes in 1 files in 1087.30618596 sec.
        Overall rate = 3.95 MB/sec.  Transfer rate = 29.3 MB/sec.
        Network rate = 43.1 MB/sec.  Drive rate = 168 MB/sec.
        Disk rate = 43.1 MB/sec.  Exit status = 0.
PURGED  n11037040_0003_L250200N_D04.tar.gz
Start time: Wed Mar 12 18:10:34 2008
___________________________________________
Date: Wed, 12 Mar 2008 20:03:12 -0500 (CDT)
This ticket is assigned to SSA Primary of the CD-SF/DMS/DSC/SSA group.
___________________________________________
Date: Thu, 13 Mar 2008 05:13:13 +0000 (UTC)
    Oops, an important correction to a typo in this report.

    Instead of
There was then a nearly minute delay before my next file got written .
    I meant to say
There was then a nearly 20 minute delay before my next file got written .

    The 20 minute delays are continuing.

    I seem to get 1 to 3 files copied,
    then am kicked off the drive for 20 minutes.

    In copying 95 files so far, there have been 18 tape mounts.
    The progression has been :

        mover=LTO4_43
        mover=LTO4_44
        mover=LTO4_46
        mover=LTO4_43
        mover=LTO4_49
        mover=LTO4_46
        mover=LTO4_43
        mover=LTO4_47
        mover=LTO4_45
        mover=LTO4_49
        mover=LTO4_44
        mover=LTO4_46
        mover=LTO4_49
        mover=LTO4_47
        mover=LTO4_51
        mover=LTO4_47
        mover=LTO4_43
        mover=LTO4_46

   At 00:10, I have been sitting waiting for a move for 15 minutes.
   There are  3 IDLE LTO-4 drives.
___________________________________________
Date: Thu, 13 Mar 2008 02:12:03 -0500 (CDT)
This ticket is assigned to SSA Primary of the CD-SF/DMS/DSC/SSA group.
___________________________________________
Date: Thu, 13 Mar 2008 13:18:30 +0000 (UTC)

Has this ticket been seen by anyone ?

It was assigned to SSA Primary at 20:03:12 -0500 (CDT)
It was assigned to SSA Primary at 02:12:03 -0500 (CDT)

The problems are continuing.

Right now, there are three IDLE LTO-4 drives,
but my encp commands are still waiting 20 minutes for a tape mount
for each file copied.

The full log for this job is in file
    /minos/data/mcimport/TAR/daikon_04/L250200N/near/704/mcimport.log
mounted on all FNALU and FermiGrid nodes .
___________________________________________
Date: Thu, 13 Mar 2008 13:47:31 +0000 (UTC)

I changed my script to force a 5 minute HAVE BOUND period,
    encp --delayed-dismount 5
just in case the overloaded Enstore systems need more leeway.

But drive LTO4-48 dismounted my tape after 30 seconds in the HAVE_BOUND state.
Here is a status line from http://cmsdca.fnal.gov/cgi-bin/enstore_drives.sh

label      mover             tot.time status                system_inhibit      rq. host         updated   $
VOJ550     LTO4_48.mover     230      DISMOUNT_WAIT (16   ) (none          none) ['minos']        03-13-08 $

There are still several IDLE drives.
___________________________________________
Date: Thu, 13 Mar 2008 08:51:31 -0500 (CDT)
From: Jon Bakken <bakken@fnal.gov>
_This just ties up tape drives more - I don't think this is a
good idea, and it certainly just makes things worse.  It's
not fair to other experiments.

Basically, enstore is broken, and no options you set are going
to help it.
__________________________________________
Date: Thu, 13 Mar 2008 09:02:54 -0500
From: Stan Naymola <stan@fnal.gov>
    As Jon said the system is broken. The developers are trying to fix 
the LM. Please use the defaults for transfers. We have 2 LTO4's that are 
offline for a special test. I am canceling that test and will put these 
online. That should relieve the LTO4 resource issue.
Stan
___________________________________________
Date: Thu, 13 Mar 2008 11:28:43 -0500
From: Stan Naymola <stan@fnal.gov>

We have added back 2 LTO4 drives, so if things work right, there should 
be drives available.
__________________________________________
Date: Thu, 13 Mar 2008 17:15:32 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Stan Naymola <stan@fnal.gov>
Cc: Jon Bakken <bakken@fnal.gov>, enstore-admin@fnal.gov, minos-data@fnal.gov
Subject: Re: HelpDesk ticket 112593 has additional info.

On Thu, 13 Mar 2008, Stan Naymola wrote:

> We have added back 2 LTO4 drives, so if things work right, there should 
> be drives available.

  Thanks !

  The next batch of Minos data writes has started,
  and is now running at full speed, 30 MB/sec net.

  There are brief ( 30 second ) delays about every 3 minutes,
  but nothing like the problems we had before.

      A few things have changed since the last measurement

1) You added 2 drives, thanks !

2) CMS writes to CCRC08LoadTest are not active, like they were before.

3) The LTO-3 manager was restored to service, after robot repairs.

4) I upgraded our default client from encp v3_6d to  encp v3_7 .
__________________________________________
Date: Thu, 13 Mar 2008 14:40:59 -0500 (CDT)
From: David Berg <berg@fnal.gov>
__________________________________________
I sent 3 replies to this ticket yesterday evening. Did you not see
them? If not, something is broken in the helpdesk auto forwarding.
(See attachments.)

In my opinion, (4) is the most significant change as it probably
fixed the communication deadlock between the encp client and the
mover. That was the real cause of your problems. It never had
anything to do with not enough available drives, or CMS bumping
your tape.

I will look into why the --delayed-dismount option didn't work.
Jon is wrong - this is exactly a situation it was intended for.
It is not a change applied to all movers, which would be wrong,
but just this one. You have a job that will be writing to a
single tape continuously for many hours; there is no reason it
shouldn't stay mounted the whole time. There is a one-time
penalty of a few extra mintues when your job finishes, which is
insignificant compared to the time avoided in extra mounts,
seeks, and dismounts.

I'm glad it's working better for you now. VOJ550 is about half
full, with 50 mounts.
__________________________________________


########
# DATA #
########

    Need to encp from local disk, for speed.

    Interrupted near end of 704 ECRC phase, 08:30

    Removed the empty n11037048_0000_L250200N_D04.ecrc
    
cp -a AFSS/mcimport.20080311 .

./mcimport.20080311  -T daikon_04/L250200N/near/704

   Rates look good 10 to 15 MB/sec even in face of overload from minos26.
   ( for the copy phase of things )

   Oops, need to interrupt this to correct oversight in PURGEFILE,
       which need to remove both FILE and LOCAL/FILE
   Will catch this in the ecrc phase early this afternoon.

17:45 - interrupted to correct the import script, 

removed partial file
    rm ${LOCAL}/n11037047_0012_L250200N_D04.tar.gz

Changed LOCALLIM to 50000, so I can see some purging right now.

$ LOCALFREE=`df -m ${LOCAL} | tr -s ' '  | grep '% /' | cut -f 4 -d ' '` ; echo $LOCALFREE
46736

./mcimport.20080311  -T daikon_04/L250200N/near/704
Wed Mar 12 17:48:44 CDT 2008
 OK - version mcimport.20080303 processing from /minos/data/mcimport/STAGE/daikon_04/L250200N/near/704 
...
  NFILES 305 
WILL ENCP 225 files

    Overall rates look good, like

        Overall rate = 30.3 MB/sec.  Transfer rate = 30.5 MB/sec.
        Network rate = 43 MB/sec.  Drive rate = 176 MB/sec.
        Disk rate = 43 MB/sec.  Exit status = 0.

And we've just been preempted from the LTO-4 drive by CMS.


Sent helpdesk request, high priority, around 18:15

Note, the CMS writes are to file family CCRC08LoadTest

=============================================================================

2008 03 11

############
# MCIMPORT #
############

mualem restarted imports from caltech at about 13:41

Data write rates dropped to around 1 MB/sec from 6,
gradually starting around 13:45 through 14:30.

Check rates on minos26 :

cd /local/scratch26/kreymer/DATA
BAF=/minos/data/mcimport/STAGE/daikon_04/L250200N/near/710/n11037100_0000_L250200N_D04.tar.gz
time dd if=${BAF} of=TEST.dat bs=759175734
1+0 records in
1+0 records out

real    3m9.092s
user    0m0.001s
sys     0m5.075s

MINOS26 > time ecrc TEST.dat 
CRC 2473399715

real    0m4.186s
user    0m2.344s
sys     0m0.777s

minos-sam03 data rates gradually recovered to 6 MB/se by 15:00

Ganglia shows a similar dip this morning 00:00 to 00:30

########
# FARM #
########

Checking out CPB far logs , par batch meeting.

Many duplicates reported,
    from
F00030612_0000.all.sntp.cedar_phy_bhcurv.0.root
    to
F00037885_0000.spill.sntp.cedar_phy_bhcurv.0.root 

Most of the PENDing files are before run 30,000.
Exceptions  in all.sntp ,
    F00034650
    F00034744
    F00035640
    F00037901

These are not duplicates, could be forced out if desired,
along with the 19 sub-30K runs from
    F00019953
to
    F00028451

=============================================================
Date: Tue, 11 Mar 2008 15:44:08 -0500
From: Howard Rubin <rubin@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: Re: cedar_phy_bhcurv far concatenation status - FYI

Art,

I've looked at a couple of runs, and I think the problem is that you're 
checking against mdaq files before there were suppression lists, which 
started when there was beam in 2005-03.  For the limited sample I 
checked, the missing output is in runs which are not in Ben's list from 
which I run.  Here are the subruns for the first of your mystery runs:

fnpcsrv1% ls -l F00019953*
-rw-r--r--  1 buckley numi 41790452 Oct  6  2003 F00019953_0000.mdaq.root
-rw-r--r--  1 buckley numi 48620509 Oct  6  2003 F00019953_0001.mdaq.root
-rw-r--r--  1 buckley numi 81994333 Oct  6  2003 F00019953_0002.mdaq.root
-rw-r--r--  1 buckley numi   634052 Oct  6  2003 F00019953_0003.mdaq.root

It looks like subrun 0003 wasn't in the runlist, 
/minos/data/minfarm/lists/daq_lists/2003-10.farlist for good reason.

I think you should just force them out.

Howie
===============================================================

./roundup  -f 1 -s F0001 -r cedar_phy_bhcurv far
Tue Mar 11 16:32:53 CDT 2008

./roundup  -f 1 -s F0002 -r cedar_phy_bhcurv far
Tue Mar 11 16:38:42 CDT 2008

   Send email when this is finished.
   This completed at 19:22

########
# DATA #
########

    files are still moving well to LTO-4 from /minos/data/mcimport/...,
    still around 6 MB/sec.

#######
# LOG #
#######

   Added this log to the computing/dh/dhmain.html web page, as WORK LOG
   Shifted the mid term tasks to the bottom of the file, for legibility.
   Remember to go look there once in a while !

ln -sf dhmain.20080311.html dhmain.html # was dhmain.20080131.html

Made a new link worklog.txt to replace samlog.txt.

  Sent email to minos-data, minos-admin, minos_batch


=============================================================================

2008 03 10

###########
# MINOS10 #
###########

Date: Mon, 10 Mar 2008 10:03:47 -0500 (CDT)
Subject: HelpDesk ticket 112430
___________________________________________
Short Description: minos10 down since early Sunday afternoon

Problem Description: run2-sys :

Node minos10 disappeared from Ganglia monitoring early Sunday afternoon.
It is still off the network ( no response to ping. )

Please investigate when you get a chance.
___________________________________________
Date: Mon, 10 Mar 2008 10:12:56 -0500 (CDT)
This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group.
________________________________________________________________
Date: Mon, 10 Mar 2008 10:43:52 -0500 (CDT)
Solution: schmitz@fnal.gov sent this solution:  power-cycled
___________________________________________________________________


############
# PREDATOR #
############

Problems in ND sam declares,

N00013775_0003.mdaq.root Sat Mar  8 23:06:12 UTC 2008
 OOPS - run_dbu is stuck for 147, killing it 

N00013775_0004.mdaq.root Sat Mar  8 23:09:17 UTC 2008
 OOPS - run_dbu is stuck for 137, killing it 


Sun Mar  9 20:11:48 UTC 2008
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 128: 13756 Segmentation fault      dbu -bq ${HOME}/minos/scripts/dbu_sampy.C ${FILE} >>${logname} 2>&1
N00013778_0018.sam.py was not generated - check log for error
N00013778_0018.log

cd /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/neardet_data/2008-03
rm N00013775_0004.sam.py

########
# DATA #
########

    Interrupted early in ECRC for 702,
    to get rid of the spurious OOPS

cp -a AFSS/mcimport.20080303

Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       28T   25T  3.5T  88% /minos/data

for DIR in ${DIRS} ; do
    ./mcimport.20080303  -T daikon_04/L250200N/near/${DIR}
done


    Started up overlaying on minos12

About 10 MB/sec ( input files) 13:25 to 13:45.
No impact on minos-sam03 data rates (still running ecrc)


    Tested local copy on minos26, 

cd /local/scratch26/kreymer/DATA

du -sm /minos/data/mcimport/STAGE/daikon_04/L250200N/near/702/n11037020_0001_L250200N_D04.tar.gz
726     /minos/data/mcimport/STAGE/daikon_04/L250200N/near/702/n11037020_0001_L250200N_D04.tar.gz

MINOS26 > time cp -v /minos/data/mcimport/STAGE/daikon_04/L250200N/near/702/n11037020_0001_L250200N_D04.tar.gz TEST.dat
`/minos/data/mcimport/STAGE/daikon_04/L250200N/near/702/n11037020_0001_L250200N_D04.tar.gz' -> `TEST.dat'

real    1m9.342s
user    0m0.144s
sys     0m4.051s

   Rate is 726 MB/110 sec =>  10 MB/sec.

 time ecrc TEST.dat 
CRC 1502044195

real    0m3.277s
user    0m2.388s
sys     0m0.824s

    Let's try dd, for kicks on the next file,
BAF=/minos/data/mcimport/STAGE/daikon_04/L250200N/near/702/n11037020_0002_L250200N_D04.tar.gz
MINOS26 > time dd if=${BAF} of=TEST.dat bs=10M
70+1 records in
70+1 records out

real    1m10.369s
user    0m0.000s
sys     0m4.046s

MINOS26 > time ecrc TEST.dat 
CRC 3187614117

real    0m5.670s
user    0m2.312s
sys     0m0.865s

MINOS26 > for DIR in ${DIRS} ; do du -sm /minos/data/mcimport/STAGE/daikon_04/L250200N/near/${DIR} ; done
1       /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700
1385    /minos/data/mcimport/STAGE/daikon_04/L250200N/near/701
225475  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/702
219404  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/703
222473  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/704
219192  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/705
221078  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/706
221048  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/707
221730  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/708
217331  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/709
220377  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/710
223250  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/711
220463  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/712
222180  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/713
222384  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/714
154     /minos/data/mcimport/STAGE/daikon_04/L250200N/near/715
117468  /minos/data/mcimport/STAGE/daikon_04/L250200N/near/717

    We have 200 GB free on /home, total size 252 GB
    Can we clear 30GB more ?
 $ du -sm /home/*

du: `/home/buckley': Permission denied

du: `/home/kreymer': Permission denied
  478     .

du: `/home/lost+found': Permission denied

 5691    /home/mindata
   removed 141

1       /home/room1

du: `/home/sam/products/man': Permission denied
du: `/home/sam/products/catman': Permission denied
 2954    /home/sam

1       /home/samread
    
=============================================================================

2008 03 09 ( Sunday )

########
# DATA #
########

   Scanning for files over 500 MBytes.
   These can be written directly without tarring em up.

$ DIRS=`ls /minos/data/mcimport/STAGE/daikon_04/L250200N/near`

$ echo $DIRS
700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 717

for DIR in ${DIRS} ; do 
( printf "${DIR} " ; find /minos/data/mcimport/STAGE/daikon_04/L250200N/near/${DIR} -name \*.gz  -size +500000000c -exec du -sm {} \; | wc -l )
done

700 0
701 307
702 309
703 298
704 305
705 297
706 303
707 303
708 304
709 298
710 302
711 306
712 301
713 303
714 305
715 31
717 161

    -T option for direct write of large ( > 500 MB ) .tar files.
    This will clear most of daikon_04/L250200N/near/*

TIND=daikon_04/L250200N/near/715

AFSS/mcimport.20080303  -T  ${TIND}

Sun Mar  9 17:08:48 CDT 2008
 OK - version mcimport.20080303 processing from /minos/data/mcimport/STAGE/daikon_04/L250200N/near/715 
Sun Mar  9 18:24:28 CDT 2008
 OK, logging activity to /minos/data/mcimport/TAR/daikon_04/L250200N/near/715/mcimport.log 


   Purging did not happen this time, ECRCFILE lacked full path in PURGEFILE.

   Strange, seeing little 1 minute interruptions in data flow every 12 minutes,
   on the hourly ganglia network plot.
   Norm data rate seems to average 5 to 6 MB/sec.

    Corrected and reran, oops

ECRC n11037150_0000_L250200N_D04.tar.gz 
   no harm done, corrected 1 more typo


$ df -h /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       28T   25T  3.4T  89% /minos/data

    Now let's run on all daikon_04/L250200N/near

cp -a AFSS/mcimport.20080303 .

for DIR in ${DIRS} ; do
    ./mcimport.20080303  -T daikon_04/L250200N/near/${DIR}
done

Sun Mar  9 18:38:24 CDT 2008
      Oops, left in the extra PURGEFILE in the ENCP file loop
      so lots of harmless OOPSes in the logs.
ECRC...
Sun Mar  9 22:21:23 2008

grep Overall /minos/data/mcimport/TAR/daikon_04/L250200N/near/701/mcimport.log
   mostly 6 MB/sec.

    Ganalia of minos-sam03 shows 
        15 MB/sec during ECRC phase,
         5 MB/sec during encp phase
   
=============================================================================

2008 03 07

Date: Fri, 07 Mar 2008 10:02:11 -0600 (CST)

Art, I have downloaded the condor 7.0.1 RPMS
for MINOS.

In future, the latest and greatest RPMs for Minos
condor will  always be stored at

http://fermigrid.fnal.gov/files/condor/RPMS/i386/condor-latest.rpm
and (if necessary)
http://fermigrid.fnal.gov/files/condor/RPMS/x86_64/condor-latest.rpm

Steve Timm

    forwarded to minos-admin

    I found these files actually at
http://fermigrid.fnal.gov/files/condor/
    specifically, we use the x86-rhel3 version,
http://fermigrid.fnal.gov/files/condor/condor-7.0.1-linux-x86-rhel3-dynamic-1.i386.rpm


=============================================================================

2008 03 06

########
# DATA #
########

MINOS26 > date
Thu Mar  6 09:46:06 CST 2008

MINOS26 > df -h /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       25T   25T  601G  98% /minos/data

Date: Thu, 06 Mar 2008 13:17:33 -0600 (CST)
From: David Berg <berg@fnal.gov>
To: kreymer@fnal.gov, oleynik@fnal.gov
Cc: crawdad@fnal.gov, minos-data@fnal.gov, enstore-admin@fnal.gov
Subject: Re: Request Minos write access to LTO-4 to clear 10 TB backlog

Art,

CMS has kindly loaned 15 blank LTO4 tapes to CD for this purpose. I
have created a quota of 15 for minos and reassigned 15 tapes to the
common blank pool. I changed the library tag under /pnfs/minos/stage
to CD-LTO4G1.

Write away.


########
# FARM #
########

roundup is falling behind, copying only 750 out of 1351 files since 13:45


top - 09:10:13 up 72 days, 20:00, 19 users,  load average: 8.32, 8.12, 8.13
Tasks: 230 total,   9 running, 220 sleeping,   1 stopped,   0 zombie
Cpu(s): 93.6% us,  4.0% sy,  0.0% ni,  2.2% id,  0.2% wa,  0.0% hi,  0.0% si
Mem:  16629324k total,  8084144k used,  8545180k free,     7324k buffers
Swap: 19454704k total,      240k used, 19454464k free,  1602072k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                   
22010 carneiro  25   0  597m 592m  636 R   96  3.7   1251:44 astra_022608                                                                                                               
23435 carneiro  25   0  597m 592m  636 R   96  3.7   1251:22 astra_022608                                                                                                               
23437 carneiro  25   0  597m 592m  636 R   93  3.7   1251:13 astra_022608                                                                                                               
23432 carneiro  25   0  597m 592m  636 R   91  3.7   1251:13 astra_022608                                                                                                               
23430 carneiro  25   0  597m 592m  636 R   91  3.7   1251:23 astra_022608                                                                                                               
23440 carneiro  25   0  597m 592m  616 R   90  3.6   1251:17 astra_022608                                                                                                               

Date: Thu, 06 Mar 2008 10:04:35 -0600 (CST)
Subject: HelpDesk ticket 112276
___________________________________________
Short Description: fnpcsrv1 overloaded by carneiro astra_022608 jobs

Problem Description: The Minos farm I/O operations from fnpcsrv1 are falling seriously behind.


  A contributing factor is probably the six CPP bound processes 
  running since about noon yesterday. 
     
http://fermigrid2.fnal.gov/ganglia/?r=day&c=FermiGrid&h=fnpcsrv1.fnal.gov

  These seem to be winding down now, as of about 10:00.

  In future, please do not overload this central server in this way.
___________________________________________

Date: Thu, 27 Mar 2008 08:59:17 -0500 (CDT)
Subject: Help Desk Ticket 112276 Has Been Resolved.
___________________________________________________________________

Solution: Carneiro has been shown
how to run grid universe jobs
and has successfuly done so.
This problem should not happen again.  LEt us know if it does.

Steve Timm

___________________________________________________________________


##########
# PURIFY #
##########

Date: Thu, 06 Mar 2008 18:01:59 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos_software_discussion@fnal.gov
Subject: Software Product "Purify" (fwd)

---------- Forwarded message ----------
Date: Thu, 06 Mar 2008 11:44:35 -0600
From: Peter J. Rzeminski II <ptr@fnal.gov>
To: linux-users@fnal.gov
Subject: Software Product "Purify"

The license for Purify is coming up for renewal soon.  The licensing 
group has requested that we find out if anybody is currently using it.

Troy checked the license server and in the past year nobody requested a 
license for the software.  Additionally, nobody I have spoken to seems 
to know of anybody using the software.

If anybody is currently using it and wants the license to be renewed, 
please speak up now.  Otherwise, I will advise them to let the license 
expire.

Thank you.

-- 
____________________________________________________________

Peter J. Rzeminski II                    Email: ptr@fnal.gov
CD/LSC/CSI/Central Services - Web Team   Phone: 630.840.5524
Fermi National Accelerator Laboratory    Pager: 630.905.0540

#######
# AFS #
#######

    Global failures,

Mar  6 06:42:37 minos26 kernel: afs: Lost contact with file server 131.225.68.6 
...
Mar  6 07:24:57 minos05 kernel: afs: file server 131.225.68.17 is back up 

Same problem on minos-mysql1

Helpdesk ticket 112250 - WEB down - 12:58 UTC / 06:58 CST

Date: Thu, 06 Mar 2008 10:09:38 -0600 (CST)
Subject: HelpDesk ticket 112277

___________________________________________
Short Description: AFS down 06:42 to 07:25 - status ?

Problem Description: CSI :

AFS seems to have been down from about 06:42 to 07:25 this morning,
as seen on the Minos Cluster and on fnpcsrv1 ( presumably a server problem
).

I see no announcement at
http://computing.fnal.gov/cdsystemstatus/system/AFS.html

Is the system stable and useable, or should be be shutting down ?
Please post a status announcement.
    Thanks !
___________________________________________
Date: Thu, 06 Mar 2008 10:14:30 -0600 (CST)
This ticket has been reassigned to RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST Group.
___________________________________________________________________

Solution: The AFS fileserver processes core dumped this morning when
we attempted to add a service key for the FERMI.WIN.FNAL.GOV AD
domain.

This process was to be non-disruptive as the processes only needed
to re-read a keyfile.  This was not the case.

The service outage was from 06:43 -> 07:19.

AFS is currently stable.

I will see about getting a message posted to the status page.
___________________________________________________________________


#######
# AFS #
#######

Date: Wed, 05 Mar 2008 19:15:04 -0600
From: Cron Daemon <root@minos01.fnal.gov>
To: kreymer@fnal.gov
Subject: Cron <kreymer@minos01> /usr/krb5/bin/kcron ${HOME}/minos/scripts/cfl

kinit: Invalid message type while getting initial credentials
aklog: Couldn't get fnal.gov AFS tickets:
aklog: Invalid argument while getting AFS tickets
CFL.new: No such file or directory
?
?
CFL.new: Permission denied
?
rm: cannot remove `CFL.old': Permission denied
mv: cannot move `CFL' to `CFL.old': Permission denied
mv: cannot stat `CFL.new': No such file or directory
kdestroy: No credentials cache file found while destroying cache
Ticket cache ^GNOT^G destroyed!

#######
# AFS #
#######

Date: Thu, 06 Mar 2008 06:43:02 -0600
From: Cron Daemon <root@minos25.fnal.gov>
To: kreymer@fnal.gov
Subject: Cron <kreymer@minos25> /usr/krb5/bin/kcron ${HOME}/minos/scripts/condorweb

sh: /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorweb: Connection timed out


=============================================================================

2008 03 05

##########
# CONDOR #
##########

echo "ln -sf  kreymer-condor.proxy.20080404 kreymer-condor.proxy/home/gfactory/.grid/kreymer-condor.proxy" \
| at Apr 01

############
# MCIMPORT #
############

   Urgent need to archive older and/or select gaf files to tape.
   Revive the tar and write sections.

   -t option for running tar, specifying input path.
   Switch to direct ENCP in write, this is archival, no DCache.
   Means we can purge immediately on a successful copy.
   encp the whole directory in one command.
      typically 50 to 100 files.
   Add ecrc files as each tarfile is built

For example,

   TIND=/minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141
   mcimport -t ${TIND}

will produce
                  /minos/data/mcimport/STAGE/TAR/daikon_00/L010185N_nue/near/141
containing files like
    n14111411_0000_L010185N_D00_nue-n14111414_0000_L010185N_D00_nue.tar
    n14111411_0000_L010185N_D00_nue-n14111414_0000_L010185N_D00_nue.ecrc
    n14111411_0000_L010185N_D00_nue-n14111414_0000_L010185N_D00_nue.index

WRITE will encp these directly to tape, 
    Files will go to 
        /pnfs/minos/stage/TAR/...
    
Test like
    TIND=daikon_00/L010185N_nue/near/141
    AFSS/mcimport.20080303 -n -t ${TIND}

$ du -sm /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141
5682    /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141

cp -vax /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141 \
        /local/scratch26/mindata/141
...

too slow, too  much mcimport

on minos-sam03,
time cp -vax /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141 \
        /home/mindata/141
real    21m17.551s
user    0m1.059s
sys     0m31.474s


$ AFSS/mcimport.20080303  -t ${TIND}
 OOPS - found /minos/data/mcimport/CRON/mcimport.tar.pid 
 OK - stale pid file 

Thu Mar  6 16:43:31 CST 2008
...
For config n1411 _L010185N_D00_nue.tar.gz 
       99 files from 
      n14111411_0000_L010185N_D00_nue.tar.gz to
      n14111419_0010_L010185N_D00_nue.tar.gz   
99/99 TOTAL FILES 
5682    /minos/data/mcimport/TAR/daikon_00/L010185N_nue/near/141
Thu Mar  6 17:17:43 CST 2008

Oops, ecrc was done on the wrong file, correcting manually.

cd /minos/data/mcimport/TAR/daikon_00/L010185N_nue/near/141
FIS='
n14111411_0000_L010185N_D00_nue-n14111413_0006_L010185N_D00_nue
n14111413_0007_L010185N_D00_nue-n14111416_0003_L010185N_D00_nue
n14111416_0004_L010185N_D00_nue-n14111418_0010_L010185N_D00_nue
n14111419_0000_L010185N_D00_nue-n14111419_0010_L010185N_D00_nue
'
for FI in ${FIS} ; do
echo ${FI}
ecrc ${FI}.tar | cut -f 2 -d ' ' > ${FI}.ecrc
done

   Requested R/W mount of /pnfs/minos on minos-sam03

   Corrected group, allowing mindata to write
       chown kreymer.e875 /pnfs/minos/stage

MINOS26 > df -h /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       25T   25T  460G  99% /minos/data
MINOS26 > date
Thu Mar  6 17:47:02 CST 2008

Date: Thu, 06 Mar 2008 17:41:37 -0600 (CST)
Subject: HelpDesk ticket 112323
___________________________________________
Short Description: Please mount /pnfs/minos read/write on minos-sun03 ( presently readonly )

Problem Description: run2-sys :

We urgently need to move some 10 TBytes of data from /minos/data to PNFS.
minos-sun03 is the system at most capable of doing this,
but /pnfs/minos is mounted readonly there.

Please change this to a read/write mount ASAP.

    Thanks !
___________________________________________

Corrected this to minos-sam03

Date: Thu, 06 Mar 2008 17:55:25 -0600
From: Jason Harrington <jason@fnal.gov>

I have remounted /pnfs/minos read/write on minos-sam03.  I was trying to 
get the change into cfengine and ran into an issue with classes 
containing "-" (it's a syntax error), so I commented the fstab edits for 
the time being (next update run at 18:00, so needed to keep things 
working everywhere).
___________________________________________

All done on the MINOS side.  The cfengine repairs are an internal issue.
This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group.
___________________________________________

Date: Fri, 07 Mar 2008 10:00:25 -0600 (CST)
Note To Requester: csieh@fnal.gov sent this Notes To Requester: 
The system name minos-sun03 does not exist in DNS.  Is this the correct 
name of the system you are asking about?
_________________________________________________________________


Tape data rates from minos26 are terrible, about 1 MB/sec.

$ AFSS/mcimport.20080303  -t ${TIND}
Thu Mar  6 17:40:08 CST 2008
AFSS/mcimport.20080303: line 908: [: too many arguments
 OK - version mcimport.20080303 processing from /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near/141 

Completed transferring 5957857280 bytes in 4 files in 3360.73328114 sec.
        Overall rate = 1.84 MB/sec.  Transfer rate = 1.87 MB/sec.
        Network rate = 1.97 MB/sec.  Drive rate = 94 MB/sec.
        Disk rate = 1.97 MB/sec.  Exit status = 0.

    Corrected a few more flaws, purged 141.

    Let's try the whole thing on 142

TIND=daikon_00/L010185N_nue/near/141
AFSS/mcimport.20080303  -t ${TIND}

    Rates are good, 
        about 6 MB/sec tarring,
        about 6 MB/sec to tape.   

Oops, typo error making ecrc's, correct this 

$ cd /minos/data/mcimport/TAR/daikon_00/L010185N_nue/near/142
$ FILES=`ls *.tar`
$ for FILE in $FILES ; do echo ${FILE} ; ecrc ${FILE} | cut -f 2 -d ' ' > ${FILE%.tar}.ecrc ; done

AFSS/mcimport.20080303  -t ${TIND}

  Hmmm, log files are not working properly, oh well.

    Let's grab a bigger bite

$ du -sm /minos/data/mcimport/STAGE/daikon_00/L010185N/near/*
33846   /minos/data/mcimport/STAGE/daikon_00/L010185N/near/141
36561   /minos/data/mcimport/STAGE/daikon_00/L010185N/near/142
36670   /minos/data/mcimport/STAGE/daikon_00/L010185N/near/143
37375   /minos/data/mcimport/STAGE/daikon_00/L010185N/near/144
3763    /minos/data/mcimport/STAGE/daikon_00/L010185N/near/145
10388   /minos/data/mcimport/STAGE/daikon_00/L010185N/near/704

NNS=`ls /minos/data/mcimport/STAGE/daikon_00/L010185N/near`
$ echo $NNS
141 142 143 144 145 704

date
df -h /minos/data
for NN in ${NNS} ; do
    TIND=daikon_00/L010185N/near/${NN}
    AFSS/mcimport.20080303  -t ${TIND}
    date
    df -h /minos/data
done
date
df -h /minos/data


    This will go more slowly than it should, most all.md5 files are missing
    Only present in 704


Thu Mar  6 21:23:26 CST 2008
$ df -h /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       25T   25T  440G  99% /minos/data

.... this is no good, the md5sums will take much too long.

killed this midstream, 

$ cd /minos/data/mcimport/STAGE/daikon_00/L010185N/near/141
$ cp 2103.md5 all.md5

Let's grab a bigger piece of pie,

$ du -sm /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700

TIND=daikon_04/L250200N/near/700
AFSS/mcimport.20080303  -t ${TIND}

$ ls /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700 | wc -l
372
$ wc -l /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700/md5
wc: /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700/md5: No such file or directory
$ wc -l /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700/all.md5
369 /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700/all.md5

MAIN >> /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700/log/mcimport.log 2>&1 &

Thu Mar  6 21:48:50 CST 2008
 OK - version mcimport.20080303 processing from /minos/data/mcimport/STAGE/daikon_04/L250200N/near/700 
 LOGS 

 TAR, WRITE, PURGE 

  IN TAR 
 
Thu Mar  6 21:48:50 CST 2008
For config n1103 _L250200N_D04.tar.gz 
       274 files from 
      n11037001_0000_L250200N_D04.tar.gz to
      n11037009_0030_L250200N_D04.tar.gz   

md5sum n11037008_0025_L250200N_D04.tar.gz
n11037001_0000_L250200N_D04-n11037001_0001_L250200N_D04.tar 2
      n11037001_0000_L250200N_D04.tar.gz to
      n11037001_0001_L250200N_D04.tar.gz   

   The ecrc file looks OK now, let's see how this looks tomorrow.
   If lucky, we'll get nearly a 200 GB from this,
   and nearly a TB from farm concatenation overnight.


#######
# AFS #
#######

for NODE in ${NODES} ; do 
ssh -ax ${NODE} \
'grep afs: /var/log/messages | grep "Mar " | grep -v Tokens | uniq'; done \
| cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' 

Mar  3 22:12:55 minos05 kernel: afs: Lost contact with volume location server 198.128.3.21 in cell es.net
Mar  3 22:13:52 minos05 kernel: afs: Lost contact with volume location server 198.128.3.23 in cell es.net
Mar  3 22:14:49 minos05 kernel: afs: Lost contact with volume location server 198.128.3.22 in cell es.net
Mar  3 22:16:01 minos05 kernel: afs: Lost contact with volume location server 192.204.203.218 in cell sinenomine.net
Mar  4 05:30:07 minos05 kernel: afs: volume location server 198.128.3.22 in cell es.net is back up
Mar  4 05:30:07 minos05 kernel: afs: volume location server 198.128.3.23 in cell es.net is back up
Mar  5 08:30:07 minos05 kernel: afs: volume location server 198.128.3.21 in cell es.net is back up


########
# FARM #
########

top - 11:40:44 up 71 days, 22:30, 22 users,  load average: 6.41, 2.93, 2.21
Tasks: 223 total,   7 running, 215 sleeping,   1 stopped,   0 zombie
Cpu(s): 76.1% us,  1.6% sy,  0.0% ni, 16.2% id,  6.0% wa,  0.1% hi,  0.0% si
Mem:  16629324k total,  9604968k used,  7024356k free,     5184k buffers
Swap: 19454704k total,      240k used, 19454464k free,  3748952k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                        
22010 carneiro  25   0  597m 490m  576 R   12  3.0   1:15.15 astra_022608                                                                                                                                    
23430 carneiro  25   0  597m 490m  576 R   12  3.0   0:58.07 astra_022608                                                                                                                                    
23432 carneiro  25   0  597m 490m  576 R   12  3.0   0:53.79 astra_022608                                                                                                                                    
23435 carneiro  25   0  597m 490m  576 R   12  3.0   0:47.90 astra_022608                                                                                                                                    
23437 carneiro  25   0  597m 490m  576 R   12  3.0   0:44.19 astra_022608                                                                                                                                    
23440 carneiro  25   0  597m 490m  556 R   12  3.0   0:41.75 astra_022608                                                                                                                                    
...


#######
# SAM #
#######

Various reports of SAM project trouble on LSF nodes ( flxb* )


MINOS26 > sam dump project --station=minos --project=evansj-CC0325-RunII-L250z200-ND-Data-20080304-0419

MINOS26 > SAMDIM='project_name evansj-CC0325-RunII-L250z200-ND-Data-20080304-0419'
MINOS26 > sam list files --dim="${SAMDIM}"
... 101 files listed ...

    Justin specified a project having trouble, with this dataset :

MINOS26 > sam_test_py minos ${UNIV} evansj-CC0325-RunI-L010z185-ND-Data

 OK running 
    station   minos
    dbserver  prd
    dataset   evansj-CC0325-RunI-L010z185-ND-Data
    project   sam_test_project_20080305195142
    fileCut   0
    cid       7686
    cpid      26266
    job       SAMStation.JobCount(jobsAtNode=1, jobsAll=1)
...
Got   dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008200_0006.spill.sntp.cedar_phy_bhcurv.0.root  file  440
Decrementing the job count.
Stopping the project


FLXB19 > time sam_test_py minos ${UNIV} evansj-CC0325-RunI-L010z185-ND-Data

FLXB29 > ./sam_cli_py minos prd sam_test_project_20080305200307
RetryHandler.getNextFile(26271L)> initial retriable exception ProjectNotFound('Project 'sam_test_project_20080305200307' on station 'minos' not responding.')
RetryHandler.getNextFile(26271L)> will retry in 1.28 seconds

Got   dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2005-07/N00008076_0000.spill.sntp.cedar_phy_bhcurv.0.root  
file  297
Decrementing the job count.
Stopping the project

real    7m4.591s

    Projects mentioned are :

evansj-CC0325-RunI-L010z185-ND-Data-20080304-1125
evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830
evansj-CC0325-RunII-L010z185-ND-Data-20080303-1013

   Checking out the other datasets mentioned

FLXB19 > time sam_test_py minos ${UNIV} evansj-CC0325-RunII-L250z200-ND-Data
Got   dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2006-07/N00010583_0023.spill.sntp.cedar_phy_bhcurv.0.root  
file  101
Decrementing the job count.
Stopping the project
real    2m7.226s

FLXB19 > time sam_test_py minos ${UNIV} evansj-CC0325-RunII-L010z185-ND-Data
Got   dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-03/N00011855_0000.spill.sntp.cedar_phy_bhcurv.0.root  
file  413
Decrementing the job count.
Stopping the project

real    7m27.047s

    What is the state of the projects :

MINOS26 > sam dump project --station=minos --project=evansj-CC0325-RunI-L010z185-ND-Data-20080304-1125
MINOS26 > sam dump project --station=minos --project=evansj-CC0325-RunI-L010z185-ND-Data-20080304-1125 \
> /minos/scratch/kreymer/log/samproj/evansj-CC0325-RunI-L010z185-ND-Data-20080304-1125

MINOS26 > sam dump project --station=minos --project=evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830
Project 'evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830' on station 'minos' not responding.

MINOS26 > sam dump project --station=minos --project=evansj-CC0325-RunII-L010z185-ND-Data-20080303-1013
TRANSIENT; CORBA.TRANSIENT(omniORB.TRANSIENT_ConnectFailed, CORBA.COMPLETED_NO)

   Two are dead, and the third seems to be in active use.

  26265: evansj(loon:dev)@dcap://minos-01[Loon Analysis Process], busy since 05 Mar 13:46:11, 05 Mar 13:46:11, 1958488
  26267: evansj(loon:dev)@dcap://minos-02[Loon Analysis Process], busy since 05 Mar 13:53:51, 05 Mar 13:53:51, 1958498
  26268: evansj(loon:dev)@dcap://minos-01[Loon Analysis Process], busy since 05 Mar 13:54:40, 05 Mar 13:54:40, 1958489

    Looking in trace for evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830

Find messages like
03/03/08 20:39:42 minos.evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830.PollProcess.Worker 26800: Notifing
03/03/08 20:39:42 minos.evansj-CC0325-RunII-L250z200-ND-Data-20080303-0830.PollProcess.Worker 26800: System exception `COMM_FAILURE'
Reason: Connection refused
Completed: no
Minor code: 1330577418 (connect() failed)


03/05/08 01:24:19 minos.evansj-CC0325-RunI-L010z185-ND-Data-20080304-1125.PollProcess.Worker 18021: System exception `COMM_FAILURE'
Reason: Connection refused
Completed: no
Minor code: 1330577418 (connect() failed)

But these are not unique to evansj jobs, perhaps 'good' error messages.


##########
# CONDOR #
##########

Released all the gfactory processes that had been held :

##########
# CONDOR #
##########

Created newer proxy for gfactory, 

SRV1> cd /export/stage/minfarm/.grid

DAYS=20
(( HOURS = DAYS * 24 ))
DAPR=`date -d "today + ${DAYS}days" +%Y%m%d`


voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -vomslife ${HOURS}:0 \
    -cert kreymer-doe.pem   \
    -key kreymer-doekey.pem \
    -out kreymer-condor.proxy.${DAPR}  \
    -valid ${HOURS}:0

Your proxy is valid until Tue Mar 25 10:28:01 2008

DAYS=30


[gfactory@minos25 ~]$ cd .grid/

DAPR=20080325
DAPR=20080404

scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy.${DAPR} .

DAPR=20080325
ln -sf  kreymer-condor.proxy.${DAPR} kreymer-condor.proxy

##########
# CONDOR #
##########

All data stopped just before Noon on Tuesday.
Investigating. The proxy expired.

#######
# SAM #
#######

    Sam declares for FD data stopped working .
    Other declares seem OK.

////////////////////////////////////////////////

STARTED   Mon Mar  3 21:08:41 2008
FINISHED  Mon Mar  3 21:08:44 2008
Traceback (most recent call last):
  File "./sadd", line 110, in ?
    SAMLOC=sam.locate( args = FILER )
  File "sam_common_pylib/SamCommand/BlessedCommandInterfacePlaceHolder.py", line 81, in __call__
  File "sam_common_pylib/SamCommand/CommandInterface.py", line 251, in __call__
  File "sam_common_pylib/SamCommand/SamCommandInterface.py", line 240, in apiWrapper
  File "sam_user_pyapi/src/samLocate.py", line 75, in implementation
  File "sam_common_pylib/SamStruct/NameOrId.py", line 51, in __init__
  File "sam_common_pylib/SamCorba/SamIdlStructWrapperBase.py", line 405, in __init__
  File "sam_common_pylib/SamStruct/NameOrId.py", line 71, in initialize_fromPython_countedArgs
ArgumentError: NameOrId: Invalid input arguments
Input args = (None,)

fardet_data/2008-03
STARTED   Mon Mar  3 23:08:02 2008

////////////////////////////////////////////////

Generating .py for /pnfs/minos/fardet_data/2008-03
STARTING Mon Mar  3 23:06:13 UTC 2008
 Treating     78 files 
 Scanning      2 files 
F00040390_0003.mdaq.root Mon Mar  3 23:06:15 UTC 2008
?
F00040390_0004.mdaq.root Mon Mar  3 23:07:14 UTC 2008
?
FINISHED Mon Mar  3 23:07:59 UTC 2008

////////////////////////////////////////////////

MINOS26 > pwd
/afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/fardet_data/2008-03

MINOS26 > dds F00040390_0004*
-rw-r--r--  1 kreymer g020 0 Mar  3 17:07 F00040390_0004.log
-rw-r--r--  1 kreymer g020 0 Mar  3 17:07 F00040390_0004.sam.py

MINOS26 > find . -size 0
./F00040390_0003.log
./F00040390_0003.sam.py

MINOS26 > dds F00040390_0003*
-rw-r--r--  1 kreymer g020 0 Mar  3 17:06 F00040390_0003.log
-rw-r--r--  1 kreymer g020 0 Mar  3 17:07 F00040390_0003.sam.py

MINOS26 > rm F00040390_0003*

////////////////////////////////////////////////

mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root

Are these files being tried now? If not, could they be?

=============================================================================

2008 03 04

###########
# BLUEARC #
###########

Date: Tue, 04 Mar 2008 13:44:42 -0600 (CST)
Subject: HelpDesk ticket 112164
___________________________________________
Short Description: /minos/data size adjustment in BlueArc

Problem Description: LSC/CSI :

We seem to have been even more successful than before in filling up
/minos/data, before we have had a chance to archive some of the older
files.
( See previous helpdesk ticket 111148 from 14 Feb. )

We have under 1 TB free, out of 25 TB.

Please adjust the size of the /minos/data area upward from 25 TB to 28 TB.

If necessary take this space from /minos/scratch,
which is presently using under 3 TB.

We are actively working on archiving nearly 10 TB of files from
/minos/data,
but it will take a while to revive old scripts, and get these on tape.
We should get this cleared up by about 10 March.
___________________________________________
Date: Tue, 04 Mar 2008 13:53:50 -0600 (CST)
This ticket has been reassigned to RZEMINSKI, PETER J of the CD-LSCS/CSI/CS/EST Group.
___________________________________________
Date: Tue, 04 Mar 2008 13:58:39 -0600 (CST)
This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/BLU Group.
___________________________________________


###########
# BLUEARC #
###########

    Yesterday, had no earlier than 13:33

MINOS26 > df -h  /minos/data
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                       25T   24T  1.7T  94% /minos/data
MINOS26 > df -h /minos/scratch
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/scratch
                      9.0T  2.3T  6.7T  26% /minos/scratch

MINOS26 > du -sm /minos/data/users/rustem/
87015   /minos/data/users/rustem/

##########
# DCACHE #
##########

    Status of helpdesk tickets 111951 kreymer, 112020 rubin

  From ticket resolution, georges, x4515, ssa-help@fnal.gov

    Trying a copy , once more
MINOS26 > dccp  dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root  TEST.data

    http://fndca3a.fnal.gov/dcache/DOORS.html
DCap00-stkendca2a-unknown-550430 DCap00-stkendca2a-unknown-550430 minos26.fnal.gov          active Mar 04 11:15:11 Mar 04 11:15:11   1060/14515 DCap00-stkendca2a-unknown-550430 Arthur Kreymer            ?                         ?                                                       ? ?                              open  minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root


########
# FARM #
########

> Please give me a brief map of the process of moving files from mcnearcat 
> to dcache/enstore.

    The files are moved to
          /minos/data/minfarm/WRITE/ 
    until they are confirmed to be on tape.
    That should usually happen by the next cron cycle.

    Right now there are 739 waiting to get on tape.

MINOS26 > ls /minos/data/minfarm/WRITE/*cand* | wc -l
739

> Incidentally, grid processing has added a pass number to mc output 
> files.  Does this disturb you?

Somewhat, as I am swamped by some other high priority activities 
    clearing some space on /minos/data ... we have 10.5 TBytes of MC imports !!!
    clearing out duplicates from /minos/data/minfarm
        requires some major changes to my scripts

But if the files are showing up concatenated, I guess things are fine.

I do see one concatenated file dated Mar  4 02:48
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758/n13037581_0000_L010185N_D04.sn
tp.cedar_phy_bhcurv.0.root

It's not in SAM yet, I presume bacause of the backlog.
I'll look for it later today.


...

 /home/minfarm/ROUNTMP/LOG/saddreco/daikon_04/cedar_phy_bhcurv/near_L010185N.log

Needed  /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758
 Added sam tape location  /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758
Treating 3 files in /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758
 OK - declared n13037583_0000_L010185N_D04.sntp.cedar_phy_bhcurv.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758(voe170.184)
 OK - declared n13037582_0000_L010185N_D04.sntp.cedar_phy_bhcurv.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758(voe170.185)
 OK - declared n13037580_0000_L010185N_D04.sntp.cedar_phy_bhcurv.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data/758(voe170.186)

   This skipped the .0 files.

   Looking into this, using saddreco.new


time ./saddreco.new -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N --verify


URK --- looking at the latest files showing up in mcnearcat,
they are owned by minospro.numi .

SRV1> ypcat passwd | grep minospro
minospro:x:42411:5111:minos e875 production:/grid/home/minospro:/sbin/nologin


=============================================================================

2008 03 03

###########
# MINOS26 #
###########

disk filled up... perhaps due to vault processing.


$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/hdb1             230G  218G     0 100% /local/scratch26

cd /local/scratch26/mindata/MOVED/


 OOPS vault error processing near 2008-02

$ rm -r kreymer/
$ rm -r mtavera

17:13 ( 23:13 UTC )

$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/hdb1             230G   25G  194G  12% /local/scratch26

Date: Mon, 03 Mar 2008 16:37:01 -0600
From: Minos Data <mindata@minos26.fnal.gov>
To: minos-data@fnal.gov
Subject: MCIMPORT DISABLED DUE TO SCRATCH SPACE

 WARNING, minos26 scratch space 0 under 3000 MBytes 
 .k5login has been restricted to administrators 
 .k5login will be restored within 10 minutes when space goes over 5000

This worked, 

$ dds .k5*
lrwxrwxrwx  1 mindata e875  12 Mar  3 17:17 .k5login -> .k5loginfull

Restarted vault

rm -r  /local/scratch26/kreymer/SHEEP/neardet_data/2008-02
rm /var/tmp/rawcopy/TARWORK/*.root


############
# MCIMPORT #
############
    
        MUST shift something out.
  183937  /minos/data/mcimport/STAGE/daikon_00
10535116  /minos/data/mcimport/STAGE/daikon_04

MINOS26 > du -sm /minos/data/mcimport/STAGE/daikon_04/*
 424831 /minos/data/mcimport/STAGE/daikon_04/L010000N
  55785 /minos/data/mcimport/STAGE/daikon_04/L010170N
6248133 /minos/data/mcimport/STAGE/daikon_04/L010185N
   6622 /minos/data/mcimport/STAGE/daikon_04/L010185N_charm
   8428 /minos/data/mcimport/STAGE/daikon_04/L010185N_nccoh
  65355 /minos/data/mcimport/STAGE/daikon_04/L010200N
 123582 /minos/data/mcimport/STAGE/daikon_04/L100200N
 162246 /minos/data/mcimport/STAGE/daikon_04/L150200N
3440138 /minos/data/mcimport/STAGE/daikon_04/L250200N

The RUN subdirectories of L010185N and L250200N are nearly all
100 to 200 GByte in size.

  I'll revive and extend the tarring option of mcimport.

$ du -sm /pnfs/minos/stage/*
 268720 /pnfs/minos/stage/arms
      1 /pnfs/minos/stage/buckley
      1 /pnfs/minos/stage/gmieg
 758471 /pnfs/minos/stage/hgallag
1358789 /pnfs/minos/stage/howcroft
5596976 /pnfs/minos/stage/kordosky
  11457 /pnfs/minos/stage/kreymer
 202838 /pnfs/minos/stage/mualem
      1 /pnfs/minos/stage/rhatcher
      1 /pnfs/minos/stage/sjc
      1 /pnfs/minos/stage/urheim


Date: Mon, 03 Mar 2008 15:50:59 -0600 (CST)
Subject: HelpDesk ticket 112120
___________________________________________
Short Description: STKEN - request change of  /pnfs/minos/stage library to LTO-3

Problem Description: We need to write about 10 TBytes of new data to /pnfs/minos/stage.
  We need to start these new write this week.

  Please change the library from CD-9940B to CD-LTO3, 
  so that we do not exhaust the supply of 9940-B tapes.
  ( 10 TB is about 50 9940-B tapes, versus 120 on hand. )

      Thanks !
___________________________________________
Date: Tue, 04 Mar 2008 08:03:06 -0600 (CST)
This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group.


##########
# CONDOR #
##########

Rustem--when you are looking at the slots that are allocated
for MINOS you need to add up all users in the MINOS group which are
running at the time.  For instance, right now there are 300 slots
being used by the sum of all three users minos, minospro, and rustem,
and it is the sum of these three that matters.

Having said that, there was a time this morning at around 9AM
when less than 300 total minos jobs were running but it also looked
like none were waiting at that time.

For more information on what use is being made of slots,
you can do condor_userprio -all, and look at the entry for
group_numi, that will tell you how many slots the minos group
is using at one time.

There is a standing request from Art Kreymer to boost up the
quota of total slots for MINOS which we should get to later this week.

Steve Timm


> Ticket #: 112087
> Priority: Medium
> System Name:
>
> ____________________________________________
> Requester Information
> ____________________________________________
> Name: RUSTEM OSPANOV
> Phone: 6460
> E-Mail Address: RUSTEM@FNAL.GOV
> _____________________________________________
> Ticket Details
> _____________________________________________
> Problem Category: Grid
> Type: Fermilab Sup Ctr
> Item: FermiGrid
> Urgency: Medium
> Short Description: FermiGrid/general purpose farms quota allocation
>
> Problem Description: Hello,
>
> I am running jobs under minos group account on general purpose farms.
> This account has 300 slots allocated for the jobs but I observe that
> often less than 300 jobs are running even when there are idle nodes:
>
> http://home.fnal.gov/~rustem/tmp/condor_q_2008_03_03.txt
> http://home.fnal.gov/~rustem/tmp/condor_status_2008_03_03.txt
>
> There seems to be something with counting of jobs toward the minos
> allocation that I do not understand because we should be able to run
> 300 jobs at any given time.
>
> Thank you,
> Rustem


SRV1>  condor_userprio -all | grep numi
Last Priority Update:  3/3  13:39
                                    Effective   Real     Priority   Res   Accumulated       Usage            Last      
User Name                           Priority  Priority    Factor    Used  Usage (hrs)    Start Time       Usage Time   
------------------------------      --------- -------- ------------ ----  ----------- ---------------- ----------------
group_numi.rustem@fnal.gov              11.31    11.31         1.00  141     25711.36  9/22/2007 11:22  3/03/2008 13:40
group_numi.minos@fnal.gov               77.09    77.09         1.00   51     22886.28 11/27/2007 16:52  3/03/2008 13:40
group_numi.minospro@fnal.gov            87.87    87.87         1.00  108      8923.31 12/18/2007 13:51  3/03/2008 13:40
group_numi                             177.72   177.72         1.00  300   1040569.32  4/19/2006 19:49  3/03/2008 13:40


<-- # @  Enter Update below this line. @ # -->

Steve, thanks for looking into this.

Rustem, for timeline information regarding loads, you can use CondorView,
Follow the links under 
    
    http://fermigrid.fnal.gov/
    -> Left Frame - FermiGrid Monitoring, Metrics and Accounting: 
          [Metrics and Service Monitors ]
    -> FermiGrid - Production Clusters / fngp-osg.fnal.gov      Condor View
    -> Pool User (Job) Statistics [week]
           http://fnpcsrv1.fnal.gov/condorview/viewdir/UserWeek.html

    Use the 'configure' box to the lower left of the graphics display
    to select items of interest, like 
        group_numi.minospro@fnal.gov  jobsRunning
        group_numi.rustem@fnal.gov    jugsRunning
        group_numi.rustem@fnal.gov    jobsIdle

    You had a few idle jobs as you ramped up from 21:30 to 22:00 last night
    You have more idle jobs now, as you are competing with production.

<-- # @  Enter Update above this line. @ # -->
> >

http://webserver.infn.it/cdf/docs/cafcondorOperations.pdf


  User condor_userprio to change the priority factory

for USER in tinti sjc kreymer masaki scavan boehm pawloski rmehdi loiacono ; do
condor_userprio -setfactor ${USER}@fnal.gov 100. ; done

  Got more complete list with 
USERS=`condor_userprio -allusers | grep ' 0' | cut -f 1 -d ' ' | grep -v gfactory`

for USER in ${USERS} ; do
condor_userprio -setfactor ${USER} 100. ; done


###########
# MONTHLY #
###########

DATASETS 3/3
PREDATOR 3/3
VAULT    3/3  12:00 - restarted near due to full /local/scratch26
MYSQL    3/   13:30 - 

  MYSQL waiting for heavy usage by tagg@minos11 to abate.
  This has restarted, but is only reader acces

Mysql> du -sm .
58366   .

Mon Mar  3 13:30:54 CST 2008
DCS_HV.MYD 
real    15m26.458s
PULSERGAIN.MYD
real    11m55.229s
  the rest
real    50m4.492s
Mon Mar  3 14:48:34 CST 2008


#########
# MISER #
#########

    The new Miser uses cert's for access
 
https://appora.fnal.gov/pls/cert/miscomp.miser.html

    Upgrade described at

http://computing.fnal.gov/news/misermar08.html

########
# GRID #
########

    voms-proxy-init - are we at required >= 1.6.16.10

SRV1> voms-proxy-init -version
voms-proxy-init
Version: 1.7.20
Compiled: Jul  3 2007 13:49:34

MINOS26 > cd /grid/app/minos/VDT
MINOS26 > . ./setup.sh
MINOS26 > voms-proxy-init -version
voms-proxy-init
Version: 1.7.20
Compiled: Jul  3 2007 13:49:34


#######
# WEB #
#######

   server validations due by Mar 31 ? follow up to email


########
# FARM #
########

  All processing has moved to the Grid universe.


########
# FARM #
########

   howie - SAM for Grid jobs ?
   will try /grid/app/minos/sam
   clone of /export/stage/minfarm/ROUNDUP/SAM
   as minfarm@fnpcsrv1 :
   
cp -ax /export/stage/minfarm/ROUNDUP/SAM /grid/app/minos/sam

=============================================================================

2008 02 29

#######
# DOG #
#######

   Maisey Kreymer is coming home this morning around 10:00
   On vacation.
   Woof.
   
########
# DATA #
########

Date: Thu, 28 Feb 2008 17:16:38 -0800
From: J. Pedro Ochoa <ochoa@caltech.edu>

I was successfully getting them using your script "ftpfiles" but then when?it got to
n11011026_0000_L010185N_D00.sntp.cedar_phy.root it seems to have gotten stuck (by "stuck" I mean it's
been on that file for several hours, and the size does not increase).

15:10

MINOS26 > ./dc_stat /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root
============================
 PNFS status for /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root 
-rw-r--r--  1 rubin e875 583586556 Nov 28 20:19 n11011026_0000_L010185N_D00.sntp.cedar_phy.root

LEVEL 2 
2,0,0,0.0,0.0
:c=1:6329d874;h=yes;l=583586556;

LEVEL 4 
VO9663
0000_000000000_0000008
583586556
mcout_cedar_phy_near_daikon_00_sntp
/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root

000F00000000000006BDD8D8

CDMS119630277100000
stkenmvr16a:/dev/rmt/tps0d0n:479000017059
2252920947

============================

MINOS26 > dccp -P dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root

http://www-stken.fnal.gov/enstore/tape_inventory/VO9663

shows this file at 

VO9663 CDMS119630277100000       583586556 0000_000000000_0000008      no /pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root
    
MINOS26 > cd /local/scratch26/kreymer/DATA
MINOS26 > dccp  dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root  TEST.data


   Submitted helpdesk ticket to  MSS / dcache-stken
   Note that  Software/MSS is no longer available.

Date: Fri, 29 Feb 2008 09:34:49 -0600 (CST)
Subject: HelpDesk ticket 111951
___________________________________________

Short Description: Minos file fails to stage

Problem Description: dcache-admin :

Please reply to minos-data.

The following file fails to stage to DCache.

We have a user trying to access it :

 ./dc_stat
/pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n110
11026_0000_L010185N_D00.sntp.cedar_phy.root
============================
 PNFS status for
/pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data/102/n110
11026_0000_L010185N_D00.sntp.cedar_phy.root 
-rw-r--r--  1 rubin e875 583586556 Nov 28 20:19
n11011026_0000_L010185N_D00.sntp.cedar_phy.root

LEVEL 2 
LEVEL 2 
2,0,0,0.0,0.0
:c=1:6329d874;h=yes;l=583586556;

LEVEL 4 
VO9663
0000_000000000_0000008
583586556
mcout_cedar_phy_near_daikon_00_sntp
/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_
data/102/n11011026_0000_L010185N_D00.sntp.cedar_phy.root

000F00000000000006BDD8D8

CDMS119630277100000
stkenmvr16a:/dev/rmt/tps0d0n:479000017059
2252920947

============================
___________________________________________

This ticket is assigned to NAYMOLA, STAN of the CD-SF/DMS/DSC/SSA.
___________________________________________

Date: Mon, 03 Mar 2008 14:37:50 -0600 (CST)
This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group.
___________________________________________
Date: Wed, 05 Mar 2008 13:42:19 -0600
From: Vladimir Podstavkov <podstvkv@fnal.gov>

The last three files are on LT03 tape, and there is a backlog of 2000
transfers to/from LTOs. Nine drives are used by CMS, two others - by
database backups.

I would say - the problem requires some action on the enstore side.
There is nothing we can do on dcache side.

This ticket can be closed.
___________________________________________


Date: Mon, 03 Mar 2008 15:47:41 +0900
From: Howard Rubin <rubin@fnal.gov>
Ticket #: 112020
___________________________________________
Short Description: Unable to read some files

Problem Description: Here is a list of files I cannot read from dcache 
using dccp and srmcp:
...


=============================================================================


2008 02 28

############
# SADDRECO #
############

Interactive testing of sets

./saddreco.new -d near -r cedar -p 2007-11 --verify


sampy

import ...

os.chdir('/pnfs/minos/reco_near/cedar/sntp_data/2007-11')
... stuff from candfiles ...


########
# MAIL #
########

   N.B. kerio mail server / versus outlook

#######
# AFS #
#######

Predator bailed at 01:06 UTC ( 19:06 CST ),

/tmp/filedxZ2eC: line 614: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: Connection timed out

Updated ticket :

Date: Thu, 28 Feb 2008 19:33:33 +0000 (UTC)
Subject: Re: HelpDesk ticket 107032 has additional info.

Per Ray's question about pinpointing which areas are being accessed.

Generally, the /var/log/messages do not tell me which path is being accessed.
Just the IP address of the server which has timed out.

I have put a summary of our various scans on the web at
    http://www-numi.fnal.gov/computing/afs.txt

I will add specific file information when available.

The most recent timeout is interesting, as it involved two hosts,
and one of my scripts failed at exactly this time,
giving me the path to a file being accessed :

Feb 27 05:28:11 minos22 kernel: afs: Lost contact with file server 131.225.68.7 
Feb 27 05:30:10 minos22 kernel: afs: file server 131.225.68.7 is back up 

Feb 27 19:06:35 minos26 kernel: afs: Lost contact with file server 131.225.68.6 
/tmp/filedxZ2eC: line 614:/afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: Connection timed out
Feb 27 19:07:45 minos26 kernel: afs: file server 131.225.68.6 is back up 


#######
# SAM #
#######

    10 sec limit
moveCachedFiles.py  -c /var/tmp/kreymer/minosdev.cfg
<2008/02/28 08:56:31> Total processing time in seconds: 11
<2008/02/28 08:56:31> Number of records in cached_files.  Starting: 553940  Ending: 543977
<2008/02/28 08:56:31>      Difference between start and end: 9963
   600 sec limit 
<2008/02/28 09:06:25> Total cached_file records copied: 543116 deleted: 0
<2008/02/28 09:06:25> Total cache_file_project_usages records copied: 578100 deleted: 0
<2008/02/28 09:06:25> Total processing time in seconds: 483
<2008/02/28 09:06:25> Number of records in cached_files.  Starting: 543977  Ending: 861
<2008/02/28 09:06:25>      Difference between start and end: 543116


=============================================================================

2008 02 27

#########
# ADMIN #
#########

    Testing direct page web access, at

http://csdserver.fnal.gov/arsys/forms/csdserver/DirectContact/SSAView/?mode=create 

Date: Wed, 27 Feb 2008 11:35:43 -0600 (CST)
Subject: HelpDesk ticket 111830
___________________________________________
Short Description: WEB MSS  PH  630-840-4261  FN  ARTHUR  This is a test. This is only a test.

Problem Description: This is a test. This is only a test.
___________________________________________


#######
# SAM #
#######

Date: Wed, 27 Feb 2008 09:07:13 -0600
From: Stephen P. White <swhite@fnal.gov>
Subject: Cached_files


    kreymer@minos-sam01

mkdir /minos/scratch/kreymer/sam
cd /minos/scratch/kreymer/sam

    Getting moveCachedFiles.py from
http://cdcvs.fnal.gov/cgi-bin/fnal-only/cvsweb.cgi/sam_maintenance_tools/General/PurgeCachedFiles/

CVSROOT=cvsuser@cdcvs.fnal.gov:/cvs/cd   # set the repository
unset CVS_RSH                            # remove ssh used by Minos

cvs -d ${CVSROOT co sam_maintenance_tools/General/PurgeCachedFiles
cd sam_maintenance_tools/General/PurgeCachedFiles

    The config file needs to be on local disk,
    and protected, as it contains database passwords

mkdir /var/tmp/kreymer
cp example_moveCachedFiles.cfg /var/tmp/kreymer/minosdev.cfg
chmod 700                      /var/tmp/kreymer/minosdev.cfg
nedit                          /var/tmp/kreymer/minosdev.cfg


unset SETUP_UPS UPS_DIR
. ~sam/setups.sh

setup sam_python  v2_4_4
setup cx_Oracle   v4_3_3_py2_4_4

moveCachedFiles.py -h

       [MCF]
       max_seconds = 3600    # The maximum number of seconds this appliction is to run.
       log_screen  = false   # Prints log data to the screen if True.
       log_path    = path    # Prints log data to a file at this directory. Set to ""
                             #   for no log file.
       database    = dbname  # Oracle database to connect to
       username    = uname   # Account to use for connection
       password    = paswd   # Password to use to obtain connection


  Added -n NOOP option for a preview run
  
moveCachedFiles.py -n -c /var/tmp/kreymer/minosdev.cfg

    database connection times out

OPW=...

setup oracle_client v10_1_0_3_0
sqlplus samdbs/${OPW}@minosdev

    That failed also, try a newer oracle_tnsnames than 42,

setup oracle_tnsnames v46

    Now sqlplus works

 NOOP  true
<2008/02/27 11:58:37> Max run time is set to 30 seconds.
<2008/02/27 11:58:37> Establishing database connection as samdbs/xxxx@minosdev
<2008/02/27 11:58:38> Obtaining number of records in cached_files....
<2008/02/27 11:58:38> Number of records in cached_files: 429983
<2008/02/27 11:58:40> Selected 409 cached_file_id records in 0 seconds. 
<2008/02/27 11:58:40>   for minDate: 10/23/2005   maxDate: 10/24/2005  (min/max endTime seconds: 1)
<2008/02/27 11:58:40> Error: - local variable 'copyCnt' referenced before assignment
<2008/02/27 11:58:40> Total cached_file records copied: 0 deleted: 0
<2008/02/27 11:58:40> Total cache_file_project_usages records copied: 0 deleted: 0
<2008/02/27 11:58:40> Total processing time in seconds: 1
<2008/02/27 11:58:40> Number of records in cached_files.  Starting: 429983  Ending: 429983
<2008/02/27 11:58:40>      Difference between start and end: 0


    Checked connections with
cd ~/minos/oracle
export TOPDB_CONN=monitor/...
./topdb minosdev


   This was useful for diagnostics, 
   reverted to the original script for production running.

moveCachedFiles.py  -c /var/tmp/kreymer/minosdev.cfg

   Repeated a few times, with 100 second time limit.
   Did the same for int, took only 2 passes ( 190K files )

   Will do production tomorrow.
   
   Ran test project in production, success !

      DB   BEFORE  AFTER   TIME
      dev  429973   1160    306 sec
      int  191907   3060    150 sec
      prd  553079    861    493 sec      
      
#######
# AFS #
#######

for NODE in ${NODES} ; do ssh -ax ${NODE} 'grep afs: /var/log/messages | grep "Feb " | grep -v Tokens | uniq'; done | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' 
Feb 24 11:50:43 minos05 kernel: afs: Lost contact with file server 131.225.68.65 
Feb 24 11:53:33 minos05 kernel: afs: file server 131.225.68.65 is back up 

Feb 25 12:11:05 minos09 kernel: afs: Waiting for busy volume 1685714815 

Feb 25 19:21:49 minos19 kernel: afs: Lost contact with file server 131.225.68.49 
Feb 25 19:21:50 minos19 kernel: afs: failed to store file 
Feb 25 19:21:51 minos19 kernel: afs: failed to store file 
Feb 25 19:22:08 minos19 kernel: afs: file server 131.225.68.49 is back up 

Feb 25 19:21:08 minos22 kernel: afs: Lost contact with file server 131.225.68.49 
Feb 25 19:21:09 minos22 kernel: afs: failed to store file 
Feb 25 19:21:59 minos22 kernel: afs: file server 131.225.68.49 is back up 

Feb 27 05:28:11 minos22 kernel: afs: Lost contact with file server 131.225.68.7 
Feb 27 05:30:10 minos22 kernel: afs: file server 131.225.68.7 is back up 

Feb 27 19:06:35 minos08 kernel: afs: Lost contact with file server 131.225.68.6 
Feb 27 19:06:50 minos08 kernel: afs: file server 131.225.68.6 is back up 

Feb 27 19:06:35 minos26 kernel: afs: Lost contact with file server 131.225.68.6 
/tmp/filedxZ2eC: line 614: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: Connection timed out
Feb 27 19:07:45 minos26 kernel: afs: file server 131.225.68.6 is back up 

=============================================================================

2008 02 26

########
# GRID #
########


    Grid school this morning and afternoon.

    FermiGrid 201 - Scripting and running grid jobs [ Steve Timm ]
        Four Minos people ( kreymer, loiacoli, rustem, urish )
    Two ( kreymer, urish ) at FermiGrid 202 - Grid Storage Access
     
=============================================================================

2008 02 25

########
# GRID #
########

Date: Mon, 25 Feb 2008 18:10:03 -0600 (CST)
Subject: HelpDesk ticket 111723
___________________________________________
Short Description: Minos users needing login to worker nodes

Problem Description: As requested, here is a list of Minos users who should have
  interactive login access to FermiGrid worker nodes :

asousa
bckhouse
kreymer
mstrait
nwest
rhatcher
rubin
scavan
___________________________________________


#########
# ADMIN #
#########

    Per Helpdesk/NGOP program,
    the support levels available are :

24by7        ( commonly called 24x7 )
8to00by7
8to17by7
6to22by7     ( obsolete )
830to1630by7 ( obsolete )

8to17by5     ( commonly called  8x5, incorrectly, should be 9x5 )

zero

#########
# ADMIN #
#########

minos08 - 
    still no rsh or telnet access
    no sar since Dec 12
    telnetd is running

to jonest
 
<-- # @@@  Enter Update below this line. @@@ # -->
  
The system has been behaving normally since Friday.
So there is no immedate need for further action.

Some residual issues remain unexplained :

  1) I cannot log in via rsh or telnet
  2) SAR is not running.
  3) There is a telnetd process running, unlike rest of the cluster.

<-- # @@@  Enter Update above this line. @@@ # -->

    Oops, as of 17:20 or so, no longer can ssh to minos08

#######
# AFS #
#######
Date: Mon, 25 Feb 2008 14:31:16 -0600 (CST)
Subject: Your ticket 107032 has been reassigned to PASETES, RAY


###########
# WEATHER #
###########

Date: Mon, 25 Feb 2008 13:45:58 -0600
From: Fermilab Today <today@fnal.gov>
To: allhands@fnal.gov
Subject: All Hands - weather alert

To: All Hands

From: Bruce Chrisman, Chief Operating Officer

Winter storm

A severe winter storm is predicted to move across the Chicago area 
this evening and into Tuesday.

Please be careful as you walk to your car and drive home. 
Despite the weather, Fermilab will very likely remain open. 

If Fermilab closes, a note will be posted on the home page: http://www.fnal.gov,
and recorded on the Fermilab inclement weather hotline: (630) 840-5995.

The National Weather Service Web site, http://www.noaa.gov, 
is the best source for the latest weather information.

And remember, whatever the weather, 
please drive carefully and walk with caution.


##########
# CONDOR #
##########

Note that glidein jobs run under account

ID       uid=7927(minos) gid=5111(numi) groups=5111(numi)

This account seems not to exist anywhere else.

So if directories are created in grid jobs, 
    and are not group writeable,
    they will be hard to deal with.

Perhaps we need a 7927 account somewhere, like minos01 and minos26.


#########
# ADMIN #
#########

Account for Ruth Toner <rt314@cam.ac.uk>

Submitted via
    http://computing.fnal.gov/cd/forms/requirements_offsite_new.html

Process outlined in
    http://computing.fnal.gov/cd/forms/requirements.html
Approval described at
    http://computing.fnal.gov/cd/forms/offsite_instructions.html

Approval should come from some in the minos-approved mail list.

   sent to listserv :
review minos-approved
There is no MINOS-APPROVED list on this server. Try sending a "LIST" command
Per helpdesk printout, these are
   wojcicki,plunk,ayres,buckley,rameika


N.B. - 2008 02 28   
/afs/fnal/files/home/room1/rtoner dates from 2005.

N.B. - 2008 03 27 - she is able to log in now,

I have done

MINOS01 > pts adduser -user rtoner -group minos
MINOS01 > pts membership minos | grep toner
  rtoner


#######
# AFS #
#######

for NODE in ${NODES} ; do ssh -ax ${NODE} 'grep afs: /var/log/messages | grep "Feb " | grep -v Tokens | uniq'; done | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' 
Feb 24 11:50:43 minos05 kernel: afs: Lost contact with file server 131.225.68.65 
Feb 24 11:53:33 minos05 kernel: afs: file server 131.225.68.65 is back up 

############
# MCIMPORT #
############

    minos26 disk has filled quickly
  
    Data is coming into mindata from mtavera

cd /local/scratch26/mindata/
ls mtavera/ | wc -l
450

10:44
cp -vax mtavera  /minos/data/mcimport/mtavera
time diff -r mtavera  /minos/data/mcimport/mtavera
   oops, several l*.log and n*.log files have come in.

Waited for this to drain.
Last change around 14:39, as of 17:13.

rsync   -n -r  mtavera/  /minos/data/mcimport/mtavera  --perms --times --size-only -v
time rsync -r  mtavera/  /minos/data/mcimport/mtavera  --perms --times --size-only -v
sent 18262419838 bytes  received 4060 bytes  10651749.14 bytes/sec
total size is 135283318592  speedup is 7.41

real    28m34.211s
user    3m19.541s
sys     2m7.979s

MINOS26 > du -sm /minos/data/mcimport/mtavera
129020  /minos/data/mcimport/mtavera

mv mtavera MOVED/mtavera
ln -s /minos/data/mcimport/mtavera /local/scratch26/mindata/mtavera

17:44
time diff -r MOVED/mtavera  /minos/data/mcimport/mtavera
real    391m32.443s
user    3m56.464s
sys     5m24.418s

  finished around 01:00

rsync   -n -r  MOVED/mtavera/  /minos/data/mcimport/mtavera  --perms --times --size-only -v
building file list ... done

sent 62097 bytes  received 20 bytes  1656.45 bytes/sec
total size is 135283318592  speedup is 2177879.14

cd /minos/data/mcimport/mtavera
mv NOIMPORT MCIMPORT

###########
# ROUNDUP #
###########

   roundup.20080225 - changed limit test on ROUNTMP to ignore data,
                      which now resides in /minos/data

cp -a AFSS/roundup.20080225 .
ln -sf roundup.20080225 roundup

###########
# ROUNDUP #
###########

   stalled due to full disk space ?
   
LOGS/2008-02/cedar_phy_bhcurvmcnear.log

OK - stream L010185N_D04.cand.cedar_phy_bhcurv
OK - 1298792 Mbytes in 79 runs 
 OOPS - Stream size 1298792 too big for free space 381587 - 10000 

This is (unnecessarily) monitoring ROUNTMP=/export/stage/minfarm/ROUNDUP

df -m  $ROUNTMP | tail -1 | tr -s ' ' | cut -f 4 -d ' '

But the problem is the 1.3 TByte reported size.

    This went from 
Wed Feb 20 13:57:18 CST 2008
OK - stream L010185N_D04.cand.cedar_phy_bhcurv
OK - 170122 Mbytes in 11 runs 
    to

Mon Feb 25 02:21:20 CST 2008
OK - processing 7973 files 
OK - stream L010185N_D04.cand.cedar_phy_bhcurv
OK - 1298792 Mbytes in 79 runs 

    farmgsum shows
mcnearcat
   2271 1298792 cand.cedar_phy_bhcurv.root

SRV1> ls -ltr /minos/data/minfarm/mcnearcat  | wc -l
10446
SRV1> ls -ltr /minos/data/minfarm/mcnearcat | grep mrnt | wc -l
5323
SRV1> ls -ltr /minos/data/minfarm/mcnearcat | grep cand | wc -l
2271

cand's are almost all c_p_b, 500 MBytes in size.


=============================================================================

2008 02 24

############
# SADDRECO #
############

Speed up file scanning, test with

    minfarm@fnpcsrv1

prepare per HOWTO.saddreco, cd to /afs/...scripts

./saddreco.new -d near -r cedar -p 2007-11 --verify

DETECT  near
RELEASE  cedar
MONTH  2007-11
BAIL 999999
STARTED   Sun Feb 24 22:25:18 2008
saddreco  2007117
Declaring to SAM prd near cedar 2007-11 verify
Needed  /pnfs/minos/reco_near/cedar/cand_data/2007-11
Treating 291 files in /pnfs/minos/reco_near/cedar/cand_data/2007-11
Needed  /pnfs/minos/reco_near/cedar/sntp_data/2007-11
Treating 25 files in /pnfs/minos/reco_near/cedar/sntp_data/2007-11
STARTED   Sun Feb 24 22:25:18 2008
FINISHED  Sun Feb 24 22:25:44 2008


SAMDIM="FULL_PATH /pnfs/minos/reco_near/cedar/cand_data/2007-11"
time sam list files --summaryOnly --dim="${SAMDIM}"
File Count:         502
Average File Size:  182.21MB
Total File Size:    89.33GB
Total Event Count:  50718886

real    0m1.596s
user    0m1.114s
sys     0m0.136s


SAMDIM="
    VERSION    cedar
and DATA_TIER  cand-near
and PHYSICAL_DATASTREAM_NAME spill
and FULL_PATH /pnfs/minos/reco_near/cedar/cand_data/2007-11
"

File Count:         211
Average File Size:  289.66MB
Total File Size:    59.69GB
Total Event Count:  21504842

real    0m1.476s
user    0m1.074s
sys     0m0.145s

   Initial conclusions

We get a 20 fold speedup ( 26" -> 1.5" ) 
by doing sam list rather than sam locate.

It does not speed things up to select the version and data stream.

Use examples of code from samlocate or saddcache.

 .... 2008 02 26 ...
 
   testing saddreco.new 

2.2 sec including sam.translateconstraints 
5.8 sec including stc and candfiles


Longer future test will be
./saddreco.new -m daikon_04 -d near -r cedar_phy_bhcurv -p L010185N \
      --verify 

########
# GRID #
########

Date: Sun, 24 Feb 2008 14:48:45 -0600 (CST)
Subject: HelpDesk ticket 111653
___________________________________________
Short Description: loiacono account on fnpcsrv1 and fngp-osg

Problem Description: Please create a loiacono account for this Minos user
on fnpcsrv1 and fngp-osg,
so that she can test direct grid submission of
select Minos analysis jobs. 

We have been trying to run these jobs via glideinWMS from minos25,
on ISMINOSAFS nodes, but have been suffering from timeouts.
We need to find out whether the fault lies with the
glidein mechanism, or with the jobs themselves.
___________________________________________

########
# FARM #
########

Sun Feb 24 14:58:29 CST 2008
mv NOCAT NOCAT.ok


=============================================================================

2008 02 22

###########
# MINOS08 #
###########

Date: Fri, 22 Feb 2008 15:34:40 -0600 (CST)
Subject: HelpDesk ticket 111626
___________________________________________
Short Description: minos08 logins failing, system in distress

Problem Description: run2-sys :

We can no longer log into minos08 via ssh.
A connection is made, but never progresses to an interactive session.

Ganglia monitoring at 
http://rexganglia2.fnal.gov/minos/?c=MINOS%20Cluster&h=minos08.fnal.gov&m=&
r=day&s=descending&hc=4
shows a heavy wait-I/O load starting soon after 14:00 .

If you can get to the console, please check for system level problems.
___________________________________________

Date: Fri, 22 Feb 2008 15:39:11 -0600 (CST)
This ticket has been reassigned to JONES, TERRY of the CD-SF/FEF Group.
___________________________________________

Date: Fri, 22 Feb 2008 15:54:52 -0600 (CST)
Note To Requester: jonest@fnal.gov sent this Notes To Requester: 
I have been able to use ssh to login.
> "load average: 3.43, 3.44, 4.25"
___________________________________________
Note To Requester: jonest@fnal.gov sent this Notes To Requester: 

> the files in /var/log/sa are from December.

_________________________________________________________________


MINOS08 > sudo /etc/init.d/condor start
Starting up Condor
MINOS08 > date
Fri Feb 22 16:01:37 CST 2008


<-- # @@@  Enter Update below this line. @@@ # -->

Kerberized rsh connections to minos08 are immediately rejected.

They work for the rest of the Minos Cluster .

MIN > rsh -N -X minos01 pwd
/afs/fnal.gov/files/home/room1/kreymer

MIN > rsh -N -X minos08 pwd
minos08.fnal.gov: Connection refused
trying normal rsh (/usr/bin/rsh)  WARNING: NO ENCRYPTION!
rsh: invalid option -- N
usage: rsh [-nd] [-l login] host [command]


<-- # @@@  Enter Update above this line. @@@ # -->

    rsh -N -X minos08 'pwd'
    minos08.fnal.gov: Connection refused

<-- # @@@  Enter Update below this line. @@@ # -->

    Thanks, the system seems to have recovered.

    There are still odd problems.
    I guess we can wait till next week to investigate.

The SAR data seems to stop after 12 February
MINOS08 > ls /var/log/sa
sa05  sa06  sa07  sa08  sa09  sa10  sa11  sa12  sa13  sar04  sar05  sar06  sar07  sar08  sar09  sar10 
sar11  sar12

condor was not running ( I have restarted it. )

I still cannot log in via kerberized rsh.

<-- # @@@  Enter Update above this line. @@@ # -->

    2008 02 25


##########
# PARROT #
##########

Running tests on fnpc132, 32 bit kernel, Intel
   2 GB memory, Intel(R) Xeon(TM) CPU 3.06GHz

For the first time I see all the root .so files being opened.

VERS  proxy dcache
242     n     n     CPU bound, killed after 10 minutes
240     n     n     OK, quick ( 12 sec )
240     n     y     NO, cpu bound, killed after 1 minute, same parrot
cur     n     y     OK, quick
cur     n     y     NO, CPU bound on rerun, killed after 1 minute
cur     n     n     OK, quick


  Mysteries

why can loon not be run twice
why does parrot sometimes hang up compute-bound
why does  -d remote  list detailed files,
    on fnpc132, 
Linux fnpc132.fnal.gov 2.6.9-42.0.10.ELsmp #1 SMP Tue Feb 27 08:38:56 CST 2007 i686 i686 i386 GNU/Linux
   not fngp-osg
Linux fngp-osg.fnal.gov 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 11:21:10 CDT 2007 i686 i686 i386 GNU/Linux


##########
# PARROT #
##########

cat > /minos/scratch/kreymer/log/parrot/242prdc.0221

top - 10:59:52 up 200 days,  1:45,  1 user,  load average: 15.80, 16.11, 19.51
Tasks: 581 total,  18 running, 538 sleeping,  24 stopped,   1 zombie
Cpu(s): 71.6% us, 27.6% sy,  0.7% ni,  0.0% id,  0.0% wa,  0.1% hi,  0.0% si
Mem:   6229924k total,  6061752k used,   168172k free,   112288k buffers
Swap: 17583132k total,    77904k used, 17505228k free,  3110160k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                 
 4157 kreymer   25   0  532m 425m 233m R   27  7.0  15243:54 parrot                                                                   
29530 kreymer   25   0  526m 516m 302m R   25  8.5 245:40.30 parrot                                                                   
 8712 kreymer   25   0  669m 640m 304m R   23 10.5 250:15.06 parrot                                                                   
20557 kreymer   25   0  669m 586m 229m R   22  9.6 252:20.28 parrot                                                                   
 4355 kreymer   17   0  2528 1396  860 R    0  0.0   0:00.37 top                                                                      
 4159 kreymer   16   0  7076 1592 1244 T    0  0.0   0:00.12 bash                                                                     
 8713 kreymer   16   0  6752 1520 1244 T    0  0.0   0:00.05 bash                                                                     
14149 kreymer   16   0  141m  82m  40m T    0  1.4   0:03.39 129f324e                                                                 
15289 kreymer   16   0  5276 1208  992 T    0  0.0   0:00.01 sh                                                                       
15292 kreymer   16   0  5276  564  348 T    0  0.0   0:00.00 sh                                                                       
15293 kreymer   15   0  4608  452  376 T    0  0.0   0:00.00 cat                                                                      
15294 kreymer   15   0  107m 1944 1212 T    0  0.0   0:00.01 129f324e                                                                 
18961 kreymer   16   0  141m  82m  40m T    0  1.4   0:03.36 129f324e                                                                 
20153 kreymer   16   0  5276 1208  992 T    0  0.0   0:00.01 sh                                                                       
20158 kreymer   17   0  5276  564  348 T    0  0.0   0:00.00 sh                                                                       
20159 kreymer   15   0  4608  452  376 T    0  0.0   0:00.00 cat                                                                      
20160 kreymer   15   0  107m 1944 1212 T    0  0.0   0:00.00 129f324e                                                                 
20558 kreymer   16   0  7400 1596 1244 T    0  0.0   0:00.08 bash                                                                     
22637 kreymer   15   0  141m  77m  35m T    0  1.3   0:03.36 129f324e                                                                 
22984 kreymer   16   0  5276 1208  992 T    0  0.0   0:00.01 sh                                                                       
22987 kreymer   17   0  5276  564  348 T    0  0.0   0:00.00 sh                                                                       
22988 kreymer   15   0  4608  452  376 T    0  0.0   0:00.00 cat                                                                      
22989 kreymer   15   0  107m 1608  876 T    0  0.0   0:00.02 129f324e                                                                 
23904 kreymer   21   0  141m  66m  24m T    0  1.1   0:03.19 129f324e                                                                 
23928 kreymer   19   0  5276 1208  992 T    0  0.0   0:00.01 sh                                                                       
23932 kreymer   20   0  5276  564  348 T    0  0.0   0:00.00 sh                                                                       
23933 kreymer   15   0  4608  452  376 T    0  0.0   0:00.00 cat                                                                      
23934 kreymer   17   0  107m 1464  732 T    0  0.0   0:00.01 129f324e                                                                 
29532 kreymer   16   0  6344 1596 1240 T    0  0.0   0:00.09 bash                                                                     


killed off the parrots, load dropped to under 2.


=============================================================================

2008 02 21

##########
# PARROT #
##########

Attempting to reestablish parrot tests, and use HTTP_PROXY="squid.fnal.gov:3128"

-current- and DCache hanging up :

P> loon -bq firstlast.C ${DFILE}
Warning in <TClassTable::Add>: class timespec already in TClassTable

Tested
  VERS   proxy  dcache
240       y      y      Failed open
cur       y      y      OK
242           

############
# MCIMPORT #
############

n13037306_0006_L010185N_D04.reroot.root (No such device or address)

/pnfs/fnal.gov/usr/minos/mcin_data/near/daikon_04/L010185N/730/n13037306_0006_L010185N_D04.reroot.root

ls -l /pnfs/minos/mcin_data/near/daikon_04/L010185N/730/n13037306_0006_L010185N_D04.reroot.root
-rw-r--r--  1 kreymer e875 0 Feb 21 11:29 /pnfs/minos/mcin_data/near/daikon_04/L010185N/730/n13037306_0006_L010185N_D04.reroot.root

rm /pnfs/minos/mcin_data/near/daikon_04/L010185N/730/n13037306_0006_L010185N_D04.reroot.root

The timing, 11:29, matches the glitch on Condor submission.


########
# DATA #
########

Restarted crontab.dat for
   kreymer@minos26
   mindata@minos26

Holding of on farm concatenation for a little while.


########
# DATA #
########


Date: Thu, 21 Feb 2008 11:30:01 -0600
From: Cron Daemon <root@minos25.fnal.gov>
To: kreymer@fnal.gov
Subject: Cron <kreymer@minos25> ${HOME}/minos/scripts/condorglide

/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorglide: line 12: cd:
/minos/scratch/kreymer/condor/probe: Not a directory
find: logs/glideafs: No such file or directory

subsequent jobs look OK.

MIN > for NODE in  $NODES ; do printf "${NODE} " ; ssh -ax ${NODE} 'du -sm /minos/scratch/kreymer/condor/probe' ; done
minos01 17      /minos/scratch/kreymer/condor/probe
minos02 17      /minos/scratch/kreymer/condor/probe
...
minos25 17      /minos/scratch/kreymer/condor/probe
minos26 17      /minos/scratch/kreymer/condor/probe

#########
# DOCDB #
#########

Date: Thu, 21 Feb 2008 09:43:03 -0600 (CST)
Subject: HelpDesk ticket 111541
Short Description: DodDB is offline

Problem Description: The DocDB system seems to be offline.
  This affects both the Minos DocDB area and the public areas .
  The problem may have started late last night, before midnight.

      http://cd-docdb.fnal.gov/cgi-bin/DocumentDatabase/

Forbidden

You don't have permission to access /cgi-bin/DocumentDatabase/ on this
server.

Additionally, a 403 Forbidden error was encountered while trying to use an
ErrorDocument to handle the request.
Apache/2.0.46 (Scientific Linux) Server at cd-docdb.fnal.gov Port 80
___________________________________________
Date: Thu, 21 Feb 2008 10:03:39 -0600 (CST)
This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group.
___________________________________________________________________
Date: Thu, 21 Feb 2008 11:47:22 -0600 (CST)
Solution: NAS (the file server that supplies files to docdb server) had outage.  It is back online.  Docdb
is back online.  Try your access to docdb server now.
___________________________________________________________________

#######
# SAM #
#######

09:30 - restarted prd dbserver v8_4_3
   False start earlier,
   had installed all secondary products, but not

   upd install -j sam_db_srv_pkg v8_4_3


# FARM #

roundup -c -r cedar_phy_bhcurv mcnear
  still running, since 

Wed Feb 20 16:54:22 CST 2008

 WRITING to DCache 329


/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/730

/home/minfarm/scripts/roundup: line 801: dccp: command not found

SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp \
 file:///n13037299_0002_L010185N_D04.cand.cedar_phy_bhcurv.root \
 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/cand_data/729

Interestingly, these copies are continuing successfully,
right through the BlueArc outages and PNFS maintenance.


=============================================================================

2008 02 20

##########
# PARROT #
##########

    HOWTO.parrot 

added INSTALL section.
Tested with VER=2_4_2

Tested using current, runs loon OK

Tested setting 

setenv HTTP_PROXY "squid.fnal.gov:8080"
setenv HTTP_PROXY "squid.fnal.gov:80"
setenv HTTP_PROXY "squid.fnal.gov:"

  This seems to have no effect,
  set before or after running parrot

Connected to older version, 

export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4

This tries the squid, and fails

    for port 8080,
    
1203552033.342214 [8799] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/
1203552033.342287 [8799] parrot: http: connect squid.fnal.gov port 8080
1203552033.345079 [8799] parrot: http: GET http://www-numi.fnal.gov:80/computing/d199//.growfsdir HTTP/1.0
Host: squid.fnal.gov
1203552033.346009 [8799] parrot: http: HTTP/1.1 302 Found
1203552033.346029 [8799] parrot: http: Date: Thu, 21 Feb 2008 00:00:33 GMT
1203552033.346038 [8799] parrot: http: Server: Apache/2.2.3 (Unix) mod_ssl/2.2.3 OpenSSL/0.9.7d mod_python/3.2.6 Python/2.3.5 mod_jk/1.2.18 PHP/4.4.2
...
1203552033.349700 [8799] parrot: http: error: server gave 302 redirect from https://squid.fnal.gov:8443/computing/d199//.growfsdir back to the same url!


    for ports 80, and null,

FNGP-OSG > parrot -m ${PARROT_DIR}/mountfile2.grow -d remote  bash
FNGP-OSG > ls -d /afs/fnal.gov/files/code/e875/general/minossoft
1203551898.085693 [4300] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199/
1203551898.085911 [4300] parrot: http: connect squid.fnal.gov port 80
1203551898.087505 [4300] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing/d199
1203551898.087625 [4300] parrot: http: connect squid.fnal.gov port 80
1203551898.088621 [4300] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80/computing
1203551898.088716 [4300] parrot: http: connect squid.fnal.gov port 80
1203551898.089746 [4300] parrot: grow: searching for filesystem at http://www-numi.fnal.gov:80
1203551898.089842 [4300] parrot: http: connect squid.fnal.gov port 80
ls: /afs/fnal.gov/files/code/e875/general/minossoft: No such file or directory


########
# DATA #
########

Preparing for PNFS/DCache maintenance 21 Feb

    kreymer@minos26
echo "crontab -r" | at 05:30 Feb 21

    mindata@minos26 
echo "crontab -r" | at 01:00 Feb 21

    minfarm@fnpcsrv1
echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00 Feb 21

=============================================================================

2008 02 19

#######
# DAQ #
#######


Noted that there have been no Near DCS files archived since Feb 4,
when the minos-mysql1 database server was restarted.

MINOS26 > ls -l /pnfs/minos/near_dcs_data/2008-02
...
-rw-r--r--  1 buckley e875 477434 Feb  3 19:29 N080203_000002.mdcs.root
-rw-r--r--  1 buckley e875 357868 Feb  4 17:52 N080204_000002.mdcs.root

less 2008-02-04.daq.log 

RCS E Mon  4-02-2008 12:22:03 rcServer 8523 131.225.192.134 5859 31631 run 13591  Socket error on  dcsdcp-nd.fnal.gov 9089: Read EOF: Success
RCS E Mon  4-02-2008 12:22:08 rcServer 8523 131.225.192.134 5860 31632 run 13591  Connect to dcsdcp-nd.fnal.gov:9089 failed: Connection refused
RCS N Mon  4-02-2008 12:22:19 rcServer 8523 131.225.192.134 5861 31633 run 13591  Connected to node DCS(100) on dcsdcp-nd.fnal.gov 9089 
RCS N Mon  4-02-2008 12:22:19 rcServer 8523 131.225.192.134 5862 31634 run 13591  Binding socket to DCS(100) 

  N.B. - these showed up in the morning Predator run 2008 02 20

########
# FARM #
########

> The nue group is interested in running over the HE ND data sample.  However, 
> it looks like only half the RunII HE ND data has been processed with 
> cedar_phy_bhcurv.  Looking at pnfs, only half of the HE ND sntp files for 
> 2006-06, 2006-07, 2006-08 are there.   Would it be possible to get these 
> missing sntp files?


MINOS26 > find  /pnfs/minos/mcin_data/near/daikon_04/L250200N -type f | wc -l
4638

MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L250200N/cand_data -type f | wc -l
4515

MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L250200N/sntp_data -type f | wc -l
387

SAMDIM="
    DATA_TIER                sntp-near
and MC.RELEASE               daikon_04
and MC.BEAM                  L250200N
and VERSION                  cedar.phy.bhcurv
"
sam list files --dim="${SAMDIM}"
File Count:         387
Average File Size:  1.25GB
Total File Size:    484.43GB
Total Event Count:  3582400

SFILES=`sam list files --dim="${SAMDIM}" --nosummary`

printf "${SFILES}\n" | wc -l
387

for FILE in ${SFILES} ; do
    sam get metadata --file=${FILE} \
    | grep parents \
    | tr "'" \\\n  \
    | grep root
done | wc -l
4478

OOPS, this was irrelevant, the question was about ND data, not MC,

  805 raw data  files in /pnfs/minos/neardet_data/2006-07
  117 candidates  in /pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2006-07
  632 candidates  in /pnfs/minos/reco_near/cedar_phy/cand_data/2006-07

Greg sent a list of missing files .

QFILES=`cat ../qfiles`
for FILE in ${QFILES} ; do
    FRU=`echo ${FILE} | cut -f 1  -d _`
    FTA=`echo ${FILE} | cut -f 2- -d .`
    SUB=`echo ${FILE} | cut -f 2  -d _ | cut -f 1 -d .`
    SAMDIM="FILE_NAME       ${FRU}_%.${FTA}
    and     PARENT_BY_NAME  ${FRU}_${SUB}.mdaq.root"
    printf "\n${FILE}\n"
    sam list files  --nosummary --dim="${SAMDIM}"
done    

... all these are concatenated, except

N00010583_0000.spill.sntp.cedar_phy_bhcurv.0.root
N00010583_0010.spill.sntp.cedar_phy_bhcurv.0.root
N00010583_0012.spill.sntp.cedar_phy_bhcurv.0.root
N00010583_0014.spill.sntp.cedar_phy_bhcurv.0.root
N00010583_0023.spill.sntp.cedar_phy_bhcurv.0.root

ls /minos/data/minfarm/nearcat/N00010583*.spill.sntp.cedar_phy_bhcurv.0.root

#######
# DAQ #
#######

Habig requested access to minossrv-nd

I do not seem to have access ( minos, root, kreymer )

MIN > ssh  minossrv-nd
Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive).


############
# PREDATOR #
############

   latest Near DCS file was
   
N080204_000002.mdcs.root Tue Feb  5 11:13:43 UTC 2008

   
#######
# SAM #
#######

received version information for sam_products v4_31
    sam_station         v6_0_5_23_srm -q GCC-3.1"

saved in transit to
    minos/sam_products.table.v4_31


Referred to 

    http://cdfkits.fnal.gov/CdfCode/source/Distribution/SAM/HOWTO.kits

Or better yet, ~/minos/HOWTO.products


 version=v4_31
oversion=v4_30

samprod=sam_products
FLVR=NULL

    Have to bootstrap, have not done this before for Minos
upd install -j sam_products ${oversion}

cd ${PRODUCTS}/../prd/${samprod}

cp -ar ${oversion}  ${version}
cd ${version}/${FLVR}
ups declare ${samprod} ${version} -f ${FLVR} -r ${samprod}/${version}/${FLVR} -m ${samprod}.table

# then edit  ups/${samprod}.table as necessary
#
#   You may for example go to a station running the new versions,
#   and run   ./init_sam -n cdf <station> ${oversion}
#   Then use   ups list -K+  to see what's in use for changed products
#      this could be scripted

cd ~/minos/scripts

./updadd ${FLVR} ${samprod} ${version}
 
 OK - adding  sam_products v4_31 NULL -q  
 OK - reporting space used 
      12        /afs/fnal.gov/files/code/e875/general/ups/prd/sam_products/v4_31/NULL
 OK - testing file permissions 
 OK - no file permission problems 
 OK - tar command is gtar
      4 /var/tmp/sam_products.tar.gz
 OK - upd addproduct 
notice: Adding flags -O "public"
error output of move_ups_dir: Authenticated kreymer@FNAL.GOV
        Account updadmin: authorization for kreymer@FNAL.GOV for execution of
                       /usr/krb5/k5arc/scripts/upd successful
        Changing uid to updadmin (100)


#######
# SAM #
#######

    Upgraded development dbserver
         to allow cleanup of old file history

upd install -j sam_db_srv_pkg v8_4_3 # was v8_3_0, in server list
upd install -j sam            v8_2_2 # was v7_5_1  current
ups declare -c sam            v8_2_2


   Startup failed, before I installed all the required products :
   
MINOS-SAM02 > cat dbs__minos-sam02__dbs_dev/trace.15464
warning: Python C API version mismatch for module struct: This Python has API version 1012, module struct has version 1010.
warning: Python C API version mismatch for module strop: This Python has API version 1012, module strop has version 1010.
warning: Python C API version mismatch for module time: This Python has API version 1012, module time has version 1010.
Traceback (most recent call last):
  File "/home/sam/products/db_server_base/v3_3_17/NULL/bin/DbListener.py", line 30, in ?
    import Monitor
  File "/home/sam/products/db_server_base/v3_3_17/NULL/lib/Monitor.py", line 73, in ?
    import EventPoster
  File "/home/sam/products/db_server_base/v3_3_17/NULL/lib/EventPoster.py", line 11, in ?
    import ConfigMgr
  File "/home/sam/products/db_server_base/v3_3_17/NULL/lib/ConfigMgr.py", line 49, in ?
    import Parameter
  File "/home/sam/products/db_server_base/v3_3_17/NULL/lib/Parameter.py", line 8, in ?
    import DbLog
  File "/home/sam/products/db_server_base/v3_3_17/NULL/lib/DbLog.py", line 4, in ?
    import os, os.path, re, exceptions
  File "/home/sam/products/python/v2_1/Linux+2.4/lib/python2.1/re.py", line 28, in ?
    from sre import *
  File "/home/sam/products/python/v2_1/Linux+2.4/lib/python2.1/sre.py", line 17, in ?
    import sre_compile
  File "/home/sam/products/python/v2_1/Linux+2.4/lib/python2.1/sre_compile.py", line 15, in ?
    assert _sre.MAGIC == MAGIC, "SRE module mismatch"
AssertionError: SRE module mismatch
Killed process: 15464

   Pick up missing products, per ups list -l sam_db_srv_pkg v8_4_3

db_server_base_cx v1_8 -j')
cx_Oracle         v4_3_3_py2_4_4 -j')
oracle_instant_client v10_2_0_3 -j')
oracle_tnsnames       v46 -j')
sam_python            v2_4_4 -j')
sam_db_srv            v8_4_3 -j')
-q "${UPS_REQ_QUALIFIERS}" sam_server_pylib v8_4_1 -j')
-q "${UPS_REQ_QUALIFIERS}" sam_common_pylib v8_4_2 -j')
omniORB v4_1_1 -q GCC-3.4.3-PYTHON-2.4 -j')
sam_idl_pylib v8_4 -j')
HTMLgen v2_1 -j')
-q "${UPS_REQ_QUALIFIERS}" sam_config -j')
sam_ns_ior -j')
-q "${UPS_REQ_QUALIFIERS}" sam_dimension_server_prototype v8_4_0 -j')
-q "${UPS_REQ_QUALIFIERS}" sam_pnfs_srv v8_4_0 -j')
encp v3_6g -j')

ups list -aK+ db_server_base_cx
ups list -aK+ cx_Oracle 
ups list -aK+ oracle_instant_client
ups list -aK+ oracle_tnsnames
ups list -aK+ sam_python
ups list -aK+ sam_db_srv
ups list -aK+ sam_server_pylib
ups list -aK+ sam_common_pylib
ups list -aK+ omniORB -q GCC-3.4.3-PYTHON-2.4
ups list -aK+ sam_idl_pylib
ups list -aK+ HTMLgen
ups list -aK+ sam_config
ups list -aK+ sam_ns_ior
ups list -aK+ sam_dimension_server_prototype
ups list -aK+ sam_pnfs_srv
ups list -aK+ encp

    Installed the needed products on minos-sam03

upd install -j db_server_base_cx v1_8 # was v1_4
upd install -j cx_Oracle         v4_3_3_py2_4_4
upd install -j sam_db_srv        v8_4_3
upd install -j sam_server_pylib  v8_4_1
upd install -j sam_common_pylib  v8_4_2
upd install -j omniORB           v4_1_1 -q GCC-3.4.3-PYTHON-2.4
upd install -j sam_idl_pylib     v8_4
upd install -j sam_dimension_server_prototype v8_4_0
upd install -j sam_pnfs_srv      v8_4_0

    Dev dbserver is restarted, passes its tests.
    int dbserver is restarted, passes its tests.
    
    Installed the needed products on minos-sam03

    Created private/minos-sam01_server_list.txt.20080219


   sam v8_1_6 is current at CDF
   sam v8_4_0 is current at D0 
   sam v8_2_2 is current in upd

... previously ...        
sam_db_srv_pkg v8_3_0 ( was sam_db_srv v7_6_1 )
sam_bootstrap  v8_1_0 ( was v6_1_2, required for use of sam_db_srv_pkg )  
sam_config     v7_1_5 ( was v4_2_34 )
sam            v8_2_0 ( was v7_6_5, on clients ) 

=============================================================================

2008 02 18

#######
# AFS #
#######

for NODE in ${NODES} ; do 
ssh -ax ${NODE} 'grep afs: /var/log/messages | grep -v Tokens | uniq'; done \
  | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' 
done

Feb 17 05:20:01 minos05 kernel: afs: Lost contact with file server 131.225.68.6 
Feb 17 05:20:32 minos05 kernel: afs: file server 131.225.68.6 is back up 

Dec 12 17:08:58 minos08 kernel: afs: Lost contact with file server 131.225.68.7 
Dec 12 17:09:09 minos08 kernel: afs: file server 131.225.68.7 is back up 

Feb 17 05:23:27 minos09 kernel: afs: Lost contact with file server 131.225.68.6 
Feb 17 05:23:28 minos09 kernel: afs: Lost contact with file server 131.225.68.6 
Feb 17 05:24:45 minos09 kernel: afs: file server 131.225.68.6 is back up 

##########
# CONDOR #
##########

Date: Mon, 18 Feb 2008 12:04:59 -0600
From: Laura Loiacono <loiacono@fnal.gov>

As you know I am running on the new batch nodes. I had been running alot of jobs since yesterday and
it appears that there is a problem with the connection between the node the job is running on and the
node it was submitted from. This causes some of the jobs to take twice as long as the should.
See log file messages below...


000 (35477.000.000) 02/18 05:43:23 Job submitted from host: <131.225.193.25:63984>
...
001 (35477.000.000) 02/18 06:12:03 Job executing on host: <131.225.166.118:62340>
...
006 (35477.000.000) 02/18 06:12:11 Image size of job updated: 160400
...
022 (35477.000.000) 02/18 07:22:58 Job disconnected, attempting to reconnect
??? Socket between submit and execute hosts closed unexpectedly
??? Trying to reconnect to vm2@18764@fnpc339.fnal.gov <131.225.166.118:62340>
...
024 (35477.000.000) 02/18 07:22:58 Job reconnection failed
??? Job disconnected too long: JobLeaseDuration (3600 seconds) expired
??? Can not reconnect to vm2@18764@fnpc339.fnal.gov, rescheduling job
...
001 (35477.000.000) 02/18 07:25:06 Job executing on host: <131.225.166.131:65494>
...
022 (35477.000.000) 02/18 08:36:56 Job disconnected, attempting to reconnect
??? Socket between submit and execute hosts closed unexpectedly
??? Trying to reconnect to vm2@8274@fnpc344.fnal.gov <131.225.166.131:65494>
...
024 (35477.000.000) 02/18 08:36:56 Job reconnection failed
??? Job disconnected too long: JobLeaseDuration (3600 seconds) expired
??? Can not reconnect to vm2@8274@fnpc344.fnal.gov, rescheduling job
...


   Checking that all nodes are involved.

$ cd /minos/scratch/loiacono/condor

$ grep 'Trying to reconnect' log.* | cut -f 3 -d @ | cut -f 1 -d . | sort -u
fnpc339
fnpc340
fnpc341
fnpc342
fnpc343
fnpc344
fnpc345
fnpc346


=============================================================================

2008 02 15

########
# GRID #
########

Date: Fri, 15 Feb 2008 21:30:25 -0600
From: Cron Daemon <root@minos25.fnal.gov>
To: kreymer@fnal.gov
Subject: Cron <kreymer@minos25> /usr/krb5/bin/kcron ${HOME}/minos/scripts/factproxy

verify_recvd_packet: generated hash did not compare
/usr/krb5/bin/kxlist: Matching credential not found while finding the credentials containing the private
key and certificate in the credentials cache

#######
# SAM #
#######

    sam_cpp_api with sam locate support , bug corrected.

upd install -j sam_cpp_api v8_4_0_1 -q GCC-3.4.3
 
############
# MCIMPORT #
############

   corrected md5sum , needed to  cd ${MCSTA} to find the recently moved FILE

ln -sf mcimport.20080211 mcimport  # was mcimport.20071120

 
##########
# CONDOR #
##########

  Updated glidein proxy, see 2008 02 13 entry

#######
# WEB #
#######

In dhmain.html, changed
   CDSystemStatus to cdsystemstatus

This was not needed earlier this week.


########
# FARM #
########

   The review below is a usefull warmup for the broader cleanup
   of  *cat areas .
   
Drivers

   many duplicates being reported ( based on READ SAM/READ files )
      are these legit ?

   too many READ SAM/READ files 
       need to purge these, base dup calculations on SAM
    
    stale READ SAM/READ files
        some may be hanging around after the concatenated files wer 
      

########
# FARM #
########

Date: Thu, 14 Feb 2008 15:39:09 -0600
From: Howard Rubin <rubin@iit.edu>

There are a few files, all from F00040092_0005 2007-12 cedar in 
/minos/data/minfarm/fardet.  These may have been from a repeated run. 
There are candidates on pnfs (cand and bcnd) but I can't track the 
ntuples all.sntp, spill.sntp, and spill.bntp because whatever might have 
been there has been cleared from farcat.  Can you see if they've been 
concatenated?

The same is true for N00012048_0017 2007-04 cedar_phy_bhcurv and, from 
2007-12 cedar, N0001390_0009, 10, 11, 12, and 13.  Again, all candidates 
have correspondences on pnfs, and I can't track the cosmic or spill ntuples.

MINOS26 > dds /minos/data/minfarm/neardet
total 1996088
drwxrwxr-x   2 rubin e875      6144 Jan 29 10:22 ./
drwxrwxr-x  31 10871 e875      4096 Feb 14 13:10 ../
-rw-rw-r--   1 rubin e875  79565829 Nov 21 12:35 N00012048_0017.spill.cand.cedar_phy_bhcurv.0.root
-rw-rw-r--   1 rubin e875   8601534 Nov 21 12:35 N00012048_0017.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--   1 rubin e875  15696718 Nov 21 12:35 N00012048_0017.spill.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--   1 rubin e875 111781571 Dec 22 07:55 N00013190_0009.cosmic.cand.cedar.0.root
-rw-rw-r--   1 rubin e875  29508780 Dec 22 07:55 N00013190_0009.cosmic.sntp.cedar.0.root
-rw-rw-r--   1 rubin e875 243623537 Dec 22 07:57 N00013190_0009.spill.cand.cedar.0.root
-rw-rw-r--   1 rubin e875  48111340 Dec 22 07:57 N00013190_0009.spill.sntp.cedar.0.root
-rw-rw-r--   1 rubin e875 111378136 Dec 22 07:48 N00013190_0010.cosmic.cand.cedar.0.root
-rw-rw-r--   1 rubin e875  29327780 Dec 22 07:48 N00013190_0010.cosmic.sntp.cedar.0.root
-rw-rw-r--   1 rubin e875 224232407 Dec 22 07:49 N00013190_0010.spill.cand.cedar.0.root
-rw-rw-r--   1 rubin e875  44408037 Dec 22 07:49 N00013190_0010.spill.sntp.cedar.0.root
-rw-rw-r--   1 rubin e875 110519969 Dec 22 07:50 N00013190_0011.cosmic.cand.cedar.0.root
-rw-rw-r--   1 rubin e875  29113746 Dec 22 07:50 N00013190_0011.cosmic.sntp.cedar.0.root
-rw-rw-r--   1 rubin e875 249715245 Dec 22 07:51 N00013190_0011.spill.cand.cedar.0.root
-rw-rw-r--   1 rubin e875  49711909 Dec 22 07:51 N00013190_0011.spill.sntp.cedar.0.root
-rw-rw-r--   1 rubin e875 112480567 Dec 22 07:43 N00013190_0012.cosmic.cand.cedar.0.root
-rw-rw-r--   1 rubin e875  29439690 Dec 22 07:44 N00013190_0012.cosmic.sntp.cedar.0.root
-rw-rw-r--   1 rubin e875 220021097 Dec 22 07:46 N00013190_0012.spill.cand.cedar.0.root
-rw-rw-r--   1 rubin e875  43400685 Dec 22 07:46 N00013190_0012.spill.sntp.cedar.0.root
-rw-rw-r--   1 rubin e875 111656926 Dec 22 07:51 N00013190_0013.cosmic.cand.cedar.0.root
-rw-rw-r--   1 rubin e875  29300465 Dec 22 07:51 N00013190_0013.cosmic.sntp.cedar.0.root
-rw-rw-r--   1 rubin e875 112343659 Jan  3 14:54 N00013190_0017.cosmic.cand.cedar.0.root
MINOS26 > dds /minos/data/minfarm/fardet
total 463468
drwxrwxr-x   2 rubin e875     73728 Feb 14 16:55 ./
drwxrwxr-x  31 10871 e875      4096 Feb 14 13:10 ../
-rw-rw-r--   1 rubin e875 138049103 Dec 10 19:51 F00037144_0012.all.cand.cedar_phy_bhcurv.0.root
-rw-rw-r--   1 rubin e875  25245746 Dec 10 19:51 F00037144_0012.all.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--   1 rubin e875  31015742 Dec 10 19:51 F00037144_0012.spill.bcnd.cedar_phy_bhcurv.0.root
-rw-rw-r--   1 rubin e875   4678962 Dec 10 19:51 F00037144_0012.spill.bntp.cedar_phy_bhcurv.0.root
-rw-rw-r--   1 rubin e875  20469882 Dec 10 19:51 F00037144_0012.spill.cand.cedar_phy_bhcurv.0.root
-rw-rw-r--   1 rubin e875   2934022 Dec 10 19:51 F00037144_0012.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--   1 rubin e875   3156006 Dec 10 19:51 F00037144_0012.spill.sntp.cedar_phy_bhcurv.0.root
-rw-rw-r--   1 rubin e875 132515788 Dec 22 06:28 F00040092_0005.all.cand.cedar.0.root
-rw-rw-r--   1 rubin e875  23832557 Dec 22 06:28 F00040092_0005.all.sntp.cedar.0.root
-rw-rw-r--   1 rubin e875  48796440 Dec 22 06:32 F00040092_0005.spill.bcnd.cedar.0.root
-rw-rw-r--   1 rubin e875   7709121 Dec 22 06:32 F00040092_0005.spill.bntp.cedar.0.root
-rw-rw-r--   1 rubin e875  31031046 Dec 22 06:30 F00040092_0005.spill.cand.cedar.0.root
-rw-rw-r--   1 rubin e875   5039787 Dec 22 06:30 F00040092_0005.spill.sntp.cedar.0.root
-rw-rw-r--   1 rubin e875       504 Feb 12 11:11 c_list
-rw-rw-r--   1 rubin e875      1920 Feb 11 19:57 copy_list


for FILE in `ls /minos/data/minfarm/neardet | grep cand` ; do 
sam locate ${FILE} ; done
['/pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2007-04,391@vo9747']
['/pnfs/minos/reco_near/cedar/cand_data/2007-12,86@vob319']
['/pnfs/minos/reco_near/cedar/cand_data/2007-12,84@vob319']
['/pnfs/minos/reco_near/cedar/cand_data/2007-12,81@vob319']
['/pnfs/minos/reco_near/cedar/cand_data/2007-12,85@vob319']
['/pnfs/minos/reco_near/cedar/cand_data/2007-12,87@vob319']
['/pnfs/minos/reco_near/cedar/cand_data/2007-12,88@vob319']
['/pnfs/minos/reco_near/cedar/cand_data/2007-12,78@vob319']
['/pnfs/minos/reco_near/cedar/cand_data/2007-12,80@vob319']
['/pnfs/minos/reco_near/cedar/cand_data/2007-12,83@vob319']
['/pnfs/minos/reco_near/cedar/cand_data/2007-12,91@vob319']

for FILE in `ls /minos/data/minfarm/fardet | grep cand` ; do
sam locate ${FILE} ; done
['/pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2006-12,1465@vob825']
['/pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2006-12,1471@vob825']
['/pnfs/minos/reco_far/cedar/cand_data/2007-12,712@vo5628']
['/pnfs/minos/reco_far/cedar/cand_data/2007-12,748@vo5628']

   Now testing non-cand files which may be concatenated .
   
   Try one manually.

FILE=N00012048_0017.spill.mrnt.cedar_phy_bhcurv.0.root
FRU=`echo ${FILE} | cut -f 1  -d _`
FTA=`echo ${FILE} | cut -f 2- -d .`
SUB=`echo ${FILE} | cut -f 2  -d _ | cut -f 1 -d .`
SAMDIM="FILE_NAME       ${FRU}_%.${FTA}
and     PARENT_BY_NAME  ${FRU}_${SUB}.mdaq.root"

for FILE in `ls /minos/data/minfarm/neardet | grep -v cand` ; do
    FRU=`echo ${FILE} | cut -f 1  -d _`
    FTA=`echo ${FILE} | cut -f 2- -d .`
    SUB=`echo ${FILE} | cut -f 2  -d _ | cut -f 1 -d .`
    SAMDIM="FILE_NAME       ${FRU}_%.${FTA}
    and     PARENT_BY_NAME  ${FRU}_${SUB}.mdaq.root"
    printf "\n${FILE}\n"
    sam list files  --nosummary --dim="${SAMDIM}"
done    

N00012048_0017.spill.mrnt.cedar_phy_bhcurv.0.root
N00012048_0000.spill.mrnt.cedar_phy_bhcurv.0.root

N00012048_0017.spill.sntp.cedar_phy_bhcurv.0.root
N00012048_0000.spill.sntp.cedar_phy_bhcurv.0.root

N00013190_0009.cosmic.sntp.cedar.0.root
N00013190_0000.cosmic.sntp.cedar.0.root

N00013190_0009.spill.sntp.cedar.0.root
N00013190_0000.spill.sntp.cedar.0.root

N00013190_0010.cosmic.sntp.cedar.0.root
N00013190_0000.cosmic.sntp.cedar.0.root

N00013190_0010.spill.sntp.cedar.0.root
N00013190_0000.spill.sntp.cedar.0.root

N00013190_0011.cosmic.sntp.cedar.0.root
N00013190_0000.cosmic.sntp.cedar.0.root

N00013190_0011.spill.sntp.cedar.0.root
N00013190_0000.spill.sntp.cedar.0.root

N00013190_0012.cosmic.sntp.cedar.0.root
N00013190_0000.cosmic.sntp.cedar.0.root

N00013190_0012.spill.sntp.cedar.0.root
N00013190_0000.spill.sntp.cedar.0.root

N00013190_0013.cosmic.sntp.cedar.0.root
N00013190_0000.cosmic.sntp.cedar.0.root

MINOS26 > for FILE in `ls /minos/data/minfarm/fardet | grep -v cand` ; do


F00037144_0012.all.sntp.cedar_phy_bhcurv.0.root
F00037144_0000.all.sntp.cedar_phy_bhcurv.0.root

F00037144_0012.spill.bcnd.cedar_phy_bhcurv.0.root
F00037144_0012.spill.bcnd.cedar_phy_bhcurv.0.root

F00037144_0012.spill.bntp.cedar_phy_bhcurv.0.root
F00037144_0000.spill.bntp.cedar_phy_bhcurv.0.root

F00037144_0012.spill.mrnt.cedar_phy_bhcurv.0.root
F00037144_0000.spill.mrnt.cedar_phy_bhcurv.0.root

F00037144_0012.spill.sntp.cedar_phy_bhcurv.0.root
F00037144_0000.spill.sntp.cedar_phy_bhcurv.0.root

F00040092_0005.all.sntp.cedar.0.root
F00040092_0000.all.sntp.cedar.0.root

F00040092_0005.spill.bcnd.cedar.0.root
F00040092_0005.spill.bcnd.cedar.0.root

F00040092_0005.spill.bntp.cedar.0.root
F00040092_0000.spill.bntp.cedar.0.root

F00040092_0005.spill.sntp.cedar.0.root
F00040092_0000.spill.sntp.cedar.0.root

c_list

copy_list

=============================================================================

2008 02 14

################
# IMAGEMAGICK #
################

MINOS11 > rpm -qf /usr/bin/convert
ImageMagick-5.5.6-24

Per pawloski/tjyun, requested ImageMagick on minos cluster.

Requested installation

Date: Thu, 14 Feb 2008 15:30:44 -0600 (CST)
Subject: HelpDesk ticket 111178
___________________________________________
Short Description: ImageMagick on Minos Cluster

Problem Description: run2-sys :

Please install ImageMagick on the Minos Cluster.
    minos01 through minos26
This is available via yum.

ImageMagick seems to have been dropped back when we upgraded to SLF 4.
    It is installed on minos11, which stayed at SLF 3.

This can be done at your convenience.
___________________________________________
Date: Thu, 14 Feb 2008 15:42:50 -0600 (CST)
This ticket has been reassigned to HO, LING of the CD-SF/FEF Group.
___________________________________________

Date: Thu, 14 Feb 2008 17:30:41 -0600 (CST)
___________________________________________________________________
Solution: ling@fnal.gov sent this solution: 
ImageMagik installed on these nodes.


   cc'd  pawloski,tjyang,minos-admin

#######
# SAM #
#######

Need to pick up missing cedar_phy files declarations.
There is no SLOG directory for cedar_phy, perhaps we never cleaned up.


cut and source shrc/kreymer

cd /export/stage/minfarm/ROUNDUP

PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin
export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer
export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9010

SOCFILE=/export/stage/minfarm/.grid/samdbs_prd
export SAM_ORACLE_CONNECT=`cat ${SOCFILE}`

DET=near
RELEASE=cedar_phy

SLOG=${HOME}/ROUNTMP/LOG/saddreco/${RELEASE}/${DET}.log
mkdir -p ${HOME}/ROUNTMP/LOG/saddreco/${RELEASE}


MONTH=2007-02

SRV1> ./saddreco ${DET} ${RELEASE} ${MONTH} verify | grep -v verified
DETECT  near
RELEASE  cedar_phy
MONTH  2007-02
BAIL 999999
STARTED   Thu Feb 14 17:49:00 2008
saddreco  2007117
Declaring to SAM prd near cedar_phy 2007-02 verify
Needed  /pnfs/minos/reco_near/cedar_phy/cand_data/2007-02
Treating 667 files in /pnfs/minos/reco_near/cedar_phy/cand_data/2007-02
Needed  234 files, Rate was  2.584
Needed  /pnfs/minos/reco_near/cedar_phy/mrnt_data/2007-02
Treating 47 files in /pnfs/minos/reco_near/cedar_phy/mrnt_data/2007-02
 obsolete              N00011798_0000.spill.mrnt.cedar_phy.0.root
Needed   11 files, Rate was  1.066
Needed  /pnfs/minos/reco_near/cedar_phy/sntp_data/2007-02
Treating 50 files in /pnfs/minos/reco_near/cedar_phy/sntp_data/2007-02
 obsolete              N00011798_0000.spill.sntp.cedar_phy.0.root
Needed   23 files, Rate was  1.220
STARTED   Thu Feb 14 17:49:00 2008
FINISHED  Thu Feb 14 17:51:00 2008


./saddreco  -d ${DET} -r ${RELEASE} -p ${MONTH} --declare 2>&1 \
    | tee -a  ${SLOG}
STARTED   Thu Feb 14 18:05:21 2008
FINISHED  Thu Feb 14 18:07:42 2008


for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/sntp_data ; ls -d 20??-??)` ; do
./saddreco  -d ${DET} -r ${RELEASE} -p ${MONTH} --declare 2>&1 \
    | tee -a  ${SLOG}
done
STARTED   Thu Feb 14 19:56:24 2008
FINISHED  Thu Feb 14 21:39:40 2008

Picked up files in 


MONTH  2005-03
MONTH  2005-04

MONTH  2005-09 ( obsolete only )

MONTH  2006-06

MONTH  2006-12
MONTH  2007-01
MONTH  2007-02
MONTH  2007-03
MONTH  2007-04


DET=far
SLOG=${HOME}/ROUNTMP/LOG/saddreco/${RELEASE}/${DET}.log

    Verify scan :

for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/sntp_data ; ls -d 20??-??)` ; do
./saddreco  -d ${DET} -r ${RELEASE} -p ${MONTH} --verify
done

2005-03 cand
2006-04 cand

   Declare the few missing files

for MONTH in 2005-03 2006-04   ; do
./saddreco  -d ${DET} -r ${RELEASE} -p ${MONTH} --declare 2>&1 \
    | tee -a  ${SLOG}
done


########
# DATA #
########

Date: Thu, 14 Feb 2008 11:25:18 -0600 (CST)
Subject: HelpDesk ticket 111148
___________________________________________
Short Description: minos group quota adjustment in BlueArc

Problem Description: LSC/CSI :

We have been too successful using our /minos/data disks !

We seem to have used the entire 18 TBytes of group quota.

MINOS26 > quota -s -v -g e875
..
minos-nas-0.fnal.gov:/minos/scratch
                  2165G       0  10240G            459k       0       0    
   
minos-nas-0.fnal.gov:/minos/data
                 16384G*      0  16384G            554k       0       0    
   

To give us some breathing room, while we move some files off to tape,
please shift 5 TBytes of e875 group quota from /minos/scratch to
/minos/data.
___________________________________________
Date: Thu, 14 Feb 2008 12:22:16 -0600 (CST)
This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group.
___________________________________________
Date: Thu, 14 Feb 2008 13:46:26 -0600 (CST)
This ticket has been reassigned to ROMERO, ANDY of the CD-LSCS/CSI/CS/WST Group.
___________________________________________

   Converstation with Andy, the volume quota is now 25 TB.
   There was no need to reduce scratch quota.
   
   Still a mystery why quota is reporting 16384G* 
   That is a suspicious number 2^14 .


##########
# DC2NFS #
##########

AFSS/dc2nfs -n -d reco_near/cedar_phy/sntp_data/2007-02

$ du -sm /minos/data/reco_near/*
17      /minos/data/reco_near/R1_18
270547  /minos/data/reco_near/R1_18_2
8813    /minos/data/reco_near/R1_18_3
142294  /minos/data/reco_near/R1_18_4
22221   /minos/data/reco_near/R1_21
10001   /minos/data/reco_near/R1_23
9336    /minos/data/reco_near/R1_23a
9958    /minos/data/reco_near/R1_24
10089   /minos/data/reco_near/R1_24a
11734   /minos/data/reco_near/R1_24b
41283   /minos/data/reco_near/R1_24c
24122   /minos/data/reco_near/R1_24cal
1       /minos/data/reco_near/S06-05-25-R1-22
10002   /minos/data/reco_near/S06-06-22-R1-22
895381  /minos/data/reco_near/cedar
221619  /minos/data/reco_near/cedar_phy
341345  /minos/data/reco_near/cedar_phy_bhcurv

$ du -sm /minos/data/reco_near
2028757 /minos/data/reco_near

$ du -sm /minos/data/reco_far
1343793 /minos/data/reco_far

MINOS26 > du -sm /minos/data/*
1273917 /minos/data/analysis
1706    /minos/data/asousa
259008  /minos/data/beam_data
2       /minos/data/d10
87905   /minos/data/flux
0       /minos/data/foo
3437    /minos/data/log_data
1       /minos/data/maint
9099614 /minos/data/mcimport
2362905 /minos/data/mcout_data
1       /minos/data/mindata
406473  /minos/data/minfarm
405888  /minos/data/mysql
1343793 /minos/data/reco_far
2028757 /minos/data/reco_near
87995   /minos/data/users

for DIR in `ls /pnfs/minos/reco_near/cedar_phy/sntp_data` ; do 
echo $DIR ; ./stage -d -p 0 reco_near/cedar_phy/sntp_data/${DIR} ; done


992G    /minos/data/reco_near/cedar_phy/sntp_data

 STARTED Thu Feb 14 14:33:00 CST 2008
FINISHED Fri Feb 15 11:07:15 CST 2008

cat > /tmp/dc2nfs.2008

 
###########
# BLUEARC #
###########

05:00 to 08:00 scheduled BlueArc firmware upgrade.

    mindata@minos26 
echo "crontab -r" | at 01:00 Feb 14

    minfarm@fnpcsrv1
echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 01:00 Feb 14

Restarted corral and mcimport around 15:50 

N.B.  subscribed to site-nas-announce,

Date:         Thu, 14 Feb 2008 07:20:03 -0600
Reply-To:     Ray Pasetes <rayp@FNAL.GOV>
Sender:       Site-Wide NAS Service Announcements and Status
              <SITE-NAS-ANNOUNCE@LISTSERV.FNAL.GOV>
From:         Ray Pasetes <rayp@FNAL.GOV>
Subject:      NAS Servers have been upgraded
Comments: To: site-nas-announce@fnal.gov
Content-type: text/plain; format=flowed; charset=ISO-8859-15

The Central NAS Servers have been upgraded and service has been resumed. 
If you notice any problems, please open a helpdesk ticket -- 


=============================================================================

2008 02 13

#######
# SAM #
#######

For dogwood, per HOWTO.saddreco,


sam get registered application families | grep loon

export SAM_ORACLE_CONNECT="samdbs/<passwd>"

for DWR in dogwoodtest0 dogwood0 dogwood1 ; do

for REL in dev int prd ; do
    setup sam -q ${REL}
    samadmin add application family --appFamily=reco --appName=loon \
        --appVersion=${DWR}
done
done

##########
# CONDOR 
##########
condor queue time limit

http://www.uscms.org/SoftwareComputing/UserComputing/BatchSystem.html
   look at queue from any node :
       condor_q -submitter gfactory
   see a summary 
       condor_status -submitter
   see the queues known to minos25
       condor_q -name minos25

http://www.astro.northwestern.edu/AstCCwiki/index.php?title=Typhoon(Cluster)_Documentation
     condor_q -goodputs 
 
##########
# CONDOR #
##########


Created newer proxy for gfactory, 

SRV1> cd /export/stage/minfarm/.grid

DAYS=10
(( HOURS = DAYS * 24 ))
DAPR=`date -d "today + ${DAYS}days" +%Y%m%d`


voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -vomslife ${HOURS}:0 \
    -cert kreymer-doe.pem   \
    -key kreymer-doekey.pem \
    -out kreymer-condor.proxy.${DAPR}  \
    -valid ${HOURS}:0

Your proxy is valid until Sun Feb 17 16:59:24 2008

voms-proxy-info -file kreymer-condor.proxy.${DAPR}


[gfactory@minos25 ~]$ cd .grid/

DAPR=20080223
scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy.${DAPR} .
DAPR=20080304
scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy.${DAPR} .

cp -a kreymer-condor.proxy.${DAPR} kreymer-condor.proxy
[gfactory@minos25 .grid]$ date
Fri Feb 15 13:38:13 CST 2008

#########
# FNALU #
#########

fsus02 - down for hardware repairs, holds LSF processing.

Date: Wed, 13 Feb 2008 09:06:54 -0600 (CST)

the Sun tech finished the hardware task on fsun02, reprogramming
the nvram, and fsun02 is back.  The lsf batch nodes were reopened,
and minos jobs are running on them.

margaret

Margaret Greaney                Telephone: 630-840-4623
Fermilab                        E-mail: mgreaney@fnal.gov


#########
# MYSQL #
#########

products >

. etc/setups.sh
export PRODUCTS=/local/ups/db
setup upd
upd install -j mysql v5_0_51

=============================================================================

2008 02 12

#########
# MYSQL #
#########

Making space on minos-sam03 for mysql tests.

kreymer >

cd /home/kreymer/MYSQL

MINOS-SAM03 > du -sm *
9352    20060421
14021   20070207
14186   20070305
14790   20070403
16492   20070705
16939   20070815
17519   20070919
17542   20071002
17579   20071107
62437   BINLOG

8956    /minos/data/mysql/archive/20060418
9343    /minos/data/mysql/archive/20060421
17753   /minos/data/mysql/archive/20071218
17885   /minos/data/mysql/archive/20080116
18028   /minos/data/mysql/archive/20080204


Removed the binlogs, these live in /minos/data/mysql/BINLOG now

rm -r BINLOG

Checked the files already copied to /m/d/m/a/20060421

MMON=20060421
MDMA=/minos/data/mysql/archive

diff -r ${MMON} ${MDMA}/${MMON}
rm   -r ${MMON}


for MMON in 20070207 20070305 20070403 20070705 20070815 20070919 20071002 20071107 ; do
printf "${MMON} " ; date ; time cp -ax ${MMON} ${MDMA}/${MMON} ; done


20070207 Tue Feb 12 19:14:12 CST 2008

real    6m29.750s
user    0m0.306s
sys     0m42.422s
20070305 Tue Feb 12 19:20:41 CST 2008

real    6m34.734s
user    0m0.291s
sys     0m43.041s
20070403 Tue Feb 12 19:27:16 CST 2008

real    7m21.917s
user    0m0.284s
sys     0m44.296s
20070705 Tue Feb 12 19:34:38 CST 2008

real    7m35.256s
user    0m0.329s
sys     0m48.148s
20070815 Tue Feb 12 19:42:14 CST 2008

real    8m57.493s
user    0m0.348s
sys     0m50.528s
20070919 Tue Feb 12 19:51:11 CST 2008

real    8m30.084s
user    0m0.386s
sys     0m54.253s
20071002 Tue Feb 12 19:59:41 CST 2008

real    8m49.770s
user    0m0.369s
sys     0m53.047s
20071107 Tue Feb 12 20:08:31 CST 2008

real    11m26.015s
user    0m0.365s
sys     0m55.130s


for MMON in 20070207 20070305 20070403 20070705 20070815 20070919 20071002 20071107 ; do
printf "${MMON} " ; date ; time diff -r ${MMON} ${MDMA}/${MMON} ; done

20070207 Tue Feb 12 22:44:35 CST 2008

real    7m15.662s
user    0m24.176s
sys     0m30.761s
20070305 Tue Feb 12 22:51:51 CST 2008

real    7m33.303s
user    0m23.318s
sys     0m30.847s
20070403 Tue Feb 12 22:59:24 CST 2008

real    10m12.483s
user    0m25.223s
sys     0m33.597s
20070705 Tue Feb 12 23:09:37 CST 2008

real    8m32.953s
user    0m28.791s
sys     0m37.324s
20070815 Tue Feb 12 23:18:10 CST 2008

real    9m33.451s
user    0m29.187s
sys     0m38.286s
20070919 Tue Feb 12 23:27:44 CST 2008

real    11m4.116s
user    0m30.180s
sys     0m39.555s
20071002 Tue Feb 12 23:38:48 CST 2008

real    8m56.781s
user    0m30.327s
sys     0m40.164s
20071107 Tue Feb 12 23:47:45 CST 2008

real    10m14.862s
user    0m30.470s
sys     0m40.086s


rm -r 20070207 20070305

for MMON in      20070403 20070705 20070815 20070919 20071002 20071107 ; do
printf "${MMON} " ; date ; rm -r ${MMON}  ; done

20070403 Wed Feb 13 08:56:08 CST 2008
20070705 Wed Feb 13 08:56:32 CST 2008
20070815 Wed Feb 13 08:56:59 CST 2008
20070919 Wed Feb 13 08:57:25 CST 2008
20071002 Wed Feb 13 08:58:01 CST 2008
20071107 Wed Feb 13 08:58:24 CST 2008

df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/hdb2             225G  8.2G  205G   4% /home


#########
# MYSQL #
#########

MINOS26 > upd install -j mysql v5_0_51
informational: installed mysql v5_0_51.
upd install succeeded.

MINOS26 > setup mysql v5_0_51
cat: /afs/fnal.gov/files/code/e875/general/ups/db/mysql/config/minos26.fnal.gov.: No such file or directory
cat: /mysql.socket: No such file or directory
cat: /mysql.port: No such file or directory
cat: /mysql.user: No such file or directory
Setup:mysql datadir = 
Setup:port=; socket=
MINOS26 > type mysql
mysql is /afs/fnal.gov/files/code/e875/general/ups/prd/mysql/v5_0_51/Linux-2-6/bin/mysql

MINOS26 > mysql -V
mysql  Ver 14.12 Distrib 5.0.51a, for redhat-linux-gnu (i686) using  EditLine wrapper


#########
# FNALU #
#########

bsub -R "hname!=flxi06" hostname

bsub -R "hname!=flxi06 & hname!=flxb10 & hname!=flxb11 & hname!=flxb35"  hostname


###########
# BLUEARC #
###########

Date: Tue, 12 Feb 2008 13:53:33 -0600 (CST)
Subject: HelpDesk ticket 111030
___________________________________________
Short Description: Quota request for BlueArc served /minos/scratch, for loiacono

Problem Description: LSC/CSI :

Please set an individual storage quota of 500 GBytes for user loiacono
on the BlueArc served /minos/scratch volume.

This overrides the existing default 100 GBytes quota.
___________________________________________

Date: Tue, 12 Feb 2008 14:41:03 -0600 (CST)
This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group.


#######
# AFS #
#######

    My previous scans for AFS timeouts were flawed, searching for 'Jan '.
    There continue to be AFS timeouts this month.

for NODE in ${NODES} ; do 
ssh -ax ${NODE} 'grep afs: /var/log/messages | grep -v Tokens | uniq'; done \
  | cut -f 1 -d '(' | sed 's/ in cell fnal.gov//g' 
done

Feb  1 13:55:48 minos04 kernel: afs: Lost contact with file server 131.225.68.19 
Feb  1 13:57:02 minos04 kernel: afs: file server 131.225.68.19 is back up 

Feb  1 07:39:00 minos10 kernel: afs: Lost contact with file server 131.225.68.65 
Feb  1 07:39:05 minos10 kernel: afs: Lost contact with file server 131.225.68.7 
Feb  1 07:40:26 minos10 kernel: afs: file server 131.225.68.65 is back up 
Feb  1 07:40:26 minos10 kernel: afs: file server 131.225.68.7 is back up 

Feb  4 17:56:06 minos04 kernel: afs: Lost contact with file server 131.225.68.17 
Feb  4 17:56:23 minos04 kernel: afs: file server 131.225.68.17 is back up 

Feb  5 05:38:05 minos18 kernel: afs: Lost contact with file server 131.225.68.49 
Feb  5 05:39:30 minos18 kernel: afs: file server 131.225.68.49 is back up 

Feb  6 06:08:50 minos04 kernel: afs: Lost contact with file server 131.225.68.49 
Feb  6 06:10:35 minos04 kernel: afs: file server 131.225.68.49 is back up 

Feb  7 10:42:02 minos19 kernel: libafs: module license 'http://www.openafs.org/dl/license10.html' taints kernel.
Feb  7 10:42:02 minos19 kernel: libafs: no version for "sys_close" found: kernel tainted.
Feb  7 10:42:02 minos19 kernel: libafs: Ignoring new-style parameters in presence of obsolete ones
Feb  7 11:58:57 minos19 kernel: afs: Lost contact with file server 192.168.67.1 

Feb  7 15:51:35 minos05 kernel: afs: Lost contact with file server 131.225.68.17 
Feb  7 15:52:01 minos05 kernel: afs: file server 131.225.68.17 is back up 

Feb  8 11:47:59 minos05 kernel: afs: Lost contact with file server 131.225.68.7 
Feb  8 11:48:39 minos05 kernel: afs: file server 131.225.68.7 is back up 

Feb  9 09:09:23 minos06 kernel: afs: Lost contact with file server 131.225.68.7 
Feb  9 09:12:32 minos06 kernel: afs: file server 131.225.68.7 is back up 

Feb 10 16:50:08 minos16 kernel: afs: Lost contact with file server 131.225.68.17 
Feb 10 16:51:58 minos16 kernel: afs: file server 131.225.68.17 is back up 

Feb 11 11:15:35 minos11 kernel: afs: Lost contact with file server 192.168.67.1 

Feb 11 12:55:03 minos05 kernel: afs: Lost contact with file server 131.225.68.49 
Feb 11 12:55:21 minos05 kernel: afs: Lost contact with file server 131.225.68.7 
Feb 11 12:55:41 minos05 kernel: afs: file server 131.225.68.49 is back up 
Feb 11 12:55:41 minos05 kernel: afs: file server 131.225.68.7 is back up 

Feb 12 06:07:06 minos23 kernel: afs: Lost contact with file server 131.225.68.17 
Feb 12 06:09:49 minos23 kernel: afs: file server 131.225.68.17 is back up 

Times : 14  76  17  85  105 26  40  209  110 20 160

 14 
 17  
 20
 26 
 40
 76 
 85  
105 
110
160
209

Servers

131.225.68.7
131.225.68.17
131.225.68.49
131.225.68.65

##########
# DCACHE #
##########

Write queue jumped sharply to 20000 yesterday afternoon/evening.
And again from 17K to 25K around  02:00

MINOS26 > ./dcache/datasets w
Run Tue Feb 12 08:58:40 CST 2008      Data from 12-Feb-2008 06:33

Pool group writePools

  FILES  GBYTES FAMILY
  31846     551 cdms.cdms

That's about 12 MBytes/file.

Date: Mon, 11 Feb 2008 21:11:46 -0600 (CST)
Ticket #: 110975
___________________________________________
Short Description: stken overload

Problem Description: Since about 3:30 today a user has dumped over 16000 
files into dcache.
According to its agreement with MSS, MINOS throttled back its running when
the write pools exceeded 2500 files.  We will therefore be down until the
write pools clear.  Please communicate with the user to avoid this sort of
situation in the future.

Howie Rubin
___________________________________________

<-- # @@@  Enter Update below this line. @@@ # -->

This is apparently due to addition of over 30,000 files by CDMS .
These file average a little over 10 MBytes in size.
The peak queued stores has gone as high as 24,000.
    http://fndca.fnal.gov/dcache/queue/allpools.jpg

Note that in the past, this level of backlog has led to global
DCache system failures.

Here's a summary of the pool content :
...
    If nothing is done, 
    Minos data handling will remain shut down for about a week.

    These CDMS files need to be removed.

    As before, we'll be glad to provide advice and planning support,
    and even share our scripts.

<-- # @@@  Enter Update above this line. @@@ # -->

  I looked at one of the CDMS tapes, 
  http://www-stken.fnal.gov/enstore/tape_inventory/VOC132

About half the files are 20 byte files , like
    /pnfs/fs/usr/cdms/Raw_Soudan_data_sync/180108_0736/180108_0736_F0309.gz.status

Fermi National Accelerator Laboratory
D.A. Bauer, M.B. Crisler, D. Holmgren, E. Ramberg, J. Yoo
                               2745

Queues :

15:00 - 20136
15:20 - 18572 rapid clearing, lately
16:00 - 16026
23:00 - 14857  and back to a slow clearing, about 200/hour
08:40 - 11844  roughly 3000/10 hours

   Berg reports that the Tsunami is passing quickly,
   will not recur, via email from djholm.
   
   They are unrepentant about the 10's of thousands of 20 byte files.
   I find this utterly irresponsible and unacceptable.


    RESOLVED  Wed 02/13

CDMS files were removed by DCache developers
A backlog of about 2K cand files has cleared to tape.

=============================================================================

2008 02 11


###########
# MINOS11 #
###########

Date: Mon, 11 Feb 2008 09:39:56 -0600 (CST)
Subject: HelpDesk ticket 110904
Short Description: minos11 is down

Problem Description: run2-sys :

Node minos11 ( our only remsining SLF 3 system )
seems to have gone off the network at about 10:00 Sunday 10 Feb,
according to the Ganglia plots.

I get no response to ping.

Please investigate.

   Thanks !
___________________________________________
Date: Mon, 11 Feb 2008 09:48:58 -0600 (CST)
This ticket has been reassigned to HO, LING of the CD-SF/FEF Group.
___________________________________________________________________

Date: Mon, 11 Feb 2008 10:10:02 -0600 (CST)

Solution: ling@fnal.gov sent this solution: 
System rebooted.
___________________________________________________________________


Feb 10 04:02:03 minos11 syslogd 1.4.1: restart.
Feb 10 05:19:56 minos11 sshd(pam_unix)[17841]: session opened for user rhatcher by (uid=0)
Feb 10 05:19:57 minos11 sshd(pam_unix)[17970]: session opened for user rhatcher by (uid=0)
Feb 10 10:40:53 minos11 sshd(pam_unix)[23565]: session opened for user rhatcher by rhatcher(uid=0)
Feb 11 11:02:36 minos11 syslogd 1.4.1: restart.


=============================================================================

2008 02 08

##########
# PARROT #
##########

    mindata :

cd    /grid/app/minos/parrot
curl http://www.cse.nd.edu/~ccl/software/files/cctools-current-i686-linux-2.6.tar.gz \
      -o cctools-current-i686-linux-2.6.tar.gz
tar xzvf cctools-current-i686-linux-2.6.tar.gz

Now testing this, 

export PARROT_DIR=/grid/app/minos/parrot/cctools-current-i686-linux-2.6
export PATH=${PARROT_DIR}/bin:${PATH}


parrot -m ${PARROT_DIR}/mountfile2.grow   bash
PS1='P> '

P> . /usr/local/etc/setups.sh
P> export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup
P> setup_minos()
<more> {
<more> . $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $*
<more> }
P> setup dcap -q unsecured
P> type dcap
bash: type: dcap: not found
P> which dccp
/fnal/ups/prd/dcap/v2_38_f0512/Linux-2-4/bin/dccp
P> type dccp
dccp is /fnal/ups/prd/dcap/v2_38_f0512/Linux-2-4/bin/dccp
P> DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F
P> DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root
P> dccp ${DFILE} /var/tmp/TEST.root
41379 bytes in 0 seconds
P> hostname
fngp-osg.fnal.gov
P> setup_minos -r R1.24.2
No default SAM configuration exists at this time.
MINOSSOFT release "R1.24.2"
SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01
bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory


P> type loon
loon is /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.24.2/bin/Linux2.6-GCC_3_4/loon

cd /minos/scratch/kreymer/condor/loonb
P> loon -bq firstlast.C ${DFILE}
Warning in <TClassTable::Add>: class timespec already in TClassTable
loon [0] 
Processing firstlast.C...
Spin(1 in 1 out 0 filt.)
  1) +RawRecCounts::Ana         n=1     (     1/     0) t=(    0.01/    0.00)


RawRecCounts Report: F00031300_0000.mdaq.root
       root version: v04-02-00

  VldContexts: 
  First:       {   Far|  Data|2005-04-25 16:36:07.621514000Z}
  First Snarl: {Unknow|Unknow|1970-01-01 00:00:00.000000000Z} # -1        
  Last:        {   Far|  Data|2005-04-25 16:36:07.621514000Z}
  Last  Snarl: {Unknow|Unknow|1970-01-01 00:00:00.000000000Z} # -1        
  in 11 records of 0 record sets 

  RawConfigFilesBlock                     7
  RawDaqHeaderBlock                      11
  RawRunCommentBlock                      1
  RawRunConfigBlock                       1
  RawRunEndBlock                          1
  RawRunStartBlock                        1

RawRecCounts done


=============================================================================

2008 02 07

#######
# SAM #
#######

    sam_cpp_api with sam locate support !

upd install -j sam_cpp_api v8_4_0 -q GCC-3.4.3


##########
# PARROT #
##########

    Testing the d141/d199 clones of products/minossoft

cd /afs/fnal.gov/files/expwww/numi/html/computing

ln -s  /afs/fnal.gov/files/data/minos/d141 d141
ln -s  /afs/fnal.gov/files/data/minos/d199 d199


MINOS26 > fs listquota /afs/fnal.gov/files/data/minos/d141
Volume Name                   Quota      Used %Used   Partition
nb.minos.d141              50000000  27714790   55%         56%  

MINOS26 > fs listquota /afs/fnal.gov/files/data/minos/d199
Volume Name                   Quota      Used %Used   Partition
nb.minos.d199              50000000  28146482   56%         77%  

MINOS26 > find /afs/fnal.gov/files/data/minos/d141 -type f | wc -l
963571


time make_growfs -k /afs/fnal.gov/files/data/minos/d141
real    9m52.730s
user    0m48.330s
sys     3m14.385s
du -sk /afs/fnal.gov/files/data/minos/d141/.growfsdir
69168   /afs/fnal.gov/files/data/minos/d141/.growfsdir

time make_growfs -k /afs/fnal.gov/files/data/minos/d199
real    8m31.961s
user    0m36.871s
sys     2m27.319s


FNGP-OSG > export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4
FNGP-OSG > export PATH=${PARROT_DIR}/bin:${PATH}
FNGP-OSG > parrot -m ${PARROT_DIR}/mountfile2.grow   bash
FNGP-OSG > . /usr/local/etc/setups.sh
FNGP-OSG > export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup
FNGP-OSG > setup_minos()
<more> {
<more> . $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $*
<more> }

FNGP-OSG > setup_minos
No default SAM configuration exists at this time.
MINOSSOFT release "development"
SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=trunk EXTERN=v03 CONFIG=v01
bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory

cd /minos/scratch/kreymer/condor/loonb
DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root
loon -bq firstlast.C ${DFILE}

FNGP-OSG > setup_minos -r R1.24.2
No default SAM configuration exists at this time.
MINOSSOFT release "R1.24.2"
SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=v5-12-00f EXTERN=v03 CONFIG=v01
bash: child setpgid (13580 to 13579): Operation not permitted
bash: child setpgid (13581 to 13579): Operation not permitted
bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory
FNGP-OSG > type loon
loon is /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.24.2/bin/Linux2.6-GCC_3_4/loon
FNGP-OSG > type dcap
bash: type: dcap: not found
FNGP-OSG > type dccp
dccp is /afs/fnal.gov/files/code/e875/general/ups/prd/dcap/v2_36_f0506/Linux-2-4/bin/dccp
FNGP-OSG > loon -bq firstlast.C ${DFILE}
Warning in <TClassTable::Add>: class timespec already in TClassTable

P> dccp ${DFILE} F00031300_0000.mdaq.root
getControlMessage: poll fail.
Failed to create a control line
Failed open file in the dCache.
Can't open source file : Server rejected "hello"
System error: Input/output error


P> loon -bq firstlast.C F00031300_0000.mdaq.root
Warning in <TClassTable::Add>: class timespec already in TClassTable
loon [0] 
Processing firstlast.C...
Spin(1 in 1 out 0 filt.)
  1) +RawRecCounts::Ana         n=1     (     1/     0) t=(    0.00/    0.00)


RawRecCounts Report: F00031300_0000.mdaq.root
       root version: v04-02-00

  VldContexts: 
  First:       {   Far|  Data|2005-04-25 16:36:07.621514000Z}
  First Snarl: {Unknow|Unknow|1970-01-01 00:00:00.000000000Z} # -1        
  Last:        {   Far|  Data|2005-04-25 16:36:07.621514000Z}
  Last  Snarl: {Unknow|Unknow|1970-01-01 00:00:00.000000000Z} # -1        
  in 11 records of 0 record sets 

  RawConfigFilesBlock                     7
  RawDaqHeaderBlock                      11
  RawRunCommentBlock                      1
  RawRunConfigBlock                       1
  RawRunEndBlock                          1
  RawRunStartBlock                        1

RawRecCounts done


=============================================================================

2008 02 06

#############
# MINOSSOFT #
#############

Second pass on minossoft symlinks finds nothing to do.

Second pass on products  symlinks finds 

/afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/startup/ups_startup
/afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/startup/ups_startup
/afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/shutdown/ups_shutdown
/afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/shutdown/ups_shutdown
/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/bin/config_build_root_minimal.sh
/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/tar_files
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_3/lib/libmyodbc3.so
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so


    Corrected .fnal.gov to a local symlink,
    both in my copy, and in the original

cd /afs/fnal.gov/files/data/minos/d141/db/.upsfiles/startup

ln -sf ups_startup  ups_startup.csh
ln -sf ups_startup  ups_startup.sh

cd /afs/fnal.gov/files/data/minos/d141/db/.upsfiles/startup

ln -sf ups_shutdown  ups_shutdown.csh
ln -sf ups_shutdown  ups_shutdown.sh

cd /afs/fnal.gov/files/code/e875/general/ups/db/.upsfiles/startup
cd /afs/fnal.gov/files/code/e875/general/ups/db/.upsfiles/shutdown

     Removed useless vdt versions in ups
     which have explicit links to a products path.

ups undeclare -Y vdt v1_1_14_13
ups undeclare -Y vdt v1_6_1_0
ups undeclare -Y vdt v1_8_1_1

chmod -R 755  /afs/fnal.gov/files/data/minos/d141/prd/vdt
rm -r         /afs/fnal.gov/files/data/minos/d141/prd/vdt
rm -r         /afs/fnal.gov/files/data/minos/d141/db/vdt


 printf "${SLINKS}\n" | cut -f 2 -d :
/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/bin/config_build_root_minimal.sh
/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/tar_files
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_3/lib/libmyodbc3.so
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so

   This seems like something that should be cleaned up
   but I'll just grab them for now .
   There are 215 MBytes of tar files.

 grep  ':/' ${SLINKF} | grep -v "${UPI}" | cut -f 2 -d :
/ftp/products/sam/v8_2_0/Linux+2/sam_v8_2_0_Linux+2.ups.tar
/ftp/products/sam_ns_ior/v7_1_0/NULL/sam_ns_ior_v7_1_0_NULL.ups.tar
/fnal/ups/prd/oracle_client/v10_1_0_2_0a/Linux-2/nls/lbuilder/lbuilder
/fnal/ups/prd/oracle_client/v10_1_0_2_0a/Linux-2/jdk/man/ja_JP.eucJP
/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/usr/share/libtool/config.guess
/usr/share/libtool/config.sub
/usr/share/libtool/ltmain.sh
/usr/share/automake-1.6/install-sh
/usr/share/automake-1.6/mkinstalldirs
/usr/share/automake-1.6/missing
/usr/share/automake-1.6/depcomp
/usr/lib/libz.so.1
/usr/lib/libz.so.1
/ftp/products/samgrid_batch_adapter/v7_1_0/NULL/samgrid_batch_adapter_v7_1_0_NULL.ups.tar
/ftp/products/geant/v3_21_14a/Linux+2.6/geant_v3_21_14a_Linux+2.6.ups.tar

Just 1 file to actually copy.

Done 


#########
# FNALU #
#########

Date: Wed, 06 Feb 2008 14:08:52 -0600 (CST)
Subject: HelpDesk ticket 087003 has additional info.
_________________________________________________________________
Ticket #: 087003
_________________________________________________________________
Note To Requester: mgreaney@fnal.gov sent this Notes To Requester: 
Art,

the TWW software has a perl that is in your path when you login, probably.

This is the default because the TWW took the place of a lot of the ups
products.  If you use the perl that is on the system or the ups perl, then
kcroninit works.

try doing setup perl before you run the kcroninit.

Let me know if that works,

thank you,
Margaret
_________________________________________________________________
This is still a problem, as the TWW perl is still in people's paths.


perl is a standard part of Linux installations.
I think we should remove the TWW perl, and put this problem behind us.

Unfortunately, now that I look closer, perl is the tip of iceberg.


On flxi03, there are
756 files in /opt/TWWfsw/bin, dating from May/Jun 2006.

531 already exist in /usr/bin.


for BIN in `ls /opt/TWWfsw/bin` l do
[ -r "/usr/bin/${BIN}" ] && ls /usr/bin/${BIN} ; done | wc -l
531

Four are in /usr/X11R6/bin

/usr/X11R6/bin/cxpm
/usr/X11R6/bin/nc
/usr/X11R6/bin/nedit
/usr/X11R6/bin/sxpm

Another eight exist in /bin :

/bin/bash
/bin/gettext
/bin/gtar
/bin/gunzip
/bin/gzip
/bin/mktemp
/bin/tcsh
/bin/zcat

   Perhaps we should not put 2 year old versions of these in people's paths.

So at minimum we should remove perl from TWW.

And the packages that are duplicating /bin/*

Perhaps we should consider removing all of TWW !
_________________________________________________________________


FLXI03 > for BIN in `ls /opt/TWWfsw/bin` ; do [ -r "/usr/bin/${BIN}" ] || ls -l /opt/TWWfsw/bin/${BIN} ; done | cut -f 2 -d \> | cut -f -4 -d / | sort -u
 /opt/TWWfsw/aspell06
 /opt/TWWfsw/bash30
 /opt/TWWfsw/ddd33
 /opt/TWWfsw/diffutils28
 /opt/TWWfsw/emacs21
 /opt/TWWfsw/expect54
 /opt/TWWfsw/fcpackage22
 /opt/TWWfsw/ghostscript70r
 /opt/TWWfsw/gpatch25
 /opt/TWWfsw/groff119
 /opt/TWWfsw/gzip13
 /opt/TWWfsw/imagemagick62
 /opt/TWWfsw/ispell32
 /opt/TWWfsw/liblcms11
 /opt/TWWfsw/libttf21
 /opt/TWWfsw/libungif
 /opt/TWWfsw/libwmf02
 /opt/TWWfsw/lsof47
 /opt/TWWfsw/lynx28
 /opt/TWWfsw/m4
 /opt/TWWfsw/metamail27
 /opt/TWWfsw/mktemp15
 /opt/TWWfsw/mutt14
 /opt/TWWfsw/mysql4112
 /opt/TWWfsw/ncurses54
 /opt/TWWfsw/nedit55
 /opt/TWWfsw/netpbm92
 /opt/TWWfsw/perl586
 /opt/TWWfsw/pine46
 /opt/TWWfsw/pkgutils15
 /opt/TWWfsw/plotutils24
 /opt/TWWfsw/python242
 /opt/TWWfsw/python242p
 /opt/TWWfsw/tar11
 /opt/TWWfsw/tcsh61
 /opt/TWWfsw/texinfo48
 /opt/TWWfsw/tk84p
 /opt/TWWfsw/xemacs214
 /opt/TWWfsw/xpm
_________________________________________________________________

Date: Mon, 30 Jun 2008 12:59:59 -0500 (CDT)
From: Margaret_Greaney <mgreaney@fnal.gov>

I have not heard back from Frank Nagy on this, but from what I see the 
upgrade of TWW caused new perl modules to be available and kcroninit does
work on my attempts on fnalu on linux nodes.

_________________________________________________________________
Still fails for me, same way, 

FLXI04 > kcroninit
Can't locate Net/Domain.pm in @INC (@INC contains: 
/usr/krb5/lib 
/opt/TWWfsw/libdb42/lib/perl586 
/opt/TWWfsw/imagemagick62/lib/perl586 
/opt/TWWfsw/readline50/lib/perl586 
/opt/TWWfsw/pe

FLXI04 > type perl
perl is /opt/TWWfsw/bin/perl

FLXI04 > ls -l /opt/TWWfsw/bin/perl
lrwxr-xr-x    1 kevinh   root           41 May 23  2006 
  /opt/TWWfsw/bin/perl -> /opt/TWWfsw/perl586/bin/.perl.tww-wrapper


FLXI04 > ls -l /opt/TWWfsw/perl586/bin/.perl.tww-wrapper
-rwxr-xr-x    1 kevinh   root        12363 Apr 11  2006 
  /opt/TWWfsw/perl586/bin/.perl.tww-wrapper

_________________________________________________________________

Date: Mon, 27 Oct 2008 13:02:33 -0500 (CDT)

Solution: kcron may not work on flxi06 which has a 5.1 install. It works on
the rest of the fnalu cluster.
_________________________________________________________________
   Testing 27 Oct, inconsistent results, with or without
PATH=${PATH/\/opt\/TWWfsw\/bin://}

FLXI03 > kcron
kinit: Preauthentication failed while getting initial credentials
FLXI03 > PATH=${PATH/\/opt\/TWWfsw\/bin://}
FLXI03 > PATH=${PATH/\/opt\/TWWfsw\/bin://}
FLXI03 > kcron
FLXI03 > kcron
kinit: Preauthentication failed while getting initial credentials

17:23 - kcron continues to be intermittent,
        with and without TWW in the path,
        at least on flxi04.

time for N in 1 2 3 4 5 6 7 8 9 0 ; do printf "${N} " ; kcron ; sleep 1 ; done

17:31 - OK,, this has run OK on flxi02/3/4.
   So perhaps some flakiness in the network or KDC earlier. 
_________________________________________________________________

Date: Mon, 27 Oct 2008 22:35:22 +0000 (GMT)

kcron is working for me now in flxi02/3/4

There are intermittent failures, like
    FLXI03 > kcron
    kinit: Preauthentication failed while getting initial credentials
but these do not seem to have anything to do with TWW.

Please go ahead and close this ticket.

    Thanks !

_________________________________________________________________

=============================================================================

2008 02 05


#############
# MINOSSOFT #
#############


    See HOWTO.afssoftprod
    
pts examine buckley:minosrecodata
pts setfields buckley:minosrecodata -access SOMar


[root@minos-mysql1 ~]# time up ${UPI} ${D199}
Unable to set owner-id for /afs/fnal.gov/files/data/minos/d199/setup/CVS/Root to 1019
...
Unable to set owner-id for /afs/fnal.gov/files/data/minos/d199 to 1019
Unable to set group-id for /afs/fnal.gov/files/data/minos/d199 to 5111

real    114m24.882s
user    0m19.311s
sys     11m38.279s


Ran the slinky copies, with 
   -head 1
   
   -head 3

then everything.

See  /minos/scratch/kreymer/slinky/minossoft.log

Rates are about 1 MB/sec ( probably due to failing chown's )


###############
# FNALU BATCH #
###############

for node in flxb11 flxb13 flxb21 flxb22 flxb24

minos
setup_minos  -r R1.24.2
cd /minos/scratch/kreymer/condor/loonb
DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-04/F00031300_0000.mdaq.root
loon -bq firstlast.C ${DFILE}

flxb11 OK
flxb13 OK
flxb16 
flxb21 Connection to flxb21 closed.
flxb22 Connection to flxb22 closed.
flxb24 

=============================================================================

2008 02 04

###############
# FNALU BATCH #
###############

 Jobs which are NOT exiting instantly without logs or output :

cat /minos/scratch/rahaman/releases/minos/Mad/macros/*.log | grep executed | cut -f 2 -d '<' | cut -f 1 -d '>' | sort -u

flxb17.fnal.gov
flxb18.fnal.gov
flxb19.fnal.gov
flxb20.fnal.gov
flxb21.fnal.gov
flxb22.fnal.gov
flxb23.fnal.gov
flxb24.fnal.gov
flxb25.fnal.gov
flxb26.fnal.gov
flxb27.fnal.gov
flxb28.fnal.gov
flxb30.fnal.gov
flxb31.fnal.gov
flxi06.fnal.gov

   Scanning nodes that are up and taking batch jobs,

for N in 10 11 13 17 18 19 20 21 22 23 24 25 26 27 28 30 31 32 ; do bsub -R flxb${N} ls -ld /minos/data /minos/scratch ; done

10 11 13 17 18 19 20 21 22 23 24 25 26 27 28 30 31 32
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..  x 

  Summary, it seems /minos/data and scratch are missing on flxb32

bsub -R "hname!=flxb32" ls -ld /minos/data /minos/scratch 

___________________________________________
Date: Mon, 04 Feb 2008 18:46:40 -0600 (CST)
Subject: HelpDesk ticket 110624
Short Description: flxb32 is missing the /minos/data and /minos/scratch mounts

Problem Description: fnalu-admin :

Several Minos LSF batch jobs are failing, due to
the lack of /minos/scratch and /minos/data mounts on flxb32.
This tends to quickly suck jobs out of the execution queue.

Please hold host flxb32 in LSF until the mounts are established.
___________________________________________


########
# DATA #
########

Per query from grashorn, re 'missing' subruns in cedar,

RUNSUBS='
24183:0001
24183:0002
24183:0003
24183:0004
24186:0001
24186:0002
24186:0003
24955:0002
25020:0001
25020:0002
25021:0000
25021:0001
25021:0002
25022:0001
25022:0002
25023:0001
25023:0002
25024:0001
25024:0002
25025:0001
25025:0002
26836:0000
27666:0001
27666:0002
27666:0003
27669:0001
27669:0002
27669:0003
27669:0004
27669:0005
27669:0006
27669:0007
28273:0001
28466:0004
28636:0007
29127:0004
29217:0004
29232:0006
29233:0006
29233:0007
30107:0003
30108:0001
30108:0002
30108:0003
34220:0001
34220:0002
34220:0003
34220:0004
34220:0005
34224:0001
34224:0002
35640:0009
35724:0013
35727:0005
36869:0009
37230:0017
37676:0020
37676:0022
'

for RUNSUB in ${RUNSUBS} ; do
printf "${RUNSUB} "
RUN=`printf "${RUNSUB}" | cut -f 1 -d :`
SUB=`printf "${RUNSUB}" | cut -f 2 -d :`

SAMDIM="
    VERSION                  cedar \
and DATA_TIER                sntp-far \
and PARENT_BY_NAME  F000${RUN}_${SUB}.mdaq.root \
"
#echo ${SAMDIM}

sam list files --nosummary --dim="${SAMDIM}"
done

###########
# MONTHLY #
###########

DATASETS 2/4
PREDATOR 2/4
VAULT    2/4
MYSQL    2/4

#########
# MYSQL #
#########

> Sometime in the past hour (it's now Mon Feb 4 06:48:29 CST 2008) we
> lost connectivity to minos-db1.fnal.gov:-

mysql server down on minos-mysql1

  I see a problem in /var/log/messages.1 :

Feb  3 02:15:02 minos-mysql1 kernel: hdc: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
Feb  3 02:15:02 minos-mysql1 kernel: hdc: drive_cmd: error=0x04Aborted Command
Feb  3 02:15:02 minos-mysql1 kernel: hdc: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
Feb  3 02:15:02 minos-mysql1 kernel: hdc: drive_cmd: error=0x04Aborted Command
Feb  3 02:15:02 minos-mysql1 kernel: cdrom: open failed.
Feb  3 02:15:02 minos-mysql1 kernel: cdrom: open failed.

  But that's just the CD rom, so this should be harmless.
  ( But why was anything accessing this at 2 AM ? )

Uh-Oh, the usual archive area

Mysql> ls -l /data/archive/COPY
total 0
-rw-r--r--  1 root root 0 Jan 23 14:15 empty_file_4_tibs

What's going on ?
Where are the backup files ?

I see that /etc/resolv.conf lists only one nameserver,

search fnal.gov
nameserver 131.225.8.120


less /data/database/minos-mysql1.fnal.gov.err


080204  6:27:00 [Note] /local/ups/prd/mysql/v4_1_11/Linux-2/libexec/mysqld: Shutdown complete

But the usual message about mysqld ended is missing, like :
070821 15:34:09  mysqld ended


    Backing up mysql database, need for monthly anyway

cd ${DBHOME}/offline
 dds -tr
...
-rw-rw----   1 minsoft e875       31744 Feb  4 06:26 BEAMMONFILESUMMARY.MYI
-rw-rw----   1 minsoft e875        4096 Feb  4 06:26 BEAMMONCUTSVLD.MYI
-rw-rw----   1 minsoft e875        2048 Feb  4 06:26 BEAMMONCUTS.MYI

Mysql> script -a ${DBCOPY}/offline/offline.log
Script started, file is /minos/data/mysql/archive/20080204/offline/offline.log
[minsoft@minos-mysql1 offline]$ time cp -av --target-directory=/minos/data/mysql/archive/20080204/offline *.frm
[minsoft@minos-mysql1 offline]$ time cp -av --target-directory=/minos/data/mysql/archive/20080204/offline *.MYD


real    30m47.148s
user    0m0.990s
sys     2m16.440s

[root@minos-mysql1 ~]# /etc/init.d/mysql start
Starting MySQL...................................          [FAILED]

080204 10:41:49  mysqld started
080204 10:41:49 [ERROR] /local/ups/prd/mysql/v4_1_11/Linux-2/libexec/mysqld: unknown variable '--binlog-do-db=crl_v1'

080204 10:41:49  mysqld ended

Changed /data/database/my.cnf to

binlog-do-db = crl_v1
binlog-do-db = offline


[root@minos-mysql1 ~]# /etc/init.d/mysql start
Starting MySQL                                             [  OK  ]


080204 10:44:11  mysqld started
/local/ups/prd/mysql/v4_1_11/Linux-2/libexec/mysqld: ready for connections.
Version: '4.1.11-log'  socket: '/data/database/mysql.sock'  port: 3306  Source distribution

    Checked for recent temp entries , there are non since restart ( 372 )
    There are plenty before, ( 371 )

Mysql> mysqlbinlog -s -d offline  ${DBBINS}/minos.000372 | less
Mysql> mysqlbinlog -s -d crl_v1  ${DBBINS}/minos.000372 | less
Mysql> mysqlbinlog -s -d temp  ${DBBINS}/minos.000372 | less

=============================================================================

2008 02 01

############
# PRODUCTS #
############

Preparing for global up of products etc.
Options :

   -m < check that we have no mount points before copying >

AFSC=/afs/fnal.gov/files/code/e875

UPI=${AFSC}/general/products
UPO=/afs/fnal.gov/files/data/minos/d141

time up ${UPI} ${UPO}
Unable to set owner-id for /afs/fnal.gov/files/data/minos/d141/db/neugen3/CVS/Entries to 5922
Unable to set group-id for /afs/fnal.gov/files/data/minos/d141/db/neugen3/CVS/Entries to 5111
...
Unable to set group-id for /afs/fnal.gov/files/data/minos/d141/prd/pacman/v3_20/NULL/src/pythonCheck.pyc to 5111
Unable to set group-id for /afs/fnal.gov/files/data/minos/d141/prd/pacman/v3_20/NULL/src/Alias.pyc to 5111

real    16m55.340s
user    0m1.469s
sys     2m9.036s


MIN > fs listquota /afs/fnal.gov/files/code/e875/general/products
Volume Name                   Quota      Used %Used   Partition
c.e875.d1                   8000000   3266914   41%         59%  


MIN > fs listquota /afs/fnal.gov/files/data/minos/d141
Volume Name                   Quota      Used %Used   Partition
nb.minos.d141              50000000   3327363    7%         53%  

PLINKS='
GENIE
LOG4CPP
MINOS_EXTERN
MINOS_ROOT
NEUGEN3
PYTHIA6
stdhep
'

D141=/afs/fnal.gov/files/data/minos/d141

for PLINK in ${PLINKS} ; do
printf " OK - copying ${PLINK} "
date
du -sm ${AFSC}/releases/${PLINK}

UPX=${D141}/prd/${PLINK}
UPI=${AFSC}/releases/${PLINK}
UPO=${D141}/prd/${PLINK}

[ -L "${UPX}" ] && ls -l ${UPX} && rm ${UPX}
time up ${UPI} ${UPO}

done  2>&1 | tee /tmp/plinkup.log


MINOS26 > grep -v 'Unable to set ' /tmp/plinkup.log
 OK - copying GENIE Fri Feb  1 17:00:34 CST 2008
600     /afs/fnal.gov/files/code/e875/releases/GENIE

real    1m36.736s
user    0m0.203s
sys     0m18.265s
 OK - copying LOG4CPP Fri Feb  1 17:02:12 CST 2008
91      /afs/fnal.gov/files/code/e875/releases/LOG4CPP

real    1m16.498s
user    0m0.220s
sys     0m9.826s
 OK - copying MINOS_EXTERN Fri Feb  1 17:03:30 CST 2008
2048    /afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN

real    17m30.968s
user    0m3.110s
sys     2m20.089s
 OK - copying MINOS_ROOT Fri Feb  1 17:21:27 CST 2008
20537   /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT
lrwxr-xr-x  1 kreymer g020 49 Feb  1 15:00 /afs/fnal.gov/files/data/minos/d141/prd/MINOS_ROOT -> /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT

real    194m50.950s
user    0m23.238s
sys     19m52.623s
 OK - copying NEUGEN3 Fri Feb  1 20:41:50 CST 2008
841     /afs/fnal.gov/files/code/e875/releases/NEUGEN3
lrwxr-xr-x  1 kreymer g020 46 Feb  1 15:00 /afs/fnal.gov/files/data/minos/d141/prd/NEUGEN3 -> /afs/fnal.gov/files/code/e875/releases/NEUGEN3

real    3m0.463s
user    0m0.413s
sys     0m28.601s
 OK - copying PYTHIA6 Fri Feb  1 20:44:55 CST 2008
183     /afs/fnal.gov/files/code/e875/releases/PYTHIA6
lrwxr-xr-x  1 kreymer g020 46 Feb  1 15:00 /afs/fnal.gov/files/data/minos/d141/prd/PYTHIA6 -> /afs/fnal.gov/files/code/e875/releases/PYTHIA6

real    0m51.456s
user    0m0.116s
sys     0m7.127s
 OK - copying stdhep Fri Feb  1 20:45:47 CST 2008
27      /afs/fnal.gov/files/code/e875/releases/stdhep
lrwxr-xr-x  1 kreymer g020 45 Feb  1 15:00 /afs/fnal.gov/files/data/minos/d141/prd/stdhep -> /afs/fnal.gov/files/code/e875/releases/stdhep

real    0m10.084s
user    0m0.019s
sys     0m1.218s


##############
# MINOS_DATA #
##############

CDIRS=`(cd /afs/fnal.gov/files/data/minos/d10/indexes ; ls *_near.cedar.index)`


MINOS26 > ls -l ${CDIRS}
-rw-r--r--  1 rubin e875  4500 Sep 26  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-04_near.cedar.index
-rw-r--r--  1 rubin e875 32200 Oct 28  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-05_near.cedar.index
-rw-r--r--  1 rubin e875 31000 Feb  5  2007 /afs/fnal.gov/files/data/minos/d10/indexes/2005-06_near.cedar.index
-rw-r--r--  1 rubin e875 33350 Sep 28  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-07_near.cedar.index
-rw-r--r--  1 rubin e875 35000 Sep 28  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-08_near.cedar.index
-rw-r--r--  1 rubin e875 35200 Mar 23  2007 /afs/fnal.gov/files/data/minos/d10/indexes/2005-09_near.cedar.index
-rw-r--r--  1 rubin e875 19700 Sep 28  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-10_near.cedar.index
-rw-r--r--  1 rubin e875 32800 Sep 28  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-11_near.cedar.index
-rw-r--r--  1 rubin e875 24150 Sep 28  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2005-12_near.cedar.index
-rw-r--r--  1 rubin e875 33550 Feb  5  2007 /afs/fnal.gov/files/data/minos/d10/indexes/2006-01_near.cedar.index
-rw-r--r--  1 rubin e875 25950 Sep 28  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-02_near.cedar.index
-rw-r--r--  1 rubin e875     0 Nov 26  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-05_near.cedar.index
-rw-r--r--  1 rubin e875 33400 Dec 26 16:19 /afs/fnal.gov/files/data/minos/d10/indexes/2006-06_near.cedar.index
-rw-r--r--  1 rubin e875  5900 Dec 26 16:19 /afs/fnal.gov/files/data/minos/d10/indexes/2006-07_near.cedar.index
-rw-r--r--  1 rubin e875 14450 Dec 13  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-08_near.cedar.index
-rw-r--r--  1 rubin e875 20450 Dec 11  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-09_near.cedar.index
-rw-r--r--  1 rubin e875 30850 Dec 11  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-10_near.cedar.index
-rw-r--r--  1 rubin e875 26850 Dec  4  2006 /afs/fnal.gov/files/data/minos/d10/indexes/2006-11_near.cedar.index
-rw-r--r--  1 rubin e875 35900 Mar 23  2007 /afs/fnal.gov/files/data/minos/d10/indexes/2006-12_near.cedar.index
-rw-rw-r--  1 rubin e875 30000 Oct 24 14:22 /afs/fnal.gov/files/data/minos/d10/indexes/2007-01_near.cedar.index
-rw-r--r--  1 rubin e875 31650 Mar  2  2007 /afs/fnal.gov/files/data/minos/d10/indexes/2007-02_near.cedar.index
-rw-r--r--  1 rubin e875 36250 Apr  2  2007 /afs/fnal.gov/files/data/minos/d10/indexes/2007-03_near.cedar.index
-rw-r--r--  1 rubin e875 35000 May  2  2007 /afs/fnal.gov/files/data/minos/d10/indexes/2007-04_near.cedar.index
-rw-r--r--  1 rubin e875  2900 May  5  2007 /afs/fnal.gov/files/data/minos/d10/indexes/2007-05_near.cedar.index

mindata :

cd ~kreymer/minos/scripts

for CDIR in ${CDIRS} ; do ./afs2nfs -i ${CDIR} -n ; done

Complete except for

 STREAM sntp to /minos/data/reco_near/cedar/sntp_data/2007-01 1569 
cp recodata76/N00011446_0018.spill.sntp.cedar.0.root /minos/data/reco_near/cedar/sntp_data/2007-01/N00011446_0018.spill.sntp.cedar.0.root
cp recodata77/N00011648_0003.spill.sntp.cedar.0.root /minos/data/reco_near/cedar/sntp_data/2007-01/N00011648_0003.spill.sntp.cedar.0.root

$ dds /afs/fnal/files/data/minos/d10/recodata76/N00011446_0018.spill.sntp.cedar.0.root
-rw-rw-r--  1 rubin e875 73341213 Jan  2  2007 /afs/fnal/files/data/minos/d10/recodata76/N00011446_0018.spill.sntp.cedar.0.root
$ dds /afs/fnal/files/data/minos/d10/recodata77/N00011648_0003.spill.sntp.cedar.0.root
-rw-rw-r--  1 rubin e875 2181427 Jan 26  2007 /afs/fnal/files/data/minos/d10/recodata77/N00011648_0003.spill.sntp.cedar.0.root

./afs2nfs -i 2007-01_near.cedar.index
    This is picking up much more than 2 files,
    the whole month was missing, all 600 files

 600/ 600 recodata77/N00011648_0003.spill.sntp.cedar.0.root 
 STREAM sntp rate 17865
38G     /minos/data/reco_near/cedar/sntp_data/2007-01

 STARTED Fri Feb  1 10:45:10 CST 2008
FINISHED Fri Feb  1 11:22:03 CST 2008


CFDIRS=`(cd /afs/fnal.gov/files/data/minos/d10/indexes ; ls *_far.cedar.index)`

for CFDIR in ${CDIRS} ; do ./afs2nfs -i ${CFDIR} -n ; done

    all present and accounted for

     CHECKING FOR EMPTY LARGE DATA VOLUMES, NONE AT PRESENT

cd /afs/fnal.gov/files/data/minos

DIRS=`ls -d d?? d???`

for DIR in ${DIRS} ; do printf "${DIR} " ; fs listquota ${DIR} | grep nb ; done > /tmp/mdd

cat /tmp/mdd | sort -rk 5

   Removed cedar

for DIR in ${DIRS} ; do printf "${DIR} " ; fs listquota ${DIR} | grep nb ; done > /tmp/mdd2
cat /tmp/mdd2 | sort -rk 5


d199 nb.minos.d199              50000000       172    0%         79%  
d198 nb.minos.d198              50000000     59211    0%         79%  
dbm nb.data.minosd11            4000000       551    0%         76%  
d86 nb.minos.d86               50000000         6    0%         72%  
d58 nb.minos.d58                8000000         8    0%         72%  
d10 nb.data.minosd10            8000000      8439    0%         72%  
d245 nb.minos.d245              50000000        29    0%         65%  
d243 nb.minos.d243              50000000        34    0%         61%  
d141 nb.minos.d141              50000000       242    0%         52%  

for RDIR in d86 d141 d198 d199 d243 d245 ; do 
find ${RDIR} -type d ; done

for RDIR in d141 d198 d199 ; do
ls -R ${RDIR} ; done

for RDIR in d141 d198 d199 ; do
find ${RDIR} -type f ; done
d198/recodata72/c10000845_0005.sntp.cedar.root

grep c10000845_0005.sntp.cedar.root d10/indexes/*.index
d10/indexes/mc_cosmic.bfld201.cedar.index:recodata72/c10000845_0005.sntp.cedar.root

   Bottom line, d141 and d199 are clear.

rm -r d141/reco*
rm -r d199/reco*

rm  d199/indexes
rm  d141/indexes

rm d10/recodata52
rm d10/recodata73


##############
# MINOS_DATA #
##############

     REMOVING CEDAR NTUPLES FROM AFS

rubin@fnpcsrv1 :
cat shrc/kreymer
   cut/paste

cd /afs/fnal.gov/files/data/minos/d10/indexes
ls -l *ar.cedar.index

    Ran a test pass, counting files

./rv cedar noop | grep -v rm
 Removed net 28698 files 

./rv cedar noop > /var/tmp/rv.cedar.log

    14:06

./rv cedar 2>&1  | tee  /var/tmp/rvdo.cedar.log
This procedure will erase all cedar ntuples and rewrite the index files!
It will not prompt again -- do you want to proceed <y|n>? y
Removing 2004-08_far.cedar.index
 Removed 0 files 
Removing 2004-09_far.cedar.index
 Removed 0 files 
...
Removing 2007-05_near.cedar.index
 Removed 58 files 
 Removed net 28698 files 


grep ^rm /var/tmp/rv.cedar.log | cut -f 2 -d / | sort -u
recodata23
recodata26
recodata34
recodata35
recodata36
recodata37
recodata38
recodata39
recodata40
recodata42
recodata43
recodata44
recodata45
recodata46
recodata47
recodata51
recodata52
recodata53
recodata54
recodata56
recodata57
recodata58
recodata59
recodata60
recodata61
recodata62
recodata63
recodata64
recodata65
recodata66
recodata67
recodata68
recodata69
recodata70
recodata71
recodata72
recodata73
recodata74
recodata75
recodata76
recodata77
recodata78
recodata79
recodata80
recodata81
recodata82
recodata83
recodata84
recodata85
recodata86
recodata88
recodata89
recodata90
recodata91
recodata92
recodata93
recodata94
recodata95
recodata96
recodata97
recodata98


#########
# FNALU #
#########

for NODE in  flxb16 $INODES ; do printf "${NODE} " ; ssh -ax ${NODE} 'df -h /minos/data | grep nas' ; done

flxb16 minos-nas-0.fnal.gov:/minos/data
flxb10 minos-nas-0.fnal.gov:/minos/data
flxb11 minos-nas-0.fnal.gov:/minos/data
flxb13 minos-nas-0.fnal.gov:/minos/data
flxb17 minos-nas-0.fnal.gov:/minos/data
flxb23 Could not chdir to home directory /afs/fnal.gov/files/home/room1/kreymer: No such file or directory
minos-nas-0.fnal.gov:/minos/data
flxb24 minos-nas-0.fnal.gov:/minos/data
flxb25 minos-nas-0.fnal.gov:/minos/data
flxb30 minos-nas-0.fnal.gov:/minos/data
flxb31 minos-nas-0.fnal.gov:/minos/data
flxb32 minos-nas-0.fnal.gov:/minos/data
flxb33 minos-nas-0.fnal.gov:/minos/data
flxb34 minos-nas-0.fnal.gov:/minos/data
flxb35 minos-nas-0.fnal.gov:/minos/data

This recovered by 11:00 ( 17:00 UTC )

########
# FARM #
########

IMANODES='fnpc239 fnpc240 fnpc241 fnpc242 fnpc243 fnpc244 fnpc245 fnpc246'

for NODE in ${IMANODES} ; do
printf "${NODE} "
ssh -ax minos25 'grep ^OPTIONS /etc/sysconfig/afs ; ps -flu root | grep vice'
done

=============================================================================

2008 01 31

##########
# CONDOR #
##########

    Drafting minoscondorsupport.txt ( home of desktop )
    emailed to berman, timm, sfiligoi, jallen

#########
# FNALU #
#########

flxb16 was upgraded to SLF 4 around 15:00 today,
    lacks /minos/data and /minos/scratch.

Reported to mgreaney, logged under ticket 110383


########
# GRID #
########

for ID in 339 340 341 342 343 344 345 346 ; do printf "fnpc${ID} "
ssh -ax fnpc${ID} ls -ld /afs/fnal/files/code/e875 ; done

##############
# MINOS_DATA #
##############

From CFL

  30991    1807  reco_near/cedar/.*nt._data/
  83803    1298  reco_far/cedar/.*nt._data/


   PNFS COUNTS

grep /pnfs/minos/reco_near/cedar/sntp_data CFL/CFL | wc -l
22856

grep /pnfs/minos/reco_far/cedar/sntp_data CFL/CFL | wc -l
39703

grep /pnfs/minos/reco_far/cedar/.bntp_data CFL/CFL | wc -l
13990

   /minos/data COUNTS

find /minos/data/reco_near/cedar/sntp_data -type f | wc -l
11874

find /minos/data/reco_far/cedar/sntp_data -type f | wc -l
16655

find /minos/data/reco_far/cedar/.bntp_data -type f | wc -l
88

    index COUNTS

wc -l /afs/fnal.gov/files/data/minos/d10/indexes/*_near.cedar.index
12220

wc -l /afs/fnal.gov/files/data/minos/d10/indexes/*_far.cedar.index
16478


   Let's purge the near cedar first 


################
# AFS SYMLINKS #
################

  N.B. - there is an 'up' command for cloning AFS file trees,
         which preserves ACL's.
         up <from> <to>

#######
# NET #
#######

Date: Thu, 31 Jan 2008 10:17:25 -0600 (CST)
Subject: HelpDesk ticket 110393
Short Description: MRTG plots not available for r-s-fcc2-server

Problem Description: I cannot view MRTG plots for nodes on r-s-fcc2-server,
    Things seem to be OK for nodes like minorora1 (s-s-fcc1-server)

    For example, at

http://fndcg0.fnal.gov/~netadmin/NodeLocator/mrtg-search.cgi?hname=minos01.
fnal.gov

    I see

Search Results for minos01.fnal.gov

131.225.193.1 is connected to r-s-fcc2-server on port Gi8/26 (minos01)
Last detected on this switch at 2008/01/31/09:41

1 node is connected to port Gi8/26 of r-s-fcc2-server.

Looking Glass Error: Unknown area name for Device
___________________________________________
Date: Thu, 31 Jan 2008 10:22:18 -0600 (CST)
This ticket has been reassigned to WOHLT, DARRYL of the CD-LSCS/CNCS/SN Group.
___________________________________________
Date: Thu, 31 Jan 2008 10:29:10 -0600 (CST)
Note To Requester: darryl@fnal.gov sent this Notes To Requester: 
Art, it looks like you're using the old NodeLocator.  Please change your 
bookmark to
http://www-dcn.fnal.gov/~netadmin/m-s-nodelocator/NodeLocator/search.html
and let me know if this fixes the problem.
Darryl
___________________________________________

   Yes, corrected this in /afs/fnal.gov/files/expwww/numi/html/computing/dh
       dhmain.html.20080131


#######
# AFS #
#######

for NODE in ${NODES} ; do 
printf "${NODE}\n"; ssh -ax ${NODE} 'grep afs: /var/log/messages | grep "Jan " | grep -v Tokens | uniq'; done

minos03
Jan 27 12:56:37 minos03 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server)
Jan 27 12:56:42 minos03 kernel: afs: file server 131.225.68.49 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

minos11
Jan 30 14:25:45 minos11 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Jan 30 14:25:46 minos11 kernel: afs: failed to store file (110)
Jan 30 14:26:43 minos11 kernel: afs: failed to store file (110)
Jan 30 14:27:58 minos11 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

minos21
Jan 31 16:27:46 minos21 kernel: afs: Lost contact with file server 131.225.68.65 in cell fnal.gov (all multi-homed ip addresses down for the server)
Jan 31 16:30:23 minos21 kernel: afs: file server 131.225.68.65 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

minos23
Jan 31 16:40:30 minos23 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server)
Jan 31 16:40:32 minos23 kernel: afs: Lost contact with file server 131.225.68.11 in cell fnal.gov (all multi-homed ip addresses down for the server)
Jan 31 16:43:38 minos23 kernel: afs: file server 131.225.68.49 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Jan 31 16:43:38 minos23 kernel: afs: file server 131.225.68.11 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

minos25
Jan 29 12:07:01 minos25 kernel: libafs: Ignoring new-style parameters in presence of obsolete ones
Jan 29 12:19:35 minos25 kernel: afs: Lost contact with file server 192.168.67.1 in cell fnal.gov (multi-homed address; other same-host interfaces maybe up)

#########
# FNALU #
#########

Date: Thu, 31 Jan 2008 08:54:57 -0600 (CST)
Subject: HelpDesk ticket 110383
___________________________________________
Ticket #: 110383
___________________________________________
Short Description: FNALU batch node flxb17 shows no recent activity

Problem Description: fnalu-admin :

FNALU batch Node flxb17 seems to be stuck.
The last active job was started over 4 days ago.


$ bjobs -u all -r
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME  SUBMIT_TIME
122890  pawlosk RUN   minos      minos02.fna flxb17.fnal *>& log193 Jan 27 01:51
122897  pawlosk RUN   minos      minos02.fna flxb17.fnal *>& log200 Jan 27 01:51
..

bjobs -l 122890
..
Sun Jan 27 02:00:26: Resource usage collected.
                     The CPU time used is 217 seconds.
                     MEM: 437 Mbytes;  SWAP: 559 Mbytes;  NTHREAD: 3
                     PGID: 29728;  PIDs: 29728 29748 29749
___________________________________________

Date: Thu, 31 Jan 2008 09:04:26 -0600 (CST)
This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group.
___________________________________________
Date: Thu, 31 Jan 2008 09:11:28 -0600 (CST)
Note To Requester: mgreaney@fnal.gov sent this Notes To Requester: 
Art,

I will reset it it today and let you know.

Margaret
___________________________________________
Date: Thu, 31 Jan 2008 09:11:29 -0600 (CST)

Note To Requester: mgreaney@fnal.gov sent this Notes To Requester: 
I can't access this from the console and will reset it later today.

thanks,
margaret
___________________________________________


################
# AFS SYMLINKS #
################

    Summary of symlink crosslinks
    
code/e875/general/MINOS_EXTERNAL

   none

code/e875/general/ROOT

    /code/e875/releases2/ROOT
    data/minos/d04/libraries

code/e875/general/minossoft

    code/e875/releases/SRT_BINLIBTMP
    code/e875/releases1
    code/e875/releases2
    data/minos/d04/libraries/DatabaseTables/HEAD

code/e875/general/products

    code/e875/general/ups
    code/e875/releases/GENIE
    code/e875/releases/LOG4CPP
    code/e875/releases/MINOS_EXTERN
    code/e875/releases/MINOS_ROOT
    code/e875/releases/NEUGEN3
    code/e875/releases/PYTHIA6
    code/e875/releases/stdhep

code/e875/releases

    code/e875/general/MINOS_EXTERNAL
    code/e875/general/ROOT
    code/e875/general/bin
    code/e875/general/ups/prd/MINOS_EXTERN
    code/e875/general/ups/prd/PYTHIA6

code/e875/releases1

    none

code/e875/releases2

    code/e875/general/ROOT/config_build_root.sh
    code/e875/general/ups/prd/MINOS_ROOT

code/e875/sim

    miscellaneous

data/minos/d04/libraries

    none


=============================================================================

2008 01 30

##########
# CONDOR #
##########

    glideinWMS 1.1 available 

##########
# CONDOR #
##########

    per http://www.cs.wisc.edu/condor/

Stable series: Condor Version 7.0.0 released January 22nd, 2008
Development series: Condor Version 7.1.0 is coming soon
Previous Stable series: Condor Version 6.8.8 released December 20th, 2007

############
# PREDATOR #
############

10:05

Corrupt .sam.py due to timeout of genpy for
F00040244_0003 ( and timeout of _0004 )


/afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/fardet_data/2008-01
MINOS26 > mv F00040244_0003.sam.py F00040244_0003.sam.py.BAD
MINOS26 > mv F00040244_0004.sam.py F00040244_0004.sam.py.BAD

These were picked up cleanly on the next cycle.

##########
# PARROT #
##########

Tracking down symlinks in afs can be messy, because of equivalent
   /afs/fnal.gov
   /afs/.fnal.gov
   /afs/fnal

    There are just a few of these

SIM ( /afs/fnal/ )

    /afs/fnal.gov/files/code/e875/sim/
/afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOGEANT.ddl
/afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOMETRY.ddl
/afs/fnal/files/code/e875/sim/hermes_db/include/partap.inc

MINOS26 > printf  "${SLINKS}\n"   | grep /afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEO
lrwxr-xr-x  1 para 1507 63 Apr 19  1996 /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/ddl/GEOMETRY.ddl -> /afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOMETRY.ddl
lrwxr-xr-x  1 para 1507 63 Apr 19  1996 /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/ddl/GEOGEANT.ddl -> /afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOGEANT.ddl

   MINOS26 > find /afs/fnal.gov/files/code/e875/sim/hermes -name GEOMETRY.ddl
/afs/fnal.gov/files/code/e875/sim/hermes/ddl/GEOMETRY.ddl
/afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/ddl/GEOMETRY.ddl
/afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/hermes/ddl/GEOMETRY.ddl
 
MINOS26 > diff /afs/fnal.gov/files/code/e875/sim/hermes/ddl/GEOMETRY.ddl /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/hermes/ddl/GEOMETRY.ddl

   Corrected these broken symlinks

cd /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/ddl
ln -sf ../hermes/ddl/GEOGEANT.ddl GEOGEANT.ddl
ln -sf ../hermes/ddl/GEOMETRY.ddl GEOMETRY.ddl


MINOS26 > printf  "${SLINKS}\n"   | grep /afs/fnal/files/code/e875/sim/hermes_db/include/partap.inc
lrwxr-xr-x  1 para 1507 58 Apr 19  1996 /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/hermes/src/partap.inc -> /afs/fnal/files/code/e875/sim/hermes_db/include/partap.inc

MINOS26 > find /afs/fnal/files/code/e875/sim/hermes -type f -name partap.inc
/afs/fnal/files/code/e875/sim/hermes/hermes_db/include/partap.inc

cd /afs/fnal.gov/files/code/e875/sim/hermes/hermes_db/hermes/src
ln -sf ../../include/partap.inc partap.inc


PRODUCTS


    /afs/fnal.gov/files/code/e875/general/products
/afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/shutdown/ups_shutdown
/afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/startup/ups_startup


################
# AFS SYMLINKS #
################

OK, we really need a map of AFS directories, vols and symlinks needed for parrot.

First, a symlink map.

SLINKS=`find ${BASE} -type l -exec ls -l {} \;`

printf "${SLINKS}\n" \
  | cut -f 2 -d '>' \
  | tr -d '[:blank:]' \
  | grep ^/afs \
  | grep -v ${BASE} \
  | sort -u


BASE=/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL
7/8 G  code.e875.general
  ( none )


BASE=/afs/fnal.gov/files/code/e875/general/ROOT
7/8 G  code.e875.general

/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_2/v3_05_05
/afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-00-02
/afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-00-03A
/afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-00-04
/afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-00-08
/afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-01-02
/afs/fnal.gov/files/code/e875/releases2/ROOT/IRIX6.5-GCC_3_3/v4-02-00
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3-05-07
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3-05-07-opt
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3-10-01
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3-10-01-opt
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-02
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-03A
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-04/
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-08
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-08d
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-08e
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-00-08f
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-01-02/
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-01-04
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-02-00
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_3/v4-02-00-opt
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/bleeding-edge
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v4-02-00
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v4-02-00-opt
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v4-04-02
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v4-04-02-opt
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v5-08-00
/afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_4/v5-08-00-opt
/afs/fnal.gov/files/data/minos/d04/libraries/IRIX6.5-GCC_3_2/v3_05_03
/afs/fnal.gov/files/data/minos/d04/libraries/IRIX6.5-GCC_3_2/v3_05_04
/afs/fnal.gov/files/data/minos/d04/libraries/Linux2.4-GCC_3_2/v3_05_00
/afs/fnal.gov/files/data/minos/d04/libraries/Linux2.4-GCC_3_2/v3_05_04


BASE=/afs/fnal.gov/files/code/e875/general/minossoft
7/8 G  code.e875.general

/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.18.4/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.18.4/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.18.4/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.22/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.22/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.22/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.23/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.23/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.23/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.0/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.0/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.0/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.1/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.1/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.1/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.2/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.2/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.2/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.3/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.3/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.3/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.25/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.25/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.25/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.26/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.26/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.26/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.28/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.28/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.28/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-02-16-R1-21/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-02-16-R1-21/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-02-16-R1-21/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-09-29-R1-24/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-09-29-R1-24/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-09-29-R1-24/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-10-12-R1-24/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-10-12-R1-24/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-10-12-R1-24/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-11-10-R1-24/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-11-10-R1-24/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S06-11-10-R1-24/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-02-23-R1-25/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-02-23-R1-25/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-02-23-R1-25/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-09-R1-25/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-09-R1-25/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-09-R1-25/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-24-R1-25/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-24-R1-25/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-03-24-R1-25/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-04-06-R1-25/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-04-06-R1-25/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-04-06-R1-25/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-06-R1-25/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-06-R1-25/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-06-R1-25/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-17-R1-25/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-17-R1-25/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-05-17-R1-25/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-04-R1-25/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-04-R1-25/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-04-R1-25/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-20-R1-25/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-20-R1-25/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-06-20-R1-25/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-13-R1-26/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-13-R1-26/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-13-R1-26/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-27-R1-26/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-27-R1-26/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-07-27-R1-26/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-06-R1-26/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-06-R1-26/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-06-R1-26/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-20-R1-26/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-20-R1-26/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-09-20-R1-26/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-10-22-R1-26/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-10-22-R1-26/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-10-22-R1-26/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-11-10-R1-26/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-11-10-R1-26/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-11-10-R1-26/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-12-22-R1-26/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-12-22-R1-26/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S07-12-22-R1-26/tmp
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/bin
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/lib
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/S08-01-10-R1-27/tmp
/afs/fnal.gov/files/code/e875/releases1/R0.18.0/bin
/afs/fnal.gov/files/code/e875/releases1/R0.18.0/lib
/afs/fnal.gov/files/code/e875/releases1/R0.18.0/tmp
/afs/fnal.gov/files/code/e875/releases1/R0.20.0/bin
/afs/fnal.gov/files/code/e875/releases1/R0.20.0/lib
/afs/fnal.gov/files/code/e875/releases1/R0.20.0/tmp
/afs/fnal.gov/files/code/e875/releases1/R0.21/bin
/afs/fnal.gov/files/code/e875/releases1/R0.21/lib
/afs/fnal.gov/files/code/e875/releases1/R0.21/tmp
/afs/fnal.gov/files/code/e875/releases1/R0.22/bin
/afs/fnal.gov/files/code/e875/releases1/R0.22/lib
/afs/fnal.gov/files/code/e875/releases1/R0.22/tmp
/afs/fnal.gov/files/code/e875/releases1/R0.8.0/bin
/afs/fnal.gov/files/code/e875/releases1/R0.8.0/lib
/afs/fnal.gov/files/code/e875/releases1/R0.8.0/tmp
/afs/fnal.gov/files/code/e875/releases1/R1.10/bin
/afs/fnal.gov/files/code/e875/releases1/R1.10/lib
/afs/fnal.gov/files/code/e875/releases1/R1.10/tmp
/afs/fnal.gov/files/code/e875/releases1/R1.11/bin
/afs/fnal.gov/files/code/e875/releases1/R1.11/lib
/afs/fnal.gov/files/code/e875/releases1/R1.11/tmp
/afs/fnal.gov/files/code/e875/releases1/R1.17/bin
/afs/fnal.gov/files/code/e875/releases1/R1.17/lib
/afs/fnal.gov/files/code/e875/releases1/R1.17/tmp
/afs/fnal.gov/files/code/e875/releases1/R1.18.1/bin
/afs/fnal.gov/files/code/e875/releases1/R1.18.1/lib
/afs/fnal.gov/files/code/e875/releases1/R1.18.1/tmp
/afs/fnal.gov/files/code/e875/releases1/R1.18.2/bin
/afs/fnal.gov/files/code/e875/releases1/R1.18.2/lib
/afs/fnal.gov/files/code/e875/releases1/R1.18.2/tmp
/afs/fnal.gov/files/code/e875/releases1/R1.18/bin
/afs/fnal.gov/files/code/e875/releases1/R1.18/lib
/afs/fnal.gov/files/code/e875/releases1/R1.18/tmp
/afs/fnal.gov/files/code/e875/releases1/R1.2/bin
/afs/fnal.gov/files/code/e875/releases1/R1.2/lib
/afs/fnal.gov/files/code/e875/releases1/R1.2/tmp
/afs/fnal.gov/files/code/e875/releases1/R1.20/bin
/afs/fnal.gov/files/code/e875/releases1/R1.20/lib
/afs/fnal.gov/files/code/e875/releases1/R1.20/tmp
/afs/fnal.gov/files/code/e875/releases1/R1.21/bin
/afs/fnal.gov/files/code/e875/releases1/R1.21/lib
/afs/fnal.gov/files/code/e875/releases1/R1.21/tmp
/afs/fnal.gov/files/code/e875/releases1/R1.3/bin
/afs/fnal.gov/files/code/e875/releases1/R1.3/lib
/afs/fnal.gov/files/code/e875/releases1/R1.3/tmp
/afs/fnal.gov/files/code/e875/releases1/development/bin
/afs/fnal.gov/files/code/e875/releases1/development/lib
/afs/fnal.gov/files/code/e875/releases1/development/tmp
/afs/fnal.gov/files/code/e875/releases1/doxygen/loon
/afs/fnal.gov/files/code/e875/releases2/R1.0/bin
/afs/fnal.gov/files/code/e875/releases2/R1.0/lib
/afs/fnal.gov/files/code/e875/releases2/R1.0/tmp
/afs/fnal.gov/files/code/e875/releases2/R1.12/bin
/afs/fnal.gov/files/code/e875/releases2/R1.12/lib
/afs/fnal.gov/files/code/e875/releases2/R1.12/tmp
/afs/fnal.gov/files/code/e875/releases2/R1.13/bin
/afs/fnal.gov/files/code/e875/releases2/R1.13/lib
/afs/fnal.gov/files/code/e875/releases2/R1.13/tmp
/afs/fnal.gov/files/code/e875/releases2/R1.14/bin
/afs/fnal.gov/files/code/e875/releases2/R1.14/lib
/afs/fnal.gov/files/code/e875/releases2/R1.14/tmp
/afs/fnal.gov/files/code/e875/releases2/R1.15/bin
/afs/fnal.gov/files/code/e875/releases2/R1.15/lib
/afs/fnal.gov/files/code/e875/releases2/R1.15/tmp
/afs/fnal.gov/files/code/e875/releases2/R1.16/bin
/afs/fnal.gov/files/code/e875/releases2/R1.16/lib
/afs/fnal.gov/files/code/e875/releases2/R1.16/tmp
/afs/fnal.gov/files/code/e875/releases2/R1.5/bin
/afs/fnal.gov/files/code/e875/releases2/R1.5/lib
/afs/fnal.gov/files/code/e875/releases2/R1.5/tmp
/afs/fnal.gov/files/code/e875/releases2/R1.6/bin
/afs/fnal.gov/files/code/e875/releases2/R1.6/lib
/afs/fnal.gov/files/code/e875/releases2/R1.6/tmp
/afs/fnal.gov/files/code/e875/releases2/R1.7/bin
/afs/fnal.gov/files/code/e875/releases2/R1.7/lib
/afs/fnal.gov/files/code/e875/releases2/R1.7/tmp
/afs/fnal.gov/files/code/e875/releases2/R1.8/bin
/afs/fnal.gov/files/code/e875/releases2/R1.8/lib
/afs/fnal.gov/files/code/e875/releases2/R1.8/tmp
/afs/fnal.gov/files/code/e875/releases2/R1.9/bin
/afs/fnal.gov/files/code/e875/releases2/R1.9/lib
/afs/fnal.gov/files/code/e875/releases2/R1.9/tmp
/afs/fnal.gov/files/data/minos/d04/libraries/DatabaseTables/HEAD


BASE=/afs/fnal.gov/files/code/e875/general/products
8G c.e875.d1

/afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/shutdown/ups_shutdown
/afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/startup/ups_startup
/afs/fnal.gov/files/code/e875/general/ups/db/sam_batch_adapter/v0_9_9_5
/afs/fnal.gov/files/code/e875/general/ups/db/sam_config/v4_2_34
/afs/fnal.gov/files/code/e875/general/ups/db/sam_config/v4_2_34/config.env
/afs/fnal.gov/files/code/e875/general/ups/prd/misweb/v2_23_5/NULL
/afs/fnal.gov/files/code/e875/general/ups/prd/sam/v7_6_3/Linux-2
/afs/fnal.gov/files/code/e875/general/ups/prd/sam/v7_6_5/Linux-2
/afs/fnal.gov/files/code/e875/general/ups/prd/sam/v7_7_1/Linux-2
/afs/fnal.gov/files/code/e875/general/ups/prd/sam/v8_1_3/Linux-2
/afs/fnal.gov/files/code/e875/general/ups/prd/sam/v8_2_0/Linux-2
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_batch_adapter/v0_9_9_5/NULL
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_bootstrap/v4_4_1/NULL
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_config/v4_2_28/NULL
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_config/v4_2_34/NULL
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_cpp_api/v7_2_1/Linux-2-4-GCC-3-4-3
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_cpp_api/v7_4_2/Linux-2-4-GCC-3-4-3
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_cpp_api/v7_4_3/Linux-2-4-GCC-3-4-3
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_ns_ior/v7_0_0/NULL
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_ns_ior/v7_1_0/NULL
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_web_services/v0_9_8/NULL
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_web_services/v0_9_9/NULL
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_web_services_client/v0_9_2/NULL
/afs/fnal.gov/files/code/e875/general/ups/prd/samgrid_batch_adapter/v7_1_0/NULL
/afs/fnal.gov/files/code/e875/general/ups/prd/vdt/v1_1_14_13/Linux
/afs/fnal.gov/files/code/e875/general/ups/prd/vdt/v1_6_1_0/Linux
/afs/fnal.gov/files/code/e875/general/ups/prd/vdt/v1_8_1_1/Linux
/afs/fnal.gov/files/code/e875/releases/GENIE
/afs/fnal.gov/files/code/e875/releases/LOG4CPP
/afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN
/afs/fnal.gov/files/code/e875/releases/MINOS_ROOT
/afs/fnal.gov/files/code/e875/releases/NEUGEN3
/afs/fnal.gov/files/code/e875/releases/PYTHIA6
/afs/fnal.gov/files/code/e875/releases/stdhep

  
BASE=/afs/fnal.gov/files/code/e875/releases
40/50G nb.minos.d133

/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_3/lib/libmyodbc3.so
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/Linux2.4-GCC_3_4/lib/libmyodbc3.so
/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/tar_files
/afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/bin/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/bin/config_build_root_minimal.sh
/afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_EXTERN/Linux2.4-GCC_3_2/v03/lib/libmyodbc3.so
/afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_EXTERN/Linux2.4-GCC_4_1/v04/lib/libmyodbc3.so
/afs/fnal.gov/files/code/e875/general/ups/prd/PYTHIA6/Linux2.4-GCC_3_4/v6_406/inc
/afs/fnal.gov/files/code/e875/general/ups/prd/PYTHIA6/Linux2.4-GCC_3_4/v6_409/inc
  
BASE=/afs/fnal.gov/files/code/e875/releases1
6/8 G

 (none)
  
BASE=/afs/fnal.gov/files/code/e875/releases2
6/8 G

/afs/fnal.gov/files/code/e875/general/ROOT/config_build_root.sh
/afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-02-00
/afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-02-00-opt
/afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-04-02b
/afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-04-02b-opt
/afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-04-02f
/afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v4-04-02f-opt

BASE=/afs/fnal.gov/files/code/e875/sim
7/8 G code.e875.sim

/afs/fnal.gov/files/code/e875/general/minossoft/releases/development/BField/bfld_imap.C
/afs/fnal.gov/files/data/minos/d1/nuflux/newfiles/nuhist_far.rz
/afs/fnal.gov/files/data/minos/d1/nuflux/newfiles/nuhist_near_v2.rz
/afs/fnal.gov/files/data/minos/d12/root_files/AAA_README.TXT
/afs/fnal.gov/files/data/minos/d12/root_files/AAA_README.TXT.~1~
/afs/fnal.gov/files/data/minos/d12/root_files/AAA_README.TXT.~2~
/afs/fnal.gov/files/data/minos/d12/root_files/emu_tauCC.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_e_hitbits.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_mu-tau_v5.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_mu_801.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_mu_811.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_mu_899.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_mu_hitbits.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_mu_reco.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_muon_hitbits.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_muon_noelos.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_muon_noscat.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_nc_hitbits.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_tau_hitbits.root
/afs/fnal.gov/files/data/minos/d12/root_files/far_tau_reco.root
/afs/fnal.gov/files/data/minos/d12/root_files/overlay_ph2me.root
/afs/fnal.gov/files/data/minos/d17/gnumi_flux
/afs/fnal.gov/files/data/minos/d7/hitbits/far_e_hitbits.fz_gaf
/afs/fnal.gov/files/data/minos/d7/hitbits/far_mu_hitbits.fz_gaf
/afs/fnal.gov/files/data/minos/d7/hitbits/far_muon_hitbits.fz_gaf
/afs/fnal.gov/files/data/minos/d7/hitbits/far_nc_hitbits.fz_gaf
/afs/fnal.gov/files/data/minos/d7/hitbits/far_tau_hitbits.fz_gaf
/afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOGEANT.ddl
/afs/fnal/files/code/e875/sim/hermes_db/hermes/ddl/GEOMETRY.ddl
/afs/fnal/files/code/e875/sim/hermes_db/include/partap.inc

BASE=/afs/fnal.gov/files/data/minos/d04/libraries
8/8 GB nb.data.minosd4
  (none)


MINOS26 > du -sm /afs/fnal.gov/files/code/e875/releases/*
600     /afs/fnal.gov/files/code/e875/releases/GENIE
91      /afs/fnal.gov/files/code/e875/releases/LOG4CPP
2048    /afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN
20538   /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT
841     /afs/fnal.gov/files/code/e875/releases/NEUGEN3
183     /afs/fnal.gov/files/code/e875/releases/PYTHIA6
13432   /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP
1687    /afs/fnal.gov/files/code/e875/releases/base_release_build
27      /afs/fnal.gov/files/code/e875/releases/stdhep

=============================================================================

2008 01 29

##########
# PARROT #
##########

Trying again to build -f indexes for the export directories
and verifying with a direct sha1sum

AFSDIR=/afs/fnal.gov/files/data/minos/d04/libraries
AFSDIR=/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL
real    0m29.130s

mv ${AFSDIR}/.growfschecksum ${AFSDIR}/.growfschecksumNL
mv ${AFSDIR}/.growfsdir      ${AFSDIR}/.growfsdirNL

time make_growfs -k -f ${AFSDIR}

ls -l /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL fails as before,
but the .growfschecksum file contains the correct checksum.

directory checksum is 8721a99dea7b07e57189f611263f24b5929528e8
   That's nonsense, 


MINOS26 > curl http://www-numi.fnal.gov:80//computing/MINOS_EXTERNAL//.growfsdir -o /tmp/growdir
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3991k  100 3991k    0     0  46.4M      0 --:--:-- --:--:-- --:--:-- 52.6M

MINOS26 > sha1sum /tmp/growdir 
e59ac0b8ab17d5f94c2fd165012d2fc5192998b6  /tmp/growdir

MINOS26 > sha1sum ${AFSDIR}/.growfsdir
e59ac0b8ab17d5f94c2fd165012d2fc5192998b6  /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL/.growfsdir

Trying again with -d all

 1201646406.226614 [11797] parrot: http: GET http://www-numi.fnal.gov:80/computing/MINOS_EXTERNAL//.growfsdir HTTP/1.1
Host: www-numi.fnal.gov
Cache-Control: max-age=0
1201646406.236623 [11797] parrot: http: HTTP/1.1 200 OK
1201646406.236650 [11797] parrot: http: Date: Tue, 29 Jan 2008 22:40:05 GMT
1201646406.236659 [11797] parrot: http: Server: Apache/1.3.37 (Unix) PHP/5.2.4 mod_layout/2.1 mod_fastcgi/2.4.2 mod_ssl/2.8.25 OpenSSL/0.9.8d
1201646406.236668 [11797] parrot: http: Last-Modified: Tue, 29 Jan 2008 22:29:49 GMT
1201646406.236676 [11797] parrot: http: ETag: "3c68a40e-3e5f9a-479fa8dd"
1201646406.236682 [11797] parrot: http: Accept-Ranges: bytes
1201646406.236688 [11797] parrot: http: Content-Length: 4087706
1201646406.236697 [11797] parrot: http: Content-Type: text/plain
1201646406.236709 [11797] parrot: http: 
1201646406.236716 [11797] parrot: grow: loading filesystem directory...
1201646406.501638 [11797] parrot: tcp: disconnected from 131.225.70.20:80
1201646406.501759 [11797] parrot: grow: directory checksum is 8721a99dea7b07e57189f611263f24b5929528e8

   Summary - the downloaded directory seems to have a bad checksum,
             although the original is fine.


#########
# FNALU #
#########

An update of CPU batch power, since last week's upgrades to
    flxb10/11/13/17/23/24/25
( 10-13 are 1 GHz, counting as 1/3 core x 2 = 2/3 core, net 2 cores)

  HOSTS            bsub -R   Cores
FLXB10-30    SL3  "linux24"   20
FLXB10-30    SL4  "linux24"   10
FLXB31-34    SL4  "linux26"   10

ssh to flxb17 hangs up
ssh to flxb30 lacks AFS token

########
# GRID #
########

Trying to get 60 hour proxy, per chadwick advice

MINOS25 > date
Tue Jan 29 15:03:12 CST 2008

MINOS25 > d=kreymer/cron/minos25.fnal.gov

MINOS25 > kcron -f
TMXbT5oIyGkEaix7kSZvZg

MINOS25 > kinit ${d} -k -t /var/adm/krb5/`kcron -f`

MINOS25 > klist -f
Ticket cache: /tmp/krb5cc_1060_YH8644
Default principal: kreymer/cron/minos25.fnal.gov@FNAL.GOV

Valid starting     Expires            Service principal
01/29/08 15:04:38  01/30/08 01:04:38  krbtgt/FNAL.GOV@FNAL.GOV
        Flags: FIA
01/29/08 15:04:39  01/30/08 01:04:38  afs@FNAL.GOV
        Flags: FA

MINOS25 > voms-proxy-init -noregen -voms fermilab:/fermilab -valid 60:0
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Contacting  voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Done

Warning: fg6x1.fnal.gov:15001: The validity of this VOMS AC in your proxy is shortened to 86400 seconds!

Creating proxy .................................. Done

Warning: your certificate and proxy will expire Tue Jan 29 23:30:06 2008
which is within the requested lifetime of the proxy

MINOS25 > voms-proxy-info -all
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy/CN=proxy
issuer    : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy
type      : unknown
strength  : 512 bits
path      : /tmp/x509up_u1060
timeleft  : 8:22:37
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer
issuer    : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov
attribute : /fermilab/Role=NULL/Capability=NULL
attribute : /fermilab/minos/Role=NULL/Capability=NULL
timeleft  : 23:59:30

=============================================================================

2008 01 28

#######
# AFS #
#######

Scanning for AFS configurations :

for NODE in ${NODES} ; do 
printf "${NODE} "; ssh ${NODE} 'grep OPTIONS= /etc/sysconfig/afs'; done
minos01 OPTIONS=$LARGE
minos02 OPTIONS=$LARGE
...
minos25 OPTIONS=AUTOMATIC
minos26 OPTIONS=$LARGE

Checked dates and content,

for NODE in ${NODES} ; do 
printf "${NODE} "; ssh ${NODE} 'diff /etc/sysconfig/afs /minos/scratch/kreymer/sysafs'; done
< OPTIONS=AUTOMATIC
---
> OPTIONS=$LARGE

for NODE in ${NODES} ; do printf "${NODE} "; ssh ${NODE} 'ls -l /etc/sysconfig/afs'; done
minos01 -rw-r--r--  1 root root 4724 Aug 20 13:03 /etc/sysconfig/afs
...
minos06 -rw-r--r--  1 root root 4724 Aug 20 13:04 /etc/sysconfig/afs
...
minos11 -rw-r--r--    1 root     root         1922 Aug 21 15:59 /etc/sysconfig/afs
minos12 -rw-r--r--  1 root root 4724 Aug 20 13:04 /etc/sysconfig/afs
...
minos25 -rw-r--r--  1 root root 4727 Oct 19 10:52 /etc/sysconfig/afs
minos26 -rw-r--r--  1 root root 4724 Aug 20 13:04 /etc/sysconfig/afs


##########
# CONDOR #
##########

Date: Mon, 28 Jan 2008 14:10:02 -0600
From: Cron Daemon <root@minos25.fnal.gov>
To: kreymer@fnal.gov
Subject: Cron <kreymer@minos25> ${HOME}/minos/scripts/condorglide

/bin/sh: /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorglide: No such file or directory
__________________________________________________
/var/log/messages is full of lines like

Jan 28 10:52:41 minos25 kernel: afs_NewVCache: warning none freed, using 300 of 300
Jan 28 10:52:41 minos25 kernel: afs_NewVCache - none freed

grep afs_NewVCache: /var/log/messages | wc -l
32355

grep -v afs_NewVCache /var/log/messages

   Rates can be tens per second, or little gaps.
   Skipping gaps under 5 minutes,

grep afs_NewVCache: /var/log/messages.2 | uniq | less


Date: Mon, 28 Jan 2008 14:49:29 -0600 (CST)
Subject: HelpDesk ticket 110193

___________________________________________
Short Description: AFS messages on minos25

Problem Description: run2-sys

I failed access a file on AFS from minos25 today, as follows, at 14:10:02

/bin/sh: /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorglide:
No such
file or directory

Again at  16:12:04


In /var/log/messagse, I see 32355 set of messages starting at Jan 28
10:52:41
and continuing through 14:18:54 :

Jan 28 10:52:41 minos25 kernel: afs_NewVCache: warning none freed, using
300 of 300
Jan 28 10:52:41 minos25 kernel: afs_NewVCache - none freed

Similar problem are seen in earlier messages.N files, for 2 in 2,3,4

The messages can come as slowly as every minute or so, or at tens per
second.
Here is a summary of lines from existing messages files,
where I have omitted any messages coming at intervals under 5 minutes,
so that we can see the time periods of interest .

I do not see any such messages on the other nodes minos01 through minos24.

messages.4

Jan  1 03:05:59 
Jan  1 03:07:33 

Jan  1 03:12:42 
Jan  1 03:12:44 

Jan  2 11:54:58 
Jan  2 11:56:38 

messages.3

Jan  8 13:33:08 
Jan  8 13:39:57 

Jan 11 11:43:21 
Jan 11 11:52:57 

messages.2

Jan 17 12:44:22 
Jan 17 12:52:22 

Jan 17 13:06:34 
Jan 17 13:06:44 

messages

Jan 28 10:52:41 
Jan 28 11:01:13 

Jan 28 13:40:47 
Jan 28 13:43:18 

Jan 28 14:06:02 
Jan 28 14:18:54

Jan 28 15:49:33
Jan 28 15:54:46

Jan 28 16:01:01
Jan 28 16:01:04

Jan 28 16:11:53
Jan 28 16:29:28

Jan 28 17:03:23
Jan 28 17:08:39

__________________________________________
Date: Mon, 28 Jan 2008 15:10:59 -0600 (CST)
This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group.
___________________________________________
Date: Tue, 29 Jan 2008 12:11:04 -0600 (CST)
Solution: ettab@fnal.gov sent this solution: 
I am not sure how the /etc/sysconfig/afs file got modified.
I have made the requested change and restarted afsd.
___________________________________________

   N.B. I see no afs_NewVCache messages past the 12:07 restart of AFS with
   OPTIONS=$LARGE

Updated MINOS status page


###########
# MINOS03 #
###########

Date: Mon, 28 Jan 2008 14:52:00 -0600 (CST)
Subject: HelpDesk ticket 110194
___________________________________________
Short Description: Cannot ssh to minos03

Problem Description: run2-sys :

I cannot log into minos03 via ssh, but can reach the rest of the Minos
Cluster.
I can log in with kerberized rsh.

MIN > date
Mon Jan 28 20:45:54 UTC 2008

MIN > ssh -v minos03
OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to minos03 [131.225.193.3] port 22.
debug1: Connection established.
debug1: identity file /home/kreymer/.ssh/identity type -1
debug1: identity file /home/kreymer/.ssh/id_rsa type 1
debug1: identity file /home/kreymer/.ssh/id_dsa type -1
ssh_exchange_identification: Connection closed by remote host
___________________________________________
Date: Mon, 28 Jan 2008 15:11:08 -0600 (CST)
This ticket has been reassigned to BURNS, ETTA of the CD-SF/FEF Group.
___________________________________________
Date: Tue, 29 Jan 2008 09:06:57 -0600 (CST)
Solution: ettab@fnal.gov sent this solution: 
The sshd daemon was restarted.


=============================================================================

2008 01 25

#######
# SAM #
#######

    MC examples for vahle

SAMDIM="
    DATA_TIER                sntp-near
and MC.RELEASE               daikon_00
and VERSION                  cedar.phy
and RUN_NUMBER               1024
"

SAMDIM="
    DATA_TIER                sntp-near
and MC.RELEASE               daikon_00
and MC.BEAM                  L010185N
and VERSION                  cedar.phy
and RUN_NUMBER in            1024,1025,1026
"

##########
# PARROT #
##########

upd install -j cern 2004
upd install -j oracle_client v10_1_0_2_0b

time make_growfs -k ${AFSB}/general/products
real    0m44.483s

setup_minos
ERROR: Found no match for product 'oracle_tnsnames'
ERROR: Action parsing failed on "unsetuprequired(oracle_tnsnames)"

explicitly setting up GCC3_4_3 version of GEANT
INFORMATIONAL: Product 'geant' (with qualifiers 'GCC3_4_3'), has no current chain (or may not exist)

upd install -j oracle_tnsnames
ups declare -c oracle_tnsnames v48 -f NULL

upd install -j geant v3_21_14a  -f Linux+2.6
ups declare -c geant v3_21_14a  -f Linux+2.6


FNGP-OSG > setup_minos
No default SAM configuration exists at this time.
MINOSSOFT release "development"
SRT_SUBDIR=Linux2.6-GCC_3_4 ROOT=trunk EXTERN=v03 CONFIG=v01
bash: child setpgid (9107 to 9106): Operation not permitted
setup "test" version of LABYRINTH [ linux , FNALU ]
setup NEUGEN3 development
explicitly setting up GCC3_4_3 version of GEANT
using PYTHIA6 (v6_412) for LUND
bash: child setpgid (9566 to 9565): Operation not permitted
bash: child setpgid (9743 to 9742): Operation not permitted
bash: child setpgid (10201 to 10200): Operation not permitted

Still lacking loon and root.

/afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT -> /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT

Symlinks are not being followed

Try this with symlinks, getting into a loop under
    /afs/fnal.gov/files/code/e875/general/products/prd/MINOS_ROOT/Linux2.4-GCC_3_2/v3-05-05/v3_05_05

ls -l  /afs/fnal.gov/files/code/e875/general/products/prd/MINOS_ROOT/Linux2.4-GCC_3_2/v3-05-05/v3_05_05
lrwxr-xr-x  1 rhatcher e875 68 Feb 14  2006 /afs/fnal.gov/files/code/e875/general/products/prd/MINOS_ROOT/Linux2.4-GCC_3_2/v3-05-05/v3_05_05 -> /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05

rm  /afs/fnal.gov/files/code/e875/general/products/prd/MINOS_ROOT/Linux2.4-GCC_3_2/v3-05-05/v3_05_05

time make_growfs -k -f ${AFSB}/general/products

real    10m37.602s
user    0m51.874s
sys     4m4.926s

   We need to serve general/ROOT and general/MINOS_EXTERNAL
   These have links to d04/libraries
   
time make_growfs -k -f ${AFSB}/general/ROOT

   Oops, needed cleanup/removal of one loop
   /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05
       links to
   /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05
       but
   /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05
       links back to
   /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05

rm /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05
rm /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05


MINOS26 > dds  /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05
lrwxr-xr-x  1 rhatcher e875 68 Jul 17  2003 /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 -> /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/
MINOS26 > dds /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05
lrwxr-xr-x  1 rhatcher e875 68 Jul 17  2003 /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05 -> /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/
MINOS26 > rm /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05
MINOS26 > rm /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05
rm: cannot remove `/afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05': No such file or directory
MINOS26 > dds /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05
ls: /afs/fnal.gov/files/code/e875/releases2/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05: No such file or directory
MINOS26 > dds  /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05
ls: /afs/fnal.gov/files/code/e875/general/ROOT/Linux2.4-GCC_3_2/v3_05_05/v3_05_05: No such file or directory

time make_growfs -k -f ${AFSB}/general/ROOT
real    2m27.806s


time make_growfs -k -f ${AFSB}/general/MINOS_EXTERNAL
real    0m34.504s

ln -s /afs/fnal.gov/files/code/e875/general/ROOT ROOT
ln -s /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL MINOS_EXTERNAL
ln -s /afs/fnal.gov/files/data/minos/d04/libraries  d04libs

FNGP-OSG > setup_minos
1201288429.716438 [21203] parrot: fatal : directory and checksum still inconsistent after 60 seconds
1201288429.716711 [21203] parrot: notice: received signal 15 (Terminated), killing all my children...
1201288429.716887 [21203] parrot: notice: sending myself 15 (Terminated), goodbye!

  Oops, problem with products directory, try again without -f

That builds quickly, but now symlinks are not being followed

Try a rebuild again with symlinks :
real    11m2.270s

ls -l /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT
1201290364.388026 [9257] parrot: fatal : directory and checksum still inconsistent after 60 seconds

ls -l /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/


    GRRRRRR - I seem to be hitting a fundamental problem,
    after a very promising start.
    
    Symlinks do not seem to be followed when I do not use -f,
    and make_growfs -f seems to produce .growfsdir files with incorrect checksums.

Let's try a smaller directory, near the base of the chain,

time make_growfs -k -f /afs/fnal.gov/files/data/minos/d04/libraries

FNGP-OSG > ls -l  /afs/fnal.gov/files/data/minos/d04/libraries
total 6
drwxr-xr-x  3 kreymer numi 2048 Mar 12  2003 DatabaseTables
drwxr-xr-x  4 kreymer numi 2048 May 22  2003 IRIX6.5-GCC_3_2
drwxr-xr-x  4 kreymer numi 2048 May 22  2003 Linux2.4-GCC_3_2

    Testing symlinks across served directories,
    they do not seem to work :


ln -s /afs/fnal.gov/files/code/e875/general/products products

ls -l /afs/fnal.gov/files/code/e875/sim/products/etc
ls: /afs/fnal.gov/files/code/e875/sim/products/etc: No such file or directory

    PLAN - we need to merge these multivolume trees
    
    Look for links like :

find /afs/fnal.gov/files/code/e875/general/products -type l -exec ls -l {} \; | cut -f 2 -d '>'


=============================================================================

2008 01 24

##########
# PARROT #
##########

HOWTO.parrot - created

export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4
export PATH=${PARROT_DIR}/bin:${PATH}

parrot -m ${PARROT_DIR}/mountfile.grow -d remote  bash

export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup
setup_minos()
{
. $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $*
}

setup_minos

1201210251.123664 [14892] parrot: grow: open http://www-numi.fnal.gov:80/computing/minossoft///setup/setup_minossoft_FNALU_parser.sh
1201210252.401424 [15100] parrot: notice: cannot execute the program /usr/sbin/userhelper because it is setuid.
execl() error, errno=13

FNGP-OSG > ls -alF /usr/sbin/userhelper 
-rws--x--x  1 kreymer numi 31560 May  4  2007 /usr/sbin/userhelper*

Hmmm, let's be more cautious , check some products

Still some trouble, seem to have lost the definition of 'setup'


FNGP-OSG > ups list -aK+ root
WARNING:  '/afs/fnal.gov/files/code/e875/general/ups/db' is not a directory

OK, need to add the ups->products symlink to our exports

ln -s /afs/fnal.gov/files/code/e875/general/products ups


bash: /afs/fnal.gov/files/code/e875/sim/labyrinth/setup_labyrinth.sh: No such file or directory


ln -s /afs/fnal.gov/files/code/e875/sim/labyrinth labyrinth

MINOS26 > echo $AFSB
/afs/fnal.gov/files/code/e875

pts adduser -user kreymer -group buckley:minsoft

time make_growfs -k ${AFSB}/sim/labyrinth

GRRRRRRR - cannot write into labyrinth, in spite of membership.
Back up 1 level, export sim as a whole

ln -s /afs/fnal.gov/files/code/e875/sim  sim

time make_growfs -k ${AFSB}/sim
real    0m23.568s


Now, on setup_minos,

ERROR: Action parsing failed on "SetupRequired(cern 2004)"
GEANTINCS not in any location ... expect trouble

    loon is not happy.
    we need to stop using /afs/fnal.gov/ups

    But let's see if anything runs :

FNGP-OSG > sam locate foo
Datafile with name 'foo' not found.

FNGP-OSG > sam locate              F00031300_0000.mdaq.root
['/pnfs/minos/fardet_data/2005-04,1515@vo4245']

FNGP-OSG > sam get metadata --file=F00031300_0000.mdaq.root
ImportedDetectorFile({
             'fileName' : 'F00031300_0000.mdaq.root',

   Excellent !
   
###########
# MINOS02 #
###########

Cannot log into minos02 via ssh ( pawloski )
Ganglia shows it dead 04:30 through 08:30,
and 12:00 to 12:50 today.
 

for NODE in ${NODES} ; do printf "${NODE} " ; ssh -a ${NODE} pwd ; done
minos01 /afs/fnal.gov/files/home/room1/kreymer
minos02 ssh_exchange_identification: Connection closed by remote host
minos03 /afs/fnal.gov/files/home/room1/kreymer
...

NO ssh logins since last night.
tail of /var/log/messages :

Jan 23 22:25:30 minos02 sshd(pam_unix)[19570]: session opened for user grashorn by (uid=0)
Jan 23 22:35:06 minos02 sshd(pam_unix)[19604]: session opened for user bspeak by (uid=0)
Jan 24 05:51:24 minos02 xfs[4602]: re-reading config file 
Jan 24 05:51:24 minos02 xfs: xfs -USR1 succeeded
Jan 24 05:51:24 minos02 xfs[4602]: ignoring font path element /usr/X11R6/lib/X11/fonts/100dpi:unscaled (unreadable) 
Jan 24 14:08:53 minos02 login: kreymer preauthenticated login on pts/3 from minos-93198.dhcp
___________________________________________

Date: Thu, 24 Jan 2008 14:27:43 -0600 (CST)
Subject: HelpDesk ticket 110032
___________________________________________
Short Description: Cannot log into minos02 via ssh

Problem Description: run2-sys :

We cannot log into minos02 via ssh.
I am able to login via rsh.
The other Minos Cluster nodes are unaffected.

Here is an example :

MIN > date
Thu Jan 24 20:19:27 UTC 2008

MIN > ssh -v minos02
OpenSSH_3.9p1 NCSA_GSSAPI_20040818 KRB5, OpenSSL 0.9.7a Feb 19 2003
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to minos02 [131.225.193.2] port 22.
debug1: Connection established.
debug1: identity file /home/kreymer/.ssh/identity type -1
debug1: identity file /home/kreymer/.ssh/id_rsa type 1
debug1: identity file /home/kreymer/.ssh/id_dsa type -1
ssh_exchange_identification: Connection closed by remote host
___________________________________________
Date: Thu, 24 Jan 2008 14:37:54 -0600 (CST)
This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group.
___________________________________________

Date: Thu, 24 Jan 2008 15:12:03 -0600 (CST)
Solution: jonest@fnal.gov sent this solution: 
> I was able to restart sshd.  It should be fine now.
___________________________________________

=============================================================================

2008 01 23

########
# FARM #
########

Purging old files / duplicates

./roundup  -n -D   -r cedar_phy_bhcurv far
    All the duplicates are pass 0

    Contrary to the content of the READ and READ/SAM files,
    many of these files are NOT declared to SAM, and not in PNFS.


    Let's chew on this case, and develop the proper tools
    for cases where READ or READ/SAM are stale.


./roundup -n -D   -r cedar_phy_bhcurv far | uniq > /tmp/cpbf.dup

SRV1> grep DUPE /tmp/cpbf.dup | wc -l
321


#########
# FNALU #
#########

Date: Wed, 23 Jan 2008 11:01:43 -0600 (CST)
From: Margaret_Greaney <mgreaney@fnal.gov>
To: minos-users@fnal.gov
Cc: dss-est@fnal.gov
Subject: status of batch node updates

To all,

The operating system updates on the fnalu batch nodes are progressing.
Yesterday flxb10,11,13 were updated. This morning flxb17,23-25 were 
updated.

If you are having any problems with these nodes please report them. The 
remainder will be updated as schedule permits.

Date: Wed, 23 Jan 2008 18:31:58 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Margaret_Greaney <mgreaney@fnal.gov>
Cc: minos-users@fnal.gov, dss-est@fnal.gov
Subject: Re: status of batch node updates

On Wed, 23 Jan 2008, Margaret_Greaney wrote:

...

> The operating system updates on the fnalu batch nodes are progressing.
> Yesterday flxb10,11,13 were updated. This morning flxb17,23-25 were 
> updated.

Simple LSF jobs seem to be failing for me on these nodes.
I have tried all of 10, 11, 13, 17, 23, 24, 25

The same trivial job runs OK on unupgraded nodes like flxb16 and flxb26.


    For example

MINOS26 > bsub -R flxb10 echo HELLO
Job <120645> is submitted to default queue <30min>.


MINOS26 > bjobs -l 120645

Job <120645>, User <kreymer>, Project <default>, Status <EXIT>, Queue <30min>, 
                     Command <echo HELLO>
Wed Jan 23 12:27:18: Submitted from host <minos26.fnal.gov>, CWD 
</minos/scratc
                     h/kreymer>, Requested Resources <flxb10>;
Wed Jan 23 12:27:24: Started on <flxb10.fnal.gov>;
Wed Jan 23 12:27:24: Exited with exit code 255. The CPU time used is 0.0 
second
                     s.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -    0.5   0.7    -       -     -    -     -     -      -      -  
 loadStop    -    7.0   6.0    -       -     -    -     -     -      -      -  

-----------------------------------------
Date: Thu, 24 Jan 2008 13:07:33 -0600 (CST)

these nodes were missing a directory that does not come with the openafs 
rpm but was still on the slf3 nodes. Now that /usr/afsws/bin is in place
and something called sbatchd restarted on nodes, they process jobs.
-----------------------------------------
 bsub -R flxb25 echo HELLO
10 OK
11 OK
13 OK
17 OK
23 OK
24 OK
25 OK

These nodes allow interactive login now.
Unavailable are 16 18 19 20 21 22 26 27 28 29 

###########
# MINOS25 #
###########

Date: Wed, 23 Jan 2008 10:34:20 -0600 (CST)
Subject: HelpDesk ticket 109940

Short Description: minos25 in distress - may need reboot

Problem Description: run2-sys :

The minos25 system seems to be in distress.

Yesterday, around 14:50, memory usage shot up,
and almost all the CPU is in Wait CPU state, according to Ganglia.

Condor jobs are not being started, and the condor_q command is failing.

The immediate cause may be a set of five 'loon' jobs running under brebel.
These are each using a large amount of memory, 1/3 to 3/4 GB each.
Brian is unable to kill these, which are all in 'D' state according to top.


run2-sys : Please kill these brebel processes if you can.
           If that does not work, and you cannot restore normal operation,
           please reboot minos25.
------------------------------------------------
Date: Wed, 23 Jan 2008 10:52:53 -0600 (CST)
This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group.
------------------------------------------------
Date: Wed, 23 Jan 2008 11:23:00 -0600 (CST)
Subject: Help Desk Ticket 109940 Has Been Resolved.
------------------------------------------------

    Condor recovered, and several old jobs are still running.
    My glidein tests cleared quickly.
    
condor_q kreymer

-- Submitter: minos25.fnal.gov : <131.225.193.25:63984> : minos25.fnal.gov
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
27469.0   kreymer         1/22 18:50   0+00:00:00 I  0   9.8  probe 0 0 here are
27471.0   kreymer         1/22 20:00   0+00:00:00 I  0   9.8  probe 0 0 here are
27472.0   kreymer         1/22 20:10   0+00:00:00 I  0   9.8  probe 0 0 here are
27473.0   kreymer         1/22 20:20   0+00:00:00 I  0   9.8  probe 0 0 here are
27474.0   kreymer         1/22 20:30   0+00:00:00 I  0   9.8  probe 0 0 here are
27475.0   kreymer         1/22 20:40   0+00:00:00 I  0   9.8  probe 0 0 here are
27476.0   kreymer         1/22 20:50   0+00:00:00 I  0   9.8  probe 0 0 here are
27477.0   kreymer         1/22 21:00   0+00:00:00 I  0   9.8  probe 0 0 here are
27478.0   kreymer         1/22 21:10   0+00:00:00 I  0   9.8  probe 0 0 here are
27479.0   kreymer         1/22 21:20   0+00:00:00 I  0   9.8  probe 0 0 here are
27480.0   kreymer         1/22 21:30   0+00:00:00 I  0   9.8  probe 0 0 here are
27481.0   kreymer         1/22 21:40   0+00:00:00 I  0   9.8  probe 0 0 here are
27482.0   kreymer         1/22 21:50   0+00:00:00 I  0   9.8  probe 0 0 here are
27483.0   kreymer         1/22 22:00   0+00:00:00 I  0   9.8  probe 0 0 here are
27484.0   kreymer         1/22 22:10   0+00:00:00 I  0   9.8  probe 0 0 here are
27485.0   kreymer         1/22 22:20   0+00:00:00 I  0   9.8  probe 0 0 here are
27486.0   kreymer         1/22 22:30   0+00:00:00 I  0   9.8  probe 0 0 here are
27487.0   kreymer         1/22 22:40   0+00:00:00 I  0   9.8  probe 0 0 here are
27488.0   kreymer         1/22 22:50   0+00:00:00 I  0   9.8  probe 0 0 here are
27489.0   kreymer         1/22 23:00   0+00:00:00 I  0   9.8  probe 0 0 here are
27490.0   kreymer         1/22 23:10   0+00:00:00 I  0   9.8  probe 0 0 here are
27491.0   kreymer         1/22 23:20   0+00:00:00 I  0   9.8  probe 0 0 here are
27492.0   kreymer         1/22 23:30   0+00:00:00 I  0   9.8  probe 0 0 here are
27493.0   kreymer         1/22 23:40   0+00:00:00 I  0   9.8  probe 0 0 here are
27494.0   kreymer         1/22 23:50   0+00:00:00 I  0   9.8  probe 0 0 here are
27502.0   kreymer         1/23 11:20   0+00:00:00 I  0   9.8  probe 0 0 here are
27504.0   kreymer         1/23 11:30   0+00:00:00 I  0   9.8  probe 0 0 here are
27505.0   kreymer         1/23 11:40   0+00:00:00 I  0   9.8  probe 0 0 here are
27506.0   kreymer         1/23 11:50   0+00:00:00 I  0   9.8  probe 0 0 here are

##########
# CONDOR #
##########

No kreymer jobs submitted between 16:50 and 18:50 yesterday,
then job 27469 at 18:50


MINOS25 > condor_q

-- Failed to fetch ads from: <131.225.193.25:64973> : minos25.fnal.gov

Ganglia monitoring for the cluster shows a sharp drop around
17:00 on 22 Jan, a little blip up around 18:50,
then continued low process count.

On minos26, memory used ramped up starting at 16:30,
reached 2.3 GB around 16:50.

Wait CPU spiked to 100% around 14:50.

I see no swap being used.

top - 10:08:43 up 96 days, 11 min,  8 users,  load average: 17.10, 16.97, 16.44
Tasks: 261 total,   1 running, 259 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.1% us,  0.2% sy,  0.0% ni, 11.7% id, 87.9% wa,  0.0% hi,  0.0% si
Mem:   4151264k total,  4099084k used,    52180k free,   111472k buffers
Swap:  4192944k total,      208k used,  4192736k free,  1223368k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM   TIME #C COMMAND                                                                         
 4031 brebel    18   0 1051m 791m  40m D    0 19.5  25:25  1 loon                                                                             
 4055 brebel    18   0 1050m 788m  40m D    0 19.5  24:46  0 loon                                                                             
 4525 brebel    18   0  545m 400m  40m D    0  9.9  20:41  1 loon                                                                             
 4734 brebel    18   0  443m 325m  40m D    0  8.0  15:24  3 loon                                                                             
 4723 brebel    18   0  442m 325m  40m D    0  8.0  15:24  0 loon                                                                             
 5023 condor    16   0 32364  27m 3452 S    0  0.7  56:39  2 condor_negotiat                                                                  
 5022 condor    15   0 23592  17m 3480 S    0  0.4 115:11  0 condor_collecto                                                                  
  593 gfactory  16   0 46484  17m 2816 S    0  0.4 247:39  3 python                                                                           
32587 gfronten  16   0 15288  11m 1560 S    0  0.3  29:47  2 python                                                                           
 2804 ntp       16   0  5472 5472 3524 S    0  0.1   0:11  2 ntpd                                                                             


##########
# PARROT #
##########

MINOS26 > make_growfs -h
Use: /grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4/bin/make_growfs [options] <directory>
Where options are:
  -f  Follow symbolic links.
  -v  Give verbose messages.
  -K  Create checksums for files. (default enabled)
  -k  Disable checksums for files.
  -h  Show this help file.

15:03 UTC
time make_growfs -f -k ${AFSB}/releases1
real    0m36.613s
user    0m4.067s
sys     0m18.874s


parrot -m ${PARROT_DIR}/mountfile.grow -d remote  bash

ls /afs/fnal.gov/files/code/e875/releases1
...
directory checksum is dc880108e9c28b1ab8e411629ed503fbf790924d
actual checksum is a3edbef5da19cb66c6604b4c27dea1b1c669dd96
1201101290.221921 [2200] parrot: grow: loading filesystem directory...
1201101290.520844 [2200] parrot: grow: directory checksum is dc880108e9c28b1ab8e411629ed503fbf790924d
1201101290.521023 [2200] parrot: grow: fetching checksum from wget --no-cache -q -O /tmp/grow.checksum.1060.30772 http://www-numi.fnal.gov:80//computing/releases1//.growfschecksum
1201101290.573307 [2200] parrot: grow: actual checksum is a3edbef5da19cb66c6604b4c27dea1b1c669dd96
1201101290.590914 [2200] parrot: fatal : directory and checksum still inconsistent after 60 seconds
1201101290.591300 [2200] parrot: notice: received signal 15 (Terminated), killing all my children...
1201101290.591576 [2200] parrot: notice: sending myself 15 (Terminated), goodbye!
Terminated

   Trying this without following symlinks

make_growfs -k ${AFSB}/releases1

OK, can see this directory how, checksum matches

time make_growfs -k ${AFSB}/releases2
real    1m5.835s
user    0m5.574s
sys     0m18.773s

time make_growfs -k ${AFSB}/releases
real    8m48.406s
user    0m44.452s
sys     2m31.912s

time make_growfs -k ${AFSB}/general/products
real    0m34.637s
user    0m2.962s
sys     0m10.926s

Along the way, needed to create symlinks :

minossoft -> /afs/fnal.gov/files/code/e875/general/minossoft/
products -> /afs/fnal.gov/files/code/e875/general/products/
releases -> /afs/fnal.gov/files/code/e875/releases/
releases1 -> /afs/fnal.gov/files/code/e875/releases1/
releases2 -> /afs/fnal.gov/files/code/e875/releases2/

MIN > du -sm */.growfsdir
1       dh/.growfsdir
67      minossoft/.growfsdir
5       products/.growfsdir
67      releases/.growfsdir
5       releases1/.growfsdir
9       releases2/.growfsdir

=============================================================================

2008 01 22

########
# FARM #
########

Purging old files / duplicates

SRV1> cp -a AFSS/roundup.20080110 .
SRV1> ln -sf roundup.20080110 roundup

./roundup  -D   -r cedar_phy_bhcurv mcfar
./roundup  -f 5 -r cedar_phy_bhcurv mcfar

    cleared out files hanging around since 10/26

./roundup -f 10 -r cedar_phy_bhcurv mcnear

    cleared out files from Dec


##########
# PARROT #
##########

    Checking sizes and permissions of the interesting paths 

GB Path under /afs/fnal.gov/files/code/e875/

 3 general/products
  minos rlidwka

 7 general/minossoft 
40 releases
 6 releases1
 6 releases2
  rhatcher:minsoft rlidwk
  buckley:minsoft rlidwka

MINOS26 > pts membership rhatcher:minsoft
Members of rhatcher:minsoft (id: -1397) are:
  buckley
  kreymer
  rhatcher

AFSB=/afs/fnal.gov/files/code/e875

MIN > find /afs/fnal.gov/files/code/e875/general/minossoft -type d | wc -l
  55286

21:06 UTC
time make_growfs ${AFSB}/general/minossoft
...
entering dir /afs/fnal.gov/files/code/e875/general/minossoft/srt
entering dir /afs/fnal.gov/files/code/e875/general/minossoft/temp

real    396m3.524s
user    24m46.764s
sys     170m36.388s


FNGP-OSG > export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4
FNGP-OSG > export PATH=${PARROT_DIR}/bin:${PATH}
FNGP-OSG > parrot -m ${PARROT_DIR}/mountfile.html -d remote  bash

FNGP-OSG > time ls -l /afs/fnal.gov/files/code/e875/general/minossoft/packages
...
1201100190.657741 [11726] parrot: grow: loading filesystem directory...
...
real    0m35.371s
user    0m0.000s
sys     0m0.001s

MINOS26 > ls -alF /afs/fnal.gov/files/code/e875/general/minossoft/.grow*
-rw-r--r--  1 kreymer g020       44 Jan 22 21:41 /afs/fnal.gov/files/code/e875/general/minossoft/.growfschecksum
-rw-r--r--  1 kreymer g020 69866912 Jan 22 21:41 /afs/fnal.gov/files/code/e875/general/minossoft/.growfsdir

MINOS26 > du -sm /afs/fnal.gov/files/code/e875/general/minossoft/.growfsdir
67      /afs/fnal.gov/files/code/e875/general/minossoft/.growfsdir

#########
# MYSQL #
#########

http://dev.mysql.com/doc/refman/5.0/en/binary-log.html

mysqld options that control binlog :

--log-bin[=base_name]

--binlog-do-db=db_name
--binlog-ignore-db=db_name
   these affect only access via the default USE database

mysqlbinlog  will display the logs

the binary log format is different in MySQL 5.0 from previous versions of MySQL,
due to enhancements in replication.

I have added to /data/database/my.cnf

--binlog-do-db = crl_v1
--binlog-do-db = offline

See LOG.mysql


#########
# MYSQL #
#########

17:47 UTC

Pushing recent BINLOGS to /minos/data/mysql/BINLOGS,
so we can clear space on mysql1 local disk

 time rsync -r  ${DBBINS}/  ${DBCOPB}  --perms --times --size-only -v
building file list ... done
./
minos.000357
minos.000358
minos.index

sent 2110642300 bytes  received 80 bytes  21212486.23 bytes/sec
total size is 140733284935  speedup is 66.68

real    1m38.983s
user    0m24.538s
sys     0m14.512s


mysql -u root offline
PURGE MASTER LOGS BEFORE DATE_SUB( NOW( ), INTERVAL 3 DAY);
Query OK, 0 rows affected (2 min 22.27 sec)
PURGE MASTER LOGS BEFORE DATE_SUB( NOW( ), INTERVAL 2 DAY);
Query OK, 0 rows affected (1 min 27.75 sec)
EXIT;

Mysql> dds /data/archive/BINLOG/
total 9410228
drwxr-xr-x  2 minsoft e875       4096 Jan 22 11:52 ./
drwxr-xr-x  5 minsoft e875       4096 Jan 15 14:57 ../
-rw-rw----  1 minsoft e875 1073744708 Jan 21 08:57 minos.000350
-rw-rw----  1 minsoft e875 1073747386 Jan 21 08:58 minos.000351
-rw-rw----  1 minsoft e875 1073743411 Jan 21 08:59 minos.000352
-rw-rw----  1 minsoft e875 1073749504 Jan 21 09:00 minos.000353
-rw-rw----  1 minsoft e875 1073743433 Jan 21 09:02 minos.000354
-rw-rw----  1 minsoft e875 1073746201 Jan 21 09:03 minos.000355
-rw-rw----  1 minsoft e875 1073743213 Jan 21 09:09 minos.000356
-rw-rw----  1 minsoft e875 1073743727 Jan 21 21:40 minos.000357
-rw-rw----  1 minsoft e875 1036638377 Jan 22 11:48 minos.000358
-rw-rw----  1 minsoft e875        306 Jan 22 11:52 minos.index

Mysql> df -h /data
Filesystem            Size  Used Avail Use% Mounted on
/dev/hdb1             230G   87G  132G  40% /data


=============================================================================

2008 01 21    MLKJ Holiday

#######
# AFS #
#######

MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Jan " | grep -v Tokens | uniq'; done

minos03
Jan 21 10:24:54 minos03 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server)
Jan 21 10:26:19 minos03 kernel: afs: file server 131.225.68.49 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Jan 21 18:06:28 minos03 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Jan 21 18:08:43 minos03 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

minos19
Jan 20 12:30:42 minos19 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)
Jan 20 12:32:06 minos19 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

minos21
Jan 21 08:57:32 minos21 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)
Jan 21 08:59:19 minos21 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

minos26
Jan 20 15:31:24 minos26 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)
Jan 20 15:32:35 minos26 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)


###########
# MINOS01 #
###########

The backlogged, apparently failed email from minos01
seems to be making its way out, and is being delivered.

#########
# MYSQL #
#########

16:20

Pushing recent BINLOGS to /minos/data/mysql/BINLOGS,
so we can clear space on mysql1 local disk

 time rsync -r  ${DBBINS}/  ${DBCOPB}  --perms --times --size-only -v
minos.000240
...
minos.000356
minos.000357
minos.index

sent 126405334115 bytes  received 2400 bytes  21332433.81 bytes/sec
total size is 139384494903  speedup is 1.10

real    98m45.365s
user    23m25.373s
sys     12m41.155s


mysql -u root offline
PURGE MASTER LOGS BEFORE DATE_SUB( NOW( ), INTERVAL 5 DAY);
EXIT;

#######
# NET #
#######

Date: Mon, 21 Jan 2008 15:59:22 -0600 (CST)
Subject: HelpDesk ticket 109827
Short Description: fnsrv1 not providing name service

Problem Description: fnsrv1 ( 131.225.17.150 ) seems not to be providing name services.

MIN > nslookup 131.225.17.150 fnsrv0.fnal.gov
Server:         fnsrv0.fnal.gov
Address:        131.225.8.120#53

150.17.225.131.in-addr.arpa     name = fnsrv1.fnal.gov.


MIN > nslookup 131.225.17.150 fnsrv1.fnal.gov
;; connection timed out; no servers could be reached

This caused service failures on nodes such as minos01,
which were incorrectly configured to use only nameserver fnsrv1,
since sometime Saturday 19 Jan 2008.


Oops, never mind.

While I was submitting this ticket, service seems to have been restored.

MIN > nslookup 131.225.17.150 fnsrv1.fnal.gov
Server:         fnsrv1.fnal.gov
Address:        131.225.17.150#53

150.17.225.131.in-addr.arpa     name = fnsrv1.fnal.gov.

MIN > date
Mon Jan 21 21:56:32 UTC 2008

And the failing services on minos01 are working again.

So consider this an informational report.

Strange that one of the primary nameservers would be down so long
without NGOP generating a helpdesk ticket.

-------------------------------------------
Date: Tue, 22 Jan 2008 09:52:29 -0600 (CST)
Note To Requester: tang@fnal.gov sent this Notes To Requester: 

>   
Resolved.
Rebooted server.

###########
# MINOS01 #
###########

   Justin Evans reports lack of CVS commit email to MSD,
   since Friday 18 Jan 23:43.

Date: Mon, 21 Jan 2008 09:01:42 -0600 (CST)
Subject: HelpDesk ticket 109825

Short Description: minos01 cannot send email

Problem Description: run2-sys :

Outgoing email from minos01.fnal.gov seems to have stopped working
sometime after Friday 18 Jan 23:43 .

For example, the following produces no received mail :

echo TESTDIRECTMAIL | /bin/mail  -s "DIRECTMAILTEST" kreymer@fnal.gov

The same command sends mail from other nodes such as minos02 and minos26.

    Per bv (viren) suggestion,
    
MINOS01 > echo TEST | /bin/mail -v -s TEST kreymer@fnal.gov
/afs/fnal.gov/files/home/room1/kreymer/outbox: Permission denied
fnal.gov: Name server timeout
kreymer@fnal.gov... Transient parse error -- message queued for future delivery
kreymer@fnal.gov... queued

Indeed, the nameserver capability is hosed,

MINOS01 > host www.fnal.gov
;; connection timed out; no servers could be reached


MINOS01 > cat /etc/resolv.conf
search fnal.gov
nameserver 131.225.17.150

MINOS02 > cat /etc/resolv.conf
search fnal.gov
nameserver 131.225.8.120
nameserver 131.225.227.254
nameserver 131.225.17.150

MIN > cat /etc/resolv.conf
; generated by /sbin/dhclient-script
search fnal.gov dhcp.fnal.gov
nameserver 131.225.17.150
nameserver 131.225.8.120

   The problems seems to be with 131.225.17.150, fnsrv1.fnal.gov

MIN > nslookup 131.225.17.150
Server:         131.225.8.120
Address:        131.225.8.120#53

150.17.225.131.in-addr.arpa     name = fnsrv1.fnal.gov.

MIN > nslookup 131.225.17.150 fnsrv1.fnal.gov
;; connection timed out; no servers could be reached

MINOS26 > cat > /minos/scratch/kreymer/resolv.conf
search fnal.gov
nameserver 131.225.8.120
nameserver 131.225.227.254
nameserver 131.225.17.150

MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh -ax ${NODE} 'diff /etc/resolv.conf /minos/scratch/kreymer/resolv.conf'; done
minos01
aklog: Couldn't get fnal.gov AFS tickets:
aklog: Cannot resolve network address for KDC in requested realm while getting AFS tickets
1a2,3
> nameserver 131.225.8.120
> nameserver 131.225.227.254
minos02
minos03
...
minos11
1d0
< ; generated by /sbin/dhclient-script
...
minos24
2a3,4
> nameserver 131.225.227.254
> nameserver 131.225.17.150


MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh -ax ${NODE} 'diff -q /etc/resolv.conf /minos/scratch/kreymer/resolv.conf || cat /etc/resolv.conf'; done
minos01
aklog: Couldn't get fnal.gov AFS tickets:
aklog: Cannot resolve network address for KDC in requested realm while getting AFS tickets
Files /etc/resolv.conf and /minos/scratch/kreymer/resolv.conf differ
search fnal.gov
nameserver 131.225.17.150
...
minos11
Files /etc/resolv.conf and /minos/scratch/kreymer/resolv.conf differ
; generated by /sbin/dhclient-script
search fnal.gov
nameserver 131.225.8.120
nameserver 131.225.227.254
nameserver 131.225.17.150
...
minos24
Files /etc/resolv.conf and /minos/scratch/kreymer/resolv.conf differ
search fnal.gov
nameserver 131.225.8.120


To      : HelpDesk <helpdesk-forwarder@fnal.gov>
Cc      : minos-admin@fnal.gov
Attchmnt: 
Subject : Re: HelpDesk ticket 109825
----- Message Text -----
On Mon, 21 Jan 2008, HelpDesk wrote:

<-- # @@@  Enter Update below this line. @@@ # -->

    Thanks to bv ( Brett Viren ) for suggesting use of   mail -v  
    for diagnostics.

MINOS01 > echo TEST | /bin/mail -v -s TEST kreymer@fnal.gov
/afs/fnal.gov/files/home/room1/kreymer/outbox: Permission denied
fnal.gov: Name server timeout
...


    The nameserver fnsrv1 = 131.225.17.150 is not working,
    and minos01 is configured to use only this server.


    minos01, minos11 and minos24 all have nonstandard /etc/resolv.conf

    This is probably also causing the problem on mino01 with AFS tokens,
    reported in helpdesk ticket 109813 .

Date: Tue, 22 Jan 2008 09:40:22 -0600 (CST)
This ticket has been reassigned to GRAHAM, SETH of the CD-SF/FEF Group.

Date: Tue, 22 Jan 2008 10:21:01 -0600 (CST)
Solution: sether@fnal.gov sent this solution: 

I just sent a test mail from minos01 and it was sent properly. Whatever 
was wrong appears to have cleared up. If you're still seeing problems, 
let me know.

 N.B. /etc/resolv.conf is updated now

#######
# CVS #
#######

Noted many changed to CVSROOT configuration files, not committed to CVS

-r--r--r--  1 minoscvs e875     1026 Jan 11 11:28 verifymsg
-r--r--r--  1 minoscvs e875      879 Jan 11 11:28 taginfo
-r--r--r--  1 minoscvs e875      649 Jan 11 11:28 rcsinfo
-r--r--r--  1 minoscvs e875      266 Jan 11 11:28 numisoft.list
-r--r--r--  1 minoscvs e875      564 Jan 11 11:28 notify
-r--r--r--  1 minoscvs e875      109 Jan 11 11:28 neugen3.list
-r--r--r--  1 minoscvs e875    12977 Jan 11 11:28 modules
-r--r--r--  1 minoscvs e875       58 Jan 11 11:28 minospub.list
-r--r--r--  1 minoscvs e875     1717 Jan 11 11:28 loginfo
-r--r--r--  1 minoscvs e875       82 Jan 11 11:28 labyrinth.list
-r--r--r--  1 minoscvs e875      796 Jan 11 11:28 framework.list
-r--r--r--  1 minoscvs e875     1025 Jan 11 11:28 editinfo
-r--r--r--  1 minoscvs e875      753 Jan 11 11:28 cvswrappers
-r-xr-xr-x  1 minoscvs e875      695 Jan 11 11:28 cvs.log
-r--r--r--  1 minoscvs e875      364 Jan 11 11:28 config
-r--r--r--  1 minoscvs e875      803 Jan 11 11:28 commitinfo
-r--r--r--  1 minoscvs e875      585 Jan 11 11:28 checkoutlist
-r-xr-xr-x  1 minoscvs e875   101985 Jan 11 11:28 check_access,v
-r-xr-xr-x  1 minoscvs e875    16251 Jan 11 11:28 check_access

Odd, I seem to be fingered here , in cvshlog :

Fri Jan 11 11:27:38 2008 (kreymer@(null)) : cvsh -c cvs server [sk]
Fri Jan 11 11:28:33 2008 (kreymer@(null)) : cvsh -c cvs server [sk]

grep -v 'cvsh \-' cvshlog

Sat Dec  1 22:33:35 2007 (rhatcher@(null)) : -cvsh [sk]
Tue Dec  4 16:10:26 2007 (rhatcher@(null)) : -cvsh [sSk]
Mon Jan  7 17:20:51 2008 (kreymer@(null)) : -cvsh [sk]
Thu Jan 10 13:18:50 2008 (kreymer@(null)) : -cvsh [sk]
Mon Jan 21 07:48:38 2008 (kreymer@(null)) : -cvsh [sk]
Mon Jan 21 07:49:19 2008 (kreymer@(null)) : -cvsh [sk]
Mon Jan 21 08:23:47 2008 (kreymer@(null)) : -cvsh [sk]
Mon Jan 21 09:04:56 2008 (kreymer@(null)) : -cvsh [sk]

=============================================================================

2008 01 19

Date: Sat, 19 Jan 2008 18:48:05 -0600 (CST)
Subject: HelpDesk ticket 109813
Short Description: minos01 cannot issue afs tokens

Problem Description: run2-sys :

Starting sometime today, Saturday 19 January 2008,
it seems that node minos01 can no longer issue AFS tokens via aklog.
There is no such problem on the other minos02 through minos26 nodes,
or on the FNALU interactive nodes.

The klog command, using a password, works OK on minos01.

    For example, 

MINOS01 > date
Sat Jan 19 18:36:22 CST 2008

MINOS01 > klist -f
Ticket cache: /tmp/krb5cc_1060_jq7966
Default principal: kreymer@FNAL.GOV

Valid starting     Expires            Service principal
01/19/08 18:23:17  01/20/08 04:17:56  krbtgt/FNAL.GOV@FNAL.GOV
        renew until 01/26/08 18:15:24, Flags: FfRA


MINOS01 > aklog -d
Authenticating to cell fnal.gov (server fsus01.fnal.gov).
We've deduced that we need to authenticate to realm FNAL.GOV.
Getting tickets: afs/@FNAL.GOV
Kerberos error code returned by get_cred: -1765328165
aklog: Couldn't get fnal.gov AFS tickets:
aklog: unknown RPC error (-1765328165) while getting AFS tickets


MINOS01 > klog
Password:
MINOS01 > tokens

Tokens held by the Cache Manager:

User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Jan 25 19:19]
   --End of list--
-----------------------------------------
Date: Tue, 22 Jan 2008 10:30:42 -0600 (CST)
Solution: sether@fnal.gov sent this solution: 

This appears to be working now. We noticed that /etc/resolv.conf only 
had one name server listed. Having problems with both afs and sendmail 
could have been caused by the name server being down.

I added two more fnal dns servers to the file to prevent this happening 
again.

=============================================================================

2008 01 18

##########
# PARROT #
##########

    mindata :

mkdir /grid/app/minos/parrot
cd    /grid/app/minos/parrot
curl http://www.cse.nd.edu/~ccl/software/files/cctools-2_4_0-i686-linux-2.4.tar.gz \
      -o cctools-2_4_0-i686-linux-2.4.tar.gz
tar xzvf cctools-2_4_0-i686-linux-2.4.tar.gz


export PARROT_DIR=/grid/app/minos/parrot/cctools-2_4_0-i686-linux-2.4
export PATH=${PARROT_DIR}/bin:${PATH}

    USAGE

   Before setting up grow indexes,

parrot -M \
/afs/fnal.gov/files/code/e875/releases=/http/www-numi.fnal.gov/computing/releases  \
bash

cat /afs/fnal.gov/files/code/e875/releases/GENIE/Linux2.4-GCC_3_4/checkout.sh
 
     or
    
parrot -m ${PARROT_DIR}/mountfile.html /bin/bash
cat /afs/fnal.gov/files/code/e875/releases/GENIE/Linux2.4-GCC_3_4/checkout.sh
 
     or
    
parrot -m ${PARROT_DIR}/mountfile.grow [ -d all ] [ -d remote ] <command>


mountfile.glow :

/afs/fnal.gov/files/code/e875/general/minossoft /grow/www-numi.fnal.gov/computing/minossoft
/afs/fnal.gov/files/code/e875/general/products  /grow/www-numi.fnal.gov/computing/products
/afs/fnal.gov/files/code/e875/releases          /grow/www-numi.fnal.gov/computing/releases
/afs/fnal.gov/files/code/e875/releases1         /grow/www-numi.fnal.gov/computing/releases1
/afs/fnal.gov/files/code/e875/releases2         /grow/www-numi.fnal.gov/computing/releases2

mountile.html :

/afs/fnal.gov/files/code/e875/general/minossoft /html/www-numi.fnal.gov/computing/minossoft
/afs/fnal.gov/files/code/e875/general/products  /html/www-numi.fnal.gov/computing/products
/afs/fnal.gov/files/code/e875/releases          /html/www-numi.fnal.gov/computing/releases
/afs/fnal.gov/files/code/e875/releases1         /html/www-numi.fnal.gov/computing/releases1
/afs/fnal.gov/files/code/e875/releases2         /html/www-numi.fnal.gov/computing/releases2


    PROXY
    
export HTTP_PROXY squid.fnal.gov

    GROW

make_growfs <dir>
    seems to properly walk the tree, but not follow symlinks, making
    .growfschecksum
    .growfsdir

GRRRRRRRR the mountfiles just are not working,
for anything but anonftp, on fngp-osg


   Let's try GROW on a smaller bit of the web :

MINOS26 > make_growfs /afs/fnal.gov/files/expwww/numi/html/computing/dh
entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh
entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/db
entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/badfiles
entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/protons
entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/pnfslog
entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/pnfslog/2005
entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/pnfslog/2005/11
entering dir /afs/fnal.gov/files/expwww/numi/html/computing/dh/pnfslog/2005/12
...
  
FNGP-OSG > parrot -M /dh=/grow/www-numi.fnal.gov/computing/dh/ -d remote  bash


Both of these work


#######
# AFS #
#######

Per a Ray Pasetes phone conversation earlier today :

All Minos AFS file systems 
are now on AFS servers with upgraded software and firmware.  

We have seen no AFS timeouts on the Minos Cluster since 15 January !


########
# FARM #
########

Need to clean up after repeated /m/d failures, esp. Dec 13, in 
   2007-12/cedarfar.log

/pnfs/minos/reco_far/cedar/sntp_data/2007-12
F00040057_0000.all.sntp.cedar.0.root
F00040057_0000.spill.sntp.cedar.0.root

/pnfs/minos/reco_far/cedar/.bntp_data/2007-12
F00040057_0000.spill.bntp.cedar.0.root

AFSS/dc2nfs -n -d reco_far/cedar/sntp_data/2007-12

SRV1> chmod 775 /minos/data/reco_far/cedar/sntp_data/*

$ AFSS/dc2nfs.20080118 -d reco_far/cedar/sntp_data/2007-12

  22/  66 /minos/data/reco_far/cedar/sntp_data     2384 F00040054_0000.spill.sntp.cedar.0.root            dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar/sntp_data/2007-12/F00040057_0000.all.sntp.cedar.0.root /minos/data/reco_far/cedar/sntp_data/2007-12/F00040057_0000.all.sntp.cedar.0.root
  23/  66 /minos/data/reco_far/cedar/sntp_data     2384 F00040057_0000.all.sntp.cedar.0.root            dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar/sntp_data/2007-12/F00040057_0000.spill.sntp.cedar.0.root /minos/data/reco_far/cedar/sntp_data/2007-12/F00040057_0000.spill.sntp.cedar.0.root
  66/  66 /minos/data/reco_far/cedar/sntp_data     2384 F00040124_0000.spill.sntp.cedar.0.root            

$ AFSS/dc2nfs.20080118 -n -d reco_far/cedar/.bntp_data/2007-12

  11/  33 /minos/data/reco_far/cedar/.bntp_data     2383 F00040054_0000.spill.bntp.cedar.0.root            dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar/.bntp_data/2007-12/F00040057_0000.spill.bntp.cedar.0.root /minos/data/reco_far/cedar/.bntp_data/2007-12/F00040057_0000.spill.bntp.cedar.0.root
  33/  33 /minos/data/reco_far/cedar/.bntp_data     2383 F00040124_0000.spill.bntp.cedar.0.root            

Oops, should I have just moved the concatenated file from farcat ?
Yup, checking and removing the extra copies, on fnpcsrv1

cd /minos/data/minfarm/WRITE

MDP=/minos/data/reco_far/cedar/sntp_data/2007-12

FILE=F00040057_0000.all.sntp.cedar.0.root
ls -l ${FILE} ${MDP}/${FILE}
diff  ${FILE} ${MDP}/${FILE}
rm            ${MDP}/${FILE}
mv    ${FILE} ${MDP}/${FILE}
ln -s         ${MDP}/${FILE} ${FILE}

FILE=F00040057_0000.spill.sntp.cedar.0.root
ls -l ${FILE} ${MDP}/${FILE}
diff  ${FILE} ${MDP}/${FILE}
rm            ${MDP}/${FILE}
mv    ${FILE} ${MDP}/${FILE}
ln -s         ${MDP}/${FILE} ${FILE}


MDP=/minos/data/reco_far/cedar/.bntp_data/2007-12

FILE=F00040057_0000.spill.bntp.cedar.0.root
ls -l ${FILE} ${MDP}/${FILE}
mv    ${FILE} ${MDP}/${FILE}
ln -s         ${MDP}/${FILE} ${FILE}

##############
# MINOS_DATA #
##############

cedar sntp/bntp total is 30991 from CFLSUM,
    /M/D near sntp 11837
    /M/D  far sntp 

wc -l /afs/fnal.gov/files/data/minos/d10/indexes/*_near.cedar.index
12220

wc -l /afs/fnal.gov/files/data/minos/d10/indexes/*_far.cedar.index
16478

#######
# WEB #
#######

Updated dhmain.html from dhmain.20071019.html to dhmain.20080118.html
   pointing the the numi08 new elog
   
########
# GRID #
########

Date: Fri, 18 Jan 2008 09:33:54 -0600 (CST)
From: Steven Timm <timm@fnal.gov>
To: fermigrid-users@fnal.gov
Subject: GP Grid job evictions.

There were a large number of jobs from many different users
that were evicted from the farms last night.  We are currently
investigating why this happened. Once we get it fixed the jobs
will restart with no further action requrired on your part.


##########
# CONDOR #
##########

At midnight, condor glideins started queuing up midnight, cleared at 07:00

Condorview looks pretty normal

#############
# CHECKLIST #
#############

minos-sam01 - ganglia stops around 23:30 yesterday

GANGLIA for minos cluster shows hosts dropping off starting around midnight.
Nothing is really wrong with the hosts.

Same behaviour for Minos Cluster and Minos Server nodes,
but Minos DB looks OK.

###########
# GANGLIA #
###########

Date: Fri, 18 Jan 2008 08:35:27 -0600 (CST)
Subject: HelpDesk ticket 109750

Short Description: Ganglia monitoring for Minos Cluster and Minos Server nodes is broken

Problem Description: run2-sys :

Starting around 23:30 yesterday 17 Jan 2008,
the Ganglia monitoring at
    http://rexganglia2.fnal.gov/minos
started gradually losing contact with the monitored nodes.

Ganglia now claims that all the Minos Cluster and most Server nodes are
down.

The monitored nodes are in fact up, and seem to be acting normally.

Ganglia monitoring of the MINOS DB nodes minosora1 and minosora3 is normal,
as well as monitoring of minos-mysql1 in the Minos Server group.

-------------------------------------------
Date: Fri, 18 Jan 2008 08:42:23 -0600 (CST)
This ticket has been reassigned to JONES, TERRY of the CD-SF/FEF Group.
-------------------------------------------
Date: Fri, 18 Jan 2008 09:19:53 -0600 (CST)
Note To Requester: jonest@fnal.gov sent this Notes To Requester: 
The d0om server has lost its network.  this could be the problem.
> Or, just a distractionj
-------------------------------------------
Date: Fri, 18 Jan 2008 13:25:18 -0600 (CST)
Solution: jonest@fnal.gov sent this solution: 
> Ganglia data is collecting again.


=============================================================================

2008 01 17

##########
# CONDOR #
##########

Email from chadwick ( -> minosadmin )

The limit is 2.5 days (60 hours) - here is how to get a robot proxy
with this lifetime (thanks to Matt Crawford for his assistance).

-Keith.

[chadwick@fermigrid0 ~]$ p=chadwick/cron/fermigrid0.fnal.gov

[chadwick@fermigrid0 ~]$ kinit $p -k -t /var/adm/krb5/`kcron -f`

[chadwick@fermigrid0 ~]$ klist
Ticket cache: /tmp/krb5cc_1021_P31493
Default principal: chadwick/cron/fermigrid0.fnal.gov@FNAL.GOV

Valid starting     Expires            Service principal
01/17/08 16:39:26  01/18/08 02:39:26  krbtgt/FNAL.GOV@FNAL.GOV
         renew until 01/20/08 04:39:26

[chadwick@fermigrid0 ~]$ voms-proxy-init -noregen -voms 
fermilab:/fermilab -userconf $HOME/vomses/vomses.voms -valid 60:0

#######
# POT #
#######

Updated current and historic links to POT plots

#######
# WEB #
#######

Around 14:00, stopped getting service from www-numi.fnal.gov

Can ping the host :

MINOS26 > ping -c 3 expwww17.fnal.gov
PING expwww17.fnal.gov (131.225.70.20) 56(84) bytes of data.
64 bytes from expwww17.fnal.gov (131.225.70.20): icmp_seq=0 ttl=125 time=0.442 ms
64 bytes from expwww17.fnal.gov (131.225.70.20): icmp_seq=1 ttl=125 time=0.487 ms
64 bytes from expwww17.fnal.gov (131.225.70.20): icmp_seq=2 ttl=125 time=0.460 ms

Helpdesk tickets 109719, 109720 issued by NGOP around 20:03 UTC

Several other pages are down, including
   COUPP , Miniboone, SDSS

15:02 - COUPP and SDSS are back
15:05 - NUMI is back

The problem was network routing, per ticket 109719

#######
# WEB #
#######

Received request from lauram to test Apache 2 servers
during beta period ending Jan 23.

    Forwarded the email to minos-admin,
    Liz forwarded it to Cat.

In /etc/hosts or  /C:\WINDOWS\system32\drivers\etc\hosts

do something like

131.225.70.203   webstats.fnal.gov
131.225.70.203   computing.fnal.gov
131.225.70.203   cdorg.fnal.gov
131.225.70.203   www-numi.fnal.gov

131.225.70.202   www.fnal.gov


############
# RELEASES #
############

R1.28 tagging has begun, with ROOT  5.18/00

##########
# CONDOR #
##########

GPfarm condor upgrade to 6.9.5 and OSG 0.8.0 scheduled 09:00

Completed around 12:00

For present, need to do
    condor_q -direct schedd
pending a fix in the local configuration

########
# BMNT #
########

Plan to rename and remove bmnt and mrnt files for

farcat    
   2915   11365 spill.bmnt.cedar_phy_bhcurv.0.root


Per Howie, some of these runs have been done with mrnt/bmnt,
and many  with just the normal mrnt files.

So I need a list of bmnt files,
need to remove just those particular mrnt's.


BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort`


    PLAN :

0) Get file name list for bmnt

1) Get corresponding mrnt file list

2) Clean up the mrnt files

   a) move from /M/D/RF/CPB/mrnt_data to /M/D/RF/CPB/BMNT
   b) Remove the files from PNFS
   c) Set aside farm bookkeeping files
       READ, SAM/READ
   d) Undeclare from SAM

3) Rename the bmnt files to mrnt

-------------------------

     EXECUTION :

0)    BMNT LIST - kreymer

BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort`

printf "${BFILES}\n" | wc -w 
2915

printf "${BFILES}\n" > /minos/scratch/kreymer/bmnt/BFILES

BFILES=`cat /minos/scratch/kreymer/bmnt2/BFILES`

BFILES runs from
F00032481_0000.spill.bmnt.cedar_phy_bhcurv.0.root  2005-08
to
F00038559_0023.spill.bmnt.cedar_phy_bhcurv.0.root  2007-07

1)    MRNT LIST - kreymer/mindata/rubin/minfarm

MRUNS=`printf "${BFILES}\n" | cut -f 1 -d _ | sort -u`

printf "${MRUNS}\n" | wc -w 
179

F00032481
F00032484
...
F00038556
F00038559

  Rough check for _000000 subruns

for MRUN in ${MRUNS} ; do
sam locate ${MRUN}_0000.spill.mrnt.cedar_phy_bhcurv.0.root
done
Datafile with name 'F00038266_0000.spill.mrnt.cedar_phy_bhcurv.0.root' not found.

  Detailed check via SAM

for MRUN in ${MRUNS} ; do
RUN=`echo ${MRUN} | cut -c 5-`
SAMDIM="
    DATA_TIER  mrnt-far
and VERSION    cedar.phy.bhcurv
and PHYSICAL_DATASTREAM_NAME spill
and RUN_NUMBER    ${RUN}
"
sam list files --dim="${SAMDIM}" --nosummary
done > /minos/scratch/kreymer/bmnt/MFILES

wc -l /minos/scratch/kreymer/bmnt/MFILES 
179 /tmp/MFILES

grep -v '_0000' /minos/scratch/kreymer/bmnt/MFILES
F00032481_0007.spill.mrnt.cedar_phy_bhcurv.0.root

   This makes sense I think.
   We picked up one non subrun 0 file,
   and there is one run missing, F00038266

MFILES=`cat /minos/scratch/kreymer/bmnt/MFILES`

printf "${MFILES}\n" | wc -l
179

for MFILE in ${MFILES} ; do
MON=`sam locate ${MFILE} | cut -f 7 -d / | cut -f 1 -d ,`
printf "reco_far/cedar_phy_bhcurv/mrnt_data/${MON}/${MFILE}\n" \
   | tee -a /minos/scratch/kreymer/bmnt/MFILEPS
done

MFILEPS=`cat /minos/scratch/kreymer/bmnt/MFILEPS`

for MFILEP in ${MFILEPS} ; do
ls -l /pnfs/minos/${MFILEP} ; done

for MFILEP in ${MFILEPS} ; do
ls -l /minos/data/${MFILEP} ; done


  for each account, did
BFILES=`cat /minos/scratch/kreymer/bmnt/BFILES`
MFILES=`cat /minos/scratch/kreymer/bmnt/MFILES`
MFILEPS=`cat /minos/scratch/kreymer/bmnt/MFILEPS`


2a)  /minos/data - minfarm@fnpcsrv1

for MFILEP in ${MFILEPS} ; do
MFILER=`echo ${MFILEP} | sed s/mrnt_data/BMNT/g`
MFILED=`dirname ${MFILER}`
mkdir -p                 /minos/data/${MFILED}
mv /minos/data/${MFILEP} /minos/data/${MFILER}
done

find /minos/data/reco_far/cedar_phy_bhcurv/BMNT -type f
179

2b)    /pnfs/minos - rubin


for MFILEP in ${MFILEPS} ; do
#ls -l /pnfs/minos/${MFILEP}
rm /pnfs/minos/${MFILEP}
done


2c)    READ, SAM/READ

    Do this as minfarm@fnpcsrv1

cd /export/stage/minfarm/ROUNDUP
mkdir READBMNT

for MFILE in ${MFILES} ; do
#ls  READ/SAM/${MFILE}
mv  READ/SAM/${MFILE} READBMNT/${MFILE}
done


2d)    SAM   kreymer@minos26

for MFILE in ${MFILES} ; do
    sam undeclare file ${MFILE}
done


3)    rename bmnt  -  minfarm

cd /minos/data/minfarm/farcat

for BFILE in ${BFILES} ; do
MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g`
[ -r ${MFILE} ] && ls -l ${MFILE}
done

  One run, 38266, is missing subrun 11 in both mrnt and bmnt.

-rw-rw-r--  1 rubin numi 2959459 Dec 12 20:52 F00038266_0000.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2920808 Dec 12 20:49 F00038266_0001.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2822482 Dec 12 20:51 F00038266_0002.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2979061 Dec 12 20:52 F00038266_0003.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2977375 Dec 12 20:53 F00038266_0004.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2953857 Dec 12 20:53 F00038266_0005.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2846984 Dec 12 20:53 F00038266_0006.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2930329 Dec 12 20:53 F00038266_0007.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2909285 Dec 12 20:54 F00038266_0008.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 3030490 Dec 12 20:51 F00038266_0009.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2132720 Dec 12 20:50 F00038266_0010.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2326530 Dec 12 20:54 F00038266_0012.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2539648 Dec 12 20:53 F00038266_0013.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2943127 Dec 12 20:55 F00038266_0014.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2993096 Dec 12 20:55 F00038266_0015.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2720666 Dec 12 20:50 F00038266_0016.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2980667 Dec 12 20:56 F00038266_0017.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2799555 Dec 12 20:52 F00038266_0018.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 1961562 Dec 12 20:54 F00038266_0019.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2887540 Dec 12 20:55 F00038266_0020.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2966054 Dec 12 20:54 F00038266_0021.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2945561 Dec 12 20:53 F00038266_0022.spill.mrnt.cedar_phy_bhcurv.0.root
-rw-rw-r--  1 rubin numi 2943774 Dec 12 21:04 F00038266_0023.spill.mrnt.cedar_phy_bhcurv.0.root

Moved these files out of the way,

mkdir /minos/data/minfarm/DUP/BMNT
mv F00038266*mrnt*root /minos/data/minfarm/DUP/BMNT/

    14:36

for BFILE in ${BFILES} ; do
MFILE=`echo ${BFILE} | sed s/bmnt/mrnt/g`
mv ${BFILE} ${MFILE}
done

Later, 175 far mrnt's were added to PNFS and MD by roundup.

=============================================================================

2008 01 16

#######
# AFS #
#######

MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Jan " | grep -v Tokens | grep Lost | grep 131.225 | uniq'; done

minos03
Jan 15 20:10:43 minos03 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)

minos17
Jan 14 11:19:24 minos17 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)

minos23
Jan 13 20:37:53 minos23 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)


########
# BMNT #
########

Plan to rename and remove bmnt and mrnt files for

farcat    
   2915   11365 spill.bmnt.cedar_phy_bhcurv.0.root

SAMDIM="
    DATA_TIER  mrnt-far
and VERSION    cedar.phy.bhcurv
and PHYSICAL_DATASTREAM_NAME spill
"

MINOS26 > sam list files --dim="${SAMDIM}"
Files:
   F00031305_0000.spill.mrnt.cedar_phy_bhcurv.0.root
   F00030618_0000.spill.mrnt.cedar_phy_bhcurv.0.root
...
   F00032638_0000.spill.mrnt.cedar_phy_bhcurv.0.root
   F00032644_0000.spill.mrnt.cedar_phy_bhcurv.0.root

File Count:         1371
Average File Size:  22.49MB
Total File Size:    30.11GB
Total Event Count:  282038696

This does not quite add up, as the mrnt's are concatenated,
and there should be more bmnt's waiting for concatenation.

sam list files --dim="${SAMDIM}" --nosummary | sort
F00030612_0000.spill.mrnt.cedar_phy_bhcurv.0.root
F00030613_0000.spill.mrnt.cedar_phy_bhcurv.0.root
F00030614_0000.spill.mrnt.cedar_phy_bhcurv.0.root
...
F00038553_0000.spill.mrnt.cedar_phy_bhcurv.0.root
F00038556_0000.spill.mrnt.cedar_phy_bhcurv.0.root
F00038559_0000.spill.mrnt.cedar_phy_bhcurv.0.root

MINOS26 > ls /minos/data/reco_far/cedar_phy_bhcurv/mrnt_data/ | sort
2005-03
2005-04
2005-05
2005-06
2005-07
2005-08
2005-09
2005-10
2005-11
2005-12
2006-01
2006-02
2006-05
2006-06
2006-07
2006-08
2006-09
2006-10
2006-11
2006-12
2007-01
2007-02
2007-03
2007-04
2007-05
2007-06
2007-07
2007-08
2007-09

    check counts in /M/D/RF/CPB/

MINOS26 > find /minos/data/reco_far/cedar_phy_bhcurv/mrnt_data -type f | wc -l
1371

MINOS26 > find /minos/data/reco_far/cedar_phy_bhcurv/sntp_data -type f | wc -l
2767

MINOS26 > find /minos/data/reco_far/cedar_phy_bhcurv/.bntp_data -type f | wc -l
1400


    Need to :

0) Get file name list for mrnt and bmtn

1) Set aside pnfs and MD mrnt files

2) Set aside farm bookkeeping files
     READ, SAM/READ

3) Undeclare the files from SAM

4) Rename the bmnt files to mrnt


0)

SAMDIM="
    DATA_TIER  mrnt-far
and VERSION    cedar.phy.bhcurv
and PHYSICAL_DATASTREAM_NAME spill
"

MFILES=`sam list files --dim="${SAMDIM}" --nosummary | sort`
BFILES=`ls /minos/data/minfarm/farcat | grep bmnt | sort`


MINOS26 > printf "${MFILES}\n" | wc -w
1371

MINOS26 > printf "${BFILES}\n" | wc -w 
2915


MFILES runs from
F00030612_0000.spill.mrnt.cedar_phy_bhcurv.0.root  2005-03
    to
F00038559_0000.spill.mrnt.cedar_phy_bhcurv.0.root  2007-07


BFILES runs from
F00032481_0000.spill.bmnt.cedar_phy_bhcurv.0.root  2005-08
to
F00038559_0023.spill.bmnt.cedar_phy_bhcurv.0.root  2007-07


1)  

mv /minos/data/reco_far/cedar_phy_bhcurv/mrnt_data \
   /minos/data/reco_far/cedar_phy_bhcurv/mrnt_data_removed

mv /pnfs/minos/reco_far/cedar_phy_bhcurv/mrnt_data \
   /pnfs/minos/reco_far/cedar_phy_bhcurv/mrnt_data_removed

2)    
...

3) 
...

#######
# AFS #
#######


############
# PREDATOR #
############

Near and Far genpy failed, at 11:06 and 11:10 .

HOWTO.dccp - test succeeds, full speed.

These files got picked up on the next cycle, at 13:06-ish

#########
# MYSQL #
#########

Monthly backups, per HOWTO.dbarchive.20080115

Shifted montly backups to archive subdirectory

for MON in 20060418 20060421 20071218 ; do 
mv /minos/data/mysql/${MON} /minos/data/mysql/archive/${MON} ; done

Started main copy around 11:00, informed the control room.

Mysql> du -sm .
57955   .

`DCS_HV.MYD' -> `/minos/data/mysql/archive/20080116/offline/DCS_HV.MYD'
real    6m36.177s
13457   /minos/data/mysql/archive/20080116/offline/DCS_HV.MYD

`PULSERGAIN.MYD' -> `/minos/data/mysql/archive/20080116/offline/PULSERGAIN.MYD'
real    8m32.173s
14363   /minos/data/mysql/archive/20080116/offline/PULSERGAIN.MYD

real    24m55.808s

   Net 40', had been 70'
   
PULSERDRIFTPINVLD.MYD

21'  md5sum  was 31
53'  gzip    was 99 ( 41 CPU )

COPY TO DBCOPL
real    9m44.978s
user    0m3.115s
sys     1m55.999s

Copy binlog - these binlogs are unreasonably large !

Mysql> du -sm /data/archive/BINLOG/
43581   /data/archive/BINLOG/

Jan  3 -  9 GB
Jan  4 - 20 GB
Jan 14 -  7 GB
Jan 15 -  4 GB

rsync :
real    25m37.174s
user    8m32.292s
sys     4m32.641s

  ( corrected BINLOG copy to /minos/data/mysql/BINLOG from archive/BINLOG,
    rm -r BINLOG
    mv archive/BINLOG BINLOG

#########
# MYSQL #
#########

ln -sf HOWTO.dbarchive.20080115 HOWTO.dbarchive # was HOWTO.dbarchive.20070705


##########
# CONDOR #
##########

The 7:50 glide jobs are still queued up, no glideins running.

gfactory and gfrontend jobs are running.

There are 300 rubin jobs running, 275 idle.
That has probably pushed us aside for awhile.

The backup cleared and the jobs ran round 12:00 .


########
# FARM #
########

Also removed the extra files from 

    /minos/data/reco_far/cedar_phy_bhcurv/*_data/2007-08
    /minos/data/reco_far/cedar_phy_bhcurv/*_data/2007-09

rm /minos/data/reco_far/cedar_phy_bhcurv/*_data/2007-09/*.root
rm /minos/data/reco_far/cedar_phy_bhcurv/.*_data/2007-09/*.root

rm /minos/data/reco_far/cedar_phy_bhcurv/*_data/2007-08/*.root
rm /minos/data/reco_far/cedar_phy_bhcurv/.*_data/2007-08/*.root

##########
# DCACHE #
##########

The e907/mipp geant write backlog cleared out last night.

#########
# MYSQL #
#########

News storied report that Sun is purchasing Mysql.


=============================================================================

2008 01 15

########
# FARM #
########

Date: Tue, 15 Jan 2008 15:16:43 -0600
From: Howard Rubin <rubin@iit.edu>

To: Art Kreymer <kreymer@fnal.gov>
Cc: Alex Sousa <a.sousa1@physics.ox.ac.uk>
Subject: Files removed from enstore/dcache and /minos/data for 2007-08 and -09

Art,

Here are the run numbers for which all SAM entries must be undeclared. 
I have deleted the files, including the hidden files, from 
/pnfs/minos/reco_far/cedar_phy_bhcurv/*_data/2007-0[89] *except* for 
candidates for F00038559 which complete the last run of 2007-07.  Note 
that there are no ntuples for this run in 2007-08 because they were 
properly concatenated onto the 2007-07 runs.

Howie

F00038562
F00038565
F00038568
F00038571
F00038572
F00038575
F00038580
F00038585
F00038588
F00038591
F00038594
F00038597
F00038600
F00038603
F00038822
F00038825
F00038828
F00038869
F00038891
F00038893
F00038897
F00038902
F00038914
F00038918
F00038928
F00038869
F00038891
F00038893
F00038897
F00038902
F00038914
F00038918
F00038928
F00039044
F00039047
F00039050
F00039070
F00039281
F00039284
F00039306
F00039309
F00039312
F00039316
F00039334
F00039337

fnpcsrv1$ cat rm.2007-09
F00039337
F00039340
F00039345
F00039348
F00039349
F00039350
F00039353
F00039356
F00039359
F00039362
F00039571
F00039574
F00039577
F00039580
F00039583
F00039586
F00039589
F00039592
F00039595
F00039603
F00039607
F00039608
F00039610
F00039615
F00039618
F00039622
F00039625
F00039628
F00039631
F00039653
F00039676
F00039679
F00039682
F00039685
F00039688
F00039691
F00039694
F00039697
F00039700
F00039704
F00039707
F00039710
F00039713
F00039716
F00039719

STRM=cand
SAMDIM="
    DATA_TIER  ${STRM}-far
and VERSION    cedar.phy.bhcurv
and RUN_NUMBER >  38560
"
sam list files --dim="${SAMDIM}" --nosummary | cut -f 1 -d _ | sort -u > /tmp/samcandlist

./samundeclare -n "${SAMDIM}" | wc -l
2077

MINOS26 > ./samundeclare  "${SAMDIM}" 
Found  2075  files 
 undeclared  F00038562_0004.spill.cand.cedar_phy_bhcurv.0.root
 undeclared  F00038562_0011.spill.cand.cedar_phy_bhcurv.0.root
 undeclared  F00038562_0013.all.cand.cedar_phy_bhcurv.0.root
...
 undeclared  F00039716_0003.spill.cand.cedar_phy_bhcurv.0.root
 undeclared  F00039716_0021.spill.cand.cedar_phy_bhcurv.0.root
 undeclared  F00039719_0002.all.cand.cedar_phy_bhcurv.0.root

MINOS26 > sam list files --dim="${SAMDIM}"
No files match the given constraints.

STRM=bcnd

File Count:         816

MINOS26 > ./samundeclare  "${SAMDIM}"
Found  816  files 
 undeclared  F00038562_0001.spill.bcnd.cedar_phy_bhcurv.0.root
 undeclared  F00038565_0021.spill.bcnd.cedar_phy_bhcurv.0.root
 undeclared  F00038568_0009.spill.bcnd.cedar_phy_bhcurv.0.root
...
 undeclared  F00039713_0018.spill.bcnd.cedar_phy_bhcurv.0.root
 undeclared  F00039713_0015.spill.bcnd.cedar_phy_bhcurv.0.root
 undeclared  F00039716_0016.spill.bcnd.cedar_phy_bhcurv.0.root

sam list files --dim="${SAMDIM}"

   Picking up sntp, bntp, mrnt files

SAMDIM="
    VERSION    cedar.phy.bhcurv
and RUN_NUMBER >  38560
"

sam list files --dim="${SAMDIM}"

File Count:         288
Average File Size:  121.48MB
Total File Size:    34.17GB
Total Event Count:  85685481

MINOS26 > ./samundeclare "${SAMDIM}"
Found  288  files 
 undeclared  F00038562_0000.spill.bntp.cedar_phy_bhcurv.0.root
 undeclared  F00038568_0000.all.sntp.cedar_phy_bhcurv.0.root
 undeclared  F00038588_0000.spill.sntp.cedar_phy_bhcurv.0.root
...
 undeclared  F00039710_0000.all.sntp.cedar_phy_bhcurv.0.root
 undeclared  F00039713_0000.spill.bntp.cedar_phy_bhcurv.0.root
 undeclared  F00039719_0000.spill.bntp.cedar_phy_bhcurv.0.root

MINOS26 > sam list files --dim="${SAMDIM}"
No files match the given constraints.

   Sent mail to minos_batch regarding this.


#########
# MYSQL #
#########

Preparing HOWTO.dbarchive.20080115 
   writing to /minos/data/mysql/archive... instead of /data/archive/...

Cleaned up directory,

cd /data/archive
rm    locate
rmdir CP
rmdir DUMP

mv    COPY/20071218 20071218
rmdir COPY

#########
# MYSQL #
#########

Heavy load on minos-mysql, varying, blocks of mininum x5, max x25 load
 
Queries from flxb*, like 
select max(TIMEEND) from DCS_MAG_FARVLD where TIMEEND ...

 bjobs -u all -r
 bjobs -u all -r | cut -f 3 -d ' '  | sort -u
jdejong
jyuko
pawlosk
scavan
sjc

Mysql> mysqladmin processlist -u root | cut -f 6 -d ' ' | cut -f 1 -d ':' | sort -u
flxb10.fnal.gov
flxb11.fnal.gov
flxb13.fnal.gov
flxb18.fnal.gov
flxb19.fnal.gov
flxb20.fnal.gov
flxb22.fnal.gov
flxb24.fnal.gov
flxb25.fnal.gov
flxb26.fnal.gov
flxb27.fnal.gov
flxb30.fnal.gov
flxb31.fnal.gov
flxb32.fnal.gov
flxb33.fnal.gov
flxb34.fnal.gov
flxi06.fnal.gov
minos02.fnal.gov
minos03.fnal.gov
minos04.fnal.gov
minos06.fnal.gov
minos07.fnal.gov
minos09.fnal.gov
minos10.fnal.gov
minos14.fnal.gov
minos15.fnal.gov
minos18.fnal.gov
minos20.fnal.gov
minos21.fnal.gov
minos22.fnal.gov
minos23.fnal.gov
minos24.fnal.gov
minos26.fnal.gov


But none of the DCS_MAG_FARVLD queries are coming from minos* jobs.

mysqladmin processlist -u root | grep DCS_MAG_FARVLD | cut -f 6 -d ' ' | cut -f 1 -d ':' | sort
flxb10.fnal.gov
flxb10.fnal.gov
flxb11.fnal.gov
flxb11.fnal.gov
flxb13.fnal.gov
flxb13.fnal.gov
flxb18.fnal.gov
flxb18.fnal.gov
flxb19.fnal.gov
flxb20.fnal.gov
flxb22.fnal.gov
flxb22.fnal.gov
flxb24.fnal.gov
flxb24.fnal.gov
flxb26.fnal.gov
flxb27.fnal.gov
flxb27.fnal.gov
flxb30.fnal.gov
flxb31.fnal.gov
flxb31.fnal.gov
flxb32.fnal.gov
flxb33.fnal.gov
flxb34.fnal.gov
flxb34.fnal.gov
flxi06.fnal.gov
flxi06.fnal.gov


##########
# CONDOR 
##########

Created newer proxy for gfactory, 

SRV1> cd /export/stage/minfarm/.grid

voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -vomslife 800:0 \
    -cert kreymer-doe.pem   \
    -key kreymer-doekey.pem \
    -out kreymer-condor.proxy  \
    -valid 800:0

Your proxy is valid until Sun Feb 17 16:59:24 2008


[gfactory@minos25 ~]$ cd .grid/

scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy    \
   kreymer-condor.proxy.20080217

cp -a kreymer-condor.proxy.20080217 kreymer-condor.proxy

=============================================================================

2008 01 14

#######
# WEB #
#######

Date: Mon, 14 Jan 2008 13:25:49 -0600
From: John Inkmann <inkmann@fnal.gov>
To: Liz Buckley-Geer <buckley@fnal.gov>, kreymer@fnal.gov
Cc: "webteam@fnal.gov" <webteam@fnal.gov>
Subject: www-numi website Suspect files shared to system:anyuser

 The files listed below showed up on our scans for suspect files.  Currently, it is being shared (to anyone)

If the file is not needed, it can simply be deleted.  Otherwise, issue the following commands, in the follow

/usr/afsws/bin/fs sa -dir <replace with directory name> -acl lauram:expwwwmachine rl
/usr/afsws/bin/fs sa -dir <replace with directory name>  -acl system:anyuser none

Either approach will prevent the file from showing up on next week's scan, after which, we usually make the 

Thanks for your attention in this matter.
-

   see /home/kreymer/password.txt

Date: Mon, 14 Jan 2008 13:40:04 -0600 (CST)
From: Liz Buckley-Geer <buckley@fnal.gov>

These are not password files. They are all copies of the same ROOT header 
file and it's associated dependency file. This file does not contain any 
passwords. It just happen to have the string Passwd in the name. They are 
visible by design.

        Liz


##########
# VOMSES #
##########

SRV1> find /export/stage/minfarm -name vomses
find: /export/stage/minfarm/.grid/backup: Permission denied
/export/stage/minfarm/homegrid/vdt-1.3.10/voms/etc/vomses


#########
# FNALU #
#########

Date: Mon, 14 Jan 2008 10:07:12 -0600 (CST)
Subject: HelpDesk ticket 109510

Short Description: Mount /minos/data and /minos/scratch on flxi07

Problem Description: fnalu-admin :

Please mount /minos/data and /minos/scratch on flxi05 and flxi07.

flxi05 is needed to test 32 bit SLF 4.4 operations.

flxi07 is needed so that we can transfer large files to AFS,
and test 64bit kernel operations.

The /minos areas are already mounted on flxi04 and flxi06 (SLF 3)
--------------------------------------------
Date: Mon, 14 Jan 2008 10:20:51 -0600 (CST)
This ticket has been reassigned to BAISLEY, WAYNE of the CD-LSCS/CSI/DSS/EST Group.
--------------------------------------------
Date: Mon, 14 Jan 2008 10:32:11 -0600 (CST)
Note To Requester: mgreaney@fnal.gov sent this Notes To Requester: 
Art, these mounts are done.
--------------------------------------------


##########
# DCACHE #
##########

Write pool backlog remains high, 2727 Queues, 145 Active as of 09:50.

e907 ( Mipps ) total data usage is now
9940   6 TB
9940B 21 TB
LTO-3  8 TB
   with several TB still queued in the writePool buffer.

#######
# AFS #
#######

Testing large file access


MIN > for NODE in ${UNODES} flxi07 ; do printf "${NODE} "; ssh ${NODE} 'ls -d /minos/data'; done
flxi02 ls: /minos/data: No such file or directory
flxi03 ls: /minos/data: No such file or directory
flxi04 /minos/data
flxi05 ls: /minos/data: No such file or directory
flxi06 /minos/data
flxi07 ls: /minos/data: No such file or directory


MIN > for NODE in ${UNODES} flxi07 ; do printf "${NODE} "; ssh ${NODE} 'cat /etc/redhat-release'; done
flxi02 Scientific Linux Fermi LTS release 4.4 (Wilson)
flxi03 Scientific Linux Fermi LTS release 4.4 (Wilson)
flxi04 Scientific Linux release 3.0.5 (Fermi)
flxi05 Scientific Linux Fermi LTS release 4.5 (Wilson)
flxi06 Scientific Linux release 3.0.5 (Fermi)
flxi07 Scientific Linux Fermi LTS release 4.4 (Wilson)

MIN > for NODE in ${UNODES} flxi07 ; do printf "${NODE} "; ssh ${NODE} 'cat /proc/cpuinfo | grep address | uniq'; done
flxi02 address sizes    : 36 bits physical, 48 bits virtual
flxi03 address sizes    : 36 bits physical, 48 bits virtual
flxi04 flxi05 flxi06 flxi07 address sizes       : 36 bits physical, 48 bits virtual


MIN > for NODE in ${UNODES} flxi07 ; do printf "${NODE}\n"; ssh ${NODE} 'uname -a'; done
flxi02
Linux flxi02.fnal.gov 2.6.9-55.0.9.ELsmp #1 SMP Fri Sep 28 09:24:48 CDT 2007 x86_64 x86_64 x86_64 GNU/Linux
flxi03
Linux flxi03.fnal.gov 2.6.9-55.0.9.ELsmp #1 SMP Fri Sep 28 09:24:48 CDT 2007 x86_64 x86_64 x86_64 GNU/Linux
flxi04
Linux flxi04.fnal.gov 2.4.21-32.0.1.ELsmp #1 SMP Wed May 25 15:42:26 CDT 2005 i686 i686 i386 GNU/Linux
flxi05
Linux flxi05.fnal.gov 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 11:21:10 CDT 2007 i686 i686 i386 GNU/Linux
flxi06
Linux flxi06.fnal.gov 2.4.21-47.ELsmp #1 SMP Thu Jul 20 09:54:04 CDT 2006 i686 i686 i386 GNU/Linux
flxi07
Linux flxi07.fnal.gov 2.6.9-55.0.9.ELsmp #1 SMP Fri Sep 28 09:24:48 CDT 2007 x86_64 x86_64 x86_64 GNU/Linux


FLXI02 > time sum /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz
sum: /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz: Input/output error

real    3m27.187s
user    0m7.127s
sys     0m4.183s


FLXI03 > time sum /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz
sum: /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz: Input/output error

real    3m29.324s
user    0m7.240s
sys     0m4.183s


FLXI07 > time sum /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz
51137 4755327

real    1m23.968s
user    0m11.257s
sys     0m15.685s


=============================================================================

2008 01 11

########
# FARM #
########

Need to clean up after repeated /m/d failures, esp. Dec 13, in 
   2007-12/cedarfar.log

/pnfs/minos/reco_far/cedar/sntp_data/2007-12
F00040057_0000.all.sntp.cedar.0.root
F00040057_0000.spill.sntp.cedar.0.root

/pnfs/minos/reco_far/cedar/.bntp_data/2007-12
F00040057_0000.spill.bntp.cedar.0.root


##########
# DCACHE #
##########

Date: Fri, 11 Jan 2008 15:28:21 -0600 (CST)
Subject: HelpDesk ticket 109468

Short Description: fndca read pools miserly with movers ?


On Fri, 11 Jan 2008, J. Pedro Ochoa wrote:

> Everything was working OK but since ~10.00 AM this morning it seems to just be stuck. 
> I am trying to get file
> 
> reco_near/cedar_phy_bhcurv/sntp_data/2006-02/N00009767_0000.spill.sntp.cedar_phy_bhcurv.0.root
  
That file is in DCache, in pool r-stkendca18a-6 .
  
But that pool presently has 4 queued read requests, which is strange,
as it should allow up to 50 reads, and only 2 are active.
 
I'm reporting this to the experts.

------------------------------------------
Date: Fri, 11 Jan 2008 15:33:26 -0600 (CST)
This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group.
------------------------------------------


##########
# CONDOR #
##########

New gfactory processes disappear without running.
That's because condorweb was removing the stage subdirectory,
which exists only in AFS ( created by create_glidein )

Changed script to push only the monitor piece, per sfiligoi :

Corrected around 14:25 CST ( 20:25 UTC )

LOCWG=/home/gfactory/web/monitor/
AFSWG=/afs/fnal.gov/files/expwww/numi/html/gfactory/monitor

    Note the trailing / on LOGWG, required by rsync

/afs/fnal.gov/files/expwww/numi/html/gfactory/stage/glidein_t6
recreated at 20:32 UTC

Previously, 
23834.0   boehm           1/11 13:12   0+00:15:50 R  0   175.8 RunTemp01-11-08_13
23835.0   boehm           1/11 13:12   0+00:15:50 R  0   175.8 RunTemp01-11-08_13
23836.0   boehm           1/11 13:12   0+00:15:50 R  0   107.4 RunTemp01-11-08_13
23837.0   boehm           1/11 13:12   0+00:15:50 R  0   175.8 RunTemp01-11-08_13
23838.0   boehm           1/11 13:12   0+00:15:32 R  0   166.0 RunTemp01-11-08_13
...

Now

23834.0   boehm           1/11 13:12   0+01:20:07 I  0   175.8 RunTemp01-11-08_13
23835.0   boehm           1/11 13:12   0+01:20:03 I  0   175.8 RunTemp01-11-08_13
23836.0   boehm           1/11 13:12   0+01:20:07 I  0   107.4 RunTemp01-11-08_13
23837.0   boehm           1/11 13:12   0+01:20:08 I  0   175.8 RunTemp01-11-08_13
23838.0   boehm           1/11 13:12   0+01:19:51 I  0   166.0 RunTemp01-11-08_13
23839.0   boehm           1/11 13:12   0+00:00:00 I  0   9.8  RunTemp01-11-08_13
23839.0   boehm           1/11 13:12   0+00:00:00 I  0   9.8  RunTemp01-11-08_13
...


Jobs have ramped up, 

14:57
cq gfactory
104 jobs; 0 idle, 104 running, 0 held
111 jobs; 7 idle, 104 running, 0 held

111 jobs; 0 idle, 111 running, 0 held


#######
# AFS #
#######

Nick reports 2 GB file size limit in AFS.
/afs/fnal.gov/files/data/minos/d210/database_dumps/


Mysql> pwd
/data/archive/COPY/20071218/offline

4649 MBytes in DCS_HV.MYD.gz

Mysql> time cp DCS_HV.MYD.gz /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz
cp: writing `/afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz': File too large

real    3m6.080s
user    0m0.565s
sys     0m47.171s
produced a 2 GB file


http://www.openafs.org/pipermail/openafs-info/2006-March/021852.html
Build it from source and use --enable-largefile-fileserver

https://lists.openafs.org/pipermail/openafs-info/2002-September/005812.html
   65K file per directory limit, translates to file name length limit


Test on fsui03

fsui03 > cd /var/tmp
fsui03 > scp -c blowfish minsoft@minos-mysql1:/data/archive/COPY/20071218/offline/DCS_HV.MYD.gz DCS_HV.MYD.gz

Oops, the root partition did not have enough space ( 2.3 GB free )
Copy to /tmp instead

fsui03 > time scp -c blowfish minsoft@minos-mysql1:/data/archive/COPY/20071218/offline/DCS_HV.MYD.gz /tmp/DCS_HV.MYD.gz
real    14m8.760s
user    0m4.180s
sys     1m56.970s

fsui03 > time cp /tmp/DCS_HV.MYD.gz /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz
cp: /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz: File too large

real    17m58.690s
user    0m0.150s
sys     4m22.220s
produced no output file after the failure.


    TRY AGAIN ON FLXI07, A 64BIT system

FLXI07 > time scp -c blowfish minsoft@minos-mysql1:/data/archive/COPY/20071218/offline/DCS_HV.MYD.gz /tmp/DCS_HV.MYD.gz
real    2m56.531s
user    1m5.632s
sys     0m49.417s

FLXI07 > time cp /tmp/DCS_HV.MYD.gz /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz
real    8m11.278s
user    0m0.855s
sys     1m6.878s

( a few minutes hangup after the copy seems complete, before exiting from cp )


Try this also on flxi05

FLXI05 > rpm -q openafs
openafs-1.4.4-46.SL4

FLXI05 > time scp -c blowfish minsoft@minos-mysql1:/data/archive/COPY/20071218/offline/DCS_HV.MYD.gz /usr/scratch/sect1/DCS_HV.MYD.gz
real    4m57.852s
user    0m57.639s
sys     0m46.515s

FLXI05 > time cp /usr/scratch/sect1/DCS_HV.MYD.gz /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz2
cp: writing `/afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz2': File too large

real    2m8.493s
user    0m0.459s
sys     0m37.696s


FLXI05 > dds /afs/fnal.gov/files/data/minos/d88
total 6850453
drwxrwxrwx  3 root     root       6144 Jan 11 17:32 ./
drwxr-xr-x  3 lisa     g150      10240 Nov  1 15:28 ../
-rw-r--r--  1 kreymer  g020 4869454341 Jan 11 13:50 DCS_HV.MYD.gz
-rw-r--r--  1 kreymer  g020 2145390592 Jan 11 17:34 DCS_HV.MYD.gz2
drwxr-xr-x  3 shanahan ktev       2048 Jan  2 11:04 ndphys/

FLXI05 > time cp /afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz /usr/scratch/sect1/DCS_HV.MYD.gzbig
cp: reading `/afs/fnal.gov/files/data/minos/d88/DCS_HV.MYD.gz': Input/output error

real    2m22.158s
user    0m0.318s
sys     0m14.797s


FLXI07 > rpm -q openafs
openafs-1.4.4-46.SL4.x86_64

Mysql> rpm -q openafs
openafs-1.4.4-46.SL4


For reference, copy this test file to /minos/data/mysql/DCS_HV.MYD.gz
   Rate from mysql1 was about 
real    2m56.960s
user    0m0.096s
sys     0m13.086s
   4870 mbytes/ 177 sec = 28. MB/sec

Try, for reference, the uncompressed database table,
as will do for backups :

Mysql> pwd
/data/database/offline

dds DCS_HV.MYD
14110664704

Mysql> time cp DCS_HV.MYD /minos/data/mysql/DCS_HV.MYD
real    6m51.467s
user    0m0.266s
sys     0m38.744s

   Rate 14111. MB/411 sec = 34 MB/sec


##########
# DCACHE #
##########

Date: Fri, 11 Jan 2008 10:22:50 -0600 (CST)
Subject: HelpDesk ticket 109440

      dcache-admin :

  Over the last few days, the DCache write queue has grown gradually
  to a peak of 5000, and seems to be still climbing.
  See
      http://fndca.fnal.gov/dcache/queue/allpools.jpg
  
  As before, this is shutting down Minos farm and data import activity.


  Also, the writePool group seems to be very close to capacity
  and in danger of filling. See pools w-stkendca[10,11,12]a-[4,5,6] at
      http://fndca3a.fnal.gov:2288/usageInfo

  The major write activity again seems to be to the e907 'geant' family,
  over 2 TBytes in the last few days.
  See
      http://www-stken.fnal.gov/enstore/burn-rate/CD-LTO3.e907.jpg
  
  This time the problem is not the large number of small files, 
  but the total volume.
  
  Please take action to clear this backlog.          

----------------------------------------------
Date: Fri, 11 Jan 2008 10:30:17 -0600 (CST)
This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group.
----------------------------------------------

  On 2 January, the write pools contained 1/4 TB of e907 data files.
http://www-numi.fnal.gov/computing/dh/datasets/2008/01/current.w.20080102
  
  Today it has nearly 6 TBytes of e907 data files.
http://www-numi.fnal.gov/computing/dh/datasets/2008/01/current.w.20080111


=============================================================================

2008 01 10

########
# FARM #
########

Cleaning up, first the easy stuff

Four duplicates to move out of the way :

 ./roundup -D -r cedar_phy_oldbhcurv mcnear

fails with messages

mv n13035098_0004_L010185N_D03.sntp.cedar_phy_oldbhcurv.root ../minfarm/DUP/n13035098_0004_L010185N_D03.sntp.cedar_phy_oldbhcurv.root
mv: cannot move `n13035098_0004_L010185N_D03.sntp.cedar_phy_oldbhcurv.root' to `../minfarm/DUP/n13035098_0004_L010185N_D03.sntp.cedar_phy_oldbhcurv.root': No such file or directory

AFSS/roundup.20080110 -D -r cedar_phy_oldbhcurv mcnear

That removed them cleanly.


##########
# CONDOR #
##########

Corrected the configuration of the gfactory to write to local web,
with assistance from sfiligoi.

    Edited glideinWMS/creation/glideinWMS.xml
       This is used in the creation of a new glidein_t* configuration.
       Changed    glidein factory_name  and     monitor base_dir
       
[gfactory@minos25 creation]$ diff glideinWMS.xml glideinWMS.xml.save
1,2c1,2
< <glidein factory_name="minos" glidein_name="t6" schedd_name="minos25.fnal.gov">
<    <monitor base_dir="/home/gfactory/web/monitor"/>
---
> <glidein factory_name="minos" glidein_name="t5" schedd_name="minos25.fnal.gov">
>    <monitor base_dir="/afs/fnal.gov/files/expwww/numi/html/gfactory/monitor"/>

    ./create_glidein glideinWMS.xml

    cd
    ./start_factory.sh

Note that all the older _tN with N less that 6 are now obsolete.

Monitoring plots are at links like

    http://www-numi.fnal.gov/gfactory/monitor/glidein_t6/total/


##########
# CONDOR #
##########

ln -s ../../../../../home/room1/kreymer/minos/HOWTO.condor HOWTO.condor

This puts the HOWTO at
    http://www-numi.fnal.gov/computing/HOWTO.condor


###########
# BLUEARC #
###########

Date: Thu, 10 Jan 2008 13:31:14 -0600 (CST)
Subject: HelpDesk ticket 109383

Short Description: Quota request for BlueArc served /minos/scratch, for tjyang

Problem Description: LSC/CSI :

Please set an individual storage quota of 500 GBytes for user tjyang
on the BlueArc served /minos/scratch volume.

This overrides the existing default 100 GBytes quota.
----------------------------------------------------
Date: Thu, 10 Jan 2008 13:36:07 -0600 (CST)
This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group.
----------------------------------------------------
Date: Thu, 10 Jan 2008 14:11:04 -0600 (CST)
Solution: Hi Art,

minos-nas-0:/scratch/tjyang quota has been increased to 500GB
----------------------------------------------------

##########
# NEXSAN #
##########

After firmware update, need to

    mindata@minos26 
crontab crontab.dat

    minfarm@fnpcsrv1
mv /home/minfarm/ROUNTMP/NOCAT /home/minfarm/ROUNTMP/NOCAT.ok

##########
# CONDOR #
##########

Testing glideins after outage, seems OK now.
Factories submitted at 08:39, started running at about 08:43

########
# GRID #
########

diff /usr/local/vdt-1.8.1/glite/etc/vomses /minos/scratch/kreymer/VDT/glite/etc/vomses

Hacking  my copy, 

  changed fermigrid2.fnal.gov to  
                voms.fnal.gov
  changed CN=host/voms.fnal.gov to
          CN=http/voms.fnal.gov

Still some residual changes.
How is this file to be kept up to date ?

Where is the DOE CA information kept ?

scp fnpcsrv1:/usr/local/vdt-1.8.1/glite/etc/vomses vomses


MINOS26 > vdt-version --show
You have installed a subset of VDT version 1.8.1a:
    CA Certificates v31 (includes IGTF 1.17 CAs)
    Fetch CRL 2.6.2
    GPT 3.2

MINOS26 > pacman -update CA-Certificates
Update of [/minos/scratch/kreymer/VDT:http://vdt.cs.wisc.edu/vdt_181_cache:VDT-Common] found...
Update of [/minos/scratch/kreymer/VDT:http://vdt.cs.wisc.edu/vdt_181_cache:VDT-Environment] found...
Update of [/minos/scratch/kreymer/VDT:http://vdt.cs.wisc.edu/vdt_181_cache:VDT-Version-Info] found...
Update of [/minos/scratch/kreymer/VDT:http://vdt.cs.wisc.edu/vdt_181_cache:CA-Certificates-Base] found...
Update of [/minos/scratch/kreymer/VDT:http://vdt.cs.wisc.edu/vdt_181_cache:Licenses] found...
Updating [VDT-Environment]...
Downloading [vdt-environment-1-193.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-environment/1]...
Updating [VDT-Version-Info]...
Downloading [vdt-version-info-1.8.1-26.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-version-info/1.8.1]...
Updating [VDT-Common]...
Downloading [vdt-common-1-228.tar.gz] from [http://vdt.cs.wisc.edu/software//vdt-common/1]...
Updating [CA-Certificates-Base]...
Downloading [certificates-33-1.tar.gz] from [http://vdt.cs.wisc.edu/software/certificates/33]...
Installing package [CA-Certificates-Base].
Downloading [certificates-install-4-256.tar.gz] from [http://vdt.cs.wisc.edu/software//certificates-install/4]...
Updating [Licenses]...
Downloading [licenses-1.8.1-12.tar.gz] from [http://vdt.cs.wisc.edu/software//licenses/1.8.1]...

    This did not help, as I am not using this vdt for srmcp on minos26.

cd /home/minfarm/.grid
mv  certificates certificates.20070206
scp -r minfarm@fnpcsrv1:/local/ups/grid/globus/share/certificates certificates

This fixed the srmtest problem under mindata.

$ dds /minos/scratch/kreymer/VDT/globus/share/certificates-33-1 | wc -l
394

SRV1> dds /usr/local/vdt-1.8.1/globus/share/certificates-33-1 | wc -l
471

   Need to track down and update or remove

/export/stage/minfarm/homegrid/vdt-1.3.10
/grid/app/minos/VDT
   . ./setup.sh
   pacman -update CA-Certificates
        certificates-32-1 updated to certificates-33-1

########
# GRID #
########

Helpdesk submission ( apparently not submitted )
Grid / Fermilab Sup Ctr 
High
Kreymer DOE cert claims to have expired, is not expired

 Starting as early as 02:20 this morning,
  the Kreymer DOE grid certificate seems to be expired.
  This seen by SRM, and by the web browser certificate test at
      https://security.fnal.gov/cgi-bin/doetest/displaycert.cgi
  
  But the certificate is not due to expire until April 2008.
  And I can generate a grid proxy with the cert :

SRV1> cd /export/stage/minfarm/.grid

voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -vomslife 800:0 \
    -cert kreymer-doe.pem   \
    -key kreymer-doekey.pem \
 

########
# GRID #
########

srmcp is failing now, claims cert is expired :

SRMClientV1 : org.globus.common.ChainedIOException: Authentication failed [Caused by: Defective credential
detected [Caused by: [JGLOBUS-96] Certificate "DC=org,DC=DOEGrids,OU=Certificate Authorities,CN=DOEGrids CA
1" expired]]

SRV1> cd /export/stage/minfarm/.grid

voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -vomslife 800:0 \
    -cert kreymer-doe.pem   \
    -key kreymer-doekey.pem \
    -out kreymer-test.proxy  \
    -valid 800:0

Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Enter GRID pass phrase:
Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Creating temporary proxy .............................................. Done
Contacting  voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Done

Warning: fg6x1.fnal.gov:15001: The validity of this VOMS AC in your proxy is shortened to 86400 seconds!

Creating proxy ................................ Done
Your proxy is valid until Tue Feb 12 16:23:36 2008


=============================================================================

2008 01 09

###############
# CONDORGLIDE #
###############

Run this script via cron ( crontab.minos25) to keep the factory alive
Logs go to   condor/log/glide/

Had to add these, to get path to condor commands :

    source /usr/local/etc/setups.sh
    setup shrc
    source /etc/bashrc

###########
# BLUEARC #
###########

Date: Wed, 09 Jan 2008 15:43:44 -0600 (CST)
Subject: HelpDesk ticket 109326

Short Description: Quota request for BlueArc served /minos/scratch, for rmehdi

Problem Description: LSC/CSI :

Please set an individual storage quota of 500 GBytes for user rmehdi
on the BlueArc served /minos/scratch volume.

This overrides the existing default 100 GBytes quota.

---------------------------
Date: Wed, 09 Jan 2008 15:58:05 -0600 (CST)
This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group.
---------------------------
Date: Wed, 09 Jan 2008 16:22:56 -0600 (CST)
Solution: joes@fnal.gov sent this solution: 
increased volume quota to 500GB (was 100GB) for
minos-nas-0:/minos/scratch/rmehdi
This ticket was resolved by SYU, JOSEPH of the CD-LSCS/CSI/CS/EST group.
---------------------------

##########
# CONDOR #
##########

[gfactory@minos25 ~]$ find . -type f -exec grep -q /afs/fnal.gov/files/expwww/numi {} \; -print
./glideinWMS/install/glideinWMS_install
./glideinWMS/creation/glideinWMS.xml.20071217
./glideinWMS/creation/glideinWMS.xml
./glideinsubmit/glidein_t3/glideinWMS.xml
./glideinsubmit/glidein_t5/glideinWMS.xml
./glideinsubmit/glidein_t1/glideinWMS.xml
./glideinsubmit/glidein_t4/glideinWMS.xml
./glideinsubmit/glidein_t2/glideinWMS.xml
./.bash_history

XMLS='
glideinWMS/creation/glideinWMS.xml
glideinsubmit/glidein_t3/glideinWMS.xml
glideinsubmit/glidein_t5/glideinWMS.xml
glideinsubmit/glidein_t1/glideinWMS.xml
glideinsubmit/glidein_t4/glideinWMS.xml
glideinsubmit/glidein_t2/glideinWMS.xml
'
for XML in ${XMLS} ; do cp -a ${XML} ${XML}.save ; done

for XML in ${XMLS} ; do nedit ${XML} &  ; done

replace
    /afs/fnal.gov/files/expwww/numi/html/gfactory
with
    /home/gfactory/web

for XML in ${XMLS} ; do sdiff -s ${XML} ${XML}.save ; done

15:22  ./start_factory.sh

15:23 gfactory processes 23364.0 1 2 3 4 are idle
15:26 gfactory processes have been running about 1 minute.
15:28 gfactory processes usee 3:25 seconds, kreymer glidein is running

MINOS25 > crontab  ${HOME}/minos/scripts/crontab.minos

Oops, had to add this to get aklog to work in condorweb :

    PATH="/usr/krb5/bin:${PATH}"

Date: Wed, 9 Jan 2008 22:26:08 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: Burt Holzman <burt@fnal.gov>
Cc: Igor Sfiligoi <sfiligoi@fnal.gov>
Subject: Re: Follow-up on WMS/grid discussions

...

    I have modified these files :

glideinWMS/creation/glideinWMS.xml
glideinsubmit/glidein_t3/glideinWMS.xml
glideinsubmit/glidein_t5/glideinWMS.xml
glideinsubmit/glidein_t1/glideinWMS.xml
glideinsubmit/glidein_t4/glideinWMS.xml
glideinsubmit/glidein_t2/glideinWMS.xml

    after first creating *.save versions
    I changed
/afs/fnal.gov/files/expwww/numi/html/gfactory
    to
/home/gfactory/web

    I have started a one per minute cron job 
    rsync'ing the afs web area to the home web directory.

    /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/condorweb

This may not be ideal, but should keep the gfactory process happy for now.

Can the one minute update interval be increased ?

#######
# NET #
#######

Slow and failing network connections reported on Minos systems,
late morning.

MIN > ssh minos01
Last login: Wed Jan  9 12:02:44 2008 from 131.225.56.147
...
aklog: Couldn't get fnal.gov AFS tickets:
aklog: Cannot resolve network address for KDC in requested realm while getting AFS tickets

12:05 - OK again ?

r-s-fcc2-server
#########
# ADMIN #
#########

Helpdesk ticket 095815
Updated minos-sam02 status, system is up.
Successful !


########
# FARM #
########

Date: Tue, 08 Jan 2008 17:08:52 -0600 (CST)
From: Steven Timm <timm@fnal.gov>
To: fermigrid-announce@fnal.gov
Subject: new worker nodes, CDF and GP Grid clusters

46 of the 48 new worker nodes have been deployed on the General Purpose
Grid cluster today.  We expect the last two nodes to be ready tomorrow
morning. Thanks to Rennie Scott and Jason Allen for fast work.
The high 8 nodes fnpc339-fnpc346, MINOS will have priority on.  We
are still working on the details of getting AFS implemented on those
nodes as they requested, hopefully within 7-10 days.
These 48 new worker nodes will all be available for production for a 
while. Eventually some of them may be redeployed for integration
and testing projects.


As reported in operations meeting yesterday, 155 new worker nodes
were also deployed as CDF Grid cluster 3 and they are available
for use by FermiGrid.  these are fcdfcaf1502-1656.  CDF will
have priority on all of these nodes. Non-cdf users should only
access them via the fermigrid1 site job gateway, not by direct submission.


=============================================================================

2008 01 08

##########
# NEXSAN #
##########

Scheduled cron shutdowns for NEXSAN upgrades Thur 10 Jan 08:00

    mindata@minos26 
echo "crontab -r" | at 02:00 Jan 10

    minfarm@fnpcsrv1
echo "mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT" | at 02:00 Jan 10

##########
# CONDOR #
##########

http://www-numi.fnal.gov/gfactory/stage/glidein_t5/condor_config
file:/home/gfactory/web/stage/glidein_t5/condor_config

condorweb - script does an rsync 
from /home/gfactory/web/
  to /afs/fnal.gov/files/expwww/numi/html/gfactory

############
# MCIMPORT #
############

Removed a bad imported file from sjc,

FILE=n11047018_0002_L010185N_D04.tar.gz

would have moved to MCSTA,
    ${MDSTAGE}/${MCREL}_${MCRN}/${CONF}/${DET}/${RUN}
MCSTA=/minos/data/mcimport/STAGE/daikon_04/L010185N/near/701

mkdir /minos/data/mcimport/sjc/BAD
mv  ${MCSTA}/${FILE} /minos/data/mcimport/sjc/BAD/${FILE}


###########
# ENSTORE #
###########

    We are very short of 9940-B tapes in the library,
    Inventory this morning was 108 tapes

    See also   
http://www-stken.fnal.gov/enstore/burn-rate/CD-9940B.jpg

Recent rates are under
   Enstore
        Plots
            Bytes Written per Storage Group Plots
   http://www-stken.fnal.gov/enstore/burn-rate/plot_enstore_system.html 

         Month Week TapesBlank
ALL_9940B  241  63  294
CD-9940B   159  31  108
lqcd        36   9
miniboone    4   0
minos       68  16
sdss        18   2

###########
# ENSTORE #
###########

Date: Tue, 08 Jan 2008 14:42:12 -0600 (CST)
HelpDesk ticket 109251

Short Description: Please move future /pnfs/minos/mcin_data and mcout_data writes to CD-LTO-3

Problem Description: enstore-admin :

The STKEN 9940-B tape inventory is running critically low.
Minos is a major user ( 68 the past month, 16 last week )

Most of our current use is to paths
    /pnfs/minos/mcout_data
and
    /pnfs/minos/mcin_data

Therefore, please do something like the following
to direct future writes under these paths toward LTO-3 tape :

cd /pnfs/minos/mcin_data
enstore pnfs --library CD-LTO3

cd /pnfs/minos/mcout_data
enstore pnfs --library CD-LTO3

--------------------------------

Date: Tue, 08 Jan 2008 15:01:58 -0600 (CST)
This ticket has been reassigned to SZMUKSTA, GEORGE of the CD-SF/DMS/DSC/SSA Group.

--------------------------------
Date: Tue, 08 Jan 2008 18:32:13 -0600 (CST)

Solution: berg@fnal.gov sent this solution: 

The library tags have been changed to CD-LTO3. Thanks, Art!


##########
# CONDOR #
##########

The gfactory stoppages are due to losing a token for writing to
    /afs/fnal.gov/files/expwww/numi/html/gfactory

    Set up rsync via cron job on minos25, kreymer account for present
 
    Test timing,

AFSWG=/afs/fnal.gov/files/expwww/numi/html/gfactory
LOCWG=/home/gfactory/web
TESWG=/afs/fnal.gov/files/expwww/numi/html/test

time cp -ax ${AFSWG} ${LOCWG}
real    0m35.957s
user    0m0.063s
sys     0m1.526s

diff -r ${AFSWG} ${LOCWG}
real    0m8.566s
user    0m0.237s
sys     0m0.953s


mkdir ${TESWG}

time rsync -r ${LOCWG} ${TESWG} --perms --times --links --size-only --delete -v
sent 66808099 bytes  received 65640 bytes  3110406.47 bytes/sec
total size is 66530899  speedup is 0.99

real    0m20.566s
user    0m0.955s
sys     0m3.520s

    Repeat again at about 12:13

[gfactory@minos25 ~]$ time rsync -r ${LOCWG} ${TESWG} --perms --times --links --size-only --delete -v
building file list ... done

sent 129656 bytes  received 20 bytes  13650.11 bytes/sec
total size is 66530899  speedup is 513.05


real    0m9.075s
user    0m0.052s
sys     0m3.327s


real    0m9.625s
user    0m0.057s
sys     0m3.383s

    Around 13:20
 
real    0m8.478s

    Added slash after ${LOCWG} to put the the output files directly in ${TESWG}
    rm -r ${TESWG}/web

time rsync -r ${LOCWG}/ ${TESWG} --perms --times --links --size-only --delete -v
real    0m19.969s

    Tried this from kreymer on minos25,
    could not access /home/gfactory, mode 700
    
chmod 755 /home/gfactory

time rsync -r ${LOCWG}/ ${TESWG} --perms --times --links --size-only --delete -v
real    0m8.535s


########
# MAIL #
########

Removed RFC2369 headers from minos-cdops
for which they are not appropriate, to eliminate the PINE messages

     [ Note: This message contains email list management information ]

   To disable the headers, added to the head of the options list, 

Misc-Options= NO_RFC2369

##########
# NEXSAN #
##########

Date: Tue, 08 Jan 2008 07:53:39 -0600
From: Etta Burns <ettab@fnal.gov>
To: minos-admin@fnal.gov
Cc: Jason Allen <jallen@fnal.gov>, dbell@fnal.gov
Subject: Request For Satabeast Downtime

The new NexSAN firmware (Gn60) is available for installation.

Would it be possible to have a 20 minute downtime on Thursday morning, 
beginning at 8:00, to upgrade the firmware and reboot the satabeast?


Etta B

-- 
Etta Burns              Fermi National Accelerator Laboratory
ettab@fnal.gov          P.O. Box 500
(630) 840-8300          Batavia, IL 60510


Announced this to
    minos_software_discussion
    minos_batch
    minos-data
    minos-users


##########
# CONDOR #
##########

Date: Tue, 08 Jan 2008 12:28:01 +0100
From: Igor Sfiligoi <sfiligoi@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Cc: Burt Holzman <burt@fnal.gov>
Subject: Re: Follow-up on WMS/grid discussions

Hi Art.

It is again the AFS problem :(
The monitoring pages are on AFS, so when the AFS token expires, the factory cannot anymore update the
monitoring info and it exits.

I don't have an easy answer for this problem :(

--------------------------

Date: Tue, 08 Jan 2008 10:01:42 -0600
From: Burt Holzman <burt@fnal.gov>

Keep the monitoring pages local and have a kcron job push it to AFS 
every minute?

--------------------------
     
I have tested this.    
There are about 66 MBytes of files in the present web area.


I created a copy of the AFS web area
    AFSWG=/afs/fnal.gov/files/expwww/numi/html/gfactory
in
    LOCWG=/home/gfactory/web

I tested the speed of rsync writing to
    TESWG=/afs/fnal.gov/files/expwww/numi/html/test


time rsync -r ${LOCWG} ${TESWG} --perms --times --links --size-only --delete -v
real    0m20.566s

Subsequent updates take about 8 seconds elapsed time.

Igor :
    How often would such updates need to be performed ? 
    Should they be synchromized with any other process ?
        I see from the log the gfrontend runs about every 90 seconds.


########
# FARM #
########

cedar_phy_bhcurv mcnear processing cleared out the WRITE backlog,
as of 8 this morning.

Remaining issues 
   farcat    cedar_phy_bhcurv bmnt files
   nearcat   cedar_phy_bhcurv 0 and 1 spill file duplicates
   mcnearcat cedar_phy_oldbhcurv  a few files
   mcnearcat cedar_phy_bhcurv   over 250 files pending


=============================================================================

2008 01 07

###############
# MINOS-USERS #
###############

Liz added kreymer as an owner, added Giles Barr barr@FNAL.GOV

########
# FARM #
########

Cleanup and catchup

WRITE -
   Dec 22 c_p_b mcnear  stuck due to missing
/minos/data/mcout_data/daikon_04/L010000N/near/cedar_phy_bhcurv/sntp_data/701/n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root

   This is present in pnfs,
-rw-r--r--  1 rubin numi 519562958 Dec 22 22:04 /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010000N/sntp_data/701/n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root

The rename after the srmcp seems to have failed on Dec 22,
probably due to a NexSAN flakeout.

SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/L010000N/sntp_data/701
mkdir: cannot create directory `/minos/data/mcout_data': No such device or address
 OOPS - cannot create CC area /minos/data/mcout_data/daikon_04/L010000N/near/cedar_phy_bhcurv/sntp_data/701


   This is still there in WRITE, but not moved and symlinked.
SRV1> dds /minos/data/minfarm/WRITE/n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root
-rw-r--r--  1 minfarm numi 519562958 Dec 22 19:50 /minos/data/minfarm/WRITE/n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root

FILE=n13037014_0018_L010000N_D04.sntp.cedar_phy_bhcurv.root
GDW=/minos/data/minfarm/WRITE
CCDEST=/minos/data/mcout_data/daikon_04/L010000N/near/cedar_phy_bhcurv/sntp_data/701

ls -l ${CCDEST}
mv ${GDW}/${FILE} ${CCDEST}/${FILE}

ln -s      ${CCDEST}/${FILE} ${GDW}/${FILE}


############
# PNFSDIRS #
############

Added test that encp is set up, so that we have an enstore command


#########
# ADMIN #
#########

Helpdesk ticket 095815

Tried entering a Minos status at
    https://computing.fnal.gov/cdsystemstatus/customersupport/inpform.htm

   MINOS
   AFS
   <default date/time>
   Minor
   No Estimate
   AFS timeouts continue at a low rate on the Cluster.
   We are awaiting a schedule for server software upgrades.


On hitting ENTER, got an attempt to open
    enter_status.pl
This seems to be an empty file.

   Tried this again, got

Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator, helpdesk@fnal.gov and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.


    Reported this to trb@fnal.gov

Date: Tue, 08 Jan 2008 17:13:52 -0600 (CST)
Subject: Help Desk Ticket 109168 Has Been Resolved.
Solution: Someone, in an attempt to make the .htaccess file for that area more readable, broke it.  It's
better now.


##########
# CONDOR #
##########

To      : Burt Holzman <burt@fnal.gov>
Cc      : sfiligoi@fnal.gov
Attchmnt: 
Subject : Re: Follow-up on WMS/grid discussions
----- Message Text -----
On Mon, 7 Jan 2008, Burt Holzman wrote:

> A few months ago we had a discussion of running your software via WMS on
> the grid.  Has there been any progress?  Do you need any help in getting
> started?
  
This has been set up for initial testing , with major assistance from Igor S.
     
Glideins seem to be working most of the time.
We have had problems with the gfactory process disappearing.

We are still using DOE Grid Certs for glidein,
instead of the preferred KCA based certs. 
The KCA certs seem to be too short lived. 

I still need to learn to configure and monitor Condor.
In particular, we'd like to control the running of jobs with differing
time limits ( 30 min, 4 hour, 1 day, 4 day ),
and eventually run both Farm and User Analysis jobs via WMS.

We are not allowed to use WMS in production until we have upgraded
to Condor 6.9 with glExec support, both on the Minos Cluster and GPFARM.


#########
# ADMIN #
#########

Date: Mon, 07 Jan 2008 10:57:06 -0600 (CST)
HelpDesk ticket 109145

    run2-sys :

Please set minos25 /var/log/messages* protections to 644,
as on the reset of the Minos Cluster,
to allow monitoring of the ongoing AFS timeouts.

    chmod 644 /var/log/messages*

Date: Mon, 07 Jan 2008 11:06:35 -0600 (CST)
This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group.

Date: Mon, 07 Jan 2008 11:33:49 -0600 (CST)
Solution: schmitz@fnal.gov sent this solution: 
protections changed as per Art's request.

    
#######
# AFS #
#######

pts adduser -user mgoodman -group wadmnumi:numiweb


pts: Permission denied ; unable to remove user belias from group wadmnumi:numiweb 

#######
# AFS #
#######

    Timeouts continue, at a low rate.

MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages.1 | grep "Jan " | grep -v Tokens | grep Lost | grep 131.225 | uniq'; done

minos05
Jan  4 11:15:19 minos05 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)
Jan  4 14:18:22 minos05 kernel: afs: Lost contact with file server 131.225.68.11 in cell fnal.gov (all multi-homed ip addresses down for the server)

minos09
Jan  3 13:28:58 minos09 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)


=============================================================================
=============================================================================

2008 01 03

########
# DATA #
########

non-root files, per email; mcgowan
    /data/minos/root_data/reco_far/R1_18_4/.bntp_data/2006-07
F00035859_0008.spill.bntp.R1_18_4.0.root
F00035947_0014.spill.bntp.R1_18_4.0.root
F00036019_0019.spill.bntp.R1_18_4.0.root

DCPOR=24136
DPAT=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/minos/reco_far/R1_18_4/.bntp_data/2006-07
DFILE=${DPAT}/${FIL}

setup_minos -r R1.18.4
hadd -f /local/scratch26/kreymer/DATA/Merged.root ${DFILE} ${DFILE}


############
# PNFSDIRS #
############

Per rhatcher/arms

 ./pnfsdirs near cedar_phy_bhcurv daikon_04 L010185N_nccoh write


=============================================================================

2008 01 02

###########
# MONTHLY #
###########

DATASETS 1/2
PREDATOR 1/2
VAULT    1/2
MYSQL    1/

Vault - encp - 
Got error while trying to obtain configuration: ('KEYERROR', "Configuration Server: no such name: 'pnfs_agent'")
~/minos/log/rawcopy/${DET}/encp.2007-12.log
   This seems to be normal since 2006-12


#######
# AFS #
#######

MINOS26 > pts removeuser -user belias -group wadmnumi:numiweb
pts: Permission denied ; unable to remove user belias from group wadmnumi:numiweb 

Asked liz to add me to wadmnumi:wadmnumigr

Done, and done

=============================================================================

2008 01 01

Vacation notes :

    minosadmin
Subject: Your ticket 095815 has been reassigned to BOZONELOS, TOM

Solution: Added the following KX509 DN's to the .htaccess file:

|kreymer\
|buckley\
|rhatcher\
|urish\

Please go to the following URL to make manual updates, be sure to pick the
correct system (MINOS) from the dropdown list:

https://computing.fnal.gov/cdsystemstatus/customersupport/inpform.htm

    rameika - ssh login problems
minosadmin

    lusers - java updates

    boehm - minosadmin - glidein guidance
    
    rarmstr - minoscvs - access

    windows - 2008 - windows domain exp Jan 14

    habig - minosshift - looking for net downtime

    rubin - minosbatch - cannot use new proxy to copy with srm

found gfactory processes missing again, restarted
    
=============================================================================

2007 12 28

    Kreymer on vaction until 2 January 2008
    
        Happy New Year !

=============================================================================

2007 12 27

##########
# CONDOR #
##########

Restarted missing gfactory processes
Verified correct running of probe, with wms.run script


##########
# CONDOR #
##########

Temporary safety copy of examples,

cd /local/scratch26/kreymer
cp -vax /minos/scratch/kreymer/condor condor


##########
# DCACHE #
##########

 HelpDesk ticket 108857
 
Short Description: FNDCA overloaded with e907 writes to DCache pools

Problem Description: dcache-admin :

At about 02:00 this morning, a flood of over 12,000 writes started 
to FNDCA write pools, pushing the Pool Request Queue for Stores over 8000.
The writes continued through 06:00.

The Minos MC import and farm concatenation processes have throttled down
as designed, trying to keep the queue under the recommended 2-3K limit.

The backlog is clearing at roughly 10 seconds/file,
apparently all going to a single LTO-3 tape, file family 'geant'.
These seem to be e907 files, all under 10 MBytes in size.
This is very inefficient for LTO-3 tape ( or even 9940 ).

It may take a day or more for this backlog to clear out,
and for Minos processing to resume,
assuming that no more files of this sort are written,
and assuming that there are no global DCache service failures,
as have happened before when backlogs got this large.

Please contact E907 to understand the source of the problem,
and if at all possible, remove this backlog.

Date: Thu, 27 Dec 2007 14:30:58 -0600 (CST)
This ticket has been reassigned to HARRISON, MICHAEL of the CD-SF/DMS/DSC/SSA Group.

Date: Thu, 27 Dec 2007 14:48:46 -0600 (CST)
mircea@fnal.gov sent this Notes To Requester: 

Hi Arthur,

I've started looking into it, will let you know when I have an update.

-Mike Harrison

Date: Thu, 27 Dec 2007 15:04:27 -0600
From: Holger Meyer <hmeyer@fnal.gov>
To: Mike Harrison <mircea@fnal.gov>, kreymer@fnal.gov, dcache-admin@fnal.gov
Subject: Re: Fw: HelpDesk ticket 108857 has been assigned to you HARRISON, MICHAEL

Mike, Art, dcache-admins,

I started these file transfers. They are small stdhep files that I will 
run through Monte Carlo simulation and reconstruction jobs on the grid.

I plan to copy more files soon, so we should find a less disruptive way 
for me to do so. I assumed that my script initiating one copy at a time 
would not cause a problem. I want these files backed up on tape (even if 
that is inefficient for these small files). The Monte Carlo output and 
reconstructed files will all be much larger.

If there is a way to throttle the transfer of these files to tape to a 
rate that allows MINOS to continue its work, please do so.

Best regards,
  Holger
2997


##########
# DCACHE #
##########

Write queues are up to 8000.
All dumped in starting 02:00, peak at 06:00 roughly

I see no DCache writes in STKEN.

Restore list is out of date, Oct 11

Login list, I set mostly access to files like
    minos/reco_near/cedar_phy/cand_data/2005-11/N00009104_0003.spill.cand.cedar_phy.0.root
    minos/mcin_data/near/daikon_00/L010185N/101/n11011012_0006_L010185N_D00.reroot.root
    
DCache New Plots, write transfers, confirms 1500 to 2000 transfers/hour,
03:00 through 08:00, again at 11:00

Billing shows 12373 transfers from Clients, versus normal 1500ish
http://fndca3a.fnal.gov/dcache/billing.html

grep w-stken billing.lis | grep e907 | wc -l
12240

PRQ Stores are still at 7186
14:00 - 7169
14:01 - 8161
14:32 - 7102
16:15 - 6954

########
# FARM #
########

Clearing out deadwood

13:30
./roundup -f 1 -r cedar mcfar
   1 run, vintage Jun 30


##########
# PARROT #
##########

---------- Forwarded message ----------
Date: Thu, 27 Dec 2007 17:54:12 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: webteam@fnal.gov, rayp@fnal.gov, minos-data@fnal.gov
Subject: Parrot tests - FYI

  Per my conversation with Laura Mengel earlier today,
  this is just an FYI heads-up note to let relevant people know
  that we are looking into the possible use of Parrot
  as a means of accessing Minos software releases from FermiGrid worker nodes.

  At present, this is just a technology investigation.

  But if the tests are successful, this scheme might be rapidly deployed,
  as it has been used in production by CDF,
  and requires relatively little new infrastructure.

  For initial single client testing, I have linked the Minos release
  and product areas under the existing web pages.

  For larger scale testing and deployment,
  we would plan to use squids to take the load off central servers,
  and would make a formal deployment and support plan.

  References :

     http://www.cse.nd.edu/~ccl/software/parrot/

     http://www.cse.nd.edu/~dthain/papers/cdf-parrot-chep06.pdf


########
# WEB  #
########

User list from
WUSERS=`{ pts membership wadmnumi:wadmnumigr ; pts membership wadmnumi:numiweb;} | sort | uniq | grep -v wadmnumi`
echo $WUSERS
admarino alberto arms asousa avva ayres belias boehm brebel bseilhan bspeak buckley cbs cjames costas cwhite dave_b dharris efalk gfp gmieg grossman habig hgallag hylen jkn jmusser kreymer lang lauram mdier med messier mgoodman michael msanchez murgia niki nwest para petyt pjl plunk rameika rgran rhatcher rubin shanahan tagg thomsonm thosieck urheim webera wehmann

MINOS26 > WMAIL=
MINOS26 > for WUSER in ${WUSERS} ; do WMAIL=${WMAIL}${WUSER}, ; done
MINOS26 > echo $WMAIL
admarino,alberto,arms,asousa,avva,ayres,belias,boehm,brebel,bseilhan,bspeak,buckley,cbs,cjames,costas,cwhite,dave_b,dharris,efalk,gfp,gmieg,grossman,habig,hgallag,hylen,jkn,jmusser,kreymer,lang,lauram,mdier,med,messier,mgoodman,michael,msanchez,murgia,niki,nwest,para,petyt,pjl,plunk,rameika,rgran,rhatcher,rubin,shanahan,tagg,thomsonm,thosieck,urheim,webera,wehmann,

Removed michael, invalid.
Removed belias, email could not go to RAL.

admarino,alberto,arms,asousa,avva,ayres,boehm,brebel,bseilhan,bspeak,buckley,cbs,cjames,costas,cwhite,dave_b,dharris,efalk,gfp,gmieg,grossman,habig,hgallag,hylen,jkn,jmusser,kreymer,lang,lauram,mdier,med,messier,mgoodman,msanchez,murgia,niki,nwest,para,petyt,pjl,plunk,rameika,rgran,rhatcher,rubin,shanahan,tagg,thomsonm,thosieck,urheim,webera,wehmann,

Sent mail to minos-data , cc: the full list :

  Yesterday between 16:54 and 16:59 CST,
  the www-numi web server lost access to its AFS files.

  Apparently this ACL entry was removed :
       system:anyuser  rl   

  Service was restored this morning when Dave Bell restored the ACL entry.

  Laura Mengel has determined the time of change from the server logs.
  We also see that the /afs/fnal.gov/files/expwww/numi directory changed at
  Dec 26 16:57.
  Because we see no new files there, we guess that something was removed.

  This mail is being sent to the full list of people with access.
  
  Did somebody remove a file and/or change the ACL yesterday ?

--------------------------------

  We have late breaking news from the web team that one of their scripts
  was very likly guilty of removing the ACL. 

  They will try to assure that it does not get loose from its sandbox in future.


########
# WEB  #
########

Date: Thu, 27 Dec 2007 11:01:21 -0600
From: Laura Mengel <lauram@fnal.gov>
To: kreymer@fnal.gov
Cc: lauram@fnal.gov
Subject: user cert access restriction on web servers

http://www.fnal.gov/docs/products/apache/SSLNotes.html
under "Setting up to allow Kerberos/kx509 authentication"
under "and then you can use, in .htaccess files, entries"

This is the general URL with web help
http://www-css.fnal.gov/csi/webdocs/webmaster_info.html

-- Laura


#######
# AFS #
#######

More AFS timeouts :

minos17
Dec 26 21:09:15 minos17 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)

minos21
Dec 26 22:46:11 minos21 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)

minos26
Dec 25 02:06:19 minos26 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server)

########
# WEB  #
########

Numi web sites down, 

Helpdesk Ticket 108833 Urish - Urgent 12/27/2007 12:26:50 AM

The Minos experiment relies on the central web server for accessing some
operational information. The www-numi.fnal.gov web site will not allow
access. The URL responds with "Forbidden - You don't have permissions to
access / on this server."

I attempted to access the AFS space directory and was unable to mount the
directory /afs/fnal.gov/files/expwww/numi where the files for this site are
stored.

BELL, DAVE , x4482, csi-est@fnal.gov / medium

Modified
trb 12/27/2007 2:45:45 PM

Audit Log :

12/27/2007 2:41:44 PM jereboze
The Assigned To Group was changed from Help Desk to CD-LSCS/CSI/CS/EST.

The Assigned To Individual was changed from HelpDesk to BELL, DAVE.

The Assigned To E-mail Address was changed from helpdesk@fnal.gov to csi-est@fnal.gov.

12/27/2007 3:34:47 PM resolution

The www-numi.fnal.gov web site is now accessible.

I added the rights for system:anyuser.

# fs la /afs/fnal.gov/files/expwww/numi
Access list for /afs/fnal.gov/files/expwww/numi is
Normal rights:
  lauram:expwwwread rl
  wadmnumi:numiweb rlidwka
  wadmnumi:wadmnumigr rlidwka
  lauram:expwwwadm rlidwka
  system:administrators rlidwka
  system:anyuser rl


MINOS26 > ls -altr
total 126
-rw-r--r--   1    1866 oss    274 Mar 20  1997 expwww.dat
-rw-r--r--   1    7979 oss   8014 May 12  1997 README.numi_webserver
drwxr-xr-x   2    7979 oss   2048 Feb  4  1998 CERN_conf
drwxr-xr-x   2    7979 oss   2048 Feb  4  1998 wwwstat-1.0
lrwxr-xr-x   1 bin     root     5 Feb 10  1998 hyper-news -> babar
drwxr-xr-x   2    7979 oss   2048 Apr 21  1999 NCSA_conf
drwxr-xr-x   2    1866 1530  2048 Oct  7  1999 admin_pre1_3_9
drwxr-xr-x   2    1866 1530  2048 Mar  6  2001 admin_standalone
drwxrwxrwx   6    7979 1530  2048 Oct  4  2001 numinotes
drwxr-xr-x   4    1866 1530  2048 May 28  2003 conf_standalone
drwxr-xr-x   2    1222 g020  2048 Jul 30  2003 admin
-rw-r--r--   1   10599 e875  8014 Jul 30  2003 README_standalone
-rw-r--r--   1   10599 e875  9314 Sep  5  2003 README
drwxr-xr-x   3 para    oss   2048 Apr 29  2004 babar
drwxrwxrwx   2 root    root  8192 Jul 16  2004 file_upload
-rwxr-xr-x   1 buckley e875    63 Jul 16  2004 cleanup_query_files
drwxr-xr-x   2    1222 g020  2048 Mar 14  2005 conf
-rw-r--r--   1   10599 e875   837 May 26  2006 README.switch
drwxr-xr-x   3 boehm   e875  2048 Mar 16  2007 youngminos
drwxr-xr-x   2    1222 g020  2048 Apr  6  2007 auth
drwxr-xr-x   5    7979 oss   8192 Sep  4 17:52 cgi-bin
drwxr-xr-x  38    7979 oss   4096 Nov 27 14:27 html
drwxr-xr-x  12    9999 root  6144 Dec 21 11:15 ..
drwxrwxrwx   2 root    root 45056 Dec 25 14:11 query_files
drwxrwxrwx  15    7979 root  2048 Dec 26 16:57 .

=============================================================================

2007 12 26


###########
# BLUEARC #
###########
 minos_software_discussion

To: minos_software_discussion@fnal.gov
Cc: minos_sim@fnal.gov, minos_batch@fnal.gov, minos-data@fnal.gov,minos-admin

  Here is a summary of a conversation with Ray Pasetes,
  who leads the group maintaining our BlueArc file service.

  This weekend's timeouts were again due to communication problems
  between the NexSAN SataBeast array and the Fiber Channel fabric.

  This is not a unique problem, other customers are also suffering.

  There is a new set of firmware from NexSAN which should correct this.
  But it has just been received, and is not yet field tested.

  Therefore, it will be best for use to minimize use of these areas
  until this firmware is will tested, probably after the Austin meeting.

  Major users are :
      GPFarm analysis pioneers ( Rustem and Josh )
      The Farm - has finished essentially all the reprocessing
      MC import - has a few days of files to import.           


###########
# BLUEARC #
###########

    Per Pasetes conversation, about 15:45 this afternoon.

  Two files were lost from /minos/scratch , cleaned up by fsck :

2007-12-25 05:29:37 File System: Deleting corrupted file:
near_L010z185i_mc.uDST_strip.root

2007-12-25 05:29:37 File System: Deleting corrupted file:
near_L010z185i_mc.uDST_strip.root

    The problem was again FC resets on the fabric to the NexSAN array.
    The fsck took 18.5 hours to run 
    ( with many resets slowing things down  )

    CSI is testing new firmware from NexSAN, version M
        K - crashed heads
        L - production, heads OK, but has bus timeouts that we are seeing
        M - crashes heads
        N - beta firmware , but claims to correct the timeout problems.

Recommendation : 
    Minos - minimize use of /minos/data till N is tested out ( 2008 )
    Minos - prepare a client test suite to provide a realistic load.

AFS issues - 
    had tested and were ready to deploy new software last week,
    then started seeing failures again.
    These turn out to be due to a hardware failure.
    When the hardware is replaced, and tests are repeated, we may be OK.
    Meanwhile, the minos software is being shifted to the minos-2 AFS server,
    which has not seen any timeouts recently.
             
    
##############
# MINOS_DATA #
##############

for DIR in ${DIRS} ; do fs listquota d${DIR} | grep nb ; done > /tmp/mdd

sort -n -k 4 /tmp/mdd
nb.minos.d88               50000000         6    0%         64%  
nb.minos.d141              50000000   4590792    9%         60%  
nb.minos.d118              50000000   5292952   11%         60%  
nb.minos.d119              50000000   6013856   12%         74%  

d141/recodata52 - F/N cedar
d118/recodata41 -   N cedar only 41 files
d119/recodata42 - F/N cedar , n R1_18_2

    shifting d118 to d105/recodata34
nb.minos.d105              50000000  41376549   83%         64%  

FILES=`ls d118/recodata41 | grep root`

date
for FILE in ${FILES} ; do 
cp -av d118/recodata41/${FILE} d105/recodata34/${FILE}
done
date
Wed Dec 26 15:27:28 CST 2007
Wed Dec 26 15:37:10 CST 2007


for FILE in ${FILES} ; do echo ${FILE}
diff d118/recodata41/${FILE} d105/recodata34/${FILE}
done

for FILE in ${FILES} ; do 
grep ${FILE} d10/indexes/*.cedar.index
done | cut -f 1 -d : | sort | uniq

d10/indexes/2006-06_near.cedar.index
d10/indexes/2006-07_near.cedar.index

16:20 - Changed recodata41 to recodata34 in these indexes.


for FILE in ${FILES} ; do 
echo ${FILE} ; echo rm  d118/recodata41/${FILE}
done

cd d118
rmdir recodata41
rm    recodata*
rm    indexes

    Now back to minos26, give this to nonap


fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/d118 \
    -acl minos:nonap rlidwka

fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/d118 \
    -acl minos:admin rlidwka

fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/d118 \
    -acl buckley:minosrecodata none

rm d10/recodata41


#######
# AFS #
#######

Created d10/analysis directories to keep track of assigned volumes

DIR=d186
fs listacl ../../../${DIR}
ln -s      ../../../${DIR} ${DIR}

    rustem
d186
d203
d221
d260

    ana_ntuples ( buckley:ana_ntuples, should be minos:nc )
d271
d272

    beam
d188 d239 d266 d268 d269 d270

    cc
d86

    nc
d138 d147 d169 d187 d204 d211 d228 d229  

    nd
d88

    nonap
d240 d261 d262 d263 d264 d265
    d240 needs minos:nonap added, members adjusted

    nubar
d227

    nue
d241 d242 d243 d244 d245

    reco
d267


########
# FARM #
########

scripts/crontab.dat updated ( previously May  8  2007 )
changed comment ENSTORE to pnfs .
saved as crontab.dat.20071226


#######
# AFS #
#######

for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Dec " | grep -v Tokens | grep Lost | grep 131.225 | uniq'; done

minos08
Dec 12 17:08:58 minos08 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)

/var/log/messages.1 

minos02
Dec 21 00:15:19 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 21 08:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)

Bottom line, no timeouts since Dec 21 update after 13:42 to
/home/minfarm/scripts/web_status
But the /minos/scratch area has been down since Saturday,
and there was very little activity over the Christmas holiday.

###########
# BLUEARC #
###########

BlueArc /minos/data and scratch areas went offline,

Last files in /minos/data/minfarm/nearcat were Saturday evening :

Dec 22 14:36 N00008224_0019.spill.sntp.cedar_phy_bhcurv.1.root
Dec 22 18:47 F00039704_0004.spill.mrnt.cedar_phy_bhcurv.0.root
Dec 22 20:57 n13037065_0011_L010000N_D04.mrnt.cedar_phy_bhcurv.root

/minos/* announced online and fsck'd at Dec 26, 2007 7:32 AM

Help Desk Ticket 108777


=============================================================================

2007 12 21

#######
# AFS #
#######

NEWGROUP=nd

pts creategroup -name kreymer:${NEWGROUP}
group kreymer:nd has id -2708

pts setfields  kreymer:${NEWGROUP} -access SOMar

for GUSER in buckley kreymer shanahan kordosky rgran ; do
pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done

pts chown      kreymer:${NEWGROUP}  minos:admin

pts membership minos:${NEWGROUP}


#######
# AFS #
#######

fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/d88 \
    -acl minos:cc none

fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/d88 \
    -acl buckley:minosrecodata none

fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/d88 \
    -acl minos:nd rlidwka

 buckley:minosrecodata 

##############
# MINOS_DATA #
##############

Let's clear the R1_2* reco data.

rubin@fnpcsrv1 :
cd /afs/fnal.gov/files/data/minos/d10/indexes

INDS=`ls *R1_23*.index`

./rv R1_23 noop | grep -v rm

Removing 2005-11_far.R1_23.index
 Removed 661 files 
touch -t 200101010000 2005-11_far.R1_23.index
Removing 2005-11_near.R1_23.index
 Removed 827 files 
touch -t 200101010000 2005-11_near.R1_23.index
    mostly 51,52,53

SRV1> ./rv R1_23            
This procedure will erase all R1_23 ntuples and rewrite the index files!
It will not prompt again -- do you want to proceed <y|n>? y
Removing 2005-11_far.R1_23.index
 Removed 661 files 
Removing 2005-11_near.R1_23.index
 Removed 827 files 

R1_23a
   most 53
   
   SRV1> ./rv R1_23a                  
This procedure will erase all R1_23a ntuples and rewrite the index files!
It will not prompt again -- do you want to proceed <y|n>? y
Removing 2005-11_far.R1_23a.index
 Removed 660 files 
Removing 2005-11_near.R1_23a.index
 Removed 817 files 

./rv 'S06-05-25-R1-22'


S06-05-25-R1-22
Removing 2005-11_far.S06-05-25-R1-22.index
 Removed 661 files 
Removing 2005-11_near.S06-05-25-R1-22.index
 Removed 662 files 

S06-06-22-R1-22
mostly 53

Removing 2005-11_far.S06-06-22-R1-22.index
 Removed 661 files 
Removing 2005-11_near.S06-06-22-R1-22.index
 Removed 827 files 

R1_24a  mostly 54
Removing 2005-11_far.R1_24a.index
 Removed 720 files 
Removing 2005-11_near.R1_24a.index
 Removed 886 files 

R1_24b  55, 56
Removing 2005-11_far.R1_24b.index
 Removed 720 files 
Removing 2005-11_near.R1_24b.index
 Removed 1058 files 

R1_24c  56 57 58
Removing 2005-11_far.R1_24c.index
 Removed 720 files 
Removing 2005-11_near.R1_24c.index
 Removed 1368 files 


R1_24cal  96 97 98
Removing 2005-11_far.R1_24cal.index
 Removed 692 files 
Removing 2005-11_near.R1_24cal.index
 Removed 1304 files 

Let's get more ambitious, removing 

R1_18_2
 Removed net 15220 files 

SRV1> cat 2*R1_18_2.index | wc -l        
15220

SRV1> find /minos/data/reco_near/R1_18_2/ -type f | wc -l
6501
SRV1> find /minos/data/reco_far/R1_18_2/ -type f | wc -l
8719
  = 15224


R1_18_4
 Removed net 6280 files 
SRV1> find /minos/data/reco_near/R1_18_4/ -type f | wc -l
1752
SRV1> find /minos/data/reco_far/R1_18_4/ -type f | wc -l
3946
  = 5698

SRV1> wc -l *_far.R1_18_4.index
  3946 total
That's complete.
SRV1> wc -l *_near.R1_18_4.index
     0 2006-03_near.R1_18_4.index
     0 2006-05_near.R1_18_4.index
   669 2006-06_near.R1_18_4.index  669
   118 2006-07_near.R1_18_4.index  118
     5 2006-08_near.R1_18_4.index    5
   395 2006-09_near.R1_18_4.index  273  *
   612 2006-10_near.R1_18_4.index  152  *
   535 2006-11_near.R1_18_4.index  535
  2334 total

Note that indexes 2006-09 and 10 were repaired after the copies.
Rerun a catchup copy 

./afs2nfs  -i  2006-10_near.R1_18_4.index

 STREAM sntp to /minos/data/reco_near/R1_18_4/sntp_data/2006-10 4618 
 612/ 612 recodata15/N00011137_0004.spill.sntp.R1_18_4.0.root 
 STREAM sntp rate 15250
34G     /minos/data/reco_near/R1_18_4/sntp_data/2006-10

 STARTED Fri Dec 21 18:42:00 CST 2007
FINISHED Fri Dec 21 19:12:11 CST 2007

$ ./afs2nfs  -i  2006-09_near.R1_18_4.index
 STREAM sntp to /minos/data/reco_near/R1_18_4/sntp_data/2006-09 4574 
 395/ 395 recodata15/N00010903_0003.spill.sntp.R1_18_4.0.root 
 STREAM sntp rate 15573
19G     /minos/data/reco_near/R1_18_4/sntp_data/2006-09

 STARTED Fri Dec 21 19:12:41 CST 2007
FINISHED Fri Dec 21 19:17:55 CST 2007

find /minos/data/reco_near/R1_18_4/ -type f | wc -l
2334

./rv 'R1_18_2'
 Removed net 15220 files 

./rv 'R1_18_4'
 Removed net 6280 files 

./rv 'R1.16a'
 Removed net 612 files 
    from recodata17 / d88

for DIR in ${DIRS} ; do echo ${DIR} ; find d${DIR} -name \*R1_17\*; done | less
R1_17* files are only in d88, not indexed

find d88/recodata17 -name \*R1_17\*  | wc -l
3155

find d88/recodata17 -name \*R1_17\*  -exec rm {} \;

rm d88/recodat17/*R1_17*

./rv 'R1_18'
 Removed net 2174 files 

./rv 'R1_21'
 Removed net 359 files 

./rv 'R1_24'
 Removed net 883 files 

./rv 'R1_18_2a
 Removed net 720 files 

./rvm _far.carrot.R1_18_2 


#######
# AFS #
#######

fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/d86 \
    -acl minos:admin rlidwka

fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/d86 \
    -acl minos:cc rlidwka
    

##########
# PARROT #
##########

    Created web links for Minos releases and products
    These links seem to work OK, following subsequent links cleanly.

cd /afs/fnal.gov/files/expwww/numi/html/computing

ln -s /afs/fnal.gov/files/code/e875/releases          releases
ln -s /afs/fnal.gov/files/code/e875/general/minossoft minossoft
ln -s /afs/fnal.gov/files/code/e875/general/products  products
ln -s /afs/fnal.gov/files/code/e875/general/ 


##########
# DCACHE #
##########

FAM=reco_near_cedar_phy_bhcurv_sntp
./volumes  ${FAM}
VO3914
VO4613
VO7018
VO9572
VOC321
VOC494
VOC501
VOLS=` ./volumes  ${FAM}`

for VOL in ${VOLS} ; do ./stage -d -p 0 ${VOL} | grep -v pnfs | tr -d . ; done
VO3914 Needed 27/186
VO4613 Needed 11/77
VO7018 Needed 0/240
VO9572 Needed 70/135
VOC321 Needed 0/44
VOC494 Needed 0/42
VOC501 Needed 39/438

The file count seems much lower than far.
Still, let's pull 'em.

for VOL in ${VOLS} ; do ./stage -w  ${VOL} ; done

#######
# AFS #
#######

  regarding HelpDesk ticket 107323

Recent timeouts :

minos02
Dec 20 01:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 20 03:15:14 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 20 07:15:13 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 20 09:15:17 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 20 11:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 20 12:15:15 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 21 00:15:19 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 21 08:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)

minos09
Dec 19 19:16:52 minos09 kernel: afs: Lost contact with file server 131.225.68.65 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 19 19:16:56 minos09 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)


MINOS02 > printf 'sleep 5 ; top -b -n 1 -i > /tmp/top1015.log\n' | at 10:15
job 3 at 2007-12-21 10:15

HOURS='00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23'
for HH in ${HOURS} ; do echo ${HH} ; done

mkdir /minos/scratch/kreymer/afsscan


Found connections like

Dec 21 12:15:13 minos02 sshd(pam_unix)[17054]: session opened for user rubin by (uid=0)

Here is a more complete comparison of this with reuben connections :

       RUBIN SSHD  AFS TIMEOUT   DELAY(sec)
Dec 20  01:15:13    01:15:16        3
Dec 20  03:15:12    03:15:14        2
Dec 20  07:15:12    07:15:13        1
Dec 20  09:15:15    09:15:17        2
Dec 20  11:15:14    11:15:16        2
Dec 20  12:15:13    12:15:15        2
Dec 21  00:15:17    00:15:19        2
Dec 21  08:15:14    08:15:16        2

These come from his cron job on fnpcsrv1, 

  15 00,01,02,03,04,05,06,07,08,09,10,11,12,14,16,18,20,22,23 * * * /usr/krb5/bin/kcron /home/minfarm/scripts/web_status

This script does , among other things,

kcron

EXEC_DIR=/home/minfarm
WEB_DIR=$EXEC_DIR/web
cd $WEB_DIR

webuser=rubin
webhost=minos02.fnal.gov
webarea=/afs/fnal.gov/files/expwww/numi/html/minwork/computing/batch_monitor

scp farm_status.html ${webuser}@${webhost}:$webarea

SRV1> time scp farm_status.html ${webuser}@${webhost}:$webarea
farm_status.html                                                                                   100%   37KB  37.4KB/s   00:00    

real    0m5.209s
user    0m0.035s
sys     0m0.056s

for N in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ; do        
> date ; echo ${N} ; scp -q farm_status.html ${webuser}@${webhost}:$webarea ; sleep 60 ; done
Fri Dec 21 13:22:42 CST 2007
...
Fri Dec 21 13:42:36 CST 2007

The last scp copy was as follows, from /var/log/messages.1
Dec 21 13:42:37 minos02 sshd(pam_unix)[21162]: session opened for user rubin by (uid=0)


#############
# DBARCHIVE #
#############

MINOS-SAM03 > du -sm 20060421
9352    20060421

time scp -vr 20060421 minsoft@minos-mysql1:/minos/data/mysql/20060421


# GRID #

CDF Grid workshop 7/8 Jan ( 2 hours each )
  filed email in cdfcaf

########
# GRID #
########

     So, where are the vomses files ?
     You can get a clue from
voms-proxy-init -debug

Let's have a look at
    /grid/app/minos/VDT
    /minos/scratch/kreymer/VDT

########
# GRID #
########

Date: Mon, 17 Dec 2007 15:52:52 -0600
From: Dan Yocum <yocum@fnal.gov>
To: fermigrid-announce@fnal.gov
Subject: upcoming change to VOMS server at Fermilab

My apologies for missing the fermigrid-announce mailing list.

On Tuesday, Dec. 18, 2007 FermiGrid will be moving the Virtual 
Organization Management Servers (VOMS) for the following VOs to the host 
voms.fnal.gov:

fermilab
dzero
sdss
des
gadu
nanohub
ilc
lqcd
i2u2
osg

Currently, these voms servers are on fermigrid2.fnal.gov and this server 
will remain in service for several months to alleviate the pain of 
migration.  However, some users may experience problems when attempting 
to create voms proxy certificates (i.e., voms-proxy-init) for the above 
VOs.  Generally, these users have a mismatch of hostname and host 
certificate name in their vomses file.  The solution is to tell these 
users to edit their vomses files and make these 2 changes:

1) change all 'host/fermigrid2.fnal.gov' to 'http/voms.fnal.gov'
2) change all instances of the name fermigrid2.fnal.gov to voms.fnal.gov

If you have any further questions, feel free to send questions to 
fermigrid-help@fnal.gov.

Thanks,
Dan


-- 
Dan Yocum
Fermilab  630.840.6509
yocum@fnal.gov, http://fermigrid.fnal.gov
Fermilab.  Just zeros and ones.


=============================================================================

2007 12 20
s-s-wh8w-7
########
# FARM #
########

Howie has stopped the copy of minfarm/web/indexes to AFS.

The AFS copy is now the primary version


#########
# STAGE #
#########

file families are set for CPB ntuples, 
scanning :

    FAM=reco_far_cedar_phy_bhcurv_sntp

./volumes  vols
./volumes  ${FAM}

VOLS=` ./volumes  ${FAM}`

for VOL in ${VOLS} ; do ./stage -d -p 0 ${VOL} | grep -v pnfs ; done
 Staging files from tape VO9677
................ Needed 545/1492
 Staging files from tape VOC674
Needed 0/902
 Staging files from tape VOC691
...................... Needed 0/220

   The files do need to move to the new pools :

./stage -d -p 0 -g MinosPrdReadPools VOC691 | grep -v pnfs

Sample file is
/pnfs/fnal.gov/usr/minos/reco_far/cedar_phy_bhcurv/sntp_data/2007-05/F00038163_0000.all.sntp.cedar_phy_bhcurv.0.root

./stage -b 3 -g MinosPrdReadPools VOC691
r-stkendca18a-5 - this is in the correct pool, good, was not there prestage
w-stkendca12a-4

   FAM=reco_far_cedar_phy_bhcurv_sntp

17:47 
for VOL in ${VOLS} ; do ./stage -w -g MinosPrdReadPools ${VOL} ; done


##########
# DCACHE #
##########

Date: Thu, 20 Dec 2007 15:14:24 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: dcache-admin@fnal.gov
Subject: dcap version for security scan immunity


    Which version of dcap should be used to avoid the known problems
    with recent Fermilab security scans ?
        Minos has been using the 'current' version, v2_32_f0408,
        which I know is too old.

    How should dcap be set up, or what environment variable should be set,
    for scan immunity ?

    What are the operational impacts of running in this mode ?


=============================================================================

2007 12 19

##########
# DCACHE #
##########

Date: Wed, 19 Dec 2007 08:44:18 -0600 (CST)
Subject: HelpDesk ticket 108563

Short Description: Minos writes pending 3 days

Problem Description: dcache-admin

There is a set of Minos data files which were written to FNDCA 3 days ago.
These are still not on tape, as reported at

    http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt

( I see that there are problems again with the 9* pools. )
Please report that status of recovery of these files to minos-data.

Date: Wed, 19 Dec 2007 08:57:42 -0600 (CST)
This ticket has been reassigned to BERG, DAVID of the CD-SF/DMS/DSC/SSA Group.


Thu Dec 20 08:31:44 CST 2007 - waiting

########
# FARM #
########

In cedar_phymcnear.log, see message
 OOPS - POOLS ACTIVE NEED 12 10 11 
but writing continued...

I see that the 9* pools are down

############
# HELPDESK #
############

Arthur Kreymer wrote:
> I cannot use the usual Web page to submit helpdesk requests
> I see, at https://computing.fnal.gov/cgi-bin/remedy/Helpdesk.pl
> as of about 22:50 on 2007 Dec 18,

This error was inadvertently caused by scheduled maintenance and only temporarily affected the web 
interface to reporting helpdesk issues.  Sorry for the inconvenience.


=============================================================================

2007 12 18


##############
# MINOS/DATA #
##############

near cedar_phy_bhcurv preparation

PNFS=/pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data
YEMOS=`cd ${PNFS} ; find . -type d -maxdepth 1 -exec basename {} \; | grep -v '\.' | sort`

$ for MO in $YEMOS ; do  AFSS/stage -d -p 0 reco_near/cedar_phy_bhcurv/sntp_data/$MO ; done | grep Needed
. Needed 8/67
 Needed 22/98
... Needed 6/65
...... Needed 0/54
. Needed 0/1
..... Needed 2/55
.... Needed 0/33
 Needed 3/46
... Needed 2/42
. Needed 3/52

OK, need to get these directed to Minos read pools, and staged,
then can do the 

bin/dc2nfs -d reco_near/cedar_phy_bhcurv/sntp_data 2>&1 | \
   tee /minos/scratch/log/dc2nfs/cpbnear.log

###########
# MONTHLY #
###########

completed dbarchive

15:15
DCS_HV.MYD     35m
  rm -r /data/archive/COPY/20071107  # possibly speed up the copies ?
PULSERGAIN.MYD 63m
the rest       74m

Tue Dec 18 20:47:21 CST 2007
md5sum real    30m44.050s
46G
gzip   real    98m52.524s
18G

Repeated copies, with fresh ticket
Ran out of space on minos-sam03

time cp -vax ${DBCOPY} /minos/data/mysql/${DAY}
real    16m43.108s
time diff -r ${DBCOPY} /minos/data/mysql/${DAY}
real    33m19.808s

time rsync -r \
real    8m7.283s
   ran out of space again


MINOS-SAM03 > time scp -vr 20060418 minsoft@minos-mysql1:/minos/data/mysql/20060418
real    7m25.788s
user    3m0.609s
sys     1m4.927s

time rsync -r ${DBBINS} /minos/data/mysql/BINLOG --perms --times --size-only -v
   interrupted during 143, resumed
time rsync -r ${DBBINS} /minos/data/mysql        --perms --times --size-only -v
40m20.398


   The problem is that BINLOGS contains 54 GB of recent changes !
   

[minsoft@minos-mysql1 ~]$ ls -alF /data/archive/BINLOG | grep 'Dec  8' | wc -l
15
[minsoft@minos-mysql1 ~]$ ls -alF /data/archive/BINLOG | grep 'Dec  9' | wc -l
24
[minsoft@minos-mysql1 ~]$ ls -alF /data/archive/BINLOG | grep 'Dec 14' | wc -l
9


###########
# BLUEARC #
###########

Date: Tue, 18 Dec 2007 19:33:16 -0600 (CST)
Subject: Help Desk Ticket 108225 Has Been Resolved.
Solution: I forgot to close this ticket.

I called Art on day of incident .... we had experienced
a head/crash and failover at the day/time of the ticket


Andy   ( romero@fnal.gov x4733 )

Problem Description: LSC/CSI :

Today at around 11:30, till around 11:35 ( roughly )
the NFS mounts of the BlueArc served /minos/data and /minos/scratch
timed out many or all of the Minos Cluster nodes.
The mounts seem to have recovered.
...

###########
# BLUEARC #
###########

Date: Tue, 18 Dec 2007 15:35:38 -0600 (CST)
HelpDesk ticket 108542

Short Description: Quota request for BlueArc served /minos/scratch, for rahaman

Problem Description: LSC/CSI :

Please set an individual storage quota of 500 GBytes for user rahaman
on the BlueArc served /minos/scratch volume.

This overrides the existing default 100 GBytes quota.

Date: Tue, 18 Dec 2007 15:44:25 -0600 (CST)
This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group.
Date: Tue, 18 Dec 2007 16:04:09 -0600 (CST)
Solution: Quota adjusted.
This ticket was resolved by INKMANN, JOHN of the CD-LSCS/CSI/CS/EST group.


##########
# ANNUAL #
##########

Created new data directories for next year, 
per procedure at the bottom of LOG
per Rubin reminder.


##########
# DCACHE #
##########

Rustem reports slow access to 
    /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_04/L010185N/sntp_data
1376 files 

Ticket #: 107808

The restores took about 12 hours, as expected.

RUNS=`ls /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_04/L010185N/sntp_data`
 for RUN in $RUNS ; do  ./stage -w mcout_data/cedar_phy_bhcurv/far/daikon_04/L010185N/sntp_data/${RUN} ; done
 
/minos/data/mcout_data/daikon_04/L010185N/far/cedar_phy_bhcurv/sntp_data


##########
# DCACHE #
##########

Requesting additional file families for MinosPrdReadPools

DPAT=reco_near/cedar_phy_bhcurv/sntp_data
DPAT=mcout_data/cedar_phy_bhcurv/near/daikon_04/L010185N/sntp_data
DPAT=mcout_data/cedar_phy_bhcurv/far/daikon_04/L010185N/sntp_data
( cd /pnfs/minos/${DPAT} ; enstore pnfs --tags ) | grep '^.(tag)(file_family)' | cut -f 2 -d =

reco_far_cedar_phy_bhcurv_sntp
reco_near_cedar_phy_bhcurv_sntp
mcout_cedar_phy_bhcurv_far_daikon_04_sntp
mcout_cedar_phy_bhcurv_near_daikon_04_sntp

Date: Tue, 18 Dec 2007 23:02:45 +0000 (UTC)
Subject: HelpDesk ticket 108564

  I cannot use the usual web page to submit this, 
  so please enter this manually tomorrow :

  Please direct to  Software / MSS / DCache ,
  ( dcache-admin )

  Low priority

Please add these file families to the MinosPrdReadPools selection list,
described at
    http://fndca3a.fnal.gov:2288/poolInfo/ugroups/MinosPrdSelGrp

minos.reco_far_cedar_phy_bhcurv_mrnt
minos.reco_far_cedar_phy_bhcurv_sntp
minos.reco_near_cedar_phy_bhcurv_mrnt
minos.reco_near_cedar_phy_bhcurv_sntp
minos.mcout_cedar_phy_bhcurv_far_daikon_04_mrnt
minos.mcout_cedar_phy_bhcurv_far_daikon_04_sntp
minos.mcout_cedar_phy_bhcurv_near_daikon_04_mrnt
minos.mcout_cedar_phy_bhcurv_near_daikon_04_sntp

Date: Wed, 19 Dec 2007 08:55:28 -0600 (CST)
This ticket is assigned to BERG, DAVID of the CD-SF/DMS/DSC/SSA.


#######
# AFS #
#######

Per CD OPS meeting, AFS timeout cause may have been located.
Latest Minos timeouts :

for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Dec " | grep -v Tokens | grep Lost | uniq'; done

minos02
Dec 16 07:16:10 minos02 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 16 11:15:14 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 16 16:15:15 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 17 07:15:18 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)

minos08
Dec 12 17:08:58 minos08 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)

minos09
Dec 16 07:16:10 minos09 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)

##########
# CONDOR #
##########

Now testing the run limits in condor, with 100 josh processes running
   ( and the minos farms stuck due to cert problems )

condor_submit glide150.run

It seems that all 150 ran, pretty quickly,
with no increase in glideins.

MINOS25 > grep HOSTNAME logs/glide150/24032*.out | cut -f 2 -d : | sort -u
HOSTNAME fnpc300.fnal.gov
HOSTNAME fnpc309.fnal.gov
HOSTNAME fnpc323.fnal.gov
HOSTNAME fnpc335.fnal.gov

    Let's try this again, with CPU cranked up to 3 minutes 

Still only 111 jobs, 111 running

cq kreymer
...
24034.*

150 jobs; 139 idle, 11 running, 0 held

MINOS25 > grep HOSTNAME logs/glide150/24034*.out | cut -f 2 -d : | sort -u
HOSTNAME fnpc300.fnal.gov
HOSTNAME fnpc309.fnal.gov
HOSTNAME fnpc323.fnal.gov
HOSTNAME fnpc335.fnal.gov


##########
# CONDOR #
##########

Glidein management 0.1 :

Glideins are controlled by two accounts on minos25 :
    gfactory
    gfrontend

To start up the system, run these scripts in the home areas respectively
    start_factory.sh
    start_frontend.sh

To stop these, just kill the python scripts respectively
    python glideFactory.py 90 4 /home/gfactory/glideinsubmit/glidein_t5/
    python glideinFrontend.py 90 4 /home/gfrontend/myvofrontend1/etc/vofrontend.cfg

The primary configuration files are respectively
    glideinWMS/creation/glideinWMS.xml
    myvofrontend1/etc/vofrontend.cfg

Stopped the pythons, adjusted the max execution limit in vofrontend.cfg,
restarted.

    Submitted wms.run at 14:19
    Ran with 10K limit, per myvofrontend1/log/frontend_info.20071218.log

    Killed process with -9,
    restarted, now running like
Total running 0 limit 50


=============================================================================

2007 12 17


###########
# BLUEARC #
###########

Date: Mon, 17 Dec 2007 14:29:37 -0600 (CST)
HelpDesk ticket 108471

Short Description: Export /minos/data and /minos/scratch to *.fnal.gov

Problem Description: LSC/CSI :

Please export to *.FNAL.GOV , readonly and rootsquashed,
the BlueArc served 
     /minos/data
and
    /minos/scratch

Motivation - To make these available on Minos laptops and desktops.

Security   - These are already mounted on the GPFARM Open Science Enclave,
             making the data readable even to non-Fermilab users.
             The readonly/rootsquash export should prevent inappropriate
writes.

Load       - These are already mounted on hundreds of GPFARM nodes.
             The laptops and desktops are a small increment.

Timing     - Whenever convenient.

Compatibility - This request presumes that existing exports and mounts
                remain functional without modification.
                We still write to some of these ares from the Minos
Cluster,
                and from the GPFARM.


Date: Mon, 17 Dec 2007 15:57:20 -0600 (CST)
This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group.

Date: Mon, 17 Dec 2007 16:13:24 -0600 (CST)
Solution: Added read-only to following nfs mounts:

131.225.*.* (read_only,root_squash)
  /minos/data

131.225.*.* (read_only,root_squash)
  /minos/scratch


########
# FARM #
########

# 2007 12 17 - reenabled cedar_phy mcnear, for recent mrnt processing

############
# PREDATOR #
############

Has been disabled since Thursday 14 Dec, forgot to start after Oracle patches.

MINOS26 > ./predator 2007-12

STARTED  Mon Dec 17 15:10:30 UTC 2007
FINISHED Mon Dec 17 20:02:49 UTC 2007


##########
# CONDOR #
##########

factproxy - cleaned up to copy to ${PFIL}.new, then rename to ${PFIL}
            this should minimize exposure to transition problems.

##########
# CONDOR #
##########

Igor notes that ~gfactory is in AFS. 
Not good, contains proxies.

Dec 17 13:10 gfactory - this seems to be corrected.

Created new proxy, 

SRV1> cd /export/stage/minfarm/.grid

voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -vomslife 800:0 \
    -cert kreymer-doe.pem   \
    -key kreymer-doekey.pem \
    -out kreymer-condor.proxy  \
    -valid 800:0
...
Your proxy is valid until Sat Jan 19 21:14:24 2008

scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy \
    .grid/kreymer-condor.proxy.20080119

The kreymer-pilot.proxy is also refreshed after 13:10.


=============================================================================

2007 12 14

##########
# CONDOR #
##########

administrative access, try this with an active proxy,

MINOS25 > cd /local/scratch25/kreymer/.grid/
MINOS25 > scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-doekey.pem .
MINOS25 > scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-doe.pem .
voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -cert kreymer-doe.pem   \
    -key kreymer-doekey.pem
   
MINOS25 > condor_off  -peaceful  minos01  -subsystem startd
Sent "Set-Peaceful-Shutdown" command to startd minos01.fnal.gov
Sent "Kill-Daemon-Peacefully" command to master minos01.fnal.gov

MINOS25 > condor_q 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  80.0   jdejong        11/2  13:39   0+00:00:01 H  0   0.0  loon /minos/scratc
1789.0   hartnell       11/11 11:36   0+00:00:01 H  0   9.8  tiny              
...

MINOS25 > condor_rm 80.0
Job 80.0 marked for removal

MINOS25 > condor_rm 1789.0
Job 1789.0 marked for removal

voms-proxy-destroy

MINOS25 > condor_rm 15835.3
Job 15835.3 marked for removal

    This was a job stuck due to the bluearc timeout on 11 Dec.
    So I still seem to be a superuser for managing jobs

MINOS25 > condor_off  -peaceful  minos02  -subsystem startd
ERROR
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate.  Globus is reporting error (851968:40).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using FS
Can't send Set-Peaceful-Shutdown command to startd minos02.fnal.gov
ERROR
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate.  Globus is reporting error (851968:80).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using FS
Sent "Kill-Daemon-Peacefully" command to master minos02.fnal.gov


##########
# CONDOR #
##########

SRV1> minfarm,

cd /export/stage/minfarm/.grid

voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -vomslife 800:0 \
    -cert kreymer-doe.pem   \
    -key kreymer-doekey.pem \
    -out kreymer-condor.proxy  \
    -valid 800:0

Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Enter GRID pass phrase:
Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Creating temporary proxy .................................................... Done
Contacting  fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Done

Warning: fermigrid2.fnal.gov:15001: validity shortened to 86400 seconds!

Creating proxy ................................................ Done
Your proxy is valid until Wed Jan 16 19:41:03 2008

[gfactory@minos25 ~]$ 

scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy \
.grid/kreymer-condor2.proxy


##########
# DCACHE #
##########

Request to recycle
 VO2114 | CD-9940B | 0000_000000000_0000637 | minos     | reco_near_cedar_phy_bhcurv_cand
 VO3170 | CD-9940B | 0000_000000000_0000444 | minos     | reco_near_cedar_phy_bhcurv_cand
 VO3899 | CD-9940B | 0000_000000000_0000124 | minos     | stage_kordosky
 VO4319 | CD-9940B | 0000_000000000_0000129 | minos     | stage_kordosky
 VO4616 | CD-9940B | 0000_000000000_0000050 | minos     | stage_kordosky
 VO7080 | CD-9940B | 0000_000000000_0000302 | minos     | reco_mc_cosmic_cedar
 VO9164 | CD-9940B | 0000_000000000_0000128 | minos     | stage_kordosky
 VOA280 | CD-9940B | 0000_000000000_0000128 | minos     | stage_kordosky
 VOC347 | CD-9940B | 0000_000000000_0000579 | minos     | mcin_near_daikon_04
 VOC588 | CD-9940B | 0000_000000000_0000128 | minos     | stage_kordosky


for VOL in VO2114  VO3170  VO3899  VO4319  VO4616  VO7080  VO9164  VOA280 VOC347  VOC588 ; do
echo VOLUME ${VOL}
enstore info --list=${VOL} ; 
done > /minos/scratch/kreymer/recycle20071214.lis

 
##########
# CONDOR #
##########


Need to register Kreymer KCA DN in Minos CAF, for glidein usage

Date: Fri, 14 Dec 2007 11:12:07 -0600 (CST)
HelpDesk ticket 108403

Problem Description: run2-sys :

We are moving to a more secure environment, 
with shorter lived certificates for the Minos Condor Analysis Facility.

Please add the following DN to this file on minos25 :
    /etc/grid-security/condor-grid-mapfile

"/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer" gfactory2
( This is one line, including gfactory2 .
  This may have been split by the Helpdesk entry form and/or email. )

We would like to get this done today of possible.

Date: Fri, 14 Dec 2007 11:34:06 -0600 (CST)
Subject: Your ticket 108403 has been reassigned to ALLEN, JASON


Added kreymer/cron/minos25.fnal.gov@FNAL.GOV to gfactory@minos25:.k5login


The helpdesk ticket request was wrong, a blunder in my part, 
should have been for

"/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer" gfactory2

Reported this to Jason, this was corrected soon :

MINOS25 > ls -l /etc/grid-security/condor-grid-mapfile
-rw-r--r--  1 root root 1797 Dec 14 16:19 /etc/grid-security/condor-grid-mapfile


########
# FARM #
########

   This is a mess !
   
   The catchup run seems to have failed to set it's pid,
   and we had two running in parallel,
   making a scrambled hash of the log file.
   
PURGED WRITE/F00032678_0000.all.sntp.cedar_phy_bhcurv.0.root
PURGED WRITE/F00031989_0000.all.sntp.cedar_phy_bhcurv.0.root
do_ypcall: clnt_call: RPC: Timed out
Traceback (most recent call last):
  File "/export/stage/minfarm/ROUNDUP/SAM/current/bin/sam", line 4, in ?
    sys.exit(Sam.main(sys.argv))
  File "sam_user_pyapi/bin/Sam.py", line 6368, in main
  File "sam_common_pylib/SamCommand/CommandInterfaceSuite.py", line 120, in dispatch
  File "sam_common_pylib/SamCommand/CommandInterfaceSuite.py", line 118, in dispatchCommand
  File "sam_common_pylib/SamCommand/CommandInterface.py", line 61, in mainDispatch
  File "sam_common_pylib/SamCommand/BlessedCommandInterfacePlaceHolder.py", line 38, in dispatch
  File "sam_common_pylib/SamCommand/SamCommandInterface.py", line 208, in cliDispatch
  File "sam_common_pylib/SamCommand/CommandInterface.py", line 331, in cliDispatch
  File "sam_common_pylib/SamCommand/CommandInterface.py", line 344, in _baseClass_cliDispatch
  File "sam_user_pyapi/src/samLocate.py", line 75, in implementation
  File "sam_common_pylib/SamCorba/SamServerProxy.py", line 257, in _callRemoteMethod
  File "sam_common_pylib/SamCorba/SamServerProxyRetryHandler.py", line 266, in handleCall
KeyError: 'getpwuid(): uid not found: 10871'

Log modified at 06:45:03


    Note the do_ypcall, this is an NIS problem !


    Net effect, we still have a lot to do, will let the scripts run :
SRV1> ls WRITE | grep ^F | wc -l
4026

Present summary :

SRV1> find . -name "f*" -type l  | wc -l
701
SRV1> find . -name "f*" -type f  | wc -l
7

mcfar written around 10:00, into saddreco

Saturday 15:48 CST , still copying cedar_phy_bhcurvfar
  started around 19:00 yesterday

SRV1> find WRITE/ -name "F*" -type f  | wc -l
1475
SRV1> find WRITE/ -name "F*" -type l  | wc -l
2431

=============================================================================

2007 12 13

#######
# AFS #
#######

    Per tagg,

for AFSUSER in cherdack rearmstr rodriges  ; do
pts adduser -user ${AFSUSER} -group minos:cc
done

pts membership minos:cc
  buckley
  kreymer
  urheim
  tagg
  cherdack
  rearmstr
  rodriges

I cannot find a Fermilab ID for Tony.


########
# FARM #
########

4614 files in WRITE/ are too much for ls.

roundup.20071213

PURGE and WRITE sections got too many files for the ls command to swallow
changed to 'find'

    In testing, the CC mv/ln were not disabled by ${ECHO}
    need to shift some back :

SRV1> find . -type l -name "F*"
./F00031721_0000.spill.mrnt.cedar_phy_bhcurv.0.root
./F00031721_0000.spill.bntp.cedar_phy_bhcurv.0.root
./F00031721_0000.all.sntp.cedar_phy_bhcurv.0.root
./F00031721_0000.spill.sntp.cedar_phy_bhcurv.0.root
./F00040057_0000.all.sntp.cedar.0.root
./F00040057_0000.spill.bntp.cedar.0.root
./F00040057_0000.spill.sntp.cedar.0.root

FILES=`find . -type l -name "F*" | cut -f 2 -d /`

for FILE in ${FILES} ; do
FLIN=`ls -l ${FILE} | cut -f 2 -d '>'`
FDIR=`dirname ${FLIN}`
mv ${FDIR}/${FILE} ${FILE}
ls -l ${FILE}
done

    OK, that's clean.

Put the new roundup in production, and run catchup

SRV1> cp -a AFSS/roundup.20071213 .
SRV1> ln -sf roundup.20071213 roundup

SRV1> roundup  -r cedar_phy_bhcurv far 
Thu Dec 13 22:15:32 CST 2007
 PURGING WRITE files 4497 


########
# GRID #
########

Date: Thu, 13 Dec 2007 11:51:10 -0600
Subject: Re: ALLEN, JASON  HelpDesk ticket 108321 Has Been Updated.
Resolved
Mounted /grid/data and grip/app as requested.

    I have verified this with a scan of all nodes.

###########
# MONTHLY #
###########

DATASETS 12/13
PREDATOR 12/13
VAULT    12/13
MYSQL    12/

Date: Thu, 13 Dec 2007 11:36:47 -0600 (CST)
HelpDesk ticket 108345
AFS corruption of
  /afs/fnal.gov/files/home/room1/kreymer/minos/log/rawcopy/far/encp.2007-11.log

Date: Thu, 13 Dec 2007 12:06:06 -0600 (CST)
Subject: Your ticket 108345 has been reassigned to HILL, KEVIN

Checked other files in rawcopy/far and near with 'file',
one other data file, far/2007-08.log, with one @ byte,
rerun, because the disk filled at that time.

Checked all the topdb and pnfslogs, these all look intact.


2007 12 18 - 15:15 - dbarchive
DCS_HV.MYD     35m
PULSERGAIN.MYD 
  rm -r /data/archive/COPY/20071107


#######
# SAM #
#######

Tested dev universe after Tuesday's upgrades,
using new TEST section of HOWTO.sam.


#######
# SAM #
#######

Downtime scheduled 10:00 for Oracle/System patches of production minosora1

Date: Thu, 13 Dec 2007 10:32:35 -0600
oracle database patch and host server reboot are complete.? minosprd database is available for use.

Date: Thu, 13 Dec 2007 17:09:36 +0000 (UTC)
   Thanks !

   The Minos production dbserver and station rode throught the maintenance,
   and are still functioning normally.


=============================================================================

2007 12 12

#######
# AFS #
#######

NEWGROUP=cc

pts creategroup -name kreymer:${NEWGROUP}
group kreymer:cc has id -2502

for GUSER in buckley kreymer tagg urheim ; do
pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done

pts membership kreymer:${NEWGROUP}
  buckley
  kreymer
  urheim
  tagg

pts examine    kreymer:${NEWGROUP}
Name: kreymer:cc, id: -2502, owner: kreymer, creator: kreymer,
  membership: 4, flags: SOMar, group quota: 0.

pts chown      kreymer:${NEWGROUP}  minos:admin

   Now assign this to d88

fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/d88 \
    -acl minos:admin rlidwka

fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/d88 \
    -acl minos:cc rlidwka


##########
# CONDOR #
##########

Date: Wed, 12 Dec 2007 17:08:39 -0600 (CST)
HelpDesk ticket 108310

    Trying to set up a frequent KCA based proxy for use by the factory

    Script  factproxy

kx509
kxlist -p
voms-proxy-init \
    -noregen    \
    -voms fermilab:/fermilab/minos/Role=pilot \
    -vomslife 12:0 \
    -valid 12:0 \
    -out /local/scratch25/kreymer/kreymer-pilot.proxy

This works OK for my normal ticket,

MINOS25 > voms-proxy-info -all -file /local/scratch25/kreymer/kreymer-pilot.proxy
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy/CN=proxy
issuer    : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy
type      : unknown
strength  : 512 bits
path      : /local/scratch25/kreymer/kreymer-pilot.proxy
timeleft  : 11:57:32
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer
issuer    : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov
attribute : /fermilab/minos/Role=pilot/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
attribute : /fermilab/minos/Role=NULL/Capability=NULL
timeleft  : 23:59:18

    But fails for the kcron ticket, even for the minimal vpi form
    
MINOS25 > kcron

MINOS25 > klist -f
Ticket cache: /tmp/krb5cc_1060_c22360
Default principal: kreymer/cron/minos25.fnal.gov@FNAL.GOV

MINOS25 > kx509

MINOS25 > kxlist -p
Service kx509/certificate
 issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA
 subject= /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/0.9.2342.19200300.100.1.1=kreymer
 serial=CF0A30
 hash=f6c1da48

MINOS25 > voms-proxy-init -noregen -voms fermilab:/fermilab -debug
Detected Globus version: 22
Unspecified proxy version, settling on Globus version: 2
Number of bits in key :512
Using configuration file /minos/scratch/kreymer/VDT/glite/etc/vomses
Using configuration file /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Files being used:
 CA certificate file: none
 Trusted certificates directory : /minos/scratch/kreymer/VDT/globus/TRUSTED_CA
 Proxy certificate file : /tmp/x509up_u1060
 User certificate file: /tmp/x509up_u1060
 User key file: /tmp/x509up_u1060
Output to /tmp/x509up_u1060
Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer
Using configuration file /minos/scratch/kreymer/VDT/glite/etc/vomses
Using configuration file /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Contacting  fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Failed

Error: fermilab: User unknown to this VO.

None of the contacted servers for fermilab were capable
of returning a valid AC for the user.

   I suspect that this kcron cert is not known to the VO


Trying to self register, page
  [-] Members     
    . Re-sign Grid and VO AUPs     
    [-] Certificates       
      . Add Certificate    

The search form is missing.
    
You are logged in as /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
/DC=org/DC=DOEGrids/OU=Certificate Authorities/CN=DOEGrids CA 1

    and

You are logged in as /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/UID=kreymer
/DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA

    2007 12 14 - Chadwick added the CN=cron certificate, with pilot role.

I can now create a proxy :


MINOS25 > voms-proxy-init \
<more>     -noregen    \
<more>     -voms fermilab:/fermilab/minos/Role=pilot \
<more>     -vomslife 12:0 \
<more>     -valid 12:0 \
<more>     -out /local/scratch25/kreymer/kreymer-pilot.proxy
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Contacting  fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Done
Creating proxy ........................................................ Done
Your proxy is valid until Fri Dec 14 21:36:53 2007


MINOS25 > voms-proxy-info -all -file /local/scratch25/kreymer/kreymer-pilot.proxy
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy
issuer    : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer
type      : proxy
strength  : 512 bits
path      : /local/scratch25/kreymer/kreymer-pilot.proxy
timeleft  : 11:59:33
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer
issuer    : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov
attribute : /fermilab/minos/Role=pilot/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
attribute : /fermilab/minos/Role=NULL/Capability=NULL
timeleft  : 11:59:32

But this was for a normal KCA cert which somehow crept in, not the cron 


MINOS25 > voms-proxy-info -all -file ${PDIR}/${PFIL}
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer/CN=proxy
issuer    : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer
type      : proxy
strength  : 512 bits
path      : /local/scratch25/kreymer/.grid/kreymer-pilot.proxy
timeleft  : 8:35:42
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Arthur E. Kreymer/USERID=kreymer
issuer    : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov
attribute : /fermilab/minos/Role=pilot/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
attribute : /fermilab/minos/Role=NULL/Capability=NULL
timeleft  : 10:35:47


##########
# OFFICE #
##########

Copied white/black boards to
    file:///minos/scratch/kreymer/
File: dscn0550.jpg  	1123 KB  	12/12/2007  	03:06:41 PM
File: dscn0552.jpg 	1052 KB 	12/12/2007 	03:06:44 PM
File: dscn0553.jpg 	1038 KB 	12/12/2007 	03:06:46 PM   


##########
# CONDOR #
##########

Date: Wed, 12 Dec 2007 15:04:21 -0600 (CST)
HelpDesk ticket 108299

Short Description: Minos Cluster - condor 6.9.5 preinsatllation

Problem Description: run2-sys :

  Please install the following RPM in all the minos01 through minos26
systems.
http://fermigrid.fnal.gov/files/condor/condor-6.9.5-linux-x86-rhel3-dynamic-1.i386.rpm
   
This rpm places new files in /opt/condor-6.9.5,
and should not interfere with existing operations.
The rpm is about 95 MB, and unwinds into about 250 MBytes.


The actual upgrade is still being planned,
and will consist roughly of Condor shutdown/swap config files/Condor start.


   Background :

We will need to upgrade the Condor version on the Minos Cluster
from 6.8.6 to 6.9.5 sometime soon, by next week,
in order to be compatible with the new Condor being deployed
next week on the GPfarm

It is probably best to do this soon, before the holidays.


Date: Wed, 12 Dec 2007 15:20:20 -0600 (CST)
This ticket has been reassigned to ALLEN, JASON of the CD-SF/FEF Group.


##########
# CONDOR #
##########

From 15 Oct plan from Timm, 
we should be able to pre-install

http://fermigrid.fnal.gov/files/condor/condor-6.9.5-linux-x86-rhel3-dynamic-1.i386.rpm

We presently have
http://fermigrid.fnal.gov/files/condor/condor-6.8.6-linux-x86-rhel3-dynamic-1.i386.rpm

The suggested Minos config files are in 
    http://fermigrid.fnal.gov/files/condor/minos/

######
# CD #
######

FYI - IT/Comp Prof jobs renaming, 
      http://wdrs.fnal.gov/job_descript/info_tech/IT_Job_Description_Review.ppt

########
# FARM #
########

Investigating mia sntp's for

N00008218
N00008221
N00008224
N00008227

N00008230
N00008233
N00008238


###########
# BLUEARC #
###########

Date: Wed, 12 Dec 2007 09:16:38 -0600 (CST)
Etta Burns and Dave Bell worked on the array this morning.  We believe
the Minos area is now stable.

ettab - 8300
dbell - 4482

Date: Wed, 12 Dec 2007 16:44:57 +0000 (UTC)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos-admin@fnal.gov, csi-mgmt@fnal.gov
Cc: rayp@fnal.gov, ettab@fnal.gov, dbell@fnal.gov
Subject: Re: HelpDesk ticket 108251 - resolved, executive summary

    This is an executive summary of the resolution of HelpDesk ticket 108251,
    mount failures of the BlueArc served /minos/data, /minos/scratch, 
    based on a conversation with Dave Bell.

    If there are no corrections from the experts,
    I will forward this to the Minos collaboration in general.

       Observations :

1) No disk or array failures were observed internal to the array.

2) There is a level of soft errors in the Fiber Channel fabric
   which is consistent with the presently normal operation of
   similar arrays at Fermilab.
   These do not seem to be obviously correlated with our mount timeouts.

3) There were communication failures to the array's FC ports
   which seem identical to those seen previously on similar CMS servers.

   Those problems were resolved about a month ago in CMS by changing
   Nexsan controller settings from 'active/active' to 'active/passive'

   This change from a/a to a/p was applied to the Minos server this morning.
   
   
       Actions :

Minos should resume normal use of the data and scratch areas,
expecting to see no further timeouts.


=============================================================================

2007 12 11

###########
# BLUEARC #
###########

Date: Tue, 11 Dec 2007 19:48:52 -0600 (CST)
HelpDesk ticket 108251

 LSC/CSI :

   As of about 19:40, the /minos/data and /minos/scratch NFS mounts
   have timed out on the Minos Cluster and on fnpcsrv1.

   This shuts down all Minos farm processing, and most analysis.

##############
# MINOS_DATA #
##############

cd $MINOS_DATA/d10

DIRS=`ls -ld recodata* | grep lrwx | cut -f 2 -d / | cut -c 2- | sort -n` 

cd ..
for DIR in ${DIRS} ; do fs listquota d${DIR} | grep nb ; done

All are 50000 except

d11
d21
d22
d46-49
d71-776

Look at a block of not so full directories
nb.minos.d81               50000000  31401634   63%         78%  
nb.minos.d86               50000000  32841842   66%         85%  
nb.minos.d88               50000000  34690590   69%         78%  
nb.minos.d89               50000000  29209809   58%         78%  
nb.minos.d90               50000000  33385630   67%         77%  
nb.minos.d91               50000000  44757332   90%         77%  

d86 had an old kreymer file
cp -a d86/kreymer/F00034242_0013.mdaq.root /minos/scratch/kreymer/

and files like
F00036724_0008.spill.sntp.R1_18_4.0.root  vintage Oct 2006
N00011059_0001.spill.sntp.R1_18_4.0.root  vintage Oct 2006
a20000180_0001.cnts.R1.14.root            vintage May/Jun 2005
c10000659_0001.cnts.R1.14.root            vintage May/Jun 2005

MINOS26 > grep c10000695_0009.cnts.R1.14.root d10/indexes/*.index
d10/indexes/mc_far.R1.14.index:recodata16/c10000695_0009.cnts.R1.14.root

I'm removing all from
   mc_far.R1.14.index

Doing this on fnpcsrv1 as rubin
    cut/paste from shrc/kreyemr

cd /afs/fnal.gov/files/data/minos/d10/indexes

cp remove_vsn.mc /rvm
  hacked this to preview, looks OK, files will come out of recodata16/17/18/19

cd ../../d86/recodata16

SRV1> rm recodata16 
SRV1> ls | wc -l
287

SRV1> for FILE in ${FILES} ; do grep ${FILE} ../../d10/indexes/*R1_18_4*.index; done | cut -f 1 -d : | cut -f 5 -d / | sort -u
2006-10_far.R1_18_4.index
2006-10_near.R1_18_4.index
2006-11_far.R1_18_4.index
2006-11_near.R1_18_4.index

INDS=`<above>`

SRV1> for IND in ${INDS} ; do grep recodata16 ../../d10/indexes/${IND} ; done | wc -l
287

    Now shift these to recodata17 on d88

    See that we have space

fs listquota ../../d88
Volume Name                   Quota      Used %Used   Partition
nb.minos.d88               50000000  15788915   32%         74%  

    Move em

for FILE in ${FILES} ; do
cp -va ${FILE} ../../d88/recodata17/${FILE}
done

    Diff em

for FILE in ${FILES} ; do
echo ${FILE}
diff   ${FILE} ../../d88/recodata17/${FILE}
done

    Reindex em

for IND in ${INDS} ; do nedit ../../d10/indexes/${IND} ; done
    changed recodata16 to recodata17

for IND in ${INDS} ; do 
scp  ../../d10/indexes/${IND} fnpcsrv1:~minfarm/web/indexes/${IND}
done


    Clear em

for FILE in ${FILES} ; do
rm     ${FILE}
done

Tue Dec 11 20:15:30 CST 2007


############
# NOACCESS #
############

VOC083            181.57GB   (NOACCESS   1210-2159 none     0731-0922)   CD-9940B         minos.mcin_near_daikon_04.cpio_odc 

###########
# BLUEARC #
###########

HelpDesk ticket 108225

 LSC/CSI :

Today at around 11;30, till around 11:35 ( roughly )
the NFS mounts of the BlueArc served /minos/data and /minos/scratch
timed out many or all of the Minos Cluster nodes.

Is there a known problem ?

For reference, here are some sample mounts :

MINOS26 > grep blue /etc/fstab

blue2:/fermigrid-data   /grid/data      nfs  rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0

blue2:/fermigrid-app    /grid/app       nfs     rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0


Date: Tue, 11 Dec 2007 12:13:05 -0600 (CST)
This ticket has been reassigned to HILL, KEVIN of the CD-LSCS/CSI/CS/EST Group.

Date: Tue, 11 Dec 2007 13:43:04 -0600 (CST)
This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group.

########
# FARM #
########

Why are write queues so long, and why did they shoot up to over 5000 ?

w-stkendca10a-4
w-stkendca11a-5
w-stkendca9a-5
w-stkendca9a-6

FAMS=`cat   /tmp/pool9a6  | cut -f 6,6 -d ' ' | sort -u | grep 'si='\
 | cut -f 2 -d '{' | cut -f 1 -d '}' | grep -v unknown`

for FAM in ${FAMS} ; do 
printf "${FAM} " ; grep "{${FAM}}" /tmp/pool9a6 | wc -l
done


for POOL in 10a-4 11a-5 9a-5 9a-6 ; do
curl -o /tmp/pool${POOL} http://fndca3a.fnal.gov/dcache/files/w-stkendca${POOL}.files
done

for POOL in 10a-4 11a-5 9a-5 9a-6 ; do
printf "\n${POOL}\n"
FAMS=`cat   /tmp/pool${POOL}  | cut -f 6,6 -d ' ' | sort -u | grep 'si=' \
 | cut -f 2 -d '{' | cut -f 1 -d '}' | grep -v unknown | grep -v minos`
for FAM in ${FAMS} ; do 
printf "${FAM} " ; grep "{${FAM}}" /tmp/pool${POOL} | wc -l
done
done


=============================================================================

2007 12 10

##########
# DCACHE #
##########

848 files dated 2007-12-08 are listed at
    http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt
Some are on tape now, some are not, like
     /pnfs/minos/reco_far/cedar_phy_bhcurv/cand_data/2006-01/F00033671_0020.all.cand.cedar_phy_bhcurv.0.root   

Took a snapshot :
   curl -s -o /var/tmp/kreymer/minos.txt http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt

#######
# CFL #
#######

'Newline appended' - comes from the 'ed' step.

GRRRRRRRRRR - out of quota in minos.log_data

-rw-r--r--   1 kreymer 1525 260832953 Dec 10 17:10 CFL.new
-rw-r--r--   1 kreymer 1525 176661307 Dec  8 19:16 CFL.old

MINOS26 > fs listquota
Volume Name                   Quota      Used %Used   Partition
minos.log_data              8000000   7813906   98%<<       26%    <<WARNING

3.4 of the * GB are taken by CFL.

Shifted a copy of all CFL to /minos/data/log_data/CFL

mkdir -p /minos/data/log_data
cp -vax /afs/fnal.gov/files/data/minos/log_data/CFL /minos/data/log_data/CFL
diff -r /afs/fnal.gov/files/data/minos/log_data/CFL /minos/data/log_data/CFL
rm CFL.2006* CFL.20070*

Try a fresh processing :

curl

MINOS26 > wc -l CFL.new
1488395 CFL.new
MINOS26 > dds CFL.new
-rw-r--r--  1 kreymer 1525 279971484 Dec 10 17:59 CFL.new

no more message  'Newline appended'
wc matches


########
# GRID #
########

According to Steve Timm, 
the attribute to select Minos AFS nodes in GPFARM will be 
  ISMINOSAFS

This is boolean, true or false.


##########
# DC2NFS #
##########

$ AFSS/dc2nfs -d beam_data 2>&1 | tee -a /tmp/dc2nfs.beam_data.log


...
 STARTED 
FINISHED 

#########
# ADMIN #
#########

HelpDesk ticket 108182

 LSC/CSI :

Please set an individual storage quota of 500 GBytes for user pawloski
on the BlueArc served /minos/scratch volume.

This overrides the existing default 100 GBytes quota.

Date: Tue, 11 Dec 2007 09:50:29 -0600 (CST)
Solution: quota increased
This ticket was resolved by HILL, KEVIN of the CD-LSCS/CSI/CS/EST group.


########
# GRID #
########

/grid/app and data mount on minos02-minos25

Scanned, are mounted presently only on 01 and 26


HelpDesk ticket 108170

    run2sys :

Please mount /grid/data and /grid/app on Minos Cluster nodes
minos02 through minos25.

  /grid/data should be read/write .
  /grid/app  should be readonly .

These are already mounted r/w on minos01 and 26, and should remain so.


###########
# MEETING #
###########

sent travel request to Rachel Rauchmiller <rachelr@fnal.gov> 4514

#########
# ADMIN #
#########

MINOS26 > setup systools
MINOS26 > cmd add_minos_user boyd
cmd: Unable to determine your group name, gid = 1525


MINOS01 > setup systools
MINOS01 > cmd add_minos_user boyd
You are not authorized to run this command!
MINOS01 > date
Tue Dec 11 09:13:32 CST 2007

#######
# AFS #
#######

For the following scans, piped the output through uniq.
Should so this directly in the 'for' statement next time.

    messages

MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Dec " | grep -v Tokens'; done

minos02
Dec 10 05:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec 10 05:15:54 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

    messages.1

MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages.1 | grep "Dec " | grep -v Tokens'; done
minos01
Dec  6 19:16:34 minos01 kernel: afs: Waiting for busy volume 1685736052 (minos.log_data) in cell fnal.gov
minos02
Dec  2 06:15:17 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  2 06:18:03 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  2 08:15:12 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  2 08:16:13 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  4 04:15:46 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  4 04:21:18 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  4 09:15:41 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  4 09:16:28 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  4 18:16:10 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  4 18:20:51 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  4 22:15:33 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  4 22:17:20 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  5 05:18:06 minos02 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  5 05:21:07 minos02 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  5 10:15:38 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  5 10:16:33 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  5 12:15:53 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  5 12:17:58 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  5 16:15:22 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  5 16:17:26 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  5 22:15:22 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  5 22:18:34 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  6 16:15:20 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  6 16:17:48 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  7 07:15:19 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  7 07:17:25 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  7 18:15:25 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  7 18:18:38 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos04
Dec  5 05:19:45 minos04 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  5 05:22:38 minos04 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  8 15:28:52 minos04 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  8 15:30:39 minos04 kernel: afs: file server 131.225.68.49 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos05
Dec  5 05:20:16 minos05 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  5 05:23:01 minos05 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos06
Dec  5 05:21:15 minos06 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  5 05:23:56 minos06 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos08
Dec  4 05:57:39 minos08 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  4 06:00:57 minos08 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  8 16:11:15 minos08 kernel: afs: Lost contact with file server 131.225.68.49 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  8 16:11:16 minos08 kernel: afs: failed to store file (110)
Dec  8 16:11:47 minos08 kernel: afs: file server 131.225.68.49 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos11
Dec  3 13:25:52 minos11 kernel: afs: failed to store file (over quota)
Dec  8 22:28:59 minos11 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  8 22:30:35 minos11 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos12
Dec  4 05:57:33 minos12 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  4 05:58:08 minos12 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos15
Dec  3 04:46:46 minos15 kernel: afs: Lost contact with volume location server 131.225.68.4 in cell fnal.gov
Dec  3 04:49:41 minos15 kernel: afs: volume location server 131.225.68.4 in cell fnal.gov is back up
Dec  7 09:36:02 minos15 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  7 09:38:12 minos15 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos16
Dec  4 05:58:00 minos16 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  4 05:58:37 minos16 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos23
Dec  7 09:36:02 minos23 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  7 09:37:16 minos23 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos24
Dec  4 05:57:40 minos24 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  4 06:00:56 minos24 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Dec  6 11:32:51 minos24 kernel: afs: Lost contact with file server 131.225.68.65 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  6 11:34:35 minos24 kernel: afs: file server 131.225.68.65 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos26
Dec  6 08:06:07 minos26 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Dec  6 08:06:58 minos26 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)


############
# MCIMPORT #
############

11:06 cycle complains of full disk

Found /tmp full

-rw-r--r--   1 mindata  e875 934518784 Dec 10 10:45 junk.file

Removed this file

$ df -h
minos-nas-0.fnal.gov:/minos/data
                       12T  8.6T  2.6T  78% /minos/data

minos-nas-0.fnal.gov:/minos/scratch
                      3.3T  783G  2.6T  24% /minos/scratch


#############
# CHECKLIST #
#############

queued stores peaking over 6000, averaging 3500 since 12/8
activity ramped up 12/2, gap 12/6 through 12/7
   
staging sharp spikes over 2000 on 12/3 noon and 12/8 evening

Enstore servers - see backlog writing reco_far_cedar_phy_bhcurv_cand

##################
# VACATION NOTES #
##################

sam-design 
    Thursday ( Nov ? ) 2 SRM-s, problems, srm restarted
    stkendca9a raid array repaired, back in service ?
    d0ora2 crashes 11/26 and 12/3 ( back disk/bios ? )
    sam-users - d0ora2 down again 12/8

fnoaa -  will be retired, what is this ?

Helpdesk assessment report 14 Dec

minos-admin
   jpfitz set up tools to let us add new users
       setup systools
       cmd add_minos_user <username>

    bspeak having trouble on some minos?? systems with kcron/kcroninit   

minos_software_discussion
    FNALU meeting Dec 19 - hartnell/young minos

xTravel - make arrangements <minos>

xminosdb - 
x    farm db reconnects ?
x        open file limit increased to 4K

minos-data
   CCPID needs space in afs 10-20 GB
   recycle request 8 Dec from berg
   kschu - 4 Dec metadata

xAFS
x   3 Dec - update requested by rayp
x        sent new list of errors

BATCH
x    need massive undeclare of MC D04 - or not
x          deferred, just mrcc were missing
    12/7 - is corral using the mysql data disk ?

SHIFT - 
    T962 DAQ access

SIM
    kordosky teragrid suggestion  12/5
x    corrupt/duplicate files 12/7
x         corrupt at origin

xCFL
x    Wed 5 Dec email received 'Newline appended'
x       log_data partition had filled up

xCONDOR
x    Vahle - cannot read file in d195  12/7 
x.          liz fixed it

#########
# FNALU #
#########
Date: Mon, 10 Dec 2007 09:18:11 -0600
From: mgreaney@fnal.gov
To: kreymer@fnal.gov
Subject: FNALU General meeting, December 19, WH1W 1:30pm

To all,

There will be a general meeting for experimenters and users using
the FNALU cluster on December 19, in WH 1 West from 1:30-3:00pm.

The purpose of the meeting is get input from experimenters and users on what
resources are needed and to identify experiments using FNALU.

Also the status or changes to support for FNALU will be discussed.

If you are not able to attend this meeting, please send an email response
to dss-est@fnal.gov with details of your project. Please include these
details:
 1. Name of experiment and scope of the project
 2. CPU needs for the duration of the project
 3. Disk space needs
 4. Applications (licensed) needed
 5. Whether or not you are using a local filesystem mounted on fnalu
 6. Whether or not you use LSF 

Thank you,

DSS Group

=============================================================================

############
# VACATION #
############

   on vacation Dec 3-7
   
=============================================================================

2007 11 30

##########
# CONDOR #
##########

Could not add fermilab/minos group role of pilot.
Helpdesk ticket 

#############
# CHECKLIST #
#############

Enstore ball is red - probably LTO3 failure

No ND data since 01:00 UTC.
Beam returned around 07:00 UTC


Tabs to preserve in office swap - 
    http://www.cs.wisc.edu/condor/tutorials/barcelona-2006/
    http://www.lnf.infn.it/computing/afs/doc/adm/adm02.htm
    http://www.lnf.infn.it/computing/afs/doc/adm/adm02.htm
    http://glite.web.cern.ch/glite/packages/R3.0/deployment/glite-WMS/glite-WMS.asp
    
    
=============================================================================

2007 11 29

##########
# OFFICE #
##########

Kreymer/Ayres moved from 1270/1265 to 1260
Plunkett      moved from 1260      to 1270

Networks moving tomorrow 07:00


########
# FARM #
########

./volumes  vols
./volumes  mcout_cedar_phy_near_daikon_00_cand
CVOLS=` ./volumes  mcout_cedar_phy_near_daikon_00_cand`

MINOS26 > ./stage -d -p 0 VOC472
 Needed 119/356

for VOL in ${CVOLS} ; do
./stage -w -g readPools ${VOL} 
done 2>&1 | tee -a /tmp/stagecand.log

 enstore info --list=VOC472
 
 
#######
# AFS #
#######

MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 28" | grep -v Tokens'; done

minos02
Nov 28 04:15:20 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 28 04:15:56 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

Nov 28 11:15:18 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 28 11:17:03 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

Nov 28 14:15:25 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 28 14:17:25 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

Nov 28 20:15:27 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 28 20:17:05 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

minos19
Nov 28 04:51:25 minos19 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 28 04:51:42 minos19 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)


=============================================================================

2007 11 28

##############
# BLACKBOARD #
##############

Prior to kreymer office move from 1270 to 1250
Here's notes from the blackboard
   ( Not worth photographing into DocDB )


  HOSTS            bsub -R   Cores
FLXB11-30    SL3  "linux24"   30
FLXB31-34    SL4  "linux26"   10
condor       SL4              35


   RECO PATH DISCUSSION

RECO
                 DET / REL / STR / MO
MCOUT
                 REL / DET /  MC / CONF / STR / RUN
MC in /minos/data and in future ?
    MCR / CONF / DET / REL / STR / RUN
       

    TODO LIST
    
.forward ?

crontab vs /usr/bin/aklog

minos workgroup has shepelak in the .k5login of root


#######
# CRL #
#######

MINOS26 > fs listacl /afs/fnal.gov/files/data/minos/crl_data/WWWdirectory/crlwforms
Access list for /afs/fnal.gov/files/data/minos/crl_data/WWWdirectory/crlwforms is
Normal rights:
  kschu:crlweb2 rlidwk
  bgreen:minoscrladmin rlidwka
  bgreen:minoscrl rlidwk
  spanacek:crladmin rlidwka
  system:administrators rlidwka
  system:anyuser rl
  buckley rlidwka
  bgreen rlidwka

    habig needs access

    someone should

fs setacl \
    -dir /afs/.fnal.gov/files/data/minos/crl_data/WWWdirectory/crlwforms \
    -acl habig rlidwka

MINOS26 > pts membership bgreen:minoscrladmin
  buckley
  bgreen
  avva
  saranen

MINOS26 > pts membership spanacek:crladmin
  dave_b
  spanacek
  stephen
  bgreen
  mccusker
  131.225.110.8
  131.225.110.61


##########
# CONDOR #
##########

Date: Wed, 28 Nov 2007 13:38:06 -0600
From: Sfiligoi Igor <sfiligoi@fnal.gov>
To: jason@fnal.gov
Cc: kreymer@fnal.gov, minos-admin@fnal.gov
Subject: Change to allow Art to manage the MINOS Condor pool

Hi Jason.

In order for Art to administer the MINOS Condor pool,
the following line
"/DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310" kreymer
should be put into the
/etc/grid-security/condor-grid-mapfile

of all the Condor worker nodes (minos25 would need it, too, but since we 
at the moment use the same DN also for the glideins, it should not 
change from what it is now).

This change does not need any Condor reconfig to be effective...
as soon as it is in, Art will be allowed to issue administrative commands.

Thanks,
   Igor & Art

Date: Wed, 28 Nov 2007 15:34:42 -0600
From: Jason Harrington <jason@fnal.gov>
This has been done.


###########
# NETWORK #
###########
  
  ... work in progress ...
  
KREYMERNOTE a.k.a.
KREYMERLAPFNAL.dhcp.fnal.gov  131.225.56.160 

MAC's
  00-0E-35-A2-22-59 
  00-01-4A-04-65-23

The IP assigned to 131.225.56.160 shifted to the wireless 
and got the new name KREYMERFNALGOV-1024593-dp.dhcp.fnal.gov


It has not worked stably on the wired network for several weeks.


Earlier today, the wire was offline and wireless had

SRV1> host 131.225.94.156
156.94.225.131.in-addr.arpa domain name pointer G-Bs-Computer.dhcp.fnal.gov.

###########
# ROUNDUP #
###########

SRV1> ./farmgsum

nearcat
   4370  201652 mcnearcat

mcnearcat
      1      19 mrnt.cedar_phy_oldbhcurv.root
   4021  192594 mrnt.cedar_phy.root
      3     197 sntp.cedar_phy_oldbhcurv.root
    345   18624 sntp.cedar_phy.root

   OK, need to proceed with mcnearcat today
       to clear backlog,
       to test MC sam declares and moved to /m/d/...

AFSS/roundup.20071126 -n -r cedar_phy_oldbhcurv  mcnear
all 4 are duplicates

AFSS/roundup.20071126 -n -r cedar_phy  mcnear 2>&1 | tee /tmp/cpmc.log

 OK adding n13011004_0000_L010185N_D00.mrnt.cedar_phy.root 11
 OK adding n13011006_0000_L010185N_D00.mrnt.cedar_phy.root 1
 OK adding n13011007_0000_L010185N_D00.mrnt.cedar_phy.root 1
 OK adding n13011008_0000_L010185N_D00.mrnt.cedar_phy.root 1
 OK adding n13011009_0000_L010185N_D00.mrnt.cedar_phy.root 1
 OK adding n13011009_0009_L010185N_D00.mrnt.cedar_phy.root 2
 OK adding n13011027_0000_L010185N_D00.mrnt.cedar_phy.root 11
 OK adding n13011043_0000_L010185N_D00.mrnt.cedar_phy.root 11
 OK adding n13011046_0000_L010185N_D00.mrnt.cedar_phy.root 11
 OK adding n13011053_0000_L010185N_D00.mrnt.cedar_phy.root 11
...


    Monitor with
less LOG/2007-11/cedar_phymcnear.log 

     Run 1 file
AFSS/roundup.20071126  -r cedar_phy -s n13011006  mcnear

mkdir: cannot create directory `/minos/data/mcout_data/daikon_00': Permission denied
 OOPS - cannot create CC area /minos/data/mcout_data/daikon_00/L010185N/near/cedar_phy/mrnt_data/100

    as mindata,
cd /minos/data
chmod 775 mcout_data

AFSS/roundup.20071126  -r cedar_phy -s n13011007  mcnear

    Corrected roundup to allow MC saddreco

AFSS/roundup.20071126  -r cedar_phy -s n13011008  mcnear

    The scope of saddreco is overly broad, repeating all run ranges.
    Live with it for now.
    In future, try to narrow down.
MCREL=daikon_00
DET=near
REL=cedar_phy
CONF=L010185N/cand_data/141
./saddreco -m ${MCREL} -d ${DET} -r ${REL} -p ${CONF} -b 1 --verify

    Changed the SLOG directory and name to 
        ${HOME}/ROUNTMP/LOG/saddreco/${MCREL}/${REL}/${DET}_${CONF}.log

AFSS/roundup.20071126  -r cedar_phy -s n13011004  mcnear
    oops, mkdir -p of log directory was misplaced, try again

AFSS/roundup.20071126  -r cedar_phy -s n13011009  mcnear

    The SADD phase is taking about 30 minutes for L010185N,
    due to the large number of cand files in about 200 RUN ranges

    Let's rip on the catchup now !

MIN > mv roundup.20071126 roundup.20071128

SRV1> cp -a AFSS/roundup.20071128 .
SRV1> ln -sf roundup.20071128 roundup # was roundup.20071125

./roundup  -r cedar_phy  mcnear
Wed Nov 28 18:12:44 CST 2007


#######
# AFS #
#######

for NODE in ${NODES} ; do printf "${NODE}\n"
ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 27" | grep -v Tokens'; done

minos02
Nov 27 11:16:05 minos02 kernel: afs: Lost contact with file server 131.225.68.19
Nov 27 11:18:08 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov 
Nov 27 14:15:22 minos02 kernel: afs: Lost contact with file server 131.225.68.19
Nov 27 14:19:23 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov 


for NODE in ${NODES} ; do printf "${NODE}\n"
ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 28" | grep -v Tokens'; done

minos02
Nov 28 04:15:20 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 28 04:15:56 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

minos19
Nov 28 04:51:25 minos19 kernel: afs: Lost contact with file server 131.225.68.6 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 28 04:51:42 minos19 kernel: afs: file server 131.225.68.6 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)


##########
# CONDOR #
##########

   Updated wms.run to have

+RunOnGrid=True
Requirements = ((Arch=?="X86_64") || (Arch=?="INTEL")) && (GLIDEIN_Site=!=UNDEFINED)

7367.0   kreymer        11/28 08:34   0+00:00:00 I  0   9.8  probe             


   Submitted 10 process wms, on grid

Ran on fnpc206 and 264

   Cranked up to 100 processes, cleaned up probe printout

csub wms.run 
8227.99  kreymer        11/28 11:02   0+00:00:00 I  0   9.8  probe 20 99 here a

###########
# ROUNDUP #
###########

Did one more catchup on cedar

./roundup  -r cedar far
Wed Nov 28 08:15:31 CST 2007
Wed Nov 28 08:33:43 CST 2007

./roundup  -r cedar near
Wed Nov 28 08:35:45 CST 2007
Wed Nov 28 09:16:18 CST 2007

SRV1> ./farmgsum

nearcat
   4370  201652 mcnearcat

mcnearcat
      1      19 mrnt.cedar_phy_oldbhcurv.root
   4021  192594 mrnt.cedar_phy.root
      3     197 sntp.cedar_phy_oldbhcurv.root
    345   18624 sntp.cedar_phy.root

   OK, need to proceed with mcnearcat today
       to clear backlog,
       to test MC sam declares and moved to /m/d/...

AFSS/roundup.20071126 -n -r cedar_phy_oldbhcurv  mcnear
all 4 are duplicates

AFSS/roundup.20071126 -n -r cedar_phy  mcnear 2>&1 | tee /tmp/cpmc.log

=============================================================================

2007 11 27

##########
# CONDOR #
##########

Testing glidein factory

alias csub='condor_submit $*'      
alias cq='condor_q $*'

cd /minos/scratch/kreymer/condor/probe
    wms.run
       RunOnGrid=True
    wms2.run    
       +RunOnGrid=True
       skip kcron

condor_queue
7206.0   kreymer        11/27 20:23   0+00:00:00 I  0   9.8  kcron /minos/scrat
7207.0   kreymer        11/27 20:25   0+00:00:00 I  0   9.8  probe             
7208.0   gfactory       11/27 20:26   0+00:00:00 I  0   9.8  glidein_startup.sh
7208.1   gfactory       11/27 20:26   0+00:00:00 I  0   9.8  glidein_startup.sh
7208.2   gfactory       11/27 20:26   0+00:00:00 I  0   9.8  glidein_startup.sh
7208.3   gfactory       11/27 20:26   0+00:00:56 R  0   9.8  glidein_startup.sh
7208.4   gfactory       11/27 20:26   0+00:00:00 I  0   9.8  glidein_startup.sh

20:32
cq

7206.0   kreymer        11/27 20:23   0+00:00:00 I  0   9.8  kcron /minos/scrat
7207.0   kreymer        11/27 20:25   0+00:00:00 I  0   9.8  probe             
7208.0   gfactory       11/27 20:26   0+00:01:06 R  0   9.8  glidein_startup.sh
7208.1   gfactory       11/27 20:26   0+00:02:06 R  0   9.8  glidein_startup.sh
7208.2   gfactory       11/27 20:26   0+00:02:06 R  0   9.8  glidein_startup.sh
7208.3   gfactory       11/27 20:26   0+00:04:06 R  0   9.8  glidein_startup.sh
7208.4   gfactory       11/27 20:26   0+00:02:06 R  0   9.8  glidein_startup.sh
7209.0   gfactory       11/27 20:29   0+00:01:06 R  0   9.8  glidein_startup.sh


MINOS25 > condor_status | grep fnp
vm1@9790@fnpc LINUX       X86_64 Owner      Idle       1.000    39  0+00:05:30
vm2@9790@fnpc LINUX       X86_64 Owner      Idle       1.960  3891  0+00:05:31
vm1@20628@fnp LINUX       X86_64 Owner      Idle       1.000    39  0+00:00:08
vm2@20628@fnp LINUX       X86_64 Owner      Idle       2.190  3891  0+00:00:09
vm1@10273@fnp LINUX       X86_64 Owner      Idle       1.000    39  0+00:00:11
vm2@10273@fnp LINUX       X86_64 Owner      Idle       2.450  3891  0+00:00:12
vm1@17343@fnp LINUX       X86_64 Owner      Idle       1.000    39  0+00:00:11
vm2@17343@fnp LINUX       X86_64 Owner      Idle       1.990  3891  0+00:00:12
vm1@8821@fnpc LINUX       X86_64 Owner      Idle       1.000    39  0+00:00:12
vm2@8821@fnpc LINUX       X86_64 Owner      Idle       2.600  3891  0+00:00:13
vm1@11264@fnp LINUX       X86_64 Owner      Idle       1.000    39  0+00:00:07
vm2@11264@fnp LINUX       X86_64 Owner      Idle       1.760  3891  0+00:00:08

condor_q -l 7207.0
...
Arguments = ""
RunOnGrid = TRUE
GlobalJobId = "minos25.fnal.gov#1196216712#7207.0"
ProcId = 0
AutoClusterId = 10
AutoClusterAttrs = "JobUniverse,LastCheckpointPlatform,NumCkpts,vm2_RemoteUser,User,GLIDEIN_Is_Monitor,RunO$
WantMatchDiagnostics = TRUE
LastRejMatchReason = "PREEMPTION_REQUIREMENTS == False"
LastRejMatchTime = 1196217579
ServerTime = 1196217767

   Possibly a problem with preemption ?


##########
# CONDOR #
##########

Shutting down further jobs on minos12 :

Inspired by
    condor_off -peaceful -all -startd

MINOS12 > condor_status minos12

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

vm1@minos12.f LINUX       INTEL  Claimed    Busy       1.910  2026  0+01:33:45
vm2@minos12.f LINUX       INTEL  Claimed    Busy       2.360  2026  0+01:33:47

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX     2     0       2         0       0          0        0

               Total     2     0       2         0       0          0        0

MINOS12 > condor_off  -peaceful  minos12  -subsystem startd
Sent "Set-Peaceful-Shutdown" command to startd minos12.fnal.gov
Sent "Kill-Daemon-Peacefully" command to master minos12.fnal.gov


14:38
MINOS01 > condor_off  -peaceful  minos01  -subsystem startd
Sent "Set-Peaceful-Shutdown" command to startd minos01.fnal.gov
Sent "Kill-Daemon-Peacefully" command to master minos01.fnal.gov

No obvious effect, but nothing running yet, so

MINOS01 > sudo /etc/init.d/condor stop
Shutting down Condor (fast-shutdown mode)


MINOS01 > sudo /etc/init.d/condor start
Shutting down Condor (fast-shutdown mode)

   Still not running jobs, so stopping startd had the desired effect.

   Oops, spoke too soon, a job started running.
   
##########
# CONDOR #
##########

cd /export/stage/minfarm/.grid

voms-proxy-init \
    -voms fermilab:/fermilab/minos \
    -vomslife 500:0 \
    -cert kreymer-doe.pem   \
    -key kreymer-doekey.pem \
    -out kreymer-condor.proxy  \
    -valid 500:0

Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Enter GRID pass phrase:
Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Creating temporary proxy ...................................................................... Done
Contacting  fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Done

Warning: fermigrid2.fnal.gov:15001: validity shortened to 86400 seconds!

Creating proxy ........................................................................... Done
Your proxy is valid until Tue Dec 18 08:12:08 2007


SRV1> voms-proxy-info -all -file kreymer-condor.proxy
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
identity  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
type      : proxy
strength  : 512 bits
path      : kreymer-condor.proxy
timeleft  : 499:56:36
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
issuer    : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov
attribute : /fermilab/minos/Role=NULL/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
timeleft  : 23:56:35


[gfactory@minos25 ~]$ 

scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-condor.proxy .grid/


Date: Tue, 27 Nov 2007 12:36:10 -0600
From: Jason Harrington <jason@fnal.gov>
Done.

Now needs writeable web page


mkdir /afs/fnal.gov/files/expwww/numi/html/gfactory
fs setacl -dir gfactory -acl sfiligoi rlidwka


Date: Tue, 27 Nov 2007 17:24:11 -0600
From: Sfiligoi Igor <sfiligoi@fnal.gov>
To: Arthur Kreymer <kreymer@fnal.gov>
Subject: glideinWMS up and running

Hi Art.

The glideinWMS is up and running on the MINOS pool.

All you need to do to get jobs there is add
+RunOnGrid=True
to your condor submit file.

Well, maybe you want also add
(Arch=?="X86_64") || (Arch=?="INTEL")
to be able to run on 64-bit machines (most of the GPfarms).


P.S.: gLExec is not in use right now, as I wanted something simple.
Once we put that one in, too, a few more lines will be needed.

Cheers,
   Igor


#######
# DAQ #
#######

minos-gateway-nd - found disabled account ( !! in /etc/shadow )
hartnell:!!:13059:0:99999:7:::
cmetelko:!!:13494:0:99999:7:::
koskinen:!!:13552:0:99999:7:::

Tested with temporary .k5login in hartnell.


#######
# AFS #
#######

Looks clean so far today, further timeout last night.

for NODE in ${NODES} ; do printf "${NODE}\n"
ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 27" | grep -v Tokens'; done

#########
# MYSQL #
#########

minos-mysql1 has load average around 30, since 16:30 yesterday.

These are DCS_MAG_FARVLD queries, from minos* nodes.


=============================================================================

2007 11 26

###########
# MINOS12 #
###########

Cleaned out /local/scratch12/kreymer files :

-rw-r--r--   1 kreymer 1525 1887436800 Aug 22  2005 offline.aa
-rw-r--r--   1 kreymer 1525 1887436800 Aug 22  2005 offline.ab
...
-rw-r--r--   1 kreymer 1525 1887436800 Aug 22  2005 offline.at

Requested removal of root-owned /l/s12/database files

MINOS12 > cd /local/scratch12/database/offline
MINOS12 > ls -l
total 63568964
-rw-rw----  1 root root 47137798818 Aug  9  2005 PULSERDRIFT.MYD
-rw-rw----  1 root root 17631876096 Aug 10  2005 PULSERDRIFT.MYI
-rw-rw----  1 root root   130156696 Aug  9  2005 PULSERDRIFTPIN.MYD
-rw-rw----  1 root root    49219584 Aug  9  2005 PULSERDRIFTPIN.MYI
-rw-rw----  1 root root        9046 Jul  8  2005 PULSERDRIFTPIN.frm
-rw-rw----  1 root root    69760831 Aug  9  2005 PULSERDRIFTPINVLD.MYD
-rw-rw----  1 root root    12163072 Aug  9  2005 PULSERDRIFTPINVLD.MYI
-rw-rw----  1 root root        8828 Oct 14  2004 PULSERDRIFTPINVLD.frm

 HelpDesk ticket 107510
 Short Description: Please remove files from minos12:/local/scratch12/database/...

Problem Description: run2-sys :

Please remove the directory and all the files under 
   minos12:  /local/scratch12/database

These are old database backups from 2005, and are no longer needed.
( I do not know offhand how they got to be owned by root. )

This ticket has been reassigned to SCOTT, RENNIE of the CD-SF/FEF Group.

Date: Mon, 26 Nov 2007 13:25:48 -0600 (CST)

Note To Requester: boyd@fnal.gov sent this Notes To Requester: 

Art,

I changed them to be owned by you.  You can delete them if you'd like.

joe
.....................
Date: Tue, 27 Nov 2007 08:37:49 -0600 (CST)
From: Arthur Kreymer <kreymer@fnal.gov>

    Thanks !

    I have removed the files.
........................

Copied the files to /minos/data/analysis/database,
sum *
removed the originals

Actually, did a final checksum,
from /minos/data on minos02 ( sustained about 20 MBytes/sec )

minos02:
25650 46033007 PULSERDRIFT.MYD
04031 17218629 PULSERDRIFT.MYI
10027 127107 PULSERDRIFTPIN.MYD
38932 48066 PULSERDRIFTPIN.MYI
27312     9 PULSERDRIFTPIN.frm
18962 68126 PULSERDRIFTPINVLD.MYD
06033 11878 PULSERDRIFTPINVLD.MYI
08483     9 PULSERDRIFTPINVLD.frm
MINOS02 > date
Tue Nov 27 09:26:17 CST 2007


minos12:

25650 46033007 PULSERDRIFT.MYD
04031 17218629 PULSERDRIFT.MYI
10027 127107 PULSERDRIFTPIN.MYD
38932 48066 PULSERDRIFTPIN.MYI
27312     9 PULSERDRIFTPIN.frm
18962 68126 PULSERDRIFTPINVLD.MYD
06033 11878 PULSERDRIFTPINVLD.MYI
08483     9 PULSERDRIFTPINVLD.frm
MINOS12 > date
Tue Nov 27 10:19:05 CST 2007

10:48 : rm -r database

#######
# AFS #
#######

fsus02 seems to have been stable since the Saturday 24 Nov upgrades.
But I have seen timeouts since then on the Minos cluster for

fsus05 131.225.68.17
fsus07 131.225.68.6
fsus08 131.225.68.19


for NODE in ${NODES} ; do printf "${NODE}\n"
ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 25" | grep -v Tokens'; done
   clean

for NODE in ${NODES} ; do printf "${NODE}\n"
ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 26" | grep -v Tokens'; done


minos02
Nov 26 07:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19
Nov 26 07:17:14 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov 

minos04
Nov 26 03:56:54 minos04 kernel: afs: Lost contact with file server 131.225.68.6 
Nov 26 03:59:05 minos04 kernel: afs: file server 131.225.68.6 in cell fnal.gov i

Nov 26 08:17:23 minos04 kernel: afs: Lost contact with file server 131.225.68.17
Nov 26 08:23:44 minos04 kernel: afs: file server 131.225.68.17 in cell fnal.gov 

minos09
Nov 26 08:17:50 minos09 kernel: afs: Lost contact with file server 131.225.68.17
Nov 26 08:19:00 minos09 kernel: afs: file server 131.225.68.17 in cell fnal.gov 

minos14
Nov 26 08:18:02 minos14 kernel: afs: Lost contact with file server 131.225.68.17
Nov 26 08:19:35 minos14 kernel: afs: file server 131.225.68.17 in cell fnal.gov 

minos17
Nov 26 08:18:14 minos17 kernel: afs: Lost contact with file server 131.225.68.17
Nov 26 08:19:23 minos17 kernel: afs: file server 131.225.68.17 in cell fnal.gov 


MIN > host 131.225.68.6
6.68.225.131.in-addr.arpa domain name pointer fsus07.fnal.gov.
MIN > host 131.225.68.17
17.68.225.131.in-addr.arpa domain name pointer fsus05.fnal.gov.
MIN > host 131.225.68.19
19.68.225.131.in-addr.arpa domain name pointer fsus08.fnal.gov.

Sent this information as update to Helpdesk ticket 107032


###########
# ROUNDUP #
###########

Corrected PURGE code to use ${CCDEST}/${FILE} size, not GDW/FILE

Added one more test file :

AFSS/roundup.20071115    -r cedar_phy_bhcurv  -s   N00009300  near

less +F /home/minfarm/ROUNTMP/LOG/saddreco/cedar_phy_bhcurv/near.log

Tue Nov 27 01:09:41 CST 2007
OK - stream spill.mrnt.cedar_phy_bhcurv
OK - 166050 Mbytes in 327 runs 
...
OK - stream spill.sntp.cedar_phy_bhcurv
OK - 177300 Mbytes in 162 runs 
...
Tue Nov 27 15:02:58 CST 2007
 WRITING to DCache 453
...
     looks OK through about 16:53
SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00009873_0000.spill.mrnt.cedar_phy_bhcurv.1.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2006-02
PURGE FARM    N00009873_0000.spill.mrnt.cedar_phy_bhcurv.1.root
SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00012001_0000.spill.mrnt.cedar_phy_bhcurv.0.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2007-03
SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00012001_0000.spill.sntp.cedar_phy_bhcurv.0.root /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-03
SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00012004_0000.spill.mrnt.cedar_phy_bhcurv.0.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2007-04
SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00012004_0000.spill.sntp.cedar_phy_bhcurv.0.root /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-04
SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00012007_0000.spill.mrnt.cedar_phy_bhcurv.0.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2007-04

    Odd, lots of files not being purged.
    Some of these duplicate files written back on 3 Nov, according to
        LOG/2007-11/cedar_phy_bhcurvnear.log
    

###########
# ROUNDUP #
###########

The above looks good, let's get cedar back into keepup :

MINOS26 > mv roundup.20071115 roundup.20071125
SRV1> cp -a AFSS/roundup.20071125 .
SRV1> ln -sf roundup.20071125 roundup # was roundup.20070809

./roundup -n -r cedar far
OK - 3882 Mbytes in 9 runs 


./roundup -n -r cedar near
OK - 4239 Mbytes in 13 runs 

./roundup  -r cedar far
Mon Nov 26 11:23:07 CST 2007
Mon Nov 26 11:56:25 CST 2007


./roundup  -r cedar near
Mon Nov 26 11:59:50 CST 2007


#######
# AFS #
#######
Date: Mon, 26 Nov 2007 10:30:02 -0600 (CST)

HelpDesk ticket 107484
Short Description: Can't see AFS backup area

Problem Description: ( I still can not see /afs/fnal.gov/files/backup/home/room1/kreymer.
     But that is a different problem, which should be tracked separately.

This ticket is assigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST.

Date: Mon, 26 Nov 2007 12:03:41 -0600 (CST)
Solution: joes@fnal.gov sent this solution: 
Remounted afs backup directory

=============================================================================

2007 11 25  Sunday

#######
# AFS #
#######

for NODE in ${NODES} ; do printf "${NODE}\n"
ssh ${NODE} 'grep afs: /var/log/messages | grep "Nov 25" | grep -v Tokens'; done

minos01
minos02
...
minos25
grep: /var/log/messages: Permission denied
minos26
Nov 25 19:14:42 minos26 kernel: afs: Waiting for busy volume 1685441905 (expwww.numi.fnalminos) in cell fnal.gov


###########
# ROUNDUP #
###########

Trying to get data/mc keepup restarted, with mc saddreco,
before next week, so can concentrate on condor analysis glideins
and related issues.

Probed cedar_phy_bhcurv friday, >  /tmp/cpbn.log

Here are some to chew on.

 OK adding N00009280_0014.spill.mrnt.cedar_phy_bhcurv.1.root 6
 OK adding N00009283_0000.spill.mrnt.cedar_phy_bhcurv.1.root 1
 OK adding N00009300_0000.spill.mrnt.cedar_phy_bhcurv.1.root 24
 OK adding N00009303_0000.spill.mrnt.cedar_phy_bhcurv.1.root 22
 OK adding N00009306_0000.spill.mrnt.cedar_phy_bhcurv.1.root 1
 OK adding N00009309_0000.spill.mrnt.cedar_phy_bhcurv.1.root 1
 OK adding N00009322_0000.spill.mrnt.cedar_phy_bhcurv.1.root 24
 OK adding N00009325_0000.spill.mrnt.cedar_phy_bhcurv.1.root 24
 OK adding N00009328_0000.spill.mrnt.cedar_phy_bhcurv.1.root 1

Corrected to check for dup catted file directly in ${GDW}/${SFINI}
    rather than using the symlink on ROUNDUP/WRITE

AFSS/roundup.20071115 -n -r cedar_phy_bhcurv  -s   N00009283  near   

   adjusted move to CCDEST, try one for real :
   
AFSS/roundup.20071115 -r cedar_phy_bhcurv  -s   N00009283  near
................
SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00009283_0000.spill.mrnt.cedar_phy_bhcurv.1.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11
PURGE FARM    N00009283_0000.spill.mrnt.cedar_phy_bhcurv.1.root

Sun Nov 25 19:32:03 CST 2007

 SADD 
less +F /home/minfarm/ROUNTMP/LOG/2005-11/declare_near_cedar_phy_bhcurv.log
Sun Nov 25 19:32:04 CST 2007
...........

   Oops, needed to make /minos/data/reco* directories group writeable,
         to that minfarm can write.
   Oops, the saddreco MC/Data clauses were reversed,
         and data logs were still going to LOG/SAMMON.


    as mindata,
cd /minos/data

find reco* -type d -exec ls -ld {} \;
find reco* -type d -exec chmod 775  {} \;

   Trying again, with a single file

AFSS/roundup.20071115 -n -r cedar_phy_bhcurv  -s   N00009306  near   
AFSS/roundup.20071115    -r cedar_phy_bhcurv  -s   N00009306  near   

less LOG/2007-11/cedar_phy_bhcurvnear.log 
   per this log,
less +F /home/minfarm/ROUNTMP/LOG/saddreco/cedar_phy_bhcurv/near.log
    declared 1 file !
ls -l  /minos/data/reco_near/cedar_phy_bhcurv/mrnt_data/2005-12

    Try one more file, this time 6 concatenated :

AFSS/roundup.20071115    -r cedar_phy_bhcurv  -s   N00009280  near

    OK adding N00009280_0014.spill.mrnt.cedar_phy_bhcurv.1.root 6
SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00009280_0014.spill.mrnt.cedar_phy_bhcurv.1.root /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11

less +F /home/minfarm/ROUNTMP/LOG/saddreco/cedar_phy_bhcurv/near.log

    Correct earlier misplacement of our first test case :

cd ROUNTMP/WRITE
FILE=N00009283_0000.spill.mrnt.cedar_phy_bhcurv.1.root
mv ${FILE} /minos/data/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11/${FILE}
ln -s /minos/data/reco_near/cedar_phy_bhcurv/mrnt_data/2005-11/${FILE} ${FILE}


###########
# BLUEARC #
###########

Subject: HelpDesk ticket 107457

Short Description: Quota request for rustem on BlueArc - /minos/scratch

Problem Description: LSC/CSI :

Please set an individual storage quota of 500 GBytes for user rustem,
on the BlueArc served /minos/scratch volume.

This overrides the existing default 100 GBytes quota.

Date: Mon, 26 Nov 2007 08:32:14 -0600 (CST)
This ticket has been reassigned to SYU, JOSEPH of the CD-LSCS/CSI/CS/EST Group.

Date: Mon, 26 Nov 2007 08:42:10 -0600 (CST)
Quota for user rustem has been increased to 500GB.
This ticket was resolved by PASETES, RAY of the CD-LSCS/CSI/CS/EST group.


=============================================================================

2007 11 24  Saturday

#######
# AFS #
#######

Maintenance started right at 06:00

My home area looks OK.

Updated LOG with content of LOG1120

Restarted cron jobs
    kreymer@minos26
    mindata@minos26
  

#######
# AFS #
#######

Date: Sat, 24 Nov 2007 10:50:57 -0600
From: Ray Pasetes <rayp@fnal.gov>
To: pc-manager@fnal.gov, unix-managers@fnal.gov, linux-users@fnal.gov, macusers@fnal.gov,
     ppdhelpdesk@fnal.gov, James C Hammer <jchammer@fnal.gov>, John J. Konc <konc@fnal.gov>,
     Michael J. Kuc <mkuc@fnal.gov>, Michael J. Woods <mwoods@fnal.gov>,
     Thomas W. Ackenhusen <tackenhu@fnal.gov>, snolan@fnal.gov, bd-net-patch@fnal.gov,
     Jud Parker <jparker@fnal.gov>, csi-mgmt@fnal.gov, CSG <csg@fnal.gov>,
     Desktop & Server Support - Enterprise <dss-est@fnal.gov>, HelpDesk <helpdesk@fnal.gov>,
     Arthur Kreymer <kreymer@fnal.gov>, Liz Buckley-Geer <buckley@fnal.gov>, Steven Timm <timm@fnal.gov>
Subject: Re: Status: AFS Outage 11/24 -- Need additional hour

    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-1" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

The AFS servers have been upgraded.  Please check your systems
to make sure they are communicating with the servers.
In some cases, AFS clients may need to reboot to properly
flush their cache.

-Ray

-- 
==============================================
Ray Pasetes             Email: rayp@fnal.gov
CD/LSC/CSI/CS           Phone: 630-840-5250
Fermilab, Batavia, IL   Fax  : 630-840-6345
==============================================


=============================================================================

2007 11 21 

###########
# ROUNDUP #
###########

Still need to restart roundup

Priorities :
    Cleanly handle the new /minos/data structure
    Declare MC to SAM
    CC sntp to /minos/data

SRV1> AFSS/roundup.20071115 -n -r cedar_phy_bhcurv near 2>&1 | tee  /tmp/cpbn.los

SRV1> AFSS/roundup.20071115 -n -r cedar_phy_bhcurv  -s   N00009283  near


   ####################
   # AFS LOG RECOVERY #
   ####################


 < recovered >

This is a copy of LOG.recovered, restored this afternoon.

The usual LOG file went to 0 length, due to an AFS glitch today.

Restored per Helpdesk ticket 107415, apparently to 

-rw-r--r--    1 7695     bin       1844447 Nov 21 15:08 LOG.restored

    ( That's Ray Pasetes )

Trying to reconstruct notes from earlier today :

############
# MCIMPORT #
############

    Corrected paths and links to mcin, 
    for those still having /far/mcin


    Accounts having empty far/mcin

for DIR in boehm hgallag kordosky ; do
    echo ${DIR}
    cd ${DIR}
    ls -R far
    rmdir  far/mcin/dcache
    rmdir  far/mcin
    ln -s ../mcin far/mcin
    cd ..
done

    For accounts with empty mcin, files in far/mcin,

mkdir -p howcroft/mcin/dcache

for DIR in howcroft kreymer mualem ; do
    printf "\n\n${DIR}\n"
    cd ${DIR}
    du -sk far
    ls -lR mcin
    cd ..

    rmdir mcin/dcache
    rmdir mcin

    mv    far/mcin mcin
    ln -s ../mcin far/mcin
    du -sk mcin

done


############
# MCIMPORT #
############

    Started cron keepup :
    
$ cat crontab.dat 
MAILTO=minos-data@fnal.gov

37 0-22/4 * * *  ${HOME}/mcimport -c ALL

crontab crontab.dat

    And did top off run, 50 files before 12:37 cron pass

./mcimport -b 50 OVERLAY


############
# MCIMPORT #
############

Did catchup on the mcimport run without sam enabled,

DET=near
MCREL=daikon_00
CONF=L010185N_nue

DET=near
MCREL=daikon_04
CONF=L010185N


DET=far
MCREL=daikon_04
CONF=L010185N


SADDIR=${DET}/${MCREL}/${CONF}/*
echo $SADDIR

~/saddmc  --verify -n 1  ${MCREL}  ${SADDIR}
~/saddmc  --declare  ${MCREL}  ${SADDIR} >> /minos/scratch/mindata/log/saddmc/prd-${DET}-${MCREL}-${CONF}.log 2>&1

########
# FARM #
########

    Copy of recent ntuples for nearline analysis

cd /minos/data/minfarm/nearcat
MDDIR=/minos/data/reco_near/cedar/sntp_data/2007startup

mkdir /minos/data/reco_near/cedar/sntp_data/2007startup

for FILE in N*.spill.*.cedar.* ; do  
echo ${FILE} ; cp ${FILE} ${MDDIR}/${FILE} ; done


=============================================================================

2007 11 20

#######
# AFS #
#######

Subject : Re: HelpDesk ticket 107323
----- Message Text -----
<-- # @@@  Enter Update below this line. @@@ # -->

As per discussions today at all levels, please close out this ticket,
and withdraw the request for the global firewalling of AFS.

Minos will take a couple of actions until the Saturday upgrades :

1) We will try to decouple our Control Room beam data logging from AFS.

2) We will keep the Shift personnel ( x3368 ) informed as to how to report
   AFS outages via the call center, if such reports are needed.
   We will follow up with Ray to clarify details like
       how long to wait before calling the call center, and
       how to know whether a problem is being worked on.

<-- # @@@  Enter Update above this line. @@@ # -->


#########
# ADMIN #
#########

    per brebel email

MINOS25 > kcroninit

 *************************************************************************
 *                                                                       *
 *      This system is not properly configured to initialize             *
 *      authenticating cron jobs in a secure fashion.                    *
 *                                                                       *
 *      Please contact your sysadmin regarding the ownership and/or      *
 *      permissions on the /var/adm/krb5 directory.                      *
 *                                                                       *
 *************************************************************************

MIN > for NODE in ${NODES} ; do printf "${NODE}\n"; ssh ${NODE} 'ls -ld /var/adm/krb5 | grep -v "^drwx--s--x"'; done
minos25
drwxr-xr-x  2 root root 4096 Oct 19 11:12 /var/adm/krb5

HelpDesk ticket 107348

Short Description: Cannot kcroninit on minos25 due permissions for /var/adm/krb5

Problem Description:

 run2-sys :

We cannot use 'kcroninit' on minos25,
apparently due to incorrect permissions on /var/adm/krb5

This directory is drwx--s--x on systems where kcroninit works,
and on minos25 is drwxr-xr-x

Please investigate and correct this.


############
# MCIMPORT #
############

Testing saddmc.20071114 with dcache location support.

Ease to test, need any fresh
Previously failed on
file /pnfs/minos/mcin_data/near/daikon_04/L010185N/700
    n13037002_0009_L010185N_D04.reroot.root

MIN > mv saddmc.20071114 saddmc.20071120

$ cp -a AFSS/saddmc.20071120 .
$ ln -sf AFSS/saddmc.20071120 saddmc

AFSS/mcimport.20071118   -f 10 -m OVERLAY

Again, copies to tape started almost immediately,
not with a 4 hour delay

14:45
cp -a  AFSS/mcimport.20071120  mcimport.20071120
ln -sf      mcimport.20071120  mcimport # was mcimport.20071109

In STAGE/arms,   
   rmdir mcin/dcache
   rmdir mcin
   ln -s far/mcin mcin ( should go the other way, but imports are active )

$ ./mcimport -b 3 -f 10 -m arms

N.B. Perhaps we should try kx509 dccp , X509_CERT_DIR with cert.
But how does it know the cert name ?

Defer this for now, let's get rolling .

Rate has been about 30 GBytes/hour.

STAGE/arms/far/mcin has 300 GB.

So run a manual mcimport, then start the cron tomorrow morning.

15:56
./mcimport ALL

rm: remove write-protected regular file `n13037002_0011_L010185N_D04.reroot.root'? n
rm: remove write-protected regular file `n13037002_0012_L010185N_D04.reroot.root'? 

??? what is this ??? why is it going to the terminal ?
The 644 protected files are owned by rhatcher.

Look at old purged files,
n11011020_0001_L010185N_D00.reroot.root
They are gone.

    Needed to hack mcimport.20071120 to do   rm -f ${FILE}

    Restarted around 16:18

./mcimport -c ALL


#######
# AFS #
#######

Scanned recent logs, see failures for

fsus02 131.225.68.7
fsus03 131.225.68.4
fsus08 131.225.68.19 

Sent reply to rayp :

> I've identified the following minos volumes on fsus02.  I'm going to
> move them to fsus07 for now and see if we can isolate minos from the
> issues affecting fsus02.  These are the areas.  Please let me know if
> there are more.

    Thanks for shifting these, this may help with the web page stability.

    fsus07 is not immune to these problems. We saw failures of fsus07 on Nov 13.
    We saw failures yesterday for fsus02, fsus03 and fsus08
    I do not think that we can avoid this problem by switching servers.


Nevertheless, to answer your direct question :

    The AFS volumes read by the Control Room are the release and product areas,

/afs/fnal.gov/files/code/e875/general/minossoft/   
/afs/fnal.gov/files/code/e875/general/products/
/afs/fnal.gov/files/code/e875/general/minossoft/packages
/afs/fnal.gov/files/code/e875/releases
/afs/fnal.gov/files/code/e875/releases1
/afs/fnal.gov/files/code/e875/releases2

    There are a lot of symbolic links, so it is hard to know whether these suffice.

    There remains the problem that there are many users heavily engaged 
    in the analysis of post-shutdown data, which is critical to establishing 
    the running condition for the experiment. They remain sensitive to fsus02.

Rayp sent list of servers

/afs/fnal.gov/files/code/e875/general/minossoft/
fsus05.fnal.gov         /vicepb             RW

/afs/fnal.gov/files/code/e875/general/products/
fsus07.fnal.gov         /vicepb             RW

/afs/fnal.gov/files/code/e875/general/minossoft/packages
fsus05.fnal.gov         /vicepb             RW

/afs/fnal.gov/files/code/e875/releases
fsus-minos01.fnal.gov           /viceph             RW

/afs/fnal.gov/files/code/e875/releases1
fsus08.fnal.gov         /vicepd             RW

/afs/fnal.gov/files/code/e875/releases2
fsus09.fnal.gov         /vicepb             RW


=============================================================================

2007 11 19

#######
# AFS #
#######

fsus02 timing out again :

Nov 19 17:25:54 minos26 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 19 17:59:18 minos26 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)


 HelpDesk ticket 107323

Date: Mon, 19 Nov 2007 18:18:46 -0600 (CST)
From: Arthur Kreymer <kreymer@fnal.gov>
To: helpdesk@fnal.gov, schmidt@fnal.gov, rayp@fnal.gov, inkmann@fnal.gov
Cc: habig@fnal.gov, plunk@fnal.gov, wojcicki@fnal.gov, buckley@fnal.gov, rhatcher@fnal.gov, urish@fnal.gov
Subject: Request AFS restriction to FNAL hosts


 At about 17:23 this afternoon, AFS server fsus02 again became unavailable,
 taking with it many of the lab's Web servers, the Minos Control Room Logbok, etc. 

 This is seriously hurting detector operations for Minos,
 and is massively disruptive to our ability to analyze data.


 It is my understanding that the direct cause of this instability
 is the interaction of NAT clients with our AFS servers,
 something that will be corrected Saturday morning.
 But I do not think we can afford to run under the present conditions
 through Thanksgiving.

 Until the Saturday upgrades are completed,
 I request that we limit AFS clients to the fnal.gov and minos-soudan.org subnets, 
 via firewalls or other technical means.

 This is a drastic action, but should eliminate NAT clients,
 and give us stable operation through the holiday.


#######
# LSF #
#######

for NODE in ${NODES} ; do 
printf "${NODE}\n"; ssh ${NODE} '. /usr/local/etc/setups.sh ; setup lsf ; bjobs'
done

minos13
Failed in an LSF library call: Slave LIM configuration is not ready yet


########
# FARM #
########

Files were concatenated this morning,

 OK adding F00039965_0000.all.sntp.cedar.0.root 11
 NSFIL SSIZ MSIZ DSIZ 11 262594663 261592421 100224
-rw-r--r--  1 minfarm numi 261592421 Nov 19 00:05

 OK adding F00039968_0000.all.sntp.cedar.0.root 2
 NSFIL SSIZ MSIZ DSIZ 2 45970263 45758780 211483
-rw-r--r--  1 minfarm numi 45758780 Nov 19 00:06
 PEND - have 17/23 subruns for F00039971_*.all.sntp.cedar.0.root 0 11/18 23:40 0 17

PEND - have 17/24 subruns for F00039971_*

I see no cedar files for near detector.

   Informed minos_batch .

Howie is not seeing recent beam database info :


minfarm on fnpcsrv1% scripts/beam_mon minos-db1
Inquiring of minos-db1 on port 3306 as reader_old:minos_db
beam_mon returns null -- no updates recently


#######
# CAF #
#######

Normally 35 running
08:00 minos25 :condor_off -peaceful -all -startd

condor_q | grep running

    ?  29 running
09:00  18 running,
12:00   8 running

14:30
216 jobs; 214 idle, 0 running, 2 held
14:36
216 jobs; 175 idle, 39 running, 2 held

I see a jump in load average around 14:33
   Hurray !
   
   
###########
# ROUNDUP #
###########


############
# MCIMPORT #
############

AFSS/mcimport.20071118  -b 1 -m OVERLAY
   Looks OK, see
/pnfs/minos/mcin_data/near/daikon_00/L010185N/102/n11011020_0000_L010185N_D00.reroot.root

AFSS/mcimport.20071118   -m OVERLAY
Mon Nov 19 08:04:41 CST 2007
Mon Nov 19 11:27:00 CST 2007

DET=near
MCREL=daikon_00
CONF=L010185N

for RUN in 102 103 104 ; do
SADDIR=${DET}/${MCREL}/${CONF}/${RUN}
#~/saddmc  --verify -n 1  ${MCREL}  ${SADDIR}
 ~/saddmc  --declare  ${MCREL}  ${SADDIR} \
>> /minos/scratch/mindata/log/saddmc/prd-${DET}-${MCREL}-${CONF}.log 2>&1
done

   looks good
 
############
# MCIMPORT #
############
  
Let's check out the sam declares once again :

18:52
AFSS/mcimport.20071118  -b 1 -f 10 -m OVERLAY

Dropped \n from MCINDS, to get clean path,

Now find we need to handle files which are not yet on tape
Update to saddmc is needed, similar to saddreco.


 saddmc  20070924  processing mcin_data 
STARTED   Tue Nov 20 01:13:22 2007
Declaring to SAM v8_2_0 prd daikon_04 declare 999999
Scanning  /pnfs/minos/mcin_data/near/daikon_04/L010185N ['700']
Needed    /pnfs/minos/mcin_data/near/daikon_04/L010185N/700
Treating 37 files in /pnfs/minos/mcin_data/near/daikon_04/L010185N/700
OOPS - short Enstore data at  Tue Nov 20 01:13:28 2007
 ENLIN  []
 ENFILE  n13037002_0009_L010185N_D04.reroot.root

   WARNING WARNING WARNIGN - these files are going to tape much too fast,
   there should be a 4 hour delay, writes are immediate.

#######
# AFS #
#######

    Sent this to minos_all, minos_software_discussion, CRL


Date: Mon, 19 Nov 2007 08:12:33 -0600
From: Ray Pasetes <rayp@fnal.gov>
To: CSG <csg@fnal.gov>, Desktop & Server Support - Enterprise <dss-est@fnal.gov>,
     HelpDesk <helpdesk@fnal.gov>, Kristen J. Webb <kwebb@teradactyl.com>,
     Liz Buckley-Geer <buckley@fnal.gov>, Arthur E Kreymer <kreymer@fnal.gov>, Steven Timm <timm@fnal.gov>
Subject: AFS outage Saturday, 11/24 6A-10A

    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-1" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

On Saturday, 11/24, from 6A-10A, the AFS service will be
out for an emergency upgrade.  It has been determined
that the current release of code, OpenAFS v1.4.4 can
have issues with clients that are behind a NAT.  These
issues can indirectly cause a resource problem on
the fileservers which could have resulted in the outages
last week and the "Connection timed out" issues we have
been seeing as of late.


Please let any other interested parties know
about this outage.


-- 
==============================================
Ray Pasetes             Email: rayp@fnal.gov
CD/LSC/CSI/CS           Phone: 630-840-5250
Fermilab, Batavia, IL   Fax  : 630-840-6345
==============================================


   bv will contact rhatcher, see whether we can decouple
   hartnell may assist


=============================================================================

2007 11 18  Sun

############
# MCIMPORT #
############

mcimport.20071118 - cleaned up and tested,

$ AFSS/mcimport.20071118  -b 1 -m OVERLAY

Oops,  
/pnfs/minos/mcin_data/near/daikon_00/L010185N/102 et.al. are owned by rhatcher,
but not group writeable.

   as rhatcher
cd /pnfs/minos/mcin_data/near
find daikon_* -type d -user rhatcher -exec ls -ld    {} \;
   288 directories
find daikon_* -type d -user rhatcher -exec chmod 775 {} \;
   does not work on systems where I can be rhatcher

Sent mail to rhatcher.

Did the same for kreymer files :

d /pnfs/minos/mcin_data/near
find daikon_* -type d -user kreymer -exec ls -ld    {} \;
    173 directories
find daikon_* -type d -user kreymer -exec chmod 775 {} \;

############
# MCIMPORT #
############

    Manually imported some kordosky files which had piled up.

AFSS/mcimport.20071118 -b 3 -m kordosky

AFSS/mcimport.20071118      -m kordosky

    Manually imported some arms files which had piled up.

AFSS/mcimport.20071118      -m arms


############
# SADDRECO #
############


saddreco mc far completed early this morning


=============================================================================

2007 11 17  Sat

############
# SADDRECO #
############

saddreco.20071117 - added  RECODIRS.sort()
   to get a more readable log as we do the full mcout declares

12:15
SRV1> cp -a AFSS/saddreco.20071117 . ; ln -sf saddreco.20071117 saddreco # was saddreco.20070913


############
# SADDRECO #
############

NDDS=`ls -d /pnfs/minos/mcout_data/*/near/daikon*`
FDDS=`ls -d /pnfs/minos/mcout_data/*/far/daikon*`

printf "${NDDS}\n"
MINOS26 > printf "${NDDS}\n"
/pnfs/minos/mcout_data/R1_24cal/near/daikon_00
/pnfs/minos/mcout_data/R1_24calB/near/daikon_00
/pnfs/minos/mcout_data/cedar/near/daikon_00
/pnfs/minos/mcout_data/cedar/near/daikon_01
/pnfs/minos/mcout_data/cedar_bx113/near/daikon_00
/pnfs/minos/mcout_data/cedar_phy/near/daikon_00
/pnfs/minos/mcout_data/cedar_phy/near/daikon_03
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04
/pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00
/pnfs/minos/mcout_data/cedar_phy_oldbhcurv/near/daikon_03
/pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00
/pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00

MINOS26 > printf "${FDDS}\n"
/pnfs/minos/mcout_data/R1_24spill/far/daikon_02
/pnfs/minos/mcout_data/cedar/far/daikon_00
/pnfs/minos/mcout_data/cedar/far/daikon_01
/pnfs/minos/mcout_data/cedar/far/daikon_02
/pnfs/minos/mcout_data/cedar_phy/far/daikon_00
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02
/pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_03
/pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_04
/pnfs/minos/mcout_data/cedar_phy_safitter/far/daikon_02
/pnfs/minos/mcout_data/cedar_phy_srsafitter/far/daikon_02


PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin
export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer
export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9010
export SAM_ORACLE_CONNECT='samdbs/...'

for DIR in ${FDDS} ; do
    CONFS=`ls ${DIR}`
    echo ${DIR} ${CONFS}
done
/pnfs/minos/mcout_data/cedar/far/daikon_00 L010185N L100200N L250200N
/pnfs/minos/mcout_data/cedar/far/daikon_01 L010185N
/pnfs/minos/mcout_data/cedar/far/daikon_02 CosmicLE CosmicMu
/pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_03 CosmicLE CosmicMu
/pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_04 L010185 L010185N L250200 L250200N
/pnfs/minos/mcout_data/cedar_phy/far/daikon_00 L010185N L100200N L250200N
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02 CosmicLE CosmicMu
/pnfs/minos/mcout_data/cedar_phy_safitter/far/daikon_02 CosmicLE CosmicMu
/pnfs/minos/mcout_data/cedar_phy_srsafitter/far/daikon_02 CosmicLE CosmicMu
/pnfs/minos/mcout_data/R1_24spill/far/daikon_02 CosmicMu

for DIR in ${NDDS} ; do
    CONFS=`ls ${DIR}`
    echo ${DIR} ${CONFS}
done
/pnfs/minos/mcout_data/cedar_bx113/near/daikon_00 L010185N_bfldx113
/pnfs/minos/mcout_data/cedar/near/daikon_00 L010000N L010170N L010185N L010185N_bfldx113 L010185N_charm L010185N_lowi L010185N_medi L010185N_nccoh L010200N L100200N L150200N L250200N L250200N_nccoh
/pnfs/minos/mcout_data/cedar/near/daikon_01 L010185N
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03 CosmicLE CosmicMu L010185N
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04 CosmicLE L010000N L010170N L010185N L010200N L100200N L150200N L250200N
/pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00 L010185N
/pnfs/minos/mcout_data/cedar_phy/near/daikon_00 L010000N L010170N L010185N L010200N L100200N L150200N L250200N
/pnfs/minos/mcout_data/cedar_phy/near/daikon_03 CosmicMu L010185N
/pnfs/minos/mcout_data/cedar_phy_oldbhcurv/near/daikon_03 L010185N
/pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00 L010185N L010185N_bfldx113
/pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00 L010185N L010185N_bfldx113
/pnfs/minos/mcout_data/R1_24calB/near/daikon_00 L010185N
/pnfs/minos/mcout_data/R1_24cal/near/daikon_00 L010185N L010185N_24cal

Test one, 

./saddreco -m daikon_00 -d far -r cedar_phy -p L250200N -b 1 --verify
 AFSS/saddreco.20071117 -m daikon_02 -d far -r cedar -p CosmicLE -b 1 --verify

for DIR in ${FDDS} ; do
    CONFS=`ls ${DIR}`
    echo ${DIR} ${CONFS}
    for CONF in ${CONFS} ; do
        echo ${DIR}/${CONF}
          REL=`echo ${DIR} | cut -f 5 -d '/'`
          DET=`echo ${DIR} | cut -f 6 -d '/'`
        MCREL=`echo ${DIR} | cut -f 7 -d '/'`
        ls ${DIR}/${CONF}
        LOGDIR=${HOME}/ROUNTMP/LOG/saddreco/${MCREL}
        mkdir -p ${LOGDIR} 
#        date ; ./saddreco -m ${MCREL} -d ${DET} -r ${REL} -p ${CONF} -b 1 --verify
#        date ; ./saddreco -m ${MCREL} -d ${DET} -r ${REL} -p ${CONF} -b 1 \
        date ; ./saddreco -m ${MCREL} -d ${DET} -r ${REL} -p ${CONF}  \
            --declare 2>&1 | tee -a ${LOGDIR}/${REL}_${CONF}_${DET}.log
    done
done

   Tested this first with one moderate configuration,
DIR=/pnfs/minos/mcout_data/cedar/far/daikon_00
CONF=L100200N

    Ran twice, with corrected LOGDIR
  
    Ran full fardet, 1 event

    Needed r1.24spill

for REL in dev int prd ; do
    setup sam -q ${REL}
    samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.24spill
done
New applicationFamilyId = 228
New applicationFamilyId = 82
New applicationFamilyId = 282

DIR=/pnfs/minos/mcout_data/R1_24spill/far/daikon_02
   Tested this, OK now.

Added missing release for near, r1.24calb
for REL in dev int prd ; do
    setup sam -q ${REL}
    samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.24calb
done


12:55    Ran bail-1 test for NDDS,

Sat Nov 17 13:41:45 CST 2007

grep -v declared ${HOME}/ROUNTMP/LOG/saddreco/daikon*/*.log | less

    Added _${DET} to the log file name

    Found OOPS problem with cedar.phy.oldbhcurv

for REL in dev int prd ; do
    setup sam -q ${REL}
    samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.oldbhcurv
done
New applicationFamilyId = 230
New applicationFamilyId = 84
New applicationFamilyId = 302

    Launched the full fardet processing
    
Sat Nov 17 13:51:38 CST 2007
Sat Nov 17 16:10:16 CST 2007

    No OOPS in
grep -v declared ${HOME}/ROUNTMP/LOG/saddreco/daikon*/*far.log | less


    Launched the full neardet processing

for DIR in ${NDDS} ; do
... same as for far, see above ...

Sat Nov 17 16:12:55 CST 2007
Sun Nov 18 02:46:41 CST 2007

    No OOPS in
grep -v declared ${HOME}/ROUNTMP/LOG/saddreco/daikon*/*near.log | less

#######
# AFS #
#######

for NODE in ${NODES} ; do printf "${NODE}\n"
ssh ${NODE} 'grep afs /var/log/messages | grep -v Tokens | grep "Nov 16"' ; done

minos02
Nov 16 11:15:15 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 16 11:20:17 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
Nov 16 18:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 16 18:17:07 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)
minos06
Nov 16 15:04:12 minos06 kernel: afs: Lost contact with volume location server 131.225.68.4 in cell fnal.gov
Nov 16 15:06:25 minos06 kernel: afs: volume location server 131.225.68.4 in cell fnal.gov is back up
minos16
Nov 16 20:19:49 minos16 kernel: afs: Lost contact with file server 131.225.68.17 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 16 20:22:11 minos16 kernel: afs: file server 131.225.68.17 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)


for NODE in ${NODES} ; do printf "${NODE}\n"
ssh ${NODE} 'grep afs /var/log/messages | grep -v Tokens | grep "Nov 17"' ; done

minos02
Nov 17 07:15:12 minos02 kernel: afs: Lost contact with file server 131.225.68.19 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 17 07:16:35 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

Updated  HelpDesk ticket 107032

Lost access to fsus02, as follows

<-- # @@@  Enter Update below this line. @@@ # -->

We lost access to fsus02 this evening.

This removed access to several things,
most critically the Minos Control Room Log Book, at

    http://www-minoscrl2.fnal.gov/minos/Index.jsp

and the helpdesk web page.


    Some of the /var/log/messages messages were like :

minos01
Nov 17 20:37:05 minos01 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi$
Nov 17 21:08:39 minos01 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed addr$

minos02
Nov 17 20:36:46 minos02 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi$
Nov 17 21:05:56 minos02 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed addr$

<-- # @@@  Enter Update above this line. @@@ # -->


On minos-beamdata,

Nov 17 20:33:14 minos-beamdata kernel: afs: Lost contact with file server 192.168.67.1 in cell fnal.gov (multi-homed address; other same-host interfaces maybe up)
That is a weird address, not 131.225 Fermilab


=============================================================================

2007 11 16


#######
# CAF #
#######

Date: Fri, 16 Nov 2007 16:03:59 -0600 (CST)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos-admin@fnal.gov
Cc: sfiligoi@fnal.gov
Subject: Proposed schedule for security enhancements on the Minos Condor Analysis Facility.

  Proposed schedule for security enhancements on the Minos Condor Analysis Facility.

  ( I think this is what we have already been talking about,
    but it is good to have a summary . )

Monday morning ( 8:00 or so, the earlier the better )
   root starts draining the Minos Condor worker nodes
      ( commands to be provided by Igor )

 From minos25, issue
condor_off -peaceful -all -startd

Monday afternoon ( 13:00 or so )
   root stops condor
      ( commands to be provided by Igor )

 From minos25, issue
condor_off -fast -all -startd
followed by
condor_off -all -master

   root needs to have installed host certificates and /etc/grid-security/certificates
   root pushes out the new configuration files with cfengine
   root starts condor
      ( commands to be provided by Igor )

This should be the usual
/etc/init.d/condor start
on all the affected nodes.

   sfiligoi and kreymer test proper operation of Minos Condor,
   processing of user jobs will resume by Monday evening.

Date: Fri, 16 Nov 2007 17:35:53 -0600 (CST)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos_software_discussion@fnal.gov
Cc: minos-admin@fnal.gov
Subject: Minos Condor Analysis Facility upgrade Monday

We plan to upgrade the Minos Condor configuration files Monday 19 Nov, 
in preparation for providing a gateway to larger Grid resources.

We plan to drain the worker nodes during the morning,
and perform the configuration upgrade in the afternoon,
restoring service by the evening.

The queues should be retained.

Any jobs process which we kill around noon will be rerun,
probably transparently the the users ( aside from the delay ).

   Thanks for your patience !

#######
# CAF #
#######

  Igor has pointed out to me that we should also modify a line in

  minos25:/etc/condor/condor_config , as follows

    change

QUEUE_SUPER_USERS       = root, condor

    to

QUEUE_SUPER_USERS       = root, condor, buckley, kreymer, rhatcher, sfiligoi, timm

This change can be pushed with the rest of the security changes.
I'll send a separate email with a proposed schedule.

#######
# CAF #
#######

HelpDesk ticket 107197

Short Description: Condor stop/start sudo access on Minos Cluster

Problem Description: run2-sys

Please set up sudo access to invoke /etc/init.d/codor on the Minos Cluster,

for the following users :

  buckley, kreymer, rhatcher, sfiligoi, timm

This would let the experiment management stop and start Condor as needed.
This will be especially useful during the present commissioning phase.

Date: Fri, 16 Nov 2007 16:45:05 -0600 (CST)
Solution: schmitz@fnal.gov sent this solution: 
added user list to sudoers giving condor start/stop access


MINOS25 > sudo -l
User kreymer may run the following commands on this host:
    (root) NOPASSWD: /etc/init.d/codor


MINOS01 > ps axf | grep condor
 7305 pts/3    S+     0:00  |       \_ grep condor
29713 ?        Ss     8:33 /opt/condor/sbin/condor_master
29714 ?        Ss    16:42  \_ condor_startd -f


#######
# LSF #
#######

Date: Fri, 16 Nov 2007 14:42:06 -0600 (CST)
From: Margaret_Greaney <mgreaney@fnal.gov>
To: minos-users@fnal.gov, minos-admin@fnal.gov
Cc: dss-est@fnal.gov
Subject: flxi06 hardware problems

we will be taking down flxi06 momentarily to try to fix
a bad system disk.

Date: Fri, 16 Nov 2007 15:02:04 -0600 (CST)
Due to a problem we need to reschedule the hardware repair of
flxi06 for Monday, November 19.


############
# PREDATOR #
############

Is up to date.
Glitch processing 
N071007_000002.mdcs.root Fri Nov 16 11:11:29 UTC 2007
Wait for next cycle.
 
#######
# AFS #
#######

for NODE in ${NODES} ; do printf "${NODE}\n"
ssh ${NODE} 'grep afs /var/log/messages | grep -v Tokens | grep "Nov 15"' ; done

minos02
Nov 15 12:15:20 minos02 kernel: afs: Lost contact with file server 131.225.68.19
Nov 15 12:16:31 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov 
Nov 15 18:15:16 minos02 kernel: afs: Lost contact with file server 131.225.68.19
Nov 15 18:15:28 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov 
Nov 15 22:15:14 minos02 kernel: afs: Lost contact with file server 131.225.68.19
Nov 15 22:17:54 minos02 kernel: afs: file server 131.225.68.19 in cell fnal.gov 
minos03
Nov 15 18:26:20 minos03 kernel: afs: Lost contact with file server 131.225.68.49
Nov 15 18:28:38 minos03 kernel: afs: file server 131.225.68.49 in cell fnal.gov 
minos04
Nov 15 13:46:05 minos04 kernel: afs: Lost contact with file server 131.225.68.49
Nov 15 13:47:20 minos04 kernel: afs: file server 131.225.68.49 in cell fnal.gov 
minos12
Nov 15 13:21:18 minos12 kernel: afs: Lost contact with file server 131.225.68.49
Nov 15 13:21:18 minos12 kernel: afs: failed to store file (110)
Nov 15 13:23:00 minos12 kernel: afs: file server 131.225.68.49 in cell fnal.gov 

for NODE in ${NODES} ; do printf "${NODE}\n"
ssh ${NODE} 'grep afs /var/log/messages | grep -v Tokens | grep "Nov 16"' ; done


09:38 - sent this information as followup to ticket 107032


=============================================================================

2007 11 15

############
# PREDATOR #
############

Ran by hand to catchup since yesterday's AFS problem.

16:27
./predator 2007-11

16:50  predator is still running, but the .pid file will save us
       neardet SAM data is generated cleanly, so we're OK
crontab crontab.dat


########
# FARM #
########

Completing transition to /minos/data/minfarm
   Pick up strays 

26 cand files showed up in farcat on Nov 8, pending in WRITE, like
F00039916_0014.spill.cand.cedar.0.root

mv /grid/data/minos/minfarm/WRITE/*cand* /grid/data/minos/minfarm/DUP/
FILES=`ls /grid/data/minos/minfarm/DUP`
for FILE in ${FILES} ; do sam locate ${FILE} ; done
   all of these files are in SAM

for DIR in BAD DUP N7760 SAFE WRITE ; do
du -sm  /grid/data/minos/minfarm/${DIR}
cp -vax /grid/data/minos/minfarm/${DIR} /minos/data/minfarm/${DIR}
du -sm                                  /minos/data/minfarm/${DIR}
diff -r /grid/data/minos/minfarm/${DIR} /minos/data/minfarm/${DIR}
mv      /grid/data/minos/minfarm/${DIR}  /grid/data/minos/minfarm/OLD${DIR}
ln -s   /minos/data/minfarm/${DIR}       /grid/data/minos/minfarm/${DIR}
done

107     /grid/data/minos/minfarm/BAD
107     /minos/data/minfarm/BAD
2981    /grid/data/minos/minfarm/DUP
2981    /minos/data/minfarm/DUP
1888    /grid/data/minos/minfarm/N7760
1888    /minos/data/minfarm/N7760
1       /grid/data/minos/minfarm/SAFE
1       /minos/data/minfarm/SAFE
4684    /grid/data/minos/minfarm/WRITE
4683    /minos/data/minfarm/WRITE

cd /export/stage/minfarm/ROUNDUP

SRV1> ls -l | grep grid
lrwxrwxrwx   1 minfarm numi      28 May 19 12:14 DUP -> /grid/data/minos/minfarm/DUP
lrwxrwxrwx   1 minfarm numi      29 May 14  2007 GDS -> /grid/data/minos/minfarm/SAFE
lrwxrwxrwx   1 minfarm numi      30 May 11  2007 WRITE -> /grid/data/minos/minfarm/WRITE

for DIR in DUP WRITE ; do 
rm ${DIR} ; ln -sf /minos/data/minfarm/${DIR} ${DIR} ; done

rm GDS ; ln -s /minos/data/minfarm/SAFE GDS

----------------

ls N00012*cedar_phy_bhcurv*0.root | wc -l

    These date Nov 2 throug Nov 5
FILES=`ls N00012*cedar_phy_bhcurv*0.root`
SRV1> for FILE in ${FILES} ; do sam locate ${FILE} ; done

SRV1> printf "${FILES}\n" | wc -l
330

FILES=`ls N00012*cedar_phy_bhcurv*0.root | grep -v 00012001`

for FILE in ${FILES} ; do mv ${FILE} ../REMOVED/${FILE} ; done
cd ..
mv REMOVED ../READREM

mv 

   This file lists all 24 subruns ( 0-2 in 2007-03, 3-24 in 2007-04 )
   
 READ/SAM/N00012001_0000.spill.mrnt.cedar_phy_bhcurv.0.root

MINOS26 > sam undeclare file N00012001_0000.spill.mrnt.cedar_phy_bhcurv.0.root
MINOS26 > sam undeclare file N00012001_0000.spill.sntp.cedar_phy_bhcurv.0.root
MINOS26 > sam undeclare file N00012001_0000.spill.cand.cedar_phy_bhcurv.0.root
MINOS26 > sam undeclare file N00012001_0001.spill.cand.cedar_phy_bhcurv.0.root
MINOS26 > sam undeclare file N00012001_0002.spill.cand.cedar_phy_bhcurv.0.root

SRV1> mv READ/SAM/N00012001_0000.spill.sntp.cedar_phy_bhcurv.0.root READREM/
SRV1> mv READ/SAM/N00012001_0000.spill.mrnt.cedar_phy_bhcurv.0.root READREM/

Rubin :

"Since we were removing the old files, I decided to treat this as if they 
 never existed at all, and thus this is pass 0. "


#######
# AFS #
#######

No further AFS messages in syslog, except discarded tokens

############
# MCIMPORT #
############

    Declares mcin to sam
    Bails for boring users having no files in top, mcin, mcin/dcache

$ cp -a AFSS/mcimport.20071109 .
$ ln -sf mcimport.20071109 mcimport # was mcimport.20071022

08:31 crontab crontab.dat


############
# MCIMPORT #
############

For immedate processing of recent arms files,
will also import them the old fashioned way .

FILES=`find /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near -type f -name \*.gz`
for FILE in ${FILES} ; do cp -a ${FILE} STAGE/arms/ ; done

./mcimport.20071022 -F arms
Thu Nov 15 09:27:10 CST 2007
SRMCPed n14111411_0000_L010185N_D00_nue-n14111450_0000_L010185N_D00_nue.tar 

Need to rerun this afternoon, to clear dcache directory.


N.B. why  Sorting 18517 logs in /local/scratch26/mindata/arms/far/mcin/log ?

13:07 - holding off while more of these show up, disabling arms mcimport

mv STAGE/arms/MCIMPORT STAGE/arms/NOIMPORT 

   OOPS, these are all to be removed, and some more files generated.
   
   Will leave ARMS in NOIMPORT state through the weekend,
   until the 40 runs x 10 subruns are uploaded and tarred with old mcimport.

$ cd STAGE/arms
$ rm index/n14111411_0000_L010185N_D00_nue-n14111450_0000_L010185N_D00_nue.index
$ rm -r /minos/data/mcimport/STAGE/daikon_00/L010185N_nue/near
MINOS26 > rm /pnfs/minos/stage/arms/n14111411_0000_L010185N_D00_nue-n14111450_0000_L010185N_D00_nue.tar 

=============================================================================

2007 11 14

########
# FARM #
########

Redeclaring cedar_phy_bhcurv which failed due to lack of 

MONS=`ls */decl*cedar_phy_bhcurv.log | cut -f 1 -d '/'

cat */decl*cedar_phy_bhcurv.log > /tmp/cpblog

grep -v declared /tmp/cpblog | less

    Lots of failures Sep 11, 

for MON in 2005-12 2006-01 2006-02 ; do
./roundup -c -m "${MON}" -r cedar_phy_bhcurv near
done

Need cleanup from previous errors:

2005-12
 OOPS, need location for  N00009544_0009.spill.cand.cedar_phy_bhcurv.0.root
 OOPS, need location for  N00009303_0009.spill.cand.cedar_phy_bhcurv.0.root
 OOPS, need location for  N00009530_0019.spill.cand.cedar_phy_bhcurv.0.root
 OOPS, need location for  N00009331_0003.spill.cand.cedar_phy_bhcurv.0.root

2006-01
 OOPS, need location for  N00009647_0006.spill.cand.cedar_phy_bhcurv.0.root
 OOPS, need location for  N00009583_0000.spill.cand.cedar_phy_bhcurv.0.root
 OOPS, need location for  N00009589_0001.spill.cand.cedar_phy_bhcurv.0.root
 OOPS, need location for  N00009586_0024.spill.cand.cedar_phy_bhcurv.0.root
 OOPS, need location for  N00009589_0000.spill.cand.cedar_phy_bhcurv.0.root

2006-02
 OOPS, need location for  N00009839_0007.spill.cand.cedar_phy_bhcurv.0.root
 OOPS, need location for  N00009755_0019.spill.cand.cedar_phy_bhcurv.0.root
 OOPS, need location for  N00009732_0022.spill.cand.cedar_phy_bhcurv.0.root
 OOPS, need location for  N00009732_0008.spill.cand.cedar_phy_bhcurv.0.root


PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin
export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer
export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9010
export SAM_ORACLE_CONNECT='samdbs/...'

DET=near
REL=cedar_phy_bhcurv

SAMMON=2005-12
AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} addloc

SAMMON=2006-01
AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} addloc

SAMMON=2006-02
AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} addloc


#######
# AFS #
#######

--------------------------------------------------------------------------
HelpDesk ticket 107048

Short Description: AFS server(s) down again

Problem Description: All the Minos Cluster nodes have lost contact with AFS servers again.

    Here is a typical message :

Nov 14 10:39:13 minos26 kernel: afs: Lost contact with file server 131.225.68.47
in cell fnal.gov (all multi-homed ip addresses down for the server)

   This is an urgent problem, we cannot access the Minos software
   without this server.
--------------------------------------------------------------------------
    11:04
This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group.
    11:20 
I see network traffic, the server seems up again.


This affects products and releases.
Instead of hanging, we see :

$ ls /afs/fnal.gov/files/code/e875/products
ls: /afs/fnal.gov/files/code/e875/products: No such file or directory

MRTG data flow stopped around 10:40, see
http://www-dcn.fnal.gov/~netadmin/m-s-fcc-mrtg/cgi/mrtg-rrd.fcgi/r-s-fcc2-server/r-s-fcc2-server_gi1_24.html


   Scanned again, around 17:20
for NODE in ${NODES} ; do printf "${NODE}\n" ;  ssh ${NODE} 'grep afs /var/log/messages | grep "Nov 14"' ; done


131.225.68.47 fsus06 
    all nodes, 10:40 - 11:22


131.225.68.19  fsus08
    minos02
        10:15:12 - 10:16:12
        14:15:11 - 14:17:39
        Many messages like
        Nov 14 16:15:12 minos02 kernel: afs: Tokens for user of AFS id 1334 for cell fnal.gov are discarded (rxkad error=19270407)
    
131.225.68.4 fsus03
    minos18
        15:36:37 - 15:38:04
        
        
########## 
# SADDMC #
##########

   Now working with the mindata account, on minos26

   Checking that we are up to date, before doing output
   Nothing was needed, per the following scans.
   
DET=near
VEGS='daikon_00 daikon_01 daikon_03 daikon_04'
for VEG in ${VEGS} ; do
for DIR in `ls /pnfs/minos/mcin_data/${DET}/${VEG} | sort` ; do
    echo ${VEG} ${DIR}
./saddmc  --verify  -n 1  ${VEG}  ${DET}/${VEG}/${DIR}/*
done ; done  2>&1 | tee -a /tmp/saddmc.near

    AFS went down after daikon_03/spill_cedarphyMRE
    Repeated the scan with

VEGS='daikon_03'

    Clean, aside from MRE files which should not be declared

DET=far
VEGS='daikon_00 daikon_01 daikon_02 daikon_03'

for VEG in ${VEGS} ; do
for DIR in `ls /pnfs/minos/mcin_data/${DET}/${VEG} | sort` ; do
    echo ${VEG} ${DIR}
./saddmc  --verify  -n 1  ${VEG}  ${DET}/${VEG}/${DIR}/*
done ; done  2>&1 | tee -a /tmp/saddmc.far

   Need daikon_03 CosmicMu 127 to 132

VEG=daikon_03
DIR=CosmicMu
./saddmc  --verify  ${VEG}  ${DET}/${VEG}/${DIR}/*

./saddmc  --declare ${VEG}  ${DET}/${VEG}/${DIR}/*  2>&1 \
  | tee -a /minos/scratch/mindata/log/saddmc/prd-${DET}-${VEG}-${DIR}.log

Done by 15:16


########## 
# SADDMC #
##########

Shifted logs to /minos/scratch/mindata/log/saddmc

mkdir -p       /minos/scratch/mindata/log
cp -vax        /minos/scratch/kreymer/log/saddmc \
               /minos/scratch/mindata/log/saddmc

From now on, will be doing saddmc from the mindata account.

########
# FARM #
########

Preparing for FARM redeclares
Continuing from 2007 11 12, but will use run number rather than month

Checking run numbers April 2007,
March has       through 12001
April has 12002 through 12135

for STRM in cand sntp mrnt ; do

SAMDIM="
    DATA_TIER  ${STRM}-near
and PHYSICAL_DATASTREAM_NAME spill
and VERSION    cedar.phy.bhcurv
and RUN_NUMBER >  12001
"

for STRM in cand sntp mrnt ; do
SAMDIM="
    DATA_TIER  ${STRM}-near
and PHYSICAL_DATASTREAM_NAME spill
and VERSION    cedar.phy.bhcurv
and RUN_NUMBER >  12001
"
./samlocate "${SAMDIM}" | sort -k 2,2
done

for STRM in cand sntp mrnt ; do
SAMDIM="
    DATA_TIER  ${STRM}-near
and PHYSICAL_DATASTREAM_NAME spill
and VERSION    cedar.phy.bhcurv
and RUN_NUMBER >  12001
"
./samlocate "${SAMDIM}" | wc -l
done
1783
123
116

date
for STRM in cand sntp mrnt ; do
SAMDIM="
    DATA_TIER  ${STRM}-near
and PHYSICAL_DATASTREAM_NAME spill
and VERSION    cedar.phy.bhcurv
and RUN_NUMBER >  12001
"
./samundeclare  "${SAMDIM}"
done
date

Wed Nov 14 08:55:46 CST 2007

#######
# AFS #
#######

from minos26 /var/log/messages
Nov 13 16:57:38 minos26 kernel: afs: Lost contact with file server 131.225.68.7 in cell fnal.gov (all multi-homed ip addresses down for the server)
Nov 13 17:29:08 minos26 kernel: afs: file server 131.225.68.7 in cell fnal.gov is back up (multi-homed address; other same-host interfaces may still be down)

for NODE in ${NODES} ; do printf "${NODE}\n" ;  ssh ${NODE} 'grep afs /var/log/messages' ; done > /tmp/afsmsg


HelpDesk ticket 107032
08:36
This ticket has been reassigned to INKMANN, JOHN of the CD-LSCS/CSI/CS/EST Group.


Short Description: AFS time outs continuing - status reqeust ?

Problem Description: What is the status of AFS ?

Scanning the Minos Cluster /var/log/messages logs, 
I see several AFS timeouts before and after yesterday's 16:00 to 18:00
outage.
These occurred as recently as 05:15 this morning.

I can find no detailed information at
http://computing.fnal.gov/cdsystemstatus/system/AFS.html
or in the helpdesk tickets, or in FNALU login messages.

    Here are the timeout details from the Minos Cluster :

131.225.68.7 fsus02 
    minos01 through minos26  ( except the nodes listed below )
        Nov 13 16:57:38 - 17:29:08
    minos21, minos22 - no time out
    minos04
        Nov 11 06:05:03 - 06:05:19
        Nov 13 16:57:39 - 17:29:08
    minos05 
        Nov 12 22:29:59 - 22:32:20
        Nov 13 16:57:37 - 17:28:56

131.225.68.4 fsus03
    minos03
        Nov 12 18:52:25 - 18:55:56
    minos04
        Nov 13 19:31:54 - 19:33:32
    minos08
        Nov 12 18:52:11 - 18:55:39
    minos15
        Nov 13 16:22:15 - 16:25:14
    minos18
        Nov 12 18:52:07 - 18:54:59
    minos20
        Nov 12 18:16:56 - 18:17:53
        Nov 13 16:21:53 - 16:25:15


131.225.68.6 fsus07
    minos22
        Nov 13 16:30:48 - 16:32:29

131.225.68.19 fsus08
    minos02 - 
        Nov 13 20:15:13 - 20:17:00 
        Nov 13 23:15:15 - 23:17:07 
        Nov 14 01:15:12 - 01:15:25 
        Nov 14 03:15:15 - 03:16:50 
        Nov 14 05:15:16 - 05:18:07 
    minos18
        Nov 11 19:16:29 - 19:17:06

131.225.68.49 fsus09
    minos04
        Nov 13 19:31:54 - 19:33:32
    minos15
        Nov 13 16:22:15 - 16:25:14
    minos20
        Nov 12 18:16:56 - 18:17:53
        Nov 13 16:21:53 - 16:25:15
-----------------------------------------------

    MRTG shows 

fsus02 16:00 - 17:45 gap , heavy traffic ( 3 MB/sec ) for hours preceding
fsus03 17:20 - 17:45 gap
fsus07 15 minute gaps 18:05 through 19:15
fsus08 low traffic 18:15 - 20:30, spike at 19:10
fsus09 18:05 - 19:10 mostly gap, big spike at 19:10, 20:00 - 20:30 gap
       several gaps 04:00 - 06:00

 
=============================================================================

2007 11 13

#######
# AFS #
#######

Lost contact with AFS around 14:00.

Parts of the system are back, around 15:00
Other parts still time out

From minos26 and my desktop,

/afs/fnal.gov/files/home/room1/kreymer is OK on minos26, bad on fnpcsrv1
/afs/fnal.gov/files/home/room1/kreymer/minos times out 

17:00 - running process count climbing again on Minos Cluster ganglia
 
#######
# VDT #
#######

    Try installation into /grid/app/minos/VDT

setup pacman v3_20
mkdir -p /grid/app/minos/VDT
cd       /grid/app/minos/VDT
pacman -get VDT:VOMS-Client
 
Do you want to add [http://vdt.cs.wisc.edu/vdt_cache/] to [trusted.caches]? (y or n): y
Package [VOMS-Client] found in [VDT]...
Do you want to add [http://vdt.cs.wisc.edu/vdt_181_cache] to [trusted.caches]? (y or n): y
...
Do you agree to the licenses? [y/n] y
...
Where would you like to install CA files?

Choices:
        l (local) - install into $VDT_LOCATION/globus/share/certificates
        n (no)    - do not install
l


mkdir -p  /grid/app/minos/VDT/glite/etc
chmod 755 /grid/app/minos/VDT/glite/etc

cp -a /minos/scratch/kreymer/VDT/glite/etc/vomses glite/etc/vomses


    See /grid/app/minos/VDT/vdt/etc/package_data/VDT-Version-Info.filelist
    But I see no versions there, just lists of files.

cd /grid/app/minos/VDT
. setup.sh


klist -f
kx509
kxlist -p
voms-proxy-init -noregen -voms fermilab:/fermilab -valid 168:0

   /minos/scratch/kreymer/VDT seems to be 1.8.1-19
   based on vdt-install.log
   Doing this under /minos/scratch/kreymer/VDT gets

MINOS26 > voms-proxy-init -noregen -voms fermilab:/fermilab -valid 168:0
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E Kreymer/USERID=kreymer
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Contacting  fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Failed

Error: fermilab: User unknown to this VO.

None of the contacted servers for fermilab were capable
of returning a valid AC for the user.

    /grid/app/minos/VDT seem to be 1.8.1-21
    Doing this under /grid/app/minos/VDT gets
 
VOMS Server for fermilab not known!


########
# FARM #
########

for AREA in neardet fardet farcat nearcat
AREA=farcat
AREA=nearcat
 

               DO THE FOLLOWING THINGS
date
ls -ltr /grid/data/minos/${AREA} | tail -1

ls      /grid/data/minos/${AREA} | wc -l
du -sm  /grid/data/minos/${AREA}

time cp -vax /grid/data/minos/${AREA}     /minos/data/minfarm/${AREA}

ls      /minos/data/minfarm/${AREA} | wc -l
du -sm  /minos/data/minfarm/${AREA}

diff -r /grid/data/minos/${AREA}     /minos/data/minfarm/${AREA}

mv      /grid/data/minos/${AREA}     /grid/data/minos/OLD${AREA}
ln -s   /minos/data/minfarm/${AREA}  /grid/data/minos/${AREA}

date
              DID THE ABOVE


AREA=neardet

2
594     /grid/data/minos/neardet
real    0m26.515s
2
594     /minos/data/minfarm/neardet
Tue Nov 13 13:58:01 CST 2007

AREA=fardet

R > date
Tue Nov 13 13:59:19 CST 2007
0
1       /grid/data/minos/fardet
real    0m0.010s
0
1       /minos/data/minfarm/fardet
Tue Nov 13 13:59:48 CST 2007

AREA=farcat

Tue Nov 13 14:01:23 CST 2007
-rw-rw-r--  1 rubin numi  7138755 Nov 12 23:38 F00039950_0001.spill.bntp.cedar.0.root
172
1972    /grid/data/minos/farcat
real    1m46.929s
172
1969    /minos/data/minfarm/farcat
Tue Nov 13 14:07:00 CST 2007


##########
# DCACHE #
##########

Problems with stuck open transfers on stkendca19a

all      uid  = 13234

FILS='
reco_far/cedar/sntp_data/2004-09/F00027184_0002.all.sntp.cedar.0.root
reco_far/cedar/sntp_data/2005-04/F00030628_0007.all.sntp.cedar.0.root
reco_far/cedar/sntp_data/2004-10/F00027603_0004.all.sntp.cedar.0.root
reco_near/cedar/sntp_data/2005-12/N00009530_0020.cosmic.sntp.cedar.0.root
'

for FIL in ${FILS} ; do
root -b -q  dcap://fndca1.fnal.gov:${DCPORT}/pnfs/fnal.gov/usr/minos/${FIL}
done

for FIL in ${FILS} ; do
dccp dcap://fndca1.fnal.gov:${DCPORT}/pnfs/fnal.gov/usr/minos/${FIL} \
/local/scratch26/kreymer/COPY/
done
rm /local/scratch26/kreymer/COPY/*cedar*
=============================================================================

2007 11 12

########
# FARM #
########

Preparing for FARM redeclares

for MON in 05 06 07 ; do
for STRM in cand sntp mrnt ; do

SAMDIM="
    DATA_TIER  ${STRM}-near
and PHYSICAL_DATASTREAM_NAME spill
and VERSION    cedar.phy.bhcurv
and FULL_PATH /pnfs/minos/reco_near/cedar_phy_bhcurv/${STRM}_data/2007-${MON}
"
./samundeclare -n  "${SAMDIM}"

done ; done

  < see 2007 11 14 >

############
# MCIMPORT #
############ 

Added 0 length check to MCINWRITE

#######
# LSF #
#######

Old brebel ticket 106750 status ?

Cannot submit from minos01 - minos13

11/12/2007 12:53:38 PM ticket closed,
 "minos nodes not showing up in lsf configuration; minos exp now using condor
  instead of lsf."

Well, that is true for the execution nodes, but not for submission.


#######
# LSF #
#######

for NODE in ${NODES} ; do printf "${NODE} " ;  ssh ${NODE} 'source /usr/local/etc/setups.sh ; setup lsf ; bjobs' ; done

minos01 Request from non-LSF host rejected
...
minos13 Request from non-LSF host rejected
minos14 No unfinished job found
minos15 No unfinished job found
...

18:12

HelpDesk ticket 106966

    run2-sys :

On the Minos Cluster, we have lost access to LSF
from nodes minos01 through minos13,

Commands like 'bsub' result in  
   Request from non-LSF host rejected

We have normal access from hosts minos14 through minos26.

Please restore access from minos02 to minos13, as this is causing confusion,
and is a substantial inconvenience to the users.

    Thanks !

13 Nov 2007 08:30:30
This ticket has been reassigned to SCHMITZ, MARK of the CD-SF/FEF Group.

13 Nov 2007 08:55:20  schmitz
restarted LSF on minos01 - still no good
MINOS01 > bjobs
Failed in an LSF library call: Slave LIM configuration is not ready yet


#########
# FNALU #
#########

FNALU batch jobs failed for pawloski over the weekend.

His account is absent on FNALU.

There are 17 missing accounts.

MINOS01 > ypcat passwd  | grep '/afs/fnal' | cut -f 1 -d : | sort > /tmp/users

FLXI05 > scp minos01:/tmp/users /tmp/users

FLXI05 > for user in `cat /tmp/users` ; do grep -q  ^${user} /tmp/pwd || echo ${user} ; done
bckhouse
blake
idanko
kimjj
llhsu
mbt
mstrait
mtavera
pawloski
pittam
rahaman
rearmstr
rmehdi
rodriges
scavan
tinti
whitehd

FLXI05 > for user in `cat /tmp/users` ; do grep -q  ^${user} /tmp/pwd || echo ${user} ; done > /tmp/MISS


13:24
HelpDesk ticket 106946
   forwarded to all these users

13:42 - assigned to mgreaney

16:10 - Except for mtb (who is expired in nas), the accounts were added back.

mbt:KERBEROS:13574:5111:Meagan Thompson:/afs/fnal/files/home/room2/mbt:/usr/local/bin/tcsh

FLXI05 > for user in `cat /tmp/MISS` ; do grep -q  ^${user} /tmp/pwd2 || echo ${user} ; done
mbt

17:00
Notified users via email

########
# DESK #
########

Restarting after planned Sunday power outage

#############
# CHECKLIST #
#############

Mysql1 has been saturated since early morning last Saturday,
load averages up to 18.

Probably the pawloski jobs running since then

    Queries like

select min(TIMEEND) from DCS_MAG_FARVLD where TIMEEND > '2007-03-15 04:15:08'  and ...

#######
# MRE #
#######

find d* -name  N\*MRE\* > /tmp/MRE.lis

  Found the one 0 length file remaining in AFS ( othere had been deleted )

So the 4 problem-files were 0 length in AFS, have removed them in PNFS.


=============================================================================

2007 11 10

   Undeclaring April 2007 files

SAMDIM="FULL_PATH /pnfs/minos/reco_near/cedar_phy_bhcurv/sntp_data/2007-04"

SFILES=`sam list files --dim="${SAMDIM}" --nosummary`
printf "${SFILES}\n" 
printf "${SFILES}\n" | wc -w
45

06:44
for FILE in ${SFILES} ; do sam locate ${FILE} ; done
for FILE in ${SFILES} ; do echo ${FILE} ; sam undeclare file ${FILE} ; done

SAMDIM="FULL_PATH /pnfs/minos/reco_near/cedar_phy_bhcurv/mrnt_data/2007-04"
44
06:45

SAMDIM="FULL_PATH /pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2007-04"
702
06:51


=============================================================================

2007 11 09

########
# GRID #
########

    HelpDesk ticket 106879

Short Description: Minos Cluster - need grid host certificates for use with Condor

Problem Description: run2-sys :

  We are preparing to improve the configuration of Condor on the Minos Cluster,
  by installing the Glidein WMS system already being used by CMS.
  This should act as our gateway to GPFARM and other Fermigrid resources.

  Igor Sfiligoi will be assisting with the configuration of this.
  His first advice is that we need to obtain Grid Host certificates
  for the existing systems, to improve the internal security.

  I suspect that this is something that run2-sys has done before
  for similar grid installations. 
  If so, please make these available.
  If not, we will need to get advice from the people doing this in CMS.

  We would like to proceed with this project early in the week of Nov 12.
  ( next week )

Date: Wed, 14 Nov 2007 12:16:37 -0600 (CST)
Eventually I got the grid-cert-request done for all 26 servers, and submittied them to the website. Request
numbers are:
29489  29497  29498  29500  29501  29502  29503  29504  29505  29506  29507  29508  29509
29510  29511  29512  29513  29514  29515  29516  29517  29518  29519  29520  29521  29522

Just waiting for return mail to get the URL and install the certificates.


############
# MCIMPORT #
############ 

boehm files still pending

#######
# LSF #
#######


   REQUESTED MINOS CLUSTER LSF SHUTDOWN

Date: Fri, 9 Nov 2007 14:02:28 -0600 (CST)
From: Arthur Kreymer <kreymer@fnal.gov>
To: minos_software_discussion@fnal.gov
Cc: minos-admin@fnal.gov, mgreaney@fnal.gov
Subject: Re: LSF problem on Minos Cluster


   Having heard no objection from Minos, and per a discussion with Joe Boyd,
   we are officially asking that LSF job slots be removed from the
   Minos Cluster nodes ( minos14 through minos26 ).

   We will still be submitting jobs to the traditional LSF FNALU batch system, 
   but they should not run on the Minos Cluster nodes.

   Let me repeat our thanks to the Computing Division people 
   who set this up for us earlier this year ! 

   The scheme allowed us to keep doing physics through this phase
   of our transition to Condor and Grid computing.

On Fri, 9 Nov 2007, Arthur Kreymer wrote:

> Date: Fri, 9 Nov 2007 09:52:24 -0600 (CST)
> From: Arthur Kreymer <kreymer@fnal.gov>
> To: minos_software_discussion@fnal.gov
> Cc: minos-admin@fnal.gov
> Subject: LSF problem on Minos Cluster
> 
> 
>   There were global LSF problems a couple of day ago,
>   which were corrected for the FNALU batch system,
>   but which are still lingering on the Minos Cluster.
> 
>   Some of our initial volunteers ( Brian, Greg, Josh )
>   are making successful use of Condor on the Cluster,
>   and this is our planned direction.
> 
>   So I intend to ask the LSF managers to abandon attempts
>   to revive these Minos Cluster LSF execution slots ( hosts minosNN ).
> 
>   The other existing FNALU batch slots are unaffected.
> 
>   Please let me know if this would be a problem for anyone.
>   ( This is somewhat moot, these are still broken in LSF. )


=============================================================================

2007 11 08

#######
# NET #
#######
 
HelpDesk ticket 106799

Short Description: MRTG timezone error ?

Problem Description: When attempting to veiw the MRTG traffic plots for recently rebooted
fnpcsrv1,
via a host search at
    http://fndcg0.fnal.gov/~netadmin/NodeLocator/search.html
I get the following message at
   
http://fndcg0.fnal.gov/~netadmin/NodeLocator/mrtg-search.cgi?hname=fnpcsrv1


    131.225.167.44 is connected to s-f-grid-fcc1 on port Gi0/1 
    Last detected on this switch at 2007/11/08/10:49

But the local time was only 10:06
Somebody's clock is off by an hour.
    ( It would be better if all data were logged and presented in UTC. )


09 Nov 2007 16:03 reassigned to  CLIFFORD, ALDEN of the CD-LSCS/CNCS/SN Group

10 Nov 2007 19:01 reassigned to wohlt

##########
# CONDOR #
##########

Created /minos/scratch/kreymer/condor/loont,
   which runs loon on a small data file, under tcsh

Had to set the PATH environment variable, cloned from path.

Created /minos/scratch/kreymer/condor/loonb
    running under bash
    No path fiddling was needed.


alias csub='condor_submit $*'      
alias cq='condor_q $*'


########
# FARM #
########

Ticket 106771 - 04:19 fnpcsrv1:Host is unpingable for a least 10 minutes
    by NGOP
Noticed by asousa, email to backhouse,rubin,timm,kreymer


Date: Thu, 08 Nov 2007 10:44:18 -0600 (CST)
From: Steven Timm <timm@fnal.gov>
Hi Alexandre, I got the fnpcsrv1 back up about an hour ago.
It had crashed with kernel panic at 04:05 local time.


############
# MCIMPORT #
############ 

boehm files, 332/434 files are now on tape,
  102 remain stuck in DCache

These are in pools like  w-stkendca10a-5 which are still offline.

10:29

We did find some problems on the stkendca10a node which have now been corrected.
The dCache pools on stkendca10a are up and available again.
Please let us know if the writes to tape are moving along properly again.
Ken S. -- SSA Group

The pools came back at about 09:56

I see data moving out of 10a-5 ( cand data ) to tape
MRE data is on the way to tape.

Still 4 left, in 11a-4 and 11a-6, at 13:30.

09 Nov 2007 14:57 podstvkv - I am working on it.


List of files is

/pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE/957/
    n00009573_0000_spill_D03_cedarphyMRE.reroot.root

/pnfs/minos/mcin_data/near/daikon_03/spill_cedarphyMRE/969/
    n00009696_0019_spill_D03_cedarphyMRE.reroot.root
    n00009696_0020_spill_D03_cedarphyMRE.reroot.root
    n00009696_0021_spill_D03_cedarphyMRE.reroot.root


=============================================================================

2007 11 07

#########
# ADMIN #
#########

For interactive limits, see

    http://www-cdf.fnal.gov/offline/runii/ILP/ILPUG/ilpug-4.html

User processes have their priority changed to 19 (lowest priority), 
if using 50% of a CPU or more for 10 minutes, or more. 
The user is not notified in the event of a renice.

User processes are killed if using 10% of a CPU or more for 30 minutes or more. 
The user is notified via email when a process is killed. 
      
########
# GRID #
########

   ticket 105784  pending since 10/18, requests /minos/data and scratch
   on GPFARM and fnpcsrv1
   
   DONE and tested !!!

######################
# AFS / LSF problems #
######################

106736 - 09:00 - fsun02 unpinged for 10 minutes
106750 - 13:30  brebel - cannot connect to lsf server
    no information in the ticket
    arms informs me of license problems, 

#######
# LSF #
#######

Observe many brebel jobs, submitted around 16:05

#######
# AFS #
#######

for NODE in $NODES ; do  echo ${NODE} ; ssh  ${NODE} 'grep "afs: failed" /var/log/messages' ; done
    or
for NODE in $NODES ; do  echo ${NODE} ; ssh  ${NODE} 'grep \(110\) /var/log/messages' ; done

minos05
Nov  6 12:50:00 minos05 kernel: afs: failed to store file (110)
Nov  6 12:50:37 minos05 kernel: afs: failed to store file (110)
Nov  7 14:15:01 minos05 kernel: afs: failed to store file (110)

minos17
Nov  6 22:29:22 minos17 kernel: afs: failed to store file (110)
Nov  6 22:30:26 minos17 kernel: afs: failed to store file (110)
Nov  7 15:59:23 minos17 kernel: afs: failed to store file (110)
Nov  7 15:59:24 minos17 kernel: afs: failed to store file (110)

minos26
Nov  6 12:50:01 minos26 kernel: afs: failed to store file (110)
Nov  6 22:30:03 minos26 kernel: afs: failed to store file (110)
Nov  7 01:00:05 minos26 kernel: afs: failed to store file (110)
Nov  7 04:44:57 minos26 kernel: afs: failed to store file (110)
Nov  7 14:15:02 minos26 kernel: afs: failed to store file (110)


for NODE in $NODES ; do  echo ${NODE} ; ssh  ${NODE} 'grep \(110\) /var/log/messages.1' ; done

minos04
Oct 30 15:29:53 minos04 kernel: afs: failed to store file (110)

minos14
Nov  2 15:19:22 minos14 kernel: afs: failed to store file (110)
Nov  2 15:19:31 minos14 kernel: afs: failed to store file (110)

minos17
Oct 30 21:04:17 minos17 kernel: afs: failed to store file (110)
Nov  1 10:54:42 minos17 kernel: afs: failed to store file (110)

for NODE in $NODES ; do  echo ${NODE} ; ssh  ${NODE} 'grep \(110\) /var/log/messages.2' ; done

minos26
Oct 27 02:09:50 minos26 kernel: afs: failed to store file (110)
Oct 27 02:09:51 minos26 kernel: afs: failed to store file (110)

#######
# SAM #
#######

IT 3128 - sam_products for sam_station v6_0_5_22_srm

############
# MCIMPORT #
############

boehm files, 

total   1546
PURGED  1112
PENDING  434

There are 291 queued writes
  106 for w-stkendca10a-4, most of the rest for -5 and -6

Reported to dache-admin twice.

Issued helpdesk ticket  106748
  14:49 - investigating, contacing developers re stkendca10a pools

>> We still have four files waiting to be written from
>>     w-stkendca11a-4
>> and w-stkendca11a-6

This was resolved, closed out ticket 13 Nov
    The system disk filled due to some large lqcd files
    not being handled by and older encp configuration.
    The configuration was updated and disk space cleared.

##########
# CONDOR #
##########

jboehm is running on our Condor pool, keeping a queue of about 50.
The load average on the cluster jumped up around 13:00 today !

MIN > for NODE in $NODES ; do printf "${NODE} " ; ssh  ${NODE} 'du -sm /local/scratch*/boehm' ; done
minos02 1       /local/scratch02/boehm
minos03 3296    /local/scratch03/boehm
minos04 5       /local/scratch04/boehm
minos06 7       /local/scratch06/boehm


=============================================================================

2007 11 06

########
# GRID #
########

/minos/data and scratch mounts on GPFARM - ticket 106721

07:19 - exports were added


############
# MCIMPORT #
############

mcimport.20071102 - continuing to restructure for new scheme
   ${INPAT}/mcin for reroot files, was ${INPAT}/near/mcin
   Extended autodest for two-part MC configurations, as done in roundup

Updated .grid/kreymer-doe.proxy
Created STAGE/CRON to hold the pid

$ ./mcimport  boehm
 OOPS - found /home/mindata/STAGE/boehm/log/mcimport.pid 
 OK - stale pid file 
 OK, logging activity to /home/mindata/STAGE/boehm/log/mcimport.log 
Tue Nov  6 14:52:17 CST 2007
 OK - processing from /home/mindata/STAGE/boehm 
      version mcimport.20071102 
 LOGS 
 PURGE, TAR, WRITE, MCINPURGE, MCINWRITE 
...
177624  /home/mindata/STAGE/boehm/
1       /home/mindata/STAGE/boehm/tar
1       /home/mindata/STAGE/boehm/dcache
177624  /home/mindata/STAGE/boehm/mcin
177623  /home/mindata/STAGE/boehm/mcin/dcache
Wed Nov  7 02:15:09 CST 2007

$ ./mcimport  boehm
 OK - purging 1546 MCIN files ?
Wed Nov  7 07:33:46 CST 2007


############
# PNFSDIRS #
############

Added support for a release MCIN which disable output
   ( this is for archives of some special files from boehm .
      Also useful for testing. )
 
./pnfsdirs near MCIN daikon_03 spill_cedarphyMRE

./pnfsdirs near MCIN daikon_03 spill_cedarphyMRE write
 

#########   
# MYSQL #
#########

Overloaded with brebel cron jobs since Nov  5 09:45
He will restart with newer code with efficient database access.


=============================================================================

2007 11 05

############
# MCIMPORT #
############

boehm reroot files :

     Suggested names like
    
n13011432_0000_L010185N_D03_D00cedarMRE.reroot.root

    But the initial files are like
    
N00009146_0008_D03_spillcedar_phyMRE.reroot.root

    These started as cedar_phy mrcc, had MRE run with D03,
    so should be named like
    
n00009146_0008_spill_D03_cedarphyMRE.reroot.root

New file name is

   n${FILE:1:13}_spill_D03_cedarphyMRE.reroot.root

1547

for FILE in N*_D03_spillcedar_phyMRE.reroot.root ; do echo ${FILE} ; done | wc -l
1546

   Stray list.txt file.

for FILE in N*_D03_spillcedar_phyMRE.reroot.root ; do 
echo mv ${FILE} n${FILE:1:13}_spill_D03_cedarphyMRE.reroot.root ; done

for FILE in N*_D03_spillcedar_phyMRE.reroot.root ; do 
mv ${FILE} n${FILE:1:13}_spill_D03_cedarphyMRE.reroot.root ; done

17;236 - ready to move these, with revised mcimport script
    handle mcin top path
    do not concatenate

   see 2007 11 06    

#######
# DAQ #
#######

  to habig :

From my logs, based on what I did for a similar shutdown last time ,
here is a set of commands that would shut down Sunday morning,
feel free to adjust .

   shutting down minos-evd last ( it is an NFS server )
   shutting down minos-beamdata after acnet ( it exports to acnet )
 

ssh -ax -l root minos-rc        'echo "shutdown -h now" | at 05:30 Nov 11'

ssh -ax -l root minos-om        'echo "shutdown -h now" | at 05:32 Nov 11'

ssh -ax -l root minos-acnet     'echo "shutdown -h now" | at 05:34 Nov 11'

ssh -ax -l root minos-beamdata  'echo "shutdown -h now" | at 05:36 Nov 11'

ssh -ax -l root minos-evd       'echo "shutdown -h now" | at 05:38 Nov 11'


Check the at status with  'at -l '


###########
# MONTHLY #
###########

DATASETS 11/5
PREDATOR 11/5
VAULT    11/5
MYSQL    11/7    waited for brebel monthly processing on FNALU

./stage -g RawDataWritePools -d -p 0  fardet_data/2007-10
 Needed 403/817
STARTED  Mon Nov  5 09:52:15 CST 2007
FINISHED Mon Nov  5 10:48:14 CST 2007

db archives

mysql> system time cp -av --target-directory=/data/archive/COPY/20071107/offline DCS_HV.MYD ;
real    16m30.116s
mysql> system time cp -av --target-directory=/data/archive/COPY/20071107/offline PULSERGAIN.MYD ;
real    18m25.131s
mysql> system time cp -av --target-directory=/data/archive/COPY/20071107/offline `cat /tmp/offiles` ;
real    41m29.246s

[minsoft@minos-mysql1 offline]$ time md5sum * >> ../offline.md5sum
real    19m41.403s
[minsoft@minos-mysql1 offline]$ time gzip -1 *.MYD
real    62m10.466s

Mysql> time scp -r -c blowfish -qv ${DBCOPY} ${REPATH}
real    13m21.059s
Mysql> time rsync -r \
real    0m15.821s
Wed Nov  7 11:55:36 CST 2007


########
# GRID #
########

Dear VO Member,

Your status with the VO has been changed from Approved to Suspended due to the following reason:
Suspended on 200711050500. Please contact VO administrator if you have any questions.

VOMRS fermilab Service


There are 6885 Fermilab KCA cert's.

Of these, 3867 are suspended.

Of these, 2130 were suspended this morning.

grep 'Suspended on' vomrs.txt  | sort -k 5,5 -n | tr -s ' ' | cut -f 5 -d ' ' | sort -u

200708061248
...

########
# FARM #
########

mcnearcat
   2739   53543 mrnt.cedar_phy_oldbhcurv.root
      8      63 sntp.cedar_phy_bhcurv.root
   3166  180491 sntp.cedar_phy_oldbhcurv.root
    186   12280 sntp.cedar_phy.root

corral - added cedar_phy_oldbhcurv


MFILES=`find /grid/data/minos/mcnearcat -name \*oldbhcurv\* -exec basename {} \;`

printf "${MFILES}\n" | wc -l
5662

Need to clear some working space, based on farmgsum

nearcat
    109    8110 spill.sntp.cedar_phy.0.root

WFILES=`find /grid/data/minos/nearcat -name \*spill.sntp.cedar_phy.0.root -exec basename {} \;`
printf "${WFILES}\n" | wc -l
109

for FILE in ${WFILES} ; do 
cp /grid/data/minos/nearcat/${FILE} /export/stage/minfarm/test/${FILE} ; done

for FILE in ${WFILES} ; do  echo ${FILE}
diff /grid/data/minos/nearcat/${FILE} /export/stage/minfarm/test/${FILE} ; done

for FILE in ${WFILES} ; do  echo ${FILE}
touch -r /grid/data/minos/nearcat/${FILE} /export/stage/minfarm/test/${FILE} ; done

for FILE in ${WFILES} ; do  echo ${FILE}
rm       /grid/data/minos/nearcat/${FILE}; done

   in ROUNTMP
mv NOCAT.bck NOCAT
   in scripts
./roundup -c -M -r cedar_phy_oldbhcurv mcnear


########
# FARM #
########

   Removed ROUNTMP/WRITE.old left over from migration to /grid/data  2007 05 11

=============================================================================

2007 11 03   Sat

############
# MCIMPORT #
############

    Moved kordosky directory, see entry at 2007 10 30

Sat Nov  3 09:59:59 CDT 2007
Sat Nov  3 10:48:13 CDT 2007

    File copy rate for small log files seems to be about 200 files/second,
    much better than last Thursday when DMA was off.

    Size of log files :

MINOS26 > MCD=/local/scratch26/mindata/kordosky/log
MINOS26 > find ${MCD} -type f | wc -l ; du -sm ${MCD}
58784
5710    /local/scratch26/mindata/kordosky/log

   Sent out an all-clear message


daikon_04 cleanup :

    I have completed the cleanup of the old daikon_04 beam MC.
        The /pnfs/minos/stage/*D04* files are purged.
        The appropriate index, md5, and log files have been cleaned up.
       

mcimport :

    All of the mcimport/STAGE areas have been moved to /minos/data/mcimport
        including the kordosky area.

    Everyone should be able to resume production.


processing :

    Given the much larger capacity of the STAGE/ cache area,
    and its ability to handle large numbers of small files,
    Robert and I have decided to simplify the overly processing.

    We will overlay directly from the /minos/data/mcimport/<user>/ directories,
    without first creating tarfiles and indexes.

    We may or may not still create tarfiles later, for archival purposes,
    but this is no longer in the processing pipeline.

    This has no impact on people producing the MC files.
    Files are copied to the same place, with the same integrity checks.
    They will remain there a bit longer than before.

startup :

    It is the weekend, so please exercise caution and restraint.
    If things break, they may need to wait till Monday.

    Note that minos26 had severe problem starting last Wednesday.
    These were probably resolved by the reboot on Friday,
    combined with our new processing model reducing the load on local disk.


###########
# MONITOR #
###########

Restarted monitoring per HOWTO.monitor

    ( except beam )

#########
# FNALU #
#########

FNALU batch jobs failed for pawloski over the weekend.

His account is absent on FNALU

MINOS01 > ypcat passwd  | grep '/afs/fnal' | cut -f 1 -d : | sort > /tmp/users

FLXI05 > scp minos01:/tmp/users /tmp/users

FLXI05 > for user in `cat /tmp/users` ; do grep -q  ^${user} /tmp/pwd || echo MISSING ${user} ; done | cut -f 1 -d ':'
bckhouse
blake
idanko
kimjj
llhsu
mbt
mstrait
mtavera
pawloski
pittam
rahaman
rearmstr
rmehdi
rodriges
scavan
tinti
whitehd


=============================================================================

2007 11 02

############
# MCIMPORT #
############

    daikon_04 cleanup

Logs to remove are all in the run number range 7000 - 7200.

    cd kordosky/log
    
find . -type f -name L??????_\*_7???_\*.log | wc -l
4949

All seem to be newer then Oct 19, 14 days ago

$ find . -type f -name L??????_\*_7???_\*.log -mtime +13 -exec ls -l {} \; | wc -l
335
$ find . -type f -name L??????_\*_7???_\*.log -mtime +14 -exec ls -l {} \; | wc -l
0

$ mv L*.log badd04/
$ mv n*.log badd04/

$ find . -type f -name L??????_\*_7???_\*.log -exec echo mv {} baddo4/ \;
$ find L* -type f -name L??????_\*_7???_\*.log -exec  mv {} badd04/ \;

$ find L* -type f -name n\*_D04.log | wc -l
4901

$ find L* -type f -name n\*_D04.log -exec  mv {} badd04/ \;
$ find badd04 -type f -name n\*_D04.log | wc -l
4949

$ tar czf badd04.tgz -C badd04 .
$ tar tzf badd04.tgz | wc -l
9899
   That is a correct count, includes .
$ rm -r badd04/

    Purge the tar.gz incoming

$ rm *D04.tar.gz
$ rm tar/n*

    Clean the mf5.all file

cd kordosky/md5

$ wc -l all.md5
31106 all.md5

$ grep D04.tar.gz all.md5 | wc -l
4969
$ grep -v D04.tar.gz all.md5 > all.md5new
$ mv all.md5 all.md5.badd04
$ mv all.md5new all.md5

    Clean the indexes
$ ls *D04.index | wc -l
756
$ mkdir badd04
$ mv *D04.index badd04/
$ cat badd04/*.index | wc -l
4838

$ tar czvf ../badd04.index.tgz -C badd04 .
$ tar tzf ../badd04.index.tgz | wc -l
757

    PNFS
MINOS26 > cd /pnfs/minos/stage/kordosky/
MINOS26 > ls | wc -l
4387
MINOS26 > ls *D04.tar | wc -l
754

MINOS26 > find . -name n\*_D04.tar | wc -l
754
MINOS26 > find . -name n\*_D04.tar -mtime +13 | wc -l
22
MINOS26 > find . -name n\*_D04.tar -mtime +14 | wc -l
0

MINOS26 > find . -name n\*_D04.tar -exec rm {} \;


##########
# CONDOR #
##########

and zwaska

To: brebel@fnal.gov, habig@fnal.gov, jdejong@fnal.gov, pawloski@fnal.gov, petyt@fnal.gov, rustem@fnal.gov,
     tinti@fnal.gov
Cc: minos-admin@fnal.gov, minos_batch@fnal.gov
Subject: Condor queues available on Minos Cluster


  This note is going out to our identified Analysis Batch 'power users'.

  Last week, we successfully installed a condor pool on the Minos Cluster.

  Greg has been doing some preliminary tests,
  and I have done some stress tests to determine that the system can handle
  thousands of jobs without rolling over.

  Documention is rough, and I have very little Condor experience.
  Nevertheless, the system may already be useful for running jobs.

  Please have a look at an early draft document,

     ~kreymer/minos/HOWTO.condor
  a.k.a.
     http://home.fnal.gov/~kreymer/minos/HOWTO.condor

  and give things a try.

      Enjoy !


############
# MCIMPORT #
############

Cleanup - a lot of daikon_04 was declared to SAM,  on  2007 10 25

   This was all near CosmicLE, from sjc, directly imported.

Also, cleaned up after the oom Killer, which zapped kordosky's mcimport
at 09:48

n11037118_0018_L010185N_D04-n11037118_0022_L010185N_D04.tar 5
      n11037118_0018_L010185N_D04.tar.gz to
      n11037118_0022_L010185N_D04.tar.gz   
    
rm tar/n11037118_0018_L010185N_D04-n11037118_0022_L010185N_D04.tar
rm /var/tmp/mindata/MCTAR/kordosky/*.gz

############
# MCIMPORT #
############

Created overlay directory for overlaid reroot files.
We will write them to PNFS from mcimport, like any other reroots.


############
# MCIMPORT #
############

Rearranged sjc/far/mcin per new arrangements,
sharing near and far files in <user>/mcin.

mv            far/mcin mcin
ln -s ../mcin far/mcin


###########
# MINOS26 #
###########

Have been seeing oom killer messages in /var/log/messages

Oct 31 03:52:01 minos26 kernel: oom-killer: gfp_mask=0xd0
Oct 31 03:58:28 minos26 kernel: oom-killer: gfp_mask=0xd0
Oct 31 05:29:12 minos26 kernel: oom-killer: gfp_mask=0xd0
...
Nov  1 21:37:43 minos26 kernel: oom-killer: gfp_mask=0xd0
Nov  1 22:55:03 minos26 kernel: oom-killer: gfp_mask=0xd0
Nov  2 03:30:32 minos26 kernel: oom-killer: gfp_mask=0xd0
Nov  2 03:37:43 minos26 kernel: oom-killer: gfp_mask=0xd0
Nov  2 04:19:11 minos26 kernel: oom-killer: gfp_mask=0xd0
Nov  2 04:28:05 minos26 kernel: oom-killer: gfp_mask=0xd0
Nov  2 04:28:06 minos26 kernel: oom-killer: gfp_mask=0xd0
Nov  2 04:28:08 minos26 kernel: oom-killer: gfp_mask=0xd0
Nov  2 04:28:11 minos26 kernel: oom-killer: gfp_mask=0xd0

MINOS26 > grep Killed /var/log/messages | grep -v sleep  | grep -v scp | grep -v _log
Oct 31 07:22:56 minos26 kernel: Out of Memory: Killed process 17228 (bash).
Oct 31 07:22:57 minos26 kernel: Out of Memory: Killed process 4664 (bash).
Oct 31 12:49:32 minos26 kernel: Out of Memory: Killed process 32117 (sendmail).
Oct 31 12:49:33 minos26 kernel: Out of Memory: Killed process 6445 (bash).
Oct 31 12:49:34 minos26 kernel: Out of Memory: Killed process 946 (bash).
Oct 31 12:49:40 minos26 kernel: Out of Memory: Killed process 19156 (mcimport).
Oct 31 15:53:02 minos26 kernel: Out of Memory: Killed process 10295 (sendmail).
Oct 31 15:53:14 minos26 kernel: Out of Memory: Killed process 19867 (bash).
Oct 31 20:55:33 minos26 kernel: Out of Memory: Killed process 7926 (sh).

   Ticket 106517
   09:42 jpfitz

crontab -r for kreymer and mindata

    From /var/log/messages
Nov  2 10:22:59 minos26 exiting on signal 15
Nov  2 12:33:18 minos26 syslogd 1.4.1: restart.

Digging into the history,

$ grep -v 'session opened for user' /var/log/messages | less
Oct 31 01:51:11 minos26 kernel: hdb: dma_timer_expiry: dma status == 0x61
Oct 31 01:51:21 minos26 kernel: hdb: DMA timeout error
Oct 31 01:51:21 minos26 kernel: hdb: dma timeout error: status=0xd0 { Busy }
Oct 31 01:51:21 minos26 kernel: 
Oct 31 01:51:21 minos26 kernel: ide: failed opcode was: unknown
Oct 31 01:51:21 minos26 kernel: hda: DMA disabled
Oct 31 01:51:21 minos26 kernel: hdb: DMA disabled
Oct 31 01:51:21 minos26 kernel: ide0: reset: success
Oct 31 03:52:01 minos26 kernel: oom-killer: gfp_mask=0xd0
...

The ganglia plots indicate a load average around 12 around 02:00


=============================================================================

2007 11 01

##########
# CONDOR #
##########

Testing large scale submit with a touch script 

./touch n M
   touches file n in subdirectory M

Submitted 1000 processes,
rate of running is about 4/second, using full Minos Cluster

minos25 Load average running 1000 was about 1
                             3000 was about 1.4

3000 rate with 100 second sample, is about 
   2  /second at  200
   2.4/second at  900 
   3.2/second at 1500
   3.9/second at 2000
   3.9/second at 2500

17:57
10000 rate with 100 second sleep is 
   0.7/second at  200
   0.7/second at  800
   0.7/second at 1200
   0.8/second at 1700 
   0.8/second at 2100
   0.7/second at 2600
   0.9/second at 3100
   1.0/second at 3700
   1.1/second at 4300
   1.2/second at 4900
   1.4/second at 5700

Times  
from 17:58
  to 20:18

MINOS25 > condor_q -run

-- Failed to fetch ads from: <131.225.193.25:62586> : minos25.fnal.gov

############
# MCIMPORT #
############

Looking at cleanup of incorrect D04 processing.

$ ls /pnfs/minos/stage/kordosky/*D04* | wc -l
730

$ ls index/*D04.index | wc -l
730

   Plan :

Wait for incoming files to abate ( tomorrow )
Remove all the pnfs and index files
Logs - let them rot ?

Move kordosky to bluearc

Removed all kordosky/DUP files
Removed almost all kordosky/BAD files

for FILE in kordosky/BAD/*.gz ; do echo ${FILE} ; gunzip -t ${FILE} ; done

    All but two are actually bad

kordosky/BAD/n11011003_0001_L010185N_D01.tar.gz
kordosky/BAD/n12011003_0001_L010185N_D01.tar.gz

    Removed them all.


MINOS26 > du -sm /pnfs/minos/stage/*
268720  /pnfs/minos/stage/arms
1       /pnfs/minos/stage/buckley
1       /pnfs/minos/stage/gmieg
758471  /pnfs/minos/stage/hgallag
1358789 /pnfs/minos/stage/howcroft
6736664 /pnfs/minos/stage/kordosky
11457   /pnfs/minos/stage/kreymer
202838  /pnfs/minos/stage/mualem
1       /pnfs/minos/stage/rhatcher
1       /pnfs/minos/stage/sjc
1       /pnfs/minos/stage/urheim

9336936 /pnfs/minos/stage

Plan to reorganize this into /minos/data/mcimport

Heirarchy for long term storage should look like

   MC release
     Config
       Detector
         Run/subrun breakout

For input to overlays, 
    MCR/CONF/DET is adequate

Checking for dup's amoung recent import is simple
No tarring of the files, as all is on Bluearc/NFS

Data can them be archived after the fact, in large files,
without paranoid CRC checksum tests, and in very large tars.
We could put this to LT03 or LT04 tape.

       
#######
# AFS #
#######


Per brebel,

    requested volumes

/afs/fnal.gov/files/data/minos/d271
/afs/fnal.gov/files/data/minos/d272

   for nc work

system:administrators rlidwka
minos:admin rlidwka
system:anyuser rl
buckley:ana_ntuples rlidwka

   Ticket 106484
   assigned to mengel
   15:31 done

########
# GRID #
########


Bluearc maintenance 06:00 this morning
seems to have induced a NFS stale file handle problem
on minos01/26 and fnpcsrv1, and probably elsewhere.

Noted in fermigrid-announce by timm.

14:09 - sent ticket to run2-s6s

    ticket 106477

Cleared at 14:40 by jpfitz


########
# FARM #
########

/grid/data glitch requires removal of some cand's
are these declared to SAM ?

RUNSUBS='
N00011852_0000
N00011861_0017
N00011861_0019
N00011861_0021
N00011878_0014
N00011878_0017
N00011896_0003
N00011896_0008
N00011908_0006
N00011908_0022
N00011911_0010
N00011911_0014
N00011911_0021
N00011911_0023
N00011914_0014
N00011917_0001
N00011920_0017
N00011923_0000
N00011923_0017
'

for RUNSUB in ${RUNSUBS} ; do 
sam locate ${RUNSUB}.spill.cand.cedar_phy_bhcurv.0.root
done

for RUNSUB in ${RUNSUBS} ; do
echo ${RUNSUB}
( cd /pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2007-03 ;
  cat ".(use)(4)(${RUNSUB}.spill.cand.cedar_phy_bhcurv.0.root)" | head -1 )
done


=============================================================================

2007 10 31

############
# MCIMPORT #
############

Output is falling behind, due to long kordosky tarring,
started 22:49, getting about 7 MB/sec
slowed to 1 MB/sec at 
    n11037106_0022_L010185N_D04-n11037106_0028_L010185N_D04.tar
ran till 07:23


#######
# BOO #
#######

=============================================================================

2007 10 30

###########
# ENSTORE #
###########

Big ( over 600 ) queues, delaying farm running due to lack of mcin_data.

CMS data challenge is underway, I also see lqcd activity.

Only 2 minos reads are pending, from VOB372
The file needed is 
    /pnfs/minos/mcin_data/near/daikon_03/L010185N/500/n13035001_0000_L010185N_D03.reroot.root
on VO3403    

############
# MCIMPORT #
############

Second attempt to import bad file, around 17:45 Oct 29
gunzip: n11037088_0014_L010185N_D04.tar.gz: unexpected end of file


#######
# AFS #
#######

    Mounting subdirectories :

http://osdir.com/ml/file-systems.openafs.general/2003-03/msg00092.html
    fs exportafs nfs -submounts on

    Freelance mode AFS, readonly (windows only)
http://ezine.daemonnews.org/200605/afs.html

    mentions freelance mode, with no home cell, no tokens
    ( circa 2002, oops Windows only )
    too bad, this would have been useful in the OSE grid

    Translator :
http://www.nabble.com/Bug-405982:-cannot-stop-all-afsd-process-when-start-with--rmtsys-t2935598.html

    No translator :
    
http://www.openafs.org/pipermail/openafs-devel/2001-May/006056.html   

    Usage at INFN
http://www.lnf.infn.it/computing/afs/doc/adm/adm02.htm

#######
# AFS #
#######

    For reference, for grid computing, we would want to have
    
    Releases
/afs/fnal.gov/files/code/e875/general/minossoft/   

    Products
/afs/fnal.gov/files/code/e875/general/products/
   symlinked to ups/

    But under releases, there are 3 symlinks outside 
/afs/fnal.gov/files/code/e875/general/minossoft/packages
    for bin, lib and tmp, like
/afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP/R1.24.3/lib/

    Scanned all releases, found links like
/afs/fnal.gov/files/code/e875/releases
/afs/fnal.gov/files/code/e875/releases1
/afs/fnal.gov/files/code/e875/releases2

    releases is a 50 GB disk, links are all to SRT_BINLIBTMP
    releases1/2 are 8 GB.


ls -al /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.24.3 | \
grep /afs/fnal | \
grep -v /afs/fnal.gov/files/code/e875/general/minossoft/packages

MSP=/afs/fnal.gov/files/code/e875/general/minossoft

cd  /afs/fnal.gov/files/code/e875/general/minossoft/releases

    There are many relative symlinks, up 1 level, from include/<pkg> to ../<pkg>
find R1.24.3 -type l -exec ls -l {} \; | grep ' \.\./'
    But I see no links up 2 levels
find R1.24.3 -type l -exec ls -l {} \; | grep ' \.\./\.\.'

    There are many symlinks to ${MSP}/packages
find R1.24.3 -type l -exec ls -l {} \; | grep -v ' \.\./'
    There are 3 symlinks to bin/lib/tmp with an explicit AFS path
find R1.24.3 -type l -exec ls -l {} \; | grep -v ' \.\./' | grep -v ${MSP}

    Get symlink paths in a searchable form
find . -type l -exec ls -l {} \; | grep -v ' \.\./' | grep -v ${MSP} | cut -f 2 -d '>' | tee -a /tmp/minrel

    There are only 2 stray symlinks
find . -type l -exec ls -l {} \; | grep -v ' \.\./' | grep -v ${MSP} | grep -v /afs/fnal.gov
lrwxr-xr-x  1 rhatcher e875 23 Sep 27 19:56 ./S07-09-20-R1-26/G3PTSim/LinkDef.h -> TGeant3/geant3LinkDef.h
lrwxr-xr-x  1 rhatcher e875 23 Jul 14 00:00 ./S07-07-26-R1-26/Linux2.6-GCC_3_4-maxopt -> Linux2.4-GCC_3_4-maxopt

MINOS26 > du -sm /afs/fnal.gov/files/code/e875/releases/*
551     /afs/fnal.gov/files/code/e875/releases/GENIE
54      /afs/fnal.gov/files/code/e875/releases/LOG4CPP
2048    /afs/fnal.gov/files/code/e875/releases/MINOS_EXTERN
22299   /afs/fnal.gov/files/code/e875/releases/MINOS_ROOT
840     /afs/fnal.gov/files/code/e875/releases/NEUGEN3
183     /afs/fnal.gov/files/code/e875/releases/PYTHIA6
16280   /afs/fnal.gov/files/code/e875/releases/SRT_BINLIBTMP
1425    /afs/fnal.gov/files/code/e875/releases/base_release_build
27      /afs/fnal.gov/files/code/e875/releases/stdhep

############
# MCIMPORT #
############
 
Kregg is concerned again with minos26 capacity,

Looking at Ganglia, 
I see incoming rates as high as 40 GB/3 hours or 4 MBytes/second.
Load average runs about 4 , spikes to 6 during this influx.

But the average incoming rate is around 1 MB/second, per ganglia plots.

############
# MCIMPORT #
############

   Per discussion with arms, will shift all users but kordosky
   over to /minos/data/mcimport/...

$ du -sm *
1       CRON
5805    arms
1       buckley
1       gmieg
468     hgallag
1       himmel
2054    howcroft
26365   kordosky
3621    kreymer
1       mcinwrite
4928    mualem
1       nohup.out
1       rhatcher
23093   sjc
1       urheim


    Small users are buckley gmieg himmel rhatcher urheim
    There are no symlinks, per
        find . -type l

cd /local/scratch26/mindata/
mkdir MOVED

MCUSER=buckley

[ -r "/minos/data/mcimport/${MCUSER}" ] && echo OOPS DUPLICATE in data
[ -r "MOVED/${MCUSER}" ]                && echo OOPS DUPLICATE MOVED

[ -r ${MCUSER}/MCIMPORT ] && echo TOBLUE && mv ${MCUSER}/MCIMPORT ${MCUSER}/TOBLUE 
du -sm  ${MCUSER}
find    ${MCUSER} -type f | wc -l

date
time \
cp -ax  ${MCUSER} /minos/data/mcimport/${MCUSER}
find    /minos/data/mcimport/${MCUSER} -type f | wc -l
du -sk  ${MCUSER} /minos/data/mcimport/${MCUSER}
time \
diff -r ${MCUSER} /minos/data/mcimport/${MCUSER}

mv      ${MCUSER}                MOVED/${MCUSER}
ln -s             /minos/data/mcimport/${MCUSER} ${MCUSER}
[ -r ${MCUSER}/TOBLUE ] && echo MCIMPORT && mv ${MCUSER}/TOBLUE ${MCUSER}/MCIMPORT 
date

   14:18 - did the other small guys

MCUSER=gmieg
MCUSER=himmel
MCUSER=rhatcher
MCUSER=urheim

    14:23 - did the inactive guys

MCUSER=arms
... ( copy took hours, due to small files ? )
     interrupted the diff -r after 
real    48m53.509s
user    0m5.524s
sys     0m17.290s
moved anyway, then ran the diff :
real    54m8.381s
user    0m13.886s
sys     0m53.854s

    2007 10 31 - continuing

MCUSER=hgallag
real    1m23.301s
real    0m21.857s

MCUSER=howcroft
21138
2102772 howcroft
2062388 /minos/data/mcimport/howcroft
real    13m22.700s
sys     0m16.462s
real    7m15.946s
sys     0m13.733s


MCUSER=kreymer

49
real    12m2.772s
3707768 kreymer
3703972 /minos/data/mcimport/kreymer
real    14m25.219s


MCUSER=mualem

Wed Oct 31 11:02:47 CDT 2007
6571
real    19m4.375s
5045996 mualem
5029340 /minos/data/mcimport/mualem
real    21m33.584s

    2007 11 01

MCUSER=sjc

Fri Nov  2 13:52:49 CDT 2007
16592
real    3m10.135s
2237676 sjc
2201240 /minos/data/mcimport/sjc
real    1m22.925s

    2007 11 03

MCUSER=kordosky
TOBLUE
6756    kordosky
62434

Sat Nov  3 09:59:59 CDT 2007
real    23m26.866s
6917552 kordosky
6803072 /minos/data/mcimport/kordosky
real    23m54.821s
Sat Nov  3 10:48:13 CDT 2007

MCUSER=mcinwrite
1       mcinwrite
2
Sat Nov  3 10:51:12 CDT 2007
real    0m0.042s
2
16      mcinwrite
16      /minos/data/mcimport/mcinwrite
real    0m0.003s
Sat Nov  3 10:51:18 CDT 2007


=============================================================================

2007 10 29

#######
# SAM #
#######

sam_bootstrap v8_1_1
    current on minos-sam01 and minos-sam02
    In a pinch, can fall back by using the older v8_1_0 directly
        ups update sam_bootstrap v8_1_0

Version v8_1-1 has improved retries in case of station/dbserver restarts,
    backs off rate of retries to lower limit of once per hour.

############
# INDEXNFS #
############

./indexnfs reco_near/cedar_phy/sntp_data/2005-04

RDIRS=`cd /minos/data ; find reco_near -type d -name 2???-??`

for DIR in ${RDIRS} ; do ./indexnfs ${DIR} ; done


###########
# BLUEARC #
###########

    From CD ops meeting :

11/1: 6-6:15am Site NAS Server (BlueArc) will be down
      for a major firmware upgrade.

##########
# DCACHE #
##########

Removed stray directories under

    May 22 rubin   /pnfs/minos/mcout_data/cedar_phy/bfld201_lowE

fnpcsrv1% rmdir bfld201_lowE/sntp_data/
fnpcsrv1% rmdir bfld201_lowE/cand_data/
fnpcsrv1% rmdir bfld201_lowE


############
# MCIMPORT #
############

Strange, a bad input file kordosky/n11037054_0014_L010185N_D04.tar.gz
was correctly detected and moved to BAD,
but it seems to have remained in the FILES list !

   Will have to re-test this code somehow.
   Impact is just a somewhat messy printout.

########
# GRID #
########

104371 - marked resolved, waiting for my reply to something ?
         The accounts are present.
         Need to follow up on cleanup of old users ?
         ( mail filed in minos-admin )
Michael Kordosky
Brandon Seilhan
Durga Rajaram
Howard Rubin
Thomas Brennan
Mark Messier  (He is also on MIPP)
Steven Cavanaugh
Deborah Harris
Valeri Garkusha
Sergei STriganov
Hugh Gallagher
Adam Para
Mayly Sanchez
Tingjun Yang
George Irwin
Byron Lundberg
Robert Bernstein
John Urheim
Alexandre Sousa
Regina Rameika
Carol Ward
Liz Buckley
Joshua Boehm


105638 - waiting for information from me ? 
         Mount of /minos/scratch and data on FNALU int and batch
         I think questions were answered on 16 Oct.
         Mounts are in place on FNALU batch and some interactive

########
# GRID #
########

    New ticket by rubin, 106232

Spontaneous Condor restarts continue on the Farm.
Note - farm is running Condor 6.8.5, which has problems.
       The problem is believed solved in the Sep 13  Condor 6.8.6
       which we run on the Minos Cluster
       
########
# GRID #
########

   ticket 105784  pending since 10/18, requests /minos/data and scratch
   on GPFARM and fnpcsrv1

   See  fs exportafs - translator ?
   See Administration Guide, Appendix A, under http://www.openafs.org/doc/index.htm
   
   http://www.openafs.org/pages/doc/AdminGuide/auagd022.htm#HDRWQ595
   
   
###########
# ENSTORE #
###########

    Finished review of tapes listed  2007 10 09
    for recycling .
    
    Approved all but NULL31 ( not a tape )
    
copy to georges@fnal.gov who sent a recent reminder listing 19 of these


=============================================================================

2007 10 26

#######
# LSF #
#######

    Checking tokens, they are cloned from submission process :

for NODE in $BNODES ; do 
bsub -R ${NODE} "tokens" ; done


########
# GRID #
########

    Mail to minos-admin , chadwick, timm, berman
    MINOS-doc-3776, version 1

Based on our successful experience so far running Condor on the Minos Cluster,
here is a more detailed plan for the new nodes being installed in GCC.

Please let us know if there are any adjustments needed to this plan.

    Condor on new Minos computing ( 8 x Dell PE 6850 )

Driver :

    Make these nodes available for Minos Analysis batch computing.

Issues :

    To provide compatibility with the existing Minos Cluster Condor system,
    we should install the following in addition to the default SLF 4 OS ,
    on the eight dedicated Minos nodes :

        Condor installation to match the Minos Cluster,
            using the minos25 schedd.
            Configurations should probably be for
                12 VM's per host ( 30% oversubscription )
                load average limit of 20 ( generous )
                no preemption
                no suspension                
        AFS
        Accounts via NIS from minos01/02 , to match the Minos Cluster
            Allow interactive logins, so that people can do 'kcroninit'
            
Timeline :

        Install the above within a week after initial acceptance burnin.

-----------------------------------------------------

Timm asked about selection of the particular nodes

Berman asked about 
    Condor expert support
    Installation of condor
    Support level

    Updated the document 

https://minos-docdb.fnal.gov:440/cgi-bin/RetrieveFile?docid=3776&version=2&filename=minosanalysis.txt

#######
# CVS #
#######

Note contact information for cdcvs migration,

 sforrest  Stanley Forrester ( UC Davis, now a contractor ) 
           x4417 , was formerly x8473


########## 
# SADDMC #
##########

Started FARDET mcin declares ( see below )

#######
# SAM #
#######

per nwest having problems with samLocate remotely, ran test :

08:53

NIT=0
while [ ${NIT} -lt 301 ] ; do
   (( NIT++ ))
   printf "${NIT} "
   sleep 1
   samLocate --file=F00018000_0000.mdaq.root \
   --wsdl=http://www-numi.fnal.gov/sam_web_services/wsdl/DataFileService.wsdl.xml
done

ran cleanly

##########
# CONDOR #
##########

Try to control chatter, with
    when_to_transfer_output = ON_EXIT

This works !   Created tinywr.run for this test

Creating probe and probe.run tests

   OK not too thorough yet.

Note, that once a job is held, then released,
it may not run until an other job is submitted.

Rediscovered that to use kcron, 
   this must be the EXECUTEABLE.
Adjusted probe.run appropriately.


##########
# CONDOR #
##########

Testing independence of tokens

Submitted probe contining embedded 20*tiny, to minos10

Submitted another 2 minutes later

Token expiration times are 2 minutes apart,
both at the start and the end of the jobs ==> no interference


=============================================================================

2007 10 25

########## 
# SADDMC #
##########

    Shift the logs into /minos/scratch/kreymer

cd  /afs/fnal.gov/files/home/room1/kreymer/minos/log

mkdir -p       /minos/scratch/kreymer/log
cp -vax saddmc /minos/scratch/kreymer/log/saddmc

mv      saddmc saddmc_old
ln -s          /minos/scratch/kreymer/log/saddmc saddmc

########## 
# SADDMC #
##########

export SAM_ORACLE_CONNECT="samdbs/..."

DET=near
VEG=daikon_00
DIR=L010000N
./saddmc  --verify   -n 1  ${VEG}  ${DET}/${VEG}/${DIR}/100
./saddmc  --verify   -n 1  ${VEG}  ${DET}/${VEG}/${DIR}/*

./saddmc  --declare  -n 1  ${VEG}  ${DET}/${VEG}/${DIR}/100

sam get metadata --file=n13011007_0007_L010000N_D00.reroot.root
sam locate              n13011007_0007_L010000N_D00.reroot.root


    N E A R

DET=near

MINOS26 > ls /pnfs/minos/mcin_data/${DET} | grep dai
daikon_00
daikon_01
daikon_03
daikon_04

DET=near
VEGS='daikon_00 daikon_01 daikon_03 daikon_04'

for VEG in ${VEGS} ; do
for DIR in `ls /pnfs/minos/mcin_data/${DET}/${VEG} | sort` ; do
    echo ${VEG} ${DIR}
    #./saddmc  --verify  -n 1  ${VEG}  ${DET}/${VEG}/${DIR}/*
    ./saddmc --declare          ${VEG}  ${DET}/${VEG}/${DIR}/*
done 2>&1 | tee -a /minos/scratch/kreymer/log/saddmc/prd-${DET}-${VEG}-${DIR}.log
done

STARTED   Thu Oct 25 21:37:11 2007
FINISHED  Fri Oct 26 01:33:33 2007


grep -v declared ../log/saddmc/prd*.log | grep -v Needed | grep -v Treating | less
grep -v declared ../log/saddmc/prd-${DET}*.log | grep -v "Needed\|Treating\|Declaring\|Scanning\|MODE" | less

    F A R

DET=far

MINOS26 > ls /pnfs/minos/mcin_data/${DET} | grep dai
daikon_00
daikon_01
daikon_02
daikon_03
daikon_04

DET=far
VEGS='daikon_00 daikon_01 daikon_02 daikon_03 daikon_04'

for  ... done  as above

STARTED   Fri Oct 26 13:29:15 2007
FINISHED  Fri Oct 26 14:04:10 2007


##########
# DC2NFS #
##########

Dated version dc2nfs.20071025 - takes single -d argument for path

    BEAM DATA ( anticipating needs of beam group soon )

$ AFSS/dc2nfs -d beam_data 2>&1 | tee -a /tmp/dc2nfs.beam_data.log

STARTING Thu Oct 25 12:03:59 CDT 2007
 Running dc2nfs for 
     DATA beam_data 
Processing 37 months

...
 STARTED Thu Oct 25 12:03:59 CDT 2007
FINISHED Thu Oct 25 17:00:56 CDT 2007


#######
# NFS #
#######


http://osdir.com/ml/linux.nfs/2004-05/msg00108.html
http://oss.sgi.com/projects/xfs/


    13:10     email to minos-admin regarding /minos filesystem sizes
 

    The NFS mounts of /minos/data and /minos/scratch seem to be working fine,
    and quota shows roughly the expected quotas.

    But df shows a device size much smaller than the expected size
    of about 20 to 10 TBytes for data and scratch

MINOS26 > df -h /minos/*
Filesystem            Size  Used Avail Use% Mounted on
minos-nas-0.fnal.gov:/minos/data
                      3.1T  2.3T  846G  73% /minos/data
minos-nas-0.fnal.gov:/minos/scratch
                      851G  5.7G  846G   1% /minos/scratch

    Is it possible that somehow we have made NFS V2 client mounts ?

    The fstab entries contain

 nfs    rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0

    The man page for nfs mentions a nfsvers=3 option, not vers=3.

=============================================================================

2007 10 24

##########
# DC2NFS #
##########

    BEAM DATA ( anticipating needs of beam group soon )

BMOS=`cd /pnfs/minos/beam_data ; ls`

for MO in ${BMOS} ; do ./stage -d -p 0 beam_data/${MO} ; done
    interrupted, many files missing
    
./volumes  vols
./volumes  beam_data
VO4933
VO7427
VO8433
VO8538
VO8976
VO9835
VOB445
VOB557
VOC009

BVOLS=` ./volumes  beam_data`

for VOL in ${BVOLS} ; do ./stage -d -p 0 ${VOL} ; done | tee /tmp/beamvols

MINOS26 > grep "Staging files\|Needed" /tmp/beamvols | tr -d '.'
 Staging files from tape VO4933
 Needed 659/814
 Staging files from tape VO7427
 Needed 17/46
 Staging files from tape VO8433
 Needed 0/455
 Staging files from tape VO8538
 Needed 3/444
 Staging files from tape VO8976
 Needed 1/79
 Staging files from tape VO9835
 Needed 0/21
 Staging files from tape VOB445
 Staging files from tape VOB557
 Needed 0/707
 Staging files from tape VOC009
 Needed 0/45

    Let's restore the missing files

for VOL in ${BVOLS} ; do ./stage -w ${VOL} ; done | tee /tmp/beamstage

    For reference, setting the scale,
    
 du -sm /pnfs/minos/beam_data
259002  /pnfs/minos/beam_data


##########
# CONDOR #
##########

Steve Timm corrected a typo in the condor_config files,
Mark Schumitz restarted all the daemons.

    Jobs are running !

tiny.run - single job

tiny.run3 - 3 jobs

tiny.run50 - 50 jobs


########
# PNFS #
########

    Corrected directories for new Monte Carlo

./pnfsdirs near cedar_phy_bhcurv daikon_04 L010000N write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L010170N write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L010185N write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L010200N write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L100200N write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L150200N write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L250200N write

./pnfsdirs  far cedar_phy_bhcurv daikon_04 L010185N write
./pnfsdirs  far cedar_phy_bhcurv daikon_04 L250200N write

   removed bad mcin
   
for DIR in L010000 L010170 L010185 L010200 L100200 L150200 L250200 ; do
ls -l /pnfs/minos/mcin_data/near/daikon_04/${DIR} ; done

for DIR in L010000 L010170 L010185 L010200 L100200 L150200 L250200 ; do
rmdir /pnfs/minos/mcin_data/near/daikon_04/${DIR} ; done

for DIR in L010185 L250200 ; do
ls -l /pnfs/minos/mcin_data/far/daikon_04/${DIR} ; done

for DIR in L010185 L250200 ; do
rmdir /pnfs/minos/mcin_data/far/daikon_04/${DIR} ; done

   removed bad mcout

for DIR in L010000 L010170 L010185 L010200 L100200 L150200 L250200 ; do
ls -lr /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/${DIR} ; done

for DIR in L010000 L010170 L010185 L010200 L100200 L150200 L250200 ; do
rm -r /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_04/${DIR} ; done


#######
# AFS #
#######

    Changed acl's for 5 volumes for nue analysis group,
    created 2007 05 23

d241 d242 d243 d244 d245

minos:admin rlidwka
boehm rlidwka
msanchez rlidwka


    Created minos:nue group

NEWGROUP=nue

pts creategroup -name kreymer:${NEWGROUP}
group kreymer:nue has id -2487


NEWUSERS='boehm msanchez'

for GUSER in ${NEWUSERS} ; do 
pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done

pts setfields  kreymer:${NEWGROUP} -access SOMar

pts membership kreymer:${NEWGROUP}

pts examine    kreymer:${NEWGROUP}
Name: kreymer:nue, id: -2487, owner: kreymer, creator: kreymer,
  membership: 2, flags: SOMar, group quota: 0.

pts chown      kreymer:${NEWGROUP}  minos

pts examine    minos:${NEWGROUP}
pts membership minos:${NEWGROUP}

    Added the rest of the buckley:nue group

NEWUSERS='annah1 buckley cbs cherdack howcroft ochoa pawloski tjyang scavan vahle'
for GUSER in ${NEWUSERS} ; do 
pts adduser -user ${GUSER} -group minos:${NEWGROUP} ; done


    Added this acl

cd $MINOS_DATA

for DIR in d241 d242 d243 d244 d245 ; do 
fs listacl ${DIR} ; done

for DIR in d241 d242 d243 d244 d245 ; do 
fs setacl -dir ${DIR} -acl minos:nue rlidwka ; done


###########
# afs2nfs #
###########

created afs2nfschk to check input file existence and non-0 size

asf2nfschk -i 2006-10_near.R1_18_4.index

cd /afs/fnal.gov/files/data/minos/d10/indexes

for INDEX in *.index ; do
~/minos/scripts/afs2nfschk -i ${INDEX} ; done 2>&1 | tee afs2nfschk.log

    Created afs2nfschk.sum with a summary of just damaged indexes

2005-07_far.R1.16a.index 
2005-11_near.R1_24b.index 
2006-09_near.R1_18_4.index 
2006-10_near.R1_18_4.index 
2007-01_near.cedar.index 
BAD_mc_far.daikon_00.cedar.index 
mc_far.carrot.R1_24.index 
mc_near.R1_18_2.index 
mc_near.daikon_00.cedar.index 

for INDEX in ${BIND} ; do
~/minos/scripts/afs2nfschk -i ${INDEX} ; done | grep index

    15/    15 2005-07_far.R1.16a.index
     1/   339 2005-11_near.R1_24b.index
    14/   409 2006-09_near.R1_18_4.index
    17/   629 2006-10_near.R1_18_4.index
     1/   601 2007-01_near.cedar.index
   781/   816 BAD_mc_far.daikon_00.cedar.index
     1/    41 mc_far.carrot.R1_24.index
     1/  2289 mc_near.R1_18_2.index
     1/ 10923 mc_near.daikon_00.cedar.index

mkdir BAD

for INDEX in ${BIND} ; do cp -a ${INDEX} BAD/${INDEX} ; done

rm 2005-07_far.R1.16a.index
rm BAD_mc_far.daikon_00.cedar.index

Edited the remaining files to remove missing files.

for INDEX in ${BIND} ; do ( nedit ${INDEX} & ) ; done

for INDEX in ${BIND} ; do ~/minos/scripts/afs2nfschk -i ${INDEX} ; done
   Looks OK now


###########
# afs2nfs #
###########

   Corrected .bntp_data directory name 

mv /minos/data/reco_far/cedar_phy/bntp_data /minos/data/reco_far/cedar_phy/.bntp_data

############
# MCIMPORT #
############

Assisting boehm move of nue pseudo MC files to PNFS.

These started as daikon_00 files reco'd with cedar,
then muon removed and electrons simulated replacing the mu.

    I have suggested names like

n13011432_0000_L010185N_D03_D00cedarMRE.reroot.root


=============================================================================

2007 10 23

#######
# AFS #
#######

See entry on  2007 10 04

    Repeated scan for anyuser,
       just one problem-user 

    Repeated scan for authuser,

See /home/kreymer/afsscan.log

Got a response from Chadwick before I sent the note to nightwatch.
Must be some tachyons round here.


############
# MCIMPORT #
############

Keepin' up, 135 GB minimum space last night.

Messages still a bit messy fron CRON pid,

 OOPS - found /local/scratch26/mindata/CRON/mcimport.pid
  PID TTY          TIME CMD
19913 ?        00:00:00 mcimport

08:30
Cleaned up these messages, hacked into mcimport.20071022

##########
# DCACHE #
##########

Email from Sue Kasahara,
DCache/Root read rates with root HEAD are as good as old xrootd,
aside from a 24 second real time delay, reading concatenated sntp files.

Using tree->SetCacheSize(50000000) and/or TTreeCache::SetLearnEntries(1)

###########
# afs2nfs #
###########

  Reviewing afs2nfs.log
  Leaving concatenated <\m> statuslines intact,
  makes it easier to find diagnostic messages via nedit

2006-10_near.R1_18_4.index
  recodata15/N00011001_0004.spill.sntp.R1_18_4.0.root

2005-11_near.R1_24b.index 
  187/ 339 recodata55/N00009146_0005.spill.sntp.R1_24b.0.root

2007-01_near.cedar.index 
   1/ 601 recodata77F00037242_0003.spill.sntp.cedar.0.root 


    OOPS - copied all files to each stream target directory, for

2005-04_far.cedar_phy.index
    /minos/data/reco_far/cedar_phy/bntp_data/2005-04
2005-05_far.cedar_phy.index
2005-06_far.cedar_phy.index
2005-07_far.cedar_phy.index
2005-08_far.cedar_phy.index
2005-09_far.cedar_phy.index
2005-10_far.cedar_phy.index
2005-11_far.cedar_phy.index
2005-12_far.cedar_phy.index 
2006-01_far.cedar_phy.index
2006-02_far.cedar_phy.index
2006-03_far.cedar_phy.index
2006-06_far.cedar_phy.index
2006-07_far.cedar_phy.index
2006-08_far.cedar_phy.index
2006-09_far.cedar_phy.index
2006-10_far.cedar_phy.index
2006-11_far.cedar_phy.index
2006-12_far.cedar_phy.index
2007-01_far.cedar_phy.index
2007-02_far.cedar_phy.index
2007-03_far.cedar_phy.index

MONS='                  2005-05 2005-06 2005-07 2005-08 2005-09 2005-10 2005-11 2005-12 
2006-01 2006-02 2006-03         2006-06 2006-07 2006-08 2006-09 2006-10 2006-11 2006-12
2007-01 2007-02 2007-03'

for MON in ${MONS} ; do
    echo ${MON}
    ls /minos/data/reco_far/cedar_phy/bntp_data/${MON}/*.root | wc -l
    ls /minos/data/reco_far/cedar_phy/bntp_data/${MON}/*.sntp.*.root | wc -l
    ls /minos/data/reco_far/cedar_phy/sntp_data/${MON}/*.bntp.*.root | wc -l
done

for MON in ${MONS} ; do
    echo ${MON}
    rm /minos/data/reco_far/cedar_phy/bntp_data/${MON}/*.sntp.*.root
    rm /minos/data/reco_far/cedar_phy/sntp_data/${MON}/*.bntp.*.root
done

for MON in ${MONS} ; do
    echo ${MON}
    ls /minos/data/reco_far/cedar_phy/bntp_data/${MON}/*.bntp.*.root | wc -l
    ls /minos/data/reco_far/cedar_phy/sntp_data/${MON}/*.sntp.*.root | wc -l
done

    Now clean up the R1_18 overwriting

for DIR in cbdl cnts sntp ; do 
ls /minos/data/reco_far/R1_18/cbdl_data/2005-04 | wc -l ; done

rm /minos/data/reco_far/R1_18/cbdl_data/2005-04/*.sntp.*.root
rm /minos/data/reco_far/R1_18/cbdl_data/2005-04/*.cnts.*.root

rm /minos/data/reco_far/R1_18/cnts_data/2005-04/*.cbdl.*.root
rm /minos/data/reco_far/R1_18/cnts_data/2005-04/*.sntp.*.root

rm /minos/data/reco_far/R1_18/sntp_data/2005-04/*.cbdl.*.root
rm /minos/data/reco_far/R1_18/sntp_data/2005-04/*.cnts.*.root


=============================================================================

2007 10 22

########
# GRID #
########

105638

/minos is still mounted on FNALU on flxi04 and 6
we need at least flxi07, for IA64 testing

########
# GRID #
########

104371

account request for fnpcsrv1 - awaiting information from me ?

Art, we will take this request into consideration. 
Final approval or denial will be based on the details of how the 
Open Science Enclave security plan is implemented.  
I suggest you visit the meeting at 3 PM today.

###########
# afs2nfs #
###########

    Ran till about 10:00 Sunday 21 Oct

    log file ran out of quota ( d10 )

    Captured most of it from the screen, saved in /tmp/afs2nfs.log

    First clear some space on recodata01

mc_cosmic.bfld201.cedar.index
mc_far.R1.14.index
mc_far.carrot.cedar.index

    Copying 137 c* files from 01 to 113 ( 7.8 GB )

grep recodata01 indexes/*.index | wc -l
137

    14:03 - 14:23
for FILE in recodata01/* ; do
    cp -a ${FILE} recodata113/
done

for FILE in recodata01/* ; do
    echo ${FILE} ; diff ${FILE} recodata113/
done

nedit mc_cosmic.bfld201.cedar.index mc_far.R1.14.index mc_far.carrot.cedar.index
    changed recodata01 to recodata113
grep recodata113 mc_cosmic.bfld201.cedar.index mc_far.R1.14.index mc_far.carrot.cedar.index | wc -l
137


############
# MCIMPORT #
############

RAL claimed we were not keeping up, I see no evidence of that

Looked at ganglia plots, see minos26free.20071022.png

Thursday there was a nice clean run, reducing free space from 230 to 70 GB
in about 14 hours,   150 GB/14 hours or 2.9 MBytes/second.

Concatenation writes to DCache at 6 to 10 MB/second.
Concatenation writes local tars at a similar rate.
So we should just keep up.

    18:47 CDT Fri 19 Oct 'Not keeping up'  free disk down to 38 GB
    19:34                300 running jobs, holding rest
    20:08                cronjob changed from 6 hours to 4 hour interval
    09:00                100 GB free
    13:00                180 GB free
    19:00                noted that jobs had been held.

Issues raised in email :

   Why delay for second pass/clearing of files ?
   Nick requests turorial on copy to /pnfs/minos/fermigrid/volatile
   Why not use FermiGrid SE for volatile storage, to clear minos26
   What if all farms run at once.

   
mcimport.20071022 - exits quietly if CRON job is still running.

17:45
ln -sf mcimport.20071022 mcimport # was mcimport.20070912


Updated crontab.dat to run every 2 hours,
saved as scripts/crontab.mcimport.20071011

=============================================================================

2007 10 19

###########
# afs2nfs #
###########

$ ./afs2nfs -i 2005-11_far.cedar.index

STARTING Fri Oct 19 11:06:06 CDT 2007
 Running dc2nfs for 
     INDEX 2005-11_far.cedar.index 
      MO   2005-11  
     DET   far 
     REL   cedar 
 FILES   = 720
 STREAMS = sntp
 720/ 720 /minos/data/reco_far/cedar/sntp_data/2005-11    16384 recodata59/F00033256_0005.spill.sntp.cedar.0.root            

1.2G    /minos/data/reco_far/cedar/sntp_data/2005-11

 STARTED Fri Oct 19 11:06:06 CDT 2007
FINISHED Fri Oct 19 11:08:05 CDT 2007

Oops, cannot always get release from file name or index name, 
due to embedded dots in older releases.

Allow this on the command line via -r


    OK, let's look at the big picture.

    Presently, in /minos/data have only reco_far R1.16 and R1_16a.

find . -name 200\*index -size +0 -exec basename {} \; | grep R1.16
2005-03_far.R1.16a.index
2005-07_far.R1.16a.index

find . -name mc\*index -size +0 -exec basename {} \;  | grep 'R1\.'
mc_far.R1.14.index


    OK, here are the reco dir's without a .

find . -name 20\*index -size +0 -exec basename {} \;  | grep -v 'R1\.'

RCDIRS=`find /afs/fnal.gov/files/data/minos/d10/indexes -name 20\*index -size +0 -exec basename {} \;  | grep -v 'R1\.'`

   Testing rate calculation
   
./afs2nfs  -i 2005-11_near.cedar.index
    observed rates are around 16 MBytes/second
41G     /minos/data/reco_near/cedar/sntp_data/2005-11
 STARTED Fri Oct 19 14:34:51 CDT 2007
FINISHED Fri Oct 19 15:20:39 CDT 2007

    Oops, forgot to include the rate calculation, let's try something shorter

$ ./afs2nfs  -i 2006-01_near.cedar_phy.index
  15/  15 /minos/data/reco_near/cedar_phy/sntp_data/2006-01    16274 recodata109/N00009714_0000.spill.sntp.cedar_phy.0.root             STREAM sntp rate 19151


9.4G    /minos/data/reco_near/cedar_phy/sntp_data/2006-01

 STARTED Fri Oct 19 15:30:19 CDT 2007
FINISHED Fri Oct 19 15:38:50 CDT 2007

    Adjusted format, and changed to use of dd instead of cp
    
$ ./afs2nfs  -i 2006-02_near.cedar_phy.index
...
 STREAM sntp rate 17366


9.1G    /minos/data/reco_near/cedar_phy/sntp_data/2006-02

 STARTED Fri Oct 19 15:39:51 CDT 2007
FINISHED Fri Oct 19 15:48:59 CDT 2007

    Now trying a blocksize equal to the file size

$ ./afs2nfs -i 2006-07_near.cedar_phy.index
...
 STREAM sntp rate 16344
3.2G    /minos/data/reco_near/cedar_phy/sntp_data/2006-07
 STARTED Fri Oct 19 17:23:50 CDT 2007
FINISHED Fri Oct 19 17:27:09 CDT 2007


    Cleaned up the formatting

$ ./afs2nfs -i 2006-08_near.cedar_phy.index
...
 STREAM sntp rate 13822

$ ./afs2nfs -i 2006-09_near.cedar_phy.index
 STREAM sntp rate 20154

    Adjust format a bit more
    
$ ./afs2nfs -i 2006-10_near.cedar_phy.index
 STREAM sntp rate 19693
8.7G    /minos/data/reco_near/cedar_phy/sntp_data/2006-10

    And a bit more
    
$ ./afs2nfs -i 2006-11_near.cedar_phy.index

 STREAM sntp rate 19723
8.1G    /minos/data/reco_near/cedar_phy/sntp_data/2006-11


   Let's rock and roll !

tokens

{ for INDEX in ${RCDIRS} ; do ./afs2nfs  -i  ${INDEX} ; done } 2>&1 \
| tee -a  /afs/fnal.gov/files/data/minos/d10/indexes/afs2nfs.log 

    Oops, format is not quite clean, interrupted ( to test interruptions )

rm /minos/data/reco_far/cedar/sntp_data/2006-01/F00033499_0018.spill.sntp.cedar.0.root

$ ./afs2nfs -i 2006-12_near.cedar_phy.index
 STREAM sntp rate 18629
11G     /minos/data/reco_near/cedar_phy/sntp_data/2006-12

   And once again a full run, at 18:23


    Note -  Have reverted to cp, seems to work well with the AFS source.

###########
# MINOS25 #
###########

SLF 4.4 upgrade started around 10:25, ganglia up around 10:57

13:10 - schmitz is trying to get condor configured.

13:50 - timm has root access

#######
# UPS #
#######

for NODE in $BNODES ; do 
bsub -R ${NODE} "ls -l /usr/local/etc/setups.sh" ; done

flxb10 local
flxb11 local
flxb13 local
flxb16 local
flxb17 local
flxb18 local
flxb19 local
flxb20 local
flxb21 local
flxb22 local
flxb23 local
flxb24 local
flxb25 local
flxb26 local
flxb27 local
flxb28 local
flxb29 local
flxb30 local
flxb31 fnal
flxb32 fnal
flxb33 fnal
flxb34 fnal
flxb35 fnal

Summary, per 2007 09 18 survey :

/local/ups  seems preferred,
   Minos Cluster at SL4
   flxi02
   flxi06
   flxb   at SL 3

/fnal/ups  seems to have crept in recently
   minos11 post reinstall
   flxi04/5/7
   flxb  at SL 4


############
# mcimport #
############

hacked ganglia/minos26 into DH web page

ln -sf dhmain.20071019.html dhmain.html # was dhmain.20070501.html


=============================================================================

2007 10 18

###########
# afs2nfs #
###########

Moves existing files from $MINOS_DATA/d10/* to /minos/data,
based on indexes/*.index files

find /afs/fnal.gov/files/data/minos/d10/indexes/ -name \*index -exec ls -l {} \; | wc -l
297
    -size 0
99
    -size +0
198

-size +0 -name mc\* 
43
-size +0 -name 20\* 
154

This is quite a mess, there are more than just sntp files here,
    sntp
    cnts
    bntp
    snts
    cbdl

    File names for mc are not at all uniform, like

mc_atmos.bfld201.R1_18_2.index
mc_far          .R1_18_2.index
mc_far  .carrot .R1_18_2.index
mc_far  .v17    .R1_18_2.index

Let's proceed with non-mc data first,
  will have to get target path file-by-file, parsing from names like
  
  F00037789_0000.spill.bntp.cedar_phy.0.root


#########
# BATCH #
#########

Only 10 of the 40 cores in FNALU batch systems were upgraded to SLF 4.
I announced my intent to ask for the upgrade of the rest,
to  minos_software_discussion

########
# CRON #
########

    Global scan of crontabs, triggered by find on minos25

15 09 * * * ${HOME}/minos/scripts/prehour > /tmp/prehour.log

    This was just testing the hour selection in predator,
    did not do anything but print.


for NODE in $NODES ; do printf "${NODE} " ; ssh  ${NODE} 'crontab -l' ; done

minos01 
MAILTO=kreymer@fnal.gov
15 19 * * *     /usr/krb5/bin/kcron ${HOME}/minos/scripts/cfl
01 23 * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/afsfree quiet
05 23 * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/afssum  quiet

minos06
02 17 * * *  date >> /var/tmp/FOO 

minos26 
MAILTO=kreymer@fnal.gov
06 1-23/2  * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/predator
10     05  * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/gridappsync

    Removed the minos06 crontab

###########
# MINOS25 #
###########

The system was upgraded to SLF 4 long before the rest of the cluster.
Joe Boyd sees that is running SELinux, and has other configuration problems.

Will reinstall it from scratch tomorrow :

Email to minos-admin at 15:30:


  Per our discussion, please schedule the reinstall of the OS on minos25
  to match the other Minos Cluster SL 4 systems.

  Let's do this as soon as possible, so that we can continue with the
  Condor work.

  Let me know a specific time, and I'll announce it to Minos.
  

  We have already discussed the usual local file issues

kcron   - mine was the only one, can drop it
mail    - only lsfadmin had email, can be dropped
crontab - mine was the only one, I have removed it.
ganglia - please restart after the upgrade
lsf     - can omit this, as we are moving to Condor.

  Condor will need to be reinstalled and reconfigured to match
  the existing configuration, after the upgrade.


#######
# SAM #
#######

SAMDIM="PARENT_BY_NAME  F00030612_0005.mdaq.root"
sam list files  --dim="${SAMDIM}"
Files:
   F00030612_0005.all.snts.R1.14.root
   F00030612_0005.all.snts.R1_18.0.root
...
   F00030612_0005.all.cand.R1_18.0.root
   F00030612_0005.all.cnts.R1_18.0.root

File Count:         32
Average File Size:  19.72MB
Total File Size:    630.92MB
Total Event Count:  663136


########
# GRID #
########

  Ticket 105784

  Please have the Bluearc served /minos/scratch and /minos/data volumes
  mounted on the FNAL_GPFARM nodes ( including fnpcsrv1 etc. )

  These are already mounted on the Minos Cluster and Server nodes,
  and all FNALU Batch nodes.

  /minos/scratch will allow analysis users to use existing test releases
      and files.

  /minos/data will be evaluated for possible use by Farm processing,
      and provides access to analysis ntuples.


#######
# VDT #
#######

Per Timm, need to remove the trailing "32" from vomses,
    and change 
-voms fermilab:
    to
-voms fermilab:/fermilab

Did not yet remove the 32, but fixing the -voms argument gets a proxy.

###########
# SCRATCH #
###########

 07:45

Solution: ettab@fnal.gov sent this solution: 
User directories have been created under /minos/scratch

########
# GRID #
########

MINOS25 > condor_submit tiny.run

   Timm finds that startd's are trying and failing to connect
  
########
# GRID #
########

    For normal useage, see extended introductory user tutorial at
 
http://www.cs.wisc.edu/condor/tutorials/barcelona-2006/

##########
# ORACLE #
##########

minosora1 upgrade started at 07:24.
Contact resumed at 08:44

project failed at 09:50, lost connection.

OK at 09:10

10:05 - notified by mmihalek


=============================================================================

2007 10 17

###########
# SCRATCH #
###########

Requested directory creation

14:15   Ticket 105745


USERS=`ypcat passwd | cut -f 1 -d ':' | sort`
echo $USERS

for USER in $USERS ; do 
printf "${USER}\n" ; finger ${USER}@fnal.fnal.gov | grep failed ; done

condor
lsfadm
mindata
minoscvs
mssg
products
sam
samread
vanconan

#  Create all the directories

for SUSER in ${USERS} ; do
( su ${SUSER} ; mkdir -p /minos/scratch/${SUSER} )
done

#  Remove a few stray users who do not need scratch

for SUSER in condor lsfadm mssg products samread ; do
( su ${SUSER} ; rmdir    /minos/scratch/${SUSER} )
done

   DONE - 2007 10 18  07:45 see above

##########
# CONDOR #
##########

schmitz has updated the configs, and started condor globally

MINOS25 > ps -flu condor
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
5 S condor   26687     1  0  76   0 -  2100 -      11:11 ?        00:00:00 /opt/condor/sbin/condor_master
4 S condor   26688 26687  0  76   0 -  2252 -      11:11 ?        00:00:00 condor_collector -f
4 S condor   26689 26687  0  76   0 -  2072 -      11:11 ?        00:00:00 condor_negotiator -f
4 S condor   26690 26687  0  76   0 -  2130 -      11:11 ?        00:00:00 condor_schedd -f

This is what Timm says we should expend on the schedd system.

condor_q  - runs and reports no jobs


##########
# ORACLE #
##########

mmihalek restarted gmond on minosora1/3, 
   had not been logging data since Oct 11.
   
This restored the data flow.


########
# GRID #
########

FNALU batch /minos mounts are complete and tested, see 2007 10 16

#######
# VDT #
#######

vdt 1.6.1 is being used on fnpcsrv1

This exists in UPD

upd install -j vdt    v1_6_1_0

upd install -j pacman v3_19
ups declare -c pacman v3_19 -f NULL

ups tailor vdt  v1_6_1_0  2>&1 | tee -a /tmp/vdtinstall.log

   FAILED -  needs perl > 5.8.0


ups undeclare -Y vdt  v1_6_1_0
upd install -j   vdt  v1_6_1_0
unsetup perl
ups tailor       vdt  v1_6_1_0  2>&1 | tee -a /tmp/vdtinstall.log
  11:10 -  14:12
  Stuck after 3 hours at :
  
Installing Condor Globus EDG-Make-Gridmap MyProxy VOMS (on some systems this may take more than 30 min)
Installing package [CA-Certificates-Base].

MINOS26 > ps xfwww
...
 4517 pts/6    S+     0:00  \_ ups tailor vdt v1_6_1_0
 4519 pts/6    S+     0:00  |   \_ sh -c . /tmp/file5krZVp
 4542 pts/6    S+     0:00  |       \_ /bin/sh /afs/fnal.gov/files/code/e875/general/ups/prd/vdt/v1_6_1_0/Linux/ups/install.sh
 6099 pts/6    S+     0:15  |           \_ python /afs/fnal.gov/files/code/e875/general/ups/prd/pacman/v3_19/NULL/bin/pacman -install http://vdt.cs.wisc.edu/vdt_161_cache:Condor http://vdt.cs.wisc.edu/vdt_161_cache:Globus http://vdt.cs.wisc.edu/vdt_161_cache:EDG-Make-Gridmap http://vdt.cs.wisc.edu/vdt_161_cache:MyProxy http://vdt.cs.wisc.edu/vdt_161_cache:VOMS
 4518 pts/6    S+     0:00  \_ tee -a /tmp/vdtinstall.log

Try again, this time the current vdt v1_8_1_0 ( vdt.cs.wisc.edu )

upd install -j vdt    v1_8_1_1
date
ups tailor       vdt  v1_8_1_1  2>&1 | tee -a /tmp/vdt1811.log
date

GGGGGRRRRRRRRRRRRRR cannot setup pacman, needs

upd install -j python v2_4_2_sam

unsetup perl
setup  pacman
date
ups tailor       vdt  v1_8_1_1  2>&1 | tee -a /tmp/vdt1811.log
  14:43 - 14:54
 pacman version [3.19] must be >= [3.20].

upd install -j pacman v3_20

ups undeclare -Y vdt  v1_8_1_1
upd install -j   vdt  v1_8_1_1

date
ups tailor       vdt  v1_8_1_1  2>&1 | tee -a /tmp/vdt1811a.log
date

    14:57 - 
STILL FAILS - garaozli advises me that these kits probably do no work..


Did direct installation into 

    /minos/scratch/kreymer/VDT
    
mkdir -p /minos/scratch/kreymer/VDT
cd       /minos/scratch/kreymer/VDT

pacman -get VDT:VOMS-Client
...
Choices:
        l (local) - install into $VDT_LOCATION/globus/share/certificates
        n (no)    - do not install
l

. setup.sh
 echo $VDT_LOCATION 
/minos/scratch/kreymer/VDT

voms-proxy-init -noregen -voms fermilab: -valid 168:0 --debug
Detected Globus version: 22
Unspecified proxy version, settling on Globus version: 2
Number of bits in key :512
Using configuration file /minos/scratch/kreymer/VDT/glite/etc/vomses
Using configuration file /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses

Copied similar file from fnpcsrv1 to glite/etc/vomses,
   from /usr/local/vdt-1.6.1/glite/etc/vomses

MINOS25 > voms-proxy-init -noregen -voms fermilab: -valid 168:0 
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Contacting  fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Failed

Error: fermilab: Unable to satisfy  Request!

None of the contacted servers for fermilab were capable
of returning a valid AC for the user.

Note that fnpcsrv1 always complains about lack of $prefix/etc/vomses
Tried 
    export prefix=/minos/scratch/kreymer/VDT/glite
no change

        Here is a fresh test :

MINOS25 > cd /minos/scratch/kreymer/VDT
MINOS25 > . setup.sh
MINOS25 > klist -f
Ticket cache: /tmp/krb5cc_1060_Tf3886
Default principal: kreymer@FNAL.GOV

Valid starting     Expires            Service principal
10/17/07 17:24:12  10/18/07 19:15:46  krbtgt/FNAL.GOV@FNAL.GOV
        renew until 10/24/07 17:15:46, Flags: FfRA
10/17/07 17:24:13  10/18/07 19:15:46  afs@FNAL.GOV
        renew until 10/24/07 17:15:46, Flags: FfRA
MINOS25 > kx509
MINOS25 > kxlist -p
Service kx509/certificate
 issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA
 subject= /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/0.9.2342.19200300.100.1.1=kreymer
 serial=70F791
 hash=3fb2f7c8
MINOS25 > voms-proxy-init -noregen -voms fermilab: -valid 168:0 -debug
Detected Globus version: 22
Unspecified proxy version, settling on Globus version: 2
Number of bits in key :512
Using configuration file /minos/scratch/kreymer/VDT/glite/etc/vomses
Using configuration file /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Files being used:
 CA certificate file: none
 Trusted certificates directory : /minos/scratch/kreymer/VDT/globus/TRUSTED_CA
 Proxy certificate file : /tmp/x509up_u1060
 User certificate file: /tmp/x509up_u1060
 User key file: /tmp/x509up_u1060
Output to /tmp/x509up_u1060
Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Arthur E. Kreymer/USERID=kreymer
Using configuration file /minos/scratch/kreymer/VDT/glite/etc/vomses
Using configuration file /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
Contacting  fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Failed

Error: fermilab: Unable to satisfy  Request!

None of the contacted servers for fermilab were capable
of returning a valid AC for the user.
MINOS25 > date
Wed Oct 17 17:25:10 CDT 2007


    Try installation into /grid/app/minos/products/VDT

setup pacman v3_20
mkdir -p /grid/app/minos/products/VDT
cd       /grid/app/minos/products/VDT
pacman -get VDT:VOMS-Client

mkdir -p  /grid/app/minos/products/VDT/glite/etc
chmod 755 /grid/app/minos/products/VDT/glite/etc

. setup.sh

voms-proxy-init -noregen -voms fermilab: -valid 168:0
Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses
VOMS Server for fermilab not known!

cp -a /minos/scratch/kreymer/VDT/glite/etc/vomses glite/etc/vomses

    This works the same as the /minos/scratch copy on minos25,
    but on fnpcsrv1, returns

VOMS Server for fermilab not known!

    Weird !
    
    Note, there is documentation for the VOMS client installation at

http://vdt.cs.wisc.edu/VOMS-documentation.html

########
# GRID #
########

Around 09:00,
processes touching /minos/data or /minos/scratch are getting hung up.
    can ping servers
ping -c 1 -w 2 minos-nas-0 

09:23 /minos is OK again !

########
# FARM #
########

fnpcsrv1 cannot perform simple commands like
    uptime
    cd
    /bin/ls /tmp
 

    MRTG data query brings up a message :

131.225.167.44 is connected to s-f-grid-fcc1 on port Gi0/1 
Last detected on this switch at 2007/10/17/09:11

1 node is connected to port Gi0/1 of s-f-grid-fcc1.
Looking Glass Error: Unknown area name for Device s-f-grid-fcc1

   No plots available for any nodes on this switch.
   ( other switches are OK )

N.B. - all bluearc seems to have been affected, including CMS


=============================================================================

2007 10 16

#######
# X11 #
#######

Ran 2 clean scans of Minos Cluster nodes, running gimp.
No hangups.

    On minos26, saw several messages like
executable not found: '/usr/lib/gimp/2.0/plug-ins/gap_frontends'

########
# GRID #
########

Helpdesk ticket  105638

    fnalu-admin :

Please mount the BlueArc served /minos/scatch and /minos/data areas
on all FNALU interactive and batch systems.

/minos/scratch should be writable by users.
/minos/data should be exported and readonly on FNALU at present.

   Thanks !

16:00 - mounted on all batch, and some interactive systems

MINOS26 > BNODES='flxb10 flxb11 flxb13 flxb16 flxb17 flxb18 flxb19 flxb20 flxb21 flxb22 flxb23 flxb24 flxb25 flxb26 flxb27 flxb28 flxb29 flxb30 flxb31 flxb32 flxb33 flxb34 flxb35'

for NODE in $BNODES ; do 
bsub -R ${NODE} "hostname > /minos/scratch/kreymer/BNODES/${NODE}" ; done

Failed due to directory missing on 
    flxb13
    flxb27
    
bsub -R flxb13 'grep minos /etc/fstab'

for NODE in $BNODES ; do 
bsub -R ${NODE} "hostname > /minos/data/mindata/kreymer/BNODES/${NODE}" ; done

>>> 2007 10 17

    Mounts have been updated on flxb13/27/35

for NODE in flxb13 flxb27 ; do 

AOK !


########
# GRID #
########


########
# FARM #
########

Tracking down reason for  N00010639_0009.spill.sntp.cedar_phy.0.root pending

sam list files --nosum --dim="data_tier raw-near and run_number 10639" | cut -f1 -d '.' | sort | wc -l
24

There are 20 subruns already written, plus 1 pending (_0009 )
But there are only 20 subruns expected 
    24 subruns in raw data
     3 nospills ( 0/1/2 )
     1 badruns not present in nearcat  ( 16 )
    20 subruns expected

This throws off the simple minded logic of the script.
The problem is that subrun 16 was written, 
    but is still listed in the badruns list.


=============================================================================

2007 10 15

########
# GRID #
########

Submitted timm's Condor plan to minos-admin,
Ticket 105607

###########
# ROUNDUP #
###########

Added SOCFILE for oracle admin connection

cp -a AFSS/roundup.20070809 .
ln -sf     roundup.20070809  roundup  # was roundup.20070803

##########
# DCACHE #
##########

schubert cannot access
    /pnfs/minos/reco_near/R1_18_2/sntp_data/2005-05/N00007815_0000.spill.sntp.R1_18_2.0.root
looks OK to me, bases on metadata.

IFILE=N00007815_0000.spill.sntp.R1_18_2.0.root
IPATH=minos/reco_near/R1_18_2/sntp_data/2005-05
DCPOR=24136 # unsecured
DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
cd /local/scratch??/`whoami`
dccp    ${DFILE} TEST.dat   # do the copy

    Still stuck after 10 minutes at 13:34, here is the login list :

DCap01-stkendca2a-unknown-57776 DCap01-stkendca2a-unknown-57776 minos26.fnal.gov          active Oct 15 13:24:54 Oct 15 13:24:54   1060/15611 DCap01-stkendca2a-unknown-57776 Arthur Kreymer     ?    ?   ? ?    open  minos/reco_near/R1_18_2/sntp_data/2005-05/N00007815_0000.spill.sntp.R1_18_2.0.root

My guess is that they have reconfigured pools, and this files needs restore.
But the restore queue page dates from 13 Oct, at
http://fndca3a.fnal.gov/dcache/RC.html

I still see no Enstore activity on the DCache data plots today.
See Enstore notes below

15:22 - there are idle drives, shubert's test file is restored to
        r-stkendca15a-6

########
# FARM #
########

Tracking down N00010639_0016.spill.sntp.cedar_phy.0.root per rubin request
/2007-06/cedar_phynear.log

Files were written on  Thu Jun  7 13:32:59 CDT 2007
wrote files 0, 10 17
 BADRUNS   N00010639_0009.cosmic.sntp.cedar_phy.0.root
 BADRUNS   N00010639_0016.cosmic.sntp.cedar_phy.0.root

File was written this morning to 
    /pnfs/minos/reco_near/cedar_phy/sntp_data/2006-08

saddreco did not run due to lack of SAM_ORACLE_CONNECT

Created new roundup, tried again.

./roundup  -w  -r cedar_phy near

#######
# SAM #
#######

11:00

    Upgraded production dbserver
        allows parameters to be added cleanly
        allows query on CHILD_BY_NAME

sam_db_srv_pkg v8_3_0 ( was sam_db_srv v7_6_1 )
sam_bootstrap  v8_1_0 ( was v6_1_2, required for use of sam_db_srv_pkg )  
sam_config     v7_1_5 ( was v4_2_34 )
sam            v8_2_0 ( was v7_6_5, on clients ) 


    Updated sam on AFS

ups declare -c sam v8_2_0 # was v7_6_5

    The queries for listing parents now work,
    where they returned extra results before :    

FILE=F00030612_0005.spill.bntp.cedar_phy.0.root

SAMDIM="
    DATA_TIER   raw-far \
and FULL_PATH like /pnfs/minos/fardet_data/2005-04 \
and FILE_NAME like F0003061% \
and CHILD_BY_NAME ${FILE} \
"

SAMDIM="CHILD_BY_NAME ${FILE}"

sam list files  --dim="${SAMDIM}" --nosummary | sort
F00030612_0005.mdaq.root
F00030612_0006.mdaq.root
F00030612_0007.mdaq.root

MINOS26 > sam get metadata --file=${FILE} \
<more>     | grep parents \
<more>     | tr "'" \\\n  \
<more>     | grep root    \
<more>     | sort
F00030612_0005.mdaq.root
F00030612_0006.mdaq.root
F00030612_0007.mdaq.root

#######
# SAM #
#######

    Added the new MC parameters in production
    as previously done in dev and int.

samadmin add param suite --param-file=MCPARAMS.py 
Param Category 'mc': 
 ... paramType 'bfield': registered as type 'string'  (new dimension 'mc.bfield')
 ... paramType 'volume': (no change)
 ... paramType 'beam': (no change)
 ... paramType 'split': (no change)
 ... paramType 'vtxregion': registered as type 'string'  (new dimension 'mc.vtxregion')
 ... paramType 'release': (no change)
 ... paramType 'flavor': (no change)

The parameters have good indexes, as verified with sqlplus.


###########
# ENSTORE #
###########

    Ticket 105574

Sometime this morning the www-stken web pages stopped responding.

    I see no enstore transfers to FNDCA since 06:00,
    
http://fndca2a.fnal.gov:8090/dcache/outplot?lvl=0&filename=billing-2007.10.
daily.brd.png&day=15&fmt=lin

    The CMS tape activity monitor shows active LTO3 drives, 
    but most 9940B drives have been stuck several hours :
http://cmsdca.fnal.gov/cgi-bin/enstore_drives.sh


10:03

    Some IP addresses were changed this morning. The enstore monitoring 
web pages for all three enstore systems are not accessible. We are in 
the process of identifying and correcting  the problem. More when it is 
known.


   George Szmuksta
   SSA

( pages look OK to me  A.K. )

12:45

The web pages have been fixed. As far as tape activity we are 
experiencing a media changer queue full which is delaying mounts and 
dismounts. We are looking at it.

George szmuksta

13:55 - still no tape activity.

14:19 - seeing tape activity
        mostly writes, 168 queue elements 

15:22 - there are idle drive, shubert's test file is restored to
        r-stkendca15a-6

#######
# SIM #
#######

    Request from arms for

near/daikon_04/CosmicLE
near/daikon_04/L010000
near/daikon_04/L010170
near/daikon_04/L010185
near/daikon_04/L010200
near/daikon_04/L100200
near/daikon_04/L150200
near/daikon_04/L250200

far/daikon_04/L010185
far/daikon_04/L250200

Waiting to see whether these are all cedar_phy_bhcurv .

11:30 - confirmed

./pnfsdirs near cedar_phy_bhcurv daikon_04 CosmicLE write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L010000 write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L010170 write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L010185 write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L010200 write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L100200 write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L150200 write
./pnfsdirs near cedar_phy_bhcurv daikon_04 L250200 write

./pnfsdirs  far cedar_phy_bhcurv daikon_04 L010185 write
./pnfsdirs  far cedar_phy_bhcurv daikon_04 L250200 write


#######
# SAM #
#######


    Performed database repairs in dev as described 2007 10 12, 
    on advice from Herber this morning.

setup oracle_client
../bin/rlwrap sqlplus samdbs/...@minosdev


=============================================================================

2007 10 12

########
# FARM #
########

    Rubin is doing cedar_phy near cleanup.

    Requests status of N00011772 .1.root files

    Existing catted files are

N00011772_0000.spill.sntp.cedar_phy.0.root
N00011772_0000.spill.mrnt.cedar_phy.0.root

sam get metadata --file=${FILE} \
    | grep parents \
    | tr "'" \\\n  \
    | grep root    \
    | sort

The original .0. files seem complete.
I don't know why there was reprocessing of 8 of the subruns.


#######
# SAM #
#######

    Testing definition creation, failing for ahimmel ( not in group minos )

sam create definition    --definitionName='kreymer-test' \
    --dimensions='FILE_NAME = F00031300_0000.mdaq.root' \
    --group='minos'

sam  describe definition --definitionName='kreymer-test'

sam  delete definition   --definitionName='kreymer-test'


#######
# SAM #
#######

dbs v8_3_0 work, see log_data/LOG/sam03.log

Upgrade to sam_config v7_1_5

ups declare -c sam_config v7_1_5

Cannot start integration dbserver v8_3_0 lacking
    compat-libstdc++-33-3.2.3-47.3

Requested this, ticket 105533
  assigned to Jason
  Done !

MINOS26 > upd install -j sam v8_2_0


#######
# SAM #
#######

Assess damage to the parameters


setup oracle_client
../bin/rlwrap sqlplus samdbs/...@minosdev

select dimension_name,dim_alias from dimensions where dimension_name like 'MC.%' ;
select dimension_name,dim_alias from dimension_addons where dimension_name like 'MC.%' ;


The dev declaration are definitely mangled, contining bad DIM_ALIAS fields
for MC.BFIELD and MC.VTXREGION , like
    param_types##1
    param_categories##1

Adding new parameters to int using sam v8_2_0 and dbs v8_3_0

setup sam -q int v8_2_0
export SAM_ORACLE_CONNECT="samdbs/<passwd>"
samadmin add param suite --param-file=MCPARAMS.py 

Looked with sqlplus, the param values are unique ( 261 and 252 )


Plan to correct these problems on Monday

SET PAGESIZE  1000
SET LINESIZE   100
SET NEWPAGE   NONE

SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSIONS       where DIMENSION_NAME = 'MC.BFIELD' ;
SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSION_ADDONS where DIMENSION_NAME = 'MC.BFIELD'    and DIM_COLUMN = 'param_category' ;
SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSION_ADDONS where DIMENSION_NAME = 'MC.BFIELD'    and DIM_COLUMN = 'param_type' ;

SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSIONS       where DIMENSION_NAME = 'MC.VTXREGION' ;
SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSION_ADDONS where DIMENSION_NAME = 'MC.VTXREGION' and DIM_COLUMN = 'param_category' ;
SELECT DIMENSION_NAME,DIM_ALIAS from DIMENSION_ADDONS where DIMENSION_NAME = 'MC.VTXREGION' and DIM_COLUMN = 'param_type' ;


UPDATE DIMENSIONS       SET DIM_ALIAS = 'param_values##261'     where DIMENSION_NAME = 'MC.BFIELD' ;
UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_categories##261' where DIMENSION_NAME = 'MC.BFIELD'    and DIM_COLUMN = 'param_category' ;
UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_types##261'      where DIMENSION_NAME = 'MC.BFIELD'    and DIM_COLUMN = 'param_type' ;

UPDATE DIMENSIONS       SET DIM_ALIAS = 'param_values##262'     where DIMENSION_NAME = 'MC.VTXREGION' ;
UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_categories##262' where DIMENSION_NAME = 'MC.VTXREGION' and DIM_COLUMN = 'param_category' ;
UPDATE DIMENSION_ADDONS SET DIM_ALIAS = 'param_types##262'      where DIMENSION_NAME = 'MC.VTXREGION' and DIM_COLUMN = 'param_type' ;


########
# GRID #
########

Submitted Minos Cluster grid plan via email to
    timm, chadwick, berman, minos-admin
              
##########
# DC2NFS #
##########

AFSS/dc2nfs -d far -r R1.16 -s sntp

STARTING Fri Oct 12 14:16:02 CDT 2007
 Running dc2nfs for 
     DET far 
     REL R1.16 
     STR sntp 
Processing 5 months
 STARTED Fri Oct 12 14:16:02 CDT 2007
FINISHED Fri Oct 12 15:42:39 CDT 2007


=============================================================================

2007 10 11

#######
# SAM #
#######

Resuming dbs v8_3_0 work, see log_data/LOG/sam03.log

   products are installed, ready to bite the bullet and upgrade sam_config ?


#########
# STAGE #
#########

stage.20071012

Added printout of ${NCHECK}/${NFILES} in WAITER
Added -b bailout option, for testing
Added VERSION and printout thereof
Added STARTED / FINISHED time summary


11:15
ln -sf stage.20071012 stage # was stage.20061012

              
##########
# DC2NFS #
##########

$ AFSS/dc2nfs -d far -r R1.16a -s sntp

STARTING Thu Oct 11 08:53:06 CDT 2007
 Running dc2nfs for 
     DET far 
     REL R1.16a 
     STR sntp 
Processing 1 months


cranked along at about 5 files/second


R1_15 far reco need months 2005-0*
    3 5 6 7 8

./stage  reco_far/R1.16/sntp_data/2005-03

for MON in 5 6 7 8 ; do ./stage -w reco_far/R1.16/sntp_data/2005-0${MON} ; done

#########
# ADMIN #
#########

Discussed 8 nodes GP CPU deployment with Jason Allen ( Boyd is on vacation)
We will need AFS mounted on these, unlike rest of GP_GRID nodes
Probably no special network/location requirements.

=============================================================================

2007 10 10

##############     
# parameters #
##############

Multiple parameter selections are not working for mc.vtxregion.

Same problem as before,

herber has corrected the database content previously via direct SQL.

It was necessary to
    samadmin flush dbserver cache

The problem was the non-unique numbers in 
    DIMENSIONS.DIM_ALIAS and
    DIMENSION_ADDONS.DIM_ALIAS

To see this, use the database browser,

   select development
      SAM
          parameters
              mc.<various>, and note the value param_values##1
              
##########
# DC2NFS #
##########

Cloned from dc2afs

Test on far R1.16a sntp_data 
  then R1.16

./stage -d -p 0 reco_far/R1.16a/sntp_data/2005-03
 Needed 460/460

These are all in the minos file family, so forget tape optimization,
just prestage them as is.

./stage  reco_far/R1.16a/sntp_data/2005-03

#########
# ADMIN #
#########

bspeak asks that grashorn login shell be bash on Minos Cluster

I submitted helpdesk ticket 105387
for this and FNALU.

Done for Minos Cluster 14:50

############
# MCIMPORT #
############

mcimport.20070912 tune up/debug ( improved diskfull handling )

AFSS/mcimport.20070913 -n kreymer

Corrected quotations around print statements

11:43

$ cp -a AFSS/mcimport.20070912 .
$ ln -sf mcimport.20070912 mcimport # was mcimport.20070910

############
# MCIMPORT #
############

    mualem is importing lots of  CosmicLE_D03.reroot.root to
mualem/
    rather than 
mualem/far/mcin/

    Renamed these, and informed mualem and minos_sim

for FILE in *root ; do echo ${FILE} ;  mv ${FILE} far/mcin/${FILE} ; done

   Created working directories

./pnfsdirs far cedar_phy_bhcurv daikon_03 CosmicLE write

chmod  775 /pnfs/minos/mcin_data/far/daikon_03
chgrp e875 /pnfs/minos/mcin_data/far/daikon_03

chmod  775 /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_03
chgrp e875 /pnfs/minos/mcout_data/cedar_phy_bhcurv/far/daikon_03

Ran mcimport -w mualem around 18:40 CDT, clearing space quickly.

Hacked crontab to run next cycle at 20:37, when this has cleared 
the nearly 800 files.


#########
# NEDIT #
#########

requested installation on minos-sam02 to match rest of Cluster/Server

Ticket 105361

############
# pnfsdirs #
############

Updated to set group and permission of basei/baseo 
                    on the level above basein/baseou

./pnfsdirs  far cedar_phy_bhcurv daikon_03 CosmicMu write
./pnfsdirs near cedar_phy_bhcurv daikon_03 CosmicLE write

=============================================================================

2007 10 09

###########
# ENSTORE #
###########

Recycle request for tapes as follows :

Minos currently has 33 tapes with no active files:
 
 
Checked volumes with enstore info --list=${VOL}


 VO8166 | none     | 9940B | far_dcs_data

OK    unknown path

 VO2067 | none     | 9940  | log_data_caldet
 VO5689 | none     | 9940  | log_data_R1_14
 VO6504 | none     | 9940B | log_data_R1_14
 VO8500 | none     | 9940B | log_data_R1_18
 VO8514 | none     | 9940B | log_data_R1_18
 VO8547 | none     | 9940B | log_data_R1_18_2
 VO8564 | none     | 9940B | log_data_R1_18_2

OK    Moved to $MINOS_DATA/log_data/...

 VO4432 | full     | 9940B | mcout_far_daikon_02_cand
 VO6615 | full     | 9940B | mcout_far_daikon_02_cand
 VOC485 | full     | 9940B | mcout_far_daikon_02_cand
 VOC488 | full     | 9940B | mcout_far_daikon_02_cand

OK    reprocessed

 NULL31 | none     | null  | neardet_data

NO    moibenko files, but this is NULL MOVER data, not a tape

 VO4435 | full     | 9940B | reco_mc_near_cedar
 VO4460 | readonly | 9940B | reco_mc_near_cedar
 VO4461 | full     | 9940B | reco_mc_near_cedar
 VO4465 | full     | 9940B | reco_mc_near_cedar
 VO4475 | full     | 9940B | reco_mc_near_cedar
 VO4553 | full     | 9940B | reco_mc_near_cedar
 VO4554 | full     | 9940B | reco_mc_near_cedar
 VO4613 | full     | 9940B | reco_mc_near_cedar
 VO4716 | full     | 9940B | reco_mc_near_cedar

OK    no paths

 VOB870 | full     | 9940B | reco_mc_near_cedar_cand
 VOB931 | full     | 9940B | reco_mc_near_cedar_cand
 VOC465 | full     | 9940B | reco_mc_near_cedar_cand

OK    no paths

 VO9913 | full     | 9940B | reco_mc_cosmic_cedar
 VOB691 | full     | 9940B | reco_mc_cosmic_cedar
 VOC043 | full     | 9940B | reco_mc_cosmic_cedar
 VOC151 | full     | 9940B | reco_mc_cosmic_cedar
 VO7080 | readonly | 9940B | reco_mc_cosmic_cedar

OK    bfld201_lowE some deleted/reprocessed, some not

 VO8437 | full     | 9940B | reco_near_R1_18_4
 VOB644 | full     | 9940B | reco_near_R1_18_4

OK     no path

 VOB414 | none     | 9940B | reco_near_S06-05-25-R1-22

OK     test processing runs, obsolete


#######
# SRM #
#######

SRM is down around 07:30 due to fndca2a failure/replacement.
    Estimate 3 to 4 hours.

Tried pinging servers with  

    telnet fndca1.fnal.gov     8443
    telnet stkendca2a.fnal.gov 8443

But both of these succeed in connecting to the port
  ( exit code of 1 after a normal quit/exit/close )
even though the SRM service is down.

14:50
    Network monitoring indicates that fndca2a went down around 03:15,
    and came up around 10:00 .

    But the SRM service is still down.

    Is there a revised time estimate for restoring SRM ?

15:05 - Timur requested database and SRM restart

16:40 - still no estimate

#######
# SAM #
#######

Testing sam_db_srv_pkg  on minos-sam02 ( int )

upd list -l sam_db_srv_pkg v8_3_0

See log_data/LOG/sam02.log


=============================================================================

2007 10 08

#############
# CHECKLIST #
#############

Ticket 105076   Minos Server Ganglia still missing - jpfitz
   send reminder 13:40
   fixed by jonest 15:26
   
Ticket 105113 - jyuko group and scratch access
   2007 10 10 - assigned to shepelak
                reassigned to terry jones
                completed 13:21
                        
Sam shifter Marcia Begalli  sent message re IT 1146
   no way to check existence of sam tape location

note gltail.rb at www.fudgie.org
   real time monitoring tools

########
# GRID #
########

Grid user meeting discussed overall tactics,
Made a short but failed attempt to see why proxies with minos/production
    could not write to DCache.
Created srmwtest for write tests
   need to add controls for  PNFS/volatile, normal/roled proxy, file size,

############
# SRMWATCH #
############

   New monitoring pages under DCache

       http://fndca2a.fnal.gov:8090/srmwatch/

=============================================================================

2007 10 05   vacation

=============================================================================

2007 10 04

#######
# SRM #
#######

    Down since Wednesday sometime.

about 16:55, helpdesk ticket 105117

 
#######
# SRM #
#######

Report failure to write using production role in grid proxy,
  to fermigrid-users


########
# DATA #
########

    Added jyuko to minos:beam, per request.
pts membership minos:beam
pts adduser -user jyuko  -group minos:beam

    Created
/minos/scratch/jyuko

/minos/scratch/kreymer

###########
# GANGLIA #
###########


Reported minos25 and sam/mysql1 ganglia monitoring missing since Wed glitch.

    Ticket 105076

jpfitz restored and reconfigured, we've lost Minos Server links.


#######
# AFS #
#######

Scanned AFS for system:anyuser protections of home directories

    system:anyuser includes everyone in the world who can gain access to your cell.
    system:authuser includes everyone who is currently authenticated in your cell

AFSH=/afs/fnal.gov/files/home
ROOM=room1
AUTH=anyuser

for DIR in ${AFSH}/${ROOM}/* ; do printf "\n${DIR} " ; fs listacl ${DIR} | grep system:anyuser | grep -v 'system:anyuser rl$' ; done

 This revealed enough exceptions to be worth summarizing

for DIR in ${AFSH}/${ROOM}/* ; do
    ACL=`fs listacl ${DIR} 2> /dev/null | grep system:${AUTH} | grep -v "system:${AUTH} rl$"`
    [ -n "${ACL}"  ] && printf "${DIR} ${ACL}\n" 
done

Redirected stderr to /dev/null, as cannot access all directories 
I do have a valid token when running this scan.

Not listing certain security problems here,
but reporting to fnalu-admin and nightwatch.

ADIR=<one of the offending directories>

At csf.rl.ac.uk,

( cd ${ADIR} ; echo HELLO > HELLO ; ls -l HELLO ; cat HELLO ; rm HELLO )

ls -l ${ADIR}/HELLO ; cat ${ADIR}/HELLO

    The world can indeed write to these areas.

$ cat  /tmp/home*.any | grep rlidwka


    There are 11 rlidwka users, and one rlw

cat  /tmp/home*.any  | grep 'rl.*w

    Passed the list to Joe Klemencic  jklemenc


########
# GRID #
########

/minos/data and /minos/scratch are in /etc/fstab on Cluster and Servers

=============================================================================

2007 10 03

############
# SADDRECO #
############

Retesting saddreco.20070913 after adjustments for regular data

PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin
export SAM_DB_SERVER_NAME=SAMDbServer.dev:SAMDbServer
export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9000
export SAM_ORACLE_CONNECT='samdbs/pass'

RELS=cedar
MCRL=daikon_00

MODS=/pnfs/minos/mcout_data/${RELS}/near/${MCRL}
DIRS=L100200N

AFSS/saddreco.20070913 near ${RELS} ${DIRS} verify -m ${MCRL} -b 1 -v

   corrected code for copy of params

AFSS/saddreco.20070913 near ${RELS} ${DIRS} verify -m ${MCRL}

    This verified cleanly on cand/mrnt/sntp ;  100-105

Now re-retest for reco data, as below

AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} verify  -b 1 -v -s F00039716_0005
AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} verify  -b 1
AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} verify

OOPS, need location for  F00039586_0005.all.cand.cedar.0.root

AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} addloc
OK - add location  F00039586_0005.all.cand.cedar.0.root /pnfs/minos/reco_far/cedar/cand_data/2007-09(vo2363.1246)

    Ran single file declaration

AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} declare -b 1 \
    2>&1 | tee -a ${HOME}/ROUNTMP/LOG/${SAMMON}/declare_${DET}_${REL}.log

    Ran the rest

AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} declare \
    2>&1 | tee -a ${HOME}/ROUNTMP/LOG/${SAMMON}/declare_${DET}_${REL}.log

SRV1> ln -sf saddreco.20070913 saddreco # was saddreco.20070507

Restored SAM as appropriate, to corral


##########
# SADDMC #
##########

Asking permission at Monday's MC meeting to proceed with MC declares

Metadata now includes

Params({
    'mc' : CaseInsensitiveDictionary({
         'beam' : DataType('string'), # from directory, like L010185N_bfldx113
       'bfield' : DataType('string'), # field 5 in releases daikon and later
       'flavor' : DataType('string'), # field 4
      'release' : DataType('string'), # from directory, like daikon_00
        'split' : DataType('string'), # field 5 in releases carrot and earlier
       'volume' : DataType('string'), # changed vtxregion
    'vtxregion' : DataType('string'), # field 3 [itgt]
    })})

Event counts and first/last event numbers in SAM metadata are faked,
as we are not reading the mcin files to get those numbers
( and I lack the code to do so. )


#########
# ADMIN #
#########

Apparent reboot of Minos Cluster and Server nodes


BNODES='flxb10 flxb11 flxb13 flxb16 flxb17 flxb18 flxb19 flxb20 flxb21 flxb22 flxb23 flxb24 flxb25 flxb26 flxb27 flxb28 flxb29 flxb30 flxb31 flxb32 flxb33 flxb34 flxb35'
INODES='flxb10 flxb24 flxb30 flxb31 flxb32 flxb33 flxb34 flxb35'
UNODES="flxi02 flxi03 flxi04 flxi05 flxi06"
SNODES="minos-mysql1 minos-sam01 minos-sam02 minos-sam03"
NODES="minos01 minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10 minos11 minos12 minos13 minos14 minos15 minos16 minos17 minos18 minos19 minos20 minos21 minos22 minos23 minos24 minos25 minos26"

for NODE in ${NODES} ; do printf "${NODE} " ; ssh ${NODE} uptime ; done

BNODES - stayed up
UNODES - stayed up
SNODES - all rebooted at 7:02
 NODES - minos22 through 26 rebooted at 7:02


    Lost /minos/* mounts, these are not yet in fstab

roundup  was cleanly finished by 06:13
mcimport was cleanly finished at 06:35

minos-sam01 - ups start sam_bootstrap
    ./sam_test_py minos - OK

########
# GRID #
########

Outline of file system mount permissions, per Chadwick whiteboard
  
                      DISK         HOME

                    GCI   OSE    GCE   OSE
                  | -------------------------
  Computing : GCE | RWX | RWX  | NO  | RWX  |
                  |     |      |     |      |
              OSE | NO ?| RWX  | NO  | RWX  |
                   --------------------------
                    R--
                    RW-
    
=============================================================================

2007 10 02

########
# GRID #
########

/minos/data and /minos/scratch mounted on Cluster and Servers

#######
# CVS #
#######

    Backed up previous passwd file

MINOSCVS > cd /cvs/minoscvs/rep1/CVSROOT/
MINOSCVS > mv passwd  passwd.20010918 ; cp -a passwd.20010918 passwd  ; ls -l pass*

    Created new passwd file with no password

MINOSCVS > cp passwd passwd.20071002
MINOSCVS > nedit passwd.20071002

    Deployed and test passwordless pserver, with fallback, about 12:00

cp -a passwd.20071002 passwd 

    Tested

MINOS26 > cvs -d $loc checkout BubbleSpeak ; rm -r BubbleSpeak

   test
cvs checkout: warning: failed to open /afs/fnal.gov/files/home/room1/kreymer/.cvspass for reading: No such file or directory
cvs checkout: Updating BubbleSpeak

   and tested from csf.rl.ac.edu
   

###########
# MONTHLY #
###########

DATASETS 10/2
PREDATOR 10/2
VAULT    10/2   Note these went to LTO-3 library this time
MYSQL    10/2

############
# SADDRECO #
############

Final test of saddreco.20070913 for predator

    Log into fnpcsrv1

/export/stage/minfarm/.grid/

cd scripts

cp -a AFSS/saddreco.20070913 .

PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin
export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer
export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9010
export SAM_ORACLE_CONNECT=`cat /export/stage/minfarm/.grid/samdbs_prd`

DET=far
REL=cedar
SAMMON='2007-09'   
AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} verify  -b 1 -v -s F00039716_0005

 copymeta problem

   hacked copy of saddreco.20070507 to print MYMETA, for comparison

saddreco.old ${DET} ${REL} ${SAMMON} verify  1 > /tmp/log.old
AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} verify  -b 1 -v -s F00039716_0005 > /tmp/log.20070913

... did not do the following yet, want to re-test MC first ...

    Ran single file declaration

AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} declare -b 1 \
    2>&1 | tee -a ${HOME}/ROUNTMP/LOG/${SAMMON}/declare_${DET}_${REL}.log

    Ran the rest

AFSS/saddreco.20070913 ${DET} ${REL} ${SAMMON} declare \
    2>&1 | tee -a ${HOME}/ROUNTMP/LOG/${SAMMON}/declare_${DET}_${REL}.log

=============================================================================

2007 10 01

############ 
# SADDRECO #
############

Moving latest saddreco.20070913 to production use in roundup.
    Creates tape storage locations as needed
PLAN : 
    disable SAM declares for a while, via corral
    saddreco --verify
    saddreco --declare
    integrate

#######
# CVS #
#######

Testing pserver password removal.

mkdir -p /tmp/kreymer
cd       /tmp/kreymer

loc=":pserver:anonymous@minoscvs.fnal.gov:/cvs/minoscvs/rep1"

cvs -d $loc checkout BubbleSpeak

cvs checkout: warning: failed to open /afs/fnal.gov/files/home/room1/kreymer/.cvspass for reading: No such file or directory
cvs checkout: authorization failed: server minoscvs.fnal.gov rejected access to /cvs/minoscvs/rep1 for user anonymous
cvs checkout: used empty password; try "cvs login" with a real password

changed .cvspass ( saving old as .cvspass.20050420 ) in minoscvs@minoscvs,
this had not effect. Probably need a pserver restart.
The old password is still working


:pserver:anonymous@minos01.fnal.gov:/cvs/minoscvs/rep1 A+.(=0BB&
    was
:pserver:anonymous@minos01.fnal.gov:/cvs/minoscvs/rep1 Ay=0=h<Z

Old :

/1 :pserver:anonymous@minos01.fnal.gov:2401/cvs/minoscvs/rep1 Ay=0=h<Z
:pserver:anonymous@minos01.fnal.gov:/cvs/minoscvs/rep1 Ay=0=h<Z


New :

/1 :pserver:anonymous@minos01.fnal.gov:2401/cvs/minoscvs/rep1 A+.(=0BB&
:pserver:anonymous@minos01.fnal.gov:/cvs/minoscvs/rep1 A+.(=0BB&

Restored the file, pending ability to restart server.

    This was a red herring, .cvspass should probably not exist on the server.

    Need to update

MINOSCVS > cat /cvs/minoscvs/rep1/CVSROOT/passwd
anonymous:y/6MJprbDjVZ.:minoscvs

    In accordance with

CDFCVS > cat run2/CVSROOT/passwd
anonymous::cdfcvs

Note that CDF has a passwd,v file
    
#######
# SRM #
#######

Manual test with short retry timeout and num, 

srmls --debug=true -retry_timeout=10  -retry_num=1 ${SPATH2}

Stuck, try -retry_num=0, still stuck

SRMClientV2 :  srmLs, contacting service httpg://stkendca2a.fnal.gov:8443/srm/managerv2
...
eventually, times out with status 1

SRV1> date ; srmls --debug=true -retry_timeout=1000  -retry_num=1 ${SPATH2} ; date
Mon Oct  1 09:57:06 CDT 2007
Storage Resource Manager (SRM) CP Client version 1.25
Copyright (c) 2002-2006 Fermi National Accelerator Laboratory

SRM Configuration:
        debug=true
        gsissl=true
...

        surl[0]=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/beam_data/2004-12
Mon Oct 01 09:57:06 CDT 2007: In SRMClient ExpectedName: host
Mon Oct 01 09:57:06 CDT 2007: SRMClient(https,srm/managerv2,true)
SRMClientV2 : user credentials are: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
SRMClientV2 : connecting to srm at httpg://stkendca2a.fnal.gov:8443/srm/managerv2
SRMClientV2 :  srmLs, contacting service httpg://stkendca2a.fnal.gov:8443/srm/managerv2
SRMClientV2 : put: try # 0 failed with error
SRMClientV2 : ; nested exception is: 
        java.net.SocketTimeoutException: Read timed out
SRMClientV2 : put: try again
SRMClientV2 : sleeping for 1000 milliseconds before retrying
SRMClientV2 : put: try # 1 failed with error
SRMClientV2 : ; nested exception is: 
        java.net.SocketTimeoutException: Read timed out
Exception in thread "main" AxisFault
 faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
 faultSubcode: 
 faultString: java.net.SocketTimeoutException: Read timed out
 faultActor: 
 faultNode: 
 faultDetail: 
        {http://xml.apache.org/axis/}stackTrace:java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
...
        at gov.fnal.srm.util.SRMDispatcher.work(SRMDispatcher.java:721)
        at gov.fnal.srm.util.SRMDispatcher.main(SRMDispatcher.java:342)

        {http://xml.apache.org/axis/}hostname:fnpcsrv1.fnal.gov

java.net.SocketTimeoutException: Read timed out
        at org.apache.axis.AxisFault.makeFault(AxisFault.java:101)
        at org.apache.axis.transport.http.HTTPSender.invoke(HTTPSender.java:154)
...
        at gov.fnal.srm.util.SRMDispatcher.main(SRMDispatcher.java:342)
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
...
        at org.apache.axis.transport.http.HTTPSender.readHeadersFromSocket(HTTPSender.java:583)
        at org.apache.axis.transport.http.HTTPSender.invoke(HTTPSender.java:143)
        ... 14 more
Mon Oct  1 10:17:11 CDT 2007

   That's 20 minutes

11:47   podstvkv investigating
12:00   srm seems to be back.

Just in time to catch the noon cron for mindata(mcimport) and minfarm(corral)

    Copies are working in corral.


=============================================================================

2007 09 29

#######
# SRM #
#######

SRM offline, errors like these in LOG/2007-09/cedarfar.log

Sun Sep 30 00:06:36 CDT 2007

 WRITING to DCache 1
SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///F00039713_0000.all.sntp.cedar.0.root /pnfs/minos/reco_far/cedar/sntp_data/2007-09
SRMClientV1 : put: try # 0 failed with error
SRMClientV1 : java.net.SocketTimeoutException: Read timed out
srm copy of at least one file failed or not completed
Command failed!
Server error message for [1]: "can't get pnfsId (not a pnfsfile)" (errno 666).
Failed open file in the dCache.
dc_stage fail : "can't get pnfsId (not a pnfsfile)"
System error: Input/output error


=============================================================================

2007 09 28

#######
# CVS #
#######

copied adduser from cdfcvs server, tested on wojcicki

#######
# DBB #
#######

From RAL, port 8080 is OK, but

RL > curl dbb.fnal.gov:80          
curl: (7) socket error: 110


########
# MAIL #
########

for UUSER in  bishai kafka wojcicki ; do finger ${UUSER}@fnal.fnal.gov | grep '@'  ; done
               bishai@fsui02.fnal.gov
               kafka@fnalu.fnal.gov
               wojcicki@fnalu.fnal.gov

Requested wojcicki SGWEG@SLAC.Stanford.EDU via helpdesk email  cc: wojcicki
    done around 16:08

bishai is trying to connect to imap - done at about 15:45


=============================================================================

2007 09 27

########
# MAIL #
########

for UUSER in alberto bishai escobar kafka para wojcicki ; do
finger ${UUSER}@fnal.fnal.gov | grep '@'  ; done
               alberto@fnalu.fnal.gov
               bishai@fsui02.fnal.gov
               djensen@imapserver1.fnal.gov
               escobar@fsui02.fnal.gov
               kafka@fnalu.fnal.gov
               para@fsui02.fnal.gov
               wojcicki@fnalu.fnal.gov

########
# FARM #
######## 

/grid/data/minos filled up quota

quota -s -v -g e875 2> /dev/null | grep -A 1 'fermigrid\-data' | grep -v fermi
                   400G*      0    400G            8139       0       0        

 SRV1> ./farmgsum 

    Summarizing /grid/data/minos/*cat   

   1598   53160 nearcat
     21     250 farcat
    632   27411 mcnearcat
      9     535 mcfarcat
      0       1 mcfmockcat
    742  278670 minfarm/WRITE
   3002  360027 TOTAL files, GBytes

...

srmcp fails showing :


Last good copy was 13:04 on 26 Sep.

srm client error: credential remaining lifetime is less then a minute 

SRM_CONF=/export/stage/minfarm/.srmconfig/config.xml

 <x509_user_proxy> /export/stage/minfarm/.grid/x509up_u1334 </x509_user_proxy>
/export/stage/minfarm/.srmconfig/config.xml

SRV1> grid-proxy-info -all -file /export/stage/minfarm/.grid/x509up_u1334
subject  : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990/CN=687673363
issuer   : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990
identity : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : /export/stage/minfarm/.grid/x509up_u1334
timeleft : 0:00:00


SRV1> cd /export/stage/minfarm/.grid
SRV1> mv x509up_u1334 x509up_u1334.20061220
SRV1> cp /home/rubin/.grid/x509up_u1334 x509up_u1334
SRV1> chmod 700 x509up_u1334

Cleared off 8.6 GB from DUP

cd /grid/data/minfarm
cp -vax DUP /export/stage/minfarm/DUP

SRV1> du -sm /export/stage/minfarm/DUP
8986    /export/stage/minfarm/DUP

SRV1> diff -r DUP /export/stage/minfarm/DUP

SRV1> rm DUP/*.root
rm: remove write-protected regular file `DUP/c10000607_0003.cand.cedar.root'? y

SRV1> quota -s -v -g numi 2> /dev/null | grep -A 1 'fermigrid\-data' | grep -v fermi
                   395G       0    400G            8193       0       0        

SRV1> ./roundup  -r cedar_phy  mcnear

SRV1> ./roundup  -r cedar_phy_bhcurv  mcnear

   these two ran in parallel.
   c_p was doing srmcp while c_p_b was doing hadd

19:00

392G used

./roundup  -w  -r cedar_phy_bhcurv  mcnear   

365G

   Not so outstanding, but cedar_phy mcnear is still writing.
   Will start a purge of that in an hour or so.

GRRRRRRRRR - writes for daikon_03 cedar_phy mcnear failed.
No such directory. 

./pnfsdirs near cedar_phy daikon_03 L010185N write

    Group is e875, protections 775, OK

    22:01 

./roundup  -w  -r cedar_phy  mcnear

233G

./roundup  -w  -r cedar_phy_bhcurv  mcnear   

216G

./roundup   -w  -r cedar_phy  mcnear

200G

  Midnight corral should clear the remaining 120 GB,
  which are on the way to tape already.
  
  Suggested that Howie restart the farm, around 22:30.
  

=============================================================================

2007 09 26
set minos:beam for early volumes in

 for VOL in  d188 d239 d266 d268 d269 d270 ; do fs listacl $MINOS_DATA/${VOL} ; done


#########
# BATCH #
#########

New GRID nodes are ( Req PO 577128 )
   D0 L3     30
   KEK       17
   CDF      138
   D0       205
   GPFarm    47
   Minos      8
   MiniBoone  8
   
Dell PowerEdge 1950, 
2 x Quad-Core Intel Xeon 2.66 Ghz, 16GB RAM,500GB SATAu HD, 
3 year NBD Warranty, 
fully integrated, burned-in, tested and installed. 
22 Compute Servers per Rack with balance in last Rack.


Survey of CLUBS node speeds

   Normalize to 3 GHz minos26, tiny rating 1033 .

flxb  tiny
10       416
11       414 416
13       419 419

17      1055
18      1063
19      1067
20      1063
21      1066
22      1064 
23      1071
25      1060
26      1068
28      1062
30      1066
31       900 897 899
32       901 901 901 
33       896 901 901
34       900 898
35       987 989 985 988

Check parallel capacity
Can log into 10 24 30-35

cd Linux/tiny
for N in 1 2 3 4 ; do ( time ./tiny & ) ; done

    flxb10
   2
real    0m18.109s
user    0m18.100s
   4
real    0m36.142s
user    0m18.050s

    flxb24
   2

   4

    flxb31
   2
real    0m7.570s
user    0m7.559s
   4
real    0m15.153s
user    0m7.536s

    flxb35
   2    
real    0m6.895s
user    0m6.892s
   4
real    0m13.838s
user    0m6.919s

   For Minos nodes, base time is 7 seconds.
      Is hyperthreading helping ?

    minos01

real    0m11.720s
user    0m11.715s

    minos26
    
real    0m11.606s
user    0m11.535s

   Summary :
   
All CLUBS/FLXB nodes act as 2 core
See MHz summary 2007 09 11

 AGE     NODES    GHZ/core cores GHz
Ancient 10/11/13  1.5         6     9
Old     16-30     3          30   100
Mid     31-35     2.7         8    22
New     35        3           2     6

CLUBS   137 GHz

Cluster 150 GHz ( 50 * 3 )
(   LSF  75     ( 25 * 3 ) )

New     170 GHz ( 64 * 2.66 )  
 
########### 
# STORAGE #
###########

In 284 AFS disk volumes, we have

DIRS=`ls`
SIZES=`for DIR in ${DIRS} ; do fs listquota ${DIR} | grep 0000  | tr -s ' ' | cut -f 2 -d ' '; done`

SIZEU=`for DIR in ${DIRS} ; do fs listquota ${DIR} | grep 0000  | tr -s ' ' | cut -f 3 -d ' '; done`

printf "${SIZEU}\n" | ./count

10457000000
 9068980997

We use 9 of 10.5 TBytes of capacity

Draft document is going into Minos Doc  3601


=============================================================================

2007 09 25

########
# FARM #
########

Rubin reports big backlogs writing.

I think it's all us :

MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03/CosmicMu/cand_data -type f | wc -l
11560

############ 
# SADDRECO #
############

    Need to declare daikon_01 and daikon_03 reco  in dev,
    for final tests.
 
near     
    cedar
        0 1

    cedar_phy
        0     3

    cedar_phy_bhcurv
              3 4
far
    cedar
        0 1 2

    cedar_phy
        0   2

    cedar_phy_bhcurv - none

For now, let's do near cedar_phy daikon_03

    Log into fnpcsrv1

cd scripts
tokens

AFSK=/afs/fnal.gov/files/home/room1/kreymer/minos/log/saddreco

PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin
export SAM_DB_SERVER_NAME=SAMDbServer.dev:SAMDbServer
export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9000
export SAM_ORACLE_CONNECT='samdbs/pass'

RELS=cedar_phy
MCRL=daikon_03

MODS=/pnfs/minos/mcout_data/${RELS}/near/${MCRL}
DIRS=`ls $MODS`
echo $DIRS
CosmicMu

   tested with
   
AFSS/saddreco.20070913 near ${RELS} ${DIR} verify -m ${MCRL} -b 1 -s sntp_data/205 -v

    Ran single file declaration

for DIR in ${DIRS} ; do
AFSS/saddreco.20070913 near ${RELS} ${DIR} declare -m ${MCRL} -b 1 \
    2>&1 | tee -a /tmp/saddreco.${MCRL}.${DIR}.declare.log
done

    Ran the rest

for DIR in ${DIRS} ; do
AFSS/saddreco.20070913 near ${RELS} ${DIR} declare -m ${MCRL} \
    2>&1 | tee -a /tmp/saddreco.${MCRL}.${DIR}.declare.log
done
STARTED   Tue Sep 25 23:47:49 2007
FINISHED  Wed Sep 26 00:05:04 2007

   looks clean in the log, as follows :

grep -v declared /tmp/saddreco.${MCRL}.${DIR}.declare.log | less


=============================================================================

2007 09 24

########## 
# SADDMC #
##########

saddmc.20070924

ln -sf saddmc.20070924 saddmc # was  saddmc.20070608

export SAM_ORACLE_CONNECT="samdbs/..."

./saddmc  --declare  -n 1  ${VEG}  near/${VEG}/${DIR}/504

sam get metadata --file=n13035044_0008_L010185N_D03.reroot.root


for VEG in daikon_01 daikon_03 ; do
for DIR in `ls /pnfs/minos/mcin_data/near/${VEG}` ; do
    echo ${VEG} ${DIR}
    #./saddmc -v --verify  -n 1  ${VEG}  near/${VEG}/${DIR}/*
    ./saddmc --declare  ${VEG}  near/${VEG}/${DIR}/*
done ; done 2>&1 | tee -a ${HOME}/minos/log/saddmc/${VEG}.log


########
# DISK #
########

Per conversation with Ling 8018, 
The array is rebuilt with 2 hot spare disks.

He was told by Jason Allen not to do the hot-spare disk tests on our array.
( 13:53)

He will check again with Jason, and get back to me.

Spoke to Jason, authorized delaying as  needed
to do the 1/2/3 disk rebuild time and performance tests.
It might not be available by Thursday .
I authorized this anyway.
We really need to know that our particular hot spare disks work.


########
# FARM #
########

Need to remove READ files for 
    /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03/CosmicMu
files which Howie is removing and reprocessing today.

Duplicates were detected in the noon run,
  will have to clean up.

    
Per discussion with Howie , remove ^n1004...cedar_phy_bhcur
which should all be CosmicMu, with reversed field.

SRV1> ls | grep ^n1004 | grep cedar_phy_bhcurv | wc -l
228

BADRS=`ls | grep ^n1004 | grep cedar_phy_bhcurv`
printf "${BADRS}\n" | wc -l
228

for FILE in ${BADRS} ; do mv ${FILE} ../BADREAD/${FILE} ; done

Now cleaning up an aborted start from this morning,

MINOS26 > ls /pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03/CosmicMu/cand_data/* | grep ^n1004 | wc -l
972

MINOS26 > find \
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03/CosmicMu/cand_data/ \  
-type f -name n1004\* | wc -l
972

FILES=`find \
/pnfs/minos/mcout_data/cedar_phy_bhcurv/near/daikon_03/CosmicMu/cand_data/ \
-type f -name n1004\*`

for FILE in ${FILES} ; do
   usleep 200000
   FPA=`dirname ${FILE}`
   FNA=`basename ${FILE}`
   ( cd ${FPA} ;
     L4=`cat ".(use)(4)(${FNA})"`
     if [ -n "${L4}" ] ; then
         VOL=`printf ${L4}\n" | head -1"`
         printf "${VOL} ${FNA}\n"
     else
         printf "pend   ${FNA}\n"
     fi ; )
done

#######
# SAM #
#######

./sam_test_py minos prd zeval-far-cand-physicsm-spill-r1_16

This tests a 96 file project

Note that the Universe qualifier is not working, 
you keep what you have before running s_t_p

Per hartnell request, here's a hack to count up progress :

export SAM_STATION=minos
export SAM_PROJECT=yourprojectname

if PROJDUMP=`sam dump project -s  --retryMaxCount=2` ; then
    NEED=`printf "${PROJDUMP}\n" | grep 'unbuffered yet' | wc -l`
    HAVE=`printf "${PROJDUMP}\n" | grep 'delivered on'   | wc -l`
    (( TOT = NEED + HAVE ))
    printf " Delivered ${HAVE}/${TOT} Need ${NEED}\n"
else
    printf " The project is unavailable ( unstarted or completed )\n"
fi


=============================================================================

2007 09 23  Sunday

#######
# SAM #
#######

MCPARMS.py - edited to add

'bfield'    : 'string' ,    replaces split for post-carrot MC
'vtxregion' : 'string' ,    replaces volume for all, per discussions


setup sam -q dev
export SAM_ORACLE_CONNECT="samdbs/<passwd>"
samadmin add param suite --param-file=MCPARAMS.py 
Param Category 'mc': 
 ... paramType 'bfield': registered as type 'string'  (new dimension 'mc.bfield')
 ... paramType 'volume': (no change)
 ... paramType 'beam': (no change)
 ... paramType 'split': (no change)
 ... paramType 'vtxregion': registered as type 'string'  (new dimension 'mc.vtxregion')
 ... paramType 'release': (no change)
 ... paramType 'flavor': (no change)

MINOS26 > sam get registered parameters
Params({
    'mc' : CaseInsensitiveDictionary({
         'beam' : DataType('string'),
       'bfield' : DataType('string'),
       'flavor' : DataType('string'),
      'release' : DataType('string'),
        'split' : DataType('string'),
       'volume' : DataType('string'),
    'vtxregion' : DataType('string'),
    })})

########## 
# SADDMC #
##########

for VEG in daikon_01 daikon_02 daikon_03 daikon_04; do

for UNI in dev int prd ; do
  setup sam -q ${UNI}
  export    SAM_ORACLE_CONNECT
  samadmin add application family --appFamily=simulation --appName=gminos --appVersion=${VEG}
  export -n SAM_ORACLE_CONNECT
done

done

########## 
# SADDMC #
##########

saddmc.20070924

Removed all RECOREL support, now that saddreco works for MC
    Removed enupdate, no longer used

Changed volume to vtxregion

New FILECH4 variable for split vs bfield
    Set to split for recorel[0] < d


./saddmc.20070924 --verify -n 1 daikon_01  near/daikon_01/L010185N/140 -v
./saddmc.20070924 --verify      daikon_01  near/daikon_01/L010185N/140 


=============================================================================

2007 09 21  

#######
# LSF #
#######

Rustem has submitted about 12K jobs to the 4hr queue, like

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
425482  rustem  RUN   4hr        flxi06.fnal flxb10.fnal *1006_0007 Sep 21 17:41

bjobs -u rustem | wc -l

~18:15 11410
 18:37 11376
 18:39 11367

 13:00   954

############
# MCIMPORT #
############

XFILE=/pnfs/minos/mcin_data/near/daikon_03/CosmicMu/218/n10032185_0006_CosmicMu_D03.reroot.root
as reported 9/19, too short, apparently damaged.

mv ${XFILE} /pnfs/minos/BAD/BAD_n10032185_0006_CosmicMu_D03.reroot.root

########
# GRID #
########

Authorized certs in minos group ( not minossoft or production )

    https://voms.fnal.gov:8443/vomrs/vo-fermilab/vomrs
   
brebel - rebel - FNAL only
habig
jdejong - dejong
petyt
rhatcher existed
rustem
tinti

And other accounts that exist on fnpcsrv1

backhouse
kordosky
messier
cavanaugh -added FNAL,had DOE
gallagher
sanchez
Tingjun Yang
George Irwin
John Urheim
Alexandre Sousa
Joshua Boehm

########
# GRID #
########

HOWTO.fermigrid updated for
    vdt setup
    kx509
    voms-proxy-init

Tested kerberos based grid cert per chadwick email 2007 07 18  fgusers

#########
# ADMIN #
#########

10:00 restarted cronjobs and NOCAT


#######
# LSF #
#######

Continued tests of tcsh scripts, success with

( unset PRODUCTS SETUPS_DIR SETUP_UPS INFO_DIR UPS_DIR SETUP_SHRC SETUP_INFO SETUP_LOGIN ; bsub -R "linux26" -q minos test_sub.scr )
based on a scan of all SETUP_ * environment variable at submission

motivated by shrc messages when I unset PRODUCTS SETUPS_DIR SETUP_UPS UPS_DIR 

seem to need UPS_DIR
 
( unset UPS_DIR SETUPS_DIR SETUP_UPS SETUP_SHRC ; bsub -R "linux26" -q minos test_sub.scr )

 ( unset UPS_DIR SETUPS_DIR SETUP_UPS SETUP_SHRC ; bsub -R "linux26" -q minos test_lsf_csh )


#######
# LSF #
#######

08:58  31 32 34  are updated
       33 35 active, status set to closed
10:30  35 is updated, awaiting job on 33
10:52  updates are complete

for NODE in flxb10 flxb24 flxb30 flxb31 flxb32 flxb33 flxb34 flxb35 ; do printf "\n${NODE} `date`\n"; ssh -a ${NODE} "grep OPTION /etc/sysconfig/afs" ; done


###########
# BLUEARC #
###########

Per our discussions verbally yesterday,
here is what I understand of our deployment plan :

    Ling will be doing the following in preparation :

1) Add two hot-spare disks ( 40 disk in use, to hot spares )

2) Verify and measure the raid rebuild time required for failover for
        1 disk failure
        2 disk failure
        3 disk failure ( this rebuild will fail, see what it look like )

3) Make the full array available to BlueArc, which quotas as specified below.
   The split between /minos/scratch and /minos/data to be dynamic,
   and handle via quotas.


4) Mount /minos/scratch and /minos/data  on the Minos Cluster and servers
   as specified below.

    Arthur will specify the list of client nodes,
    and the initial directory structures.

1) Clients - please mount /minos/scratch and /minos/data on all Minos Cluseter
             and server nodes.

   minos01 through minos26
   minos-mysql1
   minos-sam01
   minos-sam02
   minso-sam03

2) /minos/data  - roughly 20 TBytes or 2/3 of the disk capacity

   Let's start with /minos/data/mindata
   owned by mindata, group e875, group writeable.

3) /minos/scratch - roughly 10 TBytes

   Directories for each of the 186 minos users on the Minos Cluster
        ypcat passwd | cut -f 1 -d : | sort
   Each user gets 100 GB default quota.
      This is an oversubscription, but most of these are not active.
      This should be enough for initial testing.

   Please provide a means  ( sudo ? ) for kreymer, buckley, rhatcher and urish
   to adjust quotas in /minos/scratch.

   After testing and discussion, we will probably move all the users' files
   from existing nodes' scratch areas such as /local/scratch01/<user> to
   /minos/scratch/<user>/minos01 directories.
  
   
=============================================================================

2007 09 20  

#######
# AFS #   LSF
####### 

  Rustem reports problems on flxb31-35 similar to those during the Cluster upgrade

loon: error while loading shared libraries: libEG.so: cannot open shared object file: No such file or directory

Checking Cluster and FNALU and batch nodes :

MIN > for NODE in $UNODES ; do printf "\n${NODE} `date`\n"; ssh -a ${NODE} "grep OPTIONS /etc/sysconfig/afs" ; done
on all but flxi02, 
OPTIONS=$MEDIUM

flxi07 has
OPTIONS=AUTOMATIC


minos* has
OPTIONS=$LARGE

for NODE in $BNODES ; do printf "\n${NODE} `date`\n"; ssh -a flxb${NODE} "grep OPTIONS /etc/sysconfig/afs" ; done
10 24 30 31
OPTIONS=$MEDIUM

31 32 33 34 35
OPTIONS=AUTOMATIC


#########
# ADMIN #
#########

Preparing for all-day shutdown later today

   
    predator

MINOS26 > echo 'crontab -r' | at 03:30

    mcimport

M26 > echo 'crontab -r' | at 03:30
job 21 at 2007-05-24 03:30

    corral
    
SRV1> echo 'mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT' \
    | at 03:30


########
# GRID #
########

Please create acccounts on fnpcsrv1 and fngp-osg
for the following users,
so that they can start submitting jobs to Fermigrid :

brebel    NC
habig     Run Coord
jdejong   Calib
petyt     
rhatcher  Admin
rustem    Reco
tinti     Reco

The fnpcsrv1 accounts are needed for access to AFS.
The fngp-osg accounts are needed for testing without AFS.


=============================================================================

2007 09 19  

#######
# AFS #
#######

Per koskinen,

    requested volumes

afs/fnal.gov/files/data/minos/d268
afs/fnal.gov/files/data/minos/d269
afs/fnal.gov/files/data/minos/d270


   cloned from d266  for beam systematics work

system:administrators rlidwka
minos:admin rlidwka
minos:beam rlidwka
minos rl

#########
# MYSQL #
#########

Finally doing monthly backups, now that brebel load has dropped

Localy copy rates are still miserable, about 4 MBytes/second
   with cp -av ...

Will slug it through, then try    dd if=  of=  bs=10M

Times for big copies were
DCS_HV.MYD
real    41m59.161s
PULSERGAIN.MYD
real    19m9.964s
the rest
real    67m43.457s

md5sum
real    22m43.163s
gzip
real    79m54.981s

   oops, minos-sam03 kreymer account moved to AFS.
   needed to adjust REPATH to /home/kreymer/...

scp:
real    20m8.355s

BINLOGS
real    3m37.674s
  

#######
# SSH #
#######

curl http://www-numi.fnal.gov/computing/dh/sshkrb5.tgz -o sshkrb5.tgz

The original sl3 shared libraries were not correctly named.
My tests on csf.rl.ac.uk were falling back to system libraries.

    Needed various symlinks for kinit/klist :

ln -s  libkrb5.so.3    libkrb5.so
ln -s  libcrypto.so.4  libcrypto.so
ln -s  libcom_err.so.3 libcom_err.so

With these symlinks, all of kinit/klist/ssh/scp are fairly clean,
using only the same 3   glibc  libraries :

RL > ldd ./klist | grep -v FOO
        libc.so.6 => /lib/tls/libc.so.6 (0x00111000)
        libdl.so.2 => /lib/libdl.so.2 (0x0035d000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x00f8b000)


Cleaned up sshlib under SL3 by removing unused
    ld-linux.so.2
    tmp/libdl.so.2

Added kkinit and kklist scripts,
adjusted the aliases in setupssh.[c]sh
   changed kinit/klist aliases to kkinit/kklist for consistency

############
# MCIMPORT #
############

/pnfs/minos/mcin_data/near/daikon_03/CosmicMu/218/n10032185_0006_CosmicMu_D03.reroot.root
   is reported by howie to be unreadable
 too short ( 31645696 vs usual 45000000 size )


##########
# DCACHE #
##########

/pnfs/minos/fardet_data/2007-09/F00039679_0019.mdaq.root
   stuck in genpy since 16:06:13 UTC 2007

=============================================================================

2007 09 18  

#######
# SSH #
#######

For SL4, I have just put together
  
    http://www-numi.fnal.gov/computing/dh/sshkrb5_sl4.txt
     
For SL3, I have renamed the previous files      
     
    http://www-numi.fnal.gov/computing/dh/sshkrb5_sl3.txt


   SL4 requires different shared libraries, to avoid getting message

You don't exist, go away!

#######
# UPS #
#######

Per reports from users failing to run LSF jobs, 
surveying usage of /fnal/ups versus /local/ups

We should have installed upsupdbootstrap-local
 
In all cases there is a symlink from /usr/local/etc/setups.*


    /local/ups

minos01-minos10
minos12-minos26    minos25 is a copy, not symlink

flxi02
flxi06

flxb10
flxb24
flxb30

    /fnal/ups

minos11

flxi04
flxi05
flxi07

flxb31-flxb35

    /afs/fnal.gov/ups

flxi03

   Cannot log into

flxb11
flxb13
flxb16-flxb23
flxb25-flxb29

Correction, scanning Cluster for /u/l/e/setups links,
11 -> /fnal/ups
rest -> /local/ups
25 is not a symlink, but a direct copy of the files.

#######
# LSF #
#######

    lhsu is having trouble in batch.

    Jobs that try to run #!/bin/tcsh exit 

cp ~llhsu/scripts/batch/test_sub.scr .

bsub -R "linux26" -q minos test_sub.scr

    Exited for me, but with this message :

/local/ups/prd/ups/v4_7_2/Linux-2/bin/ups: Command not found.


   But I can run a similar test job in bash,
   
bsub -R "linux26" -q minos test_lsf   


########
# GRID #
########

kreymer@minos26 :

mkdir     /grid/data/minos/users
chmod 775 /grid/data/minos/users

MINOS26 > mkdir /grid/data/minos/users/boehm
MINOS26 > mkdir /grid/data/minos/users/brebel
MINOS26 > mkdir /grid/data/minos/users/habig
MINOS26 > mkdir /grid/data/minos/users/jdejong
MINOS26 > mkdir /grid/data/minos/users/kreymer
MINOS26 > mkdir /grid/data/minos/users/petyt
MINOS26 > mkdir /grid/data/minos/users/rustem
MINOS26 > mkdir /grid/data/minos/users/scavan
MINOS26 > mkdir /grid/data/minos/users/tinti

chmod 775 /grid/data/minos/users/*

boehm brebel habig jdejong kreymer petyt rustem scavan tinti
=============================================================================

2007 09 17  

#######
# SSH #
#######

    Testing portable access at RAL

    At Fermilab, in computing/dh, did

tar cvzf sshkrb5.tar -C /afs/fnal.gov/files/home/room3/hartnell/programs/sshkrb5 .

   Then at csf.rl.ac.uk
   
mkdir -p ${HOME}/programs/sshkrb5
cd        ${HOME}/programs/sshkrb5
curl http://www-numi.fnal.gov/computing/dh/sshkrb5.tgz -o sshkrb5.tgz
tar xzvf sshkrb5.tgz

. setupkssh.sh
kinit kreymer@FNAL.GOV
/usr/kerberos/bin/klist -f

kssh -l kreymer minos26.fnal.gov pwd

    Updated sshkrb5.tgz on web server to include setupkssh.sh adjusted for bash
    and to make alias kkinit instead of kinit,
        using /usr/kerberos/bin/kinit


Some data from sjc attempts,

Sep 17 14:14:39 minos26 sshd[21042]: error: PAM: Authentication failure for sjc from nova.physics.wm.edu
Sep 17 14:14:39 minos26 sshd[21042]: Connection closed by ::ffff:128.239.52.85

grep -v 'session opened for user' /var/log/messages | less
Sep 17 14:14:39 minos26 sshd: pam_krb5[21043]: 
    authentication fails for 'sjc' (sjc@FNAL.GOV): 
    Authentication service cannot retrieve authentication info. 
    (Cannot contact any KDC for requested realm)


########
# FARM #
########

Further permission problems under /pnfs/minos/mcout_data/cedar_phy_bhcurv

    Needed

chmod 775 cedar_phy_bhcurv

MINOS26 > stat near
  File: `near'
  Size: 512             Blocks: 1          IO Block: 512    directory
Device: 14h/20d Inode: 254702744   Links: 1
Access: (0755/drwxr-xr-x)  Uid: ( 1060/ kreymer)   Gid: ( 5111/    e875)
Access: 2007-09-14 15:42:01.000000000 -0500
Modify: 2007-09-14 15:42:01.000000000 -0500
Change: 2007-08-30 12:58:40.000000000 -0500

chmod 775 cedar_phy_bhcurv/near

cd near

MINOS26 > stat daikon_04
  File: `daikon_04'
  Size: 512             Blocks: 1          IO Block: 512    directory
Device: 14h/20d Inode: 255895248   Links: 1
Access: (0755/drwxr-xr-x)  Uid: ( 1060/ kreymer)   Gid: ( 5111/    e875)
Access: 2007-09-14 15:42:01.000000000 -0500
Modify: 2007-09-14 15:42:01.000000000 -0500
Change: 2007-09-14 15:42:01.000000000 -0500

The dakkon_04/L010185N and .../*_data directories have proper ownership
and groups set.


=============================================================================

2007 09 15  Sat

#######
# DAQ #
#######

Added habig root access to


minos-beamdata  # also had to add myself, getting access via new password
minos-rc
minos-evd
minos-acnet
  Had previously done
minos-om

########
# FARM #
########

Did   chgrp -R e875 /pnfs/minos/mcout_data/cedar_phy_bhcurv
per rubin request.


=============================================================================

2007 09 14
   
#############
# CHECKLIST #
#############

VO8597 is available again, went NOACCESS to be copied on 2007 09 12

########
# SADD #
########

    Corrected names of older versions

for MD in 0418 0420 0503 0513 0516 0520 0624 0707 0711 ; do
mv sadd.${MD} sadd.2005${MD} ; done


########
# FARM #
########

./pnfsdirs near cedar_phy_bhcurv daikon_04 L010185N write


############ 
# SADDRECO #
############

Corrected name of file, to reflect current date.

    mv saddreco.20070707 saddreco.20070913

#########
# MYSQL #
#########

Still have a heavy load from Tuesday's brebel jobs,

MINOS26 > bjobs -u brebel | grep flxb | wc -l
19

MINOS26 > bjobs -u brebel -l 405371

Job <405371>, User <brebel>, Project <default>, Status <RUN>, Queue <1day>, Com
                     mand </afs/fnal.gov/files/data/minos/d138/gen_iuntuple_cro
                     n_sam>
Tue Sep 11 16:39:55: Submitted from host <flxi04.fnal.gov>, CWD </afs/fnal.gov/
                     files/data/minos/d138>, Requested Resources <hname!=flxb24
                     && linux24>;
Tue Sep 11 17:34:02: Started on <flxb30.fnal.gov>, Execution Home </afs/fnal.go
                     v/files/home/room1/brebel>, Execution CWD </afs/fnal.gov/f
                     iles/data/minos/d138>;
Fri Sep 14 09:24:16: Resource usage collected.
                     The CPU time used is 5978 seconds.
                     MEM: 248 Mbytes;  SWAP: 399 Mbytes;  NTHREAD: 5
                     PGID: 6520;  PIDs: 6520 6544 6545 7215 7116 
...
These are 1 day jobs, CPU limit normalized to flxi06 (CPUF  1390.00)
Most nodes have CPUF 1200.00

#########
# CDOPS #
#########

Requested mailing list, to archive my summaries.


=============================================================================

2007 09 13

##########
# CONDOR #
##########

09:00
Steve Timm found a typo in a config file.
Nodes are now registering.
All workers need a reconfigure and restart.


#######
# AFS #
#######

    Created minos:reco group

NEWGROUP=reco

pts creategroup -name kreymer:${NEWGROUP}
group kreymer:reco has id -2481


NEWUSERS='boehm masaki jmusser naples rustem sjc sujeewa tinti'

for GUSER in ${NEWUSERS} ; do 
pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done

pts setfields  kreymer:${NEWGROUP} -access SOMar

pts membership kreymer:${NEWGROUP}

pts examine    kreymer:${NEWGROUP}
Name: kreymer:nonap, id: -1941, owner: kreymer, creator: kreymer,
  membership: 5, flags: SOMar, group quota: 0.

pts chown      kreymer:${NEWGROUP}  minos:admin

pts examine    minos:${NEWGROUP}
pts membership minos:${NEWGROUP}

#######
# AFS #
#######

  may need to go back and 
     pts chown minos:GROUP  minos:admin  ( change ownership )

#######
# AFS #
#######
 
Requesting volume for rustem's reco studies
using the new group, per HOWTO.afs

    Size
50000 not backed up

    Volume 
/afs/fnal.gov/files/data/minos/d267

    ACL's 
system:administrators rlidwka
system:anyuser rl
minos:admin rlidwka
minos rl
minos:reco rlidwka

    
#######
# AFS #
#######

DVOLS=`ls -d d??? | sort`


for VOL in $DVOLS ; do echo $VOL; 
fs listacl ${MINOS_DATA}/${VOL} | grep -v system | grep rlidw; done  2>&1 | less


########
# DATA #
########

Checking reported segfaults reading
   /pnfs/minos/reco_near/cedar_phy/sntp_data/2005-11/N00009259_0000.spill.sntp.cedar_phy.0.root
   
   2131512181 May 15
   
   Pretty close to SLIM=2147483647   # 2^32 - 1

DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy/sntp_data/2005-11/N00009259_0000.spill.sntp.cedar_phy.0.root
DFILE1=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy/sntp_data/2005-11/N00009259_0022.spill.sntp.cedar_phy.0.root

setup_minos -r R1.24.2

loon -bq firstlast.C ${DFILE}
  does not crash, but does not find anything ( designed to read raw data )

in /local/scratch26/kreymer/DATA,
tried direct hadd from dcache, too slow

MINOS26 > dccp $DFILE1 .
201063216 bytes in 4 seconds (49087.70 KB/sec)
MINOS26 > dccp $DFILE .
2131512181 bytes in 53 seconds (39274.62 KB/sec)

MINOS26 > hadd testhadd.root N00009259_0000.spill.sntp.cedar_phy.0.root N00009259_0022.spill.sntp.cedar_phy.0.root
Target file: testhadd.root
MINOS26 > ls -l N00009259* testhadd.root 
-rw-r--r--  1 kreymer 1525 2131512181 Sep 13 14:20 N00009259_0000.spill.sntp.cedar_phy.0.root
-rw-r--r--  1 kreymer 1525  201063216 Sep 13 14:18 N00009259_0022.spill.sntp.cedar_phy.0.root
-rw-r--r--  1 kreymer 1525 2332536940 Sep 13 14:22 testhadd.root

Tried this also on minos11, which might have 32 bit limit
size is 2332535450

############ 
# SADDRECO #
############

08:55  start a fat declaration !


AFSS/saddreco.20070707 near cedar_phy L010185N declare -m daikon_00  2>&1 | tee /var/tmp/saddreco.declare.log

STARTED   Thu Sep 13 13:57:40 2007
FINISHED  Thu Sep 13 14:53:01 2007

grep -v declared /var/tmp/saddreco.declare.log

   Now for the rest :

MODS=/pnfs/minos/mcout_data/cedar_phy/near/daikon_00

DIRS=`ls $MODS`

MINOS26 > echo $DIRS
L010000N L010170N L010185N L010200N L100200N L150200N L250200N

MINOS26 > for DIR in ${DIRS} ; do echo ${DIR} ; ls -R ${MOD}/${DIR} | wc -l ; done
L010000N
2003
L010170N
165
L010185N
8275
L010200N
163
L100200N
704
L150200N
290
L250200N
1356

    11:11
    
for DIR in L010000N L010170N L010200N L100200N L150200N L250200N ; do
AFSS/saddreco.20070707 near cedar_phy ${DIR} declare -m daikon_00 \
    2>&1 | tee /var/tmp/saddreco.${DIR}.declare.log
done

SAMDIM='                                    
     RUN_TYPE  physics%
and   MC.BEAM  L010185N
and DATA_TIER  sntp-near
and VERSION    cedar.phy
'

   OOPS, the parameters are not being written to SAM.
   needed to add this to copymeta, as was done in saddmc

Need to remove all these files !

sam undeclare n13011010_0000_L010200N_D00.sntp.cedar_phy.root

reran tests, looked at  the metadata.

Readjusted fileType to importedSimulated in enupdate

Declared this one file, after removal ,


SAMDIM="                                    
     RUN_TYPE  physics%
and VERSION    cedar.phy
and FULL_PATH like /pnfs/minos/mcout_data/cedar_phy/near/daikon_00%
"
MINOS26 > sam list files --dim="${SAMDIM}" --summaryonly
File Count:         12217
Average File Size:  590.79MB
Total File Size:    6.88TB
Total Event Count:  8785200

MINOS26 > ./samlocate "${SAMDIM}" | wc -l
12217

real    4m22.920s
user    0m31.905s
sys     0m1.878s

MINOS26 > ./samundeclare  -b 1 "${SAMDIM}" -v

MINOS26 > ./samundeclare "${SAMDIM}" -b 10

MINOS26 > ./samundeclare "${SAMDIM}" -b 100
real    0m27.710s

MINOS26 > ./samundeclare "${SAMDIM}" -b 1
 BAIL after  1
Found  12104  files 
 undeclared  n13011034_0010_L250200N_D00.cand.cedar_phy.root

MINOS26 > ./samundeclare "${SAMDIM}" -b 1
 BAIL after  1
Found  12103  files 
 undeclared  n13011034_0004_L250200N_D00.cand.cedar_phy.root

This looks pretty good, let's go for it all !

MINOS26 > date ; time ./samundeclare "${SAMDIM}"
Thu Sep 13 18:30:24 CDT 2007

real    27m23.749s
user    0m29.169s
sys     0m1.742s

MINOS26 > ./samlocate "${SAMDIM}"


   OK, now we can redeclare everything

MODS=/pnfs/minos/mcout_data/cedar_phy/near/daikon_00

DIRS=`ls $MODS`

DIR=L010000N
AFSS/saddreco.20070707 near cedar_phy ${DIR} verify -m daikon_00 -b 1

  22:20

for DIR in ${DIRS} ; do
AFSS/saddreco.20070707 near cedar_phy ${DIR} declare -m daikon_00 \
    2>&1 | tee /var/tmp/saddreco.${DIR}.declare.log
done

for DIR in ${DIRS} ; do 
grep -v declared  /var/tmp/saddreco.${DIR}.declare.log ; done | less

Declaring to SAM dev near cedar_phy L010000N declare
STARTED   Fri Sep 14 03:20:18 2007
FINISHED  Fri Sep 14 03:36:26 2007

Declaring to SAM dev near cedar_phy L010170N declare
STARTED   Fri Sep 14 03:36:28 2007
FINISHED  Fri Sep 14 03:37:38 2007

Declaring to SAM dev near cedar_phy L010185N declare
STARTED   Fri Sep 14 03:37:40 2007
FINISHED  Fri Sep 14 04:48:10 2007

Declaring to SAM dev near cedar_phy L010200N declare
STARTED   Fri Sep 14 04:48:12 2007
FINISHED  Fri Sep 14 04:49:23 2007

Declaring to SAM dev near cedar_phy L100200N declare
STARTED   Fri Sep 14 04:49:25 2007
FINISHED  Fri Sep 14 04:54:50 2007

Declaring to SAM dev near cedar_phy L150200N declare
STARTED   Fri Sep 14 04:54:52 2007
FINISHED  Fri Sep 14 04:56:59 2007

Declaring to SAM dev near cedar_phy L250200N declare
STARTED   Fri Sep 14 04:57:01 2007
FINISHED  Fri Sep 14 05:08:07 2007


   looks good, informed hartnell and m_s_d


sam list files --summaryOnly \
    --dim="RUN_TYPE   'physics%' \
       and MC.RELEASE 'daikon_00' \
       and VERSION    'cedar.phy'"
File Count:         12217
Average File Size:  590.79MB
Total File Size:    6.88TB
Total Event Count:  8785200

sam list files --summaryOnly   \
    --dim="RUN_TYPE physics%   \
       and MC.BEAM='L010185N'  \
       and VERSION='cedar.phy'"
File Count:         7851
Average File Size:  553.83MB
Total File Size:    4.15TB
Total Event Count:  5641200

MINOS26 > find /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N -type f | wc -l
7851

sam list files --summaryOnly   \
    --dim="RUN_TYPE physics%   \
       and MC.BEAM='L010185N'  \
       and DATA_TIER sntp-near \
       and VERSION='cedar.phy'"
File Count:         796
Average File Size:  555.35MB
Total File Size:    431.70GB
Total Event Count:  2819200


=============================================================================

2007 09 12

############
# PREDATOR #
############

Removed damaged ( full disk )  F00039589_0000.sam.py
under /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/fardet_data/2007-09 13:09 CDT

 rm F00039589_0000.sam.py

last good FD declare was F00039586_0022.mdaq.root at 16:08:56 2007 UTC

STARTING Mon Sep 10 18:06:17 UTC 2007
 Treating    472 files 
 Scanning      4 files 
F00039586_0023.mdaq.root Mon Sep 10 18:06:28 UTC 2007
F00039587_0000.mdaq.root Mon Sep 10 18:07:54 UTC 2007
F00039588_0000.mdaq.root Mon Sep 10 18:08:29 UTC 2007
F00039589_0000.mdaq.root Mon Sep 10 18:09:13 UTC 2007
?
FINISHED Mon Sep 10 18:09:58 UTC 2007

Try manually

MINOS26 > cds

MINOS26 > HOSTNA=`hostname -s | cut -c 1-5`
MINOS26 > HOSTNU=`hostname -s | cut -c 6-`
MINOS26 > LOGPAT=/local/scratch${HOSTNU}/kreymer/log

MINOS26 > setup sam -q prd
MINOS26 > DET=fardet_data
MINOS26 > MONTH=2007-09

./sadd ${DET}/${MONTH} declare 2>&1 | tee -a ${LOGPAT}/samadd/${DET}/${MONTH}.log
failed, backed off and did a verify,
that looks OK, after actually deleting the damaged F00039589_0000.sam.py

Will let the next predator cycle clean up at 15:06 CDT.

   That worked OK !
   

############ 
# SADDRECO #
############

Resuming work

   Cleaned up gnu_getops handling
   
May want to test on
/pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010200N/sntp_data/100
/pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010200N/sntp_data/100

        In case it is needed, the full lowest directory list is
    find /pnfs/minos/mcout_data/cedar_phy/near/daikon_00 -type d  -name \?\?\?

Testing with


PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin
export SAM_DB_SERVER_NAME=SAMDbServer.dev:SAMDbServer
export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9000

    Try like  

AFSS/saddreco.20070707 near cedar_phy L010200N verify 3 -m daikon_00

AFSS/saddreco.20070707 near cedar_phy 2007-03 verify 3

recodirs looks OK, after a tweak.

For MC
   OK - getopt
   OK - set MCRELEASE
   OK - RECODIR modified
   OK - .mdaq.root modified
   OK - bypass pass/obsolete calculation for MC
   OK - veto os.rename of index file for SAMQ not prd
   OK - sam add tape location
       check SAM_ORACLE_CONNECT
       add the location and report
   OK - correct fake last event number for MC

Run validation of a fat directory

AFSS/saddreco.20070707 near cedar_phy L010185N verify -m daikon_00  2>&1 | tee /var/tmp/saddreco.verify.log

STARTED   Thu Sep 13 05:23:49 2007
FINISHED  Thu Sep 13 06:12:14 2007


grep -v verified /var/tmp/saddreco.verify.log

   Looks clean !
   
   
#############
# CHECKLIST #
#############

VO8597 0.15GB (NOTALLOWED 0911-0917 full  0117-1142) CD-9940B
minos.reco_near_R1_18_2.cpio_odc     Being copied to new media 091107

mysql1 load to about 20 ramp up 17:00 to 18:00 yesterday

#########
# MYSQL #
#########

Load average went to about 20 yesterday, ramped up 14:30 to 18:00
backups of mysql will have to wait till the load comes down.


show full processlist     shows queries like


select min(TIMESTART) from DCS_MAG_FARVLD where TIMESTART > '2006-09-19 04:37:20'  and DetectorMask & 2 and SimMask & 1 and CREATIONDATE >= '2006-09-19 04:18:33' and  Task = 0 

select max(TIMESTART) from DCS_MAG_FARVLD where TIMESTART < '2007-02-19 08:15:44'  and DetectorMask & 2 and SimMask & 1 and CREATIONDATE >= '2007-02-19 08:32:51' and  Task = 0

select * from DCS_MAG_FARVLD where     TimeStart <= '2005-05-23 23:51:42' and TimeEnd    > '2005-05-23 23:15:42' and DetectorMask & 2 and SimMask & 1 and  Task = 0 order by CREATIONDATE desc

No single command seems to take more than 15 seconds.

The first two commands seems to be sending data, the latter seems to be sorting

informed brebel, rhatcher

rhatcher has found the root of the problem,
may be able to implement an improvement if not an optimal solution.

Will let the jobs run to completion, as they have done before.


=============================================================================

2007 09 11

###############
# GRIDAPPSYNC #
###############

Cloned a new script to rsync /grid/app/minos/products from afs

Added it to crontab.dat, running at 05:10 daily


########
# GRID #
########

Need write access to /grid/data and app

HelpDesk ticket 103963

done 13:53

########
# GRID #
########

rsync products per 2007 08 02 example

   after getting mounts corrected

real    0m48.178s
user    0m0.823s
sys     0m6.144s

#######
# LSF #
#######

    Tested submitting to SL3 vs SL4 nodes

MINOS26 > bsub -R "linux24"  pwd

MINOS26 > bsub -R "linux26"  pwd

    Tested cross kernel submission

   from minos11
bsub -R "linux26" ". /usr/local/etc/setups.sh ; setup encp ; type encp"

   from minos26
bsub -R "linux24" ". /usr/local/etc/setups.sh ; setup encp ; type encp"
   
    Look OK to me.

BNODES='10 11 13 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35'
for NODE in $BNODES ; do printf "${NODE} " ; ssh -a -x -n flxb${NODE} grep MHz /proc/cpuinfo | head -1  | grep MHz ; done
10 cpu MHz              : 999.564
11 13 16 17 18 19 20 21 22 23 
24 cpu MHz                : 2666.815
25 26 27 28 29 
30 cpu MHz               : 2666.831
31 cpu MHz              : 2194.864
32 cpu MHz              : 2194.740
33 cpu MHz              : 2194.891
34 cpu MHz              : 2194.950
35 cpu MHz              : 2394.301


#######
# AFS #
#######

   request from brebel for pittam access to NC disks

NCVOLS='138 147 169 187 204 211 228 229'

for VOL in $NCVOLS ; do echo $VOL; 
fs listacl ${MINOS_DATA}/d${VOL} | grep -v system | grep rlidw; done  | less

138
  buckley:ana_ntuples rlidwka
147
  buckley rlidwka
  brebel rlidwka
169
  buckley:ana_ntuples rlidwka
  kreymer rlidwka
187
  buckley rlidwka
  brebel rlidwka
204
  buckley rlidwka
  brebel rlidwka
211
  buckley rlidwka
  brebel rlidwka
228
  minos:admin rlidwka
  brebel rlidwka
229
  minos:admin rlidwka
  brebel rlidwka

pts membership buckley:ana_ntuples
Members of buckley:ana_ntuples (id: -1536) are:
  buckley
  brebel

cd $MINOS_DATA
fs setacl -dir d228 -acl buckley:ana_ntuples rlidwka
fs setacl -dir d229 -acl buckley:ana_ntuples rlidwka

fs setacl -dir d169 -acl minos:admin rlidwka

pts adduser -user pittam -group buckley:ana_ntuples


=============================================================================

2007 09 10

###########
# MONTHLY #
###########

DATASETS 9/10
PREDATOR 9/10
SADDRECO 9/10   # the last time for this, no longer needed
VAULT    9/11
MYSQL    9/20


MINOS26 > ./dcache/datasets g
./dcache/datasets: line 122: [: too many arguments

Removed SADDRECO step, this is handled by roundup
    unless cand files are produced without any sntp 
    This should never happen.

Vault failed the first time through, ran out of disk

$ du -sm /pnfs/minos/vault/neardet/2007-05
89614   /pnfs/minos/vault/neardet/2007-05
$ du -sm /pnfs/minos/vault/neardet/2007-06
91205   /pnfs/minos/vault/neardet/2007-06
$ du -sm /pnfs/minos/vault/neardet/2007-07
85421   /pnfs/minos/vault/neardet/2007-07

$ du -sm /pnfs/minos/neardet_data/2007-08
28618   /pnfs/minos/neardet_data/2007-08


$ du -sm /pnfs/minos/vault/fardet/2007-05
25446   /pnfs/minos/vault/fardet/2007-05
$ du -sm /pnfs/minos/vault/fardet/2007-06
28914   /pnfs/minos/vault/fardet/2007-06
$ du -sm /pnfs/minos/vault/fardet/2007-07
28262   /pnfs/minos/vault/fardet/2007-07

$ du -sm /pnfs/minos/fardet_data/2007-08
124641  /pnfs/minos/fardet_data/2007-08

waited for mcimport to catch up


############
# MCIMPORT #
############

Adding automatic move of non-tar files to BAD,
contining processing

Typical time to gunzip -t is 

du -sm  n11035090_0027_L010185N_D03.tar.gz
319     n11035090_0027_L010185N_D03.tar.gz

real    0m21.086s
user    0m9.072s
sys     0m0.644s

real    0m9.292s
user    0m8.983s
sys     0m0.305s

So will leave this test where it is,
as the files are about to be tarred.

14:09
AFSS/mcimport.20070910   kordosky

Crashed due to script error ( corrected now )
after finding corrupt file ( this file also needed local md5sum )

n11035090_0010_L010185N_D03.tar.gz

Removed the good but unchecked tarfile
mv tar/n11035090_0002_L010185N_D03-n11035090_0027_L010185N_D03.tar BAD/n11035090_0002_L010185N_D03-n11035090_0027_L010185N_D03.tar
mv BAD/n11035090_0002_L010185N_D03-n11035090_0027_L010185N_D03.tar DUP/

17:25

AFSS/mcimport.20070910  -f 100  kordosky

$ cp AFSS/mcimport.20070910 .
$ ln -sf mcimport.20070910 mcimport
    was mcimport.20070711

finished at 17:50, restarted cronjob


#######
# AFS #
#######

Need to add zarko to buckley:beamdata group,
and/or add minos:beam to 
   d239 and d188

MINOS26 > pts adduser -user zarko -group buckley:minosbeam
MINOS26 > pts membership buckley:minosbeam

MINOS26 > fs setacl -dir d239 -acl minos:beam rlidwka
fs: You don't have the required access rights on 'd239'

Need a global addition of minos:admin by buckley

    
=============================================================================

  kreymer vacation  Sep 1-9


Sent Minos cluster description to timm, for condor planning

Set up corral for cedar_phy_bhcurv

minosora3 memory problems are gone

REC M.C.

timesheet signed and submitted for Sep

left mesage with Joe Boyd regarding BlueArc disk, and condor planning

=============================================================================

2007 08 31

###########
# ROUNDUP #
###########

Added full cedar_phy_bhcurv stanza to corral.

Ran corral manually at 08:40, to test before I leave on vacation this PM.

/home/minfarm/ROUNTMP/ROOTRELS
    added cedar_phy_bhcurv

Ran corral manually at 09:17

Oops, needed to add locations
   could use the new version of saddreco !


export SAM_ORACLE_CONNECT="samdbs/<passwd>"

for REL in dev int prd ; do
./samtapeloc /pnfs/minos/reco_near/cedar_phy_bhcurv ${REL} ; done

MINOS26 > sam get metadata --file=N00008579_0004.spill.cand.cedar_phy_bhcurv.0.root
MINOS26 > sam add location --file=N00008579_0004.spill.cand.cedar_phy_bhcurv.0.root --loc='/pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2005-09(dcache.31)'
MINOS26 > sam locate N00008579_0004.spill.cand.cedar_phy_bhcurv.0.root
['/pnfs/minos/reco_near/cedar_phy_bhcurv/cand_data/2005-09,31@dcache']

OK, all 32 concatenated files are declared, along with cands.


#######
# AFS #
#######

Tested mengel's suggestion for /usr/bin/aklog problem :

0-59 * * * * /usr/krb5/bin/kcron "/usr/krb5/bin/aklog ; ${HOME}/minos/scripts/crontestark"

#######
# AFS #
#######

MINOS26 > pts adduser -user zarko -group minos:beam
MINOS26 > pts membership minos:beam

#############
# MINOSORA3 #
#############

---------------------------------------------
Date: Fri, 31 Aug 2007 09:41:55 -0500
From: Maurine Mihalek <mmihalek@fnal.gov>

so far, so good. no warning messages since dimm's were replaced tuesday afternoon.
---------------------------------------------
The previous rate was a few errors per hour.

  I declare victory !


=============================================================================

2007 08 30


#########
# ADMIN #
#########

MIN > for NODE in $NODES ; do printf "\n${NODE} `date`\n"; ssh -ax ${NODE} "echo HELLO" ; done

minos01 Thu Aug 30 14:17:26 CDT 2007
Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive).

minos02 Thu Aug 30 14:17:27 CDT 2007
Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive).

...

minos20 Thu Aug 30 14:18:39 CDT 2007
Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive).

   The rest were OK

   Ran another pass

minos02 Thu Aug 30 14:25:39 CDT 2007
Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive).
minos20 Thu Aug 30 14:26:47 CDT 2007
Permission denied (external-keyx,gssapi-with-mic,gssapi,keyboard-interactive).


in cvshlog, found 38 instances of


Thu Aug 30 14:00:07 2007 (west@163.1.244.104) : cvsh -c cvs server [sS]

In minos01 /var/log/messages,

About 38 messages like

Aug 30 13:32:18 minos01 sshd(pam_unix)[31107]: session opened for user root by root(uid=0)
   node 163.1.244.104 is pplxgenng.physics.ox.ac.uk
   Then there is a veritable flood of about 128 repititions
...

Aug 30 14:02:49 minos01 sshd[32618]: rexec line 45: Deprecated option RhostsAuthentication
Aug 30 14:02:50 minos01 sshd[32618]: Invalid user minoscvs from 163.1.244.104
Aug 30 14:02:50 minos01 sshd[32618]: input_userauth_request: invalid user minoscvs
Aug 30 14:02:50 minos01 sshd[32618]: Failed none for invalid user minoscvs from 163.1.244.104 port 54925ssh2
Aug 30 14:02:50 minos01 sshd[32618]: Failed publickey for invalid user minoscvs from 163.1.244.104 port54925 ssh2
Aug 30 14:02:50 minos01 sshd[32618]: Failed keyboard-interactive for invalid user minoscvs from163.1.244.104 port 54925 ssh2
Aug 30 14:02:50 minos01 sshd[32618]: Connection closed by 163.1.244.104

Looks like it cleared up around 14:51.

I see nothing remarkable on minos02 /var/log/messages, but see minos20 :

Aug 30 13:36:41 minos02 ypbind: ypbind shutdown succeeded
Aug 30 13:36:41 minos02 ypbind: ypbind startup succeeded
Aug 30 13:36:42 minos02 ypbind: bound to NIS server minos02.fnal.gov
Aug 30 13:39:45 minos02 ypserv: ypserv shutdown succeeded
Aug 30 13:39:45 minos02 ypserv[13584]: WARNING: no securenets file found! 
Aug 30 13:39:45 minos02 ypserv[13584]: Support for SLP (line 20) is not compiled in.
Aug 30 13:39:45 minos02 ypserv[13584]: Support for SLP (line 22) is not compiled in.
Aug 30 13:39:45 minos02 ypserv: ypserv startup succeeded
Aug 30 13:46:00 minos02 ypbind: ypbind shutdown succeeded
Aug 30 13:46:00 minos02 ypbind: ypbind startup succeeded
Aug 30 13:46:01 minos02 ypbind: bound to NIS server minos02.fnal.gov
Aug 30 13:55:24 minos02 sshd(pam_unix)[13791]: session opened for user root by root(uid=0)
Aug 30 14:25:44 minos02 sshd(pam_unix)[13963]: session opened for user root by root(uid=0)
Aug 30 14:27:11 minos02 rpc.ypxfrd[14035]: WARNING: no securenets file found! 
Aug 30 14:27:11 minos02 rpc.ypxfrd[14035]: Support for SLP (line 20) is not compiled in.
Aug 30 14:27:11 minos02 rpc.ypxfrd[14035]: Support for SLP (line 22) is not compiled in.
Aug 30 14:27:11 minos02 ypxfrd: rpc.ypxfrd startup succeeded
Aug 30 14:28:11 minos02 sshd(pam_unix)[14071]: session opened for user kreymer by kreymer(uid=0)
Aug 30 14:28:35 minos02 sshd(pam_unix)[14396]: session opened for user kreymer by (uid=0)
Aug 30 14:32:36 minos02 ypserv: ypserv shutdown succeeded
Aug 30 14:32:36 minos02 ypserv[14440]: WARNING: no securenets file found! 
Aug 30 14:32:36 minos02 ypserv[14440]: Support for SLP (line 20) is not compiled in.
Aug 30 14:32:36 minos02 ypserv[14440]: Support for SLP (line 22) is not compiled in.
Aug 30 14:32:36 minos02 ypserv: ypserv startup succeeded
Aug 30 14:32:45 minos02 ypbind: ypbind shutdown succeeded
Aug 30 14:32:45 minos02 ypbind: ypbind startup succeeded
Aug 30 14:32:45 minos02 ypbind: bound to NIS server minos01.fnal.gov

I don't see this on other nodes.

Reply from Jason Harrington -

  minos02 was misconfigured, referred to itself.
  This was corrected by 14:30 .
  ( My guess - NIS load shifted to minos02, then got stuck. )
    

############
# MCIMPORT #
############

DUP cleanup

$ du -sm */DUP
1       arms/DUP
45250   hgallag/DUP
29      howcroft/DUP
1553    kordosky/DUP
75      kreymer/DUP
1       sjc/DUP

for DIR in `ls -d */DUP` ; do printf "$DIR "
find ${DIR} -type f -ctime +10 -exec echo {} \; | wc -l
done

arms/DUP 0
hgallag/DUP 172
howcroft/DUP 3
kordosky/DUP 6
kreymer/DUP 6
sjc/DUP 0

for DIR in `ls -d */DUP` ; do printf "$DIR "
find ${DIR} -type f -ctime +10 -exec rm {} \;
done


############
# MCIMPORT #
############
    From cron job email
/home/mindata/mcimport.20070203: line 368: srmcp: command not found

    From kordosky/log/mcimport.log
Thu Aug 30 07:54:33 CDT 2007
exeAccess failed for java

   OOPS, found the cronjob, per crontab.dat, running  mcimport.20070203
   That is a truly ancient and dysfunctional version.
   This was due to the restoration of the mindata account from an old copy.

    Corrected   
37 0,6,12,18 * * *  ${HOME}/mcimport.20070203 -c ALL
    to
37 0,6,12,18 * * *  ${HOME}/mcimport -c ALL

    While we're at it, updated 

/home/mindata/.srmconfig/kreymer.xml 
    to reference kreymer-voms.proxy
    copied thusly

$ pwd
/home/mindata/.grid
$ scp minfarm@fnpcsrv1:/export/stage/minfarm/.grid/kreymer-voms.proxy kreymer-voms.proxy 

tested with .srmtest


Thu Aug 30 10:42:46 CDT 2007: rs.state = Failed rs.error = 
RequestFileStatus#-2146260890 failed with error:[  at Thu Aug 30 10:42:42 CDT 2007 state Failed : user has
no permission to write into path /pnfs/fnal.gov/usr/minos/stage/kordosky
]

Reverted to kreymer-doe.proxy
First file copy to /pnfs/minos/stage/kordosky/n11035001_0000_L010185N_D03-n11035001_0004_L010185N_D03.tar
at 11:02 started ok, then slowed down linearly over 5 mintues to near 0 rate,
took 10 minutes for 1.7 GB.

Next file was close to 1 minute.

#######
# SAM #
#######
   
export SAM_ORACLE_CONNECT="samdbs/<passwd>"

for REL in dev int prd ; do
setup sam -q ${REL}
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.bhcurv
done


#########
# KCRON #
#########

/usr/bin/aklog does not work with kcron tickets

FLXI06 > klist -f
Ticket cache: /tmp/krb5cc_1060_z11732
Default principal: kreymer/cron/flxi06.fnal.gov@FNAL.GOV

Valid starting     Expires            Service principal
08/30/07 10:22:17  08/30/07 20:22:17  krbtgt/FNAL.GOV@FNAL.GOV
        Flags: FIA
08/30/07 10:22:18  08/30/07 20:22:17  afs@FNAL.GOV
        Flags: FA

FLXI06 > /usr/bin/aklog
aklog: Couldn't get fnal.gov AFS tickets:
aklog: Improper format of Nov 11ion database entry while getting AFS tickets

  GRRRRRRR
  
Doing vanilla test of cron on flxi0* with
   crontab crontest.dat

Some nodes fail, with

kinit: Client not found in Kerberos database while getting initial credentials


flxi02 - ok
flxi03 - ok
flxi04 - ok
flxi05 - kinit fails
flxi06 - kinit fails
flxi07 - kinit fails and kcron fails interactively


=============================================================================

2007 08 29

###########
# ROUNDUP #
###########

./roundup  -M -r cedar_phy mockfar
 
###########
# ROUNDUP #
###########

Updated to user kreymer-voms.proxy

    /export/stage/minfarm/.srmconfig/kreymer.xml

Also commented out the .key and .pem file references, not needed.

 
########
# CRON #
########

Cron works OK on minos11 and minos12 with

0-59 * * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/crontestark

MINOS12 > rpm -qa | grep cron
anacron-2.3-32
vixie-cron-4.1-47.EL4
crontabs-1.10-7

MINOS01 > rpm -qa | grep cron
anacron-2.3-32
vixie-cron-4.1-47.EL4
crontabs-1.10-7

Thiings started working today, on minos01, with

MAILTO='kreymer@fnal.gov'
0-59 * * * 0,1,2,3,4,5,6 /usr/krb5/bin/kcron ${HOME}/minos/scripts/crontestark

MAILTO='kreymer@fnal.gov'
0-59 * * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/crontestark quiet

Restored crontab.minos01 to use of 1,3,5

#########
# FNALU #
#########

Requested CLUBS upgrade to SLF 4.4 of fnalu-admin.

Spoke to Wayne Baisley today, after Unix Users meeting.
  He will consider retiring/retaining some flxb nodes in an SL3 queue,
  upgrading the rest.
  Told him of our current problem having upgraded to SL4 interactive,
  and the upcoming Collaboration meeting in Sep.

 
=============================================================================

2007 08 28 

#######
# CVS #
#######

For tjyang, added to .k5login
Verified that updates got correct username in the log

In earlier tests, verified that username is logged without having ssh1 identity


#######
# AFS #
#######

cfl, afssum and afsfree cron jobs on minos01 have been silent.
Missing tokens due to /usr/bin/aklog,
  fix by setting PATH="/usr/krb5/bin:${PATH}"
  and running aklog in the scripts.


crontab ignores list of day of week

0-59 * * * 2-6/2   works ok
0-59 * * * 0,1,2,3,4,5,6   is ignored

  tested using /tmp/ct1


###########
# GANGLIA #
###########

Minos Server is back online, along with Minos Cluster and Minos Oracle.


##############
# CRYPTOCARD #
##############

All Minos Cluster nodes now have cryptocard access.
   rennie restarted all sshd servers.
   
Minos01 sshd was hosed, would not restart, and sshd.cvs crashed.
Restarted both around 11:30 .

   We have cryptocard and cvs access again.

#########
# GENPY #
#########

[minos@minos-offline2 root_files]$ ls -l /data/root_files/F00039050_0008.mdaq.root
-rw-r--r--  1 minos e875 17413425 Aug 27 00:05 /data/root_files/F00039050_0008.mdaq.root

[minos@minos-offline2 root_files]$ md5sum F00039050_0008.mdaq.root
b3e0edf63239755dfa55a23d486b2049  F00039050_0008.mdaq.root

MINOS26 > scp -c blowfish minos@minos-offline2.minos-soudan.org:/data/root_files/F00039050_0008.mdaq.root F00039050_0008m.mdaq.root

MINOS26 > ecrc F00039050_0008m.mdaq.root 
CRC 1093528200

There is no level4 PNFS info yet !

MINOS26 > DCPOR=24125 # unsecured
MINOS26 > IFILE=F00039050_0008.mdaq.root
MINOS26 > IPATH=minos/fardet_data/2007-08
MINOS26 > DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
MINOS26 > dccp    ${DFILE} ${IFILE}
17413425 bytes in 0 seconds

MINOS26 > md5sum F0003*
20da2c880577cb4cad059ac68438975f  F00039050_0008.mdaq.root
b3e0edf63239755dfa55a23d486b2049  F00039050_0008m.mdaq.root

MINOS26 > ~/minos/scripts/run_dbu F00039050_0008.mdaq.root
 /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 128: 24008 Segmentation fault      dbu -bq ${HOME}/minos/scripts/dbu_sampy.C ${FILE} >>${logname} 2>&1
F00039050_0008.sam.py was not generated - check log for error
F00039050_0008.log
MINOS26 > ~/minos/scripts/run_dbu F00039050_0008m.mdaq.root

   Moving the bad file out of the way,
   and saving the good one :
   
MINOS26 > pwd
/local/scratch26/kreymer/DATA

    asked enstore-admin to do

enmv /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root \
     /pnfs/minos/BAD/F00039050_0008.BAD.mdaq.root

no can do, file is still pending for write


 MINOS26 > ./dc_stat /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root
============================
 PNFS status for /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root 
-rw-r--r--  1 buckley e875 17413425 Aug 27 00:09 F00039050_0008.mdaq.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:05f4ea89;l=17413425;
w-stkendca9a-3

LEVEL 4 

============================
MINOS26 > 
MINOS26 > ./dc_stat /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root
============================
 PNFS status for /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root 
-rw-r--r--  1 buckley e875 17413425 Aug 27 00:09 F00039050_0008.mdaq.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:05f4ea89;l=17413425;
w-stkendca9a-3

LEVEL 4 

============================

rm /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root

cd /local/scratch26/kreymer/DATA
setup dcap 
DCPOR=24725 # kerberos
DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
dccp F00039050_0008m.mdaq.root ${DFILE}
17413425 bytes in 1 seconds (17005.30 KB/sec)

chmod 664 /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root
 
=============================================================================

2007 08 27 

##############
# CRYPTOCARD #
##############

Working now only on flxi02, 4, 7
Not 3, 5, 6

On 3 and 5, there are multiple sshd's with PPID 1.
On 2, 4, 7  there is only 1.

MIN > ssh flxi02 'ps -flu root | grep sshd'
5 S root      2798     1  0  75   0    -   774 -      May31 ?        00:04:54 /usr/sbin/sshd
MIN > ssh flxi04 'ps -flu root | grep sshd'
5 S root      3712     1  0  75   0    -   786 -      May31 ?        00:01:13 /usr/sbin/sshd
MIN > ssh flxi07 'ps -flu root | grep sshd'
5 S root     19307     1  0  76   0 -  5754 -      Aug24 ?        00:00:00 /usr/sbin/sshd


FLXI03 > ps -flu root | grep sshd
5 S root     25491     1  0  85   0    -  1043 -      Jul30 ?        00:01:57 /usr/sbin/sshd
5 S root     25857     1  0  85   0    -  1043 -      Jul30 ?        00:01:51 /usr/sbin/sshd
5 S root     26109     1  0  85   0    -  1043 -      Jul30 ?        00:01:58 /usr/sbin/sshd
5 S root       901     1  0  75   0    -   879 -      Aug23 ?        00:00:00 /usr/sbin/sshd
5 S root     18823     1  0  75   0    -   798 -      Aug24 ?        00:00:00 /usr/sbin/sshd
5 S root      7406     1  0  75   0    -   988 -      Aug24 ?        00:00:05 /usr/sbin/sshd

MIN > ssh flxi05 'ps -flu root | grep sshd'
/usr/X11R6/bin/xauth:  timeout in locking authority file /afs/fnal.gov/files/home/room1/kreymer/.Xauthority
5 S root     13089     1  0  75   0    -   931 -      Aug10 ?        00:02:02 /usr/sbin/sshd
5 S root      2913     1  0  75   0    -   815 -      Aug23 ?        00:00:13 /usr/sbin/sshd
5 S root      3403     1  0  75   0    -   815 -      Aug23 ?        00:00:00 /usr/sbin/sshd
5 S root     24057     1  0  75   0    -   986 -      Aug24 ?        00:00:01 /usr/sbin/sshd

MIN > ssh flxi06 'ps -flu root | grep sshd'
5 S root     28508     1  0  75   0    -   986 -      Aug24 ?        00:00:02 /usr/sbin/sshd

MIN > ssh flxi07 'ps -flu root | grep sshd'
5 S root     19307     1  0  76   0 -  5754 -      Aug24 ?        00:00:00 /usr/sbin/sshd

Can see /var/log/secure on minos-mysql1.

Buckley failed login produces 

Aug 27 14:08:38 minos-mysql1 sshd[5168]: Failed gssapi-with-mic for buckley from ::ffff:131.225.193.6 port 37596 ssh2
Aug 27 14:08:38 minos-mysql1 sshd[5168]: Failed keyboard-interactive for buckley from ::ffff:131.225.193.6 port 37596 ssh2
Aug 27 14:08:38 minos-mysql1 sshd[5168]: Connection closed by ::ffff:131.225.193.6


#########
# GENPY #
#########

 2007-08-27 00:09:29 	 buckley(1019.5111) 	 krbftp 	 write 	 /pnfs/fnal.gov/usr/minos/fardet_data/2007-08/F00039050_0008.mdaq.root 	 daqdcp.minos-soudan.org 	 18 	 17413425 	 0 	 OK 	

dbu fails on /pnfs/minos/fardet_data/2007-08/F00039050_0008.mdaq.root

R__unzip: error in inflate (zlib)
Error in <TBasket::ReadBasketBuffers>: fNbytes = 13452, fKeylen = 84, fObjlen = 41620, noutot = 0, nout=0, nin=13368, nbuf=41620

  Test per HOWTO.genpy

MINOS26 > cd ${HOME}/minos/test

TIER=mdaq
IFILE=F00039050_0008
DATADIR=fardet_data/2007-08
rm -f ${IFILE}.log
rm -f ${IFILE}.sam.py
rm -f ${IFILE}.sam.pyc
minos

setup_minos -r R1.22
  dbu fails as before


setup_minos -r R1.26
   dbu fails as before, but with successful error code
     and producing output

   Let's see if we have any more old failures :
   
find * -name \*.log -exec grep -H unzip {} \;
    and on minos06,
cd /local/scratch06/kreymer/genpy/fardet_data
find ????-??  -name \*.log -exec grep -H unzip {} \;

Last previous fardet unzip error was in 2004-10

###########
# ROUNDUP #
###########

corral - removed  corralsrs  from corral, these files are done.

###########
# ROUNDUP #
###########

    100    2116 mcfmockcat

./roundup  -M -r cedar_phy mockfar
  Finished at about 10:12
  About 12 seconds per 20 MBytes file writing with srmcp

Need to rerun this afternoon, to flush WRITE


###########
# MINOS11 #
###########

Trying to clean up SLF 3 afs login, via

    yum install openafs-krb5

MIN > rpm -ql openafs-krb5
/usr/bin/aklog
/usr/sbin/asetkey

This works, and gives a long lived token.

########
# PNFS #
########

/pnfs/minos seems mounted ro on minos01 and 26, not rw.

Requested that this be corrected.

   fixed on minos01 around 11:10, minos26 around 11:30
   
########
# FARM #
########

   SUMMARY OF THE 8 JOB PROCESSING OF SCALED FIELD STUDIES 
       FORMERLY ON THE KREYMER BLACKBOARD
  
N  cedar_phy_       Det  Request

1 srsafitter        f   CosmicLE_D02 CosmicMC_D02

2 srsafitter        F   6 months   cosmic

3 srsafitterbx113   n   550 files  D00 L010185N

4 srsafitter        n   541 files  D00 L010185N_bfldx113

5 srsafitterbx113   n   541 files  D00 L010185N_bfldx113

6 srsafitter        N   3 mo 2005  spill

7 srsafitterbx113   N   3 mo 2005  spill

8 srsafitter        n   550 files  D00 L010185N


N boundaries
  2005-08  8433_0002
  2005-09  8433_0003
       10
       11  9280_0018

F boundaries
  2005-11  33077_0002
  2006-02  33805_0006
  2007-01  37162_0006
  2007-02  37709_0000
  

CHART
              near  far   mcnear  mcfar
   srs        6/8    2/    /4     1/
   srsbx113   7/          3/5


=============================================================================

2007 08 25  Saturday

###########
# ROUNDUP #
###########

Getting flooded by processing of

near cedar_phy daikon_00 L010185N

Typically 50 GB/6 hours.

All going into the minos file family

Try correcting as rubin@fnpcsrv1 with

   ./pnfsdirs near cedar_phy daikon_00 L010185N
Sat Aug 25 09:33:40 CDT 2007
 STREAMS cand mrnt sntp

     INPUT /pnfs/minos/mcin_data/near/daikon_00/L010185N 
 FAMSET mcin_near_daikon_00
 FAMILY mcin_near_daikon_00

    OUTPUT /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N 
chgrp: invalid group name `e875'
 OK - created /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/mrnt_data 
 OOPS, need permissions drwxrwxr-x 
drwxr-xr-x  1 rubin numi 512 Aug 25 09:33 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N
chgrp: invalid group name `e875'
 OK - have set permissions drwxrwxr-x
drwxrwxr-x  1 rubin numi 512 Aug 25 09:33 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N
 OOPS, need permissions drwxrwxr-x 
drwxr-xr-x  1 rubin numi 512 Aug 25 09:21 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/cand_data
chgrp: invalid group name `e875'
 OK - have set permissions drwxrwxr-x
drwxrwxr-x  1 rubin numi 512 Aug 25 09:21 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/cand_data
 FAMSET mcout_cedar_phy_near_daikon_00_cand
 FAMILY minos
 OOPS - need file family mcout_cedar_phy_near_daikon_00_cand
 OK - setting family to mcout_cedar_phy_near_daikon_00_cand
 FAMSET mcout_cedar_phy_near_daikon_00_mrnt
 FAMILY minos
 OOPS - need file family mcout_cedar_phy_near_daikon_00_mrnt
 OK - setting family to mcout_cedar_phy_near_daikon_00_mrnt
 OOPS, need permissions drwxrwxr-x 
drwxr-xr-x  1 rubin numi 512 Aug 25 09:25 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data
chgrp: invalid group name `e875'
 OK - have set permissions drwxrwxr-x
drwxrwxr-x  1 rubin numi 512 Aug 25 09:25 /pnfs/minos/mcout_data/cedar_phy/near/daikon_00/L010185N/sntp_data
 FAMSET mcout_cedar_phy_near_daikon_00_sntp
 FAMILY minos
 OOPS - need file family mcout_cedar_phy_near_daikon_00_sntp
 OK - setting family to mcout_cedar_phy_near_daikon_00_sntp

    Ran this as rubin@fnpcsrv1

setup encp -q stken

./pnfsdirs near cedar_phy daikon_00 L010185N write

=============================================================================

2007 08 24

###########
# MINOS11 #
###########

cryptocard - 
     yum install zz_sshd_pam
     service sshd restart

tried this on minos06, with reload, no cryptocard access

Had lots of trouble getting access, turns out 
AklogCmd   is no longer 
tried restartof ssh, still no good

zz_sshd_aklog-3.9-5
   was needed, to remove the obsolete AklogCmd entry from /etc/ssh/sshd_config

###########
# MINOS11 #
###########

 AFS - the system was booted around noon, without DHCP.
 Robert has done 3 clean root builds.
 There have been no further afs timeout in /var/log/messages, 
 aside from a single pair of timeouts to a private network.
Aug 24 11:57:49 minos11 kernel: afs: Lost contact with file server 192.168.67.1 in cell fnal.gov (multi-homed address; other same-host interfaces maybe up)
  
   Success !

###########
# ROUNDUP #
########### 

Forcing out srsafitter remnants that existed
   tested with -n -W

${HOME}/scripts/roundup  -f 2    -r cedar_phy_srsafitter        near
${HOME}/scripts/roundup  -f 2 -M -r cedar_phy_srsafitter      mcfar
${HOME}/scripts/roundup  -f 2 -M -r cedar_phy_srsafitterbx113 mcnear

#######
# DAQ #
#######

minos-gateway-nd - sshd was not running per Peter and Alec in pit x5875

Urish got console control, removed AklogCmd from /etc/sshd/sshd_config
rebooted, we're good.

###########
# UPGRADE #
###########

Short AFS tokens for bash users were due to typo in a config file
pushed manually to all nodes.

This is corrected by Rennie Scott, lifetimes look good.

Remaining issues 

    cryptocard support
    no .Xauthorization access under SL3 ( minos11 )
    Ganglia
    CLUBS to SLF 4

=============================================================================

2007 08 23

###########
# MINOS11 #
###########

Went off the network sometime after 10:00

    sar -n DEV | grep -v 'lo'
shows 0 tx packets/data at 11:20 through 12:40

But no errors in the EDEV report
And no errors in the MRTG web page information

16:22 copied a file twice via the net, to test rates,
  Rates look just fine, about 20 MBytes/second

MINOS11 > time rcp minos26:/local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root .
real    0m3.675s
MINOS11 > time rcp minos26:/local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root TEST.dat
real    0m3.258s

MINOS10 > time rcp minos26:/local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root .
real    0m3.961s
MINOS10 > time rcp minos26:/local/scratch26/kreymer/DATA/N00009870_0002.mdaq.root TEST.dat
real    0m3.035s

But 2 or 4 of Robert's rebuilds continue to fail with file timeouts.

Trying regular file manipulations :

time tar cf /local/scratch11/kreymer/DATA/home1.tar  .
real    1m23.948s

time tar cf /local/scratch11/kreymer/DATA/home2.tar  .
real    0m57.161s

for N in 3 4 5 6 7 8 9 10 ; do 
time tar cf /local/scratch11/kreymer/DATA/home${N}.tar  . ; done
real    0m58.113s
real    0m51.739s
real    0m54.350s
real    0m49.661s
real    0m51.646s
real    0m54.644s
real    0m53.916s
real    0m57.419s

Try something that hits new directories more in AFS,
writing and deleting .

date
time cp -ax ~kreymer ${MINOS_DATA}/release_data/TEST/kreymer
date
du   -sm    ~kreymer ${MINOS_DATA}/release_data/TEST/kreymer
date
diff -r     ~kreymer ${MINOS_DATA}/release_data/TEST/kreymer
date
du   -sm    ~kreymer ${MINOS_DATA}/release_data/TEST/kreymer
date
rm   -r             ${MINOS_DATA}/release_data/TEST/kreymer
date

Sort of messy, some file were write protected, and many stray symlinks

###########
# UPGRADE #
###########

 S.A.G. is back, per Liz
 
 Hi Art, so it is working again. It seemed to be a victim of the local 
ups area vanishing during the upgrade. I have pointed the PRODUCTS
variable at the MINOS products area in afs.


=============================================================================

2007 08 22

###########
# UPGRADE #
###########

Library symbolic links like
    ln -s ../../usr/X11R6/lib/libGL.so.1.2  /usr/lib/libGL.so
were part of xorg-x11-devel , installed late Tuesday morning.

###########
# UPGRADE #
###########

    Residual issues, roughly highest priority first


DONE - LSF batch servers - not running yet on minos15 16 17 18 20 21 23 24 25

minos11 issues
    AFS stability - consult an expert ?
    stray aklog message at login

TiBS reinstallation

Ganglia monitoring - needed on minos01->26 and minos-sam01/02/03

LANG-en_US.UTF-8  causes a different order for the output of ls
    Can we remove this, or add LC_COLLATE=C ?

cryptocard access - will come soon with new ssh version and zz_sshd_pam

token lifetimes are 1 day on login ( kerberos ticket expiration time)
  rather than 1 week ( based on ticket renewability time)
  This may be addressed by the new sshd version coming tomorrow.
  It is as if we were running
      AklogCmd /usr/bin/aklog
  instead of
      AklogCmd /usr/krb5/bin/aklog
  
A few users still cannot login via ssh
   perhaps a client issue ?

SamAtAGlance is not running under buckley@minos-sam01
   buckley files may need to be copied from /scratch/sam01/buckley to
     /home/buckley
   kcroninit
   crontab

Cannot read /etc/ssh/sshd_config 
        and /var/log/messages
    Can this be enabled ?

#######
# LSF #
#######

LSF batch jobs are running only on 14, 19 , 22

Scanning with
ps -fu root | grep lsf
root      4902     1  0 Aug20 ?        00:01:01 /afs/fnal.gov/ups/lsf/v6_1/i386_linux24/etc/lim
root      4905     1  0 Aug20 ?        00:00:00 /afs/fnal.gov/ups/lsf/v6_1/i386_linux24/etc/res
root      4908     1  0 Aug20 ?        00:00:01 /afs/fnal.gov/ups/lsf/v6_1/i386_linux24/etc/sbatchd
root      5635  4902  0 Aug20 ?        00:00:08 /afs/fnal.gov/ups/lsf/v6_1/i386_linux24/etc/pim


This correlates with the bhosts lists.

Requested startup.

NS='x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x'
 
for N in ${NS} ; do
bsub -q minos \
    'HOST=`hostname --short | cut -c -5` ; [ "${HOST}" = "minos" ] &&  { hostname ; sleep 120 ; }'
done

All but 25 are running jobs.
Try to probe just that node again,

for N in ${NS} ; do
bsub -q minos \
    'HOST=`hostname --short` ; [ "${HOST}" = "minos25" ] &&  { hostname ; sleep 120 ; }'
done


#########
# GENPY #
#########

    genpy.20070822
ln -sf genpy.20070822 genpy # was genpy.20061103

Need to increase timeout for large files

Typical recent timings
   GDAT/fardet_data/2007-08

Size
    du -sm /pnfs/minos/fardet_data/2007-08/${RUN}.mdaq.root
Time 
    grep '2007/08' ${RUN}.log
    TZ=UTC  stat ${RUN}.log

RUN=F00038842_0000
 1183 / 599
 
RUN=F00038846_0000
  546 / 340
  
RUN=F00038869_0000
   55 /  36

RUN=F00038871_0000
  177 /  93

RUN=F00038872_0000
  964 / 610

RUN=F00038889_0000
    7 /  22
RUN=F00038891_0001

   18 /  32

A safe limit would seem to be 60 sec + (Size in MB)

Tested this per HOWTO.genpy

###########
# SRMTEST #
###########

srmtest.20070822

   automatically runs for both minfarm and mindata

=============================================================================

2007 08 21


###########
# STARTUP #
###########

Need afs restart on minos19 ( two afsd )

lsf client OK on minos02/3/4
   bhosts works
   bsub   works

#######
# AFS #  on minos11
####### 

CACHESIZE=
OPTIONS=$LARGE

   CACHESIZE had been 100000

rebooted around 16:26

w
thosieck pts/0    tigris.hep.utexa  3:36pm 42:38   0.68s  0.44s  ssh minos05 
rhatcher pts/1    rhatcher03.dhcp.  3:55pm  9:06   0.26s  0.26s  -bash 
root     pts/2    hyperion.dhcp.fn  3:55pm 12:52   0.06s  0.06s  -bash 
kreymer  pts/3    minos-93198.dhcp  4:19pm  0.00s  0.12s  0.02s  w 
jyuko    pts/5    argut.hep.utexas  1:59pm  8:27   1.85s  0.37s  -tcsh 
blake    pts/6    pcgj.hep.phy.cam 11:30am  4:45m  0.53s  0.46s  ssh minos10 
blake    pts/10   pcgj.hep.phy.cam 11:31am  4:46m  1.24s  1.18s  ssh minos12 

#######
# AFS #  on minos-mysql1
####### 

   Getting message at login ( as once got on minos11 )
   
df: `afs': No such file or directory

Nick has been having AFS trouble, I presume on minos-mysql1

minos-mysql1 is running openafs 1.4.4,
so should have an /etc/sysconfig/afs file
like those on the SL 4.4 cluster.

But its file is identical to that on flxi04.

This is odd because the file specifies 
CACHEDIR=/usr/vice/cache
CACHEINFO=/usr/vice/etc/cacheinfo

But the active cache files are in 
[root@minos-mysql1 ~]# ls -l /var/cache/openafs/
total 200
-rw-------  1 root root 137516 Aug 21 11:35 CacheItems
-rw-------  1 root root     20 Aug 16 12:26 CellItems
drwx------  2 root root  32768 Aug 16 12:24 D0
drwx------  2 root root  20480 Aug 16 12:24 D1
-rw-------  1 root root   2288 Aug 21 11:04 VolumeItems

The /usr/vice/cache files are old
    [root@minos-mysql1 ~]# ls -l /usr/vice/cache
total 632
-rw-------  1 root root 440016 Aug 16 07:04 CacheItems
-rw-------  1 root root     20 Dec 20  2004 CellItems
drwx------  2 root root  32768 Nov 29  2004 D0
drwx------  2 root root  36864 Nov 29  2004 D1
drwx------  2 root root  36864 Nov 29  2004 D2
drwx------  2 root root  36864 Nov 29  2004 D3
drwx------  2 root root  32768 Nov 29  2004 D4
-rw-------  1 root root  16952 Aug  3 00:23 VolumeItems
[root@minos-mysql1 ~]# ls -l /usr/vice/etc
total 180
-rw-------  1 root root      0 Nov 29  2004 AFSLog
-rw-r--r--  1 root root    364 Feb 26  2005 CellAlias
-rw-r--r--  1 root root  31919 Jun  7 13:31 CellServDB
-rw-r--r--  1 root root    157 Mar 22 09:32 SuidCells
-rw-r--r--  1 root root      9 Aug 16 12:12 ThisCell
-rw-r--r--  1 root root      9 Mar  9  2004 ThisCell.FNAL
-rwxr-xr-x  1 root root 121788 Jun  7 13:31 afsd
-rw-r--r--  1 root root     30 Jun  7 13:31 cacheinfo
-rwxr-xr-x  1 root root    425 Feb 26  2005 killafs

Reported to run2-sys ( rennie )
Discussed with him and Joe Boyd.

  Per advice of mengel,
  we have put in the standard SLF 4.4 /etc/sysconfig/afs file,
  and rebooted ( reboots are recommended. )
  I stopped the mysql database just before the reboot.
  Connection details are in LOG.mysql
  
df is now happy with the afs partition
afsd shows the  

=============================================================================

2007 08 20

###########
# STARTUP #
###########

Requested restore of /home/mindata from 
    /local/scratch26/kreymer/homemindata-sam02.tar if necessary


############
# MCIMPORT #
############

rennie restored /home/mindata files from minos-sam02
   ( Vintage March )

$ cp AFSS/mcimport.20070711 mcimport.20070711
$ ln -s mcimport.20070711 mcimport

$ rmdir STAGE/
$ ln -s /local/scratch26/mindata STAGE

#######
# AFS #
#######

The afssum scripts failed ( no access ) on Friday PM

Seem to be OK this morning.

Correcting /etc/sysconfig/afs
    OPTIONS=AUTOMATIC
 to OPTIONS=$LARGE
    
Restarted most HOWTO.monitor tasks on minos26

Minos11 config had to be restored from flxi04,
rebooted around 

###########
# SRMTEST #
###########

lost from /home/mindata, recopied from minfarm@fnpcsrv1,
and put in scripts/

##############
# MINOS-DATA #
##############

Removed tzanakos@PHYS.UOA.GR from minos-data list,
due to repeated quota problems in gr.

########
# TIME #
########

for NODE in $NODES ; do 
printf "\n${NODE} `date`\n"; ssh ${NODE} "printf \"${NODE} \" ; ntpstat | grep correct"
done

All nodes are like
minos01    time correct to within 11 ms
    or
minos15    time correct to within 10 ms

#######
# X11 #
#######

Still need XFree86-devel, per boehm

########
# KRB5 #
########

rhatcher lacks /usr/krb5/bin in path


=============================================================================

2007 08 18  Sat

###########
# STARTUP #
###########

tokens are appearing now on all nodes
  1 day expiration on most
  8 day expiration on minos11, minos25

Funny messages on minos11

-bash: aklog: command not found
can't exec /local/ups/prd/perl/v5_006_1a/Linux-2/bin/perl:: No such file or directory
Terminal type is xterm
There are no available articles.

# PREDATOR #

17:25 UTC
Started manual ./predator, now that DCache is delivering files again

# 
note that kcronit is needed globally, sent mail to m_s_d

kinit: No such file or directory while getting initial credentials

I needed to run kcroninit on
  2 5-9 11-25
  
kcroninit fails on minos11,
can't exec /local/ups/prd/perl/v5_006_1a/Linux-2/bin/perl:: No such file or directory

MINOS13 > kcron
kinit: Client not found in Kerberos database while getting initial credentials
and 14, 15, 16

Note that kcrondestroy fails :

MINOS05 > kcrondestroy
KCRONINIT_DIR is not defined ... we are quitting.
BEGIN failed--compilation aborted at /usr/krb5/bin/kcrondestroy line 35.


############
# MCIMPORT #
############

The mindata account /home/mindata area is missing from minos26

############
# PREDATOR #
############

Fardet data is showing up with root version 5.16.0 .

Starting with data from 2007 08 
    F00038604_0000.mdaq.root

export SAM_ORACLE_CONNECT="samdbs/password"

for REL in dev int prd ; do
setup sam -q ${REL}
samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v05-16-00
done
New applicationFamilyId = 187
New applicationFamilyId = 76
New applicationFamilyId = 222

=============================================================================

2007 08 17

###########
# STARTUP #
###########

OK - minos24 down - has been rebooted

OK - cvs anonymous
  asousa
  cvs [update aborted]: end of file from server (consult above messages if any)
  gfp - mine

OK - pnfs -  missing on 
   minos02

OK - /grid/data on minos01

OK - minos11 - host id
   up, but funny message about 
    can't exec /local/ups/prd/perl/v5_006_1a/Linux-2/bin/perl:: No such file or directory

emacs is no longer a link to xemacs
   type xemacs to run xemacs

tokens
  jdejong Thur midnight ?
  missing from all but minos25 interactive login

minos26 - host id

lsf
  files are present on all nodes, but cannot use from
  minos02
  minos03
  minos04 

  MINOS02 > bhosts
  Failed in an LSF library call: Slave LIM configuration is not ready yet

ssh logins - miscellaneous ?

    
----------------

#######
# CVS #
#######

cd /tmp
export loc=":pserver:anonymous@minoscvs.fnal.gov:/cvs/minoscvs/rep1"
cvs -d $loc checkout BubbleSpeak

Fails from csf.rl.ac.uk

08:59
    Renie restored the /etc/hosts.allow, contining
cvs: ALL

    This restores access, tested at ral.

#######
# LSF #
#######

Checking all nodes for lsf availability

for NODE in $NODES ; do printf "${NODE} `date`\n"
ssh ${NODE} '. /usr/local/etc/setups.sh ; setup lsf ; bhosts minos25' ; done

   Fails on 
minos02
minos03
minos04

Try running a little job on each :

for NODE in $NODES ; do printf "${NODE} `date`\n"
ssh ${NODE} '. /usr/local/etc/setups.sh ; setup lsf ; bhosts minos25' ; done

#######
# AFS #
#######

10:05 Rennie left phone message
13:15 Rennie Scott setting up test fix on minos21
14:49 Rennie updated minos22 as a test, looks OK

17:20 - pushed out everywhere

  bad
minos01-10 bad
minos12-20 bad
minos23-24

  OK 
minos11
minos21
minos22
minos25
minos26

###########
# MINOS26 #
###########

MIN > ssh minos26
1208: Disconnecting: Protocol error: didn't expect packet type 34
This is a problem only from my desktop ?

MIN >  rpm -qf /usr/bin/ssh
openssh-clients-3.5p1f11-1rh7x
    should be
openssh-clients-3.5p1f12-1SL3


Into minos26 via minos01

Needed mount of
   /local/scratch26
   /grid/data
   /grid/app

Started processes per HOWTO.monitor

Ran sam_test_py minos
    sam_test_py minos dev

Tested loon per HOWTO.genpy

Declared and undeclared and located per HOWTO.sam

Started crontab at 17:37
  Next iteration at 19:06

./predator


#######
# SSH #
#######

  mcgowan
    using crypto-card ( problem with account ? )

  annah
    using crypto-card
    from local host at UCL
    to minos09
    OK to flxi0*
    ssh(6258) Permission denied 
    [annah@localhost ~]$ ssh -Y
    OpenSSH_4.3p2-4.cern-hpn, OpenSSL 0.9.7a Feb 19 2003
    skips directly to keyboard-interactive

  zarko
    using crypto-card
    dnieper to minos07/08, tried 1-20
    Linux 2.4
    OpenSSH_3.6.1p2, SSH protocols 1.5/2.0, OpenSSL 0x0090701f
    skips directly to keyboard-interactive

  kreymer@csf.rl.ac.uk
   /usr/kerberos/bin/kinit kreymer@FNAL.GOV

    ssh -2 -v minos01.fnal.gov fails
    ssh -2 -v flxi04.fnal.gov  fails
    ssh -1 -v flxi04.fnal.gov  OK

    user fermilab principal
    lcgui0359.gridpp.rl.ac.uk
    to minos01
    can connect to flxi02
    OpenSSH_3.6.1p2-CERN20030917, SSH protocols 1.5/2.0, OpenSSL 0x0090701f
  minos01
debug1: Doing challenge response authentication.
debug1: No challenge.
Permission denied.
  flxi02
debug1: Trying Kerberos v5 authentication.
debug3: Trying to reverse map address 131.225.68.42.
debug1: Kerberos v5 authentication accepted.
debug1: Kerberos v5 TGT forwarding failed: KDC can't fulfill requested option
debug1: Requesting pty.
debug3: tty_make_modes: ospeed 38400

Present off-site connections include

########
# TIME #
########
 for NODE in $NODES ; do printf "\n${NODE} `date`\n"; ssh ${NODE} "printf \"${NODE} \" ; date" ; done


=============================================================================

2007 08 16

############ 
# SADDRECO #
############

Now that all of daikon_00 is declared to mcin in dev,
let's proceed with mcout, via saddreco.20070707...


############
# SHUTDOWN #
############

root@minos-mysql1

# /etc/init.d/mysql stop
Shutting down MySQL.                                       [  OK  ]
# /etc/init.d/mysql start
Starting MySQL/bin/bash: /root/.bashrc: Permission denied
# /etc/init.d/mysql stop
Shutting down MySQL.                                       [  OK  ]
                                                           [  OK  ]       

sam@minos-sam03/2/1
   ups stop sam_bootstrap
no shrc/kreymer on minos-sam01 ?
dangling processes on minos-sam01, killed

UID        PID  PPID  C STIME TTY          TIME CMD
sam      12507     1  0 May31 ?        00:00:00 /bin/sh /home/sam/products/sam_bootstrap/v6_1_2/NULL/bin/run.sh start logger log_prd v4_2_0 --stdout=no --info=/dev/nu
sam      12671 12507  0 May31 ?        00:07:36 SamLogServer --port=40583 --host-alias=minos-sam01.fnal.gov --log=/home/sam/private/logger__minos-sam01__log_prd/log -


26317 ?        S      0:06 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/minos26free_log
 9020 ?        S      0:00 sleep 3600
14909 ?        S      0:43 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/oracle/topdb_log minosdev
16344 ?        S      0:00 sleep 600
14908 ?        S      0:41 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/oracle/topdb_log minosprd
16458 ?        S      0:00 sleep 600
14783 ?        S      1:08 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/pnfs_log
16541 ?        S      0:00 sleep 300
14781 ?        S      0:34 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/ftp_log
16361 ?        S      0:00 sleep 600

###########
# STARTUP #
###########

minos01
    pserver needed to be started, missing script,
    
    Then
         export CVSROOT=:pserver:minoscvs@minoscvs.fnal.gov:/cvs/minoscvs/rep1
         
cvs checkout -P Candidate
cvs checkout: authorization failed: server minoscvs.fnal.gov rejected access to /cvs/minoscvs/rep1 for user minoscvs
cvs checkout: used empty password; try "cvs login" with a real password
cvs checkout failed.

   started up sam servers on minos-sam01, looks OK
   started up sam servers on minos-sam03 
   started up sam servers on minos-sam02, after minosora3 firmware fix
   
 Pending issues :
 
 pserver password ? 
 cvcspserver offsite access   ?

 Some users cannot log in , perhaps ?
    1 with cryptocard

 minos11 login problem
 minos24 down
 minos26 login problem  

18:23 started mysqld

/home/room1/lsf empty on all but
    minos01
    

=============================================================================

2007 08 15

#########
# ADMIN #
#########

Preparing for all-day shutdown tomorrow

   
    predator

MINOS26 > echo 'crontab -r' | at 03:30

    mcimport

M26 > echo 'crontab -r' | at 03:30
job 21 at 2007-05-24 03:30

    corral
    
SRV1> echo 'mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT' \
    | at 03:30


#########
# MYSQL #
#########

See LOG.mysql

   Making space on disk, copying PULSERDRIFT.DAT to samread@minos-sam02


#########
# ADMIN #
#########

   Survey local mail stash on Minos Cluster
   
for NODE in $NODES ; do 
printf "${NODE} `date`\n"; ssh ${NODE} 'du -sm /var/spool/mail/*'
done > AFSK/minos/maint/vsmail.20070815

Highlights, filtering out lsfadm

minos01 Wed Aug 15 14:04:45 CDT 2007
1       /var/spool/mail/arms
1       /var/spool/mail/brebel
2       /var/spool/mail/buckley
1       /var/spool/mail/howcroft
1       /var/spool/mail/jyuko
1       /var/spool/mail/kreymer
1       /var/spool/mail/mcgo0109
225     /var/spool/mail/niki
1       /var/spool/mail/saranen
minos02 Wed Aug 15 14:04:47 CDT 2007
1       /var/spool/mail/arms
204     /var/spool/mail/rubin
minos03 Wed Aug 15 14:04:49 CDT 2007
1       /var/spool/mail/ahimmel
minos04 Wed Aug 15 14:04:50 CDT 2007
1       /var/spool/mail/admarino
1       /var/spool/mail/bspeak
0       /var/spool/mail/kreymer
1       /var/spool/mail/shepelak
minos05 Wed Aug 15 14:04:51 CDT 2007
1       /var/spool/mail/arms
minos06 Wed Aug 15 14:04:59 CDT 2007
1       /var/spool/mail/kreymer
minos07 Wed Aug 15 14:05:01 CDT 2007
1       /var/spool/mail/zarko
minos08 Wed Aug 15 14:05:07 CDT 2007
1       /var/spool/mail/admarino
2       /var/spool/mail/jdejong
minos09 Wed Aug 15 14:05:10 CDT 2007
minos10 Wed Aug 15 14:05:15 CDT 2007
minos11 Wed Aug 15 14:05:18 CDT 2007
1       /var/spool/mail/arms
38      /var/spool/mail/jdejong
23      /var/spool/mail/rhatcher
minos12 Wed Aug 15 14:05:20 CDT 2007
1       /var/spool/mail/arms
1       /var/spool/mail/boehm
minos13 Wed Aug 15 14:05:23 CDT 2007
1       /var/spool/mail/boehm
1       /var/spool/mail/rhatcher
minos14 Wed Aug 15 14:05:27 CDT 2007
minos15 Wed Aug 15 14:05:30 CDT 2007
minos16 Wed Aug 15 14:05:33 CDT 2007
minos23 Wed Aug 15 14:05:50 CDT 2007
1       /var/spool/mail/arms
minos24 Wed Aug 15 14:05:52 CDT 2007
minos25 Wed Aug 15 14:05:54 CDT 2007
1       /var/spool/mail/root
minos26 Wed Aug 15 14:05:57 CDT 2007
3       /var/spool/mail/buckley
0       /var/spool/mail/kreymer
1       /var/spool/mail/mindata

   Let me clean up my own stuff :

for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'ls -l /var/spool/mail/kreymer' ; done

minos01 Wed Aug 15 14:08:30 CDT 2007
-rw-------    1 kreymer  mail         5220 May 25  2006 /var/spool/mail/kreymer
   Four predator pid warnings, all May 24/25 2006.
   Removed
   
minos04 Wed Aug 15 14:08:34 CDT 2007
-rw-------    1 kreymer  mail            0 Jun 14  2006 /var/spool/mail/kreymer

minos06 Wed Aug 15 14:08:36 CDT 2007
-rw-------    1 kreymer  mail       391083 Apr 21  2006 /var/spool/mail/kreymer
    kinit messages from predator and afssum from Apr 5 through Apr 21 2006
    Removed with pine
   
minos26 Wed Aug 15 14:09:07 CDT 2007
-rw-------    1 kreymer  mail            0 Mar 28 17:09 /var/spool/mail/kreymer

The nontrivial files are 

minos01 Wed Aug 15 14:04:45 CDT 2007
225     /var/spool/mail/niki             - copied by niki

minos02 Wed Aug 15 14:04:47 CDT 2007
204     /var/spool/mail/rubin            - removed by rubin

minos11 Wed Aug 15 14:05:18 CDT 2007
38      /var/spool/mail/jdejong          - removed by jdejong
23      /var/spool/mail/rhatcher

Sent email to these users, suggesting an extra copy before the upgrades.


########## 
# SADDMC #
##########

Declared another file in development


./saddmc.20070815 --declare -n 1 daikon_00  near/daikon_00/L010000N/129

Beam has been taken as REFILE[15:22]
needs to change to get things like 
    L010185N_bfldx113 ,
from files like 
    n13014009_0006_L010185N_D00_bfldx113.reroot.root

That's easy to get from the directory L010185N_bfldx113,
a mess to get from the file.

ls /pnfs/minos/mcin_data/near/daikon_00/L010185N_bfldx113/400
ls /pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00/L010185N_bfldx113/sntp_data/400

    Cut up to first '.'
    If < 5 _ fields, take third [2]
    If   5 _ fields, take 3_5   [2]+'_'+[4]

./saddmc.20070815 --verify daikon_00  near/daikon_00/L010000N/129 -n 1 -v
./saddmc.20070815 --verify daikon_00  near/daikon_00/L010185N_bfldx113/400 -n 1  -v

find /pnfs/minos/mcin_data/near/daikon_00 -type d | wc -l
    221
find /pnfs/minos/mcin_data/near/daikon_00 -type f -name \*.reroot.root | wc -l
  17520

Checking rate

./saddmc.20070815 --verify daikon_00 near/daikon_00/L010185N_bfldx113/400
...
Needed   99 files, Rate was  3.472
STARTED   Wed Aug 15 19:38:26 2007
FINISHED  Wed Aug 15 19:38:56 2007

D00DIRS=`find /pnfs/minos/mcin_data/near/daikon_00 -type d | cut -f 5- -d /`

for D00DIR in ${D00DIRS} ; do
echo $D00DIR ; done

    19:48:54 UTC
 to 21:25:07 UTC

for D00DIR in ${D00DIRS} ; do
./saddmc.20070815 --verify daikon_00 ${D00DIR} ; done \
  > /var/tmp/saddvard00.log 2>&1 &

MINOS26 > grep Rate  /var/tmp/saddvard00.log  | wc -l
    184

mv /var/tmp/saddvard00.log ../log/saddmc/D00.ver.log

./saddmc.20070815 --declare daikon_00 near/daikon_00/L010185N_lowi/140 \
    >> ${HOME}/minos/log/saddmc/D00.log   2>&1 &


    21:31:07 UTC 
 to 23:52:04 UTC

for D00DIR in ${D00DIRS} ; do
./saddmc.20070815 --declare daikon_00 ${D00DIR} ; done \
    >> ${HOME}/minos/log/saddmc/D00.log   2>&1 &


#############
# MINOSORA3 #
#############

Firmware upgrades scheduled for tomorrow around 14:30.

#########
# MYSQL #
#########


=============================================================================

2007 08 14

########## 
# SADDMC #
##########

    saddmc.20070815

Dropped mcout_data from path.

Restored --addloc qualifier

Added samAdmin.addPnfsTapeLocation

Added printout of SAM version

    Resumed testing, picking a small directory,

./saddmc.20070815 -v --verify  daikon_00 mcin_data/near/daikon_00/L010000N/129

./saddmc.20070815 --declare -n 1 daikon_00 /pnfs/minos/mcin_data/near/daikon_00/L010000N/129 -v
 OK - declared            n13011290_0005_L010000N_D00.reroot.root /pnfs/minos/mcin_data/near/daikon_00/L010000N/129(voc553.456)
 OOPS , addLocation error in  n13011290_0005_L010000N_D00.reroot.root
  CLASS     SamException.SamExceptions.DataStorageLocationNotFound
  INSTANCE  Location with name '/pnfs/minos/mcin_data/near/daikon_00/L010000N/129' not found.
STARTED   Tue Aug 14 21:22:37 2007
FINISHED  Tue Aug 14 21:22:39 2007

Corrected this with --addloc, which worked using addPnfsTapeLocation


#########
# ADMIN #
#########

    Doing an X11/gimp scan, found messages like this from minos24 :
executable not found: '/usr/lib/gimp/2.0/plug-ins/print'

    Created working directory on minos24
mkdir -p /var/tmp/kreymer/.gimp-2.0/


########
# FARM #
########

Howie ran 105 more files for jobs 3 and 8,
leaving a partial run  n13011055

    Flushed this :

./roundup  -M -f 0 -s n13011055 -r cedar_phy_srsafitter      mcnear
./roundup  -M -f 0 -s n13011055 -r cedar_phy_srsafitterbx113 mcnear

############ 
# SADDRECO #
############

SRV1> ls READ | grep '^n' | wc -l
2889
SRV1> ls READ | grep '^f' | wc -l
1238
SRV1> ls READ | grep '^F2' | wc -l
498

    Adding missing locations found in the cleanup

PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin
export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer
export SAM_NAMING_SERVICE_IOR=corbaname::minos-sam01.fnal.gov:9010

./saddreco near cedar_phy_srsafitterbx113 2005-08 addloc
 OK - add location  N00008433_0002.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-08(voc678.57)
 OK - add location  N00008433_0002.spill.sntp.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/sntp_data/2005-08(vo4349.129)

./saddreco near cedar_phy_srsafitterbx113 2005-09 addloc
OK - add location  N00008451_0001.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(voc678.2)
OK - add location  N00008451_0000.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(voc678.3)
OK - add location  N00008675_0001.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(vo2180.284)
OK - add location  N00008454_0014.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(voc678.4)
OK - add location  N00008669_0018.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(vo2180.168)
OK - add location  N00008669_0023.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(vo2180.266)
OK - add location  N00008436_0006.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(voc678.1)
OK - add location  N00008612_0013.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-09(vo2180.245)
Added 8 locations

./saddreco near cedar_phy_srsafitterbx113 2005-10 addloc
OK - add location  N00008695_0007.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.321)
OK - add location  N00008988_0008.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.164)
OK - add location  N00008972_0008.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.166)
OK - add location  N00008905_0010.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.302)
OK - add location  N00008905_0011.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.249)
OK - add location  N00009000_0017.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.163)
OK - add location  N00008920_0017.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-10(vo2180.247)
Added 7 locations

./saddreco near cedar_phy_srsafitterbx113 2005-11 addloc
OK - add location  N00009219_0015.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.4)
OK - add location  N00009098_0009.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.467)
OK - add location  N00009059_0009.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.282)
OK - add location  N00009238_0015.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.6)
OK - add location  N00009059_0005.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.285)
OK - add location  N00009059_0003.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.327)
OK - add location  N00009238_0010.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.2)
OK - add location  N00009059_0010.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.301)
OK - add location  N00009059_0019.spill.cand.cedar_phy_srsafitterbx113.0.root /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data/2005-11(vo2180.294)
Added 9 locations

SRV1> ls READ | grep '^n' | wc -l
2889

SRV1> ls READ | grep '^f' | wc -l
1238

SRV1> ls READ | grep '^F2' | wc -l
498

    Let's clear out the MDC files, for cleanliness,
    as we have no present plans to put these in SAM

mkdir READ/MDC

mv READ/F2* READ/MDC/

 
=============================================================================

2007 08 13
########
# FARM #
########

One srsafitter mcnear file pending

n13014038_0008_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitter.root

ZAPRUNS
n13014038_0008_L010185N_D00_bfldx113    1 2007-07-30 11:28:55 fnpc262
n13014038_0008_L010185N_D00_bfldx113    1 2007-07-30 11:46:14 fnpc210
n13011015_0007_L010185N_D00                 1 2007-08-09 16:00:23  fnpc232

howie is back from vacation, going through large backlog of email


########
# FARM #
########

rhatcher corrected defective row in cedar database
connecting by

mysql --user=writer --password=###### --host=fnpcsrv1.fnal.gov --port=3307 cedar

mysql> select * from SPILLTIMENDVLD where SEQNO=700003590;
+-----------+---------------------+---------------------+--------------+---------+------+-------------+---------------------+---------------------+
| SEQNO     | TIMESTART           | TIMEEND             | DETECTORMASK | SIMMASK | TASK | AGGREGATENO | CREATIONDATE        | INSERTDATE          |
+-----------+---------------------+---------------------+--------------+---------+------+-------------+---------------------+---------------------+
| 700003590 | 2007-08-05 18:00:00 | 2007-08-05 18:00:00 |            1 |       1 |    3 |          -1 | 2007-08-05 18:00:00 | 2007-08-08 02:30:15 |
+-----------+---------------------+---------------------+--------------+---------+------+-------------+---------------------+---------------------+
1 row in set (0.00 sec)

mysql> delete from SPILLTIMENDVLD where SEQNO=700003590;
Query OK, 1 row affected (0.11 sec)


#########
# MYSQL #
#########

 ~rhatcher/public_html/MySQLRefCard.ps

############ 
# SADDRECO #
############

Preparing for MC declares, worried about READ and SAM/READ sizes


SRV1> ls READ/SAM | wc -l
12644

SRV1> ls READ | wc -l
4887

   Most of this is MC, will be moved to READ/SAM,
   but this should not immediately break anything

SRV1> ls READ | grep ^n | wc -l
2864

SRV1> ls READ | grep ^f | wc -l
1238

    And for cleanup,
    
SRV1> ls READ | grep ^N | wc -l
243

SRV1> ls READ | grep ^F | wc -l
541

Catching up,
ls READ | grep ^N | grep \.cedar\\.

roundup  -m '2007-04' -r cedar near
   57 files
...
   2007 08 14
roundup  -m '2005-10' -r cedar near
roundup  -m '2005-11' -r cedar near
roundup  -m '2006-12' -r cedar near
roundup  -m '2007-01' -r cedar near
roundup  -m '2007-05' -r cedar near
roundup  -m '2007-01' -r cedar_phy near
 OOPS, need location for  N00011455_0023.spill.cand.cedar_phy.0.root
sam add location --file=N00011455_0023.spill.cand.cedar_phy.0.root \
  --loc='/pnfs/minos/reco_near/cedar_phy/cand_data/2007-01(voc503.742)'
rm READ/N00011669_0000.cosmic.sntp.cedar_phy.0.root.bck
   this was an editor backup file, containing .cedar. parents

verified single parent, and sam metadata, then cleaned up one old stray
mv READ/N00008463_0019.spill.sntp.cedar.0.root READ/SAM/N00008463_0019.spill.sntp.cedar.0.root
FILE=N00012135_0013.cosmic.cand.cedar.0.root
mv READ/${FILE} READ/SAM/${FILE}
FILE=N00012135_0021.cosmic.cand.cedar.0.root
mv READ/${FILE} READ/SAM/${FILE}

   And more cleanup in FAR,

SRV1> ls READ | grep ^F | grep -v \.cedar_phy_srsa | wc -l
541

ls READ | grep ^F0 | grep  \.cedar\\.

roundup  -m '2005-04' -r cedar far
roundup  -m '2005-05' -r cedar far
roundup  -m '2005-07' -r cedar far
roundup  -m '2007-01' -r cedar far

ls READ | grep ^F0 | grep  \.cedar_phy\\.

roundup  -m '2006-07' -r cedar_phy far

   verified parents, and moved to SAM :

for FILE in F00035862_0000.spill.bntp.cedar_phy.0.root F00035862_0000.spill.sntp.cedar_phy.0.root F00035868_0000.spill.bntp.cedar_phy.0.root F00035868_0000.spill.sntp.cedar_phy.0.root ; do
mv READ/${FILE} READ/SAM/${FILE} ; done

Lots of srsafitterbx113 locations messed up,

export SAM_ORACLE_CONNECT="samdbs/<passwd>"

for REL in dev int prd ; do
./samtapeloc /pnfs/minos/reco_near/cedar_phy_srsafitterbx113 ${REL} ; done

Corrected locations 


=============================================================================

2007 08 10

#############
# CHECKLIST #
#############

queues plots still stale in dcache

All the 9* pools are offline, but they are not in any groups at present


#########
# BATCH #
#########


for N in ${NS} ; do
bsub -q minos \
    '[ `hostname` = "minos24.fnal.gov" ] &&  { hostname ; sleep 120 ; }'
 done
JIDI=371274 to 
JIDF=371313

JID=${JIDI} ; while [ ${JID} -le ${JIDF} ] ; do
bjobs -l ${JID} | grep "CPU\|Started"
(( JID = JID + 1 )) ; done

371314
371353

Jim Fromm suggests lsf wrapper looks like

#! /bin/sh

$LSB_TRAPSIGS
$LSB_RCP1
$LSB_RCP2
$LSB_RCP3
# LSBATCH: User input
/afs/fnal.gov/files/home/room1/brebel/gen_iuntuple_cron_sam
ExitStat=$?
wait
# LSBATCH: End user input
true
exit `expr $? "|" $ExitStat`

bsub -q minos 'echo LSB_TRAPSIGS ; echo $LSB_TRAPSIGS'
371354

LSB_TRAPSIGS
trap # 15 10 12 2 1

bsub -q minos 'echo $LSB_RCP1 ; echo $LSB_RCP1'
bsub -q minos 'echo $LSB_RCP2 ; echo $LSB_RCP2'
bsub -q minos 'echo $LSB_RCP3 ; echo $LSB_RCP3'

All of these are null

Final check :

bsub -q minos 'echo LSF STUFF ; echo $LSB_TRAPSIGS; $LSB_RCP1; $LSB_RCP2 ; $LSB_RCP3 ; echo LSF STUFF'


LSF STUFF
trap # 15 10 12 2 1
LSF STUFF

So the effective script is

#! /bin/sh

trap # 15 10 12 2 1
# LSBATCH: User input
pwd
ExitStat=$?
wait
# LSBATCH: End user input
true
exit `expr $? "|" $ExitStat`

As a test, ran a few slow jobs on other nodes,

bsub -q minos     '[ `hostname` != "minos24.fnal.gov" ] &&  { hostname ; sleep 120 ; }'

MINOS26 > cat ~/.lsbatch/1186759470.371387
#! /bin/sh

$LSB_TRAPSIGS
$LSB_RCP1
$LSB_RCP2
$LSB_RCP3
# LSBATCH: User input
[ `hostname` != "minos24.fnal.gov" ] &&  { hostname ; sleep 120 ; }
ExitStat=$?
wait
# LSBATCH: End user input
true
exit `expr $? "|" $ExitStat`

Note from laszlo at 14:00, try again, a 40 run shot

one job did run OK on minos24

Ran 2 more passes, things are OK

The problem was the lack of files in 
    /usr/afsws
    ( all are symlinks to /usr/sbin )

  Now looking at the difference in RPM's between minos24 and minos25 :
  
On minos25, found set
    LC_COLLATE=C
undid this.

rpm -qa | sort > minos24.rpmlis
rpm -qa | sort > minos24.rpmlis

env     | sort > minos24.env
env     | sort > minos24.env

minos24 seems to be missing

a2ps
firefox
ghostscript
gimp
gv
java
nedit
screen
seamonkey-nspr
seamonkey-nss
tetex-xdvi
thunderbird
vim-common
vim-enhanced
apel-xemacs
xemacs
xemacs-common
xemacs-el
xemacs-info
xemacs-sumo
xml-common
xorg-x11-tools
zz_a2ps_stdout
zz_emacs_link
zz_ntp_configure
zz_tex_tweaks

I'm sure we should have gv, ghostview, 


=============================================================================

2007 08 09

#########
# ADMIN #
#########

Created HOWTO.upgrade for draft upgrade plan


#########
# BATCH #
#########

Testing minos24, included in minos14-24


for N in 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 ; do
bsub -q minos "sleep 300 ; hostname" ; done
370875 to 370904

Ran on  
14 xx
15 x
16 
17 x
18 xx
19 xx
20 x
21 x
22 xx
23 xx
24 
25

   only 14/30 came back

NS=''
N=0
while [ ${N} -lt 40 ] ; do (( N = N + 1 )) ; NS="${NS} ${N}" ; done

for N in ${NS} ; do
bsub -q minos "sleep 120 ; hostname" ; done
370934 to 370973

14 xx
15 x
16 xx
17 x
18 xxx
19 xx
20 x
21 x
22 xx
23 xx
24 
25

    only 17/40 came back

Ran load on minos24/25 up to 4 with 4 instances each of
 ( while true ; do true ; done ) &

for N in ${NS} ; do
bsub -q minos "sleep 120 ; hostname" ; done

 370974 thru
 371013

JID=370974 ; while [ ${JID} -le 371013 ] ; do
bjobs -l ${JID} | grep "CPU\|Started"
(( JID = JID + 1 )) ; done

OK, everything ran

Now look back, 

JID=370934 ; while [ ${JID} -le 370973 ] ; do
bjobs -l ${JID} | grep "CPU\|Started"
(( JID = JID + 1 )) ; done

No minos25 jobs seen, all the minos24 jobs failed.

Now take the load off 24, try again

for N in ${NS} ; do
bsub -q minos "sleep 120 ; hostname" ; done
371018 to 371057

JID=371018 ; while [ ${JID} -le 371057 ] ; do
bjobs -l ${JID} | grep "CPU\|Started"
(( JID = JID + 1 )) ; done

  Removed test load from minos24.

###########
# ROUNDUP #
###########

roundup.20070809  purge READ/SAM area

Added ...

#######
# AFS #
#######

Per loiacono,

requested volume afs/fnal.gov/files/data/minos/d266
   cloned from d239, d188 per lloiaco request
for beam systematics work

  system:administrators rlidwka
  minos:admin rlidwka
  minos:beam rlidwka
  minos rl

#######
# AFS #
#######

    Created minos:beam group

NEWGROUP=beam

pts creategroup -name kreymer:${NEWGROUP}
group kreymer:beam has id -2192

pts membership buckley:minosbeam | sort
  admarino
  arms
  buckley
  dharris
  koskinen
  loiacono
  messier
  morfin
  rhatcher
  sjc
  szleper
  wehmann
  yumiceva

BUSERS=' admarino arms buckley dharris koskinen kreymer loiacono messier morfin rhatcher sjc szleper wehmann yumiceva'

for GUSER in ${BUSERS} ; do 
pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done

pts setfields  kreymer:${NEWGROUP} -access SOMar

pts membership kreymer:${NEWGROUP}

pts examine    kreymer:${NEWGROUP}
Name: kreymer:nonap, id: -1941, owner: kreymer, creator: kreymer,
  membership: 5, flags: SOMar, group quota: 0.

pts chown      kreymer:${NEWGROUP}  minos

pts examine    minos:${NEWGROUP}
pts membership minos:${NEWGROUP}


#########
# MYSQL #
#########

User tinti has been hitting mysql hard, many connections to temp,
in batch jobs running on flxb*,
since about 10:00 Tuesday Aug 7
( based on Ganglia, and today's mysqladmin processlist )

########
# FARM #
########

Matching up new job 8 ( job 3 with srsafitter )
Input seems to be /pnfs/minos/mcin_data/near/daikon_00/L010185N

DIR   IN OUT3
100   98  98
101  168 168
102  109 109
103  109 109
104  110  66

Job 3 output is under  
/pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00/L010185N    

=============================================================================

2007 08 08


########
# FARM #
########

Flushing month endpoints for srsa processing

   NEAR 2005-08  N00008433_0002
        2005-11  N00009280_0018

   FAR  2005-11 F00033077_0002
        2006-02 F00033805_0006

        2007-01 F00037162_0006
        2007-02 F00037709_0000

./roundup -n -M -W -f 1 -s "N00008433_\|N00009280_" -r cedar_phy_srsafitter near
Missing
N00009280_0011..spill.sntp.cedar_phy_srsafitter.0.root
This was NOSPILL in cedar_phy_srsafitterbx113

So write out the first run :

./roundup -n -M -W -f 1 -s "N00008433_" -r cedar_phy_srsafitter near
./roundup          -f 1 -s "N00008433_" -r cedar_phy_srsafitter near

./roundup -n -M -W -f 1 -s "F00033077_\|F00033805_" -r cedar_phy_srsafitter far
./roundup          -f 1 -s "F00033077_\|F00033805_" -r cedar_phy_srsafitter far

./roundup -n -M -W -f 1 -r cedar_phy_srsafitterbx113 near
./roundup          -f 1 -r cedar_phy_srsafitterbx113 near

   this clears out cedar_phy_srsafitterbx113 near

./roundup -n -M -W -f 1  -r cedar_phy_srsafitter far
./roundup          -f 1  -r cedar_phy_srsafitter far

   this clears out far

   Now go back and force out cedar_phy_srsafitter N00009280,
      given that subrun 11 is NOSPILL in bx113.

./roundup -n -M -W -f 1 -s "N00009280_" -r cedar_phy_srsafitter near
./roundup          -f 1 -s "N00009280_" -r cedar_phy_srsafitter near

roundup has 149 of 299 subruns in partial runs,
    missing 150.

MINOS26 > find /pnfs/minos/reco_near/cedar_phy_srsafitter/cand_data -type f | wc -l
   1550
MINOS26 > find /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/cand_data -type f | wc -l
   1749

1749/1550 = 1.128

The sntp ration is larger,

MINOS26 > find /pnfs/minos/reco_near/cedar_phy_srsafitter/sntp_data -type f | wc -l
    125
MINOS26 > find /pnfs/minos/reco_near/cedar_phy_srsafitterbx113/sntp_data -type f | wc -l
    133

 133/ 125 = 
Rustem's POT ratio is
4.64/3.63 = 1.28


Now looking at jobs 4 and 5 ,

MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00/L010185N_bfldx113/cand_data -type f | wc -l
    541
MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00/L010185N_bfldx113/cand_data -type f | wc -l
    534
MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00/L010185N_bfldx113/sntp_data -type f | wc -l
     59
MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00/L010185N_bfldx113/sntp_data -type f | wc -l
     55

MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitter/near/daikon_00/L010185N_bfldx113/cand_data -type f > /tmp/srbxc.lis
MINOS26 > find /pnfs/minos/mcout_data/cedar_phy_srsafitterbx113/near/daikon_00/L010185N_bfldx113/cand_data -type f > /tmp/srbxbxc.lis
MINOS26 > for FI in `cat /tmp/srbxc.lis` ; do basename ${FI} ; done | cut -f 1 -d . | sort >  /tmp/srbxc.fil
MINOS26 > for FI in `cat /tmp/srbxbxc.lis` ; do basename ${FI} ; done | cut -f 1 -d . | sort >  /tmp/srbxbxc.fil
MINOS26 > diff /tmp/srbxbxc.fil /tmp/srbxc.fil 


n13014010_0007_L010185N_D00_bfldx113
n13014013_0001_L010185N_D00_bfldx113
n13014032_0006_L010185N_D00_bfldx113
n13014034_0000_L010185N_D00_bfldx113
n13014034_0001_L010185N_D00_bfldx113
n13014048_0003_L010185N_D00_bfldx113
n13014050_0003_L010185N_D00_bfldx113


#######
# DAQ #
#######

F00038583_0000.mdaq.root Tue Aug  7 18:10:35 UTC 2007
had been timing out ( 10 minutes )

This was successfully declared 

F00038583_0000.mdaq.root Wed Aug  8 10:09:17 UTC 2007

Most runs simply hung up, at
Processing /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/dbu_sampy.C...

then one got through 
Open           mysql:odbc://minos-db1.fnal.gov/offline_dev?option=1;

then success.

##########
# DCACHE #
##########

    Near dcs got stuck Tuesday,

N070731_000001.mdcs.root Tue Aug  7 10:08:00 UTC 2007
and all other recent dcs files
timing out after 600 seconds

FILE=N070731_000001.mdcs.root 
DPATH=dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/near_dcs_data/2007-08
DFILE=${DPATH}/${FILE}

loon -bq ${HOME}/minos/scripts/firstlastreroot.C ${DFILE}

This hangs, as does dccp.


12:21 - podstvkv reports that it it working
14:10 - in concur, it is working now.
14:11 - ./predator 2007-08  ( to declare near dcs data )
        predator is happy, cleared the near dcs backlog

=============================================================================

2007 08 07

#######
# DAQ #
#######

F00038583_0000.mdaq.root Tue Aug  7 18:10:35 UTC 2007
  times out in DBU after 10 minutes
  size is relatively small, 7 MBytes.

  6951697 Aug  7 11:30 F00038583_0000.mdaq.root

###############
# SAMRELOCATE #
###############

Picking up near, similar to far, just a lot more files.

    FAR OUTPUT REVIEW

MINOS26 > SAMDIM="DATA_TIER mc-near"

MINOS26 > sam list files --dim="${SAMDIM}" --count
10454 files match the given constraints.

MINOS26 > sam list files --dim="${SAMDIM}" --nosummary | cut -c 16- | sort -u
L010000.reroot.root
L010170.reroot.root
L010185.reroot.root
L010200.reroot.root
L100200.reroot.root
L250200.reroot.root


MINOS26 > ls /pnfs/minos/mcout_data/R1_18_2/near                   
cand_data  mrnt_data  sntp_data  snts_data

STREAMS=`ls /pnfs/minos/mcout_data/R1_18_2/near`

for STREAM in ${STREAMS} ; do  printf "${STREAM} "
ls /pnfs/minos/mcout_data/R1_18_2/near/${STREAM} | wc -l
done
cand_data   10518
mrnt_data    1596
sntp_data   11691
snts_data   10485

   It seems these files remain in their original unhealthy paths.
   
for STREAM in ${STREAMS} ; do
./samrelocate -n mcout_data/R1_18_2/near/${STREAM} ; done

MINOS26 > for STREAM in ${STREAMS} ; do
<more> ./samrelocate -n mcout_data/R1_18_2/near/${STREAM} ; done

 NOOP 
STARTED   Tue Aug  7 21:48:26 2007
Declaring to SAM dev 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/near/cand_data
  10518 FILES 
  10401 OK    locations 
      0 fixed locations 
    117 files undeclared 
  10518 /  10518 
STARTED   Tue Aug  7 21:48:26 2007
FINISHED  Tue Aug  7 21:51:51 2007

 NOOP 
STARTED   Tue Aug  7 21:51:51 2007
Declaring to SAM dev 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data
   1596 FILES 
      0 OK    locations 
      0 fixed locations 
   1596 files undeclared 
   1596 /   1596 
STARTED   Tue Aug  7 21:51:51 2007
FINISHED  Tue Aug  7 21:52:24 2007

 NOOP 
STARTED   Tue Aug  7 21:52:24 2007
Declaring to SAM dev 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
  11691 FILES 
  10401 OK    locations 
      0 fixed locations 
   1290 files undeclared 
  11691 /  11691 
STARTED   Tue Aug  7 21:52:24 2007
FINISHED  Tue Aug  7 21:56:13 2007

 NOOP 
STARTED   Tue Aug  7 21:56:14 2007
Declaring to SAM dev 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/near/snts_data
  10485 FILES 
  10401 OK    locations 
      0 fixed locations 
     84 files undeclared 
  10485 /  10485 
STARTED   Tue Aug  7 21:56:14 2007
FINISHED  Tue Aug  7 21:59:35 2007


MINOS26 > SAMDIM="MC.RELEASE carrot_06"
MINOS26 > sam list files --dim="${SAMDIM}" --count
41657 files match the given constraints.

TIERS="cand-near mrnt-near sntp-near snts-near mc-near"

for TIER in $TIERS ; do printf "${TIER} "
SAMDIM="DATA_TIER ${TIER}"
sam list files --dim="${SAMDIM}" --count
done

cand-near 106024 files match the given constraints.
mrnt-near 915 files match the given constraints.
sntp-near 74283 files match the given constraints.
snts-near 57045 files match the given constraints.
mc-near 10454 files match the given constraints.

for TIER in $TIERS ; do printf "${TIER} "
SAMDIM="DATA_TIER ${TIER} and MC.RELEASE carrot_06"
sam list files --dim="${SAMDIM}" --count
done
cand-near 10401 files match the given constraints.
mrnt-near No files match the given constraints.
sntp-near 10401 files match the given constraints.
snts-near 10401 files match the given constraints.
mc-near 10454 files match the given constraints.

sum is 41657, matches all carrot_06 files.


#########
# ADMIN #
#########

Discussing TiBS backups of minos-sam02 with run2-sys
See
    http://computing.fnal.gov/site-backups/
TiBS cost is 1000/system minimum ( under 250 GB )
    $5K + $4K/TB over 1.5 at large scale.


#########
# ADMIN #
#########

minos23 upgrade to SLF 4

brebel jobs running,

MINOS26 > bjobs -u brebel | grep minos24
368175  brebel  UNKWN minos      flxi04.fnal minos24.fna *_cron_sam Aug  7 09:45
368182  brebel  UNKWN minos      flxi04.fnal minos24.fna *_cron_sam Aug  7 09:45

These seem to have slowed down,
cpu usage dropped sharply around 9:45, when these started.

copied mcgowan files to minos23 , as he uses this system

time tar cvf /tmp/mcgowan.tar .
real    5m40.668s
user    0m0.000s
sys     0m0.000s

MINOS24 > time scp -c blowfish /tmp/mcgowan.tar minos23:/tmp/mcgowan.tar
real    5m46.597s
user    0m1.530s
sys     0m36.970s

MINOS23 > tar xvf /tmp/mcgowan.tar 

He has moved this to his own directory
I have removed my copy.


=============================================================================

2007 08 06

###########
# ENSTORE #
###########

Found several stray files in /pnfs/minos

MINOS26 > ls -l /pnfs/minos | grep 42411
-rw-r--r--    1 42411    e875       598580 Jul 19 12:45 aaa2
-rw-r--r--    1 42411    e875       598580 Jul 19 12:50 neha10
-rw-r--r--    1 42411    e875           56 Jul 19 12:50 test11
-rw-r--r--    1 42411    e875           56 Jul 17 13:52 test5
-rw-r--r--    1 42411    e875           56 Jul 18 13:41 test7
-rw-r--r--    1 42411    e875           56 Jul 18 14:55 test90
-rw-r--r--    1 42411    e875            5 Jul 18 12:01 try.txt

Neha Sharma gave a talk on FermiGrid Matchmaking
at the 26 March Grid Users meeting.

UID 42411 belongs to user minospro on fnpcsrv1.
But that user has no  .k5login.

SRV1> id minospro
uid=42411(minospro) gid=5111(numi) groups=5111(numi)
That number matches the e875 group.

SRV1> dds ~minospro/gramsave
total 96
drwxr-xr-x  2 root     root 2048 Aug  5 04:26 ./
drwxr-xr-x  4 minospro numi 2048 Jun 17 04:24 ../
-rw-r--r--  1 root     root 3902 Jun 17 04:24 minospro.7.tar.gz

    2007 08 07

  from neha :

These files were created by me when I was testing gPlazma setup on Fermi dCache. I had to try
transfers under multiple VOs and MINOS was one of them. I should have cleaned these up...anyways
since I no longer need them, you can delete them.

FILES='aaa2 neha10 test11 test5 test7 test90 try.txt'
for FILE in ${FILES} ; do ls -l /pnfs/minos/${FILE} ; done
for FILE in ${FILES} ; do rm    /pnfs/minos/${FILE} ; done

#########
# POWER #
#########

07:00 power out, all CR and office nodes shut down

07:30 power is up, Urish is bringing up CR

fnpcsrv1 was down before 06:30,
    for its move from fcc2 to fcc1

09:30 Urish has updated and rebooted all consoles,
      after fixing problems with drivers and xorg.conf
 

#############
# CHECKLIST #
#############

Someone dumped in a peak over 3000 stores,
from    Sat 4 Aug 18:00 
through Sun 5 Aug 06:00 

Stage plot has not updated since Jul 31


Sunday, Predator found many files not having SAM tape locations
  Mostly like
  149 F00033455_0004.all.cand.cedar_phy_srsafitter.0.root 4
  267 N00009626_0022.cosmic.sntp.cedar_phy.0.root 3
 1640 F00037703_0000.all.sntp.cedar_phy_srsafitter.0.root 4

These were all picked up in this morning's saddcache run.

#############
# SADDCACHE #
#############

ln -sf saddcache.20070806 saddcache # was 20060802

Added -n NOOP option cutting off ENCP access
Changed -n to -b (bail) per standard usage elsewhere

########
# FARM #
########

We seem to be producing many Merged.*.root files with about the sam size,

MINOS26 > ls -l /grid/data/minos/minfarm/WRITE/Mer*.root
-rw-r--r--    1 10871    e875     674179014 Aug  4 06:59 /grid/data/minos/minfarm/WRITE/Merged.10548.root
-rw-r--r--    1 10871    e875     674178488 Aug  4 19:52 /grid/data/minos/minfarm/WRITE/Merged.20597.root
-rw-r--r--    1 10871    e875     674177416 Aug  2 18:09 /grid/data/minos/minfarm/WRITE/Merged.21292.root
-rw-r--r--    1 10871    e875     674177204 Aug  5 12:20 /grid/data/minos/minfarm/WRITE/Merged.2396.root
-rw-r--r--    1 10871    e875     674177186 Aug  5 18:20 /grid/data/minos/minfarm/WRITE/Merged.24558.root
-rw-r--r--    1 10871    e875     674177475 Aug  6 06:28 /grid/data/minos/minfarm/WRITE/Merged.31723.root
-rw-r--r--    1 10871    e875     674176426 Aug  4 13:51 /grid/data/minos/minfarm/WRITE/Merged.3321.root
-rw-r--r--    1 10871    e875     674176973 Aug  3 22:41 /grid/data/minos/minfarm/WRITE/Merged.3825.root
-rw-r--r--    1 10871    e875     674177091 Aug  5 06:53 /grid/data/minos/minfarm/WRITE/Merged.384.root
-rw-r--r--    1 10871    e875     674176117 Aug  4 14:18 /grid/data/minos/minfarm/WRITE/Merged.4235.root
-rw-r--r--    1 10871    e875     674176900 Aug  5 01:10 /grid/data/minos/minfarm/WRITE/Merged.5583.root
-rw-r--r--    1 10871    e875     674176756 Aug  3 19:39 /grid/data/minos/minfarm/WRITE/Merged.8649.root
-rw-r--r--    1 10871    e875     674176464 Aug  6 00:22 /grid/data/minos/minfarm/WRITE/Merged.9810.root

All within 3 KBytes ( sort -n -k 5,5 )
674176117 ...
674179014

Perhaps hadd is failing ?

Look into this when fnpcsrv1 comes back up.

Another of these at 12:23.

Found this in cedar_phy_srsafitterbx113near.log

 OK adding n13014007_0000_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root 11
 NSFIL SSIZ MSIZ DSIZ 11 688851515 604729958 8412155
 OOPS, concatenated file size discrepancy, 8412155 gt 1500000
 OK adding n13014008_0000_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root 11
 NSFIL SSIZ MSIZ DSIZ 11 707703829 674177823 3352600
 OOPS, concatenated file size discrepancy, 3352600 gt 1500000

Has been a problem since Jul 31

    Looking in HADDLOG/2007-08/cedar_phy_srsafitterbx113mcnear.log

n13014007_0004_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root is truncated at 23863296 bytes: should be 67577636
n13014007_0005_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root is truncated at 57376768 bytes: should be 67304921

n13014008_0009_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root is truncated at 30310400 bytes: should be 68090967

FILES='
n13014007_0004_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root
n13014007_0005_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root
n13014008_0009_L010185N_D00_bfldx113.sntp.cedar_phy_srsafitterbx113.root
'

for FILE in ${FILES} ; do
ls -l /grid/data/minos/mcnearcat/${FILE} ; done

These file were writting during the quota problems on July 28.
They are defective.
Moving them out of the way.


mkdir  /grid/data/minos/minfarm/BAD

for FILE in ${FILES} ; do
mv /grid/data/minos/mcnearcat/${FILE} /grid/data/minos/minfarm/BAD/${FILE} ; done

rm /grid/data/minos/minfarm/WRITE/Merged.*.root

=============================================================================

2007 08 04  Sat

#######
# DAQ #
#######

    Preparing for Monday power outage

ssh -ax -l root minos-beamdata  'echo "shutdown -h now" | at 06:30 Aug 06'

ssh -ax -l root minos-rc        'echo "shutdown -h now" | at 06:32 Aug 06'

ssh -ax -l root minos-evd       'echo "shutdown -h now" | at 06:34 Aug 06'

ssh -ax -l root minos-acnet     'echo "shutdown -h now" | at 06:36 Aug 06'

ssh -ax -l root minos-om        'echo "shutdown -h now" | at 06:38 Aug 06'

########
# FARM #
########

concatenation is not keeping ahead very well,
with old Merged files sitting in WRITE,
and many files sitting there over a day :

SRV1> dds -tr /grid/data/minos/minfarm/WRITE/Mer*
-rw-r--r--  1 minfarm numi 674177416 Aug  2 18:09 /grid/data/minos/minfarm/WRITE/Merged.21292.root
-rw-r--r--  1 minfarm numi 674176756 Aug  3 19:39 /grid/data/minos/minfarm/WRITE/Merged.8649.root
-rw-r--r--  1 minfarm numi 674176973 Aug  3 22:41 /grid/data/minos/minfarm/WRITE/Merged.3825.root
-rw-r--r--  1 minfarm numi 674179014 Aug  4 06:59 /grid/data/minos/minfarm/WRITE/Merged.10548.root

SRV1> dds -tr /grid/data/minos/minfarm/WRITE | grep minfarm
Aug  3 12:07 F00038519_0006.all.sntp.cedar.0.root
Aug  3 12:09 F00038519_0006.spill.bntp.cedar.0.root
Aug  3 12:09 F00038522_0000.spill.bntp.cedar.0.root
Aug  3 12:09 F00038519_0006.spill.sntp.cedar.0.root
Aug  3 12:26 n13023036_0002_L010185N_D00.sntp.cedar.root
Aug  3 12:30 n13023039_0002_L010185N_D00.sntp.cedar.root
...
Aug  3 13:09 N00008336_0002.cosmic.sntp.cedar_phy.1.root
Aug  3 13:09 N00008336_0011.cosmic.sntp.cedar_phy.1.root
Aug  3 13:11 N00009607_0006.cosmic.sntp.cedar_phy.0.root
...

MINOS26 > ./dc_stat /pnfs/minos/reco_far/cedar/sntp_data/2007-07/F00038519_0006.all.sntp.cedar.0.root
LEVEL 2 
2,0,0,0.0,0.0
:c=1:b56e7d1d;h=yes;l=436286249;
w-stkendca12a-6
r-stkendca18a-4

MINOS26 > ./dc_stat /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data/303/n13023036_0002_L010185N_D00.sntp.cedar.root
LEVEL 2 
2,0,0,0.0,0.0
:c=1:f715f3f6;h=yes;l=127520458;
w-stkendca12a-6
r-stkendca18a-4

MINOS26 > ./dc_stat /pnfs/minos/reco_near/cedar_phy/sntp_data/2005-08/N00008336_0002.cosmic.sntp.cedar_phy.1.root
LEVEL 2 
2,0,0,0.0,0.0
:c=1:28865ffc;h=yes;l=172820100;
w-stkendca12a-6
r-stkendca18a-4


Reported this to dcache-admin round 12:30, via email

    Ticket 102064
    
podstvkv restarted 5 pools around 19:07


=============================================================================

2007 08 03

###########
# MONTHLY #
###########

HOWTO.monthly - created from tail of this log.

###############
# SAMRELOCATE #
###############

Finally, running in earnest,
will do dev,int,prd

Review old LOG entry around 2006 07 16

./saddmc --declare  carrot_08 mcin_data/far/carrot/L010185  
./saddmc --declare  R1_18_2   mcout_data/R1_18_2/far  

./saddmc --declare  carrot_06 mcin_data/near/carrot_06/L010200 
./saddmc --declare  R1_18_2   mcout_data/R1_18_2/near 
   partial, followed by
./saddmc --declare -s L010200 R1_18_2 mcout_data/R1_18_2/near \
  2>&1 | tee -a ../log/saddmc/mcout-R1_18_2-near-prd.log 


    STARTING WITH FAR, DO NEAR LATER


    INPUT REVIEW

MINOS26 > SAMDIM="DATA_TIER mc-far"

MINOS26 > sam list files --dim="${SAMDIM}" --count
471 files match the given constraints.

MINOS26 > sam list files --dim="${SAMDIM}" --nosummary | cut -c 16- | sort -u
L010185.reroot.root
L100200N_D00.reroot.root


    ./samrelocate -v -n -b 3 mcin_data/far/carrot/L010185


MINOS26 > SAMDIM="DATA_TIER mc-near"
MINOS26 > sam list files --dim="${SAMDIM}" --count
10454 files match the given constraints.
MINOS26 > sam list files --dim="${SAMDIM}" --nosummary | cut -c 16- | sort -u
L010000.reroot.root
L010170.reroot.root
L010185.reroot.root
L010200.reroot.root
L100200.reroot.root
L250200.reroot.root


       Checking input areas

MINOS26 > ./samrelocate -n mcin_data/far/carrot/L010185

 NOOP 
STARTED   Fri Aug  3 20:11:38 2007
Declaring to SAM dev 999999
Scanning  /pnfs/minos/mcin_data/far/carrot/L010185
NFILES 1341
    441 OK    locations 
      0 fixed locations 
    900 files undeclared 
   1341 /   1341 
STARTED   Fri Aug  3 20:11:38 2007
FINISHED  Fri Aug  3 20:12:05 2007

for DIR in `ls /pnfs/minos/mcin_data/near/carrot_06` ; do
ls /pnfs/minos/mcin_data/near/carrot_06/${DIR} ; done

Need to clean out files in 
    /pnfs/minos/mcin_data/near/carrot_06/L100200/BAD

Still, we can check locations :

for DIR in `ls /pnfs/minos/mcin_data/near/carrot_06` ; do
./samrelocate -n mcin_data/near/carrot_06/${DIR} ; done

All are OK...
double checked these in production, still ok

    OUTPUT REVIEW

ls /pnfs/minos/mcout_data/R1_18_2/far/carrot
L010185  L010185_RSCT0  L010185_RSCT2  L010185_tau_test  L100200  L250200

BASE=mcout_data/R1_18_2/far/carrot/L010185
for DIR in `ls /pnfs/minos/${BASE}` ; do
./samrelocate -n  ${BASE}/${DIR} ; done

MINOS26 > ./samrelocate -q  ${BASE}/sntp_data

 QUIET 
STARTED   Fri Aug  3 21:26:46 2007
Declaring to SAM dev 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data
   1969 FILES 
      0 OK    locations 
    441 fixed locations 
   1528 files undeclared 
   1969 /   1969 
STARTED   Fri Aug  3 21:26:46 2007
FINISHED  Fri Aug  3 21:27:41 2007

MINOS26 > setup sam -q int
MINOS26 > ./samrelocate -q  ${BASE}/sntp_data

 QUIET 
STARTED   Fri Aug  3 21:45:29 2007
Declaring to SAM int 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data
   1969 FILES 
 OOPS , addLocation error for  f21001047_0000_L010185.sntp.R1_18_2.root
  CLASS     SamException.SamExceptions.DataStorageLocationNotFound
  INSTANCE  Location with name '/pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data' not found.
STARTED   Fri Aug  3 21:45:29 2007
FINISHED  Fri Aug  3 21:45:31 2007

MINOS26 > ./samtapeloc /pnfs/minos/mcout_data/R1_18_2 int
MINOS26 > ./samtapeloc /pnfs/minos/mcout_data/R1_18_2 prd

MINOS26 > sam add location --file=f21001047_0000_L010185.sntp.R1_18_2.root --loc='/pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data(vo7033,427)'

MINOS26 > ./samrelocate -q  ${BASE}/sntp_data

 QUIET 
STARTED   Fri Aug  3 21:57:36 2007
Declaring to SAM int 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data
   1969 FILES 
      1 OK    locations 
    440 fixed locations 
   1528 files undeclared 
   1969 /   1969 
STARTED   Fri Aug  3 21:57:36 2007
FINISHED  Fri Aug  3 21:58:36 2007

MINOS26 > setup sam -q prd
MINOS26 > ./samrelocate -q  ${BASE}/sntp_data

 QUIET 
STARTED   Fri Aug  3 22:01:28 2007
Declaring to SAM prd 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/sntp_data
   1969 FILES 
      0 OK    locations 
    441 fixed locations 
   1528 files undeclared 
   1969 /   1969 
STARTED   Fri Aug  3 22:01:28 2007
FINISHED  Fri Aug  3 22:02:39 2007

MINOS26 > setup sam -q dev
MINOS26 > ./samrelocate -q  ${BASE}/snts_data
   1968 FILES 
      0 OK    locations 
    441 fixed locations 
   1527 files undeclared 
   1968 /   1968 


###########
# MONTHLY #
###########

DATASETS 8/3
PREDATOR 8/3
SADDRECO 8/3
VAULT    8/3
MYSQL    8/


########
# FARM #
########

Cleanup of duplicates/healed runs

Added missing location
    grep N00012596_0002.spill.cand.cedar.0.root ../CFL/CFL
    sam add location --file=N00012596_0002.spill.cand.cedar.0.root --loc='/pnfs/minos/reco_near/cedar/cand_data/2007-07(voc583.451)'

LOG/2007-08/cedarnear.log 
    ./roundup  -s "N00012463\|N00012620" -r cedar near

updated roundup to complete all partial runs from now on.

Now clean out duplicates :

YEMO=`date +%Y-%m`

cd LOG/${YEMO}

for LOG in `ls` ; do less ${LOG} ; done

FILES='
mcfarcat/f20011014_0009_CosmicMu_D02.sntp.cedar.root
mcnearcat/n13012001_0006_L010185N_D00.sntp.cedar.root
nearcat/N00009635_0007.cosmic.sntp.cedar_phy.1.root
nearcat/N00008486_0000.spill.sntp.cedar_phy_srsafitter.0.root
'
MINOS26 > for FILE in ${FILES} ; do ls -l /grid/data/minos/${FILE} ; done
-rw-rw-r--    1 1334     e875     61303092 Jul  2 19:15 /grid/data/minos/mcfarcat/f20011014_0009_CosmicMu_D02.sntp.cedar.root
-rw-rw-r--    1 1334     e875     67187759 Jul 21 01:20 /grid/data/minos/mcnearcat/n13012001_0006_L010185N_D00.sntp.cedar.root
-rw-rw-r--    1 1334     e875     30460492 Jun  4 18:05 /grid/data/minos/nearcat/N00009635_0007.cosmic.sntp.cedar_phy.1.root
-rw-rw-r--    1 1334     e875     63165604 Aug  1 15:45 /grid/data/minos/nearcat/N00008486_0000.spill.sntp.cedar_phy_srsafitter.0.root
MINOS26 > for FILE in ${FILES} ; do mv /grid/data/minos/${FILE} /grid/data/minos/minfarm/DUP/ ; done

###########
# ROUNDUP #
###########

roundup.20070803 - concatenate when have+new files count is correct,
                   to complete runs which have been partially forced
                   
cp AFSS/roundup.20070803 .
ln -sf  roundup.20070803 roundup # was roundup.20070802 


########
# FARM #
########

GDM usage is 301/400, manualy purge most file in WRITE,

13:15
./roundup  -w -M -r cedar_phy_srsafitter near

usage down to 258/400

############
# PREDATOR #
############

Cron had been off since 1 Aug, so did catchup

./predator 2007-08


=============================================================================

2007 08 02

#############
# SAMLOCATE #
#############

example sent to brebel

./samlocate "${SAMDIM}" | while read FILEPATH
do
    FILE=`echo ${FILEPATH} | cut -f 1 -d ' '`
    FPAT=`echo ${FILEPATH} | cut -f 2 -d ' '`
    echo FILE/PATH ${FILE} ${FPAT}
done

#######
# SAM #
#######

dev/int oracle security patches ( July ) scheduled 09:00 

Up and running around 10:00

dev station and dbserver did not need to be restarted.


########
# FARM #
########

Need cleanup of sam declares...

/home/minfarm/ROUNTMP/LOG/2007-07/declare_near_cedar.log
    N00012596_0002.spill.cand.cedar.0.root

LOG/2007-08/cedarmcfar.log 
    DUPE f20011014_0009_CosmicMu_D02.sntp.cedar.root

LOG/2007-08/cedarmcnear.log
    DUPE n13012001_0006_L010185N_D00.sntp.cedar.root

LOG/2007-08/cedar_phynear.log
    many duplicates

LOG/2007-08/cedar_phyfar.log
    several pending runs, long term

LOG/2007-08/cedar_phymcfar.log
    PEND  - should flush
    
LOG/2007-08/cedar_phymcnear.log
    clean

LOG/2007-08/cedar_phy_brevmcnear.log
    clean

########
# FARM #
########

  Working on cleanup of special processing :
  
LOG/2007-08/cedar_phy_srsafitternear.log 
    DUPE N00008486_0000.spill.sntp.cedar_phy_srsafitter.0.root

LOG/2007-08/cedar_phy_srsafitterfar.log 

LOG/2007-08/cedar_phy_srsafittermcnear.log 

LOG/2007-08/cedar_phy_srsafittermcfar.log 

LOG/2007-08/cedar_phy_srsafitterbx113near.log 

LOG/2007-08/cedar_phy_srsafitterbx113mcnear.log 

########
# MAIL #
########

Mail is stuck on fnpcsrv1, minos26, minos-98193.dhtp


Mail -s roundup found duplicates in cedar_phy_srsafitter near on fnpcsrv1 minos-data@fnal.gov

killing fnpcsrv1 Mail command manually.

Mail does not get stuck when given content in the mail body.

echo "TESTING" | Mail -s "hello there" kreymer@fnal.gov

###########
# ROUNDUP #
###########

roundup.20070802 - added content to body of duplicates email,
    to avoid the email hangs seen today.

cp AFSS/roundup.20070802 .
ln -sf  roundup.20070802 roundup # was roundup.20070730 


########
# GRID #
########

hadd is running quite slowly, no CPU usage to speak of.

Read data rates are over 10 MB/second

Write rates are good, over 10 MB/second

SRV1> time dd if=/dev/zero of=TEST.DAT bs=100M count=1
1+0 records in
1+0 records out

real    0m8.871s
user    0m0.000s
sys     0m1.282s

reading/writing is slow, about 1 MB/second

SRV1> time cat /grid/data/minos/nearcat/N00009047* > ./TEST.dat
du -sm TEST.dat

real    4m25.043s
user    0m0.045s
sys     0m5.634s
SRV1> du -sm TEST.dat
383     TEST.dat


########
# GRID #
########

tokens

AFSPROD=/afs/fnal.gov/files/code/e875/general/products/
GRIPROD=/grid/app/minos/products

time rsync -r ${AFSPROD} ${GRIPROD} --perms --times --links --size-only --delete -v -n
db/minos_offline/S07-07-27-R1-26.release.log
db/minos_offline/S07-07-27-R1-26.table
db/minos_offline/S07-07-27-R1-26.version
prd/python/v2_4_sam/Linux-2-4/lib/python2.4/re.pyc
prd/python/v2_4_sam/Linux-2-4/lib/python2.4/sre.pyc
prd/python/v2_4_sam/Linux-2-4/lib/python2.4/sre_compile.pyc
prd/python/v2_4_sam/Linux-2-4/lib/python2.4/sre_constants.pyc
wrote 1012360 bytes  read 48 bytes  5547.44 bytes/sec
total size is 2226133693  speedup is 2198.85

real    3m2.370s
user    0m1.870s
sys     0m5.740s


time rsync -r ${AFSPROD} ${GRIPROD} --perms --times --links --size-only --delete -v


MINOS26 > du -sm /afs/fnal.gov/files/code/e875/general/minossoft/*
5534    /afs/fnal.gov/files/code/e875/general/minossoft/packages
435     /afs/fnal.gov/files/code/e875/general/minossoft/releases
16      /afs/fnal.gov/files/code/e875/general/minossoft/setup
1       /afs/fnal.gov/files/code/e875/general/minossoft/srt
1       /afs/fnal.gov/files/code/e875/general/minossoft/temp


=============================================================================

2007 08 01

#############
# CHECKLIST #
#############

minos-sam01 load average near 2, CPU at 1/4 ( 1 CPU )
    15:30 - 03:30 , 04:30 to present
network active, but 10 KBytes/second

17 processes like

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
22031 sam       15   0  290M 290M  5568 S     4.3  7.2   0:27   1 python /home/sam/products/db_server_base/v3_3_17/NULL/bin/DbListener.py -c=dbs_prd

saMINOS26 > sam get dbserver connection info
Connection:  minfarm@fnpcsrv1.fnal.gov:saddreco_v7_7_1(1008531)
        Servant creation time:  01-Aug-2007 07:50:00 (CDT)
        Last method invoked:    __init__ (01-Aug-2007 07:50:00 (CDT))
        Last method completed in 0.00542998313904 seconds
        Servant status message: initializing
Connection:  brebel@flxb31.fnal.gov:sam_v7_6_5(1009387)
        Servant creation time:  01-Aug-2007 08:02:01 (CDT)
        Last method invoked:    Nov 11eDimensions_v2 (01-Aug-2007 08:02:01 (CDT))
        Last method still running.
        Servant status message: invoking the SQL query for infixString = RUN_TYPE                     physics%               and VERSION  ...
Connection:  brebel@flxb22.fnal.gov:sam_v7_6_5(1009411)
        Servant creation time:  01-Aug-2007 08:02:22 (CDT)
        Last method invoked:    getReplicaLocationList (01-Aug-2007 08:02:22 (CDT))
        Last method completed in 2.46368288994 seconds
        Servant status message: Marshalling complete in less than 1 second  (len: 1)
... and 7 more getReplicaLocationList instances

moments later, 12 similar new current connections.

MINOS26 > bjobs -u brebel | grep RUN | wc -l
     55

MINOS26 > bjobs -u brebel | wc -l
    342

MINOS26 > bjobs -u brebel | grep 'Jul 31 09:45' | wc -l
     17

MINOS26 > bjobs -u brebel | grep 'Jul 31 05:' | wc -l
     13

MINOS26 > bjobs -u brebel | grep 'Aug  1 05:' | wc -l
    310

The sam locates are being done by 
/afs/fnal.gov/files/home/room1/brebel/gen_iuntuple_cron_sam

Each job looks up 1 month of files ( cedar, cedar_phy ),
about 22 months each.
So this activity results in about a 6 hour delay for each job.

Watching for dropoff, still heavy with 6 or 7 brebel connections
MINOS26 > sam get dbserver connection info | grep brebel


See below, saddreco is stuck with excessive timeouts

sam_test_py also failed similarly, due to brebel load doing sam locate.
sam_test_project is slow, but OK

Works after the load drops off at 10:30

SAMDIM="                                    
    RUN_TYPE                 physics%  \
and VERSION                  cedar.phy \
and DATA_TIER                sntp-far \
and PHYSICAL_DATASTREAM_NAME spill     \
and FULL_PATH like /pnfs/minos/reco_far/cedar_phy/sntp_data/2007-03
"

SFILES=`sam list files --dim="${SAMDIM}"  --nosummary`

printf "${SFILES}\n" | wc -l
     48

date ; time { for FILE in ${SFILES} ; do sam locate ${FILE} ; done } 

N  time   CPU
1 1' 24"  
2 2' 47"
3 1m47
4 2m27
4 2m29    25% u 1% s
5 3m11    25% u 4% s
5 3m19    25% u 5% s sam_test_py normally 7s, retries, OK at end of pass
4 2m38    25% u 3% s stp did got cid after 10 retries, then stuck again
3 2m01    25% u 1% s stp got cid first try, 1 retry, then OK
                     second try needed 1 retry for cid, 1 for cpid
2 1m39   24% u .5% s stp 1/5 passes had retry for cid  time 16s
1 1m28   17% u  0  s stp 7/7 passes ok,                time  9s
0                    stp                               time  6 s

########
# FARM #
########

Stuck in  sampy scripts/saddreco far cedar 2007-07 declare

tail  /home/minfarm/ROUNTMP/LOG/2007-07/declare_far_cedar.log

Needed  /pnfs/minos/reco_far/cedar/cand_data/2007-07
Treating 725 files in /pnfs/minos/reco_far/cedar/cand_data/2007-07
RetryHandler.getMetadataRequirementDescriptor()> initial retriable exception TRANSIENT('CORBA.TRANSIENT(omniORB.TRANSIENT_CallTimedout, CORBA.COMPLETED_MAYBE)')
RetryHandler.getMetadataRequirementDescriptor()> will retry in 600.00 seconds
...

killed corral, then saddreco before the 09:00 shutdown of fnpcsrv1


#########
# ADMIN #
#########

minos02 ganglia monitoring is down
Last heartbeat 12 days, 20:45:16 ago

reported as ticket 101878

Corrected, ganglia looks good, thanks, jonest

########
# FARM #
########

fnpcsrv1 up too late for noon roundup cron

  Clear out WRITE
  
./roundup -w -r cedar far
  declared 2007-07
./roundup -w -r cedar near
  declared 2007-07

created  corralsrs  for the special runs

.corralsrs

#############
# SAMLOCATE #
#############

SAMDIM="
    RUN_TYPE physics%                  \
and VERSION                  cedar.phy \
and DATA_TIER                sntp-far  \
and PHYSICAL_DATASTREAM_NAME spill     \
and  FULL_PATH like /pnfs/minos/reco_far/cedar_phy/sntp_data/2007-03
"

./samlocate "${SAMDIM}" | wc -l
     48
time ./samlocate "${SAMDIM}" ( 48 files )
real    0m5.390s


SAMDIM="
    RUN_TYPE physics%                  \
and VERSION                  cedar.phy \
and DATA_TIER                sntp-far  \
and PHYSICAL_DATASTREAM_NAME spill     \
and  FULL_PATH like /pnfs/minos/reco_far/cedar_phy/sntp_data/2006-01
"

./samlocate "${SAMDIM}" | wc -l
     53
time ./samlocate "${SAMDIM}"


SAMDIM="
    RUN_TYPE physics%                  \
and VERSION                  cedar    \
and DATA_TIER                sntp-far  \
and PHYSICAL_DATASTREAM_NAME spill     \
and  FULL_PATH like /pnfs/minos/reco_far/cedar/sntp_data/2006-01
"

./samlocate "${SAMDIM}" | wc -l
    745
time ./samlocate "${SAMDIM}"
real    0m40.879s

=============================================================================

2007 07 31

#########
# ADMIN #
#########

minos-om is back in service, around 13:00

  No breakin beyond JIRA accounts.


########
# FARM # 
########

Cleared enough backlog to resume running,
continuing to clear more backlog.  See yesterday's entry.

#######
# SAM #
#######

Does minos-sam02 need to be backed up ?

I think not, an occasional snapshot done privately should be OK.

MINOS-SAM02 > du -sm *
...
146     private
...
3212    products
799     products.20051018
2665    products.20060413
...

I have removed the old products areas.

Moved oracle_client aside to Xoracle_client, everything works OK,
So we are really using oracle_instant_client , deleting old clients
Moved it back

drwxr-xr-x  3 sam 5024 4096 Aug  1  2005 v10_1_0_3_0
drwxr-xr-x  3 sam 5024 4096 Jun 19  2006 v10_2_0_1
drwxr-xr-x  3 sam 5024 4096 Mar 25  2005 v8_1_7a

ups undeclare -Y oracle_client v8_1_7a
ups undeclare -Y oracle_client v10_1_0_3_0
ups undeclare -Y oracle_client v10_2_0_1

MINOS-SAM02 > du -sm sam
298     sam

MINOS-SAM02 > ups list -aK+ sam
"sam" "v6_0" "NULL" "" "" 
"sam" "v7_0_1" "Linux+2.4" "" "" 
"sam" "v7_0_2c" "Linux+2" "" "" 
"sam" "v7_0_2" "Linux+2" "" "" 
"sam" "v7_1_2" "Linux+2" "" "" 
"sam" "v7_1_10" "Linux+2" "" "" 
"sam" "v7_2_2" "Linux+2" "" "" 
"sam" "v7_2_6" "Linux+2" "" "" 
"sam" "v7_1_9" "Linux+2" "" "" 
"sam" "v7_3_0" "Linux+2" "" "" 
"sam" "v7_3_1" "Linux+2" "" "" 
"sam" "v7_3_4" "Linux+2" "" "" 
"sam" "v7_4_0a_py24" "Linux+2" "" "" 
"sam" "v7_4_2" "Linux+2" "" "" 
"sam" "v7_5_1" "Linux+2" "" "current" 
"sam" "v8_1_3" "Linux+2" "" "" 

ups undeclare -Y sam v6_0
for SREL in v7_0_1 v7_0_2c v7_0_2 v7_1_2 v7_1_10 v7_1_9 v7_2_2 v7_2_6 ; do
ups undeclare -Y sam ${SREL} ; done

for SREL in v7_3_0 v7_3_1 v7_3_4 v7_4_0a_py24 v7_4_2 ; do
ups undeclare -Y sam ${SREL} ; done

MINOS-SAM02 > du -sm sam
22      sam

269     sam_station
ups undeclare -Y sam_station v6_0_1_12 -q GCC-3.1
138     sam_station


=============================================================================

2007 07 30

###########
# ROUNDUP #
###########

Added 0 length check for merged file
Corrected message regarding ROOTRELS

cp AFSS/roundup.20070730 .
ln -sf  roundup.20070730 roundup # was roundup.20070707 

#############
# CHECKLIST #
#############

    DCache data plots stop Saturday morning around 08:00
http://fndca2a.fnal.gov:8090/dcache/outplot?lvl=0&filename=billing-2007.07.daily.brd.png&day=28&fmt=lin

    Queue plots last update July 24 00:31:25
http://fndca.fnal.gov/dcache/queue/allpools.jpg

    minos26 disks are full ( 20 GB free )


#######
# DAQ #
#######

MINOS26 > ls -l /pnfs/minos/reco_far/cedar/sntp_data/2007-07/F00038544_0000.spill.sntp.cedar.0.root
-rw-r--r--    1 1334     e875            0 Jul 28 06:07 /pnfs/minos/reco_far/cedar/sntp_data/2007-07/F00038544_0000.spill.sntp.cedar.0.root

Cannot investigate till OM comes back online ( has DAQ log archive )

/grid/data/minos/minfarm/WRITE is also 0 length.

The file difference check was at 1.5 MBytes, this file was just under.
So this slipped through.

Removing the damaged 0 length output file from WRITE

rm /grid/data/minos/minfarm/WRITE/F00038544_0000.spill.sntp.cedar.0.root
rm /pnfs/minos/reco_far/cedar/sntp_data/2007-07/F00038544_0000.spill.sntp.cedar.0.root

    2007 08 03   13:26
sam undeclare file F00038544_0000.spill.sntp.cedar.0.root

#########
# ADMIN #
#########

Report of extra JIRA accounts by saranen, discovered by habig.

Shut down system, reported to helpdesk around 10:15.

urish is on vacation.


Note : to enable the firewall,

   As root
       setup
           firewall
               Typed space in 'enable' option
               OK
       iptables -I INPUT 1 -s <hostname> -j ACCEPT   # allow
       iptables -D INPUT   -s <hostname> -j ACCEPT   # delete
       iptables -L INPUT                             # list


14:58 - lilstrom finished, waiting for permission to resume usage

21:30 - added my desktop for access,
        could not add catbox.dhcp ( no address ) for tarupp
          ( did verify that he is sysadmin, in miscomp )
        restored my desktop access
 

For habag will do

iptables -I INPUT 1 -s neutrino.d.umn.edu -j ACCEPT 

    aka 131.212.37.31

iptables -D INPUT -s owl.fnal.gov   -j ACCEPT 
iptables -D INPUT -s  131.225.82.83  -j ACCEPT   # catbox

   Alec needs web access from home, 
   
iptables -I INPUT 1 -s 71-83-38-121.dhcp.dlth.mn.charter.com -j ACCEPT


   For Nessus scan , 

iptables -I INPUT 1 -s shamus.fnal.gov -j ACCEPT


  12:25 - disabled firewall, ran scanmenow ( clean )
  12:53 - scheduled newquick scan of MINOS-OM port 8080
          node MINOS-OM was required, not minos-om.fnal.gov

13:13  returned to service

#########
# ADMIN #
#########

Giving  habig ( future run coordinator ) access to CR systems
    beamdata

Removed merina ( albert ) from minos-gateway-nd

##########
# DCACHE #
##########

Looking at logs from roundup, SRM has been down since Noon Sat 28 July

SRMClientV1 : put: try # 0 failed with error
SRMClientV1 : java.net.SocketException: Connection reset

srmtest fails as before.

12:15 Submitted High priority ticket  101747

15:30 - srm is up

Clearing space on minos26, with 

./mcimport  -w kordosky


Reenabled crontab.dat, now that SRM is working.
Only 50 GB free on disk, but we should recover OK.


########
# GRID #
########

/grid/data and /grid/app are unmounted on minos01

Asked to have them mounted, ticket 101717

###########
# ROUNDUP #
###########

ROOTRELS - Added 
    cedar_phy_srsafitter 
    cedar_phy_srsafitterbx113

#######
# SAM #
#######

for REL in dev int prd ; do
./reloc -s REL cedar_phy_srsafitter ; done

   many locations declared

for REL in dev int prd ; do
./reloc -s ${REL} cedar_phy_srsafitterbx113 ; done

   nothing found

for REL in dev int prd ; do
./reloc -s ${REL} cedar_phy_mboone ; done

   several near locations were declared, none far
 
 
export SAM_ORACLE_CONNECT="samdbs/<passwd>"

for REL in dev int prd ; do
setup sam -q ${REL}
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.srsafitter
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.safitterbx113
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.mboone
done

   the last 2/3 were needed
  

########
# FARM #
########

SRV1> quota -s -v -g numi
...
Disk quotas for group numi (gid 5111): 
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
blue2:/gpfarm-home
                   114G       0    500G           1189k       0       0        
blue2:/gpfarm-stage
                   629G       0   1639G           3444k       0       0        
blue2.fnal.gov:/fermigrid-data
                   368G       0    400G           12232       0       0        

Need to grab no more than 30 GB first chunk.

farmgsum > /tmp/farmgsum.20070730


./roundup -n -W -M -r cedar_phy_srsafitter far 2>&1 | tee -a /tmp/cpsrsf.lis

15:45
./roundup          -r cedar_phy_srsafitter far

repeated to flush WRITE files,

Quota was around 380/400 a few files into the file purge.

At end, 362/400 .
This is not enough to do the next block of 80 GB, 
Will have to take one smaller chunk.


Had planned to then do the next sets :

./roundup          -M -c -r cedar_phy_srsafitter      mcnear
./roundup          -M -c -r cedar_phy_srsafitterbx113 mcnear

    Cleared out a couple of stray files, previously forced out

./roundup  -M -r cedar_phy_brev mcnear

    Now try to take a bite out of mcnearcat

SRV1> ./roundup -n -M -W -s n13011 -r cedar_phy_srsafitterbx113 mcnear

 OK - processing /grid/data/minos/mcnearcat
      version 20070730
Mon Jul 30 22:40:52 CDT 2007
OK - processing 442 files 
OK - stream L010185N_D00.sntp.cedar_phy_srsafitterbx113
OK - 29836 Mbytes in 45 runs 
...

SRV1> quota -s -v -g numi  | grep -2 fermigrid-data
                   363G       0    400G           11801       0       0        

OK, will use 30 of 37 GB free space.
Growth seems to have stopped recently, let's give it a try.

./roundup    -M    -s n13011 -r cedar_phy_srsafitterbx113 mcnear

     2007 07 31
     
ran again at 07:56, 
usage dropped fro  363 to 336 .

Grabbing the rest, 08:00

./roundup    -M              -r cedar_phy_srsafitterbx113 mcnear

done at 9:40, 

10:46
grab another 36 GB, still have 336/400 used


./roundup -M       -r cedar_phy_srsafitter mcnear

14:30
./roundup  -M  -w  -r cedar_phy_srsafitterbx113 mcnear
usage 336 -> 313

   Now get the last large chunk of data, 147 GB

15:00
./roundup  -M  -r cedar_phy_srsafitter mcfar

16:30  clearing out WRITE files

./roundup -M -w      -r cedar_phy_srsafitter mcnear
usage 313 -> 279

 ./roundup -M -w      -r cedar_phy_brev mcnear
one file

 ./roundup -M -w     -r cedar far
pick up 2 stuck files from weekend
F00038544_0000.spill.bntp.cedar.0.root
F00038544_0000.all.sntp.cedar.0.root


SRV1> for NUM in  9398 7402 10170 16327 ; do ls -l  Merged.${NUM}.root ; done
-rw-r--r--  1 minfarm numi 0 Jul 28 18:05 Merged.9398.root
-rw-r--r--  1 minfarm numi 98304 Jul 29 00:08 Merged.7402.root
-rw-r--r--  1 minfarm numi 0 Jul 29 06:05 Merged.10170.root
-rw-r--r--  1 minfarm numi 0 Jul 29 12:05 Merged.16327.root

SRV1> for NUM in  9398 7402 10170 16327 ; do rm  Merged.${NUM}.root ; done

   2007 08 01

   purged WRITE, 317 files

./roundup  -M  -r cedar_phy_srsafitter mcfar
usage 305G -> 173


=============================================================================

2007 07 27

#######
# OSG #
#######

OSG users meeting, day 2

#######
# SAM #
#######

minos-sam02 is upgraded to SLF 4.4

    ups start sam_boostrap

Looks good

MINOS26 > setup sam -q dev
MINOS26 > sam ping dbserver
MINOS26 > sam get dbserver info
MINOS26 > sam get dbserver connection info
MINOS26 > sam locate foo
MINOS26 > sam_test_py minos
MINOS26 > sam get metadata --file=f21001001_0000_L010185.cand.R1_18_2.root
MINOS26 > sam locate f21001001_0000_L010185.cand.R1_18_2.root


=============================================================================

2007 07 26

#######
# OSG #
#######

OSG users meeting

#######
# SAM #
#######

On minos-sam02, as mindata, samread, sam,

in samread, rm SAM03.tar ( 5 GB )

tar czvf /tmp/sam02-mindata-20070726.tgz .
tar czvf /tmp/sam02-samread-20070726.tgz .
tar czvf /tmp/sam02-sam-20070726.tgz .

On minos-sam03,

    cd ARCH/sam02
    scp -c blowfish minos-sam02:/tmp/sam02-mindata-20070726.tgz .
    scp -c blowfish minos-sam02:/tmp/sam02-samread-20070726.tgz .
    scp -c blowfish minos-sam02:/tmp/sam02-sam-20070726.tgz .

=============================================================================

2007 07 25

###########
# ROUNDUP #
###########

On Wed, 25 Jul 2007, Mayly Sanchez wrote:
...
> The newer daikon cosmics ND are not being concatenated, these are
> needed for the calibration group.

I have added this to the corral script run via cron,
and have started an early run of the concatenation script, around 15:00
  
    ./roundup  -M -r cedar_phy mcnear


########
# GRID #
########

tokens

AFSPROD=/afs/fnal.gov/files/code/e875/general/products/
GRIPROD=/grid/app/minos/products

time rsync -r ${AFSPROD} ${GRIPROD} --perms --times --links --size-only --delete -v -n
real    3m18.267s
user    0m1.670s
sys     0m4.400s


time rsync -r ${AFSPROD} ${GRIPROD} --perms --times --links --size-only --delete -v
real    0m52.233s
user    0m1.770s
sys     0m4.580s


###############
# SAMRELOCATE #
###############

Continuing to work on relocation, per 14 July kordosky email re
    f21001001_0000_L010185.cand.R1_18_2.root
in
    /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data
declared
    /pnfs/minos/mcout_data/R1_18_2/far/cand_data

MINOS26 > for DIR in `ls /pnfs/minos/mcin_data/near/carrot_06` ; do printf "${DIR}" ; ls /pnfs/minos/mcin_data/near/carrot_06/${DIR} | wc -l ; done
L010000   1456
L010170    198
L010185   7483
L010200    198
L100200    729
L250200    391

./samrelocate -v -n -b 3 mcin_data/near/carrot_06/L010200

MCODIR=mcout_data/R1_18_2/far/carrot/L010185/cand_data

./samrelocate -v -n -b 20 ${MCODIR}
...
f21101070_0000_L010185.cand.R1_18_2.root '/pnfs/minos/mcout_data/R1_18_2/far/cand_data,717@vo7033'
f22001094_0000_L010185.cand.R1_18_2.root '/pnfs/minos/mcout_data/R1_18_2/far/cand_data,1174@vo7033'

Had to go to the source for SamReplicaLocation to discover how to
parse out the components:

SLOC.getLocationType()
SLOC.getFullPath()
SLOC.getActualDetails()

Can pick up 1 test file with

MINOS26 > ./samrelocate -n   ${MCODIR} MCODIR=mcout_data/R1_18_2/far/carrot/L010185/cand_data -b 13

 NOOP 
 BAIL after  13
STARTED   Wed Jul 25 15:43:20 2007
Declaring to SAM dev 13
Scanning  /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data
NFILES 1969
 f21101070_0000_L010185.cand.R1_18_2.root was /pnfs/minos/mcout_data/R1_18_2/far/cand_data 
      1 fixed locations 
     12 files undeclared 
     13 /   1969 

sam.eraseFileLocation( filename =  , replicalocation = 'string' )

 f21101070_0000_L010185.cand.R1_18_2.root was /pnfs/minos/mcout_data/R1_18_2/far/cand_data 
 f21101070_0000_L010185.cand.R1_18_2.root now /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data(vo7033.717) 


MINOS26 > ./samrelocate  ${MCODIR} MCODIR=mcout_data/R1_18_2/far/carrot/L010185/cand_data -b 13 -v

 BAIL after  13
 VERBOSE 
 DATADIR  /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data
STARTED   Wed Jul 25 16:26:57 2007
Declaring to SAM dev 13
Scanning  /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data
NFILES 1969
FILES
f21101070_0000_L010185.cand.R1_18_2.root '/pnfs/minos/mcout_data/R1_18_2/far/cand_data,717@vo7033'
 SLOC  '/pnfs/minos/mcout_data/R1_18_2/far/cand_data,717@vo7033' tape /pnfs/minos/mcout_data/R1_18_2/far/cand_data MssLocationDetails({
    'mssInstance' : 'Fermilab',
        'mssName' : 'enstore',
         'offset' : 717L,
    'volumeLabel' : 'vo7033',
    })
 f21101070_0000_L010185.cand.R1_18_2.root was /pnfs/minos/mcout_data/R1_18_2/far/cand_data 
 f21101070_0000_L010185.cand.R1_18_2.root now /pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data(vo7033.717) 
 OOPS , addLocation error for  f21101070_0000_L010185.cand.R1_18_2.root
  CLASS     SamException.SamExceptions.DataStorageLocationNotFound
  INSTANCE  Location with name '/pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data' not found.
STARTED   Wed Jul 25 16:26:57 2007
FINISHED  Wed Jul 25 16:26:59 2007

    OOPS, needed to add the storage locations.

./samtapeloc /pnfs/minos/mcout_data/R1_18_2 dev

IFILE=f21101070_0000_L010185.cand.R1_18_2.root
SAMLOC=/pnfs/minos/mcout_data/R1_18_2/far/carrot/L010185/cand_data(vo7033.717)

sam add location --file=${IFILE} --loc=${SAMLOC}

OK, we're good to go !

Ran 13, 14, 100 unlimited in development

      1 OK    locations 
      0 fixed locations 
     12 files undeclared 
     13 /   1969 

      1 OK    locations 
      1 fixed locations 
     12 files undeclared 
     14 /   1969 

      2 OK    locations 
      0 fixed locations 
     12 files undeclared 
     14 /   1969 

      2 OK    locations 
     20 fixed locations 
     78 files undeclared 
    100 /   1969 

     22 OK    locations 
    419 fixed locations 
   1528 files undeclared 
   1969 /   1969 

    441 OK    locations 
      0 fixed locations 
   1528 files undeclared 
   1969 /   1969 


=============================================================================

2007 07 24

##########
# DCACHE #
##########

srm server is down, same as sunday,

SRMClientV2 :  srmLs, contacting service httpg://stkendca2a.fnal.gov:8443/srm/managerv2
SRMClientV2 : put: try # 0 failed with error
SRMClientV2 : ; nested exception is: 
        java.net.SocketTimeoutException: Read timed out

Reported to dcache-admin
    ticket 101455 at 13:33

Note that fndca2a also crashed this morning sometime,
and was replaced with a new system.

Wonder if this is somehow related to the srm server problems ?
Should not be, as this is just a monitoring system.


   mindata@minos26   crontab -r
   minfarm@fnpcsrv1  mv NOCAT.ok NOCAT on 
   
15:30 berg is looking into this
16:31 srmtest working again on fnpcsrv1 
16:33 podstvkv reports that the srm server has been restarted.

   no reason yet, Vladimir is just back from vacation

Restored crontab and NOCAT

###########
# CLUSTER #
###########

Following up on upgrade plans, with timl,
> being used?
> clisp
> f2c  
> fort77
> maxima
> maxima-exec_clisp
> maxima-xmaxima   
> mgdiff


MIN > for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'stat /usr/bin/maxima | grep "Access: 2"' ; done
minos01 Tue Jul 24 11:36:36 CDT 2007
Access: 2007-07-24 06:07:54.000000000 -0500
minos02 Tue Jul 24 11:36:38 CDT 2007
Access: 2006-04-07 10:03:34.000000000 -0500
minos03 Tue Jul 24 11:36:40 CDT 2007
Access: 2002-10-29 16:21:24.000000000 -0600

clistp :  similar, plus
minos25 Tue Jul 24 11:39:51 CDT 2007
Access: 2007-07-23 11:15:56.000000000 -0500

MIN > for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'stat /usr/bin/f2c | grep "Access: 2"' ; done
minos01 Tue Jul 24 11:41:08 CDT 2007
Access: 2007-07-24 06:07:54.000000000 -0500
minos02 Tue Jul 24 11:41:10 CDT 2007
Access: 2006-04-07 10:03:34.000000000 -0500
minos03 Tue Jul 24 11:41:11 CDT 2007
Access: 2000-07-24 15:51:38.000000000 -0500 most like this
...
minos25 Tue Jul 24 11:41:56 CDT 2007
Access: 2007-07-23 11:16:51.000000000 -0500

Conclusion : none of these are being used.


###############
# SAMRELOCATE #
###############

clone samrelocate from saddmc 

Usage :  samrelocate <directory>
Action:  does a sam locate of each file in the given directory,
         and corrects the location as necessary.

./samrelocate -v -n mcin_data/near/carrot_06/L010185


#############
# CHECKLIST #
#############

o  near and far dcs missing since Saturday.

o  Since July 4,
   OOPS - no tape location in  F00038324_0005.sam.py

cd GDAT/fardet_data/2007-07
Files look normal on the surface, lacking tape location
Set them aside, should clear in the 9:06 predator cycle.

MINOS26 > mv F00038324_0005.sam.py F00038324_0005.sam.py.bad
MINOS26 > mv  F00038324_0005.log  F00038324_0005.log.bad


The next iteration shows  :

F00038324_0005.mdaq.root Tue Jul 24 14:13:15 UTC 2007
 OOPS - run_dbu is stuck for 600, killing it 
F S   UID   PID  PPID  C PRI  NI ADDR    SZ WCHAN  TTY          TIME CMD
0 S  1060 10483 10472  0  85   0    -   538 wait4  ?        00:00:00 run_dbu
F S   UID   PID  PPID  C PRI  NI ADDR    SZ WCHAN  TTY          TIME CMD
0 S  1060 10499 10483  0  85   0    - 28664 schedu ?        00:00:02 dbu
 kill 10499 
Tue Jul 24 14:23:25 UTC 2007
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 128: 10499 Segmentation fault      dbu -bq ${HOME}/minos/scripts/dbu_sampy.C ${FILE} >>${logname} 2>&1
F00038324_0005.sam.py was not generated - check log for error
F00038324_0005.log

then

STARTING Tue Jul 24 16:18:35 UTC 2007
 Treating    741 files 
 Scanning      3 files 
F00038324_0005.mdaq.root Tue Jul 24 16:18:55 UTC 2007

Looks OK for now !


=============================================================================

2007 07 23

#######
# SAM #
#######

Got a slew of farm processing requests from scavan

2 cedar_phy_srsafitter      6 month FD cosmic data
3 cedar_phy_srsafitterbx113 near/daikon_00/L010185N
4 cedar_phy_srsafitter      near/daikon_00/L010185N_bfldx113
5 cedar_phy_srsafitterbx113 near/daikon_00/L010185N_bfldx113
6 cedar_phy_srsafitter      near data 3 months 2005 spill data
7 cedar_phy_srsafitterbx113 near data 3 months 2005 spill data

./pnfsdirs near cedar_phy_srsafitterbx113 daikon_00 L010185N write
./pnfsdirs near cedar_phy_srsafitter      daikon_00 L010185N write


    Preparing for cedar_phy_srsafitter

export SAM_ORACLE_CONNECT="samdbs/<passwd>"

REL=cedar.phy.srsafitter

setup sam -q dev
samadmin add application family --appFamily=reco --appName=loon --appVersion=${REL}

setup sam -q int

setup sam -q prd

   < the following is pending creation of the directories >

REL=cedar_phy_srsafitter

./reloc -d -s dev cedar_phy_srsafitter  # debug test

./reloc -s dev cedar_phy_safitter
./reloc -s int cedar_phy_safitter
./reloc -s prd cedar_phy_safitter

and doing the same for

REL=cedar.phy.srsafitterbx113


REL=cedar_phy_srsafitterbx113


#######
# AFS #
#######

Requested three new data volumes for NONAP

   /afs/fnal.gov/files/data/minos/d263
   /afs/fnal.gov/files/data/minos/d264
   /afs/fnal.gov/files/data/minos/d265

ACL's like

system:administrators rlidwka
system:anyuser rl
minos rl
minos:admin rlidwka
minos:nonap rlidwka


##########
# DCACHE #
##########

Tried fermigrid/volatile :

SRV1> srmls --debug=true srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos
Storage Resource Manager (SRM) CP Client version 1.25
Copyright (c) 2002-2006 Fermi National Accelerator Laboratory

SRM Configuration:
        debug=true
...
        surl[0]=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos
Mon Jul 23 08:21:41 CDT 2007: In SRMClient ExpectedName: host
Mon Jul 23 08:21:41 CDT 2007: SRMClient(https,srm/managerv2,true)
SRMClientV2 : user credentials are: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
SRMClientV2 : connecting to srm at httpg://stkendca2a.fnal.gov:8443/srm/managerv2
SRMClientV2 :  srmLs, contacting service httpg://stkendca2a.fnal.gov:8443/srm/managerv2
SRMClientV2 : put: try # 0 failed with error


##########
# DCACHE #
##########

    Need to make new grid proxy having the production role,
    using voms-proxy-init.

09:10 Stopped mindata@minos26 crontab, and set NOCAT on minfarm@fnpcsrv1

Reviewing notes on proxy-init in HOWTO.srm

cd /export/stage/minfarm/.grid


voms-proxy-init --help

voms-proxy-init \
    -cert kreymer-doe.pem   \
    -key kreymer-doekey.pem \
    -out kreymer-voms.proxy  \
    -valid 8760:0

SRV1> voms-proxy-init \
>     -cert kreymer-doe.pem   \
>     -key kreymer-doekey.pem \
>     -out kreymer-voms.proxy  \
>     -valid 8760:0
Cannot find file or dir: $prefix/etc/vomses
Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Enter GRID pass phrase:
Creating proxy ................................................. Done

Warning: your certificate and proxy will expire Tue Apr 15 11:22:43 2008
which is within the requested lifetime of the proxy


OK, now let's try to find some use for  fermilab:/fermilab/minos/Role=Production

No help from voms-proxy-init, try google  voms-proxy-init role

    http://www.atlasgrid.bnl.gov/GUMS/Presentations/vo-privilege.ppt
This gives a clue .

Better yet, here is an overall guide :

http://www.fnal.gov/docs/products/voprivilege/documents/transition-to-privilege.html


Also, trying to set prefix to pick up a vomses file,
based on 'locate vomses'
SRV1> locate vomses
/usr/local/vdt-1.6.1/monitoring/vomses
/usr/local/vdt-1.6.1/glite/etc/vomses
/opt/glite/etc/vomses


prefix=/usr/local/vdt-1.6.1/glite

voms-proxy-init \
    -voms fermilab:/fermilab/minos/Role=Production \
    -cert kreymer-doe.pem   \
    -key  kreymer-doekey.pem \
    -out  kreymer-voms.proxy  \
    -valid 8760:0

SRV1> voms-proxy-init     -voms fermilab:/fermilab/minos/Role=Production     -cert kreymer-doe.pem       -key  kreymer-doekey.pem     -out  kreymer-voms.proxy      -valid 8760:0
Cannot find file or dir: $prefix/etc/vomses
Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Enter GRID pass phrase:
Creating temporary proxy ........................................................... Done
Contacting  fermigrid2.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov] "fermilab" Done

Warning: fermigrid2.fnal.gov:15001: validity shortened to 86400 seconds!

Creating proxy ........................................... Done

Warning: your certificate and proxy will expire Tue Apr 15 11:22:43 2008
which is within the requested lifetime of the proxy

SRV1> voms-proxy-info -file kreymer-voms.proxy
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
identity  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
type      : proxy
strength  : 512 bits
path      : kreymer-voms.proxy
timeleft  : 6409:43:17

SRV1> voms-proxy-info -all -file kreymer-voms.proxy
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot find certificate of AC issuer for vo fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
identity  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
type      : proxy
strength  : 512 bits
path      : kreymer-voms.proxy
timeleft  : 6408:53:46
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
issuer    : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov
attribute : /fermilab/minos/Role=Production/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
attribute : /fermilab/minos/Role=NULL/Capability=NULL
timeleft  : 23:53:29

Great, but this extension expires in a day,
and we still have no access to DCache via SRM.

   2007 08 29  Note that the Fermi servers ignore the extension expiration,
               Usage is production is approved by Yocum and Chadwick
               
11:28 - srm servers were running in duplicate, restarted.

./srmtest works OK now on both minfarm@fnpcsrv1 and mindata@minos26


=============================================================================

2007 07 22 sunday

##############
# DCACHE DAQ #
##############

Timur has removed the 9a write pool, we should be OK for data archiving

/home/minos/bin/archiver_krb.py

[minos@daqdcp bin]$ mv archiver_krb.py archiver_krb.20051103.py ; cp archiver_krb.20051103.py archiver_krb.py

   Needed to get some kind of useable editor there, lacking x-11
[minos@minos-gateway ~/.ssh]$ scp /usr/bin/pico daqdcp:/home/minos/bin/pico

[minos@daqdcp minos]$ cd bin
[minos@daqdcp bin]$ pwd
/home/minos/bin
[minos@daqdcp bin]$ pico archiver_krb.py 

                    # It got there successfully
                    # Check file size for confirmation
                    # Wait for ten minutes to make sure that the size info is
                    # known to pnfs
                    # time.sleep(600)

                    # reduced time to 6 seconds, writes to PNFS are now daily   2007 07 22  kreymer
                    time.sleep(6)

Started far archiver,

MINOS26 > ls -ltr /pnfs/minos/fardet_data/2007-07 | tail
-rw-r--r--    1 buckley  e875     18113194 Jul 20 13:01 F00038519_0000.mdaq.root
-rw-r--r--    1 buckley  e875     58471373 Jul 20 14:02 F00038519_0001.mdaq.root
-rw-r--r--    1 buckley  e875     17846369 Jul 20 15:02 F00038519_0002.mdaq.root
-rw-r--r--    1 buckley  e875     18032433 Jul 20 16:03 F00038519_0003.mdaq.root
-rw-r--r--    1 buckley  e875     58357421 Jul 20 17:04 F00038519_0004.mdaq.root
-rw-r--r--    1 buckley  e875     17871600 Jul 20 18:05 F00038519_0005.mdaq.root
-rw-r--r--    1 buckley  e875     17892471 Jul 22 15:29 F00038519_0006.mdaq.root
-rw-r--r--    1 buckley  e875     58425513 Jul 22 15:30 F00038519_0007.mdaq.root
-rw-r--r--    1 buckley  e875     17976883 Jul 22 15:31 F00038519_0008.mdaq.root
-rw-r--r--    1 buckley  e875            0 Jul 22 15:31 F00038519_0009.mdaq.root

The non-delayed writes seem to be OK

Do the same to near :

[minos@daqdcp-nd bin]$ mv archiver_krb.py archiver_krb.20051103.py ; cp archiver_krb.20051103.py archiver_krb.py

[minos@minos-gateway-nd ~]$ scp /usr/bin/pico daqdcp-nd:/home/minos/bin/pico

MINOS26 > ls -ltr /pnfs/minos/neardet_data/2007-07 | tail
-rw-r--r--    1 buckley  e875     86724783 Jul 21 05:16 N00012636_0011.mdaq.root
-rw-r--r--    1 buckley  e875     86971937 Jul 21 06:17 N00012636_0012.mdaq.root
-rw-r--r--    1 buckley  e875     86965023 Jul 21 07:18 N00012636_0013.mdaq.root
-rw-r--r--    1 buckley  e875     87279328 Jul 21 08:14 N00012636_0014.mdaq.root
-rw-r--r--    1 buckley  e875     87075793 Jul 21 09:15 N00012636_0015.mdaq.root
-rw-r--r--    1 buckley  e875     87341372 Jul 22 15:38 N00012636_0016.mdaq.root
-rw-r--r--    1 buckley  e875     87419819 Jul 22 15:38 N00012636_0017.mdaq.root
-rw-r--r--    1 buckley  e875     87476859 Jul 22 15:39 N00012636_0018.mdaq.root
-rw-r--r--    1 buckley  e875     87978090 Jul 22 15:39 N00012636_0019.mdaq.root
-rw-r--r--    1 buckley  e875     87600671 Jul 22 15:39 N00012636_0020.mdaq.root

The backlog is clearing quickly.

Started with 47 far, 30 near

Near cleared at 15:49
Far  cleared at 16:02

##########
# DCACHE #
##########

Ticket 101327 by rubin

Farm srmcp failed starting 15:08, also other srm's from fnpcsrv1 and minos26.

=============================================================================

2007 07 21

##############
# DCACHE DAQ #
##############

Far archiver got stuck after :

QOL I Fri 20-07-2007 19:02:48 archiver 17667 198.124.213.171 1 104566 run 38519   Processing file F00038519_0006.mdaq.root
QOL I Fri 20-07-2007 19:02:48 archiver 17667 198.124.213.171 1 104567 run 38519   Getting credentials
QOL I Fri 20-07-2007 19:02:48 archiver 17667 198.124.213.171 1 104568 run 38519   Got credentials
QOL I Fri 20-07-2007 19:02:48 archiver 17667 198.124.213.171 1 104569 run 38519   Trying ftp connect to disk cache
QOL I Fri 20-07-2007 19:02:49 archiver 17667 198.124.213.171 1 104570 run 38519   Ftp connect succeeded

No further archiver messages
PID is OK, process is present.

Near archiver got stuck after 
 2007-07-21 09:13:18 	 buckley(1019.5111) 	 krbftp 	 write 	 /pnfs/fnal.gov/usr/minos/neardet_data/2007-07/N00012636_0015.mdaq.root
 
QOL I Sat 21-07-2007 10:14:29 archiver 7887 131.225.192.132 1 266411 run 12636   Processing file N00012636_0016.mdaq.root
QOL I Sat 21-07-2007 10:14:29 archiver 7887 131.225.192.132 1 266412 run 12636   Getting credentials
QOL I Sat 21-07-2007 10:14:32 archiver 7887 131.225.192.132 1 266413 run 12636   Got credentials
QOL I Sat 21-07-2007 10:14:32 archiver 7887 131.225.192.132 1 266414 run 12636   Trying ftp connect to disk cache
QOL I Sat 21-07-2007 10:14:32 archiver 7887 131.225.192.132 1 266415 run 12636   Ftp connect succeeded

Stopped both archivers till the DCache problem is corrected.

##########
# DCACHE #
##########

    Submitted ticket

The Minos  Far Detector data archiver got stuck around
    Fri 20-07-2007 19:02:49
The Minos Near Detector data archiver got stuck around
    Sat 21-07-2007 10:14:32
after connecting to the ftp server.

I have halted the near and far detector archivers.

Messages from http://fndca3a.fnal.gov/cgi-bin/dcache_files.py
    look like
425 Cannot open port: java.lang.Exception: Pool error: Pool is disabled

We had a similar problem with the 9a pools in the writePool group Friday.

I'm guessing that in this case,  w-stkendca9a-3  may be at fault.

As we are only taking cosmic ray data, 
I have not called the call-center to have anyone paged after hours.

=============================================================================

2007 07 20

########
# FARM #
########

Searching for missing mc output files reported by arms,

FILES="
f21011047_0000_L010185N_D00.sntp.cedar.root
f21011048_0000_L010185N_D00.sntp.cedar.root
f21011064_0000_L010185N_D00.sntp.cedar.root
f21011067_0000_L010185N_D00.sntp.cedar.root
f21011073_0000_L010185N_D00.sntp.cedar.root
f21011077_0000_L010185N_D00.sntp.cedar.root
f21011078_0000_L010185N_D00.sntp.cedar.root
f21011100_0000_L010185N_D00.sntp.cedar.root
"
for FILE in ${FILES} ; do grep ${FILE} /afs/fnal.gov/files/home/room1/kreymer/minos/CFL/CFL ; done
minos reco_mc_far_cedar VOC177 0000_000000000_0000327 CDMS117296018000000 49339753 2011788958 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/104/f21011047_0000_L010185N_D00.sntp.cedar.root
minos reco_mc_far_cedar VOC177 0000_000000000_0000328 CDMS117296018800000 48749561 2076869019 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/104/f21011048_0000_L010185N_D00.sntp.cedar.root
minos reco_mc_far_cedar VOC177 0000_000000000_0000355 CDMS117301801500001 52392083 2911298779 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/106/f21011064_0000_L010185N_D00.sntp.cedar.root
minos reco_mc_far_cedar VOC177 0000_000000000_0000356 CDMS117301802200000 49670542 3764044698 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/106/f21011067_0000_L010185N_D00.sntp.cedar.root
minos reco_mc_far_cedar VOC177 0000_000000000_0000357 CDMS117301863600000 53344993 3465332460 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/107/f21011073_0000_L010185N_D00.sntp.cedar.root
minos reco_mc_far_cedar VOC177 0000_000000000_0000401 CDMS117303152100000 52629644 1210362470 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/107/f21011077_0000_L010185N_D00.sntp.cedar.root
minos reco_mc_far_cedar VOC177 0000_000000000_0000402 CDMS117303153100001 54786893 2012291602 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/107/f21011078_0000_L010185N_D00.sntp.cedar.root
minos reco_mc_far_cedar VO4049 0000_000000000_0000086 CDMS117321258900000 62377891 4045578954 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/110/f21011100_0000_L010185N_D00.sntp.cedar.root

########
# FARM #
########

Per rubin request, moved three bad mcin files out of the way

cd /pnfs/minos/mcin_data/far/daikon_02/CosmicMu

mv 106/f20011068_0004_CosmicMu_D02.reroot.root /pnfs/minos/BAD/
mv 107/f20011072_0003_CosmicMu_D02.reroot.root /pnfs/minos/BAD/
mv 108/f20011088_0005_CosmicMu_D02.reroot.root /pnfs/minos/BAD/

##########
# DCACHE #
##########

rhatcher reports errors writing via dccp kerberized since 21:43 yesterday,

Command failed!
Server error message for [1]: "Pool is disabled" (errno 104).
Failed open file in the dCache.
Can't open destination file : "Pool is disabled"
System error: Input/output error

Similar problems for farm cand writing

In service status page, 
   w-stkendca10a-4   120 msec
   rest are 44 to 85 msec

   I also see three Cells offline,

GFTP-stkendca15a
GFTP-stkendca7a
GFTP-stkendca8a

   This may be unrelated, as they would not affect Robert's kerberized dccp.

9:50 - 10-4 ping down to 77 msec,

    Looking for the working nodes, with

dcs() { ./dc_stat /pnfs/minos/${1} | grep w-stkendca ; }

dcs stage/kordosky/n11012016_0007_L010185N_D00-n11012017_0002_L010185N_D00.tar

Around 09:45 thing seem to have recovered.
Timur removed the '9a' pools from the configuration.


=============================================================================

2007 07 19

#######
# DAQ #
#######

10:15 stopped archivers, due to DCache downtime

11:10 started near archiver, file moved OK
11:14 started  far archiver, file moved OK

Copies stilled till just after 18:00.
Not sure why.

##########
# DCACHE #
##########

Enstore/dcache downtime seems to have started around 07:00, per ftplog

I saw no pnfs interruption in pnfslog.

DCache servers seem to have come back online from 10:13 to 10:26
FTP access is back.

dcap access kerberized and unsecured are back.

10:50 

srmls fails :

$ srmls ${SPATH2}
SRMClientV2 : put: try # 0 failed with error
SRMClientV2 : ; nested exception is: 
        java.net.ConnectException: Connection refused
SRMClientV2 : put: try again


ftplog saw one failure at 11:00, OK at 11:10

11:20 - OOPS, get email from kschu that maintenance continues.


Will leave DAQ archivers logging for now, 
as it should recover cleanly,
and seems be running properly at present, clearing the backlog.


email bounced from minos-data due to signature attachment.
Re-enabled attachements for present.

19:58 - 
    kreymer@minos26   crontab crontab.dat
    mindata@minos26   crontab crontab.dat
    minfarm@fnpcsrv1   mv NOCAT NOCAT.ok


http://fndca3a.fnal.gov/cgi-bin/dcache_files.py
   shows a new category, globus-mapping:(1334.5111)
this seems to include 
    read  .mdaq.root  from farm
    read  reroot.root from farm
    write .cand       from farm

Raw data is showing up in PNFS since about 18:00, don't know why,
no trace in the ftp log above

logs indicate archiver restart around 18:00 near and far.
PID is wrong in near, will correct :

20:18    echo 7887 >> /var/lock/daq/archiver.pid
20:19 daqmon is happy

########
# GRID #
########

10:28  timm :

The condor upgrade to condor 6.8.5 on the head nodes of the GP Grid 
cluster is complete.

=============================================================================

2007 07 18

########
# ROOT #
########

From MSD minutes by nwest :

>    From Tigran:
>    ReadBuffers() with vector read is implemented.
>    Today we have released new dcache version 1.7.0-39 with this
> functionality.
>    I have tried with root version 5.14, 5.15 and 5.16. It's amazing! On
> some
>    applications I got up ti 12 times performance increase!


##########
# DCACHE # 
##########

Preparing for all-day shutdown tomorrow

   
    predator

MINOS26 > echo 'crontab -r' | at 03:30

    mcimport

M26 > echo 'crontab -r' | at 03:30
job 21 at 2007-05-24 03:30

    corral
    
SRV1> echo 'mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT' \
    | at 03:30

#######
# DBU #
#######

Testing R1.15 vs R1.22 timing on older files, per rhatcher query
   ( want to remove R1.15 )

MINOS26 > cd ${HOME}/minos/test
MINOS26 > TIER=mdaq

MINOS26 > setup_minos -r R1.22

MINOS26 > IFILE=F00034242_0013
MINOS26 > DATADIR=fardet_data/2006-03
MINOS26 > ../scripts/run_dbu dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/${DATADIR}/${IFILE}.${TIER}.root

Ran for 15 minutes, produced output eventually.
Watchdog would have killed this.

Log file was present up to this point when stalled :
...
 Snarls 1996954 (19122455)  NonSnarls 119669 (108808) [MISMATCH]
 TermCode 1 Errors 0 TimeFrames 59834 Dropped 0 Consistency 0x1


MINOS26 > setup_minos R1.15
MINOS26 > time ../scripts/run_dbu dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/${DATADIR}/${IFILE}.${TIER}.root

real    12m49.637s
user    10m28.330s
sys     0m8.220s
=============================================================================

2007 07 17

#########
# EMAIL #
#########

For minos-data and minos-sam-users,

    was

Sizelim= 409600

    set


Sizelim= 10000

Attachments= no

##########
# XROOTD #
##########

gmieg adjusted minimum retention from 10 hours to 3 minutes.
Spaced is purged when disk usage hits 80%,
purgeing down to 60%.
This is visible in ganglia.

Users of /local/scratch07 ( du -sm )

4550    avva
1       brebel
340     bspeak
185     daikon
9143    deb4
2305    dharma
1545    ebarnes
3804    giurgiu
76556   gmieg
1       hartnell
2280    hjkang
349     howcroft
8677    jdejong
5       kreymer
1       kschu
85      li
938     loiacono
348     mdier
530     mskim
11130   niki
211     panos
11231   petyt
136     pjl
1       rhatcher
263     rustem
851     tagg
4129    thosieck
80      yumiceva

Strange... 
MRTG reports up to 6 MBytes/second sustained from 04:00 to 16:30

Ganglia concurs.

But the CPU load on minos07 spikes to 14 
from just after 12:00 ,  to 16:00

Rustem admits to running 20 clients at once,
  only about 7 of which seem to be visible to LSF.

=============================================================================

2007 07 16

##########
# XROOTD #
##########

Rustem reports hundreds of files not readable with xrootd, such as

stout_near_2005_05:Error in <TXNetFile::CreateXClient>: open attempt failed on
root://minos07.fnal.gov//stage/N00007751_0003.spill.sntp.cedar_phy.0.root
...
stout_near_2007_03:Error in <TXNetFile::CreateXClient>: open attempt failed on
root://minos07.fnal.gov//stage/N00011998_0000.spill.sntp.cedar_phy.0.root

copied from email to maint/xrootdbad.txt

gmieg replied that the problem was a minimum 10 hour retention policy
on the xrootd cache. 
He has changed this to 3 minutes minimum.


#######
# LSF #  minos cluster batch
#######

Per EAG Ops report, note existence of 

   xlsbatch

an X-11 batch queue viewer.

##########
# SADDMC #
##########

Wow, rediscovered that all of carrot_06 and carrot_08  mcin and mcout
had been declared to production back a year ago, 2006 06 15/16

The files have since been renamed to <beam> subdirectories,
file  locations need to be adjusted.


=============================================================================

2007 07 13


#######
# LSF #  minos cluster batch
#######

Ticket 98153

Nodes minos14 through 18 started working properly Friday afternoon.

    Tested with

bsub -q minos "echo `date` `hostnam` >> ${HOME}/lsf/lsf.log ; sleep 120 ; hostname"


############
# MCIMPORT #
############

New files to arrive from sjc,

    n100BRRRR_SSSS_CosmicMu_D03.reroot.root

MINOS26 > ./pnfsdirs near cedar_phy daikon_03 CosmicMu write
Fri Jul 13 08:28:40 CDT 2007
 STREAMS cand mrnt sntp

     INPUT /pnfs/minos/mcin_data/near/daikon_03/CosmicMu 
 OK - created /pnfs/minos/mcin_data/near/daikon_03/CosmicMu 
 FAMSET mcin_near_daikon_03
 FAMILY mcin_near
 OOPS - need file family mcin_near_daikon_03
 OK - setting family to mcin_near_daikon_03

    OUTPUT /pnfs/minos/mcout_data/cedar_phy/near/daikon_03/CosmicMu 
 OK - created /pnfs/minos/mcout_data/cedar_phy/near/daikon_03/CosmicMu 
 OK - created /pnfs/minos/mcout_data/cedar_phy/near/daikon_03/CosmicMu/cand_data 
 OK - created /pnfs/minos/mcout_data/cedar_phy/near/daikon_03/CosmicMu/mrnt_data 
 OK - created /pnfs/minos/mcout_data/cedar_phy/near/daikon_03/CosmicMu/sntp_data 
 FAMSET mcout_cedar_phy_near_daikon_03_cand
 FAMILY minos
 OOPS - need file family mcout_cedar_phy_near_daikon_03_cand
 OK - setting family to mcout_cedar_phy_near_daikon_03_cand
 FAMSET mcout_cedar_phy_near_daikon_03_mrnt
 FAMILY minos
 OOPS - need file family mcout_cedar_phy_near_daikon_03_mrnt
 OK - setting family to mcout_cedar_phy_near_daikon_03_mrnt
 FAMSET mcout_cedar_phy_near_daikon_03_sntp
 FAMILY minos
 OOPS - need file family mcout_cedar_phy_near_daikon_03_sntp
 OK - setting family to mcout_cedar_phy_near_daikon_03_sntp

########
# FARM #
########

The previous bad pass of
   /pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/sntp_data/310
were not removed from PNFS,
roundup is tripping over them.

    As rubin on fnpcsrv1,

rm /pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N/sntp_data/310/*.root


=============================================================================

2007 07 12

########
# GRID #
########

Ganglia is back for FermiGrid ( not CLUBS, Farm )

http://fermigrid2.fnal.gov:801/ganglia/?m=load_one&r=day&s=descending&c=FermiGrid&h=&sh=1&hc=4

#######
# SAM #
#######

Trying again later, post 2007 06 08,  when predator is idle.
  
./genpy -l " -r R1.15 " fardet_data/2006-03

HOSTNA=`hostname -s | cut -c 1-5`
HOSTNU=`hostname -s | cut -c 6-`

LOGPAT=/local/scratch${HOSTNU}/kreymer/log

DET=fardet_data

MONTH=2006-03

./sadd ${DET}/${MONTH} declare 2>&1 | tee -a ${LOGPAT}/samadd/${DET}/${MONTH}.log
fardet_data/2006-03
STARTED   Thu Jul 12 13:50:27 2007
Treating 1414 files
 OK - declared  F00034242_0013.mdaq.root
...
 OK - declared  F00034635_0000.mdaq.root
Needed to add 16 files
STARTED   Thu Jul 12 13:50:27 2007
FINISHED  Thu Jul 12 13:51:08 2007

##########
# SADDMC #   development declares are working now
########## 

setup sam -q dev

MID=mcin_data/far/daikon_00/L100200N

./saddmc  -m declare  daikon_00 ${MID}/101
 MODE  declare
 Processing mcin_data 
STARTED   Thu Jul 12 13:37:04 2007
Declaring to SAM dev daikon_00  declare 999999
Scanning  /pnfs/minos/mcin_data/far/daikon_00/L100200N ['101']
Needed    /pnfs/minos/mcin_data/far/daikon_00/L100200N/101
Treating 3 files in /pnfs/minos/mcin_data/far/daikon_00/L100200N/101
 OK - declared            f21411010_0000_L100200N_D00.reroot.root /pnfs/minos/mcin_data/far/daikon_00/L100200N/101(voc328.181)
 OK - declared            f21311010_0000_L100200N_D00.reroot.root /pnfs/minos/mcin_data/far/daikon_00/L100200N/101(voc328.189)
 OK - declared            f21011010_0000_L100200N_D00.reroot.root /pnfs/minos/mcin_data/far/daikon_00/L100200N/101(voc328.194)
Needed    3 files, Rate was  1.614
STARTED   Thu Jul 12 13:37:04 2007
FINISHED  Thu Jul 12 13:37:07 2007
 MODE  declare
 Processing mcin_data 
STARTED   Thu Jul 12 13:38:20 2007
Declaring to SAM dev daikon_00  declare 999999
Scanning  /pnfs/minos/mcin_data/far/daikon_00/L100200N ['100']
Needed    /pnfs/minos/mcin_data/far/daikon_00/L100200N/100
Treating 27 files in /pnfs/minos/mcin_data/far/daikon_00/L100200N/100
...
Needed   27 files, Rate was  2.528
STARTED   Thu Jul 12 13:38:20 2007
FINISHED  Thu Jul 12 13:38:32 2007

#######
# SAM #
#######

Test adding RAL file locations, in development

MINOS26 > setup sam -q dev

MINOS26 > IFIL=F00028812_0000
MINOS26 > IFILE=${IFIL}.mdaq.root

MINOS26 > sam locate $IFILE
['/pnfs/minos/fardet_data/2005-01,1898@vo4919']

MINOS26 > sam add location --file=${IFILE} --loc='ral:/some/where/over/the/rain'
Location with name 'ral:/some/where/over/the/rain' not found.

MINOS26 > samadmin add pnfs tape location --fullpath='ral:/some/where/over/the/rain'
New locationId = 3812
MINOS26 > sam add location --file=${IFILE} --loc='ral:/some/where/over/the/rain'

MINOS26 > sam locate $IFILE
['/pnfs/minos/fardet_data/2005-01,1898@vo4919'
'ral:/some/where/over/the/rain,unknown volume']

MINOS26 > sam erase file location --file=${IFILE} --loc='ral:/some/where/over/the/rain'
MINOS26 > sam locate $IFILE
['/pnfs/minos/fardet_data/2005-01,1898@vo4919']

#######
# SAM #
#######

Tested startup of a minos station in development.
server_list.txt configuration is same as production,
   except name of dbserver is station_dev
station started, stager did not.

STATION=minos               # prd
./sam_test_py ${STATION}

    Ran successfully
    We seem not to need a stager.

=============================================================================

2007 07 11

#########
# MYSQL #
#########

akumar refreshed development from production

   this killed the minos-test-station station, not in production

#########
# BATCH #
#########

minos queue is set up, feeding minos14-18 plus the old minos19-24

bsub -q minos "sleep 60 ; hostname"

bsub -q minos "sleep 120 ; hostname"
337545  ...
337564

Try more, and a longer delay

for N in 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 ; do
bsub -q minos "sleep 300 ; hostname" ; done

337565 - 337594

############
# MCIMPORT #
############

mcimport.20070711 - expanded DUP message slightly,
                    to reference mcimport.log and DUP
                    and to send full DUP list in email
                    
                    Also, needed to use xargs to grep dup in indexes,
                    due to number of kordosky indexes
                    
cp -a AFSS/mcimport.20070711 .
ln -sf mcimport.20070711 mcimport


########
# FARM #
########

Writing recent R1_24spill MC to PNFS
   This is a mix of carrot and daikon, just a few files.

They go to
    mcout_data/R1_24spill/far/carrot/L010185/sntp_data
    mcout_data/R1_24spill/far/daikon_02/CosmicMu/sntp_data/100

I need to write the carrots manually
Will do them both, as there are only 5 CosmicMu_D02 files.

mkdir /pnfs/minos/mcout_data/R1_24spill/far/daikon_02/CosmicMu/sntp_data/100
chmod 775 /pnfs/minos/mcout_data/R1_24spill/far/daikon_02/CosmicMu/sntp_data/100

setup dcap  # kerberized
DCPOR=24736


cd /grid/data/minos/mcfarcat

# first the carrots

FILES=`ls f22*`
printf "${FILES}\n"

RSPA=minos/mcout_data/R1_24spill/far/carrot/L010185/sntp_data
DOUT=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA}
POUT=/pnfs/${RSPA}

printf "\n`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log
for FILE in ${FILES} ; do
    DFIL=${DOUT}/${FILE}
    PFIL=${POUT}/${FILE}
    if [ ! -r ${PFIL} ] ; then
        echo "NEED" ${FILE}
        dccp        ${FILE} ${DFIL}
        dccp -P             ${DFIL}
    fi
done  2>&1          | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log
printf "`date`\n"   | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log

# then the daikons

FILES=`ls f20*spill*`
printf "${FILES}\n"

RSPA=minos/mcout_data/R1_24spill/far/daikon_02/CosmicMu/sntp_data/100
DOUT=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA}
POUT=/pnfs/${RSPA}

printf "\n`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log
for FILE in ${FILES} ; do
    DFIL=${DOUT}/${FILE}
    PFIL=${POUT}/${FILE}
    if [ ! -r ${PFIL} ] ; then
        echo "NEED" ${FILE}
        dccp        ${FILE} ${DFIL}
        dccp -P             ${DFIL}
    fi
done  2>&1          | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log
printf "`date`\n"   | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spill.log
Wed Jul 11 09:11:14 CDT 2007


Will need to wait a few hours, then purge like

printf "\n`date`\n"   | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spillpurge.log
for FILE in ${FILES} ; do
    PFIL=${POUT}/${FILE}
    if [ -r "${PFIL}" ] ; then
    PINFO=`(cd ${POUT} ; cat ".(use)(4)(${FILE})" | tr '\n' '\t')`
    ECRC=`printf "${PINFO}" | cut -f 11`
    if [ -n "${ECRC}" ] ; then
    LCRC=`ecrc ${FILE} | tr -s ' ' | cut -f 2 -d ' '`
    echo "   ${FILE}" ${LCRC} ${ECRC} 
    [ ${LCRC} = ${ECRC} ] && echo rm ${FILE} && rm ${FILE}
    fi ; fi
done 2>&1 | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spillpurge.log
printf "`date`\n"   | tee -a ~/ROUNTMP/LOG/cosmic/R1_24spillpurge.log


FILES=`ls f22*`

RSPA=minos/mcout_data/R1_24spill/far/carrot/L010185/sntp_data
DOUT=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA}
POUT=/pnfs/${RSPA}
Wed Jul 11 13:23:12 CDT 2007

########
# FARM #
########

Discussed many PEND issues with Howie, after today's batch meeting.

   Action items for me :

o Remove /grid/data/minos/farcat/*safitter*  - DONE - these are obsolete

o Force cedar files pre-dating the 2007 concatenation - DONE
    Far  < 37162
    Near < 11449
    
o Force files where we HAVE the missing subruns ( last digit in PEND message )


###########
# ROUNDUP #
###########

roundup.20070710

   Using ROUNTMP/ROOTRELS to get list of release using a given root

cp AFSS/roundup.20070710 .
ln -sf  roundup.20070710 roundup # wasroundup.20070707 


########
# FARM #
########

   Pre-concatenation cedar :

    N E A R
    
./roundup -S -s N00008 -r cedar near

N00008433_0000.spill.mrnt.cedar.0.root /pnfs/minos/reco_near/cedar/mrnt_data/2005-08(vo2139.100)
  INSTANCE  Location with name '/pnfs/minos/reco_near/cedar/mrnt_data/2005-08' not found.

./samtapeloc /pnfs/minos/reco_near/cedar/mrnt_data dev
/pnfs/minos/reco_near/cedar/mrnt_data
/pnfs/minos/reco_near/cedar/mrnt_data/2005-11
/pnfs/minos/reco_near/cedar/mrnt_data/2005-08
/pnfs/minos/reco_near/cedar/mrnt_data/2005-10
/pnfs/minos/reco_near/cedar/mrnt_data/2005-09
./samtapeloc /pnfs/minos/reco_near/cedar/mrnt_data int
./samtapeloc /pnfs/minos/reco_near/cedar/mrnt_data prd

./roundup -m 2005-08 -r cedar near

IFILE=N00008433_0000.spill.mrnt.cedar.0.root
SAMLOC="/pnfs/minos/reco_near/cedar/mrnt_data/2005-08(vo2139.100)"

sam add location --file=${IFILE} --loc=${SAMLOC}

./roundup -m 2005-08 -r cedar near

OK, picked up this tray mrnt from April 13.

    F A R

./roundup -S -s F0001 -r cedar far 

./roundup -S -s F0002 -r cedar far 

./roundup -S -s "F00031\|F00034\|F00035\|F00036" -r cedar far 


########
# FARM #
########

HAVE cleanup for cedar

./roundup  -f 1 -s "F00038283_\|F00038304_" -r cedar far

./roundup  -f 1 -s  "N00012425_\|N00012428_\|N00012434_" -r cedar near


=============================================================================

2007 07 10

#######
# SAM #
#######

Tested using herber 8.2.0 dbserver, OK


MINOS26 > export SAM_NAMING_SERVICE_IOR=IOR:000000000000002a49444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696e67436f6e746578743a312e3000000000000001000000000000002c000100000000001064306f7261312e666e616c2e676f7600232800000000000c4e616d655365727669636500

MINOS26 > export SAM_DB_SERVER_NAME=herber.dev:SAMDbServer
MINOS26 > sam list files  --dim="${SAMDIM}"  | sort | head

   F00030612_0005.mdaq.root
   F00030612_0006.mdaq.root
   F00030612_0007.mdaq.root

We need to upgrade.

########
# FARM #
########

New processing of R1_24spill announced,

   for future use in SAM, 
   need to create release r1.24spill


########
# FARM #
########

Prior to rerunning brev,

find . -name \*brev.root
./READ/n13023101_0000_L010185N_D00.sntp.cedar_phy_brev.root
./READ/n13023101_0007_L010185N_D00.sntp.cedar_phy_brev.root
./READ/n13023102_0000_L010185N_D00.sntp.cedar_phy_brev.root
./READ/n13023103_0000_L010185N_D00.sntp.cedar_phy_brev.root
./READ/n13023104_0000_L010185N_D00.sntp.cedar_phy_brev.root
./READ/n13023104_0005_L010185N_D00.sntp.cedar_phy_brev.root
./ECRC/n13023101_0000_L010185N_D00.sntp.cedar_phy_brev.root
./ECRC/n13023101_0007_L010185N_D00.sntp.cedar_phy_brev.root
./ECRC/n13023102_0000_L010185N_D00.sntp.cedar_phy_brev.root
./ECRC/n13023103_0000_L010185N_D00.sntp.cedar_phy_brev.root
./ECRC/n13023104_0000_L010185N_D00.sntp.cedar_phy_brev.root
./ECRC/n13023104_0005_L010185N_D00.sntp.cedar_phy_brev.root
./DFARM/n13023101_0000_L010185N_D00.sntp.cedar_phy_brev.root
./DFARM/n13023101_0007_L010185N_D00.sntp.cedar_phy_brev.root
./DFARM/n13023102_0000_L010185N_D00.sntp.cedar_phy_brev.root
./DFARM/n13023103_0000_L010185N_D00.sntp.cedar_phy_brev.root
./DFARM/n13023104_0000_L010185N_D00.sntp.cedar_phy_brev.root
./DFARM/n13023104_0005_L010185N_D00.sntp.cedar_phy_brev.root

find . -name \*brev.root -exec rm {} \;


#######
# SAM #  IT 2843
#######

In the Minos production database,
when selecting files using CHILD_BY_NAME,
extra file names are returned.
For example,
$ FILE=F00030612_0005.spill.bntp.cedar_phy.0.root
$ SAMDIM="
    DATA_TIER   raw-far \
and FILE_NAME like F0003061% \
and CHILD_BY_NAME ${FILE} \
"
$ sam list files  --dim="${SAMDIM}" --nosummary | sort
F00030610_0000.mdaq.root
F00030611_0000.mdaq.root
F00030612_0000.mdaq.root
F00030612_0001.mdaq.root
F00030612_0002.mdaq.root
F00030612_0003.mdaq.root
F00030612_0004.mdaq.root
F00030612_0005.mdaq.root
F00030612_0006.mdaq.root
F00030612_0007.mdaq.root
F00030613_0000.mdaq.root
F00030613_0001.mdaq.root
...
This list should be
F00030612_0005.mdaq.root
F00030612_0006.mdaq.root
F00030612_0007.mdaq.root


##########
# SADDMC #
##########

Verified that mc.release is already being set,

sam get metadata --file=f21311005_0000_L100200N_D00.reroot.root
...
    'mc' : CaseInsensitiveDictionary({
       'beam' : 'L100200',
     'flavor' : '3',
    'release' : 'daikon_00',
      'split' : '1',
     'volume' : '1',


#########
# VAULT #
#########

    Moving the vault copies to LTO3

cd /pnfs/minos/vault

enstore pnfs --tags | grep '^.(tag)(library)'
.(tag)(library) = CD-9940B

enstore pnfs --library CD-LTO3
[Errno 13] Permission denied: '/pnfs/minos/vault/.(tag)(library)'

    Sent email to enstore-admin asking them to set the libraries :

cd /pnfs/minos/vault
enstore pnfs --library CD-LTO3

=============================================================================

2007 07 09

#######
# AFS #
#######

Received a new backed up volume
   /afs/fnal.gov/files/data/minos/release_data
thanks to kevinh

#######
# SAM #
#######


export SAM_ORACLE_CONNECT

for UNIV in dev int prd ; do
setup sam -q ${UNIV}
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.brev
done
New applicationFamilyId = 261
New applicationFamilyId = 70
New applicationFamilyId = 162

for UNIV in dev int prd ; do
./samtapeloc /pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00 ${UNIV}
done

export -n SAM_ORACLE_CONNECT

###########
# ROUNDUP #
###########

roundup.20070707

  Added cedar_phy_brev
  Changed IOR to short form corbaname::minos-sam01.fnal.gov:9010

cp AFSS/roundup.20070707 .
ln -sf  roundup.20070707 roundup

############
# FNPCSRV1 #
############

Sorted .k5login into .k5login.20070709
Copied original to .k5login.20070116
Copied sorted version to .k5login, tested from kreymer account

=============================================================================

2007 07 08  Sat

###########
# MONTHLY #
###########

Updated IOR string to friendlier form

Was
export SAM_NAMING_SERVICE_IOR=IOR:010000002a00000049444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696e67436f6e746578743a312e30000000010000000000000030000000010100b7150000006d696e6f732d73616d30312e666e616c2e676f7600d132230c0000004e616d655365727669636500

Now
export SAM_NAMING_SERVICE_IOR="corbaname::minos-sam01.fnal.gov:9010"

############
# SADDRECO #
############

    Adding mc reco support

    on fnpcsrv1
ln -s /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/saddreco.20070707 srmc

Normal form is 
   saddreco ${DET} ${REL} ${MON}

For MC, 

#######
# AFS #
#######

Requested a new backed up volume
   /afs/fnal.gov/files/data/minos/release_data

ACL's like

system:administrators rlidwka
system:anyuser rl
minos rl
minos:admin rlidwka

Saturday, July 7, 2007 at 09:50:31

########
# ROOT #
########

        Sue Kasahara benchmarks :
    root v5.10/00 with          file->UseCache(): 3.9 sec cpu,   6 sec real
    root v5.10/00 without       file->UseCache(): 3.5 sec cpu, 163 sec real
    root v5.12/00 with (or w/o) file->UseCache(): 5.3 sec cpu, 162 sec real
    root HEAD with tree->SetCacheSize(50000000);: 4.1 sec cpu,  29 sec real

=============================================================================

2007 07 07

#########
# ADMIN #
#########

reviewed status of requisition

disk CD103354 195772   our $30,500 Project CD Operations  Task MINOS-DISK-EQ 50.01.06.04.05.02
  PO 576183  25 July
cpu  CD103358 195773   our $26,000   total $1,432,600 oddone signed 7/10

http://www-bss2.fnal.gov/reqquery/

  Activity code 4676
https://appora.fnal.gov/pls/cert/miscomp.miser.bli_html?report_only=y&fiscal_year=2007&focus_this_identifier=4676

MINOS-COMP-OP       50.01.06.04.01.01
MINOS-CPU-EQ        50.01.06.04.06.02
MINOS-DISK-EQ       50.01.06.04.05.02
MINOS-OFF-SOFT-OP   50.01.06.04.02.01
MINOS-SCI-RESRCH-OP 50.01.06.04.06.03
TAPES-MINOS-OP      50.01.10.13
REX-DEPT-INFRA-OP   50.03.05.05.01

########
# ROOT #
########

pcanal has patched the head and v5.14 branches of root
to restore dcache speed. 

    t->SetCacheSize(50000000);

##########
# SADDMC #
##########

  Lost some modifications to HOWTO.saddmc, perhaps saddmc.20070608
      due to hangup of my desktop system.
      Strange, I don't see any .bck or ~ backups of these nedited files.

   PLAN - declare a small slug of recent mcin with saddmc
          declare the corresponding mcout, probably using an
               adapted saddreco (rather than saddmc).
               then strip the saddreco functions out of saddmc

   Identify a working data set.

5905    /pnfs/minos/mcin_data/far/adamo
31714   /pnfs/minos/mcin_data/far/avocado
0       /pnfs/minos/mcin_data/far/beet            symlink
297068  /pnfs/minos/mcin_data/far/carrot
1       /pnfs/minos/mcin_data/far/carrot_06_ral   empty
456687  /pnfs/minos/mcin_data/far/daikon_00
1       /pnfs/minos/mcin_data/far/daikon_01       empty
757234  /pnfs/minos/mcin_data/far/daikon_02
35475   /pnfs/minos/mcin_data/far/v17

for DIR in adamo avocado carrot daikon_00 daikon_02 v17 ; do
printf "${DIR} " ; find /pnfs/minos/mcin_data/far/${DIR} -type f | wc -l
done

adamo      20
avocado     119
carrot    1900
daikon_00    1846
daikon_02    2543
v17     120

   Consistent with CFL summary

   7156    1734  mcin_data/far/
   1001      53  mcin_data/fmock/
  34950    8872  mcin_data/near/
      0       0  mcin_data/near_pHE/
     37       1  mcin_data/near_pME/
    246      60  mcin_data/nmock/
  43451   10746  mcin_data

   Now look for a little subset of far


MINOS26 > du -sm /pnfs/minos/mcin_data/far/daikon_00/*
419480  /pnfs/minos/mcin_data/far/daikon_00/L010185N
12893   /pnfs/minos/mcin_data/far/daikon_00/L100200N
24315   /pnfs/minos/mcin_data/far/daikon_00/L250200N

MINOS26 > du -sm /pnfs/minos/mcin_data/far/daikon_00/L100200N/*
11601   /pnfs/minos/mcin_data/far/daikon_00/L100200N/100
1292    /pnfs/minos/mcin_data/far/daikon_00/L100200N/101

  Looks good, only 30 files, two run directories
  and 3 in the 101 directory, excellent for verbose testing

MID=/pnfs/minos/mcin_data/far/daikon_00/L100200N

   oops, need to remove /p/m prefix :
   
MID=mcin_data/far/daikon_00/L100200N


./saddmc  -m verify -v  daikon_00 ${MID}/101

Metadata look OK to my eyeball.

./saddmc  -m declare  daikon_00 ${MID}/101
./saddmc  -m declare  daikon_00 ${MID}/100

 OOPS , declare error in  f21411010_0000_L100200N_D00.reroot.root
  CLASS     SamException.SamExceptions.DbSQLException
  INSTANCE  INTERNAL ERROR IN DbOracleMessage.convertUniqueConstraint

This happened back on 2006 06 16

Try this in integration

setp sam -q int

./saddmc  -m declare  daikon_00 ${MID}/101
 OOPS , addLocation error in  f21411010_0000_L100200N_D00.reroot.root
  CLASS     SamException.SamExceptions.DataStorageLocationNotFound
  INSTANCE  Location with name '/pnfs/minos/mcin_data/far/daikon_00/L100200N/101' not found.

Added locations as per below,
 
 ./saddmc  -m addloc  daikon_00 ${MID}/101

./saddmc  -m declare  daikon_00 ${MID}/101
Needed   27 files, Rate was  2.468
STARTED   Fri Jul  6 20:56:28 2007
FINISHED  Fri Jul  6 20:56:40 2007


SAMDIM="
DATA_TIER   mc-far \
"
sam list files  --dim="${SAMDIM}" --nosummary

sam list files  --dim="${SAMDIM}" --count    
471 files match the given constraints.

SAMDIM="
    DATA_TIER   mc-far \
and VERSION     daikon_00 \
"

sam list files  --dim="${SAMDIM}" --count
30 files match the given constraints.

##########
# SADDMC #
##########

Need mcin storage locations

Created samtapeloc

./samtapeloc /pnfs/minos/mcin_data/far/daikon_00/L100200N/101 int

./samtapeloc /pnfs/minos/mcin_data/far/daikon_00 dev
./samtapeloc /pnfs/minos/mcin_data/far/daikon_00 int
./samtapeloc /pnfs/minos/mcin_data/far/daikon_00 prd


##########
# SADDMC #
##########

   Vegetables needed to be registered

MINOS26 > sam get registered application families | grep gminos
ApplicationFamily('simulation', 'gminos', 'carrot_06')
ApplicationFamily('simulation', 'gminos', 'carrot')
ApplicationFamily('simulation', 'gminos', 'carrot_08')


VEG=daikon_00

for UNI in dev int prd ; do
  setup sam -q ${UNI}
  export    SAM_ORACLE_CONNECT
  samadmin add application family --appFamily=simulation --appName=gminos --appVersion=${VEG}
  export -n SAM_ORACLE_CONNECT
done
New applicationFamilyId = 257
New applicationFamilyId = 66
New applicationFamilyId = 142

VEG=daikon_02
New applicationFamilyId = 258
New applicationFamilyId = 67
New applicationFamilyId = 143

VEG=avocado
New applicationFamilyId = 259
New applicationFamilyId = 68
New applicationFamilyId = 144

VEG=beet
New applicationFamilyId = 260
New applicationFamilyId = 69
New applicationFamilyId = 145


##########
#  MYSQL #
##########

Continuing defragmentation tests

Strange, write rates to /data are a few MBytes/second,

Mysql> time dd if=/dev/zero of=/data/archive/CP/CALADCTOPE.MYD bs=2025107940 count=1
1+0 records in
1+0 records out

real    4m49.504s
user    0m0.000s
sys     0m18.970s

[root@minos-mysql1 root]# ${FRAG} /data/archive/CP/CALADCTOPE.MYD
/data/archive/CP/CALADCTOPE.MYD: 212969 extents found, perfection would be 16 extents


1+0 records in
1+0 records out

real    0m35.672s
user    0m0.010s
sys     0m10.740s

[root@minos-mysql1 root]# ${FRAG} /var/tmp/CALADCTOPE.MYD
/var/tmp/CALADCTOPE.MYD: 361 extents found, perfection would be 16 extents


Mysql> time cp -a CALADCTOPE.MYD /var/tmp/CALADCTOPE.MYD

real    5m20.852s
user    0m0.230s
sys     0m20.220s

[root@minos-mysql1 root]# ${FRAG} /var/tmp/CALADCTOPE.MYD
/var/tmp/CALADCTOPE.MYD: 546 extents found, perfection would be 17 extents

PLAN - can probably clear 23 GB of space from retired .MYI files,
       which could be restored with RESTORE TABLE


[root@minos-mysql1 root]# ${FRAG} /data/database/retired/PULSERDRIFT.MYD
/data/database/retired/PULSERDRIFT.MYD: 1954494 extents found, perfection would be 560 extents

[root@minos-mysql1 root]# ${FRAG} /data/database/retired/PULSERDRIFT.MYI
/data/database/retired/PULSERDRIFT.MYI: 641899 extents found, perfection would be 212 extents


########
# GRID #
########
 
 # # # warning - these paths are incorrect # # #
 # # # see 2007 07 06 
  
### AFSPROD=/afs/fnal.gov/files/code/e875/general/products/db/
### GRIPROD=/grid/app/minos/products

time rsync -r \
${AFSPROD} ${GRIPROD} \
--perms --times --links --size-only --delete -v

OK, this moved upd to products.

Not setting  -a  rlptgo because do not want group, owner propogated

  
AFSPROD=/afs/fnal.gov/files/code/e875/general/products/
GRIPROD=/grid/app/minos/products
mkdir ${GRIPROD}


wrote 2206955532 bytes  read 728980 bytes  3822830.32 bytes/sec
total size is 2203906163  speedup is 1.00

real    9m37.029s
user    0m22.770s
sys     0m56.100s

MINOS26 > time rsync -r ${AFSPROD} ${GRIPROD} --perms --times --links --size-only --delete -v
building file list ... done
prd/encp/v3_6d/Linux-2-4-2-3-2/volume_import/
prd/encp/v3_6d/Linux-2-6/volume_import/
wrote 1010993 bytes  read 20 bytes  18550.70 bytes/sec
total size is 2203906163  speedup is 2179.90

real    0m53.946s
user    0m1.650s
sys     0m5.710s

 rm -r /grid/data/minos/products

########
# FARM #
########

./pnfsdirs near cedar_phy_brev daikon_00 L010185N

 STREAMS cand mrnt sntp

     INPUT /pnfs/minos/mcin_data/near/daikon_00/L010185N 
 FAMSET mcin_near_daikon_00
 FAMILY mcin_near_daikon
 OOPS - need file family mcin_near_daikon_00
 OK - setting family to mcin_near_daikon_00

    OUTPUT /pnfs/minos/mcout_data/cedar_phy_brev/near/daikon_00/L010185N 
 FAMSET mcout_cedar_phy_brev_near_daikon_00_cand
 FAMILY reco_mc_near_cedar_phy_brev
 OOPS - need file family mcout_cedar_phy_brev_near_daikon_00_cand
 OK - setting family to mcout_cedar_phy_brev_near_daikon_00_cand
 FAMSET mcout_cedar_phy_brev_near_daikon_00_mrnt
 FAMILY reco_mc_near_cedar_phy_brev
 OOPS - need file family mcout_cedar_phy_brev_near_daikon_00_mrnt
 OK - setting family to mcout_cedar_phy_brev_near_daikon_00_mrnt
 FAMSET mcout_cedar_phy_brev_near_daikon_00_sntp
 FAMILY reco_mc_near_cedar_phy_brev
 OOPS - need file family mcout_cedar_phy_brev_near_daikon_00_sntp
 OK - setting family to mcout_cedar_phy_brev_near_daikon_00_sntp
MINOS26 > date
Fri Jul  6 17:22:39 CDT 2007

=============================================================================

2007 07 05

##########
#  MYSQL #
##########

Planning to break up the lock/copy of the largest tables,
to reduce database locking time of crl

Had to rsync the BINLOGS up front, to gain working space
   ( they had gotten up to 10 GB )

wrote 9984606905 bytes  read 3236 bytes  9969655.66 bytes/sec
total size is 11211535336  speedup is 1.12

real    16m40.950s
user    1m33.970s
sys     0m37.150s

Started archives at about 13:00

Thu Jul  5 13:03:51 CDT 2007
Copying DCS_HV ran at about 14-19 MB/sec through 6 GBytes,
then slowed down to 4,
then back up to 13, 10

Sizes at 100 sec intervals :

Mysql> du -sm /data/archive/COPY/20070705/offline/DCS_HV.MYD
1009
2903
4427
6214
6560
7059
8417
9459
10499
11811
13044

18m6.745s

Thu Jul  5 13:22:17 CDT 2007
Watching PULSERGAIN.MYD

while true ; do du -sm /data/archive/COPY/20070705/offline/PULSERGAIN.MYD | cut -f 1 ; sleep 100 ; done
971
2012
2243
2475
3703
5135
6270
7038
7422
8464
8879
10021
11270
12584
13159
13366
real    25m20.905s
user    0m1.980s
sys     2m1.070s


Thu Jul  5 13:51:54 CDT 2007
while true ; do du -sm /data/archive/COPY/20070705/offline | cut -f 1 ; sleep 100 ; done
27070
27545
27962
28366
28746
29035
29256   BEAMMONSPILL.MYD
29486
29824
30397
31122  BEAMMONSPILL.MYD done
32030
32365
32678
32988
33929
34911
35297
35752
36599
37619
37961
38326
38617
39000
39447
39884
40247
40611
40952
41376
41929
42369
42821
real    60m27.476s

  N.B.  some of the table sizes :

361     /data/database/offline/DCS_ENV_NEAR.MYD
399     /data/database/offline/CALPMTDRIFT.MYD
609     /data/database/offline/DCS_MAG_FAR.MYD
632     /data/database/offline/DBUVACHIPSPARS_OLD.MYD
762     /data/database/offline/DBUVACHIPSPARS.MYD
911     /data/database/offline/DBUVACHIPPEDS_OLD.MYD
1040    /data/database/offline/DBUVACHIPPEDS.MYD
1260    /data/database/offline/SPILLSERVERMON.MYD
1525    /data/database/offline/CALADCTOPE.MYD
1781    /data/database/offline/PULSERTIMEDRIFT.MYD
2225    /data/database/offline/CALADCTOPES.MYD
3097    /data/database/offline/BEAMMONSPILL.MYD
13366   /data/database/offline/PULSERGAIN.MYD
13464   /data/database/offline/DCS_HV.MYD

Mysql> df -h /data
Filesystem            Size  Used Avail Use% Mounted on
/dev/hdb1             230G  211G  7.0G  97% /data

44G     .
[minsoft@minos-mysql1 offline]$ time gzip -1 *.MYD

real    76m41.851s
user    38m45.640s
sys     3m42.710s
[minsoft@minos-mysql1 offline]$ du -sh .
17G     .

[kreymer@minos-sam03 MYSQL]$ du -sm /home/kreymer/MYSQL/*
8965    /home/kreymer/MYSQL/20060418
9352    /home/kreymer/MYSQL/20060421
14021   /home/kreymer/MYSQL/20070207
14186   /home/kreymer/MYSQL/20070305
14790   /home/kreymer/MYSQL/20070403
16492   /home/kreymer/MYSQL/20070705
45792   /home/kreymer/MYSQL/BINLOG


##########
#  MYSQL #
##########

http://www.mysql.com/doc/en/InnoDB_File_Defragmenting.html

ALTER TABLE CpuHistory TYPE=INNODB; 
ALTER TABLE CpuHistory TYPE=MYISAM;

Have a look at  
    SHOW TABLE STATUS

Found caltest database table CALADCTOPE
   size 2 GB, last updated 2007-03-24
   
   As root,

FRAG=/home/minsoft/maint/filefrag

${FRAG} /data/database/caltest/CALADCTOPE.MYD
396575 extents found, perfection would be 17 extents

${FRAG} /data/database/offline/DCS_HV.MYD
/data/database/offline/DCS_HV.MYD: 276037 extents found, perfection would be 106 extents


##########
# DCACHE #
##########

old Ticket 100349

1 of 12 writePools are offline

MINOS26 > ./poolstat verb

Thu Jul  5 09:57:17 CDT 2007

DOWN TOT   POOL GROUP
      14 ExpDbWritePools
       6 FermigridVolPools
      12 KTeVReadPools
      15 MinosPrdReadPools
       8 RawDataWritePools
       9 readPools
   1/ 12 writePools
             w-stkendca11a-4

10:00 sent update to ticket, no activity under ticket 
15:55

Solution: berg@fnal.gov sent this solution: 

All of the write pools are currently online. We will keep watching them
and restart as needed. The developers are aware of the problem and are
testing a patch. In the meantime, they have increased the size of java
heap memory for the pools that have a history of this problem, though
it may take additional restarts for the change to take effect.


########
# GRID #
########
 
 # # # warning - these paths are incorrect # # #
 # # # see 2007 07 06 
  
AFSPROD=/afs/fnal.gov/files/code/e875/general/products
GRIPROD=/grid/app/minos/products

date
time rsync -r \
${AFSPROD} ${GRIPROD} \
--perms --times --size-only --delete -v


wrote 2206935025 bytes  read 728980 bytes  3829425.85 bytes/sec
total size is 2203906163  speedup is 1.00

real    9m35.625s
user    0m22.850s
sys     0m58.770s
MINOS26 > date
Thu Jul  5 15:30:13 CDT 2007

MINOS26 > du -sm /grid/data/minos/products/
3336    /grid/data/minos/products

Try a second pass, for timing.

MINOS26 > date
Thu Jul  5 16:24:27 CDT 2007

MINOS26 > time rsync -r \
<more> ${AFSPROD} ${GRIPROD} \
<more> --perms --times --size-only --delete -v
building file list ... done
skipping non-regular file "products/db/.upsfiles/shutdown/ups_shutdown.csh"
skipping non-regular file "products/db/.upsfiles/shutdown/ups_shutdown.sh"

...

wrote 990490 bytes  read 20 bytes  18174.50 bytes/sec
total size is 2203906163  speedup is 2225.02

real    0m54.621s
user    0m1.520s
sys     0m5.010s

Oops, the output directory is not what I wanted, change to 

GRIPROD=/grid/app/minos/products

       GRRRRRRRRRRRRRRRRR -

The products are full of symlinks, especially the VDT stuff.
This wreaks havoc with many utilities like rsync.


=============================================================================

2007 07 04

    HOLIDAY

=============================================================================

2007 07 03

##########
# DCACHE #
##########

Ticket 100349

5 of 12 writePools are offline


###########
# MONTHLY #
###########

MINOS26 > aklog
MINOS26 > tokens

DATASETS 7/2
PREDATOR 7/2
SADDRECO 7/2
VAULT    7/2 ok
MYSQL    7/5
   did crl 7/3, will do rest later this week
    
###########
# MONTHLY #
###########
  
    CFL    update the web listing

cd ${HOME}/minos/CFL
$HOME/minos/scripts/cfl
$HOME/minos/scripts/cflsum | tee cflsum.`date +%Y%m%d`
ln -sf cflsum.`date +%Y%m%d` CFLSUM

Updated datasets to write the pool group name,
and to include 'q'

Removed the following, as it does nothing without file activity

    ROUNDUP

VMON=`date -d '27 days ago' +%Y-%m`
./roundup -m "${VMON}" -r cedar far
./roundup -m "${VMON}" -r cedar near

inserted

    SADDRECO on fnpcsrv1

VMON=`date -d '27 days ago' +%Y-%m`
REL=cedar
PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin
export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer
export SAM_NAMING_SERVICE_IOR=IOR:010000002a00000049444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696e67436f6e746578743a312e30000000010000000000000030000000010100b7150000006d696e6f732d73616d30312e666e616c2e676f7600d132230c0000004e616d655365727669636500

cd ~/scripts

for DET in near far ; do
./saddreco  ${DET}  ${REL} ${VMON} declare \
  2>&1 | tee ${HOME}/ROUNTMP/LOG/${SAMMON}/declare_${DET}_${REL}.log
done


###########
# CLUSTER #
###########

shepelak removed KDE components yesterday.

No adverse affects so far.

Did two gimp scans, all OK.

########
# FARM #
########

Following up on pending cedar_phy far runs 

Processing completed overnight for spill data.

There are still a few all stream subruns missing :

 PEND - have 24/12 subruns for F00034632_*.all.sntp.cedar_phy.1.root 42 05/21 10:07 0
 PEND - have 1/23 subruns for F00034635_*.all.sntp.cedar_phy.1.root 42 05/21 10:10 0
 PEND - have 2/24 subruns for F00034647_*.all.sntp.cedar_phy.0.root 21 06/11 21:36 0
 PEND - have 1/7 subruns for F00034675_*.all.sntp.cedar_phy.0.root 21 06/11 21:48 0
 PEND - have 1/19 subruns for F00034700_*.all.sntp.cedar_phy.0.root 19 06/13 12:09 0


##########
# DC2AFS #
##########

echo >> ../TRACE
./dc2afs -d far -r cedar_phy -s .bntp  | tee -a ../TRACE 2>&1
echo >> ../TRACE

2007-02
  63/  63 recodata113  8552842 F00037709_0000.spill.bntp.cedar_phy.0.root     34973878 bytes in 1 seconds (34154.18 KB/sec)       
2007-03
  48/  48 recodata113  8552842 F00037832_0000.spill.bntp.cedar_phy.0.root     34973878 bytes in 1 seconds (34154.18 KB/sec)       

########
# FARM #
########

Following up on pending cedar mcfar runs 

SRV1> cat LOG/cedarmcfar.pend 
 PEND - have 7/10 subruns for f20011007_*_CosmicLE_D02.sntp.cedar.root 3 06/29 14:39 0
 PEND - have 9/10 subruns for f20011128_*_CosmicMu_D02.sntp.cedar.root 2 06/30 16:45 0


########
# FARM #
########

Niki's subrun list, from email,
put in /tmp/subs

SRV1> scp kreymer@minos-93198.dhcp.fnal.gov:/tmp/subs /tmp/subs

cd ~/lists

LINES=`cat /tmp/subs`

printf "${LINES}\n" | wc -l
87

cat /home/minfarm/lists/daq_lists/sup/*.sup >>/tmp/SUP


cat /tmp/subs | while read LINE ; do
printf "\n${LINE}\n"
SRUN=`echo ${LINE} | cut -f 1 -d ' '`
printf " BADRUNS `grep ${SRUN} ~/lists/bad_runs.cedar_phy`\n"
printf " NOSPILL `grep ${SRUN} ~/lists/no_spill.cedar_phy`\n"
printf " SUPPRES `grep ${SRUN} /tmp/SUP`\n"
done

Everything is in bad_runs, no spill, or suppressed, except

F00033713_0017 NOT LAST SUBRUN SIZE NORMAL
    This was flagged as a bad run by the farms on 5/15
    This subrun was rerun on 5/21, but no spill output files resulted.

F00037351_*    PHYSICS TEST
    processing was not requested

F00037691_*    !! missing run in pnfs
    completed today

F00037706_*    !! missing run in pnfs
    completed today

As of about 14:00, F00033713_0017 finished on the farm,
and was rounded up with

./roundup  -s F00033713 -f 0 -r cedar_phy far

And copied to AFS with

echo >> ../TRACE
./dc2afs -d far -r cedar_phy -s .bntp  | tee -a ../TRACE 2>&1
echo >> ../TRACE

=============================================================================

2007 07 02

#######
# SAM #
#######

Per petyt request, here is a sample SAM query listing 
all the parents of a given file :

FILE=F00030612_0005.spill.bntp.cedar_phy.0.root

SAMDIM="
    DATA_TIER   raw-far \
and FULL_PATH like /pnfs/minos/fardet_data/2005-04 \
and FILE_NAME like F0003061% \
and CHILD_BY_NAME ${FILE} \
"

sam list files  --dim="${SAMDIM}" --nosummary | sort
F00030612_0005.mdaq.root
F00030612_0006.mdaq.root
F00030612_0007.mdaq.root
F00030613_0000.mdaq.root
F00030613_0001.mdaq.root
...

sam get metadata --file=${FILE} \
    | grep parents \
    | tr "'" \\\n  \
    | grep root    \
    | sort


########
# MAIL #
########

for UUSER in alberto bishai djensen escobar kafka para wojcicki ; do 
finger ${UUSER}@fnal.fnal.gov | grep '@'  ; done
               alberto@fnalu.fnal.gov    \alberto@fsui02.fnal.gov
               bishai@fsui02.fnal.gov    \bishai@fsui02.fnal.gov
               djensen@fsui02.fnal.gov   \djensen@fsui02.fnal.gov
               escobar@fsui02.fnal.gov    escobar@ifi.unicamp.br
               kafka@fnalu.fnal.gov       #\kafka@tuhepf.phy.tufts.edu
               para@fsui02.fnal.gov       \para@fsui02.fnal.gov , adpara@yahoo.com
               wojcicki@fnalu.fnal.gov    SGWEG@SLAC.Stanford.EDU

############
# PURCHASE #
############

CD103354    PO 195772  approved
CD103358    PO 195773  approved

########
# FARM #
########

Per berg, note 0 size cand file,

-rw-r--r--  1 1334 5111 0 Jun 30 04:58 f20011014_0009_CosmicMu_D02.cand.cedar.root

Removed it.

/pnfs/minos/mcout_data/cedar/far/daikon_02/CosmicMu/cand_data/101/f20011014_0009_CosmicMu_D02.cand.cedar.root

#######
# CFL #
#######

cflsum.20070702 sets MINOS_DATA to /afs/fnal.gov/files/data/minos

########
# FARM #
########

Following up on pending cedar_phy far runs 

 PEND - have 3/8 subruns for F00030612_*.spill.bntp.cedar_phy.0.root 53 05/10 01:31 0
 PEND - have 13/24 subruns for F00035724_*.spill.bntp.cedar_phy.0.root 54 05/09 04:14 0
 PEND - have 23/24 subruns for F00037691_*.spill.bntp.cedar_phy.0.root 44 05/19 09:26 0
 PEND - have 8/9 subruns for F00037706_*.spill.bntp.cedar_phy.0.root 44 05/19 09:02 0

flush F00030612, which spans the April 1 2005 startup cutoff

./roundup -n -f 1 -s F00030612 -r cedar_phy far

Howie is rerunning the rest.

At 14:36, F00037691 and F00037706 are ready.

./roundup -r cedar_phy far

N.B. completed overnight

########
# FARM #
########

Following up on pending cedar mcfar runs 

SRUNS='
f20011007_0001
f20011007_0002
f20011007_0003
f20011007_0004
f20011030_0008
f20011048_0008
f20011048_0009
f20011056_0003
f20011061_0004
f20011066_0009
f20011080_0006
f20011080_0007
f20011080_0008
f20011080_0009
f20011086_0004
f20011086_0005
f20011086_0006
f20011095_0009
f20011098_0004
f20011098_0005
f20011098_0006
f20011098_0007
f20011099_0006
f20011099_0007
f20011099_0008
f20011099_0009
f20011100_0001
f20011104_0009
f20011105_0000
f20011105_0001
f20011121_0006
f20011128_0008
f20011132_0003
f20011132_0004
f20011132_0005
f20011132_0006
f20011143_0009
f20011144_0000
f20011144_0001
f20011145_0002
f20011145_0004
f20011145_0005
'

for SUB in ${SRUNS} ; do 
RUNG=${SUB:5:3} ; echo ${SUB} ${RUNG} ;
ls -l /pnfs/minos/mcin_data/far/daikon_02/CosmicMu/${RUNG}/${SUB}* ; done

for SUB in ${SRUNS} ; do 
ls -l /grid/data/minos/mcfarcat/${SUB}* ; done


Note that f20011080 0,2 are missing in mcin.

########
# FARM #
########

Quota out on minfarm account /home/minfarm

SRV1> du -sm * | sort -n
...
53      18801dump_table
122     22892dump_table
123     condor_submit
128     FNAL_00030851.dbm.gz
163     lists
215     condor_log
510     west
1166    test
1726    grid
2209    17271dump_table
3333    scavantest

grid files were created 20 Dec, 194 days ago,
about 25K files.

Last access was Mar 24 101 days ago, 

find . -mtime -195 | less
find . -atime -101 | less

cd
cp -vax grid /export/stage/minfarm/homegrid
diff -r grid /export/stage/minfarm/homegrid

SRV1> du -sk ~/grid ../homegrid
1766720 /home/minfarm/grid
1177144 ../homegrid

mv grid gridx
ln -s /export/stage/minfarm/homegrid grid
ln: creating symbolic link `grid' to `/export/stage/minfarm/homegrid': Disk quota exceeded
rm gridx/pacman-latest.tar.gz
rm -r gridx


=============================================================================

2007 06 29

########
# GRID #
########

Looking at quota, via group,

quota -s -v -g numi

We have about 30 GB of app space, should be plenty for products.

Need to 

    cp -ax  /afs/fnal.gov/files/code/e875/general/products \
            /grid/app/minos/products

    cp -ax  /afs/fnal.gov/files/code/e875/general/minossoft \
            /grid/app/minos/minossoft

MINOS26 > fs listquota /afs/fnal.gov/files/code/e875/general/products 
Volume Name                   Quota      Used %Used   Partition
c.e875.d1                   8000000   2125440   27%         66%  

MINOS26 > fs listquota /afs/fnal.gov/files/code/e875/general/minossoft
Volume Name                   Quota      Used %Used   Partition
code.e875.general           8000000   6312687   79%         66%  

MINOS26 > du -sm /afs/fnal.gov/files/code/e875/general/products
2076    /afs/fnal.gov/files/code/e875/general/products


###########
# ROUNDUP #
###########

New files showing up in cedar near and far,
need to round them up.

   I am REALLY getting tired of new things happening on Fridays,
   requring manual intervention through the weekend.

/pnfs/minos/mcout_data/cedar/far/daikon_02/CosmicLE
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_bfldx113


./pnfsdirs  far cedar daikon_02 CosmicLE
./pnfsdirs  far cedar daikon_02 CosmicLE write


./pnfsdirs near cedar daikon_00 L010185N_bfldx113
./pnfsdirs near cedar daikon_00 L010185N_bfldx113 write


Now need to activate cedar mcnear and mcfar in corral.

Did this at 18:25 after the current roundup finished.

##########
# DCACHE #
##########

Timur Perelumtov is going to CERN to meet with Rene Brun and Patrick Fuhrman
next week. They will work on our Root/DCache I/O problem.

########
# DCAP #
########

ups copy dcap v2_38_f0512 -q unsecured -G "dcap v2_38_f0512 -q unsecured"

upd install -j dcap  v2_41_f0610
ups copy dcap v2_41_f0610 -q unsecured -G "dcap v2_41_f0610 -q unsecured"

#######
# CFL #
#######

Updated for daily running via cron on minos01

   Silent curl
   No printout
   Create CFL.YYMM01 and cflsum.YYMM01 on first day of month
   Argument names working directory, testing in /var/tmp/kreymer   

Existing montly CFL.200* file headers show times from
    04:42
to  09:45

Let's keep as far as possible from these times as possible,

Added to crontab.minos01   
15 19 * * *     /usr/krb5/bin/kcron ${HOME}/minos/scripts/cfl

And corrected the afssum times, which had minutes/hours reversed
01 23 * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/afsfree quiet
05 23 * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/afssum  quiet


timing daily
real    1m12.730s
user    0m22.320s
sys     0m19.460s


timing with cflsum
real    6m17.389s
user    1m49.970s
sys     1m16.750s

real    7m59.211s
user    1m46.460s
sys     1m17.150s

   Putting this into production,

cd /afs/fnal.gov/files/data/minos/log_data/   
rm CFL ; cp CFL.20070608 CFL   

cds
time ./cfl
real    3m19.566s
user    0m27.450s
sys     0m29.700s

=============================================================================

2007 06 28

###########
# MINOS12 #
###########

Ganglia monitoring Thu, 28 Jun 2007 13:19:37 -0500

Last heartbeat 21 days, 21:46:55 ago

MRTG networking is close to flatline at 600 bits/second for about 21.5 days.

Helpdesk ticket  100083

Forwarded to minos-admin, run2-sys


=============================================================================

2007 06 27

########
# MAIL #
########

Sent email to the 7 Minos users receiving mail on fsui02/fnalu
   warning of the 1 Oct shutdown of this service.

#########
# ADMIN #
#########

    Our req's, in CD as of 26 June
CD103354 for the satabeast
CD103358 for the nodes.

Reference FAGAN,DAVID requisition  
    CD101973
    Lab 193587
    PO  575035

1U Intel Dual Xeon Quad Core E5335
2Ghz Computer Server.

  Under
http://www-css.fnal.gov/els/useful_links/

https://fncdug1.fnal.gov/miser/req-query.html
http://www-bss2.fnal.gov/reqquery/

=============================================================================

2007 06 26

##########
# DC2AFS #
##########

Need to get the cedar_phy .bntp files in to AFS,
    for the box opening Friday.

MINOS26 > du -sm /pnfs/minos/reco_far/cedar_phy/.bntp_data        
52448   /pnfs/minos/reco_far/cedar_phy/.bntp_data

MINOS26 > fs listquota /afs/fnal.gov/files/data/minos/d10/recodata112
Volume Name                   Quota      Used %Used   Partition
nb.minos.d258              50000000  35683004   71%         45%  
MINOS26 > fs listquota /afs/fnal.gov/files/data/minos/d10/recodata113
Volume Name                   Quota      Used %Used   Partition
nb.minos.d259              50000000       188    0%         45%  

for YEMO in `ls /pnfs/minos/reco_far/cedar_phy/.bntp_data` ; do
./stage -d -p 0 reco_far/cedar_phy/.bntp_data/${YEMO} ; done

All on disk through 2006-09, then most are off disk through 2007-03.

for YEMO in `ls /pnfs/minos/reco_far/cedar_phy/.bntp_data` ; do 
./stage -w  reco_far/cedar_phy/.bntp_data/${YEMO} ; done

echo >> ../TRACE
date >> ../TRACE
./dc2afs -d far -r cedar_phy -s .bntp  | tee -a ../TRACE 2>&1
date >> ../TRACE
echo >> ../TRACE

STARTING Tue Jun 26 16:30:30 CDT 2007
FINISHED Tue Jun 26 18:05:52 CDT 2007

#######
# AFS #
#######

Existing index file sizes based on

cd $MINOS_DATA/d10/indexes
du -sk * | sort -n
...
44      BAD_mc_far.daikon_00.cedar.index
65      mc_far.carrot.cedar.index
82      mc_cosmic.bfld201.cedar.index
84      mc_far.R1.14.index
86      mc_far.daikon_00.cedar.index
94      mc_far.carrot.R1_18_2.index
99      2005-04_far.R1_18.index
117     mc_near.R1_18_2.index
501     mc_near.carrot_06.cedar.index
509     mc_near.daikon_00.cedar.index.save
530     mc_near.carrot_06.R1_18_2.index
592     mc_near.daikon_00.cedar.index

and many monthly files for R1_18_2 and R1_18_4

du -sk 20*R1_18*.index |  cut -f 1  > /tmp/sumin

MINOS26 > cat /tmp/sumin | ~/minos/scripts/count 
 Enter numbers to be added : 
 Got      48 /tmp/FOO numbers 
1285

##########
# SADDMC #
##########

latest examples in this log, and HOWTO, are out of date .

checked out some old test data from last year

MINOS26 > sam locate n13011068_0000_L010200.reroot.root
['/pnfs/minos/mcin_data/near/carrot_06/L010200,861@vo8034']

MINOS26 > sam get metadata --file=n13011068_0000_L010200.reroot.root

did successful verify in the new format :

./saddmc.20070608 -n 2 -m verify carrot_06  mcin_data/near/carrot_06/L010185


##########
# DCACHE #
##########

    Installed newer dcap

upd install -j dcap v2_38_f0512

    Could not install the newest one, because of a symlink in the UPD server
    Helpdesk ticket  099982
 
upd install -j dcap v2_41_f0610

ftp> pwd
257 "/products/dcap" is current directory.

ftp> ls /ftp/products/dcap/v2_41_f0610/Linux+2.6/dcap_v2_41_f0610_Linux+2.6.ups.tar 
200 PORT command successful.
150 Opening ASCII mode data connection for directory listing.
-rw-rw----   1 100      3531         3584 Oct 23  2006 /ftp/products/dcap/v2_41_f0610/Linux+2.6/dcap_v2_41_f0610_Linux+2.6.ups.tar


=============================================================================

2007 06 25

#######
# DAQ #
#######

F00038278_0000.mdaq.root failed to transfer twice

12:52:36 - 13:27:51
  tranferred one other file, then
13:42:56 - 14:14:23 - successful this time

Two recent files are very large :

    ssh -l kreymer minos-gateway.minos-soudan.org
    ssh -l minos   daqdcp
    cd /daqdata
    du -sm * | sort -n
    ...
    156     F00038280_0000.mdaq_1.root
    176     F00038260_0000.mdaq.root
    725     F00038276_0000.mdaq.root
    1319    F00038278_0000.mdaq.root
    1816    F00038280_0000.mdaq.root

Basicly, these large files generated while beam is down today
just take a while to copy.

########
# MAIL # Ticket 099858
########

/var/mail filled up this morning on fsui02 ( a.k.a. fnalu )
Moved niki's email to minos01 /var/spool/mail/niki

########
# MAIL #
########

MUSERS=`ypcat passwd | cut -f 1 -d :`

MINOS01 > for MUSE in ${MUSERS} ; do finger ${MUSE}@fnal | grep '@' | grep -v imapserv ; done | grep fnalu
               alberto@fnalu.fnal.gov
               michael@fnalu.fnal.gov
               kafka@fnalu.fnal.gov
               wojcicki@fnalu.fnal.gov

MINOS01 > for MUSE in ${MUSERS} ; do finger ${MUSE}@fnal | grep '@' | grep -v imapserv ; done | grep fsui
               para@fsui02.fnal.gov
               djensen@fsui02.fnal.gov
               escobar@fsui02.fnal.gov
               bishai@fsui02.fnal.gov

for UUSER in alberto bishai djensen escobar kafka para wojcicki ; do
du -sk  /var/mail/${UUSER} ; done

=============================================================================

2007 06 23   Sat

###########
# ROUNDUP #
###########

Previous files in mcout_data/cedar_phy/far/daikon_02/CosmicMu and CosmicLE
have been removed.

These were previously rounded up -S on June 13  (WTW) 
and WRITE files purged on June 21 ( back home )

Concatenation was OK , but I did not do it due to missing subruns.


    Listed files in READ index :

find READ -name \*Cosmic\* -exec ls -l {} \; | less

    Set them aside to allow fresh concatenation.

mkdir DUP DUP/D02Cosmic
 
FILES=`find . -name \*Cosmic\* -exec basename {} \;`
for FILE in ${FILES} ; do mv READ/${FILE} DUP/D02Cosmic/${FILE} ; done


    Write them out , and put this into corral also

corral
[ ${BADS} -le 1 ] && ${HOME}/scripts/roundup -c -M -r cedar_phy mcfar  || (( BADS++ )) # no SAM yet
    
ran
./roundup  -M  -r cedar_phy mcfar


=============================================================================

2007 06 22

#############
# MINOSORA3 #
#############

Memory problems continue, with new motherboard.

June 26  1 PM will switch to single size of dimms.

#########
# FNPPD #
#########

NOTE: please copy off any important files from fnppd. It is unknown how
much longer fnppd will stay on-line as of 6/20/2007.

FNPPD > uptime
  9:03am  up 1 day, 21:47,  1 user,  load average: 0.03, 0.07, 0.01

du -sm /prj/e875
67344416        /prj/e875

  Files all are owned by rhbob ( Bob Bernstein )

############
# MINOSCVS #
############

.admin - removed west ( no such Fermi principal )

.k5login
    added
        asousa
        brebel
        kasahara
        llhsu

=============================================================================

2007 06 21
   
###########
# ROUNDUP #
###########   

    Clearing older WRITE files

>>>>  156 from June 7

./roundup -w -M -r cedar_phy mcfar
Thu Jun 21 10:58:09 CDT 2007
 PURGING WRITE files 156 
Thu Jun 21 10:58:39 CDT 2007


>>>>  3 from June 8

./roundup -w -M -r cedar_phy_safitter far
Thu Jun 21 11:00:48 CDT 2007
 PURGING WRITE files 3 


>>>>  1 from  Jun 11 06:47   2007 06 11

N00011669_0000.cosmic.sntp.cedar_phy.0.root 
has a mismatched checksum.

This is a holdover from the 2007 06 11 crash of fnpcsrv1,
which I had supposedly repaired before leaving for WTW the next day.


SRMCP -streams_num=1 -server_mode=active -protocols=gsiftp file:///N00011669_0000.cosmic.sntp.cedar_phy.0.root /pnfs/minos/reco_near/
cedar_phy/sntp_data/2007-01
PURGE FARM    N00011669_0000.cosmic.sntp.cedar_phy.0.root
Wed Jun 13 00:16:29 CDT 2007

/export/stage/minfarm/ROUNDUP/ECRC/N00011669_0000.cosmic.sntp.cedar_phy.0.root

Understandable, let's generate that manually

ROUNTMP=/export/stage/minfarm/ROUNDUP
    GDW=/grid/data/minos/minfarm/WRITE
  SFINI=N00011669_0000.cosmic.sntp.cedar_phy.0.root

ecrc  ${GDW}/${SFINI} | cut -f 2 -d ' ' >  ${ROUNTMP}/${CAT}ECRC/${SFINI}

That should do it.

Indeed the file cleared out with the Noon cycle of roundup.

 
##########
# DCACHE #
##########

Write pool files seem to have been flushed to tape, ticket 099603

Times are  05:47 through 06:29

    Verified they have tape locations

cd ${IPATH}
for FILE in ${FILES} ; do cat ".(use)(4)(${FILE}" ; done
...
f20011035_0006_CosmicMu_D02.reroot.root
0000_000000000_0000680

   Checking files in http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt

/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011090_0000_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011093_0009_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0004_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011091_0007_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011091_0009_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011030_0005_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011094_0002_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011094_0001_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011094_0003_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011031_0000_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0007_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011030_0000_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011030_0006_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0009_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0002_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011095_0002_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0006_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/102/f20011029_0008_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011093_0005_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011031_0002_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011031_0001_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/103/f20011031_0004_CosmicMu_D02.cand.cedar_phy.root
/pnfs/minos/mcout_data/cedar_phy/far/daikon_02/CosmicMu/cand_data/109/f20011091_0000_CosmicMu_D02.cand.cedar_phy.root

    Times are 05:51 through 06:44

for FLAP in ${FLAPS} ; do echo ${FLAP}
IPATH=`dirname ${FLAP}` ; IFILE=`basename ${FLAP}`
( cd ${IPATH} ; cat ".(use)(4)(${IFILE})" | grep '0000_0000' )
done

All are on tape.

This was due to a tape being NOACCESS.
Odd, I did not see such an indication when I looked at the list yesterday.
 
##########
# DCACHE #
##########

Per Liz, DCache read rates were helped in early 05 by calling UseCache 

Perhaps the default arguments need to be tuned.
Perhaps something has broken in root.

#########
# STAGE #
#########

Lots of reco_far_cedar and reco_near_cedar files still being staged.

/pnfs/minos/reco_far/cedar/sntp_data/2005-10/F00032814_0023.spill.sntp.cedar.0.root
is in r-stkendca14a-6
    

That's a readPools pool.

So our cedar ntuples are not going where intended


MINOS26 > ( cd /pnfs/minos/reco_far/cedar/sntp_data/2005-10 ; enstore pnfs --tags )
.(tag)(library) = CD-9940B
.(tag)(file_family) = reco_far_cedar_sntp


VOLS=`./volumes reco_far_cedar`

echo $VOLS
VO4093 VO4094 VO7415 VO7907 VO8334 VO8363 VO9661

echo >> ../TRACE
date >> ../TRACE

for VOL in ${VOLS} ; do
./stage -w -s sntp_data ${VOL}
done  2>&1 | tee -a ../TRACE
STARTING Thu Jun 21 12:10:38 CDT 2007
FINISHED Fri Jun 22 04:49:15 CDT 2007
date >> ../TRACE
echo >> ../TRACE


#######
# CPU #
#######

Looking for AMD vs Intel benchmarks.

http://www.cpubenchmark.net/index.php looks great, but charts don't load

=============================================================================

2007 06 20

#############
# MINOSORA3 #
#############

Motherboard replaced, to address memory problems

#######
# CVS #
#######

Removed stray accidental directory from minoscvs

    cd /cvs/minoscvs/rep1/minossoft/NCUtils/Extrapolation
    rmdir MCEvent.h

This worked as desired.

############
# NOACCESS #
############

Why is a 9940 raw data tape NOACCESS ?

    All these files are on disk.


VO5182 0.39GB (NOACCESS 0619-1525 full   0611-1920)  9940 minos.fardet_data.cpio_odc


##########
# DCACHE #
##########

Many files are pending writes in mcimport, say sjc

/pnfs/minos/mcin_data/far/daikon_02/CosmicMu/103/

f20011033_0000_CosmicMu_D02.reroot.root
f20011033_0001_CosmicMu_D02.reroot.root
f20011033_0002_CosmicMu_D02.reroot.root
f20011033_0003_CosmicMu_D02.reroot.root
f20011033_0004_CosmicMu_D02.reroot.root
f20011033_0005_CosmicMu_D02.reroot.root
f20011033_0006_CosmicMu_D02.reroot.root
f20011033_0007_CosmicMu_D02.reroot.root
f20011033_0008_CosmicMu_D02.reroot.root
f20011033_0009_CosmicMu_D02.reroot.root
f20011034_0000_CosmicMu_D02.reroot.root
f20011034_0001_CosmicMu_D02.reroot.root
f20011034_0002_CosmicMu_D02.reroot.root
f20011034_0003_CosmicMu_D02.reroot.root
f20011034_0004_CosmicMu_D02.reroot.root
f20011034_0005_CosmicMu_D02.reroot.root
f20011034_0006_CosmicMu_D02.reroot.root
f20011034_0007_CosmicMu_D02.reroot.root
f20011034_0008_CosmicMu_D02.reroot.root
f20011034_0009_CosmicMu_D02.reroot.root
f20011035_0000_CosmicMu_D02.reroot.root
f20011035_0001_CosmicMu_D02.reroot.root
f20011035_0002_CosmicMu_D02.reroot.root
f20011035_0003_CosmicMu_D02.reroot.root
f20011035_0004_CosmicMu_D02.reroot.root
f20011035_0005_CosmicMu_D02.reroot.root
f20011035_0006_CosmicMu_D02.reroot.root

./dc_stat /pnfs/minos/mcin_data/far/daikon_02/CosmicMu/103/f20011033_0000_CosmicMu_D02.reroot.root
============================
 PNFS status for /pnfs/minos/mcin_data/far/daikon_02/CosmicMu/103/f20011033_0000_CosmicMu_D02.reroot.root 
-rw-r--r--    1 kreymer  e875     364301179 Jun 18 20:19 f20011033_0000_CosmicMu_D02.reroot.root

LEVEL 2 
2,0,0,0.0,0.0
:c=1:45d30373;h=yes;l=364301179;
r-stkendca13a-6
w-stkendca10a-6

LEVEL 4 

============================

   Sent this as helpdesk ticket 17:55

099603

=============================================================================

2007 06 19

#########
# STAGE #
#########

Jeff Dejong is hitting near R1_18_2 ntuples pretty hard.

Set the file family properly for sntp_data,
then restore by volume, even though these will
go to the general read pools presently.

From CFL summary :

  FILES  GBYTES  PATH 
28790     541  reco_near/R1_18_2/.*nt._data/

( cd /pnfs/minos/reco_near/R1_18_2/sntp_data ; 
  enstore pnfs --tags )
.(tag)(file_family) = reco_near_R1_18_2

( cd /pnfs/minos/reco_near/R1_18_2/sntp_data ; 
  enstore pnfs --file_family reco_far_R1_18_2_sntp )

 ./volumes vols

VOLS=`./volumes reco_near_R1_18_2`


printf "${VOLS}\n" | wc -l
     22


echo >> ../TRACE
date >> ../TRACE

for VOL in ${VOLS} ; do
./stage -w -s sntp_data ${VOL}
done  2>&1 | tee -a ../TRACE

date >> ../TRACE
echo >> ../TRACE

Wed Jun 20 00:23:59 CDT 2007
FINISHED Thu Jun 21 11:30:47 CDT 2007

#######
# WTW #
#######

Notes from the meeting
  Nick West using LCG contained in EGEE ( EGEE sort of like OSG )
  Looking at GANGA for user interface ( command or gui )
  N.B. can Nick use SAM for local cache locations ?

gmieg - rootd server being used...
        stuck raw data file reported/resolved ?

   
##########
# DCACHE #
##########

Checking Rustem's slow reading report for
    
dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy/sntp_data/2005-05

Claim is 30 minutes to read 10K snarls, versus 10.

   Trying the biggest file in that month

N00007861_0000.spill.sntp.cedar_phy.0.root

'1.58GB


dccp speed :

MINOS26 > IFILE=N00007861_0000.spill.sntp.cedar_phy.0.root
MINOS26 > IPATH=minos/reco_near/cedar_phy/sntp_data/2005-05
MINOS26 > DCPOR=24125 # unsecured
MINOS26 > DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
MINOS26 > ( cd /pnfs/${IPATH} ; cat ".(use)(2)(${IFILE})" )
2,0,0,0.0,0.0
:h=yes;c=1:818e364b;l=1699476400;
w-stkendca11a-2
r-stkendca19a-6
r-stkendca11a-2
MINOS26 > cd /local/scratch??/`whoami`
MINOS26 > time dccp    ${DFILE} TEST.dat   # do the copy
1699476400 bytes in 53 seconds (31314.06 KB/sec)

real    0m53.368s
user    0m0.110s
sys     0m11.380s

    Looks good to me.

MINOS26 > setup_minos -r R1.24.0
MINOS26 > time hadd mTEST.dat TEST.dat TEST.dat

real    5m21.285s
user    0m22.430s
sys     0m34.530s

MINOS26 > time hadd mdTEST.dat TEST.dat "${DFILE}" 
...
TEST.dat tree:NtpSt entries=1234567890
dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/reco_near/cedar_phy/sntp_data/2005-05/N00007861_0000.spill.sntp.cedar_phy.0.root tree:NtpSt entries=1234567890
...
real    174m23.321s
user    0m53.150s
sys     1m0.770s

   this ran very fast through the local file, very slow from dcache.
   about 0.2 MBytes/second, not CPU limited on the client.

Let's try something shorter, and more relevant,
the old style ntuple concatenation with a couple of shorter files.


IFILE=N00007731_0000.spill.sntp.cedar_phy.0.root
DFILE1=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
time dccp ${DFILE1} TEST1.root
5342644 bytes in 0 seconds
real    0m0.593s
user    0m0.000s
sys     0m0.040s

 'fileSize' : SamSize('5.10MB'),
 'lastEvent' : 18121L

IFILE=N00007733_0000.spill.sntp.cedar_phy.0.root
DFILE2=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
time dccp ${DFILE2} TEST2.root
1890986 bytes in 0 seconds

real    0m0.924s
user    0m0.000s
sys     0m0.040s

'fileSize' : SamSize('1.80MB')
'lastEvent' : 99785L,

time hadd TESThloc.root TEST1.root TEST2.root
real    0m11.814s
user    0m3.030s
sys     0m0.350s

time hadd TESThdca.root ${DFILE1} ${DFILE2}
real    2m17.372s
user    0m3.760s
sys     0m0.870s

MINOS26 > ln -s ~kreymer/minos/scripts/Merger.C Merger.C

setup_minos -r R1.24.0

time loon -bq CATTLE/Merger.c TEST1.root TEST2.root
real    0m14.316s
user    0m11.530s
sys     0m0.530s

Try again with correct Merger.C, local files, -r R1.24.2

setup_minos -r R1.24.2

time loon -bq Merger.C TEST1.root TEST2.root
real    0m12.306s
user    0m10.970s
sys     0m0.470s

7 MB/  12.3 sec => .57 Mb/sec


time loon -bq Merger.C ${DFILE1} ${DFILE2}
real    1m51.101s
user    0m10.780s
sys     0m0.720s

7 MB/ 111 sec => 63 KB/sec

##########
# SADDMC #
##########

HOWTO.saddmc

   Need to match to other then L* when finding directories in mcin
   
##########
# DCACHE #
##########

RawDataWritePools write interval needs reset to 24 hours, 
   based on recent file times it seems to be 4 hours now.
Ticket 099493

Problem Description: Recently, it seems that pools in the FNDCA RawDataWritePools group
have been writing to tape frequently, perhaps on a 4 hour timer.

Please reset the timers to the desired 24 hours.

    For background:

The timer for these pools was set to 24 hours early in 2006.
Here is an extract from a 25 May email from kennedy :

"
The general write pools now require the first of any of these three 
conditions to be met before encp's run:
1) 4 calendar hours have passed
2) 25 GB in file family have accumulated
3) 100 files in file family have accumulated

This is distinct from the raw data pools which wait for 24 hours.
"

#######
# AFS #
#######

Requested two new data volumes 
   d261
   d262

ACL's like

system:administrators rlidwka
system:anyuser rl
minos rl
minos:admin rlidwka
minos:nonap rlidwka

    Tuesday, June 19, 2007 at 12:16:13

Created minos:nonap group 

NEWGROUP=nonap

MINOS26 > pts creategroup -name kreymer:${NEWGROUP}
group kreymer:nonap has id -1941

for GUSER in buckley kreymer barr habig jdejong ; do
pts adduser -user ${GUSER} -group kreymer:${NEWGROUP} ; done

pts setfields  kreymer:${NEWGROUP} -access SOMar

MINOS26 > pts membership kreymer:${NEWGROUP}
Members of kreymer:nonap (id: -1941) are:
  buckley
  kreymer
  habig
  barr
  jdejong

MINOS26 > pts examine    kreymer:${NEWGROUP}
Name: kreymer:nonap, id: -1941, owner: kreymer, creator: kreymer,
  membership: 5, flags: SOMar, group quota: 0.

pts chown      kreymer:${NEWGROUP}  minos


=============================================================================

2007 06 18

########
# FLUB #
########

Stuck nodes restarted, per Helpdesk ticket 099372

Out of memory in loon on flxb34, other 2 just stuck.

Note that we have no ganglia monitoring.

=============================================================================

2007 06 16

Updated HOWTO.monitor
    beam_log
    minos26free_log

########
# FLUB #  FNALU BATCH
########


jdejong reports stuck jobs #320407 and 320521
These are on flxb33 and flxb32.

Cannot log into these

Network activity on
   flxb33 has been very low since Wed 2007 Jun 13 12:00
   flxb32 has been very low since Wed 2007 Jun 13 18:00

Other nodes are affected, not just your jobs
   flxb34 has been very low since Thu 2007 Jun 14 18:00

Getting  node list

HOSTS=` bjobs -u all  | tr -s ' ' | grep RUN | cut -f 6 -d ' ' | cut -f 1 -d . |sort -u`
The only stuck nodes are 
    flxb32
    flxb33
    flxb34

Helpdesk ticket 099372


=============================================================================

2007 06 13

FARM

quota is likely low in grid, data, purgint 57GB in WRITE


./roundup  -w -r cedar_phy near
   just in time, got to SAM phase before normal nightly run ran

./roundup  -w -M -r cedar_phy mockfar

down to 327 GB in /g/d/m now


Better now, pushing dakion_02 cedar_phy far to NPFS without
concatenation, too many bad runs for my taste.


=============================================================================

2007 06 12

   DRIVING TO WEEK IN THE WOODS.

=============================================================================

2007 06 11

########
# FARM #
########

fnpcsrv1 crashed this morning in the middle of concatenating,
see
    LOG/2007-06/cedar_phynear.log

 SUPPRESS  N00011669_0024.cosmic.sntp.cedar_phy.0.root
 OK adding N00011669_0000.cosmic.sntp.cedar_phy.0.root 24

    hadd finished , and the mv to WRITE happened

SRV1> stat /grid/data/minos/minfarm/WRITE/N00011669_0000.cosmic.sntp.cedar_phy.0.root
  File: `/grid/data/minos/minfarm/WRITE/N00011669_0000.cosmic.sntp.cedar_phy.0.root'
  Size: 708547742       Blocks: 1383936    IO Block: 32768  regular file
Device: 1eh/30d Inode: -776408581  Links: 1
Access: (0644/-rw-r--r--)  Uid: (10871/ minfarm)   Gid: ( 5111/    numi)
Access: 2007-06-11 06:46:26.832000000 -0500
Modify: 2007-06-11 06:47:51.174000000 -0500
Change: 2007-06-11 06:51:29.530000000 -0500

   The cleanup of GDM/nearcat happened

SRV1> ls -l /grid/data/minos/nearcat/N00011669*
ls: /grid/data/minos/nearcat/N00011669*: No such file or directory

    The building of WRITE/ did not happen :

SRV1> find READ -name N00011669\*
READ/SAM/N00011669_0000.cosmic.sntp.cedar.0.root

That's the old cedar cosmic file, already declared to SAM.
Hacked that old file, change cedar to cedar_phy :

We should be good to go !

############
# POOLSTAT #  verbose option
############

poolstat.20070611 

   add any option, and get a list of down pools
   added column labels

##########
# DCACHE #
##########

Mike Harrison started working on pools,
we seem to be going down hill

Mon Jun 11 12:13:01 CDT 2007

DOWN TOT GROUP
   1/ 14 ExpDbWritePools
             w-stkendca9a-1

       5 FermigridVolPools
      15 KTeVReadPools
      13 MinosPrdReadPools
   1/  7 RawDataWritePools
             w-stkendca9a-3

       8 readPools
  10/ 14 writePools
             w-stkendca10a-2
             w-stkendca10a-5
             w-stkendca10a-6
             w-stkendca11a-2
             w-stkendca11a-4
             w-stkendca11a-6
             w-stkendca9a-2
             w-stkendca9a-4
             w-stkendca9a-5
             w-stkendca9a-6

MINOS26 > ./poolstat v

Mon Jun 11 13:16:25 CDT 2007

DOWN TOT   POOL GROUP
      14 ExpDbWritePools
       5 FermigridVolPools
      15 KTeVReadPools
      13 MinosPrdReadPools
   1/  7 RawDataWritePools
             w-stkendca11a-3

       8 readPools
   1/ 14 writePools
             w-stkendca11a-1

Mon Jun 11 15:52:12 CDT 2007

DOWN TOT   POOL GROUP
      14 ExpDbWritePools
       5 FermigridVolPools
      15 KTeVReadPools
      13 MinosPrdReadPools
       7 RawDataWritePools
       7 readPools
      14 writePools

Authorize close of ticket 098946

#######
# DAQ #
#######

DAQ ftp transfers failed :
http://fndca3a.fnal.gov/cgi-bin/dcache_files.py

2007-06-11 12:45:51 buckley(1019.5111) 	krbftp 	write /pnfs/fnal.gov/usr/minos/neardet_data/2007-06/N00012361_0022.mdaq.root 	daqdcp-nd.fnal.gov 	1 0 0 ERROR 425 Cannot open port: java.lang.Exception: Illegal Object received : dmg.cells.nucleus.NoRouteToCellException
2007-06-11 12:35:53 buckley(1019.5111) 	krbftp 	write /pnfs/fnal.gov/usr/minos/fardet_data/2007-06/F00038215_0019.mdaq.root  daqdcp.minos-soudan.org 1 0 0 ERROR 425 Cannot open port: java.lang.Exception: Illegal Object received : dmg.cells.nucleus.NoRouteToCellException

    corrected archiver.pid 

[minos@daqdcp-nd minos]$ cat /var/lock/daq/archiver.pid 
18823

    Same as I set manually back on June 1.
    But the archiver is 25907

emacs /var/lock/daq/archiver.pid

    This makes daqmon happy
    Copies are still stuck
    
    srmcp works OK outbound.

  
=============================================================================

2007 06 10

crontab - reenabled around 19:48
    kreymer@minos26
    mindata@minos26
    NOCAT under minfarm@fnpcsrv1

##########
# DCACHE #
##########

   cleanup - removed 2007 05 29 vintage /grid/data/minos/minfarm/SAFE files,
   after verifying that all are on tape ( cat .(use)(4) ... )

##########
# DCACHE #
##########

FILES2=`sam list files --dim="TAPE_LABEL dcache" --nosummary | grep -v mdaq `
printf "${FILES2}\n" | wc -l
     71

   Very odd, this is exactly the number of files pending back in 2007 05 29
   stuck in pools
w-stkendca11a-4
w-stkendca11a-6


=============================================================================

2007 06 09

#######
# SAM #
####### 

./genpy -l " -r R1.15 " fardet_data/2006-03

( this never happened , repeated 2007 07 12 )

##########
# DCACHE #
##########

Cleared the empty bad candidate that's been hanging around

ls -l /pnfs/minos/reco_far/cedar_phy/.bcnd_data/2007-02/F00037384_0006.spill.bcnd.cedar_phy.0.root
-rw-r--r--    1 1334     e875            0 May 19 05:01 /pnfs/minos/reco_far/cedar_phy/.bcnd_data/2007-02/F00037384_0006.spill.bcnd.cedar_phy.0.root

As rubin on fnpcsrv1, at 02:56 UTC 10 Jun

cd /pnfs/minos/reco_far/cedar_phy/.bcnd_data/2007-02
rm  F00037384_0006.spill.bcnd.cedar_phy.0.root

sam undeclare file F00037384_0006.spill.bcnd.cedar_phy.0.root

##########
# DCACHE #
##########

Over 70 files are in write pools over 24 hours, not on tape.
Such as:

MINOS26 > dc_stat /pnfs/minos/reco_near/cedar_phy/cand_data/2006-12/N00011376_0005.spill.cand.cedar_phy.0.root
============================
 PNFS status for /pnfs/minos/reco_near/cedar_phy/cand_data/2006-12/N00011376_0005.spill.cand.cedar_phy.0.root 
-rw-r--r--    1 1334     e875     153288546 Jun  8 11:48 N00011376_0005.spill.cand.cedar_phy.0.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:bfd0d1fb;l=153288546;
w-stkendca11a-2

LEVEL 4 

============================


   But other recent mcimported files are on tape, such as
/pnfs/fnal.gov/usr/minos/mcin_data/far/daikon_02/CosmicMu/101/f20011010_0009_CosmicMu_D02.reroot.root


Let's see where they are :


FILES=`sam list files --dim="TAPE_LABEL dcache" --nosummary`

  77 files
  71 of these are non raw data
  
for FILE in ${FILES} ; do
PLOC=`sam locate ${FILE} | tr "'" \\\n | grep ^/pnfs | cut -f 1 -d ,` 
POLE=`cd ${PLOC} ; cat ".(use)(2)(${FILE})" | grep w-stkendca`
printf "%70s %s\n" ${FILE} ${POLE}
done

Every non-raw-data file is in  w-stkendca11a-2

Most are cand's, there are few sntp's.

    Tacked the list onto TRACE

  DUH, run the handle poolstat script :

MINOS26 > ./poolstat 
Sat Jun  9 23:32:31 CDT 2007

     14 ExpDbWritePools
      5 FermigridVolPools
     15 KTeVReadPools
     13 MinosPrdReadPools
      7 RawDataWritePools
      8 readPools
  5/ 14 writePools

Dead pools in the writePools group are marked +

w-stkendca10a-2  +
w-stkendca10a-4
w-stkendca10a-5  +
w-stkendca10a-6
w-stkendca11a-1
w-stkendca11a-2  +
w-stkendca11a-4  +
w-stkendca11a-5
w-stkendca11a-6  +
w-stkendca12a-4
w-stkendca9a-2
w-stkendca9a-4
w-stkendca9a-5
w-stkendca9a-6


#############
# CHECKLIST #
#############

Cannot contact fndca for queue and stage pages

http://fndca.fnal.gov/dcache/queue/allpools.jpg
http://fndca.fnal.gov/dcache/logins/stage.jpg
   also cannot reach
http://fndca.fnal.gov/dcache/files/

And under 
    http://fndca.fnal.gov:2288/cellInfo
PinManager 	<unknown> 	OFFLINE

    Checking more links on the DCache page
http://fndca.fnal.gov/


    Cannot reach these :

    Recent FTP Transfers
http://fndca3a.fnal.gov/cgi-bin/dcache_files.py

    Active Transfers
http://fndca3a.fnal.gov/dcache/transfers.html

    Billing
http://fndca3a.fnal.gov/dcache/billing.html

    File Lifetime Plots
http://fndca3a.fnal.gov/dcache/dc_lifetime_plots.html

    Pool Directory Listings
http://fndca3a.fnal.gov/dcache/files/

    Queue Plots
http://fndca3a.fnal.gov/dcache/dc_queue_plots.html

     Sum
http://fndca3a.fnal.gov/dcache/queue/allpools.jpg

    Login Plots
http://fndca3a.fnal.gov/dcache/dc_login_plots.html


Will report this via helpdesk and to dcache-admin,
and call the helpdesk tomorrow if it is not better.

  Ticket 98946


Closer inspection shows the time stamps of 
   /pnfs/minos/neardet_data/2007-06 
are clustered around 4 hour intervals.

   THE RawDataWritePools POOLS ARE WRITING EVERY 4 HOURS, NOT EVERY  24
   THIS IS BAD FOR THE TAPES.

   DEFER THIS TO MONDAY, WE HAVE GREATER PROBLEMS
   
Have a look at
    /pnfs/minos/reco_near/cedar/2007-06 times,

   Times are clustered starting at

Jun  4 21:29
Jun  4 23:31
Jun  5 00:30
Jun  5 04:29
Jun  5 10:45 -> 12:25
Jun  6 01:36
Jun  8 03:18 -> 09:48
Jun  8 13:47

=============================================================================

2007 06 08

##########
# SADDMC #
##########

setup sam -q dev

cds
for SAS in `ls saddmc.*` ; do
EXT=`echo ${SAS} | cut -f 2 -d .`
mv saddmc.${EXT} saddmc.2006${EXT}
done

cp saddmc.20060612 saddmc.20070608

    REVIEW
    
parameters - OK

data tiers

storage locations

application family


    # # #     mcout data tiers

sam get registered data tiers | sort 

setup sam -q dev

SAM_ORACLE_CONNECT=samdbs/password

export SAM_ORACLE_CONNECT
samadmin add datatier --name=mcout-near --description="mcout_data - near"
samadmin add datatier --name=mcout-far  --description="mcout_data - far"
export -n SAM_ORACLE_CONNECT

did this for dev/int/prd

    # # # STORAGE LOCATIONS


#######
# AFS #
#######

Added Greg Pawloski to the minos group

pts membership  minos | sort

pts adduser -user jyuko -group minos

pts adduser -user pawloski -group minos
pts: User or group doesn't exist ; unable to add user pawloski to group minos 

waiting for Greg's AFS account to be created

   established by 13 Junel
     
#######
# AFS #
#######

    minos:sysadmin

To keep track of the sysadmins with access to the minos group


pts creategroup -name kreymer:sysadmin

pts adduser -user kreymer -group kreymer:sysadmin

pts membership kreymer:sysadmin

pts examine    kreymer:sysadmin
Name: kreymer:admin, id: -1919, owner: kreymer, creator: kreymer,
  membership: 1, flags: S-M--, group quota: 0.

pts setfields  kreymer:sysadmin -access SOMar

pts examine    kreymer:sysadmin
Name: kreymer:admin, id: -1919, owner: kreymer, creator: kreymer,
  membership: 1, flags: SOMar, group quota: 0.


pts chown      kreymer:sysadmin  minos

pts membership minos:sysadmin

for US in boyd ettab jason jonest ling schmitz shepelak timl ; do pts adduser -user ${US} -group minos:sysadmin ; done

###########
# MONTHLY #
###########

MINOS26 > aklog
MINOS26 > tokens

Tokens held by the Cache Manager:

User's (AFS ID 1060) tokens for afs@fnal.gov [Expires Jun 15 18:32]


CFL      6/8
DATASETS 6/8
PREDATOR 6/8
SADDRECO 6/8
ROUNDUP  6/8
VAULT    6/8
    nearly at the end of near tarring,
    aklog: Couldn't get fnal.gov AFS tickets:
    aklog: Invalid argument while getting AFS tickets
    but the vaulting looks OK
MYSQL    deferred, need to defrag


########
# FARM #
########

    Clean up write cache, check status

./roundup -M -r cedar_phy_safitter far

 PEND - have 18/24 subruns for F00037060_*.all.sntp.cedar_phy_safitter.0.root 8 05/30 17:42 0
2006-11 files are missing, flush the 2006-12 parts

 PEND - have 22/24 subruns for F00037996_*.all.sntp.cedar_phy_safitter.0.root 7 06/01 02:07 0
2007-05 Miss 11,12 

 PEND - have 24/17 subruns for F00038182_*.all.sntp.cedar_phy_safitter.0.root 6 06/01 14:04 0
2007-05 needed Predator to declare raw

 PEND - have 5/24 subruns for F00038185_*.all.sntp.cedar_phy_safitter.0.root 6 06/01 14:43 0
2007-05 needed Predator to declare raw
And data runs into 2007-06, flush

  Did

./roundup -r cedar_phy_safitter far   # picked up 38182 after sam declares

./roundup -f 1 -s F00038185 -r cedar_phy_safitter far

./roundup -f 1 -s F00037060 -r cedar_phy_safitter far

#######
# SAM #  cedar_phy declared ?
#######

per vahle query 3 June, are cedar_phy files declared.

cd /pnfs/minos/reco_far

for MON in `ls sntp_data` ; do echo $MON 
             FILES=`ls sntp_data/${MON}`
for FILE in $FILES ; do sam locate ${FILE}
done ; done

MONS=`ls /pnfs/minos/reco_far/cedar_phy/sntp_data`

DET=far
for MON in $MONS ; do ./saddreco.20070507 ${DET} cedar_phy ${MON} list ; done

Needed  /pnfs/minos/reco_far/cedar_phy/cand_data/2006-03
 need raw       F00034632_0023.mdaq.root
 need raw       F00034632_0022.mdaq.root
 need raw       F00034632_0020.mdaq.root
 need raw       F00034632_0019.mdaq.root
 need raw       F00034635_0000.mdaq.root
 need raw       F00034632_0012.mdaq.root
 need raw       F00034632_0013.mdaq.root
 need raw       F00034632_0016.mdaq.root
 need raw       F00034632_0017.mdaq.root
 need raw       F00034632_0014.mdaq.root
 need raw       F00034632_0015.mdaq.root
 need raw       F00034632_0018.mdaq.root

That's 12 files missing.


DET=near

some obsoletes in 2005-09

#######
# SAM #
####### 

    STRAY RAW FILES FROM FAR 2006-03

MINOS26 > ls /pnfs/minos/fardet_data/2006-02 | wc -l
    750
MINOS26 > ls /pnfs/minos/fardet_data/2006-03 | wc -l
   1414
MINOS26 > ls /pnfs/minos/fardet_data/2006-04 | wc -l
    936

MINOS26 > SAMDIM="DATA_TIER   raw-far and FULL_PATH like /pnfs/minos/fardet_data/2006-02 "
MINOS26 > sam list files  --dim="${SAMDIM}" --count
750 files match the given constraints.

MINOS26 > SAMDIM="DATA_TIER   raw-far and FULL_PATH like /pnfs/minos/fardet_data/2006-03"
MINOS26 > sam list files  --dim="${SAMDIM}" --count
1398 files match the given constraints.

MINOS26 > SAMDIM="DATA_TIER   raw-far and FULL_PATH like /pnfs/minos/fardet_data/2006-04"
MINOS26 > sam list files  --dim="${SAMDIM}" --count
936 files match the given constraints.

We seem to need 16 files .

./genpy -d -l " -r R1.22 " fardet_data/2006-03

This seems to be listing everything.
OK, this was the time at which we moved from minos06 to minos26.

MINOS26 > pwd    
/local/scratch26/kreymer/genpy/fardet_data
MINOS26 > scp -c blowfish -r minos06:/local/scratch06/kreymer/genpy/fardet_data/2006-03 2006-03

Looks better now, see this omitting the dbu commands :

MINOS26 > ./genpy -d -l " -r R1.22 " fardet_data/2006-03

  OK  JUST TESTING 

Generating .py for /pnfs/minos/fardet_data/2006-03
STARTING Fri Jun  8 18:50:29 CDT 2007
 Treating   1414 files 
 Scanning     16 files 
F00034242_0013.mdaq.root Fri Jun  8 18:51:17 CDT 2007
F00034632_0012.mdaq.root Fri Jun  8 18:51:27 CDT 2007
F00034632_0013.mdaq.root Fri Jun  8 18:51:31 CDT 2007
F00034632_0014.mdaq.root Fri Jun  8 18:51:34 CDT 2007
F00034632_0015.mdaq.root Fri Jun  8 18:51:37 CDT 2007
F00034632_0016.mdaq.root Fri Jun  8 18:51:41 CDT 2007
F00034632_0017.mdaq.root Fri Jun  8 18:51:44 CDT 2007
F00034632_0018.mdaq.root Fri Jun  8 18:51:48 CDT 2007
F00034632_0019.mdaq.root Fri Jun  8 18:51:51 CDT 2007
F00034632_0020.mdaq.root Fri Jun  8 18:51:54 CDT 2007
F00034632_0021.mdaq.root Fri Jun  8 18:51:58 CDT 2007
F00034632_0022.mdaq.root Fri Jun  8 18:52:01 CDT 2007
F00034632_0023.mdaq.root Fri Jun  8 18:52:05 CDT 2007
F00034633_0000.mdaq.root Fri Jun  8 18:52:08 CDT 2007
F00034634_0000.mdaq.root Fri Jun  8 18:52:12 CDT 2007
F00034635_0000.mdaq.root Fri Jun  8 18:52:15 CDT 2007

   Let's run it for real

MINOS26 > ./genpy -l " -r R1.22 " fardet_data/2006-03

  Oops, that should be R1.15 for such old data.
  
  Killed, removed generated file ( timed out after 10 minutes )
  /local/scratch26/kreymer/genpy/fardet_data/2006-03/F00034632_0012*
  
Try again later, when predator is idle.
  
./genpy -l " -r R1.15 " fardet_data/2006-03

N.B.  on 2007 07 18, copied one of these files to 
/afs/fnal.gov/files/data/minos/d86/kreymer/F00034242_0013.mdaq.root
for further testing with R1.22

N.B.  on 2007 12 11 copied this to /minos/scratch/kreymer/F00034242_0013.mdaq.root

=============================================================================

2007 06 07

###########
# MINOS26 #
###########

Per request to have cjames group corrected on minos26,

09:42
    Joe Boyd re-enabled NIS (ypbind) 

12:55
    Tim Laszlo (timl@fnal.gov) disabled NIS

17:04

As of about 21:04 UTC, Joe moved minos26 to use NIS (YP) for account,
with the same short list of authorized users in the local /etc/passwd file,
but now taking detailed information from NIS :

+shepelak
+kreymer
+buckley
+rhatcher
+cjames
+mindata

This required adding mindata to the global NIS passwd file.

Logins for mindata and kreymer are working.
Login shells are taken from the NIS passwd file.

( There was about a minute of lost access for mindata around 21:02. 
  That's small compared the interruption this morning. )

########
# GRID # ticket 98820
########

Requested export to, and mount of /grid/data and /grid/app on minos01
   readonly

#########
# ADMIN #
#########

Rubin has reviewed and drafted revised Computing section of MOU,
   word document sent to me in email,

Liz original is
    MINOS-CD-MOU-Oct-06-v3.doc
Rubin section is
    MINOSFermiGrid.doc

=============================================================================

2007 06 06

############
# PNFSDIRS #
############

Still need to set perms and group on each level of created directory.
I guess that means not doing   mkdir -p


Need  
+  pnfsdirs near cedar daikon_00 L250200N  # already exist
   pnfsdirs near cedar daikon_00 L010185N_bfldx113 write # at 16:42
+  pnfsdirs near cedar daikon_00 L010185N   # already existed
   pnfsdirs near cedar daikon_00 L250200N_nccoh     write # at 16:45

Oops, failed to set group to e875

MINOS26 > DIRS='     
<more> /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_bfldx113
<more> /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_bfldx113/cand_data
<more> /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_bfldx113/mrnt_data
<more> /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_bfldx113/sntp_data
<more> /pnfs/minos/mcin_data/near/daikon_00/L250200N_nccoh
<more> /pnfs/minos/mcout_data/cedar/near/daikon_00/L250200N_nccoh
<more> /pnfs/minos/mcout_data/cedar/near/daikon_00/L250200N_nccoh/cand_data
<more> /pnfs/minos/mcout_data/cedar/near/daikon_00/L250200N_nccoh/mrnt_data
<more> /pnfs/minos/mcout_data/cedar/near/daikon_00/L250200N_nccoh/sntp_data
<more> '

for DIR in ${DIRS} ; do chgrp e875 ${DIR} ; done
for DIR in ${DIRS} ; do ls -ld ${DIR} ; done

pnfsdirs far cedar_phy           daikon_02   CosmicLE write # 17:06

Had to manually fix
MINOS26 > chgrp e875 /pnfs/minos/mcin_data/far/daikon_02
MINOS26 > chmod 775 /pnfs/minos/mcin_data/far/daikon_02

MINOS26 > chgrp  e875 /pnfs/minos/mcout_data/cedar_phy/far/daikon_02
MINOS26 > chmod   775 /pnfs/minos/mcout_data/cedar_phy/far/daikon_02

MINOS26 > chgrp  e875 /pnfs/minos/mcout_data/cedar_phy/far          
MINOS26 > chmod   775 /pnfs/minos/mcout_data/cedar_phy/far


pnfsdirs far cedar_phy_safitter  daikon_02   CosmicLE write # 

MINOS26 > chgrp  e875 /pnfs/minos/mcin_data/far/daikon_02
MINOS26 > chgrp  e875 /pnfs/minos/mcin_data/far          
MINOS26 > chgrp  e875 /pnfs/minos/mcout_data/cedar_phy_safitter/far/daikon_02/CosmicLE
MINOS26 > chgrp  e875 /pnfs/minos/mcout_data/cedar_phy_safitter/far/daikon_02         
MINOS26 > chgrp  e875 /pnfs/minos/mcout_data/cedar_phy_safitter/far          
MINOS26 > chgrp  e875 /pnfs/minos/mcout_data/cedar_phy_safitter    
MINOS26 > chmod   775 /pnfs/minos/mcin_data/far/daikon_02
MINOS26 > chmod   775 /pnfs/minos/mcin_data/far          
chmod: changing permissions of `/pnfs/minos/mcin_data/far': Operation not permitted
MINOS26 > chmod  775 /pnfs/minos/mcout_data/cedar_phy_safitter/far/daikon_02
MINOS26 > chmod  775 /pnfs/minos/mcout_data/cedar_phy_safitter/far          
MINOS26 > chmod  775 /pnfs/minos/mcout_data/cedar_phy_safitter    


And at the last minute,

pnfsdirs far cedar_phy           daikon_02   CosmicMu write # 22:22
pnfsdirs far cedar_phy_safitter  daikon_02   CosmicMu write # 22:24

As the rest of the tree above CosmicMu is already in place,
    no need to touch up the higher-up permissions and groups.


Two files were already being moved into 
    /pnfs/minos/mcin_data/far/daikon_02/CosmicMu/100
-rw-r--r--    1 kreymer  e875     364696325 Jun  6 22:17 f20011001_0000_CosmicMu_D02.reroot.root
-rw-r--r--    1 kreymer  e875     355457581 Jun  6 22:18 f20011001_0001_CosmicMu_D02.reroot.root

Changed the family on the fly, should be OK.


#######
# SAM #
#######


SAMDIM="
    RUN_TYPE physics%                  \
and VERSION                  cedar.phy \
and DATA_TIER                sntp-near \
and PHYSICAL_DATASTREAM_NAME spill     \
and FILE_NAME like N00011434%
"

or

SAMDIM="                                    
    RUN_TYPE                 physics%  \
and VERSION                  cedar.phy \
and DATA_TIER                sntp-near \
and PHYSICAL_DATASTREAM_NAME spill     \
and RUN_NUMBER 11434 
"


MINOS26 > sam list files --dim="${SAMDIM}" --nosummary
N00011434_0021.spill.sntp.cedar_phy.0.root
N00011434_0000.spill.sntp.cedar_phy.0.root

SAMDIM="
    RUN_TYPE                 physics%  \
and VERSION                  cedar.phy \
and DATA_TIER                sntp-near \
and PHYSICAL_DATASTREAM_NAME spill     \
and RUN_NUMBER 11434 \
and PARENT_BY_NAME  N00011434_0000.mdaq.root \
"

MINOS26 > sam list files --dim="${SAMDIM}" --nosummary
N00011434_0000.spill.sntp.cedar_phy.0.root

   Getting a clean list of subruns :

sam get metadata --file=N00011434_0000.spill.sntp.cedar_phy.0.root \
    | grep parents \
    | tr "'" \\\n  \
    | grep root    \
    | sort
    
#######
# AFS #
#######

Added Daniel Cherdack to the minos group

pts membership  minos | sort

pts adduser -user cherdack -group minos

########
# FARM #
########

Copied test bad short sntp from safitter, temporarily,

cp /grid/data/minos/farcat/F00038182_0022.all.sntp.cedar_phy_safitter.0.root  /afs/fnal.gov/files/data/minos/d10/recodata113/


=============================================================================

2007 06 05

#############
# MINOSORA3 #
#############

Maureen reboots with mce=off in kernel, per RH advice

    mce=off disable machine check


http://lkml.org/lkml/2003/8/5/126

    mce=off turns off MCE reporting for fatal MCE exceptions (however
    your box may still crash when something really bad happens) 

########
# FARM #
########

Investigating /grid/data/backlog

SRV1> ./farmgsum 

    Summarizing /grid/data/minos/*cat   

   1926   56573 nearcat
   4921    6046 farcat
      0       1 mcnearcat
      0       1 mcfarcat
      0       1 mcfmockcat
    427   26379 minfarm/WRITE
   7274   89001 TOTAL files, GBytes

nearcat
     74    2099 cosmic.sntp.cedar.0.root
    372    5679 cosmic.sntp.cedar_phy.0.root
    123    3176 cosmic.sntp.cedar_phy.1.root
    926   22419 spill.mrnt.cedar_phy.0.root
     62     789 spill.mrnt.cedar_phy.1.root
     71    4354 spill.sntp.cedar.0.root
    236   16644 spill.sntp.cedar_phy.0.root
     62    4125 spill.sntp.cedar_phy.1.root

farcat
     53    1270 all.sntp.cedar.0.root
     34     800 all.sntp.cedar_phy.0.root
    118    2896 all.sntp.cedar_phy.1.root
   4354     295 all.sntp.cedar_phy_safitter.0.root
     53     241 spill.bntp.cedar.0.root
    128     329 spill.bntp.cedar_phy.0.root
     53     158 spill.sntp.cedar.0.root
    128     207 spill.sntp.cedar_phy.0.root

mcnearcat

mcfarcat

mcfmockcat

minfarm/WRITE
      2    1427 cosmic.sntp.cedar.0.root
      6    3034 cosmic.sntp.cedar_phy.0.root
      1     509 cosmic.sntp.cedar_phy.1.root
    399    3694 sntp.cedar_phy.root
      6    2263 spill.mrnt.cedar_phy.0.root
      1     330 spill.mrnt.cedar_phy.1.root
      2    3197 spill.sntp.cedar.0.root
      9   11430 spill.sntp.cedar_phy.0.root
      1    1764 spill.sntp.cedar_phy.1.root

#######
# SAM #
#######

    Preparing for cedar_phy_safitter

export SAM_ORACLE_CONNECT="samdbs/<passwd>"

setup sam -q dev
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar_phy_safitter

setup sam -q int

setup sam -q prd

OOPS, repeated above with

samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy.safitter


./reloc -d -s dev cedar_phy_safitter  # debug test

./reloc -s dev cedar_phy_safitter
./reloc -s int cedar_phy_safitter
./reloc -s prd cedar_phy_safitter

   
###########
# ROUNDUP #
###########   

roundup.20070605  -  added cedar_phy_safitter

AFSS/roundup.20070605 -n -r cedar_phy_safitter far

cp AFSS/roundup.20070605 .
ln -sf  roundup.20070605 roundup

########
# FARM #
########

./roundup -r cedar_phy_safitter far
Tue Jun  5 18:18:38 CDT 2007
killed it before it concatenated anything,
regular roundups had kicked in, did not want to run 2 at once.
Will run again later tonight.

./roundup -r cedar_phy_safitter far
Tue Jun  5 21:56:47 CDT 2007


Needed  /pnfs/minos/reco_far/cedar_phy_safitter/cand_data/2007-05
 OK - skipping 12 files not yet in SAM 

Need to repeat sam declare for first 2 months,

./roundup -m 2006-12 -r cedar_phy_safitter far
./roundup -m 2007-01 -r cedar_phy_safitter far
 STARTED   Wed Jun  6 05:06:25 2007
FINISHED  Wed Jun  6 05:13:13 2007
 
due to lack of the cedar.phy.safitter application

2007-02 and later were OK


#######
# SAM #
#######

    checking recent sam counts for brebel
    
REL=cedar
RELD=`echo ${REL} | tr . _`

for DET in near far ; do
for MON in 2007-04 2007-05 2007-06 ; do
for STR in cand sntp ; do
SAMDIM="
    RUN_TYPE                 physics%  \
and VERSION                  ${REL}    \
and DATA_TIER                ${STR}-${DET} \
and PHYSICAL_DATASTREAM_NAME spill     \
and FULL_PATH like /pnfs/minos/reco_${DET}/${RELD}/${STR}_data/${MON} \
"
printf " ${REL} ${DET} ${MON} ${STR} " ; \
sam list files  --dim="${SAMDIM}" --count

done ; done ; done

=============================================================================

2007 06 04

########
# FARM #
########

vahle missing files ? 

36 runs listed by howie, not in SAM, PNFS, or AFS ?

00007801
00007899
00008043
00008192
00008564
00008707
00008746
00008826
00008850
00008878
00008925
00008975
00009402
00009441
00009476
00009502
00009582
00009635
00009732
00009892
00010155
00010319
00010383
00010449
00010474
00010510
00010552
00010660
00010678
00010700
00010724
00010749
00011134
00011155
00011218
00011666

SAMDIM="
    RUN_TYPE physics%                  \
and VERSION                  cedar.phy \
and DATA_TIER                sntp-near \
and PHYSICAL_DATASTREAM_NAME spill     \
"
sam list files --dim="${SAMDIM}"

File Count:         648
Average File Size:  553.60MB
Total File Size:    350.32GB
Total Event Count:  428335040

for RUN in ${RUNS} ; do
echo $RUN
SAMDIM="
    RUN_TYPE physics%                  \
and VERSION                  cedar.phy \
and DATA_TIER                sntp-near \
and PHYSICAL_DATASTREAM_NAME spill     \
and FILE_NAME like N${RUN}%
"
sam list files --nosummary --dim="${SAMDIM}" | sort
done

# tested the above with
RUN=00008917

Then did full run
...
00009582
N00009582_0000.spill.sntp.cedar_phy.0.root
00009635
N00009635_0000.spill.sntp.cedar_phy.0.root
...

Opened to all streams and tiers, found spill cand for

N00008564 gap 11
N00009582 ok
N00009635 ok
N00009732 0/1/2/3
N00011134  gap 14-16

MINOS26 > sam locate N00009582_0000.spill.sntp.cedar_phy.0.root
['/pnfs/minos/reco_near/cedar_phy/sntp_data/2005-12,286@vob549']
MINOS26 > ls -l /pnfs/minos/reco_near/cedar_phy/sntp_data/2005-12/N00009582_0000.spill.sntp.cedar_phy.0.root
-rw-r--r--    1 kreymer  e875     449697580 May 16 22:16 /pnfs/minos/reco_near/cedar_phy/sntp_data/2005-12/N00009582_0000.spill.sntp.cedar_phy.0.root

MINOS26 > sam locate N00009635_0000.spill.cand.cedar_phy.0.root
['/pnfs/minos/reco_near/cedar_phy/cand_data/2006-01,215@vo9531']
MINOS26 > ls -l /pnfs/minos/reco_near/cedar_phy/cand_data/2006-01/N00009635_0000.spill.cand.cedar_phy.0.root
-rw-r--r--    1 1334     e875     414760027 May 16 22:54 /pnfs/minos/reco_near/cedar_phy/cand_data/2006-01/N00009635_0000.spill.cand.cedar_phy.0.root


for RUN in ${RUNS} ; do
ls /pnfs/minos/reco_near/cedar_phy/cand_data/*/N${RUN}*
done

something for all runs but 
N00010678
N00010700

#######
# AFS #
#######

Preparing AFS request for a volume  for rustem, for his analysis


( cd $MINOS_DATA ; ls -d d??? | sort | tail -3 )
d257
d258
d259

Clone acl from rustem's existing volumes, adjusted from buckley to minos:admin
d186
d203
d221


Summary : ask for 

50000 MB

/afs/fnal.gov/files/data/minos/d260

Not backed up

minos rl
system:administrators rlidwka
system:anyuser rl
minos:admin rlidwka
rustem rlidwka

Sent request about 22:50

########
# GRID #
########

Approved Rubin's Minos Production role
    https://voms.fnal.gov:8443/vomrs/vo-fermilab/vomrs
        + Members
            . Manage Groupsand Group Roles

This will be needed when GPlazma in installed on June 21 in Dcache
            
=============================================================================

2007 06 01

#######
# DAQ #
#######

    Filled empty archiver.pid 

[minos@daqdcp-nd minos]$ ps xf | grep archiver | grep -v grep
18823 ?        S      0:20 python /home/minos/bin/archiver_krb.py


[minos@daqdcp-nd minos]$ ls -l /var/lock/daq/archiver.pid 
-rw-r--r--    1 minos    e875            0 May 31 21:20 /var/lock/daq/archiver.pid

[minos@daqdcp-nd minos]$ printf "18823\n" >> /var/lock/daq/archiver.pid

[minos@daqdcp-nd minos]$ cat /var/lock/daq/archiver.pid 
18823

   
###########
# ROUNDUP #
###########   

roundup.20070529  -  handles DUPS, purging if you set -D

cp AFSS/roundup.20070529 .
ln -sf  roundup.20070529 roundup

./roundup -n -r cedar_phy far

Too messy, created roundup.20070601 which lists HAVE count on the PEND line.
Then select out the ones that are ready now, a few at a time

AFSS/roundup.20070601 -f  2 -s "F00036655\|F00036662\|F00036680\|F00036718\|F00036770\|F00036773\| F00036777\|F00036780" -r cedar_phy far
AFSS/roundup.20070601 -f 10 -s "F00032737\|F00032788\|F00032791\|F00030642\|F00031163\|F00031201\|F00031202\|F00031203\|F00031280\|F00031286" -r cedar_phy far
AFSS/roundup.20070601 -f 18 -s "F00031292\|F00031295\|F00031302\|F00031330\|F00031338\|F00031343\|F00031344\|F00031348\|F00031353\|F00031378" -r cedar_phy far
AFSS/roundup.20070601 -f 10 -s "F00031379\|F00031380\|F00031388\|F00031389\|F00031392\|F00031393\|F00031397" -r cedar_phy far

hacked roundup to purge file that were written while PNFS was down,
   by looking in READ/SAM
AFSS/roundup.20070601 -n -w -r cedar_phy far

Oops, the -n pass writes DFARM files
SRV1> find ROUNTMP/DFARM -cmin -40 -type f -exec mv {} ROUNTMP/DFARM/tmp/ \;
Try again, -n, with disabled DFARM writing for NOOP
OK,now really purgint

AFSS/roundup.20070601 -w -M -r cedar_phy far

   Get a new list and clean up the last strays

AFSS/roundup.20070601 -n -W -M -r cedar_phy far | tee /tmp/cpf

AFSS/roundup.20070601 -f 10  -s "F00036777\|F00036777\|F00037252\|F00037697\|F00037700\|F00037703\|F00037709\|F00037761\|F00037776" -r cedar_phy far 


cp AFSS/roundup.20070601 .
ln -sf  roundup.20070601 roundup

Oops, more changes for efficiency of filtering out DUP's.


AFSS/roundup.20070601 -n -r cedar_phy near | tee /tmp/cpn
Fri Jun  1 15:18:21 CDT 2007


cp AFSS/roundup.20070601 .
ln -sf  roundup.20070601 roundup


N00007760_0011 is suppressed, but is present in output.
For the moment,set these aside

    mv /grid/data/minos/nearcat/N00007760* /grid/data/minos/minfarm/N7760/

Remove duplicates

./roundup  -W -M -D -s N00009805 -r cedar_phy near

Create current PEND list for far,

./roundup  -W -M -r cedar_phy far


Now to a regular catchup for 

./roundup  -r cedar near
Fri Jun  1 15:51:58 CDT 2007

./roundup  -r cedar far

There is still manual clearing of PENDs from cedar_phy near to do,
but we've got to clear the backlog first.

Will let the next corral run 
clear the easy backlog in cedar_phy near

Moved NOCAT to NOCAT.okm 17:30

=============================================================================

2007 05 31

##########
# DCACHE #  cedar_phy_safitter
##########


######
# CD #
######

Shutting down most servers for power outage

   minos-sam01
   minos-sam02
   minos-sam03
   
crontab -r on
  kreymer@minos26
  mindata@minos26
  minfarm@fnpcsrv1

Created pingall, pingstat scripts,

We seem to have up : 
    minos01
    minos-mysql1
    minos-25
    AFS

16:10 AFS seems to have gone offline, processes are stuck

16:18 Mail is down, cannot contact imap3 server
      cannot ping imap1/2/3

16:25 CD system status server is unpingable from my desktop

17:30
   AFS and fnpcsrv1  seem to be back up


21:13 - summary of CD status items 
    E-mail  listserv down with disk error, no estimate
    MSS     systems coming back 17:15 - are they back ?
    
Restarting SAM servers :
    minos-sam01    . setups.sh ; ups start sam_bootstrap
./sam_test_py minos
    minos-sam02    . setups.sh ; ups start sam_bootstrap 
    minos-sam03    . setups.sh ; ups start sam_bootstrap 


Restarted monitors, per new HOWTO.monitor


Checklist :

DCache plots - stale
Enstore servers - locked
Summary         - restarting


mindata@minos26
   ./srmtest
       srmls looks OK
       srmcp stuck for at least a minute

dccp is also stuck

  Large numbers of DCache pools are offline.
  Enstore is still inactive.

  Only one of the seven RawDataWritePool Pools is online.
  Only 4 of the 12 general write queues are active.

The above got stuck because pools with the test files are offline.

Recent data can be copied from write pools.
    IFILE=N00012297_0023.mdaq.root
    DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}

       
crl seems to be OK

Sent this list of DCache pools down to dcache-admin

MIN > ./poolstat
Thu May 31 23:19:07 CDT 2007

  4/ 14 ExpDbWritePools
      6 FermigridVolPools
  7/ 15 KTeVReadPools
  4/ 13 MinosPrdReadPools
  6/  7 RawDataWritePools
  4/  8 readPools
  9/ 13 writePools

Fri Jun  1 07:32:12 CDT 2007

  3/ 14 ExpDbWritePools
      6 FermigridVolPools
  3/ 15 KTeVReadPools
  4/ 13 MinosPrdReadPools
      7 RawDataWritePools
  2/  8 readPools
     13 writePools

########
# FARM #
########

17:40

Cleared the duplicates quickly with

AFSS/roundup.20070529 -M -W -D -s N00009653 -r cedar_phy near 
AFSS/roundup.20070529 -M -W -D -s N00009689 -r cedar_phy near 
AFSS/roundup.20070529 -M -W -D -s N00009714 -r cedar_phy near 


AFSS/roundup.20070529 -M -W -D -s F00032654 -r cedar_phy  far 
AFSS/roundup.20070529 -M -W -D -s F00035859 -r cedar_phy  far 


#######
# CRL #
#######

The CRL is having problems.

That is odd, as the main page is up at
    http://www-minoscrl2.fnal.gov/minos/Index.jsp

And the minos-mysql database is up.

But there is no response when we click on  "All Categories"  at
    http://www-minoscrl2.fnal.gov/minos/Log.jsp?viewTopic=All

The mysql database shows recent connections from crlweb2.fnal.gov


Sent this report to the helpdesk, tried to call Suzanne Gysin at 8334

CR reports that CRL has been working... but not for me at present 21:38 UTC
 

############
# PNFSDIRS #
############

Creating  pnfsdirs   script to create and check permissions
on various PNFS directories :
    reco_near/...
    reco_far/...
    mcin_data/...
    mcout_data/...


=============================================================================

2007 05 30

########
# FARM #  cosmic copies 
########

Trying srmcp due to stuck kerberos doors,

setup dcap -q unsecured

SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr
SOUT=${SPATH}/${RSPA}

printf "\n`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/${DET}${REL}${STR}.log
for FILE in ${FILES} ; do
    SFIL=${SOUT}/${FILE}
    DFIL=${DOUT}/${FILE}
    PFIL=${POUT}/${FILE}
    if [ ! -r ${PFIL} ] ; then
        echo "NEED" ${FILE}
        srmcp   -streams_num=1 -server_mode=active  file:///${FILE} ${SFIL}
        dccp -P             ${DFIL}
    fi
done  2>&1          | tee -a ~/ROUNTMP/LOG/cosmic/${DET}${REL}${STR}.log
printf "`date`\n"   | tee -a ~/ROUNTMP/LOG/cosmic/${DET}${REL}${STR}.log

Wed May 30 08:33:44 CDT 2007
Wed May 30 12:06:27 CDT 2007


This is running at about 12 seconds per 64 MByte file, this will take hours
for the nearly 1000 files we have to copy.

The dccp copies were reporting rates of over 30 MB/second.
Copied 980 files in 93 minutes =  5580 sec => 5.7 Sec/file

started purge of grid files

cedar sntp near
Wed May 30 15:24:08 CDT 2007
Wed May 30 15:41:39 CDT 2007

cedar sntp far
Wed May 30 17:36:38 CDT 2007
Wed May 30 17:39:29 CDT 2007

cedar_phy sntp far
Wed May 30 18:01:23 CDT 2007
Wed May 30 18:04:49 CDT 2007

cedar_phy sntp near
Wed May 30 18:05:38 CDT 2007


########
# FARM #  cleanup
########

Testing new roundup which lists possible and real duplicates
  HAVE - existing runs
  DUPE - actual duplicate subruns

    CEDAR FAR
  
AFSS/roundup.20070529  -f 1 -r cedar far
   This picked up 3 runs whose subruns had formerly been bad, are now good.


    CEDAR NEAR

N00012145 has 17 subruns, rest are MIA, from 27 days ago
    
    missing 17-23, informed howie

12179 has good subruns from 10 days ago, so clean it up:

AFSS/roundup.20070529  -f 8 -s N00012179 -r cedar near


    CEDAR MCNEAR

no files pending


    CEDAR_PHY  FAR

dozens of old pending runs, this could be a challenge

AFSS/roundup.20070529  -n -r cedar_phy far


 DUPE F00032654_0000.spill.bntp.cedar_phy.0.root
 DUPE F00032654_0000.spill.sntp.cedar_phy.0.root
     subruns 0-4

 DUPE F00035859_0014.spill.bntp.cedar_phy.0.root
 DUPE F00035859_0014.spill.sntp.cedar_phy.0.root
    subruns 14, 16-23

    CEDAR_PHY  MEAR

 DUPE N00009653_0006.spill.mrnt.cedar_phy.0.root
 DUPE N00009653_0006.spill.sntp.cedar_phy.0.root
6, 19-22

 DUPE N00009689_0000.spill.mrnt.cedar_phy.0.root
 DUPE N00009689_0000.spill.sntp.cedar_phy.0.root
0, 2, 3, 5, 7, 8

 DUPE N00009714_0003.spill.mrnt.cedar_phy.0.root
 DUPE N00009714_0003.spill.sntp.cedar_phy.0.root
3-5, 16

########
# FARM #  mock
########

./roundup -M -r cedar_phy mockfar
Wed May 30 15:18:01 CDT 2007
Wed May 30 16:42:32 CDT 2007

Could not write to L250200N/000
corrected protections, restarted

Wed May 30 17:40:53 CDT 2007
Wed May 30 18:04:25 CDT 2007

Oops, errors were writing to 
/pnfs/minos/mcout_data/cedar_phy/fmock/daikon_00/L010185N/sntp_data/000
Odd, this was owned by kreymer,created 22 May.

Needed to rerun to pick up first 22 files

#######
# LSF #  minos cluster batch
#######

Ticket 98153

____________________________________________________________________

Presently, the minos cluster nodes minos19 through minos25 
are set up to run LSF job for the minos queue.

If licenses permit, we would like to expand this to most of the
minos cluster in the short term ( by next week )

Please let us know if license are available, and we can discuss
the specific list of nodes. I know we want to exclude
    minos01
    minos02
    minos11
    minos26
and possible one or two others.
__________________________________________________________________


=============================================================================

2007 05 29

##########
# DCACHE #
##########

There are 71 reco_far, reco_near, and mcout_data files at
    http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt
waiting to be written.

These are cand and .bcnd files written Friday

from 2007-05-25 05:56:57 
to   2007-05-25 15:10:09

Reported to helpdesk as follows , high priority :98029
_________________________________________________________________________
There are 71 Minos farm output files in the DCache write pools,
but which are not yet on tape.

These were written before the Friday 25 May PNFS probelems,
from 2007-05-25 05:56:57 
to   2007-05-25 15:10:09

The file list is reported at
    http://www-stken.fnal.gov/enstore/dcache_monitor/minos.txt

When I try to copy one of these files with dccp, I get these messages :

MINOS26 > dccp -d 4 ${DPATH}/${FILE} TEST.dat
[Tue May 29 11:24:38 2007] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/cosmic/near/cand_data/c10010115_0003.cand.cedar_phy.root in cache.
Connected in 0.00s.
Command failed!
Server error message for [1]: "905" (errno 905).
Failed open file in the dCache.
Can't open source file : "905"
System error: Input/output error

MINOS26 > echo ${DPATH}/${FILE}
dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcout_data/cedar_phy/cosmic/near/cand_data/c10010115_0003.cand.cedar_phy.root

MINOS26 > date
Tue May 29 11:44:16 CDT 2007
_________________________________________________________________


Investigating

for FILE in `cat /tmp/stales` ; do 
echo ${FILE} ; PAT=/pnfs/minos/`dirname ${FILE}` ; FIL=`basename ${FILE}`
( cd ${PAT} ; cat ".(use)(2)(${FIL})" | grep stken ) ; done

w-stkendca11a-4
w-stkendca11a-6

w-stkendca11a-6 is missing from http://fndca.fnal.gov:2288/queueInfo 

13:20 - w-stkendca11a-6 is back online.

Apparently, 11 of the 13 pools were offline this morning,
according to developers.

prestaging the data for a safety copy :

for FILE in `cat /tmp/stales` ; do 
echo ${FILE} ; dccp -P ${DPATH}/${FILE} ;
sleep 10 ; done

Most files are in read queues now, except :

mcout_data/cedar_phy/cosmic/near/cand_data/c10010127_0004.cand.cedar_phy.root
mcout_data/cedar_phy/cosmic/near/cand_data/c10010128_0004.cand.cedar_phy.root
mcout_data/cedar_phy/cosmic/near/cand_data/c10010172_0003.cand.cedar_phy.root
mcout_data/cedar_phy/cosmic/near/cand_data/c10010171_0000.cand.cedar_phy.root
mcout_data/cedar_phy/cosmic/near/cand_data/c10010171_0004.cand.cedar_phy.root
mcout_data/cedar_phy/cosmic/near/cand_data/c10010186_0000.cand.cedar_phy.root
mcout_data/cedar_phy/cosmic/near/cand_data/c10010193_0001.cand.cedar_phy.root
mcout_data/cedar_phy/cosmic/near/cand_data/c10010199_0004.cand.cedar_phy.root
reco_far/cedar/.bcnd_data/2007-05/F00037989_0022.spill.bcnd.cedar.0.root
reco_far/cedar/cand_data/2007-05/F00037993_0004.spill.cand.cedar.0.root
reco_far/cedar/.bcnd_data/2007-05/F00037993_0004.spill.bcnd.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012231_0009.cosmic.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012231_0009.spill.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012182_0007.cosmic.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012231_0010.cosmic.cand.cedar.0.root

13:55 most of these files are on tape now,

Remaining files are all  reco_near/cedar/cand_data
Will check again in a couple of hours.

16:66

Still need to get these on tape :

reco_near/cedar/cand_data/2007-05/N00012182_0002.cosmic.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012182_0002.spill.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012182_0001.cosmic.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012182_0001.spill.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012231_0009.cosmic.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012231_0009.spill.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012182_0007.cosmic.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012231_0010.cosmic.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012182_0007.spill.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012231_0010.spill.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012231_0011.cosmic.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012231_0011.spill.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012182_0009.cosmic.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012182_0009.spill.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012182_0008.cosmic.cand.cedar.0.root
reco_near/cedar/cand_data/2007-05/N00012182_0008.spill.cand.cedar.0.root

for FILE in ${FILES} ; do
echo ${FILE} ; PAT=/pnfs/minos/`dirname ${FILE}` ; FIL=`basename ${FILE}`
( cd ${PAT} ; cat ".(use)(4)(${FIL})" ) ; done

N.B. removed 2007 06 10, all are on VO4763


###########
# ROUNDUP #
###########

roundup.20070529

adding duplicate test, to allow more aggressive force
for present, this will be based on READ and SAM/READ files


########
# FARM #
########

cosmic sntp files are in 
    /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE/sntp_data
but only cosmic with runs under 10000 are there.

cosmic/bfld201_lowE is for  far det files, runs under 10000
cosmoc/near         is for near det files, runs over  10000

    Created HOWTO.cosmicmc

REL=cedar
STR=sntp
DET=near
SRV1> printf "${FILES}\n" | wc -w
980

Tue May 29 17:23:32 CDT 2007
Tue May 29 18:56:40 CDT 2007


DET=far

MINOS26 > printf "${FILES}\n" | wc -w
    198


copied cedar and cedar_phy far

Tried running cedar_phy near in the background,
lost kerberos ticket when I logged out
which hosed the kerberos door.

Tried this again with the other door, stuffed the second door.


=============================================================================

2007 05 26

#######
# DAQ #
#######


11:15 - directory problem is resolved by remounting /pnfs on a server

Restarted nd archive around 14:40
   removed stray empty /var/lock/daq/archive.pid from minos@daqdcp

FD daq had restarted around 12:06

##########
# DCACHE #
##########

DCache remains healthy aside from SRM, so restarting predator.
Special run to catch up :

MINOS26 > echo "12 16 * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/predator" | crontab

Then restored normal cronjob on minos26.


Gave berg, podstvkv access to mindata@minos26,
created srmtest script there, which does srmls and srmcp

Added -debug-true
This revealed an explicit trial of httpd,dccp,gsiftp

Override this on command with -protocols=gsiftp and get successful copy

   Hypothesis, we have always been trying httpd,dccp,gsiftp,
   and dccp only recently woke from the dead on the server end.
   Something is broken in our .xml config files or their handling.
   
###########
# ROUNDUP #
###########   

roundup.20070526  -  sets -protocols=gsiftp

cp AFSS/roundup.20070526 .
ln -sf  roundup.20070526 roundup

Moved NOCAT to NOCAT.ok

############
# MCIMPORT #
############

cp AFSS/mcimport.20070526 .
ln -sf  mcimport.20070526 mcimport


########
# FARM #
########

Spotted an empty cedar_phy bcnd file, written last Saturday

MINOS26 > dds /pnfs/minos/reco_far/cedar_phy/.bcnd_data/2007-02/F00037384_0006.spill.bcnd.cedar_phy.0.root
-rw-r--r--    1 1334     e875            0 May 19 05:01 /pnfs/minos/reco_far/cedar_phy/.bcnd_data/2007-02/F00037384_0006.spill.bcnd.cedar_phy.0.root

The blinded cand is OK

=============================================================================

2007 05 25

#######
# DAQ #
#######

singles runs this morning, during beam downtime.
We built to a backlog of about 88 runs.

They changed the runs from about 20 to 400 seconds,
finishing the running before noon.

Backlog of about 83 at 13:00

Will let this clear natuarally, given that we face a holiday weekend.
 ( Monday is Memorial Day  )


##########
# DCACHE #
##########

nwest has been having trouble with ftp from RAL and OX this last week.


##########
# DC2AFS #
##########

Data rates from minos26 leveled at 12MB/sec after 1 AM,
had been about 8 to 10 MB/sec before that.
according to Ganglia.


dc2afs -n -d near -r cedar_phy -s sntp 

STARTING Fri May 25 09:02:49 CDT 2007
 Running dc2afs for 
     DET near 
     REL cedar_phy 
     STR sntp 
Processing      36 months
...
FINISHED Fri May 25 14:23:17 CDT 2007

#######
# LSF #
#######

kschu reminded me :
   we used up the available licenses setting up our minos queue
   perhaps older slow FNALU nodes could retire ?
   he has unique knowledge on configuring the minos queue,
       no NFS shared LSF config directory, files must be rsync'd by admins

###########
# ENSTORE #
###########
__________________________________________
Ticket #: 97982
___________________________________________
Short Description: The /pnfs/minos file system has disappeared

Problem Description: Sometime after Fri May 25 17:43:54 CDT 2007  
and before     Fri May 25 17:53:56 CDT 2007 
the /pnfs/minos files seem to have disappeared.
Likewise, /pnfs/cdf is gone.

I cannot list files directly via our PNFS mounts ( i.e. on fnpcsrv1 )
I cannot list files via ftp or srmls.

I see that the Enstore library managers are paused.

But the ball is not red.
____________________________________________________________________

I disabled cron at
   kreymer@minos26
   mindata@minos26
   minfarm@fnpcsrv1 ( mv NOCAT.ok NOCAT )

Note also dbu failures for beam and DCS files this morning  in predator.

19:50 - berg announces system up 1 hour, checking Minos writes

22:40 - found bad listing for some directories in normal ftp, and srm

    beam_data
    fardet_data/2007-05
    neardet_data
    mcout_data/R0.8.0


SRV1> SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fardet_data/2007-05
SRV1> srmls ${SPATH2}
  512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fardet_data/2007-05

SRV1> SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/neardet_data/2007-05
SRV1> srmls ${SPATH2}
  512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/neardet_data/2007-05

SRV1> SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/neardet_data
SRV1> srmls ${SPATH2}
  512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/neardet_data

Spoke to Berg around midnight, he will call experts (Vladimir) 
to restart ftp servers.

Note that I can copy a file successfully from one of these lost directories
via dccp :
    /pnfs/minos/beam_data/2004-12/B041201_195652.mbeam.root

00:39 - still waiting word, FTP down.    zzzzzzzz

13:09 srmls is working, but srmcp fails :

SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr

IFILE=N00004502_0000.mdaq.root
IPATH=minos/neardet_data/2004-11
SFILE=${SPATH}/${IPATH}/${IFILE}
srmcp -streams_num=1 -server_mode=active \
$SFILE file:///TEST.dat
Dcap Version version-1-2-38 Jan  4 2006 10:11:51
Allocated message queues 0, used 0

Allocated message queues 1, used 1

Creating a new control connection to stkendca2a.fnal.gov:24725.
Activating IO tunnel. Provider: [/fnal/ups/prd/dcap/v2_38_f0512/Linux-2-4/lib/libgssTunnel.so].
Added IO tunneling plugin /fnal/ups/prd/dcap/v2_38_f0512/Linux-2-4/lib/libgssTunnel.so for stkendca2a.fnal.gov:24725.
Sending control message: 0 0 client hello 0 0 2 38 -uid=10871 -pid=31746 -gid=5111

errrrr, this is using dcap, not GridFTP.


Trying another file, in a safe path :


=============================================================================

2007 05 24

##########
# DCACHE #
##########

PNFS/FTP went down on schedule 07:00

Report from howie, ticket 97825 of dcache unavailable
srmcp failed
    2007-05-23 22:30:47

Up at about 10:35, but ftp is still down

    Ticket  97867

12:06 ftp fixed by litvinse


#########
# STAGE #
#########

cedar restores finished last night :

STARTING Thu May 17 15:59:31 CDT 2007
FINISHED Wed May 23 18:38:06 CDT 2007

#######
# SAM #
#######

08:25:50

DB patches have been deployed successfully. Minosprd is available for use.

    sam locate foo

    ./sam_test_py minos

    http://www-numi.fnal.gov/computing/findrun_sam.html
       selected recent raw files

########
# FARM #
########

Undeclaring files processed with wrong field,

MINOS26 > sam list files --nosummary --dim='FILE_NAME like N00012252%cand%root'
N00012252_0000.cosmic.cand.cedar.0.root
N00012252_0000.spill.cand.cedar.0.root
N00012252_0001.cosmic.cand.cedar.0.root
N00012252_0001.spill.cand.cedar.0.root
N00012252_0002.cosmic.cand.cedar.0.root
N00012252_0002.spill.cand.cedar.0.root
N00012252_0003.cosmic.cand.cedar.0.root
N00012252_0003.spill.cand.cedar.0.root
N00012252_0004.cosmic.cand.cedar.0.root
N00012252_0004.spill.cand.cedar.0.root

FILES=`sam list files --nosummary --dim='FILE_NAME like N00012252%cand%root'`

for FILE in ${FILES} ; do echo ${FILE} ; sam undeclare ${FILE} ; done

#######
# SAM #
#######

Preparing sample query for Nikki, using http://www-numi.fnal.gov/computing/findrun_sam.html
and grabbing the dimension from dbs

sam Nov 11e constraints --summaryOnly --dim="data_tier sntp-near and VERSION_ANALYZED like r1.18.4"

sam list files --dim="run_type physics% and file_name like N000%.spill.sntp.R1_18_4.0.root and start_time <= to_date('2006-09-30','yyyy-mm-dd') and end_time >= to_date('2006-09-24','yyyy-mm-dd')"


SAMDIM="
    RUN_TYPE physics%                  \
and VERSION                  cedar.phy \
and DATA_TIER                sntp-near \
and PHYSICAL_DATASTREAM_NAME spill     \
and start_time >= to_date('2006-11-01','yyyy-mm-dd') \
and end_time   <= to_date('2006-11-30','yyyy-mm-dd')
"
GRRRRRRRRRR have been fighting for years with

and FAMILY_ANALYZED          reco      \
and APPL_NAME_ANALYZED       loon      \
and VERSION_ANALYZED         cedar     \

Nikki discovered that VERSION works fine

    sam get registered dimensions - lists VERSION
    sam get dimension info        - seems consistent with this usage
 
    Try cedar_phy

#######
# SAM #
#######

Example for brebel


SAMDIM="
    RUN_TYPE                 physics%  \
and VERSION                  cedar.phy \
and DATA_TIER                sntp-near \
and PHYSICAL_DATASTREAM_NAME spill     \
and FULL_PATH like /pnfs/minos/reco_near/cedar_phy/sntp_data/2006-11 \
"

MINOS26 > sam list files  --dim="${SAMDIM}" --nosummary 
N00011295_0000.spill.sntp.cedar_phy.0.root
N00011295_0002.spill.sntp.cedar_phy.0.root
N00011277_0000.spill.sntp.cedar_phy.0.root
N00011200_0000.spill.sntp.cedar_phy.0.root
N00011176_0000.spill.sntp.cedar_phy.0.root
N00011259_0000.spill.sntp.cedar_phy.0.root

FILES=`sam list files  --dim="${SAMDIM}" --nosummary`

for FILE in ${FILES} ; do
PNFS=`sam locate ${FILE} | tr "'" \\\n | grep ^/pnfs/minos | cut -f 1 -d ,`
printf "dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/${PNFS/\/pnfs\/}/${FILE}\n"
done


########
# FARM #
########


used this to get list of recent ND files for reprocessing
 ( did not yet know to use VERSION )

SAMDIM="\
    FAMILY_ANALYZED      reco      \
and APPL_NAME_ANALYZED   loon      \
and VERSION_ANALYZED     cedar     \
and RUN_NUMBER >= 12191 \
and RUN_NUMBER <= 12210 \
and FULL_PATH like /pnfs/minos/reco_near/cedar_phy/2007-05 \
"

sam list files  --dim="${SAMDIM}" --nosummary

FILES=`sam list files  --dim="${SAMDIM}" --nosummary | sort`

15:12

for FILE in ${FILES} ; do sam undeclare ${FILE} ; done


##########
# DC2AFS #
##########

Hacked it to use recodata???

dc2afs -n -d far -r cedar_phy -s sntp 

Started this for real around 21:30
2007-03
  87/  87 recodata108 47523102 F00037832_0000.spill.sntp.cedar_phy.0.root     66925650 bytes in 3 seconds (21785.69 KB/sec)       
FINISHED Fri May 25 08:07:48 CDT 2007

A couple of false starts, getting the overprinting clean
   and suppressing dccp output
   Need to toss in a SPACER call up front to clean up skip messages

=============================================================================

2007 05 23

##########
# DCACHE #
##########

   Prepare for DCache outage
   
    predator

MINOS26 > echo 'crontab -r' | at 03:30

    mcimport

M26 > echo 'crontab -r' | at 03:30
job 21 at 2007-05-24 03:30

    corral
    
SRV1> echo 'mv /home/minfarm/ROUNTMP/NOCAT.ok /home/minfarm/ROUNTMP/NOCAT' \
    | at 03:30

#######
# AFS #
#######

Added Gemma Tinti to the minos group

pts membership  minos | sort

pts adduser -user tinti -group minos

########
# FARM #
########

corral - Re-enabled mcnear now that roundup supports D01

#######
# AFS #
#######

Requested 5 volumes for nue analysis group,
per mayly, access to boehm

 d241 d242 d243 d244 d245

minos:admin rlidwka
boehm rlidwka
msanchez rlidwka

 
#######
# AFS #
#######

Planning to pull all cedar_phy sntp to AFS

du -sh /pnfs/minos/reco_near/cedar_phy/sntp_data
217G    /pnfs/minos/reco_near/cedar_phy/sntp_data

du -sh /pnfs/minos/reco_far/cedar_phy/sntp_data
370G    /pnfs/minos/reco_far/cedar_phy/sntp_data

for completeness, we don't need these in AFS :

du -sh /pnfs/minos/reco_near/cedar_phy/mrnt_data
41G     /pnfs/minos/reco_near/cedar_phy/mrnt_data

du -sh /pnfs/minos/reco_far/cedar_phy/.bntp_data
51G     /pnfs/minos/reco_far/cedar_phy/.bntp_data


cd $MINOS_DATA/d10/indexes
 
wc -l *.index | sort -n

  A short one, 5 files, is 2006-08_near.R1_18_4.index
  A typical sntp at 519 is  2006-02_near.cedar.index
  Three mc_near are 10K+
  
AFSLD=/afs/fnal.gov/files/expwww/numi/html/computing/dh/afssum
  
for INDEX in `ls *.index` ; do
(( SUM = 0 ))
for FILE in `cat ${INDEX}` ; do
SIZ=`ls -l ../${FILE} | tr -s ' ' | cut -f 5 -d ' '`
(( SUM += SIZ ))
done
SUM=`echo "${SUM} / 1000000000" | bc`
printf "%5d %s\n" ${SUM} ${INDEX}
done 2>&1 | tee ${AFSLD}/indexsum.20070523


sort -n ${AFSLD}/indexsum.20070523
...
   50 mc_far.R1.14.index
   57 2006-06_near.cedar.index
   59 mc_near.R1_18_2.index
   72 mc_far.daikon_00.cedar.index
  120 mc_cosmic.bfld201.cedar.index
  276 mc_near.carrot_06.R1_18_2.index
  299 mc_near.carrot_06.cedar.index
  842 mc_near.daikon_00.cedar.index


cd $MINOS_DATA/d10
(( QUOT = 0 ))
(( USED = 0 ))
for DIR in `ls -d recodata*` ; do
LSQ=`fs listquota ${DIR} | grep -v Quota | tr -s ' '`
QUO=`echo ${LSQ} | cut -f 2 -d ' '`
USE=`echo ${LSQ} | cut -f 3 -d ' '`
(( QUOT += QUO ))
(( USED += USE ))
echo  ${QUO} ${USE}
done
(( FREE = QUOT - USED ))
(( QUOT /= 1000000 ))
(( USED /= 1000000 ))
(( FREE /= 1000000 ))
printf "QUOTA ${QUOT}\nUSED  ${USED}\nFREE  ${FREE}\n"

QUOTA 4362
USED  4221
FREE  140

We have 14 8 GB volumes 

d10
d11
d21
d22
d46
d47
d48
d49
d71
d72
d73
d74
d75
d76

Have submitted a request for 14 move 50 GB volumes.
We'll see whether we have the space available.

Volumes created 14:00, by inkmann, thanks !!


##########
# DC2AFS #
##########

dc2afs - script to move a release's ntuples from PNFS to AFS.

   taking bits from mcimport.20070509

dc2afs -n -d far -r cedar_phy -s sntp 


Checked out directories, had to
   cd d252
   mv recodata105 recodata106

=============================================================================

2007 05 22

#######
# SAM #
#######

    Test extreme sam query times, locally and on CDF database

FILESM=`ls /pnfs/minos/fardet_data/2007-02`
printf "${FILESM}\n" | wc -w
   1004

FILES=`printf "${FILESM}\n" | head -10`

FIRST=`printf "${FILES}\n" | head -1`
FREST=`printf "${FILES}\n" | tail +2`
FRESC=`for FI in ${FREST} ; do printf ", ${FI}" ; done`
SAMDIM="( FILE_NAME ${FIRST} ${FRESC} )"

time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; }
     10

real    0m1.703s
user    0m0.920s
sys     0m0.170s

   In dbs log, 
13:38:09 SqlBuilderImpl.buildSqlQuery < 1 second 
13:38:09 DbCore < 1 second

Now try 1000
   1000

real    1m15.183s
user    0m1.320s
sys     0m0.280s

13:43:42 SqlBuilderImpl.buildSqlQuery 
13:44:50 rpn.infix2dims> rpnList =
13:44:50 rpn.infix2dims> returning dims =
13:44:50 DbCore
...
13:44:50 DbCore
13:44:55 DbFunctions.query::ALARM(2)> exec = 3.556872 secs
13:44:55 DbCore(servantId=98780).query[connId=5]> 1000 rows found

   CDF   fcdflnx4:

find /pnfs/cdfen/filesets/GJ/GJ00/ -type f | wc -l
FILES=`find /pnfs/cdfen/filesets/GJ/GJ00/ -type f | head -1000 \
  | cut -f 9 -d /`

   1000

real    1m10.558s
user    0m1.230s
sys     0m0.210s

########
# FARM #
########

Inventory of roundup/corral, now that the DCache backlog is clear

    cedarfar
 PEND - have 23/24 subruns for F00038021_*.all.sntp.cedar*.root 2 05/19 23:37
     most got processed Monday morning, 1 short
     wait a while
       
    cedarnear
 PEND - have 17/24 subruns for N00012145_*.cosmic.sntp.cedar*.root 18 05/03 07:38
    missing cand 0017-0023

 PEND - have 23/24 subruns for N00012197_*.cosmic.sntp.cedar*.root 7 05/14 15:44
    missing cand for _0002

 PEND - have 1/18 subruns for N00012200_*.cosmic.sntp.cedar*.root 7 05/14 19:02
    all 18 written May 14 18:19
SRV1> dds /grid/data/minos/nearcat/N00012200*
-rw-rw-r--  1 rubin numi 29524001 May 14 19:02 /grid/data/minos/nearcat/N00012200_0004.cosmic.sntp.cedar.0.root
-rw-rw-r--  1 rubin numi 68729489 May 14 19:03 /grid/data/minos/nearcat/N00012200_0004.spill.sntp.cedar.0.root
    these are duplicates,
 mv  /grid/data/minos/nearcat/N00012200* /grid/data/minos/minfarm/DUP/

 PEND - have 20/24 subruns for N00012231_*.cosmic.sntp.cedar*.root 3 05/19 02:37
    missing cand 09-11,14

    cedarmcnear
 OK

    cedar_phyfar
 PEND 54 different runs, from back to 5/11, let things drain a bit

    cedar_phynear
 PEND 14 runs, back to 5/8  ,caught up.

Digging into cedar_phynear


########
# FARM #
########

The above looks close enough, let's try mockfar

./roundup -W -M -n -r cedar_phy mockfar
    would add 99 files

./roundup -W -M -r cedar_phy mockfar
   added 99 to WRITE
 
./roundup -W -M -r cedar_phy mockfar
   Tue May 22 11:27:55 CDT 2007
 running about 15 sec/file
 we will be out of the way before the noon corral cycle.


########
# FARM #
########

Correcting file families for daikon_01 files written,
FILES='
n13011403_0000_L010185N_D01.sntp.cedar.root
n13011406_0000_L010185N_D01.sntp.cedar.root
n13011407_0000_L010185N_D01.sntp.cedar.root
'
( cd /pnfs/minos/mcout_data/cedar/near/daikon_01/L010185N/cand_data ; \
  enstore pnfs --file_family  reco_mc_near_cedar_cand )

( cd /pnfs/minos/mcout_data/cedar/near/daikon_01/L010185N/mrnt_data ; \
  enstore pnfs --file_family  reco_mc_near_cedar_mrnt )

( cd /pnfs/minos/mcout_data/cedar/near/daikon_01/L010185N/sntp_data ; \
  enstore pnfs --file_family  reco_mc_near_cedar_sntp )

mkdir /pnfs/minos/mcout_data/cedar/near/daikon_01/L010185N/sntp_data/140

DAI1=/pnfs/minos/mcout_data/cedar/near/daikon_01/L010185N/sntp_data/140/
DAI0=/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data/140/

for FILE in ${FILES} ; do mv ${DAI0}/${FILE} ${DAI1}/${FILE} ; done
12:10
 
###########
# ROUNDUP #
###########

    roundup.20070522 

Updated DETI to use F0, N0 for far, near, to avoid conflict with mock


Need to set MCREL from file name, to handle daikon_00 and daikon_01
Better yet, check and bail on any other than D??=daikon for now.

    Tested on 1 file
    
AFSS/roundup.20070522 -W -M  -s n13011401 -r cedar mcnear
AFSS/roundup.20070522 -w     -s n13011401 -r cedar mcnear

Looks good.

    Ran the rest, not all of which have all subruns in mcin.

cp AFSS/roundup.20070522 .
ln -sf  roundup.20070522 roundup

./roundup -r cedar mcnear
Tue May 22 15:17:59 CDT 2007

########
# FARM #
########

cosmic sntp files are in 
    /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE/sntp_data
but only cosmic with runs under 10000 are there.

cosmic/bfld201_lowE is for  far det files, runs under 10000
cosmoc/near         is for near det files, runs over  10000

cd /pnfs/minos/mcout_data/cedar/cosmic

ls near/cand_data | wc -l
   995

ls bfld201_lowE/sntp_data/c1001* | wc -l
    995

    I need to move these 995 files now :

FILES=`ls bfld201_lowE/sntp_data/c1001* | cut -f 3 -d /`

for FILE in ${FILES} ; do 
mv bfld201_lowE/sntp_data/${FILE} near/sntp_data/${FILE} 
usleep 200000 ; done
Done around 15:55

=============================================================================

2007 05 21

########
# FARM #
########

DCCP limited this morning, due to continued cand backlog.

cedar_phyfar has lots of pending runs from around 5/11

cedar_phynear - lots pending, nothing added since Sat.

/grid/data is up to 180 GB, mostly in WRITE, throttled


#########
# STAGE #
#########

Still cranking on cedar far, doing  far 2005-04, down the home stretch.

    Most files seem to be needed for this patch of data.


###########
# ROUNDUP #
###########

roundup.20070518

Putting this into production

cp AFSS/roundup.20070518 .
ln -sf  roundup.20070518 roundup

./roundup  -r cedar far
./roundup  -r cedar near
   restored 0 length sntp from Friday, below
   disabled cronjob, while running these manually
./roundup  -r cedar_phy far
    hit a backlog after writing most files
    queue has backed off a bit, try again
./roundup  -r cedar_phy far

   
########
# FARM #
########

Removed 0 length sntp file written Friday 05:01

MINOS26 > dds /pnfs/minos/reco_far/cedar_phy/sntp_data/2006-11/F00037028_0000.all.sntp.cedar_phy.0.root
-rw-r--r--    1 kreymer  e875            0 May 19 05:01 /pnfs/minos/reco_far/cedar_phy/sntp_data/2006-11/F00037028_0000.all.sntp.cedar_phy.0.root

    srmcp retried 3 times, them failed due to existing file.

MINOS26 > rm /pnfs/minos/reco_far/cedar_phy/sntp_data/2006-11/F00037028_0000.all.sntp.cedar_phy.0.root

    Moved this file back from DUP
  
SRV1> mv DUP/F00037028_0000.all.sntp.cedar_phy.0.root WRITE/F00037028_0000.all.sntp.cedar_phy.0.root

########
# FARM #
########

Duplicate runs with version 1, from Rubin :

FILES='
N00011609_0004.spill.cand.cedar_phy.1.root
N00011609_0006.spill.cand.cedar_phy.1.root
N00011609_0007.spill.cand.cedar_phy.1.root
N00011609_0012.spill.cand.cedar_phy.1.root
N00011609_0013.spill.cand.cedar_phy.1.root
N00011609_0014.spill.cand.cedar_phy.1.root
N00011609_0017.spill.cand.cedar_phy.1.root
N00011609_0018.spill.cand.cedar_phy.1.root
N00011640_0000.spill.cand.cedar_phy.1.root
N00011687_0003.spill.cand.cedar_phy.1.root
N00011687_0010.spill.cand.cedar_phy.1.root
N00011687_0011.spill.cand.cedar_phy.1.root
N00011687_0012.spill.cand.cedar_phy.1.root
N00011687_0014.spill.cand.cedar_phy.1.root
N00011687_0015.spill.cand.cedar_phy.1.root
N00011687_0016.spill.cand.cedar_phy.1.root
N00011687_0019.spill.cand.cedar_phy.1.root
N00011687_0020.spill.cand.cedar_phy.1.root
N00011687_0021.spill.cand.cedar_phy.1.root
N00011687_0023.spill.cand.cedar_phy.1.root
N00011707_0000.spill.cand.cedar_phy.1.root
N00011707_0001.spill.cand.cedar_phy.1.root
N00011707_0002.spill.cand.cedar_phy.1.root
N00011707_0004.spill.cand.cedar_phy.1.root
N00011707_0006.spill.cand.cedar_phy.1.root
N00011707_0007.spill.cand.cedar_phy.1.root
N00011707_0008.spill.cand.cedar_phy.1.root
N00011707_0011.spill.cand.cedar_phy.1.root
N00011707_0013.spill.cand.cedar_phy.1.root
N00011728_0001.spill.cand.cedar_phy.1.root
N00011728_0002.spill.cand.cedar_phy.1.root
N00011728_0003.spill.cand.cedar_phy.1.root
N00011728_0007.spill.cand.cedar_phy.1.root
'
for FILE in ${FILES} ; do 
printf "${FILE}\n" ; ls -1 /grid/data/minos/nearcat/${FILE:0:20}* ; done

    There were sntp and mrnt files for all these.
    Move them to DUP/pass1 intending to remove them entirely.

mkdir /grid/data/minos/minfarm/DUP/pass1/

for FILE in ${FILES} ; do 
printf "${FILE}\n"
mv /grid/data/minos/nearcat/${FILE:0:20}* /grid/data/minos/minfarm/DUP/pass1/
done

   Check SAM locations

for FILE in ${FILES} ; do sam locate ${FILE} ; done
    all have locations except
Datafile with name 'N00011687_0003.spill.cand.cedar_phy.1.root' not found.


for FILE in ${FILES} ; do FOLD=${FILE:0:36}0.root ; sam locate ${FOLD} ; done
    all have locations unknown volume except 
        N00011687_0003.spill.cand.cedar_phy.0.root

Make a shortened list
FILSAM=${FILES/N00011687_0003.spill.cand.cedar_phy.1.root}

for FILE in ${FILSAM} ; do sam undeclare file ${FILE} ; done
for FILE in ${FILSAM} ; do FOLD=${FILE:0:36}0.root ; sam undeclare ${FOLD} ; done

SRV1> ./roundup -m 2007-01 -r cedar_phy near
SRV1> ./roundup -m 2007-02 -r cedar_phy near

  We're clean now !

for RUN in N00011609 N00011640 N00011687 N00011707 N00011728 ; do
sam locate ${RUN}_0000.spill.mrnt.cedar_phy.0.root ; done

for RUN in N00011609 N00011640 N00011687 N00011707 N00011728 ; do
sam locate ${RUN}_0000.spill.sntp.cedar_phy.0.root ; done

for RUN in N00011609 N00011640 N00011687 N00011707 N00011728 ; do
sam get metadata --file=${RUN}_0000.spill.mrnt.cedar_phy.0.root | grep parent
done

for RUN in N00011609 N00011640 N00011687 N00011707 N00011728 ; do
sam get metadata --file=${RUN}_0000.spill.sntp.cedar_phy.0.root | grep parent
done

=============================================================================

2007 05 20

########
# FARM #
########

Writing cosmic MC files to PNFS.

Note that cosmic MC file naming is entirely different from 
centrally produced files, and in many cases conflicts.
See
    

Swallowing my pride, and at great risk of duplicting file names
previously used, I will simply move them as requested to PNFS.

From
    /grid/data/minos/mccosmic
    /grid/data/minos/mccosmiccat

To
    /pnfs/minos/mcout_data/cedar/cosmic/

SRV1> du -sm /grid/data/minos/mccosmic*
25848   /grid/data/minos/mccosmic
69217   /grid/data/minos/mccosmiccat
    
setup dcap  # kerberized
DCPOR=24736


RELE=cedar

    COSMIC CAND

cd /grid/data/minos/mccosmic
STRM=cand
FILES=`ls -1 *${STRM}*${RELE}\.root`

Need to pick up stray cands from mccosmic

SRV1> ls /grid/data/minos/mccosmic | wc -l
46
SRV1> ls /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE/cand_data | wc -l
175


RSPA=minos/mcout_data/cedar/cosmic/bfld201_lowE/${STRM}_data
DOUT=/dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA}
POUT=/pnfs/${RSPA}

printf "\n`date`\n"   | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}purge.log
for FILE in ${FILES} ; do
    PFIL=${POUT}/${FILE}
    if [ -r "${PFIL}" ] ; then
    PINFO=`(cd ${POUT} ; cat ".(use)(4)(${FILE})" | tr '\n' '\t')`
    ECRC=`printf "${PINFO}" | cut -f 11`
    if [ -n "${ECRC}" ] ; then
    LCRC=`ecrc ${FILE} | tr -s ' ' | cut -f 2 -d ' '`
    echo "   ${FILE}" ${LCRC} ${ECRC} 
    [ ${LCRC} = ${ECRC} ] && echo rm ${FILE} && rm ${FILE}
    fi ; fi
done 2>&1 | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}purge.log
printf "`date`\n"   | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}purge.log

Test run revealed a duplicate, c10000607_0003.cand.cedar.root

mv /grid/data/minos/mccosmic/c10000607_0003.cand.cedar.root \
   /grid/data/minos/minfarm/DUP/

Ran the purge for real, Sun May 20 16:35:45 CDT 2007

Now can write the rest :

FILES=`ls -1 *${STRM}*${RELE}\.root`

FILES=`ls -1 *${STRM}*${RELE}\.root`
printf "${FILES}\n" | wc -w
45

printf "\n`date`\n" | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log
for FILE in ${FILES} ; do
    DFIL=${DOUT}/${FILE}
    PFIL=${POUT}/${FILE}
    if [ ! -r ${PFIL} ] ; then
        echo "NEED" ${FILE}
        dccp        ${FILE} ${DFIL}
        dccp -P             ${DFIL}
    fi
done  2>&1          | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log
printf "`date`\n"   | tee -a ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log
Sun May 20 16:38:30 CDT 2007
Sun May 20 16:56:22 CDT 2007

cd /grid/data/minos/mccosmiccat
STRM=sntp
RSPA=minos/mcout_data/cedar/cosmic/bfld201_lowE/${STRM}_data
DOUT=/dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA}
POUT=/pnfs/${RSPA}

FILES=`ls -1 *${STRM}*${RELE}\.root`
printf "${FILES}\n" | wc -w
1194

Cut/paste the dccp commands above :

Sun May 20 17:02:59 CDT 2007
Sun May 20 19:18:55 CDT 2007


   PURGED CAND FILES

( Forgot to un-comment the initial purge described above,
  so this re-purges some older files . )
Sun May 20 19:47:20 CDT 2007


Files were initially not group writeable, ran second pass to pick up
    c10000607_0000.cand.cedar.root
Sun May 20 20:04:25 CDT 2007
Sun May 20 20:04:29 CDT 2007

Moved to mccosmiccat, purged 766 files already on tape :

Sun May 20 21:05:25 CDT 2007
Sun May 20 21:20:18 CDT 2007

=============================================================================

2007 05 19

########
# FARM #
########

    Moved another duplicate to DUP

SRV1> mv WRITE/F00037028_0000.all.sntp.cedar_phy.0.root DUP/

########
# FARM #
########

Shifted local ROUNDUP/DUP files to /grid/data/minos/minfarm/DUP
FILES=`ls DUP`

   Checked for conflicts, there were none
SRV1> for FILE in $FILES ; do ls /grid/data/minos/minfarm/${FILE} ; done

   Copied files
SRV1> for FILE in $FILES ; do cp -a DUP/${FILE} /grid/data/minos/minfarm/${FILE} ; done

   Checked files
SRV1> for FILE in $FILES ; do diff DUP/${FILE} /grid/data/minos/minfarm/${FILE} ; done

   Purged files
SRV1> for FILE in $FILES ; do rm DUP/${FILE}  ; done

   Oops, shifted files from minfarm to minfarm/DUP
SRV1> for FILE in $FILES ; do ls -l /grid/data/minos/minfarm/${FILE} /grid/data/minos/minfarm/DUP/${FILE} ; done
SRV1> for FILE in $FILES ; do mv /grid/data/minos/minfarm/${FILE} /grid/data/minos/minfarm/DUP/${FILE} ; done

   Relinked DUP
SRV1> rmdir DUP ; ln -s /grid/data/minos/minfarm/DUP DUP


###########
# ROUNDUP #
###########

Certifying roundup.20070518 for general use
    allowing MDC files to be handled.
    Previously, these would have been confused with FD files.

cedar near and far look OK, comparing to .20070510 ( default )
Write cedar_phy near and far to /tmp for comparison,
   as these logs are longer.

SRV1> AFSS/roundup.20070510 -n -r cedar_phy far 2>&1 | tee /tmp/cpf10
SRV1> AFSS/roundup.20070518 -n -r cedar_phy far 2>&1 | tee /tmp/cpf18
SRV1> diff /tmp/cpf10 /tmp/cpf18

Had to hack to correct AUTODEST for fmock, as had done for MCIN path

##########
# DCACHE # 
##########

Queued stores went up over 3500 midday on 18 May,
with nearly 1500 queued restores.
These restores should not have been farm activity,
    as raw data is all on disk.

Will have to look at billing files to see the root cause.
Almost all the reads now are from flxi04 by jurgen.
Why hundreds of reads queued up ? Why to flxi04 ?
    This is selex data.


=============================================================================

2007 05 18

########
# FARM #
########

    GRRRRRRRRRRRRRRRRRRR

ONCE AGAIN, GOING INTO A WEEKEND,  A   M A J O R  CHANGE TO THE MODEL

Apparently we will be receiving a substantial number of duplicated files.
This has already started to show up as duplicates in WRITE.

I am supposed to ignore them, drop them on the floor.

This requires some means of detecting them.
Not so easy, since things are concatenated.

In principal, a grep of the READ and SAM/READ indexes may find the originals.
SAM could help, but not for MC files.

For the moment, I have moved the duplicated concatenated files to DUP.

########
# FARM #
########

Rubin :

Another fact, although I haven't seen anything about it from the systems 
people, is that dcache was having problems, I think both reading and 
writing, for several periods yesterday afternoon and evening.  If you 
detect that bad files are bunched, (for example 22:15 - 22:45) that's 
probably the source.


=============================================================================

2007 05 17

#########
# STAGE #
#########

STRM=mrnt
date >> ../TRACE
for DIR in `ls /pnfs/minos/reco_near/cedar_phy/${STRM}_data` ; do
./stage -w -g MinosPrdReadPools reco_near/cedar_phy/${STRM}_data/${DIR}
done  2>&1 | tee -a ../TRACE
date >> ../TRACE

Thu May 17 08:19:09 CDT 2007
Thu May 17 08:41:30 CDT 2007


date >> ../TRACE
REL=cedar_phy
DET=far
for STRM in `ls -a /pnfs/minos/reco_${DET}/${REL} | grep "bntp\|sntp\|mrnt"` ; do
for DIR  in `ls    /pnfs/minos/reco_${DET}/${REL}/${STRM}` ; do
./stage -w -g MinosPrdReadPools reco_${DET}/${REL}/${STRM}/${DIR}
done ; done  2>&1 | tee -a ../TRACE
date >> ../TRACE

Thu May 17 08:46:43 CDT 2007
Thu May 17 13:34:45 CDT 2007

   Now for the old stuff, may take days,
   using a spare window...


printf "\n\n\nSTARTING `date`\n" >> ../TRACE
REL=cedar
for DET in near far ; do
for STRM in `ls -a /pnfs/minos/reco_${DET}/${REL} | grep "bntp\|sntp\|mrnt"` ; do
for DIR  in `ls    /pnfs/minos/reco_${DET}/${REL}/${STRM}` ; do
./stage -w -g MinosPrdReadPools reco_${DET}/${REL}/${STRM}/${DIR}
done ; done ; done  2>&1 | tee -a ../TRACE
printf "FINISHED `date`\n" >> ../TRACE

STARTING Thu May 17 15:59:31 CDT 2007
FINISHED Wed May 23 18:38:06 CDT 2007


############
# MCIMPORT #
############

Forced recent cosmic MC to disk, only 4 G/10 threshold present now.

M26 > ./mcimport -f 60 howcroft
Thu May 17 11:47:04 CDT 2007


=============================================================================

2007 05 16

############
# MCIMPORT #
############

Mock data has started to arrive from howcroft.

Check a slug of these manually.

In howcroft,   mv NOIMPORT  noIMPORT

./mcimport -n -F howcroft
   Paths look OK to me

In howcroft,    mv noIMPORT  MCIMPORT

./mcimport -F howcroft

Failed, proxy has expired

Back in fnpcsrv1:/home/minfarm/.grid

SRV1> grid-proxy-info -f kreymer-doe.proxy
subject  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=768538851
issuer   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : kreymer-doe.proxy
timeleft : 0:00:00

grid-proxy-init -cert kreymer-doe.pem -key kreymer-doekey.pem
( used my usual long many-word pass phrase )

ERROR: Your certificate has expired: Tue May  8 10:08:22 2007

OK, copy my new cert from desktop, where I use it for web browsing

Per 2006 10 28 log entry


  scp kreymer-doe.p12 minfarm@fnpcsrv1:.grid/kreymer-doe.p12

SRV1> openssl pkcs12 -in kreymer-doe.p12 -clcerts -nokeys -out kreymer-doe.pem
Enter Import Password:
MAC verified OK
SRV1> openssl pkcs12 -in kreymer-doe.p12 -nocerts         -out kreymer-doekey.pem
Enter Import Password:
MAC verified OK
Enter PEM pass phrase:
Verifying - Enter PEM pass phrase:

chmod 600 kreymer-doe*.pem

  Get a grid proxy

SRV1> grid-proxy-init -cert kreymer-doe.pem -key kreymer-doekey.pem -out kreymer-doe.proxy  -valid 999999:00
Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Enter GRID pass phrase for this identity:
Creating proxy ...................................................... Done

Warning: your certificate and proxy will expire Tue Apr 15 11:22:43 2008
 which is within the requested lifetime of the proxy


SRV1> grid-proxy-info -f kreymer-doe.proxy
subject  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=1467756922
issuer   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : kreymer-doe.proxy
timeleft : 8040:34:54  (335.0 days)

   Now copy this back to minos26:mindata

SRV1> scp kreymer-doe.proxy mindata@minos26:.grid/

./mcimport -F howcroft

RequestFileStatus#-2146774239 failed with error:[  at Wed May 16 10:52:38 CDT 2007 state Failed : user has no permission to write into path /pnfs/fnal.gov/usr/minos/mcin_data/fmock/daikon_00/L010185N/000

$ dds /pnfs/minos/mcin_data/fmock/daikon_00/L010185N/000
total 1
drwxr-xr-x    1 rhatcher e875          512 May  9 14:13 ./
drwxr-xr-x    1 rhatcher e875          512 May  9 14:13 ../

    As rubin
SRV1> mv daikon_00 daikon_00rh
SRV1> mkdir daikon_00   
SRV1> chmod 775 daikon_00

    As kreymer
MINOS26 > mkdir /pnfs/minos/mcin_data/fmock/daikon_00/L010185N
MINOS26 > chmod 775 /pnfs/minos/mcin_data/fmock/daikon_00/L010185N
MINOS26 > mkdir /pnfs/minos/mcin_data/fmock/daikon_00/L250200N
MINOS26 > chmod 775 /pnfs/minos/mcin_data/fmock/daikon_00/L250200N

   This is running, as of 10:59

Finished 11:25, after a 5 minute delay early on

Minos26 data rates plateaued around 6 Mbytes/second, 11:10 to 11:25

###########
# ROUNDUP #
###########

   roundup.20070516

Added   dccp -P   prestage of all files.
Had been omitted prior to Apr 27 deployment of MinosPrdReadPools

#########
# STAGE #
#########

./stage -d -p 0 -g MinosPrdReadPools reco_near/cedar_phy/sntp_data/2007-02

 Needed 9/      9

./stage -w -g MinosPrdReadPools reco_near/cedar_phy/sntp_data/2007-02
 Needed 9/      9
FINISHED Wed May 16 16:08:28 CDT 2007

./stage -d -p 0 -g MinosPrdReadPools reco_near/cedar_phy/sntp_data/2007-02
. Needed 0/      9
FINISHED Wed May 16 16:09:15 CDT 2007

for DIR in `ls /pnfs/minos/reco_near/cedar_phy/sntp_data` ; do
./stage -w -g MinosPrdReadPools reco_near/cedar_phy/sntp_data/${DIR}
done
    see TRACE
STARTING Wed May 16 16:11:51 CDT 2007
STARTING Wed May 16 16:38:54 CDT 2007


=============================================================================

2007 05 15

#######
# SAM #
#######

Test rapid listing of files declared to sam, for use in saddreco etc,

MINOS26 > FILESM=`ls /pnfs/minos/fardet_data/2007-02`
MINOS26 > printf "${FILESM}\n" | wc -w
   1004

FILES=`printf "${FILESM}\n" | head -10`

FIRST=`printf "${FILES}\n" | head -1`
FREST=`printf "${FILES}\n" | tail +2`
FRESC=`for FI in ${FREST} ; do printf ", ${FI}" ; done`
SAMDIM="( FILE_NAME ${FIRST} ${FRESC} )"

time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; }
real    0m1.457s
user    0m0.880s
sys     0m0.160s

for NFI in 1 4 16 64 256 ; do
FILES=`printf "${FILESM}\n" | head -${NFI}`
FIRST=`printf "${FILES}\n" | head -1`
FREST=`printf "${FILES}\n" | tail +2`
FRESC=`for FI in ${FREST} ; do printf ", ${FI}" ; done`
SAMDIM="( FILE_NAME ${FIRST} ${FRESC} )"
time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; }
done
                  for NFI in 1 4 16 64 256 999; do
      1                 1                 1

real    0m1.615s  real    0m1.502s  real    0m2.678s
user    0m0.870s  user    0m0.900s  user    0m0.940s
sys     0m0.290s  sys     0m0.220s  sys     0m0.200s
      4                 4                 4

real    0m1.632s  real    0m1.716s  real    0m1.503s
user    0m0.890s  user    0m0.860s  user    0m0.930s
sys     0m0.350s  sys     0m0.300s  sys     0m0.210s
     16                16                16

real    0m1.649s  real    0m1.737s  real    0m1.531s
user    0m0.930s  user    0m0.920s  user    0m0.880s
sys     0m0.290s  sys     0m0.260s  sys     0m0.250s
     64                64                64

real    0m2.005s  real    0m2.160s  real    0m1.868s
user    0m0.950s  user    0m0.950s  user    0m0.870s
sys     0m0.290s  sys     0m0.340s  sys     0m0.230s
    256               256               256

real    0m6.978s  real    0m6.269s  real    0m6.402s
user    0m1.090s  user    0m1.030s  user    0m1.050s
sys     0m0.510s  sys     0m0.250s  sys     0m0.190s
                      999               999

                  real    1m17.383s real    1m11.690s
                  user    0m1.510s  user    0m1.350s
                  sys     0m0.140s  sys     0m0.330s

( filter out user, sys, from now on )

for NFI in 100 200 300 400 500 600 700 800 900 999 ; do
    100                 100

real    0m2.434s    real    0m5.847s
user    0m0.980s    user    0m0.930s
sys     0m0.240s    sys     0m0.150s
    200                 200

real    0m4.723s    real    0m4.622s
user    0m0.990s    user    0m1.180s
sys     0m0.340s    sys     0m0.220s
    300                 300

real    0m8.268s    real    0m8.102s
user    0m1.040s    user    0m1.050s
sys     0m0.200s    sys     0m0.240s
    400                 400

real    0m13.307s   real    0m12.989s
user    0m1.110s    user    0m1.260s
sys     0m0.210s    sys     0m0.190s
    500                 500

real    0m25.629s   real    0m19.199s
user    0m1.170s    user    0m1.150s
sys     0m0.220s    sys     0m0.260s
    600                 600

real    0m54.243s   real    0m27.445s
user    0m1.200s    user    0m1.140s
sys     0m0.370s    sys     0m0.260s
    700                 700

real    0m39.220s   real    0m36.608s
user    0m1.360s    user    0m1.330s
sys     0m0.260s    sys     0m0.220s
    800                 800

real    0m49.463s   real    0m49.048s
user    0m1.370s    user    0m1.350s
sys     0m0.420s    sys     0m0.590s
    900                 900

real    1m6.517s    real    0m59.775s
user    0m1.430s    user    0m1.670s
sys     0m0.300s    sys     0m0.280s
    999                 999

real    1m35.493s   real    1m12.315s
user    0m0.550s    user    0m1.660s
sys     0m0.260s    sys     0m0.150s


MINOS26 > FILES=`printf "${FILESM}\n" | head -1002

MINOS26 > time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; }
real    1m12.128s
user    0m0.890s
sys     0m0.130s

MINOS26 > time { sam list files --nosummary --dim="${SAMDIM}" ; }
...real    1m13.084s
user    0m1.450s
sys     0m0.220s

FILES=`ls /pnfs/minos/fardet_data/2007-02 ; ls /pnfs/minos/fardet_data/2007-03`
echo $FILES | wc -w
   1841

...
time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; }
ORA-01795: maximum number of expressions in a list is 1000
      0

real    3m56.338s
user    0m0.830s
sys     0m0.160s

MINOS26 > time { sam list files --nosummary --dim="${SAMDIM}" | wc -l ; }
ORA-01795: maximum number of expressions in a list is 1000
      0

real    3m54.704s
user    0m0.920s
sys     0m0.320s


    Looked at dbserver dbg file, 
    all the time is spent in SqlBuilderImpl.buildSqlQuery

for NFI in 100 200 300 400 500 600 700 800 900 999 ; do
sleep 30


    100   1 second in SqlBuilderImpl.buildSqlQuery

real    0m3.943s
user    0m0.930s
sys     0m0.230s
    200   2

real    0m4.666s
user    0m1.050s
sys     0m0.300s
    300   7

real    0m8.257s
user    0m1.010s
sys     0m0.180s

    400  11

real    0m13.072s
user    0m1.100s
sys     0m0.280s
    500  17

real    0m19.515s
user    0m1.190s
sys     0m0.170s
    600  25

real    0m27.969s
user    0m1.260s
sys     0m0.180s
    700  34

real    0m37.339s
user    0m1.360s
sys     0m0.410s
    800  44

real    0m47.711s
user    0m1.360s
sys     0m0.250s
    900  56

real    0m59.522s
user    0m1.520s
sys     0m0.220s
    999  69 seconds

real    1m12.526s
user    0m1.650s
sys     0m0.250s

This is the same old production Minos   sam_db_srv v7_6_1


#########
# VAULT #
#########

    vault.20070515

Time ordered list of files before encp, for rational order.

mv vault_prev vault.20060807   # N.B.- moved this to vault.monthly 2008 04 02
mv vault      vault.20070109
ln -s         vault.20070515 vault

    rawsum.20070515

Time ordered list of files before encp, for rational order.

mv    rawsum.0329a    rawsum.20060329
cp    rawsum          rawsum.20070515
mv    rawsum          rawsum.20060331
ln -s rawsum.20070515 rawsum 


########
# FARM #
########

Complexity calculation for recent running :

2 Detectors ( near/far )
2 Streams   ( spill/cosmic , spill/all )
4 Types     ( data , MC , Mock , cambridge )
4 Releases  ( cedar, R1_24cal, R1_24calB , cedar_phy )
2 Teams/scripts ( Howie, Art )
? Calibs    ( alpha beta gamma ... final )

2 Samples   ( 1/6 , 5/6 for near data pass )

= 256

=============================================================================

2007 05 14

#######
# AFS #
#######

Requested a new volume for NONAP group

MINOS26 > ls -1d $MINOS_DATA/d???    # see what is in use

/afs/fnal.gov/files/data/minos/d240

system:administrators rlidwka
minos:admin rlidwka
habig rlidwka
minos rl


    Expanded minos:admin to match buckley:admin, adding habig

pts membership  minos:admin

for GUSER in boehm dharris messier shanahan  ; do
pts adduser -user ${GUSER} -group minos:admin ; done

pts adduser -user habig -group minos:admin ; done


###########
# ENSTORE #
###########

Checking that the VO4209 files are all on tape :

FILES=`enstore info --list=VO4209 | grep mrnt_data | tr -s ' ' | cut -f 6 -d ' ' | cut -f 1-2,5- -d /`

for FILE in ${FILES} ; do
FP=`echo ${FILE} | cut -f -7 -d /` # ; echo ${FP}
FI=`echo ${FILE} | cut -f  8 -d /` ; printf "\n${FI}\n"
( cd ${FP} ; cat ".(use)(4)(${FI})" | head -2 ) 
sleep 1
done

These are all on VOB506, so can remove the safety copies :

17:18
MINOS26 > rm -r /grid/data/minos/mindata/VO4209
MINOS26 > rm -r /local/scratch26/kreymer/VO4209  

########
# FARM #
########

condor problems, ticket 97158

########
# FARM #
########

Removed the 'safe' copies written when commissioning roundup.20070510


/grid/data/minos/minfarm/SAFE
    F00037989*
    N00012179*
    mcnearcat
    
ROUNDUP/SAFE    ( ${DET}_cedar_phy )

    Checking for existence of each file in a READ or READ/SAM file
    
SRV1> for FILE in `ls SAFE/far_cedar_phy` ; do printf "${FILE}" ; grep -q ${FILE} READ/SAM/${FILE:0:10}* ; echo " $?" ; done 2>&1 | grep -v " 0"


SRV1> ls SAFE/near_cedar_phy | wc -l
501

SRV1> for FILE in `ls SAFE/near_cedar_phy` ; do printf "${FILE}" ; grep -q ${FILE} READ/SAM/${FILE:0:10}* ; echo " $?" ; done 2>&1 | grep -v " 0" | wc -l
138

Ah, many of these are pending

DET=near
(( PEND = 0 )) ; (( OUTS = 0)) ; (( NFIL = 0 ))
(( NFIS = `ls SAFE/${DET}_cedar_phy | wc -l` ))
for FILE in `ls SAFE/${DET}_cedar_phy` ; do 
    (( NFIL++ ))
    if [ -r "/grid/data/minos/${DET}cat/${FILE}" ] ; then
    (( PEND++)) ; else
    (( OUTS++))
    printf "${FILE}"
    grep -q ${FILE} READ/SAM/${FILE:0:10}* ; echo " STAT=${?}"  ; fi
    [ ${NFIL} -eq ${NFIS} ] && printf " NFIL ${NFIL} 0\n PEND ${PEND} 0\n OUTS ${OUTS} 0\n"
done  2>&1 | grep -v "STAT=0" 

 NFIL 501 0
 PEND 136 0
 OUTS 365 0

DET=far

 NFIL 3695 0
 PEND  340 0
 OUTS 3355 0

ln -s /grid/data/minos/minfarm/SAFE GDS

DET=far
(( PEND = 0 )) ; (( OUTS = 0)) ; (( NFIL = 0 ))
(( NFIS   = `ls GDS/F00037989* | wc -l` ))
for FILE in `ls GDS/F00037989*| cut -f 2 -d /` ; do

 NFIL 60 0
 PEND 0 0
 OUTS 60 0

DET=near
(( PEND = 0 )) ; (( OUTS = 0)) ; (( NFIL = 0 ))
(( NFIS   = `ls GDS/N00012179* | wc -l` ))
for FILE in `ls GDS/N00012179*| cut -f 2 -d /` ; do

 NFIL 46 0
 PEND 2 0
 OUTS 44 0

    Pending : 
    N00012179_0018.cosmic.sntp.cedar.0.root
    N00012179_0018.spill.sntp.cedar.0.root
    
    for FILE in N00012179_0018.cosmic.sntp.cedar.0.root \
                N00012179_0018.spill.sntp.cedar.0.root ; do 
    ls -l GDS/${FILE}  /grid/data/minos/nearcat/${FILE}
    diff  GDS/${FILE}  /grid/data/minos/nearcat/${FILE} ; done
-rw-rw-r--  1 minfarm numi 29916893 May 10 17:45 GDS/N00012179_0018.cosmic.sntp.cedar.0.root
-rw-rw-r--  1 rubin   numi 29916893 May 10 17:45 /grid/data/minos/nearcat/N00012179_0018.cosmic.sntp.cedar.0.root
-rw-rw-r--  1 minfarm numi 79883531 May 10 17:45 GDS/N00012179_0018.spill.sntp.cedar.0.root
-rw-rw-r--  1 rubin   numi 79883531 May 10 17:45 /grid/data/minos/nearcat/N00012179_0018.spill.sntp.cedar.0.root


    These were in badruns through Fri May 11 15:39:18
    They are still there, error type 1.

DET=mcnear
(( PEND = 0 )) ; (( OUTS = 0)) ; (( NFIL = 0 ))
(( NFIS   = `ls GDS/mcnearcat | wc -l` ))
for FILE in `ls GDS/mcnearcat | cut -f 2 -d /` ; do

    grep -q ${FILE} READ/${FILE:0:10}* ; echo " STAT=${?}"  ; fi

 NFIL 184 0
 PEND 1 0
 OUTS 183 0

    Pending : 
    n13011765_0002_L010185N_D00.sntp.cedar.root

    This seems to be a duplicate !
    Its behaviour in HADDLOG/2007-05/cedarmcnear.log is unremarkable

    I have moved it to DUP
    FILE=n13011765_0002_L010185N_D00.sntp.cedar.root
    ls -l /grid/data/minos/mcnearcat/${FILE}  /grid/data/minos/minfarm/DUP/${FILE}
    mv    /grid/data/minos/mcnearcat/${FILE}  /grid/data/minos/minfarm/DUP/${FILE}

    for FILE in n13011765_0002_L010185N_D00.sntp.cedar.root ; do
    ls -l GDS/mcnearcat/${FILE}  /grid/data/minos/mcnearcat/${FILE}
    diff  GDS/mcnearcat/${FILE}  /grid/data/minos/mcnearcat/${FILE} ; done
-rw-rw-r--  1 minfarm numi 67927529 May 11 03:23 GDS/mcnearcat/n13011765_0002_L010185N_D00.sntp.cedar.root
-rw-rw-r--  1 rubin   numi 67927529 May 11 03:23 /grid/data/minos/mcnearcat/n13011765_0002_L010185N_D00.sntp.cedar.root

I see no need to keep any of these safety copy areas.

17:00

SRV1> rm /grid/data/minos/minfarm/SAFE/N*
SRV1> rm /grid/data/minos/minfarm/SAFE/F*
SRV1> rm -r /grid/data/minos/minfarm/SAFE/mcnearcat
SRV1> rmdir /grid/data/minos/minfarm/SAFE/farphy

SRV1> rm -r SAFE

SRV1> rmdir /grid/data/minos/minfarm/SAFE

# FARM #


=============================================================================

2007 05 11

########
# FARM #
########

   MOVING ROUNDUP/WRITE TO /grid/data/minos/minfarm/WRITE

PLAN :

0) Copy the stray duplicates from WRITE to /grid/data/minos/DUPS

1) Around 11:00, most files in WRITE should be on tape
   Purge them with a -w -M pass

2) Copy the remaining files to /grid/data/minos/minfarm/WRITE

3) change the existing write directory to a symlink

Execution :

0)  done 08:39

FILES='
N00012135_0013.cosmic.cand.cedar.0.root
N00012135_0021.cosmic.cand.cedar.0.root
n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root
'
for FILE in ${FILES} ; do 
    cp -a ${FILE} /grid/data/minos/DUP/
    diff  ${FILE} /grid/data/minos/DUP/
done

for FILE in ${FILES} ; do rm ${FILE} ; done

Shifted DUP under minos/minfarm

    mv /grid/data/minos/DUP /grid/data/minos/minfarm/DUP

1) 

${HOME}/scripts/roundup -c -M -w -r cedar     mcnear # done 11:00
${HOME}/scripts/roundup -c -M -w -r cedar_phy near   # done 11:51
${HOME}/scripts/roundup -c -M -w -r cedar_phy far    # done 13:2
    only 51/56 of cedar_phy far are on tape at 13:20
    only 41 MB, set set aside and copy

2) 14:05

cp -vax /export/stage/minfarm/ROUNDUP/WRITE \
             /grid/data/minos/minfarm/WRITE

mv      /export/stage/minfarm/ROUNDUP/WRITE \
        /export/stage/minfarm/ROUNDUP/WRITEold

ln -s        /grid/data/minos/minfarm/WRITE \
        /export/stage/minfarm/ROUNDUP/WRITE


    IMPACT

The existing roundup script accesses the WRITE area entirely by doing 'cd' .
The old WRITE becoming a symlink will have no impact.
The script does a 'mv' of Merged.root to WRITE, should work OK

roundup.20070510 and later will go direct to /grid/data/minos/minfarm/WRITE

    TESTING
    
    Purged the slow WRITE files

AFSS/roundup.20070510 -c -M -w -r cedar_phy  far

    Ran -n test pass of all types

AFSS/roundup.20070510 -n    -r cedar     far   
AFSS/roundup.20070510 -n    -r cedar     near  
AFSS/roundup.20070510 -n -M -r cedar     mcnear

AFSS/roundup.20070510 -n    -r cedar_phy near  
AFSS/roundup.20070510 -n    -r cedar_phy far   

        Just 1 run in far, can test cleanly ?

    mkdir /grid/data/minos/minfarm/SAFE

    cp -va /grid/data/minos/farcat/F00037989* /grid/data/minos/minfarm/SAFE

AFSS/roundup.20070510 -W  -r cedar   far   
Fri May 11 15:17:54 CDT 2007

    Wrote output to WRITE, READ files look OK
                           ECRC files look OK

AFSS/roundup.20070510  -w  -r cedar   far 
Fri May 11 15:21:28 CDT 2007

    SAM declares seem valid, SAM/READ files are there

    Cleaned up the dups in nearcat, checked first with diff
rm -f /grid/data/minos/nearcat/N00012135_0013.cosmic.cand.cedar.0.root
rm -f /grid/data/minos/nearcat/N00012135_0021.cosmic.cand.cedar.0.root

    cp -va /grid/data/minos/nearcat/N00012179* /grid/data/minos/minfarm/SAFE

AFSS/roundup.20070510  -c  -r cedar   near 
Fri May 11 15:39:17 CDT 2007

    Looking good, SAM declared worked for all 4 files

    cp -vax /grid/data/minos/mcnearcat /grid/data/minos/minfarm/SAFE/mcnearcat


AFSS/roundup.20070510  -c -M -r cedar   mcnear
Fri May 11 16:30:01 CDT 2007

    Cleaned up the dup in nearcat, checked first with diff
    rm -f /grid/data/minos/mcnearcat/n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root

Make local copies for safety, too many files for double network xfer
File list is too long for simple copy

    mkdir SAFE/near_cedar_phy  # under ROUNDUP
    FILES=`find /grid/data/minos/nearcat -name \*cedar_phy\* | cut -f  6 -d /`
    time for FILE in ${FILES} ; do
    cp -va /grid/data/minos/nearcat/${FILE} SAFE/near_cedar_phy/${FILE}
    done
real    7m50.735s
user    0m4.052s
sys     3m44.384s

   Corrected roundup to use Merged.${}.root not Merged.root for thread safety
   
AFSS/roundup.20070510 -c -s N00008011  -r cedar_phy near

   looks clean, Merged.446.root grew as expected

AFSS/roundup.20070510 -c    -r cedar_phy near  
Fri May 11 17:45:14 CDT 2007
Fri May 11 18:06:07 CDT 2007


    mkdir SAFE/far_cedar_phy  # under ROUNDUP
    FILES=`find /grid/data/minos/farcat -name \*cedar_phy\* | cut -f  6 -d /`
    time for FILE in ${FILES} ; do
    cp -va /grid/data/minos/farcat/${FILE} SAFE/far_cedar_phy/${FILE}
    done
real    7m3.847s
user    0m5.719s
sys     2m10.599s

   This is looking good, released cron while doing final catchup

cp AFSS/roundup.20070510 .
ln -sf  roundup.20070510 roundup
mv NOCAT NOCAT.old

./roundup.20070510 -c    -r cedar_phy far   
Fri May 11 18:51:29 CDT 2007


###########
# ENSTORE #
###########

Some tape mounts have been queues for over a half hour, delaying farm output
We are producing 17 data streams on the farm
So that's roughly 17 tape mounts/6 hours, 3 per hour.

3 cedar_phy near
4 cedar_phy far
3 cedar     near
4 cedar     far
3 cedar     mcnear

That's our intent. Is reality different ?

Enstore Drives Fri May 11 11:39:28 CDT 2007

label      mover             tot.time status                system_inhibit      rq. host         updated           volume family
VO4357     9940B26.mover     2005     DISMOUNT_WAIT (579  ) (none         none) southport        05-11-07 11:39:28 miniboone.OpenRootTree.cpio_odc
VOB796     9940B24.mover     31       MOUNT_WAIT    (4    ) (none         none) stkendca11a      05-11-07 11:39:03 minos.reco_far_cedar_phy_sntp.cpio_odc
VOD544     9940B33.mover     2373     SETUP         (0    ) (none         none) southport        05-11-07 10:59:57 miniboone.TankData.cpio_odc
VO2146     9940B34.mover     2402     DISMOUNT_WAIT (159  ) (none         full) stkendca13a      05-11-07 11:39:09 astro.astro.cpio_odc
VO4078     9940B22.mover     2333     MOUNT_WAIT    (2305 ) (none         none) stkendca18a      05-11-07 11:39:03 exp-db.daily-d0-offline.cpio_odc
VO7256     9940B21.mover     572      MOUNT_WAIT    (539  ) (none         full) flxi04           05-11-07 11:38:57 selex.selex.cpio_odc
VOC316     9940B40.mover     505      MOUNT_WAIT    (494  ) (none         none) minos01          05-11-07 11:39:22 minos.cedar_antp.cpio_odc
VO7708     9940B15.mover     2301     ACTIVE-READ   (0    ) (none         full) flxi04           05-11-07 11:39:28 selex.selex.cpio_odc
VOB506     9940B36.mover     2120     MOUNT_WAIT    (2099 ) (none         none) stkendca9a       05-11-07 11:39:08 minos.reco_near_cedar_phy_mrnt.cpio_odc
VOB135     9940B35.mover     1476     ACTIVE-WRITE  (4    ) (none         none) stkendca11a      05-11-07 11:39:08 minos.reco_near_cedar_phy_cand.cpio_odc
VOC295     9940B25.mover     1189     ACTIVE-WRITE  (4    ) (none         none) stkendca9a       05-11-07 11:39:11 minos.reco_mc_near_cedar_cand.cpio_odc
VOB549     9940B41.mover     388      SEEK          (21   ) (none         none) stkendca11a      05-11-07 11:39:11 minos.reco_near_cedar_phy_sntp.cpio_odc
VO5147     9940B20.mover     14572    MOUNT_WAIT    (14540) (none         full) stkendca13a      05-11-07 11:38:57 lqcd.lqcd.cpio_odc
VO6615     9940B16.mover     2482     ACTIVE-WRITE  (35   ) (none         none) stkendca10a      05-11-07 11:39:07 exp-db.daily-d0-offline.cpio_odc


###########
# ENSTORE #
###########

Per email from georges , volume VO4209 has been lost due to a drive error.

These are all still in DCache.

FILES=`enstore info --list=VO4209 | grep mrnt_data | tr -s ' ' | cut -f 6 -d ' ' | cut -f 1-2,5- -d /`

for FILE in ${FILES} ; do
FP=`echo ${FILE} | cut -f -7 -d /` # ; echo ${FP}
FI=`echo ${FILE} | cut -f  8 -d /` ; echo ${FI}
( cd ${FP} ; cat ".(use)(2)(${FI})" ) | grep stken
sleep 1
done

They all seem to be in the write pools

Slip these into /local/scratch26/kreymer/VO4209

mkdir /local/scratch26/kreymer/VO4209

for FILE in ${FILES} ; do
FP=`echo ${FILE} | cut -f 3- -d /` ; echo ${FP}
FI=`echo ${FILE} | cut -f  8 -d /` # ; echo ${FI}
dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/${FP} \
                    /local/scratch26/kreymer/VO4209/${FI}
done

Let's put them in their months

for FILE in ${FILES} ; do
FM=`echo ${FILE} | cut -f  7 -d /` ; echo ${FM}
FI=`echo ${FILE} | cut -f  8 -d /` # ; echo ${FI}
mkdir -p /local/scratch26/kreymer/VO4209/${FM}
mv       /local/scratch26/kreymer/VO4209/${FI} \
         /local/scratch26/kreymer/VO4209/${FM}/${FI}
done

And make another copy in /grid/data

/grid/data/minos/mindata
chmod 775  /grid/data/minos/mindata
cp -vax /local/scratch26/kreymer/VO4209 /grid/data/minos/mindata/VO4209

##########
# DCACHE #
##########

Five files have been in write queues over 6 hours, not yet queued for output
-rw-r--r--   1 minfarm numi 10183045 May 11 06:56 F00031422_0000.spill.sntp.cedar_phy.0.root
-rw-r--r--   1 minfarm numi  5332142 May 11 06:56 F00031426_0000.spill.sntp.cedar_phy.0.root
-rw-r--r--   1 minfarm numi  6992773 May 11 06:56 F00031428_0000.spill.sntp.cedar_phy.0.root
-rw-r--r--   1 minfarm numi 10819041 May 11 06:56 F00031431_0000.spill.sntp.cedar_phy.0.root
-rw-r--r--   1 minfarm numi  8864273 May 11 06:56 F00031433_0000.spill.sntp.cedar_phy.0.root

All are in w-stkendca11a-1
All are under /pnfs/minos/reco_far/cedar_phy/sntp_data/2005-05

The problem is the 94 queued stores in w-stkendca11a-1
I'm probably stuck behind a slug of cand/bcnd writes
Indeed the 'drives' web page http://cmsdca.fnal.gov/cgi-bin/enstore_drives.sh
shows w-stkendca11a-1 being pretty active writing to tape.

============================================================================

2007 05 10

###########
# ROUNDUP #
###########

roundup.20070510  for mock data challenge data ?

Added test for valid INDIR in /grid/data


For reference, scanned old releases,

MINOS26 > ls -aF /pnfs/minos/mcout_data/*/fmock 
/pnfs/minos/mcout_data/R1.12/fmock:
./  ../  .snrl_data/  .trth_data/  cand_data/  sntp_data/  snts_data/

/pnfs/minos/mcout_data/R1.6.1/fmock:
./  ../  cand_data/  snrl_data/  sntp_data/  snts_data/  trth_data/

/pnfs/minos/mcout_data/R1.7/fmock:
./  ../  .snrl_data/  .trth_data/  cand_data/  sntp_data/  snts_data/

/pnfs/minos/mcout_data/R1.9/fmock:
./  ../  .snrl_data/  .trth_data/  cand_data/  sntp_data/  snts_data/

/pnfs/minos/mcout_data/R1_18_2/fmock:
./  ../  cand_data/  sntp_data/  snts_data/

/pnfs/minos/mcout_data/cedar/fmock:
./  ../  carrot/

Per rhatcher, the .trth and .snrl are historic
Will should produce the usual sntp, .bntp and .bcnd files
  I hope for this we could skip cand


########
# FARM #
########

Keepup :

    reco_far/cedar
    reco_near/cedar
    mcout_data/cedar/near/daikon_00


Reprocessing :

    /reco_far/cedar_phy
    /reco_near/cedar_phy

Data :

     reco_near/R1_24cal
     reco_far/R1_24cal

Monte Carlo :

    mcout_data/cedar/cosmic/bfld201_lowE_R1_24cal
    mcout_data/cedar/cosmic/bfld201_lowE_R1_24calB
       the above are the special Cambridge cosmid runs,
       Copied as-is, with no concatenation
       No far/near or release in the path.

    mcout_data/R1_24cal/near/daikon_00/...
    mcout_data/R1_24calB/near/daikon_00/...

    mcout_data/cedar_phy/fmock/...
        still to come, this will not be concatenated
        mcin files are arriving momentarily

########
# FARM #
########

Reported duplicates per 2007 05 07

N00012135_0013.cosmic.cand.cedar.0.root
N00012135_0021.cosmic.cand.cedar.0.root

n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root
   This was due to cand reprocessing, lost DCache file.
   Should remove this duplicate sntp
   
########
# FARM #
########

Purged WRITE files from special passes

./roundup -w -r R1_24calB mcnear
./roundup -w -r R1_24cal    near
./roundup -w -r R1_24cal     far

Forced several files which were once in bad_runs_mc.cedar on 4/5 May,
OK now.

./roundup -f 5 -r cedar mcnear


=============================================================================

2007 05 09

############
# MCIMPORT #
############

mcimport.20070709

    Added support for MCIN mock data files N* and F*

17:29 

cp -a AFSS/mcimport.20070509 .
ln -sf     mcimport.20070509 mcimport

###########
# ROUNDUP #
###########

roundup.20070509

   Control saddreco months with SAMMONS variable,
       sorted/unique months from all calls to AUTODEST

   Added R1_24calB release

Looks ok in a dry run with

    AFSS/roundup.20070509 -n -r cedar_phy near

Putting this into production

cp AFSS/roundup.20070509 .
ln -sf  roundup.20070509 roundup

Oops, corrected typo leaving space after \ running saddreco,
caused saddreco output to go the wrong place.

Hacked logs with text editor.


    There are messages like

File "scripts/saddreco", line 79, in ?
ValueError: invalid literal for int(): 

These are due to \ being taken as an argument to saddreco.
Should have been harmless. 

Corrected roundup.20070509 to continue correctly,
and to issue sample commands for less'ing the saddreco LOGS
and to create directories for saddreco LOGS

17:52
cp AFSS/roundup.20070509 .
ln -sf  roundup.20070509 roundup


#########
# MYSQL #
#########

The heavy load on minos-mysql1 continues since 4 AM yesterday.
I see a dozen or so connections to the temp database,
and a dozen or so logins in progress at all times, from flxb*  nodes.

########
# FARM #
########

SRV1> time md5sum c10000605_0003.cand.R1_24cal.root
6424da9475ba0239642ac6b13b99a757  c10000605_0003.cand.R1_24cal.root

real    0m6.832s
user    0m1.121s
sys     0m2.051s
SRV1> time md5sum c10000605_0003.cand.R1_24cal.root
6424da9475ba0239642ac6b13b99a757  c10000605_0003.cand.R1_24cal.root

real    0m1.167s
user    0m0.861s
sys     0m0.307s

/grid/data rates are great again.
when running below, seeing 200 MBit/sec on MRTG plot of eth0


cd /grid/data/minos/mcfarcat


RELE=R1_24cal
STRM=cand

FILES=`ls -1 *${STRM}*${RELE}\.root`
RSPA=minos/mcout_data/cedar/cosmic/bfld201_lowE_${RELE}/${STRM}_data
POUT=/pnfs/${RSPA}

08:32 removed

STRM=sntp
08:48


RELE=R1_24calB
08:49

STRM=cand
08:51

09:07 done


########
# FARM #
########

As soon as the 18:05 cycle is done, need to do :

./roundup -n -W -r R1_24calB mcnear
Wed May  9 21:37:32 CDT 2007

We only had a 15 minute gap this afternoon, due to cedar_phy catchup.


=============================================================================

2007 05 08

##########
# CORRAL #
##########

Added veto on the existence of ${HOME}/ROUNTUP/NOCAT file.

crontab.dat schedules corral for 05 00,06,12,18


###########

# ROUNDUP #
###########

Oops, forgot to put roundup.20070507 with pid protection into production
    "No harm, no foul."

cp AFSS/roundup.20070507 .
ln -sf  roundup.20070507 roundup

#######
# SAM #
#######

Test rapid listing of files declared to sam, for use in saddreco etc,

sam list files --dim='(FILE_NAME F00037871_0004.mdaq.root, F00037871_0008.mdaq.root, F00037871_0013.mdaq.root )'
 Files:
   F00037871_0004.mdaq.root
   F00037871_0008.mdaq.root
   F00037871_0013.mdaq.root

File Count:         3
Average File Size:  31.78MB
Total File Size:    95.35MB
Total Event Count:  37651

SAMDIM='(FILE_NAME F00037871_0004.mdaq.root, F00037871_0008.mdaq.root, F00037871_0013.mdaq.root )'

sam list files --nosummary --dim="${SAMDIM}"
F00037871_0004.mdaq.root
F00037871_0008.mdaq.root
F00037871_0013.mdaq.root

FILES=`ls /pnfs/minos/fardet_data/2007-05`
FIRST=`printf "${FILES}\n" | head -1`
FREST=`printf "${FILES}\n" | tail +2`
FRESC=`for FI in ${FREST} ; do printf ", ${FI}" ; done`
SAMDIM="( FILE_NAME ${FIRST} ${FRESC} )"

sam list files --nosummary --dim="${SAMDIM}" | wc -l
    200
printf "${FILES}\n" | wc -w
    204

Good, this seems to work, and fairly quickly
real    0m4.402s
user    0m1.010s
sys     0m0.120s

Try agin for 2004,
real    0m50.976s
user    0m1.230s
sys     0m0.190s

real    0m48.758s
user    0m1.300s
sys     0m0.110s

818 files

#######
# SAM #
#######

export SAM_ORACLE_CONNECT

./reloc -s dev cedar_phy
./reloc -s int cedar_phy
./reloc -s prd cedar_phy

export -n SAM_ORACLE_CONNECT

#########
# MYSQL #
#########

minos-mysql1 has had a load average of about 6 to 8 since about 04:00.

########
# FARM #
########

R1_24cal forced output, verified these subruns were previously skipped

N00009235_0001
N00009241_0010
N00009256_0002
N00009256_0008
N00009256_0009
N00009259_0013

N00009162_ mrnt missing 16,17
N00009226_ mrnt missing 21
N00009143_ mrnt missing 18

./roundup -f 1 -M -   r R1_24cal near
Tue May  8 13:22:35 CDT 2007 purge
Tue May  8 13:22:51 CDT 2007 cat
Tue May  8 13:26:13 CDT 2007 write
Tue May  8 13:32:21 CDT 2007 done

./roundup -m 2005-11 -r R1_24cal near # and do SAM declares
STARTED   Tue May  8 18:35:45 2007
FINISHED  Tue May  8 18:37:52 2007

   FAR
   
247 files in WRITE to be purged, clear them first

./roundup -w         -r R1_24cal far

F00028201_ missing 00,01 which are in 2004-11 not 2004-12
   so force this

./roundup -f 1 -M    -r R1_24cal far


##########
# DCACHE #
##########

Existing http://fndca3a.fnal.gov:2288/poolInfo/ugroups/MinosPrdSelGrp

minos.reco_far_cedar_bntp@enstore 	
minos.reco_far_cedar_mrnt@enstore 
minos.reco_far_cedar_sntp@enstore 	
minos.reco_mc_far_cedar_mrnt@enstore 
minos.reco_mc_far_cedar_sntp@enstore 	
minos.reco_mc_near_cedar_mrnt@enstore 
minos.reco_mc_near_cedar_sntp@enstore 	
minos.reco_near_cedar_mrnt@enstore
minos.reco_near_cedar_sntp@enstore

Need to add

minos.reco_far_cedar_phy_bntp@enstore 	
minos.reco_far_cedar_phy_mrnt@enstore 
minos.reco_far_cedar_phy_sntp@enstore 	
minos.reco_mc_far_cedar_phy_mrnt@enstore 
minos.reco_mc_far_cedar_phy_sntp@enstore 	
minos.reco_mc_near_cedar_phy_mrnt@enstore 
minos.reco_mc_near_cedar_phy_sntp@enstore 	
minos.reco_near_cedar_phy_mrnt@enstore
minos.reco_near_cedar_phy_sntp@enstore

    And set file families for each stream, as on 2006 09 04

cd /pnfs/minos/reco_far/cedar_phy
for DIR in .bcnd .bntp cand mrnt sntp ; do (cd ${DIR}_data ; enstore pnfs --tags | grep 'family)' ) ; done

   These are not properly qualified
for DIR in .bcnd .bntp cand mrnt sntp ; do 
( cd ${DIR}_data 
  DIRT=`echo ${DIR} | tr -d '.'`
  enstore pnfs --file_family reco_far_cedar_phy_${DIRT} )
done

for DIR in .bcnd .bntp cand mrnt sntp ; do (cd ${DIR}_data/2007-03 ; enstore pnfs --tags | grep 'family)' ) ; done
#  this was correctly inherited

Now do NEAR

cd /pnfs/minos/reco_near/cedar_phy
for DIR in cand mrnt sntp ; do (cd ${DIR}_data ; enstore pnfs --tags | grep 'family)' ) ; done

for DIR in cand mrnt sntp ; do 
( cd ${DIR}_data 
  DIRT=`echo ${DIR} | tr -d '.'`
  enstore pnfs --file_family reco_near_cedar_phy_${DIRT} )
done

for DIR in  cand mrnt sntp ; do (cd ${DIR}_data/2007-03 ; enstore pnfs --tags | grep 'family)' ) ; done

Oops, somehow set /pnfs/minos/reco_near/cedar_phy to snts.
Moot, but correct this

#######
# AFS #
#######

Requested volume  d239   cloned from d188 per lloiaco request
for beam systematics work

########
# FARM #
########

Remove last 2 0 length files from last Thursday diskful event

SRV1> rm /grid/data/minos/nearcat/N00009235_0001.cosmic.cand.R1_24cal.0.root
SRV1> rm /grid/data/minos/nearcat/N00009256_0008.cosmic.cand.R1_24cal.0.root

########
# FARM #
########

    Cambridge Cosmic file cleanup

setup encp -q stken

RELE=R1_24cal
STRM=cand
STRM=sntp

cd /grid/data/minos/mcfarcat

FILES=`ls -1 *${STRM}*${RELE}\.root`

RSPA=minos/mcout_data/cedar/cosmic/bfld201_lowE_${RELE}/${STRM}_data
POUT=/pnfs/${RSPA}

for FILE in ${FILES} ; do
    PFIL=${POUT}/${FILE}
    PINFO=`(cd ${POUT} ; cat ".(use)(4)(${FILE})" | tr '\n' '\t')`
    LCRC=`ecrc ${FILE} | tr -s ' ' | cut -f 2 -d ' '`
    ECRC=`printf "${PINFO}" | cut -f 11`
    echo "   ${FILE}" ${LCRC} ${ECRC} 
    [ ${LCRC} = ${ECRC} ] && echo rm ${FILE} && rm ${FILE}
done 2>&1 | tee /tmp/purge${RELE}${STRM}.log

THis is running dog slow
SRV1> time md5sum c10000605_0000.cand.R1_24cal.root
9fb3226f8fde606d0f7d5d10887b7671  c10000605_0000.cand.R1_24cal.root

real    3m14.168s
user    0m1.829s
sys     0m1.691s

551M, so about 2 MBytes/sec

real    10m26.894s
user    0m1.840s
sys     0m0.790s

Tue May  8 16:28:01 CDT 2007

Speed seems to be back to normal,
Tue May  8 23:54:53 CDT 2007


###########
# ROUNDUP #
###########

    roundup.2070508 supporting cedar_phy

cp AFSS/roundup.20070508 .
ln -sf  roundup.20070508 roundup

########
# FARM #
########

Writing output for cedar_phy !!!!!!!!!!


./roundup -c -M -r cedar_phy far ; ./roundup -c -M -r cedar_phy near
Tue May  8 16:45:40 CDT 2007 far

Then declared to sam, ( failed first time, had to ( dev/int/prd )
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar.phy

./roundup -m '2005-04' -r cedar_phy near
STARTED   Wed May  9 04:31:40 2007
FINISHED  Wed May  9 04:31:54 2007


./roundup -m '2005-05' -r cedar_phy near
STARTED   Wed May  9 04:32:14 2007
FINISHED  Wed May  9 04:33:50 2007

./roundup -m '2005-04' -r cedar_phy far
STARTED   Wed May  9 04:35:34 2007
FINISHED  Wed May  9 04:37:20 2007

./roundup -m '2005-05' -r cedar_phy far
STARTED   Wed May  9 04:37:39 2007
FINISHED  Wed May  9 04:39:12 2007


=============================================================================

2007 05 07

##########
# CORRAL #
##########

Run various roundups in cron on fnpcsrv1, to keep the crontab file short

Will run all current roundup's , one at a time

If one is already running, move on to the next.
If one stream is very slow, we'll end up running 2 roundups
This should be OK. But allow only one such.
Check error, count, bail .
Can this be done simply ?


###########
# ROUNDUP #
###########

    roundup.20070507

Adding PID interlocking

    stealing code from mcimport


#######
# SAM #
#######

    Preparing for cedar_phy

export SAM_ORACLE_CONNECT="samdbs/<passwd>"

setup sam -q dev
samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar_phy
New applicationFamilyId = 251

setup sam -q int
New applicationFamilyId = 60

setup sam -q prd
New applicationFamilyId = 62

reco directories do not yet exist for cedar_phy


    Same for R1_24cal 

./reloc -d -s dev R1_24cal

./reloc -s dev R1_24cal
./reloc -s int R1_24cal
./reloc -s prd R1_24cal

    Testing R1_24cal

./roundup -m 2005-11 -r R1_24cal near
STARTED   Mon May  7 18:21:29 2007
...
Treating 666 files in /pnfs/minos/reco_near/R1_24cal/cand_data/2005-11

Oops, appVersion should have been r1.25cal

Working OK this time

STARTED   Mon May  7 18:28:45 2007
Needed  /pnfs/minos/reco_near/R1_24cal/cand_data/2005-11
Treating 666 files in /pnfs/minos/reco_near/R1_24cal/cand_data/2005-11
...
Treating 38 files in /pnfs/minos/reco_near/R1_24cal/mrnt_data/2005-11
OOPS - tier known,  mrnt
...

############
# SADDRECO #
############

    saddreco.20070507 - added mrnt to TIERS

SRV1> cp -a AFSS/saddreco.20070507 .
SRV1> ln -sf saddreco.20070507 saddreco

   Ran R1_24cal again
./roundup -m 2005-11 -r R1_24cal near

Followed up with the rest of R1_24cal

./roundup -m 2005-11 -r R1_24cal far
STARTED   Mon May  7 18:57:36 2007
FINISHED  Mon May  7 19:10:30 2007


./roundup -m 2004-12 -r R1_24cal far
STARTED   Mon May  7 19:10:55 2007
FINISHED  Mon May  7 19:16:24 2007

   Note that there are no .bcnd for 2004-12

###########
# ENSTORE #
###########

nwest reports a duplicate file in COMPLETE_FILE_LIST_minos

ls -l /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/100/f21011002_0000_L010185N_D00.sntp.cedar.root
-rw-r--r--    1 3475     e875     65629350 Mar 10 08:07 /pnfs//minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/100/f21011002_0000_L010185N_D00.sntp.cedar.root

enstore info --list VOC177 | grep f21011002_0000_L010185N_D00.sntp.cedar.root
VOC177 CDMS117247822600000    62959276 0000_000000000_0000266 active  /pnfs/fnal.gov/usr/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/100/f21011002_0000_L010185N_D00.sntp.cedar.root

enstore info --list VOB971 | grep f21011002_0000_L010185N_D00.sntp.cedar.root
VOB971 CDMS117353562400000    65629350 0000_000000000_0000022 active  /pnfs/fnal.gov/usr/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/100/f21011002_0000_L010185N_D00.sntp.cedar.root

cd $MINOS_DATA/d10/indexes
grep f21011002_0000_L010185N_D00.sntp.cedar.root *.index
BAD_mc_far.daikon_00.cedar.index:recodata83/f21011002_0000_L010185N_D00.sntp.cedar.root
mc_far.daikon_00.cedar.index:recodata89/f21011002_0000_L010185N_D00.sntp.cedar.root

MINOS26 > dds recodata89/f21011002_0000_L010185N_D00.sntp.cedar.root
-rw-rw-r--    1 3475     e875     65629350 Mar 10 04:28 recodata89/f21011002_0000_L010185N_D00.sntp.cedar.root


########
# FARM #
########

Writing Cambridge Cosmic MC files from
   /grid/data/minos/mcfarcat

316 files, a mix of R1_24calB and cedar


MINOS26 > ls /grid/data/minos/mcfarcat | wc -l
    316
MINOS26 > ls /grid/data/minos/mcfarcat | grep cand | wc -l
    158
MINOS26 > ls /grid/data/minos/mcfarcat | grep sntp | wc -l
    158

Oops a mixture of calB and cedar

File names are like
    c10000601_0000.cand.cedar.root
    c10000601_0000.cand.R1_24calB.root

   Examine existing directories for output

ls /pnfs/minos/mcout_data/cedar/cosmic -1
bfld201
bfld201_lowE
bfld201_lowE_R1.24.0
bfld201_rock
bfld201_vlowE
bfldoff
bfldrev
neutron

The cedar names match files already in
    /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1.24.0/cand_data
    /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1.24.0/sntp_data

Rather than change the names of the old file,
    I have made new directories, as rubin

mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_cedarmay/cand_data
mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_cedarmay/sntp_data

Then per discussion with howie, have shifted to

    rmdir /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_cedarmay/cand_data
    rmdir /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_cedarmay/sntp_data
 
mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1_24cal/cand_data
mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1_24cal/sntp_data

    and make space for some existig  R1_24calB files

mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1_24calB/cand_data
mkdir -p /pnfs/minos/mcout_data/cedar/cosmic/bfld201_lowE_R1_24calB/sntp_data

    Renamed all the cedar grid files :

cd /grid/data/minos/mcfarcat

FILES=`ls -1 *cedar.root`
for FILE in ${FILES} ; do
sleep 1
echo ${FILE} ${FILE:0:19}.R1_24cal.root
mv   ${FILE} ${FILE:0:19}.R1_24cal.root
done    

    Did this at 17:24


Now write these to PNFS, from minfarm

setup dcap  # kerberized
DCPOR=24736

cd /grid/data/minos/mcfarcat

RELE=R1_24cal
STRM=cand
STRM=sntp

FILES=`ls -1 *${STRM}*${RELE}\.root`

RSPA=minos/mcout_data/cedar/cosmic/bfld201_lowE_${RELE}/${STRM}_data
DOUT=/dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${RSPA}
POUT=/pnfs/${RSPA}

date | tee ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log
for FILE in ${FILES} ; do
    DFIL=${DOUT}/${FILE}
    PFIL=${POUT}/${FILE}
    [ ! -r ${PFIL} ] && echo "NEED" ${FILE} && dccp ${FILE} ${DFIL}
done  2>&1 | tee ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log
date | tee ~/ROUNTMP/LOG/cosmic/${RELE}${STRM}.log

Test with FILES=c10000601_0000.sntp.R1_24cal.root

OK, ran with sntp, R1_24cal, OK
ran cand
OK
Mon May  7 18:16:43 CDT 2007
Mon May  7 18:42:32 CDT 2007

Now copy R1_24calB

 ( in grid/data/minos/mcfarcat ) sum *sntp*R1_24calB.root

RELE=R1_24calB
for STRM in sntp cand ; do
...
do

Oops, forgot to mkdir the directories.
Corrected, restarted :


Mon May  7 18:46:12 CDT 2007


########
# FARM #
########

    DUPLICATES

    Duplicate cand near file

In LOG/2007-05/cedarnear.log 
continued problem, wrong ECRC for
N00012135_0013.cosmic.cand.cedar.0.root
N00012135_0021.cosmic.cand.cedar.0.root
/pnfs/minos/reco_near/cedar/cand_data/2007-05/N00012135_0013.cosmic.cand.cedar.0.root
/pnfs/minos/reco_near/cedar/cand_data/2007-05/N00012135_0021.cosmic.cand.cedar.0.root

The files were first written on May 4

-rw-r--r--  1 minfarm numi 606314496 May  4 17:41 /export/stage/minfarm/ROUNDUP/WRITE/N00012135_0013.cosmic.cand.cedar.0.root
-rw-r--r--  1 minfarm numi 231964672 May  4 17:42 /export/stage/minfarm/ROUNDUP/WRITE/N00012135_0021.cosmic.cand.cedar.0.root

Purged from WRITE on Sat 08:00

Then picked up again from /grid/data Saturday

-rw-r--r--  1 minfarm numi 749183111 May  5 13:30 /export/stage/minfarm/ROUNDUP/WRITE/N00012135_0013.cosmic.cand.cedar.0.root
-rw-r--r--  1 minfarm numi 744304636 May  5 13:30 /export/stage/minfarm/ROUNDUP/WRITE/N00012135_0021.cosmic.cand.cedar.0.root

A second srmcp was not attempted.

    Duplicate sntp mcnear file

In LOG/2007-05/cedarmcnear.log
continued problem, wrong ECRC for 
n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/sntp_data/144/n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root

Recent attempt
-rw-r--r--  1 minfarm numi 177252703 May  5 15:44 /export/stage/minfarm/ROUNDUP/WRITE/n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root

Originally written April 19

MINOS26 > dds /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/sntp_data/144/n13011446_00*_L010185N_D00_nccoh.sntp.cedar.root
-rw-r--r--    1 1334     e875     1939588608 Apr 19 22:01 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/sntp_data/144/n13011446_0000_L010185N_D00_nccoh.sntp.cedar.root

##########
# RUSTLE #
##########

Removed /grid/data/minos/farcat_safe, from early tests.
See notes added under the 2007 04 10 log entry hereunder.

SRV1> rm /grid/data/minos/farcat_safe/*
SRV1> rmdir /grid/data/minos/farcat_safe

=============================================================================

2007 05 06  sunday

#########
# STAGE #
#########

MINOS26 > NVOLS=`./volumes neardet_data`
MINOS26 > echo $NVOLS
VO2307 VO3863 VO4531 VO4918 VO5041 VO5042 VO6784 VO7026 VO7175 VO7421 VO7774 VO7896 VO7939 VO8098 VO8187 VO8332 VO8537 VO8556 VO8721 VO8741 VO8791 VO8842 VO8949 VO9752 VO9834 VOC065 VOC443
MINOS26 > echo $NVOLS | wc -w
     27
MINOS26 > for VOL in ${NVOLS} ; do ./stage -w ${VOL} ; done | tee ../log/stage/neardet_data.20070506

FINISHED Mon May  7 00:23:27 CDT 2007

Files restored :

 Staging files from tape VO4918
 Needed 308/   1074
 Staging files from tape VO5041
 Needed 1636/   3235
 Staging files from tape VO5042
 Needed 1310/   2608
 Staging files from tape VO8556
 Needed 74/    937
 Staging files from tape VO9752
 Needed 58/   2120


########
# FARM #
########

./roundup -M -r R1_24cal near

=============================================================================

2007 05 05  saturday

On friday, did

./roundup -r R1_24cal mcnear
Fri May  4 20:01:23 CDT 2007 cat
Fri May  4 21:19:53 CDT 2007 write
 OK - creating /pnfs/minos/mcout_data/R1_24cal/near/daikon_00/L010185N_24cal/cand_data/100
no such directory
( N.B. this DID create the full directory )
 
Immediate cause - stale copy of roundup.20070504 on fnpcsrv1,
    renamed now to roundup.20070504x


    Checking inventory and status of misplaced files :

SRV1> ls WRITE/n* | wc -l
233

Removed N and F files from WRITE, for clarity

./roundup -w -r cedar near
./roundup -w -r cedar far
./roundup -w -r R1_24cal near

SRV1> ls WRITE | wc -l
233

Verified that all these files are in Enstore

SRV1> ./roundup.20070504x -n -w -r R1_24cal mcnear | grep 'rm ' | wc -l
233

Made a list for future reference

ls WRITE > ~/maint/nR1_24cal.20070505

As Rubin, move the misplaced files back where they belong

RUB > find L010185N -type d
L010185N
L010185N/cand_data
L010185N/mrnt_data
L010185N/sntp_data

RUB > find L010185N_24cal -type d
L010185N_24cal
L010185N_24cal/cand_data
L010185N_24cal/cand_data/100
L010185N_24cal/cand_data/101
L010185N_24cal/cand_data/102
L010185N_24cal/sntp_data
L010185N_24cal/sntp_data/100
L010185N_24cal/sntp_data/101
L010185N_24cal/sntp_data/102

The RUN directories do not exist where they belong under L010185N,
so they can be moved cleanly from the wrong path L010185N_24cal
And we have verified, above, that nothing is pending for write.

for STR in cand_data sntp_data ; do
for RUN in 100 101 102 ; do
  mv -v L010185N_24cal/${STR}/${RUN} L010185N/${STR}/${RUN}
done ; done

RUB > find L010185N -type d
L010185N
L010185N/cand_data
L010185N/cand_data/100
L010185N/cand_data/101
L010185N/cand_data/102
L010185N/mrnt_data
L010185N/sntp_data
L010185N/sntp_data/100
L010185N/sntp_data/101
L010185N/sntp_data/102
RUB > find L010185N_24cal/ -type d
L010185N_24cal/
L010185N_24cal/cand_data
L010185N_24cal/sntp_data

Now we can purge the WRITE files, using standard roundup

./roundup -w -r R1_24cal mcnear
Sat May  5 09:50:05 CDT 2007
Sat May  5 09:57:06 CDT 2007

Scanned for recent files in nearcat/farcat from R1_24cal,
they are still currently flowing from far,
the last one was around 04:00 from near.
There is nothing in mcnearcat.

    Need a utility like farmgsum to scan all *cat directories,
       nearcat
       farcat
       mcnearcat
       mcfarcat
    listing for each stream and directory
        stream / number / size / last time

./roundup -c -r cedar far ; ./roundup -c -r cedar near
Sat May  5 11:29:39 CDT 2007

 GRRRRRRRRRRRRRRRRRRRRRRR

Stuck once again, something has changed again

less LOG/2007-05/cedarfar.log
rm: remove write-protected regular file `/grid/data/minos/farcat/F00037968_0000.all.sntp.cedar.0.root'? 

Odd, the directory is group writeable, but not the files.

Check an older directory :

SRV1> ls -alF /grid/data/minos/mcfarcat/
total 3637984
drwxrwxr-x   2 rubin numi      2048 May  2 21:13 ./
drwxrwxr-x  20 rubin numi      2048 May  3 09:10 ../
-rw-r--r--   1 rubin numi 573833281 May  5 01:08 c10000601_0000.cand.cedar.root
-rw-r--r--   1 rubin numi 576581544 May  5 01:10 c10000601_0001.cand.cedar.root
-rw-r--r--   1 rubin numi 575020786 May  5 01:29 c10000601_0002.cand.cedar.root
-rw-r--r--   1 rubin numi 283649445 May  5 00:52 c10000601_0003.cand.cedar.root
-rw-r--r--   1 rubin numi 568409565 May  5 01:11 c10000602_0000.cand.cedar.root
-rw-r--r--   1 rubin numi 576844434 May  5 01:29 c10000602_0001.cand.cedar.root
-rw-r--r--   1 rubin numi 570762049 May  5 01:13 c10000602_0002.cand.cedar.root
-rw-r--r--   1 rubin numi         0 May  2 20:21 c10000602_0003.cand.cedar.root
-rw-r--r--   1 rubin numi         0 May  2 20:21 c10000602_0003.sntp.cedar.root

Try removing a useless 0 length file :

SRV1> type rm
rm is /bin/rm
SRV1> rm /grid/data/minos/mcfarcat/c10000602_0003.sntp.cedar.root
rm: remove write-protected regular empty file `/grid/data/minos/mcfarcat/c10000602_0003.sntp.cedar.root'? y

This removed the file

Try using -f for a clean removal

rm -f /grid/data/minos/mcfarcat/c10000602_0003.cand.cedar.root

OK, created roundup.20070505 once again, to do rm -r ,
and cloned to fnpcsrc1

cp AFSS/roundup.20070505 .
ln -sf  roundup.20070505 roundup

./roundup -r cedar far
Sat May  5 13:06:19 CDT 2007
Sat May  5 13:29:20 CDT 2007

    And remove the input file that should have been removed :

SRV1> dds /pnfs/minos/reco_far/cedar/sntp_data/2007-05/F00037968_0000.all.sntp.cedar.0.root
-rw-r--r--  1 rubin numi 433161527 May  5 11:36 /pnfs/minos/reco_far/cedar/sntp_data/2007-05/F00037968_0000.all.sntp.cedar.0.root
SRV1> dds ../ROUNTMP/WRITE/F00037968_0000.all.sntp.cedar.0.root
-rw-r--r--  1 minfarm numi 433161527 May  5 11:32 ../ROUNTMP/WRITE/F00037968_0000.all.sntp.cedar.0.root
SRV1> dds /grid/data/minos/farcat/F00037968_0000.all.sntp.cedar.0.root
-rw-r--r--  1 rubin numi 23722211 May  2 23:51 /grid/data/minos/farcat/F00037968_0000.all.sntp.cedar.0.root
SRV1> rm -f /grid/data/minos/farcat/F00037968_0000.all.sntp.cedar.0.root

for FILE in `cat ../ROUNTMP/READ/F00037968_0000.all.sntp.cedar.0.root` ; do
ls -l /grid/data/minos/farcat/${FILE} ; done

for FILE in `cat ../ROUNTMP/READ/F00037968_0000.all.sntp.cedar.0.root` ; do
rm -f /grid/data/minos/farcat/${FILE} ; done

   Now back to our regularly scheduled program

./roundup -r cedar near
Sat May  5 13:30:03 CDT 2007 cat
Sat May  5 13:35:36 CDT 2007 write
Sat May  5 13:39:14 CDT 2007

./roundup -r cedar mcnear
Sat May  5 13:49:50 CDT 2007 cat
   several files hanging round since April 30, should do an -f 4 run
Sat May  5 14:15:53 CDT 2007 write
Sat May  5 14:46:03 CDT 2007

./roundup -f 4 -r cedar mcnear
Sat May  5 15:38:28 CDT 2007 cat
OK - processing 80 files 
Sat May  5 15:44:40 CDT 2007 write
Sat May  5 15:50:59 CDT 2007 done

MINOS26 > ./farmgsum

    Summarizing /grid/data/minos/*cat   

    229    5580 nearcat
   2831   36420 farcat
     57   18359 mcnearcat
     24    5936 mcfarcat
   3141   66295 TOTAL files, GBytes

nearcat
      2       0 cosmic.cand.R1_24cal.0.root
      2    1493 cosmic.cand.cedar.0.root
      8     170 cosmic.sntp.R1_24cal.0.root
     17     507 cosmic.sntp.cedar.0.root
    175    1958 spill.mrnt.R1_24cal.0.root
      8     585 spill.sntp.R1_24cal.0.root
     17    1129 spill.sntp.cedar.0.root

farcat
   1426   33936 all.sntp.R1_24cal.0.root
      7     169 all.sntp.cedar.0.root
    692    2949 spill.bntp.R1_24cal.0.root
      7      28 spill.bntp.cedar.0.root
    692    1038 spill.sntp.R1_24cal.0.root
      7      19 spill.sntp.cedar.0.root

mcnearcat
     28   17182 cand.cedar.root
     29    2065 sntp.cedar.root

mcfarcat
     12    5712 cand.cedar.root
     12     510 sntp.cedar.root

./roundup -n -s F0003317 -r R1_24cal far
 OK adding F00033174_0000.spill.sntp.R1_24cal.0.root 1
./roundup: line 507: ((: SSIF =  : syntax error: operand expected (error token is " ")
./roundup: line 507: ((: SSIF =  : syntax error: operand expected (error token is " ")
 OK adding F00033178_0000.spill.sntp.R1_24cal.0.root 24
 
./roundup -f 4 -s F0003317 -r cedar mcnear
  OOps, accident when trying additional test of the above.
  nothing was done, no files 

./roundup -n -s F0003317 -r R1_24cal far
./roundup -n -s spill.sntp -r R1_24cal far
     both are clean
     I may have typed something at the terminal during the original test
./roundup -n -r R1_24cal far
     clean this time, go for it, without SAM !
     
./roundup -M -r R1_24cal far
Sat May  5 18:59:58 CDT 2007


=============================================================================

2007 05 04

########
# FARM #
########
  More rate tests, trying direct /grid/data to enstore :
  
cd /grid/data/minos/DUP
  
IFILE=n13011446_0000_L010185N_D00_nccoh.cand.cedar.root

export SRM_CONFIG=/export/stage/minfarm/.srmconfig/config.xml
  
SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm
SFILE2=${SPATH2}/${IFILE}

SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm
SFILE=${SPATH}/${IFILE}

srmkdir ${SPATH2}

srmls ${SPATH2}
srmls ${SFILE2}

time srmcp file:///${IFILE} ${SFILE}
real    1m3.708s
user    0m27.422s
sys     0m43.969s

   Rate is 25 MBytes/sec

srmls ${SFILE2}
  1602194515 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm/n13011446_0000_L010185N_D00_nccoh.cand.cedar.root

srmrm ${SFILE2}

time srmcp file:///${IFILE} ${SFILE}
real    0m49.820s
user    0m26.619s
sys     0m31.868s

    Rate is  32 MB/sec

cd /grid/data/minos/nearcat

FILES=`ls N00012135*cand*`

SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm/TEST
SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm/TEST

srmmkdir ${SPATH2}

At 15:00+

for IFILE in ${FILES} ; do
echo ${IFILE}
time srmcp file:///${IFILE} ${SPATH}/${IFILE}
done
N00012135_0001.cosmic.cand.cedar.0.root

real    0m41.664s
user    0m20.561s
sys     0m20.849s
N00012135_0001.spill.cand.cedar.0.root

real    0m24.862s
user    0m13.962s
sys     0m8.898s
...
real    0m16.846s
user    0m11.435s
sys     0m3.375s
N00012135_0022.spill.cand.cedar.0.root

real    0m23.094s
user    0m14.327s
sys     0m9.214s

SRV1> TIMS=`grep 'real' /tmp/testcp | cut -c 12-16`
SRV1> echo $TIMS
41.664 24.862 46.818 22.677 44.512 23.150 49.938 25.862 39.586 30.845 18.375 27.728 48.427 23.360 18.204 29.931 38.207 50.222 31.942 15.594 26.017 39.594 25.393 38.542 27.424 21.473 16.846 23.094 41.664 24.862 46.818 22.677 44.512 23.150 49.938 25.862 39.586 30.845 18.375 27.728 48.427 23.360 18.204 29.931 38.207 50.222 31.942 15.594 26.017 39.594 25.393 38.542 27.424 21.473 16.846 23.094 41.664 24.862 46.818 22.677 44.512 23.150 49.938 25.862 39.586 30.845 18.375 27.728 48.427 23.360 18.204 29.931 38.207 50.222 31.942 15.594 26.017 39.594 25.393 38.542 27.424 21.473 16.846 23.094
SRV1> for TIM in $TIMS ; do printf "${TIM} + " >> /tmp/times ; done
SRV1> printf "0\n" >> /tmp/times

SRV1> cat /tmp/times | bc
2610.861

for FILE in ${FILES} ; do SI=`ls -l ${FILE} | tr -s ' ' | cut -f 5 -d ' '` ; printf "${SI} + " ; done > /tmp/sizes
printf "0\n" >> /tmp/sizes
cat /tmp/sizes | bc
12879467728

  Rate is 5 MB/sec ??? Looks lousy to me
  not consistent with 2x8.5 MB/sec

MRTG peaked sharply around 85 mbit/sec on eth0


########
# FARM #
########

152G
80G     WRITE
SRV1> du -sm /grid/data/minos/*cat
9727    /grid/data/minos/farcat
1       /grid/data/minos/mccat
3553    /grid/data/minos/mcfarcat
221977  /grid/data/minos/mcnearcat
37539   /grid/data/minos/nearcat

./roundup -w -r cedar mcnear

Oops, typos in roundup.20070503 setting SRMQ.
Fixed a couple of times, back on track.

./roundup -w -r cedar mcnear
Fri May  4 08:21:53 CDT 2007
Fri May  4 10:42:51 CDT 2007

    MRTG shows sustained 40 Mbit/sec on eth0
    sar -n DEV shows sustained 4 MBytes/sec on each eth0 and eth1
        bond0 does not show correct sum
        
    At aroung 09:20, tried to prime the pump
    getting a few files into memory cache with
time md5sum ../ROUNTMP/WRITE/n13011679*   
real    9m45.719s
user    0m23.272s
sys     0m16.046s

1 sntp, 11 cand, net about 7 GB.
du -sm
580     ../ROUNTMP/WRITE/n13011679_0000_L010185N_D00.cand.cedar.root
702     ../ROUNTMP/WRITE/n13011679_0000_L010185N_D00.sntp.cedar.root

This made a spike to 60 Bit/sec in eth0 ( 5 min ave ) in mrtg

    Purged other files from WRITE, to check size

AFSS/roundup.20070504 -w -r R1_24cal near


The net rate for this run was
   73 GBytes
   140 minutes, 8400 seconds
   8.9 MBytes/sec
Somewhat better than the old 7, not great.

./roundup -r cedar far
Fri May  4 12:12:15 CDT 2007
Fri May  4 12:19:39 CDT 2007
Looks OK, then

PURGE FARM    F00037968_0011.spill.cand.cedar.0.root
Datafile with name 'F00037968_0012.mdaq.root' not found.
SRMCP -streams_num=1 -server_mode=active file:///F00037968_0012.all.cand.cedar.0.root /pnfs/minos/reco_far/cedar/cand_data/
PURGE FARM    F00037968_0012.all.cand.cedar.0.root
Datafile with name 'F00037968_0012.mdaq.root' not found.
SRMCP -streams_num=1 -server_mode=active file:///F00037968_0012.spill.bcnd.cedar.0.root /pnfs/minos/reco_far/cedar/.bcnd_data/
PURGE FARM    F00037968_0012.spill.bcnd.cedar.0.root

?????? What happened to the month ?

Predator had not been run, the raw files were not in SAM.

The script proceeded to write the files without a month.
Ouch !!!!

  As howie, remove these strays and rewrite, first checking :

SRV1> cd /pnfs/minos/reco_far/cedar
SRV1> ls -a *_data/F*
cand_data/F00037968_0012.all.cand.cedar.0.root    cand_data/F00037971_0000.all.cand.cedar.0.root
cand_data/F00037968_0012.spill.cand.cedar.0.root  cand_data/F00037971_0000.spill.cand.cedar.0.root
cand_data/F00037968_0013.all.cand.cedar.0.root    cand_data/F00037971_0001.all.cand.cedar.0.root
cand_data/F00037968_0013.spill.cand.cedar.0.root  cand_data/F00037971_0001.spill.cand.cedar.0.root
cand_data/F00037968_0014.all.cand.cedar.0.root    cand_data/F00037971_0002.all.cand.cedar.0.root
cand_data/F00037968_0014.spill.cand.cedar.0.root  cand_data/F00037971_0002.spill.cand.cedar.0.root
cand_data/F00037968_0015.all.cand.cedar.0.root    cand_data/F00037971_0003.all.cand.cedar.0.root
cand_data/F00037968_0015.spill.cand.cedar.0.root  cand_data/F00037971_0003.spill.cand.cedar.0.root
cand_data/F00037968_0016.all.cand.cedar.0.root    cand_data/F00037971_0004.all.cand.cedar.0.root
cand_data/F00037968_0016.spill.cand.cedar.0.root  cand_data/F00037971_0004.spill.cand.cedar.0.root
cand_data/F00037968_0017.all.cand.cedar.0.root    cand_data/F00037971_0005.all.cand.cedar.0.root
cand_data/F00037968_0017.spill.cand.cedar.0.root  cand_data/F00037971_0005.spill.cand.cedar.0.root
cand_data/F00037968_0018.all.cand.cedar.0.root    cand_data/F00037971_0006.all.cand.cedar.0.root
cand_data/F00037968_0018.spill.cand.cedar.0.root  cand_data/F00037971_0006.spill.cand.cedar.0.root
SRV1> ls -a .*_data/F*
.bcnd_data/F00037968_0012.spill.bcnd.cedar.0.root  .bcnd_data/F00037971_0000.spill.bcnd.cedar.0.root
.bcnd_data/F00037968_0013.spill.bcnd.cedar.0.root  .bcnd_data/F00037971_0001.spill.bcnd.cedar.0.root
.bcnd_data/F00037968_0014.spill.bcnd.cedar.0.root  .bcnd_data/F00037971_0002.spill.bcnd.cedar.0.root
.bcnd_data/F00037968_0015.spill.bcnd.cedar.0.root  .bcnd_data/F00037971_0003.spill.bcnd.cedar.0.root
.bcnd_data/F00037968_0016.spill.bcnd.cedar.0.root  .bcnd_data/F00037971_0004.spill.bcnd.cedar.0.root
.bcnd_data/F00037968_0017.spill.bcnd.cedar.0.root  .bcnd_data/F00037971_0005.spill.bcnd.cedar.0.root
.bcnd_data/F00037968_0018.spill.bcnd.cedar.0.root  .bcnd_data/F00037971_0006.spill.bcnd.cedar.0.root

SRV1> FILES=`ls -a *_data/F* ; ls -a .*_data/F*`
SRV1> echo $FILES | wc -w
42

SRV1> for FP in ${FILES} ; do FI=`echo ${FP} | cut -f 2 -d /` ; ls -l /export/stage/minfarm/ROUNDUP/WRITE/${FI} ; done

SRV1> for FP in ${FILES} ; do  rm ${FP} ; done

    Tested the updated roundup which should abort.

SRV1> AFSS/roundup.20070504 -n -w -r cedar far

 OK - processing /grid/data/minos/farcat
      version 20070504
Fri May  4 14:23:14 CDT 2007

 PURGING WRITE files 147 
Datafile with name 'F00037968_0012.mdaq.root' not found.

 OOPS - raw data not in SAM 


OK, let's get predator up to date and resume 

cp AFSS/roundup.20070504 .
ln -sf  roundup.20070504 roundup

SRV1> du -sm /grid/data/minos/*cat
4839    /grid/data/minos/farcat
1       /grid/data/minos/mccat
3553    /grid/data/minos/mcfarcat
155542  /grid/data/minos/mcnearcat
39258   /grid/data/minos/nearcat


VMON=2007-05
./predator ${VMON}

MINOS26 > crontab crontab.dat ( later, at 18:41 )

./roundup -w -r cedar far
Fri May  4 17:20:04 CDT 2007
Fri May  4 17:36:49 CDT 2007

./roundup -r cedar near
Fri May  4 17:37:45 CDT 2007 catting
Fri May  4 18:00:24 CDT 2007 writing
  MRTG shows excellent rates, 60 to 80 mbit/s (x2)
  in spite of 60 GB of files to write ( >> 16 GB local memory )
Fri May  4 19:17:05 CDT 2007


Next... 

./roundup -r R1_24cal near
Fri May  4 19:49:00 CDT 2007
Fri May  4 19:50:17 CDT 2007
Fri May  4 19:53:41 CDT 2007

Running short of space, purge write

./roundup -w -r cedar mcnear
Fri May  4 19:55:56 CDT 2007
Fri May  4 20:00:04 CDT 2007

196G free

./roundup -r R1_24cal mcnear
Fri May  4 20:01:23 CDT 2007 cat
Fri May  4 21:19:53 CDT 2007 write
 OK - creating /pnfs/minos/mcout_data/R1_24cal/near/daikon_00/L010185N_24cal/cand_data/100
no such directory
( N.B. this DID create the full directory )

Fixing this Sat morning 5/5

###########
# ROUNDUP #
###########

    roundup.20070504

Need to handle MCCONF calculation for R1_24cal

Added file count to printout of WRITING to DCache

AFSS/roundup.20070504 -n -W -s n13011001 -r R1_24cal mcnear
.../L010185N_24cal/...

STREAM=L250200N_D00.mrnt.cedar
STREAM=L010185N_D00_nccoh.sntp.cedar
STREAM=L010185N_D00.cand.R1_24cal

doing
    MCPHYS=`echo ${STREAM} | cut -f 3 -d '_' | cut -f 1 -d .`
    MCCONF=`echo ${STREAM} | cut -f 1 -d '_'`${MCPHYS:+_${MCPHYS}}

MCPHYS is getting activated when it shouldn't be
Switch to cut on . field first, then third _

    MCPHYS=`echo ${STREAM} | cut -f 1 -d . | cut -f 3 -d '_'`

Moved to production around 12:08

cp AFSS/roundup.20070504 .
ln -sf  roundup.20070504 roundup


Modified again to abort on files which are not in SAM


########
# FARM #
########    

jaboehm (Josh) reports corrupt file at

d174/MRCC/TEMP/n13011065_0005_L010185N_D0.mrnt.cedar.root

That's actually 
    n13011065_0005_L010185N_D00.mrnt.cedar.root
written at 12:11
srmcp'd after 14:46

There were problem with the minos_reco_far familied around then,
    but not the mc families.
The new pools were being deployed.


MINOS26 > dirs=`ls -d d*`
MINOS26 > for DIR in $dirs ; do echo ${DIR} ; find ${DIR} -name TEMP ; done
d170/TEMP

   Are the raw ntuples there ?
MINOS26 > for DIR in $dirs ; do echo ${DIR} ; find ${DIR} -name n13011065_0005_L010185N_D00.mrnt.cedar.root ; done


MINOS26 > dds /afs/fnal.gov/files/data/minos/d170/TEMP/n13011065_0005_L010185N_D00.mrnt.cedar.root 
-rw-r--r--    1 11634    e875     291140914 May  4 09:01 /afs/fnal.gov/files/data/minos/d170/TEMP/n13011065_0005_L010185N_D00.mrnt.cedar.root


MINOS26 > dds /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/106/n13011065_0005_L010185N_D00.mrnt.cedar.root
-rw-r--r--    1 1334     e875     291140914 Apr 28 16:37 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/106/n13011065_0005_L010185N_D00.mrnt.cedar.root

ecrc /afs/fnal.gov/files/data/minos/d170/TEMP/n13011065_0005_L010185N_D00.mrnt.cedar.root
CRC 1580338695

MINOS26 > ./dc_stat  /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/106/n13011065_0005_L010185N_D00.mrnt.cedar.root
...
1580338695

Check the files the usual way with root, using hadd

MINOS26 > cd /local/scratch26/kreymer/
MINOS26 > hadd Merged.root /afs/fnal.gov/files/data/minos/d170/TEMP/n13011065_0000_L010185N_D00.mrnt.cedar.root /afs/fnal.gov/files/data/minos/d170/TEMP/n13011065_0005_L010185N_D00.mrnt.cedar.root 
MINOS26 > du -sm Merged.root 
464     Merged.root

Input sizes
194911909
291140914
Output
486044846 versus 486052823 ( difference 7977 )

Tested Merged.root as Josh suggested

root Merged.root
NtpSt->Show(0)

#########
# STAGE #
#########

./volumes vols

FVOLS=`./volumes fardet_data`
MINOS26 > echo $FVOLS
VO2064 VO2212 VO2220 VO2225 VO3646 VO3909 VO4136 VO4245 VO4309 VO4335 VO4639 VO4640 VO4919 VO5046 VO5054 VO5182 VO5672 VO5869 VO5871 VO5881 VO6809 VO6876 VO7999 VO8536 VO8555 VO8722 VO8917 VO8968 VO9488 VO9830 VOB499

for VOL in ${FVOLS} ; do ./stage -w ${VOL} ; done
This is already picking up a few stray files, like
    2003-10/F00020634_0000.mdaq.root

Also finding a few files listed on tape, but not in PNFS, 
and without delflag set to yes
   on tape VO2212
This is stuff from 2004-11, probably not relevant
like F00010926_0000.mdaq.root


=============================================================================

2007 05 03

###########
# ROUNDUP #
###########

   Added filtering of filename initial ${DETI}

cp AFSS/roundup.20070503 .
ln -sf  roundup.20070503 roundup

########
# FARM #
########
 
./roundup  -w -r R1_24cal near
Thu May  3 07:29:17 CDT 2007

234 GB free

 Date: Thu, 03 May 2007 09:25:21 -0500 (CDT)
From: David Berg <berg@fnal.gov>

These are srm services, and they show as offline until they are used.
Please go ahead and use them.

Tested srmls and srmcp per HOWTO.srm at 09:27, OK
Web page shows services running
    http://fndca.fnal.gov:2288/cellInfo

Get back to work, clear rest of R1_24cal

./roundup -n -M -W -r R1_24cal near
reveals several bad_runs files which are present with non-0 content.

+BADRUNS+  N00009241_0010.cosmic.cand.R1_24cal.0.root
+BADRUNS+  N00009244_0000.cosmic.cand.R1_24cal.0.root
+BADRUNS+  N00009256_0002.cosmic.cand.R1_24cal.0.root
+BADRUNS+  N00009256_0009.cosmic.cand.R1_24cal.0.root
+BADRUNS+  N00009259_0013.cosmic.cand.R1_24cal.0.root
+BADRUNS+  N00009265_0001.cosmic.cand.R1_24cal.0.root
+BADRUNS+  N00009241_0010.cosmic.sntp.R1_24cal.0.root
+BADRUNS+  N00009256_0002.cosmic.sntp.R1_24cal.0.root
+BADRUNS+  N00009259_0013.cosmic.sntp.R1_24cal.0.root
+BADRUNS+  N00009265_0001.cosmic.sntp.R1_24cal.0.root
+BADRUNS+  N00009241_0010.spill.cand.R1_24cal.0.root
+BADRUNS+  N00009256_0002.spill.cand.R1_24cal.0.root
+BADRUNS+  N00009259_0013.spill.cand.R1_24cal.0.root

I need to add an up front filter,
which is a big pain, because there are two separate bad_run files,
one for normal and one for mrnt files.

Ignore the mrcc files for now, there are no type 1 or 3 errors there

bad_runs_camb.cedar
bad_runs.cedar
bad_runs_mc.cedar
bad_runs_mrcc.cedar
bad_runs_mrcc_mc.cedar

What are these camb files ?

   This is TOOOOO MUUUUUCH !


grep ' *[1,3] *....-..-.. *' ~/lists/bad_runs_mc.cedar
put this list into zap_files

Hacked up roundup.20070503 skipping zap_files ( bad_files errors 1 or 3 )

SRV1> grep ' *[1,3] *....-..-.. *' ~/lists/bad_runs.R1_24cal
N00009256_0009.0   2005-11   47487     3  2007-05-02 19:52:12  fnpc104
N00009235_0001.0   2005-11   47643     3  2007-05-02 19:55:07  fnpc111
N00009265_0001.0   2005-11   46531     3  2007-05-02 20:09:18  fnpc228
N00009244_0000.0   2005-11   41122     3  2007-05-02 20:09:38  fnpc117
N00009256_0008.0   2005-11   47913     3  2007-05-02 20:13:24  fnpc91
N00009241_0010.0   2005-11   47591     3  2007-05-02 20:17:52  fnpc37
N00009256_0002.0   2005-11   47818     3  2007-05-02 20:27:39  fnpc30
N00009259_0013.0   2005-11   47132     3  2007-05-02 20:34:37  fnpc171

AFSS/roundup.20070503 -n -M -W  -r R1_24cal near > /tmp/rRcalz
AFSS/roundup.20070502 -n -M -W  -r R1_24cal near > /tmp/rRcal

Looks good, has zapped the type 1 and 3 bad runs.

cp AFSS/roundup.20070503 .
ln -sf  roundup.20070503 roundup


232 GB free

100674  /grid/data/minos/nearcat

./roundup -M -r R1_24cal near
Thu May  3 15:35:45 CDT 2007
<cat>
Thu May  3 16:41:11 CDT 2007
<write>
Thu May  3 20:51:54 CDT 2007

163G free

./roundup -w -r R1_24cal mcnear
Thu May  3 21:21:16 CDT 2007
Thu May  3 21:32:30 CDT 2007

224G free

Updated primary roundup.20070503 to add qualifiers
    srmcp -streams_num=1 -server_mode=active

207712  /grid/data/minos/mcnearcat
SRV1> ls /grid/data/minos/mcnearcat/*R1_* | wc -l
410
SRV1> ls /grid/data/minos/mcnearcat | wc -l
745

Let's rip :

AFSS/roundup.20070503 -r R1_24cal mcnear

   GRRRRRRRRRRRRRRRRRRRR

Did a test run, and we have problem with BOTH R1_24cal and cedar 

R1_24cal contains underscores, which fouls up mcin path calculation
   need to debug/extend/add more special cases to script
   
cedar has many +BAD_RUN+ diagnostics, in spite of new filtering of codes 1 and 3
n13011680_0003
 
Corrected BADRUNS/ZAPRUNS in roundup.20070503, recloned to fnpcsrv1

./roundup -r cedar mcnear
Thu May  3 21:51:02 CDT 2007
Thu May  3 22:37:02 CDT 2007
srmcp$: command not found
  typo in roundup.20070503 , corrected, repropogated 

########
# FARM #
########

What to do with mcnearcat/c* files, 
   which do not follow naming conventions ?

Need to rename per minos_sim conventions, 

########
# FARM #
########
      Pending 0 length files :

SRV1> find /grid/data/minos -size 0
/grid/data/minos/mcfarcat/c10000602_0003.cand.cedar.root
/grid/data/minos/mcfarcat/c10000602_0003.sntp.cedar.root
/grid/data/minos/nearcat/N00009235_0001.cosmic.cand.R1_24cal.0.root
/grid/data/minos/nearcat/N00009256_0008.cosmic.cand.R1_24cal.0.root
/grid/data/minos/mcnearcat/n13011685_0002_L010185N_D00.cand.cedar.root
/grid/data/minos/mcnearcat/n13011684_0009_L010185N_D00.cand.cedar.root
/grid/data/minos/mcnearcat/n13011684_0001_L010185N_D00.cand.cedar.root
/grid/data/minos/mcnearcat/n13011684_0000_L010185N_D00.cand.cedar.root
/grid/data/minos/mcnearcat/n13011683_0006_L010185N_D00.cand.cedar.root
/grid/data/minos/mcnearcat/n13011682_0010_L010185N_D00.cand.cedar.root
/grid/data/minos/mcnearcat/n13011683_0001_L010185N_D00.cand.cedar.root
/grid/data/minos/mcnearcat/n13011683_0009_L010185N_D00.cand.cedar.root
/grid/data/minos/mcnearcat/n13011683_0002_L010185N_D00.cand.cedar.root
/grid/data/minos/mcnearcat/n13011683_0008_L010185N_D00.cand.cedar.root
/grid/data/minos/mcnearcat/n13011684_0002_L010185N_D00.cand.cedar.root
/grid/data/minos/mcnearcat/n13011684_0008_L010185N_D00.cand.cedar.root

SRV1> find /grid/data/minos -size 0 -exec ls -l {} \;
-rw-r--r--  1 rubin numi 0 May  2 20:21 /grid/data/minos/mcfarcat/c10000602_0003.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:21 /grid/data/minos/mcfarcat/c10000602_0003.sntp.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 19:55 /grid/data/minos/nearcat/N00009235_0001.cosmic.cand.R1_24cal.0.root
-rw-r--r--  1 rubin numi 0 May  2 20:13 /grid/data/minos/nearcat/N00009256_0008.cosmic.cand.R1_24cal.0.root
-rw-r--r--  1 rubin numi 0 May  2 20:24 /grid/data/minos/mcnearcat/n13011685_0002_L010185N_D00.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:25 /grid/data/minos/mcnearcat/n13011684_0009_L010185N_D00.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:21 /grid/data/minos/mcnearcat/n13011684_0001_L010185N_D00.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:21 /grid/data/minos/mcnearcat/n13011684_0000_L010185N_D00.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:19 /grid/data/minos/mcnearcat/n13011683_0006_L010185N_D00.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:19 /grid/data/minos/mcnearcat/n13011682_0010_L010185N_D00.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:20 /grid/data/minos/mcnearcat/n13011683_0001_L010185N_D00.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:24 /grid/data/minos/mcnearcat/n13011683_0009_L010185N_D00.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:18 /grid/data/minos/mcnearcat/n13011683_0002_L010185N_D00.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:18 /grid/data/minos/mcnearcat/n13011683_0008_L010185N_D00.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:18 /grid/data/minos/mcnearcat/n13011684_0002_L010185N_D00.cand.cedar.root
-rw-r--r--  1 rubin numi 0 May  2 20:18 /grid/data/minos/mcnearcat/n13011684_0008_L010185N_D00.cand.cedar.root


########
# FARM #
########

   Waiting for DCache GsiFTP door and CopyManager
   
   Take this as a chance to test I/O rates
   

 SRV1> du -sm REDO/WRITE
3014    REDO/WRITE
 SRV1> du -sb REDO/WRITE
3155841680      REDO/WRITE
 
local to /grid/data/minos/TEST

08:51
SRV1> time cp -r REDO/WRITE /grid/data/minos/TEST
real    3m22.844s
user    0m0.108s
sys     0m50.829s

    Rate 3156./263 = 12 MB/sec.

/grid/data/minos/TEST to local
08:58
SRV1> time cp -r /grid/data/minos/TEST REDO/TESTREAD

real    1m12.285s
user    0m0.572s
sys     0m25.092s

    Rate 3156./72 = 44 MB/sec.

Repeat local to TEST2
09:00
SRV1> time cp -r REDO/TESTREAD /grid/data/minos/TEST2

real    1m49.432s
user    0m0.077s
sys     0m18.470s

    Rate 3156./109 = 29 MB/sec.


Repeat local to TEST3
09:03
SRV1> time cp -r REDO/TESTREAD /grid/data/minos/TEST3

real    1m23.789s
user    0m0.092s
sys     0m47.065s

    Rate 3156./83 =  38 MB/sec.

  NOW SRMCP/DCCP test, to fermigrid/volatile

cd LONG ( 2+ GByte file )
#  size is 2283574599

IFILE=N00010819_0000.spill.sntp.R1_18_4.0.root


   SRMCP and list/remove

export SRM_CONFIG=/export/stage/minfarm/.srmconfig/config.xml
  
SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm
SFILE2=${SPATH2}/${IFILE}

SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm
SFILE=${SPATH}/${IFILE}

srmkdir ${SPATH2}

srmls ${SPATH2}
srmls ${SFILE2}

time srmcp file:///${IFILE} ${SFILE}
real    4m11.374s
user    0m31.137s
sys     0m46.010s

srmrm ${SFILE2}

time srmcp file:///${IFILE} ${SFILE}
real    1m25.390s
user    0m30.004s
sys     0m41.447s

time srmcp file:///${IFILE} ${SFILE}
real    1m14.435s
user    0m30.811s
sys     0m44.405s

time srmcp -streams_num=1 -server_mode=active file:///${IFILE} ${SFILE}
real    1m1.463s
user    0m19.758s
sys     0m36.044s

time srmcp -streams_num=1 -server_mode=active file:///${IFILE} ${SFILE}
real    0m59.955s
user    0m20.694s
sys     0m35.184s

time md5sum ../REDO/WRITE/*
real    5m55.882s
user    0m9.317s
sys     0m7.823s

   Rate is  3155841680 / 356. = 8.9 MB/s

time srmcp -streams_num=1 -server_mode=active file:///${IFILE} ${SFILE}
real    1m0.487s
user    0m20.063s
sys     0m34.877s

time srmcp -streams_num=1 -server_mode=active file:///${IFILE} ${SFILE}
real    1m13.355s
user    0m20.007s
sys     0m34.414s

time md5sum ../REDO/WRITE/*
real    0m12.335s
user    0m8.669s
sys     0m3.659s


setup dcap -q x509
DCPOR=24525

IPATH=minos/fermigrid/volatile/kreymer

DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/minos/fermigrid/volatile/minfarm/${IFILE}

time dccp ${IFILE} ${DFILE}


##########
# DCACHE #
##########

    Dcache advertized as up at 07:09

    Replied to dcache-admin at 07:39

Except apparently the following DCache services, as show at
    http://fndca.fnal.gov:2288/cellInfo

CopyManager                 <unknown> OFFLINE
RemoteGsiftpTransferManager <unknown> OFFLINE
RemoteHttpTransferManager   <unknown> OFFLINE
SRM-stkendca2a              <unknown> OFFLINE

Logged helpdesk ticket 96639 at 08:18

Berg claims services are up, srmcp/srmls do work for me.

But fardet logging is still down.
It resumed at 11:00, with an archiver restart.


############
# MCIMPORT #
############

Bounced off kordosky/tar/n12011778_0004_L010185N_D00-n12011795_0011_L010185N_D00.tar

Should recover automatically at noon.

=============================================================================

2007 05 02


###########
# ROUNDUP #
###########

cp AFSS/roundup.20070502 .
ln -sf  roundup.20070502 roundup

########
# FARM #
########

166G free, clear some nd/fd space before running R1_24cal

./roundup -w -r cedar far
./roundup -w -r cedar hear   # oops, typo
./roundup -w -r cedar near
./roundup -w -r cedar mcnear

225G free

Check a couple of subrun sfrom R1_24cal

./roundup -s N00009095 -r R1_24cal near

Looks good, go with

./roundup -r R1_24cal near
Wed May  2 15:12:46 CDT 2007

Looking a bit at sar's log of network rates,

    sar -n DEV | grep bond0 > /tmp/sarn
    ( copy this to minos01 where we have working gnuplot )
    setup gnuplot    
    plot '/tmp/sarn' using 1:6

Typical write rate is 1, peak is 7.
This ain't Gigabit, people !

 92 GB free

./roundup -w -r cedar mcnear
Wed May  2 23:27:50 CDT 2007

./roundup  -w -r R1_24cal near
Wed May  2 23:33:20 CDT 2007
  OOPS - the previous roundup run was still running,
         so the log is a bit mixed up. Should do no harm.
         aborted cleanly on a 0 size file
150 GB free

=============================================================================

2007 05 01

###########
# MONTHLY #
###########

CFL      5/1
DATASETS 5/1
PREDATOR 5/1
SADDRECO 5/1 via roundup -m '2007-04' -r cedar far     and near
VAULT    5/15
MYSQL    5/... 

###########
# ROUNDUP #
###########
 
Status of first intergrated roundup with mcnear and SAM :


Good, but note that being 1 May, saddreco needs a catchup pass.

    roundup.20070502

Manual catchup for 2007-04 via
    AFSS/roundup.20070502 -m '2007-04' -r cedar near
    AFSS/roundup.20070502 -m '2007-04' -r cedar far

Moved purge of files ahead of concatenation, for best space usage.

Pre-purged with

135 G free

About 95 GB in *cat, so will prepurge
( I will be in class all day tomorrow morning,
  so will not deploy the new roundup.20070502 in production yet. )

102 GB in WRITE

AFSS/roundup.20070502 -w -r cedar mcnear
AFSS/roundup.20070502 -w -r cedar near
AFSS/roundup.20070502 -w -r cedar far

235 G free

############
# DATASETS #
############

Need to change naming of summary files
    Was
g - FermigridVolPools
m - RawDataWritePools
r - readPools
w - writePools

    Want   m for MinosPrdReadPools :
    Change m files to q for RawDataWritePools

g - FermigridVolPools
m - MinosPrdReadPools
q - RawDataWritePools
r - readPools
w - writePools

cd /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets
MIN > FILES=`find . -name current.m.2\* | cut -f 2- -d /`
MIN > printf "${FILES}\n"
2006/03/current.m.20060331
2006/04/current.m.20060401
2006/04/current.m.20060404
2006/04/current.m.20060406
2006/04/current.m.20060412
2006/04/current.m.20060413
2006/04/current.m.20060421
2006/04/current.m.20060426
2006/04/current.m.20060427
2006/09/current.m.20060918
2006/09/current.m.20060920
2006/09/current.m.20060925
2006/10/current.m.20061023
2007/02/current.m.20070226
2007/02/current.m.20070228
2007/03/current.m.20070302
2007/03/current.m.20070319
2007/04/current.m.20070402

for FILE in ${FILES} ; do echo mv ${FILE} ${FILE:0:16}q${FILE:17} ; done
for FILE in ${FILES} ; do      mv ${FILE} ${FILE:0:16}q${FILE:17} ; done

Reexamine 7a-1/2, 8a-1/2 write pools, each 885760 MBytes, net 3.5 TB.
( 3.7 Decimal )
Consistent with Minos usage.

Current raw data is about 4 TBytes.

#######
# X11 #
#######
for NODE in $UNODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'echo gimp;gimp;echo done' ; done

minos02 Tue May  1 14:07:48 CDT 2007
minos04 Tue May  1 14:08:55 CDT 2007
minos05 Tue May  1 14:09:07 CDT 2007
minos23 Tue May  1 14:11:53 CDT 2007

for NODE in $NODES ; do printf "${NODE} `date`\n"; 
ssh minos${NODE} 'mkdir -p /var/tmp/kreymer/.gimp-1.2' ; done

lockups are repeatable.
Reported to minos-admin

Note that the initial symptom was an emacs session,
stuck with this message in the lower left of window :

Loading edt...

Emacs version is xemacs-21.4.13-8.ent.1

=============================================================================

2007 04 30

########
# FARM #
########
 
 17 GB free

./roundup -w -r cedar mcnear
Mon Apr 30 07:41:45 CDT 2007
Mon Apr 30 07:46:43 CDT 2007

127 GB free

103522  /grid/data/minos/mcnearcat

    Corrected crontab to crontab.dat from crontab.nofntp

 AFSS/roundup.20070501  -r cedar far
Mon Apr 30 08:05:35 CDT 2007
Mon Apr 30 08:05:43 CDT 2007

./roundup -r cedar mcnear
Mon Apr 30 08:09:50 CDT 2007
OK - processing 2042 files 
Mon Apr 30 14:54:41 CDT 2007

    Now wait till about 20:00 to purge WRITE files

Saved mrtr traffic plot on desktop in 
    fnpcsrv1-20070430.png

 46 GB free

./roundup  -w -r cedar mcnear
Mon Apr 30 21:40:20 CDT 2007
Mon Apr 30 21:46:17 CDT 2007

143 GB free


########
# GRID #
########

Note that /grid/app is mounted on minos26


########
# FARM #
########

Cleaning up duplicate subruns for

n13011001_0000_L010185N_D00.mrnt.cedar.root
n13011059_0000_L010185N_D00.mrnt.cedar.root

Each was concatenated with 11 subruns.

PA=/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data

for RUN in n13011001 n13011059 ; do
for SUB in 01 02 03 04 05 06 07 08 09 10 ; do
        F=${RUN}_00${SUB}_L010185N_D00.mrnt.cedar.root
        [ -r ${PA}/${F:5:3}/${F} ] && ls -l ${PA}/${F:5:3}/${F}
done ; done
-rw-r--r--    1 1334     e875     48903889 Mar  1 02:44 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0001_L010185N_D00.mrnt.cedar.root
-rw-r--r--    1 1334     e875     49539338 Mar  1 02:04 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0002_L010185N_D00.mrnt.cedar.root
-rw-r--r--    1 1334     e875     48888805 Mar  1 01:55 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0003_L010185N_D00.mrnt.cedar.root
-rw-r--r--    1 1334     e875     48978682 Mar  1 01:09 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0004_L010185N_D00.mrnt.cedar.root
-rw-r--r--    1 1334     e875     48989510 Mar  1 02:04 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0005_L010185N_D00.mrnt.cedar.root
-rw-r--r--    1 1334     e875     49178857 Mar  1 01:53 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0006_L010185N_D00.mrnt.cedar.root

for RUN in n13011001 ; do
for SUB in 01 02 03 04 05 06 ; do
        F=${RUN}_00${SUB}_L010185N_D00.mrnt.cedar.root
        mv ${PA}/${F:5:3}/${F}  /pnfs/minos/BAD/DUP_${F}
done ; done

Did the above around 11:34

########
# FARM #
########

Rubin email states that we should ignore any bad_runs lines having error 1.

I think this is moot, as I keep existing files.
I might force out a run missing a temporarily 'bad' subrun.

Not sure how to find the error number, bad_files formats vary :

bad_runs_mc.cedar
n13025685_0000_L010185       carrot_06 L010185   139 2006-09-28 00:28:13 fnpc31
f21011011_0000_L010185N_D00                      136 2007-02-25 20:39:11 fnpc229

bad_runs.cedar
F00033570_0007.0               2006-01   92028   136 2006-08-28 10:55:05  fnpc59

Perhaps the error code is the last blank separated field before the year :

  grep ' *1 *....-..-.. *'


########
# FARM #
########

Added mcnearcat to crontab.dat :

00 08 * * * ${HOME}/scripts/roundup -c -r cedar far ; ${HOME}/scripts/roundup -c -r cedar near ; ${HOME}/scripts/roundup -c -r cedar mcnear

############
# SADDRECO #
############

    Testing SAM declares in saddreco.20070501, looking to deploy tomorrow.
    Move it to production

cp AFSS/roundup.20070501 .
ln -sf  roundup.20070501 roundup

AFSS/roundup.20070501 -m -r cedar far
15:00

    Oops, forgot to say 'declare' on saddreco commandline, corrected.
    Corrected, tested, looks OK in

LOG/2007-04/declare_far_cedar.log


########
# FARM #
########

Stirred the pot regarding mcout_data/cedar/cosmic and atmos,
   which start with letters a and c, and do not indicate daikon heritage.

Suggested usage of the beam configuration string, 
and reversion to n, f prefix.
And shift to usual mcout_data/cedar/[ne,f]ar/daikon_00/<beam>/

=============================================================================

2007 04 29 sunday

########
# FARM #
########

./roundup -w -r cedar mcnear
Sun Apr 29 07:41:18 CDT 2007

122 GB free

    Far writes stuck on sntp, clear the cand's

./roundup  -w -s cand -r cedar far
Sun Apr 29 07:49:43 CDT 2007

125 GB free

crontab crontab.nofntp  # adds -s cand to the far roundup

    DCache write pool has been reconfigures, try 1 file test

./roundup -w -s F00037950_0000.all -r cedar far
 
    look good, catch up :
    
./roundup -w -r cedar far

125 GB free

   Grab some more mrnt, up to 200 GB now 

./roundup  -s n130112 -r cedar mcnear  # 635 files 32 GB

SRV1> ls /grid/data/minos/mcnearcat | grep "n13011[2,3,4]" | wc -l
2689

./roundup -n -W -s 'n13011[2,3,4]' -r cedar mcnear 
OK - processing 2689 files 
 OOPS - Stream size 131286 too big for free space 127834 - 10000 

SRV1> ./roundup -n -W -s 'n13011[2,3]' -r cedar mcnear 
OK - processing 1615 files 
OK - stream L010185N_D00.mrnt.cedar
OK - 78869 Mbytes in 172 runs 

   OK, let's do that

./roundup -s 'n13011[2,3]' -r cedar mcnear 

 54 GB free

only 149/ of 180 WRITE files seem to be in enstore, 
close enough for next batch.

roundup -w -r cedar mcnear
Sun Apr 29 22:44:14 CDT 2007

112 GB free

./roundup -n -W -s 'n13011[4,5]' -r cedar mcnear 
 OOPS - Stream size 104611 too big for free space 114510 - 10000 

roundup -w -r cedar near
roundup -w -r cedar far

113 GB free

 ./roundup -s 'n13011[4,5]' -r cedar mcnear
Sun Apr 29 22:59:49 CDT 2007
OK - processing 2143 files 
OK - stream L010185N_D00.mrnt.cedar
OK - 104611 Mbytes in 199 runs 
...
Writing at about 01:45
...
Mon Apr 30 05:48:36 CDT 2007

   Rate is about 104611 MB/ 14480 Sec = 7 MBytes/sec

MRTG was reporting sustained 30 MBits/second, not consistent with 7 MBy/sec

=============================================================================

2007 04 28 saturday

########
# FARM #
########
 
68 GB free
./roundup -w -r cedar far
82 GB free
./roundup -w -r cedar near
127 GB free

./roundup -s sntp -f 2 -r cedar mcnear  ( expect 25 GB )
./roundup -s cand      -r cedar mcnear  ( expect 50 GB )
 61 GB free

wait 4 hours

./roundup -w           -r cedar mcnear
127 GB free

./roundup -s n130110 -r cedar mcnear  ( expect 50 GB, 1036 mrnt files )

 64 GB free

./roundup  -w -r cedar near

 75 GB free

mcnear writes are stuck,

-rw-r--r--  1 minfarm numi 536433461 Apr 28 11:27 n13011001_0000_L010185N_D00.mrnt.cedar.root
-rw-r--r--  1 rubin numi 48784095 Mar  1 00:51 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/

cd WRITE/
FS=`ls n*mrnt.cedar.root`
PA=/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data
for F in ${FS} ; do
[ -r ${PA}/${F:5:3}/${F} ] && ls -l ${PA}/${F:5:3}/${F}
done
-rw-r--r--  1 rubin numi 48784095 Mar  1 00:51 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/100/n13011001_0000_L010185N_D00.mrnt.cedar.root
-rw-r--r--  1 rubin numi 49142158 Mar  1 02:49 /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data/105/n13011059_0000_L010185N_D00.mrnt.cedar.root

Other subruns also exist from March 1 processing.
Dodge around this for now by moving the two offenders out of the way

   As rubin

PA=/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/mrnt_data
for F in n13011001_0000_L010185N_D00.mrnt.cedar.root \
         n13011059_0000_L010185N_D00.mrnt.cedar.root
do
  mv ${PA}/${F:5:3}/${F} /pnfs/minos/BAD/DUP_${F}
done

    Finding all the mcnear duplicates pending: 

cd /grid/data/minos/mcnearcat
FS=`ls n*mrnt.cedar.root`

    Now back as minfarm


./roundup -w -r cedar mcnear
Sat Apr 28 14:45:59 CDT 2007

 75 GB free

Now purge them,
./roundup -w -r cedar mcnear
Sat Apr 28 20:32:06 CDT 2007

122 Gb free

./roundup -s n130111 -r cedar mcnear  # expect 47 GB, 961 mrnt files
Sat Apr 28 20:48:34 CDT 2007

############
# SADDRECO #
############

REL=cedar
MON=2007-04
for DET in near far ; do
./saddreco  ${DET} ${REL} ${MON} declare  \
 >>   ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1


##########
# DCACHE #
##########

Far det reco files failed in a messy way to go to DCache,
root cause in messages seems :

    Pool manager error: No write pools configured for 
    <minos.reco_far_cedar_sntp@enstore>

reco_near_sntp and reco_far_cand are fine.


##########
# DCACHE #
##########

sent request to dcache-admin

    The MinosPrdSelGrp selection group presently contains

minos.reco_far_cedar_bntp@enstore 
minos.reco_far_cedar_mrnt@enstore 
minos.reco_far_cedar_sntp@enstore

    After we have corrected the present problem writing to these families,
    please extend the MinosPrdSelGrp to include

minos.reco_near_cedar_bntp@enstore 
minos.reco_near_cedar_mrnt@enstore 
minos.reco_near_cedar_sntp@enstore

###########
# ROUNDUP #
###########

    roundup.20070501

Added -m -M options to enable/disable saddreco calls for near, far only


=============================================================================

2007 04 27

##########
# DCACHE #
##########

Conversation with podstvkv,

New pools have not been effective, because wild carding of file families
does not work as a general expression.

Will send explicit list of families, he will test with cedar.

Note 13 Feb note regarding read families

y


Note 9 Feb note regarding need for 5+ TB of Minos DAQ capacity


##########
# DCACHE #
##########

New pools seem to be present, since 25 April 14:00,


ExpDbWritePools  8.9 TB

w-stkendca17a-1 680960
w-stkendca17a-2 680960
w-stkendca17a-3 680960
w-stkendca18a-1	680960
w-stkendca18a-2 680960
w-stkendca18a-3 680960
w-stkendca19a-1 680960
w-stkendca19a-2 680960
w-stkendca19a-3 680960
w-stkendca20a-1 921600
w-stkendca20a-2 921600
w-stkendca20a-3 921600

FermigridVolPools

v-stkendca16a-1 
v-stkendca16a-2 
v-stkendca16a-3 
v-stkendca16a-4 
v-stkendca16a-5 
v-stkendca16a-6 

KTeVReadPools

r-stkendca12a-1 
r-stkendca12a-2 
r-stkendca12a-3 
r-stkendca12a-4	
r-stkendca13a-1	
r-stkendca13a-2 
r-stkendca13a-3 
r-stkendca13a-4
r-stkendca14a-1 
r-stkendca14a-2 
r-stkendca14a-3 
r-stkendca14a-4 
r-stkendca15a-1 
r-stkendca15a-2 
r-stkendca15a-3 
r-stkendca15a-4

MinosPrdReadPools 10.2 TB

r-stkendca17a-4 680960
r-stkendca17a-5 906240
r-stkendca17a-6 906240
r-stkendca18a-4 680960
r-stkendca18a-5 906240
r-stkendca18a-6 906240
r-stkendca19a-4 680960
r-stkendca19a-5 906240
r-stkendca19a-6 906240
r-stkendca20a-4 906240
r-stkendca20a-5 906240
r-stkendca20a-6 906240

RawDataWritePools  3.5 TB

w-stkendca7a-1 885760
w-stkendca7a-2 885760
w-stkendca8a-1 885760
w-stkendca8a-2 885760

readPools

r-stkendca12a-5 
r-stkendca12a-6 
r-stkendca13a-5 
r-stkendca13a-6 
r-stkendca14a-5 
r-stkendca14a-6 
r-stkendca15a-5 
r-stkendca15a-6
r-stkendca17a-4 
r-stkendca17a-5 
r-stkendca17a-6 
r-stkendca20a-4 
r-stkendca20a-5 
r-stkendca20a-6 

writePools

w-stkendca10a-1 788480
w-stkendca10a-2 788480
w-stkendca10a-3	788480
w-stkendca10a-4 788480
w-stkendca10a-5 788480
w-stkendca10a-6	788480
w-stkendca11a-1 788480
w-stkendca11a-2 788480
w-stkendca11a-3 788480
w-stkendca11a-4 788480
w-stkendca11a-5 788480
w-stkendca11a-6 788480
w-stkendca17a-1 680960
w-stkendca17a-2 680960
w-stkendca17a-3 680960
w-stkendca20a-1 680960
w-stkendca20a-2 680960
w-stkendca20a-3 680960
w-stkendca9a-1 	675840
w-stkendca9a-2 	675840
w-stkendca9a-3 	675840
w-stkendca9a-4 	675840
w-stkendca9a-5 	675840
w-stkendca9a-6  675840

########
# FARM #
########

Failed to restart roundup concatenation in crontab.

But /export/stage is full !

    Cleanup :

SRV1> df -h /export/stage
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb3             477G  451G  1.9G 100% /export/stage

    The 451 GB is mostly not minos :

 SRV1> du -sm /export/stage/minfarm
du: `/export/stage/minfarm/.grid/backup': Permission denied
125245  /export/stage/minfarm


    But we can help for a while :

SRV1> du -sm WRITE
94534   WRITE

SRV1> ./roundup  -w -r cedar near

SRV1> du -sm WRITE
85525   WRITE

SRV1> ./roundup  -w -r cedar far

SRV1> du -sm WRITE
80661   WRITE

SRV1> ./roundup -w -r cedar mcnear

SRV1> du -sm WRITE
1       WRITE

   Now to catch up

SRV1> du -sm /grid/data/minos/*cat
14686   /grid/data/minos/farcat
1       /grid/data/minos/mccat
1       /grid/data/minos/mcfarcat
172917  /grid/data/minos/mcnearcat
45850   /grid/data/minos/nearcat


SRV1> ./roundup  -r cedar far

SRV1> du -sm WRITE
14612   WRITE

    Writing rate is misearable !
    
   Concatenation in 15 minutes,  11:13 thru 11:28

   srmcp's in       98 minutes,  11:28 thru 13:10, 

   per ls -ltr /pnfs/minos/reco_far/cedar/cand_data/2007-04
   mgtr for fnpcsrv1 show sustained 120 Mbits/second
      but equal input and output rates.

SRV1> ./roundup  -r cedar far

SRV1> du -sm WRITE
60362   WRITE

    mrtg shows mostly 20 mbit/second data rate, all 'in' ( to net )
    very different than for far detector

    13:48 thru 


Will have to get all this on tape, then roundup -w to purge,
then split up the mcnear files :

SRV1> ls /grid/data/minos/mcnearcat | grep n130110 | wc -l
1036
SRV1> ls /grid/data/minos/mcnearcat | grep n130111 | wc -l
961


########
# FARM #
########

SRV1> DET=far
SRV1> ./saddreco  ${DET} ${REL} ${MON} declare  \
>  >>   ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1


########
# FARM #
########


timm mentions 60 GB quota ?
    sum of
/farm/stage01_minos     1
/farm/stage02_minos  4091
/farm/stage03_minos     1
/farm/minsoft        3347
/farm/minsoft2      52347
/farm/minsoft2/Minossoft  52330
  32878   /farm/minsoft2/Minossoft/dbm

  Per rubin note, this is mostly ancient releases and root versions

=============================================================================

2007 04 26

kreymer on vacation
 
=============================================================================

2007 04 25

kreymer on vacation
 
=============================================================================

2007 04 24
 
###########
# ROUNDUP #
###########

    roundup.20070424

Corrected defects with cand/bcnd handling
    bcnd - set SOLO
    bcnd/cand - disable PEND

############
# MCIMPORT #
############

Arms reports originally truncated files under

/pnfs/minos/mcin_data/far/daikon_00/L010185N

f21011047_0000_L010185N_D00
f21011048_0000_L010185N_D00
f21011064_0000_L010185N_D00
f21011067_0000_L010185N_D00
f21011073_0000_L010185N_D00
f21011077_0000_L010185N_D00
f21011078_0000_L010185N_D00
f21011100_0000_L010185N_D00
f21311177_0000_L010185N_D00
f21311178_0000_L010185N_D00

for FI in $FIS ; do dds ${FPA}/${FI:5:3}/${FI}.reroot.root ; done
-rw-r--r--    1 kreymer  e875     226206750 Mar  1 17:12 /pnfs/minos/mcin_data/far/daikon_00/L010185N/104/f21011047_0000_L010185N_D00.reroot.root
-rw-r--r--    1 kreymer  e875     227337073 Mar  1 17:13 /pnfs/minos/mcin_data/far/daikon_00/L010185N/104/f21011048_0000_L010185N_D00.reroot.root
-rw-r--r--    1 kreymer  e875     241601846 Mar  2 18:26 /pnfs/minos/mcin_data/far/daikon_00/L010185N/106/f21011064_0000_L010185N_D00.reroot.root
-rw-r--r--    1 kreymer  e875     227683157 Mar  2 18:21 /pnfs/minos/mcin_data/far/daikon_00/L010185N/106/f21011067_0000_L010185N_D00.reroot.root
-rw-r--r--    1 kreymer  e875     243768829 Mar  2 18:12 /pnfs/minos/mcin_data/far/daikon_00/L010185N/107/f21011073_0000_L010185N_D00.reroot.root
-rw-r--r--    1 kreymer  e875     246309331 Mar  2 18:22 /pnfs/minos/mcin_data/far/daikon_00/L010185N/107/f21011077_0000_L010185N_D00.reroot.root
-rw-r--r--    1 kreymer  e875     250816517 Mar  2 18:52 /pnfs/minos/mcin_data/far/daikon_00/L010185N/107/f21011078_0000_L010185N_D00.reroot.root
-rw-r--r--    1 kreymer  e875     288965539 Mar  6 11:28 /pnfs/minos/mcin_data/far/daikon_00/L010185N/110/f21011100_0000_L010185N_D00.reroot.root
ls: /pnfs/minos/mcin_data/far/daikon_00/L010185N/117/f21311177_0000_L010185N_D00.reroot.root: No such file or directory
ls: /pnfs/minos/mcin_data/far/daikon_00/L010185N/117/f21311178_0000_L010185N_D00.reroot.root: No such file or directory

for FI in $FIS ; do 
mv ${FPA}/${FI:5:3}/${FI}.reroot.root /pnfs/minos/BAD/BAD_${FI}.reroot.root
done
mv: cannot stat `/pnfs/minos/mcin_data/far/daikon_00/L010185N/117/f21311177_0000_L010185N_D00.reroot.root': No such file or directory
mv: cannot stat `/pnfs/minos/mcin_data/far/daikon_00/L010185N/117/f21311178_0000_L010185N_D00.reroot.root': No such file or directory

##########
# DCACHE #
##########

For tjyang calibration work, post shutdown cedar ntuples needed in DCache,

./stage  -s sntp_data/2006  VOB733
 Needed 62/    177
FINISHED Tue Apr 24 15:14:06 CDT 2007

./stage  -s sntp_data/2006  VOB894

./stage  -s sntp_data/2006  VOB357

./stage  -s sntp_data/2006  VO5072


These tapes were already mounted, might as well get all the files.

=============================================================================

2007 04 23

##########
# DCACHE #
##########

Minos production read pool group should come online tomorrow.

Discussed overall scale with Vlad, may need more pools .
First goal is stability of config, then adjust scale.


=============================================================================

2007 04 20

########
# GRID #
########

Added /fermilab/minos Production to both my cert's ( grid, fermilab )

Will post the procedure to fermigrid-users


Steve Timm noted that those who need to write to DCache under both
/pnfs/fnal.gov/usr/<group> and /pnfs/fnal.gov/usr/fermigrid/volatile/<group>
should have a Production role assigned under the Fermilab VO.

Here is a specific procedure, obtained with some guidance from Dan Yucum.


You *must* use the VOMRS interface:
   https://voms.fnal.gov:8443/vomrs/vo-fermilab/vomrs

The update will be immediate to VOMS, 
but could take up to 6 hours to migrate to GUMS.

Expand the left menu bar as :

[-] fermilab Registration Home   
  [-] Members     
    . Re-sign Grid and VO AUPs     
    [+] Certificates     
    . Edit Personal Info     
    . Change Email Address     
    . Change Representative     
    . Change Expiration Date     
    . Set Authorization Status     
    . Manage Groups & Group Roles

Click on
     . Manage Groups & Group Roles

You will get a search form.

Find the cert's of interest ( I used my last name to select mine )
    checking the Member DN and Roles boxes before the search.

The report has columns including   Group role , Status   and   Select.

Under Select, check the Production box for the appropriate groups(s).

Then click the [submit] box at the bottom left corner of the report.


You will go to a new web page which announces :
  "You have successfully assigned member(s) to group/role!"

A subsequent search of these certs will show the Production role
with Status 'Approved'

The owner of each cert will also get a confirming email.

############
# SADDRECO #
############

saddreco.20070420

    Added ping of dbserver, like command line
        sam ping dbserver ----retryMaxCount=1 --retryJitter=0
    SAMQ - get from ping , not SETUP_SAM_CONFIG

###########
# ROUNDUP #
###########

roundup.20070420

    Added call to saddreco
 
##########
# DCACHE #
##########

Investigating reported bad file in dcache write pool

SRV1> pwd
/export/stage/minfarm/ROUNDUP/DUP
SRV1> IFILE=n13011446_0000_L010185N_D00_nccoh.cand.cedar.root
SRV1> IPATH=minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/cand_data/144
SRV1> DCPOR=24125 # unsecured
SRV1> DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
SRV1> dccp ${DFILE} .
1602194515 bytes in 61 seconds (25649.89 KB/sec)

MINOS26 > ./dc_stat /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/cand_data/144/n13011446_0000_L010185N_D00_nccoh.cand.cedar.root
============================
 PNFS status for /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/cand_data/144/n13011446_0000_L010185N_D00_nccoh.cand.cedar.root 
-rw-r--r--    1 1334     e875     1602194515 Apr 13 12:18 n13011446_0000_L010185N_D00_nccoh.cand.cedar.root

LEVEL 2 
2,0,0,0.0,0.0
:c=1:262b1661;h=yes;l=1602194515;
r-stkendca20a-6
w-stkendca10a-1

Looking in pool lists:
9526 n13011446_0000_L010185N_D00_nccoh.cand.cedar.root 000F00000000000005312848 <C----------(0)[0]> 1602194515 si={minos.reco_mc_near_cedar}
 799 n13011446_0000_L010185N_D00_nccoh.cand.cedar.root 000F00000000000005312848 <-P---s----L(0)[0]> 1602194515 si={minos.reco_mc_near_cedar}

This is the same file reported in the flood of emails Monday 16 April.

I have removed it via rubin@fnpcsrv1.

#######
# CVS #
#######

minoscvs@cdcvs account was created 17 April

Accesses by  baisley and penny

They are using an old cvsh v1_9,
   CDF needs to have v1_11_1 at least,
   Minos is using 1.9.1
   
=============================================================================

2007 04 19

##########
# DCACHE #
##########

DCache access was lost at 06:06 ( as seen by ND data logging )

Apr 19 05:54 N00012077_0028.mdaq.root

DCache was shut down at 06:30 
for planned but unannounced maintenance of PNFS.

Helpdesk ticket 95868

11:06 - PNFS seems back per 
    http://www-numi.fnal.gov/computing/dh/pnfslog/NOW.txt
13:38 - email that Dcache being started
13:49 - some but not all  DCache services coming back
13:51 - beam data logging started to succeed, followed by fd
        ( none yet from ND )
13:55 - have all but 
        CopyManager 
        RemoteGsiftpTransferManager
        RemoteHttpTransferManager
        SRM-stkendca2a
14:02 - Enstore/Dcache announced as being up
14:10 - The above 4 services were restarted ( 14:09:56 )

ND archiver stuck :

QOL I Thu 19-04-2007 05:54:45 archiver 6372 131.225.192.132 1 112033 run 12077   Processing file N00012077_0028.mdaq.root
QOL I Thu 19-04-2007 05:54:45 archiver 6372 131.225.192.132 1 112034 run 12077   Getting credentials
QOL I Thu 19-04-2007 05:54:47 archiver 6372 131.225.192.132 1 112035 run 12077   Got credentials
QOL I Thu 19-04-2007 05:54:47 archiver 6372 131.225.192.132 1 112036 run 12077   Trying ftp connect to disk cache
QOL I Thu 19-04-2007 05:54:47 archiver 6372 131.225.192.132 1 112037 run 12077   Ftp connect succeeded

14:40 - nd archiver restarted
        filesize matched N00012077_0028.mdaq.root
ls -l --> 143776495 Apr 19 14:40 N00012077_0029.mdaq.root

########
# FARM #
########

15:02 nothing to do for near,

SRV1> ./roundup -r cedar far

15:42
SRV1> ./roundup -r cedar mcnear

Oops, there are lots of partial runs, being written anyway with gaps.
Will have to come back later and force the gap fillers out.

 OK adding n13011433_0000_L010185N_D00_nccoh.sntp.cedar.root 3

 OOPS - SUBRUN gap 4 to 6
 OK adding n13011436_0000_L010185N_D00_nccoh.sntp.cedar.root 4

 OOPS - SUBRUN gap 4 to 5
 OK adding n13011437_0000_L010185N_D00_nccoh.sntp.cedar.root 4

 OOPS - SUBRUN gap 9 to 9
 OK adding n13011437_0006_L010185N_D00_nccoh.sntp.cedar.root 3

 OK adding n13011437_0010_L010185N_D00_nccoh.sntp.cedar.root 1

 OK adding n13011438_0001_L010185N_D00_nccoh.sntp.cedar.root 10

 OOPS - SUBRUN gap 6 to 6
 OK adding n13011439_0000_L010185N_D00_nccoh.sntp.cedar.root 6

 OOPS - SUBRUN gap 8 to 9
 OK adding n13011439_0007_L010185N_D00_nccoh.sntp.cedar.root 1

 OOPS - SUBRUN gap 5 to 5
 OK adding n13011440_0000_L010185N_D00_nccoh.sntp.cedar.root 5

 OOPS - SUBRUN gap 9 to 9
 OK adding n13011440_0006_L010185N_D00_nccoh.sntp.cedar.root 3

 OOPS - SUBRUN gap 9 to 9
 OK adding n13011456_0000_L010185N_D00_nccoh.sntp.cedar.root 9

 OOPS - SUBRUN gap 6 to 6
 OK adding n13011458_0000_L010185N_D00_nccoh.sntp.cedar.root 6

OK - stream L010185N_D00.sntp.cedar
OK - 26037 Mbytes in 40 runs 

 OOPS - SUBRUN gap 9 to 9
 OK adding n13011624_0000_L010185N_D00.sntp.cedar.root 9

 OOPS - SUBRUN gap 8 to 8
 OK adding n13011631_0001_L010185N_D00.sntp.cedar.root 7

 OOPS - SUBRUN gap 2 to 2
 OK adding n13011647_0000_L010185N_D00.sntp.cedar.root 2

 OOPS - SUBRUN gap 9 to 9
 OK adding n13011650_0000_L010185N_D00.sntp.cedar.root 9

############
# SADDRECO #
############

for DET in near far ; do
./saddreco  ${DET} ${REL} ${MON} declare  \
 >>   ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1
done

=============================================================================

2007 04 18
 
##########
# STATUS #
########## 

Requested access to CD System Status Minos web page, for
    kreymer
    buckley
    rhatcher
    urish

Ticket 95815

2007 12 26 - assigned to Richard Thies

#######
# AFS #
#######

loiacono - requests AFS disk space for beam ntuples

Reviewing farm usage :

for DIR in $DIRS ; do echo ${DIR}
fs listacl ${DIR} | grep -q minosrecodata && fs listquota ${DIR}
done | grep 50000000

86 volumes ( 50 GB ) ==> 4.3 TBytes

Later, repeated, have 90 volumes, 4.5 TBytes

Looking at existing AFS ntuples

cd d10/indexes
wc -l *.index | sort -n 
...

    770 2006-10_far.R1_18_4.index
    816 BAD_mc_far.daikon_00.cedar.index
   1332 mc_far.carrot.cedar.index
   1594 mc_far.daikon_00.cedar.index
   1844 mc_far.carrot.R1_18_2.index
   1984 mc_cosmic.bfld201.cedar.index
   2024 mc_far.R1.14.index
   2064 2005-04_far.R1_18.index
   2289 mc_near.R1_18_2.index
   9218 mc_near.daikon_00.cedar.index
  10252 mc_near.carrot_06.cedar.index
  10435 mc_near.carrot_06.R1_18_2.index
 105697 total

 wc -l *.R1_18_2.index | sort -n 
    748 2005-12_far.R1_18_2.index
   1844 mc_far.carrot.R1_18_2.index
   2289 mc_near.R1_18_2.index
  10435 mc_near.carrot_06.R1_18_2.index
  30824 total


########
# FARM #
########

Scheduled FARM maintenance ( fnpcsrv1/2 ) started around 09:00
Announced to fermigrid-announce ( but not in advance )
No details of downtime duration or work to be done.
Nothing on CD Status Page 

Up at 13:02

N.B.  /fnal/ups is now bluearc served, not from fnpcsrv1

###########
# BLUEARC #
###########

Spoke to Ray Pasetes x 5250 about long term possibilities

    no per-disk or volume license charges
    Hitachi at about $2K/TB managed by then, or external SATAbeasts etc.

http://computing.fnal.gov/nasan/bluearc.html

##########
# DCACHE #
##########

Removed and rewrite old damaged file from 13 March :

SRV1> ls -l /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112/f21011125_0000_L010185N_D00.sntp.cedar.root
-rw-rw-r--  1 bseilhan numi 0 Mar 13 15:14 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112/f21011125_0000_L010185N_D00.sntp.cedar.root

SRV1> rm /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112/f21011125_0000_L010185N_D00.sntp.cedar.root

SRV1> setup dcap

MINOS26 > grep f21011125_0000_L010185N_D00.sntp.cedar.root ~/minos/CFL/CFL
minos reco_mc_far_cedar VO4049 0000_000000000_0000239 CDMS117334037200000 61387201 1783755534 /pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112/f21011125_0000_L010185N_D00.sntp.cedar.root

    EH ????  too late, this is not consistent with PNFS listing

    Copy the AFS copy anyway

SRV1> AFSP=/afs/fnal.gov/files/data/minos/d10/recodata90
SRV1> ls -l ${AFSP}/${FILE}
-rw-rw-r--  1 bseilhan numi 64260919 Mar 10 09:44 /afs/fnal.gov/files/data/minos/d10/recodata90/f21011125_0000_L010185N_D00.sntp.cedar

SRV1> DFSP=dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112

SRV1> dccp ${AFSP}/${FILE} ${DFSP}/${FILE}
64260919 bytes in 3 seconds (20918.27 KB/sec)

########
# FARM #
########

srmcp has vanished

SRV1> less /export/osg/grid/setup.sh

SRV1> echo $PATH | tr : \\\n | grep srm
/export/osg/grid/srmclient/bin
/export/osg/grid/srmclient/sbin
/export/osg/grid/srmclient/bin
/export/osg/grid/srmclient/sbin

But /export/osg/grid/srmclient is not present !

Submitted helpdesk ticket 95864

Forwarded note to minos-data, minos_batch, fermigrid-users

SRV1> crontab crontab.noround 

Here is the ticket content :

////////////////////////////////////////////////////////////////
Short Description: srmcp and other commands are missing

Problem Description: Since today's maintenance, we seem to be missing the srmcp command et.al.

    We set up osg as always with
source /export/osg/oldgrid/setup.csh

    The path to srmcp is being set :

minfarm on fnpcsrv1% echo $PATH | tr : \\\n | grep srm
/export/osg/grid/srmclient/bin
/export/osg/grid/srmclient/sbin
/export/osg/grid/srmclient/bin
/export/osg/grid/srmclient/sbin

But ths srmclient directory is missing.

    The old osg tree was moved today , changed to a symlink :

minfarm on fnpcsrv1% ls -l /export/osg
total 8
lrwxrwxrwx   1 root root   14 Apr 18 17:16 grid -> /usr/local/vdt/
drwxr-xr-x  31 root root 4096 Jan 17 11:15 oldgrid/
drwxr-xr-x   2 root root 4096 Oct 12  2005 scratch/

    srmcp is still there under oldgrid

Was the OSG software deliberately changed ?
There was no announcement to this effect.
////////////////////////////////////////////////////////////////


###########
# ROUNDUP #
###########

ln -sf roundup.20070417 roundup
   this was trying to do the right thing for near data

=============================================================================

2007 04 17

########
# FARM #
########

re-listed DFARM /minos/*, only cores has remaining files

ROUNDUP - did catchup, to pick up several runs stuck since Friday,
which has not yet emerged from the farm by the 08:00 cron run

    11:09
SRV1> ./roundup  -r cedar near


###########
# BLUEARC #
###########

Check with Ray Pasetes on specs/plans for Blue Arc  -  5250
   Should we use this as shared work space for Minos Cluster ?
   Should this supplement AFS/DCache ?
   Cost ?

###########
# ROUNDUP #
###########

Starting MC tests, with files presently in GDM/mcnearcat

SRV1> ./roundup -n -r cedar mcnear
...

./roundup -n -W -s n13011451_ -r cedar mcnear
 OK adding n13011451_0000_L010185N_D00_nccoh.sntp.cedar.root 11
Tue Apr 17 14:10:31 CDT 2007

./roundup -n -w -s n13011451_ -r cedar mcnear

srmcp file:///n13011451_0000_L010185N_D00_nccoh.sntp.cedar.root \
srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data/145/n13011451_0000_L010185N_D00_nccoh.sntp.cedar.root

OOPS, need to adjust path to include potential beam config suffix like ncc0h.
Path should be
    mcout_data/cedar/near/daikon_00/L010185N_nccoh

roundup only worked accidentally, as there are also n13011451_0000_L010185N_D00
run/subruns without the _nccoh.

UGH... We have the same run/subrun being used twice,
       each time with different physics.
       Not so smart, but it is being done.

Roundup needs to append to beam, if non-null,
    _ and cut -f 5 -d '_' up to .

Modified roundup.20070417 accordingly

Setting file families for directories :

howie :
SRV1> DIRS=`ls`
SRV1> for DIR in $DIRS ; do ( cd ${DIR}/sntp_data ; enstore pnfs --tags | grep 'ly) =' ) ; done
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010000N/sntp_data
.(tag)(file_family) = reco_mc_near_cedar
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_charm/sntp_data
.(tag)(file_family) = reco_mc_near_cedar
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/sntp_data
.(tag)(file_family) = reco_mc_near_cedar
/pnfs/minos/mcout_data/cedar/near/daikon_00/L150200N/sntp_data
.(tag)(file_family) = reco_mc_near_cedar
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010170N/sntp_data
.(tag)(file_family) = reco_mc_near_cedar
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_lowi/sntp_data
.(tag)(file_family) = reco_mc_near_cedar
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010200N/sntp_data
.(tag)(file_family) = reco_mc_near_cedar
/pnfs/minos/mcout_data/cedar/near/daikon_00/L250200N/sntp_data
.(tag)(file_family) = reco_mc_near_cedar
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data
.(tag)(file_family) = reco_mc_near_cedar
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N_medi/sntp_data
.(tag)(file_family) = reco_mc_near_cedar
/pnfs/minos/mcout_data/cedar/near/daikon_00/L100200N/sntp_data
.(tag)(file_family) = reco_mc_near_cedar

SRV1> for DIR in $DIRS ; do ( cd ${DIR}/${STR}_data ; pwd ; enstore pnfs --file_family reco_mc_near_cedar_${STR} ) ; done

Same for STR=cand , STR=mrnt
    
Do not bother with lower level RUN subdir's, good enough to pick up new ones

SRV1> find . -type d -exec ls -ld {} \;
SRV1> find . -type d -exec chmod 775 {} \;
chmod: changing permissions of `./L010185N/cand_data/161': Operation not permitted
chmod: changing permissions of `./L010185N/cand_data/162': Operation not permitted

These were owned by bseilhan, perm's were OK already

=============================================================================

2007 04 16

##########
# DCACHE #
##########

Email every 2 minutes or so regarding cand file error in DCache .
This is being sent to non-minos addressees :

Date: Mon, 16 Apr 2007 11:15:18 -0500
From: Enstore <enstore@stkensrv2.fnal.gov>
To: cdfdh_oper@fnal.gov, cmst1@fnal.gov, jen_a@fnal.gov, stoughto@fnal.gov}
Subject: Alarm raised

Mon Apr 16 11:15:18 CDT 2007

        From: alarm_server

        ['1176740118.76', 1176740118.7641211, 'stkendca10a.fnal.gov', 13190, 'root', 'C (1)', 'ENCP', 'CRC
DCACHE MISMATCH', None, None, None, {'r_a': (('131.225.13.94', 33575), 1L,
'131.225.13.94-33575-1176739751.637307-13190'), 'text': {'status': 'CRC dcache mismatch: 640357985
(0x262B1661L) != 3635691144 (0xD8B43E88L)', 'outfile':
'/pnfs/fnal.gov/usr/minos/mcout_data/cedar/near/daikon_00/L010185N_nccoh/cand_data/144/n13011446_0000_L010185N_D00_nccoh.cand.cedar.root', 'infile': '/diska/write-pool-1/data/000F00000000000005312848'}}]

Rubin will remove the file.
There seems to be no sntp_data file for this subrun.

########
# FARM #
########

The DFarm array failed last Friday 13 April
Even /tmp seems to be locked agains writing on fnpcsrv1.
   No roundup has run since friday
   
Warning : farm mc output is going to 
    /grid/data/mcnearcat and 
    /grid/data/nearcat
    
Trying to log into fngp-osg as minfarm, getting stuck.

#######
# DAQ #
#######

Checked for updated kerberos 

for NODE in acnet beamdata om evd rc ; do
ssh -l minos minos-${NODE} rpm -q  krb5-workstation-fermi gateway-nd
done
krb5-workstation-fermi-1.8d-1.LTS4
krb5-workstation-fermi-1.8d-1.LTS4
krb5-workstation-fermi-1.8d-1.LTS4
krb5-workstation-fermi-1.8d-1.LTS4
krb5-workstation-fermi-1.8d-1.LTS4
krb5-workstation-fermi-1.8d-1.LTS4

=============================================================================

2007 04 13

##############
# DBARCHIVES #
##############

Scanned exp-db database backups with new script

./dbarchives /home/kreymer/COMPLETE_FILE_LISTING_exp-db.20070410 /weekly/.\*April
 134    263    263155 weekly/cdf-offline/cdfofpr2/2007/04-April/08
 373    393    393351 weekly/cdf-online/cdfonprd/2007/04-April/08
 227    475    475634 weekly/d0-offline/d0oflump/2007/04-April/08
 404    771    771552 weekly/d0-offline/d0ofprd1/2007/04-April/08
  23     45     45459 weekly/d0-online/d0onlprd/2007/04-April/08
  29      3      3736 weekly/minos-offline/minosprd/2007/04-April/08

FILES    GB        MB SET 

Noted drop, but not so much, in size of D0 after 26 March drop of event tables

MIN > ./dbarchives /home/kreymer/COMPLETE_FILE_LISTING_exp-db.20070410 /weekly/d0-offline/d0ofprd
...
 542   1038   1038083 weekly/d0-offline/d0ofprd1/2007/03-March/25
 404    771    771552 weekly/d0-offline/d0ofprd1/2007/04-April/08

##########
# DCACHE #
##########

18/19 pools are starting tests in FNDCAT today, need to burn in for 1 week.
Plan deployment 23 April.

############
# MCIMPORT #
############

stagesum - notes on how to start plotting MC port data with gnuplot
  ( simple size vs date plot for now )


=============================================================================

2007 04 12

########
# FARM #
########

The final 24 nearcat files are copied to /grid/data/minos/nearcat
( first attempt failed with dfarm timeout )

SRV1> DIRS='cores farcat fardet li mc mccat mcfarcat mcnearcat mctest nearcat neardet test'

SRV1> for DIR in ${DIRS} ; do printf "${DIR} " ; dfarm ls /minos/${DIR} | wc -l ; done
cores    39
farcat    0
fardet    0
li       18
mc        0
mccat     0
mcfarcat  0
mcnearcat 0
mctest   19
nearcat   0
neardet   0
test      8

Ran roundup at 09:50 to pick up nearcat files

    PEND review

 PEND - have 1/19 subruns for N00012007_*.spill.sntp.cedar*.root 5 04/07 04:35

This subrun was in the BADRUN list Wed Apr  4 08:00:02 CDT 2007

   ./roundup -S -s N00012007_0007 -f 0 -r cedar near

Now some stale mrnt's

 PEND - have 5/7 subruns for N00008433_*.spill.mrnt.cedar*.root 5 04/06 17:44
Missing _0000 and _0001

 PEND - have 18/19 subruns for N00008454_*.spill.mrnt.cedar*.root 5 04/06 16:43
Missing _0010

 PEND - have 17/18 subruns for N00008612_*.spill.mrnt.cedar*.root 3 04/09 04:00
Missing _0010

 PEND - have 4/12 subruns for N00008695_*.spill.mrnt.cedar*.root 3 04/09 07:55
Missing _0004 through _0011


###########
# ROUNDUP #
###########

roundup.20070412 - corrected FILES list to select per type, SEL, REL
                   this had been misplaced in .20070411 shift to 'find'

                   corrected to check bad_runs_mrcc.${REL} for mrnt

This cleared up all the stale mrnt files except  N00008433

############
# SADDRECO #
############
 
saddreco.20070412
    Dropped parents from normal printout

##########
# DCACHE #
########## 

Send email to dcache-admin re Minos Read pool deployment

=============================================================================

2007 04 11

########
# FARM #
########

Howie reports 12 production and 21 mc jobs running which will write DFARM
production are all N00012029

Moved roundup.20070411 to production, including 10 minute age requirement

Ran on near, far round 10:00

Re-enabled cron job

Ran saddreco

12:45  4 of the 12 nearcat files are in dfarm
13:56  8 files are there
17:35 14 of 12 files... let's let this keep running,
      we are waiting for 12 subruns, 24 files.

#######
# AFS #
#######

per buckley, requested 250 GG ( 5 x 50 ) additional user space
acl's cloned from d01

ls -d $MINOS_DATA/d* | cut -c 33- | sort -n
...
227
228
229

request /afs/fnal.gov/files/data/minos/d230 through d234

ticket 95475

#######
# SAM #
#######

Verified unknown volume location for 2 obsolete ND reco files
      N00010912_0003.cosmic.sntp.R1_18_4.0.root
      N00010912_0003.spill.sntp.R1_18_4.0.root

#######
# AFS #
#######

User volumes $MINOS_DATA/d230-d234 are created, by joes.

=============================================================================

2007 04 10

########
# FARM #
########

rubin is ready to write to /grid/data

There are some older mrnt runs I'd like to flush directly from DFARM,
but that does not affect the switch to /grid/data for current running.

The DFARM farcat area is up to date, containing only current files
which I will copy to /grid/data/farcat.

The DFARM nearcat area has some older runs with single missing mrnt subruns,
which I presume are missing because of the problems reading the cand files.

 PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 39 03/01 17:20:46
 PEND - have 23/24 subruns for N00009168_*.spill.mrnt.cedar*.root 11 03/29 17:07:07
 PEND - have 22/24 subruns for N00009235_*.spill.mrnt.cedar*.root 10 03/30 09:19:40

We are missing :

    N00009165_0002.spill.mrnt.cedar.0.root

    N00009168_0018.spill.mrnt.cedar.0.root

    N00009235_0019.spill.mrnt.cedar.0.root
    N00009235_0022.spill.mrnt.cedar.0.root

These were listed in the rubin 1 April email to minos_batch

N00009162_0013.0   2005-11   47438   139  2007-03-01 14:12:27  fnpc161

N00009165_0002.0   2005-11   46767   134  2007-03-02 16:53:50  fnpc196

N00009168_0018.0   2005-11   47380   139  2007-03-29 17:14:24  fnpc143

N00009235_0019.0   2005-11   47502   134  2007-03-30 06:58:04  fnpc196
N00009235_0022.0   2005-11   47592   132  2007-03-30 07:56:51  fnpc197

So I am forcing them out.

    Updated roundup to recognize mrnt entries, and added them to no_spill.cedar :

SRV1> cp AFSS/roundup.20070410 .
SRV1> ln -sf roundup.20070410 roundup

N00009165_0002.spill.mrnt.cedar.0.root  2007-04
N00009168_0018.spill.mrnt.cedar.0.root  2007-04
N00009235_0019.spill.mrnt.cedar.0.root  2007-04
N00009235_0022.spill.mrnt.cedar.0.root  2007-04

   Ran one more roundup :

SRV1> ./roundup -n  -r cedar near

Done, the oldest files in farcat now are from 04/06


###########
# ROUNDUP #
###########

roundup.20070411 - take input to /grid/data/minos
rustle           - copy existing files from DFARM to /grid/data/minos

##########
# RUSTLE #
##########

SRV1> AFSS/rustle  far

   moved 21 files, looks good.
   Still, make a safe copy :

SRV1> cp -r /grid/data/minos/farcat /grid/data/minos/farcat_safe
SRV1> diff -r /grid/data/minos/farcat /grid/data/minos/farcat_safe

   upcated rustle to touch the files with their DFARM date

SRV1> AFSS/rustle   near

N.B> 2007 05 07 - removed all files from /grid/data/minos/farcat_safe
these were subruns 0 through 6 of F00037871_0000,
and are long since concatenated into

F00037871_0000.all.sntp.cedar.0.root  
F00037871_0000.spill.bntp.cedar.0.root
F00037871_0000.spill.sntp.cedar.0.root


=============================================================================

2007 04 09

########
# FARM #
########

Doing cleanup after file removal of files with 0 copies in dfarm

frwr-   0 rubin                29668085 04/06 16:48:28 N00008442_0000.spill.mrnt.cedar.0.root 
frwr-   0 rubin                29892938 04/06 16:53:18 N00008442_0001.spill.mrnt.cedar.0.root 

many extra status messages, due to leftover stuff in dfarm

Moving ahead to roundup.20070409 ( was 20070401 )
with clean file selection, and NOSPILL suppression of mrnt based on sntp


########
# FARM #
########

Rubin reports that 4 files should be undeclared to sam,
new files exist in dfarm /minos/neardet

I've noted PNFS sizes and dates
                                        PNFS                     DFARM
N00012007_0007.cosmic.cand.cedar.0.root 113745471 Apr  4 03:50 113749699 04/07 04:35:34 
N00012013_0017.cosmic.cand.cedar.0.root 114483673 Apr  6 15:16 114481305 04/07 05:11:46 
N00012010_0000.cosmic.cand.cedar.0.root 113633751 Apr  4 10:26 113634922 04/07 08:24:11 
N00012010_0000.spill.cand.cedar.0.root  507054947 Apr  4 15:20 507043027 04/07 08:24:32 

These are in /pnfs/minos/reco_near/cedar/cand_data/2007-04

I have undeclared these :

MINOS26 > FILES='N00012007_0007.cosmic.cand.cedar.0.root N00012010_0000.cosmic.cand.cedar.0.root N00012010_0000.spill.cand.cedar.0.root'
MINOS26 > for FILE in $FILES ; do sam locate ${FILE} ; done
['/pnfs/minos/reco_near/cedar/cand_data/2007-04,44@voc165']
['/pnfs/minos/reco_near/cedar/cand_data/2007-04,79@voc165']
['/pnfs/minos/reco_near/cedar/cand_data/2007-04,82@voc165']
MINOS26 > for FILE in $FILES ; do sam undeclare file ${FILE} ; done

############
# MCIMPORT #
############ 

per kordosky email,

Need to purge mistakenly imported files
-rw-r--r--    1 mindata  e875         384K Apr  5 07:59 
n11011003_0001_L010185N_D01.tar.gz

-rw-r--r--    1 mindata  e875         146K Apr  5 08:20 
n12011003_0001_L010185N_D01.tar.gz
    
FILS='n11011003_0001_L010185N_D01.tar.gz n12011003_0001_L010185N_D01.tar.gz'
grep ${FIL} kordosky/index/*.index

Robert Hatcher moved the original to BAD.
I have removed the relevant lines from kordosky/md5/all.md5


$ for FIL in ${FILS} ; do grep ${FIL} kordosky/md5/all.md5  ; done
0b23060c0ae02d21e465a4db8df39300  n11011003_0001_L010185N_D01.tar.gz
f25ab8f958425c0b9ec4fa2e2820dc1f  n12011003_0001_L010185N_D01.tar.gz

$ cat all.md5 | grep -v n11011003_0001_L010185N_D01.tar.gz | grep -v n12011003_0001_L010185N_D01.tar.gz > all2.md5
$ diff all2.md5 all.md5
6550a6551,6552
> 0b23060c0ae02d21e465a4db8df39300  n11011003_0001_L010185N_D01.tar.gz
> f25ab8f958425c0b9ec4fa2e2820dc1f  n12011003_0001_L010185N_D01.tar.gz
$ mv all2.md5 all.md5


=============================================================================

2007 04 06
 
 V A C A T I O N

=============================================================================

2007 04 05

#######
# CVS #
#######

Met with rs, boyd, mengel, + , regarding 
   possible plans to move CDF and/or Minos CVS to central CVS server,
   with Bluarc connected disks.

Will give mengel access to servers ( cdf, zoom, minos ) for prototyping

Did so for zoom, minos

##########
# ORACLE #
##########

Sun tech could not work on minosora3,
Does not have cables for opereron system
Ordering cables.
 
=============================================================================

2007 04 04

#######
# NET #
#######

netdown announced downtime on Thur Apr 19, 06:00 to 06:45,
    s-s-fcc2-server3
Interesting hosts :

crlweb2
docdb
indico
listserv
mailgw1/2
numiserver
fermi-helpdesk
cdops
linux1
crlweb
mailgw
imap1/2/3

#######
# CVS #
####### 

Total space used is 4.3 GBytes

    3 GBytes -  Contrib/raufer ( most of it in NikiSys/Attic, can be purged )
     .6  GB  -  DatabaseTables
     .16 GB   - Contrib/RecoCheck *.root files
There are many more .root binay files checked into CVS,
not to mention *.pdf ,  *.ps  binary documents.
Only one obvious MS doc,
  /WebDocs/reconstruction/MINOS Standard Reconstruction Package.doc
MINOSCVS > find . -name \*\ \*,v\* -exec ls -l {} \;
 641769 Feb 23 02:20 ./WebDocs/reconstruction/MINOS Standard Reconstruction Package.doc,v
   2467 Feb 23 02:20 ./WebDocs/reconstruction/standard reconstruction software.w2w,v
1496874 Mar 24 15:49 ./DetSim/doc/Simulation Presentation June 2003 collab.sxi,v
  24898 Mar 24 15:49 ./EventDisplay/doc/snapshot13 .png,v
  25611 Mar 24 15:49 ./EventDisplay/doc/snapshot14 .png,v
   3624 Aug  1  2006 ./HWDB/images/left copy.png,v

MINOSCVS > dds Attic/
total 3043384
drwxrwxr-x    2 minoscvs e875         4096 Jul  5  2006 ./
drwxrwxr-x    8 minoscvs e875         4096 Apr  4 09:50 ../
-r--r--r--    1 minoscvs e875     628199184 Jul  5  2006 New_Systematics_0027.tar.gz,v
-r--r--r--    1 minoscvs e875     24635037 Jun 29  2006 PAN_le_mcfar18_2_SKZP_0.root,v
-r--r--r--    1 minoscvs e875     24378672 Jun 29  2006 PAN_le_mcfar18_2_modbyrs3_0.root,v
-r--r--r--    1 minoscvs e875      2796150 Jun 29  2006 Results_Macros_BeamMatrix.tar.gz,v
-r--r--r--    1 minoscvs e875     603125924 Jun 29  2006 Syst_rel_ndfits.tar.gz,v
-r--r--r--    1 minoscvs e875     601748800 Jun 29  2006 Syst_rel_skzp15p.tar.gz,v
-r--r--r--    1 minoscvs e875     608663238 Jun 30  2006 Syst_rel_v1.tar.gz,v
-r--r--r--    1 minoscvs e875     610860683 Jun 30  2006 Syst_rel_v2.tar.gz,v
-r--r--r--    1 minoscvs e875       531113 Jun 29  2006 far_spec_dataTR.root,v
-r--r--r--    1 minoscvs e875       542477 Jun 29  2006 far_spec_dataskzpTR.root,v
-r--r--r--    1 minoscvs e875      2943029 Jun 29  2006 results_comp_nskzp.tar.gz,v
-r--r--r--    1 minoscvs e875      3962436 Jun 29  2006 results_comp_skzp.tar.gz,v
-r--r--r--    1 minoscvs e875       916096 Jun 29  2006 results_comp_skzp15p.tar.gz,v

MINOSCVS > pwd
/cvs/minoscvs/rep1/minossoft/Contrib/raufer/NikiSys/Attic
MINOSCVS > tar cf /local/scratch01/minoscvs/raufer-NikiSys-Attic.tar .
MINOSCVS > du -sm .
2973    .
MINOSCVS > du -sm /local/scratch01/minoscvs/raufer-NikiSys-Attic.tar
2972    /local/scratch01/minoscvs/raufer-NikiSys-Attic.tar
MINOSCVS > tar tvf /local/scratch01/minoscvs/raufer-NikiSys-Attic.tar 
drwxrwxr-x minoscvs/e875     0 2006-07-05 15:22:27 ./
-r--r--r-- minoscvs/e875 24635037 2006-06-29 09:47:05 ./PAN_le_mcfar18_2_SKZP_0.root,v
-r--r--r-- minoscvs/e875 24378672 2006-06-29 09:47:07 ./PAN_le_mcfar18_2_modbyrs3_0.root,v
-r--r--r-- minoscvs/e875  2796150 2006-06-29 09:47:08 ./Results_Macros_BeamMatrix.tar.gz,v
-r--r--r-- minoscvs/e875  2943029 2006-06-29 09:47:09 ./results_comp_nskzp.tar.gz,v
-r--r--r-- minoscvs/e875  3962436 2006-06-29 09:47:11 ./results_comp_skzp.tar.gz,v
-r--r--r-- minoscvs/e875   916096 2006-06-29 09:47:11 ./results_comp_skzp15p.tar.gz,v
-r--r--r-- minoscvs/e875   531113 2006-06-29 12:02:24 ./far_spec_dataTR.root,v
-r--r--r-- minoscvs/e875   542477 2006-06-29 12:02:24 ./far_spec_dataskzpTR.root,v
-r--r--r-- minoscvs/e875 603125924 2006-06-29 12:31:45 ./Syst_rel_ndfits.tar.gz,v
-r--r--r-- minoscvs/e875 601748800 2006-06-29 12:34:30 ./Syst_rel_skzp15p.tar.gz,v
-r--r--r-- minoscvs/e875 608663238 2006-06-30 03:20:10 ./Syst_rel_v1.tar.gz,v
-r--r--r-- minoscvs/e875 610860683 2006-06-30 03:24:51 ./Syst_rel_v2.tar.gz,v
-r--r--r-- minoscvs/e875 628199184 2006-07-05 15:22:24 ./New_Systematics_0027.tar.gz,v
MINOSCVS > cd ..
MINOSCVS > rm -r Attic/

MINOSCVS > cd ../../..
MINOSCVS > du -sm .
1285    .
MINOSCVS > pwd
/cvs/minoscvs/rep1/minossoft


#######
# SAM #
#######

Duplicate subruns in R1_18_2, per email from asousa

Needed to obsolete spill/cosmic,  cand/sntp/ntps for these

for PASS in 0 1; do
for SRUN in N00007821_0019 N00007821_0022 N00007751_0022 N00007759_0008 ; do
for SPIL in spill cosmic ; do
for STRM in cand sntp snts ; do
#echo ${SRUN}.${SPIL}.${STRM}.R1_18_2.${PASS}.root >> /tmp/obsfiles
sam locate ${SRUN}.${SPIL}.${STRM}.R1_18_2.${PASS}.root
done ; done ; done ; done

./dropfiles /tmp/obsfiles
 
=============================================================================

2007 04 03

#########
# MYSQL #
#########

per HOWTO.dbarchive

Mysql> rm -r /data/archive/COPY/20070305
/data has 49 GB free

offline real    68m54.075s  100m44.507s
md5     real    21m35.787s   32m36.246s 40G
gzip    real    55m50.531s   80m46.291s 15G
scp     real    9m36.975s    14m3.862s 
BINLOGS real    2m59.620s     8m23.896s


8.8 GB free at minimum

All tables in all databases were locked during the offline file copies.
This is surprising ( even buggy ) behaviour. 
This is unfortunately documented in 
    http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html

Perhaps reinvestigate mysqlhotcopy, which failed a couple of years ago.
Mysql> locate mysqlhotcopy
/local/ups/prd/mysql/v4_1_11/Linux-2/bin/mysqlhotcopy
Not sure this does a proper global table lock per database.
    Hard to interpret the python.

Modified HOWTO.dbarchive to lock and flush tables by name.

But am hitting an apparent command length limit in our 4.1.11 server

http://bugs.mysql.com/bug.php?id=10119

############
# MCIMPORT #
############

Some sjc far/mcin files have been imported, apparently correctly, to
/pnfs/minos/mcin_data/far/daikon_00/L100200N  et. al.

##########
# DCACHE #
##########

Per Alexander Podovs...,
the new DCache pools are being configured into the test stand today.

Should deploy to production early next week.
Will follow up with Timur et.al.  then

#######
# SAM #
#######

Duplicate subruns in R1_18_2, per email from asousa

N00007821_0019.spill.sntp.R1_18_2.[0,1].root
N00007821_0022.spill.sntp.R1_18_2.[0,1].root
N00007751_0022.spill.sntp.R1_18_2.[0,1].root
N00007759_0008.spill.sntp.R1_18_2.[0,1].root


Need to obsolete spill/cosmic,  cand/sntp/ntps for these

for PASS in 0 1; do
for SRUN in N00007821_0019 N00007821_0022 N00007751_0022 N00007759_0008 ; do
for SPIL in spill cosmic ; do
for STRM in cand sntp snts ; do
#echo ${SRUN}.${SPIL}.${STRM}.R1_18_2.${PASS}.root >> /tmp/obsfiles
sam locate ${SRUN}.${SPIL}.${STRM}.R1_18_2.${PASS}.root
done ; done ; done ; done

=============================================================================

2007 04 02

###########
# MONTHLY #
###########

CFL      4/2
DATASETS 4/2
PREDATOR 4/2
SADDRECO 4/2
VAULT    4/2 OK
MYSQL    4/3 OK All tables in all databases locked during offline copy

#######
# AFS #
#######

    On online systems should do

fs getcellstatus -cell fnal.gov
fs setcell       -cell fnal.gov -nosuid
fs getcellstatus -cell fnal.gov


#######
# DAQ #
#######

Investigating clone of /afs/.../ products to /data/minsoft
   ( hosted on minos-evd, rcp'd to other CR systems )

MINOS26 > du -sm minossoft  
3875    minossoft

############
# SADDRECO #
############

REL=cedar
MON=2007-03

    Do a global verification

for DET in near far ; do
./saddreco  ${DET} ${REL} ${MON} verify
done

for DET in near far ; do
./saddreco  ${DET} ${REL} ${MON} declare  \
 >>   ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1
done
 

############
# MCIMPORT #
############

mcimport.20070402

   Adding support to /far/mcin for other than daikon_00
   Autodest updated, add more layers of mkdir
   
MINOS26 > cd /pnfs/minos/mcin_data/far
MINOS26 > mkdir daikon_01
MINOS26 > cd daikon_01
MINOS26 > enstore pnfs --file_family "mcin_far_daikon_01"

Into production before the 18:00 run,
no files to process yet.

##########
# ORACLE #
##########

minosora3 has cpu warnings, needs diagnostics run
Scheduled for Thur 5 April

=============================================================================

2007 03 30

#########
# DFARM #
#########

Did catchup of far and near

Strays :

 PEND - have 1/24 subruns for F00037230_*.all.sntp.cedar*.root 29 02/28 13:10:33
 PEND - have 1/24 subruns for F00037230_*.spill.bntp.cedar*.root 29 02/28 13:11:06

OK      duplicates, copied to ROUNTP/DUP
    
 PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 28 03/01 17:20:46
 PEND - have 23/24 subruns for N00009168_*.spill.mrnt.cedar*.root 0 03/29 17:07:07

PEND    crashed in mrnt, investigate

 NOSPILL   N00011347_0017.spill.sntp.cedar.0.root
 PEND - have 2/17 subruns for N00011347_*.spill.sntp.cedar*.root 7 03/22 23:33:40
 PEND - have 1/24 subruns for N00011568_*.spill.sntp.cedar*.root 7 03/22 23:49:47

OK      recovered subruns, forced to output


    Details as follows :
        
FAR -

    These are duplicates of _0017 produced as a side effect of replacing 
    F00037230_0017.all.sntp.cedar.0.root
    I have set them aside in /home/minfarm/ROUNTMP/DUP

SRV1> dfarm ls /minos/farcat/F00037230*
frwrw   1 rubin                23001056 02/28 13:10:33 /minos/farcat/F00037230_0017.all.sntp.cedar.0.root 
frwrw   1 rubin                 3185344 02/28 13:11:06 /minos/farcat/F00037230_0017.spill.bntp.cedar.0.root 
SRV1> dfarm get /minos/farcat/F00037230_0017.all.sntp.cedar.0.root .
SRV1> dfarm get /minos/farcat/F00037230_0017.spill.bntp.cedar.0.root .

SRV1> date -d '02/28 13:10:33' +%Y%m%d%H%M.%S
200702281310.33
SRV1> touch -t 200702281310.33 F00037230_0017.all.sntp.cedar.0.root
SRV1> date -d '02/28 13:11:06' +%Y%m%d%H%M.%S
200702281311.06
SRV1> touch -t 200702281311.06 F00037230_0017.spill.bntp.cedar.0.root
SRV1> dfarm rm /minos/farcat/F00037230*


NEAR -

-N00009165_0002 - 2005-11 crashed 3 times in mrnt processing, waiting
-N00009168_0018 - 2005-11 missing, probably like 9165_0002

+N00011347_0004 - 2006-12
+N00011347_0014 - 2006-12
03/22 23:33:40 /minos/nearcat/N00011347_0004.spill.sntp.cedar.0.root
03/22 23:22:12 /minos/nearcat/N00011347_0014.spill.sntp.cedar.0.root
Reprocessed, flush SOLO

    ./roundup -s N00011347 -S -f 0 -r cedar near

+N00011568_0008 - 2007-01
Recovered missing subrun, flush

    ./roundup -s N00011568 -S -f 0 -r cedar near

############
# SADDRECO #
############

PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin
export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer
export SAM_NAMING_SERVICE_IOR=IOR:010000002a00000049444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696e67436f6e746578743a312e30000000010000000000000030000000010100b7150000006d696e6f732d73616d30312e666e616c2e676f7600d132230c0000004e616d655365727669636500
export SETUP_SAM_CONFIG='sam_config v4_2_34 -f NULL -z /afs/fnal.gov/files/code/e875/general/ups/db -q prd -r /afs/fnal.gov/files/code/e875/general/ups/prd/sam_config/v4_2_34/NULL -m sam_config_prd.table -M /afs/fnal.gov/files/code/e875/general/ups/db/sam_config/v4_2_34'

REL=cedar
MON=2007-03

    Do a global verification

for DET in near far ; do
./saddreco  ${DET} ${REL} ${MON} verify
done

for DET in near far ; do
./saddreco  ${DET} ${REL} ${MON} declare  \
 >>   ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1
done

Sent note to admarino regarding  /numi_target/mars/real_target,
about 158 files, 30 GB, Mars input and .hbook files from 11/21 (year ? )

admarino advises these should be in AFS.

timm has removed them

#######
# DAQ #
#######

AFS usage in control room :

    Using /usr/sbin/lsof /afs

    minos-beamdata
rotorooter
python beam_data_files_monitor.py
    /home/minos/share/start_bd_files_monitor - 
        explicitly sets up the AFS products

    minos-evd
loon
    /data/minsoft/mcr/ControlRoomSoftware/bin/mcrrun ->
    /data/minsoft/mcr/ControlRoomSoftware/mcrrc - sets up AFS products

    minos-om
HistoDisplayMain
loon
    /data/minsoft/mcr/ControlRoomSoftware/bin/mcrrun ->
    /data/minsoft/mcr/ControlRoomSoftware/mcrrc - sets up AFS products

    minos-rc
rcGui
    /data/minsoft/mcr/ControlRoomSoftware/bin/mcrrun ->
    /data/minsoft/mcr/ControlRoomSoftware/mcrrc - sets up AFS products

    minos-

########
# FARM #
########

Re-enabled roundup in crontab

    crontab crontab.dat

###########
# ROUNDUP #
###########

roundup.20070401 - /grid/data version

Adding + character to message for NOSPILL, BADRUN, SUPPRESSED files which exist.
Writing such files to output.

Plan to rustle DFARM files into /grid/data,
then reread from there.

Reading from /grid/data will require completeness test,
in case files are actively being written - aged 1000 seconds perhaps.

=============================================================================

2007 03 29

#######
# AFS #
#######

Outage was 06:04 through 06:25

############
# PRODUCTS #
############

05:11

cd /afs/fnal.gov/files/code/e875/general
mv ups OLDups
ln -s products ups

MINOS26 > time diff -r OLDups products

real    7m57.830s
user    0m6.680s
sys     0m41.050s

Tested minossoft setup, sam, dbu
    AOK.

###########
# ENSTORE #
###########
 
VOC167, with many stage/kordosky files, has been NOACCESS since yesterday

Date: Wed, 28 Mar 2007 14:57:11 -0500
From: George Szmuksta <georges@fnal.gov>

  Yes, There were a number of transfers today that were successful with 
this tape. The last one at 12:39 today had an error of 
"READ_VOL1_WRONG". Implying the internal volser of the tape is wrong. I 
am trying to check the volser now with an automated tool. So I am 
waiting my turn for a drive.

MINOS26 > ./stage -d -p 0 VOC167 > /var/tmp/kreymer/VOC167.stage
 Needed 60/    128

MINOS26 > ./stage -d -p 0 -v VOC167 > /var/tmp/kreymer/VOC167.stage

check only writePools group 
MINOS26 > ./stage -d -p 0 -g readPools -v VOC167 > /var/tmp/kreymer/VOC167.stage

 Needed 68/    128

Touched all the files presently on caches, 68/128.

16:05 - the volume is available again.

 ./stage VOC167

    staged cleanly to disk

#######
# LSF #
####### 

Ticket 94831

Short Description: LSF license failure this morning

Problem Description: At aroung 10:00 this morning, from several hosts
    minos01, minos26, flxi04
I was unable to access LSF, due to a lack of a license.
For example

FLXI04 > setup lsf
FLXI04 > bjobs
Host does not have a software license

This seems to have cleared up as of about 10:05.

    I have also seen slowness in email,
    is there a general network problem ?
    This could affect access to the license servers.

Tested around 11:30,

bqueues is now working on
   flxi02 through 6  
   minos01 through 26

for NODE in $NODES ; do printf "${NODE} "
ssh ${NODE} '. /afs/fnal.gov/ups/etc/setups.sh ; setup lsf ; bqueues' ; done

#######
# AFS #
#######

ticket 94858

brebel requests more NC AFS space,  2 x 50GB 
Template ?
d147
d187
d204
d211

Requested d228 and d229,
acls' similar to the above,

minos rl
system:administrators rlidwka
system:anyuser rl
minos:admin rlidwka
brebel rlidwka

Making a minos:admin group modeled on buckley:admin

pts creategroup -name kreymer:admin
group kreymer:admin has id -1919

pts adduser -user kreymer -group kreymer:admin

pts membership kreymer:admin
pts examine    kreymer:admin

pts examine    kreymer:admin
Name: kreymer:admin, id: -1919, owner: kreymer, creator: kreymer,
  membership: 1, flags: S-M--, group quota: 0.

pts setfields  kreymer:admin -access SOMar

pts examine    kreymer:admin
Name: kreymer:admin, id: -1919, owner: kreymer, creator: kreymer,
  membership: 1, flags: SOMar, group quota: 0.


pts chown      kreymer:admin  minos


for GUSER in buckley gmieg rhatcher urish ; do
pts adduser -user ${GUSER} -group minos:admin ; done

pts adduser -user urheim -group minos:admin


=============================================================================

2007 03 28

#######
# AFS #
#######

    Schedule shutdown during PNFS outage

MINOS26 > echo 'crontab -r' | at 03:30
job 19 at 2007-03-29 03:30

( the minos01 crontab does not have entries for Thursday )

DCache will go down at 05:00

#########
# DFARM #
#########

far - did catchup, only stray is

3001056 02/28 13:10:33 F00037230_0017.all.sntp.cedar.0.root 
3185344 02/28 13:11:06 F00037230_0017.spill.bntp.cedar.0.root 

near - 

Failed last night, 

 OK adding N00011971_0000.cosmic.sntp.cedar.0.root 24
Transfer initiation timeout
 OOPS - failed to dfarm get /minos/nearcat/N00011971_0006.cosmic.sntp.cedar.0.root 
 BAILING 
Tue Mar 27 22:42:36 CDT 2007

Same feilure this morning.

cd /export/stage/minfarm/ROUNDUP_TEST/
DFN=`dfarm ls /minos/nearcat | tr -s ' ' | cut -f 7 -d ' '`

for DF in ${DFN} ; do date ; dfarm get /minos/nearcat/${DF} ${DF} ; ls ${DF} ; rm ${DF} ; done
...
Wed Mar 28 09:50:32 CDT 2007
Transfer initiation timeout
ls: N00011971_0006.cosmic.sntp.cedar.0.root: No such file or directory
rm: cannot remove `N00011971_0006.cosmic.sntp.cedar.0.root': No such file or directory
Wed Mar 28 09:55:32 CDT 2007
...
Wed Mar 28 09:57:22 CDT 2007
Transfer initiation timeout
ls: N00011977_0002.cosmic.sntp.cedar.0.root: No such file or directory
rm: cannot remove `N00011977_0002.cosmic.sntp.cedar.0.root': No such file or directory
Wed Mar 28 10:02:22 CDT 2007
Transfer initiation timeout
ls: N00011977_0002.spill.sntp.cedar.0.root: No such file or directory
rm: cannot remove `N00011977_0002.spill.sntp.cedar.0.root': No such file or directory
Wed Mar 28 10:07:22 CDT 2007
Transfer initiation timeout
ls: N00011981_0000.cosmic.sntp.cedar.0.root: No such file or directory
rm: cannot remove `N00011981_0000.cosmic.sntp.cedar.0.root': No such file or directory
Wed Mar 28 10:12:22 CDT 2007
Transfer initiation timeout
ls: N00011981_0000.spill.sntp.cedar.0.root: No such file or directory
rm: cannot remove `N00011981_0000.spill.sntp.cedar.0.root': No such file or directory
Wed Mar 28 10:17:23 CDT 2007
Transfer initiation timeout
ls: N00011981_0001.cosmic.sntp.cedar.0.root: No such file or directory
rm: cannot remove `N00011981_0001.cosmic.sntp.cedar.0.root': No such file or directory
Wed Mar 28 10:22:23 CDT 2007
Transfer initiation timeout
ls: N00011981_0001.spill.sntp.cedar.0.root: No such file or directory
rm: cannot remove `N00011981_0001.spill.sntp.cedar.0.root': No such file or directory
Wed Mar 28 10:27:23 CDT 2007


SRV1> dfarm ls /minos/nearcat | grep ' 0 '
frwr-   0 rubin                29235685 03/27 16:48:54 N00011971_0006.cosmic.sntp.cedar.0.root 
frwr-   0 rubin                 4788055 03/27 15:17:16 N00011977_0002.cosmic.sntp.cedar.0.root 
frwr-   0 rubin                 9743078 03/27 15:27:25 N00011977_0002.spill.sntp.cedar.0.root 
frwr-   0 rubin                29466218 03/27 16:42:23 N00011981_0000.cosmic.sntp.cedar.0.root 
frwr-   0 rubin                71763200 03/27 16:47:46 N00011981_0000.spill.sntp.cedar.0.root 
frwr-   0 rubin                29660049 03/27 15:02:26 N00011981_0001.cosmic.sntp.cedar.0.root 
frwr-   0 rubin                32033773 03/27 15:07:42 N00011981_0001.spill.sntp.cedar.0.root 

    Writing what I can :

SRV1> ./roundup -w -r cedar near

    And keeping below the bad files,
    
SRV1> ./roundup  -s N0001196 -r cedar near


The above seven files are being reprocessed.


############
# PRODUCTS #
############

/afs/fnal.gov/files/code/e875/general
MINOS26 > date ; time diff -r ups products
Wed Mar 28 08:32:15 CDT 2007
Only in products/prd/fnorb/v1_1b_8/Linux-2-4: Fnorb

real    7m50.582s
user    0m6.560s
sys     0m40.830s

MINOS26 > rm products/prd/fnorb/v1_1b_8/Linux-2-4/Fnorb

 
=============================================================================

2007 03 27

########
# FARM #
######## 

timm :
massive errors on fnpcsrv1 external disk arrays, down for reboot at 08:30

20:25 dfarm is healthy and announced  to users
    was allowed to run for Howie earlier, while still rebuilding.

############
# PRODUCTS #
############

Back in 2006 07 21, many old SAM products were disabled.

Time to finally remove them :

cd /afs/fnal.gov/files/code/e875/general/ups

for DIR in db prd ; do
for PRD in corba_common corba_util sam_idl_cpplib sam_lib sam_mis_cpplib sam_client_cpplib ; do
du -sm  ${DIR}/DISABLED${PRD} ; done ; done

MINOS26 > fs listquota .
Volume Name                   Quota      Used %Used   Partition
code.e875.general           8000000   7749808   97%<<       49%    <<WARNING

for DIR in db prd ; do
for PRD in corba_common corba_util sam_idl_cpplib sam_lib sam_mis_cpplib sam_client_cpplib ; do
rm -r ${DIR}/DISABLED${PRD} ; done ; done

MINOS26 > fs listquota .
Volume Name                   Quota      Used %Used   Partition
code.e875.general           8000000   7176814   90%         49%  


Also clear out empty sam product directories

MINOS26 > SAMS=`ls prd/sam`
MINOS26 > for SAM in $SAMS ; do [ -z `ls prd/sam/${SAM}` ] && du -sk prd/sam/${SAM} ; done
MINOS26 > for SAM in $SAMS ; do [ -z `ls prd/sam/${SAM}` ] && rmdir prd/sam/${SAM} ; done

Requested new disk, backed up, ticket 94695
    /afs/fnal.gov/files/code/e875/general/products

Created around 16:00

cd /afs/fnal.gov/files/code/e875/general
cp -vax ups/catman products       
for DIR in db etc man prd ; do echo $DIR ; cp -ax ups/${DIR} products/ ; done

Done by around 16:20

MIN > fs listquota /afs/fnal.gov/files/code/e875/general/products
Volume Name                   Quota      Used %Used   Partition
c.e875.d1                   8000000   2111911   26%         52%  

MINOS26 > date ; time diff -r ups products
Tue Mar 27 16:27:44 CDT 2007
diff: ups/prd/fnorb/v1_1b_8/Linux-2-4/Fnorb: recursive directory loop
diff: ups/prd/java/v1.5.0/Linux-2/ups/..tar: No such file or directory
diff: products/prd/java/v1.5.0/Linux-2/ups/..tar: No such file or directory
diff: ups/prd/misweb/v2_23_5/NULL/www/tmp: No such file or directory
diff: products/prd/misweb/v2_23_5/NULL/www/tmp: No such file or directory
diff: ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2: Too many levels of symbolic links
diff: products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2: Too many levels of symbolic links
diff: ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2: Too many levels of symbolic links
diff: products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2: Too many levels of symbolic links

real    8m6.765s
user    0m6.800s
sys     0m41.280s

Cleaning these out, for future sanity :

First verify they should go :

DUDS='
ups/prd/fnorb/v1_1b_8/Linux-2-4/Fnorb
ups/prd/java/v1.5.0/Linux-2/ups/..tar
products/prd/java/v1.5.0/Linux-2/ups/..tar
ups/prd/misweb/v2_23_5/NULL/www/tmp
products/prd/misweb/v2_23_5/NULL/www/tmp
ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2
products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2
ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2
products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2
'

for DUD in ${DUDS} ; do ls -l ${DUD} ; done
MINOS26 > for DUD in ${DUDS} ; do ls -l ${DUD} ; done
lrwxr-xr-x    1 buckley  e875            1 May 27  2004 ups/prd/fnorb/v1_1b_8/Linux-2-4/Fnorb -> .
lrwxr-xr-x    1 kreymer  g020           61 Feb 20 08:17 ups/prd/java/v1.5.0/Linux-2/ups/..tar -> /ftp/products/java/v1.5.0/Linux+2/java_v1.5.0_Linux+2.ups.tar
lrwxr-xr-x    1 kreymer  g020           61 Mar 27 16:16 products/prd/java/v1.5.0/Linux-2/ups/..tar -> /ftp/products/java/v1.5.0/Linux+2/java_v1.5.0_Linux+2.ups.tar
lrwxr-xr-x    1 buckley  e875           32 May 28  2004 ups/prd/misweb/v2_23_5/NULL/www/tmp -> /fnal/ups/db/misweb/Symlinks/tmp
lrwxr-xr-x    1 kreymer  g020           32 Mar 27 16:12 products/prd/misweb/v2_23_5/NULL/www/tmp -> /fnal/ups/db/misweb/Symlinks/tmp
lrwxr-xr-x    1 buckley  e875           11 May 27  2004 ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2 -> plat-linux2
lrwxr-xr-x    1 kreymer  g020           11 Mar 27 16:08 products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat/plat-linux2 -> plat-linux2
lrwxr-xr-x    1 buckley  e875           11 May 27  2004 ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2 -> plat-linux2
lrwxr-xr-x    1 kreymer  g020           11 Mar 27 16:08 products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2 -> plat-linux2

for DUD in ${DUDS} ; do rm ${DUD} ; done
rm: cannot lstat `ups/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2': No such file or directory
rm: cannot lstat `products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2': No such file or directory

############
# MCIMPORT #
############

Per arms email, removing kordosky runs n14* 1001-1010 charm files

These files were all contained in two big tarfiles,
    under /pnfs/minos/stage/kordosky .

n14011003_0000_L010185N_D00_charm-n14011009_0006_L010185N_D00_charm.tar
n14011009_0007_L010185N_D00_charm-n14011010_0010_L010185N_D00_charm.tar

$ mv n14011003_0000_L010185N_D00_charm-n14011009_0006_L010185N_D00_charm.index ../BAD/
$ mv n14011009_0007_L010185N_D00_charm-n14011010_0010_L010185N_D00_charm.index  ../BAD/

MINOS26 > cd /pnfs/minos/stage/kordosky     
MINOS26 > rm n14011003_0000_L010185N_D00_charm-n14011009_0006_L010185N_D00_charm.tar
MINOS26 > rm n14011009_0007_L010185N_D00_charm-n14011010_0010_L010185N_D00_charm.tar

###########
# ROUNDUP #
###########

per discussion, should pass SUPPRESS'd files, as with NO_SPILL.


######
# DB #
######

mmihalek and jason have set up minosora1 and minosora3 Ganglia monitoring,

    http://rexganglia2.fnal.gov/minos/?c=MINOS%20DB

Updated links in /afs/fnal.gov/files/expwww/numi/html/computing/dh
  dhmain.html
  dhleft.html

#######
# SAM #
#######

dev/int dbs stuck, CPU bound even with
    sqlplus

problem with oracle_client v10_2_0_1
   others are ok
       v8_1_7a
       v10_1_0_3_0

  Oracle client - hangs up CPU bound on minos-sam02
       because system has been up too long,
       known problem in 10.2 Oracle Client

MINOS-SAM02 > upd install -j oracle_instant_client v10_2_0_3
informational: installed oracle_instant_client v10_2_0_3.
upd install succeeded.

MINOS-SAM02 > ups copy -G "oracle_client v10_2_0_3" oracle_instant_client v10_2_0_3
MINOS-SAM02 > ups declare oracle_client v10_2_0_3 -f "Linux+2" -q "" -r "oracle_instant_client/v10_2_0_3/Linux+2" -z "/home/sam/products/upsdb"  -U "ups"  -m "oracle_instant_client.table" 
MINOS-SAM02 > ups declare -c oracle_client v10_2_0_3

trace :  missing libclntsh.so.8.0

MINOS-SAM02 > ln -s libclntsh.so /home/sam/products/oracle_client/v10_1_0_3_0/Linux+2/lib/libclntsh.so.8.0
   oops, wrong product,
MINOS-SAM02 > ln -s libclntsh.so /home/sam/products/oracle_instant_client/v10_2_0_3/Linux+2/libclntsh.so.8.0


DBservers are restarted, seem to be running in dev/int


############
# SADDRECO #
############

rm'd saddreco ( symlink ) in scripts, this runs on fnpcsrv1 now.

#########
# DFARM #
#########

Cleaning up FD files already concatenated and written,
had been retained due to stale ROUNDTMP/DFARM files.

F00037221
F00037233
F00037776

Verified that these were re-processed without clearing existing
ROUNTMP/DCACHE files, so dfarm files were not purged.

Concatenated files are written to DCache and purged from WRITE.
So removed these from DFARM manually.

dfarm rm /minos/farcat/F00037221* 
dfarm rm /minos/farcat/F00037233* 
dfarm rm /minos/farcat/F00037776* 


First, get data logged with
  ./roundup -r cedar far

This leaves few stray files

37230 - all.sntp   24 subruns added 01 18
        spill.bntp 24 subruns added 01 30
        spill.sntp 24 subruns added 03 01 with dup all.sntp, spill.bntp _0017
           dated 02/28
solution : check with howie, then remove

37801_23 3 files dated 03/23,
  this was incorrectly suppressed.
solution : force this subrun 23 out.

./roundup -s F00037801 -r cedar far

=============================================================================

2007 03 26

###########
# ROUNDUP #  
###########

Deal with existing bntp files listed in no_spill because of empty sntp

 NOSPILL   F00032719_0000.spill.bntp.cedar.0.root
 PEND - have 1/1 subruns for F00032719_*.spill.bntp.cedar*.root 1 03/21 23:10:28

 NOSPILL   F00033011_0000.spill.bntp.cedar.0.root
 NOSPILL   F00033011_0001.spill.bntp.cedar.0.root
 PEND - have 1/0 subruns for F00033011_*.spill.bntp.cedar*.root 1 03/21 23:10:10
 
AFSS/roundup.20070326 -n -W -s F00032719 -r cedar far

F00032719_*.spill.bntp.cedar*.root
    raw      0/1 
    dfarm    0
    no_spill 0
       where is 1 ?

F00033011_*.spill.bntp.cedar*.root
    raw      0/1
    dfarm    0
    no_spill 0/1

Working on (il)logic of event selection, 

   it seems to help to use a group command, { ...  ; } 
   This lets me use the same NOSPILL terms for printing and selecting,
   negated for the latter.
   You must have white space surrounding the { and } characters


        Pending DFARM files and issues :

    NEAR

 PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 24 03/01 17:20:46

 NOSPILL   N00011347_0017.spill.sntp.cedar.0.root
 PEND - have 2/17 subruns for N00011347_*.spill.sntp.cedar*.root 3 03/22 23:33:40

 PEND - have 1/24 subruns for N00011568_*.spill.sntp.cedar*.root 3 03/22 23:49:47

    FAR

 SUPPRESS  F00037233_0024.all.sntp.cedar.0.root
    These messages still need cleanup

 SUPPRESS  F00037801_0023.all.sntp.cedar.0.root
 PEND - have 1/23 subruns for F00037801_*.all.sntp.cedar*.root 2 03/23 23:36:32

 ...
 SUPPRESS  F00037804_0017.all.sntp.cedar.0.root
 SUPPRESS  F00037804_0018.all.sntp.cedar.0.root
 PEND - have 24/5 subruns for F00037804_*.all.sntp.cedar*.root 2 03/23 23:37:34

 PEND - have 1/2 subruns for F00032719_*.spill.bntp.cedar*.root 4 03/21 23:10:28

 NOSPILL   F00033011_0001.spill.bntp.cedar.0.root
 OK adding F00033011_0000.spill.bntp.cedar.0.root 1

 SUPPRESS  F00037801_0023.spill.bntp.cedar.0.root
 PEND - have 1/23 subruns for F00037801_*.spill.bntp.cedar*.root 2 03/23 23:37:05

 ...
 SUPPRESS  F00037804_0018.spill.bntp.cedar.0.root
 PEND - have 24/5 subruns for F00037804_*.spill.bntp.cedar*.root 2 03/23 23:38:05

 SUPPRESS  F00037801_0023.spill.sntp.cedar.0.root
 PEND - have 1/23 subruns for F00037801_*.spill.sntp.cedar*.root 2 03/23 23:36:47

 ...
  SUPPRESS  F00037804_0018.spill.sntp.cedar.0.root
 PEND - have 24/5 subruns for F00037804_*.spill.sntp.cedar*.root 2 03/23 23:37:49


=============================================================================

2007 03 23

############
# SADDRECO #
############

C. Test and read the READ/${FILE} parent list

Done

D.

Done. 
Skipped first READ parent
use final time, event range
increment event count
append parents

Operations : 
    MV to /SAM what to do with the READ lists ?
        NO  move these to minos26 ?    
        YES move when used/declared ?
        DUH if run on fnpcsrv1, how to monitor activity ?

Move used READ/ files to READ/SAM/

Log to ${HOME}/ROUNTP/LOG/${MONTH}/declare_${DET}_${REL}.log

SRV1> cp -a AFSS/saddreco.20070322 saddreco


REL=cedar
MON=2007-01

    Do a global verification

for DET in near far ; do
./saddreco  ${DET} ${REL} ${MON} verify \
 >>  ${HOME}/ROUNTMP/LOG/${MON}/verify_${DET}_${REL}.log 2>&1
done

    Declare one event per stream

for DET in near far ; do
./saddreco  ${DET} ${REL} ${MON} declare 1 \
 >>  ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1
done

    Some .bntp locations are missing

export SAM_ORACLE_CONNECT="samdbs/pass_word@minosprd"

 ./reloc  -y 2007 -s dev cedar

   Hmmmm, better deal with mrnt_data before we declare them.
   Can clean out unused snts, .bnts directories.

MINOS26 > samadmin add datatier --name=mrnt-far  --description="Muon removed ntuple - far"
New dataTierId = 136
MINOS26 > samadmin add datatier --name=mrnt-near --description="Muon removed ntuple - near"
New dataTierId = 137

   Fix the unlocated file :

MINOS26 > IFILE=F00037185_0000.spill.bntp.cedar.0.root
MINOS26 > ITAPE=vob719.10600
MINOS26 > SAMLOC="${IPATH}(${ITAPE})"
MINOS26 > sam add location --file=${IFILE} --loc=${SAMLOC}

   Try single file addition again

Looks good,

SRV1> FILES='N00011452_0013.spill.cand.cedar.0.root  N00011598_0000.cosmic.sntp.cedar.0.root N00011452_0013.cosmic.cand.cedar.0.root N00011598_0000.spill.sntp.cedar.0.root F00037198_0000.spill.cand.cedar.0.root F00037185_0000.spill.sntp.cedar.0.root F00037185_0000.spill.bntp.cedar.0.root F00037198_0000.all.cand.cedar.0.root F00037185_0000.all.sntp.cedar.0.root F00037204_0000.spill.bntp.cedar.0.root F00037246_0011.spill.bcnd.cedar.0.root'

SRV1> for FILE in $FILES ; do sam locate ${FILE} ; done
['/pnfs/minos/reco_near/cedar/cand_data/2007-01,89@vo9947']
['/pnfs/minos/reco_near/cedar/sntp_data/2007-01,2375@vo5072']
['/pnfs/minos/reco_near/cedar/cand_data/2007-01,108@vo9947']
['/pnfs/minos/reco_near/cedar/sntp_data/2007-01,2376@vo5072']
['/pnfs/minos/reco_far/cedar/cand_data/2007-01,616@vo7416']
['/pnfs/minos/reco_far/cedar/sntp_data/2007-01,12981@vob357']
['/pnfs/minos/reco_far/cedar/.bntp_data/2007-01,10600@vob719']
['/pnfs/minos/reco_far/cedar/cand_data/2007-01,609@vo7416']
['/pnfs/minos/reco_far/cedar/sntp_data/2007-01,12987@vob357']
['/pnfs/minos/reco_far/cedar/.bntp_data/2007-01,10596@vob719']
['/pnfs/minos/reco_far/cedar/.bcnd_data/2007-01,2890@vob735']

SRV1> for FILE in $FILES ; do sam get metadata --file=${FILE} ; done

   Looks good to my eye.

   The parents seem to be listed in a random order,
   but they are in order in the Database Browser.

Reviewed verify pass ,

SRV1> grep -v verified LOG/2007-01/verify_far_cedar.log  | grep -v PARENT

STARTED   Fri Mar 23 17:06:23 2007
saddreco  20070323
Declaring to SAM prd far cedar 2007-01 verify
Needed  /pnfs/minos/reco_far/cedar/cand_data/2007-01
Treating 738 files in /pnfs/minos/reco_far/cedar/cand_data/2007-01
Needed 1476 files, Rate was  4.727
Needed  /pnfs/minos/reco_far/cedar/sntp_data/2007-01
Treating 46 files in /pnfs/minos/reco_far/cedar/sntp_data/2007-01
Needed   92 files, Rate was  1.351
Needed  /pnfs/minos/reco_far/cedar/.bntp_data/2007-01
Treating 46 files in /pnfs/minos/reco_far/cedar/.bntp_data/2007-01
Needed   46 files, Rate was  1.289
Needed  /pnfs/minos/reco_far/cedar/.bcnd_data/2007-01
Treating 738 files in /pnfs/minos/reco_far/cedar/.bcnd_data/2007-01
Needed  738 files, Rate was  4.333
STARTED   Fri Mar 23 17:06:23 2007
FINISHED  Fri Mar 23 17:16:10 2007

SRV1> grep -v verified LOG/2007-01/verify_near_cedar.log  | grep -v PARENT

STARTED   Fri Mar 23 17:00:08 2007
saddreco  20070323
Declaring to SAM prd near cedar 2007-01 verify
Needed  /pnfs/minos/reco_near/cedar/cand_data/2007-01
Treating 682 files in /pnfs/minos/reco_near/cedar/cand_data/2007-01
 obsolete                 N00011516_0022.cosmic.cand.cedar.0.root
 obsolete                  N00011516_0022.spill.cand.cedar.0.root
 obsolete                 N00011516_0021.cosmic.cand.cedar.0.root
 obsolete                  N00011516_0021.spill.cand.cedar.0.root
 obsolete                 N00011516_0020.cosmic.cand.cedar.0.root
 obsolete                  N00011516_0020.spill.cand.cedar.0.root
 obsolete                  N00011516_0016.spill.cand.cedar.0.root
 obsolete                 N00011516_0016.cosmic.cand.cedar.0.root
 obsolete                  N00011516_0017.spill.cand.cedar.0.root
 obsolete                 N00011516_0017.cosmic.cand.cedar.0.root
 obsolete                 N00011516_0015.cosmic.cand.cedar.0.root
 obsolete                  N00011516_0015.spill.cand.cedar.0.root
 obsolete                  N00011516_0018.spill.cand.cedar.0.root
 obsolete                 N00011516_0018.cosmic.cand.cedar.0.root
 obsolete                  N00011516_0019.spill.cand.cedar.0.root
 obsolete                 N00011516_0019.cosmic.cand.cedar.0.root
Needed 1258 files, Rate was  4.183
Needed  /pnfs/minos/reco_near/cedar/sntp_data/2007-01
Treating 58 files in /pnfs/minos/reco_near/cedar/sntp_data/2007-01
Needed  103 files, Rate was  1.427
STARTED   Fri Mar 23 17:00:08 2007
FINISHED  Fri Mar 23 17:06:22 2007

Indeed the obsoletes are supplanted by .1 files, OK fine.

Take a breath, run them all :

for DET in near far ; do
./saddreco  ${DET} ${REL} ${MON} declare  \
 >>   ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1
done

Looks OK,

MON=2007-02
for DET in near far ; do
./saddreco  ${DET} ${REL} ${MON} declare  \
 >>   ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1
done


MON=2007-03
for DET in near far ; do
./saddreco  ${DET} ${REL} ${MON} declare  \
 >>   ${HOME}/ROUNTMP/LOG/${MON}/declare_${DET}_${REL}.log 2>&1
done


Summary of changes  :

< Added VERSION variable, and print thereof    
< enupdate has all concatenation support
<     READFIL has list of input files
<     READLIN variable has content, skipping first file
<     PARENT is .mdaq.root
<     For each file, 
<         bump eventCount
<         replace lastevent and endTime
<         append parents
<     rename READFIL to SAM subdirectory for MODE declare

########
# FARM #
########

Clearing out N00011577* for clean reprocessing, 
  filling the holes after the 22 January problems got too messy.


    as minfarm
SRV1> dfarm rm /minos/nearcat/N00011577*
SRV1> rm WRITE/N00011577*
SRV1> rm READ/N00011577*
SRV1> rm ECRC/N00011577*
SRV1> rm DFARM/N00011577*

   as howie
SRV1> cd /pnfs/minos/reco_near/cedar/
SRV1> rm *_data/2007-01/N00011577*

   Final cleanup, howie's reprocessing is done today :

SRV1> cp -a AFSS/roundup.20070320  roundup.20070320
SRV1> ln -sf  roundup.20070320 roundup

SRV1> ./roundup  -r cedar near
SRV1> ./roundup  -r cedar  far

Looking at the far log, still ugly :

   37230 - single subrun, force this ?
       On march 1, spill.sntp was complete, all and bntp had 1/24
   37233 SUPPRESS message during processing of F00037786
          due to stale ROUNTMP/DFARM control file time stamps
          move them aside, 
         ./roundup  -w -r cedar far
   37786 is current
   
 OK - processing /minos/farcat
Fri Mar 23 18:17:42 CDT 2007
OK - processing 286 files 
OK - stream all.sntp.cedar
OK - 2302 Mbytes in 5 runs 
 PEND - have 1/24 subruns for F00037230_*.all.sntp.cedar*.root 23 02/28 13:10:33
 SUPPRESS  F00037233_0024.all.sntp.cedar.0.root
 PEND - have 22/24 subruns for F00037786_*.all.sntp.cedar*.root 5 03/17 23:46:05
OK - stream spill.bntp.cedar
OK - 416 Mbytes in 7 runs 
 NOSPILL   F00032719_0000.spill.bntp.cedar.0.root
 PEND - have 1/1 subruns for F00032719_*.spill.bntp.cedar*.root 1 03/21 23:10:28
 NOSPILL   F00033011_0000.spill.bntp.cedar.0.root
 NOSPILL   F00033011_0001.spill.bntp.cedar.0.root
 PEND - have 1/0 subruns for F00033011_*.spill.bntp.cedar*.root 1 03/21 23:10:10
 PEND - have 1/24 subruns for F00037230_*.spill.bntp.cedar*.root 23 02/28 13:11:06
 SUPPRESS  F00037233_0024.spill.bntp.cedar.0.root
 PEND - have 22/24 subruns for F00037786_*.spill.bntp.cedar*.root 5 03/17 23:46:35
OK - stream spill.sntp.cedar
OK - 280 Mbytes in 4 runs 
 SUPPRESS  F00037233_0024.spill.sntp.cedar.0.root
 PEND - have 22/24 subruns for F00037786_*.spill.sntp.cedar*.root 5 03/17 23:46:19
Fri Mar 23 18:18:18 CDT 2007
 PEND - have 1/24 subruns for F00037230_*.all.sntp.cedar*.root 23 02/28 13:10:33
 SUPPRESS  F00037233_0024.all.sntp.cedar.0.root
 PEND - have 22/24 subruns for F00037786_*.all.sntp.cedar*.root 5 03/17 23:46:05

 NOSPILL   F00032719_0000.spill.bntp.cedar.0.root
 PEND - have 1/1 subruns for F00032719_*.spill.bntp.cedar*.root 1 03/21 23:10:28
 NOSPILL   F00033011_0000.spill.bntp.cedar.0.root
 NOSPILL   F00033011_0001.spill.bntp.cedar.0.root
 PEND - have 1/0 subruns for F00033011_*.spill.bntp.cedar*.root 1 03/21 23:10:10

 PEND - have 1/24 subruns for F00037230_*.spill.bntp.cedar*.root 23 02/28 13:11:06
 SUPPRESS  F00037233_0024.spill.bntp.cedar.0.root
 PEND - have 22/24 subruns for F00037786_*.spill.bntp.cedar*.root 5 03/17 23:46:35
OK - stream spill.sntp.cedar
OK - 280 Mbytes in 4 runs 
 SUPPRESS  F00037233_0024.spill.sntp.cedar.0.root
 PEND - have 22/24 subruns for F00037786_*.spill.sntp.cedar*.root 5 03/17 23:46:19


Further cleanup of old strays,
set aside the old N00008460_0002.spill.sntp.cedar.0.root
to make room for a new one, matching the recent cand from this subrun.

SRV1> sam undeclare file N00008460_0002.spill.sntp.cedar.0.root

SRV1> cd /pnfs/minos/reco_near/cedar/sntp_data/2005-09
SRV1> mv N00008460_0002.spill.sntp.cedar.0.root /pnfs/minos/BAD/BAD_N00008460_0002.spill.sntp.cedar.0.root

SRV1> ./roundup  -S -f 0 -s N00008460_ -r cedar near

=============================================================================

2007 03 22

############
# SADDRECO #
############

Cleaned up version numbers for old versions

for FIL in `ls saddreco.0*` ; do
DT=${FIL:9} ; mv ${FIL} saddreco.2005${DT} ; done

saddreco.20070322 - adding concatenated file support
See notes from 2006 10 24

  A. Pick a short victim run from 2007-01, test run this through saddreco
       
  B. Review fields to modify

  C. Test and read the READ/${FILE} parent list

  D. Iterate over parent metadata to get the numbers.

A. 

F00037292_*.all.sntp.cedar.0.root has 3 subruns, in 2007-01

B.

   hacked candfiles to select the desired file

Should add getops to saddreco, so can do -n, -s

  Usage :  saddreco far cedar 2007-01 [mode] [bail]

PATH=${PATH}:/export/stage/minfarm/ROUNDUP/SAM/current/bin
export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer
export SAM_NAMING_SERVICE_IOR=IOR:010000002a00000049444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696e67436f6e746578743a312e30000000010000000000000030000000010100b7150000006d696e6f732d73616d30312e666e616c2e676f7600d132230c0000004e616d655365727669636500
export SETUP_SAM_CONFIG='sam_config v4_2_34 -f NULL -z /afs/fnal.gov/files/code/e875/general/ups/db -q prd -r /afs/fnal.gov/files/code/e875/general/ups/prd/sam_config/v4_2_34/NULL -m sam_config_prd.table -M /afs/fnal.gov/files/code/e875/general/ups/db/sam_config/v4_2_34'

AFSS/saddreco.20070322 far cedar 2007-01 verify 1  
   

Set READDIR
MINOS26 > sam get metadata --file="F00037126_0020.all.sntp.cedar.0.root"
           'eventCount' : 13956L,
           'firstEvent' : 204013L,
            'lastEvent' : 214368L,
            'startTime' : SamTime(1166619877.0),
              'endTime' : SamTime(1166623477.0),
              'parents' : NameOrIdList(['F00037126_0020.mdaq.root']),
    'runDescriptorList' : RunDescriptorList([RunDescriptor(runType='physics', runNumber=37126)]),
    })

###########
# GANGLIA #
###########

Requested Ganglia host group for minosora1, ticket 94504

For reference, on fnpca, minosora1 and minos Cluster addresses were like

http://fnpca.fnal.gov/ganglia/?r=day&amp;c=MINOS Servers&amp;h=minosora1.fnal.gov
http://fnpca.fnal.gov/ganglia/?m=load_one&amp;r=day&amp;c=MINOS+Cluster&amp;h=minos-mysql1.fnal.gov

###########
# ROUNDUP #  
###########

fnpcsrv1 - updated copy of roundup.20070319 ( mc support )
           to include corrected comment

   PEND FAR today :

F00032138 - 2005-07
F00036066 - 2006-07

F00037221 - 2007-01
F00037230 - 2007-01

Added -S solo option to force direct output for older subruns

Forced the pre-2007 far files :

SRV1> AFSS/roundup.20070320  -S -f 0 -s F00032138 -r cedar far
SRV1> AFSS/roundup.20070320  -S -f 0 -s F00036066 -r cedar far

Howie has rerun, last night,

F00032719_0000  2005-09
F00033011_0000  2005-10


   PEND NEAR today 

SRV1> AFSS/roundup.20070320 -n -W -r cedar near > /tmp/nearpend   

 PEND - have 18/24 subruns for N00011577_*.cosmic.sntp.cedar*.root 62 01/18 15:12:27
 PEND - have 6/24 subruns for N00011956_*.cosmic.sntp.cedar*.root 1 03/21 00:36:51
 PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 20 03/01 17:20:46
 PEND - have 1/20 subruns for N00008463_*.spill.sntp.cedar*.root 0 03/21 18:12:45
 PEND - have 6/24 subruns for N00011956_*.spill.sntp.cedar*.root 1 03/21 00:37:08


N00008463 - 2005-09
N00009165 - 2005-11
N00011595 - 2007-01
N00011956 - 2007-03

Adding :
N00011565 onward - 2007-01

Clear the pre-2007 files :

AFSS/roundup.20070320  -S -f 0 -s N00008463 -r cedar near
   This conflicted with an existing N00008463_0019.spill.sntp.cedar.0.root
removed this file from WRITE, READ, ECRC


  Holding off pending investigation, did not do :
AFSS/roundup.20070320     -f 0 -s N00009165 -r cedar near
   mrnt files which need concatenation
   holding off for validation, per rubin advice

    Bottom line, going ahead with the near pend cleanup :

AFSS/roundup.20070320  -r cedar near


    Rubin requests putting the newer N00008463_0019.spill.sntp.cedar.0.root 
    into enstore.

cd /pnfs/minos/reco_near/cedar/sntp_data/2005-09
mv N00008463_0019.spill.sntp.cedar.0.root /pnfs/minos/BAD/BAD_N00008463_0019.spill.sntp.cedar.0.root
    

=============================================================================

2007 03 21

######
# DB #
######

Root Cause meeting scheduled 10 AM FCC1-small meeting room
   for the Wed 14 March minorora1 outage (loose network cable)

Report by jtrumbo emailed to minosdb-support

############
# MCIMPORT #
############

corrupt file  f21011195_0000_L010185N_D00.reroot.root ?

MINOS26 > ls -l /pnfs/minos/mcin_data/far/daikon_00/L010185N/119/f21011195_0000_L010185N_D00.reroot.root
-rw-r--r--    1 kreymer  e875     283698210 Mar 21 00:00 /pnfs/minos/mcin_data/far/daikon_00/L010185N/119/f21011195_0000_L010185N_D00.reroot.root

mv mcin_data/far/daikon_00/L010185N/119/f21011195_0000_L010185N_D00.reroot.root ../BAD/
mv BAD/f21011195_0000_L010185N_D00.reroot.root BAD/BAD_f21011195_0000_L010185N_D00.reroot.root

verified BAD permissions 775 per rubin request,
 later changed group to e875 per rubin request


###########
# ROUNDUP #  20070320
###########

Checking old mc files in /minos/mccat, like

frwrw   3 rubin               136984820 09/29 10:40:32 f21001006_0000_L100200.sntp.cedar.root 
frwrw   3 rubin               139335356 09/29 10:46:13 f21001025_0000_L100200.sntp.cedar.root 
...
frwrw   3 rubin                29103428 09/29 14:25:51 n13021079_0000_L010170.sntp.cedar.root 
frwrw   3 rubin                29468053 09/29 15:12:15 n13021080_0000_L010170.sntp.cedar.root 

From last survey 2006 09 29

dfarm ls /minos/mccat | tr -s ' ' | cut -f 4 -d ' ' > /tmp/mccatsz
wc -l /tmp/mccatsz
288 /tmp/mccatsz
echo \(`cat /tmp/mccatsz | tr \\\n +` 0 \) / 1000000000 | bc
9

dfarm ls /minos/mccat | tr -s ' ' | cut -f 7 -d ' ' > /tmp/mccatlis

About right, average 30 MB each.

for FILE in `cat /tmp/mccatlis | grep ^f` ; do
CONF=`echo ${FILE} | cut -f 3 -d '_' | cut -f 1 -d .`
ls -l /pnfs/minos/mcout_data/cedar/far/carrot/${CONF}/sntp_data/${FILE}
done

for FILE in `cat /tmp/mccatlis | grep ^n` ; do
CONF=`echo ${FILE} | cut -f 3 -d '_' | cut -f 1 -d .`
ls -l /pnfs/minos/mcout_data/cedar/near/carrot_06/${CONF}/sntp_data/${FILE}
done

All files present and accounted for, removing these old files

dfarm rm /minos/mccat/f*
dfarm rm /minos/mccat/n*

    Cleared out working files from testing,

dfarm rm /minos/mcnearcat/n*
dfarm rm /minos/mcfarcat/f*

    Shifted the nd mc test files out of the way in /grid/data

mv /grid/data/minos/mcnearcat      \
   /grid/data/minos/mcnearcattest

Further cleanup of 
    N00011935
    N00011938
These were concatenated, partially purged from DFARM 
due to permissions on Sat 2007 Mar 17.

These were purged from WRITE on Sunday, 

Howie will fix, I will purge from DFARM asap.

dfarm rm /minos/nearcat/N00011935*
dfarm rm /minos/nearcat/N00011938*

Done 14:55


Howie is reprocessing some runs round Jan 22.

Need to remove from SAM :
    N00008463_0019.spill.cand.cedar.0.root

MINOS26 > sam locate N00008463_0019.spill.cand.cedar.0.root
['/pnfs/minos/reco_near/cedar/cand_data/2005-09,725@vob428']

MINOS26 > sam undeclare file N00008463_0019.spill.cand.cedar.0.root
    

#########
# FARMS #
#########

Grid server maintenance today, started 08:42.
   Will follow up to see whether this interrupted daily 08:00 roundup
   Cron jobs finished at 08:17, well ahead of the shutdown.

OSG software moved from
    /export/osg/grid
to
    /usr/local/vdt
with compatibility symlinks.

Will adjust roundup.20070320

Due to reprocessing, have disabled roundup in crontab tomorrow :

    crontab crontab.noround

#######
# AFS #
#######

Security advisory re allowing suid in AFS, don't do it !

http://openafs.org/security/OPENAFS-SA-2007-001.txt

For those who are unable to upgrade, setuid status can always be 
disabled by running, as the super user on any client:
    fs setcell -cell fnal.gov -nosuid

MIN > cat  /usr/vice/etc/ThisCell
fnal.gov

MIN > fs getcellstatus -cell fnal.gov
Cell fnal.gov status: setuid allowed

MIN > fs setcell -cell fnal.gov -nosuid
MIN > fs getcellstatus -cell fnal.gov
Cell fnal.gov status: no setuid allowed

See /usr/vice/etc/afs.rc ?
or  /etc/sysconfig/afs  on our systems


=============================================================================

2007 03 20

###########
# ROUNDUP #
###########

  Added support for Monte Carlo

SRV1> cp -a AFSS/roundup.20070319 .
SRV1> ln -sf roundup.20070319 roundup
( hacked this to update VERSION at about 13:15 )

roundup.20070320 - 
   adding no_spill.txt scan
   handling absence of no_spill and bad_runs more cleanly

Running a preview in near, removing NOSPILL subruns,
we would have the following pending runs for spill files

 PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 18 03/01 17:20:46
 PEND - have 18/17 subruns for N00011577_*.spill.sntp.cedar*.root 61 01/18 15:12:57
 PEND - have  4/ 6 subruns for N00011580_*.spill.sntp.cedar*.root 56 01/22 20:38:35
 PEND - have  6/ 7 subruns for N00011589_*.spill.sntp.cedar*.root 56 01/22 16:32:32
 PEND - have 23/24 subruns for N00011935_*.spill.sntp.cedar*.root 4 03/16 10:51:34
 PEND - have  1/11 subruns for N00011938_*.spill.sntp.cedar*.root 4 03/16 10:54:27
 PEND - have  6/25 subruns for N00011953_*.spill.sntp.cedar*.root 0 03/20 04:06:57

Note that N00011577 claims to have 18/17 runs, 
  N00011577_0008.spill.sntp.cedar.0.root is in no_spill.cedar
  but exists in /minos/nearcat/

SRV1> AFSS/roundup.20070320 -s N00011577 -n -W -r cedar near
 
############
# MCIMPORT #
############

rubin reports two corrupt files

    f21311496_0000_L010185N_D00.reroot.root

This file was imported by howcroft on Sat Mar 17 .
The size looks pretty normal, no obvious truncation.

MINOS26 > grep f21311496_0000_L010185N_D00 CFL
minos mcin_far_daikon VO4125 0000_000000000_0000461 CDMS117412486400000 130644338 148713862 /pnfs/minos/mcin_data/far/daikon_00/L010185N/149/f21311496_0000_L010185N_D00.reroot.root

MINOS26 > ls -l /pnfs/minos//mcin_data/far/daikon_00/L010185N/149/f21311496_0000_L010185N_D00.reroot.root
-rw-r--r--    1 kreymer  e875     130644338 Mar 17 04:47 /pnfs/minos//mcin_data/far/daikon_00/L010185N/149/f21311496_0000_L010185N_D00.reroot.root


    f21011135_0000_L010185N_D00.reroot.root

Partial transfer saved in sjc/far/mcin/BAD on 9 March,
    and removed from PNFS at that time.

#######
# NET #
#######

Thursday - 3/29 - 6:00 AM - 45 minutes -
Operating system upgrade for s-s-fcc2-server switch-

This affects
   Minos Cluster
   Minos Servers ( SAM , Mysql1 )
   Minos AFS
   Enstore

acsminos01

rip9
stkendca3a
stkendca7a
stkendca8a
stkendm1a
stkendm2a
stkendm3a
stkendm4a
stkensrv4
stkensrv5
stkensrv6
stkensrv7
stkensrv8
stkensrv0n
many movers

Asked lamore,netmanager whether this is definitely scheduled.

Outage announced at about 13:25 to netdown@fnal.gov .


########
# FARM #
########

Scheduled down Wed 21 March 08:30 

########
# FARM #
########

fnpcsrv1 issued this message to sessions logged in :

Message from syslogd@fnpcsrv1 at Tue Mar 20 14:06:46 2007 ...
fnpcsrv1 kernel: journal commit I/O error

The system has gone off the network, as of about 14:30.

NGOP paged timm at 14:30 .
    SCSI error
Rebooted 14:50
   
=============================================================================

2007 03 19

#######
# DAQ #
#######

urish upgraded kernels on CR systems, starting with minos-beamdata,
during a brief FD outage for a fuse replacement.
BNL had to intervene to restart beam logging processes. 

##########
# DCACHE #
##########

Sent query to dcache-admin regarding Minos Read pools in FNDCA, 
due around mid march.


#########
# ADMIN #
#########

Reported multiple /pnfs/minos /etc/fstab entries, ticket 94266
Email to minos-admin


###########
# ROUNDUP #
###########

roundup.20070319 has final tweaks for mc, dropping roundup.20070314

Seems to work, path looks good, tested with -C

Plan : deploy it tomorrow

Next : use no_spill.${REL} to filter files with no spill output,
       after Howie has done some more tests of existing data.
       
Then : handle input from /grid/data instead of DFARM,
       or all 3  DFARM,  /grid/data,  DCache

But First - get SAM declares going.

=============================================================================

2007 03 17

#######
# WEB #
#######

Per rhatcher 11 Nov 2006 email suggestion,

     html/minwork/computing/enstore.html.20070317

Changed email from buckley to minos-data

Noted write access from only minos01
   Since ticket 73674, 2006 Feb 08
    Checked this actively, it's true

Changed to user path /pnfs/minos/users/

#########
# ADMIN #
#########

   /pnfs/minos has multiple /etc/fstab entries on 
minos03
minos08
minos09
minos11-12
minos14-16
minos18-24


=============================================================================

2007 03 16
 
#######
# DOE #
#######

Prepared minos/plan/doesum0703.txt for Gina's DOE review presentation.

Security visit coming next week
   Put all media into cabinets

###########
# ROUNDUP #
###########

The old mccat  mcfardcat files will not do , they have carrot names

NMCF=`ls /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data/123 | grep n13011230`
DCCP_PATH=dcap://fndca1:24136/pnfs/fnal.gov/usr/minos/

for FILE in ${NMCF} ; do ;
dccp ${DCCP_PATH}/mcout_data/cedar/near/daikon_00/L010185N/sntp_data/123/${FILE} \
/grid/data/minos/mcnearcat/${FILE} ; done
66820161 bytes in 1 seconds (65254.06 KB/sec)
68172273 bytes in 1 seconds (66574.49 KB/sec)
67004068 bytes in 1 seconds (65433.66 KB/sec)
67009239 bytes in 1 seconds (65438.71 KB/sec)
65355460 bytes in 1 seconds (63823.69 KB/sec)
67156459 bytes in 2 seconds (32791.24 KB/sec)
67347107 bytes in 3 seconds (21922.89 KB/sec)
66109571 bytes in 1 seconds (64560.13 KB/sec)
68086532 bytes in 3 seconds (22163.58 KB/sec)
66813257 bytes in 1 seconds (65247.32 KB/sec)
67456146 bytes in 2 seconds (32937.57 KB/sec)

for FILE in ${NMCF} ; do chmod 775 /grid/data/minos/mcnearcat/${FILE} ; done
for FILE in ${NMCF} ; do
dfarm put -n 1 -v /grid/data/minos/mcnearcat/${FILE} /minos/mcnearcat/${FILE}
done


#############
# minosora1 #
#############

Before Wed 14 Mar network problems, network activity showed
  a short spike around 07:00 and an hour at 60 MBit/sec around 18:00

Wednesday the large peak was around 22:00, at 130 MBit/sec
Thursday there was no large peak
Friday the small peak ( if that ) was around noon, 2 mbit/sec

Note that minosora3 is spending nearly 2 hours/day at 6 MBytes/second.
    Why ? This is dev/int, does not need backups.

=============================================================================

2007 03 15

#######
# AFS #
#######

Per howcroft request,
asked for 50GB not backup disk at
/afs/fnal.gov/files/data/minos/d227
for 

Requested for  "myself, Jeff, Pedro, Alex, Justi, David J, rustem"

myself = howcroft
Jeff     jkn      Jeffery Nelson
         jdejong  Jeffrey Dejong  
         hartnell Jeffrey Hartnell  *** per email address
Pedro    ochoa    Juan Pedro Ochoa
Alex     asousa   Alexandre_Sousa   *** a guess (  wrong )
         ahimmel  Alexander Himmel  *** set this once disk exists
Justi    evansj   Justin Evans
David J  djaffe   David Jaffe
Rustem   rustem   Rustem Ospanov    


ACL ( inspired by d221 )
  minos rl
  system:administrators rlidwka
  system:anyuser rl
  buckley:kreymer rlidwka
  howcroft:asousa:djaffe:evansj:hartnell:ochoa rlidwka

Helpdesk ticket 94115 by inkmann

Oops, format is wrong needed
  buckley  rlidwka
  kreymer  rlidwka
  howcroft rlidwka
  ahimmel  rlidwka
  djaffe   rlidwka  * * * no such user
  evansj   rlidwka
  hartnell rlidwka
  ochoa    rlidwka

Removed stray entry  :

fs setacl -dir /afs/.fnal.gov/files/data/minos/d227 -acl buckley:minosrecodata none

   # minos.nubar group #
   
Hi Art, I would create a new AFS group for these guys to use. If you 
make it owned by buckley:admin then anyone on that group can add users.
        Liz

  Documents at http://www.openafs.org/pages/doc/UserGuide/auusg008.htm#HDRWQ60

pts creategroup -name kreymer:nubar
    group kreymer:nubar has id -1917

for GUSER in buckley kreymer howcroft ahimmel evansj hartnell ochoa ; do
pts adduser -user ${GUSER} -group kreymer:nubar ; done

pts membership kreymer:nubar
pts examine    kreymer:nubar

pts setfields  kreymer:nubar -access SOMar

pts chown      kreymer:nubar  minos

pts    adduser -user asousa -group minos:nubar
pts removeuser -user asousa -group minos:nubar

OK, now add this to the directory :

fs setacl -dir /afs/.fnal.gov/files/data/minos/d227 -acl minos:nubar   rlidwka
fs setacl -dir /afs/.fnal.gov/files/data/minos/d227 -acl buckley:admin rlidwka

for GUSER in howcroft ahimmel evansj hartnell ochoa ; do
fs setacl -dir /afs/.fnal.gov/files/data/minos/d227 -acl ${GUSER} none ; done


###########
# ROUNDUP #
###########

editing roundup.20070314 for mc support

########
# FARM #
########

14:00 rubin, bseilhan met with timm to discuss Minos farm/grid compatibility

Issued identified :

1) AFS access - Steve could provide this, but rubin prefers to drop AFS
                as soon as we can count on DCache for ntuples.
                Logs can be handled somehow, they are small.

2) timm would prefer jobs to run in group account like minos,
      rather than rubin, bseilhan. This should be OK, as grid jobs
      have valid cert's giving output file access.

3) Can certain job steps be forced to fnpcsrv1 ?  Steve says yes.

4) DFARM retirement - switch to /grid/data and/or fermigrid/volative Dcache
      rubin will proceed now with /grid/data tests

5) Software distribution, presently via /home/minfarm, use /grid/app

6) timm strongly prefers no interactive logins to workers
   That's a problem for monitoring, tools are lacking ( top, ps )
   
=============================================================================

2007 03 14

###########
# ROUNDUP #
###########

roundup.20070314 -  adding monte carlo
    Now logging to  HADDLOG/${YEMON}
    Aligned  PURGED DFARM message with SRMCP message

FILES=`dfarm ls /minos/mccat | tr -s ' ' | cut -f 7 -d ' '`

    Shifted 7  old files to /minos/mcfarcat

dfarm mkdir /minos/mcnearcat
dfarm chmod rwrw /minos/mcnearcat

dfarm mkdir /minos/mcfarcat
dfarm chmod rwrw /minos/mcfarcat

FILES=`dfarm ls /minos/mccat | grep f2 | tr -s ' ' | cut -f 7 -d ' '`
cd /export/stage/minfarm/ROUNDUP_TEST

for FILE in ${FILES} ; do echo ${FILE}
dfarm get /minos/mccat/${FILE} ${FILE}
dfarm put -n 1 -v ${FILE} /minos/mcfarcat/${FILE}
rm ${FILE} ; done


########## 
# DCACHE #
##########

7 empty files reported by dcache-admin
one was a command error leaving an empty 'ls' file, oops, my bad.
6 are output of dakion_00 processing, informed Howie and Brandon

/pnfs/minos/mcout_data/cedar/far/daikon_00/L010185N/sntp_data/112/f21011125_0000_L010185N_D00.sntp.cedar.root
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/160/n13011601_0001_L010185N_D00.cand.cedar.root
/pnfs/minos/reco_far/cedar/.bntp_data/2007-03/ls
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/160/n13011601_0006_L010185N_D00.cand.cedar.root
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/160/n13011602_0003_L010185N_D00.cand.cedar.root
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/160/n13011602_0001_L010185N_D00.cand.cedar.root
/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/160/n13011601_0000_L010185N_D00.cand.cedar.root

#############
# minosora1 #
#############

13:30 mmihalek reports power cord work completed on minosora3

14:05 normal contact with minosora1 recorded at
      http://www-numi.fnal.gov/computing/database/oracle/topdb/minosprd/2007/03/14/14.txt

14:15 no access to minosora1, off the network

14:52 Helpdesk ticket 94055 issued

15:01 mmihalek verifies no outage was scheduled for minosora1

15:03 ticket 94055 assigned to jtrumbo

15:20 mmihalek verifies that minosora1 is up and running at local console

15:36
131.225.107.24 is connected to s-s-fcc1-server on port 7/37 (minosora1)
Pinged neighboring nodes at
    7/35 uscmsdb01
    7/36 uscmsdb03
    7/38 appora
    7/39 appora-dev
I have searched for minosora1 in the Tissue data base of blocked nodes,
it does not seem to be blocked.

14:39 jtrumbo asked that ticket be assigned to networking

14:41 jtrumbo calls lamore directly, no ticket to networking yet

15:50 ticket assigned to networking, vbravov

16:04 ret called, will check that we get a response from networking
        ( no response to page yet )

16:15 minosora1 is back, restarted dbserver

16:16 duplicate ticket 94069 issued by Remedy software, 
      in response to kreymer email - should not have been issued,
      ret will investigate

16:23 received additional history from Orlando via mmihalek,
      contractors were working in that area.

16:41 ticket 94055 assigned to Jack Schmidt

16:51 detailed reply from orlando

      ticket 94055 resolved by Jack Schmidt
      ( strange time stamp on email, 13:52 )

=============================================================================

2007 03 13

############
# PREDATOR #
############

Predator cronjob has not been triggering beam or dcs activity
since the DST shift.
Strange, this is keyed on the local HOUR being 5 or 23
The cron jobs were running on the even hours, see below.

########
# CRON #
########

cron on minos26 is still running in DST.
crond probably needs a restarted 

crond restarted before 16:37 by Tim Laszlo


###########
# ROUNDUP #
###########

roundup.20070313 - corrected SOLO test setting DELT,
     wrote separate files for 
     F00037755
     N00011914

I will put these back into dfarm, for concatenation
   ( after setting aside for safety )
Not because we cannot stand a few small files,
but because is seriously inflates the entries in the reco directories.

   
    Checked tape status, everything is on tape, do we can delete.

MINOS26 > cd /pnfs/minos/reco_far/cedar/sntp_data/2007-03
MINOS26 > for FIL in `ls F00037755*` ; do printf "${FIL} " ; cat ".(use)(4)(${FIL})" | head -2  ; done
F00037755_0000.all.sntp.cedar.0.root VOD520
0000_000000000_0000172
...
MINOS26 > ls F00037755* | wc -l                                        
     48

MINOS26 > cd /pnfs/minos/reco_far/cedar/.bntp_data/2007-03
MINOS26 > for FIL in `ls F00037755*` ; do printf "${FIL} " ; cat ".(use)(4)(${FIL})" | head -2  ; done
F00037755_0000.spill.bntp.cedar.0.root VOB719
0000_000000000_0010701
...
MINOS26 > ls F00037755* | wc -l                                                           
     24


MINOS26 > cd /pnfs/minos/reco_near/cedar/sntp_data/2007-03
MINOS26 > for FIL in `ls N00011914*` ; do printf "${FIL} " ; cat ".(use)(4)(${FIL})" | head -2  ; done
N00011914_0000.cosmic.sntp.cedar.0.root VO2116
0000_000000000_0000039
...
MINOS26 > ls N00011914* | wc -l
     48

    Set aside ROUNDUP summary files
DFARM - time stamps
ECRC  - check sums
READ  - input list
WRITE - files -> dfarm

for FILE in `ls WRITE | grep N\*root` ; do
mv DFARM/${FILE} REDO/DFARM/${FILE}
mv  ECRC/${FILE}  REDO/ECRC/${FILE}
mv  READ/${FILE}  REDO/READ/${FILE}
done

    Oops, this selected them all, OK, done in 1 pass.
    Should have been grep N.*root

for FILE in `ls WRITE | grep root` ; do
echo mv WRITE/${FILE} REDO/WRITE/${FILE} ; done


for FILE in `ls REDO/WRITE | grep N.*root` ; do
dfarm put -n 1  REDO/WRITE/${FILE} /minos/nearcat ; done


for FILE in `ls REDO/WRITE | grep F.*root` ; do
dfarm put -n 1  REDO/WRITE/${FILE} /minos/farcat ; done

   All set, now remove the PNFS copies to make room
   using rubin account on srv1

SRV1 > cd /pnfs/minos/reco_far/cedar/.bntp_data/2007-03
SRV1 > for FIL in `ls F00037755*` ; do ls -l ${FIL} ; done
SRV1 > for FIL in `ls F00037755*` ; do rm -v ${FIL} ; done

SRV1 > cd /pnfs/minos/reco_far/cedar/sntp_data/2007-03
SRV1 > for FIL in `ls F00037755*` ; do ls -l ${FIL} ; done
SRV1 > for FIL in `ls F00037755*` ; do rm -v ${FIL} ; done

SRV1 > cd /pnfs/minos/reco_near/cedar/sntp_data/2007-03
SRV1 > for FIL in `ls N00011914*` ; do ls -l ${FIL} ; done
SRV1 > for FIL in `ls N00011914*` ; do rm -v ${FIL} ; done

     
=============================================================================

2007 03 12

#######
# DST #
#######

Problems in DAQ in beam logging, Big Button, DCS
    Processes had to be restarted to pick up DST support


Proposed moving all Control room and DAQ system to localtime GMT,
during the summer shutdown.
Received favorably by Cat and Rob, needs discussion.

Process :

   Adjust crontab entries
   

###########
# ENSTORE #
###########

Informed enstore-admin of   bad dakion renames of 1632 files, 479424 MB

cd /pnfs/minos/mcout_data/cedar/far
mv    daikon_00 bad_daikon_00

L010185N/cand_data/143/f21411431_0000_L010185N_D00.cand.cedar.root 

###########
# ROUNDUP #
###########

Activated handling of cand files ( if any show up )
Informed Howie and Brandon

SRV1> cp -a AFSS/roundup.20070309 .
SRV1> ln -sf roundup.20070309 roundup

###########
# ENSTORE #
###########

Stan Naymola reports 600 rewrites tried for 
/pnfs/fnal.gov/usr/minos/mcout_data/cedar/far/daikon_00/L010185N/cand_data/143/f21411431_0000_L010185N_D

Check bad_daikon_00 for pending files :

cd /pnfs/fnal.gov/usr/minos/mcout_data/cedar/far/bad_daikon_00

FILES=`find . -type f | cut -f 2 -d /`

for FILE in ${FILES} ; do DIR=`dirname ${FILE}` ; FIL=`basename ${FILE}`
TL=`( cd ${DIR} ; cat ".(use)(4)(${FIL})" ) | head -2`
printf "${FILE} " ; echo ${TL}
usleep 300000 ; done | tee /tmp/badvols

L010185N/cand_data/143/f21411431_0000_L010185N_D00.cand.cedar.root 
MINOS26 > ./dc_stat /pnfs/minos/mcout_data/cedar/far/bad_daikon_00/L010185N/cand_data/143/f21411431_0000_L010185N_D00.cand.cedar.root
============================
 PNFS status for /pnfs/minos/mcout_data/cedar/far/bad_daikon_00/L010185N/cand_data/143/f21411431_0000_L010185N_D00.cand.cedar.root 
-rw-r--r--    1 1334     e875     615062441 Mar  9 09:52 f21411431_0000_L010185N_D00.cand.cedar.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:992b6eb3;l=615062441;
w-stkendca17a-1

LEVEL 4 

============================

Fix this by mv'ing the files to make DCache/Enstore happy :

PAB=/pnfs/minos/mcout_data/cedar/far/bad_daikon_00
PAG=/pnfs/minos/mcout_data/cedar/far/daikon_00
FILE=L010185N/cand_data/143/f21411431_0000_L010185N_D00.cand.cedar.root

mv  ${PAG}/${FILE}       ${PAG}/${FILE}.good
mv  ${PAB}/${FILE}       ${PAG}/${FILE}
  < wait for the file to be written >
mv  ${PAG}/${FILE}       ${PAB}/${FILE}
mv  ${PAG}/${FILE}.good  ${PAG}/${FILE}

Details, first had to get access to bseilhan account
hr> chmod 775 ${PAB}/L010185N/cand_data/143
bs> chmod 775 ${PAG}/L010185N/cand_data/143

At 13:58, I did the initial file moves, under the bseilhan account,
to allow tape writing to proceed,

At 14:00 I saw the tape write underway, on VOC105, using 9940B24.mover.

At 14:07 I moved the files back to their original locations.
    
The bad_daikon_00 copy is  tape VOC105 file 283
The     daikon_00 copy is  tape VOC105 file 133
    
    All is well !

=============================================================================

2007 03 09

########
# JAVA #
########

Per discussion on lusers, do we need a java upgrade for dates ?
Is there a test case ?

   SUMMARY : no upgrade needed, we have a test script 


herber's email suggest we need, for each series
    1.3.1_18
    1.4.2_13
    5.0_u9

http://java.sun.com/developer/technicalArticles/Intl/USDST_Faq.html#jdkversion

But the given link suggests 5.0_u6 in section 7
Control room systems run jre-1.5.0_07-fcs
 
Google search for java test daylight savings reveals a test program
http://ablogofideas.net/blog/2007/02/19/test-your-java-for-new-daylight-saving-time-changes/
   based on a javascript test
http://www.mkville.com/blog/index.cfm/2007/2/15/Quick-test-for-Daylight-Saving-Time-updates
There is also an applet browser test,
    http://ablogofideas.net/blog/2007/02/24/test-your-browsers-jre-for-daylight-saving-time-changes/

had to hack 
    &#8221; to "
    &#8243; to "
    &#038;  to &
    &#8230; to .
Had to rename source to DSTCheck.java

This is in minos/scripts/DSTCheck.java

javac DSTCheck.java

minos@minos-beamdata tmp]$ java DSTCheck
Hello, you are running Sun Microsystems Inc. JVM version: 1.4.2_12

OLD Daylight Saving Time (DST) dates: Apr 1 - Oct 28
NEW DST dates: Mar 11 - Nov 4
Now (2007-03-09 09:05:34 CST) DST offset: 0 hours
2007-03-12 01:00:00 CDT DST Offset: 1 hours
2007-04-02 01:00:00 CDT DST Offset: 1 hours
2007-10-27 01:00:00 CDT DST Offset: 1 hours
2007-11-03 01:00:00 CDT DST Offset: 1 hours
...............
. Your JVM is OK with the new DST changes .
...............

Sent summary to linux-users, put copy in 
    http://~kreymer/DSTCheck.java    

OK
    SLF 4.2    / j2sdk-1.4.2_12-fcs
    SLF 3.0.5 /  j2sdk-1.4.2_12-fcs
BAD
    java v1.5.0 in kits


###########
# ENSTORE #
###########

No response to our 6 March request to enmv misplaced files.
The requested enmv commands are in ~kreymer/minos/maint/daikonmove.txt

I am proceeding with a normal mv ,
or would do so if I could become rubin, the file owner.

DDIR=/pnfs/minos/mcout_data/cedar/near/daikon_00 
DIRS='L010200N L010185N'
for DIR in ${DIRS} ; do
FILES=`ls ${DDIR}/${DIR} | grep .root`
    for FILE in ${FILES} ; do
    RUN=`echo ${FILE} | cut -c 6-8`
    mv ${DDIR}/${DIR}/${FILE} ${DDIR}/${DIR}/cand_data/${RUN}/${FILE}
    usleep 300000
done ; done

Did this around 11:00 to 11:30, from rubin@fnpcsrv1


###########
# ROUNDUP #
###########

roundup.20070309

   Added SOLO , set for cand streams, which pops DELT to 2000,
   so that files will not be concatenated.

Test with a few recent cand files,
DPAT=dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/reco_far/cedar/cand_data/2007-03
FILES='
F00037737_0000.spill.cand.cedar.0.root
F00037737_0001.spill.cand.cedar.0.root
F00037740_0000.spill.cand.cedar.0.root
F00037740_0001.spill.cand.cedar.0.root
F00037740_0002.spill.cand.cedar.0.root
F00037740_0003.spill.cand.cedar.0.root
'

cd /export/stage/minfarm/ROUNDUP_TEST/CAND
for FILE in $FILES ; do dccp ${DPAT}/${FILE} . ; done

for FILE in $FILES ; do dfarm put -n 1 ${FILE} /minos/farcat/${FILE} ; done

dfarm ls /minos/farcat/*cand*

   cleanup after testing
for FILE in $FILES ; do dfarm rm /minos/farcat/${FILE} ; done

Informed minos_batch, let's deploy this next week.

#########
# BATCH #
#########

Need to clear all /pnfs/minos/mcout_data/cedar/far/daikon_00

MINOS26 > find . -type f | wc -l
   1632
MINOS26 > pwd
/pnfs/minos/mcout_data/cedar/far/daikon_00
MINOS26 > du -sm .
479424  .

    SRV1> 

cd /pnfs/minos/mcout_data/cedar/far

cat "daikon_00/L010185N/.(tag)(file_family)";
reco_mc_far_cedar
setup encp v3_6d -q stken

mv    daikon_00 bad_daikon_00

mkdir daikon_00
cd    daikon_00
enstore pnfs --file_family reco_mc_far_cedar

mkdir L010185N
mkdir L010185N/cand_data
mkdir L010185N/sntp_data

mkdir L250200N
mkdir L250200N/cand_data
mkdir L250200N/sntp_data

( cd L010185N/cand_data ; enstore pnfs --file_family reco_mc_far_cedar_cand )
( cd L250200N/cand_data ; enstore pnfs --file_family reco_mc_far_cedar_cand )

( cd L010185N/sntp_data ; enstore pnfs --file_family reco_mc_far_cedar_sntp )
( cd L250200N/sntp_data ; enstore pnfs --file_family reco_mc_far_cedar_sntp )


###########
# GANGLIA #
###########

The ganglia monitor unblocking was authorized a week ago,
finally got unblocked this afternoon ( lost message somewhere . )

    Authorized nodes are listed on the registration web page.
    Had to reload KCS cert after regeneration with

kx509
kxlist -p
openssl pkcs12 -export -passout pass:""  -in /tmp/x509up_u1060 -out /tmp/kreymer.p12 -name Fermilab

##########
# INDICO #
##########

Got registered so I can create minutes and categories under Experiments -> Minos
Created Core Software, and a Mar 08 meeting draft


=============================================================================

2007 03 08

###########
# ROUNDUP #
###########

roundup.20070308 - has bad_run.${REL} filtering
   no additional files are being picked up by this at present.

    Still, made this current, will run tomorrow

SRV1> ln -sf roundup.20070308 roundup # was 20070302

##########
# INDICO #
##########

HOWTO.indico

########
# GRID #
########

HOWTO.fermigrid
 
########### 
# GANGLIA #
###########
 
Sent the following sample .ssh/config file to msd :

#     You can use an ssh tunnel to see ganglia or chan13 offsite
# First, put this file in ~/.ssh/config
# Then, in one terminal window:      $ ssh gate
#     this must connect via kerberos credentials or crypto card
# Then browse locally to:   http://localhost:20000/minos
#                     or:   http://localhost:20013/notifyservlet/www

Host               gate
HostName                flxi06.fnal.gov
LocalForward 20000 rexganglia2.fnal.gov:80
LocalForward 20013      www-bd.fnal.gov:80


=============================================================================

2007 03 07

############
# MCIMPORT #
############

mcimport.20070306

    Added duplicate detection

11:33 

cp -a AFSS/mcimport.20070306 .
ln -sf     mcimport.20070306 mcimport

#######
# CVS #
####### 

write access failed at 05:03, when yum updates ran
these did not restart the sshd.cvs server

shepelak did this at 10:46, AOK
Helpdesk ticket 93608

As a side benefit, minos-admin email is now forwarded to run2-sys
Educated run2-sys about minos-admin ( reached buckley/rhatcher/kreymer/urish)

#######
# NET #
#######

Network was down in some fashion from about 13:20 to 13:45.
Informed control room, msd.

  N.B. - according to SSA Primary report, An FCC hub router rebooted at 13:20

###########
# ROUNDUP #
###########

roundup.200703 - working on bad_run removal

=============================================================================

2007 03 06

###########
# ENSTORE #
###########

Rubin reports several misplaced cand files in mc_out :

MINOS26 > DDIR=/pnfs/minos/mcout_data/cedar/near/daikon_00 
MINOS26 > DIRS=`ls ${DDIR}`
MINOS26 > for DIR in ${DIRS} ; do echo ${DIR} ; find ${DDIR}/${DIR} -type f -maxdepth 1 | wc -l ; done
L010000N
      0
L010170N
      0
L010185N
   3760
L010200N
     11
L100200N
      0
L150200N
      0
L250200N
      0

Request to enstore-admin :

Several recent Minos farm files have been placed in the wrong directories.
Please move these with enmv, 
so that the internal Enstore metadata is corrected.

The misplaced files are
    /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/*.root ( 3760 files )
    /pnfs/minos/mcout_data/cedar/near/daikon_00/L010200N/*.root (   11 files )

Please do the equivalent to :

DDIR=/pnfs/minos/mcout_data/cedar/near/daikon_00 
DIRS='L010200N L010185N'
for DIR in ${DIRS} ; do
FILES=`ls ${DDIR}/${DIR} | grep .root`
    for FILE in ${FILES} ; do
    RUN=`echo ${FILE} | cut -c 6-8`
    enmv ${DDIR}/${DIR}/${FILE} ${DDIR}/${DIR}/cand_data/${RUN}/${FILE}
done ; done

An explicit set of enmv commands may be found in
     ~kreymer/minos/maint/daikonmove.txt

L010185N needs 45 RUN directories 106 through 160
The directories are present, but empty

#########
# BATCH #
#########

rustem reports two duplicated files, with no clue as to where dups are.

n13011450_0007_L010185N_D00.sntp.cedar.root
n13011451_0005_L010185N_D00.sntp.cedar.root

cd /pnfs/minos/mcout_data/cedar/near/daikon_00
CONFS=`ls`

for CONF in ${CONFS} ; do find  ${CONF}/sntp_data -name n13011450_0007_L010185N_D00.sntp.cedar.root ; done
L010185N/sntp_data/145/n13011450_0007_L010185N_D00.sntp.cedar.root

for CONF in ${CONFS} ; do find  ${CONF}/sntp_data -name n13011451_0005_L010185N_D00.sntp.cedar.root ; done
L010185N/sntp_data/145/n13011451_0005_L010185N_D00.sntp.cedar.root

DUH, the duplicates are in AFS, not PNFS.
Howie will clean this up.


############
# MCIMPORT #
############

kordosky file copy failed, seems to be retrying indefinitely :

Time 	User 	Type 	Oper 	File 	Node 	dT 1 	File Size 	dT 2 	Status 	Details
2007-03-6 15:21:01 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar 	minos26.fnal.gov 	0 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists)
None 						0 	NOT_FINISHED 	0 	NOT_FINISHED 	
2007-03-6 15:19:09 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar 	minos26.fnal.gov 	0 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists)
2007-03-6 15:17:25 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar 	minos26.fnal.gov 	0 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists)
2007-03-6 15:15:52 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar 	minos26.fnal.gov 	1 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists)
2007-03-6 15:14:30 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar 	minos26.fnal.gov 	1 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists)
2007-03-6 15:13:19 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar 	minos26.fnal.gov 	0 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists)
2007-03-6 15:12:16 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar 	minos26.fnal.gov 	1 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists)
2007-03-6 15:11:24 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar 	minos26.fnal.gov 	0 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists)
2007-03-6 15:10:42 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar 	minos26.fnal.gov 	0 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists)
2007-03-6 15:10:10 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar 	minos26.fnal.gov 	0 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists)
2007-03-6 15:08:49 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0010_L010185N_D00-n11011620_0003_L010185N_D00.tar 	minos26.fnal.gov 	60 	0 	0 	ERROR 	426 Transfer aborted, closing connection :PANIC : Unexpected message arrived class dmg.cells.nucleus.NoRouteToCellException
2007-03-6 15:06:54 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011619_0004_L010185N_D00-n11011619_0008_L010185N_D00.tar 	minos26.fnal.gov 	50 	1711288320 	0 	OK 	
2007-03-6 15:04:57 	kreymer(1060.5111) 	gsiftp 	write 	/pnfs/fnal.gov/usr/minos/stage/kordosky/n11011618_0010_L010185N_D00-n11011619_0003_L010185N_D00.tar 	minos26.fnal.gov 	52 	1734010880 	0 	OK 	

The problem is not so much that this failed,
but that it did not report an error to the client,
and kept retrying.


############
# MCIMPORT #
############

mcimport.20070306

Adding duplicate detection

=============================================================================

2007 03 05

#######
# NET #
#######

Many helpdesk tickets today re networking problems.
Cannot browse to sites like www.irs.gov,  lwn.net .
Ticket 93479 for example, seems to be the parent.
These are assigned to vyto (Vyto Grigaliunas).

    Sometime, the following note showed up in Notes to Requester:

Hello,

The CD-HelpDesk has just been advised that a workaround has been implemented
to allow access to the affected off-site web addresses.

We're asking that you re-try the web-sites letting us know 
if the problem persists.

Thank you,
CD-HelpDesk


########
# MRCC #
########

Over the weekend, copied all MRCC files to DCache/Enstore,
and verified with

MRCCIN=/afs/fnal.gov/files/data/minos/d170/MRCC/sntp/

./mrccarch -n ${MRCCIN}/MC/Near-L010185  mcout_data/R1_18_2/near/mrnt_data

for MON in 2005-11  2006-01
./mrccarch -n ${MRCCIN}/Data/Near/${MON} reco_near/R1_18_2/mrnt_data/${MON} 

###########
# MONTHLY #
###########

MYSQL per HOWTO.dbarchive

offline real    68m54.075s
md5     real    21m35.787s
gzip    real    55m50.531s
scp     real    9m36.975s
BINLOGS real    2m59.620s


############
# MCIMPORT #
############

Found 3 duplicates in howcroft ( since Friday )

M26 > FILES=`ls *.gz`
M26 > for FILE in ${FILES} ; do grep ${FILE} index/*.index ; done
index/n12011178_0011_L010185N_D00-n12011314_0006_L010185N_D00.index:n12011193_0011_L010185N_D00.tar.gz
index/n12011178_0011_L010185N_D00-n12011314_0006_L010185N_D00.index:n12011197_0011_L010185N_D00.tar.gz
index/n12011178_0011_L010185N_D00-n12011314_0006_L010185N_D00.index:n12011201_0011_L010185N_D00.tar.gz

M26 > dds n12011197_0011_L010185N_D00.tar.gz
-rw-r--r--    1 mindata  e875      9767022 Mar  2 15:33 n12011197_0011_L010185N_D00.tar.gz
M26 > dds n12011201_0011_L010185N_D00.tar.gz
-rw-r--r--    1 mindata  e875     10206615 Mar  2 15:33 n12011201_0011_L010185N_D00.tar.gz

$ mv n12011193_0011_L010185N_D00.tar.gz DUP/
$ mv n12011197_0011_L010185N_D00.tar.gz DUP/
$ mv n12011201_0011_L010185N_D00.tar.gz DUP/

Removed former tarfile, which keep beeing concatenated to,
getting over 7 GB in size :

 PNFS status for /pnfs/minos/stage/howcroft/n12011193_0011_L010185N_D00-n12011400_0012_L010185N_D00.tar 
-rw-r--r--    1 kreymer  e875            1 Mar  5 12:42 n12011193_0011_L010185N_D00-n12011400_0012_L010185N_D00.tar

$ dds tar
total 8228592
drwxr-xr-x    2 mindata  e875         8192 Mar  5 12:37 ./
drwxr-xr-x    9 mindata  e875        53248 Mar  5 12:39 ../
-rw-r--r--    1 mindata  e875     7391375360 Mar  5 10:30 n12011193_0011_L010185N_D00-n12011400_0012_L010185N_D00.tar
-rw-r--r--    1 mindata  e875     1026396160 Mar  5 12:38 n12011311_0011_L010185N_D00-n12011400_0012_L010185N_D00.tar

Reran  ./mcimport -w howcroft, as previous run bailed on the fat tarfile.

############
# MCIMPORT #
############

mcimport.20070305  - detect duplicates via index search, put em in DUP
                     check for existing output tarfile
 

###########
# GANGLIA #
###########

Saturday, found ssh tunnel prescription at
    http://souptonuts.sourceforge.net/sshtips.htm

1) Create local .ssh/config with

Host gate
HostName 131.225.193.1
        LocalForward 20000 131.225.217.201:80
#       User kreymer  
#       User needed only if there is a username mismatch

2) ssh gate
3) browse to  http://localhost:20000/minos

    This config also works :

Host                      gate
HostName                flxi06.fnal.gov
LocalForward 20000 rexganglia2.fnal.gov:80


########
# GRID #
########

checking adler32 checksum of copied files :

from log below, in /local/scratch/kreymer,

MINOS26 > srmls -l ${SPATH2}/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root
  - Checksum value:  1cf17836

MINOS26 > time adler32 N00010819_0000.spill.sntp.R1_18_4.0.root
1cf17836

real    0m38.951s
user    0m5.150s
sys     0m5.590s

That's good.

But here's the bad.
Using adler32 requires that SRM_PATH be set.
MINOS26 > unset SRM_PATH

MINOS26 > time adler32 N00010819_0000.spill.sntp.R1_18_4.0.root
SRM_PATH is not set

But srmcp etc require that it NOT be set.

    GRRRRRRRRRRRRR

=============================================================================

2007 03 02

###########
# MONTHLY #
###########

CFL
DATASETS
PREDATOR
VAULT

Leave for next week
MYSQL
N.B. (done Monday)

############
# MCIMPORT #
############

Date: Thu, 01 Mar 2007 18:40:32 -0600
From: Cron Daemon <root@minos26.fnal.gov>
To: minos-data@fnal.gov
Subject: Cron <mindata@minos26> ${HOME}/mcimport -c ALL

du: `/local/scratch26/mindata/kordosky/n14011012_0008_L010185N_D00_charm.tar.gz.md5': No such file ordirectory

Odd, this was picked up later

$ dds kordosky/index/n14011012_0003_L010185N_D00_charm-n14011012_0009_L010185N_D00_charm.index
-rw-r--r--    1 mindata  e875          287 Mar  2 06:56 kordosky/index/n14011012_0003_L010185N_D00_charm-n14011012_0009_L010185N_D00_charm.index

########
# GRID #
########

Test a >2 GB copy, this works !

MINOS26 > time srmcp -streams_num=1 -server_mode=active  file:///N00010819_0000.spill.sntp.R1_18_4.0.root ${SPATH}/kreymer/N00010819_0000.spill.sntp.R1_18_4.0.root

real    1m30.052s
user    0m19.140s
sys     0m14.430s

But the directory listing is hosed :


Listing is OK for the single file:

MINOS26 > srmls ${SPATH2}/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root
  2283574599 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root

MINOS26 > srmls -l ${SPATH2}/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root
  2283574599 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root
  type:PERMANENT
  - Checksum value:  1cf17836
  - Checksum type:  adler32
   UserPermission: uid=1060 PermissionsRW
   GroupPermission: gid=5111 PermissionsRW
  WorldPermission: R
 created at:2007/03/02 09:54:39
 modified at:2007/03/02 09:54:39
  - Original SURL:  srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root
 - Status:  null
 - Type:  FILE

Get length int :

MINOS26 > srmls ${SPATH2}/kreymer//N00010819_0000.spill.sntp.R1_18_4.0.root |  head -1 | tr -s ' ' | cut -f 2 -d ' '        
2283574599


But the directory listing is hosed :


###########
# ROUNDUP #
###########

roundup.20070302 - corrected typo which failed to set SDEST2
srmcp was failing

SRV1> ./roundup -w -r cedar near
SRV1> ./roundup -w -r cedar  far
 
########
# MRCC #
########

Drop the splitting MC into RUNS, 1598 files is tolerable

Running 14:35, on minos-sam02
mrccarch -n MC/Near-L010185  mcout_data/R1_18_2/near/mrnt_data


   NETWORK - strange, as reported by Ganglia and mrtg,
             network input and output rates are identical,
             and about 1.5 MBytes/second
             DUUUUH, of course !
             The input files are in AFS.


Pending :
for MON in 2005-11  2006-01
mrccarch -n Data/Near/{$MON} reco_near/R1_18_2/mrnt_data/${MON} 


###########
# ROUNDUP #
###########

 PEND - have 21/24 subruns for N00009104_*.spill.mrnt.cedar*.root 1 02/28 14:14:19
 PEND - have 23/24 subruns for N00009143_*.spill.mrnt.cedar*.root 1 02/28 15:16:27
 PEND - have 23/24 subruns for N00009146_*.spill.mrnt.cedar*.root 1 02/28 13:40:43
 PEND - have 23/24 subruns for N00009162_*.spill.mrnt.cedar*.root 0 03/01 15:48:59
 PEND - have 23/24 subruns for N00009165_*.spill.mrnt.cedar*.root 0 03/01 17:20:46

SRV1> grep N00009104 /home/minfarm/lists/bad_runs_mrcc.cedar
N00009104_0017.0   2005-11   45832     1  2007-02-28 12:52:00  fnpc230
N00009104_0016.0   2005-11   46210     1  2007-02-28 12:52:08  fnpc230
N00009104_0015.0   2005-11   46073     1  2007-02-28 12:53:22  fnpc201
SRV1> grep N00009143 /home/minfarm/lists/bad_runs_mrcc.cedar
N00009143_0004.0   2005-11   45947     1  2007-02-28 12:57:42  fnpc229
SRV1> grep N00009146 /home/minfarm/lists/bad_runs_mrcc.cedar
N00009146_0000.0   2005-11   45443     1  2007-02-28 13:29:38  fnpc230
SRV1> grep N00009162 /home/minfarm/lists/bad_runs_mrcc.cedar
N00009162_0013.0   2005-11   47438   139  2007-03-01 14:12:27  fnpc161
SRV1> grep N00009165 /home/minfarm/lists/bad_runs_mrcc.cedar

So flush 9104, 9143, 9146, 9162

SRV1> ./roundup -f 0 -s 9104.*mrnt -W -r cedar near
SRV1> ./roundup -f 0 -s 9143.*mrnt -W -r cedar near
SRV1> ./roundup -f 0 -s 9146.*mrnt -W -r cedar near
SRV1> ./roundup -f 0 -s 9162.*mrnt -W -r cedar near


###########
# GANGLIA #
###########

Thanks to  sether (Seth Graham)
who has moved the minos systems to
    rexganglia2.fnal.gov/minos/?cMinos Cluster
    rexganglia2.fnal.gov/minos/?cMinos Server

=============================================================================

2007 03 01

#######
# NET #
#######

The 06:00 failover of ESNET to an alternate link seems to have failed. 
No network traffic from Hirise to the Border Router 06:00 to 06:20,
load had been 20 Mbit/sec /
ESNET load had been 300 Mbit, failover advertised as 600.

Helpdesk ticket 93337, assigned to Andrew Raider


# AFS # maintenance done ?  

helpdesk 93322, assigned to Mengel

Reply: work was done from 06:00 to 06:05


########
# GRID #
########

Try again to access volatile grid dcache,
now that we have a /pnfs/minos/farmigrid/volatile/ link.

MINOS26 >
setup srmcp v1_25_1 
unset SRM_PATH
setup java v1.5.0
export SRM_CONFIG=/local/scratch26/kreymer/.srmconfig/kreymer.xml
SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile
SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/fermigrid/volatile

srmls ${SPATH2}
  512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos
      512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport
      512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/kordosky


srmcp -debug=true file:///Merged.root ${SPATH2}/kreymer/Merged.root

srmls ${SPATH2}/kreymer
  512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/kreymer
      27009843 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fermigrid/volatile/kreymer/Merged.root

   THIS WORKS !!!!

############
# MCIMPORT #
############

At round 12:00, lots of kordosky files flooded in.
At round 13:00, the minos26 load average went up close to 38.
$ ps xf | grep md5sum | wc -l
     71

It seems to easing off round 13:45, load average at 32, some free CPU.

$ ps xf | grep md5sum | wc -l
     63

 
########
# MRCC #
########

Will copy *mrnt* files from these

for MON in 2005-11  2006-01
Data/Near/{$MON} -> /pnfs/minos/reco_near/R1_18_2/mrnt_data/${MON}

MC/Near-L010185  -> /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data/${RUN}

    MAKE THE TARGET DIRECTORIES AND SET FAMILIES

mkdir      /pnfs/minos/reco_near/R1_18_2/mrnt_data 
chmod 775  /pnfs/minos/reco_near/R1_18_2/mrnt_data 
( cd       /pnfs/minos/reco_near/R1_18_2/mrnt_data ; \
  enstore pnfs --file_family reco_near_R1_18_2_mrnt  )                      

mkdir       /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data
chmod 775   /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data
( cd        /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data; \
  enstore pnfs --file_family reco_near_R1_18_2_mrnt  )                      


    Created scripts/mrccarch to do the copies

MRCCIN=/afs/fnal.gov/files/data/minos/d170/MRCC/sntp/

./mrccarch -n ${MRCCIN}/MC/Near-L010185  mcout_data/R1_18_2/near/mrnt_data

for MON in 2005-11  2006-01
./mrccarch -n ${MRCCIN}/Data/Near/${MON} reco_near/R1_18_2/mrnt_data/${MON} 

    Sets RUN automatically for latter input path
    Checks checksum for existing files ? or -q quality ?
    

=============================================================================

2007 02 28

############
# SADDRECO #
############

    DECLARED 2003 2004 2005-1/2/3 cedar

./saddreco far cedar 2003-07 list 3

DET=far
HOSTNU=`hostname -s | cut -c 6-`
LOGPAT=/local/scratch${HOSTNU}/kreymer/log
FARM=cedar

YEAR=2003 ; MONS='07 08 09 10 11 12'
YEAR=2004 ; MONS='01 02 03 04 05 06 07 08 09 10 11 12'
YEAR=2005 ; MONS='01 02 03'
Needed  679 files, Rate was  2.494
Needed  /pnfs/minos/reco_far/cedar/.bntp_data/2005-03
Treating 0 files in /pnfs/minos/reco_far/cedar/.bntp_data/2005-03
Needed  /pnfs/minos/reco_far/cedar/.bcnd_data/2005-03
Treating 0 files in /pnfs/minos/reco_far/cedar/.bcnd_data/2005-03
Needed  /pnfs/minos/reco_far/cedar/.bnts_data/2005-03
Treating 0 files in /pnfs/minos/reco_far/cedar/.bnts_data/2005-03
STARTED   Wed Feb 28 06:34:53 2007
FINISHED  Wed Feb 28 06:49:07 2007
omniORB: Assertion failed.  This indicates a bug in the application using
omniORB, or maybe in omniORB itself.
 file: ../../../../../src/lib/omniORB/orbcore/SocketCollection.cc
 line: 682
 info: pd_refcount > 0


for MON in ${MONS} ; do
./saddreco ${DET} ${FARM} ${YEAR}-${MON} declare 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done

grep -v declare /local/scratch26/kreymer/log/saddreco/declare_far_cedar.log | less


############
# DATASETS #
############

datasets.20070228 
    Corrected m to RawDataWritePools

    Added g DPOOLG FermigridVolPools
       had problems with this yesterday, before pools lists existed.
       OK today

ln -s datasets.20070228 datasets

###########
# ROUNDUP #
###########

    roundup.20070228 - added mrnt\. to file selection for DFILESL

SRV1> ln -sf roundup.20070228 roundup # was roundup.20070224

    Created working directories

mkdir   /pnfs/minos/reco_near/cedar/mrnt_data 
( cd    /pnfs/minos/reco_near/cedar/mrnt_data ; \
  enstore pnfs --file_family reco_near_cedar_mrnt  )                      

mkdir   /pnfs/minos/reco_far/cedar/mrnt_data
( cd    /pnfs/minos/reco_far/cedar/mrnt_data ; \
  enstore pnfs --file_family reco_far_cedar_mrnt  )                      


SRV1> ./roundup  -s mrnt  -r cedar near


OOPS, forgot to set ownership to rubin

MINOS01 > chown rubin   /pnfs/minos/reco_near/cedar/mrnt_data
MINOS01 > chown rubin   /pnfs/minos/reco_far/cedar/mrnt_data

SRV1> ./roundup  -w  -r cedar near

And recreated far directory with correct protection

mkdir      /pnfs/minos/reco_far/cedar/mrnt_data
chmod 775  /pnfs/minos/reco_far/cedar/mrnt_data
( cd       /pnfs/minos/reco_far/cedar/mrnt_data ; \
  enstore pnfs --file_family reco_far_cedar_mrnt  )                      
MINOS01 > chown rubin   /pnfs/minos/reco_far/cedar/mrnt_data

   N.B. - rubin normally makes the directories ahead of time.
          the 'mkdir' in roundup is useless, change to srmmkdir   


############
# MCIMPORT #
############

One bad file identified near end of January, removed from index by rhatcher

"  Just for the record I modified:

n11011172_0002_L010185N_D00-n11011172_0010_L010185N_D00.index

and removed the line:

n11011172_0002_L010185N_D00.tar.gz

leaving only the line:

n11011172_0010_L010185N_D00.tar.gz

This removes the duplicate and *corrupted* copy of subrun 0002 from the
index files.  The working version comes from:

n11011172_0002_L010185N_D00-n11011172_0006_L010185N_D00.tar
"

=============================================================================

2007 02 27
 
########
# MRCC #
########

Per 13 Feb request, plan to write all files from
    ${MINOS_DATA}/d170 thru d174 to  /pnfs/reco_* or /pnfs/mcout_data/*

for DAT in d170 d171 d172 d173 d174

Files are under
    ${MINOS_DATA}/${DAT}/MRCC/sntp/Data
    ${MINOS_DATA}/${DAT}/MRCC/sntp/MC/Near-L010185/*.mrnt.R1.18.2.root

All files seem to be individually symlinked back to d170

MRCC/MRCCDRIVES has links to d170-d175,d193-d195
  No  mrnt files are on the d19x drives

    MC

MINOS26 > ls  MRCC/sntp/MC/Near-L010185/*.root | wc -l
   1596

MINOS26 > find d170 -type f -name \n*mrnt\*root | wc -l
     848
MINOS26 > find d172 -type f -name \n*mrnt\*root | wc -l
     748

SUM 1596


    Data

MINOS26 > ls d171/MRCC/sntp/Data/Near/2005-11/N*mrnt*root | wc -l
     656
MINOS26 > ls d171/MRCC/sntp/Data/Near/2006-01/N*mrnt*root | wc -l
     669

SUM 1325

MINOS26 > for DAT in d170 d171 d172 d173 d174 ; do printf ${DAT} ; find ${DAT} -type f -name N\*mrnt\*root | wc -l ; done
d170      0
d171    575
d172     81
d173    474
d174    195

SUM    1325

Where do they go ?

    MRCC/sntp/Data/Near/{$MON} -> /pnfs/minos/reco_near/R1_18_2/mrnt_data/${MON}

    MRCC/sntp/MC/              -> /pnfs/minos/mcout_data/R1_18_2/near/mrnt_data/${RUN}

=============================================================================

2007 02 26

###########
# ROUNDUP #
###########

Review of PEND subruns for roundup.

FAR

 PEND - have 22/24 subruns for F00037676_*.all.sntp.cedar*.root  6 02/19 23:41:20
 PEND - have  4/17 subruns for F00037697_*.all.sntp.cedar*.root  0 02/25 23:43:40

 PEND - have 22/24 subruns for F00037676_*.spill.bntp.cedar*.root  6 02/19 23:41:54
 PEND - have  4/17 subruns for F00037697_*.spill.bntp.cedar*.root  0 02/25 23:44:15

 PEND - have 19/24 subruns for F00037221_*.spill.sntp.cedar*.root 45 01/11 23:53:32
 PEND - have 23/24 subruns for F00037230_*.spill.sntp.cedar*.root 42 01/15 07:58:09
 PEND - have 18/24 subruns for F00037233_*.spill.sntp.cedar*.root 40 01/16 12:25:21
 PEND - have 22/24 subruns for F00037676_*.spill.sntp.cedar*.root  6 02/19 23:41:36
 PEND - have  4/17 subruns for F00037697_*.spill.sntp.cedar*.root  0 02/25 23:43:57

for RUN in 37676 37697 37676 37697 37221 37230 37233 37676 37697 ; do
grep ${RUN} /home/minfarm/lists/bad_runs.cedar ; done

./roundup -f 0 -s F00037676 -r cedar -n far
./roundup -f 0 -s F00037676 -r cedar    far

for RUN in 37221 37230 37233 37697 ; do 
grep ${RUN} /home/minfarm/lists/runs_done.cedar ; done

NEAR

 PEND - have 18/24 subruns for N00011577_*.cosmic.sntp.cedar*.root 38 01/18 15:12:27
 PEND - have  5/13 subruns for N00011580_*.cosmic.sntp.cedar*.root 34 01/22 20:37:13
 PEND - have  4/24 subruns for N00011586_*.cosmic.sntp.cedar*.root 34 01/22 21:32:57
 PEND - have  6/24 subruns for N00011589_*.cosmic.sntp.cedar*.root 34 01/22 16:31:44
 PEND - have 16/24 subruns for N00011592_*.cosmic.sntp.cedar*.root 34 01/22 22:51:26
 PEND - have 18/24 subruns for N00011595_*.cosmic.sntp.cedar*.root 34 01/22 15:00:05
 PEND - have 30/31 subruns for N00011651_*.cosmic.sntp.cedar*.root 27 01/29 10:51:09
 PEND - have 21/24 subruns for N00011824_*.cosmic.sntp.cedar*.root  2 02/24 01:05:03
 PEND - have 20/24 subruns for N00011827_*.cosmic.sntp.cedar*.root  1 02/25 01:49:57
 PEND - have  2/16 subruns for N00011830_*.cosmic.sntp.cedar*.root  0 02/26 03:11:21

 PEND - have 21/24 subruns for N00011565_*.spill.sntp.cedar*.root 41 01/15 11:29:49
 PEND - have 23/24 subruns for N00011568_*.spill.sntp.cedar*.root 41 01/15 14:16:12
 PEND - have 18/24 subruns for N00011577_*.spill.sntp.cedar*.root 38 01/18 15:12:57
 PEND - have  4/13 subruns for N00011580_*.spill.sntp.cedar*.root 34 01/22 20:38:35
 PEND - have  6/24 subruns for N00011586_*.spill.sntp.cedar*.root 34 01/22 21:34:11
 PEND - have  6/24 subruns for N00011589_*.spill.sntp.cedar*.root 34 01/22 16:32:32
 PEND - have 16/24 subruns for N00011592_*.spill.sntp.cedar*.root 34 01/22 22:52:22
 PEND - have 17/24 subruns for N00011595_*.spill.sntp.cedar*.root 34 01/22 15:00:31
 PEND - have  5/ 6 subruns for N00011621_*.spill.sntp.cedar*.root 30 01/26 21:56:10
 PEND - have 11/12 subruns for N00011643_*.spill.sntp.cedar*.root 29 01/27 23:58:32
 PEND - have  4/24 subruns for N00011648_*.spill.sntp.cedar*.root 30 01/27 00:09:36
 PEND - have  3/ 5 subruns for N00011701_*.spill.sntp.cedar*.root 20 02/06 01:51:05
 PEND - have 23/24 subruns for N00011728_*.spill.sntp.cedar*.root 15 02/11 02:58:48
 PEND - have 21/24 subruns for N00011734_*.spill.sntp.cedar*.root 13 02/13 03:27:54
 PEND - have 21/24 subruns for N00011804_*.spill.sntp.cedar*.root  5 02/20 23:45:43
 PEND - have 21/23 subruns for N00011819_*.spill.sntp.cedar*.root  3 02/23 05:14:33
 PEND - have 21/24 subruns for N00011824_*.spill.sntp.cedar*.root  2 02/24 01:06:23
 PEND - have 20/24 subruns for N00011827_*.spill.sntp.cedar*.root  1 02/25 01:50:18
 PEND - have  2/16 subruns for N00011830_*.spill.sntp.cedar*.root Three 0 02/26 03:11:52

Cosmic             11577 11580 11586 11589 11592 11595 11651

Spill  11565 11568 11577 11580 11586 11589 11592 11595 11621 11643 11648 11701 11728 11734

./roundup -f 20 -s N00011651_ -r cedar -n near
./roundup -f 20 -s N00011651_ -r cedar    near


##############
# MCOUT_DATA #
##############

5 files need to be moved to the Run subdirectories :

/pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/
    n13011458_0009_L010185N_D00.cand.cedar.root 
    n13011453_0007_L010185N_D00.cand.cedar.root 
    n13011455_0002_L010185N_D00.cand.cedar.root 
    n13011456_0000_L010185N_D00.cand.cedar.root 
    n13011457_0010_L010185N_D00.cand.cedar.root 


###########
# GANGLIA #
###########

    Note patrol information available offsite, 

http://d0ora2.fnal.gov/Patrol/sys-config-info/
http://d0ora2.fnal.gov/Patrol/sys-config-info/d0ora2-config.html

    Also, quite a lot of Ganglia pages:
    
http://cmssrv02.fnal.gov/ganglia/

http://d0om.fnal.gov/d0admin/ganglia/

http://d0online3.fnal.gov/ganglia/


    There may be a simpler way to limit kernel data,
    just do not feed valid data to the ganglia server !
    
 
#########
# FNALU #
#########

Asked with flxi07 (x86_64) will be available for interactive login
   email to fnalu-admin
 
############
# MCIMPORT #
############

First sjc files have been archived ( 11 ), on Sunday.

Alphabetized .k5loginfull, .k5loginmin
   corrected gallag to hgallag in .k5loginmin


###########
# ROUNDUP #
###########

Running cleanly in cron since Sunday 2007 Feb 25

SRV1> dfarm usage rubin
Used: 121267 + Reserved: 0 / Quota: 500000 (MB)


Need to examine/flush many pending runs, mostly near.

############
# HELPDESK #
############

Scanned database for kreymer tickets outstanding, 2 in stage assigned
90783  Assigned   1/9/2007  Please install encp v3_6d in AFS ...
87003  Assigned  10/16/2006 kcroninit fails on flxi04, flxi05 and flxi06

###########
# GANGLIA #
###########

Note patrol information available offsite, 

    http://d0ora2.fnal.gov/Patrol/sys-config-info/
for example,
    http://d0ora2.fnal.gov/Patrol/sys-config-info/d0ora2-config.html


=============================================================================

2007 02 23

##########
# DCACHE #
##########

Need to delete old files, owned by rubin.

Try this from flxi04, where I have tested latest srmcp v1_25_1

Updated .grid with howie's certs.
Updated .srmconfig/config.xml per local file locations


FLXI04 > scp minfarm@fnpcsrv1:.grid/user* .grid/
usercert.pem         100% |**********************************************************************************************|  1533       00:00    
userkey.pem          100% |**********************************************************************************************|  1131       00:00    
FLXI04 > scp minfarm@fnpcsrv1:.grid/x5* .grid/

FLXI04 > for SRMP in `head -1 FLOSS`; do srmls ${SRMP} ; done
  559681707 srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037621_0000.all.sntp.cedar.0.root

Try some other functions :

FLXI04 > SRMTD=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/reco_far/cedar/sntp_data/TEST

FLXI04 > srmmkdir ${SRMTD}

looked OK, did nothing... OK now I see it.
Need to use the extended srm path for v2 functons :

  srm://fndca1.fnal.gov:8443/pnfs/ becomes
  srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs

Edited FLOSS file, reran

FLXI04 > for SRMP in `cat FLOSS`; do srmrm ${SRMP} ; done
   Running clean.
check for files :

SRV1> for FILL in `cat /tmp/FLOGS` ; do ls -l ${FILL} ; done


    Rewrote the files 
    
SRV1> ./roundup -c -w -r cedar far ; ./roundup -c -w -r cedar near

Note to admins :

There were 42 Minos farm output files listed, plus 4 not in the original list.

I removed all 46 files from dcache ( using srmrm ),
    at around 11:30 this morning.

I have rewritten all 46 files to DCache, as of about 12:20 .
( We do not remove these files from our write buffer till they are on tape. )
They should normally be moving to tape in about 4 hours.

#######
# DAQ #
#######

Informed minos-data and gfp

The web page claims that the far_dcs_archiver is running,
and that it last logged  F070214_000008.mdcs.root Feb 14 20:53

In fact  it last logged  F070215_000010.mdcs.root Feb 16 19:10

It probably needs a shutdown/restart .

###########
# ROUNDUP #
###########

    Informed minos_batch
    
Today I have placed the regular concatenation of ntuples for farm output
into the crontab of minfarm@fnpcsrv1.

It is scheduled to run at 08:00 daily.
    See file  ~minfarm/scripts/crontab.dat

This was ready to go a week ago, but I've been busy with other
data handling issues since then, for obvious reasons.

########
# GRID #
########

Try again to access volatile grid dcache,
now that we have a tested srmcp v1_25_1 on central systems.

MINOS26 > setup srmcp v1_25_1 
unset SRM_PATH
setup java v1.5.0
export SRM_CONFIG=/local/scratch26/kreymer/.srmconfig/kreymer.xml
SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos

MINOS26 > srmls ${SPATH2}
  512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos
      512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport
      512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/kordosky

MINOS26 > srmmkdir ${SPATH2}/kreymer 
MINOS26 > srmls ${SPATH2}         
  512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos
      512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/mcimport
      512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/kordosky
      512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos/kreymer

Cannot write,

MINOS26 > srmcp -debug=true file:///Merged.root ${SPATH}/kreymer/Merged.root
...
Exception: user's path ///pnfs/fnal.gov/usr/fermigrid/volatile/minos/kreymer/Merged.root is not subpath of the user's root
...

Reported to dcache-admin


=============================================================================

2007 02 22

########
# MRCC #
########

Per 13 Feb request, plan to write all files from
    $MINOS_DATA}/d170 thru d174 to  /pnfs/reco_* or /pnfs/mcout_data/*


############
# MCIMPORT #
############

08:08  
M26 > ln -sf .k5loginfull .k5login 

08:30 - crontab.dat updated to NOT run mcimport.20070203

Reran manual catchup
08:45
    ./mcimport ALL

13:46 -  crontab crontab.dat - so the above change is effective !


#######
# AFS #
#######

tjyang reported  $MINOS_DATA/d167 not accessible, since Monday AFS outage.

MINOS26 > ls /afs/fnal.gov/files/data/minos/d167
ls: /afs/fnal.gov/files/data/minos/d167: No such file or directory

Note that this directory does show in the next higher level directory:

MINOS26 > ls -l /afs/fnal.gov/files/data/minos
ls: /afs/fnal.gov/files/data/minos/d167: No such file or directory
total 788
drwxrwxrwx   14 root     root         2048 Jan  4  2006 beam_data
drwxrwxrwx   18 root     root         2048 Jan  4  2006 beam_data1
drwxrwxrwx    2 root     root         2048 Jan 11  2006 beam_data2
..

Helpdesk ticket 92974 issued around 10:56.
Resolved around 13:26.
Files are available again.

##########
# DCACHE #
##########

DCache admins report files lost in DCache write pools Monday 19 Feb
during the PNFS outage.

/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037621_0000.all.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037621_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037624_0000.all.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037624_0000.spill.bntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037624_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037628_0000.all.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037628_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037629_0000.all.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037629_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037633_0000.all.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037633_0000.spill.bntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037633_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037636_0000.all.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037636_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037639_0000.all.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037639_0000.spill.bntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037639_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037642_0000.all.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037642_0000.spill.bntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037642_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037645_0000.all.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037645_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037648_0000.all.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037648_0000.spill.bntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037648_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011742_0000.cosmic.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011742_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011745_0000.cosmic.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011745_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011750_0000.cosmic.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011750_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011755_0000.cosmic.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011755_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011758_0000.cosmic.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011758_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011761_0000.cosmic.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011761_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011764_0000.cosmic.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011764_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011769_0000.cosmic.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_near/cedar/sntp_data/2007-02/N00011769_0000.spill.sntp.cedar.0.root
/pnfs/fs/usr/minos/reco_far/cedar/.bntp_data/2007-02/F00037621_0000.spill.bntp.cedar.0.root

    Four more files are sitting in WRITE :

F00037628_0000.spill.bntp.cedar.0.root
F00037629_0000.spill.bntp.cedar.0.root
F00037636_0000.spill.bntp.cedar.0.root
F00037645_0000.spill.bntp.cedar.0.root

Reported to dcache-admin, awaiting guidance.

Got OK to remove files from PNFS, and rewrite.

Created /tmp/FLOGS on fnpcsrv1, containint the above 42+4 files.
Checked they are in WRITE with

for FILE in `ls WRITE/` ; do grep -q ${FILE} /tmp/FLOGS || ls WRITE/${FILE} ; done

for FILL in `cat /tmp/FLOGS` ; do FIL=`echo $FILL | cut -f 8 -d /` 
SL=`ls WRITE/${FIL} -l | tr -s ' ' | cut -f 5 -d ' '`
SD=`ls -l ${FILL}      | tr -s ' ' | cut -f 5 -d ' '`
printf "${SL}\n${SD}\n"
[ ${SL} -ne ${SD} ] && echo OOPS
done

for FILL in `cat /tmp/FLOGS` ; do ls -l ${FILL} ; done

Oops, files are owned by rubin.
So in ROUNTMP/FLOSS, created list of files with srm paths, like

srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/minos/reco_far/cedar/sntp_data/2007-02/F00037621_0000.all.sntp.cedar.0.root

for FILL in `cat FLOSS` ; do srmls ${FILL} ; done

SRV1> date ; for FILL in `cat FLOSS` ; do srmrm ${FILL} ; done
Thu Feb 22 19:38:11 CST 2007
date
Return code: SRM_FAILURE
Explanation: java.lang.NullPointerException

MINOS26 > setup srmcp v1_25_1
MINOS26 > unset SRM_PATH
MINOS26 > export SRM_CONFIG=/home/mindata/.srmconfig/config.xml    
MINOS26 > setup java v1.5.0
..... down the drain, nothing is working.....

Will have to start fresh tomorrow, try to get rubin certificate on minos26
where we have the current srm, for which srmrm might work.


=============================================================================

2007 02 21

############
# MCIMPORT #
############

    mcimport.20070220

log sorting, for howcroft,

the GLOGFS directories are
    L010185N_n1101  3346
    L010185N_n1201  3405
the JLOGFS directories are
    L010185_near    3346
    L010185_rock    3405


M26 > ./mcimport.20070220 -w -f 144 howcroft
 OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log 

M26 > ln -s mcimport.20070220 mcimport

M26 > ./mcimport  -w  kordosky
 OK, logging activity to /local/scratch26/mindata/kordosky/log/mcimport.log 


    mcimport.20070222

Links .k5login to .k5loginmin when disk space is low.
   
    on minos-sam02 and 03, 
ln -sf .k5loginfull .k5login
    Do this minos26 as soon as we have an idle patch

This is so that mcimport can do
   ln -sf .k5loginmin .k5login
when the disk is full.


    LOG corruption, 
    
arms reports corrupt log files, 

n12011205_0001_L010185N_D00.log
n12011205_0002_L010185N_D00.log
n11011205_0001_L010185N_D00.log
n11011205_0002_L010185N_D00.log

Scanning all howcroft logs, 

for DIR in `ls -d howcroft/log/L*` ; do echo $DIR ; 
for FIL in `ls ${DIR}` ; do wc -w ${DIR}/${FIL} ; done ; done | grep ' 0 '
      0 howcroft/log/L010185N_n1101/n11011205_0001_L010185N_D00.log
      0 howcroft/log/L010185N_n1101/n11011205_0002_L010185N_D00.log
      0 howcroft/log/L010185N_n1201/n12011205_0001_L010185N_D00.log
      0 howcroft/log/L010185N_n1201/n12011205_0002_L010185N_D00.log
      0 howcroft/log/L010185_near/L010185_near_1205_1.log
      0 howcroft/log/L010185_near/L010185_near_1205_2.log
      0 howcroft/log/L010185_rock/L010185_rock_1205_1.log
      0 howcroft/log/L010185_rock/L010185_rock_1205_2.log

Jan 24 04:59 howcroft/log/L010185N_n1101/n11011205_0001_L010185N_D00.log
Jan 24 05:09 howcroft/log/L010185N_n1101/n11011205_0002_L010185N_D00.log
Jan 24 05:10 howcroft/log/L010185N_n1201/n12011205_0001_L010185N_D00.log
Jan 24 05:10 howcroft/log/L010185N_n1201/n12011205_0002_L010185N_D00.log
Jan 24 04:59 howcroft/log/L010185_near/L010185_near_1205_1.log
Jan 24 05:09 howcroft/log/L010185_near/L010185_near_1205_2.log
Jan 24 05:09 howcroft/log/L010185_rock/L010185_rock_1205_1.log
Jan 24 05:10 howcroft/log/L010185_rock/L010185_rock_1205_2.log

kordosky logs are clean.

   
=============================================================================

2007 02 20

#######
# SRM #
#######

per timur, installed java v1.5.0

This works, setup java v1.5.0
   can use srmls from  srmcp v1_25

Still need to unset SRM_PATH

##########
# DCACHE #
##########

Kennedy found that 20a was holding files,
     this has been released, files going to tape now.

Checking also raw data,

MINOS26 > cd /pnfs/minos/neardet_data/2007-02
MINOS26 > for FIL in `ls` ; do printf "${FIL} " ; head -1 ".(use)(4)(${FIL})" ; sleep 1 ; done
N00011672_0002.mdaq.root VO2307
N00011672_0003.mdaq.root VO2307
...
N00011798_0001.mdaq.root N00011799_0000.mdaq.root N00011800_0000.mdaq.root N00011801_0000.mdaq.root N00011802_0000.mdaq.root N00011803_0000.mdaq.root N00011804_0000.mdaq.root N00011804_0001.mdaq.root N00011804_0002.mdaq.root N00011804_0003.mdaq.root N00011804_0004.mdaq.root N00011804_0005.mdaq.root N00011804_0006.mdaq.root N00011804_0007.mdaq.root N00011804_0008.mdaq.root N00011804_0009.mdaq.root N00011804_0010.mdaq.root


MINOS26 > cd /pnfs/minos/fardet_data/2007-02
MINOS26 > for FIL in `ls` ; do printf "${FIL} " ; head -1 ".(use)(4)(${FIL})" ; sleep 1 ; done
...
F00037654_0017.mdaq.root 
F00037665_0000.mdaq.root
F00037670_0000.mdaq.root F00037671_0000.mdaq.root
F00037676_0001.mdaq.root F00037676_0002.mdaq.root F00037676_0003.mdaq.root F00037676_0004.mdaq.root F00037676_0005.mdaq.root F00037676_0006.mdaq.root F00037676_0007.mdaq.root F00037676_0008.mdaq.root F00037676_0009.mdaq.root F00037676_0010.mdaq.root F00037676_0011.mdaq.root F00037676_0012.mdaq.root F00037676_0013.mdaq.root F00037676_0014.mdaq.root F00037676_0015.mdaq.root F00037676_0016.mdaq.root F00037676_0017.mdaq.root F00037676_0018.mdaq.root

These are less than 24 hours old.

############
# MCIMPORT #
############

    MCIN

M26 > ./mcimport.20070216  -w  howcroft
 OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log 


    FARM OUTPUT

Checking farm output locations, many not in proper RUN subdirectories

SRV1> FILS=`ls /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/sntp_data | grep '^n'`

SRV1> printf "${FILS}\n" | head
n11011010_0002_L010185N_D00.sntp.cedar.root
n13011014_0008_L010185N_D00.sntp.cedar.root
n13011014_0010_L010185N_D00.sntp.cedar.root
n13011015_0000_L010185N_D00.sntp.cedar.root
n13011015_0003_L010185N_D00.sntp.cedar.root
n13011015_0005_L010185N_D00.sntp.cedar.root
n13011015_0006_L010185N_D00.sntp.cedar.root
n13011029_0003_L010185N_D00.sntp.cedar.root
n13011033_0010_L010185N_D00.sntp.cedar.root
n13011034_0002_L010185N_D00.sntp.cedar.root

SRV1> printf "${FILS}\n" | wc -l
119

for FIL in ${FILS} ; do 
RUN=`echo ${FIL} | cut -c  6-8` ;
FIN=`echo ${FIL} | cut -f 1 -d . ;
ls /pnfs/minos/mcin_data/near/daikon_00/L010185N/${RUN}/${FIN}.reroot.root ; done
 
SRV1> find /pnfs/minos/mcout_data/cedar/near/daikon_00 -name n\* -maxdepth 3 | wc -l
386

SRV1> find /pnfs/minos/mcout_data/cedar/near/daikon_00 -name n\* -maxdepth 3 | cut -f 7,8 -d '/' | sort -u
daikon_00/L010185N
daikon_00/L010200N
daikon_00/L100200N
daikon_00/L150200N
daikon_00/L250200N

FILES=`find /pnfs/minos/mcout_data/cedar/near/daikon_00 -name n\* -maxdepth 3`
FIRST=`printf ${FILES} | head`

    Check that the files are not on the same tape, via VP1/VP2
    Check that the checksums are the same, EC1/EC2


for FILE in ${FILES} ; do
for FILE in ${FIRST} ; do
PAT=`echo ${FILE} | cut -f -9 -d /`
FIL=`echo ${FILE} | cut -f 10 -d /`
RUN=`echo ${FIL}  | cut -c  6-8`
if [ -r ${PAT}/${RUN}/${FIL} ] ; then
   MD1=`( cd ${PAT}        ; cat ".(use)(4)(${FIL})" )`
   MD2=`( cd ${PAT}/${RUN} ; cat ".(use)(4)(${FIL})" )`
   VP1=`printf "${MD1}\n" | head -2`
   VP2=`printf "${MD2}\n" | head -2`
   LN1=`printf "${MD1}\n" | tail +3 | head -1`
   LN2=`printf "${MD2}\n" | tail +3 | head -1`
   CS1=`printf "${MD1}\n" | tail -1`
   CS2=`printf "${MD2}\n" | tail -1`
#   printf "${MD1}\n"
   echo ${VP1} ${LN1} ${CS1}
   echo ${VP2} ${LN2} ${CS2}
   [ "${VP1}" == "${VP2}" ] && printf "${PAT}/${FIL}\n OOPS, same volume \n"
   [ "${LN1}" != "${LN2}" ] && printf "${PAT}/${FIL}\n OOPS, wrong length \n"
   [ "${CS1}" != "${CS2}" ] && printf "${PAT}/${FIL}\n OOPS, wrong checksum \n"
else
    printf " Missing  ${PAT}/${RUN}/${FIL} \n"
fi
done 2>&1 | tee  /tmp/runscan.log

 Missing  /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/n13011458_0009_L010185N_D00.cand.cedar.root 
 Missing  /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/n13011453_0007_L010185N_D00.cand.cedar.root 
 Missing  /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/n13011455_0002_L010185N_D00.cand.cedar.root 
 Missing  /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/n13011456_0000_L010185N_D00.cand.cedar.root 
 Missing  /pnfs/minos/mcout_data/cedar/near/daikon_00/L010185N/cand_data/145/n13011457_0010_L010185N_D00.cand.cedar.root 

SRV1> grep -B 1 length /tmp/runscan.log  | grep pnfs/minos | sort
/pnfs/minos/mcout_data/cedar/near/daikon_00/

L010185N/cand_data/n11011010_0002_L010185N_D00.cand.cedar.root
L010185N/cand_data/n13011029_0003_L010185N_D00.cand.cedar.root
L010185N/cand_data/n13011049_0002_L010185N_D00.cand.cedar.root

L010185N/sntp_data/n11011010_0002_L010185N_D00.sntp.cedar.root
L010185N/sntp_data/n13011029_0003_L010185N_D00.sntp.cedar.root
L010185N/sntp_data/n13011049_0002_L010185N_D00.sntp.cedar.root


L010200N/cand_data/n13011009_0005_L010200N_D00.cand.cedar.root

L010200N/sntp_data/n13011009_0005_L010200N_D00.sntp.cedar.root


L100200N/cand_data/n13011015_0008_L100200N_D00.cand.cedar.root
L100200N/cand_data/n13011045_0001_L100200N_D00.cand.cedar.root

L100200N/sntp_data/n13011015_0008_L100200N_D00.sntp.cedar.root
L100200N/sntp_data/n13011045_0001_L100200N_D00.sntp.cedar.root


L250200N/cand_data/n13011004_0007_L250200N_D00.cand.cedar.root
L250200N/cand_data/n13011012_0008_L250200N_D00.cand.cedar.root
L250200N/cand_data/n13011015_0006_L250200N_D00.cand.cedar.root

L250200N/sntp_data/n13011004_0007_L250200N_D00.sntp.cedar.root
L250200N/sntp_data/n13011012_0008_L250200N_D00.sntp.cedar.root
L250200N/sntp_data/n13011015_0006_L250200N_D00.sntp.cedar.root

=============================================================================

2007 02 19

###########
# ROUNDUP #
###########

    roundup.20070219

Added -c (cron/current) to run in foreground, as in mcimport

Tested :

SRV1> ./roundup.20070219 -c  -r cedar far ; ./roundup.20070219 -c  -r cedar near

    This worked as expected, move to this in production

SRV1> ln -sf roundup.20070219 roundup

###########
# GANGLIA #
###########
 
Requested split from Minos  to   Minos Cluster and Minos Servers

Requested move from rexganglia2/farms to rexganglia2/minos

#######
# AFS #
#######

AFS timeouts starting around 08:50

09:17 - crontab -r on kreymer,minodata@minos26

A PDU serving many of the servers has failed, no estimate (09:12)

10:00 call from helpdesk, service is back.

###########
# ENSTORE #
###########

After about 12:52 ( PNFS sampler ) or 12:58 ( raw data logging )
PNFS went offline
  
Helpdesk ticket   92758
   14:21 assigned to HARRISON, MICHAEL of the CD-SF/DMS/DSC/SSA Group
   14:55 assigned to TIMM, STEVE of the CD-SF/GF/FGS Group.
   15:09 assigned to HARRISON, MICHAEL of the CD-SF/DMS/DSC/SSA Group

howcroft mcimport was running, failed on 
    n11011329_0008_L010185N_D00-n11011330_0004_L010185N_D00.tar
    n11011330_0005_L010185N_D00-n11011330_0009_L010185N_D00.tar
should pick them up next iteration

Estimate is service back by 15:00

Data logging resumed at about 14:47


#######
# SSH #
#######

Created new id_rsa (.pub) on desktop, with

    ssh-keygen -t rsa

for use in connecting to csf.rl.ac.uk for grid data tests

#######
# SRM #
#######

Testing bootleg installation of srm v1.24 under mindata@minos26
Can create directory with srmmkdir, no special config.

M26 > pwd
/local/scratch26/kreymer/SRM

M26 > export SRM_CONFIG=/home/mindata/.srmconfig/kreymer.xml

M26 > SPATH3=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcin_data/far/daikon_00/L010185N/161

M26 > srmclient/bin/srmls ${SPATH3}  
srm client error: srm ls responce path details array is null!

M26 > srmclient/bin/srmmkdir ${SPATH3}  

M26 > srmclient/bin/srmls ${SPATH3}  
  512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/mcin_data/far/daikon_00/L010185N/161


Now try a test of flxi04, for public testing :

FLXI04 > mkdir /var/tmp/kreymer
FLXI04 > cd    /var/tmp/kreymer
FLXI04 > scp -r mindata@minos26:/home/mindata/.srmconfig .srmconfig
FLXI04 > scp -r mindata@minos26:/home/mindata/.grid .grid
FLXI04 > scp -r minos26:/local/scratch26/kreymer/SRM SRM
FLXI04 > cp -vax /var/tmp/kreymer /usr/scratch/kreymer


FLXI04 > nedit .srmconfig/kreymer.xml
  changed /home/mindata/.grid  to /usr/scratch/kreymer/.grid

FLXI04 > . /afs/fnal.gov/ups/etc/setups.sh
FLXI04 > export PRODUCTS=/afs/fnal.gov/files/code/e875/general/ups/db
FLXI04 > setup srmcp v1_25_1
FLXI04 > export SRM_CONFIG=/usr/scratch/kreymer/.srmconfig/kreymer.xml

FLXI04 > SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos
  The usual failures with java traceback.

Now try the copy via /local/scratch26/kreymer/SRM/srmclient from
    https://srm.fnal.gov/twiki/pub/SrmProject/SrmcpClient/srmcp_v1_24_NULL.tar

FLXI04 > SRM/srmclient/bin/srmls ${SPATH2}
  512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos
      512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/neardet_data
      512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/hpss
      512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos/fardet_logs


############
# MCIMPORT #
############

    REROOTS - touch em up

    No new directories, so just run the script blindly
    
M26 > cp -a AFSS/mcimport.20070216 .

M26 > ./mcimport.20070216  howcroft
 OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log 


=============================================================================

2007 02 16

##########
# DCACHE #
##########

Have not restarted cron jobs for kreymer or mindata


Came up around 15:30 yesterday, no announcement.

Many problem, data archiving is stuck.

Summary since the shutdown :

2007-02-16 08:06:49  N00011758_0001.mdaq.root OK     
2007-02-16 07:59:40  F00037642_0001.mdaq.root OK      
2007-02-16 07:58:39  F00037642_0001.mdaq.root NOT_FINISHED

2007-02-15 21:54:46  N070215_000006.mdcs.root OK      
2007-02-15 19:08:20  F070215_000010.mdcs.root  OK      

2007-02-15 15:49:40  F00037642_0000.mdaq.root OK      
2007-02-15 15:48:21  N00011758_0000.mdaq.root OK      

Geoff Pearce restarted archiver at 08:06,
   archiver noted N00011758_0000 complete, now stuck on  _0001
   far archiver seems to be running twice

-------------------------------------------------------------------
    Issued helpdesk ticket 92654
-------------------------------------------------------------------
Short Description: FNDCA DCache system is failing

Problem Description: Since the upgrades yesterday, we have observerd at least the following :

    Minos raw data logging fails, FTP's do not complete.
    dccp -P commands hang up, should complete immediately
    dccp copy commands move data, but time out after 5 minutes

Details have been reported to dcache-admin via email

All Minos data handling via DCache is down, specifically
   raw data logging
   Monte Carlo import
   Farm processing
   Analysis
-------------------------------------------------------------------
-------------------------------------------------------------------

16:15 - restarted FD archivers, after DCache restart
        ND is busy with an access.

   normal dccp and dccp -P commands seem to be OK again.

16:17 - successful F00037642_0004.mdaq.root, need to wait 10' for next
16:28 - successful F00037642_0005.mdaq.root 
16:38 - successful F00037642_0006.mdaq.root 
...

############
# MCIMPORT #
############

Cleared off some tarfiles preparing for resumed imports :
Kordosky had nothing queued for writing, just a bit to purge.

M26 > ./mcimport  -w kordosky

    18:33

M26 > ./AFSS/mcimport.20070216  howcroft
 OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log 

    mkdir misspelled midir, reran

reran    
    forgot to change MIPNFS to PNFS, so size check failed on
        f21011501_0000_L010185N_D00.reroot.root
    moved to howcroft/far/mcin/dcache

reran  
    OK until a new directory was created, owned by mindata not kreymer
    need to do srm_mkdir 

    M26> rmdir /pnfs/minos/mcin_data/far/daikon_00/L010185N/151

This will take some work, using srmmidir for the first time.
   Early tests failed with srm v1_21

Make these by hand for these files :
   
MINOS26 > mkdir /pnfs/minos/mcin_data/far/daikon_00/L010185N/151
for 151 through 160


reran,   drat, ran it twice and the pid interlock failed !!!!

   copying both these, file sized match OK
     f21011511_0000 moved manually to dcache
     f21011510_0000

reran, OK so far !

Still need to implement working srmmkdir.

############
# ARCHIVER #
############

Tracking down archiver, minos@minos-beamdata

crontab runs
    /home/minos/bin/archiverstatus.sh
    
This references
    /home/minos/bin/init/archiver   script

which, for starting, runs
    /home/minos/bin/archiver_krb.py 1> /data/logs/archiver.log 2>&1 &

which does the actual ftp with

    import gssftp
    gssftp

gssftp.py seems to be vintage May  9  2006
    /home/minos/kftp/v3_6/NULL/lib/gssftp.py

Note empty pid file, /var/lock/beam/archiver.pid
init/archived gets pid from
  ps -f --cols 132 -u minos | grep archiver_ | grep -v grep | awk '{print $2}'  
right after starting archiver_krb.py, with no delay.    

This seems corrected as of 18:05, on a restart of the server
Perhaps the server crashed instantly on restart, per buckley.


############
# PREDATOR #
############

    18:07
./predator 2007-02

    
Ran cleanly, doing something in all streams

    18:58
crontab crontab.dat

=============================================================================

2007 02 15

##########
# DCACHE #
##########

down at 06:00 for maintenance

#######
# AFS #
#######

HOWTO.afs - created with AFS management guidance

MINOS26 > fs listcells | tee /tmp/listcells
MINOS26 > wc -l /tmp/listcells
    159 /tmp/listcells
MINOS26 > grep fnal /tmp/listcells
Cell fnal.gov on hosts fsus03.fnal.gov fsus01.fnal.gov fsus04.fnal.gov.

MINOS26 > for DIR in `ls` ; do fs whereis ${DIR} ; done | cut -f 6 -d ' ' | sort -u
fsus-minos01.fnal.gov
fsus02.fnal.gov
fsus05.fnal.gov
fsus06.fnal.gov
fsus07.fnal.gov
fsus08.fnal.gov

MINOS26 > DIRS=`ls`
MINOS26 > WHERES=`for DIR in ${DIRS} ; do fs whereis ${DIR} ; done`


MINOS26 > printf "${WHERES}\n" | grep minos01 | wc -l
    211
MINOS26 > printf "${WHERES}\n" | grep fsus02 | wc -l
      6
MINOS26 > printf "${WHERES}\n" | grep fsus05 | wc -l
      3
MINOS26 > printf "${WHERES}\n" | grep fsus06 | wc -l
      1
MINOS26 > printf "${WHERES}\n" | grep fsus07 | wc -l
      1
MINOS26 > printf "${WHERES}\n" | grep fsus08 | wc -l
      3

MINOS26 > printf "${WHERES}\n" | grep fsus02        
File d08 is on host fsus02.fnal.gov 
File d35 is on host fsus02.fnal.gov 
File d50 is on host fsus02.fnal.gov 
File d59 is on host fsus02.fnal.gov 
File d63 is on host fsus02.fnal.gov 
File validation is on host fsus02.fnal.gov 

MINOS26 > printf "${WHERES}\n" | grep fsus05
File crl_data is on host fsus05.fnal.gov 
File logbook is on hosts fsus05.fnal.gov fsus08.fnal.gov 
File offline_monitor is on host fsus05.fnal.gov 

MINOS26 > printf "${WHERES}\n" | grep fsus06
File beam_docs is on host fsus06.fnal.gov 

MINOS26 > printf "${WHERES}\n" | grep fsus07
File log_data is on host fsus07.fnal.gov 

MINOS26 > printf "${WHERES}\n" | grep fsus08
File d31 is on host fsus08.fnal.gov 
File d67 is on host fsus08.fnal.gov 
File logbook is on hosts fsus05.fnal.gov fsus08.fnal.gov 


=============================================================================

2007 02 14

##########
# DCACHE #
##########

    Schedule shutdown during PNFS outage

MINOS26 > echo 'crontab -r' | at 05:30
job 13 at 2007-02-15 05:30

M26 > echo 'crontab -r' | at 20:00
job 14 at 2007-02-14 20:00


###########
# ROUNDUP #
###########

############
# MINOSCVS #
############

Created .k5login backups, cleaned up
   removed west, minoscvs

#########
# MYSQL #
#########

Speakman is having trouble connectiong, from a particular offsite host.
Probably a firewall problem.

ERROR 2003 (HY000): Can't connect to MySQL server on 'minos-db1.fnal.gov' (113)
ERROR 2003 (HY000): Can't connect to MySQL server on 'minos-mysql1.fnal.gov' (113)

Suggested telnetting to the port :
MINOS26 > telnet minos-mysql1 3306
Trying 131.225.193.13...
Connected to minos-mysql1.fnal.gov (131.225.193.13).
Escape character is '^]'.
8
4.1.11-log%?
            uC.YY44H,M,1R*yG6vrYG
exit
#08S01Bad handshakeConnection closed by foreign host.

   Confirmed, this port is being blocked by his firewall.
   
   My preference is to say OK, they have got what they want.
   I prefer not to override security policies of remote sites.

###########
# GANGLIA #
###########

###########
# BEAMLOG #
###########

Updated to remove HTML content ( internal )
and suppress entries for NCYCLE or NBEAM 0,
   we had previously duplicated the previous entry !

Getting some messages like

/afs/...beam_log: line 61: [: NaN</b: integer expression expected
/afs/...beam_log: line 64: [: NaN</b: integer expression expected
/afs/...beam_log: line 71: printf: NaN</b: invalid number

Oh, still running beam_log.20060731
 Yes, this version did not contain the digit selection.

ln -sf beam_log.2007

#######
# AFS #
#######

Does fsus-minos01 need the timezone patch, per rayp ?
Possibly a 1 hour downtime.


=============================================================================

2007 02 13

############
# MCIMPORT #
############

    Tested more precise priming method suggested by rherber.

    Prime with a single block of exactly the right size.
    This seems to work well, see below, on minos-sam03.

AKS3 > time dd if=/dev/zero of=thous bs=1M count=1000
1000+0 records in
1000+0 records out

real    0m12.699s
user    0m0.000s
sys     0m5.600s

AKS3 > ls -l thous
-rw-r--r--    1 kreymer  1525     1048576000 Feb 13 11:34 thous

AKS3 > time dd if=/dev/zero of=single bs=1048576000 count=1
1+0 records in
1+0 records out

real    0m13.814s
user    0m0.000s
sys     0m5.380s

AKS3 > ls -l
total 2460012
-rw-r--r--    1 kreymer  1525     419430400 Feb  6 12:00 TEST
-rw-r--r--    1 kreymer  1525     1048576000 Feb 13 11:35 single
-rw-r--r--    1 kreymer  1525     1048576000 Feb 13 11:34 thous

AKS3 > time dd if=/dev/zero of=funny bs=1048576123 count=1
1+0 records in
1+0 records out

real    0m13.911s
user    0m0.000s
sys     0m5.220s

AKS3 > ls -l funny 
-rw-r--r--    1 kreymer  1525     1048576123 Feb 13 12:00 funny

##########
# DCACHE #
##########

Pool allocation adjustments for expanded pools Thursday

    Taking an inventory of tags on various sntp directories

MINOS26 > for DIR in `ls -d reco_near/*/sntp_data` ; do printf "${DIR} `cat ${DIR}/'.(tag)(file_family)'`\n" ; done
reco_near/R1.11/sntp_data sntp_near_R1_11
reco_near/R1.12/sntp_data sntp_near_R1_12
reco_near/R1.14/sntp_data sntp_near_R1_14
reco_near/R1.16/sntp_data reco_near_R1_16
reco_near/R1/sntp_data sntp_near_R1
reco_near/R1_17/sntp_data reco_near_R1_17
reco_near/R1_18/sntp_data sntp_near_R1_18_0
reco_near/R1_18_2/sntp_data reco_near_R1_18_2
reco_near/R1_18_2_temp/sntp_data reco_near_R1_18
reco_near/R1_18_3/sntp_data reco_near_R1_18_3
reco_near/R1_18_4/sntp_data reco_near_R1_18_4
reco_near/R1_21/sntp_data reco_near_R1_21
reco_near/R1_23/sntp_data reco_near_R1_23
reco_near/R1_23a/sntp_data reco_near_R1_23a
reco_near/R1_24/sntp_data reco_near_R1_24
reco_near/R1_24a/sntp_data reco_near_R1_24a
reco_near/R1_24b/sntp_data reco_near_R1_24b
reco_near/R1_24c/sntp_data reco_near_R1_24c
reco_near/S06-05-25-R1-22/sntp_data reco_near_S06-05-25-R1-22
reco_near/S06-06-22-R1-22/sntp_data reco_near_S06-06-22-R1-22
reco_near/cedar/sntp_data reco_near_cedar_sntp

MINOS26 > for DIR in `ls -d reco_far/*/sntp_data` ; do printf "${DIR} `cat ${DIR}/'.(tag)(file_family)'`\n" ; done
reco_far/R1.11/sntp_data sntp_data_R1_11
reco_far/R1.12/sntp_data sntp_data_R1_12
reco_far/R1.14/sntp_data sntp_data_R1_14
reco_far/R1.16/sntp_data reco_far_R1_16
reco_far/R1.16a/sntp_data sntp_near_R1_16a
reco_far/R1_17/sntp_data reco_far_R1_17
reco_far/R1_17a.0/sntp_data reco_far_R1_17
reco_far/R1_18/sntp_data reco_far_R1_18
reco_far/R1_18_2/sntp_data reco_far_R1_18_2
reco_far/R1_18_2_temp/sntp_data minos
reco_far/R1_18_2a/sntp_data reco_far_R1_18_2a
reco_far/R1_18_4/sntp_data reco_far_R1_18_4
reco_far/R1_21/sntp_data reco_far_R1_21
reco_far/R1_23/sntp_data reco_far_R1_23
reco_far/R1_23a/sntp_data reco_far_R1_23a
reco_far/R1_24/sntp_data reco_far_R1_24
reco_far/R1_24a/sntp_data reco_far_R1_24a
reco_far/R1_24b/sntp_data reco_far_R1_24b
reco_far/R1_24c/sntp_data reco_far_R1_24c
reco_far/S06-05-25-R1-22/sntp_data reco_far_S06-05-25-R1-22
reco_far/S06-06-22-R1-22/sntp_data reco_far_S06-06-22-R1-22
reco_far/cedar/sntp_data reco_far_cedar_sntp

MINOS26 > for DIR in `ls -d reco_far/*/.bntp_data` ; do printf "${DIR} `cat ${DIR}/'.(tag)(file_family)'`\n" ; done
reco_far/R1_18/.bntp_data reco_far_R1_18
reco_far/R1_18_2/.bntp_data reco_far_R1_18_2
reco_far/R1_18_2_temp/.bntp_data minos
reco_far/R1_18_2a/.bntp_data reco_far_R1_18_2a
reco_far/R1_18_4/.bntp_data reco_far_R1_18_4
reco_far/R1_23/.bntp_data reco_far_R1_23
reco_far/R1_23a/.bntp_data reco_far_R1_23a
reco_far/R1_24/.bntp_data reco_far_R1_24
reco_far/R1_24a/.bntp_data reco_far_R1_24a
reco_far/R1_24b/.bntp_data reco_far_R1_24b
reco_far/R1_24c/.bntp_data reco_far_R1_24c
reco_far/S06-05-25-R1-22/.bntp_data reco_far_S06-05-25-R1-22
reco_far/S06-06-22-R1-22/.bntp_data reco_far_S06-06-22-R1-22
reco_far/cedar/.bntp_data reco_far_cedar_bntp

Sent request to dcache-admin, kennedy, 
( omitting the not-yet-existing  *_mrnt files, thanks Robert ! )

###########
# GANGLIA #
###########

email

=============================================================================

2007 02 12

###########
# ROUNDUP #
###########

roundup.20070212
    setup encp v3_6d versus c ( since 2007 Jan 22 )
    use /export/state/minfarm/.srmconfig and .grid, for local non-NFS copy

    ran far with old roundup, near with now, looks OK

#########
# VOMRS #
#########

Plunging into the brave new VO registration world, for Grid access.

Started at fermigrid.fnal.gov, User Guide.
Directed to register at https://voms.fnal.gov:8443/vomrs/vo-fermilab/vomrs

Steps are ( in language clear as mud to me )
   Visitor    anyone
       Registration (Phase I)
       magic identity check ?
   Candidate
       Registration (Phase II) agreeing to OSG Usage Rules
   Applicant
       Approval by Administrator
   Member

Phase I -
    mail   kreymer@fnal.gov
    rep    Steven Timm
    rights Full
    First  Arthur
    Last   Kreymer
    Phone  630 840 4261
Immediately get message

 You have successfully submitted phase I of fermilab VO registration!
You now have candidate status in the fermilab VO. You will receive an email providing further instructions about second phase of registration.

Click on the Registration (Phase I) link to update the left hand menu. 

Got the email immediately, clicked the email link.

Great, the /fermilab/minos group is obvious.
But what about the roles ?
   GratiaFermilabAdmin
   GratiaGlobalAdmin
   minossoft
   Production
   root
   VOMS-Query
and why is minossoft a role for ALL groups ?

Got confirmation web page, 

 You have successfully submitted phase II of fermilab VO registration!
You will receive an email notice from the VOMRS fermilab Service indicating that you've been approved (or denied) as a VO member. This could take up to a few days; it depends on how soon your representative completes this task. You now have applicant status in the fermilab VO, and as such can access more screens. Click on the fermilab VO Registration Home link in the left hand menu to update the menu.

Applicant to fermilab VO may:

    * Change your groups and group roles selection
    * Browse groups
    * Browse institutions and sites
    * Browse required personal information
    * Browse CAs recognized by fermilab VO
    * Browse your own personal information
    * Re-sign usage rules
    * Browse your own authorization status
    * Browse required personal information
    * Browse CAs recognized by fermilab VO
    * Unsubscribe and resubscribe to personal event notification

Got an immediate email, " you have been assigned "
But that is neither 'approved' nor 'denied'.


N.B. 
Phase I - 
    must select a representative ( of what, for what ? )
      Labels have magic pop up boxes, unreadable black on dark blue
      Why is Peter Shanahan listed ? 
      Why if fermigrid2 listed ?
      Why are Steven Timm and Steven C. Timm listed ?
         I selected Steven Timm
      

Note that Authorization Status that I can browse is listed as 'new',
which is not any of the states described above.

Found this on 2007 02 15, select 'Roles', called Phases on the welcome page.
My Role is still listed as Applicant

UGH,. on a reconnect to 
    https://voms.fnal.gov:8443/vomrs/vo-fermilab/vomrs
I get a popup box message :
   You have attempted to establish a connection to "voms.fnal.gov".
   However, the security certificate presented belongs to
   "http/fermigrid2.fnal.gov". It is possible, though unlikely, that
   someone may be trying to intercept your communication with this
   web site.
   
   If you suspect that the certificate shown does not belong to
   "voms.fnal.gov", please cancel the connection and notify the site
   administrator.
   
Your status with the VO has been changed from New to Approved due to the following reason: Approved.
Please contact VO administrator if you have any questions.

Status, 

So now we have 'Phases', 'Roles' and 'Status' to describe the same thing.
Ugh.

=============================================================================

2007 02 09

##########
# BREBEL #
##########

Finished archiving minos22:/local/scratch22/brebel/R1.14
 to /pnfs/minos/users/brebel/R1.14/*

ntupleSt - 149 GB
   monthly directories

MINOS22 > cd ntupleSt
MINOS22 > DIRS=`ls`

MINOS22 > for DIR in ${DIRS} ; do du -sh ${DIR} ; done
14G     2005-05
13G     2005-06
14G     2005-07
14G     2005-08
14G     2005-09
13G     2005-10
13G     2005-11
13G     2005-12
13G     2006-01
11G     2006-02
11G     2006-03
9.6G    2006-04

setup ecrc
DCPOR=24736
setup dcap 
klist -f


#####################################
for DIR in ${DIRS} ; do
date
TAF=/tmp/br/${DIR}.tar
echo make   ${TAF}
time tar cf ${TAF} -C ${DIR} .
echo diff   ${TAF}
time tar df ${TAF} -C ${DIR} .
du -sm      ${TAF}
ls -l       ${TAF}
ECRC=`ecrc  ${TAF}`
printf "`echo $ECRC | cut -f 2 -d ' '` ${DIR}\n"

date
DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/minos/users/brebel/R1.14/${DIR}.tar
echo copy   ${TAF}
time dccp   ${TAF}  ${DFILE}
( cd /pnfs/minos/users/brebel/R1.14 ; cat ".(use)(2)(${DIR}.tar)" )
rm -f       ${TAF}

done  2>&1 | tee -a /tmp/br.log
#####################################

    Did this for 

DIRS=2005-05
DIRS='2005-06 2005-07 2005-08 2005-09 2005-10 2005-11 2005-12 2006-01 2006-02 2006-03 2006-04'

    Wait for enstore to get ECRC.

BRLIS=$MINOS_DATA/log_data/users/brebel/R1.14

cp /tmp/br.log ${BRLIS}/

   Create ecrc listings
nedit ${BRLIS}/ecrc.lis  # Create lines with ecrc<tab>dir

MINOS26 > mkdir ${BRLIS} -p

for DIR in ${DIRS} ; do ls -l ${DIR} > ${BRLIS}/${DIR}.lis ; done
cd .. ; DIR=ntupleStUp_v3.5
ls -l ${DIR} >  ${BRLIS}/${DIR}.lis

    Next, check CRC's.

TARS=`ls /pnfs/minos/users/brebel/R1.14 | cut -f 1 -d .`
TARS=${TARS}.5  # adds missing .5 to last tarfile name.

    for TAR in ${TARS} ; do
OCRC=`grep ${TAR} ${BRLIS}/ecrc.lis | cut -f 1 -d ' '`
echo ${OCRC} ${TAR}
ECRC=`(cd /pnfs/minos/users/brebel/R1.14 ; cat ".(use)(4)(${TAR}.tar)" | tail -1)`
echo ${ECRC} ${TAR}
[ "${ECRC}" != "${OCRC}" ] && printf " OOPS, mismatch \n"
    done

 2007 02 10  all these are now on tape, the above tests succeed
 
#######
# CFL #
#######

    Corrected raw data to include all months

cdm ; cd CFL

$HOME/minos/scripts/cflsum.20070209 | tee cflsum.20070206
ln -sf cflsum.20070206 CFLSUM

cds
ln -sf cflsum.20070206  cflsum

############
# MCIMPORT #
############

Note heavy load from kordosky,
as many as 10 scp's and nearly 20 md5sum's running all at once.

SAR:

02:50:01 PM       all      0.20      0.00      0.17      0.21     99.42
03:00:01 PM       all      0.61      0.00      0.64      0.65     98.09
03:10:01 PM       all      9.37      0.00      2.08      1.18     87.37
03:20:01 PM       all     11.50      0.00      3.67     14.72     70.10
03:30:01 PM       all      6.15      0.00      4.70     13.01     76.14
03:40:02 PM       all      8.50      0.00      7.40     46.63     37.47
03:50:01 PM       all      5.95      0.00      5.16     87.77      1.13
04:00:02 PM       all      3.22      0.00      2.91     93.75      0.12
04:10:02 PM       all      3.12      0.00      2.61     94.22      0.05
04:20:01 PM       all      1.95      0.00      2.11     95.84      0.09
04:30:02 PM       all      0.73      0.00      1.79     97.39      0.08
04:40:01 PM       all      1.27      0.00      2.34     96.39      0.00
04:50:02 PM       all      1.12      0.00      2.12     96.69      0.07
05:00:01 PM       all      1.10      0.00      2.38     96.48      0.04
05:10:02 PM       all      1.62      0.00      2.22     96.17      0.00
05:20:02 PM       all     16.46      0.00      3.55     79.97      0.02

05:20:02 PM       CPU     %user     %nice   %system   %iowait     %idle
05:30:01 PM       all      2.65      0.00      2.13     95.20      0.02
05:40:01 PM       all      2.81      0.00      2.64     94.55      0.01
05:50:01 PM       all      2.65      0.00      2.34     94.83      0.19
06:00:01 PM       all      5.67      0.00      7.51     86.39      0.43
06:10:01 PM       all      1.81      0.00      2.43     52.03     43.73
06:20:02 PM       all      3.03      0.00      1.65     29.64     65.68
Average:          all      4.21      0.00      2.80     40.62     52.37


=============================================================================

2007 02 08

############
# MCIMPORT #
############

running slowly again, up to 14 kordosky scp's running at once.


Fortunately my local broadband connection is slow at uploads ( 50 KB/sec ),
so it is easy to create a slow source of data.
( Easy, that is, if I drive home to do the test. )

I copied 6 files at once, with   scp -c blowfish , each of them 10 MBytes.
( This is not an unusual situation, I saw 14 kordosky scp's running today. )

The files were named frag0 through frag5.

Then I pre-created 6 more files with
    for N in a b c d e f ; do dd if=/dev/zero of=frag${N} bs=1M count=10 ; done
Then ran 6 copies again, to fraga through fragf.

The files are 2560 blocks long ( 4096 byte blocks )

    filefrag reports as follow :

R > FRAG=/home/minsoft/maint/filefrag

R > for N in 0 1 2 3 4 5 ; do ${FRAG} frag${N} ; done
frag0: 1420 extents found, perfection would be 1 extent
frag1: 1601 extents found, perfection would be 1 extent
frag2: 1694 extents found, perfection would be 1 extent
frag3: 1384 extents found, perfection would be 1 extent
frag4: 1480 extents found, perfection would be 1 extent
frag5: 1346 extents found, perfection would be 1 extent

R > for N in a b c d e f ; do ${FRAG} frag${N} ; done
fraga: 1 extent found
fragb: 1 extent found
fragc: 1 extent found
fragd: 1 extent found
frage: 1 extent found
fragf: 2 extents found, perfection would be 1 extent


###########
# GANGLIA #
###########

fnpca seems to be down, existing trouble ticket 92258

=============================================================================

2007 02 07

############
# DATABASE #
############

See LOG.mysql


#######
# X11 #
#######

Gimp scan, stuck on minos15, 
minos15 Wed Feb  7 13:11:54 CST 2007


Globally, the swap directory was missing,
    /var/tmp/kreymer/.gimp-2.0

MIN > for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'ls -l /var/tmp | grep drw' ; done

minos01 Wed Feb  7 13:16:30 CST 2007
drwx------    4 minoscvs e875         4096 Jan 26 17:59 cvs-serv22533

minos13 Wed Feb  7 13:16:38 CST 2007
drwxr-xr-x    4 kreymer  1525         4096 Jan 27 00:25 kreymer

minos26 Wed Feb  7 13:16:50 CST 2007
drwxr-xr-x    5 mindata  e875         4096 Feb  5 12:41 mindata
drwxr-xr-x    3 kreymer  g020         4096 Jan  9 11:53 rawcopy

MIN > for NODE in $NODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'mkdir -p /var/tmp/kreymer/.gimp-1.2 | grep drw' ; done
MINOS26 > mv /var/tmp/kreymer/.gimp-1.2 /var/tmp/kreymer/.gimp-2.0 

Rescanned, stuck on 11 and 15
minos11 Wed Feb  7 13:28:58 CST 2007
minos15 Wed Feb  7 13:29:39 CST 2007


############
# MCIMPORT #
############

keepup - ran very fast last night for  howcroft,
   6 to 9 MB/sec from 06:48 to 07:33  
   during which time files continued to be imported,
   about 10 minutes each
   
  size                                                    time
348051424 Feb  7 06:10 n11011272_0007_L010185N_D00.tar.gz
348915700 Feb  7 06:21 n11011272_0008_L010185N_D00.tar.gz 11
344127618 Feb  7 06:31 n11011272_0009_L010185N_D00.tar.gz 10
351102396 Feb  7 06:42 n11011273_0000_L010185N_D00.tar.gz 11
354129779 Feb  7 06:53 n11011273_0001_L010185N_D00.tar.gz 11
345883174 Feb  7 07:14 n11011273_0002_L010185N_D00.tar.gz 19
345407102 Feb  7 07:24 n11011273_0003_L010185N_D00.tar.gz 10
350804073 Feb  7 07:35 n11011273_0004_L010185N_D00.tar.gz 11
346207685 Feb  7 07:45 n11011273_0005_L010185N_D00.tar.gz 11


### DUPLICATE RUNNING ###

Reindexing kordosky duplicate running  per 03 Feb email
    L010185, 1071-1090

cd kordosky/index

    FIND THE FILES

for N in 7 8 9 ; do grep "10${N}._.*L010185" *.index ; done

for N in 7 8 9 ; do grep "10${N}._.*L010185" *.index ; done | wc -l
    432

    FIND THE TARS

for N in 7 8 9 ; do grep "10${N}._.*L010185" *.index ; done | \
    cut -f 1 -d : | sort -u

M26> for N in 7 8 9 ; do grep "10${N}._.*L010185" *.index ; done |  \
     cut -f 1 -d : | sort -u | wc -l
     48

DUPTS=`for N in 7 8 9 ; do grep "10${N}._.*L010185" *.index ; done | \
    cut -f 1 -d : | sort -u`

for DUPT in $DUPTS ; do cat ${DUPT} ; done | wc -l
    432

GOOD ! These tarfiles contain only the duplicated runs.
So the index files can just be moved out of the way.

mkdir ../index.dup1071
for DUPT in $DUPTS ; do mv ${DUPT} ../index.dup1071/ ; done


=============================================================================

2007 02 06

#######
# CFL #
#######

aklog
cdm ; cd CFL

$HOME/minos/scripts/cfl
1110230 CFL

$HOME/minos/scripts/cflsum | tee cflsum.20070206
ln -sf cflsum.20070206 CFLSUM

############
# MCIMPORT #
############

test of file preallocation , kreymer@minos-sam03

SS3 > cd MCIMPORT

SS3 > dd if=/dev/zero of=TEST bs=1M count=1000
1000+0 records in
1000+0 records out
SS3 > du -sb TEST 
1049604096      TEST
SS3 > rm TEST 
SS3 > time dd if=/dev/zero of=TEST bs=1M count=1000
1000+0 records in
1000+0 records out

real    0m12.710s
user    0m0.000s
sys     0m5.090s
 
AKS3 > du -sm /var/tmp/*
3193    /var/tmp/DCS
329     /var/tmp/FIX
1       /var/tmp/rc_host_0
AKS3 > time cp /var/tmp/FIX TEST 
real    0m8.033s
user    0m0.060s
sys     0m2.960s


M26> TIF=/var/tmp/mindata/TMP/n11011219_0003_L010185N_D00.tar.gz       
M26> time dd if=$TIF of=TIF bs=1M 
real    0m10.066s
user    0m0.000s
sys     0m3.340s
M26> rm TIF 
M26> time dd if=$TIF of=TIF bs=1M 
338+1 records in
338+1 records out

real    0m4.353s
user    0m0.000s
sys     0m2.170s

M26> time dd if=/dev/zero of=TIF bs=1M count=400
400+0 records in
400+0 records out

real    0m5.346s
user    0m0.000s
sys     0m2.030s

M26> stat TIF
  File: `TIF'
  Size: 419430400       Blocks: 820008     IO Block: 4096   Regular File
Device: 341h/833d       Inode: 17301506    Links: 1    
Access: (0644/-rw-r--r--)  Uid: ( 3648/ mindata)   Gid: ( 5111/    e875)
Access: 2007-02-06 12:07:54.000000000 -0600
Modify: 2007-02-06 12:07:59.000000000 -0600
Change: 2007-02-06 12:07:59.000000000 -0600


M26> time dd if=$TIF of=TIF bs=1M 
338+1 records in
338+1 records out

real    0m4.868s
user    0m0.000s
sys     0m2.480s

M26> stat TIF
  File: `TIF'
  Size: 354953788       Blocks: 693960     IO Block: 4096   Regular File
Device: 341h/833d       Inode: 17301506    Links: 1    
Access: (0644/-rw-r--r--)  Uid: ( 3648/ mindata)   Gid: ( 5111/    e875)
Access: 2007-02-06 12:07:54.000000000 -0600
Modify: 2007-02-06 12:08:39.000000000 -0600
Change: 2007-02-06 12:08:39.000000000 -0600

############
# MCIMPORT #
############

mcimport_init - initializes mindata account, as on minos-sam02/3


###########
# ROUNDUP #
###########

Did catchup.

SRMCP errors , retried OK for /pnfs/minos/reco_far/cedar/sntp_data/2007-02/F00037375_0000.spill.sntp.cedar.0.root
at 12:12. Succeeded at 12:13 in spite of message 

org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message:  (error code 1) [Nested exception messa
ge:  Custom message: Unexpected reply: 425 Cannot open port: java.lang.Exception: Pool manager error: No write pool available for <minos
.reco_far_cedar_sntp@enstore>].  Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected 
reply: 425 Cannot open port: java.lang.Exception: Pool manager error: No write pool available for <minos.reco_far_cedar_sntp@enstore>
        at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:195)
        at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109)
        at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1456)
        at org.globus.ftp.GridFTPClient.extendedPut(GridFTPClient.java:508)
        at org.globus.ftp.GridFTPClient.extendedPut(GridFTPClient.java:474)
        at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:846)
        at java.lang.Thread.run(Thread.java:534)
 try again
sleeping for 10000 before retrying

That's reasonable, all the 9a pools were restarted at 12:12

w-stkendca9a-1 	w-stkendca9a-1Domain 	0 	10 	69 msec 	02/06 12:12:29 	production-1-7-0(1.130.2.2)


=============================================================================

2007 02 05

############
# PREDATOR #
############
 
VMON=2007-01

./predator ${VMON}

../HOWTO.predator ${VMON}

#########
# VAULT #
#########

for DET in far near; do ./vault ${DET} ${VMON} ; done
 
Ran correctly

########
# DATA #
########

Runs to be reprocessed in cedar, from rubin mail :

N00008009_0000
N00009582_0005
N00009586_0000
N00009586_0001
N00009586_0002
N00009586_0004
N00009586_0005
N00009586_0006
N00009586_0008
N00009586_0009
N00009586_0010
N00010163_0001
N00010163_0003
N00010163_0004
N00010163_0005
N00010163_0006
N00010163_0007
N00010163_0008
N00010163_0009
N00010163_0010
N00010163_0011
N00010163_0012
N00010163_0015
N00010184_0002
N00010184_0003
N00010184_0004
N00010184_0005
N00010184_0007
N00010184_0008
N00010184_0009
N00010184_0010
N00010184_0011
N00010184_0012
N00010184_0013
N00010184_0014
N00010184_0015
N00010184_0016

    Are they in SAM ?

    First, what is in reco_near ?
    Only two of these files !

for RUN in ${RUNS} ; do
for RUN in N00010184_0016 ; do  # one bad run
for RUN in N00010218_0020 ; do  # one good run

printf "${RUN}\n" ; 
MON=`sam locate ${RUN}.mdaq.root | cut -f 5 -d '/' | cut -f 1 -d ,`
ls /pnfs/minos/reco_near/cedar/*/${MON}/${RUN}*
done

...
N00010163_0015
/pnfs/minos/reco_near/cedar/cand_data/2006-06/N00010163_0015.cosmic.cand.cedar.0.root
/pnfs/minos/reco_near/cedar/sntp_data/2006-06/N00010163_0015.cosmic.sntp.cedar.0.root
...

MINOS26 > sam locate N00010163_0015.cosmic.cand.cedar.0.root
['/pnfs/minos/reco_near/cedar/cand_data/2006-06,1614@vob884']
MINOS26 > sam locate N00010163_0015.cosmic.sntp.cedar.0.root
['/pnfs/minos/reco_near/cedar/sntp_data/2006-06,3090@vob657']

for RUN in ${RUNS} ; do for TYP in cosmic spill ; do for STR in cand sntp ; do
sam locate ${RUN}.${TYP}.${STR}.cedar.0.root ; done ; done ; done
...
Datafile with name 'N00010163_0012.spill.sntp.cedar.0.root' not found.
['/pnfs/minos/reco_near/cedar/cand_data/2006-06,1614@vob884']
['/pnfs/minos/reco_near/cedar/sntp_data/2006-06,3090@vob657']
Datafile with name 'N00010163_0015.spill.cand.cedar.0.root' not found.
...

Another check 

for RUN in ${RUNS} ; do 
sam list files --dim="file_name like ${RUN}%cedar%.root" ; done
...
Files:
   N00010163_0015.cosmic.cand.cedar.0.root
   N00010163_0015.cosmic.sntp.cedar.0.root

OK, undeclare these two :

sam undeclare file N00010163_0015.cosmic.cand.cedar.0.root
sam undeclare file N00010163_0015.cosmic.sntp.cedar.0.root

############
# MCIMPORT #
############

kordosky/
n11011519_0006_L010185N_D00.tar.gz is reported corrupt,


M26> ls -alF n11011519_0006_L010185N_D00.tar.gz
-rw-r--r--    1 mindata  e875     349098958 Feb  5 10:12 n11011519_0006_L010185N_D00.tar.gz

M26> md5sum n11011519_0006_L010185N_D00.tar.gz
67de1f44e0c8820ed5ad53975978e834  n11011519_0006_L010185N_D00.tar.gz

M26> grep n11011519_0006_L010185N_D00.tar.gz md5/all.md5 
5e42eae486d2b49271affc29241e165d  n11011519_0006_L010185N_D00.tar.gz


mv n11011519_0006_L010185N_D00.tar.gz BAD/

Perhaps we can avoid such problems, and help fragmentation,
by first copying to /var/tmp/mindata/MCIN/* ?

M26> for DIR in `ls /local/scratch26/mindata` ; do mkdir /var/tmp/mindata/MCIN/${DIR} ; done

Time some copies ( 333 MByte file )

M26> FIX=n11011504_0002_L010185N_D00.tar.gz
M26> dds $FIX
-rw-r--r--    1 mindata  e875     348302809 Feb  5 10:04 n11011504_0002_L010185N_D00.tar.gz
M26> time dd if=$FIX of=/var/tmp/mindata/MCIN/FIX 
680278+1 records in
680278+1 records out

real    2m6.155s
user    0m0.370s
sys     0m7.540s

( retry later, vault copies are running now, same disks )

Try some tests on minos-sam03

S03> FIX=n11011503_0002_L010185N_D00.tar.gz
S03> time scp -c blowfish mindata@minos26:/local/scratch26/mindata/kordosky/${FIX} FIX
real    1m37.945s
user    0m0.040s
sys     0m2.210s

>S03 md5sum FIX
32c8146409dff0f5318c63f3b4e05810  FIX
M26> md5sum n11011503_0002_L010185N_D00.tar.gz
32c8146409dff0f5318c63f3b4e05810  n11011503_0002_L010185N_D00.tar.gz

S03> ls -l FIX
-rw-r--r--    1 samread  5024     343676718 Feb  5 15:35 FIX

S03> time gunzip -t FIX
real    0m9.682s
user    0m9.490s
sys     0m0.190s

S03> time cp FIX /var/tmp/FIX
real    0m3.146s
user    0m0.030s
sys     0m1.950s


S03> time dd if=FIX of=/var/tmp/FIX
671243+1 records in
671243+1 records out

real    0m11.299s
user    0m0.410s
sys     0m6.840s

S03> time dd if=FIX of=/var/tmp/FIX bs=1M
327+1 records in
327+1 records out

real    0m4.172s
user    0m0.000s
sys     0m2.100s

Move a big file to /var/tmp, to flush memory.
SS3 > time dd if=DCS_HV.MYD.gz  of=/var/tmp/DCS bs=1M
3188+1 records in
3188+1 records out

real    1m51.970s
user    0m0.010s
sys     0m26.070s

S03> time dd if=FIX of=/var/tmp/FIX bs=1M
real    0m9.743s
user    0m0.000s
sys     0m3.040s

##########
# BREBEL #
##########

Jan 31 request to backup 
    minos22:/local/scratch22/brebel/R1.14

I suggest to /pnfs/minos/users/brebel/R1.14/*

ntupleSt - 149 GB
   monthly directories
MINOS22 > for DIR in `ls ntupleSt` ; do du -sh ntupleSt/${DIR} ; done
14G     ntupleSt/2005-05
13G     ntupleSt/2005-06
14G     ntupleSt/2005-07
14G     ntupleSt/2005-08
14G     ntupleSt/2005-09
13G     ntupleSt/2005-10
13G     ntupleSt/2005-11
13G     ntupleSt/2005-12
13G     ntupleSt/2006-01
11G     ntupleSt/2006-02
11G     ntupleSt/2006-03
9.6G    ntupleSt/2006-04

ntupleStUp_v3.5 - 4.9 GB

Simplest solution : 

Make tarfiles of the whole directories.
Make a listing, safe in afs.

Very little free space... so tar to /tmp one at a time,
record ecrc, then dccp to write pool.
Can use    tar -d    to check content

Let's try one :

cd /local/scratch22/brebel/R1.14
DIR=ntupleStUp_v3.5

MINOS22 > time tar cf /tmp/br/${DIR}.tar -C ${DIR} .
real    3m21.508s
user    0m0.500s
sys     0m38.390s

MINOS22 > time tar df /tmp/br/${DIR}.tar -C ${DIR} .
real    2m32.246s
user    0m9.250s
sys     0m23.880s


MINOS22 > du -sm /tmp/br/${DIR}.tar            
5008    /tmp/br/ntupleStUp_v3.5.tar
MINOS22 > ls -l /tmp/br/${DIR}.tar         
-rw-r--r--    1 kreymer  1525     5245091840 Feb  5 18:15 /tmp/br/ntupleStUp_v3.5.tar


MINOS26 > mkdir /pnfs/minos/users/brebel
MINOS26 > chmod 775 /pnfs/minos/users/brebel
MINOS26 > cd /pnfs/minos/users/brebel
MINOS26 > enstore pnfs --file_family minos_users_brebel
MINOS26 > mkdir R1.14

MINOS22 > DCPOR=24736
MINOS22 > DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/minos/users/brebel/${DIR}.tar

MINOS22 > setup dcap 
MINOS22 > setup encp
MINOS22 > time ecrc /tmp/br/${DIR}.tar
CRC 503464758

real    1m51.515s
user    0m15.480s
sys     0m14.350s

MINOS22 > time dccp /tmp/br/${DIR}.tar ${DFILE}
real    5m45.927s
user    0m0.000s
sys     0m0.010s
 < interrupted >
 < readjusted path from minos/brebel to minos/users/brebel >
 < readusted family from minos_brebel to minos_users_brebel >

 MINOS22 > time dccp /tmp/br/${DIR}.tar ${DFILE}
  < observed 22 MB/sec on ganglia during copy >
5245091840 bytes in 235 seconds (21796.43 KB/sec)

real    3m57.383s
user    0m15.900s
sys     0m20.340s

As expected , ls shows a file size of 1 in PNFS.

MINOS26 > cat '.(use)(2)(ntupleStUp_v3.5.tar)'
2,0,0,0.0,0.0
:h=yes;c=1:308e4337;l=5245091840;
w-stkendca9a-2

Need to wait for enstore to get ECRC.

MINOS26 > ./dc_stat /pnfs/minos/users/brebel/R1.14/ntupleStUp_v3.5.tar
 PNFS status for /pnfs/minos/users/brebel/R1.14/ntupleStUp_v3.5.tar 
-rw-r--r--    1 kreymer  e875            1 Feb  5 23:01 ntupleStUp_v3.5.tar

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:308e4337;l=5245091840;
w-stkendca9a-2

LEVEL 4 
VOC416
0000_000000000_0000001
5245091840
minos_users_brebel
/pnfs/fnal.gov/usr/minos/users/brebel/R1.14/ntupleStUp_v3.5.tar

000F00000000000004E7CA50

CDMS117073809300000
stkenmvr35a:/dev/rmt/tps2d0n:479000010076
503464758

Size and ecrc match, looks good !
MINOS22 > rm /tmp/br/${DIR}.tar


=============================================================================

2007 02 04

############
# MCIMPORT #
############

kordosky pass took over 6 hours
   howcroft input idle at present.
   min free disk was 

   srmcp was trying, no output from 07:00 to 07:05

cron pid detection is working ( 06:37 )
 
Maybe should go back to tarring to /var/tmp, with a copy back to scratch ?
Might reduce fragmentation, and be faster overall ?

$ time dd if=STAGE/kordosky/tar/n11011014_0001_L150200N_D00-n11011014_0009_L150200N_D00.tar \
          of=/var/tmp/mindata/TMP/n11011014_0001_L150200N_D00-n11011014_0009_L150200N_D00.tar
 ( spot checked at about 1.2 MBytes/sec )
 ( cancelled at 836 MBytes )
1709544+0 records in
1709544+0 records out


real    9m37.147s
user    0m1.230s
sys     0m19.330s

    Now copy a 330 MB file from /var/tmp to /tmp ( same disk )

$ time dd if=/var/tmp/mindata/TMP/n11011001_0000_L010185N_D00.tar.gz of=/tmp/mindata/TEST.dat 
672549+1 records in
672549+1 records out

real    0m34.817s
user    0m0.560s
sys     0m8.950s
 ( speed is about 10 MBytes/second )

=============================================================================

2007 02 03

############
# MCIMPORT #
############

mcimport.20070203
   added missing writing and clearing of CRON/mcimport.pid


miserable rates, .3 MB/sec for kordosky when 6 kordosky and 1 howcroft scp's
kordosky rate is .6 MB/sec
no output at all for srmcp.

later, round 10:00, single kordosky scp runs at .3 mB/sec, howcroft at .6
later, round 10:23, no kordosky scp's at all, howcroft at .6 MB/sec

srmcp's vary from 3 to 20 MB/sec, with up to 2-5 minutes idle between files


Killed cronjob, restart with this new mcimport this afternoon,
after the present run finished up.

 
=============================================================================

2007 02 02

############
# MCIMPORT #
############

crontab crontab.dat  around 09:00

mcimport.20070201 - commented out MAILTO, FREETIME test lines

Shifted crontab run times to

37 3,9,15,21 * * *  ${HOME}/mcimport -c ALL

17:45 - hacked crontab to allow catchup,
37 1,6,12,20 * * *  ${HOME}/mcimport -c ALL

M26 > echo 'crontab crontab.dat' | at 02:30

Going very slowly through kordosy files, under 1 MB/sec
  ( 3 x 500MB per tar )
Rehacked round midnight to

37 2,7,12,18 * * *  ${HOME}/mcimport -c ALL


#######
# SAM #
#######

kordosky reports samwebservices problem with

samTranslateDimensions \
--dim="run_type physics% and file_name like N000%.spill.sntp.R1_18_4.0.root and start_time <= to_date('2006-09-30','yyyy-mm-dd') and end_time >= to_date('2006-09-24','yyyy-mm-dd')" \
--wsdl="http://www-numi.fnal.gov/sam_web_services/wsdl/DimensionsService.wsdl.xml"

Usage:
   samTranslateDimensions --dim=<dimensionsString> [--verbose]
or samTranslateDimensions --query=<dimensionsString> [--verbose]
...


Local query is OK,
MINOS26 > sam list files --dim="run_type physics% and file_name like N000%.spill.sntp.R1_18_4.0.root and start_time <= to_date('2006-09-30','yyyy-mm-dd') and end_time >= to_date('2006-09-24','yyyy-mm-dd')"
Files:
   N00010861_0020.spill.sntp.R1_18_4.0.root
...
File Count:         116
Average File Size:  39.62MB
Total File Size:    4.49GB
Total Event Count:  11759539

Testing sws in my clean PRODUCTS window,

MINOS26 > setup sam_web_services_client
ERROR: Found no match for product 'python'
ERROR: Action parsing failed on "unsetuprequired(python v2_2_3_sam)"
WARNING: Unsetup of sam_web_services_client failed, continuing with setup
ERROR: Found no match for product 'python'
ERROR: Found no match for product 'python'
MINOS26 > setup python v2_4_sam
MINOS26 > samTranslateDimensions \
<more> --wsdl="http://www-numi.fnal.gov/sam_web_services/wsdl/DimensionsService.wsdl.xml" \
<more> --dim="file_name like N00008698_000%.cosmic.sntp.R1_18.0.root"
Dimension string: file_name like N00008698_000%.cosmic.sntp.R1_18.0.root
Dataset file list:
<SOAPpy.Types.typedArrayType fileList at -1228291252>: ['N00008698_0000.cosmic.sntp.R1_18.0.root', 'N00008698_0001.cosmic.sntp.R1_18.0.root', 'N00008698_0002.cosmic.sntp.R1_18.0.root', 'N00008698_0003.cosmic.sntp.R1_18.0.root', 'N00008698_0004.cosmic.sntp.R1_18.0.root', 'N00008698_0005.cosmic.sntp.R1_18.0.root', 'N00008698_0006.cosmic.sntp.R1_18.0.root', 'N00008698_0007.cosmic.sntp.R1_18.0.root', 'N00008698_0008.cosmic.sntp.R1_18.0.root', 'N00008698_0009.cosmic.sntp.R1_18.0.root']
Dataset size: 768939259.0 bytes

So the generic test query works

Nick's query works for me :

MINOS26 > samTranslateDimensions --wsdl="http://www-numi.fnal.gov/sam_web_services/wsdl/DimensionsService.wsdl.xml" --dim="run_type physics% and file_name like N000%.spill.sntp.R1_18_4.0.root and start_time <= to_date('2006-09-30','yyyy-mm-dd') and end_time >= to_date('2006-09-24','yyyy-mm-dd')" 
 Dataset file list:
<SOAPpy.Types.typedArrayType fileList at -1228292020>:['N00010861_0020.spill.sntp.R1_18_4.0.root',
...
, 'N00010893_0012.spill.sntp.R1_18_4.0.root']

I suggested trying to ping www-numi.fnal.gov and minos-sam03.fnal.gov,
and samLocate --file=foo

Scanning logs on minos-sam03, note that trace is itegrated,
and that there are daily logs, both filled with minute by minute 
messages about Checking on opened file streams.

MINOS-SAM03 > grep -v 'Checking on' wsLog__02_02_07

NB per Liz, they are running a hacked client,
    which has a built-in wsdl, you cannot set --wsdl on command line.


HOWTO.samwebservices - 
    updated to reflect cleaner usage, more functions
    and to note minos-sam03 server.

Added nwest and kordosky to samread .k5login on minos-sam03,
created WEBLOGS link to make logs easy to find.

###########
# ROUNDUP #
###########

touched up around 17:00

##########
# DCACHE #
##########
 
Reviewing file families for DCache pool planning

MINOS26 > cd ../reco_far
MINOS26 > for DIR in `ls` ; do printf "${DIR} " ; ( cd ${DIR}/sntp_data ; enstore pnfs --tags | grep 'ily) ' )  ; done
R1.11 .(tag)(file_family) = sntp_data_R1_11
R1.12 .(tag)(file_family) = sntp_data_R1_12
R1.14 .(tag)(file_family) = sntp_data_R1_14
R1.14_201 -bash: cd: R1.14_201/sntp_data: No such file or directory
.(tag)(file_family) = minos
R1.16 .(tag)(file_family) = reco_far_R1_16
R1.16a .(tag)(file_family) = sntp_near_R1_16a
R1_17 .(tag)(file_family) = reco_far_R1_17
R1_17a.0 .(tag)(file_family) = reco_far_R1_17
R1_18 .(tag)(file_family) = reco_far_R1_18
R1_18_2 .(tag)(file_family) = reco_far_R1_18_2
R1_18_2_temp .(tag)(file_family) = minos
R1_18_2a .(tag)(file_family) = reco_far_R1_18_2a
R1_18_4 .(tag)(file_family) = reco_far_R1_18_4
R1_21 .(tag)(file_family) = reco_far_R1_21
R1_23 .(tag)(file_family) = reco_far_R1_23
R1_23a .(tag)(file_family) = reco_far_R1_23a
R1_24 .(tag)(file_family) = reco_far_R1_24
R1_24a .(tag)(file_family) = reco_far_R1_24a
R1_24b .(tag)(file_family) = reco_far_R1_24b
R1_24c .(tag)(file_family) = reco_far_R1_24c
S06-05-25-R1-22 .(tag)(file_family) = reco_far_S06-05-25-R1-22
S06-06-22-R1-22 .(tag)(file_family) = reco_far_S06-06-22-R1-22
cedar .(tag)(file_family) = reco_far_cedar_sntp

MINOS26 > for DIR in `ls` ; do printf "${DIR} " ; ( cd ${DIR}/.bntp_data ; enstore pnfs --tags | grep 'ily) ' )  ; done
R1_18 .(tag)(file_family) = reco_far_R1_18
R1_18_2 .(tag)(file_family) = reco_far_R1_18_2
R1_18_2_temp .(tag)(file_family) = minos
R1_18_2a .(tag)(file_family) = reco_far_R1_18_2a
R1_18_4 .(tag)(file_family) = reco_far_R1_18_4
R1_21 -bash: cd: R1_21/.bntp_data: No such file or directory
.(tag)(file_family) = minos
R1_23 .(tag)(file_family) = reco_far_R1_23
R1_23a .(tag)(file_family) = reco_far_R1_23a
R1_24 .(tag)(file_family) = reco_far_R1_24
R1_24a .(tag)(file_family) = reco_far_R1_24a
R1_24b .(tag)(file_family) = reco_far_R1_24b
R1_24c .(tag)(file_family) = reco_far_R1_24c
S06-05-25-R1-22 .(tag)(file_family) = reco_far_S06-05-25-R1-22
S06-06-22-R1-22 .(tag)(file_family) = reco_far_S06-06-22-R1-22
cedar .(tag)(file_family) = reco_far_cedar_bntp

Hmmmm, only 3 very old releases, and cedar, have sntp or bntp tags.

Sent email back to kennedy, minos_data, dcache-admin with outline.

=============================================================================

2007 02 01

#######
# DAQ #
#######

file-exist errors started again from fd data logging,  since

2007-01-31 18:41:12 	 buckley(1019.5111) 	 krbftp 	 write 	 
/pnfs/fnal.gov/usr/minos/fardet_data/2007-02/F00037343_0003.mdaq.root 	 
daqdcp.minos-soudan.org 	 0 	 0 	 0 	 
ERROR 	553 /pnfs/fnal.gov/usr/minos/fardet_data/2007-02/F00037343_0003.mdaq.root: Cannot create file: CacheException(rc=2;msg=Pnfs error : File exists)

    cleanly written at 18:25:18 ( 00:25 UTC Feb 1 )

previous clean subrun was 0002, at 17:29:37 ( 23:29 UTC ) in
    /pnfs/minos/fardet_data/2007-01/

continues through current subrun, 2007-02/F00037343_0018.mdaq.root at 10:16:58
    cleanly written at 09:27:10

Here are the timestamps of existing files in cache :

MINOS26 > dds /pnfs/minos/fardet_data/2007-01/
...
-rw-r--r--    1 buckley  e875     16956401 Jan 31 11:17 F00037330_0017.mdaq.root
-rw-r--r--    1 buckley  e875     30157757 Jan 31 11:17 F00037330_0018.mdaq.root
-rw-r--r--    1 buckley  e875     44126200 Jan 31 11:17 F00037330_0019.mdaq.root
-rw-r--r--    1 buckley  e875     17007894 Jan 31 11:20 F00037330_0020.mdaq.root
-rw-r--r--    1 buckley  e875     30224860 Jan 31 12:20 F00037330_0021.mdaq.root
-rw-r--r--    1 buckley  e875     41813754 Jan 31 13:11 F00037330_0022.mdaq.root
-rw-r--r--    1 buckley  e875     18219275 Jan 31 13:27 F00037331_0000.mdaq.root
-rw-r--r--    1 buckley  e875      6912098 Jan 31 13:37 F00037332_0000.mdaq.root
-rw-r--r--    1 buckley  e875      6912366 Jan 31 13:48 F00037333_0000.mdaq.root
-rw-r--r--    1 buckley  e875     17815950 Jan 31 13:58 F00037334_0000.mdaq.root
-rw-r--r--    1 buckley  e875       956721 Jan 31 14:08 F00037335_0000.mdaq.root
-rw-r--r--    1 buckley  e875     18184587 Jan 31 14:24 F00037336_0000.mdaq.root
-rw-r--r--    1 buckley  e875      6906838 Jan 31 14:34 F00037337_0000.mdaq.root
-rw-r--r--    1 buckley  e875     18271837 Jan 31 14:45 F00037338_0000.mdaq.root
-rw-r--r--    1 buckley  e875      6910722 Jan 31 14:55 F00037339_0000.mdaq.root
-rw-r--r--    1 buckley  e875     18242005 Jan 31 15:06 F00037340_0000.mdaq.root
-rw-r--r--    1 buckley  e875      6909233 Jan 31 15:21 F00037341_0000.mdaq.root
-rw-r--r--    1 buckley  e875     18240463 Jan 31 15:32 F00037342_0000.mdaq.root
-rw-r--r--    1 buckley  e875     37010458 Jan 31 15:47 F00037343_0000.mdaq.root
-rw-r--r--    1 buckley  e875     37442321 Jan 31 16:28 F00037343_0001.mdaq.root
-rw-r--r--    1 buckley  e875     17120975 Jan 31 17:29 F00037343_0002.mdaq.root

MINOS26 > ls -alF /pnfs/minos/fardet_data/2007-02/
total 518959
drwxrwxr-x    1 kreymer  e875          512 Feb  1 10:28 ./
drwxrwxr-x    1 buckley  e875          512 Dec 14 11:50 ../
-rw-r--r--    1 buckley  e875     36935441 Jan 31 18:25 F00037343_0003.mdaq.root
-rw-r--r--    1 buckley  e875     37282857 Jan 31 19:26 F00037343_0004.mdaq.root
-rw-r--r--    1 buckley  e875     17129575 Jan 31 20:28 F00037343_0005.mdaq.root
-rw-r--r--    1 buckley  e875     37047418 Jan 31 21:25 F00037343_0006.mdaq.root
-rw-r--r--    1 buckley  e875     37405153 Jan 31 22:28 F00037343_0007.mdaq.root
-rw-r--r--    1 buckley  e875     16996342 Jan 31 23:26 F00037343_0008.mdaq.root
-rw-r--r--    1 buckley  e875     36888338 Feb  1 00:30 F00037343_0009.mdaq.root
-rw-r--r--    1 buckley  e875     37113715 Feb  1 01:31 F00037343_0010.mdaq.root
-rw-r--r--    1 buckley  e875     17062111 Feb  1 02:30 F00037343_0011.mdaq.root
-rw-r--r--    1 buckley  e875     36863240 Feb  1 03:29 F00037343_0012.mdaq.root
-rw-r--r--    1 buckley  e875     37372085 Feb  1 07:53 F00037343_0013.mdaq.root
-rw-r--r--    1 buckley  e875     16979392 Feb  1 08:04 F00037343_0014.mdaq.root
-rw-r--r--    1 buckley  e875     37208848 Feb  1 08:14 F00037343_0015.mdaq.root
-rw-r--r--    1 buckley  e875     37341783 Feb  1 08:25 F00037343_0016.mdaq.root
-rw-r--r--    1 buckley  e875     17165763 Feb  1 08:42 F00037343_0017.mdaq.root
-rw-r--r--    1 buckley  e875     37324693 Feb  1 09:27 F00037343_0018.mdaq.root
-rw-r--r--    1 buckley  e875     37300769 Feb  1 10:28 F00037343_0019.mdaq.root

Note that the directory is based on UTC.
So this problem is correlated with the directory we're writing to.
   
   
Files were written to Enstore around 11:28 , based on times from
MINOS26 > ls -alF /pnfs/minos/fardet_data/2007-02/

N.B. ftp client is returning SIZE=None, and failure code.     

As with kennedy, kftp shows size OK

MINOS26 > ../bin/rlwrap ftp fndca1.fnal.gov 24127
Connected to stkendca2a.fnal.gov.
220 Kerberos FTP Door ready
334 ADAT must follow
GSSAPI accepted as authentication type
GSSAPI authentication succeeded
Name (fndca1.fnal.gov:kreymer): 
200 User kreymer logged in
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> cd fardet_data/2007-02
250 CWD command succcessful. New CWD is </fardet_data/2007-02>
ftp> size F00037343_0003.mdaq.root
213 36935441

Per kennedy, restarted archiver around 13:30, no further problems.
Probably due to data having been written to tape.
This may have happened in previous months.
Experts will investigate.


############
# MCIMPORT #
############

   DUPLICATES ?

M26> cat kordosky/index/*.index > /var/tmp/mindata/TMP/kordosky.index
M26> cat /var/tmp/mindata/TMP/kordosky.index | wc -l 
   4212
M26> cat /var/tmp/mindata/TMP/kordosky.index | sort -u  | wc -l 
   4212

M26> cat howcroft/index/*.index > /var/tmp/mindata/TMP/howcroft.index
M26> cat /var/tmp/mindata/TMP/howcroft.index | wc -l 
   4394
M26> cat /var/tmp/mindata/TMP/howcroft.index | sort -u | wc -l 
   4326

M26> sort /var/tmp/mindata/TMP/howcroft.index > /tmp/ksor 
M26> sort -u /var/tmp/mindata/TMP/howcroft.index  > /tmp/ksou 
M26> sdiff -s /tmp/ksor /tmp/ksou > /tmp/ksod
M26> nedit /tmp/ksod
M26> cat /tmp/ksod | wc -l
     68
M26> for FIL in `cat /tmp/ksod` ; do grep ${FIL} howcroft/index/*.index ; done | wc -l
    136

  So, each duplicate exists in two index files.

M26> for FIL in `cat /tmp/ksod` ; do grep ${FIL} howcroft/index/*.index ; done | cut -f 1 -d ':' | sort -u | wc -l
     21

   So, 21 tarfiles contribute to this problem

n11011027_0000_L010185N_D00-n11011031_0000_L010185N_D00.tar
n11011028_0000_L010185N_D00-n11011033_0000_L010185N_D00.tar
n11011032_0000_L010185N_D00-n11011036_0000_L010185N_D00.tar
n11011172_0002_L010185N_D00-n11011172_0006_L010185N_D00.tar
n11011172_0002_L010185N_D00-n11011172_0010_L010185N_D00.tar
n11011217_0000_L010185N_D00-n11011221_0000_L010185N_D00.tar
n11011218_0004_L010185N_D00-n11011219_0001_L010185N_D00.tar
n11011219_0007_L010185N_D00-n11011220_0000_L010185N_D00.tar
n11011221_0000_L010185N_D00-n11011221_0004_L010185N_D00.tar
n11011221_0010_L010185N_D00-n11011222_0003_L010185N_D00.tar
n11011222_0000_L010185N_D00-n11011226_0000_L010185N_D00.tar
n11011222_0009_L010185N_D00-n11011223_0002_L010185N_D00.tar
n11011223_0008_L010185N_D00-n11011224_0001_L010185N_D00.tar
n11011224_0007_L010185N_D00-n11011225_0000_L010185N_D00.tar
n11011226_0000_L010185N_D00-n11011226_0004_L010185N_D00.tar
n11011227_0000_L010185N_D00-n11011227_0004_L010185N_D00.tar
n11011227_0000_L010185N_D00-n11011231_0000_L010185N_D00.tar
n11011227_0010_L010185N_D00-n11011228_0003_L010185N_D00.tar
n11011228_0009_L010185N_D00-n11011229_0002_L010185N_D00.tar
n12011005_0010_L010185N_D00-n12011209_0001_L010185N_D00.tar
n12011005_0010_L010185N_D00-n12011222_0003_L010185N_D00.tar


crontab.dat updated to contain

37 3,9,15,21 * * *  ${HOME}/mcimport -c ALL


Hold off, start running this tomorrow.


Run manually this afternoon and evening, 

    mcimport.20070201

Moved print of VERSION  to MAIN, enabled full time
TRIGTIME, TRIGSIZE trigger concatenation in generic running
NOIMPORT  disables running
Added ALL users, using MCIMPORT to control activity
Added pid check outside user loop for ALL and CRON

MINOS26 > ln -sf mcimport.20070201  mcimport  # was mcimport.20070130
M26 > cp /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mcimport.20070201 mcimport

16:32  ./mcimport ALL   

   OOPS, needed to hack this to set CRON with directory ALL is specified,
         so that mcimport runs serially

16:40  ./mcimport ALL


=============================================================================

2007 01 31

############
# MCIMPORT #
############

$ du -sm *
69799   howcroft
20365   kordosky

    HOWCROFT CLEANUP

Moved badm5.log to maint/howbad/


$ BADT=`cat ~/maint/howbad/badmd5.log | cut -f 1 -d ':' | cut -f 1 -d . | sort -u`
12 tarfiles, not so bad.

Per rhatcher discussion, 
will modify thefl index files, leaving the tars alone.

$ for BAD in ${BADT} ; do echo ${BAD} ; cp -a index/${BAD}.index ~/maint/howbad/ ; done                              

$ for BAD in ${BADT} ; do  for FIL in ${BADF} ] ; do grep ${FIL} index/${BAD}.index ; done ;  nedit index/${BAD}.index ; done

$ for BAD in ${BADT} ; do echo $BAD ; for FIL in ${BADF} ] ; do grep ${FIL} index/${BAD}.index ; done ;  nedit index/${BAD}.index ; done

Logged and annotated to   maint/howbad/fix.log

    Sent this email to minos-data, arms, howcroft, kordosky, bout 12:10

There are 34 mangled files, residing in 12 tarfiles.

Per rhatcher's suggestion, I have edited the howcroft/*.index files
for those tarfiles to remove the mangled file names.

This makes the mangled files invisible, nearly as good as rebuilding the tars,
and Robert can proceed.
  
Notes on this are under mindata/maint/howbad/
  
    Enjoy !
  
Note that 4 of the tarfiles are now moot, 
so have deleted them from /pnfs/minos/stage/howcroft

n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar
n11011007_0000_L010185N_D00-n11011011_0000_L010185N_D00.tar
n11011012_0000_L010185N_D00-n11011016_0000_L010185N_D00.tar
n11011017_0000_L010185N_D00-n11011021_0000_L010185N_D00.tar

This is to avoid conflicts on re-import.

Just in time, as these show up in current mcimport, 
    from  howcroft/log/mcimport.log

n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar 5
n11011007_0000_L010185N_D00-n11011011_0000_L010185N_D00.tar 5
n11011012_0000_L010185N_D00-n11011016_0000_L010185N_D00.tar 5
n11011017_0000_L010185N_D00-n11011021_0000_L010185N_D00.tar 5


MINOS26 > cd /pnfs/minos/stage/howcroft
MINOS26 > for FILE in $FILES ; do ls -l ${FILE} ; done
-rw-r--r--    1 kreymer  e875     1740851200 Jan 25 03:16 n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar
-rw-r--r--    1 kreymer  e875     1757184000 Jan 25 03:18 n11011007_0000_L010185N_D00-n11011011_0000_L010185N_D00.tar
-rw-rw-r--    1 kreymer  e875     1744199680 Jan 25 03:21 n11011012_0000_L010185N_D00-n11011016_0000_L010185N_D00.tar
-rw-rw-r--    1 kreymer  e875     1764659200 Jan 25 03:25 n11011017_0000_L010185N_D00-n11011021_0000_L010185N_D00.tar
MINOS26 > for FILE in $FILES ; do rm ${FILE} ; done


###########
# ROUNDUP #
###########

touched up around 08:30
 
=============================================================================

2007 01 30

############
# MCIMPORT #
############

Relaunched mcimport on kordosky with hack,
   select last not first all.md5 match, there may be duplicate entries
   as for n11011006_0005_L010000N_D00.tar.gz

mcimport.20070127
   -n - do continue to set pid, but do not do ecrc

This is too much change since 1/27, rename it to 1/30

MINOS26 > mv mcimport.20070127 mcimport.20070130
$ ln  -sf /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/mcimport.20070130 mci

MINOS26 > ln -sf mcimport.20070130 mcimport # was mcimport.20070126

Found mcimport local copy pointing to /afs/.../mcimport, OOPS,
corrected this back to pure local copy around 08:51,
hope this did not pull run from under running mcimport kordosky ( looks OK )


    Update notes from mcimport.20070130

Added test for valid .gz file, gunzip -t
    this required tar -r , 1 file at a time
Added test for free disk space in TAR
Added test for existence of INPAT directory

Take final all.md5sum match, not first, to handle duplicates

Added sort of file ALLFILES
Added rate report for TAR
Added PURGE ahead of TAR
Added log message for PURGE files not in PNFS
Added MINAGE variable to set minimum file age, changed from 10 to 30
Changed CLASS variable name to CONFIG
Do not do ecrc in PURGE when NOOP is set

For next version :

+ Added ALL users, using MCIMPORT to control activity
+ NOIMPORT, TRIGTIME, TRIGSIZE trigger concatenation in generic running


    BAD MD5 from howcroft, 34 files
    
FILES=`cat ~kreymer/minos/maint/badmd5.txt 
cd STAGE/howcroft/index/

for FILE in $FILES ; do grep $FILE *.index ; done | wc -l
   36
for FILE in $FILES ; do grep $FILE *.index ; done > ../log/badmd5.log


Duplicates are
n11011038_0005_L010185N_D00-n12011010_0002_L010185N_D00.index:n11011038_0007_L010185N_D00.tar.gz
n11011038_0007_L010185N_D00-n11011039_0002_L010185N_D00.index:n11011038_0007_L010185N_D00.tar.gz

n11011137_0010_L010185N_D00-n11011138_0001_L010185N_D00.index:n11011138_0001_L010185N_D00.tar.gz
n11011138_0001_L010185N_D00-n11011138_0005_L010185N_D00.index:n11011138_0001_L010185N_D00.tar.gz


To remove them :

   rebuild the tarfiles ?
   use --remove_files ?  no , just rebuild.
   And remove these entries from the index files.


###########
# ROUNDUP #
###########

bntp files have been missing this year, recovery being debated in minos_batch

I'd prefer leaving existing files alone and rerunning to produce just bntp.

Howie is proceeding with this plan ( 1b ) this afternoon.


=============================================================================

2007 01 29

############
# PREDATOR #
############

Note that far_dcs_data finally showed up Sat 2007 01 27


SRV1> dfarm usage rubin
Used: 63759 + Reserved: 0 / Quota: 1000000 (MB)
SRV1> ./roundup  -r cedar  near
SRV1> ./roundup  -r cedar   far
SRV1> dfarm usage rubin

############
# MCIMPORT #
############

Discussed things at 11:30 non-meeting ( off week )
    arms
    kreymer
    kordosky
    howcroft
    rhatcher    

Howcroft access errors ( no access, permission denied )
    have probably been due to local removal of his /tmp/* ticket
    He will modify copy scripts to detect ( klist -s ) and correct.

People have not been using   -c blowfish   with scp, will do so.

rhatcher found zero-file  
    n11011001_0000_L010185N_D00.tar.gz

[howcroft@positron02 L010185_near_1001_0]$ du  n11011001_0000_L010185N_D00.tar.gz
336612  n11011001_0000_L010185N_D00.tar.gz

[howcroft@positron02 L010185_near_1001_0]$ md5sum  n11011001_0000_L010185N_D00.tar.gz
e8ba468e14a44870337470722fb98111  n11011001_0000_L010185N_D00.tar.gz

Locally,
$ grep n11011001_0000_L010185N_D00 *.index
n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.index:n11011001_0000_L010185N_D00.tar.gz

$ setup dcap  v2_36_f0506 -q unsecured
$ pwd  
/var/tmp/mindata/TMP

$ FIN=n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar
$ dccp dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/minos/stage/howcroft/${FIN} .
1740851200 bytes in 41 seconds (41464.63 KB/sec)

$ FIL=n11011001_0000_L010185N_D00.tar.gz
$ od  $FIL       
0000000 000000 000000 000000 000000 000000 000000 000000 000000
*
2441445140 000000 000000 000000
2441445145


$ time wc $FIL
      0       0 344345189 n11011001_0000_L010185N_D00.tar.gz

real    0m3.665s
user    0m3.460s
sys     0m0.190s


$ ls -l $FIL
-rw-r--r--    1 mindata  e875     344345189 Jan 24 04:26 n11011001_0000_L010185N_D00.tar.gz


$ du -b $FIL
344690688       n11011001_0000_L010185N_D00.tar.gz


$ time gunzip -t $FIL                              

gunzip: n11011001_0000_L010185N_D00.tar.gz: not in gzip format

real    0m0.002s
user    0m0.010s
sys     0m0.000s

$ echo $?
1


$ for FILE in `ls /local/scratch26/mindata/howcroft/*.tar.gz` ; do wc ${FIL} ; done
...
1077744 6822523 354953788 /local/scratch26/mindata/howcroft/n11011219_0003_L010185N_D00.tar.gz
 419746 2654670 138010624 /local/scratch26/mindata/howcroft/n11011219_0004_L010185N_D00.tar.gz
1034217 6558738 341289917 /local/scratch26/mindata/howcroft/n11011219_0010_L010185N_D00.tar.gz
 211736 1345673 70098944 /local/scratch26/mindata/howcroft/n11011229_0001_L010185N_D00.tar.gz
  31312  197469 10150947 /local/scratch26/mindata/howcroft/n12011209_0002_L010185N_D00.tar.gz

SUMMARY ::::::


Unlike earlier problem with kordosky files copied when the disk was full,
du is not providing a diagnostic for these sparse files.

wc     -w   would seem to give a good robust test.
gunzip -t   could be even better.

Time these for a  valid file :

$ dd if=/local/scratch26/mindata/howcroft/n11011219_0003_L010185N_D00.tar.gz of=n11011219_0003_L010185N_D00.tar.gz
693269+1 records in
693269+1 records out

$ time wc -w n11011219_0003_L010185N_D00.tar.gz
6822523 n11011219_0003_L010185N_D00.tar.gz

real    0m9.069s
user    0m6.500s
sys     0m0.390s

$ time gunzip -t n11011219_0003_L010185N_D00.tar.gz

real    0m10.207s
user    0m10.040s
sys     0m0.150s

##########
# DCACHE #
##########

kennedy reported corruption of

/pnfs/fnal.gov/usr/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root
PNFSid = 000F00000000000004D6B328
in w-stkendca10a-3

cd /export/stage/minfarm/ROUNDUP_TEST/TEST
IFILE=N00011648_0015.cosmic.cand.cedar.0.root
IPATH=minos/reco_near/cedar/cand_data/2007-01
DCPOR=24136
DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
dccp    ${DFILE} .
113837134 bytes in 3 seconds (37056.36 KB/sec)

SRV1> ls -l /pnfs/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root
-rw-r--r--  1 rubin numi 113837134 Jan 29 10:51 /pnfs/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root


MINOS26 > ./dc_stat /pnfs/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root
============================
 PNFS status for /pnfs/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root 
-rw-r--r--    1 1334     e875     113837134 Jan 29 10:51 N00011648_0015.cosmic.cand.cedar.0.root

LEVEL 2 
2,0,0,0.0,0.0
:c=1:7d20ff59;h=yes;l=113837134;
w-stkendca10a-3
r-stkendca14a-3

LEVEL 4 

============================

    I have removed the file, per kennedy request. about 17:40

rm /pnfs/minos/reco_near/cedar/cand_data/2007-01/N00011648_0015.cosmic.cand.cedar.0.root


=============================================================================

2007 01 27

##############
# MCTARCHECK #
##############

00:25
MINOS13 > cd /local/scratch13/kreymer
MINOS13 > ~/minos/scripts/mctarcheck howcroft > howcroft.log 2>&1 &

OOPS, 
accidentally ran this with kordosky briefly, lost kordosky.log

   Copied kordosky files to afs,
cp -ax kordosky \
/afs/fnal.gov/files/data/minos/log_data/mcimport/kordosky/mccheck

howcroft finished around 15:29.

cp -ax howcroft \
/afs/fnal.gov/files/data/minos/log_data/mcimport/howcroft/mccheck


=============================================================================

2007 01 26

##########
# DCACHE #
##########

SRM doors down, expired host tickets, helpdesk ticket by rubin 91593

    Some certs updated by Berg at 8 PM last night, not sufficient.

11:10 srm servers restarted by kennedy, cleared caches, OK now

To run srmls on minos26, need to

$ cd /local/scratch26/kreymer/SRM
$ srmclient/bin/srmls ${SPATH2}


############
# MCIMPORT #
############

mcimport.20070126 - group by configuation (all but run/subrun)

Changed the config of some test files,
    in all three relevant fields

cd /local/scratch26/mindata/kreymer

cp -a TMP/* .
mv  n12011054_0003_L250200N_D00.tar.gz  n12021054_0003_L250200N_D00.tar.gz
mv  n12011054_0004_L250200N_D00.tar.gz  n12021054_0004_L250200N_D00.tar.gz
mv  n12011054_0005_L250200N_D00.tar.gz  n12011054_0005_L100200N_D00.tar.gz
mv  n12011054_0006_L250200N_D00.tar.gz  n12011054_0006_L100200N_D00.tar.gz
mv  n12011054_0007_L250200N_D00.tar.gz  n12011054_0007_L250200N_D01.tar.gz
mv  n12011054_0008_L250200N_D00.tar.gz  n12011054_0008_L250200N_D01.tar.gz

( made a little pop script to do this for testing )

MINOS26 > ln -sf mcimport.20070126 mcimport # was 20070125
$ cp -a afsmcimport mcimport                # was 20070118

At about 14:00, let's get back to work

    ./mcimport kordosky


kordosky scan is nearly done on minos13.
While they're on disk, lets touch up

MINOS26 > ./stage -T stage/kordosky     

staging howcroft next


###########
# ROUNDUP #
###########

SRV1> dfarm usage rubin
Used: 56656 + Reserved: 0 / Quota: 1000000 (MB)

SRV1> ./roundup  -r cedar near
SRV1> ./roundup  -r cedar far

SRV1> dfarm usage rubin
Used: 47838 + Reserved: 0 / Quota: 1000000 (MB)

############
# PREDATOR #
############

Clearing empty .py files since 24 Jan, restarting

   .sam.py files under 1 KB
MINOS26 > for DIR in `ls GDAT/` ; do find GDAT/${DIR}/2007-01 -type f -name \*.sam.py -mtime -3 -size -1 -exec ls -l {} \; ; done
-rw-r--r--    1 kreymer  g020            0 Jan 24 05:11 GDAT/beam_data/2007-01/B070123_080001.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 05:12 GDAT/beam_data/2007-01/B070123_160001.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 05:12 GDAT/beam_data/2007-01/B070124_000001.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 05:13 GDAT/far_dcs_data/2007-01/F070101_170032.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 05:13 GDAT/far_dcs_data/2007-01/F070102_000021.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 05:13 GDAT/far_dcs_data/2007-01/F070103_000000.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 05:10 GDAT/fardet_data/2007-01/F00037265_0014.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 05:11 GDAT/fardet_data/2007-01/F00037265_0015.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 07:10 GDAT/fardet_data/2007-01/F00037265_0016.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 07:11 GDAT/fardet_data/2007-01/F00037265_0017.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 09:10 GDAT/fardet_data/2007-01/F00037265_0018.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 09:10 GDAT/fardet_data/2007-01/F00037265_0019.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 05:12 GDAT/near_dcs_data/2007-01/N070123_000002.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 05:07 GDAT/neardet_data/2007-01/N00011615_0012.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 05:09 GDAT/neardet_data/2007-01/N00011615_0013.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 07:08 GDAT/neardet_data/2007-01/N00011615_0014.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 07:09 GDAT/neardet_data/2007-01/N00011615_0015.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 09:08 GDAT/neardet_data/2007-01/N00011615_0016.sam.py
-rw-r--r--    1 kreymer  g020            0 Jan 24 09:09 GDAT/neardet_data/2007-01/N00011615_0017.sam.py

MINOS26 > for DIR in `ls GDAT/` ; do find GDAT/${DIR}/2007-01 -type f -name \*.sam.py -mtime -3 -size -1 -exec rm {} \; ; done

=============================================================================

2007 01 25

###########
# NETWORK #
###########

Maintenance scheduled 06:00 to 06:30.

Restarted afsd on desktop
    /etc/init.d/afsd restart
did not help ( did this too early, AFS server was unstable.)
OK as of 08:00.
Lost a couple of windows (minos26, LOG, nedit(LOG) )

AFS has been unstable, seems to be up around 08:00
Web servers have been down, seem to be up now.
   but  not the fndca3a pages.

08:12 - web server is down again (www-numi.fnal.gov)

DAQ Archiver started succeeding about 08:12


#######
# SAM #
#######

Nelly installed quarterly patches on minosprd.
Ran test projects on minos26, looks good.
Had to restart dbserver.
  
  
############
# MCIMPORT #
############
  
Updated mcimport to correctly set and lock the pid file
and to use md5sum to much more quickly check tarfile content.

A number of kordosky files are empty ( du -sb ), but show a size in dir.
Cross check against mike's list in 
    maint/no_space_L010000.txt
    maint/no_space_L250200.txt

   for
FILES=`cat ~kreymer/minos/maint/no_space_L010000.txt | cut -f 1 -d .`
   and
FILES=`cat ~kreymer/minos/maint/no_space_L250200.txt | cut -f 1 -d .`

for FIL in ${FILES} ; do du -sb kordosky/${FIL}.tar.gz  ; done
for FIL in ${FILES} ; do rm     kordosky/${FIL}.tar.gz  ; done
for FIL in ${FILES} ; do rm     kordosky/log/${FIL}.log ; done

L010000 files were all 0 bytes length
Some L250 files were not, but will delete them all per kordosky

$ for FIL in ${FILES} ; do du -sb kordosky/${FIL}.tar.gz ; done
667193344       kordosky/n11011049_0010_L250200N_D00.tar.gz
660463616       kordosky/n11011050_0000_L250200N_D00.tar.gz
191254528       kordosky/n11011050_0010_L250200N_D00.tar.gz
687894528       kordosky/n11011050_0001_L250200N_D00.tar.gz
416866304       kordosky/n11011050_0002_L250200N_D00.tar.gz
614211584       kordosky/n11011050_0003_L250200N_D00.tar.gz
373862400       kordosky/n11011050_0004_L250200N_D00.tar.gz
605798400       kordosky/n11011050_0005_L250200N_D00.tar.gz
691228672       kordosky/n11011050_0006_L250200N_D00.tar.gz
401604608       kordosky/n11011050_0007_L250200N_D00.tar.gz
173281280       kordosky/n11011050_0009_L250200N_D00.tar.gz
0       kordosky/n11011051_0010_L250200N_D00.tar.gz
64204800        kordosky/n11011051_0002_L250200N_D00.tar.gz
169431040       kordosky/n11011051_0004_L250200N_D00.tar.gz
0       kordosky/n11011051_0005_L250200N_D00.tar.gz
0       kordosky/n11011051_0007_L250200N_D00.tar.gz
0       kordosky/n11011051_0009_L250200N_D00.tar.gz
0       kordosky/n11011052_0000_L250200N_D00.tar.gz
43642880        kordosky/n11011052_0010_L250200N_D00.tar.gz
12505088        kordosky/n11011052_0002_L250200N_D00.tar.gz
0       kordosky/n11011052_0003_L250200N_D00.tar.gz
0       kordosky/n11011052_0004_L250200N_D00.tar.gz
191016960       kordosky/n11011052_0005_L250200N_D00.tar.gz
0       kordosky/n11011052_0006_L250200N_D00.tar.gz
0       kordosky/n11011052_0007_L250200N_D00.tar.gz
74678272        kordosky/n11011053_0000_L250200N_D00.tar.gz
560844800       kordosky/n11011053_0010_L250200N_D00.tar.gz
0       kordosky/n11011053_0001_L250200N_D00.tar.gz
127692800       kordosky/n11011053_0002_L250200N_D00.tar.gz
0       kordosky/n11011053_0003_L250200N_D00.tar.gz
0       kordosky/n11011053_0005_L250200N_D00.tar.gz
0       kordosky/n11011053_0006_L250200N_D00.tar.gz
0       kordosky/n11011053_0008_L250200N_D00.tar.gz
353742848       kordosky/n11011054_0000_L250200N_D00.tar.gz
0       kordosky/n11011054_0010_L250200N_D00.tar.gz
0       kordosky/n11011054_0001_L250200N_D00.tar.gz
452763648       kordosky/n11011054_0002_L250200N_D00.tar.gz
0       kordosky/n11011054_0003_L250200N_D00.tar.gz
97021952        kordosky/n11011054_0004_L250200N_D00.tar.gz
0       kordosky/n11011054_0005_L250200N_D00.tar.gz
0       kordosky/n11011054_0006_L250200N_D00.tar.gz
90382336        kordosky/n11011054_0007_L250200N_D00.tar.gz
0       kordosky/n11011054_0008_L250200N_D00.tar.gz
0       kordosky/n11011054_0009_L250200N_D00.tar.gz
0       kordosky/n11011055_0000_L250200N_D00.tar.gz
0       kordosky/n11011055_0010_L250200N_D00.tar.gz
0       kordosky/n11011055_0001_L250200N_D00.tar.gz
190640128       kordosky/n11011055_0002_L250200N_D00.tar.gz
0       kordosky/n11011055_0003_L250200N_D00.tar.gz
0       kordosky/n11011055_0005_L250200N_D00.tar.gz
0       kordosky/n11011055_0006_L250200N_D00.tar.gz
0       kordosky/n11011055_0007_L250200N_D00.tar.gz
0       kordosky/n11011055_0008_L250200N_D00.tar.gz
0       kordosky/n11011055_0009_L250200N_D00.tar.gz


I found and purged one more such empty file not in your list,
$ du -sb kordosky/* | sort -nr
0       kordosky/n11011006_0004_L010000N_D00.tar.gz

$ ls -l kordosky/n11011006_0004_L010000N_D00.tar.gz
-rw-r--r--    1 mindata  e875            0 Jan 24 09:55 kordosky/n11011006_0004_L010000N_D00.tar.gz

Now run a more full scale test, for rates

$ time cp -a kordosky/n11011008* kreymer/

real    11m52.958s
user    0m0.210s
sys     0m17.710s

That's 2000 mb/799 sec, 


220     kordosky/n11011001_0007_L010000N_D00.tar.gz
$ time md5sum STAGE/kordosky/n11011001_0007_L010000N_D00.tar.gz
4cdd99fd0a6eb208b03da906b800afbd  STAGE/kordosky/n11011001_0007_L010000N_D00.tar.gz

real    3m3.608s
user    0m0.730s
sys     0m0.660s
Ugh, that's about 1 MBytes/second... miserable !!!

I did have the copy from kordosky to kreymer, 
plus an scp import running by howcroft, at present.

Try this is on a similar file with the cp running :

$ time md5sum STAGE/kordosky/n11011010_0010_L010000N_D00.tar.gz
f779b46d4be31c710518c7cb5a1ab210  STAGE/kordosky/n11011010_0010_L010000N_D00.tar.gz

real    0m43.791s
user    0m0.690s
sys     0m0.580s

That's better, 5 MB/sec, but still a tenth what I expect from modern disks.

 
    MD5SUM
    
Testing remote example :

FILE=n11011008_0000_L010000N_D00.tar.gz
RUSE=kreymer

ssh mindata@minos26.fnal.gov "cd STAGE/${RUSE} ; 
md5sum ${FILE} \
     > md5/${FILE}.md5 ; \
cat    md5/${FILE}.md5 >> md5/all.md5 ; \
cat    md5/${FILE}.md5 ; \
rm     md5/${FILE}.md5 "


    mccheck script, to dump sizes/checksums of existing tars

local tests, rates :

 MINOS13 > dccp  ${DPATH}/${FUSE}/${TAR}  .
1619066880 bytes in 43 seconds (36770.23 KB/sec)

MINOS13 > time tar xf ${TAR} -C /var/tmp/kreymer/${FUSE}

real    0m38.792s
user    0m0.150s
sys     0m11.170s

  MINOS13 > du -sm n11011001_0000_L100200N_D00-n11011001_0001_L100200N_D00.tar 
1546    n11011001_0000_L100200N_D00-n11011001_0001_L100200N_D00.tar


MINOS13 > ~/minos/scripts/mctarcheck kordosky 2>&1 | tee kordosky.log 
INFORMATIONAL: Product 'dcap' (with qualifiers 'unsecured'), has no current chain (or may not exist)
Thu Jan 25 17:52:34 CST 2007 n11011001_0000_L100200N_D00-n11011001_0001_L100200N_D00.tar


=============================================================================

2007 01 24


############
# MCIMPORT #
############

/local/stage filled around 03:00, due to mcimport flood,
about 10 to 20 GB/hour from kordosky

Disabled further input, via  .k5loginmin ( omits project principals )
will restore from .k5loginfull

$ du -sm STAGE/*/dcache
15267   STAGE/howcroft/dcache
20121   STAGE/kordosky/dcache
1       STAGE/kreymer/dcache

$ du -sm STAGE/*/tar   
20806   STAGE/howcroft/tar
34010   STAGE/kordosky/tar
1       STAGE/kreymer/tar

Total of 90 GB of tarred files pending to tape.


Shifted some kordosky files to /var/tmp/mindata, to breathe :


$ du -sm n11011044*
748     n11011044_0005_L250200N_D00.tar.gz
724     n11011044_0006_L250200N_D00.tar.gz
733     n11011044_0008_L250200N_D00.tar.gz
746     n11011044_0009_L250200N_D00.tar.gz
745     n11011044_0010_L250200N_D00.tar.gz

$ mkdir            /var/tmp/mindata/TMP


$ cp -a  n11011044*/var/tmp/mindata/TMP/
$ for FIL in n11011044* ; do echo $FIL ; diff ${FIL} /var/tmp/mindata/TMP/${FIL} ; done
n11011044_0005_L250200N_D00.tar.gz
n11011044_0006_L250200N_D00.tar.gz
n11011044_0008_L250200N_D00.tar.gz
n11011044_0009_L250200N_D00.tar.gz
n11011044_0010_L250200N_D00.tar.gz
$ rm STAGE/kordosky/n11011044_0005_L250200N_D00.tar.gz
$ #rm n11011044*
   rhatcher also cleared off 10 GB of space.
   so I have restored this file to kodosky

cp -a /var/tmp/mindata/TMP/n11011044_0005_L250200N_D00.tar.gz .
diff  /var/tmp/mindata/TMP/n11011044_0005_L250200N_D00.tar.gz .

$ ./mcimport -w kordosky
 OOPS - found /local/scratch26/mindata/kordosky/log/mcimport.pid 
  PID TTY          TIME CMD
 OK - stale pid file 
 OK, logging activity to /local/scratch26/mindata/kordosky/log/mcimport.log 

SRMCP phase ran at about  5' per file, system 200% iowait
PURGE phase ran at about 30" per file, system 130% iowait

There is one 0 length tarfile :

0 Jan 24 10:56 n11011001_0001_L010000N_D00-n11011044_0005_L250200N_D00.tar

$ dds dcache/n11011001_0001_L010000N_D00-n11011044_0005_L250200N_D00.tar
-rw-r--r--    1 mindata  e875            0 Jan 24 03:44 dcache/n11011001_0001_L010000N_D00-n11011044_0005_L250200N_D00.tar

removed it

MINOS26 > rm /pnfs/minos/stage/kordosky/n11011001_0001_L010000N_D00-n11011044_0005_L250200N_D00.tar

Similar problem in howcroft ( in tar, none in pnfs ):

$ dds howcroft/tar/n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar
-rw-r--r--    1 mindata  e875            0 Jan 24 07:37 howcroft/tar/n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar

$ rm howcroft/tar/n11011001_0000_L010185N_D00-n11011006_0000_L010185N_D00.tar

    Launched same for howcroft,
    and enabled inflow :  cp .k5loginfull .k5login


13:03
$ ./mcimport -w howcroft
 OOPS - found /local/scratch26/mindata/howcroft/log/mcimport.pid 
  PID TTY          TIME CMD
 OK - stale pid file 
 OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log 


    MCIN_DATA
    
Have waited long enough for enstore-admin to run enmv.
Have rerun 
  ./mcinfix
and reported this to enstore-admin

Now finish purging dcache files :

$ ./mcimport -w kordosky
 OK, logging activity to /local/scratch26/mindata/kordosky/log/mcimport.log 
$ ./mcimport -w howcroft
 OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log 


MCIMPORT - bad news, pid interlocking is just not working within cron jobs.
Will take quite some time to debug, I am mystefied.
Lots more print statments, I guess.

  Let's tar up the 38 GB howcroft files, as a start.
  
$ du -sm STAGE/*
...
37818   STAGE/howcroft
110232  STAGE/kordosky
...


Wed Jan 24 18:45:53 CST 2007
$ ${HOME}/mcimport howcroft
 OK, logging activity to /local/scratch26/mindata/howcroft/log/mcimport.log 

Further problem, logs indicate we run at about 15 GB/hour just making tarfiles.
That's pretty lousy, I guess too many copies going on.


=============================================================================

2007 01 23

############
# MCIMPORT #
############

MINOS26 > for DIR in L010170 L010185 L010000 L010200 L100200 L150200 L250200 ; do rmdir /pnfs/minos/mcin_data/near/daikon_00/${DIR}  ; done

###########
# ROUNDUP #
###########

SRV1> dfarm usage rubin
Used: 55910 + Reserved: 0 / Quota: 1000000 (MB)

SRV1> ./roundup  -r cedar near
SRV1> ./roundup  -r cedar far

SRV1> dfarm usage rubin
Used: 50692 + Reserved: 0 / Quota: 1000000 (MB)


#######
# AFS #
#######

The dh web area has filled up.

MIN > fs listquota /afs/fnal.gov/files/expwww/numi
Volume Name                   Quota      Used %Used   Partition
room.numi                   2000000   2000003  100%<<       87%    <<WARNING

MIN > du -sm * | sort -n
du: cannot change to directory `minwork/daqlogs': Permission denied
1       Contacts
1       DocDBSite
1       Gallery
...
2       xstyles
3       collab
4       mtg
7       public
9       PublicInfo
13      offline_software
14      doe_may_04_review
24      MinosAEM
29      doe_feb_05_review
59      Minos
95      DataQuality
113     sam
164     numi_pics
198     computing
268     talks
299     numwork
310     workgrps
889     internal
959     minwork

Repeat the study done on 2006 05 26
Note the fs lsquota no longer shows mount points.
So must explicitly sniff out mount points :


WEB=/afs/fnal.gov/files/expwww/numi
WEBS=`find ${WEB} -type d `
printf "${WEBS}\n" | wc -l
   6130

for DIR in ${WEBS} ; do fs lsmount ${DIR} ; done | grep 'is a mount' | tee /tmp/afsmounts

MIN > cat /tmp/afsmounts
'/afs/fnal.gov/files/expwww/numi' is a mount point for volume '#room.numi'
'/afs/fnal.gov/files/expwww/numi/html/talks' is a mount point for volume '#expwww.numi.talks'
'/afs/fnal.gov/files/expwww/numi/html/fnal_minos' is a mount point for volume '#expwww.numi.fnalminos'
'/afs/fnal.gov/files/expwww/numi/html/minwork' is a mount point for volume '#expwww.numi.minwork'
'/afs/fnal.gov/files/expwww/numi/html/numwork' is a mount point for volume '#expwww.numi.numwork'
'/afs/fnal.gov/files/expwww/numi/query_files' is a mount point for volume '#nb.w.numi.d1'
'/afs/fnal.gov/files/expwww/numi/numinotes' is a mount point for volume '#room.numi.1'
'/afs/fnal.gov/files/expwww/numi/numinotes/public/ps' is a mount point for volume '#w.numi.d2'
'/afs/fnal.gov/files/expwww/numi/numinotes/restricted/html' is a mount point for volume '#w.numi.d1'
'/afs/fnal.gov/files/expwww/numi/file_upload' is a mount point for volume '#expwww.numi.fileupload'

for DIR in `cat /tmp/afsmounts | cut -f 2 -d "'"` ; do fs listquota ${DIR} ; done


Let's be selfish, and take the whole fnal_minos partition for computing,
copying the files, comparing, then
renaming the original to computing_retired_20070123


MIN > cp -ax computing fnal_minos

MINOS26 > du -sm /afs/fnal.gov/files/expwww/numi/html/fnal_minos/computing
198     /afs/fnal.gov/files/expwww/numi/html/fnal_minos/computing

MINOS25 > time diff -r computing fnal_minos/computing

MIN > mv computing computing_retired_20070123 ; ln -s fnal_minos/computing computing
out of quota, had to clean up.

meanwhile, script had recreated computing confusing things when I did

MIN > ln -s fnal_minos/computing computing

Cleaned up, tried again,

MIN > mv computing computingx ; ln -s ../fnal_minos/computing computing

that was the wrong path, once again cleanly :

MIN > mv computing computingy ; ln -s fnal_minos/computing computing

That looks good. Now pick up a few stray bits of content,
from an earlier diff -r 


MIN > FIL=computing/dh/beamlog/2007/01/23.txt
MIN > nedit $FIL

<  190 Tue Jan 23 14:43:12 CST 2007
<  190 Tue Jan 23 14:44:13 CST 2007
<  186 Tue Jan 23 14:45:15 CST 2007
<  188 Tue Jan 23 14:46:12 CST 2007
<  187 Tue Jan 23 14:47:13 CST 2007
<  188 Tue Jan 23 14:48:12 CST 2007
<  188 Tue Jan 23 14:49:12 CST 2007


MIN > FIL=computing/dh/ftplog/2007/01/23.txt
MIN > nedit $FIL
<    2 Tue Jan 23 14:48:21 CST 2007     557

MIN > FIL=computing/dh/pnfslog/2007/01/23.txt
MIN > nedit $FIL
<    3 Tue Jan 23 14:42:37 CST 2007 
<    1 Tue Jan 23 14:47:38 CST 2007 

Picked up more bits from computingx

MIN > find computingx -type f
computingx/dh/pnfslog/2007/01/23.txt
computingx/dh/beamlog/2007/01/23.txt
computingx/database/oracle/topdb/minosprd/2007/01/23/14.txt

MIN > cat computingx/dh/pnfslog/2007/01/23.txt
   1 Tue Jan 23 14:52:39 CST 2007 

MIN > cat computingx/dh/beamlog/2007/01/23.txt
 187 Tue Jan 23 14:53:11 CST 2007
 187 Tue Jan 23 14:54:11 CST 2007
 187 Tue Jan 23 14:55:12 CST 2007
MIN > cat computingx/database/oracle/topdb/minosprd/2007/01/23/14.txt
 Tue Jan 23 14:55:20 CST 2007 All user connections to minosprd 
 
Access account       User name            Logon      Client machine                 Program                        STATUS         Time
-------------------- -------------------- ---------- ------------------------------ ------------------------------ -------- ----------
MONITOR              kreymer              14:55:16   minos26.fnal.gov               sqlplus@minos26.fnal.gov (TNS  ACTIVE            0
DBSNMP               oracle               07:39:34   minosora1.fnal.gov             emagent@minosora1.fnal.gov (T  ACTIVE            2
DBSNMP               oracle               07:39:32   minosora1.fnal.gov             emagent@minosora1.fnal.gov (T  INACTIVE         26
Elapsed: 00:00:00.01
  COUNT(*)
----------
         3
Elapsed: 00:00:00.00

Database server cpu used for sessions terminating within past minute:
User name            Total cpu Sessions cpu/session
-------------------- --------- -------- -----------
oracle                      .1        3         .02
Elapsed: 00:00:00.91

Edited this into current computing.

Copied database file 14.txt

topdb stopped around 10:54.


=============================================================================

2007 01 22


############
# MCIMPORT #
############

Oops, massive write errors in DCache due to my directory renames.

Writes to /pnfs/minos/mcin_data/near/daikon_00/L010185N/ 
started around 17:00 Friday 19 Jan.

Directories were renamed around 19:40.

Need to unfix damage done by the mcinfix script,
created   mcunfix script, 
ran it after doing manual

$ mkdir /pnfs/minos/mcin_data/near/daikon_00/L010185N/10
$ chmod 775 /pnfs/minos/mcin_data/near/daikon_00/L010185N
$ chmod 775 /pnfs/minos/mcin_data/near/daikon_00/L010185N/10


$ ./mcunfix 2>&1 | tee mcunfix.log
 OK -      45 files in 100
Mon Jan 22 14:32:38 CST 2007
 OK -      60 files in 101
Mon Jan 22 14:33:29 CST 2007
Mon Jan 22 14:34:36 CST 2007

################
# CONTROL ROOM #
################

The free space is about 500 MB, with files currently being written to
    /home/minos/BD/devel/BeamData/java/NuMIMon

mkdir /acnet/minos/NuMIMon
  
$ cd /home/minos/BD/devel/BeamData/java/NuMIMon

$ ls xml*.dat | wc -l
29

$ FILES=`ls xml*.dat | head -28`


for FILE in ${FILES} ; do
    cp ${FILE} /acnet/minos/NuMIMon/${FILE}
done


for FILE in ${FILES} ; do
if  diff  --brief ${FILE} /acnet/minos/NuMIMon/${FILE} ; then
    echo rm ${FILE}
    echo ln -s /acnet/minos/NuMIMon/${FILE} ${FILE}
else
    printf " OOPS, copy error for ${FILE} \n"
fi
done

###########
# ROUNDUP #
###########

SRV1> ./roundup  -r cedar near
SRV1> ./roundup  -r cedar far

NO go, the script is not executeable.
Modified around 14:16 by howie, 
to change from /usr/local/etc/setups.sh to /fnal/ups/etc/setups.sh

chmod 775 round*
chmod 775 dfarmsum
chmod 775 remove_duplicates

    GRRRRRRRRRRRRRRRRRRRRRRR

encp v3_5c no longer sets up ( it work on Friday ).
envp v3_6d seems OK, but was just installed after 14:00 this afternoon.
afs products are suddenly in the path.
dfarm no longer works ( just hangs  )

SRV1> setup dfarm
SRV1> type dfarm
dfarm is /fnal/ups/prd/dfarm/v3_1a/Linux/bin/dfarm
SRV1> ups list -aK+ dfarm
"dfarm" "v3_1a" "Linux" "" "current" 
SRV1> dfarm usage rubin
Traceback (most recent call last):
  File "/fnal/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 1236, in ?
    usg, res, qta = c.getUsage(args[0])
  File "/fnal/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 303, in getUsage
    self.connect()
  File "/fnal/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 232, in connect
    ans = self.DStr.sendAndRecv('HELLO %s' % self.Username)
  File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 386, in sendAndRecv
    return self.recv(tmo = tmo)
  File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 379, in recv
    while not self.readMore(maxmsg, tmo):
  File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 278, in readMore
    r,w,e = select.select([fd],[],[],tmo)
KeyboardInterrupt

MIN > ssh -l minfarm fnpc146
minfarm on fnpc146% source /fnal/ups/etc/setups.csh
minfarm on fnpc146% setup dfarm 
minfarm on fnpc146% date ; dfarm usage rubin ; date
Mon Jan 22 18:24:59 CST 2007
Traceback (most recent call last):
  File "/local/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 1236, in ?
    usg, res, qta = c.getUsage(args[0])
  File "/local/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 303, in getUsage
    self.connect()
  File "/local/ups/prd/dfarm/v3_1a/Linux/lib/dfarm_api.py", line 232, in connect
    ans = self.DStr.sendAndRecv('HELLO %s' % self.Username)
  File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 386, in sendAndRecv
    return self.recv(tmo = tmo)
  File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 379, in recv
    while not self.readMore(maxmsg, tmo):
  File "/local/ups/prd/fcslib/v2_1a/NULL/lib/SockStream.py", line 278, in readMore
    r,w,e = select.select([fd],[],[],tmo)
KeyboardInterrupt
Mon Jan 22 18:27:02 CST 2007

=============================================================================

2007 01 21

###########
# DESKTOP #
###########

Found desktop system powered down.
Restarted cleanly ( unclean shutdown )
from /var/log/messages :

Jan 20 09:37:20 minos-93198 sshd[21899]: Failed none for illegal user scanner from 131.225.110.131 port 52730 ssh2
Jan 20 09:37:20 minos-93198 sshd[21899]: Connection closed by 131.225.110.131
Jan 20 09:51:14 minos-93198 telnetd[21900]: ttloop:  peer died: Invalid or incomplete multibyte or wide character
Jan 20 10:28:16 minos-93198 kernel: e1000: eth0: e1000_watchdog: NIC Link is Down
Jan 21 14:34:48 minos-93198 syslogd 1.4.1: restart.
Jan 21 14:34:48 minos-93198 syslog: syslogd startup succeeded

N.B. - this powerdown was due to electrical maintenance last weekend.

############
# MCIMPORT #
############

per kordosky email 20 Jan 03:37:02, pid check is not effective !
Why does   exit 1  not exit ?
Because pid is invoked as a funtion ?
( No harm done in this case, previous process was just purging )
But note tar failures 
Fri Jan  5 03:37:02 CST 2007
/var/tmp/mindata/MCTAR/kordosky/n11011418_0004_L010185N_D00.tar.gz


Mon Jan  8 19:37:01 CST 2007
 OOPS - tar file corrupt /var/tmp/mindata/MCTAR/kordosky/n11011453_0006_L010185N_D00-n12011455_0006_L010185N_D00.tar 

Tue Jan  9 11:37:02 CST 2007
 OOPS - tar file corrupt /var/tmp/mindata/MCTAR/kordosky/n11011426_0000_L010185N_D00-n11011430_0000_L010185N_D00.tar 

Wed Jan 10 03:37:01 CST 2007
 OOPS - tar file corrupt /var/tmp/mindata/MCTAR/kordosky/n11011459_0006_L010185N_D00-n12011459_0006_L010185N_D00.tar 

Wed Jan 10 19:37:01 CST 2007
 OOPS - tar file corrupt /var/tmp/mindata/MCTAR/kordosky/n11011001_0000_L100200N_D00-n11011001_0002_L100200N_D00.tar 


=============================================================================

2007 01 19

############
# MCIMPORT #
############

10:35 :
$ cp afsmcimport mcimport 

This will copy data from the write to read pool.

MINOS26 > ./stage -d -p 0  stage/kordosky
 Needed 120/    373
FINISHED Fri Jan 19 10:39:02 CST 2007

MINOS26 > ./stage -d -p 0  stage/howcroft
........ Needed 119/    341
FINISHED Fri Jan 19 10:40:16 CST 2007

MINOS26 > ./stage  stage/kordosky
MINOS26 > ./stage  stage/howcroft
MINOS26 > date
Fri Jan 19 10:42:43 CST 2007

This should get older files on disk, newer ones should start there.

###########
# ROUNDUP #
###########

fnpcsrv1

There have been periods this morning, from about 10:20 through 10:45,
when simple commands have hung up for minutes
    ls
    ps
    top
    cat /proc/meminfo
    ./roundup -n -r cedar near
    ./roundup    -r cedar near

There were network problems earlier today, per timm ( fermigrid-help )
    Do not use farm-admin in future.

Successfully ran ( after 11;40 )

SRV1> ./roundup  -r cedar near
SRV1> ./roundup  -r cedar far

###########
# FIREFOX #
###########

firefox crashed on my desktop going to the network speed test page

/FIREFOX/run-mozilla.sh: line 424: 15208 Segmentation fault      "$prog" ${1+"$@"}

restarted

#############
# mcinwrite #
#############

cp mcimport mcinwrite

will move files from the given source path to the proper release in mcin_data,

Example :
    ./mcinwrite -f  -r daikon_00 
    
$ MCI=/afs/fnal.gov/files/data/minos/d185/daikon_00/fnal/
$ ./mcinwrite -n -v -s n11011001_0001_L010185N_D00 -r daikon_00 ${MCI}

    Write one selected file
$ ./mcinwrite  -s n11011001_0001_L010185N_D00 -r daikon_00 ${MCI}

16:55
$ ./mcinwrite  -r daikon_00  ${MCI}

( need to clean up PID clearing and exit from write )

=============================================================================

2007 01 18

############
# MCIMPORT #
############

mcimport.20070118 - added   dccp -P  to move file into read pools sooner

ln -sf mcimport.20070118  mcimport  # was 20070104

tested in kreymer

cp howcroft/n*.tar.gz kreymer/
rm kreymer/n11011165_0002_L010185N_D00.tar.gz   # drop a transit file

~kreymer/minos/scripts/mcimport.20070118 -W kreymer

afsmcimport -w kreymer

Seems to have worked :

$ ~kreymer/minos/scripts/dc_stat /pnfs/minos/stage/kreymer/n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar
============================
 PNFS status for /pnfs/minos/stage/kreymer/n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar 
-rw-r--r--    1 kreymer  e875     1744189440 Jan 18 17:03 n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:94e26d79;l=1744189440;
w-stkendca10a-3

LEVEL 4 

============================

then a couple of minutes later :

$ ~kreymer/minos/scripts/dc_stat /pnfs/minos/stage/kreymer/n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar
============================
 PNFS status for /pnfs/minos/stage/kreymer/n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar 
-rw-r--r--    1 kreymer  e875     1744189440 Jan 18 17:03 n11011164_0006_L010185N_D00-n11011165_0010_L010185N_D00.tar

LEVEL 2 
2,0,0,0.0,0.0
:c=1:94e26d79;h=yes;l=1744189440;
r-stkendca15a-5
w-stkendca10a-3

LEVEL 4 

============================

TOMORROW - should   ( while crons are idle )

    cp afsmcimport mcimport 

###########
# ROUNDUP #
###########

roundup.20060118  - suppressed 0 length files, using string ' rubin 0 '

ln -sf roundup.20070118 roundup  # was roundup.2070117
MINOS26 > scp minfarm@fnpcsrv1:scripts/roundup.20070118 .
   
09:55 running full fardet catchup

SRV1> ./roundup -r cedar far

srmcp is failing, like

java.io.IOException: credential remaining lifetime is less then a minute
...
srm client error: java.io.IOException: credential remaining lifetime is less then a minute

   ( The error message is misspelled, should be 'less than a minute' )

SRV1> grid-proxy-info -f /home/minfarm/.grid/x509up_u1334 
subject  : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990/CN=687673363
issuer   : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990
identity : /DC=org/DC=doegrids/OU=People/CN=Howard Rubin 496990
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : /home/minfarm/.grid/x509up_u1334
timeleft : 6024:51:56  (251.0 days)

SRV1> grid-proxy-info -f /home/minfarm/.grid/kreymer-doe.proxy
subject  : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310/CN=768538851
issuer   : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
identity : /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : /home/minfarm/.grid/kreymer-doe.proxy
timeleft : 2638:01:18  (109.9 days)

   But this works if I explicitly give the proxy specified in the config.xml
   or kreymer.xml   

SRV1> srmcp -streams_num=1 -server_mode=active --x509_user_proxy=/home/minfarm/.grid/x509up_u1334 $SFILE file:///TEST.dat
SRV1> ls -l TEST.dat 
-rw-rw-r--  1 minfarm numi 15761813 Jan 18 11:28 TEST.dat


SRV1> srmcp -streams_num=1 -server_mode=active --x509_user_proxy=/home/minfarm/.grid/kreymer-doe.proxy $SFILE file:///TEST.dat
SRV1> dds TEST.dat 
-rw-rw-r--  1 minfarm numi 15761813 Jan 18 11:17 TEST.dat

What has changed since yesterday ?
What changed is the the /tmp proxy expired.

SRV1> rm /tmp/x509up_u10871

Now the default proxy is absent, the config file works as intended.

Even an empty file in the default path causes failure :
org.globus.gsi.GlobusCredentialException: [JGLOBUS-11] No certificates loaded


  CONFIG FILE UPGRADE ?
  
see kreymer2.xml vs kreymer.xml

Should we move from /fnal/ups paths to /export/osg ?


Reran roundup near and far, OK !


HOWTO.dccp - removed srm information to  HOWTO.srm

=============================================================================

###########
# ROUNDUP #
###########

roundup.20060117  - changed required by fnpcsrv1 upgrades
    changed setup encp from v3_5a to v3_6c -q stken 

srmcp v1.25 seems to working ok, tested with srmls and srmcp per HOWTO.dccp

ln -sf roundup.20070117 roundup  # was roundup.2070116

Howie is having trouble with srmls under tcsh, OK under bash.
   The problem was the need for a ? character in the srm path,
   which needed to be escaped like \? in tcsh.
   

SRV1> ./roundup -s F00037213_ -r cedar far

Looks OK to me ( made safety copies if input files in ROUNDUP_TEST/TEST
Will run full catchup tomorrow.

=============================================================================

2007 01 16

###########
# ROUNDUP #
###########

Purging STRAYS and ODDS, per plan

    STRAYS ( 115) and ODDS (8) match the total count (123)

    Verify the files are still there in DFARM, then remove them. 
       
for DET in N F ; do
if [ "${DET}" = "N" ] ; then det=near ; else det=far ; fi
for FIL in `cat SDFARM${DET}` ; do 
dfarm ls /minos/${det}cat/${FIL}
done ; done
       
for DET in N F ; do
if [ "${DET}" = "N" ] ; then det=near ; else det=far ; fi
for FIL in `cat SDFARM${DET}` ; do 
dfarm rm /minos/${det}cat/${FIL}
done ; done

All files in DFARM are now recent (2007).

Two files have 0 length :

frwrw   2 rubin   0 01/15 07:54:42 F00037233_0002.all.sntp.cedar.0.root 
frwrw   2 rubin   0 01/15 08:29:30 F00037230_0019.all.sntp.cedar.0.root 
(N.B. these files were removed, rewritten around 01/16 14:13:08 )

Preview the catchup run :

SRV1> ./roundup -n -r cedar near
OK - processing     684 files 
OK - stream cosmic.sntp.cedar
OK - 9949 Mbytes in      29 runs 
 SUPPRESS  N00011446_0024.cosmic.sntp.cedar.0.root
 PEND - have 7/24 subruns for N00011446_*.cosmic.sntp.cedar*.root 14 01/02 10:20:22
 PEND - have 9/13 subruns for N00011481_*.cosmic.sntp.cedar*.root 11 01/05 04:47:37
 PEND - have 11/13 subruns for N00011488_*.cosmic.sntp.cedar*.root 11 01/05 05:48:22
 PEND - have 32/24 subruns for N00011516_*.cosmic.sntp.cedar*.root 8 01/08 05:21:31
 PEND - have 15/17 subruns for N00011542_*.cosmic.sntp.cedar*.root 0 01/15 14:16:24
 PEND - have 3/4 subruns for N00011552_*.cosmic.sntp.cedar*.root 0 01/15 14:18:43
 PEND - have 14/24 subruns for N00011565_*.cosmic.sntp.cedar*.root 1 01/15 11:29:27
 PEND - have 19/24 subruns for N00011568_*.cosmic.sntp.cedar*.root 0 01/15 14:15:40
 PEND - have 14/25 subruns for N00011574_*.cosmic.sntp.cedar*.root 0 01/16 06:21:19
...
 PEND - have 7/24 subruns for N00011446_*.spill.sntp.cedar*.root 14 01/02 10:20:52
 PEND - have 9/13 subruns for N00011481_*.spill.sntp.cedar*.root 11 01/05 04:48:09
 PEND - have 11/13 subruns for N00011488_*.spill.sntp.cedar*.root 11 01/05 05:48:53
 PEND - have 32/24 subruns for N00011516_*.spill.sntp.cedar*.root 8 01/08 05:22:04
 PEND - have 15/17 subruns for N00011542_*.spill.sntp.cedar*.root 0 01/15 14:17:56
 PEND - have 3/4 subruns for N00011552_*.spill.sntp.cedar*.root 0 01/15 14:19:35
 PEND - have 11/24 subruns for N00011565_*.spill.sntp.cedar*.root 1 01/15 11:29:49
 PEND - have 18/24 subruns for N00011568_*.spill.sntp.cedar*.root 0 01/15 14:16:12
 PEND - have 14/25 subruns for N00011574_*.spill.sntp.cedar*.root 0 01/16 06:21:40

SRV1> ./roundup -n -r cedar far 2>&1 | tee /tmp/far.pre.log
SRV1> grep PEND /tmp/far.pre.log
 PEND - have 13/19 subruns for F00037162_*.all.sntp.cedar*.root 14 01/01 23:47:22
 PEND - have 23/24 subruns for F00037221_*.all.sntp.cedar*.root 4 01/11 23:53:18
 PEND - have 22/24 subruns for F00037230_*.all.sntp.cedar*.root 1 01/15 07:57:52
 PEND - have 22/24 subruns for F00037233_*.all.sntp.cedar*.root 1 01/15 07:54:42
 PEND - have 1/18 subruns for F00037239_*.all.sntp.cedar*.root 0 01/16 00:17:22

 PEND - have 13/19 subruns for F00037162_*.spill.sntp.cedar*.root 14 01/01 23:47:35
 PEND - have 18/24 subruns for F00037221_*.spill.sntp.cedar*.root 4 01/11 23:53:32
 PEND - have 20/24 subruns for F00037230_*.spill.sntp.cedar*.root 1 01/15 07:58:09
 PEND - have 15/24 subruns for F00037233_*.spill.sntp.cedar*.root 1 01/15 08:14:03
 PEND - have 1/18 subruns for F00037239_*.spill.sntp.cedar*.root 0 01/16 00:17:35

The intial near and far PENDS are due to files already written in 2006-12 :

SRV1> ls -l /pnfs/minos/reco_near/cedar/sntp_data/2006-12/N00011446*cosmic* | wc -l
     17
SRV1> ls -l /pnfs/minos/reco_near/cedar/sntp_data/2006-12/N00011446*spill* | wc -l
     17
SRV1> ls -l /pnfs/minos/reco_far/cedar/sntp_data/2006-12/F00037162_*all* | wc -l
      6
SRV1> ls -l /pnfs/minos/reco_far/cedar/sntp_data/2006-12/F00037162_*spill* | wc -l
      6

    N00011516 was partially reprocessed, subruns 15-22.


roundup.20070116 - makes YEMON subdirectories of LOG and HADDLOG

ln -sf roundup.20070116 roundup  # was roundup.2070110


   Round up the initial runs for near and far :
   
   
./roundup -n -s N00011446_ -f 0 -r cedar near
./roundup -n -s F00037162_ -f 0 -r cedar  far

round 16:25 :

   # # # first output # # #

./roundup    -s N00011446_ -f 0 -r cedar near
./roundup    -s F00037162_ -f 0 -r cedar  far

~minfarm/lists/bad_runs.cedar has list of problem runs.

    Recently, 
N00011468_0000.0   2007-01     106     2  2007-01-03 23:22:41  fnpc166
N00011481_0008.0   2007-01   99445   136  2007-01-04 23:37:16  fnpc192
N00011481_0010.0   2007-01  100083   136  2007-01-04 23:55:07  fnpc174
N00011481_0005.0   2007-01   99799   136  2007-01-05 00:01:10  fnpc188
N00011481_0006.0   2007-01   99189   136  2007-01-05 00:01:59  fnpc146

This leaves a mystery

N00011488 missing subruns 10,11

Meanwhile, clean up and write the good data :

./roundup    -s N00011481_ -f 0 -r cedar near
./roundup  -r cedar   near
./roundup  -r cedar   far

Per rubin, removing unwanted duplicates from N00011516
SRV1> dfarm ls /minos/nearcat/*.1.*
frwrw   2 rubin                30475986 01/10 16:07:36 /minos/nearcat/N00011516_0015.cosmic.sntp.cedar.1.root 
frwrw   2 rubin                77598268 01/10 16:08:31 /minos/nearcat/N00011516_0015.spill.sntp.cedar.1.root 
frwrw   2 rubin                30682257 01/10 16:23:31 /minos/nearcat/N00011516_0016.cosmic.sntp.cedar.1.root 
frwrw   2 rubin                77519165 01/10 16:24:18 /minos/nearcat/N00011516_0016.spill.sntp.cedar.1.root 
frwrw   2 rubin                30168133 01/10 18:02:18 /minos/nearcat/N00011516_0017.cosmic.sntp.cedar.1.root 
frwrw   2 rubin                86727707 01/10 18:03:22 /minos/nearcat/N00011516_0017.spill.sntp.cedar.1.root 
frwrw   2 rubin                30197596 01/10 16:08:16 /minos/nearcat/N00011516_0018.cosmic.sntp.cedar.1.root 
frwrw   2 rubin                77902299 01/10 16:09:37 /minos/nearcat/N00011516_0018.spill.sntp.cedar.1.root 
frwrw   2 rubin                30242143 01/10 18:25:16 /minos/nearcat/N00011516_0019.cosmic.sntp.cedar.1.root 
frwrw   2 rubin                86283961 01/10 18:26:36 /minos/nearcat/N00011516_0019.spill.sntp.cedar.1.root 
frwrw   2 rubin                30415410 01/10 16:17:21 /minos/nearcat/N00011516_0020.cosmic.sntp.cedar.1.root 
frwrw   2 rubin                78316537 01/10 16:18:04 /minos/nearcat/N00011516_0020.spill.sntp.cedar.1.root 
frwrw   2 rubin                30336800 01/10 14:55:28 /minos/nearcat/N00011516_0021.cosmic.sntp.cedar.1.root 
frwrw   2 rubin                78352443 01/10 14:56:04 /minos/nearcat/N00011516_0021.spill.sntp.cedar.1.root 
frwrw   2 rubin                30105619 01/10 13:40:02 /minos/nearcat/N00011516_0022.cosmic.sntp.cedar.1.root 
frwrw   2 rubin                59368513 01/10 13:40:35 /minos/nearcat/N00011516_0022.spill.sntp.cedar.1.root 
SRV1> dfarm rm /minos/nearcat/N00011516_*.1.*


############
# predator #
############

See 2006 12 11 note, we did apparently get up to date on saddreco on 12/12.
Need to top this off.
Easist thing is to restore saddreco to predator ( done just now )
and run

VMON=2006-12
./predator ${VMON}

Then re-disable saddreco in predator, 
this needs to be done by roundup in future.

=============================================================================

2007 01 12

############
# MCIMPORT #
############

Adding sjc area to mindata@fnal.gov for Stephen Coleman, per arms.

    .k5login - added sjc@FNAL.GOV

    mkdir STAGE/sjc

    mkdir $MINOS_DATA/log_data/mcimport/sjc

    for USER in sjc
    do mkdir -m 775 /pnfs/minos/stage/${USER} ; done
    do ( cd /pnfs/minos/stage/${USER} ; enstore pnfs --file_family stage_${USER} ) ; done
    do ( cd /pnfs/minos/stage/${USER} ; enstore pnfs --tags | grep 'file_family) =' ) ; done
   

###########
# ROUNDUP #
###########

Ran further tests of size-mismatched files

SRV1> cat ~/maint/oops.1.files
N00010230_0004.spill.sntp.cedar.0.root 
N00010274_0018.spill.sntp.cedar.0.root 
N00010485_0001.cosmic.sntp.cedar.0.root 
N00011408_0006.spill.sntp.cedar.0.root 
N00011408_0021.cosmic.sntp.cedar.0.root 
N00011408_0021.spill.sntp.cedar.0.root 
F00037025_0005.all.sntp.cedar.0.root 
F00037147_0013.all.sntp.cedar.0.root 

Four of these exist in AFS, with sizes and dates matching DFARM,
all after the PFNS times.
N00010230_0004.spill.sntp.cedar.0.root 
N00010274_0018.spill.sntp.cedar.0.root 
N00011408_0006.spill.sntp.cedar.0.root 
N00011408_0021.spill.sntp.cedar.0.root 

Go ahead and purge the files  which do match in size :

SRV1> ./remove_duplicates  2>&1 | tee ../maint/remdup.log
Fri Jan 12 15:04:19 CST 2007
Fri Jan 12 15:04:28 CST 2007
PURGE N00010077_0000.cosmic.sntp.cedar.0.root
PURGE N00010077_0001.cosmic.sntp.cedar.0.root
...

SRV1> dfarm usage rubin
Used: 97525 + Reserved: 0 / Quota: 1000000 (MB)

shifted /tmp/PNFS to $HOME/maint/PNFS, adjsted remove_duplicates

   STRAYS

     Now get a copy of all the DFARM file missing from PNFS,
     put it into /export/stage/minfarm/STRAYS

     Made a shorter list of DFARM files
   
cd /home/minfarm/maint

dfarm ls /minos/nearcat > SDFARMNF
dfarm ls /minos/farcat  > SDFARMFF

    Edited this with nedit to exclude 01/* files from 2007.

cat SDFARMNF |  tr -s ' ' | cut -f 7 -d ' ' > SDFARMN
cat SDFARMFF |  tr -s ' ' | cut -f 7 -d ' ' > SDFARMF

cd /home/minfarm/maint
mkdir /export/stage/minfarm/STRAYS

    List them :
    
for DET in N F ; do
for FIL in `cat SDFARM${DET}` ; do 
grep -q ${FIL} PNFS || grep ${FIL} SDFARM${DET}F
done ; done

    Clone them :

for DET in N F ; do
if [ "${DET}" = "N" ] ; then det=near ; else det=far ; fi
for FIL in `cat SDFARM${DET}` ; do 
grep -q ${FIL} PNFS || dfarm get /minos/${det}cat/${FIL} /export/stage/minfarm/STRAYS/${FIL}
done ; done

SRV1> ls -l /export/stage/minfarm/STRAYS
total 1209476
-rw-rw-r--    1 minfarm  numi       275564 Jan 12 17:58 F00028071_0001.all.sntp.cedar.0.root
-rw-rw-r--    1 minfarm  numi       452419 Jan 12 17:58 F00033105_0010.spill.sntp.cedar.0.root
...

SRV1> ls  /export/stage/minfarm/STRAYS | wc -l
    115

    Check sizes

for DET in N F ; do
for FIL in `cat SDFARM${DET}` ; do 
if grep -q ${FIL} PNFS ; then true ; else
DSIZ=`grep ${FIL} SDFARM${DET}F              | tr -s ' ' | cut -f 4 -d ' '`
PSIZ=`ls -l /export/stage/minfarm/STRAYS/${FIL} | tr -s ' ' | cut -f 5 -d ' '`
echo ${FIL} ${DSIZ} ${PSIZ}
[ ${DSIZ} != ${PSIZ} ] && echo OOPS
fi
done ; done

    Now put a copy of odd length files into
    export/stage/minfarm/ODDS

mkdir /export/stage/minfarm/ODDS

ODDN='
N00010230_0004.spill.sntp.cedar.0.root 
N00010274_0018.spill.sntp.cedar.0.root 
N00010485_0001.cosmic.sntp.cedar.0.root 
N00011408_0006.spill.sntp.cedar.0.root 
N00011408_0021.cosmic.sntp.cedar.0.root 
N00011408_0021.spill.sntp.cedar.0.root 
'

ODDF='
F00037025_0005.all.sntp.cedar.0.root 
F00037147_0013.all.sntp.cedar.0.root 
'

for FIL in ${ODDN} ; do
echo $FIL
dfarm get /minos/nearcat/${FIL} /export/stage/minfarm/ODDS/${FIL}
done

for FIL in ${ODDF} ; do
echo $FIL
dfarm get /minos/farcat/${FIL} /export/stage/minfarm/ODDS/${FIL}
done

=============================================================================

2007 01 11

SRV1> dfarm usage rubin
Used: 415346 + Reserved: 0 / Quota: 1000000 (MB)


While waiting for resolution, prepare to purge the majority of files in PNFS.
Check more carefully , need a little remove_duplicates script 
    Put dfarm list in /tmp/DFARM[N|F} [F] as before
    check existince in /tmp/PNFS, check file sizes
     

SRV1> ./remove_duplicates 2>&1 | tee /tmp/reduppre.log
( fixed problem with bntp -> .bntp ), reran for just far
SRV1> ./remove_duplicates 2>&1 | tee -a /tmp/reduppre.log

Note several near detector file size mismatches


=============================================================================

2007 01 10

VMON=2006-12
./predator ${VMON}

###########
# ROUNDUP #
###########

Explore the 2006/7 boundary, 
purge dfarm of all files already in PNFS

roundup.20070110 - moved /DFARM/ to /${CAT}/DFARM
  likewise for LOG and HADDLOG and WRITE
  
ln -sf roundup.20070110 roundup

SRV1> mv DFARM CAT/DFARM
SRV1> mkdir DFARM
SRV1> mv LOG CAT/LOG
SRV1> mkdir LOG
SRV1> mv HADDLOG CAT/HADDLOG
SRV1> mkdir HADDLOG
SRV1> mkdir CAT/WRITE

make tmp/PNFS list of recent reco files

DET=near
for DIR in /pnfs/minos/reco_${DET}/cedar/sntp_data ; do
ls ${DIR}/2006-11 >> /tmp/PNFS
ls ${DIR}/2006-12 >> /tmp/PNFS
ls ${DIR}/2006-05 >> /tmp/PNFS
ls ${DIR}/2006-06 >> /tmp/PNFS
ls ${DIR}/2006-07 >> /tmp/PNFS
ls ${DIR}/2006-08 >> /tmp/PNFS
ls ${DIR}/2006-09 >> /tmp/PNFS
ls ${DIR}/2006-10 >> /tmp/PNFS
done

SRV1> wc -l /tmp/PNFS
   2696 /tmp/PNFS

DET=far
for DIR in /pnfs/minos/reco_${DET}/cedar/sntp_data /pnfs/minos/reco_${DET}/cedar/.bntp_data ; do
ls ${DIR}/2006-11 >> /tmp/PNFS
ls ${DIR}/2006-12 >> /tmp/PNFS
ls ${DIR}/2006-05 >> /tmp/PNFS
ls ${DIR}/2006-06 >> /tmp/PNFS
ls ${DIR}/2006-07 >> /tmp/PNFS
ls ${DIR}/2006-08 >> /tmp/PNFS
ls ${DIR}/2006-09 >> /tmp/PNFS
ls ${DIR}/2006-10 >> /tmp/PNFS
done

SRV1> wc -l /tmp/PNFS
   7091 /tmp/PNFS

SRV1> dfarm ls /minos/nearcat  > /tmp/DFARMNF
SRV1> dfarm ls /minos/farcat   > /tmp/DFARMFF

SRV1> dfarm ls /minos/nearcat | tr -s ' ' | cut -f 7 -d ' ' > /tmp/DFARMN
SRV1> dfarm ls /minos/farcat | tr -s ' ' | cut -f 7 -d ' ' > /tmp/DFARMF

SRV1> wc -l /tmp/DFARMN
   2692 /tmp/DFARMN
SRV1> wc -l /tmp/DFARMF
   3404 /tmp/DFARMF


SRV1> for FIL in `cat /tmp/DFARMF` ; do grep -q ${FIL} /tmp/PNFS || grep ${FIL} /tmp/DFARMFF ; done


SRV1> for FIL in `cat /tmp/DFARMN` ; do grep -q ${FIL} /tmp/PNFS || grep ${FIL} /tmp/DFARMNF ; done
frwrw   2 rubin                37297557 11/10 15:39:27 N00010163_0001.spill.sntp.cedar.0.root 
frwrw   2 rubin                35117005 11/10 15:28:44 N00010163_0003.spill.sntp.cedar.0.root 
frwrw   2 rubin                44780044 11/10 16:28:54 N00010163_0004.spill.sntp.cedar.0.root 
frwrw   2 rubin                47876194 11/10 17:06:10 N00010163_0005.spill.sntp.cedar.0.root 
frwrw   2 rubin                46815316 11/10 16:44:38 N00010163_0006.spill.sntp.cedar.0.root 
frwrw   2 rubin                45944255 11/10 18:37:33 N00010163_0007.spill.sntp.cedar.0.root 
frwrw   2 rubin                33988356 11/10 17:24:19 N00010163_0008.spill.sntp.cedar.0.root 
frwrw   2 rubin                41349361 11/10 17:52:56 N00010163_0009.spill.sntp.cedar.0.root 
frwrw   2 rubin                34032079 11/10 16:38:20 N00010163_0010.spill.sntp.cedar.0.root 
frwrw   2 rubin                28499385 11/10 15:39:04 N00010163_0011.spill.sntp.cedar.0.root 
frwrw   2 rubin                32534482 11/10 15:42:31 N00010163_0012.spill.sntp.cedar.0.root 
frwrw   2 rubin                69803776 12/08 07:10:33 N00011347_0014.spill.sntp.cedar.0.root 
... rest are from 2007

10163 is from 2006-06, these spill and cosmic sntp files are missing in PNFS

Informed rubin, waiting for resolution .


=============================================================================

2007 01 09

#########
# VAULT #
#########

vault - changed encp from v3_5a to v3_6d per current

per HOWTO.vault

VMON=2006-12
for DET in far near; do ./vault ${DET} ${VMON} ; done

Completed cleanly

############
# beam_log #
############

The script was stuck since
http://www-numi.fnal.gov/computing/dh/beamlog/2007/01/08.txt
 296 Mon Jan  8 11:26:29 CST 2007

ps xf :

25430 ?        S     25:01 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/beam_log
15077 ?        S      0:00 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/beam_log
15078 ?        S      0:02  \_ curl -s http://www-bd.fnal.gov/notifyservlet/www
15079 ?        S      0:00  \_ grep SC time

MINOS26 > ps -f -p 15079        
UID        PID  PPID  C STIME TTY          TIME CMD
kreymer  15079 15077  0 Jan08 ?        00:00:00 grep SC time


Killed the grep and curl, running now since
http://www-numi.fnal.gov/computing/dh/beamlog/2007/01/09.txt
 181 Tue Jan  9 11:18:21 CST 2007

###########
# ROUNDUP #
###########

grep -e PEND -e CST  LOG/R1_18_4near.log  > /tmp/pendrn

grep 'root 09/' /tmp/pendrn > /tmp/pendrn09
grep 'root 10/' /tmp/pendrn > /tmp/pendrn10
grep 'root 11/' /tmp/pendrn > /tmp/pendrn11

nedit /tmp/pendrm  # select latest batch of PEND, file prior to December

FILES=`cat /tmp/pendrn09 | cut -f 8 -d ' '`

for FILE in $FILES ; do echo $FILE ; dfarm ls /minos/${DET}cat/${FILE} ; done
for FILE in $FILES ; do echo $FILE ; dfarm rm /minos/${DET}cat/${FILE} ; done

This was a moot exercise, cleanup had already occurred.

Checked out far with

./roundup -C -r R1_18_4 -W -n far

roughly consistent with log, just a few runs, so let's clean up
Format of log changed, need slightly different selection for 09 10 11

grep -e PEND -e CST  LOG/R1_18_4far.log  > /tmp/pendrn
nedit /tmp/pendrn  # cut out all but latest PENDs
grep ' 09/' /tmp/pendrn > /tmp/pendrn09
grep ' 10/' /tmp/pendrn > /tmp/pendrn10
grep ' 11/' /tmp/pendrn > /tmp/pendrn11

FILES=`cat /tmp/pendrn09 | cut -f 8 -d ' '`
FILES=`cat /tmp/pendrn10 | cut -f 8 -d ' '`
FILES=`cat /tmp/pendrn11 | cut -f 8 -d ' '`

for FILE in $FILES ; do echo $FILE ; dfarm ls /minos/${DET}cat/${FILE} ; done
for FILE in $FILES ; do echo $FILE ; dfarm rm /minos/${DET}cat/${FILE} ; done


SRV1> dfarm usage rubin
Used: 480608 + Reserved: 0 / Quota: 1000000 (MB)

Now clean out some cedar stuff :

./roundup -C -r cedar -W -n near 2>&1 | tee /tmp/pendcn

./roundup -C -r cedar -W -n  far 2>&1 | tee /tmp/pendcf

grep 'root.*09/' /tmp/pendcn > /tmp/pendcn09
grep 'root.*10/' /tmp/pendcn > /tmp/pendcn10
grep 'root.*11/' /tmp/pendcn > /tmp/pendcn11
(10 is empty)

DET=near
FILES=`cat /tmp/pendcn09 | cut -f 8 -d ' '`
FILES=`cat /tmp/pendcn11 | cut -f 8 -d ' '`


grep 'root.*09/' /tmp/pendcf > /tmp/pendcf09
grep 'root.*10/' /tmp/pendcf > /tmp/pendcf10
grep 'root.*11/' /tmp/pendcf > /tmp/pendcf11
(10 is empty)

DET=far
FILES=`cat /tmp/pendcf09 | cut -f 8 -d ' '`
FILES=`cat /tmp/pendcf11 | cut -f 8 -d ' '`

Still a lot of stuff there,

SRV1> dfarm usage rubin
Used: 461875 + Reserved: 0 / Quota: 1000000 (MB)


Let's just purge all of R1_18_4
There are many obsolete short ntuple files there :

SRV1> dfarm ls /minos/nearcat/*R1_18_4* | wc -l
   2564
SRV1> dfarm ls /minos/nearcat/*snts**R1_18_4* | wc -l
   2545

SRV1> dfarm ls /minos/farcat/*R1_18_4* | wc -l
   4545
SRV1> dfarm ls /minos/farcat/*nts**R1_18_4* | wc -l
   4476

SRV1> dfarm usage rubin
Used: 402857 + Reserved: 0 / Quota: 1000000 (MB)


   Everything is cedar now, look at the 2006 vs 2007 breakdown
   

SRV1> dfarm ls /minos/farcat/        | grep    ' 01/' | wc -l
    372
SRV1> dfarm ls /minos/farcat/*cedar* | grep -v ' 01/' | wc -l
   2984
SRV1> dfarm ls /minos/farcat/*cedar* | wc -l
   3356

SRV1> dfarm ls /minos/nearcat/        | grep    ' 01/' | wc -l
    416
SRV1> dfarm ls /minos/nearcat/*cedar* | grep -v ' 01/' | wc -l
   2283
SRV1> dfarm ls /minos/nearcat/*cedar* | wc -l
   2699

One quick check of short ntuples,

SRV1> dfarm ls /minos/farcat/*nts* 
SRV1> dfarm ls /minos/nearcat/*nts* 
frwrw   2 rubin                  512714 11/10 10:20:48 /minos/nearcat/N00010072_0000.cosmic.snts.cedar.0.root 
frwrw   2 rubin                11483882 11/10 10:43:40 /minos/nearcat/N00010077_0000.cosmic.snts.cedar.0.root 
...
SRV1> dfarm ls /minos/nearcat/*nts* | wc -l
     19

SRV1> dfarm rm /minos/nearcat/*nts*


Let's check out the cutover boundary

SRV1> dfarm ls /minos/farcat/        | grep    ' 12/' | wc -l
   2881
SRV1> dfarm ls /minos/farcat/        | grep    ' 12/3' | wc -l
     96


=============================================================================

2007 01 08

############
# MCIMPORT #
############


#############
# checklist #
#############

minosora1 ganglia monitoring shows no data, but system is up.
Reported to minos-dbsupport

    Minos-servers category is missing entirely.

Ganglia plots are back as of about 15:00


###########
# ROUNDUP #
###########

SRV1> ./roundstat 
Mon Jan  8 15:22:31 CST 2007

OK -    5219 files , 181 GBytes in near
        2655 files , 156 GBytes in near cedar
        2564 files , 25 GBytes in near R1_18_4

OK -    8821 files , 55 GBytes in far
        4039 files , 47 GBytes in far cedar
        4782 files , 7 GBytes in far R1_18_4

OK - WRITE

OK -       2 files , 0 GBytes in near
           2 files , 0 GBytes in near cedar

OK -       0 files , 0 GBytes in far

OK - READ

OK -       0 files , 0 GBytes in near

OK -       0 files , 0 GBytes in far

SRV1> dfarm usage rubin
Used: 509062 + Reserved: 0 / Quota: 1000000 (MB)


    Moved to latest version, supporting -C :

ln -sf roundup.20061220 roundup  # was roundup.20061215

    Cleaned up the two cedar near WRITE files left from 20 Dec

./roundup -C -r cedar -w  near


=============================================================================

2006 12 29

############
# MCIMPORT #
############

Testing fermigrid cache, on the side :

SRV1> srmclient/bin/srmls ${SPATH2}/usr/fermigrid/volatile/minos
  512 srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos

SPATH2=srm://fndca1.fnal.gov:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/fermigrid/volatile/minos

srmclient/bin/srmmkdir ${SPATH2}/mcimport

   Now try writing to this from mindata on minos26

. /usr/local/etc/setups.sh
setup upd
export PRODUCTS=/afs/fnal.gov/files/code/e875/general/ups/db

setup vdt
setup srmcp v1_21
export SRM_CONFIG=/home/mindata/.srmconfig/kreymer.xml

cd /local/scratch26/mindata/kordosky

IPATH=fermigrid/volatile/minos/kordosky
IFILE=n11011401_0001_L010185N_D00.tar.gz
SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr
SFILE=${SPATH}/${IPATH}/${IFILE}

$ srmcp -streams_num=1 -server_mode=active file:///${IFILE}  ${SFILE}
Fri Dec 29 10:17:50 CST 2006: rs.state = Failed rs.error = 
RequestFileStatus#-2147274830 failed with error:[  at Fri Dec 29 10:17:46 CST 2006 state Failed : can not obtain turl for file:org.dcache.srm.SRMException: user's path ///pnfs/fnal.gov/usr/fermigrid/volatile/minos/kordosky/n11011401_0001_L010185N_D00.tar.gz is not subpath of the user's root]

   GRRRR, no good for now, will just go ahead with mcimport development
          without a safety copy of files in fermigrid/volatile/minos


Added command qualifiers to mcimport, tested srmcp on kreymer files


    Grabbed 10 more kordosky files for full test in kreymer, with fresh login

cd /local/scratch26/mindata
FILES=`ls /local/scratch26/mindata/kordosky/ | head -40 | tail -10`
for FILE in ${FILES} ; do cp -v kordosky/${FILE} kreymer/ ; done

./afsmcimport kreymer

Seems to be ok, need to still test the purging code ( wait for data on tape)

Will lauch full compression of howcroft, 
while waiting for little kreymer files to be archived.

Tarred up /local/scratch/kreymer/ARCHIVE into /tmp/ARCHIVE.tar,
sccp -c blowfish to minos25:/local/scratch25/kreymer/ARCHIVE.tar
checked with md5sum.


=============================================================================

2006 12 28

############
# MCIMPORT #
############

Tested srmcp successfully, readig a single file

In minos products area,
 
upd install -j srmcp v1_25_1 -f NULL
upd install -j vdt v1_1_14_13
upd install -j pacman v2_116_1

ups declare -c  pacman v2_116_1 -f NULL

ups tailor vdt v1_1_14_13 > /tmp/vdtinstall.log
ups declare -c vdt v1_1_14_13
ups declare -c srmcp v1_25_1


SRV1> java -version
java version "1.4.2_10"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_10-b03)
Java HotSpot(TM) Client VM (build 1.4.2_10-b03, mixed mode)

MINOS26 > java -version
java version "1.4.2_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_12-b03)
Java HotSpot(TM) Client VM (build 1.4.2_12-b03, mixed mode)

Tried various varsions, same result, then finally srmcp v1_21 works !

Had to update kreymer.xml for local product paths to srmcp,
and had to clone a local copy of CA certificates,

 scp -r minfarm@fnpcsrv1:/local/ups/grid/globus/share/certificates certificates

This now works , at least copying files to disk per HOWTO.dccp.
Now try going to dcache from local disk.

cd /local/scratch26/mindata/kreymer/tar
export SRM_CONFIG=/home/mindata/.srmconfig/kreymer.xml

IPATH=minos/stage/kreymer
IFILE=n11011401_0001_L010185N_D00-n11011401_0005_L010185N_D00.tar
SFILE=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/${IPATH}/${IFILE}

srmcp -streams_num=1 -server_mode=active \
      file:///${IFILE}  ${SFILE}


=============================================================================

2006 12 23

############
# MCIMPORT #
############

cd $MINOS_DATA/log_data
mkdir mcimport
cd    mcimport
fs setacl . minos rlidwka
cd ..
fs setacl . minos rlidwka
mkdir howcroft
mkdir kordosky


Tested this on 19 kordosky files cloned to kreymer

Tarring to kreymer/tar/ looks good, now need to add archiving


=============================================================================

2006 12 22

###############
# minos26free #
###############

Created script to report free space on /local/scratch26 hourly, at

http://www-numi.fnal.gov/computing/dh/minos26free/NOW.txt  ( for daily reports )
http://www-numi.fnal.gov/computing/dh/minos26free/FREE.txt ( use in scripts )

Updated web at /afs/fnal.gov/files/expwww/numi/html/computing/dh/dhmain.200601222.html
ln -sf dhmain.20061222.html dhmain.html  # was dhmain.20060918.html

############
# mcimport #
############

This will to be cloned from 
    vault ( tars/vaults raw data)
    rawcopy ( actually does the tarring )
    roundup ( writes concatenated ntuples )

Files are under /local/scratch26/mindata/<username>/
There are data files like
    
and log files in */log/

Data files should be tarred and feathxxxxarchived,
and logs should ber rsync'd to AFS.

    /afs/fnal.gov/files/data/minos/log_data/mcimport/<user>

Let's get the tar going first, to avoid a space crunch.


=============================================================================

2006 12 20

#######
# SRM #
#######

SRV1> time srmcp -debug=false -streams_num=1 -server_mode=active -protocols=gsiftp $SFILE file:///TEST.dat

real    0m11.100s
user    0m3.550s
sys     0m0.250s

Compare this to the roughly .8 elapsed, .3 CPU cost of globus-url-copy


X11 - clean scan today


###########
# ROUNDUP #
###########

SRV1> ./roundstat 
Wed Dec 20 09:48:47 CST 2006

OK -    4009 files , 120 GBytes in near
        1445 files , 95 GBytes in near cedar
        2564 files , 25 GBytes in near R1_18_4

OK -    7585 files , 38 GBytes in far
        2803 files , 30 GBytes in far cedar
        4782 files , 7 GBytes in far R1_18_4

OK - WRITE

OK -       2 files , 2 GBytes in near
           2 files , 2 GBytes in near R1_18_4

OK -      48 files , 9 GBytes in far
          48 files , 9 GBytes in far R1_18_4

OK - READ

OK -     701 files , 0 GBytes in near
         517 files , 0 GBytes in near cedar
         184 files , 0 GBytes in near R1_18_4

OK -    2181 files , 0 GBytes in far
        1726 files , 0 GBytes in far cedar
         455 files , 0 GBytes in far R1_18_4

Removed duplicated rerounded file from Far,

   rm WRITE/F00036196*

Removed old second half of large file dating from Nov 9

 rm /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0030.spill.sntp.R1_18_4.0.root

Cleared the write area :

./roundup -C -r R1_18_4 -w  near
./roundup -C -r R1_18_4 -w   far

Rerun R1_18_4 to get list of PEND's for removal
 
./roundup -C -r R1_18_4 -W near
 PEND - have 7/8 subruns for N00011295_*.spill.sntp.R1_18_4*.root 19 12/01 07:59:01
 PEND - have 10/12 subruns for N00011301_*.spill.sntp.R1_18_4*.root 19 12/01 08:31:34
 SUPPRESS  N00011315_0024.spill.sntp.R1_18_4.0.root
 PEND - have 2/24 subruns for N00011315_*.spill.sntp.R1_18_4*.root 18 12/01 16:38:28

      Try a roundup

      ./roundup -C -r R1_18_4 -W -R near

      fails, the missing subruns are really missing.

./roundup -C -r R1_18_4 -W far

But first, clean up protections

for DET in near far ; do
FILES=` dfarm ls /minos/${DET}cat | tr -s ' ' | grep '\- ' | cut -f 7 -d ' '`
for FILE in $FILES ; do dfarm ls         /minos/${DET}cat/${FILE} ; done
for FILE in $FILES ; do dfarm chmod rwrw /minos/${DET}cat/${FILE} ; done
done

Howie has done this.

#########
# SRMCP #
#########

Testing roundup.20061220 using Howie's cert to srmcp

get some fodder

SRV1> ./roundup.20061220 -C -r cedar -W -s N00011356  near


=============================================================================

2006 12 19

#######
# SRM #
#######

Trying a fresh VDT install on fnpcsrv1, per

http://fermigrid.fnal.gov/user-guide-new.html

cd ~/grid

wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-latest.tar.g
tar xzf pacman-latest.tar.gz
LATEST=3.19
export PATH='pwd'/pacman-${LATEST}:$PATH
cd pacman-${LATEST}
source setup.sh
Pacman requires at least Python version 2.2.
Your Python version, 2.1, is too old for Pacman.
Installing Python 2.4.1 locally... Downloading Python 2.4.1... download successful.
Unzipping... unzip successful.
Untarring... untar successful.
Configuring... configure successful.
Making Python 2.4.1... make successful.make install successful.
Python 2.4.1 has been built locally.
Ready to use Pacman.

export VDT_LOCATION=${HOME}/grid/vdt-1.3.10
mkdir $VDT_LOCATION
cd    $VDT_LOCATION

Looked for latest in
    http://software.grid.iu.edu/pacman/

client-0.4.1-2.pacman   19-Sep-2006 20:44  868   

pacman -get OSG:client-0.4.1-2
SRV1> pacman -get OSG:client-0.4.1-2
warning: Python C API version mismatch for module struct: This Python has API version 1012, module struct has version 1010.
warning: Python C API version mismatch for module bsddb: This Python has API version 1012, module bsddb has version 1010.
warning: Python C API version mismatch for module gdbm: This Python has API version 1012, module gdbm has version 1010.
warning: Python C API version mismatch for module dbm: This Python has API version 1012, module dbm has version 1010.
warning: Python C API version mismatch for module strop: This Python has API version 1012, module strop has version 1010.
warning: Python C API version mismatch for module time: This Python has API version 1012, module time has version 1010.
Traceback (most recent call last):
  File "/home/minfarm/grid/pacman-3.19/bin/pacman", line 18, in ?
    import Pacman
  File "/home/minfarm/grid/pacman-3.19/src/Pacman.py", line 83, in ?
    import lock
  File "/home/minfarm/grid/pacman-3.19/src/lock.py", line 4, in ?
    from Base import *
  File "/home/minfarm/grid/pacman-3.19/src/Base.py", line 4, in ?
    import sys,os,string,commands,copy,time,popen2,cPickle,pwd,grp,socket,anydbm,shutil
ImportError: /local/ups/prd/python/v2_1/Linux-2-4/lib/python2.1/lib-dynload/cPickle.so: undefined symbol: PyUnicode_DecodeRawUnicodeEscape

Tried a direct copy :

SRV1> globus-url-copy gsiftp://stkendca2a.fnal.gov:2811///neardet_data/2004-11/N00004502_0000.mdaq.root file:////export/stage/minfarm/ROUNDUP_TEST/TEST/TEST.dat

Now test a bit :

GSIF=gsiftp://stkendca2a.fnal.gov:2811///neardet_data/2004-11/N00004502_0000.mdaq.root
LOPE=file:////export/stage/minfarm/ROUNDUP_TEST/TEST

Trying various sizes

SRV1> N=4
SRV1> globus-url-copy -dbg -p ${N} ${GSIF} ${LOPE}/TEST.dat 2>&1 | wc -l
   3145
    
I get the following size log files :
    
N  Lines
1  3143
2  3145
4  3149
8  3157
16 3173

time globus-url-copy -p ${N} ${GSIF} ${LOPE}/TEST.dat

N=1
real    0m0.828s
user    0m0.050s   <--- this was a fluke, repeated copies are around .1
sys     0m0.230s
N=8
real    0m0.823s
user    0m0.110s
sys     0m0.210s

##########
# DCACHE #
##########

Cleaned up the damaged file from SAM,

sam undeclare file N00011134_0038.spill.cand.cedar.0.root


This file was damaged in DCache, producing thousands of files on 9 tapes,
discovered back on 2006 12 14. Tapes are released for consolidation via migration.

##########
# IMPORT #
##########

export FTP_PASSIVE ; FTP_PASSIVE=1
FTP_HOST=minos26.fnal.gov

#   Test connection 

ftpout=`printf  "user mindata\nquit\n" | ftp -n ${FTP_HOST}`
if [ "${ftpout}" = "GSSAPI authentication succeeded" ]
then
    echo " OK - we can ftp to ${FTP_HOST}"
else
    echo " "
    echo " OOPS - ftp output = ${ftpout}"
    echo " "
    echo " OOPS - we cannot access ftp at ${FTP_HOST} ,"
fi

   Copy file

printf  "user mindata\n \
    cd STAGE/kreymer \n \
    put ${FILE} ${FILE}   \n \
    quit\n" \
|  ftp -n ${FTP_HOST}


=============================================================================

2006 12 18

##########
# DCACHE #
##########

Another rubin dccp -P got stuck CPU-bound for over 2 days on fnpcserv1.
Killed it.

Farms were stuck, due to srmcp failing.

Trying it on srv1, per revised HOWTO.dccp.
   dccp works 


=============================================================================

2006 12 15

###########
# ROUNDUP #
###########

SRV1> ./roundstat
Fri Dec 15 09:25:50 CST 2006

OK -    5379 files , 188 GBytes in near
        2096 files , 126 GBytes in near cedar
        3283 files , 61 GBytes in near R1_18_4

OK -    8479 files , 47 GBytes in far
        2803 files , 30 GBytes in far cedar
        5676 files , 16 GBytes in far R1_18_4

OK - WRITE

OK -       2 files , 2 GBytes in near
           2 files , 0 GBytes in near R1_18_4

OK -       0 files , 0 GBytes in far

OK - READ

OK -     701 files , 0 GBytes in near
         517 files , 0 GBytes in near cedar
         184 files , 0 GBytes in near R1_18_4

OK -    2136 files , 0 GBytes in far
        1726 files , 0 GBytes in far cedar
         410 files , 0 GBytes in far R1_18_4

grep -e PEND -e CST  LOG/R1_18_4near.log  > /tmp/pendrn

nedit /tmp/pendrm  # select latest batch of PEND, file prior to December

FILES=`cat /tmp/pendrn | cut -f 8 -d ' '`


SRV1>             for FILE in $FILES ; do             dfarm rm minos/nearcat/${FILE} ; done
Error deleting /minos/nearcat/N00008460_0002.cosmic.sntp.cedar.0.root: PERM Permission denied
Error deleting /minos/nearcat/N00008463_0019.spill.sntp.cedar.0.root: PERM Permission denied
PERM Permission denied

Moving on to R1_18_4 far, find quite a few Nov files ready to round up,

SRV1> ./roundup -r R1_18_4  far


SRV1> ./roundup -C -r R1_18_4  -s F00036196_ -R far

Oops, that was a mistake... the Rustling worked,
as did the concatenation, but the file was the first one concatenated, using loon.
So it is already in PNFS, with a slightly different size.


###############
# LARGE FILES #
###############

Renamed the long file,

( cd /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09 ; \
  mv N00010819_0000.spill.sntp.R1_18_4.0.root N00010819_0000.spill.sntp.R1_18_4.99.root )

./dc_stat /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0000.spill.sntp.R1_18_4.99.root

=============================================================================

2006 12 14

########
# DATA #
########

created data directories for 2007, as indicate above in ANNUAL section

#########
# STAGE #
#########

mkdir -m 775 /pnfs/minos/stage
( cd /pnfs/minos/stage ; enstore pnfs --file_family stage )
( cd /pnfs/minos/stage ; enstore pnfs --tags )

for USER in arms  buckley  gallag  gmieg  howcroft  kordosky  kreymer  rhatcher  urheim
do mkdir -m 775 /pnfs/minos/stage/${USER} ; done
do ( cd /pnfs/minos/stage/${USER} ; enstore pnfs --file_family stage_${USER} ) ; done
do ( cd /pnfs/minos/stage/${USER} ; enstore pnfs --tags | grep 'file_family) =' ) ; done


########
# ENCP #
########

Need to upgrade to v3_6d due to security scans coming by 18 Dec.
But which one ? Linux+2.6 or Linx+2.4-2.3.2 ?
  -q dcache or normal ?

MINOS26 > upd install -j encp v3_6d 
MINOS26 > upd install -j encp v3_6d -q dcache 
  got word from zalokar that the dcache version is for pool nodes
MINOS26 > ups undeclare -Y encp v3_6d -q dcache

MINOS26 > upd install -j encp v3_6d -f Linux+2.6
shell-init: could not get current directory: getcwd: cannot access parent directories: No such file or directory
   OOPS, I was sitting in the removed encp -q dcache
   removed  and started over with -f Linux+2.6
MINOS26 > ups undeclare -Y encp v3_6d -f Linux+2.6
MINOS26 > upd install -j encp v3_6d -f Linux+2.6

Edited v3_6d.table to use stkensrv2 by default


MINOS26 > ups list -aK+ encp    
"encp" "v3_3" "Linux+2.4" "" "" 
"encp" "v3_4" "Linux+2.4-2.3.2" "" "" 
"encp" "v3_5a" "Linux+2.4-2.3.2" "" "current" 
"encp" "v3_6d" "Linux+2.4-2.3.2" "" "" 
"encp" "v3_6d" "Linux+2.6" "" "" 

MINOS26 > ups declare -c encp v3_6d
WARNING: Unless you know what you are doing, use a qualifier in your ups declare command!
MINOS26 > ups declare -c encp v3_6d -f Linux+2.6
WARNING: Unless you know what you are doing, use a qualifier in your ups declare command!

MINOS26 > ups list -aK+ encp
"encp" "v3_3" "Linux+2.4" "" "" 
"encp" "v3_4" "Linux+2.4-2.3.2" "" "" 
"encp" "v3_5a" "Linux+2.4-2.3.2" "" "" 
"encp" "v3_6d" "Linux+2.4-2.3.2" "" "current" 
"encp" "v3_6d" "Linux+2.6" "" "current" 


=============================================================================

2006 12 13


#######
# SAM #
#######

per akumar :

Date: Wed, 13 Dec 2006 10:47:44 -0600
v6_3 version of SAM schema has been deployed successfully on minosprd.
This version added a column called retired_date on data_files and build the 
index on file_name and retired_date.

   N.B. - this allows v8 dbservers to be deployed

MINOS26 > sam ping dbserver
The server 'SAMDbServer.prd:SAMDbServer' is alive.

MINOS26 > sam locate foo
RetryHandler.getReplicaLocationList('foo')> will retry in 18.33 seconds
Datafile with name 'foo' not found.

MINOS26 > ./sam_test_py minos
OK

MINOS26 > sam get metadata --file=F00031300_0000.mdaq.root
OK

MINOS26> ~/minos/HOWTO.predator
OK
SRV1> ./dfarmsum 


###########
# ROUNDUP #
###########

Wed Dec 13 11:30:57 CST 2006

OK -    5304 files , 184 GBytes in near
        2024 files , 122 GBytes in near cedar
        3280 files , 61 GBytes in near R1_18_4

OK -    8479 files , 47 GBytes in far
        2803 files , 30 GBytes in far cedar
        5676 files , 16 GBytes in far R1_18_4


=============================================================================

2006 12 12

############
# saddreco #
############

    DECLARE 2006 cedar reco thru November

grep -v declare /local/scratch26/kreymer/log/saddreco/declare_near_cedar.log | less
grep -v declare /local/scratch26/kreymer/log/saddreco/declare_far_cedar.log | less

HOSTNU=`hostname -s | cut -c 6-`
LOGPAT=/local/scratch${HOSTNU}/kreymer/log
FARM=cedar

MONS='01 02 03 04 05 06 07 08 09 10 11'

for DET in near far ; do 
for MON in ${MONS} ; do
./saddreco ${DET} ${FARM} 2006-${MON} declare 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done ; done

###########
# ROUNDUP #
###########


SRV1> dfarm usage rubin
Used: 523415 + Reserved: 0 / Quota: 1000000 (MB)

SRV1> ./dfarmsum 
Tue Dec 12 08:26:54 CST 2006

OK -    5119 files , 177 GBytes in near
        1832 files , 115 GBytes in near cedar
        3287 files , 61 GBytes in near R1_18_4

OK -    8335 files , 45 GBytes in far
        2659 files , 29 GBytes in far cedar
        5676 files , 16 GBytes in far R1_18_4

    Looking for CST in LOG/cedarnear.log,

 OK - processing /minos/nearcat
Mon Dec 11 13:09:16 CST 2006
Tue Dec 12 00:05:00 CST 2006

SRV1> df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb3             485G  180G  306G  37% /export/stage

    Run again, to purge WRITE area primarily
  
./roundup.20061208 -r cedar  near

 OK - processing /minos/nearcat
Tue Dec 12 08:41:50 CST 2006

SRV1> du -sm WRITE
23813   WRITE

./roundup.20061208 -r R1_18_4 -w near

SRV1> du -sm WRITE
3171    WRITE

SRV1> ./dfarmsum 
Tue Dec 12 15:00:01 CST 2006

OK -    5121 files , 177 GBytes in near
        1834 files , 115 GBytes in near cedar
        3287 files , 61 GBytes in near R1_18_4

OK -    8407 files , 46 GBytes in far
        2731 files , 29 GBytes in far cedar
        5676 files , 16 GBytes in far R1_18_4


MINOS26 > ls -R  /pnfs/minos/reco_far/R1_18_4/CAT | wc -l
    297
MINOS26 > ls -R  /pnfs/minos/reco_near/R1_18_4/CAT | wc -l
    200
MINOS26 > ls -R  /pnfs/minos/reco_far/cedar/CAT | wc -l
   1365
MINOS26 > ls -R  /pnfs/minos/reco_near/cedar/CAT | wc -l
    557

    Moved the long file out of the way, for cleanup.

SRV1> mv WRITE/N00010819_0000.spill.sntp.R1_18_4.0.root LONG/

###############
# LARGE FILES #
###############

Cleaning up, added test for WRITE vs PNFS file size
found stray file from Sep 26

OOPS - Size mismatch , BAILING 
-rw-r--r--    1 minfarm  numi     2283574599 Sep 25 17:22 N00010819_0000.spill.sntp.R1_18_4.0.root
-rw-r--r--    1 1060     numi            1 Sep 26 07:39 /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0000.spill.sntp.R1_18_4.0.root

Removed the bad file, requeued :

rm /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0000.spill.sntp.R1_18_4.0.root

./roundup.20061208 -r R1_18_4 -w near

Failed again OK !!!!!

This is a file which is too big.
The hadd and dccp were happy, it is the  ls via pnfs which is unhappy.

MINOS26 > ./dc_stat /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0000.spill.sntp.R1_18_4.0.root
============================
 PNFS status for /pnfs/minos/reco_near/R1_18_4/CAT/sntp_data/2006-09/N00010819_0000.spill.sntp.R1_18_4.0.root 
-rw-r--r--    1 kreymer  e875            1 Dec 12 12:01 N00010819_0000.spill.sntp.R1_18_4.0.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:1cf17836;l=2283574599;
w-stkendca11a-1

LEVEL 4 

============================

   So the level-2 information is good.
   
MINOS26 > unset DCACHE_IO_TUNNEL
MINOS26 > cd /local/scratch??/`whoami`
MINOS26 > IFILE=N00010819_0000.spill.sntp.R1_18_4.0.root
MINOS26 > IPATH=minos/reco_near/R1_18_4/CAT/sntp_data/2006-09
MINOS26 > DCPOR=24125 # unsecured
MINOS26 > DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
MINOS26 > dccp    ${DFILE} ${IFILE}
2283574599 bytes in 49 seconds (45511.29 KB/sec)

MINOS26 > loon -bq ~/minos/scripts/firstlast.C ${IFILE}
Not too informative, does not crash, but no counts.


MINOS26 > loon -bq ~/minos/scripts/Merger.C ${IFILE}
...
Floating point exception
MINOS26 > dds /local/scratch26/kreymer/*root
-rw-r--r--    1 kreymer  g020     323946379 Dec 12 16:59 /local/scratch26/kreymer/Merged.root
-rw-r--r--    1 kreymer  g020     842860369 Dec 12 15:30 /local/scratch26/kreymer/N00010819_0000.cosmic.sntp.R1_18_4.0.root

MINOS26 > mv Merged.root Merged.cosmic.root


MINOS26 > IFILE=N00010819_0000.spill.sntp.R1_18_4.0.root
MINOS26 > loon -bq ~/minos/scripts/Merger.C ${IFILE}
very quickly, 27009843 in Merged.root
Floating point exception


SRV1> mv WRITE/N00010819_0000.spill.sntp.R1_18_4.0.root LONG/

=============================================================================

2006 12 11

############
# saddreco #
############

saddreco was failing due to
        Application with family 'reco', applName 'loon', version 'cedar' not found.

MINOS26 > export SAM_ORACLE_CONNECT="samdbs/<passwd>@minosprd"
MINOS26 > samadmin add application family --appFamily=reco --appName=loon --appVersion=cedar

  same for dev, int

Need to do global declaration for all of cedar before resuming keepup in predator.

crontab -r


HOSTNU=`hostname -s | cut -c 6-`
LOGPAT=/local/scratch${HOSTNU}/kreymer/log
FARM=cedar
MON=2005-04

for DET in near far ; do ./saddreco ${DET} ${FARM} ${MON} verify 5 ; done

for DET in near far ; do 
./saddreco ${DET} ${FARM} ${MON} declare 5 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done

  INSTANCE  Location with name '/pnfs/minos/reco_near/cedar/cand_data/2005-04' not found.

./reloc cedar

for DET in near far ; do 
./saddreco ${DET} ${FARM} ${MON} declare 5 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done
...

 OOPS, need location for  N00007354_0015.cosmic.cand.cedar.0.root

DET=near
./saddreco  near cedar 2005-04 addloc | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
...
OK - add location  N00007354_0015.cosmic.cand.cedar.0.root /pnfs/minos/reco_near/cedar/cand_data/2005-04(vob884.1037)

   Do all of 2005

MONS='04 05 06 07 08 09 10 11 12'

for DET in near far ; do 
for MON in ${MONS} ; do
./saddreco ${DET} ${FARM} 2005-${MON} declare 5 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done ; done

for DET in near far ; do 
for MON in ${MONS} ; do
./saddreco ${DET} ${FARM} 2005-${MON} declare 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done ; done


#########
# reloc #
#########

Updated to use sam from afs, not local copy,
so this can be run on minos26 and elsewhere


MINOS26 > cp reloc.1201 reloc.20061211
MINOS26 > ln -sf reloc.20061211 reloc

MINOS26 > ./reloc -s dev cedar
Declaring locations to SAM  for cedar
...

MINOS26 > ./reloc -s int cedar

###########
# ROUNDUP #
###########

SRV1> ./dfarmsum 
Mon Dec 11 09:44:58 CST 2006

OK -    8437 files , 326 GBytes in near
        5150 files , 264 GBytes in near cedar
        3287 files , 61 GBytes in near R1_18_4

OK -    8335 files , 45 GBytes in far
        2659 files , 29 GBytes in far cedar
        5676 files , 16 GBytes in far R1_18_4

roundup.20061208 -  enforces 1.5 MByte per subrun file size match requirement

./roundup.20061208 -r cedar  -n near

./roundup.20061208 -r cedar     near

##########
# DCACHE #
##########

Note false alarm regarding write pool corruption of 
    neardet_data/2004-08/N00003307_0037.mdaq.root

Email in
http://listserv.fnal.gov/scripts/wa.exe?A2=ind0612&L=dcache-admin&T=0&X=710F7C3E310242DD53&Y=baisley%40fnal.gov&P=7034
Mentioned in the 8 Dec developer's plone log.


=============================================================================

2006 12 09

##########
# DCACHE #
##########

    Removed the Oct 19 bad file, lost in DCache maintenance

MINOS26 > sam locate N00011077_0013.spill.snts.R1_18_4.0.root
['/pnfs/minos/reco_near/R1_18_4/snts_data/2006-10,21@dcache']

MINOS26 > sam undeclare file N00011077_0013.spill.snts.R1_18_4.0.root

MINOS26 > ls -l /pnfs/minos/reco_near/R1_18_4/snts_data/2006-10/N00011077_0013.spill.snts.R1_18_4.0.root
ls: /pnfs/minos/reco_near/R1_18_4/snts_data/2006-10/N00011077_0013.spill.snts.R1_18_4.0.root: No such file or directory

    Removed the file written July 1, corrupted July 22, never on tape

MINOS26 > sam locate N00010368_0008.spill.cand.R1_18_4.0.root
['/pnfs/minos/reco_near/R1_18_4/cand_data/2006-06,3@dcache']

MINOS26 > sam undeclare file N00010368_0008.spill.cand.R1_18_4.0.root

MINOS26 > ls /pnfs/minos/reco_near/R1_18_4/cand_data/2006-06/N00010368_0008.spill.cand.R1_18_4.0.root
ls: /pnfs/minos/reco_near/R1_18_4/cand_data/2006-06/N00010368_0008.spill.cand.R1_18_4.0.root: No such file or directory

    Both of these files have been reported regularly in the saddcache summary scripts.


    Three cedar far sntp files are reported lost in DCache, not on tape.

/pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035724_0013.all.sntp.cedar.0.root
/pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.all.sntp.cedar.0.root
/pnfs/minos/reco_far/cedar/sntp_data/2006-06/F00035727_0005.spill.sntp.cedar.0.root

    These exist in the CAT stream.

MINOS26 > dds /pnfs/minos/reco_far/cedar/CAT/sntp_data/2006-06/F00035727*
-rw-r--r--    1 kreymer  e875     140341160 Dec  4 14:43 /pnfs/minos/reco_far/cedar/CAT/sntp_data/2006-06/F00035727_0000.all.sntp.cedar.0.root
-rw-r--r--    1 kreymer  e875      5791142 Dec  4 19:24 /pnfs/minos/reco_far/cedar/CAT/sntp_data/2006-06/F00035727_0000.spill.sntp.cedar.0.root
MINOS26 > dds /pnfs/minos/reco_far/cedar/CAT/sntp_data/2006-05/F00035724*
-rw-r--r--    1 kreymer  e875     587664400 Dec  4 14:43 /pnfs/minos/reco_far/cedar/CAT/sntp_data/2006-05/F00035724_0000.all.sntp.cedar.0.root

    And in AFS, for the spill stream

recodata26/F00035727_0005.spill.sntp.cedar.0.root

=============================================================================

2006 12 08

############
# predator #
############

2006-11 is caught up, veiwed with

./HOWTO.predator 2006-11

crontab crontab.dat at about 08:17


#########
# VAULT #
#########

per HOWTO.vault

VMON=2006-11
for DET in far near; do ./vault ${DET} ${VMON} ; done

Start  Fri Dec  8 10:56:14 CST 2006
Finish Fri Dec  8 18:21:28 CST 2006


###########
# ROUNDUP #
###########

Testing file size checking, with these:

 OK adding F00037000_0000.spill.bntp.R1_18_4.0.root      24
 OK adding F00037003_0000.spill.bntp.R1_18_4.0.root       7
 OK adding F00037006_0000.spill.bntp.R1_18_4.0.root       9
 OK adding F00037010_0000.spill.bntp.R1_18_4.0.root       1
 OK adding F00037013_0000.spill.bntp.R1_18_4.0.root      24
 OK adding F00037016_0000.spill.bntp.R1_18_4.0.root      28
 OK adding F00037019_0000.spill.bntp.R1_18_4.0.root       4

   like
   
./roundup.20061208 -r R1_18_4 -s F00037019  -W far


 OK adding F00037003_0000.all.sntp.R1_18_4.0.root       7
 NSFIL SSIZ MSIZ DSIZ       7 134962138 134334043 104682
 OK adding F00037003_0000.spill.bntp.R1_18_4.0.root       7
 NSFIL SSIZ MSIZ DSIZ       7 16511218 15811337 116646
 OK adding F00037003_0000.spill.sntp.R1_18_4.0.root       7
 NSFIL SSIZ MSIZ DSIZ       7 8199703 7531334 111394

 OK adding F00037006_0000.all.sntp.R1_18_4.0.root       9
 NSFIL SSIZ MSIZ DSIZ       9 170847188 170142123 88133
 OK adding F00037006_0000.spill.bntp.R1_18_4.0.root       9
 NSFIL SSIZ MSIZ DSIZ       9 21240611 20338483 112766
 OK adding F00037006_0000.spill.sntp.R1_18_4.0.root       9
 NSFIL SSIZ MSIZ DSIZ       9 10606868 9692753 114264

 OK adding F00037019_0000.all.sntp.R1_18_4.0.root       4
 NSFIL SSIZ MSIZ DSIZ       4 76506078 75924341 193912
 OK adding F00037019_0000.spill.bntp.R1_18_4.0.root       4
 NSFIL SSIZ MSIZ DSIZ       4 24714683 24232176 160835
 OK adding F00037019_0000.spill.sntp.R1_18_4.0.root       4
 NSFIL SSIZ MSIZ DSIZ       4 11188674 10783956 134906

 OK adding F00037000_0000.all.sntp.R1_18_4.0.root      24
 NSFIL SSIZ MSIZ DSIZ      24 486227081 484040886 95051
 OK adding F00037000_0000.spill.bntp.R1_18_4.0.root      24
 NSFIL SSIZ MSIZ DSIZ      24 62424368 59987997 105929
 OK adding F00037000_0000.spill.sntp.R1_18_4.0.root      24
 NSFIL SSIZ MSIZ DSIZ      24 30507880 28106032 104428

Test this for near,

./roundup.20061208 -r cedar -n near

 OK adding N00010589_0000.cosmic.sntp.cedar.0.root      19
 OK adding N00010592_0000.cosmic.sntp.cedar.0.root      24
 OK adding N00010733_0000.cosmic.sntp.cedar.0.root       3
 OK adding N00010755_0000.cosmic.sntp.cedar.0.root       2
 OK adding N00010772_0000.cosmic.sntp.cedar.0.root       4
 OK adding N00010789_0000.cosmic.sntp.cedar.0.root       3
 OK adding N00010794_0000.cosmic.sntp.cedar.0.root       5
 OK adding N00010801_0000.cosmic.sntp.cedar.0.root      13
 OK adding N00010822_0000.cosmic.sntp.cedar.0.root      14
 OK adding N00010847_0000.cosmic.sntp.cedar.0.root       3
 OK adding N00010855_0000.cosmic.sntp.cedar.0.root      14
 OK adding N00010864_0000.cosmic.sntp.cedar.0.root       9
 OK adding N00011271_0000.cosmic.sntp.cedar.0.root      12

 OK adding N00010755_0000.cosmic.sntp.cedar.0.root       2
 NSFIL SSIZ MSIZ DSIZ       2 53056255 52786823 269432
 OK adding N00010733_0000.cosmic.sntp.cedar.0.root       3
 NSFIL SSIZ MSIZ DSIZ       3 82046883 81600948 222967
 OK adding N00010789_0000.cosmic.sntp.cedar.0.root       3
 NSFIL SSIZ MSIZ DSIZ       3 72441275 71982343 229466
 OK adding N00010847_0000.cosmic.sntp.cedar.0.root       3
 NSFIL SSIZ MSIZ DSIZ       3 67450420 67054480 197970
 OK adding N00010847_0000.spill.sntp.cedar.0.root       3
 NSFIL SSIZ MSIZ DSIZ       3 155247257 154631823 307717
 OK adding N00010772_0000.cosmic.sntp.cedar.0.root       4
 NSFIL SSIZ MSIZ DSIZ       4 106806743 106075425 243772
 OK adding N00010794_0000.cosmic.sntp.cedar.0.root       5
 NSFIL SSIZ MSIZ DSIZ       5 134388261 133632186 189018
 OK adding N00010794_0000.spill.sntp.cedar.0.root       5
 NSFIL SSIZ MSIZ DSIZ       5 254050313 253037704 253152
 OK adding N00010864_0000.cosmic.sntp.cedar.0.root       9
 NSFIL SSIZ MSIZ DSIZ       9 245878476 244307392 196385
 OK adding N00010864_0000.spill.sntp.cedar.0.root       9
 NSFIL SSIZ MSIZ DSIZ       9 317547287 316161340 173243
 OK adding N00011271_0000.cosmic.sntp.cedar.0.root      12
 NSFIL SSIZ MSIZ DSIZ      12 348282322 346803428 134444
 OK adding N00011271_0000.spill.sntp.cedar.0.root      12
 NSFIL SSIZ MSIZ DSIZ      12 892507773 889431218 279686
 OK adding N00010822_0000.cosmic.sntp.cedar.0.root      14
 NSFIL SSIZ MSIZ DSIZ      14 409630441 407233416 184386
 OK adding N00010822_0000.spill.sntp.cedar.0.root      14
 NSFIL SSIZ MSIZ DSIZ      14 1010879902 1007664431 247343

 OK adding N00010589_0000.cosmic.sntp.cedar.0.root      19
 NSFIL SSIZ MSIZ DSIZ      19 542678607 539639362 168846
OK - 3700 Mbytes in       1 runs 
 BIG  - Splitting due to size 2269580593 
 OK adding N00010589_0000.spill.sntp.cedar.0.root       9
 NSFIL SSIZ MSIZ DSIZ       9 2029993078 2026219894 471648
 OK adding N00010589_0009.spill.sntp.cedar.0.root      10
 NSFIL SSIZ MSIZ DSIZ      10 1670337440 1667239859 344175

 OK adding N00010592_0000.cosmic.sntp.cedar.0.root      24
 NSFIL SSIZ MSIZ DSIZ      24 721184487 717231223 171881

Max observed DSIZ ( difference/(secs-1) ) is under 500 KB.
Should run with DSIZ limit of 2 MBytes,to be really generous.

=============================================================================

2006 12 07

###########
# ROUNDUP #
###########

    roundup.20061202

Added FREETMP calculation

Added DCFLIM file limit based on DCache write queue length

Added POOLMIN DCache write pool limit, at most 1 pool may be inactive.


SRV1> ./dfarmsum 
Thu Dec  7 19:04:41 CST 2006

OK -    6573 files , 232 GBytes in near
        2568 files , 144 GBytes in near cedar
        4005 files , 88 GBytes in near R1_18_4

OK -    8176 files , 44 GBytes in far
        2368 files , 26 GBytes in far cedar
        5808 files , 17 GBytes in far R1_18_4


SRV1> ./roundup.20061202 -r R1_18_4  near


############
# predator #
############

11:32
Catch up with   ./predator 2006-11

This has lots to process, following 20 Nov,
    so     crontab -r


#######
# X11 #
#######

    scanned gimp, some hangups

minos06 Thu Dec  7 11:08:34 CST 2006
minos18 Thu Dec  7 11:11:31 CST 2006


###########
# NETWORK #
###########

Ganglia of minos-mysql1 suggests outage was 06:37 thru 06:52

Email to net suggests partial outages 06:32 through 08:00


Network was not really up at 07:05, 

ssh to minos26 from off site hung with no response.
Succeeded on second try.

MINOS26 > minos     
-bash: /afs/fnal.gov/files/code/e875/general/minossoft/setup/setup_minossoft_FNALU.sh: Connection timed out
MINOS26 > ls /afs/fnal.gov/files/code/e875/general/minossoft/setup/setup_minossoft_FNALU.sh
/afs/fnal.gov/files/code/e875/general/minossoft/setup/setup_minossoft_FNALU.sh
MINOS26 > minos

leaving crontab disabled for minosora1 Oracle upgrade to 10.2.0.2

#######
# SAM #
#######

Production minosora1 upgraded to 10.2.0.2, complete about 09:43

Restarted dbserver at 10:32.

Tested station, OK

MINOS26 > sam ping dbserver
The server 'SAMDbServer.prd:SAMDbServer' is alive.

MINOS26 > sam locate foo
RetryHandler.getReplicaLocationList('foo')> will retry in 18.33 seconds
Datafile with name 'foo' not found.

MINOS26 > ./sam_test_py minos

MINOS26 > sam get metadata --file=F00031300_0000.mdaq.root

=============================================================================

2006 12 06
#########
# mysql #
#########

Need to correct grants for reader, reader_old, writer etc.
Recently broken for dbu by a wildcard change by nwest,
   but then we should not have been writing using reader_old.

###########
# ROUNDUP #
###########

SRV1> dfarm usage rubin
Used: 532463 + Reserved: 0 / Quota: 1000000 (MB)

SRV1> df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb3             485G  145G  340G  30% /export/stage

   Purge written files

./roundup  -r cedar -w far
./roundup  -r cedar -w near

SRV1> df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb3             485G   20G  465G   5% /export/stage


=============================================================================

2006 12 05

###########
# ROUNDUP #
###########

SRV1> date
Tue Dec  5 08:27:40 CST 2006
SRV1> dfarm usage rubin
Used: 720934 + Reserved: 0 / Quota: 1000000 (MB)

SRV1> ./dfarmsum 
Tue Dec  5 08:28:13 CST 2006
OK -    8298 files , 305 GBytes in near
        4293 files , 217 GBytes in near cedar
        4005 files , 88 GBytes in near R1_18_4
OK -    7957 files , 41 GBytes in far
        2149 files , 23 GBytes in far cedar
        5808 files , 17 GBytes in far R1_18_4

08:49
SRV1> kinit -R
./roundup  -r cedar    near

###########
# ENSTORE #
###########

ENSTORE - vet empty tapes for recycling per berg request  21 Nov 2006

These are all reco_near_R1_18_4.cpio_odc tapes, 9940B

VOLS='VO7416  VOB684  VOB685  VOB688  VOB691  VOB693  VOB695  VOB699  VOB701  VOB714  VOB724  VOB729 VOB732  VOB738  VOB739'

for VOL in ${VOLS} ; do enstore info --list="${VOL}" | less ; done
  None of these files have names, all are deleted.

for VOL in ${VOLS} ; do enstore info --gvol="${VOL}" | less ; done
  last_access ranges from 21 though 31 July 2006


=============================================================================

2006 12 04

###########
# ROUNDUP #
###########

roundup aborted on a dfarm read error, not clearing the disks.

        From the log, 


 OK adding F00033174_0000.spill.bntp.cedar.0.root       1
Transfer initiation timeout
 OOPS - failed to dfarm get /minos/farcat/F00033174_0000.spill.bntp.cedar.0.root 
 BAILING 
Sat Dec  2 21:23:40 CST 2006

    Last previous timestamp was about 21:17.


99% dfarm capacity as of this morning.

SRV1> ./roundup  -r cedar -w far

SRV1> ./dfarmsum 
Mon Dec  4 09:04:24 CST 2006
OK -    8133 files , 294 GBytes in near
        4128 files , 206 GBytes in near cedar
        4005 files , 88 GBytes in near R1_18_4
OK -   22215 files , 197 GBytes in far
       16407 files , 179 GBytes in far cedar
        5808 files , 17 GBytes in far R1_18_4

SRV1> dfarm usage rubin
Used: 991707 + Reserved: 0 / Quota: 1000000 (MB)

( after about 4.5 GB had been recovered)

So I think we did not quite hit 100% . But we came very very close.

Or maybe we did hit 100%, as dfarm stores 2 copies of each file.


Checking status of that file :

SRV1> dfarm get /minos/farcat/F00033174_0000.spill.bntp.cedar.0.root  TEST.DAT
Transfer initiation timeout

SRV1> time dfarm get /minos/farcat/F00033174_0000.spill.bntp.cedar.0.root  TEST.DAT
Transfer initiation timeout

real    5m0.141s
user    0m0.110s
sys     0m0.070s

This dfarm cp has been failing since 27 Sep.
    Only recently did this cause a bailout from the roundup script.

I have removed the offending file from dfarm, as dfarm is about to be retired,
    and there is no point pursuing this.

dfarm rm /minos/farcat/F00033174_0000.spill.bntp.cedar.0.root

The cleanup helped somewhat, as of 12:30 

SRV1> dfarm usage rubin
Used: 745531 + Reserved: 0 / Quota: 1000000 (MB)

14:00 - files are moving to tape pretty well

Running another pass on cedar/far, to finish it up.
There is plenty of space in the ROUNTMP area, 341 GB free.

./roundup  -r cedar    far

=============================================================================

2006 12 02

###########
# ROUNDUP #
###########

DFARM is getting full, 75/79/85 % on Thu/Fri/Sat

created dfarmsum on fnpcsrv1 :

SRV1> ./dfarmsum 
Sat Dec  2 12:40:19 CST 2006
OK -    6974 files , 238 GBytes in near
OK -   22163 files , 196 GBytes in far

SRV1> ./dfarmsum
Sat Dec  2 12:58:36 CST 2006
OK -    6974 files , 238 GBytes in near
        2969 files , 150 GBytes in near cedar
        4005 files , 88 GBytes in near R1_18_4
OK -   22163 files , 196 GBytes in far
       16355 files , 179 GBytes in far cedar
        5808 files , 17 GBytes in far R1_18_4

Make space for the weekend :

SRV1> ./roundup  -r cedar far

###########
# CRONTAB #
###########

crontab crontab.dat

Apparently off since 15 Nov ( kreymer on vacation since 23 Nov )

=============================================================================

2006 11 21

########
# GRID #
########

SRV1> pwd
/export/stage/minfarm/ROUNDUP_TEST/TEST

SRV1> grid-proxy-init -cert kreymer-doe.pem -key kreymer-doekey.pem
Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Enter GRID pass phrase for this identity:
Creating proxy .................................. Done
Your proxy is valid until: Tue Nov 21 22:51:32 2006


SRV1> 

setup dcap -q x509

IFILE=N00004502_0000.mdaq.root
IPATH=minos/neardet_data/2004-11
DCPOR=24525
DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
SFILE=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/${IPATH}/${IFILE}
SPATH=srm://fndca1.fnal.gov:8443/pnfs/fnal.gov/usr/${IPATH}

srmcp $SFILE file:////export/stage/minfarm/ROUNDUP_TEST/TEST/TEST.dat

srmcp $SFILE file:///TEST.dat

N.B. 2006 12 12 should instead use

      srmcp  -streams_num=1 -server_mode=active  $SFILE file:///TEST.dat

srmls  ${SPATH} - fails


?

SRV1> dccp dcap://fndca1.fnal.gov:24536/pnfs/fnal.gov/usr/minos/sim_root/far/camb_cosmic/bfld201/cosmic_mu_r651.root TEST.dat
731987214 bytes in 33 seconds (21661.55 KB/sec) 


Per timur, trying a newer version of srm client, v1_24 versus v1.23 on fnpcsrv1


MINOS26 > pwd
/local/scratch26/kreymer/SRM

MINOS26 > curl https://srm.fnal.gov/twiki/pub/SrmProject/SrmcpClient/srmcp_v1_24_NULL.tar -o srm.tar -k

MINOS26 > tar xfv srm.tar 


########
# nscd #
########

verified that nscd is off on all minos cluster nodes

    /sbin/chkconfig --list  nscd

#######
# SRM #
#######


=============================================================================

2006 11 17

##########
# DCACHE #
##########

Security scans were causing x509 door failures, disabled for now.


=============================================================================

2006 11 16

#######
# SAM #
#######

12:25 - restarted prd dbserver ( sam locate foo   hung up )
        after the production database and OS patches today ( done by 10:00 )

MINOS26 > ./sam_test_py minos prd

##########
# DCACHE #
##########


#########
# USERS #
#########

Made directory for user files in PNFS

cd /pnfs/minos
mkdir     users
chmod 775 users

Perhaps should change this back to 755.
In fact, did so on 17 Nov.

=============================================================================

2006 11 15

Registered DOE grid certificate for access to CD forms ( vacation request, etc )


X11 - clean scan

############
# SHUTDOWN #
############

Need to shut down servers around 05:30,
to match the Enstore/DCache shutdown

MINOS-SAM01 > echo '. ./samstop > samstop.log 2>&1' | at 05:30    
job 7 at 2006-06-07 03:00

MINOS26 > echo 'crontab -r' | at 05:30

=============================================================================

2006 11 14

#######
# X11 #
#######

minos19 Tue Nov 14 14:43:52 CST 2006

#########
# VAULT #
#########

per HOWTO.vault

VMON=2006-10
for DET in far near; do ./vault ${DET} ${VMON} ; done

Start  15:37:02 CST 2006
Finish 23:05

=============================================================================

2006 11 13

#######
# X11 #
#######

minos19 Mon Nov 13 14:06:49 CST 2006


=============================================================================

2006 11 10

X11 scan - clean for both acroread and gimp 15:20

CHECKLIST - write queues gradually up over 1000, starting around 01:00 yesterday
            perhaps p929  NOVA

=============================================================================

2006 11 09

X11 scan - clean for both acroread and gimp


###########
# SLF 4.4 #
###########

Ran genpy per HOWTO.genpy on minos25, looks OK

###########
# ROUNDUP #
###########

Remove Suppressed subruns from the RAWS list, for cleaner keepup

    Testing

RUNN=00011116

Corrected typo in AUTODEST ( had been using stale STRP )

Added up-front removal of SUPPRESSED subruns from RAWS list,
for cleaner keepup running.

SRV1> ln -sf roundup.20061109 roundup # was roundup.20061018

   Catchup  !!!

SRV1> ./roundup -r R1_18_4  near
SRV1> ./roundup -r R1_18_4   far


=============================================================================

2006 11 08

#######
# X11 #
#######

Around 09:00, gimp scan stuck on 06 07 17.

NODES="minos01 minos02 minos03 minos04 minos05 minos06 minos07 minos08 minos09 minos10 minos11 minos12 minos13 minos14 minos15 minos16 minos17 minos18 minos19 minos20 minos21 minos22 minos23 minos24 minos25 minos26"
CNODES="fcdflnx4 fcdflnx5 fcdflnx6 fcdflnx7 fcdflnx8 fcdflnx9"
UNODES="flxi02 flxi03 flxi04 flxi05 flxi06"

for NODE in $UNODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'mkdir -p /var/tmp/kreymer/.gimp-1.2' ; done

for NODE in $UNODES ; do printf "${NODE} `date`\n"; ssh ${NODE} 'echo gimp;gimp;echo done' ; done

Scans clean on cdf and FNALU nodes

###########
# ROUNDUP #
###########

    Issue 2366

I cannot select minos files using the lum_min and lum_max dimensions.

MINOS26 > sam list files --dim=" DATA_TIER = raw-far and LUM_MIN = 288120000  "
ORA-00904: "DATA_FILES"."LUM_MIN": invalid identifier

Looking at the LUM_MIN dimension in the database browser,
it seems that this dimension is in the data_files table,
rather than the data_files_lumblocks table as in CDF and D0.

Is there a problem with our schema or database initialization ?

See
http://dbb2.fnal.gov:8520/cdfr2/databases?smdim=LUM_MIN&dimorder=%2Bdim&skip=&limit=125&rc=n&email=&type=sam-dim&do=r&nsrc=cdfofpr2&fsrc=cdfofpr2&gsrc=cdfofpr2&rc=n&dcbk=FILECATALOG

http://dbb.fnal.gov:8520/minos/databases?smdim=LUM_MIN&dimorder=%2Bdim&skip=&limit=125&rc=n&email=&type=sam-dim&do=r&nsrc=&fsrc=minosprd&gsrc=minosprd&rc=n&dcbk=FILECATALOG


    Improve reporting of missing subruns

Scan for suppressed subruns, report as such

SRV1> cat /home/minfarm/lists/daq_lists/sup/*.sup  | grep -v Failed | wc -l
    298


#########
# BATCH #
#########

minos queue is set up, feeding minos19-24

 bsub -q minos "cat /proc/cpuinfo"

=============================================================================

2006 11 07

#######
# X11 #
#######

At around 13:50, a scan with gimp revealed problems on

minos05 - several logged in, system idle,  188076k free memory
minos07 - several logged in, system busy running 1 loon process, 2689772k free 
minos08 - root    logged in, system idle, 3050428k free
minos14 - nobody  logged in, system idle,   98968k free
minos23 - two     logged in, system idle,  775776k free

###########
# ROUNDUP #
###########

setup sam -q dev

IFIL=F00028812_0001

sam verify metadata --descriptionFile=${IFIL}.sam.py

sam declare file ${IFIL}.sam.py

sam get metadata --file=$IFIL.mdaq.root

sam add location --fileName=F00028812_0001.mdaq.root --loc=/pnfs/minos/fardet_data/2005-01

MINOS26 > sam list files --dim=" DATA_TIER = raw-far and run_number = 28812 "
Files:
   F00028812_0001.mdaq.root
   F00028812_0000.mdaq.root

MINOS26 > sam list files --dim=" DATA_TIER = raw-far and LUM_MIN = 288120000  "
ORA-00904: "DATA_FILES"."LUM_MIN": invalid identifier

fcdflnx5 > sam list files --dim="data_tier = raw and LUM_MIN = 8877899795 and LUM_MAX = 8877899795"
Files:
   br02112a.0013phys

fcdflnx5 > sam list files --dim="data_tier = raw and LUM_MIN = 8877899795"
Files:
   br02112a.0013phys

Need to issue Sam issue for this, why can I not select on LUM_MIN ?


    TO CLEANUP WILL DO
sam erase file location --fileName=F00028812_0001.mdaq.root --loc="/pnfs/minos/fardet_data/2005-01(v01234.5)"
sam undeclare file $IFIL.mdaq.root

=============================================================================

2006 11 06

##########
# DCACHE #
##########

kennedy noted door 0 stuck, restarted Sunday 11/05 around 17:53

##########
# K5PALL #
##########

Created k5pall script on desktop, to push keys to all ssh sessions,
   using ${HOME}/k5push.
   Presumes connections are made via  one of
       ssh host
       ssh -l user host

###########
# ROUNDUP #
###########

Added MISSING printout for missing subruns

Reviewing content :

./roundup.20061101 -r R1_18_4  -f 10 -W -n near 2>&1 | tee /tmp/missing
MISS=`grep MISSING /tmp/missing  | tr -s ' ' | cut -f 3 -d ' '`
for MIS in $MISS ; do sam locate $MIS ; done


    159 files total
    Some of the missing subruns are

N00011099_0001.cosmic.sntp.R1_18_4.0.root
N00010816_0000.spill.sntp.R1_18_4.0.root
N00010840_0001.spill.sntp.R1_18_4.0.root
N00010852_0010.spill.sntp.R1_18_4.0.root
N00010867_0018.spill.sntp.R1_18_4.0.root
...
    None of these are in SAM.

Trouble in run N00010912, due to reprocessing ( 0 and 1 versions )
    This fouls up the gap calculations.
    Detecting this by test for DELT=0 and veto for manual cleanup,

    Now test fardet, more agressive purging ( 2 days )
     
./roundup.20061101 -r R1_18_4  -f 2 -W -n far 2>&1 | tee /tmp/missingf
MISSF=`grep MISSING /tmp/missingf  | tr -s ' ' | cut -f 3 -d ' '`
for MIS in $MISSF ; do sam locate $MIS ; done

    Several runs missing many subruns, from 1 or small numbers through 24/25.    
    And many are in SAM.
    Probably due to my too aggressive purging
    Looking again with 10 day flush,
    most of these are F0003619, plus
    F00036753_0010.spill.bntp.R1_18_4.0.root  Oct 24
    F00036551_0009.spill.sntp.R1_18_4.0.root  Sep 7

SRV1> dfarm ls /minos/farcat/F00036753*.bntp.*
...
frwrw   2 rubin                 6636701 10/24 11:43:24 /minos/farcat/F00036753_0009.spill.bntp.R1_18_4.0.root 
frwrw   2 rubin                 8303375 10/24 00:44:58 /minos/farcat/F00036753_0011.spill.bntp.R1_18_4.0.root 

SRV1> dfarm ls /minos/farcat/F00036551_*.spill.sntp.R1_18_4.0.root
...
frwrw   2 rubin                 1164444 09/07 07:29:53 /minos/farcat/F00036551_0008.spill.sntp.R1_18_4.0.root 
frwrw   2 rubin                  988439 09/07 04:52:04 /minos/farcat/F00036551_0010.spill.sntp.R1_18_4.0.root 


#######
# X11 #
#######

gimp scan, minos08/14 sticking
   minos14 is quite idle, just a couple of idle interactive logins
  
=============================================================================

2006 11 03

#########
# STAGE #
#########

stage.20061012 hacked to correct restore queue feedback ( changed 8/10 to 5/7 )


staging ran amok, as the restore/queue feedback was misaligned,
  was reading store, not restore quantities.

This produced a peak restore queue of about 900 last night, before midnight.
The stage queue went over 1000, at http://fndca3a.fnal.gov/dcache/logins//stage.jpg.

The scripts seem to have gotten stuck around  03:27 per
  /local/scratch26/kreymer/log/stage/VOB057.20061103.log

Door 1 logins peaked at about 28, around 3 am
Door 0 logins remained under 10 till around 3 am

Dcache services show door 0 offline

Restart staging with VOB057 :


REVOLS="VOB057 VOB428 VOB441 VOB612 VOB641 VOB727 VOB884"

for VOL in ${REVOLS} ; do ./stage -w -s 'spill.cand' ${VOL} ; done

##########
# DCACHE #
##########

Verified that door 0 is down, with dccp from port 24136
Adding -o 10 ( 10 second open timeout ) did not help, the dccp remained inactive.

dccp from 24136 succeeded.    

#########
# GENPY #
#########

Switched from port 24125 to 24136, made this a variable in genpy

ln -sf genpy.20061103 genpy # was genpy.20060714

Killed some dbu processes for neardet_data, to get predator unstuck at about 09:58


###########
# ROUNDUP #
###########

Added CHART section to end of roundup.20061101
    this explains critical variables
    
    did some cleanup

Test writing runs with trailing missing subrun,

./roundup.20061101 -r R1_18_4 -n near

./roundup.20061101 -r R1_18_4 -s N00010777 -n near

    Concatenate a run with one subrun missing at end :

./roundup.20061101 -r R1_18_4 -s N00010777_  -f 24  -W near
   works !

   Implemented the -f <time> cut, changing it from hours to days
   Dropped the -F option, just to -f 0 if that is what is needed.

SRV1> ./roundup.20061101 -r R1_18_4 -s N000108  -W -n near

=============================================================================

2006 11 02

############
# saddreco #
############

monthly boundary catchup for 2006-08

HOSTNU=`hostname -s | cut -c 6-`
LOGPAT=/local/scratch${HOSTNU}/kreymer/log
FARM=R1_18_4
MON=2006-10

for DET in near far ; do ./saddreco ${DET} ${FARM} ${MON} list ; done

for DET in near far ; do
./saddreco ${DET} ${FARM} ${MON} declare 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done

#########
# STAGE #
#########

Prestaging near cedar spill candidates  for rustem

MINOS26 > REVOLS="`./volumes reco_near_cedar` `./volumes reco_near_cedar_cand`"
MINOS26 > echo $REVOLS
VO4095 VO5799 VO7971 VO8357 VO9662 VOB529 VO9690 VOB057 VOB428 VOB441 VOB612 VOB641 VOB727 VOB884

6 of these are consolidated, 8 of these are just cand tapes

MINOS26 > for VOL in ${REVOLS} ; do ./stage -n -s 'spill.cand' ${VOL} ; done

MINOS26 > for VOL in ${REVOLS} ; do ./stage -w -s 'spill.cand' ${VOL} ; done

=============================================================================

2006 11 01

###########
# ROUNDUP #
###########

roundup.20061018 -  with dfarm status check, and with corrected path to SAM/current.

SRV1> ln -sf roundup.20061018 roundup

roundup.20061101 - with final tuning for production
    HADDLOG     - separate hadd output log
    SAM/current - run current sam 
    
Met with Rubin to discuss deployment
I need to :
   change -w -W options to 
   test file splitting
       optionally make use of suppressed/bad subrun lists
   update saddreco to handle concatenated runs

Cleaned up stray ROUNDUP files, shifted some to ROUNDUP_TEST
   removed accidental blowfish directory
   
=============================================================================

2006 10 31


########
# GRID #
########

SRV1> pwd
/export/stage/minfarm/ROUNDUP/TEST

SRV1> grid-proxy-init -cert kreymer-doe.pem -key kreymer-doekey.pem 
Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Enter GRID pass phrase for this identity:
Creating proxy ............................................................................. Done
Your proxy is valid until: Tue Oct 31 20:40:56 2006

SRV1> dccp dcap://fndca1.fnal.gov:24536/pnfs/fnal.gov/usr/minos/sim_root/far/camb_cosmic/bfld201/cosmic_mu_r651.root TEST.dat
731987214 bytes in 33 seconds (21661.55 KB/sec)


globus-url-copy \
gsiftp://fndca1.fnal.gov:2811/pnfs/fnal.gov/usr/minos/sim_root/far/camb_cosmic/bfld201/cosmic_mu_r651.root \
file:///export/stage/minfarm/ROUNDUP/TEST/TEST.dat
  
We will not be getting an NFS mount of /pnfs/fermigrid,
as they prefer to use the newer tools (SRM) for that sort of thing.

SRV1> srmcp srm://cdfdca1.fnal.gov:8443//pnfs/fnal.gov/usr/minos/neardet_data/2004-11/N00004502_0000.mdaq.root file:////export/stage/minfarm/ROUNDUP/TEST
...
SRMClientV1 :   get: surls[0]="srm://cdfdca1.fnal.gov:8443//pnfs/fnal.gov/usr/minos/neardet_data/2004-11/N00004502_0000.mdaq.root"
copy_jobs is empty


#########
# admin #
#########

minos25 is at SLF 4.2, shepelak is updating to latest kernel today.
not open to users quite yet

#######
# X11 #
#######

acroread direct execution ?
Strip down the acroread script :

CMD='
export ACRO_INSTALL_DIR=/usr/lib/acroread/Reader;
export ACRO_CONFIG=intellinux;
export LANG="C";
export ACRO_ARG0=~/aread;
export LD_LIBRARY_PATH=foo;
AC=/usr/lib/acroread/Reader/intellinux;
LD_LIBRARY_PATH="$AC/lib:$LD_LIBRARY_PATH";
echo acro; $AC/bin/acroread ; echo read'

ssh minos$NODE "${CMD}" 
   This works to run acroread,
   but still leaves a dangling process on killing it.
   Sigh.
   OK, will just use the cleaner gimp for testing.

   
=============================================================================

2006 10 30

#######
# AFS #
#######

        Thursday, Nov 2 6:00 - 8:00 am 
        Devices in Subnet 68 will be moved to a new switch. 
        This affects FNALU, the site AFS servers, the site web servers 
        if dependent on AFS.    

#######
# X11 #
#######

07 Mon Oct 30 08:48:37 CST 2006
12 Mon Oct 30 08:50:01 CST 2006
14 Mon Oct 30 08:50:43 CST 2006
20 Mon Oct 30 08:51:58 CST 2006

And the same hangups at 09:00

    minos07 - idle, only niki logged in, that session idle 2 days
    minos12 - idle, niki logged in twice, sessions idle 2 days
    minos14 - idle, niki logged in twice, sessions idle 2 days
    minos20 - idle, niki logged in and idle 2 days, jjling idle 1:37

If we looked only at minos01 through minos14, we could blame this on niki,
as she has idle sessions on precisely the hung nodes 07/12/14

But she has similar idle sessions on minos15/16/17/19/21/22/25, 
which are not hung.

Created gimp swap areas on 01 thru 26,
   mkdir -p /var/tmp/kreymer/.gimp-1.2
Ran gimp globally, hangs up on same nodes, but with open window :

When we get stuck , we are in a window titled
     GIMP Startup
with content roughly like

             The GIMP

              1.2.3

         brought to you by

    Spencer Kimball & Peter Mattis

       Internal Procedures
               Units


=============================================================================

2006 10 27

#######
# X11 #
#######


MINOS08 > ps -ef | grep -v root
UID        PID  PPID  C STIME TTY          TIME CMD
rpc       1610     1  0 Jun07 ?        00:00:52 portmap
rpcuser   1629     1  0 Jun07 ?        00:00:00 rpc.statd
ntp       1802     1  0 Jun07 ?        00:04:42 ntpd -U ntp -p /var/run/ntpd.pid
nobody    3351     1  0 Jun07 ?        01:06:01 /usr/sbin/gmond
xfs       3420     1  0 Jun07 ?        00:00:00 xfs -droppriv -daemon
daemon    3429     1  0 Jun07 ?        00:00:02 /usr/sbin/atd
gdm       3808  3668  0 Jun07 ?        01:57:44 /usr/bin/gdmgreeter
rhatcher 14374     1  0 Jun15 ?        00:00:00 ssh-agent -s
rhatcher 14694     1  0 Jun15 ?        00:00:00 ssh-agent -s
smmsp     4482     1  0 Jun21 ?        00:00:01 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
niki     31400     1  0 Oct10 ?        00:00:00 ssh-agent -c
scavan   25778     1  0 Oct22 ?        00:00:00 ssh-agent -c
kreymer   2642     1  0 08:59 ?        00:00:00 bash -c echo acro;acroread;echo read
kreymer   2645  2642  0 08:59 ?        00:00:00 /usr/lib/acroread/Reader/intellinux/bin/acroread
kreymer   2658  2657  0 08:59 pts/0    00:00:00 -bash
kreymer   3187  2658  0 09:05 pts/0    00:00:00 ps -ef

Poking at minos24, the only system still stuck right now.

MINOS24 > kill 26736
...
MINOS24 > acroread -help
...
MINOS24 > cdm
MINOS24 > cd plan
MINOS24 > acroread -toPostScript FY07MCP.pdf 
MINOS24 > acroread FY07MCP.pdf 

24 Fri Oct 27 17:06:07 CDT 2006
acro
read

real    9m8.779s
user    0m0.040s
sys     0m0.010s
25 Fri Oct 27 17:15:16 CDT 2006


########
# DATA #
########

rubin reports trouble copying a file from farms,
trying it from minos26:

MINOS26 > cd  /local/scratch??/`whoami`
MINOS26 > dccp dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/sim_root/far/camb_cosmic/bfld201/cosmic_mu_r651.root TEST.dat
731987214 bytes in 20 seconds (35741.56 KB/sec)

############
# NOACCESS #
############

VO8441             84.57GB   (NOACCESS   1026-1608 readonly 0127-2210)   CD-9940B         minos.reco_far_R1_18_2.cpio_odc      Copied to new media 071806

########
# GRID #
########

  Copy my private cert

scp kreymer@minos-93198.dhcp.fnal.gov:.ssh/kreymer-doe.p12 .

  Make grid cert and key files
  
openssl pkcs12 -in kreymer-doe.p12 -clcerts -nokeys -out kreymer-doe.pem
openssl pkcs12 -in kreymer-doe.p12 -nocerts         -out kreymer-doekey.pem
chmod 600 kreymer-doe*.pem

  Get a grid proxy

SRV1> grid-proxy-init -cert kreymer-doe.pem -key kreymer-doekey.pem 
Your identity: /DC=org/DC=doegrids/OU=People/CN=Arthur E Kreymer 261310
Enter GRID pass phrase for this identity:
Creating proxy ............................................... Done
Your proxy is valid until: Sat Oct 28 02:12:17 2006


=============================================================================

2006 10 26


#######
# X11 #
#######

07 Thu Oct 26 10:54:41 CDT 2006
23 Thu Oct 26 10:56:24 CDT 2006

MINOS07 > ps axfwww
 9147 ?        S      0:00  \_ /usr/sbin/sshd
 9148 ?        Z      0:00  |   \_ [sshd <defunct>]
 9156 pts/0    S      0:00  |   \_ -tcsh
 9878 pts/0    R     75:55  |       \_ loon -bq runaparamfit.C("fit_combined_2par_2penalty_pi-K-_1.3numubar.root","input/fit_combined_numu_numubar.inp")


MINOS26 > for NODE in $NODES ; do printf "${NODE}\n" ; ssh minos${NODE} 'ps -ef | grep gminos_batch | grep -v grep' ; done
23
arms     30234 30233  0 13:12 ?        00:00:00 sh -c (time gminos_batch) > n12000001_0008_L010185_PDtest04.log ?
arms     30235 30234  0 13:12 ?        00:00:00 sh -c (time gminos_batch) > n12000001_0008_L010185_PDtest04.log ?
arms     30236 30235 82 13:12 ?        00:46:19 gminos_batch
24
arms     11533 11532  0 12:39 ?        00:00:00 sh -c (time gminos_batch) > n12000003_0007_L010185_PDtest04.log ?
arms     11534 11533  0 12:39 ?        00:00:00 sh -c (time gminos_batch) > n12000003_0007_L010185_PDtest04.log ?
arms     11535 11534 88 12:39 ?        01:19:21 gminos_batch


Note that minos24 just recently started getting hung up, and 23/24 are the only ones stuck.
MINOS23 > ps -flu arms
F S UID        PID  PPID  C PRI  NI ADDR    SZ WCHAN  STIME TTY          TIME CMD
4 S arms     24965 24964  0  85   2    -  1147 -      Oct24 ?        00:00:00 sh
0 S arms     24966 24965  0  87   2    -  1274 -      Oct24 ?        00:00:00 /usr/local/bin/tcsh
0 S arms     25301 24966  0  77   2    -  1255 -      Oct24 ?        00:00:00 /bin/csh /afs/fnal/files/home/room3/arms/scripts/run_rock1.csh
0 S arms     30233 25301  0  82   2    -  1505 -      13:12 ?        00:00:00 perl run_rock1.perl 1 8
0 S arms     30234 30233  0  85   2    -  1112 -      13:12 ?        00:00:00 sh -c (time gminos_batch) > n12000001_0008_L010185_PDtest04.log ?
1 S arms     30235 30234  0  85   2    -  1112 -      13:12 ?        00:00:00 sh -c (time gminos_batch) > n12000001_0008_L010185_PDtest04.log ?
0 R arms     30236 30235 84  87   2    - 103174 -     13:12 ?        00:54:58 gminos_batch


###########
# ROUNDUP #
###########

Added to test F00028812_0001.sam.py

from SamFile.SamDataFile import LumBlockRange
from SamFile.SamDataFile import LumBlockRangeList
   lumBlockRangeList=LumBlockRangeList([LumBlockRange(288120000,288129999),LumBlockRange(999288120000,999288129999)])

Seems to work fine in development.
This scopes out the full possible range of lumblocks ( run/subrun )

LumBlockRangeList is in use by CDF and D0, so we're in good company.

=============================================================================

2006 10 25

LSF - all clear

#######
# X11 #
#######

09:30

Stuck on minos07 minos08 
Clear on minos23, which was stuck yesterday

Later, around 14:12, 07 and 08 are clear,
stuck on 14, 23

14 Oct:10:1161803325
23 Oct:10:1161803444

minos14 was rebooted, because niki's root process could not be killed.

minos23 is still up, see top below, and meminfo

 17:39:09  up 14 days,  8:35,  5 users,  load average: 1.00, 0.97, 0.93
76 processes: 74 sleeping, 2 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total    0.0%   99.6%    0.0%   0.0%     0.0%    0.0%  300.0%
           cpu00    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
           cpu01    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
           cpu02    0.0%   22.6%    0.0%   0.0%     0.0%    0.0%   77.4%
           cpu03    0.0%   77.2%    0.0%   0.0%     0.0%    0.0%   22.8%
Mem:  4095368k av, 4073100k used,   22268k free,       0k shrd,   92660k buff
                   3231612k actv,  615688k in_d,   92080k in_c
Swap: 4144760k av,     256k used, 4144504k free                 3878748k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
22613 arms      27   2 42632  41M  2736 R N  99.8  1.0 331:41   2 gminos_batch
30284 root      24   0  8928 8928  5960 S     0.0  0.2   0:00   1 acroread
 3580 root      24   0  3404 3404  1020 S     0.0  0.0   3:39   1 python
 1802 ntp       15   0  2572 2572  2204 S     0.0  0.0   0:30   3 ntpd
 3420 xfs       15   0  2428 2428    32 S     0.0  0.0   0:00   3 xfs
...

MINOS23 > cat /proc/meminfo
        total:    used:    free:  shared: buffers:  cached:
Mem:  4193656832 4170838016 22818816        0 94969856 3971969024
Swap: 4244234240   262144 4243972096
MemTotal:      4095368 kB
MemFree:         22284 kB
MemShared:           0 kB
Buffers:         92744 kB
Cached:        3878804 kB
SwapCached:         72 kB
Active:        3231396 kB
ActiveAnon:      60520 kB
ActiveCache:   3170876 kB
Inact_dirty:    615744 kB
Inact_laundry:   92844 kB
Inact_clean:     92080 kB
Inact_target:   806412 kB
HighTotal:     3276544 kB
HighFree:         2128 kB
LowTotal:       818824 kB
LowFree:         20156 kB
SwapTotal:     4144760 kB
SwapFree:      4144504 kB
CommitLimit:   6192444 kB
Committed_AS:   540540 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

Look at hung acroread :

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
  347 kreymer   15   0  1264 1264   904 R     0.2  0.0   0:00   3 top
32403 kreymer   15   0  1428 1428  1192 S     0.0  0.0   0:00   2 bash
  331 kreymer   23   0   912  912   800 S     0.0  0.0   0:00   2 bash
  334 kreymer   25   0  9076 9076  6016 S     0.0  0.2   0:00   3 acroread

Some nodes have precisly 0 swap in use :

for NODE in $NODES ; do printf "${NODE} "; ssh minos${NODE} 'free | grep Swap' ; done
01 Swap:      4096564    2652796    1443768
02 Swap:      4096564      32628    4063936
03 Swap:      4096564     142164    3954400
04 Swap:      4096564      78204    4018360
05 Swap:      4096564      28184    4068380
06 Swap:      4096564      27588    4068976
07 Swap:      4096564      26772    4069792
08 Swap:      4096564      26684    4069880
09 Swap:      4096564      26564    4070000
10 Swap:      4096564      33988    4062576
11 Swap:      4096564      37732    4058832
12 Swap:      4096564      26740    4069824
13 Swap:      4144760          0    4144760
14 Swap:      4144760          0    4144760
15 Swap:      4144760         44    4144716
16 Swap:      4144760         16    4144744
17 Swap:      4144760          0    4144760
18 Swap:      4144760      46760    4098000
19 Swap:      4144760        312    4144448
20 Swap:      4144760         16    4144744
21 Swap:      4144760         68    4144692
22 Swap:      4144760      25624    4119136
23 Swap:      4144760        256    4144504
24 Swap:      4144760          4    4144756
25 Swap:      4144760         16    4144744
26 Swap:      4144760          0    4144760

MINOS23 > free
             total       used       free     shared    buffers     cached
Mem:       4095368    4042704      52664          0      96604    3885980
-/+ buffers/cache:      60120    4035248
Swap:      4144760        256    4144504

With 3.9 GB used by disk cache, should be plenty of memory here.

############
# saddreco #
############

monthly boundary catchup for 2006-08

HOSTNU=`hostname -s | cut -c 6-`
LOGPAT=/local/scratch${HOSTNU}/kreymer/log


FARM=R1_18_4
MON=2006-08

for DET in near far ; do ./saddreco ${DET} ${FARM} ${MON} list ; done

for DET in near far ; do
./saddreco ${DET} ${FARM} ${MON} declare 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done


###########
# ROUNDUP #
###########

Look into
   LumBlockRangeList ( LumBlockRange ( 1L , 2L ) , LumBlockRange ( ...) , ... )
   per  zs02415d.00018v29 cdf file
fcdflnx5 > sam get metadata --file=zs02415d.00018v29

=============================================================================

2006 10 24

#######
# LSF #
#######

    11:15
12hr              6  Open:Active       -    -    2    -   182   132    49     1
    11:18 none suspended
    11:23
12hr              6  Open:Active       -    -    2    -   180   132    47     1

kschu minimized suspensions, giving up preemption capabilities in FNALU/Batch.

"We will soon undertake an effort to add several minos systems to the batch configuration, with
 a new queue reserved for minos users.  That work effort will be done seperately from this
 helpdesk request.  We have begun a discussion between FNALU management and MINOS management."

( Plan to wait a week before closing the ticket )

#########
# STAGE #
#########

stage.20061012 updated to filter out HTML, as well as improved summary in -w mode,
               and changed WAIT to WAITER for name clarity

ln -sf stage.20061012 stage # was 20060920 

MCVOLS=`./volumes reco_mc_near_cedar`
MINOS26 > for VOL in ${MCVOLS} ; do ./stage.20061012 -n -s 'L010185/cand_data' ${VOL} ; done
( 0nly 1 files needed to be restored, from VOB913 )


############
# saddreco #
############

monthly boundary catchup for 2006-09

HOSTNU=`hostname -s | cut -c 6-`
LOGPAT=/local/scratch${HOSTNU}/kreymer/log


FARM=R1_18_4
MON=2006-09

for DET in near far ; do ./saddreco ${DET} ${FARM} ${MON} list ; done

for DET in near far ; do
./saddreco ${DET} ${FARM} ${MON} declare 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done

###########
# ROUNDUP #
###########

  Start tests in dev of basic sam declares with multi parents

IFIL=F00028812_0001

MINOS26 > cp F00028812_0000.sam.py F00028812_0001.sam.py
MINOS26 > sam verify metadata --descriptionFile=${IFIL}.sam.py

   parents=('one','two')
        Undeclared ancestors: one, two

   parents=('N00000124_0000.mdaq.root','N00000125_0000.mdaq.root')
MINOS26 > sam declare file ${IFIL}.sam.py
FileId = 81840

MINOS26 > sam get metadata --fileName=${IFILE}
...
              'parents' : NameOrIdList(['N00000124_0000.mdaq.root', 'N00000125_0000.mdaq.root']),
         'rawAncestors' : SamList(['N00000125_0000.mdaq.root', 'N00000124_0000.mdaq.root']),

Look into
   LumBlockRangeList ( LumBlockRange ( 1L , 2L ) , LumBlockRange ( ...) , ... )
   per  zs02415d.00018v29 cdf file
fcdflnx5 > sam get metadata --file=zs02415d.00018v29

#######
# X11 #
#######

Scanned, found node minos23 hanging up in acroread and root.
Too bad, this node is no longer running the X server, guess that did not fix things up.
Verified that minos23 is at run level 3

MINOS23 > runlevel
N 3


=============================================================================

2006 10 23

#######
# LSF #
#######

suspending continues, levels over 50, in spite of kschu tuning.

See loadStop  4.0   3.5 in output from

bqueues -l 12hr

These have been adjusted upward, as of 11:28 :

MINOS26 > bqueues -l | grep loadStop
 loadStop    -    6.0   5.0    -       -     -    -     -     -      -      -  
 loadStop    -    6.0   5.0    -       -     -    -     -     -      -      -  
 loadStop    -    7.0   5.0    -       -     -    -     -     -      -      -  
 loadStop    -    7.5   6.0    -       -     -    -     -     -      -      -  
 loadStop    -    7.5   7.5    -       -     -    -     -     -      -      -  
 loadStop    -    7.5   6.0    -       -     -    -     -     -      -      -  
 loadStop    -    7.5   6.0    -       -     -    -     -     -      -      -  
 loadStop    -    7.5   6.0    -       -     -    -     -     -      -      -  
 loadStop    -    7.5   6.0    -       -     -    -     -     -      -      -  
 loadStop    -    7.5   6.0    -       -     -    -     -     -      -      -  

kschu is actively tuning today, trying to stop the suspending.

    14:20 
12hr queue suspensions dropped from 38 to 4.

    16:40
12hr              6  Open:Active       -    -    2    -   284   208    57    19
1day              4  Open:Active       -    -    1    -    26     6     3    17

    17:17
12hr              6  Open:Active       -    -    2    -   282   208    64    10
1day              4  Open:Active       -    -    1    -    26     6     3    17

    17:19
12hr              6  Open:Active       -    -    2    -   282   208    55    19
1day              4  Open:Active       -    -    1    -    26     6     3    17

    17:20
12hr              6  Open:Active       -    -    2    -   282   208    64    10
1day              4  Open:Active       -    -    1    -    26     6     3    17


Per rustem query, scanned flxb node characteristics ( could have used lshosts )

flxb09-11,13  4 2 x 1 GHz
flxb16-30    15 2 x 2.7
flxb31-34     4 2 x 2.2  opteron
flxb35        1 2 x 2.4  opteron

Summary, neglecting the slow systems, roughly 21 2.5 GHz duals.

###########
# ROUNDUP #
###########

Verify SAM lite installation, copied from v7_6_5,
per updated HOWTO.samdist ( copy full product directory, not just 

PATH="${PATH}:/export/stage/minfarm/ROUNDUP/SAMP/bin"
export SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer
export SAM_NAMING_SERVICE_IOR=IOR:010000002a00000049444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696e67436f6e746578743a312e30000000010000000000000030000000010100b7150000006d696e6f732d73616d30312e666e616c2e676f7600d132230c0000004e616d655365727669636500

installed sam v7_7_1, and tested, on fnpcsrv1


############
# DATASETS #
############

Updated to datasets.20061023,
    cleaning out HTML tags more thoroughly as in 
    need because of format change in DCache web pages

ln -sf datasets.20061023 datasets   # was datasets.20060426


=============================================================================

2006 10 20

LSF - clean, nothing suspended in bqueues

X11 - clean, exit hang on 9,24

#########
# SHIFT #
#########

Reviewing documents at
    http://www-numi.fnal.gov/Minos/ControlRoom/index.html

    Four links under Read Before Your 1st Shift
    
Control Room Guide
    http://www-numi.fnal.gov/Minos/ControlRoom/ShiftersGuide.pdf

    Verified CRL and JIRA accounts
       CRL - have made entries in past
       JIRA - logged in, requested authorization from shanahan
              got it, am now in the shifters group ( see profile)

    Subscribed to minos-shifters, was already subscribed in June
       ( Silent since then, perhaps we can drop this. )

    Jira - have access, should browser recent open issues
    

  N.B. 
    2.1 Change Sergei (avva) references
    3.2 Change shift schedule per current practice
    3.3 Suggest change order to Beam/DCS/DAQ/OM per data flow
    p. 7/8 have large black asection at bottom in full screen in acroread .0
    4. LTS 3.03 should -> current SLF,  and change Windows 2000 to Windows
    8.1 cannot see 'Users tab' at bottom of given DAQ GUI window
    8.2.1 Default fardet run sequence RS24Hr -
            why not mention beam ?
            Really 8 hour, 4 subruns ? or 24hours ?
    8.2.2 Neardet sequence - not later 
    8.2.3 Control to Soudan at start of Souday day ( 08:00 ) ?
    9.1   'Meaningful name' for stored plots - example ?
    13.1  Elog - actually, the MCR elog tend to have more up to date OPS information
    13.4  shown in figure ?? - should be figure 32
    
    14.   Shifter's ToDo List - suggest move this forward, to section 1 or 2
    14.4  restart RC GUI's - how ?
    14.5  shift physicists and crew supervisor - what are these ?
    
    Many numinotes links and references, should be to docdb ?
    
    
DAQ & Run Control Guide
    http://www-numi.fnal.gov/numinotes/restricted/html/numi0900/numi0900.html

Online Monitor Guide
    http://www-numi.fnal.gov/numinotes/restricted/pdf/numi0903/numi0903.pdf

Online Monitor Plot Gallery
    http://minos-om.fnal.gov/cgi-bin/photo/index.cgi

#######
# LSF #
#######

Suspending was gone this morning, has returned this afternoon.

MINOS26 > date
Fri Oct 20 15:20:27 CDT 2006

MINOS26 > bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
debug            99  Open:Active       -    5    1    -     0     0     0     0
test             98  Open:Active       -   15    1    -     0     0     0     0
30min            10  Open:Active       -    -    -    -     0     0     0     0
4hr               8  Open:Active       -    -    -    -   195   102    47    46
12hr              6  Open:Active       -    -    -    -    40    33     0     7
1day              4  Open:Active       -    -    1    -     7     1     0     6
1day_ex           4  Open:Active       -    4    1    -     0     0     0     0
selex             4  Open:Active       -    5    1    -     0     0     0     0
4day              2  Open:Active       -    5    1    -     0     0     0     0
8day              1  Open:Active       -    2    1    -     0     0     0     0
 
The nodes involved are flxb18 and higher.
Typically, 2 jobs running and 3 to 6 suspended per node

=============================================================================

2006 10 19

#######
# LSF #
#######

no jobs are suspended this morning, with 60 running in 30min

see bqueues, and lsload -E

###########
# ENSTORE #
###########

Scheduled maintenance started around 09:00

    Seems to be up around 10:30

Enstore announced down again aroun 13:30

Enstore announced   up again aroun 17:30


##########
# DCACHE #  FTP
##########

beam_data logger is retrying a file that is already there, since the maintenance.

Check the size command manually :

    KERBERIZED DOOR

MINOS26 > ../bin/rlwrap ftp fndca1.fnal.gov 24127
Connected to stkendca2a.fnal.gov.
220 Kerberos FTP Door ready
334 ADAT must follow
GSSAPI accepted as authentication type
GSSAPI authentication succeeded
Name (fndca1.fnal.gov:kreymer): kreymer
200 User kreymer logged in
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> cd beam_data/2006-10
250 CWD command succcessful. New CWD is </beam_data/2006-10>
ftp> size B061019_080001.mbeam.root
213 210615172

    WEAK FTP DOOR

MINOS26 > ../bin/rlwrap ftp fndca1.fnal.gov 24126
Connected to stkendca2a.fnal.gov.
220 Weak FTP Door ready
500 Not Supported
500 Not Supported
KERBEROS_V4 rejected as an authentication type
Name (fndca1.fnal.gov:kreymer): mindata
331 Password required for mindata.
Password:******
230 User mindata logged in
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> cd beam_data/2006-10
250 CWD command succcessful. New CWD is </beam_data/2006-10>
ftp> size B061019_080001.mbeam.root
213 210615172

There are 6 'UNFINISHED' ftp copies since 08:49 from
    minos-pc1.spa.umn.edu
    minos2.physics.tamu.edu 

###########
# ROUNDUP #
###########

Testing newer roundup with dfarm status check  like :

./roundup.20061018  -r R1_18_4 -w -n -s N00010883 near

#######
# SAM #
#######

upd install -j sam v7_7_1

=============================================================================

2006 10 18

#######
# LSF #
#######

suspending continues, kschu continuing to tune.

SHOSTS=` bjobs -u all -s | grep flxb | tr -s ' ' | cut -f 6 -d ' ' | sort -u | cut -f 1 -d .`
for SHOST in ${SHOSTS} ; do printf "\n${SHOST}\n" ; bjobs -u all -m ${SHOST} ; done


Let's see what is on the swapping hosts

As of 14:30, we're down to 1 swapped process on flxb09, not a minos job.
Tuning seems to have helped.

###########
# ROUNDUP #
###########

arms found some very small (218 byte) concatenated files.

     LOG/cedarfar.log has entries like

 OK adding F00027485_0000.all.sntp.cedar.0.root       8
Transfer initiation timeout
...
Error in <TFile::TFile>: file F00027485_0005.all.sntp.cedar.0.root does not exist
Error in <TList::AddLast>: argument is a null pointer

 *** Break *** segmentation violation
 Generating stack trace...
/usr/bin/addr2line: hadd: No such file or directory
/usr/bin/addr2line: hadd: No such file or directory
 0x0804a422 in main + 0x92e from hadd
 0xb5b3278a in __libc_start_main + 0xda from /lib/tls/libc.so.6
 0x08049951 in THashList::THashList(int, int) + 0x5d from hadd
./roundup: line 384: 14477 Aborted                 (core dumped) ${ECHO} hadd Merged.root ${SFILS}
-rw-r--r--    1 minfarm  numi          218 Sep 27 12:19 /export/stage/minfarm/ROUNDUP/WRITE/F00027485_0000.all.sntp.cedar.0.root

SRV1> grep -1 timeout cedarfar.log
 OK adding F00027485_0000.all.sntp.cedar.0.root       8
Transfer initiation timeout
Target file: Merged.root
--
 OK adding F00027491_0000.all.sntp.cedar.0.root       9
Transfer initiation timeout
Target file: Merged.root
--
 OK adding F00027560_0000.all.sntp.cedar.0.root       8
Transfer initiation timeout
Target file: Merged.root
--
 OK adding F00028996_0000.all.sntp.cedar.0.root       8
Transfer initiation timeout
Target file: Merged.root
--
 OK adding F00031330_0000.spill.bntp.cedar.0.root       8
Transfer initiation timeout
Target file: Merged.root
--
 OK adding F00033174_0000.spill.bntp.cedar.0.root       1
Transfer initiation timeout
mv: cannot stat `F00033174_0000.spill.bntp.cedar.0.root': No such file or directory
--
 OK adding F00033294_0000.spill.sntp.cedar.0.root       3
Transfer initiation timeout
Target file: Merged.root
--
 OK adding F00033174_0000.spill.bntp.cedar.0.root       1
Transfer initiation timeout
mv: cannot stat `F00033174_0000.spill.bntp.cedar.0.root': No such file or directory
--
 OK adding F00033174_0000.spill.bntp.cedar.0.root       1
Transfer initiation timeout
mv: cannot stat `F00033174_0000.spill.bntp.cedar.0.root': No such file or directory


SRV1> grep '  218 ' cedarfar.log
-rw-r--r--    1 minfarm  numi          218 Sep 27 12:19 /export/stage/minfarm/ROUNDUP/WRITE/F00027485_0000.all.sntp.cedar.0.root
-rw-r--r--    1 minfarm  numi          218 Sep 27 12:25 /export/stage/minfarm/ROUNDUP/WRITE/F00027491_0000.all.sntp.cedar.0.root
-rw-r--r--    1 minfarm  numi          218 Sep 27 12:41 /export/stage/minfarm/ROUNDUP/WRITE/F00027560_0000.all.sntp.cedar.0.root
-rw-r--r--    1 minfarm  numi          218 Sep 27 13:53 /export/stage/minfarm/ROUNDUP/WRITE/F00028996_0000.all.sntp.cedar.0.root
-rw-r--r--    1 minfarm  numi          218 Sep 27 17:42 /export/stage/minfarm/ROUNDUP/WRITE/F00031330_0000.spill.bntp.cedar.0.root
-rw-r--r--    1 minfarm  numi          218 Sep 27 18:34 /export/stage/minfarm/ROUNDUP/WRITE/F00033294_0000.spill.sntp.cedar.0.root


Check one of these files,

SRV1> dfarm ls /minos/farcat/F00033174_0000.spill.bntp.cedar.0.root
frwrw   0 rubin                  302256 08/27 03:39:28 /minos/farcat/F00033174_0000.spill.bntp.cedar.0.root 

SRV1> dfarm get /minos/farcat/F00033174_0000.spill.bntp.cedar.0.root .
^C - interrupted after 5 minutes


For convenience, should do before writing,
    chmod 664 


=============================================================================

2006 10 17

Store queues have pretty well cleared, last last night.

D0 is making serious plans to gut the event tables in SAM.

#######
# LSF #
#######

swapping continues, kschu continuing to tune.

############
# NOACCESS #
############

On Tue, 17 Oct 2006, David Berg wrote:
...
> I looked up VO8575. Here's its history:

> It became unavailable on Oct 4, was copied (the mount count was reset
> to 1) and cleared on the Oct 5. So as far as I can tell, it was out of
> service for only 1 day.

I was going by
    http://www-stken.fnal.gov/enstore/tape_inventory/NOACCESS
  
Next time, I'll dig deeper with
    enstore info --vol=VO8575
  
All the remaining volumes in the NOACCESS page seem to belong there.
MINOS26 > for VOL in $DVOLS ; do printf "$VOL " ; enstore info --vol="${VOL}" | grep system_inhibit ; done


=============================================================================

2006 10 16

#######
# LSF #
#######

Swapping is reported to helpdesk, in 12hr queue, per boehm report.

kschu is adjusting.

#############
# CHECKLIST #
#############

queued stores are at 1300,
and have been sustained around 200 since mid-day Friday 13 Oct.
Probably database backups

=============================================================================

2006 10 13

X11 - clean scan 15:27

##########
# DCache #
##########

predator - last failure to open was 
ND, STARTING Thu Oct 12 18:11:40 UTC 2006
FD  STARTING Thu Oct 12 20:12:37 UTC 2006


#########
# STAGE #
#########

MINOS26 > for VOL in ${MCVOLS}; do printf "${VOL}" ; grep Needed /local/scratch26/kreymer/log/stage/${VOL}.*.log ; done
VO2343VO4065VO8219 Needed 0/     38
VO8366 Needed 29/    568
VOB656 Needed 110/    585
VOB862 Needed 133/    592
VOB873 Needed 157/    593
VOB879 Needed 190/    593
VOB883 Needed 527/    593
VOB887 Needed 576/    576
VOB895 Needed 498/    498
VOB903 Needed 558/    558
VOB907 Needed 482/    482
VOB908 Needed 518/    518
VOB913 Needed 378/    587
VOB920 Needed 124/    591
VOB927 Needed 85/     85
VOB928

#######
# CFL #
#######

Updated cfl script to normalize both
    /pnfs/fnal.gov/minos
and
    /pnfs/fs/usr/minos


cdm ; cd CFL
$HOME/minos/scripts/cfl
$HOME/minos/scripts/cflsum.20061013 | tee cflsum.20061013
ln -sf cflsum.20061013 CFLSUM

#########
# VAULT #
#########

per HOWTO.vault

VMON=2006-09


########
# PNFS #
########

DCache ftp timeouts 06:48 - 07:11 this morning,
slow response times in ftp and pnfs logs.

=============================================================================

2006 10 12

X11 - clean scan 09:45

##########
# DCache #
##########

The DCache connection problem reported by Rustem today,
and discussed in the Core Software meeting this morning,
has nothing directly to do with DCache system scaling limitations.

It is caused by the suspension of hundreds of Rustem's jobs
on the FNALU Batch nodes.  This should not be happening.

It happened before, back on 24 March 2006, helpdesk ticket 76005.

MINOS26 > bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
debug            99  Open:Active       -    5    0    -     0     0     0     0
test             98  Open:Active       -   15    0    -     1     1     0     0
30min            10  Open:Active       -    -    -    -  4860  4611    61   188
4hr               8  Open:Active       -    -    -    -     1     1     0     0
12hr              6  Open:Active       -    -    2    -     0     0     0     0
1day              4  Open:Active       -    -    1    -     3     0     0     3
1day_ex           4  Open:Active       -    4    1    -     0     0     0     0
selex             4  Open:Active       -    5    1    -     0     0     0     0
4day              2  Open:Active       -    5    0    -     2     2     0     0
8day              1  Open:Active       -    2    0    -     0     0     0     0

MINOS26 > bjobs -u rustem -s | grep Host | wc -l
    189

New ticket was generated,  Ticket #: 86869

Timeline - look at the move queues in DCache, they ramped up to around 250,
starting around 23:00 Wednesday 2006 10 11.


The previous tickets on this item were
    76005 - 2006 Mar 24
    78042 - 2006 May 05

Failure rates in genpy are
  3/24 near
  8/25 far
 11/49 net

Reported this to dcache-admin.

16:06 - batch system is slowly draining, suspended count down to 104 now.

 
Rustem reports a plan to process, twice, files in
    /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/cand_data

MINOS26 > ls /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/cand_data/ | wc -l
   7457

MINOS26 > du -sm /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/cand_data/
2253309 /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/cand_data

That's 2.25 Tbytes
MINOS26 > ./volumes vols              
 OK , refreshing volume listing in /tmp/vols 
-rw-r--r--    1 kreymer  1525       107806 Oct 12 16:25 /tmp/vols
MINOS26 > ./volumes reco_mc_near_cedar | wc -l
     18

#########
# STAGE #
#########

MCVOLS=`./volumes reco_mc_near_cedar`

MINOS26 > for VOL in VO2343 ; do ./stage -n -s 'L010185/cand_data' VOB920 ; done
............................................................ Needed 589/    591

MINOS26 > for VOL in ${MCVOLS} ; do ./stage -w -s 'L010185/cand_data' ${VOL} ; done

============================================================================

2006 10 11

#######
# X11 #
#######

Scan all nodes for X11 acroread hangs, after 13-25 reboot w\0 X server

Pass 1 - OK, stuck on exit from 16,23
Pass 2 - OK, stuck on exit from 5,16,17,24
Pass 2 - OK, stuck on exit from 15

############
# NOACCESS #
############

Tbis copy seems to be complete.

MINOS26 > enstore info --vol=VO8575
 'sum_mounts': 1,

=============================================================================

2006 10 10

###########
# roundup #
###########

Updated R1_18_4, answered a few questions.

=============================================================================

2006 10 09

############
# KERBEROS #
############

Reported problem with use of kcron;aklog tokens starting at 08:00 today
This was fixed with a rollback of today's KDC software upgrades, at 12:00.


=============================================================================

2006 10 05

###########
# ROUNDUP #
###########


SRV1> ./roundup -r R1_18_4 -w near 
SRV1> ./roundup -r R1_18_4 -w far 

Note for sample month 2005-11,

MINOS26 > du -sm /pnfs/minos/reco_far/cedar/sntp_data/2005-11 
17639   /pnfs/minos/reco_far/cedar/sntp_data/2005-11
MINOS26 > ls /pnfs/minos/reco_far/cedar/sntp_data/2005-11 | wc -l
   1440

Average size about 10 MBytes
But typical   all.sntp.cedar.0.root is 24 MB,
    typical spill.sntp.cedar.0.root is  2 MB

Confirm :

    ALLDATA
MINOS26 > sam translate constraints --summaryOnly --dim="data_tier sntp-far and VERSION_ANALYZED like r1.18.4 and physical_datastream_name = alldata"
File Count:         56430
Average File Size:  22.53MB
Total File Size:    1.21TB
Total Event Count:  1226512942

    SPILL
MINOS26 > sam translate constraints --summaryOnly --dim="data_tier sntp-far and VERSION_ANALYZED like r1.18.4 and physical_datastream_name = spill"
File Count:         20381
Average File Size:  2.21MB
Total File Size:    43.92GB
Total Event Count:  354084010

MINOS26 > time sam translate constraints --summaryOnly --dim="data_tier  sntp-far and VERSION_ANALYZED like r1.18.4"
File Count:         76811
Average File Size:  17.14MB
Total File Size:    1.26TB
Total Event Count:  1580596952

    NEAR

MINOS26 > du -sm /pnfs/minos/reco_near/cedar/sntp_data/2005-11
56285   /pnfs/minos/reco_near/cedar/sntp_data/2005-11
MINOS26 > ls /pnfs/minos/reco_near/cedar/sntp_data/2005-11 | wc -l
   1372

Average size 41 MBytes


    COSMIC
MINOS26 > sam translate constraints --summaryOnly --dim="data_tier sntp-near and VERSION_ANALYZED like r1.18.4 and physical_datastream_name = cosmic"
File Count:         20089
Average File Size:  54.91MB
Total File Size:    1.05TB
Total Event Count:  1181019503
    SPILL
MINOS26 > sam translate constraints --summaryOnly --dim="data_tier sntp-near and VERSION_ANALYZED like r1.18.4 and physical_datastream_name = spill"
File Count:         16534
Average File Size:  41.02MB
Total File Size:    662.31GB
Total Event Count:  951582371

MINOS26 > sam translate constraints --summaryOnly --dim="data_tier sntp-near and VERSION_ANALYZED like r1.18.4"
File Count:         47057
Average File Size:  43.58MB
Total File Size:    1.96TB
Total Event Count:  2139213086


=============================================================================

2006 10 04

############
# NOACCESS #
############

MINOS26 > enstore info --vol=VO8575
 'sum_mounts': 2013,


############
# saddreco #
############

kordosky reported missing file 
    

Touch up R1_18_2 and R1_18_4, in case stray files.

    Cut/paste from predator

HOSTNU=`hostname -s | cut -c 6-`
LOGPAT=/local/scratch${HOSTNU}/kreymer/log


FARM=R1_18_2
DET=near
MON=2005-10

./saddreco ${DET} ${FARM} ${MON} list

MONS=`ls /pnfs/minos/reco_${DET}/${REL}/cand_data`

for MON in ${MONS} ; do
./saddreco ${DET} ${FARM} ${MON} declare 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done

    Summary
Declaring to SAM prd near R1_18_2 2005-03 NIL

Declaring to SAM prd near R1_18_2 2005-05
Treating 772 files in /pnfs/minos/reco_near/R1_18_2/snts_data/2005-05
 8 obsolete files
Needed   14 files, Rate was  0.062

Declaring to SAM prd near R1_18_2 2005-06
Needed   10 files, Rate was  0.045

Declaring to SAM prd near R1_18_2 2005-07
Needed   12 files, Rate was  0.051

Declaring to SAM prd near R1_18_2 2005-08
Needed   14 files, Rate was  0.060

Declaring to SAM prd near R1_18_2 2005-09
Needed   10 files, Rate was  0.043

Declaring to SAM prd near R1_18_2 2005-10
Needed   12 files, Rate was  0.060

Declaring to SAM prd near R1_18_2 2005-11
Needed   12 files, Rate was  0.053

Declaring to SAM prd near R1_18_2 2005-12 NIL
Declaring to SAM prd near R1_18_2 2006-01Thre 
Needed   86 files, Rate was  0.320

Declaring to SAM prd near R1_18_2 2006-02 NIL
Declaring to SAM prd near R1_18_2 2006-03 NIL
Declaring to SAM prd near R1_18_2 2006-04 NIL
...

 dds /pnfs/minos/reco_near/R1_18_2/cand_data

Dec  9  2005 2005-03/
Mar 14  2006 2005-05/
Mar 14  2006 2005-06/
Mar 14  2006 2005-07/
Mar 15  2006 2005-08/
Mar 15  2006 2005-09/
Mar 19  2006 2005-10/
Mar 17  2006 2005-11/
Jan  2  2006 2005-12/
Mar 17  2006 2006-01/
Mar 17  2006 2006-02/
Dec 15  2005 2006-03/
...

DET=far
MONS=`ls /pnfs/minos/reco_${DET}/${REL}/cand_data`
for MON in ${MONS} ; do
./saddreco ${DET} ${FARM} ${MON} declare 2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done

MINOS26 > grep "Declaring\|Rate" ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
...
Declaring to SAM prd far R1_18_2 2005-11
Needed   16 files, Rate was  0.072
Needed   16 files, Rate was  0.073
Needed   16 files, Rate was  0.074
Needed    8 files, Rate was  0.057
Needed    8 files, Rate was  0.067
Needed    8 files, Rate was  0.071
Declaring to SAM prd far R1_18_2 2005-12
Needed   48 files, Rate was  0.194
Needed   48 files, Rate was  0.202
Needed   48 files, Rate was  0.206
Needed   24 files, Rate was  0.152
Needed   24 files, Rate was  0.200
Needed   24 files, Rate was  0.198
Declaring to SAM prd far R1_18_2 2006-01
Needed   56 files, Rate was  0.226
Needed   56 files, Rate was  0.240
Needed   56 files, Rate was  0.239
Needed   28 files, Rate was  0.196
Needed   28 files, Rate was  0.256
Needed   28 files, Rate was  0.234
Declaring to SAM prd far R1_18_2 2006-02
Needed   24 files, Rate was  0.133
Needed   24 files, Rate was  0.138
Needed   24 files, Rate was  0.141
Needed   12 files, Rate was  0.121
Needed   12 files, Rate was  0.131
Needed   12 files, Rate was  0.129
Declaring to SAM prd far R1_18_2 2006-03
Needed   88 files, Rate was  0.555
Needed   88 files, Rate was  0.616
Needed   88 files, Rate was  0.625

=============================================================================

2006 10 03

###########
# ROUNDUP #
###########

Finished Rustle option.

Very little data picked up.

Many runs have missing subrun 24, perhaps due to small raw data file, like

 112918 Sep 24 18:49 /pnfs/minos/neardet_data/2006-09/N00010861_0024.mdaq.root
 

=============================================================================

2006 10 02

#######
# SAM #
#######

MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 1 and mc.flavor 3" --count
No files match the given constraints.

MINOS-SAM02 > setup sam -q prd
MINOS-SAM02 > samadmin flush dbserver cache

MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 1 and mc.flavor 3" --count
100 files match the given constraints.

###########
# ROUNDUP #
###########

  Adding Rustle option , to get missing files from DCache ( for commissioning only )

SRV1> ./roundup.20061003 -r R1_18_4 -n -R -s N00010883 near 

#######
# X11 #
#######

Scan all nodes for X11 acroread hangs, per minos25 report.
minos23 and minos25  are having problems.

Reported to minos-admin

=============================================================================

2006 10 01

###########
# ROUNDUP #
###########

SRV1> ./roundup -r R1_18_4 -w  near 
SRV1> ./roundup -r R1_18_4 -w   far 

#######
# X11 #
#######

Scan all nodes for X11 acroread hangs, per minos02 report.
minos02 is the only one having a problem
shepelak investigating

#######
# SAM #
#######

Check timing of large (130K) query similar to D0 problem case
    sam translate constraints --count --dim="run_number > 215700 and data_tier raw and 
    trig_config_type physics and physical_datastream_name all_%"

    Similar event count in reco_far (350K) , R1_18 (120K)

sam get metadata --fileName=F00017566_0000.all.cand.R1_18.0.root
...

time sam translate constraints --count \
  --dim="run_number > 2 and data_tier like %-far and VERSION_ANALYZED like r1.18"
  408784 files match the given constraints.

real    0m34.532s
user    0m0.860s
sys     0m0.160s

time sam translate constraints --count \
  --dim="run_number > 2 and data_tier like cand-far and VERSION_ANALYZED = r1.18"
76453 files match the given constraints.

real    0m11.816s
user    0m0.830s
sys     0m0.190s


MINOS26 > time sam translate constraints --summaryOnly --dim="run_number > 2 and data_tier  cand-far and appl_name_analyzed loon and VERSION_ANALYZED like r1.18"
File Count:         76453
Average File Size:  65.03MB
Total File Size:    4.74TB
Total Event Count:  1575363453

real    1m32.876s
user    0m46.560s
sys     0m2.150s
   this usee about 270 MB memory
   
MINOS26 > time sam translate constraints --summaryOnly --dim="run_number > 2 and data_tier like %-far and appl_name_analyzed loon and VERSION_ANALYZED like r1.18"
File Count:         408784
Average File Size:  22.43MB
Total File Size:    8.75TB
Total Event Count:  8018571334

real    11m39.624s
user    7m22.490s
sys     0m9.850s


=============================================================================

2006 09 29

###########
# ROUNDUP #
###########

    CHECKING DFARM SPACE

SRV1> dfarm ls /minos/mccat | wc -l
  11888

SRV1> dfarm ls /minos/mccat | tr -s ' ' | cut -f 4 -d ' ' | ./count
 Enter numbers to be added : 
 Got    6224 /tmp/FOO numbers 
194878508098
Something's truncated, try by hand :

SRV1> dfarm ls /minos/mccat | tr -s ' ' | cut -f 4 -d ' ' > /tmp/FOO
SRV1> echo \(`cat /tmp/FOO | tr \\\n +` 0 \) / 1000000000 | bc
310

Yes, 310 GBytes of mccat files, which we are not handling now.

SRV1> dfarm ls /minos/mccat | tr -s ' ' | cut -f 7 -d ' ' > /tmp/mccatlis

SRV1> grep _0000_ /tmp/mccatlis  | wc -l
  11641
SRV1> wc -l /tmp/mccatlis
  11900 /tmp/mccatlis

  10:47

N=0 ; for FIL in `cat /tmp/mccatlis` ; do 
let "N++" ; printf "%5d %s\n" ${N} ${FIL}
dfarm rm /minos/mccat/${FIL}
done
...
11900 n13025999_0000_L010185.sntp.cedar.root
SRV1> date
Fri Sep 29 11:12:31 CDT 2006
SRV1> dfarm usage rubin
Used: 156608 + Reserved: 0 / Quota: 1000000 (MB)


SRV1> ./roundup -r cedar -w  near 

SRV1> ./roundup -r cedar -w   far 

=============================================================================

2006 09 28

###########
# ROUNDUP #
###########

cedar far - ticket expired before files were written

reran around 10:00 - 
SRV1> kinit -R
SRV1> ./roundup -r cedar -w far

This completed around 14:07

roundup is symlink now to roundup.20060928
    Added kinit in write section
    Moved DFILES(L) files to /tmp/minfarm/ROUNDUP/

Caught up on R1_18_4 near and far.

Noted that just skipping the PEND files is rather slow.
Need to investiage, perhaps too costly to do 
   ls .../*/<file> ?
   
MINOS26 > time sam list files --dim='file_name N00010852_%%%%.spill.sntp.R1_18_4.0.root' --count  
20 files match the given constraints.

real    0m1.141s
user    0m0.840s
sys     0m0.170s

So we should use SAM for this, need it anyway for declares.

Cloned to /export/minfarm/ROUNDUP/SAMP with

    SRV1> scp -r -c blowfish  kreymer@minos26:${SAFS} SAMP

Followed setup instructions from HOWTO.samdist,
updating release to v7_6_5 from v7_3_4, and copying full tree

SRV1> time ls /pnfs/minos/fardet_data/*/F00036640_*.mdaq.root
real    0m9.157s
SRV1> time ls /pnfs/minos/fardet_data/*/F00036640_*.mdaq.root
real    0m32.078s

SRV1>  time sam list files --dim='file_name N00010852_%%%%.mdaq.root' --nosum
real    0m1.246s
user    0m0.870s
sys     0m0.110s

roundup now runs MUCH MUCH faster, and probably take a big load off PNFS

##############     
# parameters #
##############

Multiple parameter selections are now working !

herber has corrected the database content in minos dev via direct SQL.

He is working on corrections to int as of 12:15.

It was necessary to
    samadmin flush dbserver cache

We still need to understand how to generate correct content for future parameters.

The problem was the numbers in 
    DIMENSIONS.DIM_ALIAS and
    DIMENSION_ADDONS.DIM_ALIAS

These numbers should have been unique, but were mostly like 
    param_categories##1
    param_types##1

##########
# DCACHE #
##########

Looking into possible file corruption of
    mcout_data/cedar/near/carrot_06/L010185/cand_data/n13021384_0000_L010185.cand.cedar.root

MINOS26 > ./dc_stat /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/cand_data/n13021384_0000_L010185.cand.cedar.root
============================
 PNFS status for /pnfs/minos/mcout_data/cedar/near/carrot_06/L010185/cand_data/n13021384_0000_L010185.cand.cedar.root 
-rw-r--r--    1 1334     5111     318105291 Sep 27 18:11 n13021384_0000_L010185.cand.cedar.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:fe89ca0c;l=318105291;
w-stkendca10a-2

LEVEL 4 
VOB913
0000_000000000_0000753
318105291
reco_mc_near_cedar
/pnfs/fnal.gov/usr/minos/mcout_data/cedar/near/carrot_06/L010185/cand_data/n13021384_0000_L010185.cand.cedar.root

000F0000000000000447AE40

CDMS115939871700004
stkenmvr23a:/dev/rmt/tps0d0n:479000021848
4215654923

============================

Copied to ROUNDUP/TEMP on fnpcsrv1 with
SRV1> dccp dcap://fndca1.fnal.gov:24725/pnfs/fnal.gov/usr/minos/mcout_data/cedar/near/carrot_06/L010185/cand_data/n13021384_0000_L010185.cand.cedar.root n13021384_0000_L010185.cand.cedar.root
318105291 bytes in 6 seconds (51774.95 KB/sec)
SRV1> ecrc n13021384_0000_L010185.cand.cedar.root
CRC 4215654923


=============================================================================

2006 09 27

##########
# DCACHE #
##########

Rob Kennedy spotted runaway jdejong job on minos01, accessing files like

    minos/reco_far/R1_18_2/sntp_data/sntp_data/f21301039_0000_L010185.sntp.R1_18_2.root

Requested termination by dejong or minos-admin

###########
# ROUNDUP #
###########

Updated roundup to have SLIM file size limit, 2147418112 .

Cleaning up files to ease severe load in dfarm.

SRV1> grep cedar /tmp/nearcatls | tr -s ' ' | cut -f 4 -d ' ' | ./count
 Enter numbers to be added : 
 Got    4132 /tmp/FOO numbers 
116455153100
SRV1> let "GB=116455153100/1000000000"
SRV1> echo $GB
116

SRV1> 
SRV1> grep cedar /tmp/farcatls | tr -s ' ' | cut -f 4 -d ' ' | ./count
 Enter numbers to be added : 
 Got    6043 /tmp/FOO numbers 
102611076139
SRV1> let "GB=102611076139/1000000000"
SRV1> echo $GB
102

SRV1> df -h /export/stage/minfarm
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb3             485G   47G  438G  10% /export/stage
SRV1> ./roundup -W -r R1_18_4  near
SRV1> ./roundup -W -r R1_18_4  far
SRV1> df -h /export/stage/minfarm
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb3             485G   18G  467G   4% /export/stage

Write cedar far to WRITE, will review then Write to dcache.

SRV1>./roundup -r cedar near

SRV1> less LOG/cedarnear.log 
SRV1> grep -v "Warning in"  LOG/cedarnear.log  | grep -v 'entries=' | grep -v 'Unknown object' | grep -v 'Target' | grep -v 'Source file' | less

SRV1> du -sm WRITE/
52401   WRITE

SRV1> grep ' adding '  LOG/cedarnear.log | wc -l
    107
SRV1> grep ' PEND '  LOG/cedarnear.log | wc -l
     71

SRV1> dfarm usage rubin
Used: 888507 + Reserved: 0 / Quota: 1000000 (MB)
SRV1> date
Wed Sep 27 10:08:22 CDT 2006

SRV1> ./roundup -r cedar -W far
SRV1> ./roundup -r cedar -W near

SRV1> dfarm usage rubin
Used: 790760 + Reserved: 0 / Quota: 1000000 (MB)

12:07

SRV1> ./roundup -r cedar -w far

    NETWORK - I do not understand the I/O activity on fnpcsrv1.
    
    At 02:20, after the roundup script terminated,
    there was heavy output activity till about 03:40.
    
    Concurrent with the 12:00 startup of cedar far,
    there is also heavy output activity, tracking the dfarm get activity.
    
    
##############     
# parameters #
##############

Retry selections, still limited to one parameter selection

Easy to test in integration, and with mc_far, few files.


MINOS26 > sam ping dbserver
The server 'SAMDbServer.int:SAMDbServer' is alive.

MINOS26 > sam list files --count --dim="data_tier mc-far"
441 files match the given constraints.

These are all split 0
Look for flavor/volume mixing

MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 2" --count
141 files match the given constraints.
 But these are all like  f2200*,  all flavor and splits are 0
 
 MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 1" --count
300 files match the given constraints.
   These have mixed flavors

MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 1" | grep 'f210' | wc -l 
    100
MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 1" | grep 'f211' | wc -l 
    100
MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 1" | grep 'f212' | wc -l 
      0
MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 1" | grep 'f213' | wc -l 
    100

MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 1 and mc.flavor 0" --count
No files match the given constraints.

MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 1 and mc.flavor 3" --count
No files match the given constraints.


Same thing in production

MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 1" --count
300 files match the given constraints.
MINOS26 > sam list files --dim="data_tier mc-far and mc.flavor 3" --count
100 files match the given constraints.
MINOS26 > sam list files --dim="data_tier mc-far and mc.volume 1 and mc.flavor 3" --count
No files match the given constraints.

For testing in development,

MINOS26 > sam list files --dim="data_tier mc-near" --count
199 files match the given constraints.

MINOS26 > sam list files --dim="data_tier mc-near and mc.volume 3" --count
199 files match the given constraints.

MINOS26 > sam list files --dim="data_tier mc-near and mc.split 1" --count
99 files match the given constraints.
MINOS26 > sam list files --dim="data_tier mc-near and mc.split 2" --count
100 files match the given constraints.

MINOS26 > sam list files --dim="data_tier mc-near and mc.split 2 and mc.volume 3" --count
No files match the given constraints.

    Got herber access to minos setup, and to samread@minos-sam0*
    He is investigating the problem.

N.B. database browser has a text mode output available, see
    http://dbb.fnal.gov:8520/answers

    
=============================================================================

2006 09 26


###########
# ROUNDUP #
###########

Get minos_root v-12-00e to fnpcsrv1 for minfarm

SRV1> cd /farm/minsoft2/Minossoft/ROOT
SRV1> scp -c blowfish -r kreymer@minos26:/afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v5-12-00ev5.12.00e
SRV1> mv v5-12-00e rootv5.12.00e

For local usage, try

ROOTSYS="/farm/minsoft2/Minossoft/ROOT/rootv5.12.00e"
export ROOTSYS

if [ "${LD_LIBRARY_PATH:-}" = "" ]; then
  LD_LIBRARY_PATH="${ROOTSYS}/lib"
else
  LD_LIBRARY_PATH="${ROOTSYS}/lib:${LD_LIBRARY_PATH}"
fi
export LD_LIBRARY_PATH

PATH="${ROOTSYS}/bin:${PATH}"

SRV1> cd /export/stage/minfarm/ROUNDUP/HADD
SRV1> hadd e01.root N00009650_0000.spill.snts.cedar.0.root N00009650_0001.spill.snts.cedar.0.root


=============================================================================

2006 09 25


###########
# ROUNDUP #
###########

SRV1> dfarm ls /minos/farcat | wc -l
  18642
SRV1> dfarm ls /minos/farcat/*nts.* | wc -l
  10258

SRV1> dfarm rm /minos/farcat/*nts.* 
Error deleting /minos/farcat/F00036196_0000.all.snts.R1_18_4.0.root: PERM Permission denied
Error deleting /minos/farcat/F00036196_0000.spill.bnts.R1_18_4.0.root: PERM Permission denied
Error deleting /minos/farcat/F00036196_0000.spill.snts.R1_18_4.0.root: PERM Permission denied
PERM Permission denied

SRV1> dfarm usage rubin
Used: 864950 + Reserved: 0 / Quota: 1000000 (MB)

SRV1> dfarm ls /minos/farcat > /tmp/farcatls
SRV1> grep cand /tmp/farcatls  | wc -l
    150
SRV1> grep bcnd /tmp/farcatls  | wc -l
    193

    NEAR

SRV1> dfarm ls /minos/nearcat > /tmp/nearls
SRV1> wc -l /tmp/nearls
   5380 /tmp/nearls
SRV1> grep cand /tmp/nearls | wc -l
     34
SRV1> dfarm rm /minos/nearcat/*cand*

SRV1> grep nts /tmp/nearls | wc -l
   2681
SRV1> dfarm rm /minos/nearcat/*nts.*

SRV1> dfarm usage rubin
Used: 801873 + Reserved: 0 / Quota: 1000000 (MB)

    FAR

SRV1> dfarm rm  /minos/farcat/*nd.* 

Error deleting /minos/farcat/F00036196_0000.all.cand.R1_18_4.0.root: PERM Permission denied
Error deleting /minos/farcat/F00036196_0000.spill.bcnd.R1_18_4.0.root: PERM Permission denied
Error deleting /minos/farcat/F00036196_0000.spill.cand.R1_18_4.0.root: PERM Permission denied

SRV1> dfarm usage rubin
Used: 774580 + Reserved: 0 / Quota: 1000000 (MB)

SRV1> grep -v frwrw /tmp/nearls 
frwr-   2 rubin                21659997 09/07 05:22:22 N00008460_0002.cosmic.sntp.cedar.0.root 
frwr-   3 rubin                15898573 09/07 05:24:22 N00008463_0019.spill.sntp.cedar.0.root 
SRV1> grep -v frwrw /tmp/farcatls
frwr-   2 rubin                70649021 08/18 10:14:00 F00036196_0000.all.cand.R1_18_4.0.root 
frwr-   2 rubin                20334690 08/18 10:14:09 F00036196_0000.all.sntp.R1_18_4.0.root 
frwr-   2 rubin                 2613557 08/18 10:14:15 F00036196_0000.all.snts.R1_18_4.0.root 
frwr-   2 rubin                13525491 08/18 10:14:25 F00036196_0000.spill.bcnd.R1_18_4.0.root 
frwr-   2 rubin                 2778122 08/18 10:14:30 F00036196_0000.spill.bntp.R1_18_4.0.root 
frwr-   2 rubin                  319701 08/18 10:14:34 F00036196_0000.spill.bnts.R1_18_4.0.root 
frwr-   2 rubin                 5580539 08/18 10:14:06 F00036196_0000.spill.cand.R1_18_4.0.root 
frwr-   2 rubin                 1227569 08/18 10:14:12 F00036196_0000.spill.sntp.R1_18_4.0.root 
frwr-   2 rubin                  223103 08/18 10:14:17 F00036196_0000.spill.snts.R1_18_4.0.root 


    WRITING

kinit kreymer

SRV1> ./roundup -n -W -r R1_18_4 near

SRV1> ./roundup    -W -r R1_18_4 near

    SNTS cleanup
   
SRV1> ls -lR /pnfs/minos/reco_near/R1_18_4/CAT/snts_data
/pnfs/minos/reco_near/R1_18_4/CAT/snts_data:
total 1
drwxrwxr-x    1 minfarm  numi          512 Sep 25 11:33 2006-08
drwxrwxr-x    1 minfarm  numi          512 Sep 25 11:33 2006-09

/pnfs/minos/reco_near/R1_18_4/CAT/snts_data/2006-08:
total 269962
-rw-r--r--    1 1060     numi     167692081 Sep 18 17:26 N00010675_0000.cosmic.sntp.R1_18_4.0.root
-rw-r--r--    1 1060     numi     62883010 Sep 25 11:32 N00010675_0000.cosmic.snts.R1_18_4.0.root
-rw-r--r--    1 1060     numi     16482388 Sep 25 11:33 N00010678_0000.cosmic.snts.R1_18_4.0.root
-rw-r--r--    1 1060     numi     22764534 Sep 25 11:33 N00010683_0000.cosmic.snts.R1_18_4.0.root
-rw-r--r--    1 1060     numi      3251881 Sep 25 11:33 N00010697_0000.cosmic.snts.R1_18_4.0.root
-rw-r--r--    1 1060     numi      3368714 Sep 25 11:33 N00010700_0000.cosmic.snts.R1_18_4.0.root

/pnfs/minos/reco_near/R1_18_4/CAT/snts_data/2006-09:
total 16622
-rw-r--r--    1 1060     numi     17021097 Sep 25 11:33 N00010755_0000.cosmic.snts.R1_18_4.0.root

SRV1> rm -R /pnfs/minos/reco_near/R1_18_4/CAT/snts_data
SRV1> rmdir /pnfs/minos/reco_near/R1_18_4/CAT/_data  ( Sep 1 stray )
SRV1> rm WRITE/*nts.*
SRV1> rm READ/*nts.*


    PURGING

Created DFARM directory for time of DFARM purge

writing/purging seem to be working.

SRV1> ./roundup  -r R1_18_4    near  # touch up neardet files

Need to write them, too, after shifting in corrected roundup
    which correctly checks local CRC/Size

SRV1> mv roundx roundup
SRV1> ./roundup  -r R1_18_4  -w near  # touch up neardet files


Then need to catch up all on fardet,

SRV1> ./roundup  -w -r R1_18_4  near  # touch up neardet files


N.B> - add test not to go over 2 GBytes in hadd.

MRTG for fnpcsrv1 shows no more than 30 MBits/sec out of port,
MRTG for fnpcsrv1 shows no more than 50 MBits/sec in the port,

dccp seems to peak at 13 MBytes/sec for 1 GB files

  * * *  OOPs, some files too big already ??? * * *

PURGE DFARM  N00010819_0000.cosmic.sntp.R1_18_4.0.root
2283574599 bytes in 190 seconds (11737.12 KB/sec)
347948365 bytes in 29 seconds (11717.01 KB/sec)

SRV1> du -sm /export/stage/minfarm/ROUNDUP/WRITE/* | sort -n
1045    /export/stage/minfarm/ROUNDUP/WRITE/N00010835_0000.spill.sntp.R1_18_4.0.root
1073    /export/stage/minfarm/ROUNDUP/WRITE/N00010780_0000.spill.sntp.R1_18_4.0.root
2178    /export/stage/minfarm/ROUNDUP/WRITE/N00010819_0000.spill.sntp.R1_18_4.0.root


PURGE DFARM  N00010819_0000.cosmic.sntp.R1_18_4.0.root
2283574599 bytes in 190 seconds (11737.12 KB/sec)
   ( second line is from 2006-09/N00010819_0000.spill.sntp.R1_18_4.0.root

347948365 bytes in 29 seconds (11717.01 KB/sec)


SRV1> ./roundup  -r R1_18_4 -w far


=============================================================================

2006 09 21

##########
# DCACHE #
##########

Overload problems since 17 Sep seem to be due to umn copies of .spill.sntp.,
informed bspeak.

#########
# STAGE #
#########

setup of dcap, dropped  -q smallfile unique to v2_29_f0407,
added unqualified setup followed by
 unset DCACHE_IO_TUNNEL

Per rubin preference, changed back to -q unsecured

###########
# ROUNDUP #
###########

Trying minos_root v5-12-00e

cd /local/scratch26/kreymer/CATTLE

setup  minos_root v5-12-00e -q GCC_3_4

This works correctly with 2 levels of concatenation ( no crashes )
Need to do 3 level test.

=============================================================================

2006 09 20

############
# NOACCESS #
############

VOB445            200.00GB   (NOTALLOWED 0908-1806 none              )   CD-9940B         minos.beam_data.cern                 Cannot seem to write a label; needs investigation

Why is this going to a .cern family , normally for files over 8 GB.

MINOS26 > ./volumes beam_data
MINOS26 > VOLS=`volumes beam_data`
MINOS26 > for VOL in $VOLS ; do printf "${VOL} " ; enstore info --vol=${VOL} | grep volume_family ; done 
VO4933  'volume_family': 'minos.beam_data.cpio_odc',
VO7427  'volume_family': 'minos.beam_data.cpio_odc',
VO8433  'volume_family': 'minos.beam_data.cpio_odc',
VO8538  'volume_family': 'minos.beam_data.cpio_odc',
VO8976  'volume_family': 'minos.beam_data.cpio_odc',
VO9835  'volume_family': 'minos.beam_data.cern',
VOB445  'volume_family': 'minos.beam_data.cern',
VOB557  'volume_family': 'minos.beam_data.cpio_odc',

MINOS26 > enstore info --list=VO9835
label  bfid                       size location_cookie        delflag original_name
( old files from 2006-02 )
VO9835 CDMS115764523900000        7030 0000_000000000_0000092 active  /pnfs/fnal.gov/usr/minos/beam_data/2006-09/B060906_080001.mbeam.root
VO9835 CDMS115764524600000        7030 0000_000000000_0000095 active  /pnfs/fnal.gov/usr/minos/beam_data/2006-09/B060907_000001.mbeam.root
VO9835 CDMS115764525700000        7030 0000_000000000_0000098 active  /pnfs/fnal.gov/usr/minos/beam_data/2006-09/B060907_080001.mbeam.root
VO9835 CDMS115767417300000        7030 0000_000000000_0000101 active  /pnfs/fnal.gov/usr/minos/beam_data/2006-09/B060906_160001.mbeam.root

ls -l /pnfs/minos/beam_data/2006-09
...
-rw-r--r--    1 buckley  5111         7030 Sep  7 11:07 B060906_080001.mbeam.root
-rw-r--r--    1 buckley  5111         7030 Sep  7 19:09 B060906_160001.mbeam.root
-rw-r--r--    1 buckley  5111         7030 Sep  7 11:07 B060907_000001.mbeam.root
-rw-r--r--    1 buckley  5111         7030 Sep  7 11:07 B060907_080001.mbeam.root
...

Check fardet_data,

MINOS26 > VOLS=`volumes fardet_data`
MINOS26 > for VOL in $VOLS ; do printf "${VOL} " ; enstore info --vol=${VOL} | grep volume_family ; done 
...
VO9830  'volume_family': 'minos.fardet_data.cern',  --- many 
VOB447  'volume_family': 'minos.fardet_data.cern',  --- empty

MINOS26 > enstore info --list=VO9830 | grep 2006-09 | wc -l
     54

MINOS26 > FILES=`enstore info --list=VO9830 | grep 2006-09 | tr -s ' ' | cut -f 6 -d ' ' | cut -f 5- -d /`
MINOS26 > for FIL in $FILES ; do ls -l /pnfs/${FIL} ; done
-rw-r--r--    1 buckley  5111     21190025 Sep  6 14:52 /pnfs/minos/fardet_data/2006-09/F00036548_0013.mdaq.root
...
-rw-r--r--    1 buckley  5111     59950592 Sep  7 16:04 /pnfs/minos/fardet_data/2006-09/F00036554_0009.mdaq.root

Check neardet_data
VO9834  'volume_family': 'minos.neardet_data.cern',
MINOS26 > enstore info --list=VO9834
These are all 2006-02 files

near_dcs_data
VO9860  'volume_family': 'minos.near_dcs_data.cern',
VO9860 CDMS115767512000000      423000 0000_000000000_0000050 active  /pnfs/fnal.gov/usr/minos/near_dcs_data/2006-09/N060906_000002.mdcs.root
-rw-r--r--    1 buckley  5111       423000 Sep  7 19:25 /pnfs/minos/near_dcs_data/2006-09/N060906_000002.mdcs.root

far_dcs_data
VO9861  'volume_family': 'minos.far_dcs_data.cern',
VO9861 CDMS115767904600000     2408662 0000_000000000_0000047 active  /pnfs/fnal.gov/usr/minos/far_dcs_data/2006-09/F060906_000008.mdcs.root
-rw-r--r--    1 buckley  5111      2408662 Sep  7 20:30 /pnfs/minos/far_dcs_data/2006-09/F060906_000008.mdcs.root


#######
# X11 #
#######

Scan all nodes for X11 (root/acroread hangs ) per shepelak request

NODES='01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26'

for NODE in $NODES ; do printf "${NODE}\n"
time ssh minos${NODE} 'echo acro;acroread;echo read' ; done

Clean twice through, except for disconnect hangup first time on minos22

#########
# stage #
#########

added -n option, which sets
    pause 0
    ECHO  echo
    QUIET quiet
    DEBUG debug
for swift summary preview

##########
# DCACHE #
##########

7a write pools are back in service,
some glitches in writes, should be OK after about 11:10

Three files affected,

/pnfs/minos/fardet_data/2006-09/F00036605_0019.mdaq.root
/pnfs/minos/neardet_data/2006-09/N00010840_0022.mdaq.root
/pnfs/minos/beam_data/2006-09/B060920_080001.mbeam.root

up to date after about 11:41

=============================================================================

2006 09 19

##########
# ORACLE #
##########

cd minos/oracle/archive

MINOS26 > curl http://www-stken.fnal.gov/enstore/tape_inventory/COMPLETE_FILE_LISTING_exp-db | grep cdf > CFL.cdf

MINOS26 > grep cdfonprd CFL.cdf | grep 'September/16' | wc -l
    911

MINOS26 > grep cdf-offline CFL.cdf | grep 'September/17' | wc -l
    111

MINOS26 > curl http://www-stken.fnal.gov/enstore/tape_inventory/COMPLETE_FILE_LISTING_exp-db | grep d0 > CFL.cdf

MINOS26 > grep d0ofprd1 CFL.d0 | grep 'September/12' | wc -l
    460

MINOS26 > grep d0oflump CFL.d0 | grep 'September/12' | wc -l
    189

This creates a daily backlog of nearly 2000 files,
hence our trouble running cedar reprocessing with a 1000 file limit.


###########
# ROUNDUP #
###########

Corrected rm of working files to   rm -f
    to handle case of bachelor files.


The near log file looks pretty clean.

There are several files in far which have bothersome messages,
    but these are .bcnd files . OOPS, should have filtered these out.
    Done, updated roundup script.
    

=============================================================================

2006 09 18

Vacation 11 thru 15 Sept

#########
# vault #
#########

checked logs for Friday's vaulting, OK.

#######
# DAQ #
#######

MINOS26 > ../bin/rlwrap ftp fndca1.fnal.gov 24127
Checked sizes of problematic far_dcs_data files

MINOS26 > ../bin/rlwrap ftp fndca1.fnal.gov 24127
Connected to stkendca2a.fnal.gov.
220 Kerberos FTP Door ready
334 ADAT must follow
GSSAPI accepted as authentication type
GSSAPI authentication succeeded
Name (fndca1.fnal.gov:kreymer): 
200 User kreymer logged in
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> cd far_dcs_data/2006-09/
250 CWD command succcessful. New CWD is </far_dcs_data/2006-09>
ftp> size F060911_000008.mdcs.root 
213 2440505
ftp> size F060910_000003.mdcs.root 
213 2422021
ftp> size F060909_000010.mdcs.root 
213 2483884
ftp> size F060908_000001.mdcs.root 
213 2518899
ftp> size F060907_000009.mdcs.root 
213 2458914
ftp> size F060906_000008.mdcs.root 
213 2408662
ftp> size F060905_000006.mdcs.root 
213 2455069
ftp> size F060904_000000.mdcs.root 
213 2449265
ftp> size F060903_000007.mdcs.root 
213 2443842
ftp> size F060902_000005.mdcs.root 
213 2491215
ftp> size F060901_000008.mdcs.root 
213 2524868
ftp> size F060831_000001.mdcs.root 
213 2498580
ftp> quit
221 Goodbye


###########
# ROUNDUP #
###########

    Looking thrugh log of 8 Sep roundup

SRV1> grep -v 'no dictionary'  LOG/R1_18_4far.log | grep -v 'Source file'  | less

There are some instances of failing to remove working files.
These seem to be single subrun runs.

F00036239_0000
F00036260_0000
F00036480_0000

N00010697_0000
N00010700_0000

N.B. this is OK, corrected script to do rm -f

##########
# DCACHE #
##########

    HAVE kerberized access, repeated request

ftp fndca1.fnal.gov 24127
    succeeds

dccp N00010675_0000.cosmic.sntp.R1_18_4.0.root dcap://fndca1.fnal.gov:24736/pnfs/fnal.gov/usr/minos/reco_near/R1_18_4/CAT/snts_data/2006-08/N00010675_0000.cosmic.sntp.R1_18_4.0.root
167692081 bytes in 12 seconds (13646.82 KB/sec)

MINOS26 > ls /pnfs/minos/reco_near/R1_18_4/CAT/snts_data/2006-08 -l
total 163762
-rw-r--r--    1 kreymer  5111     167692081 Sep 18 13:21 N00010675_0000.cosmic.sntp.R1_18_4.0.root


As of around 13:30 ( based on last Oracle transfer )
many services seem to have gone offline, including all doors and the CopyManager


    DAMAGED FILES

Added link (filemonitor) to DCache line in dhmain.html

minos.tst

   This file is known to be corrupt and lost.

MINOS26 > dds /pnfs/minos/reco_near/R1_18_4/cand_data/2006-06/N00010368_0008.spill.cand.R1_18_4.0.root
-rw-r--r--    1 1334     5111     664220517 Jul  1 03:04 /pnfs/minos/reco_near/R1_18_4/cand_data/2006-06/N00010368_0008.spill.cand.R1_18_4.0.root


minos_bad.txt 

/pnfs/minos/reco_near/cedar/snts_data/2006-01

2006-08-29 08:16:43 reco_near/cedar/snts_data/2006-01/N00009603_0013.spill.snts.cedar.0.root
2006-09-04 11:40:25 reco_far/cedar/snts_data/2005-05/F00031447_0001.all.snts.cedar.0.root
2006-09-09 18:03:22 reco_far/cedar/sntp_data/2004-08/F00026836_0000.all.sntp.cedar.0.root
2006-09-09 18:07:00 reco_far/cedar/cand_data/2004-08/F00026836_0001.all.cand.cedar.0.root
2006-09-09 18:14:53 reco_far/cedar/cand_data/2004-08/F00026839_0000.all.cand.cedar.0.root
2006-09-09 18:16:24 reco_far/cedar/cand_data/2004-08/F00026854_0003.all.cand.cedar.0.root
2006-09-09 18:16:38 reco_far/cedar/cand_data/2004-08/F00026845_0002.all.cand.cedar.0.root
2006-09-09 18:16:44 reco_far/cedar/cand_data/2004-08/F00026842_0003.all.cand.cedar.0.root
2006-09-09 18:17:12 reco_far/cedar/cand_data/2004-08/F00026842_0000.all.cand.cedar.0.root
2006-09-09 18:17:14 reco_far/cedar/cand_data/2004-08/F00026839_0003.all.cand.cedar.0.root
2006-09-09 18:17:31 reco_far/cedar/cand_data/2004-08/F00026839_0001.all.cand.cedar.0.root
2006-09-09 18:18:21 reco_far/cedar/cand_data/2004-08/F00026854_0000.all.cand.cedar.0.root
2006-09-09 18:18:58 reco_far/cedar/cand_data/2004-08/F00026812_0002.all.cand.cedar.0.root
2006-09-09 18:20:27 reco_far/cedar/cand_data/2004-08/F00026854_0002.all.cand.cedar.0.root
2006-09-09 18:23:03 reco_far/cedar/cand_data/2004-08/F00026860_0000.all.cand.cedar.0.root
2006-09-09 18:23:30 reco_far/cedar/cand_data/2004-08/F00026848_0001.all.cand.cedar.0.root
2006-09-09 18:47:46 reco_far/cedar/cand_data/2004-08/F00026854_0001.all.cand.cedar.0.root


=============================================================================

2006 09 08

#############
# CHECKLIST #
#############

last neardet data at Tue Sep  5 22:07:52 UTC 2006
fardcs data is still trying duplicate copies

##########
# DCACHE #
##########

    Still no kerberized access, repeated request

ftp fndca1.fnal.gov 24127

dccp N00010666_0000.cosmic.snts.R1_18_4.0.root dcap://fndca1.fnal.gov:24736/pnfs/fnal.gov/usr/minos/reco_near/R1_18_4/CAT/snts_data/2006-08/N00010666_0000.cosmic.snts.R1_18_4.0.root

###########
# ROUNDUP #
###########

Continue cleanup of older files, 

DET=far
grep ' 09/02 \| 09/03 \| 09/04 ' ${DET}cat9.lis  | tr -s ' ' | cut -f 7 -d ' ' > del${DET}.lis
SRV1> wc -l del${DET}.lis
  28183 delfar.lis

date
for FIL in `cat del${DET}.lis` ; do dfarm rm /minos/${DET}cat/$FIL ; done
date

Fri Sep  8 14:58:51 CDT 2006


Noop R1_18_4 near, for preview :

SRV1> time ./roundup  -n -r R1_18_4  near >> /export/stage/minfarm/ROUNDUP/LOG/ratnearn.log 
real    2m53.148s
user    0m25.340s
sys     0m15.250s

OLDS=`grep OOPS /export/stage/minfarm/ROUNDUP/LOG/ratnearn.log  | grep ' 08/' | cut -f 8 -d ' '`

for OLD in $OLDS ; do dfarm ls /minos/nearcat/${OLD} ; done
for OLD in $OLDS ; do dfarm rm /minos/nearcat/${OLD} ; done

DET=far
time ./roundup  -n -r R1_18_4  ${DET} >>/export/stage/minfarm/ROUNDUP/LOG/keep${DET}n.log 
real    16m2.684s
user    3m3.970s
sys     1m42.200s

OLDS=`grep OOPS /export/stage/minfarm/ROUNDUP/LOG/keep${DET}n.log  | grep ' 08/' | cut -f 8 -d ' '`

for OLD in $OLDS ; do dfarm ls /minos/${DET}cat/${OLD} ; done
211 files
for OLD in $OLDS ; do dfarm rm /minos/${DET}cat/${OLD} ; done

Also removed 09/01 and 09/05 files from near and far, 
incomplete runs due to startup.

One more iteration for strays,

DET=far
time ./roundup  -n -r R1_18_4  ${DET} >> /export/stage/minfarm/ROUNDUP/LOG/keep${DET}n2.log
real    13m11.602s
OLDS=`grep ' 08/31 \| 09/05' /export/stage/minfarm/ROUNDUP/LOG/keep${DET}n2.log | cut -f 8 -d ' '`
for OLD in $OLDS ; do dfarm rm /minos/${DET}cat/${OLD} ; done

   near was clean.

   added auto logging to /export/stage/minfarm/ROUNDUP/LOG
   via MAIN when not doing NOOP or DEBUG or VERB
   
DET=near

./roundup -s N00010675 -r R1_18_4  ${DET} 

./roundup              -r R1_18_4  ${DET} 

DET=far

./roundup              -r R1_18_4  ${DET} 

#########
# vault #
#########

    Need a HOWTO.vault...

MON=2006-08
for DET in near far ; do ./vault ${DET} ${MON} ; done

less ~/minos/log/rawcopy/${DET}/${MON}.log
less ~/minos/log/rawcopy/${DET}/encp.${MON}.log
less ~/minos/log/rawcopy/${DET}/check.${MON}.log

=============================================================================

2006 09 07

Per brebel request Tuesday, stage 2004-11/12 cedar cosmic sntp's

MINOS26 > for MON in 11 12 ; do ./stage -w  reco_far/cedar/sntp_data/2004-${MON} ; done

#######
# X11 #
#######

Scan all nodes for X11 (root/acroread hangs ) per shepelak request

NODES='01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26'

for NODE in $NODES ; do printf "${NODE}\n"
time ssh minos${NODE} 'echo acro;acroread;echo read' ; done

Ran this 3 times, nodes 8 and 21 failed each time after acro.
One or 2 nodes hung each time after read.

15:00 karen reports minos08 as being OK.

Rescan, with shorter time report
for NODE in $NODES ; do printf "${NODE}\n"
/usr/bin/time -f %E ssh minos${NODE} 'echo acro;acroread;echo read' ; done


###########
# ROUNDUP #
###########

SRV1> time ./roundup  -d -w -r cedar near 2>&1 | tee cedarspillsntsnear2.log
real    2m40.112s
user    0m28.070s
sys     0m18.990s

SRV1> dfarm ls /minos/nearcat > nearcat907.lis
SRV1> dfarm ls /minos/farcat > farcat907.lis

DATES=`cat ${DET}cat907.lis  | tr -s ' ' | cut -f 5 -d ' ' | uniq | sort -u`
for DATE in ${DATES} ; do 
printf "${DATE} " ; grep " ${DATE} " ${DET}cat907.lis | wc -l ; done

DET=near
09/06     542
09/07     580

DET=far
09/06    3264
09/07    1048


     HADD
     
Timed usage of hadd rather than loon for concatenating


dfarm get /minos/nearcat/N00009650_0000.spill.snts.cedar.0.root .
dfarm get /minos/nearcat/N00009650_0001.spill.snts.cedar.0.root .
dfarm get /minos/nearcat/N00009650_0002.spill.snts.cedar.0.root .
dfarm get /minos/nearcat/N00009650_0003.spill.snts.cedar.0.root .
dfarm get /minos/nearcat/N00009650_0004.spill.snts.cedar.0.root .
dfarm get /minos/nearcat/N00009650_0005.spill.snts.cedar.0.root .

time hadd Merged.root N*.root
real    0m10.066s
user    0m8.570s
sys     0m0.920s

SRV1> du -sm *
65      Merged.root
12      N00009650_0000.spill.snts.cedar.0.root
11      N00009650_0001.spill.snts.cedar.0.root
13      N00009650_0002.spill.snts.cedar.0.root
13      N00009650_0003.spill.snts.cedar.0.root
12      N00009650_0004.spill.snts.cedar.0.root
6       N00009650_0005.spill.snts.cedar.0.root

So about 6.4 MBytes/second.
Not as fast as the native file system/memory , but just fine.

SRV1> time cat N*.root > Cat.roox

real    0m0.393s
user    0m0.020s
sys     0m0.370s

SRV1> du -sk C* M*
66332   Cat.roox
65912   Merged.root

SRV1> for SUB in 0 1 2 3 ; do dfarm get /minos/farcat/F00036196_000${SUB}.all.snts.R1_18_4.0.root . ; done
SRV1> for SUB in 4 5 6 7 ; do dfarm get /minos/farcat/F00036196_000${SUB}.all.snts.R1_18_4.0.root . ; done

for SUB in 0 1 2 3 4 5 6 7 ; do scp -c blowfish minfarm@fnpcsrv1:/export/stage/minfarm/ROUNDUP/HADD/F00036196_000${SUB}.all.snts.R1_18_4.0.root . ; done
MINOS26 > hadd H01.root F00036196_0000.all.snts.R1_18_4.0.root F00036196_0001.all.snts.R1_18_4.0.root
MINOS26 > hadd H23.root F00036196_0002.all.snts.R1_18_4.0.root F00036196_0003.all.snts.R1_18_4.0.root
MINOS26 > hadd H0123.root H01.root H23.root 
MINOS26 > loon -bq Merger.C H01.root H23.root 
loon [0] 
Processing Merger.C...
Warning in <TClass::TClass>: no dictionary for class NtpBDLiteRecord is available
Warning in <TClass::TClass>: no dictionary for class RecRecordImp<BeamDataLiteHeader> is available
Warning in <TClass::TClass>: no dictionary for class BeamDataLiteHeader is available
Main(7545 in 7545 out 0 filt.)
  1) +Output::Put               n=7545  (  7545/     0) t=(   14.30/    0.10)

Similar for 4567
MINOS26 > loon -bq Merger.C H45.root H67.root 
loon [0] 
Processing Merger.C...
Warning in <TClass::TClass>: no dictionary for class NtpBDLiteRecord is available
Warning in <TClass::TClass>: no dictionary for class RecRecordImp<BeamDataLiteHeader> is available
Warning in <TClass::TClass>: no dictionary for class BeamDataLiteHeader is available
Main(7339 in 7339 out 0 filt.)

  1) +Output::Put               n=7339  (  7339/     0) t=(   13.86/    0.20)

MINOS26 > loon -bq Merger.C HL0123.root HL4567.root 
loon [0] 
Processing Merger.C...
Main(14884 in 14884 out 0 filt.)
  1) +Output::Put               n=14884 ( 14884/     0) t=(   29.25/    0.25)
MINOS26 > mv Merged.root HLL01234567.root

MINOS26 > time loon -bq Merger.C H0123.root H4567.root 
loon [0] 
Processing Merger.C...
Warning in <TClass::TClass>: no dictionary for class NtpBDLiteRecord is available
Warning in <TClass::TClass>: no dictionary for class RecRecordImp<BeamDataLiteHeader> is available
Warning in <TClass::TClass>: no dictionary for class BeamDataLiteHeader is available
  1) +Output::Put               n=14884 ( 14884/     0) t=(   29.40/    0.23)


real    1m38.213s
user    1m37.230s
sys     0m0.680s
MINOS26 > mv Merged.root HHL01234567.root

MINOS26 > time hadd HHH01234567.root H0123.root H4567.root 
Target file: HHH01234567
Source file 1: H0123.root
...
Source file 2: H4567.root
Target path: HHH01234567.root:/
H0123.root tree:NtpBDLite entries=1234567890
H4567.root tree:NtpBDLite entries=1234567890
H0123.root tree:NtpSt entries=1234567890
H4567.root tree:NtpSt entries=1234567890
Unknown object type, name: FileEnd title: PerFile End Key

real    0m4.638s
user    0m3.880s
sys     0m0.500s


MINOS26 > du -sk *01234567.root
19552   HHH01234567.root
18636   HHL01234567.root
18636   HLL01234567.root

Looks pretty good !

for SUB in 0 1 2 3 4 5 6 7 ; do ln -s F00036196_000${SUB}.all.snts.R1_18_4.0.root F${SUB} ; done

    root 5.12.0 problem,
    the concatenated files cannot server as input for further concatenation.

SRV1> type hadd
hadd is hashed (/farm/minsoft2/Minossoft/ROOT/rootv5.12.00/bin/hadd)

SRV1> hadd N01.c.root N00009650_0000.spill.snts.cedar.0.root N00009650_0001.spill.snts.cedar.0.root
SRV1> hadd N23.c.root N00009650_0002.spill.snts.cedar.0.root N00009650_0003.spill.snts.cedar.0.root
SRV1> hadd N0123_2.c.root N01.c.root N23.c.root  > N0123_2.log 2>&1

There are the usual apparently harmless warnings for the first file like
Target file: N0123_2.c.root
Source file 1: N01.c.root
Warning in <TClass::TClass>: no dictionary for class VldTimeStamp is available
...

Then for file 2, many errors like 

Source file 2: N23.c.root
Target path: N0123_2.c.root:/
Warning in <TBranchElement::AddBasket>: The assumption that out-of-order basket only comes from disk based ntuple is false.
...
Warning in <TBranchElement::AddBasket>: The assumption that out-of-order basket only comes from disk based ntuple is false.

 *** Break *** segmentation violation
 Generating stack trace...
/fnal/ups/prd/gcc/v3_4_3/Linux-2-4-2-3-2/bin/addr2line: hadd: No such file or directory
/fnal/ups/prd/gcc/v3_4_3/Linux-2-4-2-3-2/bin/addr2line: hadd: No such file or directory
/fnal/ups/prd/gcc/v3_4_3/Linux-2-4-2-3-2/bin/addr2line: hadd: No such file or directory
 0xb5c08f45 in TBranchElement::GetEntry(long long, int) at tree/src/TBranchElement.cxx:1315 from /farm/minsoft2/Minossoft/ROOT/rootv
5.12.00/lib/libTree.so
 0xb5c08e99 in TBranchElement::GetEntry(long long, int) at tree/src/TBranchElement.cxx:1305 from /farm/minsoft2/Minossoft/ROOT/rootv
5.12.00/lib/libTree.so
 0xb5c08e99 in TBranchElement::GetEntry(long long, int) at tree/src/TBranchElement.cxx:1305 from /farm/minsoft2/Minossoft/ROOT/rootv
5.12.00/lib/libTree.so
 0xb5c08e99 in TBranchElement::GetEntry(long long, int) at tree/src/TBranchElement.cxx:1305 from /farm/minsoft2/Minossoft/ROOT/rootv
5.12.00/lib/libTree.so
 0xb5c08e99 in TBranchElement::GetEntry(long long, int) at tree/src/TBranchElement.cxx:1305 from /farm/minsoft2/Minossoft/ROOT/rootv
5.12.00/lib/libTree.so
 0xb5c08e99 in TBranchElement::GetEntry(long long, int) at tree/src/TBranchElement.cxx:1305 from /farm/minsoft2/Minossoft/ROOT/rootv
5.12.00/lib/libTree.so
 0xb5c08e99 in TBranchElement::GetEntry(long long, int) at tree/src/TBranchElement.cxx:1305 from /farm/minsoft2/Minossoft/ROOT/rootv
5.12.00/lib/libTree.so
 0xb5c49e4a in TTree::CopyAddresses(TTree*) at tree/src/TTree.cxx:2117 from /farm/minsoft2/Minossoft/ROOT/rootv5.12.00/lib/libTree.s
o
 0xb5c496e8 in TTree::CloneTree(long long, char const*) at tree/src/TTree.cxx:2036 from /farm/minsoft2/Minossoft/ROOT/rootv5.12.00/l
ib/libTree.so
 0xb5c1b566 in TChain::Merge(TFile*, int, char const*) at tree/src/TChain.cxx:1320 from /farm/minsoft2/Minossoft/ROOT/rootv5.12.00/l
ib/libTree.so
 0x0804b164 in MergeRootfile(TDirectory*, TList*, int) + 0xc6a from hadd
 0x0804a4c1 in main + 0x9cd from hadd
 0xb56fa78a in __libc_start_main + 0xda from /lib/tls/libc.so.6
 0x08049951 in THashList::THashList[in-charge](int, int) + 0x5d from hadd
N01.c.root tree:NtpBDLite entries=1234567890
N23.c.root tree:NtpBDLite entries=1234567890
N01.c.root tree:NtpSt entries=1234567890
N23.c.root tree:NtpSt entries=1234567890


=============================================================================

2006 09 06

##########
# DCACHE #
##########

storeq report 1873 files at 10:15.
Note that 1376 of these are in one pool, w-stkendca10a-3

Howie is restarting farm production around 09:52,
  with a highwater limit of 4000.


13:40 storeq is 702
17:00 storeq is  75
17:04 storeq is  77 ( 72 in 11a-01, 5 in v-stkendca16a-3 )
      I think this means that ALL the old files are written now.
18:08 storeq is   5 , in the v  pool, so all is caught up.
      Strange, we should be producing output now.

###########
# ROUNDUP #
###########

Still no access to dcache,

continuing to test for DCache access, without and with valid ticket

SRV1> dccp N00010666_0000.cosmic.snts.R1_18_4.0.root dcap://fndca1.fnal.gov:24736/pnfs/fnal.gov/usr/minos/reco_near/R1_18_4/CAT/snts_data/2006-08/N00010666_0000.cosmic.snts.R1_18_4.0.root
Error ( POLLIN POLLERR POLLHUP) (with data) on control line [7]
Failed to create a control line
Failed open file in the dCache.
Can't open destination file : Server rejected "hello"
System error: Input/output error

SRV1> dccp N00010666_0000.cosmic.snts.R1_18_4.0.root dcap://fndca1.fnal.gov:24736/pnfs/fnal.gov/usr/minos/reco_near/R1_18_4/CAT/snts_data/2006-08/N00010666_0000.cosmic.snts.R1_18_4.0.root
Command failed!
Server error message for [1]: "internalError : User kreymer@FNAL.GOV is not authorized" (errno 2).
Failed open file in the dCache.
Can't open destination file : "internalError : User kreymer@FNAL.GOV is not authorized"
System error: Input/output error

DFARM usage - too much, nearly 500 GB, 

SRV1> dfarm ls /minos/nearcat | tr -s ' ' | cut -f 4 -d ' ' | /home/minfarm/scripts/count
 Enter numbers to be added : 
 Got    9474 /tmp/FOO numbers 
236585801281
SRV1> dfarm ls /minos/farcat | tr -s ' ' | cut -f 4 -d ' ' | /home/minfarm/scripts/count
 Enter numbers to be added : 
 Got   39623 /tmp/FOO numbers 
222948172702

dfarm ls /minos/farcat  > farcat9.lis
dfarm ls /minos/nearcat > nearcat9.lis

cat farcat9.lis  | tr -s ' ' | cut -f 5 -d ' ' | uniq | sort -u
cat nearcat9.lis | tr -s ' ' | cut -f 5 -d ' ' | uniq | sort -u

DET=near
DATES=`cat ${DET}cat9.lis  | tr -s ' ' | cut -f 5 -d ' ' | uniq | sort -u`
for DATE in ${DATES} ; do 
printf "${DATE} " ; grep " ${DATE} " ${DET}cat9.lis | wc -l ; done

08/18      48
08/20      56
08/21      46
08/22      46
08/23      46
08/24      94
08/25       6
08/26      25
08/27       3
08/28      10
08/29     254
08/30     124
08/31     194
09/01     340
09/02    2936
09/03    2452
09/04    2268
09/05     318
09/06     212

DET=far

08/18     142
08/19     126
08/20     131
08/21     158
08/22     168
08/23     180
08/25      10
08/26     262
08/27     538
08/28     679
08/29    1238
08/30     211
08/31    1017
09/01    3524
09/02    4417
09/03   13410
09/04   10356
09/05     294
09/06    2766

grep ' 09/03 ' ${DET}cat9.lis  | tr -s ' ' | cut -f 7 -d ' ' > del${DET}.lis
SRV1> wc -l del${DET}.lis
  13410 delfar.lis

date
for FIL in `cat del${DET}.lis` ; do dfarm rm /minos/${DET}cat/$FIL ; done
date

Wed Sep  6 18:47:48 CDT 2006
Wed Sep  6 19:18:05 CDT 2006


DET=near
grep ' 09/02 \| 09/03 \| 09/04 ' ${DET}cat9.lis  | tr -s ' ' | cut -f 7 -d ' ' > del${DET}.lis
SRV1> wc -l delnear.lis
   7656 delnear.lis

date
SRV1> for FIL in `cat del${DET}.lis` ; do dfarm rm /minos/${DET}cat/$FIL ; done
date

Wed Sep  6 19:14:44 CDT 2006
Wed Sep  6 19:31:07 CDT 2006


=============================================================================

2006 09 05


##########
# DCACHE #
##########

Write pools continue to drain, down to 12500 at 17:00

=============================================================================

2006 09 04

##########
# DCACHE #
##########

Per email from kennedy, set file family per stream of cedar output.

MINOS01 > cd /pnfs/minos/reco_far/cedar

MINOS01 > for DIR in .bcnd .bntp .bnts cand sntp snts ; do (cd ${DIR}_data ; enstore pnfs --tags | grep 'family)' ) ; done
-rw-rw-r--   11 buckley  e875           14 Aug 25 15:51 /pnfs/minos/reco_far/cedar/.bcnd_data/.(tag)(file_family)
.(tag)(file_family) = reco_far_cedar
-rw-rw-r--   11 buckley  e875           14 Aug 25 15:51 /pnfs/minos/reco_far/cedar/.bntp_data/.(tag)(file_family)
.(tag)(file_family) = reco_far_cedar
-rw-rw-r--   11 buckley  e875           14 Aug 25 15:51 /pnfs/minos/reco_far/cedar/.bnts_data/.(tag)(file_family)
.(tag)(file_family) = reco_far_cedar
-rw-rw-r--   11 buckley  e875           14 Aug 25 15:51 /pnfs/minos/reco_far/cedar/cand_data/.(tag)(file_family)
.(tag)(file_family) = reco_far_cedar
-rw-rw-r--   11 buckley  e875           14 Aug 25 15:51 /pnfs/minos/reco_far/cedar/sntp_data/.(tag)(file_family)
.(tag)(file_family) = reco_far_cedar
-rw-rw-r--   11 buckley  e875           14 Aug 25 15:51 /pnfs/minos/reco_far/cedar/snts_data/.(tag)(file_family)
.(tag)(file_family) = reco_far_cedar

MINOS01 > for DIR in .bcnd .bntp .bnts cand sntp snts ; do (cd ${DIR}_data ; DIRT=`echo ${DIR} | tr -d '.'` ; enstore pnfs --file_family reco_far_cedar_${DIRT} ) ; done

Oops. the months have already been created, must hit them too :

About 08:45

for DIR in .bcnd .bntp .bnts cand sntp snts ; do
DIRT=`echo ${DIR} | tr -d '.'`
for MON in `ls ${DIR}_data` ; do
( cd ${DIR}_data/${MON} ; enstore pnfs --file_family reco_far_cedar_${DIRT} )
done ; done

for DIR in .bcnd .bntp .bnts cand sntp snts ; do
for MON in `ls ${DIR}_data` ; do
( cd ${DIR}_data/${MON} ; enstore pnfs --tags | grep 'family)' )
done ; done

    reco_far_cedar_bcnd
    reco_far_cedar_bntp
    reco_far_cedar_bnts
    reco_far_cedar_cand
    reco_far_cedar_sntp
    reco_far_cedar_snts

About 09:00,

Did the same for near,

MINOS01 > cd /pnfs/minos/reco_near/cedar

for DIR in cand sntp snts ; do
DIRT=`echo ${DIR} | tr -d '.'`
( cd ${DIR}_data ; enstore pnfs --file_family reco_near_cedar_${DIRT} )
done

for DIR in cand sntp snts ; do
DIRT=`echo ${DIR} | tr -d '.'`
for MON in `ls ${DIR}_data` ; do
( cd ${DIR}_data/${MON} ; enstore pnfs --file_family reco_near_cedar_${DIRT} )
done ; done

for DIR in cand sntp snts ; do
for MON in `ls ${DIR}_data` ; do
( cd ${DIR}_data/${MON} ; enstore pnfs --tags | grep 'family)' )
done ; done

#######
# DAQ #
#######

far_dcs_data continues to complain about existing files,

2006-09-4 01:13:36 far_dcs_data/2006-09/F060902_000005.mdcs.root
2006-09-4 01:13:32 far_dcs_data/2006-09/F060901_000008.mdcs.root
2006-09-4 01:13:26 far_dcs_data/2006-09/F060831_000001.mdcs.root

###########
# ROUNDUP #
###########

Still no access to dcache,

setup dcap
cd /export/stage/minfarm/ROUNDUP/WRITE
dccp N00010666_0000.cosmic.snts.R1_18_4.0.root dcap://fndca1.fnal.gov:24736/pnfs/fnal.gov/usr/minos/reco_near/R1_18_4/CAT/snts_data/2006-08/N00010666_0000.cosmic.snts.R1_18_4.0.root
Command failed!
Server error message for [1]: "internalError : User kreymer@FNAL.GOV is not authorized" (errno 2).
Failed open file in the dCache.
Can't open destination file : "internalError : User kreymer@FNAL.GOV is not authorized"
System error: Input/output error

ftp fndca1.fnal.gov 24127

MINOS26 >  ftp fndca1.fnal.gov 24127
Connected to stkendca2a.fnal.gov.
220 Kerberos FTP Door ready
334 ADAT must follow
GSSAPI accepted as authentication type
GSSAPI authentication succeeded
Name (fndca1.fnal.gov:kreymer): kreymer
530 User kreymer not found.
Login failed.
Remote system type is UNIX.
Using binary mode to transfer files.

##########
# ORACLE #
##########

cd minos/oracle/archive

MINOS26 > curl http://www-stken.fnal.gov/enstore/tape_inventory/COMPLETE_FILE_LISTING_exp-db | grep minos > CFL.20060905

This gets files in
    /pnfs/fs/usr/exp-db/daily/minos-offline/minosprd

The last weekly backups were 2006/07-July/02

Monthly :

MINOS26 > for MON in Jan Feb Mar Apr May Jun Jul Aug Sep; do printf "${MON} " ; grep monthly CFL.20060905  | grep 2006 | grep ${MON} | wc -l ; done
Jan      29
Feb       0
Mar      29
Apr      29
May      29
Jun      29
Jul      29
Aug       0
Sep      29


MINOS26 > for MON in Jan Feb Mar Apr May Jun Jul Aug Sep; do printf "${MON} " ; grep weekly CFL.20060905  | grep 2006 | grep ${MON} | wc -l ; done
Jan       0
Feb       0
Mar      87
Apr     116
May     116
Jun     116
Jul      29
Aug       0
Sep       0

N.B. - the problem is probably the tape backlog, minos files are in DCache.

=============================================================================

2006 09 01

#######
# DAQ #
#######

On Fri, 1 Sep 2006, Andy Blake wrote:
...
> Oh dear, I managed to freeze up the minos-om machine in the MINOS
> control room while starting up the omHistory program. I don't know 
> how to get in and kill the process remotely. Would one of you be 
> able to do it for me?

No problem. 

The system was off the network.
The screen was frozen as of 7:16 ( to judge by the clock display )
Powered down and rebooted at 08:23
Rerunning omHistory does not produce another crash.

/var/log/messages indicates a kernel panic at 07:16,
due to a bug in afs code.

I have copied the relevant parts of the messages file to
    /afs/fnal.gov/files/data/minos/log_data/LOG/minos-om.messages.20060901

Perhaps John can look into this when he gets back from vacation.

The path being examined in OMHistory Viewer was
    /afs/fnal.gov/files/data/minos/offline_monitor/Far/Cosmic/daily   *.root
Screens present were
  Left   - Online Monitoring Far Control GUI
           MINOS ONline Monitoring Main Frame <2>
  Center - OMHistory Viewer
  Right  - none

N.B. - this is actually the old AFS cache problem,
       corrected by Liz today, and previously on minos-evd

    /etc/sysconfig/afs
     change the line
        OPTIONS=AUTOMATIC
     to OPTIONS=$LARGE


###########
# roundup #
###########

Manual test of dccp,

SFINI=N00010666_0000.cosmic.snts.R1_18_4.0.root
DET=near
REL=R1_18_4

cut/paste head of roundup, and body of AUTODEST

#######
# X11 #
#######
Scan all nodes for X11 (root/acroread hangs )

MIN > NODES='01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26'

MIN > for NODE in $NODES ; do    printf "${NODE}\n" ; time ssh minos${NODE} acroread; done

Hung on nodes 15 first time, 22 second time, AFTER closing the acroread process.

Added -n to ssh, things are worse, hung on 01 09 20

##########
# DCACHE #
##########

Archives of near data to the 2006-09 directory are flagged as failing,
and being retried unnecessarily.
This produces alarms in the control room.


=============================================================================

2006 08 31

#######
# AFS #
#######

cp /var/tmp/afsntpb /var/tmp/afsntpbx
   nedit to remove all .sntp. .0. .all. type fields
   

for REL in ${RELS} ; do printf "${REL} " ; grep "${REL}" /var/tmp/afsntpb | wc -l ; done
R1.11     943
R1.12    4916
R1.14   27766
R1.16   16193
R1.16a     627
R1_17a    1224
R1_17b    1224
R1_17c     549
R1_17d      72
R1_17e      44
R1_17f      42
R1_18   33305
R1_18_2    3267

Oops, afsntpb is not nearly complete, only 86K lines
System was under heavy load due to all the basename invocations,
    other processes were failing due to lack of resources
Load average per ganglia went up to 500.

Created python script  basenames

    remove
    
.bcnd.
.bntp.
.bnts.
.sntp.
.snts.
.cntp.
.cnts.
.cbdl.
.cand.
.all.
.cosmic.
.spill.
.root
    
MINOS26 > cat /var/tmp/afsntpb | cut -f 2- -d '.' | uniq | sort -u
R1.11
R1.12
R1.14
R1.16
R1.16a
R1_17a
R1_17b
R1_17c
R1_17d
R1_17e
R1_17f
R1_18
R1_18_2
R1_18_2a
R1_18_3
R1_18_4
R1_21
R1_23
R1_23a
R1_24
R1_24a
R1_24b
R1_24c
S06-05-25-R1-22
S06-06-22-R1-22
cedar

MINOS26 > RELS=`cat /var/tmp/afsntpb | cut -f 2- -d '.' | uniq | sort -u`

MINOS26 > for REL in ${RELS} ; do printf "${REL} " ; grep "${REL}" /var/tmp/afsntpb | wc -l ; done
R1.11            943
R1.12           4916
R1.14          27766
R1.16          16193
R1.16a           627
R1_17a          1224
R1_17b          1224
R1_17c           549
R1_17d            72
R1_17e            44
R1_17f            42
R1_18         122499
R1_18_2        45361
R1_18_2a        1445
R1_18_3         1362
R1_18_4         7603
R1_21           7325
R1_23           3512
R1_23a          1746
R1_24          14661
R1_24a          1777
R1_24b          8117
R1_24c          2916
S06-05-25-R1-22 1392
S06-06-22-R1-22 1736
cedar            340

###########
# roundup #
###########

SRV1> ./roundup -v -s "N00010675" -r R1_18_4  near  
...
  1) +Output::Put               n=81188 ( 81188/     0) t=(   91.26/    0.73)

-rw-r--r--    1 minfarm  numi     61001492 Aug 31 14:32 /export/stage/minfarm/ROUNDUP/WRITE/N00010675_0000.cosmic.snts.R1_18_4.0.root


SRV1> time ./roundup -v -s "N00009143" -r cedar  near
...
  1) +Output::Put               n=365   (   365/     0) t=(    0.68/    0.03)

-rw-r--r--    1 minfarm  numi       452581 Aug 31 14:34 /export/stage/minfarm/ROUNDUP/WRITE/N00009143_0020.spill.snts.cedar.0.root

real    0m23.066s
user    0m8.150s
sys     0m1.320s

for DETREL in near/cedar near/R1_18_4 far/cedar far/R1_18_4 ; do
cd /pnfs/minos/reco_${DETREL}/CAT ; 
DETRELD=`echo ${DETREL} | tr '/' '_'`
enstore pnfs --file_family CAT_reco_${DETRELD}
enstore pnfs --tags ; done

SRV1> mkdir /pnfs/minos/reco_near/cedar/CAT
SRV1> mkdir /pnfs/minos/reco_near/R1_18_4/CAT
SRV1> mkdir /pnfs/minos/reco_far/cedar/CAT
SRV1> mkdir /pnfs/minos/reco_far/R1_18_4/CAT
cd /pnfs/minos/reco_near/cedar/CAT
enstore pnfs --file_family 

Now a real, full file to test writing to DCache.

SRV1> time ./roundup -v -s "N00010666" -r R1_18_4  near
...
SFIL N00010666_0008.cosmic.snts.R1_18_4.0.root
./roundup: line 147: printf: 0008: invalid number
7 0 1
SFIL N00010666_0009.cosmic.snts.R1_18_4.0.root
./roundup: line 147: printf: 0009: invalid number
8 0 1
SFIL N00010666_0010.cosmic.snts.R1_18_4.0.root
9 8 1
...
SFIL N00010666_0018.cosmic.snts.R1_18_4.0.root
./roundup: line 147: printf: 0018: invalid number
17 0 1
SFIL N00010666_0019.cosmic.snts.R1_18_4.0.root
./roundup: line 147: printf: 0019: invalid number
18 0 1
SFIL N00010666_0020.cosmic.snts.R1_18_4.0.root
...
  1) +Output::Put               n=295532(295532/     0) t=(  341.39/    2.72)

-rw-r--r--    1 minfarm  numi     221328365 Aug 31 16:46 /export/stage/minfarm/ROUNDUP/WRITE/N00010666_0000.cosmic.snts.R1_18_4.0.root

real    19m41.575s
user    19m2.310s
sys     0m10.390s

This is due to failing to convert SUBR to decimal immediately...
fixed this.


=============================================================================

2006 08 30

#######
# DAQ #
#######

The 3 pending files noted yesterday are logged, and in SAM. 
The archiver backlog has cleared.

###########
# roundup #
###########

Summary - 
  need to run loon using release used to produce each file.
  rate is about 10 MBytes/minute
  large per-file root overhead ? concatenated size less them sum of inputs

  To do
      run selection based on completeness and time
      autodest calculation of output path
      write to output via dccp, get ticket ( where ? )
      cron - how is crontab set presently ?
      purge data after on tape, retain on /export/stage
      get statistics to predict concatenated output size

path 

   reco
     /pnfs/minos/reco_${DET}/${REL}/${STREAM}/${MONTH}
   mcout
     /pnfs/minos/mcout_data/${REL}/${DET}/${MCREL}/${BEAM}/${STREAM}

   ${DET}    from file name first letter ( n N f F )
   ${REL}    from qualifier
   ${STREAM} from filename, add . prefix to b* 
             for testing add C prefix
   ${MONTH}  from SAM, using .mdaq.root input name, or ls
   ${BEAM}   from filename
   ${MCREL}  from SAM, or fixed at carrot for now

   MC is moot for now, there are no subruns at present.

   N.B. - what are the files like
        c10000787_0006.snts.R1.14.root
        a20000222_0001.snts.R1.14.root

#######
# AFS #
#######

Survey recodat files in AFS

for DIR in `ls -d $MINOS_DATA/d10/recodata01` ; do echo $DIR ; done

MINOS26 > find ${MINOS_DATA}/d10 -follow -name \*.root -type f | wc -l
find: /afs/fnal.gov/files/data/minos/d10/recodata01/recodata01: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata03/recodata03: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata04/recodata04: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata05/recodata05: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata07/recodata07: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata08/recodata08: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata09/recodata09: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata10/recodata10: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata11/recodata11: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata12/recodata12: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata13/recodata13: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata15/recodata15: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata25/recodata25: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata36/recodata36: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata37/recodata37: No such file or directory
find: /afs/fnal.gov/files/data/minos/d10/recodata38/recodata38: No such file or directory
 204438

for DD in 01 03 04 05 07 08 09 10 11 12 13 15 25 36 37 38 ; do
ls -ld ${MINOS_DATA}/d10/recodata${DD}/recodata${DD} ; done
lrwxr-xr-x    1 rubin    e875           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata01/recodata01 -> ../d10/recodata01/
lrwxr-xr-x    1 rubin    e875           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata03/recodata03 -> ../d21/recodata03/
lrwxr-xr-x    1 rubin    e875           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata04/recodata04 -> ../d22/recodata04/
lrwxr-xr-x    1 rubin    e875           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata05/recodata05 -> ../d46/recodata05/
lrwxr-xr-x    1 rubin    e875           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata07/recodata07 -> ../d48/recodata07/
lrwxr-xr-x    1 rubin    e875           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata08/recodata08 -> ../d49/recodata08/
lrwxr-xr-x    1 rubin    1720           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata09/recodata09 -> ../d71/recodata09/
lrwxr-xr-x    1 rubin    1720           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata10/recodata10 -> ../d72/recodata10/
lrwxr-xr-x    1 rubin    1720           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata11/recodata11 -> ../d73/recodata11/
lrwxr-xr-x    1 rubin    1720           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata12/recodata12 -> ../d74/recodata12/
lrwxr-xr-x    1 rubin    1720           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata13/recodata13 -> ../d75/recodata13/
lrwxr-xr-x    1 rubin    1720           18 Aug 29  2005 /afs/fnal.gov/files/data/minos/d10/recodata15/recodata15 -> ../d81/recodata15/
lrwxr-xr-x    1 rubin    e875           17 Sep 12  2005 /afs/fnal.gov/files/data/minos/d10/recodata25/recodata25 -> ../d96/recodata25
lrwxr-xr-x    1 rubin    e875           18 Oct 21  2005 /afs/fnal.gov/files/data/minos/d10/recodata36/recodata36 -> ../d113/recodata36
lrwxr-xr-x    1 rubin    e875           18 Oct 21  2005 /afs/fnal.gov/files/data/minos/d10/recodata37/recodata37 -> ../d114/recodata37
lrwxr-xr-x    1 rubin    e875           18 Oct 21  2005 /afs/fnal.gov/files/data/minos/d10/recodata38/recodata38 -> ../d115/recodata38

    Cleaning these all up

for DD in 01 03 04 05 07 08 09 10 11 12 13 15 25 36 37 38 ; do
rm ${MINOS_DATA}/d10/recodata${DD}/recodata${DD} ; done


MINOS26 > find ${MINOS_DATA}/d10 -follow -name \*.root -type f > /var/tmp/afsntp
MINOS26 > wc -l /var/tmp/afsntp 
 204438 /var/tmp/afsntp
MINOS26 > ls -l /var/tmp/afsntp 
-rw-r--r--    1 kreymer  1525     17163393 Aug 30 15:25 /var/tmp/afsntp

MINOS26 > for FIL in `cat /var/tmp/afsntp` ; do basename ${FIL} ; done | cut -f 3- -d '.' | uniq | sort -u

this hung up minos26, no response to commands, interrupted.

MINOS26 > for FIL in `cat /var/tmp/afsntp` ; do basename ${FIL} ; done > /var/tmp/afsntpb


=============================================================================

2006 08 29

###########
# roundup #
###########

SRV1> ./roundup -v  -s "N00010675"  near  
Processing /home/minfarm/scripts/Merger.C...
Warning in <TClass::TClass>: no dictionary for class NtpBDLiteRecord is available
Warning in <TClass::TClass>: no dictionary for class RecRecordImp<BeamDataLiteHeader> is available
Warning in <TClass::TClass>: no dictionary for class BeamDataLiteHeader is available
Error in <TBranchElement::GetDataMemberOffset>: obsolete call with (NtpSRTrack,fidvtx.nplaneu)
...
LoadMinosPDG: initializing PDG database (minos version)
LoadMinosPDG: reading /farm/minsoft2/Minossoft/minossoft/releases/R1.24.0/etc/minos_pdg_table.txt
...
CPU fromtop was 6:52, nearly 100% CPU limited

SRV1> du -sk *.root
62724   Merged.root
9776    N00010675_0000.cosmic.snts.R1_18_4.0.root
9616    N00010675_0001.cosmic.snts.R1_18_4.0.root
9620    N00010675_0002.cosmic.snts.R1_18_4.0.root
9716    N00010675_0003.cosmic.snts.R1_18_4.0.root
9736    N00010675_0004.cosmic.snts.R1_18_4.0.root
9696    N00010675_0005.cosmic.snts.R1_18_4.0.root
5544    N00010675_0006.cosmic.snts.R1_18_4.0.root

SRV1> let SUM=9776+9616+9620+9716+9736+9696+5544
SRV1> echo $SUM
63704

   Fractional size
   
SRV1> let FRA=6272400/$SUM
SRV1> echo $FRA
98

    See whether warnings go away with cedar data 
    
SRV1> time ./roundup -v -d  -s "N00009143"  near  
SRV1> time ./roundup -v -s "N00009143"  near  
...
 OK adding N00009143_0020.spill.snts.cedar.0.root       4
Warning in <TClassTable::Add>: class timespec already in TClassTable
loon [0] 
Processing /home/minfarm/scripts/Merger.C...
Warning in <TClass::TClass>: no dictionary for class NtpBDLiteRecord is available
Warning in <TClass::TClass>: no dictionary for class RecRecordImp<BeamDataLiteHeader> is available
Warning in <TClass::TClass>: no dictionary for class BeamDataLiteHeader is available
LoadMinosPDG: initializing PDG database (minos version)
LoadMinosPDG: reading /farm/minsoft2/Minossoft/minossoft/releases/R1.24.0/etc/minos_pdg_table.txt
Main(365 in 365 out 0 filt.)
  1) +Output::Put               n=365   (   365/     0) t=(    0.68/    0.09)


real    0m19.434s
user    0m8.540s
sys     0m1.990s

444     Merged.root
168     N00009143_0020.spill.snts.cedar.0.root
228     N00009143_0021.spill.snts.cedar.0.root
252     N00009143_0022.spill.snts.cedar.0.root
340     N00009143_0023.spill.snts.cedar.0.root

SRV1> time ./roundup -v -s "F00033080"  far  
...
Processing /home/minfarm/scripts/Merger.C...
Warning in <TClass::TClass>: no dictionary for class NtpBDLiteRecord is available
Warning in <TClass::TClass>: no dictionary for class RecRecordImp<BeamDataLiteHeader> is available
Warning in <TClass::TClass>: no dictionary for class BeamDataLiteHeader is available
LoadMinosPDG: initializing PDG database (minos version)
LoadMinosPDG: reading /farm/minsoft2/Minossoft/minossoft/releases/R1.24.0/etc/minos_pdg_table.txt

Main(14486 in 14486 out 0 filt.)
  1) +Output::Put               n=14486 ( 14486/     0) t=(   24.02/    0.29)


real    2m16.186s
user    1m41.080s
sys     0m2.900s

    Try a fardet R1_18_4 file now, see if those Error messages are just for near

SRV1> time ./roundup -v -s "F00036196"  far  
 OK - processing /minos/farcat
Tue Aug 29 14:05:58 CDT 2006
...
 OK adding F00036196_0000.spill.snts.R1_18_4.0.root      16
Warning in <TClassTable::Add>: class timespec already in TClassTable
loon [0] 
Processing /home/minfarm/scripts/Merger.C...
Warning in <TClass::TClass>: no dictionary for class NtpBDLiteRecord is available
Warning in <TClass::TClass>: no dictionary for class RecRecordImp<BeamDataLiteHeader> is available
Warning in <TClass::TClass>: no dictionary for class BeamDataLiteHeader is available
Error in <TBranchElement::GetDataMemberOffset>: obsolete call with (NtpSRTrack,fidvtx.nplaneu)
   and many more of these
...
Main(4456 in 4456 out 0 filt.)
  1) +Output::Put               n=4456  (  4456/     0) t=(    7.34/    0.10)


real    1m26.286s
user    0m35.370s
sys     0m2.320s

Create new setup script for R1_18_4 using root v5_10_0c
/home/minfarm/scripts/setup_minossoft.R1_18_4.sh

SRV1> time ./roundup -v -s "F00036196"  far 
 OK - processing /minos/farcat
Tue Aug 29 15:30:54 CDT 2006
...
 OK adding F00036196_0000.spill.snts.R1_18_4.0.root      16
loon [0] 
Processing /home/minfarm/scripts/Merger.C...
Warning in <TClass::TClass>: no dictionary for class NtpBDLiteRecord is available
Warning in <TClass::TClass>: no dictionary for class RecRecordImp<BeamDataLiteHeader> is available
Warning in <TClass::TClass>: no dictionary for class BeamDataLiteHeader is available
Main(4456 in 4456 out 0 filt.)
  1) +Output::Put               n=4456  (  4456/     0) t=(    4.54/    0.09)


real    0m40.878s
user    0m24.990s
sys     0m1.520s

Added -r <release> to select which release,
tie this to the setup
  -r cedar
  -r R1_18_4
  
SRV1> time ./roundup -v -s "F00036196" -r R1_18_4  far 
real    0m40.224s
user    0m24.540s
sys     0m1.890s


   C L E A N U P 
    
SRV1> dfarm ls /minos/farcat   | tr -s ' ' > farcat.lis
SRV1> dfarm ls /minos/nearcat  | tr -s ' ' > nearcat.lis

Sizes

SRV1> head farcat.lis | tr -s ' ' | cut -f 4 -d ' '

SRV1> for DIR in cores farcat fardet li mc mccat mctest nearcat neardet test ; do echo $DIR ; dfarm ls /minos/${DIR} | tr -s ' ' | cut -f 4 -d ' ' | /home/minfarm/scripts/count ; done

DIR      GB

cores    34
farcat   16
fardet    0
li        3
mc        0
mccat    94
mctest    4
nearcat  26
neardet   0
test      1

cores
 Enter numbers to be added : 
 Got     103 /tmp/FOO numbers 
34450182144
farcat
 Enter numbers to be added : 
 Got    3514 /tmp/FOO numbers 
16385277783
fardet
 Enter numbers to be added : 
 Got      39 /tmp/FOO numbers 
17320354
li
 Enter numbers to be added : 
 Got      21 /tmp/FOO numbers 
3272864406
mc
 Enter numbers to be added : 
 Got       0 /tmp/FOO numbers 
0
mccat
 Enter numbers to be added : 
 Got    1159 /tmp/FOO numbers 
93545299383
mctest
 Enter numbers to be added : 
 Got      20 /tmp/FOO numbers 
4319515019
nearcat
 Enter numbers to be added : 
 Got     555 /tmp/FOO numbers 
26001977819
neardet
 Enter numbers to be added : 
 Got       0 /tmp/FOO numbers 
0
test
 Enter numbers to be added : 
 Got      12 /tmp/FOO numbers 
671224399

Let's clean up the cand files from mccat :

SRV1> grep cand mccat.lis | cut -f 7 -d ' ' | sort -u | wc -l
    379

SRV1> grep cand mccat.lis | cut -f 7 -d ' ' | sort -u > mccands

SRV1> for FIL in `cat mccands` ; do echo $FIL ; dfarm rm /minos/mccat/${FIL} ; done

SRV1> dfarm rm /minos/nearcat/*cand*
SRV1> dfarm rm /minos/farcat/*cand*

SRV1> dfarm ls /minos/nearcat | tr -s ' ' | cut -f 4 -d ' ' | /home/minfarm/scripts/count
 Enter numbers to be added : 
 Got     382 /tmp/FOO numbers 
5889008087

SRV1> dfarm ls /minos/farcat | tr -s ' ' | cut -f 4 -d ' ' | /home/minfarm/scripts/count
 Enter numbers to be added : 
 Got    3243 /tmp/FOO numbers 
6423573310


##########
# ORACLE #
##########

cd minos/oracle/archive

Get volume list

MINOS26 > curl http://www-stken.fnal.gov/enstore/tape_inventory/COMPLETE_FILE_LISTING_exp-db | cut -f 3 -d ' ' | sort -u  > volumes

MINOS26 > wc -l volumes 
    380 volumes

MINOS26 > for VOL in `cat volumes` ; do sleep 1 ; printf "${VOL} " ; enstore info --vol=${VOL} | grep sum_mounts ; done | tee mounts

MINOS26 > sort mounts -n -k 3
...
VO8310  'sum_mounts': 307,
VO4468  'sum_mounts': 320,
VO4469  'sum_mounts': 324,
VO4463  'sum_mounts': 327,
VO4962  'sum_mounts': 327,
VO3890  'sum_mounts': 328,
VO5877  'sum_mounts': 345,
VO4923  'sum_mounts': 346,
VO4615  'sum_mounts': 386,
VO4579  'sum_mounts': 425,
VO6631  'sum_mounts': 459,
VO4316  'sum_mounts': 880,
VO8133  'sum_mounts': 933,
VO8353  'sum_mounts': 1509,

MINOS26 > cat mounts | cut -f 2 -d ":" | cut -f 1 -d ','  | ../../scripts/count
 Enter numbers to be added : 
 Got     380 /tmp/FOO numbers 
13754

MINOS26 > grep VOB mounts | cut -f 2 -d ":" | cut -f 1 -d ','  | ../../scripts/count
 Enter numbers to be added : 
 Got     188 /tmp/FOO numbers 
1748

Checking read accesses, while we're at it

MINOS26 > for VOL in `cat volumes` ; do sleep 1 ; printf "${VOL} " ; enstore info --vol=${VOL} | grep sum_rd_access ; done 2>&1 | tee reads

MINOS26 > sort reads -n -k 3
...
VO2373  'sum_rd_access': 1,
VO5108  'sum_rd_access': 1,
VO8133  'sum_rd_access': 1,
VO8239  'sum_rd_access': 1,
VO8285  'sum_rd_access': 1,
VO8451  'sum_rd_access': 1,
VO8473  'sum_rd_access': 1,
VO9666  'sum_rd_access': 1,
VOB198  'sum_rd_access': 1,
VO6959  'sum_rd_access': 3,
VOB295  'sum_rd_access': 3,
VO2368  'sum_rd_access': 5,
VOB421  'sum_rd_access': 9,
VO8246  'sum_rd_access': 13,
VO8379  'sum_rd_access': 18,
VOB420  'sum_rd_access': 35,
VO2365  'sum_rd_access': 64,
VO2271  'sum_rd_access': 80,
VO2275  'sum_rd_access': 206,

#######
# DAQ #
#######

Three ftp's failed today 08:09 through 08:13,
to /pnfs/minos/fardet_data/2006-08
F00036495_0017.mdaq.root
F00036495_0016.mdaq.root
F00036495_0015.mdaq.root

2006-08-29 08:13:46 	buckley(1019.5111) 	krbftp 	write 	/pnfs/fnal.gov/usr/minos/fardet_data/2006-08/F00036495_0017.mdaq.root 	 daqdcp.minos-soudan.org 	 113 	 0 	 0 	 ERROR 	553 /pnfs/fnal.gov/usr/minos/fardet_data/2006-08/F00036495_0017.mdaq.root: Cannot create file: CacheException(rc=10006;msg=Pnfs request timed out)
2006-08-29 08:11:33 	buckley(1019.5111) 	krbftp 	write 	/pnfs/fnal.gov/usr/minos/fardet_data/2006-08/F00036495_0016.mdaq.root 	daqdcp.minos-soudan.org 	123 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/fardet_data/2006-08/F00036495_0016.mdaq.root: No such file or directory. : /pnfs/fnal.gov/usr/minos/fardet_data/2006-08
2006-08-29 08:09:20 	buckley(1019.5111) 	krbftp 	write 	/pnfs/fnal.gov/usr/minos/fardet_data/2006-08/F00036495_0015.mdaq.root 	daqdcp.minos-soudan.org 	112 	0 	0 	ERROR 	553 /pnfs/fnal.gov/usr/minos/fardet_data/2006-08/F00036495_0015.mdaq.root: Cannot create file: CacheException(rc=10006;msg=Pnfs request timed out)

We've still got a backlog, current run is F00036503_0001.mdaq.root
Latest copy is F00036498_0005.mdaq.root
Highest subrun of this is 20
So in about 3 hours, we may be up to date !

=============================================================================

2006 08 28

#######
# DAQ #
#######

Added kreymer to .k5login, sorted, made dated old copies on
    minos-acnet        +          cjames, kreymer
    minos-beamdata     +          cjames, kreymer
    minos-evd          +          cjames, kreymer
    minos-gateway-nd   + buckley, cjames, kreymer
    minos-om           +          cjames, kreymer
    
Already did minos-rc on  Aug 24.

#############
# CHECKLIST #
#############

    Fardet files were absent for a while, lots of power problem at Sudan
STARTING Sat Aug 26 02:09:12 UTC 2006
FINISHED Mon Aug 28 10:10:11 UTC 2006

   Strange SAM timeout, nothing being declared - db glitch ?

Declaring to SAM prd far R1_18_4 2006-08
STARTED   Mon Aug 28 04:17:35 2006
...
Needed  /pnfs/minos/reco_far/R1_18_4/.bnts_data/2006-08
Treating 531 files in /pnfs/minos/reco_far/R1_18_4/.bnts_data/2006-08
RetryHandler.getMetadata('F00036236_0016.mdaq.root')> initial retriable exception COMM_FAILURE('CORBA.COMM_FAILURE(omniORB.COMM_FAILURE_WaitingForReply, CORBA.COMPLET
ED_MAYBE)')
RetryHandler.getMetadata('F00036236_0016.mdaq.root')> will retry in 600.28 seconds
RetryHandler.getMetadata('F00036236_0016.mdaq.root')> retry number 1 succeeded, returning to caller.
STARTED   Mon Aug 28 04:17:35 2006
FINISHED  Mon Aug 28 04:37:26 2006

Neardet logging is failing since

 2006-08-28 10:15:24 	 buckley(1019.5111) 	 krbftp 	 write 	 /pnfs/fnal.gov/usr/minos/neardet_data/2006-08/N00010714_0000.mdaq.root 	 daqdcp-nd.fnal.gov 	 120 	 0 	 0 	 ERROR 	553 /pnfs/fnal.gov/usr/minos/neardet_data/2006-08/N00010714_0000.mdaq.root: No such file or directory. : /pnfs/fnal.gov/usr/minos/neardet_data/2006-08

Succeeded last at

 2006-08-28 09:33:17 	 buckley(1019.5111) 	 krbftp 	 write 	 /pnfs/fnal.gov/usr/minos/neardet_data/2006-08/N00010713_0017.mdaq.root 	 daqdcp-nd.fnal.gov 	 4 	 30240300 	 0 	 OK 	


Reported FTP failures to dcache-admin

Rob Kennedy reports that the door was mistakenly restarted.
Looking at services page, it was started around 09:56

CellName 	DomainName 	Requests Pending 	Threads 	Ping 	Creation Time
KFTP0 	kerberizedftpdoor0Domain 	0 	4 	73 msec 	08/28 09:56:06

12:28 - door restarted
13:07 first recent successful FTP, looks OK now

3 files are listed as NOT_FINISHED in FTP logs

2006-08-28 12:19:30  /pnfs/fnal.gov/usr/minos/fardet_data/2006-08/F00036470_0008.mdaq.root   daqdcp.minos-soudan.org 
    Not completed, failure detected, expect to get this when backlog clears.
2006-08-28 12:19:20  /pnfs/fnal.gov/usr/minos/neardet_data/2006-08/N00010721_0000.mdaq.root  daqdcp-nd.fnal.gov
    Completed 15:13:13
2006-08-28 12:18:33  /pnfs/fnal.gov/usr/minos/beam_data/2006-08/B060828_080001.mbeam.root    minos-beamdata.fnal.gov  
    Completed 13:07:15


##########
# DCACHE #
##########

Zero length cedar file written this morning :

MINOS26 > cd /pnfs/minos/reco_far/cedar/cand_data/2004-12                                     
MINOS26 > ls -l F00028256_0005.all.cand.cedar.0.root
-rw-r--r--    1 1334     5111            0 Aug 26 05:32 F00028256_0005.all.cand.cedar.0.root

Removed it, per howie request.


###########
# roundup #
###########

SRV1> hostname
fnpcsrv1.fnal.gov

SRV1> pwd
/home/minfarm/scripts

Testing and developing ...

Removed in/out/release arguments
added qualifiers for testing
added calculation of output directory


    Testing with

./roundup -s  echo

ROUNTMP=/export/stage/minfarm/ROUNDUP

Added qualifiers, removed old arguments, only detector remains


SRV1> ./roundup -v -s "N00010675"  near  

SRV1> loon -bq firstlast.C N00010675_0000.cosmic.snts.R1_18_4.0.root
loon: error while loading shared libraries: libRootAuth.so: cannot open shared object file: No such file or directory

=============================================================================

2006 08 25

##########
# DCACHE #
##########

09:50  w-stkendca9a-* pools are offline.  reported to dcache-admin

This is a planned outage, transparent to users.


#######
# X11 #
#######

Pursuing 'hung' root and acroread problems observed by  on minos08.

Symptom is that the ssh logged in processess trying to run root or acroread
just hangs until interrupted.  xclock runs fine.
A fresh login may or may not clear the problem ( it has not done so for me )


    Test via :

for NODE in 07 08 17 18 ; do    printf "${NODE}\n" ; time ssh minos${NODE} acroread; done


The minos08 problem has cleared itself yesterday, 
now remains problem on minos18.
There are no other acroread or root processes on minos18.

##########
# ORACLE #
##########

cd minos/oracle/archive

MINOS26 > curl http://www-stken.fnal.gov/enstore/tape_inventory/COMPLETE_FILE_LISTING_exp-db | grep minos > CFL.20060825

This gets files in
    /pnfs/fs/usr/exp-db/daily/minos-offline/minosprd

The last weekly backups seem to be 2006/07-July/02

The oldest daily backups are 2006/08-August/09

Monthly :

MINOS26 > for MON in Jan Feb Mar Apr May Jun Jul Aug ; do printf "${MON} " ; grep monthly CFL.20060825  | grep 2006 | grep ${MON} | wc -l ; done
Jan      29
Feb       0
Mar      29
Apr      29
May      29
Jun      29
Jul      29
Aug       0

##########
# DCACHE #
##########

Checking out files recovered by kennedy, 

MINOS26 > mkdir d162/DCACHE
MINOS26 > cd d162/DCACHE/

MINOS26 > scp fcdflnx6:~kennedy/000F000000000000033DF650-10a1-crc-match .
MINOS26 > scp fcdflnx6:~kennedy/000F000000000000033DF650-11a3 .

MINOS26 > ln -s /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/firstlast.C firstlast.C

MINOS26 > mv 000F000000000000033DF650-10a1-crc-match 000F000000000000033DF650-10a1-crc-match.root
MINOS26 > mv 000F000000000000033DF650-11a3           000F000000000000033DF650-11a3.root                    

MINOS26 > loon -bq firstlast.C 000F000000000000033DF650-10a1-crc-match.root  > match.log
MINOS26 > loon -bq firstlast.C 000F000000000000033DF650-11a3.root > 11a3.log

###########
# roundup #
###########

    This script will concatenate farm output files from dfarm
    Convert this from the cattle script

MIN > scp cattle minfarm@fnpcsrv1:scripts/roundup


=============================================================================

2006 08 24

#######
# DAQ #
#######

Note connection failures from 218.247.185.166.
to
    minos-gateway2-nd
    minos-gateway-nd


Yum updates did roll in, looking like

MIN > ssh -l root minos-evd.fnal.gov 'grep "Aug 24" /var/log/yum.log'
Aug 24 04:16:51 Updated: perl.i386 3:5.8.5-36.RHEL4
Aug 24 04:16:52 Updated: krb5-libs.i386 1.3.4-33
Aug 24 04:16:52 Updated: elfutils-libelf.i386 0.97.1-3
Aug 24 04:16:52 Updated: elfutils.i386 0.97.1-3
Aug 24 04:16:53 Updated: krb5-devel.i386 1.3.4-33
Aug 24 04:16:54 Updated: krb5-workstation.i386 1.3.4-33
Aug 24 04:16:55 Updated: ntp.i386 4.2.0.a.20040617-4.EL4.1
Aug 24 04:16:58 Updated: xorg-x11-libs.i386 6.8.2-1.EL.13.37
Aug 24 04:16:58 Updated: xorg-x11-Mesa-libGL.i386 6.8.2-1.EL.13.37
Aug 24 04:16:58 Updated: xorg-x11-font-utils.i386 6.8.2-1.EL.13.37
Aug 24 04:16:58 Updated: xorg-x11-xfs.i386 6.8.2-1.EL.13.37
Aug 24 04:16:58 Updated: xorg-x11-Mesa-libGLU.i386 6.8.2-1.EL.13.37
Aug 24 04:16:58 Updated: xorg-x11-xauth.i386 6.8.2-1.EL.13.37
Aug 24 04:17:02 Updated: xorg-x11.i386 6.8.2-1.EL.13.37
Aug 24 04:17:02 Updated: xorg-x11-xdm.i386 6.8.2-1.EL.13.37
Aug 24 04:17:12 Updated: kdebase.i386 6:3.3.1-5.13
Aug 24 04:17:12 Updated: xorg-x11-twm.i386 6.8.2-1.EL.13.37
Aug 24 04:17:12 Updated: xorg-x11-deprecated-libs.i386 6.8.2-1.EL.13.37
Aug 24 04:17:17 Updated: xorg-x11-devel.i386 6.8.2-1.EL.13.37
Aug 24 04:17:17 Updated: xorg-x11-tools.i386 6.8.2-1.EL.13.37

for HOST in evd acnet gateway-nd beamdata om rc ; do
echo $HOST ; ssh -l root minos-${HOST} 'grep "Aug 24" /var/log/yum.log' ; done \
>> /tmp/TRACE.yum
nedit TRACE ( to have this information )


=============================================================================

2006 08 23

#######
# DAQ #
#######

yum updates are coming tonight,
  including httpd and X11
  

=============================================================================

2006 08 22

##########
# DCACHE #
##########

All 3 FNDCA general write pools down from about 
   08 21 23:48
to 08 22 07:28 

w-stkendca9a
w-stkendca10a
w-stkendca11a

    Kernel panic, reboot, according to georges

257 files listed in 

URL=http://www-stken.fnal.gov/enstore/dcache_monitor/minos_bad.txt

curl -s $URL | grep .root | wc -l
    257

FILES=`curl -s $URL | grep F0*.root | cut -f 10 -d '/'`

dfarm ls -1 /minos/nearcat  >> /tmp/mincat
dfarm ls -1 /minos/farcat   >> /tmp/mincat
wc -l /tmp/mincat 
   1107 /tmp/mincat

for FILE in $FILES ; do printf "${FILE} " ; grep ${FILE} /tmp/mincat ; done

    The files are all there

FILEPS=`curl -s $URL | grep .root | cut -f 6-10 -d '/'`

MINOS26 > for FIL in $FILEPS ; do ls -l /pnfs/minos/${FIL} ; done
-rw-r--r--    1 1334     5111     44900708 Aug 21 23:33 /pnfs/minos/reco_near/R1_18_4/cand_data/2006-08/N00010683_0002.cosmic.cand.R1_18_4.0.root
-rw-r--r--    1 1334     5111     70583979 Aug 21 23:36 /pnfs/minos/reco_near/R1_18_4/cand_data/2006-08/N00010675_0006.cosmic.cand.R1_18_4.0.root
-rw-r--r--    1 1334     5111            0 Aug 21 23:38 /pnfs/minos/reco_far/R1_18_4/cand_data/2006-08/F00036215_0020.all.cand.R1_18_4.0.root
-rw-r--r--    1 1334     5111            0 Aug 21 23:38 /pnfs/minos/reco_far/R1_18_4/sntp_data/2006-08/F00036215_0020.all.sntp.R1_18_4.0.root
... the rest are all 0 length


###########
# ENSTORE #
###########

MINOS26 > enstore info --show-bad | grep minos
VOB270 CDMS114820108400000 197223313 /pnfs/fs/usr/minos/mcin_data/near/carrot_06/L010185/.bad.n13015322_0000_L010185.reroot.root

MMMMMMM, odd, I removed this back on 08 07.
And there seems to be a good copy present now,

MINOS26 > ls /pnfs/minos/mcin_data/near/carrot_06/L010185/n13015322_0000_L010185.reroot.root -l 
-rw-r--r--    1 9950     5111     197223313 May 29 15:07 /pnfs/minos/mcin_data/near/carrot_06/L010185/n13015322_0000_L010185.reroot.root

There are output files for this,
    /pnfs/minos/mcout_data/R1_18_2/near/
       n13015322_0000_L010185.cand.R1_18_2.root
       n13015322_0000_L010185.sntp.R1_18_2.root
       n13015322_0000_L010185.snts.R1_18_2.root
and the mcin_data file is in SAM :

MINOS26 > sam locate n13015322_0000_L010185.reroot.root
['/pnfs/minos/mcin_data/near/carrot_06/L010185,407@vob307']

MINOS26 > grep n13015322_0000_L010185 ../CFL/CFL
minos mcin_near VOB270 0000_000000000_0000922 CDMS114820108400000 197223313 1302188557 /pnfs/minos/mcin_data/near/carrot_06/L010185/n13015322_0000_L010185.reroot.root
minos mcin_near VOB307 0000_000000000_0000407 CDMS114893326100000 197223313 1302188557 /pnfs/minos/mcin_data/near/carrot_06/L010185/n13015322_0000_L010185.reroot.root

MMMMMMM, odd, I removed this back on 08 07.
And there seems to be a good copy present now,

MINOS26 > ls /pnfs/minos/mcin_data/near/carrot_06/L010185/n13015322_0000_L010185.reroot.root -l 
-rw-r--r--    1 9950     5111     197223313 May 29 15:07 /pnfs/minos/mcin_data/near/carrot_06/L010185/n13015322_0000_L010185.reroot.root


     IFILE=n13015322_0000_L010185.reroot.root
     IPATH=minos/mcin_data/near/carrot_06/L010185
     DCPOR=24136

     DFILE=dcap://fndca1.fnal.gov:${DCPOR}/pnfs/fnal.gov/usr/${IPATH}/${IFILE}

     ( cd /pnfs/${IPATH} ; cat ".(use)(2)(${IFILE})" )

     cd /local/scratch??/`whoami`

     dccp -P ${DFILE} TEST.dat   # prestage
     dccp    ${DFILE} TEST.dat   # do the copy
     197223313 bytes in 4 seconds (48150.22 KB/sec)

Informed enstore-admin of the problem with enstore info --show-bad

=============================================================================

2006 08 21

##########
# CATTLE #
##########

Howie will write to dfarm areas for parallel logging
    /minos/farcat
    /minos/nearcat 

Looking at DFARM documents at
    http://ccfsrv2.fnal.gov/farms/dfarm/
but this dates from 2001, mentions fnsfo ftp server ( fnsfo is retired )

http://www-isd.fnal.gov/dfarm/

#########
# DFARM #
#########

MINOS26 > upd install -j dfarm v3_1a
MINOS26 > ups declare -c dfarm v3_1a -f Linux

doc/WEB_LINKS refers to
    http://www-isd.fnal.gov/dfarm/
but that vectors back to the ccfsrv2 page

OK - per converstion with rubin
   present accesss to dfarm has been directly 
   from minfarm account, dfarm commamd on fnpcsrv1.
   Howie set up kreymer access to this account.

   This is probably to best way for now,
   given the scheduled dfarm shutdown 1 january.
   Let's minimize the number of new things we try !
 

    Looking around on fnpcsrv1 :

syscollect : 
    http://cdcvs.fnal.gov/cgi-bin/fnal-only/cvsweb.cgi/syscollect/farms/fnpcsrv1-config.html?rev=1.1.1.19&content-type=text/x-cvsweb-markup&cvsroot=syscollect

Machine Type: DELL PE 6850

top :
    8 CPU's, 
    16 GB memory
    19 GB swap
    Intel(R) Xeon(TM) MP CPU 3.00GHz
    
bash-2.05b$ df -m # selected content of interest to minos :

Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sda2                 9845      7149      2196  77% /
/dev/sda5                 5521        75      5167   2% /tmp
/dev/sdb1                71497      5819     65679   9% /export/products
/dev/sdb2                 4768      1726      3043  37% /local/ups
/dev/sdb3               495722     12649    483074   3% /export/stage
/dev/sde1               204774     71793    132981  36% /export/lsi_home
/dev/sde2               634968    528605    106364  84% /export/lsi_stage
/dev/sdd1               699742      6581    693161   1% /export/lsi_data
stkensrv1:/minos           391        79       278  22% /pnfs/minos

Created .bashrctest for working in bash environment
Created shrc/kreymer for moving to bash ( cut/paste )


=============================================================================

2006 08 18

############
# logwatch #
############

 ####################
 # minos-gateway-nd #
 ####################

       Processing Initiated: Fri Aug 18 04:02:01 2006
          Logfiles for Host: minos-gateway-nd.fnal.gov

 --------------------- Cron Begin ------------------------ 

**Unmatched Entries**
BAD FILE MODE (cron/root)


[root@minos-gateway-nd ~]# cd /var/spool/cron
[root@minos-gateway-nd cron]# ls -l
total 16
-rwxr-xr-x  1 root root  59 Jul 11 15:17 root
-rw-------  1 root e875 107 Aug 17 16:23 saranen

OK, change protections per other systems, removing group/world execute

chmod 744 root

  ###############
  # minos-acnet #
  ###############

       Processing Initiated: Fri Aug 18 04:02:02 2006
          Logfiles for Host: minos-acnet.fnal.gov

 --------------------- Automount Begin ------------------------ 

**Unmatched Entries**
failed to mount /misc/beamdata: 462 Time(s)

    This seems to be due to  auto.data, containing

# automount /data from minos-beamdata (old minos-acnet)
beamdata 131.225.82.103:/data

[root@minos-acnet etc]# nslookup 131.225.82.103
Server:         131.225.17.150
Address:        131.225.17.150#53

103.82.225.131.in-addr.arpa     name = CD-87810.dhcp.fnal.gov.

    This is sure not node minos-beamdata,
    which in any case has an empty /etc/exports file.


=============================================================================

2006 08 17

#######
# DAQ #
#######

    syscollect - investigated emails on Sunday from

minos-gateway2-nd
minois-acnet
minos-om
minos-rc
minos-beamdata

    like

Date: Sun, 13 Aug 2006 02:15:01 -0500
From: Cron Daemon <root@minos-om.fnal.gov>
To: root@minos-om.fnal.gov
Subject: Cron <root@minos-om> /opt/syscollect/bin/syscollect.sh > /dev/null

/bin/sh: /opt/syscollect/bin/syscollect.sh: No such file or directory


    minos-om logwatch contains typically nearly 6000 lines like
    
Authorized to minos, krb5 principal minos-wh-cr/minos/minos-om.fnal.gov@FNAL.GOV (krb5_kuserok)

    That's about one every 15 seconds !

ssh connection seem to coming from

131.225.192.130 daqsrv-nd.fnal.gov
131.225.192.133 daqdds-nd.fnal.gov
131.225.192.134 daqrc-nd.fnal.gov
131.225.192.41  dcsdcp-nd.fnal.gov
131.225.52.29    minos-beamdata.fnal.gov
198.124.213.120 dcsdcp.minos-soudan.org
198.124.213.170 daqrc.minos-soudan.org
198.124.213.185 daqdds.minos-soudan.org
198.124.213.186 daqsrv.minos-soudan.org

Number of connections from each in present /var/log/messages, at 17:22 17 Aug

[root@minos-om log]# for IP in $IPS ; do printf "${IP} " ; grep "from ${IP} port " messages | wc -l ; done

131.225.192.130 6567
131.225.192.133 1871
131.225.192.134 3253
131.225.192.41  3387
131.225.52.29    109
198.124.213.120  108
198.124.213.170 3195
198.124.213.185 1985
198.124.213.186 6548

    Also, separately, there are many lines in the messages file like

Aug 17 17:26:48 minos-om pam_timestamp_check: pam_timestamp: `/var/run/' owner UID != 0

    Most are duplicates, whose suppression is messed up by all the ssh logins.

=============================================================================

2006 08 16

   COPY TO AFS

aklog

setup dcap
unset DCACHE_IO_TUNNEL

BAFS=/afs/fnal.gov/files/data/minos/d168/R1_24c/.bntp_data
BNTP=/pnfs/minos/reco_far/R1_24c/.bntp_data
DNTP=/pnfs/fnal.gov/usr/minos/reco_far/R1_24c/.bntp_data

mkdir -p $BAFS
mkdir -p $BAFS/$MON

MON=2005-11

FILES=`ls ${BNTP}/${MON}` ; printf "${MON} " ; echo $FILES | wc -w

NEEDF=""
for FIL in ${FILES} ; do
    [ ! -r ${BAFS}/${MON}/${FIL} ] && NEEDF="${NEEDF} ${FIL}"
done ;

NFIL=`echo ${FILES} | wc -w`
NNEE=`echo ${NEEDF} | wc -w`
printf "OK - need ${NNEE}/${NFIL} in ${BNTP}/${MON}\n"


for FIL in ${NEEDF} ; do
    printf "`date` ${FIL} "
    DFILE=dcap://fndca1.fnal.gov:24136/${DNTP}/${MON}/${FIL}
    ${ECHO} dccp ${DFILE} ${BAFS}/${MON}/${FIL}
done 2>&1 | tee -a /tmp/dccp.log

du -sm ${BAFS}/${MON} ${BNTP}/${MON}


=============================================================================

2006 08 15

#############
# checklist #
#############

minosora1 and minosora3 ganglia monitoring stopped at almost exactly
07:00 today

Per shepelak, gmond needs a restart on minosora1/3,
  as the ganglia server moved from fnpcb

##########
# cattle #
##########

There's a request for fardet_data/R1_24c  'blinded' files on AFS.
I presume this really means unblinded, .bntp

Investigate files in

    /pnfs/minos/reco_far/R1_24c/sntp_data/2005-11
    /pnfs/minos/reco_far/R1_24c/.bntp_data/2005-11

MINOS26 > du -sm /pnfs/minos/reco_far/R1_24c/sntp_data/
2468    /pnfs/minos/reco_far/R1_24c/sntp_data

MINOS26 > du -sm /pnfs/minos/reco_far/R1_24c/.bntp_data/
485     /pnfs/minos/reco_far/R1_24c/.bntp_data

FILES=`ls /pnfs/minos/reco_far/R1_24c/sntp_data/2005-11/ | grep spill | cut -f 1 -d '.'
echo $FILES | wc -w
    143

echo $BFILES | wc -w
    158

Oops, these files are still being written to Enstore, oldest at 15:27
  ( as of 16:40 )

MINOS26 > du -sm /pnfs/minos/reco_far/R1_24b/sntp_data/
207393  /pnfs/minos/reco_far/R1_24b/sntp_data
MINOS26 > du -sm /pnfs/minos/reco_far/R1_24b/.bntp_data/
3374    /pnfs/minos/reco_far/R1_24b/.bntp_data


=============================================================================

2006 08 14

###########
# ENSTORE # recycle tapes
###########

Per mail from berg, prepare to release 9940A volumes with no files :

MINOS26 > echo $VOLUMES
VO8908 VO8347 VO8360 VO8361 VO8362 VO8363 VO8902 VO3908 VO5180 VO5166 VO8366 VO2248 VO4316 VO4319 VO4579 VO2061 VO4962 VO5198 VO6576 VO4917 VO7115 VO7773 VO3225 VO8219 VO8227 VO8905 VO8318 VO8192 VO8334 VO8335 VO8353 VO8350 VO5875 VO5877 VO7970 VO2214 VO2215 VO2216 VO2219 VO2222 VO2223 VO3907 VO4923 VO4558 VO7907 VO8357 VO8921 VO7892 VO7893 VO7910 VO7923 VO7979 VO8352 VO8903 VO8915 VO8927 VO7971 VO8907 VO8906

MINOS26 > for VOL in ${VOLUMES} ; do enstore info --list=${VOL} | less ; done

Check mounts :

MINOS26 > for VOL in ${VOLUMES} ; do printf $VOL ; enstore info --vol=${VOL} | grep sum_mounts ; done

MINOS26 > for VOL in ${VOLUMES} ; do printf $VOL ; enstore info --vol=${VOL} | grep sum_mounts ; done
VO8908 'sum_mounts': 1,
VO8347 'sum_mounts': 7,
VO8360 'sum_mounts': 2,
VO8361 'sum_mounts': 1,
VO8362 'sum_mounts': 2,
VO8363 'sum_mounts': 4,
VO8902 'sum_mounts': 1,
VO3908 'sum_mounts': 51,
VO5180 'sum_mounts': 3,
VO5166 'sum_mounts': 13,
VO8366 'sum_mounts': 3,
VO2248 'sum_mounts': 13,
VO4316 'sum_mounts': 874,
VO4319 'sum_mounts': 2,
VO4579 'sum_mounts': 424,
VO2061 'sum_mounts': 145,
VO4962 'sum_mounts': 326,
VO5198 'sum_mounts': 153,
VO6576 'sum_mounts': 1,
VO4917 'sum_mounts': 103,
VO7115 'sum_mounts': 118,
VO7773 'sum_mounts': 36,
VO3225 'sum_mounts': 1386,
VO8219 'sum_mounts': 127,
VO8227 'sum_mounts': 10,
VO8905 'sum_mounts': 2,
VO8318 'sum_mounts': 20,
VO8192 'sum_mounts': 95,
VO8334 'sum_mounts': 7,
VO8335 'sum_mounts': 2,
VO8353 'sum_mounts': 1507,
VO8350 'sum_mounts': 3,
VO5875 'sum_mounts': 101,
VO5877 'sum_mounts': 342,
VO7970 'sum_mounts': 3,
VO2214 'sum_mounts': 277,
VO2215 'sum_mounts': 171,
VO2216 'sum_mounts': 300,
VO2219 'sum_mounts': 233,
VO2222 'sum_mounts': 316,
VO2223 'sum_mounts': 122,
VO3907 'sum_mounts': 277,
VO4923 'sum_mounts': 337,
VO4558 'sum_mounts': 138,
VO7907 'sum_mounts': 255,
VO8357 'sum_mounts': 158,
VO8921 'sum_mounts': 21,
VO7892 'sum_mounts': 13,
VO7893 'sum_mounts': 26,
VO7910 'sum_mounts': 42,
VO7923 'sum_mounts': 29,
VO7979 'sum_mounts': 37,
VO8352 'sum_mounts': 19,
VO8903 'sum_mounts': 34,
VO8915 'sum_mounts': 76,
VO8927 'sum_mounts': 88,
VO7971 'sum_mounts': 1,
VO8907 'sum_mounts': 1,
VO8906 'sum_mounts': 1,

  RELEASED THESE ALL FOR RECYCLING

=============================================================================

2006 08 11

########
# nova #
########

Assisted shanahan and admarino with access to /pnfs/p929,  ticket 83321

############
# saddreco #
############

saddreco.20060811

    corrected to time.localtime()[2] to give correct day for dcache files.

ln -sf saddreco.20060811 saddreco # was .20060713

=============================================================================

2006 08 10

#######
# dcs #
#######

files are showing up at last, first ones since the April shutdown.

First file is N060510_235333.mdcs.root

But all the files are going into /pnfs/minos/near_dcs_data/2006-08

  Normally, there is just one file overlapping from the previous month.

Suggest: files should go to a directory which matches the file name.
  1) Move stray files from existing directories since 2004-08
  2) Move misplaced files from 2006-08
  3) Change archiver algorithm to use file name, not current date.

Use 'enmv' via enstore-admin to do the file moves.
The correct the sam locations.

Have same problem in far_dcs_data,
especially 2006-07 which goes back to F0603*

=============================================================================

2006 08 09
###########
# afsfree #
###########

Added this to crontab.minos01

#######
# afs #
#######

d162/d163 are created by inkmann, but with wrong acl's from d161, not d01
Requeseted update.

=============================================================================

2006 08 08

#######
# afs #
#######

Request for more AFS disk by jjling, via buckley.

Yes, of course AFS data disks are full, as always.

Created draft 'afsfree' script to find and order free space.


#######
# web #
#######

in /afs/fnal.gov/files/expwww/numi/html/computing/dh
removed circular symlink to self :

ls -l dh
lrwxr-xr-x    1 kreymer  kreymer        49 Apr 26 14:36 dh -> /afs/fnal.gov/files/expwww/numi/html/computing/dh

rm dh

###########
# ENSTORE #
###########

Yesterday, I tried and failed to copy a set of 43 files .
Some copied, some failed, with timeouts.
     
    Specifics :

Host minos26.fnal.gov
Start time: Mon Aug  7 21:21:27 2006
encp v3_5a
User: kreymer(1060)  Group: 1525(1525)  Euser: kreymer(1060)  Egroup: 1525(1525)
Command line: encp --verbose 4 neardet_data.2006-07.1.tar neardet_data.2006-07.10.tar neardet_data.2006-07.11$
Version: v3_5a  CVS $Revision: 1.765 $ <frozen>
Library: CD-9940B  Storage Group: minos  File Family: neardet_vault  FF
Wrapper: cpio_odc  FF Width: 1
Current working directory:
minos26.fnal.gov:/local/scratch26/kreymer/SHEEP/neardet_data/2006-07
...

Things went well enough for the first 23 files,
then failed for files 24 through 33,
then started succeeding on the fourth retry of file 34,
    neardet_data.2006-07.4.tar
    
The elapsed times of the first and last failure were
2131 and 4903 seconds.
These times correspond to :

     date -d " Aug 7 21:21:27 2006 + 2131 seconds"
     Mon Aug  7 21:56:58 CDT 2006

     date -d " Aug 7 21:21:27 2006 + 4903 seconds"
     Mon Aug  7 22:43:10 CDT 2006

I see no interruption to Enstore data flow at this time, at
    http://www-stken.fnal.gov/enstore/real_rates.jpg
Movers 35B and 41B are both active now.

I see no Enstore alarms from 21:56 through 22:43
I see nothing at the times of interest in
    http://stkensrv2.fnal.gov/enstore/log/transfer_failed.txt

The ganglia monitoring for minos26 was alive then, see
    http://fnpca.fnal.gov/ganglia/?m=load_one&r=day&s=descending&c=MINOS+Cluster&h=minos26.fnal.gov&sh=1&hc=4
This indicates that minos26 was on the network.
   

   Things of note from my log :

Files were copied to
    volume VO8276, mover 9940B35
Then for no apparent reason for the 22nd file  neardet_data.2006-07.29.tar
this same volume was switched to a new mover :
    volume VO8276, mover 9940B41
Two files were copied there, then the timeouts started.

The retry intervals do not make must sense to me. 
There are usually 3 retries before bailing out,
but in one instance there were 5 retries (2006-07.35.tar)
     
The total time spent retrying is almost always .2 seconds per file.
There were three exceptional files, where the time retrying was
  6 seconds
860 seconds ( 5 retries )
700 seconds ( successful )


    Bottom line questions :

1) Why was the tape moved from mover 35 to 41 ?
   This seems extremely inefficient.
    
2) Why could we not write to Enstore for about 45 minutes ?

3) Why did encp keep trying to copy files after the first file failed ?

Retry time :
   .8
   .2
   .2
   .2
  6
860 ( 5 retries) 
   .2
   .5
   .2
   .2
   .2 ( success after 700 seconds )
   
Manual recovery :

setup encp v3_5a
DET=near
MON=2006-07

MS=~/minos/scripts
LOCL=/local/scratch26/kreymer/SHEEP
ELOG=~/minos/log/rawcopy/${DET}/encp.${MON}.log
CLOG=~/minos/log/rawcopy/${DET}/check.${MON}.log

cd  ${LOCL}/${DET}det_data/${MON}
FILES=`ls *.3?.tar`

date -u >> ${ELOG}
printf "${FILES}\n" >> ${ELOG}
encp --verbose 4 ${FILES} /pnfs/minos/vault/${DET}det/${MON} 2>&1 | tee -a ${ELOG}
date -u >> ${ELOG}

${MS}/rawsum -t ${DET}det_data/${MON}  2>&1 | tee -a ${CLOG}


    GRRRRRRRRRRRRRRRRRRRRRRRR.            

This copied 8 of the 10 files, to VO8276 on 9940B32, 
then got stuck queued up for 20 minuts. 
Just as I complained about it, this seems to have resumed copying,
still on drive B32 .

=============================================================================

2006 08 07

#######
# SAM #
#######

Removed SAM declaration for temporary empty file used to test saddreco,

    sam undeclare file F00036094_0010.spill.cand.R1_18_4.1.root

###########
# ENSTORE #
###########

Cleaned out the bad file from 21 May
    mcin_data/near/carrot_06/L010185/.bad.n13015322_0000_L010185.reroot.root

MINOS26 > ./dc_stat /pnfs/minos/mcin_data/near/carrot_06/L010185/.bad.n13015322_0000_L010185.reroot.root
============================
 PNFS status for /pnfs/minos/mcin_data/near/carrot_06/L010185/.bad.n13015322_0000_L010185.reroot.root 
-rw-r--r--    1 9950     5111     197223313 May 21 03:44 .bad.n13015322_0000_L010185.reroot.root

LEVEL 2 

LEVEL 4 
VOB270
0000_000000000_0000922
197223313
mcin_near
/pnfs/minos/mcin_data/near/carrot_06/L010185/n13015322_0000_L010185.reroot.root

000F00000000000002EF5060

CDMS114820108400000
stkenmvr23a:/dev/rmt/tps0d0n:479000021848
1302188557

============================

MINOS01 > rm /pnfs/minos/mcin_data/near/carrot_06/L010185/.bad.n13015322_0000_L010185.reroot.root
rm: remove write-protected regular file `/pnfs/minos/mcin_data/near/carrot_06/L010185/.bad.n13015322_0000_L010185.reroot.root'? y

#########
# EMAIL #
#########

Some SPAM filters were offline over the weekend, hence extra spam
   helpdesk ticket 83222, resolved by inkmann.

 
############
# PREDATOR #
############

Catch up on July far_dcs_data files

MINOS26 > MONTH=2006-07
MINOS26 > TIER=far_dcs_data
MINOS26 > HOSTNA=`hostname -s | cut -c 1-5`
MINOS26 > HOSTNU=`hostname -s | cut -c 6-`
MINOS26 > LOGPAT=/local/scratch${HOSTNU}/kreymer/log

MINOS26 > ./genpy -l  " -r ${RELEASE}  " -w ${TIER}/${MONTH}

MINOS26 > ./sadd ${TIER}/${MONTH} declare >> ${LOGPAT}/samadd/${TIER}/${MONTH}.log 2>&1

Needed to add 252 files
STARTED   Mon Aug  7 18:13:46 2006
FINISHED  Mon Aug  7 18:13:52 2006

#########
# vault #
#########

vault_prev - script to be run by crontab.vault,
vaults previous month's near and far data.

crontab.vault - runs this via cron on the 5th of each month.

vault - updated to use full /usr/krb5/bin/ path to kcron, aklog,
        for use under cron ( kcron is moot )

MON=2006-07
DET=near
DET=far

less ~/minos/log/rawcopy/${DET}/${MON}.log
less ~/minos/log/rawcopy/${DET}/encp.${MON}.log
less ~/minos/log/rawcopy/${DET}/check.${MON}.log


=============================================================================

2006 08 04

###########
# firefox #
###########

firefox -local    acts like
firefox -remote

probably due to recent upgrades.

On my laptop, can do

FIREFOX/firefox -remote "ping()"   to test for remote window.
per
    http://www.mozilla.org/unix/remote.html
    

############
# predator #
############

mkdir /local/scratch26/kreymer/log/sadddcache

MINOS26 > ./saddcache --declare \
     >> ${LOGPAT}/saddcache/${MONTH}.log 2>&1

MIN > ln -sf  predator.20060801 predator # was predator.20060601

This adds running of   saddcache.


##########
# DCACHE #
##########

Date: Fri, 04 Aug 2006 14:20:49 -0500 (CDT)
From: Dmitry Litvintsev <litvinse@fnal.gov>

        http://www-stken.fnal.gov/enstore/dcache_monitor

Clean up stray 0 length files found there, listed in minos_bad.txt

1) probably user stray
  ls -alF /pnfs/minos/mcout_data/R1_18_2/near/sntp_data/danche@minos.phy.tufts.edu
  -rw-r--r--    1 12660    5111            0 Apr 26 14:58 /pnfs/minos/mcout_data/R1_18_2/near/sntp_data/danche@minos.phy.tufts.edu

  rm      /pnfs/minos/mcout_data/R1_18_2/near/sntp_data/danche@minos.phy.tufts.edu

2) stray from saddreco testing

  rm /pnfs/minos/reco_far/R1_18_4/cand_data/2006-08/F00036094_0010.spill.cand.R1_18_4.1.root

Others in minos.txt,

    N00010368_0008.spill.cand.R1_18_4.0.root - started this investigation, damaged, abandon
    Four more under /pnfs/minos/reco_far/R1_24a/
        cand_data/2005-11/F00033108_0009.all.cand.R1_24a.0.root
	.bcnd_data/2005-11/F00033108_0009.spill.bcnd.R1_24a.0.root
	cand_data/2005-11/F00033102_0021.all.cand.R1_24a.0.root
	cand_data/2005-11/F00033108_0005.all.cand.R1_24a.0.root
	
=============================================================================

2006 08 03

#######
# SAM #
#######

minosora1 reboot at 07:00 for kernel

Down for Oracle patches 09:00 ( -> 09:45 ? )

08:57
MINOS-SAM01 > ups stop sam_bootstrap

##########
# DCACHE #
##########

Looking at FTP logs, failures starting around 09:55
particularly, a flood of connections from oracle, 
d0ora2 at 09:57 and 10:30.


12:19 - PnfsManager is back in DCache

#############
# saddcache #
#############

Removed unused stuff formerly in saddreco

MINOS26 > ln -sf saddcache.20060802 saddcache  # was saddcache.20060717

############
# saddreco #
############

Continuing tests in development, hack the test old file manually :

OFILE=F00036094_0010.spill.cand.R1_18_4.0.root
SAMLOC='/pnfs/minos/reco_far/R1_18_4/cand_data/2006-08(vob767,16)'
DISLOC='/pnfs/minos/reco_far/R1_18_4/cand_data/2006-08'

sam locate ${OFILE}

sam erase file location --file=${OFILE} --loc=${DISLOC}

sam add location --file=${OFILE} --loc=${SAMLOC}
sam add location --file=${OFILE} --loc=${DISLOC}

MINOS26 > ln -sf saddreco.20060713 saddreco # was saddreco.20060628

   NOW GET CAUGHT UP 

./saddreco far R1_18_4 2006-05 list
 need raw       F00035724_0009.mdaq.root
written Jun  1 17:15


./sadd fardet_data/2006-05 list
./sadd fardet_data/2006-05 verify
./sadd fardet_data/2006-05 declare

for DET in near far ; do for MON in 04 05 06 07 
do  ./sadd ${DET}det_data/2006-${MON} list ; done ; done

HOSTNA=`hostname -s | cut -c 1-5`
HOSTNU=`hostname -s | cut -c 6-`
LOGPAT=/local/scratch${HOSTNU}/kreymer/log
FARM=R1_18_4

for DET in near far ; do
for MONTH in 2006-04 2006-05 2006-06 2006-07 2006-08; do
./saddreco  ${DET} ${FARM} ${MONTH} declare \
    >> ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log 2>&1
or
   2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
done ; done

Doing the above manually, 
MINOS26 > ./saddreco  ${DET} ${FARM} ${MONTH} declare    2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
...
 obsolete           Traceback (most recent call last):
  File "./saddreco", line 455, in ?
  File "./saddreco", line 303, in obsolete
  File "sam_common_pylib/SamCommand/BlessedCommandInterfacePlaceHolder.py", line 81, in __call__
  File "sam_common_pylib/SamCommand/CommandInterface.py", line 251, in __call__
  File "sam_common_pylib/SamCommand/SamCommandInterface.py", line 243, in apiWrapper
  File "sam_user_pyapi/src/samLocate.py", line 75, in implementation
  File "sam_common_pylib/SamCorba/SamServerProxy.py", line 257, in _callRemoteMethod
  File "sam_common_pylib/SamCorba/SamServerProxyRetryHandler.py", line 266, in handleCall
DataFileNotFound: Datafile with name 'N00010230_0003.spill.snts.R1_18_4.1.root' not found.
    N00010283_0022.cosmic.snts.R1_18_4.0.root
 obsolete                N00010201_0009.spill.snts.R1_18_4.0.root
...
 obsolete                N00010230_0003.spill.snts.R1_18_4.0.root
 obsolete               N00010230_0003.cosmic.snts.R1_18_4.0.root
 obsolete                N00010230_0003.spill.snts.R1_18_4.1.root
 obsoleting              N00010230_0003.spill.snts.R1_18_4.1.root  

MINOS26 > ls /pnfs/minos/reco_near/R1_18_4/snts_data/2006-06/N00010230_0003.spill.snts.R1_18_4*
/pnfs/minos/reco_near/R1_18_4/snts_data/2006-06/N00010230_0003.spill.snts.R1_18_4.0.root
/pnfs/minos/reco_near/R1_18_4/snts_data/2006-06/N00010230_0003.spill.snts.R1_18_4.1.root
/pnfs/minos/reco_near/R1_18_4/snts_data/2006-06/N00010230_0003.spill.snts.R1_18_4.2.root

OK, overlooked stacked up obsolete files.
Skip the obsoletion if the OFILE is not yet declared.

 OK - declared          N00010230_0002.cosmic.cand.R1_18_4.2.root /pnfs/minos/reco_near/R1_18_4/cand_data/2006-06(vob610.270)
 OK - declared           N00010230_0002.spill.cand.R1_18_4.2.root /pnfs/minos/reco_near/R1_18_4/cand_data/2006-06(vob610.271)
RetryHandler.getReplicaLocationList('N00010371_0011.mdaq.root')> initial retriable exception COMM_FAILURE('CORBA.COMM_FAILURE(omniORB.COMM_FAILURE_WaitingForReply, CORBA.COMPLETED_MAYBE)')
RetryHandler.getReplicaLocationList('N00010371_0011.mdaq.root')> will retry in 596.05 seconds
RetryHandler.getReplicaLocationList('N00010371_0011.mdaq.root')> retry number 1 succeeded, returning to caller.
Needed  248 files, Rate was  0.845
Needed  /pnfs/minos/reco_near/R1_18_4/sntp_data/2006-06
Treating 702 files in /pnfs/minos/reco_near/R1_18_4/sntp_data/2006-06
 obsolete               N00010195_0023.cosmic.sntp.R1_18_4.0.root
 obsolete                N00010195_0023.spill.sntp.R1_18_4.0.root


MINOS-SAM01 > less dbg-SAMDbServer.prd.06_08_03-14_02_34.22828-1
Found N00010371_0011.mdaq.root query at 22:57:28

MINOS26 > sam locate N00010371_0011.cosmic.cand.R1_18_4.0.root
['/pnfs/minos/reco_near/R1_18_4/cand_data/2006-06,57@vob593']
MINOS26 > sam locate N00010371_0011.spill.cand.R1_18_4.0.root
['/pnfs/minos/reco_near/R1_18_4/cand_data/2006-06,64@vob593']

MINOS26 > ./saddreco  ${DET} ${FARM} ${MONTH} declare    2>&1 | tee -a ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log
 OOPS, need location for  F00034761_0000.all.snts.R1_18_4.0.root

    Odd, this file is on VO9806 at 0000_000000000_0000003

############
# predator #
############

Re-enabled saddreco

###########
# ENSTORE #
###########

Lost file reported yesterday,
  /pnfs/minos/reco_near/R1_18_4/cand_data/2006-06/N00010368_0008.spill.cand.R1_18_4.0.root
Written Jul  1 03:04
Never written to tape.
ECRC error in DCache showed up 22 Jul.

This data needs reprocessing, no need to recover.
Will investigate delayed write.

#######
# DCS #
#######

far_dcs_data seems to be showing up, starting with 2006-07

Nothing in near_dcs_data yet.

=============================================================================

2006 08 02

############
# saddreco #
############

Testing in development, use short list of 41 files in 2006-08

Declare 1 file, then touch a new version
   hack to look only at F00036094_0010* files, and one directory

    CANDS=glob.glob('F00036094_0010*.root')
...
for SUBDIR in   ['.bntp_data/2006-08'] :
 

MINOS26 > setup sam -q dev
MINOS26 > ./sadd fardet_data/2006-08 list
MINOS26 > ./sadd fardet_data/2006-08 verify
MINOS26 > ./sadd fardet_data/2006-08 declare

MINOS26 > ./saddreco.20060713 far R1_18_4 2006-08 list  
MINOS26 > ./saddreco.20060713 far R1_18_4 2006-08 verify
      F00036094_0010.spill.bntp.R1_18_4.0.root

MINOS26 > ./saddreco.20060713 far R1_18_4 2006-08 declare

MINOS01 > touch /pnfs/minos/reco_far/R1_18_4/.bntp_data/2006-08/ F00036094_0010.spill.bntp.R1_18_4.1.root

oops, need candiate first.

change to cand_dir,

MINOS26 > ./saddreco.20060713 far R1_18_4 2006-08 declare
F00036094_0010.all.cand.R1_18_4.0.root
F00036094_0010.spill.cand.R1_18_4.0.root

MINOS01 > touch /pnfs/minos/reco_far/R1_18_4/cand_data/2006-08/F00036094_0010.spill.cand.R1_18_4.1.root


##########
# DCACHE #
##########

No updates to
    http://fndca.fnal.gov/dcache/logins/stage.jpg
last on   Tue July 1106:42:31 

All login plots are stale, under
    http://fndca3a.fnal.gov/dcache/dc_login_plots.html
    
Reported to dcache-admin and minos-data

=============================================================================

2006 08 01

############
# saddreco #
############

    saddreco.20060713

Added dcache file support

############
# beam_log #
############

Restarted after updating to handle some of these errors ,

/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/beam_log: line 61: [: NaN</b: integer expression expected
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/beam_log: line 64: [: NaN</b: integer expression expected
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/beam_log: line 71: printf: NaN</b: invalid number
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/beam_log: line 61: [: too many arguments
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/beam_log: line 64: [: too many arguments
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/beam_log: line 71: printf: <a: invalid number
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/beam_log: line 71: printf: href: invalid number

Filtered more carefully for numbers, restarted the logger.

beam_log.20060731 is the old version.

=============================================================================

2006 07 31

#############
# saddcache #
#############

test in development using HOWTO.sam test file, looks OK

run in production

MINOS26 > ./saddcache.20060717 --list
 734 files, latest from 30/31 July.
      
MINOS26 > time ./saddcache.20060717 --list      
real    1m28.040s
user    0m3.690s
sys     0m11.800s

ln -sf saddcache.20060717 saddcache # was saddcache.20060714, /bin/sh

time ./saddcache --declare -n 1     
real    0m10.056s

time ./saddcache --declare -n 10     
real    0m11.268s

time ./saddcache --declare -n 100    
real    0m25.621s
user    0m2.140s
sys     0m1.990s

time ./saddcache --declare
...
   623 F00036069_0018.mdaq.root  31

real    1m49.443s
user    0m6.050s
sys     0m10.720s

    About 6 files per second.

#######
# SAM #
#######

10:30 -     SAM Request System v6_0 cut has been deployed successfully on minosint and minosprd.
            Anil ( akumar )

############
# beam_log #
############

Logs Minos beam power once per supercycle, 10 seconds into that cycle

    http://www-numi.fnal.gov/computing/dh/beamlog/NOW.txt

Data comes from
    http://www-bd.fnal.gov/notifyservlet/www

############
# predator #
############

MINOS26 > cp predator.20060601 predator.20060801
MINOS26 > mkdir /local/scratch26/kreymer/log/saddcache

Added a saddcache section.
	    
=============================================================================

2006 07 28

on vacation
     
=============================================================================

2006 07 27

###########
# volumes #
###########

Updated script to include both .cpio_odc and .cern file family suffixes
   but not '.none'

##########
# DCACHE #
##########

Picking up R1_18_4 files for fardet_data,

MINOS26 > ./volumes reco_far_R1_18_4        
VO9806
VOB302

MINOS26 > for VOL in VO9806 VOB302 ; do ./stage -w -s sntp ${VOL} ; done

#############
# saddcache #
#############

saddcache.20060717 is making progress
    command line parsing from saddmc
    now finds current path, gets new tape location.
    Just need to remove/add location and we're there.
     
=============================================================================

2006 07 26

##########
# ORACLE #
##########

minosdev/int on minosora3 has July Oracle patches,
installed this afternoon by diana@fnal.gov

Ran HOWTO.sam  in dev,int looks OK

./sam_test_py minos-test-dcache dev - OK

##########
# DCACHE #
##########

MINOS26 > for MON in 03 04 05 06 07 08 09 10 11 12 ; do ./stage -d -p 0 reco_far/R1_18_2/sntp_data/2005-${MON} ; done
     

Oops, volume  VO9839 was missed ( 2006-02 ) 48 files.
because it is in .cern format, not .cpio_odc

Tbis counfused the volumes script.

Picking it up now :

MINOS26 > ./stage VO9839

#######
# SAM #
#######

refreshed development from production.

stager and station will not start ( samcmd not found, etc. )

the station is no longer registered


=============================================================================

2006 07 25

##########
# DCACHE #
##########

Copies resumed last night , hung at 01:

I now see a pattern in the NOT_FINISHED failures.
Single files are being retried every 5 minutes.

2006-07-24 /pnfs/fnal.gov/usr/minos/reco_far/R1_18_2/sntp_data/
16:43:00 2005-07/F00032179_0000.all.sntp.R1_18_2.0.root
16:48:02
... every 5 minutes ...
18:20:32 last try, fail

18:20:58 2005-12/F00033369_0011.all.sntp.R1_18_2.0.root
18:25:59 success

18:25:33 2005-07/F00032194_0000.spill.sntp.R1_18_2.0.root
18:45:44 last try, fail

18:30:51 2005-12/F00033446_0023.spill.sntp.R1_18_2.0.root
18:35:52 success

18:49:46 005-08/F00032538_0005.spill.sntp.R1_18_2.0.root
20:27:18 last try, fail

20:32:18 2005-08/F00032532_0007.spill.sntp.R1_18_2.0.root
22:09:50 last try, fail

22:14:51 2005-08/F00032575_0013.spill.sntp.R1_18_2.0.root 	
23:52:24 last try, fail

23:57:24 2005-08/F00032617_0013.all.sntp.R1_18_2.0.root
01:14:15

Retries started again this morning, about 09:39, 2005-08.
Some of these files are on VO8559,
which already has nearly 200 mounts :
MINOS26 > enstore info --vol=VO8559
{'blocksize': 131072,
 'capacity_bytes': 214748364800L,
 'comment': 'Copied to new media 07-13-06',
 'declared': 1120143898.0,
 'eod_cookie': '0000_000000000_0012766',
 'external_label': 'VO8559',
 'first_access': 1132560346.0,
 'last_access': 1153842620.0,
 'library': 'CD-9940B',
 'media_type': '9940B',
 'remaining_bytes': 26272768L,
 'si_time': [1153837583.0, 1133359748.0],
 'sum_mounts': 181,
 'sum_rd_access': 19512,
 'sum_rd_err': 1,
 'sum_wr_access': 12765,
 'sum_wr_err': 0,
 'system_inhibit': ['none', 'full'],
 'user_inhibit': ['none', 'none'],
 'volume_family': 'minos.reco_far_R1_18_2.cpio_odc',
 'wrapper': 'cpio_odc',
 'write_protected': 'y'}

Let's get VO8559 protected from sntp stages now

MINOS26 > enstore info --vol=VO8559 | grep mount
 'sum_mounts': 187,


MINOS26 > ./stage -s sntp VO8559
 Staging files from tape VO8559

less +F /local/scratch26/kreymer/log/stage/VO8559.20060725.log

Also some files are on VO7705m,

MINOS26 > ./stage -s sntp VO7705
 Staging files from tape VO7705

less +F /local/scratch26/kreymer/log/stage/VO7705.20060725.log

And more,

MINOS26 > ./volumes reco_far_R1_18_2
VO7705
VO8300
VO8441
VO8548
VO8553
VO8559
VO8600
VO9818
VO9874
VOB022

removing the volumes already underway,

VOLSN='
VO8300
VO8441
VO8548
VO8553
VO8600
VO9818
VO9874
VOB022'

for VOL in ${VOLSN} ; do ./stage -w -s sntp ${VOL} ; done


=============================================================================

2006 07 24

###########
# NETWORK #
###########

Responded to Anrew Rader email with more details.

New tests: 

    With DNS failover
    nameserver 131.225.70.2  # www.fnal.gov for testing failover

Text logins timeout and fail after about 45 seconds
     
Screen unlock fails after 12 seconds.

ksu succeeds after 66 seconds

ssh login takes about 45 seconds to reach the NOTICE TO USERS, 
    about 2 minutes to prompt

ssh login to minos26 has failed after 133 seconds,
message like
10311: Connection closed by 131.225.193.26
and succeeded after 107 seconds.

Try flxi02, for easier debugging by others :

MIN > date ; ssh flxi02.fnal.gov
Mon Jul 24 09:03:00 CDT 2006
date
10319: Connection closed by 131.225.68.42
MIN > date
Mon Jul 24 09:05:12 CDT 2006

With ssh -v, see that we timeout after

Success :

MIN > date ; ssh -v flxi02.fnal.gov
Mon Jul 24 09:07:25 CDT 2006
...
10324: debug1: Calling gss_init_sec_context
10324: debug1: Delegating credentials
10324: debug1: Received GSSAPI_COMPLETE
10324: debug1: Calling gss_init_sec_context
10324: debug1: Delegating credentials
10324: debug1: bits set: 530/1024
10324: debug1: kex_derive_keys
10324: debug1: newkeys: mode 1
10324: debug1: SSH2_MSG_NEWKEYS sent
10324: debug1: waiting for SSH2_MSG_NEWKEYS
10324: debug1: newkeys: mode 0
10324: debug1: SSH2_MSG_NEWKEYS received
10324: debug1: done: ssh_kex2.
10324: debug1: send SSH2_MSG_SERVICE_REQUEST
10324: debug1: service_accept: ssh-userauth
10324: debug1: got SSH2_MSG_SERVICE_ACCEPT
10324: debug1: authentications that can continue: external-keyx,gssapi,keyboard-interactive
10324: debug1: next auth method to try is external-keyx
10324: debug1: ssh-userauth2 successful: method external-keyx
10324: debug1: channel 0: new [client-session]
10324: debug1: send channel open 0
10324: debug1: Entering interactive session.
10324: debug1: ssh_session2_setup: id 0
10324: debug1: channel request 0: pty-req
10324: debug1: Requesting X11 forwarding with authentication spoofing.
10324: debug1: channel request 0: x11-req
10324: debug1: channel request 0: shell
10324: debug1: fd 3 setting TCP_NODELAY
10324: debug1: channel 0: open confirm rwindow 0 rmax 32768
Last login: Thu Jul 20 18:03:16 2006 from minos26.fnal.gov


Failure :

MIN > date ; ssh -v flxi05.fnal.gov
Mon Jul 24 09:09:45 CDT 2006
OpenSSH_3.5p1f11, SSH protocols 1.5/2.0, OpenSSL 0x0090602f
...
10328: debug1: Calling gss_init_sec_context
10328: debug1: Delegating credentials
10328: Connection closed by 131.225.68.48
10328: debug1: Calling cleanup 0x80668b0(0x0)
MIN > date
Mon Jul 24 09:11:56 CDT 2006


    krb5.conf  tuning

Before tuning
ksu 65 seconds succeeds
log 45 seconds fails
scr 12 seconds fails

Removed WIN and UCHICAGO kdc entries,

ksu still takes 65 seconds
test login      45 seconds fails

    Reduced to single kdc, and admin

ksu 47 seconds
log 48 seconds succeeds
scr 12 seconds fails as before to unlock screen

    Added second kdc

ksu 50 seconds
log 52 seconds, 52 seconds failed

############
# NOACCESS #
############

VO8559              0.02GB   (NOACCESS   0722-0157 full     1130-0809)   CD-9940B         minos.reco_far_R1_18_2.cpio_odc      Copied to new media 07-13-06

MINOS26 > enstore info --vol=VO8559
{'blocksize': 131072,
 'capacity_bytes': 214748364800L,
 'comment': 'Copied to new media 07-13-06',
 'declared': 1120143898.0,
 'eod_cookie': '0000_000000000_0012766',
 'external_label': 'VO8559',
 'first_access': 1132560346.0,
 'last_access': 1153543102.0,
 'library': 'CD-9940B',
 'media_type': '9940B',
 'remaining_bytes': 26272768L,
 'si_time': [1153551445.0, 1133359748.0],
 'sum_mounts': 176,
 'sum_rd_access': 19375,
 'sum_rd_err': 1,
 'sum_wr_access': 12765,
 'sum_wr_err': 0,
 'system_inhibit': ['NOACCESS', 'full'],
 'user_inhibit': ['none', 'none'],
 'volume_family': 'minos.reco_far_R1_18_2.cpio_odc',
 'wrapper': 'cpio_odc',
 'write_protected': 'y'}


This was NOTALLOWED for copying on 2006 07 14, due to the mount count.
So why, with no more read errors, is this NOACCESS now ?

##########
# DCACHE #
##########

FTP access failed starting Sat Jul 22 03:56:53 CDT 2006

Per Rob Kennedy email today, too many connections by UMN.
Liz informed bspeak.
I informed all about 19/20 similar problem.

Still having problems, probably files not-in-cache.

bspeak seems to be copying most from

  reco_far/R1_18_2/sntp_data/2005-12
  reco_far/R1_18_2/sntp_data/2005-07

MINOS26 > ./stage -d -p 0 reco_far/R1_18_2/sntp_data/2005-12
 Needed 445/   1497
MINOS26 > ./stage -d -p 0 reco_far/R1_18_2/sntp_data/2005-07
. Needed 217/   1468

MINOS26 > ./stage reco_far/R1_18_2/sntp_data/2005-07
MINOS26 > ./stage reco_far/R1_18_2/sntp_data/2005-12

##########
# ORACLE #
##########

Minosdev/int
    oracle patches to be applied 16:00.

Down on schedule.
Connecting again as of 17:30.
The upgrade was rolled back, due to installation problems.

dbservers are OK ( sam locate foo )


=============================================================================

2006 07 21

###########
# NETWORK #
###########

Problem with network connectivity last night
Helpdesk tickets
    82393 ( initial     )
    92394 ( call centre )
No call back

Symptoms :

1) cannot unlock locked screen on Linux desktop minos-93198.dhcp
      This uses kerberos authentication
2) cannot ssh from laptop in the same office to minos01 or minos26,
      connection is simply dropped.
3) cannot log into some of the Minos gateway systems in the DAQ system.
4) very slow response when running kinit, nearly a minute.
5) long delays ( about 10 seconds ) 
   before getting output from ping to a remote node
6) but good network speed, 100 MBit/sec once a connection is made.

Root cause was fnsrv1 ( 131.225.17.150 ) down.
nslookup shows 1 second delay in failing over.

Measurements :

time ssh minos-9318=98.dhcp.fnal.gov pwd # second afflicted sytem )
3m 21s

time ssh minos26 date
1m 50s

time nslookup www.fnal.gov
1s ( normally under .02s )


08:10 - fnsrv1 is back up, symptoms are cleared up.

Problems to resolve:

 1) minutes delay, or failure to login, using secondary DNS
 2) screen lock stuck using secondary DNS
     
#######
# SAM #
#######

Remove sam_frozen ( 83 MB ) versions, unused for 440 days

find prd/sam_frozen -atime -440
( OOPS, warning, this is AFS, cannot check access times. )

chmod -R 755  db/sam_frozen
chmod -R 755 prd/sam_frozen

rm -r  db/sam_frozen
rm -r prd/sam_frozen

############
# PRODUCTS #
############

52      prd/sam_lib
62      prd/encp
75      prd/sam
83      prd/sam_frozen
101     prd/orbacus
182     prd/python
182     prd/sam_client_cpplib
309     prd/sam_idl_cpplib
1059    prd/sam_cpp_api

Removed orbacus, not needed with frozen client

chmod -R 755  db/orbacus
chmod -R 755 prd/orbacus

rm -r  db/orbacus
rm -r prd/orbacus

I think we can also remove other products which are now included in
sam_cpp_api, as noted in
/afs/fnal.gov/files/code/e875/general/ups/prd/sam_client_cpplib/v6_7_2_1/Linux-2-4-GCC-3-4-3/doc/sam_cpplib_package_overview.txt

orbacus - CORBA/JTC libraries (libOB, libJTC, libCosNaming)
corba_common - CORBA common IDL library (libcorba_common)
corba_util - CORBA C++ utilities (libcorba_util)
sam_idl_cpplib - SAM IDL C++ library (libsam_station, libsam_db_server, libsam_common, libsam_mis)
sam_lib - SAM utility library (libsam_util)
sam_mis_cpplib - SAM MIS C++ library (libsam_mis_cpplib)
sam_client_cpplib - SAM C++ API (libsam_client_cpplib)

Verified most of these are in 

/afs/fnal.gov/files/code/e875/general/ups/prd/sam_cpp_api/v7_4_3/Linux-2-4-GCC-3-4-3/lib
    libsam_db_srv instead of libsam_db_server
    no libsam_client_cpplib
    no libsam_common
    
To be safe, I'll rename all these obsolete products to the side,
remove them next week.

for DIR in db prd ; do
for PRD in corba_common corba_util sam_idl_cpplib sam_lib sam_mis_cpplib sam_client_cpplib ; do
echo mv ${DIR}/${PRD} ${DIR}/DISABLED${PRD} ; done ; done

for DIR in db prd ; do
for PRD in corba_common corba_util sam_idl_cpplib sam_lib sam_mis_cpplib sam_client_cpplib ; do
mv ${DIR}/${PRD} ${DIR}/DISABLED${PRD} ; done ; done

#########
# LOGIN #
#########

from /var/log/messages,

Login locally, with normal DNS service

Jul 21 11:58:23 minos-93198 login(pam_unix)[448]: authentication failure; logname=LOGIN uid=0 euid=0 tty=tty1 ruser= rhost=  user=kreymer
Jul 21 11:58:25 minos-93198 login[448]: pam_krb5afs: authentication succeeds for `kreymer'
Jul 21 11:58:25 minos-93198 login[448]: pam_krb5afs: created temporary ccache
Jul 21 11:58:25 minos-93198 login(pam_unix)[448]: session opened for user kreymer by LOGIN(uid=0)
Jul 21 11:58:25 minos-93198  -- kreymer[448]: LOGIN ON tty1 BY kreymer
Jul 21 11:58:26 minos-93198 login(pam_unix)[448]: session closed for user kreymer

Login locally fails, with DNS failover occurring

Jul 21 11:59:28 minos-93198 login(pam_unix)[513]: authentication failure; logname=LOGIN uid=0 euid=0 tty=tty1 ruser= rhost=  user=kreymer
Jul 21 12:00:19 minos-93198 login[513]: pam_krb5afs: authentication succeeds for `kreymer'
Jul 21 12:00:19 minos-93198 login[513]: pam_krb5afs: created temporary ccache

=============================================================================

2006 07 20
     
############
# NOACCESS #
############

VO7705 still being copied

##########
# DCACHE #
##########

FTP logs gave empty results,
  ftplog  - Jul 19 18:25:15 thru Jul 20 05:56:39 and on occasion since then
  ftpllog - Jul 19 18:30:12 thru Jul 20 05:57:52 and on occasion since then

Checking interactively with

SERV='ftp://fndca1.fnal.gov:24126 --user mindata:numi96'
curl -s ${SERV} -l -Q "CWD alignment/neardet/results"

Looking at http://fndca3a.fnal.gov/cgi-bin/dcache_files.py for activity around 18:25

( thru present )
...
2006-07-19 18:28:27 oracle(1602.2752) krbftp write/pnfs/fnal.gov/usr/exp-db/daily/cdf-offline/cdfofpr2/2006/07-July/19/channel02/CDFOFPR2_t596219282_s4087_p2 fcdfora4.fnal.gov 74 2147475456 0 OK     
2006-07-19 18:27:04 oracle(1602.2752) krbftp write/pnfs/fnal.gov/usr/exp-db/daily/cdf-offline/cdfofpr2/2006/07-July/19/channel01/CDFOFPR2_t596219281_s4086_p1 fcdfora4.fnal.gov 85 2147475456 0 OK	
2006-07-19 18:27:00 oracle(1602.2752) krbftp write/pnfs/fnal.gov/usr/exp-db/daily/cdf-offline/cdfofpr2/2006/07-July/19/channel02/CDFOFPR2_t596219282_s4087_p1 fcdfora4.fnal.gov 86 2147475456 0 OK     

2006-07-20 06:02:40 mindata(3648.5111) weakftp read /pnfs/fnal.gov/usr/minos/reco_far/R1_18_2/sntp_data/2005-06/F00031833_0006.all.sntp.R1_18_2.0.root      minos-server.spa.umn.edu	    100     19825730	    0	    OK     
2006-07-19 18:25:12 mindata(3648.5111) weakftp read /pnfs/fnal.gov/usr/minos/reco_far/R1_18_2/sntp_data/2005-06/F00032023_0003.all.sntp.R1_18_2.0.root	   minos-server.spa.umn.edu	   0	   NOT_FINISHED    0	   NOT_FINISHED    
... total of 18 consecutive NOT FINISHED ...
2006-07-19 17:39:25 mindata(3648.5111) weakftp read /pnfs/fnal.gov/usr/minos/reco_far/R1_18_2/sntp_data/2005-12/F00033262_0010.spill.sntp.R1_18_2.0.root    minos-server.spa.umn.edu	     0       NOT_FINISHED    0       NOT_FINISHED   
2006-07-19 17:36:45 mindata(3648.5111) weakftp read /pnfs/fnal.gov/usr/minos/reco_far/R1_18_2/sntp_data/2005-12/F00033405_0012.spill.sntp.R1_18_2.0.root    minos-server.spa.umn.edu	    159     1241620	    0	    OK      
2006-07-19 17:39:15 mindata(3648.5111) weakftp read /pnfs/fnal.gov/usr/minos/reco_far/R1_18_2/sntp_data/2005-06/F00032023_0003.all.sntp.R1_18_2.0.root      minos-server.spa.umn.edu	    0	    NOT_FINISHED    0	    NOT_FINISHED    

Enstore eagle tests only go back to 6:15 today, a clue ?
...
2006-07-20 06:15:04 	 enstore(5744.6209) 	 gsiftp 	 read 	 /pnfs/fnal.gov/usr/eagle/dcache-tests/201.dcache_page_g_6318 	 stkensrv3.fnal.gov 	 0 	 764997 	 0 	 OK 	
2006-07-20 06:15:02 	enstore(5744.6209) 	gsiftp 	write 	/pnfs/fnal.gov/usr/eagle/dcache-tests/201.dcache_page_g_6318 	stkensrv3.fnal.gov 	1 	764997 	0 	OK 

No problems at all in raw data logging

#######
# SAM #
#######

Cleaning out oldest versions

MINOS26 > ups list -aK+ sam
"sam" "v6_0" "NULL" "" "" 
"sam" "v6_7_4" "NULL" "" "" 
"sam" "v7_6_5" "Linux+2" "" "current" 
"sam" "v7_0_2c" "SunOS+5" "" "current" 
"sam" "v7_1_2" "SunOS+5" "" "" 
"sam" "v7_1_3" "SunOS+5" "" "" 
"sam" "v7_6_0" "Linux+2" "" "" 
"sam" "v7_6_3" "Linux+2" "" "" 

The pre-7 versions also pulled in many other SAM products,
which can now safely be removed :
Looking at
    ups depend sam v6_0
    ups depend sam v6_7_4
    
    SAM related
HTMLgen
omniORB
python v2_2_3a_sam
sam_admin_pyapi
sam_kerberos_rcp
sam_mis_pyapi
sam_user_pyapi
sam_common_pylib
sam_idl_pylib
samgrid_batch_adapter

    OTHER
__fileinfo
__encp
__kerberos
__kcroninit
__kcommon
__krb5conf
__perl
__gtools
__gnuplot
__ximagetools
__imagelibs
__imagemagick
__ghostscript
__xfig
__xanim
__blt

Removing all those SAM related products, by directory purges

PRODS='HTMLgen
omniORB
sam_admin_pyapi
sam_kerberos_rcp
sam_mis_pyapi
sam_user_pyapi
sam_common_pylib
sam_idl_pylib
samgrid_batch_adapter'


cd  /afs/fnal.gov/files/code/e875/general/ups
for DIR  in db prd   ; do  printf "\n${DIR}\n\n\n"
for PROD in ${PRODS} ; do  printf "\n${PROD}\n"
ls -l ${DIR}/${PROD} ; done ; done >> ~/minos/TRACE

for DIR  in db prd   ; do  printf "\n${DIR}\n\n\n"
for PROD in ${PRODS} ; do  printf "\n${PROD}\n"
find ${DIR}/${PROD} -atime -100 ; done ; done

100 - nothing accessed
200 - sam_user_pyapi and sam_idl_pylib , prd only
600 - all but HTMLgen, sam_kerberos_rcp , both prd and db

for DIR  in db prd   ; do  printf "\n${DIR}\n\n\n"
for PROD in ${PRODS} ; do  printf "\n${PROD}\n"
chmod -R 755 ${DIR}/${PROD} ; done ; done

for DIR  in db prd   ; do  printf "\n${DIR}\n\n\n"
for PROD in ${PRODS} ; do  printf "\n${PROD}\n"
(cd ${DIR}/${PROD} ; tar cf /tmp/${DIR}${PROD}.tar .; ls -l /tmp/${DIR}${PROD}.tar ) ; done ; done

for DIR  in db prd   ; do  printf "\n${DIR}\n\n\n"
for PROD in ${PRODS} ; do  printf "\n${PROD}\n"
rm -r ${DIR}/${PROD} ; done ; done

"sam" "v6_0" "NULL" "" "" 
"sam" "v6_7_4" "NULL" "" "" 
"sam" "v7_6_5" "Linux+2" "" "current" 
"sam" "v7_0_2c" "SunOS+5" "" "current" 
"sam" "v7_1_2" "SunOS+5" "" "" 
"sam" "v7_1_3" "SunOS+5" "" "" 
"sam" "v7_6_0" "Linux+2" "" "" 
"sam" "v7_6_3" "Linux+2" "" "" 

ups undeclare -Y sam v6_0
ups undeclare -Y sam v6_7_4
ups undeclare -Y sam v7_6_0

    BEFORE
MINOS26 > du -sm .
2742    .
    AFTER
MINOS26 > du -sm .
2398    .

=============================================================================

2006 07 19

############
# NOACCESS #
############

Due to excess mounts :

VO7705              0.00GB   (NOTALLOWED 0719-0908 full     1205-1938)   CD-9940B         minos.reco_far_R1_18_2.cpio_odc      Being copied to new media 071906

MINOS26 > enstore info --vol=VO7705
{'blocksize': 131072,
 'capacity_bytes': 214748364800L,
 'comment': 'Being copied to new media 071906',
 'declared': 1123196060.0,
 'eod_cookie': '0000_000000000_0012078',
 'external_label': 'VO7705',
 'first_access': 1133150160.0,
 'last_access': 1153269311.0,
 'library': 'CD-9940B',
 'media_type': '9940B',
 'remaining_bytes': 4137472L,
 'si_time': [1153318136.0, 1133833138.0],
 'sum_mounts': 2613,
 'sum_rd_access': 17552,
 'sum_rd_err': 0,
 'sum_wr_access': 12076,
 'sum_wr_err': 0,
 'system_inhibit': ['NOTALLOWED', 'full'],
 'user_inhibit': ['none', 'none'],
 'volume_family': 'minos.reco_far_R1_18_2.cpio_odc',
 'wrapper': 'cpio_odc',

###########
# volumes #
###########

Revised mount summary to include media_type,

VOLS=`cat /tmp/vols | cut -f 1 -d ' '`

for VOL in ${VOLS} ; do
    REA=`enstore info --vol=${VOL} | grep sum_rd_err | cut -f 2 -d ':' | cut -f 1 -d ','`
    WRI=`enstore info --vol=${VOL} | grep sum_wr_err | cut -f 2 -d ':' | cut -f 1 -d ','`
    MOU=`enstore info --vol=${VOL} | grep sum_mounts | cut -f 2 -d ':' | cut -f 1 -d ','`
    MED=`enstore info --vol=${VOL} | grep media_type | cut -f 2 -d ':' | cut -f 1 -d ','`
    printf "${VOL} ${REA} ${WRI} ${MOU} ${MED}\n"
done | tee /tmp/volsum

grep 9940B /tmp/volsum | sort -n -k 4
...
VO8537  0  1  1506  '9940B'
VO8557  0  0  1558  '9940B'
VO7421  0  0  1651  '9940B'
VO9501  0  1  1659  '9940B'
VO9504  1  0  1730  '9940B'
VO8548  1  0  1788  '9940B'
VO8472  0  0  1794  '9940B'
VO8554  0  0  1822  '9940B'
VO8551  0  0  1883  '9940B'
VO8553  2  0  1945  '9940B'
VO8575  1  0  1995  '9940B'
VO7705  0  0  2613  '9940B'

#############
# CHECKLIST #
#############

PNFS slowed down drastically around 9:30 ,
as seen by 'enstore info --vol='  commands.
OK now.

FTP logs gave empty results,
  ftplog  - Jul 18 22:52:17 thru Jul 19 09:33:13
  ftpllog - Jul 18 22:45:16 thru Jul 19 09:32:39

No obvious failures in the DCache FTP logs, though.
Perhaps a local minos26/minosxx problem ?


=============================================================================

2006 07 18

###########
# volumes #
###########

Check number of read errors,

VOLS=`cat /tmp/vols | cut -f 1 -d ' '`

for VOL in ${VOLS} ; do
    REA=`enstore info --vol=${VOL} | grep sum_rd_err | cut -f 2 -d ':' | cut -f 1 -d ','`
    WRI=`enstore info --vol=${VOL} | grep sum_wr_err | cut -f 2 -d ':' | cut -f 1 -d ','`
    MED=`enstore info --vol=${VOL} | grep media_type | cut -f 2 -d ':' | cut -f 1 -d ','`
    printf "${VOL} ${REA} ${WRI} ${MED}\n"
done | tee /tmp/volserr

sort -n -k 2 /tmp/volserr  | grep 9940B
..
VO3909  2  2  '9940'
VO4136  2  2  '9940'
VO4316  2  0  '9940'
VO4711  2  0  '9940'
VO4919  2  1  '9940'
VO4948  2  1  '9940'
VO8553  2  0  '9940B'
VO9533  2  0  '9940B'
VO2060  3  0  '9940'
VO2061  3  22  '9940'
VO2226  3  0  '9940'
VO4953  3  0  '9940'
VO7944  3  0  '9940'
VO2009  4  0  '9940'
VO2067  4  0  '9940'
VO2227  4  0  '9940'
VO4712  4  0  '9940'
VO8103  4  0  '9940'
VO8354  4  0  '9940'
VO2008  5  0  '9940'
VO4956  5  0  '9940'
VO6576  5  0  '9940'
VO3225  6  1  '9940'
VO5881  6  10  '9940'
VO8353  7  0  '9940'
VO4947  8  0  '9940'
VO5869  8  2  '9940'
VOB270  9  0  '9940B'
VO4917  14  6  '9940'
VO8879  19  1  '9940'
VO8952  19  0  '9940'
VO8835  31  0  '9940'
VO2212  38  0  '9940'


sort -n -k 3 /tmp/volserr 
...
VO3909  2  2  '9940'
VO4136  2  2  '9940'
VO4918  1  2  '9940'
VO5041  0  2  '9940'
VO5869  8  2  '9940'
VO5888  0  2  '9940'
VO8340  0  2  '9940'
VO8555  0  2  '9940B'
VO8917  0  2  '9940'
VO4309  0  3  '9940'
VO7991  0  3  '9940'
VO8536  0  3  '9940B'
VO4917  14  6  '9940'
VO5881  6  10  '9940'
VO2061  3  22  '9940'

Here are the 9940B volumes with more than 1 read or write error :

VOLUME  RD WR  mounts  comment

VO8536  0  3    109    Copied to new media 03/14/06
VO8553  2  0   1945    Copied to new media 03/08/06
VO8555  0  2   1110  
VO9533  2  0   1303  
VOB270  9  0      5    File 922 bad. Orig tape sent to STK 052306

     
=============================================================================

2006 07 17

############
# NOACCESS #
############

 'sum_mounts': 2195,
 'sum_rd_err': 1,
 'sum_wr_err': 1,


   VO8441             84.57GB   (NOTALLOWED 0714-0906 readonly 0127-2210)   CD-9940B         minos.reco_far_R1_18_2.cpio_odc      Being copied to new media 07-14-06
   
 MINOS26 > enstore info --vol=VO8441
{'blocksize': 131072,
 'capacity_bytes': 214748364800L,
 'comment': ' ',
 'declared': 1131639723.0,
 'eod_cookie': '0000_000000000_0006977',
 'external_label': 'VO8441',
 'first_access': 1134003878.0,
 'last_access': 1152850033.0,
 'library': 'CD-9940B',
 'media_type': '9940B',
 'remaining_bytes': 90803441664L,
 'si_time': [1153142891.0, 1138421436.0],
 'sum_mounts': 2195,
 'sum_rd_access': 8999,
 'sum_rd_err': 1,
 'sum_wr_access': 6976,
 'sum_wr_err': 1,
 'system_inhibit': ['none', 'readonly'],
 'user_inhibit': ['none', 'none'],
 'volume_family': 'minos.reco_far_R1_18_2.cpio_odc',
 'wrapper': 'cpio_odc',
 'write_protected': 'n'}
    
#########
# genpy #
#########

08:40  ln -sf genpy.20060714 genpy # was genpy.20060712
11:04  hacked genpy.20060714 correcting SCRC typo

#############
# saddcache #
#############

MINOS26 > time ./saddcache -d | wc -l
    194

real    4m45.360s
user    2m38.770s
sys     2m27.700s

Would sure be nice to have a Python version of saddcache.

Clone it from saddreco,

#########
# copy2 #
#########

Should clean these old copies out, now that we have the vault area.

3177    /pnfs/minos/copy2/fardet_data/2001-11
738     /pnfs/minos/copy2/fardet_data/2001-12
1535    /pnfs/minos/copy2/fardet_data/2002-01
3339    /pnfs/minos/copy2/fardet_data/2002-02
6430    /pnfs/minos/copy2/fardet_data/2002-03
2662    /pnfs/minos/copy2/fardet_data/2002-04
7769    /pnfs/minos/copy2/fardet_data/2002-05 7772
5476    /pnfs/minos/copy2/fardet_data/2002-06
4581    /pnfs/minos/copy2/fardet_data/2002-07 
11032   /pnfs/minos/copy2/fardet_data/2002-08 
39191   /pnfs/minos/copy2/fardet_data/2002-09 39387
9965    /pnfs/minos/copy2/fardet_data/2002-10 10987
17259   /pnfs/minos/copy2/fardet_data/2002-11 23520
1       /pnfs/minos/copy2/fardet_data/2002-12
1       /pnfs/minos/copy2/fardet_data/2003-01
1       /pnfs/minos/copy2/fardet_data/2003-02
1       /pnfs/minos/copy2/fardet_data/Oct-01

Sizes match the original fardet_data areas

MINOS26 > du -sm /pnfs/minos/fardet_data/2001-*
8555    /pnfs/minos/fardet_data/2001-09
466     /pnfs/minos/fardet_data/2001-10
3177    /pnfs/minos/fardet_data/2001-11
738     /pnfs/minos/fardet_data/2001-12
MINOS26 > du -sm /pnfs/minos/fardet_data/2002-*
1535    /pnfs/minos/fardet_data/2002-01
3339    /pnfs/minos/fardet_data/2002-02
6430    /pnfs/minos/fardet_data/2002-03
2662    /pnfs/minos/fardet_data/2002-04
7772    /pnfs/minos/fardet_data/2002-05
5476    /pnfs/minos/fardet_data/2002-06
4581    /pnfs/minos/fardet_data/2002-07
11032   /pnfs/minos/fardet_data/2002-08
39387   /pnfs/minos/fardet_data/2002-09
10987   /pnfs/minos/fardet_data/2002-10
23520   /pnfs/minos/fardet_data/2002-11
16927   /pnfs/minos/fardet_data/2002-12

Check all file sizes

for DIR in 2002-05 ; do
for DIR in `ls /pnfs/minos/copy2/fardet_data` ; do
    FILES=`ls /pnfs/minos/copy2/fardet_data/${DIR}`
    NFIL=`printf "${FILES}\n" | wc -l`
    NPNF=`ls /pnfs/minos/fardet_data/${DIR} | wc -l`
    let "NDIF=NPNF-NFIL"
    printf "${DIR} ${NFIL} ${NDIF}\n"
    for FIL in ${FILES} ; do
        usleep 10000
	SFIL=`ls -l /pnfs/minos/copy2/fardet_data/${DIR}/${FIL} | tr -s ' ' | cut -f 5 -d ' '`
	SPNF=`ls -l /pnfs/minos/fardet_data/${DIR}/${FIL}       | tr -s ' ' | cut -f 5 -d ' '`
        [ ${SFIL} -ne ${SPNF} ] && printf "${DIR} ${FIL} ${SFIL} ${SPNF}\n"
    done
done

2001-11     491 0
2001-12     107 0
2002-01     338 0
2002-02     413 0
2002-03     887 0
2002-04     448 0
2002-05    1072 1
2002-06     857 0
2002-07     632 0
2002-08    1081 0
2002-09    1172 17
2002-09 F00008299_0000.mdaq.root 0 7059406
2002-10     854 94
2002-11     860 237
2002-12       1 806
2003-01       1 906
2003-02       1 844
ls: /pnfs/minos/fardet_data/Oct-01: No such file or directory
Oct-01       1 -1

###########
# volumes #
###########

Added filter for minos. and not .deleted,
  shortens list from 17135 to 942 to 752

Count of copied/recovered tapes :
    grep -i  'cop\|file'  /tmp/vols
    34
This is about 4.5 % 

VOLS=`cat /tmp/vols | cut -f 1 -d ' '`

for VOL in ${VOLS} ; do
    MTS=`enstore info --vol=${VOL} | grep sum_mounts | cut -f 2 -d ':' | cut -f 1 -d ','`
    printf "${VOL} ${MTS}\n"
done | tee /tmp/volsm

MINOS26 > sort -k 2 -n  /tmp/volsm
...
VO7944  1800
VO5046  1801
VO8554  1821
VO2067  1826
VO8553  1857
VO8551  1881
VO5672  1920
VO7729  1991
VO8575  1995
VO8441  2196
VO7705  2609
VO8300  3330


#############
# saddcache #
#############

12:14  MINOS26 > ./saddcache 
MINOS26 > date
Mon Jul 17 12:32:16 CDT 2006


=============================================================================

2006 07 14

############
# NOACCESS #
############

Due to too many mounts, and a read error,

     VO8559              0.02GB   (NOTALLOWED 0713-0926 full     1130-0809)   CD-9940B         minos.reco_far_R1_18_2.cpio_odc      being copied to new media 07-13-06

MINOS26 > enstore info --vol=VO8559
{'blocksize': 131072,
 'capacity_bytes': 214748364800L,
 'comment': 'being copied to new media 07-13-06',
 'declared': 1120143898.0,
 'eod_cookie': '0000_000000000_0012766',
 'external_label': 'VO8559',
 'first_access': 1132560346.0,
 'last_access': 1152787565.0,
 'library': 'CD-9940B',
 'media_type': '9940B',
 'remaining_bytes': 26272768L,
 'si_time': [1152800818.0, 1133359748.0],
 'sum_mounts': 2357,
 'sum_rd_access': 17074,
 'sum_rd_err': 1,
 'sum_wr_access': 12765,
 'sum_wr_err': 0,
 'system_inhibit': ['NOTALLOWED', 'full'],
 'user_inhibit': ['none', 'none'],
 'volume_family': 'minos.reco_far_R1_18_2.cpio_odc',
 'wrapper': 'cpio_odc',
 'write_protected': 'y'}

#########
# genpy #
#########

raw data files were declared OK overnight.

genpy.20060714
    removed the dccp of data to local disk
    using fake CRC of 666, to be corrected by saddcache
    dropped the OOPS message for dcache locations 
    
#############
# saddcache #
#############

Can eliminate the initial dccp copy and ecrc step by doing here

    sam update file crc \
        --fileName=     \
	--crcValue=  \
	--crcType=

  Test with

sam update file crc --fileName=${IFIL} --crcValue=123L       --crcType="unknown crc type"
sam update file crc --fileName=${IFIL} --crcValue=994623982L --crcType="adler 32 crc type"

=============================================================================

2006 07 13

#######
# web #
#######

changed kreymer@fnal.gov address on dhmain.html to minos-data@fnal.gov

#########
# genpy #
#########

DET=neardet_data
MONTH=2006-07

./genpy.20060712 -d  ${DET}/${MONTH}

./genpy.20060712 -d -f N00010453_0009.mdaq.root  ${DET}/${MONTH}
./genpy.20060712    -f N00010453_0009.mdaq.root  ${DET}/${MONTH}

./sadd neardet_data/2006-07 verify
./sadd neardet_data/2006-07 declare

sam locate N00010453_0009.mdaq.root
['/pnfs/minos/neardet_data/2006-07,13@dcache']

sam list files --dim="TAPE_LABEL dcache" --nosummary
N00010453_0009.mdaq.root

    Let's put this in production ...

ln -sf genpy.20060712 genpy  # was genpy.20060424

15:06 attempt failed, needed encp... changed genpy to use  sam calculate file crc
16:17 special cron cycle


#############
# saddcache #
#############

Corrects location for files in dcache now on tape

############
# saddreco #
############

This also needs the same upgrades.

cp saddreco.20060628 saddreco.20060713

Need to add dcache location support,
    then switch in a few days after genpy has proven stable

#######
# SAM #
#######

ups declare -c sam v7_6_5  # was v7_6_0

    we have used the new 'sam calculate file crc' feature
    in an early draft of the new genpy ( later removed )

####################
# sam_dbs_products #
####################

Need to move to v7_6_4 matching D0 ( CDF at v7_6_1 with us ) ?
   v7_6_2 recycles last snapshot if appropriate
   other changes not relevant ( D0 MC request system )
  
   
=============================================================================

2006 07 12

############
# NOACCESS #
############

V09818 was being copied to new media, contained 43 GB of data.
  Why ?
By this afternoon, the volume is clean again.

MINOS26 > enstore info --vol=VO9818
{'blocksize': 131072,
 'capacity_bytes': 214748364800L,
 'comment': 'Copied to new media 071106',
 'declared': 1135907060.0,
 'eod_cookie': '0000_000000000_0009145',
 'external_label': 'VO9818',
 'first_access': 1138422696.0,
 'last_access': 1152551098.0,
 'library': 'CD-9940B',
 'media_type': '9940B',
 'remaining_bytes': 46385823744L,
 'si_time': [1152732354.0, 1143404949.0],
 'sum_mounts': 1,
 'sum_rd_access': 8110,
 'sum_rd_err': 1,
 'sum_wr_access': 9145,
 'sum_wr_err': 1,
 'system_inhibit': ['none', 'readonly'],
 'user_inhibit': ['none', 'none'],
 'volume_family': 'minos.reco_far_R1_18_2.cpio_odc',
 'wrapper': 'cpio_odc',
 'write_protected': 'n'}


#########
# genpy #
#########

    Just for completeness, comparing dccp synch to null volume synch
    They are not equivalent.
    
 
MINOS26 > samadmin dccp synch -v
dccp synch -v: Nothing to report.  No updates available.

MINOS26 > samadmin null volume synch -v
null volume synch: replica info for 1 file(s) successfully updated in sam_db.


null volume synch
        select df.file_name, df.file_id, dsl.full_path, dsl.location_id
          from data_files df, data_storage_locations dsl, data_file_locations dfl
         where dfl.volume_id is null
           and dfl.location_id=dsl.location_id
           and dsl.location_type='tape'
           and dfl.file_id=df.file_id
 
dccp synch
              select VOLUMES.VOLUME_ID from VOLUMES where VOLUMES.TAPE_LABEL = 'unknowndcachevolumelabel' order by VOLUMES.VOLUME_ID

              select df.file_name, df.file_id, dsl.full_path, dsl.location_id
              from data_files df, data_storage_locations dsl, data_file_locations dfl
             where dfl.volume_id=2
               and dfl.location_id=dsl.location_id
               and dfl.file_id=df.file_id

PLAN - 

    If the file is not on tape yet use
        volume = dcache
	file   = `date+%d` 

    Once a week or so, pick up these and correct the locations

FILES=`sam list files --dim="TAPE_LABEL dcache" --nosummary`
for FILE in ${FILES} ; do
    PDIR=`sam locate ${FILE} | tr -d "[]'" | grep /pnfs/minos/ | cut -f 1 -d ','`
    SLOC=`printf "${PINFO}" | cut -f 2  | cut -f 3 -d _ | bc`
    STAP=`printf "${PINFO}" | cut -f 1  | tr -d -c [:alnum:] | tr [:upper:] [:lower:]`
    sam erase file location --file=${FILE} --loc="${PDIR}"
    sam        add location --file=${FILE} --loc="${PDIR}(${STAP}.${SLOC})"
printf ${FILE} ; sam locate $FILE
done
	    

#######
# tar #
#######


Tracking down limits to tar ( 2 GB limit in tar and/or gzip ? )

Per our discussion in the hallway,
I have verified that there is no 2 GBytee tarfile or gzip limit
on them minos* systems ( aside perhaps from AFS ).
   
Enstore also has no such limit.
Some people have written files over 20 GBytes in size to enstore.


   Specific tests :

I have just tested the creation of a tarfile of an 8 GB area,
    /local/scratch26/kreymer
with

MINOS26 > tar cvf /tmp/kreymerls26.tar .
...
MINOS26 > du -sm .
6824    .

MINOS26 > du -sm /tmp/kreymerls26.tar
6768    /tmp/kreymerls26.tar

MINOS26 > sum /tmp/kreymerls26.tar
38540 6923320


I listed the tarfile with

MINOS26 > tar tvf /tmp/kreymerls26.tar > /tmp/kreymerls26.lis


I compressed it in place,

MINOS26 > gzip -1 /tmp/kreymerls26.tar

MINOS26 > du -sm /tmp/kreymerls26.tar.gz
5241    /tmp/kreymerls26.tar.gz

and listed the compressed copy,   

MINOS26 > tar tzvf /tmp/kreymerls26.tar.gz > /tmp/kreymerls26.lisz
MINOS26 > diff     /tmp/kreymerls26.lis /tmp/kreymerls26.lisz


For the ultimate test, I extracted the files and compared to originals :

MINOS26 > mkdir /var/tmp/kreymer
MINOS26 > cd    /var/tmp/kreymer
MINOS26 > tar xzvf /tmp/kreymerls26.tar.gz

MINOS26 > du -sm /var/tmp/kreymer/
6824    /var/tmp/kreymer

MINOS26 > diff -r /var/tmp/kreymer /local/scratch26/kreymer

And one last test, that gunzip works ( per rustem question )

MINOS26 > gunzip /tmp/kreymerls26.tar.gz 
MINOS26 > sum    /tmp/kreymerls26.tar
38540 6923320

=============================================================================

2006 07 11

( kreymer on vacation  )

=============================================================================

2006 07 10

#########
# vault #
#########

MON=2006-06
DET=far
./vault ${DET} ${MON}

    DONE !
    

#########
# genpy #
#########

samadmin null volume synch
  --since=<value> # scan only those file locations created on or after this date (format: dd-mon-yyyy)
     	       -d # debug mode
     	       -s # retrySilently mode (do not issue messages to stderr on proxy retries)
     	       -t # time-it mode (print additional info on timings)
     	       -v # verbose mode

export SAM_ORACLE_CONNECT=samdbs/<pass>

IPATH=/pnfs/minos/fardet_data/2005-01
SAMLOC="${IPATH}"

MINOS26 > sam add location --file=${IFILE}.mdaq.root --loc=${SAMLOC}

MINOS26 > sam locate  ${IFILE}.mdaq.root
['/pnfs/minos/fardet_data/2005-01,unknown volume']

MINOS26 > samadmin null volume synch -v
null volume synch: replica info for 1 file(s) successfully updated in sam_db.
Verbose report:

            1 replicas found needing update
            1 replicas successfully migrated to the correct volume
            0 FAILURES while attempting to update volume information
            0 replicas still in limbo land, no new information from enstore

Details:
Failed Attempts: None

Successful Updates:
{       fileName = F00028812_0000.mdaq.root,
          fileId = 81739,
        fullPath = /pnfs/minos/fardet_data/2005-01,
        crcValue = 994623982L,
         crcType = adler 32 crc type,
         samSize = 49.11KB,
        volumeId = 23,
      volumeName = VO4919,
 }

No updated information available: None

MINOS26 > sam locate  ${IFILE}.mdaq.root
['/pnfs/minos/fardet_data/2005-01,1898@vo4919']

MINOS26 > sam erase file location --file=${IFILE}.mdaq.root --loc=${SAMLOC}
MINOS26 > sam locate  ${IFILE}.mdaq.root
[]

MINOS26 > sam add location --file=${IFILE}.mdaq.root --loc=${SAMLOC}
MINOS26 > samadmin null volume synch   
null volume synch: replica info for 1 file(s) successfully updated in sam_db.

MINOS26 > sam erase file location --file=${IFILE}.mdaq.root --loc=${SAMLOC}
MINOS26 > sam add location --file=${IFILE}.mdaq.root --loc=${SAMLOC}
MINOS26 > sam list files --dim="FILE_NAME  ${IFILE}.mdaq.root"
No files match the given constraints.

    UGGH, so the bottom line is that the files can be declared with null volumes,
    but do not show up in file selections.

MINOS26 > sam add location --file=${IFILE}.mdaq.root --loc="${SAMLOC}(NONE.0)"
MINOS26 > sam locate ${IFILE}.mdaq.root
['/pnfs/minos/fardet_data/2005-01,none']

    That's not what I intended

MINOS26 > sam add location --file=${IFILE}.mdaq.root --loc="${SAMLOC}(VO0000.0)"
MINOS26 > sam locate ${IFILE}.mdaq.root

MINOS26 > sam add location --file=${IFILE}.mdaq.root --loc="${SAMLOC}(VO0000.1)"
MINOS26 > sam locate ${IFILE}.mdaq.root
['/pnfs/minos/fardet_data/2005-01,1@vo0000']

    Note that the label is forced to lower case

MINOS26 > sam add location --file=${IFILE}.mdaq.root --loc="${SAMLOC}(none.1)"
MINOS26 > sam list files --dim="TAPE_LABEL none" --nosummary
F00028812_0000.mdaq.root


=============================================================================

2006 07 07

Still waiting for DCS data to show up after shutdown, expected soon after Jul 6.

#########
# vault #
#########

MON=2006-06

DET=near
./vault ${DET} ${MON}

DET=far
./vault ${DET} ${MON}


less ~/minos/log/rawcopy/${DET}/${MON}.log
less ~/minos/log/rawcopy/${DET}/encp.${MON}.log
less ~/minos/log/rawcopy/${DET}/check.${MON}.log


=============================================================================

vacation
     
=============================================================================

2006 06 30

##############     
# parameters #
##############

Retry selections, still limited to one parameter selection

for SPL in 0 1 2 ; do printf "${SPL} "
sam list files --count --dim="data_tier mc-near and mc.split ${SPL}"
done
0 211 files match the given constraints.
1 5130 files match the given constraints.
2 5113 files match the given constraints.

MINOS26 > for VOL in 1 2 3 4 ; do printf "${VOL} "; sam list files --count --dim="data_tier mc-near and mc.volume ${VOL}"; done
1 50 files match the given constraints.
2 161 files match the given constraints.
3 10243 files match the given constraints.
4 No files match the given constraints.

###########
# fixpass #
###########

MINOS26 > ./fixpass  reco_near/R1_18_4/sntp_data/2006-06

MINOS26 > time ./fixpass  reco_near/R1_18_4/snts_data/2006-06
real    22m29.587s
user    7m55.320s
sys     2m10.840s

Find bad files :

MINOS26 > sam translate constraints --count --dim='data_tier sntp-near and VERSION_ANALYZED r1.18.4'
43888 files match the given constraints.

############
# saddreco #
############

To handle reco_near/R1_18_4/*_data/2006-05 and 2006-06 reprocessed data
    having pass numbers of 1

saddreco.20060628 has new features
   skip files with higher pass numbers
   zap  files with lower  pass numbers ( pending, need oracle password )

HOSTNA=`hostname -s | cut -c 1-5`
HOSTNU=`hostname -s | cut -c 6-`
LOGPAT=/local/scratch${HOSTNU}/kreymer/log
DET=near
FARM=R1_18_4
MONTH=2006-05

./saddreco.20060628 ${DET} ${FARM} ${MONTH} list 
./saddreco.20060628 ${DET} ${FARM} ${MONTH} verify 

./saddreco.20060628 ${DET} ${FARM} ${MONTH} declare \
    2>&1 | tee -a  ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log


MINOS26 > ./genpy neardet_data/2006-05 verify
N00010080_0000.mdaq.root Fri Jun 30 14:01:03 CDT 2006
N00010081_0000.mdaq.root Fri Jun 30 14:01:38 CDT 2006
N00010083_0000.mdaq.root Fri Jun 30 14:01:52 CDT 2006
^V^C

Hmmm, picked up a bunch of files missing metadata,
neardet_data/2006-05/N00010080_0000.sam.py
neardet_data/2006-05/N00010081_0000.sam.py
neardet_data/2006-05/N00010083_0000.sam.py
neardet_data/2006-05/N00010083_0001.sam.py
neardet_data/2006-05/N00010085_0000.sam.py
neardet_data/2006-05/N00010086_0000.sam.py
neardet_data/2006-05/N00010088_0000.sam.py
neardet_data/2006-05/N00010088_0001.sam.py
neardet_data/2006-05/N00010089_0000.sam.py
neardet_data/2006-05/N00010090_0000.sam.py
neardet_data/2006-05/N00010091_0000.sam.py
neardet_data/2006-05/N00010092_0000.sam.py
neardet_data/2006-05/N00010093_0000.sam.py

There are 265 neardet_data files, only 196 *.py files
Missing files seem to have been written after  Jun  1 11:01

This is probably due to the delayed write to Tape, need more coverage at month swap

for DET in neardet_data fardet_data
do
    printf "`date` genpy -w -l \" -r ${RELEASE} \" ${DET}/${MONTH}\n"
		 ./genpy -w -l  " -r ${RELEASE}  " ${DET}/${MONTH}
done
Fri Jun 30 14:23:25 CDT 2006 genpy -w -l " -r R1.22 " neardet_data/2006-05
Fri Jun 30 14:36:04 CDT 2006 genpy -w -l " -r R1.22 " fardet_data/2006-05

./sadd neardet_data/2006-05 declare

setup  sam -q prd
./sadd neardet_data/2006-05 declare
Needed to add 69 files
STARTED   Fri Jun 30 19:53:18 2006
FINISHED  Fri Jun 30 19:53:31 2006

Now back to testing saddreco...

For testing in development,

MINOS26 > samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.18.4


MINOS26 > ./saddreco.20060628 ${DET} ${FARM} ${MONTH} verify

Declaring to SAM dev near R1_18_4 2006-05
STARTED   Fri Jun 30 22:25:47 2006
Needed  /pnfs/minos/reco_near/R1_18_4/snts_data/2006-05
Treating 18 files in /pnfs/minos/reco_near/R1_18_4/snts_data/2006-05
 OK - obsolete           N00010077_0002.cosmic.snts.R1_18_4.0.root
 OK - verified           N00010077_0002.cosmic.snts.R1_18_4.1.root /pnfs/minos/reco_near/R1_18_4/snts_data/2006-05(vo7887.528)
 OK - obsolete           N00010077_0003.cosmic.snts.R1_18_4.0.root
 OK - verified           N00010077_0003.cosmic.snts.R1_18_4.1.root /pnfs/minos/reco_near/R1_18_4/snts_data/2006-05(vo7887.529)
...

ln -sf saddreco.20060628 saddreco     # was saddreco.20060531


MINOS26 > sam list files --dim="DATA_TIER sntp-near and VERSION_ANALYZED r1.18.4 \
<more> and FULL_PATH /pnfs/minos/reco_near/R1_18_4/sntp_data/2006-05%" --noSummary
N00010141_0000.cosmic.sntp.R1_18_4.1.root
N00010083_0001.cosmic.sntp.R1_18_4.1.root
N00010079_0001.cosmic.sntp.R1_18_4.1.root
...


##########
# oracle #
##########

oracle_tnsnames v45  declared current by dbox
    for minosora1 = dev/int support

Restarted 

#######
# SAM #
#######

upd install -j sam v7_6_5

Release note selections :

v7_6_5
    sam calculate file crc no longer requires ecrc (or the encp product).
    Remove encp and fileinfo from the table file. They should be set up internally as needed.
    (upd will no longer install them automatically, even if you do need them).
    Remove gnuplot and ximagetools from the table file. I think the plotting tools will
    set up the required products internally if available.
v7_6_2
    Add resize station disk command
v7_6_1
    Change traceback for most of the sam products to not to show the full
    path to the original source code as that is irrelevant to where it is now
    running. Only the relative package path will now appear.
    Surpress python tracebacks when the user hits ctrl-c in a sam 
v7_6_0
    Add --summaryOnly flag to translate constraints
   
=============================================================================

2006 06 29

CHECKLIST 

PNFS, TOPDB links are dead.
    AFS partition was full, rhatcher has corrected this

DCache data flow plot is up to date, but no movement for past day.
    Reported to dcache-admin
    Repaired midday ( full disk )

###########
# fixpass #
###########

Script to set content status bad for early passes of reco files

MINOS26 > ./fixpass -d -t reco_near/R1_18_4/cand_data/2006-06

  OK  JUST TESTING 

Fixing repassed files for /pnfs/minos/reco_near/R1_18_4/cand_data/2006-06
STARTING Thu Jun 29 14:48:37 CDT 2006
 Treating    787 files 
N00010152_0000.cosmic.cand.R1_18_4.0.root     'fileContentStatus' : 'good',
N00010155_0000.cosmic.cand.R1_18_4.0.root     'fileContentStatus' : 'good',
...

MINOS26 > ./fixpass  reco_near/R1_18_4/cand_data/2006-06 

 less /local/scratch26/kreymer/log/fixpass/reco_near/R1_18_4/cand_data/2006-06.log

=============================================================================

2006 06 28
     
Oracle dev/int database listners are alive on minos03,
    waiting for the all clear from DBA's

############
# predator #
############

Hacked to disable near reco declares, pending revision cleanup.

#######
# SAM #
#######

dev/int are up on minosora3

Restarted dev servers, tested with basic file ops and sam_test_py

Restarted int dbserver
Started int logger, optimizer, 
   ups tailor sam_config
       edited opt_int, log_int for minos-ora2 in place of minos-ora1

###########
# disfile #
###########

Build on old
    /cdf/home/stager/maint/recycle/disfile

SAM disabling done with
    update DATA_FILES set FILE_CONTENT_STATUS_ID = 2 where FILE_NAME = '${FILE}' ;FILE_CONTENT_STATUS_ID = 2

The normal ID is 1 .
This also can probably be done with
 
    setup sam -q dev
    export SAM_ORACLE_CONNECT=samdbs/<password>
    FILE=F00031300_0000.mdaq.root
    samadmin update file content status --fileContentStatus=bad  --file=${FILE}
    samadmin update file content status --fileContentStatus=good --file=${FILE}

MINOS26 > samadmin modify file content status --fileContentStatus=good --fileName=${FILE}
You must supply a value for the CH_COMMENT column in the COMMENT_HISTORIES table.

Test in dev, with st-onesmall file F00031300_0000.mdaq.root    

   Changed via SQLPLUS, 'fileContentStatus' : 'bad'
   Changed via samadmin, this fails.

MINOS26 > samadmin modify file content status --fileContentStatus=good --fileName=${FILE}
You must supply a value for the CH_COMMENT column in the COMMENT_HISTORIES table.

MINOS26 > samadmin modify file content status --fileContentStatus="good" --fileName=${FILE} --comment='This is a test'

=============================================================================

2006 06 27
     
On minos-sam02, 
    upd install oracle_tnsnames ( v45, was v42 )
    ups declare -f NULL -q "" -g current oracle_tnsnames v45 -z /home/sam/products/upsdb


Example from D0 (swhite) is :
    
    10 * * * * . /usr/local/etc/setups.sh && setup -qcron_prd sam && 
    SAM_ORACLE_CONNECT=`cat /home/samshift/cron/SAM_ORACLE_CONNECT`@d0ofprd1 && 
    export SAM_ORACLE_CONNECT && samadmin null volume synch --since=1-jan-2005

###########
# project #
###########

sam_test_py - added sam.pingProject test for project existence
retryInterval  = 1 ,
retryJitter    = 0 ,
retryMaxCount  = 20 ,
s              = 'TRUE',

There is a sam.startProject return delay that depends on dataset size,
perhaps distinct from the observed quadratic startup times.
Need to remap these two components, and monitor the trace for the station
to see actual precise project start times.

ln -sf sam_test_py.0627 sam_test_py # was .0619


=============================================================================

2006 06 26

Discussed Write Pool upgrade plan with Berman.
Explained problems with a simple tape restore
   Shoeshine ( stop/start ) of tapes/drives or about a drive-week ( 5 days )
   6 seconds/file overhead, 72K files

     
=============================================================================

2006 06 21

########
# FARM #
########

Discussed concatenation with Howie
Started investigating Grid DCache pool, locations like

    dcap://fndca1:24525//pnfs/fnal.gov/usr/fermigrid/volatile/minos
      srm://fndca1:8443//pnfs/fnal.gov/usr/fermigrid/volatile/minos

Found farm meeting notes via 
    http://home.fnal.gov/~timm/
      http://www-css.fnal.gov/scs/public/farms/
        http://www-css.fnal.gov/scs/public/farms/farm_users/grid-users-welcome.html
vs out of date page 
    http://www-oss.fnal.gov/scs/farms/index.html
     
=============================================================================

2006 06 20

minos-sam03
    restarted web services, down since 7 Jun

     
=============================================================================

2006 06 19

#################
# oracle_client #
#################

Test latest Oracle client with DBSERVER, in dev/int
Had used  v8_1_7a

    minos-sam02

ups stop sam_bootstrap
upd install -j oracle_client v10_2_0_1
ups declare -c oracle_client v10_2_0_1
ups start sam_bootstrap

    Still running v8_1_7a, 

grep ORACLE_CLIENT /home/sam/private/dbs__minos-sam02__dbs_dev/dumpEnv

find products -name \*.table -exec grep oracle_client {} \; -print

    setupOptional("oracle_client v8_1_7a")
products/python_dcoracle/v2_1_3b_p1_o817ap21/Linux/ups/python_dcoracle.table


Changed this to
###    setupOptional("oracle v8_1_7")
###    setupOptional("oracle_client v8_1_7a")
###    setupOptional("python v2_1")
    setupOptional("oracle_client")

sam ping dbserver fails...

cd products/oracle_client/v10_2_0_1/Linux+2/lib/
ln -s libclntsh.so.10.1 libclntsh.so.8.0

ls -ltru products/oracle_client/v8_1_7a/Linux+2/lib   | tail -3
ls -ltru products/oracle_client/v10_2_0_1/Linux+2/lib | tail -3

    or

ls -ltru products/oracle_client/v8_1_7a/Linux+2/lib   | grep libclntsh.so
ls -ltru products/oracle_client/v10_2_0_1/Linux+2/lib | grep libclntsh.so


The old libraries are still being accessed
    ( ignore symlink timestamps, ls uses them )
So need to judge activity not by the symlink access time, but by the linked file.

Try shutting down some things :

Bottom line,  things do seem to be running OK with 
    oracle_client v10_2_0_1 in dev/int

Ran test project on newly restarted station minos-test-dcache,
RetryHandler.establishConsumer(...)> initial retriable exception ProjectNotFound('Project 'sam_test_project_20060619200208' on station 'minos-test-dcache' not responding.')
RetryHandler.establishConsumer(...)> will retry in 8.50 seconds
RetryHandler.establishConsumer(...)> retry 1 (of 25) retriable exception ProjectNotFound('Project 'sam_test_project_20060619200208' on station 'minos-test-dcache' not responding.')
RetryHandler.establishConsumer(...)> will retry in 12.53 seconds
RetryHandler.establishConsumer(...)> retry number 2 succeeded, returning to caller.
    cid       435
    cpid      435
... normal success ...

This happens repeatedly in dev, not in production

Tuned timeout in sam_test_py, need 11 seconds for clean run.
Switched back to oracle_client v8_1_7, need same delay.

Messy situtation, cannot stop/restart stations, multuple instances trying to run.
Had to shut down all servers with manual kills, try clean start

  1) without station
  2) add station

OK, the problem was the the logger no longer runs on minos01, but on minos02.

ups tailor sam_config
   station_dev
      SAM_LOG_SERVER_ADDR=minos-sam02.fnal.gov:30583

Delays are gone.

Restarted all with oracle_client v10_2_0_1 !


=============================================================================

2006 06 16
     
##########
# saddmc #
##########

OPAS=...
for UNI in dev int prd ; do
  setup sam -q ${UNI}
  export    SAM_ORACLE_CONNECT=samdbs/${OPAS}
  samadmin add application family --appFamily=simulation --appName=gminos --appVersion=carrot_08
  export -n SAM_ORACLE_CONNECT
done
New applicationFamilyId = 211
New applicationFamilyId = 59
New applicationFamilyId = 60

./saddmc --verify  carrot_08 mcin_data/far/carrot/L010185


setup sam -q dev

./saddmc --declare  carrot_08 mcin_data/far/carrot/L010185 \
  2>&1 | tee -a  ../log/saddmc/mcin-carrot_08-far-dev.log
Treating 441 files in /pnfs/minos/mcin_data/far/carrot/L010185
 OOPS , declare error in  f21001001_0000_L010185.reroot.root
  CLASS     SamException.SamExceptions.DbSQLException
  INSTANCE  INTERNAL ERROR IN DbOracleMessage.convertUniqueConstraint

This sort of thing happened before, damaged parameters in dev I think.
Proceed to test in integration :


setup sam -q int

./saddmc --declare  carrot_08 mcin_data/far/carrot/L010185  \
  2>&1 | tee -a  ../log/saddmc/mcin-carrot_08-far-int.log

./saddmc --verify  R1_18_2  mcout_data/R1_18_2/far

./saddmc --declare R1_18_2  mcout_data/R1_18_2/far  \
  2>&1 | tee -a  ../log/saddmc/mcout-carrot_08-far-int.log

setup sam -q prd

./saddmc --declare  carrot_08 mcin_data/far/carrot/L010185  \
  2>&1 | tee -a  ../log/saddmc/mcin-carrot_08-far-prd.log
 MODE  declare
 Processing mcin_data 
STARTED   Fri Jun 16 16:06:52 2006
Declaring to SAM prd carrot_08  declare 999999
Scanning  /pnfs/minos/mcin_data/far/carrot ['L010185']
Needed    /pnfs/minos/mcin_data/far/carrot/L010185
Treating 441 files in /pnfs/minos/mcin_data/far/carrot/L010185
 OK - declared                 f21001001_0000_L010185.reroot.root /pnfs/minos/mcin_data/far/carrot/L010185(vo8245.1)
...
 OK - declared                 f21301039_0000_L010185.reroot.root /pnfs/minos/mcin_data/far/carrot/L010185(vo8245.441)
Needed  441 files, Rate was  2.091
STARTED   Fri Jun 16 16:06:52 2006
FINISHED  Fri Jun 16 16:10:25 2006

./saddmc --declare R1_18_2  mcout_data/R1_18_2/far  \
  2>&1 | tee -a  ../log/saddmc/mcout-carrot_08-far-prd.log

 OK - skipping 30 files not yet in SAM 
Treating 441 files in /pnfs/minos/mcout_data/R1_18_2/far/snts_data
Needed  441 files, Rate was  1.718
Needed    /pnfs/minos/mcout_data/R1_18_2/far/cand_data
 OK - skipping 30 files not yet in SAM 
Treating 441 files in /pnfs/minos/mcout_data/R1_18_2/far/cand_data
Needed  441 files, Rate was  1.806
Needed    /pnfs/minos/mcout_data/R1_18_2/far/sntp_data
 OK - skipping 30 files not yet in SAM 
Treating 441 files in /pnfs/minos/mcout_data/R1_18_2/far/sntp_data
Needed  441 files, Rate was  1.860
STARTED   Fri Jun 16 16:17:32 2006
FINISHED  Fri Jun 16 16:29:51 2006

  PICK UP L010200 in production

./saddmc --declare  carrot_06 mcin_data/near/carrot_06/L010200 \
  2>&1 | tee -a ../log/saddmc/mcin-prd.log 
 OK - declared                 n13021099_0000_L010200.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010200(vo8034.952)
Needed  198 files, Rate was  2.238
STARTED   Fri Jun 16 18:51:22 2006
FINISHED  Fri Jun 16 18:52:51 2006

./saddmc --declare R1_18_2 mcout_data/R1_18_2/near \
  2>&1 | tee -a ../log/saddmc/mcout-R1_18_2-near-prd.log 

### interrupted, taking too long, let's select the files of interest, not all 11K ###

./saddmc --declare -s L010200 R1_18_2 mcout_data/R1_18_2/near \
  2>&1 | tee -a ../log/saddmc/mcout-R1_18_2-near-prd.log 


#########
# stage #
#########

ln -sf stage.20060530 stage # was 20060511 

   cleans up logging, off when debugging

=============================================================================

2006 06 15
     
##########
# saddmc #
##########

MINOS26 > grep -v declared ../log/saddmc/mcin-R1_18_2-near-prd.log 

 MODE  declare
STARTED   Wed Jun 14 14:08:50 2006
Declaring to SAM prd R1_18_2 R1_18_2 declare 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/near ['snts_data', 'cand_data', 'sntp_data']
Needed    /pnfs/minos/mcout_data/R1_18_2/near/snts_data
 OK - skipping 281 files not yet in SAM 
Treating 10203 files in /pnfs/minos/mcout_data/R1_18_2/near/snts_data
Needed 10203 files, Rate was  0.669
Needed    /pnfs/minos/mcout_data/R1_18_2/near/cand_data
 OK - skipping 281 files not yet in SAM 
Treating 10203 files in /pnfs/minos/mcout_data/R1_18_2/near/cand_data
Needed 10203 files, Rate was  0.658
Needed    /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
 OK - skipping 1454 files not yet in SAM 
Treating 10203 files in /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
Needed 10203 files, Rate was  0.452
STARTED   Wed Jun 14 14:08:50 2006
FINISHED  Thu Jun 15 04:57:44 2006

    carrot_06 is DONE !!!

#######
# CFL #
#######

Cleaned up formatting, linked to dhmain.html

###############
# SAM_CPP_API #
###############

Installing newer versions, to test compilation of loon

ran out of products space.

MINOS26 > du -sm * | sort -nr
854     sam_cpp_api
466     sam
309     sam_idl_cpplib
264     omniORB
182     sam_client_cpplib
182     python
101     orbacus
83      sam_frozen
62      encp
52      sam_lib
47      perl
42      sam_common_pylib
27      imagemagick
22      blt
18      tk
18      tcl
17      imagelibs
17      fileinfo
17      dcap

MINOS26 > du -sm sam_cpp_api/* | sort -nr
335     sam_cpp_api/v7_2_1
327     sam_cpp_api/v7_0_1
194     sam_cpp_api/v7_3_0

MINOS26 > du -sm sam/* | sort -nr
41      sam/v7_1_3
40      sam/v7_1_2
40      sam/v7_0_2c
32      sam/v7_0_0
...
MINOS26 > du -sm sam/*           
1       sam/v6_0
1       sam/v6_7_4
32      sam/v7_0_0
21      sam/v7_0_1
21      sam/v7_0_2
21      sam/v7_0_2b
40      sam/v7_0_2c
22      sam/v7_1_10
40      sam/v7_1_2
41      sam/v7_1_3
22      sam/v7_2_2
22      sam/v7_2_6
22      sam/v7_3_0
22      sam/v7_3_1
22      sam/v7_3_2
23      sam/v7_3_4
22      sam/v7_4_0
12      sam/v7_4_0_py24
11      sam/v7_4_0a_py24
22      sam/v7_4_2
10      sam/v7_5_1
10      sam/v7_5_2
10      sam/v7_6_0
10      sam/v7_6_3

Removing intermediate sam versions
    current is v7_6_0

ups undeclare -Y sam v7_0_0
ups undeclare -Y sam v7_0_1
ups undeclare -Y sam v7_0_2c
ups undeclare -Y sam v7_0_2
ups undeclare -Y sam v7_0_2b
ups undeclare -Y sam v7_1_2
ups undeclare -Y sam v7_1_3
ups undeclare -Y sam v7_1_10
ups undeclare -Y sam v7_2_2
ups undeclare -Y sam v7_2_6
ups undeclare -Y sam v7_3_0
ups undeclare -Y sam v7_3_1
ups undeclare -Y sam v7_3_2
ups undeclare -Y sam v7_3_4
ups undeclare -Y sam v7_4_0
ups undeclare -Y sam v7_4_0_py24
ups undeclare -Y sam v7_4_0a_py24
ups undeclare -Y sam v7_5_1
ups undeclare -Y sam v7_4_2
ups undeclare -Y sam v7_5_2
ups undeclare -Y sam 

used 7904350
to
used 7503561

MINOS26 > upd list -aK+ sam_cpp_api
...
"sam_cpp_api" "v7_2_1" "Linux+2.4" "GCC-3.3.1" "current" 
"sam_cpp_api" "v7_3_0" "Linux+2.4" "GCC-3.4.3" "" 
"sam_cpp_api" "v7_4_0" "Linux+2.4" "GCC-3.4.3" "" 
"sam_cpp_api" "v7_4_2" "Linux+2.4" "GCC-3.4.3" "" 
"sam_cpp_api" "v7_4_3" "Linux+2.4" "GCC-3.4.3" "test" 

upd install -j sam_cpp_api v7_4_2 -q GCC-3.4.3
upd install -j sam_cpp_api v7_4_3 -q GCC-3.4.3

7712736   96%

#######
# ftp #
#######

Assisting ochoa with FTP copy of sntp files in mcout_data.

MINOS26 > ./stage -d -p0 -s n13023 mcout_data/R1_18_2/near/sntp_data
 Needed 821/    987

############
# ftpfiles #
############

sam list files --nosummary --dim='data_tier sntp-near and file_name like n13023%' \
    > /tmp/mcsntp

cd /local/scratch26/kreymer/COPY

~kreymer/minos/scripts/ftpfiles /tmp/mcss mcout_data/R1_18_2/near/sntp_data/

=============================================================================

2006 06 14

##########
# saddmc #
##########

MINOS26 > grep -v verified ../log/saddmc/mcin-R1_18_2-near-prd-ver.log 

 MODE  verify
STARTED   Tue Jun 13 21:44:20 2006
Declaring to SAM prd R1_18_2 R1_18_2 verify 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/near ['snts_data', 'cand_data', 'sntp_data']
Needed    /pnfs/minos/mcout_data/R1_18_2/near/snts_data
 OK - skipping 281 files not yet in SAM 
Treating 10203 files in /pnfs/minos/mcout_data/R1_18_2/near/snts_data
Needed 10203 files, Rate was  0.638
Needed    /pnfs/minos/mcout_data/R1_18_2/near/cand_data
 OK - skipping 281 files not yet in SAM 
Treating 10203 files in /pnfs/minos/mcout_data/R1_18_2/near/cand_data
Needed 10203 files, Rate was  0.547
Needed    /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
 OK - skipping 1454 files not yet in SAM 
Treating 10203 files in /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
Needed 10203 files, Rate was  0.461
STARTED   Tue Jun 13 21:44:20 2006
FINISHED  Wed Jun 14 13:30:30 2006

09:09

./saddmc --declare R1_18_2 mcout_data/R1_18_2/near \
  2>&1 | tee -a ../log/saddmc/mcin-R1_18_2-near-prd.log 

#######
# CFL #
#######

MINOS26 > mkdir /afs/fnal.gov/files/data/minos/log_data/CFL

MINOS26 > cdm
MINOS26 > ln -s /afs/fnal.gov/files/data/minos/log_data/CFL CFL

cfl - adjusted to write to CFL if parameter 'log' is given :

MINOS26 > ./cfl log
 Logging 
  % Total    % Received % Xferd  Average Speed          Time             Curr.
                                 Dload  Upload Total    Current  Left    Speed
100  118M  100  118M    0     0  3947k      0  0:00:30  0:00:30  0:00:00 5147k
 717237 CFL

cflsum - created, to summarize CFL content


#########
# EMAIL #
#########

Tested cron email from minos04,

$HOME/tmail :
#!/bin/sh
#echo "TEST 1" | Mail -s "test" kreymer
echo FOO

echo '* * * * * /usr/krb5/bin/kcron ${HOME}/tmail' | crontab

tmail sends to kreymer@fnal.gov - OK
tmail to kreymer                - not forwarded
tmail with echo                 - not forwarded
MAILTO in crontab               - OK

updated crontab.dat, crontab.minos01 with
MAILTO=kreymer@fnal.gov

##########
# afssum #
##########

ln -sf afssum.20060614 afssum # was 20060526

    added QUOTA to header

=============================================================================

2006 06 13

##########
# saddmc #
##########

MINOS26 > grep -v declared ../log/saddmc/mcin-prd.log

     Looks clean to me

Continue to work on mcout :

setup sam -q dev
FIL=n13025362_0000_L010185

./saddmc.0612 --verify  -s ${FIL} -n 1 -v carrot_06 mcin_data/near/carrot_06/L010185
./saddmc.0612 --declare -s ${FIL} -n 1 -v carrot_06 mcin_data/near/carrot_06/L010185

./saddmc.0612 --verify  -s ${FIL} -n 1 -v R1_18_2 mcout_data/R1_18_2/near
     

Close, but I worry about about this :

for mcin, verify, see

    'mc' : CaseInsensitiveDictionary({
       'beam' : ParamValue('L010185', DataType('string')),
     'flavor' : ParamValue('0', DataType('string')),
    'release' : ParamValue('carrot_06', DataType('string')),
      'split' : ParamValue('2', DataType('string')),
     'volume' : ParamValue('3', DataType('string')),
    })}),

mcout, 

    'mc' : CaseInsensitiveDictionary({
       'beam' : 'L010185',
     'flavor' : '0',
    'release' : 'carrot_06',
      'split' : '2',
     'volume' : '3',

For the mcin declared file,
MINOS26 > sam get metadata --fileName=$FIL.reroot.root
    'mc' : CaseInsensitiveDictionary({
       'beam' : 'L010185',
     'flavor' : '0',
    'release' : 'carrot_06',
      'split' : '2',
     'volume' : '3',

12:00 dba64 (minosdev/int) down for April patches
      stopped dev/int dbservers

Limited verification in production :

MINOS26 > FIL=n1302536?_0000_L010185
MINOS26 > ./saddmc.0612 --verify  -s ${FIL} -n 10   R1_18_2 mcout_data/R1_18_2/near
MINOS26 > ./saddmc.0612 --declare -s ${FIL} -n 1    R1_18_2 mcout_data/R1_18_2/near

MINOS26 > FIL=n130253??_0000_L010185
MINOS26 > ./saddmc.0612 --verify  -s ${FIL} -n 100  R1_18_2 mcout_data/R1_18_2/near

MINOS26 > sam list files --dim="DATA_TIER sntp-near and MC.BEAM L010185"
Files:
   n13025362_0000_L010185.sntp.R1_18_2.root

File Count:         1
Average File Size:  26.87MB
Total File Size:    26.87MB
Total Event Count:  400

Good enough, let's make this the official version, and verify :

ln -sf saddmc.0612  saddmc  # was 0530

setup sam -q prd

./saddmc --verify R1_18_2 mcout_data/R1_18_2/near \
  2>&1 | tee -a ../log/saddmc/mcin-R1_18_2-near-prd-ver.log 


=============================================================================

2006 06 12

    Back from "Week In The Wet" ( MINOS meeting at Tufts )

##########
# saddmc #
##########

ln -sf saddmc.0530  saddmc # was 0525, added time, UTC
      
setup sam -q dev
FIL=n13025362_0000_L010185

Corrected -s to look for string inclusively '*'+SEARCH+'*'

./saddmc.0612 --verify -s ${FIL} -n 1 -v carrot_06 mcin_data/near/carrot_06/L010185

./saddmc.0612 --verify -s ${FIL} -n 1 -v carrot_06 mcout_data/R1_18_2/near R1_18_2


setup sam -q prd

for BEAM in L010170 L010000 L100200 L250200 L010185 ; do
./saddmc --declare  carrot_06 mcin_data/near/carrot_06/${BEAM} \
  2>&1 | tee -a ../log/saddmc/mcin-prd.log 
done

 OOPS , declare error in  n13011000_0000_L010170.reroot.root
  CLASS     SamException.SamExceptions.InvalidMetadata
  INSTANCE  Invalid Metadata specified for file 'n13011000_0000_L010170.reroot.root' of type 'importedSimulated':
        ParamType 'release' (category 'mc') not found.
        DataTier 'mc-near' does not exist.

INOS26 > setup sam -q prd
MINOS26 > export SAM_ORACLE_CONNECT="samdbs/password@minosprd"
MINOS26 > samadmin add datatier --name=mc-near
New dataTierId = 116
MINOS26 > samadmin add datatier --name=mc-far
New dataTierId = 117

MINOS26 > samadmin add param suite --param-file=MCPARAMS.py 
Param Category 'mc': 
 ... paramType 'release': registered as type 'string'  (new dimension 'mc.release')
 ... paramType 'beam': (no change)
 ... paramType 'flavor': (no change)
 ... paramType 'split': (no change)
 ... paramType 'volume': (no change)

for BEAM in L010170 ; do
./saddmc --verify  carrot_06 mcin_data/near/carrot_06/${BEAM} ; done

for BEAM in L010170 L010000 L100200 L250200 L010185 ; do
./saddmc --declare  carrot_06 mcin_data/near/carrot_06/${BEAM} \
  2>&1 | tee -a ../log/saddmc/mcin-prd.log 
done


#########
# stage #
#########

Per brian request, pull in /pnfs/minos/reco_far/R1.14/sntp_data/

MINOS26 > du -sm /pnfs/minos/reco_far/R1.14/sntp_data
254144  /pnfs/minos/reco_far/R1.14/sntp_data

MINOS26 > find /pnfs/minos/reco_far/R1.14/sntp_data -name \*.root | wc -l
  10855

MINOS26 > ./volumes vols           
MINOS26 > ./volumes sntp_data_R1_14
VO3557
VO3559
VO7423
VO7729
VO7944
VO7961

MINOS26 > VOLS=`./volumes sntp_data_R1_14`

This will not work, files have been renamed.

MINOS26 > MONS=`ls /pnfs/minos/reco_far/R1.14/sntp_data`

MINOS26 > for MON in ${MONS} ; do ./stage -d -p 0 reco_far/R1.14/sntp_data/${MON} | grep Needed | tr -d '.' ; done
2003-07  Needed 59/     59
2003-08  Needed 134/    134
2003-09  Needed 209/    209
2003-10  Needed 337/    337
2003-11  Needed 382/    382
2003-12  Needed 623/    623
2004-01  Needed 652/    653
2004-02  Needed 697/    697
2004-03  Needed 712/    712
2004-04  Needed 363/    363
2004-05  Needed 377/    377
2004-06  Needed 357/    357
2004-07  Needed 357/    357
2004-08  Needed 373/    373
2004-09  Needed 378/    378
2004-10  Needed 695/    695
2004-11  Needed 689/    689
2004-12  Needed 733/    733
2005-01  Needed 722/    722
2005-02  Needed 636/    636
2005-03  Needed 681/    681
2005-04  Needed 688/    688


MINOS26 > for MON in ${MONS} ; do ./stage -w reco_far/R1.14/sntp_data/${MON} ; done
There are about 11,000 files to restore,
and the process takes nearly 10 seconds per file.   
So they should all be there by around Noon tomorrow.

( Fortunately, these files are all on the old 9940A tapes,  
  so there is very little competition for the tape drives. )


9940B drives are in a bad state.

Of the 20 9940B drives, we have
   3 OFFLINE
   1 DEAD
  16 in use

Of the 16 working drives, we have
  11 MOUNT_WAIT
   3 DISMOUNT_WAIT
   1 HAVE_BOUND
   1 SEEK

Not one single drive is actually delivering any data !
Reported to enstore-admin, but not expecting any action.

There are about 11,000 files to restore,
and the process takes nearly 10 seconds per file.   
So they should all be there by around Noon tomorrow.

( Fortunately, these files are all on the old 9940A tapes,  
  so there is very little competition for the tape drives. )

##########
# afssum #
##########

Script was broken due to carelessly leaving a debug line in place,
since 31 May.

Fixed and ran manually, OK now.
Added QUOTA column identifier, for global quota.

=============================================================================

2006 06 09

###########
# ganglia #
###########

Requested new group, MINOS Servers
Immedately it contains
    minosora1
    minosora3

It will pick up
    minos-mysql1
    minos-sam01
    minos-sam02
    minos-sam03
    minos26


=============================================================================

2006 06 08

###########
# MONITOR #
###########

${HOME}/minos/scripts/ftp_log &
${HOME}/minos/scripts/ftpl_log &
${HOME}/minos/scripts/pnfs_log &

export TDBCONN=monitor/password
${HOME}/minos/oracle/topdb_log minosprd &
${HOME}/minos/oracle/topdb_log minosdev &
unset TDBCONN

###########
# network #
###########

Network glitched 15 minutes ( due to network operator error )
See minos-admin email,

Looking at the overall Ganglia monitoring at
    http://fnpca.fnal.gov/ganglia/?m=&r=hour&s=descending&hc=4
there were changes in load as follows :

09:25 - 09:30 SCS farms
09:17 - 09:32 CDF-farms
09:20         FT-farms
09:17 - 09:32 D0-Farm-Workers
09:17 - 09:30 FNALU Cluster
09:17 - 09:32 Minis-ora


=============================================================================

2006 06 07

############
# SHUTDOWN #
############

08:30 MINT up since 07:30, but no AFS
08:45 DocDB and CRL up
09:15 dba64 up, started dev/int
09:25 minosora1 up, started prd
09:30 AFS up, started kreymer cron on minos01
13:50 Enstore/DCache announced up
19:55 started kreymer cron on minos26

PNFS is mounted,
for NOD in $NODS ; do echo $NOD ; ssh minos$NOD ls /pnfs/minos/archive ; done
  shepelak had done the mounts manually

=============================================================================

2006 06 06

16:00 - minos* nodes were shut down and rebooted by Karen,
        accident in preparation for shutdown tomorrow
        for FCC power outage

#######

# SAM #
#######

cd /afs/fnal.gov/files/code/e875/general/ups/db/sam
nedit v7_6_0.table
   Removed all optionals from setup/unsetup for noninterference

############
# SHUTDOWN #
############

Need to shut down servers around 03:00,
to match the Enstore/DCache shutdown

sam@minos-sam01:
MINOS-SAM01 > echo '. ./samstop > samstop.log 2>&1' | at 03:00    
job 7 at 2006-06-07 03:00

Practice on minos-sam02

echo '. ./samstop > samstop.log 2>&1' | at 23:40


MINOS01 > echo 'crontab -r' | at 03:00
MINOS26 > echo 'crontab -r' | at 03:00

=============================================================================

2006 06 05   Tufts meeting

 /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
 access problems via ftp from 

11,529 files, off disk 9789
du -sm ---> 270500  /pnfs/minos/mcout_data/R1_18_2/near/sntp_data

.bntp files are also still off disk.


=============================================================================

2006 06 02

#########
# reloc #
#########

MINOS-SAM01 > export SAM_ORACLE_CONNECT=samdbs/PASSWORD@minosint
MINOS-SAM01 > setup sam -q int
MINOS-SAM01 > ./reloc -s int R1_18_4

MINOS-SAM01 > upd install -j sam v7_6_0
MINOS-SAM01 > ups declare -c sam v7_6_0

############
# saddreco #
############

HOSTNA=`hostname -s | cut -c 1-5`
HOSTNU=`hostname -s | cut -c 6-`
LOGPAT=/local/scratch${HOSTNU}/kreymer/log
DET=far
FARM=R1_18_4
MONTH=2006-05
./saddreco ${DET} ${FARM} ${MONTH} declare \
    >> ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log 2>&1


###########
# MONITOR #
###########


${HOME}/minos/scripts/ftp_log &
${HOME}/minos/scripts/ftpl_log &
${HOME}/minos/scripts/pnfs_log &

export TDBCONN=monitor/password
${HOME}/minos/oracle/topdb_log minosprd &
${HOME}/minos/oracle/topdb_log minosdev &
unset TDBCONN

Make crontab driver for monitor scripts
    Cannot do this for Oracle, as a password must be set.
This involves :
    adding pid  file
    adding STOP file ( with UTC date ? )

#########
# vault #
#########

MON=2006-05

DET=near
./vault ${DET} ${MON}

DET=far
./vault ${DET} ${MON}


less ~/minos/log/rawcopy/${DET}/${MON}.log
less ~/minos/log/rawcopy/${DET}/encp.${MON}.log
less ~/minos/log/rawcopy/${DET}/check.${MON}.log


=============================================================================

2006 06 01

Network outage 06:30 to 07:00

    Sam At a Glance remains stuck
    Station and dbs are up
    but sam list files and sam_test_py are stuck.
    Restarted dbs, OK now.

############
# saddreco #
############

ln -sf saddreco.20060531 saddreco # was .0913
    adds TZ=UTC
    
Need to catch up Apr/May farm output, R1_18_4

MINOS26 > ls /pnfs/minos/reco_far/R1_18_4/cand_data/2006-04 |  wc -l
     75

MINOS26 > ls /pnfs/minos/reco_far/R1_18_4/cand_data/2006-05 |  wc -l
    680

samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.18.4

MINOS26 > ./saddreco far R1_18_4 2006-04 verify

Did catchup, 

HOSTNA=`hostname -s | cut -c 1-5`
HOSTNU=`hostname -s | cut -c 6-`
LOGPAT=/local/scratch${HOSTNU}/kreymer/log
DET=far
FARM=R1_18_4
MONTH=2006-04
    ./saddreco ${DET} ${FARM} ${MONTH} declare \
    >> ${LOGPAT}/saddreco/declare_${DET}_${FARM}.log 2>&1

  INSTANCE  Location with name '/pnfs/minos/reco_far/R1_18_4/snts_data/2006-04' not found.

#########
# reloc #
#########

MINOS-SAM01 > export SAM_ORACLE_CONNECT=samdbs/pass_word@minosdev
MINOS-SAM01 > mount /pnfs/minos
MINOS-SAM01 > setup sam -q dev
MINOS-SAM01 > ./reloc R1_18_4
MINOS-SAM01 > ./reloc -s dev R1_18_4

MINOS-SAM01 > export SAM_ORACLE_CONNECT=samdbs/pass_word@minosprd
MINOS-SAM01 > setup sam -q prd
MINOS-SAM01 > ./reloc R1_18_4


########
# sadd #
########

MIN > ln -sf sadd.20060531 sadd

    likewise, set TZ to UTC

############
# predator #
############

Changed FARM from R1_18_2 to R1_18_4    for current processing

ln -sf predator.20060601 predator   # was 20060505


=============================================================================

2006 05 31

#################
# sadd saddreco #
#################

    sadd.20060531
saddreco.20060531

    set TZ=UTC, no longer depending on caller to do this


=============================================================================

2006 05 30

Checklist -

############ 
# NOACCESS # 
############

VO8553 0.19GB (NOACCESS 0530-0951 full  1127-2152) CD-9940B  minos.reco_far_R1_18_2.cpio_odc      Copied to new media 03/08/06

  'comment': 'Copied to new media 03/08/06'
  'declared': 1120143897.0
  'eod_cookie': '0000_000000000_0013390'
  'first_access': 1132461501.0
  'last_access': 1149000712.0
  'library': 'CD-9940B'
  'sum_mounts': 1038
  'sum_rd_access': 12591
  'sum_rd_err': 2
  'sum_wr_access': 13387
  'sum_wr_err': 0
  'system_inhibit': ['NOACCESS', 'full']
 
10:25 - mail servers unreachable,
        helpdesk expert login unavailable ( cannot reach cdserver )
	PnfsManager 	pnfsDomain 	OFFLINE
	2006-05-30	09:50		ngopsrv	4 hours	Second floor cooling outage FCC...
        FTP writes to DCache failing start at 10:06

10:32 - minos* shutdown for power reduction, leaving up only 
	  dba64
	  minos01
          minosora1	  
	  minos-sam0*
	  minos-mysql1

/pnfs/minos seems normal, but dccp cannot access files.

15:30 PnfsManager returned to service, restarted by kennedy
      ftp transfers from the detectors have started working again
      
##########
# afssum #
##########

Hacked  afssum.20060526  to allow the TOTAL summary line to come into afssums.txt
Hacked  afssums.20060529 to add the missing line to the report

#########
# stage #
#########

All of the R1.21 sntp data has fallen off disk ?
   had been restored 2006 05 11
Perhaps unreliable report, due to power reductions/network problems.
Perhaps true, as they had started falling off disk last Friday.


##########
# saddmc #
##########

ln -sf saddmc.0525  saddmc # adding long options and gnu_getopts

saddmc.0530 - add start/end date based on file creation in pnfs
              unify release parameter for mcin and mcout,
	            taking mcin release from mcin file as necessary

setup sam -q dev
FIL=n13025362_0000_L010185.reroot.root
./saddmc --verify -s ${FIL} -n 1 -v carrot_06 mcin_data/near/carrot_06/L010185

./saddmc --verify -s ${FIL} -n 1 -v carrot_06 mcout_data/R1_18_2/near R1_18_2

############
# samadmin #
############

MINOS26 > setup sam -q dev
MINOS26 > sam get registered dimensions > ../log/sam/regdimdev.20060529
MINOS26 > export SAM_ORACLE_CONNECT=samdbs/pass_word@minosdev
MINOS26 > samadmin remove meaningless dimensions
DbSQLException: ORA-01031: insufficient privileges


=============================================================================

2006 05 26

#######
# WEB #
#######

Let's track down the volumes under AFSW

WEB=/afs/fnal.gov/files/expwww/numi
WEBS=`find ${WEB} -type d `

MINOS26 > find ${WEB} -type d | wc -l
   5307

Cannot resist a cleanup

MINOS26 > cd $WEB
MINOS26 > stat ping
  File: `ping'
  Size: 0               Blocks: 0          IO Block: 4096   Regular File
Device: ah/10d  Inode: 179248542   Links: 1    
Access: (0644/-rw-r--r--)  Uid: (    1/     bin)   Gid: ( 5023/ UNKNOWN)
Access: 1999-06-07 11:26:17.000000000 -0500
Modify: 1999-06-07 11:26:17.000000000 -0500
Change: 1999-06-07 11:26:17.000000000 -0500
MINOS26 > rm ping


MINOS26 > for DIR in $WEBSS ; do fs listquota $DIR ; done > /tmp/webquots
fs: File './html/numi_pics/pics_from_Rick_Ford/2004_12_03' doesn't exist
fs: File 'NuMI_first_beam_at_' doesn't exist
fs: File 'MCR' doesn't exist
fs: File './html/internal/beam/beamdevice/photos/target' doesn't exist
fs: File 'construction' doesn't exist
fs: File './html/internal/doe_access/Installation/Daily_Reports/MS' doesn't exist
fs: File 'Word' doesn't exist
fs: File './html/internal/doe_access/Installation/Daily_Reports/MS' doesn't exist
fs: File 'Word/NuMI' doesn't exist
fs: File './html/internal/doe_access/Installation/Daily_Reports/MS' doesn't exist
fs: File 'Word/MINOS' doesn't exist
fs: File './html/internal/doe_access/Installation/Daily_Reports/MS' doesn't exist
fs: File 'Word/MINOS/November' doesn't exist

MINOS26 > grep -v 'Volume Name' /tmp/webquots | sort -u
expwww.numi.fileupload       500000       254    0%         68%  fileupload
expwww.numi.fnalminos       2000000         2    0%         68%  html/fnal_minos
expwww.numi.minwork         8000000   1921742   24%         69%  html/minwork
expwww.numi.numwork         2000000    306163   15%         73%  html/numwork
expwww.numi.talks           2000000    270508   14%         73%  html/talks
nb.w.numi.d1                 100000       657    1%         61%  query_files
room.numi                   2000000   1902397   95%<<       85%  .
room.numi.1                 1000000    353693   35%         75%  numinotes
w.numi.d1                   2000000    964175   48%         68%  numinotes/restricted/html
w.numi.d2                   2000000    144913    7%         68%  numinotes/public/ps

Volume Name                   Quota      Used %Used   Partition

MINOS26 > for DIR in $WEBS ; do printf "${DIR} " ; fs listquota $DIR | grep -v 'Volume Name' ; done > /tmp/webdirs 
grep this to find above directories

#######
# AFS #
#######

Re-oops, left the afssum cronjob running on minos26, producing a dual listing today.
Killed it for future ( crontab crontab.dat )
Updated afssum manually on minos01.

afssum.20060526 - 

   Added TOTAL file counts/sizes/quota

ln -sf  afssum.20060526 afssum  # was .20051209

Dry run suggests quota is presently 892000


=============================================================================

2006 05 25

############
# predator #
############

OOPS, ran predator, not afssum, in minos01 cron job.
No harm done, aside from burning some CPU on minos01.

Corrected at about 08:03, before the 

##########
# saddmc #
##########

saddmc.20060525
   For mcout_data , should get MC release (carrot) from mcin file via SAM,
   not from the command line.

   changed from getopt to gnu_getopt method, to allow flexible arg placement
       http://docs.python.org/lib/module-getopt.html

   added --verify --declare --list 

    Test in dev with

./saddmc -m list  carrot_06 mcin_data/near/carrot_06/L010200


####################
# sam_web_services #
####################

minos-sam01 -

ups tailor sam_config - clone webprd to webint

MINOS-SAM03 > ups inquire sam_config -q webprd
SAM_WEB_SERVICES_ADDRESS=minos-sam03.fnal.gov:21999
SAM_LOG_SERVER_ADDR=minos-sam01.fnal.gov:40583
SAM_NAMING_SERVICE=minos-sam01.fnal.gov:9010
SAM_DB_SERVER_NAME=SAMDbServer.prd:SAMDbServer
SAM_COMPILER_QUALIFIER=GCC-3.1
SAM_NAMING_SERVICE_IOR=${SAM_MINOS_PRD_NS_IOR}
SAM_OPTIMIZER_SUFFIX=prd

MINOS-SAM03 > ups inquire sam_config -q webint
SAM_WEB_SERVICES_ADDRESS=minos-sam03.fnal.gov:21998
SAM_LOG_SERVER_ADDR=minos-sam01.fnal.gov:40583
SAM_NAMING_SERVICE=minos-sam01.fnal.gov:9005
SAM_COMPILER_QUALIFIER=GCC-3.1
SAM_DB_SERVER_NAME=SAMDbServer.int:SAMDbServer
SAM_NAMING_SERVICE_IOR=${SAM_MINOS_INT_NS_IOR}
SAM_OPTIMIZER_SUFFIX=int


=============================================================================

2006 05 24

#######
# AFS #
#######

Created crontab.minos01 with afssum, for running on minos01,
so that the UID numbers will be interpreted as usernames.

Removed afssum from crontab.dat ( minos26 )

##########
# saddmc #
##########

Touched up mcin/mcout locations in dev/int/prd
    mcout prd was needed, but rest were up to date.

    Look for files since yesterday

MINOS26 > ./saddmc -m verify carrot_06 mcin_data/near/carrot_06/L010185                                             
STARTED   Wed May 24 14:53:47 2006
FINISHED  Wed May 24 15:36:02 2006
  None found
  
MINOS26 > ./saddmc -m verify  carrot_06 mcout_data/R1_18_2/near     R1_18_2                                               
Needed    /pnfs/minos/mcout_data/R1_18_2/near/snts_data
 OK - skipping 82 files not yet in SAM 

What are these 82 files not in mcin_data ?

FILES=`ls /pnfs/minos/mcout_data/R1_18_2/near/snts_data`

for FIL in $FILES ; do
FIN=`echo $FIL | cut -f 1 -d '.'`.reroot.root
BEA=`echo $FIL | cut -f 1 -d '.' | cut -f 3 -d '_'`
#echo $FIN $BEA
[ ! -r /pnfs/minos/mcin_data/near/carrot_06/${BEA}/${FIN} ] && echo OOPS $FIN
done | head

OOPS n13010001_0000_L010185.reroot.root
OOPS n13010001_0000_L100200.reroot.root
OOPS n13010001_0000_L250200.reroot.root
OOPS n13010002_0000_L010185.reroot.root
OOPS n13010002_0000_L100200.reroot.root
OOPS n13010002_0000_L250200.reroot.root
OOPS n13010003_0000_L010185.reroot.root
OOPS n13010003_0000_L100200.reroot.root
OOPS n13010003_0000_L250200.reroot.root
OOPS n13010004_0000_L010185.reroot.root

These are all in v17/*,

for FIL in $FILES ; do
FIN=`echo $FIL | cut -f 1 -d '.'`.reroot.root
BEA=`echo $FIL | cut -f 1 -d '.' | cut -f 3 -d '_'`
if [ ! -r /pnfs/minos/mcin_data/near/carrot_06/${BEA}/${FIN} ] ; then
echo OOPS $FIN ; ls /pnfs/minos/mcin_data/near/v17/${BEA}/${FIN} ; fi
done

Oh well, adding the latest few mcout files, mostly to get timing.

MINOS26 > ./saddmc -m declare \
  carrot_06 mcout_data/R1_18_2/near \
  R1_18_2 \
  2>&1 | tee -a ../log/saddmc/mcout-int.log 

 OK - skipping 82 files not yet in SAM 
Treating 9073 files in /pnfs/minos/mcout_data/R1_18_2/near/snts_data
 OK - declared           n13013270_0000_L010185.snts.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/snts_data(vob305.406)
 OK - declared           n13021880_0000_L010185.snts.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/snts_data(vob305.408)
 OK - declared           n13013247_0000_L010185.snts.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/snts_data(vob305.412)
 OK - declared           n13013234_0000_L010185.snts.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/snts_data(vob305.416)
 OK - declared           n13013237_0000_L010185.snts.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/snts_data(vob305.419)
 OK - declared           n13011381_0000_L010185.snts.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/snts_data(vob305.420)
Needed    6 files, Rate was  0.001
Needed    /pnfs/minos/mcout_data/R1_18_2/near/cand_data
 OK - skipping 82 files not yet in SAM 
Treating 9073 files in /pnfs/minos/mcout_data/R1_18_2/near/cand_data
 OK - declared           n13013247_0000_L010185.cand.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/cand_data(vob305.413)
 OK - declared           n13013234_0000_L010185.cand.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/cand_data(vob305.415)
 OK - declared           n13013237_0000_L010185.cand.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/cand_data(vob305.417)
 OK - declared           n13011381_0000_L010185.cand.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/cand_data(vob305.425)
Needed    4 files, Rate was  0.000
Needed    /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
 OK - skipping 1256 files not yet in SAM 
Treating 9073 files in /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
 OK - declared           n13013247_0000_L010185.sntp.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/sntp_data(vob305.411)
 OK - declared           n13013234_0000_L010185.sntp.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/sntp_data(vob305.414)
 OK - declared           n13013237_0000_L010185.sntp.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/sntp_data(vob305.418)
 OK - declared           n13011381_0000_L010185.sntp.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/sntp_data(vob305.421)
Needed    4 files, Rate was  0.000
STARTED   Wed May 24 18:23:46 2006
FINISHED  Thu May 25 01:34:16 2006


=============================================================================

2006 05 23

#########
# stage #
#########

MINOS26 > for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do printf "$MON "
<more> ./stage -d -p 0 reco_far/R1_21/sntp_data/2004-${MON} | grep Needed | tr -d '.' ; done
01  Needed 0/    653
02  Needed 0/    702
03  Needed 0/    715
04  Needed 0/    383
05  Needed 0/    383
06  Needed 0/    357
07  Needed 0/    348
08  Needed 0/    372
09  Needed 0/    375
10  Needed 0/    695
11  Needed 0/    692
12  Needed 213/    732

So files are already falling off disk, since 2006 05 11,   12 days ago
All the read pools are live, so this is likely pool flushing.

A reminder, pool lifetime plots cannot be trusted,
as they suffer from some of the same problems as the Pool Directory Listings

##########
# saddmc #
##########

Pick up new files

for BEAM in L010170 L010000 L100200 L250200 L010185 ; do
./saddmc -m list  carrot_06 mcin_data/near/carrot_06/${BEAM}
done

Needed  676 files, Rate was  0.256
STARTED   Tue May 23 14:46:11 2006
FINISHED  Tue May 23 15:30:14 2006
( all in L010185 )

./saddmc -m declare carrot_06 mcin_data/near/carrot_06/L010185 \
   2>&1 | tee -a ../log/saddmc/mcin-int.log 
Needed  676 files, Rate was  0.220
STARTED   Tue May 23 15:40:21 2006
FINISHED  Tue May 23 16:31:35 2006

./saddmc -m declare \
    carrot_06 mcout_data/R1_18_2/near \
    R1_18_2 \
    2>&1 | tee -a ../log/saddmc/mcout-int.log 

MINOS26 > grep -v declared ../log/saddmc/mcout-int.log 
 MODE  declare
STARTED   Tue May 23 16:41:24 2006
Declaring to SAM int carrot_06 R1_18_2 declare 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/near ['snts_data', 'cand_data', 'sntp_data']
Needed    /pnfs/minos/mcout_data/R1_18_2/near/snts_data
 OK - skipping 72 files not yet in SAM 
Treating 9067 files in /pnfs/minos/mcout_data/R1_18_2/near/snts_data
Needed 1182 files, Rate was  0.134
Needed    /pnfs/minos/mcout_data/R1_18_2/near/cand_data
 OK - skipping 72 files not yet in SAM 
Treating 9069 files in /pnfs/minos/mcout_data/R1_18_2/near/cand_data
Needed 1137 files, Rate was  0.129
Needed    /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
 OK - skipping 1245 files not yet in SAM 
Treating 9069 files in /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
Needed 1100 files, Rate was  0.110
STARTED   Tue May 23 16:41:24 2006
FINISHED  Wed May 24 00:22:13 2006


        SAMPLE COMMANDS

To see metatdata for one file,

    sam get metadata --file=n13021081_0000_L010200.reroot.root

Sample file lists :

  A broad selection

    sam list files --summaryOnly --dim="DATA_TIER mc-near and MC.BEAM L010200"
    sam list files --noSummary   --dim="DATA_TIER mc-near and MC.BEAM L010200"

  Select det/rock/overlay, beam, any split, runs ending in 99

    sam list files --noSummary   --dim="DATA_TIER mc-near and \
    MC.BEAM L010200 and \
    FILE_NAME like n130___99%"
n13021099_0000_L010200.reroot.root
n13011099_0000_L010200.reroot.root

   Select sntp output for these two input files

    sam list files --noSummary   --dim="DATA_TIER sntp-near and \
    MC.BEAM L010200 and \
    FILE_NAME like n130___99%"
n13021099_0000_L010200.sntp.R1_18_2.root
n13011099_0000_L010200.sntp.R1_18_2.root

=============================================================================

2006 05 22

Back up after Sat 07:00-17:00 WH power outage.

############
# noaccess #
############

VOB270             31.76GB   (NOACCESS   0521-0541 none              )   CD-9940B         minos.mcin_near.cpio_odc            

From /local/scratch01/howcroft/mc_copy.log

   copying n13015322_0000_L010185.reroot.root from positron02.hep.caltech.edu...
   Moving n13015322_0000_L010185.reroot.root to encp...
Start time: Sun May 21 03:27:28 2006
User: howcroft(9950)  Group: e875(5111)  Euser: howcroft(9950)  Egroup: e875(5111)
Command line: encp --verbose 1 /local/scratch01/howcroft/n13015322_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185//.
Version: v3_5a  CVS $Revision: 1.765 $ <frozen>
Library: CD-9940B  Storage Group: minos  File Family: mcin_near  FF Wrapper: cpio_odc  FF Width: 1
Current working directory: minos01.fnal.gov:/afs/fnal.gov/files/home/room3/howcroft
Transfer /local/scratch01/howcroft/n13015322_0000_L010185.reroot.root -> /pnfs/minos/mcin_data/near/carrot_06/L010185/n13015322_0000_L010185.r
eroot.root:
        197223313 bytes copied to VOB270 at 0.213 MB/S
        (62.9 MB/S network) (0.256 MB/S drive) (62.9 MB/S disk)
        (0.182 MB/S overall) (0.213 MB/S transfer)
        drive_id=T9940B drive_sn=479000021848 drive_vendor=STK
        mover=9940B23.mover media_changer=stk.media_changer   elapsed=1036.97
Completed transferring 197223313 bytes in 1 files in 1036.97009587 sec.
        Overall rate = 0.182 MB/sec.  Transfer rate = 0.213 MB/sec.
        Network rate = 62.9 MB/sec.  Drive rate = 0.256 MB/sec.
        Disk rate = 62.9 MB/sec.  Exit status = 0.
   removing file /local/scratch01/howcroft/n13015322_0000_L010185.reroot.root

May need to clear scratch space on minos01/02

MINOS01 > du -sm /local/scratch01/*
584     /local/scratch01/admarino
1       /local/scratch01/arms
6358    /local/scratch01/asousa
885     /local/scratch01/bishai
11539   /local/scratch01/brb
1       /local/scratch01/brebel
2       /local/scratch01/bspeak
754     /local/scratch01/caius
37060   /local/scratch01/cherdack
6       /local/scratch01/deb4
2244    /local/scratch01/dharma
34955   /local/scratch01/giurgiu
1       /local/scratch01/hjkang
15015   /local/scratch01/howcroft
1       /local/scratch01/jdejong
1       /local/scratch01/loiacono
1       /local/scratch01/mcgowan
1       /local/scratch01/mdier
1       /local/scratch01/minoscvs
6515    /local/scratch01/msanchez
13136   /local/scratch01/mskim
150     /local/scratch01/panos
1691    /local/scratch01/petyt
1       /local/scratch01/rhatcher
1       /local/scratch01/vvs
184     /local/scratch01/yumiceva


=============================================================================

2006 05 19

#########
# genpy #
#########

Testing for write cache, did inadvertant

MINOS26 > samadmin dccp synch

No output, not sure what if anything it did.

There are three samadmin synch varieties

samadmin encp	     synch - tape volume status
samadmin dccp        synch - files with CRC_VALUE 'unknown crc value' ; set crc
samadmin null volume synch - files with NULL volume, set crc/size/tape/loc


    Questions regarding null volume synch

1) How do I test the command to see what it would do, without writing to the
database ?
    something simlar to    samadmin encp synch --reportOnly ?

2) What exactly is meant by null volume label.
   A NULL oracle column value, in which case, how do I create this ?
   An empty string '' ?
   The string 'null' or 'NULL' ?

3) How is the Enstore server selected ?
   For  samadmin encp synch, we use --enstoreWebHost=<value>

4) How is the Enstore data obtained ?
   a) Web ?
   b) enstore pnfs - Do we need to do 'setup encp' ?
   c) /pnfs - does /pnfs/minos need to mounted ?


MINOS15 > ~/minos/scripts/genpy.20060518  -f F00035657_0016.mdaq.root fardet_data/2006-05

MINOS15 > ~/minos/scripts/sadd fardet_data/2006-05 verify

    Test  dccp synch, 

MINOS26 > samadmin dccp synch  -d -v

Picks up file /pnfs/minos/fardet_data/2003-10/F00020958_0000.mdaq.root
  which actually lives in 2003-11.


##########
# saddmc #
##########

./saddmc -m declare \
    carrot_06 mcout_data/R1_18_2/near \
    R1_18_2 \
    2>&1 | tee -a ../log/saddmc/mcout-int.log 

 OK - declared           n13012480_0000_L010185.snts.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/snts_data(vo2264.3)
 OOPS , addLocation error in  n13012480_0000_L010185.snts.R1_18_2.root
  CLASS     SamException.SamExceptions.DataStorageLocationNotFound
  INSTANCE  Location with name '/pnfs/minos/mcout_data/R1_18_2/near/snts_data' not found.

MINOS26 > sam undeclare file n13012480_0000_L010185.snts.R1_18_2.root

LOCS=`find /pnfs/minos/mcout_data/R*/* -maxdepth 1 -type d`

export SAM_ORACLE_CONNECT
for LOC in ${LOCS} ; do
samadmin add pnfs tape location --fullPath=${LOC}
done
export -n SAM_ORACLE_CLIENT

    These were all needed !
    
Strange, found

SAM_CONFIG_QUALIFIER=prd
SETUP_SAM_CONFIG=sam_config v4_2_34 -f NULL -z /afs/fnal.gov/files/code/e875/general/ups/db -q int

unset SAM_CONFIG_QUALIFIER
setup sam -q int

./saddmc -m declare \
    carrot_06 mcout_data/R1_18_2/near \
    R1_18_2 \
    2>&1 | tee -a ../log/saddmc/mcout-int.log 
 MODE  declare
STARTED   Fri May 19 14:37:50 2006
Declaring to SAM int carrot_06 R1_18_2 declare 999999
Scanning  /pnfs/minos/mcout_data/R1_18_2/near ['snts_data', 'cand_data', 'sntp_data']
Needed    /pnfs/minos/mcout_data/R1_18_2/near/snts_data
 OK - skipping 72 files not yet in SAM 
Treating 7885 files in /pnfs/minos/mcout_data/R1_18_2/near/snts_data
Needed 7885 files, Rate was  0.760
Needed    /pnfs/minos/mcSout_data/R1_18_2/near/cand_data
 OK - skipping 72 files not yet in SAM 
Treating 7932 files in /pnfs/minos/mcout_data/R1_18_2/near/cand_data
Needed 7932 files, Rate was  0.772
Needed    /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
 OK - skipping 1245 files not yet in SAM 
Treating 7969 files in /pnfs/minos/mcout_data/R1_18_2/near/sntp_data
Needed 7969 files, Rate was  0.685
STARTED   Fri May 19 14:37:50 2006
FINISHED  Fri May 19 23:35:48 2006

Note that it takes a very long time to start declaring the files,
hours in this case, after the file count is printed.
That's understandable, we are reading old metadata and generating new.
Then sorting.

=============================================================================

2006 05 18


#######
# WEB #
#######

Several of the monitoring web pages have disappeared

http://www-numi.fnal.gov/computing/dh/ftplog/NOW.txt
   The daily files are there through 2006/05/17, nothing in 18
   

http://www-numi.fnal.gov/computing/database/oracle/topdb/minosprd/NOW.txt
http://www-numi.fnal.gov/computing/database/oracle/topdb/minosdev/NOW.txt
   The daily/hourly files are present through 23:59 yesteray

Same thing for

http://www-numi.fnal.gov/computing/database/oracle/topdb/minosdev/NOW.txt

Cause is that the web quota has been used up.

MIN > fs listquota /afs/fnal.gov/files/expwww/numi
Volume Name                   Quota      Used %Used   Partition
room.numi                   2000000   2000212  100%<<       86%    <<WARNING

cd /afs/fnal.gov/files/expwww/numi

du -sm * | sort -n
...
50      Minos
113     sam
164     numi_pics
265     talks
299     numwork
314     computing
320     workgrps
869     internal
1757    minwork

cd minwork
du -sm * | sort -n
...
14      electronics
30      info
55      daq
85      computing
228     neardet
295     fardet
1038    daqlogs

cd /afs/fnal.gov/files/expwww/numi

MIN > find . -size +50000 -exec du -sm {} \;
28      ./talks/Lehman_Jan02.ppt
28      ./talks/urheim_aspen05.ppt
33      ./talks/Stan_Venice_Feb05.ppt
31      ./talks/dpetyt06.ppt
47      ./internal/safety_docs/SAD/Appendix_b.doc
47      ./internal/local_access/doe_dec_02_review/Talks/DOERevDec02DBcivilstatus.ppt
26      ./internal/local_restricted_access/Engineering_Notes/1321-ES-296390.pdf
35      ./internal/local_restricted_access/Engineering_Notes/MSG-E-99200.pdf
78      ./minwork/fardet/videos/plane_build_100_seconds.avi
47      ./minwork/fardet/videos/Unloading_Steel_underground.avi
158     ./minwork/fardet/videos/plane_build_3_min_final.avi
35      ./minwork/daqlogs/NEAR/daq/msglog/2004-11/2004-11-06.daq.log.gz
29      ./numwork/reviews/july_13/save/us_install.ppt

MINOS26 > fs listquota /afs/fnal.gov/files/data/minos/log_data/
Volume Name                   Quota      Used %Used   Partition
minos.log_data              8000000   3049465   38%         17%  

MIN > mv minwork/fardet/videos/plane_build_3_min_final.avi \
            /afs/fnal.gov/files/data/minos/log_data/plane_build_3_min_final.avi

MIN > ln -s /afs/fnal.gov/files/data/minos/log_data/plane_build_3_min_final.avi \
         minwork/fardet/videos/plane_build_3_min_final.avi
	 
MINOS26 > fs listquota /afs/fnal.gov/files/data/minos/log_data
Volume Name                   Quota      Used %Used   Partition
minos.log_data              8000000   3210603   40%         17%  

Difference  3210603 - 3049465 = 161138, close enough to the 158 MB avi file size

But the numi/html directory did not gain any quota,
still has precisely the same listquota results.

FLXI02 > du -sm /afs/fnal.gov/files/expwww/numi
5544    /afs/fnal.gov/files/expwww/numi

GRRRRRRRRRRRRRRR

Found the problem.

html/minwork has 8 GB of quota,
html         has 2 GB, separate AFS partitions.

Need to shift some things into minwork ( via symlink )

First, restore the video :

MINOS26 > rm minwork/fardet/videos/plane_build_3_min_final.avi

MINOS26 > mv  /afs/fnal.gov/files/data/minos/log_data/plane_build_3_min_final.avi \
               minwork/fardet/videos/plane_build_3_min_final.avi

Them move the Grid work files

MINOS26 > cp  -ax computing/remote_software_deployment minwork/computing/ 
MINOS26 > diff -r computing/remote_software_deployment minwork/computing/remote_software_deployment 
MINOS26 > du  -sk computing/remote_software_deployment minwork/computing/remote_software_deployment
120299  computing/remote_software_deployment
120299  minwork/computing/remote_software_deployment

MINOS26 > rm -r         computing/remote_software_deployment
MINOS26 > ln -s minwork/computing/remote_software_deployment computing/remote_software_deployment

OOPS, nwest corrected this to 
ln -s ../minwork/computing/remote_software_deployment  computing/remote_software_deployment

MINOS26 > fs listquota .
Volume Name                   Quota      Used %Used   Partition
room.numi                   2000000   1879913   94%<<       86%    <<WARNING

##########
# saddmc #
##########

Proceed with clean declaration of carrot_06 to integration


    MCIN


MINOS26 > export SAM_ORACLE_CONNECT 
MINOS26 > samadmin add param suite --param-file=MCPARAMS.py
Param Category 'mc': 
 ... paramType 'release': registered as type 'string'  (new dimension 'mc.release')
 ... paramType 'beam': (no change)
 ... paramType 'flavor': (no change)
 ... paramType 'split': (no change)
 ... paramType 'volume': (no change)
MINOS26 > export -n SAM_ORACLE_CONNECT 


MINOS26 > FMCIN=`sam list files --dim='FULL_PATH /pnfs/minos/mcin_data%' --noSummary`
MINOS26 > FMCOUT=`sam list files --dim='FULL_PATH /pnfs/minos/mcout_data%' --noSummary`

MINOS26 > echo $FMCIN | wc -w
   4924
MINOS26 > echo $FMCOUT | wc -w
      0

MINOS26 > time for FIL in $FMCIN  ; do sam undeclare file ${FIL} ; done        
real    92m55.034s
user    67m10.370s
sys     11m47.220s


( This is slow, due to running sam once per file .
  That's OK, we do not want to lightly undeclare files. )

( first verified to /tmp/mcinver.log )
MINOS26 > grep verified /tmp/mcinver.log | wc -l
   8213
MINOS26 > grep Rate /tmp/mcinver.log 
Needed  198 files, Rate was  3.092
Needed 1456 files, Rate was  2.399
Needed  728 files, Rate was  2.523
Needed  391 files, Rate was  2.890
Needed 5440 files, Rate was  1.363

for BEAM in L010170 L010000 L100200 L250200 L010185 ; do
./saddmc -m declare  carrot_06 mcin_data/near/carrot_06/${BEAM} \
 2>&1 | tee -a ../log/saddmc/mcin-int.log 
done

grep -v declared ../log/saddmc/mcin-int.log
grep    Rate     ../log/saddmc/mcin-int.log
MINOS26 > grep Rate ../log/saddmc/mcin-int.log 
Needed  198 files, Rate was  2.239
Needed 1456 files, Rate was  1.882
Needed  728 files, Rate was  2.096
Needed  391 files, Rate was  2.171
Needed 5444 files, Rate was  1.173

Oops, pick up L010200

for BEAM in L010200 ; do
./saddmc -m declare  carrot_06 mcin_data/near/carrot_06/${BEAM} \
 2>&1 | tee -a ../log/saddmc/mcin-int.log 
done

    MCOUT


MINOS26 > ls /pnfs/minos/mcout_data/R1_18_2/near/sntp_data/ | wc -l
   8762

#./saddmc -m verify -s *_L010200.sntp* -n 1 -v \

./saddmc -m verify \
    carrot_06 mcout_data/R1_18_2/near \
    R1_18_2 \
    2>&1 | tee -a ../log/saddmc/mcout-int.ver.log 

MINOS26 > grep verified ../log/saddmc/mcout-int.ver.log | wc -l
  22596
MINOS26 > grep Rate ../log/saddmc/mcout-int.ver.log        
Needed 7517 files, Rate was  0.859
Needed 7521 files, Rate was  0.867
Needed 7558 files, Rate was  0.762

#########
# genpy #
#########

Testing temporary DWRITE tape label ,for things with the Write Pool

MINOS15 > cdm
MINOS15 > . ./setup_minos
MINOS15 > setup_minos -r R1.22

MINOS15 > cd /local/scratch15/kreymer/genpy/fardet_data/2006-05/
MINOS15 > ~/minos/scripts/genpy.20060518  -f F00035657_0016.mdaq.root fardet_data/2006-05

=============================================================================

2006 05 17

#########
# stage #
#########

reco_far/R1_21/sntp_data/2004-* restored 2006 05 11  are still on disk

##########
# saddmc #
##########

saddmc.0516 - adds mc.release to mcout_data files
              should I also add this to mcin_data files, for uniformity ?
	          this would also make meta cloning simpler, no need to add to mcout.


Clear out all existing files in development, and restart this way :

MINOS26 > sam list files --dim='FULL_PATH /pnfs/minos/mcout_data%' --summaryOnly
File Count:         31
Average File Size:  133.81MB
Total File Size:    4.05GB
Total Event Count:  12400

MINOS26 > sam list files --dim='FULL_PATH /pnfs/minos/mcin_data%' --summaryOnly
File Count:         4896
Average File Size:  192.06MB
Total File Size:    918.27GB
Total Event Count:  1958400

MINOS26 > FMCOUT=`sam list files --dim='FULL_PATH /pnfs/minos/mcout_data%' --noSummary`
MINOS26 > for FIL in $FMCOUT ; do sam undeclare file ${FIL} ; done        

MINOS26 > FMCIN=`sam list files --dim='FULL_PATH /pnfs/minos/mcin_data%' --noSummary`
MINOS26 > time for FIL in $FMCIN  ; do sam undeclare file ${FIL} ; done        
real    91m19.552s
user    66m11.940s
sys     11m14.970s

ln -sf saddmc.0516 saddmc 

aklog
mv ../log/saddmc/mcin-dev.log ../log/saddmc/mcin-dev1.log
mv ../log/saddmc/mcin-int.log ../log/saddmc/mcin-int1.log

./saddmc -m declare  carrot_06 mcin_data/near/carrot_06/L010200  2>&1 | tee -a ../log/saddmc/mcin-dev.log 

./saddmc -m verify -s *_L010200.sntp* -n 1 -v \
    carrot_06 mcout_data/R1_18_2/near \
    R1_18_2

./saddmc -m verify -s *_L010200.* \
    carrot_06 mcout_data/R1_18_2/near \
    R1_18_2

Needed  198 files, Rate was  0.853
STARTED   Wed May 17 15:29:50 2006
FINISHED  Wed May 17 15:40:34 2006

sam get metadata --file=n13021081_0000_L010200.reroot.root > /tmp/reroot.met
./saddmc -m verify -s n13021081_0000_L010200.sntp*  -n 1 -v   carrot_06 mcout_data/R1_18_2/near     R1_18_2 > /tmp/sntp.met
./saddmc -m verify -s n13021081_0000_L010200.snts*  -n 1 -v   carrot_06 mcout_data/R1_18_2/near     R1_18_2 > /tmp/snts.met
./saddmc -m verify -s n13021081_0000_L010200.cand*  -n 1 -v   carrot_06 mcout_data/R1_18_2/near     R1_18_2 > /tmp/cand.met
sdiff -s /tmp/reroot.met /tmp/sntp.met
    etc

Looks OK to me.


./saddmc -m declare -s *_L010200.* \
    carrot_06 mcout_data/R1_18_2/near \
    R1_18_2
Needed  198 files, Rate was  0.925
Needed  198 files, Rate was  0.924
Needed  198 files, Rate was  0.804

MINOS26 > sam get metadata --file=n13011068_0000_L010200.sntp.R1_18_2.root
ImportedSimulatedFile({
             'fileName' : 'n13011068_0000_L010200.sntp.R1_18_2.root',
               'fileId' : 81332L,
             'fileType' : 'importedSimulated',
           'fileFormat' : 'root',
             'fileSize' : SamSize('31.67MB'),
                  'crc' : CRC('1264487157L', 'adler 32 crc type'),
    'fileContentStatus' : 'good',
           'eventCount' : 400L,
             'dataTier' : 'sntp-near',
           'firstEvent' : 1L,
            'lastEvent' : 400L,
            'startTime' : SamTime('NULL'),
              'endTime' : SamTime('NULL'),
    'applicationFamily' : ApplicationFamily(appFamily='reco', appName='loon', appVersion='r1.18.2'),
                'group' : 'minos',
               'params' : Params({
    'mc' : CaseInsensitiveDictionary({
       'beam' : 'L010200',
     'flavor' : '0',
    'release' : 'carrot_06',
      'split' : '1',
     'volume' : '3',
    })}),
           'datastream' : 'alldata',
    'runDescriptorList' : RunDescriptorList([RunDescriptor(runType='physics', runNumber=1068)]),
    })

MINOS26 > ./dc_stat /pnfs/minos/mcout_data/R1_18_2/near/sntp_data/n13011068_0000_L010200.sntp.R1_18_2.root
============================
 PNFS status for /pnfs/minos/mcout_data/R1_18_2/near/sntp_data/n13011068_0000_L010200.sntp.R1_18_2.root 
-rw-r--r--    1 1334     5111     33205070 Jan 28 18:11 n13011068_0000_L010200.sntp.R1_18_2.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;c=1:14618af6;l=33205070;
r-stkendca12a-3

LEVEL 4 
VO8511
0000_000000000_0000498
33205070
reco_mc_near_R1_18_2
/pnfs/fs/usr/minos/mcout_data/R1_18_2/near/sntp_data/n13011068_0000_L010200.sntp.R1_18_2.root

000F00000000000002824E38

CDMS113849351100000
stkenmvr26a:/dev/rmt/tps0d0n:479000001510
1264487157

============================


=============================================================================

2006 05 16

########
# bntp #
########

Completed copy of 2006-02/03 to AFS

########
# data #
########

Generating .py for /pnfs/minos/neardet_data/2006-05
STARTING Mon May 15 22:06:13 UTC 2006

Data is starting to show up again !

Still no near or far dcs data.


#########
# genpy #
#########

Planning for smoother addition of data to SAM while in write pools.

Give the files a temporary location ( limbo ? LIMBO ? )

Then select on this FULL_PATH to change the the proper tape location, later.

    Sample path selection :

    SHORTEST MONTH

MINOS26 > time sam list files --dim='FULL_PATH /pnfs/minos/fardet_data/2001-10' --summaryOnly
File Count:         53
Average File Size:  8.69MB
Total File Size:    460.57MB
Total Event Count:  149990

    A SHORT MONTH

MINOS26 > time sam list files --dim='FULL_PATH /pnfs/minos/fardet_data/2001-09' --summaryOnly
File Count:         306
Average File Size:  27.88MB
Total File Size:    8.33GB
Total Event Count:  560739

real    0m1.759s
user    0m1.040s
sys     0m0.150s

    A STANDARD MONTH

MINOS26 > time sam list files --dim='FULL_PATH /pnfs/minos/fardet_data/2005-11' --summaryOnly
File Count:         850
Average File Size:  30.25MB
Total File Size:    25.11GB
Total Event Count:  12031602

real    0m6.834s
user    0m1.290s
sys     0m0.180s

   NOW A FULL YEAR

MINOS26 > time sam list files --dim='FULL_PATH LIKE /pnfs/minos/fardet_data/2005-%' --summaryOnly
File Count:         12589
Average File Size:  24.10MB
Total File Size:    296.27GB
Total Event Count:  148961539

real    0m19.368s
user    0m7.240s
sys     0m0.390s

    NOW NOTHING, A TYPICAL PICK SCAN FOR SCAVENGING DCACHE POOL FILES

MINOS26 > time sam list files --dim='FULL_PATH /LIMBO' --summaryOnly
No files match the given constraints.

real    0m1.133s
user    0m0.860s
sys     0m0.140s


    Actually, should modify only the tape location,

MINOS26 > time sam list files --dim='TAPE_LABEL vo9488' --summaryOnly
File Count:         1706
Average File Size:  18.88MB
Total File Size:    31.45GB
Total Event Count:  12288059

real    0m9.544s
user    0m1.800s
sys     0m0.120s

MINOS26 > time sam list files --dim='TAPE_LABEL vo7423' --summaryOnly
File Count:         14
Average File Size:  24.04MB
Total File Size:    336.53MB
Total Event Count:  301773

real    0m3.104s
user    0m0.840s
sys     0m0.190s

    Perhaps we make pseudo tape label, DWRITE, 
    setting the position to the time of declaration

##########
# oracle #
##########

15:00 - dev/int (dba64) availble again, April Oracle patches are rolled back.
        have been down for this since Monday morning, 2006 05 15

##########
# saddmc #
##########

Need to add mc.release parameter.

Tricky, need to create this name for the mc category,
and augment the existing parameter values in the metadata

    Added mc.release to MCPARAMS.py,

MINOS26 > samadmin add param suite --param-file=MCPARAMS.py 


./saddmc -m verify -s *_L010200.sntp* -n 1 -v carrot_06 mcout_data/R1_18_2/near R1_18_2


=============================================================================

2006 05 15

#########
# stage #
#########

reco_far/R1_21/sntp_data/2004-* are still on disk

########
# bntp #
########

mv	 $MINOS_DATA/d121/R1_18_2/.bntp_data/2006-01 $MINOS_DATA/d121/R1_18_2/.bntp_data/2006-01x

mkdir -p $MINOS_DATA/d126/R1_18_2/.bntp_data
cp  -ar  $MINOS_DATA/d121/R1_18_2/.bntp_data/2006-01x $MINOS_DATA/d126/R1_18_2/.bntp_data/2006-01
diff -r  $MINOS_DATA/d121/R1_18_2/.bntp_data/2006-01  $MINOS_DATA/d126/R1_18_2/.bntp_data/2006-01

ln -s	 $MINOS_DATA/d126/R1_18_2/.bntp_data/2006-01  $MINOS_DATA/d121/R1_18_2/.bntp_data/2006-01

mkdir -p $MINOS_DATA/d126/R1_18_2/.bntp_data/2006-02
mkdir -p $MINOS_DATA/d126/R1_18_2/.bntp_data/2006-03

ln -s    $MINOS_DATA/d126/R1_18_2/.bntp_data/2006-02  $MINOS_DATA/d121/R1_18_2/.bntp_data/2006-02
ln -s    $MINOS_DATA/d126/R1_18_2/.bntp_data/2006-03  $MINOS_DATA/d121/R1_18_2/.bntp_data/2006-03


Stage recent months in :

MINOS26 > for MON in 2006-01 2006-02 2006-03 ; do ./stage -w reco_far/R1_18_2/.bntp_data/${MON} ; done


   COPY TO AFS

aklog

setup dcap
unset DCACHE_IO_TUNNEL

BAFS=/afs/fnal.gov/files/data/minos/d121/R1_18_2/.bntp_data
BNTP=/pnfs/minos/reco_far/R1_18_2/.bntp_data
DNTP=/pnfs/fnal.gov/usr/minos/reco_far/R1_18_2/.bntp_data


    MON=2006-01

FILES=`ls ${BNTP}/${MON}` ; printf "${MON} " ; echo $FILES | wc -w

NEEDF=""
for FIL in ${FILES} ; do
    [ ! -r ${BAFS}/${MON}/${FIL} ] && NEEDF="${NEEDF} ${FIL}"
done ;

NFIL=`echo ${FILES} | wc -w`
NNEE=`echo ${NEEDF} | wc -w`
printf "OK - need ${NNEE}/${NFIL} in ${BNTP}/${MON}\n"


for FIL in ${NEEDF} ; do
    printf "`date` ${FIL} "
    DFILE=dcap://fndca1.fnal.gov:24136/${DNTP}/${MON}/${FIL}
    ${ECHO} dccp ${DFILE} ${BAFS}/${MON}/${FIL}
done ;
du -sm ${BAFS}/${MON} ${BNTP}/${MON}


=============================================================================

2006 05 12

##########
# CATTLE #
##########

brebel requested tests of R1_18_2 data from 2005-11

MINOS26 > du -sh /pnfs/minos/reco_near/R1_18_2/sntp_data/2005-11
48G     /pnfs/minos/reco_near/R1_18_2/sntp_data/2005-11

So put this on d127

MINOS26 > cd /afs/fnal.gov/files/data/minos/d132/reco_near
MINOS26 > mkdir -p /afs/fnal.gov/files/data/minos/d127/reco_near/R1_18_2/sntp_data/2005-11
MINOS26 > ln -s /afs/fnal.gov/files/data/minos/d127/reco_near/R1_18_2 R1_18_2          


MINOS26 > ./stage -d -p 0 reco_near/R1_18_2/sntp_data/2005-11 | grep Needed | tr -d '.'
 Needed 1372/   1372
MINOS26 > ./stage -w reco_near/R1_18_2/sntp_data/2005-11


DET=near
IND=reco_${DET}/R1_18_2/sntp_data/2005-11
OUD=${MINOS_DATA}/d128/reco_${DET}/R1_18_2/sntp_data/2005-11

    Testing with

./cattle   ${IND} ${OUD} R1.18.2 echo

    Do it with

mkdir -p ~/minos/log/cattle/R1_18_2/${DET}

./cattle   ${IND} ${OUD} R1.18.2  \
  2>&1 | tee -a ~/minos/log/cattle/R1_18_2/${DET}/2005-11.log

It seems to have completed cleanly at Sat May 13 09:20

#########
# stage #
#########

Checking status of files on disk, after yesterday's restore of R1_21/sntp_data

MINOS26 > for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do printf "$MON "
./stage -d -p 0 reco_far/R1_21/sntp_data/2004-${MON} | grep Needed | tr -d '.' ; done
01  Needed 0/    653
02  Needed 0/    702
03  Needed 0/    715
04  Needed 0/    383
05  Needed 0/    383
06  Needed 0/    357
07  Needed 0/    348
08  Needed 0/    372
09  Needed 0/    375
10  Needed 0/    695
11  Needed 0/    692
12  Needed 0/    732

SUMMARY : touching the files seems to have worked.

##########
# saddmc #
##########

Integration cleanly added files from near/carrot_06/010185.
So the problem in dev are probably an artifact of testing
  ( but worth understanding )
Meanwhile, can proceed with full scale declare of mcin to integration.

But FIRST, check that mcout_data works OK for existing
    L010200  198 files
    L010185 4674 files

MINOS26 > ./saddmc -m declare  carrot_06 mcin_data/near/carrot_06/L010200  2>&1 | tee -a ../log/saddmc/mcin-int.log 
Needed  198 files, Rate was  2.268
STARTED   Fri May 12 10:06:11 2006
FINISHED  Fri May 12 10:07:39 2006

MCOUT - where is it ?

    /pnfs/minos/mcout_data/${LOON}/${DETECTOR}/${TIER}_data

    ( for DETECTOR in   near  far nmock fmock  )
    ( for TIER in       cand sntp  snts        )

FIL=n11001003_0000_L010185.snts.R1_18_2.root

./saddmc -m verify -s ${FIL} -n 1 carrot_06 mcout_data/R1_18_2/near R1_18_2
OOPS - do not know datastream  snts

Corrected typo in script, now works OK, need to inspect metadata in detail.
Also, let's test a file that exists in development

FIL=n13011000_0000_L010200.snts.R1_18_2.root

Now need to adjust metadata to set application/family, etc for reco output.

Corrected application family to use recorel and not release 
    
    OOPS, needed r1.18.2 appfam

samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.18.2

MINOS26 > ./saddmc -m declare -s ${FIL} -n 1 -v carrot_06 mcout_data/R1_18_2/near R1_18_2
 MODE  declare
 SEARCH for  n13011000_0000_L010200.snts.R1_18_2.root
 BAIL after  1
 VERBOSE 
STARTED   Fri May 12 17:55:43 2006
Declaring to SAM dev carrot_06 R1_18_2 declare 1
Scanning  /pnfs/minos/mcout_data/R1_18_2/near ['snts_data/', 'cand_data/', 'sntp_data/']
Needed    /pnfs/minos/mcout_data/R1_18_2/near/snts_data/
Treating 1 files in /pnfs/minos/mcout_data/R1_18_2/near/snts_data/
INFX all.snts
 OK - declared           n13011000_0000_L010200.snts.R1_18_2.root /pnfs/minos/mcout_data/R1_18_2/near/snts_data/(vo8496.1859)
 OOPS , addLocation error in  n13011000_0000_L010200.snts.R1_18_2.root
  CLASS     SamException.SamExceptions.DataStorageLocationNotFound
  INSTANCE  Location with name '/pnfs/minos/mcout_data/R1_18_2/near/snts_data/' not found.
STARTED   Fri May 12 17:55:43 2006
FINISHED  Fri May 12 17:55:46 2006

   OOPS, needed mcout locations


LOCS=`find /pnfs/minos/mcout_data/R*/* -maxdepth 1 -type d`

export SAM_ORACLE_CONNECT
for LOC in ${LOCS} ; do
samadmin add pnfs tape location --fullPath=${LOC}
done
export -n SAM_ORACLE_CLIENT

Still oops, SUBPAT used for mcin does not work for mcout locations.
Removed trailing / from the glog, we're in business !

And do a few handfuls :

MINOS26 > ./saddmc -m declare -s *_L010200* -n 10 -v carrot_06 mcout_data/R1_18_2/near R1_18_2

    Will sleep on it, review, and run Monday.


=============================================================================

2006 05 11

##########
# oracle #
##########

Note that minosora1 stayed available today during the subnet 80 outage,
having been moved to subnet 170 yesterday.

The ganglia monitoring for 2006 05 10 has a gap from about 09:00 to 14:00
    probably due to the network change.

#########
# stage #
#########

Unfortunately, files seem to be falling off disk faster than I can stage them.


MINOS26 > ./stage -d -p 0 reco_far/R1_21/sntp_data/2004-01 | grep Needed | tr -d '.'
 Needed 215/    653

MINOS26 > ./stage -d -p 0 reco_far/R1_21/sntp_data/2004-12 | grep Needed | tr -d '.'
 Needed 731/    732

Yesterday afternoon, 2004-01 had 239 files on disk.
01  Needed 414/    653
I restored the remaining 414.
Today, 215 files are needed !

These were pulled in as recently are April 27.
Need to run another quick pass to touch up files on disk
    VOLS21N=`./volumes reco_far_R1_21`
    for VOL in $VOLS21 ; do stage -w -s sntp_data ${VOL} ; done
then touch all the files with   root -b -q > /dev/null

Added touch option

ln -sf stage.20060511 stage # was stage.20060501

MINOS26 > ./stage.20060511 -d -p 0 -T  reco_far/R1_21/sntp_data/2004-01

Looking good, so killed it and ran over all files :

    VOLS21N=`./volumes reco_far_R1_21`
    for VOL in $VOLS21N ; do ./stage -w -s sntp_data -T ${VOL} ; done

##########
# saddmc #
##########

Added data tiers to integration, per HOWTO.
MINOS26 > samadmin add datatier --name=mc-near
New dataTierId = 116
MINOS26 > samadmin add datatier --name=mc-far
New dataTierId = 117

Added single files for a test,

MINOS26 > ./saddmc -m declare -n 1 -s n13*_*.root carrot_06 mcin_data/near/carrot_06/L010185  2>&1 | tee -a ../log/saddmc/mcin-int.log
tee: ../log/saddmc/mcin-int.log: Permission denied
 OK - declared                 n13022463_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo9761.580)

MINOS26 > ./saddmc -m declare -n 1 -s n1100*_*.root carrot_06 mcin_data/near/carrot_06/L010185  2>&1 | tee -a ../log/saddmc/mcin-int.log
tee: ../log/saddmc/mcin-int.log: Permission denied
 OK - declared                 n11001003_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo9868.322)


Checked current directory sizes, for further testing

MINOS26 > BS=`(cd /pnfs/minos/mcin_data/near/carrot_06 ; ls -d L* )`
MINOS26 > echo $BS
L010000 L010170 L010185 L010200 L100200 L250200
MINOS26 > for BM in $BS ; do ls /pnfs/minos/mcin_data/near/carrot_06/$BM | wc -l ; done
   1456
    198
   4674
    198
    729
    391

Run a shorter test in development on L010170, oops, that' already done.
So do L010200

MINOS26 > ./saddmc -m declare  carrot_06 mcin_data/near/carrot_06/L010200  2>&1 | tee -a ../log/saddmc/mcin-dev.log 
Needed  198 files, Rate was  2.332
STARTED   Thu May 11 16:56:28 2006
FINISHED  Thu May 11 16:57:54 2006

   OK, let's go ahead and try a full test of L010185 in integration
   which has a cleaner history, perhaps the dev problems are an artifact of tests :

MINOS26 > ./saddmc -m declare  carrot_06 mcin_data/near/carrot_06/L010185  2>&1 | tee -a ../log/saddmc/mcin-int.log 


=============================================================================

2006 05 10

minosora1/3 moved to

minosora1 - 131.225.107.24
minosora3 - 131.225.107.25

per connection monitoring, 
minosora1 was off the net fro 07:45 till 09_04

Restarted production dbserver at 10:33

##########
# saddmc #
##########

added sys.stdout.flush() for more output when piped to tee

We continue to fail in file n11001003_0000_L010185.reroot.root

Added -s string  option , so we can proceed with the n13* files,

MINOS26 > ./saddmc -m declare -s n13*_*.root carrot_06 mcin_data/near/carrot_06/L010185 "" 2>&1 | tee -a ../log/saddmc/mcin-dev.log 

 MODE  declare
 SEARCH for  n13*_*.root
 Processing mcin_data 
STARTED   Wed May 10 14:31:22 2006
Declaring to SAM dev carrot_06  declare 999999
Scanning  /pnfs/minos/mcin_data/near/carrot_06 ['L010185']
Needed    /pnfs/minos/mcin_data/near/carrot_06/L010185
Treating 4492 files in /pnfs/minos/mcin_data/near/carrot_06/L010185
 OK - declared                 n13014056_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo7242.606)

Needed   50 files, Rate was  0.033
STARTED   Wed May 10 15:20:44 2006
FINISHED  Wed May 10 15:45:52 2006

# Try again, just for fun, with the 50 remnants

MINOS26 > ./saddmc -m declare carrot_06 mcin_data/near/carrot_06/L010185  2>&1 | tee -a ../log/saddmc/mcin-dev.log 
 MODE  declare
 Processing mcin_data 
STARTED   Wed May 10 17:32:41 2006
Declaring to SAM dev carrot_06  declare 999999
Scanning  /pnfs/minos/mcin_data/near/carrot_06 ['L010185']
Needed    /pnfs/minos/mcin_data/near/carrot_06/L010185
Treating 4550 files in /pnfs/minos/mcin_data/near/carrot_06/L010185
 OK - declared                 n13013666_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo7242.608)
 OK - declared                 n13013667_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo7242.609)
 OK - declared                 n13013674_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo7242.610)
 OK - declared                 n13013675_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo7242.611)
 OK - declared                 n13013695_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo7242.612)
 OK - declared                 n13013700_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo7242.613)
 OK - declared                 n13013701_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo7242.614)
 OK - declared                 n13013702_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo7242.615)
 OOPS , declare error in  n11001003_0000_L010185.reroot.root
  CLASS     SamException.SamExceptions.DbSQLException
  INSTANCE  INTERNAL ERROR IN DbOracleMessage.convertUniqueConstraint
STARTED   Wed May 10 17:32:41 2006
FINISHED  Wed May 10 18:00:09 2006

Verify of the n1100* files looks fine, try once again, to really confirm,

MINOS26 > ./saddmc -m declare -s n1100*_*.root carrot_06 mcin_data/near/carrot_06/L010185  2>&1 | tee -a ../log/saddmc/mcin-dev.log 
 MODE  declare
 SEARCH for  n1100*_*.root
 Processing mcin_data 
STARTED   Wed May 10 18:02:55 2006
Declaring to SAM dev carrot_06  declare 999999
Scanning  /pnfs/minos/mcin_data/near/carrot_06 ['L010185']
Needed    /pnfs/minos/mcin_data/near/carrot_06/L010185
Treating 50 files in /pnfs/minos/mcin_data/near/carrot_06/L010185
 OOPS , declare error in  n11001003_0000_L010185.reroot.root
  CLASS     SamException.SamExceptions.DbSQLException
  INSTANCE  INTERNAL ERROR IN DbOracleMessage.convertUniqueConstraint
STARTED   Wed May 10 18:02:55 2006
FINISHED  Wed May 10 18:03:23 2006


#########
# stage #
#########

Per giurgiu request, check state of R1.21 far sntp files

MINOS26 > for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do 
printf "$MON " ; ./stage -d -p 0 reco_far/R1_21/sntp_data/2004-${MON} | grep Needed | tr -d '.' ; done
01  Needed 414/    653
02  Needed 486/    702
03  Needed 483/    715
04  Needed 260/    383
05  Needed 262/    383
06  Needed 238/    357
07  Needed 250/    348
08  Needed 287/    372
09  Needed 230/    375
10  Needed 410/    695
11  Needed 398/    692
12  Needed 292/    732

OK, let's prestage these :

for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do 
./stage -g readPools -w reco_far/R1_21/sntp_data/2004-${MON} ; done

=============================================================================

2006 05 09

##########
# oracle #
##########

Have scheduled network move of minosora1/3 
 to subnet 107 ( CD servers ),
       from 80 ( CD desktops )

Wednesday morning, at nameserver reload ( usually from 7 to 9 AM )

##########
# saddmc #
##########

MINOS26 > ./saddmc  -m list near carrot_06 mcin_data/near/carrot_06/L010185 ""

Needed 4472 files, Rate was  2.882
STARTED   Tue May  9 11:19:05 2006
FINISHED  Tue May  9 11:44:58 2006


MINOS26 > ./saddmc  -m verify near carrot_06 mcin_data/near/carrot_06/L010185 "" | grep -v verified

 OPTS = [('-m', 'verify')]
 ARGS = ['near', 'carrot_06', 'mcin_data/near/carrot_06/L010185', '']
 MODE  verify
 Processing mcin_data 
STARTED   Tue May  9 11:55:33 2006
Declaring to SAM dev near carrot_06  verify 999999
Scanning  /pnfs/minos/mcin_data/near/carrot_06 ['L010185']
Needed    /pnfs/minos/mcin_data/near/carrot_06/L010185
Treating 4476 files in /pnfs/minos/mcin_data/near/carrot_06/L010185
Needed 4472 files, Rate was  1.628
STARTED   Tue May  9 11:55:33 2006
FINISHED  Tue May  9 12:41:21 2006

Removed need for detector name,

MINOS26 > ./saddmc -n 1 carrot_06 mcin_data/near/carrot_06/L010185 ""

 BAIL after  1
 Processing mcin_data 
STARTED   Tue May  9 17:04:12 2006
Declaring to SAM dev carrot_06  list 1
Scanning  /pnfs/minos/mcin_data/near/carrot_06 ['L010185']
Needed    /pnfs/minos/mcin_data/near/carrot_06/L010185
Treating 4478 files in /pnfs/minos/mcin_data/near/carrot_06/L010185
 need file      n13023425_0000_L010185.reroot.root
Needed    1 files, Rate was  0.529
STARTED   Tue May  9 17:04:12 2006
FINISHED  Tue May  9 17:04:15 2006

MINOS26 > ./saddmc -m declare carrot_06 mcin_data/near/carrot_06/L010185 "" 2>&1 | tee ../log/saddmc/mcin.log

 OK - declared                 n13021799_0000_L010185.reroot.root /pnfs/minos/mcin_data/near/carrot_06/L010185(vo9868.321)
 OOPS , declare error in  n11001003_0000_L010185.reroot.root
  CLASS     SamException.SamExceptions.DbSQLException
  INSTANCE  INTERNAL ERROR IN DbOracleMessage.convertUniqueConstraint

And from dbg-SAMDbServer.dev.06_05_09-00_23_02.11002-15.gz ( zcat  | less )

select PARAM_TYPES.PARAM_TYPE_ID from PARAM_TYPES where upper(PARAM_TYPES.PARAM_TYPE) = upper('split') and 
       PARAM_TYPES.PARAM_CATEGORY_ID = 2 order by PARAM_TYPES.PARAM_TYPE_ID
1 rows found

select PARAM_TYPES.DATA_TYPE_ID from PARAM_TYPES where PARAM_TYPES.PARAM_TYPE_ID = 8 order by PARAM_TYPES.PARAM_TYPE_ID
1 rows found

select PARAM_TYPES.PARAM_TYPE_ID from PARAM_TYPES where upper(PARAM_TYPES.PARAM_TYPE) = upper('split') and 
       PARAM_TYPES.PARAM_CATEGORY_ID = 2 order by PARAM_TYPES.PARAM_TYPE_ID
1 rows found

select PARAM_VALUES.PARAM_VALUE_ID from PARAM_VALUES where PARAM_VALUES.PARAM_VALUE = '0' and 
       PARAM_VALUES.PARAM_TYPE_ID = 8 order by PARAM_VALUES.PARAM_VALUE_ID
0 rows found
Minimum number of rows (1) not met; 0 rows found.

select param_value_id_seq.nextval from dual
1 rows found

insert into PARAM_VALUES (PARAM_VALUE_ID, PARAM_VALUE, PARAM_TYPE_ID, 
       CREATE_DATE, CREATE_USER, UPDATE_DATE, UPDATE_USER, DESCRIPTION) 
       values (7, '0', 8, null, null, null, null, null)
              <05/09/2006 23:10:37 DbCore(servantId=113).sqlDML[connId=5]> (1, 'ORA-00001: unique constraint (SAMDEV.PV_PA
RAMV_FUNC_I) violated')

################
# web services #
################

ups stop/start sam_bootstrap - to clear up problem with sam get metadata,
which returned an extra 1076 blanks in the front of each line of outout.

This is visible in the trace file, and in logs like  wsLog__05_09_06
The problem started after 3 May, first visible today 9 May.    

=============================================================================

2006 05 08

##########
# saddmc #
##########

    Proceeding with declarations, per HOWTO.saddmc

MINOS26 > find /pnfs/minos/mcin_data -type f -print -exec usleep 10000 \; | wc -l
  19905

MINOS26 > find /pnfs/minos/mcout_data -type f -print -exec usleep 10000 \; | wc -l
  58560


# TAPE LOCATIONS

  dev/int/prd - picked up fmock

# APP FAMILIES

./saddmc near carrot_06 L010185 "" verify

application families - registered carrrot, carrot_06 in integration
                       already in dev/prd

   Starting to work on new version, keyed off directory,
   and having options on the command line.

MINOS26 > cp saddmc.0306 saddmc.0508
MINOS26 > ln -sf saddmc.0508 saddmc    # was saddmc.0306


=============================================================================

2006 05 05

############
# predator #
############

08:45 hacked predator to fix/add test for FORCE for sadd, saddreco
      removed local setting of MONTH for saddreco, allows FORCE

There are complaints about lack of
    /local/scratch26/kreymer/genpy/beam_data/2006-05
    /local/scratch26/kreymer/genpy/near_dcs_data/2006-05
    /local/scratch26/kreymer/genpy/far_dcs_data/2006-05
but that is because these had not been run recently.

Forced this with   ./predator 2006-05


#########
# genpy #
#########

Verified proper operation of R1.22 for beam, dcs per revised HOWTO.genpy

R1.15 failed for dcs ( was OK for beam ), hence previous use of development in predator

HOWTO.genpy now describes tests for each of

   neardet_data
   fardet_data
   near_dcs_data
   far_dcs_data
   beam_data

#########
# FNALU #
#########

Problems with suspended processes on FNALU batch.
Helpdesk ticket 78042, assigned to Steve Timm.


=============================================================================

2006 05 04

##########
# CATTLE #
##########

Yesterday's processing of near spill data 2005-11 finished around 19:10
Started around 15:20

Many new messages from loon like

Warning in <TClass::TClass>: no dictionary for class NtpBDLiteRecord is available
Warning in <TClass::TClass>: no dictionary for class RecRecordImp<BeamDataLiteHeader> is available
Warning in <TClass::TClass>: no dictionary for class BeamDataLiteHeader is available
=W= Rec [-1|-1] RecArrayAllocator.cxx,v1.7.14.1:192> RecArrayAllocator dtor deleting array NtpMCDigiScintHits which has not been released!
...

The dictionary warnings are normal.
The destructor warnings are new to this run on near data.

Rerunning an old month, for a test.

DET=far
IND=reco_${DET}/R1_21/sntp_data/2004-01
OUD=/local/scratch26/kreymer/reco_${DET}/R1_21/sntp_data/2004-01

./cattle   ${IND} ${OUD} R1.21
similar messages

MINOS26 > type root
root is /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v5-08-00b-tspectrum/bin/root

MINOS26 > setup_minos -r R1.22
MINOS26 > loon -bq ${MERG} F00022248_0000.all.sntp.R1_21.0.root
loon [0] 
Processing /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/Merger.C...
dlopen error: /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.22/lib/Linux2.4-GCC_3_4/libStandardNtuple.so: undefined symbol: _ZTI10NtpTHEvent
Load Error: Failed to load Dynamic link library /afs/fnal.gov/files/code/e875/general/minossoft/releases/R1.22/lib/Linux2.4-GCC_3_4/libStandardNtuple.so
*** Interpreter error recovered ***

Needed to reorder Merger.C libraries, 
moving gSystem->Load("libTruthHelperNtuple.so");
before gSystem->Load("libStandardNtuple.so");
  per rhatcher

This is done in Merger.C -> Merger.C.20060504

The loon problem in R1.22 is fixed by rhatcher 
by a rebuild with latest patches, at about 12:06

Sizes of R1.21 an R1.22 files are

-rw-r--r--    1 kreymer  1525     30477377 May  4 10:49 Merged.root.R1.21
-rw-r--r--    1 kreymer  1525     30498470 May  4 11:51 Merged.root.R1.22

############
# predator #
############

10:00

crobtab crontab.dat

With the modified predator, exporting TZ=UTC,
this should set correct time stamps for new files.
( as before, for raw data, and also hopefully for reco files .) 


Application with family 'online', applName 'rotorooter', version 'v05-10-00' not found.

export SAM_ORACLE_CONNECT=samdbs/password@minosdev
setup sam -q dev
samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v05-10-00
New applicationFamilyId = 190

export SAM_ORACLE_CONNECT=samdbs/password@minosint
setup sam -q int
samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v05-10-00
New applicationFamilyId = 55

export SAM_ORACLE_CONNECT=samdbs/password@minosprd
setup sam -q prd
samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v05-10-00
New applicationFamilyId = 57

./predator 2006-04

ln -sf predator.20060504 predator  # was .20060501

   ( 

###########
# crontab #
###########

Reduced predator frequency to every other hour, including 5 and 23,
    1-23/2
 
ln -sf crontab.dat.20060504 crontab.dat # was 20051005

crontab crontab.dat ( 17:06:10 )

This timing should  let the manual 2006-04 finish

=============================================================================

2006 05 03

#######
# SAM #
#######

Shifted dev/int server_list.txt items to minos-sam02
   ( except nameserver, which remains on minos-sam01 )

##########
# CATTLE #
##########

See 2006 05 01, processing is said to be ready.

DET=near
IND=reco_${DET}/R1_21/sntp_data/2005-11
OUD=${MINOS_DATA}/d128/reco_${DET}/R1_21/sntp_data/2005-11

    Preview with
./cattle.spill   ${IND} ${OUD} R1.21 echo

Some subruns missing, mostly the same as cosmic,
plus 
    N00009201_0014
    N00009201_0015
    N00009213_0014
    N00009213_0015
    N00009262_0019-22

./cattle.spill  ${IND} ${OUD} R1.21 2>&1 | tee | ~/minos/log/cattle/R1_21/${DET}/2005-11.log

#########
# genpy #
#########

Previewing genpy with R1.18.4 and R1.22.
  As before with R1.17 and R1.18, dbu fails with mysql problem
    Unknown column 'REC_SETS_NOTLIONLY' in 'field list'

IFILE=F00034990_0000
DATADIR=fardet_data/2006-05
setup_minos -r R1.22

../scripts/run_dbu dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/${DATADIR}/${IFILE}.mdaq.root

Success, with this and the standerd HOWTO.genpy test file.

=============================================================================

2006 05 02

#########
# vault #
#########

MON=2006-04

./vault far ${MON}

less ~/minos/log/rawcopy/far/${MON}.log

#######
# FTP #
#######

kordosky reports under %1 failures copying data to
    /pnfs/minos/mcin_data/near/carrot_06_RAL
This seems very high to me.

I see messages on the FTP transfer web page
        http://fndca3a.fnal.gov/cgi-bin/dcache_files.py
like
    Oen Error: on file /home/enstore/dcache-ftp-tlog/kordosky(3873.5111)/1146507789606.tlog 

=============================================================================

2006 05 01

##########
# CATTLE #
##########

Per brian request, concatenate ND R1_21 data

DET=near
IND=reco_${DET}/R1_21/sntp_data/2005-11
OUD=${MINOS_DATA}/d128/reco_${DET}/R1_21/sntp_data/2005-11

Could test with
   OUD=/local/scratch26/kreymer/reco_near/R1_21/sntp_data/2005-11
but did not.

Need to wait for all data to be processed.

#########
# stage #
#########

Many of the reco_near files are in write, not read pools.
Added stage -g <poolgroup> to select and print selected pool group

ln -s stage.20060501 stage ( was .60331)

Renamed stage.6* to stage.2006*

./stage -p 1 -g readPools reco_near/R1_21/sntp_data/2005-11

#######
# SAM #
#######

Set TZ=UTC in all dbs_dev/int/prd
   ups tailor sam_config
       edit
          1/2/3
	  add
	  TZ
	  UTC

stopped and restarted dbs_prd on minos-sam01.

#########
# genpy #
#########

Previewing genpy with R1.18.4 and R1.22.
  As before with R1.17 and R1.18, dbu fails with mysql problem
    Unknown column 'REC_SETS_NOTLIONLY' in 'field list'

IFILE=F00034990_0000
DATADIR=fardet_data/2006-05

As expected, new data files fail using R1.15, messages like
  Error in <TStreamerInfo::ReadBuffer>: The element TLeaf::fIsRange type 218 (Bool_t) is not supported yet
  

############
# predator #
############

export TZ=UTC for use with v7 dbserver running with same TZ

    ln -sf  predator.20060501 predator  # was predator.20060424

=============================================================================

2006 04 28

##########
# DCACHE #
##########

Per kennedy action yesterday, 
ktev-migration files are again going to KTevReadPools

MINOS26 > for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do printf "$MON " ; ./stage -d -p 0 reco_far/R1_21/sntp_data/2004-${MON} | grep Needed ; done
01  Needed 18/    653
02  Needed 31/    702
03  Needed 15/    715
04  Needed 14/    383
05  Needed  8/    383
06  Needed 15/    357
07  Needed 22/    348
08  Needed  0/    372
09  Needed  0/    375
10  Needed  0/    695
11  Needed  0/    692
12  Needed  0/    732


=============================================================================

2006 04 27

#############
# checklist #
#############

As scheduled? (06:30-07:00) , 
    ftplog   down 06:57 thru 07:17, missing 06:30
    ftpllog  down 06:52 thru 07:17, missing 06:35
    pnfslog  stayed up,             missing 06:35
    
Messages on terminal like this for ftplog, ftpllog, pnfslog
    ln: cannot remove `/afs/fnal.gov/files/expwww/numi/html/computing/dh/ftpllog/NOW.txt': Connection timed out
    /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/ftp_log: line 49: /afs/fnal.gov/files/expwww/numi/html/computing/dh/ftplog/2006/04/27.txt: Connection timed out

Oracle prd/dev connections failing
   But have done so since startup on Tuesday.
   I forgot to  export TDBCONN=monitor/<password>
   
Ganglia oracle looks pretty smooth, few minute break around 06:30

Sam station minos is working


##########
# DCACHE #
##########

KTEV files now in read pools.

Not all pools are visible, but 3 of 3.9 TB is ktev migration files.

MINOS26 > for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do printf "$MON " ; ./stage -d -p 0 reco_far/R1_21/sntp_data/2004-${MON} | grep Needed ; done
01  Needed 18/    653
02  Needed 31/    702
03  Needed 15/    715
04  Needed 14/    383
05  Needed  8/    383
06  Needed 15/    357
07  Needed 22/    348
08  Needed  0/    372
09  Needed  0/    375
10  Needed  0/    695
11  Needed  0/    692
12  Needed  0/    732

#########
# stage #
#########

Checking recently staged R1_14 files from Monday ( 2006 04 24 )
VOLS=`./volumes sntp_data_R1_14`
for VOL in $VOLS ; do printf "${VOL} " ; ./stagex -d -p 0 ${VOL} | grep Needed ; done
... all files are still present on disk ...
MINOS26 > du -sm /pnfs/minos/reco_far/R1.14/sntp_data/
254144  /pnfs/minos/reco_far/R1.14/sntp_data


MINOS26 > du -sm /pnfs/minos/reco_far/R1_21/sntp_data/
194058  /pnfs/minos/reco_far/R1_21/sntp_data

MINOS26 > for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do printf "$MON " ; ./stage -d -p 0 reco_far/R1_21/sntp_data/2004-${MON} | grep Needed ; done 
01  Needed 653/    653
02  Needed 702/    702
03  Needed 715/    715
04  Needed 383/    383
05  Needed 383/    383
06  Needed 357/    357
07  Needed 348/    348
08  Needed 372/    372
09  Needed 375/    375
10  Needed 695/    695
11  Needed 692/    692
12  Needed 732/    732

all gone.

MINOS26 > ./volumes reco_far_R1_21
VO9876
VO9912
VO9927
VO9935
VO9944

MINOS26 > VOLS21N=`./volumes reco_far_R1_21`

MINOS26 > for VOL in $VOLS21 ; do stage -w -s sntp_data ${VOL} ; done
D O N E


MINOS26 > du -sm /pnfs/minos/reco_near/R1_21/sntp_data/

MINOS26 > du -sm /pnfs/minos/reco_near/R1_21/sntp_data/
12656   /pnfs/minos/reco_near/R1_21/sntp_data

MINOS26 > VOLS21N=`./volumes reco_near_R1_21`
VO2270
Needed 171/    408

MINOS26 > ./stage -w -s sntp_data VO2270
D O N E


=============================================================================

2006 04 26 

############
# predator #
############

Killed cronab, as new data will be in root v5_10_00c format.
Need new versions of dbu and loon ( R1.22 ? )


#######
# SAM #
#######

In desperation, on the CLIENT, 
    setup sam -q dev2
    export TZ=UTC
    sam undeclare file ${IFILE}.mdaq.root
    sam declare file ${IFILE}.sam.py
    FileId = 75750
    2005-01-14 12:27:14.0
EARLY by 5 hours

    setup sam -q dev
    sam undeclare file ${IFILE}.mdaq.root
    sam declare file ${IFILE}.sam.py
    FileId = 75751
    2005-01-14 18:27:14.0
CORRECT !!!

For fun, try the variety without UTC,

    setup sam -q dev
    sam undeclare file ${IFILE}.mdaq.root
    sam declare file x${IFILE}.sam.py
    FileId = 75752
    2005-01-14 18:27:14.0
CORRECT - as before

    setup sam -q dev2
    sam undeclare file ${IFILE}.mdaq.root
    sam declare file ${IFILE}.sam.py
    FileId = 75753
    2005-01-14 12:27:14.0
EARLY by 5 hours - as before

So UTC makes no difference, with or without TZ in dbserver.

    Now try to factor out the client SamTime function.

dev and dev2, TZ=CST6CDT and UTC, startTime='1000000000.0',

In all cases, declaring and reading with same dev/dev2, TZ
MINOS26 > sam get metadata --fileName=${IFILE}.mdaq.root | grep startTime
            'startTime' : SamTime(1000000000.0),

There is a 5 hour ( 18000 sec ) offset between dev/dev2, independent of client TZ,
as expected.

More precisely, let's examine the Oracle time, with '1000000000.0' in x${IFILE}.sam.py

-q   server  client   ID    Oracle time

dev  UTC     UTC     75757 2001-09-09 01:46:40.0
dev  UTC     CST6CDT 75758 2001-09-09 01:46:40.0
dev2 CST6CDT UTC     75759 2001-09-08 20:46:40.0 	
dev2 CST6CDT CST6CDT 75760 2001-09-08 20:46:40.0

And again with (UTC) in the original ${IFILE}.sam.py

-q   server  TZ      ID    Oracle time

dev  UTC     UTC     75761 2005-01-14 18:27:14.0 	
dev  UTC     CST6CDT 75762 2005-01-15 00:27:14.0 	
dev2 CST6CDT UTC     75763 2005-01-14 12:27:14.0
dev2 CST6CDT CST6CDT 75764 2005-01-14 18:27:14.0

############
# datasets #
############

ln -sf datasets.20060426 datasets  # was datasets.20060412

Moved LOGDIR from /scratch/sam03/kreymer/dcache
               to /afs/fnal.gov/files/expwww/numi/html/computing/dh/datasets

Shifted files to long-term supportable subdirectories

   Move all current.200* to current.r.200*

FILES=`ls current.200*`

for FIL in ${FILES} ; do
TIM=`printf "${FIL}" | cut -f 2 -d .`
YEA=`printf "${TIM}" | cut -c 1-4`
MON=`printf "${TIM}" | cut -c 5-6`
mkdir -p  ${YEA}/${MON} 
mv ${FIL} ${YEA}/${MON}/
done

FILES=`ls current.?.200*`
for FIL in ${FILES} ; do
TIM=`printf "${FIL}" | cut -f 3 -d .`
YEA=`printf "${TIM}" | cut -c 1-4`
MON=`printf "${TIM}" | cut -c 5-6`
mkdir -p  ${YEA}/${MON} 
mv ${FIL} ${YEA}/${MON}/
done


=============================================================================

2006 04 25 

Reported empty ganalia plots for minos-sam01 - helpdesk
Reported stalled FNDCA Queue Plots - dcache-admin

Network MRTG plots seem to be back for minos01-26 sam01-3 mysql1

#######
# SAM #
#######

sam03 - ups start sam_bootstrap

sam02 - corrected $MINOS_DATA/log_data/LOG/sam03.log to sam02.log
        copied true sam03.log there, and symlinked on sam03.

TIMEZONE - picked two files which are in dev database, for TZ testing.
      
FILE  -  F00029000_0000.mdaq.root      F00026000_0000.mdaq.root  

Dev      Fri Jan 21 18:32:29 UTC 2005  Mon Jul 12 05:02:11 UTC 2004
Oracle   2005-01-21 18:32:29.0         2004-07-12 05:02:11.0

Prd      Sat Jan 22 00:32:29 UTC 2005  Mon Jul 12 10:02:04 UTC 2004
Oracle   2005-01-21 18:32:29.0         2004-07-12 05:02:04.0

Oracle times are from the database browser,
   looking at dev/prd separately

Dev/Prd times are from like
MINOS26 > sam get metadata --file=F00026000_0000.mdaq.root | grep startTime
            'startTime' : SamTime(1089608531.0),
MINOS26 > SECS=1089608531 ; date -d "19700101 UTC + ${SECS} seconds" -u
Mon Jul 12 05:02:11 UTC 2004


BOTTOM LINE - the reports seem to be correct with dbs v7_6_1 exporting TZ=UTC.
Need to test file storage before putting this into production.

Testing file declarations per HOWTO.sam, 
    F00028812_0000.mdaq.root
    startTime=SamTime('14-Jan-2005 18:27:14(UTC)',SAM.SamTimeFormat_UTCFormat),
    
Setting TZ=UTC in dev, produces the wrong Oracle time of   2005-01-15 00:27:14.0
The dev2 dbserver, not setting TZ, produces correct Oracle 2005-01-14 18:27:14.0
  which can then be read back by the dev(TZ) server correctly.

I suspect that the DBServer has a hardcoded 5 or 6 hour offset when UTC is set,
  regardless of what the local timezone actually is.

Modify .sam.py to use   startTime=SamTime('14-Jan-2005 18:27:14','%d-%b-%Y %H:%M:%S'),
With dev(TZ), still get shifted  2005-01-15 00:27:14.0 	

Go back to dev2(no TZ), UTC produces correct time
Using %d-%b-%Y %H:%M:%S" produces the same result !!!

Grepping around in
    sam_common_pylib/v7_6_0/NULL/SamStruct/SamTime.py
no good clues that I can see

In desperation, on the CLIENT, 
    setup sam -q dev2
    export TZ=UTC
    sam undeclare file ${IFILE}.mdaq.root
    sam declare file ${IFILE}.sam.py
    FileId = 75750
    2005-01-14 12:27:14.0
EARLY by 5 hours

    setup sam -q dev
    sam undeclare file ${IFILE}.mdaq.root
    sam declare file ${IFILE}.sam.py
    FileId = 75751
    2005-01-14 18:27:14.0
CORRECT !!!

For fun, try the variety without UTC,

    setup sam -q dev
    sam undeclare file ${IFILE}.mdaq.root
    sam declare file x${IFILE}.sam.py
    FileId = 75752
    2005-01-14 18:27:14.0
CORRECT - as before

    setup sam -q dev2
    sam undeclare file ${IFILE}.mdaq.root
    sam declare file ${IFILE}.sam.py
    FileId = 75753
    2005-01-14 12:27:14.0
EARLY by 5 hours - as before

So UTC makes no difference, with or without TZ in dbserver.

    Now try to factor out the client SamTime function.

dev and dev2, TZ=CST6CDT and UTC, startTime='1000000000.0',

In all cases, declaring and reading with same dev/dev2, TZ
MINOS26 > sam get metadata --fileName=${IFILE}.mdaq.root | grep startTime
            'startTime' : SamTime(1000000000.0),

There is a 5 hour ( 18000 sec ) offset between dev/dev2, independent of client TZ,
as expected.

More precisely, let's examine the Oracle time, with '1000000000.0' in .sam.py

-q   TZ      ID    Oracle time

dev  UTC     75757 2001-09-09 01:46:40.0
dev  CST6CDT 75758 2001-09-09 01:46:40.0
dev2 UTC     75759 2001-09-08 20:46:40.0 	
dev2 CST6CDT 75760 2001-09-08 20:46:40.0

And again with (UTC) in .sam.py

-q   TZ      ID    Oracle time

dev  UTC     75761 2005-01-14 18:27:14.0 	
dev  CST6CDT 75762 2005-01-15 00:27:14.0 	
dev2 UTC     75763 2005-01-14 12:27:14.0
dev2 CST6CDT 75764 2005-01-14 18:27:14.0


###########
# MONITOR #
###########

${HOME}/minos/scripts/ftp_log &
${HOME}/minos/scripts/ftpl_log &
${HOME}/minos/scripts/pnfs_log &
${HOME}/minos/oracle/topdb_log minosprd &
${HOME}/minos/oracle/topdb_log minosdev &

Need to set up cron job to do this, controlled by dated STOP.* control file


=============================================================================

2006 04 24 Monday

###########
# MINOS06 #
###########

retired, moved to minos26, see below.

#########
# stage #
#########

Volume listings show original, not current filenames,
N.G. for the R1_14 sntp files

Created stagex which adds .all. to filenames

for VOL in ${VOLS} ; do printf "${VOL} "
./stagex -d -p 0 ${VOL} | grep Needed ; done
VO3557 grep: .(use)(2)(F00029042_0003.all.sntp.R1.14.root): No such file or directory
 Needed 303/   3209
VO3559 .. Needed 185/   2412
VO7423  Needed 2/     14
VO7729  Needed 113/   2019
VO7944 ........................ Needed 529/   2934
VO7961 ... Needed 94/    988


for VOL in ${VOLS} ; do printf "${VOL} "
./stagex -w ${VOL} ; done

##########
# DCACHE #
##########

FNDCA Queue Plots have not updated since 2006-Apr-23 16:56:
  informed dcache-admin at low priority


#######
# SAM #
#######

Upgrade minos-sam01 to 

    # deferred # sam_products     v4_25
    sam_dbs_products v7_6_1

./init_sam -n minos minos v4_25

ups declare -c sam v7_6_0 ( was v7_3_4 )


###########
# predator#
###########

predator.20060424 - can be run anywhere, intend to move to minos26

###########
# scratch #
###########

Shifting kreymer files from minos06 to 26

MINOS06 > du -sm .
1468    .

MINOS06 > du -sm *
1       RRD            output from RRD tests
18      cvs            working testrel's
1179    genpy          genpy working files, do not copy
49      genpyTEST      scratch 2005 03 15
123     log            MOVE
1       misspy         no files, 2005 04 01
3       order          scratch , obsolete since 2005 04
9       sampy          scratch, testing sampy portability
1       svols          scratch file
1       test           empty

MINOS26 > du -sm *
6012    ARCHIVE
185     CATTLE
1       COPY
151     DATA
1       SHEEP
2       log
1       reco_far
1       reco_near
1       trel

Predator uses log/* , cut *.20060424 of these scripts
    predator  
    genpy     .0120
    sadd      .0711
    saddreco

MINOS06 > rcp -pr log minos26:/local/scratch26/kreymer/log

Ready to test once via cron :
 
echo '22 * * * * /usr/krb5/bin/kcron ${HOME}/minos/scripts/predator' | crontab

#########
# stage #
#########

Lots of tape mounts restoring spill.sntp files from R1_18_2.

./volumes reco_near_R1_18_2

for VOL in ${VOLS} ; do printf "${VOL} "
./stage -w -s spill\.sntp ${VOL} ; done


#######
# SAM #
#######

See previous work 2006 03 

database browser / SAM Data Files /

F00032684_0000.mdaq.root

   Run 32684  2005-09-08 16:20:31.0

was        'startTime' : SamTime(1126192831.0), == Thu Sep  8 15:20:31 UTC 2005
is         'startTime' : SamTime(1126214431.0), == Thu Sep  8 21:20:31 UTC 2005

So we are no longer off by 1 hour, but 5 hours !!


F00032684_0000.mdaq.root
   Run 33805 2006-02-28 16:57:37.0
was         'startTime' : SamTime(1141145857.0), == Tue Feb 28 16:57:37 UTC 2006
is          'startTime' : SamTime(1141167457.0), == Tue Feb 28 22:57:37 UTC 2006

(Had been correct previously)


=============================================================================

2006 04 22

#########
# stage #
#########

Per Brian, prestage /pnfs/minos/reco_far/R1_14/sntp_data/*.all.sntp...

./volumes sntp_data_R1_14
VO3557
VO3559
VO7423
VO7729
VO7944
VO7961

Unfortunately, many of these files were renamed,
   originally lacked .all.

MINOS01 > for VOL in $VOLS ; do printf "$VOL " ; enstore info --list ${VOL} | grep all | wc -l ; done
VO3557       0
VO3559       0
VO7423       0
VO7729    1546
VO7944    2934
VO7961     988

MINOS01 > for VOL in $VOLS ; do printf "$VOL " ; enstore info --list ${VOL} | grep -v all | wc -l ; done
VO3557    3257
VO3559    2417
VO7423      16
VO7729     477
VO7944       2
VO7961       2


Created stagex which adds .all. to filenames

for VOL in ${VOLS} ; do printf "${VOL} "
./stagex -d -p 0 ${VOL} | grep Needed ; done


for VOL in ${VOLS} ; do printf "${VOL} "
./stagex -w ${VOL} ; done


##########
# DCACHE #
##########

Pool Directory Listings are incomplete again today

=============================================================================

2006 04 21

##########
# DCACHE #
##########

Scanning checklist, spotted :

    Data flow plot shows nothing since about Apr 20 08:00
    Queue and Stage plots last updated Tue Apr 18
    
    Reported, both are updating now ( 13:00 )

16:30 minos-sam03 and minos-sam01

    ups stop sam_bootstrap

16:35 - killed minos26 monitor processes
    oracle/topdb_log minosdev
    oracle/topdb_log minosprd
    scripts/ftp_log
    scripts/ftpl_log
    scripts/pnfs_log


=============================================================================

2006 04 20

###########
# ENSTORE #
###########

crontab -r at about 07:10

Downtime started around 08:00.

PNFS monitoring times out starting around 09:00.


Enstore/DCache up about 21:30
Robot down 18:00-21:00 (affecting CMS)
Robot down 22:30-23:30 (affecting all, broken passthru)


=============================================================================

2006 04 19

########
# encp #
########

Updated e875 version of encp to v3_5a


MINOS26 > ups list -aK+ encp
"encp" "v3_3" "Linux+2.4" "" "current" 
"encp" "v3_4" "Linux+2.4-2.3.2" "" "current" 

MINOS26 > upd install -j encp v3_5a

MINOS26 > ups undeclare -c encp -f Linux+2.4

MINOS26 > ups declare -c encp v3_5a

MINOS26 > ups list -aK+ encp
"encp" "v3_3" "Linux+2.4" "" "" 
"encp" "v3_4" "Linux+2.4-2.3.2" "" "" 
"encp" "v3_5a" "Linux+2.4-2.3.2" "" "current" 

#########
# mysql #
#########

copied files to minos-sam03 /home/kreymer/MYSQL/20060418/

##########
# DCACHE #
##########

Note that there are 8.9 TB of active DCache for lqcd files,
in LFSOnlyPools, with files up to 150 days resident, and about 2/3 empty.

    http://lqcddca.fnal.gov

Why are they sing the FNDCA system ?

LFSOnlyPools also exists in FNDCA, pools v-stkendca16a-1/2/3, 5.2 TB, empty.


=============================================================================

2006 04 18

#########
# mysql #
#########


Backed up database to COPY/20060418

HOWTO.dbbackup - updated, added gzip, copy to minos-sam03

=============================================================================

2006 04 17  Easter Monday

#########
# vault #
#########

All done through 2006-03 , near and far.

##########
# DCACHE #
##########


Active scan of .bntp , last was on 03 30
    had all 2005-04 through 2005-09, little else

for MON in `ls /pnfs/minos/reco_far/R1_18_2/.bntp_data/`  ; do 
printf "${MON} "
./stage -d -p 0 -w reco_far/R1_18_2/.bntp_data/${MON} | grep Needed | tr -d .
done
2005-03  Needed 678/    678
2005-05  Needed 741/    741
2005-06  Needed 715/    715
2005-07  Needed 734/    734
2005-08  Needed 742/    742
2005-09  Needed 722/    722
2005-10  Needed 739/    739
2005-11  Needed 720/    720
2005-12  Needed 748/    748
2006-01  Needed 744/    745
2006-02  Needed 616/    616
2006-03  Needed 137/    137

There are no .bntp files left on disk !

There is virtually no caching going on in the read pools,
according to the file lifetime plots.
Almost all files have lifetimes under 12 days,
with a small tail out to 30 days (2 events at 60 for 13a-3)
All the action seems to be in the first 2 bins
    a slight difference between lifetime and access bins..

File usage is roughly
    
  FILES  GBYTES GROUP
      4       0 <unknown> 
   7579    3122 e907      of net 16.3 TB in enstore
   6551    2453 lqcd      of net 120  TB in enstore ( vs CDF/DO/CMS 1540/1600/320 )
   5272     235 minos 
    356       0 test 
    318      46 theory 

Am not starting the restore of .bntp files right now,
as the tape drive seem to be hugely backlogged with the CMS data challenge.

#############
# minosora1 #
#############

MRTG now shows about 30' of 6 MB/sec => 11 GB, still more than 3, but better than former 32
thanks to cleanup of database.
We are still doing daily backups of the keep directory.

=============================================================================

2006 04 14  Good Friday

#########
# vault #
#########

2005 is done, will pick up 2006 next.

New version rawcopy.0414 checksums after restoring from tar,
rather than before going to tar.

Test with

   ./rawcopy.0414 fardet_data/2001-10

ln -sf rawcopy.0414 rawcopy # was .0401


13:10
for MON in 01 02 03 ; do ./vault near 2006-${MON} ; done &
04:01 Sat 15 Apr - done


#########
# SAM02 #
#########

Reviewed dbserver installation instructions at
    http://d0db-dev:8500/doc/SAM_DB_Server_Installation_with_Failover.html

    Not so helpful, my cloned products are already working.
    Should move to newer sam_config, using sam_ns_ior
    Do not know how to move to a newer sam_config
          ups sync sam_config    falls on its face .

dev2 and prd2 are now working, using v7_6_1


googled   dbserver log zip, found  
    http://d0ora1.fnal.gov/db_server_base/config_params.html

suggests adding

log = config["log"]
log["zip"]       = 1

############
# dbstable #
############

  script captures the state of the local dbserver to sam_dbs_products.table


=============================================================================

2006 04 13  Maunday Thursday, Passover

##########
# DCACHE #
##########

kennedy set up 1 day file write time scale for MinosDaqWritePools at 06:20

File write are deferred accordingly.

#########
# SAM01 #
#########

crontab.dat - nameserver cleanup of prd/int/dev
    had been doing just prd/dev


=============================================================================

2006 04 12

#########
# vault #
#########

    near  2005

09:45
for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do ./vault near 2005-${MON} ; done &
07:06 14 April

for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do less ~/minos/log/rawcopy/near/*2005-${MON}.log ; done


##########
# DCACHE #
##########

Requested 1 day file write time scale for MinosDaqWritePools

############
# datasets #
############

Added grand file size total for the pools, to check for truncation

ln -sf datasets.20060412 datasets # was 20060331

#########
# SAM02 #
#########

LOG shifted to AFS

    ln -sf ~kreymer/minos/LOG.sam02 LOG # was maint/LOG
    
Got dbs v7_6_1 versions from cdfsam05


=============================================================================

2006 04 11

###########
# rawcopy #
###########

    Testing on a short month, 1.3 GB.

MINOS26 > for MON in 2004-02 ; do ./vault near ${MON} ; done &

    Corrected CLOG name, and error in vault script, test again

MINOS26 > for MON in 2004-03 ; do ./vault near ${MON} ; done &

    That's clean now, going to tape VOB028 
    Now do all of 2004.
    
16:35
for MON in 04 05 06 07 08 09 10 11 12 ; do ./vault near 2004-${MON} ; done &
01:25 CST

for MON in 04 05 06 07 08 09 10 11 12 ; do less ~/minos/log/rawcopy/near/*2004-${MON}.log ; done


############
# saddreco #
############

On Tue, 11 Apr 2006, Howard Rubin wrote:

> I think the 3 files below are not in SAM.  When I do an inquiry by
> directory/month they don't get returned.
> 
> 2005-12:F00033256_0013.all.snts.R1_18_2.0.root
> 2005-12:F00033256_0014.all.snts.R1_18_2.0.root  
> 2005-12:F00033256_0007.spill.snts.R1_18_2.0.root
>
> but the files do exist on enstore.
> 
> Howie
  
These files are technically declared.
They were the first declarations in Dec 2005,                        
and the appropriate SAM storage location had not yet been declared.

These 3 files seem to be the only such files in R1_18_2 far data,
according to my logs.

I have declared them manually.

Taking file names and locations from the log,

FIL=F00033256_0007.spill.snts.R1_18_2.0.root
LOC='/pnfs/minos/reco_far/R1_18_2/snts_data/2005-12(vo7705.9011)'

FIL=F00033256_0013.all.snts.R1_18_2.0.root
LOC='/pnfs/minos/reco_far/R1_18_2/snts_data/2005-12(vo7705.9006)'

FIL=F00033256_0014.all.snts.R1_18_2.0.root
LOC='/pnfs/minos/reco_far/R1_18_2/snts_data/2005-12(vo7705.8993)'

sam add location --file=${FIL} --loc=${LOC}


=============================================================================

2006 04 10

###########
# rawcopy #
###########

Finished far through 2006-03

MINOS26 > ./volumes fardet_vault
VO2308
VO8428
VO9759
VO9797
VO9998
VOB003
VOB010
VOB014

MINOS26 > for VOL in $VOLS ; do  enstore info --gvol ${VOL} | grep remaining | cut -f 3 -d ' ' | cut -f 1 -d 'L' ; done
406486528
316460544
1558938112
550942720
681783296
1304468992
466874368
120539308032

MINOS26 > for SIZ in $SIZS ; do printf "${SIZ} / 1000000\n" | bc ; done                  
406
316
1558
550
681
1304
466
120539


Starting near, one month at a time, with a new vault script
    derived from HOWTO.rawcopy

#########
# vault #
#########


##########
# DCACHE #
##########

Requesting deferred writing for MinosDaqWritePools

Motivation... too many tape mounts.

Statistics for 'typical' 2005-11 clean month of stable running,

  DETECTOR   FILES   GBYTES   FILES/tape  days/tape  SAM/tape
    Near      913      75      2500         80         3343
    Far       850      26      6500        230         7064

Near - typically to 600  files/tape, max is  3328 VO5041
Far  - typically to 2800 files/tape, max is 10000 VO6876

Overall  files/tape, using SAM info, is also shown.

MINOS26 > sam list files --dim='__set__ zeval-near-raw'
File Count:         18379
Average File Size:  59.81MB
Total File Size:    1.05TB
Total Event Count:  648177601


MINOS26 > sam list files --dim='__set__ zeval-far-raw'
File Count:         49312
Average File Size:  28.31MB
Total File Size:    1.33TB
Total Event Count:  626909410

=============================================================================

2006 04 08

###########
# rawcopy #
###########

Final slug of catchup for fardet_data

MONS='2005-09 2005-10 2005-11 2005-12 2006-01 2006-02 2006-03' # 193 GB
ELOG=~/minos/log/rawcopy/${DET}/encp.2005b.log

21:15 rawcopy  03:07
9 Apr
08:13 encp     10:57
10 Apr
09:25 checksum 10:31
13:20 clean

 
########
# gaps #
########

Script to check for run/subrun gaps in given det / month 
in order to be sure we have all of 2006-03
( we had trouble archiving some 350+ MByte files )

Several gaps per months until 2004, then pretty clean

    1303 in /pnfs/minos/fardet_data/2004-04 

F00024200_0000.mdaq.root
F00024202_0000.mdaq.root

F00024203_0000.mdaq.root
F00024205_0000.mdaq.root

    1087 in /pnfs/minos/fardet_data/2005-05 

F00031682_0001.mdaq.root
F00031684_0000.mdaq.root


=============================================================================

2006 04 07

###########
# rawcopy #
###########

MONS='2004-08 2004-09 2004-10 2004-11 2004-12'

7.4 GB free , looks like it fit !
All the drives are busy ( CMS data challenge )
but this is a busy day for me too, will submit anyway.

ELOG=~/minos/log/rawcopy/${DET}/encp.2004c.log
07:20 encp

10:20 checksum

11:00 remove

MONS='2005-01 2005-02 2005-03 2005-04 2005-05 2005-06 2005-07 2005-08' # 198GB
ELOG=~/minos/log/rawcopy/${DET}/encp.2005a.log

17:23 rawcopy
07:30 encp
19:07 checksum
 cleanup

=============================================================================

2006 04 06

###########
# rawcopy #
###########

Resuming after interrupting 2004-05  to correct local accounts on minos26

rm /var/tmp/rawcopy/TARWORK/F*
rm -r $LOCL/fardet_data/2004-05

MONX='2004-05 2004-06 2004-07'
for MON in ${MONX}  ; do
./rawcopy ${DET}det_data/${MON} > ~/minos/log/rawcopy/${DET}/${MON}.log 2>&1 
done &

ELOG=~/minos/log/rawcopy/${DET}/encp.2004b.log
write to encp

At about 16:50, volume VO2308 was moved from 36 to 10.
That drive then apparently wrote to VO8555
stkendca7a.fnal.gov:/diska/write-pool-1/data/000F00000000000002CC4B20 -->
    /pnfs/fs/usr/minos/fardet_data/2006-04/F00034635_0013.mdaq.root

There are now, as of 16:55, at least 4 idle drive.

checksum - OK
deleted  - OK

MONS='2004-08 2004-09 2004-10 2004-11 2004-12' # 209 GB ( 218 GB free )
21:58 rawcopy


############
# minoscvs #
############

nwest corrected cvsroot script to use /usr/bin/perl not /usr/local/bin/perl

evansj reports CVS write access problems


##########
# DCACHE #
##########

Note huge DCache write queue backlogs writing, to over 1500 files,
starting late 3 Apr, mostly on 4 and 5 Apr.

A quick check of billing indicates this may be e907/mipp activity.
Enstore movement plots do not indicate unusual write activity.
No problems noted.

#########
# kcron #
#########

More tests of kcron on minos** systems
The /var/adm/krb5 crontab files produced by kcronint seem to have been lost.
So need a new round of kcroninit.

On a clean system, see something like

MINOS03 > kcron
kinit: No such file or directory while getting initial credentials

A clean, normal run of thinks is like ( around 09:20 )

MINOS03 > kcroninit

 *************************************************************************
 *                                                                       *
 *   NOTE: You will be required to enter your kerberos password.         *
 *                                                                       *
 *   YOU MUST BE ON A SECURE CHANNEL (e.g., you must be running          *
 *   this script on your local machine, or you must be connected         *
 *   via an encrypted session).                                          *
 *                                                                       *
 *   IF YOU ARE NOT ON A SECURE CHANNEL, DO NOT CONTINUE!                *
 *                                                                       *
 *************************************************************************

Are you on a secure channel?  (default = y): 
What is your kerberos principal (default = kreymer@FNAL.GOV): 
Enter the password for kreymer@FNAL.GOV: 
Now adding principal kreymer/cron/minos03.fnal.gov@FNAL.GOV...
add_principal: Principal or policy already exists while creating "kreymer/cron/minos03.fnal.gov@FNAL.GOV".
Now creating empty keytab file for kreymer/cron/minos03.fnal.gov@FNAL.GOV...
Now writing temporary keytab for kreymer/cron/minos03.fnal.gov@FNAL.GOV...
Temporary keytab created.
Now transferring temporary keytab file contents...
Transfer complete.
All done.

MINOS03 > kcron

Exactly the same thing can also fail 
(not quite exact, I omitted the initial kcron attempt)

MINOS04 > kcroninit

 *************************************************************************
 *                                                                       *
 *   NOTE: You will be required to enter your kerberos password.         *
 *                                                                       *
 *   YOU MUST BE ON A SECURE CHANNEL (e.g., you must be running          *
 *   this script on your local machine, or you must be connected         *
 *   via an encrypted session).                                          *
 *                                                                       *
 *   IF YOU ARE NOT ON A SECURE CHANNEL, DO NOT CONTINUE!                *
 *                                                                       *
 *************************************************************************

Are you on a secure channel?  (default = y): 
What is your kerberos principal (default = kreymer@FNAL.GOV): 
Enter the password for kreymer@FNAL.GOV: 
Now adding principal kreymer/cron/minos04.fnal.gov@FNAL.GOV...
add_principal: Principal or policy already exists while creating "kreymer/cron/minos04.fnal.gov@FNAL.GOV".
Now creating empty keytab file for kreymer/cron/minos04.fnal.gov@FNAL.GOV...
Now writing temporary keytab for kreymer/cron/minos04.fnal.gov@FNAL.GOV...
Temporary keytab created.
Now transferring temporary keytab file contents...
Transfer complete.
All done.


MINOS04 > kcron
kinit: Client not found in Kerberos database while getting initial credentials


=============================================================================

2006 04 05

###########
# rawcopy #
###########

HOWTO.rawcopy - for cut/paste

2004-01/02/03 encp completed last night

checked tars, cleaned up

MONS='2004-04 2004-05 2004-06 2004-07'  # 176 GB ( 218 )

15:30 - rawcopy


###########
# crontab #
###########

For upgrades today ( 08:00 ) held the cron job under kreymer

    echo "crontab -r" | at 06:10 Apr 05

###########
# UPGRADE #
###########

Most systems have been up since Nov, like minos01,
    Thu, 24 Nov 2005 08:59:31 -0600

08:00
MINOS-SAM01 > ups stop sam_bootstrap

MINOS06 > killed off pnfs_log, ftp_log ftpl_log
MINOS26 > did not kill off topdb_log minosdev/minosprd

      Ganglia watch :

08:00 thru 10:00 gradual load decrease on on cluster
10:00 half down, stable, 02-09 14 16-26
      rest remain up with no logins
11:00 mysql1 down
13:00 10-13  down,  01 12 15 sam02 sam03 mysql1 up but closed
14;00 some CPU activity on minos01, still up
14:22 mysql1, sam01, 26 unlocked  ( ganglia still down )

Checking

mysql1 
   mysqladmin -u root processlist - OK
   CRL is responding quickly at
       http://www-minoscrl2.fnal.gov/minos/Index.jsp

minos-sam01
    ups start sam_bootstrap
        nameservers are up
        production station is running, serving minos26 )


minos26
    logged in
    ran job doing checksum of local files, running enstore, accessing PNFS
    ran a SAM project successfully ( ~kreymer/minos/scripts/sam_test_py minos )
    ran loon ( development version ) on a local and DCache file
    
minoscvs
    normal and pserver access works, checkins OK

kinit and ssh are able to make tickets on minos26.

    ISSUES :

Sam At Glance - not updating ( buckley cron job ? )

Some logins with cryptocard failing ( offsite ? )
    Karen is correcting sshd_config, as of about 10:30 Thur.

minoscvs - note this runs on minos01 ( distinct IP address )
    corrected check_access to use /usr/bin/perl not /usr/local/bin/perl
     pserver is  running

/pnfs/minos - needs rw mount on minos26

XFree86-devel lacking

ssh host keys were all changed
    Does not affect successful connections
         Causes host key warnings when kerberos ticket is lacking


KCRON problems -


MINOS26 > kcron
kinit: Preauthentication failed while getting initial credentials

did a new kcronint, still no luck.

FLXI02 > kcron
kinit: Client not found in Kerberos database while getting initial credentials

   for comparison,
   
FLXI02 > kcroninit

 *************************************************************************
 *                                                                       *
 *   NOTE: You will be required to enter your kerberos password.         *
 *                                                                       *
 *   YOU MUST BE ON A SECURE CHANNEL (e.g., you must be running          *
 *   this script on your local machine, or you must be connected         *
 *   via an encrypted session).                                          *
 *                                                                       *
 *   IF YOU ARE NOT ON A SECURE CHANNEL, DO NOT CONTINUE!                *
 *                                                                       *
 *************************************************************************

Are you on a secure channel?  (default = y): y
What is your kerberos principal (default = kreymer@FNAL.GOV): 
Enter the password for kreymer@FNAL.GOV: 
Now adding principal kreymer/cron/flxi02.fnal.gov@FNAL.GOV...
Principal kreymer/cron/flxi02.fnal.gov@FNAL.GOV created.
Now creating empty keytab file for kreymer/cron/flxi02.fnal.gov@FNAL.GOV...
Now writing temporary keytab for kreymer/cron/flxi02.fnal.gov@FNAL.GOV...
Temporary keytab created.
Now transferring temporary keytab file contents...
Transfer complete.
All done.

Repeating this ...

FLXI02 > kcroninit

 *************************************************************************
 *                                                                       *
 *   NOTE: You will be required to enter your kerberos password.         *
 *                                                                       *
 *   YOU MUST BE ON A SECURE CHANNEL (e.g., you must be running          *
 *   this script on your local machine, or you must be connected         *
 *   via an encrypted session).                                          *
 *                                                                       *
 *   IF YOU ARE NOT ON A SECURE CHANNEL, DO NOT CONTINUE!                *
 *                                                                       *
 *************************************************************************

Are you on a secure channel?  (default = y): y
What is your kerberos principal (default = kreymer@FNAL.GOV): 
Enter the password for kreymer@FNAL.GOV: 
Now adding principal kreymer/cron/flxi02.fnal.gov@FNAL.GOV...
add_principal: Principal or policy already exists while creating "kreymer/cron/flxi02.fnal.gov@FNAL.GOV".
Now creating empty keytab file for kreymer/cron/flxi02.fnal.gov@FNAL.GOV...
Now writing temporary keytab for kreymer/cron/flxi02.fnal.gov@FNAL.GOV...
Temporary keytab created.
Now transferring temporary keytab file contents...
Transfer complete.
All done.

This is consistent with what I see on minos26.
kcron fails with a different message, but still fails:

FLXI02 > kcron
kinit: Client not found in Kerberos database while getting initial credentials

As of 16:12, this is magically working again.

MINOS26 > export TDBCONN=monitor/secretpassword
MINOS26 > ${HOME}/minos/oracle/topdb_log minosdev &
[2] 5507
MINOS26 > ${HOME}/minos/oracle/topdb_log minosprd &
[2] 9202
MINOS26 > unset TDBCONN


MINOS26 > ${HOME}/minos/scripts/ftp_log &
[1] 10307
MINOS26 > ${HOME}/minos/scripts/ftpl_log &
[2] 10336
MINOS26 > ${HOME}/minos/scripts/pnfs_log &

Logged out to allow NIS to be disabled, restarted this all


MINOS26 > export TDBCONN=monitor/...
MINOS26 > ${HOME}/minos/oracle/topdb_log minosdev &
[1] 19181
MINOS26 > ${HOME}/minos/oracle/topdb_log minosprd &
[2] 19304
MINOS26 > unset TDBCONN
MINOS26 > ${HOME}/minos/scripts/ftp_log &
[3] 19424
MINOS26 > ${HOME}/minos/scripts/ftpl_log &
[4] 19442
MINOS26 > ${HOME}/minos/scripts/pnfs_log &
[5] 19460


=============================================================================

2006 04 04

###########
# rawcopy #
###########

DET=far
LOCL=/local/scratch26/kreymer/SHEEP

aklog

Grabbed encp.2003bb.log full log from the scroll region via cut/paste
In future, use   tee -a 
    or wrap the whole loop in tee.


2003-10 has disappeared from SHEEP !!!!

2003-07/8/9 checksummed OK, removed them

for MON in 2003-07 2003-08 2003-09 ; do
rm -r ${LOCL}/fardet_data/${MON} ; done


    Must reconcatenate.

Since 2003-07/8/9 are complete and correct, will purge them,
and start over with new batch, 127 GBytes

MONS="2003-10  2003-11  2003-12"

for MON in ${MONS}  ; do
./rawcopy ${DET}det_data/${MON}   2>&1 | tee -a ~/minos/log/rawcopy/${DET}/${MON}.log
done


ELOG=~/minos/log/rawcopy/${DET}/encp.2003c.log
for MON in ${MONS}  ; do
    date
    cd ${LOCL}/${DET}det_data/${MON}
    FILES=`ls` ; printf "${FILES}\n"
    mkdir -p           /pnfs/minos/vault/${DET}det/${MON}
    encp --verbose 4 ${FILES} /pnfs/minos/vault/${DET}det/${MON}
done  2>&1 | tee -a ${ELOG}


for MON in ${MONS}  ; do
./rawsum -t fardet_data/${MON} ; done

for MON in ${MONS}  ; do
rm -r ${LOCL}/fardet_data/${MON} ; done


Looking ahead, checked status of 2004 files in DCache

MINOS26 > for MO in 01 02 03 04 05 06 07 08 09 10 11 12 ; do ./stage -d -p 0 fardet_data/2004-$MO | tr -d . ; done
They're all there.


MONS='2004-01 2004-02 2004-03'   # 165 GB
for MON in ${MONS}  ; do
./rawcopy ${DET}det_data/${MON}   2>&1 | tee -a ~/minos/log/rawcopy/${DET}/${MON}.log
done

00:19 tomorrow

ELOG=~/minos/log/rawcopy/${DET}/encp.2004a.log
for MON in ${MONS}  ; do
    date
    cd ${LOCL}/${DET}det_data/${MON}
    FILES=`ls` ; printf "${FILES}\n"
    mkdir -p           /pnfs/minos/vault/${DET}det/${MON}
    encp --verbose 4 ${FILES} /pnfs/minos/vault/${DET}det/${MON}
done  > ${ELOG} 2>&1 &

for MON in ${MONS}  ; do
./rawsum -t fardet_data/${MON} ; done


#########
# ADMIN #
#########

Preparing for SLF 3 Rocks upgrades of minos02-26/sam* tomorrow.

    My stuff from last week
MINOS26 > time rcp -prX ARCHIVE minos15:/local/scratch15/kreymer/ARCHIVE
real    12m19.991s
user    0m1.600s
sys     0m37.330s

    Working files from servers

    kreymer stuff
MINOS-SAM03 > time rcp -prX /home/kreymer minos15:/local/scratch15/kreymer/SAM03
real    25m38.757s
user    0m1.840s
sys     0m26.260s

MINOS06 > tar cf    /var/tmp/MINOS06.tar -C /local/scratch06/kreymer .
real    0m57.456s
user    0m1.540s
sys     0m13.140s
MINOS06 > time rcp  /var/tmp/MINOS06.tar minos15:/local/scratch15/kreymer/MINOS06.tar
real    0m19.332s
user    0m0.180s
sys     0m2.480s

MINOS26 > ls
ARCHIVE  CATTLE  COPY  DATA  SHEEP  log  reco_far  reco_near
MINOS26 > tar cf    /var/tmp/MINOS26.tar -C /local/scratch26/kreymer/log .
MINOS26 > rcp  /var/tmp/MINOS26.tar minos15:/local/scratch15/kreymer/MINOS26.tar

    SAM stuff

private/dbs__minos-sam01__dbs_prd reduced from 70 to 3 GB via gzip -1
MINOS-SAM01 > tar cvf /var/tmp/SAM01.home.tar .
MINOS-SAM01 > rcp     /var/tmp/SAM01.home.tar           kreymer@minos15:/local/scratch15/kreymer/
MINOS-SAM01 > time rcp -prX /scratch/sam01/sam/private  kreymer@minos15:/local/scratch15/kreymer/SAM01/private
real    17m25.748s
user    0m0.950s
sys     0m22.470s

    Removed python v2_1 symlink
rm products/python/v2_1/Linux+2.4/lib/python2.1/plat-linux2/plat-linux2
MINOS-SAM01 > time rcp -prX /scratch/sam01/sam/products kreymer@minos15:/local/scratch15/kreymer/SAM01/products
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/bin/lbuilder: No such file or directory
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/bin/wrksht: No such file or directory
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/bin/awm: No such file or directory
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/instantclient/libclntsh.so.10.1: No such file or directory
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/instantclient/libheteroxa10.so: No such file or directory
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/instantclient/libnnz10.so: No such file or directory
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/instantclient/libocci.so.10.1: No such file or directory
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/instantclient/libocijdbc10.so: No such file or directory
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/instantclient/libskgxp10.so: No such file or directory
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/jdk/man/ja: No such file or directory
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/lib/libclntsh.so: No such file or directory
rcp: /scratch/sam01/sam/products/oracle_client/v10_1_0_3_0/Linux+2/network/admin/tnsnames.ora: No such file or directory

real    25m19.977s
user    0m1.660s
sys     0m15.040s

But the copied size is much too big, must be some more bad symlinks around.
MINOS-SAM01 > du -sm products
2665    products
MINOS15 > du -sm SAM01/products/
4904    SAM01/products

Shifted copy of SAM03 to /home/samread
    tar cf SAM03.tar -C SAM03 .
    rm     SAM03.tar
rm products/python/v2_1/Linux+2.4/lib/python2.1/plat-linux2/plat-linux2
MINOS-SAM02 > tar czf /var/tmp/SAM02.tar .
MINOS-SAM02 > rcp /var/tmp/SAM02.tar   kreymer@minos15:/local/scratch15/kreymer/

rm  products/prd/python/v2_1/Linux-2-4/lib/python2.1/plat-linux2/plat-linux2 
MINOS-SAM03 > tar czf /var/tmp/SAM03.tar .
MINOS-SAM03 > rcp /var/tmp/SAM03.tar   kreymer@minos15:/local/scratch15/kreymer/

=============================================================================

2006 04 03

Rebooted desktop after scheduled 07:00 power outage.

Checksum looks OK, aside from topdb log of production oracle.
Restarted on minos26, disconnected, it's OK now.
No recent DCS files from near or far, due to shutdown.
No recent farm files, due to Gridftp door problems.

###########
# rawcopy #
###########

setup encp v3_5a

DET=far
LOCL=/local/scratch26/kreymer/SHEEP
ELOG=~/minos/log/rawcopy/${DET}/encp.2003b.log

for MON in 2003-07 2003-08 2003-09 2003-010  ; do
    date
    cd ${LOCL}/${DET}det_data/${MON}
    FILES=`ls` ; printf "${FILES}\n"
    mkdir -p           /pnfs/minos/vault/${DET}det/${MON}
    encp --verbose 4 ${FILES} /pnfs/minos/vault/${DET}det/${MON} \
        2>&1 | tee ${ELOG}
done

Going to VO8428, 23 mounts before this

MINOS26 > enstore info --gvol VO8428
{'blocksize': 131072,
 'capacity_bytes': 214748364800L,
 'comment': '',
 'declared': 'Thu Mar  9 15:41:34 2006',
 'eod_cookie': '0000_000000000_0000038',
 'external_label': 'VO8428',
 'first_access': 'Sat Apr  1 09:07:26 2006',
 'last_access': 'Mon Apr  3 08:12:37 2006',
 'library': 'CD-9940B',
 'media_type': '9940B',
 'remaining_bytes': 143785414656L,
 'si_time': ('Wed Dec 31 18:00:00 1969', 'Fri Feb 17 22:23:32 2006'),
 'sum_mounts': 24,
 'sum_rd_access': 0,
 'sum_rd_err': 0,
 'sum_wr_access': 37,
 'sum_wr_err': 0,
 'system_inhibit': ['none', 'none'],
 'user_inhibit': ['none', 'none'],
 'volume_family': 'minos.fardet_vault.cpio_odc',
 'wrapper': 'cpio_odc',
 'write_protected': 'n'}

Writing VO8428(fardet_vault) using 9940B26.mover from minos26 by kreymer
Pending write for cdfsys-onln from b0backup by cdfsys [VOLS_IN_WORK]

    Did this during the 2003-07.24.tar encp transfer, slowed us both down.
    encp's usually run 40 seconds, took 120 for this one.

MINOS26 > time ecrc /local/scratch26/kreymer/SHEEP/fardet_data/2003-07/fardet_data.2003-07.21.tar
CRC 628579731

real    1m35.583s
user    0m5.690s
sys     0m3.590s


RATSSSSSSS

Correct type above, copy files to 2003-10 not 2003-010

rm -r /pnfs/minos/vault/fardet/2003-010

for MON in 2003-07 2003-08 2003-09  ; do
./rawsum -t fardet_data/${MON} ; done

   This was OK

Cleared out /local/scratch26/kreymer/COPY and reco_* 
   work areas to make 200GB capacity )

MONS="2003-11  2003-12  2004-01 2004-02"

for MON in ${MONS}  ; do
./rawcopy ${DET}det_data/${MON}   2>&1 | tee ~/minos/log/rawcopy/${DET}/${MON}.log
done


=============================================================================

2006 04 01

###########
# rawcopy #
###########

DET=far
LOCL=/local/scratch26/kreymer/SHEEP

for MON in 2003-01  2003-02  2003-03  2003-04  2003-05  2003-06 ; do
    date
    cd ${LOCL}/${DET}det_data/${MON}
    FILES=`ls` ; printf "${FILES}\n"
    mkdir -p      /pnfs/minos/vault/${DET}det/${MON}
    encp ${FILES} /pnfs/minos/vault/${DET}det/${MON}
done

for MON in 2003-01  2003-02  2003-03  2003-04  2003-05  2003-06 ; do
./rawsum -t fardet_data/${MON} ; done

for MON in 2003-01  2003-02  2003-03  2003-04  2003-05  2003-06 ; do
rm -r ${LOCL}/fardet_data/${MON} ; done

    rawcopy script modified to put RAWTMP into /var/tmp, for speed
    rawcopy.0401

DET=far
LOCL=/local/scratch26/kreymer/SHEEP

for MON in 2003-07 2003-08 2003-09 2003-010  ; do
./rawcopy ${DET}det_data/${MON}   2>&1 | tee ~/minos/log/rawcopy/${DET}/${MON}.log
done


#########
# VAULT #
#########

Tapes are writing quickly, but inefficiently ( too many mounts )

Today's data went to VO8428   129 files,   encp's, 13 mounts 
Yesterdays        to VO9998    37 files,  6 encp's, 23 mounts

Logs are in ~/minos/log/rawcopy/far/
Yesterday encp.2001.log
          encp.2002.log
Today     encp.2003a.log
             
=============================================================================

2006 03 31

#########
# VAULT #
#########

created HOWTO.archive with general Enstore advice

mkdir  /pnfs/minos/vault
cd     /pnfs/minos/vault

mkdir   fardet
cd      fardet
enstore pnfs --tags
enstore pnfs --file_family fardet_vault
enstore pnfs --tags

cd     /pnfs/minos/vault
mkdir   neardet
cd      neardet
enstore pnfs --tags
enstore pnfs --file_family neardet_vault
enstore pnfs --tags


###########
# rawcopy #
###########

    On minos26

DET=far
LOCL=/local/scratch26/kreymer/SHEEP

for MON in 2001-09 2001-10 2001-11 2001-12 ; do
    date
    cd ${LOCL}/${DET}det_data/${MON}
    FILES=`ls` ; printf "${FILES}\n"
    mkdir -p      /pnfs/minos/vault/${DET}det/${MON}
    encp ${FILES} /pnfs/minos/vault/${DET}det/${MON}
done
date
Fri Mar 31 12:07:32 CST 2006
fardet_data.2001-09.1.tar
fardet_data.2001-09.2.tar
fardet_data.2001-09.3.tar
fardet_data.2001-09.4.tar
fardet_data.2001-09.5.tar
fardet_data.2001-09.6.tar
Fri Mar 31 12:14:23 CST 2006
fardet_data.2001-10.1.tar
Fri Mar 31 12:14:49 CST 2006
fardet_data.2001-11.1.tar
fardet_data.2001-11.2.tar
Fri Mar 31 12:16:57 CST 2006
fardet_data.2001-12.1.tar
MINOS26 > date
Fri Mar 31 12:17:32 CST 2006


Data is on the way to VO9998 

for MON in 2002-01  2002-02  2002-03  2002-04  2002-05  2002-06  2002-07  2002-08  2002-09  2002-10  2002-11  2002-12 ; do
    date
    cd ${LOCL}/${DET}det_data/${MON}
    FILES=`ls` ; printf "${FILES}\n"
    mkdir -p      /pnfs/minos/vault/${DET}det/${MON}
    encp ${FILES} /pnfs/minos/vault/${DET}det/${MON}
done
date
Fri Mar 31 12:38:32 CST 2006
fardet_data.2002-01.1.tar
Fri Mar 31 12:40:58 CST 2006
fardet_data.2002-02.1.tar
fardet_data.2002-02.2.tar
Fri Mar 31 12:47:55 CST 2006
fardet_data.2002-03.1.tar
fardet_data.2002-03.2.tar
fardet_data.2002-03.3.tar
fardet_data.2002-03.4.tar
Fri Mar 31 12:57:35 CST 2006
fardet_data.2002-04.1.tar
fardet_data.2002-04.2.tar
Fri Mar 31 12:59:31 CST 2006
fardet_data.2002-05.1.tar
fardet_data.2002-05.2.tar
fardet_data.2002-05.3.tar
fardet_data.2002-05.4.tar
fardet_data.2002-05.5.tar
fardet_data.2002-05.6.tar
Fri Mar 31 13:07:58 CST 2006
fardet_data.2002-06.1.tar
fardet_data.2002-06.2.tar
fardet_data.2002-06.3.tar
fardet_data.2002-06.4.tar
Fri Mar 31 13:11:50 CST 2006
fardet_data.2002-07.1.tar
fardet_data.2002-07.2.tar
fardet_data.2002-07.3.tar
Fri Mar 31 13:18:20 CST 2006
fardet_data.2002-08.1.tar
fardet_data.2002-08.2.tar
fardet_data.2002-08.3.tar
fardet_data.2002-08.4.tar
fardet_data.2002-08.5.tar
fardet_data.2002-08.6.tar
fardet_data.2002-08.7.tar
Fri Mar 31 13:28:24 CST 2006
fardet_data.2002-09.1.tar
fardet_data.2002-09.10.tar
fardet_data.2002-09.11.tar
fardet_data.2002-09.12.tar
fardet_data.2002-09.13.tar
fardet_data.2002-09.14.tar
fardet_data.2002-09.15.tar
fardet_data.2002-09.16.tar
fardet_data.2002-09.17.tar
fardet_data.2002-09.18.tar
fardet_data.2002-09.19.tar
fardet_data.2002-09.2.tar
fardet_data.2002-09.20.tar
fardet_data.2002-09.21.tar
fardet_data.2002-09.22.tar
fardet_data.2002-09.23.tar
fardet_data.2002-09.24.tar
fardet_data.2002-09.25.tar
fardet_data.2002-09.26.tar
fardet_data.2002-09.27.tar
fardet_data.2002-09.28.tar
fardet_data.2002-09.3.tar
fardet_data.2002-09.4.tar
fardet_data.2002-09.5.tar
fardet_data.2002-09.6.tar
fardet_data.2002-09.7.tar
fardet_data.2002-09.8.tar
fardet_data.2002-09.9.tar
Fri Mar 31 14:06:20 CST 2006
fardet_data.2002-10.1.tar
fardet_data.2002-10.2.tar
fardet_data.2002-10.3.tar
fardet_data.2002-10.4.tar
fardet_data.2002-10.5.tar
fardet_data.2002-10.6.tar
fardet_data.2002-10.7.tar
Fri Mar 31 14:17:38 CST 2006
fardet_data.2002-11.1.tar
fardet_data.2002-11.10.tar
fardet_data.2002-11.11.tar
fardet_data.2002-11.12.tar
fardet_data.2002-11.13.tar
fardet_data.2002-11.14.tar
fardet_data.2002-11.2.tar
fardet_data.2002-11.3.tar
fardet_data.2002-11.4.tar
fardet_data.2002-11.5.tar
fardet_data.2002-11.6.tar
fardet_data.2002-11.7.tar
fardet_data.2002-11.8.tar
fardet_data.2002-11.9.tar
Fri Mar 31 14:43:48 CST 2006
fardet_data.2002-12.1.tar
fardet_data.2002-12.10.tar
fardet_data.2002-12.2.tar
fardet_data.2002-12.3.tar
fardet_data.2002-12.4.tar
fardet_data.2002-12.5.tar
fardet_data.2002-12.6.tar
fardet_data.2002-12.7.tar
fardet_data.2002-12.8.tar
fardet_data.2002-12.9.tar
MINOS26 > date
Fri Mar 31 15:02:14 CST 2006

Now check the PNFS crc versus local tarfiles

for MON in 2001-09 2001-10 2001-11 2001-12 ; do
./rawsum -t fardet_data/${MON} ; done
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2001-09 
fardet_data.2001-09.1.tar
fardet_data.2001-09.2.tar
fardet_data.2001-09.3.tar
fardet_data.2001-09.4.tar
fardet_data.2001-09.5.tar
fardet_data.2001-09.6.tar
Fri Mar 31 16:22:02 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2001-10 
fardet_data.2001-10.1.tar
Fri Mar 31 16:22:09 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2001-11 
fardet_data.2001-11.1.tar
fardet_data.2001-11.2.tar
Fri Mar 31 16:23:07 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2001-12 
fardet_data.2001-12.1.tar
Fri Mar 31 16:23:21 CST 2006

for MON in 2002-01  2002-02  2002-03  2002-04  2002-05  2002-06  2002-07  2002-08  2002-09  2002-10  2002-11  2002-12 ; do
./rawsum -t fardet_data/${MON} ; done
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-01 
fardet_data.2002-01.1.tar
Fri Mar 31 16:26:52 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-02 
fardet_data.2002-02.1.tar
fardet_data.2002-02.2.tar
Fri Mar 31 16:27:55 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-03 
fardet_data.2002-03.1.tar
fardet_data.2002-03.2.tar
fardet_data.2002-03.3.tar
fardet_data.2002-03.4.tar
Fri Mar 31 16:29:53 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-04 
fardet_data.2002-04.1.tar
fardet_data.2002-04.2.tar
Fri Mar 31 16:30:42 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-05 
fardet_data.2002-05.1.tar
fardet_data.2002-05.2.tar
fardet_data.2002-05.3.tar
fardet_data.2002-05.4.tar
fardet_data.2002-05.5.tar
fardet_data.2002-05.6.tar
Fri Mar 31 16:33:06 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-06 
fardet_data.2002-06.1.tar
fardet_data.2002-06.2.tar
fardet_data.2002-06.3.tar
fardet_data.2002-06.4.tar
Fri Mar 31 16:34:48 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-07 
fardet_data.2002-07.1.tar
fardet_data.2002-07.2.tar
fardet_data.2002-07.3.tar
Fri Mar 31 16:36:14 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-08 
fardet_data.2002-08.1.tar
fardet_data.2002-08.2.tar
fardet_data.2002-08.3.tar
fardet_data.2002-08.4.tar
fardet_data.2002-08.5.tar
fardet_data.2002-08.6.tar
fardet_data.2002-08.7.tar
Fri Mar 31 16:39:43 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-09 
fardet_data.2002-09.1.tar
fardet_data.2002-09.10.tar
fardet_data.2002-09.11.tar
fardet_data.2002-09.12.tar
fardet_data.2002-09.13.tar
fardet_data.2002-09.14.tar
fardet_data.2002-09.15.tar
fardet_data.2002-09.16.tar
fardet_data.2002-09.17.tar
fardet_data.2002-09.18.tar
fardet_data.2002-09.19.tar
fardet_data.2002-09.2.tar
fardet_data.2002-09.20.tar
fardet_data.2002-09.21.tar
fardet_data.2002-09.22.tar
fardet_data.2002-09.23.tar
fardet_data.2002-09.24.tar
fardet_data.2002-09.25.tar
fardet_data.2002-09.26.tar
fardet_data.2002-09.27.tar
fardet_data.2002-09.28.tar
fardet_data.2002-09.3.tar
fardet_data.2002-09.4.tar
fardet_data.2002-09.5.tar
fardet_data.2002-09.6.tar
fardet_data.2002-09.7.tar
fardet_data.2002-09.8.tar
fardet_data.2002-09.9.tar
Fri Mar 31 16:52:39 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-10 
fardet_data.2002-10.1.tar
fardet_data.2002-10.2.tar
fardet_data.2002-10.3.tar
fardet_data.2002-10.4.tar
fardet_data.2002-10.5.tar
fardet_data.2002-10.6.tar
fardet_data.2002-10.7.tar
Fri Mar 31 16:56:25 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-11 
fardet_data.2002-11.1.tar
fardet_data.2002-11.10.tar
fardet_data.2002-11.11.tar
fardet_data.2002-11.12.tar
fardet_data.2002-11.13.tar
fardet_data.2002-11.14.tar
fardet_data.2002-11.2.tar
fardet_data.2002-11.3.tar
fardet_data.2002-11.4.tar
fardet_data.2002-11.5.tar
fardet_data.2002-11.6.tar
fardet_data.2002-11.7.tar
fardet_data.2002-11.8.tar
fardet_data.2002-11.9.tar
Fri Mar 31 17:04:46 CST 2006
 OK - checksumming /local/scratch26/kreymer/SHEEP/fardet_data/2002-12 
fardet_data.2002-12.1.tar
fardet_data.2002-12.10.tar
fardet_data.2002-12.2.tar
fardet_data.2002-12.3.tar
fardet_data.2002-12.4.tar
fardet_data.2002-12.5.tar
fardet_data.2002-12.6.tar
fardet_data.2002-12.7.tar
fardet_data.2002-12.8.tar
fardet_data.2002-12.9.tar
Fri Mar 31 17:11:16 CST 2006

17:15
for MON in 2001-09 2001-10 2001-11 2001-12 2002-01  2002-02  2002-03  2002-04  2002-05  2002-06  2002-07  2002-08  2002-09  2002-10  2002-11  2002-12; do
rm -r /local/scratch26/kreymer/SHEEP/fardet_data/${MON} ; done

   RUNNING ANOTHER 100 GB or so :

DET=far
LOCL=/local/scratch26/kreymer/SHEEP

for MON in 2003-01 2003-02 2003-03 2003-04 2003-05 2003-06 ; do
./rawcopy ${DET}det_data/${MON}   2>&1 | tee ~/minos/log/rawcopy/${DET}/${MON}.log
done

Urrrgh, some fardet_data files have fallen off disk,
2 from 2003-01
3 from 2003-02 
1 from 2003-04

    Have launched
for MON in 2003-02 2003-03 2003-04 2003-05 2003-06 ; do
./stage -w fardet_data/${MON} ; done

#########
# stage #
#########

Oops, some log files exist on tape VO5041
Let's filter them out from now on.

MINOS26 > NVOLS='VO5042 VO6784 VO7026 VO7175 VO7421 VO7774 VO7896 VO7939 VO8098 VO8187 VO8332 VO8537 VO8556 VO8721 VO8741 VO8791 VO8842 VO8949 VO9752'
MINOS26 > for VOL in ${NVOLS} ; do ./stage -w -s neardet_data ${VOL} ; done

ln -sf stage.60331 stage  # was .60328

    Added clean exit for 0 files

############
# datasets #
############

Added split into current.r/w/m as appropriate,
symlinking only to the .r version.

    ln -sf datasets.20060331 datasets ( was 20060310 )

##########
# DCACHE #
##########

2005-03  Needed 678/    678
2005-05  Needed 0/    741
2005-06  Needed 0/    715
2005-07  Needed 0/    734
2005-08  Needed 0/    742
2005-09  Needed 0/    722
2005-10  Needed 0/    739
2005-11  Needed 0/    720
2005-12  Needed 0/    748
2006-01  Needed 744/    745
2006-02  Needed 604/    616
2006-03  Needed 12/    137

#######
# SAM #
#######

/usr/sbin/lsof -p 16280 > ~/maint/log/dbs_prd/lsof.16280

16:50  Restarted production dbserver, glitched by network outage 14:00/14:30 ?

N.B.  Learned Monday 3 April that there was an FCC network outage at this time.

=============================================================================

2006 03 30
##########
# DCACHE #
##########

Yesterday's read pool summaries only covered about half the files present,
2.5 out of 5.5 TBytes of pool.

Active scan of .bntp :

for MON in `ls /pnfs/minos/reco_far/R1_18_2/.bntp_data/`  ; do 
printf "${MON} "
./stage -d -p 0 -w reco_far/R1_18_2/.bntp_data/${MON} | grep Needed | tr -d .
done
2005-03  Needed 678/    678
2005-05  Needed 0/    741
2005-06  Needed 0/    715
2005-07  Needed 0/    734
2005-08  Needed 0/    742
2005-09  Needed 126/    722
2005-10  Needed 96/    739
2005-11  Needed 329/    720
2005-12  Needed 435/    748
2006-01  Needed 744/    745
2006-02  Needed 604/    616
2006-03  Needed 12/    137


#######
# SAM #
#######

12:40 Restarted production dbserver, glitched by network outage this morning.

#########
# stage #
#########
17:15

MINOS26 > for VOL in ${NVOLS} ; do ./stage -w ${VOL} ; done

=============================================================================

2006 03 29

#########
# stage #
#########

MINOS06 > grep Needed  VO*.20060328.log
VO2064.20060328.log: Needed 20/   1059
VO2212.20060328.log: Needed 413/   2190
VO2220.20060328.log: Needed 1/      2
VO2225.20060328.log: Needed 434/   2020
VO3646.20060328.log: Needed 1/   1645
VO3908.20060328.log: Needed 1/      1
VO3909.20060328.log: Needed 256/   2524
VO4136.20060328.log: Needed 49/   1876
VO4245.20060328.log: Needed 1/   2867
VO4309.20060328.log: Needed 2/   1609
VO4639.20060328.log: Needed 1/    857
VO4640.20060328.log: Needed 0/    834
VO4919.20060328.log: Needed 178/   2088
VO5046.20060328.log: Needed 3/   1530
VO5054.20060328.log: Needed 58/   1465
VO5182.20060328.log: Needed 70/    911
VO5672.20060328.log: Needed 2/   1005
VO5869.20060328.log: Needed 0/    963
VO5871.20060328.log: Needed 0/   1195
VO5881.20060328.log: Needed 3/   1041
VO6809.20060328.log: Needed 515/   2865
VO6876.20060328.log: Needed 727/   9915
VO7999.20060328.log: Needed 4/   2087
VO8536.20060328.log: Needed 186/   2975
VO8555.20060328.log: Needed 1/    970
VO8917.20060328.log: Needed 0/   1875
VO8968.20060328.log: Needed 57/    768
VO9488.20060328.log: Needed 0/     31

NVOLS='VO4531 VO4918 VO5041 VO5042 VO6784 VO7026 VO7175 VO7421 VO7774 VO7896 VO7939 VO8098 VO8187 VO8332 VO8537 VO8556 VO8721 VO8741 VO8791 VO8842 VO8949 VO9752'

MINOS26 > for VOL in ${NVOLS} ; do ./stage -w ${VOL} ; done

###########
# rawcopy #
###########

All checksums failed starting with 2002-11, through all of 2002-12,
as follows

fardet_data.2002-11.6.tar
      F00010537_0000.mdaq.root to
      F00010590_0000.mdaq.root   
      ......................................................
      ..../rawcopy: line 102: [: -ne: unary operator expected

Created rawsum script for later checking of the tarfiles.

Added 'enstore info' fallback for cases where level4 PNFS checksum is missing,
    mostly 2002-11/12

./rawsum fardet_data/2002-11 
./rawsum fardet_data/2002-12

Spending most of our time spacing though tarfiles,
both for appending, and for checksumming.

Created SHEEP/TARWORK directory for single build/extract of files.

    rawsum timings much better, 50% faster even for a very small .4 GB tar
    3x faster for 2 GB tarfiles.

MINOS26 > time ./rawsum fardet_data/2001-10     
real    0m19.854s
user    0m1.830s
sys     0m10.540s

MINOS26 > time ./rawsum.0329a fardet_data/2001-10
real    0m29.627s
user    0m3.470s
sys     0m23.310s

Try a 1.6 GB tarfile :

Wed Mar 29 10:00:11 CST 2006
real    2m25.144s
user    0m6.210s
sys     0m54.030s

MINOS26 > time ./rawsum.0329a fardet_data/2002-01
Wed Mar 29 09:56:47 CST 2006
real    6m54.373s
user    0m39.820s
sys     5m12.020s

Testing revised rawcopy on 2002-01

MINOS26 > ./rawcopy fardet_data/2002-01

elapsed time 7.5 minutes , was 15 minutes

###########
# recycle #
###########

Removed old /pnfs/minos/zen directory from removal of old releases

find /pnfs/minos/zen -type d -exec rmdir {} \;
( iterated a few times )

###################
# /pnfs/minos/tmp #
###################

Removed this, it had been unused since 2000.

##########
# dcache #
##########

Again, as of about 17:50 through 18:25, 
lqcd has dumped in about 470 file restores.
They continue to roll in.  Waiting for ftp doors to fail.

Email sent to dcache-admin and minos-data

Lqcd has displace almost all other files from the DCache read pools.
They continue stage more files in on a massive scale.

Minos is only getting 78 GBytes ( not TBytes ) of DCache space,
and we are rapidly losing that !

If this were not bad enough,  selex has, for the past day or so,
been occupying most of the 9940B tape drives.

As of 19:00, usage of the 16 B drives in use is :
    5 selex
    6 cms
    3 lqcd
    2 exp-db ( database backups )
    0 minos

Minos access to data has been entirely blocked.


##########
# DCACHE #
##########

rubin reports that x509 ( kerberos ? ) doors failed yesterday, and again today.

###########
# crontab #
###########

For network outage tomorrow ( 06:30 to 07:00 ) held the cron job under kreymer

    echo "crontab -r" | at 05:30 Mar 30
    echo "crontab minos/scripts/crontab.dat" | at 8:30 Mar 30

=============================================================================

2006 03 28

#########
# genpy #
#########

genpy timed out for F00034242_0007.mdaq.root, but produded metadata.
This is one of the large files which hung up the archiver ( 1.3 GB )
  OK ?

###########
# rawcopy #
###########

MINOS26 > MON=2001-09
MINOS26 > ./rawcopy fardet_data/${MON} /local/scratch26/kreymer/SHEEP/fardet_data/${MON}  2>&1 | tee ~/minos/log/far/${MON}.log
( Oops, gave wrong logfile path )

MINOS26 > MON=2001-10
MINOS26 > ./rawcopy fardet_data/${MON} /local/scratch26/kreymer/SHEEP/fardet_data/${MON}  2>&1 | tee ~/minos/log/rawcopy/far/${MON}.log
( Oops, lacked AFS token )

MINOS26 > MON=2001-11
MINOS26 > ./rawcopy fardet_data/${MON} /local/scratch26/kreymer/SHEEP/fardet_data/${MON}  2>&1 | tee ~/minos/log/rawcopy/far/${MON}.log

for MON in 2001-12 2002-01 2002-02 2002-03 2002-04 2002-05 2002-06 ; do
./rawcopy fardet_data/${MON} /local/scratch26/kreymer/SHEEP/fardet_data/${MON}  2>&1 | tee ~/minos/log/rawcopy/far/${MON}.log
done

    Looks good, still 130 GB free on disk, grab another half year, around 18:50

MINOS26 > for MON in 2002-07 2002-08 2002-09 2002-10 2002-11 2002-12 ; do ./rawcopy fardet_data/${MON} /local/scratch26/kreymer/SHEEP/fardet_data/${MON}  2>&1 | tee ~/minos/log/rawcopy/far/${MON}.log; done

#########
# stage #
#########

Not good enough to look for files in stkendca pools, the pools come and go too quickly.
For example, fardet_data/2001-11/F00001014_0000.mdaq.root claims to be in r-stkendca4a-1

Need to get a full list of pools from
    http://fndca3a.fnal.gov:2288/poolInfo/pools/*

Or a specific list of pools like
    http://fndca3a.fnal.gov:2288/poolInfo/pgroups/MinosDaqWritePools
    http://fndca3a.fnal.gov:2288/poolInfo/pgroups/readPools
    http://fndca3a.fnal.gov:2288/poolInfo/pgroups/writePools

The links on the overall page gives the pool names

Now need to restage fardet_data, lots of files are supposedly on r-stkendca4a-1
    which no longer exists.

ln -sf stage.60328 stage ( was stage.60315 )

VOLS='VO9488 VO8555 VO4639 VO5182 VO5869 VO5672 VO5881 VO8968 VO4640 VO5871 VO2064 VO5054 VO5046 VO4309 VO3646 VO2220 VO8917 VO4136 VO2225 VO7999 VO4919 VO3908 VO3909 VO2212 VO6809 VO4245 VO8536 VO6876'

for VOL in ${VOLS} ; do ./stage -w ${VOL} ; done

###########
# ENSTORE #
###########

Continue to see unknown files, nearly half of capacity on fardet_data tapes, like

    VO8555 (fardet_data)

Reported to enstore-admin

MINOS26 > for VOL in ${VOLS} ; do printf "${VOL} " ; enstore info --list ${VOL} | grep unknown | wc -l ; done
VO9488      25
VO8555     514
VO4639       0
VO5182       0
VO5869       0
VO5672       0
VO5881       0
VO8968     340
VO4640       0
VO5871       1
VO2064       0
VO5054       0
VO5046       0
VO4309       0
VO3646       0
VO2220       0
VO8917       0
VO4136       0
VO2225       0
VO7999       0
VO4919       1
VO3908       0
VO3909       0
VO2212       0
VO6809       0
VO4245       0
VO8536    1789
VO6876     145

And neardet_data,

MINOS26 > for VOL in ${NVOLS} ; do printf "${VOL} " ; enstore info --list ${VOL} | grep unknown | wc -l ; done
VO4531       0
VO4918       0
VO5041       0
VO5042       0
VO6784       0
VO7026       0
VO7175       2
VO7421     701
VO7774       0
VO7896       0
VO7939       0
VO8098       0
VO8187       0
VO8332       0
VO8537     892
VO8556     213
VO8721     224
VO8741     144
VO8791       0
VO8842       0
VO8949       0
VO9752       1


Dates of the most recent unknown links are (derived from BFID) :
enstore info --list=VO9488 | grep unknown | head

    fardet_data
for VOL in VO9488 VO8555 VO8968 VO4919 VO8536 VO6876 ; do printf "${VOL} " ; datesec `enstore info --list=${VOL} | grep unknown | tail -1 | cut -c 12-21` ; done
VO9488 Thu Dec 15 09:09:16 CST 2005
VO8555 Tue Mar 28 15:23:58 CST 2006
VO8968 Thu Nov 10 15:50:08 CST 2005
VO4919 Fri Dec 17 13:05:19 CST 2004
VO8536 Sat Mar 11 10:04:49 CST 2006
VO6876 Mon Jan 30 19:45:50 CST 2006

    neardet_data
for VOL in VO7175 VO7421 VO8537 VO8556 VO8721 VO8741 VO9752 ; do printf "${VOL} " ; datesec `enstore info --list=${VOL} | grep unknown | tail -1 | cut -c 12-21` ; done
VO7175 Fri Jan 27 14:38:59 CST 2006
VO7421 Thu Jan 19 23:47:13 CST 2006
VO8537 Sun Feb 19 05:17:55 CST 2006
VO8556 Tue Mar 28 15:38:52 CST 2006
VO8721 Thu Nov 10 15:52:12 CST 2005
VO8741 Thu Nov  3 21:01:32 CST 2005
VO9752 Fri Feb 24 23:16:50 CST 2006


=============================================================================

2006 03 27

###########
# rawcopy #
###########

Created script, debugging

Restaged 2001-09 and 2001-10,
    they were not picked up from tape, due to changed paths.

=============================================================================

2006 03 25

#######
# SAM #
#######

The Minos production db server was down from about 04:00 to 16:17 Saturday
2006 March 25.

The probable cause was exhaustion of file handles,
due to about 990 dangling sockets.

This is a known issue, being addressed by the SAM developers.
We'll have to keep a closer eye on things.

An open question, why did the sadd client get 'stuck' at 05:10 ?
Likewise, why did 20 buckley SAG processes get stuck on minos-sam01 

Restarting the dbserver did kick things loose , predator and SAG
on minos06 and minos-sam01.

    Detailed investigation :

sadd failed for fardet_data at 04:10

Stuck in fardet_data at 05:06

fardet_data/2006-03
STARTED   Sat Mar 25 04:10:56 2006
Treating 1080 files
Traceback (most recent call last):
  File "./sadd", line 107, in ?
  File "/scratch/10/illingwo/v7_3_4/sam_common_pylib/SamCommand/BlessedCommandInterfacePlaceHolder.py", line 81, in __call__
  File "/scratch/10/illingwo/v7_3_4/sam_common_pylib/SamCommand/CommandInterface.py", line 259, in __call__
  File "/scratch/10/illingwo/v7_3_4/sam_common_pylib/SamCommand/SamCommandInterface.py", line 243, in apiWrapper
  File "/scratch/10/illingwo/v7_3_4/sam_user_pyapi/src/samLocate.py", line 75, in implementation
  File "/scratch/10/illingwo/v7_3_4/sam_common_pylib/SamCorba/SamServerProxy.py", line 257, in _callRemoteMethod
  File "/scratch/10/illingwo/v7_3_4/sam_common_pylib/SamCorba/SamServerProxyRetryHandler.py", line 266, in handleCall
SamException.SamExceptions.CorbaError: UNKNOWN; Minor: UNKNOWN_PythonException, COMPLETED_MAYBE.
/local/scratch06/kreymer/log/samadd/fardet_data/2006-03.log (END) 

No output from the 05:10 run, as it seems to be stuck.
DBserver is not responding.  int dev and dev2 are OK !

MINOS26 > sam get dbserver info
RetryHandler.getDbServerInfo()> initial retriable exception COMM_FAILURE('Minor: COMM_FAILURE_WaitingForReply, COMPLETED_MAYBE.')

MINOS-SAM01
    There are 20 /home/buckley/bin/sam-at-a-glance processes running
    They startd accumulating at 04:15
    Probably stuck , just like my sadd process.

MINOS-SAM01 > cd private/dbs__minos-sam01__dbs_prd/
MINOS-SAM01 > ls -l trace
-rw-r--r--    1 sam      dbi          7692 Mar 25 04:15 trace

 <03/25/2006 04:10:58 Connection(servantId=2233).connect[connId=3]> ORA-12545: Connect failed because target host or object does not exist

The MINOTOR connections to Oracle production continue.
The SAMDBS connections continue through 03:56 ( establ. 03:29 )
Then vanish.

minosora1 ganglia monitoring looks unremarkable.

But I can connect to samdbs/...@minosprd from minos-sam01.
Perhaps the dbserver got stuck...

MINOS-SAM01 > cd private/dbs__minos-sam01__dbs_prd/

MINOS-SAM01 > mkdir ~/maint/log/dbs_prd
MINOS-SAM01 > cp -a trace ~/maint/log/dbs_prd/
MINOS-SAM01 > cp -a sqlnet.log ~/maint/log/dbs_prd/


MINOS-SAM01 > ls -ltr
...
-rw-r--r--    1 sam      dbi      137780902 Mar 24 23:57 dbg-SAMDbServer.prd.06_03_24-00_00_11.2177-9
-rw-r--r--    1 sam      dbi          7692 Mar 25 04:15 trace
-rw-r--r--    1 sam      dbi         70718 Mar 25 04:15 sqlnet.log
-rw-r--r--    1 sam      dbi      19565981 Mar 25 08:30 dbg-SAMDbServer.prd.06_03_25-00_00_09.2177-10

 MINOS-SAM01 > ps xf  | grep prd | grep dbs
 1802 pts/0    S      0:00 /bin/sh /home/sam/products/sam_bootstrap/v6_1_2/NULL/bin/run.sh start dbs dbs_prd v7_3_0
 2177 pts/0    S     98:34 python /home/sam/products/db_server_base/v3_3_14/NULL/bin/DbListener.py -c=dbs_prd


MINOS-SAM01 > /usr/sbin/lsof -p 1802 > ~/maint/log/dbs_prd/lsof.1802
MINOS-SAM01 > /usr/sbin/lsof -p 2177 > ~/maint/log/dbs_prd/lsof.2177

MINOS-SAM01 > wc -l  ~/maint/log/dbs_prd/lsof.1802
     14 /home/sam/maint/log/dbs_prd/lsof.1802
MINOS-SAM01 > wc -l  ~/maint/log/dbs_prd/lsof.2177
   1060 /home/sam/maint/log/dbs_prd/lsof.2177

Most of these open files are like

python  2177  sam 1017u  sock       0,0          211678238 can't identify protocol

Restarting the dbserver at around 16:17

MINOS-SAM01 > ps xf | grep prd  | grep dbs
29735 pts/4    S      0:00 /bin/sh /home/sam/products/sam_bootstrap/v6_1_2/NULL/bin/run.sh start dbs dbs_prd v7_3_0
30121 /scratch/sam01/sam/private/dbs__minos-sam01__dbs_prd/trace S   0:01 python /home/sam/products/db_server_base/v3_3_14/NULL/bin/DbListener.py -c=dbs_prd

MINOS-SAM01 > grep protocol  ~/maint/log/dbs_prd/lsof.2177 | wc -l
    990

##########
# DCache #
##########

No read access to ftp from about 22:00 to about 24:00

Queues plots peaked around 1200.
3 more peaks over 400 over the weekend.
Another peak over 400 this morning.

A quick look at the Billing for 2006 03 24 suggests a massive amount lqcd restores,
much more quickly than could be done by normal batch queues.
Are they running dccp -P ?

     24 March
2691 lqcd  restores
 408 minos restores

     25 March
3733 lqcd
 727 minos

     26 March
1528 lqcd
1652 minos
  43 ktev

     27 March
 959  lqcd restores
1379 minos restores

The recent minos restores are mostly reco_near_R1_18_2,
which should still be in cache !

Pool lifetime plots indicate a very small recycling rate, lifetime out beyond 53 days
in 13a-3 14a-3, and a tiny amount ( hundreds of files ) of total recycling in 12a-3.


=============================================================================

2006 03 24

##########
# DCache #
##########

Doors were overloaded this morning.
Linear increase from 0 to over 250 each,
starting Noon yesterday.

Simona Murgia FNALU Batch jobs were the source,
she killed them around 9:30 this morning.

These jobs seem to be getting suspended by the batch system,
snowballing the load on each system.
flxb20 had 50 open DCache logins this morning.

She's reporting this to the helpdesk.
Here's a current snapshot:

MINOS06 > bjobs -u murgia -s
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
190068  murgia  SSUSP 4hr        minos05.fna flxb17.fnal *mdaq.root Mar 23 12:52
 Host load exceeded threshold:  1-minute CPU run queue length (r1m)
190064  murgia  SSUSP 4hr        minos05.fna flxb28.fnal *mdaq.root Mar 23 12:52
 Host load exceeded threshold:  15-minute CPU run queue length (r15m)
190072  murgia  SSUSP 4hr        minos05.fna flxb11.fnal *mdaq.root Mar 23 12:52
 Host load exceeded threshold:  1-minute CPU run queue length (r1m)
190078  murgia  SSUSP 4hr        minos05.fna flxb23.fnal *mdaq.root Mar 23 12:52
 Host load exceeded threshold:  15-minute CPU run queue length (r15m)
190070  murgia  SSUSP 4hr        minos05.fna flxb25.fnal *mdaq.root Mar 23 12:52
 Host load exceeded threshold:  15-minute CPU run queue length (r15m)
190066  murgia  SSUSP 4hr        minos05.fna flxb12.fnal *mdaq.root Mar 23 12:52
 Host load exceeded threshold:  15-minute CPU run queue length (r15m)
190076  murgia  SSUSP 4hr        minos05.fna flxb26.fnal *mdaq.root Mar 23 12:52
 Host load exceeded threshold:  15-minute CPU run queue length (r15m)

#############
# bad files #
#############

Informed enstore-admin of the 17 files' removal.
They still show up in enstore info --show-bad , as of 12:00 today.


##########
# CATTLE #
##########

    Missing subruns

2004-01
F00022608_0001.all.sntp.R1_21.0.root 

2004-02
F00022702_0002.all.sntp.R1_21.0.root

2004-05
F00025253_0001.all.sntp.R1_21.0.root
F00025349_0001.all.sntp.R1_21.0.root
F00025352_0003.all.sntp.R1_21.0.root 

2004-09
F00027352_0005.all.sntp.R1_21.0.root

2004-11
F00028194_0003.all.sntp.R1_21.0.root

2004-12
F00028267_0004.all.sntp.R1_21.0.root
F00028273_0001.all.sntp.R1_21.0.root
F00028526_0005.all.sntp.R1_21.0.root

DET=far
MON=01
IND=reco_${DET}/R1_21/sntp_data/2004-${MON}
OUD=${MINOS_DATA}/d128/reco_${DET}/R1_21/sntp_data/2004-${MON}

Testing with

./cattle   ${IND} ${OUD} R1.21 echo

Have modified cattle script to detect contiguous subrun ranges,
will activate the roundup of these strays tomorrow.
Be sure to test this on the single file special case ...

( Note that this logic is quite similar to what I'll use to build
  tarfiles from raw data. 
  Build a list of files till either capacity or the last file is reached. )

for MON in 01 02 05 09 11 12 ; do
IND=reco_${DET}/R1_21/sntp_data/2004-${MON}
OUD=${MINOS_DATA}/d128/reco_${DET}/R1_21/sntp_data/2004-${MON}
./cattle   ${IND} ${OUD} R1.21 echo
done

Found new files

2004-04
Mar 18 15:14 F00024835_0000

2004-05
Mar 18 15:15 F00025253_0001
Mar 18 15:14 F00025325_0000
Mar 18 15:15 F00025349_0001

2004-12
Mar 18 15:15 F00028267_0004

for MON in 01 02 05 09 11 12 ; do
IND=reco_${DET}/R1_21/sntp_data/2004-${MON}
OUD=${MINOS_DATA}/d128/reco_${DET}/R1_21/sntp_data/2004-${MON}
./cattle   ${IND} ${OUD} R1.21  \
    2>&1 | tee ~/minos/log/cattle/R1_21/${DET}/2004-${MON}.log6
done

All are on disk now !

Restaged originals

11 and 12 first, then ...
for MON in 01 02 03 04 05 06 07 08 09 10 ; do ./stage -w reco_far/R1_21/sntp_data/2004-${MON} ; done

###############
# minos-sam02 #
###############

Added missing dbserver/conf/dbs_dev2.py, started dev server !

Registered with Nameserver, responds to queries
   sam get dbserver info
   sam locate foo


=============================================================================

2006 03 23

############
# archiver #
############

Logging interrupted again after run 

1311043327 Mar 21 22:24 F00034209_0000.mdaq.root

Largest recent files are :

ls -l /pnfs/minos/fardet_data/2006-03 | sort -n -k 5,5
...
 183256281 Mar 13 09:08 F00033945_0000.mdaq.root
 280841537 Mar 21 21:03 F00034203_0000.mdaq.root
 382440200 Mar 20 11:34 F00034169_0000.mdaq.root  stuck
 521397982 Mar 14 20:10 F00034151_0000.mdaq.root  stuck
 889951917 Mar 17 18:35 F00034153_0000.mdaq.root  stuck
1311043327 Mar 21 22:24 F00034209_0000.mdaq.root stuck

Searching for large files > 300 MB in CFL
the above list is complete for 2006.

348529931 Feb 18  2005 /pnfs/minos/fardet_data/2005-02/F00029226_0006.mdaq.root


Stuck again, 1.3 GB file, 
322304689 Mar 22 15:21 F00034242_0001.mdaq.root
Modify: 2006-03-22 15:21:40.000000000 -0600
Change: 2006-03-22 14:58:41.000000000 -0600

Time 	User 	Type 	Oper 	File 	Node 	dT 1 	File Size 	dT 2 	Status 	Details
2006-03-22 14:58:41 	buckley(1019.5111) 	krbftp 	write 	/pnfs/fnal.gov/usr/minos/fardet_data/2006-03/F00034242_0001.mdaq.root 	daqdcp.minos-soudan.org 	1181 	1322304689 	0 	OK 	

Previous file was like

28895925 Mar 22 14:50 F00034242_0000.mdaq.root
Modify: 2006-03-22 14:50:15.000000000 -0600
Change: 2006-03-22 14:48:13.000000000 -0600
2006-03-22 14:48:12 	 buckley(1019.5111) 	 krbftp 	 write 	 /pnfs/fnal.gov/usr/minos/fardet_data/2006-03/F00034242_0000.mdaq.root 	 daqdcp.minos-soudan.org 	 25 	 28895925 	 0 	 OK 	


###############
# FARDET_DATA #
###############

Finished with tape VO8536, at about 10:30 !!!

Now need script to rearchive these, via minos26 local disk.
And a PLAN...


#############
# bad files #
#############

All 17 files are still listed, as seen by
    enstore info --show-bad | grep minos

All were rm'd yesterday.
Will wait till tomorrow, then report this.


=============================================================================

2006 03 22

###############
# FARDET_DATA #
###############

Cranking along at tape VO6876, reviewing logs in /local/scratch06/kreymer/log/stage/

VO6876 is full, contains 9513 files in PNFS, 
10029 files including the 'unknown' duplicates.

VO8536 is the next fullest 9940B tape,
    has 2976 files in PNFS, 4766 on tape.

############
# archiver #
############

No fardet_data since F00034169_0000.mdaq.root

According to daq logs, this FTP did not complete.
But the file is declared to sam.

QOL I Mon 20-03-2006 11:25:22 archiver 5005 198.124.213.171 1 88699 run 34283   Processing file F00034169_0000.mdaq.root
QOL I Mon 20-03-2006 11:25:22 archiver 5005 198.124.213.171 1 88700 run 34283   Getting credentials
QOL I Mon 20-03-2006 11:25:24 archiver 5005 198.124.213.171 1 88701 run 34283   Got credentials
QOL I Mon 20-03-2006 11:25:24 archiver 5005 198.124.213.171 1 88702 run 34283   Trying ftp connect to disk cache
QOL I Mon 20-03-2006 11:25:26 archiver 5005 198.124.213.171 1 88703 run 34283   Ftp connect succeeded

There is no message in the log indicating a completed copy.

In the DCache billing log, I see a connection to the ftp door at 11:25:27
The PNFS file id is 000F00000000000002BBB2B0
I see the the DCache store registered at 11:32:37

The PNFS time is 11:34 .
That's reasonable, 9 minutes elapsed time.
I think the PNFS time reflects the time at which the file went onto tape.

The previous file got copied quickly, 

    File                  archive    PNFS
F00034168_0000.mdaq.root  11:15:17  11:17  ( 3 MBytes )

QOL I Mon 20-03-2006 11:15:15 archiver 5005 198.124.213.171 1 88309 run 34281   Processing file F00034168_0000.mdaq.root

Check a couple of very short files
    File                  archive    PNFS
F00034156_0000.mdaq.root  06:31:35  06:34
F00034157_0000.mdaq.root  06:41:41  06:44

#############
# bad files #
#############

    SPRING CLEANING !

I am cleaning up the .bad. files, listed 2006 03 20 by
    enstore info --show-bad | grep minos
At least those for which we have second copies with normal names.

Of the total 42 bad files, 17 are minos files ( as of this moment. )
    All the rest are CMS files ( 25 )

    NEARDET - The files are safely on tape VO8741, were on VO8791 .

FILS='N00008930_0000.mdaq.root N00008931_0000.mdaq.root'
for FIL in ${FILS} ; do ls -l /pnfs/minos/neardet_data/2005-10/${FIL}      ; done
for FIL in ${FILS} ; do ls -l /pnfs/minos/neardet_data/2005-10/.bad.${FIL} ; done
for FIL in ${FILS} ; do ./dc_stat  /pnfs/minos/neardet_data/2005-10/${FIL} ; done
for FIL in ${FILS} ; do rm -i /pnfs/minos/neardet_data/2005-10/.bad.${FIL} ; done

MINOS26 > for FIL in ${FILS} ; do ls -l /pnfs/minos/neardet_data/2005-10/${FIL}      ; done
-rw-r--r--    1 kreymer  1525     273414617 Oct 18 15:50 /pnfs/minos/neardet_data/2005-10/N00008930_0000.mdaq.root
-rw-r--r--    1 kreymer  1525      3350592 Oct 18 15:50 /pnfs/minos/neardet_data/2005-10/N00008931_0000.mdaq.root
MINOS26 > for FIL in ${FILS} ; do ls -l /pnfs/minos/neardet_data/2005-10/.bad.${FIL} ; done
-rw-r--r--    1 buckley  e875     273414617 Oct 17 14:40 /pnfs/minos/neardet_data/2005-10/.bad.N00008930_0000.mdaq.root
-rw-r--r--    1 buckley  e875      3350592 Oct 17 14:53 /pnfs/minos/neardet_data/2005-10/.bad.N00008931_0000.mdaq.root
MINOS26 > for FIL in ${FILS} ; do rm -i /pnfs/minos/neardet_data/2005-10/.bad.${FIL} ; done
rm: remove write-protected regular file `/pnfs/minos/neardet_data/2005-10/.bad.N00008930_0000.mdaq.root'? y
rm: remove write-protected regular file `/pnfs/minos/neardet_data/2005-10/.bad.N00008931_0000.mdaq.root'? y

    FARDET - files which have been duplicated

MOFIS='
2002-12/F00010943_0000.mdaq.root
2002-12/F00011541_0000.mdaq.root
2002-11/F00010632_0000.mdaq.root'

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
ls -l /pnfs/minos/fardet_data/${MO}/${FI}; done
-rw-r--r--    1 buckley  5111     95530315 Sep  8  2003 /pnfs/minos/fardet_data/2002-12/F00010943_0000.mdaq.root
-rw-r--r--    1 buckley  5111     27510637 Sep  8  2003 /pnfs/minos/fardet_data/2002-12/F00011541_0000.mdaq.root
-rw-r--r--    1 buckley  5111     132491377 Sep  8  2003 /pnfs/minos/fardet_data/2002-11/F00010632_0000.mdaq.root

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
ls -l /pnfs/minos/fardet_data/${MO}/.bad.${FI}; done
-rw-rw-r--    1 buckley  5111     95530315 Dec  1  2002 /pnfs/minos/fardet_data/2002-12/.bad.F00010943_0000.mdaq.root
-rw-rw-r--    1 buckley  5111     27510637 Dec 23  2002 /pnfs/minos/fardet_data/2002-12/.bad.F00011541_0000.mdaq.root
-rw-rw-r--    1 buckley  5111     132491377 Nov 22  2002 /pnfs/minos/fardet_data/2002-11/.bad.F00010632_0000.mdaq.root

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
./dc_stat /pnfs/minos/fardet_data/${MO}/${FI}; done
< tape VO4309>

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
./dc_stat /pnfs/minos/fardet_data/${MO}/.bad.${FI}; done
< tape VO2212 >

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
rm -i /pnfs/minos/fardet_data/${MO}/.bad.${FI}; done
rm: remove regular file `/pnfs/minos/fardet_data/2002-12/.bad.F00010943_0000.mdaq.root'? y
rm: remove regular file `/pnfs/minos/fardet_data/2002-12/.bad.F00011541_0000.mdaq.root'? y
rm: remove regular file `/pnfs/minos/fardet_data/2002-11/.bad.F00010632_0000.mdaq.root'? y


    FARDET - files with no duplicates.
             All of these date from the time of construction, prior to 2003-07
             VO2122 status ;
                 all files from 2199 thru 2824, are deleted, these have many many duplicate lengths.
                 'comment': 'Bad files encountered, need to be recovered by vendor',
                 'declared': 'Thu May  9 17:06:30 2002',
                 
MOFIS='
2002-12/F00011192_0000.mdaq.root
2003-01/F00011933_0000.mdaq.root
2003-01/F00012331_0000.mdaq.root
2003-02/F00012687_0000.mdaq.root'

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
ls -l /pnfs/minos/fardet_data/${MO}/.bad.${FI}; done
-rw-rw-r--    1 buckley  5111     16266107 Dec  9  2002 /pnfs/minos/fardet_data/2002-12/.bad.F00011192_0000.mdaq.root
-rw-rw-r--    1 buckley  5111     17208527 Jan  8  2003 /pnfs/minos/fardet_data/2003-01/.bad.F00011933_0000.mdaq.root
-rw-rw-r--    1 buckley  5111     17418889 Jan 22  2003 /pnfs/minos/fardet_data/2003-01/.bad.F00012331_0000.mdaq.root
-rw-rw-r--    1 buckley  5111     66111335 Feb  3  2003 /pnfs/minos/fardet_data/2003-02/.bad.F00012687_0000.mdaq.root


for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
ls -l /pnfs/minos/fardet_data/${MO}/${FI}; done
ls: /pnfs/minos/fardet_data/2002-12/F00011192_0000.mdaq.root: No such file or directory
ls: /pnfs/minos/fardet_data/2003-01/F00011933_0000.mdaq.root: No such file or directory
ls: /pnfs/minos/fardet_data/2003-01/F00012331_0000.mdaq.root: No such file or directory
ls: /pnfs/minos/fardet_data/2003-02/F00012687_0000.mdaq.root: No such file or directory

I am removing these, to clear up the clutter in the .bad. file listing.

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
rm -i /pnfs/minos/fardet_data/${MO}/.bad.${FI}; done
rm: remove regular file `/pnfs/minos/fardet_data/2002-12/.bad.F00011192_0000.mdaq.root'? y
rm: remove regular file `/pnfs/minos/fardet_data/2003-01/.bad.F00011933_0000.mdaq.root'? y
rm: remove regular file `/pnfs/minos/fardet_data/2003-01/.bad.F00012331_0000.mdaq.root'? y
rm: remove regular file `/pnfs/minos/fardet_data/2003-02/.bad.F00012687_0000.mdaq.root'? y


    NEARDET - spill data from R1_18, have good copy

MOFIS='
2005-08/N00008433_0001.spill.sntp.R1_18.0.root'

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
ls -l /pnfs/minos/reco_near/R1_18/sntp_data/${MO}/.bad.${FI}; done
-rw-r--r--    1 1334     5111     76532465 Sep  2  2005 /pnfs/minos/reco_near/R1_18/sntp_data/2005-08/.bad.N00008433_0001.spill.sntp.R1_18.0.root

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
ls -l /pnfs/minos/reco_near/R1_18/sntp_data/${MO}/${FI}; done
-rw-r--r--    1 1334     5111     76532465 Nov  2 16:34 /pnfs/minos/reco_near/R1_18/sntp_data/2005-08/N00008433_0001.spill.sntp.R1_18.0.root

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
./dc_stat /pnfs/minos/reco_near/R1_18/sntp_data/${MO}/.bad.${FI}; done
VO8879
0000_000000000_0000332

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
./dc_stat /pnfs/minos/reco_near/R1_18/sntp_data/${MO}/${FI}; done
VO8763
0000_000000000_0000790

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
rm -i /pnfs/minos/reco_near/R1_18/sntp_data/${MO}/.bad.${FI}; done


MOFIS='
2005-08/N00008411_0005.cosmic.sntp.R1_18.0.root
2005-07/N00008095_0001.cosmic.sntp.R1_18.0.root
2005-07/N00008100_0007.cosmic.sntp.R1_18.0.root
2005-07/N00008140_0018.cosmic.sntp.R1_18.0.root
2005-07/N00008194_0007.cosmic.sntp.R1_18.0.root
2005-07/N00008200_0025.cosmic.sntp.R1_18.0.root
2005-07/N00008206_0005.cosmic.sntp.R1_18.0.root'

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
ls -l /pnfs/minos/reco_near/R1_18/sntp_data/${MO}/.bad.${FI}; done
-rw-r--r--    1 1334     5111     78355702 Sep  1  2005 /pnfs/minos/reco_near/R1_18/sntp_data/2005-08/.bad.N00008411_0005.cosmic.sntp.R1_18.0.root
-rw-r--r--    1 1334     5111     78024953 Sep  2  2005 /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008095_0001.cosmic.sntp.R1_18.0.root
-rw-r--r--    1 1334     5111     78062394 Sep  2  2005 /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008100_0007.cosmic.sntp.R1_18.0.root
-rw-r--r--    1 1334     5111     78401925 Sep  3  2005 /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008140_0018.cosmic.sntp.R1_18.0.root
-rw-r--r--    1 1334     5111     68522139 Sep  3  2005 /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008194_0007.cosmic.sntp.R1_18.0.root
-rw-r--r--    1 1334     5111     78046583 Sep  3  2005 /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008200_0025.cosmic.sntp.R1_18.0.root
-rw-r--r--    1 1334     5111     78077051 Sep  4  2005 /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008206_0005.cosmic.sntp.R1_18.0.root

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
ls -l /pnfs/minos/reco_near/R1_18/sntp_data/${MO}/${FI}; done
ls: /pnfs/minos/reco_near/R1_18/sntp_data/2005-08/N00008411_0005.cosmic.sntp.R1_18.0.root: No such file or directory
ls: /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/N00008095_0001.cosmic.sntp.R1_18.0.root: No such file or directory
ls: /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/N00008100_0007.cosmic.sntp.R1_18.0.root: No such file or directory
ls: /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/N00008140_0018.cosmic.sntp.R1_18.0.root: No such file or directory
ls: /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/N00008194_0007.cosmic.sntp.R1_18.0.root: No such file or directory
ls: /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/N00008200_0025.cosmic.sntp.R1_18.0.root: No such file or directory
ls: /pnfs/minos/reco_near/R1_18/sntp_data/2005-07/N00008206_0005.cosmic.sntp.R1_18.0.root: No such file or directory

These are ND cosmic sntp files for a pass we are not using any more (R1_18_2)
so I am blowing the .bad. files off :

for MOFI in ${MOFIS} ; do MO=`dirname ${MOFI}` ; FI=`basename ${MOFI}`
rm -i /pnfs/minos/reco_near/R1_18/sntp_data/${MO}/.bad.${FI}; done
rm: remove write-protected regular file `/pnfs/minos/reco_near/R1_18/sntp_data/2005-08/.bad.N00008411_0005.cosmic.sntp.R1_18.0.root'? y
rm: remove write-protected regular file `/pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008095_0001.cosmic.sntp.R1_18.0.root'? y
rm: remove write-protected regular file `/pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008100_0007.cosmic.sntp.R1_18.0.root'? y
rm: remove write-protected regular file `/pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008140_0018.cosmic.sntp.R1_18.0.root'? y
rm: remove write-protected regular file `/pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008194_0007.cosmic.sntp.R1_18.0.root'? y
rm: remove write-protected regular file `/pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008200_0025.cosmic.sntp.R1_18.0.root'? y
rm: remove write-protected regular file `/pnfs/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008206_0005.cosmic.sntp.R1_18.0.root'? y

Let's see whether the bad file list looks better tomorrow.

   ADDITIONAL FILES - 26 empty files per March 14 EMAIL from berg
   

FILS='
reco_far/R1.16/snts_data/2005-07/1
reco_far/R1_17a.0/cand_data/2005-03/F00029427_0007.all.cand.R1_17b.0.root
reco_far/R1_17a.0/cand_data/2005-03/1
reco_near/R1.16/cand_data/2005-05/zzz.150396268
reco_near/R1.16/cand_data/2005-05/1
reco_near/R1.16/sntp_data/2005-08/runN00008224_0016.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0007.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0017.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0008.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008233_0006.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0018.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0009.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0019.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0010.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0011.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0012.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0013.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0014.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0005.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0015.spill.sntp.R1.16.root.script
reco_near/R1.16/sntp_data/2005-08/runN00008224_0006.spill.sntp.R1.16.root.script
reco_near/R1_18/cand_data/2005-11/N00009238_0018.cosmic.cand.R1_18.0.root
reco_near/R1_18/cand_data/2005-11/N00009235_0012.spill.cand.R1_18.0.root
sim_root/far/camb_atmos/1
neardet_logs/om/summaries/N_mon_summary_00008800_00008899.tar
fardet_data/2005-07/1
'

MINOS-SAM03 > for FIL in ${FILS} ; do ls -l /pnfs/minos/${FIL} ; done
-rw-rw-r--    1 rubin    e875            0 Jul 27  2005 /pnfs/minos/reco_far/R1.16/snts_data/2005-07/1
-rw-rw-r--    1 rubin    e875            0 Jul 26  2005 /pnfs/minos/reco_far/R1_17a.0/cand_data/2005-03/F00029427_0007.all.cand.R1_17b.0.root
-rw-rw-r--    1 rubin    e875            0 Jul  5  2005 /pnfs/minos/reco_far/R1_17a.0/cand_data/2005-03/1
-rw-rw-r--    1 rubin    e875            0 Aug  4  2005 /pnfs/minos/reco_near/R1.16/cand_data/2005-05/zzz.150396268
-rw-rw-r--    1 rubin    e875            0 Aug  4  2005 /pnfs/minos/reco_near/R1.16/cand_data/2005-05/1
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0016.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0007.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0017.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:01 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0008.spill.sntp.R1.16.root.script
-rw-r--r--    1 raufer   e875            0 Sep 26 15:07 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008233_0006.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0018.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0009.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0019.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0010.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0011.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0012.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0013.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0014.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0005.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0015.spill.sntp.R1.16.root.script
-rwxrwx---    1 raufer   e875            0 Sep 26 15:12 /pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0006.spill.sntp.R1.16.root.script
-rw-r--r--    1 rubin    e875            0 Dec  2 09:45 /pnfs/minos/reco_near/R1_18/cand_data/2005-11/N00009238_0018.cosmic.cand.R1_18.0.root
-rw-r--r--    1 rubin    e875            0 Dec  2 09:45 /pnfs/minos/reco_near/R1_18/cand_data/2005-11/N00009235_0012.spill.cand.R1_18.0.root
-rw-rw-r--    1 rubin    e875            0 May 23  2005 /pnfs/minos/sim_root/far/camb_atmos/1
-rw-r--r--    1 500      e875            0 Oct 27 08:49 /pnfs/minos/neardet_logs/om/summaries/N_mon_summary_00008800_00008899.tar
-rw-rw-r--    1 rubin    e875            0 Jul 26  2005 /pnfs/minos/fardet_data/2005-07/1

for FIL in ${FILS} ; do rm -i /pnfs/minos/${FIL} ; done


=============================================================================

2006 03 20

##########
# ORACLE #
##########

From akumar email, Oracle server/client chart and notes

     Server Version
     Client Version 10.2.0   10.1.0   9.2.0   9.0.1   8.1.7   8.1.6   8.1.5   8.0.6   8.0.5   7.3.4
      10.2.0           Yes    Yes    Yes #5     No      EMS   No #3   No #3   No #3   No #3   No #3
      10.1.0(#4)       Yes    Yes      Yes     Was   EMS #2   No #3   No #3   No #3   No #3   No #3
      9.2.0         Yes #5    Yes      Yes     Was      EMS   No      No      Was     No      No #1


  a.. #1 - See Note 207319.1
  b.. #2 - An ORA-3134 error is incorrectly reported if a 10g client tries 
to connect to an 8.1.7.3 or lower server. See Note 3437884.8 .
  c.. #3 - An ORA-3134 error is correctly reported when attempting to 
connect to this version.
  d.. #4 - There are problems connecting from a 10g client to 8i/9i where 
one is EBCDIC based. See Note 3564573.8
  e.. #5 - For connections between 10.2 and 9.2 the 9.2 end MUST be at 
9.2.0.4 or higher. Connections between 10.2 and 9.2.0.1, 9.2.0.2 or 9.2.0.3 
are not supported.


###############
# FARDET_DATA #
###############

Cranking along at tape VO3909, reviewing logs in /local/scratch06/kreymer/log/stage/

Note - there are a few files which are NOT .mdaq.root ( .dat, .gz )

Several files were not cached, they appear to have been mv'd.
SAM is able to locate most of them.

VO2212  F files were mv'd a month later than the original directory.
2002-11/F00010922_0000.mdaq.root is now 2002-12
2002-11/F00010923_0000.mdaq.root
2002-11/F00010924_0000.mdaq.root
2002-11/F00010925_0000.mdaq.root
2002-11/F00010926_0000.mdaq.root
2002-12/F00011192_0000.mdaq.root is now 2003-01
2002-12/F00011730_0000.mdaq.root
2002-12/F00011731_0000.mdaq.root
2003-01/F00011933_0000.mdaq.root not found  BFID CDMS104206840900000
2003-01/F00012331_0000.mdaq.root not found  BFID CDMS104328374100000
2003-01/F00012640_0000.mdaq.root is now 2003-02
2003-01/F00012641_0000.mdaq.root
2003-01/F00012642_0000.mdaq.root
2003-02/F00012687_0000.mdaq.root not found  BFID CDMS104432617400000

VO3909 7 F files were mv'd a month later than the original directory.
2003-02/F00013491_0000.mdaq.root are now 2003-03
2003-02/F00013492_0000.mdaq.root
2003-02/F00013493_0000.mdaq.root
2003-02/F00013494_0000.mdaq.root
2003-03/F00014321_0000.mdaq.root are now 2003-04
2003-03/F00014322_0000.mdaq.root 
2003-04/F00015231_0000.mdaq.root  is now 2003-05

VO7999 4 N files were mv'd from fardet_data to neardet_data
2005-06/N00008028_0000.mdaq.root
2005-06/N00008026_0025.mdaq.root
2005-06/N00008027_0000.mdaq.root
2005-06/N00008028_0001.mdaq.root


Prestaged these manually :

MINOS06 > dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2005-06/N00008028_0000.mdaq.root
MINOS06 > dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2005-06/N00008026_0025.mdaq.root
MINOS06 > dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2005-06/N00008027_0000.mdaq.root
MINOS06 > dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/neardet_data/2005-06/N00008028_0001.mdaq.root
MINOS06 > dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2003-03/F00013491_0000.mdaq.root
MINOS06 > dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2003-03/F00013492_0000.mdaq.root
MINOS06 > dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2003-03/F00013493_0000.mdaq.root
MINOS06 > dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2003-03/F00013494_0000.mdaq.root
MINOS06 > dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2003-04/F00014321_0000.mdaq.root
MINOS06 > dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2003-04/F00014322_0000.mdaq.root
MINOS06 > dccp -P dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2003-05/F00015231_0000.mdaq.root

VO2220 - only 2 files ? should be 1711? No, all but 2 were deleted in 2002.

VO3908 - basename: too few arguments
         all files have delflag unknown, no original file names.

VO3909 - 
Sun Mar 19 08:00:20 CST 2006 queue=4/100
grep: .(use)(2)(F00013491_0000.mdaq.root): No such file or directory
NEED /pnfs/fnal.gov/usr/minos/fardet_data/2003-02/F00013491_0000.mdaq.root
Command failed!
Server error message for [1]: "can't get pnfsId (not a pnfsfile)" (errno 666).
Failed open file in the dCache.
dc_stage fail : "can't get pnfsId (not a pnfsfile)"
System error: Input/output error

grep: .(use)(2)(F00013492_0000.mdaq.root): No such file or directory
grep: .(use)(2)(F00013493_0000.mdaq.root): No such file or directory
grep: .(use)(2)(F00013494_0000.mdaq.root): No such file or directory

Sun Mar 19 09:40:08 CST 2006 queue=5/100
grep: .(use)(2)(F00014321_0000.mdaq.root): No such file or directory
grep: .(use)(2)(F00014322_0000.mdaq.root): No such file or directory

Sun Mar 19 11:05:37 CST 2006 queue=69/100
grep: .(use)(2)(F00015231_0000.mdaq.root): No such file or directory

Sun Mar 19 11:11:22 CST 2006 queue=105/100  SLEEP 1200 
...
Mon Mar 20 07:51:53 CST 2006 queue=112/100  SLEEP 1200 
Mon Mar 20 08:11:54 CST 2006 queue=5/100

VO7999

Sun Mar 19 00:11:54 CST 2006 queue=2/100
grep: .(use)(2)(N00008028_0000.mdaq.root): No such file or directory
Sun Mar 19 00:13:14 CST 2006 queue=3/100
grep: .(use)(2)(N00008026_0025.mdaq.root): No such file or directory
grep: .(use)(2)(N00008027_0000.mdaq.root): No such file or directory
grep: .(use)(2)(N00008028_0001.mdaq.root): No such file or directory


The  missing files are all renamed .bad. 
The full .bad. list is  a bit long, 

MINOS06 > enstore info --show-bad | grep minos
VO2212 CDMS103877371900000 95530315 /pnfs/fs/usr/minos/fardet_data/2002-12/.bad.F00010943_0000.mdaq.root
VO2212 CDMS104063783000000 27510637 /pnfs/fs/usr/minos/fardet_data/2002-12/.bad.F00011541_0000.mdaq.root
VO2212 CDMS103948254600000 16266107 /pnfs/fs/usr/minos/fardet_data/2002-12/.bad.F00011192_0000.mdaq.root
VO2212 CDMS104206840900000 17208527 /pnfs/fs/usr/minos/fardet_data/2003-01/.bad.F00011933_0000.mdaq.root
VO2212 CDMS104328374100000 17418889 /pnfs/fs/usr/minos/fardet_data/2003-01/.bad.F00012331_0000.mdaq.root
VO2212 CDMS103797533200000 132491377 /pnfs/fs/usr/minos/fardet_data/2002-11/.bad.F00010632_0000.mdaq.root
VO2212 CDMS104432617400000 66111335 /pnfs/fs/usr/minos/fardet_data/2003-02/.bad.F00012687_0000.mdaq.root
VO8879 CDMS112561799000000 78355702 /pnfs/fs/usr/minos/reco_near/R1_18/sntp_data/2005-08/.bad.N00008411_0005.cosmic.sntp.R1_18.0.root
VO8879 CDMS112563771400000 78024953 /pnfs/fs/usr/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008095_0001.cosmic.sntp.R1_18.0.root
VO8879 CDMS112566417000002 78062394 /pnfs/fs/usr/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008100_0007.cosmic.sntp.R1_18.0.root
VO8879 CDMS112567941500000 76532465 /pnfs/fs/usr/minos/reco_near/R1_18/sntp_data/2005-08/.bad.N00008433_0001.spill.sntp.R1_18.0.root
VO8879 CDMS112573875400001 78401925 /pnfs/fs/usr/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008140_0018.cosmic.sntp.R1_18.0.root
VO8879 CDMS112574094800000 68522139 /pnfs/fs/usr/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008194_0007.cosmic.sntp.R1_18.0.root
VO8879 CDMS112580084900001 78046583 /pnfs/fs/usr/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008200_0025.cosmic.sntp.R1_18.0.root
VO8879 CDMS112581414100000 78077051 /pnfs/fs/usr/minos/reco_near/R1_18/sntp_data/2005-07/.bad.N00008206_0005.cosmic.sntp.R1_18.0.root
VO8791 CDMS112957800200000 273414617 /pnfs/fs/usr/minos/neardet_data/2005-10/.bad.N00008930_0000.mdaq.root
VO8791 CDMS112957881400000 3350592 /pnfs/fs/usr/minos/neardet_data/2005-10/.bad.N00008931_0000.mdaq.root

The neardet_data files are rewritten, as are most of the fardet_data files.
The recalcitrants are

2002-12/.bad.F00011192_0000.mdaq.root
2003-01/.bad.F00011933_0000.mdaq.root
2003-01/.bad.F00012331_0000.mdaq.root
2003-02/.bad.F00012687_0000.mdaq.root


=============================================================================

2006 03 17

###########
# ENSTORE #
###########

System fully up, after maintenance, at about 00:45

##########
# CATTLE #
##########

Restart to log5 :

for MM in 05 06 07 08 09 10 12 ; do
MON="2004-${MM}"
time ./cattle        reco_${DET}/R1_21/sntp_data/${MON} \
  ${MINOS_DATA}/d128/reco_${DET}/R1_21/sntp_data/${MON} R1.21 \
    2>&1 | tee ~/minos/log/cattle/R1_21/${DET}/${MON}.log5
done

Processing is continuing, about 1 MByte/second.


###############
# FARDET_DATA #
###############

MINOS06 > ./stage.60315  VO8722
 Staging files from tape VO8722

tail -f /local/scratch06/kreymer/log/stage/VO8722.20060317.log

Looks good, will proceed with the rest of the tapes
First, make this the default version of stage

stage -> stage.0221
ln -sf stage.60315  stage


Use the sorted list of volumes from below, 
do one tape at a time, starting with the smallest
( get the tape mounts out of the way while system is still idle )
Oops, left out 8536 
  ( had been filtered from previous list while being copied due to mount counts )

VOLS='VO9488 VO8555 VO4639 VO5182 VO5869 VO5672 VO5881 VO8968 VO4640 VO5871 VO2064 VO5054 VO5046 VO4309 VO3646 VO2220 VO8917 VO4136 VO2225 VO7999 VO4919 VO3908 VO3909 VO2212 VO6809 VO4245 VO6876'

for VOL in ${VOLS} ; do ./stage -w ${VOL} ; done
./stage -w VO8536

=============================================================================

2006 03 16

##########
# CATTLE #
##########

Interrupted for PNFS maintenance, just after
    /reco_far/R1_21/sntp_data/2004-05/F00025355_0000.all.sntp.R1_21.0.root

Reviewing log files

less *.log
    =    ( list file )
    :n   ( next file )
    :p   ( previous file )

Missing subruns
2004-01
F00022608_0001.all.sntp.R1_21.0.root 

2004-02
F00022702_0002.all.sntp.R1_21.0.root

2004-05
F00025349_0001.all.sntp.R1_21.0.root
F00025352_0003.all.sntp.R1_21.0.root 

2004-11
F00028194_0003.all.sntp.R1_21.0.root

The 2004-01/02/11 subruns appear not to have been processed at all,
based on a grep of the CFL.
The 2004-05 files were not processed by R1_21.

########
# PNFS #
########

Down just after 08:32 for maintenance

There will be a nine (9) hour down time for the Public STK enstore system 
8AM to 5PM on Thursday March 16. The following work will be done:

1.) The operating system on the Public pnfs server will be upgraded.
2.) Postgres SQL will be upgraded.
3.) A raid unit will be replaced on one of the enstore servers.

George Szmuksta
ISA


Splitting PNFS database per yesterday's plan
 1. raw data
        fardet_data, neardet_data
        fardet_logs, neardet_logs
        beam_data
        near_dcs_data, far_dcs_data

 2. MC data
        mcout_data, mclog_data, mcin_data

 3. R1_18 reconstruction
        reco_far/R1_18, reco_near/R1_18

Everything else will remain where they are, in the current databases.

STATUS -

13:03 FTP monitor started responding


###########
# ENSTORE #
###########

Will investigate 26 0 length files reported by berg

/pnfs/minos/reco_far/R1.16/snts_data/2005-07/1
/pnfs/minos/reco_far/R1_17a.0/cand_data/2005-03/F00029427_0007.all.cand.R1_17b.0.root
/pnfs/minos/reco_far/R1_17a.0/cand_data/2005-03/1
/pnfs/minos/reco_near/R1.16/cand_data/2005-05/zzz.150396268
/pnfs/minos/reco_near/R1.16/cand_data/2005-05/1
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0016.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0007.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0017.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0008.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008233_0006.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0018.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0009.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0019.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0010.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0011.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0012.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0013.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0014.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0005.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0015.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1.16/sntp_data/2005-08/runN00008224_0006.spill.sntp.R1.16.root.script
/pnfs/minos/reco_near/R1_18/cand_data/2005-11/N00009238_0018.cosmic.cand.R1_18.0.root
/pnfs/minos/reco_near/R1_18/cand_data/2005-11/N00009235_0012.spill.cand.R1_18.0.root
/pnfs/minos/sim_root/far/camb_atmos/1
/pnfs/minos/neardet_logs/om/summaries/N_mon_summary_00008800_00008899.tar
/pnfs/minos/fardet_data/2005-07/1

Scanning the CFL, I find a total of 

78 files with 0 CRC     grep ' 0 /pnfs' CFL | wc -l 
 6 files with 0 size    grep ' 0 .* /pnfs' CFL | wc -l

/pnfs/minos/copy2/fardet_data/2002-09/F00008299_0000.mdaq.root
/pnfs/minos/unel/Na1.raw
/pnfs/minos/unel/Na2.raw
/pnfs/minos/unel/Na4.raw
/pnfs/minos/unel/Fa3.raw
/pnfs/minos/reco_data/R1.11/sntp_data/2004-09/F00027217_0001.sntp.R1.11.root

##############################
# COMPLET_FILE_LISTING_minos #
##############################

Hmmm, interesting, all file paths seem to be normalized to /pnfs/minos/...

No more /pnfs/fs/minos/...  or   /pnfs/fnal.gov/minos/...
This was this morning, before the maintenance.

##########
# ORACLE #
##########

minosprd oracle 10gr2 upgrade ( from 9i )
OS patched first by maureen
09:00 to 09:20
announced up at 09:29

Down again for Oracle 10 upgrade
09:49 to 11:59

13:40 nelly requested test

MINOS06 > sam ping db server
The server 'SAMDbServer.prd:SAMDbServer' is alive.
MINOS06 > sam get dbserver connection info
... hangs indefinitely

stopped dbserver,  ping now starts retrying.

restarted dbserver, ping and get info work.

Successful tests of
    sam locate foo
    sam locate F00031300_0000.mdaq.root
    sam get metadata --fileName=F00031300_0000.mdaq.root
    ./sam_test_py minos prd st-onesmall
 
We're up ! Informed minosdb-support email list.

Waiting for final all-clear from DBA's.


=============================================================================

2006 03 15

##########
# CATTLE #
##########

The 50000000 k-byte block AFS limit is 48828 MBytes
Unfortunately, our files really (barely) need 50000 MByte partitions.

Shifted reco-near to d132, this allows regrouped fardet to fit in 128-131

We are saved by the fact that the concatenated data is a bit smaller than the original ,
about 15114/15463 = .977
Oops, was missing the single-subrun files. now we have 
about 15241/15463 = .986

MINOS26 > du -sm /pnfs/minos/reco_far/R1_21/sntp_data/2004-*    CAT TO    MB
15463   /pnfs/minos/reco_far/R1_21/sntp_data/2004-01  128
16278   /pnfs/minos/reco_far/R1_21/sntp_data/2004-02  128
17781   /pnfs/minos/reco_far/R1_21/sntp_data/2004-03      129
19503   /pnfs/minos/reco_far/R1_21/sntp_data/2004-04              131
18025   /pnfs/minos/reco_far/R1_21/sntp_data/2004-05          130    
15538   /pnfs/minos/reco_far/R1_21/sntp_data/2004-06      129   
14265   /pnfs/minos/reco_far/R1_21/sntp_data/2004-07              131
15336   /pnfs/minos/reco_far/R1_21/sntp_data/2004-08      129             48655
15220   /pnfs/minos/reco_far/R1_21/sntp_data/2004-09          130         
15383   /pnfs/minos/reco_far/R1_21/sntp_data/2004-10          130         48628
15078   /pnfs/minos/reco_far/R1_21/sntp_data/2004-11              131     48846
16019   /pnfs/minos/reco_far/R1_21/sntp_data/2004-12  128                 47760

    NEAR shift to 132

mkdir /afs/fnal.gov/files/data/minos/d132/reco_near/R1_21/sntp_data -p
cp -vax  /afs/fnal.gov/files/data/minos/d128/reco_near/R1_21/sntp_data/2005-11/ /afs/fnal.gov/files/data/minos/d132/reco_near/R1_21/sntp_data/

cd $MINOS_DATA/d128/
mv reco_near reco_nearx ; ln -s $MINOS_DATA/d132/reco_near reco_near
rm -r reco_nearx
cd $MINOS_DATA/d128/reco_far/R1_21/sntp_data
rmdir /afs/fnal.gov/files/data/minos/d131/reco_far/R1_21/sntp_data/2004-11
time cp -vax 2004-11 /afs/fnal.gov/files/data/minos/d131/reco_far/R1_21/sntp_data/2004-11
real    53m9.775s
user    0m2.900s
sys     6m40.820s

time diff    2004-11 /afs/fnal.gov/files/data/minos/d131/reco_far/R1_21/sntp_data/2004-11
real    43m25.008s
user    0m34.020s
sys     2m53.710s

time rm -r        2004-11
real    0m20.643s
user    0m0.000s
sys     0m0.330s
ln -s /afs/fnal.gov/files/data/minos/d131/reco_far/R1_21/sntp_data/2004-11  2004-11

2004-11 process ran out of quota on 128, removed the 3 truncated files
F00028161_0000.all.sntp.R1_21.0.root
F00028178_0000.all.sntp.R1_21.0.root
F00028201_0000.all.sntp.R1_21.0.root


    FAR symlinks

SERV=d129
for MON in 03 06 08 ; do mkdir -p ${MINOS_DATA}/${SERV}/reco_far/R1_21/sntp_data/2004-${MON} ; done
SERV=d130
for MON in 05 09 10 ; do mkdir -p ${MINOS_DATA}/${SERV}/reco_far/R1_21/sntp_data/2004-${MON} ; done
SERV=d131
for MON in 04 07 11 ; do mkdir -p ${MINOS_DATA}/${SERV}/reco_far/R1_21/sntp_data/2004-${MON} ; done

cd $MINOS_DATA/d128/reco_far/R1_21/sntp_data
SERV=d129
for MON in 03 06 08 ; do ln -s ${MINOS_DATA}/${SERV}/reco_far/R1_21/sntp_data/2004-${MON} 2004-${MON} ; done
SERV=d130
for MON in 05 09 10 ; do ln -s ${MINOS_DATA}/${SERV}/reco_far/R1_21/sntp_data/2004-${MON} 2004-${MON} ; done
SERV=d131
for MON in 04 07    ; do ln -s ${MINOS_DATA}/${SERV}/reco_far/R1_21/sntp_data/2004-${MON} 2004-${MON} ; done

2004-11 ran out of quota, 


for MM in 11 03 04 05 06 07 08 09 10 12 ; do
MON="2004-${MM}"
time ./cattle        reco_${DET}/R1_21/sntp_data/${MON} \
  ${MINOS_DATA}/d128/reco_${DET}/R1_21/sntp_data/${MON} R1.21 \
    2>&1 | tee ~/minos/log/cattle/R1_21/${DET}/${MON}.log
done

Oops, had typo putting logs into scripts/~minos/log/cattle, fixed and did

   mv ~minos/log/cattle ../log/cattle

Restart to log4 :

for MM in 01 02 03 04 05 06 07 08 09 10 12 ; do
MON="2004-${MM}"
time ./cattle        reco_${DET}/R1_21/sntp_data/${MON} \
  ${MINOS_DATA}/d128/reco_${DET}/R1_21/sntp_data/${MON} R1.21 \
    2>&1 | tee ~/minos/log/cattle/R1_21/${DET}/${MON}.log4
done

( F00028194 is missing subrun 3 )


###############
# FARDET_DATA #
###############

MINOS06 > ./volumes vols

MINOS06 > ./volumes fardet_data | grep -v deleted | wc -l
     29

Tape VO8536 is still being copied, due to access count >> 2000 ( > 3000 )

MINOS06 > FVOLS=`./volumes fardet_data | grep -v deleted | grep -v 8536`
MINOS06 > VOLS=`./volumes fardet_data | grep -v deleted | grep -v 8536`

MINOS06 > for VOL in ${VOLS} ; do printf "${VOL} " ; enstore info --gvol=${VOL} | grep eod_cookie | cut -f 4 -d _ | cut -f 1 -d "'" ; done | sort -n -k 2
VO8722 0000006
VO9488 0000057 B
VO8555 0000480 B
VO4639 0000858
VO5182 0000912
VO5869 0000966
VO5672 0001007
VO5881 0001042
VO8968 0001109
VO4640 0001175
VO5871 0001197
VO2064 0001210
VO5054 0001467
VO5046 0001531
VO4309 0001611
VO3646 0001647
VO2220 0001711
VO8917 0001876
VO4136 0001877
VO2225 0002027
VO7999 0002088
VO4919 0002090
VO3908 0002314
VO3909 0002692
VO2212 0002825
VO6809 0002866
VO4245 0002868
VO8536 0004766 B
VO6876 0010030 B

Need new script, stagetape which selects files from a tape, in order.
Perhaps just a new option for stage, inspired by content of cattape script which does encp.


=============================================================================

2006 03 14

##########
# CATTLE #
##########

    SUBRUNS

Scanning Complete File Listing for files with large subrun counts.
Subruns exist up through 67, then scattered subruns :
grep 'neardet_data/' CFL | grep '.root$' | cut -f 8 -d ' ' | cut -f 3 -d '_' | cut -f 1 -d '.' | sort -u
...
0067
0082
0099
0108
0109
0127
0145
0160

/pnfs/minos/neardet_data/2004-09/N00003704_0082.mdaq.root
/pnfs/minos/neardet_data/2004-09/N00003704_0145.mdaq.root
/pnfs/minos/neardet_data/2004-09/N00003704_0160.mdaq.root
/pnfs/minos/neardet_data/2004-09/N00003704_0109.mdaq.root
/pnfs/minos/neardet_data/2004-09/N00003704_0127.mdaq.root
/pnfs/minos/neardet_data/2004-09/N00003704_0108.mdaq.root

There are no reco files with more than 67 subruns.

So perhaps we can make a naming convention such that the concatenated file
from subruns II through JJ would be given subrun number IIJJ.
Most commonly, we would have something like 00 through 23,
subrun 0023 ( matches original name of the final subrun)
In case of a gap, one might then have something like
    0014
    1623
Eventually we would pick up
    1515    

   COPIES

Putting logs into
  ~minos/log/cattle/R1_21/far and near

2005-11.log ...
real    132m59.741s
user    124m32.170s
sys     1m15.040s

MINOS26 > time ./cattle  reco_far/R1_21/sntp_data/2004-01 /local/scratch26/kreymer/reco_far/R1_21/sntp_data/2004-01 R1.21 2>&1 | tee  ~minos/log/cattle/R1_21/far/2004-01.log

real    281m29.408s
user    249m38.810s
sys     5m18.270s

Shifting files to $MINOS_DATA/d128 ...

mkdir -p $MINOS_DATA/d128/reco_far/R1_21/sntp_data
mkdir -p $MINOS_DATA/d128/reco_near/R1_21/sntp_data

cp -vax /local/scratch26/kreymer/reco_far/R1_21/sntp_data/2004-01  ${MINOS_DATA}/d128/reco_far/R1_21/sntp_data/2004-01
cp -vax /local/scratch26/kreymer/reco_near/R1_21/sntp_data/2005-11 ${MINOS_DATA}/d128/reco_near/R1_21/sntp_data/2005-11

    zzzz , data rate for these copies is about 12 MBytes/second.
    Could be worse, we can live with it
    
Try writing next round directory to d128

du -sm       ${MINOS_DATA}/d128
fs listquota ${MINOS_DATA}/d128

DET=far
MON=2004-02
time ./cattle        reco_${DET}/R1_21/sntp_data/${MON} \
  ${MINOS_DATA}/d128/reco_${DET}/R1_21/sntp_data/${MON} R1.21 \
    2>&1 | tee ~minos/log/cattle/R1_21/${DET}/${MON}.log

This runs a bit slower, about 18% usermode CPU versus normal 25%,
probably due to the AFS overhead.

Let's try running in /local/minos26/kreyer/SHEEP, mv Merged.root to $OUDIR
Looks OK, CPU is better.
Killed off partial 2004-02 , then

Cannot just run all, due to limited AFS partition sizes, 
so will NOT do...

for MM in 03 04 05 06 07 08 09 10 11 12 ; do
MON="2004-${MM}"
time ./cattle        reco_${DET}/R1_21/sntp_data/${MON} \
  ${MINOS_DATA}/d128/reco_${DET}/R1_21/sntp_data/${MON} R1.21 \
    2>&1 | tee ~minos/log/cattle/R1_21/${DET}/${MON}.log
done


=============================================================================

2006 03 13


##########
# CATTLE #
##########

Created scripts/cattle script with arguments
    INDIR
    OUDIR
    RELEASE
It checks that the number of subruns matches the difference in subrun range
fof the subruns of each found run.
It identified and separately processes each file suffex ( all, cosmic, ... )
    
MINOS26 > ./cattle  reco_near/R1_21/sntp_data/2005-11 /local/scratch26/kreymer/reco_near/R1_21/sntp_data/2004-11 R1.21


########
# mail #
########

We should remove RFC2369 headers from certain email groups
for which they are not appropriate, to eliminate the PINE messages

     [ Note: This message contains email list management information ]

minosdb-support
minos-admin
minos-data ( done )
minos_sam_admin ( done )

To disable the headers, need to add

Misc-Options= NO_RFC2369


##########
# saddmc #
##########

Per I.T. 1751 reply, need newer sam_dimension_server_prototype
witn an  indentation bug fixed.
 
According to release notes, need v7_4_6 or later ( v7_5_0 )

###############
# integration #
###############

ups tailor sam_config ( added Minos integration (int) ) on FNALU

#########
# admin #
#########

shepalak requests minos* system for testing system installs.
We have none set aside, but this is a reasonable request.

Scanning disk usage :

NS='01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26'
for N in ${NS} ; do rsh -X minos${N} "df -h /local/scratch${N}" ; done | tee /tmp/diskuse
MINOS06 > grep scratch /tmp/diskuse  | sort -k 4,4
/dev/hdb1             230G  204G   14G  94% /local/scratch12
/dev/hdb1             230G  196G   22G  90% /local/scratch11
...
/dev/hdb1             230G   29G  189G  14% /local/scratch07
/dev/hdb1             230G   27G  191G  13% /local/scratch18
/dev/hdb1             230G   26G  192G  12% /local/scratch26
/dev/hdb1             230G  3.3G  215G   2% /local/scratch15

Minos15 seems to be the clear winnir, only 3.3 GB in use on local disk.


=============================================================================

2006 03 10

########
# RECO #
########

File counts from an old email draft :

R1.11       10708
R1.12       23868
R1.14      113555
R1.14_201       0
R1.16       68752
R1.17        7690
R1_17        7690
R1_17a.0     4681


###########
# enstore #
###########

The STKEN  COMPLETE_FILE_LISTING files under
    http://www-stken.fnal.gov/cgi-bin/enstore/enstore_file_listing_cgi.py
are still out of date ( they have trouble making the master COMPLETE_FILE_LISTING_all

But they have started updating COMPLETE_FILE_LISTING_minos individually.

##########
# DCACHE #
##########

Here are some comments dating from late 2005 regarding DCache pool sizing.
This is being addressed with new pools now :

Compare the Data movement plots for the three large DCache systems.

    CMS
http://cmsdcam.fnal.gov:8090/dcache/outplot?lvl=1&filename=billing.brd.png

    CDF
http://cdfdcam.fnal.gov:8090/dcache/outplot?lvl=0&filename=billing.brd.png

    FNDCA


The miss rates, requiring tape access are typically

CMS   - about 1%
CDF   - about 10% to 20%, planning to reduce this to under 5%
FNDCA - 50%

The public system has 17 9940B tape drives ( over $ 500K worth )
and about 4.5 TBytes of public DCache pools   (   $  15K worth )
This is wildly out of balance.


##############
# DCACHE OPS #
##############

High data rates to STKEN from CDF
see http://m-cdf-mrtg.fnal.gov/~netadmin/mrtg/mrtg-rrd.cgi/s-cdf-cas-fcc2e/s-cdf-cas-fcc2e_9_15.html
Saturated yesterday at 1000M
    08:00 - 14:30
    15:00 - 17:30
    18:30 - 20:00 ( 1/2 ) from bzora1
    20:30 - 00:30
    05:00 - 10:30+   from fcdfora4 50M ( also yesterday 07:00-10:30 100M)

    
Now observing an large amount of FTP traffic from d0lum2, judging from FTP logs

N.B. - this is due to litvinse copying CDF data to fndcat, testing DCache.

This is slowing down various backups
   d0ora2    21:00 - normal
             07:00 95 -> 45 Mb/s about 5 hours to run
   d0lum2    20:00/22:00 - normal 300 Mb
   minosora1 21:00 - normal
   cdfora4   07:00/11:30   100 -> 50 
   bzora1    18:30/20:00 450 - normal
   fcdfora1  10:30 yesterday 80 -> 30

A few oracle ftp's failed 05:30/07:00

###############
# REPLICATION #
###############

Got some data rates for the discussion of replication.

neardet_data -  75 GB/month ( nearly 3 months/tape )
    presently using 22 tapes ( 16 A , 6 B )

fardet_data  -  25 GB/month ( about  8 months/tape )
    presently on 29 tapes    ( 25 A , 4 B )

With 400 GB tapes, 
the 6 years of anticipated fardet beam data would be on 5 tapes.

We'd better make several copies of fardet data.

    Scan done like :

VOLS=`./volumes fardet_data`
for VOL in $VOLS ; do enstore info --vol=${VOL} | grep media_type ; done

############
# datasets #
############

in scripts/dcache, revive work on datasets script, 
need to see state of write pools, which are about to be upgraded

OK, that's cleaned up to handle the existence of decimal points in file sizes,
The file deserves a new name,
mv datasets.20060224 datasets.20060310

datasets -> datasets.20051205
ln -sf      datasets.20060310  datasets

##########
# saddmc #
##########

Testing multiple parameter select in D0, on d0mino03

FIL=copyd0om_p17.06.02_NumEv-0_tauid_mcp17_ccin2p3_23874_73171397540502305275175326

sam get metadata --fileName=${FIL}
               'params' : Params({
        'global' : CaseInsensitiveDictionary({
               'facilityname' : 'ccin2p3',
                  'groupname' : 'tauid',
                 'originname' : 'ccin2p3',
                      'phase' : 'mcp17',
             'producedbyname' : 'lebrun',
            'producedforname' : 'serban',
                  'requestid' : '23874',
              'runjobversion' : 'v06-04-00',
              'workrequestid' : '404590806322499205274035723',
            })}),

<d0mino03> sam list files --count --dim="file_name = ${FIL} and global.requestid = 23874"
1 file matches the given constraints.
<d0mino03> sam list files --count --dim="file_name = ${FIL} and global.phase = mcp17"
1 file matches the given constraints.
<d0mino03> sam list files --count --dim="file_name = ${FIL} and global.requestid = 23874 and global.phase = mcp17"
1 file matches the given constraints.

<d0mino03> which sam
/fnal/ups/prd/sam/Linux-2/v7_4_2/bin/sam

<d0mino03> sam get dbserver info
         Database: d0ofprd1 (reachable)
  DbServerVersion: v7_5_0


fcdflnx9 > FIL=aa022904.0002topa
fcdflnx9 > sam list files --count --dim="file_name = ${FIL}"
1 file matches the given constraints.
fcdflnx9 > sam list files --count --dim="file_name = ${FIL} and cdf.dataset = atopaa"
1 file matches the given constraints.
fcdflnx9 > sam list files --count --dim="file_name = ${FIL} and cdf.fileset = gi6190.0"
1 file matches the given constraints.
fcdflnx9 > sam list files --count --dim="file_name = ${FIL} and cdf.dataset = atopaa and cdf.fileset = gi6190.0"
1 file matches the given constraints.

fcdflnx9 > type sam
sam is hashed (/home/cdfsoft/products/sam/v7_4_2/Linux+2/bin/sam)
fcdflnx9 > sam get dbserver info       
         Database: cdfofprd (reachable)
  DbServerVersion: v7_5_0

############
# dbserver #
############

Preparing to clone from cdfsam01, v7_5_0

per HOWTO.kits

 version=v7_5_0
oversion=v7_3_0
FLVR=Linux

samprod=sam_dbs_products
cd ${PRODUCTS}/../prd/${samprod}
cp -ar ${oversion}  ${version}
cd ${version}/${FLVR}
ups declare ${samprod} ${version} -f ${FLVR} -r ${samprod}/${version}/${FLVR} -m ${samprod}.table

sam_db_srv                     v7_3_0         -> v7_5_0
db_server_base                 v3_3_14        -> v3_3_16
omniORB -q GCC-3.1-PYTHON-2.1  v4_0_5_patched -> v4_0_7
oracle_tnsnames                v39            -> v42
sam_common_pylib               v7_3_1         -> v7_5_0
sam_config                     v4_2_34 stay, not v7_1_2
sam_dimension_server_prototype v7_3_0         -> v7_5_0
sam_idl_pylib                  v7_3_0         -> v7_5
sam_pnfs_srv                   v7_2_0         -> v7_5_0
sam_server_pylib               v7_3_1         -> v7_5_1

upd install -j sam_db_srv       v7_5_0  
upd install -j db_server_base   v3_3_16
upd install -j omniORB          v4_0_7 -q GCC-3.1-PYTHON-2.1
upd install -j oracle_tnsnames  v42
upd install -j sam_common_pylib v7_5_0
upd install -j sam_dimension_server_prototype v7_5_0
upd install -j sam_idl_pylib    v7_5
upd install -j sam_pnfs_srv     v7_5_0
upd install -j sam_server_pylib v7_5_1

    Up front, can go ahead with

ups declare -c oracle_tnsnames  v42

    To switch the dev dbserver to v7_5_0, need to prepare swap scripts,
    and hack the minossam01_server_list.txt file

ups declare -c sam_db_srv       v7_5_0  
ups declare -c db_server_base   v3_3_16
ups declare -c omniORB          v4_0_7 -q GCC-3.1-PYTHON-2.1
ups declare -c sam_common_pylib v7_5_0
ups declare -c sam_dimension_server_prototype v7_5_0
ups declare -c sam_idl_pylib    v7_5
ups declare -c sam_pnfs_srv     v7_5_0
ups declare -c sam_server_pylib v7_5_1
 
ups declare -c sam_db_srv       v7_3_0  
ups declare -c db_server_base   v3_3_14
ups declare -c omniORB          v4_0_5_patched -q GCC-3.1-PYTHON-2.1
ups declare -c sam_common_pylib v7_3_1
ups declare -c sam_dimension_server_prototype v7_3_0
ups declare -c sam_idl_pylib    v7_3_0
ups declare -c sam_pnfs_srv     v7_2_0
ups declare -c sam_server_pylib v7_3_1
 
 
=============================================================================

2006 03 09

############
# NOACCESS #
############

VO8548              0.09GB   (NOTALLOWED 0308-0856 full     1125-1505)   CD-9940B         minos.reco_far_R1_18_2.cpio_odc      Being copied to new media 03/08/06
VO8553              0.19GB   (NOTALLOWED 0308-0919 full     1127-2152)   CD-9940B         minos.reco_far_R1_18_2.cpio_odc      Being copied to new media 03/08/06

########
# date #
########

silly date tricks... 

to get the date from a Unix Epoch value, on Linux, you can

SECS=1141919295
date -d "19700101 UTC + ${SECS} seconds"

##########
# saddmc #
##########

having trouble selecting more than 2 dimensions, like

MINOS26 > sam list files --count --dim='data_tier = mc-near'
198 files match the given constraints.

MINOS26 > sam list files --count --dim='data_tier = mc-near and mc.split = 1'        
99 files match the given constraints.

MINOS26 > sam list files --count --dim='data_tier = mc-near and mc.volume = 3'
198 files match the given constraints.

MINOS26 > sam list files --count --dim='data_tier = mc-near and mc.volume = 3 and mc.split = 1'
No files match the given constraints.


=============================================================================

2006 03 08

##########
# CATTLE #
##########

for MON in 02 03 04 05 06 07 08 09 10 11 12 ; do ./stage -d -p 0 reco_far/R1_21/sntp_data/2004-${MON} ; done
all on disk except 
Staging /pnfs/minos/reco_far/R1_21/sntp_data/2004-05
Needed 52/    380

./stage  reco_far/R1_21/sntp_data/2004-05

There is quite a lot of data here, nearly 20 GB per month,
too much to conveniently keep in AFS ?

########
# RECO #
########

Per howie note about 0 length farm output files ( written 1 feb )

find /pnfs/minos/reco_near/R1_18_2/*/2006-01 -size 0 | cut -c 54-62 | sort -u   

Find advertised files
N00009659_0012
N00009720_0015
N00009729_0013
N00009729_0014
N00009729_0015
N00009729_0016
N00009729_0017
N00009729_0018
N00009729_0019
N00009729_0020
N00009729_0021
N00009729_0022
N00009729_0023
N00009732_0000
N00009732_0002
N00009732_0003
plus
N00009659_0014

I have logged them here, and removed
MINOS26 > find /pnfs/minos/reco_near/R1_18_2/*/2006-01 -size 0 -exec ls -l {} \;
-rw-r--r--    1 1334     5111            0 Feb  1 07:06 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0021.cosmic.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:33 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0023.cosmic.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 06:03 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009732_0000.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:42 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009732_0003.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:01 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0020.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:06 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0021.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 21:18 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009659_0012.cosmic.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:42 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0022.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:57 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0013.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:33 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0023.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:12 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0014.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:01 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0015.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:01 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0016.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:56 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0017.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:57 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0018.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:56 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009729_0019.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 21:19 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009659_0012.spill.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 06:03 /pnfs/minos/reco_near/R1_18_2/cand_data/2006-01/N00009732_0000.cosmic.cand.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:42 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009732_0002.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:42 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009732_0003.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:01 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0020.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:06 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0021.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:42 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0022.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:57 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0013.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:33 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0023.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:12 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0014.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:01 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0015.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:02 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0016.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:56 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0017.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 06:03 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009732_0000.cosmic.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:57 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0018.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:56 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0019.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 21:19 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009659_0012.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  3 15:27 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009659_0014.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:06 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0021.cosmic.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:33 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009729_0023.cosmic.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:07 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009720_0015.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 21:18 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009659_0012.cosmic.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 06:03 /pnfs/minos/reco_near/R1_18_2/sntp_data/2006-01/N00009732_0000.spill.sntp.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:57 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0018.cosmic.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:56 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0019.cosmic.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 06:03 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009732_0000.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 21:19 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009659_0012.cosmic.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:42 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009732_0002.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:42 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009732_0003.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:01 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0020.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:06 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0021.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:42 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0022.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:57 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0013.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:33 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0023.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:12 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0014.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:01 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0015.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:02 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0016.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:56 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0017.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:57 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0018.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 06:03 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009732_0000.cosmic.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:56 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0019.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 21:19 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009659_0012.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:42 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009732_0003.cosmic.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:01 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0020.cosmic.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:06 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0021.cosmic.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:07 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009720_0015.spill.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 07:33 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0023.cosmic.snts.R1_18_2.0.root
-rw-r--r--    1 1334     5111            0 Feb  1 05:56 /pnfs/minos/reco_near/R1_18_2/snts_data/2006-01/N00009729_0017.cosmic.snts.R1_18_2.0.root

MINOS26 > find /pnfs/minos/reco_near/R1_18_2/*/2006-01 -size 0 -exec rm -f {} \;

Now need to undeclare the non-zero length cohorts from SAM.

HEADS='
N00009659_0012 
N00009720_0015 
N00009729_0013 
N00009729_0014 
N00009729_0015 
N00009729_0016 
N00009729_0017 
N00009729_0018 
N00009729_0019 
N00009729_0020 
N00009729_0021 
N00009729_0022 
N00009729_0023 
N00009732_0000 
N00009732_0002 
N00009732_0003'

for HEAD in ${HEADS} ; do sam list files --dim="FILE_NAME like ${HEAD}%.root" ; done


############
# saddreco #
############

MINOS06 > ./saddreco  near R1_18_2 2005-12 list

Declaring to SAM prd near R1_18_2 2005-12
STARTED   Wed Mar  8 20:59:03 2006
Needed  /pnfs/minos/reco_near/R1_18_2/snts_data/2005-12
Treating 745 files in /pnfs/minos/reco_near/R1_18_2/snts_data/2005-12
 OOPS, need location for  N00009300_0013.cosmic.snts.R1_18_2.0.root
 OOPS, need location for  N00009300_0013.spill.snts.R1_18_2.0.root
Needed to add 2 locations
Needed  /pnfs/minos/reco_near/R1_18_2/cand_data/2005-12
Treating 745 files in /pnfs/minos/reco_near/R1_18_2/cand_data/2005-12
Needed  /pnfs/minos/reco_near/R1_18_2/sntp_data/2005-12
Treating 745 files in /pnfs/minos/reco_near/R1_18_2/sntp_data/2005-12
STARTED   Wed Mar  8 20:59:03 2006
FINISHED  Wed Mar  8 21:06:15 2006

MINOS06 > ./saddreco  near R1_18_2 2005-12 addloc

Declaring to SAM prd near R1_18_2 2005-12
STARTED   Wed Mar  8 21:24:50 2006
Needed  /pnfs/minos/reco_near/R1_18_2/snts_data/2005-12
Treating 745 files in /pnfs/minos/reco_near/R1_18_2/snts_data/2005-12
OK - add location  N00009300_0013.cosmic.snts.R1_18_2.0.root /pnfs/minos/reco_near/R1_18_2/snts_data/2005-12(vo8557.1196)
OK - add location  N00009300_0013.spill.snts.R1_18_2.0.root /pnfs/minos/reco_near/R1_18_2/snts_data/2005-12(vo8557.1197)

=============================================================================

2006 03 07

############
# noaccess #
############
Tue Mar 7 09:45:09 CST 2006
VO3559              0.29GB   (NOTALLOWED 0307-0907 full     0321-1745)   9940             minos.sntp_data_R1_14.cpio_odc       Being copied to new media 03/07/06
VO5165              0.25GB   (NOACCESS   0306-1442 full     1110-0309)   9940             minos.caldet_reco.cpio_odc          

##########
# saddmc #
##########

found correct way to set empty MEDATA = SamDataFile(), no more reading fake template
Rate is better, was 2.4
    Needed  200 files, Rate was  3.310


MINOS26 > samadmin add param suite --param-file=MCPARAMS.py 
Param Category 'mc': (new category)
 ... paramType 'split': registered as type 'string'  (new dimension 'mc.split')
 ... paramType 'iteract': registered as type 'string'  (new dimension 'mc.iteract')
 ... paramType 'flavor': registered as type 'string'  (new dimension 'mc.flavor')
 ... paramType 'beam': registered as type 'string'  (new dimension 'mc.beam')

 ( in development only )
change interact to volume above, for clarity.

Added params, rate is much lower validating, through about 14:00
then speeds up. Rates can be as slow as 10 to 20 seconds per file !

MINOS26 > ./saddmc near carrot_06 L010185 "" verify 
Needed 2489 files, Rate was  1.232
STARTED   Tue Mar  7 13:44:46 2006
FINISHED  Tue Mar  7 14:18:27 2006

 n13021998_0000_L010185.reroot.root fast

MRTG -
   sustained MRTG load on dba64 of about 150 Kbit/sec into port ( in and out )
   sustained on minos-sam01 is about 200Kb/sec in and out of port
   sustained minos26 is closer to 1 Mbit/sec ( due to printout long lines/file )

MINOS26 > ./saddmc near carrot_06 L010200 "" verify 
Needed  200 files, Rate was  2.223
STARTED   Tue Mar  7 14:29:52 2006
FINISHED  Tue Mar  7 14:31:23 2006


MINOS26 > samadmin flush dbserver cache
Needed  200 files, Rate was  2.152
Needed  200 files, Rate was  2.607
Needed  200 files, Rate was  2.179
removed EN* print,
Needed  200 files, Rate was  1.784

faked PNFS data fetch, still not flamin' fast
Needed  200 files, Rate was  1.960
Needed  200 files, Rate was  1.668
Needed  200 files, Rate was  2.373

unfaked this,
Needed  200 files, Rate was  1.469
STARTED   Tue Mar  7 15:05:29 2006
FINISHED  Tue Mar  7 15:07:48 2006
Needed  200 files, Rate was  2.730
STARTED   Tue Mar  7 15:14:48 2006
FINISHED  Tue Mar  7 15:16:03 2006

Summary: rates are rather inconsistent, not slow because of PNFS.

Needed  200 files, Rate was  2.763
STARTED   Tue Mar  7 15:21:27 2006
FINISHED  Tue Mar  7 15:22:40 2006
Needed  200 files, Rate was  2.793
STARTED   Tue Mar  7 15:26:14 2006
FINISHED  Tue Mar  7 15:27:27 2006

Needed  200 files, Rate was  2.386
STARTED   Tue Mar  7 15:35:41 2006
FINISHED  Tue Mar  7 15:37:06 2006

Needed  200 files, Rate was  2.720
STARTED   Tue Mar  7 16:15:45 2006
FINISHED  Tue Mar  7 16:16:59 2006

OK, will go ahead and declare something,


./saddmc near carrot_06 L010170 "" declare
Needed  187 files, Rate was  1.832
STARTED   Tue Mar  7 16:55:15 2006
FINISHED  Tue Mar  7 16:56:58 2006

 
=============================================================================

2006 03 06

##########
# saddmc #
##########

Added test suite in dev, with

MINOS26 > cat TESTPARAMS.py
from SamStruct.SamSize import SamSize
from SamStruct.SamTime import SamTime
from SamStruct.ApplicationFamily import ApplicationFamily

paramSuite = { 'test' :
   { 'tstr' : 'string' ,
     'tflo' : 'float' ,
     'tlon' : 'long' ,
     'tint' : 'int' ,
     'tboo' : 'SamStruct.SamBoolean.SamBoolean' ,
   } }


MINOS26 > samadmin add param suite --param-file=TESTPARAMS.py
Param Category 'test': (new category)
 ... paramType 'tlon': registered as type 'long'  (new dimension 'test.tlon')
 ... paramType 'tstr': registered as type 'string'  (new dimension 'test.tstr')
 ... paramType 'tflo': registered as type 'float'  (new dimension 'test.tflo')
 ... paramType 'tboo': registered as type 'SamStruct.SamBoolean.SamBoolean'  (new dimension 'test.tboo')
 ... paramType 'tint': registered as type 'int'  (new dimension 'test.tint')

MINOS26 > sam get registered parameters
Params({
    'test' : CaseInsensitiveDictionary({
        'tboo' : DataType('SamStruct.SamBoolean.SamBoolean'),
        'tflo' : DataType('float'),
        'tint' : DataType('int'),
        'tlon' : DataType('long'),
        'tstr' : DataType('string'),
        })})

MINOS26 > samadmin add datatier --name=mc-near
New dataTierId = 97
MINOS26 > samadmin add datatier --name=mc-far 
New dataTierId = 98

Success, using 'test' params


MINOS26 > ./saddmc near carrot_06 L010200 "" verify 

 Processing mcin_data 
STARTED   Mon Mar  6 16:50:53 2006
Declaring to SAM dev near carrot_06 L010200  verify 999999

Needed  200 files, Rate was  2.443
STARTED   Mon Mar  6 16:50:53 2006
FINISHED  Mon Mar  6 16:52:20 2006


MINOS26 > ./saddmc near carrot_06 L010185 "" verify 

 Processing mcin_data 
STARTED   Mon Mar  6 16:52:29 2006
Declaring to SAM dev near carrot_06 L010185  verify 999999
Scanning  /pnfs/minos/mcin_data/near/carrot_06 ['L010185']
Needed  /pnfs/minos/mcin_data/near/carrot_06/L010185
Treating 2470 files in /pnfs/minos/mcin_data/near/carrot_06/L010185
...
Needed 2470 files, Rate was  1.732
STARTED   Mon Mar  6 16:52:29 2006
FINISHED  Mon Mar  6 17:16:20 2006


    Now to set real parameters.

Per R. Herber, note that regardless of the type specified,
all parameters are implemented and tested as strings.


=============================================================================

2006 03 03

##################
# data checklist #
##################

DCAche - no traffic since 2 Mar 22:00
  pinmanager and billing offline
  fndca refuses connections

Predator - gap in fd data 22:00 thru 24:00, 3 at 01:00

 09:45 - services  are back
 15:15 - web pages are back
 
##############
# minos-data #
##############

requested email list
    public
    archived
    owners - kreymer/buckley/rhatcher

Available 10:38

##########
# saddmc #
##########

Added application families in dev/prd

        missing required attribute 'params'

sam describe metadata attributes
sam describe metadata requirements --fileType importedSimulated

   OK, need to have at least one parameter for importedSimulated

=============================================================================

2006 03 02

##########
# saddmc #
##########

continuing work, getting list of files

=============================================================================

2006 03 01

##########
# saddmc #
##########

Weird timing - 2 seconds to start up in mcin_data mode, 5 in mcout.
   Why does the SAM loading care what the directory paths are ?
  

#######
# SAM #
#######

Per Mike Kordowski ,
note that sam get metadata reports times that are off by 1 hour,
but only for times that occurred during Savings Time epochs.

He will compensate for the 1 hour shift.
The shift is in web services and local sam v7_3_4, v7_5_2


database browser / SAM Data Files /
F00032684_0000.mdaq.root
   Run 32684  2005-09-08 16:20:31.0
            'startTime' : SamTime(1126192831.0),
date +%s -u -d "2005-09-08 16:20:31"
1126196431
echo "1126196431 - 1126192831" | bc
3600

   Run 33805 2006-02-28 16:57:37.0
            'startTime' : SamTime(1141145857.0),
date +%s -u -d "2006-02-28 16:57:37"
1141145857


And while investigating, found that reco file times are off by 6 hours.


=============================================================================

2006 02 28

########
# bntp #
########

Shifted prestage of sntp_data to a fresh window on minos26 during 2006-11.
The default 8 sec per file was taking a long time.
The drives are pretty idle now, let's try running faster,

for MON in 2005-11 2005-12 2006-01 2006-02  ; do
./stage -w -s spill -p 5 reco_far/R1_18_2/sntp_data/${MON} ; done

This is building a queue, retry with 6 second pause,
the queue rapidly cleared.

Finished about 15:17

##########
# saddmc #
##########

    Testing like
./saddmc.0210a near carrot_06 L010200 R1_18_2 1
./saddmc.0210a near carrot_06 L010200 ""      1


=============================================================================

2006 02 27

Checking for MIA files for beam, dcs
( still lacking complete file listing, so checking specific volumes )

MINOS06 > ./volumes beam_data
VO4933
VO7427
VO7428
VO8433
VO8538
VO8976

MINOS06 > for VOL in VO4933 VO7427 VO7428 VO8433 VO8538 VO8976 ; do echo $VOL ; enstore info --list=${VOL} | grep B060224 ; done
VO4933
VO7427
VO7428
VO8433
VO8538
VO8538 CDMS114076843700000   150953915 0000_000000000_0000407 active  /pnfs/fs/usr/minos/beam_data/2006-02/B060224_000001.mbeam.root
VO8538 CDMS114083296400000    74060037 0000_000000000_0000409 active  /pnfs/fs/usr/minos/beam_data/2006-02/B060224_160001.mbeam.root
VO8976

MINOS06 > ./volumes near_dcs_data
VO5872
VO8494
VO8508

MINOS06 > for VOL in VO5872 VO8494 VO8508 ; do echo $VOL ; enstore info --list=${VOL} | grep N060223 ; done
...

MINOS06 > ./volumes far_dcs_data
VO5704
VO7424
VO8166
VO8543

MINOS06 > for VOL in VO5704 VO7424 VO8166 VO8543 ; do echo $VOL ; enstore info --list=${VOL} | grep F060223 ; done
...

MINOS06 > ./volumes neardet_data
VO4531
VO4918
VO5041
VO5042
VO6784
VO7026
VO7175
VO7421
VO7774
VO7896
VO7939
VO8098
VO8187
VO8332
VO8537
VO8556
VO8721
VO8741
VO8791
VO8842
VO8949
VO9752

MINOS06 > VOLS='VO4531 VO4918 VO5041 VO5042 VO6784 VO7026 VO7175 VO7421 VO7774 VO7896 VO7939 VO8098 VO8187 VO8332 VO8537 VO8556 VO8721 VO8741 VO8791 VO8842 VO8949 VO9752'

MINOS06 > for VOL in ${VOLS} ; do echo $VOL ; enstore info --list=${VOL} | grep N00009870 ; done
VO8556
VO9752

MINOS06 > for VOL in VO8556 VO9752 ; do echo $VOL ; enstore info --list=${VOL} | grep N00009870_0003 ; done
...

( There is only one far_dcs_data file on 02 23 )

Per kennedy advice, will rm these files .

for FIL in \
/pnfs/minos/neardet_data/2006-02/N00009870_0003.mdaq.root  \
/pnfs/minos/beam_data/2006-02/B060224_080001.mbeam.root    \
/pnfs/minos/near_dcs_data/2006-02/N060223_040445.mdcs.root \
/pnfs/minos/near_dcs_data/2006-02/N060223_160529.mdcs.root \
/pnfs/minos/far_dcs_data/2006-02/F060223_120026.mdcs.root
do  rm ${FIL} ; done

Liz has restored all these ( neardet_data from monitoring copy )
11:20 - near and beam are on tape, waiting for dcs archiver

The data and beam files went to tape right away,
the dcs files got to tape by about 20:40.

Informed dcache-admin via email ( 07:35 28 Feb )

##########
# CATTLE #
##########

mkdir /local/scratch26/kreymer/CATTLE

Work with 4 subruns of 2006-02/F00033779_0000
  .spill.sntp.R1_18_2.0.root
  .all.sntp.R1_18_2.root

IPATH=minos/reco_far/R1_18_2/sntp_data/2006-02
 
 ./dc_stat /pnfs/$IPATH/$IFILE
 
 cd SHEEP
 
setup_minos -r R1.18.2

MINOS26 > loon -bq Merger.C F00033779_*.spill.sntp.R1_18_2.0.root
loon [0] 
Processing Merger.C...
Warning in <TClass::TClass>: no dictionary for class NtpBDLiteRecord is available
Warning in <TClass::TClass>: no dictionary for class RecRecordImp<BeamDataLiteHeader> is available
Warning in <TClass::TClass>: no dictionary for class BeamDataLiteHeader is available
Main(1133 in 1133 out 0 filt.)
  1) +Output::Put               n=1133  (  1133/     0) t=(    3.01/    0.05)
MINOS26 > mv Merged.root ../SHEEP/F00033779_0000.spill.sntp.R1_18_2.0.root

MINOS26 > loon -bq Merger.C F00033779_*.all.sntp.R1_18_2.0.root
loon [0] 
Processing Merger.C...
Warning in <TClass::TClass>: no dictionary for class NtpBDLiteRecord is available
Warning in <TClass::TClass>: no dictionary for class RecRecordImp<BeamDataLiteHeader> is available
Warning in <TClass::TClass>: no dictionary for class BeamDataLiteHeader is available
Main(6558 in 6558 out 0 filt.)
  1) +Output::Put               n=6558  (  6558/     0) t=(   25.23/    0.61)
user    1m4.460s
( CPU bound, shows 25% on TOP )
MINOS26 > mv Merged.root ../SHEEP/F00033779_0000.all.sntp.R1_18_2.0.root

########
# bntp #
########

for MON in `ls /pnfs/minos/reco_far/R1_18_2/.bntp_data/`  ; do
./stage -d -p 0 -w reco_far/R1_18_2/.bntp_data/${MON} | grep Needed ; done
 Needed 0/    678
 Needed 0/    741
 Needed 0/    715
 Needed 0/    734
 Needed 0/    742
 Needed 0/    722
 Needed 0/    739
 Needed 3/    720
 Needed 302/    748
 Needed 681/    745
 Needed 0/    589

for MON in 2005-11 2005-12 2006-01 ; do
./stage -w reco_far/R1_18_2/.bntp_data/${MON} ; done

Check also the sntp files ( blinded )

for MON in `ls /pnfs/minos/reco_far/R1_18_2/sntp_data/`  ; do printf "${MON}"
./stage -d -p 0 -w -s spill reco_far/R1_18_2/sntp_data/${MON} | grep Needed ; done
2005-03 Needed 644/    678
2005-05 Needed 741/    741
2005-06 Needed 715/    715
2005-07 Needed 734/    734
2005-08 Needed 742/    742
2005-09 Needed 722/    722
2005-10 Needed 739/    739
2005-11 Needed 720/    720
2005-12 Needed 748/    748
2006-01 Needed 344/    745
2006-02 Needed  22/    589

Staging these on minos26 with

for MON in `ls /pnfs/minos/reco_far/R1_18_2/sntp_data/`  ; do
./stage -w -s spill reco_far/R1_18_2/sntp_data/${MON} ; done


=============================================================================

2006 02 25

###########
# enstore #
###########

The STKEN  COMPLETE_FILE_LISTING files under
    http://www-stken.fnal.gov/cgi-bin/enstore/enstore_file_listing_cgi.py
were last updated January 19 2006.

Asked enstore-admin for an update

###########
# dc_stat #
###########

Name chosen per dc_check and other existing dc_* utilities
does an ls, then checks level (Dcache) and level 4 (Enstore) PNFS data.

for FIL in \
/pnfs/minos/neardet_data/2006-02/N00009870_0003.mdaq.root \
/pnfs/minos/beam_data/2006-02/B060224_080001.mbeam.root \
/pnfs/minos/near_dcs_data/2006-02/N060223_040445.mdcs.root \
/pnfs/minos/near_dcs_data/2006-02/N060223_160529.mdcs.root \
/pnfs/minos/far_dcs_data/2006-02/F060223_120026.mdcs.root
do ./dc_stat ${FIL} ; done

============================
 PNFS status for /pnfs/minos/neardet_data/2006-02/N00009870_0003.mdaq.root 
-rw-r--r--    1 buckley  5111     93724692 Feb 24 14:54 N00009870_0003.mdaq.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;l=93724692;

LEVEL 4 

============================
 PNFS status for /pnfs/minos/beam_data/2006-02/B060224_080001.mbeam.root 
-rw-r--r--    1 buckley  5111     138786002 Feb 24 10:04 B060224_080001.mbeam.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;l=138786002;
w-stkendca7a-1

LEVEL 4 

============================
 PNFS status for /pnfs/minos/near_dcs_data/2006-02/N060223_040445.mdcs.root 
-rw-r--r--    1 buckley  5111       426532 Feb 24 11:40 N060223_040445.mdcs.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;l=426532;
w-stkendca7a-1

LEVEL 4 

============================
 PNFS status for /pnfs/minos/near_dcs_data/2006-02/N060223_160529.mdcs.root 
-rw-r--r--    1 buckley  5111       422538 Feb 24 11:50 N060223_160529.mdcs.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;l=422538;

LEVEL 4 
============================
 PNFS status for /pnfs/minos/far_dcs_data/2006-02/F060223_120026.mdcs.root 
-rw-r--r--    1 buckley  5111      2068216 Feb 24 13:36
F060223_120026.mdcs.root

LEVEL 2 
2,0,0,0.0,0.0
:h=yes;l=2068216;

LEVEL 4 

============================


All these are lacking disk and tape locations as of 2006 Feb 25 15:22 
A few had disk locations last night.


=============================================================================

2006 02 24

############
# predator #
############

N00009865_0000.mdaq.root Thu Feb 23 18:09:24 CST 2006
 OOPS - cannot get Enstore info for /pnfs/minos/neardet_data/2006-02/N00009865_0000.mdaq.root 

F00033785_0023.mdaq.root Thu Feb 23 18:09:55 CST 2006
 OOPS - cannot get Enstore info for /pnfs/minos/fardet_data/2006-02/F00033785_0023.mdaq.root 
F00033788_0000.mdaq.root Fri Feb 24 08:07:16 CST 2006
F00033788_0001.mdaq.root Fri Feb 24 08:07:19 CST 2006
F00033788_0005.mdaq.root Fri Feb 24 08:07:23 CST 2006
F00033788_0008.mdaq.root Fri Feb 24 08:07:26 CST 2006
F00033788_0009.mdaq.root Fri Feb 24 08:07:29 CST 2006
F00033788_0012.mdaq.root Fri Feb 24 08:07:32 CST 2006
F00033788_0013.mdaq.root Fri Feb 24 08:07:35 CST 2006

These are all on w-stkendca7a-1

See 8 active stores + 3 queued stores for w-stkendca7a-1

There is a long stuck Active Transfer 
Uid	Pid	Door	Client Host	Pool	PNFS ID	Transfer #	State	Seconds in this state
13172	22097	DCap00	unknown@/131.225.68.106 	w-stkendca7a-1	000F00000000000001175BB8	63	WaitingForDoorTransferOk	239057
    That is a flxb30.fnal.gov   mcgo0109 batch job, informed Aaron via email

There are 3 IDLE tape drives

Reported to dcache-admin around 09:08 .

    As of about 23:00, five new files remain unavailable,
    They have no Level 4 metadata, some have Level 2 disks, none can be read.

/pnfs/minos/neardet_data/2006-02/N00009870_0003.mdaq.root
/pnfs/minos/beam_data/2006-02/B060224_080001.mbeam.root
/pnfs/minos/near_dcs_data/2006-02/N060223_040445.mdcs.root 
/pnfs/minos/near_dcs_data/2006-02/N060223_160529.mdcs.root
/pnfs/minos/far_dcs_data/2006-02/F060223_120026.mdcs.root

  Reported to dcache-admin

#############
# mcin_data #
#############

setup sam -q dev
setup sam -q prd

SAM_ORACLE_CONNECT=samdbs/password@minosdev

export SAM_ORACLE_CONNECT
samadmin add datatier --name=mcin-far  --description="mcin_data - far"
samadmin add datatier --name=mcin-near --description="mcin_data - near"
export -n SAM_ORACLE_CONNECT

dev
MINOS26 > samadmin add datatier --name=mcin-far  --description="mcin_data - far"
New dataTierId = 95
MINOS26 > samadmin add datatier --name=mcin-near --description="mcin_data - near"
New dataTierId = 96

prd
MINOS26 > samadmin add datatier --name=mcin-far  --description="mcin_data - far"
New dataTierId = 96
MINOS26 > samadmin add datatier --name=mcin-near --description="mcin_data - near"
New dataTierId = 97


#############
# stk-users #
#############

stk-users archive request mentioned in Tues STK entry of
    http://www-isd.fnal.gov/operations/primary.2006.02.24.html
They have no objections, aside from puzzling Division policy 
    against archiving announcement lists

11:55 - submitted archive request via web page, listed berg as owner
    http://listserv/request-list-archive.asp
    monthly rotation, public access


=============================================================================

2006 02 23

##########
# oracle #
##########

Per jtrumbo request, will strengthen future passwords for various accounts,
  exemption is available for the readonly READER account,
  but that account has never been used, near as I can tell.

Reviewing usage as on 

    account        usage
MINOS_DBBROWSER    2005/08/11 by herber/sqlplus, 2005/09/12 once from ndbb02
MONITOR            current
READER             no
SAM                no
SAMADMIN           no
SAMDBS             currrent
SAMPRD             no
SAMPRD_1           no
SAMREAD            monitoring thru 2006-02, wsrvproj rarely, findrun, 
SAMREGISTER        rarely, for autoregister
SAMSHIFT           no
SAMSUPPORT         no
SAM_MONITOR_OWNER  no

I authorized enforcement of complex passwords for all these.

Note rare SAMREAD usage, probably the findrun web function.

Access account User name Logon    Client machine  Program                STATUS  Time
SAMREAD        wsrvproj  09:13:37 www05           perl@www05 (TNS V1-V3) INACTIVE   0

Enforcement is now in place, as of about 16:00.

Summary of SAM Oracle usage :

  dbserver            SAMDBS
  database browser
  samadmin commands
  autoregister via the web
  findrun            SAMREAD
  sqlplus


=============================================================================

2006 02 22

##########
# dcache #
##########

/F00033779_0000.mdaq.root is stuck in FTP transfer ( 0 size )
since 2006-02-21 15:10:08 
Reads also were failing, per kennedy
    FTP logs indicate failures from 13:13 through 14:13.

Informed Liz, dcache-admin
    Liz restarted archiver, things cleared out by 21:00 ( mostly far data )

#########
# FNALU #
#########

jdejong jobs are killing DCache with accesses to 
    reco_far/R1_18_2/sntp_data/2005-04/F00031356_0006.all.sntp.R1_18_2.0.root
Authorized at about 10:00 the disabling of jdejong access to DCache.
The FNALU jobs were submitted to 4day queues, around Feb 20 22:31
No response via email from jdejong.


=============================================================================

2006 02 21

###############
# setup_minos #
###############

setup_minos stopped working in predator.

STARTED  Mon Feb 20 22:06:03 CST 2006
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/predator: line 5: [: !=: unary operator expected
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/predator: line 1: /setup/setup_minossoft_FNALU_parser.sh: No such file or directory
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/predator: line 11: source: filename argument required
source: usage: source filename

Due to missing quotes in setup_minossoft_FNALU.sh ,

    if [ "${SRT_DIST}" !=  ...
rhatcher fixed this around 11:29 ( removed the test for both .sh and .csh )

Now messages are

STARTED  Tue Feb 21 12:06:04 CST 2006
ERROR: Invalid value "," for argument "f"
could not find a gcc version for release "R1.15"
/afs/fnal.gov/files/code/e875/general/minossoft/setup/setup_minossoft_FNALU_parser.sh: line 228: return: can only `return' from a function or sourced script
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/predator: line 9: "minos_offline": No such file or directory

The genpy jobs ran normally for neardet from 14:06
Then started timing out for fardet ( but not all of them ) at 15:00.
The last login was killed at 16:44:09 ,
its login list trace seems to look like
5071    DCap00   minos06.fnal.gov          active Feb 21 16:44:58 Feb 21 16:44:58      ?/30040 unknown                   ?                         ?                         ?                                                       ? ?                              ?

And we have the classic cutoff with corrupt data in
SamException.SamExceptions.InvalidMetadata: Invalid Metadata specified for file '' of type 'importedDetector':
        Application with family 'online', applName 'rotorooter', version 'v00-00--1' not found.
Removed these files,
cd /local/scratch06/kreymer/genpy/fardet_data/2006-02
mv F00033778_0000.sam.py F00033778_0000.sam.py.old
mv F00033778_0000.log    F00033778_0000.log.old


#########
# stage #
#########

Created a new stage.0221 with option -s SELECT to select files.
Replaces selectcosmic.

I see a lot of 9940A tape activity, sntp_near_R1_18_0
staging files from /pnfs/minos/reco_near/R1_18/sntp_data/
There are about 520 GB of such files, per batch web page.

Will do preemptive prestage :

for MON in `ls /pnfs/minos/reco_near/R1_18/sntp_data` ; do
./stage -s cosmic -w reco_near/R1_18/sntp_data/${MON} ; done

but first check what's on disk, perhaps only top off 2005-10 and 11 ?

Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-01
STARTING Tue Feb 21 13:31:17 CST 2006                 Needed 616/    616
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-03 Needed 0/    771
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-04 Needed 1/    650
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-05 Needed 71/    768
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-06 Needed 11/    736
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-07 Needed 0/    735
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-08
Interrupted, to insert filter excluding r-stkensrv6a/7a/8a

STARTING Tue Feb 21 15:04:13 CST 2006
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-01 Needed 616/    616
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-03 Needed 0/    771
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-04 Needed 1/    650
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-05 Needed 71/    768
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-06 Needed 11/    736
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-07 Needed 0/    735
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-08 Needed 71/    735
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-09 Needed 8/    708
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-10 Needed 27/    684
Staging /pnfs/minos/reco_near/R1_18/sntp_data/2005-11 Needed 182/    663


OK, launched the prestage, as noted above.

ln -sf stage.0221 stage ( was stage.0214 )


##########
# oracle #
##########

Per jtrumbo request, I have strengthened passwords for these Oracle accounts

The accounts are on minosint and minosprd except as noted

READER            ( prd only )
SAM               ( dev only )
SAMADMIN
SAMSHIFT
SAMSUPPORT
SAM_MONITOR_OWNER ( prd only )

Various DBS_* accounts will be removed from development.
They were never used, are not needed, do not exist in int/prd.


Checked historical usage

cd /afs/fnal.gov/files/expwww/numi/html/minwork/computing/database/oracle/topdb
cd minosprd

ACCT=READER ( etc )
for YR in 2005 2006 ; do for MON in `ls ${YR}` ; do for DAY in `ls ${YR}/${MON}`
do echo ${YR}/${MON}/${DAY} ; grep ^${ACCT} ${YR}/${MON}/${DAY}/*.txt  ; done; done ; done

All is clean

=============================================================================

2006 02 20

###########
# samread #
###########

Updated .k5login adding rhatcher, and core sam developers on sam02/3 .

##########
# saddmc #
##########

HOWTO.saddmc - plans for MC declares

MINOS26 > find /pnfs/minos/mcin_data -type f -print -exec usleep 10000 \; | wc -l
   8109
MINOS26 > find /pnfs/minos/mcout_data -type f -print -exec usleep 10000 \; | wc -l
  45255

##########
# Oracle #
##########

dbserver passwords -

Per 19 Jul 2005 email from bartsch to cdfsamadmin,

Control dbserver password in file
    ~sam/private/conf/dbserver/*.py
By default this file has the password SAM's source code.
Edit it manually.
Added this note to HOWTO.dbserver 

Test this in development

setup sam -q dev

sam get dbserver info ( up since 24 Jan, 6 character password for samdbs )

( shut down dev dbserver )

${OLDPWD}=samdbs
sqlplus samdbs/${OLDPWD}@minosdev
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning and Data Mining options

SQL> password
Changing password for SAMDBS
Old password: 
New password: 
Retype new password: 
Password changed
SQL> exit ;
Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning and Data Mining options

changed private/conf/dbserver/dbs_dev.py

( start dev dbserver )

sam get dbserver info ( up since today, 10 character password for samdbs )

./sam_test_py minos-test-dcache


Scanned all connection logs for production,
seeing whether anyone connected to SAMDBS from anywhere but minos-sam01

cd /afs/fnal.gov/files/expwww/numi/html/minwork/computing/database/oracle/topdb

for MON in `ls` ; do for DAY in `ls ${MON}` ; do echo ${MON}/${DAY} ; grep SAMDBS ${MON}/${DAY}/*.txt | grep -v minos-sam01 ; done; done

Looks clean, exceptions were last July 11-18 from minos-sam,
and couple of sqlplus by kreymer.
The minos-sam connx were probably related to the dbufix corrections.


Technical tests - can dbs_*.py be changed on the fly for new connx,
                  without DBserver shutdown ?
                  Test by putting in nonsense,
                  seeing whether a new SAMDBS connection is made to oracle.
                  This works, to the .py file can be edited up front :
                  
1) change the password in private/conf/dbserver/dbs_prd.py
2) ups update sam_bootstrap  # stop  the dbserver
3) OLDPWD=samdbs
   sqlplus samdbs/${OLDPWD}@minosprd
   SQL> password
   Changing password for SAMDBS
   Old password:samdbs
   New password:
   Retype new password:
   Password changed
   SQL> exit ;
4) ups update sam_bootstrap  # start the dbserver
                  
sam get dbserver connection info
sam locate foo
sam dump station --all --station=minos
./sam_test_py minos 


=============================================================================

2006 02 17

minosora1 ganglia monitoring stops at 15:00 Thur 16 Feb 2006

Somewhat slow FTP monitoring today 04:45 to 08:45
    average around 20, normal around 3

genpy.0103 was using wrong dcap node fndca, need fndca1.
ln -sf genpy.0120 genpy

Still getting timeouts, due to overloaded 7a-1 and 8a-1 write pools
Mover limits are 26, about 25 queued moves.
Farms jobs failing aroung 10:10.
Things busted loose around 10:38.

Oops, I see the limits got boosted to 50.
And I did not even send my request.

A lot of farm jobs running on MC failed, 
    mcin files named .root, not  .reroot.root.
Howie will rename and rerun.
Mover limits have been boosted to 50,
and as of 13:06 are running near 45 each.

That's odd, I see no farm connections.

Tried a loop dccp'ing a nonexistent file 20 times at 13:46.

See no extra connections in login list of DRQ.

###########
# run_dbu #
###########

ln -sf run_dbu.0120 run_dbu  # was .0113

    Corrected 0 records message, removed file extension duplication

#############
# beam_data #
#############

Need to follow up on beam_data directories mis-created in beam_data1/d1, d2.


=============================================================================

2006 02 16

MINOS_SAM01

Restarted prd dbserver due to Oracle maintenance,
    upd update sam_config
due to Oracle and system patches earlier today.

#######
# SAM #
#######

upd install -j sam v7_5_2

New features
    --nosummary

##########
# DCACHE #
##########

13:10 - Is the system up ?
    Billing has been active since 11:53
    All web pages seem to be up
    Missing 10a pools

14:06 - DCache is announced to be back up, all pools present.

COSMICS - assisted in assessing state of cosmics for hjkang
    ~hjkang/minos/R1_18_2_release/ND_timing_cal/macro/candlist.txt

    basicly, cosmics from R1_18_2/2005-10 and 11.
    Most of these are off disk.
    Prestaging with temporary 'stagecosmic' script.

Web - changed plots host from fndca to fndca2a.

#########
# stage #
#########

Changed fndca to fndca1 for dccp

=============================================================================

2006 02 15

########
# bntp #
########

for MON in `ls /pnfs/minos/reco_far/R1_18_2/.bntp_data/`  ; do
./stage -d -p 0 -w reco_far/R1_18_2/.bntp_data/${MON} | grep Needed ; done

 Needed 678/    678
 Needed 741/    741
 Needed 714/    715
 Needed 734/    734
 Needed 742/    742
 Needed 722/    722
 Needed 739/    739
 Needed 717/    720
 Needed 446/    748
........................................................... Needed 61/    745
.................................. Needed 0/    336

    OK, launch the stagein of all these

for MON in `ls /pnfs/minos/reco_far/R1_18_2/.bntp_data/`  ; do
./stage -w reco_far/R1_18_2/.bntp_data/${MON} ; done

##########
# saddmc #
##########

Global issues,
   what stream and tier ?
   mcin vs mcout ?
   parent file for mcout ?
   

Test with 
  ./saddmc 

###############
# fardet_data #
###############

Per Liz and Andy Blake, found a missing file

    VO4245 CDMS111758722500000        20570000 0000_000000000_0002720     yes /pnfs/fs/usr/minos/fardet_data/2005-06/F00031811_0007.mdaq.root

Checking all fardet_data tapes for deleted files :

MINOS26 > setup encp
MINOS26 > volumes vols
 OK , refreshing volume listing in /tmp/vols 
MINOS26 > VOLS=`volumes fardet_data | grep -v deleted`
MINOS26 > for VOL in $VOLS ; do printf "${VOL}" ;  enstore info --list ${VOL} | grep deleted | wc -l ; done
VO2064    150
VO2212    627
VO2220   1708
VO2225      6
VO3646      1
VO3908   2313
VO3909    167
VO4136      0
VO4245      1
VO4309      1
VO4639      0
VO4640    340
VO4919      0
VO5046      0
VO5054      1
VO5182      0
VO5672      1
VO5869      2
VO5871      0
VO5881      0
VO6809      0
VO6876      1
VO7999      0
VO8536      1
VO8555      0
VO8722      0
VO8917      0
VO8968      0
VO9488      0

Likewise check neardet_data

MINOS26 > VOLS=`volumes neardet_data | grep -v deleted`
MINOS26 > for VOL in $VOLS ; do printf "${VOL}" ;  enstore info --list ${VOL} | grep deleted | wc -l ; done
VO4531      0
VO4918      0
VO5041     89
VO5042      0
VO6784      0
VO7026      0
VO7175      0
VO7421      2
VO7774      0
VO7896      0
VO7939      0
VO8098      0
VO8187      0
VO8332      0
VO8537      0
VO8556      0
VO8721      0
VO8741      0
VO8791      0
VO8842      0
VO8949      0

VO5041 files are all log files
VO7421 files exist later on the tape

Improved scan for fardet_data selecting just .mdaq.root,
MINOS26 > for VOL in $VOLS ; do printf "${VOL}" ;  enstore info --list ${VOL} | grep deleted | grep .mdaq.root | wc -l ; done
VO2064      6
VO2212      1
VO2220      0
VO2225      2
VO3646      1
VO3908      0
VO3909      0
VO4136      0
VO4245      1
VO4309      1
VO4639      0
VO4640      0
VO4919      0
VO5046      0
VO5054      1
VO5182      0
VO5672      1
VO5869      2
VO5871      0
VO5881      0
VO6809      0
VO6876      1
VO7999      0
VO8536      1
VO8555      0
VO8722      0
VO8917      0
VO8968      0
VO9488      0

All these files exist elsewhere on same or other tapes.

=============================================================================

2006 02 14

Checking Dcache file access

IFILE=N00008221_0019.spill.sntp.R1_18_2.0.root
IPATH=minos/reco_near/R1_18_2/sntp_data/2005-07


Grabbed enstore log for Feb 13,

grep 'READ FROM HSM' LOG-2006-02-13  > READFROM

MIN > for N in 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ; do printf "$N " ; grep "^$N" READFROM | wc -l ; done
00      50
01      41
02      31
03      20
04      30
05      31
06      66
07     307
08     325
09     254
10      77
11     143
12     108
13     148
14     183
15     202
16     120
17     141
18     142
19     175
20     202
21     144
22     165
23     160

Look at today's partial log,

grep 'READ FROM HSM' LOG-2006-02-14  > READFROM14

00     196
01     202
02     173
03     178
04     153
05     188
06     165
07     181
08     318
09     310
10     293
11     363

Not clear I can pin down any single cause from this.

May need to prestage 

PPAT=reco_near/R1_18_2/sntp_data
for MON in `ls /pnfs/minos/${PPAT}` ; do echo $MON ; ./stage -w ${PPAT}/${MON} ; done

This quit after first month ( I tried putting it into background, OOPS )

Trying again on minos26, using stage.0214

ln -sf stage.0214 stage  # was stage.0112

This ran cleanly, 

MINOS26 > grep Needed  /local/scratch26/kreymer/log/stage/reco_near/R1_18_2/sntp_data/200**.20060214.log
...
2005-03.20060214.log: Needed 0/   1136
2005-05.20060214.log: Needed 8/   1410
2005-06.20060214.log: Needed 0/   1356
2005-07.20060214.log: Needed 37/   1397
2005-08.20060214.log: Needed 128/   1421
2005-09.20060214.log: Needed 172/   1413
2005-10.20060214.log: Needed 0/   1090
2005-11.20060214.log: Needed 0/   1360
2005-12.20060214.log: Needed 0/   1228
2006-01.20060214.log: Needed 143/   1404
2006-02.20060214.log: Needed 276/    521

=============================================================================

2006 02 13

kreymer out of town

=============================================================================

2006 02 10

##########
# saddmc #
##########

Will try a script just for MC data declarations,
    as the mcin_data files do not need to be read,
    we can fake the event counts.
    They should be uniformly 400 for nd, and vary for fd.
    But nobody is using this number in SAM, so let's move on.
    Fix it later if need be.
 
#############
# mcin_data #
#############

Exploring MC parameters used elsewhere, like CDF

sam get metadata --file=s8000010.0000kbh1

sam translate constraints --dim='FILE_NAME s8000010.0000kbh1 and VERSION_ANALYZED  1.00'

sam translate constraints --dim='FILE_NAME cr027ebf.00d3phys and FAMILY_ANALYZED dfc-load'
1 file selected

sam translate constraints --dim='FILE_NAME cr027ebf.00d3phys and FAMILY_ANALYZED dfc-load and VERSION_ANALYZED 4.9.0'
also one file selected, in spite of mismatch

I have no clue how one can correctly specify APPL_NAME_ANALYZED when this
is supposed to be a spacey string like
    data catalogue bulk load - level3
    
This seems to work, and to select by version analyzed.
sam translate constraints --dim='FILE_NAME cr027ebf.00d3phys and FAMILY_ANALYZED dfc-load and VERSION_ANALYZED raw and APPL_NAME_ANALYZED data catalogue bulk load - level3'

#######
# sam #
#######

Finally found valid selection method for VERSION_ANALYZED,
must include APPL_NAME_ANALYZED before it :
See 
    2006 01 31
    2006 01 24
    
sam translate constraints --dim='FILE_NAME F00033635_0000.mdaq.root and VERSION_ANALYZED  v04-02-00 and APPL_NAME_ANALYZED rotorooter'

Try 

sam translate constraints --count --dim='VERSION_ANALYZED r1.18 and APPL_NAME_ANALYZED loon'

Oracle load was under 10%, for about 2 minutes

dbserver was busy, but responding, for about 4 minutes

MINOS06 > sam get dbserver connection info 
Connection:  kreymer@minos06.fnal.gov:sam_v7_3_4(909)
        Servant creation time:  10-Feb-2006 18:41:39 (CST)
        Last method invoked:    translateDimensions_v2 (10-Feb-2006 18:41:39 (CST))
        Last method still running.
        Servant status message: building the return DataFilePhysicalAttributesList (len 480767)

then

MINOS06 > sam get dbserver connection info

Connection:  kreymer@minos06.fnal.gov:sam_v7_3_4(909)
        Servant creation time:  10-Feb-2006 18:41:39 (CST)
        Last method invoked:    translateDimensions_v2 (10-Feb-2006 18:41:39 (CST))
        Last method completed in 303.457702637 seconds
        Servant status message: Marshalling complete in less than 1 second <type SAM.DataFilePhysicalAttributesListStruct> 
Connection:  kreymer@minos06.fnal.gov:sam_v7_3_4(915)
        Servant creation time:  10-Feb-2006 18:47:30 (CST)
        Last method invoked:    getMetadataRequirementDescriptor (10-Feb-2006 18:47:30 (CST))
        Last method completed in 0.117998838425 seconds
        Servant status message: Marshalling complete in less than 1 second <type SamStruct.MetadataAttributeDescriptor.MetadataAttributeDescriptor> (len: 23)

local sampy continues to run 100% CPU, 

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
25804 kreymer   25   0  676M 676M  2780 R    100.0 16.9   6:35   2 sampy
25804 kreymer   25   0  846M 846M  2780 R    100.0 21.1  10:00   2 sampy
25804 kreymer   25   0 1021M 1.0G  2780 R    100.0 25.5  13:50   2 sampy
25804 kreymer   25   0 1269M 1.2G  2780 R    100.0 31.7  20:12   2 sampy
25804 kreymer   25   0 1450M 1.4G  2780 R    100.0 36.2  25:07   3 sampy
25804 kreymer   25   0 1426M 1.4G  2780 R    99.8 35.6  29:53   2 sampy
  This final sample was during rundown. SIZE got over 1520

480767 files match the given constraints.
MINOS06 > date
Fri Feb 10 19:16:37 CST 2006

So this AGAIN matched every single file !


=============================================================================

2006 02 09

############
# predator #
############

Following up on earlier problem, missing tape location,
Just regenerate the .py file again.
The original job times out, perhaps uncleanly.
N00009767_0017.mdaq.root Tue Feb  7 20:06:22 CST 2006
ps: error: List of process IDs must follow -p.
usage: ps -[Unix98 options]
       ps [BSD-style options]
       ps --[GNU-style long options]
       ps --help for a command summary
 OOPS - cannot read N00009767_0017.sam.py 

cd /local/scratch06/kreymer/genpy/neardet_data/2006-02
MINOS06 > mv N00009767_0017.sam.py N00009767_0017.sam.py.old
MINOS06 > mv N00009767_0017.log  N00009767_0017.log.old

This cleared things up in the 10:06 predator cycle.

############
# saddreco #
############

Need R1_21 and R1_18_2a (2005-11) in SAM

MINOS06 > DET=far
MINOS06 > RELEASE=R1_21

Treating 703 files in /pnfs/minos/reco_far/R1_21/sntp_data/2004-02
Declaring to SAM prd far R1_21 2004-02
OOPS - short Enstore data at  Thu Feb  9 17:37:49 2006
 ENLIN  []
 ENFILE  F00022942_0004.all.sntp.R1_21.0.root

MINOS06 > dds /pnfs/minos/reco_far/R1_21/sntp_data/2004-02/F00022942_0004.all.sntp.R1_21.0.root
-rw-r--r--    1 rubin    e875            0 Feb  6 00:43 /pnfs/minos/reco_far/R1_21/sntp_data/2004-02/F00022942_0004.all.sntp.R1_21.0.root

#############
# mcin_data #
#############

Reviving this work, idle since 01-20 due to DCache.
Will set fixed number of events per file, this is not really used.
It is basicly fixed for neardet mc, and varies for fardet.

So we do not actually need to read the files !

FIL="dcap://fndca.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcin_data/far/carrot/L010185/f22001150_0000_L010185.reroot.root"
FILL="/local/scratch06/kreymer/f22001150_0000_L010185.reroot.root"
FIP="/pnfs/minos/mcin_data/far/carrot/L010185/f22001150_0000_L010185.reroot.root"

MINOS06 > time loon -bq firstlastreroot.C ${FIL}
loon [0] 
Processing firstlastreroot.C...
dc_lseek: illegal value of whence parameter [-1282838144].

    This is what comes from reading using release R1.15.
    Works OK, if slowly, with R1.18

Now need to re-edit run_dbu, to avoid the running entirely ??
Or should I just flick this, and make a special script, saddmc ?


############
# minoscvs #
############

verified admin access

=============================================================================

2006 02 08

##########
# DCACHE #
##########

Found lots of problems during daily web check at 09:00
Bottom line :
    genpy timed out without tape location at 22:09
        fixed by removing .sam.py 9 Feb
    dbserver crashed 08:00
        due to TNS glitch, restarted dbserver, station OK
    Enstore backlogs huge with drives offline, eventually cleared
    FTP timeouts ftplog and ftpllog - 05:15 thru 09:45 - unknown
       no problems in pnfs monitor
    Found missing fardet_data file, developers restored
        2004-04/F00024101_0000
        Time stamp on directory was Jan 29 22:07

SAG page not updated for 74 minutes

Continued peaks in store queues, 
    1700 Mon night
    1550 Tue night
    1250 Tue morning

859 queued stores

Many pools offline, w15-1/2/3 w16-1/2/3 w6a-2

ftp pnfs test fails since 05:15, OK again at 09:55

five B movers offline, and one dead

one minos tape NOACCESS
VO9814            147.78GB   (NOACCESS   0208-0553 none              )   CD-9940B         minos.reco_near_R1_18_2.cpio_odc    

Odd problem in genpy for one file 
STARTED   Tue Feb  7 22:09:12 2006
 OOPS - no tape location in  N00009767_0017.sam.py

Second SAMDBS connection to oracle production at 7:52 
SAMDBS               sam                  06:49:21   minos-sam01.fnal.gov           python@minos-sam01.fnal.gov (  INACTIVE          0
SAMDBS               sam                  06:44:56   minos-sam01.fnal.gov           python@minos-sam01.fnal.gov (  INACTIVE          0
No SAMDBS at the next snapshot, 08:02:13 

sam commands hang, and cannot be interrupted
    sam locate foo
    sam get dbserver info 
OK in development

Looking in sam@minos-sam01:private/dbs__minos-sam01__dbs_prd,
trace file was updated at 07:59.
    copied to ~/maint/trace.0208
activity still visible in 
    dbg-SAMDbServer.prd.06_02_08-00_00_08.13989-77
    
~/private/nameservice__minos-sam01__ns_prd/log updated at 08:50
The name server seems to be alive, running enames to test this.

MINOS06 > sam ping dbserver
The server 'SAMDbServer.prd:SAMDbServer' is alive.

 10:19:31  up 76 days,  6:00,  3 users,  load average: 0.00, 0.00, 0.00
146 processes: 145 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  399.2%
           cpu00    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
           cpu01    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
           cpu02    0.3%    0.0%    0.0%   0.0%     0.0%    0.0%   99.6%
           cpu03    0.0%    0.0%    0.1%   0.0%     0.0%    0.0%   99.8%
Mem:  4095376k av, 4071912k used,   23464k free,       0k shrd,  235976k buff
                    942272k actv, 2133316k in_d,   90936k in_c
Swap: 4096564k av,       0k used, 4096564k free                 3091520k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
13989 sam       15   0  215M 215M  5676 S     0.0  5.3  1206m   3 python /home/sam/products/db_server_base/v3_3_14/NULL/bin/DbListener.py -c=dbs_prd
16827 sam       15   0 39920  38M  6964 S     0.0  0.9  12:28   1 smaster start --station=minos --opter-suffix=prd --nofork --preferred-loc=enstore --excess-satisfac
 
restarted dbserver at 10:22
Up since: 08-Feb-2006 16:22:50 (UTC)

Scanning for previous restarts :

MINOS-SAM01 > head -1 dbg-SAMDbServer.prd.05_1* | grep running
  <11/03/2005 09:54:04 DbLog.__init__> SAMDbServer.prd running in pid 530
  <11/17/2005 09:39:21 DbLog.__init__> SAMDbServer.prd running in pid 3585
  <11/24/2005 08:45:22 DbLog.__init__> SAMDbServer.prd running in pid 13989
MINOS-SAM01 > head -1 dbg-SAMDbServer.prd.06_* | grep running
  <02/08/2006 10:22:50 DbLog.__init__> SAMDbServer.prd running in pid 9537
So the last startup was Nov 24.
 
ENSTORE - now checking the backlogs

enstore library --get-work-sorted CD-9940B > /tmp/gws
head -1 /tmp/gws | tr ',' \\\n | grep volume_family | grep minos | wc -l
    25 
head -1 /tmp/gws | tr ',' \\\n | grep volume_family | grep e907 | wc -l
    388


=============================================================================

2006 02 07

###########
# enstore #
###########

DCache restore queue hit about 1750 last night, a new record !
  Mostly e907 and reco_far_R1_21
  
Checking Enstore queues,

enstore library --get-work-sorted CD-9940B > /tmp/gws

    The first line seems to have the queues,
    the second line seems to describe the active movers.

head -1 /tmp/gws | tr ',' \\\n | less

    Count minos reco R1_21 in queue with this

head -1 /tmp/gws | tr ',' \\\n | grep minos\\.reco_far_R1_21 | wc -l

At 13:47, 16 of the 17 CD-9940B drives are in MOUNT_WAIT or DISMOUNT_WAIT
Reported to enstore-admin around 14:10.
Breaks loose around 14:30, back to work !!!


Odd, I see a strange path in the one working mover 9940B25.mover

/pnfs/fs/.(access)(001000000000000000B422C0) -->
cmsstor15.fnal.gov:/storage/data2/read-pool-2/data/001000000000000000B422C0

As of about 20:00, 1.21 reco writing seems to have caught up.


=============================================================================

2006 02 06

Need prestage of most of the 2444 files, 753 GBytes in 
    /pnfs/minos/mcout_data/R1_18_2/near/cand_data
    
Tested with

./stage -d -p 0 mcout_data/R1_18_2/near/cand_data
 Needed 2096/   2444
FINISHED Mon Feb  6 12:02:18 CST 2006

Restored with

./stage  mcout_data/R1_18_2/near/cand_data

tail -f /local/scratch06/kreymer/log/stage/mcout_data/R1_18_2/near/cand_data.20060206.log
Staging /pnfs/minos/mcout_data/R1_18_2/near/cand_data
STARTING Mon Feb  6 12:04:40 CST 2006
Mon Feb  6 12:04:42 CST 2006 queue=36/100
NEED n13010001_0000_L100200.cand.R1_18_2.root
...


########
# reco #
########

samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.21

For dev, prd got this... for int, claims it was already there

setup sam -q prd
MINOS06 > samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.21
Enter user/passwd[@db] connection string:  
Application family already exists: id = 54

setup sam -q dev
Enter user/passwd[@db] connection string:  
New applicationFamilyId = 150

on minos-sam01, cd SCRIPTS,

MINOS-SAM01 > ./reloc -s dev -y 2005 R1_21
Declaring locations to SAM  for R1_21
STARTING Mon Feb  6 14:05:50 CST 2006
 OOPS, no directory /pnfs/minos/reco_near/R1_21 
New locationId = 1465
 OK - added /pnfs/minos/reco_far/R1_21/cand_data/2005-01
...

 OK - added /pnfs/minos/reco_far/R1_21/snts_data/2005-12
FINISHED Mon Feb  6 14:10:37 CST 2006

    Logged to /home/sam/maint/log/reloc/R1_21.20060206.log

MINOS-SAM01 > ./reloc -s prd -y 2005 R1_21
Declaring locations to SAM  for R1_21
STARTING Mon Feb  6 14:12:46 CST 2006
 OOPS, no directory /pnfs/minos/reco_near/R1_21 
New locationId = 1496    
...
New locationId = 1531
 OK - added /pnfs/minos/reco_far/R1_21/snts_data/2005-12
FINISHED Mon Feb  6 14:13:48 CST 2006

    Logged to /home/sam/maint/log/reloc/R1_21.20060206.log

MINOS-SAM01 > ./reloc -s dev -y 2004 R1_21
Declaring locations to SAM  for R1_21
STARTING Mon Feb  6 14:16:21 CST 2006
...
FINISHED Mon Feb  6 14:17:22 CST 2006

MINOS-SAM01 > ./reloc -s prd -y 2004 R1_21
STARTING Mon Feb  6 14:18:04 CST 2006
FINISHED Mon Feb  6 14:18:58 CST 2006

MINOS-SAM01 > ./reloc -s dev -y 2006 R1_21
Declaring locations to SAM  for R1_21
STARTING Mon Feb  6 14:19:24 CST 2006
FINISHED Mon Feb  6 14:20:37 CST 2006

MINOS-SAM01 > ./reloc -s prd -y 2006 R1_21
Declaring locations to SAM  for R1_21
STARTING Mon Feb  6 14:21:40 CST 2006
FINISHED Mon Feb  6 14:22:27 CST 2006


##########
# dcache #
##########

    Why did reco output flood DCache Sunday ?

MINOS06 > du -sh /pnfs/minos/reco_far/R1_21/*        
147G    /pnfs/minos/reco_far/R1_21/cand_data
48G     /pnfs/minos/reco_far/R1_21/sntp_data
5.3G    /pnfs/minos/reco_far/R1_21/snts_data

MINOS06 > ls -R /pnfs/minos/reco_far/R1_21/cand_data | wc -l
   2166

###########
# enstore #
###########

Watching reco_far.R1.12 writes on tape,
seeing strange tape status, and file families :

VO9882     9940B24.mover     219      SEEK          (46   ) (none         none) stkendca6a       02-06-06 17:15:19 minos.reco_far_R1_21.cern
VO9912     9940B33.mover     979      HAVE_BOUND    (0    ) (none         none) stkendca9a       02-06-06 17:15:48 minos.reco_far_R1_21.cpio_odc

=============================================================================

2006 02 03

#########
# topdb #
#########

Note that I am now running this on minos26, our monitor system.


Per discussions with Nelly and Anil,
we must new a new 'monitor' account to check oracle logins via v_$session.
and give that account a strong password.

Done and tested yesterday for minosdev/int.

topdb script needs to get this secret password via the environment.

export    TOPDB_CONN=monitor/newpass
topdb ...
export -n TOPDB_CONN

topdb_log was modified accordingly, and restarted.

MINOS26 > export TDBCONN=samread/reader
MINOS26 > ${HOME}/minos/oracle/topdb_log minosprd &
MINOS26 > unset TDBCONN

similar for minosdev using the new account/password


=============================================================================

2006 02 02

##########
# dcache #
##########

Raw data logging failed again today.

Dedicated minos pools were commissioned, but were not properly enabled.
Rob Kennedy fixed this around noon.

But then DCache was not configured to allow reads from the new write pools.
This was fixed around 17:00 .
We now read directly from our dedicated write pools.
That's OK for present, because the pools are a net 3 TB,
1.5 times more than the total Minos raw data.


=============================================================================

2006 02 01
##########
# dcache #
##########

Write pools have filled, d0 offline backups are running again.

Ran, on minos-sam03,

   ./datasets w

But note that pool 7 is stale, from 28 Jan.
Filtering it out for better sample,

11:35 - kennedy added 12 TB of write pools, things are running again.
        Of course this is an absurdly vast overkill, we need only a few GB.
        But it is getting data on tape right now.


Some Minos raw data write are getting through, thanks to their priority,
( although the write of
  /pnfs/minos/neardet_data/2006-02/N00009741_0000.mdaq.root
  has been pending for over an hour.  
  Some subsequent files (9741, 9742) have made it to tape. )

Unfortunately, my scripts which read the files have been failing,      
due to problems reading a file using dcap in the Minos framework  
    /pnfs/minos/neardet_data/2006-02/N00009732_0022.mdaq.root.

The last time we had a problem like this,
during the last DCache overload,
I had to make a local copy of the file and avoid reading through DCache.
I am going to have to do some serious debugging now.

Here we go again !

MINOS06 > ( cd /pnfs/minos/neardet_data/2006-02 ; cat '.(use)(2)(N00009732_0022.mdaq.root)' )
2,0,0,0.0,0.0
:h=yes;l=89955365;
r-stkendca8a-1
w-stkendca16a-2

MINOS06 > ( cd /pnfs/minos/neardet_data/2006-02 ; cat '.(use)(4)(N00009732_0022.mdaq.root)' )
VO8537
0000_000000000_0001571
89955365
neardet_data
/pnfs/fs/usr/minos/neardet_data/2006-02/N00009732_0022.mdaq.root

000F00000000000002838E50

CDMS113881963300000
stkenmvr16a:/dev/rmt/tps0d0n:479000027142
271288374

MINOS06 > ls -l /pnfs/minos/neardet_data/2006-02/N00009732_0022.mdaq.root
-rw-r--r--    1 buckley  e875     89955365 Feb  1 12:47 /pnfs/minos/neardet_data/2006-02/N00009732_0022.mdaq.root


cdm 
setup_minos -r R1.15
cd test

IFILE=N00009732_0022
DATADIR=neardet_data/2006-02
rm -f ${IFILE}.log
rm -f ${IFILE}.sam.py
rm -f ${IFILE}.sam.pyc


time ../scripts/run_dbu dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/${DATADIR}/${IFILE}.mdaq.root
real    3m36.827s
user    0m8.520s
sys     0m0.670s

GRRR, this has fallen off disk due to new pools, see above.
Good news is, the fresh copy ( eventually ) will probably be OK.
This probably came pool to pool ?

Have renamed the old .sam.py and related file,
should pick this up cleanly after 19:06.


=============================================================================

2006 01 31

sam translate constraints --dim='DATA_TIER = raw-near'
File Count:         17609
Average File Size:  58.30MB
Total File Size:    1002.50GB
Total Event Count:  584208307

sam translate constraints --dim='DATA_TIER = raw-far'
File Count:         46889
Average File Size:  28.35MB
Total File Size:    1.27TB
Total Event Count:  601124192

###############
# fardet_data #
###############

Cleaned up stray directory from the new year transition

MINOS06 > ls /pnfs/minos/fardet_data/2006-01BAD/ -ld
drwxr-xr-x    1 500      e875          512 Dec  8 08:34 /pnfs/minos/fardet_data/2006-01BAD/

MINOS06 > rmdir /pnfs/minos/fardet_data/2006-01BAD/    

#######
# SAM #
#######

More tests of bad VERSION_ANALYZED selection,
1 file test fails with sam versions
     v6_7_4
     v7_3_4
     v7_4_2
sam translate constraints --dim='FILE_NAME F00033635_0000.mdaq.root and VERSION_ANALYZED  r1.18'
The VERSION_ANALYZED for this file is  appVersion='v04-02-00' 

< see 2006 02 10 >

=============================================================================

2006 01 30

#############
# mcin_data #
#############

Just for fun, checked DCache state, all files are still on disk.
STARTING Mon Jan 30 10:37:48 CST 2006
FINISHED Mon Jan 30 10:40:49 CST 2006

=============================================================================

2006 01 29

#############
# mcin_data #
#############

Just for fun, checked DCache state, all files are still on disk.


=============================================================================

2006 01 27

#########
# stage #
#########

MINOS06 > for HORN in `ls /pnfs/minos/mcin_data/near/carrot_06 | grep -v BAD` ; do echo $HORN ; ./stage -d -p0 mcin_data/near/carrot_06/${HORN} ; done | grep -v NEED | grep -v dccp
L010170
  OK  JUST TESTING 
Staging /pnfs/minos/mcin_data/near/carrot_06/L010170
STARTING Fri Jan 27 08:35:31 CST 2006
 Prestaging     198 files 
 Needed 11/    198
FINISHED Fri Jan 27 08:35:37 CST 2006
L010185
  OK  JUST TESTING 
Staging /pnfs/minos/mcin_data/near/carrot_06/L010185
STARTING Fri Jan 27 08:35:37 CST 2006
 Prestaging     836 files 
.................. Needed 41/    836
FINISHED Fri Jan 27 08:36:24 CST 2006
L010200
  OK  JUST TESTING 
Staging /pnfs/minos/mcin_data/near/carrot_06/L010200
STARTING Fri Jan 27 08:36:24 CST 2006
 Prestaging     200 files 
 Needed 120/    200
FINISHED Fri Jan 27 08:36:38 CST 2006
L100200
  OK  JUST TESTING 
Staging /pnfs/minos/mcin_data/near/carrot_06/L100200
STARTING Fri Jan 27 08:36:39 CST 2006
 Prestaging     729 files 
........................................................................ Needed 1/    729
FINISHED Fri Jan 27 08:37:28 CST 2006
L250200
  OK  JUST TESTING 
Staging /pnfs/minos/mcin_data/near/carrot_06/L250200
STARTING Fri Jan 27 08:37:28 CST 2006
 Prestaging     391 files 
. Needed 15/    391
FINISHED Fri Jan 27 08:37:51 CST 2006

Summary

L010170 Needed  11/    198
L010185 Needed  41/    836
L010200 Needed 120/    200
L100200 Needed   1/    729
L250200 Needed  15/    391

So, roughly, DCache content seems to be intact.
Elapsed time for these checks was about 140 seconds, for 2354 files

Do a refresh of these files, just to exercise the system:

for HORN in `ls /pnfs/minos/mcin_data/near/carrot_06 | grep -v BAD` ; do echo $HORN ; ./stage mcin_data/near/carrot_06/${HORN} ; done
L010170
tail -f /local/scratch06/kreymer/log/stage/mcin_data/near/carrot_06/L010170.20060127.log
L010185
tail -f /local/scratch06/kreymer/log/stage/mcin_data/near/carrot_06/L010185.20060127.log
L010200
tail -f /local/scratch06/kreymer/log/stage/mcin_data/near/carrot_06/L010200.20060127.log
L100200
tail -f /local/scratch06/kreymer/log/stage/mcin_data/near/carrot_06/L100200.20060127.log
L250200
tail -f /local/scratch06/kreymer/log/stage/mcin_data/near/carrot_06/L250200.20060127.log

Oops, should have been ./stage -w.
Too late, will let it run this way.

Polled again, all files are present, scan took 65 seconds
STARTING Fri Jan 27 11:02:23 CST 2006
...
FINISHED Fri Jan 27 11:03:38 CST 2006


=============================================================================

2006 01 26

###########
# crontab #
###########

For maintenance today ( 08:30 to 16:30 ) held the cron job under kreymer

    echo "crontab -r" | at 08:10 Jan 26
    echo "crontab minos/scripts/crontab.dat" | at 22:00 Jan 26

At just about 09:30, /pnfs/minos went offline
    The load average on minos-sam01 jumped to 1 at that time.
    No CPU load, so this is probably just the NFS mount being stuck.
    From 09:30 through 09:33, the overall minos cluster load
    increased from about 15 to 40.

########
# PNFS #
########

Requested readonly mounts of /pnfs/minos on minos03 through minos25,
leaving 01 02 26 available for writing.

###########
# enstore #
###########

Per report of brebel, we do have two NOACCESS tapes now,

VO3559              0.29GB   (NOACCESS   0126-0135 full     0321-1745)   9940             minos.sntp_data_R1_14.cpio_odc      
VO7944              0.23GB   (NOACCESS   0126-0108 full     0416-0626)   9940             minos.sntp_data_R1_14.cpio_odc      

##########################
# Enstore/DCache upgrade #
##########################

/pnfs/minos seems to have come online at just after 17:00,
to judge from the ganglia load plots of the Minos cluster.

The ftp 'ls' test at  
    http://www-numi.fnal.gov/computing/dh/ftplog/2006/01/26.txt
shows ftp status

08:00 - down
19:31 - up
21:22 - down
21:32 - up
22:22 - down
22:42 - up
23:22 - down
05:13 - up

Email summary -

18:32 - Enstore up, working on DCache
20:02 - recompiling Enstore
22:14 - Enstore up
23:00 - Informed minos_all, and thanked dcache-admin,
        will look at status tomorrow.
06:00 - DCache announced to be up, old hardware, new software
        will need another short downtime for new hardware.
        Write pools were not writing.
        
Only one minos NOACCESS tape now, VO3559

Note - minos-mysql1 seems to have /pnfs/minos mounted,
       should dismount.

Note - both MSS (Enstore) and Central Systems (DCache)
       NGOP web pages indicate work completed on schedule at 16:30,
       no further notation for users.

Note - need to archive 
   MINOS_SAM_ADMIN

Need new mailing list MINOS_DATA
    for data handling.


=============================================================================

2006 01 25

#######
# SAM #
#######

Oops, left the *.py metadata files around for the renamed files,
therefore these got redeclared to SAM again with the wrong names.

In this case, I will just edit these files to correct filenames,
and erase file locations and files.

cdm ;  cd GDAT/beam_data/2006-01
while read FILE ; do FI=`echo $FILE | cut -f 1 -d '.'` ; nedit ${FI}.sam.py ; done

cd ../../near_dcs_data/2006-01
while read FILE ; do FI=`echo $FILE | cut -f 1 -d '.'` ; nedit ${FI}.sam.py ; done

cd ../../far_dcs_data/2006-01
while read FILE ; do FI=`echo $FILE | cut -f 1 -d '.'` ; nedit ${FI}.sam.py ; done

# remove the file locations

while read FILE ; do
PAT=`sam locate ${FILE} | cut -f 2 -d "'" | cut -f 1 -d ','`
echo $FILE $PAT
sam erase file location --fileName=${FILE} --loc=${PAT}
done

# remove the files

while read FILE ; do sam undeclare file ${FILE} ; done


=============================================================================

2006 01 24

##########
# DCache #
##########

Checking data sizes from farm, to estimate DCache write pools needs.

Getting total r1.18 size by doing

MINOS06 > sam translate constraints --dim='VERSION_ANALYZED r1.18'

This produced no measurable load on minosora1,
loaded minos-sam01 for about 25 minutes,
and the client minos06 for about 2x minutes cpu bound.

dbserver is responding OK during the client-busy phase.

27:40

   N00008197_0016.cosmic.sntp.R1_18_2.0.root
   N00008165_0018.spill.sntp.R1_18_2.0.root
...
   F00033511_0003.mdaq.root
...
   F00033513_0000.mdaq.root
   N00009616_0021.mdaq.root
...
   F00033622_0015.mdaq.root
   B060123_080001.mbeam.root
   N00009693_0000.mdaq.root
   F00033633_0000.mdaq.root
   F00033635_0000.mdaq.root


File Count:         453072
Average File Size:  41.57MB
Total File Size:    17.96TB
Total Event Count:  11991667707

Hmmm, this looks to me like a complete list of all files in SAM.

MINOS06 > sam translate constraints --dim='FILE_NAME F00033635_0000.mdaq.root and VERSION_ANALYZED r1.18'
Files:
   F00033635_0000.mdaq.root

MINOS06 > sam translate constraints --dim='FILE_NAME F00033635_0000.mdaq.root and VERSION_ANALYZED v04-02-00'
Files:
   F00033635_0000.mdaq.root

MINOS06 > sam get metadata --fileName='F00033635_0000.mdaq.root' | grep r1

#############
# SAM files #
#############

MINOS-SAM01 > setup oracle_client v10_1_0_3_0

MINOS-SAM01 > rlwrap sqlplus samdbs/password@minosdev # and minosprd 

SELECT * FROM DATA_FILES WHERE FILE_NAME = 'F00028134_0005.mdaq.root' ;
SELECT FILE_ID FROM DATA_FILES WHERE FILE_NAME = 'F00028134_0005.mdaq.root' ;
34929

OK, let's see if we can rename this as a test.

MINOS06 > setup sam -q dev
MINOS06 > sam locate F00028134_0005.mdaq.root
['/pnfs/minos/fardet_data/2004-11,204@vo4919']
MINOS06 > sam get metadata --fileName=F00028134_0005.mdaq.root
ImportedDetectorFile({
             'fileName' : 'F00028134_0005.mdaq.root',
               'fileId' : 34929L,
             'fileType' : 'importedDetector',
           'fileFormat' : 'root',
             'fileSize' : SamSize('74.15MB'),
                  'crc' : CRC('3163365313', 'adler 32 crc type'),
    'fileContentStatus' : 'good',
           'eventCount' : 15660L,
             'dataTier' : 'raw-far',
           'firstEvent' : 60183L,
            'lastEvent' : 72240L,
            'startTime' : SamTime(1101538439.0),
              'endTime' : SamTime(1101542042.0),
    'applicationFamily' : ApplicationFamily(appFamily='online', appName='rotorooter', appVersion='v04-00-08'),
                'group' : 'minos',
           'datastream' : 'alldata',
    'runDescriptorList' : RunDescriptorList([RunDescriptor(runType='physics;m', runNumber=28134)]),
    })


SELECT FILE_ID,FILE_NAME FROM DATA_FILES where  FILE_ID = 34929 ;
UPDATE DATA_FILES set FILE_NAME = 'F00028134_0005.mdaq.root2' where FILE_ID =34929 ;
commit ;

This worked, now set the file name back, per herber

UPDATE DATA_FILES set FILE_NAME = 'F00028134_0005.mdaq.root' where FILE_ID in 
( SELECT FILE_ID FROM DATA_FILES where FILE_NAME = 'F00028134_0005.mdaq.root2' ) ;

So to correct the misnamed files, generate and run a series of these in production.

for FILE in `sam translate constraints --dim='data_tier = beam and file_name like %.root.mbeam.root' | grep .root`
do
    FNEW=`echo ${FILE} | cut -f 1-3 -d .`
    printf "UPDATE DATA_FILES set FILE_NAME = '%s' where FILE_ID in \n" ${FNEW}
    printf "( SELECT FILE_ID FROM DATA_FILES where FILE_NAME = '%s' ) ;\n" ${FILE}
done

for FILE in `sam translate constraints --dim='DATA_TIER = dcs-near and FILE_NAME like %.root.mdcs.root' | grep .root`

for FILE in `sam translate constraints --dim='DATA_TIER = dcs-far and FILE_NAME like %.root.mdcs.root' | grep .root`

cut and pasted this in to 

MINOS-SAM01 > rlwrap sqlplus samdbs@minosprd
...

commit ;

    Done at around 17:00

For the record, the file names were :

B060103_080001.mbeam.root.mbeam.root
B060103_160001.mbeam.root.mbeam.root
B060104_000001.mbeam.root.mbeam.root
B060104_080001.mbeam.root.mbeam.root
B060104_160001.mbeam.root.mbeam.root
B060105_000001.mbeam.root.mbeam.root
B060105_080001.mbeam.root.mbeam.root
B060105_160001.mbeam.root.mbeam.root
B060106_000001.mbeam.root.mbeam.root
B060106_080001.mbeam.root.mbeam.root
B060106_160002.mbeam.root.mbeam.root
B060107_000001.mbeam.root.mbeam.root
B060107_080001.mbeam.root.mbeam.root
B060107_160001.mbeam.root.mbeam.root
B060108_000001.mbeam.root.mbeam.root
B060108_080001.mbeam.root.mbeam.root
B060108_160001.mbeam.root.mbeam.root
B060109_000001.mbeam.root.mbeam.root
B060109_080001.mbeam.root.mbeam.root
B060109_160002.mbeam.root.mbeam.root
B060110_000001.mbeam.root.mbeam.root
B060110_080001.mbeam.root.mbeam.root
B060110_160001.mbeam.root.mbeam.root
B060111_000001.mbeam.root.mbeam.root
B060111_080001.mbeam.root.mbeam.root
B060111_160001.mbeam.root.mbeam.root
B060112_000001.mbeam.root.mbeam.root
B060112_080001.mbeam.root.mbeam.root
B060112_160002.mbeam.root.mbeam.root
B060113_000001.mbeam.root.mbeam.root

N060102_031145.mdcs.root.mdcs.root
N060102_151155.mdcs.root.mdcs.root
N060103_151230.mdcs.root.mdcs.root
N060103_031209.mdcs.root.mdcs.root
N060104_031250.mdcs.root.mdcs.root
N060104_151310.mdcs.root.mdcs.root
N060105_031331.mdcs.root.mdcs.root
N060105_151348.mdcs.root.mdcs.root
N060106_031409.mdcs.root.mdcs.root
N060106_151428.mdcs.root.mdcs.root
N060107_031448.mdcs.root.mdcs.root
N060107_151508.mdcs.root.mdcs.root
N060108_031530.mdcs.root.mdcs.root
N060108_151553.mdcs.root.mdcs.root
N060109_031616.mdcs.root.mdcs.root
N060109_151637.mdcs.root.mdcs.root
N060110_031704.mdcs.root.mdcs.root
N060110_151730.mdcs.root.mdcs.root
N060111_031754.mdcs.root.mdcs.root
N060111_151822.mdcs.root.mdcs.root

F060102_055349.mdcs.root.mdcs.root
F060102_175435.mdcs.root.mdcs.root
F060103_175557.mdcs.root.mdcs.root
F060103_055512.mdcs.root.mdcs.root
F060104_055648.mdcs.root.mdcs.root
F060104_175712.mdcs.root.mdcs.root
F060105_055742.mdcs.root.mdcs.root
F060105_175817.mdcs.root.mdcs.root
F060106_055917.mdcs.root.mdcs.root
F060106_175932.mdcs.root.mdcs.root
F060107_060032.mdcs.root.mdcs.root
F060107_180131.mdcs.root.mdcs.root
F060108_060221.mdcs.root.mdcs.root
F060108_180308.mdcs.root.mdcs.root
F060109_060357.mdcs.root.mdcs.root
F060109_180440.mdcs.root.mdcs.root
F060110_060536.mdcs.root.mdcs.root
F060110_180626.mdcs.root.mdcs.root
F060111_060720.mdcs.root.mdcs.root
F060111_180725.mdcs.root.mdcs.root


=============================================================================

2006 01 23

##########
# DCache #
##########


########
# CHEP #
########

Drafting poster #224 'Lightweight deployment ...'

=============================================================================

2006 01 20

##########
# DCACHE #
##########

Why are RunII experiments using the public DCache system ?

Why are the database files not being gzipped before archiving,
    about a factor of 5 reduction in load ?

Why did the failure of all 3 read and write pools not generate
    an automatic page.
    The nodes were not pingable, and were offline in DCache.


#########
# MYSQL #
#########

running gzip timing on mysql archives,
   shifted files to minos26, to keep load of mysql1

#############
# mcin_data #
#############

Need to test run_dbu,
make the changes needed for mcin_data

cp run_dbu.0113 run_dbu.0120


FIL="dcap://fndca.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcin_data/far/carrot/L010185/f22001150_0000_L010185.reroot.root"
FILL="/local/scratch06/kreymer/f22001150_0000_L010185.reroot.root"
FIP="/pnfs/minos/mcin_data/far/carrot/L010185/f22001150_0000_L010185.reroot.root"

dccp ${FIL} ${FILL}
35474817 bytes in 1 seconds (34643.38 KB/sec)

MINOS06 > time dccp ${FIL} ${FILL}
35474817 bytes in 1 seconds (34643.38 KB/sec)
real    0m34.121s
user    0m0.010s
sys     0m0.230s

FILL=/local/scratch06/kreymer/f22001150_0000_L010185.reroot.root

cds
time loon -bq firstlastreroot.C ${FIL}
real    2m28.305s
user    0m13.760s
sys     0m0.590s


time loon -bq firstlastreroot.C ${FILL}
real    0m10.440s
user    0m10.070s
sys     0m0.380s

Ok, now use the local file to tune run run_dbu

cdm ; cd test
../scripts/run_dbu.0120   ${FILL}

    Need to get file time with

export TZ=UTC

Creation time
    ls -go --time=status --time-style=+"%Y-%m-%d %H:%M:%S" ${FIP} | tr -s ' [:blank:]' | cut -f 4,5 -d ' '
Mod time
    ls -go               --time-style=+"%Y-%m-%d %H:%M:%S" ${FIP} | tr -s ' [:blank:]' | cut -f 4,5 -d ' '

Should put this fix into run_dbu, instead of genpy, I guess,
to keep genpy pretty simple.


=============================================================================

2006 01 19

#############
# mcin_data #
#############

L010170 Needed  11/    198
L010185 Needed  42/    836
L010200 Needed 120/    200
L100200 Needed   1/    729
L250200 Needed  15/    391


############
# log_data #
############

Rescued files from minos-sam03 per email

I have created tarfiles of the interesting areas,
and copied them off the system.

Please go ahead now and replace the disk,
we will restore the files from these tarfiles.


There were no reported I/O errors in the sam account.

There were a few bad directories under /scratch/sam03/kreymer,
but these were not important, we have copies elsewhere.

grep error CAT.log
tar: ./CAT/.root/2003-07: Cannot stat: Input/output error
tar: ./CAT/R1_18/2004-11: Cannot stat: Input/output error
tar: ./CAT/R1_18/2005-01: Cannot stat: Input/output error
tar: ./CAT/R1_18/2004-12: Cannot stat: Input/output error
tar: ./CAT/R1_18/2005-02: Cannot stat: Input/output error
tar: ./CAT/R1_18/2005-07: Cannot stat: Input/output error
tar: ./CAT/R1_18/2005-04: Cannot stat: Input/output error
tar: ./CAT/caldet/2002-09: Cannot stat: Input/output error
tar: ./CAT/Rtest/2003-05: Cannot stat: Input/output error
tar: Error exit delayed from previous errors

Details :

    sam account
 
tar cvf /tmp/SAM.tar . 2>&1 | tee /tmp/SAM.log   

cd /scratch/sam03/sam
tar cvf /tmp/SAMscr.tar . 2>&1 | tee /tmp/SAMscr.log

    Copied to minos-sam02,

mkdir SAM03
cd    SAM03

rcp minos-sam03:/tmp/SAM.tar .
rcp minos-sam03:/tmp/SAM.log .

rcp minos-sam03:/tmp/SAMscr.tar .
rcp minos-sam03:/tmp/SAMscr.log .


     Also copied the Kreymer account,
     working files from the log_data concatenation.

cd /scratch/sam03/kreymer/CAT
tar cvf /tmp/CAT.tar . 2>&1 | tee /tmp/CAT.log

cd /scratch/sam03/kreymer/dcache
tar cvf /tmp/DCACHE.tar . 2>&1 | tee /tmp/DCACHE.log

    And copied these also to minos-sam02

rcp kreymer@minos-sam03:/tmp/DCACHE.log .
rcp kreymer@minos-sam03:/tmp/DCACHE.tar .
rcp kreymer@minos-sam03:/tmp/CAT.log .
scp -c blowfish kreymer@minos-sam03:/tmp/CAT.tar .  

Note the md5sums of these files,

77c59b444bfeb12c88abbf6cd40b25d3  CAT.tar
dc014f8da905a8f1d8ceba9bd265d059  DCACHE.tar
5b987f14bccab1778b93a9c558bbb63d  SAM.tar
31946bcd4ab1d47ebe757a6095636c7a  SAMscr.tar

and on sam02,
77c59b444bfeb12c88abbf6cd40b25d3  CAT.tar
dc014f8da905a8f1d8ceba9bd265d059  DCACHE.tar
5b987f14bccab1778b93a9c558bbb63d  SAM.tar
31946bcd4ab1d47ebe757a6095636c7a  SAMscr.tar


=============================================================================

2006 01 18

############
# log_data #
############

Probably lost all logs of the log_data concatenation,
as well as test scripts and working files,
due to loss of local disk on minos-sam03.


#########
# mysql #
#########

need to do full backup, start up incrementals.

Updating PLAN.mysql

See LOG.mysql


###########
# mc_data #
###########

Reviewed state of files in DCache, stable today

L010170 Needed  11/    198
L010185 Needed  39/    836
L010200 Needed 120/    200
L100200 Needed   1/    729
L250200 Needed  15/    391


=============================================================================

2006 01 17

Reviewed state of files in DCache,

L010170 Needed  11/    198
L010185 Needed  39/    836
L010200 Needed 120/    200
L100200 Needed   1/    729
L250200 Needed  15/    391
total          186/   2345  (8%)

Why are these files falling off disk after just 3 days ?
  Total Dcache/Enstore data movement on 14/15/16 was about 300 GB,
  7% of the total read pool capacity
  Seems about right, statistically.
  
Perhaps we need to actually touch the files we intend to read.


=============================================================================

2006 01 13

(Friday)

#############
# mcin_data #
#############

several files have already fallen out of DCache

MINOS06 > for HORN in `ls /pnfs/minos/mcin_data/near/carrot_06 | grep -v BAD` ; do echo $HORN ; ./stage -d -p0 mcin_data/near/carrot_06/${HORN} ; done | grep -v NEED | grep -v dccp
L010170 Needed 115/    198
L010185 Needed   5/    836
L010200 Needed  20/     80
L100200 Needed  15/    729
L250200 Needed   0/    391

Stage 'em again

MINOS06 > for HORN in `ls /pnfs/minos/mcin_data/near/carrot_06 | grep -v BAD` ; do echo $HORN ; ./stage -w mcin_data/near/carrot_06/${HORN} ; done

Checking again,

L010170 Needed 0/    198
L010185 Needed 0/    836
L010200 Needed 0/     80
L100200 Needed 1/    729 ( that's the BAD subdirectory showing up )
L250200 Needed 0/    391
FINISHED Fri Jan 13 11:16:34 CST 2006

./stage mcin_data/near/carrot_06/L100200

OK, we're all set.
Now to debug genpy, updating to add mcin_data support to
    wrun_dbu.0113
     run_dbu.0113
Changed data type to importedSimulated
     
export  RUN_DBU_VERSION='.0113'
export WRUN_DBU_VERSION='.0113'

./genpy -l " -r R1.18.2" -f f21001001_0000_L010185.reroot.root  mcin_data/far/carrot/L010185

Needs more work

Need to get date from file,
   as loon reports 1970 time stamp.
Need to get root version elsewhere
All we really seem to get useful from loon in the event count,
    which we can probably get more easily from root,
    as it emerges within a second of starting to run loon.
   

###########
# run_dbu #
###########

Hmmmmmmmmm.

In debugging the above, got file names in SAM like

    f21001001_0000_L010185.reroot.root.mcin.root

This has been happening since 2 Jan to beam and DCS files,
due to the upgrade to run_dbu.0725 on Jan 3 from
     genpy.0706
  wrun_dbu.0406
   run_dbu.0706

Weird, I don't see any change in the coding of
    SFIL=${BASE}.${TYPE}.root
which always should have been SFIL=${BASE}

Makes sense, since earlier genpy.0706 send a strange hybrid file path,
which chopped off the filename at the first period.

Right away, update to run_dbu.0113, 
then will have to rename the interim Beam and DCS files in SAM.


MINOS06 > ls -l genpy wrun_dbu run_dbu
lrwxr-xr-x    1 kreymer  1525           10 Jan  3 14:56 genpy -> genpy.0103
lrwxr-xr-x    1 kreymer  1525           12 Jan  3 13:10 run_dbu -> run_dbu.0725
lrwxr-xr-x    1 kreymer  1525           13 Jan  3 13:10 wrun_dbu -> wrun_dbu.0725

at 17:10
ln -sf run_dbu.0113 run_dbu

Inventory of files misnamed :

MINOS06 > sam translate constraints --dim='data_tier = beam and file_name like %.root.mbeam.root'
File Count:         30

MINOS06 > sam translate constraints --dim='DATA_TIER = dcs-near and FILE_NAME like %.root.mdcs.root'
File Count:         20

MINOS06 > sam translate constraints --dim='DATA_TIER = dcs-far and FILE_NAME like %.root.mdcs.root'
File Count:         20


########
# PNFS #
########

Preparing for Thur 2005 01 19 PNFS upgrade to 8.1.2,
review number of files in new proposed database, per 2005 Sep 28 email
with the following adjustments

reco_near  9 databases, one for each new directory
reco_far  11 databases, one for each new directory
log_data - removed

DBS='fardet_data neardet_data near_dcs_data mcin_data mcout_data far_dcs_data beam_data fardet_logs neardet_logs'

MINOS-SAM03 > for DB in ${DBS} ; do printf "${DB} " ; grep "/pnfs/minos/${DB}/" CFL | wc -l ; done
fardet_data     46588
neardet_data    18093
mcin_data        7009
mcout_data      43327
near_dcs_data     973
far_dcs_data     1647
beam_data        1076
fardet_logs       433
neardet_logs      108

I suggest we make one combined DCS DB for
    near_dcs_data     973
    far_dcs_data     1647
    beam_data        1076
    fardet_logs       433
    neardet_logs      108

and likewise one MC DB for
    mcin_data        7009
    mcout_data      43327

Also a quick check of user directories
USERS='asousa hzheng messier moibenko para psymes shanahan unel'
MINOS-SAM03 > for DB in ${USERS} ; do printf "${DB} " ; grep "/pnfs/minos/${DB}/" CFL | wc -l ; done
asousa    10
hzheng     0
messier    5
moibenko   4
para     399
psymes     0
shanahan   0
unel       7

This is not worth a separate database.

reco_far

for REL in `ls /pnfs/minos/reco_far` ; do printf ${REL} ; grep "/pnfs/minos/reco_far/${REL}/" CFL | wc -l ; done

R1.11       2082
R1.12      12861
R1.14      69456
R1.14_201      0
R1.16      28293
R1.16a      1532
R1_17       5389
R1_17a.0    3813
R1_18     120683
R1_18_2    60840
R1_18_2_temp   0

reco_near

R1           6
R1.11        6
R1.12       93
R1.14     2496
R1.16    19942
R1_17      288
R1_18    36186
R1_18_2  36850
R1_18_2_temp   0


A review of constraints.

It seems prudent to keep backend databases under 2 GB.
The present database was about 12 GB with 700K files,
or 17 KB/file.

So a natural upper limit would be about 100K files per database.
We project a 5 fold increase ( roughly ) in data volume.


=============================================================================

2006 01 12

#########
# stage #
#########

ln -sf stage.0112 stage ( was .1209 )
   cleaned up creation of log directory

#############
# mcin_data #
#############

Where are the files, under /pnfs/minos/mcin_data ?

MINOS06 > find /pnfs/minos/mcin_data -type f -print -exec usleep 10000 \; | wc -l
   7009

far - 648
   f241*.root 119 files   
   adamo
       f24*.fz_gaf.gz 20 files
   carrot
       L010185 - f21* 389 files
       L100200 - EMPTY
       L250200 - EMPTY
   v17
       L010185 - f21* 120 files
       L100200 - EMPTY
       L250200 - EMPTY

fmock - 4 
    4 F*, fm*.reroot files

near - 6074
    n1*.root    - 1522 files
    adamo       - n141*.fz_gaf.gz  24 files
    carrot_06
        L010170 - n130*.root 198 files
        L010185 - n130*.root 836 files
        L010200 - n130*.root  28 files
        L100200 - n130*.root 728 files
           BAD  - n130*.root 788 files
        L250200 - n130*.root 391 files
    rock
        v16     - n1*.fz_gaf.gz  60 files
        v17
            L010185 - EMPTY
            L100200 - EMPTY
            L250200 - EMPTY
    v17
        L010185 - n130*.root 393 files
        L100200 - n130*.root 551 files
        L250200 - n130*.root 555 files

near_pHE - EMPTY

near_pME
    README
    noB        - n163*_nob.root - 10 files
    noB_1batch - README , n163*_nob_1batch.root (10 files)
    withB      - m163*_0000.root
    
nmock  N*.root - 246 files


MINOS06 > ./stage -d -p0 mcin_data/far/carrot/L010185
  OK  JUST TESTING 
Staging /pnfs/minos/mcin_data/far/carrot/L010185
STARTING Thu Jan 12 10:15:31 CST 2006
 Prestaging     389 files 
....................................... Needed 0/    389
FINISHED Thu Jan 12 10:16:06 CST 2006

for HORN in `ls /pnfs/minos/mcin_data/near/carrot_06 | grep -v BAD` ; do echo $HORN ; ./stage -d -p0 mcin_data/near/carrot_06/${HORN} ; done | grep -v NEED | grep -v dccp
L010170
Needed 72/    198
L010185
 Needed 781/    836
L010200
Needed 0/     28
L100200
Needed 405/    729
L250200
. Needed 299/    391

MINOS06 > du -sh /pnfs/minos/mcin_data/near/carrot_06
522G    /pnfs/minos/mcin_data/near/carrot_06

MINOS06 > du -sh /pnfs/minos/mcin_data/near/carrot_06/*
36G     /pnfs/minos/mcin_data/near/carrot_06/L010170
156G    /pnfs/minos/mcin_data/near/carrot_06/L010185
5.7G    /pnfs/minos/mcin_data/near/carrot_06/L010200
289G    /pnfs/minos/mcin_data/near/carrot_06/L100200
37G     /pnfs/minos/mcin_data/near/carrot_06/L250200

Copied the above information to log/files/mcin.sum


Started to stage near/carrot_06

MINOS06 > for HORN in `ls /pnfs/minos/mcin_data/near/carrot_06 | grep -v BAD` ; do echo $HORN ; ./stage -w mcin_data/near/carrot_06/${HORN} ; done
L010170
tail -f /local/scratch06/kreymer/log/stage/mcin_data/near/carrot_06/L010170.20060112.log


##############
# parameters #
##############

Plan to use 'sam update file parameters' to set verious MC parameters,
probably based on filename conventions.
    See new naming scheme DOCDB 1446


=============================================================================

2006 01 11


=============================================================================

2006 01 10

#########
# genpy #
#########

    /pnfs/minos/neardet_data/2006-01/N00009616_0006.mdaq.root

fails, due to having rotorooter version v00-00--1
This file was logged at Jan  8 20:11 .
The previous file, N00009616_0005.mdaq.root, has version v04-02-00

cdm ; cd test

IFILE=N00009616_0006.mdaq.root
DATADIR=neardet_data/2006-01
rm -f ${IFILE}.log
rm -f ${IFILE}.sam.py
rm -f ${IFILE}.sam.pyc

time ../scripts/run_dbu dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/${DATADIR}/${IFILE}

log file is full of unprintable characters
and errors like

dc_lseek: illegal value of whence parameter [-1267404416].
          Valid values are 0 ( SEEK_SET), 1 (SEEK_CUR), 2 (SEEK_END)


Try next file

IFILE=N00009616_0007.mdaq.root
hangs after 5 seconds ( why ? file is in DCache )
real    11m27.617s
user    1m34.820s
sys     0m2.490s


MINOS06 > ( cd /pnfs/minos/neardet_data/2006-01 ; cat '.(use)(2)(N00009616_0007.mdaq.root)' )
2,0,0,0.0,0.0
:l=112847420;
r-stkendca6a-1
w-stkendca8a-1
r-stkendca7a-1

cd /local/scratch06/kreymer/
encp --verbose 1 /pnfs/minos/neardet_data/2006-01/N00009616_0006.mdaq.root ./N00009616_0006.mdaq.root


MINOS06 > time dccp -d 2 dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/neardet_data/2006-01/N00009616_0006.mdaq.root dcN00009616_0006.mdaq.root
Dcap Version version-1-2-36 Jun 20 2005 11:00:59
Allocated message queues 0, used 0

Allocated message queues 1, used 1

Creating a new control connection to fndca.fnal.gov:24125.
No IO tunneling plugin specified for fndca.fnal.gov:24125.
Sending control message: 0 0 client hello 0 0 2 36 -uid=1060 -pid=1808 -gid=1525

Server reply: welcome.
Connected to fndca.fnal.gov:24125
Setting hostname to minos06.fnal.gov.
Sending control message: 1 0 client open "dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/neardet_data/2006-01/N00009616_0006.mdaq.root" r minos06.fnal.gov 38033 -timeout=-1 -onerror=default  -uid=1060

Got callback connection from stkendca6a.fnal.gov:60869 for session 1, myID 1.
Switching on read ahead.
No Read ahead for fd = 5.
Real file name: dcN00009616_0006.mdaq.root.
Using system native open for dcN00009616_0006.mdaq.root.
[5] unpluging node
Server reply: ok destination [1].
Removing unneeded queue [1]
[5] destroing node
Using system native close for [6].
104671220 bytes in 11 seconds (9292.54 KB/sec)

real    0m18.008s
user    0m0.020s
sys     0m0.600s

Repreated to dc2N* - this time got from stkendca7a.

All 3 files are identical, according to diff !

So swap in the good .sam.py to let things keep running

cd /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/neardet_data/2006-01
mv N00009616_0006.sam.py N00009616_0006.sam.py.BAD
cp ~/minos/test/N00009616_0006.sam.py .

appended  tape location
#/pnfs/minos/neardet_data/2006-01(vo7421.1616)

Oops, failed to update .sam.py with correct CRC.

IFILE=N00009616_0006
ITAPE=vo7421.1616
SAMLOC="${IPATH}(${ITAPE})"

    at 15:25
sam erase file location --file=${IFILE}.mdaq.root --loc=${SAMLOC}
sam undeclare file ${IFILE}.mdaq.root

Predator picked it up again at 16:07


=============================================================================

2006 01 09

############
# saddreco #
############

It seems that we recently ( Dec 29 ? ) completed running reco on the 2/3 of the
near data skipped in the first pass .                

I still need to run the jobs to declare these files to SAM.

The files are in DCache and should be available via FTP.

RELEASE=R1_18_2
DET=near
MONTHS='2005-03  2005-05  2005-06  2005-07  2005-08  2005-09  2005-10  2005-11 2005-12'
for MONTH in ${MONTHS} ; do
printf "`date` ${MONTH} \n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1
done

Mon Jan  9 13:27:34 CST 2006 2005-03 
Mon Jan  9 13:41:41 CST 2006 2005-05 
Mon Jan  9 14:04:02 CST 2006 2005-06 
Mon Jan  9 14:41:32 CST 2006 2005-07 
Mon Jan  9 15:14:49 CST 2006 2005-08 
Mon Jan  9 15:50:04 CST 2006 2005-09 
Mon Jan  9 16:27:13 CST 2006 2005-10 
Mon Jan  9 16:56:03 CST 2006 2005-11 


grep declared log/saddreco/declare_near_R1_18_2.log | tail

############
# log_data #
############

REL=R1_18_2
find ${REL} -type f > remove/${REL}.files
cat  remove/${REL}.files | while read FIL
do printf "usleep 1000000 ; rm -f /pnfs/minos/log_data/${FIL}\n" >> remove/${REL}.rm
done
chmod 755 remove/${REL}.rm
wc -l remove/${REL}.*
   6017 remove/R1_18_2.files
   6017 remove/R1_18_2.rm
  12034 total

MINOS-SAM03 > date ; remove/R1_18_2.rm ; date
Mon Jan  9 14:47:18 CST 2006
Mon Jan  9 16:36:48 CST 2006

rmdir /pnfs/minos/log_data/R1_18_2/200*
rmdir /pnfs/minos/log_data/R1_18_2

/pnfs/minos/log_data is now empty

MINOS-SAM03 > wc -l remove/*.rm
...
 137286 total


#######
# AFS #
#######

Prepare for Wed and Thur 11/12 Jan 2006 downtimes for disk installs on
    fsus-minos01

for DIR in `ls $MINOS_DATA` ; do fs whereis ${MINOS_DATA}/${DIR} ; done | grep fsus-minos01

File /afs/fnal.gov/files/data/minos/beam_data is on host fsus-minos01.fnal.gov 
File /afs/fnal.gov/files/data/minos/beam_data1 is on host fsus-minos01.fnal.gov 
... ( most all /d* data directories )
File /afs/fnal.gov/files/data/minos/database_dumps is on host fsus-minos01.fnal.gov 
File /afs/fnal.gov/files/data/minos/db_cache is on host fsus-minos01.fnal.gov 
File /afs/fnal.gov/files/data/minos/dbm is on host fsus-minos01.fnal.gov 

MINOS06 > for DIR in `ls $MINOS_DATA` ; do fs whereis ${MINOS_DATA}/${DIR} ; done | grep -v fsus-minos01
File /afs/fnal.gov/files/data/minos/beam_docs is on host fsus08.fnal.gov 
File /afs/fnal.gov/files/data/minos/crl_data is on host fsus05.fnal.gov 
File /afs/fnal.gov/files/data/minos/d08 is on host fsus09.fnal.gov 
File /afs/fnal.gov/files/data/minos/d31 is on host fsus09.fnal.gov 
File /afs/fnal.gov/files/data/minos/d35 is on host fsus09.fnal.gov 
File /afs/fnal.gov/files/data/minos/d50 is on host fsus09.fnal.gov 
File /afs/fnal.gov/files/data/minos/d59 is on host fsus09.fnal.gov 
File /afs/fnal.gov/files/data/minos/d63 is on host fsus09.fnal.gov 
File /afs/fnal.gov/files/data/minos/d67 is on host fsus09.fnal.gov 
File /afs/fnal.gov/files/data/minos/log_data is on host fsus07.fnal.gov 
File /afs/fnal.gov/files/data/minos/logbook is on hosts fsus07.fnal.gov fsus05.fnal.gov fsus02.fnal.gov 
File /afs/fnal.gov/files/data/minos/offline_monitor is on host fsus05.fnal.gov 
File /afs/fnal.gov/files/data/minos/validation is on host fsus09.fnal.gov 


=============================================================================

2006 01 06

###########
# mc_data #
###########

MINOS-SAM03 > du -sh /pnfs/minos/mcin_data
1.5T    /pnfs/minos/mcin_data

MINOS-SAM03 > du -sh /pnfs/minos/mcin_data/*
106G    /pnfs/minos/mcin_data/far
1.1G    /pnfs/minos/mcin_data/fmock
1.3T    /pnfs/minos/mcin_data/near
512     /pnfs/minos/mcin_data/near_pHE
1.3G    /pnfs/minos/mcin_data/near_pME
57G     /pnfs/minos/mcin_data/nmock

21G     /pnfs/minos/mcin_data/near/adamo
516G    /pnfs/minos/mcin_data/near/carrot_06
22G     /pnfs/minos/mcin_data/near/rock
352G    /pnfs/minos/mcin_data/near/v17
and many files directly in near


############
# log_data #
############

R1_18_2 is in CAT, need to check checksums and move to AFS

./catsup R1_18_2 ( hacked for .2 sec delay, no touch )
unclean,  
OOPS - zero REF /pnfs/minos/log_data/R1_18_2/2005-05/N00007821_0005.0.R1_18_2.log.gz 
./catsup: line 56: [: -ne: unary operator expected

These files were replaced with newer versions on 1 Dec 2005.

mkdir R1_18_2orig/2005-08 ; cd R1_18_2orig/2005-08

MINOS-SAM03 > for FIL in `cat dups.lis` ; do echo ${FIL} ; mv ../../R1_18_2/2005-08/${FIL} . ;  done
F00032480_0002.0.R1_18_2.log.gz
F00032480_0003.0.R1_18_2.log.gz
F00032480_0004.0.R1_18_2.log.gz
F00032480_0006.0.R1_18_2.log.gz
F00032480_0007.0.R1_18_2.log.gz
F00032481_0000.0.R1_18_2.log.gz
F00032481_0001.0.R1_18_2.log.gz
F00032481_0002.0.R1_18_2.log.gz
F00032481_0003.0.R1_18_2.log.gz
F00032481_0005.0.R1_18_2.log.gz
F00032481_0006.0.R1_18_2.log.gz
F00032481_0007.0.R1_18_2.log.gz
F00032484_0002.0.R1_18_2.log.gz
F00032484_0005.0.R1_18_2.log.gz

Got fresh copies from tape,
./cattape VO8547 > VO8547d.log 2>&1

Checked again in CAT...
    clean !


=============================================================================

2006 01 05

############
# log_data #
############

Cleaning out empty directories :

cd /pnfs/minos/log_data

MINOS06 > ls        
0        2002-03  2002-05  2002-07  2002-09  2002-11  2003-01  2003-03  R1.0.0   R1.11  R1.14  R1.16.0  R1.7   R1_17a.0  R1_18_2
2002-02  2002-04  2002-06  2002-08  2002-10  2002-12  2003-02  R1.0     R1.0.0a  R1.12  R1.16  R1.16a   R1_17  R1_18     caldet

MINOS06 > rmdir 200*

MINOS06 > date
Thu Jan  5 08:53:46 CST 2006

MINOS06 > rm R1_17   

MINOS06 > ls -alF
total 9
drwxrwxr-x    1 10871    e875          512 Jan  5 09:01 ./
drwxrwxrwx    1 root     root          512 Nov 28 15:18 ../
drwxrwxr-x    1 rubin    e875          512 Apr 21  2005 .root/
drwxrwxr-x    1 rubin    e875          512 Dec  1 10:25 0/
drwxrwxr-x    1 rubin    e875          512 Aug 19  2003 R1.0/
drwxrwxr-x    1 rubin    e875          512 Nov  9 14:20 R1.0.0/
drwxrwxr-x    1 rubin    e875          512 Sep  1  2004 R1.0.0a/
drwxrwxr-x    1 rubin    e875          512 Dec  2  2004 R1.11/
drwxrwxr-x    1 rubin    e875          512 Feb  2  2005 R1.12/
drwxrwxr-x    1 rubin    e875          512 Apr 19  2005 R1.14/
drwxrwxr-x    1 rubin    e875          512 Aug  2 11:57 R1.16/
drwxrwxr-x    1 rubin    e875          512 Aug  4 19:19 R1.16.0/
drwxrwxr-x    1 rubin    e875          512 Jul 22 15:02 R1.16a/
drwxrwxr-x    1 rubin    e875          512 May 31  2004 R1.7/
drwxrwxr-x    1 rubin    e875          512 Jul 27 12:26 R1_17a.0/
drwxrwxr-x    1 rubin    e875          512 Nov  2 02:04 R1_18/
drwxrwxr-x    1 rubin    e875          512 Nov 26 14:20 R1_18_2/
drwxrwxr-x    1 10871    e875          512 Dec  4  2003 caldet/

MINOS06 > ls */200*
content still in

    R1_18/2005-01:
    N00006191_0048.0.R1_18.log.gz ( 0 length )

    R1_18/2005-11:
        250 files Nov 28-Dec 1

    R1_18_2/* all files

rm /pnfs/minos/log_data/R1_18/2005-01/N00006191_0048.0.R1_18.log.gz

R1_18/2005-11 files are on VO8500

./cattape VO8500 >> VO8500.log 2>&1 &

Recreated tarfile for R1.18

REL=R1_18
( cd ${REL} ; tar cf ../TAR/${REL}.log_data.tar . )

du -sm TAR/${REL}.log_data.tar
269     TAR/R1_18.log_data.tar

time cp -v TAR/${REL}.log_data.tar /afs/fnal.gov/files/data/minos/log_data/
`TAR/R1_18.log_data.tar' -> `/afs/fnal.gov/files/data/minos/log_data/R1_18.log_data.tar'

real    0m31.667s
user    0m0.070s
sys     0m4.850s

Also check ecrc checksums, hacking catsup for just 2005-11
./catsup R1_18 echo >> catsup.log &

One defect found, 0 length local file N00009159_0005.0.R1_18.log.gz
Already removed from tape VO8514,length was 9419

Copy the AFS-only R1_18 files to the tarfiles :
FILES=`ls /afs/fnal.gov/files/data/minos/log_data/R1_18/2005-11`
for FIL in $FILES ; do cp -a /afs/fnal.gov/files/data/minos/log_data/R1_18/2005-11/${FIL} R1_18/2005-11/ ; done
 
And again update the tarfile

REL=R1_18
( cd ${REL} ; tar cf ../TAR/${REL}.log_data.tar . )

du -sm TAR/${REL}.log_data.tar
270     TAR/R1_18.log_data.tar

cp -v TAR/${REL}.log_data.tar /afs/fnal.gov/files/data/minos/log_data/
`TAR/R1_18.log_data.tar' -> `/afs/fnal.gov/files/data/minos/log_data/R1_18.log_data.tar'

Verify and remove 250 R1_19/2005-11 stragglers

cd /pnfs/minos/log_data
for FIL in `ls R1_18/2005-11` ; do ls -l /scratch/sam03/kreymer/CAT/R1_18/2005-11/$FIL ; done
rm R1_18/2005-11/*

for DIR in `ls | grep -v R1_18_2` ; do ls ${DIR}/200* ; done

    Looks clean now,

for DIR in `ls | grep -v R1_18_2` ; do rmdir ${DIR}/200* ; done
rmdir caldet/stage02

MINOS-SAM03 > for DIR in `ls | grep -v R1_18_2` ; do echo ${DIR} ; done
0
R1.0
R1.0.0
R1.0.0a
R1.11
R1.12
R1.14
R1.16
R1.16.0
R1.16a
R1.7
R1_17a.0
R1_18
caldet

MINOS-SAM03 > for DIR in `ls | grep -v R1_18_2` ; do rmdir ${DIR} ; done

This leaves us only with a few stray files in .root,
and the R1_18_2 files in PNFS.


MINOS-SAM03 > ls -l .root/2003-07
total 239
-rw-r--r--    1 rubin    e875        75626 Apr 21  2005 F00018117_0000.err.gz
-rw-r--r--    1 rubin    e875        46792 Apr 21  2005 F00018117_0000.out.gz
-rw-r--r--    1 rubin    e875        76203 Apr 21  2005 F00018121_0000.err.gz
-rw-r--r--    1 rubin    e875        47014 Apr 21  2005 F00018121_0000.out.gz

mkdir -p CAT/.root/2003-07
cd       CAT/.root/2003-07
for FIL in F00018117_0000.err.gz F00018117_0000.out.gz F00018121_0000.err.gz F00018121_0000.out.gz
do encp /pnfs/minos/log_data/.root/2003-07/${FIL} ./${FIL} ; done
for FIL in F00018117_0000.err.gz F00018117_0000.out.gz F00018121_0000.err.gz F00018121_0000.out.gz
do touch -r /pnfs/minos/log_data/.root/2003-07/${FIL} ${FIL} ; done

Only F00018117_0000.err.gz differs from what is already in R1.14
For  present, copy it to $MINOS_DATA
    cp -a F00018117_0000.err.gz $MINOS_DATA/log_data/R1.14_F00018117_0000.err.root.gz
and removed the originals from PNFS.

MINOS-SAM03 > rm .root/2003-07/*
MINOS-SAM03 > rmdir .root/2003-07  
MINOS-SAM03 > rmdir .root        

   OK, now we just have R1_18_2/2005-02 thru 2005-10 to be moved out of PNFS,

I made a copy back on Nov 29,
but more files showed up, thru Dec 1

MINOS-SAM03 > find R1_18_2 -type f | wc -l
( CAT  )   4488
( PNFS )   5266

CAT  > for DIR in `ls R1_18_2` ; do printf "${DIR} " ; find R1_18_2/${DIR} -type f | wc -l ; done
2005-03    1007
2005-05     485
2005-06     558
2005-07     787
2005-08     510
2005-09     721
2005-10     420

PNFS > for DIR in `ls R1_18_2` ; do printf "${DIR} " ; find R1_18_2/${DIR} -type f | wc -l ; done
2005-03    1008
2005-05    1221
2005-06     961
2005-07     840
2005-08     357
2005-09     459
2005-10     420

./cattape VO8564 >> VO8564c.log 2>&1 &

./cattape VO8547 >> VO8547c.log 2>&1 &
clean copy

########
# bntp #
########

for MON in `ls /pnfs/minos/reco_far/R1_18_2/.bntp_data/`  ; do
./stage -d -p 0 -w reco_far/R1_18_2/.bntp_data/${MON}; done
Needed only 1 file from 2005-06

./stage -p 0 -w reco_far/R1_18_2/.bntp_data/2005-06
OK now.

MINOS06 > ./volumes log_data_R1_18_2
VO8547
VO8564

=============================================================================

2006 01 04

###########
# enstore #
###########

Enstore system problems starting after 02:00, per baisley email.
predator did not pick up new files till the 08:00 cycle.
System is still sick at 10:00, but files are showing up,
dcs from last night seems OK, but not today's first beam file.

############
# log_data #
############

removals are continuing to run on schedule.

#############
# beam_data #
#############

cd /afs/fnal.gov/files/data/minos/beam_data
for MON in 01 02 03 04 05 06 07 08 09 10 11 12
do 
  mkdir ../beam_data1/2006-${MON}
  ln -s ../beam_data1/2006-${MON} 2006-${MON}
done

setup dcap
unset DCACHE_IO_TUNNEL
MINOS_BEAM=/afs/fnal.gov/files/data/minos/beam_data1

MON=2005-12

./stage -d -p 0 beam_data/${MON}

FILES=`( cd /pnfs/minos/beam_data/${MON} ; ls )`

NEEDF=""
for FIL in ${FILES} ; do
    [ ! -r ${MINOS_BEAM}/${MON}/${FIL} ] && NEEDF="${NEEDF} ${FIL}"
done ;

NFIL=`echo ${FILES} | wc -w`
NNEE=`echo ${NEEDF} | wc -w`
printf "OK - need ${NNEE}/${NFIL} in ${MINOS_BEAM}/${MON}\n"

for FIL in ${NEEDF} ; do
    printf "${FIL} `date` "
    DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/${MON}/${FIL}
    ${ECHO} dccp ${DFILE} ${MINOS_BEAM}/${MON}/${FIL}
done ;
du -sm ${MINOS_BEAM}/${MON}/  ${MINOS_BEAM}
12570   /afs/fnal.gov/files/data/minos/beam_data1/2005-11
39353   /afs/fnal.gov/files/data/minos/beam_data1

MON=2005-12
...
OK - need      93/     93 in /afs/fnal.gov/files/data/minos/beam_data1/2005-12
...

stuck till 11:32 on
B051202_160001.mbeam.root Wed Jan  4 11:15:45 CST 2006 
killed, restarted, still stuck

hacked NEEDF to omit B051202_160001.mbeam.root
the rest are running OK now.

MINOS06 > ls -l /pnfs/minos/beam_data/2005-12/B051130_160001.mbeam.root
-rw-r--r--    1 buckley  e875     178340227 Nov 30 18:24 /pnfs/minos/beam_data/2005-12/B051130_160001.mbeam.root

MINOS06 > ( cd /pnfs/minos/beam_data/2005-12 ; cat '.(use)(4)(B051130_160001.mbeam.root)' )
VO8538
0000_000000000_0000071
178340227
beam_data
/pnfs/fs/usr/minos/beam_data/2005-12/B051130_160001.mbeam.root

000F000000000000025DBCA0

CDMS113339665300000
stkenmvr34a:/dev/rmt/tps0d0n:479000012752
3924945401

MINOS06 > ( cd /pnfs/minos/beam_data/2005-12 ; cat '.(use)(2)(B051130_160001.mbeam.root)' )
2,0,0,0.0,0.0
:l=178340227;
r-stkendca8a-1

    http://www-stken.fnal.gov/enstore/tape_inventory/VO8538
is interesting, many files identical in size to valid entried,
with delflag of unknown and no original file name.

Rats, stuck again at 11:51, at
B051212_000001.mbeam.root Wed Jan  4 11:47:12 CST 2006
Prune again, oops left B051212_000001.mbeam.root in the list,
but it worked this time !

B051212_000001.mbeam.root Wed Jan  4 11:55:19 CST 2006 7030 bytes in 0 seconds

Ran till it ran out of quota in AFS
B051231_080001.mbeam.root Wed Jan  4 12:15:39 CST 2006 Failed to close destination file: Disk quota exceeded
dccp failed.MINOS06 > dds /afs/fnal.gov/files/data/minos/beam_data1/2005-12/B051231_080001.mbeam.root

Cannot gain much by compressing:
2005-12/B051231_000001.mbeam.root
  176013107 -> (default gzip)
  174495524

This is frustrating, we are 2 files short of having enough disk !

Will vector them onto beam_data, with symlinks.

FIL=B051202_160001.mbeam.root
FIL=B051231_080001.mbeam.root

DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/${MON}/${FIL}
dccp ${DFILE} /afs/fnal.gov/files/data/minos/beam_data/2005-12over/${FIL}

cd /afs/fnal.gov/files/data/minos/beam_data1/2005-12/
FIL=B051202_160001.mbeam.root
ln -s ../../beam_data/2005-12over//${FIL} ${FIL}
FIL=B051231_080001.mbeam.root
ln -s ../../beam_data/2005-12over//${FIL} ${FIL}

########
# bntp #
########

Copy files to AFS space,

aklog

setup dcap
unset DCACHE_IO_TUNNEL

BAFS=/afs/fnal.gov/files/data/minos/d121/R1_18_2/.bntp_data
BNTP=/pnfs/minos/reco_far/R1_18_2/.bntp_data
DNTP=/pnfs/fnal.gov/usr/minos/reco_far/R1_18_2/.bntp_data

MON=2005-11

mkdir -p ${BAFS}/${MON}
FILES=`ls ${BNTP}/${MON}` ; printf "${MON} " ; echo $FILES | wc -w

NEEDF=""
for FIL in ${FILES} ; do
    [ ! -r ${BAFS}/${MON}/${FIL} ] && NEEDF="${NEEDF} ${FIL}"
done ;

NFIL=`echo ${FILES} | wc -w`
NNEE=`echo ${NEEDF} | wc -w`
printf "OK - need ${NNEE}/${NFIL} in ${BNTP}/${MON}\n"


for FIL in ${NEEDF} ; do
    printf "`date` ${FIL} "
    DFILE=dcap://fndca1.fnal.gov:24136/${DNTP}/${MON}/${FIL}
    ${ECHO} dccp ${DFILE} ${BAFS}/${MON}/${FIL}
done ;
du -sm ${BAFS}/${MON} ${BNTP}/${MON}

OK - need       8/    720 in /pnfs/minos/reco_far/R1_18_2/.bntp_data/2005-11
4512    /afs/fnal.gov/files/data/minos/d121/R1_18_2/.bntp_data/2005-11
4512    /pnfs/minos/reco_far/R1_18_2/.bntp_data/2005-11


MON=2005-12

OK - need     570/    748 in /pnfs/minos/reco_far/R1_18_2/.bntp_data/2005-12
3885    /afs/fnal.gov/files/data/minos/d121/R1_18_2/.bntp_data/2005-12
3885    /pnfs/minos/reco_far/R1_18_2/.bntp_data/2005-12

MON=2006-01
mkdir -p ${BAFS}/${MON}
OK - need      73/     73 in /pnfs/minos/reco_far/R1_18_2/.bntp_data/2006-01

=============================================================================

2006 01 03

############
# predator #
############

Update to old improved genpy, run_dbu, wrun_dbu
genpy
    Switched wrun_dbu from HEAD to FILE to allow dbu/loon running
    added timeout option
run_dbu
    Test loon return status and bail
    Expect and use full filename, not path, to get file type
wrun_dbu
    test file extension
    kill GRAND not CHILD

MINOS06 > ls -l genpy wrun_dbu run_dbu
lrwxr-xr-x    1 kreymer  1525           10 Jul  6 14:22 genpy -> genpy.0706
lrwxr-xr-x    1 kreymer  1525           12 Jul  6 14:20 run_dbu -> run_dbu.0706
lrwxr-xr-x    1 kreymer  1525           13 Apr  8  2005 wrun_dbu -> wrun_dbu.0406

ln -sf genpy.0725       genpy
ln -sf wrun_dbu.0725 wrun_dbu
ln -sf run_dbu.0725   run_dbu
date
Tue Jan  3 13:10:40 CST 2006

ln -sf genpy.0103 genpy

   fixed a typo which caused develoment to be set up
   We still need R1.15 
       R1.16   OK
       R1.17   fails
       R1.18   fails
       R1.18.2 fails
   Tested this with recent ND file, and standard test file
       F00028812_0000.mdaq.root


########
# sadd # 
########

Now correct this for production, not development

10:00

setup sam -q prd

for DET in neardet fardet beam near_dcs far_dcs ; do
for YEA in 2006 2007 2008 2009 2010 ; do
printf "${DET} ${YEA}\n"
for MON in 01 02 03 04 05 06 07 08 09 10 11 12
    do samadmin add pnfs tape location --fullPath=/pnfs/minos/${DET}_data/${YEA}-${MON}
done
done
done

Needed neardet  6-10
       fardet     10
       beam     6-10
       near_dcs 6-10
       far_dcs  6-10

neardet_data is in sam now,
will check tommorrow that near/far dcs and beam show up

############
# log_data #
############

Created tarfile for R1.18

REL=R1_18
( cd ${REL} ; tar cf ../TAR/${REL}.log_data.tar . )

du -sm TAR/${REL}.log_data.tar
266     TAR/R1_18.log_data.tar

time cp -v TAR/${REL}.log_data.tar /afs/fnal.gov/files/data/minos/log_data/
`TAR/R1_18.log_data.tar' -> `/afs/fnal.gov/files/data/minos/log_data/R1_18.log_data.tar'

real    0m25.525s
user    0m0.020s
sys     0m4.730s


reviewed remove/*.rm files

for REM in *.rm ; do less ${REM} ; done

Found strays in R1_18

cd R1_18
rm    F*  # working files
rmdir n*  # empty directories


old

MINOS-SAM03 >     wc -l remove/*.rm    
   4509 remove/R0.8.rm
    940 remove/R1.0.0.rm
  16737 remove/R1.0.0a.rm
   2299 remove/R1.0.rm
   4276 remove/R1.11.rm
   8628 remove/R1.12.rm
  23335 remove/R1.14.rm
      1 remove/R1.16.0.rm
  12462 remove/R1.16.rm
    950 remove/R1.16a.rm
   1414 remove/R1.7.rm
   2881 remove/R1_17a.0.rm
  38247 remove/R1_18.rm
  14799 remove/caldet.rm
 131478 total

new

MINOS-SAM03 > wc -l *.rm
   4509 R0.8.rm
    940 R1.0.0.rm
  16737 R1.0.0a.rm
   2299 R1.0.rm
   4276 R1.11.rm
   8628 R1.12.rm
  23335 R1.14.rm
      1 R1.16.0.rm
  12462 R1.16.rm
    950 R1.16a.rm
   1414 R1.7.rm
   2881 R1_17a.0.rm
  38038 R1_18.rm
  14799 caldet.rm
 131269 total

Counts from tarfiles
MINOS-SAM03 > for REM in *.rm ; do REL=`echo ${REM} | sed s/.rm//g` ; printf "${REL} "; tar tf ${MINOS_DATA}/log_data/${REL}.log_data.tar | grep gz | wc -l ; done
R0.8       4509
R1.0.0      940
R1.0.0a   16737
R1.0       2299
R1.11      4276
R1.12      8628
R1.14     23335
R1.16.0       1
R1.16     12462
R1.16a      950
R1.7       1414
R1_17a.0   2881
R1_18     38038
caldet    14799

Do the removals

date
for REM in *.rm ; do printf "${REM} " ; ./${REM} ; done
date

Tue Jan  3 11:13:37 CST 2006
MINOS-SAM03 > for REM in *.rm ; do printf "${REM} " ; ./${REM} ; done
R0.8.rm R1.0.0.rm R1.0.0a.rm R1.0.rm R1.11.rm R1.12.rm R1.14.rm R1.16.0.rm R1.16.rm R1.16a.rm R1.7.rm R1_17a.0.rm R1_18.rm ./R1_18.rm: zusleep: command not found
caldet.rm MINOS-SAM03 > date
Thu Jan  5 03:59:39 CST 2006

The error message is due to a stray 'z' in the first line of R1_18.rm, OK

=============================================================================

2006 01 02

########
# sadd # 
########

2005 locations missing for neardet_data

for MON in 01 02 03 04 05 06 07 08 09 10 11 12
do samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2006-${MON} ; done

Fix this thru 2010

for DET in near far ; do
for YEA in 2007 2008 2009 2010 ; do
for MON in 01 02 03 04 05 06 07 08 09 10 11 12
    do samadmin add pnfs tape location --fullPath=/pnfs/minos/${DET}det_data/${YEA}-${MON}
done
done
done

( fardet locations were already there for 2007 2008 2009 )

for DET in beam near_dcs far_dcs ; do
for YEA in 2007 2008 2009 2010 ; do
for MON in 01 02 03 04 05 06 07 08 09 10 11 12
    do samadmin add pnfs tape location --fullPath=/pnfs/minos/${DET}_data/${YEA}-${MON}
done
done
done


=============================================================================

2006 01 01

############
# archiver #
############

Only Fardet archiver seems to be running.
Created 2006 directories, in case that was needed.

20:11
mkdir /pnfs/minos/neardet_data/2006-01

20:15
mkdir /pnfs/minos/near_dcs_data/2006-01
mkdir /pnfs/minos/far_dcs_data/2006-01
mkdir /pnfs/minos/beam_data/2006-01


FARDET transfers have failed, due to bad ownership of 2006-01

MINOS-SAM03 > ls -ld /pnfs/minos/fardet_data/2006-01
drwxr-xr-x    1 500      e875          512 Dec  8 08:34 /pnfs/minos/fardet_data/2006-01

mv 2006-01 2006-01BAD

20:27  mkdir /pnfs/minos/fardet_data/2006-01
xfers failed thru 20:23, keep checking on web :

20:32
chmod 775 /pnfs/minos/beam_data/2006-01
20:35
mkdir fardet_data/2006-01
chmod 775 fardet_data/2006-01

20:37
chmod 775  neardet_data/2006-01
chmod 775 near_dcs_data/2006-01
chmod 775  far_dcs_data/2006-01

Archives has switched to trying beam_data,
still gets permission denied, in spite of 775 permissions
Oops, wrong group

20:46
chgrp e875     beam_data/2006-01
20:48
chgrp e875  neardet_data/2006-01
chgrp e875   fardet_data/2006-01
chgrp e875 near_dcs_data/2006-01
chgrp e875  far_dcs_data/2006-01

This sprang at least 1 beam file loose,
beam_data/2006-01/B051231_160001.mbeam.root
copied at 20:52, modified at 20:55


These should continue to dribble in about 1 file per 5 minutes ?
near/far dcs first files copied at 21:01, 
B060101_000001.mbeam.root at 21:03

N.B. - the nd and fd archives had to be restarted, was done AM Mon 2 Jan

=============================================================================

2005 12 16

#########
# genpy #   Test with caldet and mc data
#########

Try a more recent release under genpy, for caldet_data

./genpy.1215 -l " -r R1_18_2" -f C00110920_0000.mdaq.root caldet_data/2003-12
dbu crashes with

Error executing "insert into DBUDAQFILESUMMARY...
Unknown column 'REC_SETS_NOTLIONLY' in 'field list'

MC files - per rhatcher,

created firstlastreroot.C with
      j.Input.Set("Format=reroot");
to allow loon to read these reroot files.

It now gives a correct record count ( in the record sets field )
but not the root version ( reported as 0 )
We need a simple root utility to dump that.

Timing on the 34 MB file
/pnfs/minos/mcin_data/far/carrot/L010185/f22001150_0000_L010185.reroot.root
is
real    2m24.634s
user    0m14.490s
sys     0m0.640s

Adding module RerootToTruthModule:: to the spin did not correct the root version.
Along the way examined 

$SRT_PUBLIC_CONTEXT/
    RawData/test/spin_raw_daq.C
    DataUtil/RawRecCounts.cxx
    Production/R1.18/R1.18.X/reco_MC_far_R1_18_X.C 

############
# log_data #
############

Generate remove_${REL} scripts rate limited to 1 file per second
Use usleep 1000000 rather than sleep 1 for greater precision
( sleep might be 1 to 2 seconds )

for REL in caldet R1.0 R1.0.0 R1.0.0a R1.11 R1.12 R1.14 \
                  R1.16 R1.16.0 R1.16a R1.7 R1_17a.0 
do  printf "${REL}\n"
    find ${REL} -type f > remove/${REL}.files
    cat  remove/${REL}.files | while read FIL
    do printf "usleep 1000000 ; rm -f /pnfs/minos/log_data/${FIL}\n" >> remove/${REL}.rm
    done
    chmod 755 remove/${REL}.rm
    wc -l remove/${REL}.*
done

REL=R1_18
ditto

REL=R0.8
    find ${REL} -type f > remove/${REL}.files
    cat  remove/${REL}.files | cut -f 2- -d / | while read FIL
    do printf "usleep 1000000 ; rm -f /pnfs/minos/log_data/${FIL}\n" >> remove/${REL}.rm
    done
    chmod 755 remove/${REL}.rm
    wc -l remove/${REL}.*

MINOS-SAM03 >     wc -l remove/*.rm    
   4509 remove/R0.8.rm
    940 remove/R1.0.0.rm
  16737 remove/R1.0.0a.rm
   2299 remove/R1.0.rm
   4276 remove/R1.11.rm
   8628 remove/R1.12.rm
  23335 remove/R1.14.rm
      1 remove/R1.16.0.rm
  12462 remove/R1.16.rm
    950 remove/R1.16a.rm
   1414 remove/R1.7.rm
   2881 remove/R1_17a.0.rm
  38247 remove/R1_18.rm
  14799 remove/caldet.rm
 131478 total


=============================================================================

2005 12 15

#########
# genpy #   Test with caldet and mc data
#########

/pnfs/minos/caldet_data/2003-12/C00110920_0000.mdaq.root
/pnfs/minos/mcin_data/far/carrot/L010185/f22001150_0000_L010185.reroot.root


cp genpy.0725 genpy.1215

export WRUN_DBU_VERSION='.0725'
export  RUN_DBU_VERSION='.0725'

mv /local/scratch06/kreymer/genpy/near_dcs_data/2005-01/N050110_064203.sam.py \
   /local/scratch06/kreymer/genpy/near_dcs_data/2005-01/N050110_064203.sam.pyx
./genpy.1215 -d -p 0 -f N050110_064203.mdcs.root near_dcs_data/2005-01

#./genpy.0725 -t 10 -f N050110_064203.mdcs.root near_dcs_data/2005-01
#./genpy.0725 -t 10 -f F00028466_0004.mdaq.root fardet_data/2004-12

./setup_minos

FIL="dcap://fndca.fnal.gov:24136/pnfs/fnal.gov/usr/minos/near_dcs_data/2005-12/N051213_150015.mdcs.root"
FIL="dcap://fndca.fnal.gov:24136/pnfs/fnal.gov/usr/minos/caldet_data/2003-12/C00110920_0000.mdaq.root"
FIL="dcap://fndca.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcin_data/far/carrot/L010185/f22001150_0000_L010185.reroot.root"

time loon -bq firstlast.C ${FIL}

Works OK for near_dcs_data as expected, and caldet_data,
    0 counts and dates for mcin_data reroot files.

Try regular genpy on caldet_data

./genpy.1215 -d -p 0 -f C00110920_0000.mdaq.root caldet_data/2003-12
Floating point exception

OK, will plan to run caldet_data in same mode as dcs_data

Per Liz discussion, also try loon on like
    caldet_reco/tdaq_data/
    mcout_data/
pwd
FIL="dcap://fndca.fnal.gov:24136/pnfs/fnal.gov/usr/minos/caldet_reco/tdaq_data/2003-11/C00110072_0000.tdaq.root"
FIL="dcap://fndca.fnal.gov:24136/pnfs/fnal.gov/usr/minos/caldet_reco/tdaq_data/2003-11/C00110072_0000.li.tdaq.root"

FIL="dcap://fndca.fnal.gov:24136/pnfs/fnal.gov/usr/minos/mcout_data/R1_18_2/far/cand_data/f21301045_0000_L010185.cand.R1_18_2.root"

mc_out runs OK, but with lots of complaints like
Warning in <TClass::TClass>: no dictionary for class CandTrackSRListHandle is available


=============================================================================

2005 12 14

############
# saddreco #
############

Pick up one stray file from 2005-08, per petyt email
F00032578_0010.spill.bntp.R1_18_2.0.root

RELEASE=R1_18_2
DET=far
MONTH=2005-08
./saddreco ${DET} ${RELEASE} ${MONTH} declare >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1


###########
# cattape #
###########

Adding diagnostics to detect which mover is used.
encp --verbose 1  results in

Start time: Wed Dec 14 15:04:01 2005
User: kreymer(1060)  Group: 1525(1525)  Euser: kreymer(1060)  Egroup: 1525(1525)
Command line: encp --verbose 1 --delayed-dismount 10 /pnfs/minos/log_data/R1_18/2004-07/F00026075_0000.0.R1_18.out.gz /scratch/sam03/kreymer/TAC/test
Version: v3_5a  CVS $Revision: 1.765 $ <frozen>
Current working directory: minos-sam03.fnal.gov:/scratch/sam03/kreymer/TAC
Transfer /pnfs/minos/log_data/R1_18/2004-07/F00026075_0000.0.R1_18.out.gz -> /scratch/sam03/kreymer/TAC/test:
        5495 bytes copied from VO8353 at 0.203 MB/S
        (4.56 MB/S network) (0.868 MB/S drive) (4.56 MB/S disk)
        (0.0379 MB/S overall) (0.203 MB/S transfer)
        drive_id=T9940A drive_sn=456001002497 drive_vendor=STK
        mover=994051.mover media_changer=stk.media_changer   elapsed=3.22
Completed transferring 5495 bytes in 1 files in 3.22823691368 sec.
        Overall rate = 0.0379 MB/sec.  Transfer rate = 0.203 MB/sec.
        Network rate = 4.56 MB/sec.  Drive rate = 0.868 MB/sec.
        Disk rate = 4.56 MB/sec.  Exit status = 0.
-rw-r--r--    1 kreymer  1525         5495 Dec 14 15:04 /scratch/sam03/kreymer/TAC/test

--verbose 4 has even more details.

So for debugging, will put this in /tmp/cattape.encp ,
parse it back to find mover

    grep 'mover=' /tmp/cattape.encp | cut -f 2 -d '=' | cut -f 1 -d .

Should go further, and capture library state at each remount, with

     SNAP=`date +%Y%m%d%H%M%S`
     enstore library --get-work-sorted  9940 > /tmp/gws.${SNAP}
     cat /tmp/gws.${SNAP} |  sed -s 's/, /\n/g' | grep "^.uname.:"

Rats, VO8353 went NOACCESS again while being mounted, no CRC error.

Improved logging, as mentioned above, to /tmp/cattape/WORK.log
and /tmp/cattape/

./catepat VO2067 > VO2067.epat.log2 2>&1 &

Oops, needed to put linefeeds into work.log, correct this post hoc

cat /tmp/cattape/work.log | sed -s "s/]{'write_queue/]\n{'write_queue/g" > /tmp/cattape/work2.log
cat /tmp/cattape/work2.log | sed -s "s/]Wed /]\nWed /g" > /tmp/cattape/work3.log
cat /tmp/cattape/work3.log | sed -s "s/]Thu /]\nThu /g" > /tmp/cattape/work4.log

=============================================================================

2005 12 13

######################
# nameserver cleanup #
######################

sam v7_4_2 removes host restriction, can now purge dead projects.
Set up crontab on minos-sam01.
See maint/LOG there.

Should move to v7_5_1 ( newer python, etc ) once it is announced.


###########
# catepat #
###########

./catepat VO8353 > VO8353.epat.log 2>&1 &

failed due to Enstore overload/config server timeouts

ETIMEDOUT: [ ERRNO 110 ] Connection timed out: Unable to obtain configuration information from configuration server.
Got error while trying to obtain configuration: ('TIMEDOUT', 'configuration_server')
003269 kreymer I ENCP  ETIMEDOUT: [ ERRNO 110 ] Connection timed out: Unable to obtain configuration information from configuration server.
003269 kreymer E ENCP  Inconsistant file types:{'status': ('CONFIGDEAD', 'ETIMEDOUT: [ ERRNO 110 ] Connection timed out: Unable to obtain configuration information from configuration server.')}
003269 kreymer E ENCP  INFILE=/pnfs/minos/log_data/R1_18/2005-09/F00032802_0022.0.R1_18.err.gz OUTFILE=/scratch/sam03/kreymer/TAC/R1_18/2005-09/F00032802_0022.0.R1_18.err.gz FILESIZE= LABEL= LOCATION= DRIVE= DRIVE_SN= TRANSFER_TIME=0.00 SEEK_TIME=0.00 MOUNT_TIME=0.00 QWAIT_TIME=0.00 TIME2NOW=0.00 CRC= STATUS=CONFIGDEAD ETIMEDOUT: [ ERRNO 110 ] Connection timed out: Unable to obtain configuration information from configuration server.
 OOPS - encp failed 


restarted

./catepat VO8353 > VO8353.epat.log2 2>&1 &

#######
# sam #
#######

upd install -j sam v7_5_1
ups declare -c sam v7_5_1 

This uses python v2_4, and has various bug fixes,
including no host check for sammis cleanup nameserver

    sammis cleanup name service

=============================================================================

2005 12 12

###########
# catepat #
###########

Copy caldet logs from VO2067, 12359 files,
in reverse tape order.

./catepat VO2067 > VO2067.epat.log 2>&1 &

This completed cleanly 
MINOS-SAM03 > grep -v caldet VO2067.epat.log 
 OK - copying 
   volume  VO2067 
   files     14645 
 OK - start Mon Dec 12 17:04:32 CST 2005
 OK - done  Tue Dec 13 02:28:19 CST 2005

33827 sec  =>  .43 sec/file, 2.3 sec/fil


=============================================================================

2005 12 09

r-stkendca6a-1 showed OFFLINE at 09:00, back at 09:03


##########
# afssum #
##########

Tracking down rare format problems, like 
line 68403 of files listing,
should have "username size", has
libPhotonTransport.so
in the midst of zois files.

    cat /var/tmp/afssum_files.32629 | cut -f 2 | grep -v [0-9]

There are 0 length files full of control characters.

dds /afs/fnal.gov/files/data/minos/d13/zois/panos/gminos2/e*
-rw-r--r--    1 zois     e875            0 Dec 17  2003 /afs/fnal.gov/files/data/minos/d13/zois/panos/gminos2/e??libPhotonTransport.so?ScintHitToDigiPE?CompareToReroot?ScintHcerr
rm /afs/fnal.gov/files/data/minos/d13/zois/panos/gminos2/e*

dds /afs/fnal.gov/files/data/minos/d13/zois/panos/gminos/e*
-rw-r--r--    1 zois     e875            0 Dec 17  2003 /afs/fnal.gov/files/data/minos/d13/zois/panos/gminos/e??libPhotonTransport.so?ScintHitToDigiPE?CompareToReroot?ScintHcerr
rm /afs/fnal.gov/files/data/minos/d13/zois/panos/gminos/e*

###########
# crontab #
###########

    crontab.dat updated to run this on Monday morning.

23 05 * * 1 /usr/krb5/bin/kcron ${HOME}/minos/scripts/afssum quiet

revised this to M/W/F

23 05 * * 1,3,5 /usr/krb5/bin/kcron ${HOME}/minos/scripts/afssum quiet


########
# bntp #
########

Copy files to AFS space,

aklog

setup dcap
unset DCACHE_IO_TUNNEL

BAFS=/afs/fnal.gov/files/data/minos/d121/R1_18_2
BNTP=/pnfs/minos/reco_far/R1_18_2/.bntp_data
DNTP=/pnfs/fnal.gov/usr/minos/reco_far/R1_18_2/.bntp_data

MONTHS=`ls ${BNTP}`
MONTHS=2005-03
printf "MONTHS ${MONTHS}\n"
for MON in ${MONTHS} ; do
mkdir -p ${BAFS}/${MON}
FILES=`ls ${BNTP}/${MON}` ; printf "${MON} " ; echo $FILES | wc -w
for FIL in ${FILES} ; do
    printf "`date` ${FIL} \n"
    DFILE=dcap://fndca1.fnal.gov:24136/${DNTP}/${MON}/${FIL}
    ${ECHO} dccp ${DFILE} ${BAFS}/${MON}/${FIL}
done ;
du -sm ${BAFS}/${MON} ${BNTP}/${MON}
done > /tmp/copybntp.log 2>&1 &


#########
# stage #
#########

stage -> stage.0826
ln -sf stage.1209 stage

for MON in `ls /pnfs/minos/reco_far/R1_18_2/.bntp_data/`  ; do
./stage -d -p 0 -w reco_far/R1_18_2/.bntp_data/${MON}; done

Need 30 from 2005-08

./stage reco_far/R1_18_2/.bntp_data/2005-08

########
# bntp #
########

Seems to have returned to tolerable around 15:30

I see an earlier spike in door 00, to over 100 logins
Switching to door 01 ( *36 from *25 ),

Restarted copy at 15:32
   still running more or less fast at 17:06, into 2005-07
   2598 files so far, almost halfway done.


=============================================================================

2005 12 08

##########
# catsup #
##########

./catsup caldet >> catsup.log 2>&1

Problem with 2002-10/C00050434_0000.err.gz
which exists 3 times in CFL/lds 
and 6 times ( 3 deleted ) on VO2067
   positions 2619 2620 2622

MINOS-SAM03 > grep caldet lds | cut -f 3 | wc -l
  14801
MINOS-SAM03 > grep caldet lds | cut -f 3 | sort -u | wc -l
  14799

MINOS-SAM03 > find caldet -type f | wc -l
  14797

    still missing 2 files ?

MINOS-SAM03 > for DIR in `ls caldet` ; do printf "${DIR} " ; find caldet/${DIR} -type f | wc -l ; done
2002-09    1627
2002-10    1672
2002-11     768
2003-06     902
2003-09    5756
2003-10    4008
2003-11      64

MINOS-SAM03 > for DIR in `ls caldet` ; do printf "${DIR} " ; find /pnfs/minos/log_data/caldet/${DIR} -type f | wc -l ; done
2002-09    1627
2002-10    1672
2002-11     768
2003-06     902
2003-09    5756
2003-10    4008
2003-11      64

MINOS-SAM03 > for MON in `ls caldet` ; do printf "${MON} " ; grep "/caldet/${MON}/" CFL/lds  | wc -l ; done
2002-09    1627
2002-10    1674
2002-11     768
2003-06     902
2003-09    5756
2003-10    4008
2003-11      64
sum 14799

AHA, there is a caldet/stage02 directory with 2 files.
/pnfs/minos/log_data/caldet/stage02/C00040036_0000.err.gz

cattape requires /200* in the path.

Copied these two files by hand

mkdir caldet/stage02
cd    caldet/stage02
encp /pnfs/minos/log_data/caldet/stage02/C00040036_0000.err.gz .
encp /pnfs/minos/log_data/caldet/stage02/C00040036_0000.out.gz .
touch -r /pnfs/minos/log_data/caldet/stage02/C00040036_0000.err.gz C00040036_0000.err.gz
touch -r /pnfs/minos/log_data/caldet/stage02/C00040036_0000.out.gz C00040036_0000.out.gz
ecrc C00040036_0000.err.gz
ecrc C00040036_0000.out.gz

OK.... caldet is complete and correct.


############
# log_data #    NEW TAR FILES
############

for REL in caldet R0.8 R1.0 R1.0.0 R1.0.0a R1.11 R1.12 R1.14 \
                  R1.16 R1.16.0 R1.16a R1.7 R1_17a.0 
do  printf "${REL} " ; find ${REL} -type f | wc -l ; done

caldet    14799
R0.8       4509
R1.0       2299
R1.0.0      940
R1.0.0a   16737
R1.11      4276
R1.12      8628
R1.14     23335
R1.16     12462
R1.16.0       1
R1.16a      950
R1.7       1414
R1_17a.0   2881

for REL in caldet R0.8 R1.0 R1.0.0 R1.0.0a R1.11 R1.12 R1.14 \
                  R1.16 R1.16.0 R1.16a R1.7 R1_17a.0 
do  printf "${REL} " ; ( cd ${REL} ; tar cf ../TAR/${REL}.log_data.tar . ) ; done

caldet R0.8 R1.0 R1.0.0 R1.0.0a R1.11 R1.12 R1.14 R1.16 R1.16.0 R1.16a R1.7 R1_17a.0 

MINOS-SAM03 > du -sm TAR/*
18      TAR/R0.8.log_data.tar
5       TAR/R1.0.0.log_data.tar
121     TAR/R1.0.0a.log_data.tar
12      TAR/R1.0.log_data.tar
19      TAR/R1.11.log_data.tar
41      TAR/R1.12.log_data.tar
544     TAR/R1.14.log_data.tar
1       TAR/R1.16.0.log_data.tar
339     TAR/R1.16.log_data.tar
5       TAR/R1.16a.log_data.tar
5       TAR/R1.7.log_data.tar
40      TAR/R1_17a.0.log_data.tar
75      TAR/caldet.log_data.tar

mv log* TARcat/

MINOS-SAM03 > du -sm TARcat 
1142    TARcat
MINOS-SAM03 > du -sm TAR
1217    TAR

time cp -v TAR/* /afs/fnal.gov/files/data/minos/log_data/

############
# saddreco #
############

Check for SAM files needed for R1_18_2 older months

for DIR in `ls /pnfs/minos/reco_far/R1_18_2/.bntp_data/` ; do ./saddrecobntp far R1_18_2 ${DIR} list ; done
...
Treating 741 files in /pnfs/minos/reco_far/R1_18_2/.bntp_data/2005-08
Needed  333 files, Rate was  3.557

Treating 722 files in /pnfs/minos/reco_far/R1_18_2/.bntp_data/2005-09
Needed  264 files, Rate was  2.893

Treating 712 files in /pnfs/minos/reco_far/R1_18_2/.bntp_data/2005-11
Needed  712 files, Rate was  7.428

Treating 154 files in /pnfs/minos/reco_far/R1_18_2/.bntp_data/2005-12
Needed   24 files, Rate was  2.205

RELEASE=R1_18_2
DET=far
for MONTH in 2005-08 2005-09 2005-11 ; do
./saddreco ${DET} ${RELEASE} ${MONTH} declare >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1
done

sam translate constraints --dim="DATA_TIER bntp-far and VERSION r1.18.2" | tail
...
File Count:         5912
Average File Size:  5.32MB
Total File Size:    30.72GB
Total Event Count:  105208949

##########
# afssum #
##########

cloned from dcache/datasets

Want to report by-user summary of usage of the Minos public AFS work disks

Added this to the DH status page.
Announced to minos_software_discussion

=============================================================================

2005 12 07

#######
# web #
#######

Per Fermilab Today announcentmaint, in dhmain.html,
changed www-cdg.fnal.gov to computing.fnal.gov ( CD status links )


##########
# catsup #
##########

./catsup R0.8 echo | grep -v touch
 
 OK - start R0.8 Wed Dec  7 09:55:40 CST 2005
2002-02       54 files Wed Dec  7 09:55:40 CST 2005
2002-03      168 files Wed Dec  7 09:56:26 CST 2005
2002-04      134 files Wed Dec  7 09:59:10 CST 2005
2002-05      288 files Wed Dec  7 10:01:23 CST 2005
2002-06      210 files Wed Dec  7 10:06:24 CST 2005
2002-07      230 files Wed Dec  7 10:10:01 CST 2005
2002-08      566 files Wed Dec  7 10:14:13 CST 2005
./catsup: line 52: [: too many arguments
...    many more of these
2002-09      782 files Wed Dec  7 10:25:08 CST 2005
2002-10      444 files Wed Dec  7 10:53:30 CST 2005
./catsup: line 52: [: too many arguments
./catsup: line 52: [: too many arguments
2002-11      549 files Wed Dec  7 11:01:59 CST 2005
2002-12      502 files Wed Dec  7 11:06:45 CST 2005
2003-01      330 files Wed Dec  7 11:11:07 CST 2005
2003-02      248 files Wed Dec  7 11:13:59 CST 2005
2003-03        4 files Wed Dec  7 11:16:08 CST 2005
 OK - done  Wed Dec  7 11:16:11 CST 2005

The argument problems are due to multuple original files in CFL.
Use head -1 to select the first match, rerun 2002-08, 2002-10

Added fallback to second match in case first does not work,
for R0.8 only.


./catsup R0.8 >> catsup.log 2>&1

    Ran cleanly in 12 minutes

############
# log_data #
############

All the releases have been CRC checked and timestamped.

Oops, need to pick up the caldet files

MINOS-SAM03 > less CFL/lds
MINOS-SAM03 > grep /lds/caldet/ CFL/lds | wc -l
  14801

find /pnfs/minos/log_data/caldet -type f -exec usleep 100000 \; -print | wc -l

MINOS-SAM03 > ./cattape VO2067 echo | grep encp | wc -l
  12359

./cattape VO5166 > VO5166.log 2>&1 & #   154 active files

./cattape VO2067 > VO2067.log 2>&1 & # 14647 active files

At about 15:32:20 CST 2005
    mover 52 was preempted , switched to 71

Writing VO8661(AuxData) using 994052.mover from southport by e898
Reading VO2067(log_data_caldet) using 994071.mover from minos-sam03 by kreymer

There were several idle drives available :
    51, 61, 62, 81, 91

Saved snapshot as Desktop/status_enstore_system.20051207.html

  Got bumped again at 
15:37:18
15:53:32
16:06:09
16:19:01
16:27:06

Now build new tarfiles :

#########
# stage #
#########

Per urheim request, staged

survey with

for MON in `ls /pnfs/minos/reco_far/R1_18_2/.bntp_data/`  ; do
    ./stage -d -p 0 -w reco_far/R1_18_2/.bntp_data/${MON}
done

Need files in 2005-08 2005-09 

for MON in 2005-08 2005-09 ; do
    ./stage -w reco_far/R1_18_2/.bntp_data/${MON} ; done

MINOS06 > sam translate constraints --dim="DATA_TIER bntp-far and VERSION r1.18.2" | tail -20
File Count:         4603
Average File Size:  5.01MB
Total File Size:    22.50GB
Total Event Count:  82334530

MINOS06 > find /pnfs/minos/reco_far/R1_18_2/.bntp_data -type f -exec usleep 100000 \; -print | wc -l
   5936

2005/12/07 checking further,

MINOS06 > for DIR in `ls /pnfs/minos/reco_far/R1_18_2/.bntp_data/` ; do printf "${DIR} " ; ls /pnfs/minos/reco_far/R1_18_2/.bntp_data/${DIR} | wc -l ; done
2005-03     678
2005-05     741
2005-06     715
2005-07     734
2005-08     741
2005-09     722
2005-10     739
2005-11     712
2005-12     154


=============================================================================

2005 12 06

##########
# catsup #
##########

Need to cleanup 

    R0.8 - need to run.

File names have changed from like
    analysis_2201_0000.80092.1.err.gz
or
    R0.8/2002-03/ana_F00003315_0000.82390.1.err.gz
to
    F00002201_0000.err.gz
So add another fallback in catsup

And added special PNFS path for R0.8, skipping ${REL}


    R1.14 - OK

One 0 file in 2004-06

MINOS-SAM03 > enstore info --file=/pnfs/minos/log_data/R1.14/2004-06/F00025822_0001.err.gz
BAD STATUS ('NO SUCH FILE/BFID', 'Info Server: path /pnfs/minos/log_data/R1.14/2004-06/F00025822_0001.err.gz not found')

MINOS-SAM03 > ( cd /pnfs/minos/log_data/R1.14/2004-06 ; cat ".(use)(4)(F00025822_0001.err.gz)" | tail +11 - )
674268882

MINOS-SAM03 > ecrc R1.14/2004-06/F00025822_0001.err.gz
CRC 674268882


    R1_17a.0  - OK

    2013 lines like
    OOPS - zero REF
in
    2005-03
    2005-05
These are the files originally logged under R1_17, where they appear in lds.
Changed grep to match just the month/file part,
Files have all been touched, so just run the ecrc check part

./catsup R1_17a.0 echo | grep -v touch

 OK - start R1_17a.0 Tue Dec  6 14:54:40 CST 2005
2005-03     2375 files Tue Dec  6 14:54:40 CST 2005
2005-05      366 files Tue Dec  6 15:15:21 CST 2005
2005-06        6 files Tue Dec  6 15:18:32 CST 2005
2005-07      134 files Tue Dec  6 15:18:35 CST 2005
 OK - done  Tue Dec  6 15:19:45 CST 2005


=============================================================================

2005 12 05

############
# saddreco #
############

Oops, may have missed 2005-12 SAM location ?

Both near and far keepup are failing,

STARTED   Sun Dec  4 23:08:16 2005
Needed  /pnfs/minos/reco_near/R1_18_2/snts_data/2005-12
Treating 3 files in /pnfs/minos/reco_near/R1_18_2/snts_data/2005-12
 OOPS, need location for  N00009300_0013.cosmic.snts.R1_18_2.0.root
 OOPS , addLocation error in  N00009300_0013.spill.snts.R1_18_2.0.root
  CLASS     SamException.SamExceptions.DataStorageLocationNotFound
  INSTANCE  Location with name '/pnfs/minos/reco_near/R1_18_2/snts_data/2005-12' not found.


Declaring to SAM prd far R1_18_2 2005-12
STARTED   Sun Dec  4 23:08:23 2005
Needed  /pnfs/minos/reco_far/R1_18_2/snts_data/2005-12
Treating 11 files in /pnfs/minos/reco_far/R1_18_2/snts_data/2005-12
 OOPS, need location for  F00033256_0013.all.snts.R1_18_2.0.root
 OOPS, need location for  F00033256_0014.all.snts.R1_18_2.0.root
 OOPS , addLocation error in  F00033256_0007.spill.snts.R1_18_2.0.root
  CLASS     SamException.SamExceptions.DataStorageLocationNotFound
  INSTANCE  Location with name '/pnfs/minos/reco_far/R1_18_2/snts_data/2005-12' not found.

#########
# reloc #
#########

Needed to specify production database,

MINOS-SAM01 > ./reloc -s prd -y 2005 R1_18_2
MINOS-SAM01 > ./reloc -s prd -y 2006 R1_18_2

Have changed the default now to be production.

############
# saddreco #
############

MINOS06 > RELEASE=R1_18_2
MINOS06 > MONTH=`date +%Y-%m`
MINOS06 > for DET in near far ; do
<more>     printf "`date` saddreco ${DET} ${RELEASE} ${MONTH}\n"
<more>     ./saddreco ${DET} ${RELEASE} ${MONTH} declare \
<more>     >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1
<more> done
Mon Dec  5 11:22:05 CST 2005 saddreco near R1_18_2 2005-12
Mon Dec  5 11:22:17 CST 2005 saddreco far R1_18_2 2005-12

Looks good

############
# predator #
############

HOWTO.predator - change REL from R1_18 to R1_18_2 for monitoring

############
# log_data #
############

Check out recent log_data in PNFS, to be sure it's dead

MIN > grep /log_data/ CFL.20051130 | wc -l
 136644
MIN > grep /log_data/ CFL.20051205 | wc -l
 137246

Change is 602

I am tempted to use the CompleteFileListing to get CRC's.
But it is a bit too long, 600K lines
Created lds with just CRC, size, shortened name /lds/R*
    grep /log_data/ CFL.20051205 > ld.20051205
    cat ld.20051205 | cut -f 6-8 > lds 

Split this into one per release.
( This was unnecessary, grep lds is plenty fast on minos-sam03 )


for DIR in \
R1.0 R1.0.0 R1.0.0a R1.11 R1.12 R1.14 R1.16 R1.16.0 R1.16a R1.7 R1_17 R1_17a.0 
do
    printf "${DIR} "
    grep /lds/${DIR}/ lds | wc -l
    grep /lds/${DIR}/ lds > lds.${DIR}
done

cat lds.R1_17 >> lds.R1_17a.0
  
caldet    14799
R0.8       4727
R1.0       2299
R1.0.0      940
R1.0.0a   16737
R1.11      4276
R1.12      8628
R1.14     23335
R1.16     12462
R1.16.0       1
R1.16a      950
R1.7       1414
R1_17      2013
R1_17a.0    868

Reference totals were

20*        4727   15
R1.0       2299   10
R1.0.0      940    4
R1.0.0a   16737  114
R1.11      4276   16
R1.12      8628   34
R1.14     23335  550 
R1.16     12462  345
R1.16.0       1    0
R1.16a      950    3
R1.7       1414    3
R1_17a.0   2881    3

MINOS-SAM03 > time grep /R1.14/2004-07F00026024_0002.out.gz lds

real    0m0.011s

##########
# catsup #
##########

Touches all files in CAT/R* directory referred to PNFS
Checks ecrc versus CFL/lds fallback to enstore info --file

./catsup R1.0.0 > catsup.log 2>&1 &

for REL in R1.0 R1.0.0a R1.11 R1.12 R1.14 R1.16 R1.16.0 R1.16a R1.7 R1_17a.0
do
    ./catsup ${REL} >> catsup.log 2>&1
done &

Finished cleanly for all but R1_17a.0

=============================================================================

2005 12 01

./cattape VO8514 > VO8514.log 2>&1 &
completed OK

############
# saddreco #
############

R1_18_2

Testing with

./saddreco ${DET} ${RELEASE} ${MONTH} verify 1

Fails, probably need app version

MINOS-SAM01 > samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.18.2
New applicationFamilyId = 53

Looks ok, try a few files now :

MINOS06 > ./saddreco ${DET} ${RELEASE} ${MONTH} declare 10
looks OK, added this to ../log/saddreco/declare_${DET}_${RELEASE}.log


for DET in near far ; do
for MONTH in `ls /pnfs/minos/reco_${DET}/R1_18_2/cand_data` ; do
    date
    printf "${DET} ${RELEASE} ${MONTH} declare \n"
    ./saddreco ${DET} ${RELEASE} ${MONTH} declare \
    >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1
done ; done

############
# predator #
############

predator.1201 cloned from predator.1004,

    changed release from R1_18 to R1_18_2

#########
# reloc #
#########

reloc.1201

  Added -y option, for adding all months for a given year.

MINOS-SAM01 > ./reloc -y 2005 R1_18_2
MINOS-SAM01 > ./reloc -y 2006 R1_18_2


=============================================================================

2005 11 30

#######
# sam #
#######

On minos-sam01/2

setup upd
upd install -j sam v7_4_0a_py24
setup sam v7_4_0a_py24
nedit /home/sam/products/sam/v7_3_4a/Linux+2/ups/sam.table
   commented all setupoptionals
ups declare -c  sam v7_4_0a_py24

#########
# reloc #  R1_18_2
######### 

on minos-sam01

cd ~kreymer/minos/scripts

DET=near
DET=far

REL=dev
REL=prd

MONTHS=`ls /pnfs/minos/reco_${DET}/R1_18_2/cand_data`
echo $MONTHS
for MON in ${MONTHS} ; do ./reloc -s ${REL} -x ${DET} -m ${MON} R1_18_2 ; done

############
# log_data #
############

./cattape VO8354 > VO8354.log 2>&1 &
R1_18 2005-10 N00008920_0023.0.R1_18.log.gz Wed Nov 30 14:41:46 CST 2005
INFILE=/pnfs/minos/log_data/R1_18/2005-10/N00008920_0023.0.R1_18.log.gz
OUTFILE=/scratch/sam03/kreymer/CAT/R1_18/2005-10/N00008920_0023.0.R1_18.log.gz
FILESIZE=9635
LABEL=VO8354
LOCATION=0000_000000000_0016975
DRIVE=
DRIVE_SN=
TRANSFER_TIME=0.00
SEEK_TIME=0.00
MOUNT_TIME=0.00
QWAIT_TIME=111.59
TIME2NOW=112.14
CRC=2474288045
STATUS=NOACCESS

This is really getting clobbered by drive preemptions by writers.

enstore info --gvol VO8354
11:01 'sum_mounts': 1555
15:15 'sum_mounts': 1570

Waiting for NOACCESS to clear from O8354.
Finishing up other volumes :

./cattape VO8500 > VO8500.log 2>&1 &
R1_18 2005-11 F00033216_0000.0.R1_18.log.gz Wed Nov 30 17:01:48 CST 2005
stuck , killed at 17:57, and removed 0 length output.
No Enstore errors, but various servers failing to update web status.
There are 376 extra rows for the 9940B manager.

Some drives are busy, though !

----------------------------------------------------
Date: Wed, 30 Nov 2005 18:10:33 -0600 (CST)
From: Arthur Kreymer <kreymer@fnal.gov>
To: enstore-admin@fnal.gov
Cc: minos_batch@fnal.gov, minos_sam_admin@fnal.gov
Subject: Enstore status - overloaded ?

The enstore system seems to overloaded at present.

The servers web page at
    http://www-stken.fnal.gov/enstore/status_enstore_system.html
is showing CANNOT UPDATE STATUS for most of the servers,
as of 17:57 .

encp activity seems to be continuing, 
but one of my encp's of log files from minos-sam03 got stuck for an hour :
At 17:01,
encp   /pnfs/minos/log_data/R1_18/2005-11/F00033216_0000.0.R1_18.log.gz \
 /scratch/sam03/kreymer/CAT/R1_18/2005-11/F00033216_0000.0.R1_18.log.gz
did nothing, and I killed it at 17:57

I see hundreds of Minos reconstructed data files queued for write from DCache.
It seems to me that this queue has been building gradually all day.

As of 18:08, most of the servers are again showing a status,
but not most of the movers, or CD-9940B.library_manager .

As of 18:10, CD-9940B.library_manager shows a status of 'dead'.
----------------------------------------------------

Let's try the last tape

./cattape VO8514 > VO8514.log 2>&1 &

18:28 - all server look healthy now, several idle drives,
still no action on encp.

Stuck in
encp /pnfs/minos/log_data/R1_18/2005-11/N00009159_0005.0.R1_18.log.gz /scratch/sam03/kreymer/CAT/R1_18/2005-11/N00009159_0005.0.R1_18.log.gz

Killed this, removed empty output file.

Tried manually, still stuck, killed.
Will wait for experts to reply.

OK, the system recovered,

./cattape VO8500 > VO8500.log 2>&1 &
completed OK

./cattape VO8534 > VO8534a.log 2>&1 &
completed OK


############
# predator #
############

Had disabled reco declares Nov 26 14:36
Still waiting for field map cleanup.

=============================================================================

2005 11 29

############
# log_data #
############

    R1_18 fixh failed with PNFS level 4 problem

Second attempt fails, repeatable !
./catfix18 R1_18 > R1_18fixi.log 2>&1 &
FILEScj Tue Nov 29 08:37:10 CST 2005
...
[Errno 5] Input/output error: Layer 4 is empty


By trial and error, manual copy, offender is
    /pnfs/minos/log_data/R1_18/2005-01/N00006191_0048.0.R1_18.log.gz
0 length file

Copied the cj set manually, adding one more file to compensate for
the skipped 0 length file

encp /pnfs/minos/log_data/R1_18/2005-01/F00028559_0006.0.R1_18.err.gz /pnfs/minos/log_data/R1_18/2005-01/F00028591_0002.0.R1_18.err.gz /pnfs/minos/log_data/R1_18/2005-01/F00028594_0002.0.R1_18.out.gz /pnfs/minos/log_data/R1_18/2005-01/F00028597_0000.0.R1_18.err.gz /pnfs/minos/log_data/R1_18/2005-01/F00028600_0000.0.R1_18.out.gz /pnfs/minos/log_data/R1_18/2005-01/F00029000_0004.0.R1_18.err.gz /pnfs/minos/log_data/R1_18/2005-01/F00029006_0002.0.R1_18.err.gz /pnfs/minos/log_data/R1_18/2005-01/F00029009_0002.0.R1_18.out.gz /pnfs/minos/log_data/R1_18/2005-01/F00029051_0006.0.R1_18.err.gz /scratch/sam03/kreymer/CAT/R1_18/2005-01/

    
catfix changed to find -type f -size +0 

./catfix18 R1_18 > R1_18fixj.log 2>&1 &
FILEScj Mon Nov 28 19:49:18 CST 2005
File exists: /scratch/sam03/kreymer/CAT/R1_18/2005-01/F00028976_0002.0.R1_18.err.gz

FILEScv - have 
FILEScw Tue Nov 29 09:52:47 CST 2005

last file in cv is 2005-01/F00028973_0004.0.R1_18.out.gz
MINOS-SAM03 > stat 2005-01/F00028973_0004.0.R1_18.out.gz
  File: `2005-01/F00028973_0004.0.R1_18.out.gz'
  Size: 5602            Blocks: 16         IO Block: 4096   Regular File
Device: 342h/834d       Inode: 29130754    Links: 1    
Access: (0644/-rw-r--r--)  Uid: ( 1060/ kreymer)   Gid: ( 1525/ UNKNOWN)
Access: 2005-09-11 08:05:05.000000000 -0500
Modify: 2005-11-23 09:20:06.000000000 -0600
Change: 2005-11-23 09:20:06.000000000 -0600

This is a stray, written 11/23

Indeed, in the original log, some batches succeeded,

FILEScw Wed Nov 23 09:18:53 CST 2005
FILESda Wed Nov 23 09:20:36 CST 2005

Summary: due to the 0 file skip,
the first file of cw slipped into cv, vetoing it.
The new cw first file was copied already.
causing the new cw to be wiped

    Manually copied the new last da file:

encp /pnfs/minos/log_data/R1_18/2005-01/N00005967_0000.0.R1_18.log.gz /scratch/sam03/kreymer/CAT/R1_18/2005-01/

    Disabled the last new cv and cz files, to allow cv and cz to go

mv R1_18/2005-01/F00028973_0004.0.R1_18.out.gz R1_18/2005-01/F00028973_0004.0.R1_18.out.gz.disabled
mv R1_18/2005-01/F00028979_0003.0.R1_18.out.gz R1_18/2005-01/F00028979_0003.0.R1_18.out.gz.disabled

./catfix18 R1_18 > R1_18fixk.log 2>&1 &

##########
# dcache #
##########

CURRENT - need to bring up read pool summary scripts
          ported from 

############
# log_data #
############

For quick copy of R1_18_2 per urheim,
make new script to copy each volume in best order,

Tested with 

./cattape VO8564 > VO8564.log 2>&1 &

Interrupted, recovered with

./cattape VO8564 > VO8564a.log 2>&1 &

Then picked up 0 length file from first pass with

./cattape VO8564 > VO8564b.log 2>&1 &

    Looks OK for 4 files, rate is about 1 file per second.


./cattape VO8547 > VO8547.log 2>&1 &
 ( this cleanly handled a tape write that came along )

    First set time stamps like

REL=R1_18_2
./cattime /scratch/sam03/kreymer/CAT/${REL}
 ( ran smoothly at 10 files/second )


    copied to AFS, like

( tested with Rtest copy of R1.0.0 )

REL=R1_18_2
mkdir -p   $MINOS_DATA/log_data/${REL}
rsync -r -t --size-only --stats ${REL}/ $MINOS_DATA/log_data/${REL}

    First set time stamps like

REL=R1_18_2
./cattime /scratch/sam03/kreymer/CAT/${REL}
 ( ran smoothly at 10 files/second )


   R1_18
   
Given the good performance of cattape,
will try to finish R1_18 using it.

Find needed files like
    ./cattape VO8353 echo | grep encp | wc -l

Had to refine cattape to ignore 'deleted' files

  vol   files   need   deleted  need'
VO8353  22343   14187    1251   12936 OK, 8 hours
VO8354  18269   10797    2084    8481     (8713 including n*)
VO8500   1068    1067       4    1063
VO8514     25      25       0      25

Will ?  ./cattape VO8353 > VO8353.log 2>&1 &

No, hold on, there are files like
    /pnfs/minos/log_data/R1_18/n13010141_0000_L010185.err.gz
MINOS-SAM03 > enstore info --list VO8354 | grep R1_18/n13020135_0000_L010185.err.gz
VO8354 CDMS112787909000000       20153 0000_000000000_0012326 active  /pnfs/fs/usr/minos/log_data/R1_18/n13020135_0000_L010185.err.gz

Adjusted cattape to require /200 string,
mc 
./cattape VO8353 > VO8353.log 2>&1 &

=============================================================================

2005 11 28

############
# log_data #
############

Summary of CRC errors in logs and *fix*.log for

R10.log
R1_0.log
R1_0_0.log
R1_0_0a.log
R1_11.log
R1_12.log
R1_14.log
R1_16.log
R1_16_0.log
R1_16a.log
R1_17a_0.log
R1_18.log
R1_7.log

MINOS-SAM03 > cat *.log | grep -B 12 MISMATCH | grep OUTFILE  | sort -u | wc -l
     26

110594 files per 10/31 summary herein, piped through count script
125400 files per 10/31 CFL scan, ( about 15K caldet )
 88551 files per for DIR in `find . -type d -name R\*` ; do find ${DIR} -type f | wc -l ; done

Rate is 1 in 3400

    caldet

MINOS06 > ./volumes log_data_caldet
VO2067
VO5166

./catlogs caldet > caldet.log 2>&1 &
FILESbh Mon Nov 28 11:46:53 CST 2005
INFILE=/pnfs/minos/log_data/caldet/2002-10/C00050673_0000.err.gz
OUTFILE=/scratch/sam03/kreymer/CAT/caldet/2002-10/C00050673_0000.err.gz
FILESIZE=51
LABEL=VO2067
LOCATION=0000_000000000_0003044
DRIVE=stkenmvr6a:/dev/rmt/tps0d1n
DRIVE_SN=456001003138

     R1_18

Strange, someone read a recent log file via DCache,
  just as the volume came back online 

VO8354
  /pnfs/minos/log_data/R1_18/2004-12/F00028342_0007.0.R1_18.err.gz -->stkensrv3.fnal.gov:/dev/null


    R1_12

./catfix12 R1.12 > R1_12fixh.log 2>&1 &
  
clean ! now check totals

CFL - 8628
CAT - 6621 ( 2 skipped )
Where are the 5 missing files ?

    2004-11 1376 vs 1371

Yes, fixb failed in    FILESba,
     fixc picked up in FILESbb,
     leaving a one FILESba in the lurch

F00028022_0003.err.gz

and fixa failed in FILESbf with 10 files/batch
leaving 4 files in the lurch.

So pick up these 5 from 2004/11 :

FILES='F00028102_0003.err.gz F00027861_0004.out.gz F00028108_0001.err.gz F00028178_0000.err.gz F00028022_0003.err.gz'
for FIL in ${FILES} ; do echo $FIL ; encp /pnfs/minos/log_data/R1.12/2004-11/${FIL} /scratch/sam03/kreymer/CAT/R1.12/2004-11/${FIL} ;done

OK, now pick up the latest file having a CRC error :

encp /pnfs/minos/log_data/R1.12/2004-12/F00028440_0003.err.gz /scratch/sam03/kreymer/CAT/R1.12/2004-12/F00028440_0003.err.gz

That leaves just one offender, which failed twice.

Give it one more shot manually :

encp /pnfs/minos/log_data/R1.12/2004-12/F00028392_0002.err.gz /scratch/sam03/kreymer/CAT/R1.12/2004-12/F00028392_0002.err.gz

Success !

( cd /scratch/sam03/kreymer/CAT/R1.12
  tar cf ../log_data.R1_12.tar . )
cp -a log_data.R1_12.tar $MINOS_DATA/log_data/

    R1_18

./catfix18 R1_18 > R1_18fixg.log 2>&1 &

Killed this after 2004-12 completion, to set
    --delayed-dismount 10
10 minute tape HAVE BOUND period, due to heavy mount/dismount swapping

./catfix18 R1_18 > R1_18fixh.log 2>&1 &


#########
# count #
#########

cat | count
( adds simple numbers from input )

#######
# AFS #
#######

grrrrrrrrr... 
Seems the beam_data, beam_data1, and log_data directories
were created without world read access.

Correct this :

fs setacl log_data system:authuser rl
fs setacl log_data system:anyuser  rl

fs setacl beam_data  system:authuser rl
fs setacl beam_data  system:anyuser  rl

fs setacl beam_data1 system:authuser rl
fs setacl beam_data1 system:anyuser  rl

BASE=beam_data
BASE=beam_data/far_dcs_data
BASE=beam_data/near_dcs_data
BASE=beam_data1

for DIR in `ls ${BASE}` ; do echo $DIR
fs setacl  ${BASE}/${DIR} system:authuser rl
fs setacl  ${BASE}/${DIR} system:anyuser rl
fs listacl ${BASE}/${DIR}
done

###########
# archive #
###########

Making directories for archival ( storage to 2 separate tapes )
of various files ( such as log_data tarfiles )

mkdir /pnfs/minos/archive
mkdir /pnfs/minos/archive/1
mkdir /pnfs/minos/archive/2

( cd /pnfs/minos/archive/1 ; enstore pnfs --tags ) 
( cd /pnfs/minos/archive/1 ; enstore pnfs --file_family archive1 )
( cd /pnfs/minos/archive/1 ; enstore pnfs --tags ) 

( cd /pnfs/minos/archive/2 ; enstore pnfs --tags ) 
( cd /pnfs/minos/archive/2 ; enstore pnfs --file_family archive2 )
( cd /pnfs/minos/archive/2 ; enstore pnfs --tags ) 

=============================================================================

2005 11 27

############
# log_data #
############

catfix12 - skips F00028392_0002.err.gz
does 2004-12 2005-01 2005-02

./catfix12 R1.12 > R1_12fixg.log 2>&1 &
FILEScm Sun Nov 27 07:07:53 CST 2005
INFILE=/pnfs/minos/log_data/R1.12/2004-12/F00028440_0003.err.gz
OUTFILE=/scratch/sam03/kreymer/CAT/R1.12/2004-12/F00028440_0003.err.gz
FILESIZE=2695
LABEL=VO4917
STATUS=CRC MISMATCH
STATUS=NOACCESS

Requested ACCESS to

    VO4917
    VO8353

=============================================================================

2005 11 26

############
# log_data #
############

./catlogs R1_18 >> R1_18fixc.log 2>&1 &
Oops, kill ssh session with ~. before exiting, killed the subprocess.

cd R1_18/2004-12
find . -size 0 -exec rm {} \;
 
./catlogs R1_18 >> R1_18fixd.log 2>&1 &
Oops, ran another one of these in parallel to fixd.log.

Try once again, cleanly this time.

setup encp
./catlogs R1_18 >> R1_18fixe.log 2>&1 &
FILEScn Sat Nov 26 09:26:47 CST 2005
INFILE=/pnfs/minos/log_data/R1_18/2004-12/F00028466_0002.0.R1_18.err.gz
LABEL=VO8353

./catlogs R1_18 >> R1_18fixf.log 2>&1 &
FILEScr Sat Nov 26 13:51:54 CST 2005
INFILE=/pnfs/minos/log_data/R1_18/2004-12/F00028342_0007.0.R1_18.err.gz
OUTFILE=/scratch/sam03/kreymer/CAT/R1_18/2004-12/F00028342_0007.0.R1_18.err.gz
FILESIZE=5213
LABEL=VO8354

NOACCESS


Reviewing R1.16

  12451 / 12462

Hmmm, where did the 11 files go ?

MINOS-SAM03 > for DIR in R1.16/* ; do echo $DIR ; find $DIR -type f | wc -l ; done
R1.16/2005-03
   2816
R1.16/2005-05
   2368
R1.16/2005-06
   2888
R1.16/2005-07
   2931
R1.16/2005-08
   1448

MINOS-SAM03 > for DIR in R1.16/* ; do echo $DIR ; find /pnfs/minos/log_data/$DIR -type f | wc -l ; done
R1.16/2005-03
   2826             10 
R1.16/2005-05
   2368
R1.16/2005-06
   2888
R1.16/2005-07
   2932              1
R1.16/2005-08
   1448

2005-03
N00006754_0001.err.gz
N00006761_0000.out.gz
N00006823_0012.out.gz
N00006843_0006.err.gz
N00006849_0004.err.gz
N00006913_0003.out.gz
N00006992_0003.err.gz
N00007043_0005.err.gz
N00007049_0003.err.gz
N00007082_0009.out.gz

2005-07
N00008039_0004.err.gz

encp /pnfs/minos/log_data/R1.16/2005-07/N00008039_0004.err.gz /scratch/sam03/kreymer/CAT/R1.16/2005-07/N00008039_0004.err.gz

FILS='N00006754_0001.err.gz N00006761_0000.out.gz N00006823_0012.out.gz N00006843_0006.err.gz N00006849_0004.err.gz N00006913_0003.out.gz N00006992_0003.err.gz N00007043_0005.err.gz N00007049_0003.err.gz N00007082_0009.out.gz'
for FIL in ${FILS} ; do
echo ${FIL} 
encp       /pnfs/minos/log_data/R1.16/2005-03/${FIL} \
     /scratch/sam03/kreymer/CAT/R1.16/2005-03/
done

MINOS-SAM03 > find R1.16 -type f | wc -l
  12462

MINOS-SAM03 > find R1.16 -size 0 | wc -l
     20

R1.16/2005-05/N00007740_0012.out.gz
R1.16/2005-05/N00007665_0001.out.gz
R1.16/2005-05/N00007639_0005.err.gz
R1.16/2005-05/F00031457_0000.err.gz
R1.16/2005-05/F00031673_0004.err.gz
R1.16/2005-05/N00007745_0024.out.gz
R1.16/2005-05/N00007760_0006.err.gz
R1.16/2005-05/F00031686_0001.out.gz
R1.16/2005-05/N00007821_0021.out.gz
R1.16/2005-05/N00007860_0012.err.gz
R1.16/2005-05/N00007610_0003.err.gz
R1.16/2005-05/F00031403_0001.out.gz
R1.16/2005-05/N00007623_0000.out.gz
R1.16/2005-05/N00007680_0002.err.gz
R1.16/2005-05/N00007667_0003.err.gz
R1.16/2005-05/N00007636_0008.out.gz
R1.16/2005-05/N00007787_0016.out.gz
R1.16/2005-05/F00031733_0005.out.gz
R1.16/2005-05/N00007800_0000.err.gz
R1.16/2005-05/F00031745_0003.err.gz

 find R1.16 -size 0 -exec rm {} \;
 
FILS='R1.16/2005-05/N00007740_0012.out.gz R1.16/2005-05/N00007665_0001.out.gz R1.16/2005-05/N00007639_0005.err.gz R1.16/2005-05/F00031457_0000.err.gz R1.16/2005-05/F00031673_0004.err.gz R1.16/2005-05/N00007745_0024.out.gz R1.16/2005-05/N00007760_0006.err.gz R1.16/2005-05/F00031686_0001.out.gz R1.16/2005-05/N00007821_0021.out.gz R1.16/2005-05/N00007860_0012.err.gz R1.16/2005-05/N00007610_0003.err.gz R1.16/2005-05/F00031403_0001.out.gz R1.16/2005-05/N00007623_0000.out.gz R1.16/2005-05/N00007680_0002.err.gz R1.16/2005-05/N00007667_0003.err.gz R1.16/2005-05/N00007636_0008.out.gz R1.16/2005-05/N00007787_0016.out.gz R1.16/2005-05/F00031733_0005.out.gz R1.16/2005-05/N00007800_0000.err.gz R1.16/2005-05/F00031745_0003.err.gz' 

for FIL in ${FILS} ; do
echo ${FIL} 
encp       /pnfs/minos/log_data/${FIL} \
     /scratch/sam03/kreymer/CAT/${FIL}
done

MINOS_DATA=/afs/fnal.gov/files/data/minos
( cd /scratch/sam03/kreymer/CAT/R1.16
  tar cf ../log_data.R1_16.tar . )
cp -a log_data.R1_16.tar $MINOS_DATA/log_data/


./catlogs R1.12 > R1_12fixd.log 2>&1 &
Previous pass had not removed dangling files, just rerun :

./catlogs R1.12 > R1_12fixe.log 2>&1 &

FILESch Sat Nov 26 15:42:12 CST 2005
INFILE=/pnfs/minos/log_data/R1.12/2004-12/F00028392_0002.err.gz
OUTFILE=/scratch/sam03/kreymer/CAT/R1.12/2004-12/F00028392_0002.err.gz
FILESIZE=2449
LABEL=VO4917

./catlogs R1.12 > R1_12fixf.log 2>&1 &

FILESch Sat Nov 26 15:52:36 CST 2005
INFILE=/pnfs/minos/log_data/R1.12/2004-12/F00028392_0002.err.gz
OUTFILE=/scratch/sam03/kreymer/CAT/R1.12/2004-12/F00028392_0002.err.gz
FILESIZE=2449
LABEL=VO4917

Will have to give up for a while, repeated errors on same file.


=============================================================================

2005 11 25

############
# log_data #
############

./catlogs R1_18 >> R1_18fix.log 2>&1 &
./catlogs R1_18 >> R1_18fixa.log 2>&1 &

Oops, copy fails when some files are present.
Purged files from FILESav manually, and catlogs had this disabled.

Corrected this in catlogs, trying again,
Also removed 41st file under 2005-01, for clean batching.

./catlogs R1_18 >> R1_18fixb.log 2>&1 &


=============================================================================

2005 11 24

#########
# power #
#########

Power outage at 01:30
minos01 back up at 08:00 with minos01-07
NIS restored on minos07-26 at 08:45
ups start sam_bootstrap around then

/pnfs not mounted on

13 ls: /pnfs/minos/tmp: No such file or directory
14 ls: /pnfs/minos/tmp: No such file or directory
16 ls: /pnfs/minos/tmp: No such file or directory
17 ls: /pnfs/minos/tmp: No such file or directory
18 ls: /pnfs/minos/tmp: No such file or directory
19 ls: /pnfs/minos/tmp: No such file or directory
20 ls: /pnfs/minos/tmp: No such file or directory
21 ls: /pnfs/minos/tmp: No such file or directory
22 ls: /pnfs/minos/tmp: No such file or directory
23 ls: /pnfs/minos/tmp: No such file or directory
24 ls: /pnfs/minos/tmp: No such file or directory
25 ls: /pnfs/minos/tmp: No such file or directory
26 ls: /pnfs/minos/tmp: No such file or directory

Mounted at about 09:23, and on minos-sam03

2005 11 26 - restarted SAM web svcs on minos-sam03, 
    . products/etc/setups.sh
    ups start sam_bootstrap

Tested with

    setup sam_web_services_client
    samLocate --file=F00018000_0000.mdaq.root --wsdl=http://www-numi.fnal.gov/sam_web_services/wsdl/DataFileService.wsdl.xml    
    
MINOS06 >  ${HOME}/minos/oracle/topdb_log minosprd &
[1] 19495
MINOS06 > ${HOME}/minos/oracle/topdb_log minosdev &
[2] 19616

############
# log_data #
############

./catlogs R1.16 > R1_16fixa.log 2>&1 &
Oops, /pnfs was not mounted

./catlogs R1.16 > R1_16fixb.log 2>&1 &

Completed cleanly

./catlogs R1_18 >> R1_18fix.log 2>&1 &

=============================================================================

2005 11 23

############
# log_data #
############

catlogs - updated to skip months and batches already present

    The old catfix scripts should no longer be needed.

We are now awaiting NOACCESS volumes
Contacted enstore-admin

    VO3225
    VO4917
    VO8353

    I suggest moving the volumes back out of NOACCESS,
    and I'll be glad to do the retries manually.

    It's also OK with me if you would prefer to do the usual copy to new media.

./catlogs R1.12 > R1_12fixb.log 2>&1 &
FILESba Wed Nov 23 14:38:56 CST 2005
encp /pnfs/minos/log_data/R1.12/2004-11/F00028022_0003.err.gz /scratch/sam03/kreymer/CAT/R1.12/2004-11/F00028022_0003.err.gz
LABEL=VO4917

./catlogs R1.12 > R1_12fixc.log 2>&1 &
FILESbd Wed Nov 23 17:17:40 CST 2005
INFILE=/pnfs/minos/log_data/R1.12/2004-12/F00028401_0001.err.gz
OUTFILE=/scratch/sam03/kreymer/CAT/R1.12/2004-12/F00028401_0001.err.gz
FILESIZE=2471
LABEL=VO4917
NOCCESS

./catlogs R1.16 > R1_16fix.log 2>&1 &

=============================================================================

2005 11 22

############
# log_data #
############

for REL in  R0.8 R1.0 R1.0.0 R1.0.0a R1.7 R1.11 R1.14 ; do echo $REL ; /usr/bin/time cp -ar ${REL} ${MINOS_DATA}/log_data/${REL} ; done
R0.8
0.06user 1.22system 0:24.21elapsed 5%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (130major+23minor)pagefaults 0swaps
R1.0
0.04user 0.72system 0:12.93elapsed 5%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (130major+21minor)pagefaults 0swaps
R1.0.0
0.02user 0.22system 0:05.34elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (130major+21minor)pagefaults 0swaps
R1.0.0a
0.30user 5.31system 1:41.91elapsed 5%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (130major+27minor)pagefaults 0swaps
R1.7
0.01user 0.50system 0:07.95elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (130major+26minor)pagefaults 0swaps
R1.11
0.03user 1.39system 0:24.74elapsed 5%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (130major+27minor)pagefaults 0swaps
R1.14
0.48user 11.77system 3:05.04elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (130major+30minor)pagefaults 0swaps

    On second thought, removed these, and copied tarfiles,
    to reduce load on the archive system.

rm -r ${MINOS_DATA}/log_data/R*

/usr/bin/time cp -av log_data.R*.tar  ${MINOS_DATA}/log_data/             
`log_data.R0_8.tar' -> `/afs/fnal.gov/files/data/minos/log_data/log_data.R0_8.tar'
`log_data.R1_0.tar' -> `/afs/fnal.gov/files/data/minos/log_data/log_data.R1_0.tar'
`log_data.R1_0_0.tar' -> `/afs/fnal.gov/files/data/minos/log_data/log_data.R1_0_0.tar'
`log_data.R1_0_0a.tar' -> `/afs/fnal.gov/files/data/minos/log_data/log_data.R1_0_0a.tar'
`log_data.R1_11.tar' -> `/afs/fnal.gov/files/data/minos/log_data/log_data.R1_11.tar'
`log_data.R1_14.tar' -> `/afs/fnal.gov/files/data/minos/log_data/log_data.R1_14.tar'
`log_data.R1_7.tar' -> `/afs/fnal.gov/files/data/minos/log_data/log_data.R1_7.tar'
0.09user 12.39system 0:52.06elapsed 23%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (146major+19minor)pagefaults 0swaps
    

Only the currently written files need to be here as a full tree.

##########
# catfix #
##########

R1.12


./catfix1.12 R1.12 >> R1_12fix.log 2>&1 &

This completed 2004-10 cleanly,
them bombed ( I killed it ) with many FILE WAS MODIFIED messages.

I  reduced the batch to 10 files,

./catfixa1.12 R1.12 >> R1_12fixa.log 2>&1 &

This bombed at /pnfs/minos/log_data/R1.12/2004-11/F00028102_0003.err.gz

VO4917 is NOACCESS again.

###########
# catlogs #
###########

R1.16.0

./catlogs R1.16.0 >> R1_16_0.log 2>&1 &
( cd /scratch/sam03/kreymer/CAT/R1.16.0
  tar cf ../log_data.R1_16_0.tar . )
cp -a log_data.R1_16_0.tar $MINOS_DATA/log_data/


R1.16a

./catlogs R1.16a  >> R1_16a.log  2>&1 &
encp /pnfs/minos/log_data/R1.16a/2005-03/F00030380_0005.out.gz /scratch/sam03/kreymer/CAT/R1.16a/2005-03/F00030380_0005.out.gz
( cd /scratch/sam03/kreymer/CAT/R1.16a
  tar cf ../log_data.R1_16a.tar . )
cp -a log_data.R1_16a.tar $MINOS_DATA/log_data/


R1.17a.0

./catlogs R1_17a.0  >> R1_17a_0.log  2>&1 &
encp /pnfs/minos/log_data/R1_17a.0/2005-03/F00029439_0001.0.R1_17a.err.gz /scratch/sam03/kreymer/CAT/R1_17a.0/2005-03/F00029439_0001.0.R1_17a.err.gz
( cd /scratch/sam03/kreymer/CAT/R1_17a.0
  tar cf ../log_data.R1_17a_0.tar . )
cp -a log_data.R1_17a_0.tar $MINOS_DATA/log_data/


R1.18

./catlogs R1_18 >> R1_18.log 2>&1 &
VO8354
encp /pnfs/minos/log_data/R1_18/2004-04/F00024991_0000.0.R1_18.err.gz /scratch/sam03/kreymer/CAT/R1_18/2004-04/F00024991_0000.0.R1_18.err.gz
VO8353
encp /pnfs/minos/log_data/R1_18/2004-05/F00025142_0001.0.R1_18.err.gz /scratch/sam03/kreymer/CAT/R1_18/2004-05/F00025142_0001.0.R1_18.err.gz

FILESav Wed Nov 23 09:07:03 CST 2005
INFILE=/pnfs/minos/log_data/R1_18/2004-10/F00027697_0006.0.R1_18.out.gz
VO8353 NOACCESS


###########
# enstore #
###########

    For
cd /pnfs/minos/reco_far/R1_18_2
    and
cd /pnfs/minos/reco_near/R1_18_2

### warning... do not do this to subdirectories in future
### just set the tags on the top level directory.
### These will be dynamically changed at all the lower levels
### The inheritance is dynamic !


# check width is 2, needs reduction :
enstore pnfs --tags | grep width
### DONOT ### for DIR in *   ; do ( cd $DIR ; enstore pnfs --tags | grep width ) ; done
### DONOT ### for DIR in */* ; do ( cd $DIR ; enstore pnfs --tags | grep width ) ; done

# reset to 1
enstore pnfs --file_family_width 1
### DONOT for DIR in *   ; do ( cd $DIR ; enstore pnfs --file_family_width 1 ) ; done
### DONOT for DIR in */* ; do ( cd $DIR ; enstore pnfs --file_family_width 1 ) ; done


##########
# dcache #
##########

Pool lifetime plots show lots of files out to 90 days ( and more on 8a )
But our raw data files are going away after 10 days.
 ( files are present after Nov 03 17:00 or so )

CDF examples, of various types of pools :

Golden, under subscribed, to 1 yr.
    http://cdfdca.fnal.gov/dcache/lifetime//r-fcdfdata110-1.lifetime.jpg
    Pools 70,71,106,107,109,110
    Access peaked low, life to 

Raw, oversubscribed, predictable usage patterns
    http://cdfdca.fnal.gov/dcache/lifetime//r-fcdfdata020-1.lifetime.jpg
    Pools 17-20
    Access/Life track to 7 days.

Read, general pool
    http://cdfdca.fnal.gov/dcache/lifetime//r-fcdfdata054-1.lifetime.jpg
    Pools 35, 5, 54, ... 100, 105
    Access 1/2 life, life to 14 days

=============================================================================

2005 11 21

############
# log_data #
############

cd /scratch/sam03/kreymer/CAT/R1.14
tar cf ../log_data.R1_14.tar .
   23335 files, 544 MB


R1.16     12462  345

./catlogs R1.16 >> R1_16.log 2>&1 &

NOACCESS after two read errors, second one at 'eh'.
LABEL=VO3225


############
# predator #
############

08:24

Oops, restarted, should have done so Thur. PM

    crontab scripts/crontab.dat

###########
# enstore #
###########

Checking size of output for R1_18, planning R1_18_2

sam translate constraints  --dim="__set__ ${SET}"

NEAR

    CAND

SET=zeval-near-cand-physicsm-spill-r1_18
File Count:         4034
Average File Size:  273.83MB
Total File Size:    1.05TB
Total Event Count:  193740377

SET=zeval-near-cand-physics-spill-r1_18
File Count:         356
Average File Size:  324.15MB
Total File Size:    112.69GB
Total Event Count:  17524862

SET=zeval-near-cand-physicsm-cosmic-r1_18
File Count:         5259
Average File Size:  335.99MB
Total File Size:    1.69TB
Total Event Count:  247244265


SET=zeval-near-cand-physics-cosmic-r1_18
File Count:         761
Average File Size:  365.64MB
Total File Size:    271.73GB
Total Event Count:  34756348


     SNTP

SET=zeval-near-sntp-physicsm-spill-r1_18
File Count:         4033
Average File Size:  44.25MB
Total File Size:    174.26GB
Total Event Count:  193689474

SET=zeval-near-sntp-physics-spill-r1_18
File Count:         356
Average File Size:  51.94MB
Total File Size:    18.06GB
Total Event Count:  17524862


SET=zeval-near-sntp-physicsm-cosmic-r1_18
File Count:         5252
Average File Size:  70.03MB
Total File Size:    359.15GB

SET=zeval-near-sntp-physics-cosmic-r1_18
File Count:         761
Average File Size:  74.63MB
Total File Size:    55.46GB
Total Event Count:  34756348

    SNTS

SET=zeval-near-snts-physicsm-spill-r1_18
File Count:         4034
Average File Size:  6.88MB
Total File Size:    27.09GB
Total Event Count:  193740377

SET=zeval-near-snts-physicsm-cosmic-r1_18
File Count:         5259
Average File Size:  21.89MB
Total File Size:    112.44GB
Total Event Count:  247244265


   FAR

       CAND

SET=zeval-far-cand-physics-spill-r1_18
File Count:         5031
Average File Size:  9.98MB
Total File Size:    49.03GB
Total Event Count:  89730359

SET=zeval-far-cand-physicsm-spill-r1_18
File Count:         1060
Average File Size:  11.19MB
Total File Size:    11.58GB
Total Event Count:  18500928

       
SET=zeval-far-cand-physics-alldata-r1_18
File Count:         8990
Average File Size:  78.30MB
Total File Size:    687.46GB
Total Event Count:  179349631

SET=zeval-far-cand-physicsm-alldata-r1_18
File Count:         2375
Average File Size:  77.03MB
Total File Size:    178.66GB
Total Event Count:  47221969

        SNTP

SET=zeval-far-sntp-physics-spill-r1_18
File Count:         5031
Average File Size:  1.98MB
Total File Size:    9.74GB
Total Event Count:  89730359

SET=zeval-far-sntp-physicsm-spill-r1_18
File Count:         1060
Average File Size:  2.22MB
Total File Size:    2.30GB
Total Event Count:  18500928


SET=zeval-far-sntp-physics-alldata-r1_18
File Count:         8990
Average File Size:  21.52MB
Total File Size:    188.90GB
Total Event Count:  179349631

SET=zeval-far-sntp-physicsm-alldata-r1_18
File Count:         2375
Average File Size:  21.14MB
Total File Size:    49.03GB
Total Event Count:  47221969

    SNTS

SET=zeval-far-snts-physics-spill-r1_18
File Count:         5031
Average File Size:  271.50KB
Total File Size:    1.30GB
Total Event Count:  89730359

SET=zeval-far-snts-physics-alldata-r1_18
File Count:         8990
Average File Size:  2.67MB
Total File Size:    23.47GB
Total Event Count:  179349631

=============================================================================

2005 11 18

#######
# web #
#######

Updated dhleft.html
    reduced to just checklist type items
    used soffice with partial success ( indent was too deep )
    
Updated dhmain.html,
    added DCache link to dcache-auto mailing list archive

    R1.14 went very cleanly till 

FILESab Fri Nov 18 16:10:54 CST 2005
LABEL=VO5689
DRIVE=stkenmvr7a:/dev/rmt/tps0d1n
DRIVE_SN=456001002618
STATUS=CRC MISMATCH

encp /pnfs/minos/log_data/R1.14/2005-03/N00006753_0000.out.gz /scratch/sam03/kreymer/CAT/R1.14/2005-03/N00006753_0000.out.gz

then finished cleanly


#######
# sam #
#######

Timed new sam v7_4_0a_py24

time ./sam_test_py minos

    with sam v7_3_4
real    0m4.685s
user    0m2.470s
sys     0m0.510s

    with sam v7_4_0
real    0m4.779s
user    0m2.660s
sys     0m0.400s

    with v7_4_0_py24
real    0m3.493s
user    0m1.650s
sys     0m0.140s

    with v7_4_0a_py24
real    0m3.486s
user    0m1.630s
sys     0m0.150s

time ./sam_test_py minos prd zeval-far-cand-physicsm-spill-r1_16
94 files

    with sam v7_3_4

real    2m26.704s
user    0m4.230s
sys     0m0.430s

real    1m51.605s
user    0m4.170s
sys     0m0.490s

    with v7_4_0a_py24

real    2m25.797s
user    0m2.440s
sys     0m0.210s

real    1m48.710s
user    0m2.440s
sys     0m0.170s


=============================================================================

2005 11 17


############
# predator #
############

Halted it due to Oracle patches

    crontab -r

############
# log_data #
############

cd /scratch/sam03/kreymer/CAT/R1.0.0
tar cf ../log_data.R1_11.tar .
   4276 files, 19 MB


./catlogs R1.12 >> R1_12.log 2>&1 &

encp /pnfs/minos/log_data/R1.12/2004-07/F00026109_0001.out.gz /scratch/sam03/kreymer/CAT/R1.12/2004-07/F00026109_0001.out.gz

VO4917 went NOACCESS reading  2004-10/F00027517_0003.out.gz 

Summary of recent volumes scanned,
and those to come , via  ./volumes <release> :

log_data_R1_0_0a
VO2061  NOACCESS
VO4962

log_data_R1_7   
VO5875
VO5877

log_data_R1_11   
VO5198
VO6576  NOACCESS

log_data_R1_12   
VO4917  NOACCESS
VO7115

log_data_R1_14
VO5689
VO6504
VO7773

log_data_R1_16
VO3225
VO8219
VO8227

log_data_R1_16a
VO8318

log_data_R1_18
VO8353
VO8354
VO8500
VO8514


While waiting, fired up the next one

./catlogs R1.14 >> R1_14.log 2>&1 &


###########
# ENSTORE #  TAPE overheads
########### 

Note that tape VO4917 contains 8496 files, 4054 bytes/file,
net size 34446893 ( active only )


But tape usage is reported as 1.28GB, or  150 KBytes/file.
at
    http://www-stken.fnal.gov/enstore/tape_inventory/VO4917
This differs from the 32.85 MB reported at
    http://www-stken.fnal.gov/cgi-bin/enstore/show_volume_cgi.py?volume=VO4917

Updated the list of tapes, now that we're writing to 9940B's,

MINOS06 > ./volumes vols          
 OK , refreshing volume listing in /tmp/vols 
-rw-r--r--    1 kreymer  1525      1790670 Nov 17 09:57 /tmp/vols

MINOS06 > ./volumes log_data_R1_18
VO8353
VO8354
VO8500
VO8514

Again, two active volumes.
VO8514 was last written Nov 13 13:00:42 2005
Reports usage 10.95GB, at
    http://www-stken.fnal.gov/enstore/tape_inventory/VO8514
but has only 5 very small files !


=============================================================================

2005 11 16

############
# log_data #
############

cp catlogs catfix1.11 ; nedit catfix1.11

mv 2004-10 2004-10x

./catfix1.11 R1.11 >> R1_11fix.log 2>&1 &

############
# dcs_data #
############

Cleaning up /pnfs/minos/dcs_data empty directories

See they are empty :
MINOS06 > echo $MONS
2003-11 2003-12 2004-01 2004-02 2004-03 2004-04 2004-05 2004-06 2004-07 2004-08 2004-09 2004-10 2004-11 2004-12
for MON in ${MONS} ; do echo $MON ; find /pnfs/minos/dcs_data/${MON} -type f | wc -l ; done

Zap them ;

for MON in ${MONS} ; do echo $MON ; rmdir /pnfs/minos/dcs_data/${MON} ; done

And removed the empty 
    rmdir /pnfs/minos/far_dcs_data/2003-09

This leaves only 2003-09 overlapping

    dcs_data thru F031016_181521.mdcs.root
far_dcs_data from F031017_061617.mdcs.root

########
# pnfs #
########

Started pnfs_log timing measurement of PNFS timing.
Results in
    http://www-numi.fnal.gov/computing/dh/pnfslog/NOW.txt

=============================================================================

2005 11 15

############
# log_data #
############

CATLOGS - added internal BAT variable to set number of files per encp
Changed this to 20 for now, see how high we can go ?

./catlogs R1.11 >> R1_11.log 2>&1 &

    Recovered CRC errors with

encp /pnfs/minos/log_data/R1.11/2004-09/F00027139_0002.err.gz /scratch/sam03/kreymer/CAT/R1.11/2004-09/F00027139_0002.err.gz
encp /pnfs/minos/log_data/R1.11/2004-10/F00027654_0006.err.gz /scratch/sam03/kreymer/CAT/R1.11/2004-10/F00027654_0006.err.gz

VO6576 went NOACCESS at
    2004-10/F00027618_0002.out.gz


############
# dcs_data #
############

DCS=near_dcs_data

MONTHS=`ls /pnfs/minos/${DCS}`

for MON in ${MONTHS} ; do
mkdir  -p ${MINOS_BEAM}/${DCS}/${MON}
FILES=`( cd /pnfs/minos/${DCS}/${MON} ; ls )`
for FIL in ${FILES} ; do
    printf "${FIL} "
    DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/${DCS}/${MON}/${FIL}
    ${ECHO} dccp ${DFILE} ${MINOS_BEAM}/${DCS}/${MON}/${FIL}
done ;
du -sm ${MINOS_BEAM}/${DCS}/${MON}  ${MINOS_BEAM}/${DCS}
done

319     /afs/fnal.gov/files/data/minos/beam_data/near_dcs_data

DCS=far_dcs_data
MONTHS=`ls /pnfs/minos/${DCS}`

for MON in ${MONTHS} ; do ./stage -w ${DCS}/${MON} ; done

... dccp as above ...

2425    /afs/fnal.gov/files/data/minos/beam_data/far_dcs_data

MINOS06 > fs listquota /afs/fnal.gov/files/data/minos/beam_data
Volume Name                   Quota      Used %Used   Partition
d.beamdata                 50000000  47121534   94%<<       75%    <<WARNING


###########
# enstore #
###########

Found 0 length files in COMPLETE_FILE_LISTING :

MIN > cat CFL | egrep '[[:space:]]0[[:space:]]0[[:space:]]'
minos   minos   VO4138  0000_000000000_0000806  CDMS105408822700000     0       0       /pnfs/minos/copy2/fardet_data/2002-09/F00008299_0000.mdaq.root
minos   minos   VO6880  0000_000000000_0001041  CDMS109650150900000     0       0       /pnfs/minos/unel/Na1.raw
minos   minos   VO6880  0000_000000000_0001043  CDMS109650153200001     0       0       /pnfs/minos/unel/Na2.raw
minos   minos   VO6880  0000_000000000_0001045  CDMS109650155500000     0       0       /pnfs/minos/unel/Na4.raw
minos   minos   VO6880  0000_000000000_0001047  CDMS109650157600000     0       0       /pnfs/minos/unel/Fa3.raw
minos   sntp_data_R1_11 VO6577  0000_000000000_0000940  CDMS110200994800000     0       0       /pnfs/minos/reco_data/R1.11/sntp_data/2004-09/F00027217_0001.sntp.R1.11.root

There are also many files with 0 checksum reported

MIN > cat CFL | egrep '[[:space:]]0[[:space:]]/' | cut -f 8 | wc -l
     80


=============================================================================

2005 11 14

############
# log_data #
############

R1.0.0a failed with read error on VO2061 - apparently recovered

http://www-stken.fnal.gov/enstore/tape_inventory/NOACCESS
Mon Nov 14 09:15:13 CST 2005
VO2061             58.40GB   (NOACCESS   1112-1851 none              )   9940             minos.log_data_R1_0_0a.cpio_odc      Copied to new media 91004

Created catfix script, selecting 2003-12 and 2004-*

MINOS-SAM03 > ./catfix R1.0.0a >> R1_0_0afix.log 2>&1 &

Nope, still accessing VO2061, NOACCESS
Guess we have to wait for PNFS data to be updated


While waiting, proceed with R1.7, R1.11, R1.12, R1.14

./catlogs R1.7 >> R1_7.log 2>&1 &

R1.7 looks OK, try catfix again

One file failed with CRC errors, picked it up with
encp /pnfs/minos/log_data/R1.7/2004-02/F00022825_0000.err.gz /scratch/sam03/kreymer/CAT/R1.7/2004-02/F00022825_0000.err.gz

    VO2061 is available again, finish up again with

./catfix R1.0.0a >> R1_0_0afix2.log 2>&1 &

    Had to catch some CRC errors :

encp /pnfs/minos/log_data/R1.0.0a/2004-05/F00025393_0001.err.gz /scratch/sam03/kreymer/CAT/R1.0.0a/2004-05/F00025393_0001.err.gz


#############
# beam_data #
#############

setup dcap
unset DCACHE_IO_TUNNEL
MINOS_BEAM=/afs/fnal.gov/files/data/minos/beam_data

MONTHS=2004-12
MONTHS='2005-01  2005-02  2005-03  2005-04  2005-05  2005-06'
MONTHS='2005-07  2005-08  2005-09 2005-10  2005-11'

for MON in ${MONTHS} ; do
mkdir -p ${MINOS_BEAM}/${MON}
FILES=`( cd /pnfs/minos/beam_data/${MON} ; ls )`
for FIL in ${FILES} ; do
    printf "${FIL} "
    DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/${MON}/${FIL}
    ${ECHO} dccp ${DFILE} ${MINOS_BEAM}/${MON}/${FIL}
done ;
du -sm ${MINOS_BEAM}/${MON}  ${MINOS_BEAM}
done

Need to now select which months to move to beam_data1

MINOS06 > for MON in ${MONTHS} ; do du -sm /pnfs/minos/beam_data/${MON} ; done
444     /pnfs/minos/beam_data/2005-01
71      /pnfs/minos/beam_data/2005-02
1278    /pnfs/minos/beam_data/2005-03
19      /pnfs/minos/beam_data/2005-04
8424    /pnfs/minos/beam_data/2005-05
9313    /pnfs/minos/beam_data/2005-06
10773   /pnfs/minos/beam_data/2005-07
12940   /pnfs/minos/beam_data/2005-08
13179   /pnfs/minos/beam_data/2005-09
13604   /pnfs/minos/beam_data/2005-10
6268    /pnfs/minos/beam_data/2005-11

Have about 30 GB free on beam_data thru 2005-06
Perhaps keep another 10 for dcs_data ( have 3 GB so far , mostly far )
OK, put symlinks for 2005-08 onward to beam_data1 directories.

cd /afs/fnal.gov/files/data/minos/beam_data1
for MON in 2005-09 2005-10  2005-11 2005-12 ; do mkdir ${MON} ; done

cd ${MINOS_BEAM}
for MON in 2005-09 2005-10  2005-11 2005-12 ; do
ln -s ../beam_data1/${MON} . ; done

Now fill these in :

MONTHS='2005-07  2005-08  2005-09 2005-10  2005-11'
for MON in ${MONTHS} ; do
mkdir -p ${MINOS_BEAM}/${MON}
FILES=`( cd /pnfs/minos/beam_data/${MON} ; ls )`
for FIL in ${FILES} ; do
    printf "${FIL} "
    DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/${MON}/${FIL}
    ${ECHO} dccp ${DFILE} ${MINOS_BEAM}/${MON}/${FIL}
done ;
du -sm ${MINOS_BEAM}/${MON}  ${MINOS_BEAM}
done

MINOS06 > du -sm /afs/fnal.gov/files/data/minos/beam_data
43275   /afs/fnal.gov/files/data/minos/beam_data

MINOS06 > du -sm /afs/fnal.gov/files/data/minos/beam_data1
33050   /afs/fnal.gov/files/data/minos/beam_data1

############
# dcs_data #
############

MONTHS=`ls /pnfs/minos/near_dcs_data`

echo $MONTHS
2004-08 2004-09 2004-10 2004-11 2004-12 2005-01 2005-02 2005-03 2005-04 2005-05 2005-06 2005-07 2005-08 2005-09 2005-10 2005-11 2005-12

for MON in ${MONTHS} ; do ./stage -w near_dcs_data/${MON} ; done


=============================================================================

2005 11 11

############
# log_data #
############

Tar up R1.0.0

cd /scratch/sam03/kreymer/CAT/R1.0.0
tar cf ../log_data.R1_0_0.tar .
   940 files, 4.9 MB

Tapes are idle, let's do R1.0.0a

( Added 'echo' option to catlogs, for testing )

MINOS-SAM03 > cd /scratch/sam03/kreymer/CAT
MINOS-SAM03 > ./catlogs R1.0.0a > R1_0_0a.log 2>&1 &

=============================================================================

2005 11 11

#############
# beam_data #
#############

Asking for 100 GB of AFS space.

$MINOS_DATA /afs/fnal.gov/files/data/minos

    presently has

d01 -> d121  
beam_docs
crl_data
database_dumps
db_cache
dbm
logbook
offline_monitor
validation

    AFS volume names are wildly inconsistent, including these formats :

nb.data.minosd*
nb.minos.d*
nb.minos.d???
nb.d.minos.d*
nb.data.minos*

<<< d01 >>>  nb.data.minosd1             8000000   6885888   86%         71%  
...
<<< d11 >>>  nb.data.minosd12
<<< d12 >>>  nb.minos.d13
<<< d13 >>>  nb.minos.d15
...
<<< d18 >>>  nb.minos.d20
<<< d19 >>>  nb.minos.d019
<<< d20 >>>  nb.minos.d020
<<< d21 >>>  nb.minos.d021
<<< d22 >>>  nb.minos.d022
<<< d23 >>>  nb.d.minos.d23
...
<<< d40 >>>  nb.minos.d23
<<< d41 >>>  nb.minos.d24
<<< d42 >>>  nb.minos.d42
...
<<< d59 >>>  nb.minos.d59
<<< d60 >>>  nb.data.minosd60
...
<<< d67 >>>  nb.data.minosd67
<<< d68 >>>  nb.d.minos.d68
<<< d69 >>>  nb.d.minos.d69
<<< d70 >>>  nb.d.minos.d70
<<< d71 >>>  nb.minos.d71
...
<<< d99 >>>  nb.minos.d99 
<<< d100 >>> nb.minos.d100              50000000  49968656  100%<<       71%    <<WARNING
...

<<< beam_docs >>>
c.minos.d13                 2000000    235420   12%         63%  

<<< crl_data >>>
d.minos.d1                  2000000      8192    0%         54%  

<<< database_dumps >>>
nb.minos.d14                5000000   2364398   47%         77%  

<<< db_cache >>>
nb.minos.d22                2000000         8    0%         71%  

<<< dbm >>>
nb.data.minosd11            4000000   1710634   43%         63%  

<<< logbook >>>
files.readonly              2000000     11966    1%          2%  
<<< offline_monitor >>>
d.minos.d4                  8000000    225235    3%         54%  

<<< validation >>>
nb.minos.d21                8000000    171639    2%         64%  

Volume Name                   Quota      Used %Used   Partition


Check sizes with

for DIR in `ls` ; do printf "<<< ${DIR} >>>\n" ; fs listquota ${DIR} ; done | less

d01  -> d76  8 GB
d77  -> d121 50 GB


Submitted request, we now have 

    /afs/fnal.gov/files/data/minos/beam_data
    /afs/fnal.gov/files/data/minos/beam_data1

each 50 GB in size

setup dcap
unset DCACHE_IO_TUNNEL

MINOS_BEAM=/afs/fnal.gov/files/data/minos/beam_data

MONTHS=2004-12
MONTHS='2005-01  2005-02  2005-03  2005-04  2005-05  2005-06'
MONTHS='2005-07  2005-08  2005-09 2005-10  2005-11'

for MON in ${MONTHS} ; do
mkdir -p ${MINOS_BEAM}/${MON}
FILES=`( cd /pnfs/minos/beam_data/${MON} ; ls )`
for FIL in ${FILES} ; do
    printf "${FIL} "
    DFILE=dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/${MON}/${FIL}
    ${ECHO} dccp ${DFILE} ${MINOS_BEAM}/${MON}/${FIL}
done ;
du -sm ${MINOS_BEAM}/${MON}  ${MINOS_BEAM}
done


=============================================================================

2005 11 10

############
# log_data #
############

Tape backlog has cleared, there are a few idle drives this PM.
 
Will recover missing files from R1.0, 2002-11 and 200-12
and shift in the good files from R1.0x/2002-10 


cd R1.0
mv 2002-10 2002-10x
mv ../R1.0x/2002-10 2002-10

find 2002-10 -size 0
2002-10/F00009754_0000.out.gz
2002-10/F00009760_0000.out.gz
2002-10/F00009778_0000.out.gz

cp 2002-10x/F00009754_0000.out.gz 2002-10/
cp 2002-10x/F00009760_0000.out.gz 2002-10/
cp 2002-10x/F00009778_0000.out.gz 2002-10/

mv ../R1.0x/2003* .


(force DIRS to '2002-11 2002-12')
MINOS-SAM03 > ./catlogs R1.0 2>&1 | tee R1_0.log
 OK - start Thu Nov 10 13:52:22 CST 2005
2002-11 
     10 FILESaa
     10 FILESab
...
    269 total
Thu Nov 10 14:07:43 CST 2005
2002-12 
...
    502 total
Thu Nov 10 14:08:08 CST 2005
Thu Nov 10 14:08:30 CST 2005
Thu Nov 10 14:08:51 CST 2005
Thu Nov 10 14:09:09 CST 2005
...

CRC errors,

INFILE=/pnfs/minos/log_data/R1.0/2002-12/F00011278_0000.err.gz
INFILE=/pnfs/minos/log_data/R1.0/2002-12/F00011628_0000.out.gz


#############
# beam_data #
#############

Saw dharma02 pulling beam_data files from minos0* via encp,
requested use of dccp, started stage of all files.

    2004-12 thru 2005-11

Requesting non-backedup AFS space for this


MINOS06 > du -sm /pnfs/minos/near_dcs_data
314     /pnfs/minos/near_dcs_data

MINOS06 > du -sm /pnfs/minos/far_dcs_data
2403    /pnfs/minos/far_dcs_data

MINOS06 > sam translate constraints --dim='__set__ zeval-beam'
File Count:         888
Average File Size:  83.49MB
Total File Size:    72.40GB
Total Event Count:  4763956


###########
# volumes #
###########

Why is log_data_R1_0 split acrss three tapes ?
FF width is 1, no errors indicated.

MINOS06 > ./volumes log_data_R1_0
VO2248
VO4316
VO4319

############
# log_data #
############

Tapes are idle, let's do R1.0.0

MINOS-SAM03 > ./catlogs R1.0.0 2>&1 | tee R1_0_0.log

=============================================================================

 2005 11 09

9940A tapes are still badly backlogged,
the cancelled log_data encp is still queued up !


=============================================================================

 2005 11 08
 
###########
# enstore #
###########

Note that 9940 tapes are badly backlogged,
but Minos raw data is being written to tape,
seems to be up to date.

We really do seem to have 'admin' priority for writing.

###########
# recycle #  did it on minos-sam03
###########
 
    VOLUMES

R0_8x
VOLS='VO2214 VO2215 VO2216 VO2218 VO2219 VO2221 VO2222 VO2223 VO2224 VO2226 VO2659 VO3906 VO3907 VO4138 VO4140 VO4306 VO4708 VO5862 VO5892 VO6880 VO7809 VO8006'

R1_0
VOLS='VO4315 VO4317 VO3648 VO3784 VO4311 VO4318 VO4314 VO4320 VO3785 VO4313 VO4312'

R1_0_0
VOLS='VO4707 VO2059 VO4580  VO4706 VO4709  VO4705  VO4581'

R1_0_0a 
VOLS='VO2007 VO4954 VO2062 VO2065 VO4945 VO4949 VO4950 VO4952 VO4955 VO4964 VO5037 VO5038 VO5043 VO5047 VO5048 VO5049 VO5050 VO5052 VO5053 VO5171 VO5183 VO5668 VO5676 VO5867 VO5870 VO5878 VO5879  VO2006 VO4641 VO4957 VO4963  VO2005 VO4946 VO4960  VO2063 VO2066 VO4951 VO4958 VO4961 VO5051 VO5184'

R1_7 
VOLS='VO5874 VO5703 VO5876  VO5705'

for VOL in $VOLS ; do printf "${VOL} " ; enstore volume --list  $VOL | grep -v deleted ; sleep 1 ; done

shows that all these are recyclable from R1_0 onward
The R0_8 files are more mixed up with logs and other minos files.

Here is the sorted list of 63 recyclable R1_* volumes
Sent to enstore-admin for recycling

VO2005
VO2006
VO2007
VO2059
VO2062
VO2063
VO2065
VO2066
VO3648
VO3784
VO3785
VO4311
VO4312
VO4313
VO4314
VO4315
VO4317
VO4318
VO4320
VO4580
VO4581
VO4641
VO4705
VO4706
VO4707
VO4709
VO4945
VO4946
VO4949
VO4950
VO4951
VO4952
VO4954
VO4955
VO4957
VO4958
VO4960 
VO4961
VO4963 
VO4964
VO5037
VO5038
VO5043
VO5047
VO5048
VO5049
VO5050
VO5051
VO5052
VO5053
VO5171
VO5183
VO5184
VO5668
VO5676
VO5703
VO5705
VO5867
VO5870
VO5874
VO5876
VO5878
VO5879 

############
# log_data #  archive
############

Reviewing state of 2 copies,
  R1.0
  R1.0x

R1.0 killed this today.
  2002-10   3 0files
  2002-11 167 0files
  2002-12 163 0files

Total file counts for R1.0
MINOS-SAM03 > for DIR in 2002-09 2002-10 2002-11 2002-12 2003-01 2003-02 2003-03 2003-06 ; do printf "${DIR}  " ; ls /pnfs/minos/log_data/R1.0/${DIR}  | wc -l ; done
     22
    446
    269
    502
    352
    277
     90
    341

These are all present in the R1.0x copy (some are 0 length)
I killed off the R1.0 second copy part way through.

Here is a summary of the initial directories, the number of files present,
and the number of failures producing 0 length files on 4 and 7/8 Nov

DIR        FILES   4Nov  7Nov
2002-09       22      0     0
2002-10      446      3   127 
2002-11      269    167   interrupted ...
2002-12      502    163
2003-01      352      0
2003-02      277      0
2003-03       90      0
2003-06      341      0

=============================================================================

 2005 11 07

###########
# recycle #  do it, on minos-sam03
###########

cd /pnfs/minos

09:03

find zen -type f -exec rm -f {} \; -exec sleep 1 \; \
   >>  ~/minos/log/recycle/20051107.log 2>&1 &

Oops, this is silently removing the files,
will let it run this way.
Would need to do  rm -fv  to get a file list.

############
# log_data #  archive
############


REL=R1.0

mkdir -p /scratch/sam03/kreymer/CAT/${REL}
cd       /scratch/sam03/kreymer/CAT/${REL}

script ${REL}.log

REL=R1.0
DIRS=`(cd /pnfs/minos/log_data/${REL} ; ls )`
echo $DIRS
for DIR in ${DIRS} ; do
    printf "${DIR} \n"
    mkdir   ${DIR}
    time encp /pnfs/minos/log_data/${REL}/${DIR}/*  ${DIR}
done

=============================================================================

 2005 11 04

############
# log_data # totals, plan
############

Total data under log_data is  1.4 GB:

MIN > LENS=`grep /pnfs/minos/log_data/ CFL |  cut -f 6 | tr \\\n '+'`
MIN > printf "(${LENS}0) / 1000000\n" | bc
1442

They all could be kept in backed-up AFS space,
then only occasionally archived to Enstore

############
# log_data #  archive
############


on MINOS-SAM03,

REL=R1.0

cd /scratch/sam03/kreymer/CAT
mkdir ${REL}
cd    ${REL}

DIRS=`(cd /pnfs/minos/log_data/${REL} ; ls )`
echo $DIRS
for DIR in ${DIRS} ; do
    printf "${DIR} \n"
    mkdir   ${DIR}
    time encp /pnfs/minos/log_data/${REL}/${DIR}/*  ${DIR}
done

GRRRRR - 333 of these came up with 0 length, with confusing messages like


INFILE=/pnfs/minos/log_data/R1.0/2002-12/F00011143_0000.err.gz
OUTFILE=/scratch/sam03/kreymer/CAT/2002-12/F00011143_0000.err.gz
FILESIZE=1870
LABEL=VO4316
LOCATION=0000_000000000_0001090
DRIVE=stkenmvr5a:/dev/rmt/tps3d1n
DRIVE_SN=456001002794
TRANSFER_TIME=0.04
SEEK_TIME=0.01
MOUNT_TIME=0.09
QWAIT_TIME=1476.10
TIME2NOW=9707.95
CRC=786917272
STATUS=FILE WAS MODIFIED

Noticed the local file size changed from 0 to 1870 for file /scratch/sam03/kreymer/CAT/2002-12/F00011143_0000.err.gz.

Tried again on Monday, similar results

=============================================================================

 2005 11 03

Investigating stuck Predator

STARTED  Wed Nov  2 22:06:03 CST 2005
Wed Nov  2 22:06:05 CST 2005 genpy neardet_data/2005-11
Wed Nov  2 22:06:30 CST 2005 genpy fardet_data/2005-11
Wed Nov  2 22:07:05 CST 2005 sadd  neardet_data 2005-11

 5263 ?        S      0:00 /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/predator
 5265 ?        S      0:00 /bin/sh /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/predator
 8726 ?        S      0:01  \_ sampy ./sadd neardet_data/2005-11 declare

N00009071_0005.mdaq.root Wed Nov  2 19:07:20 CST 2005
 OOPS - cannot get Enstore info for /pnfs/minos/neardet_data/2005-11/N00009071_0005.mdaq.root 

F00033083_0002.mdaq.root Wed Nov  2 19:07:30 CST 2005
 OOPS - cannot get Enstore info for /pnfs/minos/fardet_data/2005-11/F00033083_0002.mdaq.root 

In enstore, find 696 queue elements, most of which are family astro,
writing via dcache, files seem to be 2MBytes or less.

So the lack of files is understood,
this is due to long DCache write queues.

    Why is sampy stuck ?

minos-sam01 load average jumped by 1 around 21:50
CPU jumped to 
   user  5%
   sys  20%

sam get dbserver info
sam get dbserver connection info
   both fail, cannot connect to dbserver.

MINOS-SAM01 > pwd
/home/sam/private/dbs__minos-sam01__dbs_prd
   
MINOS-SAM01 > DBG=dbg-SAMDbServer.prd.05_11_02-00_00_09.12639-70

MINOS-SAM01 > grep 'rows found$' ${DBG}  | wc -l
  12981

MINOS-SAM01 > grep 'rows found$' ${DBG} | cut -f 2 -d '>' | cut -f 1 -d r | sort -n 
...
 2093 
 3722 
 247162 
 707832 

MINOS-SAM01 > grep '247162' ${DBG} 
        <11/02/2005 20:24:50 DbCore(servantId=15362).query[connId=4]> 247162 rows found
        <11/02/2005 20:26:24 DimensionsImpl.translateDimensions_v2(bspeak_5742@minos04.fnal.gov:sam_v7_3_4(15362))> logStatus: building the ret

MINOS-SAM01 > grep '707832' ${DBG} 
        <11/02/2005 21:48:36 DbCore(servantId=15397).query[connId=4]> 707832 rows found
 
MINOS-SAM01 > grep '15397' ${DBG} 

                  <11/02/2005 21:48:02 DbDerivedImpl.__init__(bspeak_23588@minos04.fnal.gov:sam_v7_3_4(15397))> logStatus: initializing
  <11/02/2005 21:48:02 ServantManagerWithoutActivator.createServant(bspeak_23588@minos04.fnal.gov:sam_v7_3_4(15397))> Created a new 'DbDerived' servant; ActiveServantCount = 2
  <11/02/2005 21:48:02 DbDerivedImpl.query(bspeak_23588@minos04.fnal.gov:sam_v7_3_4(15397))> Entry
      <11/02/2005 21:48:02 DbDerivedImpl.__init__(bspeak_23588@minos04.fnal.gov:sam_v7_3_4(15397))> logStatus: initial args have been unmarshalled
      <11/02/2005 21:48:02 DbDerivedImpl._initializeForUse(bspeak_23588@minos04.fnal.gov:sam_v7_3_4(15397))> call base class initializer
          <11/02/2005 21:48:02 DbDerivedImpl.query(bspeak_23588@minos04.fnal.gov:sam_v7_3_4(15397))> logStatus: initialized for use
              <11/02/2005 21:48:02 ConnMgr(servantId=15397).acquire[4]> Connection instance has been given out
            <11/02/2005 21:48:02 ServantDbConnectionMgr(bspeak_23588@minos04.fnal.gov:sam_v7_3_4(15397))._acquireInstance[connId=4]> connectionId 4 acquired; nPendingRequests = 0, nAvailableDbConnections = 3
      <11/02/2005 21:48:02 DbDerivedImpl.query(bspeak_23588@minos04.fnal.gov:sam_v7_3_4(15397))> logStatus: servant has been initialized for use
      <11/02/2005 21:48:02 DbDerivedImpl.query(bspeak_23588@minos04.fnal.gov:sam_v7_3_4(15397))> logStatus: active and in use
        <11/02/2005 21:48:02 DbCore(servantId=15397).query[connId=4]> select full_path,file_name FileLocation from data_storage_locations,data_file_locations,volumes,application_families, run_types,runs,data_files_runs,data_files, file_content_statuses, data_tiers, physical_datastreams where data_files.file_id=data_files_runs.file_id and data_tiers.data_tier_id=data_files.data_tier_id and data_files.stream_id=physical_datastreams.stream_id and runs.run_id=data_files_runs.run_id and runs.run_type_id=run_types.run_type_id and data_files.appl_family_id=application_families.appl_family_id and data_files.file_id=data_file_locations.file_id(+) and data_files.file_content_status_id=file_content_statuses.file_content_status_id(+) and data_file_locations.volume_id=volumes.volume_id(+) and data_files.end_time >= to_date('2005-02-01 00:00:00','yyyy-mm-dd hh24:mi:ss') AND data_files.start_time <= to_date('2005-03-01 00:00:00','yyyy-mm-dd hh24:mi:ss') AND event_count >= '1' AND ( RUN_TYPE LIKE 'physics%' ) AND ( data_tier LIKE 'raw-near%' ) order by file_name
        <11/02/2005 21:48:36 DbCore(servantId=15397).query[connId=4]> 707832 rows found

MINOS04 > date
Thu Nov  3 09:38:32 CST 2005
MINOS04 > ps -fu bspeak
UID        PID  PPID  C STIME TTY          TIME CMD
bspeak   18013 18012  0 Nov02 ?        00:00:00 [tcsh <defunct>]
bspeak    9115     1  0 Nov02 ?        00:00:00 /bin/sh /afs/fnal.gov/files/home/room3/bspeak/bin/GrabNearMonth.sh 2005-02
bspeak   22756  9115  0 07:51 ?        00:00:00 dccp dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos//neardet_data/2005-02/N00006497_0007.mdaq.root /local/scratch04/bspeak/TempD
       
09:54 - restarted production station

############
# log_data #
############

file counts are correct,
YR=2002
for DIR in 02 03 04 05 06 07 08 09 10 11 12  ; do echo ${YR}-${DIR} ; ls ${YR}-${DIR} | wc -l ; ls /pnfs/minos/log_data/${YR}-${DIR} | wc -l  ; done
YR=2003
for DIR in 01 02 03  ; do echo ${YR}-${DIR} ; ls ${YR}-${DIR} | wc -l ; ls /pnfs/minos/log_data/${YR}-${DIR} | wc -l  ; done

Total files is 4509

MIN > grep /pnfs/minos/log_data/200.* CFL | wc -l
   4727

So we have 218 files which are in CFL but not in log_data/200.*

MINOS-SAM03 > for DIR in 01 02 03  ; do echo ${YR}-${DIR} ; ls ${YR}-${DIR} | wc -l  ; done
2003-01
    330
2003-02
    248
2003-03
      4
MINOS-SAM03 > YR=2002                
MINOS-SAM03 > for DIR in 02 03 04 05 06 07 08 09 10 11 12  ; do echo ${YR}-${DIR} ; ls ${YR}-${DIR} | wc -l ; done
2002-02
     54
2002-03
    168
2002-04
    134
2002-05
    288
2002-06
    210
2002-07
    230
2002-08
    566
2002-09
    782
2002-10
    444
2002-11
    549
2002-12
    502

MIN > for DIR in 01 02 03  ; do echo ${YR}-${DIR} ; grep /pnfs/minos/log_data/${YR}-${DIR}/ CFL | wc -l ; done
(same)
MIN > YR=2002
MIN > for DIR in 02 03 04 05 06 07 08 09 10 11 12 ; do echo ${YR}-${DIR} ; grep /pnfs/minos/log_data/${YR}-${DIR}/ CFL | wc -l ; done
2002-08
    782
2002-09
    782
2002-10
    446

So the differences are in 
2002-08 566 -> 782 ( 216 )
2002-10 444 -> 446 (   2 )
                   ( 218 )

Nearly half the files, overall, have had 'analysis' and other fields stripped,
which makes comparision very difficult.

The lost files from 2002-10 are
  F00008956_0000.err.gz
  F00008956_0000.out.gz

CFL shows

minos   minos   VO6880  0000_000000000_0000654  CDMS109649293500000     835     3333524230      /pnfs/minos/log_data/2002-10/analysis_8956_0000.78874.1.err.gz
minos   minos   VO6880  0000_000000000_0000658  CDMS109649301400000     5104    3473742725      /pnfs/minos/log_data/2002-10/analysis_8956_0000.78874.1.out.gz
minos   minos   VO6880  0000_000000000_0001080  CDMS109650708600000     4627    4131443812      /pnfs/minos/log_data/2002-10/analysis_8956_0000.74017.1.out.gz
minos   minos   VO6880  0000_000000000_0001081  CDMS109650709800000     3230    4248334686      /pnfs/minos/log_data/2002-10/analysis_8956_0000.74017.1.err.gz

So there were multiple files with the same run/subrun.
Same thing for 2002-08, this accounts for the full 218 discrepancy :

MIN > grep /pnfs/minos/log_data/2002-08/analysis\.\*out  CFL | cut -f 8 | cut -f 3 -d '_' | sort | wc -l
    385
MIN > grep /pnfs/minos/log_data/2002-08/analysis\.\*out  CFL | cut -f 8 | cut -f 3 -d '_' | sort -u | wc -l
    277
MIN > echo '385 - 277' | bc
108


=============================================================================

 2005 11 02

########
# reco #
########

rubin wants to rewrite 

F00028005_0008.all.cand.R1_18.0.root
F00028005_0008.all.sntp.R1_18.0.root
F00028005_0008.all.snts.R1_18.0.root

F00028149_0004.all.cand.R1_18.0.root
F00028149_0004.all.snts.R1_18.0.root

IFILE=F00028005_0008.all.cand.R1_18.0.root
sam locate ${IFILE}
SAMLOC='/pnfs/minos/reco_far/R1_18/cand_data/2004-11' # leaving off tape
sam erase location --file=${IFILE} --loc=${SAMLOC}
sam undeclare file ${IFILE}

same for the rest

############
# log_data #
############

Get size of various directories

LENS=`grep /pnfs/minos/log_data/20 CFL | tr [:blank:] \\\t | cut -f 6 | tr \\\n '+'`
echo ${LENS}0 | bc
15214444

for DIR in R1.0  R1.0.0  R1.0.0a  R1.11  R1.12  R1.14  R1.16  R1.16.0  R1.16a  R1.7  R1_17  R1_17a.0  R1_18
do
printf "${DIR} "
LENS=`grep /pnfs/minos/log_data/${DIR}/ CFL | tr [:blank:] \\\t | cut -f 6 | tr \\\n '+'`
echo ${LENS}0 | bc
echo \(${LENS}0\)/1000000 | bc
done


MINOS-SAM03 > cd /scratch/sam03/kreymer/
MINOS-SAM03 > mkdir CAT
MINOS-SAM03 > cd CAT
MINOS-SAM03 > mkdir 2002-02
MINOS-SAM03 > encp /pnfs/minos/log_data/2002-02/* 2002-02 
MINOS-SAM03 > mkdir 2002-03
MINOS-SAM03 > time encp /pnfs/minos/log_data/2002-03/* 2002-03
real    30m16.390s
user    0m17.540s
sys     0m1.680s

MINOS-SAM03 > YR=2002
MINOS-SAM03 > for DIR in 04 05 06 07 08 09 10 11 12  ; do echo ${YR}-${DIR} ; mkdir ${YR}-${DIR} ; time encp /pnfs/minos/log_data/${YR}-${DIR}/* ${YR}-${DIR} ; done
real    18m28.288s
real    52m7.329s
real    45m50.237s
real    31m25.451s
real    58m55.814s
real    78m6.580s
real    51m45.863s
real    89m39.941s
real    65m28.943s

MINOS-SAM03 > for DIR in 01 02 03  ; do echo ${YR}-${DIR} ; mkdir ${YR}-${DIR} ; time encp /pnfs/minos/log_data/${YR}-${DIR}/* ${YR}-${DIR} ; done
2003-01
real    52m29.408s
2003-02
real    37m43.164s
2003-03
real    2m11.728s

=============================================================================

 2005 11 01

####################
# raw data logging #
####################

DCACHE - got realtime priority for raw logging
   Helpdesk ticket 67131   2006 10 06

The correct terminology is 'admin' priority, according to David


Date: Thu, 27 Oct 2005 19:52:11 -0500 (CDT)
From: David Berg <berg@fnal.gov>
To: kreymer@fnal.gov
Cc: enstore-admin@fnal.gov
Subject: minos raw data writing priority

Art,

Since Tuesday morning, tape writes in the file families you specified
should have been flagged as higher than normal priority. I keep looking
for one in the library queue, to verify that it is working, but I haven't
caught one yet. Maybe you've seen an improvement?

Thanks,
David

The families were

      fardet_data
     neardet_data
        beam_data
     far_dcs_data
    near_dcs_data


###############
# sam_test_py #
###############

Timing sam_test_py

various datasets


MINOS06 > time ./sam_test_py  minos prd  zeval-far-cand-physicsm-spill-r1_16 10
    project   sam_test_project_20051101184857
real    0m44.138s
user    0m2.770s
sys     0m0.400s

MINOS06 > time ./sam_test_py  minos prd  zeval-far-snts-physicst-alldata-r1_14 10
    project   sam_test_project_20051101185015
real    0m26.127s
user    0m2.670s
sys     0m0.430s

MINOS06 > time ./sam_test_py  minos prd st-onesmall 
    project   sam_test_project_20051101185059
real    0m31.794s
user    0m2.590s
sys     0m0.390s

MINOS06 > time ./sam_test_py  minos prd st-onesmall
    project   sam_test_project_20051101185141
real    0m4.809s
user    0m2.610s
sys     0m0.350s

MINOS06 > time ./sam_test_py  minos prd st-onesmall
    project   sam_test_project_20051101185148
real    0m12.801s
user    0m2.490s
sys     0m0.470s

MINOS06 > time ./sam_test_py  minos prd st-onesmall
    project   sam_test_project_20051101185203
real    0m4.793s
user    0m2.550s
sys     0m0.380s

MINOS06 > time ./sam_test_py  minos prd  zeval-far-snts-physicst-alldata-r1_14 10
    project   sam_test_project_20051101185212
real    0m30.067s
user    0m2.720s
sys     0m0.400s

MINOS06 > time ./sam_test_py  minos prd  zeval-far-snts-physicst-alldata-r1_14 1 
    project   sam_test_project_20051101185244
real    0m5.030s
user    0m2.380s
sys     0m0.510s

MINOS06 > time ./sam_test_py  minos prd  zeval-far-snts-physicst-alldata-r1_14 1
    project   sam_test_project_20051101185252
real    0m4.996s
user    0m2.500s
sys     0m0.460s

MINOS06 > time ./sam_test_py  minos prd st-onesmall
    project   sam_test_project_20051101195033
real    0m4.848s
user    0m2.540s
sys     0m0.390s

MINOS06 > time ./sam_test_py  minos prd st-onesmall
    project   sam_test_project_20051101195041
real    0m4.776s
user    0m2.450s
sys     0m0.470s

MINOS06 > time ./sam_test_py  minos prd st-onesmall
    project   sam_test_project_20051101195049
real    0m4.766s
user    0m2.510s
sys     0m0.390s

MINOS06 > time ./sam_test_py  minos prd  zeval-far-snts-physicst-alldata-r1_14 1
    project   sam_test_project_20051101195059
real    0m4.971s
user    0m2.520s
sys     0m0.470s

MINOS06 > time ./sam_test_py  minos prd  zeval-far-cand-physicsm-spill-r1_16 1 
    project   sam_test_project_20051101195127
real    0m5.011s
user    0m2.490s
sys     0m0.470s

MINOS06 > time ./sam_test_py  minos prd  zeval-near-raw-checkouttm 1
    project   sam_test_project_20051101202318
real    0m5.797s
user    0m2.500s
sys     0m0.430s

MINOS06 > time ./sam_test_py  minos prd st-onesmall
    project   sam_test_project_20051101202523
real    0m4.846s
user    0m2.450s
sys     0m0.500s

MINOS06 > time ./sam_test_py  minos prd st-onesmall
    project   sam_test_project_20051101202920
real    0m4.887s
user    0m2.490s
sys     0m0.430s


###########
# project #
###########

Date: Tue, 01 Nov 2005 14:51:10 -0600 (CST)
From: Art Kreymer <kreymer@fnal.gov>
To: sam-design@fnal.gov
Cc: minos_sam_admin@fnal.gov
Subject: Summary of project start times and file rates - FYI

Per discussions after today's SAM operations meeting,
here is a fresh summary of benchmarks of 
    sam project file delivery rates
and
    sam project start times
performed back around 16 August,
documented in the 16 Aug and 23 Aug Minos status reports.

    The bottom line is that running a project produces a few seconds delay.
    Files are delivered at up to 1/second to a project, 5/second globally.
    There are additional delays which grow quadratically with dataset size,
        which are significant for datasets over 20,000 files.


I used the Minos production servers, running sam_test_py,
a python script inspired by sam_par_ret and sam_test_project, 
living now at
    ftp://ftp.fnal.gov/products/init_sam/v1/NULL/init_sam_v1_NULL/ups/sam_test_py

Product versions were
   v7_3_1    sam
   v7_2_1    sam_dbserver
   v6_0_1_17 station

3 GHz dbserver/station ( minos-sam01 ) 
     and various clients including minos06

Single project file rate was about 1 file /second,  10% CPU on dbserver.
Global rate saturated at     about 5 files/second, 100% CPU on dbserver,
    saturating with 6 active projects.

Project start time is about 0.6 second plus (NFILES**2) / (4000**2). 
This is based on sam_test_py run with fileCut 2, and various dataset sizes

Files   Seconds to start project
 5118      2.3
 7773      3.8
10670      7.2
14645      9.1


The minimum total time to run sam_test_py seems to be about 5 seconds,
    typically like

real    0m5.797s
user    0m2.500s
sys     0m0.430s

I just re-tested this today on minos06, running sam_test_py with fileCut 1 
and the following minos datasets, with almost identical timings as shown above

DATASET                               FILES
st-onesmall                               1
zeval-far-snts-physicst-alldata-r1_14    14
zeval-far-cand-physicsm-spill-r1_16      94
zeval-near-raw-checkouttm              1059

Note that this total time can also increase to over 30 seconds, like

MINOS06 > time ./sam_test_py  minos prd st-onesmall 
    project   sam_test_project_20051101185059
real    0m31.794s
user    0m2.590s
sys     0m0.390s

Not sure why, the client system is very lightly loaded,
along with the station and dbserver.


###########
# recycle #
###########


ls -l ~/minos/log/recycle/test.log
4308569 Nov  1 12:42
started 15:08

wc -l   64665

Catchup on on previous scan of CFL,


=============================================================================

 2005 10 31

###########
# SAM dev #
###########

stopped dev processes on minossam01

###########
# recycle #
###########

13:18

mkdir /pnfs/minos/zen
for DIR in R1.0 R1.0.0 R1.0.0a R1.7 cand_data   dmux_data  ntps_data  ntup_data
do
    mv /pnfs/minos/reco_far/${DIR}  /pnfs/minos/zen/${DIR}
done

Checked directory permissions :

find R*      -maxdepth 2 -type d ! -perm -20 -exec ls -ld {} \; 
find *_data  -maxdepth 1 -type d ! -perm -20 -exec ls -ld {} \;

Many of the files lack group write permission

find R1.0 ! -perm -20 -exec ls -ld {} \; -exec sleep 1 \;
-rw-r--r--    1 rubin    e875      3951565 Aug 11  2003 R1.0/snts_data/2002-12/F00011413_0000.snts.R1.0.root
-rw-r--r--    1 rubin    e875      3921935 Aug 11  2003 R1.0/snts_data/2002-12/F00011527_0000.snts.R1.0.root
...

Test one file's deletion :

rm R1.0/snts_data/2002-12/F00011413_0000.snts.R1.0.root

I can do this, but get the prompt
rm: remove write-protected regular file `R1.0/snts_data/2002-12/F00011413_0000.snts.R1.0.root'? y

I could remove the files with   rm -f ,
but I'd rather change the permissions first.

chmod 664 R1.0/snts_data/2002-12/F00011527_0000.snts.R1.0.root
rm        R1.0/snts_data/2002-12/F00011527_0000.snts.R1.0.root

Oops, cannot do this, as the file is owned by rubin.

Tested clean removal with

rm -f     R1.0/snts_data/2002-12/F00011527_0000.snts.R1.0.root


So will have to proceed with

cd /pnfs/minos
find zen -type f -exec rm -f {} \; -exec sleep 1 \; \
   >>  ~/minos/log/recycle/20051107.log 2>&1 &

15:08 Tested with

cd /pnfs/minos
find zen -type f -exec echo rm -f {} \; -exec sleep 1 \; \
 >>  ~/minos/log/recycle/test.log 2>&1 &


PLAN - wait a week then remove these files 7 Nov 2005

 
#################
# pnfs database #
#################

Alarm message from 30 Oct,

DATABASE TOO BIG: minos - 11078 MB, exceeds 2048 MB and is growing

Miniboone is at 15344 GB !


COMPLETE_FILE_LISTING_minos.1003.ur
/pnfs/fs/usr  259509
/pnfs/minos   363628
              623137

Edited CFL to use /pnfs/minos uniformly

File counts :

fardet_data    43681
neardet_data   14988
near_dcs_data    774
far_dcs_data    1449
reco_far      209197
reco_near      49360
mcin_data       3431
mcout_data     32649
beam_data        769
fardet_logs      398
neardet_logs      77
log_data      125400

alignment       1398
caldet_rawdata  1308
caldet_logs       60
caldet_reco    13476
tmp                0 
copy2           9212
hose_data         41
mapper          4180
reco_data      69551
reco_data.11-13  536  R0.8.0
reco_log.11-13   218  R0.8.0
sim_adamo         82
uberDST          223
caldet_data    15854
dcs_data         317
sim_log           84 
sim_reco          63
sim_root       16314

###################
# log_data totals #
###################

Looking into log_data, with a view to concatenation

MIN > for DIR in R1.0  R1.0.0  R1.0.0a  R1.11  R1.12  R1.14  R1.16  R1.16.0  R1.16a  R1.7  R1_17  R1_17a.0  R1_18 ; do printf "${DIR} " ; grep /pnfs/minos/log_data/${DIR} CFL | wc -l ; done
MIN > grep /pnfs/minos/log_data/20 CFL | wc -l

DIR       FILES   MB
20*        4727   15
R1.0       2299   10
R1.0.0      940    4
R1.0.0a   16737  114
R1.11      4276   16
R1.12      8628   34
R1.14     23335  550 
R1.16     12462  345
R1.16.0       1    0
R1.16a      950    3
R1.7       1414    3
R1_17a.0   2881    3
R1_18     31944  181

Oops, duplicates due to lack of trialing REL/,
repeated with /
corrects listings are above
wrong ones are below

R1.0      19976   10 ( duplicates )
R1.0.0    17677    4 ( duplicates )
R1.16     13413  345
R1_17      2881   35

And these were once split, are now renamed under R1_17a.0
R1_17      2013   35
R1_17a.0    868    3


MINOS06 > du -sm R1.0
9       R1.0

MINOS06 > du -sm R1_18
214     R1_18


=============================================================================

 2005 10 27

###########
# enstore #
###########


On Thu, 27 Oct 2005, David Berg wrote:
...
> Since Tuesday morning, tape writes in the file families you specified
> should have been flagged as higher than normal priority. I keep looking
> for one in the library queue, to verify that it is working, but I haven't
> caught one yet. Maybe you've seen an improvement?
I have quickly skimmed the logs,
We have one instance of a raw data files took nearly 3 hours to be written.
This is the worst case.
    /pnfs/minos/fardet_data/2005-10/F00033020_0010.mdaq.root
was in DCache Wed Oct 26 21:16
and on tape       Oct 26 23:40              ( according to 'ls -l' )


I do not know when the files were ftp'd into DCache, 
    so do not know the actual latencies.
A lower limit is the time from my polling till the PNFS 'ls' time
The latencies, with one exception, look good both before and after
the priority change.

ND File    Polled    PNFS Delay(min)

Oct 24
33006_0009 04:08:39 04:10   2
33006_0012 07:09:06 07:11   2
33006_0014 09:08:40 09:08   0
33006_0015 10:09:09 10:09   0
33006_0018 13:09:10 13:19  10
33006_0019 19:14:35 19:15   1


Oct 25
33014_0005 01:09:33 01:13   4
33014_0006 02:09:59 02:16   6
33014_0008 04:09:39 04:11   2
33014_0010 06:09:37 06:14   5
33014_0012 08:08:22 08:15   7
33014_0013 09:10:27 09:12   2
33014_0015 11:10:19 11:20   8

Oct 26
33017_0022 10:08:26 10:08   0
33020_0002 13:12:22 13:17   5
33020_0004 15:13:17 15:13   0
33020_0008 19:17:41 19:19   2
33020_0009 20:13:32 20:14   1
33020_0010 21:16:41 23:40 144
33020_0011 23:07:46 23:53  46

Oct 27
33020_0013 00:16:00 00:21   5
33020_0017 04:09:35 04:14   5
33020_0020 07:09:53 07:14  21

It would be interesting to see the activity logs from 26 Oct from
21:15 through 23:45.

---------- above was emailed to enstore-admin, minos_sam_admin ------------

Correction, there were FD files, not ND.

Note Alarm from enstore around 22:24

pg.InternalError could not connect to server: Connection refused 
Is the server running on host "stkensrv0.fnal.gov" and accepting 
TCP/IP connections on port 8888? IS POSTMASTER RUNNING?

Likewise arount 07:06, the 22 minute delay,
there were enstore alarms like
    Net thread is running in the state HAVE_BOUND. Will restart the mover
    
###########
# recycle #
###########

scripts/volumes - reports volumes for a given file family

Question - what file families are assigned to the reco passes to be purged ?

Grepping COMPLETE_FILE_LISTING_minos,

MIN > grep "R0\.8\.0" CFL | wc -l
MIN > for REL in R1.0 R1.0.0 R1.0.0a R1.7 ; do printf "${REL} " ; grep "/${REL}/" CFL | wc -l ; done

Release  Date Modified  Files
R0.8.0   Apr 2003        9639
R1.0     Aug 2003        8144 
R1.0.0   Aug 2003        3520 
R1.0.0a  Sep 2003       59182 
R1.7     May 2004        5596 
                        86081 including log files

Directories are under /pnfs/minos/reco_far

R1.0
R1.0.0
R1.0.0a
R1.7
cand_data  
dmux_data 
ntps_data 
ntup_data

DIRS="R1.0 R1.0.0 R1.0.0a R1.7"
for DIR in ${DIRS} ; do
SUBS=`find ${DIR} -maxdepth 2 -type d`
for SUB in ${SUBS} ; do
( cd ${SUB}
  FAM=`enstore pnfs --tags | grep 'family) ' | cut -f 2 -d '=' | tr -d '[:blank:]'`
  printf " ${FAM} ${SUB} \n" )
done ; done

DIRS="cand_data dmux_data  ntps_data  ntup_data"
for DIR in ${DIRS} ; do
SUBS=`find ${DIR} -maxdepth 1 -type d`
for SUB in ${SUBS} ; do
( cd ${SUB}
  FAM=`enstore pnfs --tags | grep 'family) ' | cut -f 2 -d '=' | tr -d '[:blank:]'`
  printf " ${FAM} ${SUB} \n" )
done ; done

FAMILIES

 snts_data_R1_0 R1.0/snts_data 
 cand_data_R1_0 R1.0/cand_data 
 sntp_data_R1_0 R1.0/sntp_data 
 ntps_data_R1_0 R1.0/ntps_data 
 ntup_data_R1_0 R1.0/ntup_data 

 snts_data_R1_0_0 R1.0.0/snts_data 
 cand_data_R1_0_0 R1.0.0/cand_data 
 sntp_data_R1_0_0 R1.0.0/sntp_data
 ntps_data_R1_0_0 R1.0.0/ntps_data 
 ntup_data_R1_0_0 R1.0.0/ntup_data 

 snts_data_R1_0_0a R1.0.0a/snts_data 
 cand_data_R1_0_0a R1.0.0a/cand_data 
 sntp_data_R1_0_0a R1.0.0a/sntp_data 
 ntps_data_R1_0_0a R1.0.0a/ntps_data 
 ntup_data_R1_0_0a R1.0.0a/ntup_data 

 snts_data_R1_7 R1.7/snts_data 
 cand_data_R1_7 R1.7/cand_data 
 sntp_data_R1_7 R1.7/sntp_data 

 minos cand_data 
 minos dmux_data 
 minos ntps_data 
 minos ntup_data 


for FAM in  snts_data_R1_0 cand_data_R1_0 sntp_data_R1_0 ntps_data_R1_0 ntup_data_R1_0 \
            snts_data_R1_0_0 cand_data_R1_0_0 sntp_data_R1_0_0 ntps_data_R1_0_0 ntup_data_R1_0_0 \
            snts_data_R1_0_0a cand_data_R1_0_0a sntp_data_R1_0_0a ntps_data_R1_0_0a ntup_data_R1_0_0a \
            snts_data_R1_7 cand_data_R1_7 sntp_data_R1_7            \
            minos \
             
do  printf "${FAM} \n" ; ./volumes ${FAM} ; done

snts_data_R1_0 
VO4315
VO4317
cand_data_R1_0 
VO3648
VO3784
VO4311
VO4318
sntp_data_R1_0 
VO4314
VO4320
ntps_data_R1_0 
VO3785
VO4313
ntup_data_R1_0 
VO4312
snts_data_R1_0_0 
VO4707
cand_data_R1_0_0 
VO2059
VO4580
sntp_data_R1_0_0 
VO4706
VO4709
ntps_data_R1_0_0 
VO4705
ntup_data_R1_0_0 
VO4581
snts_data_R1_0_0a 
VO2007
VO4954
cand_data_R1_0_0a 
VO2062
VO2065
VO4945
VO4949
VO4950
VO4952
VO4955
VO4964
VO5037
VO5038
VO5043
VO5047
VO5048
VO5049
VO5050
VO5052
VO5053
VO5171
VO5183
VO5668
VO5676
VO5867
VO5870
VO5878
VO5879
sntp_data_R1_0_0a 
VO2006
VO4641
VO4957
VO4963
ntps_data_R1_0_0a 
VO2005
VO4946
VO4960
ntup_data_R1_0_0a 
VO2063
VO2066
VO4951
VO4958
VO4961
VO5051
VO5184
snts_data_R1_7 
VO5874
cand_data_R1_7 
VO5703
VO5876
sntp_data_R1_7 
VO5705
minos 
VO0157.deleted
VO0162.deleted
VO0210.deleted
VO0211.deleted
VO0212.deleted
VO0242.deleted
VO0243.deleted
VO1849.deleted
VO1850.deleted
VO1851.deleted
VO1852.deleted
VO2214
VO2215
VO2216
VO2218
VO2219
VO2221
VO2222
VO2223
VO2224
VO2226
VO2659
VO3906
VO3907
VO4138
VO4140
VO4306
VO4708
VO5862
VO5892
VO6880
VO7809
VO8006


Reviewing the minos.minos.cpio_odc tapes,
fine mostly 
 R0.8.0
 R0.8.1
 log_data
 UberDST
 mapper
 R1.14
 copy2
 para
 R1.12
 mcin_data
 asousa

for VOL in VO4315 VO4317 VO3648 VO3784 VO4311 VO4318 VO4314 VO4320 VO3785 VO4313 VO4312 VO4707 \
           VO2059 VO4580 VO4706 VO4709 VO4705 VO4581 VO2007 VO4954 VO2062 VO2065 VO4945 VO4949 \
           VO4950 VO4952 VO4955 VO4964 VO5037 VO5038 VO5043 VO5047 VO5048 VO5049 VO5050 VO5052 \
           VO5053 VO5171 VO5183 VO5668 VO5676 VO5867 VO5870 VO5878 VO5879 VO2006 VO4641 VO4957 \
           VO4963 VO2005 VO4946 VO4960 VO2063 VO2066 VO4951 VO4958 VO4961 VO5051 VO5184 VO5874 \
           VO5703 VO5876
do printf "${VOL}" ; enstore volume --gvol ${VOL} | grep last_access ; done

VO4315 'last_access': 'Wed Aug 20 04:04:57 2003',
VO4317 'last_access': 'Fri Aug  8 12:25:27 2003',
VO3648 'last_access': 'Sun Sep 21 01:23:45 2003',
VO3784 'last_access': 'Sun Sep 21 01:31:27 2003',
VO4311 'last_access': 'Sun Dec  5 02:29:35 2004',
VO4318 'last_access': 'Wed Aug 18 19:33:43 2004',
VO4314 'last_access': 'Tue Oct 21 12:47:43 2003',
VO4320 'last_access': 'Mon Aug 11 23:39:44 2003',
VO3785 'last_access': 'Mon Aug 18 15:50:02 2003',
VO4313 'last_access': 'Fri Sep  5 13:04:39 2003',
VO4312 'last_access': 'Tue Oct 21 12:42:36 2003',
VO4707 'last_access': 'Fri Sep 12 15:43:43 2003',
VO2059 'last_access': 'Mon May  3 15:08:09 2004',
VO4580 'last_access': 'Tue Aug  3 03:31:13 2004',
VO4706 'last_access': 'Sun Oct 26 18:14:48 2003',
VO4709 'last_access': 'Wed Dec 31 17:59:59 1969',
VO4705 'last_access': 'Fri Sep 12 15:38:43 2003',
VO4581 'last_access': 'Fri Oct 10 16:14:06 2003',
VO2007 'last_access': 'Mon Mar 21 16:01:07 2005', ***  R1.0.0a/snts_data
VO4954 'last_access': 'Tue Aug 24 17:10:56 2004',
VO2062 'last_access': 'Mon Aug  2 16:30:34 2004',
VO2065 'last_access': 'Mon Oct 25 13:29:40 2004',
VO4945 'last_access': 'Fri Nov 26 23:31:00 2004',
VO4949 'last_access': 'Wed Nov  3 19:31:20 2004',
VO4950 'last_access': 'Mon Nov 29 15:30:03 2004',
VO4952 'last_access': 'Wed Oct 27 13:31:47 2004',
VO4955 'last_access': 'Mon Oct 18 20:31:00 2004',
VO4964 'last_access': 'Tue Nov 16 17:29:46 2004',
VO5037 'last_access': 'Fri Jul  9 00:14:27 2004',
VO5038 'last_access': 'Sat Dec  4 18:34:01 2004',
VO5043 'last_access': 'Mon Sep  6 06:33:54 2004',
VO5047 'last_access': 'Fri Oct  8 11:32:35 2004',
VO5048 'last_access': 'Sun Oct 24 02:39:24 2004',
VO5049 'last_access': 'Sat Nov 13 09:36:36 2004',
VO5050 'last_access': 'Sun Jul 18 08:34:08 2004',
VO5052 'last_access': 'Sun Dec  5 03:34:46 2004',
VO5053 'last_access': 'Wed Oct 27 07:35:53 2004',
VO5171 'last_access': 'Fri Jan 21 12:55:53 2005',
VO5183 'last_access': 'Sun Dec  5 19:34:15 2004',
VO5668 'last_access': 'Fri Sep  3 03:32:45 2004',
VO5676 'last_access': 'Thu Dec  2 03:35:25 2004',
VO5867 'last_access': 'Fri Nov 19 08:30:38 2004',
VO5870 'last_access': 'Thu Oct 28 21:57:54 2004',
VO5878 'last_access': 'Wed Nov 10 20:31:21 2004',
VO5879 'last_access': 'Sun Nov 21 07:31:44 2004',
VO2006 'last_access': 'Wed Jan  5 19:55:06 2005',
VO4641 'last_access': 'Mon Nov  1 15:06:57 2004',
VO4957 'last_access': 'Tue Jul 19 15:14:20 2005',  *** R1.0.0a/sntp_data
VO4963 'last_access': 'Tue Nov 23 17:28:54 2004',
VO2005 'last_access': 'Thu Sep 30 14:29:36 2004',
VO4946 'last_access': 'Sat Sep 11 08:13:35 2004',
VO4960 'last_access': 'Wed Dec 31 17:59:59 1969',
VO2063 'last_access': 'Wed Nov  3 10:29:30 2004',
VO2066 'last_access': 'Sat Oct 30 21:28:53 2004',
VO4951 'last_access': 'Mon Nov 29 16:30:12 2004',
VO4958 'last_access': 'Tue Sep 28 17:29:18 2004',
VO4961 'last_access': 'Fri Apr 16 08:34:23 2004',
VO5051 'last_access': 'Mon Dec  6 02:28:46 2004',
VO5184 'last_access': 'Tue Nov 16 23:42:21 2004',
VO5874 'last_access': 'Sun Jan  9 22:13:29 2005',
VO5703 'last_access': 'Mon Nov 15 02:31:00 2004',
VO5876 'last_access': 'Tue Aug  3 16:11:29 2004',

Scanning full file listings for any of these in read dcache

for REL in R1_0 R1_0_0 R1_0_0a R1_7 ; do
cat Desktop/*A | grep "minos.*_data.R1.*}" ; done

no output

for REL in R1_11 R1_12 R1_14 R1_16 R1_17 R1_18; do
printf "${REL} " ; cat Desktop/*A | grep "minos.*_data.*${REL}.*}" | wc -l
done

R1_14 = 10561

Restarted 11-01
Oops, this is just log_data, odd.

R1_11       0
R1_12       0
R1_14   10561
R1_16       0
R1_17       0
R1_18       2

Try again for reco files

for REL in R1_11 R1_12 R1_14 R1_16 R1_17 R1_18; do
printf "${REL} " ; cat Desktop/*A | grep "minos\.reco.*${REL}.*}" | wc -l
done

R1_11       0
R1_12       0
R1_14       0
R1_16    2932
R1_17       0
R1_18   32519

Confusing, the file families seem to have shifted with time

( cd /pnfs/minos/reco_far/R1.16a/cand_data ; enstore pnfs --tags | grep family )


    FAR
R1.11      cand_data_R1_11
R1.12      cand_data_R1_12
R1.14      cand_data_R1_14
R1.14_201  cand_data_R1_14
R1.16      reco_far_R1_16
R1.16a     cand_near_R1_16a
R1_17a.0   reco_far_R1_17
R1_18      reco_far_R1_18

    NEAR
R1         cand_near_R1
R1.11      cand_near_R1_11
R1.12      minos
R1.14      cand_near_R1_14
R1.16      reco_near_R1_16
R1_17      reco_near_R1_17
R1_18      reco_near_R1_18


=============================================================================

 2005 10 26
 
############
# predator #
############

HOWTO.predator - added samadd/beam_data

=============================================================================

 2005 10 20

############
# init_sam #
############

CVSROOT=cvsuser@cdcvs.fnal.gov:/cvs/cd   # set the repository
unset CVS_RSH                            # remove ssh used by Minos

cd
cvs co init_sam
cd     init_sam

Good, have an empty directory with proper CVS support directory

cp -a  README ~/init_sam/
cp -ar ups    ~/init_sam/ups

##########
# rlwrap #
##########

copied rlwrap-0.18-1 to minos/bin/rlwrap, it works.

=============================================================================

 2005 10 19

########
# reco #
########

Catchup reco processing seems done, as of yesterday around 01:00

MINOS06 > ls -ltr /pnfs/minos/reco_far/R1_18/cand_data/
drwxrwxr-x    1 rubin    e875          512 Oct 15 19:51 2003-10
drwxrwxr-x    1 rubin    e875          512 Oct 17 02:53 2003-09
drwxrwxr-x    1 rubin    e875          512 Oct 17 19:02 2003-08
drwxrwxr-x    1 rubin    e875          512 Oct 18 00:50 2003-07
drwxrwxr-x    1 rubin    e875          512 Oct 18 02:51 2005-10

So why is the mysql database so busy ( nearly saturated )
with queries like

    select max(TIMESTART) from PULSERDRIFTVLD where TIMESTART < '2005-09-28 07:57:42' ...

On minos-sam01

    Check file locations, they seem to already be there.

for MON in 2003-07 2003-08 2003-09 2003-10 2003-11 2003-12 ; do ./reloc -s prd -x far -m ${MON} R1_18 ; done

RELEASE=R1_18
DET=far
setup sam -q prd

for MONTH in 2003-07 2003-08 2003-09 2003-10 2003-11 2003-12 ; do
printf "\n/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done  >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1 &

grep -v declared  ../log/saddreco/declare_${DET}_${RELEASE}.log |less

Looks good !
This completes the declareation of R1_18 reprocessing,
which starts with 2003-07.

#########
# reloc #
#########

Created new reloc allowing selection of
  -x detector ( near far )
  -m month    ( 2003-07 )

Usage :  ./reloc -s dev -x near -m 2003-07 R1_18_2

##########
# dcache #
##########

Delayed restart of crontab, due to 08:00 - 09:30 dache outage.

    atrm 7

Will restart manually when dcache is up


=============================================================================

 2005 10 18

###########
# network #
###########

For network maintenance 20 Aug 06:00 to 06:30

  MINOS06 > echo "crontab -r"                                | at 05:30 Oct 20
  MINOS06 > echo "crontab ${HOME}/minos/scripts/crontab.dat" | at 07:30 Oct 20

############
# init_sam #
############

Creating an init_sam product in Fermilab kits,
   to replace the CDF Distribution/SAM subpackage.

Move all the required files into the ups directory,
so that they will be available from
    ftp://ftp.fnal.gov/products/init_sam/v1_0/NULL/init_sam_v1_0_NULL/ups

Put a good README at the top, this is also available via direct ftp.


mkdir -p /afs/fnal.gov/files/code/e875/general/ups/prd/init_sam/v1_0/NULL
cd /afs/fnal.gov/files/code/e875/general/ups/prd/init_sam/v1_0/linux
scp -r cdfcode:/home/cdfsoft/dist/releases/development/Distribution/SAM SAM

mkdir ups
scp  cdfcode:/home/cdfsoft/dist/releases/development/Distribution/ups/Distribution.table ups/init_sam.table
    this was a generic portable table file
cd ups
ups declare init_sam v1 -f NULL -r init_sam/v1/NULL -m init_sam.table

moved all SAM files into ups
Dropped
   dbserver/testmem ( maintenance script )
   init_jim_client.2005050813

Moved
   bootups/* to top level
       bootups
       dbconfig.sam
       updconfig.sam

Changed authorized_keys and .k5login fetches to cdfkits.fnal.gov
    and require station name to contain cdf

=============================================================================

 2005 10 17

#######
# sar #
#######

Added SAR (rpm -i systat) on all MINOS systems ( minos0*, minos-sam* )
   Requested 2005 10 17 - email to minos-admin
   Done      2005 10 17

##########
# updadd #
##########

( products environment )

MINOS06 > upd install -j ups v4_7_2
** SETUPS_DIR defined as /afs/fnal.gov/files/code/e875/general/ups/etc in
** /afs/.fnal.gov//files/code/e875/general/ups/db/.upsfiles/dbconfig
**** /afs/.fnal.gov//files/code/e875/general/ups/etc/upsdb_list does not exist
**** Ups cannot be configured.
-------------------------------
Please copy /afs/fnal.gov/files/code/e875/general/ups/prd/ups/v4_7_2/Linux-2/ups/upsdb_list.template
to /afs/.fnal.gov//files/code/e875/general/ups/etc/upsdb_list
and edit it as appropriate before reconfiguring ups.
informational: installed ups v4_7_2.
upd install succeeded.

MINOS06 > upd install -j upd v4_7_2
informational: installed upd v4_7_2.
upd install succeeded.

mkdir /afs/fnal.gov/files/code/e875/general/ups/etc
cp /local/ups/etc/upsdb_list /afs/fnal.gov/files/code/e875/general/ups/etc/
nedit /afs/fnal.gov/files/code/e875/general/ups/etc/upsdb_list 

MINOS06 > ups declare -c ups v4_7_2 -f Linux+2
MINOS06 > ups declare -c upd v4_7_2 -f NULL


MINOS06 > ./updadd Linux sam_dbs_products v7_2_1
 
 OK - adding  sam_dbs_products v7_2_1 Linux -q  
 OK - reporting space used 
      7 /afs/fnal.gov/files/code/e875/general/ups/prd/sam_dbs_products/v7_2_1/Linux
 OK - testing file permissions 
 OK - no file permission problems 
 OK - tar command is gtar
      4 /var/tmp/sam_dbs_products.tar.gz
 OK - upd addproduct 
Error output of move_archive_file: Authenticated kreymer@FNAL.GOV
        Account updadmin: authorization for kreymer@FNAL.GOV for execution of
                       /usr/krb5/k5arc/scripts/upd successful
        Changing uid to updadmin (100)
error output of move_ups_dir: Authenticated kreymer@FNAL.GOV
        Account updadmin: authorization for kreymer@FNAL.GOV for execution of
                       /usr/krb5/k5arc/scripts/upd successful
        Changing uid to updadmin (100)
upd addproduct succeeded.

Looks successful !

MINOS06 > upd list -l sam_dbs_products v7_2_1
...

Also added 

    sam_dbs_products v7_3_0

=============================================================================

 2005 10 13

########
# TODO #
########

Moved most items to separate TODO file

##############
# reco R1_18 #
##############

Processing seems to be back to 2003-11 today 13 Oct.
Play it safe, declare 2004-01 thru 2004-08
Last declarations were Oct 03, 2004-03 forward

MINOS06 > ls -ltd /pnfs/minos/reco_far/R1_18/cand_data/2004-*
drwxrwxr-x 1 rubin    e875  512 Oct 10 14:50 /R1_18/cand_data/2004-01
drwxrwxr-x 1 rubin    e875  512 Oct  7 07:50 /R1_18/cand_data/2004-02
drwxrwxr-x 1 rubin    e875  512 Oct  4 15:57 /R1_18/cand_data/2004-03
drwxrwxr-x 1 rubin    e875  512 Oct  2 09:07 /R1_18/cand_data/2004-04
...

RELEASE=R1_18
DET=far
setup sam -q prd

for MONTH in 2004-01 2004-02 2004-03 ; do
printf "\n/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done  >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1 &

grep -v declared  ../log/saddreco/declare_${DET}_${RELEASE}.log |less
Looks good, no errors.


#########
# files #
#########
In COMPLETE_FILE_LISTING_minos, have mix of two prefixes
    /pnfs/fs/usr/
    /pnfs/minos/

for DIR in $DIRS  ; do NUM=`grep "/minos/${DIR}/" COMPLETE_FILE_LISTING_minos | wc -l` ; printf "%7d %s\n" ${NUM} ${DIR} ; done

   1398 alignment
     10 asousa
    781 beam_data
  15854 caldet_data
     60 caldet_logs
   1308 caldet_rawdata
  13476 caldet_reco
   9212 copy2
    317 dcs_data
   1461 far_dcs_data
  43884 fardet_data
    404 fardet_logs
     41 hose_data
    800 hpss
      0 hzheng
 127288 log_data
      0 log_data.11-13
   4180 mapper
   3592 mcin_data
   7724 mclog_data
  33123 mcout_data
      5 messier
      6 moibenko
    790 near_dcs_data
  15270 neardet_data
     86 neardet_logs
      0 outbox
    399 para
      0 psymes
      0 reco_data.11-13
 219769 reco_far
  49987 reco_near
      0 shanahan
     82 sim_adamo
     84 sim_log
     63 sim_reco
  16314 sim_root
      0 tmp
    223 uberDST
     7 unel

cat | sort -n 
...
   9212 copy2
  13476 caldet_reco
  15270 neardet_data
  15854 caldet_data
  16314 sim_root
  33123 mcout_data
  43884 fardet_data
  49987 reco_near
 127288 log_data
 219769 reco_far

wc -l
637561

reco_
349396

-v reco_
288165

reco_far
123426

reco_data
69551

########
# info #
########

Requested dropping of "Info -new" from login v1_6
   as this is obsolete, produces delays at login

=============================================================================

 2005 10 12

###############
# DCS catchup #
###############

updated dcs.txt bad file listing
These are the original unrepaired files previously listed:

far_dcs_data
    2004-03 
        F040326_122044.mdcs.root  dbu timed out using 0 sec CPU
    2004-08
        F040820_220334.mdcs.root  0 records
    2005-01
        F050128_045729.mdcs.root  dbu timed out using 0 sec CPU

near_dcs_data
    2005-01
        N050127_183128.mdcs.root  0 records

##############
# RECO purge #
##############

Suggested following text.
Counted files in COMPLETE_FILE_LISTING.minos
   But had to adjust path to /pnfs/minos/reco_data/...
       the original file path.

Can find files by path, or by filename.
Path may shift, so this is tricky.

MIN > grep /pnfs/minos/reco_data/R1.0.0/ COMPLETE_FILE_LISTING_minos | wc -l
   2580
   
MIN > grep '.R0.8.0.' COMPLETE_FILE_LISTING_minos | wc -l
   9639
MIN > grep '.R0.8.0.root' COMPLETE_FILE_LISTING_minos | wc -l
   8739


Release  Date Modified   Data  Total Size
R0.8.0   Apr 2003        8739   9639
R1.0     Aug 2003        5845   8144
R1.0.0   Aug 2003        2580   3520
R1.0.0a  Sep 2003       42445  59182
R1.7     May 2004        2121   5596

R1.11    Dec 2004        6432  10708


##########
# dcache #
##########

Sent email to dcache-admin with thanks

> The file is 
>     /pnfs/minos/neardet_data/2005-10/N00008843_0003.mdaq.root

    Thanks for fixing the problem !

At 12:06 the file was still unavailable.
At 13:06 today, 2005 Oct 12, this file was successfully read.

I do see one pool restarted during that time,

    r-stkendca8a-1  10/12 12:17:44

P2P server and client counts are now both 0 on the Pool Request Queue page,
perhaps showing the release of the copy that has been stuck since Saturday.


=============================================================================

 2005 10 11

###############
# DCS catchup #
###############

Catching up on older near and far dcs files repaired by Brian and Liz
per http://www-numi.fnal.gov/minwork/computing/dh/badfiles/dcs.txt

TIER=near_dcs_data
for MONTH 2005-01 2005-02 2005-03 2005-04 2005-05

Test with
    ./genpy -d -l " "  -w   ${TIER}/${MONTH}
Run with
    ./genpy -l " "  -w   ${TIER}/${MONTH}
    less /local/scratch06/kreymer/log/genpy/${TIER}/${MONTH}.log    

for MONTH in 2005-01 2005-02 2005-03 2005-04 2005-05 ; do 
./stage -p 1 -w ${TIER}/${MONTH} ; done

TIER=far_dcs_data
for MONTH in 2004-03 2004-07 2004-08 2004-10 2004-11 2005-01 2005-02 ; do 
./stage -p 1 -w ${TIER}/${MONTH} ; done

2004-03 - F040326_122044.mdcs.root  dbu timed out 0 sec.
2004-07 - OK
2004-08 - F040820_220334.mdcs.root 0 records
2004-10 - OK
2004-11 - OK
2005-01 - F050128_045729.mdcs.root timeout 0 sec.
2005-02 - OK

Let's go ahead with the metadata in hand :

TIER=near_dcs_data
for MONTH in 2005-01 2005-02 2005-03 2005-04 2005-05 ; do 
./sadd ${TIER}/${MONTH} declare >> /local/scratch06/kreymer/log/samadd/${TIER}/${MONTH}.log 2>&1
done

TIER=far_dcs_data
for MONTH in 2004-07 2004-10 2004-11 2005-01 2005-02 ; do 
./sadd ${TIER}/${MONTH} declare >> /local/scratch06/kreymer/log/samadd/${TIER}/${MONTH}.log 2>&1
done


=============================================================================

 2005 10 10

##########
# dcache #
##########

    Starting with
N00008843_0003.mdaq.root Sat Oct  8 05:11:06 CDT 2005

Unable to run dbu ( stuck at 2 seconds ) on
 /pnfs/minos/neardet_data/2005-10/N00008843_0003.mdaq.root
 
MINOS06 > IFILE=N00008843_0003.mdaq.root
MINOS06 > ( cd /pnfs/minos/neardet_data/2005-10 ; cat ".(use)(2)(${IFILE})" )
 2,0,0,0.0,0.0
:l=66208093;
w-stkendca7a-1

    Odd, why is this not yet in a read pool ?

MINOS06 > ( cd /pnfs/minos/neardet_data/2005-10 ; cat ".(use)(4)(${IFILE})" )
VO8791
0000_000000000_0000269
66208093
neardet_data
/pnfs/fs/usr/minos/neardet_data/2005-10/N00008843_0003.mdaq.root

000F00000000000002143710

CDMS112876492100000
stkenmvr5a:/dev/rmt/tps0d1n:456001002497
4042911819

    So it is on tape.

Reported as helpdesk ticket 67295

############
# predator #
############

HOWTO.predator -

    added monitoring of beam, dcs, reco R1_18 files

=============================================================================

 2005 10 07

##########
# dcache #
##########

Continued problem running dbu on 

N00008791_0000.mdaq.root Thu Oct  6 23:06:19 CDT 2005 ...
N00008791_0000.mdaq.root Fri Oct  7 07:06:23 CDT 2005 success

N00008793_0002.mdaq.root Fri Oct  7 02:29:55 CDT 2005 ...
N00008793_0002.mdaq.root Fri Oct  7 07:18:12 CDT 2005 TIMED with .py 00:00:14 dbu

N00008810_0000.mdaq.root Fri Oct  7 13:06:24 CDT 2005 TIMED with .py 00:00:06 dbu

F00032862_0014.mdaq.root Thu Oct  6 22:06:32 CDT 2005 ...
F00032862_0014.mdaq.root Fri Oct  7 00:18:55 CDT 2005 success

F00032862_0017.mdaq.root Fri Oct  7 01:27:50 CDT 2005 ...
F00032862_0017.mdaq.root Fri Oct  7 02:51:51 CDT 2005 success

F00032862_0018.mdaq.root Fri Oct  7 02:55:37 CDT 2005 ...
F00032862_0018.mdaq.root Fri Oct  7 06:50:35 CDT 2005 success

F00032862_0019.mdaq.root Fri Oct  7 03:06:16 CDT 2005 ...
F00032862_0019.mdaq.root Fri Oct  7 06:51:45 CDT 2005 success

F00032862_0020.mdaq.root Fri Oct  7 05:21:36 CDT 2005
F00032862_0020.mdaq.root Fri Oct  7 06:52:40 CDT 2005 success

F00032863_0000.mdaq.root Fri Oct  7 07:30:21 CDT 2005 ...
F00032863_0000.mdaq.root Fri Oct  7 08:07:28 CDT 2005 success

Seems to have cleared up, no timeouts after 14:06:06
Odd that it only took 6 to 14 seconds to produce the valid dbu files.
Room for improvement in normal operation ?

Perhaps I should also switch to to door 1 (24136), less used.

=============================================================================

 2005 10 06

Data logging to tape was backlogged about 2 hours, at 10:00,
due to various Enstore activity.

As of 14:30, only 7 neardet_data files await tape.


Requested high priority for Minos writes, 
via helpdesk and dcache/enstore-admin

Ticket #: 67131

Short Description: Priority for Minos raw data logging

Problem Description: I would like to request higher priority in Enstore/Dcache
for writing the Minos raw data to tape.
    
We need priority because there have been occasions when Minos data writing
to tape has been delayed anywhere from a few hours ( like today ) 
to over a day ( a couple of weeks ago )
due to other lower priority loads on the system ( primarily reads ).

Once raw data writing priority is demonstrated to be working,
I think we can also move more quickly with the 9940A/B migration.

    Details :

Every day we write under 100 files, about 3 GBytes.
( Yes, that's GBytes, not TBytes )

Files are written via ftp to the DCache write pools, once per hour.
The normal pattern is two files per hour,
with an occasional burst of a dozen or so small files from special runs.

Data is written to the following file families :
      fardet_data
     neardet_data
        beam_data
     far_dcs_data
    near_dcs_data


=============================================================================

 2005 10 05
 
############
# predator #
############

Access to the predator logs required running predator under kcron,
Updated to crontab.dat.20051005 which contains this support.

Revised timing of saddreco to 23:00 daily.

#######
# LOG #
#######

Shifted native copy of this LOG file back to ~kreymer/minos/LOG
Made symlink to dh/samlog.txt

#########
# stage #
#########

Per urheim request, staged

survey with
for MON in 2005-03 2005-05 2005-06 2005-07 ; do
./stage -d -p 0 -w reco_far/R1_18/.bntp_data/${MON} ( none needed )
done

2005-03 Needed   0/    678
2005-05 Needed 215/    741
2005-06 Needed   0/    714
2005-07 Needed 201/    734
2005-08 Needed   0/    741
2005-09 Needed   0/    719

mkdir -p /local/scratch06/kreymer/log/stage/reco_far/R1_18/.bntp_data

for MON in 2005-05 2005-07 ; do
./stage -w reco_far/R1_18/.bntp_data/${MON} ; done

=============================================================================

 2005 10 04

############
# reco far #
############

R1_18 seems to be up to date 
grep -v declared  ../log/saddreco/declare_far_R1_18.log | less

   2004-03 - lots of files not yet on tape
   2005-10 - not there yesterday morning

MINOS06 > ls -l /pnfs/minos/reco_${DET}/${RELEASE}/cand_data/2005-10
total 5592928
-rw-r--r--    1 rubin    e875     70426119 Oct  3 23:01 F00032814_0003.all.cand.R1_18.0.root
...
-rw-r--r--    1 rubin    e875      5714345 Oct  4 04:57 F00032823_0002.spill.cand.R1_18.0.root


RELEASE=R1_18
DET=far

for MONTH in 2005-10 ; do
printf "\n/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done  >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1 &

for MONTH in 2005-09 ; do
printf "\n/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done  >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1 &

DET=near


for MONTH in 2005-09 2005-10 ; do
printf "\n/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done  >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1 &

Done, many files added in 2005-09, a few in 2005-10

############
# predator #
############

predator.1004 - added saddreco at 06:00 each day

predator -> predator.0809
ln -sf predator.1004 predator
predator -> predator.1004

############
# datasets #
############

Executed the dataset definitions for R1_18 per HOWTO.datasets,
and as described 2005 09 30

    See log/datasets/${DET}.R1_18.log


=============================================================================

 2005 10 03

############
# reco far #
############

Need to recover various problems

MINOS06 > grep -v declared ../log/saddreco/declare_${DET}_${RELEASE}.log | less

2004-04
 OOPS , addLocation error in  F00024101_0007.all.snts.R1_18.0.root
  INSTANCE  Location with name '/pnfs/minos/reco_far/R1_18/snts_data/2004-04' not found.

2004-05
 OOPS , addLocation error in  F00025200_0002.all.snts.R1_18.0.root
  INSTANCE  Location with name '/pnfs/minos/reco_far/R1_18/snts_data/2004-05' not found.

2005-01
 OOPS , addLocation error in  F00028930_0000.all.snts.R1_18.0.root
  INSTANCE  Location with name '/pnfs/minos/reco_far/R1_18/snts_data/2005-01' not found.

2005-02
 OOPS , addLocation error in  F00029236_0000.all.snts.R1_18.0.root
  INSTANCE  Location with name '/pnfs/minos/reco_far/R1_18/snts_data/2005-02' not found.

2005-03 snts cand, cdbl
 OOPS , addLocation error in  F00029327_0001.all.cnts.R1_18.0.root
  INSTANCE  Location with name '/pnfs/minos/reco_far/R1_18/cnts_data/2005-03' not found.

2005-04
 OOPS , addLocation error in  F00030622_0003.spill.snts.R1_18.0.root
  INSTANCE  Location with name '/pnfs/minos/reco_far/R1_18/snts_data/2005-04' not found.

2005-05 snts cand, cdbl
 OOPS , addLocation error in  F00031356_0007.all.cnts.R1_18.0.root
  INSTANCE  Location with name '/pnfs/minos/reco_far/R1_18/cnts_data/2005-05' not found.

2005-06 snts cand, cdbl
 OOPS , addLocation error in  F00031823_0002.all.cnts.R1_18.0.root
  INSTANCE  Location with name '/pnfs/minos/reco_far/R1_18/cnts_data/2005-06' not found.

2005-07 EH ??? duplicate ?
 OOPS , declare error in  F00032047_0004.spill.cand.R1_18.0.root
  INSTANCE  File with name 'F00032047_0004.spill.cand.R1_18.0.root' already exists in the database.

2005-08 snts cand, cdbl OK
 OOPS , addLocation error in  F00032481_0004.all.cnts.R1_18.0.root
  INSTANCE  Location with name '/pnfs/minos/reco_far/R1_18/cnts_data/2005-08' not found.

2005-09 OK

#########
# reloc #
#########

Created reloc script for running on minos-sam01,
for adding tape locations for everything under a reco tree.

    Degugged

MINOS-SAM01 > ./reloc -d -s dev R1_18 

    Ran
MINOS-SAM01 > ./reloc -s dev R1_18 
Declaring locations to SAM  for R1_18
STARTING Mon Oct  3 14:03:25 CDT 2005
Location with fullPath '/pnfs/minos/reco_near/R1_18/cand_data/2005-03' already exists.
... added ...

cand_data/2005-01
cand_data/2005-02
cand_data/2005-04

cnts_data/2005-01 thru 09

sntp_data/2005-01
sntp_data/2005-02
sntp_data/2005-04

snts_data/2005-01
snts_data/2005-02
snts_data/2005-04

MINOS-SAM01 > ./reloc -s prd R1_18
MINOS-SAM01 > grep added /home/sam/maint/log/reloc/R1_18.20051003.log
 OK - added /pnfs/minos/reco_far/R1_18/cand_data/2005-01
 OK - added /pnfs/minos/reco_far/R1_18/cand_data/2005-02
 OK - added /pnfs/minos/reco_far/R1_18/cand_data/2005-04
 
 OK - added /pnfs/minos/reco_far/R1_18/cnts_data/2005-01
 OK - added /pnfs/minos/reco_far/R1_18/cnts_data/2005-02
 OK - added /pnfs/minos/reco_far/R1_18/cnts_data/2005-03
 OK - added /pnfs/minos/reco_far/R1_18/cnts_data/2005-04
 OK - added /pnfs/minos/reco_far/R1_18/cnts_data/2005-05
 OK - added /pnfs/minos/reco_far/R1_18/cnts_data/2005-06
 OK - added /pnfs/minos/reco_far/R1_18/cnts_data/2005-07
 OK - added /pnfs/minos/reco_far/R1_18/cnts_data/2005-08
 
 OK - added /pnfs/minos/reco_far/R1_18/sntp_data/2005-01 
 OK - added /pnfs/minos/reco_far/R1_18/sntp_data/2005-02
 OK - added /pnfs/minos/reco_far/R1_18/sntp_data/2005-04

 OK - added /pnfs/minos/reco_far/R1_18/snts_data/2005-01
 OK - added /pnfs/minos/reco_far/R1_18/snts_data/2005-02
 OK - added /pnfs/minos/reco_far/R1_18/snts_data/2005-04

Now, all of the initial failure make sense.

snts/2004-04 and 05 were missing initially,
and I have added the rest of the missing ones with reloc.

##########
# addloc #
##########

Need to add the missing locations

./saddreco far R1_18 2004-04 addloc >> ../log/saddreco/addloc_far.log  2>&1
./saddreco far R1_18 2004-05 addloc >> ../log/saddreco/addloc_far.log  2>&1

./saddreco far R1_18 2005-01 addloc >> ../log/saddreco/addloc_far.log  2>&1
./saddreco far R1_18 2005-02 addloc >> ../log/saddreco/addloc_far.log  2>&1
./saddreco far R1_18 2005-03 addloc >> ../log/saddreco/addloc_far.log  2>&1
./saddreco far R1_18 2005-04 addloc >> ../log/saddreco/addloc_far.log  2>&1
./saddreco far R1_18 2005-05 addloc >> ../log/saddreco/addloc_far.log  2>&1
./saddreco far R1_18 2005-06 addloc >> ../log/saddreco/addloc_far.log  2>&1
./saddreco far R1_18 2005-07 addloc >> ../log/saddreco/addloc_far.log  2>&1
./saddreco far R1_18 2005-08 addloc >> ../log/saddreco/addloc_far.log  2>&1

############
# reco far #
############

setup sam -q prd

RELEASE=R1_18
DET=far

for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/cand_data ; ls -d 20??-??)` ; do
printf "\n/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done  >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1 &


=============================================================================

 2005 10 01

############
# far locs #
############

setup sam -q dev
setup sqm -q prd

REL=R1_18
DET=far
TIES='.bcnd_data .bntp_data .bnts_data cand_data cbdl_data cnts_data sntp_data snts_data'
MONS='2005-09 2005-10 2005-11 2005-12'

for MON in ${MONS} ; do
for TIE in ${TIES} ; do
printf "/pnfs/minos/reco_${DET}/${REL}/${TIE}/${MON}\n"
samadmin add pnfs tape location --fullPath=/pnfs/minos/reco_${DET}/${REL}/${TIE}/${MON}
done
done

Picked up just the 2005-09 nonblind locations

Repeated for 

TIES='cand_data cbdl_data cnts_data sntp_data snts_data'
MONS='2004-01 2004-02 2004-03 2004-04 2004-05 2004-06 2004-07 2004-08 2004-09 2004-10 2004-11 2004-12'
MONS='2003-01 2003-02 2003-03 2003-04 2003-05 2003-06 2003-07 2003-08 2003-09 2003-10 2003-11 2003-12'

############
# reco far #
############

This will fail again in 2005-07 due to duplicate file, but OK, close enough.

( cd ../log/saddreco ;
  mv declare_far_R1_18.log declare_far_R1_18.log1 )


setup sam -q prd

RELEASE=R1_18
DET=far

for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/cand_data ; ls -d 20??-??)` ; do
printf "\n/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done  >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1 &

Need to addloc for 2004-04/05, and repeat these,
locations were lacking

=============================================================================

 2005 09 30
 
grep -v verified ../log/saddreco/verify_${DET}_${RELEASE}.log | less 

STARTED   Thu Sep 29 13:42:36 2005
...
FINISHED  Fri Sep 30 02:18:41 2005

grep verified ../log/saddreco/verify_${DET}_${RELEASE}.log | wc -l
  82603

Factor 
        56  7 months 8 streams ( 3 blind, cbdl )
           11 months 

# VERIFY NEAR #  

Factor  21  7 months 3 streams

DET=near
RELEASE=R1_18

for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/cand_data ; ls -d 20??-??)` ; do
printf "\n/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} verify
done  >> ../log/saddreco/verify_${DET}_${RELEASE}.log 2>&1 &

tail -f ../log/saddreco/verify_${DET}_${RELEASE}.log | grep -v verified

Proceeding at about 20 minutes per month

grep verified ../log/saddreco/verify_${DET}_${RELEASE}.log | wc -l
25620

# DECLARE NEAR #

setup sam -q prd
DET=near
RELEASE=R1_18

for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/cand_data ; ls -d 20??-??)` ; do
printf "/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done  >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1 &

# RECO DATASETS #

Per HOWTO.datasets

DET=near
TYPS=${TYPSNEAR}

VED=R1_18
VER=`echo ${VED} | tr "[:upper:]" "[:lower:]" | tr '_' '.'`
VE=`echo ${VER} | tr '.' '_'`

for TID in `(cd /pnfs/minos/reco_${DET}/${VED} ; ls -d *_data)` ; do
TIE=`echo ${TID} | cut -f 1 -d '_'`

for TYP in ${TYPS} ; do
TY=`echo ${TYP} | tr -d ';' | tr -d '-'`

for STR in alldata calibration cosmic spill ; do

DNAME="zeval-${DET}-${TIE}-${TY}-${STR}-${VE}"

NFIL=`sam translate constraints \
--dimensions="\
    DATA_TIER     ${TIE}-${DET} \
and RUN_TYPE      ${TYP} \
and PHYSICAL_DATASTREAM_NAME ${STR} \
and VERSION                  ${VER} \
" \
 | grep 'File Count:' | cut -f 2 -d : | tr -d ' '`
if [ -n "${NFIL}" ]
then
    printf "\n${DNAME} ${NFIL}\n"
    sam create definition --group=minos --definitionName="${DNAME}" \
--dimensions="\
    DATA_TIER     ${TIE}-${DET} \
and RUN_TYPE      ${TYP} \
and PHYSICAL_DATASTREAM_NAME ${STR} \
and VERSION                  ${VER} \
"
    sam describe definition  --definitionName="${DNAME}"
else
#    printf " OK - nothing in ${DNAME}\n"
    printf "."
fi
done ; done ; done

######################
# SAM ON MINOS-SAM01 #
######################

MINOS-SAM01 > ups declare -c sam v7_3_4
Removing current link in /home/sam/products/upsdb/sam/Symlinks for sam v7_3_1.
Creating current link in /home/sam/products/upsdb/sam/Symlinks for sam v7_3_4.

######################
# tape locations     #
######################

Oops, somehow skipped 2005-04 in declaring R1_18 near locations
Will also need the blind and cbnt locations recently discoverd in far

setup sam -q dev
setup sam -q prd

REL=R1_18
DET=near
MON=2005-04
TIES='cand_data sntp_data snts_data'

for TIE in ${TIES} ; do samadmin add pnfs tape location --fullPath=/pnfs/minos/reco_${DET}/${REL}/${TIE}/${MON}; done

Need to also add tiers for all far months,
    cbdl
    .bcnd
    .bntp
    .bnts

REL=R1_18
DET=far
MONS=`(cd /pnfs/minos/reco_${DET}/${REL}/cand_data ; ls -d 20??-??)`
MONS='2004-04 2004-05 2004-06 2004-07 2004-08 2004-09 2004-10 2004-11 2004-12 2005-01 2005-02 2005-03 2005-04 2005-05 2005-06 2005-07 2005-08 2005-09 2005-10 2005-11 2005-12'
TIES='cbdl_data .bcnd_data .bntp_data .bnts_data'

{ setup sam -q dev ; setup sam -q prd }

for TIE in ${TIES} ; do
for MON in ${MONS} ; do
#echo \
samadmin add pnfs tape location --fullPath=/pnfs/minos/reco_${DET}/${REL}/${TIE}/${MON}
done ; done

We also missed 2005-09 !

REL=R1_18
DET=near
MON=2005-09
TIES='cand_data sntp_data snts_data'

for TIE in ${TIES} ; do samadmin add pnfs tape location --fullPath=/pnfs/minos/reco_${DET}/${REL}/${TIE}/${MON}; done

##########
# addloc #
##########

To pick up again with 2005-04 and 2005-09,
need to add the missing locations :

./saddreco near R1_18 2005-04 addloc >> ../log/saddreco/addloc_near.log  2>&1 &
./saddreco near R1_18 2005-09 addloc >> ../log/saddreco/addloc_near.log  2>&1 &

( took 5 minutes each )

#############
# reco near #
#############

for MONTH in 2005-04 2005-09 ; do
printf "/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done  >> ../log/saddreco/declare_${DET}_${RELEASE}.log 2>&1 &


=============================================================================

 2005 09 29

##########
# DCACHE #
##########

Downtime started at 07:00
Downtime over    at 08:30 according to my monitoring
Downtime ofer    at 09:38 per stk-users email

08:30 - crontab -r   for PNFS/Dcache outage
11:44 - crontab crontab.dat

############
# saddreco #
############

saddreco -> saddreco.0830
MINOS06 > ln -sf saddreco.0913 saddreco

Test a few files from fardet, one month

DET=far 
RELEASE=R1_18
MONTH=2004-04
./saddreco ${DET} ${RELEASE} ${MONTH} verify 3


setup sam -q dev
setup sam -q prd

samadmin add datatier --name=bcnd-far  --description="Unblind candidates - far"
samadmin add datatier --name=bntp-far  --description="Unblind ntuples    - far"
samadmin add datatier --name=bnts-far  --description="Unblind sntuples   - far"

OK, running validation :

DET=far
RELEASE=R1_18  # name of directory
for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/cand_data ; ls -d 20??-??)` ; do
printf "/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} verify
done  >> ../log/saddreco/verify_${DET}_${RELEASE}.log 2>&1 &


=============================================================================

 2005 09 28
 
##########
# DCACHE #
##########

Last error was at 20:16 last night.
Per Robert Kennedy, 

Looking much better now. I think there were two problems pushing things 
over the edge: clean-up of minos's pnfs database while minos was using 
it heavily (slowing both processes down) and miniboone using their 100k 
entry directories. Both interferences have stopped, and no more FTP errors.


=============================================================================

 2005 09 27

##########
# DCACHE #
##########

System is almost comatose. Sent note to minos_software_discussion

Discovered nlist command ( curl ftp://...   -l )
for fast simple file name list.


=============================================================================

 2005 09 26

##########
# DCACHE #
##########

DCache/PNFS slowness continues, large failure rates.

ls mapper ( 1 file ) takes about 10 seconds
ls asousa (10 files) takes about 60 seconds


Note, as of 18:40, no ftp errors since about 16:00

ls is much faster
   1 mapper               1 sec
  10 asousa               3 sec
 100 fardet_data/2001-12 45 sec

Should check this regularly, with something like

    time curl ftp://fndca1.fnal.gov:24126/asou --user mindata:numi96 -l
    real    0m0.591s

Started up scripts/ftp_log &

Logs to  dh/ftplog/NOW.txt

Takes about 5 seconds without curl '-l' qualifier, .1 seconds with it.


############
# saddreco #
############

    LOCATIONS 

did the following today :

setup sam -q prd
setup sam -q dev


DET=near 
TIES='cand_data sntp_data snts_data'
MONS='2005-10 2005-11 2005-12'


DET=far 
TIES='cand_data ccnd_data cntp_data cnts_data sntp_data snts_data'
MONS='2005-10 2005-11 2005-12'


for TIE in ${TIES} ; do
for MON in ${MONS} ; do
samadmin add pnfs tape location --fullPath=/pnfs/minos/reco_${DET}/${REL}/${TIE}/${MON}
done ; done


VERIFY  per  HOWTO.saddreco

DET=far
RELEASE=R1_18  # name of directory
for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/cand_data ; ls -d 20??-??)` ; do
printf "/pnfs/minos/reco_${DET}/${RELEASE}/cand_data/${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} verify
done  >> ../log/saddreco/verify_${DET}_${RELEASE}.log 2>&1 &

Checked far 2004-04 , need datatier cbdl, and the blind ones

setup sam -q dev
setup sam -q prd

samadmin add datatier --name=cbdl-far  --description="Cambridge bdlite ntuples - far"
samadmin add datatier --name=cbdl-near --description="Cambridge bdlite ntuples - near"


=============================================================================
 2005 09 23


DCache/PNFS slowness continues, large failure rates.
ftp ls times remain slow, 20 to 30 seconds for ls asousa
   as much as 80 seconds earlier today round noon.
   
=============================================================================
 2005 09 20

Updated SAM per CDF production deployment last Thursday,
after Minos local tests last week

    MINOS06 > ups list -K+ sam
    "sam" "v7_3_2" "Linux+2" "" "current" 
    MINOS06 > ups declare -c sam v7_3_4
    
Timing ftp and PNFS ls of various numbers of files in fardet-data

ftp fndca1.fnal.gov 24126
mindata
......

    Not sure if this is valid, seeing lots of PNFS timeouts in Dcache FTP history
    ftp ls times for a few top level directories,
        DIR    FILES TIME
        mapper     0    1
        hpss       3    1, 4
        unel       7    1, 4, 7
        asousa    10    8, 10 , 25
        
DIR     FILES    FTP    PNFS (sec)

mapper      
2001-12   107     75       2
2004-12   968          20

############
# predator #
############

Since Monday, 19 Sep2005 around 13:30,
raw data logging to tape has been falling behind,
according to the predator logs.

DCache shows an enormous file writing backlog,
and many PNFS timeouts.


=============================================================================
 2005 09 13

############
# saddreco #
############

Need to declare R1.18 production

Add tape locations
verify
declare

    LOCATIONS

setup sam -q prd
setup sam -q dev

samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.18

    minos06 :

REL=R1_18

for DET in near far ; do
TIES=`(cd /pnfs/minos/reco_${DET}/${REL} ; ls -d *data )`
MONS=`(cd /pnfs/minos/reco_${DET}/${REL}/cand_data ; ls -d 20??-??)`
done

    minos-sam01 :

DET=far 
TIES='cand_data ccnd_data cntp_data cnts_data sntp_data snts_data'
MONS='2005-03 2005-05 2005-06 2005-07 2005-08'

DET=near 
TIES='cand_data sntp_data snts_data'
MONS='2005-03 2005-05 2005-06 2005-07 2005-08'

for TIE in ${TIES} ; do
for MON in ${MONS} ; do
samadmin add pnfs tape location --fullPath=/pnfs/minos/reco_${DET}/${REL}/${TIE}/${MON}
done ; done


setup sam -q prd
REL=R1_18

for DET in near far ; do
MONS=`(cd /pnfs/minos/reco_${DET}/${REL}/cand_data ; ls -d 20??-??)`
for MON in ${MONS}  ; do
echo ./saddreco ${DET} ${REL} ${MON} verify # >> ../log/saddreco/verify_${DET}.log  2>&1
done done

That's good for a preview, 

    Needed to review and clean up handling of
        Rx_xx format of release within the saddreco script.
        Pass argument in as R1_18, set release as lower and tr _ .
    Need to handle blind directories .b*
    Needed to define application r1.18 ?


=============================================================================
 2005 09 02

Created VACATION summary file, linked to dh web area
   Surprisingly, this link served to the web !

=============================================================================
 2005 09 01

############
# filefrag #
############

Found filefrag utility in e2fsprogs 1.38 ( and 1.35 from SLF 4 )

############
# .k5login #
############

Added rhatcher, buckley to my .k5login, for vacation access to minos06.

=============================================================================
 2005 08 31

############
# saddreco #
############

grep -v verified log/saddreco/verify_far.log  | less
grep -v verified log/saddreco/verify_near.log | less

./saddreco  far R1.16 2005-07 declare >> ../log/saddreco/declare_far.log  2>&1 &
./saddreco near R1.16 2005-07 declare >> ../log/saddreco/declare_near.log 2>&1 &

Now go on to August

./saddreco  far R1.16 2005-08 verify >> ../log/saddreco/verify_far.log 2>&1 &
./saddreco near R1.16 2005-08 verify >> ../log/saddreco/verify_near.log 2>&1 &

Looks OK, declare 05 08

Oops, first attempt failed, tape locations missing !
Strange, why did verify not pick that up ?
Added them in dev and int

MINOS-SAM01 > DET=far 
MINOS-SAM01 > REL=R1.16
MINOS-SAM01 > TIES='cand_data  ccnd_data  cntp_data  cnts_data  sntp_data  snts_data'
MINOS-SAM01 > for TIE in ${TIES} ; do samadmin add pnfs tape location --fullPath=/pnfs/minos/reco_${DET}/${REL}/${TIE}/2005-08 ; done

MINOS-SAM01 > DET=near
MINOS-SAM01 > TIES='cand_data sntp_data snts_data'
MINOS-SAM01 > for TIE in ${TIES} ; do samadmin add pnfs tape location --fullPath=/pnfs/minos/reco_${DET}/${REL}/${TIE}/2005-08 ; done


./saddreco  far R1.16 2005-08 declare >> ../log/saddreco/declare_far.log  2>&1 &

 OOPS, need location for  F00032489_0003.all.snts.R1.16.root
This file looks OK to me, doing
    MINOS06 > cd /pnfs/minos/reco_far/R1.16/snts_data/2005-08/
MINOS06 > cat '.(use)(4)(F00032489_0003.all.snts.R1.16.root)'
VO8322
0000_000000000_0001354
...
Trying another pass

./saddreco  far R1.16 2005-08 declare >> ../log/saddreco/declare_far.log  2>&1 &
Still failed, the same.
Oh, this message means the file is declared, but needs a location.
This was fallout from the lack of the tape locations on the fist pass
So add the location :

./saddreco  far R1.16 2005-08 addloc >> ../log/saddreco/addloc_far.log  2>&1 &

./saddreco near R1.16 2005-08 declare >> ../log/saddreco/declare_near.log 2>&1 &


=============================================================================
 2005 08 30

ln -sf saddreco.0830 saddreco  # was 0721
    cleaned up sam qualifier parsing, which had broken recently

./saddreco  far R1.16 2005-07 verify >> ../log/saddreco/verify_far.log 2>&1 &
./saddreco near R1.16 2005-07 verify >> ../log/saddreco/verify_near.log 2>&1 &

grep -v verified log/saddreco/verify_near.log | less


=============================================================================
 2005 08 29

#########
# mysql #
#########

Ran more I/O tests, see 35 to 50 MB/sec local read,
but 
 for PULSERDRIFT.MYI 

=============================================================================
 2005 08 26

SAM logservers did not start using default sam_bootstrap v6_1_2

/home/sam/products/sam_bootstrap/v6_1_2/NULL/bin/samcmd: line 46: exec: SamLogServer: not found


ups tailor sam_config -q ns_dev/int/prd and log_dev/int/prd
to add  SAM_COMPILER_QUALIFIER=GCC-2.95.2

restarted, OK

####################
# DCACHE - MINOS?? #
####################

dcap times out since yesterday's IP changes.

Here's a set of files for testing the three pools 6a-1, 7a-1, 8a-1

/pnfs/minos/near_dcs_data/2005-08

MINOS06 > for FIL in `ls` ; do printf "${FIL}" ; grep 'r-' ".(use)(2)(${FIL})" ; done
N050731_212517.mdcs.rootr-stkendca7a-1
N050801_092542.mdcs.rootr-stkendca6a-1
N050801_212612.mdcs.rootr-stkendca8a-1
...

For testing,
D1=24125
D2=24136

F6=N050801_092542.mdcs.root
F7=N050731_212517.mdcs.root
F8=N050801_212612.mdcs.root

run tests like 

DFILE=dcap://fndca1.fnal.gov:${D1}/pnfs/fnal.gov/usr/minos/near_dcs_data/2005-08/${F6}
time dccp ${DFILE} TEST.dat

#######
# SAM #
#######

To eliminate undesired product couplings, 
I have HACKED the sam v7_3_2.table file to comment out

    #setupOptional('-j samgrid_batch_adapter')
    #setupOptional('-j fileinfo')
    #setupOptional('-j encp')
    #setupOptional('gnuplot')
    #setupOptional('ximagetools')


=============================================================================
 2005 08 25
 
On minos-sam01, 07:55
    ups stop sam_config


Startup plan :

    MS1> ups start sam_config
    M06- test with stp
    M06> crontab crontab.dat

14:28 - all systems but minos01 ( minos-cvs ) are up

Delays due to need to flush cache in 131.225.193.200 gateway
minos-mysql1 - needed to add new mysql client config file for new node name
    cp /local/ups/db/mysql/config/minos13.fnal.gov. /local/ups/db/mysql/config/minos-mysql1.fnal.gov.

minos-01 did not boot cleanly, investigating...
    does about 20 minutes of disk activity, then stops,
    after the message regarding purge of memory

SAM - had to manually restart nameservers with
    ups update sam_bootstrap v4_4_1
station works OK now, did

    ./stp minos st-onesmall

=============================================================================
 2005 08 24

For minos* maintenance tomorrow,

MINOS06 > echo "crontab -r" | at 06:00 Aug 25

Will need to restart topdb_log

# ${HOME}/minos/oracle/topdb_log minosprd &
# ${HOME}/minos/oracle/topdb_log minosdev &

=============================================================================
 2005 08 22
 
##############
# sam_config #
##############

Hacked /afs/fnal.gov/files/code/e875/general/ups/db/sam_config
     to remove   SetupRequired(python v2_1)
 
     Looks OK now, helps out minos dev build.
     
################
# app families #
################

MINOS-SAM01 > rlwrap sqlplus samdbs/password@minosdev # and minosprd 

SELECT * FROM application_families WHERE appl_name = 'sam_test_project_test' and family = 'test' ;
UPDATE application_families set appl_name = 'sam_test_project' WHERE appl_name = 'sam_test_project_test' and family = 'test' ;

=============================================================================
 2005 08 18
 
Working on mysql backkup, see LOG.mysql, PLAN.mysql

=============================================================================
 2005 08 17

Upgraded dbservers to

sam_db_srv                     v7_3_0
sam_idl_pylib                  v7_3_0
sam_common_pylib               v7_3_1
sam_dimension_server_prototype v7_3_0

    was
MINOS06 > sam get dbserver connection count
BAD_OPERATION; Minor: BAD_OPERATION_UnRecognisedOperationName, COMPLETED_NO.

    now
MINOS06 > sam get dbserver connection count
("ServantFactoryImpl instance has no attribute 'getServantCount'",)

Then to sam_db_srv v7_3_1 - fixes bad print statement filling trace file
11:38

##################
# project timing #
##################

stp minos prd zeval-near-raw-checkouttm

   1  2.51user 1.07system  0:06.08elapsed 58%CPU (0avgtext+0avgdata 0maxresident)k
  10  2.57user 1.13system  0:16.07elapsed 23%CPU (0avgtext+0avgdata 0maxresident)k
 100  4.48user 1.08system  2:15.84elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
1000 20.29user 1.27system 19:02.77elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k

Subtracting the base time of 6 sec, times are
 Files Time  sec/file
  10    10    1
 100   129    1.3
1000  1136    1.1


=============================================================================
 2005 08 16

Project startup delays, try a 7773 file dataset
   3.8 fails 0/10
   3.7 fails 2/5
   3.6 fails 1/5
   3.5 fails 4/5.
   2.5 fails

Files/second  and Files**2/second are
  5118/(2.3-.6)  = 3000  15408190.588235 ( 3900**2 )
  7773/(3.8-.6)  = 2400  18881102.8125   ( 4300**2 )
 10670/(7.2-.6)  = 1600  17249833.333333 ( 4200**2 )
 14645/(9.1-.6)  = 1700  25232473.529412 ( 5000**2 )

A linear fit is off by a factor of 2,
but a simple quadratic is within 20% for our datasets.

Testing various sized datasets:

 1059 zeval-near-raw-checkouttm
 5118 zeval-far-cand-physics-alldata-r1_14
 7773 zeval-far-raw-physics
10670 zeval-far-raw-vapedestal
14645 zeval-far-raw-normaldata

    By the way, dbserver memory usage at this point is 333 MBytes
c                               
MINOS06 > /usr/bin/time stp minos prd zeval-far-raw-physics 2
retry in 7.98 seconds
2.63user 1.42system 0:41.00elapsed 9%CPU (0avgtext+0avgdata 0maxresident)k

retry in 7.87 seconds
2.61user 1.29system 0:24.00elapsed 16%CPU (0avgtext+0avgdata 0maxresident)k

zeval-far-raw-vapedestal 10670 files
  6.0 f 3/5 , 1/5 on 17 Aug
  6.2 f 5/5
  6.4 f 4/5 , 0/5 on 17 Aug
  6.5 f 3/5
  6.6 f 0/5, 5/5
  7.0 f 1/10
  7.2 f 1/10 failed for net of nearly 1 minute, 4 retries

zeval-far-raw-normaldata 14645 
  14645**2 / 18881102 = 11.36

  8.1 f 5/5
  8.6 f 1/5
  9.1 f 0/5
 10.1 f 0/5
 12.1 f 0/5

zeval-near-raw-checkouttm 1059

   0.6 f 5/5
   0.7 f 1/5
   0.8 f 0/5 , 0/5
   1.0 f 0/5
   2.1 f 0/5
   2.6 f 0/5

=============================================================================
 2005 08 15
 
New version supporting appFamily in sam.establishConsumer()

MINOS06 > upd install -j  sam v7_3_1
MINOS06 > ups declare -c  sam v7_3_1 # was v7_2_6 )

Had to increase project/consumer delay from 0.5 to 0.6 seconds in STP

    Overall speed seems unchanged
 
MINOS06 > /usr/bin/time ./stp minos prd zeval-far-snts-physicsm-alldata-r1_16
...
4.23user 1.29system 1:51.58elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1254major+7280minor)pagefaults 0swaps

Added CUT argument to STP, tested like
    ./stp minos prd zeval-far-snts-physicsm-alldata-r1_16 7
and with a 5000 files dataset,
    /usr/bin/time stp minos prd zeval-far-cand-physics-alldata-r1_14 10

retry in 5.07 seconds
retry in 8.62 seconds
2.88user 1.18system 0:42.80elapsed 9%CPU (0avgtext+0avgdata 0maxresident)k

retry in 5.07 seconds
2.85user 1.18system 0:29.32elapsed 13%CPU (0avgtext+0avgdata 0maxresident)k

retry in 4.81 seconds
2.75user 1.25system 0:23.99elapsed 16%CPU (0avgtext+0avgdata 0maxresident)k

Need 2.3 second delay, start to retry at 2.2
Need .6 second delay with 1 file dataset.
Need to map out this dependency, to set delays automatically.

=============================================================================
 2005 08 12

stopped all stray test sam projects on minos station

Testing 'stp' ( sam_test_project in python )

MINOS06 > /usr/bin/time ./stp minos prd zeval-far-snts-physicsm-alldata-r1_16
... 94 files ...
4.08user 2.05system 1:52.46elapsed 5%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1239major+6975minor)pagefaults 0swaps

Same on minos07, and same when on 06 and 07 simultaneously.
Same on minos06/07/09
Same on minos06/07/08/09

minos-sam01 CPU load is 10%, 20% ,30%, 40% respectively
   DBListener.py 

Ran 8 on these 4 systems, see about 100% cpu ( 300% idle)
  about 40% user and 60% system
Elapsed time around 3 minutes

Ran 6 on minos08 alone,
   CPU about 100% 
   Elapsed client 2'15" (135")

Ran 5 on minos08 alone
   CPU about 55%, about half each user/system
   Elapsed client 1:56 ( normal )
   
There are dangline servants from
    setfileConsumptionStatus
ran 4 projects, got 2 persistent servants )
    
Ran 4 sequential projects on minos06 with
    ./stp minos prd zeval-far-snts-physicstm-alldata-r1_14
and picked up 4 persistent servants.

QUESTIONS 

    how to start project and wait for it to really start
    how to avoid persistent servant from setFile ConsumptionStatus
    why the 1 second per file rate
        not client CPU
        not dbserver CPU
        not Oracle CPU or network ( peak 80KB/sec  during these tests )
    why the 5 file per second global limit in DBListener ?

        
=============================================================================
 2005 08 11

We were getting hourly email regarding the dev logger

    Date: Thu, 11 Aug 2005 09:00:02 -0500
    From: sam users <sam@minos-sam01.fnal.gov>
    To: kreymer@fnal.gov
    Subject: log_dev logger died at minos-sam01: status 127

    New mail will not be send if the server crashes again within the same hour.
    /home/sam/products/sam_bootstrap/v6_1_2/NULL/bin/samcmd: line 46: exec: SamLogServer: not found
    Killed process: 4948

The trace file contained

/home/sam/products/sam_bootstrap/v6_1_2/NULL/bin/samcmd: line 46: exec: SamLogServer: not found
Killed process: 13139

Restarted development logger on minos-sam01 using
    ups update sam_bootstrap v4_4_1

URGH - found sam_config mangled due to simple installation of v3_1_38.
Had to correct the config.dir and config.env symlinks under upsdb/sam_config.

Corrected this, and retried starting the dev log server, still no go.


=============================================================================
 2005 08 10

MINOS-SAM01 > rlwrap sqlplus samdbs/password@minosdev

SELECT * FROM application_families WHERE appl_name = 'sam_test_project' and family = 'test' ;
UPDATE application_families set appl_name = 'sam_test_project_test' WHERE appl_name = 'sam_test_project' and family = 'test' ;

SELECT * FROM application_families WHERE appl_name = 'loon' and version = 'dev'  ;
UPDATE application_families set appl_name = 'loonanal' WHERE appl_name = 'loon' and version = 'dev' and family = 'analysis' ;


=============================================================================
 2005 08 09

############
# BEAM/DCS #
############

Checked predatord output

B050801_160004.mbeam.root was OK
B050801_224418.mbeam.root failed in SAM

New version of root v04-04-02

MINOS-SAM01 > samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v04-04-02

08:25 - crontab crontab.dat
    forgot to do this yesterday around 17:30 after PnfsManager restart

############
# predator #
############

Merged beam/dcs actions into standard predator.0809,
selected during the 05:xx iteration.

Aug  8 16:59 predator -> predator.0808
ln -sf predator.0809 predator
Aug  9 08:46 predator -> predator.0809

#######
# sam #
#######

v7_2_6 installed, made current on AFS

MINOS06 > /usr/bin/time ./sam_test_project minos prd zeval-far-snts-physicstm-alldata-r1_14
... 6 files ...
21.76user 11.97system 0:35.36elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (33946major+143351minor)pagefaults 0swaps


=============================================================================
 2005 08 08

############
# datasets #
############

Tested a fairly small dataset with sam_test_project

MINOS06 > time ./sam_test_project minos prd zeval-far-snts-physicstm-alldata-r1_14
... 6 files ...
real    0m39.405s
user    0m18.710s
sys     0m18.940s

u+s = 37.65
So we seem to be highly CPU limited on the client

MINOS06 > /usr/bin/time ./sam_test_project minos prd zeval-far-snts-physicstm-alldata-r1_14
... 6 files ...
18.95user 18.91system 0:39.37elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (32706major+137226minor)pagefaults 0swaps

MINOS06 > /usr/bin/time ./sam_test_project minos prd zeval-far-snts-physicsm-alldata-r1_16 
... 94 files ...
185.78user 201.83system 6:54.42elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (272981major+1349683minor)pagefaults 0swaps

MINOS06 > sam get dbserver connection info
Connection:  sam@minos-sam01.fnal.gov:Station Server(4832)
        Servant creation time:  08-Aug-2005 14:23:56 (UTC)
        Last method invoked:    endProject (08-Aug-2005 14:43:38 (UTC))
        Last method completed in 0.0406441688538 seconds
        Servant status message: method call complete, now marshalling return arguments
Connection:  sam@minos-sam01.fnal.gov:Station Server(4833)
        Servant creation time:  08-Aug-2005 14:23:56 (UTC)
        Last method invoked:    getPersonInfo (08-Aug-2005 14:36:47 (UTC))
        Last method completed in 0.00861883163452 seconds
        Servant status message: Marshalling complete in less than 1 second <type SamStruct.PersonInfo.PersonInfo> 
Connection:  sam@minos-sam01.fnal.gov:Station Server(4834)
        Servant creation time:  08-Aug-2005 14:23:56 (UTC)
        Last method invoked:    setFileConsumptionStatus (08-Aug-2005 14:24:49 (UTC))
        Last method completed in 0.0585429668427 seconds
        Servant status message: method call complete, now marshalling return arguments
Connection:  sam@minos-sam01.fnal.gov:Station Server(4837)
        Servant creation time:  08-Aug-2005 14:24:19 (UTC)
        Last method invoked:    getPhysicalFileMetadata (08-Aug-2005 14:36:50 (UTC))
        Last method completed in 1.88471293449 seconds
        Servant status message: Marshalling complete in less than 1 second <type SAM.PhysicalFileMetadataListStruct> 
Connection:  sam@minos-sam01.fnal.gov:Station Server(4838)
        Servant creation time:  08-Aug-2005 14:24:26 (UTC)
        Last method invoked:    projectReleasedFile (08-Aug-2005 14:43:32 (UTC))
        Last method completed in 0.0693860054016 seconds
        Servant status message: method call complete, now marshalling return arguments
Connection:  sam@minos-sam01.fnal.gov:Station Server(4840)
        Servant creation time:  08-Aug-2005 14:25:11 (UTC)
        Last method invoked:    setFileConsumptionStatus (08-Aug-2005 14:25:49 (UTC))
        Last method completed in 0.0509819984436 seconds
        Servant status message: method call complete, now marshalling return arguments
Connection:  sam@minos-sam01.fnal.gov:Station Server(4844)
        Servant creation time:  08-Aug-2005 14:27:03 (UTC)
        Last method invoked:    setFileConsumptionStatus (08-Aug-2005 14:27:33 (UTC))
        Last method completed in 0.0491569042206 seconds
        Servant status message: method call complete, now marshalling return arguments
Connection:  sam@minos-sam01.fnal.gov:Station Server(4848)
        Servant creation time:  08-Aug-2005 14:29:05 (UTC)
        Last method invoked:    setFileConsumptionStatus (08-Aug-2005 14:29:35 (UTC))
        Last method completed in 0.0508251190186 seconds
        Servant status message: method call complete, now marshalling return arguments
Connection:  sam@minos-sam01.fnal.gov:Station Server(4853)
        Servant creation time:  08-Aug-2005 14:36:47 (UTC)
        Last method invoked:    setFileConsumptionStatus (08-Aug-2005 14:43:32 (UTC))
        Last method completed in 0.0589950084686 seconds
        Servant status message: method call complete, now marshalling return arguments
Connection:  kreymer@minos06.fnal.gov:sam(4868)
        Servant creation time:  08-Aug-2005 14:43:42 (UTC)
        Last method invoked:    getDbServerConnectionInfo_v2 (08-Aug-2005 14:43:42 (UTC))
        Last method still running.
        Servant status message: active and in use
Number of connections: 10


####################
# sam_test_project #
####################

We need a Python version of sam_test_project to test server performance.

############
# predator #
############

beam/dcs being supported via new predatord ( may remerge to predator )

mkdir /local/scratch06/kreymer/log/samadd/beam_data
mkdir /local/scratch06/kreymer/log/samadd/far_dcs_data
mkdir /local/scratch06/kreymer/log/samadd/near_dcs_data


MINOS06 > ./predatord 2005-07
MINOS06 > ./predatord        

Manual tests seem to work.

Found 0 length beam_data file
   has level 2 but not level 4 information


/pnfs/minos/beam_data/2005-08/B050803_012300.mbeam.root

MINOS06 > cat '.(use)(2)(B050803_012300.mbeam.root)'
2,0,0,0.0,0.0
:l=0;
r-stkendca8a-1


############
# predator #
############

Installed version with stray DATA variables removed

May 20 15:36 predator -> predator.0520
ln -sf predator.0808 predator
Aug  8 16:59 predator -> predator.0808

=============================================================================
 2005 08 05

As documented in HOWTO.datasets,
    created in dev and prd

zeval-near-raw-checkout
zeval-near-raw-checkoutm
zeval-near-raw-checkoutt
zeval-near-raw-checkouttm
zeval-near-raw-physics
zeval-near-raw-physicsm
zeval-near-raw-physicst
zeval-near-raw-physicstm
zeval-near-raw-qiecalibrate
zeval-near-raw-qiecalibratem
zeval-near-raw-qiecalibratet
zeval-near-raw-qiecalibratetm
zeval-near-raw-qiemonitor
zeval-near-raw-qiemonitorm
zeval-near-raw-qiemonitort
zeval-near-raw-qiemonitortm
zeval-near-raw-unknown
zeval-near-raw-unknownm

zeval-far-raw-checkout
zeval-far-raw-checkoutm
zeval-far-raw-checkoutt
zeval-far-raw-checkouttm
zeval-far-raw-normaldata
zeval-far-raw-physics
zeval-far-raw-physicsm
zeval-far-raw-physicst
zeval-far-raw-physicstm
zeval-far-raw-unknown
zeval-far-raw-vacalibrate
zeval-far-raw-vacalibratem
zeval-far-raw-vacalibratet
zeval-far-raw-vacalibratetm
zeval-far-raw-vapedestal
zeval-far-raw-vapedestalm
zeval-far-raw-vapedestalt
zeval-far-raw-vapedestaltm


RECO -

near data with all tiers, types, streams and version would need 864 datasets

Modified to write only non-empty datasets

zeval-near-cand-physicsm-alldata-r1_11
zeval-near-sntp-physicsm-alldata-r1_11
zeval-near-snts-physicsm-alldata-r1_11
zeval-near-cand-physicsm-alldata-r1_12
zeval-near-sntp-physicsm-alldata-r1_12
zeval-near-cand-checkoutm-spill-r1_14
zeval-near-cand-checkoutt-spill-r1_14
zeval-near-cand-checkouttm-spill-r1_14
zeval-near-cand-physics-spill-r1_14
zeval-near-cand-physicsm-spill-r1_14
zeval-near-cand-physicst-spill-r1_14
zeval-near-cand-physicstm-spill-r1_14
zeval-near-sntp-checkoutm-spill-r1_14
zeval-near-sntp-checkoutt-spill-r1_14
zeval-near-sntp-checkouttm-spill-r1_14
zeval-near-sntp-physics-spill-r1_14
zeval-near-sntp-physicsm-spill-r1_14
zeval-near-sntp-physicst-spill-r1_14
zeval-near-sntp-physicstm-spill-r1_14
zeval-near-snts-checkoutm-spill-r1_14
zeval-near-snts-checkoutt-spill-r1_14
zeval-near-snts-checkouttm-spill-r1_14
zeval-near-snts-physics-spill-r1_14
zeval-near-snts-physicsm-spill-r1_14
zeval-near-snts-physicst-spill-r1_14
zeval-near-snts-physicstm-spill-r1_14
zeval-near-cand-physics-cosmic-r1_16
zeval-near-cand-physics-spill-r1_16
zeval-near-cand-physicsm-cosmic-r1_16
zeval-near-cand-physicsm-spill-r1_16
zeval-near-sntp-physics-cosmic-r1_16
zeval-near-sntp-physics-spill-r1_16
zeval-near-sntp-physicsm-cosmic-r1_16
zeval-near-sntp-physicsm-spill-r1_16
zeval-near-snts-physics-cosmic-r1_16
zeval-near-snts-physics-spill-r1_16
zeval-near-snts-physicsm-cosmic-r1_16
zeval-near-snts-physicsm-spill-r1_16

zeval-far-cand-normaldata-alldata-r1_7 707
zeval-far-sntp-normaldata-alldata-r1_7 707
zeval-far-snts-normaldata-alldata-r1_7 707
zeval-far-cand-normaldata-alldata-r1_11 700
zeval-far-cand-physics-alldata-r1_11 840
zeval-far-cand-physicsm-alldata-r1_11 602
zeval-far-sntp-normaldata-alldata-r1_11 700
zeval-far-sntp-physics-alldata-r1_11 840
zeval-far-sntp-physicsm-alldata-r1_11 602
zeval-far-snts-normaldata-alldata-r1_11 699
zeval-far-snts-physics-alldata-r1_11 840
zeval-far-snts-physicsm-alldata-r1_11 602
zeval-far-cand-physics-alldata-r1_12 3071
zeval-far-cand-physicsm-alldata-r1_12 1216
zeval-far-sntp-physics-alldata-r1_12 3071
zeval-far-sntp-physicsm-alldata-r1_12 1216
zeval-far-snts-physics-alldata-r1_12 3071
zeval-far-snts-physicsm-alldata-r1_12 1216
zeval-far-cand-normaldata-alldata-r1_14 4223
zeval-far-cand-physics-alldata-r1_14 5118
zeval-far-cand-physicsm-alldata-r1_14 1494
zeval-far-cand-physicst-alldata-r1_14 14
zeval-far-cand-physicstm-alldata-r1_14 6
zeval-far-ccnd-normaldata-alldata-r1_14 4223
zeval-far-ccnd-physics-alldata-r1_14 5118
zeval-far-ccnd-physicsm-alldata-r1_14 1494
zeval-far-ccnd-physicst-alldata-r1_14 14
zeval-far-ccnd-physicstm-alldata-r1_14 6
zeval-far-cntp-normaldata-alldata-r1_14 4223
zeval-far-cntp-physics-alldata-r1_14 5118
zeval-far-cntp-physicsm-alldata-r1_14 1494
zeval-far-cntp-physicst-alldata-r1_14 14
zeval-far-cntp-physicstm-alldata-r1_14 6
zeval-far-cnts-normaldata-alldata-r1_14 4223
zeval-far-cnts-physics-alldata-r1_14 5118
zeval-far-cnts-physicsm-alldata-r1_14 1494
zeval-far-cnts-physicst-alldata-r1_14 14
zeval-far-cnts-physicstm-alldata-r1_14 6
zeval-far-sntp-normaldata-alldata-r1_14 4223
zeval-far-sntp-physics-alldata-r1_14 5118
zeval-far-sntp-physicsm-alldata-r1_14 1494
zeval-far-sntp-physicst-alldata-r1_14 14
zeval-far-sntp-physicstm-alldata-r1_14 6
zeval-far-snts-normaldata-alldata-r1_14 4223
zeval-far-snts-physics-alldata-r1_14 5118
zeval-far-snts-physicsm-alldata-r1_14 1494
zeval-far-snts-physicst-alldata-r1_14 14
zeval-far-snts-physicstm-alldata-r1_14 6
zeval-far-cand-physics-alldata-r1_16 2383
zeval-far-cand-physics-spill-r1_16 2383
zeval-far-cand-physicsm-alldata-r1_16 94
zeval-far-cand-physicsm-spill-r1_16 94
zeval-far-ccnd-physics-alldata-r1_16 2383
zeval-far-ccnd-physicsm-alldata-r1_16 94
zeval-far-cntp-physics-alldata-r1_16 2383
zeval-far-cntp-physicsm-alldata-r1_16 94
zeval-far-cnts-physics-alldata-r1_16 2383
zeval-far-cnts-physicsm-alldata-r1_16 94
zeval-far-sntp-physics-alldata-r1_16 2383
zeval-far-sntp-physics-spill-r1_16 2383
zeval-far-sntp-physicsm-alldata-r1_16 94
zeval-far-sntp-physicsm-spill-r1_16 94
zeval-far-snts-physics-alldata-r1_16 2383
zeval-far-snts-physics-spill-r1_16 2383
zeval-far-snts-physicsm-alldata-r1_16 94
zeval-far-snts-physicsm-spill-r1_16 94


BEAM/DCS

Checked fresh copy of one file in

    /afs/fnal/files/home/room3/bbock/public/F050108_043922.mdcs.root


=============================================================================
 2005 08 04

HOWTO.dsnames  - discusses dataset naming options and choices
HOWTO.datasets - discusses the actual work and testing

Created, in dev and prd

zeval-beam

zeval-far-dcs
zeval-near-dcs

zeval-far-raw ( 1' 24" CPU time to do s.t.c )
File Count:         33710
Average File Size:  29.58MB
Total File Size:    973.71GB
Total Event Count:  468395640

zeval-near-raw ( 38" )
File Count:         11529
Average File Size:  39.72MB
Total File Size:    447.14GB
Total Event Count:  267649732


=============================================================================
 2005 08 03

Scanning all run types for near and far, see log/datasets/${DET}runtypes.log
Summary:

egrep 'far|Count' ../log/datasets/farruntypes.log
egrep 'far|File Count' ../log/datasets/farruntypes.log
far checkout
File Count:         30
far checkout;m
File Count:         2553
far checkout;t
File Count:         11
far checkout;tm
File Count:         40
far normal-data
File Count:         14645
far physics
File Count:         7457
far physics;m
File Count:         1537
far physics;t
File Count:         176
far physics;tm
File Count:         117
far unknown
File Count:         11
far va-calibrate
File Count:         4260
far va-calibrate;m
File Count:         9
far va-calibrate;t
File Count:         17
far va-calibrate;tm
File Count:         16
far va-pedestal
File Count:         10644
far va-pedestal;m
File Count:         22
far va-pedestal;t
File Count:         89
far va-pedestal;tm
File Count:         13

egrep 'near|File Count' ../log/datasets/nearruntypes.log
near checkout
File Count:         480
near checkout;m
File Count:         1564
near checkout;t
File Count:         79
near checkout;tm
File Count:         1059
near physics
File Count:         753
near physics;m
File Count:         5170
near physics;t
File Count:         285
near physics;tm
File Count:         896
near qie-calibrate
File Count:         115
near qie-calibrate;m
File Count:         582
near qie-calibrate;t
File Count:         1
near qie-calibrate;tm
File Count:         76
near qie-monitor
File Count:         135
near qie-monitor;m
File Count:         519
near qie-monitor;t
File Count:         16
near qie-monitor;tm
File Count:         341
near unknown
File Count:         2
near unknown;m
File Count:         2


egrep 'near|File Count' nearruntypes.log > nearruntypes.sum
egrep 'far|File Count' farruntypes.log > farruntypes.sum

    edited these to keep non-zero elements


 For FAR,

TYPS="\
checkout checkout;m checkout;t checkout;tm \
normal-data \
physics physics;m physics;t physics;tm \
unknown \
va-calibrate va-calibrate;m va-calibrate;t va-calibrate;tm \
va-pedestal va-pedestal;m va-pedestal;t va-pedestal;tm\
"

    For NEAR,
    
TYPS="\
checkout checkout;m checkout;t checkout;tm \
normal-data \
physics physics;m physics;t physics;tm \
qie-calibrate qie-calibrate;m qie-calibrate;t qie-calibrate;tm \
qie-monitor qie-monitor;m qie-monitor;t qie-monitor;tm \
unknown unknown;m\
"

Far scan ran cleanly.
Near scan hung up at

r1.16 cand-near alldata checkout 
r1.16 c("OracleMessageDB instance has no attribute 'badConn'",)
RetryHandler.translateDimensions_v2(...)> initial retriable exception UNKNOWN('Minor: UNKNOWN_PythonException, COMPLETED_MAYBE.')
RetryHandler.translateDimensions_v2(...)> will continue to retry 200 more times at 5 second intervals

Oh yeah, the Oracle database is being patched.
The job resumed cleanly after a dbserver restart.


egrep 'far|File Count' far.200508.log | \
 grep -B 1 'File Count' | \
 grep -v '\-\-'  > far.200508.sum

egrep 'near|File Count' near.200508.log | \
 grep -B 1 'File Count' | \
 grep -v '\-\-'  > near.200508.sum

far - have
    STR alldata spill 
    TYP normal-data physics physics;m physics;t physics;tm
    
near -  have
    STR alldata spill cosmic
    TYP physics physics;m physics;t physics;tm \
        checkout checkout;m checkout;t checkout;tm

    This suggests the following names

DET=far
TYPES='normal-data physics physics;m physics;t physics;tm'
for STR in alldata spill ; do
for TYP in ${TYPES} ; do
    TY=`echo ${TYP} | tr -d ';'`
    printf "zeval-${DET}-${STR}-${TY}\n"
done ; done
zeval-far-alldata-normal-data
zeval-far-alldata-physics
zeval-far-alldata-physicsm
zeval-far-alldata-physicst
zeval-far-alldata-physicstm
zeval-far-spill-normal-data
zeval-far-spill-physics
zeval-far-spill-physicsm
zeval-far-spill-physicst
zeval-far-spill-physicstm

DET=near
TYPES='physics physics;m physics;t physics;tm checkout checkout;m checkout;t checkout;tm'
for STR in alldata spill cosmic ; do
for TYP in ${TYPES} ; do
    TY=`echo ${TYP} | tr -d ';'`
    printf "zeval-${DET}-${STR}-${TY}\n"
done ; done

zeval-near-alldata-physics
zeval-near-alldata-physicsm
zeval-near-alldata-physicst
zeval-near-alldata-physicstm
zeval-near-alldata-checkout
zeval-near-alldata-checkoutm
zeval-near-alldata-checkoutt
zeval-near-alldata-checkouttm
zeval-near-spill-physics
zeval-near-spill-physicsm
zeval-near-spill-physicst
zeval-near-spill-physicstm
zeval-near-spill-checkout
zeval-near-spill-checkoutm
zeval-near-spill-checkoutt
zeval-near-spill-checkouttm
zeval-near-cosmic-physics
zeval-near-cosmic-physicsm
zeval-near-cosmic-physicst
zeval-near-cosmic-physicstm
zeval-near-cosmic-checkout
zeval-near-cosmic-checkoutm
zeval-near-cosmic-checkoutt
zeval-near-cosmic-checkouttm

=============================================================================
 2005 08 02

Downgraded dev dbs to v7_2_0 , still have dangling servants

10:00 with Lauri,
Upgraded again to v7_2_1
    including db_server_base v3_3_14 ( was using v3_3_12 )

Hmmm, development station is restarting a few time, not sure why.

Dangling servants from sam_test_project, but limited, tolerable ?
   About 6 the first run, then 1 per additional run.


############
# datasets #
############

HOWTO.datasets - stuff for planning datasets


=============================================================================
 2005 08 01

#########
# tiers #
#########

On minos-sam01

MINOS-SAM01 > upd install -j oracle_client v10_1_0_3_0

MINOS-SAM01 > rlwrap sqlplus samdbs/password@minosdev

Used the database browser
    http://dbb.fnal.gov:8520/minos/databases
    General, Minos Production
        GEN: Select A Table
            Submit request with no criteria
                DATA_TIERS ( synonym for SAMDEV.DATA_TIERS )
                    email to kreymer@fnal.gov

SELECT *         FROM  data_tiers ;
SELECT data_tier FROM  data_tiers ;

43 rows, in both dev and prd

SELECT data_tier FROM  data_tiers WHERE  data_tier = 'candidate-far' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'ntuple-far'    ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'sntuple-far' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'sntuple' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'ntuple-near' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'sntuple-near' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'ccand-far' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'ccand-near' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'cntuple-far' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'cntuple-near' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'cntups-far' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'cntups-near' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'ntups-far' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'ntups-near' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'stntuple-far' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'stntuple-near' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'stntups-far' ;
SELECT data_tier FROM  data_tiers WHERE  data_tier = 'stntups-near' ;

DELETE           FROM  data_tiers WHERE  data_tier = 'candidate-far' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'ntuple-far'    ;
DELETE           FROM  data_tiers WHERE  data_tier = 'sntuple-far' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'sntuple' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'ntuple-near' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'sntuple-near' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'ccand-far' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'ccand-near' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'cntuple-far' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'cntuple-near' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'cntups-far' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'cntups-near' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'ntups-far' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'ntups-near' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'stntuple-far' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'stntuple-near' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'stntups-far' ;
DELETE           FROM  data_tiers WHERE  data_tier = 'stntups-near' ;

This leaves 26 rows

commit ;

Note that stntuple could not be deleted

DELETE           FROM  data_tiers WHERE  data_tier = 'sntuple'
*
ERROR at line 1:
ORA-02292: integrity constraint (SAMDEV.FI_DT_FK) violated - child record found

Indeed

MINOS06 > sam translate constraints --dim="data_tier sntuple"
Files:
   F00019841_0000.snts.R1.0.0a.root
   F00019844_0000.snts.R1.0.0a.root
   F00019844_0001.snts.R1.0.0a.root
   F00019844_0002.snts.R1.0.0a.root
   F00019853_0000.snts.R1.0.0a.root
   F00019881_0000.snts.R1.0.0a.root
   F00019884_0000.snts.R1.0.0a.root
   F00019888_0000.snts.R1.0.0a.root
   F00019888_0001.snts.R1.0.0a.root
   F00019888_0002.snts.R1.0.0a.root

File Count:         10
Average File Size:  2.21MB
Total File Size:    22.15MB
Total Event Count:  818220

Probably typo errors in development testing.
There are none such in production.

     PRODUCTION 

Did the same sqlplus removal of data tiers

This leaves 25 data tiers in production


################
# fncda fndca1 #
################

Updated records in data_disks table to move from fndca to fndca1.

Restarted minos and minos-test-dcache stations to make this effective.
See minos-sam01 LOG file for details

##############
# sam_db_srv #
##############

Upgrade dbservers to sam dbserver v7_2_0 announced July 7
Actually, went ahead to v7_2_1
        PRODUCT                        v7_2_0 v7_2_1
        sam_idl_pylib                  v7_2_0 =
        sam_common_pylib               v7_2_0 v7_2_3
        sam_server_pylib               v7_2_0 v7_2_1
        sam_dimension_server_prototype v7_2_0 =
        sam_pnfs_srv                   v7_2_0 =
        sam_db_srv                     v7_2_0 v7_2_1

WARNING - 
    sam get dbserver info - increased the count by 1 each time
    sam_test_project      - incresses the count by 9 each time

#######
# sam #
#######

MINOS06 > ups declare -c sam v7_2_2  # was v7_1_10
Removing current link in /afs/fnal.gov/files/code/e875/general/ups/db/sam/Symlinks for sam v7_1_10.
Creating current link in /afs/fnal.gov/files/code/e875/general/ups/db/sam/Symlinks for sam v7_2_2.

############
# DATASETS #
############

data_tier
datastream
    alldata, calibration, spill, cosmic
runType
    normal-data va-calibrate va-pedestal

   FAR
# for VER in r1.14 ; do
for VER in r1.7 r1.11  r1.12  r1.14  r1.16 ; do
for TIE in cand-far dmux-far ccnd-far cntp-far cnts-far ntup-far ntps-far sntp-far snts-far ; do
for STR in alldata calibration spill cosmic ; do
for TYP in normal-data va-calibrate va-pedestal ; do
sam translate constraints \
--dimensions="\
    DATA_TIER ${TIE} \
and PHYSICAL_DATASTREAM_NAME ${STR} \
and RUN_TYPE                 ${TYP} \
and VERSION                  ${VER} \
"
printf " ${TIE} ${STR} ${TYP} ${VER} \n"
done ; done ; done ; done | grep -v '  F00.*.root'

File Count:         707
Average File Size:  83.16MB
Total File Size:    57.41GB
Total Event Count:  15475371
 cand-far alldata normal-data r1.7 

File Count:         707
Average File Size:  17.13MB
Total File Size:    11.83GB
Total Event Count:  15475371
 sntp-far alldata normal-data r1.7 

File Count:         707
Average File Size:  2.52MB
Total File Size:    1.74GB
Total Event Count:  15475371
 snts-far alldata normal-data r1.7 


File Count:         700
Average File Size:  76.69MB
Total File Size:    52.43GB
Total Event Count:  15296916
 cand-far alldata normal-data r1.11 

RetryHandler.translateDimensions_v2(...)> initial retriable exception MaxServantsExceeded('User exception: ('FAILED to create servant Dimensions(kreymer@minos06.fnal.gov:sam_translate constraints --dimensions=    DATA_TIER cntp-far and PHYSICAL_DATASTREAM_NAME spill and RUN_TYPE                 va-pedestal and VERSION                  r1.11 ) -- maximum allowed servants exceeded. Limit: 300', <SAM.SamDictionaryStruct instance at 0x88f54dc>)')
RetryHandler.translateDimensions_v2(...)> will continue to retry 200 more times at 5 second intervals


File Count:         4223
Average File Size:  102.82MB
Total File Size:    424.04GB
Total Event Count:  136838890
 cand-far alldata normal-data r1.14 


=============================================================================
 2005 07 29

##########
# proton #
##########

added script to copy derwent's proton summary plots
to our web pages, so they can be viewed from offsite.

Added this to crontab.dat, at 03:06


=============================================================================
 2005 07 28
 
Oracle dev/int are upgraded with July quarterly patch set.
Development dbserver seems to be OK ( sam locate, sam translate )


=============================================================================
 2005 07 26

< N.B. 2005 12 15 the following never happened >

MINOS06 > ls -l genpy wrun_dbu run_dbu
lrwxr-xr-x    1 kreymer  1525           10 Jul  6 19:22 genpy -> genpy.0706
lrwxr-xr-x    1 kreymer  1525           12 Jul  6 19:20 run_dbu -> run_dbu.0706
lrwxr-xr-x    1 kreymer  1525           13 Apr  8 15:17 wrun_dbu -> wrun_dbu.0406

ln -sf genpy.0725    genpy
ln -sf wrun_dbu.0725 wrun_dbu
ln -sf run_dbu.0725  run_dbu


=============================================================================
 2005 07 25

##########################
# genpy wrun_dbu run_dbu #
##########################

Switched to use full file name, not just HEAD,
so that the wrun_dbu can kill the proper dbu/loon.

genpy.0725
wrun_dbu.0725
run_dbu.0725

###################
# VERSION CONTROL #
###################

Use WRUN_DBU_VERSION and RUN_DBU_VERSION if set.
For this test,
export WRUN_DBU_VERSION='.0725'
export RUN_DBU_VERSION='.0725'

./genpy.0725 -d -p 0 -f N050110_064203.mdcs.root near_dcs_data/2005-01

./genpy.0725 -t 10 -f N050110_064203.mdcs.root near_dcs_data/2005-01
./genpy.0725 -t 10 -f F00028466_0004.mdaq.root fardet_data/2004-12

Looks fine, will make these versions current tomorrow.


=============================================================================
 2005 07 22

Testing addloc

./saddreco.0721  far R1.16 2005-06 addloc 1

Looks OK


./saddreco.0721  far R1.16 2005-06 addloc >> ../log/saddreco/addloc_far.log 2>&1 &
STARTED   Fri Jul 22 13:42:23 2005
FINISHED  Fri Jul 22 14:21:43 2005

./saddreco.0721  far R1.16 2005-07 addloc >> ../log/saddreco/addloc_far.log 2>&1 &
STARTED   Fri Jul 22 14:26:49 2005
FINISHED  Fri Jul 22 14:44:33 2005

############
# BEAM/DCS #
############

closeout, here are some old planning notes

Add filetype beam, dcs, or use PhysicsGeneric ?,    Yes, PhysicsGeneric
Add physical datastream beam, dcs ?                 OK, use alldata
What about the 50% empty beam files ?               No-beam period, must handle
Verify event count, first, last                     Yes, is OK.

MINOS06 > du -sh /pnfs/minos/beam_data        20. G    /pnfs/minos/beam_data
MINOS06 > du -sh /pnfs/minos/near_dcs_data/    0.2G    /pnfs/minos/near_dcs_data
MINOS06 > du -sh /pnfs/minos/far_dcs_data/     1.9G    /pnfs/minos/far_dcs_data

Updated HOWTO.beamdcs

Ran sadd on beam in development, 
oops, pnfs tape locations missing in dev, ok in prd

trying again

beam, near OK,
far fails to verify F040924_105045.sam.py ?, no the next one

Set aside the defective F040924_105045.sam.py

Try to pick it up with

./genpy -l " " -w far_dcs_data/2004-09

This ran cleanly this time, producing

   eventCount        = 1147,
   firstEvent        = 0,
   lastEvent         = 1148

not the defective

   eventCount        = ,
   firstEvent        = 0,
   lastEvent         = 

I probably got caught by the development rebuild during the original run...

Proceeding to test far in dev... 


then declare near_dcs_data and far_dcs_data to production.


    DONE


############
# saddreco #
############

saddreco -> saddreco.0718
ln -sf saddreco.0721 saddreco
saddreco -> saddreco.0721

############
# predator #
############

Predator was restarted with the 09:06 cycle, completed barely in time.
OK now.

=============================================================================
 2005 07 21
 
############
# saddreco #
############

Updated to 0718, with correct nonduplicated input files

saddreco -> saddreco.0714
ln -sf saddreco.0718 saddreco
saddreco -> saddreco.0718

Oracle came back up just before noon :
 Thu Jul 21 11:58:49 CDT 2005 All user connections to minosprd 
...

Added location manually for the first failing reco_far file

MINOS06 > echo $SFILE
F00031897_0001.all.snts.R1.16.root
MINOS06 > sam locate ${SFILE}                          
[]
MINOS06 > sam add location --file=${SFILE} --loc='/pnfs/minos/reco_far/R1.16/snts_data/2005-06(vo8060.5)'
MINOS06 > sam locate ${SFILE}
['/pnfs/minos/reco_far/R1.16/snts_data/2005-06,5@vo8060']
MINOS06 > date
Thu Jul 21 20:31:56 UTC 2005

OK... it's clear now that I look at my own notes that the 
reco_far/R1.16/*/2005-06 and 2005-07 file locations were missing when
sadddreco was first run, hence the sam.addLocation errors.

Just need to add the  'addloc' option to saddreco and rerun.
saddreco could check for this, if there were some sort of query available.


=============================================================================
 2005 07 20

Spent 19 July doing performance review work.

Now to debug the strange declare and addlocation error messages.
Added error reports and bailout to saddreco

Test with

setup sam -q dev
MINOS06 > ./saddreco.0718 far R1.16 2005-05 verify

Declaring to SAM dev far R1.16 2005-05
STARTED   Wed Jul 20 13:47:54 2005
 OK - skipping 1352 files not yet in SAM 
Treating 130 files in fardet_data/2005-05
Needed  /pnfs/minos/reco_far/R1.16/snts_data/2005-05
Needed  /pnfs/minos/reco_far/R1.16/cand_data/2005-05
 OK - verified                  F00031362_0002.all.cand.R1.16.root /pnfs/minos/reco_far/R1.16/cand_data/2005-05(vo7898.1005)
...
Needed  260 files, Rate was  2.289
STARTED   Wed Jul 20 13:47:54 2005
FINISHED  Wed Jul 20 13:56:15 2005

MINOS06 > ./saddreco.0718 far R1.16 2005-05 declare 5

Declaring to SAM dev far R1.16 2005-05
STARTED   Wed Jul 20 14:01:24 2005
 OK - skipping 1352 files not yet in SAM 
Treating 130 files in fardet_data/2005-05
Needed  /pnfs/minos/reco_far/R1.16/snts_data/2005-05
Needed  /pnfs/minos/reco_far/R1.16/cand_data/2005-05
 OK - declared                 F00031362_0002.all.cand.R1.16.root /pnfs/minos/reco_far/R1.16/cand_data/2005-05(vo7898.1005)
 OK - declared               F00031362_0002.spill.cand.R1.16.root /pnfs/minos/reco_far/R1.16/cand_data/2005-05(vo7898.1011)
 OK - declared                 F00031363_0000.all.cand.R1.16.root /pnfs/minos/reco_far/R1.16/cand_data/2005-05(vo7898.1059)
 OK - declared               F00031363_0000.spill.cand.R1.16.root /pnfs/minos/reco_far/R1.16/cand_data/2005-05(vo7898.1065)
 OK - declared                 F00031368_0000.all.cand.R1.16.root /pnfs/minos/reco_far/R1.16/cand_data/2005-05(vo7898.1204)
Needed    5 files, Rate was  1.769
... etc - all OK ...

May have to test this in production.

MINOS06 > setup sam -q prd
MINOS06 > ./saddreco.0718 far R1.16 2005-07 verify

Declaring to SAM prd far R1.16 2005-07
STARTED   Wed Jul 20 14:04:02 2005
Treating 792 files in fardet_data/2005-07
Needed  /pnfs/minos/reco_far/R1.16/snts_data/2005-07
 OOPS, need location for  F00032162_0006.all.snts.R1.16.root
 OOPS, need location for  F00032162_0006.spill.snts.R1.16.root
 OOPS, need location for  F00032163_0004.all.snts.R1.16.root
 OOPS, need location for  F00032163_0004.spill.snts.R1.16.root
 OOPS, need location for  F00032168_0004.all.snts.R1.16.root
 OOPS, need location for  F00032168_0004.spill.snts.R1.16.root
 OOPS, need location for  F00032169_0002.all.snts.R1.16.root
 OOPS, need location for  F00032169_0002.spill.snts.R1.16.root
 OK - verified                  F00032182_0001.all.snts.R1.16.root /pnfs/minos/reco_far/R1.16/snts_data/2005-07(vo8218.1017)
 OK - verified                F00032182_0001.spill.snts.R1.16.root /pnfs/minos/reco_far/R1.16/snts_data/2005-07(vo8218.1020)

Learning python error handling,
these two means of printing errors seem equivalent:

   print "  CLASS    " , sys.exc_info()[0]
   print "  INSTANCE " , sys.exc_info()[1]


   print "  CLASSstr %s" % str(sys.exc_info()[0])
   print "  CLASSstr %s" % str(sys.exc_info()[1])

CLASS     SamException.SamExceptions.DataFileNotFound
INSTANCE  Datafile with name 'F00032182_0001.spill.snts.R1.16.root' not found.

Now get ready to refresh near 2005-07 data

in  

STARTED   Sat Jul 16 18:50:38 2005
Treating 713 files in neardet_data/2005-07
Needed  /pnfs/minos/reco_near/R1.16/snts_data/2005-07
Needed 1423 files, Rate was  2.141
Needed  /pnfs/minos/reco_near/R1.16/cand_data/2005-07
Needed 1423 files, Rate was  2.231
Needed  /pnfs/minos/reco_near/R1.16/sntp_data/2005-07
Needed 1423 files, Rate was  2.194
STARTED   Sat Jul 16 18:50:38 2005
FINISHED  Sat Jul 16 19:23:32 2005

verify :

Declaring to SAM prd near R1.16 2005-07
STARTED   Wed Jul 20 15:19:03 2005
Treating 825 files in neardet_data/2005-07
Needed  /pnfs/minos/reco_near/R1.16/snts_data/2005-07
Needed  224 files, Rate was  0.944
Needed  /pnfs/minos/reco_near/R1.16/cand_data/2005-07
Needed  224 files, Rate was  1.016
Needed  /pnfs/minos/reco_near/R1.16/sntp_data/2005-07
Needed  224 files, Rate was  0.994
STARTED   Wed Jul 20 15:19:03 2005
FINISHED  Wed Jul 20 15:30:48 2005

10:38
MINOS06 > ./saddreco.0718 near R1.16 2005-07 declare >> ../log/saddreco/declare_near.log 2>&1 &

 OK - declared               N00008087_0004.spill.snts.R1.16.root /pnfs/minos/reco_near/R1.16/snts_data/2005-07(vo8222.204)
 OOPS , declare error in  N00008087_0004.spill.snts.R1.16.root
  CLASS     SamException.SamExceptions.DataFileDuplicate
  INSTANCE  File with name 'N00008087_0004.spill.snts.R1.16.root' already exists in the database.

Printed out the SORT variable | tr ',' \\\n,
it is clear the files are being duplicated in this list.

The problem is that I was getting the list of raw file names from
the cand_data directory, which for R1.16 contains both cosmic and spill
versions of the same raw file.
So many reco files were declared twice.

Corrected FILES to contain just the unique raw headers.

MINOS06 > ./saddreco.0718 near R1.16 2005-07 declare | tr ',' \\\n  | grep N00 | wc -l
    113
That's what I expect, about 113 new cand files since we last ran.

Created a candfile function to get recofile lists inside the subdirectory loop,
rather than relying on either the raw or cand lists.

15:30
MINOS06 > ./saddreco.0718 near R1.16 2005-07 declare >> ../log/saddreco/declare_near.log 2>&1 &

This ran cleanly


=============================================================================
 2005 07 18

less  ../log/saddreco/declare_near.log
grep -v declared  ../log/saddreco/declare_near.log | less
R1.11 - OK
R1.12 - OK
R1.14 - OK
R1.16 - declare error for many but NOT ALL files
a small number of files did not have errors,
sometimes a few at the beginning of the month,
sometimes scattered throughout.

But the files seem to have been declared :

MINOS06 > ./saddreco near R1.16 2005-06 list

Declaring to SAM prd near R1.16 2005-06
STARTED   Mon Jul 18 16:16:03 2005
Treating 1461 files in neardet_data/2005-06
Needed  /pnfs/minos/reco_near/R1.16/snts_data/2005-06
Needed  /pnfs/minos/reco_near/R1.16/cand_data/2005-06
Needed  /pnfs/minos/reco_near/R1.16/sntp_data/2005-06
STARTED   Mon Jul 18 16:16:03 2005
FINISHED  Mon Jul 18 16:43:39 2005

MINOS06 > ./saddreco.0718 near R1.16 2005-06 listoc
...
STARTED   Mon Jul 18 17:23:37 2005
FINISHED  Mon Jul 18 17:37:06 2005


all files had reasonable looking locations, on v0 tapes

Here is a good/bad pair from /pnfs/minos/reco_near/R1.16/snts_data/2005-07
N00008058_0000.cosmic.snts.R1.16.root OK
N00008071_0000.cosmic.snts.R1.16.root false error

Diffing  sam get metadata --file=...
I see only the differences that I expect.

=============================================================================
 2005 07 16

grep -v declared ../log/saddreco/declare_far.log 

All the R1.16 failed,
declares failed 2005-03 2005-04
   but these have locations registered in SAM now !
   as well as metadata   
declares and addLocation 2005-06 2005-07
   again  the metadata is registered.
   But not the locations

Grrrrrrr
  Why the false error reports for R1.16 file declares ?
  I see no defect in the metadata reported by sam.
  The parent files have locations.

Did not get any further after checking the tape locations.

    on MINOS-SAM01

setup sam -q prd
setup sam -q dev

DET=far 
REL=R1.16
TIES=`(cd /pnfs/minos/reco_${DET}/${REL} ; ls -d *data )`
MONS=`(cd /pnfs/minos/reco_${DET}/${REL}/cand_data ; ls -d 20??-??)`

for TIE in ${TIES} ; do
for MON in ${MONS} ; do
samadmin add pnfs tape location --fullPath=/pnfs/minos/reco_${DET}/${REL}/${TIE}/${MON}
done ; done

in dev,
MINOS-SAM01 > echo $MONS
2005-03 2005-05 2005-06 2005-07

Location with fullPath '/pnfs/minos/reco_far/R1.16/cand_data/2005-03' already exists.
Location with fullPath '/pnfs/minos/reco_far/R1.16/cand_data/2005-05' already exists.
New locationId = 485
New locationId = 486
Location with fullPath '/pnfs/minos/reco_far/R1.16/ccnd_data/2005-03' already exists.
Location with fullPath '/pnfs/minos/reco_far/R1.16/ccnd_data/2005-05' already exists.
New locationId = 487
New locationId = 488
Location with fullPath '/pnfs/minos/reco_far/R1.16/cntp_data/2005-03' already exists.
Location with fullPath '/pnfs/minos/reco_far/R1.16/cntp_data/2005-05' already exists.
New locationId = 489
New locationId = 490
Location with fullPath '/pnfs/minos/reco_far/R1.16/cnts_data/2005-03' already exists.
Location with fullPath '/pnfs/minos/reco_far/R1.16/cnts_data/2005-05' already exists.
New locationId = 491
New locationId = 492
Location with fullPath '/pnfs/minos/reco_far/R1.16/sntp_data/2005-03' already exists.
Location with fullPath '/pnfs/minos/reco_far/R1.16/sntp_data/2005-05' already exists.
New locationId = 493
New locationId = 494
Location with fullPath '/pnfs/minos/reco_far/R1.16/snts_data/2005-03' already exists.
Location with fullPath '/pnfs/minos/reco_far/R1.16/snts_data/2005-05' already exists.
New locationId = 495
New locationId = 496

similar for prd

DET=near 
REL=R1.16
TIES=`(cd /pnfs/minos/reco_${DET}/${REL} ; ls -d *data )`
MONS=`(cd /pnfs/minos/reco_${DET}/${REL}/cand_data ; ls -d 20??-??)`

similar in dev and prd, adding 06 and 07


    SHRUGGING SHOULDERS, PROCEEDING WITH NEAR

DET=near
RELEASES='R1.11 R1.12 R1.14 R1.16'

for RELEASE in ${RELEASES} ; do
for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/cand_data ; ls -d 20??-??)` ; do
printf "/pnfs/minos/reco_${DET}/${RELEASE} ${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done ; done >> ../log/saddreco/declare_${DET}.log 2>&1 &

at about 14:45


=============================================================================
 2005 07 15

#######
# sam #
#######

    Made v7_1_10 current, was v7_1_2

MINOS06 > ups list -K+ sam
"sam" "v7_1_2" "Linux+2" "" "current" 
MINOS06 > ups declare -c sam v7_1_10 
Removing current link in /afs/fnal.gov/files/code/e875/general/ups/db/sam/Symlinks for sam v7_1_2.
Creating current link in /afs/fnal.gov/files/code/e875/general/ups/db/sam/Symlinks for sam v7_1_10.
MINOS06 > ups list -K+ sam
"sam" "v7_1_10" "Linux+2" "" "current" 
MINOS06 > date
Fri Jul 15 14:17:49 UTC 2005


############
# saddreco #
############

grep -v verified ../log/saddreco/verify_near.log  | less

grep -v verified ../log/saddreco/verify_far.log  | less


DET=far
RELEASES='R1.11'

for RELEASE in ${RELEASES} ; do
for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/cand_data ; ls -d 20??-??)` ; do
printf "/pnfs/minos/reco_${DET}/${RELEASE} ${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done ; done >> ../log/saddreco/declare_${DET}.log 2>&1 &

Oops, did not want to start far quite yet,
validation is still running.

Interrupted around

OK - declared                     F00022702_0007.snts.R1.11.root /pnfs/minos/reco_far/R1.11/snts_data/2004-02(vo6806.99)
OK - declared                     F00022708_0004

MINOS06 > sam translate constraints --dim='data_tier snts-far'
...
   F00022702_0007.snts.R1.11.root
   F00022708_0004.snts.R1.11.root
   F00022696_0003.snts.R1.11.root
   F00022705_0002.snts.R1.11.root
   F00022708_0000.snts.R1.11.root

File Count:         103
Average File Size:  1.94MB
Total File Size:    199.45MB
Total Event Count:  2239295


verification of far finished, 
will rerun the short test on R1.11


DET=far
RELEASES='R1.11'

for RELEASE in ${RELEASES} ; do
for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/cand_data ; ls -d 20??-??)` ; do
printf "/pnfs/minos/reco_${DET}/${RELEASE} ${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done ; done >> ../log/saddreco/declare_${DET}.log 2>&1 &

This finished cleanly, rates between 2.0 and 1.2
    Lots of time in sorting, I think.
    The files look reasonable, and are in tape order
    6422 total files in R1.11
    Checked for problems with
        grep -v declared ../log/saddreco/declare_far.log 


13:35 launched the rest of far

DET=far
RELEASES='R1.7 R1.12 R1.14 R1.16'

for RELEASE in ${RELEASES} ; do
for MONTH in `(cd /pnfs/minos/reco_${DET}/${RELEASE}/cand_data ; ls -d 20??-??)` ; do
printf "/pnfs/minos/reco_${DET}/${RELEASE} ${MONTH}\n"
./saddreco ${DET} ${RELEASE} ${MONTH} declare
done ; done >> ../log/saddreco/declare_${DET}.log 2>&1 &

I will NOT try to run near in parallel.
The last time I did this ( admittedly a long time ago ) the dbserver crashed .
Will do such load tests later, after all the reco is declared.


For scale, 

    MINOS06 > du -sh /pnfs/minos/reco_far/R1.16/*
    246G    /pnfs/minos/reco_far/R1.16/cand_data
    33G     /pnfs/minos/reco_far/R1.16/ccnd_data
    3.3G    /pnfs/minos/reco_far/R1.16/cntp_data
    5.5G    /pnfs/minos/reco_far/R1.16/cnts_data
    51G     /pnfs/minos/reco_far/R1.16/sntp_data
    6.4G    /pnfs/minos/reco_far/R1.16/snts_data

    MINOS06 > du -sh /pnfs/minos/reco_near/R1.16/*
    2.0T    /pnfs/minos/reco_near/R1.16/cand_data
    389G    /pnfs/minos/reco_near/R1.16/sntp_data
    112G    /pnfs/minos/reco_near/R1.16/snts_data


=============================================================================
 2005 07 14

Status

Load average is up to 40 on minos06
MINOS06 > top bc > ../top_20050714.log

Killed off all the strays


  far_dcs_data
     2003-09 empty
     2003-10 done 6 July ? check root versions
     2004-03 
         F040326_122044.mdcs.root Wed Jul 13 18:59:58 CDT 2005
     2004-07
         F040710_074306.mdcs.root Wed Jul 13 20:41:11 CDT 2005
         F040712_074413.mdcs.root Wed Jul 13 20:53:07 CDT 2005
         F040716_074642.mdcs.root Wed Jul 13 21:07:25 CDT 2005
         F040723_075311.mdcs.root Wed Jul 13 21:25:27 CDT 2005
         F040726_075638.mdcs.root Wed Jul 13 21:39:07 CDT 2005
         F040728_075922.mdcs.root Wed Jul 13 21:52:18 CDT 2005
         F040729_080057.mdcs.root Wed Jul 13 22:03:59 CDT 2005
     2004-08
         F040820_220334.mdcs.root Wed Jul 13 22:52:01 CDT 2005 0 records
         F040824_040803.1.mdcs.root Wed Jul 13 22:55:50 CDT 2005 ???
     2004-09
F040924_225132.mdcs.root Thu Jul 14 00:11:35 CDT 2005
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 53: [: too many arguments
/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/run_dbu: line 75: [: too many arguments
    2004-10
        F041008_230316.mdcs.root Thu Jul 14 00:39:27 CDT 2005
        F041021_231850.mdcs.root Thu Jul 14 01:14:33 CDT 2005
        F041028_232931.mdcs.root Thu Jul 14 01:38:21 CDT 2005
    2004-11
        F041103_233255.mdcs.root Thu Jul 14 02:04:53 CDT 2005
        F041116_234803.mdcs.root Thu Jul 14 02:39:53 CDT 2005
        F041126_035846.mdcs.root Thu Jul 14 03:07:21 CDT 2005
    2005-01
        F050108_043922.mdcs.root Thu Jul 14 04:58:42 CDT 2005
        F050109_044003.mdcs.root Thu Jul 14 05:11:00 CDT 2005 0 records
        F050110_044015.mdcs.root Thu Jul 14 05:13:01 CDT 2005
        F050115_044254.mdcs.root Thu Jul 14 05:36:34 CDT 2005
        F050116_044340.mdcs.root Thu Jul 14 05:49:22 CDT 2005
        F050117_044437.mdcs.root Thu Jul 14 06:02:00 CDT 2005
        F050123_045045.mdcs.root Thu Jul 14 06:28:12 CDT 2005
        F050124_045208.mdcs.root Thu Jul 14 06:40:51 CDT 2005
        F050128_045729.mdcs.root Thu Jul 14 07:03:44 CDT 2005
        F050129_045825.mdcs.root Thu Jul 14 07:16:57 CDT 2005 0 records
        F050130_045932.mdcs.root Thu Jul 14 07:19:53 CDT 2005
        F050131_050037.mdcs.root Thu Jul 14 07:33:07 CDT 2005
    2005-02
        F050204_050258.mdcs.root Thu Jul 14 08:04:46 CDT 2005
        F050205_050323.mdcs.root
        F050206_050424.mdcs.root
        F050207_050524.mdcs.root
        F050212_051128.mdcs.root
        F050213_051324.mdcs.root
        F050214_051427.mdcs.root
and the rest is fine !


near_dcs_data
    2005-01
        N050110_064203.mdcs.root Wed Jul 13 16:33:40 CDT 2005
        N050113_061336.mdcs.root
        N050114_061507.mdcs.root
        N050115_061634.mdcs.root
        N050116_061800.mdcs.root
        N050119_062155.mdcs.root
        N050120_062305.mdcs.root
        N050121_062436.mdcs.root
        N050122_062608.mdcs.root
        N050123_062739.mdcs.root
        N050126_063221.mdcs.root
        N050127_063358.mdcs.root
        N050127_183128.mdcs.root 0 records
        N050128_063828.mdcs.root
        N050130_064103.mdcs.root
    2005-02
        N050202_064629.mdcs.root Wed Jul 13 19:15:36 CDT 2005
        N050203_064725.mdcs.root
        N050204_064921.mdcs.root
        N050205_065112.mdcs.root
        N050206_065303.mdcs.root
        N050209_065655.mdcs.root
        N050210_065659.mdcs.root
        N050212_065719.mdcs.root
        N050213_065916.mdcs.root
        N050216_070421.mdcs.root
        N050218_070525.mdcs.root
        N050219_070725.mdcs.root
        N050224_070756.mdcs.root
        N050225_070812.mdcs.root
        N050226_070825.mdcs.root
    2005-03
        N050301_070912.mdcs.root
        N050302_070928.mdcs.root
        N050303_070946.mdcs.root
        N050304_071005.mdcs.root
        N050305_071022.mdcs.root
        N050308_071117.mdcs.root
        N050309_071137.mdcs.root
        N050310_071158.mdcs.root
        N050311_071222.mdcs.root
        N050315_071400.mdcs.root
        N050316_071426.mdcs.root
        N050317_071452.mdcs.root
        N050323_071801.mdcs.root
        N050324_071840.mdcs.root
        N050325_071922.mdcs.root
        N050326_072001.mdcs.root
        N050329_072210.mdcs.root
        N050330_072253.mdcs.root
        N050331_072341.mdcs.root
    2005-04
        N050401_072428.mdcs.root
        N050402_072515.mdcs.root
        N050403_072601.mdcs.root
        N050405_072741.mdcs.root
        N050406_072831.mdcs.root
        N050407_072921.mdcs.root
        N050408_073013.mdcs.root
        N050409_073101.mdcs.root
        N050410_073152.mdcs.root
        N050411_073244.mdcs.root
        N050412_073336.mdcs.root
        N050413_073429.mdcs.root
        N050414_073523.mdcs.root
        N050415_073619.mdcs.root
        N050416_073723.mdcs.root
        N050418_073928.mdcs.root
        N050419_074042.mdcs.root
        N050420_074150.mdcs.root
        N050422_074415.mdcs.root
        N050423_074526.mdcs.root
        N050424_074642.mdcs.root
        N050425_074755.mdcs.root
        N050426_074910.mdcs.root
        N050427_075021.mdcs.root
        N050428_075136.mdcs.root
        N050429_075251.mdcs.root
        N050430_075411.mdcs.root
    2005-05
        N050501_075528.mdcs.root
        N050503_075759.mdcs.root
        N050504_075919.mdcs.root
        N050505_080041.mdcs.root
        N050507_080324.mdcs.root
        N050508_080446.mdcs.root
        N050509_080609.mdcs.root
        N050511_080900.mdcs.root
        N050512_081030.mdcs.root
        N050513_081158.mdcs.root
        N050514_081331.mdcs.root
        N050515_081506.mdcs.root
        N050516_081642.mdcs.root
        N050517_081820.mdcs.root
and the rest of near_dcs_data is fine !
        
#############
# saddreceo #
#############

Created HOWTO.saddreco for catchup scripts, to lighten this log

    Get the list of months from the cand_data directory

Hmmmm, seems I get the list of files to process from scratch *.sam.py.
I should get it from the reco output directory, cand_data/MONTH.
This is more efficient, as not all raw files are reconstructed.

Processing rate is about 2 per second.
    Load on oracle is invisible.
    Load on dbserver is about .3, CPU about 4%
    minos06 CPU about 10%

saddreco -> saddreco.0524
MINOS06 > ln -sf saddreco.0714 saddreco
saddreco -> saddreco.0714


=============================================================================
 2005 07 13

############
# BEAM/DCS #
############

Per HOWTO.beamdcs ( newly drafted )
staged and running genpy on 
14:20 beam_data

MINOS06 > TIER=beam_data
MINOS06 > for MON in `(cd /pnfs/minos/${TIER} ; ls -d 20??-??)`
<more> do ./genpy -l " " -w ${TIER}/${MON} ; done

Pushed this into background around 2005-05 transition
Odd, only files thru 2005-04 got processed.


So far so good, with beam data
    root version is v04-01-04 when events > 0
    root version is v00_00_00 when events = 0 
    first event is 0
    startTime and endTime seem to advance properly
    processing rate is about 4 per minute

for MON in 2005-05 2005-06 2005-07
do ./genpy -l " " -w ${TIER}/${MON} ; done > ../log/genpy.dobeam.log2 2>&1 &

15:37
MINOS06 > TIER=near_dcs_data
MINOS06 > for MON in `(cd /pnfs/minos/${TIER} ; ls -d 20??-??)`
<more> do ./genpy -l " " -w ${TIER}/${MON} ; done >> ..//log/genpy/do${TIER}.log 2>&1 &
[2] 12366

17:30
MINOS06 > TIER=far_dcs_data
MINOS06 > for MON in `(cd /pnfs/minos/${TIER} ; ls -d 20??-??)`
<more> do ./genpy -l " " -w ${TIER}/${MON} ; done >> ../log/genpy/do${TIER}.log 2>&1 &
[3] 25265

Note stuck loon's gobbling CPU
22:50 2005-01/N050115_061634.mdcs.root
11:43 2005-01/N050116_061800.mdcs.root

The idle process killer is not picking up the pid
ps: error: List of process IDs must follow -p.
So these do not run forever, just longer than the default 600 seconds,


Stuck files from near_dcs_data/2005-01 were
 OOPS - cannot read N050110_064203.sam.py 
 OOPS - cannot read N050113_061336.sam.py 
 OOPS - cannot read N050114_061507.sam.py 
 OOPS - cannot read N050115_061634.sam.py 
 OOPS - cannot read N050116_061800.sam.py 
 OOPS - cannot read N050119_062155.sam.py 

File sizes look normal
These are all written around 06:00, following another file from previous day

export TZ=UTC # was CST6CDT
ls -l /pnfs/minos/near_dcs_data/2005-01 | cut -c 35-
see no unusual pattern, nor sizes :

  395295 Jan 10  2005 N050109_184121.mdcs.root
  367895 Jan 10  2005 N050110_064203.mdcs.root
   90768 Jan 11  2005 N050110_184245.mdcs.root
  387192 Jan 12  2005 N050110_213906.mdcs.root
  127706 Jan 13 16:43 N050111_093948.mdcs.root
  158210 Jan 12 18:17 N050111_154636.mdcs.root
  395817 Jan 12 18:21 N050111_181128.mdcs.root
  394510 Jan 12 18:25 N050112_061211.mdcs.root
  394948 Jan 13 16:47 N050112_181252.mdcs.root
  447239 Jan 13 16:52 N050113_061336.mdcs.root
  399981 Jan 14 15:10 N050113_181422.mdcs.root
  297216 Jan 14 15:13 N050114_061507.mdcs.root
  409988 Jan 15 13:32 N050114_181549.mdcs.root
  244533 Jan 15 13:36 N050115_061634.mdcs.root
  392121 Jan 16 11:54 N050115_181717.mdcs.root
  204707 Jan 16 11:58 N050116_061800.mdcs.root

time loon -bq firstlast.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/near_dcs_data/2005-01/N050115_061634.mdcs.root> ../maint/longloon.log 2>&1 &

These files are corrupted:
Error in <TBuffer::ReadObject>: object tag too large, I/O buffer corrupted
...

=============================================================================
 2005 07 12

########
# sadd #
########

Jul  6 15:18 sadd -> sadd.0624
ln -sf sadd.0711 sadd
Jul 12 08:35 sadd -> sadd.0711

##########
# dbufix #
##########

setup sam v7_1_10 -q prd

DET=neardet_data
MONTHS="2004-11 2005-02 2005-03 2005-04 2005-05"

    
DET=fardet_data
MONTHS="2003-04 2003-05 2003-08 2003-09 2003-10 2003-11 2003-12 2004-01 2004-02 2004-03 2004-04 2004-05 2004-06 2004-07 2004-08 2004-09 2004-10 2004-11 2005-02 2005-03 2005-04 2005-05"

for MONTH in ${MONTHS}
do 
   ./sadd ${DET}/${MONTH} update
done > ../log/dbufix/${DET}.log 2>&1 &
tail -f ../log/dbufix/${DET}.log


NEAR > grep updating ../log/dbufix/${DET}.log | wc -l
   4275
FAR  > grep updating ../log/dbufix/${DET}.log | wc -lOR
  22468

MINOS06 > grep -v updating ../log/dbufix/${DET}.log

    SQL declarations


for MONTH in ${MONTHS}; do  nedit GDAT/${DET}/${MONTH}/update.sql ; done
   to add
set feedback off
set timing on
   This produced 1 line per file, no blanks

MINOS06 > setup_minos -r R1.16
MINOS06 > setup oracle_client v10_1_0_2_0b

ORACON='username/password@minosdev'
cd minos

for MONTH in ${MONTHS}
do 
    date ; echo ${DET}/${MONTH}
    sqlplus ${ORACON} @GDAT/${DET}/${MONTH}/update.sql
done > log/dbufix/${DET}-sql.log 2>&1 &
tail -f log/dbufix/${DET}-sql.log

MINOS06 > grep -v Elapsed log/dbufix/${DET}-sql.log | less

#######
# dbu #
#######

Oops, have to pick up 2005-06, which was skipped when it was active.

    STAGE TO DISK

MINOS06 > ./stage -d -p 0 -w fardet_data/2005-06
 Needed 0/    830
MINOS06 > ./stage -d -p 0 -w neardet_data/2005-06
 Needed 0/    848

    CHECK NEED FOR UPDATES

MINOS06 > ./dburecsets neardet_data/2005-06
STARTING Tue Jul 12 12:08:30 CDT 2005
 Treating    848 files 
bad thru N00007874_0004
 Passed    766 of    848 files 

MINOS06 > ./dburecsets fardet_data/2005-06
STARTING Tue Jul 12 12:12:00 CDT 2005
 Treating    831 files 
bad thru F00031831_0005
plus
F00031986_0004.log      12      11       1 ( OK, had DCache retries )
 Passed    749 of    831 files 

    SHIFT THE OLD FILES OUT

MINOS06 > pwd
/afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/fardet_data
MINOS06 > cd ../neardet_data
MINOS06 > mv 2005-06 2005-06_BADDBU
MINOS06 > cp -ax 2005-06_BADDBU 2005-06       
MINOS06 > cd ../fardet_data 
MINOS06 > mv 2005-06 2005-06_BADDBU
MINOS06 > cp -ax 2005-06_BADDBU 2005-06

    REMOVE ONLY THE BAD DBU's

MINOS06 > cd ../../neardet_data/2005-06
MINOS06 > rm N0000786*
MINOS06 > rm -i N0000787*

MINOS06 > cd ../../fardet_data/2005-06
MINOS06 > rm F0003181*        
MINOS06 > rm F0003182*
MINOS06 > rm -i F0003183*

    GENPY to replace the defective DBU's

MINOS06 > ./genpy  neardet_data/2005-06  # 81 files
 Treating    848 files 
 Scanning     81 files 
MINOS06 > ./genpy   fardet_data/2005-06  # 
 Treating    830 files 
 Scanning     80 files 

tail -f /local/scratch06/kreymer/log/genpy/neardet_data/2005-06.log
tail -f /local/scratch06/kreymer/log/genpy/fardet_data/2005-06.log

    SADD generates the update.sql files

for DET in neardet_data fardet_data
do   ./sadd ${DET}/2005-06 update
done > ../log/dbufix/2005-06.log 2>&1 &

    SQLPLUS to update the database

ORACON='username/password@minosprd'
MONTH=2005-06

cd minos
for DET in neardet_data fardet_data
do 
    date ; echo ${DET}/${MONTH}
    sqlplus ${ORACON} @GDAT/${DET}/${MONTH}/update.sql
done > log/dbufix/2005-06-sql.log 2>&1

    Done, at 16:09

############
# DATASETS #
############

It seems placement of files in monthly directories is based on endTime
   which makes sense.
   
So we can make datasets like
    raw-far-2005-04 constrained by 
        endTime > to_date('2005-07-01','yyyy-mm-dd') and
        endTIme < to_date('2005-08-01','yyyy-mm-dd')

Decisions : how deep should the constraints go ?
    
   fileType=SAM.DataFileType_ImportedDetector,
   fileContentStatus=SAM.DataFileContentStatus_Good,   "good"
   fileFormat=SAM.DataFileFormat_ROOT,
   applicationFamily=ApplicationFamily('online','rotorooter','v04-02-00'),
   dataTier='raw-far',
   datastream='alldata',
   runType='checkout;m',                               physic% ?, ;m ?
   eventCount=7,
   firstEvent=-1,
   lastEvent=-1


=============================================================================
 2005 07 11

##############
# dbu reload #
##############

Per advice received this weekend, need to modify times and event counts
with direct SQL, no support for this in SAM db servers.

Investigate with DB Browser,
    General
        Generalized Table
            data_files
                F00028134_0005.mdaq.root
or  Select A Table

will need to modify columns, selecting on FILE_NAMES
    FIRST_EVENT_NUMBER I38
?   LAST_EVENT_NUMBER  I38
    EVENT_COUNT        I38
?   END_TIME           T7
    START_TIME         T7

    Do global survey,
    find that all but 7 files differ only in 
        FIRST_EVENT_NUMBER
        EVENT_COUNT
        START_TIME

DET=neardet_data
MONTHS="2004-11 2005-02 2005-03 2005-04 2005-05"

for STRING in lastEvent endTime eventCount ; do
for MONTH in ${MONTHS} ; do echo $MONTH ; 
( cd /local/scratch06/kreymer/genpy/${DET}/${MONTH}
for FILE in `ls *.sam.py`
do
    echo $FILE ; diff ${FILE} ../${MONTH}_BADDBU/ | grep ${STRING}
done
)
done
done
...
2005-04
N00007389_0000.sam.py
<    lastEvent=0
>    lastEvent=-1
...
2004-11
N00005059_0000.sam.py
<    endTime=SamTime('24-Nov-2004 15:31:00(UTC)',SAM.SamTimeFormat_UTCFormat),
>    endTime=SamTime('24-Nov-2004 15:30:59(UTC)',SAM.SamTimeFormat_UTCFormat),


DET=fardet_data
MONTHS="2003-04 2003-05 2003-08 2003-09 2003-10 2003-11 2003-12 2004-01 2004-02 2004-03 2004-04 2004-05 2004-06 2004-07 2004-08 2004-09 2004-10 2004-11 2005-02 2005-03 2005-04 2005-05"

for STRING in lastEvent endTime eventCount ; do
for MONTH in ${MONTHS} ; do echo $MONTH ;
( cd /local/scratch06/kreymer/genpy/${DET}/${MONTH}
for FILE in `ls *.sam.py` ; do diff ${FILE} ../${MONTH}_BADDBU/ | grep ${STRING} ; done
)
done
done

2003-11
F00021107_0000.sam.py
<    lastEvent=22
>    lastEvent=-1
F00021140_0000.sam.py
<    lastEvent=24
>    lastEvent=-1
F00021141_0000.sam.py
<    lastEvent=17
>    lastEvent=-1
F00021147_0000.sam.py
<    lastEvent=18
>    lastEvent=-1
F00021148_0000.sam.py
<    lastEvent=17
>    lastEvent=-1


MINOS-SAM01 > . ./setups.sh
MINOS-SAM01 > setup oracle_client
MINOS-SAM01 > sqlplus samread/reader@minosdev
SQL>
select END_TIME from DATA_FILES where FILE_NAME = 'F00028134_0005.mdaq.root' ;
select FIRST_EVENT_NUMBER,LAST_EVENT_NUMBER,EVENT_COUNT,END_TIME,START_TIME from DATA_FILES where FILE_NAME = 'F00028134_0005.mdaq.root' ;
update DATA_FILES set EVENT_COUNT = 15660 where FILE_NAME = 'F00028134_0005.mdaq.root' ;
update DATA_FILES set FIRST_EVENT_NUMBER=0 , EVENT_COUNT=15641 where FILE_NAME = 'F00028134_0000.mdaq.root' ;

select TO_CHAR(START_TIME,'YYYY/MM/DD HH24:MI:SS') from DATA_FILES where FILE_NAME = 'F00028134_0000.mdaq.root' ;

Inspired by
    http://www-db.stanford.edu/~ullman/fcdb/oracle/or-time.html

alter session set NLS_DATE_FORMAT='DD-MON-YYYY HH24:MI:SS';
select START_TIME from DATA_FILES where FILE_NAME = 'F00028134_0000.mdaq.root' ;
update DATA_FILES set START_TIME = '27-Nov-2004 01:53:38' , FIRST_EVENT_NUMBER=0 , EVENT_COUNT=15641 where FILE_NAME = 'F00028134_0000.mdaq.root' ;
select START_TIME from DATA_FILES where FILE_NAME = 'F00028134_0000.mdaq.root' ;

will invoke with
   sqlplus username/password@database @${DET}/${MON}.sql

##########
# rlwrap #
##########

Liz Buckley installed rlwrap-0.18-1 on minos-sam01 to aid in running sqlplus,
providing ReadLine command line editing

    rlwrap sqlplus ...

########
# sadd #
########

sadd.0711 - writes SQL update commands

MINOS06 > ./sadd.0711 fardet_data/2001-10 update
MINOS06 > cat ../GDAT/fardet_data/2001-10/update.sql 
MINOS06 > sqlplus acount/passwd@minosdev @GDAT/fardet_data/2001-10/update.sql

OK - ready to go whole hog, get some sleep first... zzz ...
23:55

=============================================================================
 2005 07 08

########
# sadd #
########

    sadd.0707

Continuing to develop sadd ... erase   mode

    Testing like

setup sam v7_1_10 
./sadd.0707 fardet_data/2001-10 locate


15:09

./sadd.0707 fardet_data/2001-10 erase
about 6 seconds to remove 52 files

time ./sadd.0707 fardet_data/2001-10 declare
...
Needed to add 52 files
STARTED   Fri Jul  8 15:10:49 2005
FINISHED  Fri Jul  8 15:10:58 2005

real    0m8.358s
user    0m2.550s
sys     0m0.490s

   Have a look at 2004-12

MINOS06 > time ./sadd.0707 fardet_data/2004-12 locate
Treating 967 files
...   
STARTED   Fri Jul  8 15:21:57 2005
FINISHED  Fri Jul  8 15:22:20 2005

real    0m23.252s
user    0m4.560s
sys     0m0.620s

MINOS06 > time ./sadd.0707 fardet_data/2004-12 erase
STARTED   Fri Jul  8 15:24:10 2005
Treating 967 files
 OK - erasing  F00028201_0002.mdaq.root
Traceback (most recent call last):
  File "./sadd.0707", line 106, in ?
  File "/fnal/ups/prd/sam_common_pylib/v7_1_8/NULL/SamCommand/BlessedCommandInterfacePlaceHolder.py", line 80, in __call__
  File "/fnal/ups/prd/sam_common_pylib/v7_1_8/NULL/SamCommand/CommandInterface.py", line 259, in __call__
  File "/fnal/ups/prd/sam_common_pylib/v7_1_8/NULL/SamCommand/SamCommandInterface.py", line 218, in apiWrapper
  File "/home/lauri/v7_1_10/sam_user_pyapi/src/samMetadata.py", line 422, in implementation
  File "/fnal/ups/prd/sam_common_pylib/v7_1_8/NULL/SamCorba/SamServerProxy.py", line 235, in _callRemoteMethod
  File "/fnal/ups/prd/sam_common_pylib/v7_1_8/NULL/SamCorba/SamServerProxyRetryHandler.py", line 203, in handleCall
SamException.SamExceptions.DbConstraintError: You have violated the constraint DFRAW_FI_RAW_FK.  This means that you are attempting to delete a record from table DATA_FILES_RAW which is referenced as a foreign key in table DATA_FILES column FILE_ID

similar for 2004-06

MINOS06 > sam undeclare file  F00028201_0002.mdaq.root                                                
You have violated the constraint DFRAW_FI_RAW_FK.  
  This means that you are attempting to delete a record from table DATA_FILES_RAW
  which is referenced as a foreign key in table DATA_FILES column FILE_ID

MINOS06 > time ./sadd.0707 fardet_data/2002-12 erase
fardet_data/2002-12
STARTED   Fri Jul  8 17:52:11 2005
Treating 807 files

MINOS06 > time ./sadd.0707 fardet_data/2003-01 erase
fardet_data/2003-01
STARTED   Fri Jul  8 17:50:25 2005
Treating 898 files


MINOS06 > time ./sadd.0707 fardet_data/2003-02 erase
fardet_data/2003-02
STARTED   Fri Jul  8 17:48:23 2005
Treating 843 files


MINOS06 > time ./sadd.0707 fardet_data/2003-03 erase
fardet_data/2003-03
STARTED   Fri Jul  8 17:46:47 2005
Treating 822 files
Needed to add 1 files

MINOS06 > time ./sadd.0707 fardet_data/2003-04 erase
fardet_data/2003-04
STARTED   Fri Jul  8 17:45:02 2005
Treating 909 files

MINOS06 > time ./sadd.0707 fardet_data/2003-05 erase
fardet_data/2003-05
STARTED   Fri Jul  8 17:43:15 2005
Treating 873 files

MINOS06 > time ./sadd.0707 fardet_data/2003-06 erase
fardet_data/2003-06
STARTED   Fri Jul  8 15:26:41 2005
Treating 887 files
 OK - erasing  F00016181_0000.mdaq.root
...
real    1m25.644s
user    0m6.710s
sys     0m0.720s
fardet_data/2003-06
STARTED   Fri Jul  8 15:26:41 2005
Treating 887 files
 OK - erasing  F00016181_0000.mdaq.root
...
real    1m25.644s
user    0m6.710s
sys     0m0.720s

MINOS06 > time ./sadd.0707 fardet_data/2003-06 declare
...
STARTED   Fri Jul  8 15:31:13 2005
FINISHED  Fri Jul  8 15:33:10 2005

real    1m56.750s
user    0m27.530s
sys     0m0.890s

MINOS06 > time ./sadd.0707 fardet_data/2003-07 erase
fardet_data/2003-07
STARTED   Fri Jul  8 17:39:35 2005
Treating 1120 files
... 546 OK, then ...
OK - erasing  F00017566_0000.mdaq.root
SamException.SamExceptions.DbConstraintError: You have violated the constraint DFRAW_FI_RAW_FK.  This means that you are attempting to delete a record from table DATA_FILES_RAW which is referenced as a foreign key in table DATA_FILES column FILE_ID

same for every month through 2005-05 ( latest data in dev )

Added the 881 file for fardet_data/2005-06


   Will try a single file which I know is heavily used :
   
MINOS06 > ./sam_test_project minos-test-dcache dev
Runs OK, file F00031300_0000.mdaq.root

MINOS06 > sam undeclare file F00031300_0000.mdaq.root
You have violated the constraint CF_FI_FK.  
  This means that you are attempting to delete a record from table CACHED_FILES 
  which is referenced as a foreign key in table DATA_FILES column FILE_ID


neardet_data/2004-11 erased 6 before crashing
MINOS06 > time ./sadd.0707 neardet_data/2004-11 erase

neardet_data/2004-11
STARTED   Fri Jul  8 18:08:11 2005
Treating 1075 files
 OK - erasing  N00004495_0000.mdaq.root
 OK - erasing  N00004495_0001.mdaq.root
 OK - erasing  N00004496_0000.mdaq.root
 OK - erasing  N00004497_0000.mdaq.root
 OK - erasing  N00004498_0000.mdaq.root
 OK - erasing  N00004499_0000.mdaq.root
Traceback (most recent call last):
SamException.SamExceptions.DbConstraintError: You have violated the constraint PF1_FI_FK.  This means that you are attempting to delete a record from table PROJECT_FILES which is referenced as a foreign key in table DATA_FILES column FILE_ID

MINOS06 > time ./sadd.0707 neardet_data/2004-11 declare

neardet_data/2004-11
STARTED   Fri Jul  8 18:08:21 2005
Treating 1075 files
 OK - declared  N00004495_0000.mdaq.root
 OK - declared  N00004495_0001.mdaq.root
 OK - declared  N00004496_0000.mdaq.root
 OK - declared  N00004497_0000.mdaq.root
 OK - declared  N00004498_0000.mdaq.root
 OK - declared  N00004863_0000.mdaq.root
...
Needed to add 576 files


=============================================================================
 2005 07 07

08:55   cd minos/scripts ; crontab crontab.dat

Have not been actually adding files to SAM since 16:00 yesterday :

fardet_data/2005-07
STARTED   Wed Jul  6 16:08:19 2005
Treating 194 files
Needed to add 1 files
STARTED   Wed Jul  6 16:08:19 2005
FINISHED  Wed Jul  6 16:08:23 2005

Oops, the 'declare' section had been commented out for testing.
Corrected this live in sadd.0624
Tested  in development
MINOS06 > setup sam -q dev
MINOS06 > ./sadd fardet_data/2005-07 declare


=============================================================================
 2005 07 06

#########
# genpy #
#########

genpy.0706 - added '-l' for loon release, default is -r R1.15
             changed root version to v0_00_00 for files with 0 records

############
# BEAM/DCS #
############

MINOS06 > mv beam_data/ beam_data_1
MINOS06 > mkdir beam_data
MINOS06 > mv near_dcs_data near_dcs_data_1
MINOS06 > mv far_dcs_data far_dcs_data_1
MINOS06 > mkdir far_dcs_data
MINOS06 > mkdir near_dcs_data

Testing run_dbu.0706 which handles 0 record files,
and which specifies rotorooter application with proper root version

0 records
./run_dbu.0706 dcap://fndca.fnal.gov:24136/pnfs/fnal.gov/usr/minos/beam_data/2004-12/B041201_195652
some records
./run_dbu.0706 dcap://fndca.fnal.gov:24136/pnfs/fnal.gov/usr/minos/beam_data/2004-12/B041201_200226
and normal data
./run_dbu.0706 dcap://fndca.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2005-07/F00032089_0000
diff F00032089_0000.sam.py ../GDAT/fardet_data/2005-07

MINOS06 > loon -bq firstlast.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/2004-12/B041216_204323.mbeam.root
0 records
MINOS06 > loon -bq firstlast.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/2004-12/B041201_195652.mbeam.root

Jul  5 10:24 run_dbu -> run_dbu.0623
ln -sf run_dbu.0706 run_dbu
Jul  6 14:20 run_dbu -> run_dbu.0706

Jul  5 10:24 genpy -> genpy.0701
ln -sf genpy.0706 genpy
Jul  6 14:22 genpy -> genpy.0706

    RUN GENPY WITH RELEASE " " --> development

./genpy -l " " -d -p 0 -f B041216_204323.mbeam.root beam_data/2004-12

./genpy -l " "      beam_data/2004-12
./genpy -l " "   far_dcs_data/2003-10
./genpy -l " "  near_dcs_data/2004-08

./sadd     beam_data/2004-12 verify
./sadd  far_dcs_data/2003-10 verify
./sadd near_dcs_data/2004-08 verify


SAM - in prd and dev,

for empty beam,
samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v00-00-00

for beam_data/2004-12
samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v04-01-04

for far_dcs_data/2003-10
samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v03-05-05

for near_dcs_data/2004-08
samadmin add application family --appFamily=online --appName=rotorooter --appVersion=v04-00-03

########
# sadd #
########

Hmmm, needed to bring sadd up to date ( slight cleanup handling py files without locations )

May 20 15:36 sadd -> sadd.0520
ln -sf sadd.0624 sadd
Jul  6 15:18 sadd -> sadd.0624

cp sadd.0624 sadd.0707
 ( start working on erase mode for file removal )
 
##################
# BEAM/DCS stage #
##################

TIER=beam_data
for MON in `(cd /pnfs/minos/${TIER} ; ls -d 20??-??)`
do ./stage -dt -p 0 ${TIER}/${MON} ; done

    no files were needed

TIER=near_dcs_data

Staging /pnfs/minos/near_dcs_data/2005-06
 Needed 25/     60
Staging /pnfs/minos/near_dcs_data/2005-07
 Needed 3/     11

TIER=far_dcs_data

Staging /pnfs/minos/far_dcs_data/2005-06
 Needed 24/     60
Staging /pnfs/minos/far_dcs_data/2005-07
 Needed 3/     10


TIER=near_dcs_data
for MON in 2005-06 2005-07
do ./stage -w ${TIER}/${MON} ; done

TIER=far_dcs_data
for MON in 2005-06 2005-07
do ./stage -w ${TIER}/${MON} ; done


=============================================================================
 2005 07 05

#########
# genpy #
#########

Reviewed dbu rerun,
gzip failure for fardet_data/2004-11/F00027615_0007.mdaq.root
  This did fail the first time, was not noted in my summay log.

THEREFORE, THE DBU FIXUP IS COMPLETE !

##################################
# GENPY/WRUN_DBU/RUN_DBU CLEANUP #
##################################

ln -sf run_dbu.0623 run_dbu  # was  run_dbu.0526
ln -sf   genpy.0701 genpy    # was    genpy.0628

genpy.0701 - ( removed wrun_dbux reference from testing )
             creates LOGD as necessary

wrun_dbux - ran run_dbu.0623 - bypass now

run_dbu.0623 - adds loon running for BEAM/DCS

############
# predator #
############

MINOS06 > echo "crontab -r" | at 05:30 Jul 7
job 2 at 2005-07-07 05:30

#######
# SAM #
#######

MINOS06 > upd install -j sam v7_1_10
Creating version link in /afs/fnal.gov/files/code/e875/general/ups/db/sam/Symlinks for sam v7_1_10.
informational: installed sam v7_1_10.
upd install succeeded.

MINOS06 > ups list -K+ sam 
"sam" "v7_1_2" "Linux+2" "" "current" 
"sam" "v6_7_4" "NULL" "" "current" 

MINOS06 > ups list -K+ sam 
"sam" "v7_1_2" "Linux+2" "" "current" 
"sam" "v6_7_4" "NULL" "" "current" 

MINOS06 > ups undeclare -c sam -f NULL
# cleaning up obsoltete NULL current declaration of SAM

Need to test some things before making v7_1_10 current...
as predator picks up and uses current

HOWTO.samadd - updated per current conditions, for SAM testing
    cd ${HOME}/minos/test was /local/scratch06/kreymer/genpy/test
    IFILE=F00028812_0000 was 

MINOS06 > upd install -j sam_ns_ior v7_0_0


=============================================================================
 2005 07 01

DCache failed last night, another reboot.

fardet_data/2003-12 failed 21:20 thru 22:02
F00022111_0001 to
F00022120_0001

fardet_data/2004-09 failed 21:25 thru 22:03
F00027199_0000.mdaq.root to
F00027217_0000.mdaq.root

    RECOVER 2003-12

./genpy.0513  fardet_data/2003-12

./genpy.0701  fardet_data/2004-09

=============================================================================
 2005 07 01

#########
# genpy #
#########

genpy.0701 - added LOGD directory parameter, created directory as necessary
             as needed for beam/dcs

############
# BEAM/DCS #
############

./genpy.0701 -d -p 0 -f N040824_204312.mdcs.root near_dcs_data/2004-08
./genpy.0701         -f N040824_204312.mdcs.root near_dcs_data/2004-08
./genpy.0701                                     near_dcs_data/2004-08

./genpy.0701       beam_data/2004-12
./genpy.0701    far_dcs_data/2003-10
./genpy.0701   near_dcs_data/2004-08

#########
# topdb #
#########

Restarted directed to new offical web area
    /afs/fnal.gov/files/expwww/numi/html/computing/database/oracle/topdb

# stop topdb_log, and update

kill 15342 15221
mv topdb_log topdb_log.0609
cp topdb_log.0609 topdb_log.0701 
ln -sf topdb_log.0701 topdb_log
nedit topdb_log.0701

# shift the log files to their new home, setting aside stale copy

mv computing/database computing/database_TMP
mv minwork/computing/database computing/database
cd minwork/computing
ln -s ../../computing/database database


=============================================================================
 2005 06 30

Recover one file from 2003-04

./genpy.0513a -f F00014423_0000.mdaq.root fardet_data/2003-04
times out.
Note that the original ran for about 30 minutes !
Increased timeout to 1 hour, rerun again.
less GDAT/fardet_data/2003-04/F00014423_0000.log
Wow ! This completed in 26 minutes.

Per converstation with rhatcher,
note that there is also a F00014423_0000.mdaq.dat file.
This was a special low/no selection data run for trigger studies.

MINOS06 > du -sm F*.root | sort -n
...
72      F00014440_0000.mdaq.root
72      F00014610_0000.mdaq.root
73      F00015201_0000.mdaq.root
154     F00014423_0000.mdaq.root


MINOS06 > grep 2005 /local/scratch06/kreymer/genpy/fardet_data/2003-04/F00014423_0000.log
=E= Dbi 2005/06/28 14:54:05 [-1|-1] Dbi.cxx,v1.29:206> Bad date string: 1937-09-17 12:02:04 parsed as 1937 9 17 12 2 4
=E= Dbi 2005/06/30 08:46:49 [-1|-1] Dbi.cxx,v1.29:206> Bad date string: 1937-09-17 12:02:04 parsed as 1937 9 17 12 2 4
=E= Dbi 2005/06/30 09:00:20 [-1|-1] Dbi.cxx,v1.29:206> Bad date string: 1937-09-17 12:02:04 parsed as 1937 9 17 12 2 4
MINOS06 > stat GDAT/fardet_data/2003-04/F00014423_0000.log | grep Modify
Modify: 2005-06-30 09:26:33.000000000 -0500

MINOS06 > HEAD=F00014440_0000          
MINOS06 > grep 2005 /local/scratch06/kreymer/genpy/fardet_data/2003-04/$HEAD.log
=E= Dbi 2005/06/28 15:13:21 [-1|-1] Dbi.cxx,v1.29:206> Bad date string: 1937-09-17 12:02:04 parsed as 1937 9 17 12 2 4
MINOS06 > stat /local/scratch06/kreymer/genpy/fardet_data/2003-04/$HEAD.log | grep Modify
Modify: 2005-06-28 15:13:43.000000000 -0500

MINOS06 > HEAD=F00014610_0000
MINOS06 > grep 2005 /local/scratch06/kreymer/genpy/fardet_data/2003-04/$HEAD.log
=E= Dbi 2005/06/28 17:18:53 [-1|-1] Dbi.cxx,v1.29:206> Bad date string: 1937-09-17 12:02:04 parsed as 1937 9 17 12 2 4
MINOS06 > stat /local/scratch06/kreymer/genpy/fardet_data/2003-04/$HEAD.log | grep Modify
Modify: 2005-06-28 17:19:16.000000000 -0500

MINOS06 > HEAD=F00015201_0000
MINOS06 > grep 2005 /local/scratch06/kreymer/genpy/fardet_data/2003-04/$HEAD.log
=E= Dbi 2005/06/28 23:42:35 [-1|-1] Dbi.cxx,v1.29:206> Bad date string: 1937-09-17 12:02:04 parsed as 1937 9 17 12 2 4
MINOS06 > stat /local/scratch06/kreymer/genpy/fardet_data/2003-04/$HEAD.log | grep Modify
Modify: 2005-06-28 23:42:58.000000000 -0500

Checking loon speed on this slow file with

13:45...
MINOS06 > time loon -bq firstlast.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2003-04/F00014423_0000.mdaq.root
loon [0] 
Processing firstlast.C...
JobCPath::JobCPath("Spin",0x9a00b68,0x9a00b78)
...
real    23m25.253s
user    22m48.690s
sys     0m14.860s

Interesting, there are almost 2.8 Million records,
logged between 15:56:56
           and 15:59:40
Probably calibration of some sort.
Ten times more records than the next file in 2003-04

Do a global search :
find GDAT/fardet_data/????-?? -name \*.log -exec grep RecSets {} \;  | sort -n -k 2,2
...
 RecSets 438304  TimeFrames # [       0:1920099702] 
 RecSets 442690  TimeFrames # [      -1:      -1] 
 RecSets 452583  TimeFrames # [       0:1919252073] 
 RecSets 470739  TimeFrames # [       0:1919252073] 
 RecSets 532084  TimeFrames # [      -1:      -1] 
 RecSets 556418  TimeFrames # [      -1:      -1] 
 RecSets 683304  TimeFrames # [       0:1028858740] 
 RecSets 1114315  TimeFrames # [       0:1162035525] 
 RecSets 1330677  TimeFrames # [       0:    3599] 
 RecSets 2769061  TimeFrames # [       0:1684632178] 

MINOS06 > find GDAT/fardet_data/200?-?? -name \*.log | while read FILE ; do RECS=`grep RecSets ${FILE} | head -1 | cut -f 3 -d  ' '` ; printf "${FILE} ${RECS}\n" ; done | sort -n -k 2,2
...
GDAT/fardet_data/2003-02/F00013154_0000.log 438304
GDAT/fardet_data/2002-08/F00007740_0000.log 442690
GDAT/fardet_data/2003-06/F00016650_0000.log 452583
GDAT/fardet_data/2003-05/F00015979_0000.log 470739
GDAT/fardet_data/2002-09/F00008207_0000.log 532084
GDAT/fardet_data/2002-08/F00007742_0000.log 556418
GDAT/fardet_data/2002-12/F00011340_0000.log 683304
GDAT/fardet_data/2003-03/F00014075_0000.log 1114315
GDAT/fardet_data/2003-03/F00014074_0000.log 1330677
GDAT/fardet_data/2003-04/F00014423_0000.log 2769061

=============================================================================
 2005 06 29

Scheduled Enstore outage 08:00 to 14:00
14:15 Enstore UP

Spotted a few N* files in fardet_data/200

15:15
mv N00008026_0025.mdaq.root ../../neardet_data/2005-06/
mv N00008027_0000.mdaq.root ../../neardet_data/2005-06/
mv N00008028_0000.mdaq.root ../../neardet_data/2005-06/
mv N00008028_0001.mdaq.root ../../neardet_data/2005-06/

cdm ; cd scripts ; crontab crontab.dat

=============================================================================
 2005 06 28

#########
# genpy #
#########

# was genpy.0624
# cleaned up ed to remove stray '?'

10:35   ln -sf f.0627 genpy
  oops, had ed cleanup logic reversed ( no harm except message )
12:03   ln -sf genpy.0628 genpy
  oops, protectino was wrong...  chmod 755 genpy.0628
12:06   logs are clean now, no extra '?'


timeouts
( Note that minos06 was saturated till about 07:00 this morning )

    neardet_data
2004-11
N00004502_0000.mdaq.root Mon Jun 27 14:30:55 CDT 2005  # known problem
N00004859_0000.mdaq.root Mon Jun 27 20:56:28 CDT 2005
N00004861_0000.mdaq.root Mon Jun 27 21:07:27 CDT 2005
N00004862_0000.mdaq.root Mon Jun 27 21:18:10 CDT 2005
N00004873_0000.mdaq.root Mon Jun 27 21:32:52 CDT 2005
N00004877_0000.mdaq.root Mon Jun 27 21:44:50 CDT 2005

These had previously timed out, leaving .sam.py files.
This time, no .sam.py files.
See GLOG/neardet_data/2004-11.20050407.log

Verified continuing timeouts :
./genpy -f N00004859_0000.mdaq.root neardet_data/2004-11
./genpy -f N00004861_0000.mdaq.root neardet_data/2004-11
ok, give up.

    fardet_data

2004-01
F00022417_0000.mdaq.root Tue Jun 28 00:39:14 CDT 2005 known

2004-02 OK
2004-03 OK


2003-05
F00015698_0000 HEAD Wed Jun 29 05:00:57 CDT 2005 produced .sam.py
F00015956_0000 HEAD Wed Jun 29 08:16:39 CDT 2005 slow enstore command, OK ?
F00016075_0000 HEAD Wed Jun 29 14:50:45 CDT 2005 known


#########
# stage #
#########

Anticipating tomorrow's Enstore downtime,

for MON in 03 04 05 06 ; do ./stage -w fardet_data/2004-${MON} ; done &
    nothing was needed...

for MON in  04 05 08 09 10 11 12
do
    ./stage -dtp 0 fardet_data/2003-${MON}
done   
 Needed 0/    910
 Needed 0/    874
 Needed 0/    757
 Needed 1/   1014
 Needed 1/   1278
 Needed 2/    992
 Needed 2/   1050

for MON in  09 10 11 12
do
    ./stage -w fardet_data/2003-${MON}
done   

#########
# genpy #
#########

13:22

DET=fardet_data
for MON in  04 05 08 09 10 11 12
do
    MONTH=2003-${MON}
    printf "`date` genpy ${DET}/${MONTH}\n"
    ./stage              ${DET}/${MONTH}
    ./genpy -w           ${DET}/${MONTH}
done >> ../log/genpy/dodbufixF2003.log 2>&1 &

OOPS, 2003 data through 2003-10 lacks (use)(4) CRC entries 
   (old Enstore problem)
Fall back to genpy.0513, lastest version to get this elsewhere.

DET=fardet_data
for MON in  04 05 08 09 10 11 12
do
    MONTH=2003-${MON}
    printf "`date` genpy ${DET}/${MONTH}\n"
    ./stage              ${DET}/${MONTH}
    ./genpy.0513 -w      ${DET}/${MONTH}
done >> ../log/genpy/dodbufixF2003.log 2>&1 &


=============================================================================
 2005 06 27

06/21 genpy finished near midnight Saturday.
    for DET in neardet_data fardet_data
    for MONTH in 2005-02 2005-03 2005-04 2005-05
    
#########
# genpy #
#########

# was genpy.0524 
# adding support for beam/dcs ( SIZE ) and small cleanups.

ln -sf genpy.0624 genpy

   BADDBU

Per the list from 06 21,

cd /local/scratch06/kreymer/genpy/neardet_data
mv 2004-11 2004-11_BADDBU

cd /local/scratch06/kreymer/genpy/fardet_data

for MON in  04 05 08 09 10 11 12
do
  mv 2003-${MON} 2003-${MON}_BADDBU
done   

for MON in 01 02 03 04 05 06 07 08 09 10 11
do
  mv 2004-${MON} 2004-${MON}_BADDBU
done   

 ( fresh login )

DET=neardet_data
for MONTH in 2004-11
do
    printf "`date` genpy ${DET}/${MONTH}\n"
    ./genpy -w       ${DET}/${MONTH}
done >> ../log/genpy/dodbufixN2004.log 2>&1 &

./stage neardet_data/2004-11

   Enstore seems pretty sluggish,
   status pages may be 15 minutes out of date.
   But ENTV tells a different story, files copied every few seconds.

#########
# stage #
#########

# was stage.0607
# changed NFILES to NWAI in WAIT to avoid a conflict

ln -sf stage.0627 stage 

#########
# genpy #
#########

Spotted extra ? line in genpy output, using .0624
Prob'ly due to the error in ed not finding '9999B' in these mdaq files.

   
17:35 - started the 2004 DBU runs including file staging.

DET=fardet_data
for MON in 01 02 03 04 05 06 07 08 09 10 11
do
    MONTH=2004-${MON}
    printf "`date` genpy ${DET}/${MONTH}\n"
    ./stage              ${DET}/${MONTH}
    ./genpy.0627 -w      ${DET}/${MONTH}
done >> ../log/genpy/dodbufixF2004.log 2>&1 &


=============================================================================
 2005 06 24
    
#######
# AFS #
#######

AFS performance seems normal again, since 07:22
    according to Ganglia and MRTG monitoring of FNALU
    Ganglia CPU count went close to 0 from about 00:50 to 01:00.

Still no reply to my trouble ticket 61405,
no postings on the CD Status web pages for FNALU, Networking or AFS

Per Tom, at x2345, a network router was rebooted at 07:20 to fix the problem.


########
# sadd #
########

sadd.0624

sadd fardet_data/2005-06 has been failing 
since Thu Jun 23 18:09:43 2005
with
dbu
Traceback (most recent call last):
  File "./sadd", line 113, in ?
IndexError: list index out of range

./sadd.0624 fardet_data/2005-06 declare
 OOPS - no tape location in  F00031969_0005.sam.py

The log file looks normal, run at 2005/06/23 17:09:32

So let this get picked up the next time through...

MINOS06 > mv F00031969_0005.log F00031969_0005.log.BAD
MINOS06 > mv F00031969_0005.sam.py F00031969_0005.sam.py.BAD

Got picked up cleanly at 11:09 CDT ( 16:09 UTC )
along with the rest of the valid files.

#######
# SLF #
#######

Update path for SLF 4.0 to 4.1 :

    yum update yum-conf
    yum update
    yum install java-1.4.2-sun-compat

#########
# genpy #
#########

It is clear that most of the changes go into run_dbu, so reintegrate,

mv loonpy.0622 genpy.0624
rm loonpy

./genpy.0624 -d -p 0     beam_data/2004-12
./genpy.0624 -d -p 0 near_dcs_data/2004-08 
./genpy.0624 -d -p 0  far_dcs_data/2003-10

./run_dbu.0623 dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/2004-12/B041201_195652
or
./run_dbu.0623 dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/2004-12/B041201_195652

direct loon test,

loon -bq firstlast.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/2004-12/B041201_200226.mbeam.root

Warning - not all of these mbeam.root files have records !
   The short ones are 7030 bytes long, 0 records
   Likewise, fardcs_data at 13 KB 
       far_dcs_data/2004-08/F040820_220334.mdcs.root
   Likewise, neardcs_data at  8 KB 
       2005-01/N050127_183128.mdcs.root

There are many empty Beam root files, but only one each far/near dcs.   

For beam, search for 

Spin(68 in 68 out 0 filt.)
  First:       {  Near|  Data|2004-12-01 21:17:13.054000000Z}
  Last:        {  Near|  Data|2004-12-01 21:18:20.046000000Z}
  in 68 records of 67 record sets 


For dcs_data, perhaps not so simple, for example,

    far_dcs_data/2003-10/F031022_182257.mdcs.root length 805KB

Spin(70 in 70 out 0 filt.)
  1) +RawRecCounts::Ana         n=70    (    70/     0) t=(    0.00/    0.00)
  First:       {   Far|  Data|2003-10-22 18:51:38.000000000Z}
  Last:        {   Far|  Data|2003-10-23 06:14:25.000000000Z}
  in 71 records of 69 record sets 
  RawDcsEnvMonitorBlock                   1
  RawDcsHeaderBlock                     227
  RawDcsHvMonitorBlock                  105
  RawDcsMagnetMonitorBlock                2
  RawDcsRpsMonitorBlock                 119

   far_dcs_data/2004-08/F040820_220334.mdcs.root size 13 KB
Spin(0 in 0 out 0 filt.)

   GLOBAL SCAN FOR SHORT FILES

du -sk /pnfs/minos/beam_data/*/* | sort -nr
... 463 total files ...
93      /pnfs/minos/beam_data/2005-03/B050305_082339.mbeam.root
87      /pnfs/minos/beam_data/2004-12/B041203_161842.mbeam.root
77      /pnfs/minos/beam_data/2005-03/B050309_082339.mbeam.root
56      /pnfs/minos/beam_data/2005-01/B050103_195908.mbeam.root
48      /pnfs/minos/beam_data/2005-01/B050121_145253.mbeam.root
22      /pnfs/minos/beam_data/2005-01/B050103_193431.mbeam.root
20      /pnfs/minos/beam_data/2004-12/B041203_161255.mbeam.root
16      /pnfs/minos/beam_data/2004-12/B041216_204323.mbeam.root
15      /pnfs/minos/beam_data/2004-12/B041201_200226.mbeam.root
14      /pnfs/minos/beam_data/2004-12/B041216_203848.mbeam.root 17 records
7       /pnfs/minos/beam_data/2005-06/B050615_000001.mbeam.root  0 records
... 201 total with length 7 ...


du -sk /pnfs/minos/near_dcs_data/*/* | sort -nr
...
58      /pnfs/minos/near_dcs_data/2004-12/N041222_163937.mdcs.root
42      /pnfs/minos/near_dcs_data/2004-09/N040912_205418.mdcs.root
10      /pnfs/minos/near_dcs_data/2004-10/N041018_153826.mdcs.root
8       /pnfs/minos/near_dcs_data/2005-01/N050127_183128.mdcs.root

du -sk /pnfs/minos/far_dcs_data/*/* | sort -nr
...
93      /pnfs/minos/far_dcs_data/2004-08/F040824_171109.mdcs.root
66      /pnfs/minos/far_dcs_data/2004-08/F040820_235421.mdcs.root
64      /pnfs/minos/far_dcs_data/2004-08/F040824_041445.mdcs.root
52      /pnfs/minos/far_dcs_data/2004-05/F040525_183923.mdcs.root
50      /pnfs/minos/far_dcs_data/2004-08/F040824_044549.mdcs.root
50      /pnfs/minos/far_dcs_data/2004-08/F040824_043345.mdcs.root
42      /pnfs/minos/far_dcs_data/2004-02/F040209_210155.mdcs.root
26      /pnfs/minos/far_dcs_data/2004-08/F040820_234646.mdcs.root
26      /pnfs/minos/far_dcs_data/2004-08/F040820_221237.mdcs.root
13      /pnfs/minos/far_dcs_data/2004-08/F040820_220334.mdcs.root

Only the smallest of each of these has 0 records.
Still, run_dbu needs to handle this case, especially for mbeam.root files
For right now, just bail out.


Proceed to test,

0 length
./run_dbu.0623 dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/2004-12/B041201_195652

genpy.0624 should be ready for testing,
    now adjusts both SCRC and SIZE.

samadmin add application family --appFamily=online --appName=beam --appVersion=v0_0
samadmin add application family --appFamily=online --appName=dcs  --appVersion=v0_0


=============================================================================

 2005 06 23

##########
# loonpy #
##########    

Will proceed to retain genpy functionality as we go.
In fact, most of the changes go into run_dbu,
not genpy.

Testing with

    15 files in
./loonpy.0622 -d -p 0     beam_data/2004-12
    14 files in
./loonpy.0622 -d -p 0 near_dcs_data/2004-08 
    29 files in
./loonpy.0622 -d -p 0  far_dcs_data/2003-10

    Did not do much run_dbu testing, due to AFS/network problems.

Put a sample loonoutput file into maint/

Can convert times with strings like
    date -d '2004-12-03 16:28:05' +%s

Need to insert SAM_SIZE in the genpy script.
Need to decide application and version for beam, dcs.


#######
# AFS #
#######

Running a simple loon test takes very long, 10+minutes.
Looking at overall ganglia plots for FNALU,
things slowed down badly at 15:42 .

Global FNALU cluster memory and CPU's are very ragged,
which I think indicates severe network problems.

less GLOG/neardet_data/2005-05.log
PID  not found a few times,
N00007676_0020.mdaq.root Thu Jun 23 16:06:31 CDT 2005
ps: error: List of process IDs must follow -p.
N00007690_0003.mdaq.root Thu Jun 23 17:08:50 CDT 2005
N00007690_0004.mdaq.root Thu Jun 23 17:09:00 CDT 2005
N00007690_0005.mdaq.root Thu Jun 23 17:09:09 CDT 2005

=============================================================================

 2005 06 22

Need to genpy neardet_data/2005-03/N00006682_0012.mdaq.root
which was interrupted by a DCache outage.
rm N00006682_0012.log

#########
# genpy #
#########

genpy.0622 adds -f qualifier to select single file, no log file.

./genpy.0622 -d -f N00006682_0012.mdaq.root neardet_data/2005-03
./genpy.0622    -f N00006682_0012.mdaq.root neardet_data/2005-03

############
# dcs/beam #
############

Cache rest of June to disk

${HOME}/minos/scripts/stage -p 1 -w near_dcs_data/2005-06 &
${HOME}/minos/scripts/stage -p 1 -w  far_dcs_data/2005-06 &

########
# loon #  for DCS/BEAM
########

Copied loonpy.0622 from genpy.0622
   Will hack this, then try to re-integrate into genpy.

=============================================================================

 2005 06 21

#########
# stage #
#########

ln -sf stage.0607 stage  ( was stage.0330 )

DIR=2005-02
./stage -d -p 0 -w neardet_data/${DIR} 

Tested all months 2005-02 thru 2005-06, for fardet_data, neardet_data,
    all files are in dcache.
    ( Except for neardet_data/2005-02/out.root , 
      stray empty file by rustem 11 Apr 2005
      removed it, at 13:40 )

#########
# genpy #
#########

Correcting dbu information for 2005-02 thru 2005-05
( will need to make clever plans for 2005-06 due to predator )

cd /local/scratch06/kreymer/genpy

for DET in neardet_data fardet_data
do
for MONTH in 2005-02 2005-03 2005-04 2005-05
do
    printf "`date` shift ${DET}/${MONTH}\n"
    ls ${DET}/${MONTH}/*.log | wc -l
    ls ${DET}/${MONTH}/*.sam.py | wc -l
    mv ${DET}/${MONTH}   ${DET}/${MONTH}_BADDBU
done
done

    Did the above at 15:03

cdm ; cd scripts

for DET in neardet_data fardet_data
do
for MONTH in 2005-02 2005-03 2005-04 2005-05
do
    printf "`date` genpy ${DET}/${MONTH}\n"
    ./genpy -w       ${DET}/${MONTH}
done
done >> ../log/genpy/dodbufix.log 2>&1 &

   Did the above at 15:05:08
   dbu is cranking along, producing good looking logs and .sam.py files.

HMMM better do a full scan, I only looked by hand at early 2005

DET=neardet
MONS=`(cd /pnfs/minos/${DET}_data ; ls -d 20??-??)`
for MON in $MONS ; do ./dburecsets ${DET}_data/$MON ; done >> ../log/dburecsets/${DET}.log 2>&1 &

DET=fardet
MONS=`(cd /pnfs/minos/${DET}_data ; ls -d 20??-??)`
for MON in $MONS ; do ./dburecsets ${DET}_data/$MON ; done >> ../log/dburecsets/${DET}.log 2>&1 &

egrep 'Treating|Passed' ~/minos/log/dburecsets/neardet.log
egrep 'Treating|Passed' ~/minos/log/dburecsets/fardet.log

Need to pick up 
neardet_data 
2004-11 Passed    504 of   1081 files

fardet_data
2003-04 Passed      0 of    909 files 
2003-05 Passed      0 of    875 files 

2003-08 Passed     93 of    757 files 
2003-09 Passed      0 of   1012 files 
2003-10 Passed      0 of   1278 files 
2003-11 Passed      0 of    992 files 
2003-12 Passed      0 of   1050 files 
2004-01 Passed      0 of    989 files 
2004-02 Passed      0 of    938 files 
2004-03 Passed      0 of   1698 files 
2004-04 Passed      0 of   1303 files 
2004-05 Passed      0 of    687 files 
2004-06 Passed      0 of    589 files 
2004-07 Passed      0 of    753 files 
2004-08 Passed      0 of    899 files 
2004-09 Passed      0 of    716 files 
2004-10 Passed      0 of    956 files 
2004-11 Passed      0 of    988 files 

Rats, that's about 18470 more files.


=============================================================================

 2005 06 20

#################
# data archiver #
#################

sampy - stuck on N00007959_0002.mdaq.root since Sun Jun 19 04:06:04
then was processed Sun Jun 19 16:06:33 CDT 2005

But only 1 file then, and no more since, as of 10:00 Monday 20 Jun

fardet was stuck similarly on F00031940_0001.mdaq.root since Sun Jun 19 04:17:43
and
F00031940_0001.mdaq.root Sun Jun 19 04:28:21 CDT 2005
    success finally
F00031940_0000.mdaq.root Sun Jun 19 16:15:57 CDT 2005
F00031940_0001.mdaq.root Sun Jun 19 16:16:47 CDT 2005

but again no further files in Enstore


DCache services all restarted at 09:25 this morning.

archivers restarted at around 11:00
Data archiving is catching up.
  near is current, 
  far is running more slowly, under 100 KBytes/second rate.


Followup - looking at stat timings,
           there seems to be a 100 second minimum :

MINOS06 > stat /pnfs/minos/fardet_data/2005-06/F00031946_0000.mdaq.root
  File: `/pnfs/minos/fardet_data/2005-06/F00031946_0000.mdaq.root'
  Size: 217675          Blocks: 425        IO Block: 512    Regular File
Device: bh/11d  Inode: 255211568   Links: 1    
Access: (0644/-rw-r--r--)  Uid: ( 1019/ buckley)   Gid: ( 5111/    e875)
Access: 2005-06-20 14:59:19.000000000 -0500
Modify: 2005-06-20 14:59:19.000000000 -0500
Change: 2005-06-20 14:57:39.000000000 -0500

16:47 - fardet_data is caught up, with F00031950_0006.mdaq.root

=============================================================================

 2005 06 15
 
 WEB
 
 shifted database and dh from minwork/computing to computing
 
    cd /afs/fnal.gov/files/expwww/numi/html/minwork/computing
    mv database ../../computing/database
    ln -s       ../../computing/database database
    mv database ../../computing/dh
    ln -s       ../../computing/dh dh

URGH, should not have done this, needed

    mv dh ../../computing/dh
    ln -s       ../../computing/dh dh

did

  227  dds ../../computing/
  228      mv database ../../computing/database
  229      ln -s       ../../computihg/database database
  230  dds database
  231  dds database/
  232  dds database/
  233      ln -sf       ../../computing/database database
  234  dds database/
  235  dds database/oracle/
  236      mv database ../../computing/dh
  237      ln -s       ../../computing/dh dh

Issued the correct command, OK now.

#######
# DBU #
#######
  
Assessing level of DBU RecSets/time problem due to bad dbu behaviour.

The primary symptom is the the DBU termination message comes after the Summary
Normally, should have messages like in
    /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/fardet_data/2001-10

for FILE in *.log ; do printf "\n$FILE\n" ; egrep 'Dbu\(|RecSets' $FILE  ; done
...
F00000999_0000.log
Dbu(28 in 28 out 0 filt.)
 RecSets 28  TimeFrames # [      -1:      -1] 
...

There were problems in 2005-05 like
F00031811_0005.log
 RecSets 14499  TimeFrames # [   18860:   21617] 
Dbu(18812 in 18812 out 0 filt.)

Need to do a complete scan.

./dburecsets fardet_data/2005-01 OK
2005-02 all bad
2005-03 all bad
2005-04 all bad
2005-05 all bad
2005-06 bad thru F00031831_0005 - 463 of 544 are OK

Similar for neardet_data, 
2004-12 one file bad N00005699_0000
2005-01 AOK
2005-02 all bad
2005-03 all bad
2005-04 all bad
2005-05 all bad
2005-06 bad thru N00007874_0004.log  -  Passed    474 of    555 files 

=============================================================================

2005 06 09
        
#########
# topdb #
#########

Restarted, using oracle_client v9_2_0_1a ( not _c, fails SUBSTRING )

Restarted again directed to offical web area, not home/~kreymer
    /afs/fnal.gov/files/expwww/numi/html/minwork/computing/database/oracle/topdb

#############
# beam_data #
#############

files are up to 170 MBytes, usually around 100 MB,
so use a PAUSE of 5 seconds ( 100 MB / 20 MB/sec ) 

    10:33

cd /pnfs/minos/beam_data
for DIR in `ls`
do
    ${HOME}/minos/scripts/stage.0607 -d -v -p 0 -w beam_data/${DIR} 
#    ${HOME}/minos/scripts/stage.0607 -v -p 5 -w beam_data/${DIR} 
done

URRRRRGH ?  dccp fails,

apparently one must do
    setup dcap -q smallfile

Adjusted stage.0607 accordingly

Running again, reading tape V04933 (994081.mover)

tail -f /local/scratch06/kreymer/log/stage/beam_data/2004-12.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-01.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-02.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-03.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-04.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-05.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-06.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-07.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-08.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-09.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-10.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-11.20050610.log
tail -f /local/scratch06/kreymer/log/stage/beam_data/2005-12.20050610.log

#########
# stage # stage_limit
#########

Increased stage_limit to 100 restores from 10
now that we are running the latest and greatest dcache software
which should not crash with long restore queues.

#################
# near_dcs_data #
#################

STUFF=near_dcs_data
cd /pnfs/minos/${STUFF}

du -sk */* | sort -n | less
du -sm .

190 MB, 547 files => .3 MBytes each
so set delay of only 1 second per file

for DIR in `ls`
do
#    ${HOME}/minos/scripts/stage.0607 -d -v -p 0 -w ${STUFF}/${DIR} 
    ${HOME}/minos/scripts/stage.0607 -v -p 1 -w ${STUFF}/${DIR} 
done

MINOS06 > for DIR in `ls`
<more> do
<more> #    ${HOME}/minos/scripts/stage.0607 -d -v -p 0 -w ${STUFF}/${DIR} 
<more>     ${HOME}/minos/scripts/stage.0607 -v -p 1 -w ${STUFF}/${DIR} 
<more> done
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2004-08.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2004-09.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2004-10.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2004-11.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2004-12.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-01.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-02.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-03.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-04.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-05.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-06.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-07.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-08.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-09.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-10.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-11.20050610.log
tail -f /local/scratch06/kreymer/log/stage/near_dcs_data/2005-12.20050610.log


V05872 

#################
# far_dcs_data #
#################

STUFF=far_dcs_data
cd /pnfs/minos/${STUFF}

du -sk */* | sort -n | less
du -sm .
1784 MB, 1219 files => 1.5 MB/files
mostly 1 to 2 MB per file, again set 1 second delay.

for DIR in `ls`
do
#    ${HOME}/minos/scripts/stage.0607 -d -v -p 0 -w ${STUFF}/${DIR} 
    ${HOME}/minos/scripts/stage.0607 -v -p 1 -w ${STUFF}/${DIR} 
done

MINOS06 > for DIR in `ls`
<more> do
<more> #    ${HOME}/minos/scripts/stage.0607 -d -v -p 0 -w ${STUFF}/${DIR} 
<more>     ${HOME}/minos/scripts/stage.0607 -v -p 1 -w ${STUFF}/${DIR} 
<more> done
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2003-09.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2003-10.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2003-11.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2003-12.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-01.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-02.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-03.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-04.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-05.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-06.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-07.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-08.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-09.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-10.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-11.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2004-12.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-01.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-02.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-03.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-04.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-05.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-06.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-07.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-08.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-09.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-10.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-11.20050610.log
tail -f /local/scratch06/kreymer/log/stage/far_dcs_data/2005-12.20050610.log

V05704

=============================================================================

2005 06 07

#############
# beam_data #
#############

for DIR in `ls` ; do printf "$DIR" ; ls ${DIR} | wc -l ; done
2004-12     15
2005-01     58
2005-02     43
2005-03     90
2005-04     87
2005-05     96
...

    VOLUMES

for DIR in `ls`
do
    printf "${DIR}\n"
  ( cd ${DIR} ; for FILE in `ls` ; do head -1  ".(use)(4)(${FILE})" ; done )
done

for DIR in `ls`
do
    printf "${DIR}\n"
  ( cd ${DIR}
    for FILE in `ls`
        do 
            if  grep -q 'r-stk' ".(use)(2)(${FILE})"
            then
                true
                printf " have ${FILE} \n"
            else
                printf " NEED ${FILE} \n"
            fi
        done )
done

    LOON - try 

    cd scripts    
    loon -bq firstlast.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/2005-06/B050603_104004.mbeam.root
real    0m10.325s
user    0m8.060s
sys     0m0.250s
( 44 MB file )

and now a 100 MB file,
    time loon -bq firstlast.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/beam_data/2005-06/B050606_024004.mbeam.root
real    0m18.271s
user    0m14.710s
sys     0m0.430s

So speed seems to be locally CPU limited, over 2 Min per GByte
We have 12 GB to process, should be pretty quick.

Similarly for DCS,

    file sizes seem to be 400KB near, 2 MB far

MINOS06 > find  /pnfs/minos/near_dcs_data -type f | wc -l
    541
MINOS06 > du -sh /pnfs/minos/near_dcs_data
188M    /pnfs/minos/near_dcs_data
MINOS06 > time loon -bq firstlast.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/near_dcs_data/2005-06/N050603_084727.mdcs.root
real    0m6.445s
user    0m4.250s
sys     0m0.200s

MINOS06 > find  /pnfs/minos/far_dcs_data -type f | wc -l
   1213
MINOS06 > du -sh /pnfs/minos/far_dcs_data
1.8G    /pnfs/minos/far_dcs_data
MINOS06 > time loon -bq firstlast.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/far_dcs_data/2005-06/F050604_170410.mdcs.root
real    0m7.183s
user    0m5.050s
sys     0m0.250s


#########
# stage #    stage.0607
#########

Removed setup_minos
setup dcap DCAP_VER ( v2_32_f0408 )
Added WAIT cloned from genpy


#########
# kcron #
#########

Finding a way to run monitoring scripts which write to AFS

kcroninit ( on each node ) makes a kcron principal
kcron     uses that to make a ticket
aklog     uses that ticket for a token
          both of these seem to expire in 10 hours
          
Ran kcroninit on minos01->minos12

QUESTION : For how long is the kcronint keytab file valid ?
           Where does it live ?
    
ANSWER   : Via a google search, it appears that
              the keytab file lives forever
              It is in /var/adm/krb5/

=============================================================================

2005 06 03

############
# products #
############

upd install -j sam v7_1_2 -f SunOS+5   # for testing by Liz

ups declare -c sam v7_1_2 -f Linux+2   # for general use


=============================================================================

2005 06 02

#########
# genpy #
#########

Note non-fatal failure reading F00031820_0000.mdaq.root in
    /pnfs/minos/fardet_data/2005-06
File was written at 10:28
F00031820_0000.mdaq.root Thu Jun  2 11:08:13 CDT 2005
F00031820_0000.sam.py was not generated - check log for error
But the test for file crc passed using cat '.(use)(4)(F00031820_0000)'

=============================================================================

2005 06 01

Minos-sam02 - shepalak swapping /home/sam with /scratch/sam02 for space.
              will not backup /home/sam ( 220+ GB )
          
          
########
# DISK #
########

    Making space, by shifting some date filed to d85 ( seems relatively MT )

mkdir /afs/fnal.gov/files/data/minos/d85/kreymer
cdm ; cd test

cp -a F00017230_0000.mdaq.root /afs/fnal.gov/files/data/minos/d85/kreymer/           
cp -a N00007438_0000.mdaq.root /afs/fnal.gov/files/data/minos/d85/kreymer/

ln -sf /afs/fnal.gov/files/data/minos/d85/kreymer/F00017230_0000.mdaq.root .
ln -sf /afs/fnal.gov/files/data/minos/d85/kreymer/N00007438_0000.mdaq.root .

#######
# WEB #
#######

See  /afs/fnal.gov/files/expwww/numi/README


=============================================================================

2005 05 31

##############
# OPERATIONS #
##############


No failures, but not network global problem Monday:
    CDF saw Enstore failures 7:15 thru 9:09

File times :

ls -ltr /pnfs/minos/neardet_data/2005-05
...
-rw-r--r--    1 buckley  e875     86813579 May 30 05:57 N00007860_0030.mdaq.root
-rw-r--r--    1 buckley  e875     85549526 May 30 06:56 N00007860_0031.mdaq.root
-rw-r--r--    1 buckley  e875     84469151 May 30 09:19 N00007860_0032.mdaq.root
-rw-r--r--    1 buckley  e875     28816405 May 30 09:29 N00007860_0033.mdaq.root

ls -ltr /pnfs/minos/fardet_data/2005-05
...
-rw-r--r--    1 buckley  e875     55260346 May 30 05:53 F00031803_0001.mdaq.root
-rw-r--r--    1 buckley  e875     28628786 May 30 06:46 F00031803_0002.mdaq.root
-rw-r--r--    1 buckley  e875     21797031 May 30 09:23 F00031803_0003.mdaq.root
-rw-r--r--    1 buckley  e875     55229393 May 30 09:43 F00031803_0004.mdaq.root


    GLOG/neardet_data/2005-05.log
7:08  - normal
8:08  - missing
9:08  - missing
10:06 - normal, 3 files


    GLOG/fardet_data/2005-05.log
6:08 - normal
7:08
    F00031803_0002.sam.py was not generated - check log for error
8:08  - missing
9:08  - missing
10:09 - normal, 4 files

    SALOG/neardet_data/2005-05.log
AFS timeout after 06:10 run, next iteration 10:13

    SALOG/fardet_data/2005-05.log
AFS timeout after 06:11 run, next iteration 10:14

/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/predator: line 66: ./sadd: Connection timed out


###########
# ENSTORE #
###########

Note in alarm page, message re Minos database,

    http://www-stken.fnal.gov/enstore/enstore_alarms.html

1117256833.29 2005-May-28 00:07:13    stkensrv1 DATABASE TOO BIG: minos - 3920 MB, exceeds 2048 MB and is growing

#######
# WEB #
#######

kinit -r7d
kx509
kxlist -p
openssl pkcs12 -export -passout pass:""  -in /tmp/x509up_u1060 -out /tmp/kreymer.p12 -name Fermilab

DOEGRID

per     http://www.doegrids.org/pages/cert-request.html
went to http://pk1.doegrids.org
submitted, waiting for email with link to the cert.
Got this after about an hour.
Did export to ~/.ssh/kreymer-doe.p12   - for copying to other systems
    
    optionally, make cert and private key for grid-proxy-init
        openssl pkcs12 -in YourCert.p12 -clcerts -nokeys -out $HOME/.globus/usercert.pem
        openssl pkcs12 -in YourCert.p12 -nocerts -out $HOME/.globus/userkey.pem
    

=============================================================================

2005 05 27

Note, we have ganglia running on minosora1.

External authentication has been running for weeks from minos-sam01/2 .

RETIRE minos-sam01 minos-sam02

MINOS-SAM01 > samadmin add station life cycle --value=retired
New lifeCycleStatusValue = 10004

MINOS-SAM01 > setup sam -q dev
MINOS-SAM01 > samadmin add station life cycle --value=retired
New lifeCycleStatusValue = 10004

MINOS-SAM01 > setup sam -q dev
MINOS-SAM01 > samadmin set station life cycle --station=minos-sam01 --lifeCycle=retired
Station 'minos-sam01' has been updated.
MINOS-SAM01 > setup sam -q prd
MINOS-SAM01 > samadmin set station life cycle --station=minos-sam01 --lifeCycle=retired
Station 'minos-sam01' has been updated.


MINOS-SAM01 > samadmin set station monitor level --station=minos-sam01 --monitorLevel=ignore
Station 'minos-sam01' has been updated.
MINOS-SAM01 > setup sam -q dev
MINOS-SAM01 > samadmin set station monitor level --station=minos-sam01 --monitorLevel=ignore
Station 'minos-sam01' has been updated.

And for minos-sam02 ( exists in prd only )

MINOS-SAM01 > setup sam -q prd
MINOS-SAM01 > samadmin set station monitor level --station=minos-sam02 --monitorLevel=ignore
Station 'minos-sam02' has been updated.
MINOS-SAM01 > samadmin set station life cycle --station=minos-sam02 --lifeCycle=retired
Station 'minos-sam02' has been updated.


=============================================================================

2005 05 26

Found transition in logs under
    /local/scratch06/kreymer/genpy/fardet_data/2003-03

MINOS06 > egrep 'RecSets|Dbu\(' *.log | less
...
F00013641_0000.log:Dbu(86591 in 86591 out 0 filt.)
F00013641_0000.log: RecSets 86591  TimeFrames # [       0:1919248231] 
F00013642_0000.log: RecSets 6  TimeFrames # [       0:1162035525] 
F00013642_0000.log:Dbu(9 in 9 out 0 filt.)
...

RecSets agrees with Dbu record count when RecSets is last ( prob'ly wrong )
and is smaller when listed first ( as currently, prob'ly right).
The order probably reflects running from 1.xx instead of development.

###########
# run_dbu #
###########

    Per rhatcher, commented out username/password, provided by release
    and recently changed for mysql server migration
    Tested with R1.14 and R1.16
    
cp run_dbu.0315 run_dbu.0526
nedit run_dbu.0526
ln -sf run_dbu.0526 run_dbu

This is working OK in predator

#########
# genpy #
#########

Activated -t <file> option to specify single file for processing
Test with

./genpy.0524  -d -v fardet_data/2001-10
./genpy.0524  -d -v -t F00000991_0000.mdaq.root fardet_data/2001-10

./genpy.0524  -w fardet_data/2001-10

16:51  at last, using the .(use)(4) enstore information, not encp

    ln -sf genpy.0524 genpy  # was .0513

Predator update at 17:06 look OK in the usual logs.

=============================================================================

2005 05 25

#########
# genpy #
#########

continuing cleanup and review, testing in some old directories like 2001-10 ( 55 files)

./genpy.0524 -d -v fardet_data/2003-09
...
 OOPS - cannot get Enstore info for /pnfs/minos/fardet_data/2003-09/F00019731_0000.mdaq.root 
 OOPS - cannot get Enstore info for /pnfs/minos/fardet_data/2003-09/F00019732_0000.mdaq.root 

./genpy.0524 -d -v fardet_data/2001-10
F00000904_0000 HEAD Wed May 25 09:45:27 CDT 2005
F00000985_0000 HEAD Wed May 25 09:45:30 CDT 2005

In  /afs/fnal.gov/files/home/room1/kreymer/minos/GDAT/fardet_data/2001-10
mv F00000892_0000.sam.py F00000892_0000.sam.py.original

./genpy.0524  -w fardet_data/2001-10

Oops, a different event count ( tape and index look identical )

MINOS06 > diff F00000892_0000.sam.py F00000892_0000.sam.py.original 
...
<    startTime=SamTime('26-Sep-2001 17:56:34(UTC)',SAM.SamTimeFormat_UTCFormat),
---
>    startTime=SamTime('26-Sep-2001 17:56:27(UTC)',SAM.SamTimeFormat_UTCFormat),
33c33
<    eventCount=68,
---
>    eventCount=76,

That's with release R1.16

R1.12 - does not run, libIoModules.so: undefined symbol: _ZN3sam21ClientInterfaceThread14_threadMonitorE
R1.14 - does not write to mysql, client/server clash, no .sam.py
R1.15 - writes .sam.py, same as R1.16

Renamed these .log and .sam.py files appropriately (.116 etc)
Made clean logs for .114, .115 ( forgot that dbu appends to existing log files )

=============================================================================

2005 05 24

REVIEW results with

    grep -v OOPS ../log/saddrecall/20050523dev.log | grep -v declared | less

cleanup with saddreco.0524,
picking up the 2004-09 skipping the file with bad metadata

dev far R1.11 2004-09
dev far R1.12 2004-09
dev far R1.14 2004-09

ln -sf saddreco.0524 saddreco

./saddrecfix  declare  dev >  ../log/saddrecall/20050523devfix.log 2>&1 &

    Just one problem file
OOPS - copymeta problem in  F00027025_0002.all.snts.R1.14.root

############
# PRODUCTS #
############

    Updated chain to reflect current minos station

MINOS06 > upd modproduct -g minos sam_products v3_22 -f NULL
notice: Adding flags -O "public"
upd modproduct succeeded.

############
# dbserver #
############

test for idle connections on minos-sam01 with commands like

    ./sam_test_project_connections minos prd 2>&1 | tee  stpc.log1

for log1, log2, log3.

Found dangling servant for method setFileConsumptionStatus
Other methods dangle, but are reused.

#########
# genpy #
#########

genpy.0524 - needs testing, 

Changed from using
     enstore  info
to
     cat ".(use)(4)(...)"
for speed and to be consistent with saddreco

=============================================================================

2005 05 23

on minos-sam-1/SCRIPTS,

UNIV=dev
for DIR in R1.11 R1.12 R1.14 R1.7 R1.16 ; do ./pnfstape reco_far/${DIR} ${UNIV} ; done

for DIR in R1.11  R1.12  R1.14  R1.16 ; do ./pnfstape reco_near/${DIR} ${UNIV} ; done

UNIV=prd
for DIR in R1.11 R1.12 R1.14 R1.7 R1.16; do ./pnfstape reco_far/${DIR} ${UNIV} ; done

for DIR in R1.11  R1.12  R1.14  R1.16 ; do ./pnfstape reco_near/${DIR} ${UNIV} ; done

These were all needed in dev,
prd needed

    reco_far/R1.14
    reco_far/R1.16
    reco_near/R1.14
    reco_near/R1.16

#############
# saddreceo #
#############

Clean up the two earlier failures :

setup sam -q dev

MINOS06 > sam undeclare file F00031356_0006.all.snts.R1.16.root
MINOS06 > sam undeclare file F00031356_0006.spill.snts.R1.16.root

    ./saddreco.0520 far R1.7 2004-02 declare 2
    ./saddreco.0520 far R1.7 2004-02 declare 2

Both look good, added 2 files each.

##############
# saddrecall #
##############

Added parameters for ACTION and UNIV
    defaults are verify and dev
Added setup of approprate SAM version

./saddrecall "verify 2" dev

Needed  /pnfs/minos/reco_far/R1.14/snts_data/2003-09
SamException.SamExceptions.InvalidMetadata: Invalid Metadata specified for file 'F00018900_0000.all.snts.R1.14.root' of type 'importedDetector':
        Process Id '1' ran application family 'archiver' ('rotorooter' 'v00-06-05'); inconsistent with 'reco' ('loon' 'r1.14')

Needed  /pnfs/minos/reco_far/R1.14/snts_data/2003-11
SamException.SamExceptions.InvalidMetadata: Invalid Metadata specified for file 'F00020958_0000.all.snts.R1.14.root' of type 'importedDetector':
        Process Id '1' ran application family 'archiver' ('rotorooter' 'v00-06-05'); inconsistent with 'reco' ('loon' 'r1.14')

HMMMMMM

./saddreco.0520 far R1.14 2003-09 verify 2

Why is processId set ?


Time per directory for 2 files is about 20 seconds.
So run a deeper scan,

./saddreco.0520 far R1.14 2003-09 verify 50 > ../log/saddrecall/20050523devver.log 2>&1 &

Apparently we pick up a random processId's:

OK !!!!!

This is a holdover from my early development testing of the scripts.
Most files do not have processId set.
None should.

Trying to select these, see

    sam get dimension info

    Hmmm, these is no process id dimension

    so will just live with an 'OOPS - processId' message


   TIMING -

Testing 50 events from far R1.14 2004-01 where there are no processId=1
2  /second without clearing processId ( repeated to fill caches)
2.1/second with    clearing processId
So I'll leave in 
    if MEDATA.has_key('processId') : del MEDATA['processId']

Added in an unnecessary copy in verify -
    MYDATA = copymeta ( MEDATA )
2.1/second with     copymeta
leaving this in, as it is verifying what we do for declare


ln -sf saddreco.0523 saddreco


Time per directory for 2 files is about 20 seconds.
So run a deeper scan,

./saddrecall "verify 50" prd > ../log/saddrecall/20050523prdver.log 2>&1 &

This completed cleanly.

    RUNNING A FULL PASS ON DEV :

14:18

./saddrecall declare  dev >  ../log/saddrecall/20050523dev.log 2>&1 &


Hitting problems in 2004-09 data,

Declaring to SAM dev far R1.11 2004-09
STARTED   Mon May 23 15:02:42 2005
Treating 716 files in fardet_data/2004-09
Needed  /pnfs/minos/reco_far/R1.11/snts_data/2004-09
Traceback (most recent call last):
  File "./saddreco", line 333, in ?
  File "./saddreco", line 201, in copymeta
  File "/fnal/ups/prd/sam_common_pylib/v7_0_1/NULL/SamUtility/CaseInsensitiveDictionary.py", line 84, in __getitem__
KeyError: group

F00027025_0002.mdaq.root seems to be the problem file, declared without a group.

    Again, evidence of testing-damage.

Updating saddreco.0524 to skip bad metadata.

Also a slew of processId = 1 in 2003-10

    COUNT THE FILES 
    
MINOS06 > for DIR in R1.11 R1.12 R1.14 ; do printf "$DIR " ; find /pnfs/minos/reco_near/${DIR} -type f | wc -l ; done
R1.11       6
R1.12      93
R1.14    2496

SUM      2595     net /pnfs/reco_near   9292

MINOS06 > for DIR in R1.7 R1.11 R1.12 R1.14 R1.16 ; do printf "$DIR " ; find /pnfs/minos/reco_far/${DIR} -type f | wc -l ; done
R1.7     2121
R1.11    6426
R1.12   12861
R1.14   65130
R1.16   10791

SUM     97329     net /pnfs/reco_far  

GRAND   99904

=============================================================================

2005 05 22

Continue testing declares, with

    ./saddreco.0520 far R1.16 2005-05 declare 2

Issue: need to fill the MEDS dictionary with copies of MEDATA,
       not pointers to MEDATA.

The MEDATA.copy() method fails, had to do explicit copy of members

    MEDS[REFILE] = SamDataFile ( 
         fileName          = MEDATA['fileName']          ,\
   ...

Now testing in dev,
   F00031356_0006.all.snts.R1.16.root - oops, wrong file name in declare.
   F00031356_0006.spill.snts.R1.16.root - 
       Location with name '/pnfs/minos/reco_far/R1.16/snts_data/2005-05' not found
   Need to add locations before adding file locations !
   
=============================================================================

2005 05 20

Copied monitor commands to HOWTO.predator

mysql server upgraded yesterday, saw delays in GLOG 15:06 to 18:00
both near and far, all files eventually were declared


SALOG/* Cleaned up old retry messages from last week when dbserver disk full
Should set
    , retryMaxCount = 0
better yet, via environment 
    os.environ["SAM_PYAPI_RETRY_MAX_COUNT"] = 0
    os.environ["SAM_PYAPI_RETRY_INTERVAL"]  = 300

We retry every hour via cron.

ln -sf sadd.0520 sadd ( was .0516 )

#############
# saddreceo #
#############

Checked for problem with

grep -6  OOPS    ../log/saddrecall/20050518.log
grep -v verified ../log/saddrecall/20050518.log
grep -v verified ../log/saddrecall/20050518.log | grep -v Needed | grep -v STARTED | less

Found one missing CRC value, but it's there now.
Probably just bad timing, as the file was being added

 OK - verified  F00030106_0001.all.cand.R1.16.root /pnfs/minos/reco_far/R1.16/cand_data/2005-03(vo8004.532)
OOPS - short Enstore data
 ENLIN  []
 ENFILE  F00030106_0002.all.cand.R1.16.root

MINOS06 > ls -l F00030106_0002*
-rw-rw-r--    1 rubin    e875     70291356 May 19 08:47 F00030106_0002.all.cand.R1.16.root
-rw-rw-r--    1 rubin    e875      4599423 May 19 08:49 F00030106_0002.spill.cand.R1.16.root

The verify was indeed attemped during
STARTED   Thu May 19 08:29:17 2005
FINISHED  Thu May 19 09:50:02 2005

#######
# sam #
#######

Testing database connections with 

./simp

Must change 
    from sam include sam
to
    from Sam include sam

Did so in

    sadd.0520
    saddreco.0520

Made both of these current.
ln -sf     sadd.0520 sadd
ln -sf saddreco.0520 saddreco
ln -sf predator.0520 predator

Changed predator to use sam, not sam_frozen

sam v7_0_2c is now current.
    Adding some more to handle all the reco output types

   DATA TIERS - switching to short names matching directories

setup sam v7_0_2c -q prd

samadmin add datatier --name=ccnd-far  --description="Cambridge candidates - far"
samadmin add datatier --name=ccnd-near --description="Cambridge candidates - near"

samadmin add datatier --name=cntp-far  --description="Cambridge ntuple - far"
samadmin add datatier --name=cntp-near --description="Cambridge ntuple - near"

samadmin add datatier --name=cnts-far  --description="Cambridge ntuple short - far"
samadmin add datatier --name=cnts-near --description="Cambridge ntuple short - near"

samadmin add datatier --name=ntup-far  --description="ntuple - far"
samadmin add datatier --name=ntup-near --description="ntuple - near"

samadmin add datatier --name=ntps-far  --description="ntuple short - far"
samadmin add datatier --name=ntps-near --description="ntuple short - near"

samadmin add datatier --name=sntp-far  --description="Standard ntuple - far"
samadmin add datatier --name=sntp-near --description="Standard ntuple - near"

samadmin add datatier --name=snts-far  --description="Standard ntuple short - far"
samadmin add datatier --name=snts-near --description="Standard ntuple short - near"

and the same for -q dev


########
# sadd #
########

17:15 - sadd.0520
corrected argument handling, to use AROFF offset for argv
missed a couple of hours' sam declares by predator
                    
#############
# saddreceo #
#############

setup sam -q dev

./saddrecall >  ../log/saddrecall/20050518.log 2>&1 &

egrep -1 'Needed| ED' ../log/saddrecall/20050520v.log


=============================================================================

2005 05 17

#############
# saddreceo #
#############

 OK - verified  F00022672_0007.sntp.R1.11.root /pnfs/minos/reco_far/R1.11/sntp_data/2004-02(vo6577.89)
Traceback (most recent call last):
...
SamException.SamExceptions.CorbaError: UNKNOWN; Minor: UNKNOWN_PythonException, COMPLETED_MAYBE.
( nameserver down )

Disk filled on minos-sam01, 
shifted products and private to
    /scratch/sam01/sam/*

./saddrecall >  ../log/saddrecall/20050518.log 2>&1 &

egrep -1 'Needed| ED' ../log/saddrecall/20050518.log

Rate, based on far R1.11 2004-02, 3x700 files in  32 min => 2100/1920=1.1 files/second

Timed PNFS metadata access,

cd /pnfs/minos/reco_far/R1.11/snts_data/2004-10

time ( for iter in 1 2 3 4 5 6 7 8 9 0  ; do cat ".(use)(4)(F00027709_0007.snts.R1.11.root)" ; done )
...
real    0m0.475s
user    0m0.000s
sys     0m0.100s

So the elapsed time is only about .05 seconds per query.
So a delay of 0.1 second should be plenty to moderate the PNFS load.
Adjusted saddreco.0519 accordingly.


OOPS, was not setting datastream in metadata.
Added a streams table,

streams = { \
'all'    : 'alldata' , \
'cosmic' : 'cosmic'  , \
'spill'  : 'spill'   , \
        }

Word from recon meeting today is that it's like we'll remove all reco data
prior to R1.12 .

Will test accordingly.

Added "BAIL" argument to do just 1 event
Moved getMetadata outside the loop

=============================================================================

2005 05 17

#############
# saddreceo #
#############

Getting close, let's make one of these look official !

    ln -s saddreco.0517 saddreco

Repeat timing, 770 files 

./saddreco.0517 far R1.16 2005-05 verify
   single file 10 seconds    13    9    9
   get        100 seconds    98  110  101  20%/40% cpu on 6/1  8  /sec
   get/verify 133 seconds   133  133       25%/50% CPU         6  /sec
   g/update/v 300 seconds   302  263       12%/20% CPU         2.6/sec


Preparing for global run

MINOS06 > ls /pnfs/minos/reco_near
R1.11  R1.12  R1.14  R1.16
MINOS06 > ls /pnfs/minos/reco_near/*
...
cand_data  sntp_data  snts_data
MINOS06 > ls -x /pnfs/minos/reco_near/*/*
( -x makes the list go across the page )
    =>

DET = near ; RELEASE=R1.11 ; for MONTH in 2004-10
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

DET = near ; RELEASE=R1.12 ; for MONTH in 2004-10 2005-01
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

DET = near ; RELEASE=R1.14 ; for MONTH in 2005-01  2005-02  2005-03  2005-04
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

    Similar for fardet_data

DET=far ; RELEASE=R0.8.0
for MONTH in 2002-02  2002-03  2002-04  2002-05  2002-06  2002-07  2002-08  2002-09  2002-10  2002-11  2002-12  2003-01  2003-02  2003-03
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

DET=near ; RELEASE=R1.0 ; for MONTH in 2002-09  2002-10  2002-11  2002-12  2003-01  2003-02  2003-03  2003-06
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

DET=near ; RELEASE=R1.0.0 ; for MONTH in 2003-03  2003-04  2003-05  2003-07  2003-08
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

DET=near ; RELEASE=R1.0.0a ; for MONTH in 2002-02  2002-03  2002-04  2002-05  2002-06  2002-07  2002-08 2002-09  2002-10  2002-11  2002-12  2003-01  2003-02  2003-03  2003-04  2003-05 2003-07  2003-08  2003-09  2003-10  2003-11  2003-12  2004-01  2004-02  2004-03  2004-04  2004-05  2004-06  2004-07  2004-08  2004-09
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

DET=near ; RELEASE=R1.7 ; for MONTH in 2004-02  2004-03
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

DET=near ; RELEASE=R1.11 ; for MONTH in 2004-02  2004-08  2004-09  2004-10
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

DET=near ; RELEASE=R1.12 ; for MONTH in 2004-07  2004-08  2004-09  2004-10  2004-11  2004-12  2005-01  2005-02
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

DET=near ; RELEASE=R1.14 ; for MONTH in 2003-07  2003-08  2003-09  2003-10  2003-11  2003-12  2004-01  2004-02  2004-03  2004-04  2004-05  2004-06  2004-07  2004-08  2004-09  2004-10 2004-11  2004-12  2005-01  2005-02  2005-03  2005-04
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

DET=near ; RELEASE=R1.16 ; for MONTH in 2005-03  2005-05
do ./saddreco ${DET} ${RELEASE} ${MONTH} verify ; done

    I have put this all into saddrecall script

./saddrecall >  ../log/saddrecall/20050517.log 2>&1 &

Oops, blew editing for fardet, and included R0.8.0 (should skip)
Fixed,

./saddrecall >  ../log/saddrecall/20050517a.log 2>&1 &


###########
# enstore #
###########

OOPS, missing CRC via .(use)(4) for files in 
    /pnfs/minos/reco_data/R1.0/snts_data/2002-09

MINOS06 > time ( enstore pnfs --bfid F00007780_0000.snts.R1.0.root ;  enstore info --bfid=CDMS106130516700000 )
CDMS106130516700000
{'bfid': 'CDMS106130516700000',
 'complete_crc': 3735261159L,
 'deleted': 'no',
 'drive': 'stkenmvr6a:/dev/rmt/tps3d1n:4560010014',
 'external_label': 'VO4315',
 'gid': -1,
 'location_cookie': '0000_000000000_0000995',
 'pnfs_name0': '/pnfs/minos/reco_data/R1.0/snts_data/2002-09/F00007780_0000.snts.R1.0.root',
 'pnfsid': '000F00000000000000383748',
 'sanity_cookie': (65536L, 3775606649L),
 'size': 1585894L,
 'uid': -1}

real    0m1.329s
user    0m0.180s
sys     0m0.060s

MINOS06 > time cat ".(use)(4)(F00007780_0000.snts.R1.0.root)"
VO4315
0000_000000000_0000995
1585894
snts_data_R1_0
/pnfs/minos/reco_data/R1.0/snts_data/2002-09/F00007780_0000.snts.R1.0.root

000F00000000000000383748

CDMS106130516700000
stkenmvr6a:/dev/rmt/tps3d1n:4560010014

real    0m0.040s
user    0m0.000s
sys     0m0.000s

So we really need to use the fast-cat method.

Will update to saddreco.0518, to report and continue when Checksum is missing.

ln -sf saddreco.0518 saddreco

./saddreco.0518 far R1.0 2002-09 verify

A quick scan indicates probable problems only in output from

    R1.0
    R1.0.0
    R1.0.0a

N.B. 2005 05 18 - Per Michael Zolakar,
                  level 4 PNFS data is missing for file written with old encp versions.
                  This can no longer happen.
                  We will probably delete these files, so skip for now

Rerunning,

./saddrecall >  ../log/saddrecall/20050517b.log 2>&1 &

=============================================================================
2005 05 16

########
# sadd #
########

ln -sf sadd.0516 sadd ( was .0513 )
    Added corba connection cleanup ( pending sam v7 )

    This seems to have cleared up the dangling dbserver connections,
    as of the 10:06 cycle today.

############
# PREDATOR #
############

The cron job has been running smoothly since Friday 13 May around noon.

##############
# DATA TIERS #
##############

MINOS06 > sam get registered data tiers | sort
beam
cand-far
cand-near
candidate-far
dcs-far
dcs-near
ntuple-far
ntuple-near
raw-caldet
raw-far
raw-near
sntuple
sntuple-far
sntuple-near
tdaq-caldet

    Adding some more to handle all the reco output types

setup sam v7_0_1 -q prd

samadmin add datatier --name=ccand-far  --description="Cambridge candidates - far"
samadmin add datatier --name=ccand-near --description="Cambridge candidates - near"

samadmin add datatier --name=cntuple-far  --description="Cambridge ntuple - far"
samadmin add datatier --name=cntuple-near --description="Cambridge ntuple - near"

samadmin add datatier --name=cntups-far  --description="Cambridge ntuple short - far"
samadmin add datatier --name=cntups-near --description="Cambridge ntuple short - near"

samadmin add datatier --name=dmux-far  --description="reco dmux - far"
samadmin add datatier --name=dmux-near --description="reco dmux - near"

samadmin add datatier --name=ntups-far  --description="ntuple short - far"
samadmin add datatier --name=ntups-near --description="ntuple short - near"

samadmin add datatier --name=stntuple-far  --description="Standard ntuple - far"
samadmin add datatier --name=stntuple-near --description="Standard ntuple - near"

samadmin add datatier --name=stntups-far  --description="Standard ntuple short - far"
samadmin add datatier --name=stntups-near --description="Standard ntuple short - near"

and the same for -q dev


  Adding cosmic and spill physical datastreams

samadmin add physical datastream --name=cosmic --logicalDatastream=physics
samadmin add physical datastream --name=spill  --logicalDatastream=physics

( in dev and prd )


#############
# saddreceo #
#############


=============================================================================

2005 05 13

#############
# predator  #
#############

Removed snappy call

#########
# genpy #
#########

genpy.0513
    removed HEAD label from printout
    consolidate logs into ${DATADIR}.log

ln -s genpy.0513 genpy ( was 0510 )

########
# sadd #
########

Removed and reformatted some diagnostics,
added leading blank line for the log

ln -sf sadd.0513 sadd ( was 0503 )

# products #

upd install -j sam v7_0_1 
upd install -j sam_common_pylib v6_7_1_0
ups undeclare -c sam_batch_adapter
ups undeclare -c samgrid_batch_adapter

Was getting errors like

MINOS06 > sam
Fatal Python error: Cannot open archive
Aborted

Got around it once ( accident ? )

minos
env | grep SETUP
unsetup SAM_USER_PYAPI
unsetup SAM_ADMIN_PYAPI
unsetup SAM_CONFIG
unsetup ORBACUS
unsetup SAM
unsetup SAM_COMMON_PYLIB
unsetup SAM_IDL_CPPLIB
unsetup SAM_MIS_CPPLIB
unsetup PYTHON
unset PYTHONPATH PYTHON_INC
setup sam v7_0_1
setup sam v7_0_1
sam locate foo
setup sam v7_0_1 -q prd
type sam
sam locate foo

    Can bypass with

export PRODUCTS=/afs/fnal.gov/files/code/e875/general/ups/db
setup sam v7_0_1 -q prd
sam locate foo

    But there's still a problem,

MINOS06 > sam translate constraints --dim="__set__ st-onesmall"
BAD_OPERATION; Minor: BAD_OPERATION_UnRecognisedOperationName, COMPLETED_NO.

    Also with the older sam_frozen v6_8_3b
MINOS06 > sam translate constraints --dim="__set__ st-onesmall"
Command Description:
...
        Invalid Argument Count: no parameters allowed; found parameters ['st-onesmall'] 

MINOS06 > setup sam v6_7_4 -q prd

MINOS06 > sam translate constraints --dim="__set__ st-onesmall"
Files:
   F00031300_0000.mdaq.root

File Count:         1
Average File Size:  40.41KB
Total File Size:    40.41KB
Total Event Count:  1

#################
# monitor genpy #
#################

cat PLOG

cat GLOG/neardet_data/2005-05.log
cat GLOG/fardet_data/2005-05.log

cat ../SALOG/neardet_data/2005-05.log 
cat ../SALOG/fardet_data/2005-05.log

=============================================================================

2005 05 12

#############
# predator  #
#############

genpy and sadd near and far raw data.

    to be called by cron hourly

###############
# crontab.dat #
###############

06  * * * ${HOME}/minos/scripts/predator

Be careful to write only to non-afs disk /local/scratch06,
   Hmm, that's hard if I run snappy in the script.
   Perhaps run the risk of leaving snappy to later.
   Not really all that much needed.
   Worst case, need to rerun to regenerate the *.sam.py files,
   but the data is in the database anyway, no real need.

=============================================================================

2005 05 11

#########
# genpy #
#########

The usual, need to cron this very soon.

#############
# saddreceo #
#############

Checking that a single dbserver connection is used for sam.validateMetadata.


sam get dbserver connection info | egrep 'Connection:|validate'

Side evaluation of timing, roughly 500 files 

./saddreco.0511 far R1.16 2005-05 verify
   get/update takes 180 seconds , 
   get        takes 100 seconds , 25% CPU on minos-sam01
   get/verify takes  80 seconds , 50% CPU

############
# METADATA #
############

MINOS06 > setup sam -q d0
MINOS06 > sam get registered data tiers
unknown
generated
run_1_sta
edu50
edu100
edu150
edu250
digitized
raw
special
thumbnail
reconstructed
simulated
root-tuple
epics
unofficial_reco
generated-bygroup
simulated-bygroup
digitized-bygroup
reconstructed-bygroup
root-tuple-bygroup
re-reconstructed
re-reconstructed-bygroup
filtered-reco
filtered-root
significant-event
virtual-filtered-root
virtual-filtered-reco
virtual-reco
virtual-root
virtual-thumbnail
raw-bygroup
filtered-thumbnail
virtual-filtered-thumbnai
v-filtered-thumbnail
thumbnail-bygroup
root-bygroup
filtered-raw
triggersimulated
triggersimulated-bygroup
sam-dbserver-log
sam-master-log
rawsimulated
root-tree
root-tree-bygroup
lumscaler
large_archive
unmerged-thumbnail
root-histogram

MINOS06 > setup sam -q cdf
MINOS06 > sam get registered data tiers
filtered-reco
raw
reconstructed
generated
simulated
unidentified
unknown

=============================================================================

2005 05 10

#########
# genpy #
#########

08:36

DIR=2005-05
for DET in neardet_data fardet_data
do
    printf "\n`date` ${DET}/${DIR}\n"
    ./genpy -w       ${DET}/${DIR}
    ./snappy         ${DET}/${DIR}
done >>      ../log/genpy/do${DIR}.log 2>&1 &

setup sam_frozen -q prd
./sadd neardet_data/2005-05 declare 2>&1 | tee -a ../log/samadd/neardet_data.20050511
./sadd  fardet_data/2005-05 declare 2>&1 | tee -a ../log/samadd/fardet_data.20050511

########
# DATA #
########


MINOS06 > pwd
/pnfs/minos/reco_near/R1.14

MINOS06 > du -sm *
10140   cand_data
1584    sntp_data
265     snts_data

MINOS06 > cd ../../reco_far
MINOS06 > cd R1.14
MINOS06 > du -sh *   
980G    cand_data
183G    ccnd_data

#############
# saddreceo #
#############

Added parents.

Getting some dangling dbserver connections ,
two from yesterday and one from right now:

MINOS06 > ./saddreco.0510 far R1.16 2005-05 verify
...
Needed to add 478 files
STARTED   Wed May 11 11:39:59 2005
FINISHED  Wed May 11 11:42:35 2005

MINOS06 > sam get dbserver connection info
Number of connections: 4
Connection:  kreymer@minos06.fnal.gov:sam(2686)
        Servant creation time:  11-May-2005 11:47:26 (CDT)
        Last method invoked:    getDbServerConnectionInfo (11-May-2005 11:47:26 (CDT))
        Last method still running.
Connection:  kreymer@minos06.fnal.gov:sampy(2249)
        Servant creation time:  10-May-2005 11:15:38 (CDT)
        Last method invoked:    getReplicaLocationList (10-May-2005 11:15:38 (CDT))
        Last method still running.
Connection:  kreymer@minos06.fnal.gov:sampy(2681)
        Servant creation time:  11-May-2005 11:39:59 (CDT)
        Last method invoked:    validateMetadata (11-May-2005 11:42:34 (CDT))
        Last method completed in 0.0230031013489 seconds
Connection:  kreymer@minos06.fnal.gov:sampy(2246)
        Servant creation time:  10-May-2005 10:59:11 (CDT)
        Last method invoked:    getReplicaLocationList (10-May-2005 10:59:11 (CDT))
        Last method still running.

Reported this to sam-design


=============================================================================

2005 05 10

#########
# genpy #
#########

DIR=2005-05
for DET in neardet_data fardet_data
do
    printf "\n`date` ${DET}/${DIR}\n"
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done >> ../log/genpy/do2005-05.log 2>&1 &

less ../GLOG/neardet_data/2005-05.20050510.log
less ../GLOG/fardet_data/2005-05.20050510.log

setup sam_frozen -q prd
./sadd neardet_data/2005-05 declare 2>&1 | tee -a ../log/samadd/neardet_data.20050510
./sadd  fardet_data/2005-05 declare 2>&1 | tee -a ../log/samadd/fardet_data.20050510

ln -sf genpy.0510 genpy ( was .0505 )


#############
# saddreceo #
#############

Updated saddreco.0510
    get file info from Enstore via cat ".(use)(4)(FILE)"

Lauri recommends copying metadata object, using dictionary to update.

Hacking along the old way for the moment, want to look at timings :


in development, after several iterations,

MINOS06 > ./saddreco.0510 far R1.16 2005-05 verify
...
 OK - skipping 184 files not yet in SAM 
Treating 73 files
...
Needed to add 130 files
STARTED   Tue May 10 17:11:09 2005
FINISHED  Tue May 10 17:11:57 2005

Now try production,   setup sam_frozen -q prd

MINOS06 > ./saddreco.0510 far R1.16 2005-05 verify
...
Treating 257 files
...
Needed to add 430 files
STARTED   Tue May 10 17:16:26 2005
FINISHED  Tue May 10 17:18:59 2005

    153 seconds

try a second pass

Needed to add 430 files
STARTED   Tue May 10 17:19:38 2005
FINISHED  Tue May 10 17:22:00 2005

    142 seconds

CPU load on minos-sam01 is about 23%


=============================================================================

2005 05 09

#############
# saddreceo #
#############

saddreco.0509

    Need to check for upper and lower case N, F prefixes !

./saddreco.0506 near R1.16 2005-05

Oops, R1.16 is wiped out today, being reprocessed, fall back to 423 files in

./saddreco.0509 far R1.0.0a 2002-04

One file has showed up in 2005-05, so go back to testing with

./saddreco.0509 far R1.16 2005-05 verify

    with a hack to pick only file 41.

Found the trick !

Must set MEDATA["fileId"] = 0
before doine sam.fileDeclare.

samadmin add application family --appFamily=reco --appName=loon --appVersion=r1.16

=============================================================================

2005 05 08

#########
# genpy #
#########

DIR=2005-05
for DET in neardet_data fardet_data
do
    printf "\n`date` ${DET}/${DIR}\n"
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done >> ../log/genpy/do2005-05.log 2>&1 &

less ../GLOG/neardet_data/2005-05.20050508.log
less ../GLOG/fardet_data/2005-05.20050508.log

setup sam_frozen -q prd
./sadd neardet_data/2005-05 declare 2>&1 | tee -a ../log/samadd/neardet_data.20050508
./sadd  fardet_data/2005-05 declare 2>&1 | tee -a ../log/samadd/fardet_data.20050508

=============================================================================

2005 05 06

Testing dcache anon ftp thread counts.

curl --user mindata:numi96 ftp://fndca.fnal.gov:24126/fardet_data/2001-10/F00001005_0000.mdaq.root -o /tmp/test.root
First copy was slow due to tape mount,

MINOS06 > curl --user mindata:numi96 ftp://fndca.fnal.gov:24126/fardet_data/2001-10/F00001005_0000.mdaq.root -o /tmp/test.root
  % Total    % Received % Xferd  Average Speed          Time             Curr.
                                 Dload  Upload Total    Current  Left    Speed
100 11140  100 11140    0     0    129      0  0:01:26  0:01:26  0:00:00  3297

subsequently was fast,

curl -s -S --user mindata:numi96 ftp://fndca.fnal.gov:24126/fardet_data/2001-10/F00001005_0000.mdaq.root -o /tmp/test.root

Thread count is 13.
At 11:25, did 50 ftp's,

Threads increased only to 15.


#########
# genpy #
#########

ln -sf genpy.0505 genpy
    Append to log file in preparation for hourly runs via cron

13:00

ls /pnfs/minos/neardet_data/2005-05 | tail -1
N00007664_0020.mdaq.root

ls /pnfs/minos/fardet_data/2005-05 | tail -1
F00031389_0006.mdaq.root

DIR=2005-05
for DET in neardet_data fardet_data
do
    printf "\n`date` ${DET}/${DIR}\n"
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done >> ../log/genpy/do2005-05.log 2>&1 &

less ../GLOG/neardet_data/2005-05.20050506.log
less ../GLOG/fardet_data/2005-05.20050506.log

setup sam_frozen -q prd
./sadd neardet_data/2005-05 declare 2>&1 | tee -a ../log/samadd/neardet_data.20050506
./sadd  fardet_data/2005-05 declare 2>&1 | tee -a ../log/samadd/fardet_data.20050506

#############
# saddreceo #
#############

saddreco.0506

    Filtering out .py files lacking SAM declarations or locations
    Finding all *_data/MONTH subdirectories, exit if none.

./saddreco.0506 near R1.16 2005-05

=============================================================================

2005 05 05

#########
# genpy #
#########

ln -sf genpy.0505 genpy
    Append to log file in preparation for hourly runs via cron

15:30

ls /pnfs/minos/neardet_data/2005-05 | tail -1
N00007663_0001.mdaq.root

ls /pnfs/minos/fardet_data/2005-05 | tail -1
F00031387_0000.mdaq.root

DIR=2005-05
for DET in neardet_data fardet_data
do
    printf "\n`date` ${DET}/${DIR}\n"
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done >> ../log/genpy/do2005-05.log 2>&1 &

less ../GLOG/neardet_data/2005-05.20050505.log
less ../GLOG/fardet_data/2005-05.20050505.log

./sadd neardet_data/2005-05 declare 2>&1 | tee -a ../log/samadd/neardet_data.20050505
./sadd  fardet_data/2005-05 declare 2>&1 | tee -a ../log/samadd/fardet_data.20050505

#############
# saddreceo #
#############

    testing with

./saddreco.0505 far R1.16 2005-05

Learning more about reco tree... will need to parse this.


=============================================================================

2005 05 04

#########
# genpy #
#########

10:00

ls /pnfs/minos/neardet_data/2005-05 | tail -1
N00007636_0003.mdaq.root

ls /pnfs/minos/fardet_data/2005-05 | tail -1
F00031374_0004.mdaq.root

DIR=2005-05
for DET in neardet_data fardet_data
do
    printf "\n`date` ${DET}/${DIR}\n"
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done >> ../log/genpy/do2005-05.log 2>&1 &

less ../GLOG/neardet_data/2005-05.20050504.log
less ../GLOG/fardet_data/2005-05.20050504.log

Will rerun, once the data logging is caught up - 
    dcache was down 20:xx thru 09:43

13:28 - reran genpy

./sadd neardet_data/2005-05 declare 2>&1 | tee ../log/samadd/neardet_data.20050504
./sadd  fardet_data/2005-05 declare 2>&1 | tee ../log/samadd/fardet_data.20050504


########
# RECO #
########

Exploring remaining /pnfs/minos directories

Bypassing apparent private areas :

000F000000000000000C7E90 stray file 2002 Oct 17 by root
alignment        tarfiles in results, root files in ntuple/*
asousa           empty
caldet_logs      log and tarfiles mostly 2003/2004 early
copy2            second copy of some fardet_data thru 2002-11
                   MINOS06 > du -sm /pnfs/minos/copy2/fardet_data
                   113147  /pnfs/minos/copy2/fardet_data
fardet_logs      171 files, .8 GB, current
hzheng           empty
hose_data        a few tar files
hpss             misc files from 2001
log_data         O U C H   
                 72674 log and error files for every pass of every datafile.
                     mostly under 10 KB.72674
                 MUST reorganize this, 
                    just counting required grepping COMPLETE_FILE_LISTING
log_data.11-13   similar, 280 files from 2002-08
loon.C           stray file by affa 2005 Mar 09
mapper           4700 umn files 6 GB from 2003
mgreaneytest     empty file 2003 Nov 03
messier          5 tarfiles 2001 Oct/Nov
moibenko         mgreadny tests some current
neardet_logs     NOT READABLE BY WORLD
para             misc programs and files 2003
psymes           empty
reco_data.11-13  Mostly 2002-08 plus one file 2002-10, produced log_data.11-13
shanahan         empty tree
sim_adamo        rhatcher - Misc files from Feb 2002
sim_log          misc logs from late Dec 2002/Jan 2003
sim_reco         37 GB root files from Jan 2003
tmp              empty
uberDST          2003 vahle root files
unel             2001 a few raw files


beam_data
caldet_data
caldet_rawdata
caldet_reco
dcs_data
far_dcs_data
fardet_data
mcin_data
mclog_data
mcout_data
near_dcs_data
neardet_data
reco_far
reco_near
sim_root         15890 files, active

Will start working on reco_near, first in dev, cloning metadata from neardet_data

R1.11  R1.12  R1.14  R1.16
each has
cand_data  sntp_data  snts_data

MINOS06 > du -sm reco_near/*/*/*
688     reco_near/R1.11/cand_data/2004-10
143     reco_near/R1.11/sntp_data/2004-10
38      reco_near/R1.11/snts_data/2004-10
8966    reco_near/R1.12/cand_data/2004-10
15572   reco_near/R1.12/cand_data/2005-01
2364    reco_near/R1.12/sntp_data/2004-10
4120    reco_near/R1.12/sntp_data/2005-01
2125    reco_near/R1.12/snts_data/2004-10
3683    reco_near/R1.12/snts_data/2005-01
478     reco_near/R1.14/cand_data/2005-01
852     reco_near/R1.14/cand_data/2005-02
8797    reco_near/R1.14/cand_data/2005-03
14      reco_near/R1.14/cand_data/2005-04
66      reco_near/R1.14/sntp_data/2005-01
128     reco_near/R1.14/sntp_data/2005-02
1386    reco_near/R1.14/sntp_data/2005-03
7       reco_near/R1.14/sntp_data/2005-04
29      reco_near/R1.14/snts_data/2005-01
33      reco_near/R1.14/snts_data/2005-02
198     reco_near/R1.14/snts_data/2005-03
6       reco_near/R1.14/snts_data/2005-04
45112   reco_near/R1.16/cand_data/2005-05
8962    reco_near/R1.16/sntp_data/2005-05
2674    reco_near/R1.16/snts_data/2005-05

MINOS06 > du -sm reco_far/R1.14/*/*
13988   reco_far/R1.14/cand_data/2003-07
32716   reco_far/R1.14/cand_data/2003-08
33671   reco_far/R1.14/cand_data/2003-09
39774   reco_far/R1.14/cand_data/2003-10
37760   reco_far/R1.14/cand_data/2003-11
52924   reco_far/R1.14/cand_data/2003-12
50230   reco_far/R1.14/cand_data/2004-01
...

for SUBDIR in  'cand',  'sntp',  'snts' , 'dmux' :
   dmux is only there for R0.8.0
   so skip it for SAM.


=============================================================================

2005 05 03

#########
# genpy #
#########

11:55

ls /pnfs/minos/neardet_data/2005-05 | tail -1
N00007628_0000.mdaq.root

ls /pnfs/minos/fardet_data/2005-05 | tail -1
F00031372_0000.mdaq.root

DIR=2005-05
for DET in neardet_data fardet_data
do
    printf "\n`date` ${DET}/${DIR}\n"
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done >> ../log/genpy/do2005-05.log 2>&1 &

less ../GLOG/neardet_data/2005-05.20050503.log
less ../GLOG/fardet_data/2005-05.20050503.log


########
# sadd #
########

setup sam        -q dev
setup sam_frozen -q dev

./sadd neardet_data/2005-05 declare 2>&1 | tee ../log/samadd/neardet_data.20050503
./sadd  fardet_data/2005-05 declare 2>&1 | tee ../log/samadd/fardet_data.20050503


#########
# genpy #
#########
With log cleanups/reductions

    MINOS06 > ln -sf genpy.0429 genpy


####################
# ENSTORE metadata #
####################

MINOS06 > pwd
/pnfs/minos/fardet_data/2005-05

Just BFID for 73 files

time (for FILE in `ls` ; do enstore pnfs --bfid ${FILE} ; done)
real    0m37.274s
user    0m6.050s
sys     0m3.830s

BFID and info
real    1m13.628s ( 73 seconds, how nice ! )
user    0m13.120s
sys     0m7.540s

info directly from PNFS local mount
real    0m2.923s
user    0m0.020s
sys     0m0.460s

The form of this listing seems to be :
 1) external_label
 2) location_cookie
 3) size ( without terminal L )
 4) ? file family
 5) pnfs_name0
 6) 
 7) pnfsid
 8) 
 9) bfid
10) drive
11) complete_crc ( without terminal L )

Not sure about
    deleted
    gid
    sanity_cookie ( missing )
    uid

Other levels :

MINOS06 >  cat ".(use)(1)(F00031373_0000.mdaq.root)"
CDMS111514003000000

MINOS06 >  cat ".(use)(2)(F00031373_0000.mdaq.root)"
2,0,0,0.0,0.0
:l=21103703;
r-stkendca6a-1
w-stkendca8a-1

############
# checkmet #
############

DET=neardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    ./sadd ${DET}/${DIR} verify
done > ../log/checkmet/checkmet.${DET}.0503 2>&1 &

    RUN TYPES

Encountered run type's missing.

MINOS06 > sam get registered run types | sort > devruntypes
MINOS06 > setup sam -q prd
MINOS06 > sam get registered run types | sort > prdruntypes

dev has uniqely

charge-injection-far
checkout
normal-data-far
pedestal-far
unknown;m

prd has uniqely

cosmic-sgate
dave-reyna-test
external-spill
external-trig
external-trig+li
null-trig
null-trig+li
qie-check-cal
qie-expert-data
qie-expert-raw
qie-spill

Added on minos-sam01

MINOS-SAM01 > setup sam -q dev

samadmin add run type --runType=cosmic-sgate
samadmin add run type --runType=dave-reyna-test
samadmin add run type --runType=external-spill
samadmin add run type --runType=external-trig
samadmin add run type --runType=external-trig+li
samadmin add run type --runType=null-trig
samadmin add run type --runType=null-trig+li
samadmin add run type --runType=qie-check-cal
samadmin add run type --runType=qie-expert-data
samadmin add run type --runType=qie-expert-raw
samadmin add run type --runType=qie-spill


MINOS-SAM01 > setup sam -q prd

samadmin add run type --runType=charge-injection-far
samadmin add run type --runType=checkout
samadmin add run type --runType=normal-data-far
samadmin add run type --runType=pedestal-far
samadmin add run type --runType='unknown;m'

Note need to escape the ; in the last of these commands

    DATA TIERS

setup sam -q dev
samadmin add datatier --name=tdaq-caldet --description="triggered raw data from caldet"

setup sam -q prd
samadmin add datatier --name=candidate-far --description="candidates from far_detector"
samadmin add datatier --name=sntuple       --description="stntuple"

    APPLICATION FAMILY
    
setup sam -q prd
samadmin add application family --appFamily=archiver --appName=rotorooter --appVersion=v00-06-05
samadmin add application family --appFamily=loon     --appName=loon       --appVersion=r1.0.0a
samadmin add application family --appFamily=online   --appName=rotrooter  --appVersion=v04-00-08
samadmin add application family --appFamily=reco     --appName=loon       --appVersion=dev
samadmin add application family --appFamily=reco     --appName=loon       --appVersion=r1.13
samadmin add application family --appFamily=reco     --appName=loon       --appVersion=r1.14
samadmin add application family --appFamily=sam_test_project --appName=sam_test_project --appVersion=1
samadmin add application family --appFamily=test     --appName=sam_test_project --appVersion=1

    LOGICAL DATASTREAMS

setup sam -q dev
samadmin add logical datastream --name=generic

setup sam -q prd
samadmin add logical datastream --name=physics
samadmin add logical datastream --name=calibration

    PHYSICAL DATASTREAMS

setup sam -q prd
samadmin add physical datastream --logicalDatastream=calibration --name=calib

    NODES
setup sam -q prd
samadmin add node --hw=PC --os=linux --name=afsroot
samadmin add node --hw=PC --os=linux --name=buckleypc.fnal.gov
samadmin add node --hw=PC --os=linux --name=dcap://fndca-door01
samadmin add node --hw=PC --os=linux --name=dcap://fndca-door02
samadmin add node --hw=PC --os=linux --name=dcap://fndca-door03
samadmin add node --hw=PC --os=linux --name=dcap://fndca-door04
samadmin add node --hw=PC --os=linux --name=dcap://fndca-door05
samadmin add node --hw=PC --os=linux --name=dcap://mtd-01
samadmin add node --hw=PC --os=linux --name=dcap://mtd-02
samadmin add node --hw=PC --os=linux --name=rfio://minos-sam01.fnal.gov
samadmin add node --hw=PC --os=linux --name=rfio://minos-samt02.fnal.gov

############
# checkmet #
############

DET=neardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    ./sadd ${DET}/${DIR} verify
done > ../log/checkmet/checkmet.${DET}.0503a 2>&1 &

MINOS06 > grep -B 1 Needed  ../log/checkmet/checkmet.${DET}.0503a
See OK before each Needed message.

DET=fardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    ./sadd ${DET}/${DIR} verify
done > ../log/checkmet/checkmet.${DET}.0503a 2>&1 &

This seems to be using about 60% of one CPU on minos-sam01,
according to TOP, in the DBserver python process.


########
# sadd #            PRODUCTION
########

setup sam_frozen -q prd

for DET in neardet_data fardet_data
do
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    ./sadd ${DET}/${DIR} declare
done > ../log/samadd/prd_${DET}.20050503 2>&1
done &

grep -B 1 Needed ../log/samadd/prd_neardet_data.20050503 
    16:14 till  16:36
grep -B 1 Needed ../log/samadd/prd_fardet_data.20050503 
    16:36 
    The CPU load on minos-sam01 remains around 60%

sampy uses about 30% CPU on minos06 for about 8 seconds,
   then falls to 18% CPU for the rest of the run.
   Probably sam_frozen initialization, then running ?

=============================================================================

2005 05 02

#########
# genpy #
#########

09:40

ls /pnfs/minos/neardet_data/2005-04 | tail -1
N00007607_0005.mdaq.root
ls /pnfs/minos/fardet_data/2005-04 | tail -1
F00031356_0005.mdaq.root
DIR=2005-04
for DET in neardet_data fardet_data
do
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done >> ../log/genpy/do2005r4f.log 2>&1 &

less ../GLOG/neardet_data/2005-04.20050502.log
less ../GLOG/fardet_data/2005-04.20050502.log

########
# sadd #
########

setup sam -q dev
setup python v2_1

./sadd neardet_data/2005-04 declare 2>&1 | tee ../log/samadd/neardet_data.20050431
./sadd  fardet_data/2005-04 declare 2>&1 | tee ../log/samadd/fardet_data.20050431


#########
# genpy #
#########

   New version with cleaner messages, and cleaner rm of .mdaq.root

ln -sf genpy.0429 genpy


ls /pnfs/minos/neardet_data/2005-05 | tail -1
N00007617_0009.mdaq.root
ls /pnfs/minos/fardet_data/2005-05 | tail -1
F00031367_0000.mdaq.root

DIR=2005-05
for DET in neardet_data fardet_data
do
    printf "\n`date` ${DET}/${DIR}\n"
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done >> ../log/genpy/do2005-05.log 2>&1 &

less ../GLOG/neardet_data/2005-05.20050502.log
less ../GLOG/fardet_data/2005-05.20050502.log

########
# sadd #
########

./sadd neardet_data/2005-05 declare 2>&1 | tee ../log/samadd/neardet_data.20050502
./sadd  fardet_data/2005-05 declare 2>&1 | tee ../log/samadd/fardet_data.20050502


#######
# sam #
#######

Liz hacked sam v6_7_4.table to use sam_config current, not the old v4_2_28


=============================================================================

2005 04 29

#########
# genpy #
#########

08:25

DIR=2005-04
for DET in neardet_data fardet_data
do
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done > ../log/genpy/do2005r4f.log 2>&1 &

less ../GLOG/neardet_data/2005-04.20050429.log

A few files failed, timed out !

N00007585_0016
N00007588_0000
N00007594_0004

Try again with

   revised genpy, which only rm's mdaq.root files if COPYing

DET=neardet_data
./genpy.0429 -w ${DET}/${DIR}

OK


########
# sadd #
########

./sadd neardet_data/2005-04 declare 2>&1 | tee ../log/samadd/neardet_data.20050429
./sadd  fardet_data/2005-04 declare 2>&1 | tee ../log/samadd/fardet_data.20050429

=============================================================================

2005 04 28

#########
# genpy #
#########

07:40

DIR=2005-04
for DET in neardet_data fardet_data
do
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done > ../log/genpy/do2005r4e.log 2>&1 &

done 

less ../GLOG/neardet_data/2005-04.20050428.log

########
# sadd #
########

./sadd neardet_data/2005-04 declare 2>&1 | tee ../log/samadd/neardet_data.20050428
./sadd  fardet_data/2005-04 declare 2>&1 | tee ../log/samadd/fardet_data.20050428


=============================================================================

2005 04 27

#########
# genpy #
#########

10:06

DIR=2005-04
for DET in neardet_data fardet_data
do
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done > ../log/genpy/do2005r4d.log 2>&1 &

done 11:19

less ../GLOG/neardet_data/2005-04.20050427.log


########
# sadd #
########

./sadd neardet_data/2005-04 declare 2>&1 | tee ../log/samadd/neardet_data.20050427
SAM files for  neardet_data/2005-04
 mode is  declare
STARTED   Wed Apr 27 11:28:22 2005
Treating 958 files
Traceback (most recent call last):
  File "./sadd", line 83, in ?
  File "/afs/fnal.gov/files/code/e875/general/ups/prd/sam_common_pylib/v6_7_0_3/NULL/SamCommand/CommandInterface.py", line 259, in __call__
    return self.apiWrapper(**kwargs)
  File "SamCommandInterface.py", line 187, in apiWrapper
  File "/afs/fnal.gov/files/code/e875/general/ups/prd/sam_user_pyapi/v6_7_0_3/NULL/src/samLocate.py", line 84, in implementation
    proxy = self.proxyMgr.getProxy(DB_DATA_FILE_PROXY)
  File "/afs/fnal.gov/files/code/e875/general/ups/prd/sam_common_pylib/v6_7_0_3/NULL/SamCorba/SamServerProxyManager.py", line 222, in getProxy
    theProxy = proxyType(*args)
  File "/home/lauri/v6_7_0_3/sam_common_pylib/SamCorba/DbServerProxy.py", line 201, in __init__
  File "/home/lauri/v6_7_0_3/sam_common_pylib/SamCorba/DbServerProxy.py", line 166, in __init__
  File "/home/lauri/v6_7_0_3/sam_common_pylib/SamCorba/DbServerProxy.py", line 79, in __init__
  File "/home/lauri/v6_7_0_3/sam_common_pylib/SamCorba/DbServerProxy.py", line 101, in _findServerEnvVarName

karen corrected symlink libcrypt.so.1,

./sadd neardet_data/2005-04 declare 2>&1 | tee ../log/samadd/neardet_data.20050427
./sadd  fardet_data/2005-04 declare 2>&1 | tee ../log/samadd/fardet_data.20050427


=============================================================================

2005 04 25

#########
# genpy #
#########

08:50

DIR=2005-04
for DET in neardet_data fardet_data
do
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done > ../log/genpy/do2005r4c.log 2>&1 &

11:35 genpy's finished

########
# sadd #
########

./sadd neardet_data/2005-04 declare 2>&1 | tee ../log/samadd/neardet_data.20050425
./sadd  fardet_data/2005-04 declare 2>&1 | tee ../log/samadd/fardet_data.20050425


=============================================================================

2005 04 22

#########
# genpy #
#########

09:00

ln -sf genpy.0421 genpy

DCache and Enstore are stable again, and I see clean output from

   ./genpy.0421 -d neardet_data/2005-04 

DIR=2005-04
for DET in neardet_data fardet_data
do
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done > ../log/genpy/do2005r4b.log 2>&1 &

    Oops, hacked genpy.0421 to remove diagnostic print of VOL FIL SIZ CRC


########
# sadd #
########

./sadd neardet_data/2005-04 declare 2>&1 | tee ../log/samadd/neardet_data.20050422
./sadd  fardet_data/2005-04 declare 2>&1 | tee ../log/samadd/fardet_data.20050422

#######
# WWW #
#######

Searching randomly, found

    /afs/fnal.gov/files/expwww/numi/README

Will probably put static pages into

    /afs/fnal.gov/files/expwww/numi/html/minwork/computing

#######
# SAM #
#######

PROD > upd install -j sam_frozen v6_7_4
PROD > ups declare -c sam_frozen v6_7_4


=============================================================================

2005 04 21

genpy.0421 - Added test for valid BFID, and Enstore data before dbu

Confounded by DCache/Enstore shutdown today.
Some files have no BFID :

MINOS06 > ./genpy.0421 -d fardet_data/2005-04 | grep OOPS| grep BFID
...
/pnfs/minos/fardet_data/2005-04/
F00031264_0004.mdaq.root
F00031264_0005.mdaq.root
F00031264_0006.mdaq.root 
F00031264_0011.mdaq.root 
F00031264_0012.mdaq.root 
F00031264_0013.mdaq.root 
F00031264_0014.mdaq.root 
F00031264_0015.mdaq.root 
F00031264_0016.mdaq.root 

MINOS06 > ./genpy.0421 -d neardet_data/2005-04 | grep OOPS| grep BFID
...
/pnfs/minos/neardet_data/2005-04/
N00007438_0001.mdaq.root 
N00007438_0002.mdaq.root 
N00007438_0003.mdaq.root 
N00007442_0000.mdaq.root 
N00007442_0001.mdaq.root 
N00007442_0002.mdaq.root 
N00007442_0003.mdaq.root 
N00007442_0004.mdaq.root 
N00007442_0005.mdaq.root 
N00007442_0006.mdaq.root 
N00007442_0007.mdaq.root 


=============================================================================

2005 04 20

########
# sadd #
########

Integrated sorting of the tape/file locations, abandoning the separate fileorder

    ln -sf sadd.0420 sadd

MODE='declare'
DIR=2005-04
for DET in neardet_data fardet_data
do
    ./sadd ${DET}/${DIR} $MODE
done 2>&1 | tee  ../log/samadd/${DET}.20050420

Oops, mv ../log/samadd/fardet_data.20050420 ../log/samadd/neardet_data.20050420

    Location with name '/pnfs/minos/fardet_data/2005-04' not found

on misossam,
for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do samadmin add pnfs tape location --fullPath=/pnfs/minos/fardet_data/2005-${MON} ; done
This added 2005-03 thru 2005-12

./sadd fardet_data/2005-03 declare 2>&1 | tee ../log/samadd/fardet_data.20050420a 
./sadd fardet_data/2005-04 declare 2>&1 | tee ../log/samadd/fardet_data.20050420b

Did the same for fardet_data 2006 thru 2009
    in dev and prd


#########
# genpy #
#########

DIR=2005-04
for DET in neardet_data fardet_data
do
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done > ../log/genpy/do2005r4a.log 2>&1 &

########
# sadd #
########

MODE='declare'
DIR=2005-04
for DET in neardet_data fardet_data
do
    ./sadd ${DET}/${DIR} $MODE 2>&1 | tee  ../log/samadd/${DET}.20050420a
done

OOPS, ran into two files with no location,

/local/scratch06/kreymer/genpy/neardet_data/2005-04/N00007438_0000.sam.py
/local/scratch06/kreymer/genpy/fardet_data/2005-04/F00031264_0002.sam.py
/local/scratch06/kreymer/genpy/fardet_data/2005-04/F00031264_0003.sam.py

No CRC or tape location available yet, probably being logged to tape.
Need to check for this in genpy, and sadd.

MODE='declare'
DIR=2005-04
for DET in neardet_data fardet_data
do
    ./sadd ${DET}/${DIR} $MODE 2>&1 | tee  ../log/samadd/${DET}.20050420a
done
/local/scratch06/kreymer/genpy/neardet_data/2005-04/N00007438_0000.sam.py

rm  /local/scratch06/kreymer/genpy/neardet_data/2005-04/N00007438_0000.sam.py
rm /local/scratch06/kreymer/genpy/fardet_data/2005-04/F00031264_0002.sam.py
rm /local/scratch06/kreymer/genpy/fardet_data/2005-04/F00031264_0003.sam.py

=============================================================================

2005 04 19 08:00

genpy is still running for April

########
# sadd #
########

extending to include sorting of tape locations,
so we can skip the fileorder step,
which will streamline Predator mode running.

Testing with the 53 files in

    ./sadd.0420 fardet_data/2001-10


=============================================================================

2005 04 18 08:00

########
# sadd #
########

Added sam.verifyMetadata

Timings 

>>>>   Failed run, under .1 seconds

MINOS06 > time ./sadd fardet_data/2001-10x
real    0m0.074s
user    0m0.030s
sys     0m0.060s

>>>>   Loading sam, doing nothing, 1.7 sec net as usual

MINOS06 > time ./sadd fardet_data/2001-10
STARTED   Mon Apr 18 09:15:24 2005
Treating 53 files
STARTED   Mon Apr 18 09:15:24 2005
FINISHED  Mon Apr 18 09:15:26 2005

real    0m1.680s
user    0m1.010s
sys     0m0.670s

>>>>    Verifying files, 1016, about 50"  ( 15" CPU )

MINOS06 > time ./sadd fardet_data/2004-01 verify
STARTED   Mon Apr 18 09:42:19 2005
Treating 988 files
STARTED   Mon Apr 18 09:42:19 2005
FINISHED  Mon Apr 18 09:43:07 2005

real    0m48.222s
user    0m14.900s
sys     0m0.990s

>>>>    Checking 988 locations about 20" ( 3" CPU )

MINOS06 > time ./sadd fardet_data/2004-01
STARTED   Mon Apr 18 09:40:39 2005
Treating 988 files
STARTED   Mon Apr 18 09:40:39 2005
FINISHED  Mon Apr 18 09:40:59 2005

real    0m20.081s
user    0m3.100s
sys     0m0.740s

Testing a single declare ( via hacked 1 file limit )

#    add one file

MINOS06 > ./sadd fardet_data/2002-10 declare
STARTED   Mon Apr 18 12:17:41 2005
Treating 944 files
 OK - declared  F00009692_0000.sam.py

#    add one more

MINOS06 > ./sadd fardet_data/2002-10 declare
STARTED   Mon Apr 18 12:19:39 2005
Treating 944 files
 OK - declared  F00009693_0000.mdaq.root

#    see that they got added, and are missing locations

MINOS06 > ./sadd fardet_data/2002-10        
STARTED   Mon Apr 18 12:20:14 2005
Treating 944 files
Need to add 255 files
Need to add 3 locations
STARTED   Mon Apr 18 12:20:14 2005
FINISHED  Mon Apr 18 12:20:32 2005

    OK - now need to add the sam add location code
    Done, tested with
   
STARTED   Mon Apr 18 13:42:37 2005
Treating 944 files
 OK - declared  F00009694_0000.mdaq.root

    OK, now catching up seriously,

Did the following once with the 1 file limit in place,
   mv ../log/samadd/${DET}.20050418 ../log/samadd/${DET}.20050418a

then

DET=fardet_data
MODE='declare'
for DIR in 2002-07  2002-10 2003-04 2003-09 2003-10 2003-11 2003-12 2004-09
do
    ./sadd ${DET}/${DIR} $MODE
done
 > ../log/samadd/${DET}.20050418 2>&1 &

    ###########
    # NEARDET #
    ###########

DET=neardet_data
for DIR in 2005-02 2005-03 ; do ./fileorder ${DET}/${DIR} $MODE; done

neardet_data/2005-02 STARTED  Mon Apr 18 15:18:40 CDT 2005
 Treating     981 files 
 OOPS - incorrect sorted count     980 vs     981
neardet_data/2005-02 FINISHED Mon Apr 18 15:19:20 CDT 2005
neardet_data/2005-03 STARTED  Mon Apr 18 15:19:20 CDT 2005
 Treating    1049 files 
 OOPS - incorrect sorted count    1048 vs    1049
neardet_data/2005-03 FINISHED Mon Apr 18 15:20:03 CDT 2005

##########
# misspy #
##########

misspy.0418 - added count of pnfs/minos, reported when not 1

ln -sf misspy.0418 misspy

./misspy -d neardet_data/2005-02
./addloc neardet_data/2005-02 N00006207_0000
./misspy -d neardet_data/2005-02


./misspy -d neardet_data/2005-03
./addloc neardet_data/2005-03 N00006680_0001
./misspy -d neardet_data/2005-03

for DIR in 2005-02 2005-03 ; do ./fileorder ${DET}/${DIR} $MODE; done

########
# sadd #
########

MODE=''

MINOS06 > for DIR in 2005-02 2005-03 ; do ./sadd ${DET}/${DIR} $MODE; done
SAM files for  neardet_data/2005-02
STARTED   Mon Apr 18 16:46:55 2005
Treating 981 files
Need to add 981 files
STARTED   Mon Apr 18 16:46:55 2005
FINISHED  Mon Apr 18 16:47:16 2005
SAM files for  neardet_data/2005-03
STARTED   Mon Apr 18 16:47:16 2005
Treating 1049 files
 OOPS, need location for  N00006790_0000.sam.py
 OOPS, need location for  N00006791_0000.sam.py
Need to add 1047 files
Need to add 2 locations
STARTED   Mon Apr 18 16:47:16 2005
FINISHED  Mon Apr 18 16:47:38 2005

MODE='declare'
for DIR in 2005-02 2005-03 ; do ./sadd ${DET}/${DIR} $MODE; done \
> ../log/samadd/${DET}.20050418 2>&1 &

Ran cleanly, up to date thru March now.

#########
# genpy #
#########

DIR=2005-04
for DET in neardet_data fardet_data
do
    ./stage    ${DET}/${DIR}
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done > ../log/genpy/do2005r4.log 2>&1 &

=============================================================================

2005 04 15 12:00


Scanning again for fardet results

for log in /local/scratch06/kreymer/log/samadd/fardet_data/*.log ; do less $log ; done

MONTH   problem                            missing
2002-07 duplicate F00006027_0000.mdaq.root  600
2002-10 dbserver down                       400
2003-04 duplicate F00014323_0000.mdaq.root  700
2003-09 duplicate F00018900_0000.mdaq.root 1000
2003-10 duplicate F00020636_0000.mdaq.root 1278
2003-11 duplicate F00020958_0000.mdaq.root  980
2003-12 duplicate F00021853_0002.mdaq.root 1050
2004-09 duplicate F00027016_0000.mdaq.root  715
2005-01 duplicate after flaked dbserver    1100

Working on script/sadd, a python script to do efficient catchup declares
    consolidating the very slow ( 5 GHZ/sec ) 'from sam import sam'


=============================================================================

2005 04 14 08:00

Scanning log/samadd/fardet_data*

2002-07 F00006027_0000.mdaq.root dup
2002-10 F00009692_0000.sam.py    badConn 8:58
2002-11
et.al.

The problems from 2002-10 on are due to the development database being down.

Restart with shorter listW
Will separately do incremental run on 2002-07 and 2002-10

12:51

##########
# samadd #
##########

DET=fardet_data
for DIR in `cat fardet.tosamadd2`
do
    ./samadd -w ${DET}/${DIR}
done > ../log/samadd/${DET}.20050414 2>&1 &


=============================================================================

2005 04 13 08:00


Problems in 2003-04 2003-05 2003-06 2003-07 2003-08
samadmin add application family --appFamily=online --appName=rotorooter --appVersion='v03-04-02'

samadd.0412 - added -s SAMQUAL to select alternate universe
              default is now dev

sam - used ups tailor sam_config to add
      -q prd
      -q dev
      pointing to the minos-sam01 nameserver/dbserver

###########
# app fam #
###########

Bringing production database up to speed with development, comparing
    sam get registered application families

On minos-sam

setup sam -q prd
for VER in v03-01-06  v04-02-00  v03-05-07  v03-10-02  v03-03-00  v03-03-07  v03-10-01 v03-04-02 
do
samadmin add application family --appFamily=online --appName=rotorooter --appVersion=${VER}
done


And to bring development up to date with production,

setup sam -q dev
for VER in  r0.8.0  r1.0.0  r1.11  r1.12  r1.7  r1.5.0  r1.6.0  r1.6.1  r1.9 r1.15
do
samadmin add application family --appFamily=reco --appName=loon --appVersion=${VER}
done

setup sam -q dev
for VER in   dev  r1.12  r1.13  r1.14 r1.15
do
samadmin add application family --appFamily=analysis --appName=loon --appVersion=${VER}
done

And anticipating reco and analysis with the new release,

setup sam -q prd
VER=r1.15
samadmin add application family --appFamily=reco --appName=loon --appVersion=${VER}
samadmin add application family --appFamily=analysis --appName=loon --appVersion=${VER}


##########
# samadd #
##########

ln -sf samadd.0412 samadd

############
# checkmet #
############

repeat the failing months, now that database has good application families

DET=fardet_data
for DIR in 2003-04 2003-05 2003-06 2003-07 2003-08
do
    ./samadd -d -m ${DET}/${DIR}
done > ../log/checkmet/checkmet.${DET}.0413 2>&1 &

This finished cleanly at 18:55


##########
# samadd #
##########

Prepared list of fardet_data directories needing samadd:

cd scripts
(cd /local/scratch06/kreymer/genpy/fardet_data ; ls ) > fardet.tosamadd
pico fardet.tosamadd ( removed 2001-10 2004-12 2005-01 )

Test with 1-5 files,

DET=fardet_data
for DIR in 2001-12
do
    ./samadd -w ${DET}/${DIR}
done > ../log/samadd/${DET}.20050413a 2>&1 &


samadd.0412 - changed to setup_minos -r R1.15

DET=fardet_data
for DIR in `cat fardet.first`
do
    ./samadd -w ${DET}/${DIR}
done > ../log/samadd/${DET}.20050413b 2>&1 &

22:48

DET=fardet_data
for DIR in `cat fardet.tosamadd`
do
    ./samadd -w ${DET}/${DIR}
done > ../log/samadd/${DET}.20050413c 2>&1 &


##########
# PYTHON #
##########

MINOS06 > python
Python 2.1 (#3, May 16 2001, 15:15:15) 
[GCC 2.95.2 19991024 (release)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> from sam import sam
>>> result = sam.locate(args='N00005935_0000.mdaq.root')
>>> print result
['/pnfs/minos/neardet_data/2005-01,939@vo4918']
>>> print sam.locate(args='N00005935_0000.mdaq.root')
['/pnfs/minos/neardet_data/2005-01,939@vo4918']
>>> ^D

cd ../test
>>> from sam import sam 
>>> result = sam.verifyMetadata(descriptionFile='F00028812_0000.sam.py')
>>> result = sam.verifyMetadata(descriptionFile='F00028812_0000.sam.py')
>>> print result
...

It works ( not the old sam.verifyDescriptinFile interface described on web )


=============================================================================

2005 04 12 08:00

Reviewing fardet_data checkmet/samadd information.

Removed stray directories produced by older samadd scripts

    rmdir /local/scratch06/kreymer/log/samadd/fardet_data/200?-??

cd /local/scratch06/kreymer/log/samadd/fardet_data

There have been too many layers of corrections and interruptions
to do incremental checks of fardet_data.

Therefore, go back and order and checkmet them all !

#############
# fileorder #
#############

fileorder.0412

   cleaned up messages


DET=fardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    ./fileorder ${DET}/${DIR}
done > ../log/fileorder/${DET}.log.20050412 2>&1 &

found a few problems, 

 OOPS - incorrect sorted count     906 vs     907
fardet_data/2003-04 FINISHED Tue Apr 12 09:44:56 CDT 2005

 OOPS - incorrect sorted count     871 vs     873
fardet_data/2003-05 FINISHED Tue Apr 12 09:45:07 CDT 2005

Hmmm, seem to have overlooked 2003-03 and 2003-04 dbu: not found messages.
2/824 and 2/910 files respectively.
Urr, the ones in 2003-03 are stale
     th eones in 2003-04 are real

DET=fardet_data
for DIR in 2003-03 2003-04
do
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done
got 822, 909 files listed by snappy
    824  910 files listed by misspy ( which counted non .mdaq.root files )

misspy.0412 - select *.mdaq.root file names, to eliminate nonstandards
 genpy.0412 -  select *.mdaq.root file names, to eliminate nonstandards

 OOPS - incorrect sorted count     908 vs     909
fardet_data/2003-04 FINISHED Tue Apr 12 10:33:34 CDT 2005

    F00014423_0000.sam.py is missing a location

./addloc  fardet_data/2003-04 F00014423_0000
./fileorder ${DET}/${DIR}
 Treating     909 files 

2003-05

F00015394_0000
F00015698_0000

./addloc  fardet_data/2003-05 F00015394_0000
./addloc  fardet_data/2003-05 F00015698_0000
DIR=2003-05
./fileorder ${DET}/${DIR}
 Treating     873 files 

############
# checkmet #
############

DET=fardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    ./samadd -d -m ${DET}/${DIR}
done > ../log/checkmet/checkmet.${DET}.0412 2>&1 &

Note user mode load on minos-sam01 increased from 0 to 1.5 % as this started.
Observed via ganglia


egrep -1 'FINISHED|STARTING' checkmet.fardet_data.0412


########
# reco #
########

Let's get a rough count :

MINOS06 > ls -R /pnfs/minos/reco_far/* | wc -l
 280882
 ( this took over 1 hour )

That's pretty bad, if we need to declare all of these.
Reading 280 K files will take a very long time.

=============================================================================

2005 04 11 09:40

#########
# genpy #
#########

neardet_data/2005-02 and 03 completed this morning

##########
# misspy #
##########

DET=neardet_data
for DIR in 2004-11 2005-02 2005-03
do
    date
    printf "OK - ${DET}/${DIR} \n"
    ./misspy ${DET}/${DIR}
done

./misspy neardet_data

    Found that the catchup of 2004-07 was missed, typo'd as 2004-04

#########
# genpy #
#########

DET=neardet_data
YY=2004
for MM in 07
do 
    ./genpy -w ${DET}/${YY}-${MM}
    ./snappy   ${DET}/${YY}-${MM}
done > ../log/genpy/donear2004r7.log 2>&1 &

./misspy neardet_data/2004-07

That picked up the missing 4 files. A.O.K. now.

The single N00005699_0000 file in 2004-12 seems to be development glitch,
rerunning :

DET=neardet_data
YY=2004
for MM in 12
do 
    ./genpy -w ${DET}/${YY}-${MM}
    ./snappy   ${DET}/${YY}-${MM}
done > ../log/genpy/donear2004r12.log 2>&1 &

./misspy neardet_data/2004-12


   FARDET

Doing global scan of fardet_data, one at a time ( -w )

##########
# misspy #
##########

DET=fardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    date
    printf "OK - ${DET}/${DIR} \n"
    ./misspy -w ${DET}/${DIR}
done > ../log/misspy/${DET}.log.20050411 2>&1 &

FARDET

scanning misspy logs,

cat log/misspy/fardet_data/*.log > ../public_html/minos/misspy_far.txt

Issues :

2001-09 10 silent exits w/o py
2001-11 46 a few silent exits
2002-01 29 a few silent
2004-04  2 silent

Rerun

2002-05 137 dcache problems
2003-07  21 dcache problems
2003-09   2 open errors
2003-12   1 dbu
2004-03   3 dbu

misspy - added Zombie and TBasket string searches

#########
# genpy #
#########

DET=fardet_data
for DIR in 2002-05  2003-07  2003-09  2003-12  2004-03 
do 
    ./genpy -w ${DET}/${DIR}
    ./snappy   ${DET}/${DIR}
done > ../log/genpy/dofarmisc1.log 2>&1 &

DET=fardet_data
for DIR in 2002-05  2003-07  2003-09  2003-12  2004-03 
do
    date
    printf "OK - ${DET}/${DIR} \n"
    ./misspy -w ${DET}/${DIR}
done > ../log/misspy/${DET}.log.20050411a 2>&1 &

2002-05 - 69 files, was 137, OK
2003-03 -  2 files missing due to names
           F00014231_0000.mdaq.dat
           F00014232_0000.mdaq.dat.orig
2003-07 - 21 files still missing
    F00017211_0000 silent exit
    The other 20 file names are non-standard,
    a mixture of .mdaq.dat and .msaq.dat.gz
2003-09 -  2 files cannot be opened
2003-12 - OK
2004-03 - OK

   OK, let's go ahead and declare what we have to SAM in development
   for fardet_data   


=============================================================================

2005 04 10 21:45

#########
# genpy #
#########

genpy seems be running slow, minutes per file.

Did I omit the staging of data ?
Yes, according to this log. and it is probably needed.
A bit late, we have processed around 500 files into 2005-03,
    but better than never

cd scripts
./stage neardet_data/2005-03

On the other hand, I see a log for prestaging ending 19:10 Sat Apr 9.
So apparently this was done already. No harm in a the touchup.

Not sure why things ran slowly earlier.
No SAR on these systems, need to request this.


=============================================================================

2005 04 09 09:15

#########
# genpy #
#########

genpy.0408 - switched it back to R1.15, now that exist and is stable,
    this time really using gcc 3.4.3

cd scripts
DET=neardet_data
YY=2005
for MM in 02 03
do
    ./stage    ${DET}/${YY}-${MM}
    ./genpy -w ${DET}/${YY}-${MM}
    ./snappy   ${DET}/${YY}-${MM}
done > ../log/genpy/donear2005r23.log 2>&1 &

RATS - started this without the genpy change, running in development.

Killed off and started again

for MM in 02 03
do
#    ./stage    ${DET}/${YY}-${MM}
    ./genpy -w ${DET}/${YY}-${MM}
    ./snappy   ${DET}/${YY}-${MM}
done > ../log/genpy/donear2005r23.log 2>&1 &


=============================================================================

2005 04 08 09:00
dbu is still missing in development
    The shared libraries seem healthy, can run my local bin/dbu

Updated HOWTO.genpy to describe both local and dcache running
    And to describe
        setup_minos -r R1.15

Tested R1.15 dbu, seems OK


MINOS06 > time ../scripts/run_dbu ${IFILE}
real    0m20.705s
user    0m13.000s
sys     0m0.840s


MINOS06 > time ../scripts/run_dbu dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-01/${IFILE}
real    0m38.491s
user    0m16.440s
sys     0m0.900s

############
# wrun_dbu #
############

ln -sf wrun_dbu.0406 wrun_dbu
    ( was wrun_dbu.0315 )

Tested per HOWTO.wrun_dbu
  
#########
# genpy #
#########

ln -sf genpy.0408 genpy
   ( was genpy.0315 )
    message 'cannot read' when .sam.py missing
    encp v3_4 which has the needed -q stken, missing from v3_3 now
    setup_minos -r R1.15 for stability, stable since 2005 04 05


Restarted neardet/2004-11 run
This time for sure !

cd scripts
DET=neardet_data
YY=2004
for MM in 11
do
    ./genpy -w ${DET}/${YY}-${MM}
    ./snappy   ${DET}/${YY}-${MM}
done > ../log/genpy/donear2004r11.log 2>&1 &

Did pretty well, the R1.15 got rebuilt.

Rerun again, hacking genpy.0408 back to development
now that dbu is in development again.

DET=neardet_data
YY=2004
for MM in 11
do
    ./genpy -w ${DET}/${YY}-${MM}
    ./snappy   ${DET}/${YY}-${MM}
done > ../log/genpy/donear2004r11a.log 2>&1 &


Now we're stuck at the endof 2004-11, starting with
N00005182_0003
thru the end of the directory.
These get stuck in dbu after 2 CPU seconds, and hang indefinitely.
Even when accessing local files.

GRRRRR - even the files which formerly worked, above, tested per HOWTO.genpy
are getting stuck the same way, at

Successfully opened connection to: mysql:odbc://minos-db1.fnal.gov/temp?option=1;

OK !!! The mysql server is hosed, causing problems with other things
        like the CRL.
        Due to full disk on minosdb1

17:25 - database is working again

Tested with a couple of files, including N00005182_0003

cd ../scripts
DET=neardet_data
YY=2004
for MM in 11
do
    ./genpy -w ${DET}/${YY}-${MM}
    ./snappy   ${DET}/${YY}-${MM}
done > ../log/genpy/donear2004r11b.log 2>&1 &

This is running properly ( killed N00004502_0000 as it is known to time out )


=============================================================================

2005 04 07  08:30

    # FARDET STATUS #
   
genpy  - fardet_data/2003-03/4/5/6

    finished at around 01:58 today, well ahead of the 06:00 network outage

./misspy -d 
fardet_data/2003-03  dbu
F00014231_0000 F00014232_0000

fardet_data/2003-04  dbu
F00014508_0000 F00014518_0000

fardet_data/2003-05  file does not exist
F00016075_0000  this one is there
F00016097_0000  this is renamed to      F00016097_0000.mdaq.root.orig

fardet_data/2003-06 - OK

    # LOGS #

MINOS06 > mkdir ../log/genpy
MINOS06 > mv do*log ../log/genpy/

    # NEARDET #

genpy  - neardet_data/2004-11   catchup 575 missing files
         neardet_data/2004-04   catchup   4 missing files
         neardet_data/2004-12   catchup   1 missing file

DET=neardet_data
YY=2004
./stage    ${DET}/${YY}-11

for MM in 04 12 11
do 
    ./genpy -w ${DET}/${YY}-${MM}
    ./snappy   ${DET}/${YY}-${MM}
done > ../log/genpy/donear2004r4r12r11.log 2>&1 &

N.B. - OOPS 2005 04 11 - should have been 07, not 04, above.

Still at about 954/1081 as of 23:00, doing 2004-11
    These files are taking a long time, minutes each, to process
OUCH ! there are three dbu's running now, according to top.
Each with over 730 minutes of CPU !!!!
Why did I not see this with   ps xf ?
Guess I did not look closely enough .
Why did wrun_dbu not kill them ?
The files are :
N00004862_0000.mdaq.root
N00004873_0000.mdaq.root
N00004877_0000.mdaq.root
Scanning /neardet_data/2004-11.20050407.log 
These claim to have been killed,
 kill 3601 
 kill 6083 
 kill 7331 
for
F S   UID   PID  PPID  C PRI  NI ADDR    SZ WCHAN  TTY          TIME CMD
0 S  1060  3601  3600  0  85   0    -  1329 wait4  pts/0    00:00:00 run_dbu
0 S  1060  6083  6082  0  84   0    -  1329 wait4  pts/0    00:00:00 run_dbu
0 S  1060  7331  7330  0  84   0    -  1331 wait4  pts/0    00:00:00 run_dbu


But these are  not the pid's of the dbu processes now running,
3604
6086
7334

Ah, we are killing the run_dbu, not dbu process.
It would be better to have tracked down the dbu !!
Killing them now, will have to retry/recover later.

kill 3604 6086 7334


And we're back into limbo as of about 23:16, with no dbu utility in path.

23:17 zzz

=============================================================================

2005 04 06  08:30

############
# wrun_dbu #
############

Added 30 second delay after killing the process,
    to allow time for .sam.py to potentially be written
Added kill -9 after the 30 second delay

############
# checkmet #
############

All OK on rerun, except for
2003-03
      151 vs     782

    picking up 2003-03 which was being written yesterday while being checked

./fileorder fardet_data/2003-03
 Sorting files from fardet_data/2003-03
Sorting files for /local/scratch06/kreymer/genpy/fardet_data/2003-03
STARTING Wed Apr  6 08:46:51 CDT 2005
 Treating     822 files 
    820 /local/scratch06/kreymer/order/fardet_data/2003-03

RATS - still a mismatch, and neither matches the 782 .py files seen above.


MINOS06 > ./misspnfs fardet_data/2003-03
Treating     822 files 
584 F00014074_0000 
585 F00014075_0000 
 errors 2/822
 See /local/scratch06/kreymer/log/genpy/fardet_data/2003-03.20050405.log
fardet_data/2003-03.20050320.log

OK, these two files did time out and were killed.
But why was the file information not added to the existing .sam.py files ?
Probably due to killing and not waiting for the process to finish.
This may have resulted in a .sam.py being created after the genpy script had given up

############
# wrun_dbu #
############

Added a 30 second delay after future timeouts

##########
# addloc #
##########

cloned this from genpy.addloc, to add locations for a single file

MINOS06 > ./addloc  fardet_data/2003-03 F00014074_0000
MINOS06 > ./addloc  fardet_data/2003-03 F00014075_0000

./fileorder       fardet_data/2003-03
./samaddnew -d -m fardet_data/2003-03 > ../log/checkmet/checkmet.fardet_data.200303 2>&1 &

This looks good so far, cranking along.
And completed cleanly in 45 '

2002-05   935 vs     936
for FILE in F00005005_0000 ; do ./addloc -d fardet_data/2002-05 ${FILE} ; done
for FILE in F00005005_0000 ; do ./addloc    fardet_data/2002-05 ${FILE} ; done
./fileorder       fardet_data/2002-05

2002-09  1148 vs    1158
for FILE in F00008125_0000  F00008159_0000  F00008176_0000  F00008185_0001  F00008186_0000  F00008187_0000 F00008202_0000  F00008203_0000  F00008204_0000  F00008207_0000 
do ./addloc     fardet_data/2002-09 ${FILE} ; done
./fileorder     fardet_data/2002-09

2003-06   821 vs     826
for FILE in  F00016239_0000  F00016240_0000  F00016289_0000  F00016305_0000  F00016309_0000 
do ./addloc     fardet_data/2003-06 ${FILE} ; done
./fileorder     fardet_data/2003-06

2003-07  1086 vs    1120
for FILE in   F00017002_0000 F00017018_0000  F00017019_0000  F00017219_0000  F00017219_0002  F00017220_0000  F00017221_0000 
do ./addloc     fardet_data/2003-07 ${FILE} ; done
for FILE in   F00017223_0000 F00017223_0001  F00017223_0002  F00017224_0000  F00017225_0000  F00017226_0000  F00017227_0000 
do ./addloc     fardet_data/2003-07 ${FILE} ; done
for FILE in   F00017229_0000 F00017230_0000  F00017316_0000  F00017320_0000  F00017322_0000  F00017324_0000  F00017326_0000  
do ./addloc     fardet_data/2003-07 ${FILE} ; done
for FILE in   F00017328_0000 F00017330_0000  F00017336_0000  F00017350_0000  F00017368_0000  F00017372_0000  F00017374_0000  
do ./addloc     fardet_data/2003-07 ${FILE} ; done
for FILE in   F00017378_0000 F00017384_0000  F00017394_0000  F00017598_0000  F00017758_0000  F00018076_0000 
do ./addloc     fardet_data/2003-07 ${FILE} ; done
./fileorder     fardet_data/2003-07

2003-08   743 vs     757
for FILE in  F00018143_0000 F00018145_0000 F00018146_0000 F00018147_0000 F00018148_0000 F00018151_0000  F00018156_0000 
do ./addloc     fardet_data/2003-08 ${FILE} ; done
for FILE in  F00018157_0000  F00018160_0000  F00018162_0000  F00018164_0000  F00018166_0000  F00018172_0000  F00018182_0000 
do ./addloc     fardet_data/2003-08 ${FILE} ; done
./fileorder     fardet_data/2003-08
 
2003-09  1009 vs    1010
for FILE in  F00018922_0000 ; do ./addloc     fardet_data/2003-09 ${FILE} ; done
./fileorder     fardet_data/2003-09
? log file missing ?

2003-11   990 vs     992
for FILE in  F00020959_0000  F00020960_0000 
do ./addloc     fardet_data/2003-11 ${FILE} ; done
./fileorder     fardet_data/2003-11

2004-05   685 vs     687
for FILE in  F00025535_0003 F00025536_0000
do ./addloc     fardet_data/2004-05 ${FILE} ; done
./fileorder     fardet_data/2004-05

Done at 15:05

##########
# snappy #
##########

   all these addloc corrected directories :

for MON in 2002-05 2002-09 2003-06 2003-07 2003-08 2003-09 2003-11 2004-05
do ./snappy  fardet_data/${MON} ; done

############
# checkmet #
############

    and check them all again, now that they are clean

DET=fardet_data
for DIR in 2002-05 2002-09 2003-06 2003-07 2003-08 2003-09 2003-11 2004-05
do
    date
    ./samaddnew -d -m ${DET}/${DIR}
    date
done > ../log/checkmet/checkmet.${DET}.0406 2>&1 &


=============================================================================

2005 04 05 09:00

#########
# genpy #
#########

DET=fardet_data
YY=2003
for MM in 03 04 05 06
do 
    ./stage    ${DET}/${YY}-${MM}
    ./genpy -w ${DET}/${YY}-${MM}
    ./snappy   ${DET}/${YY}-${MM}
done > dofar2003r3456.log 2>&1 &

N.B. finished 04-07 01:58


############
# checkmet #
############

reviewing log/checkmet/checkmet.fardet_data

2002-02 v03-03-00
2002-03 v03-03-00
2002-04 v03-03-00
2002-06 v03-03-00
2002-07 v03-03-00
2002-08 v03-03-07
2002-10 v03-03-07
2002-11 v03-03-07
2002-12 v03-03-07
2003-01 v03-03-07
2003-02 v03-03-07
2003-03 v03-03-07
2003-10 v03-10-01

2002-05   935 vs     936
2002-09  1148 vs    1158
2003-04 no .py
2003-05 no .py
2003-06   821 vs     826
2003-07  1086 vs    1120
2003-08   743 vs     757
2003-09  1009 vs    1010
2003-11   990 vs     992
2004-05   685 vs     687

   So as before, as sam , on minos_sam 

setup sam -q minos
samadmin add application family --appFamily=online --appName=rotorooter --appVersion='v03-03-00'
samadmin add application family --appFamily=online --appName=rotorooter --appVersion='v03-03-07'
samadmin add application family --appFamily=online --appName=rotorooter --appVersion='v03-10-01'

DET=fardet_data
for DIR in 2002-02 2002-03 2002-04 2002-06 2002-07 2002-08 2002-10 2002-11 2002-12 2003-01 2003-02 2003-03 2003-10 
do
    date
    ./samaddnew -d -m ${DET}/${DIR}
    date
done > ../log/checkmet/checkmet.${DET}.0405 2>&1 &

All OK on rerun, except for
2003-03
      151 vs     782


    # tracking down file counts #

./fileorder fardet_data/2002-05

############
# misspnfs #
############

created this to look for /pnfs/minos lines missing or extra in .sam.py
MINOS06 > ./misspnfs fardet_data/2002-05
Treating     936 files 

861 F00005005_0000 
 errors 1/936

 See /local/scratch06/kreymer/log/genpy/fardet_data/2002-05.20050317.log

This one was due to a dbu timeout, killed but produced a valid .sam.py file.

    All the following were stuck, killed, but produced  .sam.py

2002-05   935 vs     936
F00005005_0000

2002-09  1148 vs    1158
F00008125_0000 
F00008159_0000 
F00008176_0000 
F00008185_0001 
F00008186_0000 
F00008187_0000 
F00008202_0000 
F00008203_0000 
F00008204_0000 
F00008207_0000 

2003-06   821 vs     826
F00016239_0000 
F00016240_0000 
F00016289_0000 
F00016305_0000 
F00016309_0000 

2003-07  1086 vs    1120
F00017002_0000 
F00017018_0000 
F00017019_0000 
F00017219_0000 
F00017219_0002 
F00017220_0000 
F00017221_0000 
F00017223_0000 
F00017223_0001 
F00017223_0002 
F00017224_0000 
F00017225_0000 
F00017226_0000 
F00017227_0000 
F00017229_0000 
F00017230_0000 
F00017316_0000 
F00017320_0000 
F00017322_0000 
F00017324_0000 
F00017326_0000 
F00017328_0000 
F00017330_0000 
F00017336_0000 
F00017350_0000 
F00017368_0000 
F00017372_0000 
F00017374_0000 
F00017378_0000 
F00017384_0000 
F00017394_0000 
F00017598_0000 
F00017758_0000 
F00018076_0000 

2003-08   743 vs     757
F00018143_0000
F00018145_0000
F00018146_0000
F00018147_0000
F00018148_0000
F00018151_0000 
F00018156_0000 
F00018157_0000 
F00018160_0000 
F00018162_0000 
F00018164_0000 
F00018166_0000 
F00018172_0000 
F00018182_0000 

2003-09  1009 vs    1010
F00018922_0000
? log file missing ?

2003-11   990 vs     992
F00020959_0000 
F00020960_0000 

2004-05   685 vs     687
F00025535_0003
F00025536_0000


=============================================================================

2005 04 04 09:00

#########
# genpy #
#########

    dbu has recovered, tested per HOWTO.genpy
 
YY=2005
for MM in 02 03
do 
#    ./stage    fardet_data/${YY}-${MM}
    ./genpy -w fardet_data/${YY}-${MM}
    ./snappy   fardet_data/${YY}-${MM}
done > dofar2005r23.log 2>&1 &

##########
# samadd #
##########
 
   fardet_data first scan, thru 2005-02

   Go ahead and bounce off of 2005-03, which is still in genpy

cd ${HOME}/minos/scripts
DET=fardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    printf " OK - ./samadd ${DET}/${DIR} \n"
    ./samadd -k -w ${DET}/${DIR}
done > ../log/samadd/${DET}.20050404 2>&1 &

Oops, needed to order these first

############
# checkmet #
############

 mkdir ../log/checkmet

DET=fardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    date
    printf "\n\n OK - ${DET}/${DIR} \n"
    ./fileorder       ${DET}/${DIR}
    date
    ./samaddnew -d -m ${DET}/${DIR}
    date
done > ../log/checkmet/checkmet.${DET} 2>&1 &

##########
# misspy #
##########

Added diagnostic scan of .log 
    egrep 'CheckByteCount|SignalAction|TDCacheFile|unzip' ${GENLOG}

    # fardet_data #

It seems most of the 2003-03/04/05/06 failures were dbu or dcache flakeouts
Will rerun genpy for this interval.

    # neardet_data #

    Meanwhile, running on neardet_data :

DET=neardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    date
    printf "OK - ${DET}/${DIR} \n"
    ./misspy ${DET}/${DIR}
done > ../log/misspy/${DET}.log.20050404 2>&1 &


    Scanning results,
MINOS06 > cat  log/misspy/neardet_data/*.log | less

    
2004-07 - dbu shared library problems for 4 files, rerun
2004-11 - 500+ logs missing, rerun
2004-12 - dbu shared library problem  for 1 file , rerun


=============================================================================

2005 04 03 14:10

#########
# genpy #
#########

Picking up last 2 months of data

    first fardet_data

YY=2005
for MM in 02 03
do 
    ./stage    fardet_data/${YY}-${MM}
    ./genpy -w fardet_data/${YY}-${MM}
    ./snappy   fardet_data/${YY}-${MM}
done > dofar2005r23.log 2>&1 &

    then will do neardet_data, possibly tomorrow AM

OOPS - at about 23:04, dbu disappeared along with the required
       libsigc-1.2.so.5
       The perils of running in development
       I don't know what the cycle time of the nightly build is.
       
If dbu is not back by tomorrow morning, will report to rhatcher.
< N.B. dbu did recover, resumed processing >

###########
# fileall #
###########

created fileall to drive filecount and filesize

./fileall &


=============================================================================

2005 04 01 18:00

cloned misspy from genpy, to count missing files.


=============================================================================

2005 03 31 09:00

shifted LOG to ../public_html/minos/samlog.txt
updated        ../public_html/minos/samload.txt

#########
# genpy #
#########

head -8 $GLUG ; tail $GLUG ; grep HEAD $GLUG | wc -l ; grep sam ${GLUG}
Generating .py for /pnfs/minos/fardet_data/2004-04
STARTING Wed Mar 30 11:23:54 CST 2005
...
 Treating    1303 files 
...
FINISHED Thu Mar 31 03:20:38 CST 2005
Thu Mar 31 03:20:38 CST 2005
   1303
F00024202_0000.sam.py was not generated - check log for error
F00024203_0000.sam.py was not generated - check log for error

##########
# samadd #
##########

Checking neardet declarations :

cat SALOG/neardet_data/*.log | grep -v .sam.py | grep -v FileId | less

2005-01 had already been run, so all these bounced.

In addition, the following were dup's :

MINOS06 > cat SALOG/neardet_data/*.log | egrep 'Declaring|exists'       
Declaring files to SAM  for neardet_data/2004-02
File with name 'N00000042_0000.mdaq.root' already exists in the database.
Declaring files to SAM  for neardet_data/2004-03
File with name 'N00000263_0000.mdaq.root' already exists in the database.
Declaring files to SAM  for neardet_data/2004-04
File with name 'N00000459_0000.mdaq.root' already exists in the database.
Declaring files to SAM  for neardet_data/2004-05
Declaring files to SAM  for neardet_data/2004-06
Declaring files to SAM  for neardet_data/2004-07
Declaring files to SAM  for neardet_data/2004-08
Declaring files to SAM  for neardet_data/2004-09
File with name 'N00003460_0000.mdaq.root' already exists in the database.
File with name 'N00003462_0000.mdaq.root' already exists in the database.
File with name 'N00003581_0000.mdaq.root' already exists in the database.
File with name 'N00003645_0000.mdaq.root' already exists in the database.
File with name 'N00003500_0000.mdaq.root' already exists in the database.
File with name 'N00003546_0000.mdaq.root' already exists in the database.
Declaring files to SAM  for neardet_data/2004-10
File with name 'N00004269_0000.mdaq.root' already exists in the database.
File with name 'N00004412_0000.mdaq.root' already exists in the database.
File with name 'N00004490_0000.mdaq.root' already exists in the database.
Declaring files to SAM  for neardet_data/2004-11
File with name 'N00004499_0000.mdaq.root' already exists in the database.
File with name 'N00004504_0000.mdaq.root' already exists in the database.
File with name 'N00004512_0000.mdaq.root' already exists in the database.
Declaring files to SAM  for neardet_data/2004-12

MINOS06 > for FILE in `cat neardup.lis` ; do sam locate $FILE ; done
['/pnfs/minos/neardet_data/2004-02,1@vo5041']
['/pnfs/minos/neardet_data/2004-03,186@vo5041']
['/pnfs/minos/neardet_data/2004-04,481@vo5041']
['/pnfs/minos/neardet_data/2004-09,1961@vo5041']
['/pnfs/minos/neardet_data/2004-09,1964@vo5041']
['/pnfs/minos/neardet_data/2004-09,2009@vo5041']
['/pnfs/minos/neardet_data/2004-09,2074@vo5041']
['/pnfs/minos/neardet_data/2004-09,1317@vo5042']
['/pnfs/minos/neardet_data/2004-09,1366@vo5042']
['/pnfs/minos/neardet_data/2004-10,2727@vo5041']
['/pnfs/minos/neardet_data/2004-10,2971@vo5041']
['/pnfs/minos/neardet_data/2004-10,3021@vo5041']
['/pnfs/minos/neardet_data/2004-11,3031@vo5041']
['/pnfs/minos/neardet_data/2004-11,3036@vo5041']
['/pnfs/minos/neardet_data/2004-11,3046@vo5041']

#########
# files #
#########

cd minos/log/files
for TYP in beam far_dcs fardet near_dcs neardet ; do mv filecount.${TYP}_data count.${TYP}_data ; done

=============================================================================

2005 03 30 09:00

#########
# stage #
#########

Staging of fardet_data/2004-03 has been stuck since

MINOS06 > grep queue SLOG/fardet_data/2004-03.20050329.log 
...
Tue Mar 29 18:29:13 CST 2005 queue=14/10  SLEEP 1200 

This is because the port for PoolRequestQueue moved from 443 to 2288.

Corrected this, restarted, all seems to be well.

#########
# genpy #
#########

20:52

Cranking along, slowed slightly by the lack of prestaging earlier.
Thru 
MINOS06 > GLUG=GLOG/fardet_data/2004-04.20050330.log
MINOS06 > head -8 $GLUG ; tail $GLUG ; grep HEAD $GLUG | wc -l
...
943/1303


FAILURE SUMMARY -

quick scan of .log in old fardet_data logs shows lots, mostly in 2003.

MINOS06 > grep 'not generated' GLOG/fardet_data/*.log | wc -l
   2903

MINOS06 > grep 'not generated' GLOG/neardet_data/*.log | wc -l
     66


##########
# samadd #
##########

All files are validated for neardet.

Oops, ran a few directories in parallel, has killed dbservers in past.
Interrupted, and modified samadd to run serially with -w option.

samadd.0330 adds '-k' option, to keep going in case of error
            adds '-w' option, to wait instead of backgrounding
	    
Updated the official samadd to be symlink to samadd.0330


cd minos/scripts
DET=neardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    printf " OK - ./samadd ${DET}/${DIR} \n"
    ./samadd -k -w ${DET}/${DIR}
done > ../log/samadd/${DET}.20050330 2>&1 &

Seems to be running smoothly now.

21:27

=============================================================================

2005 03 29 08:20


#########
# genpy #
#########

continues to run, up thru 4002-02, about 8 hours per month

Following with
MINOS06 > GLUG=GLOG/fardet_data/2004-02.20050328.log 
MINOS06 > head -8 $GLUG ; tail $GLUG ; wc -l $GLUG
 Treating     938 files 
    644 GLOG/fardet_data/2004-02.20050328.log

##########
# samadd #
##########

All files are valid in the months validated

Need to correct and rerun samaddnew on

neardet_data/2004-11 505/506 files sorted/.py
   missing location due to runaway dbu with .py output.
neardet_data/2005-01 865/864 files sorted/.py
   duplicate location, #/pnfs/minos/neardet_data/2005-01(vo4918.660)
   
/local/scratch06/kreymer/genpy/neardet_data/2004-11/N00004861_0000.sam.py
  is missing the tape location
The FINISHED message is missing from
    GLOG/neardet_data/2004-11.20050313.log 

MINOS06 > cd 
MINOS06 > cd minos
MINOS06 > . ./setup_minos
MINOS06 > setup_minos
MINOS06 > cd /local/scratch06/kreymer/genpy/neardet_data/2004-11
MINOS06 > ${HOME}/minos/scripts/run_dbu N00004861_0000

< killed the dbu program after 30:33 >
this did produce a correctly formatted N00004861_0000.sam.py
Picking up with what genpy would do :

MINOS06 > DATADIR=neardet_data/2004-11
MINOS06 > DDIR=/pnfs/fnal.gov/usr/minos/${DATADIR}
MINOS06 > PDIR=/pnfs/minos/${DATADIR}
MINOS06 > DATA=`echo ${DATADIR} | cut -f 1 -d '/'`
MINOS06 > FILE=N00004861_0000.mdaq.root
MINOS06 >     HEAD=`echo ${FILE} | cut -f 1 -d '.'`
MINOS06 >     cd /local/scratch06/kreymer/genpy/${DATADIR}
MINOS06 >                 BFID=`enstore pnfs --bfid ${PDIR}/${FILE}`
MINOS06 >                 PINFO=`enstore info --bfid ${BFID}`
MINOS06 >                 SCRC=`echo ${PINFO} | tr , \\\n | grep '''complete_crc'''    | cut -f 2 -d : | cut -c 2- | cut -f 1 -d ,`
MINOS06 >                 SLOC=`echo ${PINFO} | tr , \\\n | grep '''location_cookie''' | cut -f 4 -d _ | cut -f 1 -d \' | bc`
MINOS06 >                 STAP=`echo ${PINFO} | tr , \\\n | grep '''external_label'''  | cut -f 2 -d : | tr -d -c [:alnum:] | tr [:upper:] [:lower:]`
MINOS06 >                SSIZE=`echo ${PINFO} | tr , \\\n | grep '''size'''            | cut -f 2 -d : | tr -d -c [:digit:]`
MINOS06 >                 printf ",s/CRC(999L,/CRC(${SCRC},/\nw\n" | ed -s ${HEAD}.sam.py 
MINOS06 >                 printf '#'                                    >> ${HEAD}.sam.py
MINOS06 >                 printf "${PDIR}(${STAP}.${SLOC})\n"           >> ${HEAD}.sam.py 

./fileorder       neardet_data/2004-11
./samaddnew -d -m neardet_data/2004-11 >> checkmet.neardet_data 2>&1 &


MINOS06 > for file in `cat /local/scratch06/kreymer/order/neardet_data/2005-01` ; do echo $file ; tail -1 /local/scratch06/kreymer/genpy/neardet_data/2005-01/$file ; done | less
...
N00005797_0000.sam.py
#/pnfs/minos/neardet_data/2005-01(vo4918.660)
N00005797_0000.sam.py
#/pnfs/minos/neardet_data/2005-01(vo4918.660)
N00005798_0000.sam.py
...

This file has two tape location comments at the end.
Perhaps we should check for this in genpy.

Removed the extra line with nedit.


./fileorder       neardet_data/2005-01
./samaddnew -d -m neardet_data/2005-01 >> checkmet.neardet_data 2>&1 &


=============================================================================

2005 03 28 08:20

#########
# genpy #
#########

Started final round of old genpy, fardet_data, 2004

cd scripts
YY=2004
for MM in 01 02 03 04 05 06 07 08 09 10 11 12
do 
    ./stage    fardet_data/${YY}-${MM}
    ./genpy -w fardet_data/${YY}-${MM}
    ./snappy   fardet_data/${YY}-${MM}
done > dofar2004.log 2>&1 &

Restarted around 15:15
I probably killed this off around 14:11 when debugging the samaddnew scripts.


##########
# samadd #
##########

Time to get serious again about sam declare/add

Need to run fileorder and samadd on each directory.
made samaddnew using the new nameserver/dbserver


MINOS06 > ./fileorder fardet_data/2003-12
 Sorting files from fardet_data/2003-12
Sorting files for /local/scratch06/kreymer/genpy/fardet_data/2003-12
STARTING Mon Mar 28 13:41:01 CST 2005
 Treating    1049 files 
   1049 /local/scratch06/kreymer/order/fardet_data/2003-12
MINOS06 > date
Mon Mar 28 13:42:43 CST 2005

MINOS06 > ./samaddnew -d -m fardet_data/2003-12
        Application with family 'online', applName 'rotorooter', version 'v03-05-07' not found.

So as before, as sam , 

MINOS_SAM $ samadmin add application family --appFamily=online --appName=rotorooter --appVersion='v03-05-07'
( need external authentication for sam on minos-sam01 and minos-sam02 )

and for neardet_data/2004-02

Checking all near detector data metadata :

DET=neardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    date
    printf " OK - ${DET}/${DIR} \n"
    ./fileorder       ${DET}/${DIR}
    date
    ./samaddnew -d -m ${DET}/${DIR}
    date
done > checkmet.${DET} 2>&1 &

   
Will need a working directory list, for restarts :

MINOS06 > DET=neardet_data
MINOS06 > ( cd /local/scratch06/kreymer/genpy/${DET} ; ls ) > nwork.${DET}

Sabotaged one .py file from neardet_data/2004-04 ( and restored it )
to check that samaddnew correctly exits when the sam metadata is N.G.
Good, so this job can just keep running, will bail out of each month separately.
Verified this manually with
    ./samaddnew -d -m neardet_data/2004-04


N00000462_0000.sam.py
Invalid Metadata specified for file 'N00000462_0000.mdaq.root' of type 'importedDetector':
        Run type with name 'unknown' not found.

We seem to have encountered

    runType='unknown'
    runType='unknown;m' 

[sam minos-sam ~/private/conf/dbserver] $ samadmin add run type --runType='unknown'
[sam minos-sam ~/private/conf/dbserver] $ samadmin add run type --runType='unknown;m'

GRRRR restart the whole metadata verification mess
with the same command as before, renaming the old log to checkmet.neardet_data1

DET=neardet_data
for DIR in `( cd /local/scratch06/kreymer/genpy/${DET} ; ls )`
do
    date
    printf " OK - ${DET}/${DIR} \n"
    ./fileorder       ${DET}/${DIR}
    date
    ./samaddnew -d -m ${DET}/${DIR}
    date
done > checkmet.${DET} 2>&1 &

Following progress with

   grep O checkmet.neardet_data ; wc -l checkmet.neardet_data ; tail checkmet.neardet_data

=============================================================================

2005 03 25 13:25

dcap connections are getting stuck again, on and off,
since about 13:08, coincident with the staging of 2003-11 .
Queues look short to me, 10 active and 30 staging on each of 6a and 7a pools.
But this may be too much for this wimpy dcache system.

Changed scripts/stage_limit from 50 to 10.
PRQ's are dropping steadily.
run_dbu is OK again, as of

GLOG/fardet_data/2003-11.20050325.log
F00020961_0000 HEAD Fri Mar 25 13:30:57 CST 2005


=============================================================================

2005 03 24 08:45

Since 00:16, all the dbu jobs have failed with segfaults.
See  GLOG/fardet_data/2003-08.20050323.log
     GLOG/fardet_data/2003-09.20050324.log
     
I killed off the genpy jobs, 

Robert correced a dbu problem with endfile,
caused by improvements in JobControl's handing of endfile.

Restarted,

cd scripts
YY=2003
for MM in 08 09 10 11 12
do 
    ./stage    fardet_data/${YY}-${MM}
    ./genpy -w fardet_data/${YY}-${MM}
    ./snappy   fardet_data/${YY}-${MM}
done > dofar2003a.log 2>&1 &

The segfaulted dbu sessions produced apparently valid .sam.py
So we have not lost much time.


=============================================================================

2005 03 23 08:00

Updated scripts/stage.0322 to watch restore queues and wait as necessary
based on scripts/stage_limit ( set to 50 for now )

################
# Rest of 2003 #
################

    at 11:57

YY=2003
for MM in 07 08 09 10 11 12
do 
    ./stage    fardet_data/${YY}-${MM}
    ./genpy -w fardet_data/${YY}-${MM}
    ./snappy   fardet_data/${YY}-${MM}
done > dofar2003.log 2>&1 &

Still getting a few timeout, but fewer, and widely scattered.

Need to improve the watchdog, to kill the actual stuck dbu program.

=============================================================================

2005 03 22 08:00

Slew of idle dbu's in 2003-07, since 05:39

fardet_data/2003-07/F00017226_0000.mdaq.root
fardet_data/2003-07/F00017225_0000.mdaq.root
fardet_data/2003-07/F00017224_0000.mdaq.root
fardet_data/2003-07/F00017223_0002.mdaq.root
fardet_data/2003-07/F00017223_0001.mdaq.root
fardet_data/2003-07/F00017223_0000.mdaq.root
fardet_data/2003-07/F00017222_0000.mdaq.root
fardet_data/2003-07/F00017221_0000.mdaq.root
fardet_data/2003-07/F00017220_0000.mdaq.root
fardet_data/2003-07/F00017219_0002.mdaq.root
fardet_data/2003-07/F00017219_0001.mdaq.root
fardet_data/2003-07/F00017219_0000.mdaq.root
fardet_data/2003-07/F00017218_0000.mdaq.root
fardet_data/2003-07/F00017216_0000.mdaq.root
fardet_data/2003-07/F00017214_0000.mdaq.root
fardet_data/2003-07/F00017212_0000.mdaq.root
fardet_data/2003-07/F00017210_0000.mdaq.root
fardet_data/2003-07/F00017208_0000.mdaq.root

cd test
dccp dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-01/F00028812_0000.mdaq.root F00028812_0000.mdaq.root
../scripts/run_dbu F00028812_0000
../scripts/run_dbu dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-01/F00028812_0000
Produces equivalent F00028812_0000.sam.py and F00028812_0000.sam.py.20050322

dccp dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2003-07/F00017230_0000.mdaq.root F00017230_0000.mdaq.root
   sits doing nothing for 10 minutes...
   will contact dcache-admin
Hmmm, this actually completed, at 10:03
Still most dbu's are stuck, thru F00017248_0000
   
Will kill off 2003-07, this must be a bad patch o' data,
perhaps a DCache problem ?

MINOS06 > ps xf
  PID TTY      STAT   TIME COMMAND
11675 pts/0    S      0:00 -bash
28767 pts/0    S      0:00  \_ /bin/sh ./genpy -w fardet_data/2003-07
29951 pts/0    S      0:00      \_ ls
29952 pts/0    S      0:01      \_ /bin/sh ./genpy -w fardet_data/2003-07

kill 29952
kill 28767

kill 12146 11420 10648 8918 5941 3502 2721 1957 31187 25957 23386 22175 21444 20780 20053 19413 18680

We have moved on to the 2003-08 area, but are still getting stuck the same way.
It is as if none of these files are actually on disk, due to very slow tape restores.
Could be, there are lots of CMS writes going on.

There are global DCache problems, all the read pools are down, per Alex K.

Will implement a 100 file queue limit in stage script, checked every 50 files.


=============================================================================

2005 03 21 12:30 - 17:45

Killed runaway dpu,

12:30
2002-05/F00004916_0000
22399 pts/0    R    5261:40 dbu  ...
   dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2002-05/F00004916_0000.mdaq.root

Procesing continues, up through fardet_data/2003-06

17:40

Doing 2003-06, have a stuck stager doing dccp -P ( prestage )
F00016258_0000
 5477 pts/0    R    199:47  \_ dccp -P dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2003-06/F00016258_0000.mdaq.root

Also a stuck dbu ( related ? dunno )
F00016180_0000
 5468 pts/0    R    199:21 dbu -bq /afs/fnal.gov/files/home/room1/kreymer/minos/dbu_sampy.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2003-06/F00016180_0000.mdaq.root

And a stale dbu
F00016195_0000
 8378 pts/0    S      0:02 dbu -bq /afs/fnal.gov/files/home/room1/kreymer/minos/dbu_sampy.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2003-06/F00016195_0000.mdaq.root

Processing has moved on to 
F00016307_0000

Killing all the strays...


=============================================================================

2005 03 17
May need to run a second pass on fardet_data/2002-05
   due to dcache downtime today 08:30 to 09:10

Still cranking along, started 2002-06 at 10:50


=============================================================================

2005 03 15

done > dofar234.log 2>&1 &

Hmmm, this finished at 18:50 with no obvious symptoms
it just quit after processing 2002-01

Generated 310 .py files
MINOS06 > cat 2002-01/F00011912_0000.log
Unable to open 
    dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2002-01/F00011912_0000.mdaq.root
with accessmode READ

    Odd, did not get PY/fardet_data/2002-01.tgz

    Relaunch with slightly better diagnostics,

for YY in 2002 2003 2004
do
for MM in 01 02 03 04 05 06 07 08 09 10 11 12
do
    date
    printf " OK - doing ${YY}-${MM} \n"
    ./stage    fardet_data/${YY}-${MM}
    ./genpy -w fardet_data/${YY}-${MM}
    ./snappy   fardet_data/${YY}-${MM}
done
done > dofar234.log2 2>&1 &

14:15
This has finished 2002-1, and is into 2002-2 now


=============================================================================

2005 03 14

07:30

YY=2001
for MM in 09 11 12
do 
    ./stage    fardet_data/${YY}-${MM}
    ./genpy -w fardet_data/${YY}-${MM}
    ./snappy   fardet_data/${YY}-${MM}
done > dofar2001.log 2>&1 &

Stuck for 30 minutes on fardet_data/2001-12/F00001760_0000.mdaq.root
killed dbu

F00001760_0000 killed at 30'
F00001761_0000 killed at 10'
F00001766_0000 killed at 7'

1771, 1775 seem OK , cranking along again.


#########
# GENPY #
#########

was genpy.0311
MINOS06 > ln -sf genpy.0315 genpy

This reads directly from dcache,
has -c local copy option.

Uses wrun_dbu

###########
# run_dbu #
###########

switch to using $HOME/bin/dbu -> dbu.0315
for pseudo stability  ( still running out of development, may lose .so on occasion ? )

Filtered file name base for use in log and pyname

############
# wrun_dbu #
############

Watchdog version, modeled on dorsam scripts
but with more agressive test for child completion

Runs run_dbu with 10 minute timeout


########
# DCAP #
########

Testing local vs dcap dbu running again :

MINOS06 > cd /local/scratch06/kreymer/genpyTEST/
MINOS06 > dccp  dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-02/F00029231_0004.mdaq.root F00029231_0004.mdaq.root
51200000 bytes in 19 seconds (2631.58 KB/sec)
MINOS06 > time $HOME/minos/bin/dbu -bq ${HOME}/minos/dbu_sampy.C ${FILE}.mdaq.root
real    0m35.722s
user    0m24.200s
sys     0m0.800s

MINOS06 > time dbu -bq ${HOME}/minos/dbu_sampy.C ${FILE}.mdaq.root
real    0m31.031s
user    0m24.460s
sys     0m0.660s

MINOS06 > time dbu -bq ${HOME}/minos/dbu_sampy.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-02/${FILE}.mdaq.root
real    1m7.802s
user    0m24.570s
sys     0m1.090s
MINOS06 > time dbu -bq ${HOME}/minos/dbu_sampy.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-02/${FILE}.mdaq.root
real    0m46.045s
user    0m25.170s
sys     0m1.090s

########################
# 2002/3/4 fardet_data #
########################

16:30

for YY in 2002 2003 2004
do
for MM in 01 02 03 04 05 06 07 08 09 10 11 12
do 
    ./stage    fardet_data/${YY}-${MM}
    ./genpy -w fardet_data/${YY}-${MM}
    ./snappy   fardet_data/${YY}-${MM}
done
done > dofar234.log 2>&1 &

Hmmm, this finished at 18:50 with no obvious symptom
Generated 310 .py files
MINOS06 > cat 2002-01/F00011912_0000.log
Unable to open dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2002-01/F00011912_0000.mdaq.root with accessmode READ


=============================================================================

2005 03 14  13:13

Killed /bin/sh ./genpy -w neardet_data/2004-11
   stuck on 853:56  dbu N00004861_0000.mdaq.root 

This is continuing to process 2004-12 OK,

MINOS06 > dds ~/minos/SDAT/neardet_data/2004-12/*.py | wc -l
     85


###############
#   SNAPPY    #
###############

script to tar .py files into PY/... for safe keeping

MINOS06 > ./snappy neardet_data/2004-02
 OK - archiving .py and .log for neardet_data/2004-02 
 Saving     263 files 
608     /afs/fnal.gov/files/home/room1/kreymer/minos/PY/neardet_data/2004-02.tgz
MINOS06 > ./snappy neardet_data/2004-03
 OK - archiving .py and .log for neardet_data/2004-03 
 Saving     198 files 
210     /afs/fnal.gov/files/home/room1/kreymer/minos/PY/neardet_data/2004-03.tgz
MINOS06 > ./snappy neardet_data/2004-04
 OK - archiving .py and .log for neardet_data/2004-04 
 Saving      31 files 
20      /afs/fnal.gov/files/home/room1/kreymer/minos/PY/neardet_data/2004-04.tgz
MINOS06 > ./snappy neardet_data/2004-05
 OK - archiving .py and .log for neardet_data/2004-05 
 Saving     499 files 
494     /afs/fnal.gov/files/home/room1/kreymer/minos/PY/neardet_data/2004-05.tgz
MINOS06 > ./snappy neardet_data/2004-06
 OK - archiving .py and .log for neardet_data/2004-06 
 Saving     529 files 
596     /afs/fnal.gov/files/home/room1/kreymer/minos/PY/neardet_data/2004-06.tgz
MINOS06 > ./snappy neardet_data/2004-07
 OK - archiving .py and .log for neardet_data/2004-07 
 Saving     692 files 
824     /afs/fnal.gov/files/home/room1/kreymer/minos/PY/neardet_data/2004-07.tgz
MINOS06 > ./snappy neardet_data/2004-08
 OK - archiving .py and .log for neardet_data/2004-08 
 Saving     903 files 
1800    /afs/fnal.gov/files/home/room1/kreymer/minos/PY/neardet_data/2004-08.tgz
MINOS06 > ./snappy neardet_data/2004-09
 OK - archiving .py and .log for neardet_data/2004-09 
 Saving     589 files 
1205    /afs/fnal.gov/files/home/room1/kreymer/minos/PY/neardet_data/2004-09.tgz
MINOS06 > ./snappy neardet_data/2004-10
 OK - archiving .py and .log for neardet_data/2004-10 
 Saving     866 files 
769     /afs/fnal.gov/files/home/room1/kreymer/minos/PY/neardet_data/2004-10.tgz
MINOS06 > ./snappy neardet_data/2004-02
 OK - archiving .py and .log for neardet_data/2004-02 
 Saving     263 files 
608     /afs/fnal.gov/files/home/room1/kreymer/minos/PY/neardet_data/2004-02.tgz

note difference in 2004-07 697 versus snappy'd 692


=============================================================================

2005 03 13  22:54

Rats, seem to be stuck on 2004-11/N00004859_0000.mdaq.root
-rw-r--r--    1 kreymer  1525      8166162 Mar 13 06:59 N00004859_0000.mdaq.root
Only 8 MB long.
But stuck running du, from 07:00 throu 22:55
The next file looks fine, 4860_000.
And N00004859_0000.sam.py looks fine ???
   eventCount=2,
   firstEvent=-1,
   lastEvent=-1
Note inconsistent event count, negative event numbers

Ugh.
Perhaps we need a watchdog.

Will let things continue to run,
perhaps this is the only sticky file ?


=============================================================================

2005 03 11  16:30 17:30

Need to get organized, and get all the .py generated asap

Added -w (WAIT) option to genpy,
so we can do one at a time in a loop, rathern than in the background.

for MM in 04

for MM in 05 06 07 08 09 10 11 12
do 
    ./stage    neardet_data/2004-${MM}
    ./genpy -w neardet_data/2004-${MM}
done > donear0512.log 2>&1 &


=============================================================================

2005 03 10  07:45 8:15


MINOS06 >  ./samadd -p 0 -i F00028571_0003 fardet_data/2005-01
Declaring files to SAM  for fardet_data/2005-01
STARTING Thu Mar 10 08:24:18 CST 2005
 Treating    1166 files 
F00028571_0003.sam.py
FileId = 2502
...
F00029085_0004.sam.py
FileId = 3566
FINISHED Thu Mar 10 10:15:02 CST 2005


Let's try to restart fardet_data/2004-12
MINOS06 > sam locate F00028386_0007.mdaq.root
['/pnfs/minos/fardet_data/2004-12,843@vo4919']

Looks like mine, will pick up with the next file

MINOS06 > ./samadd -p 0 -i F00028387_0000 fardet_data/2004-12
tail -f /local/scratch06/kreymer/log/samadd/fardet_data/2004-12.20050310.log
Declaring files to SAM  for fardet_data/2004-12
STARTING Thu Mar 10 13:20:25 CST 2005
 Treating     968 files 
F00028387_0000.sam.py
FileId = 3582
...
File F00028466_0004.sam.py does not contain an instance of SamDataFile.
 OOPS - sam declare failed 0
FINISHED Thu Mar 10 13:47:20 CST 2005


/local/scratch06/kreymer/genpy/fardet_data/2004-12/F00028466_0004.sam.py
was empty, aside from tape location
This is because dbu crashed.

MINOS06 > rm /local/scratch06/kreymer/genpy/fardet_data/2004-12/F00028466_0004.sam.py
MINOS06 > nedit /local/scratch06/kreymer/order/fardet_data/2004-12

restarted

MINOS06 > ./samadd -p 0 -i F00028490_0003 fardet_data/2004-12
FileId = 4048
FINISHED Thu Mar 10 16:27:10 CST 2005


MINOS06 >  ./samadd -p 0 -i F00028571_0003 fardet_data/2005-01


Now back to the genpy :

16:45

./genpy  neardet_data/2004-03
tail -f /local/scratch06/kreymer/log/genpy/neardet_data/2004-03.20050310.log


=============================================================================

2005 03 09  07:45 8:15

Need to scan all .py files under /local/scratch06/kreymer/genpy/*/*
for
    applicationFamily=
    runType=

MINOS06 > cat /local/scratch06/kreymer/genpy/neardet_data/2005-01/*.py | grep 'runType=' | sort -u
   runType='checkout',
   runType='checkout;m',
   runType='checkout;t',
   runType='checkout;tm',
   runType='physics',
   runType='physics;m',
   runType='physics;t',
   runType='physics;tm',
   runType='qie-calibrate',
   runType='qie-calibrate;m',
   runType='qie-calibrate;tm',
   runType='qie-monitor',
   runType='qie-monitor;m',
   runType='qie-monitor;tm',

MINOS06 > find /local/scratch06/kreymer/genpy -type f -name \*.py -exec cat {} \; | grep 'runType=' | sort -u
x   runType='checkout',
.   runType='checkout;m',
.   runType='checkout;t',
.   runType='checkout;tm',
x   runType='normal-data',
.   runType='physics',
.   runType='physics;m',
.   runType='physics;t',
.   runType='physics;tm',
.   runType='qie-calibrate',
.   runType='qie-calibrate;m',
.   runType='qie-calibrate;tm',
.   runType='qie-monitor',
.   runType='qie-monitor;m',
.   runType='qie-monitor;tm',
.   runType='va-calibrate',
.   runType='va-calibrate;t',
.   runType='va-pedestal',
.   runType='va-pedestal;m',
.   runType='va-pedestal;t',
.   runType='va-pedestal;tm',

MINOS06 > find /local/scratch06/kreymer/genpy -type f -name \*.py -exec cat {} \; | grep 'applicationFamily=' | sort -u
   applicationFamily=ApplicationFamily('online','rotorooter','v03-01-06'),
   applicationFamily=ApplicationFamily('online','rotorooter','v04-00-08'),
   applicationFamily=ApplicationFamily('online','rotorooter','v04-02-00'),

Testing these in minos/test directory,
v04-00-08 - OK
v03-01-06 - NG
v04-02-00 - NG

Likewise test runType, mark above with
. - OK
x - fails
Also tried
x normal-data;t
x normal-data;m
x normal-data;tm
. qie-calibrate;t
. qie-monitor;t
. va_calibrate;m
. va_calibrate;tm     

Verify with

MINOS06 > sam get registered application families
ApplicationFamily('archiver', 'rotorooter', 'v00-06-05')
ApplicationFamily('reco', 'loon', 'r1.0.0a')
ApplicationFamily('loon', 'loon', 'r1.0.0a')
ApplicationFamily('online', 'rotrooter', 'v04-00-08')
ApplicationFamily('online', 'rotorooter', 'v04-00-08')

MINOS06 > sam get registered run types           
normal-data-far
charge-injection-far
pedestal-far
physics;m
physics;t
physics
physics;tm
va-calibrate;m
va-calibrate;t
va-calibrate;tm
va-pedestal;m
va-pedestal;t
va-pedestal;tm
qie-calibrate;m
qie-calibrate;t
qie-calibrate;tm
qie-monitor;m
qie-monitor;t
qie-monitor;tm
checkout;m
checkout;t
checkout;tm
va-calibrate
va-pedestal
qie-calibrate
qie-monitor

13:20 per buckley,

Adding the missing ones

[sam minos-sam ~/Notes] $ samadmin add run type --runType='normal-data'
[sam minos-sam ~/Notes] $ samadmin add run type --runType='checkout'

[sam minos-sam ~/Notes] $ samadmin add application family --appFamily=online --appName=rotorooter --appVersion='v03-01-06'
[sam minos-sam ~/Notes] $ samadmin add application family --appFamily=online --appName=rotorooter --appVersion='v04-02-00'

samadd - added -m option to test metadata
         added file vs sorted count check

13:40  

MINOS06 > ./samadd -p 0 fardet_data/2001-10

F00000965_0000.sam.py
FileId = 962
Location with name '/pnfs/minos/fardet_data/2001-10' not found.
F00000966_0000.sam.py
FileId = 963
Location with name '/pnfs/minos/fardet_data/2001-10' not found.
F00000967_0000.sam.py
FileId = 964
Location with name '/pnfs/minos/fardet_data/2001-10' not found.
F00000968_0000.sam.py
FileId = 965
Location with name '/pnfs/minos/fardet_data/2001-10' not found.
F00000969_0000.sam.py
FileId = 966
Location with name '/pnfs/minos/fardet_data/2001-10' not found.

Added this manually.
Also picking up the rest :

2001
Minos_sam > for MON in 09 10 11 12 ; do samadmin add pnfs tape location --fullPath=/pnfs/minos/fardet_data/2001-${MON} ; done
Location with fullPath '/pnfs/minos/fardet_data/2001-10' already exists.

2002
Minos_sam > for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do samadmin add pnfs tape location --fullPath=/pnfs/minos/fardet_data/2002-${MON} ; done
Location with fullPath '/pnfs/minos/fardet_data/2002-07' already exists.

2003
Minos_sam > for MON in 01 02 03 04 05 06 07 08 09 10 11 12 ; do samadmin add pnfs tape location --fullPath=/pnfs/minos/fardet_data/2003-${MON} ; done
Location with fullPath '/pnfs/minos/fardet_data/2003-09' already exists.
Location with fullPath '/pnfs/minos/fardet_data/2003-10' already exists.

2004
( these all existed )

MINOS06 > ./samadd -p 0 fardet_data/2001-10
ran cleanly, 
Declaring files to SAM  for fardet_data/2001-10
STARTING Wed Mar  9 14:05:58 CST 2005
 Treating      53 files 
F00000965_0000.sam.py
FileId = 967
...
FileId = 1019
FINISHED Wed Mar  9 14:10:23 CST 2005

17:11
MINOS06 > ./samadd -p 0 fardet_data/2004-12
tail -f /local/scratch06/kreymer/log/samadd/fardet_data/2004-12.20050309.log

MINOS06 > ./samadd -p 0 fardet_data/2005-01
tail -f /local/scratch06/kreymer/log/samadd/fardet_data/2005-01.20050309.log

MINOS06 > ./samadd -p 0 neardet_data/2005-01
tail -f /local/scratch06/kreymer/log/samadd/neardet_data/2005-01.20050309.log
( removed duplicate N00005797_0000.sam.py from
  /local/scratch06/kreymer/order/neardet_data/2005-01 )

This all collapsed due to dbserver nonresponse :
    
    fardet_data/2004-12
...
F00028244_0000.sam.py
The server 'SAMDbServer.minos:SAMDbServer' is not responding.
 OOPS - sam declare failed 0
FINISHED Wed Mar  9 17:23:54 CST 2005

    fardet_data/2005-01
...
F00028547_0002.sam.py
The server 'SAMDbServer.minos:SAMDbServer' is not responding.
 OOPS - sam declare failed 0
FINISHED Wed Mar  9 17:24:05 CST 2005

    neardet_data/2005-01
STARTING Wed Mar  9 17:24:07 CST 2005
 Treating     864 files 
N00005995_0000.sam.py
The server 'SAMDbServer.minos:SAMDbServer' is not responding.
 OOPS - sam declare failed 0
FINISHED Wed Mar  9 17:24:27 CST 2005
    
MINOS06 > sam get dbserver info
         Database: minosdev (reachable)
  DbServerVersion: v6_6_0_3
         Up since: 09-Mar-2005
...
                    'dbgFile' : './/dbg-SAMDbServer.minos.05_03_09-17_24_47.30897-1',

The dbserver crashed and restarted...

OK, just in case I caused this, can restart one of these cleanly

MINOS06 > ./samadd -p 0 neardet_data/2005-01
tail -f /local/scratch06/kreymer/log/samadd/neardet_data/2005-01.20050309.log
STARTING Wed Mar  9 17:45:01 CST 2005
 Treating     864 files 
FileId = 2035
FINISHED Wed Mar  9 19:00:37 CST 2005


cd /local/scratch06/kreymer/order/fardet_data
Made .full copies, then
chopped off 2004-12 before F00028244_0000.sam.py
chopped off 2005-01 before F00028547_0002.sam.py


Also getting ready for more neardet_data :

./stage neardet_data/2004-02
./stage neardet_data/2004-03
./stage neardet_data/2004-04
./stage neardet_data/2004-05
will also need 6 thru 12

Added -i INIFILE option to samadd

MINOS06 > mv /local/scratch06/kreymer/order/fardet_data/2004-12.full /local/scratch06/kreymer/order/fardet_data/2004-12
MINOS06 > ./samadd -p 0 -i F00028244_0000 fardet_data/2004-12
tail -f /local/scratch06/kreymer/log/samadd/fardet_data/2004-12.20050309.log

MINOS06 > mv /local/scratch06/kreymer/order/fardet_data/2005-01.full /local/scratch06/kreymer/order/fardet_data/2005-01
MINOS06 >  ./samadd -p 0 -i F00028547_0002 fardet_data/2005-01
tail -f /local/scratch06/kreymer/log/samadd/fardet_data/2005-01.20050309.log

This again killed the dbserver

2004-12
MINOS06 > sam locate F00028283_0000.mdaq.root
[]
FINISHED Wed Mar  9 21:15:12 CST 2005
MINOS06 > sam undeclare file F00028283_0000.mdaq.root
MINOS06 > ./samadd -p 0 -i F00028283_0000 fardet_data/2004-12

2005-01
F00028571_0003
FINISHED Wed Mar  9 21:15:12 CST 2005

restarted just fardet_data/2004-12

MINOS06 > ./samadd -p 0 -i F00028283_0000.sam.py fardet_data/2004-12
Declaring files to SAM  for fardet_data/2004-12
STARTING Wed Mar  9 21:57:50 CST 2005
 Treating     968 files 
F00028283_0000.sam.py
...
F00028386_0007.sam.py
File with name 'F00028386_0007.mdaq.root' already exists in the database.
 OOPS - sam declare failed 0
FINISHED Wed Mar  9 22:22:56 CST 2005


./genpy  neardet_data/2004-02
tail -f /local/scratch06/kreymer/log/genpy/neardet_data/2004-02.20050309.log


=============================================================================

2005 03 08  20:00  22:30

OK, time to really proceed to samadd the files for which we have .py files


cd scripts
cp genpy samadd


Tested this with


./samadd -p 0 fardet_data/2001-10

Oops, fails like

head -8 /local/scratch06/kreymer/log/samadd/fardet_data/2001-10.20050308.log

STARTING Tue Mar  8 21:12:06 CST 2005
 Treating      53 files 
 Treating      53 files 
F00000965_0000.sam.py
Invalid Metadata specified for file 'F00000965_0000.mdaq.root' of type 'importedDetector':
        Run type with name 'normal-data' not found.
        Application with family 'online', applName 'rotorooter', version 'v03-01-06' not found.


Hmmm, look into this later, meanwhile try to correct the neardet_data .py files
to have tape locations, so they can be sorted and declared

cp genpy genpy.addloc which is hardcoded to add the tape locations for
neardet_data.2005-01


Still having trouble with neardet_data/2005-01 files...

Transition is between

 N00005802_0000.sam.py
and
 N00005803_0000.sam.py

MINOS06 > IFILE=N00005803_0000
MINOS06 > sam verify metadata --descriptionFile=${IFILE}.sam.py
Invalid Metadata specified for file 'N00005803_0000.mdaq.root' of type 'importedDetector':
        Run type with name 'checkout' not found.
        Application with family 'online', applName 'rotorooter', version 'v04-02-00' not found.
MINOS06 > IFILE=N00005804_0000
MINOS06 > sam verify metadata --descriptionFile=${IFILE}.sam.py
Invalid Metadata specified for file 'N00005804_0000.mdaq.root' of type 'importedDetector':
        Application with family 'online', applName 'rotorooter', version 'v04-02-00' not found.

     FILEORDER

Revised fileorder to put things in <DATA>/date files,
closer to the structure of sampy.
Shifted existing files

./fileorder 

=============================================================================

2005 03 04  17:30 18:30

in /local/scratch06/kreymer/genpy/neardet_data/2005-01

tar czvf  /var/tmp/fd-2001-10.tgz .
tar czf  /var/tmp/fd-2004-12.tgz .
tar czf  /var/tmp/fd-2005-01.tgz .
tar czf  /var/tmp/nd-2005-01.tgz .

cd ; cd minos ; mkdir PY
cp /var/tmp/*.tgz .

   ORDERING
   
Glancing quickly at fardet_data/2001-10
                    fardet_data/2005-01
Things are more or less in tape order,
but there are occasional position orders within a tape.

Worth getting this right now...

Created scripts/fileorder script to write
   /local/scratch06/kreymer/order/<DATASET>

MINOS06 > ./fileorder fardet_data/2005-01
Sorting files for /pnfs/minos/fardet_data/2005-01
STARTING Fri Mar  4 18:31:20 CST 2005
 Treating    1166 files 
   1166 /local/scratch06/kreymer/order/fardet_data-2005-01

Test with -d option

=============================================================================

2005 03 02  10:00 - 17:00


./stage fardet_data/2004-12
./genpy  fardet_data/2004-12

need to
    ln -sf genpy.0301 genpy

=============================================================================

2005 03 01  10:00 - 17:00
Very odd, got a 'stuck' dccp doing fardet_data/2005-01

 7393 pts/0    R    3548:21  \_ dccp dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-01/F00028812_0000.mdaq.root F00028812_0000.mdaq.root
MINOS06 > date
Tue Mar  1 10:37:45 CST 2005

MINOS06 > stat /local/scratch06/kreymer/log/genpy/fardet_data/2005-01.20050225.log        
  File: `/local/scratch06/kreymer/log/genpy/fardet_data/2005-01.20050225.log'
  Size: 52570           Blocks: 112        IO Block: 4096   Regular File
Device: 341h/833d       Inode: 3784706     Links: 1    
Access: (0644/-rw-r--r--)  Uid: ( 1060/ kreymer)   Gid: ( 1525/ UNKNOWN)
Access: 2005-03-01 10:36:29.000000000 -0600
Modify: 2005-02-25 20:26:52.000000000 -0600
Change: 2005-02-25 20:26:52.000000000 -0600

MINOS06 > type dccp
dccp is /afs/fnal.gov/files/code/e875/general/ups/prd/dcap/v2_32_f0408/Linux-2-4/bin/dccp

dccp dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-01/F00028812_0000.mdaq.root F00028812_0000.mdaq.root
50287 bytes in 1 seconds (49.11 KB/sec)


MINOS06 > ../bin/psv 7393

work: /local/scratch06/kreymer/genpy/fardet_data/2005-0root: /
eip00000000B75CAA4E wch0000000000000000 priority: 25        nice: 0
ppid: 7934          pgrp: 6156          sess: 1234          tty#: 34816
tpgid: 26591        exit sig: 17        cpu: 3
pnd0000000000000000 blk0000000000002000 ign0000000000000006 trp0000000000001000
minor faults:                    37     user:                    100778.5s
child ...:                        0     child user:                     0s
najor faults:                   239     system:                  113678.2s
child ...:                        0     child system:                   0s
vsize:                      1994752     swapped:                         0
rss:                            228     flags:            0000000000000000
pgm 228             shr 193             txt 22              dat 189
rss 228             lib 17              wrt 35              Cycle: 254
Load averages, r/t processes, last pid: 1.01 1.15 1.13 2/92 26597


Word from Alex K, there was a stuck door last weekend involving this file.
I will kill and retry.


Summaries :

corrected filecount script ( was not counting correct directory )
created   filesize  script, writing to log/*


Have to restage, files seem to have fallen off disk
( possibly due to 5a/6a pool problems ?
  r-stkendca5a-1  seems to be missing from status and space reports )
  r-stkendca6a-1  reports java.io.IOExcepetion: No space left on device
                    under  http://fndca.fnal.gov:443/usageInfo


To check compatbility with old files,

./stage fardet_data/2001-10
./genpy.0301  fardet_data/2001-10

Two files failed to get .py files
    F00000904_0000
    F00000985_0000
Messages like
Warning in <TFile::Init>: file F00000904_0000.mdaq.root probably not closed, trying to recover
Warning in <TFile::Init>: no keys recovered, file has been made a Zombie

Great, this will make a good test of incremental running mode :

./genpy.0301  fardet_data/2001-10

    
=============================================================================

2005 02 25

The file locations are still not know... strange 

MINOS06 > sam add location --file=${IFILE}.mdaq.root --loc=${SAMLOC}
Location with name '/pnfs/minos/neardet_data/2005-01/' not found.
MINOS06 > date
Fri Feb 25 13:55:13 CST 2005

Try again on minos-sam,

samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-01

[sam minos-sam ~/Notes] $ setup sam -q minos
[sam minos-sam ~/Notes] $ echo $SAM_DB_SERVER_NAME
SAMDbServer.minos:SAMDbServer
[sam minos-sam ~/Notes] $ samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-01
Location with fullPath '/pnfs/minos/neardet_data/2005-01' already exists.

Duh.. error in my HOWTO, should not have trailing / in the tape location.


    Started staging January fardet data

MINOS06 > ./stage fardet_data/2005-01
WARNING: based on your node, minos06.fnal.gov, ENSTORE_CONFIG_HOST has been set to stkensrv2.fnal.gov
WARNING: If this is not correct, either set ENSTORE_CONFIG_HOST by hand or use a qualifier in your setup command!
setupsrtarch is "SRT_ARCH=Linux-sl3", but not yet used
MINOSSOFT "development" release  (ROOT Linux2.4-GCC_3_3/v4-02-00)
setup "test" version of LABYRINTH [ linux , FNALU ]
explicitly setting up GCC3_3_1 version of GEANT

tail -f /local/scratch06/kreymer/log/stage/fardet_data/2005-01.20050225.log
MINOS06 > date
Fri Feb 25 15:13:56 CST 2005


16:48 - fardet_data/2005-01 restore is complete

MINOS06 > ./genpy  fardet_data/2005-01
...
tail -f /local/scratch06/kreymer/log/genpy/fardet_data/2005-01.20050225.log

Oops, one problem during processing,
 HEAD F00028812_0000 
F00028812_0000.mdaq.root
./genpy: line 150:  7393 Terminated              ${ECHO} dccp dcap://fndca.fnal.gov:24125${DDIR}/${FILE} ${FILE}
1231
1237
but contents look OK. Hmmm.

=============================================================================

2005 02 24

near detector genpy seems to be complete, took about 12 hours for 900 files

MINOS06 > grep HEAD /local/scratch06/kreymer/log/genpy/neardet_data/2005-01.20050223.log | wc -l
    864

NEXT - need to do the sam declare, sam add location's

MINOS06 > . ./setup_minos
MINOS06 > setup_minos
MINOS06 > ups inquire sam_config
DEFAULT_CONFIGURATION=MINOS development|minos|minos.env
EXISTING_CONFIGURATION=MINOS development|minos|minos.env
EXISTING_CONFIGURATION=MINOS production|minos_prd|minos_prd.env

setup sam -q minos
cd /local/scratch06/kreymer/genpy/neardet_data/2005-01

IFILE=N00005797_0000
IPATH=/pnfs/minos/neardet_data/2005-01
ITAPE=0.0

    pre-verify with

sam verify metadata --descriptionFile=${IFILE}.sam.py

    load with

sam declare file ${IFILE}.sam.py
rm -f            ${IFILE}.sam.pyc

    test with

sam get metadata --fileName=${IFILE}.mdaq.root
    
    clear with

sam undeclare file ${IFILE}.mdaq.root

L O C A T I O N

SAMLOC="${IPATH}(${ITAPE})"

sam add location --file=${IFILE}.mdaq.root --loc=${SAMLOC}

sam locate  ${IFILE}.mdaq.root

    clear with

sam erase file location --file=${IFILE}.mdaq.root --loc=${SAMLOC}


---------
MINOS06 > sam declare file N00005797_0000.sam.py
FileId = 942

MINOS06 > sam add location --file=${IFILE}.mdaq.root --loc=${SAMLOC}
Location with name '/pnfs/minos/neardet_data/2005-01/' not found.

---------

Added file locations as sam@minos-sam, 

from Notes/prd_metadata.txt
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2004-02
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2004-03
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2004-04
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2004-05
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2004-06
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2004-07
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2004-08
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2004-09
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2004-10
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2004-11
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2004-12
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-01
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-02
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-03
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-04
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-05
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-06
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-07
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-08
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-09
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-10
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-11
samadmin add pnfs tape location --fullPath=/pnfs/minos/neardet_data/2005-12

Some of these were duplicates, others were needed, including 2005-01.


=============================================================================

2005 02 23

Note Minos doors are 24125. 24136 per buckley email today.
24125 is flaking out sometimes ?
Need to start pinging.

This affected many files last night running genpy
Creating ndd.retry short list.

GRRR - apparently I failed to run dbu on the LOCAL copy of the file,
       hence the very slow running.
       
       Since I have to restart anyway, killing this off now, 17:50.
       N00005979_0000

Corrected genpy, and run_dbu ( to accept simple filename argument )

18:20 MINOS06 > ./scripts/genpy    neardet_data/2005-01


=============================================================================

2005 02 22  13:30 - 16:30

PLAN - prestage and get metadata for near detector, 

MINOS06 > time ls -l /pnfs/minos/neardet_data/2005-01
real    0m15.399s
user    0m0.030s
sys     0m0.490s

MINOS06 > ls  /pnfs/minos/neardet_data/2005-01 | wc -l
    864
real    0m0.123s
user    0m0.000s
sys     0m0.090s


A bit of an adventure writing a first Minos context script
scripts/stage

    had to use minus_setup function, cannot directly source the script
    
    had to clean up the ups/setup act, with  . /afs/fnal.gov/ups/etc/setups.sh
    
MINOS06 > ./scripts/stage -d neardet_data/2005-01

seems to be running, staging data in !


Heady with victory, let's try   

     genpy

MINOS06 > ./scripts/genpy -d neardet_data/2005-01

MINOS06 > ./scripts/genpy    neardet_data/2005-01
    
=============================================================================

2005 02 17  09:15 - 12:35

Trying again, to see how the timing goes

time run_dbu.sh fardet_data/2005-02 F00029153_0003

./psv 

hangs for about a minute in this state:

dbu -bq ./dbu_sampy.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet

exe: -rwxr-xr-x rhatcher e875 28529 Feb 16 10:52:45 200mem:  kreymer 1525
work: /afs/fnal.gov/files/home/room1/kreymer/minos     root: /
eip00000000B4A77A4E wch00000000C0134FBC priority: 15        nice: 0
ppid: 30234         pgrp: 30234         sess: 28800         tty#: 34816
tpgid: 30234        exit sig: 17        cpu: 2
pnd0000000000000000 blk0000000000000000 ign0000000000001000 trp0000000000006007
minor faults:                  6905     user:                        3.96s
child ...:                      280     child user:                     0s
najor faults:                  8935     system:                      0.21s
child ...:                      869     child system:                0.06s
vsize:                     96100352     swapped:                         0
rss:                          14000     flags:            0000000000000000
pgm 14002           shr 6799            txt 3               dat 9002
rss 14000           lib 4995            wrt 2146            Cycle: 1072
Load averages, r/t processes, last pid: 0.01 0.00 0.00 1/81 30305

Then starts running,
real    4m34.860s
user    0m25.990s
sys     0m1.090s


MINOS06 > time run_dbu.sh fardet_data/2005-02 F00029153_0004
real    3m56.077s
user    0m21.470s
sys     0m0.730s
about 930 cycles before CPU activity starts
    One cycle of psv is about 1/10 second.
Monitoring finshed at about 2000 cycles ( 3m 20 sec )


Again had trouble, had to log in fresh to get dccp to work

Remember not to copy to AFS space :

  A F S

MINOS06 > time  dccp  dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-02/F00029153_0000.mdaq.root F00029153_0000.mdaq.root
45254572 bytes in 1 seconds (44193.92 KB/sec)

real    0m5.596s
user    0m0.000s
sys     0m0.450s

   VAR TMP
 
MINOS06 > time  dccp  dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-02/F00029153_0000.mdaq.root /var/tmp/F00029153_0000.mdaq.root
45254572 bytes in 1 seconds (44193.92 KB/sec)

real    0m1.609s
user    0m0.000s
sys     0m0.410s


There is no long startup overhead using local files.
Code also runs efficiently, 

MINOS06 > export ENV_TSQL_PSWD=minos_db
MINOS06 > export ENV_TSQL_USER=reader
MINOS06 > export ENV_TSQL_URL="mysql:odbc://minos-db1.fnal.gov/temp;mysql:odbc://minos-db1.fnal.gov/offline;mysql:odbc://minos-db1.fnal.gov/offline_dev"
MINOS06 > time dbu -bq ./dbu_sampy.C  test.root
dbu [0] 
Processing ./dbu_sampy.C...
DbuDaqFileModule::Config  
 HeartbeatInterval: 300 sec
...
=E= Dbi 2005/02/17 11:48:22 Dbi.cxx,v1.29:206> Bad date string: 1937-09-17 12:02:04 parsed as 1937 9 17 12 2 4
real    0m28.752s
user    0m21.250s
sys     0m0.630s

MINOS06 > export ENV_TSQL_PSWD=minos_db
MINOS06 > export ENV_TSQL_USER=reader
MINOS06 > export ENV_TSQL_URL="mysql:odbc://minos-db1.fnal.gov/temp;mysql:odbc://minos-db1.fnal.gov/offline;mysql:odbc://minos-db1.fnal.gov/offline_dev"
MINOS06 > time dbu -bq ./dbu_sampy.C /local/scratch06/kreymer/F00029153_0000.mdaq.root
...
real    0m24.662s
user    0m20.920s
sys     0m0.620s

12:35

MINOS06 > mkdir scripts
MINOS06 > mkdir bin
MINOS06 > cp /afs/fnal.gov/files/code/e875/general/minossoft/releases/development/bin/Linux2.4-GCC_3_3/dbu bin/dbu
MINOS06 > mv psv bin/psv
MINOS06 > mkdir /local/scratch06/kreymer/log

MINOS06 > rcp cdfcode:/home/cdfsoft/maint/dcache/stage/touchup scripts/stage


=============================================================================

2005 02 16  13:45 - 14:45

Trying again, with mysql up :

created setup_minos with definitions below, did
. ./setup_minos
setup_minos

    will have to clean this up later

run_dbu.sh fardet_data/2005-02 F00029153_0001

This again seems to be stuck after 4 seconds.
Hmmm, after quite a while, it completed !

The 001 .py files has correct event counts, and time stamps, and application families

Try again

MINOS06 > date
Wed Feb 16 14:28:25 CST 2005

MINOS06 > time run_dbu.sh fardet_data/2005-02 F00029153_0002
real    3m41.866s
user    0m16.330s
sys     0m0.510s

    dbu sseems to creep along, doing nothing after 4 seconds,
    then after about 1 minute using about 5 to 10% CPU

Need to get a scorecard , learn where all minos services are run.

get a few more filecounts :
made filecount script :

./filecount far_dcs_data
./filecount near_dcs_data
./filecount beam_data
mv filecount.fardet_data log/
mv filecount.neardet_data log/

Reran on first file which got damaged .py file

MINOS06 > mv F00029153_0000.sam.py F00029153_0000.sam.py1
MINOS06 > time run_dbu.sh fardet_data/2005-02 F00029153_0000
real    4m29.278s
user    0m18.830s
sys     0m0.610s

MINOS06 > diff F00029153_0000.sam.py1 F00029153_0000.sam.py
...
>    applicationFamily=ApplicationFamily('online','rotorooter','v04-02-00'),
...
>    startTime=SamTime('07-Feb-2005 20:32:42(UTC)',SAM.SamTimeFormat_UTCFormat),
>    endTime=SamTime('07-Feb-2005 21:32:52(UTC)',SAM.SamTimeFormat_UTCFormat),
>    eventCount=16053,
>    firstEvent=0,
>    lastEvent=12446

14:45

=============================================================================
2005 02 15  14:45 - 16:50

Looking at Minos computing batch instructions,
    http://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/soft_setup_fnalu.html

( NB make this bit of the web page PRE )

export MINOS_SETUP_DIR=/afs/fnal.gov/files/code/e875/general/minossoft/setup 
setup_minos() 
{
. $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh $*
}
    
MINOS06 > source $MINOS_SETUP_DIR/setup_minossoft_FNALU.sh   
WARNING: based on your node, minos06.fnal.gov, ENSTORE_CONFIG_HOST has been set to stkensrv2.fnal.gov
WARNING: If this is not correct, either set ENSTORE_CONFIG_HOST by hand or use a qualifier in your setup command!
setupsrtarch is "SRT_ARCH=Linux-sl3", but not yet used
MINOSSOFT "development" release  (ROOT Linux2.4-GCC_3_3/v4-02-00)
setup "test" version of LABYRINTH [ linux , FNALU ]
explicitly setting up GCC3_3_1 version of GEANT


MINOS06 > type dbu
dbu is /afs/fnal.gov/files/code/e875/general/minossoft/releases/development/bin/Linux2.4-GCC_3_3/dbu

Picking up the scripts from minos-sam
 ~sam/Notes and 
 ~sam/bin
 
 Looking at ../bin/run_dbu.sh
 
 /pnfs/minos/fardet_data/2005-02
 
 File name format is like
     
          run  sub  daq
    F00029153_0000.mdaq.root
    F00029153_0001.mdaq.root
    F00029153_0002.mdaq.root
    F00029153_0003.mdaq.root
    F00029153_0004.mdaq.root
    F00029153_0005.mdaq.root
    F00029153_0006.mdaq.root
    F00029153_0007.mdaq.root

So for the initial run of run_dbu.sh, let's try

   run_dbu.sh fardet_data/2005-02 F00029153_0000

On minos6, in kreymer/minos,
    rcp sam@minos-sam:bin/run_dbu.sh run_dbu.sh
    chmod 755 run_dbu.sh 

Getting root macro for dbu, from ~rhatcher/dbu_sampy.C

    GenSAMPythonFile=2 - write python file only if mysql succeeds

Try running this on one file:

got stuck, no cpu being used

29499 pts/0    S      0:00 /bin/bash ./run_dbu.sh fardet_data/2005-02 F00029153_0000
29501 pts/0    S      0:04  \_ dbu -bq ./dbu_sampy.C dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-02/F00029153_000

Trying 

dccp  dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-02/F00029153_0000.mdaq.root test.root
 this failed

Try a prestage/ping

dccp -P -d 2 dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-02/F00029153_0000.mdaq.root

try again

MINOS06 > dccp  dcap://fndca.fnal.gov:24125/pnfs/fnal.gov/usr/minos/fardet_data/2005-02/F00029153_0000.mdaq.root test.root
45254572 bytes in 2 seconds (22096.96 KB/sec)

so try run_dbu.sh again

MINOS06 > run_dbu.sh fardet_data/2005-02 F00029153_0000
fardet_data/2005-02/F00029153_0000.sam.py exists - skip processing

16:00
OK, the former problem was due to a mysql overload, will have to defer further testing


We did get that .py file, but bits were not quite right
    ...
    startTime=SamTime('01-Jan-1970 00:00:00(UTC)',SAM.SamTimeFormat_UTCFormat),
    endTime=SamTime('01-Jan-1970 00:00:00(UTC)',SAM.SamTimeFormat_UTCFormat),
    eventCount=0,
    firstEvent=-1,
    lastEvent=-1

We could test for this.
Also, we would get stuck if the database failed.

16:30

Need to 
    wrap  this to get new files
    Add crc tape,position information
  

Survey file counts :

MINOS06 > for DIR in `ls /pnfs/minos/fardet_data` ; do echo ${DIR} `ls /pnfs/minos/fardet_data/${DIR} | wc -l` ; done
( stored this in filecount.fardet_data )

MINOS06 > for DIR in `ls /pnfs/minos/neardet_data` ; do echo ${DIR} `ls /pnfs/minos/fardet_data/${DIR} | wc -l` ; done | tee filecount.neardet_data
2004-02 938
2004-03 1698
2004-04 1303
2004-05 687
2004-06 589
2004-07 753
2004-08 899
2004-09 716
2004-10 956
2004-11 988
2004-12 968
2005-01 1166
2005-02 426

16:50


##################################################################

#########
# TO DO #
#########

ADMIN - consider adding interactive limits like
        http://www-cdf.fnal.gov/offline/runii/ILP/ILPUG/ilpug-4.html

AFS - ask for special kerberos principal for minos, minos26, AFS access per nagy

    
AFS - predator beam_data, dcs copies 
      predator reco  .bntp_data

CFLSUM -  report mcout_data by mc release and detector
          break down MCIN summary by release

CLUSTER 
  /tmp is to small on Minos Cluster - 1 GB ?
     Yep, and 2 GB on the new servers.

CONDOR -

   Look into jobmon graphical monitoring tool, per dbox rex status mail


CVS
    cat authorized_keys from KEYS/*.keys
    document and publicize access via kerberos
    Implement check_access.d cloned from CDF.
    Clean out large binary content
    remove useless identity ( ssh v1 ) keys

FARM - roundup
    correct PID protection problem seen with CPB -s charm 
    use samlocate -f FILELIST  to feed autodest,
     rather than separate commands for each file.

    Allow running on a second host
        move PID file to a shared file system

    request group and account name realignment
        minospro = minoscvs ??? ( this is in error on Minos Cluster )
        numi group - e875 
        etc
   
  
FARM - move concatenation off fnpcsrv1


    Test dccp x509 mode
       make read test


FARM - copy stray candidate files noted on 2008 06 06
    list of 80 in /minos/data/minfarm/mcnear/cp_to_dc 
    See notes in this log, and rubin email 16 May "Cleaning up the mess"

FARM DUP handling - check it out, producing ECRC errors at present ?

support -m sam-only  mode for mc data ( need to know mc release)
          and possibly correct the existing MCREL calculation
              really need a list of them.


GRID - 
   evaluate ganga , cafGui, or equivalent tools
   

METRICS plot for CD Ops meeting
    MC import progress ?
    Reco reprocessing when it happens
    Farm uptime/throughput plot ?

INDICO - get the Core Software meeting out of HyperNews

DCache - Request login plots separately for each pool group
minos-om - change /var/run owner back to root, and make /var/run/httpd per
            https://itso.iu.edu/You_Don't_Need_Root_for_That

DCS - rename files into the correct month, in PNFS and SAM,
especially 
  near 2006-08 
  far  2006-07

ENSTORE

    MOVES
        clean up historic moves with enmv request, per berg
        user directories under users
        CAT reco files removed
        mc files to restructured directories

    Make second copy of rest of raw data ( dcsn/dcsf/beam )
       and set up ongoing copies.

       
    Do regular log of bad files :
        enstore info --show-bad | grep minos

    copy2 - should clean these old copies out, now that we have the vault area.
          2006 07 17

   Add logging/notice of 'noaccess' minos tapes ( incremental, minos only )

    Add logging/notice for 
        http://www-stken.fnal.gov/enstore/dcache_monitor/minos*.txt    

    survey entire space for multiply linked files
      (shows up in ls *)
      clean by doing n-1 rm's ?

    survey complete file listing for 0 length 'accident' files,
       created via PNFS write attempts
       and files with 0 checksum

    survey for *_lock or .nfs files ?

    Scan all volumes for large write error counts , via 
        http://www-stken.fnal.gov/enstore/tape_inventory/VO*
            or 
        enstore info --gvol <volume>
        suggest that gvol return times in UTC, not local time.

    Get tool for finding current path for given bfid/original path

    Request backfilling of PNFS level 4 information for all files.
      cat .(use)(4)(...)  takes about 30 ms
      enstore info        takes 2 x 100 ms , one for BFID, one for info
      This is mainly fardet_data,  one month or so ?

MYSQL - suggest modification to allow online defragmentation of tables.
      
      How to migrate to mysql 5 ?
      How to defragment online  ?
      Build of mysql 5 with --local-infile per  05 Jul 2007 email ?

SAM

    Nick's Grid catalogue - send sam example, sam distribution info

    Use 'sam list file' on a list, not individual sam locate,
    to test for file existence in roundup et.al.
    beware of 1000 files limit, be sure it's gone

    Integrate SAM declarations into farm operations.

    declare caldet data

    sam_db_srv_pkg   upgrade

    samadmin remove meaningless dimensions - fails 
    
    Oracle - test 10g in client and dbserver
      oracle_client 10g needs to have dead symlinks cleaned up
      repeat performance tests ( including dataset size scaling ) at 10g
    
    MINOS-SAM01
        Swap /scratch and /home/ on minos-sam01 as on minos-sam02

    user examples, - last day's data ( own dataset ), test with STP

    request a sam get metadata with more precise/reversable output.

    request 'sam ping file' for fast file existence test.

    commit minos/scripts to CVS
            
Add rlwrap on Minos central systems - in UPS
   ( or rlfe 0.4 or 5.0-11 debian ( alpha, but in readline source ? )
   , or cle 0.4-7 ? )

   copied rlwrap binary to minos/bin/rlwrap, seems to work
   Need to put this into kits, no priv required.

DCACHE - follow up on IP cacheing issues - not updated ?
  
    Test crash of kerberized dcache door with no ticket and report to developers.
 
EMAIL - fix afs/cron problem sending mail - lack of mail forwarding ???
        get ownership of minos-sam-admin
        
PLONE - set up twiki/plone/jira for Minos DH status

WEB - enable cert based access to internal minos pages


move DH status page to prominent place offline
   add monitoring list, per 2008 04 07 email/log entry


ROUNDUP -
    roundup sam-only seems to leave stale PID


#############################################################################

    SHORT TERM


SL5 - do we need Minos workgroups ?


mcimport/roundup - clarify DCache queue message to stalled, queue is N/M )


Ticket 108310
cert registration access for kreymer - tlevshin@fnal.gov
   Tanya Levshina 8730


FARM - 
   add 'chmod 755' before remove of local files - 
      cannot do this for rubin files

    remove the /grid/data/minos/minfarm symlinks to /minos/data
       
/minos/data
   cedar_phy_bhcurv catchup copy

roundup 
    saddreco MC speedup / directory restriction
    later - READ/SAM file cleanup / SAM for duplicate tests


http://home.fnal.gov/~sfiligoi/glideinWMS/

Create appropriate groups for write access to /minos/data areas.


SAM dimension list - add to main SAM page,
    list differences between
        SAM selection usage
        sam get metadata
        sam declare
    and sam get registered dimensions


SRM - add active monitor
    srmtest - add qualifiers for verbose, r/w/l , cert, srmserver


CVS - 
      register all in .k5login
      remove unuseable ssh 1 keys, verify that username is still logged


AFS - enable suid protection in Minos systems and control room.
         or document that this is moot with latest upgrades

      
   PEND DCACHE        

mcimport -  re-scan all bigtars for valid .tar.gz content          

Deal with odd messages from   ssh -vvv  regarding form of my id_rsa

     M I D    T E R M

fwant - photo  03 Mar 2008    -  perf