2004 CDF E-Log -- Owl shift. Fri Feb 27, 2004
SciCo DAQ Ace Monitoring Ace CO (Operations Manager)
Guram Chlachidze Alison Lister Catalin Ciobanu Oleg Poukhov J.J.Schmidt


Start of Shift Notes:  

Taking cosmics. Tevatron in stacking.

Fri Feb 27 00:18:02 TOF heartbeat. SMACS crashed. We restarted it. - Catalin
Fri Feb 27 01:12:21
b0clc00 had a "U" error in the VxWorks monitor window. 
Shepherding the crate fixed this.
 - Alison/Catalin
Fri Feb 27 04:20:00 Still waiting beam, so we are taking cosmics - Guram
Fri Feb 27 04:48:43
Pinky (B2 W7 L4) stable throughout the shift
 - Catalin
Fri Feb 27 05:11:35 Run 179416 ACTIVE: Attention !!!. CER_SVXMON_HALT_RECOVER_RUN_ERROR !!! Stuck Cellid I/B0/W0/L0/C4-7 . auto-HRR solved the problem - Alison x2080
Fri Feb 27 05:27:19 Run 179416 Recover transition state: FrontEnd Crate Error Condition from: VRB_SVX_02 RXPT error - Alison x2080
Fri Feb 27 05:32:09
b0svx02 (VRB_SVX_02 RXPT error) 
VxWorks login: vxworks 
Password: 

b0svx02-> i 

  NAME        ENTRY       TID    PRI   STATUS      PC       SP     ERRNO  DELAY 
---------- ------------ -------- --- ---------- -------- -------- ------- ----- 
tExcTask   excTask       7bfe708   0 PEND         29a164  7bfe630       0     0 
tNetTask   netTask       7bdd088   1 PEND         280360  7bdcfd8       0     0 
tPortmapd  portmapd      7bd8308   1 PEND         280360  7bd8198      16     0 
tRlogOutTasrlogOutTask   781f238   2 PEND         280360  781f0b0       0     0 
tRlogInTaskrlogInTask    781dca8   2 PEND         280360  781da80       0     0 
tShell     shell         7aa11f8   5 READY        283d24  7aa01c0  1c0001     0 
tRlogind   rlogind       7bd9898  15 PEND         280360  7bd94d8       0     0 
tAioIoTask1aioIoTask     7bee678  50 PEND         2807d4  7bee5d0       0     0 
tAioIoTask0aioIoTask     7be7470  50 PEND         2807d4  7be73c8       0     0 
tAioWait   aioWaitTask   7bf5880  51 PEND         280360  7bf5728       0     0 
t1         VISIONserver  7bbf1f0 100 PEND         280360  7bbf0c0       0     0 
t2         ROBINserver   7bba1c8 100 PEND         280360  7bba088       0     0 
t3         httpDaemon    7bb51a0 100 PEND         280360  7bb4f88       0     0 
Messenger  FER_messenge  7a70a30 200 SUSPEND      282aa0  7a70770  3d0004     0 
Readout    FER_readOutV  777e030 201 READY        287110  777df38  3d0002     0 
rtlm_main  rtlm_main     7bb0178 220 PEND         280360  7bb0038       0     0 
Mon_III    FER_monitorI  7b5ceb0 220 READY        2832d8  7b5ce10  1c0001     0 
rtlm_sessiortlm_session  7b783c8 225 PEND         280360  7b781e8       0     0 
tLogTask   logTask       7bfbd90 250 READY        29a164  7bfbcc8       0     0 
value = 0 = 0x0 
b0svx02-> tt Messenger 
2889e8 vxTaskEntry    +60 : FER_messenger () 
7b300b4 FER_messenger  +2cc: 7b0eb28 () 
7b0eb28 FER_smartInitMS+1db0: 7b2e628 () 
7b2e628 FER_errorSender+22ec: _MLencodedMessageCreate () 
7b1b304 _MLencodedMessageCreate+88 : free () 
25a37c free           +1c : memPartFree () 
259fbc memPartFree    +144: taskSuspend () 
value = 0 = 0x0 
b0svx02->
 - Catalin/Alison
-- Fri Feb 27 05:44:14 comment by...W.Badgett --  
The stack trace above points to a recently fixed 
merlin package bug.    Note that the main DAQ crates 
all boot off of the bug-fixed version, while the b0svx## 
crates boot from an old (very old?) version in 
Steve Nahn's private disk area.   Therefore, I'm guessing 
the bug is still present in Steve Nahn's version.

I advise Steve Nahn to recompile and/or relink his 
private version of fer package.      It would also be 
good for these crates to boot from a public area, but 
that doesn't seem likely to happen.

-- Fri Feb 27 09:17:30 comment by...SCN --  Here the record of reboots of b0svx02 for the last 6 months or so
Wed Nov 26 10:05:10
Wed Nov 26 10:17:48
Thu Nov 27 08:34:10
Thu Nov 27 08:36:46
Thu Nov 27 08:49:36
Thu Nov 27 09:08:32
Thu Nov 27 09:12:01
Sun Nov 30 04:31:35
Thu Jan 08 06:10:24
Thu Jan 08 06:13:01
Thu Jan 08 06:46:35
Mon Jan 12 15:38:33
Wed Jan 21 20:09:32
Sat Jan 24 15:34:14
Sat Jan 24 15:37:46
Wed Feb 25 08:50:36
Wed Feb 25 08:52:40
Fri Feb 27 05:33:27
Fri Feb 27 05:39:11
Fri Feb 27 07:46:45
In addition, if you browse the Silicon log you find that until Feb 27, all of the reboots were done on purpose by humans either to troubleshoot the FFO resonance detector hardware, add some L1A counting, swap a VRB etc. None of them until this morning were due to a crash in the software, and you can see that in between Jan 24 and Feb 25 there were no reboots at all.

This begs the question What changed today? (The point of keeping a private version is that it doesn't get changed underneath you, unlike the public "frozen" version which has changed 3 times in the last 20 days or so.


Fri Feb 27 05:37:50 Run 179416 Terminated at 2004.02.27 05:37:35 - RunControl
Fri Feb 27 05:38:29 Run 179416 TERMINATE: Ended run as problem with b0svx02 - Alison x2080
Fri Feb 27 05:46:25 Run 179417 Activated at 2004.02.27 05:45:53 - RunControl
Fri Feb 27 05:46:45 Run 179417 ACTIVATE: Restarted the run - Alison x2080
Fri Feb 27 05:47:51
Working on crate b0imu00 and b0imu01 to diagnose their  
problems running with TDC DSP version V45. 

 - W.Badgett
Fri Feb 27 07:45:09 Run 179417 Terminated at 2004.02.27 07:45:02 - RunControl
Fri Feb 27 07:45:22 Run 179417 TERMINATE: End run to put IMU crates back in - Alison x2080
Fri Feb 27 07:51:33 Run 179435 Activated at 2004.02.27 07:51:18 - RunControl
Fri Feb 27 07:51:48 Run 179435 ACTIVATE: Restarted run with IMU back in partition - Alison x2080
Fri Feb 27 07:52:34 Run 179435 Terminated at 2004.02.27 07:52:27 - RunControl
Fri Feb 27 07:52:45 Run 179435 TERMINATE: IMU01 had DTO problems - Alison x2080
Fri Feb 27 07:54:35
Doing some parasitical testing of b0imu00 and  
b0imu01 while they are in the DSP version V45,  
in the main partition. 
 - W.Badgett
Fri Feb 27 07:56:08
Run Number Data Type Physics Table Begin Time End Time Live Time L1 Accepts L2 Accepts L3 Accepts Live Lumi, nb-1 GR SC RC
Totals 07:55:03 ::
 - End of Shift Report
Fri Feb 27 07:56:08
b0SVX02 again had problems, this time during the HRR recover transition. 
Software rebooted the crate, this time it seemed to work!
 - alison
Fri Feb 27 07:56:45 Run 179436 Activated at 2004.02.27 07:56:25 - RunControl
Fri Feb 27 07:56:46 Run 179436 ACTIVATE: Restarted without IMU01 - Alison x2080
Fri Feb 27 07:57:48 Shift Summary:
No beam all shift following a quench during the day
shift.  

   Tevatron people working to fix consequences of the bad  
   pirani gauge. 
-  restarted run due to b0svx02 error 
-  Bill is working on crate b0imu00 and b0imu01 to diagnose their   
   problems running with TDC DSP version V45.  


End of Shift Numbers
CDF Run II

Runs                   
Delivered Luminosity   0  
Acquired Luminosity    0  
Efficiency             100

 - Guram
Fri Feb 27 08:08:25
Tried several different readout settings, and imu00  
seemed to like its default settings:  TdcReadoutMode =  
LocalAggressive; DmaChain=false;  SpyMode=false 

In this case, there was the latest rate rate  
of header word errors. 
 - W.Badgett :: (run 179436)
Fri Feb 27 08:09:12 Run 179436 Terminated at 2004.02.27 08:08:44 - RunControl
Fri Feb 27 08:09:13 Run 179436 TERMINATE: End run to put imu01 back in the partition. - Alison x2080
Fri Feb 27 08:13:23 Run 179437 Activated at 2004.02.27 08:13:07 - RunControl
Fri Feb 27 08:23:16 Run 179437 ACTIVE: HRRed CER_SVXMON_HALT_RECOVER_RUN_ERROR: Stuck Cellid S/B1/W5/L4/C7-13 .  - Vadim x2080
Fri Feb 27 08:25:51 Run 179437 Terminated at 2004.02.27 08:25:33 - RunControl
Fri Feb 27 08:25:52 Run 179437 TERMINATE: stop the run to change l3proxy version - Vadim x2080