Symptom:
	Stuck rsh to localhost on SGI.
Why this happened:
    	output disks are scattered across SGI farm and IBM farm.  To check
	the space left on the output disk, rsh was used no matter it is on
	local host or remote.  On SGI this produced a hung rsh process.
Solution:
    	Use rsh for remote only.


Symptom:
	Bad runs (solenoid was off, lots of detectors are not 100% on ... etc)
	that causes production executable to fail.
Why this happened:
	DAQ has no way to know if a run is bad.  The online shift-takers did
	not tag it so they appear as normal run in the raw tape database.
Solution:
	Manually skip those runs.	


Symptom:
	Tape media error.
Why this happened:
	Dirty media or dirty drive.
Solution:
	Skip it after many retries with different drives.


Symptom:
	Can not write output file to an IBM file system although the space left
	is enough.
Why this happened:
	IBM file system does not allow new files to be copied to it if it is
	more than 96% full.
Solution:
	Mark such disks as full when it is 96% already and copy to other
	disks.


Symptom:
	Merged PAD files did not contain the full run.  #1
Why this happened:
	- DAQ takes data and write them to disk
	- online stager copy a tape-worth of files to tape and produce
	  raw_1b.tape database file   (*** not all files of the same run
	  are copied ***)
	- production proceeds for files on tape
	- PAD-merging relies on raw_1b.tape to decide how many files are there
	  for a run.
	==>  If there is a, say, one week accelerator shutdown, the last run
	     is not copied to tape.  And the PAD-merging decides there are
	     no new files for the run and go ahead merge the files.
Solution:
	Instead of using raw_1b.tape, use online DAQ history file to determine
	the number of files for each run.


Symptom:
	Merged PAD files did not contain the full run.  #1
Why this happened:
	Online data logger crashed so the DAQ history file is missing.
Solution:
	Manual.     This is very hard to catch by production team alone.


Symptom:
	Raw data are not processed.
Why this happened:
	Online data logger crashed in the middle of writing a tape so it
	doesn't write entries into raw_1b.tape and production is not aware
	of existence of such a tape.
Solution:
	Manually add it into raw_1b.tape and manually process it.
	This is again hard to catch by production team alone.


Symptom:
	Stream B data are processed with out-of-date constants.  #1
Why this happened:
	The IBM farm was relocated in February 1995.  Somehow the script
	to update database constants were not put back into appropriate
	crontab.  This wasn't caught until the collaboration added the
	K* gamma trigger in April.
Solution:
	Re-process all data produced during Feb 12 - Apr 17 on IBM farm.
	This includes everything: reconstruction, splitting, merging,
	staging, update bookkeeping.   This was the biggest disaster
	happened in Run 1.


Symptom:
	Stream B data are processed with out-of-date constants.  #2
	Caught by errors reported by TRGSEL module.
Why this happened:
	The cron job on the distribution machine was not running as a result
	of power outage.
Solution:
	Manually fetch the constants and resume the cron job.


Symptom:
	Stream A split PADs have wrong triggers.
Why this happened:
	Related to the problem with out-of-date constants on IBM farm because
	stream A PAD concatenation is done on IBM with Stream B constants.
Solution:
	Re-concatenate.


Symptom:
	Tape label mismatch.	
Why this happened:
	The person that labels the tapes accidentally swapped labels of 90
	tapes one day.
Solution:
	Ask people to re-label them.


Symptom:
	Can't read raw data for re-processing.
Why this happened:
	Tape media error (it was read too many times?)
Solution:
	Use DST with production banks removed.  Some time was spent in
	determining if the drop_banks.uic works appropriately for this
	purpose.  Be careful in production history bank (It is filled
	twice).


Symptom:
	Some production tapes are missing in the vault.
Why this happened:
	The person in charge of tape management took them out of FCC vault
	because they were damaged.  But the production team was not informed
	so they are still in the production database.
Solution:
	Some of the tapes have library or trailer copies so we can recover
	them.  However, for inclusive streams there are no trailer copies.
	We did not have time to reproduce all of these tapes.


Symptom:
	B59857AM was processed but somehow all output DST was lost.
Why this happened:
	Unknown.  Maybe human error?
Solution:
	Re-process.  Since PADs are okay, make sure that either there are no
	output PAD files or output PAD files are invisible to PAD merging
	program.


Symptom:
	Can't connect to b0dstr.fnal.gov for fetching new database constants.
Why this happened:
	xcdf15 (an X terminal?) somehow intervened the traffic to b0dstr.
	Maybe a duplicated IP?
Solution:
	Turn xcdf15 off and it is solved.


Symptom:
	Can't write bookkeeping file.
Why this happened:
	NFS problem.
Solution:
	Ask computer system manager to check/reboot/cleanup the NFS server.

Symptom:
	Wrong entries in raw_1c.tape.
	B74606:         Only AM-AQ is present.  AA-AL is missing.
	B74817:         AD-AL has run_type *TAG with PHYSICS_630_7_L3TAG_ALP_UP table,
        	        AA-AC has run_type *NO_TRIG_TABL and no trigger table.
Why this happened:
	Problems in data logger?
Solution:
	Manually add them into raw_1c.tape for processing.


Symptom:
	DST/PAD Tape damage.
Why this happened:
	Media error, or a bad drive munches it when reading it.
Solution:
	Case by case.


Symptom:
	Unknown trigger in the data.
Why this happened:
	Trigger table database is not updated correctly online.
Solution:


Symptom:
	B58956AR.RAW was listed as 0 records but there are 2547 events
	shown up in production.  As a result, the run was merged without
	this file.
Why this happened:
	This is the last file of the run.  It seems data logger crashed
	and failed to write the record for this file.
Solution:
	Re-merge.


Symptom:
        Stagging couldn't start and the whole production job hung.
Why this happened:
        There was not tapes left for copying.
Solution:
        Ask people in charge to initialize more tapes.

Symptom: 
        Output disk full
Why this happened:
        Stager had to copy reconstructed data in the same run to tape. Each
        run contained lots of subfiles. If one of the subfiles was not
        processed successfully, the output data for that run wouldn't be
        copied to tapes.
Solution:
        Gave those subfiles highest priority to process in production.

Symptom:
       Cannot update the stager bookkeeping files
Why this happened:
        Disk where the bookkeeping files situated was full. Each stagging job
        produced  log files  according to the tape names. These log files may 
        eat a lot of disk space.
Solution:
        Compress those older log files.

Symptom:
        Unsuccessful Data writing to tape 
Why this happened:
        Dirty tape drive
Solution:
        Clean tape drive or change tape drive.

Symptom: 
        Bookkeeping files were incorrectly modified when editing
Why this happened:
        Bookkeeping files became so large that one couldn't use editor to 
        change contents. 
Solution:
        Use standard unix commands like 'more', 'less', 'cat', 'fgrep' ...,etc
        to modify the bookkeeping files.

Symptom:
        Failed to update the stager bookkeeping files without giving a warning
Why this happened:
        There was a power outage when a stagging job was trying to update
        the bookkeeping files.
Solution:
        Manually update the bookkeeping files.

Symptom:
       Some data were ready to copy but stagging never started.
Why this happened: 
       When stager job were copying data to tapes, it would first change 
       the date of those data files to the future. Since our script will only
       copy data with older dates compared to the current time, those files 
       under copying won't be interred by the next stagging job. Those data
       will be deleted if copying succeeds or the file dates will be changed
       to the current time by using "touch" command if copying fails. Sometimes
       stagging jobs failed to "touch" those data files due to some special
       reasons such as power outage or diskfull.
Solution:
       Touch data files manually.
 

Adjustment:
	CDF software version upgrade but production should stay with original
	version.
Why this happened:
	The IBM flavor of production executable was built on fncka, whose
	software was subject to change together with CDF software upgrades.
Solution:
	Use a special version s_production for building production executables
	which has frozen CDF softwares but may contain newer codes for farm
	use (e.g., interfaces to CPS).


Adjustment:
	SVX' alignment constants changes.
Why this happened:
	SVX' group obtained better alignment constants.
Solution:
	Change the production talk-to to use the old constants for consistency.


Adjustment:
	We got more nodes on the IBM farm.
Why this happened:
	Farm node re-allocation.
Solution:
	Modify CPS-related configuration files.


Adjustment:
	CPS upgraded to v2_9b.
Why this happened:
	well...
Solution:
	The production script was using specific (older) version.
	CPS expert think it is okay to use the new version.  So
	the production script is modified accordingly.


Adjustment:
	Collaboration request to add a new trigger to an existing stream.
Why this happened:
	Improvement of physics triggers.
Solution:
	Modify splitting UIC accordingly.


Adjustment:
	Collaboration request to add new output banks.
Why this happened:
	There are new sub-detectors (small angle detectors).
Solution:
	Modify output UIC of reconstruction and PAD concatenation accordingly.


Adjustment:
	Change the whole set of triggers.
Why this happened:
	Case 1: Accelerator makes a 630 GeV run.
	Case 2: broken CTC wire (CTC/2 run).
Solution:
	Make a new set of output/split UIC for reconstruction and PAD merging
	accordingly.
	Note that there is a set of 1800 GeV runs taken with 630 trigger
	tables as a test.


Adjustment: 
        Some run were processed with wrong run constants
Why this happened:
        Database was not updated
Solution:
        Removed those runs from the stager bookkeeping files.
        Reprocess those runs.

Adjustment: 
        New triggeris in the stream
Why this happened:
        Study more physics. (K* gamma, Roman pot, Central dijets, ...)
Solution:
        Modify stagging script.

Adjustment:
        Stream C processing
Why this happened:
        Finally B group decided to process these events.
Solution:
        Modify stagging script to produce stream c bookkeeping file.