Symptom: Stuck rsh to localhost on SGI. Why this happened: output disks are scattered across SGI farm and IBM farm. To check the space left on the output disk, rsh was used no matter it is on local host or remote. On SGI this produced a hung rsh process. Solution: Use rsh for remote only. Symptom: Bad runs (solenoid was off, lots of detectors are not 100% on ... etc) that causes production executable to fail. Why this happened: DAQ has no way to know if a run is bad. The online shift-takers did not tag it so they appear as normal run in the raw tape database. Solution: Manually skip those runs. Symptom: Tape media error. Why this happened: Dirty media or dirty drive. Solution: Skip it after many retries with different drives. Symptom: Can not write output file to an IBM file system although the space left is enough. Why this happened: IBM file system does not allow new files to be copied to it if it is more than 96% full. Solution: Mark such disks as full when it is 96% already and copy to other disks. Symptom: Merged PAD files did not contain the full run. #1 Why this happened: - DAQ takes data and write them to disk - online stager copy a tape-worth of files to tape and produce raw_1b.tape database file (*** not all files of the same run are copied ***) - production proceeds for files on tape - PAD-merging relies on raw_1b.tape to decide how many files are there for a run. ==> If there is a, say, one week accelerator shutdown, the last run is not copied to tape. And the PAD-merging decides there are no new files for the run and go ahead merge the files. Solution: Instead of using raw_1b.tape, use online DAQ history file to determine the number of files for each run. Symptom: Merged PAD files did not contain the full run. #1 Why this happened: Online data logger crashed so the DAQ history file is missing. Solution: Manual. This is very hard to catch by production team alone. Symptom: Raw data are not processed. Why this happened: Online data logger crashed in the middle of writing a tape so it doesn't write entries into raw_1b.tape and production is not aware of existence of such a tape. Solution: Manually add it into raw_1b.tape and manually process it. This is again hard to catch by production team alone. Symptom: Stream B data are processed with out-of-date constants. #1 Why this happened: The IBM farm was relocated in February 1995. Somehow the script to update database constants were not put back into appropriate crontab. This wasn't caught until the collaboration added the K* gamma trigger in April. Solution: Re-process all data produced during Feb 12 - Apr 17 on IBM farm. This includes everything: reconstruction, splitting, merging, staging, update bookkeeping. This was the biggest disaster happened in Run 1. Symptom: Stream B data are processed with out-of-date constants. #2 Caught by errors reported by TRGSEL module. Why this happened: The cron job on the distribution machine was not running as a result of power outage. Solution: Manually fetch the constants and resume the cron job. Symptom: Stream A split PADs have wrong triggers. Why this happened: Related to the problem with out-of-date constants on IBM farm because stream A PAD concatenation is done on IBM with Stream B constants. Solution: Re-concatenate. Symptom: Tape label mismatch. Why this happened: The person that labels the tapes accidentally swapped labels of 90 tapes one day. Solution: Ask people to re-label them. Symptom: Can't read raw data for re-processing. Why this happened: Tape media error (it was read too many times?) Solution: Use DST with production banks removed. Some time was spent in determining if the drop_banks.uic works appropriately for this purpose. Be careful in production history bank (It is filled twice). Symptom: Some production tapes are missing in the vault. Why this happened: The person in charge of tape management took them out of FCC vault because they were damaged. But the production team was not informed so they are still in the production database. Solution: Some of the tapes have library or trailer copies so we can recover them. However, for inclusive streams there are no trailer copies. We did not have time to reproduce all of these tapes. Symptom: B59857AM was processed but somehow all output DST was lost. Why this happened: Unknown. Maybe human error? Solution: Re-process. Since PADs are okay, make sure that either there are no output PAD files or output PAD files are invisible to PAD merging program. Symptom: Can't connect to b0dstr.fnal.gov for fetching new database constants. Why this happened: xcdf15 (an X terminal?) somehow intervened the traffic to b0dstr. Maybe a duplicated IP? Solution: Turn xcdf15 off and it is solved. Symptom: Can't write bookkeeping file. Why this happened: NFS problem. Solution: Ask computer system manager to check/reboot/cleanup the NFS server. Symptom: Wrong entries in raw_1c.tape. B74606: Only AM-AQ is present. AA-AL is missing. B74817: AD-AL has run_type *TAG with PHYSICS_630_7_L3TAG_ALP_UP table, AA-AC has run_type *NO_TRIG_TABL and no trigger table. Why this happened: Problems in data logger? Solution: Manually add them into raw_1c.tape for processing. Symptom: DST/PAD Tape damage. Why this happened: Media error, or a bad drive munches it when reading it. Solution: Case by case. Symptom: Unknown trigger in the data. Why this happened: Trigger table database is not updated correctly online. Solution: Symptom: B58956AR.RAW was listed as 0 records but there are 2547 events shown up in production. As a result, the run was merged without this file. Why this happened: This is the last file of the run. It seems data logger crashed and failed to write the record for this file. Solution: Re-merge. Symptom: Stagging couldn't start and the whole production job hung. Why this happened: There was not tapes left for copying. Solution: Ask people in charge to initialize more tapes. Symptom: Output disk full Why this happened: Stager had to copy reconstructed data in the same run to tape. Each run contained lots of subfiles. If one of the subfiles was not processed successfully, the output data for that run wouldn't be copied to tapes. Solution: Gave those subfiles highest priority to process in production. Symptom: Cannot update the stager bookkeeping files Why this happened: Disk where the bookkeeping files situated was full. Each stagging job produced log files according to the tape names. These log files may eat a lot of disk space. Solution: Compress those older log files. Symptom: Unsuccessful Data writing to tape Why this happened: Dirty tape drive Solution: Clean tape drive or change tape drive. Symptom: Bookkeeping files were incorrectly modified when editing Why this happened: Bookkeeping files became so large that one couldn't use editor to change contents. Solution: Use standard unix commands like 'more', 'less', 'cat', 'fgrep' ...,etc to modify the bookkeeping files. Symptom: Failed to update the stager bookkeeping files without giving a warning Why this happened: There was a power outage when a stagging job was trying to update the bookkeeping files. Solution: Manually update the bookkeeping files. Symptom: Some data were ready to copy but stagging never started. Why this happened: When stager job were copying data to tapes, it would first change the date of those data files to the future. Since our script will only copy data with older dates compared to the current time, those files under copying won't be interred by the next stagging job. Those data will be deleted if copying succeeds or the file dates will be changed to the current time by using "touch" command if copying fails. Sometimes stagging jobs failed to "touch" those data files due to some special reasons such as power outage or diskfull. Solution: Touch data files manually. Adjustment: CDF software version upgrade but production should stay with original version. Why this happened: The IBM flavor of production executable was built on fncka, whose software was subject to change together with CDF software upgrades. Solution: Use a special version s_production for building production executables which has frozen CDF softwares but may contain newer codes for farm use (e.g., interfaces to CPS). Adjustment: SVX' alignment constants changes. Why this happened: SVX' group obtained better alignment constants. Solution: Change the production talk-to to use the old constants for consistency. Adjustment: We got more nodes on the IBM farm. Why this happened: Farm node re-allocation. Solution: Modify CPS-related configuration files. Adjustment: CPS upgraded to v2_9b. Why this happened: well... Solution: The production script was using specific (older) version. CPS expert think it is okay to use the new version. So the production script is modified accordingly. Adjustment: Collaboration request to add a new trigger to an existing stream. Why this happened: Improvement of physics triggers. Solution: Modify splitting UIC accordingly. Adjustment: Collaboration request to add new output banks. Why this happened: There are new sub-detectors (small angle detectors). Solution: Modify output UIC of reconstruction and PAD concatenation accordingly. Adjustment: Change the whole set of triggers. Why this happened: Case 1: Accelerator makes a 630 GeV run. Case 2: broken CTC wire (CTC/2 run). Solution: Make a new set of output/split UIC for reconstruction and PAD merging accordingly. Note that there is a set of 1800 GeV runs taken with 630 trigger tables as a test. Adjustment: Some run were processed with wrong run constants Why this happened: Database was not updated Solution: Removed those runs from the stager bookkeeping files. Reprocess those runs. Adjustment: New triggeris in the stream Why this happened: Study more physics. (K* gamma, Roman pot, Central dijets, ...) Solution: Modify stagging script. Adjustment: Stream C processing Why this happened: Finally B group decided to process these events. Solution: Modify stagging script to produce stream c bookkeeping file.