DRAFT DRAFT DRAFT $Date: 1997/07/30 19:01:37 $ Last editor: $Author: bakken $ $Revision: 1.21 $ D0 RECONSTRUCTION INPUT PIPELINE SYSTEM REQUIREMENTS M. Diesburg S. Fuess L. Lueking J. Bakken K. Genser D. Petravick R. Rechenmacher K. Ruthsmandorfer This note enumerates system requirements on the Reconstruction Input Pipeline (RIP) for the Run II D0 experiment. Overall, D0 simultaneously requires an average I/O capacity of 32 MB/s (16 MB/s in and 16 MB/s out) from the archiving system. Peak rates are approximately double the average rates. 1.0 Definitions =============== 1.1 The RIP Project controls the moving of data on local D0 disks to the archiving system in Feynman Computer Center (FCC) where it is recorded and cached for later use. The data must be archived and cataloged in the appropriate databases, and in such a way that it can enter the subsequent stages of data processing. 1.2 The archiving system is defined to (1) start with the salient network hardware at D0, (2) include all network connectivity, (3) include all relevant mass storage equipment at FCC, and (4) end with the salient hardware of the data output requester. 1.3 The archiving system is part of D0's DAQ and Reconstruction Farm and must guarantee services specified later in this document. 1.4 Conceptually, the archiving system looks like Input Data From D0 __________________ Output Data | | Raw Data ==== 13 MB/s ===> | | | FCC |===> 13 MB/s == To Recon Farms ExpressLine = 2 MB/s ===> | Archiving System | | |===> 3 MB/s == To Other uses Other ======= 1 MB/s ==> | | ------------------ 1.5 It is TBD whether the archiving system is the final repository for D0's data, or if it is a circular buffer for data awaiting reconstruction. If it is a circular buffer, its size could be potentially very large TBD since there is no guarantee from D0 that reconstruction will keep up with data taking. 2.0 Raw Data from D0 to the Archiving System ============================================== 2.1 The archiving system must guarantee an average input raw data rate, from all sources and streams, of at least 13 MB/sec. Note: The D0 TDR specifies a trigger rate of 53 Hz and an event size of 250 KB/event, yielding an average raw data accumulation rate of 13 MB/s. 2.2 The archiving system can assume that the raw data are packed into "Partition files" of about 1 GB in size. 2.3 The archiving system must be prepared to accept a total of 150 TB of raw data files each year. 2.4 The archiving system can count on D0 to provide a 1 hour staging disk buffer (glitch buffer) to cover some of its own transient unavailability. Note: D0 plans on implementing a single 100 GB glitch buffer for each of its host computers. This would provide about a 2.25 hour glitch space, but with overheads, D0 local file demands, etc., it is more likely to be a 1 hour buffer space. Comment from K. Genser: I would prefer an 8 hour glitch buffer and a 0.05% system availability requirement rather than a 1 hour buffer and 1% availability. 1-2 hours is too small. 2.5 The archiving system must sustain the average rate under all these conditions: (1) When the data originates from 1 D0 host computer sending 1 stream of partition files. Note: This is the preferred D0 solution, although it is not an absolute requirement. Comment from K. Genser: 1 stream is achievable, but is it really practical? (2) When the data originates from up to 3 D0 host computers each sending 1 stream of partition files. (3) When all the data originates from 1-3 D0 host computers sending up to 32 streams of partition files. Comment from K. Genser: 32 streams should not be considered. 3.0 "Expressline" Analysis Files from D0 to the Archiving System ================================================================ 3.1 The archiving system must guarantee an average input Expressline data rate of at least 2 MB/sec. Note: This rate is limited by the available D0 CPU, but the 2 MB/s is a reasonable expectation. 3.2 The archiving system can assume that the Expressline data are packed into "Partition files" of about 1.3 GB in size. 3.3 The archiving system must be prepared to accept a total of 21 TB of Expressline data files each year. 3.4 The archiving system must assume that the Expressline data will be sent to the archiving system over 1 stream. 4.0 "Other" Data Files from D0 to the Archiving System ======================================================= 4.1 The archiving system must be prepared to accept hundreds of small files that D0 will send to it during each fill. Note: The exact list/number/size is TBD. Note: This "other" data tends to be produced between runs, which may be as far apart as 4 hours. Comments from K. Genser: Do you really need to write all the small files to HPSS? What is "small"? What is typical? How many small files are there? What order will the small files be read back in? 4.2 The archiving system must guarantee an average input "Other" data rate of at least 1 MB/sec. 4.3 The archiving system must be prepared to accept a total of 11 TB of "Other" data files each year. 4.4 The archiving system must assume that the "Other" data will be sent to the archiving system over TBD streams. 5.0 General/Common Input Data Properties ======================================== 5.1 The archiving system must be prepared to accept independent or concurrent stream(s) of input files from any D0 host computers. Note: These multiple streams imply that several files on the archiving system side are simultaneously open.. 5.2 The archiving system must assume the average rates will be constant over the fill unless otherwise noted. Note: D0 will adjust its trigger to maintain a nearly constant data rate (in MB/s) for the duration of the beam store. 5.3 The archiving system must measure average rates using overall clock time. All overheads must be included in the calculations. Note: Some of the overheads in the current HPSS architecture have been measured and are very large. See Table 2. 5.4 The archiving system must be able to empty a "almost full" glitch buffer in 4 times its total size in hours without impeding D0 from taking data at full rates. Note: This means an 8 hour glitch buffer will be emptied within 32 hours. Note: This emptying of the glitch buffer also sets demands on average and sustained peak rates as described in the next item. 5.5 The archiving system must have a peak sink rate higher than the average rate to allow for (1) Overhead delays in creating file in the archiving system, (2) Overhead delays on local system to effectively overlap file reads and writes to the archiving system, (3) Pumping a full glitch buffer empty, (4) and a practical software implementation. Note: Assuming a file size of 1024 MB, a 13 MB/s average rate requires one file every 79 s. If the (1) Overhead time to create a file = T (s/file), and the (2) Local system overhead delays = L (%), and the (3) Glitch buffer size = G (h) and it needs to be emptied = E (h), and the (4) Practical implementation = P (%), and the (5) Nominal rate = R (MB/s), and the (6) File size = F (MB/file), then the Glitch buffer would be S = G * [R*3600 MB/s] MB, and would contain N = S/F files. And the peak sink rate, in MB/s, into the archiving system would be { [F MB/file] / [(F/R-T) s/file] + [S MB/glitch] / [(E*3600-TN) s/file] } * P * L. For R=13, E = 4*G, F = 1 GB, and P*L = 150%, Parameters S (GB) Peak Rate (MB/s) ========== ========= =========================== T=21, G=8 366 { 17.7 + 3.5 } * 1.5 = 32 T=21, G=1 46 { 17.7 + 3.5 } * 1.5 = 32 T=0, G=8 366 { 13 + 3.3 } * 1.5 = 24 T=0, G=1 46 { 13 + 3.3 } * 1.5 = 24 5.6 The archiving system can assume that the maximum size of any file will be less than 2 GB. 5.7 The archiving system must not assume that any files will be compressible. 5.8 The archiving system can assume the maximum number of files in the archiving system at any one time is TBD files. 5.9 The files in the archiving system must be named. Note: The D0 file names tend to be of the form: (stream)(run)(partition)(format)(version) # chars: 3 6 3 3 2 = 17 5.10 The files in the archiving system can not be appended to. 6.0 Availability ================ 6.1 The archiving system can schedule downtime only after negotiations with D0. 6.2 The data taking portion of the archiving system can schedule downtime only during accelerator downtime. 6.3 The archiving system will be declared as available when it does not impede D0's ability to take data, reconstruct data or otherwise conduct its business. 6.4 The data taking portion of the archiving system has to be 99% available when the D0 is taking collider data. Note: 99% is the historical D0 number. Comment from K. Genser: 99% availability roughly means a maximum 7 hours downtime per month. In the case of a big failure, it will take a minimum of 1/2 day to get a vendor in to look at the problem. Do not know what to do about this - redundant robot? Comment from K. Genser: I would prefer an 8 hour glitch buffer and a 0.05% system availability requirement rather than a 1 hour buffer and a 1% availability. 1-2 hours is too small. 6.5 The data taking portions of the archiving system must be at least 90% available when D0 is not taking collider data. Note: D0 will be accumulating cosmic ray and calibration data or performing other tests during accelerator down time. 90% is the historical D0 number. 6.6 The data taking portions of the archiving system will be considered unavailable when the local D0 glitch buffering has filled. 6.7 The data taking portions of the archiving system will be considered available when there is space in the local D0 glitch buffers but the archiving system is temporarily down. 6.8 The reconstruction portion of the archiving system should plan on supplying Partition files to the Reconstruction Farms almost constantly, 24 hours x 7 day/week. No significant downtime is planned. 7.0 Data Loss ============= 7.1 The obvious intent is for the archiving system to not lose any data. 7.2 Data loss bookkeeping starts once the data enters the salient networking or communications equipment. 7.3 The archiving system must provide a success/failure status code on each file transfer. 7.4 The archiving system can request to D0 to resend data it has not received correctly if it has sent a failure status code. 7.5 The archiving system can not lose more than 0.1% of the data files during the transfer of files from the archiver's disk buffer to the final media. 7.6 The archiving system can not lose more than 0.5% of the data files each year. Comment: What does it mean if an archiver disk is lost? 7.7 The archiving system must ensure any data loss occurs randomly. Data loss can not be localized to a single stream or group of files. Comment: What does it mean if a complete tape is lost?? 7.8 The archiving system can assume that D0 will provide a local backup of any small data sets it deems critical to its own operations. 7.9 The archiving system may not assume that D0 will backup any raw or Expressline files. 8.0 Control Requirements ========================== 8.1 The archiving system must be capable of orderly shutdown and startup. This is defined to be: (1) The ability to disallow further requests to read or write files. (2) Shutdown, subject to a configurable timeout, once no more files are being read or written. (3) Provide a a general on/off switch rather than have flow control completely controlled by just by attempt-success/failure. (4) Allow for an end-of-fill flush of partially filled files from D0. 8.2 The archiving system must be able to ignore/delete partially sent files in the case of unorderly shutdown such as a dropped network connection. 8.3 The archiving system must provide a dynamic method of restricting access, reading and/or writing, to a node, group of users or an individual user. 8.4 The archiving system must provide a file permission system to hinder accidental file loss. 8.5 The archiving system does not have to provide any undelete features. 8.6 The archiving system must be able to manage its resources to effectively. That is, it must control users asking for arbitrary files and miscellaneous services so they do not impact the guaranteed rates it must provide. 8.7 The archiving system must provide activity reporting information, approximately once per minute, on any individual and named file, that is sufficient to understand: (1) the flow of file through the internals of the archiving system, (2) whether the file is on the final storage media, (3) and its general status. Note: Activity reporting does not extend to the level of saying file X is P% written to tape, it just has to say "on tape x" or "on disk". 9.0 Data from the Archiving System to the First-pass Reconstruction =================================================================== 9.1 The archiving system must be prepared to furnish the Reconstruction Farms specified Partition files approximately one week after the data is taken. Note: The Reconstruction Farms are dedicated and general purpose "peaking capacity" farms located in FCC. 9.2 The archiving system can count on D0 to provide a list specifying which Partition files the Reconstruction Farms will be needing 1 day before they are used. 9.3 The archiving system can assume that, within a run, D0 will request Partition files in the same order they were written to the archiving system. Note: In general, D0 will request Partition files from the archiving system in roughly the order were sent into the archiving system. Please note that 9.3 talks only about Partition files within a run. 9.4 The archiving system must guarantee an average output data rate to the Reconstruction Farms of at least 13 MB/sec. 9.5 The archiving system must have a peak source rate of at least TBD MB/s. This number is roughly double the average rates. Note: Same rationale as in 5.5. 10.0 Data from the Archiving System to "Other" Services ======================================================= 10.1 The "other" D0 services requiring access to the raw data are its calibration and alignment activities. These tasks need lots, but not all, of raw Partitions files. 10.2 The archiving system must guarantee an average output rate of at least 3 MB/s for these "other" services. 10.3 The archiving system must be prepared to send Partition files to to D0 offline analysis system as requested by authorized users. Note: D0 does not require raw data to be sent back to the counting room. 10.4 These "other" services can fully use the resources normally used by the Data Acquisition and Reconstruction when they would be otherwise unused. 11.0 Data from the Archiving System to Re-reconstruction ======================================================= 11.1 TBD. Potentially another 13 MB/s out of archiving system. 11.2 Re-reconstruction is an issue only if the archiving system ends up as the permanent repository of the raw data. 12.0 Still to be pondered.... ============================= 12.1 Whether the Analysis Archive is distinct from this archiving system is TBD. 12.2 Where reconstruction products go is TBD. 12.4 Requirements on the reading of Expressline data are TBD. 12.4 Requirements on the Reconstruction farms are TBD (can be lifted from the ___________ document...) Appendices ========== --------------------------------------------------------------------------- Simultaneous Data Flow Requirements Data Type Input/ Rate Size Sources/ Output (MB/s) (GB) Destinations ========= ====== ======= ===== ============ Raw data in partition files IN 13 1 1-3 Sources Expressline reconstruction IN 2 1.3 1 Source Other input data IN 1 Small TBD Sources Raw data to reconstruct farms OUT 13 1 TBD Destinations Other, includes user accesses OUT 3 TBD TBD Destinations Reconstructed Data IN 15 TBD same robot? Analysis Fetches OUT TBD TBD Re-reconstruction OUT TBD TBD Re-reconstruction IN TBD TBD same robot? Table 1 --------------------------------------------------------------------------- --------------------------------------------------------------------------- Measured Per-File Overheads in the HPSS Archiving System Operation Average Max Min (s) (s) (s) ========== ======= ===== ===== Create 21.5 32.6 16.5 Read 29.8 48.4 21.7 Delete (total time) 8.3 11.3 6.3 Measured on July 8, 1997 using FMSS v1_2e Table 2 --------------------------------------------------------------------------- --------------------------------------------------------------------------- Current Physical Network Connectivity Between D0 and FCC From demar@fnal.gov Fri Aug 1 10:21:53 1997 There are a total of 12 pair of multimode fibers out to D0, 6 pair that stop off at the at the '0' service bldgs (B0, C0, ...) and 6 pair that are a continuous run back to A3/A0. Three pair are in use out to D0, and one pair is in use out to C0. No single mode at all... Table 3 --------------------------------------------------------------------------- --------------------------------------------------------------------------- Current HPSS Capabilities An approximately 51 MB/s SINGLE transfer read rate was observed in one HPSS site (not FNAL) for a 6 way tape stripe. The system should scale with the hardware used, there is no maximum number. The write rate was 32 MB/s for the same configuration. Here it scaled "slower". Aggregate rate should be much easier to achieve than this single rate. The tests were done using 2F30, 8 43P's and 8 3950 tape drives in one the 3494 library. One tape drive per one 43P power PC machine. Mass Storage Group should acquire a j50 machine in early August which will be used as the HPSS name server. Mass Storage Group 3 F50 machines on a September-October time scale which will be used as the HPSS movers. --------------------------------------------------------------------------- --------------------------------------------------------------------------- DCD General Comments 1. Guaranteeing rates means that you have to be on your own network. The complicated allocation schemes people talk about are not ready and do not work. 2. The rates can easily be handled on a single mode fiber. Multimode won't work. [Several pairs of multimode fibers might work with additional software effort.] 3. D0 does not have a single mode fiber (CDF does) and will need to install a single mode fiber to achieve the necessary bandwidth to FCC. a. This needs to be approved at the director's level (twice) and then by DOE. This has been in the plan for some time, but has not been implemented, approved or funded yet. b. Keith is writing the TDR to request this. 4. The cost of just laying the fiber (no network equipment): a. FCC to D0 clockwise is cheapest solution = $141K b. Keith's preferred solution (redundancy+safety) is FCC to D0 counterclockwise = $237K 5. After approval, it takes 4 months to install fiber. 6. Lasers will have to be used on each end of the fiber - this represents a safety issue that will have to be addressed. Unfortunately, OSHA does not have any rules governing this yet and any new rules could impact the cost. 7. The "commodity" network solution appears to be Gigabit Ethernet. --------------------------------------------------------------------------- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Jon A. Bakken bakken@fnal.gov (630) 840-4790