100 BULLET ANALYSIS

One of the outcomes of the April 98 HPSS workshop at Fermilab was an attempt by IBM to try and understand what Fermilab needed out of mass storage system and how/if HPSS could meet the requirements. Numerous meetings took place in which IBM people (Otis Graf + Harry Hulen), CD people (primarily Don P), and CDF/D0 people tried to specify what Run II requirements are. The report from these meetings is "Functional Requirements for the Fermilab Run II Mass Storage System" and is available at http://www5.clearlake.ibm.com:6001/Fermilab/FermiReq5.html. The work is by all the people who participated, but the actual writing was done by Otis Graf at IBM; and because of this certain HPSS biases and non Fermilab requirements remain.

A list of "shalls/musts" was extracted from the document - this resulted in the "100 Bullet List". The goal of doing this was to try and see if the report could be distilled into a list that could be prioritized. The numbers in parentheses you see in the paragraphs below correspond to order that "shall" was found in the original report. Since 100 items is a large number to consider individually and to match other attempts at synthesis of the critical requirements for a mass storage system, the "100 Bullet List" was then reordered into 10 broad categories and this is what you will find below.

  1. Retrieve and Specify Tape Contents
  2. Eject to Shelf
  3. Read Only
  4. Import
  5. Open Format
  6. Resource Management
  7. Availability
  8. Fault Tolerance
  9. Administration and Monitoring
  10. Data Migration
  11. Unclassified

There are 2 clear IBM/HPSS biases that should be noted in the list, others probably exist as well:

  • Requiring files to not span tapes would be desireable for our applications, but VERY difficult for HPSS to meet.
  • GUI interfaces/displays should be low priority; CLI interfaces are the main interface and are needed at startup.
  • This section attempts to address these same requirements from Enstore's perspective. In the difficulty column, "None" means that the requirement is inherent to enstore's design or else it is already implemented in the prototype and requires little or no extra work.

    Category and Requirement Enstore Remarks Difficulty
    Retrieve and Specify Tape Contents
    (2) Data files shall be located on tape in such a way as to optimize efficient access to the files. Enstore puts files on tape in the order the files were submitted using a file family and file family width scheme. The experiments can control file placement on tape within these limits. None
    (10) The MSS shall allow data files to be placed on tape media in such a way that the files can be efficiently accessed and retrieved from the storage system. Same as (2)
    (13) The MSS shall allow MSS clients to identify groups of files that are to be stored on a common set of physical tape volumes. The MSS shall guarantee that physical tape volumes only contain files identified to be part of the same grouping. Same as (2)
    (14) The MSS shall ensure that tape media is efficiently utilized. Files should be appended to tapes in such a way that all the media is used and tape is written to physical end of media. Enstore expects ftt to provide correct information from the media devices and will use this information to fill tapes. It will not attempt to span files across tapes and as such will never completely fill tapes - there will always be less space left on the tape than the nominal filesize. Easy
    (15) Files shall span tapes (if necessary) in order to fill an allocated tape(physical volume) in a storage class before starting a new tape. Enstore does not plan to span files across tapes. Not planned
    (16) Each tape physical volume shall only contain files belonging to a single file family. File families provide this feature. None
    (17) There shall be up to 1000 (TBD) different file families. There are no limits, within reason - one can make more file families than there will be tapes in Run II. None
    (18) A user with appropriate authorization shall be able to create a new file family. PNFS allows this None
    (19) A file family shall belong to one or more user groups. PNFS has a standard Unix file protection/access scheme. None
    (21) A file family shall have the property of "width". The MSS shall, given sufficient physical resources, be able to concurrently mount the number of tapes equal to the width of the file family, and shall be able to write files to those tapes in parallel. File families have the feature of width. None
    (22) The width of a file family shall not exceed the number of physical + tape drives in the storage system. There is no current automatic tool that guarantees this in Enstore, but it is straightforward to develop. Enstore will use the drives allocated to a family. Easy
    (23) The user or application shall be able to specify the family name or ID at the time the file is created. Pnfs and Encp allow this. None
    (26) It will be useful to have namespace objects that are mapped to physical tape volumes. This is a straightforward development. Easy
    (31) Tools shall be available so that users can determine the mapping of files to tapes. These tools shall generate the following reports and lists: List of all files on a tape (or set of tapes) and location of files on the tape; Indication of whether file is internal or external to tape library. Oldest and Youngest file on a tape. These tools can be developed. Medium
    (38) Tools shall be available so as to identify tape volumes by file, file family, creation/change dates of files, and date of last tape mount. These tools shall allow users to identify tapes that are candidates for removal from the tape library. Users shall be able to generate reports from MSS metadata, as follows: List all of the tapes belonging to a File Family and date of most recent mount of the tape. List all of the tapes within a File Family such that each file is older than a specified date.Location of a file (or set of files): tape identifier and whether internal or external to the tape library. List of tape volumes, sorted by date of the most recent mount of the tape. Indicate whether tape is internal or external to library. Same development as (31)
    (39) The report shall indicate MSS behavior when an access attempt is made of a file on the tape: (1) request operator mount,or (2) return warning message to user. Same development as (31)
    (59) Users shall be able to use the report and list generation tools described in Section 2. Nothing in the design prevents users from having these capabilities. None
    (85) An application or user shall be able to determine the family name and ID of a file through the FTP and API interfaces. Pnfs allows this - specific api's will need to be developed. Easy
    (86) A user can create reports on file, family, and volume information such as: The family name, COS and volume of a file. All of the files in a file family, All of the tape volumes in a file family. Same development as (31)
    (87) A user shall be able to create a file family by using a utility program. Fully allowed by pnfs. None
    Eject to Shelf
    (3) Data files shall be located on tape in such a way as to allow tapes to be removed from the robotic library and placed in a vault or distributed to another organization. Experiments can specify the format and placement of data. Within these bounds, this requirement is met. None
    (32) The utilities for ejecting and returning tapes from/to the MSS tape library (or libraries) shall be capable of automatically running in the background to the core MSS servers as a batch job, and shall not interfere with the normal operation of the MSS and tape robotic arm. Development of tools is new straightforward work. Modifications to the media changes will have to be made. Medium
    (33) Ejected tapes could also remain in the tape library, but be assigned to another storage software. Volume clerk allows this. None
    (34) It shall be possible for a user (member of the group(s) that own tape volumes) to specify a list of volumes that the robotic library will eject from the library. OK None
    (35) When selecting a tape to be ejected, a user shall be able to specify the type of behavior that is to occur if a file on the tape is subsequently requested for access. The user shall be able to specify one of the two types of behavior: When a file access is attempted (open for read or write), a message is displayed to the MSS operator that requests that the affected tape be retrieved from the vault and mounted in a tape drive or placed back into the library.The file access fails, and an appropriate error message is returned to the user indicating that the file is not currently available Straightforward development. This is a simple change in the volume clerk tables to switch library manager for ejected volumes. Easy
    (36) If a file spans tape volumes and at least one of the volumes has been ejected from the library, the MSS shall exhibit one of the behaviors described above. That behavior shall be the same as if ALL the volumes on which the file resides were ejected from the library. Files do not span tapes. Not planned
    (37) It shall be possible for the user to change the file access behavior of an ejected tape after the tape has been removed from the tape library. A tool and user interface shall be available to allow the user to change that part of the MSS metadata that controls the policy of attempting to read a file from an ejected tape. Straightforward development Easy
    (40) An MSS utility program shall allow tapes to be marked for ejection. Straightforward development Easy
    (41) The utility shall also interact with the robot to cause the tapes to be ejected and will mark the MSS file/tape metadata to show that the tapes are not in the robotic library. Same as (40)
    (42) This utility shall run in the background to the MSS core servers, and shall not interfere with the normal operation of the MSS. Same as (40)
    (43) The utility shall take the following options:Maximum number of tape volumes to eject. Same as (40)
    (44) When tapes are ejected from the MSS library, an associated set of intermediate metadata shall also (optionally) be exported as a flat file. Straight forward development Easy
    (45) This metadata file shall provide tape file information such that a utility program could read the files and copy them to a standard Unix file system. File wrappering information provides this. None
    (46) The tape eject utility shall have the option to leave the tape in the library so that it is available to another software application that could read the MSS tape. Volume clerk allows this. None
    (47) The MSS shall be able to exchange (total of eject/export and re-insert/import) up to 200 tapes per day through the robot(s) I/O tray. Really depends on the hardware - Enstore will not get in the way.
    (48) There shall be two scenarios whereby volumes are re-inserted into the tape library. Tape is manually inserted by the operator in response to a open read/write request.Batches of tapes can be re-inserted into the robotic tape library by stacking the cartridge (or cassettes) into the robot I/O tray Enstore could be used to run operator driven tape drives. Operator interaction would have to be developed. Library manager could interact with OCS, for example. Easy
    Read Only
    (27) It shall be possible to optionally mark a tape volume as "Read Only". The Read Only status of a tape shall be maintained in the MSS metadata. Volume clerk provides this capability. Moreover, enstore has a system and a user read-only flag that can be controlled independently. None
    (28) Tapes shall be marked as ReadOnly under the one of the following conditions: All of the files on the tape are older than a specified period of time. A straightforward utility would have to be written to do this. Easy
    (29) Tools shall be available so that tapes can be marked and unmarked as Read Only. Same as (27)
    (30) If a user or application attempts to open for writing a file that is marked as "Read Only", a suitable error message shall be returned to the user by the MSS interface program. Already provided by volume. None
    (53) Tapes that have been ejected (but not exported) for the purpose of reading outside of the MSS should be physically marked as "Read Only" because the MSS metadata still accounts for these tapes and it is assumed that they be eventually placed back into the MSS library. Already provided by volume clerk. None
    Import
    (55) A utility shall be provided that can write files to tapes and can create a set of intermediate metadata, such that the tapes and metadata can be imported into the Fermilab MSS. External tapes written in a acceptable format should be able to be imported. This is new work beyond what has been done. Medium
    (56) This utility shall be executable external to the MSS, on other processor(s), at other sites or institutions. Part of (55)
    (57) A utility shall be provided that will allow tapes (and associated metadata) to be imported into the MSS. Part of (55)
    (58) An export utility will be provided that will allow tapes to be exported from the MSS. Straightforward development. Easy
    Open Format
    (49) An "open" tape and metadata format shall be provided such that software (external of the MSS software and infrastructure) can read files from ejected tapes, and can write files to tapes that will be imported into the Fermilab MSS. Within current implementation - but more "open" formats will have to be acceptable. Easy
    (50) The open formats shall allow other programs, utilities, and other mass storage systems to read and write tapes that are compatible with the Fermilab MSS. True by definition. None
    (51) A utility shall be provided that can read files from tapes that have been ejected or exported from the MSS robotic tape library system. OK. It is expected that the experiments already can read their own data formats. None
    (52) This utility program shall be able to run outside of and independent of the MSS servers and infrastructure, and would make use of a set of metadata associated with the tapes. Same as (49)
    (54) All tapes written by the Fermilab MSS shall be in the native format (for example, in HPSS format). All tapes will be written in an agreed upon format, probably not hpss. None
    Resource Management
    (1) The storage equipment will be shared by the experiment groups and the MSS shall provide a priority system so that each group has access to its quota of resources. Provided already by the dynamic movers. None
    (4) Tape drives will be in short supply, and thus tape drive activity shall be optimized. OK, movers try to write/read as much as possible for given tape. Easy
    (5) The number of tape mounts shall be minimized for typical file access patterns. Library manager provides this. None
    (7) Robotic tape libraries (especially the robot arm and storage slots) will be in short supply, and their use shall be optimized. Not clear to me how Enstore can optimize this. Dropped from HPSS requirements? ???
    (11) The MSS shall provide the capability to configure storage resources (drives and media) into different groupings, such that files requiring different storage or administrative characteristics (e.g., high-volume date rates, low latency to access data, group X's tape volumes) can be directed to an appropriate set of storage resources. Dynamic movers provide this. Needs to be tuned and automated Medium
    (24) The MSS shall provide the capability for files to be written directly to a tape media. The MSS shall also provide the capability for files that were initially written directly to tape to either be read directly from tape or first staged to disk before being read (to perform speed matching and/or file buffering). It shall be possible to assign ownership of a COS to a user group. Movers provide direct tape access - no disk buffer/cache implemented yet. Medium
    (25) It shall also be possible to configure the MSS such that a COS will have ownership of storage resources (tape volumes) and devices (tape drives). Read Only Access for Groups of Tapes (and Files). Movers provide "ownership" of devices. PNFS provides ownership of tape. None
    (60) The Fermilab MSS shall provide an application programming interface (API) so that client software programs can access the storage system servers. If this means clients can attach directly to drives, it is not in the current prototype, but it is an extension that was in the original plan. Lots of details need to be understood for this to work properly. Hard
    (61) This API shall execute on the client processor, shall provide C++ language interfaces, and shall not require MSS infrastructure software (DCE, Encina, SFS, SAMMI) on the client processor. User APIs in the various languages will have to be written Easy
    (62) The client API shall be available in source and binary for the following OS platforms: SGI IRIX, IBM AIX, Digital Unix, Intel Linux, Windows NT, Sun Solaris. Not all platforms have been checked - underlying dependent layers must also work on them. Medium
    (64) For some categories of data, files must be copied to tape as soon as possible. Therefore, it must be possible to allocate a minimum number of tape drives to the process of writing the privileged categories of files to tape. Dynamic mover configuration provides this. None
    (65) When possible, the MSS shall optimize the use of multiple robotic libraries and arms so as to maximize the capacity of the system to perform tape cartridge movement. It shall be possible to queue up requests to move tape volumes. Library manager provides this - a better optimization can be done. Easy
    (66) The MSS software shall have the capability to limit file system requests such that the available resources will not be exhausted. Part of current design - needs review. Easy
    (67) When a request is made to read or write a file to the MSS, the system shall return an estimated time to complete the request. I think this is very hard to do for all but the simplest cases. Tape and drive optimization affect this dramatically. Very Hard
    (68) The MSS shall allow the establishment of quotas for storage resources. These limits shall be established for a single user, user group, and file family. Quota management is requires design and implementation. Medium
    (69) Quotas shall include: Number of objects in the namespace (files, filesets, directories, etc. Same as (68)
    (70) The gatekeeper function shall filter requests to the MSS and throttle use. The gatekeeper shall provide the capability for user level scheduling of storage requests. Library manager can handle this, in principle. Not clear to me how filter gets developed. Throttling takes place elsewhere in servers. Hard
    (71) All user access methods(NFS, FTP, client API, DFS) shall go through the gatekeeper, which shall provide the following system throttling functions: Temporarily suspend access to the MSS in order to allow maintenance activities and/or implementation of additional resources (such as mover nodes, tape drives, etc.) Throttling is provided in encp and pnfs. No need for an extra gatekeeper. None
    (84) The process of loading, labeling, and allocating the blank tape volumes shall be done in the background to the core MSS software and shall not interfere with the other MSS operations. Code needs to be written to allow this to work efficiently. Medium
    Availability
    (6) There will be unattended operations for at least eight hours per day, seven days per week. Goal, unproven as of yet. Substantial error analysis and testing required. Medium+long
    (8) Recording to tape of raw experiment data shall continue for several weeks (up to a month) at a time. This data recording shall be uninterrupted by lower priority file I/O activity or system management tasks (such as metadata backups, software upgrades, etc.) Movers can be pre-allocated. Software should not be limiting factor. None
    (73) The MSS shall be able to support 30 days of continuous data file acquisition and accumulation to tapes in the robotic tape libraries. The 30 day (minimum) time interval shall not be interrupted by maintenance activities or system outages (defined as unable to continue to record data files). Goal, unproven as of yet. Substantial error analysis and testing required. Same work as (6)
    (82) Data Acquisition. The MSS shall be able to receive files of raw experiment data at the peak sustained rate of 35 MB/sec. The data acquisition rate of 35 MB/ sec (3 TB per day) shall be sustained for a period of 30(TBD) days. Goal, dictated by the hardware - software should not be in the way.
    (83) The MSS shall be able to receive a total of 5 TB of data per day, equivalent to 100 tapes. The MSS shall be able to add 700 blank tape volumes per week. Goal, dictated by the hardware - software should not be in the way.
    (95) The MSS system shall also provide the capability to issue alarms under the following conditions: The MSS system falls below a TBD threshold capacity or performance. Certain TBD critical work cannot be completed. Planned work - straightforward development. Medium
    Fault Tolerance
    (72) The failure of a single component shall not cause the system to become unavailable for the high priority data movement tasks. Single points of failure (file clerk, library manager...) need to be robust. Multiple movers need failure/restart procedures. Hard
    (74) The system shall have a fail-over capability such that if a component fails, file access services shall continue at reduced capacity. Enstore can accommodate failed hardware to some extent - making this robust is a significant effort. Hard
    (75) In case a failure or error condition occurs with a fail-over component, a message shall be logged to a report file, but file services shall continue without operator intervention. Straightforward (once (72) and (74) are done) Easy
    (76) When a read/write error occurs with a tape drive, the system shall retry TBD times and shall log error counts. Straightforward development Easy
    (77) If the tape drive still generates errors after the maximum number of retries, the MSS shall take the tape drive off-line and retry the file or tape operation (such as stage, migrate, or repack) with another tape drive. Straightforward development Easy
    (78) Processor nodes that are used for data movement (to/from MSS clients and between storage hierarchies) and to control storage devices (tape drives and disks) shall have the capability to be automatically taken off-line without bringing down other MSS services. Dynamic movers provide this capability - Not automated today. Easy
    (79) When a data movement processor fails, file access services shall continue at reduced capacity,a message shall be logged to a report file, and file services shall continue without operator intervention. Really same work as (78)
    (80) It shall be possible to bring the repaired processor back on-line without disruption to the other MSS services. Same as (78)
    (81) When a disk drive fails, the MSS shall take it off line and mark the drive down. Same as (78), except Enstore doesn't currently manage disks. Easy
    (97) File Transfer Errors To/From MSSClient - Determine if error is due to network, client processor machine,corrupted file, etc. Work needs to be done to make this robust. Medium
    Administration and Monitoring
    (20) A graphical user interface (GUI) shall be available to give users the ability to list, add, and manage files in file families. Enstore provides CLI interfaces that GUIs could be built on. Lengthy development needs to be done for GUI work. Medium
    (88) Trash Basket - When files are deleted, they shall go into a trash basket. New work Straightforward implementation for DESY protocol. Metadata purged only when tapes squeezed. Some work to restore files from trashbin. Medium
    (89) There shall be a TBD limit to the number of files that can exist in the trash basket for a user or group. Enstore doesn't have a limit. None
    (90) All administrative and system management functions shall be accessible through a GUI and through Unix shell commands (such as Korn shell). Enstore provides CLI interfaces that GUIs could be built on. Lengthy development needs to be done for GUI work. medium
    (91) Unless otherwise noted, these functions shall operate in the background to the MSS servers, and shall not interfere with the normal operation of the Fermilab MSS. The complete list of functions is TBD. OK, unless otherwise noted. None
    (92) The MSS system monitor shall provide GUI displays of health and status of all components in the system. CLI Health monitor needs further development. Gui means lengthy development time. We have a rudimentary WEB display now. Medium
    (93) When error or failure conditions occur, the system monitor shall provide visual notification through GUI status indicators. CLI Health monitor needs further development. Gui means lengthy development time. Learning to interface with data center is part of the task and has a learning curve. Medium
    (94) Some critical messages shall also automatically be sent to operators by e-mail. Logger could be extended to do this. Easy
    (96) The MSS Health and Status system shall have a command line (CLI) capability such that a shell script (or software product) can receive the alarms and repair the alarm condition. Same as (92)
    (98) Number of read/write error for each disk drive. This info is all available, utilities for keeping and presenting it need to be developed. Easy
    (99) Tape drive use This info is all available, utilities for keeping and presenting it need to be developed. Easy
    (100) Number of read/write errors for each tape drive. This info is all available, utilities for keeping and presenting it need to be developed. Easy
    Data Migration (HSM)
    (12) The MSS shall provide the capability to build storage hierarchies containing a list of storage resource classes together with associated policies that shall be used to control data movement between those resources (e..g., a storage hierarchy containing disk at the top level of the hierarchy with tape at the second, lowest level; migration, purge and stage policies that control when file data is moved between levels and removed from the higher level) Enstore does not attempt to manage all levels of the storage hierarchies. Enstore can be used as a front end to products that do attempt to manage disks. A disk caching/buffer extension to Enstore is planned. Enstore imagines 4 levels of hierarchy: user disk, buffer/cache disk, library tape and shelf tape. Enstore only seeks to manage the last two. Very difficult
    Unclassified
    (9) Evaluation of tape drives and associated media is still ongoing. Therefore, the required tape cartridges (or cassettes), drives, and robotic libraries have not yet been determined. These components will be specified in a later version of the MSS requirements. Enstore is designed to be a layered product. It uses a ftt layer to access tapes and determine error counts and media capacity. New and unusual drives will require implementation in ftt and possibly changes in the mover. Medium
    (63) The MSS shall be scalable so as to provide the anticipated performance required by the Fermilab Run II experiments. Enstore is based on distributed movers and as such is very scalable in data moving. Bottlenecks will happen due to hardware limitations before the software hinders performance. Database contention is possible - this can be addressed by splitting them into several servers, but needs to be studied in detail. Medium