Guide and Ingest Plan Template for Putting Space Physics Data on NDADS 1996 February 27 Kent Hills, Dieter Bilitza, Bobby Candey, Emily Greene [Aug 7 96 Added "Highlights from Experience", minor edits] [Feb 27: Add CD 8.3 -> unique filename mapping questions. ] [ Edits to reflect current improvements to process.] [May 23: Add NSSDC ID to form and to the summary of the NDADS ingest process.] [ Add "Brief comments about the data" to the form.] [May 4: System Datatype names are currently limited to 20 characters.] [May 1: Additions to Footnote 3 about on-the-fly processing options.] [April 25: Add note on number of files needed for *new* project.] TABLE OF CONTENTS ----------------- Introduction Highlights from experience Summary of the NDADS Ingest Process Footnotes to the Ingest Plan Ingest Plan Outline (copy this section to a separate file and fill out) INTRODUCTION ------------ This is a guide to acquisition scientists for use in entering space physics data into the NDADS nearline system. For space physics data, the organization of data granules and the logic of the retrieval process may be quite different from the astrophysical cases. The ingest plan should be viewed as a living document of the process and should be updated to reflect any changes made to the ingest. HIGHLIGHTS FROM EXPERIENCE -------------------------- o Use EIDs which correspond to the start time. YYDDD_YYDDD format EIDs are strongly discourage due to difficulties in standardization. o Be very certain to specifically state whether the CC bit is to be set on the file or not. o File names are limited to 32.32 characters. o The extensive use of superkeys is discouraged. The currently slow down retrieval significantly. o Use DOCUMENT as an EID with the same datatype as the data which it is documenting. If a document covers several datatypes, the document can have it's own system datatype, but all userdatatypes should include both the data and the documentation datatypes. o look at sample ingest plans SUMMARY OF THE NDADS INGEST PROCESS ----------------------------------- Depending on the particulars of each dataset, the tasks involved in ingesting the data onto NDADS may be performed partly by the Acquisition Scientist (AS), Acquisition Programmer (AP) and partly by the Ingest Programmer (IP) and Staff. All of the steps leading to data ingest will be outlined here. For those steps that can be performed by different NSSDC staff members, each case will be dealt with as appropriate for that situation. 1. Acquisition scientist (AS) identifies the desired data set (including the NSSDC ID), determines what processing options are needed (for the NDADS output for the requester), determines the output organization (sizes, filenames, ...), determines what validation is needed, etc., then collects the required information and presents it to the acquisition programmer for discussion. An outline of the ingest plan is given below with comments. 2. Acquisition programmer (AP) reviews information to see that all is in order. Acquisition scientist and programmer discuss the ingest process and naming schemes. 3. The acquisition scientist presents the ingest plan to the JSPAG and managers for review. 4. The data, software and documentation are brought online (either by the AS, AP or ingest personnel). 5. The number and size of the files are checked and other minimally required validation is performed (either by the AS, AP or ingest personnel as determined in the ingest plan). 6. Other verification/validation runs are made and any required preliminary processing steps are run (such as making CDF, making time averages, etc.) (either by the AS, AP or ingest personnel). 7. Ingest programmer inserts datatype and EID information into the NDADS database and initiates actual data ingest onto the NDADS jukeboxes. IP verifies ingest and informs AS and AP. 8. Acquisition programmer verifies ingest and adds dataset to SPyCAT. 9. Acquisition scientist verifies the results via NDADS ARMS and SPyCAT. 10. The acquisition scientist updates the general and project-specific holdings files (and sends to ingest group) and updates NMD, RSIRS, AIMS listings of NDADS availability. The updated general and project-specific holdings files should be submitted to the acquisition programmer. FOOTNOTES TO THE INGEST PLAN ---------------------------- Footnote 1: Datatypes --------------------- The System Datatypes are the ones used internally when ingesting the data, while user Datatypes and EIDs are the items used by the requestors. The User Datatypes can be the same as the System Datatypes and can also point to groups of system ones. Multiple datatypes can be defined to encompass overlapping sets of files if the system datatypes are defined as specific as possible. For instance, VEFI has DC and AC files in possibly CDFs and VMS binary forms. So we could define system datatypes of VEFI_AC_VMSBIN, VEFI_DC_VMSBIN, VEFI_AC_CDF, and VEFI_DC_CDF when we ingest the files originally. We can then define user datatypes of those plus VEFI_VMSBIN (to get both DC and AC files), VEFI_AC (to return the most preferred format or to use the user EIDs to specify, as in CDF81302), and VEFI (to return most common (DC CDFs, say) or by user EID of AC_CDF_81302). Use of DOCUMENT or SOFTWARE as Datatypes is not recommended and strongly discouraged; make these EIDs. Note the use of underscores to delimit the logical parts of a datatype name. The System Datatype names are currently limited to 31 characters; the User Datatype can use up to 31 characters. Footnote 2: EID naming schemes ------------------------------ The EID (for Entry ID) is used to indicate uniquely each of the smallest segments of data that will be retrieved as a unit. For maximum flexibility, define very specific EIDs for each file at ingest. DOCUMENT, SOFTWARE and INVENTORY should be standard user EIDs that point to multiple description, software and inventory list files. The most common user EID would be just time, say YYDDD or YYMMDD, that points to files starting at that time. EIDs should be generated to enable data retrieval by any organizational scheme that is likely to be of use to a requester. Although it might be a time, it can be an event, location, direction, etc. The holdings file will generally just give the format for the construction of the EIDs, rather than giving the entire list of available EIDs. The Super-EIDs or superkeys combine multiple EIDs or provide an alternative naming scheme or synonyms. Note that superkeys currently are defined for ALL datatypes at once. Be aware that large number of superkeys significantly slows down data retrieval. All datatypes should have a standard superkey of DOCUMENTS being equated to the EID DOCUMENT. See Footnote 6 for additional discussion of EIDs relating to specialized kinds of documentation. Footnote 3: On-the-fly Processing Options ----------------------------------------- A limited amount of processing can be performed when the files are retrieved from the jukebox, perhaps converting to ASCII, CDF, or GIF files from the original ingested files. On-the-fly processing should only be considered if there will be a relatively few number of requests compared to the dataset size, and the total number of processed/output files is large, the total volume of output files is large, or there are a number of file formats. If requests for these output formats will be common and CPU intensive, the dataset should be converted in batch mode and reingested instead. The cutoff point for considering on-the-fly processing will change with increasing storage and processing capacity and file compression. Maintaining the conversion code is much more troublesome than doing the conversion for all the files once. An additional drawback is that the conversion option is only available currently in ARMS requests and not through Quasar FTP. On-the-fly processing is requested in the ARMS system by adding a Dataformat key to the Subject line. The Acquisition Scientist must provide the code to do the processing. Make sure the holdings file is clear on the process. NOTE that the current implementation of on-the-fly is added by hand, and that there are no current working examples. Footnote 4: Implied Carriage Control flag (CC Bit) -------------------------------------------------- Setting the Implied Carriage Control flag (CC Bit) correctly is very important for files that are being transferred between VMS and non-VMS (such as UNIX) machines. More accurately, it is not a question of VMS vs non-VMS systems, but rather of file systems that keep record structure information vs those that don't. VMS systems contain (generally invisible) information about the record structure of a file, while UNIX systems do not hold any record structure information at the file system level. Thus, when sending data via FTP to a non-VMS system, it might be necessary in the transfer process to insert end-of-record markers (in whatever style the target system recognizes, such as CR, LF or CR-LF). This insertion will only be done during the FTP process (involving both the FTP client and the target system) if the so-called "implied CC bit" is set. Sometimes/often the carriage control is explicit (fortran cc, etc.) and not implied, in which case there isn't a problem. The best way to confirm that the bit is correct is to transfer a sample file from VMS to UNIX and see if it compares correctly byte-for-byte or see if works properly with the software. Usually, the CC bit should be set if you have ASCII data in either fixed-length or variable-length records (the Unix user can read until a CR is found). The resulting file length may be longer than the original, due to the presence of these CR bytes. There are a few exceptions to this rule. For instance, Postscript files may be all ASCII but it unnecessary and error-prone to change end-of-record characters (and version 2 PS also allows binary enclosures); these should be treated as Binary files and the CC bit should not be set. The CC bit should not be set if you have either fixed-length or variable-length binary records. In addition to confusing end-of-line characters with valid binary data, the file size and record size will change. For fixed-length record files, the dataset format description document or software must tell the Unix user the record length. For variable-length record files, the Unix user reads the control byte(s) that are part of the record to find out how long the record is. Adding extra end-of-line characters will throw off the record and file lengths and they may be mistaken for data. The following are some file types that should be treated as Binary and not have the CC bit set on VMS systems: CDF, HDF, FITS, GIF, JPEG, TIFF, PNG, PS, EXE. The following are some file types that generally should be treated as Text/ ASCII and should have the CC bit set on VMS systems: TXT, DOC (unless Word or WordPerfect type files), README, ASC, TEXT, FOR, F, C, COM (for VMS but binary for PCs), LOG, LIS, LIST. It is important to check these files to see if they are truly files you can type to the screen and read as ASCII. Footnote 5: Output filenames ---------------------------- The user-requested files will be staged to a [.PROJECT] subdirectory for the requester to copy by FTP or DECnet copy. It is important to distinguish the files from various instruments with a clear naming convention. In addition, if time is at the beginning of the filenames, files of the same time for several instruments will show up in the directory together and will have the maximum uniqueness in the PC 8.3 format. If the filenames are not too long, the project can also be added so the users will not confuse various projects in their own directory space. The JSPAG recommends using filenames of the form: time_instrument_project_dataform_version.fileformat. The time could be in begin and end time form or just begin time (YYDDD, YYDDDHHMMSS, YYMMDD, YYMMDDHHMMSS). Instrument and project should be kept short. Dataform might indicate average (6S or 5M) versus high resolution data or AC versus DC. The version is recommended since many datasets are reprocessed at some point; use a format of Vnn where nn=01, 02, etc. in order. The file format should be .CDF, .FOR, .C, .PS, .GIF and such as appropriate with document text files as .TXT, ASCII data files as .ASC and binary data files as .DAT. It is very important to distinguish text files from binary ones when the user is transferring them via FTP or wants to examine them. The NDADS file size limit is 32 characters before period and 32 after and only one period. Try not to use odd characters that may be invalid on other platforms. No mixed case, i.e, files should be the same for upper and lower case letters. Use of the JSPAG filenaming convention is highly recommended: JSPAG file naming convention: EID_INSTRUMENT_SPACECRAFT_DATAFORMAT_VNN.EXTENSION where: EID is the start time of the file AND the NDADS EID. The preferred format is YYDDD, but may be altered if necessary. For example, days with multiple files should be YYDDDHH. YYMMDD is also acceptable. It is preferred that the stop time NOT be included. INSTRUMENT is the instrument name, and should be abbreviated if there is a generally accept abbreviation. SPACECRAFT is the spacecraft name, and preferably also the NDADS PROJECT. Well known abbreviations are acceptable. DATAFORMAT gives the time resolution (10MS, 1HR) and/or any analysis type (PEAK, AVG). VNN is the version number of the dataset (V01). EXTENSION is the data format of the file (ASCII, VMSBIN) Footnote 6: Documentation ------------------------- Documentation can consist of one or many files (text files, image files, sample data in ASCII). Each should have a unique EID and also be available via a superkey of DOCUMENT. Documents that pertain to all the experiments on a spacecraft might be retrieved by an EID like SC_DOC that is defined for each datatype. Each documentation file should be entered into TRF and keyed with the NSSDC dataset ID, and further keyed as documentation to accompany software. Then a copy should be submitted to the controlled digital document library. Ideally, when there are corrections or additions to the document, the changes would be made in the master copy and NDADS could retrieve that copy for staging when requested. Some kinds of information (data coverage inventories, command histories, spin vector changes, mode changes, etc.) may be too lengthy to be practical to include in the normal Documentation. If there is an inventory file (already provided, or generated directly from the data later as a value-added feature), it should be identified by an EID of INVENTORY, or if there are several such files (perhaps one for each of several years), then each file should have its own EID and INVENTORY would be a super-key that calls up all of these inventory files. In case the same inventory file applies to all experiments on the spacecraft, then only one file is needed. It would be assigned a SYSTEM Datatype, and then the INVENTORY EID for each of the USER Datatypes for this mission would also point to this file. If the INVENTORY files follow a reasonably standardized format (none defined yet), then a future user interface program might, for example, be able to interact with the user and check the contents of the inventory file to answer the user's questions, without requiring the user to peruse the file manually. CMD_HISTORY is suggested as another standard EID to be used as appropriate. Acquisition scientists should suggest and define others as needed for best help to users of their data. Footnote 7: Software -------------------- Software is handle analogously to documents, with the use of the superkey SOFTWARE to retrieve all software-related files. As above, it is recommended NOT to use SOFTWARE as a datatype, but rather only as an all-inclusive EID for software for a particular datatype. Individual pieces of software can also have their own unique EIDs. Software files should include sufficient user guides and other documentation. Footnote 8: Bill of Lading (BOL) Recommended Format --------------------------------------------------- We have developed standard BOL format for Space Physics; The BOL file program is currently run by the acquisition programmer and delivered to the ingest programmer. To use this method, the EIDs must be start time only. It is highly recommended. Footnote 9: Validation requirements ----------------------------------- Validation and verification have many varied meanings in different contexts. We should be certain that the files that we ingest are indeed the files that we expected (i.e., we should verify that the tape that we expected was actually the tape that was mounted on the drive). We will usually include checking filenames and sizes, verifying that we got all the Project sent, and optionally checking time order and maybe even a more complete verification of data. The software, file formats, correct file types, and such should also be checked. Some negotiation/discussion may be required with a programmer (either Science Group programmer or NDADS ingest programmer) to decide who will do which steps of validation and verification. Directions must include the action to take if data files don't meet the requirements. Footnote 10: Holdings Files -------------------------- There are two levels of "holdings" information: one is a general NDADS holdings file that should contain a brief paragraph or so to advertise each project and to point to (possible multiple) related spacecraft holdings files; the other is the project level (spacecraft level) holdings file. Both should be brief summary "advertisements" of what is available. Data formats, lengthy descriptions, and other lengthy discussions, including most dataset-specific text, should be put into one or more documentation files, not in the holdings files. The size and content of the paragraph for the general NDADS holdings file will vary according to the situation for each project. For ISEE, for example, it indicates (among other things) that if the user wants "ISEE 1", then "ISEE1" (no space) should be requested. Don't forget to say how to retrieve the documentation and software. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- NDADS INGEST PLAN OUTLINE (Fill this out) Version February 24, 1996. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- This is intended as the starting point for ingest work, to be followed up by discussion with the NDADS programmer and maybe a science programmer. Make additions as needed, to help document the ingest operation for later reference. -------------------------------------------------------------------------------- Date: Acquisition Scientist: Acquisition Programmer: PROJECT (e.g., DE, Hawkeye): TOTAL NUMBER OF FILES (*NEW* projects only): Estimate for all data sets in this project, for ingest in near future. NSSDC ID(s) FOR THIS DATA: BRIEF COMMENTS ABOUT THE DATA: SYSTEM DATATYPEs (Footnote 1) (maximum 31 characters): Consider how you will describe the datatype/EID structure in the Holdings file concisely and succinctly, so that it will be clear to the user. USER DATATYPESs (can be same as System ones) (maximum 31 characters): NDADS Entry Identification Codes (EIDs) (Footnote 2): Alternate EIDs (superkeys): DOCUMENTS = DOCUMENT highly recommended On-the-fly Processing Options (Footnote 3): Delivery Medium (tapes, FTP, backup/tar, etc.): Delivery site/method/platform: Delivery Schedule (Specify date when data will be ready for NDADS ingest): Delivery of test data: Delivery of production data (with begin and end dates): Data flow rate: Average and range of file sizes: Total number of files and volume expected: Expected anomalies in Schedule and special priority requirements: Data Formats: Input Formats (e.g., native, ASCII, CDF, HDF, FITS, VMS variable, VMS index, Unix streaming): Compression protocol (if any): Specify where the detailed input format description can be found: Output Formats (e.g., ASCII, binary, CDF; see Footnote 4 about CC flag; The output format may also imply Valued-Added activities, below): VMS File Attributes (if any; try not to) (See Footnote 4): Input filename formats: Output filename formats (Footnote 5) (e.g., 81302_DE2_VEFI.ASC): For CDs specify mapping of 8.3 names to unique filenames: Documentation Files (Footnote 6): Software Files (include platform it runs on) (Footnote 7): Ingest Method (e.g., automated, semi-automated, or BOL file): Preprocessing (e.g., split files if file size is more than 10-20 MB or so): Staging area required (size and location, maybe XFILES): This is negotiable; but should be filled with actual values when determined. BOL (Bill Of Lading) for automated ingest (Footnote 8): Consider ingesting this as a record of the ingest process. Validation requirements (Footnote 9) (who will do each?): To ensure that data are as expected (correct tape mounted, etc.): To check quality of data, time-order, etc.: Output procedure (for each user request): (e.g., place files in project directory in NDADS anonymous area and set CC bit if necessary; run conversion and bundling options) If you think it matters, specify Fetcher (storage jukebox): either DLT (unix) or Signet (VMS) and why? Proprietary Status (i.e., Public or Proprietary (release date)): Holdings Files (pointers to file locations) (Footnote 10) Text for General NDADS holdings file (general project advertisement and how to get Project holdings file): Text for Project holdings file (very short description of spacecraft and instruments and data formats with emphasis on how to retrieve the data (EID format, etc.): AIM/RSIRS/NMD information (reminder here of Acquisition Scientist function): For new data sets: NSSDC ID, Provider & other information, Brief description For existing data sets: Update, particularly for NDADS availability Summary of Value-Added activities (brief summary, not details): To be done prior to or during ingest: Planned for a later step, after ingest into NDADS: Other Issues or potential problems: