Guide and Ingest Plan Template for Putting Space Physics Data on NDADS

1996 February 27 Kent Hills, Dieter Bilitza, Bobby Candey, Emily Greene

[Aug 7 96 Added "Highlights from Experience", minor edits]
[Feb 27: Add CD 8.3 -> unique filename mapping questions. ]
[        Edits to reflect current improvements to process.]
[May 23: Add NSSDC ID to form and to the summary of the NDADS ingest process.]
[        Add "Brief comments about the data" to the form.]
[May 4: System Datatype names are currently limited to 20 characters.]
[May 1: Additions to Footnote 3 about on-the-fly processing options.]
[April 25: Add note on number of files needed for *new* project.]

TABLE OF CONTENTS
-----------------
	Introduction
	Highlights from experience
	Summary of the NDADS Ingest Process
	Footnotes to the Ingest Plan
	Ingest Plan Outline (copy this section to a separate file and fill out)


INTRODUCTION
------------
This is a guide to acquisition scientists for use in entering space 
physics data into the NDADS nearline system.  For space physics data, 
the organization of data granules and the logic of the retrieval 
process may be quite different from the astrophysical cases.  The 
ingest plan should be viewed as a living document of the process and 
should be updated to reflect any changes made to the ingest.

HIGHLIGHTS FROM EXPERIENCE
--------------------------
o Use EIDs which correspond to the start time.  YYDDD_YYDDD format EIDs 
are strongly discourage due to difficulties in standardization.
o Be very certain to specifically state whether the CC bit is to be set 
on the file or not.
o File names are limited to 32.32 characters.
o The extensive use of superkeys is discouraged.  The currently slow 
down retrieval significantly.
o Use DOCUMENT as an EID with the same datatype as the data which it is 
documenting.  If a document covers several datatypes, the document can 
have it's own system datatype, but all userdatatypes should include 
both the data and the documentation datatypes.
o look at sample ingest plans

SUMMARY OF THE NDADS INGEST PROCESS
-----------------------------------
Depending on the particulars of each dataset, the tasks involved in 
ingesting the data onto NDADS may be performed partly by the 
Acquisition Scientist (AS), Acquisition Programmer (AP) and partly by 
the Ingest Programmer (IP) and Staff.  All of the steps leading to 
data ingest will be outlined here.  For those steps that can be 
performed by different NSSDC staff members, each case will be dealt 
with as appropriate for that situation.

1. Acquisition scientist (AS) identifies the desired data set (including the 
NSSDC ID), determines what processing options are needed (for the NDADS output
for the requester), determines the output organization (sizes, filenames, ...),
determines what validation is needed, etc., then collects the required
information and presents it to the acquisition programmer for discussion.  
An outline of the ingest plan is given below with comments.

2.  Acquisition programmer (AP) reviews information to see that all is 
in order.  Acquisition scientist and programmer discuss the ingest 
process and naming schemes.

3. The acquisition scientist presents the ingest plan to the JSPAG and managers 
for review.

4.  The data, software and documentation are brought online (either by 
the AS, AP or ingest personnel).

5. The number and size of the files are checked and other minimally required 
validation is performed (either by the AS, AP or ingest personnel as 
determined in the ingest plan).

6. Other verification/validation runs are made and any required preliminary
processing steps are run (such as making CDF, making time averages, etc.)
(either by the AS, AP or ingest personnel).

7. Ingest programmer inserts datatype and EID information into the NDADS 
database and initiates actual data ingest onto the NDADS jukeboxes.  
IP verifies ingest and informs AS and AP.

8.  Acquisition programmer verifies ingest and adds dataset to SPyCAT.

9.  Acquisition scientist verifies the results via NDADS ARMS and SPyCAT.

10. The acquisition scientist updates the general and project-specific holdings
files (and sends to ingest group) and updates NMD, RSIRS, AIMS listings of
NDADS availability.  The updated general and project-specific holdings 
files should be submitted to the acquisition programmer.


FOOTNOTES TO THE INGEST PLAN
----------------------------

Footnote 1: Datatypes
---------------------
The System Datatypes are the ones used internally when ingesting the data, 
while user Datatypes and EIDs are the items used by the requestors.  
The User Datatypes can be the same as the System Datatypes and can 
also point to groups of system ones.  Multiple datatypes can be 
defined to encompass overlapping sets of files if the system datatypes 
are defined as specific as possible.  For instance, VEFI has DC and AC 
files in possibly CDFs and VMS binary forms.  So we could define 
system datatypes of VEFI_AC_VMSBIN, VEFI_DC_VMSBIN, VEFI_AC_CDF, and 
VEFI_DC_CDF when we ingest the files originally.  We can then define 
user datatypes of those plus VEFI_VMSBIN (to get both DC and AC 
files), VEFI_AC (to return the most preferred format or to use the 
user EIDs to specify, as in CDF81302), and VEFI (to return most common 
(DC CDFs, say) or by user EID of AC_CDF_81302).  Use of DOCUMENT 
or SOFTWARE as Datatypes is not recommended and strongly discouraged; 
make these EIDs.  Note the use of underscores to delimit the 
logical parts of a datatype name.  The System Datatype names are 
currently limited to 31 characters; the User Datatype 
can use up to 31 characters.  

Footnote 2: EID naming schemes
------------------------------
The EID (for Entry ID) is used to indicate uniquely each of the smallest
segments of data that will be retrieved as a unit.  For maximum flexibility,
define very specific EIDs for each file at ingest.  DOCUMENT, SOFTWARE 
and INVENTORY should be standard user EIDs that point to 
multiple description, software and inventory list files.  The most 
common user EID would be just time, say YYDDD or YYMMDD, that points 
to files starting at that time.  

EIDs should be generated to enable data retrieval by any organizational scheme
that is likely to be of use to a requester.  Although it might be a time, it
can be an event, location, direction, etc. The holdings file will generally
just give the format for the construction of the EIDs, rather than giving the
entire list of available EIDs.  The Super-EIDs or superkeys combine 
multiple EIDs or provide an alternative naming scheme or synonyms.  
Note that superkeys currently are defined for ALL datatypes at once.  
Be aware that large number of superkeys significantly slows down data 
retrieval.  All datatypes should have a standard superkey of DOCUMENTS 
being equated to the EID DOCUMENT.

See Footnote 6 for additional discussion of EIDs relating to specialized kinds 
of documentation.


Footnote 3: On-the-fly Processing Options
-----------------------------------------
A limited amount of processing can be performed when the files are retrieved
from the jukebox, perhaps converting to ASCII, CDF, or GIF files from the
original ingested files. On-the-fly processing should only be considered if 
there will be a relatively few number of requests compared to the dataset size, 
and the total number of processed/output files is large, the total volume of
output files is large, or there are a number of file formats. If requests for 
these output formats will be common and CPU intensive, the dataset should be
converted in batch mode and reingested instead. The cutoff point for
considering on-the-fly processing will change with increasing storage and
processing capacity and file compression. Maintaining the conversion code is 
much more troublesome than doing the conversion for all the files once. An 
additional drawback is that the conversion option is only available currently 
in ARMS requests and not through Quasar FTP. On-the-fly processing is requested
in the ARMS system by adding a Dataformat key to the Subject line. The
Acquisition Scientist must provide the code to do the processing. Make sure the
holdings file is clear on the process.  NOTE that the current 
implementation of on-the-fly is added by hand, and that there are no 
current working examples.


Footnote 4: Implied Carriage Control flag (CC Bit)
--------------------------------------------------
Setting the Implied Carriage Control flag (CC Bit) correctly is very important
for files that are being transferred between VMS and non-VMS (such as UNIX) 
machines.  More accurately, it is not a question of VMS vs non-VMS systems, but 
rather of file systems that keep record structure information vs those that
don't.  VMS systems contain (generally invisible) information about the
record structure of a file, while UNIX systems do not hold any record structure
information at the file system level.  Thus, when sending data via FTP to a
non-VMS system, it might be necessary in the transfer process to insert
end-of-record markers (in whatever style the target system recognizes, such as
CR, LF or CR-LF).  This insertion will only be done during the FTP process
(involving both the FTP client and the target system) if the so-called "implied
CC bit" is set.  Sometimes/often the carriage control is explicit (fortran cc,
etc.) and not implied, in which case there isn't a problem. 

The best way to confirm that the bit is correct is to transfer a sample file
from VMS to UNIX and see if it compares correctly byte-for-byte or see if works
properly with the software. 

Usually, the CC bit should be set if you have ASCII data in either fixed-length
or variable-length records (the Unix user can read until a CR is found). The
resulting file length may be longer than the original, due to the presence of
these CR bytes.  There are a few exceptions to this rule. For instance,
Postscript files may be all ASCII but it unnecessary and error-prone to change
end-of-record characters (and version 2 PS also allows binary enclosures);
these should be treated as Binary files and the CC bit should not be set. 

The CC bit should not be set if you have either fixed-length or variable-length
binary records. In addition to confusing end-of-line characters with valid
binary data, the file size and record size will change. For fixed-length record
files, the dataset format description document or software must tell the Unix
user the record length. For variable-length record files, the Unix user reads
the control byte(s) that are part of the record to find out how long the record
is. Adding extra end-of-line characters will throw off the record and file
lengths and they may be mistaken for data. 

The following are some file types that should be treated as Binary and not have 
the CC bit set on VMS systems: CDF, HDF, FITS, GIF, JPEG, TIFF, PNG, PS, EXE.

The following are some file types that generally should be treated as Text/
ASCII and should have the CC bit set on VMS systems: TXT, DOC (unless Word or
WordPerfect type files), README, ASC, TEXT, FOR, F, C, COM (for VMS but binary
for PCs), LOG, LIS, LIST. It is important to check these files to see if they
are truly files you can type to the screen and read as ASCII. 


Footnote 5: Output filenames
----------------------------
The user-requested files will be staged to a [.PROJECT] subdirectory for the
requester to copy by FTP or DECnet copy. It is important to distinguish the 
files from various instruments with a clear naming convention. In addition, if
time is at the beginning of the filenames, files of the same time for several
instruments will show up in the directory together and will have the maximum
uniqueness in the PC 8.3 format. If the filenames are not too long, the project
can also be added so the users will not confuse various projects in their own
directory space. The JSPAG recommends using filenames of the form: 
time_instrument_project_dataform_version.fileformat. The time could be in begin 
and end time form or just begin time (YYDDD, YYDDDHHMMSS, YYMMDD, 
YYMMDDHHMMSS). Instrument and project should be kept short. Dataform might 
indicate average (6S or 5M) versus high resolution data or AC versus DC. The
version is recommended since many datasets are reprocessed at some point; use a
format of Vnn where nn=01, 02, etc. in order. The file format should be .CDF, .FOR, .C, .PS,
.GIF and such as appropriate with document text files as .TXT, ASCII data files
as .ASC and binary data files as .DAT. It is very important to distinguish text
files from binary ones when the user is transferring them via FTP or wants to
examine them. 

The NDADS file size limit is 32 characters before period and 32 after and only
one period. Try not to use odd characters that may be invalid on other 
platforms. No mixed case, i.e, files should be the same for upper and lower 
case letters.  Use of the JSPAG filenaming convention is highly 
recommended:

JSPAG file naming convention:

EID_INSTRUMENT_SPACECRAFT_DATAFORMAT_VNN.EXTENSION

where:

EID is the start time of the file AND the NDADS EID.  The preferred format 
is YYDDD, but may be altered if necessary.  For example, days with multiple 
files should be YYDDDHH. YYMMDD is also acceptable.  It is preferred that 
the stop time NOT be included.

INSTRUMENT is the instrument name, and should be abbreviated if there is a 
generally accept abbreviation.

SPACECRAFT is the spacecraft name, and preferably also the NDADS PROJECT.  
Well known abbreviations are acceptable.

DATAFORMAT gives the time resolution (10MS, 1HR) and/or any analysis type 
(PEAK, AVG).

VNN is the version number of the dataset (V01).

EXTENSION is the data format of the file (ASCII, VMSBIN)


Footnote 6: Documentation
-------------------------
Documentation can consist of one or many files (text files, image files, sample 
data in ASCII). Each should have a unique EID and also be available via a 
superkey of DOCUMENT. Documents that pertain to all the experiments on a
spacecraft might be retrieved by an EID like SC_DOC that is defined for each
datatype. Each documentation file should be entered into TRF and keyed with the
NSSDC dataset ID, and further keyed as documentation to accompany software. 
Then a copy should be submitted to the controlled digital document library. 
Ideally, when there are corrections or additions to the document, the changes
would be made in the master copy and NDADS could retrieve that copy for staging
when requested. 

Some kinds of information (data coverage inventories, command histories, spin
vector changes, mode changes, etc.) may be too lengthy to be practical to
include in the normal Documentation. If there is an inventory file (already
provided, or generated directly from the data later as a value-added feature),
it should be identified by an EID of INVENTORY, or if there are several such
files (perhaps one for each of several years), then each file should have its
own EID and INVENTORY would be a super-key that calls up all of these inventory
files. In case the same inventory file applies to all experiments on the
spacecraft, then only one file is needed. It would be assigned a SYSTEM
Datatype, and then the INVENTORY EID for each of the USER Datatypes for this
mission would also point to this file. If the INVENTORY files follow a
reasonably standardized format (none defined yet), then a future user interface
program might, for example, be able to interact with the user and check the
contents of the inventory file to answer the user's questions, without requiring
the user to peruse the file manually. 

CMD_HISTORY is suggested as another standard EID to be used as appropriate.
Acquisition scientists should suggest and define others as needed for best help
to users of their data. 


Footnote 7: Software
--------------------
Software is handle analogously to documents, with the use of the superkey 
SOFTWARE to retrieve all software-related files. As above, it is
recommended NOT to use SOFTWARE as a datatype, but rather only as an
all-inclusive EID for software for a particular datatype.  Individual pieces of
software can also have their own unique EIDs. Software files should include 
sufficient user guides and other documentation.


Footnote 8: Bill of Lading (BOL) Recommended Format
---------------------------------------------------
We have developed standard BOL format for Space Physics; The BOL file 
program is currently run by the acquisition programmer and delivered 
to the ingest programmer.  To use this method, the EIDs must be start 
time only.  It is highly recommended.

Footnote 9: Validation requirements
-----------------------------------
Validation and verification have many varied meanings in different contexts.
We should be certain that the files that we ingest are indeed the files that we 
expected (i.e., we should verify that the tape that we expected was actually 
the tape that was mounted on the drive).  We will usually include checking
filenames and sizes, verifying that we got all the Project sent, and optionally
checking time order and maybe even a more complete verification of data. The
software, file formats, correct file types, and such should also be checked.
Some negotiation/discussion may be required with a programmer (either Science 
Group programmer or NDADS ingest programmer) to decide who will do which steps 
of validation and verification.  Directions must include the action to take if
data files don't meet the requirements. 


Footnote 10: Holdings Files
--------------------------
There are two levels of "holdings" information: one is a general NDADS holdings
file that should contain a brief paragraph or so to advertise each project and
to point to (possible multiple) related spacecraft holdings files; the other is
the project level (spacecraft level) holdings file.  Both should be brief
summary "advertisements" of what is available.  Data formats, lengthy
descriptions, and other lengthy discussions, including most dataset-specific
text, should be put into one or more documentation files, not in the holdings
files.  The size and content of the paragraph for the general NDADS holdings
file will vary according to the situation for each project.  For ISEE, for
example, it indicates (among other things) that if the user wants "ISEE 1",
then "ISEE1" (no space) should be requested.  Don't forget to say how to
retrieve the documentation and software. 


--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
NDADS INGEST PLAN OUTLINE (Fill this out)               
Version February 24, 1996. 
-------------------------------------------------------------------------------- 
-------------------------------------------------------------------------------- 
This is intended as the starting point for ingest work, to be followed 
up by discussion with the NDADS programmer and maybe a science 
programmer.  Make additions as needed, to help document the ingest 
operation for later reference.  
--------------------------------------------------------------------------------

Date: 

Acquisition Scientist:

Acquisition Programmer:

PROJECT (e.g., DE, Hawkeye): 

TOTAL NUMBER OF FILES (*NEW* projects only):
    Estimate for all data sets in this project, for ingest in near future.

NSSDC ID(s) FOR THIS DATA:

BRIEF COMMENTS ABOUT THE DATA:

SYSTEM DATATYPEs (Footnote 1) (maximum 31 characters): 
   Consider how you will describe the datatype/EID structure in the Holdings 
   file concisely and succinctly, so that it will be clear to the user.

USER DATATYPESs (can be same as System ones) (maximum 31 characters):

NDADS Entry Identification Codes (EIDs) (Footnote 2):

Alternate EIDs (superkeys): DOCUMENTS = DOCUMENT highly recommended

On-the-fly Processing Options (Footnote 3):

Delivery Medium (tapes, FTP, backup/tar, etc.): 

Delivery site/method/platform: 

Delivery Schedule (Specify date when data will be ready for NDADS ingest): 
   Delivery of test data: 
   Delivery of production data (with begin and end dates): 
     Data flow rate: 
     Average and range of file sizes: 
     Total number of files and volume expected: 
   Expected anomalies in Schedule and special priority requirements: 

Data Formats: 
   Input Formats (e.g., native, ASCII, CDF, HDF, FITS, VMS variable, VMS index,
     Unix streaming): 
   Compression protocol (if any): 
   Specify where the detailed input format description can be found:

   Output Formats (e.g., ASCII, binary, CDF; see Footnote 4 about CC flag; The
     output format may also imply Valued-Added activities, below): 
   VMS File Attributes (if any; try not to) (See Footnote 4): 

Input filename formats:

Output filename formats (Footnote 5) (e.g., 81302_DE2_VEFI.ASC): 

For CDs specify mapping of 8.3 names to unique filenames:

Documentation Files (Footnote 6): 

Software Files (include platform it runs on) (Footnote 7): 

Ingest Method (e.g., automated, semi-automated, or BOL file):

Preprocessing (e.g., split files if file size is more than 10-20 MB or so): 

Staging area required (size and location, maybe XFILES): 
   This is negotiable; but should be filled with actual values when determined.

BOL (Bill Of Lading) for automated ingest (Footnote 8): 
   Consider ingesting this as a record of the ingest process.

Validation requirements (Footnote 9) (who will do each?): 
   To ensure that data are as expected (correct tape mounted, etc.):
   To check quality of data, time-order, etc.:

Output procedure (for each user request): 
   (e.g., place files in project directory in NDADS anonymous area and 
   set CC bit if necessary; run conversion and bundling options)

If you think it matters, specify Fetcher (storage jukebox): either DLT (unix) or
Signet (VMS) and why? 

Proprietary Status (i.e., Public or Proprietary (release date)): 

Holdings Files (pointers to file locations) (Footnote 10) 
   Text for General NDADS holdings file (general project advertisement and how
     to get Project holdings file): 
   Text for Project holdings file (very short description of spacecraft and
     instruments and data formats with emphasis on how to retrieve the data
     (EID format, etc.): 

AIM/RSIRS/NMD information (reminder here of Acquisition Scientist function): 
   For new data sets: NSSDC ID, Provider & other information, Brief description
   For existing data sets: Update, particularly for NDADS availability 

Summary of Value-Added activities (brief summary, not details):  
   To be done prior to or during ingest:
   Planned for a later step, after ingest into NDADS:

Other Issues or potential problems: