FILENAMES AND DELIVERY DIRECTORIES
As outlined in Section C.10, the contractor shall assign a digital-image filename to each image captured as part of the initial image-capture process, and deliver these files to the Library in a certain arrangement of directories and subdirectories, following the specifications outlined below.
4.1
Filename/Directory Structure 1: Unnumbered Documents in Folder Structure
Generally speaking, this structure applies to certain manuscript collections, e.g., the Booker T. Washington Papers. The documents in these collections have been placed in separate file folders within certain logical elements: series and subseries. Each folder, series, and subseries representing units that cohere intellectually. In addition, the folders are stored in containers (boxes), in sequence. The following table illustrates this form of organization:
Collection
Series 1
Subseries A
Folders 1, 2, 3, etc.
Documents 1, 2, 3, etc.
Subseries B
Folders 1, 2, 3, etc.
Documents 1, 2, 3, etc.
Series 2
Subseries A
Folders 1, 2, 3, etc.
Documents 1, 2, 3, etc.
Note:
The folders are placed in containers (boxes), until each container is filled. Often, these containers are numbered. Thus, a researcher may identify materials within a collection logically, i.e., by series, subseries, and folder; or physically, by container number and folder name or number.
Each collection's organization, including a list of series, containers, and folders, is found in the collection's printed finding aid.
The Library's digital presentation of these collections will be organized in the logical structure, i.e., by series and folder. Researchers will access the collections by means of an online finding aid (based on the existing printed finding aid) to be produced by the Library. Although the containers do not represent logical units, their numbers may be employed in directory naming in order to retain the sequence of folders in the collection. The Library will provide the contractor with a copy of the printed finding aid (or an equivalent list) at the time of scanning. If containers and folders have not been previously numbered, the Library will assign numbers by marking the finding aid or list. The marked-up finding aid or list will indicate the identifier for each series and folder. For collections in this category, the contractor shall deliver the images in a combination of directories and subdirectories. The highest level delivery directories will represent collection series, with lower-level directories representing folders. The individual files will reproduce the document pages within the folders.
Within the folders, the images receive sequential numbers. Folders generally contain a few hundred pages. Since they never contain more than 10,000 leaves (and thus will not exceed 9,999 pages to be imaged) the Library requires the use of a four-digit number (including leading zeroes) for page-image naming. The contractor shall assign filenames sequentially within the folder, i.e., 0001, 0002, 0003, 0004, etcetera.
4.1.1
Filename/Directory Structure 1, Continued: Recognizing and Marking New Documents
The special contractor requirement associated with folder-based manuscript collections is the recognition of "new documents." (See also Section C.4.3.) Documents in folders tend to be letters, reports, and other written or printed items. In order to aid future researchers, each image that represents the start of a new document shall be marked by adding the letter d to the filename in the last position before the filename extension, e.g., 0001d.jpg.
Recognizing new documents means observing that the next image represents the start of a letter (which may be indicated by letterhead, date, salutation, etc.) or the first page of a report (which may be indicated by a title, author's name, page 1, etc.). Miscellaneous pieces of paper (for example, scribbled notes, 3x5 slips, or groups of small items, etc.) should also be treated as new documents. As noted above, the Library understands that these judgements are sometimes difficult and requires only 80 percent accuracy in the identification of new documents.
4.1.2
Filename/Directory Structure 1, Continued: Example of Unnumbered Document/File Folder Collection
The table that follows offers an example of a directory and filename structure for a folder-based manuscript collection.
Finding Aid information | Identifier for directory provided by LC | Name assigned to directory by contractor | Identifier for folder provided by LC | Name assigned to folder subdirectory by contractor | Image filenames assigned by contractor |
Series: Correspondence |
gpcor | gpcor001
(may be more directories if large series) |
. | . | . |
Container: 816 Folder (no. 23): "Letters, January-March 1876" | . | . | 81623 | 81623 | . |
Image number 1, start of first document, feature recognized by contractor | . | . | . | . | 0001d.jpg |
Remaining pages of first document; image nos. 2-5 | . | . | . | . | 0002.jpg 0003.jpg 0004.jpg 0005.jpg |
Image number 6, first page of new document, feature recognized by contractor | . | . | . | . | 0006d.jpg |
Remaining pages of second document; image numbers 7-10 | . | . | . | . | 0007.jpg 0008.jpg 0009.jpg 0010.jpg |
4.2
Filename/Directory Structure 2: Bibliographic Record/Print-Page Number Structure
This structure will generally be used for monographs. The Library will supply a list or a simple database for the group of monographs to be scanned. The key elements on this list or database will be:
Collection name
Example: Western Travel in Rare Books
Collection identifier
Example: wtrb
Book author and/or title
Example: White, Michael Claringbud. California all the way back to
1828.
LCCN: the Library of Congress catalog card number (a unique number)
Example: ca4-14356 (stored in computer as 04014356)
Book identifier
Example: 014356
Text conversion yes/no
Example: No
Before scanning, the contract shall verify the identity of the monograph by comparing it to the identifying target presented by the Library. The target will provide the identifier to be used for the monograph. The target is to be scanned to permit the verification of the item during the Library's quality review process.
All of the images of book pages shall be delivered to the Library in a directory assigned the name of the book identifier, 014356 in the first example above.
The individual image filenames shall be assigned as outlined in the following two sections.
At the time that a task for a particular printed matter collection is assigned, the Library will provide written instructions, a copy of the finding aid or equivalent (in print and/or in machine-readable form).
Specific instructions about scanning blank pages (pages with no marks of any kind) will be developed for each collection. In general, the rule will be to omit blank pages from the image set and numbering structure.
4.2.1
Filename/Directory Structure 2A -- Filenames for Book Page Images Where Printed Page Numbers are Tracked
This pattern is used when no SGML-encoded, machine-readable text is required.
Overall name pattern: cccppppf
NOTE: ccc means image control number, pppp means print page number, and f means feature.
ccc Image control number
The first three digits are used to assign a set of sequential numbers to all of the images for the book. The ccc for the image of the target is 000. The first actual image from the book is assigned control number 001.
pppp Printed page number
Digits four through seven carry the actual printed page number for the page reproduced. The page number is to be represented with leading zeros.
The contractor must determine this number by examining the page itself.
If the printed number is arabic, then it is simply keyed in, with leading zeros.
If the number is roman, the lead digit (first of this set of four; fourth digit in the overall filename) shall be r, and the remaining three digits in this set (digits five, six, and seven in the overall filename) will represent the arabic translation of the roman number.
If there is no printed page number, then 0000 shall be keyed.
f Feature
Digit eight indicates that the page or pages contain a special feature. The contractor must recognize the feature by examining the image itself. The abbreviations for the features to be indicated are as follows:
g Title Page (if the work has more than one title page, indicate
the main title page if that can be easily determined; if not, indicate
the first)
n Table of Contents (if more than one page, indicate all pages)
l List of Illustrations (if more than one page, indicate all pages)
p Repeating page image
x Index (if more than one page, indicate all pages)
4.2.2
Filename/Directory Structure 2B -- Filenames for Book Page Images When Printed Page Numbers and Features are not Tracked
Generally speaking, this approach will be used when text conversion is planned. The converted texts will include SGML markup that will indicate the relationship between image control numbers and printed page numbers thus making it unnecessary to capture this information in the filenames.
Overall name pattern:cccc
cccc Image control number
The first three digits are used to assign a set of serial numbers to all of the images for the book. The cccc for the image of the target is 0000. The first actual page image from the book is assigned 0001.
4.3
Filename/Directory Structure 3: Serial Structure
This structure will be used for serials (e.g., magazines). The Library will supply a list or a simple database for the serials to be scanned. The key elements on this list or database will be:
Collection name
Example: Magazines for Children
Collection abbreviation
Example: mcgc
Serial title
Example: Wee Winkle
Monthly LCCN: the Library of Congress catalog card number (a unique
number)
Example: 07-53986
ISSN: the International Standard Serial Number (a unique number)
Example: 45670923
Serial identifier
Example: 45670923
Issue enumeration
Example: January - December, 1918
Issue identifiers
Example: 191801, 191802, 191803, etc.
Cumulative index
Example: For 1918
Cumulative index identifier
Example: 1918in
Before scanning, the contract shall verify the identity of the serial by comparing it to the identifying target presented by the Library. The target will provide the identifier to be used for the serial. The target is to be scanned to permit the verification of the item during the Library's quality review process.
All of the subdirectories containing the images for each serial shall be delivered to the Library in a directory assigned the name of the serial identifier, 45670923 in the example above.
New subdirectories shall be created for each issue, e.g., 191801, 191802, 191803, for collation records, e.g., 1918cl, and cumulative indexes, e.g. 1918in.
At the time that a task for a particular serial collection is assigned, the Library will provide written instructions, a copy of the finding aid (in print and/or in machine-readable form).
4.3.1
Filename/Directory Structure 3A -- Filenames for Serial Page Images When Printed Page Numbers are Tracked
This pattern is used when no SGML-encoded, machine-readable text is required.
The individual image filenames for actual serial page shall be assigned as follows. Note that a separate requirement for the collation-record and cumulative index images is stated below.
Overall name pattern: cccppppf
ccc Image control number
The first three digits are used to assign a set of serial numbers to all of the images for the issue of the serial. The first actual image is assigned 001.
pppp Printed page number
Digits four through seven carry the actual printed page number for the page reproduced. The page number is to be represented with leading zeros. The contractor must determine this number by examining the image itself.
If the printed number is arabic, then it is simply keyed in, with leading zeros.
If the number is roman, the lead digit (first of this set of four; fourth digit in the overall filename) shall be r, and the remaining three digits in this set (digits five, six, and seven in the overall filename) will represent the arabic translation of the roman number.
If there is no printed page number, then 0000 shall be keyed.
f Feature
Digit eight indicates that the page or pages contains a special feature. The contractor must recognize the feature by examining the image itself. The abbreviations for the features to be indicated are as follows:
c Cover (if the work has more than one cover, indicate the main
cover if that can be easily determined; if not, indicate the first)
n Table of Contents (if more than one page, indicate all pages)
l List of Illustrations, if any
p Repeating page image
x Index, if any
4.3.2
Filename/directory structure 3B -- Filenames for Serial Page Images When Printed Page Numbers are not Tracked
This pattern is used when SGML-encoded, machine-readable text is required.
The individual image filenames for actual serial page shall be assigned as follows.
Note that a separate requirement for the collation-record and cumulative index images is stated below.
Overall name pattern: cccf
ccc Image control number
The first three digits are used to assign a set of serial numbers to all of the images for the issue of the serial. The first actual image is assigned 001.
4.3.3
Filename/directory structure 3C -- Filenames for Collation Records and/or Cumulative Indexes for Serials
Overall name pattern: cccpppp
ccc Image control number
The first three digits are used to assign a set of serial numbers to all of the images in the index or collation. The first image is assigned 001.
pppp Printed page number
Digits four through seven carry the actual printed page number for the page reproduced. The page number is to be represented with leading zeros. The contractor must determine this number by examining the image itself.
If the printed number is arabic, then it is simply keyed in, with leading zeros.
If the number is roman, the lead digit (first of this set of four; fourth digit in the overall filename) shall be r, and the remaining three digits in this set (digits five, six, and seven in the overall filename) will represent and arabic translation of the roman number.
If there is no printed page number, then 0000 shall be keyed.
4.4
Filename/Directory Structure 4: Copyright Registration and Technical Document Number Structure
The copyright-registration-number/technical-document structure applies to two classes of material. First and foremost, it will be used for collections deposited at the Library in years past, as part of the copyright registration process and often left uncataloged. Some of these are printed matter, e.g., the nineteenth century sheet music collections, while others are manuscripts (including typescripts), e.g., the collection of unpublished early twentieth century plays. Second, it will be used for separate, short items like technical reports. These are typically offset-printed reports, many prepared for such agencies as the Department of Defense, running about 20-30 pages each.
Every document in the copyright collections received a registration number when the collections were copyrighted. The number is generally stamped on the cover or title page, often with a rubber stamp or written into a blank portion of a rubber stamp. In a few cases, the first part of the number is rendered in roman numeral and the latter part in arabic, e.g., the rubber stamp indicates registration number xxc 14, for registration number 8014. Technical reports also tend to have a unique number assigned by the agency that prepared them.
The contractor shall assign directory names based upon the registration or report number. Depending upon the collection, this number may reach five or more digits, e.g., 56872. The Library will provide a list of identifiers based on this number. One example is the Library's sheet music collection. For this collection, the identifier for the directory will consist of the copyright registration number, with added leading zeros sufficient to create a five-digit expression and prefixed with collection abbreviation, e.g., SM for the sheet music collection. Thus the directory for the sheet music items registered under the number 8692 shall be named sm08692.
At the time that a task for a particular collection is assigned, the Library will provide written instructions, a copy of the finding aid (in print and/or in machine-readable form).
The individual image filenames for actual page images shall be assigned as follows.
Copyright/tech report structure page-image name pattern: cccppppf
ccc Image control number
The first three digits are used to assign a set of serial numbers to all of the images for the item. The first actual image is assigned 001.
If the contractor encounters missing or unscannable film frames, the relevant control number shall be left unassigned to permit future capture and insertion of the image in the set. A note of this discovery shall also be made in the scanning log. If repeating film images are identified and scanned (contractor's option), the control number shall increment in the usual way and (as noted below) the repeat noted as a feature.
pppp Printed page number
Digits four through seven carry the actual printed page number for the page reproduced. The page number is to be represented with leading zeros. The contractor must determine this number by examining the image itself.
If the printed number is arabic, then it is simply keyed in, with leading zeros.
If the number is roman, the lead digit (first of this set of four; fourth digit in the overall filename) shall be r, and the remaining three digits in this set (digits five, six, and seven in the overall filename) will represent and arabic translation of the roman number. If there is no printed page number, then 0000 shall be keyed.
f Feature
c Cover (if the work has more than one cover, indicate the main
cover if that can be easily determined; if not, indicate the first)
n Table of Contents (if more than one page, indicate all pages)
l List of Illustrations, if any
p Repeating page image
x Index, if any
4.5
Filenames for Multiple Versions of the Same Page
This filenaming structure is used when two images result from the capture of text page with a printed halftone and text in order to achieve legibility of both the text and image. The image which is maximized for the legibility of the text shall have the filename consistent with the structure of the rest of the document. The image which is maximized for the legibility of the image shall have the same name as the text image but carries a feature designation of p, which indicates a repeating page.
4.6
Filenames for Derivative Images
Derivative bitonal images of grayscale and color images shall retain the same name as the original. These images will be distguished by their extensions. (Grayscale and color images will have a .jpg extenstion, while the bitonal images will have a .tif extension.
4.7
Filenames for Images of Segments of Pages
The Library will offer sets of foldout and large page segments to end-users in a presentation that suggests a grid or matrix. The assembly of the grid or matrix of images will depend upon the system of filenames assigned to the images.
The Library foresees that segmented images will be encountered in (1) books for which the texts will be converted and (2) unnumbered documents in folders. The first four filename positions shall be the control page number of the image and shall be followed by a three-character feature identifier which will indicate the the position of that segment in a grid that represents the whole item.
Whole page image: ccc
Segment images cccfxy
ccc Image Control Number
f Feature identifier for segment images - s
x Horizontal grid coordinate - Alpha, beginning with a for the first
row
y Vertical grid coordinate - Numeric, beginning with 1 for the first
column
Here are two examples:
Example 1: The 154th page scanned is a large map that must be segmented into twelve parts.
Filename of whole image of item (if any): 154.tif
Filenames of segment images with feature designations for each segment of the twelve-segment map:
154sa1.tif | 154sa2.tif | 154sa3.tif |
154sb1.tif | 154sb2.tif | 154sb3.tif |
154sc1.tif | 154sc2.tif | 154sc3.tif |
154sd1.tif | 154sd2.tif | 154sd3.tif |
Example 2: The 79th page scanned in a manuscript folder is a drawing that must be scanned as two segments (top half and bottom half). No whole page image is scanned
Filenames of segment images with feature designations for each of the two segments:
079sa1.tif | 079sb1.tif |
4.8
Filenames for Images with Two Print Pages
This filenaming structure is used only when folded sheets smaller than 8 1/2 x11 capture two whole pages in one image. When print page numbers and features are not tracked, the filename would be cccc for a four (4) digit control page number. When print page numbers and features are tracked, the filename would be cccppppf, where ccc is the three (3) digit control page number, pppp is the print page number one the left page, and f is the most significant feature on either page.
4.9
Filenames for Resolution Targets
The filenames for the resolution targets will indicate that the file represents an image of the target, the resolution of the image, and the image type. For all filenames of resolution targets, the first 2 digits shall be tg; the third digit shall represent the resolution 2, 3, or 4 for resolutions of 200, 300, and 400 dpi, respectively; and the fourth and fifth digits shall represent the image type: bt for bitonal, bh for bitonal with halftone correction, gr for grayscale, and co for color. The filename extensions shall the same as those appropriate for each individual image type: tif or jpg. For example, the filename for an image of a resolution target scanned as a 300 dpi bitonal image with halftone correction shall be tg3bh.tif.
4.10
Filenames for SGML-Encoded, Machine-readable Texts and Associated Files
The filename for SGML-encoded, machine-readable text will be the item identifier followed by the extension sgm. The filename for Page Information Group file will be the item identifier followed by the extension pgi. The filename for the Reference file will be the item identifer followed by the extension .ref. The filename for the Omission Report file will be the item identifer followed by the extension .omi. The filename for Entity file will be the item identifier followed by the extension ent. This applies to all of the preceding naming schemes. For example:
Collection name
Example: Western Travel in Rare Books
Collection identifier
Example: wtrb
Book author and/or title
Example: White, Michael Claringbud. California all the way back to
1828.
LCCN: the Library of Congress catalog card number (a unique number)
Example: ca4-14356 (stored in computer as 04014356)
Book identifier
Example: 014356
SGML text filename
Example: 014356.sgm
Page Information Group file filename
Example: 014356.pgi
Reference file filename
Example: 014356.ref
Omission Report file filename
Example: 014356.omi
Entity file filename
Example: 014356.ent
Next.....Previous.....Return to Section J Table of Contents......Return to the Table of Contents