ATTACHMENT 4

FILENAMES AND DELIVERY DIRECTORIES

As outlined in Section C.10, the contractor shall assign a digital-image filename to each image captured as part of the initial image-capture process, and deliver these files to the Library in a certain arrangement of directories and subdirectories, following the specifications outlined below.

4.1

Filename/Directory Structure 1: Unnumbered Documents in Folder Structure

Generally speaking, this structure applies to certain manuscript collections, e.g., the Booker T. Washington Papers. The documents in these collections have been placed in separate file folders within certain logical elements: series and subseries. Each folder, series, and subseries representing units that cohere intellectually. In addition, the folders are stored in containers (boxes), in sequence. The following table illustrates this form of organization:

Collection

Series 1

Series 2

Note:

The folders are placed in containers (boxes), until each container is filled. Often, these containers are numbered. Thus, a researcher may identify materials within a collection logically, i.e., by series, subseries, and folder; or physically, by container number and folder name or number.

Each collection's organization, including a list of series, containers, and folders, is found in the collection's printed finding aid.

The Library's digital presentation of these collections will be organized in the logical structure, i.e., by series and folder. Researchers will access the collections by means of an online finding aid (based on the existing printed finding aid) to be produced by the Library. Although the containers do not represent logical units, their numbers may be employed in directory naming in order to retain the sequence of folders in the collection. The Library will provide the contractor with a copy of the printed finding aid (or an equivalent list) at the time of scanning. If containers and folders have not been previously numbered, the Library will assign numbers by marking the finding aid or list. The marked-up finding aid or list will indicate the identifier for each series and folder. For collections in this category, the contractor shall deliver the images in a combination of directories and subdirectories. The highest level delivery directories will represent collection series, with lower-level directories representing folders. The individual files will reproduce the document pages within the folders.

Within the folders, the images receive sequential numbers. Folders generally contain a few hundred pages. Since they never contain more than 10,000 leaves (and thus will not exceed 9,999 pages to be imaged) the Library requires the use of a four-digit number (including leading zeroes) for page-image naming. The contractor shall assign filenames sequentially within the folder, i.e., 0001, 0002, 0003, 0004, etcetera.

4.1.1

Filename/Directory Structure 1, Continued: Recognizing and Marking New Documents

The special contractor requirement associated with folder-based manuscript collections is the recognition of "new documents." (See also Section C.4.3.) Documents in folders tend to be letters, reports, and other written or printed items. In order to aid future researchers, each image that represents the start of a new document shall be marked by adding the letter d to the filename in the last position before the filename extension, e.g., 0001d.jpg.

Recognizing new documents means observing that the next image represents the start of a letter (which may be indicated by letterhead, date, salutation, etc.) or the first page of a report (which may be indicated by a title, author's name, page 1, etc.). Miscellaneous pieces of paper (for example, scribbled notes, 3x5 slips, or groups of small items, etc.) should also be treated as new documents. As noted above, the Library understands that these judgements are sometimes difficult and requires only 80 percent accuracy in the identification of new documents.

4.1.2

Filename/Directory Structure 1, Continued: Example of Unnumbered Document/File Folder Collection

The table that follows offers an example of a directory and filename structure for a folder-based manuscript collection.

Finding Aid information Identifier for directory provided by LC Name assigned to directory by contractor Identifier for folder provided by LC Name assigned to folder subdirectory by contractor Image filenames assigned by contractor
Series:
Correspondence
gpcor gpcor001

(may be more directories if large series)

. . .
Container: 816 Folder (no. 23): "Letters, January-March 1876" . . 81623 81623 .
Image number 1, start of first document, feature recognized by contractor . . . . 0001d.jpg
Remaining pages of first document; image nos. 2-5 . . . . 0002.jpg 0003.jpg 0004.jpg 0005.jpg
Image number 6, first page of new document, feature recognized by contractor . . . . 0006d.jpg
Remaining pages of second document; image numbers 7-10 . . . . 0007.jpg 0008.jpg 0009.jpg 0010.jpg

4.2

Filename/Directory Structure 2: Bibliographic Record/Print-Page Number Structure

This structure will generally be used for monographs. The Library will supply a list or a simple database for the group of monographs to be scanned. The key elements on this list or database will be:

Collection name
Example: Western Travel in Rare Books

Collection identifier
Example: wtrb

Book author and/or title
Example: White, Michael Claringbud. California all the way back to 1828.

LCCN: the Library of Congress catalog card number (a unique number)
Example: ca4-14356 (stored in computer as 04014356)

Book identifier
Example: 014356

Text conversion yes/no
Example: No

Before scanning, the contract shall verify the identity of the monograph by comparing it to the identifying target presented by the Library. The target will provide the identifier to be used for the monograph. The target is to be scanned to permit the verification of the item during the Library's quality review process.

All of the images of book pages shall be delivered to the Library in a directory assigned the name of the book identifier, 014356 in the first example above.

The individual image filenames shall be assigned as outlined in the following two sections.

At the time that a task for a particular printed matter collection is assigned, the Library will provide written instructions, a copy of the finding aid or equivalent (in print and/or in machine-readable form).

Specific instructions about scanning blank pages (pages with no marks of any kind) will be developed for each collection. In general, the rule will be to omit blank pages from the image set and numbering structure.

4.2.1

Filename/Directory Structure 2A -- Filenames for Book Page Images Where Printed Page Numbers are Tracked

This pattern is used when no SGML-encoded, machine-readable text is required.

4.2.2

Filename/Directory Structure 2B -- Filenames for Book Page Images When Printed Page Numbers and Features are not Tracked

Generally speaking, this approach will be used when text conversion is planned. The converted texts will include SGML markup that will indicate the relationship between image control numbers and printed page numbers thus making it unnecessary to capture this information in the filenames.

4.3

Filename/Directory Structure 3: Serial Structure

This structure will be used for serials (e.g., magazines). The Library will supply a list or a simple database for the serials to be scanned. The key elements on this list or database will be:

Collection name
Example: Magazines for Children

Collection abbreviation
Example: mcgc

Serial title
Example: Wee Winkle

Monthly LCCN: the Library of Congress catalog card number (a unique number)
Example: 07-53986

ISSN: the International Standard Serial Number (a unique number)
Example: 45670923

Serial identifier
Example: 45670923

Issue enumeration
Example: January - December, 1918

Issue identifiers
Example: 191801, 191802, 191803, etc.

Cumulative index
Example: For 1918

Cumulative index identifier
Example: 1918in

Before scanning, the contract shall verify the identity of the serial by comparing it to the identifying target presented by the Library. The target will provide the identifier to be used for the serial. The target is to be scanned to permit the verification of the item during the Library's quality review process.

All of the subdirectories containing the images for each serial shall be delivered to the Library in a directory assigned the name of the serial identifier, 45670923 in the example above.

New subdirectories shall be created for each issue, e.g., 191801, 191802, 191803, for collation records, e.g., 1918cl, and cumulative indexes, e.g. 1918in.

At the time that a task for a particular serial collection is assigned, the Library will provide written instructions, a copy of the finding aid (in print and/or in machine-readable form).

4.3.1

Filename/Directory Structure 3A -- Filenames for Serial Page Images When Printed Page Numbers are Tracked

This pattern is used when no SGML-encoded, machine-readable text is required.

The individual image filenames for actual serial page shall be assigned as follows. Note that a separate requirement for the collation-record and cumulative index images is stated below.

4.3.2

Filename/directory structure 3B -- Filenames for Serial Page Images When Printed Page Numbers are not Tracked

This pattern is used when SGML-encoded, machine-readable text is required.

The individual image filenames for actual serial page shall be assigned as follows.

Note that a separate requirement for the collation-record and cumulative index images is stated below.

4.3.3

Filename/directory structure 3C -- Filenames for Collation Records and/or Cumulative Indexes for Serials

4.4

Filename/Directory Structure 4: Copyright Registration and Technical Document Number Structure

The copyright-registration-number/technical-document structure applies to two classes of material. First and foremost, it will be used for collections deposited at the Library in years past, as part of the copyright registration process and often left uncataloged. Some of these are printed matter, e.g., the nineteenth century sheet music collections, while others are manuscripts (including typescripts), e.g., the collection of unpublished early twentieth century plays. Second, it will be used for separate, short items like technical reports. These are typically offset-printed reports, many prepared for such agencies as the Department of Defense, running about 20-30 pages each.

Every document in the copyright collections received a registration number when the collections were copyrighted. The number is generally stamped on the cover or title page, often with a rubber stamp or written into a blank portion of a rubber stamp. In a few cases, the first part of the number is rendered in roman numeral and the latter part in arabic, e.g., the rubber stamp indicates registration number xxc 14, for registration number 8014. Technical reports also tend to have a unique number assigned by the agency that prepared them.

The contractor shall assign directory names based upon the registration or report number. Depending upon the collection, this number may reach five or more digits, e.g., 56872. The Library will provide a list of identifiers based on this number. One example is the Library's sheet music collection. For this collection, the identifier for the directory will consist of the copyright registration number, with added leading zeros sufficient to create a five-digit expression and prefixed with collection abbreviation, e.g., SM for the sheet music collection. Thus the directory for the sheet music items registered under the number 8692 shall be named sm08692.

At the time that a task for a particular collection is assigned, the Library will provide written instructions, a copy of the finding aid (in print and/or in machine-readable form).

The individual image filenames for actual page images shall be assigned as follows.

Copyright/tech report structure page-image name pattern: cccppppf

4.5

Filenames for Multiple Versions of the Same Page

This filenaming structure is used when two images result from the capture of text page with a printed halftone and text in order to achieve legibility of both the text and image. The image which is maximized for the legibility of the text shall have the filename consistent with the structure of the rest of the document. The image which is maximized for the legibility of the image shall have the same name as the text image but carries a feature designation of p, which indicates a repeating page.

4.6

Filenames for Derivative Images

Derivative bitonal images of grayscale and color images shall retain the same name as the original. These images will be distguished by their extensions. (Grayscale and color images will have a .jpg extenstion, while the bitonal images will have a .tif extension.

4.7

Filenames for Images of Segments of Pages

The Library will offer sets of foldout and large page segments to end-users in a presentation that suggests a grid or matrix. The assembly of the grid or matrix of images will depend upon the system of filenames assigned to the images.

The Library foresees that segmented images will be encountered in (1) books for which the texts will be converted and (2) unnumbered documents in folders. The first four filename positions shall be the control page number of the image and shall be followed by a three-character feature identifier which will indicate the the position of that segment in a grid that represents the whole item.

Here are two examples:

Example 1: The 154th page scanned is a large map that must be segmented into twelve parts.

154sa1.tif 154sa2.tif 154sa3.tif
154sb1.tif 154sb2.tif 154sb3.tif
154sc1.tif 154sc2.tif 154sc3.tif
154sd1.tif 154sd2.tif 154sd3.tif

Example 2: The 79th page scanned in a manuscript folder is a drawing that must be scanned as two segments (top half and bottom half). No whole page image is scanned

Filenames of segment images with feature designations for each of the two segments:

079sa1.tif 079sb1.tif

4.8

This filenaming structure is used only when folded sheets smaller than 8 1/2 x11 capture two whole pages in one image. When print page numbers and features are not tracked, the filename would be cccc for a four (4) digit control page number. When print page numbers and features are tracked, the filename would be cccppppf, where ccc is the three (3) digit control page number, pppp is the print page number one the left page, and f is the most significant feature on either page.

4.9

Filenames for Resolution Targets

The filenames for the resolution targets will indicate that the file represents an image of the target, the resolution of the image, and the image type. For all filenames of resolution targets, the first 2 digits shall be tg; the third digit shall represent the resolution 2, 3, or 4 for resolutions of 200, 300, and 400 dpi, respectively; and the fourth and fifth digits shall represent the image type: bt for bitonal, bh for bitonal with halftone correction, gr for grayscale, and co for color. The filename extensions shall the same as those appropriate for each individual image type: tif or jpg. For example, the filename for an image of a resolution target scanned as a 300 dpi bitonal image with halftone correction shall be tg3bh.tif.

4.10

Filenames for SGML-Encoded, Machine-readable Texts and Associated Files

The filename for SGML-encoded, machine-readable text will be the item identifier followed by the extension sgm. The filename for Page Information Group file will be the item identifier followed by the extension pgi. The filename for the Reference file will be the item identifer followed by the extension .ref. The filename for the Omission Report file will be the item identifer followed by the extension .omi. The filename for Entity file will be the item identifier followed by the extension ent. This applies to all of the preceding naming schemes. For example:

Collection name
Example: Western Travel in Rare Books

Collection identifier
Example: wtrb

Book author and/or title
Example: White, Michael Claringbud. California all the way back to 1828.

LCCN: the Library of Congress catalog card number (a unique number)
Example: ca4-14356 (stored in computer as 04014356)

Book identifier
Example: 014356

SGML text filename
Example: 014356.sgm

Page Information Group file filename
Example: 014356.pgi

Reference file filename
Example: 014356.ref

Omission Report file filename
Example: 014356.omi

Entity file filename
Example: 014356.ent

Next.....Previous.....Return to Section J Table of Contents......Return to the Table of Contents