Library of Congress and Ameritech Competition (1996-1999): Resources for Applicants -- Digital Formats for Content Reproductions [1996-08-20]

The National Digital Library Program:
Archived Documentation

The Library of Congress / Ameritech National Digital Library Competition (1996-1999)
Competition
Home Technical Information for the 1998/99 Competition > Resources for Applicants > Digital Formats for Content Reproductions

Digital Formats for Content Reproductions

Carl Fleischhauer
Technical Coordinator, National Digital Library Program, Library of Congress
August 20, 1996

I. Introduction

The trio of documents present a snapshot of the Library of Congress digital conversion activity as of August 1996. The ideas and approaches outlined represent the collection-digitization effort of the American Memory pilot program (1990-1994) and the operational National Digital Library Program (1995-1996) that has followed the pilot. The institution recognizes that many avenues remain unexplored and that new technology will lead to changing practices.

Most of the formats listed below are in use for World Wide Web access to American Memory collections released or in production in 1996; a few are planned for use in the near future or are alternatives that other institutions have employed. The Library's selections represent an attempt to balance quality of reproduction, convenient accessibility for the general public over the World Wide Web, likely longevity of format (using standard formats where possible and proprietary formats only where widely deployed), and production cost. For digital reproductions of original items, the greatest stability and public accessibility obtains for images that reproduce manuscript documents, printed matter, and pictorial materials, and for searchable texts, including those that employ a Standard Generalized Markup Language. Formats that reproduce the time-based content of sound recordings and moving-image collections, however, have a less-certain future.

II. Pictorial Materials

For pictorial collections, the Library produces three image types:

Thumbnail
A small image presented with the bibliographic record, to allow users to judge whether they wish to take the time to retrieve a higher quality image.

Reference
The "fetchable" higher quality image. In current projects, only one reference image is provided; future collections may offer two (or more) at varying levels of resolution.

Archive
An uncompressed (or, in the future, lossless-compressed) image free of the artifacts resulting from lossy compression, provided to users for reproduction or held for future reprocessing as compression standards change. Not provided at this time; may be provided to users as a downloadable file in the future.

Alternative format
Several organizations have used the Kodak PhotoCD (Image Pac) format in their imaging projects. Originally associated only with CD-ROM disks, this multi-resolution format may now be written to other storage media. The Library has not had extensive experience with PhotoCD/Image Pac. Archives wishing to produce collections that are interoperable with those at the Library of Congress and who plan to use PhotoCD technology should either determine how direct access to those images may be provided to WWW clients or plan to reprocess the Image Pac images to produce GIF and JFIF/JPEG images for WWW access in association with the American Memory site.

III. Textual Materials Reproduced as Searchable Text and Images

Searchable transcriptions can be a tremendous aid to a researcher seeking instances of particular words or phrases in a textual work. Transcribed text, especially when encoded with markup language, can also facilitate the researcher's navigation of a longer document, The cost of providing perfect or near-perfect transcriptions is very high, however, and, for many researchers, proper understanding of a document may depend upon seeing a facsimile (and in some cases, the original). For these reasons, the Library has experimented with the presentation of manuscript and printed matter items as a coordinated set of page-images and searchable text. In some American Memory pilot-period collections, separate images of tables and illustrations were provided in addition to, or in lieu of, page-image sets.

The Library encodes its documents using Standard Generalized Markup Language (SGML), as described below. Since the Library always places SGML texts online together with bibliographic records or a finding aid, the headers within the documents contain minimal bibliographic information. For a more detailed description of the Library's approach to text-reproduction using SGML, see American Memory DTD for Historical Documents.

Full-function SGML viewers for the WWW are not available free or as shareware at this time. For this reason, the Library derives an HTML version of the text from the SGML version and places both online.

The page images included in the text reproduction sets employ the formats for tonal and bitonal images described below.

Searchable text
The Library's transcription requirement for contractors is 99.95 percent accuracy compared to the original (in future contracts, 99.995 will be required for some items). The texts are encoded with SGML, using the American Memory document type definition (DTD), which conforms to the international guidelines for humanities texts, the Text Encoding Initiative (TEI). Entity values in the document consist of of the filename, without extension, for each page image, illustration, or table. The entity file lists entity declarations (which include entity values) and their corresponding filename with extensions.

Alternate formats:
Texts marked up with HTML or Text Encoding Initiative-conformant SGML DTDs other than that developed by American Memory. Archives wishing to produce collections that are interoperable with those at the Library of Congress and who plan to use an alternate approach, should (a) make available the associated DTD, style sheets, and navigators for use by the general public and (b) provide sufficient title and identifier information within the document header to facilitate integration into the American Memory interface.

IV. Textual Materials Reproduced as Images

The following discussion of text page-images applies to images associated with searchable texts (see preceding section) and image-only presentations of manuscript or printed documents.

The Library has been experimenting with tonal (color and grayscale) reproduction of manuscript and older printed documents, partly after noting shortcomings in some bitonal (black and white) images produced during the American Memory pilot. Original items with a mixture of lighter and darker markings are often more successfully reproduced in a tonal rather than in a bitonal image. Typography or line art, however, is often successfully reproduced in a bitonal image. Bitonal images usually provide better printed output. Thus, some collections may warrant the production of both types of images. Multiple versions of an image of a page may also be needed to provide a WWW browser-based means for paging through documents (see section V.).

Tonal images of manuscript and printed documents.
At this writing, the Library's only online example of tonal document reproduction is the small collection of Walt Whitman notebooks . A demonstration project to refine a tonal-image approach to manuscripts is underway with a portion of the Federal Theater Project collection. In this latter project, the best-quality (or "archival") version of the image is tonal and the access ("reference") image is bitonal.

Tonal image types:
Reference
In the current Whitman presentation, this is the image "fetched" from the page menu. See Section II.D on browser-based paging for an alternate approach that requires GIF (Graphic Interchange Format) files.

Archive (uncompressed)
An uncompressed (or, in the future, lossless-compressed) image free of the artifacts resulting from lossy compression, provided to users for reproduction or held for future reprocessing as compression standards change. May be provided to users as a downloadable file in the future.

Archive (compressed)
The project to demonstrate approaches for manuscript digitization is testing the production of a compressed tonal image for archiving (or highest quality display) together with a bitonal image for reference access. A steering committee argued that legibility was the highest goal and that modest compression artifacts could be tolerated for the sake of smaller file sizes.

Alternate formats:
PDF (Portable Document Format from Adobe Corporation). The Library has not had extensive experience with this format; archives wishing to produce collections that are interoperable with those at the Library of Congress and who plan to use PDF must be capable of helping to guide their implementation. (See also Section II.D on browser-based paging.)

Bitonal images of manuscript and printed documents. The use of the lossless CCITT (FAX) compression for bitonal images may mean that one image may serve both reference and archiving needs. For some items, however, higher resolution may be desired for an archival copy. Projects patterned on the book-reformatting work pioneered at the Cornell University Library may fall into this category.

Alternate formats:
PDF (Portable Document Format from Adobe Corporation), the new JBIG standard, or other proprietary but widely used and widely supported formats. The Library has not had extensive experience with these formats; archives wishing to produce collections that are interoperable with those at the Library of Congress and who plan to use them must be capable of helping to guide their implementation. (See also Section II.D on browser-based paging.)

The special problem of printed halftone illustrations. Printed halftones present special problems in reproduction because of interference between the spatial frequency of the halftone dot pattern and the spatial frequency applied by scanning and/or output devices. The interference "waves" caused by the intersection of the two frequencies manifest themselves as moir� patterns that degrade the image. There are a number of treatments that can mitigate or correct this degradation but not all are practical in a production-line environment. Possible treatments include the following:

Descreening and rescreening
This approach removes the halftone dots and converts the image to grayscale, then rescreens it to produce a new halftone. Xerox has incorporated this approach in some of its advanced scanning devices and it has also been employed in the Cornell University Library's book-reformatting projects. In the implementations known to the Library, the process seems to depend upon "four-square" capture of the source items. This requires the placement of flat sheets of paper on a scanner's glass surface which requires that books be disbound. Furthermore, if a page containing both text block and illustration is captured, the system (or operator) must zone the page and capture text and illustration separately. The Library of Congress has been digitizing books for access and not for preservation. Since the volumes have not been disbound, the Library has not had the opportunity to employ this technique.

Capture at high enough resolution to reproduce the halftone dots
This approach requires capture resolutions at one or more multiples of the original halftone screen. Thus, for books with high-quality illustrations, the capture might be at 600 dpi or higher. In order to reproduce the scanned image without loss, the screen display or printer must also offer high resolution. In order to produce reduced-resolution (smaller) images for access, a post process consisting of descreening/rescreening, converting to grayscale, or dot-randomization would have to be applied. The Library has not availed itself of this technique.

Grayscale reproduction
For many illustrations, this approach offers a reasonable onscreen rendering at moderate resolutions. Since printed output from a typical laser printer requires that grayscale images be halftones, paper copies produced from these grayscale images may suffer from moir� patterns. If a page containing both text block and illustration is captured in grayscale at moderate levels of resolution (e.g., 200-400 dpi), the grayscale treatment that benefits the illustration may injure the clarity of the typography. Thus, one may wish to zone the page and capture illustration and text separately or capture two versions of the page image. The Library has not availed itself of this technique.

Randomization of scanner "dot pattern"
This process reproduces printed halftones as bitonal images to which a special diffuse dithering treatment is applied at scan time (or in post-processing a grayscale image). This reduces but does not eliminate moir� patterns. The effect on typography is not as severe as the effect produced by grayscale capture, although it adds speckles to white areas surrounding the type. The Library has employed this approach in a number of collections, capturing images at 300 dpi. The resulting images print on a laser printer with good results but do not rescale well for screen display. In order to provide a screen display of the whole illustration for which no rescaling is required, the Library has also created thumbnail versions of the larger image. These are not incorporated in the current American Memory WWW presentation at this time.

Image types for printed halftones:
Fetchable Reference
The Library's randomized-dot-pattern images have been captured on a Xerox K5200 scanner. When the diffuse dithering treatment is applied, this scanner's software creates files in the PCX format (a format associated with ZSoft's PC PaintBrush software)

Bitonal thumbnail image (not offered on WWW at this time)l
The images described here are bitonal; the Library is considering creating grayscale thumbnails for printed halftones in the future.

V. Images for Use in Browser-based Paging Sets

The companion document Digital Historical Collections: Types, Elements, and Construction outlines some of the devices that may be used to provide users with the many images (or other digital objects) that are linked to a single reference, including browser-based paging. If browser-based paging is adopted, images in the GIF format must be produced. The Library's plans call for the production of GIF images with reduced resolution (when compared to the source image) and altered tonality (bitonal made into grayscale). For example, in the case of the 150 dpi, 256-shade grayscale images of the Walt Whitman notebook pages offered online, the GIF images might be at 75 dpi and offer only 16 shades of gray. In the case of a 300 dpi bitonal image of a page of printed matter, the GIF image might be from 75-100 dpi and in 16 or 32 shades of gray, produced with a process that shades the bitonal image to gray at the same time that it is rescaled.

VI. Maps

The Library's Geography and Map Division is developing an approach for digitizing map collections, with the advice of the division's Center for Geographic Information. For most of the maps selected for the Library's program, the preliminary finding is that good legibility will be afforded in tonal images at a spatial resolution of 300 dpi. Archival copies will be stored without compression (or with lossless compression). There are many challenges associated with Internet transmission, display, and printing of very large images and the Library has not formulated plans for the presentation of maps in the WWW environment.

VII. Recorded Sound

The large files required to reproduce audio have launched the computer multimedia industry on a constant search for new and better compression and playback schemes. For this reason, the digital audio formats suitable for the WWW are less stable than those for text and pictorial images; thus the audio files produced today will become obsolete more quickly. When materials are remastered, the moderate fidelity of current audio formats will mean that the source material or an intermediate format, e.g., DAT (digital audio tapes), must be used (again) to create files in the new format.

The audio selections included in American Memory collections at this time are "downloadable," meaning that the browser must copy a file to the local computer before it can begin to play them. Since the files are large (the four-minute recordings in the Nation's Forum collection run about 2 megabytes each), this is space- and time-consuming. In order to address this problem, the Library is preparing "streaming" files that will begin to play as they are transmitted through the network. Although more convenient, these files are of lower fidelity than the downloadable examples.

Downloadable files (online today)
The Library plans to replace these files with the downloaded type described below by early 1997.

VIII. Moving-image Materials

The large files required to reproduce motion pictures and video have launched the computer multimedia industry on a constant search for new and better compression and playback schemes. For this reason, the moving-image formats suitable for the WWW are less stable than those for text and pictorial images produced today and will become obsolete more quickly. When materials are remastered, the moderate quality of current moving-image formats will mean that the source material or an intermediate format, e.g., videotape, must be used (again) to create files in the new format.

Files (online today)
The Library plans to supplement these files with the types described below by late 1996.

IX. File Headers

The Library plans to add data to the file headers for all of its reproductions over time. For now, a preliminary implementation exists for the four types listed below. Header content will almost certainly be a part of, or interplay with, the administrative and structural metadata associated with the repository described in Digital Historical Collections: Types, Elements, and Construction. The development and implementation of headers will keep pace with the Library's overall design process for metadata.

TIFF image files
The most fully developed image header scheme applies to the archival version of pictorial image files (as distinct from text pages). The Library has been using TIFF version 5.0 but expects to begin using version 6.0 very soon. It is worth noting that the Library's use of TIFF formats and headers has not always gone smoothly, perhaps the inevitable result of using an industry rather than a true standard. This accounts for some of the uncertainties imbedded in the description that follows. The Library has used the following TIFF tags. Contractors have been asked to provide typical or expected data for most tags; exceptions to the norm are noted in the comments column.

Regarding the fields that have pixel counts or other dimensional information, it is worth noting that most of the Library's pictorial collections have been digitized from negatives, copy negatives, or copy prints; for most items, the actual dimensions of original prints or artists' works (as displayed) are neither known nor easily incorporated in scan-time data.

SGML text files
The American Memory DTD for historical texts includes a simplified Text Encoding Initiative (TEI) header. The header is an integral part of the SGML-encoded document. Since the Library accompanies its marked-up texts with bibliographic records or finding aids, the header contains only a handful of MARC field equivalents: title statement and statement of responsiblity (MARC field 245), copyright registration number (MARC field 017), and the Library of Congress catalog card number (LCCN), when one exists.

WAVE audio files
The Library plans to use the following Resource Interchange File Format (RIFF) INFO list chunk data with its WAVE files:


Description	Tag	Comments
NewSubfile Type	254
ImageWidth	256	Actual pixel count
ImageLength	257	Actual pixel count
BitsPerSample	258
Compression	259
Photometric Interpretation	262
Document Name	269	This data is usually the collection identifier and filename, e.g., from the Yanker poster collection: yan/1a12345u.tif. An alternate would be to use the file identifier rather than than the name (i.e., exclude the "u" for "uncompressed" and the extension): yan/1a12345

Strip Offsets	273
Samples Per Pixel	277
Rows Per Strip	278
Strip Byte Counts	279
Xresolution	282	dots per inch
Yresolution	283	dots per inch
Resolution Unit	296	2 (inch)
Date Time	306	date and time scanned
Artist	315	Library of Congress
Xresolution	282	(see following note)
YResolution	283	(see following note)
Resolution Unit	296	(see following )
Date Time	306	date and time scanned
Artist	315	Library of Congress

The Library of Congress / Ameritech National Digital Library Competition (1996-1999)
Competition Home	Technical Information for the 1998/99 Competition > Resources for Applicants > Digital Formats for Content Reproductions