The Library of Congress / Ameritech National Digital Library Competition (1996-1999) | |
Competition Home |
Technical Information for the 1998/99 Competition > Resources for Applicants > Digital Formats for Content Reproductions |
This document and two others are intended to take the place of the 1994 document titled Elements of Digital Historical Collections. The new documents have been prepared, in part, to offer guidance to applicants in the Library of Congress/Ameritech competition. The other two documents are Digital Historical Collections: Types, Elements, and Construction and Access Aids and Interoperability.
The trio of documents present a snapshot of the Library of Congress digital conversion activity as of August 1996. The ideas and approaches outlined represent the collection-digitization effort of the American Memory pilot program (1990-1994) and the operational National Digital Library Program (1995-1996) that has followed the pilot. The institution recognizes that many avenues remain unexplored and that new technology will lead to changing practices.
Most of the formats listed below are in use for World Wide Web access to American Memory collections released or in production in 1996; a few are planned for use in the near future or are alternatives that other institutions have employed. The Library's selections represent an attempt to balance quality of reproduction, convenient accessibility for the general public over the World Wide Web, likely longevity of format (using standard formats where possible and proprietary formats only where widely deployed), and production cost. For digital reproductions of original items, the greatest stability and public accessibility obtains for images that reproduce manuscript documents, printed matter, and pictorial materials, and for searchable texts, including those that employ a Standard Generalized Markup Language. Formats that reproduce the time-based content of sound recordings and moving-image collections, however, have a less-certain future.
For pictorial collections, the Library produces three image types:
Thumbnail
A small image presented with the bibliographic record, to allow users
to judge whether they wish to take the time to retrieve a higher quality
image.
Tonal depth: Format: Compression: Spatial resolution: |
8 bits per pixel GIF Native to GIF Circa 200x200 pixels |
Reference
The "fetchable" higher quality image. In current projects,
only one reference image is provided; future collections may offer two
(or more) at varying levels of resolution.
Tonal depth: Format: Compression: Spatial resolution: |
Grayscale: 8 bits per pixel; color: 24 bits
per pixel JFIF (JPEG File Interchange Format) JPEG (generally about 10:1 compression) Moderate class ranges from about 500x400 to about 1000x700 pixels; higher resolution class (future) will range from 2000x1400 to 4000x3000; both moderate and higher resolution will be offered to users |
Archive
An uncompressed (or, in the future, lossless-compressed) image free
of the artifacts resulting from lossy compression, provided to users for
reproduction or held for future reprocessing as compression standards change.
Not provided at this time; may be provided to users as a downloadable file
in the future.
Tonal depth: Format: Compression: Spatial resolution: |
Grayscale: 8 bits per pixel; color: 24 bits
per pixel TIFF (Tagged Image File Format) Uncompressed Moderate class ranges from about 500x400 to about 1000x700 pixels; higher resolution class (LC examples coming in future) will range from 2000x1400 to 4000x3000; only the highest resolution will be archived |
Alternative format
Several organizations have used the Kodak PhotoCD (Image Pac) format
in their imaging projects. Originally associated only with CD-ROM disks,
this multi-resolution format may now be written to other storage media.
The Library has not had extensive experience with PhotoCD/Image Pac. Archives
wishing to produce collections that are interoperable with those at the
Library of Congress and who plan to use PhotoCD technology should either
determine how direct access to those images may be provided to WWW clients
or plan to reprocess the Image Pac images to produce GIF and JFIF/JPEG
images for WWW access in association with the American Memory site.
Searchable transcriptions can be a tremendous aid to a researcher seeking instances of particular words or phrases in a textual work. Transcribed text, especially when encoded with markup language, can also facilitate the researcher's navigation of a longer document, The cost of providing perfect or near-perfect transcriptions is very high, however, and, for many researchers, proper understanding of a document may depend upon seeing a facsimile (and in some cases, the original). For these reasons, the Library has experimented with the presentation of manuscript and printed matter items as a coordinated set of page-images and searchable text. In some American Memory pilot-period collections, separate images of tables and illustrations were provided in addition to, or in lieu of, page-image sets.
The Library encodes its documents using Standard Generalized Markup Language (SGML), as described below. Since the Library always places SGML texts online together with bibliographic records or a finding aid, the headers within the documents contain minimal bibliographic information. For a more detailed description of the Library's approach to text-reproduction using SGML, see American Memory DTD for Historical Documents.
Full-function SGML viewers for the WWW are not available free or as shareware at this time. For this reason, the Library derives an HTML version of the text from the SGML version and places both online.
The page images included in the text reproduction sets employ the formats
for tonal and bitonal images described below.
Searchable text
The Library's transcription requirement for contractors is 99.95 percent
accuracy compared to the original (in future contracts, 99.995 will be
required for some items). The texts are encoded with SGML, using the American
Memory document type definition (DTD), which conforms to the international
guidelines for humanities texts, the Text Encoding Initiative (TEI). Entity
values in the document consist of of the filename, without extension, for
each page image, illustration, or table. The entity file lists entity
declarations (which include entity values) and their corresponding filename
with extensions.
SGML Document Type Definition: Character sets: Associated file: Filename extensions: |
American Memory DTD ISO 646; the IBM extended-character sets are represented by the publicly declared entity reference sets in ISO 8879. Entity file sgm (for the text), ent (for the entity file) |
Alternate formats:
Texts marked up with HTML or Text Encoding Initiative-conformant SGML
DTDs other than that developed by American Memory. Archives wishing to
produce collections that are interoperable with those at the Library of
Congress and who plan to use an alternate approach, should (a) make available
the associated DTD, style sheets, and navigators for use by the general
public and (b) provide sufficient title and identifier information within
the document header to facilitate integration into the American Memory
interface.
The following discussion of text page-images applies to images associated with searchable texts (see preceding section) and image-only presentations of manuscript or printed documents.
The Library has been experimenting with tonal (color and grayscale)
reproduction of manuscript and older printed documents, partly after noting
shortcomings in some bitonal (black and white) images produced during the
American Memory pilot. Original items with a mixture of lighter and darker
markings are often more successfully reproduced in a tonal rather than
in a bitonal image. Typography or line art, however, is often successfully
reproduced in a bitonal image. Bitonal images usually provide better printed
output. Thus, some collections may warrant the production of both types
of images. Multiple versions of an image of a page may also be needed to
provide a WWW browser-based means for paging through documents (see section
V.).
Tonal images of manuscript and printed documents.
At this writing, the Library's only online example of tonal document reproduction
is the small collection of Walt Whitman notebooks
Tonal image types:
Reference
In the current Whitman presentation, this is the image "fetched"
from the page menu. See Section II.D on browser-based paging for an alternate
approach that requires GIF (Graphic Interchange Format) files.
Tonal depth: Format: Compression: Spatial resolution: |
Grayscale: 8 bits per pixel; color: 24 bits per pixel JFIF JPEG For the Whitman notebooks: 150 dpi |
Archive (uncompressed)
An uncompressed (or, in the future, lossless-compressed) image free
of the artifacts resulting from lossy compression, provided to users for
reproduction or held for future reprocessing as compression standards change.
May be provided to users as a downloadable file in the future.
Tonal depth: Format: Compression: Spatial resolution: |
Grayscale: 8 bits per pixel; color: 24 bits
per pixel TIFF Uncompressed For the Whitman notebooks: 300 dpi |
Archive (compressed)
The project to demonstrate approaches for manuscript digitization is
testing the production of a compressed tonal image for archiving (or highest
quality display) together with a bitonal image for reference access. A
steering committee argued that legibility was the highest goal and that
modest compression artifacts could be tolerated for the sake of smaller
file sizes.
Tonal depth: Format: Compression: Spatial resolution: |
Grayscale: 8 bits per pixel; color: 24 bits
per pixel JFIF JPEG 300 dpi |
Alternate formats:
PDF (Portable Document Format from Adobe Corporation). The Library
has not had extensive experience with this format; archives wishing to
produce collections that are interoperable with those at the Library of
Congress and who plan to use PDF must be capable of helping to guide their
implementation. (See also Section II.D on browser-based paging.)
Bitonal images of manuscript and printed documents. The use of
the lossless CCITT (FAX) compression for bitonal images may mean that one
image may serve both reference and archiving needs. For some items, however,
higher resolution may be desired for an archival copy. Projects patterned
on the book-reformatting work pioneered at the Cornell University Library
may fall into this category.
Image types:
Fetchable Reference
Tonal depth: Format: Compression: Spatial resolution: |
Black and white, 1 bit per pixel TIFF CCITT Group 4 LC examples: 300 dpi; potential range 150-300 dpi |
Archive Higher resolution version if needed.
Tonal depth: Format: Compression: Spatial resolution: |
Black and white, 1 bit per pixel TIFF CCITT Group 4 No LC examples; potential range 300-1200 dpi |
Alternate formats:
PDF (Portable Document Format from Adobe Corporation), the new JBIG
standard, or other proprietary but widely used and widely supported formats.
The Library has not had extensive experience with these formats; archives
wishing to produce collections that are interoperable with those at the
Library of Congress and who plan to use them must be capable of helping
to guide their implementation. (See also Section II.D on browser-based
paging.)
The special problem of printed halftone illustrations. Printed
halftones present special problems in reproduction because of interference
between the spatial frequency of the halftone dot pattern and the spatial
frequency applied by scanning and/or output devices. The interference "waves"
caused by the intersection of the two frequencies manifest themselves as
moir‚ patterns that degrade the image. There are a number of treatments
that can mitigate or correct this degradation but not all are practical
in a production-line environment. Possible treatments include the following:
Descreening and rescreening
This approach removes the halftone dots and converts the image to grayscale,
then rescreens it to produce a new halftone. Xerox has incorporated this
approach in some of its advanced scanning devices and it has also been
employed in the Cornell University Library's book-reformatting projects.
In the implementations known to the Library, the process seems to depend
upon "four-square" capture of the source items. This requires
the placement of flat sheets of paper on a scanner's glass surface which
requires that books be disbound. Furthermore, if a page containing both
text block and illustration is captured, the system (or operator) must
zone the page and capture text and illustration separately. The Library
of Congress has been digitizing books for access and not for preservation.
Since the volumes have not been disbound, the Library has not had the opportunity
to employ this technique.
Capture at high enough resolution to reproduce the halftone dots
This approach requires capture resolutions at one or more multiples
of the original halftone screen. Thus, for books with high-quality illustrations,
the capture might be at 600 dpi or higher. In order to reproduce the scanned
image without loss, the screen display or printer must also offer high
resolution. In order to produce reduced-resolution (smaller) images for
access, a post process consisting of descreening/rescreening, converting
to grayscale, or dot-randomization would have to be applied. The Library
has not availed itself of this technique.
Grayscale reproduction
For many illustrations, this approach offers a reasonable onscreen
rendering at moderate resolutions. Since printed output from a typical
laser printer requires that grayscale images be halftones, paper copies
produced from these grayscale images may suffer from moir‚ patterns. If
a page containing both text block and illustration is captured in grayscale
at moderate levels of resolution (e.g., 200-400 dpi), the grayscale treatment
that benefits the illustration may injure the clarity of the typography.
Thus, one may wish to zone the page and capture illustration and text separately
or capture two versions of the page image. The Library has not availed
itself of this technique.
Randomization of scanner "dot pattern"
This process reproduces printed halftones as bitonal images to which
a special diffuse dithering treatment is applied at scan time (or in post-processing
a grayscale image). This reduces but does not eliminate moir‚ patterns.
The effect on typography is not as severe as the effect produced by grayscale
capture, although it adds speckles to white areas surrounding the type.
The Library has employed this approach in a number of collections, capturing
images at 300 dpi. The resulting images print on a laser printer with good
results but do not rescale well for screen display. In order to provide
a screen display of the whole illustration for which no rescaling is required,
the Library has also created thumbnail versions of the larger image. These
are not incorporated in the current American Memory WWW presentation at
this time.
Image types for printed halftones:
Fetchable Reference
The Library's randomized-dot-pattern images have been captured on a
Xerox K5200 scanner. When the diffuse dithering treatment is applied, this
scanner's software creates files in the PCX format (a format associated
with ZSoft's PC PaintBrush software)
Tonal depth: Treatment: Format: Compression: Spatial resolution: |
Black and white, 1 bit per pixel Xerox diffuse dithering PCX (some have been converted to TIFF) Native to PCX or none LC examples 300 dpi |
Archive: None created by the Library for halftone illustrations
Bitonal thumbnail image (not offered on WWW at this time)l
The images described here are bitonal; the Library is considering creating
grayscale thumbnails for printed halftones in the future.
Tonal depth: Treatment: Format: Compression: Spatial resolution: |
Black and white, 1 bit per pixel Xerox diffuse dithering PCX (some have been converted to TIFF) Native to PCX or none Within window of about 500x400 |
The companion document Digital Historical Collections: Types, Elements, and Construction outlines some of the devices that may be used to provide users with the many images (or other digital objects) that are linked to a single reference, including browser-based paging. If browser-based paging is adopted, images in the GIF format must be produced. The Library's plans call for the production of GIF images with reduced resolution (when compared to the source image) and altered tonality (bitonal made into grayscale). For example, in the case of the 150 dpi, 256-shade grayscale images of the Walt Whitman notebook pages offered online, the GIF images might be at 75 dpi and offer only 16 shades of gray. In the case of a 300 dpi bitonal image of a page of printed matter, the GIF image might be from 75-100 dpi and in 16 or 32 shades of gray, produced with a process that shades the bitonal image to gray at the same time that it is rescaled.
The Library's Geography and Map Division is developing an approach for digitizing map collections, with the advice of the division's Center for Geographic Information. For most of the maps selected for the Library's program, the preliminary finding is that good legibility will be afforded in tonal images at a spatial resolution of 300 dpi. Archival copies will be stored without compression (or with lossless compression). There are many challenges associated with Internet transmission, display, and printing of very large images and the Library has not formulated plans for the presentation of maps in the WWW environment.
The large files required to reproduce audio have launched the computer multimedia industry on a constant search for new and better compression and playback schemes. For this reason, the digital audio formats suitable for the WWW are less stable than those for text and pictorial images; thus the audio files produced today will become obsolete more quickly. When materials are remastered, the moderate fidelity of current audio formats will mean that the source material or an intermediate format, e.g., DAT (digital audio tapes), must be used (again) to create files in the new format.
The audio selections included in American Memory collections at this time are "downloadable," meaning that the browser must copy a file to the local computer before it can begin to play them. Since the files are large (the four-minute recordings in the Nation's Forum collection run about 2 megabytes each), this is space- and time-consuming. In order to address this problem, the Library is preparing "streaming" files that will begin to play as they are transmitted through the network. Although more convenient, these files are of lower fidelity than the downloadable examples.
Audio file types:
Downloadable files (online today)
The Library plans to replace these files with the downloaded type described
below by early 1997.
Attributes: Format (file type): |
11.025 kHz sample rate, 8 bit word, mono AU (Sun Microsystems format developed for UNIX systems) |
Downloadable files (planned for early 1997)
Attributes: Format (file type): |
22.05 kHz sample rate, 16 bit word, mono WAVE (Microsoft format) |
Streaming files (planned for early 1997)
Encoding: Format (file type): |
For 14.4 modems RA (RealAudio format from Progressive Networks) |
The large files required to reproduce motion pictures and video have launched the computer multimedia industry on a constant search for new and better compression and playback schemes. For this reason, the moving-image formats suitable for the WWW are less stable than those for text and pictorial images produced today and will become obsolete more quickly. When materials are remastered, the moderate quality of current moving-image formats will mean that the source material or an intermediate format, e.g., videotape, must be used (again) to create files in the new format.
Moving-image file types:
Files (online today)
The Library plans to supplement these files with the types described
below by late 1996.
Encoding: Image size: Frame rate: Data rate: Format: |
Indeo 3.2 (Intel) 320x240 pixels 15 fps ca. 1 megabits/second AVI (Audio Video Interleaved from Microsoft) |
Moderate resolution files (planned for late 1996)
Image size: Frame rate: Data rate: Compression: Format: |
320x240 pixels 30 fps ca. 1.2 megabits/second MPEG-1 mpg |
Low resolution files (planned for late 1996)
Image size: Color depth: Data rate: Format: |
160x120 pixels 24 bits/pixel ca. 100 kilobytes/second QuickTime (Apple Computer format) |
The Library plans to add data to the file headers for all of its reproductions over time. For now, a preliminary implementation exists for the four types listed below. Header content will almost certainly be a part of, or interplay with, the administrative and structural metadata associated with the repository described in Digital Historical Collections: Types, Elements, and Construction. The development and implementation of headers will keep pace with the Library's overall design process for metadata.
TIFF image files
The most fully developed image header scheme applies to the archival
version of pictorial image files (as distinct from text pages). The Library
has been using TIFF version 5.0 but expects to begin using version 6.0
very soon. It is worth noting that the Library's use of TIFF formats and
headers has not always gone smoothly, perhaps the inevitable result of
using an industry rather than a true standard. This accounts for some of
the uncertainties imbedded in the description that follows. The Library
has used the following TIFF tags. Contractors have been asked to provide
typical or expected data for most tags; exceptions to the
norm are noted in the comments column.
Regarding the fields that have pixel counts or other dimensional information, it is worth noting that most of the Library's pictorial collections have been digitized from negatives, copy negatives, or copy prints; for most items, the actual dimensions of original prints or artists' works (as displayed) are neither known nor easily incorporated in scan-time data.
Description | Tag | Comments |
NewSubfile Type | 254 | |
ImageWidth | 256 | Actual pixel count |
ImageLength | 257 | Actual pixel count |
BitsPerSample | 258 | |
Compression | 259 | |
Photometric Interpretation | 262 | |
Document Name | 269 | This data is usually the collection identifier and filename, e.g., from the Yanker poster collection: yan/1a12345u.tif. An alternate would be to use the file identifier rather than than the name (i.e., exclude the "u" for "uncompressed" and the extension): yan/1a12345 |
Strip Offsets | 273 | |
Samples Per Pixel | 277 | |
Rows Per Strip | 278 | |
Strip Byte Counts | 279 | |
Xresolution | 282 | dots per inch |
Yresolution | 283 | dots per inch |
Resolution Unit | 296 | 2 (inch) |
Date Time | 306 | date and time scanned |
Artist | 315 | Library of Congress |
Xresolution | 282 | (see following note) |
YResolution | 283 | (see following note) |
Resolution Unit | 296 | (see following ) |
Date Time | 306 | date and time scanned |
Artist | 315 | Library of Congress |
NOTE: The Library has At least two options exist for tags 282, 283, and 296.
Option 1 (has been used for full size uncompressed images)
Xresolution YResolution ResolutionUnit |
282 283 296 |
actual pixel count actual pixel count 1 (no unit specified) |
Option 2 (has been used for thumbnail images)
Xresolution YResolution ResolutionUnit |
282 283 296 |
dots per inch dots per inch 2 (inch) |
SGML text files
The American Memory
DTD for historical texts includes a simplified Text Encoding Initiative
(TEI) header. The header is an integral part of the SGML-encoded document.
Since the Library accompanies its marked-up texts with bibliographic records
or finding aids, the header contains only a handful of MARC field equivalents:
title statement and statement of responsiblity (MARC field 245), copyright
registration number (MARC field 017), and the Library of Congress catalog
card number (LCCN), when one exists.
WAVE audio files
The Library plans to use the following Resource Interchange File Format
(RIFF) INFO list chunk data with its WAVE files:
INAM (name/title) ICRD (creation date) IARL (archival location) ICOP (copyright) |
identifier for the item date digitized by vendor as YYMMDD Library of Congress, identifier for collection or project "see collection restriction statement" |
RealAudio files
The Library plans to use the RealAudio header:
Title Author Copyright |
identifier for the item; date digitized by vendor
as YYMMDD Library of Congress, identifier for collection or project "see collection restriction statement" |
Competition Home |
Technical Information for the 1998/99 Competition > Guidelines and Resources Prepared for Applicants > Digital Formats for Content Reproductions |
The Library of Congress
>> American Memory
Content updated: 2002-11-18
|
Contact Us |