F e d e r a l D e p o s i t o r y L i b r a r y P r o g r a m | ||
| ||
![]() |
Home About the FDLP Depository Management Electronic Collection Locator Tools & Services Processing Tools Publications Q & A |
askLPS · Calendar · Contacts · Library Directory · Site Index · Site Search |
![]() |
|
Digitizing Collections of Government Documents: Options, Processes, and Costs Cathy Nelson Hartman Denton, TX Introduction Good afternoon. My name is Cathy Hartman, and I am the electronic resources coordinator for government information at the University of North Texas (UNT). My journey down the path to digitization began in 1997 at the ALA Midwinter Meeting in Washington, DC. Duncan Aldrich, then working as an expert consultant in GPO's electronic transition effort, mentioned in an update at GODORT's Federal Documents Task Force meeting that GPO needed depository library partners to assume responsibility for providing access to electronic documents. Depository libraries have provided access to documents in tangible media for more than a hundred years, so providing access to documents in electronic media seemed to be just another method of fulfilling our responsibilities as a depository library. I volunteered to become a partner. In October 1997, the University of North Texas Libraries entered into an agreement with the U.S. Government Printing Office to provide permanent public access to the electronic records of the Advisory Commission on Intergovernmental Relations (ACIR). As the only site for access to the ACIR electronic records, we frequently received requests for historical publications of the agency from researchers, government administrators, students, and others who found the UNT Libraries' electronic collection by searching the Internet. We proposed to enhance the ACIR electronic collection by making the most important serial titles published by the agency available as electronic documents accessible via the Internet, so in the spring of 1998, I wrote a grant proposal for a pilot project to begin the digitization process. In May of 1998, AMIGOS Bibliographic Council notified me that they were funding the project and then the real work began. Today, I will be talking about questions you should ask yourself before beginning a digitization project. For each question, options will be discussed and our decisions and processes for the ACIR project will be presented. Question 1: What Collection Will Be Digitized? For most of you this question will not be difficult. Being from Texas, the comedian, Jeff Foxworthy, of the "you might be a redneck IF" jokes, is a personal favorite of mine. (Most Texans can connect to "redneck" jokes.) Following his style and my experience with documents librarians, I can confidently say, "If you find you have an uncontrollable urge to digitize everything in your collection and make it all 'permanently publicly available,' you just might be a documents librarian." Your problem will probably be focusing on only one collection as the most likely candidate for digitization. Consider such issues as:
Items considered for digitization should also be examined for copyright. This will not be an issue for most collections of government documents. However, if it is an issue, be certain that your copyright information is very current. There are constant changes in the copyright laws, especially with respect to digitizing documents and making them available on the Internet. Question 2: How Will Funding Be Obtained for the Project? This question is more difficult to answer than the first. Digitizing any collection requires personnel time (frequently the most expensive element), training, hardware, software, possibly funds for outsourcing parts or all of the project, and many various small expenses. Since funding was not available in my library for a pilot project, I decided to request grant funding. It seemed a reasonable thing to do at the time. However, if you are writing your first grant proposal, here are a few tips that I learned the hard way. First, be prepared for the significant effort involved in writing the grant proposal and then later writing the grant report. Second, be certain that you check with the grants office on your campus or in the city government before sending out a grant proposal. This will save you many problems over the life of the project. They may add hidden budget costs for various items, such as benefits for project staff. Such costs can affect the amount of money you think you have to spend on the project. They may also ask penetrating questions such as, "If your plan to complete the project is not successful, what is your secondary plan to fulfill the requirements of the grant?" Also, the record keeping must be carefully done so that expenditures are clearly documented. Question 3: Depending on the Level of Funding Acquired, How Will We Balance Level of Access to the Digitized Documents with Cost of Digitization? If you found major funding and costs are not an issue, you may want to provide the very best access to thousands of pages of documents by scanning, OCR-ing, marking them up in HTML, and verifying every word of text for accuracy. We considered this option. However, the costs stated in the project report of the AMIGOS funded study at Oklahoma State University, A Digital Challenge: Bringing Kappler's Indian Affairs: Laws and Treaties to the World Wide Web1, clearly showed that we could not afford the expensive, time-intensive efforts required to create HTML files from the scanned text, even though we believed that with current technology, HTML files would offer the best Web access. I stated in my grant proposal that our digitization project would be accomplished by outsourcing high-speed, quantity scanning of approximately 4,200 pages of text. Our goal was to develop a process that balanced level of access with the cost of digitizing and making the data available on the Internet. The objectives included:
Selecting a Vendor Part of our strategy for controlling costs of scanning included hiring a vendor with the appropriate high-speed scanning equipment to scan the documents. Several vendors were contacted and asked to supply samples of their work. Two vendors agreed to do so and were shipped an issue of an ACIR periodical and a volume of Significant Features of Fiscal Federalism for the test. One vendor supplied us with TIFF files and offered a very low price of 22 cents per page for the black and white scanning. It took several weeks to receive the test scans and several more weeks to retrieve the loaned documents that they scanned. The contact person lacked knowledge about the scanning process and could answer few of my questions about the files. The other vendor, the Electronic Resource Library Project Lab based at Amarillo College, test scanned the documents and supplied us with TIFF files and with PDF image-plus-text files. The test files and our documents were returned quickly. Their bid was 22 cents per page for TIFF files or 26 cents per page for PDF image-plus-text files. Color scans were offered for document cover pages for an additional 4 cents per page. The director of the lab, Dr. Karen Ruddy, was knowledgeable and prompt with answers to our questions about the files. The PDF image-plus-text files provided both good quality image files that could be viewed in the free Acrobat Reader and searchable text files. The PDF files were created using the Adobe Capture software, which added the additional benefit of Optical Character Recognition (OCR) to create searchable text. The image file was displayed, but the text file existed and could be searched or copied and pasted. Much of the scanned text was readable by the OCR software. Pages containing simple text with plain fonts were translated more successfully by the OCR software than non-text material or unusual fonts were. Since the PDF image-plus-text files would allow the additional access of searchable text, they deserved thorough investigation. We viewed the test files, searched them, copied and pasted from the text, and checked printability. The results exceeded our expectations, so we decided that the PDF image-plus-text files were our best option. When our experiments showed that the in-house personnel and computer time needed to move TIFF files to PDF image-plus-text files was significant, the decision was made that the extra 4 cents per page charged by the vendor to provide PDF image-plus-text files would be well worth the small extra cost. The PDF image-plus-text files seemed to be our most cost-effective method of balancing issues of access and costs. Also important influences in our decision to go with the PDF image-plus-text files included:
The vendor selected was the Amarillo College Electronic Resource Library Lab. The Lab had previously received grants from the Federal Government to purchase high-speed scanning equipment to digitize documents related to plutonium research. The Lab had also worked with the Department of Energy's "Energy InfoBridge" project, scanning many thousands of pages. Dr. Walter Warnick, director of the Energy Resource Library, highly recommended the Lab. Dr. Ruddy, the Lab director, was interested in outside contracts to keep the Lab personnel and equipment busy. The ACIR scanning project would serve as a pilot project for the Lab to determine if bringing in outside work would be economically feasible. Question 4: Are All Documents Needed for Digitizing Part of Our Collection, and If Not, How Will They Be Obtained? This question is particularly relevant if the items are out of print or will be damaged in the scanning process. My grant proposal stated that approximately 4,200 pages of the most important serial publications of the Advisory Commission on Intergovernmental Relations would be digitized. The ACIR collection at the University of North Texas Libraries was assessed to determine if all issues of our selected serial titles were available in the collection. We estimated that the 1990 -- 1995 volumes of Significant Features of Fiscal Federalism and volume 10 -- volume 20 of Intergovernmental Perspective would approximate the 4,200 pages. Since high-speed quantity scanning makes use of an automatic paper feeder, any item sent to the Lab would have its binding shaved. The decision was made that retaining a paper copy of each item scanned would be important, so Offers Lists published by the Federal Depository Library Program were monitored frequently to attempt to collect duplicate copies of as many of the publications as possible. When a duplicate could not be located, the publication would be re-bound after scanning. Duplicates of many of the items were collected when one depository library gave up its depository status and offered all of its collection to other depositories. A few other items were collected at random. One copy of all volumes of the Significant Features of Fiscal Federalism, except for the 1993 volumes, was in the UNT collection in paper format. The 1993 volumes were in microfiche with the microfiche obviously created from a copy of the original publication. Even though fiche scanners exist, the quality of the fiche copy must be high for the scanned file to be acceptable. When an inquiry sent to the Texas Library Association Government Documents Round Table listserv showed that all depository libraries had received the 1993 volumes in microfiche, other groups were contacted. The issues were eventually located in the collection of a professor of public administration on the UNT campus. Only a few issues of the periodical, Intergovernmental Perspective, volumes 10 - 12, were missing and were happily supplied by depository librarians at Texas Christian University and the Texas State Library and Archives Commission. As we expanded our scanning back to volume one, other issues were supplied by depository libraries across the country when a request was posted to GOVDOC-L. Issues or volumes of the titles that were borrowed from individuals or other libraries could not be sent to the Lab to have bindings shaved, so it was determined that these publications would be scanned on an available flat-bed scanner in the UNT Libraries. In July 1998, the first 2,164 pages were shipped to the Lab. An additional 1,436 pages were shipped in August for a total of 3,600 pages. In October, when the UNT Libraries offered additional funding for the project, we shipped an additional 1,872 pages to the Lab. TEXPRESS, the courier system connecting many colleges and universities in Texas, allowed us to ship all documents at no charge to the project. Question 5: How Will Skilled Personnel Be Found and Training Provided for All Project Participants? For those of us in academic libraries, students provide a wonderful resource for project personnel. For institutions with library and information science programs, especially skilled graduate students may be found. Our grant proposal included funding to hire a project assistant, so faculty and students who had expressed an interest in the project were notified that we were accepting applications. We were interested in hiring a student who could begin work on the first of August and continue into the fall semester. Interviews were conducted and an extremely well qualified graduate student from the School of Library and Information Sciences was hired. Training for you as the project manager and for other personnel can be an expensive part of the project. I enrolled in an intensive, three-day class to learn to use the Adobe Acrobat software required to alter and enhance the PDF files. The $450 cost for the class is not an unusual fee and is another cost to include in your grant request. I then instructed the project assistant in the basics of using the software. The project assistant and I developed a process to create links within the documents, bookmarks, and other enhancements. Since borrowed items would be scanned on-site, a process for scanning was also created, and the project assistant wrote a procedures manual outlining the process for others to use. As the scanned files were completed by the Lab and sent to us, we enhanced the files by adding bookmarks for the contents of each title, links from the contents pages, and links from indexes when an index was included in the volume. Every page was also checked for readability and printing quality. The procedures manual was edited as needed throughout this process. Question 6: Do We Have the Technical Skills, Or Access to Qualified Staff, To Solve the Technical Issues of a Digitization Project? In any digitization project, technical decisions must be made. If you are not fluent in the language of servers, file types, and technical problem solving, be certain that knowledgeable staff are available for consultation. As we prepared volumes for loading to the Web server, several technical issues required solutions.
File Size The scanned documents ranged in pagination from approximately 30 pages to over 300 pages. File sizes ranged from 1.8 mega bytes (Mb) to 20 Mb. Downloading such large files over the Web can take considerable time, especially if access is via a modem. When saving the enhanced files, we were careful to use the Acrobat Exchange software's "optimize" function, which helped reduce the size of the files. This, however, did not make the files small enough to have an acceptable download time. We examined the option of making each page or a few pages into separate files, then creating some type of navigation system to allow the user to move on to the next file (next page of the document). We visited two sites that use this method. Even though it did reduce download time, we felt it was cumbersome for the user, and it would increase our costs significantly by requiring additional time to prepare our files for the Web. Searching for other options, we discovered in a mailing list archive called the "PDF Archive," a possible solution called "byteserving."2 It involved setting up the files correctly and having Web server software that supports the "Byte Range Retrieval Extension to HTTP" protocol. This server software has the capability to "serve" to the user only one page at a time of a PDF file. This method requires the user to change only one setting in the Acrobat Reader Preferences to disallow "background downloading." The user can then move through the document using the Acrobat Reader's functions or the links and bookmarks we created. Since the UNT Libraries' Web server already had one of the software packages that supports byteserving, we tested this option and decided it would be the best option for us. On our Web interface page, we asked the user to link to another page to find out about "Faster Downloads."3 There we explained byteserving and how "preferences" in the Acrobat Reader could be easily altered for faster downloading of the files. Searching PDF Files From the beginning of the project, our goal to make the ACIR Web site searchable was an important part of maximizing access to the digitized collection. We quickly learned that many of the well-known search engines would not index and search PDF files. We spent a considerable amount of time viewing and reading about our options. The project assistant created a table outlining our most promising options. Infoseek's Ultraseek Server, Microsoft Index Server, and Verity Search were our best options. Infoseek was reasonably priced, had automatic re-indexing, supported sophisticated search queries and responses, and was Y2K compliant. Microsoft Index Server was free with our Windows NT 4.0 server software and had automatic re-indexing. However, it did not rank search results or detect duplication, and it often included HTML characters when creating summaries. The Microsoft Index Server did offer a PDF filter that could be installed so that PDF files would be indexed. The Verity search engine provided a special filter to search over 200 file formats and used Meta tags to control summaries, so responses to a query were controlled by the metadata entered for each PDF file. Our investigation also revealed that the Netscape Compass Server used the Verity search engine software. Since the University used the Netscape Compass Server without cost, it was our best option. It required the addition of a PDF filter for indexing PDF documents. However, Netscape recently made the announcement that educational institutions would no longer have free access to their Netscape Compass Server software. It is unclear at this time how this will affect installation of the software. There will undoubtedly be problems to solve as we activate it or our second choice, the Microsoft Index Server, and create the appropriate CGI scripts. Search engines examined but rejected for various reasons included:
Metadata Metadata are used to describe an information resource. Whatever the file type used in a digitization project, Meta tags are important for accurate retrieval of documents. The Acrobat Exchange software allows entry of four Meta tags for each PDF file created. The Meta tags are very important because this is the information used to build the index list when searching PDF files. Without Meta tags, the index list often contains the URL as the title of the document and the first few words of readable text in the document as the description. Such a list may not be an accurate description of the document, and if the OCR software was unable to read the first few words, the information may even be unreadable. For this reason, the decision was made to include metadata for every PDF document. Much of the data entry for the Meta tags is awaiting the activation of the search engine. Until we see how the selected search engine builds the indexes and displays the index lists, we cannot know what information to enter on each line of the Meta tags. All accompanying HTML pages were created with title, keyword, and description Meta tags. Our search of the literature found articles outlining research that showed HTML documents with title, keyword, and description Meta tags were ranked higher on index lists built by some Web search engines. Also, most Web search engines use the title and description Meta tags to build the index list. When the Meta tags are not present, the title displayed is often either the URL of the page or "No Title," and the first few words of the document become the description.4 Integrating Digitized Files Into a Web Site Whatever file type chosen for a digitization project, Web access must be provided in an organized and varied manner. As librarians, our skills as organizers of information certainly assist with this part of the project. Support pages that may be required include pages for browsing by topic or by title, bibliographies, help pages, or pages with historical or related information. Specialists or experts may be consulted for input for this part of the project. Realizing that the hyperlink properties of HTML documents could assist us with offering multiple access points to the full-text PDF files, we examined the overall design of the ACIR Web site. Already contained on the site were the electronic files of the ACIR as they appeared when the ACIR closed in 1996, and the UNT Libraries agreed to provide permanent public access to the files. This part of the Web site could not be altered from the way the files appeared when the agency closed. To enhance the original files, we added a brief history of the ACIR and a bibliography of the publications of the ACIR. Relevant dates and citations for laws that created or affected the ACIR were collected for the history of the agency, and the bibliography of ACIR publications was compiled and added. Additional Web pages created to provide access to the PDF documents and to provide technical information about the site included:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A service of the Superintendent of
Documents, U.S. Government Printing Office. Questions or comments: asklps@gpo.gov. | |||
Last updated: October 30, 2002 Page Name: http://www.access.gpo.gov/su_docs/fdlp/pubs/proceedings/99pro15.html | |||
[ GPO Home ] | [ GPO Access Home ] | [ FDLP Desktop Home ] | [ Top ] |