Home Projects Publications Presentations Repositories Photo Gallery Career Staff Favorites
  • Turning The Pages Online
  • MyMorph
  • Medical Article Records GROUNDTRUTH (MARG)
  • MD on Tap
  • AnatQuest
Links to Feeds:
PublicationsRSS  RSS
CEB NewsRSS  RSS

Last updated: June 18, 2008

Staff Bibliography

Print this Print this  E-mail this E-mail this


Access to Document Images over the Internet


Walker, F. L.
Thoma, G. R.
Proceedings of the Ninth Integrated
Online Library Systems Meeting
New York.
May 11-12, 1994
pp 185-197.

Keywords: Electronic document delivery, document imaging, DocView.


Abstract

DocView, the prototype system described here, provides an end user access to document literature through the Internet. DocView consists of a Microsoft Windows-based client software at the user's machine and remotely located Unix-based document image servers. The user may retrieve the bitmapped document images from any one of several remote servers, preview the pages on the screen, manipulate the image (zoom, scroll, pan), cut and paste portions of pages of interest, electronically "bookmark" desired pages, and print only the pages needed. The user can also use DocView to receive images from Ariel stations.

1. INTRODUCTION AND BACKGROUND

Electronic access to bibliographic and full-text databases has been routinely done for many years, but the electronic retrieval of complete documents, in particular journal articles, is rare even today. The National Library of Medicine's R&D division, the Lister Hill Center, has undertaken several pilot projects to address this need. One project[1], called SAIL (System for Automated Interlibrary Loan), automates delivery of documents requested by users by using a preselected store of digital images of journals. SAIL consists of a complex of PC-based workstations that: retrieves document requests from the Library's mainframe computer, parses the requests into requester and document information, uses the document information to retrieve images from an optical disk archive, and uses the requester information (e.g., fax number, user address) to automatically fax the images to the requester. For users who want the articles by mail, the page images are automatically printed, along with the cover of the journal issue and a printout of the request, and mailed. The total effect is to minimize human intervention at the Library in the delivery of the articles. Another project called the Electronic Document Delivery System does not rely on indirect access through the Library's interlibrary loan system, but offers direct access to an image database[2].

The subject of this paper is another R&D system, DocView, which delivers document images over the Internet. DocView is expected to be a faster, cheaper, more reliable and more convenient method of document delivery than either fax or mail, since the Internet offers higher speed, higher image resolution and lower transmission cost. These advantages promise to become even more pronounced as the backbone speed of the Internet, currently at T3 (about 45 Mbps), moves up gradually to OC-3, OC-12 and eventually to Gigabits/second speeds. This paper covers the operational and technical characteristics of DocView.

2. OVERALL SYSTEM DESCRIPTION

DocView consists of Microsoft Windows-based client software at the user's machine and a remotely located Unix server providing access to journal articles stored as bitmapped document images written on magnetic or optical disks. DocView provides a user two ways to access document literature via the Internet, as shown in Figure 1. In one method, the client software enables a user to retrieve the articles from any one of several remote servers, display and preview the bitmapped images, manipulate the images (zoom, scroll, pan), cut and paste portions of pages of interest, electronically &bookmark& desired pages, and print only the pages needed.

Figure 1. DocView Document Access Methods. Diagram representation of the Internet, DocView workstations and servers.

The second access method allows a DocView user to receive documents sent over the Internet from remotely located Ariel workstations[3]. Ariel is a software package developed by Research Libraries Group for a workstation comprising a PC, a scanner and printer. Many libraries are beginning to use Ariel for document transmission in a &fax-like& manner, but via Internet. The DocView client software is designed to receive Ariel-compatible documents and allow the same document image manipulation.

Whichever method is used, the client workstation may be physically separated from DocView servers or Ariel workstations by thousands of miles. The first method, client/server, allows the user to browse through a list of available documents such as journal articles, to select one and receive it immediately. The second method is less direct in that a user may contact a library, ask for a specific article, then have it sent directly to his computer. While the first method promises rapid delivery of pre-scanned documents from an online document collection, the second method provides documents on demand, albeit slower. This method does not require documents to be preselected and stored. DocView is designed to allow a user to gain access to documents through both means simultaneously.

2.1 Server Design

The prototype server software for DocView runs on Unix machines. Three platforms are currently being used: a Sun SPARCServer 690 located inhouse, a Sun SPARCStation 10 at the University of Arizona, and a Convex supercomputer on the NIH campus. The software, written in C, runs unmodified on all of these computers. However, DocView could be designed to use any computer and operating system for the Internet document server. For instance, the server could be built on a Windows NT machine, or it could run on a VAX. The only requirement is that the server have access to Internet.

The DocView server computers have access to the document images, which have been previously scanned at 300 dpi, compressed CCITT Group IV, and stored as TIFF images. The size of a typical compressed image is 100 kilobytes, and the average length of a typical article ten pages, resulting in about 1 Megabyte per article.

The Internet communications is done using Berkeley sockets. This technique uses TCP/IP connection-based stream sockets that provide bidirectional, reliable, sequenced and unduplicated flow of data between a server and client over Internet. An application level protocol was designed for client/server communication. A client may send six types of queries to the server, which may return five responses. The types of query/response packets sent between server and client are:

  • 0: Login [Client Query]
  • 1: Acknowledgment [Server Response]
  • 2: Do you have these images? [Client Query]
  • 3: I have the images. [Server Response]
  • 4: Negative [Server Response]
  • 5: Send the image. [Client Query]
  • 6: Here is the requested image. [Server Response]
  • 7: I have accepted all images. [Client Query]
  • 8: Logout [Client Query]
  • 9: What documents do you have? [Client Query]
  • 10: These are the documents I have [Server Response]

A typical session begins with the DocView client sending the server a login request (packet type 0). The server follows with an acknowledgment (packet type 1). Then the client asks the server for a list of documents (packet type 9). The server responds with a list of document identifiers (packet type 10), including the number of pages contained in each document. The client software allows the user to browse through the list of documents available on the server. To get a citation for a specific document, the client software sends the server a request packet for page 0, in which there is a text citation to the document. The server responds by sending the citation using packet type 6. The client software allows the user to receive only the first page (to preview), or the entire document. The client requests an image using the packet type 5 query, and the server sends the image with a packet type 6 response. The client may request only one image at a time to facilitate multiplexing image requests from multiple clients. For a ten page document, the client sends ten separate image requests to the server. Each request specifies the document page number to be sent to the client. The session ends with a logout from the client (packet type 8).

Additional packet types include a negative response (packet type 4) for any query the server cannot handle. A packet type 2 is useful if the client software wants to find out if the server has a specific document. The server will respond affirmatively using a packet type 3 (that includes the citation and number of images in the document) if it has the document, or with a negative response if it does not have the images. Packet type 7 is used for informing the server that the client has accepted the images sent to it; this could be used to compute financial payment for the document if that were part of the system. Future expansion of the client/server protocol could include provisions for the exchange of cost information and credit card numbers to allow the DocView user to pay for the requested document.

2.2 Client Design

The prototype DocView client software is an application that runs under Microsoft Windows version 3.1. The computer platform is recommended to have a speed of 33 MHz or higher, have a minimum of 8 Megabytes of memory, and have an Internet connection. The software must include Windows Sockets[4], supplied by the manufacturer of the TCP/IP protocol stack used in the computer. Windows Sockets provides a common protocol for the design of a sockets application. It allows an application to run unmodified across all Windows platforms having Window Sockets. Most major manufacturers of TCP/IP stacks for personal computers now supply a Window Sockets dynamic link library for Windows Sockets applications[5].

DocView's client software allows the user to receive document images either actively or passively. In the active mode, the user selects a DocView server on Internet, connects to it, chooses a document and downloads it. As shown in Figure 2, the dialog box allows the user to select a DocView server. Once the server is selected, the DocView client connects to the server automatically. A citation to the first document stored at the server is displayed in the dialog box. By using the First Doc, Last Doc, Prev Doc and Next Doc buttons, the user may browse the documents available at the server. The user may download either the first page of any document to preview it, or the entire document. When transmission completes, DocView notifies the user through an optional audible signal.

Figure 2. Screen shot of the DocView server interface.

In the passive mode, an operator of a remote Ariel workstation scans a document and sends it to the DocView client. This would normally be in response to a user request made over email, telephone, fax or other electronic means. Reception of an Ariel-compatible document occurs in the background, and DocView notifies the user when it arrives.

Whether a DocView user receives documents from a DocView server or from an Ariel workstation, the method for viewing and manipulating the images is the same. As shown in Figure 3, documents are displayed in separate windows on the screen. The windows may be cascaded, tiled or minimized using the Windows Multiple Document Interface. The user interface contains menus along the top of the screen and a toolbox of buttons representing the most commonly used functions in the menu. Each window may be maximized so that the page image may be easily read on the screen. The minimum recommended screen resolution is VGA, or 640x480. Higher resolutions allow more of the page to be displayed at a readable level. It is possible to zoom in on the image or to shrink it. Scroll bars are available for panning and scrolling zoomed images. The user may also rotate images in 90 degree steps, a useful function for viewing pages printed in landscape rather than portrait mode.

Figure 3. Multiple documents cascaded on screen.

The DocView user interface provides functions for easily browsing through the electronic document. It has a Next Page button, Previous Page button, and a Page Jump Button. The Page Jump function allows the user to move to any arbitrary page. DocView also contains an electronic bookmark function as shown in Figure 4. Electronic bookmarks do for the electronic document what real bookmarks do for paper documents: they keep track of important sections and allow easy movement from one important section to another. DocView allows any number of pages in any number of documents to be marked. When marked, a page appears to have the upper right corner bent down. The Page Jump function also allows movement between marked pages by allowing the user to move to the First Marked Page, Last Marked Page, Previous Marked Page, or Next Marked Page.

Figure 4. Screen shot of the DocView electronic bookmark.

DocView contains a versatile print function that allows the user to print pages either in the current document or all documents (Figure 5). This function offers a number of options, including printing the cover page, currently displayed page, marked pages, all pages or a sequence of pages. The number of copies may vary from 1 to 10, and each page may be shrunk up to 75% in size.

Figure 5. Screen shot of the DocView print functions.

Among other features of the DocView software is a copy function that permits the user to select part of an image and copy it to the Windows clipboard. This allows the user to create new documents from the ones received through the Internet. The portion of the image copied may be automatically scaled larger or smaller, or kept at screen resolution. In all DocView functions the user may obtain context sensitive help on any topic by pressing the help button associated with that function. DocView offers a complete help facility on any aspect of operating the user interface.

The DocView client software is structured as two modules running over Windows Sockets (Figure 6). The DOCVIEW.EXE module communicates with each DocView server using a TCP/IP socket. The ARIEL.EXE module is designed specifically for compatibility with remote Ariel machines using an Unreliable Datagram Protocol (UDP) socket. Ariel version 1.12 uses a modified form of TFTP for document transmission. This method of communication is not as fast as the TCP/IP communication with DocView servers. However, in preliminary testing both methods have been shown to be equally reliable. When an Ariel document has arrived at the DocView client computer, the ARIEL.EXE module informs DOCVIEW.EXE using Dynamic Data Exchange. DOCVIEW.EXE then notifies the user that a new Ariel document has arrived and is ready for viewing.

Figure 6. Diagram of the DocView client software architecture. From bottom to top: driver for communications board, TCP/IP stack, Windows sockets DLL. The top level shows the dynamic data exchange between the lower layers and the DocView.exe and Ariel.exe.

3. EVALUATION CONSIDERATIONS

The basic evaluation goals are to investigate system performance, image quality, cost and user satisfaction with the features provided.

All of the end user DocView functions are provided by the client software. Whether all of these are useful and desirable remain questions to be answered. A user questionnaire will elicit comments on the utility of these functions, and whether other functions are desirable, e.g., OCR of received images to create text data, and whether this will be used to conduct full-text searches or for appending to the user's word processed documents.

An investigation of performance focusses on speed and image quality and bottlenecks in the system. Delivery speed is a function of the total time taken from request to display. The time taken for document delivery begins with the user invoking the display application program, the time the images on optical disk controlled by a remote server on the Internet is accessed, the time taken for images to transfer from optical disk to server memory, the time to traverse the Internet, and the time to move images from client memory to display. Delays in any of these stages will be measured and analyzed. Bottlenecks in the network due, for example, to source quench (an indication that a router somewhere on the Internet is receiving data faster than it can handle, requiring the originating computer to slow down transmission), will be recorded and its effect on performance will be analyzed.

Image quality is mainly a function of scan density, and therefore affects the delivery time. Documents consisting of a mixture of text-only, text with illustrations, and dithered gray images, scanned at 300 to 400 dpi will be offered for access. User satisfaction in terms of the tradeoff between image quality and delivery speed will be studied for these different modes.

Among the questions to be addressed using DocView as a testbed are the following:

  • Does direct access to an electronic document store and online retrieval of documents over the Internet result in time and cost savings? By how much?
  • What are the technical specifications of the hardware and software for an affordable direct document access system?
  • What functions/features of the user interface contribute to easy document usage, speed of task completion, and overall user satisfaction?
  • What strategy should be used to select the part of the collection to be stored electronically and the part that should be left to on-demand Ariel transmission?
  • What strategy should be used to scan documents for electronic storage at a resolution of 300 dpi vs. 400 dpi?
  • Are some parts of the electronic collection better stored on magnetic disk rather than in optical disk jukeboxes?
  • What are the network problems on LANs and the Internet?

4. FUTURE DESIGN

Another design goal is to provide a document request function to empower the end user to electronically send a request to document suppliers having Ariel-type workstations. A third goal is to enable a user to send documents via DocView in addition to receiving them. A fourth goal is to add character recognition to convert the received bitmapped images to text. A fifth goal is to extend the DocView client software to other computer platforms.

On the server side, other server technologies will be investigated as possible replacements for the prototype DocView server. These include Gopher, World Wide Web (WWW), and Wide Area Information Server (WAIS)[6]. DocView could also be modified for use with other commonly available software packages such as Mosaic[7] to provide a document viewing capability. In this case, Mosaic would provide the connection to the remote server, whether Gopher, WWW or WAIS, while DocView would provide the viewing capability. Finally, other methods of document transmission will be investigated for possible integration into DocView, such as the Multipurpose Internet Mail Extension (MIME), which will allow images to be sent via Internet email. MIME could prove to be a viable alternative for systems such as Ariel, since it is not tied to a specific hardware platform; any computer could provide the source of images.

5. REFERENCES

1. System for Automated Interlibrary Loan: System and Operations Description. Internal Technical Report of the Communications Engineering Branch, Lister Hill National Center for Biomedical Communications, National Library of Medicine. November 1992.

2. Thoma GR, Walker FL. Essential Functions in an electronic document delivery system. In: Broering NC, ed. High-Performance Medical Libraries, Advances in Information Management for the Virtual Era. Meckler, Westport CT, 1993; pp.77-88.

3. Bharadwaj R. The Ariel project. Proc. ASIS, vol. 28 (1991); p.339.

4. Socket Reference for Windows, PC/TCP Development Kit for DOS, FTP Software, Inc., November 1992.

5. Hall M. A Guide to Windows Sockets, JSB Corporation, June 1, 1993.

6. Krol E. The Whole Internet User's Guide & Catalog, Sebastopol, CA: O'Reilly & Associates, Inc. 1992; pp 189-241.

7. NCSA Mosaic for Microsoft Windows, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign.

 

National Institutes of Health (NIH)National Institutes of Health (NIH)
9000 Rockville Pike
Bethesda, Maryland 20892

U.S. Dept. of Health and Human ServicesU.S. Dept. of Health
and Human Services

USA.gov Website