Architecture

In order to get the most out of PDFBox it is neccessary to understand how a PDF document is organized as PDFBox was architected around the concepts layed out in the ISO-32000 (PDF) Specification

Quick Introduction to the PDF format

A PDF file is made up of a sequence of bytes. These bytes, grouped into tokens, make up the basic objects upon which higher level objects and structures are built [see ISO-32000 7.3].

PDFBox makes these basic objects available in the *org.apache.pdfbox.cos* package (The COS Model).

The organization of these objects, how to they are read and how to write them is defined in the file structure of the PDF [see ISO-32000 7.5]. In addition a file can be encrpyted to protect the document's content [see ISO-32000 7.5].

PDFBox handles the reading in the *org.apache.pdfbox.pdfparser* package. Writing of PDF files is handled in the *org.apache.pdfbox.pdfwriter* package.

Within the file structure basic objects are used to create a document structure building higher level objects such as pages, bookmarks, annotations [see ISO-32000 7.7].

PDFBox makes these higher level objects available through the *org.apache.pdfbox.pdfmodel* package (The PD Model).

In addition there is a COS representation available for the PD model if there is a need to inspect the underlying structure or to handle special cases where the higher level PD model doesn't provide the functionality needed.

It's always the COS model which is represented in the PDF file.

The COS Model

As outlined above the basic PDF objects are represented in PDFBox in the org.apache.pdfbox.cos package.

PDF Type Description Example PDFBox class ISO 32000
Boolean Standard True/False values true org.apache.pdfbox.cos.COSBoolean 7.3.2
Number Integer and floating point numbers 1 2.3 org.apache.pdfbox.cos.COSInteger
org.apache.pdfbox.cos.COSFloat
7.3.3
String A sequence of characters (This is a string) org.apache.pdfbox.cos.COSString 7.3.4
Name A predefined value in a PDF document, typically used as a key in a dictionary /Type org.apache.pdfbox.cos.COSName 7.3.5
Array Arrays are one-dimensional lists of objects accessed by a numeric index. Within an array each basic object is permitted as an entry. [549 3.14 false (Ralph) /SomeName] org.apache.pdfbox.cos.COSArray 7.3.6
Dictionary A map of name value pairs <<
/Type /XObject
/Name (Name)
/Size 1
>>
org.apache.pdfbox.cos.COSDictionary 7.3.7
Stream A stream of data, typically compressed. This is used for page contents, images and embedded font streams. 12 0 obj << /Type /XObject >> stream 030004040404040404 endstream org.apache.pdfbox.cos.COSStream 7.3.8
Object A wrapper to any of the other objects, this can be used to reference an object multiple times. An object is referenced by using two numbers, an object number and a generation number. Initially the generation number will be zero unless the object got replaced later in the stream. 12 0 obj << /Type /XObject >> endobj org.apache.pdfbox.cos.COSObject

A page in a pdf document is represented with a COSDictionary. The entries that are available for a page can be seen in the PDF Reference and an example of a page looks like this:

<<
    /Type /Page
    /MediaBox [0 0 612 915]
    /Contents 56 0 R
>>

The information within the dictionary can be accessed using the COS model

COSDictionary page = ...;
COSArray mediaBox = (COSArray)page.getDictionaryObject( "MediaBox" );
System.out.println( "Width:" + mediaBox.get( 3 ) );

As can be seen from that little example the COS model provides a low level API to access information within the PDF. In order to use the COS model successfully a good knowledge of the PDF specification is needed.

The PD Model

The COS Model allows access to all aspects of a PDF document. This type of programming is tedious and error prone though because the user must know all of the names of the parameters and no helper methods are available. The PD Model was created to help alleviate this problem. Each type of object(page, font, image) has a set of defined attributes that can be available in the dictionary. A PD Model class is available for each of these so that strongly typed methods are available to access the attributes.

The same code from above to get the page width can be rewritten to use PD Model classes.

PDPage page = ...;
PDRectangle mediaBox = page.getMediaBox();
System.out.println( "Width:" + mediaBox.getWidth() );

PD Model objects sit on top of COS model. Typically, the classes in the PD Model will only store a COS object and all setter/getter methods will modify data that is stored in the COS object. For example, when you call PDPage.getLastModified() the method will do a lookup in the COSDictionary with the key "LastModified", if it is found the value is then converter to a java.util.Calendar. When PDPage.setLastModified( Calendar ) is called then the Calendar is converted to a string in the COSDictionary.