Skip to content

1.1.3 File structure

Felix Schütt edited this page Jul 6, 2017 · 2 revisions

Looking at the example of our "Hello World" PDF, it's clearly visible how a PDF file is structured. We have a header, the actual document body, the cross reference table and the trailer.

The different parts (objects) of a PDF are written into the file, one after another, seperated by line breaks. Other whitespace delimiters are not permitted.

Header

The header is located at the start of the file and always contains %PDF, followed by the PDF version number.

%PDF-1.4

Body

The document body contains the actual information that make up the document. All indirect objects (the ones that begin with obj and end with endobj), have to be written into the file one by one. Objects have to be numbered, starting from 1 - without gaps, but they don't necessarily have to be written in the order of their object number.

Cross-reference table

This table contains the positions of indirect objects within the file. The position is defined as the number of bytes between the start of the file and the first byte of the object (a reference to byte 0 points to the start of the file). The first byte of an object is defined as the first number before the obj keyword (for example, the number 7 would be the first byte for an object called 7 0 obj). The cross-reference table does not follow the established PDF Syntax, however. The rules for the xref table are as follows:

First, we write a single line containing the keyword xref

Next, we write a line with two numbers. The first number is the object ID of the first object in the xref table. For a newly created PDF, this will always be 0 (even if you start with the object ID 1), but you can append PDFs and chain multiple xref tables together, in which case this number might not be 0. Then we write (seperated by whitespace) the number of entries in the xref table - which is exactly the number of indirect objects in our file + 1 (first line is special, hence the + 1).

The lines that follow are seperated by line breaks. Each line consists of a 10-digit number, a space, a 5-digit number, another space and a single letter for the object type. The numbers have to be padded with 0 to the left. The letter f stands for free objects (deleted objects) and n for normal objects.

If we use UNIX ore Apple line breaks, we have to add a space to the end of each line (because the byte count within one xref entry has to be exactly 20 bytes). Without this space (remember "\r" vs "\r\n"), the byte count would be only 19 bytes.

The first entry in the xref table is the so-called "null object". It is only of interest if you edit files, but it has to be present even for newly created files. The null objects first number is always 0, the second number 65535 and the letter is an f. This means that no deleted objects are present.

Following are the indirect objects, in the order of their object number (here's why you can't have gaps in your object numbering). The first number is the position of the object in the file, the second number is the generation ID as well as the letter n to mark a regular object.

Example:

xref
0 7
0000000000 65535 f 
0000000009 00000 n 
0000000050 00000 n 
0000000102 00000 n 
0000000268 00000 n 
0000000374 00000 n 
0000000443 00000 n 

Trailer

At last, we write the trailer, which luckily adheres to the regular PDF syntax. First we see a single line with the keyword trailer. Following, is a dictionary with a few required keys. At least you have to note the /Size, which referes to the number of items in your cross-reference table (number of indirect objects + 1) and the /Root, which tells you which object is the "Catalog" of the file. Usually you'll also encounter /Info, which references an object with metadata (author, date, etc.). (Note: This is not required for normal PDF, but it must be present for any PDF-X or PDF-A conform documents. The reason are older PDF readers - for newer documents, you'll also have to set XMP metadata).

After the dictionary, you'll see a single line with the keyword startxref, followed by a single line with the position of the xref table. The position of the xref table is defined as the offset from the start of the file to the x of xref.

The end of the file is marked as a line with the contents %%EOF. Everything after this text should be ignored. This convention eliminates problems with file markers or unnecessary newlines at the end of the file, etc.

Example:

trailer
<< /Size 7
   /Info 1 0 R
   /Root 2 0 R
>>
startxref
534
%%EOF

Next up: Implementation limits