-
Notifications
You must be signed in to change notification settings - Fork 242
PdfDocument
Namespace - UglyToad.PdfPig
The PdfDocument
class provides all root functionality for consuming document content.
To create an instance of a PdfDocument
you must first call PdfDocument.Open
. There are 3 overloads for opening a document:
PdfDocument Open(byte[] fileBytes, ParsingOptions options = null);
This opens a document from an array of bytes representing a PDF document.
PdfDocument Open(string filePath, ParsingOptions options = null);
This opens a document from the filesystem at the provided path. This will load the entire file into memory at once. The alternative is to use the 3rd overload:
PdfDocument Open(Stream stream, ParsingOptions options = null);
This opens a document from a stream of any kind, this could be a MemoryStream
, FileStream
, etc. It's worth noting that if the stream is not buffered (e.g. a network stream) this will be much slower. One workaround for this is to load the stream into a BufferedStream, a framework class which enables buffering automatically.
Any call to open should be wrapped in a using
statement since PdfDocument
implements IDisposable
:
using (PdfDocument document = PdfDocument.Open(@"C:\docs\test.pdf"))
{
}
Parsing options control aspects of how the document is opened and allow the consumer to provide their own logger. The defaults should be sufficient, except where the document is password protected where a password must be provided in the ParsingOptions.Password
property.
UseLenientParsing
controls how strictly the library interprets the PDF specification and how much error recovery it attempts where the document format is invalid or corrupt. The default is to attempt lenient parsing but a stricter parsing mode can be enabled by passing the static ParsingOptions.LenientParsingOff
instance.
Once a PdfDocument
has been obtained by calling Open
the main use case is to inspect the pages that the document contains.
Firstly the total number of pages in the document is provided by:
int numberOfPages = document.NumberOfPages;
Individual pages may then be opened using GetPage
. This takes a 1-indexed page number as an argument:
using UglyToad.PdfPig.Content;
// ...
Page page1 = document.GetPage(1);
Page page2 = document.GetPage(2);
// etc.
Calling GetPage(i)
with a value of i <= 0
is invalid.
You can also enumerate all pages in a document in order using:
using UglyToad.PdfPig.Content;
// ...
IEnumerable<Page> pages = document.GetPages();
A PDF document can include general information about the document at the top level in the XML format defined by the Extensible Metadata Platform (XMP).
If this optional XML data is present it may be obtained using the TryGetXmpMetadata
method:
using UglyToad.PdfPig.Content;
// ...
if (document.TryGetXmpMetadata(out XmpMetadata metadata))
{
XDocument xmpDocument = metadata.GetXDocument();
}
else
{
// No XMP metadata was present.
}
In addition to XMP metadata which allows for an extensible range of metadata a PDF document may optionally contain an information dictionary. This defines a range of fields such as author, title, etc.
This can be accessed through the Information
property:
using UglyToad.PdfPig.Content;
// ...
DocumentInformation information = document.Information;
string title = information.Title;
string author = information.Author;
// etc.
Since all fields on the information dictionary are optional they can be null
and should be checked prior to access, e.g.:
DocumentInformation information = document.Information;
if (information.Author != null)
{
string upperAuthor = information.Author.ToUpper();
}
There are multiple versions of the PDF specification following the numbering 1.1, 1.2, 1.3, etc.
. The version number of the current document can be retrieved with the Version
property:
decimal version = document.Version;
Documents can be encrypted using a number of different algorithms defined by the PDF specification, the IsEncrypted
flag indicates whether a document is encrypted.
The Structure
property of a document provides access to the underlying PDF tokens that are used to construct the document.
This is for advanced users and relies on a familiarity with the PDF specification to use.