-
Notifications
You must be signed in to change notification settings - Fork 242
Home
This wiki contains more detail on various aspects of the public API and the PDF document format.
- Extracts the position and size of letters from any PDF document. This enables access to the text and words in a PDF document.
- Allows the user to retrieve images from the PDF document.
- Allows the user to read PDF annotations, PDF forms, embedded documents and hyperlinks from a PDF.
- Provides access to metadata in the document.
- Exposes the internal structure of the PDF document.
- Creates PDF documents containing text and path operations.
- Read content from encrypted files by providing the password.
- Document Layout Analysis - PdfPig also comes with some tools for document layout analysis such as the Recursive XY Cut, Document Spectrum and Nearest Neighbour algorithms, along with others. It also provides support for exporting page contents to Alto, PageXML and hOcr format. See Document Layout Analysis
- Tables are not directly supported but you can use Tabula Sharp or Camelot Sharp. As of 2023 Tabula-sharp is the most complete port source
This provides an alternative to the commercial libraries such as SpirePDF or copyleft alternatives such as iText 7 (AGPL) for some use-cases.
It should be noted the library does not support use-cases such as converting HTML to PDF or from other document formats to PDF. For HTML to PDF a good quality solution is wkhtmltopdf. It also does not currently support generating images from PDF pages. If you need this functionality see if docnet meets your requirements.
PdfPig aims to provide 2 main areas of functionality:
- Extracting PDF content.
- Creating PDFs.
The simplest usage of the library for extracting content involves opening a document and extracting the position and text of all words across all pages:
using System.Collections.Generic;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
using (PdfDocument document = PdfDocument.Open(@"C:\path\to\file.pdf"))
{
foreach (Page page in document.GetPages())
{
IEnumerable<Word> words = page.GetWords();
}
}
Pages can also be accessed individually with an index starting at 1. You can also access the positions and sizes of the individual letters on a page:
using System.Collections.Generic;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
using (PdfDocument document = PdfDocument.Open(@"C:\path\to\file.pdf"))
{
Page page = document.GetPage(1);
IReadOnlyList<Letter> letters = page.Letters;
}
For document creation a new document can be created using the Standard14 fonts which are included in the PDF specification:
PdfDocumentBuilder builder = new PdfDocumentBuilder();
PdfPageBuilder page = builder.AddPage(PageSize.A4);
PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);
page.AddText("Hello World!", 12, new PdfPoint(25, 520), font);
byte[] b = builder.Build();
The resulting bytes are a valid PDF document and can be saved to the file system, served from a web server, etc.
You can use document builder to visualise what pdf pig has done for document reading.
using UglyToad.PdfPig;
using UglyToad.PdfPig.Writer;
using (var document = PdfDocument.Open(pdf))
{
var builder = new PdfDocumentBuilder{};
var pageBuilder = builder.AddPage(document, pageNumber);
pageBuilder.SetStrokeColor(255,0,0);
var page = document.GetPage(pageNumber);
foreach(var word in page.GetWords())
{
var box = word.BoundingBox;
pageBuilder.DrawRectangle(box.BottomLeft, (decimal)box.Width, (decimal)box.Height);
}
builder.ToImage().Display();
}
View this gist that goes through some basic beginner examples: https://gist.github.com/cordasfilip/c6d2510b358323dc2f71c843460cbcdf
More details on the API can be found here.
Additional automated documentation from doc-comments can be found on DotNetApis.
Release notes as well as downloadable packages can be found on the releases page https://github.com/UglyToad/PdfPig/releases.