🤖 pdf-bot

Easily create a microservice for generating PDFs using headless Chrome.

pdf-bot is installed on a server and will receive URLs to turn into PDFs through its API or CLI. pdf-bot will manage a queue of PDF jobs. Once a PDF job has run it will notify you using a webhook so you can fetch the API. pdf-bot supports storing PDFs on S3 out of the box. Failed PDF generations and Webhook pings will be retried after a configurable decaying schedule.

pdf-bot uses html-pdf-chrome under the hood and supports all the settings that it supports. Major thanks to @westy92 for making this possible.

How does it work?

Imagine you have an app that creates invoices. You want to save those invoices as PDF. You install pdf-bot on a server as an API. Your app server sends the URL of the invoice to the pdf-bot server. A cronjob on the pdf-bot server keeps checking for new jobs, generates a PDF using headless Chrome and sends the location back to the application server using a webhook.

Prerequisites

Node.js v6 or later

Installation

$ npm install -g pdf-bot
$ pdf-bot install

Make sure the node path is in your $PATH

pdf-bot install will prompt for some basic configurations and then create a storage folder where your database and pdf files will be saved.

Configuration

pdf-bot comes packaged with sensible defaults. At the very minimum you must have a config file in the same folder from which you are executing pdf-bot with a storagePath given. However, in reality what you probably want to do is use the pdf-bot install command to generate a configuration file and then use an alias ALIAS pdf-bot = "pdf-bot -c /home/pdf-bot.config.js"

pdf-bot.config.js

var htmlPdf = require('html-pdf-chrome')

module.exports = {
  api: {
    token: 'crazy-secret'
  },
  generator: {
    completionTrigger: new htmlPdf.CompletionTrigger.Timer(1000) // 1 sec timeout
  },
  storagePath: 'storage'
}

$ pdf-bot -c ./pdf-bot.config.js push https://esbenp.github.io

See a full list of the available configuration options.

Usage guide

Structure and concept

pdf-bot is meant to be a microservice that runs a server to generate PDFs for you. That usually means you will send requests from your application server to the PDF server to request an url to be generated as a PDF. pdf-bot will manage a queue and retry failed generations. Once a job is successfully generated a path to it will be sent back to your application server.

Let us check out the flow for an app that generates PDF invoices.

1. (App server): An invoice is created ----> Send URL to invoice to pdf-bot server
2. (pdf-bot server): Put the URL in the queue
3. (pdf-bot server): PDF is generated using headless Chrome
4. (pdf-bot server): (if failed try again using 1 min, 3 min, 10 min, 30 min, 60 min delay)
5. (pdf-bot server): Upload PDF to storage (e.g. Amazon S3)
6. (pdf-bot server): Send S3 location of PDF back to the app server
7. (App server): Receive S3 location of PDF -> Check signature sum matches for security
8. (App server): Handle PDF however you see fit (move it, download it, save it etc.)

You can send meta data to the pdf-bot server that will be sent back to the application. This can help you identify what PDF you are receiving.

Setup

On your pdf-bot server start by creating a config file pdf-bot.config.js. You can see an example file here

pdf-bot.config.js

module.exports = {
  api: {
    port: 3000,
    token: 'api-token'
  },
  storage: {
    's3': createS3Config({
      bucket: '',
      accessKeyId: '',
      region: '',
      secretAccessKey: ''
    })
  },
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

As a minimum you should configure an access token for your API. This will be used to authenticate jobs sent to your pdf-bot server. You also need to add a webhook configuration to have pdf notifications sent back to your application server. You should add a secret that will be used to generate a signature used to check that the request has not been tampered with during transfer.

Start your API using

pdf-bot -c ./pdf-bot.config.js api

This will start an express server that listens for new jobs on port 3000.

Setting up Chrome

pdf-bot uses html-pdf-chrome which in turns uses chrome-launcher to launch chrome. You should check out those two resources on how to properly setup Chrome. However, with chrome-launcher Chrome should be started automatically. Otherwise, html-pdf-chrome has a small guide on how to have it running as a process using pm2.

You can install chrome on Ubuntu using

sudo apt-get update && apt-get install chromium-browser

If you are testing things on OSX or similar, chrome-launcher should be able to find and automatically startup Chrome for you.

Setting up the receiving API

In the examples folder there is a small example on how the application API could look. Basically, you just have to define an endpoint that will receive the webhook and check that the signature matches.

api.post('/hook', function (req, res) {
  var signature = req.get('X-PDF-Signature', 'sha1=')

  var bodyCrypted = require('crypto')
    .createHmac('sha1', '12345')
    .update(JSON.stringify(req.body))
    .digest('hex')

  if (bodyCrypted !== signature) {
    res.status(401).send()
    return
  }

  console.log('PDF webhook received', JSON.stringify(req.body))

  res.status(204).send()
})

Setup production environment

Follow the guide under production/ to see how to setup pdf-bot using pm2 and nginx

Setup crontab

We setup our crontab to continuously look for jobs that have not yet been completed.

* * * * * node $(npm bin -g)/pdf-bot -c ./pdf-bot.config.js shift:all >> /var/log/pdfbot.log 2>&1
* * * * * node $(npm bin -g)/pdf-bot -c ./pdf-bot.config.js ping:retry-failed >> /var/log/pdfbot.log 2>&1

Quick example using the CLI

Let us assume I want to generate a PDF for https://esbenp.github.io. I can add the job using the pdf-bot CLI.

$ pdf-bot -c ./pdf-bot.config.js push https://esbenp.github.io --meta '{"id":1}'

Next, if my crontab is not setup to run it automatically I can run it using the shift:all command

$ pdf-bot -c ./pdf-bot.config.js shift:all

This will look for the oldest uncompleted job and run it.

How can I generate PDFs for sites that use Javascript?

This is a common issue with PDF generation. Luckily, html-pdf-chrome has a really awesome API for dealing with Javascript. You can specify a timeout in milliseconds, wait for elements or custom events. To add a wait simply configure the generator key in your configuration. Below are a few examples.

Wait for 5 seconds

var htmlPdf = require('html-pdf-chrome')

module.exports = {
  api: {
    token: 'api-token'
  },
  // html-pdf-chrome options
  generator: {
    completionTrigger: new htmlPdf.CompletionTrigger.Timer(5000), // waits for 5 sec
  },
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

Wait for event

var htmlPdf = require('html-pdf-chrome')

module.exports = {
  api: {
    token: 'api-token'
  },
  // html-pdf-chrome options
  generator: {
    completionTrigger: new htmlPdf.CompletionTrigger.Event(
      'myEvent', // name of the event to listen for
      '#myElement', // optional DOM element CSS selector to listen on, defaults to body
      5000 // optional timeout (milliseconds)
    )
  },
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

In your Javascript trigger the event when rendering is complete

document.getElementById('myElement').dispatchEvent(new CustomEvent('myEvent'));

Wait for variable

var htmlPdf = require('html-pdf-chrome')

module.exports = {
  api: {
    token: 'api-token'
  },
  // html-pdf-chrome options
  generator: {
    completionTrigger: new htmlPdf.CompletionTrigger.Variable(
      'myVarName', // optional, name of the variable to wait for.  Defaults to 'htmlPdfDone'
      5000 // optional, timeout (milliseconds)
    )
  },
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

In your Javascript set the variable when the rendering is complete

window.myVarName = true;

You can find more completion triggers in html-pdf-chrome's documentation

API

Below are given the endpoints that are exposed by pdf-server's REST API

Push URL to queue: POST /

key	type	required	description
url	string	yes	The URL to generate a PDF from
meta	object		Optional meta data object to send back to the webhook url

Example

curl -X POST -H 'Authorization: Bearer api-token' -H 'Content-Type: application/json' http://pdf-bot.com/ -d '
  {
    "url":"https://esbenp.github.io",
    "meta":{
      "type":"invoice",
      "id":1
    }
  }'

Database

LowDB (file-database) (default)

If you have low conurrency (run a job every now and then) you can use the default database driver that uses LowDB.

var LowDB = require('pdf-bot/src/db/lowdb')

module.exports = {
  api: {
    token: 'api-token'
  },
  db: LowDB({
    lowDbOptions: {},
    path: '' // defaults to $storagePath/db/db.json
  }),
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

PostgreSQL

var pgsql = require('pdf-bot/src/db/pgsql')

module.exports = {
  api: {
    token: 'api-token'
  },
  db: pgsql({
    database: 'pdfbot',
    username: 'pdfbot',
    password: 'pdfbot',
    port: 5432
  }),
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

Optionally, you can specify a database url by specifying a connectionString.

To install the necessary database tables, run db:migrate. You can also destroy the database by running db:destroy.

Storage

Currently pdf-bot comes bundled with build-in support for storing PDFs on Amazon S3.

Feel free to contribute a PR if you want to see other storage plugins in pdf-bot!

Amazon S3

To install S3 storage add a key to the storage configuration. Notice, you can add as many different locations you want by giving them different keys.

var createS3Config = require('pdf-bot/src/storage/s3')

module.exports = {
  api: {
    token: 'api-token'
  },
  storage: {
    'my_s3': createS3Config({
      bucket: '[YOUR BUCKET NAME]',
      accessKeyId: '[YOUR ACCESS KEY ID]',
      region: '[YOUR REGION]',
      secretAccessKey: '[YOUR SECRET ACCESS KEY]'
    })
  },
  webhook: {
    secret: '1234',
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

Multipart Post

To install Multipart Post storage add a key to the storage configuration. Notice, you can add as many different locations you want by giving them different keys.

var multipartPostConfig = require('pdf-bot/src/storage/multipartpost')

module.exports = {
  api: {
    token: 'api-token'
  },
  storage: {
    'multipart': multipartPostConfig({
      url: '[URL TO POST TO]'
    })
  }
}

Options

var decaySchedule = [
  1000 * 60, // 1 minute
  1000 * 60 * 3, // 3 minutes
  1000 * 60 * 10, // 10 minutes
  1000 * 60 * 30, // 30 minutes
  1000 * 60 * 60 // 1 hour
];

module.exports = {
  // The settings of the API
  api: {
    // The port your express.js instance listens to requests from. (default: 3000)
    port: 3000,
    // Spawn command when a job has been pushed to the API
    postPushCommand: ['/home/user/.npm-global/bin/pdf-bot', ['-c', './pdf-bot.config.js', 'shift:all']],
    // The token used to validate requests to your API. Not required, but 100% recommended.
    token: 'api-token'
  },
  db: LowDB(), // see other drivers under Database
  // html-pdf-chrome
  generator: {
    // Triggers that specify when the PDF should be generated
    completionTrigger: new htmlPdf.CompletionTrigger.Timer(1000), // waits for 1 sec
    // The port to listen for Chrome (default: 9222)
    port: 9222
  },
  queue: {
    // How frequent should pdf-bot retry failed generations?
    // (default: 1 min, 3 min, 10 min, 30 min, 60 min)
    generationRetryStrategy: function(job, retries) {
      return decaySchedule[retries - 1] ? decaySchedule[retries - 1] : 0
    },
    // How many times should pdf-bot try to generate a PDF?
    // (default: 5)
    generationMaxTries: 5,
    // How many generations to run at the same time when using shift:all
    parallelism: 4,
    // How frequent should pdf-bot retry failed webhook pings?
    // (default: 1 min, 3 min, 10 min, 30 min, 60 min)
    webhookRetryStrategy: function(job, retries) {
      return decaySchedule[retries - 1] ? decaySchedule[retries - 1] : 0
    },
    // How many times should pdf-bot try to ping a webhook?
    // (default: 5)
    webhookMaxTries: 5
  },
  storage: {
    's3': createS3Config({
      bucket: '',
      accessKeyId: '',
      region: '',
      secretAccessKey: ''
    })
  },
  webhook: {
    // The prefix to add to all pdf-bot headers on the webhook response.
    // I.e. X-PDF-Transaction and X-PDF-Signature. (default: X-PDF-)
    headerNamespace: 'X-PDF-',
    // Extra request options to add to the Webhook ping.
    requestOptions: {

    },
    // The secret used to generate the hmac-sha1 signature hash.
    // !Not required, but should definitely be included!
    secret: '1234',
    // The endpoint to send PDF messages to.
    url: 'http://localhost:3000/webhooks/pdf'
  }
}

CLI

pdf-bot comes with a full CLI included! Use -c to pass a configuration to pdf-bot. You can also use --help to get a list of all commands. An example is given below.

$ pdf-bot.js --config ./examples/pdf-bot.config.js --help


  Usage: pdf-bot [options] [command]


  Options:

    -V, --version        output the version number
    -c, --config <path>  Path to configuration file
    -h, --help           output usage information


  Commands:

    api                   Start the API
    db:migrate
    db:destroy
    install
    generate [jobID]      Generate PDF for job
    jobs [options]        List all completed jobs
    ping [jobID]          Attempt to ping webhook for job
    ping:retry-failed
    pings [jobId]         List pings for a job
    purge [options]       Will remove all completed jobs
    push [options] [url]  Push new job to the queue
    shift                 Run the next job in the queue
    shift:all             Run all unfinished jobs in the queue

Debug mode

pdf-bot uses debug for debug messages. You can turn on debugging by setting the environment variable DEBUG=pdf:* like so

DEBUG=pdf:* pdf-bot jobs

Tests

$ npm run test

Issues

Please report issues to the issue tracker

License

The MIT License (MIT). Please see License File for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
bin		bin
examples		examples
production		production
src		src
storage		storage
test		test
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 pdf-bot

How does it work?

Prerequisites

Installation

Configuration

Usage guide

Structure and concept

Setup

Setting up Chrome

Setting up the receiving API

Setup production environment

Setup crontab

Quick example using the CLI

How can I generate PDFs for sites that use Javascript?

API

Push URL to queue: POST /

Example

Database

LowDB (file-database) (default)

PostgreSQL

Storage

Amazon S3

Multipart Post

Options

CLI

Debug mode

Tests

Issues

License

About

Releases

Packages

Languages

License

Digitalum/pdf-bot

Folders and files

Latest commit

History

Repository files navigation

🤖 pdf-bot

How does it work?

Prerequisites

Installation

Configuration

Usage guide

Structure and concept

Setup

Setting up Chrome

Setting up the receiving API

Setup production environment

Setup crontab

Quick example using the CLI

How can I generate PDFs for sites that use Javascript?

API

Push URL to queue: POST /

Example

Database

LowDB (file-database) (default)

PostgreSQL

Storage

Amazon S3

Multipart Post

Options

CLI

Debug mode

Tests

Issues

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages