Saving a Dynamic Web Page as PDF with Puppeteer

This article shares a brief tutorial for how to save web page as PDF with NodeJs. We will be using the Puppeteer headless chrome browser to pull the web page on a Node server and convert it to PDF. Big thanks to the Chrome DevTools team for maintaining this excellent headless browser!

So what's a headless browser? Simply put, its a browser with no display. That may initially sound strange, but they are great for automating PDF rendering, or creating a robot to search the web.

A Brief Note on Trial and Error

In my efforts to find a good method to print html to PDF, I ran into many dead ends. In efforts to save you from that, here's what did not work for me, and why:

PDFMake - A great javascript library that generates pdf from an object array. I did not want to rewrite my entire html into an object array, so instead I spent many hours trying to convert html to canvas, but could not get past error messages from the canvas size.
jsPDF - Another great javascript library that generates pdfs. This one was able to convert html to canvas without errors. Hooray! But wait... Canvas is essentially an image on a pdf page. This option does not support multiple paged pdfs or page breaks very well, nor does it preserve the text data very well. Its basically a picture printed on one pdf page. Even simply getting the size and margins right can be a painful struggle.

Enter the Headless Browser

I went down the path of using PhantomJs as a headless browser to print PDFs several years ago, and remember ultimately not being successful. Hence, I was thrilled to learn of the Puppeteer browser via the Chrome team. However, even after working down this path, I still found myself in trial and error mode, with only a handful of tutorials available to help in my journey. Hope this post can save you some time in this endeavor!

Prerequisites

Here's what I'm running:

Ubuntu 16.04
Node
NPM
ExpressJs
AngularJS

You may need to install the following Debian dependencies:

sudo apt-get install gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget

Installing Puppeteer

Special Note: Puppeteer downloads the headless browser to your node package during installation. This file is relatively large, and exceeds Github's file size limit. If you don't ignore your node_modules file, you will likely have trouble pushing. You can either look into GitHub's Large File Storage, or just ignore the Puppeteer package by opening your .gitignore file and adding: node_modules/puppeteer/

Puppeteer is available from npm (learn more here). Install with the following:

npm i puppeteer

Step 1 - Server Side Setup with ExpressJs

Here is a basic server side setup. This code opens the headless browser, routes to an example webpage, renders as PDF, and sends back to the client. See commented code below:

// Initialize the module
const puppeteer = require('puppeteer');

// A Post Route to Open the Headless Browser
app.post('/printPdf', function (req, res, next) {

  async function generatePdf() {
    const browser = await puppeteer.launch({ 
      args: ['--no-sandbox', '--disable-setuid-sandbox']
    });

    // Open a new page with the headless browser
    const page = await browser.newPage();

    // Route the headless browser to the webpage for printing
    await page.goto('http://www.example.com'); // add your url

    // Print the page as pdf
    const buffer = await page.pdf({ 
      printBackground: true, 
      format: 'Letter', 
      PreferCSSPageSize: true 
    }); 

    // send the pdf
    res.type('application/pdf');
    res.send(buffer);

    // Close the headless browser
    browser.close();
  };
  generatePdf();
});

Step 2: The Client Setup with AngularJs

The client setup will look something like this (an example AngularJs controller function):

// save as pdf
$scope.savePdf = function(){

  var response=$http.post("printPdf", { 
    responseType: 'arraybuffer',
    headers: {
      'Accept': 'application/pdf'
    }
  });

  response.then(function (success) {

    var fileName = 'Example.pdf';
    var a = angular.element('<a/>');
    var blob = new Blob([success.data], {
      type:'application/octet-stream'
    });
    var url = window.URL.createObjectURL(blob);

    if (window.navigator.msSaveBlob) { 
      // For IE
      window.navigator.msSaveOrOpenBlob(blob, fileName)
    } else if (navigator.userAgent.search("Firefox") !== -1) { 
      // For Firefox
      a.style = "display: none";
      angular.element(document.body).append(a);

      a.attr({
        href: 'data:application/pdf,' + encodeURIComponent(success.data),
        target: '_blank',
        download: fileName
      })[0].click();

      a.remove();
    } else { 
      // For Chrome
      a.attr({
        href: url,
        target: '_blank',
        download: fileName
      })[0].click();
    }
    window.URL.revokeObjectURL(url);
    }, function (err)  {
    });
}

Above, we make the post request for the pdf file, and use Blob to download the file. Each browser has their own way of handling this (see commented code above).

You can call the code with a button that calls the savePdf function somewhere in your Html (e.g. <submit ng-click="savePdf()">Save Me!</submit> ).

Step 3: Making This Dynamic

Now let's say you have an application that requires a dynamic front end. First, we'll need to modify our router with a few extra lines of code.

In the below code, after we open our headless browser and route to our web page url using page.goto, we'll add a function to be called within the headless browser from our server, using page.evaluate.

In the example below, we will call a client function: window.exampleFunction, and load whatever data we want to the headless browser, with the data variable.

// Initialize the module
const puppeteer = require('puppeteer');

// A Post Route to Open the Headless Browser
app.post('/printPdf', function (req, res, next) {

  let data = 'Whatever data you want to inject to the client';

  async function generatePdf() {
    const browser = await puppeteer.launch({
      args: ['--no-sandbox', '--disable-setuid-sandbox']
    });

    // Open a new page with the headless browser
    const page = await browser.newPage();

    // Route the headless browser to the webpage for printing
    await page.goto('http://www.example.com'); // add your url

	// ADD THIS FUNCTION TO CALL A CLIENT FUNCTION IN THE HEADLESS BROWSER
    await page.evaluate((data) => { 
      return Promise.resolve(window.exampleFunction(data)); 
    }, data);

    // Print the page as pdf
    const buffer = await page.pdf({ 
      printBackground: true, 
      format: 'Letter', 
      PreferCSSPageSize: true 
    }); 

    // send the pdf
    res.type('application/pdf');
    res.send(buffer);

    // Close the headless browser
    browser.close();
  };
  generatePdf();
});

Finally, we must also add our function on the client, which will be called by the headless browser. This would be added to your front end framework. Here's how we would update our AngularJs controller:

// Add a window function to your controller, which can inject data and funtionallity when loaded by the headless browser
window.exampleFunction = function(data) {
    $scope.example = data;
    $scope.$apply();
}

The above window.exampleFunction is essentially being called in the router, with the page.evaluate method. You can inject whatever data you like, and add functionality as needed before rendering the PDF. In my case, I simply needed to inject data to the scope before rendering.

Some Final Nuances

Displaying Images

I found that rendering images can be challenging, especially with dynamic urls. Cross origin requests may also present problems.

I needed to display images by ID from AWS S3, and found I could no longer simply point my image element to the S3 url. I also could not add any parameters to the url in the image element. The image simply would not show.

My final messy work around was to route the image tag to my express router, with a simplified route - <img src="example-image.image"> (for whatever reason, I couldn't get this to work without an extension, so I made up .image).

Next, I added the image ID as an Angular router parameter on the page (e.g. http://www.example.com/ID12345)

I then setup Express to pull the ID from the requesting header, and route the image back to the client Here's how that looks:

// Send the image to client
app.get('/example-image.image',function(req, res){

  // Get the params from the requesting header
  let imageId = req.headers.referer.substring(req.headers.referer.lastIndexOf('/')+1);

  // The S3 url
  let url = 'https://s3-us-west-1.amazonaws.com/MYBUCKET/' + imageId; 
  request(url).pipe(res);
});

I'm sure there's a better way to do this. Feel free to add your solution to the comments of this post.

Displaying Margins

PDF page margins can also be tricky with Puppeteer. In our initial server route, we added the following option PreferCSSPageSize: true, which will allow us to set print styling in our CSS file.

I could never get the margins to work right, but found success in just sizing the actual page a little smaller than Letter size. See the below CSS:

@media print {
  @page { 
    size:8in 10in; 
    margin: 0  
  }
}

Finally, if there are elements that you don't want to render, simply create a CSS class, and add it to those elements:

@media print {    
  .no-pdf {
    display: none !important;
  }
}

Hope that helps! If you have any issues or feedback, feel free to comment below.