Generating PDFs From Webpages With A Large Number Of Graphs

When we generate survey reports for clients, along with online reports with interactive graphs we also generate PDFs that clients can print out and share. Converting our online reports-each of which sometimes has several hundred graphs spread across multiple pages-to PDFs has been an interesting challenge.

To generate PDFs from webpages, we leverage PDFKit, a handy gem that can convert webpages to PDF using wkhtmltopdf. Every time we are done generating a batch of online reports, we spawn off background jobs that use PDFKit to hit web endpoints to get a printable version of an online report and convert the retrieved HTML to PDF. However, each online report has graphs rendered in Javascript using Highcharts. These graphs can take some time to load and PDFKit may convert the HTML to PDF before all graphs have been rendered, resulting in PDFs with some graphs that are blank. As a quick fix to this problem, our first approach was to use PDFKit’s javascript_delay  parameter which can be used to delay the PDF generation to allow for the client to finish rendering graphs.

However, a limitation of this approach was that for some of our large reports that have 200+ graphs, even a delay as large as 40 seconds meant the PDF still had blank graphs (Note, this isn’t a problem in the online version of our reports since not all graphs on those need to be rendered when the page loads: most graphs are rendered as a result of some user action). Moreover, for reports that didn’t need the full 40 seconds to load graphs, we still waited that amount, causing the process to be unnecessarily slow.

In the next version, we replaced the arbitrary timeout with PDFKit’s window_status  option.

Each online report registers all graphs that need to be rendered while initializing and as each graph completes rendering it is accordingly marked as rendered. When all graphs have rendered, the window_status  is set to "ready" , which is PDFKit’s cue to generate the PDF. Moreover, if after a defined timeout (in our case 40 seconds), there are still unrendered graphs, we log an error and manually set the window_status  to "ready"  so the PDF is generated anyway and the job does not stall forever. Logging the error ensures that we can find the partially-complete PDF that was rendered and investigate.

While this approach worked in most cases, sometimes we ended up with reports that have a Javascript error on the page. In these cases, not all graphs finished rendering and the Javascript to forcibly set the window status after 40 seconds would never execute and the background job generating the PDF would hang forever, wasting resources and driving up costs unnecessarily.

To solve this issue, our first thought was to wrap the PDFKit.to_file  call that invokes the wkhtmltopdf binary in a timeout using Ruby’s built-in timeout.rb module. However, a bit of research gave us indication that using timeout.rb is unsafe and can cause a lot of platform instabilities. Instead, we ended up writing a bash script that wrapped the wkhtmltopdf binary in a timeout using linux’s timeout utility and pointing PDFKit to refer to the bash script instead of the wkhtmltopdf executable.

PDFKit’s initializer:

Bash script wrapping wkhtmltopdf in a timeout:

Finally, code in the background job:

With this method, we log an error when a PDF takes longer than 10 minutes to generate without our workers stalling forever if there’s a Javascript error on the page.

While we use Highcharts, this approach works for any page that has graphs rendered using a Javascript charting library.

Interested in playing with PDFKit and printable reports? We’re hiring!

Related Posts
Toward a Swankier Rails Console
Implementing priority lanes for jobs of the same type in Sidekiq
Ruby Gem – ExternalFields
Eliminating Nondeterministic (“Flaky”) Tests in Ruby and RSpec