The PDF Toolkit

Tutorial Difficulty Level    

Creating and reading PDF files in Linux is easy, but manipulating existing PDF files is a little trickier. Countless applications enable you to fiddle with PDFs, but it’s hard to find a single application that does everything. The PDF Toolkit (pdftk) claims to be that all-in-one solution. It’s the closest thing to Adobe Acrobat that we’ve found for Linux.

Developer Sid Steward describes pdftk as the PDF equivalent of an “electronic staple remover, hole punch, binder, secret decoder ring, and X-ray glasses.” That’s a lot of functionality for a 4MB application, but the software delivers. Pdftk can join and split PDFs; pull single pages from a file; encrypt and decrypt PDF files; add, update, and export a PDF’s metadata; export bookmarks to a text file; add or remove attachments to a PDF; fix a damaged PDF; and fill out PDF forms. In short, there’s very little pdftk can’t do when it comes to working with PDFs.

You can download pdftk as source or as a Debian or RPM package, FreeBSD port, or Gentoo Ebuild. Binaries are available for Windows and Mac OS X too.

Pdftk does have a GUI, but it is most powerful when used as a command line tool (and can be used in your scripts too). The syntax can be complicated, especially for complex actions such as removing specific pages from a PDF file, but it’s worth learning a few commands.

Joining files

Pdftk’s ability to join two or more PDF files is on par with such specialized applications as pdfmeld and joinPDF. The command syntax is simple:

cat is short for concatenate — that is, link together, for those of us who speak plain English — and output tells pdftk to write the combined PDFs to a new file.

Pdftk doesn’t retain bookmarks, but it does keep hyperlinks to both destinations within the PDF and to external files or Web sites. Where some other applications point to the wrong destinations for hyperlinks, the links in PDFs combined using pdftk managed to hit each link target perfectly.

Splitting files

Splitting PDF files with pdftk was an interesting experience. The burst option breaks a PDF into multiple files — one file for each page:

We don’t see the use of doing that, and with larger documents you wind up with a lot of files with names corresponding to their page numbers, like pg_0001 and pg_0013 — not very intuitive.

On the other hand, we found pdftk’s ability to remove specific pages from a PDF file to be useful. For example, to remove pages 10 to 25 from a PDF file, you’d type the following command:

We have used this syntax extensively to trim pages from samples that give we students, and to extract articles from back issues of digital magazines. The resulting files are small, and the PDFs retain excellent resolution.

Infrequently used options

Pdftk has a number of options that you might use infrequently, but that are very useful when you need them — such as update_info and user_pw.

When you create a PDF, it might contain no or incomplete metadata — that is, information describing the PDF. Metadata can come in handy when you or your users need to organize or index a set of PDF files. Using pdftk and a text file, you can change or add metadata to the PDF:

In this usage, the contents of the file data.txt consist of an InfoKey and InfoValue pair, like this:

InfoKey: Keywords
InfoValue: DocBook,writing,documentation,background

You can change only the following metadata items with pdftk: title, author, subject, producer, and keywords.

If you’re working with PDFs that contain sensitive information, you may want to require a password to read the PDF. If you want to make sure that only certain people can view a PDF, you can apply a password to it with the user_pw option:

You will be prompted for a password of up to 32 characters. When someone tries to open the PDF, they will be asked to enter a password.

If you use pdftk regularly, or if you’re comfortable writing scripts to encapsulate the commands that you use, then you should have no problems working from the command line like this. Try it!