Adobe Portable Document Format (PDF) documents can be created with
blank spaces that can be filled in by the user.
The US Internal Revenue Service is a
popular source of forms using this feature. It's very convenient if you
want to use the Adobe Reader and fill out the forms interactively. It's a
bit less convenient if you want to do calculations in a spreadsheet and then
transfer the numbers to the form. One of the reasons I own a computer is
to let it do grunt work like copying numbers from one place to another,
so I decided to figure out how to fill out PDF forms automatically.
The tools exist but it takes a while to figure out how to use them,
so I decided to create this tutorial to capture what I've found out.
Here's the process flow I use. Starting from the right, the
program takes specially formatted values (the fdf file), merges them with
an existing pdf form, and writes a new pdf file which is the filled-out form.
On the left, I wrote a program
fdf_gen, which handles the
process of getting my data into the right format. It's fairly specific
to my problem, but may be useful as a starting point for someone else.
pdftk f1040.pdf fill_form f1040_kat.fdf output f1040_kat.pdfwhere f1040_kat.fdf defines the contents of each field, and f1040_kat.pdf is the new pdf with the values inserted. The fdf file contains specially formatted PostScript, and looks like this:
The top 2 lines and bottom 5 lines are standard headers and trailers, which should never need to be changed. On each line, the characters in parentheses after the T are the names of fields in the form, and the value in parentheses after the V is the value to be written into the field. For example,%FDF-1.2 1 0 obj<</FDF<< /Fields[ <</T(f1_04(0))/V(Katherine(Kat))>> <</T(f1_05\(0\))/V(Astrofic)>> <</T(c1_03(0))/V(a)>> % And lots more lines like these. ] >> >> endobj trailer <</Root 1 0 R>> %%EOF
f1_04(0)is the "First Name" field.
pdftk f1040.pdf dump_data_fields >f1040_fields.txt
Most fields are of type "Text", we'll talk about FieldType "Button" next. The FieldName is just a character string that labels the field. Unfortunately, there's no relationship between these names and the line numbers or anything else on the form, so the only good way to figure out what's what is to stuff a dummy value into the field, and see where it shows up on the form.FieldType: Text FieldName: f1_01(0)
The FieldStateOption lines define the allowed values for the checkboxes. Most just have optionsFieldType: Button FieldName: c1_03(0) FieldFlags: 0 FieldJustification: Left FieldStateOption: Off FieldStateOption: Yes FieldStateOption: a FieldStateOption: b FieldStateOption: c FieldStateOption: d
Off(no boxes checked) or
Yes(check the box). In this case, there are 5 possible choices. Naturally, the option value have absolutely no relationship to anything actually printed on the form, so we have to try the values until we get the one we want. And here it is.
These notes describe what I've seen on IRS forms; others may have other quirks.
I wrote program
fdf_gen.c to implement part of the process of creating
an fdf file. It works on some simple
test cases, but hasn't had any extensive validation. In other words, if you're
going to use it for something critical like real tax forms, you really need
to doublecheck the output to make sure it's doing what you want it to do.
In this case, I generate the fdf file using the command
fdf_gen f1040.flds kat.in kat.fdf
f1040.flds just assigns a content type and more descriptive name to
each value to be entered, and
kat.in contains the input values.
Typical entries in
where the first item is the type of data, the next item is my descriptive name, and the
rest of the line contains the field or fields the value will be written into.
string LblLastName f1_05(0)
string3 LblSSN f1_06(0) f1_07(0) f1_08(0)
dollar_cents L7 f1_44(0) f1_45(0)
kat.in Just contains descriptive names and values:
My data types are:
string- Value is just a character string.
number- Synonym for string.
button- Synonym for string.
string3- String is broken into multiple sections, and each section goes into a different field in the form. The first character is the section break character.
dollar_cents- A numeric value placed into 2 fields. The dollar value goes into the first field, and the cents value goes into the second.
dollar_cents_paren- like dollar_cents, except that negative values are in parentheses. E.G., -123.21 is generates
21), in separate fields.
This program was originally published in 2008. As of March 2012, Greg Lawson is also working on
this code as part of an open source tax project. You may want to check his repository for
more recent updates. See the links below.
pdftkprogram has a website at www.accesspdf.com. It has the program, mailing lists, and links to purchase a book, PDF Hacks. I haven't purchased the book but the program is great, so I assume the book will be too.