Developers
-
Github
- code
- issue tracker
- [milestones]
-
API
Because OpenTaxForms does not store user tax information on the server, the API is read-only and provides a complete accounting of form fields: data type, size and position on page, and role in field groupings like dollars-and-cents fields, fields on the same line, fields in the same table, fields on the same page, and fields involved in the same formula.
API docs forthcoming--see examples in TestOtfApi class in test/test_opentaxforms.py
-
Build status
-
Test status
-
License
-
How it works
Most of the IRS tax forms embed all the fillable field information in the XML Forms Architecture (XFA) format. The OpenTaxForms python script extracts the XFA from each PDF form, and parses out:
- relationships among fields (such as dollar and cent fields; fields on the same line; columns and rows of a table).
- math formulas, including which fields are computed vs user-entered (such as "Subtract line 37 from line 35. If line 37 is greater than line 35, enter -0-").
- references to other forms
All this information is stored in a PostgreSQL database and served according to a ReSTful API. For each tax form page, an html form (with javascript to express the formulas) is generated and overlaid on an svg rendering of the original PDF. The javascript saves all user inputs to local/web storage in the browser via basil.js. When the page is loaded, those values are retrieved. Values are keyed by tax year, form number (eg 1040), and XFA field id (and soon taxpayer name now that I do my kids' taxes too). Testers annotate the page image with boxes and comments via annotorious.js. A few (24 as of August 2016) of the 900+ IRS forms don't have embedded XFA (such as ...). Eventually those forms may be updated to contain XFA, but until then, the best automated approach is probably OCR (optical character recognition). OCR maybe a less fool-proof approach in general, especially for state (NJ, NY, etc) forms, which generally are not XFA-based.
For more technical details, see my talk, especially starting from slide 11.