Why Forms Break — OpenForm Blog

The world runs on forms.

Insurance runs on ACORD forms. Lending runs on MISMO data, disclosures, and prescribed packets. Tax and payroll run on IRS forms. Immigration runs on government forms. Real estate has association forms. Every regulated workflow eventually arrives at the same requirement: participants need a common way to specify what information is being collected and what transaction, filing, certification, or authorization is taking place.

Forms are the original API. They are the shared contract between organizations that do not share a codebase.

PDF Won The Distribution Layer

For decades, PDF became the default format for that contract. Not because PDF is a good data model. Because it preserves visual integrity.

That mattered. In many workflows, layout is not cosmetic. The exact wording, spacing, boxes, signature areas, and ordering of fields carry legal and operational meaning. A prescribed form needs to look like the prescribed form. PDF solved that problem well enough, so industries standardized around it.

Then the internet and e-signature workflows arrived. SaaS platforms had to collect the same information digitally, but the prescribed artifact on the other side was still usually a PDF. The path of least resistance was obvious: keep using PDF as the durable external document, and build software around it. The prescribing bodies had little reason to redesign their standards around machine-readable schemas. They already had a format that preserved the page.

That is why PDF stayed in the middle of the stack.

If we were designing this system today from scratch, we would not choose PDF as the center. PDF preserved the page. It never modeled the transaction.

PDF Is The Wrong Source Of Truth

The problem is not that PDF is useless. The problem is that teams keep treating it as the thing that defines the form, when it is really just one rendering of the form.

Once PDF becomes the source of truth, a long list of structural problems shows up:

The field model is too primitive. PDF fields are mostly text boxes, checkboxes, radios, and lists. Everything tends to collapse to strings. There is no natural representation for arrays, nested objects, dynamic sections, or structured conditional flows. Even basic data gets combined awkwardly, with fields like First name and middle initial or Employer name and address.
Field identity is unstable. Forms often contain duplicate field ids. Internal names are frequently cryptic, like sectionA[0].item[1].field[1], and carry no semantic meaning. There is usually no stable canonical id that survives a redesign or a new template version.
The PDF does not describe what the fields mean. A label on a page is not a schema. There is no reliable place to declare type information, validation rules, units, formats, business logic, or required-if conditions. Hidden fields and branching logic often exist only in instructions or application code.
The file is tied to layout, not structure. Coordinates matter more than semantics. Small layout changes break mappings. Extraction often depends on visual parsing rather than declared meaning. Tab order, grouping, and field relationships are inconsistent enough that even navigation is unreliable.
The form is not actually self-contained. Instructions, worksheets, required attachments, lookup tables, and external dependencies are often part of the real workflow. A human can read those and do the work. The PDF itself does not encode that context in a usable way.
Validation is weak or external. Native constraints are minimal. Embedded JavaScript exists, but it is viewer-dependent, often disabled, and not something you want to trust with business logic. Cross-field rules, type checks, and conditional validation usually live outside the document.
Versioning is poor. A new PDF version is effectively a new form. There is no real diff, no migration path, and no structured representation of what changed. Teams find out something broke when a filing is rejected or an ops team notices output no longer lines up.
Interoperability is brittle. Adobe, Foxit, PDF.js, iText, PDFBox, and other tools interpret forms differently. AcroForm and XFA remain a compatibility problem. Many industries still depend on XFA forms that do not work in browsers and require specialized tooling.
Operational quality is hard to maintain. PDFs are binary artifacts. They are hard to diff in Git, hard to review, awkward to test, and painful to QA at the field level. Accessibility is inconsistent. Metadata is awkward. Every organization ends up building custom adapters around the same underlying format limitations.

None of these problems is surprising in hindsight. PDF was built to preserve presentation. Teams are asking it to act like schema, validator, transport, storage, and execution layer at the same time.

Humans Can Compensate. Agents Cannot.

For years, this was tolerable because people made the system work.

Ops teams fixed bad mappings. Compliance teams reviewed output before submission. Engineers kept spreadsheets of field definitions and edge cases. Support staff corrected data when a form failed in ways the software did not understand. Humans acted as the missing runtime.

That is why so many document-heavy systems appear stable from the outside. The software is not carrying the load alone.

AI changes that because agents need explicit structure. An agent asked to fill out a W-9 needs to know what fields exist, which are required, which values are valid, which rules depend on other fields, and whether the document is complete before it is rendered or submitted. A PDF cannot answer those questions directly. It exposes a visual surface, not a machine-readable contract.

That is why template filling, vision-based filling, browser automation, and schema-extraction pipelines all end up hitting the same wall. They are different ways of reverse-engineering structure from a rendering artifact.

Agents need schema, not screenshots.

Reverse The Stack

The fix is not to make PDF do more. The fix is to stop making PDF carry responsibilities it was never designed to carry.

The source of truth should be a structured definition of the document: schema, typed fields, validation rules, conditional logic, stable identity, versioning, and rendering logic. The PDF should become one output layer of that definition, not the place where the meaning of the form lives.

Once you reverse the stack, the rest gets simpler. The UI can derive from the same definition. Validation can run against it deterministically. Agents can inspect it directly. Renderers can produce PDFs, web forms, or structured data from the same source. A document update becomes a change to one artifact instead of an audit across templates, scripts, mappings, and runbooks.

That is the real reason forms break today. The system is treating the file as the form when it should be treating the form as infrastructure.

If your team keeps rebuilding W-9s, 1099s, ACH authorizations, insurance forms, or real estate packets across PDFs, scripts, and mapping tables, the problem is not just the template. The problem is that the form still has no machine-readable home.

If this sounds familiar in your own stack, we would like to hear about it. Get in touch if you want to compare notes on where PDF is still acting as the system of record and what a schema-first document layer would look like in your workflow.

All posts