Why AI Agents Struggle With Forms

Your AI agent can reason, plan, and write code. Ask it to fill out a W-9 and it falls apart. Not because the model is incapable, but because the document is.

But the agent is not the first system to fail at this. It is the latest in a line of approaches, each more sophisticated than the last, each running into the same wall. The structure of the document was never declared. Different systems attack that problem from different angles. None of them removes the underlying constraint.

Manual: humans as the integration layer

The original approach requires almost no software. A partially filled PDF is emailed to the end user. The user completes it, emails it back. Someone on the other side opens it, reviews it manually, finds errors, and sends it back again. The cycle repeats until the form is correct.

This works because a human is acting as the runtime. The person reading the document understands the instructions, notices when a field is missing, catches formatting mistakes, and knows when something looks wrong even if the software does not. Validation happens in a human head, not in the system.

It is slow, expensive, and hard to scale. Every error is a round trip. Every revision depends on people noticing what changed. But for a long time this was how many document workflows actually ran, and in a surprising number of organizations it still is.

Progressive intake flows: schema-first, application-bound

In one sense, progressive intake flows solve the problem the right way.

Instead of asking a user to work directly against a PDF, the team models the document outside the PDF. The user sees a logical flow rather than a page layout. Fields appear in a sensible order. Conditional sections show and hide based on prior answers. Validation happens as data is entered. By the time the workflow reaches the end, the system has a structured representation of the document and can render the final PDF through a template engine.

Architecturally, that is the right direction. Start with structure. Use the PDF as an output.

The problem is where that structure lives. In most implementations, the document model is trapped inside product code. The field definitions live across frontend components, backend validators, mapping tables, and template rendering logic. Conditional behavior is implemented as application logic. Validation rules are embedded in forms code. PDF field mappings are maintained separately. If the underlying W-9 changes, engineers have to update the UI flow, the validation rules, and the rendering layer together.

That makes the system expensive to maintain and difficult to reuse. The schema exists, but it does not exist as a portable document artifact. It exists as behavior inside a product. Agents cannot query it directly. Other systems cannot consume it cleanly. The logic is machine-executable, but not exposed as machine-readable document infrastructure.

AI agents: fill the template directly

The newest approach skips the explicit schema and goes back to the template itself. An AI agent reads the PDF the way a person would. It uses tools to extract whatever structure the file exposes, relies on vision to interpret the rest, and tries to place values into the right fields based on layout, labels, and instructions.

Watching a capable model do this is impressive. It can inspect the page, reason about nearby text, follow instructions, and make a decent guess about what belongs where. For one-off fills of unfamiliar forms, that is often useful enough to feel like progress.

For production, it is still fragile. Fields get skipped. Values land in the wrong place. Formatting breaks. The same form and data can produce different output across runs. Every fill requires a full inference pass over the document, which means latency, cost, and nondeterminism. Most importantly, the agent is still working from the PDF as the source of truth. It is inferring structure from presentation, not filling against a declared contract.

Schema extraction: infer the model from the PDF

Systems like Reducto take a more structured approach. Instead of asking a model to rediscover the form on every run, they analyze a blank document once and generate a machine-readable schema from it. That extracted schema can then be reused for later fills.

This is a meaningful step forward. It tries to recover a structured model from the PDF so software has something better than raw layout to work with. That can make repeated filling faster and more consistent.

But the model is still inferred rather than declared. It is downstream of the template, not the source of truth. And the schema is still incomplete. It can tell you where fields are. It does not fully express validation, conditional logic, cross-field rules, or the actual semantics of the document.

And when the underlying form changes, the extracted schema has to be regenerated and revalidated. The system is still downstream of the template. It has just made the reverse-engineering step more efficient.

What these approaches are actually telling us

These approaches are not all making the same architectural move.

Manual review and agent-driven filling are PDF-first. The document exists as a rendered artifact, and the runtime is forced to work from that artifact.

Schema extraction is also PDF-first, but it inserts an inference layer in the middle. It tries to recover structure from the rendered file so software has something better to operate on.

Progressive intake flows are different. They are schema-first in spirit. They start with structure and render the PDF afterward. That is why they usually produce the best user experience and the most reliable validation. Their weakness is that the structure is buried inside an application rather than represented as a reusable document artifact.

That distinction matters because it reveals the real requirement. The stack needs an explicit document layer: typed fields, semantic meaning, validation rules, conditional logic, stable identity, versioning, and rendering instructions in one place. The PDF still matters, but as one output of that layer, not as the thing every system has to decode.

Once that layer exists, the rest gets simpler. A web form can derive from it. An API can validate against it. An agent can inspect it directly instead of guessing from screenshots. A PDF can be rendered from it without becoming the system of record.

Agents need schema, not screenshots.

If your team is experimenting with W-9s or other prescribed forms and running into these same constraints, we would like to hear about it. Talk to us if you want to compare notes on where current approaches hold up, where they fail, and what a better document layer should provide.

All posts