Sajiv Francis
11 min read

I Built a Document AI System in 2018. Here's What I Got Right and What I Missed

A 2026 retrospective on document AI from before RAG had a name

A 2026 retrospective on document AI from before RAG had a name, and on how much of the system around the model shifted once foundation models arrived.

In 2018, we built a document-ingestion pipeline with NLP parsing, restructured output, and an adaptive-learning UI as part of a subscription product. We were not aware of the transformer architecture that propels the foundation models of today — BERT and the first GPT had only just shipped that year. For Optey, we were building productivity for people who would not take the time to read a 500- or 1000-page book. The plan was to use NLTK and Python’s NLP libraries to compress those inputs into a 10-page summary that read like a story. The concept was grounded; the technology of the time was the limit.

Fast forward to 2026, and here is how that build compares to the same system as you’d architect it today.

In 2026 terms, that’s a RAG-based adaptive-learning SaaS, before RAG had a name. The stack was modest: Python and Django on the backend, NLTK doing the parsing, a SQL store for the structured records, a JavaScript front-end on top. The pipeline ran in one direction — documents in, parsed once, persisted as structured records, UI reads from the records. We never finished it into a product, but eight years later, the architecture I sketched is approximately the same architecture I would still draw today.

The point of this essay is not the project. It is the narrow question of what the model layer of a document AI system can output now that it could not output in 2018, and what that change does to the system around it. Briefly: what we got right architecturally, what we missed strategically, and then the part I actually came here to write — how the AI layer changed.

What we got right

The architectural pattern (ingest unstructured → structure → personalize) and the cognitive-science framing (ILM).

The shape of the system holds up. Documents went in once, got parsed once, got persisted as structured records, and the UI read from the records at request time. Every modern RAG architecture has the same shape: a one-time ingestion stage that produces an indexed representation, then a separate retrieval and generation stage at query time. The reason this matters is mostly economic. Ingestion is expensive; queries are cheap. Any system that doesn’t separate them eats its compute budget. We got this right, whether by accident or not. Django’s request lifecycle made it the obvious thing to do, and it kept being the right thing to do as the model layer changed underneath it.

The cognitive-science framing was the better call — the one I’m more sure about in retrospect. The Interactive Learning Model gives a definition of “good output” that does not collapse into “fluent text”. Useful output for a learner has three dimensions: what they think, what they decide to do, and how they feel. Designing the output schema around that triad forced the system to produce more than just summaries; it had to produce prompts. Reflective questions tied to specific sections. Decision points where the user committed to a next action. In 2018 that was scaffolding around mediocre NLTK output, an attempt to make the surface feel less mechanical. In 2026 the same schema is what you would put in a system prompt with structured-output enforcement. The pedagogical theory was doing the work that schema-constrained generation does today — just by hand and in a database table.

The third thing we got right was treating the output as a recurring product rather than a one-time artifact. Subscription billing forced retention thinking, which forced workflow thinking, which is where the durable value of a document AI system actually lives. The model produces text. The product is what surrounds the text: how the user moves through it, returns to it, and builds on it. In 2018, this looked like a UX consideration. In 2026, it looks like the only part of the stack that isn’t commoditized.

What we missed

The scale of foundation models, embedding-based retrieval, and the fact that the “AI” piece would commoditize while the workflow and UX layer would matter most.

The misses are more interesting than the hits, because all three turn out to be about the same thing: what the model layer could actually do.

NLTK is a tokenizer, a POS tagger, a chunker. It identifies that “the cat sat on the mat” has a noun phrase and a prepositional phrase. It does not understand cats, mats, or sitting. The summarization we had it doing was extractive: pick the highest-information sentences using TF-IDF, surface them in order. That is compression, not understanding. The leap from extractive to abstractive required the transformer architecture, which arrived in late 2017 in “Attention Is All You Need” and didn’t reach a usable scale until GPT-3 in 2020. We were building on a substrate that was about to be obsoleted. Hardly anyone outside a small circle of researchers knew it.

Retrieval was the bigger miss. We built a parser, not a retriever. The whole system assumed there was one canonical structured form of each document, produced at ingest time, and the UI surfaced different views of that one form. The 2026 version of the same system would not work that way. It would store the document as a vector index — chunks of text mapped into a high-dimensional space where semantic similarity is computable — and produce a tailored output every time the user asked something different about the same source. The structure is no longer the answer. The structure is the search space. The answer gets generated at query time, against the part of the search space the question pointed at. That distinction sounds small, but it is the entire shift from 2018 document AI to 2026 document AI.

The third miss took years to understand and is the strategic one. We assumed the rule-based NLTK techniques — parsing and summarization of text — were the moat. If we could build a better summarizer than the open-source baseline, we would have something defensible. What happened is that AI innovation with transformer architecture and foundation models commoditized faster than anything else in the stack. Summarization is an API call. Embedding is an API call. Cross-document synthesis is an API call. Even fairly hard reasoning over long documents is an API call, priced per token, billed at fractions of a cent. The model layer collapsed from “the hardest part to build” to “the easiest part to call.” Everything we thought was the differentiator is now a few lines of Python. The defensible parts of the system in 2026 are nothing we would have flagged as defensible in 2018: the curation of the source corpus, the trust UI around hallucinations, the integrations, the workflow that turns a model output into a learning loop. The model is the substrate, not the product.

How the AI layer changed

This is the part I came here to write. Not the AI revolution — the narrow, specific question: if you sat down to build the same kind of document AI system today, with long-document ingestion, structured output, and an adaptive-learning UI, what would the model layer give you that NLTK didn’t? What is the surface that has actually moved?

Output shape is no longer fixed at ingest time

This is the deepest change. In 2018 the output was determined when the document was parsed. We produced a fixed set of structured records — sections, summaries, key terms, question banks — and rendered different views of that fixed set. If a user wanted to ask something the schema didn’t anticipate, the system had no answer; it would fall back to keyword search at best.

Modern document AI works on the inverse principle: the ingestion stage produces an index, not an answer. Output is generated per query, against the relevant chunk of the index, with the model deciding what shape the answer should take. The same source yields a one-paragraph summary, a multi-step explanation, a comparison table, a worked example, a tutorial, or a flashcard set, depending on what the user asks for. The schema moved from compile time to runtime. That is a much bigger change than it sounds.

Extractive versus abstractive collapsed

Extractive summarization — picking the most informative sentences from the source — was the safe choice in 2018, because the alternative produced a salad of words. Abstractive summarization, where the system writes its own sentences, required generative models we did not have at the time.

By 2022 the tradeoff had inverted. Abstractive output was higher quality than extractive on most dimensions that mattered: coherence, adaptation to the reader’s level, ability to compress redundant material, ability to surface implications the source didn’t state directly. The cost of this shift is hallucination — the model invents content that wasn’t in the source. Mitigating it required grounding: forcing the model to cite source spans, rejecting outputs whose claims don’t appear in retrieved context, running consistency checks across multiple model passes. The grounding loop is now standard infrastructure for most commercial AI systems. It wasn’t a category in 2018.

Multi-document synthesis became a primitive

A 2018 system that wanted to compare two textbooks on the same topic had to build that logic explicitly. Entity alignment. Claim extraction. Contradiction detection. Each one its own research project, and the combined pipeline was brittle. A 2026 system passes both texts to the model with a query like “where do these disagree, and on what evidence?” and gets back a coherent answer with citations. The work that used to take a team-year of NLP engineering is a prompt.

This generalizes well past summarization. Most operations that required dedicated NLP machinery in 2018 — entity resolution, relation extraction, sentiment, intent classification, topic modeling, even fairly delicate things like rhetorical structure analysis — are now zero-shot capabilities of a general-purpose model, often without prompt engineering at all. The specialist NLP toolkit that dominated this domain for two decades has been largely absorbed into the foundation model. That absorption was not predicted and was not gradual.

Multimodal input is no longer a separate pipeline

In 2018, document AI with Optey meant text. Diagrams, charts, equations, tables embedded as images, scanned pages, handwritten margin notes — these all required separate OCR, layout analysis, and figure-understanding pipelines. Most projects skipped them; the engineering cost was too high relative to the ROI, especially given OCR was unreliable and figure understanding barely worked.

Modern multimodal models ingest a PDF page as an image and answer questions about its contents directly — diagrams, formulas, the relationship between a chart’s y-axis and its caption, all of it. This unification matters beyond convenience. It removes a class of failure mode where the most semantically important content in a document — the figure that shows the relationship the text is describing — was invisible to the system because it lived in pixels rather than tokens. A 2018 document AI system was illiterate in half of what it ingested. A 2026 system isn’t.

Output is no longer one-shot

A 2018 pipeline ran end-to-end and emitted a result. If the result was wrong, the user re-ran it with different inputs. A 2026 system can refine its own output: ask clarifying questions when a query is ambiguous, retrieve additional context when its first answer was thin, run a calculation when the question has a numeric component, decide that it doesn’t have enough information and say so rather than guess. The boundary between “output” and “session” has dissolved. What the user receives is increasingly the product of a multi-step interaction inside the system, rather than a single forward pass. Whether you call this agentic, tool-using, or just well-engineered, the effect is the same: the system gets to think before it answers, and that thinking is observable enough to debug.

What hasn’t changed

What hasn’t changed is more interesting than what has. The need for a structured intermediate representation didn’t go away. The need for a workflow layer that turns model output into something a user can act on didn’t go away. The difficulty of getting a corpus clean and well-scoped enough for the model to be useful didn’t go away — if anything, it got worse, because better models are more sensitive to corpus quality and will confidently hallucinate plausibly-shaped wrong answers from bad input. A 2018 system with a noisy corpus produced obviously broken output, which got the corpus fixed. A 2026 system with the same noisy corpus produces fluent, coherent, wrong output, which gets shipped.

The AI models got dramatically better, and the engineering around the model got dramatically more important. A naive 2018 architect — which I was — might have predicted that better models would simplify the system, since the model would handle more of the work. The opposite happened. Modern document AI systems are larger than 2018 ones, with more components, more guardrails, more retrieval indices, more evaluation pipelines, more agentic loops. The model is one node in a graph that has gotten more complex, not less.

That is the part I would explain to the 2018 version of myself, if I could: the AI model improving by three orders of magnitude does not collapse the system around it; it expands the system. Mediocre models force simple architectures because nothing else works. Capable models force complex architectures because they make ambitious products possible, and ambitious products have surface area.

Comments