The local OCR that scored best, and let the chatbot show the diagrams


In the last post I built a strict exact-match test for the OCR behind a 1994 Yamaha XV250 Virago manual chatbot and scored four local pipelines against 100 hand-verified values. The live corpus (Docling running EasyOCR) scored 61 percent; the best of the four was Docling + RapidOCR at 85 percent, and it got there doing genuine OCR on the page pixels, with nothing leaving the machine and no per-page API bill.

Re-OCRing with RapidOCR did one more thing I had not planned for: it handed me every figure in the manual as a separate image, which turned into a feature. The chatbot now shows you the exploded-view diagrams inline with the answer.

Live: virago.edestudio.us . Ask it how to adjust the valves.


Same test as the last post: 100 hand-verified values, the same strict matcher (digits and unit together, present or not, no partial credit), across four local pipelines.

Pipeline Exact-match fidelity
Docling + EasyOCR (live) 61%
Tesseract PDF 83%
Docling + Tesseract layer 82%
Docling + RapidOCR 85%

RapidOCR fixed the failure class that was hurting most: the character-level damage. XV250U came back correct, not XV2SOU. The range tilde in 0.6 ~ 0.7 mm survived. Model codes, capacities and clearances — the dense maintenance tables EasyOCR scrambled — came back clean.

The one class it only half-fixes is torque. The usable unit of a torque spec is the triple (58 Nm, 5.8 m-kg, 42 ft-lb) tied to its bolt, and Docling’s table model crams those columns regardless of which OCR engine feeds it. RapidOCR keeps five of the ten triples intact, against almost none for the Tesseract routes — better, but not solved, because the remaining failure is layout, not reading.

The cost of the whole run was electricity. The pipeline is Docling with RapidOCR (backend="torch", so it uses the GPU if there is one), all local — no document leaves the machine, which also means the same pipeline works for a privacy-sensitive corpus or with no internet at all. That is the trade I was happy to make: a few points of fidelity below a hosted API, in exchange for a free, open-source pipeline I fully control and can point at the next manual without touching a billing page.


The harness wraps it so an extract is one command. Docling does the layout and table reconstruction; RapidOCR reads the pixels; --images also pulls every embedded figure out to disk:

# OCR the PDF with Docling + RapidOCR, and extract figures too
dx extract yamaha-xv250 --backend docling-rapidocr --images

# rewrite the per-page markdown into a RAG-ready corpus: a
# "Source: <manual> | <chapter> | manual page N" header per page,
# and every image link rewritten to an absolute figure URL
dx rag yamaha-xv250 --backend docling-rapidocr

The extract writes one Markdown file per page under corpora/docling-rapidocr/page_NNN.md, with the figures for that page in images/page_NNN/ and the Markdown already linking them by name. The --images flag is the whole reason the next half of this post exists.


The manual is full of exploded-view diagrams, wiring schematics and measurement illustrations. They are the half of a service manual that text can never replace. Docling + RapidOCR pulled 997 images across the 291 pages, numbered per page, with the Markdown linking each one inline next to the step it belongs to.

The figures are not spread evenly. A handful of pages (the index, plain spec tables, blank verso pages) carry none, while the procedure-heavy chapters carry three, four, even nine to a page. The mean is 3.4 images per page.

These are the diagrams that the text simply cannot carry. An exploded view of the front brake, for instance, is where the torque figures from the last section actually live: the 35 Nm (3.5 m-kg, 25 ft-lb) and 30 Nm (3.0 m-kg, 22 ft-lb) triples are printed as callouts pointing at the exact bolts they apply to, alongside “New” tags for parts that must be replaced and the brake fluid type.

That single image carries more usable assembly information than a page of prose, and it is exactly the kind of thing the chatbot can now surface next to a “how do I rebuild the front brake” answer. It is also a reminder of why torque fidelity in the text matters less on its own than it looks: the authoritative copy of that torque figure is the callout in the diagram, and the diagram now travels with the answer. The manual’s other illustration types extract just as cleanly, from a labelled engine cutaway:

to the full wiring schematic, the densest single page in the book:

So the corpus the chatbot reads now contains lines like this, sitting between the numbered removal steps:

2. Disconnect:
- Fuel hoses ①
- Pulser hose ②
![Figure (manual p.180)](https://virago.edestudio.us/img/page_180/img-411.jpeg)

If a retrieved chunk carries that line, the model can repeat it, and the figure renders in the answer. The job became: host the images, get the model to include the relevant ones, and render them safely.

The figures (about 9 MB of JPEGs) went into a dedicated Cloudflare R2 bucket, served by a small Pages Function at /img/[[path]] that sets the content type by extension and caches hard. A separate bucket, not the one holding the PDF, so the AI Search indexer never tries to treat a JPEG as a document. Same-origin URLs (virago.edestudio.us/img/...) mean no CORS and no second domain. The dx rag step is what rewrites each page’s relative images/... link to its absolute bucket URL, so the corpus that gets indexed already points at the hosted figure.

The system prompt had to be explicit, because the model’s default reflex is to claim it cannot display images. The rule that worked:

When a figure in the retrieved context directly supports the answer, you MUST include it inline using the exact image link from that context. Never say you cannot display images. Only use links that appear verbatim in the excerpts; never invent, alter, or guess a URL.

“Never say you cannot display images” is doing real work there. Without it the model hedged.

The frontend is a dependency-free Markdown renderer, deliberately small and built to be safe to assign to innerHTML. Adding images meant a few careful touches:

  • Images restricted to https://. No javascript: or data: URIs can reach an src. The image rule also has to run before the link rule, or ![alt](url) gets half-eaten and leaves a stray !.
  • A streaming flash fix. The original code re-rendered the whole message on every streamed token, which tore down and rebuilt the <img> dozens of times a second, so the figure flickered and jumped. The fix hides figures while streaming and draws them once at the end.
  • A broken-image safety net. The model occasionally mangles a URL it is supposed to copy verbatim. I watched it turn img-410.jpeg into img.jpeg, which 404s and shows a broken-image box. A capture-phase error handler now removes any figure that fails to load, so a mangled link just vanishes instead of showing a broken icon.

Re-OCRing also let me re-cut the corpus. Each page is now its own indexed unit with a header that names the chapter and the manual page, so the model cites both:

follow these steps from the Carburetion chapter, page 180

That header is exactly what the dx rag step writes (Source: <manual> | <chapter> | manual page N), derived from the chapter map in the document’s metadata. The reader has the full PDF open in a slide-over panel, paginated 1 to 291, so “page 180” is a place they can actually go. The “Sources” chip under each answer collapses to clean chapter names (“Chapter 5: Carburetion”), deduped, because the page filenames are mapped back to chapters in the client.


One thing the model does badly: it restarts numbered steps at 1. A four-step procedure came out as “1. Turn… 1. Disconnect… 1. Remove… 1. Remove…” even though the source markdown was correctly numbered 1 to 4. Prompt nudges help but do not fully fix it.

So the renderer renumbers procedures itself. It keeps a running counter, numbers ordered-list items continuously down a section regardless of what number the model emitted, and resets only at a heading (a new section is a new procedure). The model can collapse every step to “1.” and the reader still sees 1, 2, 3, 4. The lesson I keep relearning: when a model is unreliable at something deterministic, do the deterministic part in code.


Going local is not a free win, it is a different set of compromises from a hosted OCR API.

  • 85 percent is not 100. A hosted OCR API scored higher on the same test. I traded those points for a pipeline that costs nothing per page, sends no document off the machine, and is open source end to end. For these manuals — owner-grade workshop references, with the full PDF one tap away in the panel — that trade is the right way round.
  • Torque is only half-recovered, and superscript units stay fragile. Both are layout/glyph problems Docling’s table model owns, not the OCR engine, so they need a layout-aware pass or hand-correction, not another engine swap. Until then the system prompt flags safety-critical values and points the reader at the scan.
  • The model is still the weak link at the edges. It mangles the occasional image URL and the step numbers, which is why both have a deterministic safety net in front of them rather than a trusting prompt.
  • Page-level chunks pull more, smaller results. Better for getting a figure retrieved alongside its step, but a single question can surface ten pages across a couple of chapters. The right lever there is reranking, a cross-encoder reordering the candidates by true relevance, not a blunt similarity threshold.

The headline is the easy part: a local, open-source OCR pipeline that scored best of the four I tested, and the manual’s diagrams now appear in the answers. The interesting part is everything between a clean OCR pass and a chatbot you would actually trust to hand a mechanic a torque figure and the picture of where the bolt goes.

Go break it: virago.edestudio.us .

×
Page views: