Earlier this year I wrote about building a document analyzer using Ollama and Llama2 running on my NAS at home. It worked. But I was already running the rest of the project on Cloudflare Workers, and having the AI piece live on a home server felt increasingly out of place. If the NAS was slow, the tool was slow. If it was off, the tool was off.
The obvious move was to go full Cloudflare. This post is about what that looks like now and, specifically, what happens behind the scenes when you press Analyze.
You can try it here: Document Analyzer
What changed at the top level ¶
The original setup was:
- Ollama running on my NAS, exposed via Cloudflare Tunnel
- A single Cloudflare Worker handling requests
- Three prompts: summarize, key points, sentiment
The new version:
- Workers AI running
@cf/mistralai/mistral-small-3.1-24b-instruct- no home server involved - Cloudflare KV for caching results
- D1 for storing analysis history
- Analytics Engine for metrics
- AI Gateway in front of every model call
- 17 prompts, validated server-side
- Full TypeScript throughout
Same idea, different foundation.
What happens when you press Analyze ¶
The front end sends a POST to /api/analyze with two fields: the document text and the chosen prompt. Everything interesting happens in the Worker from there.
Step 1: Prompt validation
The prompt is checked against a server-side allow-list. The prompts are stored as a JSON string in a Worker environment variable and parsed at request time. If the prompt string does not match one of the known prompts exactly, the request is rejected with a 400. This is a simple guard against someone crafting a POST with an arbitrary instruction.
Step 2: Cache lookup
A SHA-256 hash is computed from the combination of the document text and the prompt. That hash becomes the KV cache key. If the hash is already in KV, the cached result is returned immediately as a Server-Sent Events stream. The response is instant and the AI is never called.
Step 3: AI call
On a cache miss, the Worker calls Workers AI via AI Gateway. The model is mistral-small-3.1-24b-instruct, which has a 128k token context window. That headroom matters for longer documents — the smaller Llama 3.1 8B models top out at 32k tokens, which is not enough once you factor in the system preamble, the document, and the expected output. The request includes a system preamble that constrains the model to the document content only - no external assumptions, no padding, shortest answer that addresses the prompt.
The response comes back as a stream of SSE chunks. The Worker reads each chunk, parses the response field, and builds up the full result.
Step 4: Side effects via waitUntil
Once all chunks are collected, three things happen in the background using ctx.waitUntil:
- The full result is written to KV with a 24-hour TTL
- A row is inserted into D1 with the prompt, a hash of the document, the first 200 characters of the document, and the first 500 characters of the result
- A data point is written to Analytics Engine
The response is already being streamed back to the browser while these run. The user does not wait for any of them.
Why hash document and prompt together ¶
The cache key is sha256(documentText + prompt). Not just the document. Not just the prompt.
The same document analyzed with “Summary” and “Sentiment” should produce different cached results. Hashing both together means each unique combination gets its own cache entry. The 24-hour TTL is a reasonable balance - documents do not change, but I did not want stale entries accumulating indefinitely.
The other benefit: the hash is what goes into D1 as doc_hash. No raw document content is stored in the database, only a snippet for reference.
What Analytics Engine actually tells you ¶
Each request writes a data point with:
- Blob: the prompt label (e.g. “Summary”, “Key points”)
- Doubles: cache hit as 0 or 1, response time in milliseconds, document length in characters
This lets me query things like: which prompts are used most, what percentage of requests are cache hits, and whether certain prompts are slower than others. Nothing complex, but enough to understand how the tool is actually being used without guessing.
Same proxy pattern on Astra Docs Chat ¶
When I built Astra Docs Chat
, I applied the same rule: the browser never calls the upstream AI service directly. A Cloudflare Pages Function at /api/astra-chat proxies to a private Langflow instance instead. See Proxying Langflow from Cloudflare Pages Functions
for that write-up.
Try it ¶
Open jamieede.com/analyzer , paste in any document, pick a prompt, and see what comes back.
The version described here is live. If you are curious about a specific part of the implementation or have a prompt type you think is missing, I am interested to hear it, reach out on LinkedIn