Chunking and embedding technical documentation for RAG

“271 pages” is not “271 vectors.” Split settings and embedding model choice determine whether Astra Docs Chat retrieves the right paragraph when you ask about PCU groups, hybrid search, or collection APIs.

Context: Building Astra Docs Chat · Langflow ingest flow · Batch ingest

Try retrieval in production: Astra Docs Chat

Shape of the corpus ¶

Each file in my local pages/ export is trafilatura-extracted markdown from Astra DB Serverless docs :

# / ## heading hierarchy
Fenced code blocks (API examples, cURL, driver snippets)
Occasional tables and bullet lists
Minimal nav boilerplate (nav stripped at extract time)

There is no per-file YAML front matter in v1. Page identity is the filename slug.

Chunking should respect headings and keep code blocks intact where possible. Splitting mid-fence produces retrieval hits that start with json or half a curl flag: useless context for the LLM.

SplitText in Langflow ¶

The ingest flow uses Langflow’s SplitText component between File and Embeddings. On the DataStax RAG template (cloned into SplitText-ingest), defaults are:

Setting	Value
Chunk size	1000 characters
Chunk overlap	200 characters
Separator	newline (`\n`)
Splitter	CharacterTextSplitter

File-ingest → SplitText-ingest → OpenAI Embeddings → AstraDB-ingest

What each knob does ¶

Setting	Trade-off
Chunk size	Larger → more context per hit; smaller → more precise retrieval
Chunk overlap	Reduces cuts across section boundaries; increases vector count and ingest cost
Separators	Newline splits are coarse; `\n##` or paragraph breaks can respect doc structure better

There is no universal optimum. Evaluate on real questions from the starter prompts on /astra-chat.

Code blocks and API reference pages ¶

Dense parameter lists (REST query params, driver method tables) often span fewer than 1000 characters but logically belong together. Overlap helps when a question matches text near a chunk boundary.

If retrieval consistently returns truncated API tables, try:

Smaller chunk size with higher overlap on reference sections only (requires split strategy per doc type, not in v1)
Hybrid search for exact symbol names (Astra vector store post )

Embedding model: text-embedding-3-small ¶

Ingest and query both use OpenAI text-embedding-3-small:

Good cost/quality for English technical prose
Supported in Langflow’s OpenAI Embeddings component on both ingest and chat paths
Must match at query time and ingest time for meaningful similarity scores

Chat LLM is DeepSeek; embedding model independence is normal. Do not swap the chat model and assume embeddings follow.

DeepSeek embeddings were explicitly out of scope for v1 (design spec deferred).

Changing embedding model after ingest requires full re-embed of the corpus (re-ingest post ).

Pages vs chunks vs hits ¶

Term	Meaning
Page	One markdown file ≈ one docs URL
Chunk	One SplitText output segment → one vector in Astra
Hit	Top-k chunks returned for a question (default 4 in chat flow)

A long administration page may produce dozens of vectors. Retrieval quality is measured at hit level, not page count.

The batch script logs “Adding N documents to the Vector Store” per file: N is chunk count for that page, not 1.

After a full ingest, the gap shows up in Astra DB directly. The datastax_astra_docs collection held 6174 records from 271 markdown pages when I spot-checked in Data Explorer: same corpus, many more vectors than files.

Astra DB Data Explorer showing the datastax_astra_docs collection: 6174 records, 1536 dimensions, cosine similarity, with content and filename metadata per chunk

Each row is one chunk: content is the text SplitText produced, $vector is the embedding, and metadata.filename ties the chunk back to the source markdown file. The 1536 dimensions match OpenAI text-embedding-3-small at default width.

Quick eval method ¶

Pick 10 questions with known answers in specific doc sections
Run retrieval-only in Langflow Playground (inspect retrieved text before generation)
Change chunk size / overlap; re-ingest a subset with --limit 3 (batch post )
Repeat step 2
Note where code blocks truncate or headings split awkwardly

Log failures: they often cluster on API reference pages with dense parameter lists and multi-language code tabs (extracted as sequential fences).

Example eval questions:

Question	Good hit contains
What are PCU groups?	Billing / plan administration wording
How do I create a collection?	Quickstart or collections API steps
Explain hybrid search	Vector + keyword search explanation

Metadata and refresh ¶

Source links in answers: chunk metadata (source_url, heading_path) would need ingest-time work; not in v1
Refresh: changing split settings mid-life without full rebuild mixes incompatible chunk granularities in one collection

Plan split experiments on a throwaway Astra collection before re-ingesting datastax_astra_docs.

Future improvements ¶

Metadata per chunk at split time
Hybrid search in Astra for symbol-heavy queries
Heading-aware splitting (custom component or preprocessor)
Eval harness stored in repo (question → expected doc slug)

Next in the series ¶

Using Astra DB as the vector store : where chunks land and how search retrieves them
Batch-ingesting markdown through Langflow : operational loop that writes chunks to Astra
Re-ingesting when upstream docs change : when split or embedding settings change

Series index: Building Astra Docs Chat

Open Astra Docs Chat and ask a code-heavy question: if the answer’s sample looks truncated, chunk boundaries are a likely culprit.