“271 pages” is not “271 vectors.” Split settings and embedding model choice determine whether Astra Docs Chat retrieves the right paragraph when you ask about PCU groups, hybrid search, or collection APIs.
Context: Building Astra Docs Chat · Langflow ingest flow · Batch ingest
Try retrieval in production: Astra Docs Chat
Shape of the corpus ¶
Each file in my local pages/ export is trafilatura-extracted markdown from Astra DB Serverless docs
:
#/##heading hierarchy- Fenced code blocks (API examples, cURL, driver snippets)
- Occasional tables and bullet lists
- Minimal nav boilerplate (nav stripped at extract time)
There is no per-file YAML front matter in v1. Page identity is the filename slug.
Chunking should respect headings and keep code blocks intact where possible. Splitting mid-fence produces retrieval hits that start with json or half a curl flag: useless context for the LLM.
SplitText in Langflow ¶
The ingest flow uses Langflow’s SplitText component between File and Embeddings. On the DataStax RAG template (cloned into SplitText-ingest), defaults are:
| Setting | Value |
|---|---|
| Chunk size | 1000 characters |
| Chunk overlap | 200 characters |
| Separator | newline (\n) |
| Splitter | CharacterTextSplitter |
File-ingest → SplitText-ingest → OpenAI Embeddings → AstraDB-ingest
What each knob does ¶
| Setting | Trade-off |
|---|---|
| Chunk size | Larger → more context per hit; smaller → more precise retrieval |
| Chunk overlap | Reduces cuts across section boundaries; increases vector count and ingest cost |
| Separators | Newline splits are coarse; \n## or paragraph breaks can respect doc structure better |
There is no universal optimum. Evaluate on real questions from the starter prompts on /astra-chat.
Code blocks and API reference pages ¶
Dense parameter lists (REST query params, driver method tables) often span fewer than 1000 characters but logically belong together. Overlap helps when a question matches text near a chunk boundary.
If retrieval consistently returns truncated API tables, try:
- Smaller chunk size with higher overlap on reference sections only (requires split strategy per doc type, not in v1)
- Hybrid search for exact symbol names (Astra vector store post )
Embedding model: text-embedding-3-small ¶
Ingest and query both use OpenAI text-embedding-3-small:
- Good cost/quality for English technical prose
- Supported in Langflow’s OpenAI Embeddings component on both ingest and chat paths
- Must match at query time and ingest time for meaningful similarity scores
Chat LLM is DeepSeek; embedding model independence is normal. Do not swap the chat model and assume embeddings follow.
DeepSeek embeddings were explicitly out of scope for v1 (design spec deferred).
Changing embedding model after ingest requires full re-embed of the corpus (re-ingest post ).
Pages vs chunks vs hits ¶
| Term | Meaning |
|---|---|
| Page | One markdown file ≈ one docs URL |
| Chunk | One SplitText output segment → one vector in Astra |
| Hit | Top-k chunks returned for a question (default 4 in chat flow) |
A long administration page may produce dozens of vectors. Retrieval quality is measured at hit level, not page count.
The batch script logs “Adding N documents to the Vector Store” per file: N is chunk count for that page, not 1.
After a full ingest, the gap shows up in Astra DB directly. The datastax_astra_docs collection held 6174 records from 271 markdown pages when I spot-checked in Data Explorer: same corpus, many more vectors than files.
Each row is one chunk: content is the text SplitText produced, $vector is the embedding, and metadata.filename ties the chunk back to the source markdown file. The 1536 dimensions match OpenAI text-embedding-3-small at default width.
Quick eval method ¶
- Pick 10 questions with known answers in specific doc sections
- Run retrieval-only in Langflow Playground (inspect retrieved text before generation)
- Change chunk size / overlap; re-ingest a subset with
--limit 3(batch post ) - Repeat step 2
- Note where code blocks truncate or headings split awkwardly
Log failures: they often cluster on API reference pages with dense parameter lists and multi-language code tabs (extracted as sequential fences).
Example eval questions:
| Question | Good hit contains |
|---|---|
| What are PCU groups? | Billing / plan administration wording |
| How do I create a collection? | Quickstart or collections API steps |
| Explain hybrid search | Vector + keyword search explanation |
Metadata and refresh ¶
- Source links in answers: chunk metadata (
source_url,heading_path) would need ingest-time work; not in v1 - Refresh: changing split settings mid-life without full rebuild mixes incompatible chunk granularities in one collection
Plan split experiments on a throwaway Astra collection before re-ingesting datastax_astra_docs.
Future improvements ¶
- Metadata per chunk at split time
- Hybrid search in Astra for symbol-heavy queries
- Heading-aware splitting (custom component or preprocessor)
- Eval harness stored in repo (question → expected doc slug)
Next in the series ¶
- Using Astra DB as the vector store : where chunks land and how search retrieves them
- Batch-ingesting markdown through Langflow : operational loop that writes chunks to Astra
- Re-ingesting when upstream docs change : when split or embedding settings change
Series index: Building Astra Docs Chat
Open Astra Docs Chat and ask a code-heavy question: if the answer’s sample looks truncated, chunk boundaries are a likely culprit.