Batch-ingesting hundreds of markdown files through Langflow

Astra Docs Chat only works if the documentation already lives in a vector store. The parent post summarised that as a one-time batch job: export markdown, run an ingest flow per file, resume on failure. This post is the Langflow API pattern behind that summary.

The crawl and batch scripts live on my machine only (not published as a repo). What follows is enough to build your own ingest loop; filenames and flags match what I run locally.

If you have not read the overview yet, start here: Building Astra Docs Chat .

The edge proxy that serves chat is covered in Proxying Langflow from Cloudflare Pages Functions .

Try the live chat: Astra Docs Chat

The problem ¶

Langflow’s ingest path is designed around a File component: upload a document, split it, embed chunks, write vectors to Astra DB. That works beautifully in the UI for one file at a time.

Astra Docs Chat needed 271 markdown pages from the public Astra DB Serverless documentation . Clicking through the Langflow UI 271 times was not a plan. The batch script wraps Langflow’s HTTP API so each file is upload → run ingest → record state, with resume after Ctrl+C or network blips.

pages/*.md
    → POST /api/v2/files/
    → POST /api/v1/run/datastax-astra-ingest?stream=false
         (tweak: File-ingest path = uploaded file)
    → Astra DB collection datastax_astra_docs

Chat uses a different flow endpoint (datastax-astra-chat). Ingest and chat are separate graphs on the same Langflow instance. That split keeps the public surface minimal: visitors never trigger file upload or embedding. See Langflow RAG over Astra DB: ingest and chat flows for what happens inside those graphs.

Corpus source ¶

The markdown files sit in a local pages/ folder on my machine. I produced them with trafilatura against the public docs site: one .md file per page, stripped of nav and footer chrome. You can use any export pipeline; the Langflow steps below only care that you have pages/*.md.

Filename convention maps URL paths to slugs:

administration/subscription-plans.html
  → pages/administration_subscription-plans.md

Pages are plain markdown (headings, lists, fenced code, tables). There is no YAML front matter with source_url in v1; page identity is implicit in the filename. Chunk metadata at ingest time would be a separate improvement; see chunking technical docs for RAG .

271 files at ingest time. File count can drift if you re-crawl; the batch script processes whatever is in pages/*.md.

Crawling and extracting markdown is a separate local step I run before ingest. This post focuses on the Langflow loop only.

Environment and configuration ¶

The script reads credentials from the environment and exits immediately if the API key is missing:

Variable	Default / role
`LANGFLOW_URL`	Private Langflow base URL (no trailing slash)
`LANGFLOW_API_KEY`	Required
`LANGFLOW_FLOW_ENDPOINT`	`datastax-astra-ingest`
`LANGFLOW_FILE_COMPONENT`	`File-ingest` (tweak target)
`LANGFLOW_ASTRA_COMPONENT`	`AstraDB-ingest` (debug output component)

Hard-coded defaults in my copy of the script point at a private Langflow instance; override with env vars on your machine.

Per-file loop ¶

For each sorted *.md in pages/:

Skip if already ingested: ingest_state.json records success by file path
Skip if too small: files under 100 bytes are not worth embedding
Upload: multipart POST to /api/v2/files/ with x-api-key
Run ingest: POST to /api/v1/run/{endpoint}?stream=false with a tweak on the File component
Verify response text: look for Langflow’s “Adding N documents to the Vector Store” message
Save state: atomic write to ingest_state.json after every file so Ctrl+C is safe

Upload ¶

files = {"file": (file_path.name, fh, "text/markdown")}
resp = session.post(
    f"{LANGFLOW_URL}/api/v2/files/",
    headers={"x-api-key": LANGFLOW_API_KEY},
    files=files,
    timeout=60,
)
uploaded = resp.json().get("path") or resp.json().get("file_path")

Run ingest with tweaks ¶

The ingest flow expects the File component’s path to point at the uploaded file. Langflow’s tweak API sets that per run without editing the saved graph:

{
  "output_type": "debug",
  "output_component": "AstraDB-ingest",
  "tweaks": {
    "File-ingest": { "path": ["7b90824f-.../administration_audit-log.md"] }
  }
}

Success detection is string-based on the response body:

OK: "Adding " + "documents to the Vector Store"
Fail: "No documents to add to the Vector Store" or "error":true

That is brittle but practical: you get a clear pass/fail per file in CI logs without parsing Langflow’s full debug JSON.

Retries and failure log ¶

Each file gets one retry after a 2-second pause. Persistent failures append to ingest_failed.log (path<TAB>reason) and mark the file "failed" in state.

Re-run with --retry-failed to attempt failed entries again without re-processing successes.

CLI usage ¶

From the directory where you keep pages/*.md and your batch script:

export LANGFLOW_API_KEY=...
export LANGFLOW_URL=https://your-langflow.example

# Dry run on three files
python ingest_langflow.py --limit 3

# Full corpus
python ingest_langflow.py

# Retry only failed
python ingest_langflow.py --retry-failed

# Optional: create/update ingest-only flow via Langflow API first
python ingest_langflow.py --ensure-flow --limit 3

Examples use the filename I gave my local script; yours can differ. The --ensure-flow flag runs a small setup helper that clones the File → SplitText → Embeddings → AstraDB chain from the main RAG flow into a dedicated ingest endpoint. Handy on a fresh Langflow instance where only the chat flow exists.

Throughput and timing ¶

The script processes one file per run, waiting for each ingest to finish (300-second timeout per run). Embedding API calls dominate.

That is acceptable for a one-time load on a personal reference tool. It is not a continuous pipeline. When upstream docs change, see re-ingesting when docs change .

Pre-flight checklist ¶

Before the full run:

Astra collection exists: datastax_astra_docs (empty preferred on first run; appends otherwise)
Langflow global variables hold OpenAI and Astra credentials: not hard-coded in component fields
Spot-check with --limit 3: then open Langflow Playground and ask something only those pages would answer
Confirm ingest endpoint matches your flow’s published name (datastax-astra-ingest)

After the full run, spot-check five questions in Playground that map to known doc sections (PCU groups, collection creation, hybrid search).

State files ¶

File	Purpose
`ingest_state.json`	Per-file status: `ingested`, `failed`, `skipped`
`ingest_failed.log`	Append-only log of failures for manual inspection
`.ingest_flow_id`	Written by `--ensure-flow`: tracks the ingest flow id

Example state entry:

{
  "pages/administration_audit-log.md": {
    "status": "ingested",
    "uploaded_path": "7b90824f-.../administration_audit-log (17).md"
  }
}

To force re-ingest of a single file, remove its entry from ingest_state.json and re-run. The script does not hash content on re-run: changed markdown without a state reset will be skipped.

What this script deliberately does not do ¶

Crawl or extract docs: a separate local crawl step (trafilatura against docs.datastax.com), not covered here
Serve chat: that is the Pages Function + chat flow (proxy post )
Dedupe by content hash on re-run: skipped files are keyed by path and ingested status only

Chunk sizing and embedding choices affect retrieval quality: chunking and embedding technical docs for RAG .

Next in the series ¶

Proxying Langflow from Cloudflare Pages Functions : the edge layer this ingest pipeline feeds (trilogy part 1)
Langflow RAG over Astra DB: ingest and chat flows : what happens inside the graphs this script triggers
Chunking and embedding technical docs for RAG : split settings that determine chunk count per file
Building a streaming chat UI in Hugo : the browser side once vectors exist (trilogy part 3)

Series index: Building Astra Docs Chat

Open Astra Docs Chat and ask something you would otherwise search the docs for.