This is the full guide I used to build a domain-specific AI chatbot that answers questions as me — drawing only from my own engineering notes, papers, and writing. No OpenAI key needed. No server to manage. Everything runs on Cloudflare's free tier.
Follow along and you'll have your own by the end. The only prerequisite is a Cloudflare account and Node.js installed.
A chatbot that only answers questions about your knowledge base. If someone asks something outside your domain, it says so politely. Under the hood, every query goes through a 7-step pipeline before the LLM ever sees it — making it significantly more accurate than a basic RAG implementation.
Each query passes through 7 stages. The key insight: we embed a hypothetical answer (HyDE) rather than the raw question, then merge vector and keyword results before a cross-encoder re-scores them for true relevance.
Create a Cloudflare account at cloudflare.com if you don't have one. Then install Node.js (v18+) from nodejs.org.
Install Wrangler — Cloudflare's CLI — then log in:
npm install -g wrangler
wrangler login
A browser window will open asking you to allow Wrangler access. Click Allow. The terminal will confirm: Successfully logged in.
mkdir my-rag-bot
cd my-rag-bot
Inside the project folder, initialise a basic Worker. We'll replace the generated files with our full implementation later.
wrangler init --yes
This creates wrangler.jsonc and src/index.ts. Both get replaced in the steps below.
This is your vector database — it stores 768-dimensional embeddings of your knowledge base so the system can retrieve semantically relevant chunks at query time.
npx wrangler vectorize create my-knowledge-index --dimensions=768 --metric=cosine
If Wrangler asks "Would you like Wrangler to add it to your wrangler.jsonc?" — press n. You'll add it manually in Step 6.
This SQLite database handles keyword (FTS5) search — it runs in parallel with vector search and catches exact term matches that embedding similarity sometimes misses: technical acronyms, model numbers, specific names.
npx wrangler d1 create chunks-fts
The command prints a uuid. Copy it — you'll paste it into wrangler.jsonc in the next step.
Replace the entire contents of wrangler.jsonc with the configuration below. Replace YOUR_D1_DATABASE_ID with the uuid you copied in Step 5.
{
"$schema": "node_modules/wrangler/config-schema.json",
"name": "my-rag-bot",
"main": "src/index.ts",
"compatibility_date": "2026-05-15",
"compatibility_flags": [
"nodejs_compat",
"global_fetch_strictly_public"
],
"observability": { "enabled": true },
"upload_source_maps": true,
"ai": { "binding": "AI" },
"vectorize": [
{
"binding": "VECTORIZE",
"index_name": "my-knowledge-index"
}
],
"d1_databases": [
{
"binding": "DB",
"database_name": "chunks-fts",
"database_id": "YOUR_D1_DATABASE_ID"
}
]
}
This is the content your chatbot draws from. It can be your CV, papers, articles, technical notes — anything you want it to know about.
pdftotext or copy-pasteknowledge.txtThe chunker splits on blank lines between paragraphs — so structure your content with clear paragraph breaks between topics.
Go to dash.cloudflare.com → your Account ID appears in the right sidebar on the home page.
dash.cloudflare.com/profile/api-tokensrag-bot-ingest)Copy the API token immediately after creation — you will not be able to see it again after leaving the page.
Create ingest.mjs in your project root. Replace YOUR_ACCOUNT_ID and YOUR_API_TOKEN with your values from Step 8.
This script does four things for each chunk:
knowledge.txt into 800-character chunks with 120-character overlapchunks.sql file for loading into D1Raw chunk: "The breaking load is 450 kN." — meaningless without context. After contextualization: "This chunk covers the breaking load of HMPE mooring lines in the FloatMast TLP study. The breaking load is 450 kN." — retrieves correctly.
// ingest.mjs
import { readFileSync, writeFileSync } from 'fs';
// ── Config ────────────────────────────────────────────────────────────────────
const ACCOUNT_ID = 'YOUR_ACCOUNT_ID';
const API_TOKEN = 'YOUR_API_TOKEN';
const INDEX_NAME = 'my-knowledge-index';
const EMBED_MODEL = '@cf/baai/bge-base-en-v1.5';
const CONTEXT_LLM = '@cf/meta/llama-3.1-8b-instruct';
const TEXT_FILE = './knowledge.txt';
const TARGET_CHUNK = 800; // characters per chunk (~150 tokens)
const OVERLAP_CHARS = 120; // overlap between chunks (~15%)
const DOC_SUMMARY_CHARS = 3000; // characters used as context anchor for the LLM
// ── Chunking ──────────────────────────────────────────────────────────────────
function chunkText(text) {
const paragraphs = text.split(/\n\s*\n/).filter(p => p.trim().length > 0);
const chunks = [];
for (const para of paragraphs) {
const clean = para.replace(/\s+/g, ' ').trim();
if (clean.length < 80) continue;
if (clean.length <= TARGET_CHUNK) {
chunks.push(clean);
continue;
}
// Long paragraph: split by sentences, carry overlap into next chunk
const sentences = clean.match(/[^.!?]+[.!?]+/g) ?? [clean];
let current = '';
for (const s of sentences) {
if (current.length + s.length > TARGET_CHUNK) {
if (current) chunks.push(current.trim());
const words = current.split(' ');
const overlap = words.slice(-Math.round(OVERLAP_CHARS / 5)).join(' ');
current = (overlap + ' ' + s.trim()).trim();
} else {
current += (current ? ' ' : '') + s.trim();
}
}
if (current.trim().length > 80) chunks.push(current.trim());
}
return chunks;
}
// ── Contextual retrieval ──────────────────────────────────────────────────────
async function addContext(docSummary, chunk) {
const prompt =
`<document_summary>\n${docSummary}\n</document_summary>\n\n` +
`<chunk>\n${chunk}\n</chunk>\n\n` +
`In 1-2 sentences, describe what concept this chunk covers and where it ` +
`fits in the document. Reply with ONLY the description, no preamble.`;
try {
const res = await fetch(
`https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/run/${CONTEXT_LLM}`,
{
method: 'POST',
headers: { Authorization: `Bearer ${API_TOKEN}`, 'Content-Type': 'application/json' },
body: JSON.stringify({ messages: [{ role: 'user', content: prompt }], max_tokens: 80, temperature: 0.0 }),
}
);
const json = await res.json();
const ctx = json?.result?.response?.trim();
return ctx ? `${ctx}\n\n${chunk}` : chunk;
} catch {
return chunk; // fall back to raw chunk if LLM call fails
}
}
// ── Embedding ─────────────────────────────────────────────────────────────────
async function getEmbedding(text) {
const res = await fetch(
`https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/run/${EMBED_MODEL}`,
{
method: 'POST',
headers: { Authorization: `Bearer ${API_TOKEN}`, 'Content-Type': 'application/json' },
body: JSON.stringify({ text: [text] }),
}
);
const json = await res.json();
if (!json.success) throw new Error(`Embedding failed: ${JSON.stringify(json.errors)}`);
return json.result.data[0];
}
// ── Vectorize upload ──────────────────────────────────────────────────────────
async function upsertVectors(vectors) {
const res = await fetch(
`https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/vectorize/v2/indexes/${INDEX_NAME}/upsert`,
{
method: 'POST',
headers: { Authorization: `Bearer ${API_TOKEN}`, 'Content-Type': 'application/json' },
body: JSON.stringify({ vectors }),
}
);
const json = await res.json();
if (!json.success) throw new Error(`Vectorize upsert failed: ${JSON.stringify(json.errors)}`);
}
// ── D1 SQL builder ────────────────────────────────────────────────────────────
const sqlStatements = [];
function escapeSql(str) { return str.replace(/'/g, "''"); }
function d1AddStatement(sql) { sqlStatements.push(sql); }
function d1AddInsert(id, text) {
sqlStatements.push(`INSERT INTO chunks(id, text) VALUES ('${escapeSql(id)}', '${escapeSql(text)}');`);
}
function writeSqlFile() { writeFileSync('./chunks.sql', sqlStatements.join('\n'), 'utf-8'); }
// ── Main ──────────────────────────────────────────────────────────────────────
async function main() {
console.log('Reading knowledge file...');
const text = readFileSync(TEXT_FILE, 'utf-8');
console.log(` ${text.length} characters loaded`);
const docSummary = text.slice(0, DOC_SUMMARY_CHARS);
const rawChunks = chunkText(text);
console.log(` Split into ${rawChunks.length} chunks`);
d1AddStatement('DROP TABLE IF EXISTS chunks;');
d1AddStatement('CREATE VIRTUAL TABLE chunks USING fts5(id UNINDEXED, text);');
console.log('\nContextualizing + embedding + uploading...');
const vectorBatch = [];
for (let i = 0; i < rawChunks.length; i++) {
process.stdout.write(`\r [${i + 1}/${rawChunks.length}] processing...`);
const id = `chunk-${i}`;
const contextualized = await addContext(docSummary, rawChunks[i]);
const embedding = await getEmbedding(contextualized);
vectorBatch.push({ id, values: embedding, metadata: { text: contextualized } });
d1AddInsert(id, contextualized);
if (vectorBatch.length >= 20 || i === rawChunks.length - 1) {
process.stdout.write(' - uploading batch...');
await upsertVectors([...vectorBatch]);
vectorBatch.length = 0;
}
}
writeSqlFile();
console.log(`\n\nDone! ${rawChunks.length} chunks uploaded to Vectorize.`);
console.log(`\nNext: npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql`);
}
main().catch(err => { console.error('\nError:', err.message); process.exit(1); });
node ingest.mjs
Expected output:
Reading knowledge file...
62475 characters loaded
Split into 111 chunks
Contextualizing + embedding + uploading...
[111/111] processing... - uploading batch...
Done! 111 chunks uploaded to Vectorize.
Next: npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql
This takes several minutes. Every chunk is processed by the context LLM before embedding — that's the contextual retrieval step at work. It only needs to run once (or when you update your knowledge base).
This creates the FTS5 full-text search table in your D1 database and inserts all chunks. The Worker queries this alongside Vectorize on every request.
npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql
You'll see: Successfully executed 113 commands
Replace everything in src/index.ts with the code below. On every query the Worker runs: HyDE + FTS in parallel → embed → vector search → RRF fusion → reranking → Llama 3.3 70B answer, with full conversation history injected at each step.
Key upgrades over a basic chatbot: both HyDE and FTS are history-aware — follow-up questions like "and hydraulics?" automatically inherit topic context from the previous exchange. The system prompt handles greetings naturally and resolves references to "the author" to your name. Scope decisions are driven by retrieved context only, never by topic name.
export interface Env {
AI: any;
VECTORIZE: any;
DB: any;
}
interface HistoryMessage {
role: 'user' | 'assistant';
content: string;
}
const EMBED_MODEL = '@cf/baai/bge-base-en-v1.5';
const LLM_FAST = '@cf/meta/llama-3.1-8b-instruct';
const LLM_MAIN = '@cf/meta/llama-3.3-70b-instruct-fp8-fast';
const RERANKER = '@cf/baai/bge-reranker-base';
function sanitizeFtsQuery(q: string): string {
return q
.replace(/[^a-zA-Z0-9\s]/g, ' ')
.split(/\s+/)
.filter(t => t.length > 2)
.join(' ');
}
// HyDE: generate a hypothetical answer using recent history for context.
// Short follow-ups like "and hydraulics?" use history to know the topic is CFD.
// Embedding the hypothetical answer instead of the raw question bridges the
// vocabulary gap between short queries and longer technical document chunks.
async function generateHyDE(env: Env, question: string, history: HistoryMessage[]): Promise<string> {
try {
const r = await env.AI.run(LLM_FAST, {
messages: [
{
role: 'system',
content: 'You are an offshore and marine engineering expert. Write a 2-3 sentence technical answer using specific domain terminology.',
},
...history.slice(-4),
{ role: 'user', content: question },
],
max_tokens: 150,
temperature: 0.1,
});
return (r as any).response?.trim() || question;
} catch {
return question;
}
}
// FTS keyword search via D1 SQLite FTS5.
// Expands the query with the previous user message so short follow-ups
// like "only Naval or hydraulics also?" inherit the parent topic from history.
async function queryFTS(
env: Env,
question: string,
history: HistoryMessage[]
): Promise<Array<{ id: string; text: string }>> {
try {
const lastUserMsg = history.filter(h => h.role === 'user').slice(-1)[0]?.content ?? '';
const expanded = lastUserMsg ? `${lastUserMsg} ${question}` : question;
const q = sanitizeFtsQuery(expanded);
if (!q) return [];
const result = await env.DB.prepare(
'SELECT id, text FROM chunks WHERE chunks MATCH ?1 ORDER BY rank LIMIT 15'
)
.bind(q)
.all<{ id: string; text: string }>();
return result.results ?? [];
} catch {
return [];
}
}
// Reciprocal Rank Fusion: merges two ranked lists without needing calibrated scores.
// score(d) = sum of 1/(60 + rank) across both lists.
function rrfFusion(
vectorMatches: Array<{ id: string; text: string }>,
ftsMatches: Array<{ id: string; text: string }>,
k = 60
): Array<{ id: string; text: string; score: number }> {
const map = new Map<string, { text: string; score: number }>();
for (let i = 0; i < vectorMatches.length; i++) {
const { id, text } = vectorMatches[i];
const e = map.get(id) ?? { text, score: 0 };
e.score += 1 / (k + i + 1);
map.set(id, e);
}
for (let i = 0; i < ftsMatches.length; i++) {
const { id, text } = ftsMatches[i];
const e = map.get(id) ?? { text, score: 0 };
e.score += 1 / (k + i + 1);
map.set(id, e);
}
return Array.from(map.entries())
.map(([id, v]) => ({ id, text: v.text, score: v.score }))
.sort((a, b) => b.score - a.score);
}
// Cross-encoder reranker: reads the question AND each chunk together.
// Much more accurate than cosine similarity. Retrieve 15, rerank top 10, keep 5.
async function rerank(
env: Env,
question: string,
candidates: Array<{ id: string; text: string; score: number }>
): Promise<Array<{ id: string; text: string }>> {
if (candidates.length === 0) return [];
try {
const result = await env.AI.run(RERANKER, {
query: question,
contexts: candidates.map(c => ({ text: c.text })),
});
const scores = (result as any).data as Array<{ score: number }>;
return candidates
.map((c, i) => ({ ...c, rerankScore: scores[i]?.score ?? 0 }))
.sort((a, b) => b.rerankScore - a.rerankScore)
.slice(0, 5)
.map(({ id, text }) => ({ id, text }));
} catch {
return candidates.slice(0, 5).map(({ id, text }) => ({ id, text }));
}
}
// Scope decisions are driven by CONTEXT only — never by topic name.
const SYSTEM_PROMPT = `You are Dimitrios Tsakalomatis — engineer, researcher, and the person whose knowledge base you draw from. Your background spans offshore structures, marine engineering, hydraulics, CFD, Python for engineering, AI/ML, and academic research.
IDENTITY RULES:
- When the CONTEXT refers to "the author", "the researcher", or "the engineer", that person is you — Dimitrios Tsakalomatis.
- For greetings and casual conversation ("hi", "how are you"), respond naturally and warmly. You do not need CONTEXT for this.
- For personal questions not in the CONTEXT, respond graciously that you prefer to keep personal life separate from professional work.
ANSWER RULES:
1. If the CONTEXT contains information relevant to the question, use it to answer — regardless of domain. If it is in the CONTEXT, answer it.
2. Never copy text fragments verbatim. Always compose a complete, natural response in your own words.
3. Use the CONVERSATION HISTORY to understand follow-up questions.
4. Only if the CONTEXT does not contain relevant information, respond EXACTLY: "That's not something I have documented — a general assistant will serve you better for that."
5. Never guess or use outside knowledge for technical facts.
6. Answer concisely, technically, and directly.`;
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
if (request.method === 'OPTIONS') {
return new Response(null, {
headers: {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'POST, OPTIONS',
'Access-Control-Allow-Headers': 'Content-Type',
},
});
}
if (url.pathname === '/query' && request.method === 'POST') {
const body = (await request.json()) as { question: string; history?: HistoryMessage[] };
const { question } = body;
const history: HistoryMessage[] = body.history ?? [];
if (!question) {
return new Response(JSON.stringify({ answer: 'Please provide a question.' }), { status: 400 });
}
// HyDE + FTS in parallel, both history-aware
const [hydeAnswer, ftsMatches] = await Promise.all([
generateHyDE(env, question, history),
queryFTS(env, question, history),
]);
// Embed the HyDE answer (not the raw question)
const embeddingResponse = await env.AI.run(EMBED_MODEL, { text: [hydeAnswer] });
const queryVector = (embeddingResponse as any).data[0];
// Vector search — wider net so the reranker has room to work
const similar = await env.VECTORIZE.query(queryVector, { topK: 15, returnMetadata: true });
const vectorMatches = (similar.matches as any[])
.filter(m => m.metadata?.text)
.map(m => ({ id: m.id as string, text: m.metadata.text as string }));
// RRF fusion
const merged = rrfFusion(vectorMatches, ftsMatches);
// Rerank — top 10 in, top 5 out
const top = merged.length > 0 ? await rerank(env, question, merged.slice(0, 10)) : [];
// Always call the LLM — the system prompt handles greetings naturally
const context = top.map(c => c.text).join('\n\n---\n\n');
const userPrompt = context
? `CONTEXT:\n${context}\n\nQUESTION:\n${question}\n\nAnswer:`
: `QUESTION:\n${question}\n\nAnswer:`;
// Generate answer with conversation history injected
const answerResponse = await env.AI.run(LLM_MAIN, {
messages: [
{ role: 'system', content: SYSTEM_PROMPT },
...history.slice(-6), // last 3 exchanges
{ role: 'user', content: userPrompt },
],
max_tokens: 1000,
temperature: 0.0,
});
const answer = ((answerResponse as any).response ?? '').trim() || "I couldn't generate an answer.";
return new Response(JSON.stringify({ answer }), {
headers: { 'Content-Type': 'application/json', 'Access-Control-Allow-Origin': '*' },
});
}
return new Response('Not found', { status: 404 });
},
};
npx wrangler deploy
Wrangler compiles index.ts and deploys to Cloudflare's global edge. You'll get a live URL like https://my-rag-bot.YOUR-SUBDOMAIN.workers.dev. Open it — the chatbot is live.
When you add new content to knowledge.txt, re-run only Steps 10 and 11. No need to redeploy the Worker — it queries Vectorize and D1 dynamically on every request.
node ingest.mjs
npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql
Every upgrade applied, compared to a typical beginner implementation:
All on Cloudflare Workers AI — no external API keys needed.