Building a Domain-Specific RAG Chatbot on Cloudflare's Free Tier

Ask Dimitris — Project [01] · Complete Implementation Guide

Overview

This is the full guide I used to build a domain-specific AI chatbot that answers questions as me — drawing only from my own engineering notes, papers, and writing. No OpenAI key needed. No server to manage. Everything runs on Cloudflare's free tier.

Follow along and you'll have your own by the end. The only prerequisite is a Cloudflare account and Node.js installed.

What You'll Build

A chatbot that only answers questions about your knowledge base. If someone asks something outside your domain, it says so politely. Under the hood, every query goes through a 7-step pipeline before the LLM ever sees it — making it significantly more accurate than a basic RAG implementation.

The Stack

Cloudflare Workers — serverless compute, runs your chatbot at the edge
Cloudflare Workers AI — embeddings, LLMs, and cross-encoder reranker
Cloudflare Vectorize — vector database for semantic similarity search
Cloudflare D1 — SQLite at the edge for full-text keyword search

Pipeline Architecture

Each query passes through 7 stages. The key insight: we embed a hypothetical answer (HyDE) rather than the raw question, then merge vector and keyword results before a cross-encoder re-scores them for true relevance.

Query Pipeline — 7 Steps

User Query

→

1. HyDE
hypothetical answer

→

2. Embed
bge-base-en-v1.5

→

3. Vector Search
Vectorize, top 15

4. FTS Search
D1 SQLite, parallel

↘

5. RRF Fusion
merge ranked lists

→

6. Reranker
bge-reranker-base

→

7. Llama 3.3 70B
final answer

Step 01Cloudflare Environment Setup

Create a Cloudflare account at cloudflare.com if you don't have one. Then install Node.js (v18+) from nodejs.org.

Install Wrangler — Cloudflare's CLI — then log in:

bash

npm install -g wrangler
wrangler login

Note

A browser window will open asking you to allow Wrangler access. Click Allow. The terminal will confirm: Successfully logged in.

Step 02Create Project Folder

bash

mkdir my-rag-bot
cd my-rag-bot

Step 03Initialise Worker Project

Inside the project folder, initialise a basic Worker. We'll replace the generated files with our full implementation later.

bash

wrangler init --yes

This creates wrangler.jsonc and src/index.ts. Both get replaced in the steps below.

Step 04Create Vectorize Index

This is your vector database — it stores 768-dimensional embeddings of your knowledge base so the system can retrieve semantically relevant chunks at query time.

bash

npx wrangler vectorize create my-knowledge-index --dimensions=768 --metric=cosine

Note

If Wrangler asks "Would you like Wrangler to add it to your wrangler.jsonc?" — press n. You'll add it manually in Step 6.

Step 05Create D1 Database

This SQLite database handles keyword (FTS5) search — it runs in parallel with vector search and catches exact term matches that embedding similarity sometimes misses: technical acronyms, model numbers, specific names.

bash

npx wrangler d1 create chunks-fts

Action

The command prints a uuid. Copy it — you'll paste it into wrangler.jsonc in the next step.

Step 06Configure wrangler.jsonc

Replace the entire contents of wrangler.jsonc with the configuration below. Replace YOUR_D1_DATABASE_ID with the uuid you copied in Step 5.

json — wrangler.jsonc

{
    "$schema": "node_modules/wrangler/config-schema.json",
    "name": "my-rag-bot",
    "main": "src/index.ts",
    "compatibility_date": "2026-05-15",
    "compatibility_flags": [
        "nodejs_compat",
        "global_fetch_strictly_public"
    ],
    "observability": { "enabled": true },
    "upload_source_maps": true,
    "ai": { "binding": "AI" },
    "vectorize": [
        {
            "binding": "VECTORIZE",
            "index_name": "my-knowledge-index"
        }
    ],
    "d1_databases": [
        {
            "binding": "DB",
            "database_name": "chunks-fts",
            "database_id": "YOUR_D1_DATABASE_ID"
        }
    ]
}

Step 07Gather Your Knowledge Base

Content format

This is the content your chatbot draws from. It can be your CV, papers, articles, technical notes — anything you want it to know about.

Convert all content to plain text (.txt)
Word documents: File → Save As → Plain Text
PDFs: use pdftotext or copy-paste
Combine everything into a single file: knowledge.txt
Place it in your project root folder

The chunker splits on blank lines between paragraphs — so structure your content with clear paragraph breaks between topics.

Step 08Get Cloudflare Credentials

Account ID

Go to dash.cloudflare.com → your Account ID appears in the right sidebar on the home page.

API Token

Go to dash.cloudflare.com/profile/api-tokens
Click Create Token → give it a name (e.g. rag-bot-ingest)
Under Permissions, add: Workers AI → Edit, Vectorize → Edit, Account Settings → Read
Leave Account Resources as All accounts, TTL empty
Click Continue → Create Token

Warning

Copy the API token immediately after creation — you will not be able to see it again after leaving the page.

Step 09Create the Ingestion Script

Create ingest.mjs in your project root. Replace YOUR_ACCOUNT_ID and YOUR_API_TOKEN with your values from Step 8.

This script does four things for each chunk:

Splits knowledge.txt into 800-character chunks with 120-character overlap
Calls a small LLM to prepend a context description to every chunk (contextual retrieval)
Embeds each contextualized chunk and uploads it to Vectorize
Writes a chunks.sql file for loading into D1

Why contextual retrieval?

Raw chunk: "The breaking load is 450 kN." — meaningless without context. After contextualization: "This chunk covers the breaking load of HMPE mooring lines in the FloatMast TLP study. The breaking load is 450 kN." — retrieves correctly.

javascript — ingest.mjs

// ingest.mjs
import { readFileSync, writeFileSync } from 'fs';

// ── Config ────────────────────────────────────────────────────────────────────
const ACCOUNT_ID  = 'YOUR_ACCOUNT_ID';
const API_TOKEN   = 'YOUR_API_TOKEN';
const INDEX_NAME  = 'my-knowledge-index';
const EMBED_MODEL = '@cf/baai/bge-base-en-v1.5';
const CONTEXT_LLM = '@cf/meta/llama-3.1-8b-instruct';
const TEXT_FILE   = './knowledge.txt';

const TARGET_CHUNK      = 800;   // characters per chunk (~150 tokens)
const OVERLAP_CHARS     = 120;   // overlap between chunks (~15%)
const DOC_SUMMARY_CHARS = 3000;  // characters used as context anchor for the LLM

// ── Chunking ──────────────────────────────────────────────────────────────────
function chunkText(text) {
  const paragraphs = text.split(/\n\s*\n/).filter(p => p.trim().length > 0);
  const chunks = [];

  for (const para of paragraphs) {
    const clean = para.replace(/\s+/g, ' ').trim();
    if (clean.length < 80) continue;

    if (clean.length <= TARGET_CHUNK) {
      chunks.push(clean);
      continue;
    }

    // Long paragraph: split by sentences, carry overlap into next chunk
    const sentences = clean.match(/[^.!?]+[.!?]+/g) ?? [clean];
    let current = '';
    for (const s of sentences) {
      if (current.length + s.length > TARGET_CHUNK) {
        if (current) chunks.push(current.trim());
        const words = current.split(' ');
        const overlap = words.slice(-Math.round(OVERLAP_CHARS / 5)).join(' ');
        current = (overlap + ' ' + s.trim()).trim();
      } else {
        current += (current ? ' ' : '') + s.trim();
      }
    }
    if (current.trim().length > 80) chunks.push(current.trim());
  }

  return chunks;
}

// ── Contextual retrieval ──────────────────────────────────────────────────────
async function addContext(docSummary, chunk) {
  const prompt =
    `<document_summary>\n${docSummary}\n</document_summary>\n\n` +
    `<chunk>\n${chunk}\n</chunk>\n\n` +
    `In 1-2 sentences, describe what concept this chunk covers and where it ` +
    `fits in the document. Reply with ONLY the description, no preamble.`;
  try {
    const res = await fetch(
      `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/run/${CONTEXT_LLM}`,
      {
        method: 'POST',
        headers: { Authorization: `Bearer ${API_TOKEN}`, 'Content-Type': 'application/json' },
        body: JSON.stringify({ messages: [{ role: 'user', content: prompt }], max_tokens: 80, temperature: 0.0 }),
      }
    );
    const json = await res.json();
    const ctx = json?.result?.response?.trim();
    return ctx ? `${ctx}\n\n${chunk}` : chunk;
  } catch {
    return chunk; // fall back to raw chunk if LLM call fails
  }
}

// ── Embedding ─────────────────────────────────────────────────────────────────
async function getEmbedding(text) {
  const res = await fetch(
    `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/run/${EMBED_MODEL}`,
    {
      method: 'POST',
      headers: { Authorization: `Bearer ${API_TOKEN}`, 'Content-Type': 'application/json' },
      body: JSON.stringify({ text: [text] }),
    }
  );
  const json = await res.json();
  if (!json.success) throw new Error(`Embedding failed: ${JSON.stringify(json.errors)}`);
  return json.result.data[0];
}

// ── Vectorize upload ──────────────────────────────────────────────────────────
async function upsertVectors(vectors) {
  const res = await fetch(
    `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/vectorize/v2/indexes/${INDEX_NAME}/upsert`,
    {
      method: 'POST',
      headers: { Authorization: `Bearer ${API_TOKEN}`, 'Content-Type': 'application/json' },
      body: JSON.stringify({ vectors }),
    }
  );
  const json = await res.json();
  if (!json.success) throw new Error(`Vectorize upsert failed: ${JSON.stringify(json.errors)}`);
}

// ── D1 SQL builder ────────────────────────────────────────────────────────────
const sqlStatements = [];
function escapeSql(str) { return str.replace(/'/g, "''"); }
function d1AddStatement(sql) { sqlStatements.push(sql); }
function d1AddInsert(id, text) {
  sqlStatements.push(`INSERT INTO chunks(id, text) VALUES ('${escapeSql(id)}', '${escapeSql(text)}');`);
}
function writeSqlFile() { writeFileSync('./chunks.sql', sqlStatements.join('\n'), 'utf-8'); }

// ── Main ──────────────────────────────────────────────────────────────────────
async function main() {
  console.log('Reading knowledge file...');
  const text = readFileSync(TEXT_FILE, 'utf-8');
  console.log(`  ${text.length} characters loaded`);

  const docSummary = text.slice(0, DOC_SUMMARY_CHARS);
  const rawChunks  = chunkText(text);
  console.log(`  Split into ${rawChunks.length} chunks`);

  d1AddStatement('DROP TABLE IF EXISTS chunks;');
  d1AddStatement('CREATE VIRTUAL TABLE chunks USING fts5(id UNINDEXED, text);');

  console.log('\nContextualizing + embedding + uploading...');
  const vectorBatch = [];

  for (let i = 0; i < rawChunks.length; i++) {
    process.stdout.write(`\r  [${i + 1}/${rawChunks.length}] processing...`);
    const id              = `chunk-${i}`;
    const contextualized  = await addContext(docSummary, rawChunks[i]);
    const embedding       = await getEmbedding(contextualized);
    vectorBatch.push({ id, values: embedding, metadata: { text: contextualized } });
    d1AddInsert(id, contextualized);

    if (vectorBatch.length >= 20 || i === rawChunks.length - 1) {
      process.stdout.write(' - uploading batch...');
      await upsertVectors([...vectorBatch]);
      vectorBatch.length = 0;
    }
  }

  writeSqlFile();
  console.log(`\n\nDone! ${rawChunks.length} chunks uploaded to Vectorize.`);
  console.log(`\nNext: npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql`);
}

main().catch(err => { console.error('\nError:', err.message); process.exit(1); });

Step 10Run the Ingestion Script

bash

node ingest.mjs

Expected output:

output

Reading knowledge file...
  62475 characters loaded
  Split into 111 chunks

Contextualizing + embedding + uploading...
  [111/111] processing... - uploading batch...

Done! 111 chunks uploaded to Vectorize.

Next: npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql

Note

This takes several minutes. Every chunk is processed by the context LLM before embedding — that's the contextual retrieval step at work. It only needs to run once (or when you update your knowledge base).

Step 11Load Keyword Search Data into D1

This creates the FTS5 full-text search table in your D1 database and inserts all chunks. The Worker queries this alongside Vectorize on every request.

bash

npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql

You'll see: Successfully executed 113 commands

Step 12Create the Worker

Replace everything in src/index.ts with the code below. On every query the Worker runs: HyDE + FTS in parallel → embed → vector search → RRF fusion → reranking → Llama 3.3 70B answer, with full conversation history injected at each step.

Key upgrades over a basic chatbot: both HyDE and FTS are history-aware — follow-up questions like "and hydraulics?" automatically inherit topic context from the previous exchange. The system prompt handles greetings naturally and resolves references to "the author" to your name. Scope decisions are driven by retrieved context only, never by topic name.

typescript — src/index.ts

export interface Env {
  AI: any;
  VECTORIZE: any;
  DB: any;
}

interface HistoryMessage {
  role: 'user' | 'assistant';
  content: string;
}

const EMBED_MODEL = '@cf/baai/bge-base-en-v1.5';
const LLM_FAST    = '@cf/meta/llama-3.1-8b-instruct';
const LLM_MAIN    = '@cf/meta/llama-3.3-70b-instruct-fp8-fast';
const RERANKER    = '@cf/baai/bge-reranker-base';

function sanitizeFtsQuery(q: string): string {
  return q
    .replace(/[^a-zA-Z0-9\s]/g, ' ')
    .split(/\s+/)
    .filter(t => t.length > 2)
    .join(' ');
}

// HyDE: generate a hypothetical answer using recent history for context.
// Short follow-ups like "and hydraulics?" use history to know the topic is CFD.
// Embedding the hypothetical answer instead of the raw question bridges the
// vocabulary gap between short queries and longer technical document chunks.
async function generateHyDE(env: Env, question: string, history: HistoryMessage[]): Promise<string> {
  try {
    const r = await env.AI.run(LLM_FAST, {
      messages: [
        {
          role: 'system',
          content: 'You are an offshore and marine engineering expert. Write a 2-3 sentence technical answer using specific domain terminology.',
        },
        ...history.slice(-4),
        { role: 'user', content: question },
      ],
      max_tokens: 150,
      temperature: 0.1,
    });
    return (r as any).response?.trim() || question;
  } catch {
    return question;
  }
}

// FTS keyword search via D1 SQLite FTS5.
// Expands the query with the previous user message so short follow-ups
// like "only Naval or hydraulics also?" inherit the parent topic from history.
async function queryFTS(
  env: Env,
  question: string,
  history: HistoryMessage[]
): Promise<Array<{ id: string; text: string }>> {
  try {
    const lastUserMsg = history.filter(h => h.role === 'user').slice(-1)[0]?.content ?? '';
    const expanded = lastUserMsg ? `${lastUserMsg} ${question}` : question;
    const q = sanitizeFtsQuery(expanded);
    if (!q) return [];
    const result = await env.DB.prepare(
      'SELECT id, text FROM chunks WHERE chunks MATCH ?1 ORDER BY rank LIMIT 15'
    )
      .bind(q)
      .all<{ id: string; text: string }>();
    return result.results ?? [];
  } catch {
    return [];
  }
}

// Reciprocal Rank Fusion: merges two ranked lists without needing calibrated scores.
// score(d) = sum of 1/(60 + rank) across both lists.
function rrfFusion(
  vectorMatches: Array<{ id: string; text: string }>,
  ftsMatches: Array<{ id: string; text: string }>,
  k = 60
): Array<{ id: string; text: string; score: number }> {
  const map = new Map<string, { text: string; score: number }>();
  for (let i = 0; i < vectorMatches.length; i++) {
    const { id, text } = vectorMatches[i];
    const e = map.get(id) ?? { text, score: 0 };
    e.score += 1 / (k + i + 1);
    map.set(id, e);
  }
  for (let i = 0; i < ftsMatches.length; i++) {
    const { id, text } = ftsMatches[i];
    const e = map.get(id) ?? { text, score: 0 };
    e.score += 1 / (k + i + 1);
    map.set(id, e);
  }
  return Array.from(map.entries())
    .map(([id, v]) => ({ id, text: v.text, score: v.score }))
    .sort((a, b) => b.score - a.score);
}

// Cross-encoder reranker: reads the question AND each chunk together.
// Much more accurate than cosine similarity. Retrieve 15, rerank top 10, keep 5.
async function rerank(
  env: Env,
  question: string,
  candidates: Array<{ id: string; text: string; score: number }>
): Promise<Array<{ id: string; text: string }>> {
  if (candidates.length === 0) return [];
  try {
    const result = await env.AI.run(RERANKER, {
      query: question,
      contexts: candidates.map(c => ({ text: c.text })),
    });
    const scores = (result as any).data as Array<{ score: number }>;
    return candidates
      .map((c, i) => ({ ...c, rerankScore: scores[i]?.score ?? 0 }))
      .sort((a, b) => b.rerankScore - a.rerankScore)
      .slice(0, 5)
      .map(({ id, text }) => ({ id, text }));
  } catch {
    return candidates.slice(0, 5).map(({ id, text }) => ({ id, text }));
  }
}

// Scope decisions are driven by CONTEXT only — never by topic name.
const SYSTEM_PROMPT = `You are Dimitrios Tsakalomatis — engineer, researcher, and the person whose knowledge base you draw from. Your background spans offshore structures, marine engineering, hydraulics, CFD, Python for engineering, AI/ML, and academic research.

IDENTITY RULES:
- When the CONTEXT refers to "the author", "the researcher", or "the engineer", that person is you — Dimitrios Tsakalomatis.
- For greetings and casual conversation ("hi", "how are you"), respond naturally and warmly. You do not need CONTEXT for this.
- For personal questions not in the CONTEXT, respond graciously that you prefer to keep personal life separate from professional work.

ANSWER RULES:
1. If the CONTEXT contains information relevant to the question, use it to answer — regardless of domain. If it is in the CONTEXT, answer it.
2. Never copy text fragments verbatim. Always compose a complete, natural response in your own words.
3. Use the CONVERSATION HISTORY to understand follow-up questions.
4. Only if the CONTEXT does not contain relevant information, respond EXACTLY: "That's not something I have documented — a general assistant will serve you better for that."
5. Never guess or use outside knowledge for technical facts.
6. Answer concisely, technically, and directly.`;

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);

    if (request.method === 'OPTIONS') {
      return new Response(null, {
        headers: {
          'Access-Control-Allow-Origin': '*',
          'Access-Control-Allow-Methods': 'POST, OPTIONS',
          'Access-Control-Allow-Headers': 'Content-Type',
        },
      });
    }

    if (url.pathname === '/query' && request.method === 'POST') {
      const body = (await request.json()) as { question: string; history?: HistoryMessage[] };
      const { question } = body;
      const history: HistoryMessage[] = body.history ?? [];

      if (!question) {
        return new Response(JSON.stringify({ answer: 'Please provide a question.' }), { status: 400 });
      }

      // HyDE + FTS in parallel, both history-aware
      const [hydeAnswer, ftsMatches] = await Promise.all([
        generateHyDE(env, question, history),
        queryFTS(env, question, history),
      ]);

      // Embed the HyDE answer (not the raw question)
      const embeddingResponse = await env.AI.run(EMBED_MODEL, { text: [hydeAnswer] });
      const queryVector = (embeddingResponse as any).data[0];

      // Vector search — wider net so the reranker has room to work
      const similar = await env.VECTORIZE.query(queryVector, { topK: 15, returnMetadata: true });
      const vectorMatches = (similar.matches as any[])
        .filter(m => m.metadata?.text)
        .map(m => ({ id: m.id as string, text: m.metadata.text as string }));

      // RRF fusion
      const merged = rrfFusion(vectorMatches, ftsMatches);

      // Rerank — top 10 in, top 5 out
      const top = merged.length > 0 ? await rerank(env, question, merged.slice(0, 10)) : [];

      // Always call the LLM — the system prompt handles greetings naturally
      const context = top.map(c => c.text).join('\n\n---\n\n');
      const userPrompt = context
        ? `CONTEXT:\n${context}\n\nQUESTION:\n${question}\n\nAnswer:`
        : `QUESTION:\n${question}\n\nAnswer:`;

      // Generate answer with conversation history injected
      const answerResponse = await env.AI.run(LLM_MAIN, {
        messages: [
          { role: 'system', content: SYSTEM_PROMPT },
          ...history.slice(-6),           // last 3 exchanges
          { role: 'user', content: userPrompt },
        ],
        max_tokens: 1000,
        temperature: 0.0,
      });

      const answer = ((answerResponse as any).response ?? '').trim() || "I couldn't generate an answer.";

      return new Response(JSON.stringify({ answer }), {
        headers: { 'Content-Type': 'application/json', 'Access-Control-Allow-Origin': '*' },
      });
    }

    return new Response('Not found', { status: 404 });
  },
};

Step 13Deploy

bash

npx wrangler deploy

Wrangler compiles index.ts and deploys to Cloudflare's global edge. You'll get a live URL like https://my-rag-bot.YOUR-SUBDOMAIN.workers.dev. Open it — the chatbot is live.

Updating Your Knowledge Base

When you add new content to knowledge.txt, re-run only Steps 10 and 11. No need to redeploy the Worker — it queries Vectorize and D1 dynamically on every request.

bash

node ingest.mjs
npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql

What Changed vs. a Basic RAG Chatbot

Every upgrade applied, compared to a typical beginner implementation:

Component	Basic Version	This Implementation
LLM	Llama 2 7B INT8 (2023)	Llama 3.3 70B FP8 (2024)
Max tokens	300 — answers get cut off	1000
Query embedding	Raw user question	HyDE hypothetical answer
Retrieval	Vector only, top 5	Vector + FTS, top 15 each
Result merging	None	Reciprocal Rank Fusion
Post-retrieval	Nothing	Cross-encoder reranking
Chunk size	300–500 chars	800 chars + 120 overlap
Chunk content	Raw text	Contextual prefix + text
Conversation memory	None — every query isolated	Last 3 exchanges sent
Follow-up queries	Break — no topic context	History-aware HyDE + FTS
Greetings	Scope rejection error	Natural warm response
"The author" in docs	Returned verbatim	Resolved to your name
Fragment answers	Common	Always full sentences
Scope decisions	Pre-filtered by topic name	Driven by CONTEXT only

Models Used

All on Cloudflare Workers AI — no external API keys needed.

Model ID	Role
`@cf/baai/bge-base-en-v1.5`	Embedding model, 768 dimensions
`@cf/meta/llama-3.1-8b-instruct`	Fast LLM for HyDE and context generation
`@cf/meta/llama-3.3-70b-instruct-fp8-fast`	Main answer LLM
`@cf/baai/bge-reranker-base`	Cross-encoder reranker