← Projects Projects / Ask Dimitris — RAG Guide

Building a Domain-Specific RAG Chatbot on Cloudflare's Free Tier

Ask Dimitris — Project [01] · Complete Implementation Guide

Overview

This is the full guide I used to build a domain-specific AI chatbot that answers questions as me — drawing only from my own engineering notes, papers, and writing. No OpenAI key needed. No server to manage. Everything runs on Cloudflare's free tier.

Follow along and you'll have your own by the end. The only prerequisite is a Cloudflare account and Node.js installed.

What You'll Build

A chatbot that only answers questions about your knowledge base. If someone asks something outside your domain, it says so politely. Under the hood, every query goes through a 7-step pipeline before the LLM ever sees it — making it significantly more accurate than a basic RAG implementation.

The Stack

  • Cloudflare Workers — serverless compute, runs your chatbot at the edge
  • Cloudflare Workers AI — embeddings, LLMs, and cross-encoder reranker
  • Cloudflare Vectorize — vector database for semantic similarity search
  • Cloudflare D1 — SQLite at the edge for full-text keyword search

Pipeline Architecture

Each query passes through 7 stages. The key insight: we embed a hypothetical answer (HyDE) rather than the raw question, then merge vector and keyword results before a cross-encoder re-scores them for true relevance.

Query Pipeline — 7 Steps
User Query
1. HyDE
hypothetical answer
2. Embed
bge-base-en-v1.5
3. Vector Search
Vectorize, top 15
4. FTS Search
D1 SQLite, parallel
5. RRF Fusion
merge ranked lists
6. Reranker
bge-reranker-base
7. Llama 3.3 70B
final answer
Step 01Cloudflare Environment Setup

Create a Cloudflare account at cloudflare.com if you don't have one. Then install Node.js (v18+) from nodejs.org.

Install Wrangler — Cloudflare's CLI — then log in:

bash
npm install -g wrangler
wrangler login
Note

A browser window will open asking you to allow Wrangler access. Click Allow. The terminal will confirm: Successfully logged in.

Step 02Create Project Folder
bash
mkdir my-rag-bot
cd my-rag-bot
Step 03Initialise Worker Project

Inside the project folder, initialise a basic Worker. We'll replace the generated files with our full implementation later.

bash
wrangler init --yes

This creates wrangler.jsonc and src/index.ts. Both get replaced in the steps below.

Step 04Create Vectorize Index

This is your vector database — it stores 768-dimensional embeddings of your knowledge base so the system can retrieve semantically relevant chunks at query time.

bash
npx wrangler vectorize create my-knowledge-index --dimensions=768 --metric=cosine
Note

If Wrangler asks "Would you like Wrangler to add it to your wrangler.jsonc?" — press n. You'll add it manually in Step 6.

Step 05Create D1 Database

This SQLite database handles keyword (FTS5) search — it runs in parallel with vector search and catches exact term matches that embedding similarity sometimes misses: technical acronyms, model numbers, specific names.

bash
npx wrangler d1 create chunks-fts
Action

The command prints a uuid. Copy it — you'll paste it into wrangler.jsonc in the next step.

Step 06Configure wrangler.jsonc

Replace the entire contents of wrangler.jsonc with the configuration below. Replace YOUR_D1_DATABASE_ID with the uuid you copied in Step 5.

json — wrangler.jsonc
{
    "$schema": "node_modules/wrangler/config-schema.json",
    "name": "my-rag-bot",
    "main": "src/index.ts",
    "compatibility_date": "2026-05-15",
    "compatibility_flags": [
        "nodejs_compat",
        "global_fetch_strictly_public"
    ],
    "observability": { "enabled": true },
    "upload_source_maps": true,
    "ai": { "binding": "AI" },
    "vectorize": [
        {
            "binding": "VECTORIZE",
            "index_name": "my-knowledge-index"
        }
    ],
    "d1_databases": [
        {
            "binding": "DB",
            "database_name": "chunks-fts",
            "database_id": "YOUR_D1_DATABASE_ID"
        }
    ]
}
Step 07Gather Your Knowledge Base

Content format

This is the content your chatbot draws from. It can be your CV, papers, articles, technical notes — anything you want it to know about.

  • Convert all content to plain text (.txt)
  • Word documents: File → Save As → Plain Text
  • PDFs: use pdftotext or copy-paste
  • Combine everything into a single file: knowledge.txt
  • Place it in your project root folder

The chunker splits on blank lines between paragraphs — so structure your content with clear paragraph breaks between topics.

Step 08Get Cloudflare Credentials

Account ID

Go to dash.cloudflare.com → your Account ID appears in the right sidebar on the home page.

API Token

  1. Go to dash.cloudflare.com/profile/api-tokens
  2. Click Create Token → give it a name (e.g. rag-bot-ingest)
  3. Under Permissions, add: Workers AI → Edit, Vectorize → Edit, Account Settings → Read
  4. Leave Account Resources as All accounts, TTL empty
  5. Click Continue → Create Token
Warning

Copy the API token immediately after creation — you will not be able to see it again after leaving the page.

Step 09Create the Ingestion Script

Create ingest.mjs in your project root. Replace YOUR_ACCOUNT_ID and YOUR_API_TOKEN with your values from Step 8.

This script does four things for each chunk:

  1. Splits knowledge.txt into 800-character chunks with 120-character overlap
  2. Calls a small LLM to prepend a context description to every chunk (contextual retrieval)
  3. Embeds each contextualized chunk and uploads it to Vectorize
  4. Writes a chunks.sql file for loading into D1
Why contextual retrieval?

Raw chunk: "The breaking load is 450 kN." — meaningless without context. After contextualization: "This chunk covers the breaking load of HMPE mooring lines in the FloatMast TLP study. The breaking load is 450 kN." — retrieves correctly.

javascript — ingest.mjs
// ingest.mjs
import { readFileSync, writeFileSync } from 'fs';

// ── Config ────────────────────────────────────────────────────────────────────
const ACCOUNT_ID  = 'YOUR_ACCOUNT_ID';
const API_TOKEN   = 'YOUR_API_TOKEN';
const INDEX_NAME  = 'my-knowledge-index';
const EMBED_MODEL = '@cf/baai/bge-base-en-v1.5';
const CONTEXT_LLM = '@cf/meta/llama-3.1-8b-instruct';
const TEXT_FILE   = './knowledge.txt';

const TARGET_CHUNK      = 800;   // characters per chunk (~150 tokens)
const OVERLAP_CHARS     = 120;   // overlap between chunks (~15%)
const DOC_SUMMARY_CHARS = 3000;  // characters used as context anchor for the LLM

// ── Chunking ──────────────────────────────────────────────────────────────────
function chunkText(text) {
  const paragraphs = text.split(/\n\s*\n/).filter(p => p.trim().length > 0);
  const chunks = [];

  for (const para of paragraphs) {
    const clean = para.replace(/\s+/g, ' ').trim();
    if (clean.length < 80) continue;

    if (clean.length <= TARGET_CHUNK) {
      chunks.push(clean);
      continue;
    }

    // Long paragraph: split by sentences, carry overlap into next chunk
    const sentences = clean.match(/[^.!?]+[.!?]+/g) ?? [clean];
    let current = '';
    for (const s of sentences) {
      if (current.length + s.length > TARGET_CHUNK) {
        if (current) chunks.push(current.trim());
        const words = current.split(' ');
        const overlap = words.slice(-Math.round(OVERLAP_CHARS / 5)).join(' ');
        current = (overlap + ' ' + s.trim()).trim();
      } else {
        current += (current ? ' ' : '') + s.trim();
      }
    }
    if (current.trim().length > 80) chunks.push(current.trim());
  }

  return chunks;
}

// ── Contextual retrieval ──────────────────────────────────────────────────────
async function addContext(docSummary, chunk) {
  const prompt =
    `<document_summary>\n${docSummary}\n</document_summary>\n\n` +
    `<chunk>\n${chunk}\n</chunk>\n\n` +
    `In 1-2 sentences, describe what concept this chunk covers and where it ` +
    `fits in the document. Reply with ONLY the description, no preamble.`;
  try {
    const res = await fetch(
      `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/run/${CONTEXT_LLM}`,
      {
        method: 'POST',
        headers: { Authorization: `Bearer ${API_TOKEN}`, 'Content-Type': 'application/json' },
        body: JSON.stringify({ messages: [{ role: 'user', content: prompt }], max_tokens: 80, temperature: 0.0 }),
      }
    );
    const json = await res.json();
    const ctx = json?.result?.response?.trim();
    return ctx ? `${ctx}\n\n${chunk}` : chunk;
  } catch {
    return chunk; // fall back to raw chunk if LLM call fails
  }
}

// ── Embedding ─────────────────────────────────────────────────────────────────
async function getEmbedding(text) {
  const res = await fetch(
    `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/run/${EMBED_MODEL}`,
    {
      method: 'POST',
      headers: { Authorization: `Bearer ${API_TOKEN}`, 'Content-Type': 'application/json' },
      body: JSON.stringify({ text: [text] }),
    }
  );
  const json = await res.json();
  if (!json.success) throw new Error(`Embedding failed: ${JSON.stringify(json.errors)}`);
  return json.result.data[0];
}

// ── Vectorize upload ──────────────────────────────────────────────────────────
async function upsertVectors(vectors) {
  const res = await fetch(
    `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/vectorize/v2/indexes/${INDEX_NAME}/upsert`,
    {
      method: 'POST',
      headers: { Authorization: `Bearer ${API_TOKEN}`, 'Content-Type': 'application/json' },
      body: JSON.stringify({ vectors }),
    }
  );
  const json = await res.json();
  if (!json.success) throw new Error(`Vectorize upsert failed: ${JSON.stringify(json.errors)}`);
}

// ── D1 SQL builder ────────────────────────────────────────────────────────────
const sqlStatements = [];
function escapeSql(str) { return str.replace(/'/g, "''"); }
function d1AddStatement(sql) { sqlStatements.push(sql); }
function d1AddInsert(id, text) {
  sqlStatements.push(`INSERT INTO chunks(id, text) VALUES ('${escapeSql(id)}', '${escapeSql(text)}');`);
}
function writeSqlFile() { writeFileSync('./chunks.sql', sqlStatements.join('\n'), 'utf-8'); }

// ── Main ──────────────────────────────────────────────────────────────────────
async function main() {
  console.log('Reading knowledge file...');
  const text = readFileSync(TEXT_FILE, 'utf-8');
  console.log(`  ${text.length} characters loaded`);

  const docSummary = text.slice(0, DOC_SUMMARY_CHARS);
  const rawChunks  = chunkText(text);
  console.log(`  Split into ${rawChunks.length} chunks`);

  d1AddStatement('DROP TABLE IF EXISTS chunks;');
  d1AddStatement('CREATE VIRTUAL TABLE chunks USING fts5(id UNINDEXED, text);');

  console.log('\nContextualizing + embedding + uploading...');
  const vectorBatch = [];

  for (let i = 0; i < rawChunks.length; i++) {
    process.stdout.write(`\r  [${i + 1}/${rawChunks.length}] processing...`);
    const id              = `chunk-${i}`;
    const contextualized  = await addContext(docSummary, rawChunks[i]);
    const embedding       = await getEmbedding(contextualized);
    vectorBatch.push({ id, values: embedding, metadata: { text: contextualized } });
    d1AddInsert(id, contextualized);

    if (vectorBatch.length >= 20 || i === rawChunks.length - 1) {
      process.stdout.write(' - uploading batch...');
      await upsertVectors([...vectorBatch]);
      vectorBatch.length = 0;
    }
  }

  writeSqlFile();
  console.log(`\n\nDone! ${rawChunks.length} chunks uploaded to Vectorize.`);
  console.log(`\nNext: npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql`);
}

main().catch(err => { console.error('\nError:', err.message); process.exit(1); });
Step 10Run the Ingestion Script
bash
node ingest.mjs

Expected output:

output
Reading knowledge file...
  62475 characters loaded
  Split into 111 chunks

Contextualizing + embedding + uploading...
  [111/111] processing... - uploading batch...

Done! 111 chunks uploaded to Vectorize.

Next: npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql
Note

This takes several minutes. Every chunk is processed by the context LLM before embedding — that's the contextual retrieval step at work. It only needs to run once (or when you update your knowledge base).

Step 11Load Keyword Search Data into D1

This creates the FTS5 full-text search table in your D1 database and inserts all chunks. The Worker queries this alongside Vectorize on every request.

bash
npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql

You'll see: Successfully executed 113 commands

Step 12Create the Worker

Replace everything in src/index.ts with the code below. On every query the Worker runs: HyDE + FTS in parallel → embed → vector search → RRF fusion → reranking → Llama 3.3 70B answer, with full conversation history injected at each step.

Key upgrades over a basic chatbot: both HyDE and FTS are history-aware — follow-up questions like "and hydraulics?" automatically inherit topic context from the previous exchange. The system prompt handles greetings naturally and resolves references to "the author" to your name. Scope decisions are driven by retrieved context only, never by topic name.

typescript — src/index.ts
export interface Env {
  AI: any;
  VECTORIZE: any;
  DB: any;
}

interface HistoryMessage {
  role: 'user' | 'assistant';
  content: string;
}

const EMBED_MODEL = '@cf/baai/bge-base-en-v1.5';
const LLM_FAST    = '@cf/meta/llama-3.1-8b-instruct';
const LLM_MAIN    = '@cf/meta/llama-3.3-70b-instruct-fp8-fast';
const RERANKER    = '@cf/baai/bge-reranker-base';

function sanitizeFtsQuery(q: string): string {
  return q
    .replace(/[^a-zA-Z0-9\s]/g, ' ')
    .split(/\s+/)
    .filter(t => t.length > 2)
    .join(' ');
}

// HyDE: generate a hypothetical answer using recent history for context.
// Short follow-ups like "and hydraulics?" use history to know the topic is CFD.
// Embedding the hypothetical answer instead of the raw question bridges the
// vocabulary gap between short queries and longer technical document chunks.
async function generateHyDE(env: Env, question: string, history: HistoryMessage[]): Promise<string> {
  try {
    const r = await env.AI.run(LLM_FAST, {
      messages: [
        {
          role: 'system',
          content: 'You are an offshore and marine engineering expert. Write a 2-3 sentence technical answer using specific domain terminology.',
        },
        ...history.slice(-4),
        { role: 'user', content: question },
      ],
      max_tokens: 150,
      temperature: 0.1,
    });
    return (r as any).response?.trim() || question;
  } catch {
    return question;
  }
}

// FTS keyword search via D1 SQLite FTS5.
// Expands the query with the previous user message so short follow-ups
// like "only Naval or hydraulics also?" inherit the parent topic from history.
async function queryFTS(
  env: Env,
  question: string,
  history: HistoryMessage[]
): Promise<Array<{ id: string; text: string }>> {
  try {
    const lastUserMsg = history.filter(h => h.role === 'user').slice(-1)[0]?.content ?? '';
    const expanded = lastUserMsg ? `${lastUserMsg} ${question}` : question;
    const q = sanitizeFtsQuery(expanded);
    if (!q) return [];
    const result = await env.DB.prepare(
      'SELECT id, text FROM chunks WHERE chunks MATCH ?1 ORDER BY rank LIMIT 15'
    )
      .bind(q)
      .all<{ id: string; text: string }>();
    return result.results ?? [];
  } catch {
    return [];
  }
}

// Reciprocal Rank Fusion: merges two ranked lists without needing calibrated scores.
// score(d) = sum of 1/(60 + rank) across both lists.
function rrfFusion(
  vectorMatches: Array<{ id: string; text: string }>,
  ftsMatches: Array<{ id: string; text: string }>,
  k = 60
): Array<{ id: string; text: string; score: number }> {
  const map = new Map<string, { text: string; score: number }>();
  for (let i = 0; i < vectorMatches.length; i++) {
    const { id, text } = vectorMatches[i];
    const e = map.get(id) ?? { text, score: 0 };
    e.score += 1 / (k + i + 1);
    map.set(id, e);
  }
  for (let i = 0; i < ftsMatches.length; i++) {
    const { id, text } = ftsMatches[i];
    const e = map.get(id) ?? { text, score: 0 };
    e.score += 1 / (k + i + 1);
    map.set(id, e);
  }
  return Array.from(map.entries())
    .map(([id, v]) => ({ id, text: v.text, score: v.score }))
    .sort((a, b) => b.score - a.score);
}

// Cross-encoder reranker: reads the question AND each chunk together.
// Much more accurate than cosine similarity. Retrieve 15, rerank top 10, keep 5.
async function rerank(
  env: Env,
  question: string,
  candidates: Array<{ id: string; text: string; score: number }>
): Promise<Array<{ id: string; text: string }>> {
  if (candidates.length === 0) return [];
  try {
    const result = await env.AI.run(RERANKER, {
      query: question,
      contexts: candidates.map(c => ({ text: c.text })),
    });
    const scores = (result as any).data as Array<{ score: number }>;
    return candidates
      .map((c, i) => ({ ...c, rerankScore: scores[i]?.score ?? 0 }))
      .sort((a, b) => b.rerankScore - a.rerankScore)
      .slice(0, 5)
      .map(({ id, text }) => ({ id, text }));
  } catch {
    return candidates.slice(0, 5).map(({ id, text }) => ({ id, text }));
  }
}

// Scope decisions are driven by CONTEXT only — never by topic name.
const SYSTEM_PROMPT = `You are Dimitrios Tsakalomatis — engineer, researcher, and the person whose knowledge base you draw from. Your background spans offshore structures, marine engineering, hydraulics, CFD, Python for engineering, AI/ML, and academic research.

IDENTITY RULES:
- When the CONTEXT refers to "the author", "the researcher", or "the engineer", that person is you — Dimitrios Tsakalomatis.
- For greetings and casual conversation ("hi", "how are you"), respond naturally and warmly. You do not need CONTEXT for this.
- For personal questions not in the CONTEXT, respond graciously that you prefer to keep personal life separate from professional work.

ANSWER RULES:
1. If the CONTEXT contains information relevant to the question, use it to answer — regardless of domain. If it is in the CONTEXT, answer it.
2. Never copy text fragments verbatim. Always compose a complete, natural response in your own words.
3. Use the CONVERSATION HISTORY to understand follow-up questions.
4. Only if the CONTEXT does not contain relevant information, respond EXACTLY: "That's not something I have documented — a general assistant will serve you better for that."
5. Never guess or use outside knowledge for technical facts.
6. Answer concisely, technically, and directly.`;

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);

    if (request.method === 'OPTIONS') {
      return new Response(null, {
        headers: {
          'Access-Control-Allow-Origin': '*',
          'Access-Control-Allow-Methods': 'POST, OPTIONS',
          'Access-Control-Allow-Headers': 'Content-Type',
        },
      });
    }

    if (url.pathname === '/query' && request.method === 'POST') {
      const body = (await request.json()) as { question: string; history?: HistoryMessage[] };
      const { question } = body;
      const history: HistoryMessage[] = body.history ?? [];

      if (!question) {
        return new Response(JSON.stringify({ answer: 'Please provide a question.' }), { status: 400 });
      }

      // HyDE + FTS in parallel, both history-aware
      const [hydeAnswer, ftsMatches] = await Promise.all([
        generateHyDE(env, question, history),
        queryFTS(env, question, history),
      ]);

      // Embed the HyDE answer (not the raw question)
      const embeddingResponse = await env.AI.run(EMBED_MODEL, { text: [hydeAnswer] });
      const queryVector = (embeddingResponse as any).data[0];

      // Vector search — wider net so the reranker has room to work
      const similar = await env.VECTORIZE.query(queryVector, { topK: 15, returnMetadata: true });
      const vectorMatches = (similar.matches as any[])
        .filter(m => m.metadata?.text)
        .map(m => ({ id: m.id as string, text: m.metadata.text as string }));

      // RRF fusion
      const merged = rrfFusion(vectorMatches, ftsMatches);

      // Rerank — top 10 in, top 5 out
      const top = merged.length > 0 ? await rerank(env, question, merged.slice(0, 10)) : [];

      // Always call the LLM — the system prompt handles greetings naturally
      const context = top.map(c => c.text).join('\n\n---\n\n');
      const userPrompt = context
        ? `CONTEXT:\n${context}\n\nQUESTION:\n${question}\n\nAnswer:`
        : `QUESTION:\n${question}\n\nAnswer:`;

      // Generate answer with conversation history injected
      const answerResponse = await env.AI.run(LLM_MAIN, {
        messages: [
          { role: 'system', content: SYSTEM_PROMPT },
          ...history.slice(-6),           // last 3 exchanges
          { role: 'user', content: userPrompt },
        ],
        max_tokens: 1000,
        temperature: 0.0,
      });

      const answer = ((answerResponse as any).response ?? '').trim() || "I couldn't generate an answer.";

      return new Response(JSON.stringify({ answer }), {
        headers: { 'Content-Type': 'application/json', 'Access-Control-Allow-Origin': '*' },
      });
    }

    return new Response('Not found', { status: 404 });
  },
};
Step 13Deploy
bash
npx wrangler deploy

Wrangler compiles index.ts and deploys to Cloudflare's global edge. You'll get a live URL like https://my-rag-bot.YOUR-SUBDOMAIN.workers.dev. Open it — the chatbot is live.

Updating Your Knowledge Base

When you add new content to knowledge.txt, re-run only Steps 10 and 11. No need to redeploy the Worker — it queries Vectorize and D1 dynamically on every request.

bash
node ingest.mjs
npx wrangler d1 execute chunks-fts --remote --file=./chunks.sql

What Changed vs. a Basic RAG Chatbot

Every upgrade applied, compared to a typical beginner implementation:

ComponentBasic VersionThis Implementation
LLMLlama 2 7B INT8 (2023)Llama 3.3 70B FP8 (2024)
Max tokens300 — answers get cut off1000
Query embeddingRaw user questionHyDE hypothetical answer
RetrievalVector only, top 5Vector + FTS, top 15 each
Result mergingNoneReciprocal Rank Fusion
Post-retrievalNothingCross-encoder reranking
Chunk size300–500 chars800 chars + 120 overlap
Chunk contentRaw textContextual prefix + text
Conversation memoryNone — every query isolatedLast 3 exchanges sent
Follow-up queriesBreak — no topic contextHistory-aware HyDE + FTS
GreetingsScope rejection errorNatural warm response
"The author" in docsReturned verbatimResolved to your name
Fragment answersCommonAlways full sentences
Scope decisionsPre-filtered by topic nameDriven by CONTEXT only

Models Used

All on Cloudflare Workers AI — no external API keys needed.

Model IDRole
@cf/baai/bge-base-en-v1.5Embedding model, 768 dimensions
@cf/meta/llama-3.1-8b-instructFast LLM for HyDE and context generation
@cf/meta/llama-3.3-70b-instruct-fp8-fastMain answer LLM
@cf/baai/bge-reranker-baseCross-encoder reranker