Build a RAG Pipeline

Build a Retrieval-Augmented Generation (RAG) system that lets users ask questions about your own documents. The pipeline embeds your docs, stores the vectors, finds relevant chunks at query time, and passes them to a chat model for grounded answers.

What you’ll build

A production RAG pipeline that:

Chunks and embeds your documents using Runcrate’s embedding models
Stores vectors in any vector database (Postgres pgvector, Pinecone, Weaviate, or in-memory)
Retrieves relevant chunks for each user query
Generates accurate, grounded answers using Runcrate’s chat models

Architecture

User Query
    ↓
Embed query (Runcrate embedding model)
    ↓
Vector similarity search (your vector DB)
    ↓
Top-K relevant chunks
    ↓
Prompt = system instructions + chunks + user query
    ↓
Chat completion (Runcrate chat model)
    ↓
Grounded answer

Full example (Vercel AI SDK + pgvector)

1. Embed and store documents

import { runcrate } from '@runcrate/ai';
import { embedMany } from 'ai';
import { sql } from '@vercel/postgres';

const docs = [
  { id: '1', title: 'Pricing', content: 'GPU instances start at $0.35/hr for an RTX 4090...' },
  { id: '2', title: 'Storage', content: 'Persistent volumes cost $0.03/GB/month...' },
  { id: '3', title: 'Auto-recharge', content: 'Set a credit threshold and recharge amount...' },
];

// Embed all documents
const { embeddings } = await embedMany({
  model: runcrate.embeddingModel('BAAI/bge-large-en-v1.5'),
  values: docs.map(d => `${d.title}: ${d.content}`),
});

// Store in pgvector
for (let i = 0; i < docs.length; i++) {
  await sql`
    INSERT INTO documents (id, title, content, embedding)
    VALUES (${docs[i].id}, ${docs[i].title}, ${docs[i].content}, ${JSON.stringify(embeddings[i])})
  `;
}

2. Query at runtime

import { runcrate } from '@runcrate/ai';
import { embed, generateText } from 'ai';
import { sql } from '@vercel/postgres';

async function askDocs(question: string) {
  // Embed the question
  const { embedding } = await embed({
    model: runcrate.embeddingModel('BAAI/bge-large-en-v1.5'),
    value: question,
  });

  // Find similar documents
  const { rows } = await sql`
    SELECT title, content, 1 - (embedding <=> ${JSON.stringify(embedding)}) AS similarity
    FROM documents
    ORDER BY embedding <=> ${JSON.stringify(embedding)}
    LIMIT 5
  `;

  // Generate answer with context
  const context = rows.map(r => `[${r.title}]: ${r.content}`).join('\n\n');

  const { text } = await generateText({
    model: runcrate('deepseek-ai/DeepSeek-V3'),
    messages: [
      {
        role: 'system',
        content: `Answer the user's question using ONLY the following context. If the context doesn't contain the answer, say so.\n\n${context}`,
      },
      { role: 'user', content: question },
    ],
  });

  return { answer: text, sources: rows.map(r => r.title) };
}

const result = await askDocs('How much does storage cost?');
console.log(result.answer);
console.log('Sources:', result.sources);

Full example (Python SDK + in-memory)

A minimal RAG pipeline using cosine similarity in Python — no vector database needed for small doc sets:

from runcrate import Runcrate
import numpy as np

client = Runcrate(api_key="rc_live_...")

# Your documents
docs = [
    "GPU instances are billed hourly. RTX 4090 starts at $0.35/hr, A100 at $1.20/hr, H100 at $2.50/hr.",
    "Storage volumes cost $0.03/GB/month, charged weekly. Volumes persist across instance termination.",
    "Auto-recharge tops up credits automatically when your balance drops below a threshold you set.",
    "API keys are scoped to a workspace. The full key is shown only once at creation.",
    "The Models API supports chat, image, video, TTS, and ASR across 140+ open-source models.",
]

# Embed all documents
doc_embeddings = []
for doc in docs:
    resp = client.models.embed(model="BAAI/bge-large-en-v1.5", input=doc)
    doc_embeddings.append(resp.data[0].embedding)

doc_embeddings = np.array(doc_embeddings)

def ask(question: str, top_k: int = 3) -> str:
    # Embed the question
    q_resp = client.models.embed(model="BAAI/bge-large-en-v1.5", input=question)
    q_vec = np.array(q_resp.data[0].embedding)

    # Cosine similarity
    sims = doc_embeddings @ q_vec / (
        np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(q_vec)
    )
    top_indices = np.argsort(sims)[-top_k:][::-1]

    context = "\n\n".join(docs[i] for i in top_indices)

    # Generate answer
    response = client.models.chat_completion(
        model="deepseek-ai/DeepSeek-V3",
        messages=[
            {"role": "system", "content": f"Answer using ONLY this context:\n\n{context}"},
            {"role": "user", "content": question},
        ],
    )

    return response.choices[0].message.content

print(ask("How much does an A100 cost per hour?"))
print(ask("What happens when my credits run out?"))

Production tips

Chunking matters most. Split documents at semantic boundaries (paragraph breaks, headers), not fixed character counts. Aim for 200–500 tokens per chunk.
Hybrid search (vector + keyword BM25) is the single biggest quality improvement over pure vector search.
Reranking with a cross-encoder after initial retrieval is the highest-ROI step — retrieve top-50, rerank to top-5, send to LLM.
Include metadata (title, source URL, date) in each chunk so the model can cite sources.

​What you’ll build

​Architecture

​Full example (Vercel AI SDK + pgvector)

​1. Embed and store documents

​2. Query at runtime

​Full example (Python SDK + in-memory)

​Production tips