> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Build a RAG Pipeline

> Retrieval-Augmented Generation using Runcrate embeddings and chat models — search your own docs with AI.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Build a Retrieval-Augmented Generation (RAG) system that lets users ask questions about your own documents. The pipeline embeds your docs, stores the vectors, finds relevant chunks at query time, and passes them to a chat model for grounded answers.

## What you'll build

A production RAG pipeline that:

1. Chunks and embeds your documents using Runcrate's embedding models
2. Stores vectors in any vector database (Postgres pgvector, Pinecone, Weaviate, or in-memory)
3. Retrieves relevant chunks for each user query
4. Generates accurate, grounded answers using Runcrate's chat models

***

## Architecture

```
User Query
    ↓
Embed query (Runcrate embedding model)
    ↓
Vector similarity search (your vector DB)
    ↓
Top-K relevant chunks
    ↓
Prompt = system instructions + chunks + user query
    ↓
Chat completion (Runcrate chat model)
    ↓
Grounded answer
```

***

## Full example (Vercel AI SDK + pgvector)

### 1. Embed and store documents

```typescript theme={"theme":"github-dark"}
import { runcrate } from '@runcrate/ai';
import { embedMany } from 'ai';
import { sql } from '@vercel/postgres';

const docs = [
  { id: '1', title: 'Pricing', content: 'GPU instances start at $0.35/hr for an RTX 4090...' },
  { id: '2', title: 'Storage', content: 'Persistent volumes cost $0.03/GB/month...' },
  { id: '3', title: 'Auto-recharge', content: 'Set a credit threshold and recharge amount...' },
];

// Embed all documents
const { embeddings } = await embedMany({
  model: runcrate.embeddingModel('BAAI/bge-large-en-v1.5'),
  values: docs.map(d => `${d.title}: ${d.content}`),
});

// Store in pgvector
for (let i = 0; i < docs.length; i++) {
  await sql`
    INSERT INTO documents (id, title, content, embedding)
    VALUES (${docs[i].id}, ${docs[i].title}, ${docs[i].content}, ${JSON.stringify(embeddings[i])})
  `;
}
```

### 2. Query at runtime

```typescript theme={"theme":"github-dark"}
import { runcrate } from '@runcrate/ai';
import { embed, generateText } from 'ai';
import { sql } from '@vercel/postgres';

async function askDocs(question: string) {
  // Embed the question
  const { embedding } = await embed({
    model: runcrate.embeddingModel('BAAI/bge-large-en-v1.5'),
    value: question,
  });

  // Find similar documents
  const { rows } = await sql`
    SELECT title, content, 1 - (embedding <=> ${JSON.stringify(embedding)}) AS similarity
    FROM documents
    ORDER BY embedding <=> ${JSON.stringify(embedding)}
    LIMIT 5
  `;

  // Generate answer with context
  const context = rows.map(r => `[${r.title}]: ${r.content}`).join('\n\n');

  const { text } = await generateText({
    model: runcrate('deepseek-ai/DeepSeek-V3'),
    messages: [
      {
        role: 'system',
        content: `Answer the user's question using ONLY the following context. If the context doesn't contain the answer, say so.\n\n${context}`,
      },
      { role: 'user', content: question },
    ],
  });

  return { answer: text, sources: rows.map(r => r.title) };
}

const result = await askDocs('How much does storage cost?');
console.log(result.answer);
console.log('Sources:', result.sources);
```

***

## Full example (Python SDK + in-memory)

A minimal RAG pipeline using cosine similarity in Python — no vector database needed for small doc sets:

```python theme={"theme":"github-dark"}
from runcrate import Runcrate
import numpy as np

client = Runcrate(api_key="rc_live_...")

# Your documents
docs = [
    "GPU instances are billed hourly. RTX 4090 starts at $0.35/hr, A100 at $1.20/hr, H100 at $2.50/hr.",
    "Storage volumes cost $0.03/GB/month, charged weekly. Volumes persist across instance termination.",
    "Auto-recharge tops up credits automatically when your balance drops below a threshold you set.",
    "API keys are scoped to a workspace. The full key is shown only once at creation.",
    "The Models API supports chat, image, video, TTS, and ASR across 140+ open-source models.",
]

# Embed all documents
doc_embeddings = []
for doc in docs:
    resp = client.models.embed(model="BAAI/bge-large-en-v1.5", input=doc)
    doc_embeddings.append(resp.data[0].embedding)

doc_embeddings = np.array(doc_embeddings)

def ask(question: str, top_k: int = 3) -> str:
    # Embed the question
    q_resp = client.models.embed(model="BAAI/bge-large-en-v1.5", input=question)
    q_vec = np.array(q_resp.data[0].embedding)

    # Cosine similarity
    sims = doc_embeddings @ q_vec / (
        np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(q_vec)
    )
    top_indices = np.argsort(sims)[-top_k:][::-1]

    context = "\n\n".join(docs[i] for i in top_indices)

    # Generate answer
    response = client.models.chat_completion(
        model="deepseek-ai/DeepSeek-V3",
        messages=[
            {"role": "system", "content": f"Answer using ONLY this context:\n\n{context}"},
            {"role": "user", "content": question},
        ],
    )

    return response.choices[0].message.content

print(ask("How much does an A100 cost per hour?"))
print(ask("What happens when my credits run out?"))
```

***

## Production tips

* **Chunking matters most.** Split documents at semantic boundaries (paragraph breaks, headers), not fixed character counts. Aim for 200–500 tokens per chunk.
* **Hybrid search** (vector + keyword BM25) is the single biggest quality improvement over pure vector search.
* **Reranking** with a cross-encoder after initial retrieval is the highest-ROI step — retrieve top-50, rerank to top-5, send to LLM.
* **Include metadata** (title, source URL, date) in each chunk so the model can cite sources.
