Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Build a Retrieval-Augmented Generation (RAG) system that lets users ask questions about your own documents. The pipeline embeds your docs, stores the vectors, finds relevant chunks at query time, and passes them to a chat model for grounded answers.
What you’ll build
A production RAG pipeline that:
- Chunks and embeds your documents using Runcrate’s embedding models
- Stores vectors in any vector database (Postgres pgvector, Pinecone, Weaviate, or in-memory)
- Retrieves relevant chunks for each user query
- Generates accurate, grounded answers using Runcrate’s chat models
Architecture
User Query
↓
Embed query (Runcrate embedding model)
↓
Vector similarity search (your vector DB)
↓
Top-K relevant chunks
↓
Prompt = system instructions + chunks + user query
↓
Chat completion (Runcrate chat model)
↓
Grounded answer
Full example (Vercel AI SDK + pgvector)
1. Embed and store documents
import { runcrate } from '@runcrate/ai';
import { embedMany } from 'ai';
import { sql } from '@vercel/postgres';
const docs = [
{ id: '1', title: 'Pricing', content: 'GPU instances start at $0.35/hr for an RTX 4090...' },
{ id: '2', title: 'Storage', content: 'Persistent volumes cost $0.03/GB/month...' },
{ id: '3', title: 'Auto-recharge', content: 'Set a credit threshold and recharge amount...' },
];
// Embed all documents
const { embeddings } = await embedMany({
model: runcrate.embeddingModel('BAAI/bge-large-en-v1.5'),
values: docs.map(d => `${d.title}: ${d.content}`),
});
// Store in pgvector
for (let i = 0; i < docs.length; i++) {
await sql`
INSERT INTO documents (id, title, content, embedding)
VALUES (${docs[i].id}, ${docs[i].title}, ${docs[i].content}, ${JSON.stringify(embeddings[i])})
`;
}
2. Query at runtime
import { runcrate } from '@runcrate/ai';
import { embed, generateText } from 'ai';
import { sql } from '@vercel/postgres';
async function askDocs(question: string) {
// Embed the question
const { embedding } = await embed({
model: runcrate.embeddingModel('BAAI/bge-large-en-v1.5'),
value: question,
});
// Find similar documents
const { rows } = await sql`
SELECT title, content, 1 - (embedding <=> ${JSON.stringify(embedding)}) AS similarity
FROM documents
ORDER BY embedding <=> ${JSON.stringify(embedding)}
LIMIT 5
`;
// Generate answer with context
const context = rows.map(r => `[${r.title}]: ${r.content}`).join('\n\n');
const { text } = await generateText({
model: runcrate('deepseek-ai/DeepSeek-V3'),
messages: [
{
role: 'system',
content: `Answer the user's question using ONLY the following context. If the context doesn't contain the answer, say so.\n\n${context}`,
},
{ role: 'user', content: question },
],
});
return { answer: text, sources: rows.map(r => r.title) };
}
const result = await askDocs('How much does storage cost?');
console.log(result.answer);
console.log('Sources:', result.sources);
Full example (Python SDK + in-memory)
A minimal RAG pipeline using cosine similarity in Python — no vector database needed for small doc sets:
from runcrate import Runcrate
import numpy as np
client = Runcrate(api_key="rc_live_...")
# Your documents
docs = [
"GPU instances are billed hourly. RTX 4090 starts at $0.35/hr, A100 at $1.20/hr, H100 at $2.50/hr.",
"Storage volumes cost $0.03/GB/month, charged weekly. Volumes persist across instance termination.",
"Auto-recharge tops up credits automatically when your balance drops below a threshold you set.",
"API keys are scoped to a workspace. The full key is shown only once at creation.",
"The Models API supports chat, image, video, TTS, and ASR across 140+ open-source models.",
]
# Embed all documents
doc_embeddings = []
for doc in docs:
resp = client.models.embed(model="BAAI/bge-large-en-v1.5", input=doc)
doc_embeddings.append(resp.data[0].embedding)
doc_embeddings = np.array(doc_embeddings)
def ask(question: str, top_k: int = 3) -> str:
# Embed the question
q_resp = client.models.embed(model="BAAI/bge-large-en-v1.5", input=question)
q_vec = np.array(q_resp.data[0].embedding)
# Cosine similarity
sims = doc_embeddings @ q_vec / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(q_vec)
)
top_indices = np.argsort(sims)[-top_k:][::-1]
context = "\n\n".join(docs[i] for i in top_indices)
# Generate answer
response = client.models.chat_completion(
model="deepseek-ai/DeepSeek-V3",
messages=[
{"role": "system", "content": f"Answer using ONLY this context:\n\n{context}"},
{"role": "user", "content": question},
],
)
return response.choices[0].message.content
print(ask("How much does an A100 cost per hour?"))
print(ask("What happens when my credits run out?"))
Production tips
- Chunking matters most. Split documents at semantic boundaries (paragraph breaks, headers), not fixed character counts. Aim for 200–500 tokens per chunk.
- Hybrid search (vector + keyword BM25) is the single biggest quality improvement over pure vector search.
- Reranking with a cross-encoder after initial retrieval is the highest-ROI step — retrieve top-50, rerank to top-5, send to LLM.
- Include metadata (title, source URL, date) in each chunk so the model can cite sources.