Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Run Ollama on a dedicated cloud GPU instead of your local machine. Faster inference, larger models, and a shared endpoint for your team.

1. Deploy and install

runcrate instances create --name ollama --gpu RTX4090
runcrate instances status ollama

runcrate ssh ollama -- "curl -fsSL https://ollama.com/install.sh | sh"

2. Start the server

runcrate ssh ollama -- "OLLAMA_HOST=0.0.0.0 nohup ollama serve > /root/ollama.log 2>&1 &"

3. Pull models

runcrate ssh ollama -- "ollama pull llama3.1:8b"
runcrate ssh ollama -- "ollama pull qwen2.5:7b"
runcrate ssh ollama -- "ollama list"

4. Test the API

runcrate instances info ollama

curl http://<INSTANCE_IP>:11434/api/chat \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "What is Ollama?"}],
    "stream": false
  }'

5. Use the OpenAI-compatible endpoint

from openai import OpenAI

client = OpenAI(
    base_url="http://<INSTANCE_IP>:11434/v1",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain LoRA in one paragraph."}],
)
print(response.choices[0].message.content)

6. Larger models on A100

For 70B+ models, use an A100 80 GB:
runcrate instances create --name ollama-big --gpu A100
runcrate ssh ollama-big -- "curl -fsSL https://ollama.com/install.sh | sh"
runcrate ssh ollama-big -- "OLLAMA_HOST=0.0.0.0 nohup ollama serve > /root/ollama.log 2>&1 &"
runcrate ssh ollama-big -- "ollama pull llama3.1:70b"

Monitoring

runcrate ssh ollama -- nvidia-smi
runcrate ssh ollama -- "tail -20 /root/ollama.log"
runcrate ssh ollama -- "ollama ps"

Tips

  • Ollama quantizes models by default (Q4). For higher quality, use :fp16 tags if VRAM allows.
  • The first request after pulling a model is slower — Ollama loads into GPU memory on demand.
  • For production workloads with high concurrency, use vLLM instead.

Cleanup

runcrate instances delete ollama
runcrate instances delete ollama-big