> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Run Ollama on a Cloud GPU

> Deploy Ollama on an RTX 4090, pull models, and serve an OpenAI-compatible API from a cloud GPU.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Run Ollama on a dedicated cloud GPU instead of your local machine. Faster inference, larger models, and a shared endpoint for your team.

## 1. Deploy and install

```bash theme={"theme":"github-dark"}
runcrate instances create --name ollama --gpu RTX4090
runcrate instances status ollama

runcrate ssh ollama -- "curl -fsSL https://ollama.com/install.sh | sh"
```

## 2. Start the server

```bash theme={"theme":"github-dark"}
runcrate ssh ollama -- "OLLAMA_HOST=0.0.0.0 nohup ollama serve > /root/ollama.log 2>&1 &"
```

## 3. Pull models

```bash theme={"theme":"github-dark"}
runcrate ssh ollama -- "ollama pull llama3.1:8b"
runcrate ssh ollama -- "ollama pull qwen2.5:7b"
runcrate ssh ollama -- "ollama list"
```

## 4. Test the API

```bash theme={"theme":"github-dark"}
runcrate instances info ollama

curl http://<INSTANCE_IP>:11434/api/chat \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "What is Ollama?"}],
    "stream": false
  }'
```

## 5. Use the OpenAI-compatible endpoint

```python theme={"theme":"github-dark"}
from openai import OpenAI

client = OpenAI(
    base_url="http://<INSTANCE_IP>:11434/v1",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain LoRA in one paragraph."}],
)
print(response.choices[0].message.content)
```

## 6. Larger models on A100

For 70B+ models, use an A100 80 GB:

```bash theme={"theme":"github-dark"}
runcrate instances create --name ollama-big --gpu A100
runcrate ssh ollama-big -- "curl -fsSL https://ollama.com/install.sh | sh"
runcrate ssh ollama-big -- "OLLAMA_HOST=0.0.0.0 nohup ollama serve > /root/ollama.log 2>&1 &"
runcrate ssh ollama-big -- "ollama pull llama3.1:70b"
```

## Monitoring

```bash theme={"theme":"github-dark"}
runcrate ssh ollama -- nvidia-smi
runcrate ssh ollama -- "tail -20 /root/ollama.log"
runcrate ssh ollama -- "ollama ps"
```

## Tips

* Ollama quantizes models by default (Q4). For higher quality, use `:fp16` tags if VRAM allows.
* The first request after pulling a model is slower — Ollama loads into GPU memory on demand.
* For production workloads with high concurrency, use [vLLM](/examples/vllm-inference-server) instead.

## Cleanup

```bash theme={"theme":"github-dark"}
runcrate instances delete ollama
runcrate instances delete ollama-big
```
