> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploy Qwen on a Cloud GPU

> Self-host Qwen3 or Qwen2.5 models with vLLM on a dedicated GPU. Covers model sizes from 7B to 235B.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Run Qwen models on your own GPU. Qwen3 introduces a hybrid thinking mode — the model can reason step-by-step or answer directly, controlled via system prompt.

## GPU requirements

| Model                 | GPU              | VRAM needed  | Approx. cost |
| --------------------- | ---------------- | ------------ | ------------ |
| Qwen2.5-7B-Instruct   | RTX 4090 (24 GB) | \~14 GB      | \~\$0.35/hr  |
| Qwen3-32B             | A100 80 GB       | \~64 GB      | \~\$1.60/hr  |
| Qwen2.5-72B-Instruct  | A100 80 GB       | \~72 GB      | \~\$1.60/hr  |
| Qwen3-235B-A22B (MoE) | 2x H100 80 GB    | \~60 GB each | \~\$5.00/hr  |

***

## Deploy Qwen2.5 7B (RTX 4090)

```bash theme={"theme":"github-dark"}
runcrate instances create --name qwen-7b --gpu RTX4090
runcrate instances status qwen-7b

runcrate ssh qwen-7b -- "pip install vllm"

runcrate ssh qwen-7b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 8192 \
  --port 8000 --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"
```

## Deploy Qwen3 32B (A100)

```bash theme={"theme":"github-dark"}
runcrate instances create --name qwen3-32b --gpu A100
runcrate ssh qwen3-32b -- "pip install vllm"

runcrate ssh qwen3-32b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-32B \
  --max-model-len 8192 \
  --port 8000 --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"
```

## Deploy Qwen3 235B MoE (2x H100)

```bash theme={"theme":"github-dark"}
runcrate instances create --name qwen3-235b --gpu H100 --gpu-count 2
runcrate ssh qwen3-235b -- "pip install vllm"

runcrate ssh qwen3-235b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-235B-A22B \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --trust-remote-code \
  --port 8000 --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"
```

## Test the endpoint

```bash theme={"theme":"github-dark"}
runcrate instances info qwen3-32b

curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-32B",
    "messages": [{"role": "user", "content": "Compare transformer and mamba architectures."}],
    "max_tokens": 512
  }'
```

## Enable thinking mode

Add `/think` to the system prompt for step-by-step reasoning, or `/no_think` for direct answers:

```bash theme={"theme":"github-dark"}
curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-32B",
    "messages": [
      {"role": "system", "content": "/think"},
      {"role": "user", "content": "What is 27 * 43?"}
    ],
    "max_tokens": 1024
  }'
```

## Cleanup

```bash theme={"theme":"github-dark"}
runcrate instances delete qwen-7b
runcrate instances delete qwen3-32b
runcrate instances delete qwen3-235b
```
