Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Run Qwen models on your own GPU. Qwen3 introduces a hybrid thinking mode — the model can reason step-by-step or answer directly, controlled via system prompt.
GPU requirements
| Model | GPU | VRAM needed | Approx. cost |
|---|
| Qwen2.5-7B-Instruct | RTX 4090 (24 GB) | ~14 GB | ~$0.35/hr |
| Qwen3-32B | A100 80 GB | ~64 GB | ~$1.60/hr |
| Qwen2.5-72B-Instruct | A100 80 GB | ~72 GB | ~$1.60/hr |
| Qwen3-235B-A22B (MoE) | 2x H100 80 GB | ~60 GB each | ~$5.00/hr |
Deploy Qwen2.5 7B (RTX 4090)
runcrate instances create --name qwen-7b --gpu RTX4090
runcrate instances status qwen-7b
runcrate ssh qwen-7b -- "pip install vllm"
runcrate ssh qwen-7b -- "nohup python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--max-model-len 8192 \
--port 8000 --host 0.0.0.0 \
> /root/vllm.log 2>&1 &"
Deploy Qwen3 32B (A100)
runcrate instances create --name qwen3-32b --gpu A100
runcrate ssh qwen3-32b -- "pip install vllm"
runcrate ssh qwen3-32b -- "nohup python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-32B \
--max-model-len 8192 \
--port 8000 --host 0.0.0.0 \
> /root/vllm.log 2>&1 &"
Deploy Qwen3 235B MoE (2x H100)
runcrate instances create --name qwen3-235b --gpu H100 --gpu-count 2
runcrate ssh qwen3-235b -- "pip install vllm"
runcrate ssh qwen3-235b -- "nohup python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-235B-A22B \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--trust-remote-code \
--port 8000 --host 0.0.0.0 \
> /root/vllm.log 2>&1 &"
Test the endpoint
runcrate instances info qwen3-32b
curl http://<INSTANCE_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [{"role": "user", "content": "Compare transformer and mamba architectures."}],
"max_tokens": 512
}'
Enable thinking mode
Add /think to the system prompt for step-by-step reasoning, or /no_think for direct answers:
curl http://<INSTANCE_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [
{"role": "system", "content": "/think"},
{"role": "user", "content": "What is 27 * 43?"}
],
"max_tokens": 1024
}'
Cleanup
runcrate instances delete qwen-7b
runcrate instances delete qwen3-32b
runcrate instances delete qwen3-235b