Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Download any model from HuggingFace Hub to a cloud GPU and serve it behind an OpenAI-compatible API.
1. Deploy an instance
runcrate instances create --name hf-serve --gpu A100 --template ubuntu-devbox
runcrate instances status hf-serve
runcrate ssh hf-serve -- "pip install huggingface_hub[cli] vllm"
3. Authenticate (for gated models)
Some models (Llama, Gemma, Mistral) require HuggingFace access tokens:
runcrate ssh hf-serve -- "huggingface-cli login --token hf_YOUR_TOKEN"
4. Download a model
Download to /workspace/ so the model persists if you attach a volume:
runcrate ssh hf-serve -- "huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
--local-dir /workspace/models/llama-8b"
runcrate ssh hf-serve -- "du -sh /workspace/models/*"
5. Serve with vLLM
runcrate ssh hf-serve -- "nohup python -m vllm.entrypoints.openai.api_server \
--model /workspace/models/llama-8b \
--max-model-len 8192 \
--port 8000 --host 0.0.0.0 \
> /root/vllm.log 2>&1 &"
Test it:
runcrate instances info hf-serve
curl http://<INSTANCE_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/workspace/models/llama-8b",
"messages": [{"role": "user", "content": "Hello from HuggingFace."}],
"max_tokens": 128
}'
Persist models with a volume
Avoid re-downloading large models by using a persistent volume:
runcrate volumes create --name hf-models --size 200
runcrate instances create --name hf-serve --gpu A100 --template ubuntu-devbox --storage hf-models
Models at /workspace/models/ persist across instance restarts.
Download specific file types
runcrate ssh hf-serve -- "huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
--local-dir /workspace/models/llama-8b \
--include '*.safetensors' '*.json' 'tokenizer*'"
Tips
- For gated models, create a token at
huggingface.co/settings/tokens with read scope.
- Cloud download speeds are typically 1-5 GB/min — much faster than home connections.
- A 70B FP16 model is ~140 GB. An FP8 version is ~70 GB. Check VRAM before downloading.
Cleanup
runcrate instances delete hf-serve