Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Download any model from HuggingFace Hub to a cloud GPU and serve it behind an OpenAI-compatible API.

1. Deploy an instance

runcrate instances create --name hf-serve --gpu A100 --template ubuntu-devbox
runcrate instances status hf-serve

2. Install tools

runcrate ssh hf-serve -- "pip install huggingface_hub[cli] vllm"

3. Authenticate (for gated models)

Some models (Llama, Gemma, Mistral) require HuggingFace access tokens:
runcrate ssh hf-serve -- "huggingface-cli login --token hf_YOUR_TOKEN"

4. Download a model

Download to /workspace/ so the model persists if you attach a volume:
runcrate ssh hf-serve -- "huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
  --local-dir /workspace/models/llama-8b"
runcrate ssh hf-serve -- "du -sh /workspace/models/*"

5. Serve with vLLM

runcrate ssh hf-serve -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model /workspace/models/llama-8b \
  --max-model-len 8192 \
  --port 8000 --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"
Test it:
runcrate instances info hf-serve

curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/workspace/models/llama-8b",
    "messages": [{"role": "user", "content": "Hello from HuggingFace."}],
    "max_tokens": 128
  }'

Persist models with a volume

Avoid re-downloading large models by using a persistent volume:
runcrate volumes create --name hf-models --size 200
runcrate instances create --name hf-serve --gpu A100 --template ubuntu-devbox --storage hf-models
Models at /workspace/models/ persist across instance restarts.

Download specific file types

runcrate ssh hf-serve -- "huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
  --local-dir /workspace/models/llama-8b \
  --include '*.safetensors' '*.json' 'tokenizer*'"

Tips

  • For gated models, create a token at huggingface.co/settings/tokens with read scope.
  • Cloud download speeds are typically 1-5 GB/min — much faster than home connections.
  • A 70B FP16 model is ~140 GB. An FP8 version is ~70 GB. Check VRAM before downloading.

Cleanup

runcrate instances delete hf-serve