Download and Serve HuggingFace Models

Download any model from HuggingFace Hub to a cloud GPU and serve it behind an OpenAI-compatible API.

1. Deploy an instance

runcrate instances create --name hf-serve --gpu A100 --template ubuntu-devbox
runcrate instances status hf-serve

2. Install tools

runcrate ssh hf-serve -- "pip install huggingface_hub[cli] vllm"

3. Authenticate (for gated models)

Some models (Llama, Gemma, Mistral) require HuggingFace access tokens:

runcrate ssh hf-serve -- "huggingface-cli login --token hf_YOUR_TOKEN"

4. Download a model

Download to /workspace/ so the model persists if you attach a volume:

runcrate ssh hf-serve -- "huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
  --local-dir /workspace/models/llama-8b"

runcrate ssh hf-serve -- "du -sh /workspace/models/*"

5. Serve with vLLM

runcrate ssh hf-serve -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model /workspace/models/llama-8b \
  --max-model-len 8192 \
  --port 8000 --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Test it:

runcrate instances info hf-serve

curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/workspace/models/llama-8b",
    "messages": [{"role": "user", "content": "Hello from HuggingFace."}],
    "max_tokens": 128
  }'

Persist models with a volume

Avoid re-downloading large models by using a persistent volume:

runcrate volumes create --name hf-models --size 200
runcrate instances create --name hf-serve --gpu A100 --template ubuntu-devbox --storage hf-models

Models at /workspace/models/ persist across instance restarts.

Download specific file types

runcrate ssh hf-serve -- "huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
  --local-dir /workspace/models/llama-8b \
  --include '*.safetensors' '*.json' 'tokenizer*'"

Tips

For gated models, create a token at huggingface.co/settings/tokens with read scope.
Cloud download speeds are typically 1-5 GB/min — much faster than home connections.
A 70B FP16 model is ~140 GB. An FP8 version is ~70 GB. Check VRAM before downloading.

Cleanup

runcrate instances delete hf-serve

​1. Deploy an instance

​2. Install tools

​3. Authenticate (for gated models)

​4. Download a model

​5. Serve with vLLM

​Persist models with a volume

​Download specific file types

​Tips

​Cleanup