> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Download and Serve HuggingFace Models

> Deploy a GPU instance, download models from HuggingFace Hub, and serve them with vLLM.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Download any model from HuggingFace Hub to a cloud GPU and serve it behind an OpenAI-compatible API.

## 1. Deploy an instance

```bash theme={"theme":"github-dark"}
runcrate instances create --name hf-serve --gpu A100 --template ubuntu-devbox
runcrate instances status hf-serve
```

## 2. Install tools

```bash theme={"theme":"github-dark"}
runcrate ssh hf-serve -- "pip install huggingface_hub[cli] vllm"
```

## 3. Authenticate (for gated models)

Some models (Llama, Gemma, Mistral) require HuggingFace access tokens:

```bash theme={"theme":"github-dark"}
runcrate ssh hf-serve -- "huggingface-cli login --token hf_YOUR_TOKEN"
```

## 4. Download a model

Download to `/workspace/` so the model persists if you attach a volume:

```bash theme={"theme":"github-dark"}
runcrate ssh hf-serve -- "huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
  --local-dir /workspace/models/llama-8b"
```

```bash theme={"theme":"github-dark"}
runcrate ssh hf-serve -- "du -sh /workspace/models/*"
```

## 5. Serve with vLLM

```bash theme={"theme":"github-dark"}
runcrate ssh hf-serve -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model /workspace/models/llama-8b \
  --max-model-len 8192 \
  --port 8000 --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"
```

Test it:

```bash theme={"theme":"github-dark"}
runcrate instances info hf-serve

curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/workspace/models/llama-8b",
    "messages": [{"role": "user", "content": "Hello from HuggingFace."}],
    "max_tokens": 128
  }'
```

## Persist models with a volume

Avoid re-downloading large models by using a persistent volume:

```bash theme={"theme":"github-dark"}
runcrate volumes create --name hf-models --size 200
runcrate instances create --name hf-serve --gpu A100 --template ubuntu-devbox --storage hf-models
```

Models at `/workspace/models/` persist across instance restarts.

## Download specific file types

```bash theme={"theme":"github-dark"}
runcrate ssh hf-serve -- "huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
  --local-dir /workspace/models/llama-8b \
  --include '*.safetensors' '*.json' 'tokenizer*'"
```

## Tips

* For gated models, create a token at `huggingface.co/settings/tokens` with `read` scope.
* Cloud download speeds are typically 1-5 GB/min — much faster than home connections.
* A 70B FP16 model is \~140 GB. An FP8 version is \~70 GB. Check VRAM before downloading.

## Cleanup

```bash theme={"theme":"github-dark"}
runcrate instances delete hf-serve
```
