Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Train a domain-specific LLM on your own dataset using LoRA (Low-Rank Adaptation). A single A100 can fine-tune a 7B–70B model in hours. QLoRA pushes 7B fine-tuning down to an RTX 4090 (24GB VRAM).
What you’ll build
A fine-tuned model adapter that specializes an open-source LLM for your use case — customer support, medical QA, code generation, legal analysis, or anything else. The adapter merges back into the base model and can be served with vLLM.
GPU sizing
| Model Size | Method | GPU | VRAM Needed | Time (1K samples) |
|---|
| 7B–8B | QLoRA (4-bit) | RTX 4090 | ~12 GB | ~30 min |
| 7B–8B | LoRA (FP16) | A100 40GB | ~30 GB | ~20 min |
| 13B | QLoRA (4-bit) | RTX 4090 | ~18 GB | ~45 min |
| 70B | QLoRA (4-bit) | A100 80GB | ~48 GB | ~3 hrs |
| 70B | LoRA (FP16) | 2x H100 | ~140 GB | ~2 hrs |
LoRA rank selection
| Rank | Use Case |
|---|
| 8 | Formatting, tone, and style changes |
| 16–32 | Moderate domain shift (e.g., medical terminology) |
| 64 | Substantial knowledge injection |
Step-by-step (CLI)
1. Prepare your dataset
Create a JSONL file with your training data:
{"messages": [{"role": "user", "content": "What's your refund policy?"}, {"role": "assistant", "content": "We offer full refunds within 30 days of purchase. After 30 days, we provide store credit."}]}
{"messages": [{"role": "user", "content": "How do I track my order?"}, {"role": "assistant", "content": "Go to Orders in your account dashboard and click on the order number. You'll see real-time tracking."}]}
2. Deploy a GPU and upload your data
runcrate instances create --name finetune --gpu A100
# Upload dataset and training script
runcrate cp ./train_data.jsonl finetune:/root/
runcrate cp ./finetune.py finetune:/root/
3. Install dependencies
runcrate ssh finetune -- "pip install torch transformers datasets peft trl accelerate bitsandbytes"
4. Training script (finetune.py)
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
model_id = "meta-llama/Llama-3.1-8B-Instruct"
# QLoRA: load model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
# LoRA config — rank 16 for domain adaptation
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Load dataset
dataset = load_dataset("json", data_files="/root/train_data.jsonl", split="train")
# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=SFTConfig(
output_dir="/root/output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
save_strategy="epoch",
bf16=True,
),
processing_class=tokenizer,
)
trainer.train()
trainer.save_model("/root/output/final")
tokenizer.save_pretrained("/root/output/final")
print("Training complete.")
5. Run training
runcrate ssh finetune -- "cd /root && python finetune.py"
6. Monitor training
# Check progress
runcrate ssh finetune -- "tail -20 /root/output/training.log"
# Watch GPU utilization
runcrate ssh finetune -- "nvidia-smi"
7. Download the adapter and clean up
# Download the LoRA adapter
runcrate cp -r finetune:/root/output/final/ ./my-adapter/
# Tear down
runcrate instances delete finetune
8. Merge and serve
After downloading, merge the adapter into the base model locally or on another instance:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base_model, "./my-adapter")
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").save_pretrained("./merged-model")
Then serve with vLLM (see Deploy a vLLM Inference Server):
runcrate ssh server -- "python -m vllm.entrypoints.openai.api_server --model /root/merged-model --port 8000 --host 0.0.0.0"
Using the Python SDK
from runcrate import Runcrate
import time
client = Runcrate(api_key="rc_live_...")
# Deploy with all dependencies pre-installed
instance = client.instances.create(
name="finetune",
gpu_type="A100",
gpu_count=1,
startup_commands=[
"pip install torch transformers datasets peft trl accelerate bitsandbytes",
],
)
# Wait for deployment
while True:
status = client.instances.get_status(instance.id)
if status.status == "deployed":
break
time.sleep(10)
print(f"Ready — SSH: root@{status.ip}")
Using MCP (via Claude Code / Cursor)
“Spin up an A100 called ‘finetune’. Install torch, transformers, peft, trl, accelerate, and bitsandbytes. Then show me a training script for QLoRA fine-tuning Llama 3.1 8B.”
The agent deploys the instance, installs packages via ssh_execute, and generates the training script for you.