Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Train a domain-specific LLM on your own dataset using LoRA (Low-Rank Adaptation). A single A100 can fine-tune a 7B–70B model in hours. QLoRA pushes 7B fine-tuning down to an RTX 4090 (24GB VRAM).

What you’ll build

A fine-tuned model adapter that specializes an open-source LLM for your use case — customer support, medical QA, code generation, legal analysis, or anything else. The adapter merges back into the base model and can be served with vLLM.

GPU sizing

Model SizeMethodGPUVRAM NeededTime (1K samples)
7B–8BQLoRA (4-bit)RTX 4090~12 GB~30 min
7B–8BLoRA (FP16)A100 40GB~30 GB~20 min
13BQLoRA (4-bit)RTX 4090~18 GB~45 min
70BQLoRA (4-bit)A100 80GB~48 GB~3 hrs
70BLoRA (FP16)2x H100~140 GB~2 hrs

LoRA rank selection

RankUse Case
8Formatting, tone, and style changes
16–32Moderate domain shift (e.g., medical terminology)
64Substantial knowledge injection

Step-by-step (CLI)

1. Prepare your dataset

Create a JSONL file with your training data:
{"messages": [{"role": "user", "content": "What's your refund policy?"}, {"role": "assistant", "content": "We offer full refunds within 30 days of purchase. After 30 days, we provide store credit."}]}
{"messages": [{"role": "user", "content": "How do I track my order?"}, {"role": "assistant", "content": "Go to Orders in your account dashboard and click on the order number. You'll see real-time tracking."}]}

2. Deploy a GPU and upload your data

runcrate instances create --name finetune --gpu A100

# Upload dataset and training script
runcrate cp ./train_data.jsonl finetune:/root/
runcrate cp ./finetune.py finetune:/root/

3. Install dependencies

runcrate ssh finetune -- "pip install torch transformers datasets peft trl accelerate bitsandbytes"

4. Training script (finetune.py)

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"

# QLoRA: load model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA config — rank 16 for domain adaptation
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Load dataset
dataset = load_dataset("json", data_files="/root/train_data.jsonl", split="train")

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="/root/output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        logging_steps=10,
        save_strategy="epoch",
        bf16=True,
    ),
    processing_class=tokenizer,
)

trainer.train()
trainer.save_model("/root/output/final")
tokenizer.save_pretrained("/root/output/final")
print("Training complete.")

5. Run training

runcrate ssh finetune -- "cd /root && python finetune.py"

6. Monitor training

# Check progress
runcrate ssh finetune -- "tail -20 /root/output/training.log"

# Watch GPU utilization
runcrate ssh finetune -- "nvidia-smi"

7. Download the adapter and clean up

# Download the LoRA adapter
runcrate cp -r finetune:/root/output/final/ ./my-adapter/

# Tear down
runcrate instances delete finetune

8. Merge and serve

After downloading, merge the adapter into the base model locally or on another instance:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base_model, "./my-adapter")
merged = model.merge_and_unload()

merged.save_pretrained("./merged-model")
AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").save_pretrained("./merged-model")
Then serve with vLLM (see Deploy a vLLM Inference Server):
runcrate ssh server -- "python -m vllm.entrypoints.openai.api_server --model /root/merged-model --port 8000 --host 0.0.0.0"

Using the Python SDK

from runcrate import Runcrate
import time

client = Runcrate(api_key="rc_live_...")

# Deploy with all dependencies pre-installed
instance = client.instances.create(
    name="finetune",
    gpu_type="A100",
    gpu_count=1,
    startup_commands=[
        "pip install torch transformers datasets peft trl accelerate bitsandbytes",
    ],
)

# Wait for deployment
while True:
    status = client.instances.get_status(instance.id)
    if status.status == "deployed":
        break
    time.sleep(10)

print(f"Ready — SSH: root@{status.ip}")

Using MCP (via Claude Code / Cursor)

“Spin up an A100 called ‘finetune’. Install torch, transformers, peft, trl, accelerate, and bitsandbytes. Then show me a training script for QLoRA fine-tuning Llama 3.1 8B.”
The agent deploys the instance, installs packages via ssh_execute, and generates the training script for you.