Fine-Tune an LLM with LoRA

Train a domain-specific LLM on your own dataset using LoRA (Low-Rank Adaptation). A single A100 can fine-tune a 7B–70B model in hours. QLoRA pushes 7B fine-tuning down to an RTX 4090 (24GB VRAM).

What you’ll build

A fine-tuned model adapter that specializes an open-source LLM for your use case — customer support, medical QA, code generation, legal analysis, or anything else. The adapter merges back into the base model and can be served with vLLM.

GPU sizing

Model Size	Method	GPU	VRAM Needed	Time (1K samples)
7B–8B	QLoRA (4-bit)	RTX 4090	~12 GB	~30 min
7B–8B	LoRA (FP16)	A100 40GB	~30 GB	~20 min
13B	QLoRA (4-bit)	RTX 4090	~18 GB	~45 min
70B	QLoRA (4-bit)	A100 80GB	~48 GB	~3 hrs
70B	LoRA (FP16)	2x H100	~140 GB	~2 hrs

LoRA rank selection

Rank	Use Case
8	Formatting, tone, and style changes
16–32	Moderate domain shift (e.g., medical terminology)
64	Substantial knowledge injection

Step-by-step (CLI)

1. Prepare your dataset

Create a JSONL file with your training data:

{"messages": [{"role": "user", "content": "What's your refund policy?"}, {"role": "assistant", "content": "We offer full refunds within 30 days of purchase. After 30 days, we provide store credit."}]}
{"messages": [{"role": "user", "content": "How do I track my order?"}, {"role": "assistant", "content": "Go to Orders in your account dashboard and click on the order number. You'll see real-time tracking."}]}

2. Deploy a GPU and upload your data

runcrate instances create --name finetune --gpu A100

# Upload dataset and training script
runcrate cp ./train_data.jsonl finetune:/root/
runcrate cp ./finetune.py finetune:/root/

3. Install dependencies

runcrate ssh finetune -- "pip install torch transformers datasets peft trl accelerate bitsandbytes"

4. Training script (`finetune.py`)

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"

# QLoRA: load model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA config — rank 16 for domain adaptation
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Load dataset
dataset = load_dataset("json", data_files="/root/train_data.jsonl", split="train")

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="/root/output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        logging_steps=10,
        save_strategy="epoch",
        bf16=True,
    ),
    processing_class=tokenizer,
)

trainer.train()
trainer.save_model("/root/output/final")
tokenizer.save_pretrained("/root/output/final")
print("Training complete.")

5. Run training

runcrate ssh finetune -- "cd /root && python finetune.py"

6. Monitor training

# Check progress
runcrate ssh finetune -- "tail -20 /root/output/training.log"

# Watch GPU utilization
runcrate ssh finetune -- "nvidia-smi"

7. Download the adapter and clean up

# Download the LoRA adapter
runcrate cp -r finetune:/root/output/final/ ./my-adapter/

# Tear down
runcrate instances delete finetune

8. Merge and serve

After downloading, merge the adapter into the base model locally or on another instance:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base_model, "./my-adapter")
merged = model.merge_and_unload()

merged.save_pretrained("./merged-model")
AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").save_pretrained("./merged-model")

Then serve with vLLM (see Deploy a vLLM Inference Server):

runcrate ssh server -- "python -m vllm.entrypoints.openai.api_server --model /root/merged-model --port 8000 --host 0.0.0.0"

Using the Python SDK

from runcrate import Runcrate
import time

client = Runcrate(api_key="rc_live_...")

# Deploy with all dependencies pre-installed
instance = client.instances.create(
    name="finetune",
    gpu_type="A100",
    gpu_count=1,
    startup_commands=[
        "pip install torch transformers datasets peft trl accelerate bitsandbytes",
    ],
)

# Wait for deployment
while True:
    status = client.instances.get_status(instance.id)
    if status.status == "deployed":
        break
    time.sleep(10)

print(f"Ready — SSH: root@{status.ip}")

Using MCP (via Claude Code / Cursor)

“Spin up an A100 called ‘finetune’. Install torch, transformers, peft, trl, accelerate, and bitsandbytes. Then show me a training script for QLoRA fine-tuning Llama 3.1 8B.”

The agent deploys the instance, installs packages via ssh_execute, and generates the training script for you.

​What you’ll build

​GPU sizing

​LoRA rank selection

​Step-by-step (CLI)

​1. Prepare your dataset

​2. Deploy a GPU and upload your data

​3. Install dependencies

​4. Training script (finetune.py)

​5. Run training

​6. Monitor training

​7. Download the adapter and clean up

​8. Merge and serve

​Using the Python SDK

​Using MCP (via Claude Code / Cursor)