brandonbeiler/InternVL3_5-8B-FP8-Dynamic

image text to textenzhsafetensorsinternvl_chatfp8quantizationdynamicvision-languagemit

1

198.7K

🔥 InternVL3_5-8B-FP8-Dynamic 🔥

This is a fp8 dynamic (w8a8) version of OpenGVLab/InternVL3_5-8B, optimized for high-performance inference with vLLM. The model utilizes fp8 dynamic (w8a8) for optimal performance and deployment.

Just Run It (vLLM serve)

You can serve the model using vLLM's OpenAI-compatible API server.

vllm serve brandonbeiler/InternVL3_5-8B-FP8-Dynamic \
    --quantization compressed-tensors \
    --served-model-name internvl3_5-8b \
    --reasoning-parser qwen3 \
    --trust-remote-code \
    --max-model-len 32768 \
    --tensor-parallel-size 1 # Adjust based on your GPU setup

Notes

32k max context length
reasoning parser ready to go, requires system prompt to run in thinking mode
still investigating tool calling

🚀 Key Features

FP8 Dynamic Quantization: No calibration required, ready to use immediately
Vision-Language Optimized: Specialized quantization recipe that preserves visual understanding
vLLM Ready: Seamless integration with vLLM for production deployment
Memory Efficient: ~50% memory reduction compared to FP16 original
Performance Boost: Significant faster inference on H100/L40S GPUs

📊 Model Details

Original Model: OpenGVLab/InternVL3_5-8B
Source Model: OpenGVLab/InternVL3_5-8B
Quantized Model: InternVL3_5-8B-FP8-Dynamic
Quantization Method: FP8 Dynamic (W8A8)
Quantization Library: LLM Compressor v0.7.1
Quantized by: brandonbeiler

🏗️ Technical Specifications

Hardware Requirements

Inference: ? VRAM (+ VRAM for context)
Supported GPUs: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism)
GPU Architecture: Latest NVIDIA GPUs (Ada Lovelace, Hopper and later) and latest AMD GPUs. Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell)

Quantization Details

Weights: FP8 E4M3 with dynamic per-tensor scales
Activations: FP8 E4M3 with dynamic per-tensor scales
Preserved Components: Vision tower, embeddings, mlp1

🔬 Package Versions

This model was created using:

llmcompressor==0.7.1
compressed-tensors==0.10.2
transformers==4.55.0
torch==2.7.1
vllm==0.10.1.1

Quantized with ❤️ using LLM Compressor for the open-source community

Deploy Model on Runcrate

Run this model on powerful GPU infrastructure. Deploy in 60 seconds.

Pay per second

H100, A100, RTX GPUs

Instant deployment

DEPLOY IN 60 SECONDS

Run InternVL3_5-8B-FP8-Dynamic on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.