GadflyII/Qwen3-Coder-Next-NVFP4

text generationtransformerstransformerssafetensorsqwen3_nexttext-generationqwen3moeapache-2.0
vLLMRunnable with vLLM
485.4K

Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).

https://github.com/Gadflyii/vllm/tree/main

Qwen3-Coder-Next-NVFP4

NVFP4 quantized version of Qwen/Qwen3-Coder-Next (80B-A3B).

Model Details

PropertyValue
Base ModelQwen/Qwen3-Coder-Next
ArchitectureQwen3NextForCausalLM (Hybrid DeltaNet + Attention + MoE)
Parameters80B total, 3B activated per token
Experts512 total, 10 activated + 1 shared
Layers48
Context Length262,144 tokens (256K)
QuantizationNVFP4 (FP4 weights + FP4 activations)
Size45GB (down from ~149GB BF16, 70% reduction)
Formatcompressed-tensors

Quantization Details

Quantized using llmcompressor 0.9.0.1.

NUM_CALIBRATION_SAMPLES = 20
MAX_SEQUENCE_LENGTH = 2048
DATASET = "HuggingFaceH4/ultrachat_200k" (train_sft)
moe_calibrate_all_experts = True

# Layers kept in BF16
ignore = [
    "lm_head",
    "re:.*mlp.gate$",               # MoE router gates
    "re:.*mlp.shared_expert_gate$", # Shared expert gates
    "re:.*linear_attn.*",           # DeltaNet linear attention
]

Benchmark Results

MMLU-Pro

ModelAccuracyDelta
BF1652.90%-
NVFP451.27%-1.63%

Context Length Testing

Successfully tested up to 128K tokens with FP8 KV cache (Not enough VRAM to test any higher context).

Usage with vLLM

Requires vLLM with NVFP4 support (0.16.0+), Transformers 5.0.0+

#vllm Serving
vllm serve GadflyII/Qwen3-Coder-Next-NVFP4 \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --kv-cache-dtype fp8

License

Apache 2.0 (same as base model)

Acknowledgments

DEPLOY IN 60 SECONDS

Run Qwen3-Coder-Next-NVFP4 on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.