RedHatAI/Mistral-7B-Instruct-v0.3-GPTQ-4bit

text generationtransformerstransformerssafetensorsmistraltext-generationconversationalbase_model:mistralai/Mistral-7B-Instruct-v0.3apache-2.0
346.6K

Model Card for Mistral-7B-Instruct-v0.3 quantized to 4bit weights

  • Weight-only quantization of Mistral-7B-Instruct-v0.3 via GPTQ to 4bits with group_size=128
  • GPTQ optimized for 99.75% accuracy recovery relative to the unquantized model

Open LLM Leaderboard evaluation scores

Mistral-7B-Instruct-v0.3Mistral-7B-Instruct-v0.3-GPTQ-4bit
(this model)
arc-c
25-shot
63.4863.40
mmlu
5-shot
61.1360.89
hellaswag
10-shot
84.4984.04
winogrande
5-shot
79.1679.08
gsm8k
5-shot
43.3745.41
truthfulqa
0-shot
59.6557.48
Average
Accuracy
65.2165.05
Recovery100%99.75%

vLLM Inference Performance

This model is ready for optimized inference using the Marlin mixed-precision kernels in vLLM: https://github.com/vllm-project/vllm

Simply start this model as an inference server with:

python -m vllm.entrypoints.openai.api_server --model neuralmagic/Mistral-7B-Instruct-v0.3-GPTQ-4bit

image/png

DEPLOY IN 60 SECONDS

Run Mistral-7B-Instruct-v0.3-GPTQ-4bit on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.