pyrymikko/nomic-embed-code-W4A16-AWQ

Name: pyrymikko/nomic-embed-code-W4A16-AWQ
Rating: 5 (1 reviews)
Author: pyrymikko

safetensorsqwen2quantizedembeddingW4A16llmcompressormit

Runnable with vLLM

0

HuggingFace

8.3K

nomic-embed-code-W4A16-AWQ

This is a W4A16 quantized version of nomic-ai/nomic-embed-code.

Quantized using AWQ (Activation-aware Weight Quantization) with llm-compressor!

Quantization Details

Method: llmcompressor (AWQ one-shot PTQ)
Algorithm: AWQ (Activation-aware Weight Quantization)
Scheme: W4A16
Weight bits: 4-bit
Activation bits: 16-bit
Group size: 128
Format: compressed-tensors
Size reduction: ~75% compared to FP16

Usage

from transformers import AutoModel, AutoTokenizer

# Load quantized model
model = AutoModel.from_pretrained(
    "nomic-embed-code-W4A16-AWQ",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "nomic-embed-code-W4A16-AWQ",
    trust_remote_code=True
)

# Generate embeddings
texts = ["Hello world", "Example text"]
inputs = tokenizer(texts, padding=True, return_tensors="pt")
embeddings = model(**inputs).last_hidden_state.mean(dim=1)

print(embeddings.shape)

Performance

Memory usage: ~75% reduction vs FP16
Inference speed: Similar or faster on compatible hardware
Quality: Minimal degradation (<1% on most embedding tasks)

Why AWQ?

AWQ (Activation-aware Weight Quantization) is a one-shot weight quantization method that:

Activation-aware: Protects salient weights based on activation magnitudes
Uses calibration data to identify important weight channels
Provides better accuracy than GPTQ and naive rounding (RTN)
Works efficiently with group-wise quantization (group size 128)
Maintains model quality while achieving 75% size reduction
Optimal for embedding models that rely on preserving semantic relationships

Original Model

This quantized model is based on nomic-ai/nomic-embed-code.

Citation

If you use this model, please cite the original model and llmcompressor:

@software{llmcompressor,
  title = {LLM Compressor},
  author = {Neural Magic},
  url = {https://github.com/vllm-project/llm-compressor},
  year = {2024}
}

Deploy Model on Runcrate

Run this model on powerful GPU infrastructure. Deploy in 60 seconds.

Pay per second

H100, A100, RTX GPUs

Instant deployment

DEPLOY IN 60 SECONDS

Run nomic-embed-code-W4A16-AWQ on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.