Runnable with vLLMThis is a W4A16 quantized version of nomic-ai/nomic-embed-code.
Quantized using AWQ (Activation-aware Weight Quantization) with llm-compressor!
from transformers import AutoModel, AutoTokenizer
# Load quantized model
model = AutoModel.from_pretrained(
"nomic-embed-code-W4A16-AWQ",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"nomic-embed-code-W4A16-AWQ",
trust_remote_code=True
)
# Generate embeddings
texts = ["Hello world", "Example text"]
inputs = tokenizer(texts, padding=True, return_tensors="pt")
embeddings = model(**inputs).last_hidden_state.mean(dim=1)
print(embeddings.shape)
AWQ (Activation-aware Weight Quantization) is a one-shot weight quantization method that:
This quantized model is based on nomic-ai/nomic-embed-code.
If you use this model, please cite the original model and llmcompressor:
@software{llmcompressor,
title = {LLM Compressor},
author = {Neural Magic},
url = {https://github.com/vllm-project/llm-compressor},
year = {2024}
}