nomic-ai/CodeRankEmbed

sentence-transformerssentence-transformerssafetensorsnomic_bertcustom_codearxiv:2412.01007base_model:Snowflake/snowflake-arctic-embed-m-longmit
195.4K

CodeRankEmbed

CodeRankEmbed is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms various open-source and proprietary code embedding models on various code retrieval tasks.

Check out our blog post and paper for more details!

Combine CodeRankEmbed with our re-ranker CodeRankLLM for even higher quality code retrieval.

Performance Benchmarks

NameParametersCSN (MRR)CoIR (NDCG@10)
CodeRankEmbed137M77.960.1
Arctic-Embed-M-Long137M53.443.0
CodeSage-Small130M64.954.4
CodeSage-Base356M68.757.5
CodeSage-Large1.3B71.259.4
Jina-Code-v2161M67.258.4
CodeT5+110M74.245.9
OpenAI-Ada-002110M71.345.6
Voyage-Code-002Unknown68.556.3

We release the scripts to evaluate our model's performance here.

Usage

Important: the query prompt must include the following task instruction prefix: "Represent this query for searching relevant code"

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/CodeRankEmbed", trust_remote_code=True)
queries = ['Represent this query for searching relevant code: Calculate the n-th factorial']
codes = ['def fact(n):\n if n < 0:\n  raise ValueError\n return 1 if n == 0 else n * fact(n - 1)']
query_embeddings = model.encode(queries)
print(query_embeddings)
code_embeddings = model.encode(codes)
print(code_embeddings)

Training

We use a bi-encoder architecture for CodeRankEmbed, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a 21 million example high-quality dataset we curated called CoRNStack. Our encoder is initialized with Arctic-Embed-M-Long, a 137M parameter text encoder supporting an extended context length of 8,192 tokens.

Citation

If you find the model, dataset, or training code useful, please cite our work:

@misc{suresh2025cornstackhighqualitycontrastivedata,
      title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking}, 
      author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
      year={2025},
      eprint={2412.01007},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.01007}, 
}
DEPLOY IN 60 SECONDS

Run CodeRankEmbed on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.