Open Source · Open Data · Open Weights

The community-built, best embedding model in the world.

We're crowdsourcing the data labeling that embedding models need to scale.
Run a container on your machine, label query-document pairs while you sleep,
and help build open-source models that top the leaderboard.

Start Contributing

# pick the line that matches your hardware

# Apple Silicon

$ pip install 'embedkombinat-annotator[mlx]' && annotator

# NVIDIA GPU

$ pip install 'embedkombinat-annotator[vllm]' && annotator

# CPU only

$ pip install 'embedkombinat-annotator[cpu]' && annotator

✓ Detected: NVIDIA RTX 3090 (24GB VRAM)

✓ Selected model: Qwen2.5-7B-Instruct (Q8)

✓ Fetched batch: 500 pairs from task queue

▸ Labeling... 0/500 (0.0%)█

The Problem

Here's why embedding models stop scaling.

LLMs improve with more data. Embedding models don't, because scaling their training data introduces false negatives: documents incorrectly treated as irrelevant. This poisons the contrastive learning signal and eventually makes more data hurt performance.

The fix is straightforward: use a language model to verify every (query, document) pair and filter the false negatives. But at the scale needed - hundreds of millions of pairs - no single lab can afford it.

The community can.

Model size: LLMs vs Embeddings Parameter count · log scale · 2018-2025

LLMs Embedding models Size gap

Every Contribution Counts

Your GPU labels pairs while you sleep.

The labeling task is simple: “Does this document answer this query?” A small model running on consumer hardware handles this reliably. Leave the container running overnight. Wake up having contributed to the best embedding model ever built.

Hardware	Model	Speed	Overnight (8hr)
RTX 4090 (24GB)	Qwen2.5-7B Q8	~120 pairs/min	~57,600 pairs
RTX 3090 (24GB)	Qwen2.5-7B Q8	~85 pairs/min	~40,800 pairs
RTX 3060 (12GB)	Qwen2.5-3B Q8	~60 pairs/min	~28,800 pairs
M2/M3 MacBook (16GB)	Qwen2.5-3B Q4	~20 pairs/min	~9,600 pairs
CPU only (16GB RAM)	Qwen2.5-1.5B Q4	~8 pairs/min	~3,840 pairs

Top Contributors

#	Contributor	Pairs Labeled
Loading…

Get Started on GitHub

For Companies

Sponsor the infrastructure.

We accept donations of API credits, GPU compute time, and cloud resources. Every donated resource is accounted for publicly, and we publish exactly how it's used. If you give us compute, we label pairs.

hello@embedkombinat.org

The community-built, best embedding model in the world.

Labeling Progress

Here's why embedding models stop scaling.

Your GPU labels pairs while you sleep.

Top Contributors

Sponsor the infrastructure.