Open Source · Open Data · Open Weights

The community-built, best embedding model in the world.

We're crowdsourcing the data labeling that embedding models need to scale.
Run a container on your machine, label query-document pairs while you sleep,
and help build open-source models that top the leaderboard.

Start Contributing

# start contributing in one command

$ docker pull embedkombinat/annotator

$ docker run -it --gpus all embedkombinat/annotator

✓ Detected: NVIDIA RTX 3090 (24GB VRAM)

✓ Selected model: Qwen2.5-7B-Instruct (Q8)

✓ Fetched batch: 500 pairs from task queue

▸ Labeling... 0/500 (0.0%)█

The Problem

Here's why embedding models stop scaling.

LLMs improve with more data. Embedding models don't, because scaling their training data introduces false negatives: documents incorrectly treated as irrelevant. This poisons the contrastive learning signal and eventually makes more data hurt performance.

The fix is straightforward: use a language model to verify every (query, document) pair and filter the false negatives. But at the scale needed - hundreds of millions of pairs - no single lab can afford it.

The community can.

Model size: LLMs vs Embeddings Parameter count · log scale · 2018-2025

LLMs Embedding models Size gap

Every Contribution Counts

Your GPU labels pairs while you sleep.

The labeling task is simple: “Does this document answer this query?” A small model running on consumer hardware handles this reliably. Leave the container running overnight. Wake up having contributed to the best embedding model ever built.

Hardware	Model	Speed	Overnight (8hr)
RTX 4090 (24GB)	Qwen2.5-7B Q8	~120 pairs/min	~57,600 pairs
RTX 3090 (24GB)	Qwen2.5-7B Q8	~85 pairs/min	~40,800 pairs
RTX 3060 (12GB)	Qwen2.5-3B Q8	~60 pairs/min	~28,800 pairs
M2/M3 MacBook (16GB)	Qwen2.5-3B Q4	~20 pairs/min	~9,600 pairs
CPU only (16GB RAM)	Qwen2.5-1.5B Q4	~8 pairs/min	~3,840 pairs

⚙

Auto-detection

The container probes your hardware and picks the best model. No config needed.

🔒

Nothing leaves your machine

The model runs locally. Only pair ID + label are uploaded.

🔬

Quality control

Every pair labeled by 2+ contributors. Honeypots catch bad actors.

📈

Transparent progress

Public ledger. Weekly retraining. MTEB scores published live.

Get Started on GitHub

For Companies

Sponsor the infrastructure.

We accept donations of API credits, GPU compute time, and cloud resources. Every donated resource is accounted for publicly, and we publish exactly how it's used. No waste. If you give us compute, we label pairs.

hello@embedkombinat.org