Embed Kombinat is an open-source project to produce the highest-quality text embedding training data ever assembled — and use it to train state-of-the-art embedding models that are fully open: open weights, open data, open code, open process.
A crowdsourced labeling system (kombinat + annotator) that distributes relevance labeling across volunteer GPUs running small language models locally.
A cleaned, verified dataset of hundreds of millions of (query, document, relevance) triples — the largest open dataset of its kind, released incrementally on HuggingFace.
Embedding models trained on this data at base (~140M) and large (~330M) parameter scales, targeting state-of-the-art performance on MTEB retrieval benchmarks.
We are building three things:
kombinat + annotator) that distributes relevance labeling across hundreds of volunteer GPUs running small language models locally.The dataset and the infrastructure are the primary contributions. The trained models are the proof that the data works — but anyone can take the dataset and train their own models however they want.
There are trillions upon trillions of query-document pairs on the internet. This is, in principle, the richest training signal imaginable for embedding models. In practice, almost none of it is usable.
The reason is a data labeling problem hiding inside every contrastive learning dataset. Contrastive training starts with known positive pairs — a query and a document that answers it. To learn useful embeddings, the model also needs negatives: documents that don’t answer the query. The standard approach is to treat everything else in the batch as a negative. As you scale up the dataset, you add more documents to the pool of implicit negatives.
The problem: some of those documents are relevant. A query like “was Ronald Reagan a democrat?” will have one labeled positive document. But in a corpus of millions, there are inevitably other documents that also answer this question. They get silently treated as negatives. The model learns to push them away from the query — the exact opposite of what it should do.
This is called the false negative problem in contrastive learning. As datasets grow, the density of these mislabeled pairs increases. The training signal gets noisier. At some point, adding more data makes the model worse. This is why embedding model performance has plateaued at around 10M training pairs while LLMs continue to improve with more data.
This isn’t speculation. The CDE paper (Morris & Rush, 2024; ICLR 2025) demonstrated that even crude embedding-based filtering of false negatives within training batches produced nearly a 10% improvement in retrieval performance. The ANCE paper (Xiong et al., 2020) showed that iteratively re-mining hard negatives after each training cycle consistently improved dense retrieval quality. Research on MS MARCO has found that over 70% of top-retrieved passages are actually false negatives in the original annotations.
The pattern is clear: the data exists, the labels are broken, and fixing them unlocks scaling.
It has — just not in the open. OpenAI, Google, Anthropic, and Cursor all train internal embedding models. When they publish papers, the methodology sections make it clear they invest heavily in data cleaning and negative verification. But the cleaned datasets are never released. The resulting models are served behind APIs.
There is no technical barrier to producing this data. The fix is dead simple: for each (query, document) pair, ask a language model “does this document answer this query?” The big labs do this at scale with their own infrastructure. They just don’t open-source the results.
We will.
Relevance labeling is embarrassingly parallel. Each (query, document) pair is independent — no shared state, no sequential dependencies, no gradient synchronization. A contributor’s GPU processes pairs in isolation. Only the labels flow back. This is the ideal shape for distributed computation.
The task itself is simple enough for small language models (7B parameters and below) to handle reliably at the binary or graded relevance level. This means the compute bar for contribution is low: anyone with a laptop GPU or even a CPU can participate. The task doesn’t require a frontier model, a cloud API, or any money.
The analogy is Folding@home or BOINC — distributed computation where individually small contributions aggregate into something no single participant could achieve alone. Except instead of donating CPU cycles for protein folding, contributors donate LLM inference for relevance labeling.
The full pipeline runs in a repeating cycle: mine candidates, label them, train, evaluate, repeat with better embeddings.
We start with nomic-ai/nomic-embed-unsupervised-data, a publicly available dataset of ~235 million weakly-supervised text pairs across 29 domain splits (Reddit, Wikipedia, Amazon Reviews, StackExchange, academic papers, etc.). Each row is a positive (query, document) pair.
This dataset was chosen because it is fully open, large, diverse across domains, and already used to train several open embedding models (Nomic Embed v1 and v2). Starting from the same source data means our improvements are directly attributable to label quality, not data selection.
For each query in a split, we retrieve the top-K most similar documents using two complementary methods:
all-MiniLM-L6-v2 to embed all documents and queries, indexed with FAISS IVFFlat.The two ranked lists are fused using Reciprocal Rank Fusion (RRF, k=60). Documents that rank highly in both methods are prioritized — these are the hardest negatives, the ones most likely to confuse an embedding model and most valuable to verify.
After fusion, we filter out the known positive document and take the top-5,000 candidates per query. Each candidate pair receives a deterministic UUID based on uuid5(query_text + doc_id + source_dataset), making the entire pipeline idempotent.
The annotation system has two components:
kombinat — a FastAPI coordination server backed by PostgreSQL. It maintains the task queue of unlabeled pairs, assigns batches to contributors, receives and validates labels, tracks contributor reputation, and serves public progress statistics. The work queue uses SELECT FOR UPDATE SKIP LOCKED for concurrent batch assignment — no Redis, no external message broker.
annotator — a CLI tool (or Docker container) that contributors run on their own hardware. It authenticates via GitHub OAuth, claims a batch of pairs from kombinat, runs a local language model to label each pair, and streams results back in chunks of ~50 pairs. The model runs entirely on the contributor’s machine. No API keys are transmitted. No data leaves the machine except pair IDs and labels.
The annotator auto-detects available hardware and selects the appropriate inference backend and model:
| Hardware | Backend | Model |
|---|---|---|
| NVIDIA GPU (≥18GB VRAM) | vLLM | Qwen2.5-7B-Instruct |
| NVIDIA GPU (≥8GB VRAM) | vLLM | Qwen2.5-7B-Instruct-AWQ |
| NVIDIA GPU (≥4GB VRAM) | vLLM | Qwen2.5-3B-Instruct-AWQ |
| Apple Silicon (≥6GB) | MLX | Qwen2.5-7B-Instruct (4-bit) |
| Apple Silicon (≥4GB) | MLX | Qwen2.5-3B-Instruct (4-bit) |
| CPU only | llama.cpp | Qwen2.5-3B-Instruct (Q4_K_M) |
| CPU only (low RAM) | llama.cpp | Qwen2.5-1.5B-Instruct (Q4_K_M) |
Only models that pass a quality threshold (Cohen’s κ ≥ 0.8 against frontier model labels on a held-out test set) are included in the registry.
Trust in crowdsourced labels comes from multiple overlapping mechanisms:
Redundancy: Every pair is labeled by at least 2 independent contributors. Agreement promotes the pair to verified. Disagreement triggers a third annotation or escalation to a higher-quality model.
Honeypot pairs: 5–10% of each batch consists of pairs with known ground truth labels (from human-annotated datasets like MS MARCO qrels and Natural Questions). These are injected transparently — the contributor doesn’t know which pairs are honeypots. If a contributor’s accuracy on honeypots drops below 90%, their entire batch is quarantined and their reputation score is adjusted.
Contributor reputation: Every contributor has a reputation score based on honeypot accuracy, agreement rate with other contributors, and label distribution consistency. Low-reputation contributors receive honeypot-heavy batches until their reliability is established.
Model and quantization tracking: Every annotation records which model and quantization level produced it. This enables post-hoc analysis of per-model agreement rates and bias patterns.
We will use exact softmax with coordinate ascent, as proposed by Morris (2022, 2026). This is not the standard approach.
Standard contrastive training uses in-batch negatives as a sampled softmax — a Monte Carlo approximation of the true loss. This approximation introduces noise that worsens as batch size grows relative to corpus size. Exact softmax computes the loss against the entire corpus (or a very large subset), not just the in-batch negatives. Morris demonstrated this approach dramatically outperforms NCE loss in settings where perfect labels are available — which is exactly what our annotation pipeline produces.
Training involves an iterative cycle:
Hard negative re-mining is not a one-time preprocessing step — it is a recurring phase in the training cycle. Each iteration produces a better model, which in turn reveals harder negatives, which in turn enables further improvement.
Primary benchmark: MTEB (Massive Text Embedding Benchmark), specifically the retrieval category. We target state-of-the-art performance in two model size classes:
Secondary benchmarks: BEIR (zero-shot retrieval across diverse domains), and eventually MIRACL (multilingual retrieval) when we expand beyond English.
We will publish MTEB scores after every training cycle, not just at release. The trajectory matters as much as the final number — it demonstrates whether the iterative re-mining approach is producing consistent gains.
There is no internal version that differs from the public version.
A HuggingFace dataset containing:
verified statusThree open-source repositories:
Open embedding models released on HuggingFace at two scales:
Each release includes full model weights (not just the final checkpoint — intermediate checkpoints at each training cycle), the exact training configuration, MTEB and BEIR evaluation results, and a training report documenting what worked, what didn’t, and what we’d do differently.
We will publish technical reports at each major milestone:
Pull the Docker container or install via pip. Your GPU labels pairs while you sleep.
# NVIDIA GPU $ docker pull ghcr.io/embedkomb/annotator:latest $ docker run -it --gpus all ghcr.io/embedkomb/annotator:latest
# Apple Silicon $ pip install "annotator[mlx]" $ annotator
# CPU only $ pip install "annotator[cpu]" $ annotator
The repositories are on GitHub. We use red-green TDD. PRDs are written before implementation. Issues are labeled with good-first-issue for newcomers.
If you represent a company that could donate API credits, GPU time, or cloud resources — reach out at hello@embedkombinat.org. Every donated resource is accounted for publicly, and we publish exactly how it’s used. No waste. If you give us compute, we label pairs.
The more contributors we have, the faster the dataset grows, the better the model gets. Share this with anyone who cares about open-source AI infrastructure.