manifest contribute
Open Source · Open Data · April 2026

Why we’re building the world’s best embedding model.

Embed Kombinat is an open-source project to produce the highest-quality text embedding training data ever assembled — and use it to train state-of-the-art embedding models that are fully open: open weights, open data, open code, open process.

Annotation Infrastructure

A crowdsourced labeling system (kombinat + annotator) that distributes relevance labeling across volunteer GPUs running small language models locally.

Retrieval Dataset

A cleaned, verified dataset of hundreds of millions of (query, document, relevance) triples — the largest open dataset of its kind, released incrementally on HuggingFace.

Open Embedding Models

Embedding models trained on this data at base (~140M) and large (~330M) parameter scales, targeting state-of-the-art performance on MTEB retrieval benchmarks.


01 — What we’re building

Three things.

We are building three things:

  1. A crowdsourced annotation infrastructure (kombinat + annotator) that distributes relevance labeling across hundreds of volunteer GPUs running small language models locally.
  2. A cleaned, verified retrieval dataset of hundreds of millions of (query, document, relevance) triples — the largest open dataset of its kind.
  3. Open embedding models trained on this data at the base (~140M parameter) and large (~330M parameter) scales, targeting state-of-the-art performance on MTEB retrieval benchmarks.

The dataset and the infrastructure are the primary contributions. The trained models are the proof that the data works — but anyone can take the dataset and train their own models however they want.


02 — Why this matters

Embedding models can’t scale. Here’s why — and how to fix it.

The false negative problem

There are trillions upon trillions of query-document pairs on the internet. This is, in principle, the richest training signal imaginable for embedding models. In practice, almost none of it is usable.

The reason is a data labeling problem hiding inside every contrastive learning dataset. Contrastive training starts with known positive pairs — a query and a document that answers it. To learn useful embeddings, the model also needs negatives: documents that don’t answer the query. The standard approach is to treat everything else in the batch as a negative. As you scale up the dataset, you add more documents to the pool of implicit negatives.

The problem: some of those documents are relevant. A query like “was Ronald Reagan a democrat?” will have one labeled positive document. But in a corpus of millions, there are inevitably other documents that also answer this question. They get silently treated as negatives. The model learns to push them away from the query — the exact opposite of what it should do.

This is called the false negative problem in contrastive learning. As datasets grow, the density of these mislabeled pairs increases. The training signal gets noisier. At some point, adding more data makes the model worse. This is why embedding model performance has plateaued at around 10M training pairs while LLMs continue to improve with more data.

The evidence

This isn’t speculation. The CDE paper (Morris & Rush, 2024; ICLR 2025) demonstrated that even crude embedding-based filtering of false negatives within training batches produced nearly a 10% improvement in retrieval performance. The ANCE paper (Xiong et al., 2020) showed that iteratively re-mining hard negatives after each training cycle consistently improved dense retrieval quality. Research on MS MARCO has found that over 70% of top-retrieved passages are actually false negatives in the original annotations.

The pattern is clear: the data exists, the labels are broken, and fixing them unlocks scaling.

Why hasn’t this been done?

It has — just not in the open. OpenAI, Google, Anthropic, and Cursor all train internal embedding models. When they publish papers, the methodology sections make it clear they invest heavily in data cleaning and negative verification. But the cleaned datasets are never released. The resulting models are served behind APIs.

There is no technical barrier to producing this data. The fix is dead simple: for each (query, document) pair, ask a language model “does this document answer this query?” The big labs do this at scale with their own infrastructure. They just don’t open-source the results.

We will.

Why crowdsourcing works here

Relevance labeling is embarrassingly parallel. Each (query, document) pair is independent — no shared state, no sequential dependencies, no gradient synchronization. A contributor’s GPU processes pairs in isolation. Only the labels flow back. This is the ideal shape for distributed computation.

The task itself is simple enough for small language models (7B parameters and below) to handle reliably at the binary or graded relevance level. This means the compute bar for contribution is low: anyone with a laptop GPU or even a CPU can participate. The task doesn’t require a frontier model, a cloud API, or any money.

The analogy is Folding@home or BOINC — distributed computation where individually small contributions aggregate into something no single participant could achieve alone. Except instead of donating CPU cycles for protein folding, contributors donate LLM inference for relevance labeling.


03 — Technical approach

How the pipeline works.

Data pipeline overview

The full pipeline runs in a repeating cycle: mine candidates, label them, train, evaluate, repeat with better embeddings.

Data Pipeline §3.1 — iterative annotation + training cycle
Source Dataset nomic-embed-unsupervised-data · 235M pairs · 29 domains MINE CANDIDATES Hard Negative Mining BM25 + dense retrieval (all-MiniLM-L6-v2)  ·  RRF fusion (k=60) top-5,000 candidates per query  ·  deterministic UUID per pair ENQUEUE PAIRS Task Queue — kombinat FastAPI + PostgreSQL · SELECT FOR UPDATE SKIP LOCKED CLAIM BATCHES Distributed Annotation Contributors · local LLMs · vLLM / MLX / llama.cpp labels stream back  ·  no data leaves contributor machine CONSENSUS + HONEYPOTS Verified Dataset 2+ labels per pair · κ ≥ 0.8 · released on HuggingFace TRAIN Model Training Exact softmax + coordinate ascent  ·  not in-batch NCE base (~140M params)  ·  large (~330M params) EVALUATE Evaluation + Re-mine MTEB / BEIR benchmarks  ·  publish scores after every cycle re-embed corpus  ·  re-mine hard negatives with updated model ITERATE

Source data

We start with nomic-ai/nomic-embed-unsupervised-data, a publicly available dataset of ~235 million weakly-supervised text pairs across 29 domain splits (Reddit, Wikipedia, Amazon Reviews, StackExchange, academic papers, etc.). Each row is a positive (query, document) pair.

This dataset was chosen because it is fully open, large, diverse across domains, and already used to train several open embedding models (Nomic Embed v1 and v2). Starting from the same source data means our improvements are directly attributable to label quality, not data selection.

Hard negative candidate mining

For each query in a split, we retrieve the top-K most similar documents using two complementary methods:

The two ranked lists are fused using Reciprocal Rank Fusion (RRF, k=60). Documents that rank highly in both methods are prioritized — these are the hardest negatives, the ones most likely to confuse an embedding model and most valuable to verify.

After fusion, we filter out the known positive document and take the top-5,000 candidates per query. Each candidate pair receives a deterministic UUID based on uuid5(query_text + doc_id + source_dataset), making the entire pipeline idempotent.

Distributed annotation

The annotation system has two components:

kombinat — a FastAPI coordination server backed by PostgreSQL. It maintains the task queue of unlabeled pairs, assigns batches to contributors, receives and validates labels, tracks contributor reputation, and serves public progress statistics. The work queue uses SELECT FOR UPDATE SKIP LOCKED for concurrent batch assignment — no Redis, no external message broker.

annotator — a CLI tool (or Docker container) that contributors run on their own hardware. It authenticates via GitHub OAuth, claims a batch of pairs from kombinat, runs a local language model to label each pair, and streams results back in chunks of ~50 pairs. The model runs entirely on the contributor’s machine. No API keys are transmitted. No data leaves the machine except pair IDs and labels.

The annotator auto-detects available hardware and selects the appropriate inference backend and model:

Hardware Backend Model
NVIDIA GPU (≥18GB VRAM) vLLM Qwen2.5-7B-Instruct
NVIDIA GPU (≥8GB VRAM) vLLM Qwen2.5-7B-Instruct-AWQ
NVIDIA GPU (≥4GB VRAM) vLLM Qwen2.5-3B-Instruct-AWQ
Apple Silicon (≥6GB) MLX Qwen2.5-7B-Instruct (4-bit)
Apple Silicon (≥4GB) MLX Qwen2.5-3B-Instruct (4-bit)
CPU only llama.cpp Qwen2.5-3B-Instruct (Q4_K_M)
CPU only (low RAM) llama.cpp Qwen2.5-1.5B-Instruct (Q4_K_M)

Only models that pass a quality threshold (Cohen’s κ ≥ 0.8 against frontier model labels on a held-out test set) are included in the registry.

Quality control

Trust in crowdsourced labels comes from multiple overlapping mechanisms:

Redundancy: Every pair is labeled by at least 2 independent contributors. Agreement promotes the pair to verified. Disagreement triggers a third annotation or escalation to a higher-quality model.

Honeypot pairs: 5–10% of each batch consists of pairs with known ground truth labels (from human-annotated datasets like MS MARCO qrels and Natural Questions). These are injected transparently — the contributor doesn’t know which pairs are honeypots. If a contributor’s accuracy on honeypots drops below 90%, their entire batch is quarantined and their reputation score is adjusted.

Contributor reputation: Every contributor has a reputation score based on honeypot accuracy, agreement rate with other contributors, and label distribution consistency. Low-reputation contributors receive honeypot-heavy batches until their reliability is established.

Model and quantization tracking: Every annotation records which model and quantization level produced it. This enables post-hoc analysis of per-model agreement rates and bias patterns.

Training methodology

We will use exact softmax with coordinate ascent, as proposed by Morris (2022, 2026). This is not the standard approach.

Standard contrastive training uses in-batch negatives as a sampled softmax — a Monte Carlo approximation of the true loss. This approximation introduces noise that worsens as batch size grows relative to corpus size. Exact softmax computes the loss against the entire corpus (or a very large subset), not just the in-batch negatives. Morris demonstrated this approach dramatically outperforms NCE loss in settings where perfect labels are available — which is exactly what our annotation pipeline produces.

Training involves an iterative cycle:

  1. Train an initial model with standard contrastive loss on the cleaned data
  2. Re-embed the entire corpus with the updated model
  3. Re-mine hard negatives (the embedding space has shifted, so the “hard” boundary has moved)
  4. Re-label the new candidates (the annotation pipeline runs again)
  5. Train with exact softmax + coordinate ascent on the expanded, re-verified dataset
  6. Repeat from step 2

Hard negative re-mining is not a one-time preprocessing step — it is a recurring phase in the training cycle. Each iteration produces a better model, which in turn reveals harder negatives, which in turn enables further improvement.

Evaluation

Primary benchmark: MTEB (Massive Text Embedding Benchmark), specifically the retrieval category. We target state-of-the-art performance in two model size classes:

Secondary benchmarks: BEIR (zero-shot retrieval across diverse domains), and eventually MIRACL (multilingual retrieval) when we expand beyond English.

We will publish MTEB scores after every training cycle, not just at release. The trajectory matters as much as the final number — it demonstrates whether the iterative re-mining approach is producing consistent gains.


04 — Deliverables

Everything we produce is public, immediately.

There is no internal version that differs from the public version.

The dataset

A HuggingFace dataset containing:

The annotation infrastructure

Three open-source repositories:

Trained models

Open embedding models released on HuggingFace at two scales:

Each release includes full model weights (not just the final checkpoint — intermediate checkpoints at each training cycle), the exact training configuration, MTEB and BEIR evaluation results, and a training report documenting what worked, what didn’t, and what we’d do differently.

Technical reports

We will publish technical reports at each major milestone:


05 — What we are not

Let’s be clear about scope.


10 — Get involved

Your GPU can contribute tonight.

Run the annotator

Pull the Docker container or install via pip. Your GPU labels pairs while you sleep.

NVIDIA GPU
# NVIDIA GPU
$ docker pull ghcr.io/embedkomb/annotator:latest
$ docker run -it --gpus all ghcr.io/embedkomb/annotator:latest
Apple Silicon
# Apple Silicon
$ pip install "annotator[mlx]"
$ annotator
CPU only
# CPU only
$ pip install "annotator[cpu]"
$ annotator

Contribute code

The repositories are on GitHub. We use red-green TDD. PRDs are written before implementation. Issues are labeled with good-first-issue for newcomers.

Sponsor compute

If you represent a company that could donate API credits, GPU time, or cloud resources — reach out at hello@embedkombinat.org. Every donated resource is accounted for publicly, and we publish exactly how it’s used. No waste. If you give us compute, we label pairs.

Spread the word

The more contributors we have, the faster the dataset grows, the better the model gets. Share this with anyone who cares about open-source AI infrastructure.

Start Contributing Sponsor Compute