Manifest — embedkombinat

01 — What we’re building

Three things.

We are building three things:

A crowdsourced annotation infrastructure (kombinat + annotator) that distributes relevance labeling across hundreds of volunteer GPUs running small language models locally.
A cleaned, verified retrieval dataset of hundreds of millions of (query, document, relevance) triples — the largest open dataset of its kind.
Open embedding models trained on this data at the base (~140M parameter) and large (~330M parameter) scales, targeting state-of-the-art performance on MTEB retrieval benchmarks.

The dataset and the infrastructure are the primary contributions. The trained models are the proof that the data works — but anyone can take the dataset and train their own models however they want.

02 — Why this matters

Embedding models can’t scale. Here’s why — and how to fix it.

The false negative problem

There are trillions upon trillions of query-document pairs on the internet. This is, in principle, the richest training signal imaginable for embedding models. In practice, almost none of it is usable.

The reason is a data labeling problem hiding inside every contrastive learning dataset. Contrastive training starts with known positive pairs — a query and a document that answers it. To learn useful embeddings, the model also needs negatives: documents that don’t answer the query. The standard approach is to treat everything else in the batch as a negative. As you scale up the dataset, you add more documents to the pool of implicit negatives.

The problem: some of those documents are relevant. A query like “was Ronald Reagan a democrat?” will have one labeled positive document. But in a corpus of millions, there are inevitably other documents that also answer this question. They get silently treated as negatives. The model learns to push them away from the query — the exact opposite of what it should do.

This is called the false negative problem in contrastive learning. As datasets grow, the density of these mislabeled pairs increases. The training signal gets noisier. At some point, adding more data makes the model worse. This is why embedding model performance has plateaued at around 10M training pairs while LLMs continue to improve with more data.

The evidence

This isn’t speculation. The CDE paper (Morris & Rush, 2024; ICLR 2025) demonstrated that even crude embedding-based filtering of false negatives within training batches produced nearly a 10% improvement in retrieval performance. The ANCE paper (Xiong et al., 2020) showed that iteratively re-mining hard negatives after each training cycle consistently improved dense retrieval quality. Research on MS MARCO has found that over 70% of top-retrieved passages are actually false negatives in the original annotations.

The pattern is clear: the data exists, the labels are broken, and fixing them unlocks scaling.

Why hasn’t this been done?

It has — just not in the open. OpenAI, Google, Anthropic, and Cursor all train internal embedding models. When they publish papers, the methodology sections make it clear they invest heavily in data cleaning and negative verification. But the cleaned datasets are never released. The resulting models are served behind APIs.

There is no technical barrier to producing this data. The fix is dead simple: for each (query, document) pair, ask a language model “does this document answer this query?” The big labs do this at scale with their own infrastructure. They just don’t open-source the results.

We will.

Why crowdsourcing works here

Relevance labeling is embarrassingly parallel. Each (query, document) pair is independent — no shared state, no sequential dependencies, no gradient synchronization. A contributor’s GPU processes pairs in isolation. Only the labels flow back. This is the ideal shape for distributed computation.

The task itself is simple enough for small language models (7B parameters and below) to handle reliably at the binary or graded relevance level. This means the compute bar for contribution is low: anyone with a laptop GPU or even a CPU can participate. The task doesn’t require a frontier model, a cloud API, or any money.

The analogy is Folding@home or BOINC — distributed computation where individually small contributions aggregate into something no single participant could achieve alone. Except instead of donating CPU cycles for protein folding, contributors donate LLM inference for relevance labeling.

03 — Technical approach

How the pipeline works.

Data pipeline overview

The full pipeline runs in a repeating cycle: mine candidates, label them, train, evaluate, repeat with better embeddings.

Data Pipeline §3.1 — iterative annotation + training cycle

Source data

We start with nomic-ai/nomic-embed-unsupervised-data, a publicly available dataset of ~235 million weakly-supervised text pairs across 29 domain splits (Reddit, Wikipedia, Amazon Reviews, StackExchange, academic papers, etc.). Each row is a positive (query, document) pair.

This dataset was chosen because it is fully open, large, diverse across domains, and already used to train several open embedding models (Nomic Embed v1 and v2). Starting from the same source data means our improvements are directly attributable to label quality, not data selection.

Hard negative candidate mining

For each query in a split, we retrieve the top-K most similar documents using two complementary methods:

BM25 (lexical): Catches surface-level term overlap. Fast, effective for keyword-heavy queries.
Dense retrieval (embedding-based): Catches semantic similarity. Uses all-MiniLM-L6-v2 to embed all documents and queries, indexed with FAISS IVFFlat.

The two ranked lists are fused using Reciprocal Rank Fusion (RRF, k=60). Documents that rank highly in both methods are prioritized — these are the hardest negatives, the ones most likely to confuse an embedding model and most valuable to verify.

After fusion, we filter out the known positive document and take the top-5,000 candidates per query. Each candidate pair receives a deterministic UUID based on uuid5(query_text + doc_id + source_dataset), making the entire pipeline idempotent.

Distributed annotation

The annotation system has two components:

kombinat — a FastAPI coordination server backed by PostgreSQL. It maintains the task queue of unlabeled pairs, assigns batches to contributors, receives and validates labels, tracks contributor reputation, and serves public progress statistics. The work queue uses SELECT FOR UPDATE SKIP LOCKED for concurrent batch assignment — no Redis, no external message broker.

annotator — a CLI tool (or Docker container) that contributors run on their own hardware. It authenticates via GitHub OAuth, claims a batch of pairs from kombinat, runs a local language model to label each pair, and streams results back in chunks of ~50 pairs. The model runs entirely on the contributor’s machine. No API keys are transmitted. No data leaves the machine except pair IDs and labels.

The annotator auto-detects available hardware and selects the appropriate inference backend and model:

Hardware	Backend	Model
NVIDIA GPU (≥18GB VRAM)	vLLM	Qwen2.5-7B-Instruct
NVIDIA GPU (≥8GB VRAM)	vLLM	Qwen2.5-7B-Instruct-AWQ
NVIDIA GPU (≥4GB VRAM)	vLLM	Qwen2.5-3B-Instruct-AWQ
Apple Silicon (≥6GB)	MLX	Qwen2.5-7B-Instruct (4-bit)
Apple Silicon (≥4GB)	MLX	Qwen2.5-3B-Instruct (4-bit)
CPU only	llama.cpp	Qwen2.5-3B-Instruct (Q4_K_M)
CPU only (low RAM)	llama.cpp	Qwen2.5-1.5B-Instruct (Q4_K_M)

Only models that pass a quality threshold (Cohen’s κ ≥ 0.8 against frontier model labels on a held-out test set) are included in the registry.

Quality control

Trust in crowdsourced labels comes from multiple overlapping mechanisms:

Redundancy: Every pair is labeled by at least 2 independent contributors. Agreement promotes the pair to verified. Disagreement triggers a third annotation or escalation to a higher-quality model.

Honeypot pairs: 5–10% of each batch consists of pairs with known ground truth labels (from human-annotated datasets like MS MARCO qrels and Natural Questions). These are injected transparently — the contributor doesn’t know which pairs are honeypots. If a contributor’s accuracy on honeypots drops below 90%, their entire batch is quarantined and their reputation score is adjusted.

Contributor reputation: Every contributor has a reputation score based on honeypot accuracy, agreement rate with other contributors, and label distribution consistency. Low-reputation contributors receive honeypot-heavy batches until their reliability is established.

Model and quantization tracking: Every annotation records which model and quantization level produced it. This enables post-hoc analysis of per-model agreement rates and bias patterns.

Training methodology

We will use exact softmax with coordinate ascent, as proposed by Morris (2022, 2026). This is not the standard approach.

Standard contrastive training uses in-batch negatives as a sampled softmax — a Monte Carlo approximation of the true loss. This approximation introduces noise that worsens as batch size grows relative to corpus size. Exact softmax computes the loss against the entire corpus (or a very large subset), not just the in-batch negatives. Morris demonstrated this approach dramatically outperforms NCE loss in settings where perfect labels are available — which is exactly what our annotation pipeline produces.

Training involves an iterative cycle:

Train an initial model with standard contrastive loss on the cleaned data
Re-embed the entire corpus with the updated model
Re-mine hard negatives (the embedding space has shifted, so the “hard” boundary has moved)
Re-label the new candidates (the annotation pipeline runs again)
Train with exact softmax + coordinate ascent on the expanded, re-verified dataset
Repeat from step 2

Hard negative re-mining is not a one-time preprocessing step — it is a recurring phase in the training cycle. Each iteration produces a better model, which in turn reveals harder negatives, which in turn enables further improvement.

Evaluation

Primary benchmark: MTEB (Massive Text Embedding Benchmark), specifically the retrieval category. We target state-of-the-art performance in two model size classes:

Base (~110–140M parameters): BERT-base class. The CDE-small-v2 (140M params, score 65.58) is the current best in this class.
Large (~330M parameters): BERT-large class.

Secondary benchmarks: BEIR (zero-shot retrieval across diverse domains), and eventually MIRACL (multilingual retrieval) when we expand beyond English.

We will publish MTEB scores after every training cycle, not just at release. The trajectory matters as much as the final number — it demonstrates whether the iterative re-mining approach is producing consistent gains.

04 — Deliverables

Everything we produce is public, immediately.

There is no internal version that differs from the public version.

The dataset

A HuggingFace dataset containing:

All (query, document, relevance_label) triples that reach verified status
Full annotation metadata: model used, quantization, contributor (pseudonymized), consensus score
The raw unlabeled candidate pairs (for researchers who want to re-label with different models or criteria)
Released incrementally — not held back for a “big launch.” As pairs are verified, they become available.

The annotation infrastructure

Three open-source repositories:

kombinat: FastAPI coordination server. PostgreSQL-backed task queue with pair lifecycle management, contributor reputation tracking, quality control (honeypots, consensus, anomaly detection), public stats API. Designed as general-purpose distributed annotation infrastructure — reusable for RLHF preference labeling, LLM-as-judge evaluation, or any task where you need distributed annotation at scale.
annotator: CLI tool and Docker container for contributors. Hardware auto-detection, pluggable inference backends (vLLM, MLX, llama.cpp), GitHub OAuth authentication, streaming result submission. Runs on anything from a 4090 to a MacBook to a CPU-only server.
embed-kombinat.github.io: Landing page with live progress statistics, contributor leaderboard, and documentation.

Trained models

Open embedding models released on HuggingFace at two scales:

embed-kombinat-base (~140M parameters)
embed-kombinat-large (~330M parameters)

Each release includes full model weights (not just the final checkpoint — intermediate checkpoints at each training cycle), the exact training configuration, MTEB and BEIR evaluation results, and a training report documenting what worked, what didn’t, and what we’d do differently.

Technical reports

We will publish technical reports at each major milestone:

Data report: Statistics on the labeled dataset — domain distribution, label balance, inter-annotator agreement, model correlation analysis
Training report: Methodology, hyperparameter choices, ablation studies, comparison with standard contrastive training
Infrastructure report: Architecture decisions, scaling characteristics, lessons learned from running crowdsourced annotation

05 — What we are not

Let’s be clear about scope.

Not a company. We don’t have funding, investors, or revenue. This is an open-source project maintained by two engineers in their own time.
Not a research lab. We are not publishing papers for academic credit. We are building infrastructure and releasing artifacts.
Not a data labeling service. Kombinat is built for this project, but it’s open source and general-purpose. If someone else wants to use it for a different labeling task, they can.
Not competing with Nomic, Cohere, or other embedding providers. We are building on top of Nomic’s open data and contributing back to the ecosystem. If our dataset makes their next model better, that’s a win.

10 — Get involved

Your GPU can contribute tonight.

Run the annotator

Pull the Docker container or install via pip. Your GPU labels pairs while you sleep.

NVIDIA GPU

# NVIDIA GPU
$ docker pull ghcr.io/embedkomb/annotator:latest
$ docker run -it --gpus all ghcr.io/embedkomb/annotator:latest

Apple Silicon

# Apple Silicon
$ pip install "annotator[mlx]"
$ annotator

CPU only

# CPU only
$ pip install "annotator[cpu]"
$ annotator

Contribute code

The repositories are on GitHub. We use red-green TDD. PRDs are written before implementation. Issues are labeled with good-first-issue for newcomers.

Sponsor compute

If you represent a company that could donate API credits, GPU time, or cloud resources — reach out at hello@embedkombinat.org. Every donated resource is accounted for publicly, and we publish exactly how it’s used. No waste. If you give us compute, we label pairs.

Spread the word

The more contributors we have, the faster the dataset grows, the better the model gets. Share this with anyone who cares about open-source AI infrastructure.

Start Contributing Sponsor Compute

Why we’re building the world’s best embedding model.

Annotation Infrastructure

Retrieval Dataset

Open Embedding Models