ettin-encoder-17m-pretrain-50m

A retrieval encoder pretrained from jhu-clsp/ettin-encoder-17m on a 50M-pair balanced split of the DenseOn embeddings pre-training dataset.

Training data: this model was contrastively pretrained on capemox/denseon-pretrain-50m-balanced, a 50,000,000-pair sample drawn from the DenseOn corpus lightonai/embeddings-pre-training-curated (665M curated query–document pairs across 34 sources). Pairs are sampled with T=2 temperature weighting + iterative equal-redistribution capping so that no single source dominates the mix.

This is a Stage 1 (pretraining-only) checkpoint — it has not been fine-tuned on supervised retrieval data. Use it as a strong starting point for Stage 2 fine-tuning, or as a zero-shot retrieval encoder.

Training recipe


Base model	`jhu-clsp/ettin-encoder-17m` (ModernBERT, ~17M params)
Training data	`capemox/denseon-pretrain-50m-balanced` (50M pairs, DenseOn split)
Loss	`MultipleNegativesRankingLoss` (full, in-batch negatives)
Batch size	1024 (1023 in-batch negatives/anchor)
Per-source batching	each batch drawn from one source dataset (DenseOn recipe)
Learning rate	3e-5, linear decay, 5% warmup
Epochs	1 (~48,827 steps)
Precision	bf16 + tf32, SDPA, torch.compile
Hardware	1× A100-80GB, ~5h 26m

Evaluation (Stage 1 checkpoint, NDCG@10)

BEIR subset (zero-shot, no fine-tuning):

Dataset	NDCG@10
ArguAna	0.4482
FiQA2018	0.2548
NFCorpus	0.2605
SCIDOCS	0.1629
SciFact	0.6276
TRECCOVID	0.5044
Mean	0.3764

NanoBEIR (13-dataset aggregate, end of training): 0.5074

Downstream value

When this checkpoint is fine-tuned on MS MARCO hard negatives (tomaarsen/msmarco-Qwen3-Reranker-0.6B), it reaches 0.3061 mean BEIR NDCG@10 — versus 0.2264 for the same fine-tune starting from the raw base model (+35% from this pretraining step).

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("capemox/ettin-encoder-17m-pretrain-50m")

queries = ["What is the capital of France?"]
docs    = ["Paris is the capital and largest city of France."]

q = model.encode(queries, normalize_embeddings=True)
d = model.encode(docs,    normalize_embeddings=True)
print(q @ d.T)

Downloads last month: 45

Safetensors

Model size

16.8M params

Tensor type

F32

Model tree for capemox/ettin-encoder-17m-pretrain-50m

Base model

jhu-clsp/ettin-encoder-17m

Finetuned

(26)

this model

capemox
/

ettin-encoder-17m-pretrain-50m