ettin-encoder-17m-pretrain-50m

A retrieval encoder pretrained from jhu-clsp/ettin-encoder-17m on a 50M-pair balanced split of the DenseOn embeddings pre-training dataset.

Training data: this model was contrastively pretrained on capemox/denseon-pretrain-50m-balanced, a 50,000,000-pair sample drawn from the DenseOn corpus lightonai/embeddings-pre-training-curated (665M curated query–document pairs across 34 sources). Pairs are sampled with T=2 temperature weighting + iterative equal-redistribution capping so that no single source dominates the mix.

This is a Stage 1 (pretraining-only) checkpoint — it has not been fine-tuned on supervised retrieval data. Use it as a strong starting point for Stage 2 fine-tuning, or as a zero-shot retrieval encoder.

Training recipe

Base model jhu-clsp/ettin-encoder-17m (ModernBERT, ~17M params)
Training data capemox/denseon-pretrain-50m-balanced (50M pairs, DenseOn split)
Loss MultipleNegativesRankingLoss (full, in-batch negatives)
Batch size 1024 (1023 in-batch negatives/anchor)
Per-source batching each batch drawn from one source dataset (DenseOn recipe)
Learning rate 3e-5, linear decay, 5% warmup
Epochs 1 (~48,827 steps)
Precision bf16 + tf32, SDPA, torch.compile
Hardware 1× A100-80GB, ~5h 26m

Evaluation (Stage 1 checkpoint, NDCG@10)

BEIR subset (zero-shot, no fine-tuning):

Dataset NDCG@10
ArguAna 0.4482
FiQA2018 0.2548
NFCorpus 0.2605
SCIDOCS 0.1629
SciFact 0.6276
TRECCOVID 0.5044
Mean 0.3764

NanoBEIR (13-dataset aggregate, end of training): 0.5074

Downstream value

When this checkpoint is fine-tuned on MS MARCO hard negatives (tomaarsen/msmarco-Qwen3-Reranker-0.6B), it reaches 0.3061 mean BEIR NDCG@10 — versus 0.2264 for the same fine-tune starting from the raw base model (+35% from this pretraining step).

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("capemox/ettin-encoder-17m-pretrain-50m")

queries = ["What is the capital of France?"]
docs    = ["Paris is the capital and largest city of France."]

q = model.encode(queries, normalize_embeddings=True)
d = model.encode(docs,    normalize_embeddings=True)
print(q @ d.T)
Downloads last month
45
Safetensors
Model size
16.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for capemox/ettin-encoder-17m-pretrain-50m

Finetuned
(26)
this model

Datasets used to train capemox/ettin-encoder-17m-pretrain-50m