Artifact Harness

A versioned program runtime where every prompt, program, workflow, and model component is an auditable, executable, evaluable artifact.

The Artifact Harness treats agent-produced outputs as first-class versioned programs. Inspired by DSPy's compiled programs, TextGrad's text-gradient optimization, RLM context procedures, and W3C PROV provenance tracking—the core invariant is auditability through execution traces and measurable improvement.

The Idea

Every artifact contains six components:

Component	Purpose	Analogy
Interface	Typed I/O contract	DSPy `Signature`
Implementation	Executable logic	DSPy `Module.forward()`
Evaluator	Quality gate	DSPy `metric(example, prediction, trace)`
Traces	Execution history	DSPy `dspy.settings.traces`
Tests	Correctness constraints	DSPy `Assert` / `Suggest`
Provenance	Version graph	W3C PROV `wasDerivedFrom`

The agent operates on artifacts through five primitive operations:

EXECUTE  → Run with full trace capture
CRITIQUE → Evaluate on examples, produce structured feedback
MUTATE   → Create new version with changes (immutability preserved)
COMPOSE  → Chain artifacts into pipelines
PROMOTE  → Gate evaluation → production status

All representations are supported: prompts, DSPy programs, Python functions, DAG workflows, RLM-style context procedures, and fine-tuned model references. The invariant is not the representation but the auditability.

Quick Start

from artifact_harness import ArtifactHarness, EvalDataset, PromotionGate, TestCase, ConstraintLevel

harness = ArtifactHarness("my_harness")

# Create an artifact from any callable
def classify(text: str) -> dict:
    return {"label": "positive" if "good" in text.lower() else "negative"}

artifact = harness.create_code_artifact(
    name="classifier",
    fn=classify,
    interface="text -> label",          # DSPy-style shorthand
    evaluator=lambda inp, out, tr: 1.0 if out["label"] == inp.get("expected") else 0.0,
    tests=[
        TestCase(
            name="valid_label",
            predicate=lambda out: out["label"] in ("positive", "negative"),
            level=ConstraintLevel.ASSERT,
        ),
    ],
)

# Execute with trace capture
result = harness.execute(artifact, {"text": "This is good"})
# result.outputs = {"label": "positive"}
# result.trace   → full ExecutionTrace with steps, timing, metadata
# result.test_results → [("valid_label", True, "OK")]

# Evaluate on a dataset
dataset = EvalDataset("test", [
    {"text": "good product", "expected": "positive"},
    {"text": "bad product", "expected": "negative"},
])
eval_run = harness.evaluate(artifact, dataset)
# eval_run.mean_score, eval_run.pass_rate, eval_run.traces

The Optimization Loop

The harness drives measurable improvement through a critique→mutate→evaluate cycle:

from artifact_harness import OptimizationConfig

# Define how to mutate based on critique feedback
def mutator(artifact, critique_result):
    if critique_result["overall_score"] < 0.8:
        return {
            "instructions": "Handle edge cases better",
            "demonstrations": [{"text": "example", "label": "positive"}],
        }
    return {"instructions": "Fine-tuned instructions"}

result = harness.optimize(
    artifact,
    train_data=train_dataset,
    eval_data=eval_dataset,
    mutator_fn=mutator,
    config=OptimizationConfig(
        max_iterations=10,
        target_score=0.95,
        improvement_patience=3,
    ),
)
# result.initial_score → result.final_score
# result.history → per-iteration trace
# result.promoted → whether best version passed the gate

Composition

# Compose artifacts into pipelines
pipeline = harness.compose(
    classifier,      # text → label
    explainer,       # label → explanation
    name="classify_and_explain",
)

result = harness.execute(pipeline, {"text": "Great product!"})
# Runs classifier, feeds output to explainer
# Full trace captures both steps

Bootstrap Demonstrations (DSPy-style)

# Run artifact on examples, keep successful traces as demos
bootstrapped = harness.bootstrap(
    artifact,
    dataset=examples,
    max_demos=8,
    min_score=0.8,  # Quality gate for demo inclusion
)
# bootstrapped.demonstrations → filtered successful traces

Provenance & Audit

Every mutation, composition, and promotion is tracked:

# Version lineage (root → current)
lineage = harness.lineage(artifact)
for v in lineage:
    print(f"v{v.version} [{v.content_hash()[:8]}] score={v.latest_score}")

# Provenance graph
graph = harness.provenance_graph(artifact.artifact_id)
# graph["nodes"] → artifact summaries
# graph["edges"] → derivation links with type, agent, activity

# Full audit log
log = harness.audit_log()
# Every create, execute, critique, mutate, compose, promote event

# Trace report
report = harness.trace_report(artifact)
# Score stats, latency stats, last N traces

Architecture

artifact_harness/
├── core/
│   ├── schema.py      # Artifact, Interface, Trace, Test, Provenance types
│   └── store.py       # VersionStore: registry + DAG traversal
├── operations/
│   ├── ops.py         # execute, compose, critique, mutate, promote
│   └── pipeline.py    # EvalDataset, evaluate, optimize, bootstrap
├── harness/
│   └── agent.py       # ArtifactHarness: top-level orchestrator
├── tests/
│   └── test_all.py    # 99 tests covering all operations
└── demo.py            # Full lifecycle demo

Design Principles

Immutability: Artifacts are never modified in place. Every mutation creates a new version linked via provenance. The original is always preserved.
Trace everything: Every execution captures a full ExecutionTrace with per-step I/O, timing, and metadata. Traces are the basis for quality assessment, bootstrapping, and audit.
Gated promotion: Artifacts can only reach "promoted" status by passing a configurable PromotionGate (min score, min traces, test pass rate, improvement over current promoted).
Representation-agnostic: The artifact schema doesn't care if the implementation is a prompt template, a Python function, a DSPy program, or a fine-tuned model reference. The contract is the Interface; the proof is the traces.
Content-addressed versioning: Each version has a deterministic content hash based on implementation, instructions, demonstrations, and parameters. Same content → same hash.

Key Types

Interface          # Typed I/O contract: FieldDescriptor tuples
FieldDescriptor    # name, dtype, description, required, default
Artifact           # The six-component artifact
ExecutionTrace     # Full trace: steps, I/O, timing, score
TraceStep          # One predictor call within a trace
TestCase           # Assert (hard) or Suggest (soft) constraint
ProvenanceEdge     # Derivation link: parent → child + type + agent
QualityRecord      # Evaluation result: metric, score, passed_gate
EvalDataset        # Examples + optional gold outputs
EvalRun            # Evaluation results: per-example scores + stats
PromotionGate      # Configurable promotion criteria
OptimizationConfig # Loop parameters: max_iters, target, patience

Representations

Type	Implementation	Use Case
`code`	Python callable	Functions, algorithms
`prompt`	Template string	LLM prompt templates
`dspy_program`	DSPy Module	Compiled LM programs
`workflow`	Step DAG dict	Multi-step pipelines
`rlm_procedure`	RLM context proc	Long-context reasoning
`finetuned`	Model reference	Fine-tuned checkpoints
`composite`	Artifact pipeline	Composed artifacts

Running Tests

python -m artifact_harness.tests.test_all
# 99/99 passed

Running the Demo

python -m artifact_harness.demo
# Full lifecycle: create → execute → evaluate → critique → mutate →
#   optimize → bootstrap → compose → promote → audit

Citation

This system synthesizes ideas from:

DSPy (Khattab et al., 2023): Signature/Module/Teleprompter pattern, trace-bootstrapped demonstrations, compiled programs
TextGrad (Yuksekgonul et al., 2024): Text-gradient critique→mutate loop
MIPRO (Opsahl-Ong et al., 2024): Joint instruction + demonstration optimization
RLM (2025): Context procedures with symbolic handles and recursive self-calls
AFlow (Zhang et al., 2024): MCTS over code-represented workflows
AWM (Wang et al., 2024): Online workflow induction from execution traces
Git-Theta (Kandpal et al., 2023): Parameter-level version control
W3C PROV (2013): Entity/Activity/Agent provenance model

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for iaksentijevic/artifact-harness