YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Artifact Harness

A versioned program runtime where every prompt, program, workflow, and model component is an auditable, executable, evaluable artifact.

The Artifact Harness treats agent-produced outputs as first-class versioned programs. Inspired by DSPy's compiled programs, TextGrad's text-gradient optimization, RLM context procedures, and W3C PROV provenance trackingβ€”the core invariant is auditability through execution traces and measurable improvement.

The Idea

Every artifact contains six components:

Component Purpose Analogy
Interface Typed I/O contract DSPy Signature
Implementation Executable logic DSPy Module.forward()
Evaluator Quality gate DSPy metric(example, prediction, trace)
Traces Execution history DSPy dspy.settings.traces
Tests Correctness constraints DSPy Assert / Suggest
Provenance Version graph W3C PROV wasDerivedFrom

The agent operates on artifacts through five primitive operations:

EXECUTE  β†’ Run with full trace capture
CRITIQUE β†’ Evaluate on examples, produce structured feedback
MUTATE   β†’ Create new version with changes (immutability preserved)
COMPOSE  β†’ Chain artifacts into pipelines
PROMOTE  β†’ Gate evaluation β†’ production status

All representations are supported: prompts, DSPy programs, Python functions, DAG workflows, RLM-style context procedures, and fine-tuned model references. The invariant is not the representation but the auditability.

Quick Start

from artifact_harness import ArtifactHarness, EvalDataset, PromotionGate, TestCase, ConstraintLevel

harness = ArtifactHarness("my_harness")

# Create an artifact from any callable
def classify(text: str) -> dict:
    return {"label": "positive" if "good" in text.lower() else "negative"}

artifact = harness.create_code_artifact(
    name="classifier",
    fn=classify,
    interface="text -> label",          # DSPy-style shorthand
    evaluator=lambda inp, out, tr: 1.0 if out["label"] == inp.get("expected") else 0.0,
    tests=[
        TestCase(
            name="valid_label",
            predicate=lambda out: out["label"] in ("positive", "negative"),
            level=ConstraintLevel.ASSERT,
        ),
    ],
)

# Execute with trace capture
result = harness.execute(artifact, {"text": "This is good"})
# result.outputs = {"label": "positive"}
# result.trace   β†’ full ExecutionTrace with steps, timing, metadata
# result.test_results β†’ [("valid_label", True, "OK")]

# Evaluate on a dataset
dataset = EvalDataset("test", [
    {"text": "good product", "expected": "positive"},
    {"text": "bad product", "expected": "negative"},
])
eval_run = harness.evaluate(artifact, dataset)
# eval_run.mean_score, eval_run.pass_rate, eval_run.traces

The Optimization Loop

The harness drives measurable improvement through a critique→mutate→evaluate cycle:

from artifact_harness import OptimizationConfig

# Define how to mutate based on critique feedback
def mutator(artifact, critique_result):
    if critique_result["overall_score"] < 0.8:
        return {
            "instructions": "Handle edge cases better",
            "demonstrations": [{"text": "example", "label": "positive"}],
        }
    return {"instructions": "Fine-tuned instructions"}

result = harness.optimize(
    artifact,
    train_data=train_dataset,
    eval_data=eval_dataset,
    mutator_fn=mutator,
    config=OptimizationConfig(
        max_iterations=10,
        target_score=0.95,
        improvement_patience=3,
    ),
)
# result.initial_score β†’ result.final_score
# result.history β†’ per-iteration trace
# result.promoted β†’ whether best version passed the gate

Composition

# Compose artifacts into pipelines
pipeline = harness.compose(
    classifier,      # text β†’ label
    explainer,       # label β†’ explanation
    name="classify_and_explain",
)

result = harness.execute(pipeline, {"text": "Great product!"})
# Runs classifier, feeds output to explainer
# Full trace captures both steps

Bootstrap Demonstrations (DSPy-style)

# Run artifact on examples, keep successful traces as demos
bootstrapped = harness.bootstrap(
    artifact,
    dataset=examples,
    max_demos=8,
    min_score=0.8,  # Quality gate for demo inclusion
)
# bootstrapped.demonstrations β†’ filtered successful traces

Provenance & Audit

Every mutation, composition, and promotion is tracked:

# Version lineage (root β†’ current)
lineage = harness.lineage(artifact)
for v in lineage:
    print(f"v{v.version} [{v.content_hash()[:8]}] score={v.latest_score}")

# Provenance graph
graph = harness.provenance_graph(artifact.artifact_id)
# graph["nodes"] β†’ artifact summaries
# graph["edges"] β†’ derivation links with type, agent, activity

# Full audit log
log = harness.audit_log()
# Every create, execute, critique, mutate, compose, promote event

# Trace report
report = harness.trace_report(artifact)
# Score stats, latency stats, last N traces

Architecture

artifact_harness/
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ schema.py      # Artifact, Interface, Trace, Test, Provenance types
β”‚   └── store.py       # VersionStore: registry + DAG traversal
β”œβ”€β”€ operations/
β”‚   β”œβ”€β”€ ops.py         # execute, compose, critique, mutate, promote
β”‚   └── pipeline.py    # EvalDataset, evaluate, optimize, bootstrap
β”œβ”€β”€ harness/
β”‚   └── agent.py       # ArtifactHarness: top-level orchestrator
β”œβ”€β”€ tests/
β”‚   └── test_all.py    # 99 tests covering all operations
└── demo.py            # Full lifecycle demo

Design Principles

  1. Immutability: Artifacts are never modified in place. Every mutation creates a new version linked via provenance. The original is always preserved.

  2. Trace everything: Every execution captures a full ExecutionTrace with per-step I/O, timing, and metadata. Traces are the basis for quality assessment, bootstrapping, and audit.

  3. Gated promotion: Artifacts can only reach "promoted" status by passing a configurable PromotionGate (min score, min traces, test pass rate, improvement over current promoted).

  4. Representation-agnostic: The artifact schema doesn't care if the implementation is a prompt template, a Python function, a DSPy program, or a fine-tuned model reference. The contract is the Interface; the proof is the traces.

  5. Content-addressed versioning: Each version has a deterministic content hash based on implementation, instructions, demonstrations, and parameters. Same content β†’ same hash.

Key Types

Interface          # Typed I/O contract: FieldDescriptor tuples
FieldDescriptor    # name, dtype, description, required, default
Artifact           # The six-component artifact
ExecutionTrace     # Full trace: steps, I/O, timing, score
TraceStep          # One predictor call within a trace
TestCase           # Assert (hard) or Suggest (soft) constraint
ProvenanceEdge     # Derivation link: parent β†’ child + type + agent
QualityRecord      # Evaluation result: metric, score, passed_gate
EvalDataset        # Examples + optional gold outputs
EvalRun            # Evaluation results: per-example scores + stats
PromotionGate      # Configurable promotion criteria
OptimizationConfig # Loop parameters: max_iters, target, patience

Representations

Type Implementation Use Case
code Python callable Functions, algorithms
prompt Template string LLM prompt templates
dspy_program DSPy Module Compiled LM programs
workflow Step DAG dict Multi-step pipelines
rlm_procedure RLM context proc Long-context reasoning
finetuned Model reference Fine-tuned checkpoints
composite Artifact pipeline Composed artifacts

Running Tests

python -m artifact_harness.tests.test_all
# 99/99 passed

Running the Demo

python -m artifact_harness.demo
# Full lifecycle: create β†’ execute β†’ evaluate β†’ critique β†’ mutate β†’
#   optimize β†’ bootstrap β†’ compose β†’ promote β†’ audit

Citation

This system synthesizes ideas from:

  • DSPy (Khattab et al., 2023): Signature/Module/Teleprompter pattern, trace-bootstrapped demonstrations, compiled programs
  • TextGrad (Yuksekgonul et al., 2024): Text-gradient critiqueβ†’mutate loop
  • MIPRO (Opsahl-Ong et al., 2024): Joint instruction + demonstration optimization
  • RLM (2025): Context procedures with symbolic handles and recursive self-calls
  • AFlow (Zhang et al., 2024): MCTS over code-represented workflows
  • AWM (Wang et al., 2024): Online workflow induction from execution traces
  • Git-Theta (Kandpal et al., 2023): Parameter-level version control
  • W3C PROV (2013): Entity/Activity/Agent provenance model
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for iaksentijevic/artifact-harness