YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Artifact Harness
A versioned program runtime where every prompt, program, workflow, and model component is an auditable, executable, evaluable artifact.
The Artifact Harness treats agent-produced outputs as first-class versioned programs. Inspired by DSPy's compiled programs, TextGrad's text-gradient optimization, RLM context procedures, and W3C PROV provenance trackingβthe core invariant is auditability through execution traces and measurable improvement.
The Idea
Every artifact contains six components:
| Component | Purpose | Analogy |
|---|---|---|
| Interface | Typed I/O contract | DSPy Signature |
| Implementation | Executable logic | DSPy Module.forward() |
| Evaluator | Quality gate | DSPy metric(example, prediction, trace) |
| Traces | Execution history | DSPy dspy.settings.traces |
| Tests | Correctness constraints | DSPy Assert / Suggest |
| Provenance | Version graph | W3C PROV wasDerivedFrom |
The agent operates on artifacts through five primitive operations:
EXECUTE β Run with full trace capture
CRITIQUE β Evaluate on examples, produce structured feedback
MUTATE β Create new version with changes (immutability preserved)
COMPOSE β Chain artifacts into pipelines
PROMOTE β Gate evaluation β production status
All representations are supported: prompts, DSPy programs, Python functions, DAG workflows, RLM-style context procedures, and fine-tuned model references. The invariant is not the representation but the auditability.
Quick Start
from artifact_harness import ArtifactHarness, EvalDataset, PromotionGate, TestCase, ConstraintLevel
harness = ArtifactHarness("my_harness")
# Create an artifact from any callable
def classify(text: str) -> dict:
return {"label": "positive" if "good" in text.lower() else "negative"}
artifact = harness.create_code_artifact(
name="classifier",
fn=classify,
interface="text -> label", # DSPy-style shorthand
evaluator=lambda inp, out, tr: 1.0 if out["label"] == inp.get("expected") else 0.0,
tests=[
TestCase(
name="valid_label",
predicate=lambda out: out["label"] in ("positive", "negative"),
level=ConstraintLevel.ASSERT,
),
],
)
# Execute with trace capture
result = harness.execute(artifact, {"text": "This is good"})
# result.outputs = {"label": "positive"}
# result.trace β full ExecutionTrace with steps, timing, metadata
# result.test_results β [("valid_label", True, "OK")]
# Evaluate on a dataset
dataset = EvalDataset("test", [
{"text": "good product", "expected": "positive"},
{"text": "bad product", "expected": "negative"},
])
eval_run = harness.evaluate(artifact, dataset)
# eval_run.mean_score, eval_run.pass_rate, eval_run.traces
The Optimization Loop
The harness drives measurable improvement through a critiqueβmutateβevaluate cycle:
from artifact_harness import OptimizationConfig
# Define how to mutate based on critique feedback
def mutator(artifact, critique_result):
if critique_result["overall_score"] < 0.8:
return {
"instructions": "Handle edge cases better",
"demonstrations": [{"text": "example", "label": "positive"}],
}
return {"instructions": "Fine-tuned instructions"}
result = harness.optimize(
artifact,
train_data=train_dataset,
eval_data=eval_dataset,
mutator_fn=mutator,
config=OptimizationConfig(
max_iterations=10,
target_score=0.95,
improvement_patience=3,
),
)
# result.initial_score β result.final_score
# result.history β per-iteration trace
# result.promoted β whether best version passed the gate
Composition
# Compose artifacts into pipelines
pipeline = harness.compose(
classifier, # text β label
explainer, # label β explanation
name="classify_and_explain",
)
result = harness.execute(pipeline, {"text": "Great product!"})
# Runs classifier, feeds output to explainer
# Full trace captures both steps
Bootstrap Demonstrations (DSPy-style)
# Run artifact on examples, keep successful traces as demos
bootstrapped = harness.bootstrap(
artifact,
dataset=examples,
max_demos=8,
min_score=0.8, # Quality gate for demo inclusion
)
# bootstrapped.demonstrations β filtered successful traces
Provenance & Audit
Every mutation, composition, and promotion is tracked:
# Version lineage (root β current)
lineage = harness.lineage(artifact)
for v in lineage:
print(f"v{v.version} [{v.content_hash()[:8]}] score={v.latest_score}")
# Provenance graph
graph = harness.provenance_graph(artifact.artifact_id)
# graph["nodes"] β artifact summaries
# graph["edges"] β derivation links with type, agent, activity
# Full audit log
log = harness.audit_log()
# Every create, execute, critique, mutate, compose, promote event
# Trace report
report = harness.trace_report(artifact)
# Score stats, latency stats, last N traces
Architecture
artifact_harness/
βββ core/
β βββ schema.py # Artifact, Interface, Trace, Test, Provenance types
β βββ store.py # VersionStore: registry + DAG traversal
βββ operations/
β βββ ops.py # execute, compose, critique, mutate, promote
β βββ pipeline.py # EvalDataset, evaluate, optimize, bootstrap
βββ harness/
β βββ agent.py # ArtifactHarness: top-level orchestrator
βββ tests/
β βββ test_all.py # 99 tests covering all operations
βββ demo.py # Full lifecycle demo
Design Principles
Immutability: Artifacts are never modified in place. Every mutation creates a new version linked via provenance. The original is always preserved.
Trace everything: Every execution captures a full
ExecutionTracewith per-step I/O, timing, and metadata. Traces are the basis for quality assessment, bootstrapping, and audit.Gated promotion: Artifacts can only reach "promoted" status by passing a configurable
PromotionGate(min score, min traces, test pass rate, improvement over current promoted).Representation-agnostic: The artifact schema doesn't care if the implementation is a prompt template, a Python function, a DSPy program, or a fine-tuned model reference. The contract is the Interface; the proof is the traces.
Content-addressed versioning: Each version has a deterministic content hash based on implementation, instructions, demonstrations, and parameters. Same content β same hash.
Key Types
Interface # Typed I/O contract: FieldDescriptor tuples
FieldDescriptor # name, dtype, description, required, default
Artifact # The six-component artifact
ExecutionTrace # Full trace: steps, I/O, timing, score
TraceStep # One predictor call within a trace
TestCase # Assert (hard) or Suggest (soft) constraint
ProvenanceEdge # Derivation link: parent β child + type + agent
QualityRecord # Evaluation result: metric, score, passed_gate
EvalDataset # Examples + optional gold outputs
EvalRun # Evaluation results: per-example scores + stats
PromotionGate # Configurable promotion criteria
OptimizationConfig # Loop parameters: max_iters, target, patience
Representations
| Type | Implementation | Use Case |
|---|---|---|
code |
Python callable | Functions, algorithms |
prompt |
Template string | LLM prompt templates |
dspy_program |
DSPy Module | Compiled LM programs |
workflow |
Step DAG dict | Multi-step pipelines |
rlm_procedure |
RLM context proc | Long-context reasoning |
finetuned |
Model reference | Fine-tuned checkpoints |
composite |
Artifact pipeline | Composed artifacts |
Running Tests
python -m artifact_harness.tests.test_all
# 99/99 passed
Running the Demo
python -m artifact_harness.demo
# Full lifecycle: create β execute β evaluate β critique β mutate β
# optimize β bootstrap β compose β promote β audit
Citation
This system synthesizes ideas from:
- DSPy (Khattab et al., 2023): Signature/Module/Teleprompter pattern, trace-bootstrapped demonstrations, compiled programs
- TextGrad (Yuksekgonul et al., 2024): Text-gradient critiqueβmutate loop
- MIPRO (Opsahl-Ong et al., 2024): Joint instruction + demonstration optimization
- RLM (2025): Context procedures with symbolic handles and recursive self-calls
- AFlow (Zhang et al., 2024): MCTS over code-represented workflows
- AWM (Wang et al., 2024): Online workflow induction from execution traces
- Git-Theta (Kandpal et al., 2023): Parameter-level version control
- W3C PROV (2013): Entity/Activity/Agent provenance model