Skip to content

Engine Subsystems

Stargraph's engine is the runtime that executes a validated Graph to a terminal RunSummary. This page is the orienting map: each subsystem owns one responsibility and exposes a single Python entry point.

Subsystem map

Module Responsibility Public surface
stargraph.graph Static, hashable, IR-validated graph definition + async run handle Graph, GraphRun, RunState, structural_hash, runtime_hash
stargraph.checkpoint Per-step persistence contract + drivers Checkpoint, Checkpointer, RunSummary
stargraph.runtime Loop, event bus, mirror lifecycle, reducer registry (driven internally by GraphRun.start())
stargraph.nodes Built-in node implementations NodeBase, MLNode, DSPyNode, SubgraphNode
stargraph.replay Cassettes, comparison, counterfactual fork ToolCallCassette, CounterfactualMutation, derived_graph_hash
stargraph.fathom CLIPS-backed governance / rule firing FathomAdapter
stargraph.errors Force-loud error hierarchy (FR-6) StargraphError and subclasses

Construction is synchronous and side-effect free; execution is async and single-use per GraphRun. The (graph_hash, runtime_hash) pair is pinned at Graph.__init__ and travels with every Checkpoint so a resume can refuse a mismatch loudly (FR-20).

Graph / GraphRun

Graph is the sync construction half of the Temporal-style split. It validates the IR, compiles the state_schema into a Pydantic BaseModel subclass, and pins the JCS structural hash plus the runtime hash:

from stargraph.graph import Graph
from stargraph.ir._models import IRDocument

ir = IRDocument(ir_version="1.0.0", id="run:hello", nodes=[])
graph = Graph(ir)

graph.graph_hash    # 64-char hex sha256 of the JCS-canonical IR
graph.runtime_hash  # sha256(python_version + stargraph_version)

GraphRun is the async execution half — single-use, one run_id per handle:

from stargraph.checkpoint.sqlite import SQLiteCheckpointer

checkpointer = SQLiteCheckpointer("checkpoints.db")
await checkpointer.bootstrap()

run = await graph.start(checkpointer=checkpointer)
summary = await run.start()        # drive to terminal state
async for event in run.stream():   # observe transitions
    ...

Lifecycle states (pending|running|paused|awaiting-input|done|cancelled|error|failed) are exposed on run.state for inspection but transitioned only by the run loop.

The awaiting-input state is reached when a node raises InterruptAction. The loop emits WaitingForInputEvent, transitions state, and returns — it does not poll for a transition back to running. Resume is cold-restart-only in v1: stop the process, call GraphRun.respond(...) (which flips state and asserts the response as a stargraph.evidence Fathom fact), then restart with GraphRun.resume(checkpoint). Warm in-process resume is on the post-1.0 roadmap; see v1 limits for the boundary list.

See:

MLNode

MLNode (stargraph.nodes.MLNode) wraps the loaders in stargraph.ml.loaders so a sklearn / xgboost / onnx model can be dropped into a graph as a normal node. Construction is eager-validated — the runtime is checked, the pickle gate fires for sklearn when allow_unsafe_pickle=False, and the ONNX session warms via the shared cache:

from stargraph.nodes.ml import MLNode

# ONNX is the recommended runtime — no pickle, no untrusted deserialization.
node = MLNode(
    model_id="risk-classifier",
    version="1.4.0",
    runtime="onnx",
    file_uri="file:///opt/models/risk-1.4.0.onnx",
    expected_sha256="2c1f...e0ba",  # pin the model bytes
    input_field="features",
    output_field="risk_score",
)

Sklearn requires allow_unsafe_pickle=True (default-deny per FR-30 antipattern guard #4); xgboost loads from JSON/UBJ and ignores the pickle gate. Inference is offloaded to a worker thread via asyncio.to_thread, so the event loop is never blocked by a sync predict call.

Force-loud errors

The engine never silently coerces on schema mismatch, missing checkpoint, or counterfactual hash collision. All structured failures inherit from StargraphError and carry typed context fields you can assert against: CheckpointError(reason="graph-hash-mismatch", expected_hash=..., actual_hash=...).

Catch the narrow subclass; the context fields are part of the public contract.