// Long-form bodies for the writing entries. Keyed by ENTRY id.
// Block kinds: "p" (paragraph), "h2" (section heading), "quote" (blockquote), "code" (preformatted).
// Inline markup inside p, h2, and quote: `inline code` and **bold**.

const ARTICLE_BODIES = {
  "memory-hierarchies-for-long-running-agents": [
    { kind: "p", text: "In an 80-turn agent session, the model's memory is more important than its reasoning. A strong LLM with bad memory will repeat tested mistakes, plan from contradicted hypotheses, and burn budget recovering context it already had. Most agent frameworks treat this as a compression problem: when context gets too big, summarize. Naive summarization is where most failure modes live. It promotes speculation to fact, drops the provenance of claims, and lets the next turn work from a corrupted version of what actually happened. While building an interactive agent for ARC-AGI-3, I needed something better, so I built a memory framework around three primitives that compose. This is how it works, and how it generalizes to any agent that has to remember more than fits in context." },

    { kind: "h2", text: "The problem with summarization-based memory" },
    { kind: "p", text: "When a context window fills up, you have to drop something. The standard move is to call an LLM to summarize the older turns and put the summary at the top of the next prompt. This works fine for chatbot transcripts. It breaks badly for agents." },
    { kind: "p", text: "The break is not the model's fault. It is the summarizer's. A summarizer is a model running on text and asked to produce shorter text. To do that, it has to interpret. It conflates sources. Coach notes ('maybe try the diagonal control') sit next to deterministic events ('ACTION6@39,17 produced local_change, level_delta=0') and come out the other side equally weighted. Hypotheses get promoted to facts. Speculation becomes assertion. Provenance evaporates." },
    { kind: "p", text: "The downstream effect is worse than just losing detail. The next turn reads the summary as if it were ground truth, and now the agent is planning from a story the summarizer made up. I caught a clean case of this in one run: the raw action ledger recorded `tracked_object_delta.moved=19` for an action; the compactor dropped that field; the next prompt told the model 'no objects moved'; the model abandoned a correct control hypothesis and pursued a wrong one for thirty turns. The model was not wrong. The memory was." },
    { kind: "p", text: "Three structural properties have to hold for an agent's compacted memory to survive long sessions. Sources have to be addressable and ordered, so the model knows which information to trust. Hypotheses have to carry their epistemic status, so a plausible-but-untested claim does not get treated like a verified one. And facts have to be anchored in durable, citable records, so compaction cannot lose them by paraphrase." },
    { kind: "p", text: "The three primitives below are the ones I built to enforce those properties. None of them is novel in isolation. The combination is." },

    { kind: "h2", text: "1. Source authority hierarchy" },
    { kind: "p", text: "Every piece of information in the agent's context has an authority level. The compactor enforces this when it builds the next prompt:" },
    { kind: "code", text: `authoritative:    event_log refs, deterministic state snapshots, action effects
supporting:       tracked deltas, bridge artifacts, frame helpers
advisory:         visual review, coach notes, REPL comments
unverified:       free narrative, transient speculation` },
    { kind: "p", text: "The rules are simple. Authoritative sources are reproduced verbatim in the compacted prompt. Supporting sources are reproduced with their reference but may be summarized. Advisory sources are prefixed with `coach_advice_not_fact:` or `visual_hypothesis_not_fact:` so the model cannot mistake them for evidence. Unverified narrative is dropped unless it carries explicit uncertainty markers." },
    { kind: "p", text: "The critical rule is **no promotion**. The compactor is forbidden, in its system prompt, from rewriting an advisory source into a supporting one, or a supporting source into an authoritative one. A coach note can never become a verified mechanic during compaction. A visual hypothesis can never become an event-anchored fact. If a claim needs to be promoted, the agent has to do it explicitly in a future turn, with new evidence." },
    { kind: "p", text: "This sounds trivial. It is not. Most summarizers promote silently because the prompt they are given is 'make this shorter and clearer', and clarity means converting 'the coach suggested the band shifts' into 'the band shifts'. The authority hierarchy disables that move." },
    { kind: "p", text: "An entry in the compacted prompt looks like this:" },
    { kind: "code", text: `{
  "authority_order": [
    "event refs / action_effects / bridge_accounting",
    "current frame helpers and bridge artifacts",
    "world_model claims only when verified/supported",
    "visual review and coach notes as advisory"
  ],
  "authoritative_action_facts": {
    "tested_controls": [
      {
        "action_key": "ACTION6@23,41",
        "attempts": 4,
        "status": "nonterminal_local_change",
        "refs": ["event_log:35", "event_log:38", "event_log:39"]
      }
    ]
  },
  "coach_notes_advisory": [
    "coach_advice_not_fact: try ACTION7 on the upper region"
  ]
}` },
    { kind: "p", text: "The reader of this prompt cannot accidentally treat the coach note like an action fact. They live in different fields with different prefixes, and the system prompt has told the model how to read each one." },

    { kind: "h2", text: "2. Hypothesis status ledger with safe_to_plan_from" },
    { kind: "p", text: "Once sources are ordered, claims need their own ontology. Hypotheses are not facts. Plausible hypotheses are not safe planning ground. The agent needs a way to track the difference structurally." },
    { kind: "p", text: "Every claim the agent generates is a typed object:" },
    { kind: "code", text: `{
  "id": "wm_1_24",
  "claim_type": "mechanic",
  "status": "supported",
  "safe_to_plan_from": true,
  "checker": "event_effect_repeatability",
  "claim": "ACTION6@48,26 shifts the middle band cyclically by one slot",
  "evidence_refs": ["event_log:32", "event_log:38"],
  "prediction": {"next_slots": [2, 1, 9, 9, 1, 10, 15, 2, 10, 2]}
}` },
    { kind: "p", text: "Status moves through a small typed lattice:" },
    { kind: "code", text: `pending  ->  supported  ->  verified
                |             |
             falsified  <- contradicted` },
    { kind: "p", text: "A claim starts `pending` when the agent first proposes it. It becomes `supported` when observed evidence matches the prediction at least once. It becomes `verified` only after independent repeated tests. It moves to `falsified` if a single counter-example breaks the prediction, or `contradicted` if it conflicts with a verified peer." },
    { kind: "p", text: "The field that does the most work is `safe_to_plan_from`. A supported claim is not automatically safe to plan from. It might be supported because of one lucky observation. Planning over a claim that has only a single-event basis is how agents get stuck in loops where they execute a fragile theory and then call the resulting failure 'the world being inconsistent'. By gating planning on `safe_to_plan_from`, the agent is forced to either ground the claim further before planning, or branch to a different hypothesis." },
    { kind: "p", text: "The status is not metadata. It changes execution. The action review gate, before allowing the repeat of a costly real action, checks whether the proposal cites a claim with `safe_to_plan_from=true`. If not, the agent has to either upgrade the claim with new evidence or pick a different route. This is the structural answer to 'why is my agent stuck repeating the same broken strategy?'." },

    { kind: "h2", text: "3. Event-anchored memory" },
    { kind: "p", text: "Authority and status only matter if facts survive compaction. The third primitive is the one that makes this possible." },
    { kind: "p", text: "Instead of storing facts as text inside the agent's conversation, I store them in external structured artifacts and reference them by stable pointers. The conversation carries references; the artifacts carry the data." },
    { kind: "p", text: "An action that produces a transition does not become a paragraph. It becomes an entry:" },
    { kind: "code", text: `event_log:52: {
  "action_key": "ACTION6@39,17",
  "before_frame_ref": "frames/turn_47_pre.json",
  "after_frame_ref": "frames/turn_47_post.json",
  "effect": "local_change",
  "level_delta": 0,
  "changed_cell_count": 89,
  "tracked_object_delta": {
    "moved_count": 19,
    "rotated_count": 0
  }
}` },
    { kind: "p", text: "In the compacted prompt, the agent sees `event_log:52` as a reference, plus a one-line summary if useful. If the agent needs the full event, it can ask for the artifact, which is immutable and indexed." },
    { kind: "p", text: "This decouples three things that are usually conflated. Memory durability lives in the artifact store. Transport (what gets sent to the model this turn) lives in the prompt budget. Citation lives in the reference scheme. Each of them can evolve independently." },
    { kind: "p", text: "The narrative version, 'the band shifted left', drifts when summarized. The reference, `event_log:52`, does not. A claim citing `event_log:52` can always be checked. A claim citing 'the band shifted left' cannot." },
    { kind: "p", text: "The discipline this enables is striking. Hypotheses must cite event refs to be promotable. Compaction cannot drop event refs; only their inline summaries. The agent's planning has to point at evidence by ID, not by paraphrase. After a few turns, the agent's reasoning starts to look more like a database query than a narrative, and that is the right shape for long sessions." },

    { kind: "h2", text: "How they compose: the decision state" },
    { kind: "p", text: "The three primitives compose into a single object the compactor produces every turn: the decision state." },
    { kind: "p", text: "The decision state is not a summary of the conversation. It is a deterministic reconstruction of the agent's durable ledgers, materialized as a JSON manifest. The conversation is just transport. The ledgers are the truth." },
    { kind: "p", text: "A condensed decision state looks like this:" },
    { kind: "code", text: `{
  "phase": "VALIDATE_POLICY",
  "authority_order": [...],
  "authoritative_action_facts": {...},
  "verified_hypotheses": [
    {"id": "wm_1_24", "claim_type": "mechanic", "safe_to_plan_from": true, ...}
  ],
  "pending_hypotheses": [
    {"id": "wm_1_31", "claim_type": "goal", "safe_to_plan_from": false, ...}
  ],
  "coach_notes_advisory": [
    "coach_advice_not_fact: try the right side"
  ],
  "available_artifacts": [
    "bridge_context/action_effect_map.json",
    "bridge_context/recent_events.json"
  ]
}` },
    { kind: "p", text: "This is what the model sees, after all the noise has been filtered through the three primitives. No drift, no promoted speculation, no orphaned claims. Just a typed view of what the agent has actually established, what it is still investigating, and what advice is on the table without authority." },
    { kind: "p", text: "Compaction stops being a compression problem. It becomes a **specification of trust**: the manifest tells the model what it is allowed to plan from, what it should treat as hypothesis, and what is just commentary." },

    { kind: "h2", text: "Where it actually helps" },
    { kind: "p", text: "This framework was built for an ARC-AGI-3 agent, but nothing about it is puzzle-specific. The three primitives apply to any agent that runs long sessions, observes a real environment, and has to remember more than fits in a prompt window. Here are five scenarios where I have either seen these patterns help or expect them to." },
    { kind: "p", text: "**Web automation agents.** A browser agent that navigates a site for a hundred or more actions accumulates state: what elements exist, which XPaths are stable, which forms have been filled, which submissions failed. Authority hierarchy separates 'this element was observed in the current DOM snapshot' (authoritative) from 'this XPath worked reliably across sessions' (supporting) from 'the page probably has a confirmation step' (advisory). Hypothesis status prevents the agent from planning a multi-step flow based on a single observed transition. Event-anchored refs let the agent cite `interaction_event:127: clicked //button[@signin], coords [523,401]` instead of re-describing the click in each subsequent turn." },
    { kind: "p", text: "**Customer support agents.** A support agent handling a long case juggles confirmed issues, customer-reported context, and speculative diagnoses. Authority separates system logs or billing records (authoritative) from confirmed customer statements (supporting) from differential hypotheses (advisory). Hypothesis status prevents the agent from promising a resolution path based on one signal. Event-anchored refs (`case_event:456: customer_stated_payment_failed`) let the agent and its handoff partners cite ground truth across long sessions, including audit replay." },
    { kind: "p", text: "**Code debugging assistants.** A debugging session can run for hours. Authority separates test output (authoritative) from stack trace inference (supporting) from heuristic suspicion (advisory). Hypothesis status lets the agent track 'this might be a race condition' as `pending` until two independent reproductions support it, then `verified` only after the fix removes related failures. Event-anchored refs let the agent cite `test_failure:89: KeyError 'token' at auth.py:42` across an entire session without re-pasting the stack trace." },
    { kind: "p", text: "**Medical AI triage.** Triage and diagnosis are exactly the scenario where source authority matters most. Imaging and lab results are authoritative. Patient-reported symptoms are supporting. Differential diagnoses are advisory. Hypothesis status prevents an early hypothesis from being treated as a working diagnosis before the supporting tests have come back. Event-anchored refs let the system cite `lab_result:234: CBC, ANC=420` rather than paraphrasing across consultation notes." },
    { kind: "p", text: "**RAG systems with audit requirements.** Retrieval-augmented systems often need to defend why a particular answer was produced. Authority separates retrieved passages with citations (authoritative) from paraphrased summaries (supporting) from query reformulations (advisory). Hypothesis status flags relevance claims as supported only after user feedback or click-through. Event-anchored refs replace inline citation strings with stable pointers into the retrieval log, so audits can replay the exact sources that informed each answer." },
    { kind: "p", text: "In all of these, the framework is doing the same job: keeping the model honest about which information has authority, which claims are safe to plan from, and which facts are durable. The agent does not have to be smart enough to figure out trust on its own. The memory layer decides what it is allowed to trust." },

    { kind: "h2", text: "Specification, not compression" },
    { kind: "p", text: "I went into this thinking compaction was a compression problem. It is not. Compression is incidental. The actual problem is specification: telling the model what it is allowed to trust, what it should still question, and what is just commentary." },
    { kind: "p", text: "A long-running agent with naive summarization is an agent that hallucinates its own past. The three primitives, source authority, hypothesis status, and event-anchored references, are not a compression scheme. They are a contract between the agent and its memory, written down explicitly enough that compaction cannot violate it without raising an error." },
    { kind: "p", text: "If you are building an agent that has to remember more than fits in context, you are going to have to make these choices. You can make them by accident, in which case your agent will drift. Or you can make them on purpose." },
    { kind: "p", text: "**Memory in agent systems is not compression. It is specification of what your model is allowed to trust.**" },
  ],

  "typed-route-ledger-world-modeling": [
    { kind: "p", text: "For two months I tried to make a strong LLM solve a game it had never seen. What worked was not a smarter prompt. What worked was building the scaffolding around the model: a system that decides when it observes, when it acts, when it has to stop and prove what it just saw. This is a lab report on that scaffolding." },

    { kind: "h2", text: "Summary" },
    { kind: "p", text: "During my ARC-AGI-3 experiments, I observed that the difference between a strong model 'thinking' and an agent that solves interactive tasks rarely lives in the prompt alone. The most consistent gains came from treating the model as a component inside a harness: a system that controls when the model observes, when it forms hypotheses, when it commits real actions, when it must stop repeating a strategy, and how it turns observed effects into operational memory." },
    { kind: "p", text: "I call this approach, for now, Typed Route-Ledger Governed Execution for Test-Time World-Modeling. The central idea is simple: instead of letting an LLM drive the session freely, the harness maintains a typed state machine, records routes and attempts in ledgers, demands evidence before repeated actions, and lets the model build a local world model during the test." },
    { kind: "p", text: "That world model is not a neural model of the environment trained ahead of time. It is an operational test-time world model: a set of claims, transitions, control maps, predictions, and falsifications, built while interacting with the game. It exists to answer practical questions. What does this action change? Under what conditions does this transition repeat? Which next test reduces uncertainty the most? Is this sequence grounded in evidence or in a guess?" },

    { kind: "h2", text: "Motivation" },
    { kind: "p", text: "ARC-AGI-3 shifts the problem from 'produce a response' to 'discover an environment'. The agent receives an interactive visual game with no explicit rules and has to figure out mechanics, objectives, and action sequences. (ARC-AGI-3 is the third edition of François Chollet's benchmark for abstract reasoning, focused this time on interactive agents.) This makes the 'just ask the model to solve it' approach fragile. The model tends to mistake local change for terminal progress. It repeats actions it has already tested. It invents visual objectives without evidence. It loses context after many attempts. It treats a plausible hypothesis as fact. It fails to distinguish 'nothing changed' from 'something changed but the aggregate metric did not notice'. And it turns a verbal sequence into a plan, with no transition proof." },
    { kind: "p", text: "The harness emerged in response to those failure modes. The question stopped being 'how do we make the LLM think better?' and became:" },
    { kind: "quote", text: "what structure around the LLM makes it discover rules in a way that is more verifiable, less impulsive, and more reusable?" },

    { kind: "h2", text: "The thesis" },
    { kind: "p", text: "The thesis of this work is that interactive agents need a more rigid separation between observation, hypothesis, evidence, policy, execution, falsification, and operational memory." },
    { kind: "p", text: "In purely agentic systems, these layers often blur into free text. The model sees something, infers a rule, executes an action, interprets the result, and decides the next step within the same narrative flow. That is flexible, but it also creates loops and self-deception." },
    { kind: "p", text: "My approach inverts the relationship: the LLM still reasons, writes code, and proposes tests, but the harness governs the cycle. It decides which routes are valid, which evidence counts, when a branch has been invalidated, and when a real action needs additional proof." },
    { kind: "p", text: "**The harness is the lever, not the model.**" },

    { kind: "h2", text: "Main components" },
    { kind: "p", text: "Before describing each component separately, it is worth showing the technical loop the architecture tries to enforce. Instead of a free agent calling tools in arbitrary order, the ideal cycle looks closer to this:" },
    { kind: "code", text: `snapshot = pipeline_status()
route = snapshot["route_request"]

source = route_source(snapshot, route)
if route_ledger.is_invalidated(source):
    route = deterministic_fallback(snapshot, source)

result = execute_typed_route(route)
route_ledger.record(source=source, route=route, result=result)

if result.failed_nonrepairable:
    route_ledger.invalidate(source)` },
    { kind: "p", text: "That pseudocode captures the central difference: the model is not the final authority on the next step. It can propose, analyze, and write code, but the route comes from a typed state and stays on record. When a decision source fails, the system does not depend on the prompt 'reminding' the model not to try again; the source is invalidated." },

    { kind: "h2", text: "1. Typed pipeline" },
    { kind: "p", text: "The main flow is organized into phases like exploration, verification, policy validation, terminal rollout, and solving. Each phase has allowed tools and expected contracts." },
    { kind: "p", text: "In practice, the state that matters is not a loose narrative like 'I think I should explore more'. It is an object with a phase, allowed tools, and an executable route:" },
    { kind: "code", text: `{
  "phase": "VALIDATE_POLICY",
  "allowed_next_tools": ["terminal_bridge_workbench"],
  "route_request": {
    "tool": "terminal_bridge_workbench",
    "reason": "typed live-action budget reached; map controls before more actions"
  }
}` },
    { kind: "p", text: "This prevents the agent from using a solver tool while it is still in the observation phase, or running a policy without enough evidence. The practical effect is to reduce improvisation at moments where improvisation usually costs actions." },

    { kind: "h2", text: "2. Route ledger" },
    { kind: "p", text: "Each route the harness proposes has an origin: a pending hypothesis, an exploration request, a policy attempt, an active strategy, or a need for the bridge." },
    { kind: "p", text: "When a route fails in a non-repairable way, its origin can be invalidated. This blocks a common pattern in LLM agents: the system already knows a route does not work, but the model reissues the same route under different wording." },
    { kind: "p", text: "This ledger is one of the most important pieces of the architecture. It turns 'do not try this again' into structured state rather than advice in a prompt." },
    { kind: "p", text: "A conceptual route ledger entry looks something like this:" },
    { kind: "code", text: `{
  "source": "pending_exploration_request:L1:target_probe_3",
  "route": "run_actor_probe",
  "result": "FAILED_NONREPAIRABLE",
  "invalidates_source": true,
  "reason": "same action family already tested without terminal evidence"
}` },
    { kind: "p", text: "The point is subtle: the ledger is not just logging history. It changes the topology of future decisions. An invalidated route stops being a valid option, even if the model returns to describe it with different words." },

    { kind: "h2", text: "3. Terminal bridge REPL" },
    { kind: "p", text: "When the typed pipeline reaches the limit of what it can decide via short probes, it can open a REPL workbench. The REPL is not a free solver in the traditional sense. It receives tools to inspect the frame, compare states, read artifacts, build control maps, simulate transitions, and propose one action at a time." },
    { kind: "p", text: "The bridge's role is to allow richer local reasoning without abandoning the harness's governance. In multi-step games, that workspace was essential: the model needed to 'think while playing', but without turning the real game into a cheap simulator." },
    { kind: "p", text: "Inside the bridge, the model works more like a scientist with a lab notebook than like a chatbot. A good REPL turn tends to look like this:" },
    { kind: "code", text: `# read-only first: inspect current structured state
effects = action_effects()
latest = transition_evidence("event_log:40")

# create or update an explicit model claim
world_model_add_claim(
    claim_type="mechanic",
    claim="ACTION6@39,17 shifts the top band by one slot",
    evidence_refs=["event_log:40"],
    prediction={"next_top_slots": [8, 15, 10, 9, 15]},
)

# only then propose one live action with a falsifiable expectation
submit_action(
    "ACTION6",
    x=39,
    y=17,
    why="test one predicted top-band shift",
    expected_signal="top band shifts one slot; no repeated blind pumping",
    stop_if="prediction mismatch or level advance",
)` },
    { kind: "p", text: "This format does not guarantee the model is right. But it forces the action to carry a prediction and a stopping condition." },

    { kind: "h2", text: "4. World model ledger" },
    { kind: "p", text: "The bridge records claims about the world: an action shifts a band; a control rotates a cycle; two coordinates are inverses; a visual target may indicate a terminal condition; a predicted sequence should change specific slots." },
    { kind: "p", text: "Each claim can be pending, supported, verified, unverified, or contradicted. The point is not to build a perfect ontology of the game. The point is to give the model an operational memory that distinguishes 'I think' from 'I tested'." },
    { kind: "p", text: "A compact claim looks like:" },
    { kind: "code", text: `{
  "id": "wm_1_24",
  "claim_type": "mechanic",
  "status": "supported",
  "safe_to_plan_from": true,
  "checker": "event_effect_repeatability",
  "claim": "ACTION6@48,26 shifts the middle band cyclically by one slot",
  "evidence_refs": ["event_log:32"],
  "prediction": {
    "next_slots": [2, 1, 9, 9, 1, 10, 15, 2, 10, 2]
  }
}` },
    { kind: "p", text: "The most important field is `safe_to_plan_from`. It separates a hypothesis that can serve as planning ground from a hypothesis that is still only a possibility. This matters because LLMs tend to plan over plausible ideas before those ideas are verified." },

    { kind: "h2", text: "5. Transition evidence" },
    { kind: "p", text: "Every real action produces before/after evidence: which cells changed, which regions were affected, which rows or columns had salient transitions, whether a level advance occurred, and whether the change was purely local." },
    { kind: "p", text: "This substantially reduced the dependence on loose visual interpretation. Instead of 'it looks like it moved right', the agent can consult a specific event and compare the transition." },
    { kind: "p", text: "A reduced transition evidence record looks like:" },
    { kind: "code", text: `{
  "event_ref": "event_log:40",
  "action_key": "ACTION6@39,17",
  "effect": "local_change",
  "level_delta": 0,
  "changed_cell_count": 89,
  "top_changed_rows": [
    {"axis": "row", "index": 17, "changed_cells": 10},
    {"axis": "row", "index": 18, "changed_cells": 10}
  ],
  "interpretation_boundary": "local transition evidence; not terminal proof"
}` },
    { kind: "p", text: "That last field became an important doctrine: local change is not proof of terminal progress. The agent can use the event to build mechanics, but not to declare it is closer to the goal without additional evidence." },

    { kind: "h2", text: "6. Action review" },
    { kind: "p", text: "Before spending a real action, especially on repetitions, the bridge can pass through a reviewer. That reviewer asks: has this action been tried? Was the previous effect terminal or only local? Is there a verifiable prediction? Is there a world model claim that supports the proposal? Is the model trying to repeat an action because it has evidence, or because it is stuck?" },
    { kind: "p", text: "In some runs, this mechanism blocked 'shift until it resolves' loops and forced the REPL to produce slot-level proof before continuing." },
    { kind: "p", text: "A typical review, when the model wanted to repeat an already-tried action, looked roughly like this:" },
    { kind: "code", text: `{
  "verdict": "run_offline_check_first",
  "risk": "medium",
  "repetition_risk": "high",
  "action_key": "ACTION6@39,17",
  "feedback": "analyze event_log:40 before repeating; provide slot-level proof"
}` },
    { kind: "p", text: "This changes the reviewer's role. It is not a second solver. It is an epistemic brake: when an action looks like a repetition, it forces the REPL back to the artifacts to produce a smaller proof before spending another action." },

    { kind: "h2", text: "7. Decision-state compaction" },
    { kind: "p", text: "Long interactive sessions break context. The solution was not just summarizing text. The bridge started maintaining structured artifacts: actions, transitions, world model, latest feedback, visual reviews, canonical state, and decision state." },
    { kind: "p", text: "Subsequent context is rebuilt from those artifacts. The textual summary becomes support; the ledger becomes the source of truth." },
    { kind: "p", text: "The decision state that enters the compact prompt is deliberately small. It does not try to carry the whole conversation; it carries an index of what matters:" },
    { kind: "code", text: `{
  "authority_order": [
    "event refs / action_effects / bridge_accounting",
    "current frame helpers and bridge artifacts",
    "world_model claims only when verified/supported",
    "visual review and coach notes as advisory"
  ],
  "authoritative_action_facts": {
    "tested_controls": [
      {
        "action_key": "ACTION6@23,41",
        "attempts": 4,
        "status": "nonterminal_local_change",
        "refs": ["event_log:35", "event_log:38", "event_log:39"]
      }
    ]
  }
}` },
    { kind: "p", text: "This authority hierarchy was an important lesson. Without it, the model tends to treat summary, coach note, visual hypothesis, and actual evidence as if they all had the same weight." },

    { kind: "h2", text: "How the architecture behaves at a level" },
    { kind: "p", text: "A typical level does not start with 'solve'. It starts with uncertainty reduction. The harness first collects observations and small probes. If the probes show that local controls exist but produce no terminal progress, the pipeline opens the bridge. The bridge then runs through three jobs: map controls, turn controls into verifiable transitions, and turn transitions into a small policy." },
    { kind: "p", text: "The most common anti-pattern is skipping from step one to step three. For example: 'this arrow moves the band, so I will press it until it aligns'. The architecture tries to prevent that jump by demanding an intermediate representation:" },
    { kind: "code", text: `control_map = {
    "ACTION6@14,26": {"band": "middle", "delta": "left_1"},
    "ACTION6@48,26": {"band": "middle", "delta": "right_1"},
    "ACTION6@14,35": {"band": "bottom", "delta": "left_1"},
    "ACTION6@48,35": {"band": "bottom", "delta": "right_1"},
}

plan = simulate(control_map, current_state, target_hypothesis)
assert plan.predicted_checkpoints` },
    { kind: "p", text: "Even when the `target_hypothesis` is wrong, this separation helps: the failure is localized. I know whether the mechanic, the simulation, or the interpretation of the goal failed." },

    { kind: "h2", text: "Observed examples" },

    { kind: "h2", text: "ft09 and the importance of well-chosen simple actions" },
    { kind: "p", text: "In ft09, I saw that good initial investigation can turn an entire game into a compact policy. The agent did not need a complex theory: it needed to discover which action had effect, verify regularity, and apply the correct sequence with few losses." },
    { kind: "p", text: "The lesson here was that, even in simpler tasks, the harness has to protect the model from overthinking. When the mechanic is simple, the best agent is not the one that builds the richest explanation; it is the one that identifies the right operator and stops spending actions." },
    { kind: "p", text: "The pattern I want to preserve from this case is:" },
    { kind: "code", text: `observe()
probe_one_action()
verify_regular_effect()
execute_minimal_policy()
stop_on_level_advance()` },
    { kind: "p", text: "This looks trivial, but in LLM agents the trivial is exactly what gets lost: after finding a sufficient action, the model keeps looking for a prettier theory than it needs." },

    { kind: "h2", text: "lp85 and multi-step mechanics" },
    { kind: "p", text: "lp85 was more informative for the architecture. The game demands mapping controls, understanding displacements, separating local change from real progress, and planning sequences with state dependencies." },
    { kind: "p", text: "In some levels, the bridge managed to turn local observations into an executable model: identifying lateral controls, mapping displacements per band, discovering rotations, and assembling sequences based on observed transitions. The most interesting part was not just completing a portion of the game; it was watching the agent move from 'testing coordinates' to 'maintaining an operational map of the environment'." },
    { kind: "p", text: "After many iterations of the harness, the same base model went from getting stuck on level two to clearing all eight levels of lp85 in roughly 185 actions. None of those wins came from a smarter prompt. Each came from a new ledger or check that absorbed a class of mistake the previous version was still making." },
    { kind: "p", text: "One particularly instructive case involved cycles controlled by visual regions on the board. The model initially treated some actions as `no_effect`, but later analysis showed real rotation without aggregate color-count change. This is exactly the kind of error a textual LLM tends to miss and that a transition ledger can recover." },
    { kind: "p", text: "The technical mistake looked something like this:" },
    { kind: "code", text: `aggregate_color_delta == {}` },
    { kind: "p", text: "leading to:" },
    { kind: "code", text: `effect = "no_effect"` },
    { kind: "p", text: "But the structural comparison showed something else:" },
    { kind: "code", text: `before_slots = [1, 9, 15, 10, ...]
after_slots  = [9, 15, 10, 1, ...]
rotation_verified = before_slots != after_slots` },
    { kind: "p", text: "The global color count did not change, but the state did. That distinction was decisive in seeing that there was a manipulable cycle. This kind of bug is a good justification for structured ledgers: the semantic interpretation of the event can be wrong, but the before/after artifact lets us recover the mechanical truth." },

    { kind: "h2", text: "When the world model gets it wrong" },
    { kind: "p", text: "Not every world model built during a run was correct. In lp85, the agent repeatedly confused a repeatable local change with evidence of a terminal condition. It built visual alignment hypotheses, executed shifts, and the game did not advance." },
    { kind: "p", text: "That failure was useful. The harness was able to record that the actions were repeatable but non-terminal, that the visual hypothesis lacked sufficient support, and that repeating the same control family required new proof. The system does not eliminate error, but it makes error auditable and, in some cases, governable." },

    { kind: "h2", text: "What seems new here" },
    { kind: "p", text: "I do not claim that typed pipelines, REPL agents, ledgers, or world models are novel ideas in isolation. What looks promising is the combination: a typed orchestrator that decides the route; a local REPL for rich investigation; a world model ledger that separates claim, evidence, and status; a transition ledger that anchors claims in events; a route ledger that invalidates loop sources; an action reviewer that governs repetitions; and structured compaction based on artifacts rather than just textual summary." },
    { kind: "p", text: "That combination creates an agent that is not simply 'LLM with tools'. It is a system in which the LLM builds local models under an evidence regime." },

    { kind: "h2", text: "What I learned about models" },
    { kind: "p", text: "An important part of the experiments was discovering that the quality of the base model is not enough. Different models failed for different reasons." },
    { kind: "p", text: "Open-weight and open-source models were especially interesting for the bridge REPL. In particular, models from the Qwen family running on a local endpoint gave room for long reasoning sessions, map building, and programmatic analysis. The advantage was not just 'raw intelligence'; it was infrastructure control, more operational freedom, and the ability to iterate on the harness's format." },
    { kind: "p", text: "Strong proprietary models, like Claude, were useful for reasoning and review. Different providers fail in different ways: timeouts, image support, transport limits, response format. The session length matters as much as raw reasoning quality." },
    { kind: "p", text: "The core lesson was that ARC-AGI-3 measures the model plus harness plus runtime combination. A better model inside a weaker harness can spend actions without learning. A slightly weaker model with a well-governed workbench can build useful state and avoid some of the loops." },
    { kind: "p", text: "This was the state of model selection in early May. By the time you read this, the harness has moved to Gemini as the default for production runs. The harness lesson holds; the model lesson is more fluid." },

    { kind: "h2", text: "Current limits" },
    { kind: "p", text: "The system is still far from being a general solver. Some limits showed up repeatedly. The agent still builds weak terminal hypotheses when the image does not clearly show the goal. Models can spend a lot of context justifying a simple action. The bridge can still use real actions as an expensive form of simulation. The quality of compaction decides whether the agent keeps reasoning or loses the thread. Strong reviews block loops but can also stall when the model cannot produce the required proof. Models and providers vary widely in stability, timeout, image support, and response format." },
    { kind: "p", text: "These limits do not invalidate the approach. They indicate where the next generation of the harness needs to improve: less real action, more local simulation, better deterministic checks, and a cleaner representation of the world model." },
    { kind: "p", text: "One concrete direction is to make the world model checker less permissive. At some moments, a claim received `supported` status because the cited events had `local_change`, even when the terminal part of the claim was still unproven. The distinction should be more granular:" },
    { kind: "code", text: `{
  "mechanic_status": "verified",
  "terminal_status": "unverified",
  "safe_for": ["local_simulation"],
  "unsafe_for": ["solve_policy"]
}` },
    { kind: "p", text: "This separation is probably necessary to prevent a good local mechanic from turning into a false solution." },

    { kind: "h2", text: "Why this matters for new agents" },
    { kind: "p", text: "Many current agents are evaluated on tasks where the main action is producing text or code. ARC-AGI-3 forces a different regime: the agent has to discover a world, act in it, observe consequences, and revise its theory." },
    { kind: "p", text: "In that context, 'world model' does not have to mean a large model trained to predict video or state. It can mean an operational test-time layer: small, local, verifiable, and sufficient for planning." },
    { kind: "p", text: "My intuition is that this form of governed world-modeling will be increasingly important for interactive agents: in unknown games, in graphical interfaces, in web navigation, in system debugging, in automation where each action has cost, and in environments where feedback is partial and ambiguous." },
    { kind: "p", text: "The agent I want is not just a model that 'understands' the task. It is a system that knows how to build a temporary theory of the world, act with restraint, discard bad theories, and preserve enough evidence to keep going. For agents that have to discover worlds, the harness can be as important as the model." },
  ],
};

function getArticleBody(id) {
  return ARTICLE_BODIES[id] || [
    { kind: "p", text: "This piece is still in draft. Check back later, or send me a note and I'll share the working version." },
  ];
}

window.ARTICLE_BODIES = ARTICLE_BODIES;
window.getArticleBody = getArticleBody;
