Research timeline
How we got here — by date
From the first prototypes to the verified result, in order.
Mar 23–24 · prototype
Mirror tests — belief under pressure
The earliest probes pushed agents to revise a belief under pressure; instead they tended to conserve, defend, or synthesize it. The lesson shaped a lasting caution — identity-change is treated carefully and is not a headline claim. These are prototype probes and are excluded from any quantitative result.
Mar (late) · prototype
The drama engine
hypothesis brokeAn attempt to author conflict and meaning into the system directly. The beautiful hypothesis — that compelling drama could be written in — broke: what felt alive was drama that grew out of characters meeting irreversible consequences. The project reset around it: don't write the drama, build the conditions and let it grow.
Apr 1–7 · exploratory
Cross-model & skeptic probes
The same probes were run across several model families (Grok, Qwen, GPT-4o, Llama) plus skeptic and domain variants. The first sign appeared that the same scenario produces recognizably different temperaments by model. Exploratory only — these motivated the later metrics, not measured rates.
Apr 20 – May · base data
Life Sim — repeated 20-tick lives
Agents living repeated 20-tick lives with persistent memory, private motivation, and irreversible consequences. This became the main quantitative source for the behavioral evaluation. Runtime metrics are treated as candidate claims and qualified by post-hoc audits — not as ground truth.
Jun 5–6 · base data
Simulation Room behavioral battery
Larger runs across scenarios that, for the first time, measure epistemic actions directly — requesting a source, verifying one, correcting the record — alongside memory grounding and relationship shifts. This is the data behind the memory / epistemic-agency result.
Jun 8–11 · controlled setup
Razlom + cross-model
The Razlom scenario and cross-model passes set up the controlled comparison that the verified result rests on: same model, same scene, same length — public record alive in both arms — differing only in whether the agent also has a private (subjective) channel.
2026 · VERIFIED (behavioral)
Cross-model evaluation — seven model families
Across seven full model families and six providers, memory-grounded action ratios hold at or near 1.0, narrative stability holds at 1.0, memory divergence is consistently non-zero, and early fear-confirmation loops are measurable. Caveats travel with the numbers: runtime metrics are candidate claims, audit-qualified; the system is closed (partial, behavioral reproducibility, not code-level); and the explicit non-claims hold — no consciousness, no autonomous inner life, no proven identity transformation.
Jun 15 · VERIFIED (reproducible)
Memory = epistemic agency
The controlled Razlom kernel battery (deepseek-v4-flash, 50 lives per condition). With a subjective channel the agent contests the record; without it, never — correct_record 9.56 vs 0.00 per life, epistemic-contest ≈31 vs ≈0.4 (~75–80×), while rescue sits at ceiling in both. So the effect is epistemic posture, not survival. Reproducible, not n=1. The same battery requalified the earlier 0,1,1,1,3 escalation chain as not-yet-reproduced. See the finding for the full data and honest remainder.
Prototype and exploratory entries are shown as lineage — what they taught — not as measured results. Only the two entries marked VERIFIED carry numbers, each traveling with its caveats.