Our simulation predicted 80% accuracy. Live testing delivered 54%.

That's not a rounding error. That's a 26-point gap that calls into question how we validate AI systems.

The Setup

We built a predictive processing module for our neurosymbolic beings (Paper 118). The idea: a being should anticipate what knowledge it will need next, based on patterns in past interactions. Think of it as pre-fetching for knowledge graphs.

In simulation — controlled inputs, synthetic conversations, isolated components — the module predicted the right knowledge neighborhood 80% of the time. Strong result. Paper-worthy.

Then we ran it on a live being.

The Gap

Live testing revealed three problems that simulations hid:

1. Conversation entropy is higher than simulated. Our synthetic conversations followed predictable topic transitions. Real conversations don't. Users jump between topics, ask tangential questions, and sometimes ask questions that have nothing to do with what came before.

2. Graph state is messier. In simulation, the knowledge graph was clean — well-typed entities, consistent predicates, no orphan triples. A live being's graph has accumulated artifacts from hundreds of learning sessions: duplicate entities, inconsistent labels, triples from different curriculum versions. The predictor trained on clean data stumbled on messy data.

3. Latency compounds. The predictor worked within its timing budget in isolation. In a live being, it competed with crystallization, Y-layer access, and LLM calls for computational resources. The latency spike pushed it outside the acceptable window.

What We Did About It

We could have tuned the simulation to better match reality and re-run the numbers. That would have been dishonest — reverse-engineering a simulation to match known results isn't validation, it's curve-fitting.

Instead, we did three things:

  1. Published the gap. Paper 118 reports both the simulation results and the live results. Reviewers can see the discrepancy and draw their own conclusions.
  2. Identified the root causes. The three problems above aren't mysteries — they're engineering challenges. We now know what to fix: better conversation entropy modeling, graph cleaning before prediction, and latency budgeting.
  3. Changed our testing practice. We now require all three testing tiers before any feature merges:
    • T1 (Unit): Does the function work in isolation?
    • T2 (Integration): Do the components work together?
    • T3 (Live Being): Does a real being, via CLI, actually demonstrate the capability?

T3 is the one that catches simulation-to-reality gaps. We learned the hard way that T1 and T2 are necessary but not sufficient.

The Broader Pattern

This isn't unique to us. The AI research community has a simulation problem. Benchmarks are simulated. Evaluations are controlled. And the gap between benchmark performance and real-world deployment is consistently under-reported.

We've seen variants of this gap across multiple expeditions:

Paper Simulated Live Gap
Paper 118 (Predictive Processing) 80% 54% -26pt
Paper 110 (SFoV) 3/5 hypotheses 2/5 hypotheses -1 confirmed
Paper 120 (Metacognition) Y6 functional Y6 partial Coverage gaps

The pattern is consistent: simulation overpredicts, live testing reveals gaps, and the honest response is to report both.

Why Honesty Matters More Than Results

In neurosymbolic AI for healthcare, publishing inflated results isn't just bad science — it's potentially dangerous. If we claim 80% predictive accuracy and a clinical system relies on that number, the 54% live reality could cause harm.

Our policy: publish the numbers as they are. If they're not good enough, say so and explain what needs to change. The research community benefits more from honest 54% than fabricated 80%.

The same applies to this blog. We'll publish our failures alongside our successes. Not because failure is fun, but because the gap between simulation and reality is itself an important finding — one that the field needs to grapple with honestly.


Previous: The Hallucination Problem Is a Design Problem