Skip to main content Scroll Top
19th Ave New York, NY 95822, USA
Untitled design (54)

As the race to scale up AI—by increasing computing power, expanding available data, and enlarging the deep neural networks that underpin most modern systems—continues apace, a natural concern arises. As AI systems become more powerful, they also seem ever more mysterious. How can we trust systems that appear to be impenetrable black boxes, whose inner workings we only dimly understand?

One response is to try to “open the box”: to identify the representations distributed across billions of weights and activations inside a neural network and to map out what, exactly, is being computed. This is the active research programme of mechanistic interpretability—and it has produced genuine insights. But in general, it may prove impractical on its own. The function of a modern deep neural network is not localized in a neat, human-readable module. It is distributed across vast stretches of the network. Complex interactions among parameters allow the system to combine patterns and exceptions, flexibly guided by context. That distributed structure is precisely what makes these systems so powerful—and so resistant to straightforward inspection. But there is another approach.

At first glance, the opacity of AI systems seems deeply worrying. Yet on reflection, this is nothing new. It is an all-too-familiar situation we face when interacting with other human beings. Each of us carries around a brain of enormous complexity—around 100 billion neurons and on the order of 100 trillion synaptic connections—about which we have essentially no direct insight. We cannot “inspect” one another’s neural circuitry. Indeed, we are no better at at “introspecting” our own. And yet we manage to interact successfully enough—we explain each other’s behaviour, predict responses, justify our actions, and coordinate socially well enough to get along and to sustain complex societies.

How do we do this? Not by opening the biological black box, but by engaging at the level of interaction. We ask for reasons. We test responses. We probe with counterfactuals. We request clarification. We build trust through repeated exchanges. In short, explanation is something that happens between agents, not something extracted directly from neural tissue.

The parallel with human interaction is instructive, but it has real limits that deserve acknowledgement. We trust other humans in part because we share a common cognitive architecture, the same evolutionary history, and the same repertoire of embodied experience. As such, we have reasonable priors that other humans have goals, beliefs, and experiences roughly like our own.  When a person offers a justification for their behaviour, we might assume we can reasonably infer that it reflects, however imperfectly, something about the reasoning processes that actually produced the behaviour. With current AI systems, that inference is far less secure. When a large language model produces a justification, it may be generating text that sounds like a faithful account of its reasoning but bears no reliable relationship to the computational process that generated the original output.

This may appear to be a crucial disanalogy. But decades of research in psychology suggest that human, and AI, “introspection” may actually not so different after all . Humans, it turns out, are prolific confabulators. We routinely generate post-hoc justifications for behaviours driven by processes we have no conscious access to, and we do so with full confidence that we are reporting accurately. This is not a marginal phenomenon. Confabulation is pervasive in moral judgment, choice behaviour, and everyday social explanation. The justifications people offer for their actions are frequently plausible-sounding narratives constructed after the fact, not faithful readouts of the cognitive processes that produced those actions.

What makes human interaction work despite pervasive confabulation is not that we take people’s justifications at face value. It is that we test them: we probe, challenge, cross-examine, and check for consistency over time. Interactive explainability applies these same insights to AI. Instead of demanding full transparency of internal mechanisms, we treat the AI system as a partner in dialogue. We evaluate it through:

  • Counterfactual probing (“What would you say if this assumption changed?”)
  • Asking for justifications for words and actions
  • Consistency checks across related responses
  • Coherence performance in extended interaction, not just single queries
  • Cross-examination (“but yesterday you said the opposite,” “if you believe X, how can you now be telling me Y”)

In this view, understanding is not a static property of a model’s architecture. It is an emergent property of structured interaction.

A note of caution, though. It is entirely possible that a sufficiently capable system (whether human or AI) may pass all these checks while still being unreliable in ways that matter. A system that is coherent and consistent in dialogue is not necessarily trustworthy—it may actively be attempt to deceive us. Indeed, AI-safety researchers have long interrogated AIs using methods such as adversial “red teaming,” actively attempt to lure the AI into forbidden or inappropriate behaviors.  What the interactive explainability framing adds is the recognition that this kind of structured probing should not be seen as confined to pre-deployment safety testing. Rathert it is a continuous feature of how we engage with AI systems in use.

This does not mean that internal analysis of AI’s is without value. Indeed, if the motivating concern is the safety of increasingly powerful systems, then understanding internal mechanisms still matters a great deal for safety, debugging, and scientific understanding. But for practical trust, what matters most may be something more familiar: whether the system behaves coherently across contexts, responds to challenges, corrects itself when prompted, and integrates feedback over time.

Human social life already runs on this principle. We trust people not because we understand their neurons, but because they can justify themselves, respond to criticism, and maintain consistency across situations. If AI systems can do something analogous—and if we can build infrastructure that better ensures that they do—then black-box opacity may not be as severe a problem as it first appears.

The core shift is this: explanation need not require the unveiling of hidden circuitry. It is the achievement of mutual intelligibility through interaction, the approach that has sustained human social coordination for as long as humans have existed.

Written by: Simon Myers and Nick Chater, Behavioural Science Group, Warwick Business School