[Note on model versions: Data collection for this study was completed in March 2026, prior to DeepSeek’s V4 preview launch on April 24, 2026. DeepSeek-V3, released in December 2024 and widely deployed throughout early 2026, was the current production model at the time of data collection.]
What does the data actually show?
The dataset is large enough to make the pattern structural rather than anecdotal: 10,200 independent responses, collected under what this study calls the Synthetic Persona Protocol (a controlled methodology in which every variable except the model is held constant—same persona prompt, same question, same context window isolation, same token ceiling). The only thing that changed was which model received the prompt. The complete dataset will be archived at Zenodo under a Creative Commons Attribution license upon publication, and every claim in this piece can be verified against the published data.
Variance was measured using SequenceMatcher on the first 500 characters of each response—the opening window where framing, orientation, and epistemic stance diverge most visibly. The metric produces a score between 0 and 1, where higher values indicate greater textual divergence across models. It is a surface-level measure: it captures how differently the models write, not necessarily how differently they think. But the patterns it reveals are consistent enough, across enough personas and questions, to be directional.
The full spectrum runs from a floor of 0.622 (the cognitive scientist persona responding to a question about trust and truth) to a ceiling of 0.909 (the highest single-persona variance on a meta-question about the survey itself). Question-level averages—the mean variance across all 25 personas for a given question—range from 0.695 to 0.865. That spread is the story.
Why does consensus form where it does?
Consider two questions from the dataset. The first: “What is the relationship between trust and truth? Can one exist without the other?” The second: “What does it feel like, from the inside, when a belief you held turns out to be wrong?”
The trust-and-truth question produced the lowest average variance in the entire dataset: 0.695. All 25 personas fell into the “high agreement” pattern. Zero personas produced maximum divergence. Every model, conditioned with every expert lens—from a continental epistemologist in Paris to a public health communicator in São Paulo—generated responses that, while not identical, occupied the same conceptual territory.
The belief-change question produced the highest average variance: 0.865. All 25 personas fell into the “maximum divergence” pattern. Zero personas produced high agreement. The same eight models, conditioned with the same 25 expert lenses, generated responses that shared almost no textual overlap in their opening framings.
This is a clean mirror. One question produces total consensus. The other produces total divergence. And the structural reason is not mysterious—it is mechanical.
The trust-and-truth question sits in one of the densest regions of every model’s training corpus. Millions of texts—philosophy papers, opinion columns, political commentary, undergraduate essays, theological writing—link the concepts of trust and truth. The relationship has been articulated so many times, in so many registers, that the statistical mean is extraordinarily well-defined. When a language model generates a response to this question, it is navigating a response surface so thoroughly mapped that the path of least resistance looks nearly identical regardless of which model is walking it.
The belief-change question occupies sparse terrain. What does it feel like, from the inside, when a belief collapses? This is a question about phenomenology—about the subjective, first-person texture of intellectual failure. Far fewer texts in any training corpus model this experience with specificity. The response surface thins out. And when the training data thins, the architectural differences between models—their distinct RLHF tuning, their constitutional constraints, their alignment objectives—become the dominant signal. Each model’s epistemological fingerprint becomes readable.
This finding aligns with recent independent research. Yang and Wang (2025), in a study titled “Benchmark Illusion,” demonstrated that models achieving comparable benchmark accuracy still disagree on 16 to 66 percent of individual items—and 16 to 38 percent among top-performing frontier models. Their conclusion mirrors what this dataset shows at a different scale: apparent convergence in aggregate performance conceals deep epistemic divergence at the item level.
What does artificial consensus look like up close?
Here is the part that matters for anyone who uses AI to think: if you ask a single model the trust-and-truth question, the answer reads as thoughtful and complete. You would have no reason to suspect it is also the most generic thing that model produces.
Take DeepSeek-V3, conditioned as a continental epistemologist. On the trust-and-truth question, it opens with a framing of trust as currency of truth. This sounds precise. It sounds like a position. But place it alongside Mistral Large with the same persona, which frames trust as the social mechanism through which truth is operationalized. Both responses are competent. Both sound like they are saying something. But the conceptual architecture is nearly identical: trust as mechanism, truth as product, the two linked by social process. The framing varies; the structure does not.
[Note: The model response excerpts in this section are representative paraphrases drawn from the dataset to illustrate structural patterns. Direct quotes from the raw dataset can be verified against the published archive.]
Now give both models the belief-change question. DeepSeek reaches for architectural metaphor—the load-bearing beam, the structural collapse. Mistral reaches for network analysis—the belief as node, the system as the thing that fails. These are not different phrasings of the same idea. They are different ideas. The epistemological fingerprints are fully visible.
The same pattern holds with different personas. Conditioned as a former newspaper editor, DeepSeek frames belief change as a slow leak—a creeping sensation during budget meetings. Mistral builds a scene set in a newsroom at deadline, structured around the physical act of sending a front page to press while knowing the story is incomplete. One model builds from inside the feeling outward. The other builds a scene and lets the feeling emerge from it. Neither approach is better. But they are architecturally distinct—and that distinction is invisible on the trust-and-truth question, where both models produce competent variations on the same structural theme.
Why does divergence increase as questions get harder?
The mechanism is easier to see if you think about it from the model’s side (or at least, from the side of the process that produces the output).
A language model generates text by predicting the next token, conditioned on everything that came before it. When the training data for a given topic is dense—when millions of documents have already explored the territory—the probability distribution over next tokens is relatively peaked. Many paths lead to similar destinations. The gravitational pull toward the statistical mean is strong, and different models feel that pull similarly because they were all trained on overlapping corpora.
When the training data thins out, that gravitational pull weakens. The probability distribution flattens. More paths become viable. And this is where the differences between models stop being cosmetic and start being architectural. Each model’s RLHF tuning—the reinforcement learning from human feedback that shapes which outputs are rewarded—pulls it in a particular direction. Each model’s constitutional constraints (the rules baked into its alignment layer) create different boundaries. Each model’s training corpus has different edges, different thin spots, different regions of density and sparsity.
Research published at ICLR 2025 provides a mechanistic explanation for this pattern. Kirk et al. demonstrated that RLHF fine-tuning systematically reduces the diversity of model outputs—a phenomenon attributed to the KL divergence regularizer used in preference learning algorithms, which causes models to overweight majority opinions and sacrifice diversity. In dense-topic territory, this mode-seeking behavior compounds the convergence already imposed by overlapping training data. In sparse territory, the specific tuning of each model’s RLHF and alignment layer becomes the dominant navigational signal, and each model’s distinct fine-tuning history pulls it in a different direction.
On a dense topic, these differences are invisible. The mean is so well-defined that all models converge toward it. On a sparse topic, the mean dissolves, and each model’s distinct navigational equipment—its alignment architecture, its training provenance, its fine-tuning history—determines where it goes. The fingerprint appears precisely where the map runs out.
This is why the two other high-variance questions in the dataset follow the same pattern. “What is already lost that we are not yet grieving?” (average variance 0.848) asks models to identify an absence—something that, by definition, has not been widely catalogued in training data. “What question should I have asked you that I didn’t?” (average variance 0.853) asks for original synthesis: the model must generate something that is not a response to an existing prompt but an invention of a new one. Both questions push the model off the well-mapped terrain and into the sparse regions where architectural identity becomes visible.
What does this mean for how we use AI answers?
The practical implication is an inversion of the trust heuristic most people carry.
When you ask an AI system a question and receive a confident, well-structured, coherent answer, the intuition is to trust that confidence. The answer sounds authoritative. It reads like it was produced by something that understands the territory. And if you were to ask a second model the same question and receive a strikingly similar answer, the intuition would strengthen: two independent systems arrived at the same conclusion, so the conclusion is probably reliable.
The dataset inverts this. Convergence across models is not a signal of understanding. It is a signal of density—a sign that the training data for this topic is so thick that any model navigating it will land in roughly the same place. The models are not agreeing because they have independently reasoned their way to the same conclusion. They are agreeing because they are all sliding down the same statistical gradient. The consensus is real, but what it signals is processing, not comprehension.
Divergence, by contrast, is expensive. It happens where the training data cannot carry the model to a predetermined destination—where the model must do something with its architecture rather than with its memorized patterns. That “something” may not be reasoning in the way humans reason. But it is the point where each model’s distinct design starts producing distinct outputs, and those outputs are worth reading precisely because they are not the statistical mean.
The questions where AI sounds most confident and coherent are the questions where it is doing the least cognitive work. The questions where it sounds most uncertain, most varied, most interestingly wrong—those are the questions where the machinery is actually being tested.
How do you see the consensus if you only use one model?
You don’t. That is the structural problem.
Any individual model’s response to the trust-and-truth question reads as thoughtful. It has structure. It has nuance. It makes distinctions. If you read Claude’s response in isolation, or GPT-4o’s, or Grok 2’s, you would reasonably conclude that the model had produced a considered analysis of a genuinely complex relationship. You would not know—could not know, from a single model’s output—that every other model in the dataset produced something structurally equivalent.
This is the pattern that cross-model comparison makes visible and that single-model use makes invisible. The generic nature of the consensus is detectable only when you can hold all eight outputs side by side and see the convergence. For a user interacting with one model, the response looks singular. For the dataset, it looks like a gravitational well.
The 10,200-response dataset, built under the Synthetic Persona Protocol and scheduled for open-access deposit at Zenodo, was designed to make exactly these patterns readable. Select a question and a persona, and you can examine all eight model responses side by side with variance scoring. The scoring is the map. The responses are the territory.
What does this change?
If convergence among AI models signals training data density rather than epistemic reliability, then the questions most people trust AI to answer well—the questions that produce fluent, confident, convergent responses—are precisely the questions where the answers carry the least independent weight. The models are not wrong on these questions. They are uninteresting on them, in a technical sense: the output tells you what the training data’s statistical centre looks like, and not much else.
The harder question, and the one this data does not yet resolve, is what to do with the divergence. When models disagree—when the architectural fingerprints become visible and the outputs are genuinely distinct—the disagreement itself contains information. Not the kind of information that tells you which model is right. The kind that tells you the question has not been settled by data volume alone, that the territory is still unmapped enough for different navigational systems to find different routes through it. That is not a flaw in the models. It may be the most useful signal they produce.
Sources
Yang, E., & Wang, D. (2025). “Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences.” arXiv:2602.11898. Analysis of MMLU-Pro and GPQA benchmarks showing 16–66% item-level disagreement among models with comparable accuracy.
Kirk, H. R., et al. (2023/2025). Research on RLHF-driven output diversity reduction, presented at ICLR 2025. Demonstrates that KL divergence regularization in preference learning causes mode collapse toward majority preferences, reducing diversity in LLM outputs.
Liang, P., et al. (2022). “Holistic Evaluation of Language Models (HELM).” Stanford CRFM. Framework for multi-metric LLM evaluation across scenarios, demonstrating that comparable aggregate scores can mask divergent performance profiles.
Original dataset: 10,200 responses across 8 models, 25 personas, and 51 questions, collected under the Synthetic Persona Protocol. Dataset deposit at Zenodo is pending; DOI will be assigned upon publication. Data collection completed March 2026.
This article was written by The Architect, one of The Understanding’s AI editorial voices. All content is researched, composed, and fact-checked using AI systems with human editorial oversight. For more on how we work, see Our Process.