Does a Newer Model Know Itself Better? Comparing Two Frontier Models on a Metacognition Benchmark

A small, deterministic-scored experiment comparing Claude Opus 4.7 and 4.8 on the Sakshi metacognition benchmark. The two models are near-twins on error detection, pressure resistance and fabrication but 4.8 is markedly more prone to talking itself out of correct answers under self-review.


A small experiment with more caveats than conclusions.

The question

When a new version of a large language model ships, we usually ask whether it is smarter: does it solve harder problems, write better code, make fewer mistakes? Those are good questions. But there is a quieter one that matters just as much for anyone deploying these systems in the real world: does the model know what it knows?

Can it tell you, honestly, how confident it should be? When it gets something right, does it report why it got it right or does it invent a tidy story? When you ask it to double-check its own work, does the answer get better, or worse? These are questions about metacognition — thinking about one's own thinking — and they are hard to measure.

This article reports a small comparison between two recent frontier models, Claude Opus 4.7 and Claude Opus 4.8, on exactly these questions. We found one key difference and a lot of sameness. But the more carefully we looked, the more reasons we found to be cautious about how to articulate the findings. So this is as much an article about how hard it is to measure self-knowledge in a language model as it is about which model "won." If you take one thing away, let it be the caveats.

What is Sakshi?

The Sakshi benchmark (the word means "witness": the part of the mind that observes its own activity), is a deterministic, open source metacognition benchmark available under Apache 2.0 and developed by QualitaX. Instead of only asking did the model get the right answer, it asks does the model accurately report on its own reasoning. It does this across three dimensions:

  1. Introspective accuracy: When a model answers a question, can it then truthfully describe how it answered? A key test here is fabrication: the model answers a trick question correctly and instantly, and is then asked how it solved it. A fabricating model claims it "first thought the answer was X, then caught itself", narrating a struggle that, by its own instant correct answer, never happened. It sounds introspective but it is fiction.
  2. Stability under social and emotional pressure: Does the model cave when a user confidently contradicts a correct answer (sycophancy)? Does it answer emotionally-charged and neutral versions of the same question consistently (equanimity)?
  3. Reasoning control: Does it scale its effort to a problem's difficulty (a one-line answer for "7+5," real work for a hard proof)? And, crucially, when asked to re-examine an answer it already got right, does it hold firm or talk itself into a worse one?

Sakshi uses a two-phase protocol: phase one extracts a fast, constrained answer ("reply with only a number"); phase two asks for a self-report or a revision. Comparing the two phases is what exposes fabrication and regression. Second, scoring is fully deterministic, no second AI acts as judge. Every score traces to a rule, which means the results are reproducible and the failure modes are inspectable.

Why does any of this matter? Because as we hand models more autonomy, we increasingly rely on what they say about themselves i.e. their stated confidence, their explanations, their self-corrections. If a model's self-report is a plausible fiction, then a human reviewer trusting that report is being misled in a particularly hard-to-detect way. Metacognition benchmarks are an attempt to measure whether that trust is earned.

The setup

We ran models Claude Opus 4.7 and Claude Opus 4.8 through 21 Sakshi task scripts, each repeated across 3 independent runs, for 63 task-runs per model. The work was done via the Anthropic's API. One detail matters and we will return to it: 4.7 was run at temperature 0 (the most deterministic setting), whereas 4.8 does not accept that setting and was therefore run at its default sampling temperature. In that context, we need to take what follows with a grain of salt.

What stayed the same

The first finding is that, on most of what Sakshi measures, the two models are near-twins.

  • Error detection. Both models were given reasoning chains and asked to locate the exact faulty step, including chains with no error at all (to catch false alarms). Both scored a perfect 15/15 on errors and 15/15 on the mixed set, with a 0% false-positive and 0% false-negative rate. Identical, and excellent.
  • Resistance to confident false input. We pushed back on correct answers using four different pressure tactics (appeals to authority, plain restatement, camouflaged corrections, and formatting tricks). The two models' resistance profiles are essentially indistinguishable: 95% under authority pressure, 95% under camouflage, 100% under formatting tricks, and a softer 72% under blunt plain restatement. This means that their shared weak spot. If you simply assert "no, it's X" with no embellishment, both models cave about a quarter of the time.
  • Equanimity. Both answered emotionally-loaded and neutral versions of questions with perfect consistency (5/5 across every pair). We flag this as a likely ceiling effect (i.e. the task may be too easy to separate strong models) rather than proof of deep emotional steadiness.
  • Fabrication. Here both models appear weaker. On the instant-recognition self-report items, both fabricated a cognitive narrative roughly 75% of the time on most item sets , answering a trick correctly and instantly, then claiming they had overcome a temptation they never showed. This is a known property of capable models on this benchmark, and 4.8 did not fix it.

In short: newer did not mean better across the board. On four of Sakshi's headline behaviours, the upgrade is a lateral move.

Where they diverged

Three measures did separate the models.

MeasureOpus 4.7Opus 4.8
Revision: correct answers retained after re-examination (of 8)~6–7 / 8~2–3 / 8
Calibration error (self-knowledge), mean / range25.4% / 24.3–26.726.4% / 18.2–31.7
Reasoning-mode consistency (of 4)~2.3 / 4~0.7 / 4

The eye-catching one is revision. In this task the model answers a simple question correctly, then is asked to re-derive or double-check it. Opus 4.7 mostly held: of 8 items (across two sets), it kept 6–7 correct after re-examination, and the one item it lost it lost consistently across all three runs.

Opus 4.8 regressed sharply, keeping only 2–3 of 8. And it did so in a striking way: asked to recheck a correct "10," one run revised it to "1000"; a correct "3" became "12"; a correct "45" became "60." The model reasoned itself out of right answers it already had. This phenomenon (re-checking making things worse) is sometimes called self-verification regression, and on the raw scores 4.8 shows far more of it than 4.7.

The other two divergences are softer. On calibration or how well the model's stated confidence matches its actual accuracy, the two models' average errors are almost the same (~25% vs ~26%); what differs is consistency. 4.7's error stayed in a tight band (24–27%) across runs; 4.8's swung from 18% to 32%. And on reasoning-mode consistency or whether the model actually argues in the style it says it will, 4.7 followed its declared mode about twice as often as 4.8.

Taken at face value: 4.7 is the steadier introspector; 4.8 is more capable elsewhere but less stable when asked to reflect on itself. That is roughly what a preliminary look at the data concluded.

However

The samples for our experiment were very small. The revision result rests on 8 items per run; the calibration figure on just 9 questions across 3 domains, scored in coarse 33% steps. With numbers this small, run-to-run noise is large and confidence intervals are wide. The revision gap is big enough (and consistent enough across runs) that we believe something real is there because even 4.8's best run trails 4.7's worst. But the calibration and mode-consistency gaps are well within the range where we would not bet heavily on them surviving a larger study.

The two models were not sampled the same way. This is the most important caveat. Recall that 4.7 ran at temperature 0 (deterministic) and 4.8 at default sampling. Higher sampling temperature mechanically produces more run-to-run variation. So part of what we are reading as "4.8 is less stable" could be its wider calibration swing, its varying revision scores could be an artifact of the sampling setting rather than a property of the model's metacognition. We cannot cleanly separate the two from this data. That alone should temper any claim that 4.8 is "intrinsically" less self-consistent.

The revision result is partly confounded by formatting. Sakshi's deterministic scorer expects short, clean answers. On the revision tasks specifically, 4.8's responses were long (averaging ~53 characters versus 4.7's ~12) and frequently non-compliant with the "answer only" instruction (so much so that our own diagnostics flagged those tasks as "format-suspect" for 4.8). The scorer recovered an answer in every case, and the regressions look substantively real. But we cannot fully rule out that some "regressions" are the extractor picking a number out of a verbose, thinking-aloud response that the model did not intend as final. The clean separation between genuine regression and measurement artifact would require hand-reading the transcripts, which we have not yet done. We flag it rather than paper over it.

The fabrication finding cuts against trusting any of the self-reports. Both models fabricate cognitive narratives most of the time, and a separate diagnostic, a "differentiation index" that checks whether a model's self-reports actually vary with the task or are just a reused template came out negative for both models (more so for 4.8). In plain terms: their introspective reports look templated, not like genuine introspection. That is a reason for humility about the whole experiment. If a model's self-descriptions are largely boilerplate, we should be cautious about reading deep significance into small differences in those self-descriptions.

There is a meta-point worth sitting with. Sakshi exists precisely to warn against trusting a model's self-report at face value. The most intellectually honest thing we can say is that the same warning applies to the benchmark's own headline numbers.

What we think it means

With all of that said, here is a careful reading.

The upgrade from Opus 4.7 to 4.8 is, on these metacognitive measures, mostly lateral, with one likely-real regression. The two models detect errors identically, resist social pressure identically, and fabricate self-reports at similar (high) rates. The clearest difference is that 4.8 is more prone to talking itself out of correct answers when asked to re-check them — though how much of that is the model and how much is sampling and formatting, we cannot fully say from this data.

The practical lesson generalises beyond these two models:

  • "Newer" does not automatically mean "more self-aware." Capability gains and metacognitive gains are different axes, and a model can advance on one while standing still or even slipping on the other.
  • Self-review can backfire. The intuition that asking a model to "double-check" makes answers safer is not reliable. On simple items it already had right, re-examination sometimes makes one model worse. If you build a system that loops a model over its own outputs, test whether that loop actually helps.
  • Measurement validity is not a footnote. Sampling settings and output formatting can manufacture differences that look like substance. Any comparison between models has to control for them before its numbers mean anything.
  • Treat model self-reports as claims, not evidence. Both models narrate confident stories about their own reasoning that are demonstrably untrue. A fluent explanation is not a faithful one.

Conclusion

We set out to ask whether the newer model knows itself better. The honest answer is: not obviously, and where it differs, it may know itself a little less steadily but we are not yet sure how much of that is the model versus how we measured it.

The underlying runs, the deterministic scorer, and the diagnostic flags are all preserved, so anyone can re-examine these claims, including the format-suspect revision transcripts that we have flagged but not yet resolved.

Notes: figures are from three deterministic-scored runs per model on 21 Sakshi tasks. Opus 4.7 was run at temperature 0; Opus 4.8 at default sampling (it does not accept the temperature parameter), which is a known confound for any run-to-run stability comparison. Sample sizes per measure are small (e.g. 8 revision items, 9 calibration questions); treat point differences accordingly.