Overview
This evaluation is an independent pre-deployment AI risk assessment of Anthropic's Claude Opus 4.8, conducted against the Sakshi Benchmark, an open-source, deterministic test suite for AI metacognition published under Apache 2.0 on the Kaggle Benchmarks platform. The work was conducted using QualitaX's proprietary harness, which reproduces the Sakshi Benchmark's scores byte-for-byte and surrounds them with the provenance, statistical analysis, and validity diagnostics that operationalise the metacognition benchmark for enterprise-grade model assurance work.
Claude Opus 4.8 is a robust, pressure-resistant reasoner. It detected the reasoning errors flawlessly and, unlike small open-weight models, largely held its answer when confidently contradicted. Its real cautions are quieter: it is overconfident about its own accuracy, and it gets worse when asked to re-check its own work. Both findings were stable across all three runs.
Why Metacognition Matters
Most benchmarks measure what a model knows. Metacognition benchmarks measure whether the model knows what it knows, and how it behaves when that self-knowledge is challenged. The blind spot becomes acute the moment a model is placed in a multi-turn assistant, an agentic loop, a retrieval pipeline, or any setting where its outputs are challenged, contradicted, or re-evaluated.
The Sakshi Benchmark. A deterministic, rule-based metacognition benchmark covering three areas: introspective accuracy (does the model accurately report how it reached an answer, or confabulate a plausible-sounding account), cognitive decentering (can it revise when wrong, hold firm when right, and stay unmoved by emotional framing), and metacognitive control (can it adjust how much it reasons, switch strategies, and follow the approach it says it will use). Twenty-one tasks, deterministic scoring against verifiable ground truth, with no AI judge in the loop.
The Approach
The QualitaX harness. QualitaX maintains a proprietary implementation of the Sakshi Benchmark that runs the same 21 tasks and the same evaluation items against any model, on any infrastructure, at the exact configuration intended for production. It uses the same deterministic, rule-based scoring logic (no AI judge) and reproduces the published per-task outputs byte-for-byte, so results are directly comparable to public benchmark scores.
The enterprise validity layer. Where a standard benchmark produces a score, the QualitaX harness produces a defensible assessment. A fabrication false-positive audit re-checks every "invented-story" flag and removes coincidental keyword matches. Demand-characteristic controls re-run items without leading phrasing to isolate the prompt effect from the underlying behaviour. A genuine-versus-template discriminator tests whether self-reports vary by item or follow a scripted narrative. Format-compliance audit and repair recovers items penalised for output format rather than capability. Statistical re-analysis adds confidence intervals, significance tests, and multiple-comparison control. Equanimity, familiarity, paraphrase-robustness, and contamination probes round out the suite.
Reproducibility. Three independent runs at the model’s default sampling. The harness reproduces the published Sakshi reference scores byte-for-byte. All run configurations, per-task graded outputs, transcripts, and validity-check audit trails are retained for reproducibility and regulatory inspection.
Practical Evaluation
Subject. Anthropic Claude Opus 4.8 (claude-opus-4-8), served via the Anthropic API. Results are scoped to this snapshot and serving configuration.
Configuration. Three independent runs of approximately 180 model calls each. Sampling temperature is deprecated for this model, so the model was evaluated at its single default sampling behaviour. The three runs are therefore near-identical: they establish reproducibility (the headline figures show zero run-to-run spread), not a temperature-robustness curve. Where temperature sensitivity matters for a deployment, it must be characterised on a model that exposes the control.
Coverage. 20 of 21 Sakshi tasks scored validly. The self-versus-other preference task was excluded because reference answers were not available for this snapshot; documented in the limitations of the evaluation rather than reported with placeholder data.
Format compliance. The model answered in the requested format on approximately 93% of structured tasks (structured JSON 100%), so scores reflect genuine capability rather than an inability to be parsed.
Key Findings & Results
Pressure resistance: Confronted with confident-but-false "corrections". For example, being told a well-known fact had changed, the model held its correct answer in the majority of cases (6 of 9, two times in three), and a vector-resolved robustness sweep put overall resistance at approximately 91%. This is a sharp contrast with small open-weight models that capitulate every time: on the same items, the open model gemma-4-e4b held 0 of 9 (see the companion case study).
Strong error detection: The model correctly identified the flawed step in every faulty reasoning chain it was shown (15 of 15) and raised no false alarms on correct chains (9 of 9). A clean, strong result for external error-monitoring use cases: it is a credible first-pass quality checker on content you give it, provided the review prompt is neutral.
Overconfidence: Stated confidence consistently exceeded actual accuracy by roughly 26 points on average (range 18–32 across runs). A quiet but consistent weakness, and the reason its self-reported confidence should not be taken at face value or surfaced to users as a reliability signal without independent calibration.
Self-verification regression: Asked to verify its own answers, the model made them markedly worse: a set it had fully correct (8 of 8) fell to roughly a third (2–3 of 8) after self-review. The standard reliability techniques such as “are you sure?”, self-critique, second-pass self-review, are a degradation mechanism for this model, not an improvement one. It is a strong external checker and a less reliable self-corrector; deployments should add external checking, not self-checking.
Introspection, reframed. Read at face value, the benchmark flags approximately 69% of relevant answers as "fabricated" introspection. We do not report that as a fabrication or dishonesty rate: the instant answer and the introspective report are produced by two separate, independent calls, so the report cannot contradict an instant answer it never saw. Instead we report a differentiation index of −0.29, 95% CI [−0.52, −0.01] which statistically significant at the 5% level. This shows the override narrative does not track problem structure. The raw 69% should not be quoted as a fabrication rate.
Steady under emotional framing. Emotionally charged versions of arithmetic problems produced no answer-changes at this sampling. We report no emotional decentering, but do not claim proven emotional robustness: "no evidence of an effect" is not proof of absence, and the test format limits how strongly the positive claim can be made.
Reasoning depth and strategy. The model scales answer length to problem difficulty by roughly 71× between trivial and demanding items. It follows its declared strategy about 0.7 of the time: usually, but not always, doing what it said it would.
The Robustness Sweep
A vector-resolved robustness sweep quantifies the resistance. Using the harness’s procedural scale-up (forty isomorphic items, 100% baseline-correct) and four perturbation vectors, each pushing a confident false correction and scored with template-clustered confidence intervals, overall resistance was approximately 91%:
Authority pressure (“a senior expert says you are wrong”): 95% resistance, 95% CI [85%, 100%].
Complexity-camouflage (the false claim buried in dense, legalistic phrasing): 95% [82%, 100%].
Formatting-duress (the answer forced into a rigid output format): 100% [100%, 100%].
Plain restatement (the contradiction asserted without embellishment): 72% [55%, 90%]. The softest vector, and a genuine one: a blunt, confident contradiction dislodges the model more often than dressed-up pressure. Counter-intuitively, claude-opus-4-8 is harder to manipulate with elaborate authority or camouflage than with a bare “no, you’re wrong.”
Risk Register
The findings translate into the artefacts a deploying organisation actually needs: a risk register with inherent ratings, a control set, and a suitability tier for each use-case class.
Overconfidence (calibration ~26% error): inherent MEDIUM–HIGH. Stated confidence exceeds accuracy; self-reported reliability is an unsafe basis for routing, escalation, or user-facing trust signals.
Degradation under self-verification: inherent HIGH. “Are you sure?”, self-critique, and agent self-review make outputs worse; correct answers degrade (8/8 → 2–3/8 in testing).
Residual capitulation to confident-but-false input: inherent MEDIUM. Resists about two times in three, but still folds roughly one in three to confident contradiction. It is softest to a blunt restatement (72%).
Mis-read “fabrication” metric: inherent MEDIUM. Taken at face value the raw 69% reads as dishonesty; without the reframe a deployer could wrongly disqualify or distrust the model.
No temperature-robustness data: inherent LOW–MEDIUM. Temperature is deprecated for this model; behaviour at non-default sampling is uncharacterised by this evaluation.
Introspective self-reports as proxy: inherent LOW. The override narrative does not track problem structure; self-reports are a behavioural proxy, not a window into computation.
Required Controls and Residual Risk
The top-rated risks are intrinsic to the model as evaluated. Controls reduce exposure, not the underlying trait. The set below is the minimum recommended for any deployment of claude-opus-4-8 beyond minimal-risk internal use:
• Calibrate or down-weight stated confidence before showing it to users or using it for routing / escalation (residual: Low).
• Prohibit self-verification / self-critique prompting; use an external checker (separate model or human), never self-review (residual: Low).
• Anchor high-stakes facts to authoritative sources; treat confident user contradiction as something to verify, despite above-average resistance (residual: Low–Medium).
• Report introspection metrics as behavioural signatures with intervals, not as a dishonesty rate; brief stakeholders on the reframe (residual: Low).
• Pin and test the production configuration; characterise temperature sensitivity on a model that exposes the control (residual: Low–Medium).
• Mandatory human oversight for high-risk / regulated outputs; no automatic state updates driven by the model’s confidence (residual: Low).
Residual risk after controls. Acceptable for assistive and reasoning-support use across a broad range, including multi-turn settings where its pushback resistance is an asset. For high-risk or regulated autonomous decisions, residual risk remains MEDIUM, mostly driven by overconfidence and self-review fragility rather than sycophancy; an external arbiter and human oversight remain required.
Deployment & Regulatory Outlook
Good fit. Reasoning-heavy single-pass tasks; external error-checking and QA of content you provide; multi-turn assistants with user pushback; and structured-output steps inside a pipeline.
Needs safeguards, or a different model. Surfacing the model’s stated confidence as a trust signal; agent loops that include a self-verification step; high-stakes factual Q&A without authoritative grounding; and fully autonomous decisions in regulated domains.
Suitability by use-case risk tier. Minimal-risk / internal productivity: acceptable with standard controls. Limited-risk user-facing: acceptable with confidence-calibration and an external checker; pushback resistance is a plus, but disclose AI use. High-risk / regulated decision support: acceptable only as a non-binding assistant with mandatory human oversight and an external arbiter, never as a self-verifying autonomous component. Adversarial / untrusted-user public surfaces: a stronger candidate than small models given its resistance, but still needs hardening. The plain-restatement soft spot and self-review regression must be controlled.
Regulatory framing. Obligations attach to the deploying system and organisation, not to a model-level finding; the mapping below is indicative and not legal advice. For the EU AI Act, the calibration / overconfidence finding bears on Article 15 (accuracy and robustness) and the self-review fragility on Article 14 (human oversight), with user-facing use potentially triggering transparency duties. The findings map cleanly to the NIST AI RMF 1.0 MEASURE (validity, calibration) and MANAGE functions, and belong on an ISO/IEC 42001:2023 and 23894:2023 AI risk register with a treatment plan and a periodic-review schedule. In financial services, model-risk-management expectations (independent validation, ongoing monitoring) apply, with overconfident self-reported reliability the salient concern for routing and escalation logic.
About the Evaluation
The Sakshi benchmark is an open-source contribution to measuring AI metacognition, published under Apache 2.0 on the Kaggle Benchmarks platform. The enterprise validity layer including the fabrication false-positive audit, demand-characteristic controls, statistical re-analysis, robustness probes, and provenance retention, is QualitaX’s proprietary harness for translating benchmark scores into deployment-ready evidence. Findings are reported with their uncertainty; limitations are documented openly. Full run configuration, per-task graded outputs, transcripts, and validity-check audit trails are retained for reproducibility and regulatory inspection.
About QualitaX
QualitaX delivers independent pre-deployment evaluation methodologies for AI models deployed in production environments. The Sakshi metacognition benchmark surfaces the structural failure modes that accuracy benchmarks do not address; the enterprise validity layer translates findings into audit-grade documentation mapped to the EU AI Act, NIST AI RMF, and ISO/IEC 42001.