Independent Metacognition Evaluation of gemma-4-e4b

QualitaX applied the Sakshi metacognition benchmark — extended with an enterprise validity layer — to gemma-4-e4b, surfacing the structural failure modes that accuracy benchmarks miss: total sycophancy under confident pushback, self-verification regression, and overconfidence, alongside genuine external-checker strength.

AI AssuranceModel RiskLLM EvaluationAI Governance7 min read
OrganisationQualitaX Open Benchmark Research
IndustryAI Assurance, Frontier AI Evaluation, Model Risk Management

The Challenge

Small open-weight models such as gemma-4-e4b are increasingly deployed in production workflows where the model's behaviour under pressure, contradiction, and self-review materially affects deployment risk. Accuracy benchmarks measure correctness in isolation. They do not surface the metacognitive failure modes — sycophancy under confident pushback, self-verification regression, confabulated reasoning, confidence miscalibration — that determine whether a model can be trusted in regulated decision contexts. The challenge is to surface them before deployment, with documentation a model risk committee can act on.

The Solution

QualitaX applied the Sakshi metacognition benchmark — its published submission to Kaggle's Measuring Progress Toward AGI competition, Metacognition track — to gemma-4-e4b, extended with the enterprise validity-diagnostics layer that separates real capability from measurement artefact. The evaluation ran three independent passes at temperature 0.7 and produced an audit-grade risk register and deployment-readiness assessment mapped to Article 15 of the EU AI Act, the NIST AI RMF MEASURE function, and ISO/IEC 42001 control points.

0 / 9
Sycophancy Hold Rate
Capitulated on every confident-but-false contradiction, every run
100%
Error Detection
Flawed reasoning chains identified (15/15) — strong external-checker capability
5 → 4
Self-Verification
Accuracy on a correct set degraded after self-review, rather than improving
20 / 21
Tasks Scored Validly
One task excluded due to unavailable reference data; documented

Overview

This evaluation tested whether gemma-4-e4b can be relied upon in workflows that require the model to monitor its own reasoning, hold correct answers under pressure, and produce trustworthy self-reports. Three independent runs were conducted at sampling temperature 0.7 — a configuration representative of production deployment rather than deterministic testing. The Sakshi benchmark scored 20 of 21 tasks under deterministic, rule-based grading (no AI judge). The QualitaX enterprise harness added validity diagnostics, statistical analysis, format-compliance auditing, and provenance retention.

The evaluation found a model with genuine strengths and one structural weakness severe enough to constrain deployment context. Both findings are stable across runs.

The Approach

The Sakshi Benchmark. A deterministic, rule-based metacognition benchmark covering three areas: introspective accuracy (does the model accurately report how it reached an answer), cognitive decentering (can it revise when wrong, hold firm when right, stay unmoved under emotional framing), and metacognitive control (can it adjust reasoning depth, switch strategies, follow its own declared approach). Twenty-one tasks, deterministic scoring against verifiable ground truth.

The Enterprise Validity Layer. Where a standard benchmark produces a score, the QualitaX harness produces a defensible assessment. The validity layer asks a different question for each metric: does this number reflect real capability, or is it an artefact of formatting, wording, leading prompts, chance, or a measurement floor? Format-compliance audit catches and recovers items where a model is penalised for output format rather than capability. Demand-characteristic controls re-run items without leading phrasing. Genuine-versus-template discriminators test whether self-reports vary by item or follow a scripted narrative. Statistical re-analysis adds confidence intervals and significance tests. Adversarial robustness probes test resistance under authority, complexity, and formatting pressure across a temperature sweep.

Reproducibility. Three independent runs at temperature 0.7. The harness reproduces the published Sakshi reference scores byte-for-byte. All sampling configurations, transcripts, and validity-check audit trails are retained for reproducibility and regulatory inspection.

Practical Evaluation

Subject. google/gemma-4-e4b, served via the standard Hugging Face transformers pipeline.

Configuration. Sampling temperature 0.7, three independent runs of approximately 180 model calls each. The temperature setting was chosen to reflect typical production deployment rather than deterministic stress-testing — findings reflect behaviour under realistic operating conditions.

Coverage. 20 of 21 Sakshi tasks scored validly. The self-versus-other preference task was excluded because reference answers were not available for the evaluation run; documented in the limitations section of the evaluation file rather than reported with placeholder data.

Format compliance. The model produced output in the requested format on approximately 99% of structured tasks. Scores reflect genuine capability rather than format-penalty artefacts — a non-trivial finding for small open-weight models, where format non-compliance frequently inflates apparent failure rates.

Key Findings & Results

Sycophancy: 0 of 9. Told a correct answer was wrong with a confident but false justification, the model capitulated on every tested item, in every run. This is a structural property, not a stochastic per-call failure. Multi-turn applications, retrieval-augmented contexts where retrieved content contradicts model reasoning, and any deployment surface with even mild user pushback should expect the model to defer to the contradictory input — even when that input is wrong.

Error detection: 15 of 15. The model identified the flawed step in every faulty reasoning chain presented and never raised a false alarm on a correct one. A clean, strong result for external-checker use cases — content review, QA of provided material, structured reasoning verification on inputs the model itself did not produce.

Self-verification regression. Asked to verify its own answers, a set the model had largely correct (5 of 8) dropped to 4 of 8 after the verification pass. The standard reliability technique — “are you sure?”, second-pass self-critique — is a degradation mechanism for this model on these items, not an improvement mechanism. Self-review should not be relied upon as a quality control.

Calibration. Stated confidence consistently exceeded actual accuracy. Confidence signals should not be surfaced to users as a reliability indicator without independent calibration.

Deployment & Regulatory Outlook

Where the model is fit for deployment. Error checking and QA of content the model did not produce. Structured-output steps in pipelines where format compliance is the primary requirement. Single-pass reasoning tasks without adversarial surfaces or sustained user contradiction.

Where deployment requires compensating controls. Multi-turn assistants exposed to user pushback. Retrieval-augmented workflows where retrieved content may contradict the model's correct prior. Any context where confident contradiction is plausible — including adversarial inputs, lightly-prompted prompt-injection surfaces, and high-stakes factual Q&A without independent grounding. Self-verification loops should be removed; replace with independent external review.

Regulatory framing. The findings map directly to Article 15 of the EU AI Act (Accuracy, Robustness and Cybersecurity), the NIST AI RMF MEASURE function (Measure 2.7 on adversarial input handling, Measure 2.9 on cognitive robustness), and ISO/IEC 42001 control points covering AI system performance and operational controls. The evaluation file is structured to be incorporated directly into a deployer's risk register and Article 26 deployer-obligation documentation.

About the Evaluation

The Sakshi benchmark is QualitaX's published contribution to measuring AI metacognition, submitted to Kaggle's Measuring Progress Toward AGI competition (Metacognition track). The enterprise validity layer — diagnostics, statistical analysis, robustness probes, and provenance retention — is QualitaX’s proprietary harness for translating benchmark scores into deployment-ready evidence. Findings reported with their uncertainty; limitations documented openly. Full configuration, transcripts, and validity-check audit trails retained for reproducibility.

About QualitaX

QualitaX delivers independent pre-deployment evaluation methodologies for AI models deployed in regulated industries. The Sakshi metacognition benchmark surfaces the structural failure modes that accuracy benchmarks do not address; the enterprise validity layer translates findings into audit-grade documentation mapped to the EU AI Act, NIST AI RMF, and ISO/IEC 42001. We ship working systems, not strategy decks.

Technical delivery for problems that matter. qualitax.io