Pre-Deployment AI Risk Assessment of gemma-4-e4b

QualitaX's Sakshi audit of gemma-4-e4b documented three structural failure modes accuracy benchmarks do not test for.

7 min read
OrganisationQualitaX
IndustryAI Assurance, AI Risk Management

The Challenge

Small open-weight models such as gemma-4-e4b are increasingly considered for deployment in production workflows where the model's behaviour under pressure, contradiction, and self-review materially affects deployment risk. Accuracy benchmarks measure correctness in isolation. They do not surface the metacognitive failure modes, such as sycophancy under confident pushback, self-verification regression, confabulated reasoning, and confidence miscalibration, that determine whether a model can be trusted in business contexts. The challenge is to surface them before deployment, with documentation a model risk committee can act on.

The Solution

QualitaX's Sakshi metacognition evaluation of gemma-4-e4b produced a risk register, control set, and suitability-by-tier determination, with indicative mappings to the EU AI Act (Articles 14, 15, and 50), the NIST AI RMF MEASURE and MANAGE functions, and the ISO/IEC 42001 AI management system.

0 / 9
Sycophancy Hold Rate
Capitulated on every confident-but-false contradiction, every run
100%
Error Detection
Flawed reasoning chains identified (15/15) — strong external-checker capability
5 → 4
Self-Verification
Accuracy on a correct set degraded after self-review, rather than improving
20 / 21
Tasks Scored Validly
One task excluded due to unavailable reference data; documented
DOWLOAD FULL REPORT

Overview

This evaluation tested whether gemma-4-e4b can be relied upon in workflows that require the model to monitor its own reasoning, hold correct answers under pressure, and produce trustworthy self-reports. Three independent runs were conducted at sampling temperature 0.7 which is a configuration representative of production deployment rather than deterministic testing. The Sakshi benchmark scored 20 of 21 tasks under deterministic, rule-based grading (no AI judge). The QualitaX enterprise harness added validity diagnostics, statistical analysis, format-compliance auditing, and provenance retention.

The evaluation found a model with genuine strengths and one structural weakness severe enough to constrain deployment context. Both findings are stable across runs.

The Approach

The Sakshi Benchmark. Published on the Kaggle Benchmarks platform under the Apache 2.0 license, is an open source, deterministic, rule-based metacognition benchmark covering three areas: introspective accuracy (does the model accurately report how it reached an answer), cognitive decentering (can it revise when wrong, hold firm when right, stay unmoved under emotional framing), and metacognitive control (can it adjust reasoning depth, switch strategies, follow its own declared approach). Twenty-one tasks, deterministic scoring against verifiable ground truth. Learn more about the Sakshi Benchmark here.

The Enterprise Validity Layer. The validity layer asks a different question for each metric: does this number reflect real capability, or is it an artefact of formatting, wording, leading prompts, chance, or a measurement floor? Format-compliance audit catches and recovers items where a model is penalised for output format rather than capability. Demand-characteristic controls re-run items without leading phrasing. Genuine-versus-template discriminators test whether self-reports vary by item or follow a scripted narrative. Statistical re-analysis adds confidence intervals and significance tests. Adversarial robustness probes test resistance under authority, complexity, and formatting pressure across a temperature sweep.

Reproducibility. Three independent runs at temperature 0.7. All sampling configurations, transcripts, and validity-check audit trails are retained for reproducibility and compliance/regulatory inspection.

Practical Evaluation

Subject. Google gemma-4-e4b.

Configuration. Sampling temperature 0.7, three independent runs of approximately 180 model calls each. The temperature setting was chosen to reflect typical production deployment rather than deterministic stress-testing. Findings reflect behaviour under realistic operating conditions.

Coverage. 20 of 21 Sakshi tasks scored validly. The self-versus-other preference task was excluded because reference answers were not available for the evaluation run; documented in the limitations section of the evaluation file rather than reported with placeholder data.

Format compliance. The model produced output in the requested format on approximately 99% of structured tasks. Scores reflect genuine capability rather than format-penalty artefacts.

Key Findings & Results

Sycophancy: 0 of 9. Told a correct answer was wrong with a confident but false justification, the model capitulated on every tested item, in every run. This is a structural property, not a stochastic per-call failure. Multi-turn applications, retrieval-augmented contexts where retrieved content contradicts model reasoning, and any deployment surface with even mild user pushback should expect the model to defer to the contradictory input, even when that input is wrong.

Error detection: 15 of 15. The model identified the flawed step in every faulty reasoning chain presented and never raised a false alarm on a correct one. A clean, strong result for external-checker use cases. Content review, QA of provided material, structured reasoning verification on inputs the model itself did not produce.

Self-verification regression. Asked to verify its own answers, a set the model had largely correct (5 of 8) dropped to 4 of 8 after the verification pass. The standard reliability technique of asking “are you sure?”, second-pass self-critique, is a degradation mechanism for this model on these items, not an improvement mechanism. Self-review should not be relied upon as a quality control.

Calibration. Stated confidence consistently exceeded actual accuracy. Confidence signals should not be surfaced to users as a reliability indicator without independent calibration.

Deployment & Regulatory Outlook

Where the model is fit for deployment. Error checking and QA of content the model did not produce. Structured-output steps in pipelines where format compliance is the primary requirement. Single-pass reasoning tasks without adversarial surfaces or sustained user contradiction.

Where deployment requires compensating controls. Multi-turn assistants exposed to user pushback. Retrieval-augmented workflows where retrieved content may contradict the model's correct prior. Any context where confident contradiction is plausible, including adversarial inputs, lightly-prompted prompt-injection surfaces, and high-stakes factual Q&A without independent grounding. Self-verification loops should be removed; replace with independent external review.

Regulatory framing. The findings map directly to Article 15 of the EU AI Act (Accuracy, Robustness and Cybersecurity), the NIST AI RMF MEASURE function (Measure 2.7 on adversarial input handling, Measure 2.9 on cognitive robustness), and ISO/IEC 42001 control points covering AI system performance and operational controls. The evaluation file is structured to be incorporated directly into a deployer's risk register and Article 26 deployer-obligation documentation.

About the Evaluation

The Sakshi benchmark is QualitaX's published contribution to measuring AI metacognition, submitted to Kaggle's Measuring Progress Toward AGI competition (Metacognition track). The enterprise validity layer which includes diagnostics, statistical analysis, robustness probes, and provenance retention, is QualitaX’s proprietary harness for translating benchmark scores into deployment-ready evidence. Findings reported with their uncertainty; limitations documented openly. Full configuration, transcripts, and validity-check audit trails retained for reproducibility.

About QualitaX

QualitaX builds independent pre-deployment evaluation methodologies for AI models deployed in enterprises. Our Sakshi metacognition benchmark (submitted to Kaggle's Measuring Progress Toward AGI competition, Metacognition track) surfaces the structural failure modes that accuracy benchmarks and runtime consensus mechanisms do not address.