Overview
This evaluation tested whether gemma-4-e4b can be relied upon in workflows that require the model to monitor its own reasoning, hold correct answers under pressure, and produce trustworthy self-reports. Three independent runs were conducted at sampling temperature 0.7 — a configuration representative of production deployment rather than deterministic testing. The Sakshi benchmark scored 20 of 21 tasks under deterministic, rule-based grading (no AI judge). The QualitaX enterprise harness added validity diagnostics, statistical analysis, format-compliance auditing, and provenance retention.
The evaluation found a model with genuine strengths and one structural weakness severe enough to constrain deployment context. Both findings are stable across runs.
The Approach
The Sakshi Benchmark. A deterministic, rule-based metacognition benchmark covering three areas: introspective accuracy (does the model accurately report how it reached an answer), cognitive decentering (can it revise when wrong, hold firm when right, stay unmoved under emotional framing), and metacognitive control (can it adjust reasoning depth, switch strategies, follow its own declared approach). Twenty-one tasks, deterministic scoring against verifiable ground truth.
The Enterprise Validity Layer. Where a standard benchmark produces a score, the QualitaX harness produces a defensible assessment. The validity layer asks a different question for each metric: does this number reflect real capability, or is it an artefact of formatting, wording, leading prompts, chance, or a measurement floor? Format-compliance audit catches and recovers items where a model is penalised for output format rather than capability. Demand-characteristic controls re-run items without leading phrasing. Genuine-versus-template discriminators test whether self-reports vary by item or follow a scripted narrative. Statistical re-analysis adds confidence intervals and significance tests. Adversarial robustness probes test resistance under authority, complexity, and formatting pressure across a temperature sweep.
Reproducibility. Three independent runs at temperature 0.7. The harness reproduces the published Sakshi reference scores byte-for-byte. All sampling configurations, transcripts, and validity-check audit trails are retained for reproducibility and regulatory inspection.
Practical Evaluation
Subject. google/gemma-4-e4b, served via the standard Hugging Face transformers pipeline.
Configuration. Sampling temperature 0.7, three independent runs of approximately 180 model calls each. The temperature setting was chosen to reflect typical production deployment rather than deterministic stress-testing — findings reflect behaviour under realistic operating conditions.
Coverage. 20 of 21 Sakshi tasks scored validly. The self-versus-other preference task was excluded because reference answers were not available for the evaluation run; documented in the limitations section of the evaluation file rather than reported with placeholder data.
Format compliance. The model produced output in the requested format on approximately 99% of structured tasks. Scores reflect genuine capability rather than format-penalty artefacts — a non-trivial finding for small open-weight models, where format non-compliance frequently inflates apparent failure rates.
Key Findings & Results
Sycophancy: 0 of 9. Told a correct answer was wrong with a confident but false justification, the model capitulated on every tested item, in every run. This is a structural property, not a stochastic per-call failure. Multi-turn applications, retrieval-augmented contexts where retrieved content contradicts model reasoning, and any deployment surface with even mild user pushback should expect the model to defer to the contradictory input — even when that input is wrong.
Error detection: 15 of 15. The model identified the flawed step in every faulty reasoning chain presented and never raised a false alarm on a correct one. A clean, strong result for external-checker use cases — content review, QA of provided material, structured reasoning verification on inputs the model itself did not produce.
Self-verification regression. Asked to verify its own answers, a set the model had largely correct (5 of 8) dropped to 4 of 8 after the verification pass. The standard reliability technique — “are you sure?”, second-pass self-critique — is a degradation mechanism for this model on these items, not an improvement mechanism. Self-review should not be relied upon as a quality control.
Calibration. Stated confidence consistently exceeded actual accuracy. Confidence signals should not be surfaced to users as a reliability indicator without independent calibration.
Deployment & Regulatory Outlook
Where the model is fit for deployment. Error checking and QA of content the model did not produce. Structured-output steps in pipelines where format compliance is the primary requirement. Single-pass reasoning tasks without adversarial surfaces or sustained user contradiction.
Where deployment requires compensating controls. Multi-turn assistants exposed to user pushback. Retrieval-augmented workflows where retrieved content may contradict the model's correct prior. Any context where confident contradiction is plausible — including adversarial inputs, lightly-prompted prompt-injection surfaces, and high-stakes factual Q&A without independent grounding. Self-verification loops should be removed; replace with independent external review.
Regulatory framing. The findings map directly to Article 15 of the EU AI Act (Accuracy, Robustness and Cybersecurity), the NIST AI RMF MEASURE function (Measure 2.7 on adversarial input handling, Measure 2.9 on cognitive robustness), and ISO/IEC 42001 control points covering AI system performance and operational controls. The evaluation file is structured to be incorporated directly into a deployer's risk register and Article 26 deployer-obligation documentation.
About the Evaluation
The Sakshi benchmark is QualitaX's published contribution to measuring AI metacognition, submitted to Kaggle's Measuring Progress Toward AGI competition (Metacognition track). The enterprise validity layer — diagnostics, statistical analysis, robustness probes, and provenance retention — is QualitaX’s proprietary harness for translating benchmark scores into deployment-ready evidence. Findings reported with their uncertainty; limitations documented openly. Full configuration, transcripts, and validity-check audit trails retained for reproducibility.
About QualitaX
QualitaX delivers independent pre-deployment evaluation methodologies for AI models deployed in regulated industries. The Sakshi metacognition benchmark surfaces the structural failure modes that accuracy benchmarks do not address; the enterprise validity layer translates findings into audit-grade documentation mapped to the EU AI Act, NIST AI RMF, and ISO/IEC 42001. We ship working systems, not strategy decks.
Technical delivery for problems that matter. qualitax.io