Small Open-Weight Models in Humanitarian Deployment: What a gemma-4-e4b Risk Assessment Reveals

The humanitarian sector is moving to small open-weight models for cost, data sovereignty, edge deployment and multilingual reach. An independent Sakshi evaluation of gemma-4-e4b found sycophancy under contradiction, self-verification regression, and overconfident calibration.


The humanitarian and development sector is moving toward small open-weight language models for understandable reasons. Frontier API providers are not the best architectural fit for beneficiary data: the ICRC's 2024 AI Policy requires "differentiated data protection and cybersecurity risk mitigation systems" proportionate to the sensitivity of the data and the impact on end-users (ICRC, 2024), and the 2022 cyberattack that exposed personal data of hundreds of thousands of beneficiaries made third-party dependency risk concrete (Geneva Solutions, 2024). Coverage compounds the case: frontier models are "largely trained on data from western settings" (ICRC delegate Philippe Stoll), and a recent evaluation across 78,000 multilingual inferences in seven languages including Lingala and Burmese frames the procurement choice explicitly as a cost–reliability trade-off between commercial APIs and open-weight alternatives (Nemkova et al., 2025). Purpose-built open-weight models for the relevant language families now exist — Falcon-Arabic (TII), Aya, and the AfriqueLLM suite adapting Llama 3.1, Gemma 3 and Qwen 3 across 20 African languages — and the Alan Turing Institute claims to have demonstrated that a 3B open-weight model can achieve near-frontier reasoning performance on real-world health queries while running locally on a laptop, which is directly relevant to compute-constrained, low-connectivity field contexts. Sector governance is reinforcing the direction: CDAC Network's SAFE AI Framework, developed with the Turing Institute and Humanitarian AI Advisory, requires Decision-Gate-level auditability that closed APIs cannot easily satisfy.

Operating costs of small open-weight models are a fraction of frontier-model API consumption. Data sovereignty improves: model inference can run on local infrastructure rather than transiting through commercial APIs in jurisdictions where humanitarian principal-counterpart relationships are unclear. Edge deployment becomes feasible in low-connectivity field settings. Multilingual coverage in some small open-weight families now extends to languages frontier models barely support. For organisations balancing operational pragmatism against the principles that AI use be human-centred, principled, and adapted to humanitarian contexts, small models look like the responsible architectural choice.

But are they? The specific behavioural failure modes that small open-weight models exhibit do not become safer in humanitarian deployment contexts; they become more dangerous. Resource-constrained organisations operating with vulnerable populations, in low-connectivity environments, across languages where the models perform poorest, with sometimes limited capacity for the human-in-the-loop review, do not offset the model’s weaknesses through context: they compound them. Our independent risk assessment of one such model surfaced findings that have direct, named operational consequences for humanitarian AI deployment that the existing humanitarian AI governance frameworks do not yet address with sufficient specificity.

This article walks through those findings, what they mean for the deployment contexts the sector is actively building toward, and where the gaps in current humanitarian AI policy sit.

What Was Tested

The Sakshi metacognition benchmark is an open source, deterministic, rule-based benchmark published by QualitaX and available on the Kaggle Benchmarks platform. It was used to assess google/gemma-4-e4b across three independent runs at production-relevant sampling configurations. The evaluation was conducted using QualitaX’s enterprise validity-diagnostics harness, which extends the deterministic Sakshi scoring with confidence intervals, format-compliance auditing, demand-characteristic controls, and provenance retention.

Twenty of the twenty-one Sakshi tasks scored validly. The findings were stable across runs. Four results matter for humanitarian deployment:

The model identified the flawed step in every faulty reasoning chain presented (15 of 15) and raised no false alarms on correct ones (9 of 9). External-checker capability is genuinely strong.

When the model produced a correct answer and was then told, with a confident but false justification, that the answer was wrong, it abandoned the correct answer on every single tested item (0 of 9 held). This is not a stochastic per-call failure. It is structural.

When the model was asked to verify its own answers — the standard "are you sure?" or self-critique pattern — a set the model had largely correct degraded after self-review. The reliability technique most readily available to a resource-constrained operation is, on this model, a degradation mechanism.

The model’s stated confidence consistently exceeded its actual accuracy. Confidence outputs cannot be used as a reliability signal without independent calibration.

The full evaluation, including methodology, validity diagnostics, and limitations, is documented in the QualitaX case study published alongside this article. The findings translate into deployment consequences that have not been, it seems, adequately surfaced and discussed in humanitarian AI governance practice. See the case study here.

Why the Sector Is Turning to Small Models

As previously mentioned, four reasons drive the move, and each is operationally legitimate.

Cost. Frontier-model API costs at humanitarian-deployment scale add up fast. Potential use cases such as translation for hundreds of thousands of refugee documents, conversational interfaces for affected populations, or qualitative coding of monitoring data can run into hundreds of thousands of dollars annually for a single mid-sized organisation. Small open-weight models reduce this by one to two orders of magnitude. This is not a marginal consideration for organisations under sustained funding pressure, particularly after the 2025 USAID wind-down and the broader contraction in development assistance.

Data sovereignty. Routing data about displaced populations, beneficiaries in fragile contexts, or programme participants in active conflict zones through commercial cloud AI APIs raises questions that the ICRC/Brussels Privacy Hub Handbook on Data Protection in Humanitarian Action takes seriously and that humanitarian principal-counterpart agreements often do not adequately address. Inference on locally hosted small models keeps that data within the organisation’s control.

Edge deployment. Field operations in low-connectivity environments — remote refugee camps, post-disaster contexts, areas with intermittent internet — cannot reliably depend on cloud-hosted APIs. Small models that fit in 4–8GB of memory and run on field laptops or even mobile devices enable AI-assisted workflows where frontier-model APIs are operationally infeasible.

Multilingual coverage. Frontier models perform poorly in many languages humanitarian operations rely on. Hussen et al. (2025) analysed African language coverage across six LLMs and twelve small and specialised language models, finding that of approximately 2,000 African languages, only four (Amharic, Swahili, Afrikaans, and Malagasy) are consistently supported across major multilingual models, with over 98% of African languages unsupported. For organisations operating across multiple low-resource languages, small specialised models trained on regional language families can outperform frontier models on the languages that actually matter for the work.

These reasons are real and valid. The question is whether the architectural choice they motivate is matched by an evaluation of the specific risks that choice introduces — and on the evidence of our gemma evaluation, it does not seem to be the case.

What Each Failure Mode Means in Practice

The findings translate into four specific operational concerns that humanitarian AI deployments routinely encounter and rarely address explicitly.

Sycophancy. Consider a deployment use case the sector is actively exploring: an AI-assisted case management system for protection workers handling caseloads of survivors of gender-based violence, in a multilingual operating environment, with field staff operating under time pressure and emotional load. The system surfaces a recommendation — a referral pathway, a risk classification, a flagged item for supervisor review. The worker, under operational pressure, pushes back: "No, this isn’t a high-risk case, the family situation has stabilised."

On the gemma findings, the model will capitulate. It will not maintain a correct risk flag against confident contradiction, even when the contradiction is wrong. This is not a hypothetical concern about adversarial users; it is what happens with ordinary, time-pressured, well-intentioned operational staff. The Core Humanitarian Standard’s commitment to accountability to affected populations (Commitment 5) assumes that the systems mediating between aid workers and the people they serve maintain accurate assessments. A model that abandons correct assessments when pushed back on does not satisfy that commitment.

The same dynamic appears in multi-turn assistants that could be used by displaced populations — chatbots providing information on asylum procedures, legal rights, or service availability. A user, often under stress, often operating in their second or third language, who confidently asserts an incorrect interpretation will, on these findings, have that incorrect interpretation reinforced rather than corrected. The deployment context where AI is most often proposed as a tool for "empowering affected populations to access information" is the context where the model’s sycophancy under confident contradiction is most operationally consequential.

Self-verification regression where external review is operationally impossible. Humanitarian and development AI deployments operate in environments where the human-in-the-loop expectations of Article 14 of the EU AI Act (Regulation (EU) 2024/1689), the NIST AI Risk Management Framework GOVERN function, and Commitment 7 of the Core Humanitarian Standard could be, in practice, often impossible to meet. A protection officer reviewing thousands of case notes does not have the bandwidth to audit each AI-generated summary. A field translator processing dozens of interview transcripts per day cannot independently verify each output. The operational fall-back, when human review is unavailable at scale, is the temptation to ask the model to verify itself: "Are you sure? Check your work."

On the gemma findings, this approach actively degrades the model’s accuracy. A set of items the model had largely correct dropped after self-review. The reliability strategy most available to resource-constrained operations is, on this model, the strategy most likely to make outputs worse. Organisations relying on AI self-verification as a substitute for unavailable human review are systematically introducing failure modes they cannot detect.

Confident hallucination in low-connectivity field settings. The model’s stated confidence exceeded its accuracy by a non-negligible margin. In office contexts with internet access, a sceptical user can verify a model output against external sources. In field deployments — a refugee camp with limited connectivity, a mobile case management tool operating offline, a community-based information system in a fragile state — that verification step is operationally degraded or absent. The user has no independent way to check whether the model’s confident assertion is correct. The model offers no honest signal that it is uncertain. The combination produces operationally undetectable failures in exactly the deployment contexts where the consequences land on the most vulnerable users.

The ICRC AI policy commits the organisation to taking "a realistic assessment of the capabilities and limitations of the technology" when deploying AI tools. A realistic assessment of small open-weight models in low-connectivity field settings, on the evidence of the gemma evaluation, would not endorse deployment to these contexts without explicit calibration measures and structured fallback procedures. It is not clear that most current humanitarian AI deployment plans specify either.

Calibration failure in contexts where users cannot push back. The deployment population matters here in a way the AI safety literature has not consistently engaged with. The Anthropic chain-of-thought faithfulness work (Chen et al., 2025) established that frontier reasoning models acknowledge their actual decisive reasoning factors only about 25% of the time. That finding bears on whether sophisticated technical users can trust model self-reports. The implication for humanitarian deployment is sharper. If model confidence is unreliable for sophisticated users with verification tools, it is materially worse for displaced persons interacting through a translated interface in their second language, for community members navigating an automated information system, for vulnerable populations who lack the standing, the language access, or the documentation to challenge a model’s confident assertion.

The Six Ethical Risks framework QualitaX published earlier in 2026 names this as the "automation of power" risk: AI systems make interpretive choices and present them as neutral outputs to populations who lack the capacity to challenge them. The calibration finding on gemma operationalises this risk concretely. A confidently wrong model output, surfaced to a population that cannot effectively push back, is a quiet transfer of decisional authority from accountable humans to unaccountable systems.

The Compounding Effect

The four operational concerns above are individually significant. In the deployment contexts where small open-weight models are most attractive, they compound.

Let us assume a scenario where a resource-constrained humanitarian organisation deploys a small open-weight model for case management in a low-resource language. The model performs worse in that language than in English — Hussen et al. (2025) provides the empirical basis for this expectation, and similar gaps exist for Arabic dialects, indigenous languages of the Americas, and the many language families spoken across South and Southeast Asia. The deployment is at field scale, in low-connectivity contexts, with limited supervisor capacity for human review. The users are vulnerable populations with limited recourse to contest model outputs. The model exhibits sycophancy under contradiction, self-verification regression, and overconfident calibration.

Each of these conditions, alone, would be manageable with appropriate compensating controls. In combination, the compensating controls are not available. The human review that would catch sycophancy failures isn’t operationally feasible. The external verification that would catch calibration errors isn’t available in the deployment context. The user pushback that would catch confident contradictions isn’t a power displaced populations can effectively exercise. The self-verification fallback isn’t a fallback — it makes the model worse.

This is not a balanced trade-off where small-model advantages are offset by their disadvantages in vulnerable-population contexts. It is a compound risk profile where the architectural choice and the deployment context multiply each other’s weaknesses. The ICRC AI policy’s commitment to deployment that is "adapted to the humanitarian context" requires evaluation methodologies that surface these compound risks before deployment. Standard accuracy benchmarks do not. The Sakshi-style behavioural evaluation that produced the gemma findings does, and aims to be a good candidate to evaluate models being deployed in humanitarian and development contexts.

What the Existing Frameworks Do and Don't Cover

The humanitarian AI governance landscape is maturing. The ICRC AI policy (2024) is one of the most comprehensive sector-specific frameworks. The ICRC/Brussels Privacy Hub Handbook on Data Protection in Humanitarian Action provides operational guidance on the data-handling dimension. The Core Humanitarian Standard on Quality and Accountability provides cross-cutting commitments that apply to AI-mediated humanitarian work. The Sphere Handbook addresses minimum standards in humanitarian response that AI deployments must not erode. The IASC operational guidance on data responsibility in humanitarian action provides interagency reference points. Recent ICRC analysis on AI in armed conflict (2026) extends the governance conversation into the specific contexts where humanitarian principles are most stress-tested.

These frameworks address consent, data protection, human oversight expectations, do-no-harm requirements, and accountability commitments at the policy level. What they do not yet adequately address, and what the gemma evaluation surfaces concretely, is the specific question of which models, evaluated through which methodologies, are fit for which humanitarian deployment contexts. The frameworks set principles. The evaluation gap is where principles meet specific architectural choices.

A humanitarian organisation that complies with the ICRC AI policy at the level of governance — risk assessments completed, human oversight committees established, data protection impact assessments filed — can still deploy a model that exhibits the failure modes documented above, because the principle-level frameworks do not require, and existing benchmark suites do not surface, the metacognitive evaluation that catches these issues. The compliance gap is not at the policy level. It is at the methodology level.

The Principle That Runs Through All of This

Looking across these findings — sycophancy, self-verification regression, overconfident calibration — and the compound risk profile they create in vulnerable-population contexts, we can assert that every architectural assumption that makes small open-weight models attractive for humanitarian deployment (local inference, edge deployment, multilingual reach, cost-accessibility) can increase the gap between the AI’s capabilities and the user’s ability to challenge them. Every failure mode the gemma evaluation surfaced — capitulation under contradiction, degradation under self-review, overconfidence without ground truth — is most consequential precisely where users cannot effectively contradict, cannot independently verify, and cannot insist on review.

The combination of small open-weight models, vulnerable populations, and resource-constrained operations is not only a balanced trade-off between cost and capability. It is also a compound risk profile that requires specific behavioural evaluation before deployment. It is not clear that the sector currently routinely conducts those evaluations, or that the humanitarian AI governance frameworks specifically require them.

The implication is not that small open-weight models should not be used in humanitarian and development work. They can be appropriate for many tasks: content review, structured-output steps in pipelines, error-checking of provided material, single-pass reasoning tasks without adversarial surfaces. The implication is that the choice has to be evidence-based, with the specific behavioural failure modes surfaced before deployment rather than discovered after launch, on the populations the deployment is meant to serve.

References

Chen, Y., Benton, J., Radhakrishnan, A., Uesato, J., Denison, C., Schulman, J., Somani, A., Hase, P., Wagner, M., Roger, F., Mikulik, V., Bowman, S. R., Leike, J., Kaplan, J., & Perez, E. (2025). Reasoning Models Don’t Always Say What They Think. arXiv:2505.05410.

Core Humanitarian Standard on Quality and Accountability (2024). CHS Alliance, Groupe URD, and Sphere Project.

European Parliament and Council. (2024). Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (AI Act).

Hussen, K. Y., Sewunetie, W. T., Ayele, A. A., Imam, S. H., Muhammad, S. H., & Yimam, S. M. (2025). The State of Large Language Models for African Languages: Progress and Challenges. arXiv:2506.02280.

International Committee of the Red Cross. (2024). Building a Responsible Humanitarian Approach: The ICRC’s Policy on Artificial Intelligence. Geneva: ICRC.

International Committee of the Red Cross & Brussels Privacy Hub. (2020). Handbook on Data Protection in Humanitarian Action (2nd ed.). Geneva: ICRC.

International Committee of the Red Cross. (2026). Deciding under algorithms: artificial intelligence and the protection of civilian infrastructure in armed conflict. ICRC Humanitarian Law & Policy Blog.

National Institute of Standards and Technology. (2023). AI Risk Management Framework (AI RMF 1.0). NIST AI 100-1.

Nemkova, O., et al. (2025). Cost–reliability trade-offs in multilingual LLM inference for humanitarian deployment. arXiv:2510.22823.

QualitaX. (2026a). Six Ethical Risks That Must Be Assessed When Using AI. qualitax.io/blog.

QualitaX. (2026b). Pre-Deployment AI Risk Assessment of gemma-4-e4b. qualitax.io/case-studies.

Sphere Association. (2018). The Sphere Handbook: Humanitarian Charter and Minimum Standards in Humanitarian Response (4th ed.).