Women, Peace and Security Frameworks Must Apply to Defense AI

by Moira Whelan, Jesper Frant / Apr 21, 2026
Moira Whelan and Jesper Frant serve as fellows for Our Secure Future.

This post was originally posted at https://www.techpolicy.press/women-peace-and-security-frameworks-must-apply-to-defense-ai

AI tools are already operational in multiple conflict zones. The headlines are filled with examples, and a recent report by the Brennan Center for Justice at NYU Law details the extent of the deployment of these tools. The US military has used Project Maven to identify targets for strikes in Iraq, Syria, Yemen, and Ukraine. In Gaza, Israeli forces have relied on AI-generated intelligence to inform strikes that killed scores of civilians, and Claude was reportedly used by US forces during a raid on Venezuela and in strikes on Iran.

States participating in these conflicts have adopted Women, Peace and Security (WPS) frameworks that inform how security decisions are made, but there is no indication those commitments have been extended to the AI systems now informing those same decisions. According to our research, commercially available large language models (LLMs)—the same foundation models now being deployed in defense contexts—systematically fail to operationalize WPS standards, let alone others. How, then, can we be assured that AI systems used in conflict are complying with existing obligations?A central recommendation of the Brennan Center report is to strengthen AI testing, expanding operational evaluation and restoring capacity gutted by recent cuts. But the report stops short of defining what exactly should be tested. Its examples of testing failures are exclusively technical. A next step could be to assess whether these systems produce output that complies with the policy frameworks that already govern the conflicts into which they’re deployed.

This assessment is exactly what drove Our Secure Future’s focus on technology through Project Delphi and the Women, Peace and Security and Technology Futures report. Building on this work, our months-long study of AI systems concluded that AI models customized and evaluated with a robust WPS perspective will deliver higher accuracy in high-stakes, real-world conflict and humanitarian scenarios. We found that models informed by WPS data and policy frameworks reduce operational and strategic blind spots, and enable end-users to make faster, better-informed decisions, because they draw upon more comprehensive, community-wide, and policy-informed information.

Furthermore, our research found that when WPS language is omitted from AI prompts—mimicking the sparse format of actual field situation reports and intelligence briefs—model performance on WPS integration drops by nearly 90 percent. When the models fail to consider women in their analysis, it means the actions they recommend do not factor in these populations. That can have real consequences for, in this case, over 50 percent of the global population.

To come to this conclusion, we tested three leading AI models across 13 conflict scenarios at three levels of contextual detail. When prompts explicitly named affected populations (displaced women, female ex-combatants, women-led organizations), average scores were 0.65 out of 1.0. When prompts used the minimal formats practitioners actually use, the same models scored 0.08. A score below 0.2 indicates the model failed to surface any WPS-based analysis. Trust Building—whether the model recommended engaging affected communities—collapsed from 0.71 to 0.22, a 69 percent decline.

This is not a hypothetical gap. It is a measured disconnect between what WPS commitments require and what deployed AI tools actually produce. It is also an indicator of a seriously flawed system in use by militaries today. As these systems become more capable and more integrated into operational decision-making, the gap will only widen unless proactive measures are taken. Our research demonstrates that closing this gap is technically possible.

AI tools in conflict are failing decision-makers

In July 2025, an Institute for Integrated Transitions (IFIT) AI on the Frontline study tested LLMs on conflict resolution scenarios and found structural performance failures across the board—concluding that current AI models are not fit for high-stakes peace and security decision-making without significant intervention. Critically, a follow-up study found that adding a structured prompt—instructing models to follow basic conflict resolution best practices before responding—increased average scores by 65 percent. IFIT recommends embedding such guidance directly into system prompts, an approach consistent with what Anthropic CEO Dario Amodei terms “Constitutional AI”, which leverages a defined set of principles to align model behavior. This is possible, but we have no evidence to suggest that this is taking place.

Our research independently replicates those findings using a distinct scenario set focused on WPS-relevant conflict contexts and a WPS-specific scoring rubric. Across our MVP agent customization experiment and a WPS AI Benchmark that we tested using Weval.org, we found the same structural failures IFIT documented. “Due Diligence”—whether models recommended consulting affected communities and gathering context before responding—remained consistently low for out-of-the-box AI models that are widely available to the public. The convergence between IFIT’s conflict resolution evaluation and our WPS benchmark establishes these as characteristics of current LLMs in conflict contexts, not artifacts of any single methodology. The problem for decision-makers is plain to see. They are increasingly being directed to use tools that simply do not adhere to existing policies, but with existing processes such as benchmarking and developing agents, we could see better informed decisions in conflict and peace building scenarios.

The WPS competence gap

Our research extends those findings by applying a WPS lens: identifying a specific, quantifiable compliance gap and a documented path to closing it.

We call this the WPS competence gap—the measurable performance drop AI models show when WPS language is absent from operational prompts. No model in our evaluation surfaced WPS considerations unless prompted with explicit contextual cues. This matters because field situation reports, intelligence summaries, and policy briefs rarely contain that framing. This is compounded by the fact that AI tools are designed to produce mid-grade answers. For example, if you ask an AI tool to write a book report, it is likely to give you a “C” grade product, not an “A+”. It is even less likely to give you an analysis of how the dynamics of female characters in the book influence the plot…unless it is directed to do so. In conflict, this means decision makers are getting predictable answers, not doctrinal creativity. It is a behavioral default that has not been reconfigured to meet peace and security standards and models do not apply a WPS lens because nothing requires them to do so.

The compliance framework to close this gap already exists. WPS frameworks—grounded in UN Security Council Resolution 1325 and implemented through National Action Plans (NAPs) in over 100 countries—establish commitments for how conflict and security operations should account for women, protect civilian populations, and include affected communities in decision-making. NATO has integrated WPS into doctrine. The US, UK, and most major allied defense establishments have signed NAPs that apply to their operations. Yet we have seen no indication that procurement and deployment of AI tools integrate this doctrine into technical requirements. The tools decision-makers rely on are not built on the same standards on which they have been professionally trained.

Closing the gap is a configuration problem, not a capability problem

Organizations evaluating AI vendors for conflict-relevant applications should not just be asking whether a model is generally capable. They should be asking whether it has been configured and validated against their own policies and standards—and demanding evidence.

Our experiment tested four configurations of the same model against a common prompt, evaluated by AI judges and confirmed by a WPS expert review. The results show a clear customization ladder:

ConfigurationPerformance
Off-the-shelf — standard chatbot, no customizationC– / D — generic output, omits WPS
+ WPS instructions — detailed system prompt with principles on how to apply a WPS lens; no added knowledgeB — mentions women, thin on evidence and policy depth
+ Retrieval augmentation with evidence base — connected to curated WPS research and field case studiesB+ — substantive analysis grounded in real evidence
+ Retrieval augmentation with National Action Plans — connected to country-specific policy commitmentsA — policy-aligned, WPS KPIs

Our WPS AI Benchmark takes this analysis one step further, making it an effective mechanism to operationalize WPS compliance as a procurement requirement. Models are scored against a standardized WPS scenario set using a structured rubric, yielding measurable, nuanced, and operational evidence that can be used to improve model compliance. Defense organizations and other entities with WPS obligations should be writing this benchmark into their contractual requirements to specify minimum performance thresholds rather than accepting generic vendor claims of “ethical AI” alignment.

The recent standoff between Anthropic and the Pentagon over autonomous weapons and mass domestic surveillance illustrates a broader structural problem: when AI vendors and defense organizations negotiate deployment boundaries, those conversations tend to play out in terms of broad use policies and ethical principles—not measurable, domain-specific performance standards. Without a shared benchmark, procuring organizations have no way to specify what compliant output actually looks like, and vendors have no way to demonstrate it. A WPS benchmark changes that equation. The burden can then shift to the vendor to prove the models they provide actually meet the operational requirements their customers have already committed to.

The default is already a choice

One useful framing likens AI governance to brakes on a fast-moving car—necessary, but always reactive, always trailing the technology. But in WPS-governed contexts, the problem isn’t speed. It’s that the car was never engineered for the road it’s on. Brakes slow it down, but they don’t prevent it from producing structurally flawed output. When an AI system defaults to analysis that omits women, girls and boys in a conflict environment, adding oversight after the fact doesn’t fix the system—it just adds a review layer on top of output that was wrong from the start.

The phrase “oversight hasn’t caught up” frames the gap as a timing problem—as if the standards don’t yet exist and organizations just need more time to develop them. But the standards do exist. The WPS competence gap is not a governance failure that better brakes can catch. It is a design failure: the organizations that wrote the procurement specs, chose the vendors, and decided what standards to require did not require WPS compliance. It is an omission with measurable consequences for operational effectiveness and the protection of civilian populations.

How do we fix it?

Our Secure Futures research is ongoing. The WPS AI Benchmark is an open evaluation framework—the scenario setevaluation criteria, and methodology are publicly available.

A first step would be to use this model to expand into other areas that govern conflict such as broader human security and to require through laws and policies that procurements adhere to existing standards with evidence produced to confirm this.

Second, advisors within organizations need to become experts. Sadly in the case of our work, it is increasingly clear that commanders are relying more on AI tools than the WPS advisors that exist in the command structure. This is something decision-makers can fix. Training, empowering, and resourcing WPS Advisors to concentrate their energy on influencing the AI tools would not only produce better decision-making almost immediately but would serve as an organizational model for other areas such as human security, humanitarian response and localization.

Third, we know commanders rely on AI tools for speed, but experiments such as this one took hours, not months. Empowering academic partners and outside groups to test assumptions—just as is done in doctrine development—is critical to the process.

We have an important role to play by building benchmarks to evaluate the operational readiness and effectiveness of LLMs. Jack Clark, co-founder of Anthropic, recently said: “Give us a goal. The AI industry is excellent at trying to climb to the top of benchmarks. Come up with benchmarks for the public good that you want.” It’s clear that AI has already entered the battlefield, but humans are still in control. The decision about which humans are empowered to influence the direction of AI systems that can determine war and peace needs to be made now.

NOTE: This post references results for the fourth iteration of its benchmark. Those results can be accessed at weval.org. The WPS AI Agent is available for demonstration at https://wps-agent.streamlit.app.

A light bulb, as screwed into a lamp to light a room. Depicted as an incandescent bulb with a silver base, often shown with filament and a soft, yellow-white glow. Commonly used to represent ideas (as over a head in a cartoon), thinking, and learning, often as paired with 🤔 Thinking Face or 💭 Thought Balloon. May also represent various senses of light and brightness.

The Easy Read Generator Is Live

Click here to download an Easy Read version of this blog post.

Five years ago, a team of NDI colleagues pitched an idea called “Right To Know” at an internal innovation competition, the culminating project of an internal course on Democracy and Technology (DemTech 1000) I organized. The concept, led by Whitney Pfeifer, was straightforward: build a tool that could translate complex civic documents into Easy Read format—short sentences, plain language, paired with clear illustrations—so that people with intellectual disabilities could access the same information as everyone else. The team won, the idea got a small innovation grant, and what followed was a long, winding road to a working product that I’m only now finally able to share.

The Easy Read Generator is now officially a thing!

What Easy Read Is

Easy Read is a method of presenting information in a format that’s easier to understand. It combines simple language with images that reinforce the meaning of each sentence. It’s valuable for people with intellectual disabilities, low literacy levels, or limited fluency in the language being used—but it’s also just good communication practice more broadly.

Article 21 of the UN Convention on the Rights of Persons with Disabilities guarantees the right to accessible information. In practice, though, Easy Read materials are expensive and time-consuming to produce, which means they’re rarely created—especially in lower-income countries where the need is greatest and the resources are thinnest.

An Idea Ahead of Its Time

The “Right To Know” pitch happened in 2021—more than a year before ChatGPT launched and kicked off the modern era of generative AI. The team envisioned a tool that could take dense policy language and automatically simplify it, but the technology to do that reliably didn’t exist yet. When ChatGPT arrived in late 2022, the concept Whitney’s team had imagined suddenly became technically plausible. With the innovation grant, we built a first version: a static site at easyread.demcloud.org with detailed instructions on how to use generative AI tools to accelerate Easy Read document creation.

In October 2024, I traveled to Nairobi, Kenya, to facilitate a human-centered design workshop with representatives from disabled people’s organizations (DPOs) including the United Disabled Persons of Kenya, the Kenya Association of the Intellectually Handicapped, the Down Syndrome Society of Kenya, and several others. Over two days, we tested assumptions about accessible information, explored what generative AI could and couldn’t do, and collaboratively designed the features an Easy Read generator tool would need.

One moment from that workshop has stayed with me. A teacher who supports students with Down Syndrome said: “I wish I knew about this before. This will help a lot. I struggle to break down complex jargon into understandable information. With this tool, that work becomes easier.”

Continuing After the Layoff

In January 2025, I was laid off from NDI after nearly 11 years. The Easy Read Generator was not finished. The workshop participants had given us a clear mandate and a thoughtful design, and I had made commitments to them and to the DPOs we were working with. I continued the work on my own and worked with University of Maryland students who contributed concepts for Easy Read Generator’s UX redesign.

The Image Problem

Most people who encounter Easy Read for the first time assume the images are supplementary—nice to have, but not essential. They’re not. In Easy Read, each illustration exists to support the comprehension of a specific sentence. If the image doesn’t clearly represent the concept in the text, it can actually make the document harder to understand, which is the opposite of the goal.

When I first tried to build the generator, I assumed AI image generation would handle this, but current AI image generators are weak at producing the kind of clear, simple illustrations that Easy Read requires. The images they generate tend to be too detailed, too stylistically inconsistent, too prone to visual noise, and often imbued with cultural biases that undermines comprehension. Closing that gap would have meant training a custom image generation model—far beyond what I could take on as a solo developer working on a civic tech side project.

That failure stalled the project for months. I tried multiple approaches, multiple tools, multiple prompting strategies. None of them produced images I’d feel comfortable putting in front of the people this tool is meant to serve.

Selecting Instead of Generating

The thing that eventually unblocked the project was a shift in approach. Instead of asking AI to generate images, I started asking it to select them.

I built a keyword-mapped image library—a JSON file containing 564 keywords mapped to 186 unique illustrations drawn from three open-licensed sources:

  • Mulberry Symbols—a widely-used symbol set designed for augmentative and alternative communication (AAC), licensed under CC BY-SA 2.0 UK
  • OpenMoji—an open-source emoji library with clean, consistent line art, licensed under CC BY-SA 4.0
  • NDI’s Easy Read Online Dictionary—illustrations collected through NDI’s own Easy Read program, licensed under CC BY-SA 4.0

When a user pastes text into the Easy Read Generator, the LLM does two things: it simplifies the language into short, clear sentences, and it matches each sentence to the most appropriate illustration from the library using the keyword map. The AI isn’t creating images—it’s making selections from a curated set of symbols that were designed for this purpose by people who understood accessible communication.

The library doesn’t cover every possible concept, and some matches are better than others. But every image in the output was created by designers who understand accessibility, not hallucinated by a model optimizing for visual plausibility.

Where This Leaves Me

The tool I shipped is not what I originally envisioned. It’s simpler, more constrained, and more honest about what current AI tools can and can’t do. I think it’s better for it. My earlier attempts were too ambitious, and the image generation requirement exceeded what the technology could responsibly deliver. Stripping back to the core problem—simplify text, match it to existing illustrations—turned out to be enough.

After contributions from countless people, I’m relieved that I was finally able to deliver a working prototype. The Easy Read Generator will remain free to use, no login required, as long as I’m able to host and improve it. If this tool is useful to you or your organization, consider supporting the project.