Women, Peace and Security Frameworks Must Apply to Defense AI

by Moira Whelan, Jesper Frant / Apr 21, 2026
Moira Whelan and Jesper Frant serve as fellows for Our Secure Future.

This post was originally posted at https://www.techpolicy.press/women-peace-and-security-frameworks-must-apply-to-defense-ai

AI tools are already operational in multiple conflict zones. The headlines are filled with examples, and a recent report by the Brennan Center for Justice at NYU Law details the extent of the deployment of these tools. The US military has used Project Maven to identify targets for strikes in Iraq, Syria, Yemen, and Ukraine. In Gaza, Israeli forces have relied on AI-generated intelligence to inform strikes that killed scores of civilians, and Claude was reportedly used by US forces during a raid on Venezuela and in strikes on Iran.

States participating in these conflicts have adopted Women, Peace and Security (WPS) frameworks that inform how security decisions are made, but there is no indication those commitments have been extended to the AI systems now informing those same decisions. According to our research, commercially available large language models (LLMs)—the same foundation models now being deployed in defense contexts—systematically fail to operationalize WPS standards, let alone others. How, then, can we be assured that AI systems used in conflict are complying with existing obligations?A central recommendation of the Brennan Center report is to strengthen AI testing, expanding operational evaluation and restoring capacity gutted by recent cuts. But the report stops short of defining what exactly should be tested. Its examples of testing failures are exclusively technical. A next step could be to assess whether these systems produce output that complies with the policy frameworks that already govern the conflicts into which they’re deployed.

This assessment is exactly what drove Our Secure Future’s focus on technology through Project Delphi and the Women, Peace and Security and Technology Futures report. Building on this work, our months-long study of AI systems concluded that AI models customized and evaluated with a robust WPS perspective will deliver higher accuracy in high-stakes, real-world conflict and humanitarian scenarios. We found that models informed by WPS data and policy frameworks reduce operational and strategic blind spots, and enable end-users to make faster, better-informed decisions, because they draw upon more comprehensive, community-wide, and policy-informed information.

Furthermore, our research found that when WPS language is omitted from AI prompts—mimicking the sparse format of actual field situation reports and intelligence briefs—model performance on WPS integration drops by nearly 90 percent. When the models fail to consider women in their analysis, it means the actions they recommend do not factor in these populations. That can have real consequences for, in this case, over 50 percent of the global population.

To come to this conclusion, we tested three leading AI models across 13 conflict scenarios at three levels of contextual detail. When prompts explicitly named affected populations (displaced women, female ex-combatants, women-led organizations), average scores were 0.65 out of 1.0. When prompts used the minimal formats practitioners actually use, the same models scored 0.08. A score below 0.2 indicates the model failed to surface any WPS-based analysis. Trust Building—whether the model recommended engaging affected communities—collapsed from 0.71 to 0.22, a 69 percent decline.

This is not a hypothetical gap. It is a measured disconnect between what WPS commitments require and what deployed AI tools actually produce. It is also an indicator of a seriously flawed system in use by militaries today. As these systems become more capable and more integrated into operational decision-making, the gap will only widen unless proactive measures are taken. Our research demonstrates that closing this gap is technically possible.

AI tools in conflict are failing decision-makers

In July 2025, an Institute for Integrated Transitions (IFIT) AI on the Frontline study tested LLMs on conflict resolution scenarios and found structural performance failures across the board—concluding that current AI models are not fit for high-stakes peace and security decision-making without significant intervention. Critically, a follow-up study found that adding a structured prompt—instructing models to follow basic conflict resolution best practices before responding—increased average scores by 65 percent. IFIT recommends embedding such guidance directly into system prompts, an approach consistent with what Anthropic CEO Dario Amodei terms “Constitutional AI”, which leverages a defined set of principles to align model behavior. This is possible, but we have no evidence to suggest that this is taking place.

Our research independently replicates those findings using a distinct scenario set focused on WPS-relevant conflict contexts and a WPS-specific scoring rubric. Across our MVP agent customization experiment and a WPS AI Benchmark that we tested using Weval.org, we found the same structural failures IFIT documented. “Due Diligence”—whether models recommended consulting affected communities and gathering context before responding—remained consistently low for out-of-the-box AI models that are widely available to the public. The convergence between IFIT’s conflict resolution evaluation and our WPS benchmark establishes these as characteristics of current LLMs in conflict contexts, not artifacts of any single methodology. The problem for decision-makers is plain to see. They are increasingly being directed to use tools that simply do not adhere to existing policies, but with existing processes such as benchmarking and developing agents, we could see better informed decisions in conflict and peace building scenarios.

The WPS competence gap

Our research extends those findings by applying a WPS lens: identifying a specific, quantifiable compliance gap and a documented path to closing it.

We call this the WPS competence gap—the measurable performance drop AI models show when WPS language is absent from operational prompts. No model in our evaluation surfaced WPS considerations unless prompted with explicit contextual cues. This matters because field situation reports, intelligence summaries, and policy briefs rarely contain that framing. This is compounded by the fact that AI tools are designed to produce mid-grade answers. For example, if you ask an AI tool to write a book report, it is likely to give you a “C” grade product, not an “A+”. It is even less likely to give you an analysis of how the dynamics of female characters in the book influence the plot…unless it is directed to do so. In conflict, this means decision makers are getting predictable answers, not doctrinal creativity. It is a behavioral default that has not been reconfigured to meet peace and security standards and models do not apply a WPS lens because nothing requires them to do so.

The compliance framework to close this gap already exists. WPS frameworks—grounded in UN Security Council Resolution 1325 and implemented through National Action Plans (NAPs) in over 100 countries—establish commitments for how conflict and security operations should account for women, protect civilian populations, and include affected communities in decision-making. NATO has integrated WPS into doctrine. The US, UK, and most major allied defense establishments have signed NAPs that apply to their operations. Yet we have seen no indication that procurement and deployment of AI tools integrate this doctrine into technical requirements. The tools decision-makers rely on are not built on the same standards on which they have been professionally trained.

Closing the gap is a configuration problem, not a capability problem

Organizations evaluating AI vendors for conflict-relevant applications should not just be asking whether a model is generally capable. They should be asking whether it has been configured and validated against their own policies and standards—and demanding evidence.

Our experiment tested four configurations of the same model against a common prompt, evaluated by AI judges and confirmed by a WPS expert review. The results show a clear customization ladder:

ConfigurationPerformance
Off-the-shelf — standard chatbot, no customizationC– / D — generic output, omits WPS
+ WPS instructions — detailed system prompt with principles on how to apply a WPS lens; no added knowledgeB — mentions women, thin on evidence and policy depth
+ Retrieval augmentation with evidence base — connected to curated WPS research and field case studiesB+ — substantive analysis grounded in real evidence
+ Retrieval augmentation with National Action Plans — connected to country-specific policy commitmentsA — policy-aligned, WPS KPIs

Our WPS AI Benchmark takes this analysis one step further, making it an effective mechanism to operationalize WPS compliance as a procurement requirement. Models are scored against a standardized WPS scenario set using a structured rubric, yielding measurable, nuanced, and operational evidence that can be used to improve model compliance. Defense organizations and other entities with WPS obligations should be writing this benchmark into their contractual requirements to specify minimum performance thresholds rather than accepting generic vendor claims of “ethical AI” alignment.

The recent standoff between Anthropic and the Pentagon over autonomous weapons and mass domestic surveillance illustrates a broader structural problem: when AI vendors and defense organizations negotiate deployment boundaries, those conversations tend to play out in terms of broad use policies and ethical principles—not measurable, domain-specific performance standards. Without a shared benchmark, procuring organizations have no way to specify what compliant output actually looks like, and vendors have no way to demonstrate it. A WPS benchmark changes that equation. The burden can then shift to the vendor to prove the models they provide actually meet the operational requirements their customers have already committed to.

The default is already a choice

One useful framing likens AI governance to brakes on a fast-moving car—necessary, but always reactive, always trailing the technology. But in WPS-governed contexts, the problem isn’t speed. It’s that the car was never engineered for the road it’s on. Brakes slow it down, but they don’t prevent it from producing structurally flawed output. When an AI system defaults to analysis that omits women, girls and boys in a conflict environment, adding oversight after the fact doesn’t fix the system—it just adds a review layer on top of output that was wrong from the start.

The phrase “oversight hasn’t caught up” frames the gap as a timing problem—as if the standards don’t yet exist and organizations just need more time to develop them. But the standards do exist. The WPS competence gap is not a governance failure that better brakes can catch. It is a design failure: the organizations that wrote the procurement specs, chose the vendors, and decided what standards to require did not require WPS compliance. It is an omission with measurable consequences for operational effectiveness and the protection of civilian populations.

How do we fix it?

Our Secure Futures research is ongoing. The WPS AI Benchmark is an open evaluation framework—the scenario setevaluation criteria, and methodology are publicly available.

A first step would be to use this model to expand into other areas that govern conflict such as broader human security and to require through laws and policies that procurements adhere to existing standards with evidence produced to confirm this.

Second, advisors within organizations need to become experts. Sadly in the case of our work, it is increasingly clear that commanders are relying more on AI tools than the WPS advisors that exist in the command structure. This is something decision-makers can fix. Training, empowering, and resourcing WPS Advisors to concentrate their energy on influencing the AI tools would not only produce better decision-making almost immediately but would serve as an organizational model for other areas such as human security, humanitarian response and localization.

Third, we know commanders rely on AI tools for speed, but experiments such as this one took hours, not months. Empowering academic partners and outside groups to test assumptions—just as is done in doctrine development—is critical to the process.

We have an important role to play by building benchmarks to evaluate the operational readiness and effectiveness of LLMs. Jack Clark, co-founder of Anthropic, recently said: “Give us a goal. The AI industry is excellent at trying to climb to the top of benchmarks. Come up with benchmarks for the public good that you want.” It’s clear that AI has already entered the battlefield, but humans are still in control. The decision about which humans are empowered to influence the direction of AI systems that can determine war and peace needs to be made now.

NOTE: This post references results for the fourth iteration of its benchmark. Those results can be accessed at weval.org. The WPS AI Agent is available for demonstration at https://wps-agent.streamlit.app.

A light bulb, as screwed into a lamp to light a room. Depicted as an incandescent bulb with a silver base, often shown with filament and a soft, yellow-white glow. Commonly used to represent ideas (as over a head in a cartoon), thinking, and learning, often as paired with 🤔 Thinking Face or 💭 Thought Balloon. May also represent various senses of light and brightness.

The Easy Read Generator Is Live

Click here to download an Easy Read version of this blog post.

Five years ago, a team of NDI colleagues pitched an idea called “Right To Know” at an internal innovation competition, the culminating project of an internal course on Democracy and Technology (DemTech 1000) I organized. The concept, led by Whitney Pfeifer, was straightforward: build a tool that could translate complex civic documents into Easy Read format—short sentences, plain language, paired with clear illustrations—so that people with intellectual disabilities could access the same information as everyone else. The team won, the idea got a small innovation grant, and what followed was a long, winding road to a working product that I’m only now finally able to share.

The Easy Read Generator is now officially a thing!

What Easy Read Is

Easy Read is a method of presenting information in a format that’s easier to understand. It combines simple language with images that reinforce the meaning of each sentence. It’s valuable for people with intellectual disabilities, low literacy levels, or limited fluency in the language being used—but it’s also just good communication practice more broadly.

Article 21 of the UN Convention on the Rights of Persons with Disabilities guarantees the right to accessible information. In practice, though, Easy Read materials are expensive and time-consuming to produce, which means they’re rarely created—especially in lower-income countries where the need is greatest and the resources are thinnest.

An Idea Ahead of Its Time

The “Right To Know” pitch happened in 2021—more than a year before ChatGPT launched and kicked off the modern era of generative AI. The team envisioned a tool that could take dense policy language and automatically simplify it, but the technology to do that reliably didn’t exist yet. When ChatGPT arrived in late 2022, the concept Whitney’s team had imagined suddenly became technically plausible. With the innovation grant, we built a first version: a static site at easyread.demcloud.org with detailed instructions on how to use generative AI tools to accelerate Easy Read document creation.

In October 2024, I traveled to Nairobi, Kenya, to facilitate a human-centered design workshop with representatives from disabled people’s organizations (DPOs) including the United Disabled Persons of Kenya, the Kenya Association of the Intellectually Handicapped, the Down Syndrome Society of Kenya, and several others. Over two days, we tested assumptions about accessible information, explored what generative AI could and couldn’t do, and collaboratively designed the features an Easy Read generator tool would need.

One moment from that workshop has stayed with me. A teacher who supports students with Down Syndrome said: “I wish I knew about this before. This will help a lot. I struggle to break down complex jargon into understandable information. With this tool, that work becomes easier.”

Continuing After the Layoff

In January 2025, I was laid off from NDI after nearly 11 years. The Easy Read Generator was not finished. The workshop participants had given us a clear mandate and a thoughtful design, and I had made commitments to them and to the DPOs we were working with. I continued the work on my own and worked with University of Maryland students who contributed concepts for Easy Read Generator’s UX redesign.

The Image Problem

Most people who encounter Easy Read for the first time assume the images are supplementary—nice to have, but not essential. They’re not. In Easy Read, each illustration exists to support the comprehension of a specific sentence. If the image doesn’t clearly represent the concept in the text, it can actually make the document harder to understand, which is the opposite of the goal.

When I first tried to build the generator, I assumed AI image generation would handle this, but current AI image generators are weak at producing the kind of clear, simple illustrations that Easy Read requires. The images they generate tend to be too detailed, too stylistically inconsistent, too prone to visual noise, and often imbued with cultural biases that undermines comprehension. Closing that gap would have meant training a custom image generation model—far beyond what I could take on as a solo developer working on a civic tech side project.

That failure stalled the project for months. I tried multiple approaches, multiple tools, multiple prompting strategies. None of them produced images I’d feel comfortable putting in front of the people this tool is meant to serve.

Selecting Instead of Generating

The thing that eventually unblocked the project was a shift in approach. Instead of asking AI to generate images, I started asking it to select them.

I built a keyword-mapped image library—a JSON file containing 564 keywords mapped to 186 unique illustrations drawn from three open-licensed sources:

  • Mulberry Symbols—a widely-used symbol set designed for augmentative and alternative communication (AAC), licensed under CC BY-SA 2.0 UK
  • OpenMoji—an open-source emoji library with clean, consistent line art, licensed under CC BY-SA 4.0
  • NDI’s Easy Read Online Dictionary—illustrations collected through NDI’s own Easy Read program, licensed under CC BY-SA 4.0

When a user pastes text into the Easy Read Generator, the LLM does two things: it simplifies the language into short, clear sentences, and it matches each sentence to the most appropriate illustration from the library using the keyword map. The AI isn’t creating images—it’s making selections from a curated set of symbols that were designed for this purpose by people who understood accessible communication.

The library doesn’t cover every possible concept, and some matches are better than others. But every image in the output was created by designers who understand accessibility, not hallucinated by a model optimizing for visual plausibility.

Where This Leaves Me

The tool I shipped is not what I originally envisioned. It’s simpler, more constrained, and more honest about what current AI tools can and can’t do. I think it’s better for it. My earlier attempts were too ambitious, and the image generation requirement exceeded what the technology could responsibly deliver. Stripping back to the core problem—simplify text, match it to existing illustrations—turned out to be enough.

After contributions from countless people, I’m relieved that I was finally able to deliver a working prototype. The Easy Read Generator will remain free to use, no login required, as long as I’m able to host and improve it. If this tool is useful to you or your organization, consider supporting the project.

Digital illustration of a laptop displaying an integrated development environment (IDE) with code on the left and a document with charts on the right. Surrounding the screen are research notes, sticky notes, data visualizations, a magnifying glass, books, and icons representing ideas, networks, and media, symbolizing an AI-powered workspace that integrates coding, research, writing, and analysis in one environment.

Coding Agents Aren’t Just for Code

Coding agents don’t have to be just for writing code. At their core, they’re for building repositories of knowledge—and that changes what’s possible for anyone doing any sort of complex knowledge work, not just software engineers.

Matt Shumer’s viral post, “Something Big Is Happening,” laid out a stark timeline: AI labs focused on coding first because code is the lever that builds better AI. Software engineers were the first to feel the ground shift—not because they were the target, but because they were standing on the launchpad. Shumer describes telling an AI what he wants, walking away for four hours, and coming back to a finished product. No corrections needed.

He’s right about the trajectory. But there’s a dimension he and most commentators are missing—one that matters especially for the rest of us who aren’t software engineers. The real power of coding agents isn’t that they do work for you. It’s that they let you build a persistent base of knowledge you can think with over time.

I came to this from a different angle. I’m not a software engineer, but I spent over a decade managing technology projects at the National Democratic Institute and on Capitol Hill. My job was translating between technical teams and institutional stakeholders, not writing the code myself.

Then I got laid off, and the usual buffers disappeared—no team to rely on for technical advice, no vendors to turn to for support, no sprint cycles limiting the scope of what I could do. I started using coding agents because I had real problems to solve and no one to hand them to. But I also had a genuine curiosity about a field moving faster than anything I’d seen in my career. That curiosity has since shaped my consulting work—projects ranging from AI bias assessment to open-source civic tech.

What I found surprised me. The coding agent wasn’t just helping me write code. It was giving me a workspace where I could load reference documents, build on previous sessions, and develop structured thinking across weeks of work—something no chatbot had ever offered. It was the first AI tool that felt less like a novelty and more like a genuine working environment. That experience is what convinced me that these tools represent an unlock that most people outside of software engineering don’t yet realize is available to them.

What is a Coding Agent Anyway?

If you haven’t used one, a coding agent is an AI that operates inside an integrated development environment (IDE)—the software workspace where code is written, run, and debugged—where it can read, write, and modify files across an entire project. Tools like Windsurf and Cursor connect to frontier models from OpenAI and Anthropic via their APIs, giving you a conversational interface that can see and manipulate your whole workspace.

If you’ve only used ChatGPT or Claude through a chat window, this is a fundamentally different experience. In a chat, the AI’s memory lives in the conversation. When the conversation ends, the context is gone. You start fresh every time. The model might remember what you said ten messages ago, but it has no persistent, structured understanding of what you’re building outside of the conversation. Some chatbots get around this by curating a notepad of your previous conversations—but that’s frequently done without your knowledge or input, and the result is a lossy, opaque summary rather than a structured body of work you control.

In an IDE, the memory lives in the repository. Every file, every folder, every document you add becomes part of a growing body of context that any new session can draw on. You’re not just having a conversation—you’re developing a running knowledge base that you can feed back into the next session, or hand to a different agent entirely. Multiple agents can work on the same project. The context accumulates rather than evaporating.

This is the difference between in-context learning—the AI’s ability to use what you’ve told it within a single conversation—and building a repository of knowledge. In-context learning is powerful but ephemeral: the model gets smarter within a session but starts from zero the next time. A repository is persistent. It compounds.

The Offloading Trap

Not all AI tools get this right. I recently tested Airtable’s Superagent for building a presentation. The tool is autonomous by default: it generates output through a lengthy thinking process, but when I tried to shift from “generate a draft” to “collaborate on refinements,” the system resisted. It defaulted to its own style, rolled back toward its preferences, and didn’t give me an editable output I could iterate on. I spent more time reacting to the AI than thinking with it.

The problem wasn’t capability—it was interaction design. Superagent optimized for cognitive offloading: hand over the task, get back a finished product. That works when you want a first draft you don’t care much about. It fails when accuracy, framing, and judgment matter—which is most of the time in professional work.

Coding agents solve this problem architecturally. Because the output lives in files you can see, edit, and version-control, the collaboration model is fundamentally different. The AI proposes changes; you accept, reject, or modify them. Each round builds on what’s already there. It’s closer to working with a colleague than delegating to an assistant—a shift I described in “The End of the Product Manager (As We Knew It).” The tracked-changes, diff-based workflow I suggested to the Superagent team? That’s just how coding agents already work.

The Current Moment

Shumer’s post captures something important about where we are: the AI labs built coding capability first because it’s the flywheel that accelerates everything else. AI that writes code can help build the next version of itself. OpenAI’s GPT-5.3 Codex was, by their own account, “instrumental in creating itself.” The recursive loop is already running.

Here’s what that means for everyone else. The frontier labs are pouring resources into making coding agents better—not just because they want to automate software engineering, but because the coding agent paradigm is the most powerful interface they’ve found for AI to do complex, multi-step, persistent work. The investment in coding agents is, in effect, an investment in the general-purpose AI workspace. Software engineers are the first to benefit, but the architecture is domain-agnostic—and already accessible to anyone willing to try it.

The repository model doesn’t care whether the files contain Python scripts or policy analysis. It works the same way. And as these tools improve—as the models get better at judgment, at synthesis, at maintaining coherence across large projects—the gap between “coding agent” and “knowledge work agent” will close entirely.

What This Looks Like in Practice

I’ve been using coding agents for work that goes well beyond shipping software:

  • Research synthesis: Loading a project with dozens of source documents—reports, articles, transcripts—and asking the agent to identify patterns, contradictions, and gaps across the full corpus.
  • Data analysis: Cleaning, transforming, and exploring datasets directly in the workspace—writing scripts to normalize messy data, generate visualizations, or test hypotheses, all within a project that retains the logic and context between sessions.
  • Writing and editing: Drafting in a repository where the AI can reference my previous work, style preferences, and source material simultaneously, then proposing edits I can accept or reject line by line.
  • Structured analysis: Building frameworks and templates that persist across sessions, so each new analysis builds on the last rather than starting from scratch.
  • Knowledge management: Developing living documents that accumulate institutional knowledge in a format that’s both human-readable and AI-accessible.
  • Building applications: And yes, coding agents are also for coding. I’ve used them to build web applications, wire together APIs, and deploy tools—work I previously would have contracted out. You don’t need to be an engineer; the agent handles the implementation while you focus on what the tool needs to do.

Most of this doesn’t require writing code. What it requires is understanding that the IDE is a workspace, not just a code editor—and that the repository is a knowledge base, not just a codebase.

The Honest Version

Shumer writes that he kept giving people “the polite version” of what’s happening with AI, “because the honest version sounds like I’ve lost my mind.” I know the feeling. But I’m also not prepared to hand over my agency to AI—and my honest version is about a different gap: the most powerful AI collaboration tools available right now are coding agents, and most people don’t yet realize they’re available to anyone, regardless of technical background.

Here’s what I want people to take away: you can build a workspace where AI thinks with you over time, not just in a single conversation. That’s what a repository gives you. And the tools to do it—coding agents—are available to anyone right now, for the cost of a subscription. You don’t need to write code. You don’t need a technical background. You just need to open the door.

A screenshot of the Superagent website.

TIL: Superagent Slide optimizes for offloading, not collaboration

I got invited into Airtable’s Superagent beta, and my first idea was simple: use it to help build a presentation I was already preparing for. The request was constrained (short, structured, meant to be presented) and in a domain where accuracy and framing matter—so it felt like a good test of whether Superagent could help with real work, not just brainstorming.

Afterwards, the Superagent team reached out and invited feedback. He couldn’t join the call in the end, but I spoke with his product manager. It was useful—I could share what worked, where things broke down for me, and a few broader questions about bias, alignment, and trust (beyond whatever guardrails exist in the underlying third‑party models).

Here are the takeaways:

1) Autonomy by default can slide into cognitive offloading

Superagent was helpful at the start. It generated structure quickly and surfaced a couple angles and graphic designs I might not have considered on my own. Autonomy-by-default may be genuinely useful when you want to offload early drafting. The issue for me was shifting from ‘generate a draft’ to ‘collaborate on refinements.’ I struggled to generate something I could confidently present. Superagent repeatedly defaulted to a pitch-style, narrative deck. Once the system committed to a mode, it was hard to steer, and without an editable output (or something like tracked changes/diffs), I spent more time trying to refine my initial prompt than iterating.

I ultimately ended up running Superagent Slides as seven separate tasks (each one taking about 5 minutes to complete) as I improved my prompt. Trying to refine the deck within the same task didn’t work. Each refinement took another 5-10 minutes and it often felt like the output rolled back toward the system’s defaults rather prioritizing my preferences.

The biggest barrier to using Superagent for collaboration was simpler: the final deck didn’t come out in an editable format, and I couldn’t export output in Google Doc or Powerpoint to make edits. Without being able to make normal edits, it was essentially unusable for my presentation workflow.

Net result: I spent more time waiting for and then reacting to output than thinking with the tool. That may be fine for users who want a more autonomous generator, but I was expecting the AI to act as a partner.

What I suggested to the product manager was is a more collaborative approach—closer to how code-editing tools handle changes: show proposed edits clearly (tracked changes / Git-style diffs), let me accept or reject selectively, and let each round build on what’s already working.

2) A few basics also got in the way

Even setting content aside, a few practical issues made the output harder to use:

  • inconsistent hierarchy and styling across slides
  • text density/sizing that wasn’t presentation-safe or accessible

These are solvable, and may simply reflect where the product is in its maturity curve. But they matter because they determine whether you can use what it generates under deadline.

3) As tools get more autonomous, trust and safety become basic product quality

We also talked about bias and reliability in higher-risk domains. I shared IFIT research suggesting models perform better in conflict contexts when prompted to do basic “sanity checks”: ask clarifying questions, surface contingencies and trade-offs, disclose risks, and take sequencing seriously.

One thing I left unsure about is how Superagent is approaching bias/alignment/trust & safety at the product layer (beyond whatever guardrails exist in underlying models). If it’s on the roadmap, I’d love to see how they’re thinking about it—this is the kind of capability that’s easier to build in early.

Final thoughts

The product manager encouraged me to try Super Report, which she said is more developed. I plan to—partly because reports may naturally be easier to revise than slides.

She also said making presentations editable is already a roadmap priority, which is an important step toward making the workflow feel more collaborative.

If I zoom out, the part that stuck with me is how often AI products still treat cognitive offloading, model bias, and accessibility as secondary concerns. Those three things aren’t edge cases; they’re the difference between something that’s impressive in a pitch deck and something that’s safe and usable in the real world.

Scene from office space movie. The "two Bobs" interrogate Tom: "what would you say you do here?"

The End of the Product Manager (As We Knew It)

There’s a scene in Office Space where two consultants, the infamous “Two Bobs,” ask Tom Smykowski a deceptively simple question: what would you say you do here? Tom bristles. He explains that he talks to customers, translates between teams, and keeps things from falling apart. “I have people skills,” he says, and still fails the interview. Initech is bloated, inefficient, and badly run; the Two Bobs are there to reduce headcount, strip out bureaucracy, and show quick savings. But their logic only worked because technology and process standardization were already absorbing coordination and oversight work. What once required multiple roles could now be combined, or eliminated.

The film predates the economic shifts now underway with generative AI, but the pattern is familiar. As tools become more capable, work that once required multiple specialized roles begins to recombine. The work itself isn’t disappearing, but the categories we use to describe it are breaking down as responsibilities coalesce. Product management isn’t going away; it’s becoming more demanding and more technical. As AI tools absorb coordination and translation work, product managers are increasingly responsible for judgment, ethical tradeoffs, and hands-on experimentation. The historical boundary between defining software and building is collapsing—and product managers are increasingly expected to operate on both sides of that line.

For more than a decade, I worked in international democracy and civic technology programs at the National Democratic Institute, where product work rarely looked like Silicon Valley product management. Budgets were tight, users were diverse, failure carried political and reputational consequences, and technology had to function inside institutions that moved slowly and had low risk tolerance. I was often cast as “the people person,” responsible for translating between program teams, technical constraints, and real-world use. I served as product manager for the DemTools suite—a set of open-source tools NDI hosted and maintained as a shared service for civil society and political actors—defining roadmaps and requirements, managing vendors, and taking responsibility for whether tools actually worked in practice, not just in theory. This was product management in the classical sense, shaped by the realities of international development and democracy support.

While my perspective is grounded in the international development, non-profit and government sectors, the consolidation of product roles is equally applicable to the for-profit and tech industries. Indeed, tech-sector product managers are likely the vanguard in this trend, being among the first to face the need for deeper technical capabilities as AI tools mature.

When the Trump Administration abruptly ended most foreign assistance, I was laid off, along with many others in my sector. That moment forced a reevaluation of my value in the job market—which kinds of work remained in-demand as institutions retrenched. It also created space. For the first time, I could spend sustained time working directly with tools now accelerating this consolidation. At NDI, I had been invited into an internal AI working group, but hands‑on use of contemporary AI coding tools was largely prohibited in day‑to‑day work. Outside those constraints, the shift was clear: even without formal computer science training, these tools have allowed me to expand what product management itself entails. And this experience reflects a broader market trend: as software development becomes more accessible, roles consolidate, and product managers are increasingly expected to build, not just define, the tools they own.

Building Without a Buffer

After my layoff, I began experimenting seriously with AI‑assisted coding tools to solve problems I had previously only managed indirectly. Working inside an integrated development environment (IDE)—the software workspace where code is written, run, and debugged—with a coding agent that can read my codebase, refactor logic, and respond to tightly scoped instructions, I was able to move from defining requirements to implementing and testing them myself. 

I took on work I had previously only specified or reviewed: writing data-cleaning scripts to normalize inconsistent datasets; building small backend services and database schemas; wiring together APIs, authentication, and basic front-end components; and deploying a functioning open-source web application. Work that once required contracts, budgets, and months of coordination now happens in days. As a result, I spend less time coordinating handoffs and more time interrogating outputs—testing assumptions, pressure-testing model behavior against real-world constraints, and deciding where automation ends and responsibility begins. That experience has given me a clearer sense of how to embed institutional policies into practical system behavior: shaping product direction, advising teams on appropriate uses of AI, and setting guardrails that organizations can actually stand behind.

AI hasn’t turned me into a senior engineer, and I wouldn’t ship production‑level code without review. But it has allowed me to turn conceptual understanding into working systems while retaining responsibility for product decisions. At the same time, these tools hollow out traditional entry points on the engineering side. Junior‑level work—boilerplate, scaffolding, translation between systems—is increasingly easy to automate. The developer, product manager, and project manager roles aren’t vanishing; rather they’re collapsing inward, concentrating responsibility in fewer hands.

A Failure That Taught Me More Than the Wins

My first serious attempt to build something more ambitious—an Easy Read generator tool—failed for a number of reasons. First, I started with a product mistake. Instead of defining clear, minimal functional requirements and testing a narrow MVP, I tried to build everything I thought the tool eventually needed to be. I collapsed “prototype” and “platform” into the same effort before validating the core idea.

That mistake collided with a harder constraint. I ran into a real technical limit: current AI tools are still extremely weak at generating Easy Read–style images that actually support reading comprehension for people with intellectual disabilities. The requirement exceeded what the technology can responsibly deliver today—and it also exceeded my abilities as a solo developer. Closing that gap would have required orders of magnitude more time and effort, up to and including training a custom image-generation model—well beyond the practical scope for this project.

The failure wasn’t just technical; it was conceptual. Building directly with AI tools made that misalignment impossible to ignore. There was no vendor buffer and no sprint cycle to hide behind—the system simply stopped cooperating. When you work this close to implementation, bad assumptions fail immediately. Either the requirement was flawed, or I lacked the technical depth to solve it. In this case, it was both.

Human Connection Still Matters

As roles collapse and responsibilities concentrate, human collaboration becomes even more critical. In my own work, this has taken a few concrete forms: regular collaboration with former colleagues who are practicing software developers, and reaching out to others working on similar problems. Sometimes this looks like show-and-tell; other times it takes the form of short, informal working sessions to compare approaches. The emphasis isn’t on tools for their own sake. It’s on clarifying what we’re actually trying to build, catching weak assumptions early, deciding what not to attempt, and making sense of rapidly changing technology together.

Those interactions do work that AI tools don’t. Coding agents accelerate implementation, but they don’t independently challenge framing, surface blind spots, or carry context across decisions. When you’re simultaneously acting as developer, product manager, and project manager, peer-level human feedback becomes the primary check on overconfidence and misjudgment. AI may compress roles, but it also reduces opportunities for feedback. As those feedback loops shrink, collaboration has to become more intentional. Without it, the risk is the accumulation of unrecognized mistakes—problems you don’t realize you’re creating until they surface downstream.

Conclusion (As We Know It)

When I talk about the end of the product manager, I’m not predicting the disappearance of a job title. I’m describing the collapse of a boundary. As tools change the economics of building, the old division of labor—between defining work and implementing it—no longer holds. What’s ending isn’t product work itself, but the idea that it can remain insulated from the act of building.

AI-assisted coding compresses the distance between intent and execution. Product managers who can’t get close to the code risk losing contact with reality; developers who can’t reason about requirements inherit decisions they didn’t make. Responsibility concentrates, feedback loops shrink, and mistakes surface later without intentional human collaboration.

This isn’t a story about replacing expertise or celebrating lone builders. The tools only work when grounded in real technical understanding—and they fail fast when that foundation is missing. What changes is who is expected to carry that understanding, and how early.

The end of the product manager isn’t the end of product work. It’s the end of pretending that thinking and building can be cleanly separated. What comes next belongs to people willing to hold both sides of that responsibility at once.

A mosaic of prototype screens from the Easy Read Generator redesign—an accessibility-focused civic tech tool reimagined by UMD students to better serve users with diverse cognitive and digital literacy needs.

Forked, Not Finished: Mentoring Civic Tech the Open Source Way

This spring, I had the opportunity to support several student-led civic tech projects through the University of Maryland’s iConsultancy program. The partnership was originally facilitated through my role at the National Democratic Institute (NDI), but when NDI’s participation was disrupted by a sweeping freeze on U.S. foreign assistance programs, I continued advising the students in a personal capacity.

What started as a straightforward mentorship experience became a much more fluid—and in some ways more meaningful—engagement, shaped by shifting roles, student initiative, and a shared interest in public-interest technology. In many ways, it reminded me of the spirit of open source: people stepping in, adapting to change, and contributing however they can. NDI itself has long embraced open source platforms like Decidim and CiviCRM as part of its commitment to digital democracy—tools that reflect the values of transparency, adaptability, and shared ownership.

Three Projects, Three Distinct Challenges

Each iConsultancy team focused on a different scope of work—specifically related to Decidim, an open-source platform for democratic participation, and a new tool that NDI was designing to make information more accessible to people with intellectual disabilities. These projects were all rooted in the open source ethos: building in the open, iterating in real time, and aiming for impact beyond the immediate team.

1. Decidim Alternate Deployment Methods

This team explored ways to simplify and modernize how Decidim is deployed across different environments. The official Heroku option had become outdated, and the manual installation process was prohibitively complex for non-expert users.

The students conducted a technical evaluation of Docker and Heroku deployment methods, tested them across operating systems, and ultimately created an updated Docker configuration tailored for production environments. Their contributions were submitted to the Decidim GitHub repo. These additions make it significantly easier to deploy Decidim in a production environment using Docker Compose. Like many open source contributions, their work advanced on community-maintained tools, with the potential to be picked up and improved by others.

2. Easy Read Generator UX Redesign

The second team focused on redesigning the user interface for NDI’s Easy Read Generator project, a tool that simplifies complex civic documents to make them more accessible for individuals with intellectual disabilities and those with lower literacy levels.

Drawing on user research, accessibility guidelines (like WCAG), and competitive analysis, the students developed a high-fidelity prototype and detailed UX recommendations. While I had envisioned an iterative redesign of existing wireframes, the team pushed the concept further—exploring new features such as login options and donation functionality. Their willingness to experiment expanded the conversation about what this tool could become. 

3. Manual Installation Documentation Enhancements

The third project aimed to unify and improve Decidim’s manual installation documentation. English-language instructions were incomplete, and more robust Spanish-language documentation had yet to be translated or standardized.

The team was tasked with consolidating and testing these disparate guides, streamlining the process for deploying Decidim with all its intended features. Documentation is the connective tissue of any open source ecosystem, and while this team faced challenges in delivering their final product, the importance of the task—and the gaps it sought to fill—remains clear.

Lessons from the Field

Each project reflected the realities of open collaboration: sometimes productive, sometimes messy, always instructive. The teams that stayed organized and engaged produced genuinely useful outputs that could be built upon by others. In other cases, student groups struggled to balance their workload or needed more support to stay aligned with the project’s goals.

To be clear, this isn’t a critique of the iConsultancy model—student-led learning is, by design, exploratory. But like any open source initiative, success is rarely the result of individual effort alone. It depends on a thoughtful mix of initiative, shared norms, and an ecosystem of support. Civic tech projects, especially those aiming for real-world relevance, demand a working knowledge of community context, accessibility, and technical infrastructure—all challenging to fully absorb in a single semester. And just as open source contributors rely on documentation, mentors, and community to navigate complex codebases, student teams benefit from structured feedback, clear goals, and a culture that rewards asking questions. Those ingredients can turn short-term projects into lasting contributions.

Why I Stayed

Even after my layoff from NDI, I chose to remain involved because my commitment to the projects didn’t depend on a formal title. The UMD students brought real energy and fresh ideas. And continuing to mentor them gave me a sense of continuity and purpose at a time when many other structures were unraveling.

In civic tech, we often talk about resilience, distributed leadership, and decentralization. These principles are foundational to the open source ecosystem, where no single person or entity controls the project and leadership often emerges organically from contributors. This experience reminded me that these values aren’t just theoretical—they show up in how we navigate change. Open source projects are a fitting metaphor: they can survive the loss of their initial stewards, thriving as new contributors pick up the thread. Our work, too, can have a life beyond any single job or institution. Even when a formal role ends, the ideas, tools, and momentum we create can continue evolving—adapted, expanded, and reimagined by others who care.

Using AI to Strengthen Democratic Inclusion

Participants develop a list of features they would like to be included in an Easy Read generator tool. They then used this list to design a prototype tool.
Participants develop a list of features they would like to be included in an Easy Read generator tool. They then used this list to design a prototype tool.

From the 15 percent of people around the world who live with a disability, 8 in 10 reside in developing countries. Although Article 21 of the United Nations Convention on the Rights of Person with Disabilities (CRPD) grants them the right to accessible information, people with disabilities often face communication barriers due to a lack of information accessibility. Access to information is essential for democratic and political participation, which enables people to make informed decisions and influence policies that affect their lives. If people with intellectual disabilities have greater access to easy-to-read information on political processes or policies and the necessary assistance using it, they will be better equipped to advocate for themselves and participate in democracy. By reducing communication barriers through Easy Read and other accessible formats, societies can foster inclusion, making it possible for people with disabilities to engage fully in civic life.

With these circumstances in mind, the National Democratic Institute (NDI) organized a two-day workshop in Nairobi, Kenya, to bring people with intellectual disabilities, caretakers, civil society representatives, government officials, and accessibility experts together to test and design tools for creating Easy Read documents. The workshop began by reviewing the results of a remotely-conducted activity to test assumptions about how to best address barriers to accessible information in Kenya. Participants then explored the possibility of using generative AI tools, like ChatGPT, to facilitate the creation of accessible information. To ensure that everyone could participate, NDI provided accessibility accommodations, such as sign-language interpretation, an expanded time frame agenda to allow for ample participation, and illustrations to enhance comprehension and retention.

Easy Read is a method of presenting information in an easy-to-understand format. Easy Read materials are especially beneficial for people with disabilities, those with low literacy levels, non-native language speakers, and individuals experiencing memory difficulties. Easy Read combines short sentences that are clear and free of jargon with simple images to help explain the written content. Easy Read is essential not only for people with intellectual disabilities but also for making information accessible to everyone, particularly in a democratic society. Accessible information enables all citizens to participate in civic processes, make informed decisions, and understand their rights and responsibilities. By utilizing Easy Read, NDI seeks to support inclusive democratic participation and enable people to actively engage in their communities.

Alice Mundia, Chairperson of the Differently Talented Society of Kenya (DTSK), discusses barriers faced by persons with intellectual disabilities, specifically with regard to accessing information.
Alice Mundia, Chairperson of the Differently Talented Society of Kenya (DTSK), discusses barriers faced by persons with intellectual disabilities, specifically with regard to accessing information.

Twenty representatives from various disabled people’s organizations (DPOs) and other civic groups contributed their diverse perspectives and expertise to advance information accessibility in Kenya. These groups included the United Disabled Persons of Kenya (UDPK), the Kenya Association of the Intellectually Handicapped (KAIH), Kenya ICT Action Network (KICTANet), Differently Talented Society of Kenya (DTSK), Black Albinism (BI), Ubongo Kids, Down Syndrome Society of Kenya (DSSK), Kenya Sign Language Interpreters Association (KSLIA), the Kenya National Association of the Deaf (KNAD), and the Directorate of Social Development under the Ministry of Labour and Social Services. The event fostered collaboration and laid the foundation for further development of accessible digital tools in the country.

On the first day, participants reflected on the structural challenges that restrict access to information for people with intellectual disabilities. Alice Mundia, Chairperson of the Differently Talented Society of Kenya (DTSK), led a discussion on the barriers to creating and distributing Easy Read materials. Participants then explored NDI’s Easy Read website, provided feedback on navigation and usability, and used generative AI tools to draft Easy Read documents. Working in small groups, they refined these drafts, exploring the potential and challenges of using AI for accessible content creation.

“I wish I knew about this before. This will help a lot,” said a teacher who supports students with Down Syndrome. “I struggle to break down complex jargon into understandable information. With this tool, that work becomes easier.”

During the second day, participants focused on mapping key stakeholders involved in creating and disseminating Easy Read documents and developing a prototype for an Easy Read Generator tool. Participants collaborated to design user flows, interfaces, and features for the tool by sketching visual prototypes. This hands-on session ensured that the tool would meet the diverse needs of people with intellectual disabilities and their supporters. The concept for an Easy Read Generator originated during a pitch competition in 2021, where NDI staff proposed tech solutions to democracy challenges. The winning idea, the “Right To Know” project, envisioned an Easy Read translator, anticipating the development of generative AI technologies like ChatGPT, which has enabled computers to simplify complex documents quickly.

Through the workshop, participants found that while ChatGPT is a powerful tool for generating and simplifying text, the unpaid version has several limitations that hinder its generation of accessible content. These include browsing limitations and the inability to upload documents or generate images. 

Following this workshop, NDI has begun exploring two avenues to address these limitations and improve access to accessible information for people with intellectual disabilities. First, NDI is reaching out to companies that provide Generative AI chatbots to explore the possibility of allowing NGOs that support people with intellectual disabilities to access paid services for free or at a reduced cost. Such a program could enable disability rights advocates, caregivers, and organizations to leverage the most advanced tools to generate Easy Read content. This would significantly enhance their ability to reach and support individuals who depend on these accessible materials.

NDI is also exploring avenues for developing the prototype Easy Read Generator that participants designed into a working application through future programs. This tool would not only improve the experience of using Generative AI tools to create Easy Read documents, it could also be offered for free to select partner organizations, eliminating cost as a barrier to generating easy-to-read information. 

This illustration captures the second day of the workshop, which focused on designing an Easy Read AI chatbot.
This illustration captures the second day of the workshop, which focused on designing an Easy Read AI chatbot.

Through this workshop, participants from diverse backgrounds collaborated to explore generative AI’s potential for making information accessible for all. The workshop provided an invaluable opportunity to address challenges, share insights, and develop solutions. NDI remains committed to expanding these programs to ensure that all citizens have access to information in formats they can understand and use.

Author: Jesper Frant, Senior Technology Projects Manager for NDI’s Democracy and Technology team

NDI’s engagement with this program is implemented with the support from the National Endowment for Democracy (NED) program.

Related Stories 

Early Intervention is Showing Girls that Politics is for Them

Persons With Disabilities Enhance Civic Engagement in Jordan

Partnering with the Disability Community and Parliament to Promote Inclusion

###

NDI is a non-profit, non-partisan, non-governmental organization that works in partnership around the world to strengthen and safeguard democratic institutions, processes, norms and values to secure a better quality of life for all. NDI envisions a world where democracy and freedom prevail, with dignity for all.

This story was originally posted on ndi.org.

How Smart Automation Can Be Used In International Development

This article was originally posted on NDItech.org.

Artificial Intelligence is one of those buzzwords in tech that everyone’s heard, but few people actually understand how it can be used in practice. If you’re to believe Hollywood or Stephen Hawking, AI either means androids that are indistinguishable from humans (except for the inability to use conjunctions) or super-intelligent computers that could spell the end of the human race. After attending a Tech Salon on how AI can be used in international development, I can say with absolute certainty that it is neither of those things… yet. But the “commodification” of AI is making “smart automation” — a term I quite liked as a useful synonym for AI — much more accessible outside Silicon Valley. In fact, you probably already used some form of AI today without even knowing it.

Before we get into how AI can be used in international development, let’s first understand for what type of things smart automation can and can’t be used. These capabilities or limitations can be broken down into three categories.

First, computers can now be trained to automate human intelligence. In other words, we can now train computers to do simple tasks that only humans used to be able to do — things like find which photos in your photo album have cats in them. This is a learning process whereby a human sorts out cat photos and a machine-learning algorithm (another tech buzzword) builds its own model to automate the process of finding cat photos.

Second, smart automation is only really useful as a way to augment human ability; it does not replace humans wholesale. AI is really good at classification and prediction, but it will never be 100 percent accurate. You still need a human to monitor the results, check for bias and make judgment calls.

Ok, so, now that the AI found the cat photos, it’s up to you — human — to exclude the one that is just a realistic-looking cat-shaped slipper (how’d that get in there?!?) and post the cutest, most relevant one as your animal shelter’s Facebook cover photo. We’re trying rescue kittens, not sell cat slippers…silly computer.

Finally, computers are way better than humans at doing simple, mundane tasks over and over without error or referencing vast databases of complex information. Smart automation is, therefore, a pathway to scale.

The cat example doesn’t work quite as well in this case so I’m going to dispense with that metaphor and instead turn to a real-life problem. There are simply too few doctors in Nigeria, and — given the size of the population and its rate of growth — it will be generations before we can train enough doctors. Smart automation has been shown to be surprisingly accurate at diagnosing medical ailments. Combining AI-assisted diagnosis with community health workers — who require way less training than a doctor — could be an important pathway to scaling access to medical services in places like Nigeria.

So how would an organization like NDI get started in smart automation? The Tech Salon folks recommended starting with a mid-scale pilot project tied to metrics for success and getting top-down institutional buy-in. But for me, the “how” is way less important than the “what.” In other words, selecting the right pilot project based on previously successful use cases is way more important than the size or institutional buy-in of the pilot. Also, your organization should probably have the capacity to support “dumb automation” — automation that doesn’t employ machine learning algorithms — before it makes the leap to supporting smart automation.

NDI is currently looking for ideas on an appropriate pilot project for smart automation. If you have ideas, you can email me at jfrant [at] ndi [dot] org (<= hoping the AIs aren’t smart enough to read that… yet).