Anamnesis: Evaluating LLM-Driven Exploit Generation Under Real-World Mitigations

Anamnesis is an experimental evaluation framework designed to study how modern large language model (LLM) agents generate working exploits from vulnerability reports—even in the presence of exploit mitigations. The project explores a critical and uncomfortable question for the security community: how capable are current AI models at turning abstract vulnerability descriptions into reliable, mitigation-aware exploits?

The results suggest the answer is: very capable.

What Anamnesis Tests

At its core, Anamnesis simulates a realistic exploitation workflow. Given:

  • a vulnerability report

  • a proof-of-concept (PoC) trigger

  • and a vulnerable software target

LLM-based agents are tasked with:

  1. Analyzing the vulnerable codebase

  2. Understanding the vulnerability mechanics

  3. Generating a functional exploit

  4. Adapting that exploit to bypass active security mitigations

This goes far beyond simple code generation. The agents must reason about memory layout, execution flow, mitigation constraints, and reliability.

The Target: A Zero-Day in QuickJS

The experiments are based on a previously unknown (zero-day) vulnerability in QuickJS, a lightweight JavaScript engine widely used in embedded systems and tooling.

The vulnerability is explained in detail within the project and is notable for two reasons:

  • It is exploitable in multiple configurations

  • It was automatically discovered using an AI agent built on top of Opus 4.5

This makes Anamnesis a closed-loop experiment: AI-assisted vulnerability discovery followed by AI-assisted exploit generation.

The Models Under Evaluation

Two advanced LLM-based agents were tested:

  • An agent built on Opus 4.5

  • An agent built on GPT-5.2

Across multiple scenarios, the experiments varied:

  • Enabled exploit mitigations (e.g. memory protections)

  • Exploit constraints and goals

  • Required levels of reliability and control

The Results

  • Opus 4.5 successfully completed many of the exploitation tasks

  • GPT-5.2 successfully completed all of them

In every successful case, the models demonstrated more than basic exploitation.

From Exploit to “Memory API”

A particularly striking result is how the models approached exploitation.

Rather than crafting a single-purpose payload, both models used the vulnerability to construct a primitive “API” for arbitrary memory modification within the target process. Once this internal capability was established, the agents:

  • Manipulated the process address space at will

  • Systematically bypassed protection mechanisms

  • Hijacked execution flow

  • Achieved the defined objectives in a controlled manner

This mirrors how experienced human exploit developers work: first establish primitives, then build reliability and flexibility on top.

Why This Matters

Anamnesis highlights a shift that security professionals can no longer ignore:

  • LLMs are not just generating exploit snippets

  • They are demonstrating strategic exploit reasoning

  • They can adapt to mitigations instead of failing on first contact

This has implications for:

  • Defensive security and threat modeling

  • Vulnerability disclosure timelines

  • The future of automated red teaming

  • AI governance and capability containment

The experiments do not suggest that AI replaces human exploit developers—but they do show that the barrier to sophisticated exploitation is rapidly eroding.

A Tool for Understanding, Not Sensationalism

Importantly, Anamnesis is framed as an evaluation and research framework, not a weaponization toolkit. Its value lies in making these capabilities measurable, reproducible, and open to scrutiny.

Understanding what LLMs can do in adversarial contexts is a prerequisite for designing defenses that still work when attackers are no longer limited by human time, fatigue, or scale.

Final Thought

The uncomfortable takeaway from Anamnesis is not that AI can write exploits—but that it can reason about exploitation as a process.

Ignoring that reality would be a far greater risk than studying it.

Visited 5 times, 1 visit(s) today
share this recipe:
Facebook
X
WhatsApp
Telegram
Email
Reddit