Anamnesis: Evaluating LLM-Driven Exploit Generation Under Real-World Mitigations

January 22, 2026

Anamnesis is an experimental evaluation framework designed to study how modern large language model (LLM) agents generate working exploits from vulnerability reports—even in the presence of exploit mitigations. The project explores a critical and uncomfortable question for the security community: how capable are current AI models at turning abstract vulnerability descriptions into reliable, mitigation-aware exploits?

The results suggest the answer is: very capable.

What Anamnesis Tests

At its core, Anamnesis simulates a realistic exploitation workflow. Given:

a vulnerability report
a proof-of-concept (PoC) trigger
and a vulnerable software target

LLM-based agents are tasked with:

Analyzing the vulnerable codebase
Understanding the vulnerability mechanics
Generating a functional exploit
Adapting that exploit to bypass active security mitigations

This goes far beyond simple code generation. The agents must reason about memory layout, execution flow, mitigation constraints, and reliability.

The Target: A Zero-Day in QuickJS

The experiments are based on a previously unknown (zero-day) vulnerability in QuickJS, a lightweight JavaScript engine widely used in embedded systems and tooling.

The vulnerability is explained in detail within the project and is notable for two reasons:

It is exploitable in multiple configurations
It was automatically discovered using an AI agent built on top of Opus 4.5

This makes Anamnesis a closed-loop experiment: AI-assisted vulnerability discovery followed by AI-assisted exploit generation.

The Models Under Evaluation

Two advanced LLM-based agents were tested:

An agent built on Opus 4.5
An agent built on GPT-5.2

Across multiple scenarios, the experiments varied:

Enabled exploit mitigations (e.g. memory protections)
Exploit constraints and goals
Required levels of reliability and control

The Results

Opus 4.5 successfully completed many of the exploitation tasks
GPT-5.2 successfully completed all of them

In every successful case, the models demonstrated more than basic exploitation.

From Exploit to “Memory API”

A particularly striking result is how the models approached exploitation.

Rather than crafting a single-purpose payload, both models used the vulnerability to construct a primitive “API” for arbitrary memory modification within the target process. Once this internal capability was established, the agents:

Manipulated the process address space at will
Systematically bypassed protection mechanisms
Hijacked execution flow
Achieved the defined objectives in a controlled manner

This mirrors how experienced human exploit developers work: first establish primitives, then build reliability and flexibility on top.

Why This Matters

Anamnesis highlights a shift that security professionals can no longer ignore:

LLMs are not just generating exploit snippets
They are demonstrating strategic exploit reasoning
They can adapt to mitigations instead of failing on first contact

This has implications for:

Defensive security and threat modeling
Vulnerability disclosure timelines
The future of automated red teaming
AI governance and capability containment

The experiments do not suggest that AI replaces human exploit developers—but they do show that the barrier to sophisticated exploitation is rapidly eroding.

A Tool for Understanding, Not Sensationalism

Importantly, Anamnesis is framed as an evaluation and research framework, not a weaponization toolkit. Its value lies in making these capabilities measurable, reproducible, and open to scrutiny.

Understanding what LLMs can do in adversarial contexts is a prerequisite for designing defenses that still work when attackers are no longer limited by human time, fatigue, or scale.