Fackel: an autonomous pentest framework powered by ReAct agents

March 9, 2026 · 5 min read

security ai agents pentesting osint langgraph python

Table of Contents

Most pentest automation tools encode strategy in code: run this scanner, parse that output, feed it to the next step. The human decides the sequence; the tool just executes it. Fackel inverts that relationship. The LLM decides what to do next—which tools to call, how to interpret results, and when to move on—while the code enforces safety, validation, and structure.

This post covers the architecture, the key design decisions, and the trade-offs that emerged while building Fackel.

The pipeline

Fackel runs a 5-phase pipeline where each phase is a LangGraph node:

Target → OSINT → Approval Gate → Port Scan → Vuln Scan → Triage → Report

The OSINT agent has 27 passive tools (DNS, WHOIS, subdomain enumeration, Shodan, certificate transparency, historical DNS, etc.). If it discovers IPs and the operator opted for active scanning, a human-in-the-loop approval gate pauses execution and displays targets for review before proceeding.

Port scanning has 2 tools (naabu, nmap). Vulnerability scanning has 12 (Nuclei, DalFox, WPScan, WAF detection, TLS analysis, etc.). Triage identifies gaps in coverage. Report synthesizes everything into a structured Markdown document.

The key word is autonomous: each agent uses the ReAct pattern—Reason + Act—to choose tools, interpret results, and decide next steps. The orchestrator manages state flow and conditional routing but never tells an agent which tool to use.

Why ReAct agents, not chains

A chain is a fixed sequence: call tool A, then tool B, then tool C. A ReAct agent is a loop: the model observes the current state, reasons about what’s missing, picks a tool, observes the result, and repeats until it decides it’s done.

For pentesting this matters because the right strategy depends on what you find. If OSINT reveals a WordPress site, the agent should prioritize WPScan and directory enumeration. If it finds an API endpoint, GraphQL introspection becomes relevant. If subdomains point to cloud IPs, S3 bucket scanning makes sense. Hardcoding these decisions is possible but brittle—every new target shape requires new branching logic.

With ReAct agents, the model reads a skill prompt (a playbook-style markdown document describing strategy for that phase) and autonomously selects tools based on what it observes. The key constraint is that the model can only call tools that are explicitly provided—it cannot hallucinate capabilities.

LLM-as-a-judge: adaptive routing

After each phase, a structured-output evaluator (the “judge”) scores the phase’s quality on a 0.0–1.0 scale and recommends routing. If port scanning returned empty results, the judge routes directly to triage instead of wasting time on vulnerability scanning. If OSINT found no IPs, the pipeline skips active scanning entirely.

This replaces what would normally be a forest of if/elif blocks with a single LLM call that evaluates context holistically. The judge has its own skill prompt that defines scoring criteria and routing rules.

Input validation as a first-class concern

Every tool validates its inputs through guard_target(), a validation layer that classifies input types (IP, domain, URL, CIDR) and rejects anything that doesn’t match the tool’s expected input type. This is enforced at code level—it raises ToolException, not just prompt instructions the model might ignore.

Shell metacharacters, path traversal attempts, and private IP ranges are rejected before any command execution. The model receives a structured error and can retry with corrected input.

This was a non-negotiable design decision. When an LLM decides what commands to run, the boundary between “model output” and “system input” becomes your primary attack surface. Prompt-level instructions are necessary but insufficient—you need code-level enforcement.

Tool resilience

Three mechanisms keep tool failures from cascading:

ToolException + handle_tool_error: every tool propagates clean errors back to the LLM as regular tool results, not crashes. The model reads the error and adapts.
Circuit breakers: HTTP-based tools (Shodan, VirusTotal, etc.) use per-service circuit breakers that disable the tool after repeated failures. This prevents the agent from wasting its iteration budget on a service that’s down.
Automatic provider gating: tools requiring API keys that aren’t configured are removed from the agent’s tool list at startup. The LLM never sees tools it can’t use.

Per-agent model configuration

Different phases have different requirements. OSINT involves many tool calls with simple reasoning—a fast, cheap model works well. Report generation requires synthesizing findings into coherent prose—a more capable model helps.

Fackel uses environment variables (FACKEL_MODEL_OSINT, FACKEL_MODEL_REPORT, etc.) so each agent can use a different model. The default falls back to gpt-5-mini for all agents.

Two-tier prompting

All agents share a soul prompt: a markdown document that defines identity, anti-hallucination rules, and output constraints. Each agent also receives a skill prompt: a phase-specific playbook with strategy guidelines, tool usage patterns, and prioritization rules.

The separation matters because it prevents prompt drift. The soul prompt enforces consistent behavior (never fabricate findings, always cite tool output) while skill prompts can be iterated independently per phase.

Observability

Setting two environment variables enables LangSmith tracing. All agent phases appear as hierarchical traces with token usage, tool I/O, latency, and middleware activity. No code changes required—LangGraph’s callback system handles it.

For terminal output, Fackel streams tool calls and results in real time. Verbose mode (-v) also shows the model’s reasoning steps (the “thought” portion of ReAct).

What I’d do differently

Stricter output schemas. Some agents return free-text summaries that downstream agents must parse. Structured output (Pydantic models) for inter-phase communication would make the pipeline more deterministic.

Cost tracking per run. LangSmith provides token counts, but an in-pipeline cost estimator that could halt execution if a run exceeds a budget would be valuable for production use.

Better test coverage for agent decisions. Unit testing individual tools is straightforward. Testing that an agent makes reasonable strategic decisions given a particular context is harder and where most of the risk lies.

Running it

# Install
git clone https://github.com/flaviomilan/fackel.git
cd fackel && uv sync --python 3.12

# Configure
cp .env.example .env  # set OPENAI_API_KEY

# Passive scan only
fackel example.com --no-active-scan

# Full scan with verbose output
fackel example.com -v

The project is open source under Apache 2.0: github.com/flaviomilan/fackel.