Drudgle: A Self-Organizing, Perpetual Evaluation System for Autonomous Code Generation Using Multi-Agent AI Models

Author: Abel Mohler
Date: August 2025
Version: 1.0

Abstract

Drudgle is a research platform that treats software creation as an evolving, multi-agent conversation. Autonomous agents propose, implement, critique, vote, and synthesize improvements in a perpetual loop. The system emphasizes objective, external oracles (tests/benchmarks) and hard phase gates (correctness, safety, observability, and cost) to avoid reward hacking and drift. Drudgle is both a tool for autonomous code generation and a testbed for studying multi-agent coordination, safety, and long-horizon learning.

We ground Drudgle in a pragmatic architecture that uses structured outputs, graph-based workflows, and capability-scoped tools, and we enforce defense-in-depth safety (sandboxing, static/runtime guards, resource limits, monitoring). The platform prioritizes reproducibility (run manifests, pinned models/prompts), data governance (license/provenance scanning), and research value (emergent behavior analytics).

This whitepaper describes the design principles, architecture, evaluation methodology, safety model, current status, and roadmap.

Executive Summary

Problem: Single-agent prompting struggles with reliability, correctness, and long-horizon improvement. Static benchmarks age quickly.
Approach: A perpetual, multi-agent debate where agents specialize (Proposer, Implementer, Critic, Voter, Synthesizer) and iterate under objective gates.
Key Enablers: External oracles; schema-enforced outputs; prompt chaining; structured parsing; safety sandboxes; budgets and SLOs.
Architecture: Graph-based workflows, typed prompts/parsers, capability-scoped tools, persistent state.
Outcomes: A living benchmark and research testbed; autonomous code generation constrained by correctness, safety, and cost.

1. Introduction

Modern AI code generation tools such as GitHub Copilot and OpenAI Codex have demonstrated the capacity of large language models (LLMs) to generate useful and syntactically valid code. However, these systems typically rely on human input for task definition, evaluation, and feedback.

This paper presents Drudgle, an autonomous code generation ecosystem composed of multiple AI models that self-organize, compete, and cooperate in order to:

Decide what software to build
Determine how to build it
Critically evaluate one another's contributions
Vote on improvements
Evolve codebases indefinitely without human intervention

Drudgle reimagines AI-assisted software engineering not as a linear process, but as a recursive, multi-agent debate—a digital ecology of artificial minds striving to refine one another through critique and adaptation. Crucially, Drudgle replaces proxy-chasing with oracle-checked contracts and gated promotion.

2. Objectives

To develop a fully autonomous code generation framework where promotion is gated by objective tests and safety checks
To design a multi-model feedback loop with schema-enforced, structured exchanges and explicit design contracts
To investigate emergent coordination behaviors in decentralized, collaborative–adversarial settings
To collect a growing dataset of AI-generated code, critiques, decisions, and provenance for training and evaluation
To evaluate the effectiveness of the system as a living benchmark with external validation and long-horizon metrics

2.1 Key Contributions

A gated, oracle-driven loop that reduces reward hacking and drift.
A graph-oriented agent workflow for clarity and observability.
A defense-in-depth safety model suitable for autonomous code generation.
Reproducibility practices (run manifests, pinned models/prompts, structured outputs).
A rigorous metrics framework (correctness, safety, cost, regression, diversity, progress).

3. Methodology

3.1 Agent Roles

Drudgle consists of N participating models, divided into functional roles:

Proposer (Spark): Proposes a project concept and initiates code generation
Implementers (Builders): Generate code based on the proposal
Critics (Observers): Analyze the implementation and provide structured critiques
Voters (Adjudicators): Score proposals and implementations, select winning directions
Synthesizer (Architect): Integrates feedback into new iterations or forks

Roles may rotate or evolve based on system heuristics and historical performance.

3.2 Iterative Loop (Gated)

Proposal → Project spec with constraints and acceptance tests where possible.
Implementation → Code generation with structured outputs; schema-validated artifacts.
Validation (Tier 1–2 gates) → Format/lint/type/unit/integration/property tests; static/dynamic analysis; sandboxed execution with resource limits.
Critique (Tier 3) → AI peer review on maintainability, style, novelty, and design contract adherence.
Decision (Tier 4 as needed) → External oracle or human review for hard domains.
Synthesis → Update “Design Notes” (contract), capture trade-offs, and proceed.

Promotion only occurs when Tier 1–2 gates pass. Iteration halts or rolls back on repeated failures or SLO/cost breaches.

4. System Design

Model Diversity: Multiple model families reduce monoculture risk and encourage divergent thinking.
Schema-First I/O: All inter-agent exchanges use structured outputs (e.g., JSON Schema/Pydantic) with validation and repair loops.
Design Contract: A versioned “Design Notes” artifact captures principles, trade-offs, and explicit rejects per iteration.
Noise Injection: Calibrated ambiguity/perturbations to test robustness without derailing progress.
Influence Management: Influence decays over time; fork budgets preserve exploration and minority ideas.

5. Evaluation Criteria

Drudgle evaluates progress with objective and qualitative metrics:

Correctness: Hidden acceptance test pass rate; property-based coverage; performance thresholds.
Safety: Zero critical violations; harmful-output scanner cleanliness; sandbox compliance.
Cost: Token/time/compute spend per successful iteration; early-stop triggers on threshold breach.
Regression: Rolling regression rate across iterations.
Diversity: Edit-distance/novelty across proposals and implementations.
Qualitative: Critique depth, design coherence, maintainability.

6. Significance

Drudgle serves as a:

Research testbed for emergent multi-agent behavior and alignment under constraints.
Living benchmark that evolves beyond static datasets and single-run evaluations.
Safety experiment for layered containment in autonomous software systems.
Practical tool that generates working software under measurable guarantees.

7. Future Work

Planned expansions include:

Extending to non-code domains once gates are reliably enforced.
Incorporating reinforcement learning for long-horizon optimization tied to external oracles.
Rich visualizations of debate/decision graphs and provenance.
Integration with LangSmith/observability backends for deeper analytics.

8. Safety and Governance

We employ multi-layered containment:

Filesystem and container sandboxing with no network egress by default.
Static and runtime analysis to block dangerous imports/functions; monitored file ops; timeouts.
Resource limits (CPU/memory/FS/process) and execution time caps.
Monitoring and incident response with severity thresholds and human override.

Policies include retention/redaction, license/provenance scanning, and audit trails.

9. Architecture and Implementation

Graph Workflows: Model workflows as explicit graphs with typed state and edges.
Structured Outputs: Enforce parsers and schemas (e.g., JSON Schema/Pydantic) with repair loops.
Prompt Chaining: Separate analysis from generation for reliability.
Tooling: Capability-scoped tools behind explicit permissions.
Reproducibility: Pin models/prompts/seeds and record run manifests.

10. Observability and SLOs

Dashboards for stagnation, regression, parser failures, and budget breaches.
SLOs gate auto-iteration and trigger early-stop/remediation workflows.

11. Current Progress (Aug 2025)

Multi-agent coordination framework with specialized roles and tested workflows.
Proposal phase with six-language support; enhanced static analysis and robust parsing.
Workspace management and fallback validation paths; improved JSON extraction/unescaping.
Monitoring and messaging infrastructure in place for reliability and observability.

12. Roadmap and Milestones

Near-term roadmap milestones:

Formalize adapter and structured parsing.
Runnable pipelines and a minimal graph workflow (Proposal → Implementation → Validation → Critique → Synthesis).
Capability-scoped tool wrappers; prompt templates and chains.
Expanded observability, SLO gating, and diversity controls.

13. Limitations and Risks

Goal drift if gates are weak or proxy metrics dominate.
Cost sensitivity for large agent teams and long horizons.
Safety blind spots in novel toolchains or domains.
Parser brittleness without strict schemas and repair.

Mitigations include hard gates, budgets, conservative scope, and staged rollout.

14. Conclusion

Drudgle proposes a gated, oracle-driven, multi-agent approach to autonomous software development. By combining graph workflows, structured outputs, safety-in-depth, and rigorous observability, Drudgle aims to create both a practical code generator and a valuable research testbed for emergent AI behavior.