MARE PRD: Multi-Agent Redundant Execution (MARE) for Blueprint

MARE PRD: Multi-Agent Redundant Execution (MARE) for Blueprint Product: Verdict IDE Subsystem: Blueprint Feature: Multi-Agent Redundant Execution (MARE) Author: Trent Carter Date: January 1, 2026 Status: Draft (Rewrite v1)

⸻

1. Executive Summary

Multi-Agent Redundant Execution (MARE) is a Blueprint execution policy layer that routes a single job across multiple LLM agents/models (serial, parallel, or staged hybrid) to optimize for Cost, Latency, and Confidence under hard-enforced constraints.

MARE replaces the limitations of “one default model per agent role” with a constraint-driven dispatcher that:

• enforces budgets and deadlines in real time,

• validates and aggregates candidate outputs (especially code) using artifact-aware methods,

• captures an audit-grade Decision Record (“why did we choose this output?”),

• learns per-domain/per-repo model strengths over time.

⸻

2. Customer Value Proposition

MARE delivers value where single-model execution fails:

2.1 Save Money

Most jobs do not require frontier models. MARE uses low-cost scouts first, escalating only when uncertainty or disagreement is detected, reducing average spend without reducing quality.

2.2 Save Time

When deadlines are tight or the job is high criticality, MARE runs parallel redundancy and arbitration to reduce time-to-correctness vs slow, iterative retries.

2.3 Increase Trust and Correctness

MARE introduces:

• redundancy for verification,

• validators as judges (lint/typecheck/tests/static checks),

• conflict-aware aggregation (merge/repair),

• decision provenance and explainability.

2.4 Increase Flexibility

Users can specify constraints and policies per repo/task/file type rather than being trapped in a static “model assignment tab” approach.

⸻

3. Goals and Non-Goals 3.1 Goals G1 — Constraint-driven execution: Users set constraints; system chooses strategy and agents automatically. G2 — Hard budget enforcement: Spend must not exceed caps; no unbounded parallel blowups. G3 — Better outcomes than single-model: Higher correctness/validator pass rates with equal or lower median cost. G4 — Explainable decisions: Provide a durable Decision Record for every run. G5 — Blueprint-native integration: MARE is a first-class policy on Blueprint JobCards and execution flows. 3.2 Non-Goals

• Provider SDK implementation details and code-level design

• QA plans, Gherkin scenarios, prototype-only mechanics

• Full training/RL system design (signals are defined; training is separate work)

⸻

4. Definitions

• Job: A single Blueprint request (answer, patch, refactor, analysis, etc.).

• Policy: A set of constraints and behaviors attached to a job (MARE policy).

• Wave: A staged group of agent executions (e.g., scouts → infantry → artillery).

• Decision Record: The canonical audit trail describing how MARE arrived at the result.

• Validator: Deterministic checks used in arbitration (lint/typecheck/tests/static analysis).

• DIS: Domain Intelligence Score; model performance score per domain/task type.

⸻

5. User Experience: The Parametric Equalizer

MARE is controlled through a constraint UI (“Parametric Equalizer”) that expresses policy, not manual orchestration.

5.1 Core Constraints

• Max Spend (hard cap per job)

• Max Time (deadline / latency budget)

• Target Confidence (acceptance threshold)

• Redundancy Floor (minimum independent verifications)

5.2 Locks and Guardrails

• Any constraint can be locked as “hard.”

• Each constraint supports min/max guardrails.

• If constraints are incompatible, the UI must:

• show unsatisfiable state, and

• offer the closest satisfiable plan (clearly labeled).

5.3 Profiles

• Save/load named profiles (e.g., “Fast Draft”, “Production Patch”, “Security Audit”).

• Auto-apply profiles by:

• repo, directory,

• file type,

• task type (migrations, auth/security, infra, docs).

5.4 Strategy Visibility

The UI must always display:

• selected strategy (Serial / Parallel / Hybrid),

• expected spend range and time range,

• validators that will be applied,

• escalation behavior (what triggers Wave 2/3).

⸻

6. Blueprint Integration 6.1 Blueprint as Dispatcher Client

Blueprint generates a JobCard containing:

• task type + domain hints,

• context pointers (files, diffs, retrieval references),

• repo-level validator commands or configured checks,

• selected profile + constraints.

MARE consumes the JobCard and returns:

• an Execution Plan (strategy, waves, budgets, validators, arbitration method),

• final result artifact,

• Decision Record + receipts.

6.2 Execution Plan as a First-Class Artifact

MARE must produce a plan before execution, including:

• wave definitions (counts, tiers, context level),

• per-wave budget allocation,

• concurrency ceilings,

• arbitration method,

• validator stack,

• stop conditions and fail-closed behavior.

⸻

7. Strategies

MARE selects one of three strategy families per job based on constraints and registry telemetry.

7.1 Serial (“Waterfall”)

• Run agents sequentially, typically cheapest/most domain-fit first.

• Stop when result passes acceptance threshold and validators.

• Best for strict spend caps and low urgency.

7.2 Parallel (“Shotgun”)

• Run multiple agents concurrently within concurrency and spend ceilings.

• Aggregate + arbitrate quickly to reach a confident decision.

• Best for tight deadlines and critical correctness.

7.3 Staged Hybrid (“Nose-In”)

• Wave 1: low-cost scouts with compressed context

• Wave 2: mid-tier agents with full context

• Wave 3: frontier arbitration only if needed

Important: Wave sizes are not hardcoded. They are derived from:

• Max Spend / Max Time,

• required redundancy floor,

• registry telemetry (latency/cost/capabilities),

• task criticality/profile.

⸻

8. Two Required Pillars (Non-Negotiable) 8.1 Budget Governor (Hard Enforcement)

MARE must enforce constraints at runtime via a Budget Governor that:

• enforces max spend (hard cap),

• allocates per-wave budgets,

• caps in-flight concurrency,

• allows escalation only when both budget and time allow,

• supports cancellation/termination of in-flight attempts,

• includes provider circuit breakers (timeouts, rate limits, repeated failures),

• fails closed:

• if constraints can’t be satisfied, return best-known result with explicit reasons.

8.2 Decision Provenance (Decision Record)

Every job produces a durable Decision Record containing:

• constraints/profile used,

• strategy chosen and why (reason codes),

• per-attempt receipts (tokens, cost, latency, provider/model),

• aggregation method and outcome,

• dissent summaries,

• validator results,

• final selection rationale.

This record must be accessible from UI and stored for debugging and learning.

⸻

9. Agent/Model Registry

MARE relies on a live registry that provides both capability contracts and telemetry.

9.1 Capability Contracts (Required)

Per model/provider:

• supports streaming (Y/N)

• supports logprobs/entropy signals (Y/N)

• supports tools/function calling (Y/N)

• max context window

• structured output reliability modes (if available)

• known operational constraints (rate limits, typical timeouts)

Requirement: MARE must not pick a strategy requiring a capability that selected agents cannot provide. 9.2 Telemetry (Required)

Per model/provider:

• rolling latency (p50/p95),

• throughput (tokens/sec),

• real cost rates (input/output),

• failure/timeout rate,

• availability signals.

9.3 Domain Intelligence Score (DIS)

DIS is a normalized performance score per domain/task type derived from:

• validator pass rate,

• user accept/reject outcomes,

• rework/rerun rate,

• post-merge regressions (when detectable),

• policy-lane success metrics.

DIS supports cold-start via priors, then adapts to repo/user reality.

⸻

10. Aggregation and Arbitration (Artifact-Aware)

MARE must aggregate outputs as artifacts, not strings.

10.1 Code Synthesis Aggregator

For code-producing jobs:

• compare outputs as diffs/patches when possible,

• merge non-overlapping edits,

• detect conflicts,

• when conflicts exist, trigger a repair arbitration step within remaining budget/time.

10.2 Validators as Judges

Validators are first-class decision signals. Arbitration must consider:

• lint/typecheck,

• unit test subsets,

• static analysis (profile-dependent),

• repo-configured commands.

A minority output may be selected if it is the only one passing validators.

10.3 Judge Agent (Optional, Budgeted)

If validators are unavailable or inconclusive, MARE may use a judge model only when:

• budget/time permit,

• provenance is recorded,

• and the judge method is explicitly identified in the Decision Record.

⸻

11. Early Termination and Guardrails 11.1 Stream Quality Guards

MARE must support early termination of “runaway” generations.

• If logprobs exist, entropy-based termination may be used.

• If logprobs do not exist, MARE must use fallback guards such as:

• repetition/loop detection,

• no-progress heuristics,

• structural violations (e.g., JSON/patch format),

• code-syntax divergence indicators.

11.2 Cancellation Semantics

MARE must be able to:

• cancel in-flight agent attempts when escalation occurs,

• cancel when budget/time is reached,

• cancel when a sufficiently validated answer is found.

⸻

12. Trust and Control: User-Facing Features 12.1 MARE Timeline

A live “run timeline” view showing:

• waves launched and completed,

• which agents agreed/disagreed (and why),

• budget/time consumed vs remaining,

• which validators ran and results,

• whether escalation occurred and reason codes.

12.2 Manual Overrides

User can:

• stop escalation,

• force escalation (“use artillery now”),

• pin or ban providers/models per repo/task type,

• override strategy (within hard constraints).

12.3 Safety Gates for Applying Changes

Profiles can require user approval before applying patches, especially for:

• migrations,

• auth/security,

• infra/config,

• low-consensus results,

• validator failures.

⸻

13. Functional Requirements 13.1 Inputs

• JobCard (task type, domain hints, context pointers, profile/constraints)

• registry (capabilities + telemetry + DIS)

• repo validator configuration (commands/rules per profile)

13.2 Outputs

• final artifact (answer/patch)

• Execution Plan (what was intended)

• Decision Record (what happened and why)

• receipts (tokens/cost/time by attempt)

• confidence score and validator outcomes

13.3 Operational Requirements

• heterogeneous providers supported concurrently (cloud + local)

• robust timeouts/retries with circuit breakers

• deterministic enforcement of caps and concurrency

• consistent logging and provenance for every attempt

⸻

14. Success Metrics

• Median cost/job reduced vs single-model baseline under comparable quality constraints

• Median time-to-acceptable reduced for “fast” profiles

• Validator pass rate improved for code tasks

• Rerun/rework rate decreased (proxy for trust + correctness)

• Escalation rate stable and explainable (not always going to Wave 3)

• User override frequency declines over time as DIS improves

⸻

15. Rollout Phases P0 — Safety and Core Value

• Budget Governor

• Serial + staged hybrid strategy

• capability contracts + telemetry registry

• Decision Record v1

• validator integration (minimum viable set)

P1 — Trust and Performance

• Parallel mode with strict caps

• Code Synthesis Aggregator (merge + conflict repair)

• MARE Timeline UI + manual overrides

P2 — Learning

• DIS reinforcement from outcomes

• per-repo priors and profile recommendations

• smarter auto-selection and lower escalation rates

⸻

16. Open Questions (Product)

• Which task types require mandatory human approval by default?

• What is the default validator stack per language/repo type?

• What should “confidence” mean across modalities (code patch vs analysis vs prose)?

• How should local models be scored and integrated into DIS without bias?

⸻

If you want the next deliverable to be immediately implementation-usable for your team, I’ll generate the Blueprint contract package next: the minimal schemas for MAREPolicy, ExecutionPlan, and DecisionRecord (with reason codes), so your agents can build it without “creative interpretation.”

MARE PRD: Multi-Agent Redundant Execution (MARE) for Blueprint

Related Research

PRD: Power Law Engine (PLE)

PRD: Guarded Consensus Mode

Claude Code vs Verdict Code: Comprehensive Comparison

PRD PAS Master Orchestration