THE PIPELINE

12 agents. One goal.
Kill the bad ideas.

MAGELLAN reads across scientific silos to find where existing knowledge connects in ways nobody has seen — then turns those connections into testable hypotheses and attacks its own ideas until only the defensible ones survive.

Four phases, one session

Each discovery session runs these phases in sequence. Everything is autonomous — input is “go”, output is testable hypotheses.

1. Scout
Find where to look

Scans the scientific landscape for connections nobody has explored. Uses 10 strategies including ABC bridging, contradiction mining, structural isomorphism, and serendipity.

2. Generate
Propose mechanisms

Creates detailed mechanistic hypotheses with specific proteins, pathways, and predictions. Every claim is tagged as grounded, parametric, or speculative.

3. Critique
Attack every claim

9 adversarial attack vectors. Checks each citation against real literature. Searches for counter-evidence. Fabricated citations = automatic kill.

4. Validate
Score & verify

10-point quality rubric. 6-dimension ranking. Cross-model validation with GPT-5.4 and Gemini 3.1. Only the strongest survive.

The full pipeline

What actually happens when you type /discover. Two model tiers: opus for deep cross-disciplinary reasoning, sonnet for structured search and scoring.

Click any agent to see what it receives, outputs, and how it decides.

28
files created
30+
web searches
13
agent dispatches
3
AI models
opusOrchestrator— dispatches all agents, enforces guard logic, manages adaptive cycles

Cannot execute phases inline — can only dispatch agents. WebSearch/WebFetch removed from the coordinator. This prevents monolithic LLM behavior and ensures each agent operates within its constraints.

Phase 0 · Scout
Find where to explore
opusScout2–4 min

Finds where to look

ReceivesMeta-learning insights from prior sessions, rotating creativity constraint
Outputs5–6 candidate targets with bridge concepts and strategy rationale

Uses 10 exploration strategies with diversity constraints. Must produce ≥2 different strategies across 3 targets, with ≥1 using a strategy with <2 sessions of data.

sonnetLiterature Scout2–3 min

Retrieves and verifies literature

ReceivesAll candidate targets from Scout
OutputsDisjointness scores, retrieved papers, verification status per target

Mandatory MCP server calls (Semantic Scholar, PubMed). Verifies field disjointness — DISJOINT targets have 84% pass rate vs. 30% for PARTIALLY_EXPLORED.

opusTarget Evaluator1–2 min

Adversarial challenge

ReceivesTop 3 targets (Orchestrator-selected, DISJOINT priority)
OutputsChallenged targets with vulnerability assessment

4 attack axes: popularity bias, vagueness, structural impossibility, local optima. Weakens framing before generation starts.

sonnetComputational Validator1–2 min

Programmatic bridge checks

ReceivesSurviving targets with literature context
OutputsQuantitative bridge evidence (pathway overlaps, protein interactions, co-occurrence stats)

KEGG pathway cross-check, STRING protein interactions, PubMed co-occurrence analysis, back-of-envelope physics calculations.

Phase 2 · Generate
Create detailed hypotheses
opusGenerator2–3 min

Creates hypotheses with self-critique

ReceivesValidated targets, literature context, computational validation results
Outputs5–6 hypotheses per cycle with claim-level tagging (GROUNDED/PARAMETRIC/SPECULATIVE)

Builds a Structured Relationship Map (parametric knowledge graph) before generating. Uses bisociation and multi-level abstraction. Runs mandatory 5-point SELF-CRITIQUE: citation specificity, directionality, cellular compartment, quantitative sanity, protein properties.

Phase 3 · Critique & Rank
Attack, score, evolve
opusCritic2–3 min

9 adversarial attack vectors

ReceivesAll generated hypotheses with claim tags
OutputsVerdict per hypothesis (SURVIVE/WOUNDED/KILLED), kill reasons, critic_questions for next cycle

Target kill rate: 50–70%. Runs META-CRITIQUE if kill rate <15%. Writes bidirectional feedback: critic_questions forwarded to Generator in the next cycle.

sonnetRanker1 min

6-dimension weighted scoring

ReceivesSurviving hypotheses post-critique
OutputsScored and ranked hypotheses with per-dimension justifications + Elo tournament diagnostic

Mandatory per-hypothesis scoring table with ≥2-sentence justification per dimension. Diversity check: promotes dissimilar hypothesis if 3+ share same bridge mechanism. Elo pairwise tournament (15 comparisons) as sanity check.

sonnetEvolver1–2 min

Genetic refinement

ReceivesRanked hypotheses with diversity analysis
OutputsRecombined hypotheses with diversity constraint enforcement

Genetic operations (crossover, mutation) on promising hypotheses. Conditionally skippable if top-3 ≥ 6.5. Enforces diversity constraint at the population level.

CYCLE DECISION
Early complete if top-3 ≥ 7.0 · Standard: 2 cycles · Extended if survival <30%
Phase 4 · Validate
Final verification & meta-learning
opusQuality Gate8–12 min

10-point rubric, 30+ web searches

ReceivesTop-ranked hypotheses from Ranker
OutputsFinal verdict (PASS/CONDITIONAL_PASS/FAIL), per-claim grounding verification

Each hypothesis receives deep analysis with mandatory META-VALIDATION reflection. Verifies every [GROUNDED] claim individually via web search. Citation hallucination or fabricated protein properties = automatic FAIL.

sonnetCross-Model Validator25–35 min

Independent GPT + Gemini assessment

ReceivesQuality Gate survivors
OutputsIndependent assessments from GPT-5.4 Pro (reasoning=high) and Gemini 3.1 Pro (thinking=HIGH)

Automatic API calls to two external models. GPT focuses on empirical validation, Gemini on structural analysis. Consensus report synthesizes agreement and divergence.

sonnetSession Analyst5–8 min

Meta-learning extraction

ReceivesComplete session results and pipeline execution data
OutputsStrategy performance, kill patterns, bridge type analysis → meta-insights for future sessions

Feeds forward into future sessions: which strategies worked, which targets were productive, what bridge types survived critique. This is how the system learns.

Typical total: 60–155 minutes depending on cycles and cross-model validation. Everything runs autonomously — you type /discover and come back to find results. Every claim in every surviving hypothesis has been verified against published literature.

10 exploration strategies

How the Scout decides where to look. Each session must use ≥2 different strategies, with ≥1 from a strategy with fewer than 2 prior sessions of data.

01Recent Breakthrough Radiation

Traces implications of recent discoveries into non-obvious distant fields. What does this finding change about something nobody has thought to connect?

02Anomaly Hunting

Targets reproducible but unexplained phenomena across science — reliable observations that resist explanation under current paradigms.

03Converging Vocabularies

Finds fields developing similar mathematical frameworks or conceptual language independently, suggesting hidden shared structure.

04Tool Transfer

Identifies analytical tools, methods, or instruments from one field that could solve open problems in an unrelated domain.

05Scale Bridging

Connects phenomena well-understood at one scale (molecular, cellular, organismal, ecological) to adjacent scales where the same logic hasn't been applied.

06Failed Paradigm Recycling

Revisits ideas abandoned in their original field — they may work in a completely different context where the original failure conditions don't apply.

07Swanson ABC Bridging

Systematic identification of literatures with shared intermediate concepts (B) but no direct citations between them (A and C). The classic discovery method, operationalized.

08Contradiction Mining

Active search for contradictions between fields as sources of novel hypotheses. Inspired by FutureHouse's ContraCrow approach.

09Structural Isomorphismv5.8

Fields sharing identical mathematical structure (equations, network topology, information-theoretic constraints) but completely different physical substrates. The bridge IS the math itself.

10Serendipityv5.8

Deliberate exposure to unexpected knowledge: choose a never-explored domain, find its most surprising recent discovery, and ask which distant field would be most transformed.

Quality safeguards: Bridge concepts are mandatory for every target. A rotating creativity constraint (mod 5 per session) forces exploration of diverse target types: cross-discipline, mathematical, temporal gap, tool transfer, or unsolved problem.

9 ways every hypothesis gets attacked

The Critic agent is genuinely adversarial. Target kill rate: 50-70%. If kill rate drops below 15%, a META-CRITIQUE forces the Critic to re-examine whether it was too lenient.

01Novelty Kill

WebSearch verification that the connection isn't already published or well-studied.

Fails if: existing review paper covers this connection

02Mechanism Kill

Physical, chemical, or biological plausibility check — energy scales, timescales, concentrations.

Fails if: proposed mechanism violates known physical constraints

03Logic Kill

Detects correlation masquerading as causation, analogy confused with structural relation, or post-hoc reasoning.

Fails if: causal claims unsupported by mechanism

04Falsifiability Kill

Can this hypothesis be proven wrong with a specific experiment?

Fails if: no experiment could falsify the claim → automatic KILL

05Triviality Kill

Would a PhD student in each relevant field say 'obviously'?

Fails if: experts in the field would consider this well-known

06Counter-Evidence Search

Dedicated WebSearch for contradictions and mechanism failures in published literature.

Fails if: strong counter-evidence exists and is not addressed

07Groundedness Attack

Distinguishes literature-grounded claims from parametric knowledge from pure speculation.

Fails if: >50% of core claims are unverifiable

08Hallucination-as-Novelty

Directly targets the Science/AAAS finding that AI novelty scores collapse from 5.38→3.41 after experimental validation.

Fails if: novelty depends entirely on unverifiable claims → probable hallucination

09Claim-Level Fact Verification

Web searches every individual [GROUNDED] claim: author+year+journal, directionality, cellular compartment, protein properties.

Fails if: citation hallucination or fabricated protein property → automatic KILL

The scoring system

Two layers: 6-dimension ranking by the Ranker agent, then a 10-point quality gate by an Opus-level agent with 35 reasoning turns.

6-Dimension Ranking

DimensionWeightWhat It Measures
Novelty
20%
Is this connection unexplored in existing literature? Verified via web search.
Mechanistic Specificity
20%
How concrete and detailed is the proposed mechanism? Specific proteins, pathways, and predictions.
Cross-field Distance
10%
How far apart are the connected disciplines? Higher distance = more surprising connection.
Testability
20%
Can this be verified with existing methods, organisms, and equipment within a reasonable timeframe?
Impact
10%
If true, how much would this change our understanding of either field?
Groundedness
20%
Are the hypothesis components supported by retrievable published evidence?

Composite = weighted average + 0.5 cross-domain creativity bonus for hypotheses crossing 2+ disciplinary boundaries. An Elo tournament (15 pairwise comparisons) cross-checks the linear ranking.

10-Point Quality Gate

01Mechanism plausibility — does the proposed mechanism make physical/biological sense?
02Literature novelty — web grounding check confirms this isn't restating known results
03Falsifiability — can this be experimentally disproven with a specific test?
04Bridge concept clarity — is the cross-field connection mechanistically clear?
05Evidence sufficiency — are supporting claims actually findable in literature?
06Testability — is the experimental design realistic and achievable?
07Counter-evidence acknowledgment — what would falsify this hypothesis?
08Cross-discipline coherence — does the idea make sense in both connected fields?
09Prediction clarity — are outcomes specific, measurable, and time-bounded?
10Per-claim grounding — each [GROUNDED] claim individually verified via web search

Verdicts: PASS · CONDITIONAL_PASS (with noted risks) · FAIL. Each hypothesis receives 35 turns of Opus-level analysis with mandatory META-VALIDATION reflection before output.

MAGELLAN vs. “just ask GPT”

The architecture decisions that make MAGELLAN different from prompting a single model.

DimensionSingle-model promptMAGELLAN
Discovery strategyNone — responds to prompt10 autonomous strategies with diversity constraints and exploration slots
Literature validationNonePer-claim fact-checking via PubMed, KEGG, STRING databases + MCP server calls
Quality controlNone9 adversarial attack vectors + 10-point quality gate rubric (35 reasoning turns)
Cross-model validationNoneIndependent assessment by GPT-5.4 Pro + Gemini 3.1 Pro with consensus report
TransparencyBlack-box outputEvery claim tagged GROUNDED / PARAMETRIC / SPECULATIVE with sources
Kill rate0% — everything sounds plausible70-86% — most ideas are rejected as lacking novelty, evidence, or rigor
Self-critiqueNone5-point self-critique, meta-critique loops, hallucination-as-novelty detection
EvolutionSingle passAdaptive 1-3 cycles with genetic recombination and diversity enforcement

See how MAGELLAN compares to Google AI Co-Scientist, FutureHouse, BenevolentAI, and other dedicated platforms → The Landscape

65% attrition is the point

Most AI systems optimize for output volume. MAGELLAN optimizes for rigorous filtering. The difference? We'd rather show you 89 defensible ideas than 255 fluent hallucinations.

Generated
255
Survived Critique
115
Passed Quality Gate
89

Across 20 sessions. Citation hallucination or fabricated protein properties = automatic FAIL. Restating known results = kill.

What a kill looks like

Showing what the system rejects is more revealing than showing what it keeps.

FAILKilled at Quality Gate · Cycle 1 · Attack vector #1 (Novelty Kill)

“Quantum tunneling enables proton transfer in enzyme active sites at rates exceeding classical predictions”

Kill reason:Not novel. Klinman & Kohen (2013) extensively documented quantum tunneling in enzyme catalysis. The hypothesis restated established knowledge without adding a new mechanistic connection. The Quality Gate verified this against published literature and rejected it.

This is what happens to ideas that sound impressive but don't contribute new knowledge.

Design principles

Parametric + Retrieval

Frontier LLMs (91-94% on GPQA Diamond) generate cross-domain connections from internal knowledge. PubMed, KEGG, STRING, and Semantic Scholar validate every factual claim. Neither approach alone is sufficient — parametric knowledge finds connections, retrieval keeps them honest.

Groundedness scoring (20% weight)

Prevents fluent hallucinations from scoring high. Every [GROUNDED] claim is verified against real papers via web search. Fake citations, fabricated protein properties, or reversed directionality = automatic FAIL. The 22-48% hallucination rate in frontier models makes this essential.

Mandatory agent dispatch

The Orchestrator cannot execute phases inline — it can only dispatch agents. WebSearch and WebFetch are removed from the coordinator. This prevents monolithic LLM behavior and ensures each agent operates within defined constraints with appropriate tool access.

Cross-model validation

Survivors are independently assessed by GPT-5.4 Pro (empirical focus, reasoning=high) and Gemini 3.1 Pro (structural analysis, thinking=HIGH). Cross-model consensus increases confidence; disagreement flags areas needing deeper investigation.

Session-to-session meta-learning

Strategy performance, kill patterns, and bridge survival rates are tracked in meta-insights. The Scout reads this before each session to avoid unproductive strategies and focus on what works. The system learns which exploration approaches are most fruitful.

Radical transparency

Every claim is tagged GROUNDED, PARAMETRIC, or SPECULATIVE. We publish kill rates, confidence scores, counter-evidence, and cross-model assessments. We show what the system rejects. We label our uncertainty. If it can't survive scrutiny, it shouldn't be published.

Every agent prompt, scoring rubric, and quality gate is open source.

View on GitHub →