THE PIPELINE

15 agents. One goal.
Kill the bad ideas.

MAGELLAN reads across scientific silos to find where existing knowledge connects in ways nobody has seen — then turns those connections into testable hypotheses and attacks its own ideas until only the defensible ones survive.

Four phases, one session

Each discovery session runs these phases in sequence. Everything is autonomous — input is “go”, output is testable hypotheses.

1. Scout

Find where to look

Scans the scientific landscape for connections nobody has explored. Uses 10 strategies including ABC bridging, contradiction mining, structural isomorphism, and serendipity.

2. Generate

Propose mechanisms

Creates detailed mechanistic hypotheses with specific proteins, pathways, and predictions. Every claim is tagged as grounded, parametric, or speculative.

3. Critique

Attack every claim

9 adversarial attack vectors. Checks each citation against real literature. Searches for counter-evidence. Fabricated citations = automatic kill.

4. Validate

Score & verify

10-point quality rubric. 6-dimension ranking. Cross-model validation with GPT-5.5 Pro and Gemini Deep Research Max. Only the strongest survive.

The full pipeline

What actually happens when you type /discover. Two model tiers: opus for deep cross-disciplinary reasoning, sonnet for structured search and scoring.

Click any agent to see what it receives, outputs, and how it decides.

files created

30+

web searches

agent dispatches

AI models

opusOrchestrator— dispatches all agents, enforces guard logic, manages adaptive cycles

Cannot execute phases inline — can only dispatch agents. WebSearch/WebFetch removed from the coordinator. This prevents monolithic LLM behavior and ensures each agent operates within its constraints.

Phase 0 · Scout

Find where to explore

opusScoutFinds where to look2–4 min▸

Finds where to look

ReceivesMeta-learning insights from prior sessions, rotating creativity constraint

Outputs5–6 candidate targets with bridge concepts and strategy rationale

Uses 10 exploration strategies with diversity constraints. Must produce ≥2 different strategies across 3 targets, with ≥1 using a strategy with <2 sessions of data.

sonnetLiterature ScoutRetrieves and verifies literature2–3 min▸

Retrieves and verifies literature

ReceivesAll candidate targets from Scout

OutputsDisjointness scores, retrieved papers, verification status per target

Mandatory MCP server calls (Semantic Scholar, PubMed). Verifies field disjointness — DISJOINT targets have 84% pass rate vs. 30% for PARTIALLY_EXPLORED.

opusTarget EvaluatorAdversarial challenge1–2 min▸

Adversarial challenge

ReceivesTop 3 targets (Orchestrator-selected, DISJOINT priority)

OutputsChallenged targets with vulnerability assessment

4 attack axes: popularity bias, vagueness, structural impossibility, local optima. Weakens framing before generation starts.

sonnetComputational ValidatorProgrammatic bridge checks1–2 min▸

Programmatic bridge checks

ReceivesSurviving targets with literature context

OutputsQuantitative bridge evidence (pathway overlaps, protein interactions, co-occurrence stats)

KEGG pathway cross-check, STRING protein interactions, PubMed co-occurrence analysis, back-of-envelope physics calculations.

▾

Phase 2 · Generate

Create detailed hypotheses

opusGeneratorCreates hypotheses with self-critique2–3 min▸

Creates hypotheses with self-critique

ReceivesValidated targets, literature context, computational validation results

Outputs5–6 hypotheses per cycle with claim-level tagging (GROUNDED/PARAMETRIC/SPECULATIVE)

Builds a Structured Relationship Map (parametric knowledge graph) before generating. Uses bisociation and multi-level abstraction. Runs mandatory 5-point SELF-CRITIQUE: citation specificity, directionality, cellular compartment, quantitative sanity, protein properties.

▾

Phase 3 · Critique & Rank

Attack, score, evolve

opusCritic9 adversarial attack vectors2–3 min▸

9 adversarial attack vectors

ReceivesAll generated hypotheses with claim tags

OutputsVerdict per hypothesis (SURVIVE/WOUNDED/KILLED), kill reasons, critic_questions for next cycle

Target kill rate: 50–70%. Runs META-CRITIQUE if kill rate <15%. Writes bidirectional feedback: critic_questions forwarded to Generator in the next cycle.

sonnetRanker6-dimension weighted scoring1 min▸

6-dimension weighted scoring

ReceivesSurviving hypotheses post-critique

OutputsScored and ranked hypotheses with per-dimension justifications + Elo tournament diagnostic

Mandatory per-hypothesis scoring table with ≥2-sentence justification per dimension. Diversity check: promotes dissimilar hypothesis if 3+ share same bridge mechanism. Elo pairwise tournament (15 comparisons) as sanity check.

sonnetEvolverGenetic refinement1–2 min▸

Genetic refinement

ReceivesRanked hypotheses with diversity analysis

OutputsRecombined hypotheses with diversity constraint enforcement

Genetic operations (crossover, mutation) on promising hypotheses. Conditionally skippable if top-3 ≥ 6.5. Enforces diversity constraint at the population level.

CYCLE DECISION

Early complete if top-3 ≥ 7.0 · Standard: 2 cycles · Extended if survival <30%

▾

Phase 4 · Validate

Final verification & meta-learning

opusQuality Gate10-point rubric, 30+ web searches8–12 min▸

10-point rubric, 30+ web searches

ReceivesTop-ranked hypotheses from Ranker

OutputsFinal verdict (PASS/CONDITIONAL_PASS/FAIL), per-claim grounding verification

Each hypothesis receives deep analysis with mandatory META-VALIDATION reflection. Verifies every [GROUNDED] claim individually via web search. Citation hallucination or fabricated protein properties = automatic FAIL.

sonnetCross-Model ValidatorIndependent GPT + Gemini assessment30–90 min▸

Independent GPT + Gemini assessment

ReceivesQuality Gate survivors

OutputsIndependent assessments from GPT-5.5 Pro (reasoning=xhigh, web search + code interpreter + shell) and Gemini Deep Research Max (autonomous research agent, ~80–160 web searches per task)

Automatic API calls to two external models, run in parallel. GPT-5.5 Pro uses background submit + 30s polling (no streaming) and focuses on empirical validation and code-verified arithmetic. Gemini Deep Research Max runs an autonomous research loop (plan → search → read → code → synthesize) on the Interactions API with google_search, url_context, and code_execution, returning a fully cited report with mandatory literature review (5–10 papers with DOIs per hypothesis). Consensus report synthesizes agreement and divergence.

sonnetSession AnalystMeta-learning extraction5–8 min▸

Meta-learning extraction

ReceivesComplete session results and pipeline execution data

OutputsStrategy performance, kill patterns, bridge type analysis → meta-insights for future sessions

Feeds forward into future sessions: which strategies worked, which targets were productive, what bridge types survived critique. This is how the system learns.

sonnetConvergence ScannerClinical trial & patent search3–5 min▸

Clinical trial & patent search

ReceivesQuality Gate survivors with verified claims

OutputsConvergence signals from ClinicalTrials.gov, NIH Reporter, patents, independent publications

Searches sources never consulted by the main pipeline for independent evidence confirming the core mechanism. Produces translational signals used in Impact Potential Score (IPS) calculation: IPS = 0.4 × scout_estimate + 0.6 × (convergence_signals/3) × 10.

sonnetDataset Evidence MinerExternal database evidence search3–5 min▸

External database evidence search

ReceivesQuality Gate survivors

OutputsIndependent evidence from HPA, GWAS Catalog, ChEMBL, UniProt, PDB

Searches structured scientific databases never queried by the main pipeline for independent supporting evidence. Complements the Convergence Scanner's clinical/patent focus with protein, genomic, and chemical data.

Typical total: 60–155 minutes depending on cycles and cross-model validation. Everything runs autonomously — you type /discover and come back to find results. Every claim in every surviving hypothesis has been verified against published literature.

10 exploration strategies

How the Scout decides where to look. Each session must use ≥2 different strategies, with ≥1 from a strategy with fewer than 2 prior sessions of data.

01Recent Breakthrough Radiation

Traces implications of recent discoveries into non-obvious distant fields. What does this finding change about something nobody has thought to connect?

02Anomaly Hunting

Targets reproducible but unexplained phenomena across science — reliable observations that resist explanation under current paradigms.

03Converging Vocabularies

Finds fields developing similar mathematical frameworks or conceptual language independently, suggesting hidden shared structure.

04Tool Transfer

Identifies analytical tools, methods, or instruments from one field that could solve open problems in an unrelated domain.

05Scale Bridging

Connects phenomena well-understood at one scale (molecular, cellular, organismal, ecological) to adjacent scales where the same logic hasn't been applied.

06Failed Paradigm Recycling

Revisits ideas abandoned in their original field — they may work in a completely different context where the original failure conditions don't apply.

07Swanson ABC Bridging

Systematic identification of literatures with shared intermediate concepts (B) but no direct citations between them (A and C). The classic discovery method, operationalized.

08Contradiction Mining

Active search for contradictions between fields as sources of novel hypotheses. Inspired by FutureHouse's ContraCrow approach.

09Structural Isomorphismv5.8

Fields sharing identical mathematical structure (equations, network topology, information-theoretic constraints) but completely different physical substrates. The bridge IS the math itself.

10Serendipityv5.8

Deliberate exposure to unexpected knowledge: choose a never-explored domain, find its most surprising recent discovery, and ask which distant field would be most transformed.

Quality safeguards: Bridge concepts are mandatory for every target. A rotating creativity constraint (mod 5 per session) forces exploration of diverse target types: cross-discipline, mathematical, temporal gap, tool transfer, or unsolved problem.

9 ways every hypothesis gets attacked

The Critic agent is genuinely adversarial. Target kill rate: 50-70%. If kill rate drops below 15%, a META-CRITIQUE forces the Critic to re-examine whether it was too lenient.

01Novelty Kill

WebSearch verification that the connection isn't already published or well-studied.

Fails if: existing review paper covers this connection

02Mechanism Kill

Physical, chemical, or biological plausibility check — energy scales, timescales, concentrations.

Fails if: proposed mechanism violates known physical constraints

03Logic Kill

Detects correlation masquerading as causation, analogy confused with structural relation, or post-hoc reasoning.

Fails if: causal claims unsupported by mechanism

04Falsifiability Kill

Can this hypothesis be proven wrong with a specific experiment?

Fails if: no experiment could falsify the claim → automatic KILL

05Triviality Kill

Would a PhD student in each relevant field say 'obviously'?

Fails if: experts in the field would consider this well-known

06Counter-Evidence Search

Dedicated WebSearch for contradictions and mechanism failures in published literature.

Fails if: strong counter-evidence exists and is not addressed

07Groundedness Attack

Distinguishes literature-grounded claims from parametric knowledge from pure speculation.

Fails if: >50% of core claims are unverifiable

08Hallucination-as-Novelty

Directly targets the Science/AAAS finding that AI novelty scores collapse from 5.38→3.41 after experimental validation.

Fails if: novelty depends entirely on unverifiable claims → probable hallucination

09Claim-Level Fact Verification

Web searches every individual [GROUNDED] claim: author+year+journal, directionality, cellular compartment, protein properties.

Fails if: citation hallucination or fabricated protein property → automatic KILL

The scoring system

Two layers: 6-dimension ranking by the Ranker agent, then a 10-point quality gate by an Opus-level agent with 35 reasoning turns.

6-Dimension Ranking

Dimension	Weight	What It Measures
Novelty	20%	Is this connection unexplored in existing literature? Verified via web search.
Mechanistic Specificity	20%	How concrete and detailed is the proposed mechanism? Specific proteins, pathways, and predictions.
Cross-field Distance	10%	How far apart are the connected disciplines? Higher distance = more surprising connection.
Testability	20%	Can this be verified with existing methods, organisms, and equipment within a reasonable timeframe?
Impact	10%	If true, how much would this change our understanding of either field?
Groundedness	20%	Are the hypothesis components supported by retrievable published evidence?

Composite = weighted average + 0.5 cross-domain creativity bonus for hypotheses crossing 2+ disciplinary boundaries. An Elo tournament (15 pairwise comparisons) cross-checks the linear ranking.

10-Point Quality Gate

01Mechanism plausibility — does the proposed mechanism make physical/biological sense?

02Literature novelty — web grounding check confirms this isn't restating known results

03Falsifiability — can this be experimentally disproven with a specific test?

04Bridge concept clarity — is the cross-field connection mechanistically clear?

05Evidence sufficiency — are supporting claims actually findable in literature?

06Testability — is the experimental design realistic and achievable?

07Counter-evidence acknowledgment — what would falsify this hypothesis?

08Cross-discipline coherence — does the idea make sense in both connected fields?

09Prediction clarity — are outcomes specific, measurable, and time-bounded?

10Per-claim grounding — each [GROUNDED] claim individually verified via web search

Verdicts: PASS · CONDITIONAL_PASS (with noted risks) · FAIL. Each hypothesis receives 35 turns of Opus-level analysis with mandatory META-VALIDATION reflection before output.

MAGELLAN vs. “just ask GPT”

The architecture decisions that make MAGELLAN different from prompting a single model.

Dimension	Single-model prompt	MAGELLAN
Discovery strategy	None — responds to prompt	10 autonomous strategies with diversity constraints and exploration slots
Literature validation	None	Per-claim fact-checking via PubMed, KEGG, STRING databases + MCP server calls
Quality control	None	9 adversarial attack vectors + 10-point quality gate rubric (35 reasoning turns)
Cross-model validation	None	Independent assessment by GPT-5.5 Pro + Gemini Deep Research Max with consensus report
Transparency	Black-box output	Every claim tagged GROUNDED / PARAMETRIC / SPECULATIVE with sources
Kill rate	0% — everything sounds plausible	70-86% — most ideas are rejected as lacking novelty, evidence, or rigor
Self-critique	None	5-point self-critique, meta-critique loops, hallucination-as-novelty detection
Evolution	Single pass	Adaptive 1-3 cycles with genetic recombination and diversity enforcement

See how MAGELLAN compares to Google AI Co-Scientist, FutureHouse, BenevolentAI, and other dedicated platforms → The Landscape

63% attrition is the point

Most AI systems optimize for output volume. MAGELLAN optimizes for rigorous filtering. The difference? We'd rather show you 168 defensible ideas than 456 fluent hallucinations.

Generated

456

Survived Critique

205

Passed Quality Gate

168

Across 37 sessions. Citation hallucination or fabricated protein properties = automatic FAIL. Restating known results = kill.

What a kill looks like

Showing what the system rejects is more revealing than showing what it keeps.

FAILKilled at Quality Gate · Cycle 1 · Attack vector #1 (Novelty Kill)

“Quantum tunneling enables proton transfer in enzyme active sites at rates exceeding classical predictions”

Kill reason:Not novel. Klinman & Kohen (2013) extensively documented quantum tunneling in enzyme catalysis. The hypothesis restated established knowledge without adding a new mechanistic connection. The Quality Gate verified this against published literature and rejected it.

This is what happens to ideas that sound impressive but don't contribute new knowledge.

Design principles

Parametric + Retrieval

Frontier LLMs (91-94% on GPQA Diamond) generate cross-domain connections from internal knowledge. PubMed, KEGG, STRING, and Semantic Scholar validate every factual claim. Neither approach alone is sufficient — parametric knowledge finds connections, retrieval keeps them honest.

Groundedness scoring (20% weight)

Prevents fluent hallucinations from scoring high. Every [GROUNDED] claim is verified against real papers via web search. Fake citations, fabricated protein properties, or reversed directionality = automatic FAIL. The 22-48% hallucination rate in frontier models makes this essential.

Mandatory agent dispatch

The Orchestrator cannot execute phases inline — it can only dispatch agents. WebSearch and WebFetch are removed from the coordinator. This prevents monolithic LLM behavior and ensures each agent operates within defined constraints with appropriate tool access.

Cross-model validation

Survivors are independently assessed by GPT-5.5 Pro (empirical focus, reasoning=xhigh, web search + code interpreter + shell) and Gemini Deep Research Max (autonomous research agent, ~80–160 web searches and code execution per task, fully cited literature review). Cross-model consensus increases confidence; disagreement flags areas needing deeper investigation.

Session-to-session meta-learning

Strategy performance, kill patterns, and bridge survival rates are tracked in meta-insights. The Scout reads this before each session to avoid unproductive strategies and focus on what works. The system learns which exploration approaches are most fruitful.

Radical transparency

Every claim is tagged GROUNDED, PARAMETRIC, or SPECULATIVE. We publish kill rates, confidence scores, counter-evidence, and cross-model assessments. We show what the system rejects. We label our uncertainty. If it can't survive scrutiny, it shouldn't be published.

Every agent prompt, scoring rubric, and quality gate is open source.

View on GitHub →Read the arXiv paper →

15 agents. One goal.Kill the bad ideas.