12 agents. One goal.
Kill the bad ideas.
MAGELLAN reads across scientific silos to find where existing knowledge connects in ways nobody has seen — then turns those connections into testable hypotheses and attacks its own ideas until only the defensible ones survive.
Four phases, one session
Each discovery session runs these phases in sequence. Everything is autonomous — input is “go”, output is testable hypotheses.
Scans the scientific landscape for connections nobody has explored. Uses 10 strategies including ABC bridging, contradiction mining, structural isomorphism, and serendipity.
Creates detailed mechanistic hypotheses with specific proteins, pathways, and predictions. Every claim is tagged as grounded, parametric, or speculative.
10-point quality rubric. 6-dimension ranking. Cross-model validation with GPT-5.4 and Gemini 3.1. Only the strongest survive.
The full pipeline
What actually happens when you type /discover. Two model tiers: opus for deep cross-disciplinary reasoning, sonnet for structured search and scoring.
Click any agent to see what it receives, outputs, and how it decides.
Cannot execute phases inline — can only dispatch agents. WebSearch/WebFetch removed from the coordinator. This prevents monolithic LLM behavior and ensures each agent operates within its constraints.
opusScoutFinds where to look2–4 min▸▾
Finds where to look
Uses 10 exploration strategies with diversity constraints. Must produce ≥2 different strategies across 3 targets, with ≥1 using a strategy with <2 sessions of data.
sonnetLiterature ScoutRetrieves and verifies literature2–3 min▸▾
Retrieves and verifies literature
Mandatory MCP server calls (Semantic Scholar, PubMed). Verifies field disjointness — DISJOINT targets have 84% pass rate vs. 30% for PARTIALLY_EXPLORED.
opusTarget EvaluatorAdversarial challenge1–2 min▸▾
Adversarial challenge
4 attack axes: popularity bias, vagueness, structural impossibility, local optima. Weakens framing before generation starts.
sonnetComputational ValidatorProgrammatic bridge checks1–2 min▸▾
Programmatic bridge checks
KEGG pathway cross-check, STRING protein interactions, PubMed co-occurrence analysis, back-of-envelope physics calculations.
opusGeneratorCreates hypotheses with self-critique2–3 min▸▾
Creates hypotheses with self-critique
Builds a Structured Relationship Map (parametric knowledge graph) before generating. Uses bisociation and multi-level abstraction. Runs mandatory 5-point SELF-CRITIQUE: citation specificity, directionality, cellular compartment, quantitative sanity, protein properties.
opusCritic9 adversarial attack vectors2–3 min▸▾
9 adversarial attack vectors
Target kill rate: 50–70%. Runs META-CRITIQUE if kill rate <15%. Writes bidirectional feedback: critic_questions forwarded to Generator in the next cycle.
sonnetRanker6-dimension weighted scoring1 min▸▾
6-dimension weighted scoring
Mandatory per-hypothesis scoring table with ≥2-sentence justification per dimension. Diversity check: promotes dissimilar hypothesis if 3+ share same bridge mechanism. Elo pairwise tournament (15 comparisons) as sanity check.
sonnetEvolverGenetic refinement1–2 min▸▾
Genetic refinement
Genetic operations (crossover, mutation) on promising hypotheses. Conditionally skippable if top-3 ≥ 6.5. Enforces diversity constraint at the population level.
opusQuality Gate10-point rubric, 30+ web searches8–12 min▸▾
10-point rubric, 30+ web searches
Each hypothesis receives deep analysis with mandatory META-VALIDATION reflection. Verifies every [GROUNDED] claim individually via web search. Citation hallucination or fabricated protein properties = automatic FAIL.
sonnetCross-Model ValidatorIndependent GPT + Gemini assessment25–35 min▸▾
Independent GPT + Gemini assessment
Automatic API calls to two external models. GPT focuses on empirical validation, Gemini on structural analysis. Consensus report synthesizes agreement and divergence.
sonnetSession AnalystMeta-learning extraction5–8 min▸▾
Meta-learning extraction
Feeds forward into future sessions: which strategies worked, which targets were productive, what bridge types survived critique. This is how the system learns.
Typical total: 60–155 minutes depending on cycles and cross-model validation. Everything runs autonomously — you type /discover and come back to find results. Every claim in every surviving hypothesis has been verified against published literature.
10 exploration strategies
How the Scout decides where to look. Each session must use ≥2 different strategies, with ≥1 from a strategy with fewer than 2 prior sessions of data.
Traces implications of recent discoveries into non-obvious distant fields. What does this finding change about something nobody has thought to connect?
Targets reproducible but unexplained phenomena across science — reliable observations that resist explanation under current paradigms.
Finds fields developing similar mathematical frameworks or conceptual language independently, suggesting hidden shared structure.
Identifies analytical tools, methods, or instruments from one field that could solve open problems in an unrelated domain.
Connects phenomena well-understood at one scale (molecular, cellular, organismal, ecological) to adjacent scales where the same logic hasn't been applied.
Revisits ideas abandoned in their original field — they may work in a completely different context where the original failure conditions don't apply.
Systematic identification of literatures with shared intermediate concepts (B) but no direct citations between them (A and C). The classic discovery method, operationalized.
Active search for contradictions between fields as sources of novel hypotheses. Inspired by FutureHouse's ContraCrow approach.
Fields sharing identical mathematical structure (equations, network topology, information-theoretic constraints) but completely different physical substrates. The bridge IS the math itself.
Deliberate exposure to unexpected knowledge: choose a never-explored domain, find its most surprising recent discovery, and ask which distant field would be most transformed.
Quality safeguards: Bridge concepts are mandatory for every target. A rotating creativity constraint (mod 5 per session) forces exploration of diverse target types: cross-discipline, mathematical, temporal gap, tool transfer, or unsolved problem.
9 ways every hypothesis gets attacked
The Critic agent is genuinely adversarial. Target kill rate: 50-70%. If kill rate drops below 15%, a META-CRITIQUE forces the Critic to re-examine whether it was too lenient.
The scoring system
Two layers: 6-dimension ranking by the Ranker agent, then a 10-point quality gate by an Opus-level agent with 35 reasoning turns.
6-Dimension Ranking
| Dimension | Weight | What It Measures |
|---|---|---|
| Novelty | 20% | Is this connection unexplored in existing literature? Verified via web search. |
| Mechanistic Specificity | 20% | How concrete and detailed is the proposed mechanism? Specific proteins, pathways, and predictions. |
| Cross-field Distance | 10% | How far apart are the connected disciplines? Higher distance = more surprising connection. |
| Testability | 20% | Can this be verified with existing methods, organisms, and equipment within a reasonable timeframe? |
| Impact | 10% | If true, how much would this change our understanding of either field? |
| Groundedness | 20% | Are the hypothesis components supported by retrievable published evidence? |
Composite = weighted average + 0.5 cross-domain creativity bonus for hypotheses crossing 2+ disciplinary boundaries. An Elo tournament (15 pairwise comparisons) cross-checks the linear ranking.
10-Point Quality Gate
Verdicts: PASS · CONDITIONAL_PASS (with noted risks) · FAIL. Each hypothesis receives 35 turns of Opus-level analysis with mandatory META-VALIDATION reflection before output.
MAGELLAN vs. “just ask GPT”
The architecture decisions that make MAGELLAN different from prompting a single model.
| Dimension | Single-model prompt | MAGELLAN |
|---|---|---|
| Discovery strategy | None — responds to prompt | 10 autonomous strategies with diversity constraints and exploration slots |
| Literature validation | None | Per-claim fact-checking via PubMed, KEGG, STRING databases + MCP server calls |
| Quality control | None | 9 adversarial attack vectors + 10-point quality gate rubric (35 reasoning turns) |
| Cross-model validation | None | Independent assessment by GPT-5.4 Pro + Gemini 3.1 Pro with consensus report |
| Transparency | Black-box output | Every claim tagged GROUNDED / PARAMETRIC / SPECULATIVE with sources |
| Kill rate | 0% — everything sounds plausible | 70-86% — most ideas are rejected as lacking novelty, evidence, or rigor |
| Self-critique | None | 5-point self-critique, meta-critique loops, hallucination-as-novelty detection |
| Evolution | Single pass | Adaptive 1-3 cycles with genetic recombination and diversity enforcement |
See how MAGELLAN compares to Google AI Co-Scientist, FutureHouse, BenevolentAI, and other dedicated platforms → The Landscape
What a kill looks like
Showing what the system rejects is more revealing than showing what it keeps.
“Quantum tunneling enables proton transfer in enzyme active sites at rates exceeding classical predictions”
This is what happens to ideas that sound impressive but don't contribute new knowledge.
Design principles
Parametric + Retrieval
Frontier LLMs (91-94% on GPQA Diamond) generate cross-domain connections from internal knowledge. PubMed, KEGG, STRING, and Semantic Scholar validate every factual claim. Neither approach alone is sufficient — parametric knowledge finds connections, retrieval keeps them honest.
Groundedness scoring (20% weight)
Prevents fluent hallucinations from scoring high. Every [GROUNDED] claim is verified against real papers via web search. Fake citations, fabricated protein properties, or reversed directionality = automatic FAIL. The 22-48% hallucination rate in frontier models makes this essential.
Mandatory agent dispatch
The Orchestrator cannot execute phases inline — it can only dispatch agents. WebSearch and WebFetch are removed from the coordinator. This prevents monolithic LLM behavior and ensures each agent operates within defined constraints with appropriate tool access.
Cross-model validation
Survivors are independently assessed by GPT-5.4 Pro (empirical focus, reasoning=high) and Gemini 3.1 Pro (structural analysis, thinking=HIGH). Cross-model consensus increases confidence; disagreement flags areas needing deeper investigation.
Session-to-session meta-learning
Strategy performance, kill patterns, and bridge survival rates are tracked in meta-insights. The Scout reads this before each session to avoid unproductive strategies and focus on what works. The system learns which exploration approaches are most fruitful.
Radical transparency
Every claim is tagged GROUNDED, PARAMETRIC, or SPECULATIVE. We publish kill rates, confidence scores, counter-evidence, and cross-model assessments. We show what the system rejects. We label our uncertainty. If it can't survive scrutiny, it shouldn't be published.
Every agent prompt, scoring rubric, and quality gate is open source.
View on GitHub →