Our Methodology
Last updated: May 2026
Learning Whistle publishes learning paths at two tiers, and we build them differently. This page is a transparent account of how each one is made — and exactly what you can trust it for.
The two tiers, at a glance
Standard paths are quality-controlled AI explainers. Every Standard path is written to a target reading level, checked against a structured quality gate before it is published, and labelled as AI-generated on every station. Standard paths are not individually source-cited — they are an accessible, well-structured starting point for learning, not a citable reference. Community event paths are a kind of Standard path, with added collaboration, mileage, and collectible rewards; they are built the same way.
Premium paths go much further: each one is assembled from verified, open-access sources — not from AI memory — with every factual claim cited and independently checked. The rest of this page documents that Premium pipeline in full.
How Standard paths are quality-controlled
Standard paths do not go through source retrieval, license filtering, or per-claim verification, and they carry no citations. What they do go through is a structured quality gate: content is calibrated to the reading level you choose using Flesch-Kincaid grade targets, swept for structural completeness across every station, and scored before publication — paths that fall short are regenerated rather than released. Every Standard station also carries a visible AI-Generated label, so you always know what you are reading. The result is consistent, readable, well-organised explainers; for material you intend to cite or quote, use a Premium path.
1. Source Retrieval
Before any content is written, our pipeline queries up to 27 authoritative open-access databases. The specific databases used depend on the learning category. In addition, our topic decomposition layer automatically identifies cross-domain connectors relevant to each station — so a path on portrait photography may also draw from optics physics (arXiv), colorimetry (NIST), and visual perception research (PubMed), not just art archives:
- Science & Medicine: PubMed Central, Europe PMC, NIH Reporter, openFDA, OpenAlex, StatPearls clinical handbook (via NCBI Bookshelf), ClinicalTrials.gov, OpenStax textbooks
- Physics & Mathematics: arXiv, NIST, OpenAlex, OpenStax textbooks
- Engineering & Computer Science: arXiv, IETF RFCs, NIST, OpenAlex, NASA Technical Reports, OpenStax Introduction to Computer Science
- Chemistry: PubChem, NIST, NIH Reporter, OpenAlex, OpenStax textbooks
- Law & Policy: CourtListener (Free Law Project), Congressional Research Service, OpenStax Business Law
- History, Arts & Humanities: Europeana, Library of Congress, Smithsonian, Project Gutenberg, OpenStax U.S./World History
- Earth Sciences: USGS Publications Warehouse, NASA Technical Reports, OpenAlex, OpenStax
- Economics & Politics: World Bank Open Data, Congressional Research Service, OpenAlex, OpenStax Macroeconomics & Microeconomics
- Philosophy: Stanford Encyclopedia of Philosophy, Project Gutenberg, OpenAlex, OpenStax Introduction to Philosophy
- Literature & Language: Library of Congress, Project Gutenberg, OpenAlex, OpenStax Writing Guide
All sources are retrieved via official public APIs. No scraping of paywalled content. The OpenStax chapter index is built from the publicly-published TOC of each CC BY textbook and refreshed when the catalog updates.
2. License Enforcement
Only sources with the following licenses are admitted into the generation pipeline:
- CC0 (Public Domain Dedication)
- CC BY (Creative Commons Attribution)
- Public Domain
- U.S. Government Work
Sources with CC-BY-SA, CC-BY-NC, or unknown license terms are automatically excluded. This is a hard rule — no exceptions, regardless of source quality.
3. Relevance Scoring
Each retrieved source is scored for relevance to the specific station learning objective using an AI relevance model. Sources scoring below 0.6 out of 1.0 are dropped before generation begins. This prevents tangential sources from appearing in citations.
4. Source Ranking
Relevant sources are ranked using a weighted formula:
- Authority tier: 40% — peer-reviewed journals and government data rank highest
- Full text availability: 20% — sources with accessible full text are preferred
- Recency: 15% — exponential decay, newer sources preferred with domain-appropriate half-lives
- Citation count: 15% — log-normalised, highly-cited work preferred
- License permissiveness: 10% — CC0 > CC-BY > government work
5. Source-Grounded Generation
Gemini Pro writes each station using only the retrieved source excerpts — it cannot introduce facts from its training data. Every factual claim is required to cite a source using inline citation tokens (e.g. [S1], [S2]) that map back to the retrieved documents.
6. Citation Graph Enrichment
After ranking, the pipeline checks whether any retrieved source is a landmark work — a paper with an unusually high citation count relative to its field. For fast-moving fields like computer science and AI, the threshold is 50 citations; for medicine and biology, 100; for humanities and philosophy, 200. When a landmark is detected, the pipeline fetches that paper's reference list from OpenAlex and retrieves the most-cited open-access works from it. These foundational papers are added to the source set so generation is grounded in both current research and the work that shaped it.
7. Per-Sentence Verification
After generation, each station undergoes an independent verification pass. A separate AI model (with no shared context with the generator) checks every factual claim against the source excerpts and flags:
- Unsupported claims: fact has no citation or the cited source does not mention it
- Contradicted claims: fact directly contradicts a source
- Over-stated claims: fact goes beyond what the source actually says
Stations with major issues are automatically revised before delivery. The verification report is stored with the path and available on request.
8. Wikidata Cross-Check
After delivery, each station undergoes an independent fact audit against the Wikidata knowledge graph(CC0) — the same structured database that underpins Wikipedia's infoboxes. An AI model extracts up to five atomic, verifiable claims per station (specific dates, numerical values, named relationships, scientific constants) and queries Wikidata's SPARQL endpoint for each. Any discrepancy is logged, classified by severity, and stored in the path audit record. This is an additional layer of fact-checking beyond source verification — checking our sources against a second, independently maintained knowledge base.
9. Quality Gate
Every Premium path must pass a 100-point quality gate before delivery. Paths that fail are not released, and the Gold Ticket cost is fully refunded.
10. Verified Source Databases
The following databases are queried during Premium path generation. Every database listed is free, open-access, and operated by a government agency, academic institution, or established non-profit. No paywalled or commercially licensed content is ever used.
Science & Medicine
- PubMed Central (U.S. National Library of Medicine) — peer-reviewed biomedical and life sciences literature; open-access full text
- Europe PMC (European Bioinformatics Institute) — biomedical literature including preprints; CC-BY and CC0 content
- NIH Reporter (National Institutes of Health) — federally funded research project abstracts and plain-language summaries
- openFDA (U.S. Food & Drug Administration) — FDA-approved drug labels including clinical pharmacology and mechanism-of-action text
- PubChem (National Center for Biotechnology Information) — curated chemical compound database with biological activity data
- StatPearls (via NCBI Bookshelf) — 100,000+ peer-reviewed clinical handbook chapters covering diagnosis, treatment, and management; CC BY 4.0. Also includes NICE Guidelines (UK NHS), GeneReviews, and AHRQ evidence syntheses.
- ClinicalTrials.gov (U.S. National Library of Medicine) — comprehensive registry of clinical studies worldwide, with study design, interventions, eligibility, and primary outcomes; public domain
Physical Sciences, Engineering & Mathematics
- arXiv (Cornell University) — open-access preprints in physics, mathematics, computer science, and engineering
- NIST (National Institute of Standards and Technology) — government measurement science publications and the NIST Digital Library of Mathematical Functions
- IETF RFCs (Internet Engineering Task Force) — internet and networking standards documents
Space & Earth Sciences
- NASA Technical Reports (National Aeronautics and Space Administration) — mission publications, research reports, and technical documents
- USGS Publications Warehouse (U.S. Geological Survey) — earth science, geology, hydrology, and geography publications
Law, Policy & Economics
- CourtListener (Free Law Project) — U.S. federal and state court opinions, statutes, and legal documents; CC0
- Congressional Research Service Reports (U.S. Congress) — nonpartisan policy research and analysis on every area of public policy; government work
- World Bank Open Data (World Bank Group) — global economic, financial, and development data and research publications
History, Arts & Humanities
- Library of Congress (U.S. Congress) — digitised historical collections, manuscripts, maps, photographs, and published works; government work and CC0
- Europeana (European Union) — cultural heritage collections from 3,000+ European museums, libraries, and archives; CC0 and CC-BY
- Smithsonian Open Access (Smithsonian Institution) — digitised collections from 19 Smithsonian museums and research centres; CC0
Philosophy
- Stanford Encyclopedia of Philosophy (Stanford University) — peer-reviewed reference work with expert-authored entries on every area of philosophy; CC-BY
Free Open Textbooks
- OpenStax (Rice University) — 127 free, peer-reviewed college and high-school textbooks across biology, chemistry, physics, anatomy & physiology, microbiology, nursing, anthropology, psychology, sociology, history, economics, statistics, calculus, business, computer science, philosophy, writing, and more. Approximately 12,000 chapter-level sources indexed. CC BY 4.0.
Universal Scholarly Discovery
- OpenAlex (OurResearch, non-profit) — open index of 250 million+ scholarly works, authors, institutions, and citations across all disciplines; CC0
Public Domain Primary Texts
- Project Gutenberg — 70,000+ public domain books in full text, including primary philosophical works, literary classics, historical documents, and early scientific writing; all content is public domain (copyright: false)
Limitations
Open-access coverage varies by field. Some specialized topics may have fewer available sources, particularly in areas where most authoritative research is behind paywalls. When fewer than 2 verified sources are found for too many stations in a path, generation is halted before any content is written and your Gold Tickets are refunded in full. We will never generate content without sufficient source grounding.
Arts, humanities, and cultural heritage categories (Music, Architecture, Visual Arts, History, Literature) draw from Europeana, the Smithsonian, and the Library of Congress. These databases return collection metadata — catalogue records, accession notes, digitised manuscripts — rather than explanatory scholarly prose. This content functions as primary source context: it grounds claims in real artefacts and documented history, but it is not the same as a peer-reviewed analysis of technique or style. Paths in these categories are informed by authentic primary sources; they are not synthesised from academic literature in the same way as science or engineering paths.
Our verification system reduces but does not eliminate errors. If you find a factual inaccuracy, please report it via our errata page.