AGI Canary

AGI Canary

Epistemic instrumentation for AGI progress

Methodology

How the AGI Canary Index is built—frameworks, axes, and pipeline

Frameworks
The index consolidates established academic and policy frameworks into a single auditable system.
  • CHC cognitive domains (Hendrycks et al.) — "A Definition of AGI" proposes operationalizing AGI as matching cognitive versatility across Cattell–Horn–Carroll domains, producing an interpretable jagged profile rather than one score.
  • Levels of AGI (Morris et al., DeepMind) — Frames progress using performance, generality, and autonomy. Explains progress without claiming "AGI achieved" prematurely.
  • ARC-AGI — Measures abstraction and reasoning beyond narrow training distribution. The canary for generalization vs memorization.
  • METR autonomy evaluations — Tooling and protocols for agentic autonomy, long-horizon task success, and dangerous-capability detection. The backbone for our autonomy & risk track.
  • OECD AI Capability Indicators — Policy-grade, human-skill-referenced capability levels across domains. Cross-walks with CHC for a public-friendly taxonomy.
9 Cognitive Axes
Capability signals are mapped to these axes. Scores use -1..1 scale with uncertainty bounds.
  • Reasoningreasoning
  • Learning efficiencylearning_efficiency
  • Long-term memorylong_term_memory
  • Planningplanning
  • Tool usetool_use
  • Social cognitionsocial_cognition
  • Multimodal perceptionmultimodal_perception
  • Robustnessrobustness
  • Alignment & safetyalignment_safety
Pipeline
Content flows from discovery through extraction to daily snapshots.
  1. Discovery — RSS feeds, curated sources, search APIs, and X. URLs are deduplicated and stored as items.
  2. Acquisition — Firecrawl scrapes full-text, content is validated (length, paywall), stored in R2, and linked to documents.
  3. AI extraction — Vercel AI SDK + OpenRouter. LLM extracts claims, axis impacts, benchmarks, confidence, citations. Confidence is adjusted by source trust weight; signals below threshold are filtered.
  4. Daily snapshots — Signals for a date are aggregated into axis scores (confidence-weighted average of direction × magnitude). Delta is day-over-day change.
Audit trail
Every displayed score links to citations and provenance. The app surfaces source attribution, confidence, and uncertainty.