AGI — the goal
Artificial General Intelligence is the stated target of every serious frontier AI lab, the premise of hundreds of billions in capex, and a moving goalpost that gets redrawn every time a model clears the previous line. This page tries to be useful about it: what the word means (and doesn't), who's racing, what the benchmarks show, and where the doubts are.
1. What "AGI" actually means
"Artificial General Intelligence" was coined in 2002 by Shane Legg and Ben Goertzel as a way to distinguish the original dream of AI — a single system that can do any intellectual work a human can — from the narrow task-specific AI that had come to dominate the field. For two decades the phrase lived in fringe conferences. In 2023 it started showing up in 10-K filings.
The problem is that "general intelligence" is a contested concept even among humans. Psychologists debate whether there's a single $g$ factor; evolutionary biologists point to enormous cross-species variation in kinds of cognition; anthropologists note that most human intelligence is socially distributed, not individual. Importing all this into ML was always going to produce disagreement.
What you can safely say: AGI is not a binary. The question isn't "did we reach AGI?" but "on which axes of general-purpose cognitive work does the system match or exceed humans, and at what cost?" Every useful conversation about AGI starts by being specific about the operationalization.
1. Cognitive AGI — matches a generalist human across a wide bundle of cognitive tasks.
2. Economic AGI — can substitute for most of what humans get paid to do at the keyboard.
3. Transformative AI — causes a phase-change in the economy and society comparable to the industrial revolution.
These are not the same thing. A system can be economic-AGI without being cognitive-AGI (narrow tools with enough coverage) or cognitive-AGI without being transformative (expensive, slow, capped by compute).
2. The competing definitions
A short tour of the influential definitions you'll see cited:
- OpenAI (2018 charter). "Highly autonomous systems that outperform humans at most economically valuable work." Explicitly economic; vague on which humans and which work. Operationalized by Microsoft's contract as a revenue milestone: $100B in cumulative profit (per reporting in 2023).
- DeepMind (Morris et al. 2024). "An AI system that is at least as capable as a skilled adult human at most tasks." They propose a matrix of performance (5 levels) × generality (narrow vs general). By their own scoring, frontier LLMs are "Level 1 General — Emerging AGI." arXiv:2311.02462.
- Nilsson's employment test (2005). An AI is "general enough" if it could do the jobs that humans actually get hired for — including new ones. A pragmatic economic bar.
- Turing (1950). Pass an extended conversation with a human judge. Already passed in short-form controlled studies (Jones & Bergen 2024). Obsolete as a finish line; useful as a historical milestone.
- Coffee test (Wozniak). A robot that can enter an unfamiliar house and make a cup of coffee. A physical-embodiment benchmark that is still far from solved — which is why "AGI" discussions are mostly about disembodied cognition in 2026.
- Karnofsky's "PASTA" — Process for Automating Scientific and Technological Advancement. An AI capable enough to automate the hard parts of AI research. The self-improvement entry point; explicitly the thing the AI 2027 scenario organizes around.
3. DeepMind's 6 levels — a useful ladder
Morris et al. propose a two-axis framework. Don't treat it as a law; treat it as a shared vocabulary.
| Level | Narrow AI | General AI |
|---|---|---|
| 0 · No AI | calculator, software | Amazon Mechanical Turk, human-in-the-loop software |
| 1 · Emerging | simple rule-based systems | ChatGPT, Gemini, Claude — the current frontier LLMs |
| 2 · Competent (50th-percentile human) | Siri, Alexa, spam filters | not yet achieved |
| 3 · Expert (90th-percentile) | Grammarly, Imagen | not yet achieved |
| 4 · Virtuoso (99th-percentile) | Stockfish, AlphaFold | not yet achieved |
| 5 · Superhuman | AlphaZero, AlphaProof | "ASI" — not yet achieved |
Two observations on this table:
- Narrow superhuman systems exist today in limited domains (chess, Go, protein folding, math-olympiad geometry). The gap is generality, not peak performance in any one area.
- Putting frontier LLMs at "Emerging General" is charitable or deflationary depending on your prior. If you care most about general-knowledge fluency, they're past Level 2. If you care most about reliability over long horizons, they're barely Level 1.
4. Who's building it — 2026 field guide
The "serious AGI lab" list is shorter than the "AI company" list. Entry requirements: (1) frontier-scale pretraining compute, (2) a public commitment to building AGI, (3) an in-house research agenda, not just a wrapper.
OpenAI
Charter: AGI that benefits all of humanity. Stack: GPT series, DALL·E, Sora, o-series reasoning, Realtime API. 2025 Stargate announcement: $500B compute buildout (Oracle/SoftBank/MGX/OpenAI). Microsoft is primary compute partner; Azure deal contains a clause that ends Microsoft's privileged access the moment OpenAI's board "determines we have achieved AGI."
frontierAnthropic
Charter: safe, powerful AI. Stack: Claude family (Opus/Sonnet/Haiku), extended-thinking reasoning, interpretability research (SAEs), responsible scaling policy. Dario's Machines of Loving Grace (Oct 2024) is the clearest public statement of how a frontier lab imagines the good case.
frontierGoogle DeepMind
Full stack: TPUs, Gemini 2.x, Nano, AlphaFold 3, AlphaProof, AlphaGeometry 2, GraphCast, Genie, Project Astra. The only lab that ships both foundation models and specialized scientific AI at frontier scale. Hassabis won the 2024 Nobel in Chemistry for AlphaFold.
frontierxAI
Grok series. Built the "Colossus" cluster in Memphis from zero to 100k H100s in 122 days; expanding to 1M GPUs. Openly framed as an AGI race. Fused into X (the social network) for distribution and training data.
frontierMeta Superintelligence Labs
After Meta's $14B investment in Scale AI (June 2025), Alexandr Wang joined to run a new Superintelligence Labs unit alongside FAIR. Llama 4 is open-weight; MSL is explicitly going after frontier. Internal tension: LeCun (FAIR) is skeptical LLMs get there without world models.
openSafe Superintelligence Inc.
Ilya Sutskever's post-OpenAI lab. Single-product company: one goal, one team, one product — safe superintelligence. No APIs, no products shipped yet. Reportedly raising at a ~$30B valuation in 2025 on strength of the founders alone.
stealthDeepSeek
Spun out of quant fund High-Flyer. DeepSeek-V3 (Dec 2024) and R1 (Jan 2025) shipped as open weights at fractions of US training cost. The January 2025 "DeepSeek moment" wiped ~$1T off US AI-infra market cap in a single day and reset assumptions about how much compute was really needed.
openAlibaba Qwen / Zhipu / Moonshot
The Chinese tigers. Qwen 3 is competitive with GPT-4o on many benchmarks and ships open-weight; Zhipu's GLM-4.5 and Moonshot's Kimi are close behind. US chip-export controls forced efficiency-first research cultures — much of the Chinese work is downstream of that constraint.
ChinaTwo honorable mentions that don't fit the "frontier lab" bucket but matter:
- Mistral AI (Paris) — Europe's sovereign-AI play. Ships both open and commercial models. Strategic, not sheer frontier.
- Tesla FSD / Optimus team — the only serious attempt at embodied, real-world general intelligence at scale. If "AGI must be physical" matters to you, Tesla is the main shop.
5. Benchmark trajectories — the receipts
A few numbers worth memorizing because they show how fast things are moving:
| Benchmark | 2022 | 2024 Q4 | 2025 frontier | Human baseline |
|---|---|---|---|---|
| MMLU (broad knowledge) | 60% (GPT-3) | 88% (GPT-4) | 92%+ (saturated) | ~89% |
| GPQA Diamond (grad-level science) | ~30% | 58% (o1) | 85%+ (o3) | ~65% (domain experts) |
| SWE-bench Verified (real GitHub bugs) | ~2% | 49% (Claude 3.5) | 70%+ (Claude 4) | ~85% |
| FrontierMath (research-level math) | — | 2% (GPT-4o) | 25%+ (o3) | ~30% (Fields-medalist estimate) |
| ARC-AGI-1 (abstraction puzzles) | ~5% | ~30% (o1) | 76% (o3 tuned) | ~85% |
| Humanity's Last Exam | — | — | ~14% (o3) | varies (crowdsourced experts) |
What to take from this table: the benchmarks that used to be "hard" get saturated in 12–18 months. Each time a new one is published, people expect it to survive 5 years and it survives 1. That's the empirical fact anyone who argues for long AGI timelines has to contend with; that's the empirical fact anyone who argues for imminent AGI can't stop pointing at.
But there are also honest caveats:
- Benchmark gaming is real. Anything Internet-scrapeable ends up in pretraining data eventually. "Held-out" is harder than it sounds.
- Exam-style tasks reward a specific cognitive shape. They reward pattern completion on well-posed questions. Long-horizon, ambiguous, multi-tool work is not what MMLU measures.
- Test-time compute is a confound. o3's ARC-AGI performance used ~$2k of compute per task. That's impressive but also clarifies what "reasoning model" really means: throwing more serial compute at one problem.
6. Interactive: METR's time-horizon curve
Arguably the most interesting recent metric is METR's time-horizon measurement — the length of a software task (measured in how long it takes a competent human) that a frontier model can complete with 50% reliability. Between 2019 and 2025, METR reports this number has roughly doubled every 7 months. Drag the slider to see what that trend implies.
Horizontal: calendar year. Vertical (log scale): task length, measured in human-minutes. Extrapolating a 7-month doubling puts "a full workday" tasks around 2028 and "a full work-week" tasks around 2029–30. These are extrapolations, not predictions.
If a model can reliably complete a week-long task, an AGI labs' pitch deck writes itself: point it at "improve our model" and let it run. This is the recursive self-improvement mechanism that turns timelines sharp. The counter-argument: the doubling trend is on software tasks, which are the friendliest possible domain (clear specs, cheap verification). Extrapolating to biology, law, diplomacy, or physical work is optimism, not measurement.
7. What might stop it
A sober engineer's list of things that might derail the "AGI by 2030" trajectory. None are fatal on their own; several together would be.
Data
High-quality text tokens on the open internet are finite. Epoch AI's 2024 projection: we exhaust the stock of human-written text at frontier-scale training sometime between 2026 and 2032. Synthetic data and multimodal data are the usual answers; both have open questions.
Compute & power
Training runs are moving past 10²⁶ FLOPs. That's roughly 100k+ GPUs running for months. The immediate bottlenecks are power (a 1 GW data center is a regional infrastructure project), chip supply, and cooling. Bookmark SemiAnalysis if you want to track this.
Reliability & evaluation
Current models are great at 10-minute tasks and terrible at 10-hour tasks — the reliability curve falls off a cliff. We don't have well-understood scaling laws for reliability the way we do for loss. This is the current research frontier inside every lab.
Architecture lock-in
LeCun's loudest point, and it has some teeth: the transformer plus next-token loss is extremely powerful but may be missing components (world models, explicit memory, planning over long horizons) that cognitive science suggests matter. If true, scaling the current recipe converges to a sub-AGI plateau.
Regulation / export controls
US export controls on advanced GPUs to China, EU AI Act obligations, possible US federal regulation. None of these stop AGI; all of them change the cost structure and who wins.
Alignment verification
Even labs that believe AGI is close say that deploying it requires solving evaluations we don't yet have. If interpretability and evals lag capabilities far enough, there's an implicit brake — either voluntary or imposed.
8. Further reading
Primary sources and deep treatments
- Morris, Sohl-Dickstein, Fiedel et al. (2024) — Levels of AGI for Operationalizing Progress on the Path to AGI. The DeepMind framework.
- Legg & Hutter (2007) — A Collection of Definitions of Intelligence. Seventy definitions of intelligence, collected. The paper that shaped modern AGI vocabulary.
- Eloundou, Manning, Mishkin, Rock (2023) — GPTs are GPTs. Task-level labor-exposure analysis.
- Amodei, Dario (2024) — Machines of Loving Grace. Anthropic CEO's long-form case for the good outcome.
- Kokotajlo, Alexander et al. (2025) — AI 2027. A month-by-month scenario narrative, widely read inside labs.
- METR (2025) — Measuring AI Ability to Complete Long Tasks. The time-horizon data.
- Epoch AI — ongoing forecasting of compute, data, and scaling trends.