Latent Space: The AI Engineer Podcast podcast

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

8.1.2026

Latent Space: The AI Engineer Podcast

0:00

1:18:24

don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk —- From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really? We discuss: The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard) The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias) The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron) The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents) Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions) V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models) — Artificial Analysis Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\")) George Cameron on X: https://x.com/georgecameron (https://x.com/georgecameron (\"https://x.com/georgecameron\")) Micah-Hill Smith on X: https://x.com/micahhsmith (https://x.com/micahhsmith (\"https://x.com/micahhsmith\")) Chapters 00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins 00:01:19 Business Model: Independence and Revenue Streams 00:04:33 Origin Story: From Legal AI to Benchmarking Need 00:16:22 AI Grant and Moving to San Francisco 00:19:21 Intelligence Index Evolution: From V1 to V3 00:11:47 Benchmarking Challenges: Variance, Contamination, and Methodology 00:13:52 Mystery Shopper Policy and Maintaining Independence 00:28:01 New Benchmarks: Omissions Index for Hallucination Detection 00:33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning 00:23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks 00:50:19 Stirrup Agent Harness: Open Source Agentic Framework 00:52:43 Openness Index: Measuring Model Transparency Beyond Licenses 00:58:25 The Smiling Curve: Cost Falling While Spend Rising 01:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits 01:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges 01:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas 01:15:05 Looking Ahead: Intelligence Index V4 and Future Directions 01:16:50 Closing: The Insatiable Demand for Intelligence

Weitere Episoden von „Latent Space: The AI Engineer Podcast“

Weitere Episoden

Hol dir die ganze Welt der Podcasts mit der kostenlosen GetPodcast App.

Abonniere alle deine Lieblingspodcasts, höre Episoden auch offline und erhalte passende Empfehlungen für Podcasts, die dich wirklich interessieren.

Ein Unternehmen von

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast

Weitere Episoden von „Latent Space: The AI Engineer Podcast“

Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun

Mistral: Voxtral TTS, Forge, Leanstral, & what's next for Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

🔬Why There Is No "AlphaFold for Materials" — AI for Materials Discovery with Heather Kulik

Dreamer: the Personal Agent OS — David Singleton

Why Anthropic Thinks AI Should Have Its Own Computer — Felix Rieseberg of Claude Cowork & Claude Code Desktop

Retrieval After RAG: Hybrid Search, Agents, and Database Design — Simon Hørup Eskildsen of Turbopuffer

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

Cursor's Third Era: Cloud Agents

Every Agent Needs a Box — Aaron Levie, Box

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity