
đ ThursdAI - Dec 11 - GPT 5.2 is HERE! Plus, LLMs in Space, MCP donated, Devstral surprises and more AI news!
Hey everyone,
December started strong and does NOT want to slow down!? OpenAI showed us their response to the Code Red and itâs GPT 5.2, which doesnât feel like a .1 upgrade! We got it literally as breaking news at the end of the show, and oh boy! The new kind of LLMs is here.
GPT, then Gemini, then Opus and now GPT again... Who else feels like weâre on a trippy AI rolercoaster? Just me? đ«š
Iâm writing this newsletter from a fresh âtraveling podcasterâ setup in SF (huge shoutout to the Chroma team for the studio hospitality).
P.S - Next week weâre doing a year recap episode (52st episode of the year, what is my life), but today is about the highest-signal stuff that happened this week.
Alright. No more foreplay. Letâs dive in. Please subscribe.
đ„ The main event: OpenAI launches GPTâ5.2 (and itâs⊠a lot)
We started the episode with âgarlic in the airâ rumors (OpenAI holiday launches always have that Christmas panic energy), and then⊠boom: GPTâ5.2 actually drops while weâre live.
What makes this release feel significant isnât âone benchmark went up.â Itâs that OpenAI is clearly optimizing for the things that have become the frontier in 2025: long-horizon reasoning, agentic coding loops, long context reliability, and lower hallucination rates when browsing/tooling is involved.
5.2 Instant, Thinking and Pro in ChatGPT and in the API
OpenAI shipped multiple variants, and even within those there are âlevelsâ (medium/high/extra-high) that effectively change how much compute the model is allowed to burn. At the extreme end, youâre basically running parallel thoughts and selecting winners. Thatâs powerful, but also⊠very expensive.
Itâs very clearly aimed at the agentic world: coding agents that run in loops, tool-using research agents, and âdo the whole task end-to-endâ workflows where spending extra tokens is still cheaper than spending an engineer day.
Benchmarks
Iâm not going to pretend benchmarks tell the full story (they never do), but the shape of improvements matters. GPTâ5.2 shows huge strength on reasoning + structured work.
It hits 90.5% on ARCâAGIâ1 in the Pro XâHigh configuration, and 54%+ on ARCâAGIâ2 depending on the setting. For context, ARCâAGIâ2 is the one where everyone learns humility again.
On math/science, this thing is flexing. We saw 100% on AIME 2025, and strong performance on FrontierMath tiers (with the usual âTier 4 is where dreams go to dieâ vibe still intact). GPQA Diamond is up in the 90s too, which is basically âPhD trivia mode.â
But honestly the most practically interesting one for me is GDPval (knowledge-work tasks: slides, spreadsheets, planning, analysis). GPTâ5.2 lands around 70%, which is a massive jump vs earlier generations. This is the category that translates directly into âis this model useful at my job.â - This is a bench that OpenAI launched only in September and back then, Opus 4.1 was a âmeaslyâ 47%! Talk about acceleration!
Long context: MRCR is the sleeper highlight
On MRCR (multi-needle long-context retrieval), GPTâ5.2 holds up absurdly well even into 128k and beyond. The graph OpenAI shared shows GPTâ5.1 falling off a cliff as context grows, while GPTâ5.2 stays high much deeper into long contexts.
If youâve ever built a real system (RAG, agent memory, doc analysis) you know this pain: long context is easy to offer, hard to use well. If GPTâ5.2 actually delivers this in production, itâs a meaningful shift.
Hallucinations: down (especially with browsing)
One thing we called out on the show is that a bunch of user complaints in 2025 have basically collapsed into one phrase: âit hallucinates.â Even people who donât know what a benchmark is can feel when a model confidently lies.
OpenAIâs system card shows lower rates of major incorrect claims compared to GPTâ5.1, and lower âincorrect claimsâ overall when browsing is enabled. Thatâs exactly the direction they needed.
Real-world vibes:
We did the traditional âvibe testsâ mid-show: generate a flashy landing page, do a weird engineering prompt, try some coding inside Cursor/Codex.
Early testers broadly agree on the shape of the improvement. GPTâ5.2 is much stronger in reasoning, math, longâcontext tasks, visual understanding, and multimodal workflows, with multiple reports of it successfully thinking for one to three hours on hard problems. Enterprise users like Box report faster execution and higher accuracy on real knowledgeâworker tasks, while researchers note that GPTâ5.2 Pro consistently outperforms the standard âThinkingâ variant. The tradeoffs are also clear: creative writing still slightly favors Claude Opus, and the highest reasoning tiers can be slow and expensive. But as a generalâpurpose reasoning model, GPTâ5.2 is now the strongest publicly available option.
AI in space: Starcloud trains an LLM on an H100 in orbit
This story is peak 2025.
Starcloud put an NVIDIA H100 on a satellite, trained Andrej Karpathyâs nanoGPT on Shakespeare, and ran inference on Gemma. Thereâs a viral screenshot vibe here thatâs impossible to ignore: SSH into an H100⊠in space⊠with a US flag in the corner. Itâs engineered excitement, and Iâm absolutely here for it.
But we actually had a real debate on the show: is âGPUs in spaceâ just sciâfi marketing, or does it make economic sense?
Nisten made a compelling argument that power is the real bottleneck, not compute, and that big satellites already operate in the ~20kW range. If you can generate that power reliably with solar in orbit, the economics start looking less insane than youâd think. LDJ added the long-term land/power convergence argument: Earth land and grid power get scarcer/more regulated, while launch costs trend downâeventually the curves may cross.
I played âvoice of realismâ for a minute: what happens when GPUs fail? Itâs hard enough to swap a GPU in a datacenter, now imagine doing it in orbit. Cooling and heat dissipation become a different engineering problem too (radiators instead of fans). Networking is nontrivial. But also: we are clearly entering the era where people will try weird infra ideas because AI demand is pulling the whole economy.
Big Company: MCP gets donated, OpenRouter drops a report on AI
Agentic AI Foundation Lands at the Linux Foundation
This one made me genuinely happy.
Block, Anthropic, and OpenAI came together to launch the Agentic AI Foundation under the Linux Foundation, donating key projects like MCP, AGENTS.md, and goose. This is exactly how standards should happen: vendorâneutral, boring governance, lots of stakeholders.
Itâs not flashy work, but itâs the kind of thing that actually lets ecosystems grow without fragmenting.
BTW, I was recording my podcast while Latent.Space were recording theirs in the same office, and they have a banger episode upcoming about this very topic! All Iâll say is Alessio Fanelli introduced me to David Soria Parra from MCP đ Watch out for that episode on Latent space dropping soon!
OpenRouterâs âState of AIâ: 100 Trillion Tokens of Reality
OpenRouter and a16z dropped a massive report analyzing over 100 trillion tokens of realâworld usage. A few things stood out:
Reasoning tokens now dominate. Above 50%, around 60% of all tokens since early 2025 are reasoning tokens. Remember when we went from âLLMs canât do mathâ to reasoning models? That happened in about a year.
Programming exploded. From 11% of usage early 2025 to over 50% recently. Claude holds 60% of the coding market. (at least.. on Open Router)
Open source hit 30% market share, led by Chinese labs: DeepSeek (14T tokens), Qwen (5.59T), Meta LLaMA (3.96T).
Context lengths grew massively. Average prompt length went from 1.5k to 6k+ tokens (4x growth), completions from 133 to 400 tokens (3x).
The âGlass Slipperâ effect. When users find a model that fits their use case, they stay loyal. Foundational early-user cohorts retain around 40% at month 5. Claude 4 Sonnet still had 50% retention after three months.
Geography shift. Asia doubled to 31% of usage (China key), while North America is at 47%.
Yam made a good point that we should be careful interpreting these graphsâtheyâre biased toward people trying new models, not necessarily steady usage. But the trends are clear: agentic, reasoning, and coding are the dominant use cases.
Open Source Is Not Slowing Down (If Anything, Itâs Accelerating)
One of the strongest themes this week was just how fast open source is closing the gap â and in some areas, outright leading. Weâre not talking about toy demos anymore. Weâre talking about serious models, trained from scratch, hitting benchmarks that were frontierâonly not that long ago.
Essential AIâs Rnjâ1: A Real Frontier 8B Model
This one deserves real attention. Essential AI â led by Ashish Vaswani, yes Ashish from the original Transformers paper â released Rnjâ1, a pair of 8B openâweight models trained fully from scratch. No distillation. No âjust a fineâtune.â This is a proper pretrain.
What stood out to me isnât just the benchmarks (though those are wild), but the philosophy. Rnjâ1 is intentionally focused on pretraining quality: data curation, code execution simulation, STEM reasoning, and agentic behaviors emerging during pretraining instead of being bolted on later with massive RL pipelines.
In practice, that shows up in places like SWEâbench Verified, where Rnjâ1 lands in the same ballpark as much larger closed models, and in math and STEM tasks where it punches way above its size. And remember: this is an 8B model you can actually run locally, quantize aggressively, and deploy without legal gymnastics thanks to its Apache 2.0 license.
Mistral Devstral 2 + Vibe: Open Coding Goes Hard
Mistral followed up last weekâs momentum with Devstral 2, and Mistral Vibe!
The headline numbers are: the 123B Devstral 2 model lands right at the top of openâweight coding benchmarks, nearly matching Claude 3.5 Sonnet on SWEâbench Verified. But what really excited the panel was the 24B Devstral Small 2, which hits highâ60s SWEâbench scores while being runnable on consumer hardware.
This is the kind of model you can realistically run locally as a coding agent, without shipping your entire codebase off to someone elseâs servers. Pair that with Mistral Vibe, their openâsource CLI agent, and you suddenly have a credible, fully open alternative to things like Claude Code, Codex, or Gemini CLI.
We talked a lot about why this matters. Some teams canât send code to closed APIs. Others just donât want to pay perâtoken forever. And some folks â myself included â just like knowing whatâs actually running under the hood. Devstral 2 checks all those boxes.
đ This weekâs Buzz (W&B): Trace OpenRouter traffic into Weave with zero code
We did a quick âBuzzâ segment on a feature that I think a lot of builders will love:
OpenRouter launched Broadcast, which can stream traces to observability vendors. One of those destinations is W&B Weave.
The magic here is: if youâre using a tool that already talks to OpenRouter, you can get tracing into Weave without instrumenting your code. Thatâs especially useful when instrumentation is hard (certain agent frameworks, black-box tooling, restricted environments, etc.).
If you want to set it up: OpenRouter Broadcast settings.
Vision Models Are Getting Practical (and Weirdly Competitive)
Visionâlanguage models quietly had a massive week.
JinaâVLM: Small, Multilingual, and Very Good at Docs
Jina released a 2.4B VLM thatâs absolutely dialed in on document understanding, multilingual VQA, and OCRâheavy tasks. This is exactly the kind of model youâd want for PDFs, charts, scans, and messy realâworld docs â and itâs small enough to deploy without sweating too much.
Z.ai GLMâ4.6V: Long Context, Tool Calling, Serious Agent Potential
Z.aiâs GLMâ4.6V impressed us with its 128K context, native tool calling from vision inputs, and strong performance on benchmarks like MathVista and WebVoyager. Itâs one of the clearest examples yet of a VLM thatâs actually built for agentic workflows, not just answering questions about images.
That said, I did run my unofficial âbee counting testâ on it⊠and yeah, Gemini still wins there đ
Perceptron Isaac 0.2: Tiny Models, Serious Perception
Perceptronâs Isaac 0.2 (1B and 2B variants) showed something I really like seeing: structured outputs, focus tools, and reliability in very small models. Watching a 2B model correctly identify, count, and point to objects in an image is still wild to me.
These are the kinds of models that make physical AI, robotics, and edge deployments actually feasible.
đ§° Tools: Cursor goes visual, and Google Stitch keeps getting scarier (in a good way)
Cursor: direct visual editing inside the codebase
Cursor shipped a new feature that lets you visually manipulate UI elementsâclick/drag/resizeâdirectly in the editor. We lumped this under âtoolsâ because itâs not just a nicety; itâs the next step in âIDE as design surface.â
Cursor is also iterating fast on debugging workflows. The meta trend: IDEs are turning into agent platforms, not text editors.
Stitch by Google: Gemini 3 Pro as default, plus clickable prototypes
I showed Stitch on the show because itâs one of the clearest examples of âdistribution beats raw capability.â
Stitch (Googleâs product born from the Galileo AI acquisition) is doing Shipmas updates and now defaults to âThinking with Gemini 3 Pro.â It can generate complex UIs, export them, and even stitch multiple screens into prototypes. The killer workflow is exporting directly into AI Studio / agent tooling so you can go from UI idea â code â repo without playing copy-paste Olympics.
Site:
https://stitch.withgoogle.com
đŹ Disney invests $1B into OpenAI (and Sora gets Disney characters)
This is the corporate story that made me do a double take.
Disneyâarguably the most IP-protective company on Earthâis investing $1B into OpenAI and enabling use of Disney characters in Sora. Thatâs huge. It signals the beginning of a more explicit âlicensed synthetic mediaâ era, where major IP holders decide which model vendors get official access.
It also raises the obvious question: does Disney now go harder against other model providers that generate Disney-like content without permission?
We talked about how weird the timing is too, given Disney has also been sending legal pressure in the broader space. The next year of AI video is going to be shaped as much by licensing and distribution as by model quality.
Closing thoughts: the intelligence explosion is loud, messy, and accelerating
This episode had everything: open-source models catching up fast, foundation-level standardization around agents, a usage report that shows what developers actually do with LLMs, voice models getting dramatically better, and OpenAI shipping what looks like a serious âweâre not losingâ answer to Gemini 3.
And yes: weâre also apparently putting GPUs in space.
Next weekâs episode is our year recap, andâof courseâwe now have to update it because GPTâ5.2 decided to show up like the final boss.
If you missed any part of the show, check out the chapters in the podcast feed and jump around. See you next week.
TL;DR + Show Notes (links for everything)
Hosts
* Alex Volkov â AI Evangelist @ Weights & Biases: @altryne. I host ThursdAI and spend an unhealthy amount of time trying to keep up with this firehose of releases.
* Co-hosts â @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed. Each of them brings a different âlensâ (agents, infra, evaluation, open source, tooling), and itâs why the show works.
Open Source LLMs
* Essential AI â RNJâ1 (8B base + instruct): tweet, blog, HF instruct, HF base. This is a from-scratch open pretrain led by Ashish Vaswani, and itâs one of the most important âWestern open modelâ signals weâve seen in a while.
* Mistral â Devstral 2 + Devstral Small 2 + Mistral Vibe: tweet, Devstral Small 2 HF, Devstral 2 HF, news, mistral-vibe GitHub. Devstral is open coding SOTA territory, and Vibe is Mistralâs swing at the CLI agent layer.
AI in Space
* Starcloud trains and runs an LLM in orbit on an H100: Philip Johnston, Adi Oltean, CNBC, Karpathy reaction. A satellite H100 trained nanoGPT on Shakespeare and ran Gemma inference, igniting a real debate about power, cooling, repairability, and future orbital compute economics.
Putnam Math Competition
* Nous Research â Nomos 1 (Putnam scoring run): tweet, HF, GitHub harness, Hillclimb. This is a strong open-weight math reasoning model plus an open harness, and it shows how orchestration matters as much as raw weights.
* Axiom â AxiomProver formal Lean proofs on Putnam: tweet, repo. Formal proofs are the âno excusesâ version of math reasoning, and this is a serious milestone even if you argue about exact framing.
Big Company LLMs + APIs
* OpenAI â GPTâ5.2 release: Alex tweet, OpenAI announcement, ARC Prize verification, Sam Altman tweet. GPTâ5.2 brings major jumps in reasoning, long context, and agentic workflows, and itâs clearly positioned as an answer to the Gemini 3 era.
* OpenRouter x a16z â State of AI report (100T+ tokens): tweet, landing page, PDF. The report highlights the dominance of programming/agents, the rise of reasoning tokens, and real-world usage patterns that explain why everyone is shipping agent harnesses.
* Agentic AI Foundation under Linux Foundation (AAIF): Goose tweet, Block blog, aaif.io, Linux Foundation tweet. MCP + AGENTS.md + Goose moving into vendor-neutral governance is huge for interoperability and long-term ecosystem stability.
* Disney invests $1B into OpenAI / Sora characters: (covered on the show as a major IP + distribution moment). This is an early signal of licensed synthetic media becoming a first-class business line rather than a legal gray zone.
This weekâs Buzz (W&B)
* OpenRouter Broadcast â W&B Weave tracing: Broadcast settings. You can trace OpenRouter-based traffic into Weave with minimal setup, which is especially useful when you canât (or donât want to) instrument code directly.
Vision & Video
* Jina â jinaâVLM (2.4B): tweet, arXiv, HF, blog. A compact multilingual VLM optimized for doc understanding and VQA.
* Z.ai â GLMâ4.6V + Flash: tweet, HF collection, GLMâ4.6V, Flash, blog. Strong open VLMs with tool calling and long context, even if my bee counting test still humbled it.
* Perceptron â Isaac 0.2 (1B/2B): tweet, HF 2B, HF 1B, blog, demo. The Focus/zoom tooling and structured outputs point toward âVLMs as reliable perception modules,â not just chatty describers.
Voice & Audio
* Google DeepMind â Gemini 2.5 TTS (Flash + Pro): AI Studio tweet, GoogleAI devs tweet, blog, AI Studio speech playground. The key upgrades are control and consistency (emotion, pacing, multi-speaker) across many languages.
* OpenBMB â VoxCPM 1.5: tweet, HF, GitHub. Open TTS keeps getting better, and this release is especially interesting for fine-tuning and voice cloning workflows.
Tools
* Cursor â direct visual editing (new UI workflow): (covered on the show as a major step toward âIDE as design surfaceâ). Cursor continues to push the agentic IDE category into new territory.
* Stitch by Google â Shipmas updates + Gemini 3 Pro âThinkingâ + Prototypes: tweet 1, tweet 2, site, plus background articles: TechCrunch launch, acquisition detail. Stitch is turning prompt-to-UI into a full prototype-to-code pipeline with real export paths.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Fler avsnitt frÄn "ThursdAI - The top AI news from the past week"



Missa inte ett avsnitt av âThursdAI - The top AI news from the past weekâ och prenumerera pĂ„ det i GetPodcast-appen.







