
đ ThursdAI - the week that changed the AI landscape forever - Gemini 3, GPT codex max, Grok 4.1 & fast, SAM3 and Nano Banana Pro
Hey everyone, Alex here đ
Iâm writing this one from a noisy hallway at the AI Engineer conference in New York, still riding the high (and the sleep deprivation) from what might be the craziest week weâve ever had in AI.
In the span of a few days:
Google dropped Gemini 3 Pro, a new Deep Think mode, generative UIs, and a free agent-first IDE called Antigravity.xAI shipped Grok 4.1, then followed it up with Grok 4.1 Fast plus an Agent Tools API.OpenAI answered with GPTâ5.1âCodexâMax, a longâhorizon coding monster that can work for more than a day, and quietly upgraded ChatGPT Pro to GPTâ5.1 Pro.Meta looked at all of that and said âcool, weâll just segment literally everything and turn photos into 3D objectsâ with SAM 3 and SAM 3D.Robotics folks dropped a home robot trained with almost no robot data.And Google, just to flex, capped Thursday with Nano Banana Pro, a 4K image model and a provenance system while we were already live!
For the first time in a while it doesnât just feel like ânew models came out.â It feels like the future actually clicked forward a notch.
This is why ThursdAI exists. Weeks like this are basically impossible to follow if you have a day job, so my coâhosts and I do the noâsleep version so you donât have to. Plus, being at AI Engineer makes it easy to get super high quality guests so this week we had 3 folks join us, Swyx from Cognition/Latent Space, Thor from DeepMind (on his 3rd day) and Dominik from OpenAI! Alright, deep breath. Letâs untangle the week.
TL;DR
If you only skim one section, make it this one (links in the end):
* Gemini 3 Pro: 1Mâtoken multimodal model, huge reasoning gains - new LLM king
* ARCâAGIâ2: 31.11% (Pro), 45.14% (Deep Think) â enormous jumps
* Antigravity IDE: free, Geminiâpowered VS Code fork with agents, plans, walkthroughs, and browser control
* Nano Banana Pro: 4K image generation with perfect text + SynthID provenance; dynamic âgenerative UIsâ in Gemini
* xAI
* Grok 4.1: big postâtraining upgrade â #1 on humanâpreference leaderboards, much better EQ & creative writing, fewer hallucinations
* Grok 4.1 Fast + Agent Tools API: 2M context, SOTA toolâcalling & agent benchmarks (Berkeley FC, T²âBench, research evals), aggressive pricing and tight X + web integration
* OpenAI
* GPTâ5.1âCodexâMax: âfrontier agentic codingâ model built for 24h+ software tasks with native compaction for millionâtoken sessions; big gains on SWEâBench, SWEâLancer, TerminalBench 2
* GPTâ5.1 Pro: new âresearchâgradeâ ChatGPT mode that will happily think for minutes on a single query
* Meta
* SAM 3: openâvocabulary segmentation + tracking across images and video (with text & exemplar prompts)
* SAM 3D: singleâimage â 3D objects & human bodies; surprisingly highâquality 3D from one photo
* Robotics
* Sunday Robotics â ACTâ1 & Memo: home robot foundation model trained from a $200 skill glove instead of $20K teleop rigs; longâhorizon household tasks with solid zeroâshot generalization
* Developer Tools
* Antigravity and Marimoâs VS Code / Cursor extension both push toward agentic, reactive dev workflows
Live from AI Engineer New York: Coding Agents Take Center Stage
We recorded this weekâs show on location at the AI Engineer Summit in New York, inside a beautiful podcast studio the team set up right on the expo floor. Huge shout out to Swyx, Ben, and the whole AI Engineer crew for that â last time I was balancing a mic on a hotel nightstand, this time I had broadcastâgrade audio while a robot dog tried to steal the show behind us.
This yearâs summit theme is very onâtheânose for this week: coding agents.
Everywhere you look, thereâs a company building an âagent labâ on top of foundation models. Amp, Cognition, Cursor, CodeRabbit, Jules, Google Labs, all the openâsource folks, and even the enterprise players like Capital One and Bloomberg are here, trying to figure out what it means to have real software engineers that are partly human and partly model.
Swyx framed it nicely when he said that if you take âvertical AIâ seriously enough, you eventually end up building an agent lab. Lawyers, healthcare, finance, developer tools â they all converge on âagents that can reason and code.â
The big labs heard that theme loud and clear. Almost every major release this week is about agents, tools, and longâhorizon workflows, not just chat answers.
Google Goes All In: Gemini 3 Pro, Antigravity, and the Agent Revolution
Letâs start with Google because, after years of everyone asking âwhereâs Google?â in the AI race, they showed up this week with multiple bombshells that had even the skeptics impressed.
Gemini 3 Pro: Multimodal Intelligence That Actually Delivers
Google finally released Gemini 3 Pro, and the numbers are genuinely impressive. Weâre talking about a 1 million token context window, massive benchmark improvements, and a model thatâs finally competing at the very top of the intelligence charts. Thor from DeepMind joined us on the show (literally on day 3 of his new job!) and you could feel the excitement.
The headline numbers: Gemini 3 Pro with Deep Think mode achieved 45.14% on ARC-AGI-2âthatâs roughly double the previous state-of-the-art on some splits. For context, ARC-AGI has been one of those benchmarks that really tests genuine reasoning and abstraction, not just memorization. The standard Gemini 3 Pro hits 31.11% on the same benchmark, both scores are absolutely out of this world in Arc!
On GPQA Diamond, Gemini 3 Pro jumped about 10 points compared to prior models. Weâre seeing roughly 81% on MMLU-Pro, and the coding performance is where things get really interestingâGemini 3 Pro is scoring around 56% on SciCode, representing significant improvements in actual software engineering tasks.
But hereâs what made Ryan from Amp switch their default model to Gemini 3 Pro immediately: the real-world usability. Ryan told us on the show that theyâd never switched default models before, not even when GPT-5 came out, but Gemini 3 Pro was so noticeably better that they made it the default on Tuesday. Of course, they hit rate limits almost immediately (Google had to scale up fast!), but those have since been resolved.
Antigravity: Googleâs Agent-First IDE
Then Google dropped Antigravity, and honestly, this might be the most interesting part of the whole release. Itâs a free IDE (yes, free!) thatâs basically a fork of VS Code, but reimagined around agents rather than human-first coding.
The key innovation here is something they call the âAgent Managerââthink of it like an inbox for your coding agents. Instead of thinking in folders and files, youâre managing conversations with agents that can run in parallel, handle long-running tasks, and report back when they need your input.
I got early access and spent time playing with it, and hereâs what blew my mind: you can have multiple agents working on different parts of your codebase simultaneously. One agent fixing bugs, another researching documentation, a third refactoring your CSSâall at once, all coordinated through this manager interface.
The browser integration is crazy too. Antigravity can control Chrome directly, take screenshots and videos of your app, and then use those visuals to debug and iterate. Itâs using Gemini 3 Pro for the heavy coding, and even Nano Banana for generating images and assets. The whole thing feels like itâs from a couple years in the future.
Wolfram on the show called out how good Gemini 3 is for creative writing tooâitâs now his main model, replacing GPT-4.5 for German language tasks. The model just âgetsâ the intention behind your prompts rather than following them literally, which makes for much more natural interactions.
Nano Banana Pro: 4K Image Generation With Thinking
And because Google apparently wasnât done announcing things, they also dropped Nano Banana Pro on Thursday morningâliterally breaking news during our live show. This is their image generation model that now supports 4K resolution and includes âthinkingâ traces before generating.
I tested it live by having it generate an infographic about all the weekâs AI news (which you can see on the top), and the results were wild. Perfect text across the entire image (no garbled letters!), proper logos for all the major labs, and compositional understanding that felt way more sophisticated than typical image models. The file it generated was 8 megabytesâan actual 4K image with stunning detail.
Whatâs particularly clever is that Nano Banana Pro is really Gemini 3 Pro doing the thinking and planning, then handing off to Nano Banana for the actual image generation. So you get multimodal reasoning about your request, then production-quality output. You can even upload reference imagesâup to 14 of themâand itâll blend elements while maintaining consistency.
Oh, and every image is watermarked with SynthID (Googleâs invisible watermarking tech) and includes C2PA metadata, so you can verify provenance. This matters as AI-generated content becomes more prevalent.
Generative UIs: The Future of Interfaces
One more thing Google showed off: generative UIs in the Gemini app. Wolfram demoed this for us, and itâs genuinely impressive. Instead of just text responses, Gemini can generate full interactive mini-apps on the flyâcomplete dashboards, data visualizations, interactive widgetsâall vibe-coded in real time.
He asked for âfour panels of the top AI news from last weekâ and Gemini built an entire news dashboard with tabs, live market data (including accurate pre-market NVIDIA stats!), model comparisons, and clickable sections. It pulled real information, verified facts, and presented everything in a polished UI that you could interact with immediately.
This isnât just a demoâitâs rolling out in Gemini now. The implication is huge: weâre moving from static responses to dynamic, contextual interfaces generated just-in-time for your specific need.
xAI Strikes Back: Grok 4.1 and the Agent Tools API
Not to be outdone, xAI released Grok 4.1 at the start of the week, briefly claimed the #1 spot on LMArena (at 1483 Elo, not 2nd to Gemini 3), and then followed up with Grok 4.1 Fast and a full Agent Tools API.
Grok 4.1: Emotional Intelligence Meets Raw Performance
Grok 4.1 brought some really interesting improvements. Beyond the benchmark numbers (64% win rate over the previous Grok in blind tests), what stood out was the emotional intelligence. On EQ-Bench3, Grok 4.1 Thinking scored 1586 Elo, beating every other model including Gemini, GPT-5, and Claude.
The creative writing scores jumped by roughly 600 Elo points compared to earlier versions. And perhaps most importantly for practical use, hallucination rates dropped from around 12% to 4%âthatâs roughly a 3x improvement in reliability on real user queries.
xAIâs approach here was clever: they used âfrontier agentic reasoning models as reward modelsâ during RL training, which let them optimize for subjective qualities like humor, empathy, and conversational style without just scaling up model size.
Grok 4.1 Fast: The Agent Platform Play
Then came Grok 4.1 Fast, released just yesterday, and this is where things get really interesting for developers. Itâs got a 2 million token context window (compared to Gemini 3âs 1 million) and was specifically trained for agentic, tool-calling workflows.
The benchmark performance is impressive: 93-100% on Ď²-Bench Telecom (customer support simulation), ~72% on Berkeley Function Calling v4 (top of the leaderboard), and strong scores across research and browsing tasks. But hereâs the kicker: the pricing is aggressive.
At $0.20 per million input tokens and $0.50 per million output tokens, Grok 4.1 Fast is dramatically cheaper than GPT-5 and Claude while matching or exceeding their agentic performance. For the first two weeks, itâs completely free via the xAI API and OpenRouter, which is smartâget developers hooked on your agent platform.
The Agent Tools API gives Grok native access to X search, web browsing, code execution, and document retrieval. This tight integration with X is a genuine advantageâwhere else can you get real-time access to breaking news, sentiment, and conversation? Yam tested it on the show and confirmed that Grok will search Reddit too, which other models often refuse to do. Iâve used both these models this week in my N8N research agent and I gotta say, 4.1 fast is a MASSIVE improvement!
OpenAIâs Endurance Play: GPT-5.1-Codex-Max and Pro
OpenAI clearly saw Google and xAI making moves and decided they werenât going to let this week belong to anyone else. They dropped two significant releases: GPT-5.1-Codex-Max and an update to GPT-5.1 Pro.
GPT-5.1-Codex-Max: Coding That Never Stops
This is the headline: GPT-5.1-Codex-Max can work autonomously for over 24 hours. Not 24 minutes, not 24 queriesâ24 actual hours on a single software engineering task. I talked to someone from OpenAI at the conference who told me internal checkpoints ran for nearly a week on and off.
How is this even possible? The secret is something OpenAI calls âcompactionââa native mechanism trained into the model that lets it prune and compress its working session history while preserving the important context. Think of it like the model taking notes on itself, discarding tool-calling noise and keeping only the critical design decisions and state.
The performance numbers back this up:
* SOTA 77.9% on SWE-Bench Verified (up from 73.7%)
* SOTA 79.9% on SWE-Lancer IC SWE (up from 66.3%)
* 58.1% on TerminalBench 2.0 (up from 52.8%)
And crucially, in medium reasoning mode, it uses 30% fewer thinking tokens while achieving better results. Thereâs also an âExtra Highâ reasoning mode for when you truly donât care about latency and just want maximum capability.
Yam, one of our co-hosts whoâs been testing extensively, said you can feel the difference immediately. The model just âgets itâ faster, powers through complex problems, and the earlier versionâs quirk of ignoring your questions and just starting to code is fixedânow it actually responds and collaborates.
Dominic from OpenAI joined us on the show and confirmed that compaction was trained natively into the model using RL, similar to how Claude trained natively for MCP. This means the model doesnât waste reasoning tokens on maintaining contextâit just knows how to do it efficiently.
GPT-5.1 Pro: Research-Grade Intelligence & ChatGPT joins your group chat1
Then thereâs GPT-5.1 Pro, which is less about coding and more about deep, research-level reasoning. This is the model that can run for 10-17 minutes on a single query, thinking through complex problems with the kind of depth that previously required human experts.
OpenAI also quietly rolled out group chatsâbasically, you can now have multiple people in a ChatGPT conversation together, all talking to the model simultaneously. Useful for planning trips, brainstorming with teams, or working through problems collaboratively. If agent mode works in group chats (we havenât confirmed yet), that could get really interesting.
Meta drops SAM3 & SAM3D - image and video segmentation models powered by natural language
Phew ok, big lab releases now done, oh.. wait not yet! Because Meta has decided to also make a dent on this Week with SAM3 and SAM3D, which both are crazy. Iâll just add their video release here instead of going on and on!
This Weekâs Buzz from Weights & Biases
Itâs been a busy week at Weights & Biases as well! We are proud Gold Sponsors of the AI Engineer conference here in NYC. If youâre at the event, please stop by our boothâweâre even giving away a $4,000 robodog!
This week, I want to highlight a fantastic update from Marimo, the reactive Python notebook company we acquired.
Marimo just shipped a native VS Code and Cursor extension. This brings Marimoâs reactive, Git-friendly notebooks directly into your favorite editors.
Crucially, it integrates deeply with uv for lightning-fast package installs and reproducible environments. If you import a package you donât have, the extension prompts you to install it and records the dependency in the script metadata. This bridges the gap between experimental notebooks and production-ready code, and itâs a huge boost for AI-native development workflows. (Blog , GitHub )
The Future Arrived Early
Phew... if you read all the way until this point, can you leave a ⥠emoji in the comemnts? I was writing this and it.. is a lot! I was wondering who would even read all the way till here!
This week we felt the acceleration! đĽ I can barely breathe, I need a nap!
A huge thank you to our guestsâRyan, Swyx, Thor, and Dominikâfor navigating the chaos with us live on stage, and to the AI Engineer team for hosting us.
Weâll be back next week to cover whatever the AI world throws at us next. Stay tuned, because at this rate, AGI might be here by Christmas.
TL;DR - show notes and links
Hosts and Coâhosts
* Alex Volkov â AI Evangelist at Weights & Biases / CoreWeave, host of ThursdAI (X)
* Coâhosts - Wolfram Ravenwolf â (X), Yam Peleg (X) LDJ (X)
Guests
* Swyx â Founder of AI Engineer Worldâs Fair and Summit, now at Cognition ( Latent.Space , X)
* Ryan Carson â Amp (X)
* Thor Schaeff â Google DeepMind, Gemini API and AI Studio (X)
* Dominik Kundel â Developer Experience at OpenAI (X)
Open Source LLMs
* Allen Institute Olmo 3 - 7B/32B fully open reasoning suite with end-to-end training transparency (X, Blog)
Big CO LLMs + APIs
* Google Gemini 3 Pro - 1M-token, multimodal, agentic model with Generative UIs (X, X, X)
* Google Antigravity - Agent-first IDE powered by Gemini 3 Pro (Blog, X)
* xAI Grok 4.1 and Grok 4.1 Thinking - big gains in Coding, EQ, creativity, and honesty (X, Blog)
* xAI Grok 4.1 Fast and Agent Tools API - 2M-token context, state-of-the-art tool-calling (X)
* OpenAI GPT-5.1-Codex-Max - long-horizon agentic coding model for 24-hour+ software tasks (X, X)
* OpenAI GPT-5.1 Pro - research-grade reasoning model in ChatGPT Pro
* Microsoft, NVIDIA, and Anthropic partnership - to scale Claude on Azure with massive GPU investments (Announcement, NVIDIA, Microsoft Blog)
This weeks Buzz
* Marimo ships native VS Code & Cursor extension with reactive notebooks and uv-powered environments (X, Blog, GitHub)
Vision & Video & 3D
* Meta SAM 3 & SAM 3D - promptable segmentation, tracking, and single-image 3D reconstruction (X, Blog, GitHub)
AI Art & Diffusion
* Google Nano Banana Pro and SynthID verification - 4K image generation with provenance (Blog)
Show Notes and other Links
* AI Engineer Summit NYC - Live from the conference
* Full livestream available on YouTube
* ThursdAI - Nov 20, 2025
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Altri episodi di "ThursdAI - The top AI news from the past week"



Non perdere nemmeno un episodio di âThursdAI - The top AI news from the past weekâ. Iscriviti all'app gratuita GetPodcast.







