
đ
ThursdAI - Jun 4 - NVIDIA drops Nemotron 3 Ultra (550B open), Microsoft becomes a frontier lab, Ideogram 4 goes open, Agent Arena & more
Hey folks, Alex here, let me catch you up!
Iâve had a feeling that this week is going to be crazy, as it started on the weekend MiniMax M3, then with Jensen announcing new RTX Spark, NVIDIAâs first PC chip packing 1 petaflop of local AI power into thin laptops.
A few days later at Microsoft BUILD, Satya & Mustafa from MAI dropped 7 AI models, completely pre-trained from scratch, including a new MAI-thinking-1, MAI-code and MAI-image 2.5 that started topping the image gen charts.
Then other image models started racing to the top of the Arena benchmarks, IdeoGram 4 hitting becoming SOTA open weights image-gen model, and Reve 2 beating Nano Banana just a few hours after that.
And then today, NVIDIA dropped Nemotron 3 Ultra, their latest 550B open weights model, data and training and Arena published a new agentic eval leaderboard and we got a new Gemma 4 12B.
Iâve had the great pleasure to host Chris (@llm_wizard) from Nvidia, Peter Gostev from Arena and Karan from Nous Research (who were featured prominently by Jensen!) all on the show.
Def donât miss this one! Letâs get into the details.
ThursdAI - Join the flock of folks who know what is happening in AI before everyone else.
Open Source LLMs
đ„ NVIDIA Nemotron 3 Ultra: The 550B Open Source Beast Built for Agents (X, Arxiv, Announcement)
This was the big one. Breaking news mid-show: NVIDIA drops Nemotron 3 Ultra, a 550 billion parameter sparse MoE model with 55 billion active parameters, built on a hybrid Mamba-Transformer architecture. Chris Alexiuk, AKA Joe Nemotron, joined us live from NVIDIA HQ in Santa Clara to walk us through it.
The headline number is 5.9x higher inference throughput compared to GLM-5.1 on decode-heavy workloads. Chris told us that this is a result of multiple things, their Hybrid Mamba-Transformer approach, the sparse attention, and that they optimized for decode-heavy workloads (the kinds of workloads agents do)
The architecture is fascinating. Theyâre mixing Mamba-2 state space layers with sparse attention, which means step 300 in an agent loop runs as fast as step 3. Pure transformers canât do that because the attention cost keeps growing with context length. This kicks in big time at 64K+ sequence lengths, which is exactly where you end up in real agentic work when the model is having multi-turn conversations and people are dumping their entire codebase in.
P.S - We launched Nemotron 3 Ultra with 0-day support on CoreWeave Inference, itâs super fast and pretty cheap, give it a try here
They pretrained on 20 trillion tokens, extended context to 1 million tokens, and their post-training pipeline used multi-teacher on-policy distillation from over 10 specialized teacher models covering everything from SWE to terminal use to search to office work, which they are also going to open source soon!
One thing Chris emphasized that I really appreciate: NVIDIA doesnât have their own harness. Thereâs no âNVIDIA Code.â Which means they actively resist the temptation to harness-max, to optimize for just one harness and look good on a specific leaderboard. Ultra should be a solid drop-in for whatever harness youâre used to, and that generality is worth a lot. Itâs not the best thinker, but it is the highest score US based open weights model, so again, a huge huge win for the US AI ecosystem!
The Nemotron 3 Ultra release is open under the OpenMDW-1.1 license: base BF16, post-trained BF16, and NVFP4 quantized checkpoints, plus the GenRM, synthetic pre-training data for code, legal, and specialized domains, post-training datasets, RL environments via NeMo Gym, and training recipes in the Nemotron GitHub repo, which is absolutely bonkers! Kudos to team green for this awesome and very important release!
NVIDIA Nemotron 3.5 ASR: The Tiny Speed Demon (X, HF, Blog, Blog)
Oh, and NVIDIA wasnât done. They also dropped Nemotron 3.5 ASR, a 600 million parameter open source multilingual streaming speech-to-text model covering 40 languages. Itâs the fastest model Pipecat has ever tested, and the cost math is insane: roughly 5 cents an hour for enterprise deployment when typical API providers charge 10 cents to a dollar per hour. Our friend Kwindla from Daily and Pipecat put together a detailed writeup with benchmarks and cost analysis. Chris couldnât stop praising NVIDIAâs speech team and honestly, I canât either. Banger after banger.
Just a week after I told you about Cartesia Ink-2, NVIDIA drops an open version thatâs pareto optimal, can run fully on-device and is blazing fast at transcription!?
Other notable open source announcements that would have made full headlines on any other week:
* MiniMax announces M3, a natively multimodal, 1M, coding and agentic frontier model (X)This one is very interesting, but not yet available as Open Weights so we havenât tested it fully, weâre going to do it next week when the drop the tech report and the weights
* Google drops Gemma 4 12B - encoder-free multimodal model that runs on your laptop with 16GB VRAM under Apache 2 (X, HF)Our friends from DeepMind keep the western open source momentum going with a new 12B size for Gemma (which crossed some 100M downloads on Hugging Face recently).
* JetBrains Mellum2, a 12B MoE model with only 2.5B active, trained from scratch by a team of 7 people (X, Blog, HF, CW Inference)The great folks at JetBrains, the company behind the IntelliJ IDEs, dropped a new model called Mellum2 which they trained from scratch. Very interesting to see them pivot in the world where IDEâs are dying at the hands of LLMs.
* H Company drops Holo 3.1: blazing fast local computer-use agents from 0.8B to 35B, with massive mobile benchmark jumps (X, Blog)
NVIDIAâs RTX Spark and reinventing the PC - announcement at Computex 2026
While weâre on the topic of NVIDIA, they opened the week with a huge announcement, including Microsoft, Dell, Lenovo, and HP and a bunch of other partners in it.
They announced RTX Spark, their first ever PC chip, which is a full system on a chip (SoC) focused on running AI workloads for things like OpenClaw and Hermes!
Announcing this on the stage at Computex, Jensen Huang called it the âthe most amazing chip the world has ever builtâ, being able to run every app that Microsoft has ever run.
This is a huge deal, specifically because of how agentic the world is becoming, these machines (thin laptops and a mac-mini alternative were announced) will be able to run 120 billion parameter models on-device, gaming at the level of RTX 5070, and AI agents 24/7. Iâm getting excited and Iâm not a windows user!
Hermes victory + Hermes Desktop and an interview with Karan from Nous Research
If you squint, you can see that by the little red OpenClaw, thereâs another logo. Thatâs the Nous Girl logo of Nous Research, which was rebranded to be the logo of their Hermes Agent (an open source agentic harness thatâs passed 181K starts on Github, and is the leader in global ranking on OpenRouter)
Weâve had the awesome pleasure of having Karan Malhotra (@karan4d), one of the co-founders of Nous Research on the show, and Karan broke down how Nous Research evolved from a research lab that created the long context innovations (YaRN) and finetuned models (Hermes used to be a series of models) to a full agentic company.
We also chatted with Karan about the new Hermes Desktop experience, which lets folks see the tools that are used, the code thatâs being written by their agent, and how it feels to be featured by the worlds largest company on the global stage! Definitely check out the conversation with Karan.
Microsoft BUILD, new PC, becoming a frontier lab with MAI-thinking-1, MAI-code and MAI-image 2.5 (Blog)
From Jensen to Satya, the week was full of AI announcement that will impact the world. Microsoftâs annual Build conference happened just a few days after, with Jensen zooming in from Taipei to co-announce all these new PC models and chips.
Shortly after that, and after a lot of other announcements about less-exciting enterpris-y stuff, Satya handed the stage to Mustafa Suleyman (co-foudner of DeepMind and Inflection AI) and now CEO of Microsoftâs AI division (MAI) to announce all these new models!
A few of these (in previous versions) were already covered on the show, but the new LLMs are the most interesting! MAI-Thinking-1 is 1T total parameters with 35B active params, trained on 33.5T tokens (30T pre-training, 3.55T mid-training), without any distillation (which felt important for them to say given their proprietary access to OpenAIâs models). Itâs not yet competitive with Opus and OpenAIâs flagship models, but they are claiming parity with Sonnet 4.5 and get 53% in Swe-bench Pro coding tasks!
Given that recently, OpenAI started offering their models on AWS, weâre now seeing a bit of a distancing between Microsoft and OpenAI, with Microsoft showing that can become a frontier lab on their own right, or well.. maybe a second tier frontier lab.
Of course, we shouldnât forget that Microsoft kind of started the whole era of coding AIâs with CoPilot and completely lost to the Cursors and Windsurfs and Devins of the world given the huge head start they had with Github, so Iâm really curious to see how strongly they will push this âsecond tier frontier labâ angle and if they have what it takes to compete with Google here (not to mention OpenAI and Anthropic)
And while the model wasnât available for me to even test yet, MAI did drop an incredibly in depth 109 page technical report on it. Our friend of the pod Elie Bacouch (@eliebacouch) did a breakdown of the most interesting aspects of it, calling it a gold-mine for details about training models at this scale.
Image gen models race to the top of the Arena
This week was honestly chaotic for image gen. Three new SOTA models in basically 48 hours, I tried to use them all while preparing for the show, and hereâs the comparison I ran:
Microsoft MAI-Image 2.5 (X, Try it)
One of the more surprising updates were about the MAI-image 2.5, it landed at #3 on text-to-image and #2 on image-to-image, surpassing Nano Banana Pro on the editing leaderboard. It comes in two flavors, MAI-Image-2.5 and a faster Flash variant, both running on H100s which means existing infra can serve it, and itâs already rolling out in OneDrive Photos for background cleanup and distractions removal.
That said, my honest take: I tried to generate a ThursdAI thumbnail with it and got âimage failedâ because I think the word âexplosionâ tripped its safety filter. I then tried to generate an âhorse riding an astronaut on the moonâ and got this, yep... this is .. not the best. IDK how and why they shot up so high on the leaderboards. But I guess weâll see as more folks try these models.
Ideogram 4.0 - new SOTA open weights image gen đ„ (X, Blog, HF)
The one I want to celebrate hardest is Ideogram 4.0, because they opened the weights! For the previous three Ideogram versions you could only use them on their website, and now they dropped the next one as a 9.3 billion parameter open weights model (non-commercial license, but still). This is now new #1 open weights text-to-image model, with only closed models from OpenAI and Google ahead of it on DesignArena. At 9.3B params, it beats much larger models like Qwen-Image (20B), FLUX.2 dev (32B), and even the 80B MoE HunyuanImage 3.0 on text rendering benchmarks.
The architecture is wild. Instead of CLIP or T5 they use Qwen3-VL-8B as the text encoder, extract hidden states from 13 intermediate layers, and they trained exclusively on structured JSON captions with bounding boxes. Thatâs why itâs so good at layout control, you can prompt it with precise bounding box positions and hex color palettes, and you can see the layout shaping the generation as it converges.
In my thumbnail test it nailed almost everything but had a small typo (it generated âNemotronâ once and then a weird âNemo 1â duplicate in another area). Still, very impressive for a first open weights release.
Reve 2 jumps to #2 above Nano Banana Pro (X, Blog, Try it)
Iâve talked about Reve before, and Reve 2.0 just dropped at #2 on the Text-to-Image Arena with a 1280 score, a +125 Elo jump over their v1.5 in a single release. Thatâs basically unheard of on the arena leaderboard. The thing that blows my mind is theyâre a 65 person lab training at only 2,000 GPU scale, competing with labs that have orders of magnitude more compute.
The core innovation is that they separated planning from rendering. Every image is first laid out as structured code (composition, relationships, style, labeled segments) before it gets rendered at native 4K (true 16 megapixels, not upscaled). Because the image is represented as code, every element is addressable and editable, so you can manipulate specific regions without regenerating the whole thing. This is also agent-native by design, LLMs can reason directly about the image structure.
I demoed their editing interface live on the show and itâs the tightest layout control Iâve seen in any image model. When I moved my head box to the left, it worked. When I moved the logo to the bottom, it worked. When I changed the word ânewsâ to âimplodedâ, the surrounding text stayed pixel-identical. That precision is genuinely new.
Honest tradeoff though, Peter Gostev flagged this on the show: theyâre #2 on text-to-image but only around #9 on image editing. That matched my own experience nailing the thumbnail likeness, the layout work is amazing but the face came out a little googly-eyed and cartoonish, with one finger going somewhere fingers should not go.
For what itâs worth on my own thumbnail bake-off: Nano Banana Pro is still my pick for the absolute best instruction following (it nails my exact ThursdAI logo color every time), GPT Image 2 is still the highest fidelity but always comes out a little overcooked on the skin, Reve 2 is gorgeous on layout but the face needs work, and Ideogram 4 is the most exciting because itâs open. A lot of why I prefer Nano Banana is just that my prompts are very Nano Banana tuned by now.
Breaking news on the show: Agent Arena from LMArena
The breaking news of the day, while we were already on air, was Arena AI launching a brand new Agent Arena leaderboard. Nisten pasted the link in our group chat and three minutes later Peter Gostev himself jumped on the show to walk us through it. Got to love this format.
The motivation is something weâve been talking about for a year. The original Arena was built for the chatbot era, where you send one prompt and vote A vs B. But weâve all moved to agents, long multi-step tasks running for many minutes or hours, and that comparison no longer captures what matters. Agent Arena fixes this by giving models a real workspace with web search, file system and terminal tools, then measures millions of live sessions across five signals: task success, steerability, error recovery, user praise, and tool hallucination. The launch snapshot is built from 300,000 tasks, 2 million tool calls, and 40 million lines of agent-written code.
The results match the vibes on my feed perfectly. GPT-5.5 High is #1 by a comfortable margin, Claude Opus 4.7 right behind, and very interestingly ZAIâs GLM 5.1 (MIT licensed, fully open) lands at #3, above Google, Kimi and DeepSeek. The funniest moment of the show was when weâd been calling out Gemma 4 31B for being bad agentically purely based on vibes, and the brand new benchmark showed up 20 minutes later confirming exactly that. The other juicy signal is âbash recoveryâ, how quickly a model recovers when a command fails. GPT-5.5 leads at ~17%, and Grok 4.3 from xAI sits at -89%, which is so much worse it almost looks like a training bug.
Iâm super into this. Give it a spin at arena.ai (@arena on X), theyâre rolling new models in as labs send early access, so thereâs a good chance youâll spin up the next Mythos in their agent harness.
This weekâs Buzz - WeaveHacks 4 + Nemotron on CW Inference + WolfBench 3D
A few things from our corner this week.
WeaveHacks 4 is this weekend in SF - not too late to join yet!
Weâre hosting WeaveHacks 4 in San Francisco this weekend, and we still have a few spots left, so if youâre in town, please come join us at lu.ma/weavehacks. OpenAI is sponsoring us for the first time, Cursor is in too, weâve got over $150K in credits to give out, food, and a great panel of judges I reached out to personally.
Nemotron 3 Ultra is live on CW Inference at full NVFP4
I said it above but it bears repeating, our inference team got Nemotron 3 Ultra live on day zero on CoreWeave Inference (via Weights & Biases) at full NVFP4 precision. Nisten plugged it straight into his medical anatomy harness (which was originally built for Kimi and Qwen) and it just worked, plug and play, agentically highlighting body parts and calling custom tools, at around 15 cents cached input. Try it at wandb.me/nemotron-ultra.
WolfBench gets a 3D bar update
Wolfram shipped a quietly important feature on WolfBench: 3D bars where the depth of each bar represents how many tokens the model used to get its score. The 2D view shows Gemini 3.5 Flash sitting comfortably at #2 on the agentic scores, almost matching GPT-5.5. But flip on 3D mode and the picture is very different. Gemini Flash burned over 3 billion input tokens to get that score, where GPT-5.5 used a couple hundred to reach the same level. Thatâs the difference between âcheap fast modelâ and âactually cheap to run end to endâ. Wolframâs writing up the full analysis on the W&B blog next week. Check out the new 3D view on wolfbench.ai
AI in Society
Look, tons of other stuff happened this week as well, that honestly deserves its own newsletter, we are focused on models and agents, but itâs hard to ignore the bigger picture.
Senator Bernie Sanders, introduced a public bill called The American AI Sovereign Wealth Fund Act would have the government tax AI companies, take 50% of the stock, and put it under public control. Which I personally find ridiculous, but apparently caused Sam Altman to request a meeting with Bernie.
Meanwhile thereâs no doubt that AI hate is growing, and that the public sentiment is very negative, as we can see on the issue of Datacenter water usage for example. Despite Satya Nadellaâs claim that the latest Microsoft Datacenters are using a closed loop water system, that use less water than 1 restaurant (X), and that datacenters use less than 1% of total water usage in the US, a lot of politicians, and social media users are still pushing the narrative that datacenters are are a water-guzzling monster and need to be stopped.
Anthropicâs âWhen AI builds buildsâ report (X)
Anthropic released a report today called âWhen AI builds itselfâ with haunting graphic.
They have a bunch of previously unreleased data in there on how AI is shaping the work inside Anthropic and outline 3 potential futures:
1 - AI progress stalls, humans are able to catch up. Unlikely
2 - AI labs continue to see compounding efficiency gains - The most likely scenario, in which the nature of work changes, 100-person companies could do the work of 10,000- or 100,000-person organizations. The role of humans at companies like Anthropic would shift - Most Likely Scenario per Anthropc
3- AI systems themselves become capable of full recursive self-improvement, and begin building their successors - the most unclear scenario of whether these systems will be aligned to human values or not.
This is a fascinating and yes scary read, as Anthropic fully acknowledges that it would be dope if everyone chills for a second and stops building recursive self-improving AIâs that we arenât sure could be aligned, but that itâs likely not going to happen, because itâll just let other labs or in face other countries to catch up and change the frontier.
AI Leaders from top labs Urge Congress to Mandate Synthetic DNA Screening
Sam Altman of OpenAI, Dario Amodei of Anthropic, Demis Hassabis of Google DeepMind, and others signed an open letter on June 3, 2026, pushing for required screening of synthetic DNA and RNA orders to block known risky sequences. The letter, backed by Nobel winners, biotech CEOs, and security experts, notes AIâs ability to outpace human experts in biology, heightening biosecurity risks despite voluntary industry efforts since 2009. I think everyone agrees that this is a good idea, especially given the above Anthropic report. Very happy to see this happening.
Pheeeeew what a week.
This was a looong week, I wasnât sure if weâd be able to cover everything, and it feels like we did a decent job! I know itâs exhausting, and I hope we on ThursdAI help you readers and listeners to stay on top of things without spending too many cycles.
If you enjoyed this newsletter or episode, please share it with a friend and consider subscribing to our Youtube Channel (thursdai.news/yt) to help more folks stay up to date.
Thanks for reading ThursdAI - Highest signal weekly AI news show! This post is public so feel free to share it.
TL;DR and Show Notes - June 4, 2026
* Show Notes & Guests
* Alex Volkov - AI Evangelist & Weights & Biases CoreWeave (@altryne)
* Co Hosts - @WolframRvnwlf @yampeleg @ldjconfirmed
* Guests: Chris Alexiuk / @llm_wizard from NVIDIA Nemotron
* Karan Malhotra from Nous Research
* Peter Gostev from Arena
* Open Source LLMs
* NVIDIA released Nemotron 3 Ultra, a 550B / 55B-active open-weight MoE built for long-running agents, with weights, data, recipes, GenRM, and training assets released (X, Tech Report, Announcement, HF).
* NVIDIA also shipped Nemotron 3.5 ASR, a 600M open multilingual streaming STT model for voice agents (X, HF, Benchmark, Voice Agent Repo).
* Google dropped Gemma 4 12B, an encoder-free multimodal model that runs locally under Apache 2.0 (X, HF).
* MiniMax announced M3, a natively multimodal, 1M-context coding and agentic model with open weights coming soon (X, API, Code).
* JetBrains released Mellum2, a 12B MoE with 2.5B active params trained from scratch by a small team (X, Blog, HF).
* H Company launched Holo 3.1, local computer-use agents from 0.8B to 35B with new quantized checkpoints (X, Blog).
* Big CO LLMs + APIs
* NVIDIA announced RTX Spark, its new Arm + Blackwell PC platform for local AI agents and 120B-class local inference (coverage).
* Microsoft AI launched seven new MAI models, including MAI-Thinking-1, MAI-Code-1-Flash, MAI-Image-2.5, MAI-Transcribe-1.5, and MAI-Voice-2 (Blog, Tech Report).
* AI Art & Diffusion & 3D
* MAI-Image-2.5 landed near the top of Arena image leaderboards, though hands-on tests were mixed (X, Try it).
* Ideogram 4.0 became the top open-weight text-to-image model with strong typography and layout control (X, Blog, HF).
* Reve 2.0 jumped to #2 on Text-to-Image Arena with native 4K, code-like layout control, and precise editing (X, Blog, Try it).
* xAI released Grok Imagine Video 1.5 Preview for image-to-video with synced audio (xAI).
* Tools & Agentic Engineering
* Arena launched Agent Arena, a new leaderboard for real agent workflows instead of one-shot chatbot prompts (Arena).
* Cognition rebranded Windsurf into Devin Desktop, a multi-agent command center with ACP support (X, Announcement).
* Nous Research launched Hermes Desktop, bringing Hermes Agent into a native desktop app for Mac, Windows, and Linux (X, Site).
* This Weekâs Buzz
* WeaveHacks 4 is this weekend in SF with OpenAI, Cursor, DeepMind, and more joining (lu.ma/weavehacks).
* Nemotron 3 Ultra is live on CoreWeave Inference through W&B at full NVFP4 precision (Try it).
* WolfBench added 3D token-depth bars, making model efficiency much easier to see (wolfbench.ai).
* Voice & Audio
* ElevenLabs launched Dubbing v2, an audio-to-audio dubbing model that preserves performance across 90+ languages (X, Dubbing).
* Cartesia launched Ink-2, a fast streaming STT model built for voice agents (X, Ink, AA).
* NVIDIAâs Nemotron 3.5 ASR looks like a major open-source voice-agent infrastructure drop (HF).
* AI in Society
* Bernie Sanders proposed the American AI Sovereign Wealth Fund Act, calling for public equity stakes in major AI companies (coverage).
* Anthropic published When AI Builds Itself, laying out scenarios for AI-driven AI R&D and recursive self-improvement (Anthropic).
* AI leaders urged Congress to mandate synthetic DNA/RNA screening and recordkeeping (WIRED).
* Anthropic confidentially filed for an IPO, adding another frontier-lab public-market storyline to watch (Axios).
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Fler avsnitt frÄn "ThursdAI - The top AI news from the past week"



Missa inte ett avsnitt av âThursdAI - The top AI news from the past weekâ och prenumerera pĂ„ det i GetPodcast-appen.








