
đ ThursdAI - Dec 4, 2025 - DeepSeek V3.2 Goes Gold Medal, Mistral Returns to Apache 2.0, OpenAI Hits Code Red, and US-Trained MOEs Are Back!
Hey yall, Alex here đ«Ą
Welcome to the first ThursdAI of December! Snow is falling in Colorado, and AI releases are falling even harder. This week was genuinely one of those âdrink from the firehoseâ weeks where every time I refreshed my timeline, another massive release had dropped.
We kicked off the show asking our co-hosts for their top AI pick of the week, and the answers were all over the map: Wolfram was excited about Mistralâs return to Apache 2.0, Yam couldnât stop talking about Claude Opus 4.5 after a full week of using it, and Nisten came out of left field with an AWQ quantization of Prime Intellectâs model that apparently runs incredibly fast on a single GPU. As for me? Iâm torn between Opus 4.5 (which literally fixed bugs that Gemini 3 created in my code) and DeepSeekâs gold-medal winning reasoning model.
Speaking of which, letâs dive into what happened this week, starting with the open source stuff thatâs been absolutely cooking.
Open Source LLMs
DeepSeek V3.2: The Whale Returns with Gold Medals
The whale is back, folks! DeepSeek released two major updates this week: V3.2 and V3.2-Speciale. And these arenât incremental improvementsâweâre talking about an open reasoning-first model thatâs rivaling GPT-5 and Gemini 3 Pro with actual gold medal Olympiad wins.
Hereâs what makes this release absolutely wild: DeepSeek V3.2-Speciale is achieving 96% on AIME versus 94% for GPT-5 High. Itâs getting gold medals on IMO (35/42), CMO, ICPC (10/12), and IOI (492/600). This is a 685 billion parameter MOE model with MIT license, and it literally broke the benchmark graph on HMMT 2025âthe score was so high it went outside the chart boundaries. Thatâs how you DeepSeek, basically.
But itâs not just about reasoning. The regular V3.2 (not Speciale) is absolutely crushing it on agentic benchmarks: 73.1% on SWE-Bench Verified, first open model over 35% on Tool Decathlon, and 80.3% on ÏÂČ-bench. Itâs now the second most intelligent open weights model and ranks ahead of Grok 4 and Claude Sonnet 4.5 on Artificial Analysis.
The price is what really makes this insane: 28 cents per million tokens on OpenRouter. Thatâs absolutely ridiculous for this level of performance. Theyâve also introduced DeepSeek Sparse Attention (DSA) which gives you 2-3x cheaper 128K inference without performance loss. LDJ pointed out on the show that he appreciates how transparent theyâre being about not quite matching Gemini 3âs efficiency on reasoning tokens, but itâs open source and incredibly cheap.
One thing to note: V3.2-Speciale doesnât support tool calling. As Wolfram pointed out from the model card, itâs âdesigned exclusively for deep reasoning tasks.â So if you need agentic capabilities, stick with the regular V3.2.
Check out the full release on Hugging Face or read the announcement.
Mistral 3: Europeâs Favorite AI Lab Returns to Apache 2.0
Mistral is back, and theyâre back with fully open Apache 2.0 licenses across the board! This is huge news for the open source community. They released two major things this week: Mistral Large 3 and the Ministral 3 family of small models.
Mistral Large 3 is a 675 billion parameter MOE with 41 billion active parameters and a quarter million (256K) context window, trained on 3,000 H200 GPUs. Thereâs been some debate about this modelâs performance, and I want to address the elephant in the room: some folks saw a screenshot showing Mistral Large 3 very far down on Artificial Analysis and started dunking on it. But hereâs the key context that Merve from Hugging Face pointed outâthis is the only non-reasoning model on that chart besides GPT 5.1. When you compare it to other instruction-tuned (non-reasoning) models, itâs actually performing quite well, sitting at #6 among open models on LMSys Arena.
Nisten checked LM Arena and confirmed that on coding specifically, Mistral Large 3 is scoring as one of the best open source coding models available. Yam made an important point that we should compare Mistral to other open source players like Qwen and DeepSeek rather than to closed modelsâand in that context, this is a solid release.
But the real stars of this release are the Ministral 3 small models: 3B, 8B, and 14B, all with vision capabilities. These are edge-optimized, multimodal, and the 3B actually runs completely in the browser with WebGPU using transformers.js. The 14B reasoning variant achieves 85% on AIME 2025, which is state-of-the-art for its size class. Wolfram confirmed that the multilingual performance is excellent, particularly for German.
Thereâs been some discussion about whether Mistral Large 3 is a DeepSeek finetune given the architectural similarities, but Mistral claims these are fully trained models. As Nisten noted, even if they used similar architecture (which is Apache 2.0 licensed), thereâs nothing wrong with thatâitâs an excellent architecture that works. Lucas Atkins later confirmed on the show that âMistral Large looks fantastic... it is DeepSeek through and through architecture wise. But Kimi also does thatâDeepSeek is the GOAT. Training MOEs is not as easy as just import deepseak and train.â
Check out Mistral Large 3 and Ministral 3 on Hugging Face.
Arcee Trinity: US-Trained MOEs Are Back
We had Lucas Atkins, CTO of Arcee AI, join us on the show to talk about their new Trinity family of models, and this conversation was packed with insights about what it takes to train MOEs from scratch in the US.
Trinity is a family of open-weight MOEs fully trained end-to-end on American infrastructure with 10 trillion curated tokens from Datology.ai. They released Trinity-Mini (26B total, 3B active) and Trinity-Nano-Preview (6B total, 1B active), with Trinity-Large (420B parameters, 13B active) coming in mid-January 2026.
The benchmarks are impressive: Trinity-Mini hits 84.95% on MMLU (0-shot), 92.1% on Math-500, and 65% on GPQA Diamond. But what really caught my attention was the inference speedâNano generates at 143 tokens per second on llama.cpp, and Mini hits 157 t/s on consumer GPUs. Theyâve even demonstrated it running on an iPhone via MLX Swift.
I asked Lucas why it matters where models come from, and his answer was nuanced: for individual developers, it doesnât really matterâuse the best model for your task. But for Fortune 500 companies, compliance and legal teams are getting increasingly particular about where models were trained and hosted. This is slowing down enterprise AI adoption, and Trinity aims to solve that.
Lucas shared a fascinating insight about why they decided to do full pretraining instead of just post-training on other peopleâs checkpoints: âWe at Arcee were relying on other companies releasing capable open weight models... I didnât like the idea of the foundation of our business being reliant on another company releasing models.â He also dropped some alpha about Trinity-Large: theyâre going with 13B active parameters instead of 32B because going sparser actually gave them much faster throughput on Blackwell GPUs.
The conversation about MOEs being cheaper for RL was particularly interesting. Lucas explained that because MOEs are so inference-efficient, you can do way more rollouts during reinforcement learning, which means more RL benefit per compute dollar. This is likely why weâre seeing labs like MiniMax go from their original 456B/45B-active model to a leaner 220B/10B-active modelâthey can get more gains in post-training by being able to do more steps.
Check out Trinity-Mini and Trinity-Nano-Preview on Hugging Face, or read The Trinity Manifesto.
OpenAI Code Red: Panic at the Disco (and Garlic?)
It was ChatGPTâs 3rd birthday this week (Nov 30th), but the party vibes seem⊠stressful. Reports came out that Sam Altman has declared a âCode Redâ at OpenAI.
Why? Gemini 3.The user numbers donât lie. ChatGPT apparently saw a 6% drop in daily active users following the Gemini 3 launch. Googleâs integration is just too good, and their free tier is compelling.
In response, OpenAI has supposedly paused âside projectsâ (ads, shopping bots) to focus purely on model intelligence and speed. Rumors point to a secret model codenamed âGarlicââa leaner, more efficient model that beats Gemini 3 and Claude Opus 4.5 on coding reasoning, targeting a release in early 2026 (or maybe sooner if they want to save Christmas).
Wolfram and Yam nailed the sentiment here: Integration wins. Wolframâs family uses Gemini because itâs right there on the Pixel, controlling the lights and calendar. OpenAI needs to catch up not just on IQ, but on being helpful in the moment.
Post the live show, OpenAI also finally added GPT 5.1 Codex Max we covered 2 weeks ago to their API and itâs now available in Cursor, for free, until Dec 11!
Amazon Nova 2: Enterprise Push with Serious Agentic Chops
Amazon came back swinging with Nova 2, and the jump on Artificial Analysis is genuinely impressiveâfrom around 30% to 61% on their index. Thatâs a massive improvement.
The family includes Nova 2 Lite (7x cheaper, 5x faster than Nova Premier), Nova 2 Pro (93% on ÏÂČ-Bench Telecom, 70% on SWE-Bench Verified), Nova 2 Sonic (speech-to-speech with 1.39s time-to-first-audio), and Nova 2 Omni (unified text/image/video/speech with 1M token context windowâyou can upload 90 minutes of video!).
Gemini 3 Deep Think Mode
Google launched Gemini 3 Deep Think mode exclusively for AI Ultra subscribers, and itâs hitting some wild benchmarks: 45.1% on ARC-AGI-2 (a 2x SOTA leap using code execution), 41% on Humanityâs Last Exam, and 93.8% on GPQA Diamond. This builds on their Gemini 2.5 variants that earned gold medals at IMO and ICPC World Finals. The parallel reasoning approach explores multiple hypotheses simultaneously, but itâs compute-heavyâlimited to 10 prompts per day at $77 per ARC-AGI-2 task.
This Weekâs Buzz: Mid-Training Evals are Here!
A huge update from us at Weights & Biases this week: We launched LLM Evaluation Jobs. (Docs)
If you are training models or finetuning, you usually wait until the end to run your expensive benchmarks. Now, directly inside W&B, you can trigger evaluations on mid-training checkpoints.
It integrates with Inspect Evals (over 100+ public benchmarks). You just point it to your checkpoint or an API endpoint (even OpenRouter!), select the evals (MMLU-Pro, GPQA, etc.), and we spin up the managed GPUs to run it. You get a real-time leaderboard of your runs vs. the field.
Also, a shoutout to users of Neptune.aiâcongrats on the acquisition by OpenAI, but since the service is shutting down, we have built a migration script to help you move your history over to W&B seamlessly. We arenât going anywhere!
Video & Vision: Physics, Audio, and Speed
The multimodal space was absolutely crowded this week.
Runway Gen 4.5 (âWhisper Thunderâ)
Runway revealed that the mysterious âWhisper Thunderâ model topping the leaderboards is actually Gen 4.5. The key differentiator? Physics and Multi-step adherence. It doesnât have that âdiffusion wobbleâ anymore. We watched a promo video where the shot changes every 3-4 seconds, and while itâs beautiful, it shows we still havenât cracked super long consistent takes yet. But for 8-second clips? Itâs apparently the new SOTA.
Kling 2.6: Do you hear that?
Kling hit back with Video 2.6, and the killer feature is Native Audio. I generated a clip of two people arguing, and the lip sync was perfect. Not âdubbed overâ perfect, but actively generated with the video. It handles multi-character dialogue, singing, and SFX. Itâs huge for creators.
Kling was on a roll this week, releasing not one, but two Video Models (O1 Video is an omni modal one that takes Text, Images and Audio as inputs) and O1 Image and Kling Avatar 2.0 are also great updates! (Find all their releases on X)
P-Image: Sub-Second Generation at Half a Cent
Last week we talked about ByteDanceâs Z-Image, which was super cool and super cheap. Well, this week Pruna AI came out with P-Image, which is even faster and cheaper: image generation under one second for $0.005, and editing under one second for $0.01.
I built a Chrome extension this week (completely rewritten by Opus 4.5, by the wayâmore on that in a second) that lets me play with these new image models inside the Infinite Craft game. When I tested P-Image Turbo against Z-Image, I was genuinely impressed by the quality at that speed. If you want quick iterations before moving to something like Nano Banana Pro for final 4K output, these sub-second models are perfect.
The extension is available on GitHub if you want to try itâyou just need to add your Replicate or Fal API keys.
SeeDream 4.5: ByteDance Levels Up
ByteDance also launched SeeDream 4.5 in open beta, with major improvements in detail fidelity, spatial reasoning, and multi-image reference fusion (up to 10 inputs for consistent storyboards). The text rendering is much sharper, and it supports multilingual typography including Japanese. Early tests show it competing well with Nano Banana Pro in prompt adherence and logic.
đ€ Voice & Audio
Microsoft VibeVoice-Realtime-0.5B
In a surprise drop, Microsoft open-sourced VibeVoice-Realtime-0.5B, a compact TTS model optimized for real-time applications. It delivers initial audible output in just 300 milliseconds while generating up to 10 minutes of speech. The community immediately started creating mirrors because, well, Microsoft has a history of releasing things on Hugging Face and then having legal pull them down. Get it while itâs hot!
Use Cases: Code, Cursors, and âAntigravityâ
We wrapped up with some killer practical tips:
* Opus 4.5 is a beast: As I mentioned, using Opus inside Cursorâs âAskâ mode is currently the supreme coding experience. It debugs logic flaws that Gemini misses completely. I also used Opus as a prompt engineer for my infographics, and it absolutely demolished GPT at creating the specific layouts I needed
* Googleâs Secret: Nisten dropped a bomb at the end of the showâOpus 4.5 is available for free inside Googleâs Antigravity (and Colab)! If you want to try the model thatâs beating GPT-5 without paying, go check Antigravity now before they patch it or run out of compute.
* Microsoft VibeVoice: A surprise drop of a 0.5B speech model on HuggingFace that does real-time TTS (300ms latency). It was briefly questionable if it would stay up, but mirrors are already everywhere.
Thatâs a wrap for this week, folks. Next week is probably going to be our final episode of the year, so weâll be doing recaps and looking at our predictions from last year. Should be fun to see how wrong we were about everything!
Thank you for tuning in. If you missed the live stream, subscribe to our Substack, YouTube, and wherever you get your podcasts. See you next Thursday!
TL;DR and Show Notes
Hosts and Guests
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
* Co Hosts - @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed
* Guest - Lucas Atkins (@latkins) - CTO Arcee AI
Open Source LLMs
* DeepSeek V3.2 and V3.2-Speciale - Gold medal olympiad wins, MIT license (X, HF V3.2, HF Speciale, Announcement)
* Mistral 3 family - Large 3 and Ministral 3, Apache 2.0 (X, Blog, HF Large, HF Ministral)
* Arcee Trinity - US-trained MOE family (X, HF Mini, HF Nano, Blog)
* Hermes 4.3 - Decentralized training, SOTA RefusalBench (X, HF)
Big CO LLMs + APIs
* OpenAI Code Red - ChatGPT 3rd birthday, Garlic model in development (The Information)
* Amazon Nova 2 - Lite, Pro, Sonic, and Omni models (X, Blog)
* Gemini 3 Deep Think - 45.1% ARC-AGI-2 (X, Blog)
* Cursor + GPT-5.1-Codex-Max - Free until Dec 11 (X, Blog)
This Weekâs Buzz
* WandB LLM Evaluation Jobs - Evaluate any OpenAI-compatible API (X, Announcement)
Vision & Video
* Runway Gen-4.5 - #1 on text-to-video leaderboard, 1,247 Elo (X)
* Kling VIDEO 2.6 - First native audio generation (X)
* Kling O1 Image - Image generation (X)
Voice & Audio
* Microsoft VibeVoice-Realtime-0.5B - 300ms latency TTS (X, HF)
AI Art & Diffusion
* Pruna P-Image - Sub-second generation at $0.005 (X, Blog, Demo)
* SeeDream 4.5 - Multi-reference fusion, text rendering (X)
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
D'autres épisodes de "ThursdAI - The top AI news from the past week"



Ne ratez aucun Ă©pisode de âThursdAI - The top AI news from the past weekâ et abonnez-vous gratuitement Ă ce podcast dans l'application GetPodcast.







