ThursdAI - The top AI news from the past week podcast

📆 ThursdAI - Dec 5 - OpenAI o1 & o1 pro, Tencent HY-Video, FishSpeech 1.5, Google GENIE2, Weave in GA & more AI news

0:00
1:31:37
Retroceder 15 segundos
Avanzar 15 segundos

Well well well, December is finally here, we're about to close out this year (and have just flew by the second anniversary of chatGPT 🎂) and it seems that all of the AI labs want to give us X-mas presents to play with over the holidays!

Look, I keep saying this, but weeks are getting crazier and crazier, this week we got the cheapest and the most expensive AI offerings all at once (the cheapest from Amazon and the most expensive from OpenAI), 2 new open weights models that beat commercial offerings, a diffusion model that predicts the weather and 2 world building models, oh and 2 decentralized fully open sourced LLMs were trained across the world LIVE and finished training. I said... crazy week!

And for W&B, this week started with Weave launching finally in GA 🎉, which I personally was looking forward for (read more below)!

TL;DR Highlights

* OpenAI O1 & Pro Tier: O1 is out of preview, now smarter, faster, multimodal, and integrated into ChatGPT. For heavy usage, ChatGPT Pro ($200/month) offers unlimited calls and O1 Pro Mode for harder reasoning tasks.

* Video & Audio Open Source Explosion: Tencent’s HYVideo outperforms Runway and Luma, bringing high-quality video generation to open source. Fishspeech 1.5 challenges top TTS providers, making near-human voice available for free research.

* Open Source Decentralization: Nous Research’s DiStRo (15B) and Prime Intellect’s INTELLECT-1 (10B) prove you can train giant LLMs across decentralized nodes globally. Performance is on par with centralized setups.

* Google’s Genie 2 & WorldLabs: Generating fully interactive 3D worlds from a single image, pushing boundaries in embodied AI and simulation. Google’s GenCast also sets a new standard in weather prediction, beating supercomputers in accuracy and speed.

* Amazon’s Nova FMs: Cheap, scalable LLMs with huge context and global language coverage. Perfect for cost-conscious enterprise tasks, though not top on performance.

* 🎉 Weave by W&B: Now in GA, it’s your dashboard and tool suite for building, monitoring, and scaling GenAI apps. Get Started with 1 line of code

OpenAI’s 12 Days of Shipping: O1 & ChatGPT Pro

The biggest splash this week came from OpenAI. They’re kicking off “12 days of launches,” and Day 1 brought the long-awaited full version of o1. The main complaint about o1 for many people is how slow it was! Well, now it’s not only smarter but significantly faster (60% faster than preview!), and officially multimodal: it can see images and text together.

Better yet, OpenAI introduced a new ChatGPT Pro tier at $200/month. It offers unlimited usage of o1, advanced voice mode, and something called o1 pro mode — where o1 thinks even harder and longer about your hardest math, coding, or science problems. For power users—maybe data scientists, engineers, or hardcore coders—this might be a no-brainer. For others, 200 bucks might be steep, but hey, someone’s gotta pay for those GPUs. Given that OpenAI recently confirmed that there are now 300 Million monthly active users on the platform, and many of my friends already upgraded, this is for sure going to boost the bottom line at OpenAI!

Quoting Sam Altman from the stream, “This is for the power users who push the model to its limits every day.” For those who complained o1 took forever just to say “hi,” rejoice: trivial requests will now be answered quickly, while super-hard tasks get that legendary deep reasoning including a new progress bar and a notification when a task is complete. Friend of the pod Ray Fernando gave pro a prompt that took 7 minutes to think through!

I've tested the new o1 myself, and while I've gotten dangerously close to my 50 messages per week quota, I've gotten some incredible results already, and very fast as well. This ice-cubes question failed o1-preview and o1-mini and it took both of them significantly longer, and it took just 4 seconds for o1.

Open Source LLMs: Decentralization & Transparent Reasoning

Nous Research DiStRo & DeMo Optimizer

We’ve talked about decentralized training before, but the folks at Nous Research are making it a reality at scale. This week, Nous Research wrapped up the training of a new 15B-parameter LLM—codename “Psyche”—using a fully decentralized approach called “Nous DiStRo.” Picture a massive AI model trained not in a single data center, but across GPU nodes scattered around the globe. According to Alex Volkov (host of ThursdAI), “This is crazy: they’re literally training a 15B param model using GPUs from multiple companies and individuals, and it’s working as well as centralized runs.”

The key to this success is “DeMo” (Decoupled Momentum Optimization), a paper co-authored by none other than Diederik Kingma (yes, the Kingma behind Adam optimizer and VAEs). DeMo drastically reduces communication overhead and still maintains stability and speed. The training loss curve they’ve shown looks just as good as a normal centralized run, proving that decentralized training isn’t just a pipe dream. The code and paper are open source, and soon we’ll have the fully trained Psyche model. It’s a huge step toward democratizing large-scale AI—no more waiting around for Big Tech to drop their weights. Instead, we can all chip in and train together.

Prime Intellect INTELLECT-1 10B: Another Decentralized Triumph

But wait, there’s more! Prime Intellect also finished training their 10B model, INTELLECT-1, using a similar decentralized setup. INTELLECT-1 was trained with a custom framework that reduces inter-GPU communication by 400x. It’s essentially a global team effort, with nodes from all over the world contributing compute cycles.

The result? A model hitting performance similar to older Meta models like Llama 2—but fully decentralized.

Ruliad DeepThought 8B: Reasoning You Can Actually See

If that’s not enough, we’ve got yet another open-source reasoning model: Ruliad’s DeepThought 8B. This 8B parameter model (finetuned from LLaMA-3.1) from friends of the show FarEl, Alpin and Sentdex 👏

Ruliad’s DeepThought attempts to match or exceed performance of much larger models in reasoning tasks (beating several 72B param models while being 8B itself) is very impressive.

Google is firing on all cylinders this week

Google didn't stay quiet this week as well, and while we all wait for the Gemini team to release the next Gemini after the myriad of very good experimental models recently, we've gotten some very amazing things this week.

Google’s PaliGemma 2 - finetunable SOTA VLM using Gemma

PaliGemma v2, a new vision-language family of models (3B, 10B and 33B) for 224px, 448px, 896px resolutions are a suite of base models, that include image segmentation and detection capabilities and are great at OCR which make them very versatile for fine-tuning on specific tasks.

They claim to achieve SOTA on chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation!

Google GenCast SOTA weather prediction with... diffusion!?

More impressively, Google DeepMind released GenCast, a diffusion-based model that beats the state-of-the-art ENS system in 97% of weather predictions. Did we say weather predictions? Yup.

Generative AI is now better at weather forecasting than dedicated physics based deterministic algorithms running on supercomputers. Gencast can predict 15 days in advance in just 8 minutes on a single TPU v5, instead of hours on a monstrous cluster. This is mind-blowing. As Yam said on the show, “Predicting the world is crazy hard” and now diffusion models handle it with ease.

W&B Weave: Observability, Evaluation and Guardrails now in GA

Speaking of building and monitoring GenAI apps, we at Weights & Biases (the sponsor of ThursdAI) announced that Weave is now GA. Weave is a developer tool for evaluating, visualizing, and debugging LLM calls in production. If you’re building GenAI apps—like a coding agent or a tool that processes thousands of user requests—Weave helps you track costs, latency, and quality systematically.

We showcased two internal apps: Open UI (a website builder from a prompt) and Winston (an AI agent that checks emails, Slack, and more). Both rely on Weave to iterate, tune prompts, measure user feedback, and ensure stable performance. With O1 and other advanced models coming to APIs soon, tools like Weave will be crucial to keep those applications under control.

If you follow this newsletter and develop with LLMs, now is a great way to give Weave a try

Open Source Audio & Video: Challenging Proprietary Models

Tencent’s HY Video: Beating Runway & Luma in Open Source

Tencent came out swinging with their open-source model, HYVideo. It’s a video model that generates incredible realistic footage, camera cuts, and even audio—yep, Foley and lip-synced character speech. Just a single model doing text-to-video, image-to-video, puppeteering, and more. It even outperforms closed-source giants like Runway Gen 3 and Luma 1.6 on over 1,500 prompts.

This is the kind of thing we dreamed about when we first heard of video diffusion models. Now it’s here, open-sourced, ready for tinkering. “It’s near SORA-level,” as I mentioned, referencing OpenAI’s yet-to-be-fully-released SORA model. The future of generative video just got more accessible, and competitors should be sweating right now. We may just get SORA as one of the 12 days of OpenAI releases!

FishSpeech 1.5: Open Source TTS Rivaling the Big Guns

Not just video—audio too. FishSpeech 1.5 is a multilingual, zero-shot voice cloning model that ranks #2 overall on TTS benchmarks, just behind 11 Labs. This is a 500M-parameter model, trained on a million hours of audio, achieving near-human quality, fast inference, and open for research.

This puts high-quality text-to-speech capabilities in the open-source community’s hands. You can now run a top-tier TTS system locally, clone voices, and generate speech in multiple languages with low latency. No more relying solely on closed APIs. This is how open-source chases—and often catches—commercial leaders.

If you’ve been longing for near-instant voice cloning on your own hardware, this is the model to go play with!

Creating World Models: Genie 2 & WorldLabs

Fei Fei Li’s WorldLabs: Images to 3D Worlds

WorldLabs, founded by Dr. Fei Fei Li, showcased a mind-boggling demo: turning a single image into a walkable 3D environment. Imagine you take a snapshot of a landscape, load it into their system, and now you can literally walk around inside that image as if it were a scene in a video game. “I can literally use WASD keys and move around,” Alex commented, clearly impressed.

It’s not perfect fidelity yet, but it’s a huge leap toward generating immersive 3D worlds on the fly. These tools could revolutionize virtual reality, gaming, and simulation training. WorldLabs’ approach is still in early stages, but what they demonstrated is nothing short of remarkable.

Google’s Genie 2: Playable Worlds from a Single Image

If WorldLabs’s 3D environment wasn’t enough, Google dropped Genie 2. Take an image generated by Imagen 3, feed it into Genie 2, and you get a playable world lasting up to a minute. Your character can run, objects have physics, and the environment is consistent enough that if you leave an area and return, it’s still there.

As I said on the pod, “It looks like a bit of Doom, but generated from a single static image. Insane!” The model simulates complex interactions—think water flowing, balloons bursting—and even supports long-horizon memory. This could be a goldmine for AI-based game development, rapid prototyping, or embodied agent training.

Amazon’s Nova: Cheaper LLMs, Not Better LLMs

Amazon is also throwing their hat in the ring with the Nova series of foundational models. They’ve got variants like Nova Micro, Lite, Pro, and even a Premier tier coming in 2025. The catch? Performance is kind of “meh” compared to Anthropic or OpenAI’s top models, but Amazon is aiming to be the cheapest high-quality LLM among the big players. With a context window of up to 300K tokens and 200+ language coverage, Nova could find a niche, especially for those who want to pay less per million tokens.

Nova Micro costs around 3.5 cents per million input tokens and 14 cents per million output tokens—making it dirt cheap to process massive amounts of data. Although not a top performer, Amazon’s approach is: “We may not be best, but we’re really cheap and we scale like crazy.” Given Amazon’s infrastructure, this could be compelling for enterprises looking for cost-effective large-scale solutions.

Phew, this was a LONG week with a LOT of AI drops, and NGL, o1 actually helped me a bit for this newsletter, I wonder if you can spot the places where o1 wrote some of the text using a the transcription of the show and the outline as guidelines and the previous newsletter as a tone guide and where I wrote it myself?

Next week, NEURIPS 2024, the biggest ML conference in the world, I'm going to be live streaming from there, so if you're at the conference, come by booth #404 and say hi! I'm sure there will be a TON of new AI updates next week as well!

Show Notes & Links

TL;DR of all topics covered:

* This weeks Buzz

* Weights & Biases announces Weave is now in GA 🎉(wandb.me/tryweave)

* Tracing LLM calls

* Evaluation & Playground

* Human Feedback integration

* Scoring & Guardrails (in preview)

* Open Source LLMs

* DiStRo & DeMo from NousResearch - decentralized DiStRo 15B run (X, watch live, Paper)

* Prime Intellect - INTELLECT-1 10B decentralized LLM (Blog, watch)

* Ruliad DeepThoutght 8B - Transparent reasoning model (LLaMA-3.1) w/ test-time compute scaling (X, HF, Try It)

* Google GenCast - diffusion model SOTA weather prediction (Blog)

* Google open sources PaliGemma 2 (X, Blog)

* Big CO LLMs + APIs

* Amazon announces Nova series of FM at AWS (X)

* Google GENIE 2 creates playable worlds from a picture! (Blog)

* OpenAI 12 days started with o1 full and o1 pro and pro tier $200/mo (X, Blog)

* Vision & Video

* Tencent open sources HY Video - beating Luma & Runway (Blog, Github, Paper, HF)

* Runway video keyframing prototype (X)

* Voice & Audio

* FishSpeech V1.5 - multilingual, zero-shot instant voice cloning, low-latency, open text to speech model (X, Try It)

* Eleven labs - real time audio agents builder (X)



This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

Otros episodios de "ThursdAI - The top AI news from the past week"