š ThursdAI - August 1st - Meta SAM 2 for video, Gemini 1.5 is king now?, GPT-4o Voice is here (for some), new Stability, Apple Intelligence also here & more AI news
Starting Monday, Apple released iOS 18.1 with Apple Intelligence, then Meta dropped SAM-2 (Segment Anything Model) and then Google first open sourced Gemma 2B and now (just literally 2 hours ago, during the live show) released Gemini 1.5 0801 experimental that takes #1 on LMsys arena across multiple categories, to top it all off we also got a new SOTA image diffusion model called FLUX.1 from ex-stability folks and their new Black Forest Lab.
This week on the show, we had Joseph & Piotr Skalski from Roboflow, talk in depth about Segment Anything, and as the absolute experts on this topic (Skalski is our returning vision expert), it was an incredible deep dive into the importance dedicated vision models (not VLMs).
We also had Lukas Atkins & Fernando Neto from Arcee AI talk to use about their new DistillKit and explain model Distillation in detail & finally we had Cristiano Giardina who is one of the lucky few that got access to OpenAI advanced voice mode + his new friend GPT-4o came on the show as well!
Honestly, how can one keep up with all this? by reading ThursdAI of course, that's how but ā ļø buckle up, this is going to be a BIG one (I think over 4.5K words, will mark this as the longest newsletter I penned, I'm sorry, maybe read this one on 2x? š)
[ Chapters ]
00:00 Introduction to the Hosts and Their Work
01:22 Special Guests Introduction: Piotr Skalski and Joseph Nelson
04:12 Segment Anything 2: Overview and Capabilities
15:33 Deep Dive: Applications and Technical Details of SAM2
19:47 Combining SAM2 with Other Models
36:16 Open Source AI: Importance and Future Directions
39:59 Introduction to Distillation and DistillKit
41:19 Introduction to DistilKit and Synthetic Data
41:41 Distillation Techniques and Benefits
44:10 Introducing Fernando and Distillation Basics
44:49 Deep Dive into Distillation Process
50:37 Open Source Contributions and Community Involvement
52:04 ThursdAI Show Introduction and This Week's Buzz
53:12 Weights & Biases New Course and San Francisco Meetup
55:17 OpenAI's Advanced Voice Mode and Cristiano's Experience
01:08:04 SearchGPT Release and Comparison with Perplexity
01:11:37 Apple Intelligence Release and On-Device AI Capabilities
01:22:30 Apple Intelligence and Local AI
01:22:44 Breaking News: Black Forest Labs Emerges
01:24:00 Exploring the New Flux Models
01:25:54 Open Source Diffusion Models
01:30:50 LLM Course and Free Resources
01:32:26 FastHTML and Python Development
01:33:26 Friend.com: Always-On Listening Device
01:41:16 Google Gemini 1.5 Pro Takes the Lead
01:48:45 GitHub Models: A New Era
01:50:01 Concluding Thoughts and Farewell
Show Notes & Links
* Open Source LLMs
* Meta gives SAM-2 - segment anything with one shot + video capability! (X, Blog, DEMO)
* Google open sources Gemma 2 2.6B (Blog, HF)
* MTEB Arena launching on HF - Embeddings head to head (HF)
* Arcee AI announces DistillKit - (X, Blog, Github)
* AI Art & Diffusion & 3D
* Black Forest Labs - FLUX new SOTA diffusion models (X, Blog, Try It)
* Midjourney 6.1 update - greater realism + potential Grok integration (X)
* Big CO LLMs + APIs
* Google updates Gemini 1.5 Pro with 0801 release and is #1 on LMsys arena (X)
* OpenAI started alpha GPT-4o voice mode (examples)
* OpenAI releases SearchGPT (Blog, Comparison w/ PPXL)
* Apple releases beta of iOS 18.1 with Apple Intelligence (X, hands on, Intents )
* Apple released a technical paper of apple intelligence
* This weeks Buzz
* AI Salons in SF + New Weave course for WandB featuring yours truly!
* Vision & Video
* Runway ML adds Gen -3 image to video and makes it 7x faster (X)
* Tools & Hardware
* Avi announces friend.com
* Jeremy Howard releases FastHTML (Site, Video)
* Applied LLM course from Hamel dropped all videos
Open Source
It feels like everyone and their grandma is open sourcing incredible AI this week! Seriously, get ready for segment-anything-you-want + real-time-video capability PLUS small AND powerful language models.
Meta Gives Us SAM-2: Segment ANYTHING Model in Images & Videos... With One Click!
Hold on to your hats, folks! Remember Segment Anything, Meta's already-awesome image segmentation model? They've just ONE-UPPED themselves. Say hello to SAM-2 - it's real-time, promptable (you can TELL it what to segment), and handles VIDEOS like a champ. As I said on the show: "I was completely blown away by segment anything 2".
But wait, what IS segmentation? Basically, pixel-perfect detection - outlining objects with incredible accuracy. My guests, the awesome Piotr Skalski and Joseph Nelson (computer vision pros from Roboflow), broke it down historically, from SAM 1 to SAM 2, and highlighted just how mind-blowing this upgrade is.
"So now, Segment Anything 2 comes out. Of course, it has all the previous capabilities of Segment Anything ... But the segment anything tool is awesome because it also can segment objects on the video". - Piotr Skalski
Think about Terminator vision from the "give me your clothes" bar scene: you see a scene, instantly "understand" every object separately, AND track it as it moves. SAM-2 gives us that, allowing you to click on a single frame, and BAM - perfect outlines that flow through the entire video! I played with their playground, and you NEED to try it - you can blur backgrounds, highlight specific objects... the possibilities are insane. Playground Link
In this video, Piotr annotated only the first few frames of the top video, and SAM understood the bottom two shot from 2 different angles!
Okay, cool tech, BUT why is it actually USEFUL? Well, Joseph gave us incredible examples - from easier sports analysis and visual effects (goodbye manual rotoscoping) to advances in microscopic research and even galactic exploration! Basically, any task requiring precise object identification gets boosted to a whole new level.
"SAM does an incredible job at creating pixel perfect outlines of everything inside visual scenes. And with SAM2, it does it across videos super well, too ... That capability is still being developed for a lot of AI Models and capabilities. So having very rich ability to understand what a thing is, where that thing is, how big that thing is, allows models to understand spaces and reason about them" - Joseph Nelson
AND if you combine this power with other models (like Piotr is already doing!), you get zero-shot segmentation - literally type what you want to find, and the model will pinpoint it in your image/video. It's early days, but get ready for robotics applications, real-time video analysis, and who knows what else these clever hackers are dreaming up! š¤Æ
Check out Piotr's Zero Shot Florence + Sam2 Implementation
Best of all? Apache 2 license, baby! As Joseph said, "Open source is foundational to making the accessibility, the use cases, and the advancement of the field overall", and this is a prime example. Huge kudos to Meta for empowering us with this tech.
The whole conversation w/ Piotr & Joseph is very much worth listening to on the pod šļø
Google Throws Down The Gauntlet: Open Sourcing GemMA 2 2.6B
It was Meta vs. Google on Monday because NOT to be outdone, Google also went on an open-sourcing spree. This time, they gifted us GemMA 2 (a 2.6 billion parameter powerhouse), alongside a safety-focused suite called ShieldGemMA AND a transparency tool called GemmaScope.
So what makes Gemma 2 special? First off, it's optimized for on-device use, meaning super-efficient local running. BUT there's a catch, folks... They claim it beats Mixtral AND Llama 2 70B on the LMsys Arena leaderboard, with an ELO score of 1126. Hold on, a 2 billion parameter model outperforming the big boys? š¤Ø As LDJ (one of my regular co-hosts) said on the show:
"Yeah, I think my best theory here is... there's at least two or three variables at play ... In LMSys, people are much more likely to do single turn, and within LMSys, people will usually be biased more towards rating models with a more recent knowledge cutoff as higher".
Translation? It might be gaming the system a bit, but either way, Gemma 2 is an exciting release - super fast, small enough for on-device applications, and coming with safety tools right out the gate! I think Zenova (our Hugging Face wizard) is already running this on WebGPU! You NEED to try it out.
And GemmaScope? That's some cool, cool stuff too. Think about peeking inside the "brain" of the model - you can actually SEE how Gemma 2 processes information. Remember Anthropic Mechinterp? It's like that, giving us unprecedented transparency into how these systems actually "think". You gotta see it on Neuronpedia. Neuronpedia link
It's Meta versus Google - round one, FIGHT! š„
Distilling Knowlege: Arcee AI Drops DistilKit!
Just when I thought the week was done throwing surprises, Arcee AI casually dropped DistilKit - an open source tool to build distilled language models. Now, this is some NEXT level stuff, folks. We talked with Lukas Atkins and Fernando (the brilliant minds behind DistillKit), and I finally learned what the heck "distillation" really means.
"TLDR - we teach a smaller model to think like a bigger model"
In a nutshell: teach a smaller model how to think like a larger one. Think GPT-4o and GPT-4 Mini, where the smaller model supposedly got the "essence" of the bigger version. Or imagine a tiny Llama that inherited the smarts of 405B - ridiculous! š¤Æ As Fernando eloquently put it:
So in the finetuning that we have been doing, just in terms of generating text instructions and so on, we were observing only the token that was generated from the teacher model. And now with the distillation, we are observing the whole distribution of the tokens that could be sampled
Now I admit, even after Fernando's expert breakdown, my brain still kind of melted. š« BUT, here's why this matters: distilled models are super efficient, saving on cost and resources. Imagine powerful AI that runs seamlessly on your phone! š¤Æ Arcee is making this possible for everyone.
Was it pure coincidence they released this on the same week as the Llama 3.1 LICENSE CHANGE (Zuckerberg is clearly watching ThursdAI...), which makes distillation perfectly legal?
It's wild, exciting, AND I predict a massive surge in smaller, specialized AI tools that inherit the intelligence of the big boys.
This weeks buzz
Did I already tell you that someone came up to me and said, hey, you're from Weights & Biases, you are the guys who make the courses right? š I said, well yeah, we have a bunch of free courses on wandb.courses but we also have a world leading ML experiment tracking software and an LLM observability toolkit among other things. It was really funny he thought we're just courses company!
Well this last week, my incredible colleague Agata who's in charge of our courses, took an initiative and stitched together a course about Weave from a bunch of videos that I already had recorded! It's awesome, please check it out if you're interested to learn about Weave š
P.S - we are also starting a series of AI events in our SF office called AI Salons, the first one is going to feature Shreya Shankar, and focus on evaluations, it's on August 15th, so if you're in SF, you're invited for free as a ThursdAI subscriber! Get free tickets
Big Co AI - LLMs & APIs
Not only was open source popping off, but those walled-garden mega corps wanted in on the action too! SearchGPT, anyone?
From Whispers to Reality: OpenAI Alpha Tests GPT-4 Voice (and IT'S WILD)
This was THE moment I waited for, folks - GPT-4 with ADVANCED VOICE is finally trickling out to alpha users. Did I get access? NO. š© But my new friend, Cristiano Giardina, DID and you've probably seen his viral videos of this tech - they're blowing up MY feed, even Sam Altman retweeted the above one! I said on the show, this new voice "feels like a big next unlock for AI"
What sets this apart from the "regular" GPT-4 voice we have now? As Cristiano told us:
"the biggest difference is that the emotion , and the speech is very real and it follows instructions regarding emotion very well, like you can ask it to speak in a more animated way, you can ask it to be angry, sad, and it really does a good job of doing that."
We did a LIVE DEMO (it worked, thank God), and y'all... I got CHILLS. We heard counting with a breath, depressed Soviet narrators, even a "GET TO THE CHOPPA" Schwarzenegger moment that still makes me laugh š It feels like a completely different level of interaction, something genuinely conversational and even emotional. Check out Cristiano's profile for more insane demos - you won't be disappointed.Follow Cristiano Here For Amazing Voice Mode Videos
Can't wait for access, if anyone from OpenAI is reading this, hook me up š I'll trade my SearchGPT access!
SearchGPT: OpenAI Throws Their Hat Into The Ring (again?)
Did OpenAI want to remind everyone they're STILL here amidst the LLama/Mistral frenzy? Maybe that's why they released SearchGPT - their newest "search engine that can hold a conversation" tool. Again, waitlisted, but unlike with voice mode... I got access. š
The good: Fast. Really fast. And impressively competent, considering it's still a demo. Handles complex queries well, and its "follow-up" ability blows even Perplexity out of the water (which is impressive).
The less-good: Still feels early, especially for multi-language and super local stuff. Honestly, feels more like a sneak peek of an upcoming ChatGPT integration than a standalone competitor to Google.
But either way, it's an interesting development - as you may have already learned from my full breakdown of SearchGPT vs. Perplexity
Apple Intelligence is here! (sort of)
And speaking of big companies, how could I not mention the Apple Intelligence release this week? Apple finally dropped iOS 18.1 with the long-awaited ON-DEVICE intelligence, powered by the Apple Foundational Model (AFM). Privacy nerds rejoice! š
But how good is it? Mixed bag, I'd say. It's there, and definitely usable for summarization, rewriting tools, text suggestions... but Siri STILL isn't hooked up to it yet, tho speech to text is way faster and she does look more beautiful. š¤ Apple did release a ridiculously detailed paper explaining how they trained this model on Apple silicon... and as Nisten (ever the voice of honesty) said on the show,
"It looks like they've stacked a lot of the tricks that had been working ... overall, they're not actually really doing anything new ... the important thing here is how they apply it all as a system that has access to all your personal data."
Yeah, ouch, BUT still exciting, especially as we get closer to truly personal, on-device AI experiences. Right now, it's less about revolutionary advancements, and more about how Apple can weave this into our lives seamlessly - they're focusing heavily on app intents, meaning AI that can actually DO things for you (think scheduling appointments, drafting emails, finding that photo buried in your library). I'll keep testing this, the more I play around the more I find out, like it suddenly started suggesting replies in messages for me for example, or I haven't yet seen the filtered notifications view where it smartly only lets important messages go through your focus mode.
So stay tuned but it's likely not worth the beta iOS upgrade if you're not a dev or a very strong enthusiast.
Wait, MORE Breaking News?? The AI World Doesn't Sleep!
If this episode wasn't already enough... the very day of the live show, as we're chatting, I get bombarded with breaking news alerts from my ever-vigilant listeners.
1. Gemini 1.5 Pro 0801 - Now #1 on LMsys Arena! š¤Æ Google apparently loves to ship big right AFTER I finish recording ThursdAI (this happened last week too!). Gemini's new version, released WHILE we were talking about older Gemini versions, claimed the top spot with an insane 1300 ELO score - crushing GPT-4 and taking home 1st place in Math, Instruction Following, and Hard Prompts! It's experimental, it's up on Google AI Studio... Go play with it! (and then tag me with your crazy findings!)
And you know what? Some of this blog was drafted by this new model, in fact, I had the same prompt sent to Claude Sonnet 3.5, Mistral Large v2 and I tried LLama 3.1 405B but couldn't find any services that host a full context window, and this Gemini just absolutely demolished all of them on tone, on imitating my style, it even took some of the links from my TL;DR and incorporated them into the draft on its own! I've never seen any other model do that! I haven't used any LLMs so far for this blog besides proof-reading because, well they all kinda sucked, but damn, I dare you to try and find out where in this blog it was me and where it was Gemini.
2. GitHub Does a Hugging Face: Introducing GitHub Models!
This dropped just as we wrapped - basically a built-in marketplace where you can try, test, and deploy various models right within GitHub! They've already got LLaMa, Mistral, and some Azure-hosted GPT-4o stuff - very intriguing... Time will tell what Microsoft is cooking here, but you can bet I'll be investigating!šµļø
AI Art & Diffusion
New Stability: Black Forest Labs and FLUX.1 Rise!
Talk about a comeback story: 14 EX Stability AI pros led by Robin Rombach, Andreas Blatmann & Patrick Esser the OG creaters of Stable Diffusion with $31 million in funding from a16z, and are back to make diffusion dreams come true. Enter Black Forest Labs. Their first gift? FLUX.1 - a suite of text-to-image models so good, they're breaking records. I saw those demos and wow. It's good, like REALLY good. š¤Æ
And the real bomb? They're working on open-source TEXT-TO-VIDEO! That's right, imagine generating those mind-blowing moving visuals... with code anyone can access. It's in their "Up Next" section, so watch that space - it's about to get REAL interesting.
Also... Midjourney 6.1 also came out, and it looks GOOD
And you can see a comparison between the two new leading models in this thread by Noah Hein
Tools & Hardware: When AI Gets Real (And Maybe Too Real...)
You knew I had to close this madness out with some Hardware, because hardware means that we actually are interacting with these incredible models in a meaningful way.
Friend.com: When Your AI Is... Always Listening? š¤Ø
And then this happened... Avi Schiffman (finally) announces friend.com. with an amazingly dystopian promo video from Sandwich. Videos. ~ 22 million views and counting, not by accident! Link to Video.
It's basically an always-on, listening pendant. "A little like wearing a wire" as Nisten so eloquently put it. š§ Not for memory extension or productivity... for friendship. Target audience? Lonely people who want a device that captures and understands their entire lives, but in an almost comforting way (or maybe unsettling, depending on your viewpoint!). The debate about privacy is already RAGING... But as Nisten pointed out:
"Overall, it is a positive. ...The entire privacy talk and data ownership, I think that's a very important conversation to have".
I kinda get the vision. Combine THIS tech with GPT-4 Voice speed... you could actually have engaging conversations, 24/7! š¤Æ I don't think it's as simple as "this is dystopian, end of story". Character AI is EXPLODING right now, remember those usage stats, over 20 million users and counting? The potential to help with loneliness is real...
The Developer Corner: Tools for Those Hacking This Future
Gotta love these shoutouts:
* FastHTML from Jeremy Howard: Not strictly AI, but if you hate JS and love Python, this one's for you - insanely FAST web dev using a mind-bending new syntax. FastHTML website link
* Hamel Hussain's Applied LLM Course - All Videos NOW Free!: Want to learn from some of the best minds in the field (including Jeremy Howard, Shreya Shankar evaluation QUEEN, Charles Frye and tons of other great speakers)? This course covers it all - from finetuning to Rag Building to optimizing your prompts.Applied LLMs course - free videos link
AND ALSO ... Nisten blew everyone's minds again in the end! Remember last week, we thought it'd take time before anyone could run Llama 3.1 405B on just CPU? Well, this crazy genius already cracked the code - seven tokens per second on a normal CPU! š¤Æ If you're a researcher who hates using cloud GPUs (or wants to use ALL THOSE CORES in your Lambda machine, wink wink)... get ready.
Look, I'm not going to sit here and pretend that weeks are not getting crazier, it takes me longer and longer to prep for the show, and really is harder and harder to contain the show to 2 hours, and we had 3 breaking news stories just today!
So we're accelerating, and I'll likely be using a bit of support from AI, but only if it's good, and only if it's proof read by me, so please let me know if you smell slop! I really wanna know!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
This is a public episode. If youād like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
More episodes from "ThursdAI - The top AI news from the past week"
Don't miss an episode of āThursdAI - The top AI news from the past weekā and subscribe to it in the GetPodcast app.