
Voice AI’s Big Moment: Why Everything Is Changing Now (ft. Neil Zeghidour, Gradium AI)
Voice used to be AI’s forgotten modality — awkward, slow, and fragile. Now it’s everywhere. In this reference episode on all things Voice AI, Matt Turck sits down with Neil Zeghidour, a top AI researcher and CEO of Gradium AI (ex-DeepMind/Google, Meta, Kyutai), to cover voice agents, speech-to-speech models, full-duplex conversation, on-device voice, and voice cloning.
We unpack what actually changed under the hood — why voice is finally starting to feel natural, and why it may become the default interface for a new generation of AI assistants and devices.
Neil breaks down today’s dominant “cascaded” voice stack — speech recognition into a text model, then text-to-speech back out — and why it’s popular: it’s modular and easy to customize. But he argues it has two key downsides: chaining models adds latency, and forcing everything through text strips out paralinguistic signals like tone, stress, and emotion. The next wave, he suggests, is combining cascade-like flexibility with the more natural feel of speech-to-speech and full-duplex conversation.
We go deep on full-duplex interaction (ending awkward turn-taking), the hardest unsolved problems (noisy real-world environments and multi-speaker chaos), and the realities of deploying voice at scale — including why models must be compact and when on-device voice is the right approach.
Finally, we tackle voice cloning: where it’s genuinely useful, what it means for deepfakes and privacy, and why watermarking isn’t a silver bullet.
If you care about voice agents, real-time AI, and the next generation of human-computer interaction, this is the episode to bookmark.
Neil Zeghidour
LinkedIn - https://www.linkedin.com/in/neil-zeghidour-a838aaa7/
X/Twitter - https://x.com/neilzegh
Gradium
Website - https://gradium.ai
X/Twitter - https://x.com/GradiumAI
Matt Turck (Managing Director)
Blog - https://mattturck.com
LinkedIn - https://www.linkedin.com/in/turck/
X/Twitter - https://twitter.com/mattturck
FirstMark
Website - https://firstmark.com
X/Twitter - https://twitter.com/FirstMarkCap
(00:00) Intro
(01:21) Voice AI’s big moment — and why we’re still early
(03:34) Why voice lagged behind text/image/video
(06:06) The convergence era: transformers for every modality
(07:40) Beyond Her: always-on assistants, wake words, voice-first devices
(11:01) Voice vs text: where voice fits (even for coding)
(12:56) Neil’s origin story: from finance to machine learning
(18:35) Neural codecs (SoundStream): compression as the unlock
(22:30) Kyutai: open research, small elite teams, moving fast
(31:32) Why big labs haven’t “won” voice AI4
(34:01) On-device voice: where it works, why compact models matter
(46:37) The last mile: real-world robustness, pronunciation, uptime
(41:35) Benchmarking voice: why metrics fail, how they actually test
(47:03) Cascades vs speech-to-speech: trade-offs + what’s next
(54:05) Hardest frontier: noisy rooms, factories, multi-speaker chaos
(1:00:50) New languages + dialects: what transfers, what doesn’t
(1:02:54 Hardware & compute: why voice isn’t a 10,000-GPU game
(1:07:27) What data do you need to train voice models?
(1:09:02) Deepfakes + privacy: why watermarking isn’t a solution
(1:12:30) Voice + vision: multimodality, screen awareness, video+audio
(1:14:43) Voice cloning vs voice design: where the market goes
(1:16:32) Paris/Europe AI: talent density, underdog energy, what’s next
Altri episodi di "The MAD Podcast with Matt Turck"



Non perdere nemmeno un episodio di “The MAD Podcast with Matt Turck”. Iscriviti all'app gratuita GetPodcast.








