The Paleo Cyborg Podcast podcast

Caio Temer Episode 13 part 1

0:00
1:21:30
Rewind 15 seconds
Fast Forward 15 seconds
Caio Temer explains LLMs and machine learning. found that generalized learning happens after 10k epochs with a considerable amount of data per epoch, are there studies in this regard? ChatGPT Yes, there are studies related to generalized learning after many times, with a large amount of data per season. A notable example is the "Deep Bootstrap Framework", presented at ICLR 2021. This framework makes a connection between generalization and online optimization. He compares a "real world", where a model is trained on a finite number of samples by multiple times, with an "ideal world", where each training step uses fresh samples from a virtually infinite data set. The research found that models that train quickly in infinite data are the same as those that generalize well in finite data. This study offers a new perspective on design choices in practice and provides a theoretical path to understand generalization. Get started with ChatGPT Caio Temer show notes: Part One summary interesting topics for trading- -how I created the first bots -how models combine language models with autonomous robots -why conventional methods don't work about language models: -language models are simpler to obtain with other structures -problem in the current structure and how I solved it - development of intuitive neural networks using hash abstraction Caio developed a way for neural networks, through cryptography techniques that use sum, to learn to abstract and understand deeply quickly created a lib of a tiny language model that learns to predict all tokens at an absurd speed by using distance vector embeds instead of tokens Transformers: [](https://arxiv.org/abs/2205.13504) for example, if the network is exploding, there's something wrong with a component, or the update rate is wrong If the rate is right, the linearities are wrong, just exchange some linear or more linear connection like relu for tanh or add a tanh. softmax in deep layers as a projector reduces network convergence time by forcing layer alignment training the transformer one layer at a time, adding a copy of the first layer and freezing the first accelerates training and improves retention In sequence for sequence, a permuted linear dense solves 99% of problems in less time and better than the transformer. Giant embeds work like entire networks, only needing one activation at the end after reshaping “1 tanh layer of 1 million units solves everything better than any other network with the same number of parameters, at least in the tasks I use, I always start with it, then see how to reduce” Collabs: making a language model using Facebook's FastText https://colab.research.google.com/drive/1wVQrpzyY-SkCZTRZCcP6xJGRdg1ik0jR#scrollTo=pIW6-VwMuRlz https://colab.research.google.com/drive/1oABIZr1xiIu7DKc7AbbZfeBJFLkZW6Ep#scrollTo=XfR-3PpPYoFU https://github.com/rtkclouds/fast-js-language-model https://x.com/canalCCore2/status/1735044356535472278?s=20 GitHub gists: [](https://gist.github.com/rtkclouds/50b81d10736793f07cdca354516e8757) [](https://gist.github.com/rtkclouds/a6ee9afd96461ca94b3e9c22f78bda3a)

More episodes from "The Paleo Cyborg Podcast"