The above is a video on how to train a conversational model from scratch.
3/13/23 - I could not continue the video because some of the concepts were advanced, and I need to read up or watch his earlier videos dealing with foundational elements of AI.
https://www.youtube.com/watch?v=kCc8FmEb1nY
Tokenizing
Converting the raw text into numbers
- character-based
- word-based
- Needs a range of words to build a vocabulary
- based on the range of words - numbers are assigned
- example - English characters could use 26 numbers (based on the number of alphabets), whereas a dictionary could contain a million numbers (based on the number of words)
- Some libraries can be used
- Google - Sentence Piece
- OpenAI - tiktoken (this is what GPT use)
Training a block
- It a common to use some text to train a model. This size of the text used is the block size. The training involves building a network of what word usually comes after the previous one or what character comes next after the previous one.
Tensor