This is an illustration based on LLama 3.1's scaling laws which apply to their data mix on their dense architecture. LLMs at this next scale will use a different data mix and different architecture which most likely will reduce the optimal tokens. They can also use synthetic data and multiple epochs though; GPT-4 used 2-4 epochs.
They may also invest more compute into FLOPs per forward/backward pass relative to tokens.
Very intriguing. Much obliged for the updates.
Really interesting. Thanks for these updates
150 Trillion training tokens?
Who has that?
Sorry 140 trillion. After cleaning/ post processing. Text max comes out to 20 trillion tops. Including multimodal (audio, video, image) will
Be 5-10 Trillion max.
This is an illustration based on LLama 3.1's scaling laws which apply to their data mix on their dense architecture. LLMs at this next scale will use a different data mix and different architecture which most likely will reduce the optimal tokens. They can also use synthetic data and multiple epochs though; GPT-4 used 2-4 epochs.
They may also invest more compute into FLOPs per forward/backward pass relative to tokens.