4 Comments

Really interesting. Thanks for these updates

Expand full comment

150 Trillion training tokens?

Who has that?

Expand full comment

Sorry 140 trillion. After cleaning/ post processing. Text max comes out to 20 trillion tops. Including multimodal (audio, video, image) will

Be 5-10 Trillion max.

Expand full comment
author

This is an illustration based on LLama 3.1's scaling laws which apply to their data mix on their dense architecture. LLMs at this next scale will use a different data mix and different architecture which most likely will reduce the optimal tokens. They can also use synthetic data and multiple epochs though; GPT-4 used 2-4 epochs.

They may also invest more compute into FLOPs per forward/backward pass relative to tokens.

Expand full comment