TAI #110; Llama 3.1’s scaling laws vs 100k+ H100 clusters?
Also Mistral Large 2 123B, AlphaProof Olympiad, SearchGPT, GPT4o 64k output, SAM 2
What happened this week in AI by Louie
The new Llama 3.1 model series was groundbreaking for many reasons; the first open source LLM to match SOTA, big gains in the smaller 8B and 70B models, a detailed technical paper teaching LLM training and a large group of GPU cloud partners with various quantizations for affordable API model access. This is a huge free gift to 8 billion people that likely cost $60m+ for just the final model training runs! It was also important for model distillation as an industry (using a larger model to teach and improve a smaller model) - with a license that permits training and distillation from 405B. We expect distillation will join RAG and fine-tuning in the LLM builder toolkit. Not to be outdone; Mistral also delivered a very strong new 123B dense model, outperforming LLama 3.1 on general capability relative to parameters. However here model details were more limited and it has a less open license. Interestingly these were both dense models. Closed AI labs are generally assumed to now be using Mixture of Expert (MoE). These models have different trade offs with training and inference efficiency which can vary depending on batch size and compute set up (particularly memory vs FLOPs).
One thing that jumped out from the Llama 3.1 paper was the detailed disclosure of their scaling laws. LLM capability improvement comes broadly from five sources; 1) Increased training compute budget (which is from more GPUs/TPUs, better GPUs or longer training runs, and can be spent on more parameters, more training data or more FLOPs per forward/backward pass). 2) Increased utilization of this training compute (higher MFU, less downtime), 3) Higher quality training data, 4) More training compute efficient algorithms (eg MoE, new attention mechanisms), 5) Better mid training/ post training performance unlock and enhancement (eg RLHF, instruction tuning, monte carlo tree search).
META unveiled their measured capability vs training compute scaling laws for their LlaMA 3.1 model data recipe (this is varying just the compute lever above with 2-5 kept constant). META measured “negative log-likelihood” loss relative to training compute for next token prediction of the correct answer on a benchmark. This is measuring how well the model performs at next token prediction; which so far has correlated very well with the model’s ability to assist with a huge range of human tasks.
Source: META llama 3.1 paper.
These scaling laws are perhaps the single most important graphs in AI right now and I think are the key driver of Nvidia's >$100bn runrate of GPU sales, Elon Musk's planned ~$15 billion training cluster next year, and the potential $100 billion data center planned by OpenAI and Microsoft for 2028. The advancements seen with LLaMA 3.1 demonstrate this trend vividly. A ~24x increase in compute for 3.1 405B compared to the 3.0 8B version led to huge jumps in utility - for example a leap in MMLU+ from 36.2 to 61.6 and in Livebench from 27.6 to 54.3. (Note: META didn’t use the parameter optimal model size for the 8B compute budget as it optimized for inference cost). So far progress in next-token prediction has been remarkably consistent over many orders of magnitude of training compute (eg ~6 orders of magnitude in META’s measurement); what happens if we extrapolate this exponential just a little bit further with the 10 or 100x training compute that is in the near term pipeline?
Looking ahead over the next 12 months I’d guess GPT-5/next is training on somewhere from 50k to 100k H100s, and Elon has stated Grok-3 is now starting training on 100k H100s (32k already operating) and aims to release the model by December. This can be in the order of 5-6x the compute budget of LLama 405B. Elon also stated he plans a 300k Nvidia B200 cluster for next summer which could be ~7x again or ~40x total. This is excluding more efficient algorithms, data and training. OpenAI is also reportedly leasing a 100k GB200 (~220k B200s) cluster from Oracle next year (via Microsoft) for $2.5bn per year. Plans for in-house single clusters from Google, Microsoft, META and AWS are unclear but likely at a similar scale. It is interesting to note a parameter optimal dense model trained on 300k B200s for ~100 days - with LLama 3.1’s data mix and MFU - would have 2.5trn parameters and 140trn training tokens. We estimate this would be a ~$15bn capex, $2bn 100 day training cost and 600MW model. However it is highly unlikely to use the same architecture and likely will bring in multimodal data which will lead to different scaling laws.
Assuming continued progress on LLM capabilities comes with many caveats; limitations of current transformer-based architectures could become apparent, such as fundamental inability to develop true reasoning capabilities or the exhaustion of available useful training data. However, there are also several pathways for overcoming these obstacles, including improved data filtering, synthetic training data, improved & hybrid architectures, mid/post-training reasoning enhancements, and more sophisticated training objectives. To some extent I think current methods of next token prediction on internet data actively trains against learning stronger reasoning. Humans very often don’t write down their full reasoning steps online; so LLMs are being penalized for adding in these intermediary steps and instead forced to jump straight to guessing the answer. Similarly- if an LLM is trained on scientific papers without the underlying datasets; it doesn’t have the full information scientists need to jump to their conclusions. I’m hopeful some of this can be fixed with synthetic data and self play where LLMs help create more thorough descriptions of required reasoning steps to be used in training.
Why should you care?
The key question is; if and when does the compute scaling law exponential start to look like a sigmoid? And if this doesn’t happen very soon - what does this mean in terms of LLM capability and impact on the economy? On specific benchmarks I expect we will saturate performance and from one perspective this can look like slowing progress; however it can also suggest we just underestimated how challenging we need to make our benchmarks! In more general terms; there is still a long way to go in minimizing LLM NLL loss on the internet.
What exactly would another Llama 3.1 405B vs LLama 3.0 8B scale leap in intelligence look like in 6-18 months? LLMs are establishing pricing at the equivalent of ~$0.001-$0.01 per hour of human equivalent work. For example human reading speed of ~300 words a minute would cost ~$0.003 per hour with GPT-4o-mini while human translators at ~300 words per hour would cost ~$0.003 per hour with GPT-4o. At the moment LLM capabilities are limited - and to get to an economically useful level of reliability very often need a lot of extra human work preparing data, writing detailed prompt instructions and building RAG, fine-tuning, in-context learning and agent pipelines - together with diligently checking the results. But what happens if foundation models improve, your existing LLM pipelines can port over and still multiply reliability, and you can now hire unlimited useful assistants for the equivalent of $0.01 per hour of human work? As with today, I expect finding your best use cases for LLMs will require deeply understanding their strengths and limitations together with a lot of imagination and work crafting instructions and datasets.
I think it is important to start to consider and plan for further LLM capabilities; however I see huge economically useful potential even with the LLMs of today. It just takes time to educate people how to use them safely and effectively and to build custom pipelines to get to sufficient levels of reliability (which isn’t 100% levels as humans very often make errors too!). I also think the LLM developer stack of today and much of the work preparing data and enhancing LLM performance on specific niche tasks or for specific companies and industries is likely to transfer over to also improving next generation models. Those using the models of today are best placed to take advantage of the models of the future!
— Louie Peters — Towards AI Co-founder and CEO
This issue is brought to you by… us!
Building LLMs for Production is currently at 30% off
Take advantage of the current deal offered by amazon (depending on location) to get our recent book “Building LLMs for Production” with 30% off right now!
Here’s a quote from Jerry Liu, Co-founder and CEO of LlamaIndex describing the book:
“This is the most comprehensive textbook to date on building LLM applications - all essential topics in an AI Engineer's toolkit."
p.s. If you already have it, please consider leaving a review! If you do, reach out to our co-founder Louis (louis@towardsai.net) with a screenshot and he’ll provide you with a free alpha access to our upcoming courses!
Hottest News
Meta is releasing Llama 3.1, the largest-ever open-source AI model (128k context window), which the company claims outperforms GPT-4o and Anthropic’s Claude 3.5 Sonnet on several benchmarks. These multilingual models are a collection of pretrained and instruction-tuned generative models in 8B, 70B, and 405B sizes (text in/text out).
Mistral released a new flagship 123B model, Large 2, which it claims to be on par with the latest models from OpenAI and Meta in terms of code generation, mathematics, and reasoning. Mistral Large 2 has a 128k context window and supports dozens of languages and 80+ coding languages, including Python, Java, C, C++, JavaScript, and Bash. Weights for the instruct model are available and are also hosted on HuggingFace.
Google Deepmind’s latest models, AlphaProof and AlphaGeometry 2, solved four out of six problems from this year’s International Mathematical Olympiad (IMO), achieving a score equivalent to that of a silver medalist. AlphaProof, a reinforcement-learning-based system for formal math reasoning, and AlphaGeometry 2, an improved geometry-solving system, solved two algebra problems, one number theory problem, and one geometry problem.
OpenAI is announcing SearchGPT, an AI-powered search engine with real-time access to information across the Internet. SearchGPT is just a “prototype” for now. The GPT-4 family of models powers the service and will only be accessible to 10,000 test users at launch.
The ClaudeBot web crawler that Anthropic uses to scrape training data for AI models like Claude has hammered iFixit’s website almost a million times in 24 hours, seemingly violating the repair company’s Terms of Use. Anthropic responded by saying, website owners need to specifically block Anthropic’s crawler, ClaudeBot.
6. OpenAI has launched an Alpha version of GPT4o with 64K output tokens per request.
This is now the longest output model available and we expect to be useful for tasks such as language translation, code base conversion, and format and style conversion of text. The model is slightly more expensive than the base GPT4o model.
7. Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images
Meta introduced SAM 2, an advanced model for promptable object segmentation in both images and videos, offering real-time performance and state-of-the-art accuracy. The model is open-sourced under the Apache 2.0 license, accompanied by the extensive SA-V dataset featuring 51,000 videos and over 600,000 spatio-temporal masks. SAM 2 can generalize across unseen objects and domains, enabling applications in video editing, scientific research, and autonomous systems.
Five 5-minute reads/videos to keep you learning
This blog dives into some of the latest AI developments, such as Llama 3.1 405B, GPT-4o, and Claude 3.5 Sonnet, from major players like Meta, Open AI, and Anthropic. This step-by-step guide covers what Llama 3.1 405B is, how to use it locally, and why it is better than GPT-4o and Claude 3.5 Sonnet.
Claude AI is a powerful LLM that is more focused on providing responses that are accurate, ethical, and correct. Though it may not offer as many features as its competitor ChatGPT, it specializes in performing specific tasks and responsible generative AI model development. This article explores Claude AI and its unique features, as well as how it differs from the most popular generative AI tool, ChatGPT.
LLMs like GPT-4 can be used as scalable, cost-efficient evaluators of other models using the LLM-as-a-Judge methodology. Various publications have analyzed this approach, highlighting the best practices for its implementation and outlining important sources of bias of which we should be aware. This article examines many of these publications and builds a deep, practical understanding of LLM evaluations.
The article outlines three types of AI application startups: AI Copilots, which boost productivity by assisting with primary tasks; AI Colleagues, which independently execute tasks to improve operational efficiency; and AI Native Services, which are highly automated businesses automating complete services to compete with conventional companies by offering high-quality, lower-cost alternatives.
Repositories & Tools
The Llama Agentic System repository allows you to run the Llama 3.1 model for tasks requiring complex reasoning and tool use.
freeCodeCamp is a full-stack web development and machine learning curriculum.
Odyssey is a framework that empowers LLM-based agents with open-world skills to explore the vast Minecraft world.
Athina AI manages LLMs in production by detecting and fixing hallucinations while managing prompts.
Dioptra is a software test platform for assessing the trustworthy characteristics of AI.
Top Papers of The Week
This study explores how using synthetic data to train AI models can result in nonsensical outcomes and a quick decline in model performance. This phenomenon, known as "model collapse," occurs when AI systems are trained using data generated by other AI models, which magnifies errors with each successive generation. Researchers found that language models trained in this recursive manner tend to repeat phrases and struggle to produce diverse and accurate content. While an interesting paper; we think this study uses far from realistic conditions. It uses OPT-125m - an old and very low performing model relative to current LLMs- and its being used to simply regenerate and replace a dataset just by completion and without any prompting; so you'd expect it to very quickly lose key information from the initial dataset! With strong LLMs together with human direction, selection or correction AI assisted content will not lose the information to nearly the same degree.
This paper introduces MINT-1T, the most extensive open-source Multimodal INTerleaved dataset. It comprises one trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets and previously untapped sources such as PDFs and ArXiv papers.
Diffree is a text-guided model that autonomously integrates new objects into images based on textual descriptions, eliminating the need for manual placement and ensuring visual and contextual coherence.
This paper presents MAIA, a Multimodal Automated Interpretability Agent. It uses neural models to automate neural model understanding tasks like feature interpretation and failure mode discovery. It equips a pre-trained vision-language model with tools that support iterative experimentation on subcomponents of other models to explain their behavior.
LLMs are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. This paper introduces SGLang, a system for executing complex language model programs. It consists of a frontend language to simplify programming with primitives and runtime to accelerate execution with novel optimizations.
Quick Links
1. AI Video Generator Runway secretly trained by scraping thousands of videos from popular YouTube creators and brands. A former employee of Runway leaked a company spreadsheet allegedly showing Runway’s plans to categorize, tag, and train on YouTube channels.
2. X (formerly Twitter) is using your X posts to train its Grok AI model. The platform is under pressure from data regulators after it emerged that users consent to their posts being used to build AI systems via a default setting on the app.
3. Apple signed the White House’s voluntary commitment to AI safety. Apple joins 15 other technology companies — including Amazon, Anthropic, Google, Inflection, Meta, Microsoft, and OpenAI — that committed to the White House’s ground rules for developing generative AI in July 2023.
4. Harvey partners with Voyage to build custom legal embeddings. Harvey has partnered with Voyage AI to develop custom embeddings for legal applications, reducing the retrieval of irrelevant legal texts by 25% compared to standard models.
5. A step towards making health health screen accessible for billions with ppg signals. Google Research's Health AI Team is developing a method using photoplethysmograph (PPG) signals collected from smartphones to accurately assess cardiovascular disease risk, making heart health screenings more accessible globally.
6. Why OpenAI Could Lose $5 Billion This Year? The information reported OpenAI is on track to lose $5bn this year with $4bn spent on inference, $3.5bn on training and $1.5bn on staff. While we are skeptical of all the assumptions OpenAI for sure is making big bets on the next generation of models!
Who’s Hiring in AI
Front End Engineer, Comms Design @OpenAI (San Francisco, CA, USA)
Data Scientist- LLM/NLP @Simetrik (Remote)
Sr. Generative AI & ML Specialist Solutions Architect @Amazon (Irvine, CA, USA)
AI Platform Software Engineer @AlphaSense (New York, NY, USA)
Data Analyst - Graphs @CloudWalk (Brazil/Remote)
Machine Learning Engineer (LLM / AI) @Techie Talent (Remote)
Machine Learning Engineer (2-4 years) @Docsumo (Remote)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Really interesting. Thanks for these updates
150 Trillion training tokens?
Who has that?