#125: Training Compute Scaling Saturating As Orion, Gemini 2.0, Grok 3, and Llama 4 Approach?
What happened this week in AI by Louie
This week, the potential plateauing of LLM training scaling laws has been a focus of debate in the AI community. The Information reported that OpenAI’s scaling of LLM training compute appears to be hitting a plateau, with more incremental gains in its latest model, Orion, than hoped relative to GPT-4. Reports of this slowing trend are not isolated to OpenAI. Google DeepMind, for instance, is expected to launch Gemini 2.0 in December, but reports have also suggested internal disappointment on improvements. Similarly, we recently discussed Anthropic’s delayed release of Opus 3.5, though CEO Dario Amodei has now confirmed they still plan to release it. Meanwhile, Meta’s LLaMA 4 and XAi’s Grok-3 are currently training on clusters of more than 100,000 H100 GPUs, with Grok-3 expected as soon as late 2024. Despite these investments, the anticipated performance gains across models may be smaller than the leap seen with previous generations, raising broader questions about the limits of traditional training compute “scaling laws.”
In May, OpenAI CEO Sam Altman expressed high hopes for Orion, the company’s upcoming flagship model, predicting it would be significantly more advanced than GPT-4. At the time, Orion’s training was reportedly only 20% complete, but it was already performing on par with GPT-4 in tasks and intelligence. However, as training progressed, Orion’s improvements have been more incremental, especially compared to the leap seen between GPT-3 and GPT-4, leading some within OpenAI to temper their expectations. As testing continues, OpenAI employees who have worked with Orion report that, while it shows notable progress on certain language tasks, its performance is inconsistent, particularly with more structured tasks like coding and complex problem-solving. For some applications, Orion’s capabilities don’t clearly surpass GPT-4. These mixed results have raised questions about whether Orion’s enhancements are enough to justify its increased operational costs.
OpenAI has yet to finish the final safety evaluations of Orion, which are expected to be publicly released early next year, with hints that it may depart from the traditional “GPT” branding to reflect its new direction. It is also possible that Orion is integrated into OpenAI’s new o1 reasoning model family to achieve further performance gains.
Why should you care?
Exactly what bottlenecks or dead ends will get in the way of continuing to improve LLM capabilities is a key factor in how quickly they will significantly transform the global economy and potentially even achieve AGI. While diminishing returns are natural to some extent — particularly after saturating many easier capabilities and benchmark tasks — LLMs still have a long way to go to match human performance on many tasks. We actually think current LLM capabilities are already enough for a huge global impact, but foundation LLMs need to be customized to specific tasks and companies to achieve the reliability and productivity gains needed for widescale adoption. This is currently bottlenecked by LLM Developer talent (we think many millions of LLM Developers will be needed, and we are trying to solve this with our Towards AI Academy), employee’s non-technical LLM education, and the time it takes to test and iterate these advanced LLM pipelines. Nevertheless, progress in foundation model capabilities can open up new use cases, and we would be very disappointed if progress stops here! However, we don’t think this is likely.
Despite recent press reports on disappointing gains from larger training compute budgets — Sam Altman and Dario Amoedi are both very optimistic in public statements (albeit with an incentive given fundraising needs!). Sam Altman, for example, said in a recent Reddit AMA he thinks AGI can be “achievable with current hardware.” Dario Amoedi meanwhile thinks “Powerful AI” will be achieved in 2026 or 2027. Recent newsflow from leading cloud providers lining up nuclear power for the energy needs of next-generation training clusters also contradicts the saturating returns narrative. Nevertheless, we think there have likely been disappointing results from training runs this year as companies have scaled to 50k+ H100 GPU clusters. Most likely, this is due to a bottleneck in diverse training data after saturating data that is easily scrapable from the internet. New data (real or synthetic) and new architectures may be needed to make the most of larger clusters. Training compute is not the only path to progress. However, we think huge progress has been made this year in both inference cost, and the underappreciated new inference compute scaling paradigm.
Foundation LLM capability improvement comes broadly from six sources: 1) Increased training compute budget (which is from more GPUs/TPUs, better GPUs, or longer training runs and can be spent on more parameters, more training data, or more FLOPs per forward/backward pass). 2) Increased utilization of this training compute (higher Maximum FLOPS Utilization, less downtime), 3) Higher quality training data, 4) More training compute efficient algorithms (e.g., MoE, new attention mechanisms), 5) Better mid-training/post-training performance unlocks and enhancement (e.g., RLHF, instruction tuning, monte carlo tree search) and 6) more recently Inference or Test Time compute scaling (increasing “thinking time/tokens” to solve harder problems). We think right now inference compute scaling is the most effective path to progress, so we would not be surprised to see the focus shift here from scaling training compute. We also think there is still a lot of room to experiment with new or modified model architectures. However, we think larger and larger training budgets will still be justified in parallel, given that even incremental gains relative to other techniques can still unlock huge economic value.
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
1. Google DeepMind Open-Sources AlphaFold 3
Google DeepMind has unexpectedly released the source code and model weights of AlphaFold 3 for academic use, marking a significant advance that could accelerate scientific discovery and drug development. AlphaFold 3 opens new frontiers in protein modeling by predicting molecular interactions and enabling unprecedented advancements in drug discovery.
2. ChatGPT Added 50 Million Weekly Users in Just Two Months
OpenAI revealed that over 250 million people worldwide use ChatGPT weekly. That’s a sharp rise since late August when OpenAI said the chatbot had 200 million weekly users — double the number it had last November. As of June, 350 million people used OpenAI’s tools each month.
3. Meta Is Using More Than 100,000 Nvidia H100 AI GPUs To Train Llama-4
Meta is utilizing over 100,000 Nvidia H100 AI GPUs to develop Llama 4, an advanced AI model with improved modalities and reasoning capabilities. Despite the significant power demands, Meta plans to release Llama models for free to encourage broader development and application.
4. Gemini Is Now Accessible From the OpenAI Library
Google is now offering an OpenAI API-compatible endpoint for Gemini, making Gemini accessible via the OpenAI library, enabling developers to easily switch to it. The inclusion means developers won’t need to overhaul their existing code or pipelines.
5. Claude 3.5 Sonnet Can Now View Images within PDFs
The new Claude 3.5 Sonnet model can now parse images in PDF input, enabling it to understand both textual and visual content within documents. This enhancement marks a substantial leap forward, allowing the AI to handle a broader range of information from PDFs, including textual explanations, images, charts, and graphs, within documents that span up to 100 pages.
6. Introducing FLUX1.1 [Pro] Ultra and Raw Modes
BlackForestLabs has enhanced the FLUX1.1 [pro] with Ultra and Raw Modes, offering 4MP image resolutions in just 10 seconds, 2.5x faster than competitors at $0.06 per image. The Raw mode improves authenticity and diversity, especially in human and nature photography, accessible via the company’s API for high-quality, rapid image generation.
7. OpenAI in Talks With Regulators To Become a For-Profit Company
OpenAI, valued at $157 billion, is in early talks with California and Delaware regulators to shift from a nonprofit to a for-profit entity to attract investors and address the valuation challenges of its AI models. It plans to retain a nonprofit arm post-restructuring.
8. Introducing the First AMD 1B Language Models: AMD OLMo
AMD has introduced AMD OLMo, a series of 1 billion parameter language models trained on 1.3 trillion tokens using AMD Instinct MI250 GPUs. These open-sourced models excel in reasoning and instruction-following, outperforming similar-sized models in general reasoning and chat benchmarks.
9. Amazon May Up Its Investment in Anthropic — on One Condition
According to The Information, Amazon is in talks to invest multiple billions in Anthropic, its first financial pledge in the company since a $4 billion deal struck last year. The new investment is structured like the last one — but with the condition that Anthropic uses Amazon-developed silicon hosted on Amazon Web Services to train its AI.
Five 5-minute reads/videos to keep you learning
1. DSPy: Machine Learning Attitude Towards LLM Prompting
In this article, the author aims to showcase complex technologies through nontrivial use cases, focusing on the DSPy framework. It explains what DSPy is and focuses on implementing an LLM-based classifier.
This piece breaks down “Global prediction of extreme floods in ungauged watersheds” by Google AI to look into how they solved the various challenges and what that teaches us about building more truly meaningful AI solutions for the future. The article covers three primary ideas: why weather forecasting is so difficult, why LSTMs are good for this, and what this teaches us about AI policy for the future.
3. Your Guide to AI: November 2024
The article discusses key AI developments, including the White House’s National Security Memorandum on AI, geopolitical tensions surrounding AI technology, and corporate AI regulation and hardware strategies. It also highlights industry movements like OpenAI’s valuation rise, Anthropic’s new features, and advancements in AI-driven robotics and autonomous systems.
4. Can AI Understand Our Minds?
This article attempts to explain the current state of machine learning through the latest study by Michal Kosinski titled Evaluating Large Language Models in Theory of Mind Tasks. Building on this, it dives into the theory of mind and its implications for the future of AI and our society.
5. I Just Tested Google vs. ChatGPT Search — and I’m Shocked by the Results
This article compares ChatGPT’s new search feature against Google search, covering categories like speed, accuracy, visuals, and overall user experience. ChatGPT and Google excel in different areas but cater to slightly different needs.
Repositories & Tools
1. Hertz-dev is an open-source, first-of-its-kind base model for full-duplex conversational audio.
2. TableGPT is a pre-built agent for TableGPT2, a series of LLMs for table-based question answering.
3. Gen AI Scripting offers convenient tooling for file ingestion, prompt development, and structured data extraction.
4. Developer Roadmap compiles interactive roadmaps, guides, and other educational content to help developers grow in their careers.
5. OpenRLHF is an easy-to-use RLHF framework built on Ray, DeepSpeed, and HF Transformers.
Top Papers of The Week
1. HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
Much of the structural and semantic information inherent in HTML, such as headings and table structures, is lost during the plain-text-based RAG process. To solve this problem, this paper proposes HTMLRAG, which uses HTML instead of plain text as the format of retrieved knowledge in RAG. It also proposes HTML cleaning, compression, and pruning strategies to shorten the HTML while minimizing the loss of information.
2. Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?
This paper addresses the issue of chain-of-thought prompting with noisy rationales in LLMs using the NoRa dataset. It introduces contrastive denoising with noisy chain-of-thought (CD-CoT), which enhances reasoning accuracy by 17.8% by contrasting noisy and clean rationales.
3. BitNet a4.8: 4-bit Activations for 1-bit LLMs
BitNet a4.8 introduces 4-bit activations for 1-bit LLMs, using hybrid quantization and sparsification to minimize errors. It employs 4-bit activations in key layers and 8-bit quantization for intermediate states, achieving performance parity with BitNet b1.58 while offering faster inference and 55% reduced parameter activation. Additionally, it supports a 3-bit KV cache for improved efficiency in large-scale LLM deployment.
4. Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Agent K v1.0 is an autonomous data science agent designed to automate and optimize the data science lifecycle through structured reasoning and memory management. In evaluations using Kaggle competitions, it achieved a 92.5% success rate, ranking in the top 38% among 5,856 competitors, and performed at a level comparable to a Kaggle Grandmaster, earning multiple medals. This highlights its effectiveness in handling complex, multimodal data science tasks.
5. Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
This paper presents Hunyuan-Large, an open-sourced Mixture-of-Experts (MoE) based LLM with 389 billion total parameters and 52 billion activated parameters, developed by authors who do not state their affiliation. The paper details the model’s pre-training and post-training stages, highlighting the data synthesis process and training techniques used to achieve high performance across various benchmarks.
Quick Links
1. Scale AI announced Defense Llama. Built on Meta’s Llama 3, the LLM is specifically customized and fine-tuned to support American national security missions. Defense Llama is available exclusively in controlled U.S. government environments within Scale Donovan. It aims to apply the power of generative AI to use cases such as planning military or intelligence operations and understanding adversary vulnerabilities.
2. Microsoft researchers recently unveiled a new multi-agent infrastructure called Magentic-One that allows a single AI model to power various helper agents that work together to complete complex, multi-step tasks in different scenarios. Microsoft calls Magentic-One a generalist agentic system that can “fully realize the long-held vision of agentic systems that can enhance our productivity and transform our lives.”
3. The Beatles’ AI-assisted track ‘Now and Then’ is nominated for two Grammy awards. Though the band has been broken up for 50 years, Paul McCartney used AI last year to create “the last Beatles record.” He took one of Lennon’s demos from 1978 and used AI to clean up the recording’s poor sound quality.
4. Another one of OpenAI’s lead safety researchers, Lilian Weng, announced she is departing the startup. Weng served as VP of research and safety since August and, before that, was the head of the OpenAI’s safety systems team. In a post on X, Weng said, “After 7 years at OpenAI, I feel ready to reset and explore something new.”
5. OpenAI defeats news outlets’ copyright lawsuit over AI training. A New York federal judge dismissed a lawsuit against artificial intelligence giant OpenAI that claimed it misused articles from news outlets Raw Story and AlterNet to train its large language models.
6. OpenAI’s o1 Model Leaked on Friday, and It Is Wild — Here’s What Happened
OpenAI’s upcoming AI model, o1, was accidentally leaked, showcasing advanced capabilities surpassing GPT-4, including comprehensive image and multimedia analysis. The leak occurred due to a URL parameter modification, but OpenAI has since resolved the issue, with an official release anticipated soon.
Who’s Hiring in AI
PhD Intern (f/m/d) — Business AI Research @SAP (Berlin, Germany)
Research Engineer @Anthropic (London, UK)
Staff Software Engineer, Generative AI, Gemini Code Assist @Google (New York, NY, USA)
Applied Machine Learning Engineer — Localization @Apple (Cupertino, California, United States)
Generative AI Engineer @FabFitFun (USA/Remote)
AI Engineer @SmartDev (Hanoi, Vietnam)
AI & GenAI Data Scientist-Senior Associate @PwC (Multiple Locations)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.