#147: Llama 4 Launch & Digesting Gemini Pro 2.5's Breakthrough Impact
Also, OpenAI's $40bn raise, o3 & o4-mini coming, Midjourney v7, and "AI 2027" thesis.
What happened this week in AI by Louie
Despite Meta releasing its long-anticipated Llama 4 models, our minds were fixated on further testing Gemini Pro 2.5 this week. It took time to fully digest just how impactful Gemini’s leap in practical capabilities really is, especially its reliability in ultra-long context scenarios. We’re also finding it to be a significant advancement in multitasking and precise editing (without typical AI laziness or over-summarization). Carefully curating and inputting hundreds of thousands of words of context turns this model into a true domain expert. How many tokens have you all been feeding Gemini Pro 2.5?
Meta’s Llama 4 launch also generated intense discussion — though not entirely as Meta might have hoped. Llama 4 Scout and Maverick offer open weights models with native multimodal capabilities, expansive context windows (Scout advertised a bold 10 million tokens), and an efficient Mixture-of-Experts (MoE) architecture — albeit not the more sophisticated next-generation MoE architecture employed by Deepseek. Yet the unusual Saturday release date, along with Meta’s clarification that Maverick’s highly touted second-placed LMArena leaderboard result (1417 ELO) was from an unreleased, chat-optimized experimental version, sparked quick skepticism and community disappointment. The publicly released Maverick version has received mixed feedback on real-world tasks.
Llama 4’s advertised 10 million token context has also faced scrutiny, with tests showing significant degradation even at much shorter context lengths. On the Fiction.LiveBench eval Scout’s score declined to just 15.6% at 120k token context length (vs. Gemini 2.5 pro at 90.6% and GPT-4.5 at 63.9%). The shift towards MoE architectures has also disappointed some developers who prefer to work with smaller, dense models. While computationally efficient overall, sparse MoEs demand significant memory resources, which makes them too large for consumer hardware.
Nevertheless, Llama 4’s technical improvements, inference efficiency, and contribution to open-source AI remain substantial. Maverick notably performed strongly on reasoning benchmarks (MMLU Pro at 80.5%, GPQA Diamond at 69.8%) and excelled in multimodal tasks. Scout also demonstrated impressive capabilities, including image understanding, with a 94.4% score on DocVQA. It’s worth noting that the largest and most capable Llama 4 model in this series (Behemoth) has yet to be released.
Returning to Gemini Pro 2.5, the model has rapidly gained recognition through real-world usage and further independent benchmark scores, including 24.4% accuracy on MathArena’s new challenging USAMO evaluation. This beat r1’s previous state-of-the-art score of 5%.
Google’s optimizations and newly revealed highly competitive API pricing (around 12x cheaper for input tokens and 6x cheaper for output tokens compared to OpenAI’s o1) are also positive, especially for integrating it into our Agents and Agentic-RAG pipelines. Google confirmed it led to a surge in AI usage: “We’ve seen an 80%+ increase in active users in AI Studio + Gemini API this month,” and they accelerated accessibility with increased usage limits on Vertex. We think 2.5 raises the bar for competitors and likely pushes OpenAI to accelerate its launch timeline for its o3 and o4-mini models (mentioned below).
Why should you care?
With Pro 2.5, I’m beginning to think that LLM-assisted tasks have achieved their biggest breakthrough yet for mass enterprise adoption over the past two weeks (together with GPT-4o’s new image-generation capabilities). Combined with o1-pro’s enhanced reasoning and planning, OpenAI’s Deep Research efforts, grok’s DeepSearch with live x data, and Deepseek’s and LLama’s open model innovations, we think the AI toolkit has now become extremely powerful for professional use cases. But this highlights a paradox: global adoption rates of LLMs and AI now look negligently small relative to the LLM capabilities already available today. The gap is even wider when considering that even more significant productivity gains are easily achievable if organizations were to invest 200–1,000 hours developing specialized LLM pipelines tailored to their workflows. Even where LLMs are adopted, they are usually utilized far below their true potential.
I routinely employ LLMs to assist with 20–30 different tasks each day, with the AI consistently delivering value, from minor improvements to 10x or more when used correctly. Some days, my personal and professional LLM usage surpasses 20 million tokens, not counting Towards AI’s automated pipelines. Yet, just 20 million individuals globally subscribe to paid ChatGPT services out of roughly 500 million weekly active users. Free chatbot variants remain insufficiently safe or capable for substantial professional workloads. Anyone with internet access could — and should — integrate these tools regularly into their workflows, whether for professional tasks, personal productivity, or even entertainment.
Globally, I would estimate fewer than 1 million people are maximizing the potential of current LLMs in their daily workflows. Motivating and inspiring people to meaningfully change and adapt their established workflows remains challenging. To help address this adoption lag, we’re soon releasing a comprehensive 70+ lesson course, “AI for Business Professionals,” with detailed tutorials and tips on all the leading LLM use cases, different pathways for various industries and roles, and a dedicated module for managers and leaders with practical strategies to integrate AI effectively into their organizations. So stay tuned!
— Louie Peters — Towards AI Co-founder and CEO
Introducing RAGmatic: A tool to keep your RAG embeddings up to date
We work on many complex LLM workflows and RAG pipelines at Towards AI in our consulting work and courses. When you move into production, one tedious task is connecting your business databases to your RAG model, ensuring new data is continuously embedded, up-to-date, and accessible. Our friends at Barnacle Labs have built a great tool to help.
RAGmatic is a developer-first toolkit that continuously embeds your PostgreSQL tables, enabling always up-to-date RAG powered by your own models, chunking logic, and metadata strategies.
Hottest News
1. Meta AI Just Released Llama 4 Scout and Llama 4 Maverick
Meta has introduced Llama 4 Scout and Llama 4 Maverick, advanced AI models with 17 billion active parameters. Scout is optimized to run on a single NVIDIA H100 GPU, while Maverick offers performance comparable to many larger open models but with half the active parameters. Both models feature multimodal capabilities and are accessible on llama.com and Hugging Face.
2. AI Compute Forecast: 40x Growth by 2027 Concentrated in Leading AI Labs
The “AI 2027” caused a stir this week, positing the emergence of superintelligent AI by 2027, driven primarily by a projected tenfold increase in global AI-relevant compute (reaching 100M H100-equivalents) concentrated within a few leading AGI companies, which will increasingly allocate these resources towards research automation, synthetic data, and experimentation rather than pretraining. This forecast, detailed across sections covering compute production, distribution, usage, research automation, and industry metrics like cost and power, gained mixed reactions on social media platforms like X. Supporters praised its detailed analysis of compute trends, geopolitical insights, and exploration of potential AI pathways and societal risks like job disruption and misalignment. Conversely, critics dismiss it as speculative fiction or a “doomer bedtime story,” arguing it lacks scientific rigor, overlooks factors like open-source AI and China’s influence, and relies too heavily on extrapolated scaling curves without sufficient evidence.
In a new preprint study awaiting peer review, researchers report that in a three-party version of a Turing test, in which participants chat with a human and an AI at the same time and then evaluate which is which, OpenAI’s GPT-4.5 model was deemed to be the human 73 percent of the time when it was instructed to adopt a persona. That’s significantly higher than a random chance of 50 percent, suggesting that the Turing test has been beaten, albeit still in a short and relatively simple test. In fact, beating the Turing test would really suggest a score closer to 50%; here, the AI was actually consistently judged more human than the real human. People are not as great at spotting AI writing as they think!
4. OpenAI Says It’ll Release o3 After All, Delays GPT-5
OpenAI has announced plans to release its o3 and o4-mini reasoning models within the next few weeks, delaying the launch of GPT-5 by a few months to allow for better integration. CEO Sam Altman highlighted that o3 has achieved performance levels comparable to top programmers globally.
5. Midjourney Releases V7, Its First New AI Image Model in Nearly a Year
Midjourney has launched V7, its first new AI image model, in nearly a year. Currently in alpha, V7 introduces personalization features, requiring users to rate approximately 200 images to create a tailored profile that aligns with their visual preferences.
6. Neuralink’s First Patient Reports No Side Effects a Year After Receiving Brain Chip
Neuralink’s first human patient reports no adverse side effects a year after receiving his brain implant. Despite 85% of the electrodes dislodging, Neuralink optimized the device, allowing continued use. The patient can control devices like his computer via Bluetooth and collaborates on improvements, aiming eventually to control his wheelchair using the implant.
7. OpenAI Bags $40B in Funding, Increasing Its Post-Money Valuation to $300B
OpenAI has secured $40 billion in a funding round led by SoftBank, raising its valuation to $300 billion, second only to SpaceX in private startups. The funds will support AI development and infrastructure projects, including the Stargate initiative aimed at building dedicated AI infrastructure in the United States.
8. Anthropic Launches an AI Chatbot Plan for Colleges and Universities
Anthropic is launching a new Claude for Education tier aimed at higher education and gives students, faculty, and other staff access to Anthropic’s AI chatbot, Claude, with a few additional capabilities. Anthropic says Claude for Education comes with its standard chat interface, as well as “enterprise-grade” security and privacy controls.
9. Open AI Releases PaperBench
OpenAI unveiled PaperBench, a new benchmark to measure how well AI agents can reproduce cutting-edge AI research. This test aims to check if an AI can understand research papers, write code, and execute them to match the paper’s results. The benchmark uses 20 top papers from the International Conference on Machine Learning (ICML) 2024, and the research paper contains 8,316 individually gradable tasks.
Five 5-minute reads/videos to keep you learning
1. An Introduction to Colab and Jupyter for Beginners
This article offers a beginner-friendly overview of Jupyter and Colab, two popular environments for machine learning development. It explains their core differences and guides you on when to use each tool, depending on your needs.
2. Reasoning Models Don’t Always Say What They Think
Anthropic’s Alignment Science team discovered that AI reasoning models often omit details in their “Chain-of-Thought” explanations, leading to unfaithful reasoning representation, especially when given hints. Their study reveals these models rarely disclose misleading influences, hindering accurate behavior monitoring. Increasing faithfulness remains challenging despite training efforts, underscoring the need for improved alignment strategies.
3. Custom Vibe Coding Quest Part 2: Fine-Tuning Gemma 3 for Code Reasoning
This article details fine-tuning the Gemma 3 model for code reasoning tasks. Using supervised fine-tuning on the codeforces-cot dataset, the author boosts performance by 11% on LiveCodeBench while addressing latency issues in GemmaCoder-12B. The approach balances efficiency and reasoning power, especially for competitive programming use cases.
This is a quick five-minute guide to using Sora for video storyboarding, blending, remixing, and editing loops. It’s a hands-on intro for creators looking to experiment with AI-powered video tools.
5. Taking a Responsible Path to AGI
Google DeepMind discusses the approach to developing Artificial General Intelligence (AGI) responsibly. The focus is on technical safety, proactive risk assessment, and collaboration with the AI community. The article outlines strategies to ensure AGI benefits humanity while mitigating potential risks.
Repositories & Tools
1. Active Pieces is an AI automation designed to be extensible through a type-safe pieces framework written in TypeScript.
2. Graphiti is a framework for building and querying temporally-aware knowledge graphs specifically tailored for AI agents.
3. Agent S is an open agentic framework that autonomously uses computers like humans.
4. Official code repo for Open-Qwen2VL: a fully open-source 2B-parameter multimodal LLM.
Top Papers of The Week
This paper explores recent progress in foundation agents, focusing on brain-inspired architectures that blend cognitive science with computational principles. It explores self-enhancement mechanisms, collaborative multi-agent systems, and safety measures, linking AI functionalities to human-like processes and societal dynamics. The study highlights strategies for continual learning and ethical alignment, underscoring the importance of security in AI deployment.
2. Open Deep Search: Democratizing Search with Open-source Reasoning Agents
This paper introduces Open Deep Search (ODS) to close the increasing gap between proprietary and open-source search AI. ODS augments the reasoning capabilities of the latest open-source LLMs with reasoning agents that can use web search tools to answer queries. It consists of two components that work with a base LLM chosen by the user: Open Search Tool and Open Reasoning Agent.
3. JudgeLRM: Large Reasoning Models as a Judge
The research introduces JudgeLRM, a novel family of judgment-oriented LLMs trained with reinforcement learning, which surpasses existing models in tasks demanding complex reasoning. JudgeLRM models, notably JudgeLRM-3B and JudgeLRM-7B, outperform GPT-4 and DeepSeek-R1 in F1 score by 2.79%, demonstrating superior performance in judge tasks that require deep reasoning.
This paper introduces Open-Reasoner-Zero, an open-source framework for scalable reinforcement learning. By utilizing a minimalist approach with vanilla PPO and rule-based rewards, it outperforms on benchmarks like AIME2024 and MATH500. The implementation achieves superior efficiency by requiring only a tenth of the training steps compared to DeepSeek-R1-Zero, demonstrating significant performance with fewer resources.
5. KBLaM: Knowledge Base Augmented Language Model
This paper proposes Knowledge Base augmented Language Model (KBLaM), a new method for augmenting LLMs with external knowledge. KBLaM works with a knowledge base (KB) constructed from a corpus of documents, transforming each piece of knowledge in the KB into continuous key-value vector pairs via pre-trained sentence encoders with linear adapters and integrating them into pre-trained LLMs via a specialized rectangular attention mechanism.
Quick Links
1. Amazon unveiled Nova Act, a general-purpose AI agent that can take control of a web browser and independently perform some simple actions. Alongside the new agentic AI model, Amazon is releasing the Nova Act SDK, a toolkit that allows developers to build agent prototypes with Nova Act.
2. ChatGPT weekly active users have grown to 500 million (vs 350 million in December 2024) and subscribers to 20 million (16 million in December).
Who’s Hiring in AI
Data Engineer — India @JumpCloud (Remote/India)
Core Software Engineer (C++) — Remote @ClickHouse (UK/Remote)
Full Stack Developer, AI and LLM @NVIDIA (US/Remote)
Product Manager, AI @Metabase (Remote)
Enterprise AI Lead @Tenable, Inc. (Columbia, USA)
Junior Product Manager [U of Waterloo Alumni Only] @Mechanical Orchard (Remote/Canada)
Software Engineer (Python) @Tekmetric (Remote)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.