TAI #156: Gemini 2.5 Pro Takes Benchmark Lead as OpenAI Hits $10bn Revenue

Also, Apple's "Illusion of Thinking" Paper Sparks Debate, and New Open Models from Qwen & NVIDIA

Louie Peters

Towards AI

, and

Louis-François Bouchard

Jun 10, 2025

What happened this week in AI by Louie

The AI industry continues its blistering pace, with major players hitting significant milestones and new tools constantly emerging. This week, OpenAI reportedly hit a $10 billion annualized revenue run rate, now boasting 500 million weekly active users and 3 million paid business subscribers. This financial muscle was flexed alongside the launch of new business-centric features for ChatGPT, notably “Connectors” and “Record Mode,” signalling a clear intent to deeply embed AI into enterprise workflows.

This revenue growth isn’t isolated; Anthropic recently hit a $3bn revenue run rate, and AI coding startup Cursor is reportedly at $0.5bn. While these numbers indicate LLM revenue is indeed taking off, particularly in the last few months, there’s still a vast runway ahead, especially compared to the immense economic value AI can already deliver when used effectively at work.

OpenAI’s new “Connectors” feature aims to make ChatGPT a central operational hub by allowing it to securely link with enterprise applications like Google Drive, GitHub, SharePoint, and more, enabling users to search files, pull live data, and reference content directly within the chat. The “Record Mode” introduces transcription and structured summaries for meetings, including action items and time-stamped citations, positioning ChatGPT as a “second memory.” These moves clearly target the productivity suites of Microsoft 365 and Google Workspace. This release intensifies the “coopetition” between Microsoft and OpenAI, as the two are increasingly stepping on each other’s toes. In a notable strategic shift, Microsoft executives, including new CoreAI head Jay Parikh, are now pushing the company to focus on a “platform, platform, platform” approach, building the tools for others to create agents and moving toward a consumption-based pricing model, a clear pivot from their seat-based Copilot subscriptions.

On the capability front, Google’s Gemini 2.5 Pro received another strong upgrade (the 06–05 preview version), decisively taking the lead on several key LLM benchmarks. Notably, on the Aider polyglot coding benchmark, Gemini 2.5 Pro (preview 06–05 with 32k thinking tokens) scored an impressive 83.1% at a cost of $49.88 per evaluation, surpassing OpenAI’s o3 (high), which scored 79.6% at $111.03. Even Gemini’s default thinking mode achieved 79.1%. It also showed a massive 10% jump on SimpleBench, taking first place with a 62.4% score. Google’s blog also highlighted a 24-point Elo jump on LMArena (to 1470) and a 35-point jump on WebDevArena (to 1443) for this latest iteration.

The LLM developer toolkit saw further valuable additions this week. Alibaba’s Qwen team released new open embedding and reranking models (Qwen3 Embedding series), designed for text embedding, retrieval, and reranking tasks, with their 8B embedding model claiming the top spot on the MTEB multilingual leaderboard. Concurrently, NVIDIA introduced Llama Nemotron Nano VL, a new OCR-focused vision language model. This Llama 3.1 8B fine-tune excels at extracting information from complex documents like PDFs, graphs, and charts, delivering strong performance on OCRBench v2 and made available via NVIDIA NIM API and Hugging Face.

Why should you care?

This week perfectly encapsulates the central tension in AI today: the seemingly unstoppable commercial momentum versus the nuanced, often critical, debate about the true nature of AI “intelligence.” On one hand, you have OpenAI hitting a $10bn revenue run rate, a figure driven by tangible, real-world value that millions of users and businesses are willing to pay for. On the other hand, you have publications like Apple’s “The Illusion of Thinking” paper, which are often seized upon by skeptics as proof that these models aren’t actually “thinking” at all.

The Apple paper generated discussion by testing Large Reasoning Models (LRMs) on a series of controlled puzzles, like Tower of Hanoi and River Crossing. It highlighted findings where models excelled at one but failed quickly at another, and eventually failed at all puzzles beyond a critical complexity threshold. It presented this as a fundamental failure of LLM reasoning. However, while the research was interesting for highlighting some specific limitations of AI models, we think the paper’s framing and title were a bit clickbaity, extrapolating from narrow puzzles to a broad critique of all LLM reasoning abilities.

The paper presented models performing 100+ correct steps on the Tower of Hanoi but failing after just four steps on River Crossing as “surprising.” However, as community analysis has pointed out, this isn’t a failure of reasoning. The Tower of Hanoi is a well-known, straightforward algorithm. The challenge isn’t figuring out how to solve it, but flawlessly executing an exponentially long sequence of steps (2^N-1), a task constrained by long-context generation limits and reliability, not reasoning and planning. In fact, for a higher number of disks, models may have failed simply because the required output exceeded their token context window. In contrast, the River Crossing puzzle is a much harder planning problem with a high branching factor, where the models’ failures point to genuine current weaknesses in complex search and spatial awareness.

This academic contribution comes as Apple itself continues at a somewhat disappointing pace in rolling out its own significant AI features into its core products. It’s far easier to critique the frontiers of AI research than to ship transformative AI products at scale. There are many more benchmarks already out there testing more general reasoning capabilities, but we find the day-to-day real-world use cases for LLMs — assisting with brainstorming, planning, and completing thousands of practical tasks — to be a more useful measure of their ability than performance on abstract puzzle scenarios.

So, who is right? The market and customers pouring billions into AI, or the skeptics highlighting these perceived failures? The answer lies in the messy middle. OpenAI’s $10bn revenue isn’t built on models flawlessly solving abstract puzzles; it’s built on their immense utility as powerful assistants. The “hype” often over-promises fully autonomous agents, and academic critiques serve as a necessary reality check. However, when those critiques are based on narrow benchmarks or are amplified without nuance, they create an “illusion of failure” that is just as misleading as the hype. The real story is that we are in a phase of incredible capability, but one that demands critical thinking from all of us to understand and work effectively with both the strengths and the specific, well-defined limitations of these powerful new tools. Above all, it still requires humans to combine their own strengths and expertise with the models to solve problems together.

— Louie Peters — Towards AI Co-founder and CEO

Introducing the 10-Hour Video Primer: Building and Operating LLMs

The clearest, fastest path to understanding what actually matters when building with language models.

You’ve seen the advice:

“Just use RAG.” “Fine-tune it.” “Build an agent.” But how do you know what to use — and when?

This full video course gives you the mental map of the LLM landscape in 10 hours: when to prompt, when to fine-tune, when to orchestrate agents, and how to avoid months of dead ends.

You’ll learn to:

✅ Understand how transformers really work

✅ Build end-to-end chains with prompting, RAG, and fine-tuning

✅ Evaluate safely with metrics and human feedback

✅ Orchestrate advanced workflows and better understand agents.

✅ Optimize with distillation, quantization, RLHF & more

Access it now for just $199 —

Price goes up soon as we add a new 2-hour deep dive on fine-tuning open models (you’ll get that update free if you join now).

👉 Start Your First Lesson Now

Hottest News

1. Gemini 2.5 Pro Gets an Upgrade

Gemini 2.5 Pro has received a significant upgrade, topping major benchmarks like GPQA, Aider, and LMArena. It shows a 24-point Elo increase on LMArena and 35 points on WebDevArena. The update brings enhanced creativity and formatting capabilities while retaining strong coding performance. Now available through Google AI Studio and Vertex AI, this preview version also introduces “thinking budgets” to help manage cost and latency more effectively.

2. Microsoft’s Next AI Bet: Platforms, Agents, and a Shift in Strategy

Microsoft is shifting its AI strategy to focus less on proprietary models and more on becoming the go-to platform for building AI agents. Under CEO Satya Nadella and new CoreAI head Jay Parikh, the company is betting that the next wave of growth will come from tools that help businesses build their own autonomous agents — software that can perform complex tasks with minimal human input. With OpenAI planning to reduce its dependence on Azure, Microsoft is doubling down on offering cost-efficient models, open protocols, and custom agent infrastructure via Azure. Internally, this has meant reorganization, tensions over team control, and a push to move from seat-based pricing to consumption-based billing. Microsoft’s aim is clear: to lead the coming “agentic” web by becoming the platform every AI developer builds on.

3. Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series

Alibaba’s Qwen team has released the Qwen3-Embedding and Qwen3-Reranker series, designed for high-performance multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series is available in 0.6B, 4B, and 8B parameter sizes. It supports 119 languages, making it one of the most capable and flexible open-source model suites. The models are open-sourced under the Apache 2.0 license and are available on Hugging Face, GitHub, and ModelScope, as well as via Alibaba Cloud APIs.

4. Cursor 1.0 Released

Cursor 1.0 rolls out new features, including BugBot for automated code reviews and expanded access to the Background Agent for all users. The release adds support for Jupyter Notebooks and introduces a beta feature called “Memories,” which retains project-specific conversational context. Users can now deploy MCP servers with a single click and visualize outputs directly in the chat. An updated dashboard provides deeper insights with improved usage analytics.

5. ChatGPT Introduces Meeting Recording and Connectors for Google Drive, Box, and More

OpenAI has introduced new productivity tools for ChatGPT business users, including meeting recording, cloud service integrations, and enhanced MCP connections for advanced research workflows. ChatGPT now supports connectors for Google Drive, Dropbox, Box, SharePoint, and OneDrive, enabling it to retrieve and synthesize information from users’ own data sources in response to queries.

6. Mistral Releases a Vibe Coding Client, Mistral Code

Mistral has announced Mistral Code, an AI-driven “vibe coding” assistant built to rival GitHub Copilot, Windsurf, and Anysphere’s Cursor. The tool combines Mistral’s language models, an integrated IDE assistant, local deployment options, and enterprise-friendly features in a single offering. Currently in private beta, it supports JetBrains IDEs and Microsoft’s VS Code.

7. Reddit Sues Anthropic for Allegedly Not Paying for Training Data

Reddit has filed a lawsuit against Anthropic in a Northern California court, accusing the company of using Reddit content to train AI models without a proper licensing agreement. The complaint claims Anthropic’s actions were both unauthorized and commercial, violating Reddit’s terms of service and intellectual property rights.

Five 5-minute reads/videos to keep you learning

1. MCP -The Golden Key for AI Automation

This blog breaks down the Model Context Protocol (MCP), a standard that lets LLMs call external APIs through structured JSON-RPC requests. Using a simple calculator as an example, it walks through how an LLM reads a tool description and crafts a call. It also dives into secure OAuth2-based authorization, showing how platforms like Zerodha Kite can grant access without exposing credentials, crucial for anyone building agents that need safe, real-world tool use.

2. CodeAgents + Structure: A Better Way to Execute Actions

This article introduces Structured CodeAgents, which enforce JSON-formatted outputs containing both an explicit thoughts field and the actual code to run, combining clear reasoning with executable logic. Benchmarks across GAIA, MATH, SimpleQA, and Frames show this structured approach boosts performance by 2–7 points and reduces error-prone parsing problems.

3. Monitor and Evaluate Open AI SDK Agents using Langfuse

This guide walks through monitoring a simple three-agent workflow built with the OpenAI SDK: an input guardrail, an assist agent, and a validation agent. It shows how to capture trace data using OpenTelemetry and visualize it in Langfuse, then digs into programmatic trace analysis with the Langfuse SDK to generate custom evaluation plots.

4. The Essential Guide to Model Evaluation Metrics for Classification

This guide breaks down evaluation metrics for classification tasks, starting with the basics — confusion matrix, accuracy, precision, and recall — and moving into advanced options like ROC AUC, Log Loss, and the Phi Coefficient. Through case studies, it highlights when and why to use each, especially in imbalanced data scenarios, and emphasizes aligning metric choice with project goals.

5. LangChain + Segment Any Text + RAG = The Key to Understanding Your Documents

This article tackles the problem of semantic fragmentation in RAG pipelines caused by standard text splitting. The ContextGem framework’s Segment Any Text (SAT) approach ensures chunks are semantically complete before retrieval. The piece shows how to combine structured data extraction from ContextGem with a standard LangChain + FAISS + OpenAI setup, creating an agent that pulls accurate, context-aware answers from structured and unstructured sources.

Repositories & Tools

1. TensorZero unifies an LLM gateway, observability, optimization, evaluations, and experimentation.

2. DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks.

3. Dolphin is a novel multimodal document image parsing model following an analyze-then-parse paradigm.

4. Cognee builds memory for AI agents.

Top Papers of The Week

1. How Much Do Language Models Memorize?

This paper introduces a rigorous framework to quantify how much information language models memorize about specific datapoints, separating unintended memorization from generalization. Through experiments on synthetic and real datasets, it estimates that GPT-style models store about 3.6 bits per parameter, and shows that memorization capacity defines a phase transition where generalization begins.

2. LlamaFirewall: An Open Source Guardrail System for Building Secure AI Agents

This paper introduces LlamaFirewall, an open-source security-focused guardrail framework designed to serve as a final layer of defense against security risks associated with AI Agents. It mitigates risks such as prompt injection, agent misalignment, and insecure code risks through PromptGuard 2, a universal jailbreak detector; Agent Alignment Checks, a chain-of-thought auditor; and CodeShield, an online static analysis engine.

3. AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

This paper presents AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. It also adopts a staleness-enhanced PPO variant to better handle outdated training samples.

4. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

To refine how RLVR training impacts LLM reasoning, this paper presents a new methodology focused on token entropy patterns. It observes that in the CoT sequences, only a small fraction of tokens (roughly 20%) display significantly higher entropy. These tokens, labeled “forking tokens,” often correspond to moments where the model must decide between multiple reasoning paths. The remaining 80% of tokens typically exhibit low entropy and are extensions of prior statements. By limiting policy gradient updates solely to these high-entropy tokens, the research team could not only maintain but, in many cases, improve performance on challenging reasoning benchmarks.

5. Large Language Models Often Know When They Are Being Evaluated

Frontier language models exhibit substantial evaluation awareness, accurately classifying transcripts from evaluations and real-world interactions with above-random performance. Gemini-2.5-Pro achieves an AUC of 0.83. However, these models still trail behind humans, who reach an AUC of 0.92.

Quick Links

1. Anthropic has released a new set of AI models for U.S. national security customers. Compared to Anthropic’s consumer- and enterprise-focused models, the new custom Claude Gov models were designed to be applied to government operations like strategic planning, operational support, and intelligence analysis.

2. EleutherAI has released Common Pile v0.1, a dataset it claims is one of the largest licensed and open-domain text collections for training AI. Weighing in at 8 terabytes, Common Pile v0.1 was used to train two new models — Comma v0.1–1T and Comma v0.1–2T — that EleutherAI says perform on par with models trained on unlicensed, copyrighted data.

3. NVIDIA has introduced Llama Nemotron Nano VL, a vision-language model (VLM) designed for document-level understanding. Built on the Llama 3.1 architecture and paired with a lightweight vision encoder, the model targets applications that require accurate parsing of complex formats like scanned forms, financial reports, and technical diagrams.

Who’s Hiring in AI

Senior Software Engineer, Agentic AI @NVIDIA (Multiple US Locations)

Software Engineer, Google Cloud Applications AI @Google (Sunnyvale, CA, USA)

Engineering Intern — Gen AI for FP&A Platform @Drivetrain (Remote)

AI Engineer Specialist @Digibee Inc. (Brazil/Remote)

Advanced AI Research Scientist Associate Manager @Accenture (Multiple US Locations)

Senior Salesforce Developer (AI Focus) @Palo Alto Networks (Bangalore, India)

Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.

Think a friend would enjoy this too? Share the newsletter and let them join the conversation.

Towards AI Newsletter

Discussion about this post

Ready for more?