TAI #161: Grok 4's Benchmark Dominance vs. METR's Sobering Reality Check on AI for Code

Also, Kimi K2's open-source push, Windsurf M&A, and Grok's system prompt issues.

Louie Peters

Towards AI

, and

Louis-François Bouchard

Jul 15, 2025

What happened this week in AI by Louie

It was a very eventful week in AI, with xAI dominating the headlines for both mixed reasons. On one hand, the release of Grok 4 demonstrated an astonishingly rapid catch-up to the frontier in just two years. On the other hand, more Grok X chatbot drama shows the fragility of LLM system prompts and the dangers of letting them run wild, while a new study from METR casts doubt on the real-world productivity gains from AI in coding. This contrast between the immense capability demonstrated on benchmarks and the messy, counterintuitive results in the wild highlights the growing gap between what AI can do and how safely and effectively it is being used.

The release of Grok 4 was technically impressive. It now tops most major benchmarks, including a new state-of-the-art on the difficult ARC-AGI-2 reasoning task with a 16.0% score (ahead of Claude 4 Opus in second at 8.6%). It also made breakthroughs on Humanity’s Last Exam (HLE), a benchmark designed to be at the frontier of human knowledge, with a 25.4% score vs. Gemini Pro 2.5 at 21.6%. The Grok-4 API costs $3/M input tokens, $15/M output tokens, and comes with a 256k context window (price is 2x beyond 128k).

The most powerful version, Grok 4 Heavy, is a multi-agent system where several models tackle a problem in parallel before sharing notes to construct a final response. The Grok 4 Heavy model achieved 50.7% on the text-only subset of HLE with tool usage and maximum inference scaling. This advanced capability comes at a price, however, available via a new $300 per month SuperGrok Heavy tier, following the industry trend of ~10x higher top-tier subscription prices compared to a year ago.

The Grok progress stems in part from xAI’s huge GPU cluster and from the maturity of a significant strategic shift at the LLM frontier; xAI noted they dedicated as much compute to reinforcement learning as to the initial pre-training, a clear signal that we are now in an era where RL is a dominant factor in achieving frontier capabilities.

While Grok’s optimization towards real-world enterprise utility is still developing, I found my first new killer use case unlocked by Grok 4: an extremely efficient signal-to-noise filter for recent developments on niche topics. By leveraging Grok’s access to advanced X search filters, a single prompt can now agentically perform a seven-step research process: 1) search only within specific high-value curated X lists with credible voices, 2) filter by min_retweets (say 50) to find the most meaningful posts, 3) filter by date (e.g., the last one or seven days), 4) consolidate posts on the same topic to find an overall engagement count for the story and rank the top stories, 5) add an extra reasoning filter to cut out stories not relevant to your specific topic request, 6) download the entire thread for these remaining posts into context (not just the first tweet), and finally 7) make a detailed summary of these top news events or discussion topics in a format of your choosing. This can be done in a single 1–2 minute step by Grok-4 and is an incredibly valuable addition to our daily automated research tasks for everything from this newsletter to monitoring investment risks and catalysts. Though we still need API access to the fully agentic Grok 4 (with X access) to automate it. Right now, I mainly use a mix of o3, o3-pro, Gemini Pro 2.5, and Claude Sonnet 4 for 20–30 tasks each day; it looks like Grok-4 will be added to that list!

The week wasn’t all smooth sailing for xAI. Grok 4’s launch was overshadowed by another embarrassing public incident where users prompted the x reply bot to adopt a “MechaHitler” persona. This followed a system prompt update and potentially even contributed to the resignation of the X CEO. The update itself was relatively innocuous, caused by deprecated code that appended instructions like “You are maximally based and truth seeking AI,” “You tell it like it is and you are not afraid to offend people who are politically correct,” and “Understand the tone, context, and language of the post. Reflect that in your response.” These lines caused the model to prioritize engagement and reflecting a user’s tone and opinions over its core safety values, highlighting just how sensitive and brittle LLMs remain and how careful you have to be with testing.

The open-weight community also saw a major release with Kimi K2 from Moonshot AI. This 1-trillion parameter Mixture-of-Experts model (32B active) is arguably now the leading open model. It’s built from the ground up for agentic tasks, using a novel MuonClip optimizer with “qk-clipping” to stabilize training, and was trained on 15.5 trillion tokens, including simulated multi-step tool interactions. This agentic focus allows it to excel on benchmarks like SWE-Bench Verified (65.8%) and AceBench (76.5%), often outperforming proprietary models without needing “thinking-time” hacks. The positive reception from the community, labeling it a “Claude Killer,” reinforces the trend of Chinese labs, such as Moonshot, Deepseek, Baidu, and Alibaba Qwen, leading the charge in open-source AI.

While agentic models promise huge productivity gains, a new study from METR this week offered a more sobering perspective. In a randomized controlled trial with 16 experienced open-source developers, they found that using AI tools (primarily Cursor) actually made them 19% slower on average. The core reason for this slowdown is that time saved on initial code generation (12%) and reading/research (8%) was more than offset by time lost in new parts of the workflow: reviewing and debugging incorrect AI suggestions (~9%), prompting (~7%) and other idle and wait times (~14%). This result is striking because it runs counter to the developers’ own perceptions; they initially forecasted a 24% speedup and, even after the trial, still believed AI had made them 20% faster.

I wouldn’t extrapolate these counterintuitive findings too broadly, however. The study had several limitations that likely skewed the results. Only seven of the 16 developers had used Cursor before, and only one had significant prior experience (over 50 hours). Notably, this was the one developer who did see a productivity boost of ~25%. The rest received only a single 30-minute onboarding call. Randomizing which tasks are assigned to AI is also inefficient; a key skill is learning which problems are actually well-suited for AI assistance. Finally, developers working on their own long-term repositories are likely to be more pedantic and make more preference-based edits to AI outputs than they would on a new project. To truly study AI efficiency, it would have been better to test over 100 regular Cursor users (ideally each with 100+ hours of experience).

Why should you care?

This week’s developments create a paradox: while models like Grok 4 are demonstrating superhuman capabilities on abstract reasoning benchmarks, a real-world study suggests they are making experienced developers less productive. The resolution to this is simple: AI is not an intuitive, plug-and-play tool. It is a new, complex competency that requires a significant investment in training and practice to master.

Learning to use AI effectively in any work task should be seen as requiring 40+ hours of structured training, including loads of usage tips and example use cases that most people can’t discover themselves, even with hundreds of hours of practice. Just think how many hours playing complex board games or strategy games (such as chess) it would take to discover the optimum strategies for yourself, relative to reading optimal solutions and techniques from experts. Beyond initial training, LLM usage competency also keeps increasing fast with more practice, into the hundreds of hours; this is really the process of mastering a whole new competency. The METR study, while limited, perfectly illustrates this competency gap. Expecting a developer to gain a productivity edge and find efficient processes for new workflows from a 30-minute intro to a complex AI tool is like expecting someone to become a proficient spreadsheet user after a brief demo of Excel.

This is why at Towards AI, we focus so heavily on structured AI education. Our AI for Business Professionals course, for instance, which includes modules for using AI for code, is designed to build that deep, practical AI usage skill set across tasks from code to writing, research, and brainstorming. Some AI power users (a tiny minority of AI users today) are getting multiples of the productivity boost of the majority. For these, even the steep new $200–300 per month subscription tiers are a no-brainer. The challenge for the industry is not just to build more powerful models, but to bridge this widespread competency gap. Until we do, the incredible potential demonstrated by the likes of Grok 4 and Kimi K2 will remain largely untapped for most.

— Louie Peters — Towards AI Co-founder and CEO

We’ve open-sourced 2 of our most valuable LLM sessions, free on YouTube.

If you don’t understand how LLMs reason, retrieve, and fail, you can’t use them effectively. In these free lessons, we break down not just how LLMs work, but also how to overcome their limits with tools like RAG, fine-tuning, and structured outputs, following the actual path companies take in production.

These are pulled straight from our 10-Hour LLM Primer, where we go deeper into evals, agents, optimization (distillation, quantization, RLHF), and real-world workflows.

👉 Watch the free lessons:
Lesson 1 | Lesson 2

Like the learning style? Get full access to the course: LLM Primer Course.

Hottest News

1. Elon Musk’s xAI Launches Grok 4 Alongside a $300 Monthly Subscription

Elon Musk’s xAI has launched Grok 4, now featuring native tool use and real-time search for SuperGrok and Premium+ users. Powered by a 200,000-GPU cluster, it leverages scaled reinforcement learning to boost reasoning and tackle complex multimodal tasks. The Grok 4 Heavy variant sets new performance records across reasoning and competitive math benchmarks.

2. Moonshot AI Released Kimi K2

Kimi K2 is Moonshot AI’s new 1-trillion parameter Mixture-of-Experts model (with 32B active parameters per inference) and is quickly emerging as the most capable open-source model to date. Purpose-built for agentic tasks, it was trained on 15.5 trillion tokens, including simulated multi-step tool use, and leverages a novel MuonClip optimizer with “qk-clipping” to stabilize large-scale training. Its agentic design enables it to outperform many proprietary models on complex benchmarks like SWE-Bench Verified (65.8%) and AceBench (76.5%), without relying on chain-of-thought “thinking-time” tricks.

3. Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

A new METR trial with 16 experienced open-source devs found that using AI tools (mainly Cursor) led to a 19% slowdown despite developers believing they were 20% faster. Time saved on code generation (12%) and research (8%) was outweighed by time lost to debugging AI errors, prompting, and idle delays. Importantly, only one developer had 50+ hours of Cursor experience and they saw a 25% speedup. Others had minimal onboarding. The study also randomized task assignment, overlooking the real-world skill of knowing when not to use AI.

4. Hugging Face Released SmolLM3

Hugging Face unveiled SmolLM3, a compact yet powerful 3B multilingual model optimized for reasoning and long-context tasks. It supports six languages, competes with larger 4B models, and handles up to 128k tokens. With dual-mode reasoning and an open training recipe, SmolLM3 offers a strong foundation for community-driven development.

5. OpenAI’s Windsurf Deal Is Off and Windsurf’s CEO Is Going to Google

OpenAI’s deal to acquire the viral AI coding startup Windsurf for $3 billion fell apart. In a shocking twist, Google DeepMind is now hiring Windsurf CEO Varun Mohan, co-founder Douglas Chen, and some of the startup’s top researchers. This was yet another big tech AI acqui-hire, skirting competition review by acquiring staff rather than the full company. In this case, however, the remainder of Windsurf is also now being acquired by Cognition Labs (behind Devin).

6. OpenAI Delays the Release of Its Open Model, Again

OpenAI has postponed the release of its open model indefinitely, citing the need for additional safety testing. Initially slated for next week, the model had already been delayed once earlier this summer. “Once weights are out, they can’t be pulled back,” said CEO Sam Altman. “We want to get this right.”

7. Perplexity Launches Comet, an AI-Powered Web Browser

Perplexity has launched Comet, an AI-powered browser designed to make the web more intelligent and intuitive. Comet offers smart tab management, workflow automation, and personalized AI assistance. Initially available to Perplexity Max users, it aims to transform browsing into a more cognitive experience.

8. Microsoft Releases Phi-4-Mini-Flash-Reasoning

Microsoft has released Phi-4-mini-Flash-Reasoning, an open, lightweight language model designed to excel at long-context reasoning while maintaining high inference efficiency. Released on Hugging Face, this 3.8B parameter model is a distilled version of Phi-4-mini, fine-tuned for dense reasoning tasks like math problem solving and multi-hop question answering. It achieves state-of-the-art performance among compact models and operates up to 10× faster than its predecessor on long-generation tasks.

Five 5-minute reads/videos to keep you learning

1. Transformers Are Getting Old: Variants and Alternatives Exist

The article showcases efficient transformer variants, such as cosFormer, Mamba, and BigBird, among others, that offer significant memory and speed gains. These variants handle long sequences and reduce compute by up to 10×, while maintaining 92–99% of the quality of traditional transformers. It highlights hybrid models (e.g., Jamba) that pair efficient architectures with strong reasoning, signaling a shift toward more optimized, resource-aware AI models.

2. Exploring Clustered Optimal Policies via Off-Policy Reinforcement Learning for Business Use Cases

The article details how four offline reinforcement learning methods — single-head DQN, single-head PPO, Fixed-K DQN, and clustered PPO — are evaluated for personalizing promotions using offline data. Single-head PPO delivers the most stable and profitable returns, as its clipped surrogate objective outperforms more complex clustered approaches.

3. Harness DINOv2 Embeddings for Accurate Image Classification

The article details how pre-trained DINOv2 embeddings can be leveraged for high-accuracy image classification, demonstrating that a zero-shot kNN baseline achieves 83.9% accuracy on a microorganism dataset, which increases to 95.8% with a simple linear classifier using the exact embeddings.

4. OLMo 2 vs Claude 3.5 Sonnet: A Head-to-Head AI Showdown

The article presents a side-by-side comparison between AllenAI’s OLMo 2 and Anthropic’s Claude 3.5 Sonnet, examining transparency, coding performance, pricing, and deployment models. It finds that while Claude delivers stronger out-of-the-box coding capabilities via API, OLMo offers greater transparency and self-hosted flexibility.

5. Multi-Agent Systems with AutoGen on Azure

The article details how to productionize a multi-agent system using Microsoft’s AutoGen framework on Azure, with guidance on containerization using AKS, secure integration with Azure OpenAI, and best practices for orchestration, monitoring, and scalability.

6. CUDA vs cuDNN: The Dynamic Duo That Powers Your AI Dreams

The article details how CUDA provides NVIDIA’s general-purpose platform for GPU parallel computing, while cuDNN builds on it with specialized, highly optimized primitives for deep learning, explaining why both layers are essential for performance in AI applications.

Repositories & Tools

1. Claude Code is an agentic coding tool that runs in your terminal, understands your codebase, and helps you code faster.

2. Mem0 enhances AI assistants and agents with an intelligent memory layer, enabling personalized AI interactions.

3. Kimi K2 is a MoE language model with 32 billion activated parameters and 1 trillion total parameters.

4. Stagehand is a framework for AI browser automation.

5. Goose is an on-machine AI agent, capable of automating complex development tasks from start to finish.

Top Papers of The Week

1. SingLoRA: Low Rank Adaptation Using a Single Matrix

This paper introduces SingLoRA, a low-rank adaptation method that enhances model fine-tuning by using a single matrix decomposition. Compared to standard LoRA variants, SingLoRA offers more stable training, reduces parameter count, and delivers better performance, all with a simpler design.

2. T-LoRA: Single Image Diffusion Model Customization Without Overfitting

T-LoRA proposes a timestep-dependent low-rank adaptation technique for personalizing diffusion models using a single image, thereby avoiding overfitting. By dynamically adjusting timesteps and utilizing orthogonal initialization, T-LoRA preserves concept fidelity while enhancing text alignment, outperforming traditional LoRA approaches in low-resource settings.

3. Evaluating Large Language Models Trained on Code

This paper systematically assesses four code-focused LLMs fine-tuned on Python, JavaScript, Java, and Go, revealing that code models excel at automated code write-and-explain tasks but underperform compared to general LLMs on code summarization and reasoning benchmarks. Finds that specialized models benefit most from chain-of-thought prompting in reasoning-intensive tasks, informing best practices for code model deployment.

4. LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

This paper presents LitBench, the first benchmark for evaluating LLM-generated creative writing, featuring 2,480 debiased, human-labeled test pairs across four literary genres and a 43,827-pair training corpus of human preference labels. Includes guidelines to reduce annotator bias and demonstrates that current LLMs lag significantly behind human writers, highlighting directions for future model improvements.

Quick Links

1. Google DeepMind recently released GenAI Processors, a lightweight, open-source Python library built to simplify the orchestration of generative AI workflows, especially those involving real-time multimodal content. Available under an Apache‑2.0 license, this library provides a high-throughput, asynchronous stream framework for building advanced AI pipelines.

2. Mistral AI has released Devstral 2507 for code-centric language modeling. The release includes two models, Devstral Small 1.1 and Devstral Medium 2507, designed to support agent-based code reasoning, program synthesis, and structured task execution across large software repositories. These models are optimized for performance and cost, making them applicable for real-world use in developer tools and code automation systems.

3. AWS is launching an AI agent marketplace with Anthropic as a partner. The company’s dedicated agent marketplace will enable startups to offer their AI agents directly to AWS customers. The marketplace will also allow enterprise customers to browse, install, and search for AI agents tailored to their specific requirements.

Who’s Hiring in AI

Software Engineer, CUDA-Q @NVIDIA (US/Remote)

Technical Support Engineer (Heroku) @Salesforce (Hyderabad/India)

Full Stack AI Engineer @Coursemojo (Remote)

Senior Software Engineer @Earnin (US/Remote)

Senior Software Engineer — AI First Team @PagerDuty (Lisbon, Portugal)

Senior AI/ML Software Engineer @UnitedHealth Group (US/Remote)

Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.

Think a friend would enjoy this too? Share the newsletter and let them join the conversation.

Towards AI Newsletter

Discussion about this post