TAI #153: AlphaEvolve & Codex - AI Breakthroughs in Algorithm Discovery & Software Engineering

Also, VS Code moves to open-source AI, MiniMax voice, Qwen Parallel Scaling & more.

Louie Peters

Towards AI

, and

Louis-François Bouchard

May 20, 2025

What happened this week in AI by Louie

This week, Google DeepMind introduced AlphaEvolve, a genuinely innovative agent capable of discovering and evolving new algorithms, representing a leap in AI’s potential to make true original breakthroughs. While AI’s use in fundamental science is only just beginning, AI for software development is maturing rapidly. OpenAI launched its new Codex coding agent this week while Microsoft made a strategic move to open-source key components of its GitHub Copilot Chat extension for VS Code — a timely reaction to momentum at AI-coding competitors such as Windsurf (recently acquired by OpenAI’s for $3bn) and Cursor (creator Anysphere recently valued at ~$9bn). Both Cursor and Windsurf are themselves forks of VS Code, underscoring the significance of Microsoft’s decision to deeply embed open-source AI into the platform.

AlphaEvolve distinguishes itself by evolving superior algorithms through a robust integration of LLMs like Gemini Pro, evolutionary search frameworks, and rigorous automated evaluation methods. Rather than merely generating plausible code snippets, it iteratively refines entire codebases, simultaneously optimizing across multiple performance metrics. Crucially, by grounding itself in actual code execution results, AlphaEvolve effectively sidesteps hallucinations, resulting in algorithms that not only sound plausible but demonstrably outperform existing methods.

Source: AlphaEvolve: A coding agent for scientific and algorithmic discovery paper

The results achieved by AlphaEvolve are notable. It has already optimized critical aspects of Google’s own infrastructure, recovering approximately 0.7% of Google’s entire fleet-wide compute capacity through an improved data center scheduling algorithm. It has also simplified hardware accelerator circuit designs and accelerated the training of its own underlying LLM — a glimpse into a future of practical AI self-improvement. Perhaps most impressively, AlphaEvolve cracked a problem that had stood unchanged since 1969, devising a more efficient method for multiplying two 4x4 complex matrices using only 48 scalar multiplications, besting Strassen’s classic algorithm after 56 years. Moreover, it tackled over 50 other open mathematical problems, often matching or surpassing the state of the art.

Parallel to AlphaEvolve’s foundational breakthroughs, we’re witnessing accelerated integration of AI into day-to-day software engineering workflows, exemplified by OpenAI’s newly launched Codex agent. Codex, powered by a fine-tuned OpenAI o3 model, operates in a secure cloud environment, seamlessly handling practical development tasks like bug-fixing, code review, refactoring, and responding to real-time user feedback. With its integrated “ask” and “code” modes, it autonomously clones repositories, runs tests, and proposes code improvements via diffs and pull requests, guided by explicit instructions defined in AGENTS.md files. Codex is an evolution toward AI as an ever-present, trusted engineering partner, not merely a passive coding assistant.

Why should you care?

AlphaEvolve illustrates a significant threshold: AI methods that combine the generative creativity of LLMs with evolutionary search and rigorous automated verification can now drive genuine foundational breakthroughs, not merely incremental improvements.

To grasp this shift, it helps to consider how AI’s role has evolved. Initially, LLMs emerged primarily as chatbots for straightforward Q&A interactions. When leveraged effectively, these systems have already progressed into high-value assistants capable of either saving substantial time or brainstorming novel ideas and solutions, especially powerful when paired with human intuition to rapidly filter the best insights.

The next stage represents an even deeper collaborative partnership, with AI agents beginning to act more autonomously to complete complex tasks. This trajectory is clearly evident in both Deep Research agents and Coding Agents. Tools like Codex and initiatives by startups such as Devin are steadily nudging AI assistants further along their trajectory from “fancy autocomplete” to AI copilots and on toward greater autonomy. Still, at this stage, expert guidance remains crucial to steer the models effectively.

In scientific research, AI also currently remains most valuable as a powerful copilot, surfacing ideas or connections researchers might otherwise overlook. Deep Research itself is actively advancing AI autonomy across parts of the scientific process, already holding potential to automate significant portions of the research pipeline, though it still encounters hurdles like journal paywalls and currently works best within a heavily interactive, iterative human-AI collaboration. AlphaEvolve, however, offers a glimpse into the future potential of fully autonomous scientific agents — albeit within a narrowly defined domain.

Most current AI capabilities, however, still derive largely from human-generated data provided during their training. These models may uncover valuable existing insights missed by human expert users, or creatively combine ideas from distinct fields (also a fundamental yet often underappreciated driver of human breakthroughs and artistic creations!). The first wave of AI-driven breakthroughs will likely integrate these creative combinational abilities with AI’s brute-force capacity to systematically test a large number of generated ideas.

One field particularly suited to this increased autonomy and brute-force experimentation is AI research itself, where AI agents optimize their own model architectures, training techniques, and experimental pipelines. Autonomous AI-driven experimentation will work particularly effectively for LLMs because:

Fully automated experimentation is straightforward via APIs and cloud computing.
Scaling laws ensure results from small models (around 1⁰²⁰ FLOPS) remain predictive and relevant at much larger scales. (Although scaling models to larger parameters or training tokens often fails unexpectedly!)
Automated verification via training metrics and benchmarks provides clear, quantifiable feedback loops, allowing rapid evolutionary iteration.

We anticipate a powerful “funnel” approach to emerge: autonomous AI runs vast numbers of small-scale experiments, rapidly discarding ineffective approaches, and incrementally escalating the most promising results in larger training runs. At higher compute budgets, human oversight will increasingly step in, ensuring substantial resource commitments follow robust validation.

This funnel approach also creates a potent reinforcement-learning loop. Running experiments at such a scale could itself generate vast amounts of new data, ideal for reinforcement learning of reasoning models. Human ML researchers would guide promising research directions at the outset and step in at high-stakes junctures, dramatically boosting efficiency throughout the research pipeline.

Ultimately, we’re seeing two complementary AI evolutions unfold simultaneously: AI as a potentially autonomous discoverer of fundamental breakthroughs (AlphaEvolve), and AI becoming an increasingly indispensable partner embedded directly into everyday workflows (Codex, AI-enhanced VS Code). The era of AI serving not merely as a productivity booster but as a genuinely powerful partner and independent innovator is clearly upon us.

— Louie Peters — Towards AI Co-founder and CEO

Hottest News

1. Google DeepMind Announced AlphaEvolve: A Gemini-Powered Coding Agent

Google DeepMind announced AlphaEvolve, an LLM-powered coding agent for general-purpose algorithm discovery and optimization. It pairs Google’s Gemini model with an evolutionary approach that automatically tests, refines, and improves algorithms. The system has already been deployed across Google’s data centers, chip designs, and AI training systems

2. OpenAI Launches Codex, an AI Coding Agent, in ChatGPT

OpenAI released a research preview of Codex, an AI coding assistant powered by Codex-1, a variant of its o3 reasoning model tailored for software engineering. Users with access can find Codex in ChatGPT’s sidebar, assign coding tasks with a prompt, and trigger execution using the “Code” button.

3. VS Code Announced Open Source AI Editor

Microsoft announced it will open-source the code in the GitHub Copilot Chat extension under the MIT license and refactor AI features from the extension into VS Code core. It will allow developers to directly review the code, contribute new features, or fix issues. Meanwhile, the VS Code team emphasized that integrating AI functions into the editor core will significantly improve development efficiency.

4. Nous Research Launches Psyche Decentralized Network

Nous Research unveiled Psyche, a decentralized network for AI model training built on the Solana blockchain. Using DisTrO technology to reduce bandwidth demands, it enables global participation using idle compute resources. The platform has launched the largest distributed training run to date, aiming to train a 4B-parameter open-source model.

5. Notion Introduced Notion AI for Work

Notion rolled out a new suite of AI tools integrated into its workspace, including AI Meeting Notes, Research Mode, and Enterprise Search. Users can now toggle between GPT-4.1 and Claude 3.5, with AI Connectors and additional features rolling out soon.

6. The Chinese MiniMax Voice Model Beats OpenAI and Everyone Else

China-based Minimax’s Speech-02-HD, a text-to-speech model, outperforms OpenAI and ElevenLabs on the Artificial Analysis Speech Arena leaderboard. It supports over 32 languages, raw-audio cloning, and can process up to 200,000 characters in a single pass.

7. Stability AI Releases an Audio-Generating Model That Can Run on Smartphones

Stability AI launched Stable Audio Open Small, a 341M-parameter stereo audio model optimized for mobile. Open-source and efficient enough to run on Arm CPUs, it generates up to 11 seconds of audio in under 8 seconds, directly on smartphones.

8. Qwen Introduced Parallel Scaling

Alibaba’s Qwen team introduced a theoretical scaling law for parallel training, validated through pretraining experiments. The study demonstrates that a model with P times the parameters can be trained in 1/P of the time using P times the computational resources. This approach aims to enhance training efficiency for LLMs.

Five 5-minute reads/videos to keep you learning

1. 12 MCP Servers You Can Use in 2025

MCP (Model Context Protocol) is emerging as the go-to standard for connecting LLMs to external data, tools, and services. This guide explores 12 leading MCP servers, highlighting their strengths, ideal use cases, and trade-offs.

2. A Data Scientist’s Guide to Docker Containers

This beginner-friendly guide covers the essentials of Docker — from understanding containers to building and running your first one. It’s tailored for data scientists looking to streamline development, testing, and deployment.

3. Merge Large Language Models with Mergekit

Model merging combines multiple LLMs into one. This tutorial uses the mergekit library, compares four merging strategies, and walks you through building a custom merged model, Marcoro14–7B-slerp.

4. Regression vs Classification in Machine Learning — Why Most Beginners Get This Wrong

One of the first decisions in machine learning — regression or classification — is often misunderstood. This article explains how to make that call properly, what each term truly means, and how to apply the distinction in real-world projects.

5. TinyAgents: A Minimal Experiment with Code Agents and MCP Tools

TinyAgents explores how lightweight code agents can asynchronously call tools using MCP. By replacing tool-calling agents with code agents, this approach enables more complex and autonomous workflows beyond what standard tool-use agents can handle.

Repositories & Tools

1. Open Agent Platform is a no-code agent-building platform connected to various tools, RAG servers, and even agents through an Agent Supervisor.

2. Paper2Code is a multi-agent LLM system that transforms paper into a code repository.

3. Simple Evals contains a lightweight library for evaluating language models.

Top Papers of The Week

1. ZeroSearch: Incentivize the Search Capability of LLMs without Searching

ZeroSearch introduces a reinforcement learning framework that enables LLMs to perform search-like reasoning without relying on real search engines. Through supervised fine-tuning and a curriculum-based rollout strategy, it trains models to mimic search behavior while avoiding the cost and instability of live API interactions, achieving competitive or superior results.

2. LLMs Get Lost In Multi-Turn Conversation

This study compares LLM performance across single- and multi-turn settings using large-scale simulations. Results show a significant performance drop, averaging 39% in multi-turn interactions across six generation tasks. The decline stems less from aptitude loss and more from increased unreliability, as revealed in the analysis of over 200,000 conversations.

3. Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Addressing two major issues in LLM training data — verification strategy and seed data selection — this paper presents an efficient evaluation pipeline. It enables rapid assessment of data quality impacts and optimizes sample selection for classifier training, improving the filtering and verification process for high-quality training datasets.

4. Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning

Nemotron-Research-Tool-N1 introduces a rule-based reinforcement learning approach for training tool-using language models. It uses a binary reward signal focused solely on the structural and functional correctness of tool use, enabling models to internalize reasoning strategies without relying on annotated trajectories.

5. Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

This research applies Grokking to real-world data by augmenting knowledge graphs with synthetic relational data to enhance Transformer-based multi-hop reasoning. Achieving up to 100% accuracy on 2WikiMultiHopQA, the method surpasses existing benchmarks and demonstrates the potential of grokking-based augmentation in factual reasoning tasks.

Quick Links

1. Google’s Gemma AI models surpass 150M downloads. Google launched Gemma in February 2024, aiming to compete with other “open” model families like Meta’s Llama. The latest Gemma releases are multimodal and support over 100 languages.

2. xAI blamed an “unauthorized modification” for a bug in its AI-powered Grok chatbot that caused Grok to repeatedly refer to “white genocide in South Africa” when invoked in certain contexts on X. On Wednesday, Grok began replying to dozens of posts on X with information about white genocide in South Africa, even in response to unrelated subjects.

3. OpenAI is releasing its GPT-4.1 and GPT-4.1 mini AI models in ChatGPT. The GPT-4.1 models should help software engineers using ChatGPT to write or debug code. As a result of this update, OpenAI is removing GPT-4.0 mini from ChatGPT for all users.

4. According to CNBC, Anthropic revenue hit $2 billion in the first quarter of 2025, double the $1 billion in the prior period. Customers spending more than $100,000 annually jumped 8x from a year ago. Anthropic also received a $2.5 billion revolving credit line.

Who’s Hiring in AI

AI / Large Language Model Architect @Accenture (Multiple US Locations)

Student Worker- Data Scientist @Salesforce (Tel Aviv, Israel)

Application Developer — Microsoft .NET Stack @IBM (Giza, Egypt)

Sr Gen AI Engineer @Testlio (Remote in APAC)

Software Student — AT Infra Team @Applied Materials (Rehovot, Israel)

Applications Developer 3 @Oracle (Telangana, India)

Head of Responsible AI @National Grid (Waltham, MA, USA)

Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.

Think a friend would enjoy this too? Share the newsletter and let them join the conversation.

Towards AI Newsletter

Discussion about this post