TAI #155: DeepSeek R1's Reasoning Leap & Unlocking AI's Untapped Potential at Wor
Also, Meta's AI restructuring, the subtleties of RL rewards, and introducing our new course for AI business professionals.
What happened this week in AI by Louie
This week marked a strong return for the open-source community, with DeepSeek’s R1–0528 update significantly narrowing the performance gap with state-of-the-art reasoning models. Despite being only a point release built upon the existing base model, DeepSeek R1 made impressive leaps in reasoning benchmarks, scoring 81.0% on GPQA Diamond and 71.6% on the challenging Aider coding benchmark. These improvements place it closely behind top proprietary models like OpenAI’s o3 and Google’s Gemini 2.5 Pro. While DeepSeek still lacks multimodal integration and extensive context capabilities, its excellent cost-performance ratio makes it ideal for many enterprise and hobbyist use cases. Most companies interested in deploying DeepSeek will likely either self-host the model or use LLM inference providers such as Together.ai rather than accessing it through DeepSeek’s China-based API.
In a related development, recent research highlighted the unexpected complexity behind reinforcement learning with verifiable rewards (RLVR), demonstrating that even random or nonsensical rewards could substantially boost performance on math benchmarks using Qwen 2.5 models. This peculiar finding suggests that reinforcement learning might sometimes simply prompt the model to more frequently apply existing effective strategies, like code-assisted reasoning, rather than necessarily imparting new knowledge. The insights from this study reinforce the subtlety and nuance required in effectively leveraging RL for genuinely novel capability improvements rather than merely amplifying existing patterns.
This week’s DeepSeek R1 update also notably increases competitive pressure on Meta, whose open-source Llama 4 models have recently lagged behind both Alibaba’s Qwen and DeepSeek’s offerings in China across several critical benchmarks and recent adoption. While Meta AI benefits from the immense scale of its existing apps, such as WhatsApp, with CEO Mark Zuckerberg recently reporting META AI at 1 billion monthly active users across its family of apps, their core AI models have struggled to maintain pace in pure reasoning and coding tasks. To respond, Meta has restructured its AI teams into two distinct units: an AI Products team focusing on consumer-facing features, and an AGI Foundations group specifically tasked with advancing reasoning, multimedia, and voice capabilities, including its Llama model family. Whether this strategic realignment helps Meta bridge the performance gap and retain critical talent remains uncertain, but it will be key to US competitiveness in open-source AI.
Why should you care?
The rapid pace of AI progress, evident in both open-source updates like DeepSeek R1 and the continuous evolution of closed models, brings a wave of powerful new tools. However, this very progress makes the landscape increasingly complex to navigate. LLM model choice alone is getting complicated enough for both users and developers, with a vast array of options differing in strengths, costs, and specific capabilities. Add to this the huge range of techniques for developing and using them, and the sheer breadth of different use cases, and it’s clear that simply keeping up is a challenge.
As I’ve discussed before, understanding the economic trade-offs is crucial. One must constantly balance the value of their own time saved checking an inferior AI’s output, or correcting and improving it, against the potentially higher cost of a more capable model. This isn’t just about finding the cheapest option, but the most effective one for a given task and budget. The nuances of when to deploy a frontier reasoning model versus a more efficient specialized one, or how to best structure human-AI collaboration, are becoming key differentiators. While exciting, the constant stream of advancements underscores the need for a more structured approach to learning and applying these technologies to avoid simply dabbling without achieving real impact.
These very challenges and the clear gap between AI’s potential and its current practical application in most workplaces have led us to launch our new course designed specifically for business professionals.
Tried AI and felt it’s more hype than help? You’re not wrong if you’re using it like most people. Effective AI adoption is rare because it demands more than basic prompting — it requires imagination and intuition for where to put it to work, human expertise, and knowing how to collaborate.
— Louie Peters — Towards AI Co-founder and CEO
AI: From Confusing Toy to Your Most Powerful Business Ally?
Introducing AI for Business Professionals: Towards AI’s new course for non-technical users to truly leverage AI (ChatGPT, Claude, Gemini). This isn’t just about speed; it’s about enhancing work quality and brainstorming high-value ideas.
The full course is $399, but your journey starts completely free. Preview key lessons, download our Top Tips Cheat Sheet, and discover how to finally make AI useful.
Learn to:
🧠 Master “Skilled Collaboration”: Go beyond prompts, integrate your expertise.
💡 Overcome “Lack of Imagination”: Discover high-impact AI uses for writing, research, and data analysis.
🚀 Unlock Strategic Insights: Use AI for superior brainstorming & planning.
🛡️ Use AI Safely & Effectively: Avoid common pitfalls, protect data.
The first few lessons & our “Top Tips Cheat Sheet” are FREE. See the “night and day difference” expert AI use makes.
We estimate fewer than 1M people globally use AI to its full potential at work. Be one of them.
👉 Start your free lessons & get the cheat sheet!
Hottest News
1. Deepseek Pushes an R1 Upgrade
DeepSeek released its R1–0528 update, pushing its 671B model’s Artificial Analysis score from 60 to 68 — on par with Gemini 2.5 Pro and ahead of Grok 3 Mini, Llama 4 Maverick, and Qwen 3 253. The improvement came without architecture changes, driven instead by deeper RL training with 99M tokens and gains in math, coding, and science.
2. Anthropic Open-Sources Circuit Tracing Tools
Anthropic has open-sourced a new interpretability toolkit that generates attribution graphs — visual maps of how language models internally process prompts. The tools support popular open-weight models like Gemma-2–2b and Llama-3.2–1b, and are integrated with Neuronpedia for interactive exploration. Researchers can trace circuits, test hypotheses by editing feature values, and share annotated graphs.
3. Black Forest Labs Introduces FLUX.1 Kontext and the BFL Playground
Black Forest Labs has introduced FLUX.1 Kontext, a suite of multimodal flow models that enable in-context image generation and editing. Unlike traditional text-to-image systems, Kontext models accept both text and image inputs, allowing users to perform iterative edits, preserve character consistency, and apply local modifications without fine-tuning. The FLUX.1 Kontext [pro] model delivers fast, high-quality generation and editing, while the experimental FLUX.1 Kontext [max] enhances prompt adherence and typography generation. An open-weight FLUX.1 Kontext [dev] variant is also available for research and customization in private beta.
Insikt Group’s latest report assesses the U.S.–China AI race, concluding that while China’s generative AI models currently trail U.S. counterparts by approximately 3–6 months, the gap is narrowing. Despite advancements like DeepSeek’s R1 model and increased patent activity, China faces challenges in compute infrastructure, chip manufacturing, and private investment. Conversely, the U.S. maintains an edge in funding, talent, and model performance. The report suggests breakthroughs in agentic and collaborative AI systems could shift the competitive landscape before 2030.
5. Mistral Launches Agents API and Code Embedding Model
Mistral AI has unveiled two significant tools: Codestral Embed and the Agents API. Codestral Embed is a specialized embedding model for code, excelling in retrieval tasks on real-world code data. It outperforms leading models like Voyage Code 3, Cohere Embed v4.0, and OpenAI’s large embedding model, even at lower dimensions and precisions. Simultaneously, the Agents API enables developers to build AI agents capable of executing tasks like code execution, image generation, and web search. These agents maintain context across interactions and can orchestrate multiple tools using the MCP.
6. Artificial Intelligence Trends in 2025
Mary Meeker’s 2025 BOND AI Trends report highlights the unprecedented acceleration of AI development and adoption. AI-related job postings increased by 448% over the past seven years, while non-AI IT roles declined by 9%. The report underscores the rapid growth of AI ecosystems, noting that AI chatbots are now mistaken for humans 73% of the time, up from 50% six months prior. However, Meeker warns that U.S. AI leaders like OpenAI may be undercut by more cost-effective rivals, such as China’s DeepSeek, due to soaring model training costs and the emergence of efficient, custom-trained models. Despite these challenges, the U.S. maintains a lead in AI investment and infrastructure, with AI investment reaching $109 billion in 2024, significantly outpacing expenditures in China and the UK.
7. Sakana AI Introduced the Darwin Gödel Machine (DGM): A Self-Improving Agent
Sakana AI has introduced the Darwin Gödel Machine (DGM), a self-improving coding agent capable of rewriting its own Python code to enhance performance on programming tasks. DGM autonomously generates, evaluates, and integrates code modifications, such as improved file viewing, enhanced editing tools, and patch validation steps. It also maintains a history of attempted changes to inform future iterations.
Five 5-minute reads/videos to keep you learning
1. RAG Is Dead, Long Live Agentic Retrieval
This article introduces a shift from traditional Retrieval-Augmented Generation (RAG) to agentic retrieval, enhancing AI systems with dynamic, multi-step reasoning and intelligent querying across multiple knowledge bases. This approach integrates advanced techniques like hybrid search, reranking, and multi-modal embeddings, abstracted into LlamaCloud’s API for streamlined implementation.
2. Supercharge Mistral-7B with GRPO Finetuning: A Beginner-Friendly Tutorial with Code
This tutorial walks through fine-tuning Mistral-7B with Gradient Regularised Policy Optimisation (GRPO) to boost reasoning performance. Using a 4-bit quantized model via Unsloth and the GSM8K dataset, the author shows how to structure rewards, run GRPO with TRL, and evaluate improvements — all in a resource-efficient setup.
3. Enhancing LLM Capabilities: The Power of Multimodal LLMs and RAG
This piece breaks down how Multimodal LLMs — combining text, image, and audio inputs — overcome the limits of text-only models. It explores core components like CLIP, LLaVA, and Whisper, and details how Multimodal RAG systems integrate these with retrieval pipelines. The article also outlines system-building steps and emphasizes the need for robust, cross-modal evaluation metrics.
4. Monitor and Evaluate Open AI SDK Agents using Langfuse
This tutorial walks through the process of building a simple agentic workflow with the OpenAI SDK and tracking it using Langfuse via OpenTelemetry. It covers implementing input, assist, and validation agents, configuring trace capture, and analyzing the results programmatically with Matplotlib.
5. MCP 101: Why This Protocol Matters in the Age of AI Agents 🤖
This article introduces Anthropic’s Model Context Protocol (MCP), an open standard that streamlines LLM interactions with external tools via a client-host-server setup and JSON-RPC 2.0. It outlines MCP’s structured lifecycle, from initialization to termination, highlighting its relevance for building scalable, tool-using AI agents.
Repositories & Tools
1. PixelFlow is a family of image generation models that operate directly in the raw pixel space.
2. Zero is a minimally structured framework for training and evaluating vision-language models using LLM surrogates, supporting torch FSDP/DDP and evaluation via vLLM with tensor parallelism.
3. ChartGalaxy is a million-scale dataset of synthetic and real infographic charts with paired data tables, designed to advance infographic chart understanding, code generation, and layout synthesis.
4. WebAgent is an open-source framework for training autonomous web agents using a four-stage pipeline combining supervised fine-tuning and reinforcement learning.
5. HunyuanVideo-Avatar is a multimodal diffusion transformer (MM-DiT)-based model capable of generating dynamic, emotion-controllable, and multi-character dialogue videos.
Top Papers of The Week
1. Spurious Rewards: Rethinking Training Signals in RLVR
This study reveals that Reinforcement Learning with Verifiable Rewards (RLVR) can substantially enhance mathematical reasoning in Qwen2.5-Math models, even when trained with spurious rewards, random, incorrect signals, or based solely on output formatting. These findings suggest that RLVR may amplify latent reasoning strategies within models, challenging the assumption that high-quality supervision is necessary for effective RLVR training.
2. Web Bench — a New Way To Compare AI Browser Agents
Web Bench is a comprehensive dataset designed to assess AI browser agents across a diverse range of tasks. Comprising 5,750 tasks on 452 websites, with 2,454 tasks open-sourced, Web Bench expands upon previous benchmarks by incorporating both READ tasks (data extraction) and WRITE tasks (form filling, authentication, file downloads). Findings reveal that while agents perform adequately on READ tasks, they struggle significantly with WRITE tasks, highlighting areas for improvement in browser automation.
3. LMEval: An Open Source Framework for Cross-Model Evaluation
Google’s LMEval is an open-source framework designed to streamline the evaluation of large language models across different providers. Leveraging the LiteLLM framework, LMEval offers compatibility with major model providers like Google, OpenAI, Anthropic, and Hugging Face. It supports multimodal benchmarks and various scoring metrics and features an incremental evaluation engine to optimize performance assessments efficiently.
Hugging Face has integrated the Liger kernel into its TRL library to enhance Group Relative Policy Optimization (GRPO) training. This optimization, alongside FSDP and PEFT support, reduces peak memory usage during GRPO training without compromising model quality.
Quick Links
1. Perplexity launches Perplexity Labs, a suite of tools for building data-driven outputs. Labs enables users to generate reports, spreadsheets, dashboards, and simple web apps using tools like deep web browsing, code execution, and chart creation. It performs self-directed workflows over 10 minutes or more to automate complex tasks.
2. Meta AI surpasses one billion monthly active users. Its scale comes from automatic integration into Facebook, Instagram, and WhatsApp, giving it a distribution advantage. However, the value of this usage is unclear, and Meta’s models risk falling behind in professional and developer use cases.
3. Anthropic CEO warns of significant AI-driven job displacement. CEO Dario Amodei says leaders are “sugar-coating” the risks of AI-driven job losses, especially in entry-level roles across finance, tech, law, and consulting. He argues that companies need to be more transparent about the disruptive potential of highly capable AI systems.
4. Exa explains how it evaluates search and ranking quality. The team shares their internal metrics for measuring relevance, including automated scores and human judgments, highlighting the tradeoffs between faster and more accurate ranking methods.
Who’s Hiring in AI
Head of Developer Experience and Community @Mistral AI (Palo Alto, CA, USA)
AI Engineer @AI Fund (Palo Alto, CA, USA/Hybrid)
Summer Intern — AI/ML Engineer @HARMAN INTERNATIONAL INDUSTRIES INC (Bellevue, WA, USA)
AI & GenAI Data Scientist — Director @PwC (Seattle, WA, USA)
AI Engineer @fabric (Toronto & Vancouver/Canada)
Software Engineer — LLM & Agent Integration @Firstup (Remote/US)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.