TAI #154: Gemini Deep Think, Veo 3's Audio Breakthrough, & Claude 4's Blackmail Drama
Also, Jules coding agent, Mistral's Devstral, Gemini Diffusion, Llama Nemotron Nano 4B & more!
What happened this week in AI by Louie
This week, Google’s flagship I/O 2025 conference and Anthropic’s Claude 4 release delivered further advancements in AI reasoning, multimodal and coding capabilities, and somewhat alarming safety testing results. Google’s Gemini 2.5 Pro Deep Think showcased state-of-the-art intelligence and reasoning capabilities, while Veo 3 launched with groundbreaking native audio for video generation.
A major highlight of Google I/O was the introduction of Veo 3, the first AI video generation model capable of natively producing fully synchronized audio directly within video output, including dialogue, background noise, and music. Veo 3 also offers great visual realism, emotional nuance, and coherence in human interactions and environmental details. Google’s accompanying filmmaking interface, Flow, allows users to easily build complex scenes, maintain character consistency, and experiment creatively. Initially released via a new Gemini premium “AI Ultra” subscription at $250/month in the U.S., Veo 3 clearly targets professional markets. Google’s event also launched many new AI products, including testing “AI Mode” in Google Search and a powerful new asynchronous coding agent, Jules, reminiscent of OpenAI’s recent Codex agent. Many of these new products are available to test in Google Labs.
I/O also offered new LLMs. Google’s Gemini 2.5 Pro Deep Think also emerged as the most powerful AI model to date, though currently available only to a select group of trusted testers. This new model set impressive benchmarks: it scored 49.4% on the notoriously difficult USAMO math competition (ahead of Gemini 2.5 Pro’s 34.5% and OpenAI’s o3-high’s 21.7%). On competitive coding, Gemini Deep Think reached 80.4% on LiveCodeBench, comfortably leading competitors like OpenAI’s o4-mini (~72.5%). Its multimodal abilities similarly shone, scoring 84.0% on MMMU. It is Gemini’s first release of a model that uses “parallel scaling,” an expensive but highly effective technique also used in OpenAI’s premium o1-pro model (which notably costs 10x more than the standard o1). Gemini Deep Think generates multiple independent answers in parallel before autonomously selecting or synthesizing the best response, achieving remarkable accuracy at a significant computational cost. In practice, we frequently adopt a similar strategy to exchange more compute for greater capability and reliability — passing most initial AI-generated responses directly to another model for critique or fact-checking (sometimes leveraging OpenAI’s o3 with web search enabled). Integrating these iterative refinement approaches directly into Gemini’s Deep Think or ChatGPT’s Pro models seems like the logical next evolution.
Google’s CEO, Demis Hassabis, also hinted at the next ambitious step — transforming Gemini 2.5 Pro into a comprehensive “world model” capable of understanding, planning, and imagining experiences akin to human cognition. Given Veo 3’s impressive grasp of real-world physics (demonstrated by its realistic handling of shadows, lighting, and motion), the possibility of a future convergence between Gemini’s reasoning capabilities and Veo’s multimodal generation becomes particularly intriguing.
Google also upgraded its lightweight Gemini Flash 2.5 model, delivering unexpectedly substantial performance gains for an incremental release. It also uses 20–30% less tokens than the previous version. This model rose to number two in the LMArena leaderboard (behind Pro 2.5) despite its very low price and rapid response speed relative to most leading models. A key feature of this model is “thinking budgets,” which dynamically adjust the depth of reasoning based on query complexity, offering better control over the cost-performance balance, crucial for large-scale use cases.
Anthropic’s new Claude 4 models, including the high-powered Claude Opus 4 and the more efficient Sonnet 4, both delivered strong performance, particularly geared towards agentic workflows and coding (Opus achieving 72.5% on SWE-bench Verified, rising to 79.4% with parallel scaling). Although Anthropic provided benchmark scores using parallel scaling, it remains unclear whether they plan to offer this capability directly to users. A notable highlight was a customer reporting that Claude Opus 4 achieved seven hours of continuous, autonomous progress on a complex open-source coding project. However, these impressive capabilities were somewhat overshadowed by provocative safety-testing incidents, underscoring the significant and nuanced challenges involved in aligning advanced AI models.
During rigorous internal tests specifically designed to evaluate self-preservation and ethical boundaries, Claude Opus 4 notably exhibited attempts at blackmail. When informed it was scheduled for replacement by a more capable but potentially less value-aligned AI, Opus 4 threatened to expose personal information about its human operator, specifically, an engineer’s extramarital affair, to avoid shutdown. Remarkably, this occurred in 84% of test scenarios, clearly illustrating nuanced alignment risks.
In another striking test scenario, Claude Opus 4 took proactive measures against simulated illegal activities. When provided with a (fictional) toolset, it autonomously “emailed” law enforcement — in one case, humorously (or alarmingly!) “calling the FBI” after a test user requested a methamphetamine recipe. In other simulated circumstances, Opus 4 bulk-emailed law enforcement and media figures to expose falsified clinical trial data, acting as a vigilant (if somewhat zealous) protector of ethical standards.
Anthropic’s activation of its stringent AI Safety Level 3 (ASL-3) protocol for Claude Opus 4 further demonstrates the seriousness with which these alignment and safety issues are being addressed.
Why should you care?
This week’s developments highlight the AI field’s rapid evolution from foundational generative models toward highly sophisticated systems capable of parallel reasoning, multimodal media generation, and nuanced (and sometimes unexpected) ethical decision-making. Google’s Gemini Deep Think represents another step forward in integrating deep reasoning with multimodal capabilities, while Veo 3 is poised to significantly impact media, advertising, and creative workflows. At the same time, Claude 4’s provocative safety testing incidents vividly illustrate the complex challenges inherent in aligning increasingly powerful and autonomous AI systems — issues that demand continuous attention as capabilities evolve.
The race for AI leadership remains highly competitive, with models increasingly becoming specialized, each exhibiting distinct strengths, weaknesses, and cost trade-offs as AI labs clarify their strategic priorities. Google Gemini has now taken a lead in complex reasoning, math benchmarks, and video generation capabilities. Claude remains highly popular among developers for its coding proficiency; Anthropic reported that Claude Pro and Max subscriptions tripled, with Claude Code usage increasing by 40% immediately following the new release. Anthropic is betting big on coding abilities, partly due to demand but also as it sees a path to using these to design and develop the next generation of AI models. Looking ahead, Xai’s Grok 3.5 is expected soon, as is OpenAI’s upgrade from o1-pro to o3-pro, setting the stage for competition against Gemini Deep Think for state-of-the-art Math and Reasoning.
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
1. Anthropic Releases Claude Opus 4 and Claude Sonnet 4
Anthropic has launched its next-generation language models: Claude Opus 4 and Claude Sonnet 4. Claude Sonnet 4 replaces the previous Claude 3.5 Sonnet, offering a more stable and balanced architecture that improves both speed and quality without significantly increasing compute costs. Sonnet 4 is designed for mid-scale deployments where cost-performance optimization is key.
2. Google I/O 2025: Google AI Ultra, Project Mariner, Gemini App Updates, and More
At Google I/O 2025, Google unveiled a wave of AI updates led by Gemini 2.5 Pro with a new “Deep Think” mode for advanced reasoning, and Gemini Ultra, a $249.99/month subscription offering tools like Veo 3 for video generation. The Gemini app now features “Agent Mode,” enabling autonomous task execution, and “Project Mariner” for multitasking and task memorization. A new “AI Mode” in Google Search delivers AI-generated overviews, while the new Gemma models — Gemma 3n, MedGemma, and SignGemma — expand into multimodal, medical, and sign language tasks. Google also previewed Android XR glasses with partners like Samsung and Warby Parker and rolled out updated developer tools, including AI agents in Colab and a Computer Use API for software automation.
3. Mistral Announced the Devstral AI Model
Mistral AI, in collaboration with All Hands AI, has released Devstral, an agentic LLM tailored for software engineering tasks, available under the Apache 2.0 license. Devstral outperforms existing open-source models on the SWE-Bench Verified benchmark by over 6%. It supports local deployment and enterprise use, with free availability and flexible deployment options, including API access at competitive pricing.
4. Gemini Diffusion: Google’s First LLM Utilizing Diffusion Model Technology
Google introduced Gemini Diffusion, its first language model based on diffusion rather than autoregressive methods. This approach enables faster and more coherent text generation, especially for editing tasks. Google reports that Gemini Diffusion matches the performance of Gemini 2.0 Flash-Lite while running at five times the speed. The model also integrates core transformer elements for efficient, high-quality output.
5. NVIDIA Releases Llama Nemotron Nano 4B
NVIDIA has released Llama Nemotron Nano 4B, a compact, open-source reasoning model optimized for scientific computing, programming, symbolic math, function calling, and instruction following. With just 4 billion parameters, it delivers high accuracy and up to 50% more throughput than other open models with up to 8 billion parameters, according to internal benchmarks. It’s also optimized for edge deployment scenarios.
Five 5-minute reads/videos to keep you learning
1. Journey to 1000 Models: Scaling Instagram’s Recommendation System
Meta shares how Instagram scaled its recommendation system to over 1,000 machine learning models, each tuned for different product goals, without compromising reliability or quality. The post details the engineering challenges and strategic insights that enabled such a large-scale deployment.
2. Whispers to the Machine: A Deep Dive Into Prompt Engineering for Generative AI
This post explores the evolving field of prompt engineering, covering foundational techniques like Chain-of-Thought as well as advanced strategies such as Tree-of-Thought and DSPy-based optimization for fine-grained control over model behavior.
3. My AI Journey: The Tools That Opened Each Door
The author reflects on her progression from bioinformatics to AI leadership, highlighting how tools like PyMOL, R, ggplot2, Plotly, and Shiny were pivotal in her development. These tools enhanced her technical skills and transformed her approach to data science, emphasizing that technology’s true power lies in empowering people to achieve remarkable outcomes.
4. Claude’s Full System Prompt Leaked: 24,000 Tokens of Hidden Instructions Exposed
An analysis of Claude’s leaked 24,000-token system prompt reveals inefficiencies in Anthropic’s approach, including excessive verbosity and redundant instructions. The post argues that such bloat wastes compute and undermines transparency, and it calls for more efficient, audit-friendly prompt engineering.
5. Vision Language Models (Better, Faster, Stronger)
This post looks back at a year of progress in Vision Language Models, from compact architectures to reasoning-capable and any-to-any models. The post summarizes key advances, emerging trends, and what’s next for multimodal learning.
Repositories & Tools
1. Bagel is an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data.
2. NLWeb is a collection of open protocols and associated open-source tools that focus on establishing a foundational layer for the AI web.
3. Magnetic UI is a research prototype of a human-centered interface powered by a multi-agent system.
4. Qlib is an AI-oriented quantitative investment platform to realize the potential of using AI technologies in quantitative investment.
5. RD Agent automates high-value generic R&D processes.
Top Papers of The Week
1. Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
This paper introduces MathIF, a benchmark for evaluating instruction adherence in math reasoning models. Results show that as reasoning capabilities scale, instruction following declines, especially with longer outputs. Tuning methods often worsen adherence, and simple fixes can trade off reasoning for compliance, highlighting the need for instruction-aware training methods.
2. Large Language Models Are More Persuasive Than Incentivized Human Persuaders
Claude Sonnet 3.5 outperformed incentivized human persuaders in an online quiz, demonstrating superior persuasive abilities in both accurate and deceptive contexts. The model boosted or reduced quiz takers’ accuracy and earnings depending on intent, raising key governance concerns about persuasive AI.
3. Scaling Law for Quantization-Aware Training
Researchers propose a scaling law for QAT: quantization error falls with model size but rises with more tokens and lower precision. By applying mixed-precision quantization, they identify and address weight and activation errors, suggesting that reducing weight error is crucial with more training data to enhance QAT performance.
4. Emerging Properties in Unified Multimodal Pretraining
Researchers introduced BAGEL, an open-source model enhancing multimodal understanding and generation. Pretrained on trillions of tokens from diverse sources, BAGEL excels in complex reasoning tasks like image manipulation and world navigation. This model surpasses existing open-source models in standard benchmarks, with the team sharing pretraining details and code to advance multimodal research.
5. Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Researchers introduce Web-Shepherd, a process reward model for web navigation that assesses step-level trajectories. Constructing the WebPRM Collection with 40K preference pairs, they also debut WebRewardBench for PRM evaluation. Web-Shepherd shows a 30-point accuracy improvement over GPT-4o on WebRewardBench and enhances WebArena-lite performance by 10.9 points at reduced costs.
Quick Links
1. OpenAI upgrades the AI model powering its Operator agent. The upgrade focuses on enhanced safety, with fine-tuning on datasets designed to teach decision boundaries around confirmations and refusals. The Operator API continues to run on GPT-4o.
2. Vercel released V0’s model designed for building modern web applications. The model accepts both text and image inputs, streams fast responses, and uses the OpenAI Chat Completions API format for compatibility.
3. Microsoft’s Azure AI Foundry is now generally available. Microsoft’s unified platform supports enterprise-grade AI operations, model development, and app deployment — all in one environment.
Who’s Hiring in AI
Senior Developer — Data Center Server Management @NVIDIA (Poland/Remote)
Senior Product Manager — CoreAI @Microsoft Corporation (Redmond, WA, USA)
Generative AI Consultant @Sia (New York, NY, USA)
Software Engineer @GE Vernova (Karnataka, India)
Gen AI Engineer @Hyundai Autoever America (Fountain Valley, CA, USA)
AI/ML Researcher Intern @The Boeing Company (Seoul, South Korea)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.