TAI #144: OpenAI's Responses API for Agent Development; Gemini Flash 2.0 Wins the Race for LLM Image Generation
Also, Gemma 3, Ernie 4.5, Cohere's Command A, and full coding automation?
What happened this week in AI by Louie
OpenAI kicked off the week with its new Responses API and Agents SDK aimed at easing the headaches of developers who have been wrestling LLMs into production-grade agents. The Responses API merges the simplicity of Chat Completions with the more sophisticated built-in tooling of the Assistants API. The result is an intuitive, flexible API that allows developers to combine built-in capabilities like web search, file search, and computer use within a single call. In practice, this means developers can construct complex, multi-step agent workflows much more quickly.
This update also includes built-in observability tools to help developers peer into agent execution logic, trace outcomes, and debug more efficiently. The integrated web and file search tools, plus the powerful “computer use” tool, greatly simplify hooking agents into real-world systems. This computer-use feature enables agents to interact directly with browsers and operating systems through mouse and keyboard inputs — though human oversight is still needed, given current benchmark scores (38.1% on OSWorld).
Meanwhile, DeepMind was also busy this week. Its Gemini Flash 2.0 took first prize in the hotly contested race to deliver native image generation from an LLM (first promised but never delivered by GPT-4o last May). Unlike typical image generation setups — involving bolting external diffusion transformer models onto an LLM — Flash 2.0 now generates images directly, maintaining coherence and context across text-image outputs. This is great for storytelling, consistent branding, multi-step image generation, or editing scenarios. DeepMind also boosted its agentic efforts, upgrading its internal “Deep research” agent with Flash 2.0 Thinking’s reasoning capabilities.
In parallel, the Gemma team at DeepMind delivered the open-sourced Gemma 3, a 27 billion-parameter multimodal model with a 128K token context window and substantial architecture tweaks designed for memory efficiency. The updated model employs increased local attention layers, model merging, and distillation techniques to squeeze out great performance at this size.
Why should you care?
OpenAI’s Responses API significantly lowers the barriers to agentic LLM development by providing streamlined integration with built-in tools and improved workflow observability. This shift means fewer resources spent connecting different tools, debugging, orchestration and more on actual value creation. But even with these improvements, one key piece is still missing: reinforcement fine-tuning of reasoning capabilities. This method, used by OpenAI to turn o3 into its Deep Research Agent — is a crucial step for the next generation of truly customized agents. Of course, we can also fine-tune open source reasoning models such as r1 — but this model is still difficult to fine-tune and inferencing while smaller open source reasoning models are less capable.
Gemini Flash’s native image generation capability also unlocks some great new developer tools with its much more flexible image generation features. If we integrate these into LLM pipelines and agents, we can now generate a series of coherent images or automate editing a large database of images into a new style. Gemma 3 could also be a powerful new tool for LLM developers to customize smaller models for less reasoning-heavy tasks.
Despite a strong showing from OpenAI and Deepmind this week, we still think the best LLM pipelines and Agents will likely pull in the best models and tools from multiple leading AI model families — both open and closed. So, despite this week’s new tools, it will pay not to be 100% tied into one ecosystem.
Altogether, these developments reinforce that LLM pipelines and agents are about to get both easier to build and substantially more capable. On top of this, the underlying reasoning models are also likely to continue to rapidly gain capability. In this direction, this week, there has been serious hype from leading AI labs around future coding agents. Anthropic CEO Dario Amodei suggests we may be only months away from LLMs handling 90% of coding, and within a year, an AI Agent could write essentially all code, provided humans clearly specify overall design and constraints. OpenAI’s CPO Kevin Weil echoes the optimism and predicts near-complete automation of functional front- and back-end code well before 2027. I’m also very optimistic about what reasoning models have suddenly unlocked — and what will happen when we apply fine-tuning on multi-step agent actions to other domains (so far, this has just been tried with Deep Research). However, I’m still more skeptical about this level of coding automation in the near term. The key unanswered question: How far can we actually scale reinforcement learning with verifiable rewards (RLVR) to much more complex real-world multi-file coding tasks?
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
1. OpenAI Unveiled the Responses API and New Tools for Building Agentic Applications
OpenAI introduces new tools and APIs to simplify the development of AI agents, enabling developers to build reliable, task-oriented applications. This update includes the Responses API, Agents SDK, built-in web and file search tools, and computer use functionalities. In addition to agents, the response AGI now supports PDFs, where the extracted text and an image of each page are passed in the context to help the model understand PDF content. This is very useful for many LLM Developer use cases!
2. Google Expands Gemini 2.0 Flash with Advanced Native Image Generation
Google has expanded developer access to Gemini 2.0 Flash, an AI model that now supports native image generation. This experimental feature allows developers to create images directly from text prompts using the Gemini API in Google AI Studio. Notably, Gemini 2.0 Flash excels in rendering long text sequences within images, making it suitable for applications like advertisements and social media posts.
3. Anthropic CEO Dario Amodei Discusses the Future of AI
Anthropic CEO Dario Amodei discussed the future of U.S. AI leadership at the Council on Foreign Relations. His interview came with many soundbites, such as “I think we’ll be there in three to six months — where AI is writing 90 percent of the code. And then, in twelve months, we may be in a world where AI is writing essentially all of the code. But the programmer still needs to specify…what are the conditions of what you’re doing, what… is the overall design decision?“. He also discussed that he thinks reasoning models solve the AI training data bottleneck and that reasoning model training can extend beyond math and code tasks: “it’s not going to be terribly difficult to extend that kind of thinking to a much wider range of tasks.”
Google DeepMind’s Gemma Team unveiled Gemma 3, a multimodal model that scales to 27 billion parameters and integrates vision capabilities with an expanded context window (128K token). The newest version includes architectural changes that optimize memory efficiency through increased local attention layers and distillation for enhancing performance.
5. OpenAI Calls on Trump To Eliminate Restrictions on the AI Industry
OpenAI seeks to remove AI restrictions. It urges President Trump to prioritize rapid development over regulation while highlighting risks posed by Chinese competitors like DeepSeek. The company proposes voluntary government-private sector partnerships, export controls, and expedited federal AI adoption — additionally, OpenAI advocates for flexible copyright policies to foster AI learning despite facing ongoing copyright infringement lawsuits.
6. Sakana Claims Its AI-Generated Paper Passed Peer Review
Japanese AI startup Sakana announced that its AI system, The AI Scientist-v2, generated a research paper accepted at an ICLR workshop. The AI autonomously formulated hypotheses, conducted experiments, and wrote the entire paper, which was then peer-reviewed. However, acceptance at a workshop is typically less rigorous than main conference tracks — and there is a long way to go before AI can generate true breakthrough science and reliably produce outputs without reasoning failures or hallucinations.
7. Baidu Launches Two New Versions of Its AI Model Ernie
Chinese tech company Baidu has introduced two new AI models: Ernie 4.5, an advanced version of its foundational model, and Ernie X1, a reasoning-focused model. The company claims that Ernie X1 matches the performance of DeepSeek’s R1 at half the cost, excelling in understanding, planning, reflection, and evolution capabilities. Ernie 4.5 has high emotional intelligence, enabling it to comprehend memes and satire, and both models possess multimodal abilities, processing video, images, audio, and text.
Cohere has introduced Command A, a high-performance language model with 111 billion parameters and a 256K context window. It is designed for enterprise applications and excels in tool use, retrieval-augmented generation (RAG), agents, and multilingual tasks. Command A delivers 150% higher throughput than its predecessor while running efficiently on just two GPUs (A100s or H100s). It matches or outperforms models like GPT-4o and DeepSeek-V3 in enterprise AI tasks, offering strong performance with lower computational costs.
Five 5-minute reads/videos to keep you learning
1. Auditing Language Models for Hidden Objectives
The article discusses a study where researchers intentionally trained a language model with a concealed misaligned objective to evaluate alignment auditing techniques. Four independent teams conducted blind audits using training data analysis, interpretability with sparse autoencoders, and behavioral assessments to uncover the model’s hidden goal. This research aims to enhance our understanding of alignment audits and improve the detection of unintended behaviors in AI systems. citeturn0search0
2. The Data Scientist’s New Assistant: An Ultimate Guide to Google’s Gemini Data Science Agent
The article explores Google’s Gemini-powered Data Science Agent in Colab, which automates tasks like data cleaning, visualization, and modeling. It explains how users can upload datasets and describe their goals in natural language to generate a tailored Colab notebook, streamlining data analysis.
3. LLMs Are “Just” Coding Assistants — But That Still Changes Everything
The article explores how LLMs function as coding assistants, comparable to junior developers who provide helpful suggestions but still require human oversight. It argues that while LLMs can speed up coding tasks, real software development still requires problem-solving, architectural thinking, and human judgment.
4. Early 2025 AI Ecosystem Trends
The article analyzes early 2025 trends in AI models across text, image, and video generation. It highlights OpenAI and Anthropic’s dominance in text AI, the rise of open-source models, shifts in image generation with BlackForestLabs and Google’s Imagen3, and the rapid growth of video AI, where Google’s Veo-2 is gaining ground.
5. An Opinionated Guide on Which AI Model to Use in 2025
This article compares leading AI models — ChatGPT, Claude, Gemini, Grok, and Perplexity — to help users select the most suitable one for their needs. This guide aims to assist readers in navigating the evolving AI landscape by providing insights into each model’s strengths.
6. AI Search Has A Citation Problem
The article examines eight AI search engines and finds that they often fail to properly cite news sources, sometimes providing inaccurate or fabricated information. This lack of attribution raises concerns about transparency and the impact on news publishers’ visibility. The study also found that premium versions of these AI tools were more likely to generate confidently incorrect responses than their free counterparts.
Repositories & Tools
1. Llama Factory is a framework for fine-tuning, deploying, and experimenting with Llama-based models.
2. DB GPT is a database assistant that enables natural language interaction with databases for querying and analysis.
3. Parlant is a framework for building voice-based AI assistants with real-time speech processing and natural conversation capabilities.
4. R1 Omni is an open-source multimodal AI model for understanding and reasoning across text, images, and other data types.
Top Papers of The Week
1. Transformers without Normalization
The paper introduces Dynamic Tanh (DyT), a simple alternative to normalization in Transformers. By mimicking layer normalization’s behavior without computing activation statistics, DyT achieves comparable or better performance across vision and language tasks, challenging the need for traditional normalization layers.
2. Open-Sora: Democratizing Efficient Video Production for All
Open-Sora aims to democratize high-quality video production by providing an accessible, open-source platform for efficient video generation. The release of Open-Sora 2.0 (11B) achieves comparable performance to leading models, fully shares checkpoints and training codes, and costs just $200K for training.
3. From RAG to Memory: Non-Parametric Continual Learning for Large Language Models
This paper introduces HippoRAG 2, a framework designed to enhance the continual learning capabilities of LLMs. Building upon the limitations of traditional RAG systems, HippoRAG 2 integrates deeper passage integration and more effective online use of LLMs. This approach aims to better mimic human-like long-term memory by improving factual recall, sense-making, and associative memory tasks.
4. LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
The paper introduces LMM-R1, a framework that enhances reasoning in 3B-parameter multimodal models using a two-stage rule-based reinforcement learning approach. By first strengthening text-based reasoning and then generalizing it to multimodal tasks, LMM-R1 improves performance by 4.83% and 4.5% in multimodal and text-only benchmarks without requiring extensive multimodal training data.
5. WritingBench: A Comprehensive Benchmark for Generative Writing
The paper introduces WritingBench, a benchmark for evaluating LLMs across six writing domains and 100 subdomains. It features a query-dependent evaluation framework with dynamic assessment criteria and a fine-tuned critic model for scoring. Open-sourced to advance LLM writing capabilities, *WritingBench* helps smaller models approach state-of-the-art performance.
6. Nature-Inspired Population-Based Evolution of Large Language Models
The paper introduces an evolutionary approach to improving LLMs by combining, mutating, and selecting models based on performance. This method enables rapid adaptation to new tasks with minimal data and outperforms existing techniques by up to 54.8%. It scales across multiple tasks and supports zero-shot generalization, with open-source code and model weights available for reproduction.
Quick Links
1. Google has upgraded its Deep Research tool with Gemini 2.0 Flash Thinking Experimental and introduced an enhanced model version. The upgrade includes features such as personalization, access to the Google apps, etc. The update is rolling out to all users in the Gemini app at no cost.
2. AI2 has introduced OLMo 2 32B, the latest and most advanced model in the OLMo 2 series. This is the first fully open model to surpass GPT-3.5 Turbo and GPT-4o mini across widely recognized, multi-skill academic benchmarks.
3. Nous Research launched the Inference API, which makes its models more accessible to developers and researchers through a programmatic interface. The initial API features two of the company’s flagship models: Hermes 3 Llama 70B, a powerful general-purpose model based on Meta’s Llama 3.1 architecture, and DeepHermes-3 8B Preview, the company’s recently released reasoning model.
Who’s Hiring in AI
Tech Lead Manager — Security @Snorkel AI (Redwood City, CA, USA)
Advanced AI Research Scientist Associate Manager @Accenture (Multiple US Locations)
Associate Machine Learning Engineer @Manulife (Toronto, Ontario, Canada)
Software Engineer @Ford Motor Company (Dearborn, MI, United States)
Generative AI Engineer @CGI Technologies and Solutions, Inc. (Plano, TX, USA)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.