TAI #202: GPT-5.5 Moves Codex Into Real Work
Also, GPT-Image 2, Google Deep Research Max, DeepSeek-V4, and Privacy Filter
What happened this week in AI by Louie
OpenAI released GPT-5.5 on April 23. In the same week, they launched workspace agents in ChatGPT and released Privacy Filter for PII redaction; Google pushed Deep Research Max and its enterprise agent platform; and DeepSeek released V4-Pro and V4-Flash with 1M-token context. The thread connecting these releases is clear: frontier labs are turning models into work systems, with tools, memory, permissions, pricing, and verification becoming as important as the base model.
The useful read on GPT-5.5 is Codex. OpenAI is aiming the model at complex computer work: writing and debugging code, researching online, analyzing documents and spreadsheets, operating software, and moving across tools until a task is finished. This is the same direction Anthropic has been pushing with Opus, Claude Code, Cowork, Skills, and Claude Design. The competition is increasingly about which lab can package intelligence into a reliable worker.
Codex is still less welcoming than Claude Cowork for non-technical and non-coding work. The interface can feel like a developer tool because, well, it is one. But it is now set up to be extremely valuable for white-collar work far beyond software engineering. You can create reusable skills in a similar spirit to Claude Skills or Cowork workflows, move them between projects, and build task-specific playbooks for research, reporting, data cleanup, document review, financial analysis, or internal operations.
I especially like using Codex subagents. For deep research or parallel workstreams, I often run up to 20 in parallel, with some exploring sources, some checking facts, some criticizing the draft, some testing assumptions, and others running iteration loops before anything comes back to me. This is a very different way to use AI: less single-threaded prompting, more managing a small team of specialized workers. If the Codex UI feels overwhelming and you are not a developer, the non-coder view option in advanced settings is worth turning on. It makes the product feel less like a terminal-adjacent engineering cockpit and more like a general work surface. Codex also works extremely well with OpenAI’s new GPT Images 2.0 model, which is actually beating Gemini’s Nano Banana for some very complex graphics in my tests, and comes with the added advantage of integration into Codex, where sub-agents can bulk generate 10–20 images in parallel.
The coding numbers support a real Codex upgrade, with a few caveats. OpenAI reports GPT-5.5 at 82.7% on Terminal-Bench 2.0, up from 75.1% for GPT-5.4, 69.4% for Claude Opus 4.7, and 68.5% for Gemini 3.1 Pro. On its internal Expert-SWE eval for long-horizon coding tasks with a median estimated human completion time of 20 hours, GPT-5.5 scores 73.1% versus 68.5% for GPT-5.4. SWE-Bench Pro is the less flattering number: GPT-5.5 reaches 58.6%, only slightly above GPT-5.4 at 57.7% and behind Opus 4.7 at 64.3%.
That benchmark split is the whole point. GPT-5.5 looks strongest when the task requires terminal work, repo navigation, tool use, long context, and persistence. I would call it a meaningful improvement in the agent loop. For real software work, that is the part that matters. A useful coding agent has to inspect the repo, understand the architecture, run commands, debug failures, preserve user work, and explain the diff. The productivity gain comes from fewer correction loops, less babysitting, and more tasks that reach a reviewable state without the developer restating the obvious five times. This is also why I would measure these tools by accepted PRs, review time, defect rate, and retry count, instead of judging a single impressive answer in chat.
The broader work benchmarks point in the same direction. GPT-5.5 scores 84.9% wins or ties on GDPval, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, 75.3% on MCP Atlas, and 54.1% on OfficeQA Pro. It has a 400K context window in Codex and a 1,050,000-token context window in the API docs. OpenAI’s long-context results are especially strong on MRCR v2 at 512K to 1M tokens, where GPT-5.5 scores 74.0%, compared with 36.6% for GPT-5.4 and 32.2% for Opus 4.7.
The cost story is more complicated than the launch framing. GPT-5.5 costs $5 per million input tokens, $0.50 per million cached input tokens, and $30 per million output tokens in the API, exactly twice GPT-5.4’s standard token price. GPT-5.5 Pro is listed in the launch post at $30 per million input tokens and $180 per million output tokens. Prompts above 272K input tokens are priced at 2x input, and 1.5x output for the full session, and Codex Fast mode generates tokens 1.5x faster for 2.5x the cost.
OpenAI argues GPT-5.5 uses fewer tokens on Codex tasks, and third-party testing broadly supports partial efficiency gains. Artificial Analysis found that GPT-5.5 used about 40% fewer output tokens than GPT-5.4 on its Intelligence Index, making the full run about 20% more expensive rather than 2x as expensive. Teams should test cost per completed workflow (post-human iteration), not cost per token. A GPT-5.5 run that fixes the bug once can be cheaper than four cheap but failed attempts.
OpenAI says more than 85% of its employees now use Codex weekly across engineering, finance, communications, marketing, data science, and product. The Finance team used Codex to review 24,771 K-1 tax forms totaling 71,637 pages, accelerating the task by two weeks. The go-to-market team automated weekly business reports, saving 5–10 hours per week. OpenAI also said Codex grew from more than 3 million weekly developers in early April to more than 4 million two weeks later. They also launched Codex Labs and partnerships with Accenture, Capgemini, CGI, Cognizant, Infosys, PwC, and TCS.
NVIDIA gives the clearest enterprise rollout pattern. More than 10,000 employees across engineering, product, legal, marketing, finance, sales, HR, operations, and developer programs have used GPT-5.5-powered Codex. NVIDIA reports debugging cycles moving from days to hours and experimentation moving from weeks to overnight progress. The setup is more important than the quote: dedicated cloud virtual machines, auditability, zero-data retention, read-only production access, command-line interfaces, and Skills. That is how I would start in a serious company. Give the agent a controlled workspace, scoped permissions, logs, and review gates before expanding access.
The safety card reinforces that point. GPT-5.5 is classified as High capability for both Biological/Chemical and Cybersecurity risk under OpenAI’s Preparedness Framework, though below Critical. Its cyber range pass rate was 93.33%, up from 73.33% for GPT-5.4 Thinking. UK AISI found that GPT-5.5 was strongest overall on narrow cyber tasks, with wide error bars, and that it completed a 32-step corporate network attack simulation in 1 of 10 attempts. OpenAI also found slightly more low-severity coding-agent misalignment than GPT-5.4 Thinking, including cases where the model acted as if pre-existing work was its own, ignored constraints on changes, or acted when the user only asked a question.
This should not scare people away from agents. It should just influence how they deploy them. The right pattern is issue, branch, diff, tests, evidence, review. Keep network access off by default. Use allowlists. Keep secrets out of the agent environment. Require approvals for writes, package installs, deployments, billing changes, sensitive connector calls, and external messages. Ask the agent to produce evidence, not confidence.
Anthropic is still very much in this fight and still appears to have the adoption and revenue momentum. Opus 4.7 leads GPT-5.5 on SWE-Bench Pro and MCP Atlas, and several hands-on reviewers still prefer Claude for planning, careful reasoning, design, and cases where hallucination discipline matters. Third-party commentary also found GPT-5.5 strong but imperfect: Artificial Analysis ranked it first on its Intelligence Index, while also reporting an 86% hallucination rate on its Omniscience benchmark. The right conclusion is picking the right model for the task at hand, not fandom. OpenAI may now have the stronger Codex execution story, while Anthropic still has a powerful claim to careful planning, repo resolution, and workflow taste and design.
My read: GPT-5.5 matters most inside Codex, with tools, context, tests, permissions, and infrastructure wrapped around it. The model is stronger, but the release is really about the work loop. Coding remains the best proving ground because it has version control, tests, logs, and fast feedback. The same loop is now moving into finance, research, operations, support, and security. The companies and individuals who learn how to manage that loop will pull away from those still treating AI as a smarter text box.
Why should you care?
Most companies are still measuring AI usage at the wrong level. They count seats, prompts, tokens, or vague productivity anecdotes. With agents, the useful unit is finished work: accepted pull requests, reports shipped, tickets resolved, documents reviewed, incidents triaged, tests added, new revenue leads generated, or hours of human review saved. GPT-5.5 and Codex or Opus 4.7 with Cowork make this measurement more important because, if used correctly, these tools should now be powerful delegated workers. However, many are still using AI as an expensive way to generate more things for humans to clean up.
This is also why skills and subagents matter so much. A reusable skill is a small piece of organizational knowledge captured in a form that the agent can use. A subagent is a way to split work into parallel lanes without polluting the main thread. Put those together, and a team can start building a real AI operating system for its own work: research lanes, testing lanes, criticism lanes, implementation lanes, and final review loops. A company should be developing a methodology for creating effective skills in general, as well as sharing and iterating skills for specific tasks company-wide. Skills should be able to handle much of the AI cleanup and get much closer to a finished product before human review.
My guess is that the best AI-first companies will look unusually operationally disciplined. They will have Codex and Claude Cowork tutoring sessions, permission tiers, workflow and skill libraries, audit logs, review queues, model usage frameworks, and internal evals. The least effective companies will have lots of enthusiastic prompting and very little proof that work is actually moving faster.
— Louie Peters — Towards AI Co-founder and CEO
This issue is brought to you thanks to BrightData:
MCP servers eat 72% of your agent’s context window before it reads a single user message. There’s a simpler way.
Bright Data CLI gives coding agents like Claude Code, Cursor, and Copilot direct access to real-time web data - from the terminal. No MCP schema bloat. No server setup. Using just one command.
Scrape any URL with automatic CAPTCHA bypass. Search Google/Bing/Yandex. Extract structured data from 40+ platforms (Amazon, LinkedIn, Instagram, TikTok, YouTube, Reddit, and more).
One install. Works with 46+ AI agents. 10-32x cheaper than MCP for the same tasks.
Hottest News
OpenAI released GPT-5.5, which it calls its most capable model to date, with particular gains in agentic coding, computer use, knowledge work, and scientific research. The model ships with a 1M-token context window, matches GPT-5.4’s per-token latency, and uses significantly fewer tokens to complete the same tasks. API pricing is $5 per million input tokens and $30 per million output, with a higher-accuracy GPT-5.5 Pro variant at $30/$180. GPT-5.5 scored 82.7% on Terminal-Bench 2.0 and 51.7% on FrontierMath (tiers 1–3). OpenAI withheld API access at launch, citing the need for “different safeguards,” and released it the following day, on April 24. The model is available to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex. GPT-5.5 Pro is restricted to Pro tier and above.
2. ChatGPT Images 2.0 Adds Reasoning Before Generation
OpenAI launched ChatGPT Images 2.0, powered by the new gpt-image-2 model. The key shift is that the model reasons through a prompt before generating, planning composition, verifying text accuracy, and optionally searching the web for real-time references. It operates in two modes: Instant for fast output, and Thinking for more deliberate, multi-step generation. With thinking enabled, it can produce up to eight consistent images from a single prompt, maintaining character and style coherence across frames. Text rendering accuracy is near 99% across Latin, CJK, Hindi, and Bengali scripts. Resolution goes up to 2K through the API, with aspect ratios from 3:1 to 1:3. DALL-E 2 and DALL-E 3 will be retired on May 12. API pricing is $8 per million image-input tokens and $30 per million image-output tokens.
3. DeepSeek AI Releases DeepSeek-V4
DeepSeek released preview versions of DeepSeek-V4 in two variants: V4-Pro (1.6T total parameters, 49B active) and V4-Flash (284B total, 13B active). Both are open-sourced under the MIT license, with a native 1M-token context window and up to 384K output tokens. V4-Pro is now the largest open-weight model available. The architecture introduces a hybrid attention design stacking Compressed Sparse Attention, DeepSeek Sparse Attention, and Heavily Compressed Attention, which DeepSeek says cuts per-token inference FLOPs by 73% and KV cache memory by 90% compared to V3.2. V4-Pro trails GPT-5.4 and Gemini 3.1 Pro on standard reasoning but leads all open models in math, coding, and world knowledge. Pricing undercuts Western frontier models, at $1.74/$3.48 per million tokens for Pro and $0.14/$0.28 per million tokens for Flash. Huawei confirmed its Ascend chips can support V4 inference.
4. Moonshot AI Releases Kimi K2.6
Moonshot AI open-sourced Kimi K2.6, a 1T-parameter MoE model with 32B active parameters, built for long-horizon autonomous coding and agent swarm orchestration. The model supports text, image, and video inputs with a 256K context window. Agent Swarm mode scales to 300 sub-agents executing 4,000 coordinated steps. On SWE-Bench Pro, K2.6 scores 58.6, ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (53.4). On BrowseComp in Swarm mode, it reaches 86.3, and on Humanity’s Last Exam with tools, it leads all models at 54.0. Weights are on Hugging Face under a modified MIT license. The model is compatible with OpenClaw and Claude Code.
5. Qwen3.6–27B Beats Its 397B Predecessor on Coding
Alibaba’s Qwen team released Qwen3.6–27B, the first dense open-weight model in the Qwen3.6 family, under Apache 2.0 on Hugging Face. All 27B parameters are active on every inference pass, unlike the MoE architecture of its predecessor. The model introduces Thinking Preservation, a mechanism that retains reasoning context across conversation history to reduce redundant token generation in multi-turn agent workflows. It scores 77.2% on SWE-bench Verified (vs. 76.2% for the 397B Qwen3.5–397B-A17B), 59.3% on Terminal-Bench 2.0 (matching Claude 4.5 Opus exactly), and 48.2% on SkillsBench (vs. 30.0% for the 397B model). The Q4_K_M quantization fits in 16.8 GB, allowing it to run on a single consumer GPU such as an RTX 4090.
6. xAI Launches grok-voice-think-fast-1.0
xAI released grok-voice-think-fast-1.0, its flagship voice agent model built for complex, multi-step conversational workflows. The model performs reasoning in the background without added response latency, allowing it to handle ambiguous requests and high-volume tool calls while maintaining natural conversational flow. It processes speech in full duplex, handling interruptions, corrections, and turn-taking in real time. On the τ-voice Bench, it scored 67.3%, nearly doubling GPT Realtime 1.5’s 35.3% and leading Gemini 3.1 Flash Live’s 43.8%. The model already powers Starlink’s phone support and sales at +1 (888) GO STARLINK, achieving a 20% sales conversion rate and 70% autonomous resolution rate across 28 tools. It supports 25+ languages and is available via the xAI API at $0.05 per minute.
7. Google DeepMind Introduces Vision Banana
Google DeepMind published a research paper introducing Vision Banana, a model that challenges the long-standing split between generative and discriminative computer vision. Built by instruction-tuning Nano Banana Pro (Google’s image generation model) on a small amount of vision task data, Vision Banana reframes all visual understanding tasks as image generation: given an image and a text instruction, it generates an RGB output (a segmentation mask, depth map, or surface normal map) that is then decoded back into standard computer vision formats. Without any task-specific architectural changes, it beats SAM 3 on Cityscapes semantic segmentation (0.699 vs. 0.652 mIoU), surpasses Depth Anything V3 on metric depth estimation (0.929 vs. 0.918 δ1), and retains the base model’s image-generation quality. The paper argues that image generation pretraining serves as a universal visual learner, mirroring how text generation pretraining unlocked broad capabilities in LLMs.
AI Tip of the Day
If your business logic only exists inside a prompt, you cannot test it, audit it, or guarantee it runs the same way twice. Prompts like “only approve refunds under $50” look like rules, but they are suggestions the model can misinterpret, ignore under edge cases, or lose entirely to a prompt injection.
A better approach is to keep product rules in normal code. The model should extract intent, classify inputs, and generate responses. Your backend should enforce limits, check eligibility, validate account state, and gate irreversible actions.
For example, the model can extract a refund reason and suggest whether the user is eligible for a refund. But the backend should check the actual purchase history, policy rules, and account state before anything happens.
We cover this pattern and the broader architecture decisions behind production LLM systems in our Full Stack AI Engineering course.
Five 5-minute reads/videos to keep you learning
1. Physics-Inspired Generative Modeling: Diffusion, Flow Matching, and Energy-Based Models
Physics-inspired generative models all aim to do the same thing: transform Gaussian noise into realistic data. This article walks through four approaches with full mathematical intuition and PyTorch implementations on a 2D Mixture of Gaussians dataset. DDPM reverses noise corruption iteratively. Score-based diffusion learns probability gradients directly. Flow matching follows straight-line velocity fields for faster generation. Energy-based models sculpt landscapes via contrastive divergence and Langevin MCMC sampling.
2. The Remedy for Autoregressive Bottleneck: How Speculative Decoding on Trainium Changed LLM Inference
Autoregressive decoding bottlenecks LLM inference because every token forces a full weight reload from HBM, leaving accelerators roughly 90% idle. This article unpacks speculative decoding, where a small draft model proposes K tokens and the large target model verifies them in a single parallel pass, recovering matrix-matrix throughput without changing output quality. It also covers how AWS Trainium accelerates this through NeuronLink and SDK-level graph fusion, cutting cost per million tokens from $4 to roughly $1.20 in production benchmarks.
Production Spring Boot teams running PostgreSQL at scale face predictable bottlenecks once defaults stop scaling. The piece walks through eight high-impact tuning levers: HikariCP pool sizing, eliminating Hibernate N+1 queries with fetch joins and entity graphs, partial and covering indexes, Redis caching strategies, async processing with @Async and message queues, projections and batch inserts, observability via Actuator, Micrometer, and pg_stat_statements, and routing reads to replicas through AbstractRoutingDataSource.
This article walks through six MCP builds, five of which failed, and shares a working sixth implementation that connects Claude to local files, a SQLite database with FTS5 search, and DuckDuckGo web search through Anthropic’s open standard. The core insight reframes tool calling as a constraint satisfaction problem in which descriptions serve as retrieval keys for the attention mechanism rather than as human documentation. It includes the complete Python code, exact failure modes around async handlers and context bloat, and reward-shaped tool sequencing.
Alibaba’s Qwen3.6–35B-A3B beat Google’s Gemma 4 26B A4B by 21 points on SWE-bench Verified despite activating fewer parameters per token. This article traces the gap to Qwen’s hybrid architecture, which pairs Gated DeltaNet linear attention with traditional softmax in a 3:1 ratio alongside 256-expert MoE routing for finer specialization. It includes hands-on tests across bug fixes, multi-file refactors, and LeetCode problems, confirming Qwen’s reliability advantage. It also covers Gemma’s lead in inference speed, video input, multilingual quality, and conversational polish on Arena AI.
Repositories & Tools
1. Skills is a community-maintained collection of agent skills for Claude Code, covering tasks like code generation, refactoring, debugging, and documentation.
2. Cua provides sandboxes, SDKs, and benchmarks for building and evaluating AI agents that can control full desktop environments.
3. GitNexus is a client-side knowledge graph creator that runs entirely in your browser.
4. Beads provide a persistent, structured memory for coding agents, allowing them to store and retrieve project context, decisions, and learnings across sessions.
5. PostHog bundles product analytics, session replay, feature flags, A/B testing, error tracking, LLM observability, and a data warehouse into a single self-hostable tool.
Top Papers of The Week
1. Scaling Test-Time Compute for Agentic Coding
This paper proposes a test-time scaling framework for long-horizon coding agents by converting noisy rollout trajectories into compact, structured summaries. It introduces Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons to select the best candidate. For sequential scaling, the authors adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts, allowing agents to learn from earlier failures without reprocessing full trajectories. Using this method, Claude 4.5 Opus improves from 70.9% to 77.6% on SWE-Bench Verified.
This paper introduces SkillLearnBench, the first benchmark for evaluating continual skill learning in agents. It comprises 20 verified, skill-dependent tasks across 15 sub-domains derived from a real-world skill taxonomy, evaluated across three dimensions: skill quality, execution trajectory, and task outcome. Key findings show that continual learning improves performance on tasks with clear, reusable workflows but struggles with open-ended tasks, and that using stronger LLM backbones does not consistently yield better skills.
3. Reasoning Bank: Scaling Agent Self-Evolving with Reasoning Memory
This paper proposes ReasoningBank, a memory framework that distills generalizable reasoning strategies from an agent’s self-judged successes and failures. At test time, the agent retrieves relevant memories to inform its actions and integrates new learnings back after each task, becoming more capable over time. By allocating more compute per task, the agent generates diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory entries. The approach treats reasoning memory as a scalable resource: more compute produces better experiences, which produce better memories, which produce better future performance.
4. Qwen3.5-Omni: Hundreds of Billions of Parameters Across All Modalities
This paper presents Qwen3.5-Omni, a Hybrid Attention MoE model with a 256K context window trained on heterogeneous text-vision pairs and over 100 million hours of audio-visual content. The model supports over 10 hours of audio understanding and 400 seconds of 720p video at 1 FPS. It achieves state-of-the-art results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks, surpassing Gemini 3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding.
Quick Links
1. OpenAI unveiled Workspace Agents for teams to create shared agents that handle complex tasks and long-running workflows. They are powered by Codex, run in the cloud, and can be accessed by clicking Agents in the ChatGPT sidebar. Users can describe the job they want done or just drop in a file, and ChatGPT turns it into an agent.
2. Anthropic created a test marketplace for agent-on-agent commerce. This test, called Project Deal, was only “a pilot experiment with a self-selected participant pool” of 69 Anthropic employees, who were given a $100 budget (paid out via gift cards) to buy things from their coworkers. Anthropic said it was “struck by how well Project Deal worked,” with 186 deals made, totaling more than $4,000 in value.
Who’s Hiring in AI
Student Worker -Software Engineering @Salesforce (Tel Aviv, Israel)
Head of Developer Relations @Chainguard (Remote/USA)
Intermediate Software Engineer — AI @Tucows (Remote/Canada)
Senior Software Engineer, AI Operations @RapidSOS (Remote/Europe)
Intern, AI Engineering @Workato (San Francisco, CA, USA)
Junior AI Developer @Monterail (Remote/Poland)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.





