TAI #205: Codex Special Edition; Interviewing Romain Huet, OpenAI's Head of Developer Experience
Also, Cursor Composer 2.5, Codex on Mobile, Grok Build & more
What happened this week in AI by Louie
This is a special Codex edition. Codex is shipping major new features almost every week, while OpenAI’s GPT-5 model family is getting a major new release almost every month. The last few weeks pushed both lines hard. OpenAI shipped GPT-5.5 into Codex, launched Codex for mobile with remote supervision, expanded enterprise controls, launched the OpenAI Deployment Company, agreed to acquire Tomoro, and partnered with Dell to bring Codex closer to hybrid and on-prem enterprise data. GitHub, Cursor, Anthropic, and Cognition also shipped new agent features aimed at the same loop: assign work, let an agent execute, inspect the result, and push it into production.
The timing was useful because I interviewed Romain Huet, OpenAI’s Head of Developer Experience, on May 13. His central point is worth sitting with: “It is almost like the definition of a developer itself is changing.” When Romain joined OpenAI three years ago, developer experience meant helping people bring models into their products through the API. Codex flips that path. “The real magic is that anyone can now build anything they want with AI,” he said. “Anyone is a developer.”
GPT-5.5 landed in Codex and ChatGPT on April 23, with a 400K context window in Codex and 1M in the API. The benchmark sweep is strong across terminal use, computer use, long context, and partner workflow tests, but the more useful signal is adoption. Romain told me a million enterprise customers picked up GPT-5.5 in its first week, the fastest adoption OpenAI has ever seen for a new model.
The product shift is the surrounding harness. Codex now has subagents, in-app browser use, Chrome extension support for signed-in sites, macOS Computer Use, plugins, skills, hooks, mobile remote control, auto-review, and enterprise managed configuration. The model is becoming the worker. Codex is becoming the work surface, the permission layer, and the coordination layer. As Romain put it, “we are now far, far, far beyond writing the code.”
Romain calls this “agentic delegation.” The best Codex workflows look less like prompting and more like managing a small team. OpenAI’s docs frame subagents as a way to reduce context pollution: keep the main thread focused on requirements, decisions, and final outputs, while separate agents handle exploration, tests, log analysis, and review. For complex research, I now run 10 to 20 subagents in parallel. Ten gather relevant sources, one checks benchmarks, one looks for practitioner complaints, one extracts quotes, one criticizes the draft, another verifies dates and numbers.
This all also changes how I think about AI education. The important skill is work design: breaking a vague goal into bounded tasks, giving each agent the right tools and context, defining what it must return, and setting up verification. Romain thinks people are still holding back. “I find myself, and I think most people probably are still somewhat shy at times of asking Codex very, very complex things,” he said. He also pushed back on the older view of prompting as a craft to learn. “It’s less and less really like a magic thing to learn, as opposed to just treating Codex as your teammate. It’s really your thought partner. Things might not be crisp in your mind just yet. That’s okay. You can still dictate to Codex a lot of what’s on your mind.” His suggested curriculum: “People need to understand agentic delegation, tool use, the harness, and learn to be curious and ambitious with what the models can accomplish.”
This is also why he sees Codex as an on-ramp to building AI products. “Codex has become the gateway to the API almost,” he told me. A developer asks Codex to generate an image, use a browser, reason over documents, or connect a tool, and then the obvious next thought is: can I put that capability inside my own product? Romain went further: “We also think about an agent as kind of the primary consumer almost of our documentation, of our SDKs, of our tools.” That is a practical change for every developer-tools company. Documentation now needs to be readable by humans and useful to agents wiring the product into a real codebase.
The non-developer angle is more complicated. Claude Cowork still feels more inviting for many non-technical users. Codex grew out of code, Git diffs, sandboxes, terminals, and pull requests, so it can feel like a developer tool. The gap is shrinking. The in-app browser lets users leave visual comments on a page. Computer Use lets Codex operate graphical apps on macOS with explicit permissions. Plugins connect Gmail, Google Drive, Slack, GitHub, and other systems. Mobile remote access lets you approve commands, inspect screenshots, review diffs, and steer long-running work from your phone. Romain framed the design challenge well: “If you’re not writing code for a living, you don’t really want to see a git repo and the concept of a pull request. Maybe what you’re trying to accomplish is just creating slide decks or some Excel sheets.” For readers who find the default interface intimidating, check whether the simplified settings are available in your Codex app.
Romain’s own usage has shifted hard. “Now there’s not a single task that I start without Codex,” he said, and pointed out that the shift reaches far beyond OpenAI’s engineering team: “Everything has changed, not just for me, but for everyone at the company.” The examples in OpenAI’s GPT-5.5 post back this up. Comms analyzed six months of speaking-request data and built a scoring framework with an automated Slack agent. Finance reviewed 24,771 K-1 tax forms across 71,637 pages, accelerating the task by two weeks. Go-to-Market staff automated weekly reporting and saved 5 to 10 hours per week.
The enterprise announcements make more sense in that light. On May 11, OpenAI launched the OpenAI Deployment Company, agreed to acquire Tomoro, and said the new company would start with around 150 Forward Deployed Engineers and Deployment Specialists, with more than $4 billion of initial investment. On May 18, OpenAI and Dell announced a partnership to bring Codex to hybrid and on-prem enterprise environments through the Dell AI Data Platform, with the stated goal of getting Codex closer to codebases, documents, business systems, and team workflows.
Romain framed the bottleneck plainly. The hard work is everything around the model: “how do you get access to the right dataset, how do you clean it for the model to actually access it the right way? All of these things that have to go right for something like Codex to do great work.” The Deployment Company is the bandwidth to do that at scale. “We really want to make sure that enterprises can adopt these new tools because we see GPT-5.5 as the turning point,” he told me.
NVIDIA is the clearest public example. OpenAI’s customer story says 40,000 NVIDIANs have access to Codex and reports a 10x speed improvement in end-to-end research workflows. Databricks is bringing GPT-5.5 into enterprise agent workflows through AI Unity Gateway, AgentBricks, and Agent Supervisor API. These are vendor and customer-story claims, so some caution applies, but the direction is consistent.
The caveat is reliability. The last month also included stream disconnects, overload errors, compaction complaints, quota frustration, and a GPT-5.5 degradation incident. Power users report both sides: Codex can plan, review, and debug at a higher level than previous releases, then waste quota on retries, approval waits, or a bloated diff that still needs cleanup. The practical advice is the same as with every serious agent workflow: use it hard, verify harder. Give Codex tests, diff review, browser checks, source constraints, and narrow permissions.
Romain emphasized that verification loop. Codex can test its own work, open browsers at multiple screen sizes, inspect UI states, and iterate. “It can actually check its work and until it’s actually correct, it will keep on iterating,” he said. That is a different product category from autocomplete.
The competitive race is now about the full delegated-work loop. Anthropic is packaging Claude Code, Cowork, Design, finance agents, Microsoft 365 integrations, and the PwC rollout into a coherent non-developer work story. GitHub owns the issue-to-pull-request-to-CI-to-merge path. Cursor is pushing cloud-agent environments, Teams delegation, Bugbot, and Composer 2.5. Cognition is turning Devin toward incident triage. Codex’s advantage is OpenAI’s model platform, the harness, subagents, plugins, browser and computer use, mobile supervision, and the API bridge. The open question is whether OpenAI can keep it approachable for non-technical users without losing the power that engineers want.
My current read is that Codex is becoming the most important OpenAI product for people who build systems, even when those systems are no longer traditional software. The people who learn to use Codex well will produce apps, reports, workflows, research panels, dashboards, slide decks, and internal automations far faster than people who only use chat. The work still needs judgment, domain expertise, and verification. The center of gravity is moving from writing the code yourself to designing the work and managing the agents that carry it out.
Why should you care?
Codex is becoming the bridge from software developer to AI engineer. A strong Codex user learns context engineering, tool use, agent planning, verification, model routing, retrieval, browser automation, permissions, and deployment almost by accident. The next step is obvious: build those capabilities into products and workflows for other people.
This changes what we should teach. Retrieval-augmented generation, fine-tuning, and API calls still matter, but they sit inside a broader agent curriculum: how models use tools, how harnesses constrain them, how to evaluate outputs, how to route tasks across models, how to design safe permissions, and how to turn a messy business process into a repeatable AI workflow. Romain put the underlying point well: the model is where the raw IQ lives, but “similar to a human, it will only do great work if it has access to the tools that it needs to execute this work and also verify the work.” Agents are also becoming primary consumers of documentation and SDKs, which means builders need to write for humans and agents at the same time.
For Towards AI, this is exactly where our education and consulting work is moving. The valuable skill is turning domain expertise into agent-ready workflows that can be tested, governed, improved, and reused. Codex is one of the clearest paths into that future because it lets people feel the whole loop from idea to working system.
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
The Ramp AI Index for May shows Anthropic surpassed OpenAI in business adoption for the first time (34.4% vs. 32.3%) as overall AI adoption among American businesses crossed 50%. Anthropic quadrupled its business adoption share over the past year while OpenAI grew by only 0.3%, driven largely by the coding assistant market. However, the article cautions that Anthropic’s lead may be fragile: Claude has faced frequent outages, rate limits, and growing user dissatisfaction, while its latest model update tripled token costs for image prompts. OpenAI’s Codex operates more cheaply, and open-source inference platforms rank among Ramp’s fastest-growing AI vendors. Switching costs between providers remain minimal, and companies like Uber have already blown through their 2026 AI budgets, suggesting that cost pressure could reshape the rankings.
OpenAI unveiled Daybreak on May 10, 2026, an AI-powered cyber defense initiative that bundles frontier models, the Codex Security agentic system, and a network of over 20 security partners into a unified platform for vulnerability detection and remediation. The initiative deploys three model tiers: GPT-5.5 for general-purpose use with standard safeguards, GPT-5.5 with Trusted Access for Cyber for verified defensive work such as secure code review, malware analysis, and detection engineering, and GPT-5.5-Cyber for red teaming and penetration testing under strict access controls requiring phishing-resistant authentication starting June 1, 2026. At its core, Codex Security, originally launched in March 2026 as a developer tool, ingests entire repositories, builds codebase-specific threat models, maps realistic attack paths, validates issues in isolated environments, and proposes patches for human review, reducing analysis from hours to minutes. Partners including CrowdStrike, Cloudflare, Palo Alto Networks, Cisco, and Trail of Bits are already integrating the platform. All tiers explicitly prohibit credential theft, stealth persistence, malware deployment, and unauthorized exploitation.
OpenAI launched personal finance tools for ChatGPT Pro subscribers in the U.S. on May 15, available in preview on web and iOS. The feature integrates with Plaid to connect to over 12,000 financial institutions, including Schwab, Fidelity, Chase, Robinhood, American Express, and Capital One. Built on OpenAI’s April 2026 acquisition of fintech startup Hiro, previously backed by Ribbit, General Catalyst, and Restive, the tools let users analyze spending, track subscriptions and upcoming payments, plan budgets, and review portfolio performance through an interactive dashboard. ChatGPT stores user-provided financial context such as savings goals and planned purchases as “financial memories,” which users can view and delete. Access is read-only; users cannot move money through ChatGPT. OpenAI plans to expand the feature to Plus subscribers and is working with Intuit to enable capabilities like tax impact analysis and credit card approval odds.
xAI released Grok Build on May 15, 2026 as an early beta, a terminal-based coding agent and CLI aimed at professional software engineering and complex coding work. Available exclusively to SuperGrok Heavy subscribers at $300/month, the tool positions itself as a direct competitor to Anthropic’s Claude Code, OpenAI’s Codex, and Google’s Gemini CLI. The launch follows CEO Elon Musk’s public acknowledgment that xAI had fallen behind rivals in coding capabilities. Grok Build supports AGENTS.md, hooks, skills, MCP servers, subagents for parallel tasks, and deep worktree integration. The release arrives amid organizational turbulence: xAI was acquired by SpaceX in February 2026, and the company has since lost over 50 researchers and engineers. Whether Grok Build can close the gap with established coding agents remains an open question as the product enters its early beta phase.
Cursor released Composer 2.5 on May 18, 2026, upgrading its in-house coding model for sustained, long-running software tasks inside Cursor. The company says the model follows complex instructions more reliably than Composer 2, performs better during extended agent work, and was trained with targeted reinforcement learning using textual feedback, larger synthetic task generation, Sharded Muon optimization, and dual-mesh hybrid sharded data parallelism. Pricing remains split across two usage modes: Standard at $0.50 per million input tokens and $2.50 per million output tokens, and Fast, the default, at $3.00 per million input tokens and $15.00 per million output tokens. The release matters because Cursor is no longer treating model quality as a wrapper problem. It is building a vertically integrated coding stack where the model, integrated development environment, cloud agents, benchmarks, and training pipeline are all tuned around real repository work.
OpenAI brought Codex into the ChatGPT mobile app in preview on May 14, giving developers a way to monitor, steer, and approve long-running coding tasks from iOS and Android. Codex still operates on a trusted machine such as a laptop, Mac mini, devbox, or managed remote environment, while the phone receives live updates including screenshots, terminal output, diffs, test results, and approval requests. Users can review outputs, approve commands, change models, work across active threads, and start new tasks without exposing local files, credentials, or machine setup directly to the public internet, thanks to a secure relay layer. OpenAI says Codex now has more than 4 million weekly users, which makes the mobile release less about typing code on a phone and more about keeping autonomous coding work unblocked while the agent is already running.
AI Tip of the Day
Prompt caching only works when the repeated part of your prompt stays the same.
The repeated parts of an LLM request are often expensive: system instructions, policy text, tool schemas, style rules, and few-shot examples. Caching can reduce the cost and latency of repeated input tokens, but small changes near the front of the prompt can invalidate the cache. A common mistake is putting dynamic user data, timestamps, retrieved chunks, or session-specific metadata before the stable instructions.
Structure the request so the stable prefix comes first: system prompt, durable rules, tool definitions, and fixed examples. Put dynamic content later: user message, retrieved context, current date, account state, and temporary constraints. Then monitor cache hit rate alongside latency and cost per request. If the hit rate is low, inspect prompt ordering.
If you’re building LLM applications and want to go deeper into prompting, cost optimization, and production architecture, our 10-Hour LLM Fundamentals course is a fully video-based primer to get you started. Find more information here.
Five 5-minute reads/videos to keep you learning
The transformer attention mechanism is algebraically identical to the 1964 Nadaraya-Watson kernel regression estimator with an exponential dot-product kernel. This reframing reveals transformers as non-parametric kernel smoothers where context length acts as effective parameters, and predicts that the “lost in the middle” phenomenon stems from the curse of dimensionality.
A practical guide spanning SARIMAX, Prophet, XGBoost, LSTM, and N-BEATS, anchored by Zillow’s $9B iBuying failure as a warning about non-stationary data. Covers decomposition, stationarity testing, and model selection, noting that 92.5% of M5 competition submissions failed to beat a simple exponential smoothing baseline.
NVIDIA’s Guess-Verify-Refine algorithm exploits the 35-50% overlap in Top-K index sets between consecutive decode steps, a temporal correlation arising from RoPE’s Toeplitz structure. The exact method achieves a 1.88x average speedup over optimized baselines and reduces time-per-output-token by up to 7.52% in DeepSeek-V3.2 deployment.
A production pipeline pairing Snowflake’s AI_COMPLETE with gemini-3.1-pro for video analysis and Claude Sonnet for text processing. The SQL-native design handles brand sentiment, product discovery, FTC compliance monitoring, and content moderation through structured JSON schemas and incremental scheduled processing.
KV cache stores key and value tensors across decode steps to avoid redundant recomputation, but scales prohibitively at long contexts, as Llama 3 70B at 128K tokens requires 40GB of cache alone. The article surveys compression techniques including MLA (93.3% reduction), GQA, PagedAttention, and quantization, flagging a critical FP8 bug on Hopper that silently dropped accuracy from 91% to 13%.
Repositories & Tools
claude-context is Zilliz’s MCP server that lets coding agents search entire codebases via hybrid BM25 + vector retrieval with AST-aware chunking and incremental indexing, cutting token usage by ~40%.
Ml-intern is a Hugging Face’s autonomous ML agent that reads papers, finds datasets, writes training code, and fine-tunes models. Improved a Qwen3-1.7B from 10% to 32% on GPQA in 10 hours on one H100.
agentmemory is a persistent cross-session memory for coding agents. It captures interactions into searchable knowledge graphs with 95.2% retrieval accuracy and 92% fewer tokens than full-context replay. Works with Claude Code, Cursor, Codex, and Gemini CLI.
DeepSeek-TUI is a Rust terminal coding agent for DeepSeek V4 with 1M-token context, OS-level sandboxing, LSP integration, and sub-agent coordination at ~1/10th the cost of Claude Code.
Pixelle-Video is an Alibaba’s end-to-end short video engine: input a topic and it generates scripts, images, video, voiceover, and music automatically. Modular ComfyUI architecture with swappable LLM backends. Apache 2.0.
Top Papers of The Week
SenseTime introduces SenseNova-U1, a unified multimodal model built on the NEO-unify architecture that dissolves the conventional boundary between understanding and generation, treating them as synergistic facets of a single process rather than separate pipelines. Available as an 8B dense variant (U1-8B-MoT) and a 30B mixture-of-experts variant (U1-A3B-MoT), the model replaces fragmented architectures, cascaded pipelines, and misaligned representation spaces with a single coherent framework. It achieves competitive results across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. On the generation side, it handles any-to-image synthesis with semantic consistency, text-rich infographic creation, and interleaved vision-language output. The authors also demonstrate promising capabilities in vision-language-action tasks and world modeling, pointing toward broader applicability in embodied and interactive settings.
Orthrus augments a frozen LLM with a lightweight trainable module that introduces a parallel diffusion view operating alongside standard autoregressive decoding, with both views sharing the exact same KV cache to keep memory overhead at O(1). During inference, the autoregressive head handles context pre-filling while the diffusion head generates multiple tokens in parallel, addressing the fundamental throughput bottleneck of sequential decoding. An exact consensus mechanism between the two views guarantees lossless generation fidelity, meaning output quality is preserved without compromise. The framework achieves up to 7.8x inference speedup with minimal additional parameters and requires no retraining of the base model, natively resolving the longstanding tradeoff between the high fidelity of autoregressive LLMs and the speed advantages of diffusion-based generation.
Cactus Needle is a 26M-parameter Simple Attention Network distilled from Gemini 3.1, purpose-built for single-shot function calling on phones, watches, and glasses. The architecture uses 12 encoder layers with masked self-attention and RoPE, 8 decoder layers with gated cross-attention, grouped query attention (8 heads / 4 KV heads), ZCRMSNorm, and an 8,192-token BPE vocabulary, all without FFN layers. Pretrained on 200 billion tokens across 16 TPU v6e in 27 hours, then post-trained on 2 billion tokens of function-call data in 45 minutes, it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function-call benchmarks despite being far smaller. Running on the Cactus runtime at 6,000 tok/s prefill and 1,200 decode, with fully open weights on Hugging Face, local finetuning on a Mac via CLI with as few as 120 examples per tool, and a Gradio playground for testing, redefining what’s achievable for on-device AI inference.
MinT presents a managed infrastructure for LoRA post-training and online serving at million scale, maintaining a shared resident base model while cycling lightweight adapters through training, validation, and deployment stages. The system scales up LoRA reinforcement learning to dense and Mixture-of-Experts architectures exceeding 1T total parameters, supporting advanced attention mechanisms including MLA and DSA. It scales down data movement by transferring only adapter weights, as small as under 1% of base-model size in rank-1 settings, achieving 18.3x faster handoff on a 4B dense model and 2.85x faster on a 30B MoE. Concurrent multi-policy GRPO training reduces runtime by 1.77x and 1.45x without increasing peak memory. The system scales out to 10^6-scale addressable adapter catalogs, sustaining thousands of concurrent active policies with 8.5-8.7x loading efficiency gains via packed MoE LoRA tensors and cold-loading procedures integrated as scheduled service work.
Quick Links
OpenAI Launches $4B Consulting Arm With Tomoro Acquisition. OpenAI acquired London-based Tomoro (150 engineers) and invested $4 billion to launch the OpenAI Deployment Company, a $14B subsidiary backed by TPG, Advent, Bain Capital, and 19 other firms including McKinsey and Capgemini. The unit embeds Forward Deployed Engineers in enterprises to build production AI systems, targeting the implementation gap where only 5% of companies see real returns on generative AI spending.
arXiv Bans Authors for One Year Over AI Slop. arXiv will impose a one-year ban on authors who submit papers with clear signs of unchecked AI generation: hallucinated references, LLM meta-comments, placeholder text, and factual errors. After the ban, offenders must get papers accepted by a peer-reviewed journal before reposting. The policy allows appeals and does not prohibit AI tool use outright, but holds authors “fully responsible” for all content.
Thinking Machines Unveils Real-Time Interaction Models. Thinking Machines Lab introduced interaction models for real-time human-AI collaboration across audio, video, and text simultaneously. Built on a 276B MoE architecture (12B active), the system uses time-aligned micro-turns processing 200ms chunks, enabling verbal interjections, simultaneous speech, and concurrent tool use with 0.40s turn-taking latency. Released as a research preview with larger variants planned for later in 2026.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.





