TAI #208: Open Models Find Their Role as Agent Token Bills Rise
Also, Microsoft MAI models, Gemma 4 12B, MiniMax M3, and more!
What happened this week in AI by Louie
This week brought an unusually broad set of cheaper and open model releases. Microsoft announced seven in-house MAI models, led by MAI-Thinking-1. Google added a local multimodal Gemma 4 12B, and MiniMax launched M3 via its API, promising downloadable weights. This followed NVIDIA’s Nemotron 3 Ultra release last week. OpenAI also expanded Codex with role-specific plugins for work beyond coding, while Apple introduced Core AI for running custom and community models on Apple silicon.
The timing here is essential because token consumption is climbing fast. Companies are moving from short chats to long-running agents that read larger contexts, call tools repeatedly, spawn subagents, review their own work, and retry failed steps. Anthropic’s earlier research found that ordinary agents used around four times as many tokens as chat agents, and multi-agent systems around fifteen times as many tokens as chat agents. Cursor’s production data show that average tool calls per agent request rose nearly 28% from early March to mid-May. Sam Altman said some companies spent their full 2026 AI budgets during the first quarter. No one has a representative figure for the rise in company-wide bills, but inference optimization is becoming urgent.
This creates a natural role for cheaper models. I run many subagents for coding, research, fact-checking, criticism, and iteration, and there is little reason to pay frontier prices for every extraction, formatting task, or first-pass review. A strong model can handle ambiguous decisions and final synthesis, while smaller models inspect files, classify documents, run retrieval, generate tests, and produce structured drafts. Open-weight models add further options for private deployment, customization, and high-volume inference.
Vercel’s production data clearly shows this split. In May, DeepSeek jumped from under 1% to 17% of token volume on its AI Gateway while accounting for only around 1% of model spend. In coding agents, DeepSeek handled 49% of tokens at 4% of the cost, while Anthropic handled 28% of tokens at 70% of the cost. Anthropic’s token and spend shares still grew during the month. Cheap models are building a high-volume worker layer without displacing frontier models from tasks where customers will pay for quality, and the spend share shows exactly where that willingness is concentrated.
But there is a caveat on the word “open.” Only Gemma 4 12B and Nemotron 3 Ultra currently have downloadable weights. MAI-Thinking-1 is in private preview through Microsoft Foundry. MiniMax M3 launched as a hosted model and said its weights would follow within ten days, but they were still unavailable on June 9.
Microsoft’s release is the most interesting place to start. MAI-Thinking-1 is a 35-billion-active-parameter Mixture-of-Experts model with roughly one trillion total parameters and a 256K context window. Microsoft trained it from scratch on 30 trillion tokens, followed by 3.55 trillion tokens of mid-training focused heavily on code, science, technology, engineering, and mathematics.
The model is a credible first effort from Microsoft’s new AI lab, with a visible gap to frontier agent performance. Microsoft says so directly in the technical report: MAI-Thinking-1 “does not lead the field.” On SWE-Bench Pro, it reaches 52.8%, close to Opus 4.6 at 53.4%, while Kimi K2.6 and GLM-5.1 reach 58.6% and 58.4%. Its 46.0% on Terminal-Bench 2.0 trails Sonnet 4.6 at 59.1% and GPT-5.4 at 75.1%. It performs strongly on math, including 97.0% on AIME 2025.
Microsoft also explains one of those gaps. Its software-engineering training used bash and a string-replacement editor, without targeted terminal-interaction environments. The Terminal-Bench score therefore measures generalization into a tool environment the model was never trained to master. That is useful disclosure, and it hands Microsoft an obvious target for the next version.
The technical paper deserves nearly as much attention as the model. It runs beyond 100 pages and covers architecture, data composition, failed assumptions, benchmark methodology, reinforcement-learning infrastructure, and cluster performance. Microsoft trained 183 models from scratch across 61 illustrative data mixtures and found that small-scale experiments could select the wrong data: a science-heavy mixture looked stronger on small models, while a code-heavy mixture won as the models grew. The final pretraining mix was 54.6% code, and the main 30-trillion-token run used 8,192 GB200 GPUs. We rarely get this much detail from a major US lab.
The data strategy is the most striking part. Microsoft used no synthetic data from LLMs during pretraining, actively tried to remove AI-generated content from collected sources, and used no third-party model distillation. It even excluded Hugging Face and mirror domains from its web data over contamination concerns. Later reinforcement learning does use self-distillation from Microsoft’s own checkpoints and synthetic environments, so the claim is specifically about pretraining and external teachers.
This reverses Microsoft’s original Phi strategy. Phi-1 used synthetic textbooks and exercises generated by GPT-3.5, and later Phi models continued to lean on high-quality synthetic data. Synthetic data can be extremely useful, especially for compact specialist models. MAI tests a different thesis at much larger scale: learn base capability from human-generated data, then build reasoning through Microsoft’s own reinforcement-learning stack. If it holds, it is a quiet rebuttal to the “running out of data” worry, at least for a lab willing to curate aggressively.
Google’s Gemma 4 12B sits at the opposite end of the deployment spectrum. It is a dense 11.95B model that handles text, image, audio, and video-frame inputs via a single shared transformer, without large, separate vision or audio encoders. It supports 256K context, configurable thinking, native function calling, and Apache 2.0 weights. Google’s Q4 version occupies around 6.7GB before context and application overhead, putting local multimodal agents on 16GB laptops within reach.
The independent results point to a bounded role. Artificial Analysis scores Gemma 4 12B at 29.2 on its Intelligence Index and only 18.2% on Terminal-Bench Hard, and its factuality is weak when it attempts difficult closed-book questions. I would use it for private document inspection, extraction, tagging, transcription chunks, visual checks, and routine tool calls, with schemas and stronger verification wrapped around the output. Its value comes from locality and adequate capability, not autonomous frontier-level work.
MiniMax M3 is the strongest model in this group on current independent testing. Artificial Analysis gives it 54.7 on the Intelligence Index and 68.6 on the Agentic Index, ahead of Nemotron 3 Ultra at 47.7 and 57.1, respectively. Its standard API price is $0.30 per million input tokens and $1.20 per million output tokens up to 512K context, which is extremely competitive.
NVIDIA’s Nemotron 3 Ultra is a different proposition. It has 550 billion total parameters, 55 billion active per token, and a one-million-token context window. The released NVFP4 checkpoint cuts measured weight memory from around 1.1TB to 331GB and fits on four B200 GPUs. Artificial Analysis measures roughly 162 output tokens per second through current hosted providers, around four times M3’s speed, though Ultra scores lower on the capability indices.
NVIDIA now has a useful three-tier family. Nano, at roughly 3B active parameters, fits high-volume leaf tasks. Super, at 12B active, is a more plausible private enterprise reasoning model. Ultra is an escalation model for difficult coding, research, and long-output agent work. The family is a working example of how open weights can support an internal model hierarchy, even if four B200s remain serious data-center infrastructure.
I expect model selection to shift from one model per product to one model per workflow step. The economic gain comes from assigning cheap models the work they can complete reliably, then keeping frontier models on the steps where mistakes, retries, and extra supervision would erase the savings. I have already started implementing this in Codex by requesting that sub-agents use cheaper models for bulk summarization or data-extraction tasks.
The likely outcome is more total inference, with cheap workers making previously uneconomic substeps worth running and frontier models remaining the expensive control and synthesis layer. Cost per verified result is the metric to track, because retries and human review can overwhelm headline token savings.
Why should you care?
Open weights lower marginal cost and give control over privacy, adaptation, and deployment, while adding hardware, operational, security, upgrade, and utilization costs. I still find a much larger gap between frontier and open-weight models in real work than benchmarks suggest, because public evaluations favor standardized, scorable tasks. The larger training budgets behind Claude and GPT show up when I push the limits of their capabilities, run long projects, or encounter unusual cases that fall outside benchmarks. Frontier models need less steering, recover from mistakes more often, and handle the awkward final stretch of a task more reliably.
My time is worth far more than LLM tokens in nearly every personal workflow. Saving a few dollars is poor economics if I spend 20 minutes rewriting instructions, rescuing a tool loop, or checking an answer I would trust from a frontier model. So I default to the best available models for open-ended research, writing, and complex agent work.
The practical takeaway is to treat open versus closed as a routing decision at each workflow step, not as a single choice per product. Open models make most sense when work is repeated, narrow, high-volume, private, or easy to verify. I would start with hosted endpoints, test them against real traces, and measure retry rates, human intervention, and escalation rates before committing to infrastructure. Self-hosting becomes compelling when data residency, custom tuning, or sustained utilization justifies it. Gemma 4 12B is a great new option for local because its quantized weights fit hardware many teams already own. I am glad to see open weights progressing, and this week’s releases offer several genuinely useful options. Deploy them where their economic and control advantages survive the still higher cost of supervision.
— Louie Peters — Towards AI Co-founder and CEO
We just open-sourced a complete AI engineering roadmap for 2026.
It covers the full path from Python basics to production AI systems, designed to help you become an AI engineer, not just someone who can build an agent demo. No prior ML background required.
Built by our team at Towards AI, the repo includes beginner Python resources, foundational AI and LLM videos, recommended books, free and paid courses, hands-on coverage of RAG, agents, evals, MCP, deployment, safety, and coding-agent workflows, project ideas, communities, newsletters, people to follow, and job-search advice.
Every resource is tagged with a difficulty level from 1️⃣ to 🔟 so you can start wherever you are.
By the end, you’ll have the foundation to work as an AI engineer. Not just building with LLMs, but knowing when to reach for prompting over fine-tuning, when RAG is the right call and when it isn’t, when to use an agent and when a deterministic workflow will do the job better, when not to use an LLM at all, and how to evaluate, debug, trace, deploy, and monitor the systems you ship.
There’s also a built-in prompt you can paste into Claude or ChatGPT, along with your background, time, and goals, to turn the roadmap into a personalized learning plan.
Everything is free. Paid resources are clearly labeled.
Give it a star if it’s useful, and share it with someone getting into AI engineering this year.
Hottest News
1. Google DeepMind Releases Gemma 4 QAT Checkpoints
Google DeepMind released Quantization-Aware Training (QAT) checkpoints for all Gemma 4 model sizes and their drafter models on Hugging Face. Unlike standard post-training quantization, which compresses a finished model and often degrades quality, QAT simulates quantization during training so the model learns to compensate for precision loss. The result is roughly 72% lower memory requirements while maintaining performance close to full BF16 baselines. The release includes two formats: Q4_0 for general-purpose local deployment on laptops and consumer GPUs, and a new mobile-optimized format that reduces the Gemma 4 E2B model to approximately 1GB. Google did not publish Gemma 4 QAT benchmark scores alongside the release.
MiniMax released M3, a foundation model built on MiniMax Sparse Attention (MSA), a new sparse attention architecture that supports a 1M-token context window. MSA cuts per-token compute at 1M context to one-twentieth of the previous M2 generation, with over 9x faster prefill and over 15x faster decoding. The model supports text, image, and video inputs and is designed for long-horizon coding and agentic workflows. On company-reported benchmarks, M3 scored 59.0% on SWE-Bench Pro, 66.0% on Terminal-Bench 2.1, 83.5 on BrowseComp, and 70.06% on OSWorld-Verified for computer use. API pricing is $0.60/$2.40 per million input/output tokens. MiniMax committed to releasing open weights and a technical report on Hugging Face within 10 days.
3. Microsoft Unveils 7 In-House MAI Models with Frontier Tuning
Microsoft announced seven new in-house AI models at Build 2026: MAI-Thinking-1 (reasoning), MAI-Code-1-Flash (coding), MAI-Image-2.5 and MAI-Image-2.5 Flash (image generation), MAI-Transcribe-1.5 (speech recognition), MAI-Voice-2, and MAI-Voice-2-Flash (speech generation). MAI-Thinking-1 is Microsoft’s first reasoning model, trained from scratch on commercially licensed data with no distillation from OpenAI or other third-party model families. MAI-Image-2.5 leads the Arena Image Edit leaderboard at 1,403 Elo. Alongside the models, Microsoft introduced Frontier Tuning, which uses reinforcement learning to adapt MAI models to organization-specific workflows within the enterprise’s compliance boundary. Early results show an MAI model tuned for Excel matches GPT 5.4 while being up to 10× more efficient. All models are available on Azure Foundry, with distribution on OpenRouter, Fireworks, and Baseten.
4. Google Research Adds Agentic RAG to Gemini Enterprise Agent Platform
Google Research made an agentic RAG system available in public preview on Gemini Enterprise Agent Platform (formerly Vertex AI). The system addresses a common failure mode in standard RAG: queries that require multi-step reasoning across multiple documents return incomplete or empty results because a single retrieval pass cannot resolve them. The agentic approach introduces query planning and routing, where the system decomposes complex questions into sub-queries, retrieves evidence for each, and iterates until sufficient context is assembled. Google’s stated design goal is that AI-generated responses should be auditable, traceable, and grounded in retrieved source material. The feature integrates with the platform’s existing RAG Engine and Vector Search infrastructure.
JetBrains open-sourced Mellum2, a 12B-parameter MoE model with 2.5B active parameters per token, built for the infrastructure layer of AI coding systems. The model uses 64 experts and activates 8 per token, achieving over 2x faster inference compared to similar-sized models. It was trained from scratch on approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts from web data toward curated code and math. Context extends to 128K tokens via layer-selective YaRN. Mellum2 ships in two post-training variants: Instruct for direct low-latency answers and Thinking for step-by-step reasoning traces. The Thinking variant scored 69.9% on LiveCodeBench v6. JetBrains positions Mellum2 as a “focal model,” a fast component inside larger AI systems for routing, RAG, summarization, and sub-agent tasks, not a standalone replacement for frontier models. It is released under Apache 2.0 on Hugging Face.
6. Nous Research Releases Hermes Desktop
Nous Research released Hermes Desktop in public preview, a native application for macOS, Windows, and Linux that provides a graphical interface for the open-source Hermes Agent. The desktop app runs the same agent core as the CLI and messaging gateways, sharing configuration, API keys, sessions, skills, and memory across all surfaces. The interface includes streaming responses, live tool activity monitoring, and a preview pane for web pages, files, and tool output. Hermes Desktop also supports connecting to a remote Hermes backend running on a separate machine, allowing a long-running agent to stay on a server while users interact from a lightweight native window. The release shipped alongside a migration tool for OpenClaw users. The app is released under the MIT license.
7. OpenAI Expands Codex for Role-Based Tool Workflows
OpenAI expanded Codex from a coding tool into a broader enterprise work platform. The update introduces six role-specific plugins that aggregate 62 business applications (including Snowflake, Figma, and Salesforce) and include 110 built-in automated skills. It also adds Annotations for in-place editing and a Sites preview, which lets users create and share hosted interactive web applications from natural language prompts. More than 5 million people now use Codex weekly. Non-developers, including analysts, marketers, operators, designers, and researchers, make up about 20% of users and are growing at more than 3x the rate of developers. Plugins are rolling out in Codex in supported regions. Sites are in preview for Business and Enterprise teams. Additional plugins for corporate finance, private equity investing, marketing strategy, strategy consulting, and legal are coming next, along with an open ecosystem for partners to create and deploy their own plugins.
8. xAI Launches Grok Imagine 1.5 Preview on the API
xAI released grok-imagine-video-1.5-preview, an image-to-video model now available via the xAI API in preview. The model takes a single still image and a natural-language prompt describing motion, camera movement, pacing, and sound design, and generates video with synchronized audio at up to 720p resolution. Output is H.264 MP4 at 24fps across seven aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3) with durations from 1 to 15 seconds. It also supports multi-shot sequencing, allowing users to stage each frame, animate it, and chain shots into longer scenes that maintain a consistent look. On the Artificial Analysis Video Arena Image-to-Video leaderboard, it debuted in first place with an Elo of 1,404, a 52-point improvement over Grok Imagine Video 1.0. API pricing is $0.08 per second for 480p and $0.14 per second for 720p, plus $0.01 per input image. Audio is included at no additional charge. A broader consumer rollout to X Premium tiers is in progress.
9. Apple Expands Foundation Models and Adds Core AI at WWDC26
Apple unveiled updates to Apple Foundation Models and introduced Core AI at WWDC 2026 on June 8. The new generation of Foundation Models was developed with the aid of Google Gemini technology through distillation and training under the multi-year partnership announced in January 2026, though Apple states the final models are fully developed in-house. The Foundation Models framework now supports multimodal image input, ships a Python SDK, runs on Linux, and introduces a LanguageModel protocol that lets developers write session logic once and swap between Apple’s on-device model, Anthropic Claude, and Google Gemini without changing downstream code. Core AI is introduced as the successor to Core ML for running custom models on-device. MLX receives updates including Metal 4 support, GPU neural accelerators, and multi-Mac RDMA training over Thunderbolt. Siri AI, powered by the new Foundation Models, will launch in beta later this year but will not be available in the EU or China at launch.
10. OpenAI Updates GPT-Rosalind with New Life Sciences Capabilities
OpenAI shipped a major update to GPT-Rosalind on June 3, combining GPT-5.5’s agentic coding and tool-use capabilities with stronger model intelligence in medicinal chemistry, genomics, and broader life sciences workflows. OpenAI designed LifeSciBench, an externally expert-judged benchmark evaluating six workflow areas: evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, and translation and communication. On three domain-specific evaluations, GPT-Rosalind outperforms GPT-5.5 while using fewer tokens: 27.5% vs. 25.1% on MedChemBench (medicinal chemistry) with 7.2% fewer tokens, 21.6% vs. 20.4% on GeneBench (genomics and quantitative biology) with 31% fewer tokens, and 63.2% vs. 55.8% on LabWorkBench (wet lab protocol assistance) with 5.3% fewer tokens. New Codex plugins for life sciences research and NGS analysis extend the model with sourced evidence retrieval, biological interpretation, and bioinformatics execution. GPT-Rosalind is now available in research preview to eligible organizations globally. Novo Nordisk joins existing partners Amgen, Moderna, the Allen Institute, and Thermo Fisher Scientific.
11. OpenAI Introduces Dreaming for Better ChatGPT Memory
OpenAI rolled out Dreaming V3, a rebuilt memory architecture for ChatGPT that replaces the manually curated saved-memories list with a background synthesis process. The system automatically reads across a user’s full conversation history and updates what ChatGPT remembers without explicit prompts, including temporal revision: a stored memory like “going to Singapore in July” is automatically revised to “went to Singapore in July 2026” once the date passes. On OpenAI’s internal evaluations, factual recall improved from 41.5% in 2024 to 82.8% with Dreaming V3. A roughly 5x reduction in the compute required to serve dreaming makes the free-tier rollout practical for the first time. The update also includes a new Memory Summary page where users can see what ChatGPT has inferred, correct or dismiss individual memories, and control which topics it should raise. The rollout began with Plus and Pro subscribers in the US, with expansion to additional tiers and international markets planned in the coming weeks.
Five 5-minute reads/videos to keep you learning
1. The Evolution of LLM Inference: Decoding Algorithms
Speculative decoding moved from needing two separate models to using one LLM’s own internals for drafting. This article covers three draft-free approaches, including LayerSkip’s early-exit self-speculation, SWIFT’s training-free layer-skipping, and EAGLE’s hidden-state feature prediction with dynamic draft trees. It also addresses the long-context challenges through LongSpec’s cross-attention to target KV caches and TriForce’s hierarchical three-stage verification pipeline.
2. Five Probability Distributions, Explained With Beer, Peas, and Free Throws
This article covers five probability distributions every data practitioner regularly uses: the normal, binomial, t, chi-square, and F. Every approach is explained using an easy-to-understand story from Fisher’s tea experiment to Gosset brewing beer at Guinness. The piece explains when to use each distribution, what dials to adjust, and how to call them in SciPy. It also ties all five approaches into one family descended from the normal curve.
3. Docling + VectorLess + Gemma 3.5 Flash To Get Higher Accuracy
This article walks you through building a document intelligence system that combines IBM’s Docling parser with a hierarchical tree index to replace conventional vector search in RAG pipelines. Standard embedding-based retrieval chunks documents and loses structural context; this tool reconstructs heading hierarchies, maps figures back to their parent sections, and traces every answer to an exact page reference. The system pairs Docling’s layout extraction with Gemma 3.5 Flash, a lightweight model that now matches previous-generation Pro performance on agent benchmarks
4. Every Token-Based Language Model Is Throwing Away Information at the Last Step
This article discusses ELF, a 2025 language model from Nie et al., that challenges a core assumption shared by every major autoregressive model: that generation must pass through discrete tokens at every step. Each token commitment projects a continuous, high-dimensional hidden state onto a vocabulary of 100,000 entries, permanently discarding semantic information before the next step begins. ELF bypasses this entirely by using Flow Matching in the continuous embedding space, decoding only to tokens at the final step. It surpasses prior diffusion language models while training on 10x fewer tokens, raising serious questions about whether Chinchilla scaling laws remain valid for non-autoregressive objectives.
5. Your JWT Is Lying to You — The Authorization Problem Nobody Solves Correctly
JWT tokens prove identity but say nothing about what a user can actually do, and this piece traces exactly where that gap breaks real systems. It walks through the four authorization models (RBAC, ABAC, ReBAC, and policy-as-code), dissects why inline authorization logic collapses at scale across microservices, and compares OPA, Cedar, Cerbos, Casbin, and SpiceDB with honest tradeoffs. It also shows how threat signals, such as IP reputation and request velocity, can feed directly into OPA policies to block credential-stuffing attacks at the authorization layer.
Repositories & Tools
1. Kimi Code is Moonshot AI’s open-source CLI coding agent that runs in the terminal with out-of-the-box support for Kimi models. It reads and edits code, runs shell commands, searches files, fetches web pages, and selects its next action based on feedback, with compatibility for OpenAI-compatible API providers.
2. BigSet is a multi-agent system that builds structured, live datasets from a plain-English description. It infers the schema, dispatches parallel agents to gather data from the web, deduplicates results, and exports downloadable CSV or XLSX files with scheduled refresh support.
3. OpenCV is an open-source computer vision and machine learning library with bindings for C++, Python, and Java. Version 5.0, released June 6, adds a rewritten DNN engine with built-in LLM and VLM inference.
4. Project N.O.M.A.D. is a self-contained, offline-first knowledge server that bundles local AI chat with RAG, offline Wikipedia, medical references, survival guides, Khan Academy courses, downloadable maps, and encryption tools, all accessible through a browser-based Command Center on any Debian-based machine.
5. Tolaria is a cross-platform desktop app for managing markdown knowledge bases with built-in AI agent support. It exposes vault contents via MCP and provides setup paths for Claude Code, Codex CLI, Gemini CLI, and other agents to read, create, and edit notes directly within the vault.
Top Papers of The Week
1. Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
Search agents are typically trained over growing transcripts where the model must simultaneously search and manage its own state: remembering which documents it has seen, which evidence is useful, which constraints remain open, and which claims have been verified. This paper argues that RL is forced to optimize both search decisions and recoverable bookkeeping that the environment can maintain more reliably. Harness-1 is a 20B open-source search agent trained with RL inside a stateful harness that externalizes candidate pools, importance-tagged evidence, verification records, search history, and context-budget rendering. The model makes the semantic search decisions while the harness handles the bookkeeping. On eight retrieval benchmarks, Harness-1 achieves an average curated recall of 0.730, outperforming GPT-5.4, Sonnet 4.6, Kimi K2.5, and gpt-oss-120B, with only Opus 4.6 scoring higher among frontier retrievers.
2. OpenSkill: Open-World Self-Evolution for LLM Agents
Existing self-evolving agent approaches assume a usable learning loop: curated skills, successful trajectories, or verifier signals. Real-world deployments often provide none of these, offering only a task prompt. This paper formalizes open-world self-evolution, in which an agent must build both its skills and its own verification signals from scratch, without target-task supervision. OpenSkill bootstraps the loop by acquiring grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizing them into transferable skills, and refining those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved only for final evaluation.
PaddleOCR-VL-1.5 established a strong 0.9B baseline for document parsing, but its remaining errors concentrate in under-optimized regions where model behavior is unstable, data coverage is sparse, or supervision is unreliable. Rather than expanding the training corpus indiscriminately, PaddleOCR-VL-1.6 introduces a region-aware data optimization framework that identifies weak regions in the previous model, applies targeted enhancements, and improves the reliability of the supervision signal. It further adopts a progressive post-training approach that uses curated data selection and reinforcement learning. PaddleOCR-VL-1.6 achieves 96.33% on OmniDocBench v1.6, setting a new state-of-the-art while remaining a compact 1.0B model competitive with substantially larger VLMs.
4. KVarN: 2-Bit KV-Cache Quantization That Preserves Reasoning
Current KV-cache quantization methods are evaluated under prefill-like settings, but errors behave differently under autoregressive decoding. This paper shows that in the autoregressive regime, quantization errors accumulate across timesteps, driven primarily by incorrect token magnitudes rather than directional distortion. KVarN is a calibration-free KV-cache quantizer from Huawei that applies a Hadamard rotation followed by dual-scaling variance normalization to both the K and V matrices, thereby fixing outlier token-scale errors that cause most end-to-end degradation. At 2-bit precision, KVarN delivers 3–5x more context capacity while maintaining FP16-level accuracy and exceeding FP16 throughput. It sets a new state of the art on generative reasoning benchmarks, including MATH500, AIME24, and HumanEval. The implementation ships as a native vLLM backend under the Apache 2.0 license.
Quick Links
1. Cognition turns Windsurf into Devin Desktop, unifying Windsurf and Devin under a single brand with the Agent Command Center as the default surface. The Kanban-style dashboard manages all local and cloud agents in a single view, with new Spaces to share context between agents across sessions, PRs, and files. Devin Local replaces Cascade as the default local agent with subagent support and up to 30% greater token efficiency. The update ships with support for the open-source Agent Client Protocol (ACP), letting any compatible agent, including Claude Code and Codex, run inside the editor. Cascade remains available through July 1, 2026.
2. Hugging Face redesigns the hf command-line interface for coding agents, rebuilding it to work for both humans and agents like Claude Code, Codex, and Cursor. The CLI now outputs actionable hints to stderr (naming the next command with the correct IDs), never blocks on interactive prompts, and auto-detects when an agent is driving it via environment variables. For complex multi-step Hub tasks, the CLI-equipped agent used up to 6x fewer tokens than a baseline agent that hand-rolled curl or used the Python SDK. Agents can install the CLI Skill via hf skills add to get full command documentation injected into their context.
3. Meta publishes SIRA for Single-Shot Agentic Retrieval, compressing multi-round exploratory search into a single corpus-discriminative BM25 call. Instead of iteratively issuing queries and inspecting snippets, SIRA uses an LLM to predict which terms will separate desired evidence from corpus-level confusers, validates them against document-frequency statistics, and compiles a single weighted query. Across 10 BEIR benchmarks, it outperforms dense retrievers, learned sparse retrievers, and LLM search-agent baselines without relevance labels or retriever fine-tuning.
4. GitHub Copilot SDK reaches general availability, giving developers programmatic access to the same agent runtime behind GitHub Copilot for planning, tool invocation, file edits, streaming, and multi-turn sessions. The SDK is available in Node.js/TypeScript, Python, Go, .NET, Rust, and Java. New at GA: custom tools and MCP server connections, OpenTelemetry-based tracing, flexible authentication, and cloud-hosted sessions.
Who’s Hiring in AI
Senior AI Engineer/Forward Deployed Engineer @Towards AI (Remote)
Partner AI Deployment Engineer — AWS @OpenAI (Seoul, South Korea)
Staff Backend AI Engineer @Experian (Remote)
Python Developer @Insight Global (Plano, TX, USA)
AI Software Engineer @Avnet (Multiple US locations)
AI/ML Technical Leader @Cisco (San Jose, CA, USA)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.




