TAI #210: GLM-5.2 Closes Most of the Open-Weight Gap in Ten Weeks

Also, SpaceX acquires Cursor, Noam Shazeer joins OpenAI & more.

Louie Peters, Towards AI, and Louis-François Bouchard

Jun 23, 2026

What happened this week in AI by Louie

We covered GLM-5.2 when Z.ai announced it last week. Since then, the weights have shipped under an MIT license, multiple inference providers have put the model online, and independent evaluators have had time to test it. The evidence now supports a stronger conclusion: GLM-5.2 is a major breakthrough for open weights and for Chinese AI labs.

The speed of the improvement is extraordinary. GLM-5.1 launched on April 7 and GLM-5.2 arrived on June 16, exactly ten weeks later. Artificial Analysis scores 5.2 at 51 on its Intelligence Index, up 11 points from 5.1 at 40. Only Claude Fable 5 (60), Claude Opus 4.8 (56), and GPT-5.5 (55) score higher. Fable is unavailable, leaving GLM-5.2 within five points of the strongest model people can currently use.

The gains also appear on some of the hardest evaluations to target: 16 points on CritPt physics reasoning, 12 points on Humanity’s Last Exam, 9 points on long-context reasoning, 15 points on the agentic banking benchmark, 7 points on SciCode, and 16 points on Terminal-Bench 2.1, all measured by Artificial Analysis. Gains spread this widely are hard to dismiss as a single benchmark trick.

AA-Briefcase is even more convincing. It uses 91 held-out tasks across four multi-week knowledge-work projects, with nearly 2,000 source files, more than 3,500 emails, and 25,000 Slack messages. GLM-5.2 ranked third, behind Fable and Opus 4.8 but ahead of GPT-5.5 and every other open-weight model. The private tasks and rubrics make contamination and targeted training much harder, and no model is close to solving it: Fable passed every rubric criterion on only 3% of tasks.

Z.ai did not appear from nowhere. It grew out of research at Tsinghua University led by co-founder Jie Tang, whose team launched the AMiner researcher graph in 2006, contributed to the 1.75-trillion-parameter 2021 Wu Dao project, and has worked on the General Language Model architecture for years. The talent bench runs deep.

GLM-5.2 improved sharply without getting larger. It keeps the same roughly 750-billion-total, 40-billion-active Mixture-of-Experts scale as 5.1, but its context has grown from 200K to 1 million tokens. The main advances came from architecture, long-context training, reinforcement learning, distillation, and serving.

IndexShare is the architectural unlock. GLM already used sparse attention, where each query attends to a selected set of past tokens, but the expensive indexer still searched the full history at every layer. GLM-5.2 runs that search once per four-layer block and reuses the selected positions across the next three layers, while each layer still performs its own attention and mixture-of-experts computation. Z.ai reports a 2.9x reduction in per-token floating-point operations (FLOPs) for a one-million-token context, and the related IndexCache paper measured up to 1.82x faster prefill and 1.48x faster decoding while preserving quality.

The revised multi-token prediction layer improves speculative decoding acceptance by 20%, and Z.ai reworked key-value cache management, kernels, scheduling, and 8-bit floating-point (FP8) serving. These changes make long contexts and reinforcement-learning rollouts cheaper, but they are enablers. The learning breakthroughs came from what Z.ai did with the extra horizon.

My strongest hypothesis for the largest driver in the capability jump is compaction-aware reinforcement learning on more complex agentic tasks. Claude Code and Codex led a breakthrough in agentic coding adoption late last year, when context compaction enabled agents to work through far longer tasks without carrying every token forever. Compaction turns one long episode into several linked fragments; one run may produce two fragments and another eight, with the final reward arriving hours after the earliest decisions. Group-relative reinforcement learning becomes awkward because the fragments have different counts and lengths.

GLM-5.2 moved to critic-based proximal policy optimization (PPO) for these tasks. It trains every compacted fragment, uses a critic to estimate token-level advantages, and applies a token-level loss to handle unequal lengths. The model, therefore, learns from work on both sides of compaction and trains on the lossy summaries it will meet in real deployment.

The second likely driver is scaled on-policy distillation. OPD existed in GLM-5; the 5.2 change was scaling it. Z.ai says it used its slime framework to consolidate more than 10 specialist models into a final model in roughly 2 days. Specialists can spend reinforcement learning compute discovering strong policies for coding, science, search, and tools. This integrates broad skills without repeating every expensive discovery process in one generalist training run.

Z.ai also added an online anti-hacking layer for coding reinforcement learning. Suspicious tool calls are caught by rules, judged for intent, blocked when necessary, and replaced with dummy results so the rollout can continue. That protects the reward signal from agents who read hidden tests, copy reference solutions, or fetch target code.

I would be surprised if Z.ai were not also using outputs from Claude and GPT models wherever it could, for synthetic data, evaluation, or hard distillation, but it cannot explain the whole catch-up. The sudden gains line up with a much better system for generating, learning from, and consolidating long agent trajectories.

There is no clean ablation assigning credit across these changes, and some of Z.ai’s own 5.1 comparisons also changed context windows, benchmark versions, judges, time limits, or output budgets. My ranking is compaction-aware PPO for ultra-long agents, scaled multi-specialist distillation for the broad jump, and IndexShare as the efficiency multiplier that made both affordable.

GLM-5.2 still lacks multimodal input, which probably saved substantial training cost and complexity, though Z.ai does not quantify the savings. It also draws a hard deployment boundary: the model cannot inspect screenshots, review a rendered interface, read image-heavy documents, or visually test a browser workflow.

The economics are compelling, with a catch. Z.ai charges $1.40 per million input tokens, $0.26 for cached input, and $4.40 for output. The current Artificial Analysis cost-per-task chart puts GLM-5.2 at $0.52, compared with $0.86 for GPT-5.5 and $1.80 for Opus 4.8. GLM also used about 140 million output tokens across the Intelligence Index, versus 72 million for GPT-5.5 and 120 million for Opus, so lower token efficiency partly offsets the price advantage. DeepSeek V4 Pro remains roughly 10 times cheaper per task, at about $0.04-$0.05, although it scores 7 points lower on the index.

Open weights do not automatically make self-hosting economical. GLM-5.2 has 753 billion parameters in the released checkpoint. The practical FP8 vLLM recipe needs eight H200-class GPUs, while serving the full one-million-token window is documented on eight B200s with an FP8 key-value cache. Most companies cannot batch enough simultaneous work or keep that cluster busy around the clock, whereas a specialist inference provider can spread the hardware across many customers and reach much higher utilization. For most teams, the weights offer control and portability, while a hosted endpoint delivers the lower bill.

In Towards AI’s enterprise deployment work, routing routine coding through GLM-5.2 on a compliant US provider is now an easy, cost-saving recommendation for those optimizing token budgets. We would keep both Codex and Claude Code available to all developers and send bounded refactors and text-heavy repository work to GLM-5.2. Claude Code supports it directly; Codex can reach it through a provider or adapter that implements the Responses API.

If you’re cost-sensitive, I recommend reserving GPT-5.5 or Opus 4.8 for the hardest planning, recovery, final review, complex unit and integration testing, browser use, user-interface (UI) work, and screenshot-based testing. DeepSeek V4 Pro is also a strong subagent option for high-volume, easily checked summarization, extraction, classification, and structured data preparation.

I expect Z.ai trained GLM-5.2 on a far smaller budget than Anthropic, OpenAI, Google, xAI, or Meta spend on recent frontier programs. The gap in inference pricing is much narrower; however, this isn’t a model focused on the bargain tier. Two questions now matter. Can Z.ai maintain this trajectory and catch Fable or Mythos? And how many other labs can reproduce its ten-week model checkpoint jump by combining longer reinforcement-learning tasks, compaction-aware training, specialist distillation, and better reward integrity?

The longer leading US models are withheld, disabled, or constrained, the greater the chance that a Chinese open-weight model becomes the strongest capability the public can actually use. A hypothetical open-weight Fable or Mythos would face dramatically fewer restrictions and much less provider oversight than the constrained Fable API that briefly appeared this month. GLM-5.2 shows why that policy collision is approaching faster than expected.

Why should you care?

GLM-5.2 is now good enough to change enterprise model routing. A 40% saving against GPT-5.5 and roughly 70% against Opus 4.8 per Artificial Analysis task becomes material once coding agents read repositories, call tools, spawn subagents, and retry work all day. The saving only survives when GLM finishes the task without creating extra review or rescue work, so the number that matters is cost per accepted result after retries and human review, not the headline token price.

The practical pattern is a model hierarchy. Start with routine, text-heavy work where success is visible: bounded refactors, test generation, migration chores, repository questions, extraction, and structured first drafts. Keep a frontier model as the escalation path for ambiguous planning, failed runs, high-impact changes, and final review. Route visual tasks straight to GPT-5.5 or Opus, since GLM cannot inspect screenshots or rendered interfaces, and hand simple, high-volume subagent work to DeepSeek V4 Pro where outputs can be checked automatically.

Build the routing policy from real traces rather than guesswork. Measure accepted results, retries, latency, token use, human review, and failures by task category, then move each category to the cheapest model that clears your quality bar. A US-hosted endpoint supplies the contractual controls many enterprises need while avoiding an eight-GPU self-hosting commitment. The immediate opportunity is a model hierarchy that spends frontier tokens only where they change the outcome.

Two larger shifts sit behind that tactical win. The first is that the public frontier may soon be moving to open weights in certain areas where the closed labs limit their model’s abilities. Fable showed how fast a hosted frontier model can disappear, while GLM-5.2 shows the other side: once MIT-licensed weights are distributed across providers and downloaded by users, no single company or government can switch the model off globally. If US labs keep their strongest systems private or heavily restricted, Chinese labs only need to beat the models people can actually access, not every internal checkpoint, and GLM already ranks ahead of GPT-5.5 on AA-Briefcase. Published frontier weights would also strip away most of the API-level classifiers, identity checks, monitoring, and country restrictions that govern access today, which is why governments may start targeting weights, compute, or distribution. Enterprises should preserve provider portability now: keep prompts and tool definitions outside any single platform, maintain a replayable evaluation set, and test at least one open-weight fallback.

The second is that long-horizon training looks like a reproducible breakthrough. IndexShare reduced the cost of long contexts, compaction-aware PPO let every fragment of a long run contribute to learning, and specialist distillation moved policies discovered in separate programs into a single model. Together, those form a repeatable engineering agenda other labs can copy. A lab with strong researchers, adequate compute, realistic tasks, reliable verifiers, and access to frontier-generated data may now close a large capability gap during post-training alone. Not everyone will pull it off, since long agent runs are expensive and weak reward design trains shortcuts quickly, but the techniques are legible enough for MiniMax, DeepSeek, Qwen, or xAI (particularly if integrating Cursor long-horizon agentic coding data) and others to chase.

— Louie Peters — Towards AI Co-founder and CEO

Hottest News

1. SpaceX To Acquire the AI Coding Startup Cursor for $60 Billion

SpaceX agreed to acquire Anysphere, the company behind Cursor, in a $60 billion all-stock deal announced on June 16, four days after the company’s record-setting IPO. Cursor investors will receive SpaceX Class A common stock, representing a 3.4% dilution at SpaceX’s IPO valuation. The deal is the largest acquisition of a venture-backed startup on record. Cursor’s annualized revenue had climbed to $4 billion by early June, though its market share in AI coding tools had declined from 41% in June 2025 to roughly 26% in May, according to Ramp spending data. SpaceX and Cursor have been jointly training an AI model over recent months, which SpaceX plans to release on both Cursor and its Grok Build coding agent. The acquisition is intended to strengthen SpaceX’s AI division, formed through its earlier merger with xAI, which has struggled to build a competitive coding product. The deal is expected to close in Q3 2026.

2. Trump Lifts Anthropic National Security Designation, Fable Access To Be Restored

President Trump told The Axios Show on June 19 that he no longer views Anthropic or its CEO, Dario Amodei, as a national security threat, a shift from the administration’s position the prior week. “Well, not now, but a week ago, maybe,” Trump said when asked directly. The remarks followed a meeting between Amodei and Trump at the G7 summit in Évian-les-Bains, France, where Amodei and Demis Hassabis, CEO of Google DeepMind, jointly proposed a US-led democratic AI alliance. However, the Commerce Department’s export control directive issued on June 12, which forced Anthropic to disable Fable 5 and Mythos 5 for all users worldwide, has not been formally rescinded. The Pentagon’s March supply-chain risk designation and the ban on federal agencies’ use of Anthropic technology also remain in place. An Anthropic managing director said at the company’s Seoul office launch on June 18 that he was “very confident” both models would return “in the coming days.”

3. ChatGPT Market Share Drops Below 50% for First Time

ChatGPT’s share of the global AI assistant market fell to 46.4% by the end of May, the first time it has dropped below 50%, according to Sensor Tower’s State of AI Report for 2026. The decline has been steady: from 65.3% in December 2024 to 52.8% in December 2025 to 46.4% by May 2026. ChatGPT remains the most popular AI assistant with over 1.1 billion monthly users. Gemini holds 27.7% of the market with 662 million monthly users, and Claude holds 10.3% with 245 million monthly users. Claude leads all platforms in subscription conversion at 13% of users paying, the highest in the field. Switching behavior is accelerating: OpenAI’s Department of Defense partnership in February triggered a 295% day-over-day surge in ChatGPT uninstalls, while Claude’s US downloads jumped 51% the same day. Total spending on AI apps is on pace to reach $4.2 billion in H1 2026, up from $1.83 billion in H1 2025.

4. Nobel Laureate John Jumper Leaves DeepMind for Anthropic

John Jumper, who shared the 2024 Nobel Prize in Chemistry with Demis Hassabis for co-creating AlphaFold, announced on June 19 that he is leaving Google DeepMind after nearly nine years to join Anthropic. Jumper served as VP and Engineering Fellow at Google DeepMind and was a key member of Google’s AI coding development team, according to Bloomberg. AlphaFold has been used to predict over 200 million protein structures and is used by more than 2 million researchers across 190 countries. Anthropic has been building dedicated AI-for-science infrastructure throughout 2026, including wet labs and partnerships with the Allen Institute and Howard Hughes Medical Institute. Google DeepMind confirmed that Jumper would remain through the end of the year to assist with the transition.

5. Transformer Co-Inventor Noam Shazeer Leaves Google for OpenAI

Noam Shazeer, co-author of the 2017 “Attention Is All You Need” paper and co-lead of Google’s Gemini AI models, announced on June 18 that he is leaving Google to join OpenAI as Lead for Architecture Research. The departure comes less than two years after Google paid approximately $2.7 billion in a licensing deal with Character.AI that brought Shazeer and his research team back to lead Gemini development. Shazeer first joined Google in 2000 and spent over two decades at the company across two stints. He was credited with helping close the gap between Gemini and ChatGPT during his return. The move lands as OpenAI prepares for a potential IPO, with a confidential S-1 filed on June 8. Combined with Jumper’s departure to Anthropic the following day, Google lost two of its most prominent AI researchers to its two largest competitors in a single week.

6. Anthropic Adds Enterprise-Managed Authorization for MCP Connectors

Anthropic launched Enterprise-Managed Authorization (EMA) for MCP connectors, allowing IT administrators to provision connector access once through their identity provider and have employees inherit it automatically on first login, with no individual OAuth flows required. Okta is the first supported identity provider, using its Cross App Access (XAA) protocol, built on the ID-JAG standard. Seven MCP providers support EMA at launch: Asana, Atlassian, Canva, Figma, Granola, Linear, and Supabase, with Slack coming next. HubSpot, Ramp, and Webflow are among the early adopters. Ramp reports that 2,000 employees are provisioned through Okta with zero additional steps. The feature works across Claude chat, Claude Code, and Cowork for Team and Enterprise plans. EMA is built on an open extension to the MCP authorization specification, meaning any connector, including custom-built internal tools, can implement the standard.

7. Alibaba Releases Three New Foundation Models for Embodied Intelligence

Alibaba’s Qwen team released the Qwen-Robot Suite, consisting of three foundation models that bridge vision-language understanding and physical robotic control. Qwen-RobotNav is a navigation model built on Qwen3-VL (available at 2B, 4B, and 8B sizes) that unifies five navigation task families, including instruction following, point-goal navigation, object-goal search, target tracking, and autonomous driving, under a single model with a controllable observation protocol. Qwen-RobotManip is a vision-language-action model built on Qwen3.5–4B VL, trained on over 38,100 hours of open-source manipulation data. It aligns heterogeneous robot data into an 80-dimensional canonical action space, achieving 3.2x the cross-embodiment transfer rate of π0.5 and ranking first on the RoboChallenge Table30-v1 generalist track. Qwen-RobotWorld is a 20B-parameter language-conditioned video world model that predicts physically grounded futures across manipulation, driving, and navigation scenarios, using natural language as a universal action interface, and is trained on 8.6 million video-text pairs. The models are in pilot testing with Alibaba Cloud enterprise customers.

AI Tip of the Day

When an agent chooses a tool, don’t assume it understood the task.

Sometimes it only matches a word in the user’s request to a word in the tool name. For example, if the user asks for the “latest report,” the agent might still call the search tool even though the report has already been uploaded. Or it might call a database tool when the answer was already in the prompt.

To debug this, log three things side by side: the user’s request, the tool the agent picked, and the arguments it used. Then check whether the tool matched what the user actually wanted. Don’t only look for tool errors. A tool can run successfully and still be the wrong tool.

If you’re exploring agent engineering and want to go deeper into tool use and guardrails, our Agent Engineering: Building Multi-Agent Systems course is the cleanest path to building production agents.

Five 5-minute reads/videos to keep you learning

1. Building a Gemini Live voice app with React, FastAPI and your own WebSocket protocol

This article walks through building a real-time voice app on top of Gemini Live using React and FastAPI. All Gemini-specific logic routes through a backend WebSocket rather than the browser, so the backend owns the API key, system prompt, and model configuration, while the frontend communicates via a five-message protocol that the author defines and controls. Two async loops handle audio in both directions: browser mic input streams to Gemini as PCM16, and Gemini’s audio response streams back with a transcript. The result is a setup where swapping providers requires changes to a single backend file.

2. Why Most Multi-Agent AI Systems Waste 90% of Their Time (And How to Fix It)

Multi-agent setup costs are a real bottleneck, and memory snapshots fix them. The author built a five-agent code analysis swarm in which each agent spent 90 seconds installing tools before doing 8 seconds of actual work. By checkpointing a fully prepared VM using TensorLake’s memory snapshot (capturing disk state, running processes, and memory in one pass), then forking five independent copies, setup cost moved outside the loop entirely. A lead GPT-4o call then synthesized all five reports into a single, prioritized list of fixes.

3. LLM Observability with LangSmith — Part 1: Tracing Everything & Building Audit-Grade Callbacks

This article builds an LLM observability pipeline through a practical implementation for a LangGraph customer support agent. The author set up zero-config tracing with two environment variables, enabling complete request replay across every classifier call, retrieval step, and LLM response. It then builds a compliance-grade audit callback using LangChain’s BaseCallbackHandler, logging PII-redacted JSON Lines locally before any data leaves the network. Both layers run independently, giving engineers a debuggable trace and auditors a tamper-evident chain of custody.

4. LLM Observability with LangSmith — Part 2: Eval Gates, Prompt Versioning & Choosing Your Stack

This article tackles the hardest observability question: catching prompt regressions before they reach production. The author built a versioned eval dataset with trap cases, wired an exact-match evaluator into a LangSmith experiment, and surfaced a real routing failure in a single table row. Prompt versioning via the Hub provides immutable commits, movable production tags, and instant rollbacks without deploys. The article closes with a LangSmith-vs-Langfuse decision matrix and a five-rung privacy ladder for teams where traces cannot leave the building.

5. Building a Stateful Code Interpreter with Tensorlake Sandboxes

This article walks through the process of building a production-grade stateful code interpreter using Tensorlake MicroVM sandboxes. Starting with a basic Claude-plus-sandbox setup, it progresses to persistent named sessions, suspend-and-resume workflows, filesystem and memory snapshots, and parallel branch forking for comparing experimental approaches. The key distinction is between images as environment blueprints and snapshots as full runtime captures. Pre-warmed golden snapshots handle cold-start costs at scale, giving engineers a concrete path from a throwaway chat session to a persistent, restorable computing environment.

Repositories & Tools

1. Deer Flow is a super agent harness that orchestrates sub-agents, persistent memory, sandboxed execution, and extensible skills to handle long-horizon tasks.

2. Codebase-memory-mcp is an MCP server that indexes codebases into a persistent knowledge graph across 158 languages, delivering sub-millisecond queries and roughly 120x fewer tokens than file-by-file exploration.

3. FAPO uses Claude Code as an autonomous optimizer that iteratively improves prompts, parameters, and chain architecture.

4. YaFF is a C++ serialization library that provides a zero-copy wire format for Protobuf.

Top Papers of The Week

1. VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

This paper introduces VibeThinker-3B, a compact dense model with 3B parameters developed to test how far verifiable reasoning can be pushed within a small-model regime. VibeThinker-3B is trained through a three-stage pipeline: curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation, built on the Spectrum-to-Signal post-training paradigm. It scored 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and a 96.1% acceptance rate on unseen LeetCode weekly contests from April-May 2026.

2. GLM-5: from Vibe Coding to Agentic Engineering

This technical report presents GLM-5, Z.AI’s 744B-parameter MoE foundation model designed to move beyond vibe coding toward autonomous, multi-step agentic engineering. The model adopts DeepSeek Sparse Attention (DSA) to reduce training and inference costs while maintaining long-context fidelity up to 200K tokens. A new asynchronous RL infrastructure decouples generation from training to improve post-training efficiency, and novel asynchronous agent RL algorithms enable the model to learn from complex, long-horizon interactions.

3. FastContext: Training Efficient Repository Explorer for Coding Agents

In most coding agents, the same model that solves the task also explores the repository, leaving exploratory reads and searches in the solver’s context and wasting token budget on irrelevant code. FastContext separates these two roles by introducing a dedicated exploration subagent that issues parallel read-only tool calls (read, glob, grep) and returns concise file paths and line ranges as focused context. The exploration models span 4B to 30B parameters, bootstrapped from strong reference-model trajectories and refined with task-grounded rewards. Integrating FastContext into Mini-SWE-Agent improved end-to-end resolution rates by up to 5.5% across SWE-bench Multilingual, SWE-bench Pro, and SWE-QA, while reducing the coding agent’s token consumption by up to 60%.

4. LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

Looped Transformers scale latent computation by reapplying shared blocks, but sequential looping increases latency and KV cache memory usage proportionally. Parallel Loop Transformers (PLT) address this through cross-loop position offsets and shared-KV gated sliding-window attention, making the loop count a tunable design parameter. This paper trains a family of 7B PLT code models from scratch on 18T tokens, with varying loop counts, to study the gain-cost trade-off. The finding is that a two-loop configuration captures most of the representational refinement, while additional loops introduce positional mismatch and oscillatory updates that degrade performance.

5. Moebius: 0.2B Lightweight Image Inpainting with 10B-Level Performance

Current state-of-the-art image inpainting models, such as FLUX.1-Fill-Dev, operate with 10B+ parameters, making deployment expensive. Moebius compresses this capability into 0.22B parameters (less than 2% of FLUX.1-Fill-Dev) by reconstructing the diffusion backbone with Local-λ Mix Interaction blocks that summarize spatial contexts and global semantic priors into fixed-size linear matrices. An adaptive multi-granularity distillation strategy that operates entirely in latent space, avoiding pixel-space decoding costs, unlocks the representational capacity of this compressed architecture.

Quick Links

1. Perplexity launches Brain, a self-improving memory system for its Computer agent that remembers what the agent did rather than user preferences. Brain builds a context graph of completed tasks, including what worked, what failed, and what corrections were made, then synthesizes that graph overnight into an LLM wiki that loads automatically into every subsequent session. Early internal results show a 25% increase in answer correctness on repeated tasks, 16% higher recall, and 13% lower cost on tasks requiring historical context. Available in research preview for Max and Enterprise Max subscribers.

2. Claude Code now supports artifacts, turning session work into live, shareable web pages that update in place as the session progresses. Each artifact is a self-contained HTML page built from the local codebase, connected MCP tools, and conversation history, published to a private org-only URL with version history. Use cases include PR walkthroughs, incident timelines, dashboards, and release checklists. Available in beta for Team and Enterprise organizations from the CLI and desktop app.

3. Sakana AI commercializes AB-MCTS in Sakana Marlin, its first commercial product. Positioned as a “Virtual CSO,” Marlin is an autonomous research agent that runs for up to eight hours on a single topic, forming hypotheses, gathering sources, and resolving contradictions before returning a structured report of up to roughly 100 pages with executive slides. It builds on AB-MCTS (NeurIPS 2025 Spotlight) and The AI Scientist (Nature). Pricing is pay-per-use at 100 credits per run (¥98/credit), with Pro, Team, and Enterprise tiers. The underlying algorithm is open-sourced as TreeQuest under the Apache 2.0 license.

4. Liquid AI introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M, two 350M-parameter multilingual retrieval models and the first bidirectional members of the LFM family. The Embedding model produces a single vector per document for fastest search and smallest index. The ColBERT model produces per-token vectors for word-level matching with higher accuracy at the cost of a larger index. Both support 11 languages, run via llama.cpp GGUFs on CPUs and edge devices with sub-10ms query embedding latency, and outperform Qwen3-Embedding-0.6B despite being smaller. Available on Hugging Face under the LFM Open License v1.0.

Who’s Hiring in AI

Production Engineer (University Grad) @Meta (New York, NY, USA)

AI Technology Support Engineer/Analyst @Amentum (Remote/USA)

Software Developer 4 @Oracle (Raleigh, NC, USA)

Senior Engineer (m/f/x) for OpenIngest Generative AI @Dynatrace (Vienna, Austria)

Automation Engineer @Caterpillar, Inc. (Piracicaba, Brazil)

Product Engineer III @Splice (Remote/USA)

Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.

Think a friend would enjoy this too? Share the newsletter and let them join the conversation.

Towards AI Newsletter

Discussion about this post

Ready for more?