TAI #145: Hybrid Mamba Models Enter the Race as Inference API Costs 10x with o1-pro
Tencent's Hunyuan-T1, Nvidia's Nemotron-H, and Claude Search, Mistral Small 3.1!
What happened this week in AI by Louie
This week saw AI inference costs balloon as OpenAI launched its most expensive model yet, o1-pro, via API. It is priced at an eye-watering $150 per million input tokens and $600 per million output tokens — ten times the already steep cost of o1. The increase likely stems from the rumored use of “parallel scaling,” where multiple model instances concurrently tackle the same task, substantially raising compute demands. These costs are now a thousand times higher per token than 4o-mini, and when factoring in more output tokens used by reasoning models, the real-world price tag per task could jump 5,000 to 10,000 times.
Inference cost efficiencies have also been making big progress — particularly with Deepseek r1 and Deepmind Gemini Flash 2.0, but with so many ways to scale compute costs now for greater capabilities — inference cost reduction is losing the race at the frontier. We can now scale inference costs in four different ways (albeit often inefficiently) to get better results: 1) more thinking tokens in a single chain of thought, 2) multiple model instances attempting the same problem in parallel, 3) multiple steps of actions chained together by AI agents such as Deep Research and 4) larger model sizes.
This makes inference cost a major bottleneck to the adoption of the most advanced reasoning LLM agents, and the race is on to build architectures that balance reasoning capabilities with computational efficiency. Inference cost reduction so far has mostly come via tweaks to the same core transformer building blocks: new attention mechanisms, context caching, a sparser mixture of experts architecture, model distillation, model quantization, better GPUs, etc. Many people are also researching more significant changes, however. Stepping into this gap are Hybrid Mamba architectures — combining Mamba state-space models with Transformer layers — to directly address the efficiency challenges posed by pure Transformer-based models, particularly for longer and longer context lengths. We discussed mamba models last year, but now they are truly beginning to be scaled to significant sizes, and even reasoning mambas are being made available.
Nvidia released strong hybrid mamba models this week with Nemotron-H, available in sizes ranging from 8B to 56B parameters. The flagship Nemotron-H-56B was pre-trained on 20 trillion tokens using FP8 precision. Its efficient architecture enables handling contexts up to a million tokens when distilled to 47B parameters, providing substantial memory savings. Nemotron-H-47B notably achieves nearly triple the throughput of comparable Transformer models like Llama-3.1–70B and Qwen-2.5–72B when handling very long contexts (65,536 input tokens). The architecture swaps out most self-attention layers for linear-time Mamba layers, dramatically reducing compute during token generation.
Nvidia wasn’t alone in the Mamba spotlight. Tencent unveiled Hunyuan-T1, a reasoning model leveraging a Hybrid-Transformer-Mamba architecture. Nearly all (96.7%) of its compute was dedicated to reinforcement learning for enhanced reasoning. Hunyuan-T1 matches or surpasses DeepSeek’s R1 and OpenAI’s o1 on some key benchmarks, scoring an impressive 87.2 on MMLU-Pro and 96.2 on MATH-500. Thanks to its hybrid architecture, Hunyuan-T1 achieves twice the inference speed of comparable Transformer-based models, with much lower API costs ($0.14 per million input tokens, $0.55 per million output tokens). The model was trained via a rigorous curriculum-learning strategy on diverse datasets, progressively increasing complexity and context length to maximize reasoning efficiency.
These developments feed into a broader discussion on whether Transformer architectures themselves deserve credit for recent AI progress or if improved pre-training datasets, scaled compute, and novel training regimes carry more weight. New Hybrid Mamba architectures, trained on high-quality datasets originally developed for Transformers, are rapidly achieving parity or better in some areas. This convergence suggests architecture choice may be less critical than previously thought, provided models are adequately scalable.
In parallel, diffusion-based LLMs, like LLaDA (Large Language Diffusion with mAsking), are also recently emerging as credible competitors to Transformers, applying methods originally developed for image generation tasks to natural language modeling. LLaDA progressively masks inputs and reconstructs them during inference. At 8 billion parameters, LLaDA already demonstrates robust in-context learning on some tasks comparable to LLaMA3–8B.
Why should you care?
The explosion in inference costs to access the very best models underscores an escalating challenge for AI deployment, particularly for reasoning-intensive applications. This leads to increased urgency around cheaper and more efficient architectures. Hybrid Mamba and diffusion-based models are emerging as serious contenders, offering improved efficiency and challenging the Transformer’s dominance. With inference efficiency over a long context now a critical factor for practical deployment, we may start to see the adoption of these alternative architectures in production scenarios.
The rapid convergence of hybrid and Transformer architectures when trained on similar datasets also raises important questions about future capability scaling. If architectures are becoming commoditized, innovation in datasets, training strategies, and reinforcement learning methods might dominate the next phase of AI development. We still think that the invention of new training objectives — such as the recent huge progress with reasoning models by rewarding “verified solutions” — can potentially unlock even more gains than new architectures.
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
1. OpenAI Introduced Advanced Audio Models
OpenAI has introduced advanced audio models — GPT-4o transcribe, GPT-4o mini transcribe, and GPT-4o mini tts — to enhance real-time speech synthesis and transcription capabilities for developers. These models build upon the GPT-4o architecture and are extensively pretrained on specialized audio datasets, improving word error rates and more natural-sounding speech synthesis.
2. ARC Prize Announced ARC-AGI 2 Benchmark, Pure LLMs score 0%
The ARC Prize organization has introduced the ARC-AGI-2 benchmark, designed to evaluate AI models on tasks that are straightforward for humans but challenging for machines. This updated benchmark assesses both performance and cost efficiency, requiring models to interpret symbols, apply interrelated rules, and adapt based on context. Recent evaluations revealed that non-reasoning models, or ‘Pure LLMs,’ scored 0%, while other reasoning models achieved less than 4%. In contrast, a human panel achieved a perfect score of 100%. OpenAI’s unreleased o3 reasoning model attained the highest AI score of 4.0%. Alongside this, the ARC Prize 2025 has been announced, launching on Kaggle this week, aiming to drive open-source innovation in developing highly efficient, general AI systems capable of surpassing ARC-AGI-2.
3. Anthropic Adds Web Search to Its Claude Chatbot
Anthropic has enhanced its AI assistant, Claude, by integrating web search capabilities. This allows Claude to access and incorporate up-to-date information into its responses. This feature is currently available as a feature preview for paid users in the United States, and it will soon be extended to free users and additional countries. Users can enable web search through their profile settings, enabling Claude to provide more accurate and current information across various tasks.
4. NVIDIA Introduced Nemotron-H, A Family of Hybrid Mamba-Transformer Models
NVIDIA has introduced Nemotron-H, a family of hybrid Mamba-Transformer models designed to enhance inference efficiency without compromising accuracy. The lineup includes models ranging from 8 billion to 56 billion parameters, such as Nemotron-H-8B-Base and Nemotron-H-56B-Base. Additionally, a compressed 47B variant supports inference over approximately 1 million tokens on a single NVIDIA RTX 5090 GPU.
Tencent has upgraded the Hunyuan T1-Preview to the Hunyuan-T1 official version, an advanced AI reasoning model that leverages a Hybrid-Transformer-Mamba Mixture of Experts (MoE) architecture. Hunyuan-T1’s performance rivals DeepSeek’s R1 model, offering improved response times and extended text processing while maintaining clarity and a low hallucination rate.
6. OpenAI’s o1-Pro Now Available in API
OpenAI has introduced o1-pro, an enhanced version of its o1 “reasoning” AI model, now available through its developer API. This model utilizes increased computational resources to deliver more accurate responses to complex queries. Access is currently limited to developers who have previously spent at least $5 on OpenAI’s API services. Pricing for o1-pro is set at $150 per million input tokens and $600 per million output tokens, making it OpenAI’s most expensive model.
7. Mistral Launched Small 3.1, a Multimodal Small Model
Mistral AI has unveiled Mistral Small 3.1, an open-source model with 24 billion parameters that processes text and images. Despite its smaller size, it matches or surpasses the performance of larger proprietary models from companies like Google and OpenAI. The model offers improved text performance and multimodal understanding and supports a context window of up to 128k tokens. Released under the permissive Apache 2.0 license, Mistral Small 3.1 enables businesses to freely modify and deploy it.
8. xAI Launches an API for Generating Images
xAI has introduced an image generation feature to its API, utilizing the model grok-2-image-1212. This model allows users to generate up to 10 JPEG images per request, with a limit of five requests per second, for $0.07 per image. Currently, the API does not support image quality, size, or style adjustments.
9. Perplexity in Early Talks for Funding at $18 Billion Value
Perplexity AI is in early discussions to raise between $500 million and $1 billion in funding, potentially doubling its valuation to $18 billion. This follows a previous valuation of $9 billion reported in November. Recently, Perplexity introduced Comet, a web browser using AI for complex searches and tasks. Additionally, the company has proposed acquiring TikTok’s U.S. operations and open-sourcing its algorithm, although some view this move as a publicity stunt.
Five 5-minute reads/videos to keep you learning
1. LlamaIndex vs. LangChain vs. Hugging Face Smolagent: A Comprehensive Comparison
The article compares LlamaIndex, LangChain, and Hugging Face’s smolagent, three popular frameworks for integrating LLMs into applications. It breaks down their features, usability, flexibility, and performance, highlighting strengths and trade-offs.
2. You’re Doing RAG Wrong: How To Fix Retrieval-Augmented Generation for Local LLMs
The article addresses common challenges in implementing RAG systems, mainly focusing on issues like context blindness and first-person confusion. It provides practical solutions to enhance the effectiveness of RAG systems and ensure they better understand and utilize retrieved information.
3. DeepSearch Using Visual RAG in Agentic Frameworks
The article explores enhancing RAG systems by integrating visual retrieval methods like ColPali within agentic environments. It discusses how Vision-Language Models (VLMs) can streamline document-based question-answering tasks by directly processing document images, eliminating the need for traditional text extraction pipelines.
4. Measuring AI’s Ability To Complete Long Tasks
The article from METR discusses evaluating AI performance based on the length of tasks AI agents can autonomously complete. The study finds that over the past six years, the duration of tasks that AI can handle has been doubling approximately every seven months. Extrapolating this trend suggests that within five years, AI agents may independently perform complex software tasks that currently require humans days or weeks to complete.
The guide for the Gemini API offers best practices for incorporating files into prompts to enhance the model’s performance on tasks involving document understanding. It covers how to effectively reference files within prompts, manage multiple files, and handle large documents by segmenting them into smaller parts.
Repositories & Tools
1. Dynamo is a high-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
2. TxAgent is an AI system that delivers evidence-grounded treatment recommendations by integrating multi-step reasoning with real-time biomedical tools.
Top Papers of The Week
1. Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
The paper introduces Search-R1, an extension of DeepSeek-R1 that improves LLMs’ ability to autonomously generate search queries and retrieve information in real-time. It optimizes multi-turn search interactions using reinforcement learning, leading to significant performance gains across question-answering benchmarks.
2. Gemini Embedding: Generalizable Embeddings from Gemini
The paper introduces a state-of-the-art embedding model that leverages Google’s Gemini to produce versatile text representations across numerous languages and modalities. Evaluated on the Massive Multilingual Text Embedding Benchmark (MMTEB), Gemini Embedding significantly outperforms previous models in tasks such as classification, similarity, clustering, ranking, and retrieval, demonstrating its effectiveness in diverse applications.
3. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The paper presents a framework for automating scientific research using advanced LLMs. The proposed AI Scientist can generate novel research ideas, write code, conduct experiments, visualize results, authorize scientific papers, and simulate peer reviews for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community.
Researchers from IBM and Hugging Face have introduced SmolDocling, an ultra-compact vision-language model designed for comprehensive document conversion tasks. With 256 million parameters, SmolDocling efficiently processes entire pages, accurately capturing content, structure, and spatial locations of elements like code listings, tables, equations, charts, and lists. It utilizes DocTags, a universal markup format that preserves the context and positioning of document components.
5. HybridFlow: A Flexible and Efficient RLHF Framework
The paper introduces HybridFlow, a framework designed to enhance the flexibility and efficiency of RLHF in training LLMs. By integrating single-controller and multi-controller paradigms, HybridFlow offers a hybrid approach that allows for flexible representation and efficient execution of complex RLHF dataflows. It features hierarchical APIs that decouple and encapsulate computations and data dependencies, facilitating efficient orchestration of RLHF algorithms and adaptable mapping across various devices.
Quick Links
1. Anthropic appears to be using Brave to power web search for its Claude chatbot. Simon Willison reports that at least one search in Claude and Brave returned identical citations. Willison also found that Claude’s web search function contains a parameter called “BraveSearchParams.”
2. ChatGPT hit with privacy complaint over defamatory hallucinations. Privacy rights advocacy group Noyb is supporting an individual in Norway who was horrified to find ChatGPT returning made-up information that claimed he’d been convicted for murdering two of his children and attempting to kill the third.
Who’s Hiring in AI
Senior Software Engineer, Compute ML Scheduling and Observability @Anthropic (New York, NY, USA)
Generative AI Engineer @Dataiku (Germany, Remote)
Associate Machine Learning Engineer @Manulife (Toronto, Ontario, Canada)
Generative AI Automation Intern @Blue Cross of Idaho (Meridian, ID, USA)
Working Student (f/m/d) — Data Science and ML Engineering @SAP (Walldorf, Germany)
Student Software Engineer @SoundHound AI (Remote/Germany)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.