TAI #104; LLM progress beyond transformers with Samba?

Monte Carlo Tree Search for LLMs, Nvidia's Nemotron-4, OpenAI's $3.4bn revenue, Mixture of Agents, and more!

Jun 18, 2024

What happened this week in AI by Louie

This week we saw a wave of exciting papers with new LLM techniques and model architectures, some of which can quickly become integrated into production LLMs. While we discuss more of these in “Hottest News” below, we are particularly excited to see progress in hybrid LLMs moving beyond the transformer architecture.

In particular, we saw progress in 1) the Mamba non-transformer model architecture with the new Samba hybrid model from Microsoft and 2) integrating Monte Carlo Tree Search into LLMs.

Much of the takeoff in AI in the past few years has been built on transformer and diffusion models. But a question is — how much of this progress is due to the capabilities of these architectures compared to progress in our pre-training (and instruction tuning) dataset recipes and scaled training compute? There have been many hints that “datasets” are, in fact, almost everything, and we may get convergent intelligence from different architectures (so long as they are scalable!). No more so than the very impressive new model out this week from Microsoft, Samba. This is a (debatably) non-transformer LLM — and when trained on the same 3.2T token dataset as the 3.8BN Phi-3 transformer, LLM gets similar performance on most key benchmarks, improved inference metrics, and infinite context length. The model brings together Mamba (a type of State Space Model) with some key elements of transformers (Multi-Layer Perceptrons + Attention via a Sliding Window Attention mechanism) to form a new hybrid model architecture. The model code has been made available but not yet the model weights. While we have seen progress in hybrid SSM/Transformers before with Jamba from A121 earlier this year, we think Samba is the first time a non-transformer LLM architecture could have economic AI production use cases. However, it remains to be seen how it performs in real-world testing and if this architecture can be effectively scaled to the >1trn parameter size of leading foundation models such as GPT-4.

A different path to enhance transformers and LLMs with hybrid architectures is to bring in elements of Monte Carlo Tree Search. We think this has been a key focus at both Google Deepmind and OpenAI — with Demi Hasabis talking several times about bringing elements of its Reinforcement Learning expertise from models such as AlphaGo into LLMs. Press about OpenAI’s Q* model breakthrough last year is potentially something similar. In particular, these techniques are hoped to advance (or begin?) LLM’s complex reasoning capabilities — and there are some signs of progress with new papers this week from researchers in China and Deepmind, respectively; Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B and Improve Mathematical Reasoning in Language Models by Automated Process Supervision.

Why should you care?

We think the release of Samba suggests that now we have great pre-training datasets collected and prepared, innovative new LLM architectures can quickly catch up with transformers (at least in the bn scale) and there is a huge space open for research and experimentation. At the same time, however, the convergent intelligence of Samba and Transformer architectures, when trained on the same dataset, raises some questions about whether we will hit dead ends in capability with our current focus on scaling up LLM parameters and training data. We think Monte Carlo Tree Search hybrid models are one promising path to unlock improved reasoning and new LLM capabilities beyond the “scale is all you need” route.

—Louie Peters — Towards AI Co-founder and CEO

You asked. We listened.

Many of you asked for an electronic version of our new book, so after working out the kinks, we are finally excited to release the electronic version of “Building LLMs for Production.”

Don’t worry; it’s the same content, same quality, same value, but it’s cheaper!

We’ve heard many feedback from you guys wanting to have both the ebook and book for different occasions. Think of this version as your “carry it wherever you go” AI toolkit. Enjoy enhanced accessibility and easy navigation!

We know the digital reading experience will differ from the physical version, but we wanted to add something more. We worked hard to ensure it offered a more interactive experience and more knowledge through the embedded links (great free learning references) we could add, especially for coding examples. We are excited to see how you use it to create exciting projects, tools, and more.

We are also really excited to announce that our book is now available in India. Thanks to a partnership with Shroff Publishers, you can order it here!

Find it now on Amazon as a paperback, high-quality colored hardcover, and e-book.

For those seeing it for the first time, “Building LLMs for Production” is an essential toolkit for AI engineers to build reliable real-world LLM applications. It includes fundamental AI & LLM concepts, many Colab notebooks, hands-on projects, community access, and more. Written by over 10 experts from the Towards AI team and reviewed and edited by specialists from Activeloop, LlamaIndex, Mila, and more, it is a roadmap to the tech stack of the future.

It is an end-to-end resource for anyone looking to enhance their skills and develop their understanding of Generative AI and large language models (LLMs).

Hottest News

1. Open AI’s Annualized Revenue Doubles to 3.4 Billion Since Late 2023

OpenAI is rapidly moving to commercialization, with reports of hitting a $3.4bn revenue run rate, more senior management and board appointments (including some privacy eyebrows raised by the appointment of a former NSA director), and Sam Altman reportedly telling some shareholders OpenAI could convert its governance structure to a for-profit business that the firm’s nonprofit board doesn’t control.

2. NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models

Nvidia released a new 350BN Nemotron-4 LLM open-source model, trained on 9Trn tokens. The model is impressive but not ground-breaking on benchmarks; however, more information regarding the model architecture and training regime was released than has become the norm. Nvidia reported that 98% of data used in post-training was generated synthetically. Nvidia released full prompts used in its synthetic data generation pipeline and also released the reward model, used for filtering the synthetically generated data.

3. Together MoA — collective intelligence of open-source models

Together AI introduced Mixture of Agents (MoA), an approach to harness the collective strengths of multiple LLMs to improve state-of-the-art quality. It leverages several open-source LLM agents to achieve a score of 65.1% on AlpacaEval 2.0, surpassing the prior leader GPT-4o (57.5%).

4. Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations

Lamini Memory Tuning is a new way to embed facts into LLMs that improves their factual accuracy and reduces hallucinations. It entails tuning millions of expert adapters (e.g., LoRAs) with precise facts on top of any open-source LLM, like Llama 3 or Mistral 3.

5. Microsoft’s Nadella Is Building an AI Empire. OpenAI Was Just the First Step

Chief Executive Satya Nadella bet Microsoft’s future on artificial intelligence’s potential when he forged a groundbreaking partnership with OpenAI, but it is only the beginning. In recent months, he’s been spreading his bets, turning Microsoft into the world’s most aggressive amasser of AI talent, tools, and technology. Nadella has also begun building an in-house OpenAI competitor inside Microsoft.

6. DeepSeek-Coder-V2: First Open Source Model Beats GPT4-Turbo in Coding and Math

Several LLM architectural and efficiency innovations contributed to the release of DeepSeek V2 model (an AI lab within a quantitative finance fund in China!). The company continued training V2 on a further 6TN code and math tokens to develop a competent code model supporting 338 programming languages. It is the most inference-efficient model in the market (at least for models with disclosed data).

Five 5-minute reads/videos to keep you learning

1. Why BERT is Not GPT

While both BERT and GPT are based on the Transformer architecture, the key differentiation lies in their approach to word embeddings and the attention mechanism. This piece compares and contrasts between the two models.

2. Training and Finetuning Embedding Models with Sentence Transformers v3

This blog post explains how to fine-tune Sentence Transformer models to improve their performance on specific tasks. It covers the training components, how to initialize the loss function and training arguments, and more.

3. AI Agents: Hype vs. Reality

Most AI agents are not ready for mission-critical work. However, the underlying models and architectures continue to advance quickly, and we can expect to see more successful real-world applications. This piece outlines the landscape of AI agents and presents reasonable points to help you understand the reality and the hype.

4. The AGI-in-2027 Thesis

In this piece, the author dives into the assumptions behind the argument that we’ll reach superintelligence by 2027 and finds that, while the future of AI is undoubtedly exciting, we need to be cautious about overhyping its potential. It also covers the state of AI research and what it means for the future of work.

5. Apple’s AI Strategy in a Nutshell

This post presents a quick recap of how Apple classifies its AI workloads, which are grouped into three buckets: on-device, private cloud compute, and third-party model inference. In a nutshell, Apple will still power most of the AI features and “bring in” ChatGPT on a needs basis.

Repositories & Tools

1. Warp is a Python framework for high-performance GPU simulation and graphics.

2. Textgrad is a framework building automatic “differentiation” via text. It implements backpropagation through text feedback provided by LLMs.

3. AutoKT is a platform for code documentation, generation, and maintenance.

4. Zed is a high-performance, multiplayer code editor.

Top Papers of The Week!

1. The Prompt Report: A Systematic Survey of Prompting Techniques

This paper establishes a structured understanding of prompts by assembling a taxonomy of prompting techniques and analyzing their use. It presents a comprehensive vocabulary of 33 terms, a taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities.

2. Depth Anything V2

This work presents Depth Anything V2, an updated version of V1. This version produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of the teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images.

3. Scalable MatMul-free Language Modeling

Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This work shows that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. It investigates the scaling laws and finds that the performance gap between our MatMul-free models and full-precision Transformers narrows as the model size increases.

4. Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

This work presents Samba, a hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). It compresses a given sequence into recurrent hidden states while still maintaining the ability to recall memories with the attention mechanism. It has a 3.73x higher throughput compared to Transformers.

5. When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

This paper introduces an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. This approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2× speedup during generation compared to prior linear attention methods.

Quick Links

1. AWS announces a $230M commitment for generative AI startups. This will provide startups, especially early-stage companies, with AWS credits, mentorship, and education to further their use of AI and ML technologies.

2. Paris-based AI startup Mistral AI raises $640M in a Series B funding round. The startup is now valued at $6 billion following this funding round.

3. At WWDC, Apple introduced Apple Intelligence, a personal intelligence system integrated deeply into iOS 18, iPadOS 18, and macOS Sequoia. It comprises multiple highly capable generative models specialized for our users’ everyday tasks and can adapt to their current activity on the fly.

Who’s Hiring in AI!

Data Annotator — Freelance @Mindrift (Remote)

Applied Machine Learning Engineer @Snorkel AI (Redwood City, CA)

Senior Solutions Architect — Generative AI @NVIDIA (Bengaluru, India)

PhD Researchers — Generative AI for Autonomous Driving on Data Generation, Neural Simulation and Large-Scale Foundational Model Training @Torc Robotics (Stuttgart, Germany)

AI Lead @Ethos Life (India/Remote)

Software Engineer @Avalanche Labs (Remote)

Machine Learning Engineer 2, Amazon @Amazon (Seattle, WA, USA)

Junior Data Analyst (Gamification) @Animoca Brands Limited (UK)

Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.

If you are preparing your next machine learning interview, don’t hesitate to check out our leading interview preparation website, confetti!

This AI newsletter is all you need #104 was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

bambam

Oct 6, 2025Edited

At the same time, however, the convergent intelligence of Samba and Transformer architectures, when trained on the same dataset, raises some questions about whether we will hit dead ends in capability with our current focus on scaling up LLM parameters and training data. https://clusterrush.io

Towards AI Newsletter

Discussion about this post

Ready for more?