TAI #127; DeepSeek releases R1-Lite - the first reasoning model competitor to OpenAI’s o1
Also Nvidia Fugatto, Anthropic $4bn raise, Suno v4, Marco-o1 and more!
What happened this week in AI by Louie
We saw several significant events and model releases yet again in AI this week. The standout for us was the release of Deepseek’s R1 reasoning model family - the first true competitor to OpenAI’s o1 reasoning models. Amazon’s $4bn investment in Anthropic was also significant as it strengthens ties between the companies. In particular Anthropic is required to use Amazon as its primary training partner and it seems also now has to use Amazon’s in-house Trainium chips for model training. Outside of the LLM space - we also saw a new 2.5 billion generative transformer model out from Nvidia called Fugatto. This model inputs text prompts or audio files and then generates or transforms any mix of music, voices and sounds. Suno also recently released the new V4 iteration of its transformer based music model and has become one of the most impressive applications of Generative AI across all mediums in our view. Suno has also recently added many more features such as editing and replacing sections of its generated tracks and creating re-usable “personas” to maintain your favorite generated voices and styles across songs.
Deepseek's R1-Lite-Preview model is the first competitor to OpenAI's o1 reasoning model family and will also soon be open-sourced! Unlike OpenAI's o1-preview - R1 publicly shows its thinking tokens which in many cases are much more impressive than the final output tokens. Hopefully this forces OpenAI's hand both to show its thinking tokens and to release their stronger full o1 model. OpenAI’s o1 models are particularly secretive and hide thinking tokens. We think this is to avoid giving information to competitors on how the model is built and also to prevent third party distillation from its model outputs.
Deepseek is a subsidiary of a Chinese quantitative hedge fund and has released some of the largest breakthroughs in AI this year - particularly relating to training and inference efficiency. It is unclear if R1 uses the same methods as o1 - either for the post-training process that created it or for the inference time process that implements the thinking step. The benchmark results (beating o1-preview on AIME 2024 and MATH and coming behind on GPQA Diamond and Zebralogic) and inference time scaling laws are similar however. We think both models were likely trained with lots of commissioned human expert inner monologue data, showing full thinking steps for how to solve complex problems (and later similar synthetic data produced by early o1 and R1 models). R1 streams its thinking tokens one by one - which now makes me think it less likely either o1 or R1 is doing real search at inference time. Instead it is perhaps more likely for o1 that OpenAI’s 4o model was post-trained with this new “thinking step” data and using a new Reinforcement learning model to optimise a one shot chain-of-thought (CoT) thinking step. I'd still be surprised that one-shot CoT thinking tokens will work best relative to search as you scale up thinking tokens into the millions though - it seems hard for the model not to end up getting stuck going exploring reasoning rabbit holes and dead ends without search at inference time. In any case - we should get some actual information on how these models work when Deepseek releases its paper (and perhaps someday one from OpenAI also!).
To some extent, I think before this recent paradigm of “reasoning models” we have actively been training LLMs NOT to reason. Humans always skip to the key points when writing up their ideas and don’t write down their full inner monologue with every thinking step — so the LLM thought it was supposed to just guess at these random leaps from token to token. LLMs were being punished during training for attempting these necessary intermediary calculations/thinking steps and not just skipping to mimicking the next word as presented in their internet training data format. So we are not surprised to see a lot of easy wins exploring the benefits of more granular reasoning data (which we also expect to include more sophisticated tool use and function calling). We are surprised however at how quickly Deepseek managed to release a strong competitor; that said Deepseek had already been exploring research in this direction for some time before the o1 release.
Why should you care?
It is clear there is far more research and experimentation and likely easy wins in this direction of reasoning models and this will happen much faster when we have a model open-sourced. We think OpenAI was likely surprised by how soon a strong reasoning model competitor was released, however OpenAI does still have the full o1 model (and subsequent improvements) up its sleeve to release. We think now it will only be a matter of time before all leading LLM labs now get reasoning models released.
For some time we have been hearing that the latest training runs on the largest GPU clusters for the largest models have been disappointing and not achieving the levels of improvement hoped for. We are skeptical of this overall and think there are also other factors why these models aren’t yet released - including fears of model distillation and difficulty instruction tuning these larger models to ensure they obey company usage and safety policies. It is also increasingly difficult to measure progress now many benchmarks are saturated. That said - we do think it likely that these new reasoning models which scale inference compute rather than pre-training compute - are a more efficient use of training capacity and hence may be causing a shift of focus. This has repercussions for the chip industry and competitive dynamics in foundation LLMs also as perhaps huge $10-100bn training clusters are no longer as essential. It could also support the growth of tech company’s custom chip programs as in many ways it is easier for companies to make custom designed chips for inference than it is for training. That said GPUs are still more robust to work across multiple different model architectures.
Overall we think the success of reasoning models and the upcoming open source release from DeepSeek significantly increases competition and will lower barriers to entry in the LLM field. We also think a focus on inference time scaling rather than training compute scaling makes it more likely smaller labs and the open source community can continue to compete and deliver breakthroughs going forward.
— Louie Peters — Towards AI Co-founder and CEO
Towards AI now offers the most comprehensive and practical LLM Developer conversion course; From Beginner to LLM Developer.
Many millions of LLM developers will be needed to build reliable customized products on top of foundation LLMs and achieve mass Generative AI adoption in companies. This is the perfect course for Software Developers, Machine Learning Engineers, aspiring founders, or AI/Computer Science students to join the LLM field via a one stop course. Over 85+ lessons we cover the full stack of learning to build on top of foundation LLMs - from choosing a suitable LLM application to collecting data, iterating on many advanced techniques (RAG, fine-tuning, agents, audio, caching, and more), integrating industry expertise, and deploying. It also means learning many new non-technical skills and habits unique to the world of LLMs. The only skill required for the course is some knowledge of Python (or basic programming). Course participants will create a working product, which we certify, and we also provide instructor support in our dedicated Discord channel. This project could become the seed of a startup, a new tool at your company, or a portfolio project for landing an LLM Developer job.
You can find all the lesson titles, syllabus, and more information via a free preview on the course page linked below.
Hottest News
DeepSeek introduced DeepSeek-R1, a reasoning AI model competing with OpenAI’s o1, highlighting its ability to fact-check and plan for complex problem-solving. DeepSeek-R1 matches o1 on benchmarks AIME and MATH but struggles with certain logic issues. DeepSeek plans to open-source DeepSeek-R1 and release an API.
Anthropic raised $4 billion from Amazon, making AWS its main AI training partner. This will bring Amazon's total investment in Anthropic to $8 billion while maintaining its position as a minority investor. They will use Amazon as their primary training partner, collaborate on Trainium accelerator development and integrate Claude models on Amazon's platform. Anthropic has likely already been using Amazon’s Inferentia chips and Google’s TPU chips for Inference, but Nvidia GPUs and to a lesser extent Google TPUs still dominate for large training runs in the industry.
BlackForestLabs has launched FLUX.1 Tools, enhancing control and steerability in their FLUX.1 text-to-image model. The tools include FLUX.1 Fill for inpainting, FLUX.1 Canny and Depth for structural guidance, and FLUX.1 Redux for image variation.
A team of MIT researchers introduced Boltz-1, the first open-source and commercially accessible model that matches AlphaFold3-level accuracy in predicting biomolecular complexes. Boltz-1 follows the general framework used in AlphaFold3 but introduces several architectural and procedural innovations, including new multiple sequence alignment (MSA) pairing algorithms, a unified cropping approach for efficient training, and an enhanced confidence model.
Apple has released AIMv2, a family of open-set vision encoders designed to improve upon existing models in multimodal understanding and object recognition tasks. AIMv2 incorporates a multimodal autoregressive pre-training framework, which builds on the conventional contrastive learning approach used in similar models. The key feature of AIMv2 is its combination of a Vision Transformer (ViT) encoder with a causal multimodal decoder.
NVIDIA has introduced Hymba, a new family of small language models featuring a hybrid architecture that combines Mamba and Attention heads running in parallel. This model, with 1.5 billion parameters, aims to address the efficiency and performance challenges faced by smaller models while being trained on 1.5 trillion tokens.
SmolTalk is a new synthetic dataset that forms the backbone of the SmolLM2 model. It is a one-million-sample synthetically generated dataset. SmolTalk combines newly generated datasets with publicly available ones to create a cohesive collection that serves various facets of language modeling.
Fugatto is short for Foundational Generative Audio Transformer Opus 1. The model is a 2.5bn parameter transformer and is described in this paper. This model inputs text prompts or audio files and then generates or transforms any mix of music, voices and sounds.
Five 5-minute reads/videos to keep you learning
This blog includes the tools you can use to monitor and assess the performance of the Agentic approach. It also provides tips on the best practices, evaluating them, and more.
Web-LLM Assistant is an open-source project designed to integrate local LLMs with real-time web searching capabilities. This guide dives into the functionalities, installation process, and practical demonstrations of Web-LLM Assistant, inspired by its GitHub repository.
This article examines the potential limits of scaling large language models (LLMs) due to finite human-generated training data. It analyzes how the growing demand for high-quality data may outpace available resources, potentially requiring alternative strategies like synthetic data or improved model efficiency.
This article explores visual prompt injections, a way of manipulating AI systems by embedding hidden instructions in images. The article outlines examples, such as misleading adverts and subtle tampering in digital environments, emphasizing the need for robust defenses to safeguard AI systems from exploitation.
This article highlights how OpenAI stress-tests its LLMs using "red-teaming," involving human testers and automated processes to uncover harmful behaviors and vulnerabilities. It also explores challenges in testing general-purpose AI, emphasizing the need for tailored applications and improved evaluation methods for safety and reliability.
Repositories & Tools
Flux, the official inference repo for FLUX.1 models.
SmolLM2 is a family of compact language models available in three sizes: 135M, 360M, and 1.7B parameters.
Open Interpreter provides a natural-language interface to your computer's general-purpose capabilities.
Ml-aim provides the code and model checkpoints for AIMv1 and AIMv2 research projects.
Anchor Context provides the implementation of AnchorAttention, a plug-and-play attention mechanism for improving the long-context training of LLMs.
Top Papers of The Week
Marco-o1 advances reasoning for open-ended tasks by leveraging techniques such as Chain-of-Thought fine-tuning and Monte Carlo Tree Search. It delivers significant accuracy gains, effectively addressing complex problem-solving challenges and excelling in translation by accurately capturing colloquial expressions.
SAMURAI is an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning.
Researchers explored scaling laws to predict over-trained language models' performance and their accuracy on downstream tasks. Analyzing 104 models demonstrated that accurate predictions can be achieved with substantially reduced computational resources. This research mitigates the risks associated with over-training and provides insights for optimizing future model training processes efficiently.
FrontierMath is a recently introduced benchmark designed with complex, original math problems crafted by expert mathematicians to test AI's advanced reasoning capabilities. Spanning multiple mathematical domains, it tackles issues like data contamination to provide a reliable evaluation of models. AI systems achieve a success rate of less than 2%, underscoring a substantial disparity compared to human proficiency.
SAM-Decoding presents an innovative approach to speculative decoding by utilizing a suffix automaton for efficient draft generation, substantially improving the inference speed of large language models. This technique boasts an average time complexity of O(1) per generation step. When integrated with strategies such as Token Recycling and EAGLE2, it delivers notable performance gains, achieving up to a 2.49× acceleration compared to existing methods.
Quick Links
Google Deepmind release a new essay; A new golden age of discovery - Seizing the AI for Science opportunity. This essay highlights how AI tools are increasingly being used by scientists to accelerate literature reviews, design experiments, and predict outcomes, exemplified by breakthroughs like AlphaFold. The essay also examines the associated risks—such as threats to scientific creativity, reliability, and equity—and presents policy ideas, including defining AI-driven "Hilbert Problems," improving scientific data accessibility, fostering AI literacy, and reimagining scientific institutions.
AI2 has unveiled OpenScholar, a retrieval-augmented language model that searches for relevant papers and generates answers grounded in those sources. This will help scientists effectively navigate and synthesize scientific literature.
Google is working on a new API for Android 16 that lets system apps perform actions on behalf of users inside applications. This new API is guarded by permission granted to the default assistant app, i.e., Gemini, on new Android devices.
Who’s Hiring in AI
Senior Vulnerability Researcher (Open Source) @Snyk (Tel Aviv, Israel)
Software Engineer @Rimes Technologies (Hybrid/Nicosia, Cyprus)
PhD Applied Science Intern- Data Science @LinkedIn (Mountain View, CA, USA)
AI/ML Data Scientist Consultant @Guidehouse (USA)
Data Analyst Intern @Moveworks (Mountain View, CA, USA)
DevOps Engineer @Zeller (India/Remote)
Research Scientist Intern, Algorithms (PhD) @Meta (Burlingame, CA, USA)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.