#114: Two Paths to Small LMs? Synthetic Data (Phi 3.5) vs Pruning & Distillation (Llama-3.1-Minitron)
Also; Jamba 1.5, DisTrO, StormCast, Late Chunking & Cursor funding round.
What happened this week in AI by Louie
This was a week for small language models (SLMs) with significant releases from Microsoft and NVIDIA. These new models highlight the growing trend towards creating efficient yet powerful AI that can be deployed in resource-constrained environments without compromising performance. The two companies focused on different strategies for achieving these smaller models — Microsoft via training on high-quality synthetic data and Nvidia via pruning and distillation techniques.
Microsoft continued to expand and improve its Phi-3 family, introducing three new models: Phi-3.5-Mini, Phi-3.5-MoE (Mixture-of-Experts) and Phi-3.5-vision. These models underscore Microsoft’s strategy of leveraging high-quality synthetic data to enhance the capabilities of small language models. Phi-3.5-Mini is a compact 3.8 billion parameter model designed for scenarios where memory and latency are critical factors. The model achieves performance levels comparable to and, in some cases, surpassing those of larger models like Mistral-7B and Llama-3.1–8B. Meanwhile, Phi-3.5-MoE is the first MoE architecture model in the Phi family. This model activates only 6.6 billion parameters out of 42 billion, providing the flexibility to deliver high performance while maintaining efficiency.
Microsoft’s training data for the Phi-3.5 models encompasses 3.4 trillion tokens sourced from a mix of carefully curated materials. This includes publicly available documents rigorously filtered for quality, high-quality educational data and code to enhance the model’s reasoning capabilities, and newly created synthetic data designed to teach complex subjects such as math, coding, and common sense reasoning. Additionally, supervised data in chat format was used to align the model with human preferences on instruct-following, truthfulness, honesty, and helpfulness. The focus on data quality was paramount. A lot of time is spent on gathering and cleaning the training data for LLMs, yet the end result is often still raw/dirty. Microsoft is experimenting to see how much an LLM can learn from less but higher-quality training data.
NVIDIA’s release of the Llama-3.1-Minitron model highlights a different approach to creating efficient small language models. The Minitron is a 4 billion parameter model derived from the larger Llama-3.1–8B through a combination of pruning and distillation techniques. Pruning involves systematically reducing the size of a model by removing less critical layers and neurons, which helps make the model smaller and faster without losing significant capabilities. NVIDIA employed structured pruning to trim down the Llama-3.1–8B model to a smaller, leaner version, focusing on maintaining the model’s core capabilities in areas like natural language understanding and reasoning. Distillation then played a key role in transferring knowledge from the larger model to the smaller one. This process involved training the smaller model (student) to mimic the behavior of the larger model (teacher) by learning from the outputs of the larger model on the same datasets. The combination of pruning and distillation allowed NVIDIA to create a model that retains much of the predictive power of its larger counterpart while being significantly more resource-efficient. The result is a model that not only performs competitively with other models in its class but also operates more efficiently.
Why should you care?
The new releases from Microsoft and NVIDIA illustrate the different approaches to advancing small language models. Whether through the focus on high-quality synthetic training data, as seen with Microsoft’s Phi-3.5 models, or through pruning and distillation, as demonstrated by NVIDIA’s Llama-3.1-Minitron. So far, smaller models have still felt noticeably less capable in real-world use cases, with more skepticism about overfitting training data. However, we are hopeful we are getting closer to a model in this size category getting closer to real-world utility.
— Louie Peters — Towards AI Co-founder and CEO
Join 30,000+ GenAI360 Certification Course Takers in a New Challenge: GenAI Aptitude Test.
Towards AI, together with our partners at Activeloop and Intel Disruptor Initiative, was one of the first organizations to pioneer high-quality, production-oriented GenAI courses, namely our marquee LangChain & Vector Databases in Production, Training & Fine-Tuning LLMs, as well as Retrieval Augmented Generation for Production with LlamaIndex and LangChain courses.
One year and tens of thousands of professionals educated later, we’ve noticed one pattern. A lot of people call themselves “AI Engineers.” In fact, there are 47,000 of them on LinkedIn. But can they build AI systems that work in the real world? Because that’s the real test!
So, we’ve created a challenge. We’re calling it the ‘Impossible’ GenAI Test.
You’ll have 40 minutes to answer 24 questions across GenAI knowledge areas such as RAG, fine-tuning, model training, and inference. It’s tough — only about 1 in 20 people pass on their first try, but you will definitely learn a lot about your gaps in GenAI knowledge.
Take the test now for free and find out where you rank with your GenAI skills!
Hottest News
1. Fine-tuning is Now Available for GPT-4o
OpenAI introduces GPT-4o fine-tuning, which allows developers to customize models for better performance and cost-efficiency across domains. The feature is available for paid tiers with free daily training tokens until September 23. Notable achievements include Cosine’s Genie excelling in the SWE-bench and Distyl leading the BIRD-SQL benchmark.
2. Microsoft Releases New Phi 3.5 Open-Source Language and Vision Models
Microsoft’s new Phi 3.5 series introduces three open-source AI models — mini-instruct, MoE-instruct, and vision-instruct — designed to improve reasoning in multilingual commercial and scientific tasks, with capabilities in long document analysis. However, challenges with factual accuracy and potential bias are noted, and Microsoft recommends coupling these models with retrieval-augmented systems such as RAG for best results in resource-constrained environments.
3. OpenAI Has Formed a Media Partnership With Condé Nast
OpenAI has partnered with Condé Nast to integrate SearchGPT with the media company’s publications, aiming to improve search capabilities and content credibility. The collaboration is seen as a strategy to mitigate the impact of technological advancements on media revenue.
4. AI21 Labs Released Jamba 1.5 Family of Open Models Redefining Long-Context AI
AI21 released Jamba 1.5, a family of models that combines Transformer and State Space Model (SSM) architectures. The release includes Mini (12B active/52B total) and Large (94B active/398B total) MoE. Jamba 1.5 Mini is the strongest open model in its size class, scoring 46.1 on the Arena Hard benchmark, surpassing larger models like Mixtral 8x22B and Command-R+.
5. Nvidia Unveils AI Model StormCast for Advanced Weather Prediction
Nvidia has launched StormCast, an AI-driven model on its Earth-2 platform, advancing mesoscale weather prediction with simulations of atmospheric dynamics. It achieves a 10% accuracy improvement over traditional six-hour forecasts, contributing to efficient disaster planning and positioning Nvidia alongside other tech giants like Google, Microsoft, and IBM in AI climate technology.
6. Anthropic’s Claude Surpasses $1M in Mobile App Revenue
Anthropic’s AI assistant, Claude, has surpassed $1 million in mobile app revenue across iOS and Android in just 16 weeks. While Claude has seen strong growth in the U.S. and other markets, it faces challenges as Apple prepares to integrate ChatGPT directly into iPhones.
7. Nvidia’s Llama-3.1-Minitron 4B Is a Small Language Model That Punches Above Its Weight
The Nvidia research team leveraged recent advances in pruning and distillation to create Llama-3.1-Minitron 4B, a compressed version of the Llama 3 model. This model rivals the performance of larger models and equally sized SLMs while being significantly more efficient to train and deploy.
8. Nous Research Publishes a Report on DisTrO
Nous Research released a preliminary report on DisTrO (Distributed Training Over the Internet), a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by 1000x to 10,000x without relying on amortized analysis and matches AdamW+All-Reduce in convergence rates. This could be significant progress towards multi-location training runs, which can be valuable both for large tech companies with multiple data centers and more open-source and blockchain-based decentralized projects.
9. Amazon Q Has a New Code Transformation Capability for Updating Foundational Software
Amazon Q, Amazon’s GenAI assistant for software development, has a new code transformation capability for foundational software hygiene work. The feature helped them save the equivalent of 4,500 developer years of work in their internal system and Java upgrades, providing an estimated $260M in annualized efficiency gains. They could also upgrade over 50% of production Java systems to modernized Java versions at a fraction of the usual time and effort.
10. Google DeepMind Research Addresses the Most Difficult Challenges in Quantum Chemistry
Scientists at Imperial College London and Google DeepMind have proposed a solution using AI to the challenge of modeling the states of molecules. They computed the energy of atoms and molecules based on precise principles by developing and using a new mathematical approach with a neural network called FermiNet (Fermionic Neural Network). For a small but complex molecule called the carbon dimer, they achieved a mean absolute error (MAE) of 4 meV (a tiny energy measure), five times more accurate than previous top methods with an MAE of 20 meV.
11. Jina AI Introduces Late Chunking for Better Retrieval Applications
Jina introduced a new approach for embedding chunks called “Late Chunking,” which leverages the rich contextual information provided by 8192-length embedding models. Late chunking creates a set of chunk embeddings where each one is “conditioned on” the previous ones, thereby encoding more contextual information for each chunk.
Five 5-minute reads/videos to keep you learning
1. Understanding the Best Practices and Ideas for LLM-Enabled RAG Systems
RAG is one of the most important use cases for LLMs. This article studies the various components of RAG in detail.
2. What It Really Takes To Train an Entire Workforce on Gen AI
Companies prioritize generative AI training to boost innovation and competitiveness, with firms like Synechron leveraging specialized tools for AI-enablement and productivity gains. USAA is set to follow suit, emphasizing governance, risk management, and role-based AI training for its workforce.
3. Our Team Procrastinated on Writing Bug Reports. So, We Built an AI To Do It for Us
A team has developed an AI-powered solution to mitigate procrastination in writing bug reports. They crafted an automated system using Python to extract Discord messages, summarize them with Google Gemini, and integrate these summaries as issues in GitLab, thereby improving documentation efficiency and productivity.
4. Interpreting Coefficients in Linear Regression Models
This post will demonstrate how to interpret coefficients by exploring various scenarios. It analyzes a single numerical feature, examines the role of categorical variables, and unravels the complexities introduced when these features are combined.
ggml is a machine learning library written in C and C++ that focuses on transformer inference. This article focuses on the fundamentals of ggml for developers looking to get started with the library.
Repositories & Tools
1. Phi-3 CookBook is the official repo for Microsoft’s Phi-3 models, the current most cost-effective Small Language Models(SLMs).
2. Cursor is an AI-powered code editor that boosts developer productivity.
3. Haystack is an end-to-end LLM framework that allows you to build LLM-powered applications, Transformer models, vector search, and more.
4. Helicone is an open-source platform for logging, monitoring, and debugging LLMs.
5. N8n is a workflow automation and integration tool that streamlines and connects various applications.
Top Papers of The Week
1. A Survey on Benchmarks of Multimodal Large Language Models
This paper critiques the effectiveness of existing evaluation methods for Multimodal Large Language Models (MLLMs) by examining 180 benchmarks spanning image processing and complex reasoning tasks. It categorizes these evaluations across various criteria, notes the current assessment limitations, and suggests areas for improving MLLM development and research.
2. ShortCircuit: AlphaZero-Driven Circuit Design
This paper introduces ShortCircuit, a transformer-based architecture using AlphaZero, that advances Boolean circuit design by synthesizing smaller AND-Inverter Graphs (AIGs) from truth tables. Combining supervised and reinforcement learning, it beats the leading tool, ABC, with a 14.61% improvement in AIG compactness, tested on 500 real-world truth tables.
3. Searching for Best Practices in Retrieval-Augmented Generation
This paper investigates existing RAG approaches and their potential combinations to identify optimal RAG practices. It suggests several strategies for deploying RAG that balance performance and efficiency. It also demonstrates that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content.
4. To Code, or Not To Code? Exploring Impact of Code in Pre-training
The study investigates the impact of including code in pre-training data for LLMs, even when not specifically designed for code tasks. It aims to understand how code data affects performance on non-code tasks, addressing the lack of comprehensive analysis in this area. The study experimented with varied code proportions, quality, and insertion points in pre-training.
5. Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions
The Matryoshka-Adaptor framework improves the efficiency of LLM embeddings by substantially decreasing their size, preserving performance while cutting computational expenses. Compatible with any LLM, including black-box API architectures, it supports supervised and unsupervised learning. It has shown consistent results across diverse datasets, achieving up to a twelve-fold reduction in embedding dimensions.
6. Loss of Plasticity in Deep Continual Learning
Deep learning methods work in continual learning settings; they lose plasticity until they learn no better than a shallow network. This paper says that loss of plasticity is a major challenge to developing AI that can effectively handle the world’s complexity and would need to be solved to develop human-level artificial intelligence. The research found a method based on modifying one fundamental algorithm that makes neural networks work: backpropagation.
7. xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
xGen-MM (BLIP-3) is Salesforce’s framework for developing LMMs, offering extensive datasets, unique training approaches, various model architectures, and a range of LMMs that excel at in-context learning and instruction-tuning. The framework’s models are thoroughly evaluated, and Salesforce has open-sourced all related materials to foster additional research in LMMs.
Quick Links
1. OpenAI has hired former Meta executive Irina Kofman to head strategic initiatives. Kofman, who previously worked as a senior director of product management for generative AI at Meta, will now report directly to OpenAI’s CTO Mira Murati, and initially focus on safety and preparedness.
2. Google has introduced a free Prompt Gallery within its AI Studio, enhancing the suite of tools available to developers working with AI. The Prompt Gallery offers a variety of pre-built prompts designed to streamline and optimize the creation of AI models, making it easier for developers to experiment and deploy models quickly.
3. Anysphere, a two-year-old startup that developed an AI-powered coding assistant called Cursor, has raised over $60 million in a Series A financing at a $400 million post-money valuation. The round was co-led by Andreessen Horowitz and Thrive Capital. Patrick Collison, co-founder and CEO of Stripe, also participated in the round.
4. Together AI introduced Rerank API, a new serverless endpoint for enterprise search and RAG systems. This release also includes exclusive access to Salesforce’s LlamaRank model, enhancing enterprise search and RAG systems.
5. Luma AI released Dream Machine 1.5, marking a significant advancement in AI-powered video generation. This latest version of their text-to-video model offers enhanced realism, improved motion tracking, and more intuitive prompt understanding.
6. At the 2024 World Robot Conference in Beijing, Chinese companies showcased 27 humanoid robots alongside Tesla’s Optimus, signaling China’s ambition to dominate the industry.
Who’s Hiring in AI
Senior Technical Program Manager I, AI Data @Google (Mountain View, CA, USA)
Software Engineer (Data), Ai & Data Platforms @Apple (Sunnyvale, CA, USA)
Software Dev Engineer — Machine Learning Apps, Accelerator @Amazon (Cupertino, CA, USA)
Manager, Site Reliability Engineer — GeForce Now Cloud @NVIDIA (Santa Clara, CA, USA)
Postdoctoral Researcher, Fundamental AI Research (PhD) @Meta (Menlo Park, CA, USA)
Machine Learning Engineer @Bazaarvoice (Remote/Canada)
Engineering Manager, Workspaces @Weights & Biases (Remote)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.