#107:What do enterprise customers need from LLMs?
Salesforce’s xLAM-1B, SenseTimes’s SenseNova 5o, Nvidia's RankRAG, Phi-3 Update, Moshi, SummaHay!
What happened this week in AI by Louie
This week in AI saw hints of progress in multi-modal LLMs outside of OpenAI and Google, with SenseNova 5o from SenseTime and Kyutai unveiling its Moshi speech-to-speech. There was also notable progress on Retrieval-Augmented Generation (RAG) with Nvidia’s RankRAG models and the new SummHay RAG evaluation (more on all this below). Additionally, we observed advancements in smaller LLMs, including a great update to the Phi-3 model and the release of Salesforce’s xLAM-1B.
Salesforce’s model is focused on “function calling” and is the latest in a trend of smaller, niche-focused models coming from LLM research groups that target enterprise customers, such as Salesforce, Cohere, and Databricks. For instance, Cohere’s Command-R+ model is focused on RAG, and Databricks’ has a new agent framework. As noted by Cameron R. Wolfe, this can give us some information about what companies are looking for when integrating Generative AI into their business. Enterprises seek models that are easier to deploy and scale, maintain data privacy, and allow for seamless specialization over their own data through fine-tuning and RAG. The capability to develop reliable, agentic behavior through function calling can also be crucial. Additionally, they demand quick adaptation to new tasks, underscoring the importance of strong instruction-following behavior and contextual awareness.
Several other factors also drive this trend in our view. LLMs require substantial internal data for enterprise use cases, and the current generation of LLMs requires a complex pipeline involving RAG, fine-tuning, and function calling to use this and achieve the reliability needed for corporate applications. Some of these enterprise-focused AI companies have carved out a niche by offering consulting and cloud services alongside their tailored LLMs, helping clients build robust RAG and agent pipelines and bundling in their own models. This niche focus can also be partly because it is difficult to compete at the leading edge of general foundational LLMs.
Why should you care?
We often see questions and fears over whether AI is in a bubble and if over-hyping current capabilities could lead to another “AI winter.” A key question is when will Enterprise progress in its Generative AI use from smaller-scale trials and proof of concepts to full-scale rollouts? While revenue at companies such as OpenAI has grown spectacularly, there is still a huge gap to close between an annual run rate of around $100bn spent on GPUs and maybe ~$5bn Generative AI revenue run rate (including consumer-focused applications).
We think it takes time for enterprises to explore, test, and develop new technologies, but in particular, the reliability threshold is the missing piece for wide-scale enterprise rollout. Companies are learning to use a new tech stack, and developing with it takes time and likely also requires outside help. We think the reliability of LLM applications has the potential to cross this threshold on “the march of 9s” (digits of accuracy, e.g., 99% vs 90%) for many more enterprise sub-tasks with both better employee training (how to use it, when not to trust it, and Towards AI can help here!), more work on custom LLM pipelines, niche models and frameworks (agents, data preparation, RAG, fine-tuning) and better foundational LLMs.
We think the necessity for internal data and retrieval mechanisms in some form will always remain, and advanced custom LLM pipelines will continue to be essential. However, the process will simplify over time, and corporate users will likely develop more expertise in building these systems. As leading-edge LLMs like Sonnet 3.5, 4o, and Flash become faster and open-weight models improve, it will also be interesting to see how the smaller, niche models from labs such as Cohere and Salesforce compete.
— Louie Peters — Towards AI Co-founder and CEO
This issue is brought to you thanks to Brilliant:
AI is getting smarter. Are you?
No, AI isn’t replacing us tomorrow. But yes, to stay competitive, it’s essential to start understanding how to use it. Learning a little every day is one of the most important things you can do — both for personal and professional growth.
Brilliant’s suite of learning tools is designed to help you build a daily learning habit you’ll keep. Each lesson is bite-sized, so you can build real skills in minutes a day. Set goals, see your progress, and level up with courses in math, data, programming, and technology.
Join 10 million learners worldwide and start your 30-day free trial today! Plus, Towards AI readers get a special 20% off a premium annual subscription.
Hottest News
1. Salesforce Proves Less Is More: xLAM-1B ‘Tiny Giant’ Beats Bigger AI Models
Salesforce has unveiled xLAM-1B, an AI model dubbed the “Tiny Giant.” With just 1 billion parameters, it outperforms much larger models in function-calling tasks, including those from industry leaders OpenAI and Anthropic. Salesforce CEO Marc Benioff celebrated the achievement on Twitter, highlighting the potential for “on-device agentic AI.”
2. Kyutai Unveils Moshi Speech-to-Speech Model Ahead of OpenAI
French AI lab Kyutai, launched in November, has released Moshi, a 7 billion parameter speech-to-speech model with a 200ms latency. Backed by €300M in funding, Moshi can listen and speak continuously, and its multimodal design allows for seamless adaptation and finetuning.
3. Nvidia’s Llama3-RankRAG Models Combine Context Ranking and Answer Generation
Nvidia’s Llama3-RankRAG models excel in knowledge-intensive benchmarks, outperforming GPT-4 models in nine tests without instruction fine-tuning. The RankRAG framework combines context ranking and answer generation, significantly improving performance in retrieval-augmented generation tasks.
4. Microsoft’s Phi-3 Mini Update
Microsoft’s Phi-3 Mini received significant upgrades, enhancing its code understanding and long-context comprehension capabilities. The update includes better-structured outputs and improved multi-turn instruction following, applicable to both 4K and 128K context models.
5. Elon Musk: Grok 2 AI Arrives in August
Elon Musk has unveiled plans for Grok 2, a new AI model expected in August 2024, promising enhanced efficiency. His company anticipates an upgrade to Grok 3 by the end of the same year, utilizing cutting-edge Nvidia GPU technology.
6. OpenAI’s ChatGPT Mac App Was Storing Conversations in Plain Text
OpenAI’s recently launched ChatGPT macOS app had a potentially worrying security issue: finding your chats stored on your computer and reading them in plain text wasn’t hard. After the security flaw was spotted, OpenAI updated its desktop ChatGPT app to encrypt the locally stored records.
7. YouTube Now Lets You Request Removal of AI-Generated Content That Simulates Your Face or Voice
YouTube’s revised privacy policy now enables users to request the removal of deep fake content replicating their likeness if it raises privacy issues, with certain considerations for content context and public interest.
Five 5-minute reads/videos to keep you learning
1. Why Are Most LLMs Decoder-Only?
Large language models often use a decoder-only architecture because it is efficient for generative pre-training and cost-effective, exhibiting strong zero-shot generalization. Although encoder-decoder models can excel in multitask finetuning, extensive training diminishes the performance difference. This article dives into this question further.
The article challenges the belief that simply scaling up language models will result in artificial general intelligence, highlighting issues such as overhyped scaling laws, misconceptions about emergent abilities, and practical constraints like data scarcity and rising costs.
3. NPU vs. GPU: What’s the Difference?
Neural processing units (NPUs) are often compared to graphics processing units (GPUs) in their ability to accelerate AI tasks. This article compares their differences and examines the strengths and drawbacks of each, including use cases, architecture, and key features.
4. What Is a “Cognitive Architecture”?
The article discusses the role of cognitive architecture in developing applications powered by LLMs, delineating the spectrum of autonomy from basic hardcoded scripts to sophisticated, self-governing agents, and highlights its importance in deploying LLM-enabled decision-making systems.
Meta 3D Gen (3DGen) is an AI-driven pipeline that quickly generates detailed 3D models and textures from text descriptions, with capabilities for physically-based rendering and retexturing assets. This is the official research paper explaining the pipeline in detail.
Repositories & Tools
1. GraphRAG is a modular graph-based Retrieval-Augmented Generation (RAG) system.
2. Agentless is an agentless approach to automatically solve software development problems.
3. Vision agent is a tool that automates code generation for computer vision tasks based on natural language descriptions.
4. Gaianet Node provides a distributed and incentivized GenAI agent network.
5. Public APIs is a collective list of free APIs.
Top Papers of The Week!
1. Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
The “Summary of a Haystack” (SummHay) task is established to test long-context language models and retrieval-augmented generation systems by evaluating their capacity to summarize and cite from documents with repeated specific insights. This evaluation adds complexity to match more real-world use cases relative to common RAG tests like a needle in a haystack. Initial results show that RAG systems can now outperform humans in covering key information but still lag in citations.
2. Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Researchers developed a new LLM architecture with expressive hidden states for long-context modeling called Test-Time Training (TTT). This approach uses a neural network to compress input tokens, offering better performance than traditional models for tasks involving extended sequences.
3. Mixture of A Million Experts
Google DeepMind introduced PEER, a parameter-efficient expert retrieval mechanism using a product key technique for sparse retrieval from over a million experts. PEER outperforms traditional dense FFW layers and coarse-grained MoEs, offering superior performance and computational efficiency.
4. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
The computational challenges of LLM inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. This paper addresses this problem by introducing MInference (Million tokens Inference), a sparse calculation method designed to accelerate the pre-filling of long-sequence processing.
5. Data Curation via Joint Example Selection Further Accelerates Multimodal Learning
This research from Deepmind demonstrates that jointly selecting batches of data is more effective for learning than selecting examples independently. This approach — multimodal contrastive learning with joint example selection (JEST) — surpasses state-of-the-art models with up to 13× fewer iterations and 10× less computation.
6. RouteLLM: Learning to Route LLMs with Preference Data
This paper proposes several efficient router models that dynamically select between a stronger and a weaker LLM during inference to optimize the balance between cost and response quality. The researchers develop a training framework for these routers to leverage human preference data and data augmentation techniques.
7. Self-Play Preference Optimization for Language Model Alignment
This paper proposes a self-play-based method for language model alignment. This approach, called Self-Play Preference Optimization (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys a theoretical convergence guarantee.
Quick Links
1. Google’s DeepMind has developed JEST, a new method that makes training AI models for image and text processing significantly more efficient. This new approach to the energy-intensive process could make AI development faster and cheaper.
2. Adept has agreed to license its tech to Amazon. The Adept co-founders and some of the team are joining Amazon’s AGI organization to continue to pursue the mission of building useful general intelligence.
3. Cloudflare has launched a new, free tool to prevent bots from scraping websites hosted on its platform for data to train AI models. Cloudflare has set up a form for hosts to report suspected AI bots, and it’ll continue to manually blacklist AI bots over time.
4. Chinese AI was on display at the World Artificial Intelligence Conference in Shanghai, including SenseTime SenseNova 5.5. SenseTime introduced SenseNova 5.5, claiming it outperforms GPT-4 in five out of eight key metrics. SenseNova 5o, a real-time multimodal model, was also showcased, capable of processing audio, text, image, and video.
Who’s Hiring in AI!
Software Engineer, Full Stack @OpenAI (San Francisco, CA, USA)
Solutions Architect, Generative AI Specialist @NVIDIA (USA/Remote)
Programmer Writer, Documentation @Weights & Biases (Remote)
R&D Assay Engineering Intern @nomic (Montreal, Canada)
Machine Learning Research Engineer @Pear VC (San Francisco, CA, USA)
Python Application Developer @A-TEK, Inc. (Bethesda, MD, USA/Hybrid)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
If you are preparing your next machine learning interview, don’t hesitate to check out our leading interview preparation website, confetti!
This AI newsletter is all you need #107 was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.