TAI #120; OpenAI DevDay in Focus!
Also, Nvidia NVLM 1.0, Flux 1.1, an update to our book "Building LLMs for Production", and more.
What happened this week in AI by Louie
OpenAI’s 2024 DevDay event came amidst a backdrop of significant changes within the company, including executive departures and new fundraising efforts. Despite the turbulence, OpenAI is pushing forward with new developer features such as vision fine-tuning, model distillation, and prompt caching, which will make it cheaper and easier for people to develop LLM products. OpenAI boasted of over 3 million developers actively using its AI models. The multimodal LLM developer toolkit also expanded outside of OpenAI this week with a very strong new open-source multimodal model from Nvidia, NVLM 1.0.
Here’s a summary of the major announcements at OpenAI DevDay:
Realtime API for Speech-to-Speech Conversations: OpenAI introduced the Realtime API, which allows developers to create low-latency, speech-to-speech experiences directly within their applications. This API supports natural speech conversations using preset voices, similar to ChatGPT’s Advanced Voice Mode, enabling smoother interactions without the need to chain together multiple models. It features streaming audio inputs and outputs, which improves responsiveness and conversational flow. This API also offers function-calling capabilities, letting voice assistants perform actions or retrieve context automatically. The Realtime API is currently available in public beta, with pricing set at $0.06 per minute of audio input and $0.24 per minute of audio output.
Canvas Interface for Writing and Coding: Canvas is a new collaborative tool integrated into ChatGPT (Plus & Team) designed to improve writing and coding tasks. It provides a visual interface where users can work on projects alongside ChatGPT, allowing for more detailed editing, inline feedback, and context-aware suggestions.
Vision Fine-Tuning with GPT-4o: OpenAI now supports vision fine-tuning on GPT-4o, enabling developers to train the model with image and text data to enhance its visual understanding. This upgrade makes it suitable for applications like object detection, visual search, and medical image analysis. Fine-tuning with images follows a similar process to text, and developers can start with as few as 100 images. OpenAI is offering 1 million free training tokens daily for vision fine-tuning until the end of October 2024.
Prompt Caching: Following similar features at Deepseek, Anthropic, and Google Gemini, this offers discounts to developers by reusing input tokens that have been previously processed. This reduces both costs and latency for repeated prompts, making it particularly useful for long-running tasks like multi-turn conversations. Cached tokens are priced at half the cost of regular tokens, providing significant savings for API users. This feature is automatically available for GPT-4o and its mini versions, as well as other supported models. Caches are cleared after a period of inactivity but are designed to optimize performance for frequently used prompts.
Model Distillation: This allows developers to fine-tune smaller, cost-efficient models using the outputs of more capable models like GPT-4o or o1-preview. OpenAI’s integrated approach streamlines the distillation process, making it easier to achieve high performance with less resource-intensive models. The distillation suite includes tools like Stored Completions to automatically generate datasets and Evals for evaluating model performance.
Why should you care?
While we are glad competition in the foundational LLM race has heated up significantly this year, OpenAI remains the go-to LLM development platform for many. OpenAI is lagging behind in releasing some of these latest features, such as prompt caching. We still think OpenAI make tools particularly easy to use, and many of these new features will be very valuable for LLM builders.
— Louie Peters — Towards AI Co-founder and CEO
We are excited to announce that we have rolled out an updated version of Building LLMs for Production!
The updated version has an improved structure, fresher insights, more up-to-date information, optimized code, and, of course, we have made the reading experience more enjoyable.
Why an update rather than a second edition?
The book is grounded in ‘timeless principles’ that remain relevant despite ongoing developments in the LLM field. This update aims to make the reading experience smoother and more accessible, ensuring that key concepts are easy to understand.
But beyond that, we believe that certain techniques discussed in the book, such as model distillation, are becoming a foundation for practitioners and companies working with LMs. The updated version provides more practical information on these techniques, which we believe have become more accessible since the book was published and have found broader applications beyond research.
The updated version is available as a paperback, e-book, & hardcover. Grab your copy from your local Amazon page!
We are super excited for you all to read it and hear all about it.
Hottest News
1. OpenAI’s DevDay Brings Realtime API and Other Treats for AI App Developers
The company announced several new tools on OpenAI’s DevDay, including a public beta of its “Realtime API” for building apps with low-latency, AI-generated voice responses. OpenAI also introduced vision fine-tuning in its API, which will let developers use images and text to fine-tune their applications of GPT-4o.
2. Nvidia Just Dropped a New AI Model Is Open, Massive, and Ready To Rival GPT-4
Nvidia’s new NVLM 1.0 family of large multimodal language models, led by the 72 billion parameter NVLM-D-72B, demonstrates exceptional performance across vision and language tasks while also enhancing text-only capabilities.
3. ChatGPT’s ‘Canvas’ Interface Makes It Easier To Write and Code
OpenAI has launched a new “Canvas” interface for ChatGPT that allows users to adjust sections of text or code generated by the chatbot in a side-by-side collaboration. ChatGPT Canvas provides users with a menu of shortcuts for suggesting inline edits, quickly checking grammar and clarity, and adjusting text length and reading level. Some coding-specific shortcuts are also available for debugging, adding logs and comments, and translating code into other languages.
4. Meta Has Launched Movie Gen, a Cutting-Edge Media Foundation Model
Meta’s AI model, Movie Gen, generates realistic 16-second videos with sound from text prompts, surpassing competitors with advanced editing and camera movement understanding. However, it lacks voice capabilities and is not publicly released to prevent misuse.
5. OpenAI Gets $4 Billion in Credit on Top of $6.6 Billion Fundraise
OpenAI has set up a $4 billion credit line from an array of banks, adding to its financial firepower after securing a $6.6 billion round of new investments. OpenAI said it had also set up a revolving line of credit with JPMorgan Chase, Citigroup, Goldman Sachs, Morgan Stanley, Santander, Wells Fargo, the Japanese bank SMBC, UBS, and HSBC.
6. Google Releases Gemini 1.5 Flash-8B
Google has launched Gemini 1.5 Flash-8B, a production-ready variant of its lightweight language model. This release marks a significant advancement in efficient AI, designed for high-volume, multimodal applications and long context summarization tasks.
7. Microsoft Gives Copilot a Voice and Vision in Its Biggest Redesign Yet
Microsoft has unveiled a big overhaul of its Copilot experience, adding voice and vision capabilities to transform it into a more personalized AI assistant. Copilot is being redesigned across mobile, web, and the dedicated Windows app into a user experience that’s more card-based and looks very similar to the work Inflection AI has done with its Pi personalized AI assistant.
8. Black Forest Labs Releases Flux 1.1 Pro and an API
Black Forest Labs has announced the release of a new, faster text-to-image model called Flux 1.1 Pro, and with it, a paid application programming interface (API) on which developers can build third-party apps powered by the model. Individual users can access the new Flux 1.1 Pro model not through Black Forest Labs’s site but through partners together.ai, Replicate, fal.ai, and Freepik.
Five 5-minute reads/videos to keep you learning
1. Knowledge Extraction Using LLMs
Knowledge extraction from documents using LLMs (Large Language Models) has become increasingly important in our data-driven world. This article explains how LLMs can help businesses efficiently process and understand content from various sources, including text, tables, and figures.
2. A Data Scientist’s Guide to Ensemble Learning: Techniques, Benefits, and Code
The principle of collective intelligence — the “wisdom of the crowd” — is the foundation of Ensemble Learning in machine learning. Ensemble learning leverages a combination of different algorithms to make smarter decisions. This article dives into this fascinating concept and shows how it works in the world of machine learning.
3. RAG: The Power of Text Splitting for Improving Retrieval: A Developer’s Handbook
Whether building a retrieval-augmented generation (RAG) system or simply feeding large datasets into an LLM for processing, how you split your text can dramatically affect performance. This guide explores different text splitting, ranging from basic to advanced techniques, with practical examples using LangChain, Ollama embeddings, and Llama 3.2.
4. The Power of Model Distillation
This article explores an essential technique in LLMs: model distillation. This approach has become increasingly crucial as LLMs grow larger, allowing us to capture some of their impressive capabilities in more manageable packages. It will cover model distillation and why OpenAI’s decision is highly important for this approach and the future of language models.
5. Comparing Open-Source and Proprietary LLMs in Medical AI
Closed-source models, led by GPT-4o and Claude Sonnet, maintain a performance lead in medical benchmarks; however, the gap is narrowing as open-source models continue to improve. This article provides a brief overview of recent evaluations of both closed and open-source LLMs on popular medical benchmark datasets. It describes the methods, costs, and other relevant factors in obtaining these performance results.
Repositories & Tools
1. LLaVA Next is a multimodal LLM designed to evaluate multimodal tasks.
2. Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data.
3. Nicegui is an easy-to-use, Python-based UI framework in your web browser.
4. Python contains all algorithms implemented in Python.
5. GPTme is an AI assistant that can use the terminal, run code, edit files, browse the web, and use vision from a simple but powerful CLI.
Top Papers of The Week
1. MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
MM1.5 is an advanced multimodal large language model series that enhances text-image understanding and multi-image reasoning, building on the MM1 architecture. It employs diverse data sets, including OCR and synthetic captions, and features models from 1B to 30B parameters. MM1.5 also offers specialized video and mobile UI understanding variants, demonstrating strong performance across different model sizes.
This paper revisits traditional recurrent neural networks (RNNs) and shows that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs, and GRUs can be efficiently trained in parallel. It also introduces minimal versions (minLSTMs and minGRUs) that use significantly fewer parameters and are fully parallelizable during training.
3. VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
Traditional scalar-based weight quantization struggles to achieve such extreme low-bit quantization. This paper introduces Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs, which uses Second-Order Optimization to formulate the LLM VQ problem. It guides the quantization algorithm design by solving the optimization.
Processing extensive videos such as 24-hour CCTV footage or full-length films presents significant challenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, often result in substantial information loss. To overcome these shortcomings, this paper introduces OmAgent, which efficiently stores and retrieves relevant video frames for specific queries, preserving the detailed content of videos.
5. ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models
This paper introduces a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM. It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU. It introduces some key innovations, such as a distribution correction method for transformer blocks, the bit balance strategy to counteract performance degradation, and a quantization acceleration framework.
Quick Links
1. OpenAI Academy launches with $1M in developer credits for devs in low- and middle-income countries. It aims to catalyze economic growth and innovation in sectors such as healthcare, agriculture, education, and finance and “ensure that the transformative potential of artificial intelligence is accessible and beneficial to diverse communities worldwide.”
Who’s Hiring in AI
AI Technical Writer and Developer for Large Language Models @Towards AI Inc (Remote)
A.I. Prompt Engineering Intern @Sezzle (Colombia/Remote)
AI Engineer @Plante Moran (Michigan, USA)
Senior Programmer Writer, SageMaker doc team @Amazon (Seattle, WA, USA)
C360 AI Product Manager, Senior Manager @Salesforce (San Francisco, CA, USA)
Application Developer III @Agile Defense (Remote)
Senior Machine Learning Solutions Architect @Fiddler AI (Remote)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.