#119 New LLM audio capabilities with NotebookLM and ChatGPT Advanced Voice

Also Llama 3.2, Gemini 1.5 Pro-002, AlphaChip, and more turmoil at OpenAI.

Oct 01, 2024

What happened this week in AI by Louie

This week, we were focused on new voice capabilities for LLMs with Google’s recently released NotebookLM’s audio features and OpenAI’s move to roll out ChatGPT’s advanced voice mode more widely — the fully multimodal version of GPT-4o. We also saw some great new LLM options released with Llama 3.2 (the first multimodal models in the family) and Gemini 1.5 Pro-002 with some strong benchmark improvements and lower prices.

NotebookLM, an experimental tool from Google, has been created to help users organize, analyze, and synthesize information from their own documents. It acts as a virtual research assistant and allows users to “ground” the language model in their materials, such as Google Docs and PDFs, to help them get insights and potentially generate new ideas. It recently introduced the “Audio Overviews” feature for turning research into audio summaries or short podcasts. This has led to lots of great demos shared online of people turning their content directly into podcasts. Over 2,000 people have written for our Towards AI publication and we think this opens up great potential to easily expand their audience with new content mediums!

Meanwhile, OpenAI’s wider rollout for Advanced Voice Mode, part of GPT-4o, brought natural conversation capabilities to ChatGPT. It allows for real-time voice interactions and is also supposed to detect non-verbal cues like tone and speed, making responses more emotionally tuned and human-like. Users can interrupt and guide conversations without losing context, a capability that sets it apart from traditional voice assistants like Siri or Alexa. Open AI’s voice mode also has an “Audio Overview” feature, which lets users listen to synthesized summaries of their documents in a conversational podcast format. While still experimental, this feature is rolling out across mobile apps, though it remains restricted in regions such as the EU due to regulatory concerns around AI’s ability to detect emotions. OpenAI’s voice mode is still heavily constrained relative to its fundamental capabilities by managing safety risks, such as preventing its ability to mimic voices. Working on these safety fixes was a large part of the delay in the release of voice mode.

Why should you care?

These new models are important because advancements in voice-enabled AI tools like Google’s NotebookLM and OpenAI’s ChatGPT are making AI more practical and accessible. More natural, low-latency voice chatbots improve the quality of real-time conversations, allowing for smoother, more responsive interactions compared to older and less intelligent voice assistants like Siri or Alexa. This can allow for much more natural conversations than older chatbots which had to combine separate text-to-speech, LLM, and speech-to-text models.

For professionals, researchers, and content creators, these tools offer new ways to handle and distribute their content. Features like “Audio Overviews” turn written content into audio summaries, making it easier to review materials or generate new ideas. This can be especially useful for multitasking or accessing information when reading isn’t convenient. Additionally, these tools offer significant benefits for people with visual impairments by converting text into high-quality audio, making more content accessible.

— Louie Peters — Towards AI Co-founder and CEO

Hottest News

1. Meta Unveils Llama 3.2, Edge AI and Vision With Open, Customizable Models

Llama 3.2 features advanced AI models optimized for edge and mobile devices, including vision LLMs (11B and 90B) and lightweight text-only models (1B and 3B). These models excel in tasks such as summarization and image understanding, supporting extensive token lengths.

2. Google’s New Gemini 1.5 AI Models Offer More Power and Speed at Lower Costs

Google has released two updated Gemini AI models that promise more power, speed, and lower costs. The new versions, Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002, offer significant improvements over their predecessors, according to Google, showing gains across a range of benchmarks, particularly in maths, long context and visual tasks. In addition, the company has reduced the price of input and output tokens for Gemini 1.5 Pro by more than 50%, increased rate limits for both models and reduced latency.

3. OpenAI CTO Mira Murati Is Leaving

Mira Murati, CTO of OpenAI, is leaving the company after over six years to pursue personal interests. Her departure comes as OpenAI prepares for DevDay and undergoes significant changes, including CEO Sam Altman’s growing influence and a potential $150 billion funding round. Murati played a key role in developing major AI projects like ChatGPT.

4. OpenAI Might Raise the Price of ChatGPT to $44 by 2029

The New York Times, citing internal OpenAI docs, reports that OpenAI plans to raise the price of individual ChatGPT subscriptions from $20/month to $22/month by the end of the year. A steeper increase will come over the next five years; by 2029, OpenAI expects to charge $44 per month for ChatGPT Plus.

5. Microsoft Re-Launches ‘Privacy Nightmare’ AI Screenshot Tool

Microsoft’s Recall, labeled a potential “privacy nightmare” by critics, will be relaunched in November on its new CoPilot+ computers. Some of its more controversial features have been stripped out — for example, it will be opt-in whereas the original version was turned on by default.

6. Google Unveils AlphaChip AI-Assisted Chip Design Technology — Chip Layout as a Game for a Computer

Google unveiled its AlphaChip reinforcement learning method for designing chip layouts. The AlphaChip AI promises to substantially speed up the design of chip floorplans and make them more optimal in terms of performance, power, and area. The reinforcement learning method, now shared with the public, has been instrumental in designing Google’s Tensor Processing Units (TPUs).

7. Meta’s New AI-Made Posts Open a Pandora’s Box

Meta plans to generate synthetic content tailored to individual users. Meta said it will generate some images based on a user’s interests and others that feature their likeness. Users will have the option to take that content in a new direction or swipe to see more content imagined for them in real time.

Five 5-minute reads/videos to keep you learning

1. Llama Can Now See and Run on Your Device — Welcome Llama 3.2

Llama 3.2 introduces advanced multimodal and text-only models, including 11B and 90B Vision models and smaller 1B and 3B text models for on-device use. Enhancements feature visual reasoning and multilingual support, though EU users face licensing restrictions on multimodal models.

2. Converting A From-Scratch GPT Architecture to Llama 2

The article outlines the process of converting a GPT model to a Llama 2 model, highlighting key modifications such as replacing LayerNorm with RMSNorm, GELU with SiLU activation, and incorporating rotary position embeddings (RoPE). It also details updates to the MultiHeadAttention and TransformerBlock modules to support these changes.

3. ChatGPT-o1 vs Claude 3.5 Coding Performance Compared

This is a comparative analysis of OpenAI o1 and Claude 3.5 using Cursor AI. It sheds light on their respective strengths and limitations in coding tasks. While Claude 3.5 demonstrated superior performance in the tested scenarios, the true potential of OpenAI ChatGPT-o1’s advanced reasoning capabilities remains to be fully explored.

4. Open AI’s Advice on Prompting

The o1-preview and o1-mini models excel in scientific reasoning and programming, showing strong performance in competitive programming and academic benchmarks. Ideal for deep reasoning applications, they currently support text-only inputs and have limitations in their beta phase, such as a lack of image input support and slower response times.

5. U-Net Paper Workthrough

Convolutional networks have been around for a long time. Still, their performance has been limited by the size of the available training sets and the size of the networks under consideration. This article highlights newer techniques that propose a classification output considering many layers, allowing effective localization and simultaneous context usage.

6. Top Generative AI Use Cases in 2024

This guide highlights some of the top generative AI use cases across different fields, demonstrating how it revolutionizes areas like healthcare and finance.

7. Devs Gaining Little (if Anything) From AI Coding Assistants

A study from Uplevel found that AI coding assistants like GitHub Copilot do not significantly improve developer productivity as measured by pull request cycle time and throughput, contradicting anecdotal claims.

Repositories & Tools

1. Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for LLMs and AI applications.

2. Kotaemon is an open-source RAG-based tool for chatting with your documents.

3. Exo allows you to run your own AI cluster at home with everyday devices.

4. Kestra is a universal open-source orchestrator that makes scheduled and event-driven workflows easy.

5. Count Token Optimization presents an iterative optimization method for text-to-image diffusion models to enhance object counting accuracy.

Top Papers of The Week

1. Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

This paper introduces Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity.

2. HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

HelloBench is a benchmark designed to assess the long text generation abilities of Large Language Models, addressing their difficulties in producing texts over 4000 words with consistent quality. It categorizes tasks into five groups and introduces HelloEval, an evaluation method that closely aligns with human judgment.

3. A Controlled Study on Long Context Extension and Generalization in LLMs

This controlled study on extending language models for long textual contexts establishes a standardized evaluation protocol. Key findings highlight perplexity as a reliable performance metric, the underperformance of approximate attention methods, and the effectiveness of exact fine-tuning methods within their extension range.

4. LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data

This paper introduces semantic operators, a declarative programming interface that extends the relational model with composable AI-based operations for semantic queries over datasets. Each operator can be implemented and optimized in multiple ways, opening a space for execution plans similar to relational operators.

5. SciAgents: Automating Scientific Discovery Through Multi-Agent Intelligent Graph Reasoning

This paper presents SciAgents, an approach that leverages large-scale ontological knowledge graphs to organize and interconnect diverse scientific concepts. The framework autonomously generates and refines research hypotheses, elucidating underlying mechanisms, design principles, and unexpected material properties.

Quick Links

1. Bright Data offers web data solutions for AI and LLM developers. These solutions make gathering, managing, and integrating web data into AI models easier, streamlining the development process. Two standout offerings are its Dataset Marketplace and Web Scraper APIs, designed to make data collection more accessible and efficient.

2. California Governor Gavin Newsom vetoed the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act (SB 1047). In his veto message, Governor Newsom cited multiple factors in his decision, including the burden the bill would have placed on AI companies, California’s lead in the space, and a critique that the bill may be too broad.

3. Airtable launched an enterprise-grade AI platform. It includes App Library, which allows companies to create standardized AI-powered applications that can be customized across an organization, and HyperDB, which enables integration of massive datasets of over 100 million records.

Who’s Hiring in AI

Data Engineer (AWS, Snowflake, dbt) — R2843–6334 @Bcidaho (USA/Remote)

TECH Program Associate — Data Platforms @Spectrum (Madison, WI, USA)

Staff Engineer — AI/Machine Learning @LinkedIn (Sunnyvale, CA, USA)

2025 Summer Internship — MIS — Data Science Analyst — Virtual @Freeport McMoRan (Phoenix, AZ, USA)

Data Science and Analytics, Product @Anthropic (San Francisco, CA, USA)

Senior Machine Learning Engineer @webAI (USA/Remote)

Software Engineer @JPMorgan Chase (Columbus, IN, USA)

Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.

Towards AI Newsletter

Discussion about this post