TAI 130: DeepMind Responds to OpenAI With Gemini Flash 2.0 and Veo 2
Also, Cohere Command R7B, Phi-4, Imagen 3, Grok-2 update, OpenAI Canvas & Projects and more!
What happened this week in AI by Louie
AI model releases remained very busy in the run-up to Christmas, with DeepMind taking center stage this week with a very strong Gemini Flash 2.0 release and its Veo 2 video model. The Flash 2.0 model illustrates the progress made in inference efficiency and model distillation over the past year, together with Gemini’s progress in competing at the top of the leaderboards. For example, Flash 2.0’s MMMU image understanding score of 70.7% compares to 59.4% achieved by the far larger and more expensive Gemini 1.0 ultra almost exactly one year before. We also saw a strong update to Grok-2 this week, together with free access to everything on x.com. Microsoft also delivered an impressive update with Phi-4 — its model family focussed on pushing synthetic data generation to its limits. The 14bn parameter Phi-4 model achieved an MMLU Pro score of 70.4 vs Phi-3 14B at 51.3 and even beat the recently upgraded Llama 3.3 70B model at 64.4. OpenAI also continued its 12 days of announcements with focus on ChatGPT including features such as Canvas, Projects, video input in advanced voice mode and integration with iPhones.
Gemini 2.0 Flash Experimental is an updated multimodal model designed for agentic applications, capable of processing and generating text, images, and audio natively. In benchmark comparisons, it shows strong progress over its predecessors. For example, on the MMLU-Pro test of general understanding, Gemini 2.0 Flash Experimental achieves a score of 76.4%, a slight improvement over Gemini 1.5 Pro’s 75.8% (despite being a smaller and faster model) and a substantial gain compared to Gemini 1.5 Flash’s 67.3%. Similarly, on the MMMU image understanding test, Gemini 2.0 Flash Experimental reaches 70.7%, surpassing Gemini 1.5 Pro’s 65.9% and Gemini 1.5 Flash’s 62.3%.
Gemini 2.0 Flash Experimental supports a range of input/output modalities, offers structured outputs, and integrates tool use, including code execution and search. It can handle large input lengths (up to 1 million tokens) and produce outputs with up to 8,192 tokens while maintaining a high request throughput. The model’s native tool use and code execution features are intended to enhance reliability and adaptiveness, though current feedback shows some inconsistencies in accuracy and voice naturalness. Gemini also released a new Multimodal Live API with real-time audio and video-streaming input.
In a busy week at Google Deepmind, the company also announced Deep Research (a tool for researching complex topics within Gemini advanced), Veo 2 (text to video model) and Imagen 3 (text to image). Veo 2 is a video generation model capable of producing realistic motion and high-quality outputs, including 4K resolution video with reduced artifacts. It interprets and follows both simple and complex textual instructions accurately, simulating real-world physics in a variety of visual styles. Veo 2 supports a range of camera control options and maintains fidelity across diverse scenes and shot types, enhancing both realism and dynamic motion representation. In human evaluations on the MovieGenBench dataset, Veo 2 outperformed other top models in terms of overall preference and prompt-following capability.
Why should you care?
As the first release from the Gemini 2.0 family, Flash 2.0 may be the first glimpse we have of the next generation of LLMs using larger compute clusters (TPUs in this case) and compute budgets. This model likely benefits from model distillation from larger models in the 2.0 family and shows the huge progress made in inference costs this year. This new model aligns with a strategy focused on agentic experiences and interoperability with various inputs and tools. Gemini noted how it fits into agentic research prototypes like Project Astra, which examines the use of video input AI assistants in mobile and potential wearable devices, and Project Mariner, which explores browser-based agents. The strong capability now possible in low latency and low cost smaller tier models is particularly valuable for these agentic applications where many tokens may be used in large chains of prompts and where real-time responses can be key. These low costs are also important for reasoning models that scale inference time compute; this is now the key area where Gemini still lags behind OpenAI, and we expect to hear more from Gemini here in the future.
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
1. Google Launched Gemini 2.0, Its New AI Model for Practically Everything
Google released Gemini 2.0 Flash, a multilingual and multimodal AI model capable of real-time conversation and image analysis. In addition to advances in multimodality — like native image and audio output, it allows native tool use, enabling developers to build new AI agents.
2. OpenAI Brings Video to ChatGPT Advanced Voice Mode
OpenAI’s ChatGPT Advanced Voice Mode now supports video and screenshare features, enabling users to interact visually through a phone camera. This update, previously audio-only, demonstrates ChatGPT’s ability to identify objects and guide tasks. It is currently available to ChatGPT Plus and Pro users.
3. Microsoft Launches Phi-4, a New Generative AI Model, in Research Preview
Microsoft introduced Phi-4, a 14B parameter small language model (SLM) that excels at complex reasoning in areas such as math and conventional language processing. It surpasses larger models, excelling in mathematics and outperforming GPT-4 in science and tech queries. Available soon on HuggingFace, Phi-4 achieved 91.8% on AMC tests, leading all models but showing practical limitations despite strong benchmarks.
4. Apple Releases Apple Intelligence and ChatGPT Integration in Siri
Apple’s iOS 18.2 update enhances iPhones, iPads, and Macs with Apple Intelligence features. The new update brings a whole host of Apple Intelligence features, including ChatGPT integration with Siri, Genmoji, Image Playground, and Visual Intelligence to the iPhone. It also adds language support for other regions, such as the UK and Australia, officially launching Apple’s AI in those countries.
5. Cohere AI Releases Command R7B
Command R7B is the smallest, fastest, and final model in the R Series. It is a versatile tool that supports a range of NLP tasks, including text summarization and semantic search. Its efficient architecture enables enterprises to integrate advanced language processing without the resource demands typically associated with larger models.
6. Google Unveiled Willow, a Quantum Computing Chip
Google announced Willow, a new quantum chip that outperformed even the world’s best supercomputer on an advanced test. The new chip can complete a complex computation in five minutes that would take the most powerful supercomputer 10 septillion years — more than the estimated age of the universe. Google researchers were also able to prove for the first time that the chip’s errors did not increase proportionately as the number of qubits rose.
7. OpenAI Launches ChatGPT Projects, Letting You Organize Files, Chats in Groups
OpenAI is rolling out a feature called “Projects” to ChatGPT. It’s a folder system that makes it easier to organize things you’re working on while using the AI chatbot. Projects keep chats, files, and custom instructions in one place.
8. Grok Is Now Free for All X Users
Grok is now available to free users on X. Several users noticed the change on Friday, which gives non-premium subscribers the ability to send up to 10 messages to Grok every two hours. TechCrunch reported last month that Musk’s xAI started testing a free version of Grok in certain regions. Making Grok more widely available might help it compete with the already-free chatbots like OpenAI’s ChatGPT, Google Gemini, Microsoft Copilot, and Anthropic’s Claude.
9. OpenAI Released the First Version of Sora
OpenAI is releasing Sora as a standalone product at Sora.com to ChatGPT Plus and Pro users. Sora, OpenAI’s text-to-video AI, enables users to create 1080p videos up to 20 seconds long. Sora features include video remixing and storyboards. However, videos carry watermarks.
Five 5-minute reads/videos to keep you learning
1. The Epic History of Large Language Models (LLMs)
This article breaks the evolution of RNN architecture into five stages: traditional encoder-decoder architecture, addition of attention mechanism in our traditional encoder-decoder architecture, transformers architecture, addition of techniques like transfer learning into the NLP domain, and finally, large language models (like ChatGPT).
2. Building Multimodal RAG Application #5: Multimodal Retrieval From Vector Stores
This article dives into the essentials of setting up multimodal retrieval using vector stores. It covers installing and configuring the LanceDB vector database, demonstrates how to ingest both text and image data into LanceDB using LangChain, and concludes with a practical walkthrough of performing multimodal retrieval, enabling efficient searches across both text and image data.
3. How To Build a Truly Useful AI Product
The traditional laws of “startup physics” — like solving the biggest pain points first or supporting users getting cheaper at scale — don’t fully apply when building AI products. And if your intuitions were trained on regular startup physics, you’ll need to develop some new ones in AI. This article shares a set of four principles for building AI products that every app-layer founder needs to know.
4. Run Gemini Using the OpenAI API
Google confirmed that its Gemini large language model is now mostly compatible with the OpenAI API framework. There are some limitations with features such as structured outputs and image uploading, but chat completions, function calls, streaming, regular question/response, and embeddings, work just fine. This article provides examples of Python code to show how it works.
5. AI Tooling for Software Engineers in 2024: Reality Check (Part 1)
A survey asked software engineers and engineering managers about their hands-on experience with AI tooling. This article provides an overview of the survey, popular software engineering AI tools, AI-assisted software engineering workflows, what’s changed since last year, and more.
Repositories & Tools
1. MarkItDown is a Python tool for converting files and office documents to Markdown.
2. HunyuanVideo is a systematic framework for a large video generation model.
3. DeepSeek-VL2 is an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models.
4. TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.
Top Papers of The Week
This is the technical report for phi-4, a 14-billion-parameter language model. By strategically integrating synthetic data during training, it excels in STEM-focused QA capabilities. Despite retaining the phi-3 architecture, it outperforms its predecessors due to enhanced data quality, a refined training curriculum, and advanced post-training innovations. It surpasses GPT-4, particularly in reasoning-focused benchmarks.
2. ReFT: Representation Finetuning for Language Models
This research develops a family of Representation Finetuning (ReFT) methods. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. The research also defines a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT). Both are drop-in replacements for existing PEFTs and learn interventions that are 15x — 65x more parameter-efficient than LoRA.
3. Training Large Language Models To Reason in a Continuous Latent Space
This paper introduces Coconut, a novel reasoning paradigm for LLMs that operates in a continuous latent space. Coconut enhances reasoning by utilizing the last hidden state as a continuous thought, enabling advanced reasoning patterns like breadth-first search. It outperforms traditional chain-of-thought approaches in logical tasks with substantial backtracking, demonstrating the promise of latent reasoning.
4. GenEx: Generating an Explorable World
This paper introduces GenEx, a system for 3D world exploration that uses generative imagination to generate high-quality, 360-degree environments from minimal inputs like a single RGB image. GenEx enables AI agents to perform complex tasks with predictive expectations by simulating outcomes and refining beliefs. By advancing embodied AI in imaginative spaces with real-world applications, GenEx advances embodied AI in imaginative spaces.
5. FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness
This paper proposes a diagrammatic approach to optimizing deep learning algorithms with IO-awareness, achieving up to sixfold performance improvements like FlashAttention. By efficiently managing data transfers and harnessing GPU features, their method generates pseudocode for Ampere and Hopper architectures. It enhances energy efficiency and performance by reducing GPU energy costs from transfer bandwidth, which currently consumes 46%.
Quick Links
1. Harvard and Google to release 1 million public-domain books as AI training datasets. This dataset includes 1 million public-domain books spanning genres, languages, and authors, including Dickens, Dante, and Shakespeare, which are no longer copyright-protected due to age.
2. Meta is releasing an AI model called Meta Motivo, which could control the movements of a human-like digital agent, potentially enhancing the Metaverse experience. The company said that Meta Motivo addresses body control problems commonly seen in digital avatars, enabling them to perform more realistic and human-like movements.
3. Pika Labs has launched Pika 2.0, the advanced AI video model that is a new step towards creative AI video production. This forward-looking release combines crisp text alignment with freshly introduced Scene Ingredients in the Pika Labs‘ web application. Compared to earlier versions, it adds deeper flexibility and sharper detail.
Who’s Hiring in AI
Machine Learning & Computer Vision Engineer @Corning Incorporated (Remote)
Research Instructor @University of Colorado (Hybrid/Colorado, USA)
Artificial Intelligence Engineer @Fortive Corporation (Hybrid/Bengaluru, India)
Sr. AI Linguist @LinkedIn (Hybrid/Mountain View, CA, USA)
Lead AI Engineer @Capital One Services, LLC (Multiple US Locations)
Senior Generative AI Data Scientist, Amazon SageMaker @Amazon (Seattle, WA, USA)
Machine Learning Research Engineer Intern @Texas Instruments (Dallas, TX, USA)
Software Engineer, Generative AI Engineering (Internship) @Woven by Toyota (Tokyo, Japan)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.