TAI #106: Gemma 2 and new LLM benchmarks
Open LLM Leaderboard v2, CharXiv, CriticGPT, ESM3 protein model, Meta LLM Compiler, and more.
What happened this week in AI by Louie
In AI model announcements and releases this week, we were particularly interested to read about “ESM3” (a new biology foundation that has discovered a new fluorescent protein; more details below) and Gemma 2 from Google. Gemma 2 is open-source and, as such, provides a window into what is happening behind the scenes inside the leading AI labs. The paper is particularly notable for its description of how Google uses knowledge distillation — using a large language model as a teacher to train small models by minimizing the negative log-likelihood between the probabilities for each token from each model. We expect similar techniques are used in leading closed models today.
With new closed or open-source LLMs now taking the lead in particular categories almost every week, benchmark tests and evaluations are getting all the more important for helping decide which model is best for your use case. However, these benchmarks can sometimes overly emphasize memorization and may be vulnerable to issues such as ambiguous benchmark questions and model training on test sets. As such, we have seen several developments in benchmarks recently, including a new, improved MMLU Pro test. Two weeks ago, we saw the release of LiveBench — a new benchmark designed to deal with testset contamination (including releasing new questions monthly). The LMSYS Chatbot Arena has also gained popularity — a crowdsourced open platform for LLM evals that uses human pairwise comparisons of outputs to aim to test the models on more real-world tasks and human preferences. We also recently saw the launch of the Arc Prize — with Arc tests aimed to test LLMs on reasoning tasks where they can’t benefit from memorization (and results are still very poor!). To continue this trend this week, Hugging Face launched an updated Open LLM Leaderboard v2 (including MMLU Pro!), and a new benchmark for real-world multimodal LLMs tasks, “CharXiv” was released.
CharXiv focuses on chart understanding in multimodal LLMs, addressing the limitations of existing datasets by incorporating natural, challenging, and diverse charts sourced from scientific papers. This benchmark features two types of questions: descriptive questions about basic chart elements and reasoning questions that require synthesizing complex visual information. Sonnet 3.5 leads with a 60.2 score on reasoning (humans at 80.5) and 84.3 on descriptive questions (humans at 92.1). Open-source models are still heavily lagging on multimodal capability, with Phi-3 Vision in the lead at just 31.6 on reasoning.
Why should you care?
Choosing the best LLMs for your task can get very confusing — yet it is key to make the most of the technology and ensure you are getting economic results. You have to weigh up your requirements for open source vs. closed models (including flexibility for fine-tuning), privacy and safety considerations, multimodal capability, speed and price relative to capability tradeoffs, context window sizes, free tier options, context storage, and supporting infrastructure, among other factors! In addition, the best model for any of these metrics can change every week. Scores on leading benchmark tests can help compare models — but they are not without their flaws and can be over-focussed on memorization and multi-choice questions rather than more common real-world use cases. We think it is important for the AI community to continue to work on improving benchmark tests and model evaluations, but LLMs are unlikely to be one size fits all. We keep up to speed on new models as they are released, and some of our current favorites for particular use cases are below. Reach out if you ever want to discuss the best model for your project!
— Louie Peters — Towards AI Co-founder and CEO
This issue is brought to you thanks to Ai4:
Ai4, the world’s largest gathering of artificial intelligence leaders in business, is coming to Las Vegas — August 12–14, 2024.
Join 4500+ attendees, 350+ speakers, and 150+ AI exhibitors from 75+ countries at the epicenter of AI innovation.
Don’t wait — prices increase on July 13th. Apply today for a complimentary pass, or register now for 35% off final prices.
New Free Credits for our GenAI360 course with the GenAI360 Scholarship
As many of you know, we launched GenAI360: Foundation Model Certification a year ago in collaboration with Activeloop and the Intel Disruptor Initiative. With nearly 40,000 course takers, thousands of GenAI360-certified professionals, and more prominent partners joining the ride, we’re excited to celebrate this milestone with an exciting opportunity for our learning community! We’re excited to offer GenAI360 Scholarships to complete the certifications in collaboration with the Intel Disruptor Initiative and AWS. These scholarships will enable you to work on hands-on examples using Intel® Xeon® Scalable Processors in the AWS cloud environment. Read more and apply here.
Hottest News
1. Google Releases Gemma 2: A Powerful Family of LLMs 3x Smaller than Llama-3 70B
Google DeepMind launched Gemma 2, an open large language model available in 9 billion (9B) and 27 billion (27B) parameter versions. It outperforms larger models and provides cost-effective deployment options. Gemma 2 features sliding window attention, soft-capping, and knowledge distillation.
2. ESM3: Simulating 500 million years of evolution with a language model
EvolutionaryScale came out of stealth to reveal ESM3, the latest in a recent wave of biology foundation models. ESM3 simulates 500 million years of protein evolution, creating a new green fluorescent protein (esmGFP) with only 58% similarity to the closest known variant, demonstrating AI’s potential to program biology from first principles. This generative model can reason over protein sequence, structure, and function.
3. Amazon Hires Top Executives From AI Startup Adept for AGI Team
Amazon has hired several top executives and employees from the AI startup Adept, including the company’s co-founder and former CEO, David Luan. This move is part of Amazon’s efforts to bolster its development of artificial general intelligence (AGI).
4. OpenAI Launches CriticGPT to Spot Errors and Bugs in AI-Generated Code
OpenAI has introduced CriticGPT, a new AI model that can help identify mistakes in code generated by ChatGPT. The tool will improve the alignment process in AI systems through RLHF. Built using their flagship AI model GPT-4, CriticGPT was built to help human AI reviewers check the code generated by ChatGPT.
5. Andrew Ng Plans to Raise $120M for Next AI Fund
Andrew Ng plans to raise $120 million for the second fund of his AI startup incubator, the AI Fund. He launched the AI Fund in 2018 with $175 million to fund small teams working on AI solutions. The new $120 million AI Venture Fund II will continue this mission.
6. Open AI Announces Strategic Content Partnership with TIME
TIME and OpenAI have announced a multi-year strategic partnership to integrate TIME’s extensive archive of trusted journalism into OpenAI products, including ChatGPT. This collaboration will enhance OpenAI’s ability to provide users with accurate information and proper citations, directly linking to Time.com.
Five 5-minute reads/videos to keep you learning
Have you ever wondered how we can determine which LLM is superior? This article dives into how we can accurately quantify and evaluate the performance of LLMs, understand the current methodologies used for this, and discuss why this process is vital.
2. Fine-tuning Mistral on Your Dataset
This tutorial walks you through the process of fine-tuning the Mistral-7B-Instruct model on your own dataset using the Hugging Face Transformers and PEFT libraries.
3. How I Built My Own Custom 8-Bit Quantizer From Scratch: A Step-by-Step Guide Using PyTorch
Are you curious how popular quantizers such as BitsAndBytes, AWQ, and GGUF work under the hood? This post provides a step-by-step approach to building custom 8-bit quantizers from scratch using PyTorch and quantizing facebook/opt-350m.
4. Bridging the Implementation Gap of Artificial Intelligence in Healthcare
This article discusses why there is an implementation gap in AI in healthcare and proposes a plan to bridge the gap between research and practice.
5. seemore: Implement a Vision Language Model from Scratch
This blog implements a vision language model consisting of an image encoder, a multimodal projection module, and a decoder language model in pure pytorch. It is a simplified version of what you see in GPT-4 or Claude 3 in terms of vision capabilities demonstrated by a language model.
Repositories & Tools
1. R2R is an RAG engine with a RESTful API and prod features.
2. Semantic Kernel is an SDK that integrates LLMs like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C#, Python, and Java.
3. Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
4. LangGPT aims to facilitate the creation of high-quality ChatGPT prompts for everyone by utilizing a structured, template-based methodology.
5. LevelDB is a key-value storage library written at Google that provides an ordered mapping from string keys to string values.
Top Papers of The Week!
1. LLM Critics Help Catch LLM Bugs
This work trains “critic” models that help humans to more accurately evaluate model-written code and overcome the limitations of RLHF. These critics are LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks.
2. Transcendence: Generative Models Can Outperform The Experts That Train Them
The study explores if a model can outperform its training data by developing “ChessFormer,” a transformer model trained on chess game transcripts. It uses low-temperature sampling to effectively ensemble predictions from diverse, weak data sources, enhancing performance beyond individual input capabilities.
3. A Roadmap to Pluralistic Alignment
This work proposes a roadmap to pluralistic alignment, specifically using language models as a test bed. It identifies and formalizes three possible ways to define and operationalize pluralism in AI systems: Overton pluralistic models, Steerably pluralistic models, and Distributionally pluralistic models.
4. Scalable MatMul-free Language Modeling
Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This work shows that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. It investigates the scaling laws and finds that the performance gap between our MatMul-free models and full-precision Transformers narrows as the model size increases.
This essay argues that no matter how AI systems develop, if lawmakers do not address the dynamics of dangerous extraction, harmful normalization, and adversarial self-dealing, then AI systems will likely be used to do more harm than good.
6. Block Transformer: Global-to-Local Language Modeling for Fast Inference
This approach leads to a 20x improvement in inference throughput with Block Transformer compared to vanilla transformers with equivalent perplexity by massively reducing KV cache IO overhead from quadratic to linear with respect to context length, solving a key challenge in scaling to very long contexts and also novel application of global-to-local modeling.
Quick Links
1. Meta has unveiled the Meta LLM Compiler, a suite of robust, open-source models designed to optimize code and revolutionize compiler design. Meta has released the LLM Compiler under a permissive commercial license, allowing both academic researchers and industry practitioners to build upon and adapt the technology.
2. According to OpenAI’s Chief Technology Officer (CTO), Mira Murati, ChatGPT-5 is expected to achieve Ph.D.-level intelligence in specific tasks by late 2025 or early 2026. This level of intelligence is a big jump from its predecessor, ChatGPT-4, launched last year.
3. Figma is announcing new AI features at its Config conference, including a major UI redesign, new generative AI tools to help people more easily make projects, and built-in slideshow functionality.
4. Following the release of Sonnet 3.5, Anthropic CEO Dario Amodei interviewed with Time magazine and discussed his thoughts on LLM scaling laws: “We don’t see any evidence that things are leveling off. The reality of the world we live in is that it could stop at any time… I think if [the effects of scaling] did stop, in some ways, that would be good for the world. It would restrain everyone at the same time. But it’s not something we get to choose.”
Who’s Hiring in AI!
Gen AI Engineer @Clarity AI (Madrid, Spain)
Machine Learning/DevOps Consultant @Sia Partners (Amsterdam, Netherlands)
Software Engineer, Applied ML @Isomorphic Labs (London, United Kingdom)
Research Scientist @Waabi (US & Canada/Remote)
Data Analyst, Intern @Dun & Bradstreet (Mumbai, India)
Big Data Developer @Capco (Bengaluru, India)
Machine Learning Intern — Future Product Innovation @Toyota Research Institute (Los Altos, CA)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
If you are preparing your next machine learning interview, don’t hesitate to check out our leading interview preparation website, confetti!
This AI newsletter is all you need #106 was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.