#113; Sakana's AI Scientist - Are LLM Agents Ready To Assist AI Research?
Also, Xai's Grok-2, Claude context caching, Imagen 3, Tree Attention, and more!
What happened this week in AI by Louie
This week, xAI joined the growing crowd of broadly GPT-4 class models, which now includes models from OpenAI, Anthropic, Deepmind, xAI, Meta, Mistral, and DeepSeek (but only the first 4 have multimodal capabilities). Anthropic also launched a context caching option saving up to 10x for reused input tokens costs. We recently flagged that context caching opens up many new opportunities, including for complex LLM agent pipelines, and on this note, this week, Sakana AI introduced “The AI Scientist,” an LLM agent for assisting machine learning research.
Sakana’s agent begins by brainstorming new ideas using an initial topic and codebase (provided by a human researcher) and performs a literature search to review its ideas for novelty. It then plans and executes code-based experiments and gathers and visualizes data before writing a full research paper. It also includes an automated LLM peer review process that evaluates these papers. We think Sakana’s agent includes a strong feedback loop that can drive continuous improvement. In particular, its “peer reviewer” agent can be used to filter and label good and bad examples of ideas, experiments, and papers, and the agent can learn from both in the future.
Currently, this agent has many shortcomings, and the papers it produces are not of great quality. Sakana measures the average cost of these papers at under $15 — given plausible looking papers can be created at such a low cost, it can even pose a risk to research integrity with journals, and peer reviewer inboxes flooded with difficult to identify low-quality AI content submissions from people using these agents irresponsibly. However, the results are still impressive, and I see many obvious next steps to improve the agent, e.g., multimodal capabilities, giving relevant papers to the model via long context, RAG, or fine-tuning and scaling up inference budget for parts of the pipeline.
Why should you care?
I think Sakana’s implementation is impressive and ties into the power of “inference-time scaling laws” we discussed in recent weeks. Many people criticize the “scale is all you need” hypothesis of LLM’s march to AGI, but in reality, very few people believe in this on its own, and many different avenues are being pursued for progressing LLM capabilities. We can achieve new capabilities via agent pipelines or research breakthroughs without larger training budgets. In fact, one of the key benefits of the training compute vs capability scaling laws for LLMs is that even risking very small compute budgets on a small scale (and maybe LLM agent managed) experiments can potentially produce insights that can be scaled up 5+ orders of magnitude and integrated into SOTA models.
Sakana’s agent does, however, touch on a sensitive subject; many people are resistant to the rush to handing over human work to AI and also very skeptical that we are remotely close to LLMs helping in actual scientific research. In this case, however, we still see Sakana’s agent as primarily a human amplifier to aid in incremental research, which will work best with an experienced AI scientist proposing interesting ideas and code bases that they think are a promising research direction. As with any GenAI tools — many people are likely to be lazy and use these agents irresponsibly, however, I can imagine many ways to use an AI scientist agent effectively and diligently. For example, 1) Giving it an interesting source idea/theme and codebase to experiment on, 2) Using it to generate 100 ideas and running experiments on its self-selected “most interesting ideas”, generating the papers for all of these and ranking the final results. The human researchers can then review the top-ranked papers, do lots of work on improving and iterating on any interesting experimental results — and perhaps eventually get to something worth publishing in a fraction of the time it would have taken from scratch.
In addition to the scaling laws, there are other things that make ML research particularly well suited to LLM research agent assistants: 1) the high availability of open source code and papers, 2) purely cloud-based experiments, 3) the agent’s ML engineers can understand both the agent and the papers it produces to judge quality. Sakana is a respected AI research lab, and it wouldn’t surprise me if other leading AI labs like OpenAI and DeepMind were working on similar technologies in-house. It remains to be seen, however, if any of these agents can really be used to aid scientists in truly novel research.
— Louie Peters — Towards AI Co-founder and CEO
Since the release of “Building LLMs for Production”, many of you have asked us: “How do we make sure the book is not outdated within months?”
These comments are justified. We get it — AI is moving fast; there will be new and better models, better libraries, different tools, etc. But here’s our take:
The book teaches many timeless principles and techniques, such as transformer architecture, prompting, deployment, and more.
GPT-5 will still hallucinate. Hallucinations will stay as long as we don’t reach consciousness. RAG and fine-tuning will remain, even though they will get better and better.
The basics of LLMs are worth learning. Just like learning about the perceptron was (and is still) worthwhile. While the code will change, the idea and structure will stay quite similar.
We also share a lot of additional up-to-date content/code notebooks/resources on our webpage for the book: towardsai.net/book.
We’re already working on the second edition. And your thoughts, your insights, your real experiences with the book — they’re what will make the next version even better. If you’ve got a minute to drop a review, we’d love to hear what’s working and what we can do better.
Grab your copy, dive in, and share your thoughts!
Our friends in AI are hiring:
CTO and Co-founder at stealth AI company for finance. Towards AI are working on a really exciting startup project in the financial services industry, launching a predictive intelligence assistant that will operate at the intersection of LLMs and data science. The project team has a truly impressive track record in the financial services and consulting industries; the founder has been a senior partner with two of the world’s top consulting firms, working with many of the world’s largest financial services firms over a 30-year career. We are now looking for a CTO to join the team full-time as a co-founder. The right individual will have a strong technical background in AI as well as a track record of commercial product development, although not necessarily in financial services. As CTO, you will drive product design, development, and innovation. Just as importantly, you will be a magnet for engineering talent and play a key role in engaging with investors, clients, and strategic partners. If you are looking for a new intellectual and entrepreneurial challenge, working with a fantastic team, please get in touch with us today at louie@towardsai.net!
Our friends at @Mira (Remote) are also hiring a Senior AI Engineer to help build their decentralized AI infrastructure platform.
Hottest News
xAI has launched Grok-2 Beta, featuring Grok-2 and Grok-2 mini models, now available to users on 𝕏. Grok-2 demonstrates significant improvements over its predecessor, Grok-1.5, and joins the growing group of GPT-4 class text models and the smaller group of GPT-4v class multimodal models. Grok-2 scores 75.5% on MMLU-Pro, up from Grok-1.5’s 51.0%, and even outperforms GPT-4o, which scores 72.6%. In the MMMU benchmark, Grok-2 achieves 66.1%, surpassing Grok-1.5’s 53.6% but behind GPT-4o’s 69.1%. Both models will soon be available through an enterprise API, offering enhanced security and low-latency access globally.
2. Anthropic Introduced Prompt Caching
Prompt caching, which enables developers to cache frequently used context between API calls, is now available on the Anthropic API. Prompt caching reduces costs by up to 90% and latency by up to 85% for long prompts. It is currently available in public beta for Claude 3.5 Sonnet and Claude 3 Haiku.
3. Perplexity Answers 250 Million Questions a Month, Showing Growing Appetite for AI Search
AI search engine Perplexity saw a significant increase in users last month, handling 250 million queries in a month, reaching 500 million in 2023. While it lags behind Google’s dominance and has 8.5 billion daily queries, this trend indicates a user shift towards AI-driven search options.
After previewing it late last month, Runway ML has officially released Gen-3 Alpha Turbo, the latest version of the AI video generation model that it claims is seven times faster and half the cost of its predecessor, Gen-3 Alpha. Turbo is available for all plans, including a trial for free users. According to its Twitter (X) announcement, more improvements to the model, control mechanisms, and possibilities for real-time interactivity are to come.
5. Open AI Introduced SWE-Bench Verified
OpenAI released a subset of the SWE-Bench benchmark with human verification to more reliably evaluate AI models’ ability to solve real-world software issues. They worked with 93 software developers experienced in Python to manually screen SWE-bench samples for quality and annotated 1,699 random samples from the SWE-bench test set to produce SWE-bench Verified.
Grok 2, xAI’s new chatbot released on Elon Musk’s platform X, caused some controversy due to its minimal restrictions on user requests. The chatbot currently integrates Black Forest Labs’ Flux model for image generation but is implemented with far fewer constraints than other providers. While some are concerned that this can risk digital safety and increase AI controversy and regulation, others think AI should be aligned to deliver what its users request and not be trained to circumvent their wishes with top-down rules from its creators.
7. Multion Introduced Agent Q, AI Agents With Planning & Self Healing Capabilities
MultiOn has launched a new type of autonomous AI agent called Agent Q. It is a self-supervised agent reasoning and search framework that can autonomously improve in real environments through self-play and reinforcement learning. It combines technologies such as Monte Carlo Tree Search (MCTS), AI self-critique, and RLFH, enabling AI to engage in complex multi-step reasoning and decision-making in dynamic environments.
8. Google’s Upgraded AI Image Generator Is Now Available
Google has released the latest version of Imagen 3, its AI text-to-image generator, to US users. The tool, which you can access on Google’s AI Test Kitchen, is supposed to generate images with “better detail, richer lighting, and fewer distracting artifacts” compared to Google’s previous models.
Seven 5-minute reads/videos to keep you learning
1. How To Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model
This is a guide on refining the Llama-3.1 8B language model into a compact 4B version using NVIDIA’s structured compression techniques, including weight pruning and knowledge distillation. This approach yields a resource-efficient Llama-3.1-Minitron 4B that delivers high performance on benchmarks while cutting down on computational expenses.
DSPy is an open-source framework that facilitates the coordination of multiple LLM calls to tackle complex issues. It offers verifiable feedback to enhance practical solution deployment. The framework is currently improving reliability and user accessibility to strengthen its utility and continued development within the AI community. This article provides insight into how DSPy forces you to think about the problems with LLMs.
3. Review: ChatGPT’s New Advanced Voice Mode
ChatGPT’s new Advanced Voice Mode enhances speech understanding and production, outperforming predecessors and competitors like Siri and Alexa. In this article, the author reviewed the basics of Advanced Voice Mode and explored a few use cases that underscore the leap-forward nature of this technology.
PEFT is a method designed to fine-tune large models more efficiently by focusing on a subset of parameters. This blog looks under the hood of the PEFT library to better understand how things work and explores how to create a base model and use it to build a LoRA model.
5. Free Tools Every ML Beginner Should Use
This article highlights some of the essential tools that every beginner — or person willing to get started — with ML should use. It introduces tools such as Jupyter Notebook, Hugging Face and Transformers, Kaggle, and more.
6. A Crash Course of Model Calibration — Part 1
Many experiments have revealed that modern neural networks are often not well-calibrated. A model is perfectly calibrated if the predicted probabilities of outcomes align closely with the actual outcomes. This article explores how to make ML models reflect true probabilities in their predictions.
7. Synthetic Data Solves AI’s Biggest Problem
This article discusses how synthetic data is a useful application of AI technology already delivering real, tangible value to customers. Unlike fake data, synthetic data supports data-driven business systems throughout their lifecycle, mainly where ongoing access to production data is impractical or ill-advised.
Repositories & Tools
1. Qwen 2 is the official repository of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
2. Deep Live Cam allows real-time face swap and one-click video deepfake with only a single image.
3. LongWriter dataset contains 6,000 SFT data with ultra-long output ranging from 2k-32k words.
4. SWE Agent takes a GitHub issue and tries to automatically fix it using GPT-4 or your LM of choice.
5. Fabric is an open-source framework for augmenting humans using AI.
6. MiniCPM-V is a GPT-4V-level MLLM for a single image, multi-image, and video on your phone.
7. Tinygrad is a deep learning framework that is like a blend of PyTorch and micrograd.
Top Papers of The Week
This is the official paper for Google’s Imagen 3, a latent diffusion model that generates high-quality images from text prompts. The paper discusses their quality and responsibility evaluations, issues around safety and representation, and methods used to minimize the potential harm of the models.
2. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Researchers from Sakana AI, Oxford, University of British Columbia, and several other institutions published a paper unveiling the AI Scientist, a pipeline for open-ended scientific research using LLMs.
3. Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
Microsoft Research published a paper introducing rStar, a self-play multi-reasoning approach that improves reasoning capabilities in small language models. rStar uses a generation-discrimination process to decouple the different steps in the reasoning process
4. Causal Agent based on Large Language Model
This paper explores the difficulty of large language models in mastering causal reasoning and addresses the issue by introducing a Causal Agent. This agent, enhanced with causal reasoning techniques and memory components, shows proficiency in tackling various causal problems.
5. Tree Attention: Topology-Aware Decoding for Long-Context Attention on GPU Clusters
The paper presents a topology-aware decoding approach that improves long-context attention in transformer models on GPU clusters. It connects self-attention to energy-based models, leading to parallel GPU computation, significantly faster processing, reduced inter-GPU communication, and lower memory consumption.
6. Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
The paper reviews model merging strategies in machine learning, underscoring their cost-effectiveness and minimal resource usage. It introduces a new classification system for these techniques, detailing their use in language models, continual learning, and multi-task learning. It points out existing literature deficits, current obstacles, and potential areas for future study.
7. Med42-v2: A Suite of Clinical LLMs
This paper introduces Med42-v2, an advanced clinical large language model based on the Llama3 architecture. It is tailored for healthcare with specialized data and preference alignment and surpasses its predecessor and GPT-4 in medical query performance.
Quick Links
1. Nvidia will train 100,000 California residents on AI in a first-of-its-kind partnership. The program focuses on training students, educators, and workers, supporting job creation and promoting innovation, and using AI to solve challenges that can improve the lives of Californians
2. Midjourney releases a new unified AI image editor on the web. It combines inpainting, outpaining/canvas extension, and more into a single view. The new web editor is now live and available to all users who have created at least ten images on the platform. Users can access this tool by visiting midjourney.com/imagine.
3. Lambda has partnered with Nous Research to launch Hermes 3, a new fine-tuned version of Meta’s open-source Llama 3.1–405 billion parameter large language model (LLM). Hermes 3 offers an unlocked, uncensored, open weights model designed to be highly steerable, enabling users to tailor the model’s responses to their individual needs.
Who’s Hiring in AI
Lead Research Engineer @Thomson Reuters Holdings Inc. (Eagan, MN, USA/Hybrid)
Machine Learning Engineer (C++ & CUDA) @Dedrone (Remote)
Director, AI Red Team — Remote @Optum (Plymouth, MN, USA/Remote)
Head of AI @DESIGNLIBRO INC (Santa Clara, CA, USA)
Account Executive, AI Enablement @Invisible Technologies Inc. (Remote)
AI Trainer — Software Developer @Davidayo (Remote)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.