TAI #112; Agent Capabilities Advancing; METR Eval and Inference Compute Scaling
Also, Qwen 2 Math, LG EXAONE, Structured Outputs, Gemini Flash 1.5 fine-tuning.
What happened this week in AI by Louie
This week saw fewer major announcements in AI, but there were still some notable developments. New open-source models were released, including Qwen 2 Math and LG’s EXAONE (7.8B), both achieving state-of-the-art results in some benchmarks. Meanwhile, OpenAI introduced Structured Outputs in their API, adding reliability for developers by ensuring that model-generated outputs conform to specified JSON Schemas. DeepMind Gemini also launched its reduced Flash pricing and fine-tuning capabilities.
Following our comments last week on context caching (10x cheaper reused input tokens with Deepseek, up to 4x with Gemini) and how this can be synergistic with “inference time scaling laws” and agent pipelines, we were interested to see another paper out this week from Deepmind; Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. The paper explores how smaller, less capable models can be enhanced by leveraging increased test-time compute, trading off training compute budgets for inference compute. The idea is similar to how humans can improve decision-making by thinking longer about difficult problems. The study finds that by optimally scaling test-time compute, smaller models can outperform much larger models in FLOPs-matched evaluations.
We were also interested in seeing the GPT-4o system card, including some eery examples of GPT-4o voice mode spontaneously choosing to imitate the human’s voice (a bug which we understand is now fixed!). The system card included the new METR autonomy evaluation exploring agent capabilities. METR focussed on general autonomous capability measures rather than solely on “red line” threat-specific evaluations. They expanded their task suite to include around 50 new tasks in areas like cybersecurity, software engineering, and machine learning and evaluated these tasks using GPT-4o and Claude Sonnet 3.5-based agents. While these agents performed comparably to humans on many tasks that took humans under 30 minutes, they struggled on more complex tasks, and performance plateaued after using around 200,000 tokens. On average, when these agents can do a task, they cost ~1/30th of the median hourly wage of a US bachelor’s degree holder. In reality, agent and LLM pipelines will be much more customized to a specific task or set of tasks, so there is a long way to go in developing agent capabilities!
Why should you care?
Several developments this week, such as OpenAI structured outputs, more affordable LLMs, and new fine-tuning and caching options, are all making it easier and more economical to build LLM pipelines for production while also potentially lowering the barriers to entry for smaller developers. Meanwhile, the evidence stacks up on the huge potential we can unlock by building agent pipelines and directing more inference time to compute at replicating human tasks. We think there are plenty of economic applications (where, with lots of work and iteration, the LLM pipeline can cross task-specific reliability threshold) of these agent pipelines already, but we only expect these to get more powerful with the next generation of LLMs; particularly if reasoning capabilities can be improved!
— Louie Peters — Towards AI Co-founder and CEO
This issue is brought to you thanks to GrowthSchool:
200+ hours of research on AI tools & hacks packed in 3 hours
This free 3-hour Mini Course on AI & ChatGPT (worth $399) will help you become a master of 20+ AI tools & prompting techniques and save 16 hours/week.
Get it now for absolutely free! (for first 100 users only) 🎁
This course will teach you how to:
Build a business that makes $10,000 by just using AI tools
Make quick & smarter decisions using AI-led data insights
Write emails, content & more in seconds using AI
Solve complex problems, research 10x faster & save 16 hours every week
Register & save your seat now! (100 free seats only)
Hottest News
1. Gemini 1.5 Flash Price Drop With Tuning Rollout Complete, and More
Deepmind confirmed details of its Gemini 1.5 Flash price drop, which we flagged last week. They have significantly reduced their prices, with a 78% cut in input token costs to $0.075 per million tokens and a 71% reduction in output token costs to $0.3 per million tokens for prompts under 128K tokens. Context caching can additionally save up to 4x more again for reused input tokens. The fine-tuning option for Gemini 1.5 Flash is now fully deployed and accessible to all developers.
2. Zuckerberg Says Meta Will Need 10x More Computing Power To Train Llama 4 Than Llama 3
Meta’s CEO, Mark Zuckerberg, has stated that their upcoming language model, Llama 4, will require a tenfold increase in computing power for training compared to its predecessor, Llama 3. This suggests significant capital expenditure on infrastructure. However, CFO Susan Li clarified that these AI advancements are not anticipated to yield substantial revenue in the near term.
3. JPMorgan Chase Is Giving Its Employees an AI Assistant Powered by ChatGPT Maker OpenAI
JPMorgan Chase has rolled out a generative AI assistant to its employees as the initial step of a broader plan to inject the technology throughout the bank. The program, called LLM Suite, is already helping more than 60,000 employees with tasks like writing emails and reports. It is designed to be a portal that allows users to tap external LLMs.
4. Mistral Alpha Release of Agents
Mistral has introduced customization options for its models, including base prompts, few-shot prompting, and fine-tuning. The platform also launched an alpha version of Agents for workflow automation and debuted a stable client SDK for improved integration and application development.
5. AI Chipmaker Groq Raises $640M To Meet Rising Demand for High-Speed Inference Compute
Groq, an AI hardware company, has raised $640 million in a Series D round led by BlackRock, reaching a $2.8 billion valuation. The investment will expand Groq’s capabilities by more than 100,000 LPUs to support growing demand from enterprises and developers. It will enable the company to hire industry experts to drive further growth.
6. AMD Is Becoming an AI Chip Company, Just Like Nvidia
AMD’s Q2 2024 earnings highlighted progress on growing its AI business. Data center products like the Instinct MI300 accelerator are leading sales, which have surged by 115%. The MI300 broke $1 billion in quarterly sales, with AMD indicating its intent to release AI chips annually to rival Nvidia’s market dominance.
7. LG AI Released EXAONE 3.0, a Bilingual Model With 7.8B Parameters
EXAONE-3.0 7.8B-Instruct is an open pre-trained and instruction-tuned bilingual (English and Korean) generative model, pre-trained with 8T tokens and post-trained with supervised fine-tuning and DPO. It demonstrates highly competitive benchmark performance against other state-of-the-art open models of similar size.
Five 5-minute reads/videos to keep you learning
This tutorial covers retrieval augmented generation (RAG), the idea of multimodality, and how the two are combined to make modern multimodal RAG systems. You will also learn to build a multimodal RAG system using Google Gemini and a CLIP-style model for encoding. It is written for beginners and senior AI researchers.
2. Can Mixture of Experts (MoE) Models Push GenAI to the Next Level?
MoE models have been applied in LLMs, computer vision, and recommendation systems to improve accuracy and speed while reducing computational load. This article closely examines MoE models, highlights some of the most noteworthy MoE models, and more.
3. GPT-5: Everything You Need to Know
The article discusses the expected launch and potential influence of OpenAI’s GPT-5 amidst competition from Google’s Gemini and Anthropic’s Claude. It highlights the need for substantial progress to keep its market lead, with an unclear release timeline due to strategic and competitive considerations.
This article introduces a new study titled “Searching for Best Practices in Retrieval-Augmented Generation”. The study determines the optimal combinations of RAG methods to identify the best RAG practices. The article introduces the typical RAG process, presents best practices for each RAG module, and provides a comprehensive evaluation.
5. Get Started with Spark DataFrames and Big Data ML using PySpark
This is a hands-on and beginner-friendly deep dive on PySpark using Databricks.
The article examines OpenAI’s sustainability, highlighting its need for continuous funding and technological advancements against high operational costs. It discusses the complexities of OpenAI’s financial model and the potential conflict of interest posed by Microsoft’s involvement as both a supporter and a competitor. While we disagree with many assumptions made here, it is an interesting read.
7. AI Is Mining the Sum of Human Knowledge From Wikipedia. What Does That Mean for Its Future?
In this article, the author spoke with Wikipedia executives on how AI could jeopardize the encyclopedia’s connection with the volunteers who create it. The main concern is the potential impact these AI tools could have on the human motivation to continue creating and sharing knowledge.
Repositories & Tools
1. Transformer Explainer is an interactive visualization tool designed to help anyone learn how Transformer-based models like GPT work.
2. MetaGPT takes a one-line requirement as input and outputs user stories, competitive analysis, requirements, data structures, APIs, documents, etc.
3. Viking is a simple way to manage your remote machines and SSH keys.
Top Papers of The Week
1. GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
GMAI-MMBench is a new benchmark tool for evaluating Large Vision-Language Models (LVLMs) in medicine, encompassing 285 datasets across different modalities and tasks. Initial evaluations of 50 LVLMs, such as GPT-4o, revealed a peak accuracy of only 52%, indicating the need for further development in the sector.
2. RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation
RAG Foundry is an open-source platform that aims to improve Retrieval-Augmented Generation models by providing an integrated workflow for data creation, training, inference, and evaluation. It allows for the use of various knowledge sources to create specialized datasets and train models, significantly enhancing performance on tasks requiring extensive knowledge, as demonstrated by improved results on augmented Llama-3 and Phi-3 models.
3. Faithfulness Hallucination Detection in Healthcare AI
This study investigates faithfulness hallucinations in medical record summaries generated by LLMs such as GPT-4o and Llama-3. The detection framework categorizes five types of medical event hallucinations, and the pilot study involving 100 summaries of medical notes reveals the presence of these categorized hallucinations by recent closed-source and open-source LLMs.
4. Autogenic Language Embedding for Coherent Point Tracking
The paper introduces a new method for enhancing point tracking in video sequences by integrating language embeddings into visual features without requiring text annotations. This autogenic language embedding technique considerably improves over standard visual tracking, particularly in videos with diverse appearances.
5. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
This paper studies the scaling of inference-time computation in LLMs, with a focus on answering the question: If an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? This will potentially help with how one should trade off inference-time and pre-training compute.
This work presents an approach to improve model evaluators without human annotations, using synthetic training data only. In this method, the iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this at each new iteration using the improved predictions.
7. CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases
This paper introduces CodexGraph, which integrates LLM agents with graph database interfaces extracted from code repositories. It leverages the structural properties of graph databases and the flexibility of the graph query language, enabling the LLM agent to construct and execute queries and allowing code structure-aware context retrieval and code navigation.
Quick Links
1. Google illegally monopolized the search market through exclusive deals, a judge ruled on Monday, handing the government a win in its first major antitrust case against the tech giant in over two decades.
2. OpenAI introduced Structured Outputs in the API, a new feature designed to ensure model-generated outputs will match JSON Schemas provided by developers. This functionality is available on the Chat Completions API, Assistants API, and Batch API.
3. Qwen introduced Qwen2-Math and Qwen2-Math-Instruct-1.5 B/7 B/72 B. These are a series of specialized math language models built upon the Qwen2 LLMs, which outperform the mathematical capabilities of open-source models and even closed-source models (e.g., GPT-4o).
Who’s Hiring in AI
GenAI Developer @Ampcus Incorporated (TX, USA/Freelancer)
Data Science Associate (ML) @Ignitho (Chennai, India)
AI and Emerging Technology(ET) Researcher @Canadian Tire (Toronto,Canada/Hybrid)
Innovation Lead — AI and Collaboration @Pegasystems (USA/Remote)
AI Engineer @LinkedIn (Sunnyvale, CA, USA/Hybrid)
Full-Stack Developer (Technical Lead) @Frontier Technology Inc. (Colorado Springs, CO, USA)
Data Scientist III @JPMorgan Chase (Columbus, IN, USA)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.