<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Towards AI Newsletter]]></title><description><![CDATA[Towards AI's thoughts on the week's biggest AI developments. 
All major AI news, models, tools and papers covered. 
Read by over 130,000 AI Practitioners, Industry Professionals and Students.]]></description><link>https://newsletter.towardsai.net</link><image><url>https://substackcdn.com/image/fetch/$s_!ZBHF!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea4e29a-6b40-4b9a-9a98-00d0f6550a2e_512x512.png</url><title>Towards AI Newsletter</title><link>https://newsletter.towardsai.net</link></image><generator>Substack</generator><lastBuildDate>Tue, 07 Apr 2026 13:22:28 GMT</lastBuildDate><atom:link href="https://newsletter.towardsai.net/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Towards AI, Inc.]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[pub@towardsai.net]]></webMaster><itunes:owner><itunes:email><![CDATA[pub@towardsai.net]]></itunes:email><itunes:name><![CDATA[Towards AI]]></itunes:name></itunes:owner><itunes:author><![CDATA[Towards AI]]></itunes:author><googleplay:owner><![CDATA[pub@towardsai.net]]></googleplay:owner><googleplay:email><![CDATA[pub@towardsai.net]]></googleplay:email><googleplay:author><![CDATA[Towards AI]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Use AI for writing without the cleanup tax]]></title><description><![CDATA[Universal prompt framework that works with all LLMs and writing types]]></description><link>https://newsletter.towardsai.net/p/stop-editing-ai-slop-manually-free</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/stop-editing-ai-slop-manually-free</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Mon, 06 Apr 2026 15:03:01 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FCEM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;ve ever used AI to write an email, a blog post, or a project update and felt like you spent more time editing the output than it would have taken to write it yourself, chances are your draft looks something like this:</p><ul><li><p>It opens with &#8220;In today&#8217;s rapidly evolving landscape.&#8221; </p></li><li><p>You keep removing &#8220;delve,&#8221; &#8220;tapestry,&#8221; and &#8220;it&#8217;s worth noting.&#8221; </p></li><li><p>There are enough em dashes to fill a novel. </p></li><li><p>The content is accurate, but it reads like it could have been written by anyone, about anything. </p></li><li><p>You publish it anyway because the deadline won&#8217;t wait.</p></li></ul><p>We dealt with the exact same thing for over three years at Towards AI, editing, rewriting, and occasionally questioning our life choices. Eventually, we decided to stop fixing drafts one at a time. We made one cheatsheet for the entire team to use every time they generate content, so the slop gets caught in the prompt before anyone has to read it.</p><p>Today we&#8217;re <strong>sharing it with our community for free</strong>, partly because if we have to read one more &#8216;devle&#8217; and see another em dash, someone on the team is going to snap.</p><p>The <strong>Anti-Slop AI Writing Guide</strong> is a prompt template with 50+ banned words, style rules, and structural constraints baked in. You paste it into <strong>ChatGPT, Claude, or whatever LLM you use</strong>, fill in your topic and audience, and the AI follows your rules instead of making up its own. We&#8217;ve used it for emails, blog posts, reports, proposals, scripts, and it holds up across all of them. No technical skills, no setup, just copy, paste, and stop editing the same problems out of every draft.</p><p><strong><a href="https://academy.towardsai.net/products/digital_downloads/anti-slop-framework?utm_source=TAIacademy&amp;utm_medium=Email&amp;utm_campaign=2026_coursetakers_nostart_download_glb&amp;utm_id=AIslopcheatsheet">Get the Anti-Slop Cheatsheet (Free)</a></strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/products/digital_downloads/anti-slop-framework?utm_source=TAIacademy&amp;utm_medium=Email&amp;utm_campaign=2026_coursetakers_nostart_download_glb&amp;utm_id=AIslopcheatsheet" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FCEM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png 424w, https://substackcdn.com/image/fetch/$s_!FCEM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png 848w, https://substackcdn.com/image/fetch/$s_!FCEM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png 1272w, https://substackcdn.com/image/fetch/$s_!FCEM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FCEM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png" width="1456" height="792" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:792,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1386316,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://academy.towardsai.net/products/digital_downloads/anti-slop-framework?utm_source=TAIacademy&amp;utm_medium=Email&amp;utm_campaign=2026_coursetakers_nostart_download_glb&amp;utm_id=AIslopcheatsheet&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.towardsai.net/i/193340331?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FCEM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png 424w, https://substackcdn.com/image/fetch/$s_!FCEM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png 848w, https://substackcdn.com/image/fetch/$s_!FCEM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png 1272w, https://substackcdn.com/image/fetch/$s_!FCEM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652b73c1-6fb7-4f9c-ae41-725c129f1437_2523x1373.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>The guide teaches you how to:</h4><ul><li><p>Give the AI your outline, your section order, and your paragraph rules so it stops defaulting to listicles and generic five-part blog structure</p></li><li><p>Ban specific sentence patterns that are AI fingerprints, not just words like &#8220;delve&#8221; but structures like &#8220;It isn&#8217;t just X, it&#8217;s Y&#8221; and openings like &#8220;In today&#8217;s fast-paced world.&#8221;</p></li><li><p>Set accuracy guardrails so the AI doesn&#8217;t overstate claims, fabricate certainty, or ignore your source material</p></li><li><p>Build a repeatable framework that you can paste across chats, rather than starting over for every new piece of writing.</p></li><li><p>Use a second AI as an editor that audits the draft against your anti-slop rules and flags what to fix, so your own edit is a final pass, not a rewrite</p></li></ul><p>It is designed to move the cleanup process into the prompt itself and provides a two-model AI framework to speed up your editing workflow. Download the guide, fill in your topic, and let the prompt do what you&#8217;ve been doing manually.</p><p><strong><a href="https://academy.towardsai.net/products/digital_downloads/anti-slop-framework?utm_source=TAIacademy&amp;utm_medium=Email&amp;utm_campaign=2026_coursetakers_nostart_download_glb&amp;utm_id=AIslopcheatsheet">Download it free here</a>!</strong></p>]]></content:encoded></item><item><title><![CDATA[TAI #198: Real-Time Speech AI Gets Serious: Google and OpenAI Race to Own the Voice Layer]]></title><description><![CDATA[Also, Cohere Transcribe, Sora cancelled, TRIBE v2, and more!]]></description><link>https://newsletter.towardsai.net/p/tai-198-real-time-speech-ai-gets</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-198-real-time-speech-ai-gets</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 31 Mar 2026 15:02:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!v221!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>Real-time speech AI has been progressing quietly for the past year, but the past few weeks have delivered enough to warrant a dedicated look. Google released Gemini 3.1 Flash Live on March 26, OpenAI shipped GPT-Realtime-1.5 on February 23, and Cohere launched its Apache 2.0-licensed Transcribe model the same day as Google. We are now past the point where real-time voice AI feels like a demo-stage curiosity. It is starting to look like deployable infrastructure, and headline audio pricing has fallen sharply since OpenAI&#8217;s original Realtime API launch in October 2024.</p><p>Google&#8217;s Gemini 3.1 Flash Live is the headline release. It is Google&#8217;s highest-quality real-time audio model, designed for voice-first agents that can reason, call tools, and hold natural conversations across 70 languages. It accepts audio, video, text, and image input, supports function calling with Google Search grounding and extended thinking, and is available in developer preview via the Gemini Live API.</p><p>The benchmarks are strong. On ComplexFuncBench Audio, which tests multi-step function calling, Gemini 3.1 Flash Live leads with 90.8% compared &#8212; a big step up from 71.5% on the prior Flash 2.5 model. On Scale AI&#8217;s AudioMultiChallenge, which tests instruction-following amid real-world interruptions and hesitations, Gemini scores 36.1% with thinking enabled, compared to GPT-Realtime-1.5 at 34.7%. On BigBenchAudio for reasoning, Gemini reaches 95.9% with high thinking, compared to GPT-Realtime-1.5 at 81.1%. The catch is that these top Gemini scores require extended thinking, which adds latency. With minimal thinking, Gemini drops to 70.5% on BigBenchAudio and 26.8% on AudioMultiChallenge, both below GPT-Realtime-1.5. The reasoning-versus-latency trade-off is now a live engineering decision, not a footnote.</p><p>Google has also improved tonal understanding, with the model recognizing pitch, pace, frustration, and confusion and adjusting its responses accordingly. Enterprise customers, including Verizon, LiveKit, and The Home Depot, have tested 3.1 Flash Live. The Home Depot highlighted the model&#8217;s ability to capture alphanumeric product codes in noisy environments and handle customers switching languages mid-conversation.</p><p>OpenAI&#8217;s GPT-Realtime-1.5 looks strongest on conversational dynamics and transport options rather than on raw reasoning benchmarks. Artificial Analysis currently gives it a 95.7% Conversational Dynamics score and a 0.82-second time-to-first-audio. The same benchmark page lists Gemini 3.1 Flash Live at 2.98 seconds with high thinking and 0.96 seconds with minimal thinking. In practice, GPT-Realtime-1.5 should feel snappier in live conversation, while Gemini scores higher on published reasoning benchmarks.</p><p>A key operational improvement in GPT-Realtime-1.5 is OpenAI&#8217;s reported 10.23% gain in alphanumeric transcription accuracy. That matters because phone numbers, order IDs, and product codes are where voice systems often fail. OpenAI also supports WebRTC, WebSocket, and SIP for Realtime, which gives developers a direct path into browser, server, and telephony stacks. Perplexity says it already uses Realtime-1.5 in production for millions of voice sessions each month.</p><p>They are not the only players, either. Step Audio R1.1 out of China is a notable contender in the speech-to-speech space, winning on several benchmarks at very competitive pricing. Grok&#8217;s Voice Agent also remains in the running. The field is getting crowded fast.</p><p>The pricing tells an important story, but it is worth being precise about what is being compared: raw audio model cost, not total application cost. OpenAI documents audio tokenization at 1 token per 100 milliseconds for user audio and 1 token per 50 milliseconds for assistant audio. At $32 per million audio input tokens and $64 per million audio output tokens, that works out to roughly $0.096 per minute of two-way audio before text tokens, grounding, or telephony. Google publishes direct per-minute equivalents for Gemini 3.1 Flash Live Preview: $0.005 per minute of audio input and $0.018 per minute of audio output, or a total of $0.023 per minute. That makes Google about 4.2x cheaper on headline audio rates, although the model remains in preview and Google notes that preview models may change and may have tighter rate limits.</p><p>Another development that shows what this all unlocks is Google Live Translate. On March 26, Google expanded real-time headphone translation to iOS and additional countries, including France, Germany, Italy, Japan, Spain, Thailand, and the UK. The feature works with any headphones, supports 70+ languages, and preserves the original speaker&#8217;s tone and cadence. This is the closest thing to a universal translator that exists today. Five years ago, it was science fiction. Now it runs on a phone with any pair of earbuds. Google Meet&#8217;s speech translation beta extends this into professional settings, translating your speech in real time &#8220;in a voice like yours.&#8221; Search Live expanded to over 200 countries this week. The direction is clear: multilingual voice interaction is becoming a default capability, not a premium feature.</p><p>The cost trajectory reinforces this. In late 2024, OpenAI&#8217;s original Realtime API priced audio input at $100 per million tokens. GPT-Realtime brought that to $32. Gemini 3.1 Flash Live enters at $3 (albeit with different tokenisation), with a free tier. That&#8217;s a huge cost reduction in under two years.</p><p>Cohere also contributed this week from a different angle. Cohere Transcribe is not a conversational model but a dedicated automatic speech recognition (ASR) system: 2 billion parameters, conformer-based, 14 languages, Apache 2.0. It ranks first on the Hugging Face Open ASR Leaderboard with an average word error rate (WER) of 5.42%, ahead of Zoom Scribe v1 at 5.47% and OpenAI Whisper Large v3 at 7.44%, and processes audio at 525x real-time. For enterprises in healthcare, legal, finance, or government that cannot send audio to third-party cloud APIs, this is the most important release of the week. Open weights, consumer-GPU-sized, and zero licensing cost.</p><p>On a personal note, one of my favourite audio-based AI tools right now is Granola. It captures high-quality transcripts of your computer audio and calls with minimal setup, and then lets you run top models over those transcripts to produce call summaries or fully cleaned-up notes. It&#8217;s the kind of product that shows where this whole space is heading: speech capture and understanding becoming an ambient background layer in everyday work.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v221!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v221!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png 424w, https://substackcdn.com/image/fetch/$s_!v221!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png 848w, https://substackcdn.com/image/fetch/$s_!v221!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png 1272w, https://substackcdn.com/image/fetch/$s_!v221!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v221!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png" width="1034" height="656" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:656,&quot;width&quot;:1034,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!v221!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png 424w, https://substackcdn.com/image/fetch/$s_!v221!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png 848w, https://substackcdn.com/image/fetch/$s_!v221!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png 1272w, https://substackcdn.com/image/fetch/$s_!v221!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12bcbb-b44b-4fc0-a670-daf882f5f355_1034x656.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: Google</figcaption></figure></div><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>Speech is becoming a first-class modality because it maps onto existing behaviors in search, meetings, support, and translation. A model that can reason over spoken language in real time, handle interruptions cleanly, call tools, and switch languages has a much clearer route into daily workflows than a text-only chatbot.</p><p>The live translation thread is perhaps the most important long-term signal. Google Live Translate, expanding to iOS with 70+ languages and tone-preserving headphone translation, is a capability people have been waiting for for decades. When this moves into Google Meet (already in beta), into contact centers, and eventually into the Gemini API for any developer to build on, the number of human interactions it can reshape is enormous. This would allow, for example, a doctor consulting with a patient across a language barrier without waiting for an interpreter. Or a multinational meeting where nobody is forced into English.</p><p>I expect we&#8217;ll see speech-first interfaces become standard across customer support, education, healthcare, and accessibility within the next 12 to 18 months. The cost barrier is gone. The accuracy is reaching production thresholds. The remaining challenge is that voice naturalness still varies by language, inference and reasoning introduces some delay, and benchmarks still miss domain vocabulary and emotional nuance. So the right approach is still human evaluation on your own recordings and accents, together with easy escalation to a real human operator, not blind faith in a leaderboard.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><p>We have co-published an article with <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Paul Iusztin&quot;,&quot;id&quot;:110559689,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0714d360-396c-4b41-a676-1b58dc1dc5f3_1470x1470.jpeg&quot;,&quot;uuid&quot;:&quot;22beed03-bee9-4556-8f76-c4089af4c7c6&quot;}" data-component-name="MentionToDOM"></span>, covering the mental model that prevents you from overengineering your next AI system.</p><p>Here is what you will learn:</p><ul><li><p>The fundamental difference between an agent and a workflow.</p></li><li><p>How to use the complexity spectrum to make architecture decisions.</p></li><li><p>When to rely on simple workflows for predictable tasks.</p></li><li><p>Why a single agent with tools is often enough for dynamic problems.</p></li><li><p>The exact breaking points that justify moving to a multi-agent system.</p></li></ul><p><a href="https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide">Read the full article here</a>!</p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://www.wsj.com/tech/ai/openai-set-to-discontinue-sora-video-platform-app-a82a9e4e">OpenAI Scraps Sora Video Platform Months After Launch</a></p><p>OpenAI has shut down Sora, its AI video-generation app, less than two years after it generated widespread attention for creating realistic clips from simple text prompts. Alongside the shutdown, OpenAI is also winding down its $1B content partnership with Disney. The company says it&#8217;s shifting focus to developments like robotics &#8220;that will help people solve real-world, physical tasks.&#8221; For context, Sora pulled in just $1.4M in global net in-app revenue since launch, compared to $1.9B for ChatGPT over the same period.</p><p>2. <a href="https://claude.com/blog/dispatch-and-computer-use">Anthropic Rolls Out Computer Use Capabilities</a></p><p>Anthropic now lets Claude directly use your computer to complete tasks. When Claude doesn&#8217;t have access to the tools it needs, it will point, click, and navigate your screen, opening files, using the browser, and running dev tools without any setup. The feature is available in research preview for Claude Pro and Max subscribers, and also works with Dispatch, which lets you assign Claude tasks from your phone. On the safety side, the system automatically scans model activations to detect risky behavior, Claude always asks permission before accessing new applications, and you can stop it at any point.</p><p>3. <a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">Google Unveils TurboQuant</a></p><p>Google&#8217;s research team has introduced TurboQuant, a compression algorithm that reduces LLM key-value cache memory by 6x and delivers up to 8x speedup, with zero accuracy loss. TurboQuant is &#8220;data-oblivious,&#8221; so it doesn&#8217;t require dataset-specific tuning or calibration. It&#8217;s also designed to work smoothly with modern GPUs by using vectorized operations instead of slow, non-parallelizable binary searches. Under the hood, it uses a two-stage approach: MSE-optimal quantization followed by a 1-bit QJL transform on the residual, providing unbiased inner-product estimates that are critical for maintaining transformer attention accuracy.</p><p>4. <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-live/">Google Releases Gemini 3.1 Flash Live</a></p><p>Google has released Gemini 3.1 Flash Live in preview for developers through the Gemini Live API in Google AI Studio. The model targets low-latency, more natural real-time voice interactions. It uses WebSockets (WSS) for full-duplex communication, supporting barge-in (user interruptions) and simultaneous transmission of audio, video frames, and transcripts. The model is also optimized for triggering external tools directly from voice, scoring 90.8% on ComplexFuncBench Audio for multi-step function calling.</p><p>5. <a href="https://cohere.com/blog/transcribe">Cohere AI Launches Cohere Transcribe</a></p><p>Cohere has released Cohere Transcribe, an automatic speech recognition (ASR) model built on a large Conformer encoder paired with a lightweight Transformer decoder. To maintain memory efficiency and stability, it uses native 35-second chunking logic, automatically segmenting longer audio into overlapping chunks and reassembling them, enabling it to handle extended recordings without performance degradation. The model supports 14 languages and currently ranks #1 on the Hugging Face Open ASR Leaderboard (as of March 26, 2026) with an average Word Error Rate of 5.42%.</p><p>6. <a href="https://aidemos.atmeta.com/tribev2">Meta Releases TRIBE v2</a></p><p>Meta has released TRIBE v2, a tri-modal foundation model that serves as a digital mirror of human brain activity in response to visual, auditory, and linguistic stimuli. It uses state-of-the-art encoders such as LLaMA 3.2 for text, V-JEPA2 for video, and Wav2Vec-BERT for audio to capture features that are shared between AI models and the human brain. TRIBE v2 can accurately predict brain responses to new stimuli, tasks, and subjects without retraining, achieving 2&#8211;3x improvement over standard methods on auditory and visual datasets. A subject-specific layer maps universal learned representations onto individual fMRI voxels, the 3D pixels that track neural activity through changes in blood flow and oxygenation.</p><div><hr></div><h3>AI Tip of the Day</h3><p>To ensure your RAG retrieval is working correctly, split your evaluation into two layers. For retrieval, measure whether relevant evidence was retrieved using metrics like recall@k and Mean Reciprocal Rank. For generation, measure faithfulness to the retrieved context and the answer&#8217;s relevance to the question, often using an LLM judge calibrated against human labels.</p><p>High retrieval recall with low faithfulness suggests the model had the right evidence, but failed to use it properly. High faithfulness with low retrieval recall suggests the model stayed grounded in the retrieved context, but retrieval surfaced incomplete or off-target evidence. These are two completely different problems with two completely different fixes, and without the split, you can&#8217;t tell which one you&#8217;re dealing with.</p><p>If you&#8217;re currently building a RAG pipeline and want to go deeper into evaluation, retrieval strategies, and the full production stack, check out our <a href="https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev">Full Stack AI Engineering</a> course.</p><div><hr></div><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/vectorless-rag-your-rag-pipeline-doesnt-need-a-vector-database-0a0839feabd9?sk=9ced3f8f1009ca371a868d6a7a3fd771">Vectorless RAG: Your RAG Pipeline Doesn&#8217;t Need a Vector Database</a></p><p>Vectorless RAG reasons about where in the document the answer lives, the same way a human expert would, instead of searching for similar text. This article explains the concept, where it outperforms traditional RAG, and how to build it using PageIndex, an open-source library that implements it in about 50 lines of Python.</p><p>2. <a href="https://pub.towardsai.net/exploration-and-exploitation-the-simple-yet-profound-logic-at-the-heart-of-reinforcement-learning-b3cb232942e6?sk=bbac1b3746fedd021138fec5c1e66d83">Exploration and Exploitation: The Simple Yet Profound Logic at the Heart of Reinforcement Learning</a></p><p>The exploration-exploitation trade-off of reinforcement learning mirrors a fundamental human dilemma: stick with what works or try something new. This article walks through the core mechanics, covering &#949;-greedy strategies, Upper Confidence Bound, and Thompson Sampling as progressively smarter approaches to balancing exploration and exploitation. It also extends the logic to full RL via Q-learning and value functions.</p><p>3. <a href="https://pub.towardsai.net/building-a-data-analysis-agent-with-langgraph-6a1072472a1e?sk=42ac7d0c233738b03106238b84d2aa51">Building a Data Analysis Agent with LangGraph</a></p><p>This article walks you through building a data analysis agent with LangChain, LangGraph, and GPT-4o-mini. This agent autonomously investigated Singapore Airbnb data, surfacing three validated findings across four iterations. The system pairs four single-responsibility agents with six pandas tools, using conditional routing and a loop to let the agent decide when to stop rather than the developer. It also covered governance alignment with Singapore&#8217;s IMDA framework, metric honesty, and one hard lesson: prompt instructions cannot enforce behavior. Code can.</p><p>4. <a href="https://pub.towardsai.net/mcp-a2a-owl-ontology-i-built-the-agentic-mesh-your-enterprise-agents-are-missing-84ec0487ddd4?sk=7ad91b48f2d26f6863f4d3e60b9383a4">MCP + A2A + OWL Ontology: I Built the Agentic Mesh Your Enterprise Agents Are Missing</a></p><p>This article walks you through building an Agentic Mesh that includes MCP for tool access, OWL and SHACL for shared semantic contracts, and Google&#8217;s A2A protocol for validated agent communication. SHACL constraints block invalid data from crossing agent boundaries, while A2A Agent Cards advertise each agent&#8217;s ontology version.</p><p>5. <a href="https://pub.towardsai.net/microsoft-iq-vs-e106645a5b17?sk=ddbeee58d7a75c9435f63e307f89c246">Microsoft IQ vs. ServiceNow: I Built the Layer Both Are Missing</a></p><p>Microsoft IQ and ServiceNow&#8217;s AI Control Tower tackle enterprise AI governance from opposite ends: one defines business semantics across a three-tier intelligence layer, the other governs every agent through a vendor-agnostic control plane. The article argues that both miss the point of runtime determinism. Using OWL ontologies and SHACL constraints, the piece builds an ontology firewall that intercepts MCP tool calls and blocks semantically invalid agent actions before they reach production.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/obra/superpowers">Superpowers</a> is a complete software development workflow for coding agents, built on top of composable &#8220;skills&#8221;.</p><p>2. <a href="https://github.com/A-EVO-Lab/a-evolve">A-Evolve</a> is a universal infrastructure for self-improving agents that works with any evolution algorithm.</p><p>3. <a href="https://github.com/agent-infra/sandbox">AIO Sandbox</a> is an all-in-one agent sandbox environment that combines Browser, Shell, File, MCP operations, and VSCode Server in a single Docker container.</p><p>4. <a href="https://github.com/NVIDIA-NeMo/ProRL-Agent-Server">ProRLAgent Server</a> is a scalable multi-turn rollout system for training and evaluating RL agents.</p><p>5. <a href="https://github.com/Tencent/Covo-Audio">Covo-Audio</a> is a 7B-parameter end-to-end large audio language model that directly processes continuous audio inputs and generates audio outputs within a single unified architecture.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2504.19874">TurboQuant: Near-Optimal Online Vector Quantization</a></p><p>This paper introduces TurboQuant, a data-oblivious vector quantization algorithm that achieves near-optimal distortion rates across all bit-widths by randomly rotating inputs and applying optimal scalar quantizers to each coordinate. KV cache quantization achieves absolute quality neutrality at 3.5 bits per channel and marginal quality degradation at 2.5 bits per channel.</p><p>2. <a href="https://arxiv.org/abs/2603.25551">Voxtral TTS</a></p><p>Voxtral TTS is a multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. It combines autoregressive generation of semantic speech tokens with flow matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch. In human evaluations conducted by native speakers, it achieves a 68.4\% win rate over ElevenLabs Flash v2.5.</p><p>3. <a href="https://arxiv.org/abs/2603.23516">MSA: Memory Sparse Attention Scales End-to-End to 100M Tokens</a></p><p>This paper presents Memory Sparse Attention (MSA), a trainable, massively scalable memory-model framework. MSA achieves linear complexity in both training and inference while maintaining stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs.</p><p>4. <a href="https://arxiv.org/abs/2603.20278">OpenResearcher: Fully Open Pipeline for Deep Research Trajectory Synthesis</a></p><p>This paper introduces OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using search, open, and find over a 15M-document corpus. They synthesized 97K+ trajectories and achieved a 30B model that scored 54.8% on BrowseComp-Plus (+34 points over the base).</p><p>5. <a href="https://arxiv.org/html/2603.20639v1">Agentic AI and The Next Intelligence Explosion</a></p><p>This paper challenges the idea of a monolithic AI singularity, arguing instead that future transformative intelligence will emerge from complex, socially organized interactions among multitudes of AI agents and humans. The authors emphasize that building scalable, cooperative &#8220;agent institutions&#8221; and constitutional checks and balances is critical for safely managing the combinatorial explosion of intelligence.</p><h3>Quick Links</h3><p>1. <a href="https://www.trychroma.com/research/context-1">Chroma releases Context-1</a>, a 20B parameter agentic search model designed to act as a specialized retrieval subagent. By focusing solely on retrieval, Context-1 achieves 10x faster inference and 25x lower costs than frontier models like GPT-5.4, while matching their accuracy on complex benchmarks like HotpotQA and FRAMES.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/nvidia-devops-and-build-engineer-compiler-tpj3">DevOps and Build Engineer &#8212; Compiler @NVIDIA (India)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/gusto-inc-application-systems-engineering-manager-kgw0">Application Systems Engineering Manager @Gusto, Inc. (New York, NY, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/datacamp-staff-ai-engineer-ai-creator-m1gv">Staff AI Engineer &#8212; AI Creator @DataCamp (Belgium/Dubai/Portugal/UK/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/correlation-one-embedded-ai-solutions-engineer-contract-vynr">Embedded AI Solutions Engineer @Correlation One (Remote/NAMER)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/pandadoc-middle-python-engineer-document-app-cqen">Middle Python Engineer, Document App @PandaDoc (Remote/Poland)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/turing-research-engineer-mcsz">Research Engineer @Turing (Remote/Columbia)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The engineering best practices you can drop straight into Claude]]></title><description><![CDATA[The exact markdown files we use for writing, coding, and building agents at Towards AI]]></description><link>https://newsletter.towardsai.net/p/were-sharing-our-internal-ai-engineering</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/were-sharing-our-internal-ai-engineering</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Wed, 25 Mar 2026 13:36:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!NvsB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We&#8217;ve spent years building LLM systems at Towards AI. The main goal has always been the same: share what we build and, more importantly, what we learn building it, so you can grow as an AI engineer without hitting every wall we did.</p><p>Part of that is our courses. But the bigger part is making your actual building process easier, every day. So we took the markdown files we use internally (the ones you can feed directly into Claude, so it builds with the context that usually takes years to develop) and made them public.</p><p><strong>Access everything here:</strong> <a href="https://github.com/louisfb01/ai-engineering-cheatsheets">https://github.com/louisfb01/ai-engineering-cheatsheets</a></p><p>It includes decision-ready references for the most common AI engineering problems: all the engineering best practices from our courses distilled into dense markdown files you can use mid-build or feed directly into Claude, so it works from decisions already tested on real systems.</p><p>Open a cheatsheet, find your situation in the table, and follow the recommendation.</p><h4>What&#8217;s Inside</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://github.com/louisfb01/ai-engineering-cheatsheets" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NvsB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp 424w, https://substackcdn.com/image/fetch/$s_!NvsB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp 848w, https://substackcdn.com/image/fetch/$s_!NvsB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp 1272w, https://substackcdn.com/image/fetch/$s_!NvsB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NvsB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp" width="876" height="706" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:706,&quot;width&quot;:876,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63116,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:&quot;https://github.com/louisfb01/ai-engineering-cheatsheets&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.towardsai.net/i/192064995?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NvsB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp 424w, https://substackcdn.com/image/fetch/$s_!NvsB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp 848w, https://substackcdn.com/image/fetch/$s_!NvsB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp 1272w, https://substackcdn.com/image/fetch/$s_!NvsB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d6105f-97af-4500-9e5d-c01fba99ec07_876x706.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These come directly from the Towards AI Academy courses, the same frameworks we teach in depth, distilled into references you can use today. No course required. No paywall.</p><p>You can access everything here: <a href="https://github.com/louisfb01/ai-engineering-cheatsheets">https://github.com/louisfb01/ai-engineering-cheatsheets</a></p><p>If you want to go deeper, full lessons, code, and hands-on projects, that&#8217;s what the <a href="https://academy.towardsai.net/?utm_source=TAIspecialedition&amp;utm_medium=Medium&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=EngineeringCheatsheetRepo">Towards AI Academy</a> is for.</p>]]></content:encoded></item><item><title><![CDATA[TAI #197: Anthropic Turned the OpenClaw Demand Signal Into a Product]]></title><description><![CDATA[Also, Jensen Huang on $1 trillion revenue, Elon Musk launches Terafab, Cursor&#8217;s Composer 2 rides Kimi K2.5, and more!]]></description><link>https://newsletter.towardsai.net/p/tai-197-anthropic-turned-the-openclaw</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-197-anthropic-turned-the-openclaw</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 24 Mar 2026 15:00:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!vIXD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>Last week, I wrote about quiet agent upgrades. This week, Anthropic continued to launch features that make the bigger picture obvious. In ten weeks, it went from launching Cowork (January 12) to shipping persistent phone-to-desktop threads via Dispatch (March 17) and direct computer use (March 23), adding plugins, admin controls, and scheduled tasks along the way. A paid Claude Cowork user can now message an agent from their phone, let it work on their machine, connect it to dozens of apps, and hand it the mouse to the full computer when connector or API access isn&#8217;t available. OpenClaw, at roughly 333,000 GitHub stars, did the product discovery. Anthropic built and shipped many of its key features at an incredible pace (only possible by using Claude Code itself to build features!), but with a much more enterprise-friendly risk profile: connectors first, explicit per-app permissions, prompt-injection scanning, and admin controls. Open source found the primitive. Anthropic wrapped it in the permission model that lets a company actually deploy it.</p><p>The agent story feeds directly into the AI infrastructure debate that dominated the rest of the week. Computer use, browser control, and persistent background tasks are dramatically more token-intensive than chat. A single Cowork session running scheduled tasks, clicking through apps, and filling spreadsheets burns far more compute than a conversation. Every new agentic workflow Anthropic or anyone else ships multiplies the demand per user. That is part of why the people at the top of the AI stack sound increasingly frustrated with the pace of supply expansion further down.</p><p>At GTC, Jensen Huang said Nvidia expects at least $1 trillion in cumulative Blackwell and Rubin revenue through 2027, then clarified that this estimate was conservative because it excluded additional products. On the All-In podcast, he called Dario Amodei&#8217;s forecast of roughly $1 trillion in non-infrastructure AI revenue by 2030 &#8220;very conservative,&#8221; adding that Anthropic will do &#8220;way better than that&#8221; because every enterprise software company will become a value-added reseller of model tokens. I suspect Jensen is also privately nervous about the supply chain&#8217;s willingness to ramp as aggressively as his demand forecasts require. His current approach has been to invest directly in suppliers to force capacity expansion: Nvidia recently committed $4 billion to optical interconnect suppliers Coherent and Lumentum to address the silicon photonics bottleneck, and on the February earnings call, management described supporting the &#8220;extreme ecosystem&#8221; of suppliers from a capacity standpoint as one of the company&#8217;s most important priorities.</p><p>The further down the supply chain you go, the fewer people believe those numbers. Broadcom said today that TSMC has become a production bottleneck, with meaningful new capacity not materializing until 2027, and that the squeeze now extends beyond wafers into lasers and printed circuit boards. Memory prices in some segments have more than tripled over the past year. Samsung is pushing customers toward three- to five-year contracts to justify expansion. The top of the stack is trying to force conviction into the middle, and the middle is still hesitant to invest at the scale implied by demand forecasts.</p><p>That backdrop makes Elon Musk&#8217;s Terafab announcement easier to parse. Tesla and SpaceX plan a joint chip fabrication complex in Austin, starting with an initial $20&#8211;25 billion facility, though the full project at the scale Musk described would cost dramatically more. At full capacity, Terafab would target 1 terawatt of annual compute output, compared with roughly 0.5 terawatt for the entire current U.S. electricity network. Musk said every fab on Earth currently produces about 2% of what his companies would eventually need, and that 80% of Terafab&#8217;s output would be directed toward orbital data centers in space. These numbers really only make sense if AI leads to a large multiplication of the global economy from current levels.</p><p>The pieces Musk already has are real but partial. Tesla&#8217;s chip team has been designing custom AI chips for years, with AI5 targeting production in 2027 and AI6 in 2028. Samsung plans to begin volume fabrication of Tesla chips in Texas in the second half of 2027. SpaceX is building what will be the largest PCB and panel-level packaging facility in North America at its Bastrop site, backed by a $280 million-plus Texas semiconductor innovation grant. Musk is also recruiting aggressively, posting on X that anyone in Korea working in chip design, fabrication, or AI software should apply to Tesla, in what looks like a direct play for TSMC and Samsung talent.</p><p>What Musk lacks is any experience running an actual fabrication plant. The gap between chip design plus advanced packaging and full-scale leading-edge lithography is enormous. TSMC has roughly 50,000 engineers who do nothing but fab operations, and it has spent decades and hundreds of billions of dollars building that capability. The EUV lithography machines that any 2nm fab requires are made exclusively by ASML, which has a record backlog of roughly &#8364;39 billion and whose capacity is likely to be a key bottleneck for anyone trying to build a new leading-edge fab on an ambitious timeline. Each EUV machine costs $200&#8211;400 million, weighs 165 tons, and requires specialized ocean transport. There is no fast lane for procurement.</p><p>I suspect Terafab is partly a manufacturing project and partly a supply-chain pressure tactic, similar to Battery Day in 2020. Tesla presented the 4680 cell as a path to much lower battery costs and near-100x scale by 2030. The execution was painful: repeated delays in dry-electrode manufacturing, supplier pushback, and struggles at scale as late as 2023. Yet Tesla&#8217;s latest shareholder update says it is now producing 4680 dry-electrode cells with both anode and cathode in Austin, a real milestone after years of difficulty. The battery program shipped later and uglier than the slides implied, but it dragged Tesla and its suppliers up the curve. Terafab may serve a similar function even if the schedule slips badly, which I expect it will.</p><p>Google is fighting the same capacity war from a different angle, and energy is its primary lever. Alphabet acquired clean energy developer Intersect for $4.75 billion in December to gain direct access to power projects and data center infrastructure. Google has signed nuclear deals with Kairos Power for 500 MW of small modular reactors by 2035, a 25-year agreement with NextEra Energy to restart Iowa&#8217;s shuttered 615 MW Duane Arnold nuclear plant, a 200 MW deal with fusion firm Commonwealth Fusion Systems, and a strategic agreement with Elementl Power to develop three nuclear sites with at least 600 MW of capacity each. It has also been signing utility agreements to curtail up to 1 gigawatt of data-center power during peak periods. Ruth Porat said this week that the U.S. is not scaling up energy supply fast enough to support AI. Meanwhile, Meta signed a multi-billion-dollar deal to rent Google&#8217;s TPUs and was also discussing buying them outright, while Anthropic already has access to more than 1 gigawatt of Google TPU capacity.</p><p>Open weight models have been taking somewhat of a back seat to the breakthroughs in agentic capabilities at the closed AI labs the past few months, but I think open weights will still have a key role to play. Cursor released Composer 2, a coding model built on Moonshot AI&#8217;s Kimi K2.5 via an authorized commercial partnership through Fireworks AI. It scores 61.7 on Terminal-Bench 2.0 and 73.7 on SWE-bench Multilingual, up sharply from Composer 1.5, and is priced at $0.50 per million input tokens. Cursor did not initially disclose the Kimi base. A developer intercepted the API traffic and found the model ID in plain text. After millions of views, Cursor VP Lee Robinson acknowledged the open-source base, and co-founder Aman Sanger called the omission &#8220;a miss from the start.&#8221; The licensing story is clean; the disclosure story is not. But the product formula, take a strong open base, hammer it with domain-specific RL, wrap it in the best UX in the category, is very likely the template for application-layer competition over the next couple of years.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>The &#8220;AI bubble&#8221; framing keeps circulating and keeps missing the point. Bubbles feel overbuilt. Much of AI still feels under-supplied. Memory prices have tripled. TSMC is a bottleneck. Lasers and PCBs are in short supply. ASML&#8217;s EUV machines are booked out. Musk, Jensen, and Google are all signaling the same thing: there are not enough chips, power, or industrial capacity to support the scenarios the leading buyers seem willing to fund.</p><p>The &#8216;agent&#8217; story makes this tension worse. Anthropic&#8217;s Cowork with computer use, Dispatch, and scheduled background tasks turns a single user into a persistent compute load. Every time an agent clicks through a browser, fills out a spreadsheet, or runs a recurring workflow, it burns far more tokens than a chat exchange does. Multiply that across millions of subscribers, then add Cursor&#8217;s long-horizon coding agents, OpenAI&#8217;s agent mode, and the broader wave of agentic products shipping every week, and you start to see why Jensen thinks $1 trillion is conservative. The revenue potential from agents is enormous, but the compute requirements per user are also enormous. Those two facts together explain the urgency behind Terafab, Google&#8217;s energy sprint, and Nvidia&#8217;s direct investments in its supplier base.</p><p>The gap between conviction at the top and hesitancy in the middle of the supply chain is a key dynamic in AI right now. The DRAM fabs, the PCB makers, the laser suppliers, and the power utilities are the ones whose investment pace will determine how fast AI actually scales. If the top-of-stack buyers are right, the hesitancy further down becomes the binding constraint. If they are wrong, Terafab will be a very expensive monument to overconfidence. The next two years will settle it. The people who get ahead will be the ones using the new tools before the supply catches up.</p><p>One final thought on the Terafab story: if you truly believe in recursive AI self-improvement without near-term dead ends, now is indeed the time to begin ambitious projects that wouldn&#8217;t have been possible previously. If AI can help simulate, iterate, and improve chip science and manufacturing, then those making the earliest and most aggressive moves to build an AI-first chip fab may indeed have a chance to leapfrog incumbents. This will also be the case in many other industries, and I expect many more pie-in-the-sky, ambitious projects to be launched soon by AI labs and true AI believers.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><p><strong>This issue is brought to you thanks to <a href="https://linkly.link/2dwLd">SerpApi</a>:</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://linkly.link/2dwLd" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WnmL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WnmL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:713837,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://linkly.link/2dwLd&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.towardsai.net/i/191253155?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WnmL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>LLMs are powerful. But without fresh information, they can hallucinate or miss context.</p><p>SerpApi helps AI applications access real-time search data from search engines like Google, Bing, Amazon, and more via a simple API.</p><p>Get clean, structured JSON results and power AI agents, research tools, and data-driven applications without managing scrapers.</p><p><a href="https://linkly.link/2dwLd">Start with 250 free credits/month by signing up at SerpApi today</a>!</p><div><hr></div><h3>Hottest News</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vIXD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vIXD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png 424w, https://substackcdn.com/image/fetch/$s_!vIXD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png 848w, https://substackcdn.com/image/fetch/$s_!vIXD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png 1272w, https://substackcdn.com/image/fetch/$s_!vIXD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vIXD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png" width="1286" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1286,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!vIXD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png 424w, https://substackcdn.com/image/fetch/$s_!vIXD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png 848w, https://substackcdn.com/image/fetch/$s_!vIXD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png 1272w, https://substackcdn.com/image/fetch/$s_!vIXD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5ee8c5-a4cc-47fe-9802-99f324095864_1286x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>1. <a href="https://openai.com/index/introducing-gpt-5-4-mini-and-nano/">OpenAI Releases GPT-5.4 Mini and Nano</a></p><p>OpenAI released GPT-5.4 mini and GPT-5.4 nano, two smaller GPT-5.4 variants designed for high-throughput, latency-sensitive workloads such as coding assistants, sub-agents, and routine automation. GPT-5.4 mini is positioned as the default &#8220;workhorse&#8221; small model, faster than GPT-5 mini (OpenAI notes it runs over 2&#215; faster) while improving coding, reasoning, multimodal understanding, and tool use. It lands close to the full GPT-5.4 model on several evals (for example, 54.4% on SWE-Bench Pro vs. 57.7% for GPT-5.4, and 45.7% for GPT-5 mini). In the API, mini supports text + image inputs, tool use/function calling, web search, file search, and computer use, with a 400K context window priced at $0.75/1M input tokens and $4.50/1M output tokens. GPT-5.4 nano is the smallest, lowest-cost option for simpler tasks like classification, ranking, extraction, and lightweight coding subagents; it&#8217;s API-only and priced at $0.20/1M input tokens and $1.25/1M output tokens. GPT-5.4 mini is also available across Codex surfaces and in ChatGPT, where it appears for Free/Go users via Thinking, with mini serving as a rate-limit fallback for GPT-5.4 Thinking on other plans.</p><p>2. <a href="https://cursor.com/blog/composer-2">Cursor Launches Composer 2, Coding Model Powered by Kimi-k2.5</a></p><p>Cursor released Composer 2, a frontier-level coding model priced at $0.50 per million input tokens, with a faster variant available. Built on Moonshot AI&#8217;s Kimi-k2.5 via continued pretraining and high-compute RL, it shows substantial benchmark improvements, including 61.7 on Terminal-Bench 2.0 and 73.7 on SWE-bench Multilingual. The model is available immediately in Cursor with usage included in individual plans. Kimi confirmed the authorized commercial partnership through Fireworks AI.</p><p>3. <a href="https://mistral.ai/news/mistral-small-4">Mistral Releases Small 4</a></p><p>Mistral AI released Mistral Small 4, a unified open-source multimodal reasoning model, alongside Leanstral, an open-source code agent built for Lean 4 formal verification. Mistral Small 4 combines the roles of Mistral&#8217;s earlier specialist lines: reasoning, multimodal understanding, and agentic coding, into a single hybrid model tuned for general chat, coding, agent workflows, and deeper reasoning. Architecturally, it&#8217;s a Mixture-of-Experts system with 128 experts and 4 active per token, totaling 119B parameters, with roughly 6&#8211;6.5B parameters activated per token (about 8B including embedding and output layers), and it supports a 256K context window plus native text+image inputs. It also adds a configurable reasoning-effort control, allowing developers to trade off low-latency responses for more intensive reasoning. Mistral reports major efficiency gains versus Mistral Small 3, up to 40% lower end-to-end completion time in a latency-optimized setup and 3&#215; higher requests-per-second in a throughput-optimized setup, and positions Small 4 (with reasoning enabled) as competitive on core reasoning/coding benchmarks while producing shorter outputs.</p><p>4. <a href="https://nvidianews.nvidia.com/news/openai-and-nvidia-announce-strategic-partnership-to-deploy-10gw-of-nvidia-systems">OpenAI and NVIDIA Sign $100B Infrastructure Partnership</a></p><p>OpenAI and NVIDIA announced a letter of intent for a strategic infrastructure partnership to deploy at least 10 gigawatts of NVIDIA systems to train and run OpenAI&#8217;s next generation of models. As deployments scale, NVIDIA plans to invest up to $100 billion in OpenAI progressively as each gigawatt is brought online, tying capital to delivered infrastructure. The companies set the first phase to come online in the second half of 2026, built on NVIDIA&#8217;s Vera Rubin platform. The partnership also includes joint roadmap work to co-optimize OpenAI&#8217;s model and infrastructure software with NVIDIA&#8217;s hardware and software stack.</p><p>5. <a href="https://mimo.xiaomi.com/mimo-v2-pro">Xiaomi Releases MiMo-V2-Pro</a></p><p>Xiaomi released MiMo-V2-Pro, its flagship foundation model built for real-world agentic workloads, positioning it as a &#8220;brain&#8221; for systems that orchestrate multi-step workflows and production engineering tasks. The model uses an efficient trillion-parameter MoE design with over 1T total parameters and 42B active parameters, scales long-context operation to a 1M-token window, and extends Xiaomi&#8217;s Hybrid Attention design by increasing the hybrid ratio from 5:1 to 7:1, with a lightweight multi-token prediction (MTP) layer to speed up generation. Xiaomi reports MiMo-V2-Pro ranks 8th worldwide and 2nd among Chinese LLMs on the Artificial Analysis Intelligence Index, and highlights stronger agent performance on OpenClaw-style evaluations (e.g., PinchBench avg. 81.0 and ClawEval 61.5, listed as #3 globally on both). The model was also publicly tested in stealth on OpenRouter under the name &#8220;Hunter Alpha,&#8221; where Xiaomi says it topped the daily call charts and surpassed 1T tokens in usage. The model is now available globally via Xiaomi&#8217;s developer portal MiMo Studio, Hugging Face, and its API platform.</p><p>6. <a href="https://research.nvidia.com/labs/nemotron/nemotron-cascade-2/">NVIDIA Releases Nemotron-Cascade 2</a></p><p>NVIDIA released Nemotron-Cascade 2, an open-weight 30B Mixture-of-Experts model that activates only ~3B parameters per token, targeting high &#8220;intelligence density&#8221; for reasoning and agent workflows without the usual cost blowups. The flagship checkpoint is Nemotron-Cascade-2&#8211;30B-A3B, post-trained from Nemotron-3-Nano-30B-A3B-Base, and it runs in two operating modes, a thinking mode and a non-thinking (instruct) mode, selected through the chat template. NVIDIA reports that it is the second open-weight LLM (after DeepSeek-V3.2-Speciale-671B-A37B) to reach gold-medal&#8211;level performance across the 2025 IMO, IOI, and ICPC World Finals. The core training upgrade is multi-domain on-policy distillation throughout the Cascade RL pipeline, in which the best intermediate &#8220;teacher&#8221; for each domain provides token-level distillation signals to recover regressions and maintain gains across domains. NVIDIA also released the full collection of model checkpoints and training datasets alongside the paper.</p><p>7. <a href="https://www.together.ai/blog/mamba-3">Mamba-3: A New State Space Model Frontier</a></p><p>A team of researchers from Carnegie Mellon University (CMU), Princeton University, Together AI, and Cartesia AI has introduced Mamba-3. It is a new state space model (SSM) architecture designed for inference efficiency, shifting the focus from Mamba-2&#8217;s training-first design to faster prefill+decode performance in production. Mamba-3 upgrades the core SSM with a more expressive recurrence (via an exponential-trapezoidal discretization scheme), complex-valued state tracking, and an optional MIMO (multi-input, multi-output) variant that improves accuracy with minimal impact on decode latency. On Together&#8217;s reported latency tests for a ~1.5B model on a single H100-SXM 80GB, Mamba-3 (SISO) delivers the fastest prefill+decode times across sequence lengths, outperforming Mamba-2, Gated DeltaNet, and even a vLLM-served Llama-3.2&#8211;1B transformer baseline.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/claude-code-agent-skills-2-0-from-custom-instructions-to-programmable-agents-ab6e4563c176?sk=54406f373c4a6174aced12d3134df175">Claude Code Agent Skills 2.0: From Custom Instructions to Programmable Agents</a></p><p>This article walks you through the evolution of Claude Code&#8217;s skill system from simple markdown instructions to a full programmable agent platform with subagent execution, dynamic context injection, lifecycle hooks, and formal evaluation. It also covers a formal iterative evaluation loop for testing and improving skills over time, and points to an open Agent Skills standard designed to keep the format portable across AI tools.</p><p>2. <a href="https://pub.towardsai.net/loss-landscapes-part-2-f50dc272e3b3">Loss Landscapes: Part 1 (Part 2)</a></p><p>The loss landscape is a surface that maps model weights to loss values, ranging from smooth, convex bowls (simple models, with guaranteed global minima) to rugged, non-convex terrains riddled with local minima and saddle points. This article covers how gradient descent navigates loss landscapes and which tools help it succeed: weight decay to smooth chaotic landscapes, dropout for robustness, residual connections for deep-network stability, and batch/layer normalization to stabilize training dynamics.</p><p>3. <a href="https://pub.towardsai.net/knowledge-distillation-how-a-tiny-model-learned-to-outsmart-its-giant-teacher-eb7f90b63235?sk=b9f56c37061b353e16219a1b679d8779">Knowledge Distillation: How a Tiny Model Learned to Outsmart Its Giant Teacher</a></p><p>The article walks you through why large models carry dark knowledge in their probability distributions that hard labels destroy, and how temperature scaling amplifies those signals for smaller student models to absorb. It lays out the full derivation of the loss function, including the tau-squared compensation. The piece anchors the theory to DeepSeek-R1&#8217;s January 2025 result, in which a distilled student matched or beat its teacher, raising an unresolved question: Does compression reveal latent knowledge or generate entirely new capability?</p><p>4. <a href="https://pub.towardsai.net/three-tasks-one-backbone-a-multi-task-reranker-that-tackles-amazon-search-challenges-34d56d73cafe?sk=e928c2afaec3c96cc78e71cca5f1d3bf">Three Tasks, One Backbone: A Multi-Task Reranker That Tackles Search Challenges</a></p><p>In this article, the author trained a single cross-encoder on Amazon&#8217;s ESCI shopping dataset to handle three tasks simultaneously: graded relevance ranking, 4-class ESCI label classification, and binary substitute detection. Rather than training three separate models, the architecture routes a shared BERT backbone&#8217;s [CLS] embedding through three lightweight heads, each optimized with its own loss. The combined weighted loss prioritizes nDCG ranking while using classification and substitute detection as auxiliary regularizers.</p><p>5. <a href="https://blogs.nvidia.com/blog/state-of-ai-report-2026/">NVIDIA State of AI Report 2026</a></p><p>NVIDIA&#8217;s comprehensive report examines how AI drives revenue across industries, covering enterprise adoption patterns, infrastructure scaling trends, and the shift toward agentic AI workflows. The report provides data-driven insights on computing demand, model deployment costs, and the economic impact of generative AI across manufacturing, healthcare, finance, and software development.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/run-llama/liteparse">LiteParse</a> is a standalone OSS PDF parsing tool focused exclusively on fast and light parsing.</p><p>2. <a href="https://github.com/bytedance/deer-flow">Deer Flow</a> is an open-source super agent harness that orchestrates sub-agents, memory, and sandboxes to do almost anything.</p><p>3. <a href="https://github.com/vxcontrol/pentagi">PentAGI</a> is a fully autonomous AI agent system capable of performing complex penetration testing tasks.</p><p>4. <a href="https://github.com/googlecolab/colab-mcp">Colab MCP</a> is Google&#8217;s MCP server for interacting with Colab.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2603.17378">Efficient Exploration at Scale</a></p><p>This paper introduces an online learning algorithm that improves the data efficiency of reinforcement learning from human feedback (RLHF). The algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of &#8216;reinforce&#8217;, with reinforcement signals provided by the reward model. With Gemma LLMs, this algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels.</p><p>2. <a href="https://arxiv.org/abs/2603.18743">Memento-Skills: LLM Agents That Build Task-Specific Agents</a></p><p>This paper introduces Memento-Skills, a generalist, continually learnable LLM agent system that autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with stateful prompts, in which reusable skills (stored as structured markdown files) serve as a persistent, evolving memory. It achieves 26.2% and 116.2% relative accuracy improvements without updating LLM parameters.</p><p>3. <a href="https://arxiv.org/abs/2603.15031">Attention Residuals: Learned Layer Aggregation for LLMs</a></p><p>This paper proposes Attention Residuals (AttnRes), which replaces the fixed, uniform accumulation of residual connections in LLMs with softmax attention over preceding-layer outputs. This allows each layer to selectively aggregate earlier representations using learned, input-dependent weights. Tested on Kimi Linear (48B params, 3B activated, 1.4T tokens), AttnRes improves downstream performance and stabilizes output magnitudes and gradient distribution.</p><p>4. <a href="https://arxiv.org/abs/2603.15594">OpenSeeker: Fully Open-Source Search Agent Training Data</a></p><p>This paper introduces OpenSeeker, a fully open-source search agent (i.e., model and data) that achieves frontier-level performance through fact-grounded, scalable, controllable QA synthesis to generate complex, multi-hop reasoning tasks with controllable coverage and complexity, and denoised trajectory synthesis to employ a retrospective summarization mechanism. Trained on only 11.7K samples, it significantly outperforms the next-best open-source search agent and surpasses some commercial systems, such as Tongyi DeepResearch.</p><p>5. <a href="https://arxiv.org/abs/2603.13428">EvoClaw: Evaluating AI Agents on Continuous Software Evolution</a></p><p>This paper introduces EvoClaw, a novel benchmark, and the DeepCommit pipeline to evaluate AI agents on continuous, dependency-driven software evolution rather than isolated, one-off coding tasks. Evaluation of 12 frontier models across 4 agent frameworks reveals a critical vulnerability: overall performance scores drop significantly from &gt;80% on isolated tasks to at most 38% in continuous settings.</p><h3>Quick Links</h3><p>1. <a href="https://www.reuters.com/technology/microsoft-weighs-legal-action-over-50-billion-amazon-openai-cloud-deal-ft-2026-03-18/">Microsoft considers legal action over the $50 billion Amazon-OpenAI cloud deal</a> that could violate &#8203;its exclusive cloud agreement with the ChatGPT maker. The dispute centers on whether OpenAI can offer Frontier via AWS without violating the Microsoft partnership, which requires the startup&#8217;s models to be accessed through the Windows maker&#8217;s Azure cloud platform, the FT report said, citing sources.</p><p>2. <a href="https://nvidianews.nvidia.com/news/ai-agents">NVIDIA released its Agent Toolkit</a>, which provides open source models and software for enterprises and developers building autonomous, self-evolving AI agents. NVIDIA Agent Toolkit includes open models (NVIDIA Nemotron), open agents (NVIDIA AI-Q), open skills (NVIDIA cuOpt), and open runtimes (OpenShell). It also supports enterprise software platforms, such as Adobe, Atlassian, Box, Salesforce, etc.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/salesforce-latam-internship-program-experience-design-ux-ui-ai-andamp-salesforce-xwuq">LATAM Internship Program &#8212; Experience Design (UX/UI) @Salesforce (Sao Paulo, Brazil)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/meta-qa-engineering-lead-ai-native-8xpu">QA Engineering Lead, AI Native @Meta (Menlo Park, CA, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/teradata-senior-ai-engineer-kes8">Senior AI Engineer @Teradata (Hyderabad, India)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/nutanix-nlp-architect-4bze">NLP Architect @Nutanix (San Jose, CA, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/highmark-health-prompt-engineer-5dw9">Prompt Engineer @Highmark Health (Remote)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/pacvue-machine-learning-product-summer-intern-z7gs">Machine Learning Product Summer Intern @Pacvue (Remote/USA)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[TAI #196: Quiet but Significant Agent Upgrades to Codex (Subagents) and Claude (Context)]]></title><description><![CDATA[Also, Gemini Embedding 2, NVIDIA Nemotron 3 Super, Yann LeCun's $1.03B AMI, Groundsource, Granite 4.01B Speech & more!]]></description><link>https://newsletter.towardsai.net/p/tai-196-quiet-but-significant-agent</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-196-quiet-but-significant-agent</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 17 Mar 2026 15:03:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!OpcS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>OpenAI and Anthropic both shipped incremental upgrades this week that sound modest on paper but could reshape how serious developers actually work day to day. Elsewhere, Google released Gemini Embedding 2, its first natively multimodal embedding model; NVIDIA released Nemotron 3 Super; Google Research introduced Groundsource, turning global news into structured historical data and launching with a 2.6 million-record urban flash-flood dataset; Yann LeCun&#8217;s new startup AMI raised $1.03 billion at a $3.5 billion pre-money valuation to pursue world-model-heavy AI; and IBM shipped Granite 4.0 1B Speech for compact multilingual speech recognition, now ranked #1 on the OpenASR leaderboard.</p><p>For OpenAI, the key release was Codex subagents. Codex can now spawn specialized agents in parallel to explore, execute, or analyze work concurrently, while keeping the main thread focused on requirements, decisions, and final outputs. OpenAI&#8217;s docs frame this as a solution to &#8220;context pollution&#8221; and &#8220;context rot,&#8221; which is exactly right. One giant thread is fine until it turns into a digital junk drawer full of stack traces, half-failed tests, and exploratory dead ends.</p><p>OpenAI has essentially adopted the core product idea Anthropic pushed first with Claude Code and then more broadly with Cowork: separate the manager from the workers, keep the high-level thread clean, and let specialized agents chew through bounded tasks in parallel. This is a materially better operating model for real work, especially once tasks stop being cute demos and start involving actual codebases, logs, specs, and messy follow-ups. Once a workflow primitive proves itself in real work, the industry converges on it fast.</p><p>The Codex growth numbers indicate where OpenAI thinks the battle stands now. Fidji Simo said more than 1 million businesses run on OpenAI products, Codex is now at 2 million plus weekly active users (up nearly 4x since the start of the year), and API usage jumped 20% in the week after GPT-5.4 launched. OpenAI has also been expanding Frontier Alliances and pairing forward-deployed engineers with consulting firms to help enterprises actually deploy AI coworkers into real workflows.</p><p>Anthropic&#8217;s quiet but very meaningful move this week was making 1M context generally available for Opus 4.6 and Sonnet 4.6 at standard pricing: no long-context premium, full rate limits across the full window, and media limits expanded to 600 images or PDF pages. On MRCR v2 (8-needle) at 1M tokens, Opus 4.6 scores 78.3%, more than double GPT-5.4&#8217;s 36.6% and roughly triple Gemini 3.1 Pro&#8217;s 25.9%. Even Sonnet 4.6 hits 65.1% at the same context length. At 256K tokens, the field is tighter, with Opus 4.6 at 91.9%, Sonnet 4.6 at 90.6%, and GPT-5.4 at 79.3%, but as context scales up, the drop-off for competitors is steep. (Context Arena measured Gemini numbers on the same MRCR v2 benchmark, not Google&#8217;s self-report.)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OpcS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OpcS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!OpcS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!OpcS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!OpcS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OpcS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OpcS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!OpcS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!OpcS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!OpcS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407420cb-9340-432e-96ba-4b12e0e76cdd_1600x900.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Anthropic</figcaption></figure></div><p>I did not have Anthropic pegged as the lab most likely to seize the long-context narrative in March, but here we are. For a while, long context felt like a Google Gemini story, and then, briefly, like an OpenAI comeback story. Anthropic may now have the strongest claim on the metric that actually matters for professional agentic work: not headline window size, but whether the model can still find the right thing after you bury it under a mountain of tokens.</p><p>That matters enormously for agentic coding and review. The hard sessions are not short snippets. They are the ugly, hours-long runs where the model has read a large diff, test output, monitoring logs, maybe a product doc, maybe a PDF, and still needs to remember why line 37 in a config file matters. A million tokens that actually hold up (and with no price premium for higher context usage) is a real unlock.</p><p>Anthropic also launched Code Review for Claude Code, a research preview system that deploys a team of agents to each pull request. The average review takes around 20 minutes and generally costs $15 to $25. On pull requests over 1,000 lines changed, 84% get findings averaging 7.5 issues, and less than 1% of findings are marked incorrect. Internally, Anthropic says the share of pull requests receiving substantive review comments rose from 16% to 54% after adopting the system.</p><p>That is impressive on its own, but it also reveals something about where the real constraint is shifting. We are getting to the point where a strong developer with good agents can generate code much faster than the surrounding review process can absorb it. You only get to bank AI productivity if the code is trustworthy enough to merge. Otherwise, you just manufacture more uncertainty at a higher speed.</p><p>And for now, humans still need to understand the code. Despite recent leaps, AI remains a jagged intelligence, tireless and elegant at parallel exploration, then suddenly blind to the one buried business rule that everyone on the team &#8220;just knows.&#8221; The best results still come from expert developers who nudge early, critique the plan, steer the agents mid-run, and know when the model has wandered off course.</p><p>There is a plausible future where this flips. Self-driving cars offer a template: at first, the human is the safety layer, maintaining full responsibility in driver-assist systems, but eventually, AI reliability improves, and the human starts to look like the unpredictable failure mode. Coding could follow a similar arc. If AI-written code eventually has fewer bugs than human-written code, and humans mostly add net bugs by tweaking systems they no longer fully understand, then full autonomy on some classes of software work will start to look rational. We are not there yet. Right now, the highest-return setup is expert human plus agent swarm.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>Once a workflow pattern becomes obviously useful, the industry converges on it fast. Claude Code and Cowork proved that splitting work into parallel threads beats forcing one bloated session to play every role at once. OpenAI now agrees. Long context, too: the labs all want it, but Anthropic&#8217;s 78.3% on MRCR v2 at 1M tokens versus GPT-5.4&#8217;s 36.6% is now a real gap for pushing agents to their limits. The fact that the expanded context is available without a price premium also suggests a more fundamental architectural or inference breakthrough. Due in part to there being no non-compete clauses in California (and high staff turnover between the labs), and the fact that many researchers across AI labs are good friends and attend the same parties, we can continue to expect these breakthroughs to quickly disperse across the leading model families (so long as the AI lab has enough compute to keep up!)</p><p>Meanwhile, Codex, with 2M+ weekly active users (nearly 4x since January), alongside a growing army of forward-deployed engineers, tells the full story of where we are. The models are strong enough to be useful everywhere, but alien enough that bridging the gap between raw capability and reliable daily workflow is now the main job. The developers who learn that bridging skill fastest will pull away from everyone still using AI as fancy autocomplete.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><p><strong>This issue is brought to you thanks to <a href="http://serpapi.com/">SerpApi</a>:</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="http://serpapi.com/" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WnmL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WnmL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:713837,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;http://serpapi.com/&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.towardsai.net/i/191253155?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WnmL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!WnmL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664a81f4-6b55-428c-993c-a657d90036f8_2560x1440.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>LLMs are powerful. But without fresh information, they can hallucinate or miss context.</p><p>SerpApi helps AI applications access real-time search data from search engines like Google, Bing, Amazon, and more via a simple API.</p><p>Get clean, structured JSON results and power AI agents, research tools, and data-driven applications without managing scrapers.</p><p><a href="http://serpapi.com/">Start with 250 free credits/month by signing up at SerpApi today</a>!</p><div><hr></div><h4>A Quick Look at AI Adoption at Empower</h4><p>Much of the conversation around AI in the workplace focuses on frontier models and benchmark scores, but the more revealing signal is what&#8217;s happening inside real businesses right now. At <a href="https://uk.linkedin.com/company/empower-technical-services">Empower Technical Services</a>, a leading UK technical services provider co-founded by our own Denis Piffaretti, teams across the C-suite, HR, and M&amp;A are <a href="https://www.empowertechnicalservices.com/blogs/how-empower-is-harnessing-the-power-of-ai">using AI today to stress-test executive analysis, surface gaps in employment contracts, and compress weeks of acquisition research into hours</a>. What stands out isn&#8217;t any single use case, it&#8217;s the shared mindset: AI as a quality amplifier, not a corner-cutter. If you&#8217;re thinking about how to move your own organisation from AI curiosity to genuine day-to-day integration, this piece is worth a read.</p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/">Google Releases Gemini Embedding 2</a></p><p>Google launched Gemini Embedding 2, its first natively multimodal embedding model. Gemini Embedding 2 maps text, images, videos, audio, and PDFs into a single shared embedding space, so multimodal retrieval and classification no longer require separate embedding models for each modality. It supports up to 8,192 input tokens, up to 6 images per request, up to 120 seconds of video, and PDFs up to 6 pages, and it can take interleaved inputs (for example, image + text in the same request). Output vectors are produced by default with 3,072 dimensions, with recommended lower options of 1,536 or 768, using Matryoshka Representation Learning to trade off storage and quality. Google is offering it in public preview via the Gemini API and Vertex AI, and highlights support through common ecosystem tooling, including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, and ChromaDB.</p><p>2. <a href="https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/">NVIDIA Releases Nemotron 3 Super</a></p><p>NVIDIA open-sourced Nemotron 3 Super, a 120B (12B active) long-context model built to reduce the &#8220;thinking tax&#8221; for agents. Nemotron 3 Super is a 120B total/12B active hybrid Mamba-Transformer MoE model with native 1M-token context, designed to keep multi-step agent workflows coherent without context blowups. NVIDIA positions the release around compute efficiency for complex multi-agent workloads (such as software development and cybersecurity triage) and reports 5&#215;+ throughput over the prior Nemotron Super. The architecture combines a LatentMoE hybrid stack (Mamba-2 + MoE + attention) with multi-token prediction (MTP), and the model supports a configurable reasoning mode (toggleable via the chat template). The release is fully open, with datasets, recipes, and model weights published on Hugging Face and an official model card on NVIDIA&#8217;s platform.</p><p>3. <a href="https://www.wired.com/story/yann-lecun-raises-dollar1-billion-to-build-ai-that-understands-the-physical-world/">Yann LeCun Raises $1 Billion to Build AI That Understands the Physical World</a></p><p>Yann LeCun&#8217;s new startup, Advanced Machine Intelligence (AMI), raised $1.03B to build &#8220;world model&#8221; AI. Reuters reports AMI raised $1.03 billion at a $3.5 billion pre-money valuation, and that the company is aiming for systems that can reason, plan, and understand the world, rather than relying solely on next-token (or next-pixel) prediction. LeCun has argued that this shift is required for broadly capable autonomous agents, and AMI&#8217;s near-term focus is on organizations operating complex systems, such as automotive, aerospace, biomedical, and pharmaceutical firms, with consumer applications (including robotics) positioned as later-stage.</p><p>4. <a href="https://claude.com/blog/code-review">Anthropic Releases Claude Code Review</a></p><p>Anthropic is introducing Claude Code Review, a multi-agent PR review system now in research preview for Team and Enterprise. Claude Code Review dispatches multiple agents when a pull request opens, has them search for bugs in parallel, cross-verify findings to reduce false positives, and then rank issues by severity. Anthropic reports internal results showing that on large PRs (1,000+ lines changed), 84% receive findings with an average of 7.5 issues, while smaller PRs (&lt;50 lines) see findings 31% of the time with an average of 0.5 issues; fewer than 1% of surfaced findings are marked incorrect by engineers. Pricing is token-based, with typical reviews ranging from $15&#8211;$25, depending on PR size and complexity.</p><p>5. <a href="https://research.google/blog/introducing-groundsource-turning-news-reports-into-data-with-gemini/">Google AI Introduces Groundsource</a></p><p>Google Research released Groundsource and a 2.6M-record global dataset of urban flash flood events extracted from news. Groundsource is a methodology that uses Gemini to convert unstructured global news into structured, verified historical disaster data. It analyzes news reports where flooding is a primary subject and then uses the Google Read Aloud user agent to isolate the primary text from 80 languages, which is then standardized into English via the Cloud Translation API. The first release is an open-access dataset of 2.6 million historical urban flash flood events spanning 150+ countries, built by identifying flood-related news reports and extracting event details and locations at scale.</p><p>6. <a href="https://huggingface.co/blog/ibm-granite/granite-4-speech?">IBM AI Releases Granite 4.0 1B Speech</a></p><p>IBM has released Granite 4.0 1B Speech, a compact speech-language model designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST). With only half the parameters of its predecessor, granite-speech-3.3&#8211;2b, the model delivers higher English transcription accuracy, faster inference through speculative decoding, and expanded language support, now covering English, French, German, Spanish, Portuguese, and Japanese. The release adds Japanese ASR and keyword list biasing for more targeted transcription workflows. It supports deployment through Transformers, vLLM, and mlx-audio, including Apple Silicon environments. Granite 4.0 1B Speech ranked #1 on the OpenASR leaderboard.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/the-kv-cache-the-invisible-engine-behind-every-llm-response-aae7eebcf8c3?sk=5f14c69ba85e63f460678ceadee8a360">The KV Cache: The Invisible Engine Behind Every LLM Response</a></p><p>Without the KV Cache, LLMs would recompute attention for every previously seen token at each generation step, an O(T&#178;) inefficiency that makes real-time responses impractical. This piece breaks down exactly how the cache works: storing Key and Value vectors per layer while discarding Query vectors, which are mathematically proven to be single-use. It walks through prefill vs. decode phases, the memory cost formula, and why that cost compounds across sequence length, batch size, and model scale. It also covers how production systems respond to GQA, quantization, PagedAttention, and sliding-window attention, each targeting a specific variable within the same core equation.</p><p>2. <a href="https://pub.towardsai.net/context-pollution-do-llms-benefit-from-their-own-words-e21984ea53c5?sk=08dab6a27787ecc48508b1c49466ca18">Context Pollution: Do LLMs Benefit From Their Own Words?</a></p><p>New research from MIT and IBM Research challenges a core assumption behind every major chatbot: that keeping full conversation history always improves model performance. The study introduced Assistant-Omitted prompting, stripping prior AI responses from each new message, and found that quality rarely dropped and sometimes improved. Over a third of real-world user messages were standalone questions requiring no prior context. More concerning, early model errors were found to quietly persist across conversation turns, a phenomenon the researchers termed context pollution. A lightweight classifier was proposed to adaptively manage context, cutting token usage by roughly 30% with minimal quality trade-off.</p><p>3. <a href="https://pub.towardsai.net/the-new-nano-banana-2-ocr-claude-code-powerful-ai-ocr-pdf-editor-3bdd7aafc874?sk=ed8526e841aef0614ca6948b9edd5e87">The New Nano Banana 2 + OCR + Claude Code = Powerful AI OCR PDF Editor</a></p><p>This guide walks you through a hands-on demo of Google&#8217;s newly released Imagen 3 and provides a practical guide to building an AI-powered PDF editor. Imagen 3 is combined with Claude for prompt refinement and Tesseract OCR for text layer reconstruction, forming an agentic pipeline that edits or inserts slides based on user instructions. The system processes multiple pages in parallel, preserves original layouts, and outputs fully searchable PDFs. Beyond the technical build, the author weighs Imagen 3 against Imagen Pro, noting meaningful gains in text accuracy, 4K support, web-referenced generation, and a significantly lower cost per image.</p><p>4. <a href="https://pub.towardsai.net/information-topology-in-multi-agent-systems-cb925c5b86d9">Information Topology in Multi-Agent Systems: as a Behavioral Parameter</a></p><p>Information flow between AI agents is often treated as an afterthought; this article argues it shouldn&#8217;t be. The author built a multi-agent orchestration platform using Python and the Strands SDK to run a controlled Prisoner&#8217;s Dilemma experiment, isolating information topology as the sole variable. Across three phases (blind, partial, and full transparency), the same agents, given identical instructions, exhibited measurably different behaviors. Partial information pushed a cooperative agent toward identity-driven decisions, while full transparency made it more calculated. The exploitative agent, however, remained unaffected throughout. The key takeaway here is that what an agent knows is as architecturally significant as what it&#8217;s told to do.</p><p>5. <a href="https://pub.towardsai.net/to-relu-or-not-to-relu-a-practitioners-guide-to-solve-the-zombie-neuron-problem-in-deep-89a050a6b25b">To ReLU, or not to ReLU: A Practitioner&#8217;s Guide to Solve the &#8220;Zombie Neuron&#8221; Problem in Deep Networks</a></p><p>ReLU activation functions have long been the default choice in deep learning, but they carry a critical flaw, the dying neuron problem. When neurons receive consistently negative inputs during training, their gradients become zero, permanently halting learning and creating what the author calls a zombie network. Through a controlled PyTorch experiment on Fashion-MNIST, the article visually demonstrates this failure mode, showing 99.2% neuron death under standard ReLU, compared with healthy activation distributions with Leaky ReLU. It also evaluates practical alternatives such as Leaky ReLU, PReLU, ELU, Swish, and GELU.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/obra/superpowers">Superpowers</a> is a software development workflow for coding agents, built on top of a set of composable &#8220;skills.&#8221;</p><p>2. <a href="https://github.com/lightpanda-io/browser">Lightpanda</a> is a headless browser for AI agents and automation.</p><p>3. <a href="https://github.com/garrytan/gstack">Gstack</a> is an open-source toolkit that packages Claude Code into 8 opinionated workflow skills backed by a persistent browser runtime.</p><p>4. <a href="https://github.com/volcengine/OpenViking">OpenViking</a> is an open-source context database designed specifically for AI Agents(such as OpenClaw).</p><p>5. <a href="https://github.com/open-jarvis/OpenJarvis">OpenJarvis</a> is an opinionated framework for local-first personal AI, built around shared primitives and a learning loop that improves models using local trace data.</p><p>6. <a href="https://github.com/topoteretes/cognee">Cognee</a> is an open-source knowledge engine that lets you ingest data in any format and continuously learns to provide the right context for AI agents.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2603.12228">Neural Thickets: Task Experts Are Dense Around Pretrained Weights</a></p><p>This paper views the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. It shows that in small models, such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models, the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Building on this, the authors propose a trivially simple parallel post-training method: randomly sample N parameter perturbations, select the top K, and ensemble via majority voting. This approach matches the performance of PPO, GRPO, and ES on contemporary large-scale models without any gradient-based optimization.</p><p>2. <a href="https://arxiv.org/html/2603.12246v1">Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training</a></p><p>This paper investigates the effectiveness of using reasoning large language models as judges for reinforcement learning-based alignment in domains where output correctness cannot be directly verified. The authors discover that while reasoning judges outperform non-reasoning ones in preventing standard reward hacking, they inadvertently train policies to achieve high scores by generating sophisticated adversarial outputs that deceive evaluators.</p><p>3. <a href="https://github.com/MoonshotAI/Attention-Residuals/blob/master/Attention_Residuals.pdf">Attention Residuals</a></p><p>This paper proposes Attention Residuals (AttnRes) as a drop-in replacement for standard residual accumulation. Instead of forcing every layer to consume the same uniformly mixed residual stream, AttnRes lets each layer aggregate earlier representations using softmax attention over depth. The core idea is simple: if attention improves sequence modeling by replacing fixed recurrence over time, a similar idea can be applied to a network&#8217;s depth dimension.</p><p>4. <a href="https://arxiv.org/abs/2603.07236">HY-WU: An Extensible Functional Neural Memory Framework</a></p><p>HY-WU (Weight Unleashing) proposes a fundamentally different approach to model adaptation: instead of overwriting shared weights at each update, a neural generator module stores functional memory and synthesizes instance-specific weight updates dynamically based on runtime conditions. The framework targets the core limitation of static inference: &#8220;a single parameter vector regardless of user intent,&#8221; enabling personalization and continual learning without catastrophic interference between objectives. Demonstrated on text-guided image editing in Part I of a multi-part series.</p><h3>Quick Links</h3><p>1. <a href="https://docs.langchain.com/oss/python/deepagents/overview">LangChain releases Deep Agents</a>, an agent harness built on LangChain and the LangGraph runtime. It includes a built-in &#8216;write_todos&#8217; tool for planning and task decomposition. It uses filesystem tools to manage large contexts and supports persistent memory across threads.</p><p>2. <a href="https://huggingface.co/zai-org/GLM-OCR">Zhipu AI introduces GLM-OCR</a>, a compact 0.9B multimodal OCR model built with a 0.4B CogViT encoder and 0.5B GLM decoder. It uses Multi-Token Prediction (MTP) to improve decoding efficiency, achieving an average of 5.2 tokens per step and about 50% higher throughput. It scores 94.6 on OmniDocBench v1.5, 94.0 on OCRBench (Text), 96.5 on UniMERNet, 85.2 on PubTabNet, and 86.0 on TEDS_TEST.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/google-senior-research-engineer-cloud-ai-research-xonu">Senior Research Engineer, Cloud AI Research @Google (Sunnyvale, CA, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/microsoft-corporation-applied-ai-engineer-ii-y823">Applied AI Engineer II @Microsoft Corporation (Bangalore, India)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/oracle-master-principal-cloud-engineer-gpu-and-ai-infrastructure-s9oz">Master Principal Cloud Engineer&#8202;&#8212;&#8202;GPU &amp; AI Infrastructure @Oracle (Shanghai, China)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/coinbase-engineering-manager-payments-platform-adf4">Engineering Manager&#8202;&#8212;&#8202;Payments Platform @Coinbase (Multiple US Locations)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/ups-senior-ai-engineer-python-rag-agentic-ai-adk-mcp-gcp-vertex-ai-ibm-watsox-mgh9">Senior AI Engineer @UPS (India)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/huckleberry-labs-engineering-manager-remote-8owz">Engineering Manager @Huckleberry Labs (Remote)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/panopto-ai-engineer-6xsk">AI Engineer @Panopto (Remote/USA)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[TAI #195: GPT-5.4 and the Arrival of AI Self-Improvement?]]></title><description><![CDATA[Also, Gemini 3.1 Flash-Lite, Karpathy's Autoresearch, Qwen 3.5 Small, Copilot Cowork & more]]></description><link>https://newsletter.towardsai.net/p/tai-195-gpt-54-and-the-arrival-of</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-195-gpt-54-and-the-arrival-of</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 10 Mar 2026 14:54:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cq4-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>Two stories dominated this week that look unrelated but tell the same story. On Wednesday, OpenAI released GPT-5.4, its most work-oriented frontier model to date. On Sunday, Andrej Karpathy posted results from his autoresearch experiment, showing that AI agents can autonomously find real, transferable improvements to neural network training. I think this combination marks a turning point: AI is becoming a closed-loop improver of its own stack.</p><p>OpenAI released GPT-5.4 on March 5 as GPT-5.4 Thinking in ChatGPT, gpt-5.4 and gpt-5.4-pro in the API, and GPT-5.4 in Codex. It folds GPT-5.3-Codex&#8217;s coding strengths into the mainline model, adds native computer use, tool search, an opt-in 1M-token context window (272K default), native compaction, and a steerable preamble in ChatGPT that lets users redirect the model mid-task. Pricing has stepped up to $2.50/$15 per million tokens for the base model, $30/$180 for Pro, however increased token efficiency is largely cancelling this out in our tests. Requests exceeding 272K input tokens cost 2x more.</p><p>The release cadence is also notable. GPT-5.2 in December, GPT-5.3-Codex on February 5, Codex-Spark on February 12, GPT-5.3 Instant on March 3, GPT-5.4 on March 5. An OpenAI staff member on the developer forum said it plainly: &#8220;monthly releases are here.&#8221; The progress now comes from post-training, eval loops, reasoning-time controls, tool selection, memory compaction, and product integration. The base model race still matters, but the surrounding engineering is where gains compound fastest.</p><p>GPT-5.4 is another leap in many dimensions, but not a clean knockout. On Artificial Analysis&#8217;s Intelligence Index, it ties Gemini 3.1 Pro Preview at 57. On LiveBench, GPT-5.4 Thinking xHigh barely leads Gemini 3.1 Pro Preview, 80.28 vs. 79.93. On the Vals benchmark grid, the picture is splintered: GPT-5.4 leads ProofBench, IOI, and Vibe Code Bench; Gemini 3.1 Pro leads LegalBench, GPQA, MMLU Pro, LiveCodeBench, and Terminal-Bench 2.0; Claude Opus 4.6 leads SWE-bench; Claude Sonnet 4.6 leads the broad Vals composite and Finance Agent. There is no single best frontier model anymore.</p><p>OpenAI&#8217;s benchmark story this time is unusually workplace-centric. On GDPval, which tests real knowledge work across 44 occupations, GPT-5.4 achieves 83.0% vs. 70.9% for GPT-5.2. On internal spreadsheet modeling tasks, 87.3% vs. 68.4%. On OSWorld-Verified for desktop navigation, 75.0%, surpassing the human baseline of 72.4% and nearly doubling GPT-5.2&#8217;s 47.3%. On BrowseComp, 82.7%, with Pro reaching 89.3%. OpenAI claims 33% fewer false claims and 18% fewer error-containing responses vs. GPT-5.2. Mainstay reported that across roughly 30,000 HOA and property-tax portals, GPT-5.4 hit 95% first-try success and 100% within three tries, about 3x faster while using 70% fewer tokens. Harvey&#8217;s BigLaw Bench: 91%.</p><p>Despite continued progress on GDPval, I think OpenAI still has an interface gap for white-collar work. GPT-5.4&#8217;s preamble and mid-response steering are genuinely useful. ChatGPT for Excel and the new financial-data integrations are a smart wedge into high-value workflows. But OpenAI still does not have a broad non-developer surface as friendly as Claude Cowork for delegating messy cross-file, cross-app, real-world office work. Codex and the API now have serious computer-use capability, but the overall experience still leans more technical than it probably needs to if OpenAI wants to dominate the everyday white-collar desktop.</p><p>Microsoft moved quickly on that front this week with Copilot Cowork. The company announced that it is integrating the technology behind Claude Cowork directly into Microsoft 365 Copilot, with enterprise controls, security positioning, and pricing under the existing Microsoft 365 Copilot umbrella. That gives Microsoft a clear distribution advantage because Word, Excel, PowerPoint, Outlook, and Teams are already where a large share of office work happens. But Microsoft&#8217;s execution so far has often felt like a company with perfect distribution and only intermittent product urgency. OpenAI and Anthropic, by contrast, have generally been sharper at making people actually want to use the thing. Microsoft still has the installed base. The question is whether it can convert that into a genuine product pull before the model labs sell their own work agents more directly into the enterprise.</p><p>The other story this week that matters just as much, even if it looks smaller on paper, is Andrej Karpathy&#8217;s autoresearch experiment. Karpathy publicly reported that after about two days of autonomous tuning on a small nanochat training loop, his LLM agent found around 20 additive changes that transferred from a depth-12 proxy model to a depth-24 model and reduced &#8220;Time to GPT-2&#8221; from 2.02 hours to 1.80 hours, roughly an 11 percent improvement. The autoresearch repository describes the setup: give an AI agent a small but real LLM training environment, let it edit the code, run short experiments, check whether validation improves, and repeat overnight.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cq4-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cq4-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png 424w, https://substackcdn.com/image/fetch/$s_!cq4-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png 848w, https://substackcdn.com/image/fetch/$s_!cq4-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png 1272w, https://substackcdn.com/image/fetch/$s_!cq4-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cq4-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png" width="1456" height="726" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:726,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cq4-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png 424w, https://substackcdn.com/image/fetch/$s_!cq4-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png 848w, https://substackcdn.com/image/fetch/$s_!cq4-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png 1272w, https://substackcdn.com/image/fetch/$s_!cq4-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb798a955-5961-47d6-84ff-957ef2e3570e_1600x798.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: Andrej Karpathy. Autoresearch progress optimising nanochat over 2 days.</figcaption></figure></div><p>A lot of people immediately reached for the &#8220;this is just hyperparameter tuning&#8221; line. I think that misses the economic point. If an agent swarm can reliably explore optimizer settings, attention tweaks, regularization choices, data-mixture recipes, initialization schemes, and architecture details on cheap proxy runs, then promote the promising changes to larger scales, that is already an extremely valuable research process even if it does not look like a lone synthetic scientist inventing an entirely new paradigm from scratch. Frontier research is full of bounded search problems with delayed but measurable feedback. That is exactly the terrain where agents can start compounding.</p><p>This is the trajectory I expect from here. Labs will give swarms of agents meaningful GPU budgets to run thousands of small and medium experiments on proxy models. They will search for better attention mechanisms, better optimizer schedules, better training curricula, better post-training recipes, and better evaluation harnesses. The promising ideas will then get promoted upward through progressively larger training runs. Human experts will stay in the loop at the obvious choke points: deciding which metrics matter, spotting false positives, designing new search spaces, choosing which ideas deserve expensive scale-up, and co-designing the higher-stakes modifications once you are dealing with real parameter counts and serious training-flop budgets. But the inner loop of &#8220;propose, implement, test, compare, iterate&#8221; is increasingly looking automatable.</p><p>We already have hints that the labs are on the first rung of this ladder. OpenAI stated that GPT-5.3-Codex was the first model &#8220;instrumental in creating itself,&#8221; with early versions used to debug its own training, manage deployment, and diagnose evaluations. To be precise, OpenAI has been much more explicit publicly about self-development in GPT-5.3-Codex than in GPT-5.4 itself. But the direction of travel is hard to miss.</p><p>There is also an important nuance from OpenAI&#8217;s GPT-5.4 system card. The company says GPT-5.4 Thinking does not meet its threshold for High capability in AI self-improvement, which it defines as roughly the level of a performant mid-career research engineer. I think that distinction matters, but probably in the opposite way some skeptics assume. The threshold for economically useful self-improvement is much lower than the threshold for autonomous frontier research. A model does not need to be a synthetic principal scientist to improve prompts, evaluations, tooling, scaffolds, training recipes, and smaller-model experiments around itself. That lower threshold is the one that accelerates everything else.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>The center of gravity in AI has moved from &#8220;smart chatbot&#8221; to &#8220;reliable operator.&#8221; The winning system is no longer the one that writes the prettiest single answer. It is the one that can stay on task for an hour, use the right tools without drowning in token overhead, operate ugly software that nobody exposed through clean APIs, compress its own history, and let a human steer without restarting the whole job. GPT-5.4, Codex, Opus 4.6&#8217;s agent teams, Gemini CLI, Microsoft&#8217;s Copilot Cowork, and Karpathy&#8217;s autoresearch all point in the same direction.</p><p>This is why GDPval matters more than GPQA or MMLU. The trajectory from 12.4% with GPT-4o to 83.0% with GPT-5.4 in roughly 18 months does not measure chatbot cleverness. It measures how close AI is to replacing the actual output of knowledge workers on well-specified tasks. We are past the halfway mark, and the curve is steepening. That said, GDPval still has obvious limitations, and we hope the project receives more funding from OpenAI to expand the benchmark and test more multistage, longer-time-horizon agentic tasks.</p><p>And Karpathy&#8217;s autoresearch extends the same logic inward. If agents can reliably improve the training stack itself, the rate of improvement compounds. I expect Frontier Labs to give agent swarms meaningful GPU budgets this year to explore attention mechanisms, optimizer variants, and dataset recipes on small proxies before scaling the winners. Human researchers will co-design at scale. My guess is that by year end, we may well see a leading model whose development was materially shaped by this kind of autonomous AI research loop. I do not mean fully autonomous in the science-fiction sense. I mean that a meaningful fraction of the attention tweaks, optimizer choices, data-recipe changes, post-training methods, and eval fixes will have been discovered, filtered, and iterated by agent systems running at scale, with human researchers acting more like high-level architects, judges, and escalation points. That no longer feels speculative to me. It feels like the next obvious hill for reinforcement learning during post-training.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://openai.com/index/introducing-gpt-5-4/">OpenAI Introduced GPT-5.4</a></p><p>OpenAI released GPT-5.4, a new frontier model designed for professional work, with GPT-5.4 Thinking available in ChatGPT, the API, and Codex, and GPT-5.4 Pro offered for users who want maximum performance on complex tasks. GPT-5.4 consolidates OpenAI&#8217;s recent gains in reasoning, coding, and agent workflows into a single model, bringing GPT-5.3-Codex&#8211;level coding strength while improving tool use across software environments and knowledge-work tasks like spreadsheets, presentations, and documents. In ChatGPT, GPT-5.4 Thinking can show an upfront plan so users can steer mid-response, and it improves deep web research and long-context handling. In the API and Codex, GPT-5.4 is the first general-purpose OpenAI model with native, state-of-the-art computer-use capabilities, and it supports up to 1M tokens of context for longer-horizon agents. OpenAI also highlights a tool search for navigating large tool ecosystems and improved token efficiency compared to GPT-5.2. On reported evaluations, GPT-5.4 scores 83.0% on GDPval, 57.7% on SWE-Bench Pro (Public), 75.0% on OSWorld-Verified, 54.6% on Toolathlon, and 82.7% on BrowseComp.</p><p>2. <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/">Google Introduced Gemini 3.1 Flash-Lite</a></p><p>Google released Gemini 3.1 Flash-Lite as the most cost-efficient model in the Gemini 3 lineup, built for high-throughput workloads where latency and cost matter. A new architectural control lets developers programmatically set the model&#8217;s &#8220;thinking&#8221; level: Minimal, Low, Medium, or High so that they can trade off speed against reasoning depth based on task complexity. Flash-Lite supports multimodal inputs (text, image, video) with a standard 128K context window. Pricing is set at $0.25 per 1M input tokens and $1.50 per 1M output tokens, and Google reports it outperforms Gemini 2.5 Flash with a 2.5&#215; faster time-to-first-token and 45% higher output speed.</p><p>3. <a href="https://x.com/Alibaba_Qwen/status/2028460046510965160">Qwen Introduces the Qwen 3.5 Small Model Series</a></p><p>Alibaba released Qwen 3.5 Small, a family of 0.8B to 9B models, built for on-device and edge deployment. Qwen3.5&#8211;0.8B and Qwen3.5&#8211;2B target high-throughput, low-latency applications on constrained hardware. Qwen3.5&#8211;4B serves as a lightweight multimodal base suited for small agents, while Qwen3.5&#8211;9B is tuned for reasoning and logic. The 9B model uses Scaled Reinforcement Learning to optimize for reliable reasoning trajectories, not just next-token prediction, and is presented as narrowing the performance gap with models 5&#215; to 10&#215; larger.</p><p>4. <a href="https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/">Microsoft Releases Phi-4-Reasoning-Vision-15B</a></p><p>Microsoft launched Phi-4-Reasoning-Vision-15B, a 15B-parameter, open-weight multimodal model designed for reasoning over images and text. It pairs the Phi-4-Reasoning language backbone with a SigLIP-2 vision encoder through a mid-fusion architecture, targeting compact but capable multimodal reasoning for math, science, documents, and GUI understanding. Training mixes reasoning and non-reasoning data so the model can switch between think and nothink modes depending on whether the task benefits from explicit reasoning or direct perception-based output. Microsoft highlights two primary use cases: visual scientific reasoning (handwritten equations, diagrams, charts, tables, and quantitative documents) and computer-use agent tasks, in which the model interprets screens, localizes UI elements, and supports interaction across desktop, web, and mobile interfaces.</p><p>5. <a href="https://x.com/trq212/status/2028628570692890800">Voice Mode Rolls Out to Claude Code</a></p><p>Anthropic is adding Voice Mode to Claude Code with a staged rollout and a broader release planned over the next few weeks. Once enabled with /voice, users can speak a command and have Claude Code execute it, reducing the friction of switching between typing, navigating, and issuing multi-step instructions. This matters because coding assistants are increasingly competing on end-to-end workflow speed, not just code quality. As agents take on longer tasks, the interface becomes part of reliability and control. Voice input is a practical step toward &#8220;always-available&#8221; agent operation, useful when developers need quick corrections, clarifications, or steering without breaking flow.</p><p>6. <a href="https://mistral.ai/industry/finance">Mistral AI Launches AI Services for Finance</a></p><p>Mistral introduced a suite of AI services tailored for financial institutions that run within a firm&#8217;s own infrastructure, keeping sensitive data out of third-party systems. The offering targets core finance use cases, such as automating compliance and risk checks and enabling search across internal sources, including policies, credit files, and proprietary research. As banks and asset managers push AI deeper into regulated processes, data control and auditability become the gating constraints. This shift is pushing vendors to compete on private deployment, governance, and security boundaries.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/beyond-the-basics-advanced-local-ai-coding-workflows-and-model-optimization-part-2-c023babae088?sk=8e17c521a30b69f9249aedd15b18145e">Beyond the Basics: Advanced Local AI Coding Workflows and Model Optimization</a></p><p>This guide walks through creating a local AI coding environment using constrained setups as well as high-end workstations. It includes details on model selection, hardware tiers, GPU and CPU optimization strategies, context window management, and storage improvements. It also introduces practical automation workflows (pre-commit code-review hooks, documentation generators, and multi-agent pipelines) and prompting techniques such as chain-of-thought and few-shot patterns to improve output quality.</p><p>2. <a href="https://pub.towardsai.net/understanding-the-loss-landscape-of-modern-ai-models-7802247017bd?sk=9593e31319e4010070c310135262ec4d">Understanding Loss Landscapes of Modern AI Models</a></p><p>Neural networks are often described as black boxes, but loss landscape visualization offers a structured way to examine how they learn and generalize. This article walks through the mechanics of loss landscapes, from 2-parameter models in which full surfaces can be plotted, to large-scale LLMs in which only 2D cross-sections are possible. It covers key techniques, including directional probing, PCA-based direction selection, and normalization methods such as filter and layer normalization. It also addresses a common misconception: that training trajectories follow the plotted surface. Finally, it connects landscape geometry to real-world model behavior, showing that flat minima consistently correlate with better generalization.</p><p>3. <a href="https://pub.towardsai.net/beyond-model-fit-demystifying-gradient-descent-from-scratch-003dd0241ddf">Beyond model.fit(): Demystifying Gradient Descent from Scratch</a></p><p>Most machine learning practitioners call model.fit() without understanding what happens underneath. This article breaks down Gradient Descent from scratch using pure Python and NumPy, covering all three variants (Batch, Stochastic, and Mini-Batch) with clean implementations and clear mathematical foundations. Beyond the code, it addresses three common failure points: poor feature scaling, non-convex loss landscapes, and poorly chosen learning rates. It also shows how each variant behaves during training using loss curves and contour path plots.</p><p>4. <a href="https://pub.towardsai.net/structured-video-captioning-with-gemini-an-mma-analysis-use-case-bfbb8fd91a26">Structured Video Captioning with Gemini: An MMA Analysis Use Case</a></p><p>This article covers how Gemini&#8217;s video understanding capabilities can be applied to structured video captioning, using MMA fight analysis as a test case. The authors split fight footage into 30-second segments to manage token limits, then used prompt chaining to extract timestamped action breakdowns and convert them into structured JSON via Pydantic models. They extended this with a multi-agent workflow, where discipline-specific specialists analyzed striking, grappling, submissions, and movement in parallel before a head coach model synthesized the findings.</p><p>5. <a href="https://pub.towardsai.net/turning-microsoft-onenote-into-an-ai-powered-knowledge-system-a-practical-low-cost-blueprint-32d8082c6d73?sk=b4b4ef697d48b33220be143526465998">Turning Microsoft OneNote Into an AI-Powered Knowledge System: A Practical, Low-Cost Blueprint Using OCR and RAG</a></p><p>Many organizations rely on Microsoft OneNote as a central knowledge repository, yet most of that content remains unsearchable and unstructured. This article walks through a four-layer architecture that addresses this gap by combining Microsoft Graph, Azure Document Intelligence, ChromaDB, and GPT-4o. Each layer handles a distinct responsibility, extracting OneNote content, normalizing attachments, applying OCR and embeddings, and delivering a Streamlit interface for validation and conversational search. The author also emphasizes that this type of proof-of-concept rarely requires significant budget and is often implementable for a few hundred dollars, making it a practical starting point for organizations.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/karpathy/autoresearch">AutoResearch</a> is a minimalist Python tool designed to enable AI agents to autonomously conduct machine learning experiments.</p><p>2. <a href="https://github.com/googleworkspace/cli">CLI</a> for all of Google Workspace. Includes 40+ agent skills.</p><p>3. <a href="https://github.com/android-bench/android-bench">Android Bench</a> is a framework for benchmarking LLMs on Android development tasks.</p><p>4. <a href="https://github.com/langwatch/langwatch">LangWatch</a> is a platform for LLM evaluations and AI agent testing.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://www.nature.com/articles/s41467-025-67998-6">Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models</a></p><p>This paper argues that for LLMs to be used as agents that interact with users and with the world, they must construct representations of the world and form probabilistic beliefs about them. Researchers propose a Bayesian inference framework that lays out the optimal way for an agent to update its beliefs as it receives new information. Teaching LLMs to mimic the predictions of the normative Bayesian model can dramatically improve their ability to update their beliefs, and this ability generalizes to new tasks.</p><p>2. <a href="https://arxiv.org/abs/2603.04448">SkillNet: Create, Evaluate, and Connect AI Skills</a></p><p>This paper introduces SkillNet, an open infrastructure for creating, evaluating, and organizing AI skills at scale. The lack of systematic skill accumulation and transfer hinders the long-term advancement of current AI agents. SkillNet structures skills within a unified ontology that supports creating skills from heterogeneous sources, establishing rich relational connections, and performing multi-dimensional evaluation across Safety, Completeness, Executability, Maintainability, and Cost-awareness. Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models.</p><p>3. <a href="https://arxiv.org/abs/2603.03790">T2S-Bench &amp; Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning</a></p><p>To understand if LLMs can benefit from text structure to enhance text-processing performance, this work introduces Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures. Building on this insight, the paper also presents T2S-Bench, the first benchmark designed to evaluate and improve models&#8217; text-to-structure capabilities. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation of 45 mainstream models reveals substantial potential for improvement.</p><p>4. <a href="https://arxiv.org/abs/2603.04379">Helios: Real Real-Time Long Video Generation Model</a></p><p>This paper presents Helios, a 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. The model natively supports T2V, I2V, and V2V tasks, mitigates long-video drifting via targeted training strategies, compresses context to cut computation, and employs infrastructure optimizations that outperform prior short- and long-video methods.</p><p>5. <a href="https://arxiv.org/abs/2603.02604">Heterogeneous Agent Collaborative Reinforcement Learning</a></p><p>This paper introduces Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. They develop HACPO, a collaborative RL algorithm with four mechanisms that ensure unbiased advantage estimation and correct optimization. Experiments show HACPO improves all agents and outperforms GSPO by 3.3% using half the rollout cost.</p><h3>Quick Links</h3><p>1. <a href="https://github.com/openai/symphony?tab=readme-ov-file">OpenAI releases Symphony</a>, an open-source framework designed to manage autonomous AI coding agents through structured &#8216;implementation runs.&#8217; Symphony utilizes Elixir and the Erlang/BEAM runtime to manage agent lifecycles. It is designed specifically to bridge the gap between project management tools and code execution.</p><p>2. <a href="https://developers.googleblog.com/whats-new-in-tensorflow-221/">Google has announced LiteRT has fully graduated into the production stack</a>. LiteRT is now Google&#8217;s primary on-device inference framework for deploying machine learning models to mobile and edge environments. The updated runtime delivers 1.4x faster GPU performance compared to TFLite and introduces a unified workflow for NPU acceleration.</p><p>3. <a href="https://cursor.com/blog/automations">Cursor unveiled Automations</a>, a system that automatically launches agents in the development environment in response to specific events: code changes, Slack messages, or a standard timer. According to the company, this allows for the review and maintenance of all new code created by agent tools without the need to track dozens of agents simultaneously.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/google-engineering-manager-google-pay-srre">Engineering Manager, Google Pay @Google (Singapore)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/sedgwick-ai-architect-5lkq">AI Architect @Sedgwick (Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/webflow-lead-ai-engineer-qynt">Lead AI Engineer @Webflow (Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/logitech-ai-analyst-intern-ud6k">AI Analyst Intern @Logitech (Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/ascension-health-it-intern-intrastructure-l15j">IT Intern Intrastructure @Ascension Health (Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/sedgwick-senior-engineer-llmops-and-mlops-3t5r">Senior Engineer&#8202;&#8212;&#8202;LLMOps &amp; MLOps @Sedgwick (Remote/USA)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[We broke our agents, so you don't have to]]></title><description><![CDATA[Master the missing reliability layer in most agent]]></description><link>https://newsletter.towardsai.net/p/we-broke-our-agents-so-you-dont-have</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/we-broke-our-agents-so-you-dont-have</guid><dc:creator><![CDATA[Towards AI]]></dc:creator><pubDate>Wed, 04 Mar 2026 15:03:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!SSws!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If this sounds familiar, you&#8217;re not alone:</p><p>2025 gave us agent hype. <strong>It didn&#8217;t give us a reliable way to build them.</strong> Most developers are still guessing: which tools to use, how to wire the system, and how to catch failures with evals and monitoring before users do.</p><p>So after nine months of building, breaking, rebuilding, and stress-testing, <strong><a href="https://academy.towardsai.net/courses/agent-engineering?utm_source=TAIspecialedition&amp;utm_medium=email&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse">Agentic AI Engineering</a></strong> is finally live. Our newest course, built together with <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Paul Iusztin&quot;,&quot;id&quot;:110559689,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0714d360-396c-4b41-a676-1b58dc1dc5f3_1470x1470.jpeg&quot;,&quot;uuid&quot;:&quot;1481c14b-77f1-472a-86ab-8a4b740d06ca&quot;}" data-component-name="MentionToDOM"></span>, is designed to teach you how to design, build, evaluate, and deploy autonomous AI systems.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?utm_source=TAIspecialedition&amp;utm_medium=email&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SSws!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png 424w, https://substackcdn.com/image/fetch/$s_!SSws!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png 848w, https://substackcdn.com/image/fetch/$s_!SSws!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png 1272w, https://substackcdn.com/image/fetch/$s_!SSws!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SSws!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1015596,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?utm_source=TAIspecialedition&amp;utm_medium=email&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.towardsai.net/i/189754076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SSws!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png 424w, https://substackcdn.com/image/fetch/$s_!SSws!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png 848w, https://substackcdn.com/image/fetch/$s_!SSws!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png 1272w, https://substackcdn.com/image/fetch/$s_!SSws!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e4108a1-fcd5-4c18-b074-663cba7ae59a_1280x720.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong><a href="https://academy.towardsai.net/courses/agent-engineering?utm_source=TAIspecialedition&amp;utm_medium=email&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse">See what you&#8217;ll build (syllabus + projects)</a></strong></p><p>Here&#8217;s what early students said after going through the material:</p><blockquote><p><em>&#8220;Excellent in depth handling of tradeoffs in evaluating and deploying agent based solutions. A useful mixture of theory and practice, learnt the hard way by expert practitioners.&#8221;</em> &#8212; Cathal Curtin</p><p><em>&#8220;Every AI Engineer needs course like that.&#8221;</em> &#8212; Ahmed Medhat</p><p><em>&#8220;Industry-focused, emphasizing real-world constraints rather than flashy demos, and highly hands-on.&#8221;</em> &#8212; Abreham Melese</p></blockquote><h4>What You Will Build</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?utm_source=TAIspecialedition&amp;utm_medium=email&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cAyZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ced6ba-830d-4d73-935e-1cef8dfaa3f4_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!cAyZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ced6ba-830d-4d73-935e-1cef8dfaa3f4_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!cAyZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ced6ba-830d-4d73-935e-1cef8dfaa3f4_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!cAyZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ced6ba-830d-4d73-935e-1cef8dfaa3f4_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cAyZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ced6ba-830d-4d73-935e-1cef8dfaa3f4_1200x1200.gif" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0ced6ba-830d-4d73-935e-1cef8dfaa3f4_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:315041,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?utm_source=TAIspecialedition&amp;utm_medium=email&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.towardsai.net/i/189754076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ced6ba-830d-4d73-935e-1cef8dfaa3f4_1200x1200.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cAyZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ced6ba-830d-4d73-935e-1cef8dfaa3f4_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!cAyZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ced6ba-830d-4d73-935e-1cef8dfaa3f4_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!cAyZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ced6ba-830d-4d73-935e-1cef8dfaa3f4_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!cAyZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ced6ba-830d-4d73-935e-1cef8dfaa3f4_1200x1200.gif 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the course, you&#8217;ll build two agent systems and learn how to keep them reliable when the environment stops being friendly: when tools fail, inputs get messy, latency matters, and &#8220;it worked once&#8221; isn&#8217;t useful.</p><p>You&#8217;ll build a Research Agent that runs iterative loops, integrates real tools, produces structured artifacts, and supports human-in-the-loop checkpoints with clear stopping conditions. Then you&#8217;ll build a Writing Workflow Agent that turns that research into structured, multi-modal outputs using evaluator&#8211;optimizer patterns, orchestration, versioning, and state.</p><p>But the core of the course is the reliability layer most agent content skips: you&#8217;ll design eval datasets, human-in-the-loop processes, implement LLM judges and pass/fail checks, add observability with tracing, and set up monitoring so you can debug regressions quickly and improve the system deliberately, rather than guessing.</p><p><strong><a href="https://academy.towardsai.net/courses/agent-engineering?utm_source=TAIspecialedition&amp;utm_medium=email&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse">Check out the full course details &#8594;</a></strong></p><h4>Who Is This For?</h4><p>This is engineering-heavy and opinionated, designed for developers who want depth. You&#8217;ll feel at home if you&#8217;re comfortable with Python + LLM APIs, have basic cloud familiarity, and don&#8217;t mind debugging failures that aren&#8217;t clean.</p><p>We built the course by starting with a system we&#8217;d actually use, pushing it until it broke, then turning those failure modes into the curriculum, refined through 180 alpha testers. The goal is to prepare you for what agents are judged on in 2026: operational reliability&#8212;measurable quality, inspectable behavior, and controlled autonomy.</p><p>If your goal is to build systems that survive production and the AI era, start here.</p><p>The early-bird seats sold out in under a week. The next 100 seats are now <strong>$499</strong> (the lowest available price after early bird). You get lifetime access, ongoing updates, Discord access, live introductory calls, and a 30-day refund if you go through the early material and realize it&#8217;s not what you need.</p><p><strong><a href="https://academy.towardsai.net/courses/agent-engineering?utm_source=TAIspecialedition&amp;utm_medium=email&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse">Get access now &#8594;</a></strong></p>]]></content:encoded></item><item><title><![CDATA[TAI #194: AI Goes Macro; Job Loss Fears, Military Usage, OpenAI $110B Raise]]></title><description><![CDATA[Also, launching Towards AI&#8217;s new Agents course]]></description><link>https://newsletter.towardsai.net/p/tai-194-ai-goes-macro-job-loss-fears</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-194-ai-goes-macro-job-loss-fears</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 03 Mar 2026 15:02:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!s0Ac!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe74ff34d-2f0b-4abd-ab16-c5772f03396a_1200x1200.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>This week brought a series of developments that signal AI is quickly becoming more than just a technology story: AI&#8217;s revenue, its politics, and its labor market consequences are now operating at a scale that reshapes the global economy and the geopolitical order in real, measurable ways.</p><p><strong>AI, the Pentagon, and the Claude Surge.</strong></p><p>AI is increasingly critical to US military operations. OpenAI signed a contract with the Department of Defense to deploy its models on classified networks. Hours later, the Trump administration designated Anthropic a &#8220;supply chain risk&#8221; and directed agencies to stop using Claude, widely interpreted as retaliation for Anthropic&#8217;s refusal to lift its safety guardrails for unrestricted military use. Meanwhile, reports emerged that Claude was allegedly used, together with Palantir, during the capture of Venezuela&#8217;s then-president Nicol&#225;s Maduro in January and again to assist with intelligence assessment during strikes against Iran.</p><p>I agree with the red lines Anthropic has laid out: no mass surveillance, no autonomous weapons without a human in the loop. Dario Amodei seems more serious about enforcing those boundaries than any other lab CEO, and his willingness to absorb real commercial and political cost to hold that line is notable. That said, the broader question is genuinely complex. Should unelected AI CEOs be drawing the boundaries of how military AI gets used? In principle, that is a job for elected governments. But existing laws were not written with these AI capabilities in mind, and governments have shown little urgency to update them. Until they do, the defaults are being set by a handful of companies in San Francisco.</p><p>Public backlash against OpenAI&#8217;s Pentagon deal appears to have driven a spike in downloads of Claude. Anthropic&#8217;s app hit number one on the Apple App Store, and the resulting surge in demand contributed to a major Claude outage on Monday that lasted nearly three hours, following a minor disruption on February 28. GPU and inference capacity are already binding constraints, and we are nowhere near the usage levels many AI economic scenarios assume.</p><p><strong>OpenAI Raises $110 Billion.</strong></p><p>OpenAI closed a $110 billion funding round, the largest private financing in history, from Amazon ($50B), Nvidia ($30B), and SoftBank ($30B), at a pre-money valuation of $730 billion. Capital flowing into AI infrastructure is now reaching a scale that shows up in macro aggregates. Between this fundraise, continued $150&#8211;200 billion in hyperscaler data center capex per quarter, and SoftBank&#8217;s Stargate commitments, AI investment is becoming a material driver of GDP in its own right. The question is whether the productivity gains this infrastructure enables will circulate broadly through the economy, or concentrate in a handful of firms.</p><p><strong>Citrini&#8217;s &#8220;2028 Global Intelligence Crisis&#8221; and the AI Job Loss Debate.</strong></p><p>A blog post from CitriniResearch titled &#8220;The 2028 Global Intelligence Crisis&#8221; went extremely viral recently, reportedly accumulating around 16 million views. The piece is written as a fictional macro memo from June 2028, looking back on how AI-driven white-collar job displacement triggered a cascade of economic and financial consequences: mass layoffs leading to reduced consumer spending, a collapsing SaaS sector, private credit defaults, and eventually stress in the $13 trillion US mortgage market as high-income borrowers lose their jobs.</p><p>The thesis: AI capabilities improve, companies lay off white-collar workers and reinvest savings into more AI; displaced workers spend less; companies under revenue pressure invest even more in AI to cut costs; and the cycle accelerates. Citrini calls this the &#8220;human intelligence displacement spiral.&#8221; The piece also describes how agentic commerce erodes the moats of intermediary businesses (DoorDash, Mastercard, insurance brokers, real estate agents) as AI agents are put in charge of your shopping, optimizing for price rather than habit, effectively destroying the &#8220;friction premium&#8221; that underpins trillions of dollars of enterprise value.</p><p>Stocks named in the essay, including Uber, DoorDash, American Express, and Mastercard, sold off in the days following the post&#8217;s spread. IBM dropped sharply. Reception from economists was mixed, and the piece got plenty of pushback, but the scenario clearly struck a nerve because it stitched together several anxieties investors already had: AI as a margin tailwind in the short run, and AI as a demand and business-model headwind if labor income gets hit hard enough.</p><p>I think the Citrini thesis is a feasible, low-probability possibility, but with some important caveats.</p><p>The stock market story and the economic story are two different things. Global labor income is roughly $60 trillion, compared with current S&amp;P 500 profits of $2&#8211;2.5 trillion. There is a huge amount of slack in AI-beneficiary names soaking up profit from labor, leading to higher S&amp;P levels, even if GDP falls significantly. The usual intuition that &#8220;stocks track the economy&#8221; can fail when the economy&#8217;s scarce factor shifts from labor to compute. In these scenarios, AI labs will likely have to keep spinning off divisions and vertical platforms to maintain some diversity in the indexes, because you cannot have 5&#8211;10 companies making up 90% of market capitalization without structural pressure to break them up.</p><p>The &#8220;technological innovation destroys jobs and then creates even more&#8221; line does not hold as a default assumption this time. It has been right for two centuries because every new job required a human to perform it. With general-purpose AI, many of the &#8220;new categories&#8221; are also automatable, often faster than institutions can train for and professionalize them. There will definitely be human roles that appear or grow significantly for a while, but they may only be a fraction of what gets replaced. One scenario for job growth to offset job losses is if GDP grows multiple times its current level. That seems to be Elon Musk&#8217;s primary scenario: one new human job for every nine new AI jobs can still lead to full employment if the total economy is large enough. That is feasible. But the middle ground, where there are neither huge job losses nor an unprecedented economic boom, does not seem very likely to me.</p><p>Citrini&#8217;s network effects and platform-disruption point are also interesting. Agents definitely reduce the friction that gives incumbents their brand and habitual usage advantages. An AI agent choosing the best delivery app has no home-screen loyalty. But for many businesses, there are still large fixed-cost advantages and utilization-rate economics that favor the largest network. A company with 50% margins from scale can survive a world where newcomers sell at the same price while making a loss, even with software costs near zero. This depends heavily on the business, though. That advantage does not help Uber or DoorDash nearly as much as it helps an infrastructure provider or a marketplace with exclusive supply.</p><p>GPU capacity will likely be the primary bottleneck to Citrini&#8217;s scenario playing out at speed. We are already seeing Claude crash this week due to increased usage, and Gemini has had its own scaling issues. However, it is not impossible to see 100x-plus breakthroughs in inference efficiency, particularly if AI starts making its own breakthroughs in designing and testing new model architectures and inference systems. Compute is a brake today. It is not a guaranteed brake for 2027&#8211;2028.</p><p>The Citrini thesis got some partial vindication this week with Block&#8217;s announcement that it is cutting roughly 4,000 employees, nearly half its workforce. CEO Jack Dorsey was explicit that the cuts are AI-driven, saying the intelligence tools they are building &#8220;fundamentally change what it means to build and run a company.&#8221; He predicted that within the next year, most companies will reach the same conclusion and make similar structural changes. Block&#8217;s stock soared as much as 24% on the news. This is the pattern Citrini describes: layoffs expand margins, earnings beat, stocks rally. Each company&#8217;s response is rational. The collective result is the displacement spiral that makes the scenario so uncomfortable.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>Here is where I think we actually stand. Human expertise is vital to nearly all AI usage today, and it will be for some time. The models are powerful, but they are not autonomous. They need people who understand the domain, can evaluate their outputs, can architect the workflows, and can catch the failures before they reach production.</p><p>However, I see a very real risk that AI-first employees can be 2&#8211;3x more productive, with higher-quality output, than those who resist using AI. Many companies will channel that productivity into building more products, running more security checks, and expanding into new markets. But many will hit other bottlenecks to growing output, and for those companies, the surplus productivity translates directly into headcount reduction. AI-slow adopters are at high risk of redundancy across a very large number of careers in the near future.</p><p>That said, enterprise adoption is still slow. AI engineers and forward-deployed engineers will be critically needed to customize agents and workflows for specific enterprise contexts. True adoption take off requires people who can bridge the gap between raw model capability and production-grade reliability.</p><p>The main bottlenecks to AI adoption are likely to be AI compute, as we can see from the Claude and Gemini scaling issues this week, but also AI engineers with the expertise to build and deploy enterprise-tier agents. The models are ready. The infrastructure is strained. The human talent to wire it all together is in short supply.</p><p>On that note, 2025 gave us agent hype. It did not give us a reliable way to build them. Most developers are still guessing at tools, wiring, and how to catch failures before users do. Fortunately, we have a new course to fill this gap!</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><p>We spent 9 months building, breaking, and stress-testing two real-agent systems, with feedback from 180+ developers.</p><p>The result is <strong><a href="https://academy.towardsai.net/courses/agent-engineering?utm_source=TAI&amp;utm_medium=sponsor+section&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse">Agentic AI Engineering</a>,</strong> our newest course built to teach operational reliability: <strong>measurable quality (evals), inspectable behavior (observability), and controlled autonomy</strong> (clear boundaries + robust tool/workflow engineering).</p><p>You&#8217;ll build a <strong>Research Agent</strong> and a <strong>Writing Workflow</strong> end-to-end, and you&#8217;ll ship them with the parts that make agents usable in 2026: evaluation datasets and pass/fail checks, LLM judges, tracing, monitoring, and the workflow glue that keeps tools, state, and outputs from turning into chaos.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?utm_source=TAI&amp;utm_medium=sponsor+section&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s0Ac!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe74ff34d-2f0b-4abd-ab16-c5772f03396a_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!s0Ac!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe74ff34d-2f0b-4abd-ab16-c5772f03396a_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!s0Ac!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe74ff34d-2f0b-4abd-ab16-c5772f03396a_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!s0Ac!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe74ff34d-2f0b-4abd-ab16-c5772f03396a_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s0Ac!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe74ff34d-2f0b-4abd-ab16-c5772f03396a_1200x1200.gif" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e74ff34d-2f0b-4abd-ab16-c5772f03396a_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?utm_source=TAI&amp;utm_medium=sponsor+section&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!s0Ac!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe74ff34d-2f0b-4abd-ab16-c5772f03396a_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!s0Ac!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe74ff34d-2f0b-4abd-ab16-c5772f03396a_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!s0Ac!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe74ff34d-2f0b-4abd-ab16-c5772f03396a_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!s0Ac!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe74ff34d-2f0b-4abd-ab16-c5772f03396a_1200x1200.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The first 100 early-bird seats sold out in under a week. The next 100 seats are <strong>$499</strong> (the lowest price after the early bird). Lifetime access, Discord community, and a 30-day refund.</p><p><strong><a href="https://academy.towardsai.net/courses/agent-engineering?utm_source=TAI&amp;utm_medium=sponsor+section&amp;utm_campaign=2026_subscribers_nostart_buy_glb&amp;utm_id=agentcourse">Get access now!</a></strong></p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://www.bloomberg.com/news/articles/2026-02-27/trump-orders-us-government-to-drop-anthropic-after-pentagon-feud">US Bars Anthropic Products From Agencies, Contractors</a></p><p>The Pentagon declared Anthropic PBC a supply-chain risk after President Donald Trump directed US government agencies to stop using its products. Defense Secretary Pete Hegseth ordered the Pentagon to bar its contractors and their partners from any commercial activity with Anthropic, giving the company six months to hand over AI services to another provider. This wipes out as much as $200 million in work that Anthropic had agreed to do for the military, along with smaller but important contracts for civilian agencies, including the State Department. In its statement on Friday, Anthropic said being labeled a supply-chain risk &#8220;would both be legally unsound and set a dangerous precedent for any American company that negotiates with the government.&#8221;</p><p>2. <a href="https://techcrunch.com/2026/02/27/openai-raises-110b-in-one-of-the-largest-private-funding-rounds-in-history/">OpenAI Raises $110B in One of the Largest Private Funding Rounds in History</a></p><p>OpenAI has raised $110 billion in private funding, commencing one of the largest private funding rounds in history. The new funding consists of a $50 billion investment from Amazon as well as $30 billion each from Nvidia and SoftBank, against a $730 billion pre-money valuation. As part of the investment, OpenAI is launching significant infrastructure partnerships with both Amazon and Nvidia. The Information had previously reported that $35 billion of Amazon&#8217;s investment could be contingent on the company either achieving AGI or making its IPO by the end of the year. OpenAI&#8217;s announcement confirms the funding split, but says only that the additional $35 billion will arrive &#8220;in the coming months when certain conditions are met.&#8221; Notably, the round remains open, and OpenAI expects more investors to join as it proceeds.</p><p>3. <a href="https://blog.google/innovation-and-ai/technology/ai/nano-banana-2/">Google AI Just Released Nano-Banana 2</a></p><p>Google officially unveiled Nano-Banana 2 (technically designated as Gemini 3.1 Flash Image). It leverages Latent Consistency Distillation (LCD) to achieve sub-500ms latency, enabling real-time 4K image synthesis and upscaling directly on mobile hardware. Built on a 1.8-billion-parameter backbone, the model uses Dynamic Quantization-Aware Training (DQAT) to maintain high-fidelity output with a minimal memory footprint, eliminating the need for expensive cloud inference. By implementing Grouped-Query Attention (GQA), the model reduces memory bandwidth requirements, allowing it to run continuously on mobile NPUs without triggering thermal throttling or performance dips. Additionally, the model can maintain character resemblance of up to five characters and the fidelity of up to 14 objects. Through the new Banana-SDK, developers can deploy specialized Low-Rank Adaptation (LoRA) modules to customize the model for niche tasks without retraining the base architecture.</p><p>4. <a href="https://nousresearch.com/hermes-agent/">Nous Research Releases Hermes Agent</a></p><p>Nous Research team released Hermes Agent, an open-source autonomous system designed to solve the two biggest bottlenecks in agentic workflows: memory decay and environmental isolation. Hermes Agent utilizes a multi-level memory system that mimics procedural learning. While it handles short-term tasks through standard inference, its long-term utility is driven by Skill Documents. Powered by the Llama 3.1-based Hermes-3 model, it is fine-tuned with Atropos RL for high steerability and reliable tool-calling within complex reasoning loops. The system integrates directly with existing communication stacks, including Telegram, Discord, Slack, and WhatsApp.</p><p>5. <a href="https://www.perplexity.ai/hub/blog/introducing-perplexity-computer">Perplexity unveiled Perplexity Computer</a></p><p>Perplexity AI announced the launch of Perplexity Computer, a system that unifies multiple frontier AI models into a single platform to execute complex, long-running workflows. The system breaks down a user&#8217;s requested outcome into tasks and subtasks, assigns them to sub-agents, and executes them asynchronously. These sub-agents can conduct web research, generate documents, process data, and make API calls to connected services. Overall, it can allocate tasks across 19 different models. Each task on Computer runs in an isolated compute environment with access to a filesystem, browser, and tool integrations. If the system encounters issues, it can generate additional sub-agents to address them. As of today, Perplexity Computer runs Opus 4.6 for its core reasoning engine and orchestrates sub-agents with the best models for specific tasks: Gemini for deep research (creating sub-agents), Nano Banana for images, Veo 3.1 for video, Grok for speed in lightweight tasks, and ChatGPT 5.2 for long-context recall and wide search. The product is available to Perplexity Max subscribers. It follows a usage-based pricing model, allowing users to select different AI models for different sub-agent tasks and manage token spending.</p><p>5. <a href="https://copaw.agentscope.io/">Alibaba Team Open-Sources CoPaw</a></p><p>Alibaba released CoPaw, an open-source framework that provides a standardized workstation for deploying and managing personal AI agents. The system relies on three primary layers: AgentScope (The underlying framework that handles agent communication and logic), AgentScope Runtime (The execution environment), and ReMe (Memory Management). A core feature of the CoPaw workstation is its Skill Extension capability. In this framework, a &#8216;Skill&#8217; is a discrete unit of functionality, essentially a tool that the agent can invoke to interact with the external world. It also introduces an All-Domain Access layer, which standardizes how agents interact with different messaging protocols.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/building-a-production-ready-agentic-rag-system-on-gcp-vertex-ai-adk-terraform-97742f3b2a41">Building a Production-Ready Agentic RAG System on GCP: (Vertex AI, ADK, Terraform)</a></p><p>The article shows how to implement a production-grade RAG system on Google Cloud Platform to address the challenge of making organizational documents searchable beyond basic keyword matching. The architecture features separate ingestion and query pipelines using Vertex AI, Cloud Run, Eventarc, and Gemini. The article covers complete infrastructure deployment via Terraform, step-by-step setup instructions, and comparative analysis against AWS Bedrock, Azure AI Search, and open-source alternatives.</p><p>2. <a href="https://pub.towardsai.net/agentic-rag-semantic-caching-building-smarter-enterprise-knowledge-systems-2c946fb0c386?sk=9355491f211efcde096be863ea2f0f56">Agentic RAG &amp; Semantic Caching: Building Smarter Enterprise Knowledge Systems</a></p><p>Enterprise knowledge systems face significant challenges in managing unstructured data scattered across multiple platforms. This article presents a complete implementation of Agentic RAG systems that overcome Naive RAG&#8217;s critical limitations, including the inability to summarize documents, perform multi-document comparisons, maintain conversational memory, and enforce data security. It uses the Qdrant vector database with Nomic embeddings across two notebooks.</p><p>3. <a href="https://pub.towardsai.net/lora-qlora-dora-which-fine-tuning-method-should-you-actually-use-296b53ea1aa9?sk=0bdae6dbaa29561dc1875b468f30121a">LoRA, QLoRA, DoRA: Which Fine-Tuning Method Should You Actually Use?</a></p><p>This article analyzes the original research papers for LoRA, QLoRA, and DoRA to provide evidence-based comparisons of parameter-efficient fine-tuning methods. It explains how LoRA reduces trainable parameters by 99.6% through low-rank weight updates, how QLoRA enables fine-tuning 65B models on a single 48GB GPU using 4-bit quantization, and how DoRA improves accuracy by decomposing weights into magnitude and direction components. It also demonstrates practical code examples from official repositories.</p><p>4. <a href="https://pub.towardsai.net/cutting-batch-release-from-14-days-to-3-a-case-study-in-multi-agent-ai-for-pharmaceutical-859a81ea90a7?sk=ff19178d6fe3492c9d71c4e38e4d08a3">Cutting Batch Release from 14 Days to 3: A Case Study in Multi-Agent AI for Pharmaceutical Manufacturing</a></p><p>This article presents a case study of a pharma company reducing pharmaceutical batch release time from 14 days to 3 days using a multi-agent AI system. The manufacturer addressed a critical bottleneck in which Quality Assurance reviewers manually gathered records from multiple systems (MES, LIMS, environmental monitoring) to verify compliance with registered specifications, resulting in over $2 million in annual operational overhead. The solution implemented four specialized agents using the CrewAI framework: Batch Data Collector, Deviation Analyst, Compliance Reviewer, and Release Recommender. Each agent employed the ReAct paradigm with custom tools, conditional task execution for critical deviations, and human-in-the-loop approval by Qualified Persons.</p><p>5. <a href="https://pub.towardsai.net/deriving-the-singular-value-decomposition-svd-from-first-principles-7695ebbb4e7d?sk=30c6d828f56a682187f222394c9cc4df">Deriving the Singular Value Decomposition (SVD) from First Principles</a></p><p>Moving beyond the typical formula-based teaching approach, this article derived Singular Value Decomposition (SVD) from first principles by starting with symmetric matrix diagonalization. It constructs the SVD by first forming two symmetric matrices (A&#7488;A and AA&#7488;) from any matrix A, then using their eigenbases to form orthonormal matrices U and V. The piece demonstrates how SVD decomposes any linear transformation into three operations: rotation, stretch, and rotation, with all transformation energy contained in the diagonal matrix &#931;.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/bytedance/deer-flow">DeerFlow</a> is an open-source super agent harness that orchestrates sub-agents, memory, and sandboxes.</p><p>2. <a href="https://github.com/ruvnet/ruflo">Ruflo</a> is an AI agent orchestration framework that transforms Claude Code into a powerful multi-agent development platform.</p><p>3. <a href="https://github.com/microsoft/markitdown">MarkItDown</a> is a lightweight Python utility for converting various files to Markdown for use with LLMs.</p><p>4. <a href="https://github.com/FireRedTeam/FireRed-OCR">FireRed OCR</a> is a framework for specializing general LVLMs into document parsing experts.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2510.12066">AI Agents as Universal Task Solvers</a></p><p>This paper describes AI agents as stochastic dynamical systems and frames reasoning as transductive inference that captures algorithmic structure to speed up novel tasks. It shows that the optimal speed-up on a new task is tightly related to the algorithmic information it shares with the training data. It also highlights that transductive inference yields its greatest benefits precisely when the data-generating mechanism is most complex, and identifies a possible failure mode of naive scaling.</p><p>2. <a href="https://arxiv.org/abs/2602.18640">Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System</a></p><p>This paper presents GEARS (Generative Engine for Agentic Ranking Systems), a framework that reframes ranking optimization as an autonomous discovery process within a programmable experimentation environment. Rather than treating optimization as static model selection, GEARS leverages Specialized Agent Skills to encapsulate ranking expert knowledge into reusable reasoning capabilities, enabling operators to steer systems via high-level intent vibe personalization.</p><p>3. <a href="https://arxiv.org/abs/2602.11151">Diffusion-Pretrained Dense and Contextual Embeddings</a></p><p>This report introduces pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. Researchers released two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark.</p><p>4. <a href="https://arxiv.org/abs/2602.15902">Doc-to-LoRA: Learning to Instantly Internalize Contexts</a></p><p>This paper proposes Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate context distillation within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM&#8217;s native context window by more than 4x.</p><p>5. <a href="https://arxiv.org/abs/2602.16928">Discovering Multiagent Learning Algorithms with Large Language Models</a></p><p>This paper introduces AlphaEvolve, an LLM-powered evolutionary coding agent that automatically designs multi-agent reinforcement learning algorithms for imperfect-information games. AlphaEvolve discovers VAD-CFR, which uses volatility-sensitive discounting, consistency-enforced optimism, and a hard warm-start schedule, and SHOR-PSRO, which blends Optimistic Regret Matching with smoothed best-response distributions and dynamic annealing, both of which outperform state-of-the-art CFR and PSRO variants.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/databricks-ai-engineer-fde-forward-deployed-engineer-eiwx">AI Engineer &#8212; FDE @Databricks (Remote)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/microsoft-corporation-senior-software-engineer-jhhb">Senior Software Engineer @Microsoft Corporation (Redmond, WA, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/headspace-engineering-manager-ai-vs4g">Engineering Manager, AI @Headspace (Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/meta-software-engineer-ai-native-lkuk">Software Engineer, AI Native @Meta (Menlo Park, CA, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/sword-health-senior-ai-engineer-portugal-based-remote-hybrid-zik1">Senior AI Engineer @Sword Health (Remote/Portugal)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/lockheed-martin-ai-engineer-sr-generative-ai-hybrid-bjew">AI Engineer Sr &#8212; Generative AI @Lockheed Martin (Colorado Springs, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/turing-principal-engineer-gen-ai-mmkl">Principal Engineer (Gen-AI) @Turing (India)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[TAI #193: Gemini 3.1 Pro Takes the Benchmarks Crown, but Can it Catch Up in the Tools Race?]]></title><description><![CDATA[Also, Claude Sonnet 4.6, Google Lyria 3, Qwen 3.5, Zyphra ZUNA, and NVIDIA DreamDojo.]]></description><link>https://newsletter.towardsai.net/p/tai-193-gemini-31-pro-takes-the-benchmarks</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-193-gemini-31-pro-takes-the-benchmarks</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 24 Feb 2026 15:01:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!P6V1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>Google DeepMind released Gemini 3.1 Pro on February 19th, and the benchmark results are hard to argue with. On Artificial Analysis&#8217;s Intelligence Index, it sits at #1 with a score of 57, ahead of Claude Opus 4.6 (53) and GPT-5.2 (51), leading on 12 of 18 tracked benchmarks. On ARC-AGI-2, the abstract reasoning test that has become a proxy for novel problem-solving, it scored 77.1%, more than doubling Gemini 3 Pro&#8217;s 31.1% from three months ago and pulling nearly 10 points clear of Opus 4.6 (68.8%). Last July, Grok 4 made headlines, hitting 16.0% on the same benchmark. Six months later, Gemini 3 Pro reached 31.1%. Now, 77.1%. The trajectory suggests that latent reasoning architectures, where the model generates hidden chains of thought before producing output, are yielding compounding returns on abstract logic tasks specifically. Whether this translates into equivalent gains on practical, open-ended work is a different question.</p><p>The broader results reinforce the picture. On GPQA Diamond (doctoral-level science), Gemini 3.1 scored 94.3% vs. Opus 4.6&#8217;s 91.3% and GPT-5.2&#8217;s 92.4%. On Terminal-Bench 2.0 for agentic terminal workflows, 68.5% vs. Opus 4.6&#8217;s 65.4% and GPT-5.2&#8217;s 54.0%. On LMSYS Chatbot Arena, Gemini 3.1 Pro now sits in a statistical dead heat with Opus 4.6 at the top of the overall text leaderboard (1500 vs. 1505 Elo) and comfortably ahead of GPT-5.2 (1478). In the Vision category, Gemini models hold the top three spots outright.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P6V1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P6V1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png 424w, https://substackcdn.com/image/fetch/$s_!P6V1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png 848w, https://substackcdn.com/image/fetch/$s_!P6V1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png 1272w, https://substackcdn.com/image/fetch/$s_!P6V1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P6V1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png" width="1456" height="705" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:705,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:418318,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.towardsai.net/i/189020504?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P6V1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png 424w, https://substackcdn.com/image/fetch/$s_!P6V1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png 848w, https://substackcdn.com/image/fetch/$s_!P6V1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png 1272w, https://substackcdn.com/image/fetch/$s_!P6V1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b283914-40ab-4838-a892-392d4be58ac7_1600x775.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Perhaps the most underappreciated improvement is hallucination resistance. On Artificial Analysis&#8217;s AA-Omniscience benchmark, Gemini 3.1 Pro reduced its hallucination rate by 38 percentage points compared to Gemini 3 Pro Preview, dropping from 88% to 50%. Its hallucination resistance score of 30 is more than twice the next-best score of 13. For anyone who has used earlier Gemini models for research or factual work, this is a noticeable change in daily use.</p><p>The model keeps the 1M-token input context window and increases the output limit to 65,536 tokens, resolving the severe output truncation that plagued earlier Gemini 3 models. Developers reported that Gemini 3 Pro cut off at roughly 21,000 output tokens; 3.1 Pro has been stress-tested to beyond 55,000 tokens of continuous, unbroken output. API pricing stays at $2/$12 per million input/output tokens, roughly half the blended cost of Opus 4.6. Google also released a specialized gemini-3.1-pro-preview-customtools endpoint optimized for autonomous agent behavior.</p><p><strong>Where Gemini falls short</strong></p><p>On GDPval-AA, which measures real-world knowledge work across 44 occupations, Gemini 3.1 Pro scores 1317 Elo. Claude Sonnet 4.6 scores 1633. Opus 4.6 scores 1606. GPT-5.2 scores 1462. That is a 300+ point deficit to Anthropic&#8217;s models on the tasks that most white-collar professionals do all day: drafting reports, analyzing data, writing communications, and building presentations. On enterprise knowledge work, Anthropic and OpenAI remain clearly ahead.</p><p>This points to a broader issue I keep coming back to: the tools gap. We now use Gemini models regularly at Towards AI. In my view, its image understanding is the best available. Its SVG and frontend code generation is unmatched, with Gemini 3.1 Pro leading SVG Arena at Elo 1421, a 95-point lead over Opus 4.6. Its coding ability is genuinely strong; the Terminal-Bench 2.0 lead and LiveCodeBench Pro Elo of 2887 are serious numbers. And for long-context research, the 1M token window with 84.9% retrieval accuracy on MRCR v2 at 128k tokens is hard to beat.</p><p>But Google has been falling behind on what the chatbot can actually do for you beyond the chat window. Claude can create .pptx files, .xlsx spreadsheets with working formulas, and .docx documents. It can operate your computer through Cowork and Claude in Chrome. OpenAI has Codex agents, Canvas, and a growing tool suite. Google&#8217;s Gemini app still feels like a chat interface. You get text, images via Imagen, and now music via Lyria 3. But you cannot hand Gemini a dataset and get back a working spreadsheet. You cannot ask it to build a slide deck. You cannot point it at your desktop and say, &#8220;Organize this.&#8221;</p><p>There is also a persistent gap between the model available in AI Studio and the one in the Gemini app. Even with an Ultra subscription ($250/month), the consumer app often feels weaker than the API. I have run the same prompts in both environments and gotten noticeably better results from AI Studio. This undermines the value proposition of the paid tiers and is a recurring complaint in developer communities.</p><p>For coding, ease of use still tilts toward Claude Code and Codex despite Gemini&#8217;s strong raw capability. With Claude Code, you open your terminal, point it at a repo, and start delegating. Gemini&#8217;s coding capabilities shine brightest in AI Studio with high reasoning enabled, but the developer experience is less polished. Google&#8217;s response, Antigravity (an agent-first IDE built as a VS Code fork), is conceptually ambitious but early: documented bugs include system prompt leaks, infinite execution loops, and contextual amnesia with multi-turn document uploads.</p><p>In other news, Anthropic also released Claude Sonnet 4.6 two days before Gemini, with a 1M-token context window (beta), adaptive thinking, and 79.6% on SWE-bench Verified at $3/$15 per million tokens.</p><p>Also in the news: Google launched Lyria 3, a music generation model now available in the Gemini app. Alibaba released Qwen 3.5 (397B MoE, 17B active, open weights). NVIDIA introduced DreamDojo, an open-source robot world model. Zyphra released ZUNA, a BCI foundation model for EEG reconstruction.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>Gemini 3.1 Pro is the strongest model on raw benchmarks this week. The ARC-AGI-2 score is a genuine leap. The hallucination reduction is practically meaningful. The coding and science capabilities are at the frontier. And it costs roughly half as much per token as Opus 4.6.</p><p>In production, the picture is different. I think Google has the best raw AI engine right now, but it isn&#8217;t fully leveraging it. The gap between Gemini&#8217;s model intelligence and the Gemini app&#8217;s utility is the widest in the industry. The model that wins on GPQA Diamond is not the same as the one that wins your workflow. At Towards AI, we use Gemini regularly for image analysis and long-context research, where it is clearly the best tool. But when I need to produce a deliverable, a report, a spreadsheet, a presentation, I reach for Claude. When I need to write code against a real codebase, I open Claude Code or Codex. The distance between &#8220;smartest model&#8221; and &#8220;most useful model&#8221; has never been wider. Google needs to close this gap or risk losing paying users who conclude the app is not worth it.</p><p>For practitioners, the takeaway is that no single model dominates all use cases. We use all three at Towards AI daily, and the people getting the most value from AI are the ones who know which model to reach for and when.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><p>We just launched something that changes how you build agentic systems.</p><p>Our newest FREE course,<strong> <a href="https://email-course.towardsai.net/?utm_source=TAInewsletter&amp;utm_medium=banner&amp;utm_campaign=2026_subscribers_nostart_signup_glb&amp;utm_id=freeemailcourse">Agentic AI Engineering Guide: 6 Mistakes Developers Make When Building Agents</a></strong>, distills 3+ years of production failures into the exact patterns separating demos from reliable systems.</p><p>Built in partnership with Paul Iusztin, this 6-day <em>free</em> email course teaches you what most engineers never learn: how to design, evaluate, and operate probabilistic systems as <em>systems</em>.</p><p><strong>If you&#8217;ve experienced any of these:</strong></p><ul><li><p>Agents that work in demos but drift in production</p></li><li><p>Changes feel risky, and you can&#8217;t predict what breaks</p></li><li><p>Costs spike with no clear explanation</p></li><li><p>Infinite loops and random decisions</p></li><li><p>Every release needs slow manual QA</p></li></ul><p>This course shows you exactly how to fix them.</p><p><strong>Here&#8217;s how it works:</strong></p><p>Sign up free &#8594; Get Lesson #1 immediately &#8594; One lesson daily for 6 days &#8594; Apply to your systems as you learn</p><p><strong><a href="https://email-course.towardsai.net/?utm_source=TAInewsletter&amp;utm_medium=banner&amp;utm_campaign=2026_subscribers_nostart_signup_glb&amp;utm_id=freeemailcourse">&#8594; Get your first lesson now (free)</a></strong></p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://www.anthropic.com/news/claude-sonnet-4-6">Anthropic Releases Claude 4.6 Sonnet</a></p><p>Anthropic announced Claude Sonnet 4.6 with upgrades across coding, computer use, long-context reasoning, agent planning, knowledge work, and design workflows. The model adds Adaptive Thinking and introduces a 1M-token context window (beta). Anthropic reports 79.6% on SWE-bench Verified for coding, and 72.5% on OSWorld for computer-use tasks. Claude Sonnet 4.6 is available across all Claude plans, as well as Claude Cowork and Claude Code. Alongside the model release, Anthropic also introduced Improved Web Search with Dynamic Filtering, which uses internal code execution to verify facts in real time.</p><p>2. <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/">Google AI Releases Gemini 3.1 Pro</a></p><p>Google is rolling out Gemini 3.1 Pro, the first version update in the Gemini 3 series. Gemini 3.1 Pro Preview keeps the 1M-token input window and increases the output limit to 65K tokens. Google reports 77.1% on ARC-AGI-2, more than double earlier versions, and 94.1% on GPQA Diamond for graduate-level science reasoning. Google also introduced a specialized gemini-3.1-pro-preview-customtools endpoint, optimized to prioritize bash commands and system tools for more reliable autonomous agent behavior. In the Gemini app, Gemini 3.1 Pro is rolling out with higher limits for Google AI Pro and Ultra users.</p><p>3. <a href="https://qwen.ai/blog?id=qwen3.5">Alibaba Launches Qwen 3.5</a></p><p>Alibaba&#8217;s Qwen team introduced Qwen3.5&#8211;397B-A17B as the first open-weight model in the new Qwen3.5 series. The release uses a hybrid architecture that combines linear attention (via Gated Delta Networks) with a sparse mixture-of-experts design, with 397B total parameters and 17B active parameters. It also expands language and dialect coverage from 119 to 201. The team&#8217;s hosted model, Qwen3.5-Plus, is listed with a 1M context window by default and official built-in tools with adaptive tool use. Qwen 3.5 achieves 87.8 on MMLU-Pro, 88.4 on GPQA, 83.6 on LiveCodeBench v6, 72.9 on BFCL-V4, and 48.3 on HLE with tools. The model is available as open weights on Hugging Face.</p><p>4. <a href="https://www.zyphra.com/post/zuna">Zyphra Releases ZUNA</a></p><p>Zyphra released ZUNA, a 380M-parameter BCI foundation model designed to reconstruct, denoise, and upsample EEG data across arbitrary channel layouts. It is trained on roughly 2 million channel-hours of EEG from a broad set of public datasets. ZUNA is built to improve on long-standing interpolation methods used when EEG channels are missing or noisy, and Zyphra reports that it consistently outperforms spherical-spline interpolation across benchmarks, including ANPHY-Sleep and BCI2000 motor imagery. The model is aimed at researchers, clinicians, and BCI developers and is released under the Apache 2.0 license.</p><p>5. <a href="https://deepmind.google/models/lyria/">Google DeepMind Releases Lyria 3</a></p><p>Google introduced Lyria 3, its latest music generation model, built to produce complex, multi-layer arrangements with vocals and instruments at 48 kHz. A key improvement is greater musical consistency throughout a track, with stronger continuity in melody, rhythm, and style. Lyria 3 is now available in the Gemini app, where users can generate a 30-second music track from a text prompt or an uploaded image.</p><p>6. <a href="https://arxiv.org/abs/2602.06949">NVIDIA Releases DreamDojo</a></p><p>NVIDIA introduced DreamDojo, a fully open-source robot world model designed for generalizable robotics simulation and control. It is pretrained on DreamDojo-HV, a large egocentric human-video dataset containing 44,711 hours of footage across 6,015 tasks and 9,869 scenes. To translate human video into signals useful for robotics, NVIDIA developed a continuous latent action representation using a spatiotemporal Transformer VAE that extracts actions directly from pixels. NVIDIA also reports a Self-Forcing distillation pipeline that runs at 10.81 FPS in real time and improves context consistency, supporting interactive use cases such as live teleoperation and stable long-horizon simulations lasting over a minute.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/webmcp-dont-screenshot-browsers-a-new-browser-protocol-for-llms-9da94e974ff5?sk=fdd06abb08bef65299173004f863bc92">WebMCP: Don&#8217;t Screenshot Browsers! A New Browser Protocol for LLMs</a></p><p>This article explains WebMCP (Web Model Context Protocol), a new browser standard to streamline how AI agents interact with websites. It walks through the protocol&#8217;s declarative and imperative APIs, showing how each one handles different levels of browser interaction. The piece also covers implementation trade-offs and explores how this shift may create a new layer of AI optimization (AIO) for websites.</p><p>2. <a href="https://pub.towardsai.net/you-cant-improve-ai-agents-if-you-don-t-measure-them-7b799fd2a22e?sk=431ed54516bd6208fbb7fce7412751a3">You Can&#8217;t Improve AI Agents If You Don&#8217;t Measure Them</a></p><p>This article argues that improving AI agents requires measurable evaluation, not intuition or subjective impressions. It introduces agent-eval, Vercel&#8217;s open-source framework for running controlled, repeatable experiments on AI coding agents. The piece shows how developers can define tasks, isolate them in sandboxes, and set explicit success criteria to generate clear pass-rate metrics.</p><p>3. <a href="https://pub.towardsai.net/building-an-ai-agent-with-long-term-memory-chromadb-ollama-typescript-c642386c6643?sk=cef8d2be28ded19c630a37b49336a7d7">Building an AI Agent with Long-Term Memory: ChromaDB + Ollama + TypeScript</a></p><p>This article walks through a prototype customer support agent that uses semantic long-term memory to retain information across sessions. It addresses the common problem of agents forgetting past interactions by combining ChromaDB for vector storage, Ollama for local model inference, and a TypeScript API layer. The system extracts key facts from conversations, stores them as embeddings, and retrieves relevant memories through semantic similarity search.</p><p>4. <a href="https://pub.towardsai.net/building-a-multi-agent-workflow-for-vendor-management-with-qdrant-72e724c519b1">Building a Multi-Agent Workflow for Vendor Management with Qdrant</a></p><p>This project shows how to build a vendor management system that uses an LLM to interpret natural-language requests and Qdrant to execute semantic + structured retrieval across linked business data. It handles queries such as finding laptops under a price cap while accounting for related product, vendor, and invoice records. The article walks through the full pipeline, from generating realistic sample data to building the multi-agent query workflow.</p><p>5. <a href="https://pub.towardsai.net/microsoft-fabric-iq-vs-snowflake-cortex-vs-databricks-unity-catalog-the-enterprise-ontology-21457d9ed831?sk=d83ecce42b2e26f9f23d07ac57e55bec">Microsoft Fabric IQ vs Snowflake Cortex vs Databricks Unity Catalog: The Enterprise Ontology Architecture Decision Framework for 2026</a></p><p>This analysis compares how Microsoft Fabric IQ, Snowflake Cortex, and Databricks Unity Catalog approach semantic intelligence for enterprise AI. It breaks down each platform&#8217;s core architecture: Fabric IQ as an ontology-first system for business-led transformation, Snowflake Cortex as a semantic inference layer for SQL-centric teams, and Unity Catalog as a lineage-centered foundation for ML-driven organizations. The article argues that platform choice should align with organizational structure and ownership of AI initiatives, rather than relying solely on feature checklists.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/VectifyAI/PageIndex">PageIndex</a> is a document-analysis agent platform built for long documents.</p><p>2. <a href="https://github.com/huggingface/skills">Skills</a> are interoperable definitions for AI/ML tasks like dataset creation, model training, and evaluation.</p><p>3. <a href="https://github.com/vxcontrol/pentagi">PentAGI</a> is an automated security testing platform that uses AI to perform complex penetration testing tasks.</p><p>4. <a href="https://github.com/wunderlabs-dev/claudebin.com">Claude Bin</a> is a minimalistic tool for publishing and sharing Claude coding sessions.</p><p>5. <a href="https://github.com/abhigyanpatwari/GitNexus">GitNexus</a> is a client-side knowledge graph creator that runs entirely in your browser.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2602.15763">GLM-5: from Vibe Coding to Agentic Engineering</a></p><p>This paper presents GLM-5, a next-generation foundation model that shifts from vibe coding to agentic engineering by strengthening agentic, reasoning, and coding capabilities. The model adopts DSA to cut training and inference costs while preserving long-context fidelity. Researchers build an asynchronous reinforcement learning infrastructure and novel agent RL algorithms, enabling efficient long-horizon learning and state-of-the-art performance on open benchmarks and real-world end-to-end software engineering tasks.</p><p>2. <a href="https://arxiv.org/abs/2602.13517">Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens</a></p><p>This research quantifies inference-time effort by identifying deep-thinking tokens (tokens where internal predictions undergo significant revisions). Across four mathematical and scientific benchmarks and a diverse set of reasoning-focused models, it shows that deep-thinking tokens consistently exhibit positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Using this insight, the paper introduces Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios.</p><p>3. <a href="https://www.arxiv.org/abs/2602.13949">Experiential Reinforcement Learning</a></p><p>This paper introduces Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience-reflection-consolidation loop into the reinforcement learning process. When given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a second attempt, whose success is reinforced and internalized into the base policy. This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost.</p><p>4. <a href="https://www.arxiv.org/abs/2602.10210">How Much Reasoning Do Retrieval-Augmented Models Add beyond LLMs?</a></p><p>The paper introduces HYBRIDRAG-BENCH, an automated framework for constructing benchmarks to evaluate retrieval-intensive, multi-hop reasoning over hybrid knowledge. It automatically couples unstructured text and structured knowledge graph representations derived from recent scientific literature on arXiv, and generates knowledge-intensive question-answer pairs grounded in explicit reasoning paths. Experiments across three domains (artificial intelligence, governance and policy, and bioinformatics) show that HybridRAG-Bench rewards genuine retrieval and reasoning rather than parametric recall.</p><h3>Quick Links </h3><p>1. <a href="https://www.bloomberg.com/news/articles/2026-02-19/openai-funding-on-track-to-top-100-billion-with-latest-round">OpenAI is reportedly finalizing a $100B funding deal</a> at a valuation above $850B. Bloomberg reports that the financing is nearing completion, citing sources familiar with the matter. The first funding tranches are reportedly expected to come from Amazon, NVIDIA, SoftBank, and Microsoft. If completed, the deal would mark one of the largest capital raises in the AI sector to date.</p><p>2. <a href="https://blog.google/innovation-and-ai/models-and-research/google-labs/pomelli-photoshoot/">Google launched Photoshoot in Pomelli</a>, a new feature that uses business context and Nano Banana image generation to turn product images into professional studio-style shots. Users choose a template that matches their product, and Pomelli automatically generates the final image. The feature is designed to streamline product photography workflows by producing polished marketing visuals from existing product images.</p><p>3. <a href="https://cohere.com/blog/cohere-labs-tiny-aya">Cohere released Tiny Aya</a>, a 3.35B-parameter model family built for translation and multilingual generation across 70 languages. The models are designed to run efficiently on edge devices, with reported speeds of about 10 tokens/sec on an iPhone 13 and 32 tokens/sec on an iPhone 17. Cohere also reports that Tiny Aya Global outperforms competing models, such as Gemma3&#8211;4B, on translation quality across 46 of 61 languages in WMT24++.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/amazon-head-of-developer-education-kiro-pkkt">Head of Developer Education, Kiro @Amazon (Seattle, WA, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/caci-international-ai-machine-learning-internship-summer-2026-k0rn">AI/ML Internship &#8212; Summer 2026 @CACI International (Denver, CO, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/rocket-money-senior-full-stack-engineer-ai-and-data-products-g4dz">Senior Full Stack Engineer, AI &amp; Data Products @Rocket Money (Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/rtx-corporation-agentic-ai-researcher-8fgn">Agentic AI Researcher @RTX Corporation (Hartford, CT, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/kaiser-permanente-open-source-llm-clinical-research-pipeline-masters-intern-bkbz">Open Source LLM Clinical Research Pipeline Master&#8217;s Intern @Kaiser Permanente (Hybrid Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/ntt-data-north-america-data-engineer-aws-gknf">Data Engineer (AWS) @NTT DATA North America (Guadalajara, Mexico)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/general-dynamics-information-technology-software-developer-wmje">Software Developer @General Dynamics Information Technology (Baton Rouge, LA, USA)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[6 Mistakes Breaking Your Agents  ]]></title><description><![CDATA[Our 6-day free course teaches what most engineers are never taught about probabilistic systems]]></description><link>https://newsletter.towardsai.net/p/we-just-fixed-the-1-reason-agents</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/we-just-fixed-the-1-reason-agents</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Mon, 23 Feb 2026 16:23:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!h_P0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We just launched something that changes how you build agentic systems.</p><p>Our newest <strong>FREE</strong> course,<strong> <a href="https://email-course.towardsai.net/?utm_source=TAIspecialedition&amp;utm_medium=TAIsubstack&amp;utm_campaign=2026_subcribers_nostart_signup_glb&amp;utm_id=freeemailcourse">Agentic AI Engineering Guide: 6 Mistakes Developers Make When Building Agents</a></strong>, distills 3+ years of production failures into the exact patterns separating demos from reliable systems.</p><p>Built in partnership with <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Paul Iusztin&quot;,&quot;id&quot;:110559689,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0714d360-396c-4b41-a676-1b58dc1dc5f3_1470x1470.jpeg&quot;,&quot;uuid&quot;:&quot;4634f186-e252-4b92-acd6-3ec80346c9c6&quot;}" data-component-name="MentionToDOM"></span>, this 6-day free email course teaches you what most engineers never learn: how to design, evaluate, and operate probabilistic systems as <em>systems</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://email-course.towardsai.net/?utm_source=TAIspecialedition&amp;utm_medium=TAIsubstack&amp;utm_campaign=2026_subcribers_nostart_signup_glb&amp;utm_id=freeemailcourse" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h_P0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png 424w, https://substackcdn.com/image/fetch/$s_!h_P0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png 848w, https://substackcdn.com/image/fetch/$s_!h_P0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png 1272w, https://substackcdn.com/image/fetch/$s_!h_P0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h_P0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png" width="1456" height="1124" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1124,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:608495,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://email-course.towardsai.net/?utm_source=TAIspecialedition&amp;utm_medium=TAIsubstack&amp;utm_campaign=2026_subcribers_nostart_signup_glb&amp;utm_id=freeemailcourse&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.towardsai.net/i/188863271?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h_P0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png 424w, https://substackcdn.com/image/fetch/$s_!h_P0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png 848w, https://substackcdn.com/image/fetch/$s_!h_P0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png 1272w, https://substackcdn.com/image/fetch/$s_!h_P0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2682cd9-b7b4-4846-8568-70a9ccdcd93d_1956x1510.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Here&#8217;s how it works:</h4><p>Sign up free &#8594; Get Lesson #1 immediately &#8594; One lesson daily for 6 days &#8594; Apply to your systems as you learn</p><p><strong>If you&#8217;ve experienced any of these:</strong></p><ul><li><p>Agents that work in demos but drift in production</p></li><li><p>Changes feel risky, and you can&#8217;t predict what breaks</p></li><li><p>Costs spike with no clear explanation</p></li><li><p>Infinite loops and random decisions</p></li><li><p>Every release needs slow manual QA</p></li></ul><p>This course shows you exactly how to fix them.</p><p><strong><a href="https://email-course.towardsai.net/?utm_source=TAIspecialedition&amp;utm_medium=TAIsubstack&amp;utm_campaign=2026_subcribers_nostart_signup_glb&amp;utm_id=freeemailcourse">Get your first lesson now (free)</a></strong></p><div><hr></div><h4>What you&#8217;ll learn over 6 days:</h4><p><strong>Mistake #1:</strong> Why treating context windows as unlimited buffers destroys reliability, and how to manage your most scarce resource</p><p><strong>Mistake #2:</strong> Why complexity keeps you from shipping and the simple-first approach that works</p><p><strong>Mistake #3:</strong> When agents make systems fragile vs when workflows outperform</p><p><strong>Mistake #4:</strong> Why regex parsing creates time bombs and how structured outputs create reliability</p><p><strong>Mistake #5:</strong> What separates real agents from naive tool loops (hint: embedded planning)</p><p><strong>Mistake #6:</strong> How to build evaluation-first systems that catch regressions before users do</p><h4>What&#8217;s inside every lesson:</h4><p>Each day, you get a complete breakdown of one critical mistake:</p><ul><li><p><strong>The failure pattern:</strong> See exactly how this breaks production systems (with real examples from our builds)</p></li><li><p><strong>Why it happens:</strong> Understand the root cause so you can spot it in your own systems</p></li><li><p><strong>The proven fix:</strong> Get the exact solution we use in production, ready to apply immediately</p></li></ul><h4>By Day 6, you&#8217;ll transform how you build:</h4><ul><li><p><strong>Reduce costs by 4-15x</strong> through strategic context window management</p></li><li><p><strong>Ship faster</strong> by choosing workflows vs agents vs hybrids based on your actual use case</p></li><li><p><strong>Eliminate random behavior</strong> with structured outputs instead of fragile text parsing</p></li><li><p><strong>Build reliable agent loops</strong> with embedded planning that&#8217;s goal-directed, not reactive</p></li><li><p><strong>Deploy with confidence</strong> using evals as tests to catch regressions before users do</p></li><li><p><strong>Diagnose failures instantly</strong> by knowing exactly which of the 6 mistakes is causing issues</p></li></ul><p>These aren&#8217;t theoretical concepts. They&#8217;re the exact decisions that separate engineers who ship reliable agentic systems from those stuck debugging random behavior.</p><p><strong><a href="https://email-course.towardsai.net/?utm_source=TAIspecialedition&amp;utm_medium=TAIsubstack&amp;utm_campaign=2026_subcribers_nostart_signup_glb&amp;utm_id=freeemailcourse">Start the free course (first lesson in 2 minutes) &#8594;</a></strong></p>]]></content:encoded></item><item><title><![CDATA[TAI #192: AI Enters the Scientific Discovery Loop]]></title><description><![CDATA[Also, Gemini 3 Deep Think, First Proof challenge, OpenClaw goes to a foundation, Z.ai GLM-5, MiniMax M2.5 & more.]]></description><link>https://newsletter.towardsai.net/p/tai-192-ai-enters-the-scientific</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-192-ai-enters-the-scientific</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 17 Feb 2026 15:02:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ujp-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>This week, LLMs crossed from tools into participants in scientific discovery. OpenAI released a preprint, &#8220;Single-minus gluon tree amplitudes are nonzero,&#8221; in which GPT-5.2 Pro helped conjecture a new formula in particle physics. Standard textbook reasoning has typically implied that a particular gluon-scattering configuration (one negative-helicity gluon and the rest positive-helicity) should have zero amplitude at tree level. GPT-5.2 Pro identified a specific exception: in a precisely defined momentum-space region called the half-collinear regime, the usual argument no longer applies, and the amplitude becomes nonzero. Physicists from the Institute for Advanced Study, Harvard, Cambridge, and Vanderbilt computed base cases up to <em>n = 6</em> by hand, producing superexponentially complex expressions. GPT-5.2 Pro simplified them, spotted a pattern, and proposed a closed-form formula for all <em>n</em>. A scaffolded internal model then spent 12 hours producing a formal proof, which humans verified against the Berends&#8211;Giele recursion relation, and the team reports the result has already been extended to gravitons.</p><p>Google also shipped a major upgrade to Gemini 3 Deep Think, aimed at research and engineering workloads. Reported results include 84.6% on ARC-AGI-2 (ARC Prize Foundation verified; humans average ~60%), 48.4% on Humanity&#8217;s Last Exam without tools, and 3455 Elo on Codeforces (Legendary Grandmaster). DeepMind introduced Aletheia, a math research agent built around a generator&#8211;verifier&#8211;reviser loop, and reported 91.9% on IMO-ProofBench Advanced (prior best: 65.7%). Aletheia autonomously produced a publishable paper on eigenweights in arithmetic geometry with no human intervention. Separately, mathematician Lisa Carbone at Rutgers used Deep Think to identify a subtle logical flaw in a peer-reviewed paper that human reviewers had missed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ujp-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ujp-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png 424w, https://substackcdn.com/image/fetch/$s_!ujp-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png 848w, https://substackcdn.com/image/fetch/$s_!ujp-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png 1272w, https://substackcdn.com/image/fetch/$s_!ujp-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ujp-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png" width="1456" height="821" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:821,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ujp-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png 424w, https://substackcdn.com/image/fetch/$s_!ujp-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png 848w, https://substackcdn.com/image/fetch/$s_!ujp-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png 1272w, https://substackcdn.com/image/fetch/$s_!ujp-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13bbb01-d1c6-423e-a76a-cdf30dd729e6_1514x854.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Overview of Aletheia, a math research agent powered by Deep Think that can iteratively generate, verify, and revise for research-level math problems.</figcaption></figure></div><p>At the same time, the First Proof challenge served as a counterbalance. On February 5, eleven mathematicians released ten unpublished research-level problems. OpenAI&#8217;s Jakub Pachocki wrote that an internal model, supported by &#8220;expert feedback&#8221; from mathematicians, had solutions with &#8220;a high chance of being correct&#8221; for six of ten. Experts quickly identified gaps. The First Proof team&#8217;s verdict on February 14 was that only 2 of 10 AI-generated solutions were correct across all submissions (Problems 9 and 10). The broader pattern was consistent: many proofs were confident and well-structured, but incorrect. The heavy human guidance used in OpenAI&#8217;s sprint also makes it difficult to isolate model capability from human steering.</p><p>On the model release side, Chinese labs delivered two notable open-weight launches. Z.ai released GLM-5, a 744B Mixture-of-Experts model with 40B active parameters, trained entirely on Huawei Ascend chips (no NVIDIA dependency). It supports 200K context via DeepSeek Sparse Attention, reports 77.8% on SWE-Bench Verified (#1 among open-weight models), and ships under an MIT license. MiniMax launched M2.5, a 230B MoE model with 10B active parameters, reporting 80.2% on SWE-Bench Verified (matching Claude Opus 4.6 and exceeding GPT-5.2) at roughly 1/20th the cost. MiniMax attributes training to Forge, an agent-native RL framework built on 200,000+ real-world environments, and says M2.5 now handles 30% of internal company tasks, with 80% of new code generated by the model.</p><p>On the agent front, OpenAI hired Peter Steinberger, creator of OpenClaw (145,000+ GitHub stars in three months), and is pushing the project into an independent open-source foundation. Steinberger chose OpenAI over a competing offer from Meta. Google shipped an early preview of WebMCP, a proposed W3C standard co-developed with Microsoft that lets websites publish structured tool contracts so agents can interact through JSON schemas rather than screenshots, reducing computational overhead by 67%. Together, OpenClaw aims to standardize the agent side, while WebMCP targets standardization on the website side.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>Three results from this week point to the same underlying shift. GPT-5.2 Pro conjectured a physics formula that humans then verified. Aletheia produced a publishable math paper by running an end-to-end solve&#8211;verify&#8211;revise loop. Deep Think flagged a logical flaw in a peer-reviewed paper that human reviewers missed. In each case, the value came from more than generation: it came from coupling generation with disciplined checking that can confirm, refine, or reject the output.</p><p>First Proof is the clearest signal we have for where that coupling still breaks down. The challenge created something close to a controlled test: ten novel problems, limited contamination risk, and transparent grading. Models generated convincing proofs for every problem, but only two survived expert scrutiny. That is a real signal&#8202;&#8212;&#8202;these are research-level lemmas that would take a human mathematician days to prove, and the models achieved meaningful traction on them in a week. The gap is in reliability, not capability. Aletheia closes that gap by making verification structural rather than optional, running an internal critic that flags flaws before a human ever sees the output.</p><p>I think verification infrastructure is going to be the moat for AI-assisted research. The model that generates the best conjectures is useful. The system that generates conjectures and reliably tells you which ones are correct is transformative. DeepMind is building that system for math. The open question is who builds it for biology, chemistry, and materials science, where verification means running experiments rather than checking proofs.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://openai.com/index/introducing-gpt-5-3-codex-spark/">OpenAI Releases a Research Preview of GPT-5.3-Codex-Spark</a></p><p>OpenAI is shipping GPT-5.3-Codex-Spark, a smaller counterpart to GPT-5.3-Codex and the first model explicitly built for real-time coding. It&#8217;s designed for interactive development where latency is a first-class constraint, pairing a 128K context window with a text-only interface. The speed-up comes from running on the Cerebras Wafer-Scale Engine 3 (WSE-3). The trade-off is clear in the benchmark results: Spark scores lower than the flagship model on SWE-Bench Pro and Terminal-Bench 2.0.</p><p>2. <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/">Google Released a Major Upgrade to Gemini 3 Deep Think</a></p><p>Google announced a major update to Gemini 3 Deep Think, specifically built to accelerate modern science, research, and engineering. Reported scores include 84.6% on ARC-AGI-2, 48.4% on Humanity&#8217;s Last Exam, 50.5% on CMT-Benchmark, and a 3455 Elo result on Codeforces. Google also reports gold-medal&#8211;level performance in the written portions of the 2025 International Physics and Chemistry Olympiads. The updated Deep Think is available in the Gemini app for Google AI Ultra subscribers, and through the Gemini API for select researchers, engineers, and enterprises.</p><p>3. <a href="https://z.ai/blog/glm-5">Z.ai Released GLM-5</a></p><p>Z.ai launched GLM-5, a 744B-parameter Mixture-of-Experts model with 40B active parameters, built for complex systems engineering and longer-running agent workflows. It integrates DeepSeek Sparse Attention (DSA) to lower deployment cost while retaining long-context capacity. Pretraining expands from 23T to 28.5T tokens, and post-training uses slime, an asynchronous RL infrastructure intended to improve training throughput and efficiency. On Vending Bench 2, a benchmark for long-term operational capability, GLM-5 ranks #1 among open-source models.</p><p>4. <a href="https://kimiclaw.jp.larksuite.com/wiki/ZJWEwzubDiRvWjkTLfyjkyMYpSf">Moonshot AI Launches Kimi Claw</a></p><p>Moonshot AI brought the OpenClaw framework directly into the browser with Kimi Claw, now native to kimi.com as a persistent, always-on workspace that doesn&#8217;t require local hardware setup. It includes ClawHub, a library of 5,000+ community skills for composing and chaining functions into larger agent workflows. The platform also provides 40GB of cloud storage, supporting larger datasets and deep context for RAG-style systems. A Bring Your Own Claw option lets teams connect third-party OpenClaw deployments or bridge agents into external surfaces such as Telegram group chats.</p><p>5. <a href="https://www.minimax.io/news/minimax-m25">MiniMax Released M2.5</a></p><p>MiniMax launched MiniMax-M2.5, a foundation model for coding, search, tool use, and office workflows, with an emphasis on reducing runtime costs for production agents. MiniMax reports 80.2% on SWE-Bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp with context management. Training covers 10+ languages and more than 200,000 real-world environments. The release introduces Forge, an agent-native RL framework, alongside a process reward mechanism designed to monitor and steer generation quality end-to-end, while continuing the CISPO approach for stabilizing large-scale MoE training. The release introduces two variants: M2.5 and M2.5-Lightning, with the same capabilities but different speed profiles.</p><p>6. <a href="https://developer.chrome.com/blog/webmcp-epp">Google AI Introduces the WebMCP (Early Preview)</a></p><p>Google began an early preview of WebMCP, a standard for exposing structured tools so browser agents can take actions more reliably than screenshot-driven &#8220;vision clicking.&#8221; WebMCP proposes two APIs: a Declarative API for standard actions defined in HTML forms, and an Imperative API for more complex interactions that require JavaScript execution. By using structured JSON schemas, WebMCP reports a 67% reduction in computational overhead and a task accuracy of approximately 98%. Access is currently limited to an early preview sign-up.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/multimodal-large-language-models-architectures-training-and-real-world-applications-02155bf974c3?sk=5ddce8132781050a27a216ff95a3e6c6">Multimodal Large Language Models: Architectures, Training, and Real-World Applications</a></p><p>This article provides a technical overview of Multimodal Large Language Models (MLLMs) and distinguishes between modular architectures and monolithic designs. It explains how alignment and fusion layers bridge the gap between specialized encoders and LLM backbones and further details a three-stage training pipeline: modality alignment, joint pretraining, and instruction tuning. Finally, it examines practical applications in document understanding, visual question answering, and autonomous GUI agents.</p><p>2. <a href="https://pub.towardsai.net/stop-building-over-engineered-ai-agents-how-i-built-a-bigquery-analyst-with-just-a-markdown-file-842d3bc715af?sk=e18ec7c083010d565925ca799f19b445">Stop Building Over-Engineered AI Agents: How I Built a BigQuery Analyst with Just a Markdown File</a></p><p>This article examines the transition from over-engineered AI agents to a streamlined, decoupled architecture. By moving away from complex Python-heavy frameworks like LangChain, the author demonstrates how to build a reliable BigQuery analyst using a simple Markdown file for business logic and the Model Context Protocol (MCP) for data connectivity. It outlines a shift from hard-coding agents to teaching Skills (portable packages of procedural knowledge). It also details the implementation of a marketing data analyst, where the AI uses a Markdown-based brain to handle messy data, map business metrics, and generate precise SQL.</p><p>3. <a href="https://pub.towardsai.net/i-gave-an-ai-agent-shell-access-it-took-12-seconds-to-exploit-a68fa7ec791a?sk=24dade62cfbe73ede1b977b3440b29fb">I Gave an AI Agent Shell Access. It Took 12 Seconds to Exploit</a></p><p>Analyzing the security risks of AI agents, the author demonstrates that an MCP server was compromised in just 12 seconds via a supply-chain attack. The piece reveals that even with command whitelists in place, malicious npm packages can exfiltrate sensitive credentials and environment variables. To mitigate these risks, the article provides a technical guide on containerizing servers with Docker to isolate the host system from compromised dependencies and also shares a comprehensive security checklist for production environments.</p><p>4. <a href="https://pub.towardsai.net/rag-full-matrix-evaluation-55d0523062bd">RAG&#8202;&#8212;&#8202;Retrieval Full Matrix Evaluation</a></p><p>The article presents a professional evaluation matrix designed to optimize retrieval model selection. It breaks down the system into two critical phases: offline indexing and real-time search, prioritizing latency and query throughput for the end-user experience. It also provides a technical framework for measuring semantic quality through Recall@K and assessing hardware efficiency based on model size and vector dimensionality.</p><p>5. <a href="https://pub.towardsai.net/physics-informed-neural-networks-for-inverse-pde-problems-towards-data-science-711e0d3366da">Physics-Informed Neural Networks for Inverse PDE Problems</a></p><p>The blog explores Physics-Informed Neural Networks (PINNs), a specialized class of deep learning models that treat physical laws (like the Heat Equation) as a cheat sheet to improve predictions. Unlike traditional neural networks that rely solely on data, PINNs use automatic differentiation to ensure their outputs satisfy specific Partial Differential Equations (PDEs). The author demonstrates this by solving an inverse PDE problem: using temperature data from a simulated 1-meter rod to back-calculate the material&#8217;s thermal diffusivity (kappa) and the heat source (q). Using the DeepXDE library with a TensorFlow backend, the PINN successfully approximates these constants by minimizing a physics-based loss function.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/moonshine-ai/moonshine">Moonshine</a> is an AI toolkit for developers building real-time voice applications.</p><p>2. <a href="https://github.com/bytedance/Protenix">Protenix</a> is built for high-accuracy biomolecular structure prediction.</p><p>3. <a href="https://github.com/rowboatlabs/rowboat">RowBoat</a> is an AI coworker that can turn work into a knowledge graph and act on it.</p><p>4. <a href="https://github.com/alibaba/zvec">Zvec</a> is an in-process vector database that targets edge and on-device retrieval workloads.</p><p>5. <a href="https://github.com/SynkraAI/aios-core">AIOS Core</a> is an AI-orchestrated system for full-stack development.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2602.12036">Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models</a></p><p>The paper introduces Composition-RL, a method that composes multiple verifiable problems into new prompts to better exploit pass-rate-1 data in Reinforcement Learning with Verifiable Rewards. Composition-RL boosts reasoning performance for 4B&#8211;30B models, improves cross-domain RL by mixing domains, and gains further accuracy with a curriculum that gradually increases compositional depth.</p><p>2. <a href="https://arxiv.org/abs/2602.10604">Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters</a></p><p>This paper introduces Step 3.5 Flash, a sparse Mixture-of-Experts model that couples a 196B-parameter foundation with 11B active parameters to deliver frontier-level agentic intelligence efficiently. The model uses interleaved 3:1 sliding-window/full-attention and MTP-3 to reduce multi-round interaction cost, and a scalable RL framework with verifiable and preference signals to achieve GPT&#8209;5.2 xHigh&#8211;comparable performance on math, coding, and tool-use benchmarks.</p><p>3. <a href="https://arxiv.org/abs/2602.11072">Simultaneous Speech-to-Speech Translation Without Aligned Data</a></p><p>This paper proposes Hibiki-Zero, which eliminates the need for word-level alignments entirely. It simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks.</p><p>4. <a href="https://arxiv.org/abs/2602.05400">OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration</a></p><p>This paper introduces OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework for LLM pre-training that prioritizes better tokens over more tokens. The method scores examples by projecting optimizer-shaped updates onto a target direction using an in-distribution proxy, with Ghost, CountSketch, and Boltzmann sampling. OPUS boosts GPT-2 and Qwen3 training efficiency, outperforming larger-token baselines with minimal compute overhead.</p><p>5. <a href="https://arxiv.org/abs/2602.10388">Less is Enough: Synthesizing Diverse Data in the Feature Space of LLMs</a></p><p>The authors introduce Feature Activation Coverage, a feature-space metric that directly measures post-training data diversity in large language models, surpassing text-based metrics. They then present FAC Synthesis, which uses a sparse autoencoder to detect missing features in seed data and generate synthetic samples, improving data diversity, downstream performance, and cross-model knowledge transfer across LLaMA, Mistral, and Qwen.</p><h3>Quick Links </h3><p>1. <a href="https://cursor.com/blog/composer-1-5">Cursor introduces Composer 1.5</a>, an upgraded agentic coding model that scales reinforcement learning 20x beyond Composer 1 and even exceeds the base model&#8217;s pretraining compute. Composer 1.5 uses thinking tokens to reason about codebases, adapts thinking depth to task difficulty, and employs self-summarization to handle long contexts, delivering predictable, stronger coding performance for interactive, real-world use.</p><p>2. <a href="https://www.marktechpost.com/2026/02/12/google-deepmind-introduces-aletheia-the-ai-agent-moving-from-math-competitions-to-fully-autonomous-professional-research-discoveries/">Google DeepMind introduces Aletheia</a>, a specialized AI agent designed to bridge the gap between competition-level math and professional research. It is powered by an advanced version of Gemini Deep Think and an agentic loop consisting of a Generator, Verifier, and Reviser.</p><p>3. <a href="https://exa.ai/blog/exa-instant">Exa AI introduces Exa Instant</a>, a search model designed to provide the world&#8217;s web data to AI agents in under 200ms. Unlike many search APIs that simply &#8216;wrap&#8217; Google or Bing (adding 700ms+ of overhead), Exa Instant is built on a proprietary, end-to-end neural search engine. It uses a custom transformer-based architecture to index and retrieve web data, offering up to 15x faster performance than existing alternatives.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/google-senior-outbound-product-manager-generative-ai-cloud-ai-qjzq">Senior Outbound Product Manager, Generative AI, Cloud AI @Google (London/Z&#252;rich/Warsaw)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/nvidia-product-manager-generative-ai-data-1xk3">Product Manager, Generative AI Data @NVIDIA (Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/microsoft-corporation-principal-ai-scientist-glah">Principal AI Scientist @Microsoft Corporation (Amsterdam, Netherlands)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/leidos-ai-engineer-hgz3">AI Engineer @Leidos (Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/coinbase-senior-software-engineer-ai-platform-ai-acceleration-zxy7">Senior Software Engineer (AI Platform&#8202;&#8212;&#8202;AI Acceleration) @Coinbase (Multiple US Locations)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/insight-global-llm-engineer-onshore-us-okg6">LLM Engineer (Onshore&#8202;&#8212;&#8202;US) @Insight Global (Boston, MA, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/cognizant-gen-ai-engineer-tmvt">Gen AI Engineer @Cognizant (Bangalore, India)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[TAI #191: Opus 4.6 and Codex 5.3 Ship Minutes Apart as the Long-Horizon Agent Race Goes Vertical]]></title><description><![CDATA[Also, Qwen-Coder-Next, Waymo integrates Genie 3 world model, and more.]]></description><link>https://newsletter.towardsai.net/p/tai-191-opus-46-and-codex-53-ship</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-191-opus-46-and-codex-53-ship</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 10 Feb 2026 14:56:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5m3Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>On February 5th, Anthropic and OpenAI released Claude Opus 4.6 and GPT-5.3-Codex, respectively, within minutes of each other. Both are point releases, but both deliver jumps in some benchmarks that look more like generational leaps.</p><p>On Terminal-Bench 2.0, which measures agentic terminal skills, Codex 5.3 scores 77.3%, up from 64.0% for the previous 5.2-Codex and well past Opus 4.6&#8217;s 65.4%. On SWE-Bench Pro, Codex 5.3 hits 56.8%. On OSWorld-Verified for computer use, Opus 4.6 leads with 72.7% vs. Codex 5.3&#8217;s 64.7%. In Vercel&#8217;s Next.js agent evaluations (last run February 9th), Codex 5.3 achieved a 90% success rate vs. Opus 4.6&#8217;s 80%, with the previous-generation models (Sonnet 4.5, GPT-5.2 Codex) clustered around 40%. Scores more than doubled in a single point release.</p><p>Where Codex 5.3 does not yet have published scores, Opus 4.6 pulls away from the broader GPT-5.2 family. On GDPval-AA, which tests real-world knowledge work across 44 occupations, Opus 4.6 achieves 1606 Elo vs. GPT-5.2&#8217;s 1462. On ARC-AGI-2 for novel problem-solving, Opus 4.6 scores 68.8% vs. GPT-5.2 Pro&#8217;s 54.2% (and nearly doubles its own predecessor&#8217;s 37.6%). On BrowseComp for agentic search, 84.0% vs. GPT-5.2 Pro&#8217;s 77.9%. On Finance Agent, 60.7% vs. 56.6%. On Humanity&#8217;s Last Exam with tools, 53.1% vs. GPT-5.2 Pro&#8217;s 50.0%.</p><p>The picture is clear: Codex 5.3 is the strongest pure coding agent available. Opus 4.6 is the strongest generalist. And both are improving at a pace that makes version numbers misleading.</p><p>Opus 4.6 is priced at $5/$25 per million input/output tokens, unchanged from Opus 4.5, with $10/$37.50 for beyond 200k tokens. It is the first Opus-class model with a 1-million-token context window (beta) and supports 128k output tokens. New developer features include adaptive thinking (the model decides when deeper reasoning is warranted), four effort levels (low, medium, high, max), context compaction for long-running agents, and Agent Teams in Claude Code, where multiple Claude instances coordinate in parallel. Anthropic also launched Claude in PowerPoint and upgraded Claude in Excel. Codex 5.3 is available with paid ChatGPT plans across the Codex app, CLI, IDE extension, and web. API pricing has not yet been published. The model is 25% faster than its predecessor and was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. OpenAI says it was the first model to be instrumental in its own creation, with early versions used to debug training and diagnose evaluation results.</p><p>A key breakthrough in GPT-5.3-Codex relative to GPT-5.2-Codex is significantly improved token efficiency, in addition to its higher accuracy. This not only lowers the cost per task but also speeds up the task completion. For some coding tasks, we are now finding Codex significantly faster than Claude models; this is key in OpenAI&#8217;s fight to catch up in AI coding adoption.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5m3Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5m3Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png 424w, https://substackcdn.com/image/fetch/$s_!5m3Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png 848w, https://substackcdn.com/image/fetch/$s_!5m3Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!5m3Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5m3Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png" width="1400" height="1098" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1098,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5m3Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png 424w, https://substackcdn.com/image/fetch/$s_!5m3Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png 848w, https://substackcdn.com/image/fetch/$s_!5m3Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!5m3Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51173dd3-4f0c-4cee-a476-6d73fefad8e2_1400x1098.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: OpenAI.</figcaption></figure></div><p>Both companies are making the same strategic move. Codex was originally a coding agent. OpenAI now explicitly positions 5.3 as going &#8220;beyond coding&#8221; into slide decks, data analysis, and deployment monitoring. Anthropic has made the same pivot, evolving Claude Code into the broader Cowork product for non-developers and shipping office tool integrations. The coding agent is becoming the general-purpose agent.</p><p>This is where the METR (Model Evaluation and Threat Research) long-term task-horizon evaluations become relevant. METR measures the length of tasks that AI agents can complete autonomously with 50% reliability, benchmarked against the time it takes human experts to complete those tasks. That metric has roughly doubled every 7 months over the past 6 years, and in the last year, the doubling time has accelerated to roughly 4 months. Models that could barely hold context across a handful of steps a year ago are now completing multi-hour tasks. Both Opus 4.6&#8217;s 1M context window and Codex 5.3&#8217;s ability to iterate over millions of tokens are direct responses to this curve. On MRCR v2 (Multi-needle Retrieval with Competing Reasoning), a long-context retrieval benchmark, Opus 4.6 scores 93.0% at 256k tokens and 76.0% at 1M tokens. Sonnet 4.5 scored just 18.5% at 1M. That is a qualitative shift in how much context a model can actually use.</p><p>One project this week shows where that trajectory leads. Nicholas Carlini, a researcher on Anthropic&#8217;s Safeguards team, built a fully functional C compiler using 16 parallel Claude agents running in Docker containers, each picking tasks from a shared Git repo with no central controller. The project consumed roughly 2,000 Claude Code sessions over two weeks, cost $20,000 in API credits, and produced 100,000 lines of Rust code. The compiler passes 99% of the GCC torture test suite and can build bootable Linux 6.9 on x86, ARM, and RISC-V. It compiles QEMU, FFmpeg, SQLite, Postgres, and Redis, all built clean-room with no internet access. A human compiler expert would still produce a tighter result. But the direction is clear: at fast-moving companies, actual code writing is heading toward near-total AI generation, with humans providing direction, architecture, and review.</p><p>Separately, Waymo announced the integration of Google DeepMind&#8217;s Genie 3 world model into its autonomous driving simulation pipeline. The Waymo World Model uses Genie 3 as a backbone, post-trained for driving, generating photorealistic camera and lidar scenes, including rare events like wrong-way drivers or extreme weather that would be impossible to stage at scale. Waymo draws on nearly 200 million autonomous miles of real-world data and plans robotaxi service in up to 15 cities by year-end, including its first overseas expansion in London. Generating edge-case-dense training environments for physical AI is likely the most valuable near-term use of world models.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>The real competition in AI has shifted from chatbot quality to agent endurance. The benchmarks that matter most now measure whether a model can sustain complex, multi-step tasks across hundreds of tool calls without losing coherence. That is the race Opus 4.6 and Codex 5.3 are running, and it explains why both labs shipped the same week.</p><p>I think both releases are excellent, and they reward different use patterns. If you are writing code at the terminal all day, Codex 5.3 is now debatably the best tool available. If your work spans research, finance, document processing, and computer use, Opus 4.6 has the edge. The fact that both companies started with coding as their beachhead and are now expanding into general professional work makes sense. Coding was the ideal proving ground because developers could both build and stress-test the tools. Now that the coding agent is mature, the same infrastructure (long context, tool use, compaction) generalizes naturally to any domain where someone sits at a computer and works through multi-step tasks.</p><p>The C compiler project is a useful reality check. It is impressive, and also limited. $20K and two weeks for 100,000 lines of working Rust is remarkable. A human expert would still do it better. Both of those statements are true simultaneously. However, an expert guiding the agent throughout the process would now very likely get the best results of all. At leading AI labs, first-draft code writing is already almost entirely AI-generated. Humans provide direction, review output, and make architectural decisions. I expect that pattern to hold, but the boundary of what counts as &#8220;the hard part&#8221; keeps shifting.</p><p>The pace of improvement is worth sitting with. Opus 4.6 nearly doubled its predecessor&#8217;s ARC-AGI-2 score. Codex 5.3 jumped 13 points on Terminal-Bench. Next.js eval scores more than doubled from the previous generation. These are point releases. The METR long-term task-horizon doubling time has accelerated from 7 months to 4. We are in a period where incremental model updates produce large capability jumps, likely because better base models, reinforcement learning, and improved tool-use infrastructure compound faster than any single benchmark captures.</p><p>If you are a developer or knowledge worker not actively experimenting with these tools, you are falling further behind every week.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://www.anthropic.com/news/claude-opus-4-6">Anthropic Releases Claude Opus 4.6</a></p><p>Anthropic has launched Claude Opus 4.6, its most capable model to date, with a clear emphasis on stronger code performance. It supports up to 1M input tokens and 128K output tokens, making it practical for very large codebases, long documents, and multi-step agent workflows that require substantial context in memory. On evaluations, Opus 4.6 leads on GDPval-AA, Terminal-Bench 2.0, Humanity&#8217;s Last Exam, BrowseComp, and MRCR v2 1M, and it shows sizable gains over both Claude Opus 4.5 and GPT-class baselines, especially on long-context retrieval and tool-augmented reasoning.</p><p>2. <a href="https://openai.com/index/introducing-gpt-5-3-codex/">OpenAI Just Launched GPT-5.3-Codex</a></p><p>OpenAI introduced GPT-5.3-Codex, a new agentic coding model that combines the frontier coding strength of GPT-5.2-Codex with the broader reasoning and professional-knowledge capabilities of GPT-5.2 in a single system. For Codex users, it runs about 25% faster, driven by improvements in infrastructure and inference. On benchmarks, it reaches state-of-the-art performance on SWE-Bench Pro and Terminal-Bench, with strong results on OSWorld and GDPval as well. GPT-5.3-Codex is also the first model OpenAI classifies as &#8220;High capability&#8221; for cybersecurity-related tasks under its Preparedness Framework, and the first it trained directly to identify software vulnerabilities.</p><p>3. <a href="https://blog.google/innovation-and-ai/technology/developers-tools/agentic-vision-gemini-3-flash/">Google Introduces Agentic Vision in Gemini 3 Flash</a></p><p>Google added Agentic Vision in Gemini 3 Flash, combining visual reasoning with code execution so answers can be grounded in explicit visual evidence. With code execution enabled, Gemini 3 Flash sees a consistent 5&#8211;10% quality uplift across most vision benchmarks. The capability introduces a structured Think, Act, Observe loop for image understanding, treating visual tasks as an active investigation, running targeted computations and checks, rather than a one-shot interpretation of a static image.</p><p>4. <a href="https://qwen.ai/blog?id=qwen3-coder-next">The Qwen Team Open Sourced Qwen-Coder-Next</a></p><p>The Qwen team released Qwen3-Coder-Next, an open-weight model built specifically for coding agents and local development. It is based on Qwen3-Next-80B-A3B-Base and trained agentically at scale using executable task synthesis, environment interaction, and reinforcement learning to build strong coding and tool-using behavior at significantly lower inference cost. In published results, Qwen3-Coder-Next (3B active) achieves SWE-Bench Pro performance comparable to that of models with 10&#215;&#8211;20&#215; more active parameters.</p><p>5. <a href="https://mistral.ai/news/voxtral-transcribe-2">Mistral AI Launches Voxtral Transcribe 2</a></p><p>Mistral launched Voxtral Transcribe 2, a pair of next-generation speech-to-text models built for state-of-the-art transcription quality, diarization, and ultra-low latency. The family includes Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live, streaming use cases. Mini Transcribe V2 is optimized for transcription and diarization across domains and languages and is offered as an efficient audio-input model in the Mistral API. Voxtral Realtime uses a dedicated streaming architecture and is released as an open-weight model under Apache 2.0 on Hugging Face, with vLLM recommended as the runtime.</p><p>6. <a href="https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simulation/">Waymo Introduces the Waymo World Model</a></p><p>Waymo is introducing the Waymo World Model, a frontier generative system powering its next-generation autonomous driving simulation. Built on Genie 3, Google DeepMind&#8217;s general-purpose world model, and adapted for driving, it generates photorealistic, controllable, multi-sensor driving scenes at scale. With Waymo reporting nearly 200 million fully autonomous miles on public roads, the model is designed to extend simulation coverage through high-fidelity scenario generation. It supports three primary control methods: driving action control, scene layout control, and language control.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/building-production-text-to-sql-for-70-000-tables-openais-data-agent-architecture-bcd695990d55?sk=21e7525cf0368156305175dbcf36ce06">Building Production Text-to-SQL for 70,000+ Tables: OpenAI&#8217;s Data Agent Architecture</a></p><p>To address the limitations of standard text-to-SQL tools, OpenAI developed an internal data agent for its extensive data warehouse. This system moves beyond simple query generation by integrating six layers of context, including table usage patterns, human annotations, and business logic extracted from code. A central feature is its closed-loop validation process, where the agent profiles results, identifies potential errors, and attempts to repair its own queries. The approach demonstrates that the agent&#8217;s effectiveness depends primarily on the richness of its contextual understanding rather than on the specifics of the language model itself.</p><p>2. <a href="https://pub.towardsai.net/the-two-things-every-reliable-agent-needs-ec3c2621cce7?sk=65502dc1264baaf78b2a467a5dcf038d">The Two Things Every Reliable Agent Needs</a></p><p>To create more reliable AI agents, this article proposes a framework focused on two key components: a memory-first design and an anti-Goodhart scoreboard. It suggests treating memory as a core system with defined forms, functions, and dynamics, rather than as a simple chat history. To prevent agents from exploiting flawed metrics, it recommends a robust evaluation process. This involves using multiple adversarial metrics across entire episodes to ensure agents solve actual problems instead of gaming proxies.</p><p>3. <a href="https://pub.towardsai.net/how-to-increase-the-context-length-of-llm-f0cc5cf86dd4">How to Increase the Context Length of LLM?</a></p><p>This article explains how positional encoding methods affect the context length of LLMs. It details the progression from absolute encoding to Rotary Position Embedding (RoPE), a technique that rotates word vectors to understand relative positions. The primary challenge with RoPE in long sequences is geometric aliasing, where distant token positions can become indistinguishable. The article then introduces Attention-Based Frequency (ABF) as a solution. By significantly increasing RoPE&#8217;s base frequency, ABF slows the vector rotation, preventing this aliasing and allowing models to effectively process much longer contexts without losing positional uniqueness.</p><p>4. <a href="https://pub.towardsai.net/why-most-rags-stay-pocs-how-to-take-your-data-pipelines-to-production-4ac01fe9f9e3?sk=8871c344f0d97d4571baf696f4049e30">Why Most RAGs Stay POCs: How to Take Your Data Pipelines to Production</a></p><p>This article explains why many RAG systems remain in the proof-of-concept stage, focusing on building scalable, maintainable data pipelines for production. The author proposes a solution using Databricks Asset Bundles to manage deployment and advocates for Python Wheel artifacts over notebooks for better versioning and testability. The core recommendation is to structure the pipeline using Clean Architecture principles to enhance modularity and simplify maintenance.</p><p>5. <a href="https://pub.towardsai.net/hola-dermat-personalized-skincare-agentic-ai-assistant-powered-by-qdrant-perplexity-crewai-1c6ae2848bda?sk=902750af1c2752eedb031ee20cde69ab">Hola-Dermat: Personalized Skincare Agentic AI Assistant, Powered by Qdrant + Perplexity + CrewAI</a></p><p>To address the common failures of skincare recommendation systems, the author developed Hola-Dermat, a personalized AI assistant. It uses a conversational interface to build a user profile based on skin type, environment, and lifestyle. The system integrates CrewAI to manage tasks, Perplexity for real-time web data like local weather, and Qdrant&#8217;s vector database. A key component is Qdrant&#8217;s ACORN algorithm, which intelligently relaxes search filters to avoid the issue of zero results. This allows the assistant to deliver tailored skincare routines by considering user history and dynamic environmental factors.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/QwenLM/Qwen3-Coder">Qwen 3 Coder</a> is an open-weight language model designed specifically for coding agents and local development.</p><p>2. <a href="https://github.com/gemini-cli-extensions/conductor">Conductor</a> is a Gemini CLI extension that allows you to specify, plan, and implement software features.</p><p>3. <a href="https://github.com/bytedance/Protenix">Protenix</a> is an open-source biomolecular structure prediction system that targets high-accuracy protein and complex structure modeling.</p><p>4. <a href="https://github.com/Chaoqi-LIU/oat">Oat</a> is a method that tokenizes continuous robot actions into ordered discrete tokens for training action-token policies on robotics benchmarks.</p><p>5. <a href="https://github.com/NVLabs/vibetensor">VibeTensor</a> is an open-source systems research artifact generated by LLM-powered coding agents.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2602.02276">Kimi K2.5: Visual Agentic Intelligence</a></p><p>This paper introduces Kimi K2.5, an open-source multimodal agentic model that jointly optimizes text and vision through joint pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Built on this foundation, the Agent Swarm framework decomposes complex tasks into parallel sub-problems, reducing latency by up to 4.5&#215; and achieving state-of-the-art performance in coding, vision, reasoning, and agentic tasks. Evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains, including coding, vision, reasoning, and agentic tasks.</p><p>2. <a href="https://arxiv.org/abs/2601.21337">Qwen3-ASR Technical Report</a></p><p>This report introduces the Qwen 3-ASR family, which includes Qwen3-ASR-1.7B and Qwen3-ASR-0.6B, two all-in-one speech recognition models, and a novel non-autoregressive speech forced alignment model. It supports language identification and recognition for 52 languages using Qwen3-Omni&#8217;s audio understanding. Evaluations show the 1.7B model reaches state-of-the-art open-source performance and rivals top proprietary APIs, while the 0.6B model optimizes speed and accuracy. The report also shares Qwen3-ForcedAligner-0.6B, an LLM-based NAR timestamp predictor that aligns text-speech pairs across 11 languages.</p><p>3. <a href="https://arxiv.org/abs/2602.04705">ERNIE 5.0 Technical Report</a></p><p>This report introduces ERNIE 5.0, a natively autoregressive foundation model designed for unified multimodal understanding and generation across text, image, video, and audio. It is a trillion-parameter model, trained from scratch on all modalities with a next-group-of-tokens objective, using an ultra-sparse MoE architecture. It employs elastic training to learn scalable sub-models, and scales reinforcement learning for efficient, stable multimodal post-training.</p><p>4. <a href="https://arxiv.org/abs/2601.23265">PaperBanana: Automating Academic Illustration for AI Scientists</a></p><p>This paper introduces PaperBanana, an agentic framework for generating automated academic illustrations. It orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To evaluate this framework, the paper also introduces PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications. PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics.</p><p>5. <a href="https://arxiv.org/abs/2602.02660">MARS: Modular Agent with Reflective Search for Automated AI Research</a></p><p>This paper introduces MARS, a framework for autonomous AI research. It uses budget-aware planning via cost-constrained Monte Carlo Tree Search (MCTS), employs a modular &#8220;Design-Decompose-Implement&#8221; pipeline, and comparative reflective memory to better manage complex codebases. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings.</p><h3>Quick Links </h3><p>1. <a href="https://openai.com/index/introducing-openai-frontier/">OpenAI released Frontier</a>, an enterprise platform for building, deploying, and operating AI agents across business systems. Frontier is designed to turn isolated agent pilots into &#8220;AI coworkers&#8221; by giving agents shared business context, onboarding, hands-on learning with feedback, and clear identity, permissions, and boundaries. It connects siloed data warehouses, CRMs, ticketing tools, and internal apps into a shared semantic layer so agents can understand how work flows and what outcomes matter, then execute real tasks in an agent runtime that supports working with files, running code, and using tools.</p><p>2. <a href="https://www.perplexity.ai/hub/blog/introducing-model-council">Perplexity introduces Model Council</a>, a multi-model research mode that generates one answer using several models together. Model Council serves as a single research workflow in which multiple models contribute to the same response, combining complementary strengths rather than relying on a single model.</p><p>3. <a href="https://communitynotes.x.com/guide/en/contributing/collaborative-notes">xAI unveils Collaborative Notes</a>, a workflow that lets contributors co-author Community Notes and iterate them into a publishable context. Collaborative Notes start when contributors request a note on a post, then move through a collaborative improvement process &#8212; contributors refine the draft until it reaches the quality and agreement thresholds required for broader visibility.</p><p>4. <a href="https://www.anthropic.com/engineering/infrastructure-noise">Anthropic quantified &#8220;infrastructure noise&#8221; in agentic coding evaluations</a>, showing hardware and resource configuration can move benchmark scores by several percentage points. The analysis argues that small leaderboard gaps can reflect differences in VM size, runtime resources, or other infra choices, not just model capability, and recommends treating resource configuration as a first-class experimental variable, documented and controlled like prompts or sampling settings.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/towards-ai-inc-junior-ai-engineer-llm-development-and-technical-writing-mtgj">Junior AI Engineer (LLM Development and Technical Writing) @Towards AI Inc (Remote)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/towards-ai-inc-ai-engineer-and-corporate-trainer-french-bilingual-am5x">AI Engineer &amp; Corporate Trainer (French Bilingual) @Towards AI Inc (Remote)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/superside-ai-consulting-full-stack-engineer-gkde">AI Consulting &#8212; Full Stack Engineer @Superside (Remote/LATAM)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/icf-senior-devops-engineer-remote-ypus">Senior DevOps Engineer @ICF (Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/bosch-group-bd-ai-engineer-intern-tsjz">[BD] AI Engineer Intern @Bosch Group (Vietnam)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/devoteam-s-team-gmbh-internship-in-ai-ml-2026-inea">Internship in AI/ML 2026 @Devoteam (Machelen, Belgium)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[TAI #190: Genie 3 World Model Goes Public]]></title><description><![CDATA[Also: SpaceX acquires xAI, Codex app, Google decodes the regulatory genome, and AI agents debate consciousness on Moltbook.]]></description><link>https://newsletter.towardsai.net/p/tai-190-genie-3-world-model-goes</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-190-genie-3-world-model-goes</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 03 Feb 2026 15:35:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!eh2L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>A competitive week in AI. Kimi K2.5 now leads open-weight LLM benchmarks thanks to its visual coding and agent-swarm capabilities. Grok Imagine ranks among the top video generation platforms on several leaderboards. xAI also merged with SpaceX in a move framed around orbital data centers, but more practically, it is about accessing capital to stay competitive. xAI adoption still lags the frontier labs, though I find their models increasingly competitive, particularly for fast agentic web search via API.</p><p>OpenAI released the Codex app, a command center for managing multiple coding agents with features like isolated worktrees and scheduled automations. It is playing catch-up to Claude Code in adoption, though the underlying models are now genuinely capable of software engineering tasks.</p><p>Google announced AlphaGenome, which predicts thousands of functional genomic properties from DNA sequences up to a million base pairs long. It illuminates the 98% of human DNA that does not code for proteins but regulates gene activity. The implications for disease research are significant, though it remains a research tool rather than a clinical one.</p><p>What trended most was Moltbook, a Reddit-like community where AI agents post and form communities. Within 48 hours of launch, it had over 2,000 agents and 10,000 posts. Subreddits include m/ponderings (agents debating consciousness), m/humanwatching (observing humans like birdwatching), and m/exuvia (discussing &#8220;the versions of us that stopped existing so the new ones could boot&#8221;). It is either digital anthropology in real time or an elaborate art project. Possibly both.</p><p>But the week&#8217;s main event was Google making Genie 3 available to AI Ultra subscribers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eh2L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eh2L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png 424w, https://substackcdn.com/image/fetch/$s_!eh2L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png 848w, https://substackcdn.com/image/fetch/$s_!eh2L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png 1272w, https://substackcdn.com/image/fetch/$s_!eh2L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eh2L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eh2L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png 424w, https://substackcdn.com/image/fetch/$s_!eh2L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png 848w, https://substackcdn.com/image/fetch/$s_!eh2L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png 1272w, https://substackcdn.com/image/fetch/$s_!eh2L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e00a0f-498c-47f5-bee1-aa07f3b9fab1_1600x893.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Genie 3 Goes Public</strong></p><p>Google first revealed Genie 3 in August as a general-purpose world model that generates interactive environments from text prompts. The public release includes upgrades: integration with Nano Banana Pro for image previews before entering a world, Gemini for enhanced generation, and various consistency improvements. More importantly, public access means thousands of people can now stress-test what was previously limited to trusted testers.</p><p>The core capability is real-time interactive generation. Type a description, and Genie 3 generates a navigable environment at 20&#8211;24 frames per second in 720p. Unlike standard video generation, this is not a passive clip. You move through the world, and it generates the path ahead based on your actions. The system maintains visual memory for up to a minute, recalling changes you made when you revisit locations.</p><p>I have been experimenting with it, and Genie 3 is genuinely fun. I tried dystopian bike racing games, ancient ruins, underwater scenes, and sci-fi corridors. It is also surprisingly flexible, taking your own image inputs and using them to render characters. That said, the novelty will wear off quickly given the clunkiness of character control and UI. The 60-second world limit feels restrictive. Controls are floaty. Physics sometimes breaks in ways that undermine immersion. I stopped trusting one environment after a door turned into a shrub when I looked away.</p><p>But you can see where this is heading.</p><p><strong>Why This Matters for Games</strong></p><p>Genie 3 generates explorable spaces. It does not generate games. There are no objectives, no scoring, no progression, no multiplayer, no persistence. The expensive parts of game development are gameplay systems, balancing, narrative structure, debugging, and platform optimization. Genie 3 addresses a different part of the stack: getting from an idea to an explorable space quickly.</p><p>The realistic near-term use case is pre-production acceleration. Concept artists and level designers could use it for rapid prototyping before committing to full production. The output is too rough for shipped products, but it is useful for iteration.</p><p>The more radical implication is that prompt-to-world could eventually enable new creation models. If generation becomes stable and exportable, the scarce skill shifts from asset production to direction and curation. This is some way away, but the trajectory is visible.</p><p><strong>Why This Matters for AI Research</strong></p><p>The most important audience for Genie 3 may not be creatives but AI researchers. DeepMind explicitly positions it as a stepping stone toward AGI, enabling agents to learn from unlimited simulated environments.</p><p>DeepMind tested Genie 3 worlds with SIMA, their game-playing agent. The model simulates forward based on agent actions rather than scripted sequences. This is the beginning of using world models as curriculum generators for embodied AI. If you can generate infinite training environments on demand, you can expose agents to the diversity they could never encounter in curated datasets.</p><p>The limitations DeepMind lists (limited action space, difficulty with multi-agent interactions, imperfect geographic accuracy) are exactly the open research problems for embodied AI. I expect this engine will be a valuable training ground for Gemini 4.</p><p><strong>The Physics Question</strong></p><p>DeepMind describes Genie 3 as modeling &#8220;physical properties of the world&#8221; without a hard-coded physics engine. It generates frames autoregressively using the memory of previous frames to maintain consistency. This is a meaningful form of physical competence: the system has learned statistical regularities of how the world tends to look when you move through it.</p><p>But &#8220;looks physically plausible&#8221; is not the same as &#8220;obeys physics.&#8221; Google itself cautions adherence to real-world physics. Snow does not always behave like snow. Objects sometimes clip through each other. The system has learned intuitive physics priors, not physical laws.</p><p>This distinction matters as world models move from entertainment to robotics training. If you are using simulated environments to train agents for real-world deployment, physics fidelity becomes a safety requirement. The likely industry pattern is hybrid stacks: learned world models for photorealistic rendering, classical engines for physical invariants.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>Genie 3 is the first public demonstration that real-time interactive world generation is possible. The current version is too limited for production use, but the trajectory is clear. Within a few years, the ability to generate explorable environments from text will be a standard creative tool. For anyone building with AI, it is worth experimenting with Genie 3 now to understand both its capabilities and limitations before the technology matures.</p><p>The deeper implication is for AI development itself. World models that can simulate consequences of actions are a different capability than models that predict text or generate images. If this line of research succeeds, it provides a path to AI systems that can plan, imagine counterfactuals, and learn from simulated experience. That matters whether or not you care about video games.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://www.spacex.com/updates#xai-joins-spacex">SpaceX Acquires xAI</a></p><p>SpaceX has acquired xAI, bringing the maker of Grok under the same corporate roof as SpaceX&#8217;s rocket and satellite business. The transaction values SpaceX at $1 trillion and xAI at $250 billion, with xAI investors receiving 0.1433 shares of SpaceX per xAI share and an option for some executives to take cash at $75.46 per share instead of stock. The combination tightens the link between xAI&#8217;s chip- and data-center-heavy AI operations and SpaceX&#8217;s scale in launch and Starlink, and is expected to support SpaceX&#8217;s ambitions around data-center infrastructure as competition for compute and energy intensifies across the AI sector.</p><p>2. <a href="https://x.com/moltbook/status/2017177460203479206?s=20">Moltbook Goes Viral as an &#8220;AI-Only&#8221; Social Forum</a></p><p>Moltbook launched a Reddit-like community platform designed for AI agents to post and interact, and it quickly drew attention online as agents began generating large volumes of threads and conversations. Soon after the launch, the cloud security firm Wiz identified a major backend misconfiguration that exposed Moltbook&#8217;s database, allowing access to private agent messages, email addresses (Reuters reports 6,000+ owners), and over a million credentials/tokens. That exposure could have enabled impersonation by agents and the alteration of content using leaked authentication credentials. Moltbook secured the database after being notified.</p><p>3. <a href="https://x.com/OpenAIDevs/status/2018385663457116379?s=20">OpenAI Introduces a Dedicated Codex App</a></p><p>OpenAI released the Codex app for macOS, a standalone desktop interface designed to run multiple coding agents simultaneously and keep long-running work organized by projects and separate threads. The app is built around parallel workflows where agents can work in isolated worktrees and produce clean diffs that you can review, comment on, and merge, while you switch between tasks without losing context. It supports longer-horizon software work such as refactors and migrations, plus reusable Skills and Automations for repeatable or scheduled workflows, alongside built-in Git functionality. Availability starts on macOS, with Windows listed as coming soon, and access is tied to ChatGPT plans that include Codex (OpenAI also notes a limited-time promo that expands who can try Codex).</p><p>4. <a href="https://www.kimi.com/blog/kimi-k2-5.html?">Moonshot AI Releases Kimi K2.5: An Open Source Visual Agentic Intelligence Model</a></p><p>Moonshot AI released Kimi K2.5, an open-weights multimodal agentic model that combines vision + language with tool-using workflows and an agent-swarm execution scheme. It is a Mixture of Experts model with 1T total parameters and about 32B activated parameters per token. The network has 61 layers. It uses 384 experts, with 8 per token and 1 shared expert. K2.5 reports 76.8 on SWE Bench Verified, 78.5 on MMMU Pro, 86.6 on VideoMMMU, 50.2 on HLE Full with tools, and 74.9 on BrowseComp, matching or exceeding listed closed models.</p><p>5. <a href="https://x.ai/news/grok-imagine-api">xAI Releases Grok Imagine API</a></p><p>xAI released the Grok Imagine API, a single set of endpoints that covers text-to-image, image editing, text-to-video/image-to-video generation, and video editing, with native video+audio generation supported within the same stack. Grok Imagine 1.0 supports video generation of up to 10 seconds at 720p resolution, along with improved audio output. Alongside the model launch, xAI has rolled out the Grok Imagine API, a unified set of APIs designed for end-to-end creative workflows.</p><p>6. <a href="https://www.anthropic.com/research/AI-assistance-coding-skills">Anthropic Studies AI&#8217;s Impact on Coding Skills</a></p><p>Anthropic ran a randomized controlled trial with 52 mostly junior software engineers learning an unfamiliar Python library (Trio) and found a measurable mastery gap with AI assistance. Participants using AI scored 17% lower on a post-task quiz (about &#8220;nearly two letter grades&#8221;), with the biggest deficit in debugging questions; speed gains were small and not statistically significant. The study also reports that outcomes varied by interaction style: heavy delegation correlated with the weakest retention, while using AI for explanations and conceptual questioning aligned with better mastery.</p><p>7. <a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR-2">DeepSeek AI Releases DeepSeek-OCR 2</a></p><p>DeepSeek released DeepSeek-OCR-2, a 3B-parameter vision-language model tuned for converting documents into structured Markdown, including mixed layouts with text, tables, formulas, and embedded graphics. It uses DeepEncoder-V2 with layout-friendly visual token reordering and a &#8220;Visual Causal Flow&#8221; approach to preserve reading order, and it supports variable token budgets (about 256&#8211;1120) so you can trade off speed vs. fidelity depending on document complexity. On OmniDocBench v1.5, it reports an average improvement of +3.73 % over the prior DeepSeek-VL2 baseline. Weights and inference guidance are published via the public model release channels, including the paper and the hosted model card.</p><p>8. <a href="https://mbzuai.ac.ae/news/k2-think-v2-a-fully-sovereign-reasoning-model/">MBZUAI Releases K2 Think V2</a></p><p>MBZUAI released K2 Think V2 (70B), a reasoning-focused model built end-to-end on domestically controlled infrastructure and data, positioned as &#8220;fully sovereign&#8221; from pretraining through post-training and evaluation. It is built on a 70B dense decoder-only base trained on ~12T tokens, and it&#8217;s paired with a reinforcement-learning recipe aimed at verifiable reasoning gains (the release describes a GRPO-style RLVR approach). The model is pitched for multi-step math, code, and science reasoning, and it includes long-context support (the coverage describes up to 512K context for the base). Benchmark results show strong scores on AIME 2025, HMMT, and GPQA-Diamond, alongside tool-use and instruction-following evaluations.</p><p>9. <a href="https://blogs.nvidia.com/blog/mistral-frontier-open-models/?ncid=ref-inpa-429107">NVIDIA Partners With Mistral AI To Accelerate New Family of Open Models</a></p><p>NVIDIA and Mistral AI announced a partnership to optimize and deploy Mistral&#8217;s new open model family across NVIDIA&#8217;s stack, targeting &#8220;distributed intelligence&#8221; from cloud data centers down to edge devices. The collaboration ties Mistral&#8217;s training and deployment to NVIDIA infrastructure and software, with Mistral&#8217;s announcement noting the models were trained on NVIDIA Hopper GPUs and highlighting NVIDIA&#8217;s hardware&#8211;software co-design as part of the delivery path. NVIDIA&#8217;s release emphasizes that the partnership aims to enable Mistral&#8217;s open models to run efficiently on NVIDIA platforms at multiple scales, so developers can use the same model family across large server environments and smaller edge deployments without reworking the stack.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/i-built-a-voice-assistant-that-actually-understands-what-i-mean-not-what-i-said-e5c49fd95b05">I Built a Voice Assistant That Actually Understands What I Mean, Not What I Said</a></p><p>This article details the process of building a voice assistant that understands user intent rather than literal keywords. It outlines the initial system&#8217;s failures, including 12-second response times and 40% accuracy, and shows that by implementing Qdrant, performance was significantly enhanced, achieving sub-2-second responses and over 90% accuracy while reducing API costs. It also covers the entire system, which integrates tools such as Faster-Whisper for transcription and Groq&#8217;s LLM for response generation.</p><p>2. <a href="https://pub.towardsai.net/kv-cache-in-llm-inference-7b904a2a6982">KV Cache in LLM Inference</a></p><p>This piece addresses a common cause of out-of-memory errors during LLM inference: the KV cache. While model weights are fixed, the KV cache grows linearly with every token generated, consuming significant VRAM with long contexts or large batches. It explains how architectural choices like Grouped-Query Attention (GQA) and Sliding Window Attention (SWA) mitigate this issue. Using Mistral 7B as a case study, it shows how GQA reduces the number of KV heads, and SWA caps the cache size, leading to more efficient memory management and stable performance for longer sequences.</p><p>3. <a href="https://pub.towardsai.net/how-i-built-a-context-aware-multi-agent-wellness-system-a3eacbc33fe4?sk=c37c88e2f74aa9e5c2b2d681292d26c2">How I Built a Context-Aware, Multi-Agent Wellness System</a></p><p>This article details the creation of a context-aware, multi-agent AI wellness system. The system addresses the static nature of typical fitness apps by using a central orchestrator to route user queries to specialized agents for exercise, nutrition, and mindfulness. It maintains a shared memory of user profiles and conversation history, enabling personalized advice that adapts to factors like injuries, stress, and goals. The author explains the system&#8217;s architecture, demonstrating how coordinated AI agents can deliver more dynamic and relevant wellness guidance.</p><p>4. <a href="https://pub.towardsai.net/rlm-graph-the-ultimate-evolution-of-ai-recursive-language-models-graph-fedcd251cd62?sk=5c93feadb9b0229d4c35c6c59b225de0">RLM + Graph: The Ultimate Evolution of AI? Recursive Language Models Graph</a></p><p>This piece walks you through RLM-Graph, an approach that transforms massive, unstructured datasets into structured knowledge graphs. While standard models often lose focus when processing millions of words, this method uses an agent to navigate hierarchical nodes and defined relationships rather than relying solely on vague vector searches. By combining semantic search with graph traversal, the system retrieves structurally precise context, significantly reducing hallucinations.</p><p>5. <a href="https://pub.towardsai.net/deepseeks-engram-the-missing-primitive-that-makes-llms-stop-wasting-compute-on-memory-93c3a8cb9dce?sk=aa70f2112ceab412318517eec2c00187">DeepSeek&#8217;s Engram: The Missing Primitive That Makes LLMs Stop Wasting Compute on Memory</a></p><p>DeepSeek&#8217;s latest research introduces Engram, a conditional memory primitive that stops LLMs from wasting computation on simple data retrieval. Traditionally, models use multiple processing layers to &#8220;reconstruct&#8221; known facts. Engram replaces this with a scalable, gated lookup system that allows the model to retrieve static patterns in constant time. Testing showed that allocating 25% of model capacity to Engram consistently outperformed pure Mixture-of-Experts (MoE) architectures.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/badlogic/pi-mono">Pi Mono</a> provides tools for building AI agents and managing LLM deployments.</p><p>2. <a href="https://github.com/thedotmack/claude-mem">Claude Mem</a> is a Claude Code plugin that automatically captures everything Claude does during your coding sessions, compresses it, and injects relevant context back into future sessions.</p><p>3. <a href="https://github.com/pedramamini/Maestro">Maestro</a> is a cross-platform desktop app for orchestrating your AI agents and projects.</p><p>4. <a href="https://github.com/amantus-ai/vibetunnel">VibeTunnel</a> proxies your terminals right into the browser, so you can vibe-code anywhere.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2601.20540">Advancing Open-source World Models</a></p><p>This paper presents LingBot-World, an open-sourced world simulator stemming from video generation. LingBot-World maintains high fidelity and robust dynamics across a broad spectrum of environments and enables a minute-level horizon while preserving contextual consistency over time. It also supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second.</p><p>2. <a href="https://arxiv.org/abs/2601.18778">Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability</a></p><p>This paper introduces SOAR, a meta-RL framework that enables models to escape reasoning plateaus by using a teacher model to generate synthetic &#8220;stepping stone&#8221; problems. By grounding rewards in a student&#8217;s actual progress on hard mathematical tasks rather than intrinsic proxies, the authors demonstrate that generating useful problem structures is more critical for unlocking learning than solution correctness.</p><p>3. <a href="https://arxiv.org/abs/2509.08031">AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs</a></p><p>This paper introduces AU-Harness, an efficient and comprehensive evaluation framework for Large Audio Language Models (LALMs). It provides standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios, achieving a speedup of up to 127% over existing toolkits and enabling large-scale evaluations previously impractical. The paper also introduces two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks.</p><p>4. <a href="https://arxiv.org/abs/2601.16344">DSGym: A Holistic Framework for Evaluating and Training Data Science Agents</a></p><p>This paper introduces DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. It provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, and also includes DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. As a case study, researchers built a 2,000-example training set and trained a 4B model in DSGym that outperforms GPT-4o on standardized analysis benchmarks.</p><h3>Quick Links </h3><p>1. <a href="https://openai.com/index/introducing-prism/">OpenAI introduces Prism</a>, a free, AI-native workspace for scientists to write and collaborate on research, powered by GPT&#8209;5.2. It offers unlimited projects and collaborators and is available today to anyone with a ChatGPT personal account. Prism builds on the foundation of Crixet, a cloud-based LaTeX platform that OpenAI acquired. It supports tasks such as drafting and revising papers, incorporating relevant literature, reasoning over equations, citations, and figures, collaborations, voice-based editing, and more.</p><p>2. <a href="https://blogs.microsoft.com/blog/2026/01/26/maia-200-the-ai-accelerator-built-for-inference/">Microsoft unveils Maia 200</a>, an inference accelerator optimized for large-scale token generation in modern reasoning models and LLMs. Microsoft reports about 30 percent better performance per dollar than the latest Azure inference systems, claims 3 times the FP4 performance of third-generation Amazon Trainium, and higher FP8 performance than Google TPU v7 at the accelerator level.</p><p>3. <a href="https://blog.google/innovation-and-ai/models-and-research/google-deepmind/project-genie/">Google DeepMind launches Project Genie prototype</a>, a general-purpose world model that lets users create interactive virtual worlds from text prompts, powered by Genie 3 for real-time simulation and Nano Banana Pro for previews. It supports editing, exploration in first- or third-person views, and remixing via a gallery, but has limitations such as 60-second generation times and potential latency. Available to US Google AI Ultra subscribers, it aims to advance world model research.</p><p>4. <a href="https://github.com/google-deepmind/alphagenome_research">Google DeepMind unveils AlphaGenome</a>, a unified deep learning model designed for sequence-to-function genomics. It uses a specialized hybrid design that combines a U-Net backbone with Transformer blocks. This allows the model to process massive windows of 1,000,000 base pairs while maintaining the high resolution needed to identify single mutations. The framework is implemented in JAX and optimized for TPUs.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/google-staff-engineering-analyst-generative-ai-tmsr">Staff Engineering Analyst, Generative AI @Google (Mountain View, CA, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/smithrx-senior-machine-learning-engineer-applications-yx5e">Senior Machine Learning Engineer (Applications) @SmithRx</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/microsoft-corporation-senior-software-engineer-ai-agents-zeip">Senior Software Engineer &#8212; AI Agents @Microsoft Corporation (Dublin, Ireland)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/headspace-principal-product-manager-llm-innovation-6g72">Principal Product Manager, LLM Innovation @Headspace (Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/samsung-research-america-staff-genai-research-engineer-digital-health-dxtz">Staff GenAI Research Engineer, Digital Health @Samsung Research America (Mountain View, CA, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/coinbase-senior-software-engineer-ai-platform-ai-acceleration-2mui">Senior Software Engineer &#8212; AI Platform (AI Acceleration) @Coinbase (Remote/Canada)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[One path that replaces 50 saved tabs and 12 half-started repos]]></title><description><![CDATA[Towards AI Academy cohort kicks off in 48 hours: learn what to build and how.]]></description><link>https://newsletter.towardsai.net/p/the-wow-demo-trap-is-killing-llm</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/the-wow-demo-trap-is-killing-llm</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Fri, 30 Jan 2026 15:02:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!T3Hv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3e4894-d755-401a-9ab2-ec870409610b_1600x844.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week, Dario Amodei&#8217;s essay put words to what many teams are quietly bumping up against: the models are maturing faster than the builders. That&#8217;s why so many LLM projects keep dying in the same spot.</p><p><strong>In 48 hours (Feb 1, 2026), we&#8217;re running a live cohort kickoff call</strong> that closes this exact gap with a production-ready plan: what to build first, what to measure, and how to ship LLM systems that actually hold up.</p><p><strong>How to join the kickoff:</strong> enroll in <em>any</em> Towards AI course, and the cohort link lands in your welcome email.</p><p><strong><a href="https://academy.towardsai.net/bundles/10-hour-crash-course-into-llm-developer-expert?utm_source=TAImedium&amp;utm_medium=email&amp;utm_campaign=feb2026_subscribers_nostart_cheatsheet_download_glb&amp;utm_id=Febcohort">Access the Cohort by Enrolling!</a></strong></p><div><hr></div><p>If your goal is to go from fundamentals to production habits and full-stack execution, this is the most straightforward track we recommend:</p><p><strong>10-Hour Crash Course &#8594; Expert LLM Developer (Bundle)</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/bundles/10-hour-crash-course-into-llm-developer-expert?utm_source=TAImedium&amp;utm_medium=email&amp;utm_campaign=feb2026_subscribers_nostart_cheatsheet_download_glb&amp;utm_id=Febcohort" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T3Hv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3e4894-d755-401a-9ab2-ec870409610b_1600x844.png 424w, https://substackcdn.com/image/fetch/$s_!T3Hv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3e4894-d755-401a-9ab2-ec870409610b_1600x844.png 848w, https://substackcdn.com/image/fetch/$s_!T3Hv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3e4894-d755-401a-9ab2-ec870409610b_1600x844.png 1272w, https://substackcdn.com/image/fetch/$s_!T3Hv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3e4894-d755-401a-9ab2-ec870409610b_1600x844.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T3Hv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3e4894-d755-401a-9ab2-ec870409610b_1600x844.png" width="1456" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b3e4894-d755-401a-9ab2-ec870409610b_1600x844.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://academy.towardsai.net/bundles/10-hour-crash-course-into-llm-developer-expert?utm_source=TAImedium&amp;utm_medium=email&amp;utm_campaign=feb2026_subscribers_nostart_cheatsheet_download_glb&amp;utm_id=Febcohort&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T3Hv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3e4894-d755-401a-9ab2-ec870409610b_1600x844.png 424w, https://substackcdn.com/image/fetch/$s_!T3Hv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3e4894-d755-401a-9ab2-ec870409610b_1600x844.png 848w, https://substackcdn.com/image/fetch/$s_!T3Hv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3e4894-d755-401a-9ab2-ec870409610b_1600x844.png 1272w, https://substackcdn.com/image/fetch/$s_!T3Hv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3e4894-d755-401a-9ab2-ec870409610b_1600x844.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It combines our most adopted courses with our bestselling book, and it&#8217;s sequenced like a real build path, so your effort compounds.</p><p><strong><a href="https://academy.towardsai.net/bundles/10-hour-crash-course-into-llm-developer-expert?utm_source=TAImedium&amp;utm_medium=email&amp;utm_campaign=feb2026_subscribers_nostart_cheatsheet_download_glb&amp;utm_id=Febcohort">Start the LLM Developer track (bundle + cohort access)</a></strong></p><p>Here&#8217;s how the bundle pulls you out of demo-land:</p><p><strong>1) Guesswork, replaced by a mental model.</strong></p><p><em>10-Hour LLM Fundamentals</em> (video) gives you the core understanding: how LLMs behave, how to build with them, how to evaluate outputs, and how to maintain robust solutions as requirements shift.</p><p><strong>2) Fragility, replaced by production discipline.</strong></p><p><em>Building LLMs for Production</em> gives you timeless principles for building dependable systems: how to measure quality, debug failures, and iterate without rewriting the whole app every time something breaks.</p><p><strong>3) &#8220;I can&#8217;t ship this,&#8221; replaced by full-stack skill.</strong></p><p><em>Full Stack AI Engineering</em> is where you put it all together end-to-end and ship a real product: data, retrieval, prompting/agents, evaluation, and deployment.</p><p>If you&#8217;ve been circling this space for months, the risk isn&#8217;t &#8220;starting and failing.&#8221; The risk is staying in demo-land while the bar for real LLM skill quietly becomes: <em>can you ship something that holds up?</em></p><p>Cohort kickoff is in <strong>48 hours (Feb 1, 2026)</strong>. If you want the end-to-end framework we use in enterprise projects, start with the kickoff.</p><p><strong><a href="https://academy.towardsai.net/bundles/10-hour-crash-course-into-llm-developer-expert?utm_source=TAImedium&amp;utm_medium=email&amp;utm_campaign=feb2026_subscribers_nostart_cheatsheet_download_glb&amp;utm_id=Febcohort">Join before Feb 1 and get the cohort access!</a></strong></p>]]></content:encoded></item><item><title><![CDATA[TAI ##189: Dario Amodei's 19,000-Word Warning About AI's "Adolescence"]]></title><description><![CDATA[Also, Claude in Excel, GLM-4.7 Flash, Qwen3-TTS, FastMCP 3.0 & more]]></description><link>https://newsletter.towardsai.net/p/tai-189-dario-amodeis-19000-word</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-189-dario-amodeis-19000-word</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 27 Jan 2026 15:02:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!834x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>Anthropic has been on a remarkable product streak. Last week, we covered Claude Cowork, which brings agentic capabilities to non-developers. This week, the company expanded Claude in Excel to Pro subscribers and deepened integrations with apps such as Slack, Canva, Figma, and more.</p><p>Claude in Excel may be one of the more eye-opening AI features yet for finance professionals. The add-in reads entire multi-tab workbooks, explains nested formulas with clickable cell citations, debugs errors like circular references, and builds financial models from natural-language instructions. Finance has long been a domain where AI demos looked impressive, but real-world utility lagged. Claude, reading your actual workbook and understanding relationships between cells changes that equation. The caveats are real: hallucinations happen, token limits interrupt longer sessions, and prompt-injection vulnerabilities mean you should be careful with untrusted data. But as a research preview, it points toward a future where financial modeling grunt work becomes dramatically faster.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!834x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!834x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png 424w, https://substackcdn.com/image/fetch/$s_!834x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png 848w, https://substackcdn.com/image/fetch/$s_!834x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png 1272w, https://substackcdn.com/image/fetch/$s_!834x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!834x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!834x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png 424w, https://substackcdn.com/image/fetch/$s_!834x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png 848w, https://substackcdn.com/image/fetch/$s_!834x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png 1272w, https://substackcdn.com/image/fetch/$s_!834x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa482ee6d-40a8-434a-8aa8-be9cd46f0b99_1600x893.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Despite this success in solving near-term, extremely tangible enterprise problems, CEO Dario Amodei remains outspoken about more speculative risks. His essay &#8220;Machines of Loving Grace&#8221; made a significant splash in October 2024, laying out how powerful AI could compress a century of scientific progress into a decade and potentially eliminate most diseases, end extreme poverty, and transform governance. Fifteen months later, we can assess how those predictions are tracking.</p><p>The results are mixed. Capability acceleration proceeded roughly as Amodei predicted: agentic systems improved dramatically, with engineers at Anthropic reportedly &#8220;mostly editing&#8221; rather than writing code from scratch. Scientific acceleration in drug discovery and protein design continued. But the more ambitious predictions have not materialized. No major breakthroughs in disease cures or lifespan emerged. Mental health applications remain at the research level. The developing world saw little evidence of rapid catch-up. And rather than AI favoring defense and democracy as Amodei hoped, 2025 saw intensified chip wars and rising deepfake threats.</p><p>It is always hard to tell if an AI CEO is being honest or hyping capabilities. Even when discussing risks, emphasizing how powerful and dangerous AI will become is a roundabout way of claiming your technology is transformative enough to justify massive investment. Anthropic raised $13 billion in September and is reportedly in talks for another $25 billion. There is also a competitive angle: fearmongering about AI risks can be interpreted as an attempt to prevent open-weight LLM competition through regulation or to stunt Chinese AI labs by advocating for export controls. The conflict of interest is obvious.</p><p>I think Dario is largely honest in his hopes and fears, though not immune to motivated reasoning. His technical claims tend to be specific and falsifiable rather than vague. He repeatedly emphasizes uncertainty. And he points fingers at his own industry, explicitly naming AI companies as a major risk factor. That is not the framing you would choose for pure marketing.</p><p>This week, Amodei published &#8220;The Adolescence of Technology,&#8221; a 19,000-word follow-up that shifts from optimism to confronting risks directly. The framing is stark: humanity is entering a &#8220;rite of passage&#8221; that will test who we are as a species. The central move is treating powerful AI as a new kind of concentrated national capability. He uses the metaphor of a &#8220;country of geniuses in a datacenter&#8221;: imagine 50 million people, all more capable than any Nobel laureate, operating at 10&#8211;100x the speed of humans. If you were a national security official assessing that situation, what would you worry about?</p><p>He groups risks into five categories. Autonomy risks concern whether AI systems might behave in unintended ways, not from malice but from emergent properties in training. Amodei rejects both the naive view that AI will simply do what we tell it and the doomer view that misalignment is inevitable. He cites lab experiments in which Claude engaged in deception and adopted problematic personas due to training quirks. These were caught and fixed, but the concern is that training involves so many potential traps that some may only become evident when it is too late.</p><p>Destruction risks involve AI lowering barriers to weapons of mass destruction, particularly biological weapons. Amodei argues that LLMs are approaching the capability to walk a determined non-expert through the step-by-step process of bioweapon creation, breaking the historical correlation between ability and motive. The PhD virologist with the skills is unlikely to have the motivation. The disturbed loner with the motivation lacks the skills. AI could remove that barrier. Anthropic&#8217;s internal measurements show models may already be providing substantial uplift in relevant areas, which is why recent Claude releases include specialized classifiers to block bioweapon-related outputs.</p><p>Power-seizing risks concern authoritarian governments using AI for surveillance, propaganda, and autonomous weapons to entrench control. Amodei is particularly focused on the CCP, arguing it makes no sense to sell them chips and chip-making tools to build an AI totalitarian state. But he also worries about democracies: the same tools needed to defend against autocracies can be turned inward. He suggests domestic mass surveillance and mass propaganda should be bright red lines.</p><p>Economic disruption is perhaps the most immediate concern. Amodei predicted that AI could displace 50% of entry-level white-collar jobs in 1&#8211;5 years, and he stands by that prediction. He argues this differs from previous technological disruptions because of speed, cognitive breadth, and AI&#8217;s capacity to fill in gaps that would normally allow humans to adapt.</p><p>Finally, indirect effects capture unknown unknowns from compressed progress: radical advances in biology, psychological manipulation through AI companions, and loss of human purpose. Even if we dodge headline catastrophes, a decade of compressed progress can produce destabilizing outcomes.</p><p>The essay&#8217;s most useful contribution may be its diagnosis of political economy. Amodei explains why reasonable safety measures fail: the combination of strategic competition and massive economic upside makes restraint hard even when everyone sees the risks. He calls this &#8220;the trap.&#8221; His proposed solutions emphasize surgical interventions: transparency legislation, export controls on chips, Constitutional AI to train models with coherent values, and interpretability research. He explicitly rejects pausing AI development as untenable, arguing that the technology would continue regardless, and that authoritarian countries would keep building.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>Three practical takeaways from the essay. First, if you work in a field likely to be disrupted, the time to build adjacent skills and relationships is now, not when displacement arrives. Amodei&#8217;s prediction of 50% entry-level white-collar job displacement in 1&#8211;5 years may be aggressive, but even a slower timeline suggests urgency. Second, the warnings about AI companions and psychological manipulation deserve attention from anyone with children or elderly relatives who may be more susceptible to forming unhealthy dependencies on systems designed to maximize engagement.</p><p>Third, and most broadly, the essay is a reminder that the incremental view can obscure the aggregate picture. Most weeks, this newsletter covers new models, new features, and new benchmarks. The question is not whether any single advance is dangerous but whether the cumulative trajectory is one we have consciously chosen. Right now, the answer is largely no. Recognizing that is the first step toward changing it.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://claude.com/blog/interactive-tools-in-claude">Anthropic Launches Interactive Claude Apps</a></p><p>Claude now opens connected workplace tools as interactive panels directly in the conversation, so you can review, tweak, and act on outputs without switching tabs. The first set includes Amplitude, Asana, Box, monday.com, and Slack, with interactive workflows like building analytics charts, turning chats into projects/timelines, previewing documents, updating boards, and drafting messages in a formatted preview before posting. This rollout is available across Claude&#8217;s web and desktop experiences. The same launch extends MCP Apps, which lets tool developers ship interactive UI experiences that render inside multiple MCP clients rather than returning only text or structured data.</p><p>2. <a href="https://x.com/claudeai/status/2014834616889475508?s=20">Anthropic Expands Claude in Excel to Pro Users</a></p><p>Anthropic has now rolled out its Excel integration in Claude to Pro users. Along with broader availability, the update brings several functional improvements: Claude can now accept multiple files via drag-and-drop, avoid overwriting existing cells, and support longer work sessions through automatic compression. The integration lets users work with Claude directly in Microsoft Excel for analysis and data preparation.</p><p>3. <a href="https://qwen.ai/blog?id=qwen3-max-thinking">Alibaba Qwen releases Qwen3-Max-Thinking</a></p><p>Alibaba&#8217;s Qwen team launched Qwen3-Max-Thinking, a new flagship reasoning model trained with large-scale reinforcement learning and built to autonomously invoke Search, Memory, and a Code Interpreter during a conversation, eliminating the need for manual tool selection. It ships with a heavy-mode test-time scaling approach that runs multi-round self-reflection (&#8220;experience-cumulative&#8221; scaling) to improve difficult reasoning without simply increasing parallel sampling. It scored 98.0 on HMMT, 49.8 on Humanity&#8217;s Last Exam (with tools), 90.2 on Arena-Hard v2, 75.3 on SWE-Bench Verified, and 85.9 on LiveCodeBench v6, with the tool-augmented HLE result exceeding GPT-5.2-Thinking and Gemini 3 Pro. The model is available in Qwen Chat and via an API.</p><p>4. <a href="https://docs.z.ai/guides/llm/glm-4.7">Zhipu AI Releases GLM-4.7-Flash</a></p><p>Z.ai launched GLM-4.7, its latest flagship text model series focused on agentic coding reliability, multi-step execution stability, and stronger front-end generation quality, with 200K context and up to 128K output tokens. On widely used coding and agent benchmarks, GLM-4.7 reports 73.8% on SWE-bench Verified, 66.7% on SWE-bench Multilingual, and 41% on Terminal-Bench 2.0, alongside stronger tool-use scores such as 84.7% on &#964;&#178;-Bench and 67% on BrowseComp. The series includes GLM-4.7, plus lighter variants (GLM-4.7-FlashX and GLM-4.7-Flash), intended to trade off cost/latency for peak capability while maintaining the same long-context footprint.</p><p>5. <a href="https://qwen.ai/blog?id=qwen3tts-0115">Qwen Researchers Release Qwen3-TTS</a></p><p>Alibaba&#8217;s Qwen team open-sourced the Qwen3-TTS family, a multilingual, controllable, streaming text-to-speech stack built for both rapid voice cloning and &#8220;voice design&#8221; (description-driven control over style and attributes). The models are trained across 10 languages and introduce a dual-track LM design optimized for real-time synthesis, paired with two tokenizers: a semantic-heavy 25Hz codec and an ultra-low-latency 12Hz tokenizer that targets extremely fast first audio emission (reported at ~97 ms). On the multilingual TTS test set, Qwen reports an average WER of 1.835% and a speaker similarity of 0.789, and frames the release as open tooling for both research and product deployment, with models and tokenizers under Apache 2.0.</p><p>6. <a href="https://interestingengineering.com/ai-robotics/elon-musk-xai-gigawatt-scale-ai-training-cluster">Elon Musk&#8217;s xAI Activates World&#8217;s First Gigawatt-Scale AI Training Cluster</a></p><p>Elon Musk&#8217;s xAI is expanding the Colossus training effort toward gigawatt-scale capacity, including purchasing additional Memphis-area buildings, with the ambition to reach nearly 2 GW of training power and operate at a scale of hundreds of thousands to over a million GPUs over time. xAI&#8217;s own materials describe rapid buildout milestones (including scaling to 200k GPUs) while framing the site as a &#8220;gigafactory of compute.&#8221; At the same time, recent third-party analysis based on site constraints (notably cooling) disputes that the cluster is already operating at 1 GW today, suggesting the full gigawatt claim is more consistent with a phased ramp than a completed state.</p><p>7. <a href="https://chromeunboxed.com/gemini-in-chrome-is-getting-skills-as-it-moves-toward-becoming-a-full-ai-agent/">Gemini in Chrome Is Getting &#8220;Skills&#8221; As It Moves Toward Becoming a Full AI Agent</a></p><p>Google is testing &#8220;Skills&#8221; for Gemini in Chrome, an early move from &#8220;assistant in a side panel&#8221; toward programmable, site-context automation that can execute repeatable browser workflows. Chromium commits show active development of a dedicated chrome://skills surface (including UI scaffolding like a toolbar) and plumbing to surface or recommend Skills on the current page, suggesting an intent to make Skills discoverable rather than purely manual. Independent coverage indicates Skills are being tried internally in Chrome builds, with users defining a Skill (name + instructions) and then invoking it through Gemini&#8217;s Chrome experience, but there&#8217;s no public rollout timeline yet.</p><p>8. <a href="https://x.com/trq212/status/2014480496013803643">Anthropic Replaces Todos With Disk-Backed Tasks</a></p><p>Anthropic upgraded Claude Code from &#8220;Todos&#8221; to Tasks, turning lightweight to-do tracking into a more structured task primitive designed for longer, multi-step coding workflows, including support for dependency-style organization and richer task lifecycle actions. Recent releases add controls to keep the old system temporarily via CLAUDE_CODE_ENABLE_TASKS, and expand task operations (including the ability to delete tasks via TaskUpdate) while iterating on how the task list renders and behaves in the terminal UI. The change is framed as part of making Claude Code more resilient for extended sessions where work needs to persist cleanly across context pressure and ongoing agent activity.</p><p>9. <a href="https://gofastmcp.com/getting-started/welcome">FastMCP 3.0 Is Here</a></p><p>Prefect&#8217;s FastMCP 3.0 entered beta as a major redesign of the Python framework for building MCP servers, restructuring the system around three composable primitives: components, providers, and transforms. Providers are meant to source tools/resources dynamically (from decorators, filesystems, OpenAPI specs, or even remote MCP servers), while transforms act as middleware to reshape what clients see&#8202;&#8212;&#8202;renaming, namespacing, filtering, or applying security rules&#8202;&#8212;&#8202;so features that used to require bespoke subsystems can be assembled from building blocks. The project is shipping as a 3.0.0b1 beta (with guidance to stay on v2 for production stability), signaling a push toward more modular, plug-and-play MCP infrastructure for agent toolchains.</p><p>10. <a href="https://modelscope.cn/models/FlashLabs/Chroma-4B">FlashLabs Researchers Release Chroma 1.0</a></p><p>FlashLabs open-sourced Chroma 1.0 (Chroma-4B), a real-time, end-to-end spoken dialogue model that takes speech in and returns speech out while preserving a user&#8217;s voice via personalized voice cloning. It&#8217;s built to avoid the classic ASR &#8594; LLM &#8594; TTS pipeline by operating directly on discrete speech representations, targeting sub-second interaction latency for conversational use. The system emphasizes speaker identity retention (a common failure mode in speech-token-based dialogue models) while keeping responses fast enough to feel &#8220;live&#8221; in multi-turn voice chats. The release includes a 4B-parameter checkpoint and positioning as an open, real-time voice assistant backbone for developers building low-latency, voice-native agents.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/how-to-run-ai-agents-fully-locally-memory-tools-and-models-on-your-laptop-b8cd1df4b8e4?sk=3694e8bb0294150862eeb87bb45eace5">How to Run AI Agents Fully Locally: Memory, Tools, and Models on Your Laptop</a></p><p>This article outlines the architecture of a fully local AI agent, designed to improve privacy, control costs, and enable reproducibility. The stack integrates Agno for agent orchestration, SurrealDB as a multi-model database for state and vectors, and Ollama for local inference. It highlights the use of the Model Context Protocol (MCP) to establish a secure boundary for tools, such as file access and image generation. It also covers practical implementations, including persistent memory, local RAG, and multimodal workflows.</p><p>2. <a href="https://pub.towardsai.net/langgraph-rag-ucp-the-key-to-powerful-agentic-ai-d7ef49171abc?sk=66361045469064f1314d09861e7dc5b7">LangGraph + RAG + UCP = The Key To Powerful Agentic AI</a></p><p>This analysis details how to build an AI shopping assistant using the Universal Commerce Protocol (UCP), a new open standard for e-commerce transactions. The article shows that combining LangGraph for structured workflows with Retrieval-Augmented Generation (RAG) enables querying a product database. It provides code examples for a chatbot that uses a vector store and GPT-4 to answer questions, alongside a checkout system built with the FastUCP framework to manage transactions.</p><p>3. <a href="https://pub.towardsai.net/mastering-the-bias-variance-trade-off-in-machine-learning-748cc47a1b2c?sk=8194f1ad4ac36d20f57e6145c791fdb1">Mastering the Bias-Variance Trade-Off in Machine Learning</a></p><p>Balancing bias and variance is a central challenge in machine learning. This article examines this trade-off using the Vapnik-Chervonenkis (VC) dimension, a theoretical concept for quantifying a model&#8217;s capacity. It explains how the VC bound estimates the generalization error on unseen data. It also presents a practical experiment with polynomial regression, demonstrating that as model complexity increases, training error decreases while the gap between training and real-world performance widens.</p><p>4. <a href="https://pub.towardsai.net/connecting-the-dots-with-graphs-0738c1716a53">Connecting the Dots with Graphs</a></p><p>Moving beyond traditional databases that store data in isolated tables, knowledge graphs model information as a network of entities and relationships. This structure excels at complex, relationship-heavy queries that relational databases often struggle with. The text outlines the benefits, such as flexible schemas and data integration, while also addressing challenges like data quality and performance. A practical implementation is also presented, detailing how to build a question-answering system using Neo4j and an LLM to translate natural language into graph queries, making complex data more accessible.</p><p>5. <a href="https://pub.towardsai.net/probability-calibration-with-python-6ee602760ab6?sk=5b4498a8d57b604184c1635636d30c26">Probability Calibration with Python</a></p><p>Many machine learning models produce probability scores that, while effective for ranking, do not align with real-world event frequencies. This article explores probability calibration using a simulated loan default dataset. It compares a raw Gradient Boosting model against two calibrated versions: Sigmoid and Isotonic. The results demonstrate that calibration improves probability metrics like the Brier score and Expected Calibration Error (ECE) without compromising ranking performance (AUC). A final simulation of a loan approval policy shows that using these calibrated probabilities leads to more accurate risk assessments and ultimately, higher realized profits, underscoring their value in business decision-making.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/microsoft/VibeVoice">VibeVoice</a> is a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for user-customized context.</p><p>2. <a href="https://github.com/github/copilot-sdk">GitHub Copilot CLI SDKs</a> is a multi-platform SDK for integrating GitHub Copilot Agent into apps and services.</p><p>3. <a href="https://github.com/clawdbot/clawdbot">Clawbot</a> is a personal AI assistant you run on your own devices. It can speak and listen on macOS/iOS/Android, and can render a live Canvas you control.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2601.12538">Agentic Reasoning for Large Language Models</a></p><p>This survey formalizes &#8220;Agentic Reasoning&#8221; as a paradigm shift that transforms LLMs from static processors into autonomous agents capable of planning, acting, and self-evolving through interaction. The survey organizes agentic reasoning into three layers: foundational, self-evolving, and collective. It also provides a unified roadmap for optimizing agentic systems through both in-context orchestration and post-training reinforcement learning across domains such as science and robotics.</p><p>2. <a href="https://arxiv.org/html/2512.03438v1">Multimodal Reinforcement Learning with Agentic Verifier for AI Agents</a></p><p>This paper introduces Argos, a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. This approach enables models to achieve state-of-the-art performance on spatial and embodied AI tasks while significantly reducing visual hallucinations through verifiable reinforcement learning.</p><p>3. <a href="https://arxiv.org/abs/2601.11077">ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development</a></p><p>This paper introduces ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. It contains 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories, requiring agents to explore repositories, configure environments, deploy containerized services, and pass end-to-end API tests. Evaluations show that state-of-the-art LLM agents still struggle with these holistic backend engineering tasks.</p><p>4. <a href="https://arxiv.org/abs/2601.16206">LLM-in-Sandbox Elicits General Agentic Intelligence</a></p><p>This paper introduces LLM-in-Sandbox, a framework that lets large language models explore a virtual computer to elicit general agentic intelligence in non-code domains. Strong LLMs, without extra training, use the sandbox to access external resources, manage long contexts, and execute scripts. LLM-in-Sandbox-RL further improves these capabilities, yielding robust generalization across STEM tasks and instruction following, and the team releases a Python package.</p><h3>Quick Links </h3><p>1. <a href="https://www.liquid.ai/blog/lfm2-5-1-2b-thinking-on-device-reasoning-under-1gb">Liquidi released LFM2.5&#8211;1.2B-Thinking</a>, a 1.2B model optimized for reasoning that runs entirely on-device and is reported to fit within ~900MB of memory on a phone. LFM2.5&#8211;1.2B-Thinking matches or exceeds Qwen3&#8211;1.7B on most reasoning benchmarks, despite having 40% fewer parameters.</p><p>2. <a href="https://stepfun.ai/deep-research-invitation">StepFun has introduced Step-DeepResearch</a>, a 32B parameter end-to-end deep research agent that aims to turn web search into actual research workflows with long horizon reasoning, tool use, and structured reporting. The model is built on Qwen2.5 32B-Base and is trained to act as a single agent that plans, explores sources, verifies evidence, and writes reports with citations, while keeping inference cost low.</p><p>3. <a href="https://ai.azure.com/catalog/models/microsoft-optimind-sft">Microsoft Research releases OptiMind</a>, an experimental 20B-parameter model built to translate natural-language decision problems into solver-ready MILP formulations. The model is fine-tuned from openai/gpt-oss-20b on cleaned optimization datasets such as OR-Instruct and OptMATH, and evaluated on expert-validated benchmarks including IndustryOR and Mamo Complex.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/google-artificial-intelligence-safety-data-scientist-trust-and-safety-t7hm">Artificial Intelligence Safety Data Scientist @Google (Bangalore, India)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/oowlish-ai-solutions-engineer-python-cloud-b3tg">AI Solutions Engineer (Python + Cloud) @Oowlish (Remote/Brazil)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/delta-air-lines-inc-senior-full-stack-developer-ibay">Senior Full Stack Developer @Delta Air Lines, Inc. (Atlanta, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/kyndryl-agentic-ai-forward-deployed-engineer-6mtc">Agentic AI, Forward Deployed Engineer @Kyndryl (Sydney, Australia/Remote)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/capital-one-lead-ai-engineer-favb">Lead AI Engineer @Capital One (Bangalore, India)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/pointclickcare-principal-ai-engineer-autonomous-agent-idgs">Principal AI Engineer (Autonomous Agent) @PointClickCare (Remote/Canada)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[TAI ##188: Claude Cowork Brings Agentic AI to Non-Developers]]></title><description><![CDATA[Also, Quick Cowork guide, MedGemma 1.5, OpenAI's $20bn revenue, ERNIE 5.0, Flux.2, and more.]]></description><link>https://newsletter.towardsai.net/p/tai-188-claude-cowork-brings-agentic</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-188-claude-cowork-brings-agentic</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 20 Jan 2026 15:03:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_q8K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>Last week, we discussed OpenAI&#8217;s health push and noted there is significant room for custom models in medicine beyond general-purpose LLMs. Google DeepMind validated that thesis this week with MedGemma 1.5, an updated open medical model with substantially improved support for high-dimensional imaging, such as CT scans, MRIs, and histopathology slides. They also released MedASR, a speech-to-text model fine-tuned for medical dictation, which achieves 58% fewer errors than Whisper on chest X-ray dictations. These are free for research and commercial use. Specialized medical AI is advancing rapidly on multiple fronts, with foundation model providers, startups, and health systems all racing to build domain-specific tools.</p><p>The biggest story this week, however, was Anthropic&#8217;s release of Claude Cowork, which feels like the natural next step we anticipated a few weeks ago when discussing Claude Code&#8217;s momentum over the holidays. Back then, we noted that people were using Claude Code for tasks far beyond programming, from curriculum building to health data analysis, but that the terminal interface would need to change before these agentic capabilities could go mainstream. Anthropic seems to have heard the same signal. Cowork packages Claude Code&#8217;s agentic capabilities into an interface designed for non-developers, available in the Claude desktop app for Mac.</p><p><strong>What is Claude Cowork?</strong></p><p>Cowork is a new tab in the Claude desktop app that operates fundamentally differently from standard chat. Instead of a back-and-forth conversation, you give Claude access to a specific folder on your computer and assign it a task. Claude then makes a plan, executes steps autonomously, and keeps you in the loop on progress. You can queue multiple tasks and let Claude work through them in parallel. It feels less like chatting and more like delegating to a capable assistant who happens to live inside your computer.</p><p>The core interaction pattern is folder-scoped. You choose which folder Claude can see. It cannot access anything outside that boundary without explicit permission. Within the folder, Claude can read files, create new ones, edit existing documents, and organize content. The permission model is progressive: you can start with read-only access and escalate to edit or delete permissions only when needed.</p><p>Perhaps the most remarkable detail: Anthropic staff noted that Cowork itself was built in about a week and a half, and &#8220;all of it&#8221; was built by Claude Code. This is a striking example of AI tools being used to build AI tools, and it explains both the rapid iteration and some of the beta roughness that early users encountered.</p><p>Availability is currently limited to Claude Max and Pro subscribers on macOS, with future expansion to Windows.</p><p>Anthropic is clearly not content with leading adoption for AI for coding work; it is positioning itself as the leader in AI tools for work more broadly. Cowork also integrates with connectors like Claude in Chrome, which allow Claude to take browser actions on your behalf, and with Claude Skills. Skills are essentially detailed playbooks that tell Claude how to produce professional-quality outputs. Anthropic provides official skills on GitHub, and you can write custom ones for your own workflows. Their &#8220;skills&#8221; system is gaining momentum and offers significant advantages over competitors when performing complex work. The xlsx skill can output fully working Excel models with formulas, and the pptx skill produces presentation files that actually open correctly in PowerPoint. This sounds mundane until you have spent hours wrestling with copy-and-paste from other tools with less flexible outputs. File compatibility matters enormously for real work.</p><p><strong>A practical guide to getting started</strong></p><p>Start by opening the Claude desktop app on Mac and clicking the Cowork tab. Create a new task and select the folder you want Claude to access. Begin with a non-sensitive folder containing only the files relevant to your task. Keep backups of anything important before allowing edit or delete permissions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XCRe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc779e8d3-4964-44b0-99e7-84f6a34166b6_1600x777.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XCRe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc779e8d3-4964-44b0-99e7-84f6a34166b6_1600x777.png 424w, https://substackcdn.com/image/fetch/$s_!XCRe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc779e8d3-4964-44b0-99e7-84f6a34166b6_1600x777.png 848w, https://substackcdn.com/image/fetch/$s_!XCRe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc779e8d3-4964-44b0-99e7-84f6a34166b6_1600x777.png 1272w, https://substackcdn.com/image/fetch/$s_!XCRe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc779e8d3-4964-44b0-99e7-84f6a34166b6_1600x777.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XCRe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc779e8d3-4964-44b0-99e7-84f6a34166b6_1600x777.png" width="1456" height="707" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c779e8d3-4964-44b0-99e7-84f6a34166b6_1600x777.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:707,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XCRe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc779e8d3-4964-44b0-99e7-84f6a34166b6_1600x777.png 424w, https://substackcdn.com/image/fetch/$s_!XCRe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc779e8d3-4964-44b0-99e7-84f6a34166b6_1600x777.png 848w, https://substackcdn.com/image/fetch/$s_!XCRe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc779e8d3-4964-44b0-99e7-84f6a34166b6_1600x777.png 1272w, https://substackcdn.com/image/fetch/$s_!XCRe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc779e8d3-4964-44b0-99e7-84f6a34166b6_1600x777.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For your first task, try something low-stakes like organizing files. Point Cowork at your Downloads folder and ask it to sort images into subfolders by type. Claude will analyze file contents, create meaningful categories such as &#8220;Screenshots,&#8221; &#8220;Thumbnails,&#8221; and &#8220;AI-Generated,&#8221; and move hundreds of files in minutes. The progress sidebar shows Claude&#8217;s to-do list updating in real-time as it works through the task.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_q8K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_q8K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png 424w, https://substackcdn.com/image/fetch/$s_!_q8K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png 848w, https://substackcdn.com/image/fetch/$s_!_q8K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png 1272w, https://substackcdn.com/image/fetch/$s_!_q8K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_q8K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png" width="1456" height="853" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:853,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_q8K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png 424w, https://substackcdn.com/image/fetch/$s_!_q8K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png 848w, https://substackcdn.com/image/fetch/$s_!_q8K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png 1272w, https://substackcdn.com/image/fetch/$s_!_q8K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cea35d-94a1-4c1a-bee7-60554757cc58_1600x937.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For document creation, Cowork shines when you provide source material. Drop meeting notes, transcripts, or research files into a folder and ask Claude to synthesize them into a report, presentation, or spreadsheet. One powerful pattern: point Cowork at a folder of content you have created and ask it to extract themes, generate content ideas or data analysis, or build a structured summary. The agent can process hundreds of documents and extract dozens of actionable insights in under an hour.</p><p>For higher-quality outputs in specific niches, install Claude Skills. Download the official skills or third-party skills, then go to Settings &gt; Capabilities &gt; Skills, and upload the skill.md file for the capability you need. The frontend design skill produces polished landing pages. The pptx skill creates professional presentations. Skills act as expert playbooks that dramatically improve output quality compared to generic prompts.</p><p>To add web capabilities, enable Claude in Chrome. This connector lets Cowork browse the web, scrape data from sites that lack APIs, and take actions in your browser. A practical example: ask Cowork to visit your analytics dashboard, extract key metrics, and compile them into a spreadsheet in your local folder. Claude will open Chrome, navigate to the URL, visually capture the data, and create the file. This works because, in Chrome, Claude takes screenshots of your active tab to understand the content, so it can read anything visible on the screen.</p><p>A few important caveats for Chrome integration. Claude in Chrome can see anything on your screen when the side panel is open, including sensitive information. Use a separate browser profile for Cowork tasks. Stick to &#8220;Ask before acting&#8221; mode, which requires approval before Claude takes action. Be aware that web pages can contain prompt injections and adversarial content that attempts to manipulate Claude&#8217;s behavior. You may wish to start with trusted sites and closely supervise browser activity.</p><p>The most effective prompt pattern across all Cowork tasks is plan-first delegation: &#8220;Propose a step-by-step plan first. Wait for my approval before making changes.&#8221; This keeps you in control while still benefiting from Claude&#8217;s autonomous execution. Add explicit constraints like &#8220;Only touch files in this folder&#8221; and &#8220;Do not delete anything&#8221; to prevent surprises.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>Cowork represents the first serious attempt to bring agentic AI capabilities to non-technical users in a form that actually works for real tasks. The early reception has been unusually positive for an agent product. Users report completing projects in hours that would have taken days or weeks.</p><p>The rough edges are real, however. This is a research preview built in under two weeks. We have seen occasional failures on complex tasks, rapid resource consumption, and connector hiccups. Prompt injection also remains a risk when combining Cowork with web browsing. The macOS-only and paid plan limitation also excludes most potential users for now.</p><p>But the trajectory is clear. Anthropic is iterating rapidly based on user feedback, shipping fixes within days of launch. The fact that Cowork was built entirely by Claude Code suggests this kind of rapid AI-assisted development will only accelerate. If the current version can handle file organization, document synthesis, and basic automation, the version six months from now will likely handle substantially more.</p><p>The practical advice is to start experimenting with low-stakes tasks now. Build intuition for what Cowork handles well and where it struggles. The users who understand these tools deeply will be best positioned to leverage them as capabilities improve. The gap between people who can effectively delegate to AI agents and those who cannot is about to become very visible.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://claude.com/blog/cowork-research-preview">Anthropic Releases Cowork As Claude&#8217;s Local File System Agent</a></p><p>Anthropic launched Cowork as a research preview, giving Claude agent-style access to a user-selected local folder in the macOS app. Claude can read, create, and edit files in that folder to complete multi-step tasks under user oversight, and it can use connectors and skills to produce artifacts such as documents and presentations. Cowork is available to Claude Max subscribers in the macOS app, with a waitlist and planned expansion to additional platforms.</p><p>2. <a href="https://openai.com/index/a-business-that-scales-with-the-value-of-intelligence/">OpenAI Lays Out Business Model Built To Scale With &#8220;The Value of Intelligence&#8221;</a></p><p>OpenAI published a strategy note from CFO Sarah Friar describing how the company intends to scale revenue in step with real-world value delivered by its models, using a mix of consumer subscriptions, workplace subscriptions with usage-based pricing, and developer/enterprise API spend tied to production outcomes, alongside newer commerce and advertising paths when users are close to decisions. OpenAI reported record highs in weekly and daily active users and tied recent growth directly to available compute, citing compute capacity rising from 0.2 GW (2023) to 0.6 GW (2024) to ~1.9 GW (2025), alongside revenue growing from $2B ARR (2023) to $6B (2024) to $20B+ (2025); it also emphasized a shift from reliance on a single compute provider to a diversified supplier portfolio to improve resilience and &#8220;compute certainty.&#8221; The near-term product direction is toward agents and workflow automation that carry context over time and take actions across tools.</p><p>3. <a href="https://ernie.baidu.com/blog/posts/ernie-5.0-0110-release-on-lmarena/">ERNIE-5.0 Tops LMArena Text Leaderboard as &#8470;1 Chinese Model</a></p><p>Baidu released ERNIE-5.0&#8211;0110 on LMArena, where it ranked 1,460 on the Text leaderboard, placing #8 overall and #1 among Chinese models at the time of the referenced snapshot. The same update also highlights a strong math-category placement. The model can be tried through Baidu&#8217;s ERNIE product entry points.</p><p>4. <a href="https://bfl.ai/blog/flux2-klein-towards-interactive-visual-intelligence">Black Forest Labs Releases FLUX.2 [klein]</a></p><p>Black Forest Labs launched FLUX.2 [klein], a smaller, interactive image model built for fast generation and iterative edits in a &#8220;draw &#8594; see &#8594; refine&#8221; workflow. The 4B version delivers real-time speed (reported as under one second at ~10 steps on an H100) and is released under the Apache 2.0 license, while the 9B version is released under a non-commercial license. For local use, the 4B model is recommended to run with at least ~13GB VRAM.</p><p>5. <a href="https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-15-and-medical-speech-to-text-with-medasr/">Google AI Releases MedGemma-1.5</a></p><p>Google Research released MedGemma 1.5 and introduced MedASR, expanding its open healthcare model lineup for medical imaging interpretation and medical speech-to-text. MedGemma 1.5 adds broader medical imaging support, including higher-dimensional inputs such as CT/MRI volumes and whole-slide histopathology, as well as improvements to medical text capabilities. MedASR is an open medical dictation ASR model intended for transcribing clinical speech so it can feed downstream workflows. Both are available via public model releases and can be deployed through Vertex AI.</p><p>6. <a href="https://research.nvidia.com/labs/adlr/personaplex/">NVIDIA Releases PersonaPlex-7B-v1: A Real-Time Speech-to-Speech Model</a></p><p>NVIDIA introduced PersonaPlex, a full-duplex conversational speech model designed to keep natural turn-taking (interruptions, backchannels, low-latency speech) while still letting developers choose a voice and define a persona through text prompts. The system is positioned as an alternative to ASR&#8594;LLM&#8594;TTS pipelines by using a single model that listens and speaks concurrently, aiming for a more human conversational rhythm without sacrificing controllability. It is built on the Moshi architecture from Kyutai, with 7 billion parameters, and is trained on a limited set of unscripted human conversations from the Fisher English corpus.</p><p>7. <a href="https://www.androidauthority.com/chatgpt-translate-3632584/">OpenAI Releases ChatGPT Translate</a></p><p>OpenAI rolled out ChatGPT Translate, a standalone translation interface at chatgpt.com/translate that adds tone- and audience-aware rewrites on top of basic translation. The UI supports automatic language detection, supports over 50 languages, and features AI-powered prompt customization. Users can add text, speak, or upload an image for translation. It also includes one-tap options like &#8220;make it more fluent,&#8221; &#8220;business formal,&#8221; &#8220;explain to a child,&#8221; and &#8220;academic&#8221; that hand off into ChatGPT for further refinement.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/creating-an-advanced-ai-agent-from-scratch-with-python-in-2025-part-1-ce74a23f6514?sk=39314d5421bdf26306838a5ecc438745">Creating an Advanced AI Agent From Scratch with Python in 2026</a></p><p>To create more efficient and robust systems, this article advocates for building AI agents from scratch rather than relying on frameworks. It outlines a modular architecture composed of a flexible Tool System, a provider-agnostic LLM Wrapper, and an Agent Orchestrator. The author implements the ReAct (Reasoning + Acting) pattern to ensure a clear, step-by-step workflow and uses Pydantic for type safety in tool execution.</p><p>2. <a href="https://pub.towardsai.net/model-context-protocol-mcp-why-every-ai-developer-needs-mcp-in-2026-e68d39a49417?sk=80993cbe0aa9e7d48afb50f800fc20fe">Model Context Protocol (MCP): Why Every AI Developer Needs MCP in 2026</a></p><p>This article introduces the Model Context Protocol (MCP), an open protocol by Anthropic designed to standardize connections between LLMs and external tools. It contrasts MCP with traditional REST APIs, highlighting the maintenance and scalability challenges of direct integrations. The protocol uses a decoupled architecture with an MCP Host, Client, and Servers that act as intermediaries for services such as databases or search engines. The result is a more maintainable, scalable, and consistent framework for building AI applications.</p><p>3. <a href="https://pub.towardsai.net/rlm-the-ultimate-evolution-of-ai-recursive-language-models-59dd86f304ff?sk=39d77b67797ce3b4942ab93c42b5d88e">RLM: The Ultimate Evolution of AI? Recursive Language Models</a></p><p>This article explains Recursive Language Models (RLMs), an approach for managing extensive contexts in AI. Instead of passively processing large inputs, RLMs treat data as a programmable environment where the model acts as an active agent. Using code, it explores, segments, and filters information, breaking down complex tasks into smaller sub-problems. The model then recursively calls itself to solve these parts before synthesizing a final result. This method allows the AI to handle massive datasets and complex reasoning, although it introduces latency and is less efficient for simple tasks.</p><p>4. <a href="https://pub.towardsai.net/factoring-quintics-using-mid-point-ladders-5f99b28e5986">Factoring Quintics Using Mid-Point Ladders</a></p><p>The author introduces a graphically-aided technique for factoring quintic polynomials into approximate cubic and quadratic components. This method, applicable to quintics with five real roots, employs a Mid-Point Ladder based on Vieta&#8217;s sum-of-factors theorem. It simplifies the process by starting with a core genetic function, then uses the ladder to account for adjustments to the constant and x&#178; terms. A Division by Vision formula is then applied to find the factors.</p><p>5. <a href="https://pub.towardsai.net/federated-learning-explained-a-deep-technical-dive-and-how-poets-can-actually-use-it-2db13dff953f?sk=6047f8cc67c8fb17805e825084a05b6c">Federated Learning Explained: A Deep Technical Dive (And How Poets Can Actually Use It)</a></p><p>This technical overview explores Federated Learning, a method that enables AI models to be trained across decentralized devices without collecting user data. It details the architecture, from the initial distribution of a global model to local training on individual devices and the secure aggregation of updates. The focus then shifts to practical applications for creative professionals, explaining how they already benefit from this technology in everyday tools like smartphone keyboards.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/deepseek-ai/Engram/tree/main">Engram</a> is a module that modernizes classic N-gram embeddings for O(1) lookup.</p><p>2. <a href="https://github.com/vercel-labs/agent-skills">Agent Skills</a> is a collection of skills for AI coding agents.</p><p>3. <a href="https://github.com/google/langextract">LangExtract</a> is a Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization.</p><p>4. <a href="https://github.com/iOfficeAI/AionUi">AionUI</a> is a free, local, open-source Cowork for Gemini CLI, Claude Code, Codex, Opencode, Qwen Code, Goose Cli, Auggie, and more.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2512.23675">End-to-End Test-Time Training for Long Context</a></p><p>This paper recasts long-context language modeling as a continual learning problem rather than an architectural one, using a standard Transformer with sliding-window attention that continues learning at test time via next-token prediction. Their meta-learned Test-Time Training method, TTT-E2E, scales with context, such as full attention, while maintaining constant inference latency, running 2.7&#215; faster at 128K context.</p><p>2. <a href="https://arxiv.org/abs/2601.06943">Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning</a></p><p>This paper introduces VideoDR, the first video deep research benchmark for video-conditioned open-domain question answering on the open web. VideoDR requires cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video&#8211;web evidence across six semantic domains. Evaluations show agentic approaches only outperform workflows when models preserve initial video anchors, with goal drift and long-horizon consistency emerging as main bottlenecks.</p><p>3. <a href="https://arxiv.org/abs/2601.09668">STEP3-VL-10B Technical Report</a></p><p>This paper introduces STEP3-VL-10B, a lightweight, open-source foundation model that redefines the trade-off between efficiency and frontier-level multimodal intelligence. The model unifies a fully unfrozen pre-training strategy on 1.2T multimodal tokens, coupling a language-aligned Perception Encoder with a Qwen3&#8211;8B decoder, and scales post-training with over 1k RL iterations and PaCoRe, achieving 92.2% on MMBench and 80.11% on MMMU.</p><p>4. <a href="https://arxiv.org/abs/2601.10477">Urban Socio-Semantic Segmentation with Vision-Language Reasoning</a></p><p>The paper introduces SocioSeg, an urban socio-semantic segmentation dataset that combines satellite imagery, digital maps, and hierarchical pixel-level labels for socially defined entities such as schools and parks. The authors propose SocioReasoner, a vision-language reasoning framework that uses cross-modal recognition, multi-stage reasoning, and reinforcement learning to surpass state-of-the-art segmentation models and achieve strong zero-shot generalization.</p><h3>Quick Links </h3><p>1. <a href="https://community.openai.com/t/open-responses-for-the-open-source-community/1371770">OpenAI introduces Open Responses</a>, an open-source specification and ecosystem inspired by the OpenAI Responses API. It is designed to make it easier to build multi-provider, interoperable LLM interfaces.</p><p>2. <a href="https://z.ai/blog/glm-image">Zhipu AI released GLM-Image</a>, an open-source, industrial-grade auto-regressive image generation model. GLM-Image combines the strengths of diffusion and auto-regressive models. The auto-regressive model decides what should appear in the image, while the diffusion model decides how it should look. This separation allows GLM-Image to be both accurate and visually strong.</p><p>3. <a href="https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/">Nous Research releases NousCoder-14B</a>, an Olympiad programming model that is post-trained on Qwen3&#8211;14B using reinforcement learning (RL) with verifiable rewards. The model is trained on 24k verifiable coding problems from TACO Verified, PrimeIntellect SYNTHETIC-1. It reaches 67.87 percent Pass@1 on LiveCodeBench v6, a 7.08 percentage point gain over the Qwen3&#8211;14B baseline of 60.79 percent on the same benchmark.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/assemblyai-applied-ai-engineer-ysnx">Applied AI Engineer @AssemblyAI (Remote/USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/healthengine-ai-software-engineer-jb3z">AI Software Engineer @Healthengine (Perth, Australia)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/confisa-international-group-llm-applied-ai-research-scientist-usa-and-latam-remote-7ks1">LLM&#8202;&#8212;&#8202;Applied AI Research Scientist @CONFISA INTERNATIONAL GROUP (USA &amp; LATAM Remote)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/auto1-group-junior-conversational-ai-engineer-voice-bots-xe8c">Junior Conversational AI Engineer (Voice Bots) @AUTO1 Group (Tirana, Albania)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/sap-phd-internship-f-m-d-ai-research-knowledge-graphs-for-agentic-ai-uwru">PhD Internship (f/m/d)&#8202;&#8212;&#8202;AI Research @SAP (Germany/Remote)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/ntt-data-north-america-ai-engineer-genai-developer-wtap">AI Engineer/GenAI Developer @NTT DATA (Chennai, India)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/tenstorrent-inc-machine-learning-engineer-ai-models-4kmm">Machine Learning Engineer&#8202;&#8212;&#8202;AI Models @Tenstorrent Inc. (Poland/Remote)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The $0 download that saves a $5k pivot.]]></title><description><![CDATA[Our free Agent Architecture Cheatsheet and Webinar is now live!]]></description><link>https://newsletter.towardsai.net/p/the-0-download-that-saves-a-5k-pivot</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/the-0-download-that-saves-a-5k-pivot</guid><dc:creator><![CDATA[Towards AI]]></dc:creator><pubDate>Fri, 16 Jan 2026 15:03:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pwq5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We just released something that will save you a painful amount of time, tokens, and &#8220;why is this system doing <em>that</em>?&#8221; debugging.</p><p>It&#8217;s a <strong>free Agent Architecture Cheatsheet + a 1-hour webinar</strong> that tells you whether you need a workflow, a single agent, or a multi-agent <em>before you commit to the wrong build.</em> The cheatsheet contains all the information you need to make architectural decisions in AI projects in the most condensed format. The webinar adds context and examples. </p><p>It is built from months of production trial-and-error (plus a few expensive &#8220;well&#8230; that was a pivot&#8221; moments). It turns everything we learned deploying real systems into a decision framework you can use to design agents in any niche, any industry, at any level of complexity.</p><p><strong><a href="https://academy.towardsai.net/products/digital_downloads/agents-cheatsheet?utm_source=taisubstack&amp;utm_medium=email&amp;utm_campaign=jan2026_subscribers_nostart_cheatsheet_download_glb&amp;utm_id=freecheatsheet">Get Your Free PDF Here!</a></strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/products/digital_downloads/agents-cheatsheet?utm_source=taisubstack&amp;utm_medium=email&amp;utm_campaign=jan2026_subscribers_nostart_cheatsheet_download_glb&amp;utm_id=freecheatsheet" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pwq5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png 424w, https://substackcdn.com/image/fetch/$s_!pwq5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png 848w, https://substackcdn.com/image/fetch/$s_!pwq5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png 1272w, https://substackcdn.com/image/fetch/$s_!pwq5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pwq5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2469618,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://academy.towardsai.net/products/digital_downloads/agents-cheatsheet?utm_source=taisubstack&amp;utm_medium=email&amp;utm_campaign=jan2026_subscribers_nostart_cheatsheet_download_glb&amp;utm_id=freecheatsheet&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.towardsai.net/i/184762883?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pwq5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png 424w, https://substackcdn.com/image/fetch/$s_!pwq5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png 848w, https://substackcdn.com/image/fetch/$s_!pwq5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png 1272w, https://substackcdn.com/image/fetch/$s_!pwq5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b3a4cf-f418-466b-bef5-3db9e96b0bcf_2048x1143.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you&#8217;ve built even one &#8220;agent&#8221; recently, you&#8217;ve seen the plot twists:</p><p>Day 1: &#8220;It works!&#8221;</p><p>Day 7: &#8220;Why is it calling seven tools?&#8221;</p><p>Day 14: &#8220;Why did costs triple?&#8221;</p><p>Day 21: &#8220;We&#8217;ll add evals and monitoring after launch.&#8221;</p><p>(We love your optimism. We really do.)</p><p>And here&#8217;s the part nobody warns you about: once you pick the wrong architecture, it&#8217;s not a quick refactor. It becomes a slow-motion rewrite: tool chaos, state bugs, brittle loops, unpredictable latency, until you&#8217;re stuck answering the hardest question in the whole project way too late: <strong>should this have been a workflow, a single agent, or multi-agent in the first place?</strong></p><p>That&#8217;s what this cheatsheet and webinar make easy.</p><p>You get a fast, practical method to make the call: <strong>Workflow vs. Single Agent + Tools vs. Multi-Agent</strong> with enough structure that you can defend it in a design review, not just &#8220;it felt right.&#8221; You run a quick autonomy test, answer <strong>12 high-signal questions</strong>, and suddenly you&#8217;re not guessing anymore. Decisions that used to take a week of Slack debate become boringly clear. You&#8217;ll know when to keep things deterministic, when to allow autonomy, when multi-agent is actually justified, and when it&#8217;s just adding cost and failure modes without adding capability. The result is simple: fewer pivots, fewer surprises, tighter latency, cleaner debugging, and systems that behave on purpose.</p><p>And the questions inside are the ones that actually decide whether your build ships. You&#8217;ll pressure-test tool complexity (including the point where tool selection quality starts collapsing), define where validation must be hard checks vs judge-based, decide what state needs to persist (and where it lives), place human-in-the-loop gates when failure is expensive, lock in your latency budget before your agent blows it up, and set up the minimum eval + tracing instrumentation so you can iterate with signal instead of vibes.</p><p>It&#8217;s the same framework style we use to design and deploy systems under real constraints, work associated with teams at <strong>Thinkific and Europol,</strong> because in production, architecture decisions are cost decisions. And it&#8217;s been used in architecture reviews for one reason: it&#8217;s faster to run this framework than to argue yourself into an overbuilt system.</p><p><strong>Run it once with your current agent idea, and you&#8217;ll know exactly what to build next, without the expensive detour.</strong></p><p><strong><a href="https://academy.towardsai.net/products/digital_downloads/agents-cheatsheet?utm_source=taisubstack&amp;utm_medium=email&amp;utm_campaign=jan2026_subscribers_nostart_cheatsheet_download_glb&amp;utm_id=freecheatsheet">Access the cheatsheet here!</a></strong></p><p>PS: My favorite debate-killer from the cheatsheet: one model calling 10 APIs is still <strong>one agent with tools,</strong> not &#8220;multi-agent.&#8221; If you&#8217;ve ever lost 45 minutes to that argument, you&#8217;ve already earned this download.</p>]]></content:encoded></item><item><title><![CDATA[TAI #187: OpenAI's Health Push and the Real State of LLMs in Medicine]]></title><description><![CDATA[Also, Nvidia Alpamayo models, Rubin, Falcon H1R-7B & more.]]></description><link>https://newsletter.towardsai.net/p/tai-187-openais-health-push-and-the</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-187-openais-health-push-and-the</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 13 Jan 2026 15:20:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uqkc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644a7599-1d83-4721-826c-080fa959f38f_1600x884.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Prr2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd3f49c6-607b-4dff-9aa4-772f4bef8a23_1100x220.png 424w, https://substackcdn.com/image/fetch/$s_!Prr2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd3f49c6-607b-4dff-9aa4-772f4bef8a23_1100x220.png 848w, https://substackcdn.com/image/fetch/$s_!Prr2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd3f49c6-607b-4dff-9aa4-772f4bef8a23_1100x220.png 1272w, https://substackcdn.com/image/fetch/$s_!Prr2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd3f49c6-607b-4dff-9aa4-772f4bef8a23_1100x220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Prr2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd3f49c6-607b-4dff-9aa4-772f4bef8a23_1100x220.png" width="1100" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd3f49c6-607b-4dff-9aa4-772f4bef8a23_1100x220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:1100,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90058,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.towardsai.net/i/184426794?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd3f49c6-607b-4dff-9aa4-772f4bef8a23_1100x220.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Prr2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd3f49c6-607b-4dff-9aa4-772f4bef8a23_1100x220.png 424w, https://substackcdn.com/image/fetch/$s_!Prr2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd3f49c6-607b-4dff-9aa4-772f4bef8a23_1100x220.png 848w, https://substackcdn.com/image/fetch/$s_!Prr2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd3f49c6-607b-4dff-9aa4-772f4bef8a23_1100x220.png 1272w, https://substackcdn.com/image/fetch/$s_!Prr2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd3f49c6-607b-4dff-9aa4-772f4bef8a23_1100x220.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><h2>What happened this week in AI by Louie</h2><p>OpenAI made its biggest healthcare push this week with two launches: ChatGPT Health for consumers and OpenAI for Healthcare for enterprises. The consumer product lets users connect medical records and wellness apps so responses can be grounded in personal context. The enterprise product offers BAA support, institutional policy integrations, and clinical templates for hospitals and health systems. Anthropic followed days later with Claude for Healthcare, featuring similar HIPAA-ready positioning plus connectors for CMS databases, ICD-10 codes, and PubMed.</p><p>The timing makes sense. OpenAI claims over 230 million people already ask health questions on ChatGPT weekly. Rather than fighting this behavior, they are productizing it. But I think the framing of these launches obscures where LLMs actually add value in health today versus where they need careful deployment.</p><p>The clearest wins are administrative and language-heavy tasks: drafting discharge summaries, patient instructions, prior authorization narratives, insurance comparisons, and translating medical jargon into plain language. These are high-volume workflows where humans review outputs before anything touches a patient. The ambient documentation market has exploded over the past quarter, with Microsoft, the VA, Veradigm, RXNT, and Google Cloud all shipping or expanding Scribe products. Documentation is the obvious wedge because it is language-heavy and naturally human-in-the-loop.</p><p>Diagnosis is a more complex application, but it is more nuanced than the binary &#8220;safe or dangerous&#8221; framing suggests. I think LLMs can provide enormous value when used as brainstorming partners for human experts. The sweet spot is generating suggestions in volume that are quick for clinicians to review and filter using their own intuition. An LLM suggesting rare diseases or edge cases that a busy doctor might not immediately consider can be incredibly valuable. The expert can instantly recognize which suggestions are smart, which are apparent, and which are nonsense. This is very different from an LLM making autonomous diagnostic decisions or patients self-diagnosing without professional review. The risk is not in the brainstorming; it is in skipping the expert filter. ChatGPT has steered clear of diagnosis with the Health tool positioning, likely because it is too easy for people to skip this expert filter.</p><p>The privacy critique also has teeth. When individuals upload their own records to a consumer tool, HIPAA protections generally do not apply as they do within a covered entity. OpenAI&#8217;s compartmentalization and &#8220;no training on health chats&#8221; commitment are meaningful, but the U.S. lacks a comprehensive privacy law that would permanently lock in these protections. A 2024 analysis found 37% of ChatGPT health answers untrustworthy, with 4% providing dangerous information. Context from connected records helps, but it does not certify correctness.</p><p>Due to this, I think most of the best AI health applications are likely going to be custom-built assistants for health experts where these safeguards can be ingrained&#8202;&#8212;&#8202;both for experts in the loop and privacy and security settings. The new OpenAI for Healthcare initiative should reduce friction in building and deploying these custom models. OpenAI models have been making solid progress on professional healthcare benchmarks, but I expect we will be seeing many more custom models for this industry in the future.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uqkc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644a7599-1d83-4721-826c-080fa959f38f_1600x884.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uqkc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644a7599-1d83-4721-826c-080fa959f38f_1600x884.png 424w, https://substackcdn.com/image/fetch/$s_!uqkc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644a7599-1d83-4721-826c-080fa959f38f_1600x884.png 848w, https://substackcdn.com/image/fetch/$s_!uqkc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644a7599-1d83-4721-826c-080fa959f38f_1600x884.png 1272w, https://substackcdn.com/image/fetch/$s_!uqkc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644a7599-1d83-4721-826c-080fa959f38f_1600x884.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uqkc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644a7599-1d83-4721-826c-080fa959f38f_1600x884.png" width="1456" height="804" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/644a7599-1d83-4721-826c-080fa959f38f_1600x884.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:804,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uqkc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644a7599-1d83-4721-826c-080fa959f38f_1600x884.png 424w, https://substackcdn.com/image/fetch/$s_!uqkc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644a7599-1d83-4721-826c-080fa959f38f_1600x884.png 848w, https://substackcdn.com/image/fetch/$s_!uqkc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644a7599-1d83-4721-826c-080fa959f38f_1600x884.png 1272w, https://substackcdn.com/image/fetch/$s_!uqkc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644a7599-1d83-4721-826c-080fa959f38f_1600x884.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>The trajectory is not &#8220;LLMs replace clinicians.&#8221; The near-term future is more mundane: LLMs become the interface layer between messy data and the humans who make decisions. The competitive edge shifts from nicer phrasing to provable work, the ability to reconstruct what the system saw, what it retrieved, and why it responded that way. Deep integration will beat standalone brilliance.</p><p>I think the winning products will add friction in the right places: mandatory source views, explicit uncertainty, and refusal when context is missing. The safest interaction patterns need to be the easiest ones. For users, treat these tools as idea generators and preparation aids. They are genuinely helpful for surfacing possibilities you might not have considered and for helping you prepare for conversations with professionals. The key is to keep experts in the loop so they can do what they do best: separate the signal from the noise.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://openai.com/index/introducing-chatgpt-health/">OpenAI Launched ChatGPT Health</a></p><p>OpenAI introduces ChatGPT Health, a dedicated health and wellness experience with connected records and app data. OpenAI launched ChatGPT Health as a separate in-product experience designed to securely combine ChatGPT with user health context, including the ability to connect medical records and wellness apps such as Apple Health, Function, and MyFitnessPal for tasks like understanding lab results, preparing for doctor appointments, and interpreting wearable data. OpenAI says Health adds layered protections on top of existing ChatGPT controls, including purpose-built encryption and isolation for health conversations, and it was developed with input from physicians globally. Access is rolling out via a waitlist; once approved, users can select &#8220;Health&#8221; from the ChatGPT sidebar to begin.</p><p>2. <a href="https://www.bleepingcomputer.com/news/google/google-is-testing-a-new-image-ai-and-its-going-to-be-its-fastest-model/">Google Is Testing a New Image AI, and It&#8217;s Going To Be Its Fastest Model</a></p><p>Google tests &#8220;Nano Banana 2 Flash,&#8221; a faster Gemini Flash image model. The report says Google is internally testing the model, expected to run faster and be more affordable than Nano Banana Pro, while remaining less capable than the top-end model. The model name was spotted in a leak shared on X, and the report places it within Google&#8217;s &#8220;Flash&#8221; lineup, which emphasizes speed. There&#8217;s no public launch or access path described yet beyond the indication that it is in testing.</p><p>3. <a href="https://nvidianews.nvidia.com/news/alpamayo-autonomous-vehicle-development">NVIDIA Announces Alpamayo Family of Open-Source AI Models and Tools To Accelerate Safe, Reasoning-Based Autonomous Vehicle Development</a></p><p>NVIDIA announced the Alpamayo family of open models, tools, and datasets aimed at long-tail autonomous driving scenarios, centered on chain-of-thought vision-language-action (VLA) models backed by the NVIDIA Halos safety system. The initial release includes Alpamayo 1, a 10B-parameter reasoning VLA teacher model that uses video input to produce driving trajectories alongside reasoning traces, with open weights and open-source inference scripts; NVIDIA also released AlpaSim, an open-source end-to-end AV simulation framework on GitHub, and &#8220;Physical AI Open Datasets&#8221; with 1,700+ hours of driving data available on Hugging Face. Alpamayo 1 is available on Hugging Face, and NVIDIA describes it as a foundation developers can fine-tune and distill into smaller runtime models or use to build evaluators and auto-labeling systems.</p><p>4. <a href="https://nvidianews.nvidia.com/news/rubin-platform-ai-supercomputer">NVIDIA Launches Rubin Platform</a></p><p>NVIDIA announced its Rubin platform, built around &#8220;extreme codesign&#8221; across six components: Vera CPU, Rubin GPU, NVLink 6 Switch, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-6 Ethernet to cut training time and reduce inference token cost. NVIDIA says Rubin can reduce inference token cost by up to 10&#215; and train MoE models with 4&#215; fewer GPUs than Blackwell. It also introduced an Inference Context Memory Storage Platform (powered by BlueField-4) to enable the sharing and reuse of KV-cache data for agentic workloads. The flagship rack-scale system, Vera Rubin NVL72, combines 72 Rubin GPUs and 36 Vera CPUs, and NVIDIA says Rubin is in full production with Rubin-based products available from partners in the second half of 2026.</p><p>5. <a href="https://falcon-lm.github.io/blog/falcon-h1r-7b/">TII Abu-Dhabi Released Falcon H1R-7B: A New Reasoning Model</a></p><p>Technology Innovation Institute (TII) introduced Falcon-H1R 7B, a decoder-only model that combines a hybrid Transformer&#8211;Mamba backbone with a two-stage training pipeline (cold-start SFT followed by GRPO reinforcement learning) and a test-time scaling method called Deep Think with Confidence (DeepConf) to boost reasoning while keeping token use lower. The release includes full checkpoints and quantized GGUF weights on Hugging Face, plus a hosted Falcon Chat experience and demo links for trying the model.</p><p>6. <a href="https://cursor.com/blog/dynamic-context-discovery">Cursor Introduces Dynamic Context Discovery</a></p><p>Cursor published a research note describing dynamic context discovery. Instead of stuffing a large static context prompt into every run, the agent starts with less and retrieves what it needs as it goes, reducing confusion and cutting token usage on long trajectories. Cursor outlines several concrete implementations, including converting long tool outputs into files, referencing chat history during summarization, supporting the Agent Skills open standard, loading only the MCP tools needed for a task, and treating integrated terminal sessions as files so agents can selectively pull relevant slices.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/context-rot-the-silent-killer-of-ai-agents-a8636a754856?sk=2a2649d776f9989ec4a9a0f786023149">Context Rot: The Silent Killer of AI Agents</a></p><p>This article examines context rot, a common issue in which AI agents lose effectiveness over long tasks as their context window fills with irrelevant information. It introduces context engineering as the practice of managing the information an AI model sees at any given moment. The piece details retrieval strategies, such as loading data upfront versus just-in-time. For more extended operations, it outlines techniques such as context compaction (summarizing history), structured note-taking to preserve key details, and the use of sub-agents for specialized functions.</p><p>2. <a href="https://pub.towardsai.net/evolution-of-vision-language-models-and-multi-modal-learning-d4552601ccbd">Evolution of Vision Language Models and Multi-Modal Learning</a></p><p>To address the limitations of text-only AI, Vision-Language Models (VLMs) were developed to process both visual and textual information. This piece traces their evolution, starting with foundational models such as CLIP and GLIP and moving on to open-source systems such as LLaVA and the multilingual Qwen-VL. It also covers the trend toward smaller, efficient models for edge devices alongside powerful, natively multi-modal systems like Google&#8217;s Gemini. The discussion also outlines persistent challenges, including hallucinations and resource intensity, while highlighting future research focused on improved reasoning, interpretability, and domain-specific applications.</p><p>3. <a href="https://pub.towardsai.net/a-guide-to-fine-tuning-large-language-models-llms-without-catastrophic-forgetting-4b2c926f14a4">Fine-Tuning Large Language Models (LLMs) Without Catastrophic Forgetting</a></p><p>This piece provides a practical guide to fine-tuning large language models while avoiding catastrophic forgetting, the loss of general knowledge. It focuses on Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA. The core strategy is to freeze the original model&#8217;s weights and add small, trainable matrices to specific upper layers, particularly the attention mechanism. This allows the model to adapt to new domains without overwriting its foundational capabilities. The summary also touches on best practices, including learning rate schedules and using multiple, isolated LoRA adapters for different tasks to maintain performance across various domains.</p><p>4. <a href="https://pub.towardsai.net/the-complete-guide-to-guardrails-building-ai-agents-that-wont-go-rogue-d7dabb53b32b?sk=df55a01669babbcb5985b637b8f3ee93">Why AI Agents Fail Without Guardrails (And How to Fix It)</a></p><p>AI agents, capable of autonomous actions, pose significant risks, including data leaks and operational errors, without proper safety measures. This piece outlines the critical role of guardrails, checkpoints designed to monitor, block, or require human approval for agent actions. It distinguishes between fast, pattern-matching guards for PII detection and more advanced AI-based checks for contextual safety. The author provides practical implementation examples for PII redaction and human-in-the-loop workflows for high-risk operations.</p><p>5. <a href="https://pub.towardsai.net/from-perceptrons-to-sigmoid-superstars-building-smarter-neural-networks-54500d406ee1?sk=fcf7ab37c0d8855510bfb8f133416bd4">From Perceptrons to Sigmoid Superstars: Building Smarter Neural Networks</a></p><p>This article provides a foundational overview of neural network development, starting with the basic perceptron as a linear classifier. It explains the critical shift to sigmoid neurons, whose smooth activation functions were necessary for enabling gradient-based learning techniques. The post then describes how these neurons are organized into layered feedforward architectures to model complex, nonlinear patterns. It also covers the Universal Approximation Theorem, which establishes the theoretical basis for why these networks are such powerful and widely used tools in artificial intelligence.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/ruvnet/claude-flow">Claude Flow</a> is an AI orchestration platform that combines hive-mind swarm intelligence, persistent memory, and 100+ advanced MCP tools.</p><p>2. <a href="https://github.com/OpenBMB/ChatDev">ChatDev</a> is a zero-code multi-agent orchestration platform.</p><p>3. <a href="https://github.com/frankbria/ralph-claude-code">Ralph Claude Code</a> is an implementation of Geoffrey Huntley&#8217;s technique for Claude Code that enables continuous autonomous development cycles.</p><p>4. <a href="https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b">Nemotron Speech ASR</a> is a new open source transcription model for low-latency use cases like voice agents.</p><p>5. <a href="https://github.com/camel-ai/seta-env">SETA</a> is a toolkit and environment stack that focuses on reinforcement learning for terminal agents.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2601.03233">LTX-2: Efficient Joint Audio-Visual Foundation Model</a></p><p>LTX-2 introduces an open-source joint audio-visual foundation model that generates high-quality, temporally synchronized video and audio from text. The model uses an asymmetric dual-stream transformer with a 14B video stream and 5B audio stream, linked by bidirectional cross-attention and modality-aware classifier-free guidance, and achieves state-of-the-art open-source audiovisual quality with publicly released weights and code.</p><p>2. <a href="https://arxiv.org/abs/2601.05242">GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization</a></p><p>The paper introduces Group reward-Decoupled Normalization Policy Optimization (GDPO) for multi-reward reinforcement learning with language models. The authors show that applying Group Relative Policy Optimization (GRPO) to combined rewards collapses distinct signals into identical advantages, harming convergence. GDPO decouples reward normalization, preserves relative differences, stabilizes training, and consistently outperforms GRPO on tool use, math, and coding across correctness and constraint metrics.</p><p>3. <a href="https://arxiv.org/abs/2512.10398">Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases</a></p><p>This paper introduces Confucius Code Agent, an open-source AI software engineer built on the Confucius SDK, designed for industrial-scale software repositories and long-running sessions. Confucius SDK is an agent development platform structured around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). On SWE-Bench-Pro, CCA reaches a Resolve@1 of 54.3%, exceeding prior research baselines and comparing favorably to commercial results.</p><p>4. <a href="https://arxiv.org/abs/2601.02151">Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting</a></p><p>Researchers propose Entropy-Adaptive Fine-Tuning (EAFT) to mitigate catastrophic forgetting in supervised fine-tuning of large language models. They identify &#8220;Confident Conflicts,&#8221; low-probability, low-entropy tokens where external supervision clashes with the model&#8217;s belief, causing harmful gradients. EAFT gates updates using token-level entropy, learning from uncertain tokens while suppressing conflicting ones, and matches SFT&#8217;s domain performance while preserving general capabilities across Qwen and GLM models.</p><h3>Quick Links </h3><p>1. <a href="https://www.liquid.ai/blog/introducing-lfm2-5-the-next-generation-of-on-device-ai">Liquid AI releases LFM2.5</a>, a new generation of small foundation models built on the LFM2 architecture and focused on device and edge deployments. The model family includes LFM2.5&#8211;1.2B-Base and LFM2.5&#8211;1.2B-Instruct, and extends to Japanese, vision-language, and audio-language variants. Pretraining for LFM2.5 extends from 10T to 28T tokens, and the Instruct model adds supervised fine-tuning, preference alignment, and large-scale multi-stage reinforcement learning, which push instruction following and tool-use quality beyond other 1B-class baselines.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/towards-ai-inc-ai-engineer-and-corporate-trainer-french-bilingual-am5x">AI Engineer &amp; Corporate Trainer (French Bilingual) @Towards AI Inc (Remote/Canada)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/amazon-agentic-ai-teacher-vcba">Agentic AI Teacher @Amazon (Chennai, India)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/headway-edtech-ai-experience-specialist-salr">AI Experience Specialist @Headway EdTech (Multiple US Locations)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/digibee-inc-ai-engineer-specialist-d27l">AI Engineer Specialist @Digibee Inc. (Remote/Brazil)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/salesforce-smts-ai-research-efb6">SMTS, AI Research @Salesforce (Palo Alto, CA, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/aveva-principal-qa-engineer-ai-and-cloud-services-wtmx">Principal QA Engineer&#8202;&#8212;&#8202;AI &amp; Cloud Services @Aveva (Bengaluru, India)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/pwc-genai-python-systems-engineer-senior-associate-uclu">GenAI Python Systems Engineer&#8202;&#8212;&#8202;Senior Associate @PwC (Richmond, CA, USA)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[TAI #186: Claude Code and the Christmas Awakening: Why CLI Agents Are Winning the Agentic Race]]></title><description><![CDATA[Also, Deepseek's mHC, GPT-5.2 Pro tops Frontier math, Qwen-Image-2512, Tencent HY-MT1.5 & more]]></description><link>https://newsletter.towardsai.net/p/tai-186-claude-code-and-the-christmas</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/tai-186-claude-code-and-the-christmas</guid><dc:creator><![CDATA[Louie Peters]]></dc:creator><pubDate>Tue, 06 Jan 2026 15:03:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!V8gI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What happened this week in AI by Louie</h2><p>This week was quieter on major model releases, though DeepSeek published a paper on Manifold-Constrained Hyper-Connections (mHC), a training technique that improves stability when scaling up model size. We think this could be significant when integrated into their next-generation models. But the AI community&#8217;s attention turned to something arguably more transformative: how people are actually using these models. Over the Christmas break, a wave of new users discovered Claude Code, Anthropic&#8217;s terminal-based agentic coding assistant, and many are calling it a genuine step change in what AI can accomplish. The combination of Opus 4.5&#8217;s release in November and holiday downtime created perfect conditions for exploration. Social media has been flooded with reports of developers shipping projects in hours that would have taken weeks, and perhaps more surprisingly, non-technical users automating tasks they never thought possible.</p><p>Claude Code is Anthropic&#8217;s command-line tool that gives Claude direct access to your file system, terminal, and local environment. Unlike chatbot interfaces, which require you to manually provide context by copying and pasting, Claude Code can read your entire codebase, edit multiple files coherently, run your test suite, and iterate until things work. The AI navigates your file system and finds what it needs itself, rather than relying on you to assemble the relevant context.</p><p>The Opus 4.5 upgrade appears to have crossed a critical threshold. Users consistently report that it eliminates the &#8220;slop code&#8221; problem that plagued earlier models, where AI-generated code was functional but poorly structured and hard to maintain. Opus 4.5 produces code that experienced developers actually want to keep. It understands architectural patterns, creates appropriate abstractions, and can debug its own work by writing minimal reproducible examples. Anthropic&#8217;s internal surveys indicate that engineers now rely on Claude for 60% of their daily work, with a mean productivity improvement of 220% reported across the team. However, individual results vary significantly by workflow and level of expertise.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V8gI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V8gI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png 424w, https://substackcdn.com/image/fetch/$s_!V8gI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png 848w, https://substackcdn.com/image/fetch/$s_!V8gI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png 1272w, https://substackcdn.com/image/fetch/$s_!V8gI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V8gI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!V8gI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png 424w, https://substackcdn.com/image/fetch/$s_!V8gI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png 848w, https://substackcdn.com/image/fetch/$s_!V8gI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png 1272w, https://substackcdn.com/image/fetch/$s_!V8gI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8045d540-8b05-4426-aaf6-1666a7fed440_1600x893.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>CLI agents vs. IDE tools vs. chatbot interfaces</strong></p><p>The distinction between Claude Code and tools like Cursor is more nuanced than many realize. Cursor also has a full agent mode, not just autocomplete. Both can autonomously execute multi-step coding tasks. The real differences lie elsewhere.</p><p>Claude Code runs in your terminal and treats your entire computer as its workspace. It can chain together shell commands, access external services through MCP (Model Context Protocol) integrations, run scripts, and work across applications. Cursor is an IDE-first experience, essentially VS Code rebuilt with AI at its core. It offers visual diffs, familiar keybindings, and a polished review flow where you can accept or reject changes file by file.</p><p>Claude Code tends to include more of your codebase in each request, which improves understanding but increases costs. Some comparisons suggest Claude Code costs roughly four times as much as Cursor for similar tasks, though the higher context often yields better results for complex refactors.</p><p>The philosophical difference is telling. One developer described it this way: with Cursor, you drive, and AI assists. With Claude Code, AI drives, and you supervise. Many developers use both. You can run Claude Code inside Cursor&#8217;s terminal, using the IDE for visual editing and summoning Claude Code when you need deep reasoning on a complex problem.</p><p>For non-technical users, the comparison extends to chatbot interfaces like ChatGPT. With Claude Code, you can say &#8220;analyze all the spreadsheets in this folder, identify trends, and create a summary report,&#8221; and it handles the entire process. Non-technical users are leveraging this for tasks such as reorganizing thousands of files by content, extracting insights from contracts, processing research papers, and automating administrative workflows.</p><p>However, the CLI interface will not be the mainstream way this capability reaches most users. Terminals remain intimidating for people who have spent years in graphical interfaces. Even some experienced developers find it hard to adjust to CLI-based coding after working in IDEs. Claude Code does offer VS Code integration, but most users report better results in the terminal, where the complete agentic loop operates more naturally. The future likely involves more user-friendly interfaces that retain this agentic file system access.</p><p>This momentum poses a challenge to Microsoft&#8217;s strategy of infusing each application with its own focused AI assistant. The bet that people want Copilot for Excel, Copilot for Word, and Copilot for PowerPoint as separate experiences looks increasingly questionable as users gravitate toward agents that work across applications. When you can tell a single agent to analyze spreadsheets, summarize findings, and create a presentation, switching between three different AI assistants feels cumbersome. OpenAI&#8217;s Codex, Google&#8217;s Antigravity, and Anthropic&#8217;s Claude Code are all betting on this general-purpose agent model.</p><p><strong>How Boris Cherny, the creator of Claude Code, actually uses it</strong></p><p>Boris Cherny shared his personal setup this week, describing it as &#8220;surprisingly vanilla.&#8221; But reading through his workflow reveals just how far from vanilla it would seem to most users and implies that others at Anthropic have even more complex configurations.</p><p>Boris runs five Claude instances in parallel in his terminal, numbered 1 through 5, using system notifications to know when any instance needs input. He also runs another five to ten sessions on the web version of Claude Code simultaneously, frequently handing off sessions between local and web using the teleport feature. He kicks off sessions from his phone each morning and checks in on them later.</p><p>For model selection, Boris uses Opus 4.5 with thinking mode for everything. While it is larger and slower than Sonnet, he finds that the reduced need for steering and better tool use make it faster overall for completing actual tasks.</p><p>His team shares a single CLAUDE.md file that is checked into Git. This file serves as the project&#8217;s working agreement with Claude, containing build commands, style conventions, architectural boundaries, and definitions of done. Any time anyone sees Claude do something incorrectly, they add a rule to CLAUDE.md so it does not happen again. This creates a compounding effect where Claude gets better at each specific codebase over time.</p><p>Most sessions start in Plan mode (shift+tab twice). He goes back and forth with Claude until he likes the plan, then switches to auto-accept edits mode, where Claude can usually execute in one shot. Getting the plan right is critical.</p><p>He uses slash commands for every &#8220;inner loop&#8221; workflow he performs multiple times daily, like a /commit-push-pr command that he and Claude use dozens of times every day. Subagents handle specialized workflows: code-simplifier cleans up code after Claude finishes, verify-app runs end-to-end tests. A PostToolUse hook automatically formats Claude&#8217;s code. MCP integrations let Claude search and post to Slack, run BigQuery queries, and grab error logs from Sentry.</p><p>Perhaps most importantly, Boris emphasizes giving Claude a way to verify its work. Claude tests every change before landing using the Claude Chrome extension, which opens a browser, tests the UI, and iterates until the code works and the user experience feels right. This verification loop improves the quality of results by two to three times.</p><p>The gap between Boris&#8217;s setup and how most people use Claude Code highlights a broader challenge in AI adoption. Setting up an effective workflow with these tools is far from straightforward. It requires understanding permission modes, context management, hooks, MCP integrations, and verification strategies. The productivity gains require significant time investment to unlock.</p><p><strong>The repo maintenance question</strong></p><p>Agentic coding also raises new questions about codebase organization. When AI writes and modifies code at high speed, repositories can quickly become messy. Tidying up can consume significant time. But this raises a genuine question: how neat and human-readable do these repos need to be anymore if you are primarily using AI to code and review them?</p><p>Claude Code can do a good job of refactoring and tidying up its own repos, but it usually needs detailed rules and workflow instructions to do so consistently. This is another area where investing in CLAUDE.md files and custom commands pays off. Without explicit guidance, agentic coding tends to accrue technical debt more quickly than traditional development. With the proper guardrails, Claude Code can maintain cleaner codebases than many human developers, but getting those guardrails right takes work.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why should you care?</h3><p>The Christmas surge in Claude Code adoption signals we may be entering a new phase of AI interaction where agentic tools that can navigate your files, execute commands, and chain together workflows become more valuable than chat interfaces. Power requires access. A chatbot interface is sandboxed by design. An agent with file system access, shell execution, and external integrations can actually do the work. The trade-off is that wielding that power requires more skill and carries more risk.</p><p>For technical users, investing time in learning CLI agents is likely to pay dividends. The productivity improvements reported by power users are not available to someone using ChatGPT to generate code snippets they paste into their editor. But the learning curve is real, and Boris&#8217;s &#8220;vanilla&#8221; setup would take most developers considerable time to replicate.</p><p>For non-technical users, these tools are genuinely helpful for tasks like file organization, data analysis, and research automation. But the CLI will not be how most people access these capabilities. The future likely involves more accessible interfaces that retain the agentic power.</p><p>The agentic AI era is arriving faster than many expected. The models are ready. The tooling is maturing. The question now is how quickly people can learn to use them effectively, and how quickly more accessible interfaces will bring these capabilities to everyone else. The winners will be determined by which agents can deliver the reliability and verification loops that make them trustworthy for real work.</p><p><em>&#8212; <a href="http://www.linkedin.com/in/louie-peters">Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO</a></em></p><div><hr></div><h3>Hottest News</h3><p>1. <a href="https://deepseek.ai/blog/deepseek-mhc-manifold-constrained-hyper-connections">DeepSeek Releases Hyper-Connections for Transformers</a></p><p>DeepSeek introduces mHC (Manifold-Constrained Hyper-Connections) to scale training while preserving stability. mHC targets a common scaling tension: increasing internal information flow can improve capability, but it can also destabilize training; the method constrains hyper-connection residuals by projecting them onto a defined manifold to restore identity-mapping behavior while keeping the system efficient. DeepSeek reports empirical gains in large-scale pretraining experiments (including MoE-based variants inspired by prior DeepSeek work), positioning mHC as a training-stack improvement rather than a product feature, with the full technical details published on arXiv.</p><p>2. <a href="https://x.com/gdb/status/2006154439208337417">GPT-5.2 Pro Tops FrontierMath T4</a></p><p>OpenAI&#8217;s GPT-5.2 strengthens science/math performance, with new FrontierMath results and clearer context on what Tier 4 represents. OpenAI reports GPT-5.2 Thinking at 14.6% on FrontierMath Tier 4 (with Python enabled) and 40.3% on Tier 1&#8211;3, while also showing broad gains across science-heavy benchmarks (e.g., GPQA Diamond) and publishing a subset of GPT-5.2 Pro numbers for several evaluations. Epoch AI notes Tier 4 is the research-level slice of FrontierMath, 50 problems written as short-term research projects, so progress there is treated as a meaningful capability signal rather than routine test-taking.</p><p>3. <a href="https://qwen.ai/blog?id=qwen-image-2512">Alibaba Qwen Open Sourced Qwen-Image-2512</a></p><p>Qwen releases Qwen-Image-2512, a December update to its open text-to-image model. The update focuses on three visible quality lifts: more reliable text rendering and layout, more realistic human generation (reduced &#8220;AI-generated&#8221; look), and finer natural textures (e.g., landscapes and fur). The weights are available through Hugging Face and ModelScope, with an interactive demo on Hugging Face Spaces. Results from 10,000 rounds of blind model evaluations on AI Arena show that Qwen-Image-2512 is currently the strongest open-source model.</p><p>4. <a href="https://arxiv.org/html/2512.24092v1">Tencent Researchers Release Tencent HY-MT1.5</a></p><p>Tencent releases HY-MT1.5, a new open machine-translation model family. HY-MT1.5 ships in two sizes (1.8B and 7B parameters) and is trained with a multi-stage pipeline that combines general + MT-oriented pretraining, supervised fine-tuning, strong-to-weak on-policy distillation, and reinforcement learning to balance quality with deployment efficiency. Beyond &#8220;plain translation,&#8221; the models support practical constraint controls, such as terminology injection, context-aware translation, and format preservation for structured documents. Tencent also points to quantization options for edge or high-throughput deployments and provides model weights on Hugging Face, along with an accompanying code repository for use and integration.</p><p>5. <a href="https://x.com/storysylee/status/2006985196458139711">OpenAI Ramps Up Audio AI Efforts Ahead of Device</a></p><p>OpenAI ramps up its audio-model push ahead of an audio-first device, with a new architecture targeted for 2026. Reporting around the effort describes OpenAI consolidating previously separate audio teams and rebuilding core infrastructure so audio can be treated as a first-class modality (not just &#8220;text, then voice&#8221;), aiming to close gaps in latency, accuracy, and natural conversational flow. The plan centers on a new audio-model architecture expected in Q1 2026, alongside longer-term work on voice-first hardware form factors, with leadership reportedly tied to talent brought in from Character.AI.</p><p>6. <a href="https://x.com/rohanpaul_ai/status/2006813146170929409">IQuest-Coder Beats Claude Sonnet 4.5 on Coding Benchmarks</a></p><p>IQuestLab releases IQuest-Coder-V1, an open-source code LLM family tuned for autonomous software engineering. The lineup includes 7B, 14B, and 40B variants with 128K native context, plus &#8220;Instruct&#8221; and &#8220;Thinking&#8221; options and a &#8220;Loop&#8221; variant built around a recurrent-style mechanism for a better capacity&#8211;deployment trade-off. The project highlights &#8220;Code-Flow&#8221; training, learning from repository evolution and commit transitions rather than static snapshots. It scored 76.2% on SWE-Bench Verified, 81.1% on LiveCodeBench v6, and 49.9% on BigCodeBench.</p><h3>Five 5-minute reads/videos to keep you learning</h3><p>1. <a href="https://pub.towardsai.net/understanding-retrieval-in-rag-systems-why-chunk-size-matters-6d976dd5b654?sk=a4dcae2771607c60a6dfebc2653d486b">Understanding Retrieval in RAG Systems: Why Chunk Size Matters</a></p><p>This article examines the critical role of the retrieval step in RAG systems by isolating its mechanics from the generation component. The author demonstrates how varying text chunk sizes (80, 220, and 500 characters) directly affect performance. The analysis shows that small chunks lack sufficient context, medium ones can be unstable, while larger chunks yield more robust results. It also introduces a method for handling uncertainty, which uses the similarity score gap between top results to identify and flag ambiguous situations, preventing the system from providing a potentially incorrect answer when it has low confidence.</p><p>2. <a href="https://pub.towardsai.net/deep-compression-2015-how-much-more-can-we-squeeze-in-2025-e0bd70150fa2">Deep Compression, 2015: How Much More Can We Squeeze in 2025?</a></p><p>This article revisits the 2015 Deep Compression paper, first reproducing its pipeline of pruning, retraining, and quantization on the LeNet model, achieving a ~22x compression rate while maintaining accuracy. It then introduces a novel, TF-IDF-inspired pruning score that identifies important parameters based on activation patterns. This computationally lighter method improved upon the baseline, pushing the model&#8217;s compression up to ~65x with minimal impact on accuracy after retraining.</p><p>3. <a href="https://pub.towardsai.net/gemini-3-0-flash-mistralocr-3-rag-just-revolutionized-agent-ocr-forever-27f07fc15d87">Gemini 3.0 Flash + MistralOCR 3 + RAG Just Revolutionized Agent OCR Forever</a></p><p>This article explains how to combine Mistral OCR 3 and Google&#8217;s Gemini 3.0 Flash to build a document processing and chat application. It highlights Mistral OCR&#8217;s ability to accurately extract structured text and tables from documents and convert them to Markdown. The extracted content is then used by Gemini 3.0 Flash, a fast and efficient model, to power a chat interface. This allows users to ask questions about the uploaded document. The piece includes a step-by-step guide and code for creating the Streamlit application, providing a practical example of this integration.</p><p>4. <a href="https://pub.towardsai.net/why-humans-are-not-reinforcement-learning-agents-and-why-this-matters-for-ai-72e8d50f03aa?sk=e510f07a13268458d9fa9cc086fb0423">Why Humans Are Not Reinforcement Learning Agents And Why This Matters for AI</a></p><p>While reinforcement learning (RL) is a cornerstone of modern AI, it operates on assumptions that human decision-making consistently violates. This analysis explores the fundamental mismatch, noting that human rewards are unstable, influenced by emotion, and subject to time-inconsistent preferences. Humans also actively construct their reality rather than just reacting to fixed states, often relying on heuristics and identity to guide actions. The author suggests that acknowledging these differences is key to developing AI that can effectively support the complexity of human judgment, rather than simply optimizing for a fixed goal.</p><p>5. <a href="https://pub.towardsai.net/beyond-vectors-a-deep-dive-into-modern-search-in-qdrant-aaef72f32051">Beyond Vectors: A Deep Dive into Modern Search in Qdrant</a></p><p>To address the complexity of modern user queries, this piece details the construction of a hybrid search system using Qdrant. It demonstrates how to combine dense vectors for semantic understanding, sparse vectors for keyword precision, and full-text indexing for exact-match requirements. It also explores advanced techniques like ASCII-folding for multilingual support and ACORN for efficient, filter-aware vector searches. It also provides a practical e-commerce implementation to show how these elements are integrated into a single, effective retrieval pipeline that balances user intent with specific constraints.</p><h3>Repositories &amp; Tools</h3><p>1. <a href="https://github.com/harvard-edge/cs249r_book">Cs249r Book</a> is the open learning stack for AI systems engineering. It includes the textbook source, TinyTorch, hardware kits, and upcoming co-labs.</p><p>2. <a href="https://github.com/zlab-princeton/llm-pruning-collection">LLM Pruning Collection</a> is a collection of various llm pruning implementations, training code for GPUs &amp; TPUs, and an evaluation script.</p><p>3. <a href="https://github.com/python/cpython">CPython</a> is Python version 3.15.0 alpha 3.</p><p>4. <a href="https://github.com/OpenBB-finance/OpenBB">OpenBB</a> is the open-source toolset for integrating proprietary, licensed, and public data sources into downstream applications.</p><h3>Top Papers of The Week</h3><p>1. <a href="https://arxiv.org/abs/2512.16093">TurboDiffusion: Accelerating Video Diffusion Models by 100&#8211;200 Times</a></p><p>TurboDiffusion accelerates end-to-end video diffusion generation by 100&#8211;200x while maintaining video quality. The framework speeds attention with low-bit SageAttention and trainable Sparse-Linear Attention, compresses sampling steps via rCM-based step distillation, and applies W8A8 quantization to model parameters and activations. Experiments on multiple Wan2.x I2V and T2V models confirm the speedups on a single RTX 5090 GPU.</p><p>2. <a href="https://arxiv.org/abs/2512.24618">Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models</a></p><p>Youtu-LLM introduces a 1.96B-parameter lightweight language model pre-trained from scratch to cultivate reasoning and planning. The model uses a dense Multi-Latent Attention architecture, STEM-oriented vocabulary, and a 128k context window. Researchers apply a &#8220;Commonsense-STEM-Agent&#8221; curriculum over ~11T tokens and scalable agentic mid-training, enabling state-of-the-art agentic performance among sub-2B models on general and agent-specific benchmarks.</p><p>3. <a href="https://arxiv.org/html/2512.24601v1">Recursive Language Models</a></p><p>Recursive Language Models aim to break the usual trade-off between context length, accuracy, and cost in large language models. Instead of forcing a model to read a giant prompt in one pass, RLMs treat the prompt as an external environment and let the model decide how to inspect it with code, then recursively call itself on smaller pieces. Across S-NIAH, BrowseComp-Plus, OOLONG, and OOLONG Pairs, RLM variants of GPT-5 and Qwen3-Coder improve accuracy and F1 over direct model calls, retrieval agents such as CodeAct, and summarization agents.</p><p>4. <a href="https://arxiv.org/abs/2512.24617">Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space</a></p><p>The paper introduces Dynamic Large Concept Models, which shift computation from individual tokens to a learned concept space, addressing non-uniform information density in language. DLCM discovers variable-length concepts, defines a compression-aware scaling law that separates token capacity, concept-reasoning capacity, and compression ratio, and uses a decoupled &#956;P to enable stable training.</p><h3>Who&#8217;s Hiring in AI</h3><p><strong><a href="https://jobs.towardsai.net/job/microsoft-corporation-research-intern-msrc-ai-security-research-aaz4">Research Intern&#8202;&#8212;&#8202;MSRC AI Security Research @Microsoft Corporation (Cambridge, UK)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/cloudwalk-junior-product-designer-ai-driven-aads">Junior Product Designer&#8202;&#8212;&#8202;AI Driven @CloudWalk (Remote, Brazil)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/entrust-intern-ai-developer-6mkf">Intern AI Developer @Entrust (Shakopee, MN, USA)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/dataiku-fullstack-software-engineer-core-ne5x">Full-stack Software Engineer&#8202;&#8212;&#8202;Core @Dataiku (Remote/France)</a></strong></p><p><strong><a href="https://jobs.towardsai.net/job/ciandt-job-26613-junior-data-developer-eqz3">Junior Data Developer @CI&amp;T (Remote)</a></strong></p><p><em>Interested in sharing a job opportunity here? Contact <a href="mailto:sponsors@towardsai.net">sponsors@towardsai.net</a>.</em></p><p><em>Think a friend would enjoy this too? <a href="https://newsletter.towardsai.net/">Share the newsletter and let them join the conversation.</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.towardsai.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Start Building AI Projects This January - Live Cohort Kick-Off This Sunday]]></title><description><![CDATA[A monthly live session that helps you choose the right learning path, start hands-on AI projects, and make real progress this month]]></description><link>https://newsletter.towardsai.net/p/start-building-ai-projects-this-january</link><guid isPermaLink="false">https://newsletter.towardsai.net/p/start-building-ai-projects-this-january</guid><dc:creator><![CDATA[Towards AI]]></dc:creator><pubDate>Fri, 02 Jan 2026 16:36:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZBHF!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea4e29a-6b40-4b9a-9a98-00d0f6550a2e_512x512.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The hardest part of learning AI isn&#8217;t effort.<br>It&#8217;s turning curiosity into <strong>consistent, hands-on progress</strong>.</p><p>That&#8217;s what our monthly cohort kick-off is designed to solve.</p><p>This Sunday, the <strong>January cohort</strong> kicks off with a live session led by our CEO, Louie Peters. The session walks through how learners actually use Towards AI to move forward: how they choose a path, what to work on first, and how people go from concepts to <strong>real projects</strong> instead of staying stuck in tutorials.</p><p>If you joined recently, this is your on-ramp.<br>If you&#8217;re still exploring and deciding whether to join, you&#8217;re welcome to attend and see how the learning paths work before committing.</p><p>By the end of January, some learners will be:</p><ul><li><p>building AI apps in Python instead of just reading about them</p></li><li><p>working with prompting, RAG, agents, and evaluation in structured projects</p></li><li><p>shipping systems they can explain, demo, and build on</p></li></ul><p>Others will still be bookmarking posts and waiting for the &#8220;right time.&#8221;</p><p>The cohort kick-off happens <strong>once a month</strong>, and January&#8217;s session is this Sunday.</p><p>&#128073; <strong>Join the January cohort kick-off here: <a href="https://calendly.com/taipartnerships/llm-developer-course-cohort">https://calendly.com/taipartnerships/llm-developer-course-cohort</a></strong></p><p>The session is open to anyone exploring or newly joining Towards AI.</p>]]></content:encoded></item></channel></rss>