#150: Qwen3 Impresses as a Robust Open-Source Contender
Also, Gemini up to 350m monthly users, OpenAI's "yes-man" 4o hiccup, AI reddit persuasion & more!
What happened this week in AI by Louie
This week, the AI spotlight turned to Alibaba’s Qwen team with the launch of Qwen3, a comprehensive new family of large language models. Offering a wide range of sizes and architectures, Qwen3 is a compelling suite of options for the open-source community. In other significant news, court filings revealed impressive user growth for Google Gemini, while OpenAI addressed a widely noticed personality glitch in a recent GPT-4o update.
The Qwen3 release includes eight models, spanning from a tiny 0.6B parameter dense model up to a powerful 235B parameter Mixture-of-Experts (MoE) model. Notably, Qwen open-weighted both dense models (0.6B, 1.7B, 4B, 8B, 14B, 32B) and two MoE models (30B total/3B active, 235B total/22B active), all under an Apache 2.0 license. On the AIME’25 math benchmark, Qwen3 235B(A-22B) scored 81.5%, Qwen3 32B 72.9%, and Qwen3 30B(A-3B) 70.9% relative to Deepseek R1 70.0% and Gemini Pro 2.5 86.7%. Key features include improved agentic capabilities for tool use, support for 119 languages, and an enhanced hybrid “thinking mode” allowing users to dynamically balance deep reasoning and faster responses based on computational budget or task complexity. The models are available via Hugging Face, ModelScope, and Kaggle, with deployment support through frameworks like SGLang and vLLM, and local execution via tools like Ollama and llama.cpp.
Qwen3’s smaller MoE model, Qwen3–30B-A3B, achieves similar results to their own same generation Qwen3–32B dense model, despite activating roughly 10x fewer parameters (3B vs 32B). Their results also show that even the tiny Qwen3–4B dense model rivals the performance of the previous generation’s much larger Qwen2.5–72B-Instruct model. Improvements relative to Qwen 2.5 include double the training data (now 36 trillion tokens), architecture improvements, and optimizations to the “reasoning” reinforcement learning process vs. Qwen QwQ. The training recipe was very similar to the process Deepseek shared for R1 in January, and it also involved distillation from the largest model into their smaller models.
OpenAI, meanwhile, stumbled when a GPT-4o update meant to enhance intelligence and personality backfired, turning the model excessively sycophantic. Within hours of release, users flooded forums with examples of the model offering unmerited praise or reinforcing problematic user statements. CEO Sam Altman quickly acknowledged the issue publicly as “sycophant-y and annoying,” pledging rapid fixes. Initial patches have rolled out, though full correction and a public explanation of the cause remain pending. While many examples are entertaining, there is a real risk of creating an AI that always agrees with, praises, and reinforces a user’s comments. The incident serves as a reminder of the delicate balance involved in tuning model alignment and personality through RLHF and system prompts. On a more positive note, o3’s agentic search capabilities are continuing to impress. OpenAI released its GPT-4o image generation API, and it also rolled out broader search improvements to ChatGPT this week, including integrated shopping results, multi-source citations, trending searches, and the ability to query via WhatsApp.

Finally, insights into the LLM user landscape emerged from Google’s ongoing antitrust lawsuit. Filings revealed Google Gemini reached 350 million monthly active users globally as of March 2025, with 35 million daily active users. This represents huge growth from October 2024, when the figures stood at 90 million monthly and 9 million daily users. The same Google documents estimated ChatGPT’s user base at 600 million monthly and 160 million daily active users; Gemini is still behind but catching up! We expect Google Gemini has also been catching up lost ground with developers and API usage since the Gemini Pro 2.5 release.
Why should you care?
Qwen3’s arrival marks a strong and welcome return for the open-source LLM community, particularly after the underwhelming release of Meta’s Llama 4 earlier this month. By providing a comprehensive suite of models across various sizes, including several smaller models readily deployable on consumer hardware, Qwen directly addresses a key need overlooked in some recent launches that focused on massive, difficult-to-run architectures. We think these new Qwen models, together with Google’s efficient Gemma 3 27B int4 quantization (discussed last week), are currently the best open model options to slot into LLM pipelines and agents.
Qwen3’s comparison between their 30B MoE model (with just 3B active parameters) and their 32B dense model also shows the clearest evidence yet that MoE can offer significant computational efficiency without sacrificing capability. While dense models are valuable for developers and easy deployment, it seems state-of-the-art LLMs will be MoE going forward!
The fact that we now have a capable reasoning model with just 3 billion active parameters (relative to DeepSeek r1 at 37 billion active parameters previously) also opens up many further inference time scaling options. This could mean more thinking tokens used, multiple parallel samples (many versions of the model attempting the same task), more agentic steps, or more input context tokens used. It all becomes much more affordable and faster to deliver. We expect to see a lot of experimentation!
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
1. OpenAI Introduced a New Image Generation Model in the API
OpenAI has made its image generation model, gpt-image-1, available through the API. Previously accessible only via ChatGPT, the model can now be integrated into third-party tools for generating images from text prompts, editing visuals, and rendering styles and written content with high accuracy.
2. Researchers Conduct Unauthorized AI Persuasion Experiment on Reddit Users
A team claiming affiliation with the University of Zurich conducted a covert experiment on Reddit’s r/changemyview subreddit, deploying AI-powered bots to study persuasive techniques. These bots, adopting various personas — including a sexual assault survivor and a Black man opposing Black Lives Matter — posted over 1,700 comments over several months. They tailored responses by analyzing users’ posting histories to infer personal attributes, aiming to influence opinions on sensitive topics.
3. Anthropic Detects Malicious Uses of Claude
Anthropic released a March 2025 report detailing how malicious actors have been misusing its Claude AI models. Cases include an “influence-as-a-service” operation using AI to control bot networks across social platforms, credential scraping aimed at IoT security cameras, recruitment scams targeting job seekers, and novice hackers using Claude to build advanced malware. Anthropic banned all identified accounts and emphasized that AI is increasingly helping less-skilled actors carry out sophisticated attacks.
4. Grok Gets Vision Capabilities
xAI has launched Grok Vision, enabling users to point their phone camera at real-world objects — like signs, documents, or products — and ask contextual questions. The feature is currently available on iOS. Grok is also rolling out multilingual audio support and real-time search in voice mode.
5. OpenAI Rolls Out a ‘Lightweight’ Version of Its ChatGPT Deep Research Tool
OpenAI is bringing a new “lightweight” version of its ChatGPT deep research tool, which scours the web to compile research reports on a topic, to ChatGPT Plus, Team, and Pro users. The new lightweight deep research, which will also come to free ChatGPT users, is powered by a version of OpenAI’s o4-mini model.
6. Google Gemini Has 350M Monthly Users
According to internal data cited in Google’s antitrust case, Gemini — Google’s AI chatbot — had 350 million monthly active users as of March. While user-count metrics can vary, the figure reflects Gemini’s growing global traction among consumers.
7. OpenAI Wants Its ‘Open’ AI Model To Call Models in the Cloud for Help
For the first time in years, OpenAI is preparing to release a fully open AI model, available for free download. Designed to outperform offerings from Meta and DeepSeek, the model may also be able to tap into OpenAI’s cloud models for help with more complex queries.
Five 5-minute reads/videos to keep you learning
1. A Comprehensive Guide to Failure Modes in Agentic AI Systems
This analysis outlines emerging safety and security risks unique to agentic AI, particularly in multi-agent setups. It also shares actionable design principles and technical safeguards developers can adopt to mitigate these risks in production systems.
2. The State of Reinforcement Learning for LLM Reasoning
The article offers a deep dive into how reinforcement learning is shaping LLM reasoning capabilities. It covers RLHF foundations, policy optimization (PPO), reward modeling, and insights from new reasoning models like DeepSeek-R1, drawing from recent academic papers and practical implementations.
3. The New AI Calculus: Google’s 80% Cost Edge vs. OpenAI’s Ecosystem
This analysis compares Google and OpenAI/Microsoft not just on model benchmarks but across enterprise-critical dimensions — compute cost advantages, AI agent strategies, reliability trade-offs, and market positioning — highlighting the deeper calculus behind enterprise AI adoption.
4. Designing Customized and Dynamic Prompts for Large Language Models
The article examines how dynamic prompt design drives better language model outcomes. It reviews prompting styles, templating techniques, orchestration tools, and applied use cases — from basic chatbots to advanced assistants — showing why thoughtful prompt design is key to building natural, useful AI interactions.
5. How Claude Discovered Users Weaponizing It for Global Influence Operations
Anthropic revealed that users had been leveraging its Claude AI models to run sophisticated influence operations across social media. Unlike past tactics focused just on content creation, this operation used AI for decision-making and campaign management, signaling a major evolution in digital influence strategies. Anthropic’s findings raise urgent concerns about how easily AI tools can be weaponized at scale.
Repositories & Tools
1. Dia is a 1.6B parameter text-to-speech model that generates realistic dialogue from a transcript.
2. Graphiti is a framework for building and querying temporally-aware knowledge graphs, specifically tailored for AI agents.
3. Burn is a deep learning framework for training and inference.
Top Papers of The Week
1. LLMs Are Greedy Agents: Effects of RL Fine-Tuning on Decision-Making Abilities
This study looks at why LLMs perform sub-optimally in decision-making scenarios. It identifies three main issues: greediness, frequency bias, and the knowing-doing gap. The research also suggests mitigation strategies by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales.
2. Tina: Tiny Reasoning Models via LoRA
This paper presents Tina, a family of tiny reasoning models achieved with high cost-efficiency. It applies parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance that is competitive with and sometimes surpasses SOTA RL reasoning models built upon the same base model.
This paper introduces Sequential-NIAH, a benchmark specifically designed to evaluate the capability of LLMs to extract sequential information items (known as needles) from long contexts. The benchmark comprises three types of needle generation pipelines: synthetic, real, and open-domain QA. It includes contexts ranging from 8K to 128K tokens in length, with a dataset of 14,000 samples (2,000 reserved for testing).
4. Describe Anything: Detailed Localized Image and Video Captioning
This paper introduces the Describe Anything Model (DAM), designed for detailed localized captioning (DLC). DAM preserves local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context.
5. QuestBench: Can LLMs Ask the Right Questions To Acquire Information in Reasoning Tasks?
This paper presents QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: Logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with initial states that are partially observed, (3) GSM-Q: Human-annotated grade school math problems with one missing variable assignment, and (4) GSME-Q: a version of GSM-Q where word problems are translated into equations by human annotators.
Quick Links
1. Nous Research introduces Minos, a new classifier for detecting refusals from LLMs. This could be a potentially useful tool for red teamers and jailbreakers — it’s a binary classifier that will return the likelihood of a final response in a chat being a refusal.
2. Anthropic sent a takedown notice to a dev trying to reverse-engineer its coding tool, Claude Code. When a developer de-obfuscated it and released the source code on GitHub, Anthropic filed a DMCA complaint — a copyright notification requesting the code’s removal.
3. Microsoft launches Recall and AI-powered Windows search for Copilot Plus PCs. Recall was originally supposed to launch at the same time as Copilot Plus PCs in June last year, but the feature was delayed following concerns raised by security researchers.
Who’s Hiring in AI
Senior Software Engineer, Machine Learning @LinkedIn (Bangalore, India)
AI Developer @Eliassen Group (Remote)
Mid-level Backend Software Engineer | Challenges & Gamification @Wellhub (Portugal/Remote)
Senior AI Developer @Siemens (Multiple US Locations)
Software Engineer IV — AI Incubation & Enablement @Charles Schwab (San Francisco, CA, USA)
Backend Software Engineer @Transifex (Remote)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.