#146: What does GPT-4o's Native Image Gen Mean for Art, Artists, and AI Adoption?
Also, Gemini Pro 2.5 dominates benchmarks, xAI merges with X, DeepSeek-V3 update & more.
What happened this week in AI by Louie
This week, OpenAI’s native image generation with GPT-4o within ChatGPT captured viral attention, reminiscent of the buzz around the original ChatGPT launch and Deepseek’s r1 model. We commented recently that Gemini Flash was the first to integrate native image capabilities. However, 4o images now offer much stronger prompt adherence and a strong ability to edit, combine uploaded images, and iteratively refine designs within the chat interface. This addresses a common frustration with traditional diffusion-based image generation, which often feels like a ‘potluck’ — models struggle to grasp the full prompt nuance, start each generation afresh, and make consistent iteration for storyboards or styled graphics difficult. Native generation offers much tighter control, better prompt following, and image consistency.
Not to be overshadowed, DeepMind also had a major release with Gemini Pro 2.5, currently freely available via its API and chatbot. This reasoning model rose to the top of most benchmarks. It took the top spot on the LMArena leaderboard with a score of 1443, ahead of Grok 3 (1404); GPT-4o also saw a substantial incremental update this week, climbing from 6th to 2nd (+32 to 1406). Gemini made big jumps in coding — previously a weak point; including rising to the top of Cognition Lab’s Agentic Coding test with 74% vs. Sonnet 3.7 at 72% and GPT-4.5 at 65%.
Gemini models really shine in long context tasks, helpful in reviewing full code bases, large sets of professional reports or advanced RAG pipelines. Pro 2.5 shows great scores on the MRCR test — a proxy for understanding and multitasking over long contexts — scoring 83.1% at 1m tokens and 94.5% at 128k, recovering from a dip with Pro 2.0 (74.7%) while significantly outpacing rivals like GPT-4.5 (64.4%). At Towards AI, we’re exploring LLM workflows involving very long context RAG, either retrieving many chunks or caching mini datasets. While most models’ intelligence degrades sharply beyond ~10k tokens or so on complex multi-step tasks, Gemini Pro models have been best, though still imperfect. Hopefully, the intelligence and MRCR boost in 2.5 will improve performance for these demanding applications.
What does native image generation mean for art and artists?
The mixed reception to ChatGPT 4o image generation makes it clear that AI art is a sensitive topic, and many artists truly hate AI.
This is understandable as it does bring major problems. First, copyright laws weren’t designed for how AI learns. AI is inspired by the styles and features it sees during training, but it rarely memorizes or copies images exactly. Yet this mass-produced inspiration feels different from the inspiration that’s always been a core part of creativity in human art, music, or writing; artists do view AI art more like copying or stealing. Second, some categories of artists’ and designers’ work are now at real risk from AI. Third, some see a risk of trivializing or diluting artistic merit and meaning.
At the same time though, AI image generation has big positives for non-artists; it means democratizing access to some level of art and graphic design. It helps those who 1) have creative ideas but weren’t born with artistic “execution” talent or haven’t had the chance to learn, 2) don’t have the time to paint all their ideas and express their creativity visually, or 3) can’t afford to commission artists or hire graphic designers. AI art can also just be a lot of fun to play with.
With native LLM image generation, it feels like we’ve now crossed a key threshold for an AI corporate use case for the first time. With obedient instruction following, image merging including brand graphics, and easy collaborative iteration, image generation is suddenly sufficiently flexible, reliable, and easy to use out of the box; its adoption in many workflows is now simply a no-brainer. For some marketing graphics, branding, and similar tasks, it feels hard to justify paying perhaps 100 times more for a fully human design, which adds delays of weeks instead of seconds to the design iteration process, even if the AI image still needs artistic taste, inspired concepts, and many more iterations to get good results.
I don’t think AI will replace human art. People will always want art that creates a human connection and emotional response. Still, AI will produce countless new images that we previously didn’t have time, talent, or money to create. And some human art and design commissions will surely move to AI.
It’s not uncommon for industries to simply have to adapt to an inevitable reality; think how musicians today make most of their money from live performances (with the human connection), compared to growing their fan base via mass duplicated audio files in Spotify streams and the initial disruption from Napster.
Art has also always been a competitive industry with a Pareto-like curve for winners and those who can truly turn it into a career. So art has never only been about money. Many people enjoy art purely as a hobby, even if only family or friends ever see their creations.
Valid ethics and moral questions of training without the artist’s consent aside — at this stage, it already feels too late for artists to fully keep their work out of AI training datasets or user prompts. So, rather than focusing solely on resisting AI, it seems more fruitful to focus on adapting to the new reality and finding ways to compensate artists fairly if an AI generation uses their distinct style or brand. It’s also important to work on ways to prove human provenance.
I think many more artists should embrace AI tools themselves. They can create new mediums of art using their own style or brand in ways that weren’t possible before. Artists can also set up clear ways for people to pay for the use of their style with permission. Make custom AI apps or licenses yourself so your fans can support you!
Artists so far are instead among the loudest and often vitriolic critics of AI. This makes sense, but software engineers face similar near-term AI disruptions without the same level of anger. Just like with software, many artists already embrace AI tools. Skilled software engineers and artists produce far better results using AI than those without this experience or talent. The divergent response is perhaps because artists more firmly tie their talent to their own identity — while software engineers’ professional work may be less intertwined. Developers may also be more used to an open-source culture and building with other people’s code.
Why should you care?
Similar patterns will play out in many other careers in the coming years, perhaps first with software. Different factors will be at play for each field, but it will always be a complex picture. In many fields, AI will be extremely unpopular. In some fields people will be much quicker and more willing to adapt their workflows to integrate AI than others.
Looking at the potential for AI more broadly, I think a net positive societal impact from AI is most likely, but it is still far from certain. Right now, even in wealthy countries like the US, more than 99% earn below the point where more money wouldn’t increase their quality of life. Most young people can’t even afford their own homes or start a family. The world is still filled with poverty, disease, premature death, unhappiness, war, precariously balanced societal existential risks, personal inevitable aging, etc. There is a lot that still needs to be fixed, and perhaps only AI can solve some things in a time frame that makes a difference to those alive today.
Most people today also never fully explore their ambitions or creative ideas because of random and unfair factors at birth: wealth, intelligence, lack of talent in writing, art, singing, or other areas. Leveraging AI tools to democratize and make up for this shouldn’t have to mean communism or AI “sloppification” though; those with existing talent should still rise to the top, and diversity of personal expression should be enhanced with these AI tools rather than reduced.
In the near term, AI will mostly make people’s existing jobs quicker and easier. It is a powerful tool that gets the best results from people with prior experience and talent in their field.
In the long-term possibility that this all progresses to a primarily automated economy, obviously, it is essential one way or another to make sure everyone’s quality of life increases from AI’s abundance, scientific breakthroughs, deflationary pressures on living costs (lowering the wealth vs. quality of life saturation point) and its democratization of creative mediums — instead of decreasing from unemployment, mental health issues, higher inequality or new risks.
If we eventually automate away most careers, more people will have to learn to accept their self-worth, and meaning doesn’t come from their jobs, economic output, or winning gold medals against AI. Instead, meaning can derive from authentic relationships, good deeds, personal growth, community contribution, creativity, curiosity-driven exploration, and engaging in activities that bring joy independent of external validation. Just as since the 1800s, everyone learned a woman’s self-worth wasn’t from 100-hour work weeks doing manual laundry, cooking, or keeping their husbands happy, people today must learn to see value beyond their current economic or cultural role; this doesn’t have to be a bad thing. Many already believe this and are also already content without winning gold medals or Nobel prizes. Nevertheless, we still need better plans to support people’s mental health during such rapid changes to careers and lifestyles. Earning a good salary also remains key to a family’s quality of life, and managing this is a particularly acute risk for the first careers that see major disruption from AI.
For sure, even the best-case scenario for AI will have many negative consequences. Yet the rapid march of AI now seems inevitable — perhaps it has even been inevitable (sooner or later) since the accelerated breakthroughs of Turing, Von Neumann, and Shannon in the 1940s. I think it’s wisest to take part actively in AI — not just to avoid getting left behind but also to do what you can to steer AI towards the best path and better future.
I don’t know the best solutions to all these long-term issues. For now — the current state of needing essential human expertise in the loop of AI-enhanced workflows may go on for a very long time, a medium amount of time, or a short time. This depends on just how quickly AI abilities progress from here and what deadends it hits.
For now, I’d say the single best thing to do is focus on this most immediate and generally softer first wave of AI disruption. Start testing these models out on your daily tasks and try to truly understand them. Using LLMs well and responsibly is highly valuable for most non-physical work tasks even today, using them badly is both a safety and reputation risk for your work. As you use the models more, you will build an intuition for where they work well and where they don’t. You will also get a gauge for the pace of progress and be best prepared for the models that are coming next. People skilled at combining the best of AI’s abilities with their own human skills and domain expertise will prosper. However, you have to be agile and prepared for change.
Those dogmatically resistant to adopting AI or adapting are at greatest risk. So persuade your friends, families, and colleagues to be in the group that benefits!
— Louie Peters — Towards AI Co-founder and CEO
This week, we’re also super excited to share a first for us — we’ve teamed up with DecodingML (aka
), The AI Edge Newsletter (aka ), and LLM Watch (aka ) for our very first guest posts focused on optimization, cost reduction, and AI Agents! And the timing couldn’t be better. With OpenAI launching its most expensive API yet ($150 per million input tokens and $600 per million output tokens), LLM costs are a bigger concern than ever. In these posts, we dive into one of the biggest challenges in AI today: keeping costs down without compromising performance. We break down the strategies top teams are using, uncover hidden operational expenses, and share a blueprint for optimizing LLM efficiency at scale.We’re also tackling a common challenge in AI — the confusion around AI agents. Even experts often grapple with varying definitions, so we’re breaking it down while also walking through how to build an email automation agent using Hugging Face’s SmolAgents library.
Hottest News
1. Google AI Released Gemini 2.5 Pro Experimental
Google DeepMind has introduced Gemini 2.5, an advanced AI model that excels in reasoning, coding, and problem-solving. The Pro Experimental version tops benchmarks in math and science, supports a one million-token context window, and is now available in Google AI Studio and Gemini.
2. ChatGPT Introduces Native Image-Generation with GPT-4o
OpenAI’s GPT-4o has enhanced image generation by integrating it into language models for photorealistic outputs. This tool offers precise image rendering and text blending. Trained on extensive text-image data, GPT-4o ensures contextual consistency and accuracy, supporting creative applications. Access is rolling out to various ChatGPT users and will soon be available via API.
3. Elon Musk Says xAI Has Acquired X in a Deal That Values the Social Media Site at $33 Billion
Elon Musk has merged xAI with X. This is likely to more deeply integrate advanced AI capabilities into the social platform’s 600 million+ user base while giving xai cleaner access to huge volumes of training data from x. The deal values xAI at $80 billion and X at $33 billion equity value and $12 billion in debt. Given x owned a stake in xai — the combined company valuation will likely be closer to $100bn.
4. OpenAI Adopts Anthropic’s MCP
In a post on X, OpenAI CEO Sam Altman said that OpenAI will add support for Anthropic’s Model Context Protocol, or MCP, across its products, including the desktop app for ChatGPT. MCP is an open-source standard that makes it easier for two way connections between AI models and external data and tools.
5. DeepSeek AI Unveils DeepSeek-V3–0324
DeepSeek AI launches DeepSeek-V3–0324, a high-performance model under the MIT license. Running at over 20 tokens/sec on a 512GB M3 Ultra Mac Studio, it ranks #2 on aider’s polyglot benchmark.
6. OpenAI’s Viral Studio Ghibli Moment Highlights AI Copyright Concerns
OpenAI’s new AI image generator can recreate artwork in the style of Studio Ghibli, raising copyright concerns. Users generate these images via text prompts, leading to legal questions about training AI models on copyrighted works. OpenAI’s and Google’s tools highlight a leap in AI capabilities, pending legal clarification.
7. Microsoft Adds AI-Powered Deep Research Tools to Copilot
Microsoft has unveiled two advanced AI agents, ‘Researcher’ and ‘Analyst,’ for Microsoft 365 Copilot. Researcher assists with complex, multi-step research tasks by integrating OpenAI’s deep research model and accessing third-party data from sources like Salesforce and ServiceNow. Analyst, based on OpenAI’s o3-mini reasoning model, processes raw data, executes Python code, and generates detailed reports, functioning like a skilled data scientist.
8. Qwen Releases the Qwen2.5-VL-32B-Instruct
Qwen has released Qwen2.5-VL-32B, a 32-billion-parameter model that enhances human preference alignment, mathematical reasoning, and image task accuracy compared to earlier models. Benchmark results show that it outperforms models like Gemma 3–27B. Users can deploy the 4-bit version efficiently on devices such as a 64GB Mac.
9. Runway Releases an AI Model for Media Generation
Runway AI launched its most advanced AI video generation model today, entering the next phase of competition to create tools that could transform film production. The new Gen-4 system introduces character and scene consistency across multiple shots — a capability that has evaded most AI video generators until now.
Five 5-minute reads/videos to keep you learning
1. How to Fine-tune and Run Gemma 3
This article explains how Google’s Gemma 3 models (1B to 27B parameters) integrate with Unsloth for efficient fine-tuning. It covers optimizations that reduce VRAM usage by over 60%, enable longer context lengths, and introduce dynamic 4-bit quantization, particularly for vision models.
2. Tracing the Thoughts of a Large Language Model
This article discusses Anthropic’s approach to AI interpretability by tracing how Claude processes information. It explores how the model forms internal representations, plans text generation (such as rhyming in poetry), and where its reasoning can be inconsistent. These findings help improve AI transparency and reliability.
3. Training and Finetuning Reranker Models with Sentence Transformers v4
This article walks through the latest updates in Sentence Transformers v4, focusing on a new training method for cross-encoder reranker models. It explains how these improvements enhance fine-tuning efficiency and model performance on specific datasets, with practical implementation details.
4. Open-Sora 2.0 Explained: Architecture, Training, and Why It Matters
This article explains the architecture and training pipeline behind Open-Sora 2.0, a video generation model trained on a $200,000 budget. It outlines its three-stage training approach, efficiency techniques, and how it compares to models like HunyuanVideo and Runway Gen-3 Alpha.
5. What is MCP bt Anthropic? (Model Context Protocol)
This article discusses the Model Context Protocol (MCP), an open standard by Anthropic that enables AI systems to connect with external data sources in real-time. It explains how MCP allows AI assistants to retrieve and act on information dynamically, with examples like integrating Claude with GitHub for automated code management.
Repositories & Tools
1. VideoMind is a multimodal agent framework that enhances video reasoning by emulating human-like processes.
Top Papers of The Week
1. Qwen2.5-Omni Technical Report
This paper introduces Qwen2.5-Omni, a multimodal model capable of processing text, images, audio, and video while generating text and speech simultaneously. It achieves state-of-the-art performance on multimodal benchmarks using block-wise audio and visual encoders with TMRoPE position embedding. The Thinker-Talker architecture enhances concurrent output, improving speech generation quality and robustness.
2. Defeating Prompt Injections by Design
This paper presents CaMeL, a security framework designed to defend against prompt injection attacks in language model agents. Inspired by software security principles, CaMeL secures data and control flows by extracting intended control logic as pseudo-Python code and executing it through a custom interpreter.
3. TxGemma: Efficient and Agentic LLMs for Therapeutics
This paper introduces TxGemma, a suite of specialized language models developed by Google DeepMind and Google Research to support therapeutic development. Available in 2B, 9B, and 27B parameter sizes, these models are fine-tuned from Gemma-2 on biomedical datasets, covering small molecules, proteins, nucleic acids, diseases, and cell lines.
4. Video-R1: Reinforcing Video Reasoning in MLLMs
This paper presents Video-R1, a model designed to improve video reasoning in multimodal large language models. It introduces the T-GRPO algorithm for enhanced temporal modeling and incorporates image-reasoning data into training. Video-R1 achieves 35.8% accuracy on VSI-Bench, outperforming proprietary models like GPT-4o. The datasets and code are publicly available.
5. Think Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation
This paper introduces ReaRec, a framework for improving sequential recommendations through multi-step reasoning. It employs Ensemble and Progressive Reasoning Learning methods to enhance user representations and boost prediction accuracy. Experiments show that ReaRec improves recommendation performance by 30%-50%, highlighting its potential for advancing recommendation systems.
Quick Links
1. Anysphere, the company behind the AI coding tool Cursor, grew annualized recurring revenue 4x since November to $200m and is raising $625m at a ~$10bn valuation (also up 4x). Cursor is often used with the user’s LLM API keys and most commonly is used with Anthropic’s Sonnet models. Some developers report spending over $5,000 per year via LLM tokens used on Cursor.
2. OpenAI intends to release its first open language model since GPT‑2 in the coming months. OpenAI plans to host developer events to gather feedback and, in the future, demo prototypes of the model.
4. Poe, Quora’s chatbot app, launched one of its most affordable subscription options, priced at just $5 per month. In addition, the company introduced its highest-priced plan at $250 per month, designed for users who need to send a large volume of messages on Poe. Poe allows users to utilize several AI-powered bots, including DeepSeek-R1, GPT-4o, Claude 3.7 Sonnet, o3-mini, ElevenLabs, and more in one place.
Who’s Hiring in AI
Software Engineer, Scalability and Capability @Anthropic (Multiple US Locations)
AI/ML Subject Matter Expert (SME) @Koniag Government Services (Remote)
Software Engineer Intern @Bloomreach (Slovakia, Czech Republic)
AI Developer @CAI (Hyrbid/India)
Technical Program Manager @LivePerson (Remote/Germany)
Back End Engineer, AI/ML @Pendo.io (Sheffield, UK)
INTL India — LLM AI Engineer @Insight Global (Woonsocket, RI, USA)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.