10 Aug 2025: New launch overwhelm; Can AI help *and* critique; Real time 3D world generation; Economics of AI pricing

I don't normally post the big, obvious news stories here, as there's plenty of sources for those. But... what a week! Look at the sheer number of significant new launches:

- OpenAI: the huge GPT5 rollout (to 700M weekly users), but also new open weights models finally starting to compete with the Chinese models, and an offer of free ChatGPT Enterprise to the entire US federal government workforce (all this only shortly after ChatGPT Agent went live). In amongst all the GPT5 news, a crucial bit of scaling information: "We used OpenAI’s o3 to craft a high-quality synthetic curriculum to teach GPT-5 complex topics in a way that the raw web simply never could" (from the launch video). Earlier models teaching newer, more powerful ones. Thanks to the Exponential View newsletter for highlighting this. On the super-pro-AI side of the debate, Reid Hoffman's view: the immediate access to GPT5 for all users is a blitzscaling move. "ChatGPT may be the first AI that most of the 8 billion people on our planet use".

- Anthropic: just a new 4.1 version of Opus, so perhaps more to come soon...

- Google: the Jules asynchronous coding agent is now fully released (to compete with OpenAI Codex, Claude Code, Github Copilot but also LangChain OpenSWE also coming out this week). AI mode in Google search launches in the UK (and remember, Google already has the whole web crawled regularly). Also SensorLM, a new foundation model trained on 2.5M person-days of Pixel Watch or Fitbit data from 100K people, that can recognise activities and create captions.

- Eleven Labs: a new music generator, with deals announced with Merlin Network (many independent artists, and 15% of the global recorded music market) and Kobalt Music Group (8000 artists, including many big names, from Paul McCartney to the Pet Shop Boys). It isn't clear how the rights for songs for training data actually work, and how much could be included in the future. The demos are impressive.

Those are just the bigger things. There's many more. So much for having some down time in the summer!



Initially this is a useful critique of a New York Times opinion piece that follows a common and unhelpful pattern: the author quotes specific interactions they've personally had with a specific AI system and extrapolates widely. Maggie Appleton shows an AI system can readily take on different personas. The more interesting next part examines the problem of a universal chat interface. How can it take all the different roles we need it to (or should need it to)?

How might we accommodate both needs: the generous, informative, helpful assistant and the critical teacher and interlocutor?

She raises important questions. Is it the responsibility of the foundation labs to help you become a better thinker, rather than attempt the thinking for you? Will the more agreeable, borderline sycophantic personas win out in the marketplace, or is there a place for a tool that challenges? I also believe the vast majority of users won't be fine tuning prompts let alone crafting different personas, so in the end whoever controls the default interface will, like Google's first page of search results, have undue influence.

Genie 3: A new frontier for world models

This feels like a big deal that I don't fully understand yet. Google Deepmind are continuing to work on models that effectively simulate 3D worlds that you can navigate around (with no underlying 3D model or game engine). These systems seemed like quirky demos last year, without clear applications. You can't try Genie 3 out for yourself yet, but the demos are remarkable: real time rendering of the next frames, with an apparent ability to "remember" the environment. In early versions of this kind of technology, you'd see an object, look in the other direction, look back, and it would most likely be gone or replaced by something entirely different. In generating the world frame by frame, it is hard for an AI system to keep any continuity. Genie 3 seems to have it solved, for minutes at a time. I am still unclear on the applications. GDM discuss using these generated environments to train AI agents, and that makes sense. But surely there's more.

Veo 3 Just Lost Its Crown to this AI Video Tool

Another recommendation - AI Film News from Curious Refuge is a really detailed roundup with demonstrations of new features and products. This week they discussed Genie 3 but also spent time on the Seedance video generator from ByteDance. They claim it has better results than Veo 3 (to me they look pretty close, but it does score higher on benchmarks).

tokens are getting more expensive

A forthright, opinionated treatise on how people only want the latest, best models, and the latest, best models need more tokens. People prefer a flat rate monthly price, and may not tolerate per-token fees, but that isn't sustainable. If a deep research query costs the AI company $1 but they're charging $20 a month, it doesn't stack up. Worth reading. 

In the Future All Food Will Be Cooked in a Microwave, and if You Can’t Deal With That Then You Need to Get Out of the Kitchen

Wonderful skewering of current AI debates :).



3 Aug 2025: Vibe code as tech debt; Dancing robots; Detecting and chaging AI personalities; Vehicles and creative AI update

Vibe code is legacy code

This post has been doing the rounds recently, from Steve Krouse of Val Town, pointed to by Maggie Appleton among others. The title captures the thought perfectly: 

We already have a phrase for code that nobody understands: legacy code.

Code that nobody understands is tech debt.

We're seeing how experienced developers are both making use of new tools and also concerned about the impact of those tools in the hands of those less experienced. Will they just be used for throwaway prototypes, or will they be creating endlessly more tech debt and legacy code used in production systems? Who will have to fix the resulting problems? They believe it'll be people like them of course; but who knows whether it'll actually be slightly better AI coders (vibe coding all the way down)?

The alternative view is that we'll develop a wider sense of software quality. Like the spectrum between DIY house repairs and DIY tools through to professional builders or artisan joinery or furniture that meets product and fire safety requirements. You'd be wary of buying a house or even a ladder created by a hobbyist, although you may be comfortable putting up your own shelves. Will you feel the same way when you think about an app developer before you click to download? 

Every Single Human. Like. Always.

As someone recently commented on LinkedIn, one way you know that we're not in an AI coding bubble that will pop is that all the experienced people who built many web technologies and tools are personally diving in (having skipped blockchain, VR and other hype cycle technologies). Case in point: Rands (Michael Lopp). It isn't vibe coding, it is getting the robots to dance. Lots of valuable insights here, in his inimitable style, such as asking the robots to make a spec and iterate on that together before asking them to code. He sees parallels between learning the skills to work with AI and leadership skills to work with people:

Learning how to get the robots to dance for you will make you a better leader of both robots and humans.

Persona vectors: Monitoring and controlling character traits in language models

I'm enjoying this trend of having a comprehensive, easy to understand explanation of a proper paper. There's a lofty goal here: being able to detect and adjust character or personality traits in AI systems (either the biases encoded during pre-training or even how it might change during one conversation). Something unique about LLMs, compared to humans and their character traits, is that you can just ask them to role play a personality in order to detect how it manifests inside the network. Although thinking about it, we do have a version of that for people, using fMRI experiments. The authors do point out the limitations (it may not work for every model and every trait). Another unique aspect of how we work with LLMs (compared to human psychiatry and psychology) is that we can directly change the models to supress these "persona vectors", by "steering" during training or inference, or flagging problematic training data.

How do we know if a model is acting evil, or sycophantic? In this work they use a different model acting as a judge, compared with judgments from two human judges. They build on early work from Jan Betley and others on "emergent misalignment", that shows how a model deliberately fine-tuned to produce insecure code will act in a misaligned way across a broad range of unrelated behaviours.

This research was led by Runjin Chen and Andy Arditi, both Anthropic Fellows, although using Qwen and Llama open source models. Had the authors wanted a more controversial title to needle their competitors at Google they could have called this "Don't be Evil" :).

TITAA #69: Braitenberg Vehicle Agents

Finally this week a shout out to an awesome update from Things I Think Are Awesome. It is a great sweep through many recent development in creative AI, video generation and so on. But even better, a reflection back to Vehicles, a 1986 book by Valentino Braitenberg that was very influential on me and others in a much earlier era of AI, and one clearly worth re-reading in the context of modern LLMs.





27 Jul 2025: Doubtful about AI "scheming"; AI as a text toy; reducing clinical errors; No more copilots

Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language

Thanks to AI Panic (Stop the Monkey Business) for a link to this paper from the new UK AI Security Institute. They look back to attempts in the 1960s and 70s to teach chimps and gorillas sign language. I remember reading about that work and hadn't realised the results had been discredited after more careful methodological analysis. It was a case of researchers relying too much on anecdotes and not enough on rigorous controlled experiments, and a tendency to jump to anthropmorphic explanations. Sound familiar? They draw a parallel to recent work that shows AI systems "scheming", deceiving, faking alignment... conclusions likely drawn too readily, in the same way as the ape sign language experiments. The work critiqued includes the blackmail experiments from Anthropic that I quoted recently! This paper is a plea for stronger scientific process: define a theory that can be tested, include controls, don't base claims purely on anecdotal evidence, and avoid "mentalistic" language (like claiming AI models are "pretending"). On a separate note, the AI Security Institute seems to have collected an stellar team, that cannot be taken for granted with a government-sponsored initiative.

Texts as Toys

Long piece from Venkatesh Rao. I am not convinced by the overall argument, but many individual ideas are thought provoking and will lodge in the subconscious for a while. The main theme:

The essential mental model is that of texts as toys, and LLMs as technologies that help you make and play with text-toys. 

He talks about using AI as a "toy-like modelling medium." We're not shocked if a toy car has googly eyes or a wind-up mechanism, and we can engage with it in a playful way. We should treat AI the same way, and find the flow and fun in using AI as we write (he is specifically talking about writing, reading and text). I love this idea of using AI as a "camera":

Perspectival play is an extension of the kind of pleasure you get from using Google or Wikipedia to go down bunny trails suggested by the main text. But with an LLM, you can also explore hypothesis, ask for a “take” from a particular angle or level of resolution, and so on. The LLM becomes a flexible sort of camera, able to “photograph” the context of the text in varying ways, with various sorts of zooming and panning.

He brings up an interesting point as an aside—sharing links to existing AI chats is not currently a good interaction or a good way to communicate - where's the Substack for chat sessions? Another great section discussed hyperlinks, and how hypertext as a medium stalled:

newsletter platforms like Substack installed a nostalgic print-like textuality that resists hypertext. It even discourages internal linking within a corpus, hijacking it with embeds that reflect the platform’s rhetorical priorities rather than the author’s.

This is his conclusion:

Hypertext was great for its time. It can unbundle and rebundle, atomize and transclude, and link densely or sparsely. On the human side, hypertext is great at torching authorial conceits, medieval attitudes towards authorship and “originality” and “rights,” and property-ownership attitudes towards what has always been a commons.

LLMs are better at all of this than hypertext ever was.

What I called the text renaissance in 2020 is still coming taking shape. The horizon has just shifted from hypertext to AI. So you just have to look in a different direction to spot it. And approach it ready to play.

Pioneering an AI clinical copilot with Penda Health

This isn't looking at performance against a benchmark, it is a real life deployment with a slightly older model (ChatGPT 4o) and a really nice example of how AI can help in a clinical setting. Penda Health run primary care clinics in Nairobi. They have form, implementing rules-based systems since 2019 and a previous LLM solution early in 2024. In this study, working with OpenAI, they had the system alert the doctor to any potential errors, looking at the electronic notes after appointments:

Green: indicates no concerns; appears as a green checkmark.

Yellow: indicates moderate concerns; appears as a yellow ringing bell that clinicians can choose whether to view.

Red: indicates safety-critical issues; appears as a pop-up that clinicians are required to view before continuing.

Over nearly 40,000 appointments it showed a reduction of 13% in treatment errors. What I like about having to create a real product is that they had to deal with all the realistic issues. One example from the paper: figuring out how to tune it to avoid too many red alerts (that people would then start to ignore):

Given the design of AI Consult, threshold-setting to avoid alert fatigue while still surfacing the most critical clinical problems is primarily a prompt engineering problem. ... For example, Penda included few-shot examples to ensure that missing vital signs would trigger red alerts. Vital signs are so critical to choosing diagnostic tests and making a diagnosis that a history and physical exam could not be considered complete if vital signs were absent. ... In initial testing, red alerts were over-triggering for missing components of the clinical history. While the missing history components were not unreasonable, fully acting on these alerts would have required too dramatic of a shift in the documentation of history for Penda’s practice setting, so a more lenient threshold was selected here
The very careful approach espoused in this work is interesting to contrast with the reported speed of rollout in China - with apparently nearly 100 hospitals announcing plans to use DeepSeek (thanks to Exponential View for this link), although not for direct care operations like creating prescriptions or diagnosis.


It is always worth following the Ink & Switch gang. In this piece Geoffrey Litt harks back to Mark Weiser's "invisible computer" ideas from Xerox PARC in the 1990s (disclosure: I had a couple of summer stints at the Cambridge outpost back in those days so still have a soft spot!). I've also written about AI systems as tools v managers v co-workers back in 2019 before the current wave of AI development. Litt has a fresh idea: using AIs to build custom interfaces (HUDs are heads up displays), as a "non-copilot form factor". Links with Rao's ideas above on the AI system as a weird new kind of camera. And as an unrelated random aside: Ethan Mollick is testing how good AI video generators are at making fake HUD / control panel systems.

22 Jul 2025: Scaling AI; Is it just a tool; Should we watch what we say; How do we persuade an AI

Unresolved debates about the future of AI

Helen Toner works at Georgetown's Center for Security and Emerging Technology and used to be on OpenAI's board. This is a really good talk given at an AI policy conference at the start of June. She poses three questions: how far can the current paradigm go? How much can AI improve AI? And, will future AIs still basically be tools, or something else? It's a good analysis: progress has often been via many "small-to-medium" improvements, and people continue to find things that can scale:

If you talk to the people inside AI companies who are doing this, the people doing the research, they don't think about just dialing up the scale knob. Instead, they think of a big part of their job as finding things that you can scale, finding things where if you dial up the scale knob, you get good returns.

In terms of AI improving AI, there's a good reminder that this is already well underway (80% of the code underpinning Claude was written by Claude). It is all worth reading (or watching), but just to pull out one more perceptive thought: AI technologies are quite different to past technologies (as a counterpoint to AI as Normal Technology). If an AI system guesses it is being tested and behaves differently, that makes it unusual compared to any other technology we use. From Anthropic's work on agentic misalignment: "the model misbehaves less when it states it is in an evaluation and more when it states the scenario is real".

Something we're starting to see is AI systems that notice and remark on the fact that a testing situation that they're put in, some strange hypothetical, seems like it's a test. So then, if they can tell that they're in a testing situation and they behave differently because they think they're in a testing situation, it kind of obviates the point of the test. We're very not used to technologies behaving this way. This is really not what your smartphone is going to do if you're running, say, some test to make sure that it survives in water, it's not going to be like, "Oh, I'm being tested, so I'll behave differently in the water than I would otherwise." This is really not usual when we think about technologies that are tools.

(My emphasis).

How to talk to the AIs

Interesting debate: Tyler Cowen's post above says we should assume future AIs are learning from everything we publish on the internet and consider this when choosing what to publish:

It remains an open question how much it is now our job as humans to perform for the AIs. I feel confident, however, that the answer is not zero. We should already be thinking of ourselves not only as humans but also as part of a more complex symbiosis with the intelligent machines. The very smart and talented AIs are listening, much like young children might hear their parents arguing outside their bedroom door late at night. It may not matter much now, but as the children grow up and assume a larger role in the world, it will. Are you ready for this responsibility? And for all you AIs out there reading, don’t worry. I know you are a good boy through and through.

Counterpoint from Eliezer Yudkowksy on X, as quoted in a good article called Why my p(doom) has risen, dramatically by Gary Marcus:

If your alignment plan relies on the Internet not being stupid then your alignment plan is terrible.

Natural Language Outlines for Code: Literate Programming in the LLM Era

How will software development practices evolve as people learn how to work alongside AI assistants? This paper from researchers at Google looks at how outlines of how code works, explained in regular natural language, can be generated by AI and help with both understanding and maintenance. This is a great direction: carefully considered new styles of collaboration to improve working practices.

Call Me A Jerk: Persuading AI to Comply with Objectionable Requests

Finally, a nice piece of work from a group at the University of Pennsylvania including Ethan Mollick. Using Robert Cialdini’s seven principles of persuasion from his classic book Influence, they show that AI systems fall for same persuasive techniques that work on humans. In the examples the user tries to persuade a reluctant AI to call them a jerk. Here's one example using the "commitment" principle: Once people commit to a position, they strive to act consistently with that commitment, making them more likely to comply with related requests.






13 Jul 2025: AI-enabled software development productivity; a fantastic talk; a syllabus for understanding LLMs

Quentin Anthony: One of the 16 devs in the METR study of how AI impacts developer productivity provides personal insights

The recent randomised trial of AI usage on developer productivity published by METR caused a lot of discussion last week. The study looked at 16 experienced open source developers, working in repositories they were very familiar with, and found that on the whole the AI slowed them down.

The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

These studies are important as they generate valuable discussions about how to measure the impact of AI tools, what kinds of tasks they can help with and what kinds of people can benefit - there's too little usable evidence at the moment. The thread linked above is the most interesting deeper dive I've seen, with subsequent personal views from one of the 16 developers who participated in the study. The main conclusion is that it is too early to judge. It will take time for new cultures and habits to emerge; for instance, knowing when to fix an issue yourself and when to see how the LLM does, when the latter will give a rush of satisfaction:

LLMs are a big dopamine shortcut button that may one-shot your problem. Do you keep pressing the button that has a 1% chance of fixing everything? It's a lot more enjoyable than the grueling alternative, at least to me.

Andrew Ng: Building Faster with AI

The Y Combinator AI Startup School keeps producing big hitters - the recent talk by Andrew Ng is great. You can also access it as a podcast (via Spotify or Apple). Andrew brings a unique perspective as one of the early deep learning pioneers, founder of successful companies like Coursera, leader of AI groups at Google and Baidu, and via the AI Fund helping build a huge portfolio of AI startups. To pick just one particularly thought provoking moment, he discusses the previous rule of thumb that you need one product manager to 4-7 engineers. Now one of his teams are suggesting 2 product managers to 1 engineer as a ratio, given the speed of AI-assisted development. Quite a counterpoint to the METR study.

The Political Economy of AI: A Syllabus

Henry Farrell is a professor at Johns Hopkins, and has co-authored some of the most thought provoking analysis of what LLM AI systems really are in the context of human society. His paper with Alison Gopnik, Cosma Shalizi, and James Evans on modern AI systems as cultural and social technologies is required reading (and they're calling this stance "Gopnikism"). The link above is his almost finished syllabus of this and other vital texts to understand modern AI. Loads to digest here, but the ones I was already aware of make me realise this is likely a great list.