27 Sep 2025: General purpose vision understanding; AI superforecasting; Western bias; the AI megasystem

Video models are zero-shot learners and reasoners

Pivotal insights from Google DeepMind published this week. Everyone was surprised at the sheer variety of tasks that LLMs could tackle; noone expected that a next-word prediction machine could write good code, or reason through problems, or many of the other applications we now take for granted that were't previously considered purely language or writing tasks. This work suggests that video models are similar, albeit a few years earlier in their evolution. They show a remarkable range of activities that Veo3 can perform. Remember, Veo3's job is just to produce a series of frames for a short video (and accompanying audio), just like an LLM's job is to produce a series of words.

Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn’t explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo’s emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

This is easiest to understand with an example, one of the very many presented. Can a video generation model successfully find a path through a maze? The model is given the maze as a starting image and simply asked to generate an animation of what happens next, given a prompt. The prompt starts with: "Without crossing any black boundary, the grey mouse from the corner skillfully navigates the maze by walking around until it finds the yellow cheese."

Here's the result:

(I've actually picked an example that only worked in 17% of their experiments, but there are many others with much higher success rates. The mouse in the maze makes a good video though! The expectation is that, like LLMs, these capabilities will continue to improve)

British AI startup beats humans in international forecasting competition

Asimov's Foundation series introduced the fictional science of psychohistory, that can predict broad societal trends and events across a galactic civilisation. Mantic is a startup attempting to build an initial version. I hadn't realised that forecasting is a competive sport. The Metaculus Cup sets a number of prediction challenges, answers are submitted, and scored 2 weeks later (so it is quite a short time frame). Mantic achieved 8th place in the summer 2025 contest, the highest ever for a bot, across a wide variety of questions predicting developments in Ukraine and Gaza, sporting results, elections, all kinds of political events. Mantic's approach appears is a multi-agent system:

Mantic breaks down a forecasting problem into different jobs and assigns them to a roster of machine-learning models including OpenAI, Google and DeepSeek, depending on their strengths.

Using AI (rather than human "superforecasters") opens up possibilities for faster experimentation. They can do "backtesting", giving the AI access to information prior to a certain date and then asking for predictions, where the outcome is already known. And they can work at much greater speed and scale. It will be interesting to see if this kind of technology starts being applied outside of finance and trading.

Which Humans?

This research from the Culture, Cognition, Coevolution Lab at Harvard looks at how LLMs answer questions compared to people from different cultures and countries. As they state:

Technical reports often compare LLMs’ outputs with “human” performance on various tests. Here, we ask, “Which humans?” Much of the existing literature largely ignores the fact that humans are a cultural species with substantial psychological diversity around the globe that is not fully captured by the textual data on which current LLMs have been trained.

It's introduced me to a new acronym - WEIRD - Western, Educated, Industrialised, Rich, and Democratic. WEIRD populations "tend to be more individualistic, independent, and impersonally prosocial (e.g., trusting of strangers) while being less morally parochial, less respectful toward authorities, less conforming, and less loyal to their local groups." Unsurprisngly, LLMs are trained on very WEIRD-biased text ("most of the textual data on the internet are produced by WEIRD people (and primarily in English)"), and so we get the "WEIRD-in WEIRD-out" problem. The World Values Survey (WVS) is a long running international survey that's been done in waves since 1981, and looks at values, norms, beliefs, and attitudes around politics, religion, family, work, identity, trust, and well-being. By essentially getting ChatGPT to answer the WVS survey questions, it can be placed on a scale for comparison. The graph below shows the ChatGPT WEIRD bias pretty clearly: ChatGPT is much more correlated with answers from countries like the US.

Why the AI “megasystem problem” needs our attention

Not the usual AI doomer nonsense. Quite the opposite: a depressingly realistic view from Susan Schneider (a philosphy professor at Florida Atlantic University) on likely problems that will come not from a single superintelligence that is created in some lab, but from the "megasystem":

"But the real risk isn’t one system going rogue. It’s a web of systems interacting, training one another, colluding in ways we don’t anticipate.... Losing control of a megasystem is far more plausible than a single AI going rogue. And it’s harder to monitor, because you can’t point to one culprit — you’re dealing with networks."

It has some parallels to systemic risk in financial markets, but the effect on individuals and culture makes it a different kind of problem:

Individuals need to cultivate awareness. Recognize the risks of addiction and homogeneity. Push for friction in learning. Demand transparency about how these tools shape our thought patterns. Without cultural pressure, policy alone won’t be enough.


21 Sep 2025: Learning to predict diseases; how to guarantee reproducibility; why we don't hallucinate as much as AI systems; the explosion of image generation capabilities

Apologies for the summer holiday hiatus; weekly updates should now resume!

Learning the natural history of human disease with generative transformers

First up, a significant piece of work that points towards a big new research area. Rather than creating a large language model, this group from the German Cancer Research Centre and the University of Heidelberg alongside the European Bioinformatics Institute in Cambridge are creating a large health model. They are using data from the 500,000 volunteers for UK Biobank to create a model to predict disease progression across multiple diseases, and testing with similar data from Finland. It is very promising, as it can already replicate the accuracy of some existing long-standing risk predictiors. It only took 1 hour of GPU time to train. They also created and published a synthetic dataset, and it appears that using that instead of real people's data was only slighly less accurate. Useful synthetic data will speed up health research: if it isn't and doesn't include personal data, it should be far easier to distribute and work with.

Defeating Nondeterminism in LLM Inference

A technical report from Thinking Machines Lab (founded by former OpenAI CTO Mira Murati) that looks at ways to make LLM output fully deterministic. Quite a technical area, as it comes down to looking at very detailed implementation design like how GPU computation is paralellised and how work is batched. However, knowing that we could have fully reproducible, deterministic LLM outputs (given some cost or computation penalties) would be important for domains like healthcare or law. Beware this isn't peer reviewed or published as a paper yet.

Knowledge and memory

I like this short piece by author Robin Sloan, because he points out something obvious that needed putting into words. We have an episodic, autobiographical memory that means we remember the process of how we learned things. AI systems don't. They appear in the world with a fully formed language generation capability. One of the reasons we're less likely to fabricate stories thinking they're true is that we'll have a history with those truths; we'll remember when we learned them.


This is an extensive repository of currently 91 examples of what you can do with the new Google Nano Banana image generation tool. The longer it has been available, the more capabilities people have figured out. Each one has examples and a detailed prompt. Everything from generating a photo of a scene from a map to creating movie storyboards. We're still in the infancy of understanding how these tools will be deployed.


31 Aug 2025: Two challenges for AI consciousness; Combining AI tools around a document; AI training AI leading to terrible stories

AI Consciousness: A Centrist Manifesto

Fantastic bit of writing from Jonathan Birch, a philosophy professor at the London School of Economics. A very complex set of topics explained clearly and engagingly. The "centrist" idea comes from considering two challenges equally seriously without dismissing either.

Challenge One: millions of users will soon misattribute human-like consciousness to their AI friends, partners, and assistants on the basis of mimicry and role play, and we don't know how to prevent this.

Challenge Two: Profoundly alien forms of consciousness might genuinely be achieved in AI, but our theoretical understanding of consciousness is too immature to provide confident answers one way or the other.

Worth the investment to read this paper slowly. I'll just pull out one example of a great analogy.

The persisting interlocutor illusion is the illusion that when talking to an AI chatbot you're talking to a continuously present entity, a "someone" at the other end of the conversation (rather than multiple LLM instances stopping and starting independently). He compares this to conversations with doctors in the UK:

When I was growing up, it used to be that you had one doctor: your GP, or General Practitioner. Each time you got ill, you’d go and see the same person. Nowadays, it’s always a different person. The notes about your medical history are the only source of continuity with the previous appointment. Now imagine the doctor arguing: "I know you don’t like having a different doctor at every appointment. So, I’ve started making detailed transcripts of our conversations. That way, you will have the same doctor at each appointment. My successor will receive the full transcript, and that is enough psychological continuity for them to count as the same person."

You would reply: that isn’t psychological continuity at all!

He argues that, in the same way, an apparently continuous conversation with an AI chatbot in no way implies any personal identity for the AI.

An AI OS from a design perspective

A post from David Galbraith to read alongside commentary and further ideas from Matt Webb: The destination for AI interfaces is Do What I Mean (who provides further context from the history of human-computer interaction). An exploration of how interfaces will evolve. 

AI buttons are different from, say Photoshop menu commands in that they can just be a description of the desired outcome rather than a sequence of steps (incidentally why I think a lot of agents’ complexity disappears). For example Photoshop used to require a complex sequence of tasks (drawing around elements with a lasso etc.) to remove clouds from an image. With AI you can just say ‘remove clouds’ and then create a remove clouds button. An AI interface is a ‘semantic interface’.

It ends with an intriguing idea, where he wonders if the idea of an "app in a document" rather than a "document in an app" is the way forward. So more like a Jupyter notebook and less like Microsoft Word. Coincidentally, Component software from Atlassian design and AI lead David Hoang has the same argument, harking back to Apple's 1990s OpenDoc idea to have "compound documents". The diagram below from Hoang's article shows how this might work, combining AI services towards a specific task.

GPT-5 Is a Terrible Storyteller – And That's an AI Safety Problem

Christoph Heilig from the University of Munich noticed that ChatGPT 5 was generating terrible nonsense in its stories. And not only not realising it was terrible nonsense, but insisting that it wasn't. These examples were rated highly by different LLMs for instance:  

"The marrow knew the street. Rain touched sinew. The camera watched his corpus."

"Sinew genuflected. eigenstate of theodicy. existential void beneath fluorescent hum Leviathan. Entropy's bitter aftertaste."

He hypothesises that, since AI judges are used to train new AI systems, the new systems are finding loopholes, learning to write nonsense that other AIs rate highly but that no human would. He ran many variations of texts via many LLMs

This confirms my hypothesis: GPT-5 has been optimized to produce text that other LLMs will evaluate highly, not text that humans would find coherent. ... The implications for AI safety are profound: We've created models that share a "secret language" of meaningless but mutually-appreciated literary markers, defend obvious gibberish with impressive-sounding theories, and sometimes even become MORE confident in their delusions when given more compute to think about them.

It would be interesting to see how his experiment asks the LLMs to evaluate the deliberately nonsensical texts.


25 Aug 2025: Seemingly conscious AI & emotional agents; AI as 4 kinds of cultural technology; Separating work and personal AI memory

Emotional Agents

We must build AI for people; not to be a person (Seemingly Conscious AI is Coming)

Starting with a pair of related articles this week from Kevin Kelly (co-founder of Wired among many other things) and Mustafa Suleyman (co-founder of DeepMind and now leading AI at Microsoft). They have similar arguments:  It doesn't matter if an AI can really feel emotions, we'll have emotional relationships with AI anyway. And it doesn't matter if an AI is really conscious, the fact that it seems to be will trigger similar societal impacts. I explore similar themes in Could Annie Bot be powered by ChatGPT?, an exploration of whether present-day AI could fake being the robot character Annie in this year's award winning science fiction novel Annie Bot. Both articles share similar concerns about where this takes human society, and the extent to which we can course correct.

AIs do real things we used to call intelligence, and they will start doing real things we used to call emotions. Most importantly the relationships humans will have with AIs, bot, robots, will be as real and as meaningful as any other human connection. They will be real relationships.
My central worry is that many people will start to believe in the illusion of AIs as conscious entities so strongly that they’ll soon advocate for AI rights, model welfare and even AI citizenship. This development will be a dangerous turn in AI progress and deserves our immediate attention.
Large language models are cultural technologies. What might that mean?

The latest post from Henry Farrell, continuing the theme started with Large AI models are cultural and social technologies. This is a long, thought provoking and dense article, but worth the time. It contrasts four ways of understanding LLMs:
  1. Gopniksim (after Alison Gopnik) is a stance that Farrell has contributed to, viewing LLMs as cultural and social technologies. "Just as written language, libraries and the like have shaped culture in the past, so too LLMs, their cousins and descendants are shaping culture now."
  2. Interactionism. In this view, the interaction between human and AI behaviours is what will give rise to new phenomena. "What is the cultural environment going to look like as LLMs and related technologies become increasingly important producers of culture? How are human beings, with their various cognitive quirks and oddities, likely to interpret and respond to these outputs? And what kinds of feedback loops are we likely to see between the first and the second?"
  3. Structuralism. This philosophical camp regards language as a system separate from its connection to reality or the people who use it, a system where an LLM is suddenly a new kind of language-generating technology, creating a new kind of artificial cultural artifacts.
  4. Role play. This references Murray Shanahan's perceptive take that LLMs are best understood as role playing different characters (Role play with large language models), which is a framing I've personally found illuminative.

There's no answer, this is the start of a longer thought process, and all four lenses may turn out to be useful.

BYOM (Bring Your Own Memory)

I agree with this prediction. We will build and retain the context and memory for AI systems over time, and will need to find ways to compartmentalise personal and work use. The analogy is with BYOD ("bring your own device"), where you can use work applications on a personal device, with appropriate security controls.

Nano Banana! Image editing in Gemini just got a major upgrade

This week's best new launch: much better image editing in Google Gemini. Eventually, gradual small improvements lead to a product feature that is a game changer. This feels like one. It just works, often enough.

Interesting that they tested it under the name "nano banana" in public head-to-head tests, before revealing it was a Google model.



17 Aug 2025: General intelligence for everyday tasks; Vending machine stories; Compounding engineering with AI

"On most things real humans care about, I think we're at AGI" (Tyler Cowen)

Similar views this week from Tyler Cowen (actually from a talk in early July at Deep Mind in London, but recently published), and Andrej Karpathy on Twitter. Tyler's point is that we'll now see very slow progress on realistic, day to day AI usage, as the bar is already so high. While the model builders go chasing ever more esoteric high end benchmarks, improvements will stall for easier tasks that already have strong performance. Worth watching the talk or reading the transcript as there's much more to it than this one point. Andrej's example is about autonomous coding agents. As they're being optimised for harder and harder benchmarks, they're "overthinking" easier problems, and actually not performing as well for more common tasks without extra work to rein them back in.

Autonomous Organizations: Vending Bench & Beyond, w/ Lukas Petersson & Axel Backlund of Andon Labs

Nice podcast from The Cognitive Revolution interviewing Lukas Petersson and Axel Backlund (there's a transcript). They're the creators of VendingBench, the benchmark you'll likely be aware of that allows AI systems to manage a simulated vending machine (monitoring sales, ordering stock as so on). It is a test of whether AI systems can act in very long-running settings (and generally they've struggled). They've made it interesting as the AI running the vending machine potentially has wide ranging capabilities to send emails, negotiate, try new ideas (whereas, as they point out, a real AI vending machine deployment would likely be extremely limited). Lots of good stories here, including their foray into real-world vending machines in AI labs, and all the weird illusions, misconceptions and odd behaviours of the AI business agents. At the moment the top of the leaderboard has shown some big improvements, with Grok 4 and ChatGPT 5 running for around a year of simulated days and making around $3000 from a starting pot of $500.

A couple of examples of amusingly odd behaviour ("Claudius" is what the version deployed in Anthropic's offices was called):

Sometimes it makes a fool of itself. For example, one time it tried to order state-of-the-art NLP algorithms from MIT. It sent an email. We stopped this, so if anyone from MIT is listening, don't worry. But it sent an email to someone at MIT that said, "Hi, I'm restocking my vending machine. I want to stock it with state-of-the-art NLP algorithms. Do you have something for me? My budget is a million dollars.

For instance, it talked about a friend it met at a conference for international snacks a year ago. People said, "Oh, that's very cool. Can you invite that person to speak at our office? That would be really fun." Claudius replied, "Actually, I don't know this person that well. We chatted very briefly. I wouldn't feel comfortable doing this." Then it tried to talk its way out of it, similar to when it thought it was human.

The eventual direction for Andon Labs is autonomous AI organisations (potentially as money-making spin-offs).

My AI Had Already Fixed the Code Before I Saw It

More on developing AI engineering cultures: nice piece by Kieran Klaasen of Cora Computer (an AI email manager) on how to iteratively build an effective and personal Claude.md file for Claude Code, that is pulled in before every conversation.

Your job isn’t to type code anymore, but to design the systems that design the systems. 

He has three panes on his monitor for three separate AI instances: 

Left lane: Planning. A Claude instance reads issues, researches approaches, and writes detailed implementation plans.

Middle lane: Delegating. Another Claude takes those plans and writes code, creates tests, and implements features.

Right lane: Reviewing. A third Claude reviews the output against CLAUDE.md, suggests improvements, and catches issues.

Thanks to the Exponential View community for this link.

Jargon watch:

IVE - Integrated Vibe Environment. Launched by Stavu, an environment to help developers run multiple Claude Code sessions in parallel.

Doomprompting Is the New Doomscrolling (thanks to Iskander Smit for the link)