17 Aug 2025: General intelligence for everyday tasks; Vending machine stories; Compounding engineering with AI

"On most things real humans care about, I think we're at AGI" (Tyler Cowen)

Similar views this week from Tyler Cowen (actually from a talk in early July at Deep Mind in London, but recently published), and Andrej Karpathy on Twitter. Tyler's point is that we'll now see very slow progress on realistic, day to day AI usage, as the bar is already so high. While the model builders go chasing ever more esoteric high end benchmarks, improvements will stall for easier tasks that already have strong performance. Worth watching the talk or reading the transcript as there's much more to it than this one point. Andrej's example is about autonomous coding agents. As they're being optimised for harder and harder benchmarks, they're "overthinking" easier problems, and actually not performing as well for more common tasks without extra work to rein them back in.

Autonomous Organizations: Vending Bench & Beyond, w/ Lukas Petersson & Axel Backlund of Andon Labs

Nice podcast from The Cognitive Revolution interviewing Lukas Petersson and Axel Backlund (there's a transcript). They're the creators of VendingBench, the benchmark you'll likely be aware of that allows AI systems to manage a simulated vending machine (monitoring sales, ordering stock as so on). It is a test of whether AI systems can act in very long-running settings (and generally they've struggled). They've made it interesting as the AI running the vending machine potentially has wide ranging capabilities to send emails, negotiate, try new ideas (whereas, as they point out, a real AI vending machine deployment would likely be extremely limited). Lots of good stories here, including their foray into real-world vending machines in AI labs, and all the weird illusions, misconceptions and odd behaviours of the AI business agents. At the moment the top of the leaderboard has shown some big improvements, with Grok 4 and ChatGPT 5 running for around a year of simulated days and making around $3000 from a starting pot of $500.

A couple of examples of amusingly odd behaviour ("Claudius" is what the version deployed in Anthropic's offices was called):

Sometimes it makes a fool of itself. For example, one time it tried to order state-of-the-art NLP algorithms from MIT. It sent an email. We stopped this, so if anyone from MIT is listening, don't worry. But it sent an email to someone at MIT that said, "Hi, I'm restocking my vending machine. I want to stock it with state-of-the-art NLP algorithms. Do you have something for me? My budget is a million dollars.
For instance, it talked about a friend it met at a conference for international snacks a year ago. People said, "Oh, that's very cool. Can you invite that person to speak at our office? That would be really fun." Claudius replied, "Actually, I don't know this person that well. We chatted very briefly. I wouldn't feel comfortable doing this." Then it tried to talk its way out of it, similar to when it thought it was human.

The eventual direction for Andon Labs is autonomous AI organisations (potentially as money-making spin-offs).

My AI Had Already Fixed the Code Before I Saw It

More on developing AI engineering cultures: nice piece by Kieran Klaasen of Cora Computer (an AI email manager) on how to iteratively build an effective and personal Claude.md file for Claude Code, that is pulled in before every conversation.

Your job isn’t to type code anymore, but to design the systems that design the systems.

He has three panes on his monitor for three separate AI instances:

Left lane: Planning. A Claude instance reads issues, researches approaches, and writes detailed implementation plans.

Middle lane: Delegating. Another Claude takes those plans and writes code, creates tests, and implements features.

Right lane: Reviewing. A third Claude reviews the output against CLAUDE.md, suggests improvements, and catches issues.

Thanks to the Exponential View community for this link.

Jargon watch:

IVE - Integrated Vibe Environment. Launched by Stavu, an environment to help developers run multiple Claude Code sessions in parallel.

Doomprompting Is the New Doomscrolling (thanks to Iskander Smit for the link)

WhizzyIdeas.AI

A weekly and very partial view of interesting AI news

17 Aug 2025: General intelligence for everyday tasks; Vending machine stories; Compounding engineering with AI

Subscribe by email

Subscribe by email