“METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman

7/4/2025

LessWrong (Curated & Popular)

0:00

11:09

Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under five years, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.

The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts.

Full paper | Github repo

We think that forecasting the capabilities of future AI systems is important for understanding and preparing for the impact of [...]

---

Outline:

(08:58) Conclusion

(09:59) Want to contribute?

---

First published:
March 19th, 2025

Source:
https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Altri episodi di "LessWrong (Curated & Popular)"

Altri episodi

Accedi a tutto il mondo dei podcast con l’app gratuita GetPodcast.

Iscriviti ai tuoi podcast preferiti, ascolta gli episodi offline e ricevi fantastici consigli.

“METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman

LessWrong (Curated & Popular)

Altri episodi di "LessWrong (Curated & Popular)"

“Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study” by Adam Karvonen

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

[Linkpost] “Playing in the Creek” by Hastings

“Thoughts on AI 2027” by Max Harms

“Short Timelines don’t Devalue Long Horizon Research” by Vladimir_Nesov

“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger

“METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman

“Why Have Sentence Lengths Decreased?” by Arjun Panickssery

“AI 2027: What Superintelligence Looks Like” by Daniel Kokotajlo, Thomas Larsen, elifland, Scott Alexander, Jonas V, romeo

“OpenAI #12: Battle of the Board Redux” by Zvi