“Activation space interpretability may be doomed” by bilalchughtai, Lucius Bushnaq

1/10/2025

LessWrong (Curated & Popular)

0:00

15:56

TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model - the features the model's own computations make use of.

Written at Apollo Research

Introduction

Claim: Activation space interpretability is likely to give us features of the activations, not features of the model, and this is a problem.

Let's walk through this claim.

What do we mean by activation space interpretability? Interpretability work that attempts to understand neural networks by explaining the inputs and outputs of their layers in isolation. In this post, we focus in particular on the problem of decomposing activations, via techniques such as sparse autoencoders (SAEs), PCA, or just by looking at individual neurons. This [...]

---

Outline:

(00:33) Introduction

(02:40) Examples illustrating the general problem

(12:29) The general problem

(13:26) What can we do about this?

The original text contained 11 footnotes which were omitted from this narration.

---

First published:
January 8th, 2025

Source:
https://www.lesswrong.com/posts/gYfpPbww3wQRaxAFD/activation-space-interpretability-may-be-doomed

---

Narrated by TYPE III AUDIO.

More episodes from "LessWrong (Curated & Popular)"

More Episodes

Get the whole world of podcasts with the free GetPodcast app.

Subscribe to your favorite podcasts, listen to episodes offline and get thrilling recommendations.

“Activation space interpretability may be doomed” by bilalchughtai, Lucius Bushnaq

LessWrong (Curated & Popular)

More episodes from "LessWrong (Curated & Popular)"

“OpenAI #12: Battle of the Board Redux” by Zvi

“The Pando Problem: Rethinking AI Individuality” by Jan_Kulveit

“OpenAI #12: Battle of the Board Redux” by Zvi

“You will crash your car in front of my house within the next week” by Richard Korzekwa

“My ‘infohazards small working group’ Signal Chat may have encountered minor leaks” by Linch

“Leverage, Exit Costs, and Anger: Re-examining Why We Explode at Home, Not at Work” by at_the_zoo

“PauseAI and E/Acc Should Switch Sides” by WillPetillo

“VDT: a solution to decision theory” by L Rudolf L

“LessWrong has been acquired by EA” by habryka

“We’re not prepared for an AI market crash” by Remmelt