Language models can explain neurons in language models

Authors

Steven Bills^∗, Nick Cammarata^∗, Dan Mossing^∗, Henk Tillman^∗, Leo Gao^∗, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu^∗, William Saunders^∗

* Core Research Contributor; Author contributions statement below. Correspondence to interpretability@openai.com.

Affiliation

OpenAI

Published

May 9, 2023

Contributions

Methodology: Nick effectively started the project by having the initial idea to have GPT-4 explain neurons, and showing a simple explanation methodology worked. William came up with the initial simulation and scoring methodology and implementation. Dan and Steven ran many experiments resulting in ultimate choices of prompts and explanation/scoring parameters.

ML infrastructure: William and Nick set up the initial version of the codebase. Leo and Jeff implemented the initial core internal infrastructure for doing interpretability. Steven implemented the top activations pipeline. Steven and William developed the pipeline for explanations and scoring. Many other miscellaneous contributions came from William, Jeff, Dan, and Steven. Steven created the open source version.

Web infrastructure: Nick and William implemented the neuron viewer, with smaller contributions from Steven, Dan, and Jeff. Nick implemented many other UIs exploring various kinds of neuron explanation. Steven implemented human data gathering UIs.

Human data: Steven implemented and analyzed all experiments involving contractor human data: the human explanation baseline, and human scoring experiments. Nick and William implemented early researcher explanation baselines.

Alternative token and weight-based explanations: Dan implemented all experiments and analysis on token weight and token lookup baselines, next token explanations, as well as infrastructure and UIs for neuron-neuron connection weights.

Revisions: Henk implemented and analyzed the main revision experiments. Nick championed and implemented an initial proof of concept for revisions. Leo implemented a small scale derisking experiment. Steven helped design the final revision pipeline. Leo and Dan did many crucial investigations into negative finding.

Direction finding: Leo had the idea and implementated all experiments related to direction finding.

Neuron puzzles: Henk implemented all the neuron puzzles and related experiments. William came up with the initial idea. Steven and William gave feedback on data and strategies.

Subject, explainer, and simulator scaling: Steven implemented and analyzed assistant size, simulator size, and subject size experiments. Jeff implemented and analyzed subject training time experiments.

Ablation scoring: Jeff implemented ablation infrastructure and initial scoring experiments, and Dan contributed lots of useful thinking and carried out final experiments. Leo did related investigations into understanding and prediction of ablation effects.

Activation function experiments: Jeff implemented the experiments and analysis. Gabe suggested the sparse activation function, and William suggested correlation-based community detection.

Qualitative results: Everyone contributed throughout the project to qualitative findings. Nick and William discovered many of the earliest nontrivial neurons. Dan found many non-trivial neurons by comparing to token baselines, such as simile neurons. Steven found the pattern break neuron, and other context-sensitive neurons. Leo discovered the "don't stop" neuron and first noticed explanations were overly broad. Henk had many qualitative findings about explanation and scoring quality. Nick found interesting neuron-neuron connections and interesting pairs of neurons firing on the same token. William and Jeff investigated ablations of specific neurons.

Guidance and mentorship: William and Jeff led and managed the project. Jan and Jeff managed team members who worked on the project. Many ideas from Jan, Nick, William, and Ilya influenced the direction of the project. Steven mentored Henk.

Acknowledgments

We thank Neel Nanda, Ryan Greenblatt, Paul Christiano, Chris Olah, and Evan Hubinger for useful discussions on direction during the project.

We thank Ryan Greenblatt, Buck Shlegeris, Trenton Bricken, the OpenAI Alignment team, and the Anthropic Interpretability team for useful discussions and feedback.

We thank Cathy Yeh for doing useful checks of our code and noticing tokenization concerns.

We thank Carroll Wainwright and Chris Hesse for help with infrastructure.

We thank the contractors who wrote and evaluated explanations for our human data experiments, and Long Ouyang for some feedback on our instructions.

We thank Rajan Troll for providing a human baseline for the neuron puzzles.

We thank Thomas Degry for the blog graphic.

We thank other OpenAI teams for their support, including the supercomputing, research acceleration, and language teams.

Citation Information

Please cite as:

      Bills, et al., "Language models can explain neurons in language models", 2023.

BibTeX Citation:

      @misc{bills2023language,
         title={Language models can explain neurons in language models},
         author={
            Bills, Steven and Cammarata, Nick and Mossing, Dan and Tillman, Henk and Gao, Leo and Goh, Gabriel and Sutskever, Ilya and Leike, Jan and Wu, Jeff and Saunders, William
         },
         year={2023},
         howpublished = {\url{https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html}}
      }