Language models have become more capable and more widely deployed, but we do not understand how they work. Recent work has made progress on understanding a small number of circuits and narrow behaviors,
Our technique seeks to explain what patterns in text
This technique lets us leverage GPT-4
With our baseline methodology, explanations achieved scores approaching the level of human contractor performance. We found we could further improve performance by:
We applied our method to all MLP neurons in GPT-2 XL.
We are open sourcing our dataset of explanations for all neurons in GPT-2 XL, code for explanation and scoring to encourage further research in both producing better explanations. We are also releasing a neuron viewer using the dataset. Although most well-explained neurons are not very interesting, we found many interesting neurons that GPT-4 didn't understand. We hope this lets others more easily build on top of our work. With better explanations and tools in the future, we may be able to rapidly uncover interesting qualitative understanding of model computations.
In this paper we start with the easiest case of identifying properties of text inputs that correlate with intermediate activations.
In the language model case, the inputs are text passages. For the intermediate activations, we focus on neurons in the MLP layers. For the remainder of this paper, activations refer to the MLP post-activation value calculated as
When we have a hypothesized explanation for a neuron, the hypothesis is that the neuron activates on tokens with that property, where the property may include the previous tokens as context.
At a high level, our process of interpreting a neuron uses the following algorithm:
We always use distinct documents for explanation generation and simulation.
Our code for generating explanations, simulating neurons, and scoring explanations is available here.
Activations are normalized to a 0-10 scale and discretized to integer values, with negative activation values mapping to 0 and the maximum activation value ever observed for the neuron mapping to 10. For sequences where the neuron's activations are sparse (<20% non-zero), we found it helpful to additionally repeat the token/activation pairs with non-zero activations after the full list of tokens, helping the model to focus on the relevant tokens.
We prompt the simulator model to output an integer from 0-10 for each subject model token. For each predicted activation position, we examine the probability assigned to each number ("0", "1", …, "10"), and use those to compute the expected value of the output. The resulting simulated neuron value is thus on a [0, 10] scale.
Our simplest method is what we call the "one at a time" method. The prompt consists of some few-shot examples and a single-shot example of predicting an individual token's activation.
Unfortunately, the "one at a time" method is quite slow, as it requires one forward pass per simulated token. We use a trick to parallelize the probability predictions across all tokens by having few-shot examples where activation values switch from being "unknown" to being actual values at a random location in the sequence. This way, we can simulate the neuron with "unknown" in the context while still eliciting model predictions by examining logprobs for the "unknown" tokens, and without the model ever getting to observe any actual activation values for the relevant neuron. We call this the "all at once" method.
Due to the speed advantage, we use "all at once" scoring for the remainder of the paper, except for some of the smaller-scale qualitative results.
A conceptually simple approach is to use explained variance of the true activations by the simulated activations, across all tokens. That is, we could calculate
However, our simulated activations are on a [0, 10] scale, while real activations will have some arbitrary distribution. Thus, we assume the ability to calibrate the simulated neuron's activation distribution to the actual neuron's distribution. We chose to simply calibrate linearly
This motivates our main method of scoring, correlation scoring, which simply reports
Another way to understand a network is to perturb its internal values during a forward pass and observe the effect.
To measure the extent of the behavioral change from ablating to simulation, we use Jensen-Shannon divergence between the perturbed and original model’s output logprobs, averaged across all tokens. As a baseline for comparison, we perform a second perturbation, ablating the neuron’s activation to its mean value across all tokens. For each neuron, we normalize the divergence of ablating to simulation by the divergence of ablating to the mean. Thus, we express an ablation score as
We find that correlation scoring and ablation scoring have a clear relationship, on average. Thus, the remainder of the paper uses correlation scoring, as it is much simpler to compute. Nevertheless, correlation scoring appears not to capture all the deficits in simulated explanations revealed by ablation scoring. In particular, correlation scores of 0.9 still lead to relatively low ablation scores on average (0.3 for scoring on random-only text excerpts and 0.6 for top-and-random; see below for how these text excerpts are chosen).
We gave human labelers tasks where they see the same text excerpts and activations (shown with color highlighting) as the simulator model (both top-activating and random), and are asked to rate and then rank 5 proposed explanations based on how well those explanations capture the activation patterns. We found the explainer model explanations were not diverse, and so increased explanation diversity by varying the few-shot examples used in the explanation generation prompt, or by using a modified prompt that asks the explainer model for a numbered list of possible explanations in a single completion.
Our results show that humans tend to prefer higher-scoring explanations over lower-scoring ones, with the consistency of that preference increasing as the size of the score gap increases.
Throughout this work, unless otherwise specified, we use GPT-2 pretrained models as subject models and GPT-4 for the explainer and simulator models.
For both generating and simulating explanations, we take text excerpts from the training split of the subject model's pre-training dataset (e.g. WebText, for GPT-2 models). We choose random 64-token contiguous subsequences of the documents as our text excerpts.
When generating explanations, we use 5 "top-activating" text excerpts, which have at least one token with an extremely large activation value, as determined by the quantile of the max activation. This was because we found empirically that:
Thus, for the remainder of this paper, explanations are always generated from 5 top-activating sequences unless otherwise noted. We set a top quantile threshold of 0.9996, taking the 20 sequences containing the highest activations out of 50,000 total sequences. We sample explanations at temperature 1.
For simulation and scoring, we report on 5 uniformly random text excerpts ("random-only"). The random-only score can be thought of as an explanation’s ability to capture the neuron’s representation of features in the pre-training distribution. While random-only scoring is conceptually easy to interpret, we also report scores on a mix of 5 top-activating and 5 random text excerpts ("top-and-random"). The top-and-random score can be thought of as an explanation’s ability to capture the neuron’s most strongly represented feature (from the top text excerpts), with a penalty for overly broad explanations (from the random text excerpts). Top-and-random scoring has several pragmatic advantages over random-only:
Note that "random-only" scoring with small sample size risks failing to capture behavior, due to lacking both tokens with high simulated activations and tokens with high real activations. "Top-and-random" scoring addresses the latter, but causes us to penalize falsely low simulations more than falsely high simulations, and thus tends to accept overly broad explanations. A more principled approach which gets the best of both worlds might be to stick to random-only scoring, but increase the number of random-only text excerpts in combination with using importance sampling as a variance reduction strategy.
Below we show some prototypical examples of neuron scoring.
Overall, for GPT-2, we find an average score of 0.151 using top-and-random scoring, and 0.037 for random-only scoring. Scores generally decrease when going to later layers.
Note that individual scores for neurons may be noisy, especially for random-only scoring. With that in mind, out of a total of 307,200 neurons, 5,203 (1.7%) have top-and-random scores above 0.7 (explaining roughly half the variance), using our default methodology. With random-only scoring, this drops to 732 neurons (0.2%). Only 189 neurons (0.06%) have top-and-random scores above 0.9, and 86 (0.03%) have random-only scores above 0.9.
To understand the quality of our explanations in absolute terms, it is helpful to compare with a baseline that does not depend on language models' ability to summarize or simulate activations. For this reason, we examine several baseline methods that directly predict activation based on a single token, using either model weights, or activation data aggregated across held out texts. For each neuron, these methods give one predicted activation value per token in the vocabulary, which amounts to substantially more information than the short natural language explanation produced by a language model. For that reason, we also used language models to briefly summarize these lists of activation values, and used that as an explanation in our typical simulation pipeline.
The logit lens and related techniques
To linearly predict activations for each token, we multiply each token embedding by the pre-MLP layer norm gain and neuron input weights (
The linear token-based prediction baseline outperforms activation-based explanation and scoring for the first layer, predicting activations almost perfectly (unsurprising given that only the first attention layer intervenes between the embedding and first MLP layer). For all subsequent layers, GPT-4-based explanation and scoring predicts activations better than the linear token-based prediction baseline.
The linear token-based prediction baseline is a somewhat unfair comparison, as the "explanation length" of one scalar value per token in the vocabulary is substantially longer than GPT-based natural language explanations. Using a language model to compress this information into a short explanation and simulate that explanation might act as an “information bottleneck” that affects the accuracy of predicted activations. To control for this, we try a hybrid approach, applying GPT-4-based explanation to the list of tokens with the highest linearly predicted activations (corresponding to 50 out of the top 100 values), rather than to top activations. These explanations score worse than either linear token-based prediction or activation-based explanations.
The token-based linear prediction baseline might underperform the activation-based baseline for one of several reasons. First, it might fail because multi-token context is important (for example, many neurons are sensitive to multiple-token phrases). Second, it might fail because intermediate processing steps between the token embedding and
To evaluate the second possibility, we construct a second, “correlational” baseline. For this baseline, we compute the mean activation per-token and per-neuron over a large corpus of held-out internet text. We then use this information to construct a lookup table. For each token in a text excerpt and for each neuron, we predict the neuron's activation using the token lookup table, independent of the preceding tokens. Again, we do not summarize the contents of this lookup table, or use a language model to simulate activations.
The token lookup table baseline is much stronger than the token-based linear prediction baseline, substantially outperforming activation-based explanation and simulation on average. We apply the same explanation technique as with the token-based linear baseline to measure how the information bottleneck from explanation and simulation using GPT-4 affects the accuracy of predicted activations.
The resulting token lookup table-based explanation results in a score similar to our activation-based explanation on top-and-random scoring, but outperforms activation-based explanations on random-only scoring. However, we are most interested in neurons that encode complex patterns of multi-token context rather than single tokens. Despite worse performance on average, we find many interesting neurons where activation-based explanations have an advantage over token-lookup-table-based explanations. We are also able to improve over the token-lookup-table-based explanation by revising explanations. In the long run, we plan to use methods that combine both token-based and activation-based information.
Explanation quality is fundamentally bottlenecked by the small set of text excerpts and activations shown in a single explanation prompt, which is not always sufficient to explain a neuron's behavior. Iterating on explanations
One particular issue we find is that overly broad explanations tend to be consistent with the top-activating sequences. For instance, we found a "not all" neuron which activates on the phrase "not all" and some related phrases. However, the top-activating text excerpts chosen for explanation do not falsify a simpler hypothesis, that the neuron activates on the phrase "all". The model thus generates an overly broad explanation: 'the term "all" along with related contextual phrases'.
To make things worse, we find that our explainer technique often fails to take into account negative evidence (i.e. examples of the neuron not firing which disqualify certain hypotheses). With the "not all" neuron, even when we manually add negative evidence to the explainer context (i.e. sequences that include the word "all" with zero activation), the explainer ignores these and produces the same overly broad explanation. The explainer model may be unable to pay attention to all facets of the prompt in a single forward pass.
To address these issues, we apply a two-step revision process.
The first step is to source new evidence. We use a few-shot prompt with GPT-4 to generate 10 sentences which match the existing explanation. For instance, for the "not all" neuron, GPT-4 generates sentences which use the word "all" in a non-negated context.
The hope is to find false positives for the original explanation, i.e. sentences containing tokens where the neuron's real activation is low, but the simulated activation is high. However, we do not filter generated sentences for this condition. In practice, roughly 42% of generated sequences result in false positives for their explanations. For 86% of neurons at least one of the 10 sequences resulted in a false positive. This provides an independent signal that our explanations are often too inclusive. We split the 10 generated sentences into two sets: 5 for revision and 5 for scoring. Once we have generated the new sentences, we perform inference using the subject model and record activations for the target neurons.
The second step is to use a few-shot prompt with GPT-4 to revise the original model explanation. The prompt includes the evidence used to generate the original explanation, the original explanation, the new generated sentences, and the ground truth activations for those sentences. Once we obtain a revised explanation, we score it on the same set of sequences used to score the original explanation. We also score the original explanation and the revised explanation on the same set augmented with the scoring split of the new evidence.
We find that the revised explanations score better than the original explanations on both the original scoring set and the augmented scoring set. As expected, the original explanations score noticeably worse on the augmented scoring set than the original scoring set.
We find that revision is important: a baseline of re-explanation with the new sentences ("reexplanation") but without access to the old explanation does not improve upon the baseline. As a followup experiment, we attempted revision using a small random sample of sentences with nonzero activations ("revision_rand"). We find that this strategy improves explanation scores almost as much as revision using generated sentences. We hypothesize that this is partly because random sentences are also a good source of false positives for initial explanations: roughly 13% of random sentences contain false positive activations for the original model explanations.
Overall, revisions lets us exceed scores on the token lookup table explanations for top-and-random but not random-only scoring, for which the improvement is limited.
Qualitatively, the main pattern we observe is that the original explanation is too broad and the revised explanation is too narrow, but that the revised explanation is closer to the truth. For instance, for layer 0 neuron 4613 the original explanation is "words related to cardinal directions and ordinal numbers". GPT-4 generated 10 sentences based on this explanation that included many words matching this description which ultimately lacked significant activations, such as "third", "eastward", "southwest". The revised explanation is "this neuron activates for references to the ordinal number 'Fourth'", which gives far fewer false positives. Nevertheless, the revised explanation does not fully capture the neuron's behavior as there are several activations for words other than fourth, like "Netherlands" and "white".
We also observe several promising improvements enabled by revision that target problems with the original explanation technique. For instance, a common neuron activation pattern is to activate for a word but only in a very particular context. An example of this is the "hypothetical had" neuron, which activates for the word "had" but only in the context of hypotheticals or situations that might have occurred differently (e.g. "I would have shut it down forever had I the power to do so."). The original model explanation fails to pick up on this pattern and produces the overly-broad explanation, "the word 'had' and its various contexts." However, when provided with sentences containing false positive activations (e.g. "He had dinner with his friends last night") the reviser is able to pick up on the true pattern and produce a corrected explanation. Some other neuron activation patterns that the original explanation fails to capture but the revised explanation accounts for are "the word 'together' but only when preceded by the word 'get'" (e.g. "get together", "got together"), and "the word 'because' but only when part of the 'just because' grammar structure" (e.g. "just because something looks real, doesn't mean it is").
In the future, we plan on exploring different techniques for improving evidence sourcing and revision such as sourcing false negatives, applying chain of thought methods, and fine-tuning.
Explanation quality is also bottlenecked by the extent to which neurons are succinctly explainable in natural language. We found many of the neurons we inspected were polysemantic, potentially due to superposition.
The high-level idea is to optimize a linear combination of neurons.
Starting with a uniformly random direction
For the explanation step, the simplest baseline is to simply use the typical top-activation-based explanation method (where activations are for the virtual neuron at each step). However, to improve the quality of the explanation step, we use the revisions with generated negatives, and also reuse high-scoring explanations from previous steps.
We ran this algorithm on GPT-2 small's layer 10 (the penultimate MLP layer), which has 3072 neurons. For each 3072-dimensional direction, we ran 10 rounds of coordinate ascent. For the gradient ascent we use Adam with a learning rate of 1e-2,
We find that the average top-and-random score after 10 iterations is 0.718, substantially higher than the average score for random neurons in this layer (0.147), and higher than the average score for random directions before any optimization (0.061).
One potential problem with this procedure is that we could repeatedly converge upon the same explainable direction, rather than finding a diverse set of local maxima. To check the extent to which this is happening, we measure and find that the resulting directions have very low cosine similarity with each other.
We also inspect the neurons which contribute most to
One major limitation of this method is that care must be taken when optimizing for a learned proxy of explainability.
One important hope is that explanations improve as our assistance gets better. Here, we experiment with different explainer models, while holding the simulator model fixed at GPT-4. We find explanations improve smoothly with explainer model capability, and improve relatively evenly across layers.
We also obtained a human baseline from labelers asked to write explanations from scratch, using the same set of 5 top-activating text excerpts that the explainer models use. Our labelers were non-experts who received instructions and a few researcher-written examples, but no deeper training about neural networks or related topics.
We see that human performance exceeds the performance of GPT-4, but not by a huge margin. Human performance is also low in absolute terms, suggesting that the main barrier to improved explanations may not simply be explainer model capabilities.
With a poor simulator, even a very good explanation will get low scores. To get some sense for simulator quality, we looked at the explanation score as a function of simulator model capability. We find steep returns on top-and-random scoring, and plateauing scores for random-only scoring.
Of course, simulation quality cannot be measured using score. However, we can also verify that score-induced comparisons from larger simulators agree more with humans, using the human comparison data described earlier. Here, the human baseline comes from human-human agreement rates. Scores using GPT-4 as a simulator model are approaching, but still somewhat below, human-level agreement rates with other humans.
One natural question is whether larger, more capable models are more or less difficult to understand than smaller models.
To understand the basis for this trend, we examine explainability by layer. For layer 16 onward, average explanation scores drops robustly with increasing depth, using both top-and-random and random-only scoring. For shallower layers, top-and-random scores also decrease with increasing depth,
Note that these trends may be artificial, in the sense that they mostly reflect limitations of our current explanation generation technique. Our experiments on "next token"-based explanation lend credence to the hypothesis that later layers of larger models have neurons whose behavior is understandable but difficult for our current methods to explain.
One interesting question is whether the architecture of a model affects its interpretability,
Increasing activation sparsity consistently increases explanation scores, but hurts pre-training loss.
Training more tends to improve top-and-random scores but decrease random-only scores.
We also find significant positive transfer for explanations between different checkpoints of the same training run. For example, scores for a quarter-trained model seem to drop by less than 25% when using explanations for a fully-trained model, and vice versa. This suggests a relatively high degree of stability in feature-neuron correspondence.
Throughout the project we found many interesting neurons. GPT-4 was able to find explanations for non-trivial neurons that we thought were reasonable upon inspection, such as a "simile" neuron, a neuron for phrases related to certainty and confidence, and a neuron for things done correctly.
One successful strategy for finding interesting neurons was looking for those which were poorly explained by their token-space explanations, compared with their activation-based explanations. This led us to concurrently discover context neurons
Another related strategy that does not rely on explanation quality was to look for context-sensitive neurons that activate differently when the context is truncated. This led us to discover a pattern break neuron which activates for tokens that break an established pattern in an ongoing list (shown below on some select sentences) and a post-typo neuron which activates often following strange or truncated words. Our explanation model is generally unable to get the correct explanation on interesting context-sensitive neurons.
We noticed a number of neurons that appear to activate in situations that match a particular next token, for example a neuron that activates where the next token is likely to be the word “from”. Initially we hypothesized that these neurons might be making a prediction of the next token based on other signals. However, ablations on some of these neurons do not match this story. The “from” neuron appears to actually slightly decrease the probability of “from” being output. At the same time it increases the probability of variations of the word “form”, suggesting one of the things it is doing is accounting for the possibility of a typo. As it is in a late layer (44 out of 48), this neuron may be responding to situations where the network already places high probability on the word “from”. We have not investigated enough to have a clear picture of what is going on, but it is possible that many neurons encode particular subtle variations on the output distribution conditioned on a particular input rather than performing the obvious function suggested by their activations.
We found some interesting examples of neurons that respond to specific kinds of repetition. We found a neuron that activations for repeated occurrences of tokens, with stronger activations depending on the number of occurrences. An interesting example of a polysemantic neuron is a neuron that fires for both the phrase "over and over again" and “things repeated right before a non-repeated number”, possibly because “over and over again” itself includes repetition. We also found two neurons that seem mostly to respond to a second mention of a surname when combined with a different first name. It is possible that these neurons are responding to induction heads.
Overall, our subjective sense was that neurons for more capable models tended to be more interesting, although we spent the majority of our efforts looking at GPT-2 XL neurons rather than more modern models.
For more interesting neurons, see our neuron visualization website.
When thinking about how to qualitatively understand our explanation methodology, we often ran into two problems. First, we do not have any ground truth for the explanations or scores. Even human-written explanations could be incorrect, or at the very least fail to completely explain the behavior. Furthermore, it is often difficult to tell whether a better explanation exists. Second, we have no control over the complexity or types of patterns neurons encode, and no guarantee that any simple explanation exists.
To address these drawbacks, we created "neuron puzzles": synthetic neurons with human-written explanations and curated evidence. To create a neuron puzzle, we start with a human-written explanation, taken to be ground truth. Next, we gather text excerpts and manually label their tokens with "activations" (not corresponding to the activations of any real network, because these are synthetic neurons) according to the explanation. Thus, each puzzle is formed from an explanation and evidence supporting that explanation (i.e. a set of text excerpts with activations).
To evaluate the explainer, we provide the tokens and synthetic activations to the explainer and observe whether the model-generated explanation matches the original puzzle explanation. We can vary puzzle difficulty and write new puzzles to test for certain patterns that interest us. Thus, a collection of these puzzles form a useful evaluation for iterating on our explainer technique. We created a total of 19 puzzles, many inspired by neurons we and other found, including a puzzle based on the "not all" neuron described earlier and the 'an' prediction neuron in GPT-2 Large.
Puzzle examples:
For each puzzle, we ensured that the evidence is sufficient for a human to recover the original explanation.
These neuron puzzles also provide a weak signal about whether the scorer is an effective discriminator between proposed explanations for more complex neurons than the ones we have currently found. We created a multiple-choice version for each puzzle by writing a series of false explanations. For example, "this neuron responds to important years in American or European history" was a false explanation for the "incorrect historical years" puzzle. One of the false explanations is always a baseline of the three most common tokens with high activations. For each puzzle, we score the ground-truth explanation and all of the false explanations on the sequences and activations for that puzzle and then record the number of times that the ground-truth explanation has the highest score. For 16/19 puzzles, the ground-truth explanation is ranked highest, and for 18/19 the ground-truth explanation ranks in the top two. Compared with the 5/19 puzzles that the explainer solves, this evaluation suggested to us that the explainer is currently more of a bottleneck than the scorer. This may reflect the fact that detecting a pattern is more difficult than verifying a pattern given an explanation.
Nevertheless, the simulator also suffers from systematic errors. While it performs well at simulating patterns that only require looking at isolated tokens (e.g. "words related to Canada"), it often has difficulty simulating patterns involving positional information as well as patterns that involve precisely keeping track of some quantity. For instance, when simulating the "an" neuron ("this neuron activates for positions in the sentence which are likely to be followed by the word "an"), the results include very high numbers of false positives.
We are releasing code for trying out the neuron puzzles we constructed.
Past research has suggested that neurons may not be privileged as a unit of computation.
Analyzing top-activating dataset examples has proved useful in practice in previous work
One approach to reducing or working around polysemanticity we did not explore is to apply some factorization to the neuron space
We currently explain correlations between the network input and the neuron being interpreted on a fixed distribution. Past work has suggested that this may not reflect the causal behavior between the two.
Our explanations also do not explain what causes behavior at a mechanistic level, which could cause our understanding to generalize incorrectly. To predict rare or out-of-distribution model behaviors, it seems possible that we will need a more mechanistic understanding of models.
Our scoring methodology relies on the simulator model faithfully replicating how an idealized human would respond to an explanation. However, in practice, the simulator model could be picking up on aspects of an explanation that a human would not pick up on. In the worst case, the explainer model and simulator model could be implicitly performing some sort of steganography
Ideally, one could mitigate this by training the simulator model to imitate human simulation labels. We plan to visit this in future work. This may also improve our simulation quality and simplify how we prompt the model.
To understand transformer models more fully we will need to move from interpreting single neurons to interpreting circuits.
Eventually, our explainer models would draw from a rich space of hypotheses, just like interpretability researchers do.
On the other hand, this is perhaps favorable compared to pre-training itself. If pre-training scales data approximately linearly with parameters,
Another computational issue is with context length. Our current method requires the explainer model to have context at least twice as long as the text excerpts passed to the subject model. This means that if the explainer model and subject model had the same context length, we would only be able to explain the subject model's behavior within at most half of its full context length, and could thus fail to capture some behavior that only manifests at later tokens.
To the extent that these tokenization quirks affect the model's understanding of which tokens appeared in the original text excerpt, they could harm the quality of our explanations and simulations.
While we have described a number of limitations with the current version of our methods, we believe our work can be greatly improved and effectively integrated with other existing approaches. For example, successful research on polysemanticity could immediately cause our methods to yield much higher scores. Conversely, our methods could help improve our understanding of superposition by trying to find multiple explanations that cover behavior of a neuron over its entire distribution, or by optimizing to find sets of interpretable directions in the residual stream (perhaps in combination with approaches like dictionary learning). We also hope that we can integrate a wider range of common interpretability techniques, such as studying attention heads, using ablations for validation, etc. into our automated methodology.
Improvements to chain-of-thought methods, tool use, and conversational assistants can also be used to improve explanations. In the long run, we envision that the explainer model could generate, test, and iterate on a rich space of hypotheses about the subject model, similar to an interpretability researcher today. This would include hypotheses about the functionality of circuits and about out-of-distribution behaviors. The explainer model's environment could include access to tools like code execution, subject model visualizations, and talking to researchers. Such a model could be trained using expert iteration or reinforcement learning, with a simulator/judge model setting rewards. We can also train via debate, where two competing assistant models both propose explanations and critique each other's explanations.
We believe our methods could begin contributing to understanding the high-level picture of what is going on inside transformer language models. User interfaces with access to databases of explanations could enable a more macro-focused approach that could help researchers visualize thousands or millions of neurons to see high-level patterns across them. We may be able to soon make progress on simple applications like detecting salient features in reward models, or understanding qualitative changes between a fine-tuned model and its base model.
Ultimately, we would like to be able to use automated interpretability to assist in audits
This work represents a concrete instance of OpenAI's broader alignment plan of using powerful models to help alignment researchers.
Methodology: Nick effectively started the project by having the initial idea to have GPT-4 explain neurons, and showing a simple explanation methodology worked. William came up with the initial simulation and scoring methodology and implementation. Dan and Steven ran many experiments resulting in ultimate choices of prompts and explanation/scoring parameters.
ML infrastructure: William and Nick set up the initial version of the codebase. Leo and Jeff implemented the initial core internal infrastructure for doing interpretability. Steven implemented the top activations pipeline. Steven and William developed the pipeline for explanations and scoring. Many other miscellaneous contributions came from William, Jeff, Dan, and Steven. Steven created the open source version.
Web infrastructure: Nick and William implemented the neuron viewer, with smaller contributions from Steven, Dan, and Jeff. Nick implemented many other UIs exploring various kinds of neuron explanation. Steven implemented human data gathering UIs.
Human data: Steven implemented and analyzed all experiments involving contractor human data: the human explanation baseline, and human scoring experiments. Nick and William implemented early researcher explanation baselines.
Alternative token and weight-based explanations: Dan implemented all experiments and analysis on token weight and token lookup baselines, next token explanations, as well as infrastructure and UIs for neuron-neuron connection weights.
Revisions: Henk implemented and analyzed the main revision experiments. Nick championed and implemented an initial proof of concept for revisions. Leo implemented a small scale derisking experiment. Steven helped design the final revision pipeline. Leo and Dan did many crucial investigations into negative finding.
Direction finding: Leo had the idea and implementated all experiments related to direction finding.
Neuron puzzles: Henk implemented all the neuron puzzles and related experiments. William came up with the initial idea. Steven and William gave feedback on data and strategies.
Subject, explainer, and simulator scaling: Steven implemented and analyzed assistant size, simulator size, and subject size experiments. Jeff implemented and analyzed subject training time experiments.
Ablation scoring: Jeff implemented ablation infrastructure and initial scoring experiments, and Dan contributed lots of useful thinking and carried out final experiments. Leo did related investigations into understanding and prediction of ablation effects.
Activation function experiments: Jeff implemented the experiments and analysis. Gabe suggested the sparse activation function, and William suggested correlation-based community detection.
Qualitative results: Everyone contributed throughout the project to qualitative findings. Nick and William discovered many of the earliest nontrivial neurons. Dan found many non-trivial neurons by comparing to token baselines, such as simile neurons. Steven found the pattern break neuron, and other context-sensitive neurons. Leo discovered the "don't stop" neuron and first noticed explanations were overly broad. Henk had many qualitative findings about explanation and scoring quality. Nick found interesting neuron-neuron connections and interesting pairs of neurons firing on the same token. William and Jeff investigated ablations of specific neurons.
Guidance and mentorship: William and Jeff led and managed the project. Jan and Jeff managed team members who worked on the project. Many ideas from Jan, Nick, William, and Ilya influenced the direction of the project. Steven mentored Henk.
We thank Neel Nanda, Ryan Greenblatt, Paul Christiano, Chris Olah, and Evan Hubinger for useful discussions on direction during the project.
We thank Ryan Greenblatt, Buck Shlegeris, Trenton Bricken, the OpenAI Alignment team, and the Anthropic Interpretability team for useful discussions and feedback.
We thank Cathy Yeh for doing useful checks of our code and noticing tokenization concerns.
We thank Carroll Wainwright and Chris Hesse for help with infrastructure.
We thank the contractors who wrote and evaluated explanations for our human data experiments, and Long Ouyang for some feedback on our instructions.
We thank Rajan Troll for providing a human baseline for the neuron puzzles.
We thank Thomas Degry for the blog graphic.
We thank other OpenAI teams for their support, including the supercomputing, research acceleration, and language teams.
Please cite as:
Bills, et al., "Language models can explain neurons in language models", 2023.
BibTeX Citation:
@misc{bills2023language, title={Language models can explain neurons in language models}, author={ Bills, Steven and Cammarata, Nick and Mossing, Dan and Tillman, Henk and Gao, Leo and Goh, Gabriel and Sutskever, Ilya and Leike, Jan and Wu, Jeff and Saunders, William }, year={2023}, howpublished = {\url{https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html}} }