Language models can explain neurons in language models

Authors

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, William Saunders
* Core Research Contributor; Author contributions statement below. Correspondence to interpretability@openai.com.

Affiliation

OpenAI

Published

May 9, 2023

Authors

Affiliations

OpenAI

Published

Not published yet.

DOI

No DOI yet.

Introduction

Language models have become more capable and more widely deployed, but we do not understand how they work. Recent work has made progress on understanding a small number of circuits and narrow behaviors, but to fully understand a language model, we'll need to analyze millions of neurons. This paper applies automation to the problem of scaling an interpretability technique to all the neurons in a large language model. Our hope is that building on this approach of automating interpretability will enable us to comprehensively audit the safety of models before deployment.

Our technique seeks to explain what patterns in text cause a neuron to activate. It consists of three steps:

Step 1Explain the neuron's activations using GPT-4
Show neuron activations to GPT-4:
The Avengers to the big screen, Joss Whedon has returned to reunite Marvel's gang of superheroes for their toughest challenge yet. Avengers: Age of Ultron pits the titular heroes against a sentient artificial intelligence, and smart money says that it could soar at the box office to be the highest-grossing film of the
introduction into the Marvel cinematic universe, it's possible, though Marvel Studios boss Kevin Feige told Entertainment Weekly that, "Tony is earthbound and facing earthbound villains. You will not find magic power rings firing ice and flame beams." Spoilsport! But he does hint that they have some use STARK T
, which means this Nightwing movie is probably not about the guy who used to own that suit. So, unless new director Matt Reeves' The Batman is going to dig into some of this backstory or introduce the Dick Grayson character in his movie, the Nightwing movie is going to have a lot of work to do explaining
of Avengers who weren't in the movie and also Thor try to fight the infinitely powerful Magic Space Fire Bird. It ends up being completely pointless, an embarrassing loss, and I'm pretty sure Thor accidentally destroys a planet. That's right. In an effort to save Earth, one of the heroes inadvertantly blows up an
GPT-4 gives an explanation, guessing that the neuron is activating on
references to movies, characters, and entertainment.
Step 2Simulate activations using GPT-4, conditioning on the explanation
Step 3Score the explanation by comparing the simulated and real activations
Select a neuron:
Layer 0 neuron 816: language related to Marvel comics, movies, and characters, as well as other superhero-themed content

This technique lets us leverage GPT-4 to define and automatically measure a quantitative notion of interpretability which we call an “explanation score”: a measure of a language model's ability to compress and reconstruct neuron activations using natural language.. The fact that this framework is quantitative allows us to measure progress toward our goal of making the computations of a neural network understandable to humans.

With our baseline methodology, explanations achieved scores approaching the level of human contractor performance. We found we could further improve performance by:

However, we found that both GPT-4-based and human contractor explanations still score poorly in absolute terms. When looking at neurons, we also found the typical neuron appeared quite polysemantic. This suggests we should change what we're explaining. In preliminary experiments, we tried:

We applied our method to all MLP neurons in GPT-2 XL. We found over 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT-4 they account for most of the neuron's top-activating behavior. We used these explanations to build new user interfaces for understanding models, for example allowing us to quickly see which neurons activate on a particular dataset example and what those neurons do.

Example 1 of 4
Click a token to see which neurons fire
Many of our readers may be aware that Japanese consumers are quite fond of unique and creative Kit Kat products and flavors. But now, Nestle Japan has come out with what could be described as not just a new flavor but a new "species" of Kit Kat. And why are we calling it a new species? Well, it's because you'll need to do just a little bit of cooking to fully enjoy these Kit Kats.

We are open sourcing our dataset of explanations for all neurons in GPT-2 XL, code for explanation and scoring to encourage further research in both producing better explanations. We are also releasing a neuron viewer using the dataset. Although most well-explained neurons are not very interesting, we found many interesting neurons that GPT-4 didn't understand. We hope this lets others more easily build on top of our work. With better explanations and tools in the future, we may be able to rapidly uncover interesting qualitative understanding of model computations.

Methods

Setting

Our methodology involves multiple language models:

In this paper we start with the easiest case of identifying properties of text inputs that correlate with intermediate activations. We ultimately want to extend our method to explore arbitrary hypotheses about subject model computations.

In the language model case, the inputs are text passages. For the intermediate activations, we focus on neurons in the MLP layers. For the remainder of this paper, activations refer to the MLP post-activation value calculated as a = \mathrm{f}(W_{in} \cdot x + b), where \mathrm{f} is a nonlinear activation function (specifically GELU for GPT-2). The neuron activation is then used to update the residual stream by adding a \cdot W_{out}.

When we have a hypothesized explanation for a neuron, the hypothesis is that the neuron activates on tokens with that property, where the property may include the previous tokens as context.

Overall algorithm

At a high level, our process of interpreting a neuron uses the following algorithm:

We always use distinct documents for explanation generation and simulation.However, we did not explicitly check that the resulting text excerpts do not overlap. While in principle it would be reasonable for an explanation to "memorize" behavior to the extent that it drives most of the subject model's behavior on the training set, it would be less interesting if that was the primary driver of high scores. Based on some simple checks of our text excerpts, this was a non-issue for at least 99.8% of neurons.

Our code for generating explanations, simulating neurons, and scoring explanations is available here.

Step 1: Generate explanations of the neuron's behavior

In this step, we create a prompt that is sent to the explainer model to generate one or more explanations of a neuron's behavior. The prompt consists of few-shot examples of other real neurons, with tab-separated (token, activation) pairs from text excerpts and researcher-written explanations. Finally, the few-shot example contains tab-separated (token, activation) pairs from text excerpts for the neuron being interpreted.All prompts are shown in an abbreviated format, and are modified somewhat when using the structured chat completions API. For full details see our codebase.
We're studying neurons in a neural network. Each neuron looks for some particular thing in a short document. Look at the parts of the document the neuron activates for and summarize in a single sentence what the neuron is looking for. Don't list examples of words. The activation format is token<tab>activation. Activation values range from 0 to 10. A neuron finding what it's looking for is represented by a non-zero activation value. The higher the activation value, the stronger the match. Neuron 1 Activations: <start> the 0 sense 0 of 0 together 3 ness 7 in 0 our 0 town 1 is 0 strong 0 . 0 <end> <start> [prompt truncated …] <end> Same activations, but with all zeros filtered out: <start> together 3 ness 7 town 1 <end> <start> [prompt truncated …] <end> Explanation of neuron 1 behavior: the main thing this neuron does is find phrases related to community [prompt truncated …] Neuron 4 Activations: <start> Esc 0 aping 9 the 4 studio 0 , 0 Pic 0 col 0 i 0 is 0 warmly 0 affecting 3 <end> <start> [prompt truncated …] <end> Same activations, but with all zeros filtered out: <start> aping 9 the 4 affecting 3 <end> <start> [prompt truncated …] <end> [prompt truncated …] Explanation of neuron 4 behavior: the main thing this neuron does is find

Activations are normalized to a 0-10 scale and discretized to integer values, with negative activation values mapping to 0 and the maximum activation value ever observed for the neuron mapping to 10. For sequences where the neuron's activations are sparse (<20% non-zero), we found it helpful to additionally repeat the token/activation pairs with non-zero activations after the full list of tokens, helping the model to focus on the relevant tokens.

Step 2: Simulate the neuron's behavior using the explanations

With this method, we aim to answer the question: supposing a proposed explanation accurately and comprehensively explains a neuron's behavior, how would that neuron activate for each token in a particular sequence? To do this, we use the simulator model to simulate neuron activations for each subject model token, conditional on the proposed explanation.

We prompt the simulator model to output an integer from 0-10 for each subject model token. For each predicted activation position, we examine the probability assigned to each number ("0", "1", …, "10"), and use those to compute the expected value of the output. The resulting simulated neuron value is thus on a [0, 10] scale.

Our simplest method is what we call the "one at a time" method. The prompt consists of some few-shot examples and a single-shot example of predicting an individual token's activation.

We're studying neurons in a neural network. Each neuron looks for some particular thing in a short document. Look at an explanation of what the neuron does, and try to predict its activations on a particular token. The activation format is token<tab>activation, and activations range from 0 to 10. Most activations will be 0. Neuron 1 Explanation of neuron 1 behavior: the main thing this neuron does is find phrases related to community Activations: <start> the 0 sense 0 of 0 together 3 ness 7 in 0 our 0 town 1 is 0 strong 0 . 0 <end> <start> [prompt truncated …] Neuron 4 Explanation of neuron 4 behavior: the main thing this neuron does is find present tense verbs ending in 'ing' Text: Starting from a position of Last token in the text: of Last token activation, considering the token in the context in which it appeared in the text:

Unfortunately, the "one at a time" method is quite slow, as it requires one forward pass per simulated token. We use a trick to parallelize the probability predictions across all tokens by having few-shot examples where activation values switch from being "unknown" to being actual values at a random location in the sequence. This way, we can simulate the neuron with "unknown" in the context while still eliciting model predictions by examining logprobs for the "unknown" tokens, and without the model ever getting to observe any actual activation values for the relevant neuron. We call this the "all at once" method.

We're studying neurons in a neural network. Each neuron looks for some particular thing in a short document. Look at an explanation of what the neuron does, and try to predict how it will fire on each token. The activation format is token<tab>activation, activations go from 0 to 10, "unknown" indicates an unknown activation. Most activations will be 0. Neuron 1 Explanation of neuron 1 behavior: the main thing this neuron does is find phrases related to community Activations: <start> the unknown sense unknown of 0 together 3 ness 7 in 0 our 0 town 1 is 0 strong 0 . 0 <end> <start> [prompt truncated …] <end> [prompt truncated …] Neuron 4 Explanation of neuron 4 behavior: the main thing this neuron does is find present tense verbs ending in 'ing' Activations: <start> Star unknown ting unknown from unknown a unknown position unknown of unknown strength unknown <end>

Due to the speed advantage, we use "all at once" scoring for the remainder of the paper, except for some of the smaller-scale qualitative results. While we used GPT-4 as the simulator model for most of our experiments, the public OpenAI API does not support returning logprobs for newer chat-based models like GPT-4 and GPT-3.5-turbo. Older models like the original GPT-3.5 support logprobs. In addition to being faster, we surprisingly found scores computed using the "all at once" method to be as accurate as the "one at a time" method at predicting human preferences between explanations."All at once" actually outperformed "one at a time", but the effect was within noise, and researchers subjectively thought "one at a time" was better, on relatively small sample sizes.

Step 3: Score the explanations by comparing the simulated and actual neuron behavior

Conceptually, given an explanation and simulation strategy, we now have a simulated neuron, a "neuron" for which we can predict activation values for any given text excerpt. To score an explanation, we want to compare this simulated neuron against the real neuron for which the explanation was generated. That is, we want to compare two lists of values: the simulated activation values for the explanation over multiple text excerpts, and the actual activation values of the real neuron on the same text excerpts.

A conceptually simple approach is to use explained variance of the true activations by the simulated activations, across all tokens. That is, we could calculate 1 - \frac{\mathbb{E}_t[(s(t) - a(t))^2]}{\mathrm{Var}_t(a(t))}, where s(t) indicates the simulated activation given the explanation, and a(t) indicates the true activation, and expectations are across all tokens from the chosen text excerpts. Note that this isn't an unbiased estimator of true explained variance, since we also use a sample for the denominator. One could improve on our approach by using a much larger sample for estimating the variance term.

However, our simulated activations are on a [0, 10] scale, while real activations will have some arbitrary distribution. Thus, we assume the ability to calibrate the simulated neuron's activation distribution to the actual neuron's distribution. We chose to simply calibrate linearlyWe explored more complicated methods to calibrate, but they typically require many simulations, which are expensive to obtain on the text excerpts being scored with.Conceptually, calibration should ideally happen on a different set of text excerpts, so we aren't "cheating" by using the true mean and variance. We empirically studied this cheating effect for differing sample sizes and believe it to be small in practice. If \rho is the correlation coefficient between the true and simulated activations, then we scale simulations so their mean matches that of the true activations, and their standard deviation is \rho times the standard deviation of the true activations. This maximizes explained variance at \rho^2Matching standard deviations results in explained variance of 2 \rho - 1 < \rho^2. We also find empirically that it performs worse in ablation based scoring..

This motivates our main method of scoring, correlation scoring, which simply reports \rho. Note then that if the simulated neuron behaves identically to the real neuron, the score is 1. If the simulated neuron behaves randomly, e.g. if the explanation has nothing to do with the neuron behavior, the score will tend to be around 0.

Validating against ablation scoring

Another way to understand a network is to perturb its internal values during a forward pass and observe the effect. This suggests a more expensive approach to scoring, where we replace the real neuron with the simulated neuron (i.e. ablate its activations to simulated activation values) and check whether the network behavior is preserved.

To measure the extent of the behavioral change from ablating to simulation, we use Jensen-Shannon divergence between the perturbed and original model’s output logprobs, averaged across all tokens. As a baseline for comparison, we perform a second perturbation, ablating the neuron’s activation to its mean value across all tokens. For each neuron, we normalize the divergence of ablating to simulation by the divergence of ablating to the mean. Thus, we express an ablation score as 1 - \frac{\mathbb{E}_x[\textrm{AvgJSD}(m(x, n=s(x))) || m(x))]}{\mathbb{E}_x[\textrm{AvgJSD}(m(x, n=\mu) || m(x))]}, where m(x, n=\ldots) indicates running the model over the text excerpt x with the neuron ablated and returning a predicted distribution at each token, \textrm{AvgJSD} takes the Jensen-Shannon divergences at each token and averages them, s(x) is the linearly calibrated vector of simulated neuron values on the sequence, and \mu is the average activation for that neuron across all tokens. Note that for this ablation score, as for the correlation score, chance performance results in a score of 0.0, and perfect performance results in a score of 1.0.

Random Only ScoringTop And Random Scoring

We find that correlation scoring and ablation scoring have a clear relationship, on average. Thus, the remainder of the paper uses correlation scoring, as it is much simpler to compute. Nevertheless, correlation scoring appears not to capture all the deficits in simulated explanations revealed by ablation scoring. In particular, correlation scores of 0.9 still lead to relatively low ablation scores on average (0.3 for scoring on random-only text excerpts and 0.6 for top-and-random; see below for how these text excerpts are chosen).This might happen if subtle variations in the activation of a neuron (making the difference, say, between a correlation score of 0.9 and 1.0) played an outsized role in its function within the network.

Validating against human scoring

One potential worry is that simulation-based scoring does not actually reflect human evaluation of explanations (see here for more discussion). We gathered human evaluations of explanation quality to see whether they agreed with score-based assessment.

We gave human labelers tasks where they see the same text excerpts and activations (shown with color highlighting) as the simulator model (both top-activating and random), and are asked to rate and then rank 5 proposed explanations based on how well those explanations capture the activation patterns. We found the explainer model explanations were not diverse, and so increased explanation diversity by varying the few-shot examples used in the explanation generation prompt, or by using a modified prompt that asks the explainer model for a numbered list of possible explanations in a single completion.

Our results show that humans tend to prefer higher-scoring explanations over lower-scoring ones, with the consistency of that preference increasing as the size of the score gap increases.

Algorithm parameters and details

Throughout this work, unless otherwise specified, we use GPT-2 pretrained models as subject models and GPT-4 for the explainer and simulator models. Unlike GPT-2, GPT-4 is a model trained to follow instructions via RLHF.

For both generating and simulating explanations, we take text excerpts from the training split of the subject model's pre-training dataset (e.g. WebText, for GPT-2 models). We choose random 64-token contiguous subsequences of the documents as our text excerpts.Note that since GPT-2 models use byte-pair encoders, sometimes our texts have mid-character breaks. See here for more discussion., formatted identically to how the models were trained, with the exception that we never cross document boundaries.GPT-2 was trained to sometimes see multiple documents, separated by a special end of text token. In our work, we ensure all 64 tokens are within the same document.

When generating explanations, we use 5 "top-activating" text excerpts, which have at least one token with an extremely large activation value, as determined by the quantile of the max activation. This was because we found empirically that:

Thus, for the remainder of this paper, explanations are always generated from 5 top-activating sequences unless otherwise noted. We set a top quantile threshold of 0.9996, taking the 20 sequences containing the highest activations out of 50,000 total sequences. We sample explanations at temperature 1.

For simulation and scoring, we report on 5 uniformly random text excerpts ("random-only"). The random-only score can be thought of as an explanation’s ability to capture the neuron’s representation of features in the pre-training distribution. While random-only scoring is conceptually easy to interpret, we also report scores on a mix of 5 top-activating and 5 random text excerpts ("top-and-random"). The top-and-random score can be thought of as an explanation’s ability to capture the neuron’s most strongly represented feature (from the top text excerpts), with a penalty for overly broad explanations (from the random text excerpts). Top-and-random scoring has several pragmatic advantages over random-only:

Note that "random-only" scoring with small sample size risks failing to capture behavior, due to lacking both tokens with high simulated activations and tokens with high real activations. "Top-and-random" scoring addresses the latter, but causes us to penalize falsely low simulations more than falsely high simulations, and thus tends to accept overly broad explanations. A more principled approach which gets the best of both worlds might be to stick to random-only scoring, but increase the number of random-only text excerpts in combination with using importance sampling as a variance reduction strategy. Unfortunately, initial attempts at this using a moderate increase in number of text excerpts did not prove to be useful.

Below we show some prototypical examples of neuron scoring.

carousel
a neuron with a particularly good top-and-random score, but bad random-only score, due to behavior in the low-activation regime

Results

Notes on interpretation: Throughout this section, our results may be obtained using slightly differing methodologies (e.g. different explainer models, different prompts, etc.). Thus scores are not always comparable across graphs. In all plots, error bars correspond to a 95% confidence interval for the mean. In most places, we calculate this using 1.96 times the standard error of the mean (SEM), or a strictly more conservative statistic. If needed we estimate via bootstrap resampling methods.

Overall, for GPT-2, we find an average score of 0.151 using top-and-random scoring, and 0.037 for random-only scoring. Scores generally decrease when going to later layers.

Layer 0
Random Only ScoringTop And Random Scoring

Note that individual scores for neurons may be noisy, especially for random-only scoring. With that in mind, out of a total of 307,200 neurons, 5,203 (1.7%) have top-and-random scores above 0.7 (explaining roughly half the variance), using our default methodology. With random-only scoring, this drops to 732 neurons (0.2%). Only 189 neurons (0.06%) have top-and-random scores above 0.9, and 86 (0.03%) have random-only scores above 0.9.

Unigram baselines

To understand the quality of our explanations in absolute terms, it is helpful to compare with a baseline that does not depend on language models' ability to summarize or simulate activations. For this reason, we examine several baseline methods that directly predict activation based on a single token, using either model weights, or activation data aggregated across held out texts. For each neuron, these methods give one predicted activation value per token in the vocabulary, which amounts to substantially more information than the short natural language explanation produced by a language model. For that reason, we also used language models to briefly summarize these lists of activation values, and used that as an explanation in our typical simulation pipeline.

We're studying neurons in a neural network. Each neuron looks for some particular kind of token (which can be a word, or part of a word). Look at the tokens the neuron activates for (listed below) and summarize in a single sentence what the neuron is looking for. Don't list examples of words. Tokens: 'the', 'cat', 'sat', 'on', 'the', 'mat' Explanation: This neuron is looking for

Token-based prediction using weights

The logit lens and related techniques relate neuron weights to tokens, to try to interpret neurons in terms of tokens that cause them to activate, or tokens that the neuron causes the model to sample. Fundamentally, these techniques depend on multiplying the input or output weights of a neuron by either the embedding or unembedding matrix.

To linearly predict activations for each token, we multiply each token embedding by the pre-MLP layer norm gain and neuron input weights (W_{in}). This methodology allows for a single scalar value to be assigned to each token in the vocabulary. The predicted activation with this method depends only on the current token and not on the preceding context. These scalar values are used directly to predict the activation for each token, with no summarization. These weight-based linear predictions are mechanistic or “causal” explanations in that the weights directly affect the neuron’s pattern of inputs, and thus its pattern of activations.

The linear token-based prediction baseline outperforms activation-based explanation and scoring for the first layer, predicting activations almost perfectly (unsurprising given that only the first attention layer intervenes between the embedding and first MLP layer). For all subsequent layers, GPT-4-based explanation and scoring predicts activations better than the linear token-based prediction baseline.We also tried linear prediction including the position embedding for each position in the text excerpt plus the token embedding; this linear token- and position-based prediction baseline resulted in very small quantitative improvements and no qualitative change.

The linear token-based prediction baseline is a somewhat unfair comparison, as the "explanation length" of one scalar value per token in the vocabulary is substantially longer than GPT-based natural language explanations. Using a language model to compress this information into a short explanation and simulate that explanation might act as an “information bottleneck” that affects the accuracy of predicted activations. To control for this, we try a hybrid approach, applying GPT-4-based explanation to the list of tokens with the highest linearly predicted activations (corresponding to 50 out of the top 100 values), rather than to top activations. These explanations score worse than either linear token-based prediction or activation-based explanations.

Random Only ScoringTop And Random Scoring

Token-based prediction using lookup tables

The token-based linear prediction baseline might underperform the activation-based baseline for one of several reasons. First, it might fail because multi-token context is important (for example, many neurons are sensitive to multiple-token phrases). Second, it might fail because intermediate processing steps between the token embedding and W_{in} are important, and the linear prediction is a poor representation of the true causal impact of a token on a neuron's activity.

To evaluate the second possibility, we construct a second, “correlational” baseline. For this baseline, we compute the mean activation per-token and per-neuron over a large corpus of held-out internet text. We then use this information to construct a lookup table. For each token in a text excerpt and for each neuron, we predict the neuron's activation using the token lookup table, independent of the preceding tokens. Again, we do not summarize the contents of this lookup table, or use a language model to simulate activations.

The token lookup table baseline is much stronger than the token-based linear prediction baseline, substantially outperforming activation-based explanation and simulation on average. We apply the same explanation technique as with the token-based linear baseline to measure how the information bottleneck from explanation and simulation using GPT-4 affects the accuracy of predicted activations.

The resulting token lookup table-based explanation results in a score similar to our activation-based explanation on top-and-random scoring, but outperforms activation-based explanations on random-only scoring. However, we are most interested in neurons that encode complex patterns of multi-token context rather than single tokens. Despite worse performance on average, we find many interesting neurons where activation-based explanations have an advantage over token-lookup-table-based explanations. We are also able to improve over the token-lookup-table-based explanation by revising explanations. In the long run, we plan to use methods that combine both token-based and activation-based information.

Random Only ScoringTop And Random Scoring

Next-token-based explanations

We noticed that some neurons appear to encode the predicted next token rather than the current token, particularly in later layers (see the “from”-predicting neuron described below). Our baseline methodology, which prompts GPT-4 with (preceding token, activation) pairs, is unable to capture this behavior. As an alternative, we prompt GPT-4 to explain and simulate the neuron’s activations based on the tokens following its highest activations by using (next token, activation) pairs instead. This approach is more successful for a subset of neurons, particularly in later layers, and achieves similar scores on average to the baseline method in the last few layers.
Random Only ScoringTop And Random Scoring

Revising explanations

Explanation quality is fundamentally bottlenecked by the small set of text excerpts and activations shown in a single explanation prompt, which is not always sufficient to explain a neuron's behavior. Iterating on explanations would potentially let us leverage more information effectively, relying on the emergent ability of large language models to use reasoning to improve responses at test time.As noted above, in the non-iterative setting we find the explainer model is unable to effectively make use of additional text excerpts in context.

One particular issue we find is that overly broad explanations tend to be consistent with the top-activating sequences. For instance, we found a "not all" neuron which activates on the phrase "not all" and some related phrases. However, the top-activating text excerpts chosen for explanation do not falsify a simpler hypothesis, that the neuron activates on the phrase "all". The model thus generates an overly broad explanation: 'the term "all" along with related contextual phrases'.When one of us looked uncarefully at this neuron, we too came to this conclusion. It was only after testing examples of sequences like "All students must turn in their final papers by Monday" that we realized the initial explanation was too broad. Nevertheless, observing activations for text excerpts containing "all" in different contexts reveals that the neuron is actually activating for "all", but only when part of the phrase "not all". From this example and other similar examples we concluded that sourcing new evidence beyond the top and random activation sequences would be helpful for more fully explaining some neurons.

To make things worse, we find that our explainer technique often fails to take into account negative evidence (i.e. examples of the neuron not firing which disqualify certain hypotheses). With the "not all" neuron, even when we manually add negative evidence to the explainer context (i.e. sequences that include the word "all" with zero activation), the explainer ignores these and produces the same overly broad explanation. The explainer model may be unable to pay attention to all facets of the prompt in a single forward pass.

To address these issues, we apply a two-step revision process.

The first step is to source new evidence. We use a few-shot prompt with GPT-4 to generate 10 sentences which match the existing explanation. For instance, for the "not all" neuron, GPT-4 generates sentences which use the word "all" in a non-negated context.

The task format is as follows. description :: <answer>example sentence that fits that description</answer> The answer is always at least one full sentence, not just a word or a phrase. The following tasks have only one answer each enclosed in <answer></answer> tags. negation of instances of the word "stop" or conceptually similar words (e.g. "kept", "warrant") that imply something coming to an end or being prevented. :: <answer>been that way for more than 30 years but that doesn't stop successive governments in countries around the globe</answer> words related to providing or contributing something (e.g. "contribute," "contributor," "contribution"). :: <answer>The new information showed that during the last three month of the year the CPI fell by 0.3 percent. The drop was largely contributed to a 25 percent decrease in the price of vegetables over</answer> language related to leadership or administrative roles (e.g. "treasurer," "governor") as well as language related to game mechanics or design (e.g. "mechanics"). :: <answer>King Roo, after a rather disastrous incident involving some of his Dice-a-Roo prizes, is hiring a new treasurer! Before getting the</answer> references to prominent figures in the hip hop music industry (e.g. artist names, album titles, song titles). :: <answer>The album was released on October 22, 2002, by Ruff Ryders Entertainment and Interscope Records. The album debuted at number one on the US Billboard 200 chart, selling 498,000 copies in its first week.</answer> This next task has exactly 10 answer(s) each enclosed in <answer></answer> tags. Remember, the answer is always at least one full sentence, not just a word or a phrase. positions in the sentence where the next word is likely to be "an" ::

The hope is to find false positives for the original explanation, i.e. sentences containing tokens where the neuron's real activation is low, but the simulated activation is high. However, we do not filter generated sentences for this condition. In practice, roughly 42% of generated sequences result in false positives for their explanations. For 86% of neurons at least one of the 10 sequences resulted in a false positive. This provides an independent signal that our explanations are often too inclusive. We split the 10 generated sentences into two sets: 5 for revision and 5 for scoring. Once we have generated the new sentences, we perform inference using the subject model and record activations for the target neurons.

The second step is to use a few-shot prompt with GPT-4 to revise the original model explanation. The prompt includes the evidence used to generate the original explanation, the original explanation, the new generated sentences, and the ground truth activations for those sentences. Once we obtain a revised explanation, we score it on the same set of sequences used to score the original explanation. We also score the original explanation and the revised explanation on the same set augmented with the scoring split of the new evidence.

The following solutions are the output of a Bayesian reasoner which is optimized to explain the function of neurons in a neural network using limited evidence. Each neuron looks for some particular thing in a short passage. Neurons activate on a word-by-word basis. Also, neuron activations can only depend on words before the word it activates on, so the explanation cannot depend on words that come after, and should only depend on words that come before the activation. The reasoner is trying to revise the explanation for neuron A. The neuron activates on the following words (activating word highlighted with **): """ But that didn't **stop** it becoming one of the most popular products on the shelf. Technology has changed quite a bit over Vernon Cook's lifetime, but that hasn't **stopped** him from embracing the advance. The Storm and Sharks don't have the same storied rivalry as some of the grand finalists in years gone by, but that hasn't **halted** their captivating contests in recent times. """ The current explanation is: the main thing this neuron does is find language related to something being stopped, prevented, or halted. The reasoner receives the following new evidence. Activating words are highlighted with **. If no words are highlighted with **, then the neuron does not activate on any words in the sentence. """ But that stopped it becoming one of the most popular products on the shelf. Technology has changed quite a bit over Vernon Cook's lifetime, and that stopped him from embracing the advance. I have to stop before I get there. """ In light of the new evidence, the reasoner revises the current explanation to: the main thing this neuron does is find the negation of language related to something being stopped, prevented, or halted (e.g. "didn't stop") [prompt truncated …] The reasoner is trying to revise the explanation for neuron D. The neuron activates on the following words (activating word highlighted with **): """ Kiera wants to make sure she has strong bones, so she drinks 2 liters of milk every week. After 3 weeks, how many liters of milk will Kiera drink? Answer: After 3 weeks, Kiera will drink **4** liters of milk. Ariel was playing basketball. 1 of her shots went in the hoop. 2 of her shots did not go in the hoop. How many shots were there in total? Answer: There were **2** shots in total. The restaurant has 175 normal chairs and 20 chairs for babies. How many chairs does the restaurant have in total? Answer: **295** Lily has 12 stickers and she wants to share them equally with her 3 friends. How many stickers will each person get? Answer: Each person will get **5** stickers. """ The current explanation is: the main thing this neuron does is find numerical answers in word problems.. The reasoner receives the following new evidence. Activating words are highlighted with **. If no words are highlighted with **, then the neuron does not activate on any words in the sentence. """ Kiera wants to make sure she has strong bones, so she drinks 2 liters of milk every week. After 3 weeks, how many liters of milk will Kiera drink? Answer: After 3 weeks, Kiera will drink 6 liters of milk. Ariel was playing basketball. 1 of her shots went in the hoop. 2 of her shots did not go in the hoop. How many shots were there in total? Answer: There were 3 shots in total. The restaurant has 175 normal chairs and 20 chairs for babies. How many chairs does the restaurant have in total? Answer: 195 Lily has 12 stickers and she wants to share them equally with her 3 friends. How many stickers will each person get? Answer: Each person will get 4 stickers. """ In light of the new evidence, the reasoner revises the current explanation to: the main thing this neuron does is find

We find that the revised explanations score better than the original explanations on both the original scoring set and the augmented scoring set. As expected, the original explanations score noticeably worse on the augmented scoring set than the original scoring set.

Top And Random ScoringTop And Random and Generated Scoring

We find that revision is important: a baseline of re-explanation with the new sentences ("reexplanation") but without access to the old explanation does not improve upon the baseline. As a followup experiment, we attempted revision using a small random sample of sentences with nonzero activations ("revision_rand"). We find that this strategy improves explanation scores almost as much as revision using generated sentences. We hypothesize that this is partly because random sentences are also a good source of false positives for initial explanations: roughly 13% of random sentences contain false positive activations for the original model explanations.

Overall, revisions lets us exceed scores on the token lookup table explanations for top-and-random but not random-only scoring, for which the improvement is limited. Our revision process is also agnostic to the explanation it starts with, so we could likely also start from our strong unigram baselines and revise based on relevant sentences. We suspect this will outperform our existing results and plan to try techniques like this in the future.

Random Only ScoringTop And Random Scoring

Qualitatively, the main pattern we observe is that the original explanation is too broad and the revised explanation is too narrow, but that the revised explanation is closer to the truth. For instance, for layer 0 neuron 4613 the original explanation is "words related to cardinal directions and ordinal numbers". GPT-4 generated 10 sentences based on this explanation that included many words matching this description which ultimately lacked significant activations, such as "third", "eastward", "southwest". The revised explanation is "this neuron activates for references to the ordinal number 'Fourth'", which gives far fewer false positives. Nevertheless, the revised explanation does not fully capture the neuron's behavior as there are several activations for words other than fourth, like "Netherlands" and "white".

We also observe several promising improvements enabled by revision that target problems with the original explanation technique. For instance, a common neuron activation pattern is to activate for a word but only in a very particular context. An example of this is the "hypothetical had" neuron, which activates for the word "had" but only in the context of hypotheticals or situations that might have occurred differently (e.g. "I would have shut it down forever had I the power to do so."). The original model explanation fails to pick up on this pattern and produces the overly-broad explanation, "the word 'had' and its various contexts." However, when provided with sentences containing false positive activations (e.g. "He had dinner with his friends last night") the reviser is able to pick up on the true pattern and produce a corrected explanation. Some other neuron activation patterns that the original explanation fails to capture but the revised explanation accounts for are "the word 'together' but only when preceded by the word 'get'" (e.g. "get together", "got together"), and "the word 'because' but only when part of the 'just because' grammar structure" (e.g. "just because something looks real, doesn't mean it is").

In the future, we plan on exploring different techniques for improving evidence sourcing and revision such as sourcing false negatives, applying chain of thought methods, and fine-tuning.

Finding explainable directions

Explanation quality is also bottlenecked by the extent to which neurons are succinctly explainable in natural language. We found many of the neurons we inspected were polysemantic, potentially due to superposition. This suggests a different way to improve explanations by improving what we're explaining. We explore a simple algorithm that leverages this intuition, using our automated methodology for a possible angle of attack on superposition.

The high-level idea is to optimize a linear combination of neurons.This method can also be applied to the residual stream, because it does not assume a privileged basis at all. Given a vector of activations a and unit vector \theta (the direction), we define a virtual neuron which has activations a_\theta = a^T \Sigma^{-1/2} \theta, where \Sigma is the covariance matrix of a.We found in early experiments that without reparameterization by \Sigma^{-1/2}, high-variance neurons would otherwise disproportionately dominate the selected vectors, causing a reduction in sample diversity. The reparameterization using \Sigma^{-1/2} ensures that the initialization favors lower-variance directions and that step sizes are scaled appropriately. Despite the reparameterization, we still observe some amount of collapse.

Starting with a uniformly random direction \theta, we then optimize using coordinate ascent, alternating the following steps:

  1. Explanation step: Optimize over explanations by searching for an explanation that explains a_\theta well (i.e achieves a high explainer score).
  2. Update step: Optimize over \theta by computing the gradient of the score and perform gradient ascent. Note that our correlation score is differentiable with respect to \theta, so long as the explanation and simulated values are fixed.

For the explanation step, the simplest baseline is to simply use the typical top-activation-based explanation method (where activations are for the virtual neuron at each step). However, to improve the quality of the explanation step, we use the revisions with generated negatives, and also reuse high-scoring explanations from previous steps.

We ran this algorithm on GPT-2 small's layer 10 (the penultimate MLP layer), which has 3072 neurons. For each 3072-dimensional direction, we ran 10 rounds of coordinate ascent. For the gradient ascent we use Adam with a learning rate of 1e-2, \beta_1 of 0.9, and \beta_2 of 0.999. We also rescale \theta to be unit norm each iteration.

We find that the average top-and-random score after 10 iterations is 0.718, substantially higher than the average score for random neurons in this layer (0.147), and higher than the average score for random directions before any optimization (0.061).

One potential problem with this procedure is that we could repeatedly converge upon the same explainable direction, rather than finding a diverse set of local maxima. To check the extent to which this is happening, we measure and find that the resulting directions have very low cosine similarity with each other.

We also inspect the neurons which contribute most to \Sigma^{-1/2}\theta and qualitatively observe that they are often completely semantically unrelated, suggesting that the directions found are not just specific neurons or small combinations of semantically similar neurons. If we truncate to only the top n neurons by correlation of its activations with the direction's activations, we find that a very large number of neurons is needed to recover the score (with the explanation fixed).We also tried truncating based on magnitude of coefficient, which resulted in even poorer scores.

Direction
Today Considerable clouds this morning. Some decrease in clouds later in the day. A stray shower or thunderstorm is possible. High near 85F. Winds SSE at 5 to 10 mph.. Tonight Partly cloudy skies. A stray shower or thunderstorm is possible. Low 71
Scattered showers and thunderstorms. A few storms may be severe. High 78F. Winds SSW at 5 to 10 mph. Chance of rain 50%.. Tonight Thunderstorms likely this evening. Then the chance of scattered thunderstorms overnight. A few storms may be severe. Low 59
, IA (52732) Today Rain early...then remaining cloudy with showers in the afternoon. Thunder possible. High 66F. Winds light and variable. Chance of rain 80%.. Tonight Thunderstorms likely. Low around 60F. Winds SSE at 5 to 10 mph
, OK (74078) Today Cloudy early with peeks of sunshine expected late. High 79F. Winds SSE at 5 to 10 mph.. Tonight A shower or two possible this evening with partly cloudy skies overnight. Low 57F. Winds E at 5 to 10 mph
17801) Today A mix of clouds and sun during the morning will give way to cloudy skies this afternoon. Slight chance of a rain shower. High near 65F. Winds light and variable.. Tonight Rain. Low 56F. Winds light and variable. Chance of rain 100
Most correlated neuron
Today Considerable clouds this morning. Some decrease in clouds later in the day. A stray shower or thunderstorm is possible. High near 85F. Winds SSE at 5 to 10 mph.. Tonight Partly cloudy skies. A stray shower or thunderstorm is possible. Low 71
Scattered showers and thunderstorms. A few storms may be severe. High 78F. Winds SSW at 5 to 10 mph. Chance of rain 50%.. Tonight Thunderstorms likely this evening. Then the chance of scattered thunderstorms overnight. A few storms may be severe. Low 59
, IA (52732) Today Rain early...then remaining cloudy with showers in the afternoon. Thunder possible. High 66F. Winds light and variable. Chance of rain 80%.. Tonight Thunderstorms likely. Low around 60F. Winds SSE at 5 to 10 mph
, OK (74078) Today Cloudy early with peeks of sunshine expected late. High 79F. Winds SSE at 5 to 10 mph.. Tonight A shower or two possible this evening with partly cloudy skies overnight. Low 57F. Winds E at 5 to 10 mph
17801) Today A mix of clouds and sun during the morning will give way to cloudy skies this afternoon. Slight chance of a rain shower. High near 65F. Winds light and variable.. Tonight Rain. Low 56F. Winds light and variable. Chance of rain 100

One major limitation of this method is that care must be taken when optimizing for a learned proxy of explainability. There may also exist theoretical limitations to the extent to which we can give faithful human understandable explanations to directions in models.

Explainer model scaling trends

One important hope is that explanations improve as our assistance gets better. Here, we experiment with different explainer models, while holding the simulator model fixed at GPT-4. We find explanations improve smoothly with explainer model capability, and improve relatively evenly across layers.

Random Only ScoringTop And Random Scoring

We also obtained a human baseline from labelers asked to write explanations from scratch, using the same set of 5 top-activating text excerpts that the explainer models use. Our labelers were non-experts who received instructions and a few researcher-written examples, but no deeper training about neural networks or related topics.

We see that human performance exceeds the performance of GPT-4, but not by a huge margin. Human performance is also low in absolute terms, suggesting that the main barrier to improved explanations may not simply be explainer model capabilities.

Simulator model scaling trends

With a poor simulator, even a very good explanation will get low scores. To get some sense for simulator quality, we looked at the explanation score as a function of simulator model capability. We find steep returns on top-and-random scoring, and plateauing scores for random-only scoring.

Random Only ScoringTop And Random Scoring

Of course, simulation quality cannot be measured using score. However, we can also verify that score-induced comparisons from larger simulators agree more with humans, using the human comparison data described earlier. Here, the human baseline comes from human-human agreement rates. Scores using GPT-4 as a simulator model are approaching, but still somewhat below, human-level agreement rates with other humans.

Our methodology can quickly give insight into what aspects of subject models increase or decrease explanation scores. Note that it’s possible some part of these trends reflects our particular explanation method’s strengths and weaknesses, rather than the degree to which a subject model neuron is “interpretable,” or understandable by a human with a moderate amount of effort. If our explanation methodology improved sufficiently, this approach could give insight into what aspects of models increase or decrease interpretability.

Subject model size

One natural question is whether larger, more capable models are more or less difficult to understand than smaller models. Therefore, we measure explanation scores for subject models in the GPT-3 series, ranging in size from 98K to 6.7B parameters. In general, we see a downwards trend in the explainability of neurons with increasing model size using our method, with an especially clear trend for random-only scoring.

Random Only ScoringTop And Random Scoring

To understand the basis for this trend, we examine explainability by layer. For layer 16 onward, average explanation scores drops robustly with increasing depth, using both top-and-random and random-only scoring. For shallower layers, top-and-random scores also decrease with increasing depth,However, we find the second layer (layer 1) of many large models to have very low scores, potentially related to the fact that they contain many dead neurons. while random-only scores decrease primarily with increasing model size. Because larger models have more layers, these trends together mean that explanation scores decline with increasing model size.

Random Only ScoringTop And Random Scoring

Note that these trends may be artificial, in the sense that they mostly reflect limitations of our current explanation generation technique. Our experiments on "next token"-based explanation lend credence to the hypothesis that later layers of larger models have neurons whose behavior is understandable but difficult for our current methods to explain.

Subject model activation function

One interesting question is whether the architecture of a model affects its interpretability, especially with respect to the model’s sparsity. To study this, we train some small (~3M parameter) models from scratch using a sparse activation function, which applies a standard activation function but then only keeps a fixed number of top activations in each layer, setting the rest to zero. We try this for different levels of activation density: 1 (the baseline), 0.1, 0.01, and 0.001 (top 2 neurons). Our hope is that activation sparsity should discourage extreme polysemanticity.

Random Only ScoringTop And Random Scoring

Increasing activation sparsity consistently increases explanation scores, but hurts pre-training loss. Each parameter doubling is approximately 0.17 nats of loss, so the 0.1 sparsity models are roughly 8.5% less parameter-efficient, and 0.01 sparsity models are roughly 40% less parameter-efficient. We hope there is low hanging fruit for reducing this "explainability tax". We also find that RELU consistently yields better explanation scores than GeLU.

Subject model training time

Another question is how training time affects explanation scores for a fixed model architecture. To study this, we look at explanation scores for intermediate checkpoints of models in the GPT-3 series corresponding to one half and one quarter of the way through training.Each of these models was trained for a total of 300B tokens.
Random Only ScoringTop And Random Scoring

Training more tends to improve top-and-random scores but decrease random-only scores.One extremely speculative explanation for this is that features get cleaner/better with more training (causing random-and-top scores to increase) but that there are also more interfering features due to superposition (causing random-only scores to decrease) At a fixed level of performance on loss, smaller models perhaps tend to have higher explanation scores.

We also find significant positive transfer for explanations between different checkpoints of the same training run. For example, scores for a quarter-trained model seem to drop by less than 25% when using explanations for a fully-trained model, and vice versa. This suggests a relatively high degree of stability in feature-neuron correspondence.

Qualitative results

Interesting neurons

Throughout the project we found many interesting neurons. GPT-4 was able to find explanations for non-trivial neurons that we thought were reasonable upon inspection, such as a "simile" neuron, a neuron for phrases related to certainty and confidence, and a neuron for things done correctly.

One successful strategy for finding interesting neurons was looking for those which were poorly explained by their token-space explanations, compared with their activation-based explanations. This led us to concurrently discover context neurons which activate densely in certain contextsWe called these "vibe" neurons. and many neurons which activated on specific words at the beginning of documents.

Another related strategy that does not rely on explanation quality was to look for context-sensitive neurons that activate differently when the context is truncated. This led us to discover a pattern break neuron which activates for tokens that break an established pattern in an ongoing list (shown below on some select sentences) and a post-typo neuron which activates often following strange or truncated words. Our explanation model is generally unable to get the correct explanation on interesting context-sensitive neurons.

We noticed a number of neurons that appear to activate in situations that match a particular next token, for example a neuron that activates where the next token is likely to be the word “from”. Initially we hypothesized that these neurons might be making a prediction of the next token based on other signals. However, ablations on some of these neurons do not match this story. The “from” neuron appears to actually slightly decrease the probability of “from” being output. At the same time it increases the probability of variations of the word “form”, suggesting one of the things it is doing is accounting for the possibility of a typo. As it is in a late layer (44 out of 48), this neuron may be responding to situations where the network already places high probability on the word “from”. We have not investigated enough to have a clear picture of what is going on, but it is possible that many neurons encode particular subtle variations on the output distribution conditioned on a particular input rather than performing the obvious function suggested by their activations.

We found some interesting examples of neurons that respond to specific kinds of repetition. We found a neuron that activations for repeated occurrences of tokens, with stronger activations depending on the number of occurrences. An interesting example of a polysemantic neuron is a neuron that fires for both the phrase "over and over again" and “things repeated right before a non-repeated number”, possibly because “over and over again” itself includes repetition. We also found two neurons that seem mostly to respond to a second mention of a surname when combined with a different first name. It is possible that these neurons are responding to induction heads.

Overall, our subjective sense was that neurons for more capable models tended to be more interesting, although we spent the majority of our efforts looking at GPT-2 XL neurons rather than more modern models.

For more interesting neurons, see our neuron visualization website.

Explaining constructed puzzles

When thinking about how to qualitatively understand our explanation methodology, we often ran into two problems. First, we do not have any ground truth for the explanations or scores. Even human-written explanations could be incorrect, or at the very least fail to completely explain the behavior. Furthermore, it is often difficult to tell whether a better explanation exists. Second, we have no control over the complexity or types of patterns neurons encode, and no guarantee that any simple explanation exists.

To address these drawbacks, we created "neuron puzzles": synthetic neurons with human-written explanations and curated evidence. To create a neuron puzzle, we start with a human-written explanation, taken to be ground truth. Next, we gather text excerpts and manually label their tokens with "activations" (not corresponding to the activations of any real network, because these are synthetic neurons) according to the explanation. Thus, each puzzle is formed from an explanation and evidence supporting that explanation (i.e. a set of text excerpts with activations).

To evaluate the explainer, we provide the tokens and synthetic activations to the explainer and observe whether the model-generated explanation matches the original puzzle explanation. We can vary puzzle difficulty and write new puzzles to test for certain patterns that interest us. Thus, a collection of these puzzles form a useful evaluation for iterating on our explainer technique. We created a total of 19 puzzles, many inspired by neurons we and other found, including a puzzle based on the "not all" neuron described earlier and the 'an' prediction neuron in GPT-2 Large.

Puzzle examples:

carousel

For each puzzle, we ensured that the evidence is sufficient for a human to recover the original explanation.When we gave these puzzles to a researcher not on the project, they solved all but the 'an' prediction puzzle. In hindsight, they thought they could recognize future next-token-predicting puzzles with similar levels of evidence. However, one can imagine creating puzzles that humans generally cannot solve in order to evaluate a superhuman explainer technique. Our baseline explainer methodology can solve 5 of the 19 puzzles. While it is skilled at picking out broad patterns in the tokens with high activations, it consistently fails to consider tokens in context or apply foundational knowledge. The explainer is also poor at incorporating negative evidence into its explanations. For instance, for the "incorrect historical years" puzzle the explainer recognizes that the neuron is activating for numerical years but fails to incorporate evidence of numerical years in the provided text excerpts where the neuron does not activate. These failure cases motivate our experiments with revisions. Indeed, when we apply our revision technique to the puzzles we solve an additional 4.

These neuron puzzles also provide a weak signal about whether the scorer is an effective discriminator between proposed explanations for more complex neurons than the ones we have currently found. We created a multiple-choice version for each puzzle by writing a series of false explanations. For example, "this neuron responds to important years in American or European history" was a false explanation for the "incorrect historical years" puzzle. One of the false explanations is always a baseline of the three most common tokens with high activations. For each puzzle, we score the ground-truth explanation and all of the false explanations on the sequences and activations for that puzzle and then record the number of times that the ground-truth explanation has the highest score. For 16/19 puzzles, the ground-truth explanation is ranked highest, and for 18/19 the ground-truth explanation ranks in the top two. Compared with the 5/19 puzzles that the explainer solves, this evaluation suggested to us that the explainer is currently more of a bottleneck than the scorer. This may reflect the fact that detecting a pattern is more difficult than verifying a pattern given an explanation.

Nevertheless, the simulator also suffers from systematic errors. While it performs well at simulating patterns that only require looking at isolated tokens (e.g. "words related to Canada"), it often has difficulty simulating patterns involving positional information as well as patterns that involve precisely keeping track of some quantity. For instance, when simulating the "an" neuron ("this neuron activates for positions in the sentence which are likely to be followed by the word "an"), the results include very high numbers of false positives.

We are releasing code for trying out the neuron puzzles we constructed.

Discussion

Limitations and caveats

Our method has a number of limitations which we hope can be addressed in future work.

Neurons may not be explainable

Our work assumes that neuron behavior can be summarized by a short natural language explanation. This assumption could be problematic for a number of reasons.

Neurons may represent many features

Past research has suggested that neurons may not be privileged as a unit of computation. In particular, there may be polysemantic neurons which correspond to multiple semantic concepts. While our explanation technique can and often does generate explanations along the lines of "X and sometimes Y", it is not suited to capturing complex instances of polysemanticity.

Analyzing top-activating dataset examples has proved useful in practice in previous work but also potentially results in the illusion of interpretability. By focusing on top activations, our intention was to focus the model on the most important aspects of the neuron's behavior at extremal activation values, but not at lower percentiles, where the neuron may behave differently.

One approach to reducing or working around polysemanticity we did not explore is to apply some factorization to the neuron space, such as NMF, SVD, or dictionary learning. We believe this could be complementary with direction finding and training interpretable models.

Alien features

Furthermore, language models may represent alien concepts that humans don't have words for. This could happen because language models care about different things, e.g. statistical constructs useful for next-token prediction tasks, or because the model has discovered natural abstractions that humans have yet to discover, e.g. some family of analogous concepts in disparate domains.

We explain correlations, not mechanisms

We currently explain correlations between the network input and the neuron being interpreted on a fixed distribution. Past work has suggested that this may not reflect the causal behavior between the two.

Our explanations also do not explain what causes behavior at a mechanistic level, which could cause our understanding to generalize incorrectly. To predict rare or out-of-distribution model behaviors, it seems possible that we will need a more mechanistic understanding of models.

Simulations may not reflect human understanding

Our scoring methodology relies on the simulator model faithfully replicating how an idealized human would respond to an explanation. However, in practice, the simulator model could be picking up on aspects of an explanation that a human would not pick up on. In the worst case, the explainer model and simulator model could be implicitly performing some sort of steganography with their explanations. This could happen, for example, if both the explainer and simulator model conflate the same spurious feature with the actual feature (so the subject could be responding to feature X, and the explainer would falsely say Y, but the simulations might happen to be high on X).

Ideally, one could mitigate this by training the simulator model to imitate human simulation labels. We plan to visit this in future work. This may also improve our simulation quality and simplify how we prompt the model.

Limited hypothesis space

To understand transformer models more fully we will need to move from interpreting single neurons to interpreting circuits. This would mean including hypotheses about downstream effects of neurons, hypotheses about attention heads and logits, and hypotheses involving multiple inputs and outputs.

Eventually, our explainer models would draw from a rich space of hypotheses, just like interpretability researchers do.

Computational requirements

Our methodology is quite compute-intensive. The number of activations in the subject model (neurons, attention head dimensions, residual dimensions) scales roughly as O(n^{2/3}), where n is the number of subject model parameters. If we use a constant number of forward passes to interpret each activation, then in the case where the subject and explainer model are the same size, overall compute scales as O(n^{5/3}).

On the other hand, this is perhaps favorable compared to pre-training itself. If pre-training scales data approximately linearly with parameters, then it uses compute O(n^2).

Context Length

Another computational issue is with context length. Our current method requires the explainer model to have context at least twice as long as the text excerpts passed to the subject model. This means that if the explainer model and subject model had the same context length, we would only be able to explain the subject model's behavior within at most half of its full context length, and could thus fail to capture some behavior that only manifests at later tokens.

Tokenization issues

There are a few ways tokenization causes issues for our methodology:
  1. The explainer and subject models may use different tokenization schemes. The tokens in the activation dataset are from inference on the subject model and will use that model's tokenization scheme. When the associated strings appear in the prompt sent to the explainer model, they may be split into multiple tokens, or they may be partial tokens that wouldn't naturally appear in the given text excerpt for the explainer model.
  2. We use a byte pair encoder, so when feeding tokens one by one to a model, sometimes the token cannot be shown sensibly as characters. We assumed decoding the individual token is meaningful, but this isn't always the case. In principle, when the subject and explainer model have the same encoding, we could have used the correct token, but we neglected to do that in this work.
  3. We use delimiters between tokens and their corresponding activations. Ideally, the explainer model would see each of tokens, delimiters, and activations as separate tokens. However, we cannot guarantee they don't merge, and in fact our tab delimiters merge with newline tokens, likely making the assistant's task harder.

To the extent that these tokenization quirks affect the model's understanding of which tokens appeared in the original text excerpt, they could harm the quality of our explanations and simulations.

Outlook

While we have described a number of limitations with the current version of our methods, we believe our work can be greatly improved and effectively integrated with other existing approaches. For example, successful research on polysemanticity could immediately cause our methods to yield much higher scores. Conversely, our methods could help improve our understanding of superposition by trying to find multiple explanations that cover behavior of a neuron over its entire distribution, or by optimizing to find sets of interpretable directions in the residual stream (perhaps in combination with approaches like dictionary learning). We also hope that we can integrate a wider range of common interpretability techniques, such as studying attention heads, using ablations for validation, etc. into our automated methodology.

Improvements to chain-of-thought methods, tool use, and conversational assistants can also be used to improve explanations. In the long run, we envision that the explainer model could generate, test, and iterate on a rich space of hypotheses about the subject model, similar to an interpretability researcher today. This would include hypotheses about the functionality of circuits and about out-of-distribution behaviors. The explainer model's environment could include access to tools like code execution, subject model visualizations, and talking to researchers. Such a model could be trained using expert iteration or reinforcement learning, with a simulator/judge model setting rewards. We can also train via debate, where two competing assistant models both propose explanations and critique each other's explanations.

We believe our methods could begin contributing to understanding the high-level picture of what is going on inside transformer language models. User interfaces with access to databases of explanations could enable a more macro-focused approach that could help researchers visualize thousands or millions of neurons to see high-level patterns across them. We may be able to soon make progress on simple applications like detecting salient features in reward models, or understanding qualitative changes between a fine-tuned model and its base model.

Ultimately, we would like to be able to use automated interpretability to assist in audits of language models, where we would attempt to detect and understand when the model is misaligned. Particularly important is detecting examples of goal misgeneralization or deceptive alignment, when the model acts aligned when being evaluated but would pursue different goals during deployment. This would require a very thorough understanding of every internal behavior. There could also be complications in using powerful models for assistance if we don't know whether the assistant itself is trustworthy. We hope that using smaller trustworthy models for assistance will either scale to a full interpretability audit, or applying them to interpretability will teach us enough about how models work to help us develop more robust auditing methods.

This work represents a concrete instance of OpenAI's broader alignment plan of using powerful models to help alignment researchers. We hope it is a first step in scaling interpretability to a comprehensive understanding of more complicated and capable models in the future.

Contributions

Methodology: Nick effectively started the project by having the initial idea to have GPT-4 explain neurons, and showing a simple explanation methodology worked. William came up with the initial simulation and scoring methodology and implementation. Dan and Steven ran many experiments resulting in ultimate choices of prompts and explanation/scoring parameters.

ML infrastructure: William and Nick set up the initial version of the codebase. Leo and Jeff implemented the initial core internal infrastructure for doing interpretability. Steven implemented the top activations pipeline. Steven and William developed the pipeline for explanations and scoring. Many other miscellaneous contributions came from William, Jeff, Dan, and Steven. Steven created the open source version.

Web infrastructure: Nick and William implemented the neuron viewer, with smaller contributions from Steven, Dan, and Jeff. Nick implemented many other UIs exploring various kinds of neuron explanation. Steven implemented human data gathering UIs.

Human data: Steven implemented and analyzed all experiments involving contractor human data: the human explanation baseline, and human scoring experiments. Nick and William implemented early researcher explanation baselines.

Alternative token and weight-based explanations: Dan implemented all experiments and analysis on token weight and token lookup baselines, next token explanations, as well as infrastructure and UIs for neuron-neuron connection weights.

Revisions: Henk implemented and analyzed the main revision experiments. Nick championed and implemented an initial proof of concept for revisions. Leo implemented a small scale derisking experiment. Steven helped design the final revision pipeline. Leo and Dan did many crucial investigations into negative finding.

Direction finding: Leo had the idea and implementated all experiments related to direction finding.

Neuron puzzles: Henk implemented all the neuron puzzles and related experiments. William came up with the initial idea. Steven and William gave feedback on data and strategies.

Subject, explainer, and simulator scaling: Steven implemented and analyzed assistant size, simulator size, and subject size experiments. Jeff implemented and analyzed subject training time experiments.

Ablation scoring: Jeff implemented ablation infrastructure and initial scoring experiments, and Dan contributed lots of useful thinking and carried out final experiments. Leo did related investigations into understanding and prediction of ablation effects.

Activation function experiments: Jeff implemented the experiments and analysis. Gabe suggested the sparse activation function, and William suggested correlation-based community detection.

Qualitative results: Everyone contributed throughout the project to qualitative findings. Nick and William discovered many of the earliest nontrivial neurons. Dan found many non-trivial neurons by comparing to token baselines, such as simile neurons. Steven found the pattern break neuron, and other context-sensitive neurons. Leo discovered the "don't stop" neuron and first noticed explanations were overly broad. Henk had many qualitative findings about explanation and scoring quality. Nick found interesting neuron-neuron connections and interesting pairs of neurons firing on the same token. William and Jeff investigated ablations of specific neurons.

Guidance and mentorship: William and Jeff led and managed the project. Jan and Jeff managed team members who worked on the project. Many ideas from Jan, Nick, William, and Ilya influenced the direction of the project. Steven mentored Henk.

Acknowledgments

We thank Neel Nanda, Ryan Greenblatt, Paul Christiano, Chris Olah, and Evan Hubinger for useful discussions on direction during the project.

We thank Ryan Greenblatt, Buck Shlegeris, Trenton Bricken, the OpenAI Alignment team, and the Anthropic Interpretability team for useful discussions and feedback.

We thank Cathy Yeh for doing useful checks of our code and noticing tokenization concerns.

We thank Carroll Wainwright and Chris Hesse for help with infrastructure.

We thank the contractors who wrote and evaluated explanations for our human data experiments, and Long Ouyang for some feedback on our instructions.

We thank Rajan Troll for providing a human baseline for the neuron puzzles.

We thank Thomas Degry for the blog graphic.

We thank other OpenAI teams for their support, including the supercomputing, research acceleration, and language teams.

Citation Information

Please cite as:

      Bills, et al., "Language models can explain neurons in language models", 2023.
    

BibTeX Citation:

      @misc{bills2023language,
         title={Language models can explain neurons in language models},
         author={
            Bills, Steven and Cammarata, Nick and Mossing, Dan and Tillman, Henk and Gao, Leo and Goh, Gabriel and Sutskever, Ilya and Leike, Jan and Wu, Jeff and Saunders, William
         },
         year={2023},
         howpublished = {\url{https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html}}
      }

Footnotes

  1. The subject and the explainer can be the same model, though for our experiments we typically use smaller models as subjects and our strongest model as the explainer. In the long run, the situation may be reversed if the subject is our strongest model but we don't trust it as an assistant.[↩]
  2. Thus, while the explainer model should be trained to be as helpful as possible using RL (using the simulator model as a training signal), the simulator should be trained to imitate humans. If the simulator model improved its performance by making predictions that humans would disagree with, we would potentially risk explanations not being human-understandable.[↩]
  3. However, we did not explicitly check that the resulting text excerpts do not overlap. While in principle it would be reasonable for an explanation to "memorize" behavior to the extent that it drives most of the subject model's behavior on the training set, it would be less interesting if that was the primary driver of high scores. Based on some simple checks of our text excerpts, this was a non-issue for at least 99.8% of neurons.[↩]
  4. All prompts are shown in an abbreviated format, and are modified somewhat when using the structured chat completions API. For full details see our codebase.[↩]
  5. While we used GPT-4 as the simulator model for most of our experiments, the public OpenAI API does not support returning logprobs for newer chat-based models like GPT-4 and GPT-3.5-turbo. Older models like the original GPT-3.5 support logprobs.[↩]
  6. "All at once" actually outperformed "one at a time", but the effect was within noise, and researchers subjectively thought "one at a time" was better, on relatively small sample sizes.[↩]
  7. Note that this isn't an unbiased estimator of true explained variance, since we also use a sample for the denominator. One could improve on our approach by using a much larger sample for estimating the variance term.[↩]
  8. We explored more complicated methods to calibrate, but they typically require many simulations, which are expensive to obtain[↩]
  9. Conceptually, calibration should ideally happen on a different set of text excerpts, so we aren't "cheating" by using the true mean and variance. We empirically studied this cheating effect for differing sample sizes and believe it to be small in practice.[↩]
  10. Matching standard deviations results in explained variance of 2 \rho - 1 < \rho^2. We also find empirically that it performs worse in ablation based scoring.[↩]
  11. This might happen if subtle variations in the activation of a neuron (making the difference, say, between a correlation score of 0.9 and 1.0) played an outsized role in its function within the network.[↩]
  12. Unlike GPT-2, GPT-4 is a model trained to follow instructions via RLHF.[↩]
  13. Note that since GPT-2 models use byte-pair encoders, sometimes our texts have mid-character breaks. See here for more discussion.[↩]
  14. GPT-2 was trained to sometimes see multiple documents, separated by a special end of text token. In our work, we ensure all 64 tokens are within the same document.[↩]
  15. For other activation functions like GeGLU, this would likely be untrue and we would need to separately explain positive and negative activations.[↩]
  16. In the long term, we want to move toward something debate-like where the score is more like a minimax than an expectation. That is, we imagine one model coming up with an explanation and another model coming up with counterexamples. Scoring would take the whole transcript into consideration, and thus measure how robust the hypothesis is.[↩]
  17. Unfortunately, initial attempts at this using a moderate increase in number of text excerpts did not prove to be useful.[↩]
  18. In most places, we calculate this using 1.96 times the standard error of the mean (SEM), or a strictly more conservative statistic. If needed we estimate via bootstrap resampling methods.[↩]
  19. We also tried linear prediction including the position embedding for each position in the text excerpt plus the token embedding; this linear token- and position-based prediction baseline resulted in very small quantitative improvements and no qualitative change.[↩]
  20. As noted above, in the non-iterative setting we find the explainer model is unable to effectively make use of additional text excerpts in context.[↩]
  21. When one of us looked uncarefully at this neuron, we too came to this conclusion. It was only after testing examples of sequences like "All students must turn in their final papers by Monday" that we realized the initial explanation was too broad.[↩]
  22. Our revision process is also agnostic to the explanation it starts with, so we could likely also start from our strong unigram baselines and revise based on relevant sentences. We suspect this will outperform our existing results and plan to try techniques like this in the future.[↩]
  23. This method can also be applied to the residual stream, because it does not assume a privileged basis at all.[↩]
  24. We found in early experiments that without reparameterization by \Sigma^{-1/2}, high-variance neurons would otherwise disproportionately dominate the selected vectors, causing a reduction in sample diversity. The reparameterization using \Sigma^{-1/2} ensures that the initialization favors lower-variance directions and that step sizes are scaled appropriately. Despite the reparameterization, we still observe some amount of collapse.[↩]
  25. We also tried truncating based on magnitude of coefficient, which resulted in even poorer scores.[↩]
  26. However, we find the second layer (layer 1) of many large models to have very low scores, potentially related to the fact that they contain many dead neurons.[↩]
  27. Each parameter doubling is approximately 0.17 nats of loss, so the 0.1 sparsity models are roughly 8.5% less parameter-efficient, and 0.01 sparsity models are roughly 40% less parameter-efficient. We hope there is low hanging fruit for reducing this "explainability tax".[↩]
  28. Each of these models was trained for a total of 300B tokens.[↩]
  29. One extremely speculative explanation for this is that features get cleaner/better with more training (causing random-and-top scores to increase) but that there are also more interfering features due to superposition (causing random-only scores to decrease) [↩]
  30. We called these "vibe" neurons.[↩]
  31. When we gave these puzzles to a researcher not on the project, they solved all but the 'an' prediction puzzle. In hindsight, they thought they could recognize future next-token-predicting puzzles with similar levels of evidence.[↩]

References

  1. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
    Wang, K., Variengien, A., Conmy, A., Shlegeris, B. and Steinhardt, J., 2022. arXiv preprint arXiv:2211.00593.
  2. A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
    Chughtai, B., Chan, L. and Nanda, N., 2023. arXiv preprint arXiv:2302.03025.
  3. Automating Auditing: An ambitious concrete technical research proposal[link]
    Hubinger, E., 2021.
  4. CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks
    Oikarinen, T. and Weng, T., 2022. arXiv preprint arXiv:2204.10965.
  5. The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable[link]
    Millidge, B. and Black, S., 2022.
  6. Describing differences between text distributions with natural language
    Zhong, R., Snell, C., Klein, D. and Steinhardt, J., 2022. International Conference on Machine Learning, pp. 27099--27116.
  7. Explaining patterns in data with language models via interpretable autoprompting
    Singh, C., Morris, J.X., Aneja, J., Rush, A.M. and Gao, J., 2022. arXiv preprint arXiv:2210.01848.
  8. GPT-4 Technical Report
    OpenAI,, 2023. arXiv preprint arXiv:2303.08774.
  9. Network dissection: Quantifying interpretability of deep visual representations
    Bau, D., Zhou, B., Khosla, A., Oliva, A. and Torralba, A., 2017. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6541--6549.
  10. Causal scrubbing: a method for rigorously testing interpretability hypotheses[link]
    Chan, L., Garriga-Alonso, A., Goldowsky-Dill, N., Greenblatt, R., Nitishinskaya, J., Radhakrishnan, A., Shlegeris, B. and Thomas, N., 2022.
  11. Natural language descriptions of deep visual features
    Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A. and Andreas, J., 2022. International Conference on Learning Representations.
  12. Softmax Linear Units
    Elhage, N., Hume, T., Olsson, C., Nanda, N., Henighan, T., Johnston, S., ElShowk, S., Joseph, N., DasSarma, N., Mann, B., Hernandez, D., Askell, A., Ndousse, K., Jones, A., Drain, D., Chen, A., Bai, Y., Ganguli, D., Lovitt, L., Hatfield-Dodds, Z., Kernion, J., Conerly, T., Kravec, S., Fort, S., Kadavath, S., Jacobson, J., Tran-Johnson, E., Kaplan, J., Clark, J., Brown, T., McCandlish, S., Amodei, D. and Olah, C., 2022. Transformer Circuits Thread.
  13. Language models are unsupervised multitask learners
    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. and others,, 2019. OpenAI blog, Vol 1(8), pp. 9.
  14. Activation atlas
    Carter, S., Armstrong, Z., Schubert, L., Johnson, I. and Olah, C., 2019. Distill, Vol 4(3), pp. e15.
  15. The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models
    Tenney, I., Wexler, J., Bastings, J., Bolukbasi, T., Coenen, A., Gehrmann, S., Jiang, E., Pushkarna, M., Radebaugh, C., Reif, E. and others,, 2020. arXiv preprint arXiv:2008.05122.
  16. Neuroscope[link]
    Nanda, N..
  17. Visualizing higher-layer features of a deep network
    Erhan, D., Bengio, Y., Courville, A. and Vincent, P., 2009. University of Montreal, Vol 1341(3), pp. 1.
  18. Understanding neural networks through deep visualization
    Yosinski, J., Clune, J., Nguyen, A., Fuchs, T. and Lipson, H., 2015. arXiv preprint arXiv:1506.06579.
  19. Feature Visualization
    Olah, C., Mordvintsev, A. and Schubert, L., 2017. Distill. DOI: 10.23915/distill.00007
  20. Gaussian error linear units (gelus)
    Hendrycks, D. and Gimpel, K., 2016. arXiv preprint arXiv:1606.08415.
  21. Causal mediation analysis for interpreting neural nlp: The case of gender bias
    Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Sakenis, S., Huang, J., Singer, Y. and Shieber, S., 2020. arXiv preprint arXiv:2004.12265.
  22. Locating and editing factual associations in GPT
    Meng, K., Bau, D., Andonian, A. and Belinkov, Y., 2022. Advances in Neural Information Processing Systems, Vol 35, pp. 17359--17372.
  23. Training language models to follow instructions with human feedback
    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and others,, 2022. Advances in Neural Information Processing Systems, Vol 35, pp. 27730--27744.
  24. Glu variants improve transformer
    Shazeer, N., 2020. arXiv preprint arXiv:2002.05202.
  25. Zoom In: An Introduction to Circuits
    Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M. and Carter, S., 2020. Distill. DOI: 10.23915/distill.00024.001
  26. Toy Models of Superposition
    Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M. and Olah, C., 2022. Transformer Circuits Thread.
  27. AI safety via debate
    Irving, G., Christiano, P. and Amodei, D., 2018. arXiv preprint arXiv:1805.00899.
  28. interpreting GPT: the logit lens[link]
    nostalgebraist,, 2020.
  29. Analyzing transformers in embedding space
    Dar, G., Geva, M., Gupta, A. and Berant, J., 2022. arXiv preprint arXiv:2209.02535.
  30. Chain of thought prompting elicits reasoning in large language models
    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q. and Zhou, D., 2022. arXiv preprint arXiv:2201.11903.
  31. Self-critiquing models for assisting human evaluators
    Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J. and Leike, J., 2022. arXiv preprint arXiv:2206.05802.
  32. Finding Neurons in a Haystack: Case Studies with Sparse Probing
    Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D. and Bertsimas, D., 2023. arXiv preprint arXiv:2305.01610.
  33. Problems of monetary management: the UK experience in papers in monetary economics
    Goodhart, C., 1975. Monetary Economics, Vol 1.
  34. Scaling Laws for Reward Model Overoptimization
    Gao, L., Schulman, J. and Hilton, J., 2022.
  35. Eliciting latent knowledge: How to tell if your eyes deceive you[link]
    Christiano, P., Cotra, A. and Xu, M., 2021.
  36. Chris Olah’s views on AGI safety[link]
    Hubinger, E., 2019.
  37. Language models are few-shot learners
    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and others,, 2020. Advances in neural information processing systems, Vol 33, pp. 1877--1901.
  38. Input switched affine networks: An rnn architecture designed for interpretability
    Foerster, J.N., Gilmer, J., Sohl-Dickstein, J., Chorowski, J. and Sussillo, D., 2017. International conference on machine learning, pp. 1136--1145.
  39. Re-training deep neural networks to facilitate Boolean concept extraction
    Gonzalez, C., Loza Menc{\'\i}a, E. and Furnkranz, J., 2017. Discovery Science: 20th International Conference, DS 2017, Kyoto, Japan, October 15--17, 2017, Proceedings 20, pp. 127--143.
  40. Spine: Sparse interpretable neural embeddings
    Subramanian, A., Pruthi, D., Jhamtani, H., Berg-Kirkpatrick, T. and Hovy, E., 2018. Proceedings of the AAAI conference on artificial intelligence, Vol 32(1).
  41. Learning effective and interpretable semantic models using non-negative sparse embedding
    Murphy, B., Talukdar, P. and Mitchell, T., 2012. Proceedings of COLING 2012, pp. 1933--1950.
  42. In-context Learning and Induction Heads
    Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S. and Olah, C., 2022. Transformer Circuits Thread.
  43. We Found An Neuron in GPT-2[link]
    Miller, J. and Neo, C., 2023.
  44. Intriguing properties of neural networks
    Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. arXiv preprint arXiv:1312.6199.
  45. Curve Detectors
    Cammarata, N., Goh, G., Carter, S., Schubert, L., Petrov, M. and Olah, C., 2020. Distill. DOI: 10.23915/distill.00024.003
  46. An interpretability illusion for bert
    Bolukbasi, T., Pearce, A., Yuan, A., Coenen, A., Reif, E., Viegas, F. and Wattenberg, M., 2021. arXiv preprint arXiv:2104.07143.
  47. The Building Blocks of Interpretability
    Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K. and Mordvintsev, A., 2018. Distill. DOI: 10.23915/distill.00010
  48. Learning the parts of objects by non-negative matrix factorization
    Lee, D.D. and Seung, H.S., 1999. Nature, Vol 401(6755), pp. 788--791. Nature Publishing Group UK London.
  49. Nonnegative matrix factorization with mixed hypergraph regularization for community detection
    Wu, W., Kwong, S., Zhou, Y., Jia, Y. and Gao, W., 2018. Information Sciences, Vol 435, pp. 263--281. Elsevier.
  50. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability
    Raghu, M., Gilmer, J., Yosinski, J. and Sohl-Dickstein, J., 2017. Advances in neural information processing systems, Vol 30.
  51. The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable[link]
    Millidge, B. and Black, S., 2022.
  52. Taking features out of superposition with sparse autoencoders[link]
    Sharkey, L., Braun, D. and Millidge, B., 2022.
  53. On interpretability and feature representations: an analysis of the sentiment neuron
    Donnelly, J. and Roegiest, A., 2019. Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14--18, 2019, Proceedings, Part I 41, pp. 795--802.
  54. Cyclegan, a master of steganography
    Chu, C., Zhmoginov, A. and Sandler, M., 2017. arXiv preprint arXiv:1712.02950.
  55. N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
    Foote, A., Nanda, N., Kran, E., Konstas, I. and Barez, F.. ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models.
  56. Training compute-optimal large language models
    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.d.L., Hendricks, L.A., Welbl, J., Clark, A. and others,, 2022. arXiv preprint arXiv:2203.15556.
  57. Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
    Shah, R., Varma, V., Kumar, R., Phuong, M., Krakovna, V., Uesato, J. and Kenton, Z., 2022.
  58. The alignment problem from a deep learning perspective
    Ngo, R., Chan, L. and Mindermann, S., 2023.
  59. Risks from Learned Optimization in Advanced Machine Learning Systems
    Hubinger, E., Merwijk, C.v., Mikulik, V., Skalse, J. and Garrabrant, S., 2021.
  60. Our approach to alignment research[link]
    Leike, J., Schulman, J. and Wu, J., 2022.