Larger language models do in-context learning differently

Original Source Here

Larger language models do in-context learning differently

The interaction between prior knowledge and input-label mappings across model scales.

Overview. Over the past few months, I studied in-context learning with some fellow researchers at Google Brain. Our study shows how in-context learning (ICL) in language models is affected by semantic priors versus input–label mappings, especially with respect to the scale of the model. We investigated two settings to learn more about this — ICL with flipped labels and ICL with semantically-unrelated labels. We found the following:

  • Overriding prior knowledge is an emergent ability of model scale.
  • Learning in-context with semantically-unrelated labels emerges with scale.
  • Instruction tuning strengthens the use of prior knowledge more than it increases the capacity to learn input-label mappings.
  • Large-enough language models can perform linear classification at up to 64 dimensions!

The full paper can be found here.

An overview of flipped-label ICL and semantically-unrelated label ICL (SUL-ICL), compared with regular ICL.

Motivation. Language models are huge these days, and it’s partly because they can perform tasks with strong performance via in-context learning (ICL), where they’re given a few exemplars of input-label pairs in a prompt before performing the task on an unseen evaluation example. But just how do models do this? Well, they can do one or both of the following:

(A) Mostly use semantic prior knowledge to predict labels while following the format of in-context exemplars (e.g., seeing “positive sentiment” and “negative sentiment” as labels and performing sentiment analysis using prior knowledge).

(B) Learn the input–label mappings from the presented exemplars (e.g., finding a pattern that positive reviews should be mapped to one label, and negative reviews should be mapped to a different label).

In this paper, we wanted to learn about how these two factors (semantic priors and input-label mappings) interact with each other, especially with respect to the scale of the language model that’s used.

Example prompts with 1 in-context exemplar per class for the SST-2 and SUBJ datasets.

Experiment Design. We experimented on seven NLP tasks that have been widely used: sentiment analysis, subjective/objective classification, question classification, duplicated-question recognition, entailment recognition, financial sentiment analysis, and hate speech detection. We tested five language model families, with three being from OpenAI (GPT-3, InstructGPT, Codex) and two being from Google (PaLM, Flan-PaLM). The figure above shows how we prompt models in our experiments.

Flipped labels. In this experiment, in-context exemplar labels are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment being labeled as “negative”). In this setting, models that are able to override prior knowledge and learn input-label mappings in-context should experience a decrease in performance (since ground-truth evaluation labels are not flipped).

The ability to override semantic priors when presented with flipped in-context exemplar labels emerges with model scale. Smaller models cannot flip predictions to follow flipped labels (performance only decreases slightly), while larger models can do so (performance decreases to well below 50%).

We found that when no labels are flipped, larger models have better performance than smaller models (an expected result). But as we flip more and more labels, the performance of small models stays relatively flat, but large models experience huge performance drops to well-below random guessing (e.g., 90% → 22.5%).

These results indicate that large models can override prior knowledge from pretraining with input– label mappings presented in-context. Small models can’t do this, making this ability an emergent phenomena of model scale.

Semantically-unrelated labels. In this experiment, we replace labels with semantically-irrelevant ones (e.g., for sentiment analysis, we use “foo/bar” instead of “negative/positive”), which means that the model can only perform ICL by learning from input-label mappings. If a model mostly relies on prior knowledge for ICL, then its performance should decrease after this change since it will no longer be able to use semantic meanings of targets to make predictions. A model that can learn input–label mappings in-context, on the other hand, would be able to learn these semantically-unrelated mappings and should not experience a major drop in performance.

Small models rely more on semantic priors than large models do, as performance decreases more for small models than for large models when using semantically-unrelated targets instead of natural language targets.

Indeed, we see that using semantically-unrelated targets results in a greater performance drop for small models versus large models. This suggests that smaller models primarily rely on the semantic meaning of targets for ICL rather than learn the presented input–label mappings. Large models, on the other hand, have the ability to learn input–label mappings in-context when the semantic nature of targets is removed.

Larger models benefit more from additional exemplars than smaller models do.

We also found that including more in-context exemplars results in a greater performance improvement for large models than it does for small models, indicating that large models are better at learning from in-context exemplars than small models are. In other words, large models are more capable of using the additional input–label mappings presented in-context to better learn the correct relationships between inputs and labels.

Instruction Tuning. Instruction tuning is a popular technique for improving model performance, and it involves finetuning them on a bunch of NLP tasks phrased as instructions. Since the process uses natural language targets, however, an open question is whether it improves the ability to learn input-label mappings or whether it strengthens the ability to recognize and apply semantic prior knowledge. Both of these would lead to an improvement in performance on standard ICL tasks, so it’s unclear which of these occur.

We studied this by running the same two setups as before, only this time we focus on comparing standard language models (specifically, PaLM) with their instruction-tuned variants (Flan-PaLM).

Instruction-tuned language models are better at learning input–label mappings than pretraining-only language models are.

First, we find that Flan-PaLM is better than PaLM when we use semantically-unrelated targets. This effect is very prominent in small models, as Flan-PaLM-8B outperforms PaLM-8B by 9.6% and almost catches up to PaLM-62B. This trend suggests that instruction tuning strengthens the ability to learn input-label mappings, which isn’t particularly surprising.

Instruction-tuned models are worse than pretraining-only models are at learning to override semantic priors when presented with flipped labels in-context.

More interestingly, we saw that Flan-PaLM is actually worse than PaLM at following flipped labels, meaning that they were unable to override their prior knowledge (Flan-PaLM models don’t reach below random guessing with 100% flipped labels, but PaLM models can reach 31% accuracy in the same setting). These results indicate that instruction tuning must increase the extent to which models rely on semantic priors when they’re available.

Combined with the previous result, we can conclude that although instruction tuning improves the ability to learn input-label mappings, it strengthens the usage of semantic prior knowledge more.

Linear classification. As a bonus experiment, we created N-dimensional linear classification datasets and examined model behavior with respect to the number of dimensions. This helps us learn more about whether large language models’ greater capacity to learn input-label mappings also holds for non-natural-language tasks.

The largest Codex model (code-davinci-002) can perform linear classification up to 64 dimensions, while smaller Codex models do not outperform random guessing at 16 dimensions. PaLM models can all perform linear classification up to 8 dimensions with little difference in performance with respect to model scale.

We find that for Codex models, the largest model can successfully perform linear classification up to 64 dimensions, while the smaller models reach guessing performance at approximately 16 dimensions. For PaLM models, model scale does not seem to significantly correlate with the number of dimensions to which the model can perform linear classification, though all PaLM models can perform linear classification up to at least 8 dimensions. Neither PaLM models nor Codex models can outperform an SVM baseline.

Summary. We examined the extent to which language models learn in-context by utilizing prior knowledge learned during pretraining versus input-label mappings presented in-context.

We first showed that large language models can learn to override prior knowledge when presented with enough flipped labels, and this ability emerges with model scale. We then found that successfully doing ICL using semantically-unrelated labels is another emergent ability of model scale. Additionally, we analyzed instruction-tuned language models and saw that instruction tuning improves the capacity to learn input–label mappings but also strengthens semantic priors. Finally, we examined linear classification tasks, finding that successfully performing high-dimensional linear classification emerges with model scale.

These results underscore how the ICL behavior of language models can change depending on the scale of the language model, and that larger language models have an emergent ability to map inputs to many types of labels, a form of true symbolic reasoning in which input–label mappings can be learned for arbitrary symbols.

Further Reading:


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: