This is a Plain English Papers summary of a research paper called Language Models Learn Rare Phenomena from Common Patterns: Study on the AANN Construction. If you like these kinds of analysis, you should join AImodels.fyi or follow me on Twitter.

Introduction

The paper investigates how Large Language Models (LLMs) can learn rare grammatical structures, such as the AANN construction ("a beautiful five days in Texas"), from limited input data. The authors hypothesize that LLMs learn these rare phenomena by generalizing from more frequent, related constructions in the input. They train transformer models on systematically manipulated versions of the 100M-word BabyLM corpus to study the extent to which exposure to frequent, related phenomena enables generalization to novel instances of the AANN construction.

The study yields three main findings:

LMs successfully generalize to novel instances of the AANN construction, even when not exposed to any AANNs during training, suggesting that related items in the training data enable non-trivial performance in acceptability judgments.
Systematically removing AANN-related phenomena from the training data, such as measure noun phrases treated as singular, leads to worse performance on predicting novel AANNs, highlighting the role of these phenomena in generalization.
LMs that encounter AANNs with more variability in the adjective, numeral, and noun slots show better generalization than those exposed to more restricted, repeating instances, mirroring findings from human language acquisition and cognitive psychology.

These results demonstrate that a sophisticated statistical learner can learn rare linguistic phenomena by generalizing from key related constructions in the input, without relying on strong innate priors.

General Methods

The paper describes methods for characterizing the learning of a rare grammatical construction called the aann (e.g., "a whopping ninety LMs"). The authors use the BabyLM-strict corpus for training language models (LMs) and detect aann instances using regular expressions and part-of-speech tagging. They train autoregressive transformer LMs with 12 layers and attention heads, averaging results over three runs with different random seeds.

To test the LMs' knowledge of the aann, the authors use a dataset containing acceptability ratings for templatically generated sentences. They compare well-formed aann instances to corrupted versions that manipulate adjective-numeral order, article presence, adjective presence, and numeral presence. The Syntactic Log-odds Ratio (SLOR) is used to score sentences, comparing the probability of the construction given the prefix estimated by the LM to that estimated by a unigram model. Accuracy is calculated based on whether the well-formed construction scores higher than all corrupted instances.

In subsequent experiments, the authors ablate parts of the BabyLM corpus that conform to certain linguistic or statistical hypotheses. To maintain the same quantity of training data, they up-sample non-hypothesis-conforming utterances after ablation. This allows them to compare LMs that differ in content but not in the total number of tokens encountered during training.

Experiment 1: LMs learn about aanns without having seen a single instance

The paper investigated the extent to which language models (LMs) trained on the BabyLM corpus learn the "aann" construction (e.g., "a fine eighteen months"). The LMs achieved accuracies around 70%, substantially above chance, even though positive evidence of the construction made up only 0.02% of their training data. Larger state-of-the-art LMs like Llama-2-7B and GPT-2 XL achieved even higher accuracies of 83% and 78%, respectively.

Interestingly, LMs trained on the BabyLM corpus with all 2,301 detected "aann" instances removed still achieved an accuracy of 54% on judging the acceptability of "aann" constructions, 47.75 points above chance. This suggests LMs can learn the acceptability of a construction's instances without seeing any positive occurrences, likely driven by systematic patterns in the corpus.

The paper also tested counterfactual variants that violate English grammar, such as "anan" and "naan". LMs trained on these variants did not learn them as well as "aann", and still assigned non-trivial probability to unseen "aann" instances. This implies LMs pick up cues from related constructions to generalize to novel "aann" examples.

Experiment 2: Keys to Learning aanns

The paper investigates the aann construction in language models (LMs) and hypothesizes four phenomena that may contribute to its learning, despite the construction being rare in training data:

Phrases like "the beautiful five days" where "the" takes a plural noun
Measure noun phrases with plural nouns attached to an indefinite article (e.g., "a few days")
Measure nouns treated as singular in terms of agreement (e.g., "Five miles is a long way to go")
The higher likelihood of adjectives following indefinite articles compared to numerals

The effect of these phenomena on aann acceptability is measured by holding out instances during training and comparing slor (syntactic log-odds ratio) values. A control condition with random removal of instances is also considered.

Experiments are conducted under two settings: with aanns removed during training along with the phenomena, and with aanns seen during training when possible. Results show that holding out the hypothesized phenomena has non-trivial effects on LMs' ratings of unseen well-formed aanns, with balancing the frequency of adjectives and numerals following an article having the greatest effect. These patterns are absent in 4-gram LMs, suggesting they do not arise from shallow surface statistics.

The paper concludes that when LMs see evidence of the aann construction, they do learn from it. However, related phenomena where measure nouns are treated as singular show notable effects even when aanns are present, indicating they enable additional learning.

Experiment 3: The Role of Variability

The paper investigates how the variability of open slots in a construction affects language models' ability to generalize to unseen instances of that construction, specifically focusing on adjective-adjective-noun-noun (aann) constructions. The authors hypothesize that instances of aanns with greater open-slot variability, i.e., evidence that many different adjectives, numerals, and nouns can fill their respective positions, would lead language models to assign greater likelihood to unseen aanns.

The experiment divided aann-containing utterances from the BabyLM corpus into two subsets: one with highly frequent but restricted slot-fillers, and another with less frequent but more variable slot-fillers. Language models were trained on the BabyLM corpus containing either of these subsets, and the results were compared to models trained on the unablated BabyLM and a condition with no aanns.

The findings showed that language models exposed to aanns with highly variable open slots demonstrated slot fill-in likelihoods comparable to or greater than models trained on all aanns. In contrast, models exposed to aanns with low variability performed similarly to models that never saw any aanns. These results support the hypothesis that slot-variability affects the extent to which language models permit productive uses of a construction.

Figure 4: slors on aanns from Mahowald (2023) for LMs trained on BabyLM with low and high variability in the observed instances of aann. slor for unablated BabyLM-trained LM shown with dotted line.

Conclusion

The paper explores how language models handle rare linguistic phenomena, often referred to as the "long tail" of language. Studying these phenomena is important because language models perform better with more data and because the human ability to generalize to rare constructions is central to language knowledge. The authors found that language models trained on human-scale data can learn a rare construction called the aann, even without direct examples in the training data. This learning is mediated by occurrences of related constructions during training. The results contribute to a growing body of research demonstrating the ability of large language models to learn linguistic constructions.

Limitations

The paper discusses potential future work and limitations of the current method. Extending the method to a wider range of constructions is valuable but not straightforward, as it requires identifying idiosyncratic constructions and developing testable hypotheses about their learnability from limited data. This limitation highlights the need for collaboration between theoretical and computational linguists. Another limitation is the computational expense of repeatedly training language models from scratch. Alternative methods, such as representational editing, could be explored. The paper focuses on linguistic form rather than testing the ability to interpret constructions for downstream semantic tasks, which would be an informative extension.

Acknowledgments

Acknowledgments:
The authors acknowledge funding from NSF Grant 2104995 awarded to Kyle Mahowald. They thank Adele Goldberg, Leonie Weissweiler, the computational linguistics research group at UT Austin, the syntax-semantics research group at UT Austin, and the audience at the Texas Linguistics Society meeting for helpful conversations. They also thank Chris Potts for his paper on the PiPP construction which inspired the "keys to all of this" idea in their own work.

Appendix A LM training details

The authors train language models using the OPT architecture on various versions of the BabyLM corpus. They tune the learning rate for each instance of the corpus based on the validation set, and then train two additional language models with different seeds using the best learning rate. In total, they train 6 language models for each ablation of the BabyLM corpus, resulting in 90 language models for all experiments. Table 3 provides more details about the training process.

Appendix B Detecting aanns and related phenomena

The paper describes methods to extract constructions and phenomena from the BabyLM corpus. The methods primarily rely on the surface form of sentences, part-of-speech (POS) tag sequences, and in some cases, dependency parses. The authors used the spacy library with the en_web_trf model based on RoBERTa-base for POS tagging and parsing.

To detect AANNs (Adjective-Adjective-Noun-Nouns), the authors constructed a regex pattern over POS-tagged sequences. The regex allows for multiple adjectives, optional adverbs, multi-word noun phrases with plural head-nouns, and numeral expressions. They also used adjectives like 'few', 'dozen', 'couple', 'several', 'many', and 'more' as proxies for numerals.

For DT ANNs (Determiner-Adjective-Noun-Nouns), the same procedure as AANNs was followed without restricting the determiner position to indefinite determiners.

The authors also considered cases where plural nouns are attached to an indefinite article, such as "a few days" or "a couple liters". These cases were detected using dependency configurations involving det, amod, quantmod, and nummod relations.

Lastly, they examined measure noun-phrases with plural nouns treated as singular via agreement with a verb, like "five dollars is plenty". Such cases were detected using dependency configurations involving nummod and nsubj relations.

Appendix C A/An + ADJ/NUM frequency balancing

The paper analyzes the BabyLM corpus and its POS-tagged version, revealing that adjectives are about 14.6 times more likely to follow an indefinite article compared to numerals. To balance these values, 571,874 instances of adjectives following an indefinite article are removed, making it the most significant ablation performed in the study. The analysis provides insights into the linguistic patterns within the BabyLM corpus and the steps taken to address the imbalance between adjectives and numerals following indefinite articles.

If you enjoyed this summary, consider joining AImodels.fyi or following me on Twitter for more AI and machine learning content.