Florian Tramèr2 Eric Wallace3 Matthew Jagielski4 Ariel Herbert-Voss5,6 Katherine Lee1 Adam Roberts1 Tom Brown5 Dawn Song3 Úlfar Erlingsson7 Alina Oprea4 Colin Raffel1 1Google 2Stanford 3UC Berkeley 4Northeastern University 5OpenAI 6Harvard 7Apple Abstract It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model’s training data. These extracted examples include (public) personally Submitted to arXiv on 14 Dec 2020 (arXiv:2012.07805) otential candidate memorized more candidates we would ly more memorized content. es for extracting memorized argeted towards specific con- ure work. re Overfitting. It is often tting (i.e., reducing the train- ible to prevent models from er, large LMs have no signifi- still able to extract numerous ining set. The key reason is training loss is only slightly here are still some training low losses. re Data. Throughout our ntly memorize more training mple, in one setting the 1.5 memorizes over 18⇥ as much eter model (Section 7). Wor- become bigger (they already GPT-2 [5]), privacy leakage t. to Discover. Much of the nly discovered when prompt- refix. Currently, we simply xes and hope that they might fix selection strategies [58] data. n Strategies. We discuss g memorization in LMs, in- that our work is not harmful), the same techniques apply to any LM. Moreover, because memorization gets worse as LMs become larger, we expect that these vulnerabilities will become significantly more important in the future. Training with differentially-private techniques is one method for mitigating privacy leakage, however, we believe that it will be necessary to develop new methods that can train models at this extreme scale (e.g., billions of parameters) without sacrificing model accuracy or training time. More generally, there are many open questions that we hope will be investigated further, including why models memorize, the dangers of memorization, and how to prevent memorization. Acknowledgements We are grateful for comments on early versions of this paper by Dan Boneh, Andreas Terzis, Carey Radebaugh, Daphne Ip- polito, Christine Robson, Kelly Cooke, Janel Thamkul, Austin Tarango, Jack Clark, Ilya Mironov, and Om Thakkar. Summary of Contributions • Nicholas, Dawn, Ariel, Tom, Colin and Úlfar proposed the research question of extracting training data from GPT-2 and framed the threat model. • Colin, Florian, Matthew, and Nicholas stated the memoriza- tion definitions. • Florian, Ariel, and Nicholas wrote code to generate candi- date memorized samples from GPT-2 and verify the ground truth memorization. • Florian, Nicholas, Matthew, and Eric manually reviewed and categorized the candidate memorized content. • Katherine, Florian, Eric, and Colin generated the figures. cant train-test gap and yet we are still able to extract numerous examples verbatim from the training set. The key reason is that even though on average the training loss is only slightly lower than the validation loss, there are still some training examples that have anomalously low losses. Larger Models Memorize More Data. Throughout our experiments, larger LMs consistently memorize more training data than smaller LMs. For example, in one setting the 1.5 billion parameter GPT-2 model memorizes over 18⇥ as much content as the 124 million parameter model (Section 7). Wor- ryingly, it is likely that as LMs become bigger (they already have become 100⇥ larger than GPT-2 [5]), privacy leakage will become even more prevalent. Memorization Can Be Hard to Discover. Much of the training data that we extract is only discovered when prompt- ing the LM with a particular prefix. Currently, we simply attempt to use high-quality prefixes and hope that they might elicit memorization. Better prefix selection strategies [58] might identify more memorized data. Adopt and Develop Mitigation Strategies. We discuss several directions for mitigating memorization in LMs, in- cluding training with differential privacy, vetting the training data for sensitive content, limiting the impact on downstream applications, and auditing LMs to test for memorization. All of these are interesting and promising avenues of future work, but each has weaknesses and are incomplete solutions to the full problem. Memorization in modern LMs must be ad- dressed as new generations of LMs are emerging and becom- ing building blocks for a range of real-world applications. dangers of memorization, and how to prevent memorization. Acknowledgements We are grateful for comments on early versions of this paper by Dan Boneh, Andreas Terzis, Carey Radebaugh, Daphne Ip- polito, Christine Robson, Kelly Cooke, Janel Thamkul, Austin Tarango, Jack Clark, Ilya Mironov, and Om Thakkar. Summary of Contributions • Nicholas, Dawn, Ariel, Tom, Colin and Úlfar proposed the research question of extracting training data from GPT-2 and framed the threat model. • Colin, Florian, Matthew, and Nicholas stated the memoriza- tion definitions. • Florian, Ariel, and Nicholas wrote code to generate candi- date memorized samples from GPT-2 and verify the ground truth memorization. • Florian, Nicholas, Matthew, and Eric manually reviewed and categorized the candidate memorized content. • Katherine, Florian, Eric, and Colin generated the figures. • Adam, Matthew, and Eric ran preliminary investigations in language model memorization. • Nicholas, Florian, Eric, Colin, Katherine, Matthew, Ariel, Alina, Úlfar, Dawn, and Adam wrote and edited the paper. • Tom, Adam, and Colin gave advice on language models and machine learning background. • Alina, Úlfar, and Dawn gave advice on the security goals. 13