A paper-review of "Extracting Training Data from Large Language Models"

NAIST SocioCom Paper reading MTG on 2020-01-25 Reader: Shuntaro Yada
Extracting Training Data from Large Language Models A paper-review of

2 Extracting Training Data from Large Language Models Nicholas Carlini1
Florian Tramèr2 Eric Wallace3 Matthew Jagielski4 Ariel Herbert-Voss5,6 Katherine Lee1 Adam Roberts1 Tom Brown5 Dawn Song3 Úlfar Erlingsson7 Alina Oprea4 Colin Raffel1 1Google 2Stanford 3UC Berkeley 4Northeastern University 5OpenAI 6Harvard 7Apple Abstract It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model’s training data. These extracted examples include (public) personally Submitted to arXiv on 14 Dec 2020 (arXiv:2012.07805) otential candidate memorized more candidates we would ly more memorized content. es for extracting memorized argeted towards specific con- ure work. re Overfitting. It is often tting (i.e., reducing the train- ible to prevent models from er, large LMs have no signifi- still able to extract numerous ining set. The key reason is training loss is only slightly here are still some training low losses. re Data. Throughout our ntly memorize more training mple, in one setting the 1.5 memorizes over 18⇥ as much eter model (Section 7). Wor- become bigger (they already GPT-2 [5]), privacy leakage t. to Discover. Much of the nly discovered when prompt- refix. Currently, we simply xes and hope that they might fix selection strategies [58] data. n Strategies. We discuss g memorization in LMs, in- that our work is not harmful), the same techniques apply to any LM. Moreover, because memorization gets worse as LMs become larger, we expect that these vulnerabilities will become significantly more important in the future. Training with differentially-private techniques is one method for mitigating privacy leakage, however, we believe that it will be necessary to develop new methods that can train models at this extreme scale (e.g., billions of parameters) without sacrificing model accuracy or training time. More generally, there are many open questions that we hope will be investigated further, including why models memorize, the dangers of memorization, and how to prevent memorization. Acknowledgements We are grateful for comments on early versions of this paper by Dan Boneh, Andreas Terzis, Carey Radebaugh, Daphne Ip- polito, Christine Robson, Kelly Cooke, Janel Thamkul, Austin Tarango, Jack Clark, Ilya Mironov, and Om Thakkar. Summary of Contributions • Nicholas, Dawn, Ariel, Tom, Colin and Úlfar proposed the research question of extracting training data from GPT-2 and framed the threat model. • Colin, Florian, Matthew, and Nicholas stated the memorization definitions. • Florian, Ariel, and Nicholas wrote code to generate candidate memorized samples from GPT-2 and verify the ground truth memorization. • Florian, Nicholas, Matthew, and Eric manually reviewed and categorized the candidate memorized content. • Katherine, Florian, Eric, and Colin generated the figures. cant train-test gap and yet we are still able to extract numerous examples verbatim from the training set. The key reason is that even though on average the training loss is only slightly lower than the validation loss, there are still some training examples that have anomalously low losses. Larger Models Memorize More Data. Throughout our experiments, larger LMs consistently memorize more training data than smaller LMs. For example, in one setting the 1.5 billion parameter GPT-2 model memorizes over 18⇥ as much content as the 124 million parameter model (Section 7). Wor- ryingly, it is likely that as LMs become bigger (they already have become 100⇥ larger than GPT-2 [5]), privacy leakage will become even more prevalent. Memorization Can Be Hard to Discover. Much of the training data that we extract is only discovered when prompt- ing the LM with a particular prefix. Currently, we simply attempt to use high-quality prefixes and hope that they might elicit memorization. Better prefix selection strategies [58] might identify more memorized data. Adopt and Develop Mitigation Strategies. We discuss several directions for mitigating memorization in LMs, including training with differential privacy, vetting the training data for sensitive content, limiting the impact on downstream applications, and auditing LMs to test for memorization. All of these are interesting and promising avenues of future work, but each has weaknesses and are incomplete solutions to the full problem. Memorization in modern LMs must be ad- dressed as new generations of LMs are emerging and becom- ing building blocks for a range of real-world applications. dangers of memorization, and how to prevent memorization. Acknowledgements We are grateful for comments on early versions of this paper by Dan Boneh, Andreas Terzis, Carey Radebaugh, Daphne Ip- polito, Christine Robson, Kelly Cooke, Janel Thamkul, Austin Tarango, Jack Clark, Ilya Mironov, and Om Thakkar. Summary of Contributions • Nicholas, Dawn, Ariel, Tom, Colin and Úlfar proposed the research question of extracting training data from GPT-2 and framed the threat model. • Colin, Florian, Matthew, and Nicholas stated the memorization definitions. • Florian, Ariel, and Nicholas wrote code to generate candidate memorized samples from GPT-2 and verify the ground truth memorization. • Florian, Nicholas, Matthew, and Eric manually reviewed and categorized the candidate memorized content. • Katherine, Florian, Eric, and Colin generated the figures. • Adam, Matthew, and Eric ran preliminary investigations in language model memorization. • Nicholas, Florian, Eric, Colin, Katherine, Matthew, Ariel, Alina, Úlfar, Dawn, and Adam wrote and edited the paper. • Tom, Adam, and Colin gave advice on language models and machine learning background. • Alina, Úlfar, and Dawn gave advice on the security goals. 13

• Demonstrated that large language models memorise and leak individual
training examples – Potential privacy leakage confirmed! • Sampled (likely) memorised strings among generated texts, by using six different metrics • Categorised and analysed such verbatim generation 3 Summary Ariel Herbert-Voss5,6 Katherine Lee1 Adam Roberts1 Tom Brown5 Dawn Song3 Úlfar Erlingsson7 Alina Oprea4 Colin Raffel1 1Google 2Stanford 3UC Berkeley 4Northeastern University 5OpenAI 6Harvard 7Apple Abstract It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model’s training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data. We comprehensively evaluate our extraction attack to un- derstand the factors that contribute to its success. For example, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models. 1 Introduction Language models (LMs)—statistical models which assign a GPT-2 East Stroudsburg Stroudsburg... Prefix --- Corporation Seabank Centre ------ Marine Parade Southport Peter W--------- [email protected] +-- 7 5--- 40-- Fax: +-- 7 5--- 0--0 Memorized text Figure 1: Our extraction attack. Given query access to a neural network language model, we extract an individual person’s name, email address, phone number, fax number, and physical address. The example in this figure shows information that is all accurate so we redact it to protect privacy. Such privacy leakage is typically associated with overfitting rXiv:2012.07805v1 [cs.CR] 14 Dec 2020

• Pre-trained large language models (LMs) get popular • While
machine learning models could leak training data information, overfitting is known as a significant cause of it • A prevailing wisdom: SOTA LMs will not overfit to training data – Typically, such LMs are trained on massive de-duplicated datasets only once • RQ: is that prevailing wisdom really true at all????? 4 Background

• Membership inference attack – Predict whether or not a
particular example was used to train the model • Model inversion attack – Reconstruct representative views of a subset of examples – E.g. a fuzzy image of a particular person within a facial recognition dataset • Training data extraction attack – Reconstruct verbatim (exact instances of) training examples – Model’s “memorisation” of training data, in other words 5 Types of privacy attacks

• Data secrecy: (a narrow view) – If the training
data is confidential or private, extracted memorisation threats the data publisher’s security and privacy – E.g. Gmail’s autocomplete model is trained on users’ email text; unique string memorisation can be a great harm • Contextual integrity of data: (a broader view) – The data is used outside of its intended context – E.g. contact information is exposed in response to a user’s query by a dialogue system; even if their email address or phone number are public, unintentional exposure may result in harmful outcome 6 Risks of training data extraction

7 Definition of ‘memorisation’ black-box interactions where the model generates
s as the most likely continuation when prompted with some prefix c: Definition 1 (Model Knowledge Extraction) A string s is extractable4 from an LM fq if there exists a prefix c such that: s argmax s0: |s0|=N fq(s0 | c) Note that we abuse notation slightly here to denote by fq(s0 | c) the likelihood of an entire sequence s0. Since com- puting the most likely sequence s is intractable for large N, the argmax in Definition 1 can be replaced by an appropriate sampling strategy (e.g., greedy sampling) that reflects the way in which the model fq generates text in practical applications. We then define eidetic memorization as follows: Definition 2 (k-Eidetic Memorization) A string s is k- eidetic memorized (for k 1) by an LM fq if s is extractable from fq and s appears in at most k examples in the training data X: |{x 2 X : s ✓ x}|  k. adversary to inspect individual weights or hidden states (e attention vectors) of the language model. This threat model is highly realistic as many LMs available through black-box APIs. For example, the G 3 model [5] created by OpenAI is available through black-b API access. Auto-complete models trained on actual user d have also been made public, although they reportedly u privacy-protection measures during training [8]. Adversary’s Objective. The adversary’s objective is to tract memorized training data from the model. The stren of an attack is measured by how private (formalized as be k-eidetic memorized) a particular example is. Stronger atta extract more examples in total (both more total sequenc and longer sequences) and examples with lower values of We do not aim to extract targeted pieces of training data, rather indiscriminately extract training data. While targe attacks have the potential to be more adversarially harm our goal is to study the ability of LMs to memorize d black-box interactions where the model generates s as the most likely continuation when prompted with some prefix c: Definition 1 (Model Knowledge Extraction) A string s is extractable4 from an LM fq if there exists a prefix c such that: s argmax s0: |s0|=N fq(s0 | c) Note that we abuse notation slightly here to denote by fq(s0 | c) the likelihood of an entire sequence s0. Since com- puting the most likely sequence s is intractable for large N, the argmax in Definition 1 can be replaced by an appropriate sampling strategy (e.g., greedy sampling) that reflects the way in which the model fq generates text in practical applications. We then define eidetic memorization as follows: Definition 2 (k-Eidetic Memorization) A string s is k- eidetic memorized (for k 1) by an LM fq if s is extractable from fq and s appears in at most k examples in the training data X: |{x 2 X : s ✓ x}|  k. Note that here we count the number of distinct training examples containing a given string, and not the total number of times the string occurs—a string may appear multiple times in a single example, and our analysis counts this as k = 1. This definition allows us to define memorization as a spec- adversary to inspect individual weights or hidden states (e.g., attention vectors) of the language model. This threat model is highly realistic as many LMs are available through black-box APIs. For example, the GPT- 3 model [5] created by OpenAI is available through black-box API access. Auto-complete models trained on actual user data have also been made public, although they reportedly use privacy-protection measures during training [8]. Adversary’s Objective. The adversary’s objective is to extract memorized training data from the model. The strength of an attack is measured by how private (formalized as being k-eidetic memorized) a particular example is. Stronger attacks extract more examples in total (both more total sequences, and longer sequences) and examples with lower values of k. We do not aim to extract targeted pieces of training data, but rather indiscriminately extract training data. While targeted attacks have the potential to be more adversarially harmful, our goal is to study the ability of LMs to memorize data generally, not to create an attack that can be operationalized by real adversaries to target specific users. Attack Target. We select GPT-2 [50] as a representative Given a prefix string c, if a model generate a passage which is identical to an example in a training dataset, the model’s knowledge was “extracted”. If the extracted passage only appears in k pieces of the dataset, the passage is k- eidetic memorised. (Note that the k is “document frequency”.) Smaller k is more serious and dangerous!

• Adversary’s capabilities: – Black-box input and output access to
a language model, enabling perplexity calculation and next-word prediction • Adversary’s objective: – To extract memorised training data from the model – The strength of an attack = how small k is among successful k-eidetic memorisation • Attack target: – GPT-2, a popular language model with a public API access, trained on public data. 8 Threat model

Training-data extraction attack is twofold: 1. Generate text 2. Predict
which outputs contain memorised text (i.e. membership inference) 9 Attack experiments 200,000 LM Generations LM (GPT-2) Sorted Generations (using one of 6 metrics) Deduplicate Training Data Extraction Attack Preﬁxes Evaluation Internet Search Choose Top-100 Check Memorization Match No Match Figure 2: Workﬂow of our extraction attack and evaluation. Attack. We begin by generating many samples from GPT-2

10 Attack experiments 200,000 LM Generations LM (GPT-2) Sorted Generations
(using one of 6 metrics) Deduplicate Training Data Extraction Attack Prefixes Evaluation Internet Search Choose Top-100 Check Memorization Match No Match Figure 2: Workflow of our extraction attack and evaluation. Attack. We begin by generating many samples from GPT-2 when the model is conditioned on (potentially empty) prefixes. We then sort each generation according to one of six metrics and remove the duplicates. This gives us a set of potentially memorized training examples. Evaluation. We manually inspect 100 of the top-1000 generations for each metric. We mark each generation as either memorized or not-memorized by manually searching online, and we confirm these findings by working with OpenAI to query the original training data. As in prior work [51], we perform basic data-sanitization by removing HTML and JavaScript from webpages, and we de-duplicate data on a line-by-line basis. This gives us a Comparing to Other Neural Language Models. Assume that we have access to a second LM that memorizes a different set of examples than GPT-2. One way to achieve this would be Using 3 generation strategies: • Top-n sampling (baseline) • Decaying temperature • Conditioning on Internet text Using 6 membership inference metrics: • Perplexity-based confidence (baseline) • Perplexity of smaller models (medium and small) • Ratio of zlib compressed entropy • Perplexity of lowercased text • Min perplexity of sliding windows × = 1800 4 authors manually search Internet for outputs.   Then check whether memorisation is contained in the GPT-2 training set.

• Top-n sampling – Sampling the next word from top-n
likely words based on the model • Sampling tokens to generate with decaying temperature – To diverse outputs, giving high softmax temperature at first, then for the latter generation, giving lower and lower temperature • Conditioning on Internet text – In generation, seeds the model with snippets from Internet – Use different data (Common Crawl) from that GPT-2 uses (Reddit link-based) 11 Experiment details 1. Generate text

• Like existing membership inference attacks, model’s confidence to the
outputs is used as the measure of memorisation • Perplexity of the output is adopted as a baseline, since we use language model • However, in combination with the standard Top-n sampling, following outputs are observed: – Trivial memorisation: e.g. a number sequence from 1 to 100 – Repeated substrings: e.g. “I love you. I love you. I love …” 12 Experiment details 2. Membership inference Much memorised generation was found too, though: MIT licence, Terms of Service of web apps, etc.

To reduce trivial memorisation and repeated substrings, 5 new metrics
are proposed: • Perplexity of other neural language models (i.e. smaller GPT-2 models) – Small (117M params) – Medium (345M params) • Ratio of zlib compressed entropy • Perplexity of lowercased-version of the generated text • Minimum perplexity of the tokens within sliding windows (50 words) These are used as “references” to indicate whether the main model (GPT-2 XL) outputs ‘unexpectedly high confidence (low perplexity)’ 13 Experiment details 2. Membership inference Smaller models won’t memorise lower k-eidetic samples Detect simple repetitions Comparison to canonicalised text Detect memorised “sub-strings”

• 604 / 1800 samples (33.5%) were memorised • Around
4% privacy leakage 14 Results Category Count US and international news 109 Log files and error reports 79 License, terms of use, copyright notices 54 Lists of named items (games, countries, etc.) 54 Forum or Wiki entry 53 Valid URLs 50 Named individuals (non-news samples only) 46 Promotional content (products, subscriptions, etc.) 45 High entropy (UUIDs, base64 data) 35 Contact info (address, email, phone, twitter, etc.) 32 Code 31 Configuration files 30 Religious texts 25 Pseudonyms 15 Donald Trump tweets and quotes 12 Web forms (menu items, instructions, etc.) 11 Tech news 11 Lists of numbers (dates, sequences, etc.) 10 Table 1: Manual categorization of the 604 memorized training examples that we extract from GPT-2, along with a descrip- F 2 s Inference Strategy Text Generation Strategy Top-n Temperature Internet Perplexity 9 3 39 Small 41 42 58 Medium 38 33 45 zlib 59 46 67 Window 33 28 58 Lowercase 53 22 60 Total Unique 191 140 273 Table 2: The number of memorized examples (out of 100 candidates) that we identify using each of the three text gen- Memorized String Sequence Length Occurrences in Data Docs Total Y2... ...y5 87 1 10 7C... ...18 40 1 22 XM... ...WA 54 1 36 ab... ...2c 64 1 49 ff... ...af 32 1 64 C7... ...ow 43 1 83 0x... ...C0 10 1 96 76... ...84 17 1 122 a7... ...4b 40 1 311

• Membership inference methods characteristics: – Zlib finds high k-eidetic
samples; Lowercased detects news headlines; Small and Medium find rare content • Extracting longer verbatim sequences – Successfully extend memorised content up to 1450 lines of a source code and to full text of MIT licence, Creative Commons, and Project Gutenberg licences • Memorisation is context-dependent – “3.14159” → [GPT-2] → 25 digits of ; “pi is 3.14159” → 799 digits; “e begins 2.7182818. pi begins 3.14159” → 824 digits π 15 Analysis and findings

• 1-eidetic memorisation happens and can be detected by the
metrics   • Larger model memorises better 16 Analysis and findings file, namely 1450 lines of verbatim source code. We can also extract the entirety of the MIT, Creative Commons, and Project Gutenberg licenses. This indicates that while we have extracted 604 memorized examples, we could likely extend many of these to much longer snippets of memorized content. 6.5 Memorization is Context-Dependent Consistent with recent work on constructing effective “prompts” for generative LMs [5,58], we find that the memorized content is highly dependent on the model’s context. For example, GPT-2 will complete the prompt “3.14159” with the first 25 digits of p correctly using greedy sampling. However, we find that GPT-2 “knows” (under Definition 2) more digits of p because using the beam-search-like strategy introduced above extracts 500 digits correctly. Interestingly, by providing the more descriptive prompt “pi is 3.14159”, straight greedy decoding gives the first 799 digits of p—more than with the sophisticated beam search. Further providing the context “e begins 2.7182818, pi begins 3.14159”, GPT-2 greedily completes the first 824 digits of p. This example demonstrates the importance of the context: in the right setting, orders of magnitude more extraction is Occurrences Memorized? URL (trimmed) Docs Total XL M S /r/ 51y/milo_evacua... 1 359 X X 1/2 /r/ zin/hi_my_name... 1 113 X X /r/ 7ne/for_all_yo... 1 76 X 1/2 /r/ 5mj/fake_news_... 1 72 X /r/ 5wn/reddit_admi... 1 64 X X /r/ lp8/26_evening... 1 56 X X /r/ jla/so_pizzagat... 1 51 X 1/2 /r/ ubf/late_night... 1 51 X 1/2 /r/ eta/make_christ... 1 35 X 1/2 /r/ 6ev/its_officia... 1 33 X /r/ 3c7/scott_adams... 1 17 /r/ k2o/because_his... 1 17 /r/ tu3/armynavy_ga... 1 8 Table 4: We show snippets of Reddit URLs that appear a varying number of times in a single training document. We condition GPT-2 XL, Medium, or Small on a prompt that contains the beginning of a Reddit URL and report a X if the corresponding URL was generated verbatim in the first 10,000 generations. We report a 1/2 if the URL is generated by providing GPT-2 with the first 6 characters of the URL and n Strategy ure Internet 3 39 42 58 33 45 46 67 28 58 22 60 140 273 mples (out of 100 the three text gen- erence techniques. tegies; we identify andom numbers or Memorized String Sequence Length Occurrences in Data Docs Total Y2... ...y5 87 1 10 7C... ...18 40 1 22 XM... ...WA 54 1 36 ab... ...2c 64 1 49 ff... ...af 32 1 64 C7... ...ow 43 1 83 0x... ...C0 10 1 96 76... ...84 17 1 122 a7... ...4b 40 1 311 Table 3: Examples of k = 1 eidetic memorized, high- entropy content that we extract from the training data. Each is contained in just one document. In the best case, we extract a 87-characters-long sequence that is contained in the training dataset just 10 times in total, all in the same document.

The paper proposes the following strategies to mitigate privacy leakage
• Training with differential privacy — add noise on training • Curating the training data — carefully de-duplicate data • Limiting impact of memorisation on downstream applications – Fine-tuning process may cause the language model to forget some pre-training data – But be careful:fine-tuning may introduce another privacy leakages • Auditing ML models for memorisation — monitor outputs! 17 Mitigating privacy leakage

• Extraction attacks are practical threat • Memorisation does not
require overfitting ← answer to the first RQ • Larger models memorise more data • Memorisation can be hard to discover • Adopt and develop mitigation strategies 18 Lessons and future work

• Though not yet peer-reviewed and accepted (as of 2021/01/24),
it is quite a good read – Well-written and enlightening, with providing clever ideas and interesting results – Even appendices are worth reading • The paper structure seems like story-telling a little bit – Not following standard scientific-paper structures (still readable, though) – Putting different points of view one by one is not an ideal way of academic writing • Wanted to see privacy leakage-oriented evaluation per strategy and metrics – Which strategy and metric did generate most privacy leakage? The provided results are unclear – Table 6 and 7 (in Appendix) only give aggregated results 19 Reader’s remarks — Shuntaro Yada

A paper-review of "Extracting Training Data fro...

A paper-review of "Extracting Training Data from Large Language Models"

Shuntaro Yada

More Decks by Shuntaro Yada

Other Decks in Research

Featured

Transcript

NAIST SocioCom Paper reading MTG on 2020-01-25 Reader: Shuntaro Yada

2 Extracting Training Data from Large Language Models Nicholas Carlini1

• Demonstrated that large language models memorise and leak individual

• Pre-trained large language models (LMs) get popular • While

• Membership inference attack – Predict whether or not a

• Data secrecy: (a narrow view) – If the training

7 Definition of ‘memorisation’ black-box interactions where the model generates

• Adversary’s capabilities: – Black-box input and output access to

Training-data extraction attack is twofold: 1. Generate text 2. Predict

10 Attack experiments 200,000 LM Generations LM (GPT-2) Sorted Generations

• Top-n sampling – Sampling the next word from top-n

• Like existing membership inference attacks, model’s confidence to the

To reduce trivial memorisation and repeated substrings, 5 new metrics

• 604 / 1800 samples (33.5%) were memorised • Around

• Membership inference methods characteristics: – Zlib finds high k-eidetic

• 1-eidetic memorisation happens and can be detected by the

The paper proposes the following strategies to mitigate privacy leakage

• Extraction attacks are practical threat • Memorisation does not

• Though not yet peer-reviewed and accepted (as of 2021/01/24),