Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls
Shotaro Ishihara (2024). Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls. Fourth Workshop on Trustworthy Natural Language Processing.
https://arxiv.org/abs/2404.17143
and Paywalls Shotaro Ishihara (Nikkei Inc.) https://arxiv.org/abs/2404.17143 Research Question: Do Japanese PLMs memorize the training data as well as the English PLMs? Approach: We pre-trained GPT-2 models using Japanese newspaper articles. The string at the beginning (public) is used as a prompt, and the remaining string within the paywall (private) is used for the evaluation. Findings: 1. Japanese PLMs sometimes “copy and paste” on a large scale. 2. We replicated the empirical finding that memorization is related to duplication, model size, and prompt length. Memorized strings are highlighted in green. (48 chars) The more epochs (more duplication), the larger the model size, the longer the prompt, the more memorization.