Quantifying Memorization of Domain-Specific Pre...

May 08, 2024

120

Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls

Shotaro Ishihara (2024). Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls. Fourth Workshop on Trustworthy Natural Language Processing.
https://arxiv.org/abs/2404.17143

Shotaro Ishihara

May 08, 2024

More Decks by Shotaro Ishihara

See All by Shotaro Ishihara

情報技術の社会実装に向けた応用と課題：ニュースメディアの事例から / appmech-jsce 2025

upura

230

日本語新聞記事を用いた大規模言語モデルの暗記定量化 / LLMC2025

upura

260

Quantifying Memorization in Continual Pre-training with Japanese General or Industry-Specific Corpora

upura

JOAI2025講評 / joai2025-review

upura

470

AI エージェントを活用した研究再現性の自動定量評価 / scisci2025

upura

170

JSAI2025 企画セッション「人工知能とコンペティション」/ jsai2025-competition

upura

生成的推薦の人気バイアスの分析：暗記の観点から / JSAI2025

upura

290

Semantic Shift Stability: 学習コーパス内の単語の意味変化を用いた事前学習済みモデルの時系列性能劣化の監査

upura

日本語ニュース記事要約支援に向けたドメイン特化事前学習済みモデルの構築と活用 / t5-news-summarization

upura

Other Decks in Research

See All in Research

PhD Defense 2025: Visual Understanding of Human Hands in Interactions

tkhkaeio

260

cvpaper.challenge 10年の軌跡 / cvpaper.challenge a decade-long journey

gatheluck

360

Pythonでジオを使い倒そう！〜それとFOSS4G Hiroshima 2026のご紹介を少し〜

wata909

超高速データサイエンス

matsui_528

150

EOGS: Gaussian Splatting for Efficient Satellite Image Photogrammetry

satai

690

地域丸ごとデイサービス「Go トレ」の紹介

smartfukushilab1

200

Submeter-level land cover mapping of Japan

satai

430

EarthSynth: Generating Informative Earth Observation with Diffusion Models

satai

390

HoliTracer:Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery

satai

130

説明可能な機械学習と数理最適化

kelicht

220

AIグラフィックデザインの進化：断片から統合（One Piece）へ / From Fragment to One Piece: A Survey on AI-Driven Graphic Design

shunk031

510

GPUを利用したStein Particle Filterによる点群6自由度モンテカルロSLAM

takuminakao

400

Featured

See All Featured

Agile that works and the tools we love

rasmusluckow

331

21k

Fireside Chat

paigeccino

3.7k

Reflections from 52 weeks, 52 projects

jeffersonlam

353

21k

What’s in a name? Adding method to the madness

productmarketing

PRO

3.7k

JavaScript: Past, Present, and Future - NDC Porto 2020

reverentgeek

5.7k

The Art of Delivering Value - GDevCon NA Keynote

reverentgeek

1.7k

KATA

mclloyd

PRO

15k

Bash Introduction

62gerente

615

210k

The Psychology of Web Performance [Beyond Tellerrand 2023]

tammyeverts

3.1k

A Modern Web Designer's Workflow

chriscoyier

697

190k

The Invisible Side of Design

smashingmag

302

51k

The Cost Of JavaScript in 2023

addyosmani

9.1k

Transcript

Quantifying Memorization of Domain-Speciﬁc Pre-trained Language Models using Japanese Newspaper
and Paywalls Shotaro Ishihara (Nikkei Inc.) https://arxiv.org/abs/2404.17143 Research Question: Do Japanese PLMs memorize the training data as well as the English PLMs? Approach: We pre-trained GPT-2 models using Japanese newspaper articles. The string at the beginning (public) is used as a prompt, and the remaining string within the paywall (private) is used for the evaluation. Findings: 1. Japanese PLMs sometimes “copy and paste” on a large scale. 2. We replicated the empirical ﬁnding that memorization is related to duplication, model size, and prompt length. Memorized strings are highlighted in green. (48 chars) The more epochs (more duplication), the larger the model size, the longer the prompt, the more memorization.

Quantifying Memorization of Domain-Specific Pre...

Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls

Shotaro Ishihara

More Decks by Shotaro Ishihara

Other Decks in Research

Featured

Transcript

Quantifying Memorization of Domain-Speciﬁc Pre-trained Language Models using Japanese Newspaper