[論文紹介] A Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets (Recsys2020) / recsys20-reading-gunosy-datapub

株式会社 Gunosy Gunosy Tech Lab 上席研究員関喜史 2020年10月27日 Recsys2020読み会:
A Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets

（C） Gunosy Inc. All Rights Reserved. PAGE | 2 ▪
関喜史 – 1987年生まれ – 富山商船情報工学科 -> 東大システム創成->東大技術経営 – 工学博士(2017年3月) – 2011年未踏クリエイター – 未踏ジュニアメンター(2017年~) ▪ 株式会社Gunosy – Gunosy Tech Lab 研究開発チーム – 上席研究員 – 共同創業者 ▪ 専門領域 – 推薦システム – ユーザ行動分析 ▪ 趣味 – 野球、テニス – アイドル、アニメ、漫画 – 将棋、ボードゲーム自己紹介

（C） Gunosy Inc. All Rights Reserved. PAGE | 3 研究の動機
ニュース推薦の業務レベルでぶつかる問題が、推薦システムコミュニティではそこまで重要視されていない • 新規アイテム推薦の難しさ • アイテムの人気トレンドが急に変わることへの対応 • リストに求められる高い更新性 • こうした問題を推薦システムコミュニティに提案していくためにデータ自体を公開するところからやる必要がある日本の会社がもっとデータ公開するようになってほしい • 推薦システムを作りたい学生さんの熱量に対して企業側が答えられていないと感じることが多い • 推薦システムの研究をしたいと思った学生さんが触れられるデータがほしい Gunosyのデータ公開をしたいとずっと考えていた

https://twitter.com/YoshifumiSeki/status/1181416279967580160 データセット公開における障壁はプライバシーと経営情報漏えいだといえる • プライバシーは究極それに関する情報を含まなければよいのでは？ • 個人情報の非匿名化に関する研究は多い経営情報漏洩を扱ったものは見た限り存在しなかった • これができればデータを公開できるのではないか？ • その提案自体が研究になるのではないか？みたいなことをTwitterに書いた

https://twitter.com/tmaehara/status/118142751129464423 1 前原さんからエアリプをもらう • この時点で面識はない • Twitterでは相互でした • 去年のKDD採択あたりでエアリプで褒めてもらった記憶この研究は前原さんのチームがAAAI-19に採択されたもの • Fairnessのためにユーザの属性情報をサンプリングして学習用データをつくる • たしかに問題設定は似てそう

（C） Gunosy Inc. All Rights Reserved. PAGE | 6 研究の過程
• 個人情報 ◦ 当然そうした情報は含めない • 属性情報( 性別・年齢など) ◦ こうした情報はde-anonymizationにつながることが指摘されているので含めない • 記事の情報は公開しない ◦ 我々は記事を借りている立場 ◦ de-anonymizationにもつながる • IDのみなら問題ないという整理 • ビジネスKPIを復元されたくないが、KPIをすべてリストアップするのは不可能 • ランダムサンプリングでは平均値がわかる • 一人あたりのクリック数が特定の分布になるようにすれば、他の情報もぜんぶ歪むのでは？なにを守らなければならないかを考えるユーザメディアグノシー

• 法務との相談 ◦ 復元不可能にするために、ユーザの属性情報は公開しないことにした ◦ この観点から、Fairnessの観点を研究に導入することになる • メディアコミュニケーションチームとの相談 ◦ どの記事がどうだった、ということを推定できなくするために記事ID のみの公開に留めることにした ◦ この観点からPopularity Biasが提案にはいった ◦ いずれBERTベクトルなんかを付加したいと思っている • メリットの整理 ◦ ブランディングだけではないなにかが必要 ◦ 最新アルゴリズムを適用しやすくなる・我々の課題が研究コミュニティの問題にできるやりたくなったので社内調整をがんばる

最終的にこういう稟議資料を作って経営会議の承認をもらう

• 最初に連絡したのは2020/02/20 ◦ TwitterのDMで ◦ この後も連絡はTwitterのDMのみ ◦ Google Meetsで2週間に1回のペースでMTG • 目標とやることを最初に合意したので短い時間でスムーズに進んだ ◦ 論文投稿とデータ公開をすることで合意 ◦ サンプリングメソッドの具体化と実装は前原さん ◦ 全体的な論文執筆と推薦システムを使った検証実験を私が担当前原さんとの共同研究開始

（C） Gunosy Inc. All Rights Reserved. PAGE | 10 レビュー結果
• 全査読者から主張を弱めろというご意見をいただく ◦ 元々のタイトルが「Challenge and Solution to publish implicit datasets from commercial service.」だったけどSolutionしてないでしょ。という意見 ◦ ごもっとも。 ◦ CameraReadyで現在のタイトルにして、solveとかsolutionみたいな記述をぜんぶ変えた • 評価されたポイント ◦ モチベーションと最適化問題を使うアプローチは受け入れられた ◦ データセットの公開とコードの公開は高く評価された ▪ 論文提出時にファイルを添付できたので、添付した。 4/3/3 -> MetaReviewerが4でAccept (正直もっといいスコアだと思ってた)

Yoshifumi SEKI (Gunosy.inc, Japan) Takanori MAEHARA (RIKEN, Japan) A Method
to Anonymize Business Metrics to Publishing Implicit Feedback Datasets

（C） Gunosy Inc. All Rights Reserved. PAGE | 12 Background
Datasets have contributed to develop recommendation system studies. • MovieLens, Netflix Prize • In recent years, some data science competitions, such as Kaggle, KDD Cup, and Recsys Challenges, promote dataset publications. Implicit feedback datasets from commercial services are not enough. • Recommendation systems have adopted in many and various service, so many and various datasets are needed. Dataset publication is important for recsys studies.

（C） Gunosy Inc. All Rights Reserved. PAGE | 13 Motivation
There are some business risks to publish dataset. • Leaking confidential business metrics. • Some reputation risks. Before publishing a dataset, researchers must get approval by a business manager. • many business managers are not specialists in machine learning or recommender system. • The researchers should be responsible for explaining the risks and benefits. We focus on an implicit feedback datasets. • Implicit feedback datasets include confidential business information and users’ personal information. • Explicit feedback datasets are often constructed by crawling public web resources, such as user reviews and ratings available online. We would like to make it easier for commercial services to publish datasets.

（C） Gunosy Inc. All Rights Reserved. PAGE | 14 Contributions
• We summarize the challenges of building and publishing datasets from commercial service • We formulate the problem of building and publishing a dataset as a optimization problem that seeks the sampling weight of users. • We applied our method to build datasets from the raw data of our real-world mobile news delivery service Gunosy, which is a popular news delivery service in Japan ◦ The raw data has more than 1,000,000 users with 100,000,000 interactions. • The implementation of our proposed method and a dataset built by our proposed method are public https://github.com/gunosy/publishing-dataset-recsys20

（C） Gunosy Inc. All Rights Reserved. PAGE | 15 Tasks
• User behavior logs: When user u clicks article a at time t, the triplet (u, a, t) is recorded as a log • User attributes: each user has attributes, such as age and gender. • Article category: Each news articles has a category, such as sports, entertainment, and politics. Our task is to publish a subset of the user behavior logs. We build dataset by “sampling user” approach. 1. Samples users from user behavior logs. 2. Collects all the user behavior logs associated with the sampled users We only focus on the following three data to simplify the situation.

（C） Gunosy Inc. All Rights Reserved. PAGE | 16 Sampling
Approach User behavior logs User A User B User C (item A, item C, item D, item G) (item B, item C, item D, item F, item G) (item B, item C, item E)

Approach User behavior logs User A User B User C (item A, item C, item D, item G) (item B, item C, item D, item F, item G) (item B, item C, item E) Dataset sampling behavior log The consumption histories of the users are missing.

Approach User behavior logs User A User B User C (item A, item C, item D, item G) (item B, item C, item D, item F, item G) (item B, item C, item E) Dataset sampling user The consumption histories of the users are keeping.

（C） Gunosy Inc. All Rights Reserved. PAGE | 19 Challenges
1. Anonymize the Business Metrics 2. Maintain Faireness 3. Reduce Popularity Bias We pose the following three challenges.

1. Anonymize the Business Metrics 2. Maintain Faireness 3. Reduce Popularity Bias • Do not want to disclose confidential business metrics. ◦ operating income ◦ the average number of clicks ◦ the average active rate of users • If the users are sampled uniformly, some business metrics could be easily estimated. ◦ the average number of clicks ◦ the average active rate of users • We must sample users with a non-uniform distribution. We pose the following three challenges.

1. Anonymize the Business Metrics 2. Maintain Faireness 3. Reduce Popularity Bias • Publishing a fair dataset is very important. ◦ Some existing methods that maintain fairness use user attributes; hence the user attributes cause de-anonymization. ◦ Publishing unfair dataset indirectly contributes to creating unfair machine learning models. • This risk will damage the company's reputation. We pose the following three challenges.

1. Anonymize the Business Metrics 2. Maintain Faireness 3. Reduce Popularity Bias • Recommender systems are expected to match long-tailed items with users; thus, algorithms suffering the popularity bias cannot achieve their role. • We believe popularity bias is a problem in building dataset. ◦ If the dataset is built by the uniform sampling, the items of unpopular categories are less frequently sampled. ◦ Because researchers cannot increase the number of interactions, the publisher must keep a certain amount of interactions with unpopular category items. We pose the following three challenges.

（C） Gunosy Inc. All Rights Reserved. PAGE | 23 Mathematical
Formulation We formulate our task as a problem of finding the sampling weight of users: w(u). We assume that our business metric are anonymized if the distribution of the number of clicks in the dataset is different from one in the raw data. • formulating this challenge is impossible because it needs to enumerate all the metrics that we should anonymize. • several important metrics are strongly correlated with the distribution of the number of clicks. We sample users to make the distribution of datasets closer to a target distribution. user sampling click distribution of raw data target distribution

（C） Gunosy Inc. All Rights Reserved. PAGE | 24 Finding
Sampling Weight User behavior logs User A User B User C (item A, item C, item D, item G) (item B, item C, item D, item F, item G) (item B, item C, item E) Dataset sampling by weight: w(u) w(User A) w(User B) w(User C) Finding optimal w(u) to close target distribution

Formulation We sample users to make the distribution of datasets closer to a target distribution. : the expected click distribution on the dataset. : the target distribution : Wasserstein distance.

Formulation We sample users to make the distribution of datasets closer to a target distribution. : the expected click distribution on the dataset. : the target distribution : Wasserstein distance on the real line. We also sample users to make the distribution of user attributes and clicks in each article categories to a specific distribution. D is the KL divergence. Each expected distribution is simply calculated using sampling weight.

Formulation We find a sampling weight at which all the loss functions have small values. We apply the gradient descent-type algorithms to minimize loss function.

（C） Gunosy Inc. All Rights Reserved. PAGE | 28 Experiments
We built eight dataset from user behavior logs • sample 60,000 users from raw data. • two type target click distributions. ◦ Zipf(1) and Zipf(2) • controlled/un-controlled target distribtion of Attributes and Category We built a dataset from the raw data in our news delivery services.

We built eight dataset from user behavior logs • sample 60,000 users from raw data. • two type target click distributions. ◦ Zipf(1) and Zipf(2) • controlled/un-controlled target distribtion of Attributes and Category • We built a dataset from the raw data in our news delivery services. Zipf(2) datasets are more sparse than Zipf(1)

We built eight dataset from user behavior logs • sample 60,000 users from raw data. • two type target click distributions. ◦ Zipf(1) and Zipf(2) • controlled/un-controlled target distribtion of Attributes and Category We built a dataset from the raw data in our news delivery services. category controlled datasets are more sparse than uncontrolled datasets

We successfully controlled the click distributions. Zipf(1)’s distribution Zipf(2)’s distribution

We successfully controlled the click distributions. Zipf(1)’s distribution Zipf(2)’s distribution The distributions of both user attributes and article categories are also controlled successfully.

Comparing algorithms evaluations for each dataset Evaluations on Zipf(1)’s datasets were similar to uniform. Best Second

Comparing algorithms evaluations for each dataset Evaluation results on Zipf(2)’s datasets were worse than Zipf(1)’s. This may because Zipf(2) datasets were sparse. Best

Comparing algorithms evaluations for each dataset It is necessary to select sampling settings according to the purpose, and it may be important to publish datasets with various settings. It is necessary to select sampling settings according to the purpose, and it may be important to publish datasets with various settings.

（C） Gunosy Inc. All Rights Reserved. PAGE | 37 Conclusion
1. summarizing the challenges of building and publishing datasets from commercial service. 2. formulating the problem of building and publishing a dataset as a optimization problem that seeks the sampling weight of users. 3. appling our method to build datasets from the raw data of our real-world mobile news delivery service Limitations & Future Works • We did not give a theoretical guarantee if the impossibility of the estimation. Providing such an impossibility is an important. • This study only considered the user-item interactions. However real world services may have different types of behavior logs. This study is the first attempt to reduce business risks in publishing datasets

（C） Gunosy Inc. All Rights Reserved. PAGE | 38 Conclusion
Previously, researchers has not disclosed how to build the dataset and has not shared the knowledge with the community. We hope that our work will lead to more discussions on the process of building and publishing datasets and that many datasets will be published. This study is the first attempt to reduce business risks in publishing datasets https://github.com/gunosy/publishing-dataset-recsys20 our implementation and dataset avaiavle Feel free to contact me: [email protected]

（C） Gunosy Inc. All Rights Reserved. PAGE | 39 まとめ
Q&Aやその後のコミュニケーションなどかなり盛り上がりました • より議論がしたいと直接連絡をくれた研究者も • オンライン学会はQ&Aが盛り上がるなぁと思った • Recsysのシングルセッション＆2回発表という特性がポジティブに作用してそう論文ベースの産学連携の成功例になったと思う • タスクと論文というゴールが決まっていると研究者側にとってやりやすい • 企業側が役割を持って取り組むことが重要データ公開できる前例を作ったのでそれを前提とした研究ができる • こみ入った問題設定でもデータごと公開できるので、コントリビューションが作りやすい • 過去の研究の問題点に踏み込みやすい • 公開前提でいくので共同研究もしやすいデータ公開におけるサンプリングをKPIを匿名化するように最適化問題で解く研究を発表しました

情報を世界中の人に最適に届ける

[論文紹介] A Method to Anonymize Business Metrics ...

[論文紹介] A Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets (Recsys2020) / recsys20-reading-gunosy-datapub

More Decks by ysekky

Other Decks in Research

Featured

Transcript