Datasets have contributed to develop recommendation system studies. • MovieLens, Netflix Prize • In recent years, some data science competitions, such as Kaggle, KDD Cup, and Recsys Challenges, promote dataset publications. Implicit feedback datasets from commercial services are not enough. • Recommendation systems have adopted in many and various service, so many and various datasets are needed. Dataset publication is important for recsys studies.
There are some business risks to publish dataset. • Leaking confidential business metrics. • Some reputation risks. Before publishing a dataset, researchers must get approval by a business manager. • many business managers are not specialists in machine learning or recommender system. • The researchers should be responsible for explaining the risks and benefits. We focus on an implicit feedback datasets. • Implicit feedback datasets include confidential business information and users’ personal information. • Explicit feedback datasets are often constructed by crawling public web resources, such as user reviews and ratings available online. We would like to make it easier for commercial services to publish datasets.
• We summarize the challenges of building and publishing datasets from commercial service • We formulate the problem of building and publishing a dataset as a optimization problem that seeks the sampling weight of users. • We applied our method to build datasets from the raw data of our real-world mobile news delivery service Gunosy, which is a popular news delivery service in Japan ◦ The raw data has more than 1,000,000 users with 100,000,000 interactions. • The implementation of our proposed method and a dataset built by our proposed method are public https://github.com/gunosy/publishing-dataset-recsys20
• User behavior logs: When user u clicks article a at time t, the triplet (u, a, t) is recorded as a log • User attributes: each user has attributes, such as age and gender. • Article category: Each news articles has a category, such as sports, entertainment, and politics. Our task is to publish a subset of the user behavior logs. We build dataset by “sampling user” approach. 1. Samples users from user behavior logs. 2. Collects all the user behavior logs associated with the sampled users We only focus on the following three data to simplify the situation.
Approach User behavior logs User A User B User C (item A, item C, item D, item G) (item B, item C, item D, item F, item G) (item B, item C, item E) Dataset sampling behavior log The consumption histories of the users are missing.
Approach User behavior logs User A User B User C (item A, item C, item D, item G) (item B, item C, item D, item F, item G) (item B, item C, item E) Dataset sampling user The consumption histories of the users are keeping.
1. Anonymize the Business Metrics 2. Maintain Faireness 3. Reduce Popularity Bias • Do not want to disclose confidential business metrics. ◦ operating income ◦ the average number of clicks ◦ the average active rate of users • If the users are sampled uniformly, some business metrics could be easily estimated. ◦ the average number of clicks ◦ the average active rate of users • We must sample users with a non-uniform distribution. We pose the following three challenges.
1. Anonymize the Business Metrics 2. Maintain Faireness 3. Reduce Popularity Bias • Publishing a fair dataset is very important. ◦ Some existing methods that maintain fairness use user attributes; hence the user attributes cause de-anonymization. ◦ Publishing unfair dataset indirectly contributes to creating unfair machine learning models. • This risk will damage the company's reputation. We pose the following three challenges.
1. Anonymize the Business Metrics 2. Maintain Faireness 3. Reduce Popularity Bias • Recommender systems are expected to match long-tailed items with users; thus, algorithms suffering the popularity bias cannot achieve their role. • We believe popularity bias is a problem in building dataset. ◦ If the dataset is built by the uniform sampling, the items of unpopular categories are less frequently sampled. ◦ Because researchers cannot increase the number of interactions, the publisher must keep a certain amount of interactions with unpopular category items. We pose the following three challenges.
Formulation We formulate our task as a problem of finding the sampling weight of users: w(u). We assume that our business metric are anonymized if the distribution of the number of clicks in the dataset is different from one in the raw data. • formulating this challenge is impossible because it needs to enumerate all the metrics that we should anonymize. • several important metrics are strongly correlated with the distribution of the number of clicks. We sample users to make the distribution of datasets closer to a target distribution. user sampling click distribution of raw data target distribution
Sampling Weight User behavior logs User A User B User C (item A, item C, item D, item G) (item B, item C, item D, item F, item G) (item B, item C, item E) Dataset sampling by weight: w(u) w(User A) w(User B) w(User C) Finding optimal w(u) to close target distribution
Formulation We sample users to make the distribution of datasets closer to a target distribution. : the expected click distribution on the dataset. : the target distribution : Wasserstein distance.
Formulation We sample users to make the distribution of datasets closer to a target distribution. : the expected click distribution on the dataset. : the target distribution : Wasserstein distance on the real line. We also sample users to make the distribution of user attributes and clicks in each article categories to a specific distribution. D is the KL divergence. Each expected distribution is simply calculated using sampling weight.
Formulation We find a sampling weight at which all the loss functions have small values. We apply the gradient descent-type algorithms to minimize loss function.
We built eight dataset from user behavior logs • sample 60,000 users from raw data. • two type target click distributions. ◦ Zipf(1) and Zipf(2) • controlled/un-controlled target distribtion of Attributes and Category We built a dataset from the raw data in our news delivery services.
We built eight dataset from user behavior logs • sample 60,000 users from raw data. • two type target click distributions. ◦ Zipf(1) and Zipf(2) • controlled/un-controlled target distribtion of Attributes and Category • We built a dataset from the raw data in our news delivery services. Zipf(2) datasets are more sparse than Zipf(1)
We built eight dataset from user behavior logs • sample 60,000 users from raw data. • two type target click distributions. ◦ Zipf(1) and Zipf(2) • controlled/un-controlled target distribtion of Attributes and Category We built a dataset from the raw data in our news delivery services. category controlled datasets are more sparse than uncontrolled datasets
We successfully controlled the click distributions. Zipf(1)’s distribution Zipf(2)’s distribution The distributions of both user attributes and article categories are also controlled successfully.
Comparing algorithms evaluations for each dataset Evaluation results on Zipf(2)’s datasets were worse than Zipf(1)’s. This may because Zipf(2) datasets were sparse. Best
Comparing algorithms evaluations for each dataset It is necessary to select sampling settings according to the purpose, and it may be important to publish datasets with various settings. It is necessary to select sampling settings according to the purpose, and it may be important to publish datasets with various settings.
1. summarizing the challenges of building and publishing datasets from commercial service. 2. formulating the problem of building and publishing a dataset as a optimization problem that seeks the sampling weight of users. 3. appling our method to build datasets from the raw data of our real-world mobile news delivery service Limitations & Future Works • We did not give a theoretical guarantee if the impossibility of the estimation. Providing such an impossibility is an important. • This study only considered the user-item interactions. However real world services may have different types of behavior logs. This study is the first attempt to reduce business risks in publishing datasets
Previously, researchers has not disclosed how to build the dataset and has not shared the knowledge with the community. We hope that our work will lead to more discussions on the process of building and publishing datasets and that many datasets will be published. This study is the first attempt to reduce business risks in publishing datasets https://github.com/gunosy/publishing-dataset-recsys20 our implementation and dataset avaiavle Feel free to contact me: [email protected]