Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analysis of Bias in Gathering Information Betwe...

ysekky
December 10, 2018

Analysis of Bias in Gathering Information Between User Attributes in News Application (ABCCS 2018)

ysekky

December 10, 2018
Tweet

More Decks by ysekky

Other Decks in Research

Transcript

  1. Analysis of Bias in Gathering Information Between User Attributes in

    News Application Yoshifumi Seki (Gunosy Inc.) Mitsuo Yoshida (Toyohashi University of Technology) ABCCS2018@IEEE Bigdata 2018 2018.12.10
  2. Motivations • Confirmation bias is existed in information gathering on

    the web. ◦ e.g. Filter Bubbles, Echo chamber ◦ These phenomena have been investigated by questionnaires. • We would like to clarify these phenomena by analyzing behavior data. ◦ In this study, using user activity logs in news application. ◦ For evaluating diversity of recommender systems, improving long-period user satisfaction, and so on.
  3. Research Question • Q. How behavior in the news application

    differs between user attributes? ◦ Ideally, we would like to analyze users based on their interest. ◦ Instead of user’s interest, we analyze users based on their attributes. • Our Contributions: ◦ Clarify relationships of user behavior between user attributes. ◦ Detect keywords that are biased by attribute, using regression analysis.
  4. Data Source • Gunosy ◦ Japanese popular news delivery service

    ◦ providing mobile application (iOS, Android) ◦ over 24 million downloads ◦ deliver over 600 media news 4
  5. DataSet • August 1 to 31, 2019 (1 month )

    • news articles ◦ politics, society • 2 type action ◦ Click, Like • Clicked more than 100 times • User Attributes ◦ users register own attributes to that application. ▪ if users don’t register, their attributes are predicted by supervised learning. ◦ age ▪ - 29 (younger), 30-39 (middle), 40- (older) ◦ gender ▪ male, female 5
  6. Gender Action Ratio all politics society click male 58.9% 76.2%

    54.0% female 41.1% 23.8% 46.0% like male 47.7% 78.2% 47.4% female 52.3% 21.8% 52.6% # of news 1,333 8,801
  7. Age Action Ratio all politics society click young 34.7% 16.4%

    23.1% middle 30.2% 22.1% 30.4% older 35.1% 61.5% 46.5% like young 25.8% 8.8% 16.0% middle 25.4% 11.0% 22.1% older 48.7% 80.2% 61.9% # of news 1,333 8,801
  8. Normalize # of Action • The trend in # of

    action is different depending on categories and attributes. ◦ The normalization is needed.
  9. Scatter Plot by gender Click Like Pearson’s correlation coefficient 0.902

    0.883 0.502 0.509 strong positive correlation weakly than click >
  10. Pearson’s coefficient by ages politics society click like click like

    young-middle 0.993 0.909 0.985 0.955 middle-older 0.923 0.845 0.969 0.976 older-young 0.901 0.786 0.936 0.902
  11. Result of Correlation Analysis • Difference in category user behavior

    by attributes where compared using correlation coefficient. ◦ Click number has strong positive correlations between attribute. ◦ Like number has weak correlations compared to click’s. • User behavior between attributes has strong correlation. ◦ we are able to discuss about their differences by user behavior data.
  12. Comparison by keywords • Our purpose is to clarify how

    the behavior differ between user attributes on the topic of news articles. ◦ There are various definitions of news topics. ◦ This study compares articles based on the keywords included in the title • Extract keywords from news articles. ◦ Divide the title of the news article into morphemes using Mecab ▪ These morphemes are taken as keyword candidates. ◦ Count news articles including each keyword candidate. ◦ We adopt top 100 words in this count as keywords. ▪ meaningless words are excluded.
  13. Distribution of keyword correlation coefficient • We would like to

    compare keywords between user attributes. ◦ If the correlation coefficient of the keyword is weak, that keyword is not comparable. • Keywords with weak correlation coefficient are included articles with very few number of actions. Click Like
  14. Regression Analysis • For detecting the difference of keyword, we

    adopt regression analysis. • By regression analysis, Slope and Intercept are obtained. ◦ exclude keywords whose coefficient of determination is 0.5 or less. ▪ coefficient of determination is similar to correlation coefficient
  15. Compare Keyword Intercept The slope of these two keywords are

    close to the average, the intercept is large and small.
  16. Compare Keyword Slope The intercept of these two keywords are

    close to the average, the slope is large and small.
  17. Biased Keywords Detection • Using slope (s) and intercept (i),

    keywords are divided into three categories based on mean ± σ. ◦ lager than upper ( x > mean + σ) ◦ smaller than lower (x < mean - σ) ◦ within the section ( mean - σ < x < mean + σ) • These category is defined under the assumption that the distribution of these parameter is normal distribution. ◦ belonging to 95% or not. • If one is within section and other is not, this keyword is biased.
  18. Biased Keyword by intercept in gender • Mio Sugita is

    a Japanese politician who presented papers on LGBT in magazines. The claims in these papers is caused controversy. • There is news about the possible introduction of Summer Time before the 2020 Summer Olympic Games in Tokyo. • A 2-year-old boy was missing in the forest and was rescued by a volunteer. politics society click like click like Upper (biased to male) House of Representatives, China Police Obscenity Lower (biased to female) Sugita Mio, Summer Time, Cabinet, Olympics Child, Mother Boy, Crush, Mother, Children
  19. Biased Keyword by intercept in gender • Mio Sugita is

    a Japanese politician who presented papers on LGBT in magazines. The claims in these papers is caused controversy. • There is news about the possible introduction of Summer Time before the 2020 Summer Olympic Games in Tokyo. • A 2-year-old boy was missing in the forest and was rescued by a volunteer. politics society click like click like Upper (biased to male) House of Representatives, China Police Obscenity Lower (biased to female) Sugita Mio, Summer Time, Cabinet, Olympics Child, Mother Boy, Crush, Mother, Children
  20. Biased Keyword by intercept in gender • Mio Sugita is

    a Japanese politician who presented papers on LGBT in magazines. The claims in these papers is caused controversy. • There is news about the possible introduction of Summer Time before the 2020 Summer Olympic Games in Tokyo. • A 2-year-old boy was missing in the forest and was rescued by a volunteer. politics society click like click like Upper (biased to male) House of Representatives, China Police Obscenity Lower (biased to female) Sugita Mio, Summer Time, Cabinet, Olympics Child, Mother Boy, Crush, Mother, Children
  21. Biased Keyword by intercept in gender • Mio Sugita is

    a Japanese politician who presented papers on LGBT in magazines. The claims in these papers is caused controversy. • There is news about the possible introduction of Summer Time before the 2020 Summer Olympic Games in Tokyo. • A 2-year-old boy was missing in the forest and was rescued by a volunteer. politics society click like click like Upper (biased to male) House of Representatives, China Police Obscenity Lower (biased to female) Sugita Mio, Summer Time, Cabinet, Olympics Child, Mother Boy, Crush, Mother, Children
  22. Conclusion • We analyzed behavior differences between user attributes based

    on the user behavior log of news applications and extracted keywords with biased behavior. • Using regression analysis, we obtain a biased keyword from the degree of departure from the average value of slope and intercept. • Future Works ◦ Verify whether this result is valid according to social science knowledge. ◦ Discover a strong bias topic due to user's interests rather than user attributes. ◦ Create a measure that can extract keywords more simply.