Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Quantifying Diachronic Language Change via Wor...

Quantifying Diachronic Language Change via Word Embeddings: Analysis of Social Events using 11 Years News Articles in Japanese and English

Shotaro Ishihara, Hiromu Takahashi, and Hono Shirai (2023). Quantifying Diachronic Language Change via Word Embeddings: Analysis of Social Events using 11 Years News Articles in Japanese and English. 9th International Conference on Computational Social Science.
https://upura.github.io/pdf/ic2s2_2023_semantic_shift.pdf
https://www.ic2s2.org/

Shotaro Ishihara

July 11, 2023
Tweet

More Decks by Shotaro Ishihara

Other Decks in Research

Transcript

  1. Quantifying Diachronic Language Change via Word Embeddings: Analysis of Social

    Events using 11 Years News Articles in Japanese and English Research Overview • We quantitatively analyzed semantic shifts caused by social events across multiple corpora and years (using news articles published in Japanese and English between 2011-2021). • Studies on the analysis of social events have often focused on a single event, and it is important to explore more comprehensive method. • RQ1: Is the semantic shift caused by COVID-19 larger? • RQ2: Are the trends of change in Japan and English similar? • A1&2: Yes (at least in our approach) Shotaro Ishihara (Nikkei Inc. [email protected] ), Hiromu Takahashi, Hono Shirai [1] How COVID-19 is changing our language: Detecting semantic shift in twitter word embeddings. [2] Semantic Shift Stability: Efficient Way to Detect Performance Degradation of Word Embeddings and Pre-trained Language Models. (AACL2022) Acknowledgments We are grateful to Kunihiro Miyazaki for the useful research discussions. Experimental Results • A1: The semantic shift stability for 2019-2020 was observed to be the lowest for Nikkei (ja) and NOW (en), the degree of change was the greatest. • A2: The correlation coefficient between Nikkei and NOW was calculated to be 0.66, indicating a similar trend. Approach 1. Corpora are divided by year, and word2vec models are trained. 2. We take two trained word2vec models as input and derive rotation matrices (R) to align their coordinate axes. 3. Stability can be calculated by the similarities in two directions. [1] 4. We refer to the average value of stab of words as semantic shift stability, and adopted this as a representative value. [2] Inferring Reason of Semantic Shifts It has the advantage of identifying words that exhibited a significant semantic shift (words with the lowest stab) in 2019-2020. • Nikkei: infection, spread, corona, vaccine, virus, mask, infected, North Korea, vaccination, and epidemic. • NOW: king, Scott, de, virus, masks, wear, mask, pi, q, and wearing. => Words related to COVID-19 appeared at the top of the lists. Note that the analysis for 2015-2016 implied the impact of the U.S. presidential election.