Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NLP Colloquium Sep. 11, 2024 Taguchi

Chihiro Taguchi
September 12, 2024
82

NLP Colloquium Sep. 11, 2024 Taguchi

This deck contains the slides used in my presentation titled "Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t" given at the NLP Colloquium on September 11, 2024 (JST). Images that might infringe copyright and privacy of others are removed in this version.

Chihiro Taguchi

September 12, 2024
Tweet

Transcript

  1. Chihiro Taguchi, David Chiang 田口智大,蔣偉 NLP Colloquium (presented at ACL

    2024) September 11, 2024 Language Complexity and Speech Recognition Accuracy Orthographic Complexity Hurts, Phonological Complexity Doesn’t Paper LinkedIn
  2. About me 2015-2019: Faculty of Law, Keio University Language policy

    and language endangerment in Ikema, Okinawa 2020-2022: MA in Engineering, Nara Institute of Science and Technology NLP for the Tatar language 2021-2022: MScR in Linguistics, University of Edinburgh Tatar syntax 2022-present: PhD in Computer Science, University of Notre Dame NLP for documenting endangered languages Why from the humanities/social sciences to NLP? • Constant interests in languages • Luck 2
  3. Research interests NLP • Multilingual NLP – NLP for the

    documentation of endangered languages • Automatic speech recognition (ASR) • Machine translation (MT) • Corpora (Universal Dependencies) Linguistics • Descriptive linguistics, field methods • Syntax (Tatar, Kichwa, Japanese) 3
  4. Documenting Kichwa Kichwa (< Quechuan language family) • Fieldwork •

    Building Kichwa ASR with the community – Dataset – Model – Paper at LREC-COLING 2024 • Kichwa syntax – Paper at LSA 2024 4 Kichwa-speaking area in South America (Image: Wikimedia Commons, CC-BY) With my informants (Quito, Ecuador) This image was removed for privacy reasons.
  5. Background of today’s talk I was a newbie in ASR…

    (2022-) Project in 2022-23: Speech-to-IPA • Interspeech 2023: https://arxiv.org/abs/2308.03917 NLP coursework taught by my supervisor in Fall 2023 • Final project Submission to ARR under appendicitis (February 2024) 5
  6. Chihiro Taguchi, David Chiang 田口智大,蔣偉 NLP Colloquium (presented at ACL

    2024) September 11, 2024 Language Complexity and Speech Recognition Accuracy Orthographic Complexity Hurts, Phonological Complexity Doesn’t Paper ㊗ Outstanding Paper Award ㊗Senior Area Chair’s Award LinkedIn
  7. What makes speech recognition hard? For humans, 1. The number

    of characters (graphemes) Too many characters → Difficult prediction https://www.nippon.com/hk/views/b05601/ 7
  8. What makes speech recognition hard? For humans, 1. The number

    of characters 2. Inconsistent spelling (logographicity) English: /raɪt/ → right, write, rite, Wright Chinese: /shìshí/ → 事實, 適時, 是時, 嗜食, … Japanese: /senkoo=/ → 先行, 専攻, 選考, 閃光, 先攻, 潜航, 穿孔, … Thai: /kasètsàat/ → เกษตรศาสตร (กะเส็ดสาด) https://www.zabaan.com/blog/whats-wrong-with-english-spelling/ “fact” “timely” “at this time”“to have predilection for certain food” “preceding” “major” “screening” “sparkle” “bat first” “cruising underwater” “perforation” 8 e  k  ʂ  t  r  ɕ ā  s  t  r ʻ k  a e  s  d s ā d 
  9. Logographicity? 9 Alphabetic Abugida Abjad Moraic Syllabic Finnish Hindi Phoenician

    Japanese (kana) Modern Yi Phonography Morphography English German French Korean Japanese (kanji) Chinese Modern Tibetan Thai Persian Akkadian Egyptian (hieroglyph) Modified from Sproat. (2008). Computational Theory of Writing Systems. p.138
  10. For humans, 1. The number of characters 2. Inconsistent spelling

    3. The number of sounds (phonemes) What makes speech recognition hard? https://omniglot.com/writing/abkhaz.htm Japanese 5 vowels vs. English >10 vowels 10
  11. What makes speech recognition hard … For machines? Today’s topic:

    Do machines also struggle with these linguistic complexities? If so, what are the factors? 11
  12. Let’s test it with Wav2Vec2-XLSR-53 What is Wav2Vec2-XLSR-53? (Conneau et

    al., 2020) • Encoder-only speech model pretrained on 53 languages • Self-supervised multilingual model (like mBERT) • Adaptable to unseen languages https://huggingface.co/facebook/wav2vec2-large-xlsr-53 12
  13. Setup Now, we want to see what makes speech recognition

    hard for Wav2Vec2-XLSR-53. → Fine-tuning to languages with different writing systems Japanese: Kanji (日本語) Kana (ニホンゴ) Romaji (nihongo) Chinese: Hanzi (漢語) Zhuyin (注音: ㄏㄢ ̀ ㄩ ˇ) Pinyin (拼音: hànyǔ) Korean: Hangul syllabary (한국어) Hangul Jamo (ㅎㅏㄴㄱㅜㄱㅇㅓ) Phonographic languages: Thai, Arabic, English, French, Italian, Czech, Swedish, Dutch, German 13
  14. Data for speech recognition The same amount of data across

    all the training languages • Common Voice 16.1 (Ardila et al., 2020) – English: LibriSpeech (Panayotov et al., 2015) – Korean: Zeroth-Korean (https://github.com/goodatlas/zeroth) • Training data: 10,000 seconds • 12 languages, 10 writing systems 14
  15. Metrics: • Character Error Rate (CER): – where S: #substitution,

    I: #insertion, D: #deletion, N: reference length • 🆕Calibrated Errors per Second (CEPS): – where VAD: voice activity detection, a: audio length – Calibrated to mitigate the error bias caused by orthographic differences – It considers potential multiple errors occurring in a slice (e.g., character) Setup 15
  16. Assumptions: • All languages convey the same amount of information

    per second • Speech is divided into equal-length slices of τ seconds each. • An ASR error is an event that occurs at a single point in time • Errors are Poisson-distributed Then, the prob. that a slice (of τ seconds) has k errors: We want to estimate λ: maximum likelihood estimation (MLE) Likelihood function for all observations: the product Details on Calibrated Errors per Second (CEPS) 16 λ: calibrated errors per second τ: second per character λτ: number of errors in a slice p: error rate n: total number of slices
  17. Log-likelihood: Estimate λ: In the implementation, we use p =

    CER and Details on Calibrated Errors per Second (CEPS) 17 p: error rate (CER, WER, etc.) Slices with no errors Slices with at least one error
  18. CEPS and CER: example Reference: 腹減りすぎて3分待てなかったやべ Prediction: 腹減り全て3分待てなかった矢部 18 (2秒で言ったと仮定する)

    HikakinTV (2021) “【6年ぶり】 YouTubeの自動字幕実況したら爆笑が止まらない www【ヒカキン TV・セイキン TV・マスオ TV】” 3:14. Retrieved on September 8, 2024. https://www.youtube.com/watch?v=kLHc0c3Yv7U This image was removed for copyright reasons.
  19. Setup Compare CER and CEPS with: • Grapheme distribution –

    Number of grapheme types – Unigram character entropy for each grapheme type c • Logographicity – Attention-based measure (Sproat & Gutkin, 2021) • Phoneme distribution – Number of phoneme types from Phoible 2.0 (Moran and McCloy, 2019) How can we measure logographicity with attention? 19
  20. But how can we measure logographicity? Logographicity: the irregularity of

    the mapping from pronunciation to spelling (Sproat and Gutkin, 2021) → If a writing system is logographic, one must consider the context to determine the correct spelling! • on your /raɪt/ hand side → right • can you /raɪt/ it down → write • Stravinsky’s “The /raɪt/ of Spring” → rite • The /raɪt/ brothers invented an aircraft → Wright 20
  21. Attention-based measure of logographicity Check how much the attention is

    spread out across the context! Target word in the orthography Phoneme sequence with the context Mask the target matrix (zero-out) 21
  22. Attention-based measure of logographicity To compute the logographicity of a

    language, 1. Train a seq-to-seq model that converts a phoneme sequence into the target word in the orthography 2. Use the model to get the (last) attention matrix 3. Mask the target word in the matrix 4. Compute the ratio of the original matrix and the masked matrix attention matrix attention matrix (masked) 22
  23. Results: Same language, different writing systems Japanese: Kanji (日本語) Kana

    (ニホンゴ)Romaji (nihongo) Chinese: Hanzi (漢語) Zhuyin (ㄏㄢ ̀ ㄩ ˇ) Pinyin (hànyǔ) Korean: Hangul syllabary (한국어) Hangul Jamo (ㅎㅏㄴㄱㅜㄱㅇㅓ) Would they show different scores? 23
  24. Results Language Writing system CER↓ CEPS↓ #Graphemes Unigram entropy Logographicity

    #Phonemes Japanese Kanji + Kana 58.12 7.21 1702 7.74 44.98 27 Kana 29.71 3.48 92 5.63 41.22 Romaji 17.09 2.91 27 3.52 29.46 Chinese Hanzi 62.81 2.65 2155 9.47 41.59 39.5 Zhuyin 9.71 1.04 49 4.81 24.32 Pinyin 9.17 1.01 56 5.02 22.50 Korean Hangul 28.21 2.63 965 7.98 25.27 42.5 Jamo 16.72 3.23 62 4.90 15.99 “Simpler” writing systems get better scores! 24
  25. Logographic orthographies are hard to learn • Japanese: Slower learning

    of Kanji than Kana/Romaji • Korean: Slower learning of Hangul than Jamo • Chinese: Slower learning of Hanzi than Zhuyin/Pinyin Japanese Korean Chinese 25
  26. Results (contd.): Phonographic languages Language Writing system CER CEPS #Graphemes

    Unigram entropy Logographicity #Phonemes Thai Thai 19.77 1.80 67 5.24 20.55 20.67 Arabic Arabic 40.59 4.78 53 4.77 21.57 37 English Latin 3.17 0.58 27 4.17 19.17 41.22 French Latin 19.64 2.79 69 4.42 20.37 36.75 Italian Latin 14.82 1.84 48 4.27 21.28 43.33 Czech Latin 16.89 1.86 46 4.92 20.57 39 Swedish Latin 20.31 2.71 34 4.52 19.81 35 Dutch Latin 12.35 1.77 36 4.2 19.67 49.38 German Latin 7.61 1.03 48 4.18 18.03 40 26
  27. Any correlation? CER CEPS #Graphemes Unigram entropy Logographicity #Phonemes CER

    1.00 0.77 0.85 0.81 0.76 -0.37 CEPS 1.00 0.49 0.41 0.61 -0.66 #Graphemes 1.00 0.93 0.72 -0.14 Unigram entropy 1.00 0.67 -0.08 Logographicity 1.00 -0.60 Correlation matrix 27 • Significant correlation between CER and the orthography-related variables • CEPS has weaker correlation with the orthography-related variables • No significant correlation between CER and the number of phonemes
  28. Any correlation? • Significant correlation between CER and the orthography-related

    variables • CEPS has weaker correlation with the orthography-related variables • No significant correlation between CER and the number of phonemes CER CEPS #Graphemes Unigram entropy Logographicity #Phonemes CER 1.00 0.77 0.85 0.81 0.76 -0.37 CEPS 1.00 0.49 0.41 0.61 -0.66 #Graphemes 1.00 0.93 0.72 -0.14 Unigram entropy 1.00 0.67 -0.08 Logographicity 1.00 -0.60 Correlation matrix 28
  29. Any correlation? • Significant correlation between CER and the orthography-related

    variables • CEPS has weaker correlation with the orthography-related variables • No significant correlation between CER and the number of phonemes CER CEPS #Graphemes Unigram entropy Logographicity #Phonemes CER 1.00 0.77 0.85 0.81 0.76 -0.37 CEPS 1.00 0.49 0.41 0.61 -0.66 #Graphemes 1.00 0.93 0.72 -0.14 Unigram entropy 1.00 0.67 -0.08 Logographicity 1.00 -0.60 Correlation matrix 29
  30. Conclusion What makes automatic speech recognition hard? • Orthographic complexity

    – Worse performance – Slower learning – Calibrated Errors Per Second (CEPS) can mitigate the orthographic bias • Phonological complexity does not affect the performance 30
  31. Why is this finding interesting? • Speech recognition for low-resource

    logographic languages – Some low-resource languages have complex orthographies – Better accuracy with transliterated data? e.g. Taiwanese Hokkien https://www.omniglot.com/writing/yi.htm https://www.omniglot.com/writing/inuktitut.htm 31 goa2 si7 jit8 pun2 lang5 ↓ 我是日本人 (rule-based or seq2seq conversion)
  32. Why is this finding interesting? • Similarities to children’s first

    language acquisition – Babies can perfectly learn the phonology of their first language regardless of the phonological complexity of the language – Children need a lot of conscious efforts to learn writing – Fine-tuning of pretrained multilingual self-supervised ASR models is somewhat like first language acquisition? – Choi et al. (2024): “Self-Supervised Speech Representations are More Phonetic than Semantic” 32