NLP Colloquium Sep. 11, 2024 Taguchi

Chihiro Taguchi, David Chiang 田口智大，蔣偉 NLP Colloquium (presented at ACL
2024) September 11, 2024 Language Complexity and Speech Recognition Accuracy Orthographic Complexity Hurts, Phonological Complexity Doesn’t Paper LinkedIn

About me 2015-2019: Faculty of Law, Keio University Language policy
and language endangerment in Ikema, Okinawa 2020-2022: MA in Engineering, Nara Institute of Science and Technology NLP for the Tatar language 2021-2022: MScR in Linguistics, University of Edinburgh Tatar syntax 2022-present: PhD in Computer Science, University of Notre Dame NLP for documenting endangered languages Why from the humanities/social sciences to NLP? • Constant interests in languages • Luck 2

Research interests NLP • Multilingual NLP – NLP for the
documentation of endangered languages • Automatic speech recognition (ASR) • Machine translation (MT) • Corpora (Universal Dependencies) Linguistics • Descriptive linguistics, ﬁeld methods • Syntax (Tatar, Kichwa, Japanese) 3

Documenting Kichwa Kichwa (< Quechuan language family) • Fieldwork •
Building Kichwa ASR with the community – Dataset – Model – Paper at LREC-COLING 2024 • Kichwa syntax – Paper at LSA 2024 4 Kichwa-speaking area in South America (Image: Wikimedia Commons, CC-BY) With my informants (Quito, Ecuador) This image was removed for privacy reasons.

Background of today’s talk I was a newbie in ASR…
(2022-) Project in 2022-23: Speech-to-IPA • Interspeech 2023: https://arxiv.org/abs/2308.03917 NLP coursework taught by my supervisor in Fall 2023 • Final project Submission to ARR under appendicitis (February 2024) 5

Chihiro Taguchi, David Chiang 田口智大，蔣偉 NLP Colloquium (presented at ACL
2024) September 11, 2024 Language Complexity and Speech Recognition Accuracy Orthographic Complexity Hurts, Phonological Complexity Doesn’t Paper ㊗ Outstanding Paper Award ㊗Senior Area Chair’s Award LinkedIn

What makes speech recognition hard? For humans, 1. The number
of characters (graphemes) Too many characters → Diﬃcult prediction https://www.nippon.com/hk/views/b05601/ 7

What makes speech recognition hard? For humans, 1. The number
of characters 2. Inconsistent spelling (logographicity) English: /raɪt/ → right, write, rite, Wright Chinese: /shìshí/ → 事實, 適時, 是時, 嗜食, … Japanese: /senkoo=/ → 先行, 専攻, 選考, 閃光, 先攻, 潜航, 穿孔, … Thai: /kasètsàat/ → เกษตรศาสตร (กะเส็ดสาด) https://www.zabaan.com/blog/whats-wrong-with-english-spelling/ “fact” “timely” “at this time”“to have predilection for certain food” “preceding” “major” “screening” “sparkle” “bat first” “cruising underwater” “perforation” 8 e k ʂ t r ɕ ā s t r ʻ k a e s d s ā d

Logographicity? 9 Alphabetic Abugida Abjad Moraic Syllabic Finnish Hindi Phoenician
Japanese (kana) Modern Yi Phonography Morphography English German French Korean Japanese (kanji) Chinese Modern Tibetan Thai Persian Akkadian Egyptian (hieroglyph) Modified from Sproat. (2008). Computational Theory of Writing Systems. p.138

For humans, 1. The number of characters 2. Inconsistent spelling
3. The number of sounds (phonemes) What makes speech recognition hard? https://omniglot.com/writing/abkhaz.htm Japanese 5 vowels vs. English >10 vowels 10

What makes speech recognition hard … For machines? Today’s topic:
Do machines also struggle with these linguistic complexities? If so, what are the factors? 11

Let’s test it with Wav2Vec2-XLSR-53 What is Wav2Vec2-XLSR-53? (Conneau et
al., 2020) • Encoder-only speech model pretrained on 53 languages • Self-supervised multilingual model (like mBERT) • Adaptable to unseen languages https://huggingface.co/facebook/wav2vec2-large-xlsr-53 12

Setup Now, we want to see what makes speech recognition
hard for Wav2Vec2-XLSR-53. → Fine-tuning to languages with different writing systems Japanese: Kanji (日本語) Kana (ニホンゴ) Romaji (nihongo) Chinese: Hanzi (漢語) Zhuyin (注音: ㄏㄢ ̀ ㄩ ˇ) Pinyin (拼音: hànyǔ) Korean: Hangul syllabary (한국어) Hangul Jamo (ㅎㅏㄴㄱㅜㄱㅇㅓ) Phonographic languages: Thai, Arabic, English, French, Italian, Czech, Swedish, Dutch, German 13

Data for speech recognition The same amount of data across
all the training languages • Common Voice 16.1 (Ardila et al., 2020) – English: LibriSpeech (Panayotov et al., 2015) – Korean: Zeroth-Korean (https://github.com/goodatlas/zeroth) • Training data: 10,000 seconds • 12 languages, 10 writing systems 14

Metrics: • Character Error Rate (CER): – where S: #substitution,
I: #insertion, D: #deletion, N: reference length • 🆕Calibrated Errors per Second (CEPS): – where VAD: voice activity detection, a: audio length – Calibrated to mitigate the error bias caused by orthographic differences – It considers potential multiple errors occurring in a slice (e.g., character) Setup 15

Assumptions: • All languages convey the same amount of information
per second • Speech is divided into equal-length slices of τ seconds each. • An ASR error is an event that occurs at a single point in time • Errors are Poisson-distributed Then, the prob. that a slice (of τ seconds) has k errors: We want to estimate λ: maximum likelihood estimation (MLE) Likelihood function for all observations: the product Details on Calibrated Errors per Second (CEPS) 16 λ: calibrated errors per second τ: second per character λτ: number of errors in a slice p: error rate n: total number of slices

Log-likelihood: Estimate λ: In the implementation, we use p =
CER and Details on Calibrated Errors per Second (CEPS) 17 p: error rate (CER, WER, etc.) Slices with no errors Slices with at least one error

CEPS and CER: example Reference: 腹減りすぎて3分待てなかったやべ Prediction: 腹減り全て3分待てなかった矢部 18 （2秒で言ったと仮定する）
HikakinTV (2021) “【6年ぶり】 YouTubeの自動字幕実況したら爆笑が止まらない www【ヒカキン TV・セイキン TV・マスオ TV】” 3:14. Retrieved on September 8, 2024. https://www.youtube.com/watch?v=kLHc0c3Yv7U This image was removed for copyright reasons.

Setup Compare CER and CEPS with: • Grapheme distribution –
Number of grapheme types – Unigram character entropy for each grapheme type c • Logographicity – Attention-based measure (Sproat & Gutkin, 2021) • Phoneme distribution – Number of phoneme types from Phoible 2.0 (Moran and McCloy, 2019) How can we measure logographicity with attention? 19

But how can we measure logographicity? Logographicity: the irregularity of
the mapping from pronunciation to spelling (Sproat and Gutkin, 2021) → If a writing system is logographic, one must consider the context to determine the correct spelling! • on your /raɪt/ hand side → right • can you /raɪt/ it down → write • Stravinsky’s “The /raɪt/ of Spring” → rite • The /raɪt/ brothers invented an aircraft → Wright 20

Attention-based measure of logographicity Check how much the attention is
spread out across the context! Target word in the orthography Phoneme sequence with the context Mask the target matrix (zero-out) 21

Attention-based measure of logographicity To compute the logographicity of a
language, 1. Train a seq-to-seq model that converts a phoneme sequence into the target word in the orthography 2. Use the model to get the (last) attention matrix 3. Mask the target word in the matrix 4. Compute the ratio of the original matrix and the masked matrix attention matrix attention matrix (masked) 22

Results: Same language, different writing systems Japanese: Kanji (日本語) Kana
(ニホンゴ)Romaji (nihongo) Chinese: Hanzi (漢語) Zhuyin (ㄏㄢ ̀ ㄩ ˇ) Pinyin (hànyǔ) Korean: Hangul syllabary (한국어) Hangul Jamo (ㅎㅏㄴㄱㅜㄱㅇㅓ) Would they show different scores? 23

Results Language Writing system CER↓ CEPS↓ #Graphemes Unigram entropy Logographicity
#Phonemes Japanese Kanji + Kana 58.12 7.21 1702 7.74 44.98 27 Kana 29.71 3.48 92 5.63 41.22 Romaji 17.09 2.91 27 3.52 29.46 Chinese Hanzi 62.81 2.65 2155 9.47 41.59 39.5 Zhuyin 9.71 1.04 49 4.81 24.32 Pinyin 9.17 1.01 56 5.02 22.50 Korean Hangul 28.21 2.63 965 7.98 25.27 42.5 Jamo 16.72 3.23 62 4.90 15.99 “Simpler” writing systems get better scores! 24

Logographic orthographies are hard to learn • Japanese: Slower learning
of Kanji than Kana/Romaji • Korean: Slower learning of Hangul than Jamo • Chinese: Slower learning of Hanzi than Zhuyin/Pinyin Japanese Korean Chinese 25

Results (contd.): Phonographic languages Language Writing system CER CEPS #Graphemes
Unigram entropy Logographicity #Phonemes Thai Thai 19.77 1.80 67 5.24 20.55 20.67 Arabic Arabic 40.59 4.78 53 4.77 21.57 37 English Latin 3.17 0.58 27 4.17 19.17 41.22 French Latin 19.64 2.79 69 4.42 20.37 36.75 Italian Latin 14.82 1.84 48 4.27 21.28 43.33 Czech Latin 16.89 1.86 46 4.92 20.57 39 Swedish Latin 20.31 2.71 34 4.52 19.81 35 Dutch Latin 12.35 1.77 36 4.2 19.67 49.38 German Latin 7.61 1.03 48 4.18 18.03 40 26

Any correlation? CER CEPS #Graphemes Unigram entropy Logographicity #Phonemes CER
1.00 0.77 0.85 0.81 0.76 -0.37 CEPS 1.00 0.49 0.41 0.61 -0.66 #Graphemes 1.00 0.93 0.72 -0.14 Unigram entropy 1.00 0.67 -0.08 Logographicity 1.00 -0.60 Correlation matrix 27 • Signiﬁcant correlation between CER and the orthography-related variables • CEPS has weaker correlation with the orthography-related variables • No signiﬁcant correlation between CER and the number of phonemes

Any correlation? • Signiﬁcant correlation between CER and the orthography-related
variables • CEPS has weaker correlation with the orthography-related variables • No signiﬁcant correlation between CER and the number of phonemes CER CEPS #Graphemes Unigram entropy Logographicity #Phonemes CER 1.00 0.77 0.85 0.81 0.76 -0.37 CEPS 1.00 0.49 0.41 0.61 -0.66 #Graphemes 1.00 0.93 0.72 -0.14 Unigram entropy 1.00 0.67 -0.08 Logographicity 1.00 -0.60 Correlation matrix 28

Any correlation? • Signiﬁcant correlation between CER and the orthography-related
variables • CEPS has weaker correlation with the orthography-related variables • No signiﬁcant correlation between CER and the number of phonemes CER CEPS #Graphemes Unigram entropy Logographicity #Phonemes CER 1.00 0.77 0.85 0.81 0.76 -0.37 CEPS 1.00 0.49 0.41 0.61 -0.66 #Graphemes 1.00 0.93 0.72 -0.14 Unigram entropy 1.00 0.67 -0.08 Logographicity 1.00 -0.60 Correlation matrix 29

Conclusion What makes automatic speech recognition hard? • Orthographic complexity
– Worse performance – Slower learning – Calibrated Errors Per Second (CEPS) can mitigate the orthographic bias • Phonological complexity does not affect the performance 30

Why is this ﬁnding interesting? • Speech recognition for low-resource
logographic languages – Some low-resource languages have complex orthographies – Better accuracy with transliterated data? e.g. Taiwanese Hokkien https://www.omniglot.com/writing/yi.htm https://www.omniglot.com/writing/inuktitut.htm 31 goa2 si7 jit8 pun2 lang5 ↓ 我是日本人 (rule-based or seq2seq conversion)

Why is this finding interesting? • Similarities to children’s first
language acquisition – Babies can perfectly learn the phonology of their first language regardless of the phonological complexity of the language – Children need a lot of conscious efforts to learn writing – Fine-tuning of pretrained multilingual self-supervised ASR models is somewhat like first language acquisition? – Choi et al. (2024): “Self-Supervised Speech Representations are More Phonetic than Semantic” 32

NLP Colloquium Sep. 11, 2024 Taguchi

NLP Colloquium Sep. 11, 2024 Taguchi

Chihiro Taguchi

Featured

Transcript

Chihiro Taguchi, David Chiang 田口智大，蔣偉 NLP Colloquium (presented at ACL

About me 2015-2019: Faculty of Law, Keio University Language policy

Research interests NLP • Multilingual NLP – NLP for the

Documenting Kichwa Kichwa (< Quechuan language family) • Fieldwork •

Background of today’s talk I was a newbie in ASR…

Chihiro Taguchi, David Chiang 田口智大，蔣偉 NLP Colloquium (presented at ACL

What makes speech recognition hard? For humans, 1. The number

What makes speech recognition hard? For humans, 1. The number

Logographicity? 9 Alphabetic Abugida Abjad Moraic Syllabic Finnish Hindi Phoenician

For humans, 1. The number of characters 2. Inconsistent spelling

What makes speech recognition hard … For machines? Today’s topic:

Let’s test it with Wav2Vec2-XLSR-53 What is Wav2Vec2-XLSR-53? (Conneau et

Setup Now, we want to see what makes speech recognition

Data for speech recognition The same amount of data across

Metrics: • Character Error Rate (CER): – where S: #substitution,

Assumptions: • All languages convey the same amount of information

Log-likelihood: Estimate λ: In the implementation, we use p =

CEPS and CER: example Reference: 腹減りすぎて3分待てなかったやべ Prediction: 腹減り全て3分待てなかった矢部 18 （2秒で言ったと仮定する）

Setup Compare CER and CEPS with: • Grapheme distribution –

But how can we measure logographicity? Logographicity: the irregularity of

Attention-based measure of logographicity Check how much the attention is

Attention-based measure of logographicity To compute the logographicity of a

Results: Same language, different writing systems Japanese: Kanji (日本語) Kana

Results Language Writing system CER↓ CEPS↓ #Graphemes Unigram entropy Logographicity

Logographic orthographies are hard to learn • Japanese: Slower learning

Results (contd.): Phonographic languages Language Writing system CER CEPS #Graphemes

Any correlation? CER CEPS #Graphemes Unigram entropy Logographicity #Phonemes CER

Any correlation? • Signiﬁcant correlation between CER and the orthography-related

Any correlation? • Signiﬁcant correlation between CER and the orthography-related

Conclusion What makes automatic speech recognition hard? • Orthographic complexity

Why is this ﬁnding interesting? • Speech recognition for low-resource

Why is this ﬁnding interesting? • Similarities to children’s ﬁrst